Big Data Approach to Developing Adaptable Corpus Tools

被引：0

作者：

Lutskiv, Andriy ^{[1
]}

Popovych, Nataliya ^{[2
]}

机构：

[1] Ternopil Ivan Puluj Natl Tech Univ, Comp Syst & Networks Dept, Ternopol, Ukraine

[2] Uzhgorod Natl Univ, State Univ, Dept Multicultural Educ & Translat, Uzhgorod, Ukraine

来源：

COMPUTATIONAL LINGUISTICS AND INTELLIGENT SYSTEMS (COLINS 2020), VOL I: MAIN CONFERENCE | 2020年 / 2604卷

关键词：

adaptable text corpus; Big Data; natural language processing; natural language understanding; statistics; machine learning; data mining; conceptual analysis; corpus-based translation studies; conceptual seme; componential analysis;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Thesis deals with the development of corpus tools which allow building corpus of religious and historical texts. It is foreseen that the corpus has the features of data ingestion, text data preprocessing, statistics calculation, qualitative and quantitative text analysis. All these features are customizable. With Big Data approach is meant that corpus tools are treated as the data platform and the corpus itself is treated as a combination of data lake and data warehouse solutions. There have been suggested the ways for resolving algorithmic, methodological and architectural problems which arise while building corpus tool. The effectiveness of natural language processing and natural language understanding methods, libraries and tools on the example of building historical and religious texts' corpora have been checked. There have been created the workflows which comprise data extraction from sources, data transformation, data enrichment and loading into corpus storage with proper qualitative and quantitative characteristics. Data extraction approaches which are common for ingestion into data lake were used. Transformations and enrichments were realized by means of natural language processing and natural language understanding techniques. Calculation of statistical characteristics was done by means of machine learning techniques. Finding keywords and relations between them became possible thanks to the employment of latent semantic analysis, terms and N-gram frequencies, term frequency-inverse document frequencies. Computation complexity and number of information noise were reduced by singular value decomposition. The influence of singular value decomposition parameters on the text processing accuracy has been analyzed. The results of corpus-based computational experiment for religious text concept analysis have been shown. The architectural approaches to building corpus-based data platform and the usage of software tools, frameworks and specific libraries have been suggested.

引用

页数：22

共 50 条

[1] A novel adaptable approach for sentiment analysis on big social data
El Alaoui, Imane
Gahi, Youssef
Messoussi, Rochdi
Chaabi, Youness
Todoskoff, Alexis
Kobi, Abdessamad
JOURNAL OF BIG DATA, 2018, 5 (01)
[2] Correction to: A novel adaptable approach for sentiment analysis on big social data
Imane El Alaoui
Youssef Gahi
Rochdi Messoussi
Youness Chaabi
Alexis Todoskoff
Abdessamad Kobi
Journal of Big Data, 6
[3] Developing and validating a mid-frequency word list for chemistry: a corpus-based approach using big data
Xodabande, Ismail
Atai, Mahmood Reza
Hashemi, Mohammad R.
Thompson, Paul
ASIAN-PACIFIC JOURNAL OF SECOND AND FOREIGN LANGUAGE EDUCATION, 2023, 8 (01)
[4] Developing and validating a mid-frequency word list for chemistry: a corpus-based approach using big data
Ismail Xodabande
Mahmood Reza Atai
Mohammad R. Hashemi
Paul Thompson
Asian-Pacific Journal of Second and Foreign Language Education, 8
[5] A novel adaptable approach for sentiment analysis on big social data (vol 5, 12, 2018)
El Alaoui, Imane
Gahi, Youssef
Messoussi, Rochdi
Chaabi, Youness
Todoskoff, Alexis
Kobi, Abdessamad
JOURNAL OF BIG DATA, 2019, 6 (01)
[6] An Adaptable Big Data Value Chain Framework for End-to-End Big Data Monetization
Faroukhi, Abou Zakaria
El Alaoui, Imane
Gahi, Youssef
Amine, Aouatif
BIG DATA AND COGNITIVE COMPUTING, 2020, 4 (04) : 1 - 27
[7] A framework for selecting analytics tools to improve healthcare big data usefulness in developing countries
Iyamu, Tiko
SOUTH AFRICAN JOURNAL OF INFORMATION MANAGEMENT, 2020, 22 (01):
[8] Developing NLP Tools with a New Corpus of Learner Spanish
Davidson, Sam
Yamada, Aaron
Fernandez-Mira, Paloma
Carando, Agustina
Sanchez Gutierrez, Claudia H.
Sagae, Kenji
PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 7238 - 7243
[9] Biobanking with Big Data: A Need for Developing "Big Data Metrics"
Zisis, Kozlakidis
BIOPRESERVATION AND BIOBANKING, 2016, 14 (05) : 450 - 451
[10] Medical Big Data Analysis Using Big Data Tools and Methods
Alhussain, Thamer
JOURNAL OF MEDICAL IMAGING AND HEALTH INFORMATICS, 2018, 8 (04) : 793 - 795

← 1 2 3 4 5 →