Big Data Approach to Developing Adaptable Corpus Tools

被引:0
|
作者
Lutskiv, Andriy [1 ]
Popovych, Nataliya [2 ]
机构
[1] Ternopil Ivan Puluj Natl Tech Univ, Comp Syst & Networks Dept, Ternopol, Ukraine
[2] Uzhgorod Natl Univ, State Univ, Dept Multicultural Educ & Translat, Uzhgorod, Ukraine
关键词
adaptable text corpus; Big Data; natural language processing; natural language understanding; statistics; machine learning; data mining; conceptual analysis; corpus-based translation studies; conceptual seme; componential analysis;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Thesis deals with the development of corpus tools which allow building corpus of religious and historical texts. It is foreseen that the corpus has the features of data ingestion, text data preprocessing, statistics calculation, qualitative and quantitative text analysis. All these features are customizable. With Big Data approach is meant that corpus tools are treated as the data platform and the corpus itself is treated as a combination of data lake and data warehouse solutions. There have been suggested the ways for resolving algorithmic, methodological and architectural problems which arise while building corpus tool. The effectiveness of natural language processing and natural language understanding methods, libraries and tools on the example of building historical and religious texts' corpora have been checked. There have been created the workflows which comprise data extraction from sources, data transformation, data enrichment and loading into corpus storage with proper qualitative and quantitative characteristics. Data extraction approaches which are common for ingestion into data lake were used. Transformations and enrichments were realized by means of natural language processing and natural language understanding techniques. Calculation of statistical characteristics was done by means of machine learning techniques. Finding keywords and relations between them became possible thanks to the employment of latent semantic analysis, terms and N-gram frequencies, term frequency-inverse document frequencies. Computation complexity and number of information noise were reduced by singular value decomposition. The influence of singular value decomposition parameters on the text processing accuracy has been analyzed. The results of corpus-based computational experiment for religious text concept analysis have been shown. The architectural approaches to building corpus-based data platform and the usage of software tools, frameworks and specific libraries have been suggested.
引用
收藏
页数:22
相关论文
共 50 条
  • [1] A novel adaptable approach for sentiment analysis on big social data
    El Alaoui, Imane
    Gahi, Youssef
    Messoussi, Rochdi
    Chaabi, Youness
    Todoskoff, Alexis
    Kobi, Abdessamad
    JOURNAL OF BIG DATA, 2018, 5 (01)
  • [2] Correction to: A novel adaptable approach for sentiment analysis on big social data
    Imane El Alaoui
    Youssef Gahi
    Rochdi Messoussi
    Youness Chaabi
    Alexis Todoskoff
    Abdessamad Kobi
    Journal of Big Data, 6
  • [3] Developing and validating a mid-frequency word list for chemistry: a corpus-based approach using big data
    Xodabande, Ismail
    Atai, Mahmood Reza
    Hashemi, Mohammad R.
    Thompson, Paul
    ASIAN-PACIFIC JOURNAL OF SECOND AND FOREIGN LANGUAGE EDUCATION, 2023, 8 (01)
  • [4] Developing and validating a mid-frequency word list for chemistry: a corpus-based approach using big data
    Ismail Xodabande
    Mahmood Reza Atai
    Mohammad R. Hashemi
    Paul Thompson
    Asian-Pacific Journal of Second and Foreign Language Education, 8
  • [5] A novel adaptable approach for sentiment analysis on big social data (vol 5, 12, 2018)
    El Alaoui, Imane
    Gahi, Youssef
    Messoussi, Rochdi
    Chaabi, Youness
    Todoskoff, Alexis
    Kobi, Abdessamad
    JOURNAL OF BIG DATA, 2019, 6 (01)
  • [6] An Adaptable Big Data Value Chain Framework for End-to-End Big Data Monetization
    Faroukhi, Abou Zakaria
    El Alaoui, Imane
    Gahi, Youssef
    Amine, Aouatif
    BIG DATA AND COGNITIVE COMPUTING, 2020, 4 (04) : 1 - 27
  • [7] A framework for selecting analytics tools to improve healthcare big data usefulness in developing countries
    Iyamu, Tiko
    SOUTH AFRICAN JOURNAL OF INFORMATION MANAGEMENT, 2020, 22 (01):
  • [8] Developing NLP Tools with a New Corpus of Learner Spanish
    Davidson, Sam
    Yamada, Aaron
    Fernandez-Mira, Paloma
    Carando, Agustina
    Sanchez Gutierrez, Claudia H.
    Sagae, Kenji
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 7238 - 7243
  • [9] Biobanking with Big Data: A Need for Developing "Big Data Metrics"
    Zisis, Kozlakidis
    BIOPRESERVATION AND BIOBANKING, 2016, 14 (05) : 450 - 451
  • [10] Medical Big Data Analysis Using Big Data Tools and Methods
    Alhussain, Thamer
    JOURNAL OF MEDICAL IMAGING AND HEALTH INFORMATICS, 2018, 8 (04) : 793 - 795