Big Data Approach to Developing Adaptable Corpus Tools

被引:0
|
作者
Lutskiv, Andriy [1 ]
Popovych, Nataliya [2 ]
机构
[1] Ternopil Ivan Puluj Natl Tech Univ, Comp Syst & Networks Dept, Ternopol, Ukraine
[2] Uzhgorod Natl Univ, State Univ, Dept Multicultural Educ & Translat, Uzhgorod, Ukraine
关键词
adaptable text corpus; Big Data; natural language processing; natural language understanding; statistics; machine learning; data mining; conceptual analysis; corpus-based translation studies; conceptual seme; componential analysis;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Thesis deals with the development of corpus tools which allow building corpus of religious and historical texts. It is foreseen that the corpus has the features of data ingestion, text data preprocessing, statistics calculation, qualitative and quantitative text analysis. All these features are customizable. With Big Data approach is meant that corpus tools are treated as the data platform and the corpus itself is treated as a combination of data lake and data warehouse solutions. There have been suggested the ways for resolving algorithmic, methodological and architectural problems which arise while building corpus tool. The effectiveness of natural language processing and natural language understanding methods, libraries and tools on the example of building historical and religious texts' corpora have been checked. There have been created the workflows which comprise data extraction from sources, data transformation, data enrichment and loading into corpus storage with proper qualitative and quantitative characteristics. Data extraction approaches which are common for ingestion into data lake were used. Transformations and enrichments were realized by means of natural language processing and natural language understanding techniques. Calculation of statistical characteristics was done by means of machine learning techniques. Finding keywords and relations between them became possible thanks to the employment of latent semantic analysis, terms and N-gram frequencies, term frequency-inverse document frequencies. Computation complexity and number of information noise were reduced by singular value decomposition. The influence of singular value decomposition parameters on the text processing accuracy has been analyzed. The results of corpus-based computational experiment for religious text concept analysis have been shown. The architectural approaches to building corpus-based data platform and the usage of software tools, frameworks and specific libraries have been suggested.
引用
收藏
页数:22
相关论文
共 50 条
  • [31] Network Security in Big Data: Tools and Techniques
    Verma, Pushpak
    Chandra, Tej Bahadur
    Dwivedi, A. K.
    INFORMATION SYSTEMS DESIGN AND INTELLIGENT APPLICATIONS, VOL 1, INDIA 2016, 2016, 433 : 255 - 262
  • [32] A COMPREHENSIVE SURVEY ON BIG DATA ANALYTICS TOOLS
    Vijayaraj, J.
    Saravanan, R.
    Paul, P. Victer
    Raju, R.
    PROCEEDINGS OF 2016 ONLINE INTERNATIONAL CONFERENCE ON GREEN ENGINEERING AND TECHNOLOGIES (IC-GET), 2016,
  • [33] Enhancing intelligence SOC with big data tools
    Andrade, Roberto
    Torres, Jenny
    2018 IEEE 9TH ANNUAL INFORMATION TECHNOLOGY, ELECTRONICS AND MOBILE COMMUNICATION CONFERENCE (IEMCON), 2018, : 1076 - 1080
  • [34] Feedback Analysis Using Big Data Tools
    Yadav, Kusum
    Pandey, Manjusha
    Rautaray, Siddharth Swarup
    PROCEEDINGS OF 2016 INTERNATIONAL CONFERENCE ON ICT IN BUSINESS INDUSTRY & GOVERNMENT (ICTBIG), 2016,
  • [35] Adapt current tools for handling big data
    Sejdic, Ervin
    NATURE, 2014, 507 (7492) : 306 - 306
  • [36] An Overview of Big Data Mining: Methods and Tools
    Wang, Erpeng
    Chen, Guanlin
    INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING BIOMEDICAL ENGINEERING, AND INFORMATICS (SPBEI 2013), 2014, : 377 - 382
  • [37] Big Data Tools: Haddop, MongoDB and Weka
    Jaraba Navas, Paula Catalina
    Guacaneme Parra, Yesid Camilo
    Rodriguez Molano, Jose Ignacio
    DATA MINING AND BIG DATA, DMBD 2016, 2016, 9714 : 449 - 456
  • [38] Big Data and its Analyzing Tools : A Perspective
    Jaiswal, Ayshwarya
    Dwivedi, Vijay Kumar
    Yadav, Om Prakash
    2020 6TH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING AND COMMUNICATION SYSTEMS (ICACCS), 2020, : 560 - 565
  • [39] Astrophysics and Big Data: Challenges, Methods, and Tools
    Garofalo, Mauro
    Botta, Alessio
    Ventre, Giorgio
    ASTROINFORMATICS, 2017, 12 (S325): : 345 - 348
  • [40] Describing and Comparing Big Data Querying Tools
    Rodrigues, Mario
    Santos, Maribel Yasmina
    Bernardino, Jorge
    RECENT ADVANCES IN INFORMATION SYSTEMS AND TECHNOLOGIES, VOL 1, 2017, 569 : 115 - 124