Big Data Approach to Developing Adaptable Corpus Tools

被引:0
|
作者
Lutskiv, Andriy [1 ]
Popovych, Nataliya [2 ]
机构
[1] Ternopil Ivan Puluj Natl Tech Univ, Comp Syst & Networks Dept, Ternopol, Ukraine
[2] Uzhgorod Natl Univ, State Univ, Dept Multicultural Educ & Translat, Uzhgorod, Ukraine
关键词
adaptable text corpus; Big Data; natural language processing; natural language understanding; statistics; machine learning; data mining; conceptual analysis; corpus-based translation studies; conceptual seme; componential analysis;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Thesis deals with the development of corpus tools which allow building corpus of religious and historical texts. It is foreseen that the corpus has the features of data ingestion, text data preprocessing, statistics calculation, qualitative and quantitative text analysis. All these features are customizable. With Big Data approach is meant that corpus tools are treated as the data platform and the corpus itself is treated as a combination of data lake and data warehouse solutions. There have been suggested the ways for resolving algorithmic, methodological and architectural problems which arise while building corpus tool. The effectiveness of natural language processing and natural language understanding methods, libraries and tools on the example of building historical and religious texts' corpora have been checked. There have been created the workflows which comprise data extraction from sources, data transformation, data enrichment and loading into corpus storage with proper qualitative and quantitative characteristics. Data extraction approaches which are common for ingestion into data lake were used. Transformations and enrichments were realized by means of natural language processing and natural language understanding techniques. Calculation of statistical characteristics was done by means of machine learning techniques. Finding keywords and relations between them became possible thanks to the employment of latent semantic analysis, terms and N-gram frequencies, term frequency-inverse document frequencies. Computation complexity and number of information noise were reduced by singular value decomposition. The influence of singular value decomposition parameters on the text processing accuracy has been analyzed. The results of corpus-based computational experiment for religious text concept analysis have been shown. The architectural approaches to building corpus-based data platform and the usage of software tools, frameworks and specific libraries have been suggested.
引用
收藏
页数:22
相关论文
共 50 条
  • [41] Tools for the Storage and Analysis of Spatial Big Data
    Lisowski, Przemyslaw
    Piorkowski, Adam
    Lesniak, Andrzej
    10TH INTERNATIONAL CONFERENCE ENVIRONMENTAL ENGINEERING (10TH ICEE), 2017,
  • [42] A SURVEY ON WATCHDOG TOOLS FOR CLOUD AND BIG DATA
    Sakthivel, V
    Mythreagi, R.
    Priyadharshini, M.
    IEEE INTERNATIONAL CONFERENCE ON SOFT-COMPUTING AND NETWORK SECURITY (ICSNS 2018), 2018, : 197 - 201
  • [43] Tools that deal with big data: Modeling and analysis
    Dwinnell, W.
    PC AI, 2001, 15 (05):
  • [44] 6 Top Tools for Taming Big Data
    JakoB BJ orklund
    中国制造业信息化, 2012, 41 (08) : 54 - 56
  • [45] Adapt current tools for handling big data
    Ervin Sejdić
    Nature, 2014, 507 : 306 - 306
  • [46] Developing a data pipeline solution for big data processing
    Lipovac, Ivona
    Babac, Marina Bagic
    INTERNATIONAL JOURNAL OF DATA MINING MODELLING AND MANAGEMENT, 2024, 16 (01) : 1 - 22
  • [47] Covid-19 Imaging Tools: How Big Data is Big?
    KC Santosh
    Sourodip Ghosh
    Journal of Medical Systems, 2021, 45
  • [48] Covid-19 Imaging Tools: How Big Data is Big?
    Santosh, K. C.
    Ghosh, Sourodip
    JOURNAL OF MEDICAL SYSTEMS, 2021, 45 (07)
  • [49] Data Grid tools: enabling science on big distributed data
    Allcock, B
    Chervenak, A
    Foster, I
    Kesselman, C
    Livny, M
    SCIDAC 2005: SCIENTIFIC DISCOVERY THROUGH ADVANCED COMPUTING, 2005, 16 : 571 - 575
  • [50] Big Data for Traffic Estimation and Prediction: A Survey of Data and Tools
    Jiang, Weiwei
    Luo, Jiayun
    APPLIED SYSTEM INNOVATION, 2022, 5 (01)