Big Data Approach to Developing Adaptable Corpus Tools

被引:0
|
作者
Lutskiv, Andriy [1 ]
Popovych, Nataliya [2 ]
机构
[1] Ternopil Ivan Puluj Natl Tech Univ, Comp Syst & Networks Dept, Ternopol, Ukraine
[2] Uzhgorod Natl Univ, State Univ, Dept Multicultural Educ & Translat, Uzhgorod, Ukraine
关键词
adaptable text corpus; Big Data; natural language processing; natural language understanding; statistics; machine learning; data mining; conceptual analysis; corpus-based translation studies; conceptual seme; componential analysis;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Thesis deals with the development of corpus tools which allow building corpus of religious and historical texts. It is foreseen that the corpus has the features of data ingestion, text data preprocessing, statistics calculation, qualitative and quantitative text analysis. All these features are customizable. With Big Data approach is meant that corpus tools are treated as the data platform and the corpus itself is treated as a combination of data lake and data warehouse solutions. There have been suggested the ways for resolving algorithmic, methodological and architectural problems which arise while building corpus tool. The effectiveness of natural language processing and natural language understanding methods, libraries and tools on the example of building historical and religious texts' corpora have been checked. There have been created the workflows which comprise data extraction from sources, data transformation, data enrichment and loading into corpus storage with proper qualitative and quantitative characteristics. Data extraction approaches which are common for ingestion into data lake were used. Transformations and enrichments were realized by means of natural language processing and natural language understanding techniques. Calculation of statistical characteristics was done by means of machine learning techniques. Finding keywords and relations between them became possible thanks to the employment of latent semantic analysis, terms and N-gram frequencies, term frequency-inverse document frequencies. Computation complexity and number of information noise were reduced by singular value decomposition. The influence of singular value decomposition parameters on the text processing accuracy has been analyzed. The results of corpus-based computational experiment for religious text concept analysis have been shown. The architectural approaches to building corpus-based data platform and the usage of software tools, frameworks and specific libraries have been suggested.
引用
收藏
页数:22
相关论文
共 50 条
  • [21] Applications of Big Data Analytics Tools for Data Management
    Jamshidi M.
    Tannahill B.
    Ezell M.
    Yetis Y.
    Kaplan H.
    Jamshidi, Mo (moj@wacong.org), 1600, Springer Science and Business Media Deutschland GmbH (18): : 177 - 199
  • [22] ICE vs GloWbE: Big data and corpus compilation
    Loureiro-Porto, Lucia
    WORLD ENGLISHES, 2017, 36 (03) : 448 - 470
  • [23] A multilevel approach to big data analysis using analytic tools and actor network theory
    Iyamu, Tiko
    SOUTH AFRICAN JOURNAL OF INFORMATION MANAGEMENT, 2018, 20 (01):
  • [24] Psychologist and Psychology From the Perspectives of Big Data and Corpus
    Arik, Engin
    Arik, Beril T.
    INSAN & TOPLUM-THE JOURNAL OF HUMANITY & SOCIETY, 2019, 9 (04): : 87 - 114
  • [25] USING BIG DATA TO MINIMIZE UNCERTAINTY EFFECTS IN ADAPTABLE PRODUCT DESIGN
    Afshari, Hamid
    Peng, Qingjin
    INTERNATIONAL DESIGN ENGINEERING TECHNICAL CONFERENCES AND COMPUTERS AND INFORMATION IN ENGINEERING CONFERENCE, 2015, VOL 4, 2016,
  • [26] Quartz: A Template for Quantitative Corpus Data Visualization Tools
    Isaacs, Loryn
    Odlum, Alex
    Leon-Arauz, Pilar
    LANGUAGES, 2024, 9 (03)
  • [27] Big data tools for Islamic financial analysis
    Mnif, Emna
    Jarboui, Anis
    Hassan, M. Kabir
    Mouakhar, Khaireddine
    INTELLIGENT SYSTEMS IN ACCOUNTING FINANCE & MANAGEMENT, 2020, 27 (01): : 10 - 21
  • [28] An Overview of Big Data Opportunities, Applications and Tools
    Benjelloun, Fatima-Zahra
    Lahcen, Ayoub Ait
    Belfkih, Samir
    2015 INTELLIGENT SYSTEMS AND COMPUTER VISION (ISCV), 2015,
  • [29] Big Data Processing Tools Navigation Diagram
    Macak, Martin
    Bangui, Hind
    Buhnova, Barbora
    Molnar, Andras J.
    Sidlo, Csaba Istvan
    PROCEEDINGS OF THE 5TH INTERNATIONAL CONFERENCE ON INTERNET OF THINGS, BIG DATA AND SECURITY (IOTBDS), 2020, : 304 - 312
  • [30] Improving Tourist Experience by Big Data Tools
    Cassavia, Nunziato
    Dicosta, Pietro
    Masciari, Elio
    Sacca, Domenico
    PROCEEDINGS OF THE 2015 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS 2015), 2015, : 553 - 556