Cross-Lingual Information Retrieval from Multilingual Construction Documents Using Pretrained Language Models

被引:2
|
作者
Kim, Jungyeon [1 ]
Chung, Sehwan [1 ]
Chi, Seokho [1 ,2 ]
机构
[1] Seoul Natl Univ, Dept Civil & Environm Engn, Seoul 08826, South Korea
[2] Seoul Natl Univ, Inst Construct & Environm Engn, Seoul 08826, South Korea
基金
新加坡国家研究基金会;
关键词
D O I
10.1061/JCEMD4.COENG-14273
中图分类号
TU [建筑科学];
学科分类号
0813 ;
摘要
The growth of the global construction market has attracted international companies to participate in overseas projects. Overseas projects are extremely dynamic with numerous uncertainties, raising the need to collect information about construction in host countries. Due to the vast amounts of text data in the construction industry, an automated method, specifically information retrieval, is required to find the necessary information. Previous studies have suggested automated methods to review various construction documents. However, these studies required substantial manual effort and mainly focused on only one language, resulting in loss of vital information because it is buried in documents written in the host country's language. To address these limitations, this study proposes a cross-lingual information retrieval (CLIR) framework using pretrained Bidirectional Encoder Representations from Transformers (BERT) models to retrieve information from multilingual construction documents. The proposed framework employs language models (i.e., monolingual, multilingual, and cross-lingual) and trains these models on a construction data set to enhance their ability in construction-specific text. The framework achieved reliable performance of retrieval, even with minimal additional training using domain-specific data. The results indicate that training on the domain data set raises the level of retrieval, increasing the mean reciprocal rank of a specific task by up to 0.2128. With the employment of a monolingual model with machine translation, CLIR in a specific domain could be performed effectively without the need for a labeled data set. The suggested CLIR framework offers a practical alternative for dealing with construction documents in overseas projects, reducing time and cost while improving risk identification and mitigation.
引用
收藏
页数:15
相关论文
共 50 条
  • [31] Exploring Anisotropy and Outliers in Multilingual Language Models for Cross-Lingual Semantic Sentence Similarity
    Libovicky, Jindrich
    Fraser, Alexander
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 7023 - 7037
  • [32] A Learning to rank framework based on cross-lingual loss function for cross-lingual information retrieval
    Elham Ghanbari
    Azadeh Shakery
    Applied Intelligence, 2022, 52 : 3156 - 3174
  • [33] A multilingual text mining approach to web cross-lingual text retrieval
    Chau, RW
    Yeh, CH
    KNOWLEDGE-BASED SYSTEMS, 2004, 17 (5-6) : 219 - 227
  • [34] Using the Web corpus to translate the queries in cross-lingual information retrieval
    Zhang, JL
    Sun, L
    Min, JM
    PROCEEDINGS OF THE 2005 IEEE INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING (IEEE NLP-KE'05), 2005, : 493 - 498
  • [35] Cross-lingual information retrieval and delivery using community mobile networks
    Shriram, R.
    Sugumaran, Vijayan
    Kapetanios, Epaminondas
    2006 1ST INTERNATIONAL CONFERENCE ON DIGITAL INFORMATION MANAGEMENT, 2006, : 320 - +
  • [36] Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings
    Vulic, Ivan
    Moens, Marie-Francine
    SIGIR 2015: PROCEEDINGS OF THE 38TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2015, : 363 - 372
  • [37] Deep Multilabel Multilingual Document Learning for Cross-Lingual Document Retrieval
    Feng, Kai
    Huang, Lan
    Xu, Hao
    Wang, Kangping
    Wei, Wei
    Zhang, Rui
    ENTROPY, 2022, 24 (07)
  • [38] Unsupervised Cross-Lingual Information Retrieval Using Monolingual Data Only
    Litschko, Robert
    Glavas, Goran
    Ponzetto, Simone Paolo
    Vulic, Ivan
    ACM/SIGIR PROCEEDINGS 2018, 2018, : 1253 - 1256
  • [39] Multilingual Ontology Merging Using Cross-lingual Matching
    Ibrahim, Shimaa
    Fathalla, Said
    Lehmann, Jens
    Jabeen, Hajira
    2020 IEEE/WIC/ACM INTERNATIONAL JOINT CONFERENCE ON WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY (WI-IAT 2020), 2020, : 113 - 120
  • [40] CrossMath: Towards Cross-lingual Math Information Retrieval
    Gore, James
    Polletta, Joseph
    Mansouri, Behrooz
    PROCEEDINGS OF THE 2024 ACM SIGIR INTERNATIONAL CONFERENCE ON THE THEORY OF INFORMATION RETRIEVAL, ICTIR 2024, 2024, : 101 - 105