Cross-Lingual Information Retrieval from Multilingual Construction Documents Using Pretrained Language Models

被引:2
|
作者
Kim, Jungyeon [1 ]
Chung, Sehwan [1 ]
Chi, Seokho [1 ,2 ]
机构
[1] Seoul Natl Univ, Dept Civil & Environm Engn, Seoul 08826, South Korea
[2] Seoul Natl Univ, Inst Construct & Environm Engn, Seoul 08826, South Korea
基金
新加坡国家研究基金会;
关键词
D O I
10.1061/JCEMD4.COENG-14273
中图分类号
TU [建筑科学];
学科分类号
0813 ;
摘要
The growth of the global construction market has attracted international companies to participate in overseas projects. Overseas projects are extremely dynamic with numerous uncertainties, raising the need to collect information about construction in host countries. Due to the vast amounts of text data in the construction industry, an automated method, specifically information retrieval, is required to find the necessary information. Previous studies have suggested automated methods to review various construction documents. However, these studies required substantial manual effort and mainly focused on only one language, resulting in loss of vital information because it is buried in documents written in the host country's language. To address these limitations, this study proposes a cross-lingual information retrieval (CLIR) framework using pretrained Bidirectional Encoder Representations from Transformers (BERT) models to retrieve information from multilingual construction documents. The proposed framework employs language models (i.e., monolingual, multilingual, and cross-lingual) and trains these models on a construction data set to enhance their ability in construction-specific text. The framework achieved reliable performance of retrieval, even with minimal additional training using domain-specific data. The results indicate that training on the domain data set raises the level of retrieval, increasing the mean reciprocal rank of a specific task by up to 0.2128. With the employment of a monolingual model with machine translation, CLIR in a specific domain could be performed effectively without the need for a labeled data set. The suggested CLIR framework offers a practical alternative for dealing with construction documents in overseas projects, reducing time and cost while improving risk identification and mitigation.
引用
收藏
页数:15
相关论文
共 50 条
  • [41] A method of cross-lingual consumer health information retrieval
    Neveol, Aurelie
    Pereira, Suzanne
    Soualmia, Lina F.
    Thirion, Benoit
    Darmoni, Stefan J.
    UBIQUITY: TECHNOLOGIES FOR BETTER HEALTH IN AGING SOCIETIES, 2006, 124 : 601 - 608
  • [42] Cross-Lingual Information Retrieval System for Indian Languages
    Jagarlamudi, Jagadeesh
    Kumaran, A.
    ADVANCES IN MULTILINGUAL AND MULTIMODAL INFORMATION RETRIEVAL, 2008, 5152 : 80 - 87
  • [43] The Impact of Pretrained Language Models on Negation and Speculation Detection in Cross-Lingual Medical Text: Comparative Study
    Rivera Zavala, Renzo
    Martinez, Paloma
    JMIR MEDICAL INFORMATICS, 2020, 8 (12)
  • [44] Multilingual Generative Language Models for Zero-Shot Cross-Lingual Event Argument Extraction
    Huang, Kuan-Hao
    Hsu, I-Hung
    Natarajan, Premkumar
    Chang, Kai-Wei
    Peng, Nanyun
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 4633 - 4646
  • [45] Towards a Common Understanding of Contributing Factors for Cross-Lingual Transfer in Multilingual Language Models: A Review
    Philippy, Fred
    Guo, Siwen
    Haddadan, Shohreh
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 5877 - 5891
  • [46] Zero-Shot Cross-Lingual Transfer of Neural Machine Translation with Multilingual Pretrained Encoders
    Chen, Guanhua
    Ma, Shuming
    Chen, Yun
    Dong, Li
    Zhang, Dongdong
    Pan, Jia
    Wang, Wenping
    Wei, Furu
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 15 - 26
  • [47] Multilingual Test Sets for Machine Translation of Search Queries for Cross-Lingual Information Retrieval in the Medical Domain
    Uresova, Zdenka
    Dusek, Ondrej
    Hajic, Jan
    Pecina, Pavel
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 3244 - 3247
  • [48] MIND Your Language: A Multilingual Dataset for Cross-Lingual News Recommendation
    Iana, Andreea
    Glavas, Goran
    Paulheim, Heiko
    KI 2024: ADVANCES IN ARTIFICIAL INTELLIGENCE, KI 2024, 2024, 14992 : 335 - 340
  • [49] MIND Your Language: A Multilingual Dataset for Cross-lingual News Recommendation
    Iana, Andreea
    Glavas, Goran
    Paulheim, Heiko
    PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 553 - 563
  • [50] X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained Language Models
    Zhengbao, Jiang
    Anastasopoulos, Antonios
    Jun, Araki
    Haibo, Ding
    Neubig, Graham
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 5943 - 5959