Multilingual sentence categorization and novelty mining

被引:10
|
作者
Zhang, Yi [1 ]
Tsai, Flora S. [1 ]
Kwee, Agus Trisnajaya [1 ]
机构
[1] Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore 639798, Singapore
关键词
Multilingual categorization; Sentence retrieval; Novelty mining; Malay; Chinese;
D O I
10.1016/j.ipm.2010.02.003
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A challenge for sentence categorization and novelty mining is to detect not only when text is relevant to the user's information need, but also when it contains something new which the user has not seen before. It involves two tasks that need to be solved. The first is identifying relevant sentences (categorization) and the second is identifying new information from those relevant sentences (novelty mining). Many previous studies of relevant sentence retrieval and novelty mining have been conducted on the English language, but few papers have addressed the problem of multilingual sentence categorization and novelty mining. This is an important issue in global business environments, where mining knowledge from text in a single language is not sufficient. In this paper, we perform the first task by categorizing Malay and Chinese sentences, then comparing their performances with that of English. Thereafter, we conduct novelty mining to identify the sentences with new information. Experimental results on TREC 2004 Novelty Track data show similar categorization performance on Malay and English sentences, which greatly outperform Chinese. In the second task, it is observed that we can achieve similar novelty mining results for all three languages, which indicates that our algorithm is suitable for novelty mining of multilingual sentences. In addition, after benchmarking our results with novelty mining without categorization, it is learnt that categorization is necessary for the successful performance of novelty mining. (C) 2010 Elsevier Ltd. All rights reserved.
引用
收藏
页码:667 / 675
页数:9
相关论文
共 50 条
  • [31] iSentenizer-μ: Multilingual Sentence Boundary Detection Model
    Wong, Derek F.
    Chao, Lidia S.
    Zeng, Xiaodong
    SCIENTIFIC WORLD JOURNAL, 2014,
  • [32] Emu: Enhancing Multilingual Sentence Embeddings with Semantic Specialization
    Hirota, Wataru
    Suhara, Yoshihiko
    Golshan, Behzad
    Tan, Wang-Chiew
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 7935 - 7943
  • [33] MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset
    Brugger, Tobias
    Sturmer, Matthias
    Niklaus, Joel
    PROCEEDINGS OF THE 19TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND LAW, ICAIL 2023, 2023, : 42 - 51
  • [34] Multilingual relevant sentence detection using reference corpus
    Hsu, MH
    Tsai, MF
    Chen, HH
    INFORMATION RETRIEVAL TECHNOLOGY, 2005, 3411 : 165 - 177
  • [35] Efficient Multilingual Deep Learning Model for Keyword Categorization
    Polato, Mirko
    Demchenko, Denys
    Kuanyshkereyev, Almat
    Navarin, Nicolo
    2021 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (IEEE SSCI 2021), 2021,
  • [36] A neural network model for hierarchical multilingual text categorization
    Chau, RN
    Yeh, CS
    Smith, KA
    ADVANCES IN NEURAL NETWORKS - ISNN 2005, PT 2, PROCEEDINGS, 2005, 3497 : 238 - 245
  • [37] Categorization of second language accents by bilingual and multilingual listeners
    Georgiou, Georgios P.
    BRITISH JOURNAL OF DEVELOPMENTAL PSYCHOLOGY, 2024, 42 (03) : 425 - 438
  • [38] Semantic Alignment with Calibrated Similarity for Multilingual Sentence Embedding
    Ham, Jiyeon
    Kim, Eun-Sol
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 1781 - 1791
  • [39] Probing Multilingual Sentence Representations With X-PROBE
    Ravishankar, Vinit
    Ovrelid, Lilja
    Velldal, Erik
    4TH WORKSHOP ON REPRESENTATION LEARNING FOR NLP (REPL4NLP-2019), 2019, : 156 - 168
  • [40] Text categorization of multilingual web pages in specific domain
    Liu, Jicheng
    Liang, Chunyan
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2008, 5012 : 938 - 944