Multilingual sentence categorization and novelty mining

被引:10
|
作者
Zhang, Yi [1 ]
Tsai, Flora S. [1 ]
Kwee, Agus Trisnajaya [1 ]
机构
[1] Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore 639798, Singapore
关键词
Multilingual categorization; Sentence retrieval; Novelty mining; Malay; Chinese;
D O I
10.1016/j.ipm.2010.02.003
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A challenge for sentence categorization and novelty mining is to detect not only when text is relevant to the user's information need, but also when it contains something new which the user has not seen before. It involves two tasks that need to be solved. The first is identifying relevant sentences (categorization) and the second is identifying new information from those relevant sentences (novelty mining). Many previous studies of relevant sentence retrieval and novelty mining have been conducted on the English language, but few papers have addressed the problem of multilingual sentence categorization and novelty mining. This is an important issue in global business environments, where mining knowledge from text in a single language is not sufficient. In this paper, we perform the first task by categorizing Malay and Chinese sentences, then comparing their performances with that of English. Thereafter, we conduct novelty mining to identify the sentences with new information. Experimental results on TREC 2004 Novelty Track data show similar categorization performance on Malay and English sentences, which greatly outperform Chinese. In the second task, it is observed that we can achieve similar novelty mining results for all three languages, which indicates that our algorithm is suitable for novelty mining of multilingual sentences. In addition, after benchmarking our results with novelty mining without categorization, it is learnt that categorization is necessary for the successful performance of novelty mining. (C) 2010 Elsevier Ltd. All rights reserved.
引用
下载
收藏
页码:667 / 675
页数:9
相关论文
共 50 条
  • [41] Text categorization of multilingual web pages in specific domain
    Liu, Jicheng
    Liang, Chunyan
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2008, 5012 : 938 - 944
  • [42] Categorization of Multilingual Scientific Documents by a Compound Classification System
    Protasiewicz, Jaroslaw
    Mironczuk, Marcin
    Dadas, Slawomir
    ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING, ICAISC 2017, PT II, 2017, 10246 : 563 - 573
  • [43] Learning Multilingual Sentence Embeddings from Monolingual Corpus
    Wang, Shuai
    Hou, Lei
    Li, Juanzi
    Tong, Meihan
    Jiang, Jiabo
    CHINESE COMPUTATIONAL LINGUISTICS, CCL 2019, 2019, 11856 : 346 - 357
  • [44] Sentence Compression for Arbitrary Languages via Multilingual Pivoting
    Mallinson, Jonathan
    Sennrich, Rico
    Lapata, Mirella
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 2453 - 2464
  • [45] A multilingual procedure for dictionary-based sentence alignment
    Meyers, A
    Kosaka, M
    Grishman, R
    MACHINE TRANSLATION AND THE INFORMATION SOUP, 1998, 1529 : 187 - 198
  • [46] Word Embedding for Rhetorical Sentence Categorization on Scientific Articles
    Rachman, Ghoziyah Haitan
    Khodra, Masayu Leylia
    Widyantoro, Dwi Hendratmo
    JOURNAL OF ICT RESEARCH AND APPLICATIONS, 2018, 12 (02) : 168 - 184
  • [47] Automatic Rhetorical Sentence Categorization on Indonesian Meeting Minutes
    Rachman, Ghoziyah Haitan
    Khodra, Masayu Leylia
    PROCEEDINGS OF 2016 INTERNATIONAL CONFERENCE ON DATA AND SOFTWARE ENGINEERING (ICODSE), 2016,
  • [48] Experiments in term weighting for novelty mining
    Tsai, Flora S.
    Kwee, Agus T.
    EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (11) : 14094 - 14101
  • [49] Redundancy and novelty mining in the business blogosphere
    Tsai, Flora S.
    Chan, Kap Luk
    LEARNING ORGANIZATION, 2010, 17 (06): : 490 - +
  • [50] Massively Multilingual Pronunciation Mining with WikiPron
    Lee, Jackson L.
    Ashby, Lucas F. E.
    Garza, M. Elizabeth
    Lee-Sikka, Yeonju
    Miller, Sean
    Wong, Alan
    McCarthy, Arya D.
    Gorman, Kyle
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 4223 - 4228