Building parallel corpora by automatic title alignment

被引:0
|
作者
Yang, CC [1 ]
Li, KW [1 ]
机构
[1] Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Sha Tin 100083, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cross-lingual semantic interoperability has drawn significant research attention recently, as the number of digital libraries in non-English languages has grown exponentially. Cross-lingual information retrieval (CLIR) across different European languages, such as English, Spanish and French, has been widely explored, but CLIR across European and Oriental languages is still at the initial stages. To cross the language boundary, a corpus-based approach shows promise of overcoming the limitations of knowledge-based and controlled vocabulary approaches. However, collecting parallel corpora between European and Oriental languages is not an easy task. Length-based and text-based approaches are two major approaches to align parallel documents. In this paper, we investigate several techniques using these approaches, and compare their performance in aligning English and Chinese titles of parallel documents available on the Web.
引用
收藏
页码:328 / 339
页数:12
相关论文
共 50 条
  • [41] Building Affective Lexicons from Specific Corpora for Automatic Sentiment Analysis
    Bestgen, Yves
    SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 496 - 500
  • [42] NatServer: A Client-Server Architecture for building Parallel Corpora applications
    Simoes, Alberto
    Almeida, Jose Joao
    PROCESAMIENTO DEL LENGUAJE NATURAL, 2006, (37): : 91 - 97
  • [43] Building wordnets with multi-word expressions from parallel corpora
    Simoes, Alberto
    Gomez Guinovart, Xavier
    PROCESAMIENTO DEL LENGUAJE NATURAL, 2020, (64): : 45 - 52
  • [44] Multilayer Anchor Alignment in AC-E Parallel Corpora of Chinese Tea Classics
    Jiang Yi
    Jiang Xin
    Wang Dapeng
    PROCEEDINGS OF THE 2009 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND NATURAL COMPUTING, VOL II, 2009, : 498 - 501
  • [45] A Hybrid Approach for Word Alignment in English-Hindi Parallel Corpora with Scarce Resources
    Srivastava, Jyoti
    Sanyal, Sudip
    2012 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP 2012), 2012, : 185 - 188
  • [46] A Hybrid Approach for Automatic Extraction of Bilingual Multiword Expressions from Parallel Corpora
    Semmar, Nasredine
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 311 - 318
  • [47] Automatic Building of Arabic Multi Dialect Text Corpora by Bootstrapping Dialect Words
    Almeman, Khalid
    Lee, Mark
    2013 FIRST INTERNATIONAL CONFERENCE ON COMMUNICATIONS SIGNAL PROCESSING, AND THEIR APPLICATIONS (ICCSPA'13), 2013,
  • [48] AlignVis: Semi-automatic Alignment and Visualization of Parallel Translations
    Alharbi, Mohammad
    Cheesman, Tom
    Laramee, Robert S.
    2020 24TH INTERNATIONAL CONFERENCE INFORMATION VISUALISATION (IV 2020), 2020, : 98 - 108
  • [49] Automatic Parallel Data Mining After Bilingual Document Alignment
    Wolk, Krzysztof
    Wolk, Agnieszka
    RECENT ADVANCES IN INFORMATION SYSTEMS AND TECHNOLOGIES, VOL 1, 2017, 569 : 317 - 327
  • [50] NP alignment in bilingual corpora
    Recski, Gabor
    Rung, Andras
    Zsedar, Atila
    Kornai, Andras
    LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : 3379 - 3382