Graph-Based Bilingual Sentence Alignment from Large Scale Web Pages

被引:0
|
作者
Zhu, Yihe [1 ]
Wang, Haofen [1 ]
Ouyang, Xixiu [1 ]
Yu, Yong [1 ]
机构
[1] Shanghai Jiao Tong Univ, Apex Data & Knowledge Management Lab, Shanghai 200030, Peoples R China
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Sentence alignment is an enabling technology which extracts mass of bilingual corpora automatically from the vast and ever-growing Web pages. In this paper, we propose a novel graph-based sentence alignment approach. Compared with the existing approaches, ours is more resistant to the noise and structure diversity of Web pages by considering sentence structural features. We formulate sentence alignment to be a matching problem between nodes (bilingual sentences) of a bipartite graph. The maximum-weighted bipartite graph matching algorithm is first applied to sentence alignment for global optimal matching. Moreover, sentence merging and aligned sentence pattern detection are used to deal with the many-to-many matching issue and the low probability of aligned sentences with few mutual translated words issue respectively. We achieve good precision over 85% and recall over 80% on manually annotated data and 1 million aligned sentence pairs with over 82% accuracy are extracted from 0.8 million bilingual pages.
引用
收藏
页码:209 / 216
页数:8
相关论文
共 50 条
  • [1] Graph-based parallel large scale structure from motion
    Chen, Yu
    Shen, Shuhan
    Chen, Yisong
    Wang, Guoping
    [J]. PATTERN RECOGNITION, 2020, 107
  • [2] The Acquisition and Sentence Alignment for Academic Bilingual Resources Based on Web Paper Libraries
    Sun, Yueheng
    Men, Rui
    Ni, Weijie
    [J]. 2009 INTERNATIONAL CONFERENCE ON RESEARCH CHALLENGES IN COMPUTER SCIENCE, ICRCCS 2009, 2009, : 45 - 48
  • [3] Heterogeneous Graph-Based Intent Learning with Queries, Web Pages and Wikipedia Concepts
    Ren, Xiang
    Wang, Yujing
    Yu, Xiao
    Yan, Jun
    Chen, Zheng
    Han, Jiawei
    [J]. WSDM'14: PROCEEDINGS OF THE 7TH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2014, : 23 - 32
  • [4] Graph-based Neural Sentence Ordering
    Yin, Yongjing
    Song, Linfeng
    Su, Jinsong
    Zeng, Jiali
    Zhou, Chulun
    Luo, Jiebo
    [J]. PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 5387 - 5393
  • [5] Improved Graph-based Bilingual Corpus Selection with Sentence Pair Ranking for Statistical Machine Translation
    Chao, WenHan
    Li, ZhouJun
    [J]. 2011 23RD IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2011), 2011, : 446 - 451
  • [6] Graph-based Large Scale RDF Data Compression
    Zhang, Wei Emma
    [J]. SIGIR'14: PROCEEDINGS OF THE 37TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2014, : 1276 - 1276
  • [7] A Graph-Based Approach for Sentiment Sentence Extraction
    Shimada, Kazutaka
    Hashimoto, Daigo
    Endo, Tsutomu
    [J]. NEW FRONTIERS IN APPLIED DATA MINING, 2009, 5433 : 38 - 48
  • [8] Bilingual sentence alignment based on punctuation statistics and lexicon
    Chuang, TC
    Wu, JC
    Lin, T
    Shei, WC
    Chang, JS
    [J]. NATURAL LANGUAGE PROCESSING - IJCNLP 2004, 2005, 3248 : 224 - 232
  • [9] Graph-based Alignment and Uniformity for Recommendation
    Yang, Liangwei
    Liu, Zhiwei
    Wang, Chen
    Yang, Mingdai
    Liu, Xiaolong
    Ma, Jing
    Yu, Philip S.
    [J]. PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 4395 - 4399
  • [10] Graph-based molecular alignment (GMA)
    Marialke, J.
    Koerner, R.
    Tietze, S.
    Apostolakis, Joannis
    [J]. JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2007, 47 (02) : 591 - 601