Text Retrieval Based on Syntactic Information

被引:0
|
作者
Yongwei Z. [1 ,2 ]
Ting L. [1 ]
Chang L. [3 ]
Bingxin W. [3 ]
Jingsong Y. [3 ]
机构
[1] School of Chinese Language and Literature, University of Chinese Academy of Social Sciences, Beijing
[2] Corpus and Computational Linguistics Center, Institute of Linguistics, Chinese Academy of Social Sciences, Beijing
[3] School of Software and Microelectronics, Peking University, Beijing
关键词
Constituency Syntax; Corpus; Dependency Syntax; Index; Retrieval;
D O I
10.11925/infotech.2096-3467.2022.0093
中图分类号
学科分类号
摘要
[Objective] This study aims to explore an efficient method for retrieving syntactic information from large text corpora. [Methods] First, we created linearized indices for syntactic information based on their features. Then, these indices provide matching information to improve retrieval efficiency. [Results] We examined our new model with the People’s Daily Corpus of 28.51 million sentences. The average processing time for 26 queries was 802.6 milliseconds, which met the requirements of retrieval systems for large corpora. [Limitations] More research is needed to evaluate the proposed method with larger number of queries. [Conclusions] Our new method could quickly retrieve lexical, dependency syntactic and constituency syntactic information from large text corpora. © 2022, Chinese Academy of Sciences. All rights reserved.
引用
收藏
页码:25 / 37
页数:12
相关论文
共 33 条
  • [11] Xun Endong, Rao Gaoqi, Xiao Xiaoyue, Et al., The Construction of the BCC Corpus in the Age of Big Data, Corpus Linguistics, 3, 1, pp. 93-109, (2016)
  • [12] Xiao Hang, On the Construction and Application of Contemporary Chinese Corpus, Journal of Chinese World, 106, pp. 24-29, (2010)
  • [13] Luotolahti J, Kanerva J, Pyysalo S, Et al., SETS: Scalable and Efficient Tree Search in Dependency Graphs, Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pp. 51-55, (2015)
  • [14] Luotolahti J, Kanerva J, Ginter F., Dep_Search: Efficient Search Tool for Large Dependency Parsebanks, Proceedings of the 21st Nordic Conference on Computational Linguistics, pp. 255-258, (2017)
  • [15] Valenzuela-Escarcega M A, Hahn-Powell G, Surdeanu M., Odin’ s Runes: A Rule Language for Information Extraction, Proceedings of the 10th International Conference on Language Resources and Evaluation, pp. 322-329, (2016)
  • [16] Valenzuela-Escarcega M A, Hahn-Powell G, Bell D., Odinson: A Fast Rule-Based Information Extraction Framework, Proceedings of the 12th Language Resources and Evaluation Conference, pp. 2183-2191, (2020)
  • [17] Shlain M, Taub-Tabib H, Sadde S, Et al., Syntactic Search by Example, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 17-23, (2020)
  • [18] Petersen U., Querying Both Parallel and Treebank Corpora: Evaluation of a Corpus Query System, Proceedings of the 5th International Conference on Language Resources and Evaluation, pp. 2457-2459, (2006)
  • [19] Petersen U., Emdros: A Text Database Engine for Analyzed or Annotated Text, Proceedings of the 20th International Conference on Computational Linguistics, pp. 1190-1193, (2004)
  • [20] Augustinus L, Vandeghinste V, van Eynde F., Example-Based Treebank Querying, Proceedings of the 8th International Conference on Language Resources and Evaluation, pp. 3161-3167, (2012)