Text Retrieval Based on Syntactic Information

被引:0
|
作者
Yongwei Z. [1 ,2 ]
Ting L. [1 ]
Chang L. [3 ]
Bingxin W. [3 ]
Jingsong Y. [3 ]
机构
[1] School of Chinese Language and Literature, University of Chinese Academy of Social Sciences, Beijing
[2] Corpus and Computational Linguistics Center, Institute of Linguistics, Chinese Academy of Social Sciences, Beijing
[3] School of Software and Microelectronics, Peking University, Beijing
关键词
Constituency Syntax; Corpus; Dependency Syntax; Index; Retrieval;
D O I
10.11925/infotech.2096-3467.2022.0093
中图分类号
学科分类号
摘要
[Objective] This study aims to explore an efficient method for retrieving syntactic information from large text corpora. [Methods] First, we created linearized indices for syntactic information based on their features. Then, these indices provide matching information to improve retrieval efficiency. [Results] We examined our new model with the People’s Daily Corpus of 28.51 million sentences. The average processing time for 26 queries was 802.6 milliseconds, which met the requirements of retrieval systems for large corpora. [Limitations] More research is needed to evaluate the proposed method with larger number of queries. [Conclusions] Our new method could quickly retrieve lexical, dependency syntactic and constituency syntactic information from large text corpora. © 2022, Chinese Academy of Sciences. All rights reserved.
引用
收藏
页码:25 / 37
页数:12
相关论文
共 33 条
  • [1] Huang Shuiqing, Wang Dongbo, Review of Corpus Research in China, Journal of Information Resources Management, 11, 3, pp. 4-17, (2021)
  • [2] Che W X, Feng Y L, Qin L B, Et al., N-LTP: An Open-Source Neural Language Technology Platform for Chinese, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 42-49, (2021)
  • [3] Straka M, Strakova J., Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe, Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 88-99, (2017)
  • [4] Manning C, Surdeanu M, Bauer J, Et al., The Stanford CoreNLP Natural Language Processing Toolkit, Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60, (2014)
  • [5] Bird S, Klein E, Loper E., Natural Language Processing with Python, (2009)
  • [6] Hardie A., CQPweb—Combining Power, Flexibility and Usability in a Corpus Analysis Tool, International Journal of Corpus Linguistics, 17, 3, pp. 380-409, (2012)
  • [7] Davies M., Corpus of Global Web-Based English(GloWbE) [EB/ OL]
  • [8] Davies M., The iWeb Corpus
  • [9] Kilgarriff A, Baisa V, Busta J, Et al., The Sketch Engine: Ten Years on, Lexicography, 1, 1, pp. 7-36, (2014)
  • [10] Zhan Weidong, Guo Rui, Chang Baobao, Et al., The Building of the CCL Corpus: Its Design and Implementation, Corpus Linguistics, 6, 1, pp. 71-86, (2019)