From Word Embeddings To Document Similarities for Improved Information Retrieval in Software Engineering

被引:203
|
作者
Ye, Xin [1 ]
Shen, Hui [1 ]
Ma, Xiao [1 ]
Bunescu, Razvan [1 ]
Liu, Chang [1 ]
机构
[1] Ohio Univ, Sch Elect Engn & Comp Sci, Athens, OH 45701 USA
基金
美国国家科学基金会;
关键词
Word embeddings; skip-gram model; bug localization; bug reports; API documents;
D O I
10.1145/2884781.2884862
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The application of information retrieval techniques to search tasks in software engineering is made difficult by the lexical gap between search queries, usually expressed in natural language (e.g. English), and retrieved documents, usually expressed in code (e.g. programming languages). This is often the case in bug and feature location, community question answering, or more generally the communication between technical personnel and non-technical stake holders in a software project. In this paper, we propose bridging the lexical gap by projecting natural language statements and code snippets as meaning vectors in a shared representation space. In the proposed architecture, word embeddings are first trained on API documents, tutorials, and reference documents, and then aggregated in order to estimate semantic similarities between documents. Empirical evaluations show that the learned vector space embeddings lead to improvements in a previously explored bug localization task and a newly defined task of linking API documents to computer programming questions.
引用
收藏
页码:404 / 415
页数:12
相关论文
共 50 条
  • [31] An approach for document fragment retrieval and its formatting issue in engineering information management
    Liu, Shaofeng
    McMahon, Chris A.
    Darlington, Mansur J.
    Culley, Steve J.
    Wild, Peter J.
    [J]. COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2006, PT 2, 2006, 3981 : 279 - 287
  • [32] A review of structured document retrieval (SDR) technology to improve information access performance in engineering document management
    Liu, S.
    McMahon, C. A.
    Culley, S. J.
    [J]. COMPUTERS IN INDUSTRY, 2008, 59 (01) : 3 - 16
  • [33] Inferring Multilingual Domain-Specific Word Embeddings From Large Document Corpora
    Cagliero, Luca
    La Quatra, Moreno
    [J]. IEEE ACCESS, 2021, 9 : 137309 - 137321
  • [34] Profile based information retrieval from printed document images
    Abirami, S.
    Manjula, D.
    [J]. COMPUTER GRAPHICS, IMAGING AND VISUALISATION: NEW ADVANCES, 2007, : 268 - +
  • [35] DocBrowse: A system for information retrieval from document image data
    Jaisimha, MY
    Brue, A
    Nguyen, T
    [J]. STORAGE AND RETRIEVAL FOR STILL IMAGE AND VIDEO DATABASES IV, 1996, 2670 : 350 - 361
  • [36] Document clustering for efficient and secure information retrieval from cloud
    Handa, Rohit
    Krishna, C. Rama
    Aggarwal, Naveen
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2019, 31 (15):
  • [37] A computational framework for retrieval of document fragments based on decomposition schemes in engineering information management
    Liu, S.
    McMahon, C. A.
    Darlington, M. J.
    Culley, S. J.
    Wild, P. J.
    [J]. ADVANCED ENGINEERING INFORMATICS, 2006, 20 (04) : 401 - 413
  • [38] Word Retrieval from Kannada Document Images Using HOG and Morphological Features
    Hangarge, Mallikarjun
    Veershetty, C.
    Rajmohan, P.
    Mukarambi, Gururaj
    [J]. RECENT TRENDS IN IMAGE PROCESSING AND PATTERN RECOGNITION (RTIP2R 2016), 2017, 709 : 71 - 79
  • [39] Learning semantic information from Internet Domain Names using word embeddings
    Lopez, Waldemar
    Merlino, Jorge
    Rodriguez-Bocca, Pablo
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2020, 94
  • [40] A resolving of word sense ambiguity using two-level document ranking method in information retrieval
    Kang, Hyun-Kyu
    Jeon, Heung Seok
    Ko, Myeong-Cheol
    Kim, Jin Soo
    Yang, Kiduk
    [J]. 2007 INTERNATIONAL SYMPOSIUM ON INFORMATION TECHNOLOGY CONVERGENCE, PROCEEDINGS, 2007, : 315 - +