From Word Embeddings To Document Similarities for Improved Information Retrieval in Software Engineering

被引:203
|
作者
Ye, Xin [1 ]
Shen, Hui [1 ]
Ma, Xiao [1 ]
Bunescu, Razvan [1 ]
Liu, Chang [1 ]
机构
[1] Ohio Univ, Sch Elect Engn & Comp Sci, Athens, OH 45701 USA
基金
美国国家科学基金会;
关键词
Word embeddings; skip-gram model; bug localization; bug reports; API documents;
D O I
10.1145/2884781.2884862
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The application of information retrieval techniques to search tasks in software engineering is made difficult by the lexical gap between search queries, usually expressed in natural language (e.g. English), and retrieved documents, usually expressed in code (e.g. programming languages). This is often the case in bug and feature location, community question answering, or more generally the communication between technical personnel and non-technical stake holders in a software project. In this paper, we propose bridging the lexical gap by projecting natural language statements and code snippets as meaning vectors in a shared representation space. In the proposed architecture, word embeddings are first trained on API documents, tutorials, and reference documents, and then aggregated in order to estimate semantic similarities between documents. Empirical evaluations show that the learned vector space embeddings lead to improvements in a previously explored bug localization task and a newly defined task of linking API documents to computer programming questions.
引用
下载
收藏
页码:404 / 415
页数:12
相关论文
共 50 条
  • [21] Retrieval Of Information In Document Image Databases Using Partial Word Image Matching Technique
    Yadav, Seema
    Sawarkar, Sudhir
    2009 IEEE INTERNATIONAL ADVANCE COMPUTING CONFERENCE, VOLS 1-3, 2009, : 552 - 557
  • [22] Retrieval Of Information In Document Image Databases Using Partial Word Image Matching Technique
    Yadav, Seema
    Sawarkar, Sudhir
    IMECS 2009: INTERNATIONAL MULTI-CONFERENCE OF ENGINEERS AND COMPUTER SCIENTISTS, VOLS I AND II, 2009, : 902 - +
  • [23] LATENT TOPIC MODELING OF WORD CO-OCCURRENCE INFORMATION FOR SPOKEN DOCUMENT RETRIEVAL
    Chen, Berlin
    2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 3961 - 3964
  • [24] Large-scale information retrieval in software engineering - an experience report from industrial application
    Unterkalmsteiner, Michael
    Gorschek, Tony
    Feldt, Robert
    Lavesson, Niklas
    EMPIRICAL SOFTWARE ENGINEERING, 2016, 21 (06) : 2324 - 2365
  • [25] Large-scale information retrieval in software engineering - an experience report from industrial application
    Michael Unterkalmsteiner
    Tony Gorschek
    Robert Feldt
    Niklas Lavesson
    Empirical Software Engineering, 2016, 21 : 2324 - 2365
  • [26] Toward Optimal Selection of Information Retrieval Models for Software Engineering Tasks
    Rahman, Md Masudur
    Chakraborty, Saikat
    Kaiser, Gail
    Ray, Baishakhi
    2019 19TH IEEE INTERNATIONAL WORKING CONFERENCE ON SOURCE CODE ANALYSIS AND MANIPULATION (SCAM), 2019, : 127 - 138
  • [27] Text retrieval from document images based on word shape analysis
    Tan, CL
    Huang, WH
    Sung, SY
    Yu, ZH
    Xu, Y
    APPLIED INTELLIGENCE, 2003, 18 (03) : 257 - 270
  • [28] Configuring and Assembling Information Retrieval based Solutions for Software Engineering Tasks
    Dit, Bogdan
    32ND IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION (ICSME 2016), 2016, : 641 - 646
  • [29] Text Retrieval from Document Images Based on Word Shape Analysis
    Chew Lim Tan
    Weihua Huang
    Sam Yuan Sung
    Zhaohui Yu
    Yi Xu
    Applied Intelligence, 2003, 18 : 257 - 270
  • [30] XML information retrieval from spoken word archives
    Aly, Robin
    Hiemstra, Djoerd
    Ordelman, Roeland
    van der Werff, Laurens
    de Jong, Franciska
    EVALUATION OF MULTILINGUAL AND MULTI-MODAL INFORMATION RETRIEVAL, 2007, 4730 : 770 - +