From Word Embeddings To Document Similarities for Improved Information Retrieval in Software Engineering

被引:203
|
作者
Ye, Xin [1 ]
Shen, Hui [1 ]
Ma, Xiao [1 ]
Bunescu, Razvan [1 ]
Liu, Chang [1 ]
机构
[1] Ohio Univ, Sch Elect Engn & Comp Sci, Athens, OH 45701 USA
基金
美国国家科学基金会;
关键词
Word embeddings; skip-gram model; bug localization; bug reports; API documents;
D O I
10.1145/2884781.2884862
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The application of information retrieval techniques to search tasks in software engineering is made difficult by the lexical gap between search queries, usually expressed in natural language (e.g. English), and retrieved documents, usually expressed in code (e.g. programming languages). This is often the case in bug and feature location, community question answering, or more generally the communication between technical personnel and non-technical stake holders in a software project. In this paper, we propose bridging the lexical gap by projecting natural language statements and code snippets as meaning vectors in a shared representation space. In the proposed architecture, word embeddings are first trained on API documents, tutorials, and reference documents, and then aggregated in order to estimate semantic similarities between documents. Empirical evaluations show that the learned vector space embeddings lead to improvements in a previously explored bug localization task and a newly defined task of linking API documents to computer programming questions.
引用
下载
收藏
页码:404 / 415
页数:12
相关论文
共 50 条
  • [41] Towards improved information retrieval from medical sources
    Kagolovsky, Y.
    Freese, D.
    Miller, M.
    Walrod, T.
    Moehr, J.
    International Journal of Medical Informatics, 51 (2-3): : 181 - 195
  • [42] Towards improved information retrieval from medical sources
    Kagolovsky, Y
    Freese, D
    Miller, M
    Walrod, T
    Moehr, J
    INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 1998, 51 (2-3) : 181 - 195
  • [43] Key word extraction from a document using word co-occurrence statistical information
    Matsuo, Yutaka
    Ishizuka, Mitsuru
    Transactions of the Japanese Society for Artificial Intelligence, 2002, 17 (03) : 217 - 223
  • [44] Applications of Tf-idf Concept to Improve Monolingual and Cross-Language Information Retrieval based on Word Embeddings
    Sari, Syandra
    Adriani, Mirna
    PROCEEDINGS OF THE 1ST INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION SCIENCE AND SYSTEM, AISS 2019, 2019,
  • [45] The Semantic Dimension in Information Retrieval, from Document Indexing to Query Reformulation
    Bouramoul, Abdelkrim
    KNOWLEDGE ORGANIZATION, 2011, 38 (05): : 425 - 437
  • [46] Matching word images for content-based retrieval from printed document images
    Million Meshesha
    C. V. Jawahar
    International Journal of Document Analysis and Recognition (IJDAR), 2008, 11 : 29 - 38
  • [47] Matching word images for content-based retrieval from printed document images
    Meshesha, Million
    Jawahar, C. V.
    INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2008, 11 (01) : 29 - 38
  • [48] Poster: Which Similarity Metric to Use for Software Documents? A Study on Information Retrieval based Software Engineering Tasks
    Rahman, Md Masudur
    Chakraborty, Saikat
    Ray, Baishakhi
    PROCEEDINGS 2018 IEEE/ACM 40TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING - COMPANION (ICSE-COMPANION, 2018, : 335 - 336
  • [49] Proposal of Game Design Document from Software Engineering Requirements Perspective
    Gonzalez Salazar, Mario
    Mitre, Hugo A.
    Lemus Olalde, Cuauhtemoc
    Gonzalez Sanchez, Jose Luis
    2012 17TH INTERNATIONAL CONFERENCE ON COMPUTER GAMES (CGAMES), 2012, : 81 - 85
  • [50] Extracting information from experimental software engineering papers
    Cruzes, Daniela
    Mendonca, Manoel
    Basili, Victor
    Shull, Forrest
    Jino, Mario
    SCCC 2007: XXVI INTERNATIONAL CONFERENCE OF THE CHILEAN SOCIETY OF COMPUTER SCIENCE, PROCEEDINGS, 2007, : 105 - +