From Word Embeddings To Document Similarities for Improved Information Retrieval in Software Engineering

被引:203
|
作者
Ye, Xin [1 ]
Shen, Hui [1 ]
Ma, Xiao [1 ]
Bunescu, Razvan [1 ]
Liu, Chang [1 ]
机构
[1] Ohio Univ, Sch Elect Engn & Comp Sci, Athens, OH 45701 USA
基金
美国国家科学基金会;
关键词
Word embeddings; skip-gram model; bug localization; bug reports; API documents;
D O I
10.1145/2884781.2884862
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The application of information retrieval techniques to search tasks in software engineering is made difficult by the lexical gap between search queries, usually expressed in natural language (e.g. English), and retrieved documents, usually expressed in code (e.g. programming languages). This is often the case in bug and feature location, community question answering, or more generally the communication between technical personnel and non-technical stake holders in a software project. In this paper, we propose bridging the lexical gap by projecting natural language statements and code snippets as meaning vectors in a shared representation space. In the proposed architecture, word embeddings are first trained on API documents, tutorials, and reference documents, and then aggregated in order to estimate semantic similarities between documents. Empirical evaluations show that the learned vector space embeddings lead to improvements in a previously explored bug localization task and a newly defined task of linking API documents to computer programming questions.
引用
收藏
页码:404 / 415
页数:12
相关论文
共 50 条
  • [1] Word Embeddings for the Software Engineering Domain
    Efstathiou, Vasiliki
    Chatzilenas, Christos
    Spinellis, Diomidis
    [J]. 2018 IEEE/ACM 15TH INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES (MSR), 2018, : 38 - 41
  • [2] Towards Word Embeddings for Improved Duplicate Bug Report Retrieval in Software Repositories
    Budhiraja, Amar
    Dutta, Kartik
    Shrivastava, Manish
    Reddy, Raghu
    [J]. PROCEEDINGS OF THE 2018 ACM SIGIR INTERNATIONAL CONFERENCE ON THEORY OF INFORMATION RETRIEVAL (ICTIR'18), 2018, : 167 - 170
  • [3] From Word Embeddings To Document Distances
    Kusner, Matt J.
    Sun, Yu
    Kolkin, Nicholas I.
    Weinberger, Kilian Q.
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 37, 2015, 37 : 957 - 966
  • [4] Multi-class Document Classification Using Improved Word Embeddings
    Rabut, Benedict A.
    Fajardo, Arnel C.
    Medina, Ruji P.
    [J]. 2019 2ND INTERNATIONAL CONFERENCE ON COMPUTING AND BIG DATA (ICCBD 2019), 2019, : 42 - 46
  • [5] Improving Arabic information retrieval using word embedding similarities
    El Mahdaouy, Abdelkader
    El Alaoui, Said Ouatik
    Gaussier, Eric
    [J]. INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2018, 21 (01) : 121 - 136
  • [6] Constraining Word Embeddings by Prior Knowledge - Application to Medical Information Retrieval
    Liu, Xiaojie
    Nie, Jian-Yun
    Sordoni, Alessandro
    [J]. INFORMATION RETRIEVAL TECHNOLOGY, AIRS 2016, 2016, 9994 : 155 - 167
  • [7] Query Expansion based on Word Embeddings and Ontologies for Efficient Information Retrieval
    Rastogi, Namrata
    Verma, Parul
    Kumar, Pankaj
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2021, 12 (11) : 367 - 373
  • [8] Combining Word Embeddings with Taxonomy Information for Multi-Label Document Classification
    Hirschmeier, Stefan
    Schoder, Detlef
    [J]. DOCENG'19: PROCEEDINGS OF THE ACM SYMPOSIUM ON DOCUMENT ENGINEERING 2019, 2019,
  • [9] Semantically Enhanced Term Frequency based on Word Embeddings for Arabic Information Retrieval
    El Mahdaouy, Abdelkader
    El Alaoui, Said Ouatik
    Gaussier, Eric
    [J]. 2016 4TH IEEE INTERNATIONAL COLLOQUIUM ON INFORMATION SCIENCE AND TECHNOLOGY (CIST), 2016, : 385 - 389
  • [10] Re-Engineered Word Embeddings for Improved Document-Level Sentiment Analysis
    Yang, Su
    Deravi, Farzin
    [J]. APPLIED SCIENCES-BASEL, 2022, 12 (18):