From Word Embeddings To Document Similarities for Improved Information Retrieval in Software Engineering

被引：203

作者：

Ye, Xin ^{[1
]}

Shen, Hui ^{[1
]}

Ma, Xiao ^{[1
]}

Bunescu, Razvan ^{[1
]}

Liu, Chang ^{[1
]}

机构：

[1] Ohio Univ, Sch Elect Engn & Comp Sci, Athens, OH 45701 USA

来源：

2016 IEEE/ACM 38TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE) | 2016年

基金：

美国国家科学基金会;

关键词：

Word embeddings; skip-gram model; bug localization; bug reports; API documents;

D O I：

10.1145/2884781.2884862

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

The application of information retrieval techniques to search tasks in software engineering is made difficult by the lexical gap between search queries, usually expressed in natural language (e.g. English), and retrieved documents, usually expressed in code (e.g. programming languages). This is often the case in bug and feature location, community question answering, or more generally the communication between technical personnel and non-technical stake holders in a software project. In this paper, we propose bridging the lexical gap by projecting natural language statements and code snippets as meaning vectors in a shared representation space. In the proposed architecture, word embeddings are first trained on API documents, tutorials, and reference documents, and then aggregated in order to estimate semantic similarities between documents. Empirical evaluations show that the learned vector space embeddings lead to improvements in a previously explored bug localization task and a newly defined task of linking API documents to computer programming questions.

引用

下载

页码：404 / 415

页数：12

共 50 条

[41] Towards improved information retrieval from medical sources
Kagolovsky, Y.
Freese, D.
Miller, M.
Walrod, T.
Moehr, J.
International Journal of Medical Informatics, 51 (2-3): : 181 - 195
[42] Towards improved information retrieval from medical sources
Kagolovsky, Y
Freese, D
Miller, M
Walrod, T
Moehr, J
INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 1998, 51 (2-3) : 181 - 195
[43] Key word extraction from a document using word co-occurrence statistical information
Matsuo, Yutaka
Ishizuka, Mitsuru
Transactions of the Japanese Society for Artificial Intelligence, 2002, 17 (03) : 217 - 223
[44] Applications of Tf-idf Concept to Improve Monolingual and Cross-Language Information Retrieval based on Word Embeddings
Sari, Syandra
Adriani, Mirna
PROCEEDINGS OF THE 1ST INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION SCIENCE AND SYSTEM, AISS 2019, 2019,
[45] The Semantic Dimension in Information Retrieval, from Document Indexing to Query Reformulation
Bouramoul, Abdelkrim
KNOWLEDGE ORGANIZATION, 2011, 38 (05): : 425 - 437
[46] Matching word images for content-based retrieval from printed document images
Million Meshesha
C. V. Jawahar
International Journal of Document Analysis and Recognition (IJDAR), 2008, 11 : 29 - 38
[47] Matching word images for content-based retrieval from printed document images
Meshesha, Million
Jawahar, C. V.
INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2008, 11 (01) : 29 - 38
[48] Poster: Which Similarity Metric to Use for Software Documents? A Study on Information Retrieval based Software Engineering Tasks
Rahman, Md Masudur
Chakraborty, Saikat
Ray, Baishakhi
PROCEEDINGS 2018 IEEE/ACM 40TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING - COMPANION (ICSE-COMPANION, 2018, : 335 - 336
[49] Proposal of Game Design Document from Software Engineering Requirements Perspective
Gonzalez Salazar, Mario
Mitre, Hugo A.
Lemus Olalde, Cuauhtemoc
Gonzalez Sanchez, Jose Luis
2012 17TH INTERNATIONAL CONFERENCE ON COMPUTER GAMES (CGAMES), 2012, : 81 - 85
[50] Extracting information from experimental software engineering papers
Cruzes, Daniela
Mendonca, Manoel
Basili, Victor
Shull, Forrest
Jino, Mario
SCCC 2007: XXVI INTERNATIONAL CONFERENCE OF THE CHILEAN SOCIETY OF COMPUTER SCIENCE, PROCEEDINGS, 2007, : 105 - +

← 1 2 3 4 5 →