Word Embeddings for the Software Engineering Domain

被引:55
|
作者
Efstathiou, Vasiliki [1 ]
Chatzilenas, Christos [1 ]
Spinellis, Diomidis [1 ]
机构
[1] Athens Univ Econ & Business, Athens, Greece
关键词
Natural Language Processing; Skip-gram; word2vec; Stack Overflow;
D O I
10.1145/3196398.3196448
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The software development process produces vast amounts of textual data expressed in natural language. Outcomes from the natural language processing community have been adapted in software engineering research for leveraging this rich textual information; these include methods and readily available tools, often furnished with pre-trained models. State of the art pre-trained models however, capture general, common sense knowledge, with limited value when it comes to handling data specific to a specialized domain. There is currently a lack of domain-specific pre-trained models that would further enhance the processing of natural language artefacts related to software engineering. To this end, we release a word2vec model trained over 15GB of textual data from Stack Overflow posts. We illustrate how the model disambiguates polysemous words by interpreting them within their software engineering context. In addition, we present examples of fine-grained semantics captured by the model, that imply transferability of these results to diverse, targeted information retrieval tasks in software engineering and motivate for further reuse of the model.
引用
收藏
页码:38 / 41
页数:4
相关论文
共 50 条
  • [1] From Word Embeddings To Document Similarities for Improved Information Retrieval in Software Engineering
    Ye, Xin
    Shen, Hui
    Ma, Xiao
    Bunescu, Razvan
    Liu, Chang
    [J]. 2016 IEEE/ACM 38TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), 2016, : 404 - 415
  • [2] EmbSE: AWord Embeddings Model Oriented Towards Software Engineering Domain
    De Bortoli Favero, Eliane Maria
    Casanova, Dalcimar
    Pimentel, Andrey Ricardo
    [J]. PROCEEDINGS OF THE XXXIII BRAZILIAN SYMPOSIUM ON SOFTWARE ENGINEERING, SBES 2019, 2019, : 172 - 180
  • [3] Interpretable Word Embeddings For Medical Domain
    Jha, Kishlay
    Wang, Yaqing
    Xun, Guangxu
    Zhang, Aidong
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2018, : 1061 - 1066
  • [4] Domain Adaptation for Word Sense Disambiguation Using Word Embeddings
    Komiya, Kanako
    Suzuki, Shota
    Sasaki, Minoru
    Shinnou, Hiroyuki
    Okumura, Manabu
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING (CICLING 2017), PT I, 2018, 10761 : 195 - 206
  • [5] Domain Ontology Induction using Word Embeddings
    Gupta, Niharika
    Podder, Sanjay
    Annervaz, K. M.
    Sengupta, Shubhashis
    [J]. 2016 15TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2016), 2016, : 115 - 119
  • [6] The impact of corpus domain on word representation: a study on Persian word embeddings
    Hadifar, Amir
    Momtazi, Saeedeh
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2018, 52 (04) : 997 - 1019
  • [7] The impact of corpus domain on word representation: a study on Persian word embeddings
    Amir Hadifar
    Saeedeh Momtazi
    [J]. Language Resources and Evaluation, 2018, 52 : 997 - 1019
  • [8] Improving Cross-Domain Chinese Word Segmentation with Word Embeddings
    Ye, Yuxiao
    Zhang, Yue
    Li, Weikang
    Qiu, Likun
    Sun, Jian
    [J]. 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 2726 - 2735
  • [9] Domain-specific word embeddings for patent classification
    Risch, Julian
    Krestel, Ralf
    [J]. DATA TECHNOLOGIES AND APPLICATIONS, 2019, 53 (01) : 108 - 122
  • [10] Domain Adapted Word Embeddings for Improved Sentiment Classification
    Sarma, Prathusha K.
    Liang, Yingyu
    Sethares, William A.
    [J]. DEEP LEARNING APPROACHES FOR LOW-RESOURCE NATURAL LANGUAGE PROCESSING (DEEPLO), 2018, : 51 - 59