Hierarchical Data Augmentation and the Application in Text Classification

被引:14
|
作者
Yu, Shujuan [1 ]
Yang, Jie [1 ]
Liu, Danlei [1 ]
Li, Runqi [1 ]
Zhang, Yun [1 ]
Zhao, Shengmei [2 ]
机构
[1] Nanjing Univ Posts & Telecommun, Coll Elect & Opt Engn, Nanjing 210023, Peoples R China
[2] Nanjing Univ Posts & Telecommun, Coll Telecommun & Informat Engn, Nanjing 210003, Peoples R China
来源
IEEE ACCESS | 2019年 / 7卷
基金
中国国家自然科学基金;
关键词
Attention mechanism; data augmentation; natural language processing; text classification;
D O I
10.1109/ACCESS.2019.2960263
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The applications of data augmentation in natural language processing have been limited. In this paper, we propose a novel method named Hierarchical Data Augmentation (HDA) which applied for text classification. Firstly, inspired by the hierarchical structure of texts, as words form a sentence and sentences form a document, HDA implements a hierarchical data augmentation strategy by augmenting texts at word-level and sentence level respectively. Secondly, inspired by the cropping, a popular method of data augmentation in computer vision, at each augmenting level, RDA utilizes attention mechanism to distill (crop) important contents from texts hierarchically as summaries of texts. Specifically, we use a trained Hierarchical Attention Networks (HAN) model to obtain attention values of all documents in training sets at both levels respectively, which are further used to extract the most important part of words/sentences and generate new samples by concatenating them in order. Then we gain two levels of augmented datasets, WordSet and SentSet. Finally, extending training set with certain amount of HDA-generated samples and we evaluate models' performance with new training set. The results reveal HDA can generate massive and high-quality augmented samples at both levels, and models using these samples can obtain significant improvements. Compared with the existing methods, HDA enjoys the simplicity both on theory and implementation, and it can augment texts at two levels for the diversity of data.
引用
收藏
页码:185476 / 185485
页数:10
相关论文
共 50 条
  • [1] Data Augmentation with Transformers for Text Classification
    Medardo Tapia-Tellez, Jose
    Jair Escalante, Hugo
    [J]. ADVANCES IN COMPUTATIONAL INTELLIGENCE, MICAI 2020, PT II, 2020, 12469 : 247 - 259
  • [2] A Survey on Data Augmentation for Text Classification
    Bayer, Markus
    Kaufhold, Marc-Andre
    Reuter, Christian
    [J]. ACM COMPUTING SURVEYS, 2023, 55 (07)
  • [3] Tokenization-based data augmentation for text classification
    Prakrankamanant, Patawee
    Chuangsuwanich, Ekapol
    [J]. 2022 19TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER SCIENCE AND SOFTWARE ENGINEERING (JCSSE 2022), 2022,
  • [4] AEDA: An Easier Data Augmentation Technique for Text Classification
    Karimi, Akbar
    Rossi, Leonardo
    Prati, Andrea
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 2748 - 2754
  • [5] Probabilistic Interpolation with Mixup Data Augmentation for Text Classification
    Xu, Rongkang
    Zhang, Yongcheng
    Ren, Kai
    Huang, Yu
    Wei, Xiaomei
    [J]. ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT IV, ICIC 2024, 2024, 14878 : 410 - 421
  • [6] Text Smoothing: Enhance Various Data Augmentation Methods on Text Classification Tasks
    Wu, Xing
    Gao, Chaochen
    Lin, Meng
    Zang, Liangjun
    Hu, Songlin
    [J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): (SHORT PAPERS), VOL 2, 2022, : 871 - 875
  • [7] LiDA: Language-Independent Data Augmentation for Text Classification
    Sujana, Yudianto
    Kao, Hung-Yu
    [J]. IEEE ACCESS, 2023, 11 : 10894 - 10901
  • [8] Data Augmentation Methods for Enhancing Robustness in Text Classification Tasks
    Tang, Huidong
    Kamei, Sayaka
    Morimoto, Yasuhiko
    [J]. ALGORITHMS, 2023, 16 (01)
  • [9] PDA: Data Augmentation with Preposition Words on Chinese text classification
    Yang, Leixin
    Xiong, Haoyu
    Xiang, Yu
    [J]. 2024 2ND ASIA CONFERENCE ON COMPUTER VISION, IMAGE PROCESSING AND PATTERN RECOGNITION, CVIPPR 2024, 2024,
  • [10] A Submodular Optimization Framework for Imbalanced Text Classification With Data Augmentation
    Alemayehu, Eyor
    Fang, Yi
    [J]. IEEE ACCESS, 2023, 11 : 41680 - 41696