Document Classification of SuDer Turkish News Corpora

被引:0
|
作者
Sen, Mehmet Umut [1 ]
Yanikoglu, Berrin [1 ]
机构
[1] Sabanci Univ, Istanbul, Turkey
关键词
document classification; SuDer news corpora; word embeddings; neural networks;
D O I
暂无
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Word embeddings are successfully employed in various Natural Language Processing tasks, but training them requires large amount of text, which is scarce for Turkish. In this work, we collected large amounts of articles from two news websites and tags within web pages are used as labels. Obtained corpora are tested with various document classification models. Embedding based models performed better than models with the traditional TF-IDF features. A neural network that simultaneously learns the word embeddings and document classification performed the best.
引用
收藏
页数:4
相关论文
共 50 条
  • [1] Multi-Document Summarization for Turkish News
    Demirci, Ferhat
    Karabudak, Engin
    Ilgen, Bahar
    [J]. 2017 INTERNATIONAL ARTIFICIAL INTELLIGENCE AND DATA PROCESSING SYMPOSIUM (IDAP), 2017,
  • [2] Turkish document classification using shorter roots
    Cataltepe, Zehra
    Turan, Yakup
    Kesgin, Fatih
    [J]. 2007 IEEE 15TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS, VOLS 1-3, 2007, : 1228 - 1231
  • [3] Advances in unsupervised audio classification and segmentation for the broadcast news and NGSW corpora
    Huang, RQ
    Hansen, JHL
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2006, 14 (03): : 907 - 919
  • [4] Hybrid Feature Selection for Amharic News Document Classification
    Endalie, Demeke
    Haile, Getamesay
    [J]. MATHEMATICAL PROBLEMS IN ENGINEERING, 2021, 2021
  • [5] Document Embedding based Supervised Methods for Turkish Text Classification
    Celenli, Halil I.
    Ozturk, S. Talha
    Sahin, Gurkan
    Gerek, Aydin
    Ganiz, Murat C.
    [J]. 2018 3RD INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ENGINEERING (UBMK), 2018, : 477 - 482
  • [6] Turkish Document Classification with Coarse-Grained Semantic Matrix
    Donmez, Ilknur
    Adali, Esref
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, (CICLING 2016), PT II, 2018, 9624 : 472 - 484
  • [7] Feature selection by integrating document frequency with genetic algorithm for Amharic news document classification
    Endalie, Demeke
    Haile, Getamesay
    Abebe, Wondmagegn Taye
    [J]. PEERJ COMPUTER SCIENCE, 2022, 8
  • [8] CLASSIFICATION OF TURKISH TWEETS BY DOCUMENT VECTORS AND INVESTIGATION OF THE EFFECTS OF PARAMETER CHANGES ON CLASSIFICATION SUCCESS
    Bilgin, Metin
    [J]. SIGMA JOURNAL OF ENGINEERING AND NATURAL SCIENCES-SIGMA MUHENDISLIK VE FEN BILIMLERI DERGISI, 2020, 38 (03): : 1581 - 1592
  • [9] A scaleable document clustering approach for large document corpora
    Rooney, Niall
    Patterson, David
    Galushka, Mykola
    Dobrynin, Vladimir
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2006, 42 (05) : 1163 - 1175
  • [10] An overview of Broadcast News corpora
    Graff, D
    [J]. SPEECH COMMUNICATION, 2002, 37 (1-2) : 15 - 26