Text Ranking and Classification using Data Compression

被引:0
|
作者
Kasturi, Nitya [1 ]
Markov, Igor L. [1 ]
机构
[1] Meta, Menlo Pk, CA 94025 USA
来源
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A well-known but rarely used approach to text categorization uses conditional entropy estimates computed using data compression tools. Text affinity scores derived from compressed sizes can be used for classification and ranking tasks, but their success depends on the compression tools used. We use the Zstandard compressor and strengthen these ideas in several ways, calling the resulting language-agnostic technique Zest. In applications, this approach simplifies configuration, avoiding careful feature extraction and large ML models. Our ablation studies confirm the value of individual enhancements we introduce. We show that Zest complements and can compete with language-specific multidimensional content embeddings in production, but cannot outperform other counting methods on public datasets.
引用
收藏
页码:48 / 53
页数:6
相关论文
共 50 条
  • [1] Data compression using encrypted text
    Franceschini, R
    Mukherjee, A
    [J]. PROCEEDINGS OF THE THIRD FORUM ON RESEARCH AND TECHNOLOGY ADVANCES IN DIGITAL LIBRARIES (ADL '96), 1996, : 130 - 138
  • [2] Ranking in Multi Label Classification of Text Documents Using Quantifiers
    Jindal, Rajni
    Taneja, Shweta
    [J]. PROCEEDINGS 5TH IEEE INTERNATIONAL CONFERENCE ON CONTROL SYSTEM, COMPUTING AND ENGINEERING (ICCSCE 2015), 2015, : 162 - 166
  • [3] Combining feature ranking for text classification
    Makrehchi, Masoud
    Kamel, Mohamed S.
    [J]. 2007 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS, VOLS 1-8, 2007, : 3003 - 3008
  • [4] Semantic Text Compression for Classification
    Kutay, Emrecan
    Yener, Aylin
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS WORKSHOPS, ICC WORKSHOPS, 2023, : 1368 - 1373
  • [5] Text Classification Using Compression-Based Dissimilarity Measures
    Coutinho, David Pereira
    Figueiredo, Mario A. T.
    [J]. INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2015, 29 (05)
  • [6] Gender Classification using Twitter Text Data
    Vashisth, Pradeep
    Meehan, Kevin
    [J]. 2020 31ST IRISH SIGNALS AND SYSTEMS CONFERENCE (ISSC), 2020, : 56 - 61
  • [7] Perspective Scene Text Recognition with Feature Compression and Ranking
    Zhou, Yu
    Liu, Shuang
    Zhang, Yongzheng
    Wang, Yipeng
    Lin, Weiyao
    [J]. COMPUTER VISION - ACCV 2014 WORKSHOPS, PT II, 2015, 9009 : 181 - 195
  • [8] On compression-based text classification
    Marton, Y
    Wu, N
    Hellerstein, L
    [J]. ADVANCES IN INFORMATION RETRIEVAL, 2005, 3408 : 300 - 314
  • [9] Lightweight Conceptual Dictionary Learning for Text Classification Using Information Compression
    Wan, Li
    Alpcan, Tansu
    Kuijper, Margreta
    Viterbo, Emanuele
    [J]. IEEE Transactions on Knowledge and Data Engineering, 2024, 36 (12) : 8711 - 8717
  • [10] Utility-Theoretic Ranking for Semiautomated Text Classification
    Berardi, Giacomo
    Esuli, Andrea
    Sebastiani, Fabrizio
    [J]. ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2015, 10 (01)