Text-based Document Similarity Matching Using sdtext

被引:0
|
作者
Shields, Clay [1 ]
机构
[1] Georgetown Univ, Dept Comp Sci, Washington, DC 20057 USA
关键词
D O I
10.1109/HICSS.2016.694
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Forensics examiners frequently try to identify duplicate files during an investigation. They might do so to identify known files of interest, or to allow more rapid review of documents that appear to be similar. Current forensic tools for detecting duplicate files operate over the low-level bits of the file, typically using hashing. While this can be a fast and effective method in many cases, it can fail due to differences in file format. We introduce sdtext, a tool developed to identify similar files based on their textual contents, which is robust to changes in format. We show that sdtext is far more accurate than existing tools in matching files that contain the same text in different formats.
引用
收藏
页码:5607 / 5616
页数:10
相关论文
共 50 条
  • [1] On text-based estimation of document relevance
    Savia, E
    Kaski, S
    Tuulos, V
    Myllymäki, P
    [J]. 2004 IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1-4, PROCEEDINGS, 2004, : 3275 - 3280
  • [2] Text-Based Measures of Document Diversity
    Bache, Kevin
    Newman, David
    Smyth, Padhraic
    [J]. 19TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'13), 2013, : 23 - 31
  • [3] Evaluating text-based similarity measures for musical content
    Garay, A
    [J]. SECOND INTERNATIONAL CONFERENCE ON WEB DELIVERING OF MUSIC, PROCEEDINGS, 2002, : 49 - 55
  • [4] Analysis of Text-Based CAPTCHA Images using Template Matching Correlation Technique
    Sakkatos, Promprawatt
    Theerayut, Weeratham
    Nuttapol, Vijitketteepragorn
    Surapong, Pongyupinpanich
    [J]. 2014 FOURTH JOINT INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGY, ELECTRONIC AND ELECTRICAL ENGINEERING (JICTEE 2014), 2014,
  • [5] Neural Compatibility Ranking for Text-based Fashion Matching
    Chaidaroon, Suthee
    Fang, Yi
    Xie, Mix
    Magnani, Alessandro
    [J]. PROCEEDINGS OF THE 42ND INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '19), 2019, : 1229 - 1232
  • [6] Text-Based User-kNN: Measuring User Similarity Based on Text Reviews
    Terzi, Maria
    Rowe, Matthew
    Ferrario, Maria-Angela
    Whittle, Jon
    [J]. USER MODELING, ADAPTATION, AND PERSONALIZATION, UMAP 2014, 2014, 8538 : 195 - 206
  • [7] Document Expansion for Text-Based Image Retrieval at CLEF 2009
    Min, Jinming
    Wilkins, Peter
    Leveling, Johannes
    Jones, Gareth J. F.
    [J]. MULTILINGUAL INFORMATION ACCESS EVALUATION II: MULTIMEDIA EXPERIMENTS, PT II, 2010, 6242 : 172 - 176
  • [8] SUM: Serialized Updating and Matching for text-based person retrieval
    Wang, Zijie
    Zhu, Aichun
    Xue, Jingyi
    Jiang, Daihong
    Liu, Chao
    Li, Yifeng
    Hu, Fangqiang
    [J]. KNOWLEDGE-BASED SYSTEMS, 2022, 248
  • [9] Narrowing the semantic gap - Improved text-based web document retrieval using visual features
    Zhao, R
    Grosky, WI
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2002, 4 (02) : 189 - 200
  • [10] Detecting premature departure in online text-based counseling using logic-based pattern matching
    Xu, Yucan
    Chan, Christian S.
    Tsang, Christy
    Cheung, Florence
    Chan, Evangeline
    Fung, Jerry
    Chow, James
    He, Lihong
    Xu, Zhongzhi
    Yip, Paul S. F.
    [J]. INTERNET INTERVENTIONS-THE APPLICATION OF INFORMATION TECHNOLOGY IN MENTAL AND BEHAVIOURAL HEALTH, 2021, 26