A Method of Readability Assessment for Web Documents Using Text Features and HTML']HTML Structures

被引:4
|
作者
Yamasaki, Takahiro [1 ]
Tokiwa, Kin-Ichiroh [1 ]
机构
[1] Osaka Sangyo Univ, Daito, Osaka, Japan
关键词
readability assessment; Web documents; document classification; feature extraction;
D O I
10.1002/ecj.11565
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
This paper describes a method of readability assessment for Web documents. Readability is the ease in which text can be read and understood. We hypothesize that the readability is determined by whether a reader can easily grasp text structures. The impression and complexity of text are significant factors. We extract features of impression and complexity from plain text and additional data, such as HTML tags. In order to compare the effect of extracting features, we assess readability rank by machine learning. We conduct fivefold cross validation for each domain and calculate the root mean squared error between the actual rank and the estimated rank. Cross validation experiments confirm that the performance of our method is high, showing the effectiveness of extracting features about the impression and complexity for readability assessment. (C) 2014 Wiley Periodicals, Inc.
引用
收藏
页码:1 / 10
页数:10
相关论文
共 50 条
  • [1] A method of readability assessment for web documents using text features and HTML structures
    Yamasaki, Takahiro
    Tokiwa, Kin-Ichiroh
    [J]. IEEJ Transactions on Electronics, Information and Systems, 2012, 132 (09) : 1524 - 1532
  • [2] USING COOLLISTS TO INDEX HTML']HTML DOCUMENTS IN THE WEB
    LIM, JG
    [J]. COMPUTER NETWORKS AND ISDN SYSTEMS, 1995, 28 (1-2): : 147 - 154
  • [3] Extracting structures of HTML']HTML documents
    Lim, SJ
    Ng, YK
    [J]. TWELFTH INTERNATIONAL CONFERENCE ON INFORMATION NETWORKING (ICOIN-12), PROCEEDINGS, 1998, : 420 - 426
  • [4] Hierarchies in HTML']HTML documents: Linking text to concepts
    Burget, R
    [J]. 15TH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2004, : 186 - 190
  • [5] From HTML']HTML documents to web tables and rules
    Simon, Kai
    Lausen, Georg
    Boley, Harold
    [J]. 2006 ICEC: EIGHTH INTERNATIONAL CONFERENCE ON ELECTRONIC COMMERCE, PROCEEDINGS: THE NEW E-COMMERCE: INNOVATIONS FOR CONQUERING CURRENT BARRIERS, OBSTACLES AND LIMITATIONS TO CONDUCTING SUCCESSFUL BUSINESS ON THE INTERNET, 2006, : 125 - 131
  • [6] A hybrid method to categorize HTML']HTML documents
    Khordad, M
    Shamsfard, M
    Kazemeyni, F
    [J]. Data Mining VI: Data Mining, Text Mining and Their Business Applications, 2005, : 331 - 340
  • [7] Automatic discovery of semantic structures in HTML']HTML documents
    Mukherjee, S
    Yang, GZ
    Tan, WF
    Ramakrishnan, IV
    [J]. SEVENTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS I AND II, PROCEEDINGS, 2003, : 245 - 249
  • [8] Mining Web Pages Using Features of Rendering HTML']HTML Elements in the Web Browser
    Fernandez, F. J.
    Alvarez, Jose L.
    Abad, Pedro J.
    Jimenez, Patricia
    [J]. TRENDS IN PRACTICAL APPLICATIONS OF AGENTS AND MULTI-AGENTS SYSTEMS, 2011, 90 : 161 - 168
  • [9] Study on Text Information Extraction Model and Algorithm of HTML']HTML Documents
    Li Chunyan
    Jiang Ilaiyang
    [J]. PROCEEDINGS OF 2010 CROSS-STRAIT CONFERENCE ON INFORMATION SCIENCE AND TECHNOLOGY, 2010, : 399 - 403
  • [10] Extracting structures of HTML']HTML documents using a high-level stack machine
    Lim, SJ
    Ng, YK
    [J]. INFORMATION NETWORKING IN ASIA, 2001, 3 : 177 - 188