A Method of Readability Assessment for Web Documents Using Text Features and HTML']HTML Structures

被引：4

作者：

Yamasaki, Takahiro ^{[1
]}

Tokiwa, Kin-Ichiroh ^{[1
]}

机构：

[1] Osaka Sangyo Univ, Daito, Osaka, Japan

来源：

ELECTRONICS AND COMMUNICATIONS IN JAPAN | 2014年 / 97卷 / 10期

关键词：

readability assessment; Web documents; document classification; feature extraction;

D O I：

10.1002/ecj.11565

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

This paper describes a method of readability assessment for Web documents. Readability is the ease in which text can be read and understood. We hypothesize that the readability is determined by whether a reader can easily grasp text structures. The impression and complexity of text are significant factors. We extract features of impression and complexity from plain text and additional data, such as HTML tags. In order to compare the effect of extracting features, we assess readability rank by machine learning. We conduct fivefold cross validation for each domain and calculate the root mean squared error between the actual rank and the estimated rank. Cross validation experiments confirm that the performance of our method is high, showing the effectiveness of extracting features about the impression and complexity for readability assessment. (C) 2014 Wiley Periodicals, Inc.

引用

页码：1 / 10

页数：10

共 50 条

[1] A method of readability assessment for web documents using text features and HTML structures
Yamasaki, Takahiro
Tokiwa, Kin-Ichiroh
[J]. IEEJ Transactions on Electronics, Information and Systems, 2012, 132 (09) : 1524 - 1532
[2] USING COOLLISTS TO INDEX HTML']HTML DOCUMENTS IN THE WEB
LIM, JG
[J]. COMPUTER NETWORKS AND ISDN SYSTEMS, 1995, 28 (1-2): : 147 - 154
[3] Extracting structures of HTML']HTML documents
Lim, SJ
Ng, YK
[J]. TWELFTH INTERNATIONAL CONFERENCE ON INFORMATION NETWORKING (ICOIN-12), PROCEEDINGS, 1998, : 420 - 426
[4] Hierarchies in HTML']HTML documents: Linking text to concepts
Burget, R
[J]. 15TH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2004, : 186 - 190
[5] From HTML']HTML documents to web tables and rules
Simon, Kai
Lausen, Georg
Boley, Harold
[J]. 2006 ICEC: EIGHTH INTERNATIONAL CONFERENCE ON ELECTRONIC COMMERCE, PROCEEDINGS: THE NEW E-COMMERCE: INNOVATIONS FOR CONQUERING CURRENT BARRIERS, OBSTACLES AND LIMITATIONS TO CONDUCTING SUCCESSFUL BUSINESS ON THE INTERNET, 2006, : 125 - 131
[6] A hybrid method to categorize HTML']HTML documents
Khordad, M
Shamsfard, M
Kazemeyni, F
[J]. Data Mining VI: Data Mining, Text Mining and Their Business Applications, 2005, : 331 - 340
[7] Automatic discovery of semantic structures in HTML']HTML documents
Mukherjee, S
Yang, GZ
Tan, WF
Ramakrishnan, IV
[J]. SEVENTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS I AND II, PROCEEDINGS, 2003, : 245 - 249
[8] Mining Web Pages Using Features of Rendering HTML']HTML Elements in the Web Browser
Fernandez, F. J.
Alvarez, Jose L.
Abad, Pedro J.
Jimenez, Patricia
[J]. TRENDS IN PRACTICAL APPLICATIONS OF AGENTS AND MULTI-AGENTS SYSTEMS, 2011, 90 : 161 - 168
[9] Study on Text Information Extraction Model and Algorithm of HTML']HTML Documents
Li Chunyan
Jiang Ilaiyang
[J]. PROCEEDINGS OF 2010 CROSS-STRAIT CONFERENCE ON INFORMATION SCIENCE AND TECHNOLOGY, 2010, : 399 - 403
[10] Extracting structures of HTML']HTML documents using a high-level stack machine
Lim, SJ
Ng, YK
[J]. INFORMATION NETWORKING IN ASIA, 2001, 3 : 177 - 188

← 1 2 3 4 5 →