Automated labeling of PDF mathematical exercises with word N-grams VSM classification

被引：0

作者：

Yamauchi, Taisei ^{[1
]}

Flanagan, Brendan ^{[2
]}

Nakamoto, Ryosuke ^{[1
]}

Dai, Yiling ^{[3
]}

Takami, Kyosuke ^{[4
]}

Ogata, Hiroaki ^{[3
]}

机构：

[1] Kyoto Univ, Grad Sch Informat, Kyoto, Japan

[2] Kyoto Univ, Inst Liberal Arts & Sci, Ctr Innovat Res & Educ Data Sci, Kyoto, Japan

[3] Kyoto Univ, Acad Ctr Comp & Media Studies, Kyoto, Japan

[4] Natl Inst Educ Policy Res, Educ Data Sci Ctr, Tokyo, Japan

来源：

SMART LEARNING ENVIRONMENTS | 2023年 / 10卷 / 01期

关键词：

Automatic labeling; Word n-gram; Random forest; Incomplete text classification; Word embedding; Mathematical education; Mathematical education in Japan; LANGUAGE; RECOGNITION; ANALYTICS;

D O I：

10.1186/s40561-023-00271-9

中图分类号：

G40 [教育学];

学科分类号：

040101 ; 120403 ;

摘要：

In recent years, smart learning environments have become central to modern education and support students and instructors through tools based on prediction and recommendation models. These methods often use learning material metadata, such as the knowledge contained in an exercise which is usually labeled by domain experts and is costly and difficult to scale. It recognizes that automated labeling eases the workload on experts, as seen in previous studies using automatic classification algorithms for research papers and Japanese mathematical exercises. However, these studies didn't delve into fine-grained labeling. In addition to that, as the use of materials in the system becomes more widespread, paper materials are transformed into PDF formats, which can lead to incomplete extraction. However, there is less emphasis on labeling incomplete mathematical sentences to tackle this problem in the previous research. This study aims to achieve precise automated classification even from incomplete text inputs. To tackle these challenges, we propose a mathematical exercise labeling algorithm that can handle detailed labels, even for incomplete sentences, using word n-grams, compared to the state-of-the-art word embedding method. The results of the experiment show that mono-gram features with Random Forest models achieved the best performance with a macro F-measure of 92.50%, 61.28% for 24-class labeling and 297-class labeling tasks, respectively. The contribution of this research is showing that the proposed method based on traditional simple n-grams has the ability to find context-independent similarities in incomplete sentences and outperforms state-of-the-art word embedding methods in specific tasks like classifying short and incomplete texts.

引用

页数：30

共 50 条

[1] Automated labeling of PDF mathematical exercises with word N-grams VSM classification
Taisei Yamauchi
Brendan Flanagan
Ryosuke Nakamoto
Yiling Dai
Kyosuke Takami
Hiroaki Ogata
[J]. Smart Learning Environments, 10
[2] IDF for Word N-grams
Shirakawa, Masumi
Hara, Takahiro
Nishio, Shojiro
[J]. ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2017, 36 (01)
[3] Using Word N-Grams as Features in Arabic Text Classification
Al-Thubaity, Abdulmohsen
Alhoshan, Muneera
Hazzaa, Itisam
[J]. SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING, 2015, 569 : 35 - 43
[4] Algorithm for Updating n-Grams Word Dictionary for Web Classification
Abidin, Taufik Fuadi
Ferdhiana, Ridha
[J]. 2016 INTERNATIONAL CONFERENCE ON INFORMATICS AND COMPUTING (ICIC), 2016, : 432 - 436
[5] SPEECH RECOGNITION USING FUNCTION-WORD N-GRAMS AND CONTENT-WORD N-GRAMS
ISOTANI, R
MATSUNAGA, S
SAGAYAMA, S
[J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 1995, E78D (06) : 692 - 697
[6] The subjective frequency of word n-grams
Shaoul, Cyrus
Westbury, Chris F.
Baayen, R. Harald
[J]. PSIHOLOGIJA, 2013, 46 (04) : 497 - 537
[7] Variable word rate n-grams
Gotoh, Y
Renals, S
[J]. 2000 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS, VOLS I-VI, 2000, : 1591 - 1594
[8] Which Granularity to Bootstrap a Multilingual Method of Document Alignment: Character N-grams or Word N-grams?
Lecluze, Charlotte
Rigouste, Lois
Giguet, Emmanuel
Lucas, Nadine
[J]. CORPUS RESOURCES FOR DESCRIPTIVE AND APPLIED STUDIES. CURRENT CHALLENGES AND FUTURE DIRECTIONS: SELECTED PAPERS FROM THE 5TH INTERNATIONAL CONFERENCE ON CORPUS LINGUISTICS (CILC2013), 2013, 95 : 473 - 481
[9] Pixel N-grams for mammographic lesion classification
Kulkarni, Pradnya
Stranieri, Andrew
Ugon, Julien
Mittal, Manish
Kulkarni, Siddhivinayak
[J]. 2017 2ND INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS, COMPUTING AND IT APPLICATIONS (CSCITA), 2017, : 107 - 111
[10] Comparing Pixel N-grams and Bag of Visual Word Features for the Classification of Diabetic Retinopathy
Kulkarni, Pradnya
Stranieri, Andrew
Jelinek, Herbert
[J]. PROCEEDINGS OF THE AUSTRALASIAN COMPUTER SCIENCE WEEK MULTICONFERENCE (ACSW 2019), 2019,

← 1 2 3 4 5 →