Improving Readability for Automatic Speech Recognition Transcription

被引：5

作者：

Liao, Junwei ^{[1
]}

Eskimez, Sefik ^{[2
]}

Lu, Liyang ^{[2
]}

Shi, Yu ^{[2
]}

Gong, Ming ^{[3
]}

Shou, Linjun ^{[3
]}

Qu, Hong ^{[1
]}

Zeng, Michael ^{[2
]}

机构：

[1] Univ Elect Sci & Technol China, Chengdu, Peoples R China

[2] Microsoft Speech & Dialogue Res Grp, New York, NY USA

[3] Microsoft STCA NLP Grp, Beijing, Peoples R China

来源：

ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING | 2023年 / 22卷 / 05期

关键词：

Automatic speech recognition; post-processing for readability; data synthesis; pre-trained model; PUNCTUATION; CAPITALIZATION;

D O I：

10.1145/3557894

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Modern Automatic Speech Recognition (ASR) systems can achieve high performance in terms of recognition accuracy. However, a perfectly accurate transcript still can be challenging to read due to grammatical errors, disfluency, and other noises common in spoken communication. These readable issues introduced by speakers and ASR systems will impair the performance of downstream tasks and the understanding of human readers. In thiswork, we present a task called ASR post-processing for readability (APR) and formulate it as a sequenceto-sequence text generation problem. The APR task aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of speakers. We further study the APR task from the benchmark dataset, evaluation metrics, and baseline models: First, to address the lack of task-specific data, we propose a method to construct a dataset for the APR task by using the data collected for grammatical error correction. Second, we utilize metrics adapted or borrowed from similar tasks to evaluate model performance on the APR task. Lastly, we use several typical or adapted pre-trained models as the baseline models for the APR task. Furthermore, we fine-tune the baseline models on the constructed dataset and compare their performance with a traditional pipeline method in terms of proposed evaluation metrics. Experimental results show that all the fine-tuned baseline models perform better than the traditional pipeline method, and our adapted RoBERTa model outperforms the pipeline method by 4.95 and 6.63 BLEU points on two test sets, respectively. The human evaluation and case study further reveal the ability of the proposed model to improve the readability of ASR transcripts.

引用

下载

页数：23

共 50 条

[1] Improving Speech Synthesis by Automatic Speech Recognition and Speech Discriminator
Huang, Li-Yu
Chen, Chia-Ping
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2024, 40 (01) : 189 - 200
[2] Improving Automatic Recognition of Aphasic Speech with AphasiaBank
Le, Duc
Provost, Emily Mower
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2681 - 2685
[3] Improving the Quality of Automatic Speech Recognition in Trucks
Korenevsky, Maxim
Medennikov, Ivan
Shchemelinin, Vadim
Speech and Computer, 2016, 9811 : 362 - 369
[4] Improving analysis techniques for automatic speech recognition
O'Shaughnessy, D
2002 45TH MIDWEST SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOL III, CONFERENCE PROCEEDINGS, 2002, : 65 - 68
[5] THE DEREVERBERATION OF SPEECH WITH A VIEW TO IMPROVING THE AUTOMATIC RECOGNITION OF SPEECH IN ROOMS
HIRSCH, HG
ACUSTICA, 1989, 67 (03): : 216 - 221
[6] UTILIZATION OF REDUNDANCY OF PHONEMIC TRANSCRIPTION OF SPEECH FOR AUTOMATIC-SPEECH RECOGNITION
OTTEN, KW
KLEINER, RT
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1964, 36 (05): : 1039 - &
[7] Automatic speech recognition performance on a voicemail transcription task
Padmanabhan, M
Saon, G
Huang, J
Kingsbury, B
Mangu, L
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2002, 10 (07): : 433 - 442
[8] Improving Hypernasality Estimation with Automatic Speech Recognition in Cleft Palate Speech
Song, Kaitao
Wan, Teng
Wang, Bixia
Jiang, Huiqiang
Qiu, Luna
Xu, Jiahang
Jiang, Liping
Lou, Qun
Yang, Yuqing
Li, Dongsheng
Wang, Xudong
Qiu, Lili
INTERSPEECH 2022, 2022, : 4820 - 4824
[9] Improving Automatic Emotion Recognition from Speech Signals
Bozkurt, Elif
Erzin, Engin
Erdem, Cigdem Eroglu
Erdem, A. Tanju
INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 312 - +
[10] IMPROVING AUTOMATIC SPEECH RECOGNITION ROBUSTNESS FOR THE ROMANIAN LANGUAGE
Buzo, Andi
Cucu, Horia
Burileanu, Corneliu
19TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO-2011), 2011, : 2119 - 2122

← 1 2 3 4 5 →