A large margin algorithm for speech-to-phoneme and music-to-score alignment

被引：25

作者：

Keshet, Joseph

Shalev-Shwartz, Shai

Singer, Yoram

Chazan, Dan

机构：

[1] Hebrew Univ Jerusalem, Sch Engn & Comp Sci, IL-91904 Jerusalem, Israel

[2] Google Inc, Mountain View, CA 94043 USA

[3] Hebrew Univ Jerusalem, Dept Elect Engn, IL-91904 Jerusalem, Israel

来源：

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2007年 / 15卷 / 08期

关键词：

forced alignment; large margin and kernel methods; music; speech processing; support vector machines (SVMS);

D O I：

10.1109/TASL.2007.903928

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

We describe and analyze a discriminative algorithm for learning to align an audio signal with a given sequence of events that tag the signal. We demonstrate the applicability of our method for the tasks of speech-to-phoneme alignment ("forced alignment") and music-to-score alignment. In the first alignment task, the events that tag the speech signal are phonemes while in the music alignment task, the events are musical notes. Our goal is to learn an alignment function whose input is an audio signal along with its accompanying event sequence and its output is a timing sequence representing the actual start time of each event in the audio signal. Generalizing the notion of separation with a margin used in support vector machines for binary classification, we cast the learning task as the problem of finding a vector in an abstract inner-product space. To do so, we devise a mapping of the input signal and the event sequence along with any possible timing sequence into an abstract vector space. Each possible timing sequence therefore corresponds to an instance vector and the predicted timing sequence is the one whose projection onto the learned prediction vector is maximal. We set the prediction vector to be the solution of a minimization problem with a large set of constraints. Each constraint enforces a gap between the projection of the correct target timing sequence and the projection of an alternative, incorrect, timing sequence onto the vector. Though the number of constraints is very large, we describe a simple iterative algorithm for efficiently learning the vector and analyze the formal properties of the resulting learning algorithm. We report experimental results comparing the proposed algorithm to previous studies on speech-to-phoneme and music-to-score alignment, which use hidden Markov models. The results obtained in our experiments using the discriminative alignment algorithm are comparable to results of state-of-the-art systems.

引用

下载

页码：2373 / 2382

页数：10

共 4 条

[1] An On-line Algorithm for Music-to-Score Alignment of Guzheng Performance
Wang, Ziyi
Cao, Yin
2018 IEEE 23RD INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING (DSP), 2018,
[2] OPTIMIZING THE MAPPING FROM A SYMBOLIC TO AN AUDIO REPRESENTATION FOR MUSIC-TO-SCORE ALIGNMENT
Joder, Cyril
Essid, Slim
Richard, Gael
2011 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA), 2011, : 121 - 124
[3] A COMPARATIVE STUDY OF TONAL ACOUSTIC FEATURES FOR A SYMBOLIC LEVEL MUSIC-TO-SCORE ALIGNMENT
Joder, Cyril
Essid, Slim
Richard, Gael
2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 409 - 412
[4] A Coupled Duration-Focused Architecture for Real-Time Music-to-Score Alignment
Cont, Arshia
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2010, 32 (06) : 974 - 987

← 1 →