GSoC 2017 Application & Background

CMUSphinx is a pioneer in speech recognition and constantly maintains several open-source research software. Google Summer of Code 2017 accepted proposals from CMUSphinx community and some of them were related to pronunciation remediation mentored by James Salsman and others. Being a research student of speech and language technology, these projects are very exciting for me and well suited for my area of interest. I had previously worked with Sphinx at IIIT-H and used it for a basic pronunciation evaluation task citing Srikanth Ronanki's work during GSoC 2012. Therefore I contacted James and asked for his help to review my proposal. He suggested several modifications concerning the architecture and approach. We modified the timeline to suit the project requirements and appended motivation factors to keep the learner going during training.
Thanks to the mentors and Google, the proposal is accepted for implementation during GSoC 2017.

This is what I propose to implement during GSoC 2017:

The proposed framework will implement a user interface capable of collecting audio signal from microphone corresponding to a prompted text. It must perform a comprehensive speech analysis to produce a report of mispronounced segments of speech in the collected audio. Audio collection is performed asynchronously through web workers in browsers or flash actionscript as a fallback. The collected audio is decoded in parallel using several grammars within the pocketsphinx framework and standardized acoustic scores for each decoded unit is compared across grammars to evaluate pronunciation. These grammars form the core contribution of this project wherein they act as a panel of native and non-native transcriptionists or recognition models predicting the output as intelligible to each of them.

The live implementation of the project can be seen at the following link: IREMEDY


There is a huge amount of literature available to tackle mispronunciation detection including conference proceedings and PhD theses. Further I summarize some of the relevant literature for the task of pronunciation evaluation.

1. Witt, Silke M. "Automatic error detection in pronunciation training: Where we are and where we need to go." Proc. IS ADEPT 6 (2012).

This is an invited paper in proceedings of the International Symposium on Automatic Detection of Errors in Pronunciation Training. It goes in great detail about the whole framework of pronunciation training and talks about history, typology, observations, features, approaches and design of pronunciation correction systems.

It says that CALL systems combines several disciplines including speech recognition, linguistics, psycholinguistics, pedagogy as well as auditory and articulatory research. The research on pronunciation training sprang to life with increased computing power. In 2007, ISCA created a special interest group called Speech and Language Technology for Education (SLaTE). The spoken utterance in a language L2 spoken by a native speaker of L1 may lie within the scale of "native-sounding speech" to "unintelligible speech". In general, the approaches for pronunciation error detection lie among two groups:

  1. Phoneme-level pronunciation error detection
  2. Prosodic error detection

Phoneme errors may be "severe" , i.e. substitution of another phoneme or "less severe", i.e. accentuated or malformed phoneme due to slight variation in tongue position, etc.

Prosodic errors may be due to stress at different segments of speech, rhythm and intonation.

Kim et al suggests that phoneme errors are tough for humans to judge and best phoneme level score is obtained when score for all the same phonemes is averaged over one speaker. He also proves that at least 300 instances of a single phoneme is required to get a correlation score of 0.8 against human judgement.

The obvious disadvantage of phoneme level errors are due to co-articulation effect and un-obvious spelling-to-pronunciation rules.

Raux et al establishes the relationship between error rates and intelligibility. Errors related to prosodic features such as vowel insertion impact intelligibility more than segmental errors such as phoneme substitution for example: replacing 'A' with 'ER'.

Koniaris et al models pronunciation based on perception of sounds by native vs non-native speaker.

Features used for pronunciation are N-dim vectors for each phoneme where each dimention varies within a range for a single phoneme class.

The two approaches for error detection that are relevant to this proposal are 
  1. Likelihood-based scoring
  2. Classifier-based scoring

Another vital consideration for pronunciation correction systems is prosodic error detection and relevant feedback in two areas.

  1. Pitch accent recognition
  2. Duration, energy, Pitch and pauses
The F0 contour can be tracked for subject's utterance and feedback can be provided to adjust the pitch of a mis-pronounced segment. Rhythm features must also be tracked. Audio-visual feedback can be provided by articulatory feature inversion. Other corrective feedback can be provided by designing the interactive system. The paper also comments about the motivation to user and suggests some important presentation layer changes.




Comments

Popular Posts