DaNangVMD: Vietnamese Speech Mispronunciation Detection

DaNangVMD: Nhận diện phát âm sai tiếng Việt

  • Ket Doan Nguyen
  • Nguyen Anh Tran
  • Van Nam Vo
  • Tran Tien Nguyen
  • Pham Tuyen Le
  • Quoc Vuong Nguyen
  • Huu Nhat Minh Nguyen
Keywords: Mispronunciation Detection, Multimodal Embedding, Vietnamese Speech Recognition

Abstract

Automatic Speech Recognition, also known as ASR, has grown exponentially over the past decade and is used
to recognize and translate human speech into readable text automatically. However, Vietnamese Speech Recognition faces critical challenges such as frequent mispronunciations as well as a huge variant in Vietnamese speech. In this work, we dive into the difficult challenge of Mispronunciation Detection (MD) in the Vietnamese language. As such a tonal language, Vietnamese is not only based on consonants and vowels but also on variations in pitch or tone during pronunciation. In this paper, we propose DaNangVMD model for detecting mispronunciations in Vietnamese speech based on the audio speech and canonical transcript. By leveraging multi-head attention-based multimodal representation from the embeddings of the phonetic encoder and linguistic encoder, DaNangVMD aims to provide a robust solution for accurate mispronunciation
detection and diagnosis. Throughout the extensive evaluation, the proposed DaNangVMD exhibits superior performances rather than that of the PAPL baseline models by 15% in F1 score and 13% in accuracy.

Author Biographies

Nguyen Anh Tran

Tran Nguyen Anh is pursuing the B.Eng.
degree in Information Technology from
the University of Danang, Vietnam - Korea University of Information and Communication Technology. His research interests include software development, machine
learning and deep learning.
Email: anhtn.21it@vku.udn.vn

Van Nam Vo

Vo Van Nam is pursuing the B.Eng. degree
in Information Technology from the University of Danang, Vietnam - Korea University of Information and Communication
Technology. His research interests include
software development, machine learning
and deep learning.
Email: namvv.21it@vku.udn.vn

Tran Tien Nguyen

Nguyen Tran Tien is pursuing the B.Eng.
degree in Information Technology from
the University of Danang, Vietnam - Korea University of Information and Communication Technology. His research interests include software development, machine
learning and deep learning.
Email: nttien.20it6@vku.udn.vn

Pham Tuyen Le

Le Pham Tuyen received the B.S. degree
in computer science from Ho Chi Minh
City University of Technology, Vietnam, in
2013, and the Ph.D. degree in computer
science and engineering from Kyung Hee
University, South Korea, in 2019. He is
currently an Principal Research Engineer at
AgileSoDA Company and visiting lecturer
at Industrial University of Ho Chi Minh City. His current research
interests include machine learning, reinforcement learning, combinatorial optimization, and robotics.
Email: tuyen_01036033@iuh.edu.vn

Quoc Vuong Nguyen

Nguyen Quoc Vuong receive master degree computer science from the University
of Da Nang, in 2011. His research interests include software development, machine
learing and deep learning.
Email: voung@donga.edu.vn

Huu Nhat Minh Nguyen

Nguyen Huu Nhat Minh (M’20) received
Ph.D. degree in Computer Science and
Engineering from Kyung Hee University,
South Korea, in 2020. He continued PostDoc with Federated Learning and Democratized Learning at Intelligent Networking
lab, Kyung Hee University, South Korea.
He is Deputy Head of Department of Science, Technology, and International Cooperation, and In charge
of Research Program at Digital Science and Technology Institute,
The University of Danang – Vietnam - Korea University of Information and Communication Technology, Vietnam. He received
the best KHU Ph.D. thesis award in engineering in 2020. He
had publications in premier ACM/IEEE journals and conferences.
His research interests include wireless communications, federated
learning, NLP, and computer vision.
Email: nhnminh@vku.udn.vn

References

Wai-Kim Leun. “CNN-RNN-CTC BASED ENDTO-END MISPRONUNCIATION DETECTION AND DIAGNOSIS.” Department of Systems Engineering and Engineering Management, https://www1.se.cuhk.edu.hk/ hccl/publications/pub/ICASSP2019-mdd.pdf. Accessed 22 November 2023.

Kun Li, and Helen Meng. “Mispronunciation Detection and Diagnosis in L2 English Speech Using MultiDistribution Deep Neural Networks.” Human-Computer Communications Laboratory Department of System Engineering and Engineering Management The Chinese University of Hong Kong, Hong Kong SAR, China, 16 June 2023, https://www1.se.cuhk.edu.hk/ hccl/publications/pub/2014 PID3298385 LK.pdf. Accessed 22 November 2023.

Yiqing Feng, et al. “SED-MDD: Towards Sentence Dependent End-To-End Mispronunciation Detection and Diagnosis.” 16 June 2023, https://ieeexplore.ieee.org/document/9052975. Accessed 24 November 2023.

Kaiqi Fu. Jones Lin, Dengfeng Ke, Yanlu Xie, Jinsong Zhang, Binghuai Lin “A Full Text-Dependent End to End Mispronunciation Detection and Diagnosis with Easy Data Augmentation Techniques.” arXiv, 17 April 2021, https://arxiv.org/abs/2104.08428. Accessed 24 November 2023.

Wenxuan Ye, et al. “An Approach to Mispronunciation Detection and Diagnosis with Acoustic, Phonetic and Linguistic (APL) Embeddings.” arXiv, 14 October 2021, https://arxiv.org/abs/2110.07274. Accessed 24 November 2023.

Huu Tuong Tu, et al. “Mispronunciation detection and diagnosis model for tonal language, applied to Vietnamese.” [Dublin, Ireland], 20-24 8 2023, https://www.isca-speech.org/archive/pdfs/interspeech2023/huu23 interspeech.pdf.

Thi Thu Trang Nguyen. HMM-based Vietnamese TextTo-Speech: Prosodic Phrasing Modeling, Corpus Design System Design, and Evaluation, 22 January 2016, https://theses.hal.science/tel-01260884/document. Accessed 24 November 2023.

Maxprotect, https://www.maxprotect.com/. Last accessed 21 Feb. 2024

M. Hammami, Y. Chahir and L. Chen, “WebGuard: Web based adult content detection and filtering system," In

Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003), Halifax, NS, Canada, 2003, pp. 574-578.

Hu, Weiming, Haiqiang Zuo, Ou Wu, Yunfei Chen, Zhongfei Zhang, and David Suter. “Recognition of adult images, videos, and web page bags." ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 7, no. 1 (2011): 1-24.

Sharma, Preeti, Manoj Kumar, and Hitesh Sharma. “Comprehensive analyses of image forgery detection methods from traditional to deep learning approaches: an evaluation." Multimedia Tools and Applications 82, no. 12 (2023): 18117-18150.

Soliman, Mohamed Mostafa, Mohamed Hussein Kamal, Mina Abd El-Massih Nashed, Youssef Mohamed Mostafa, Bassel Safwat Chawky, and Dina Khattab. “Violence recognition from videos using deep learning techniques." In 2019 Ninth International Conference on Intelligent Computing and Information Systems (ICICIS), pp. 80-85. IEEE, 2019.

Human Action Recognition (HAR) Dataset. https://www.kaggle.com/datasets/meetnagadia/humanaction-recognition-har-dataset. Last accessed 21 Feb. 2024

Caruana, Rich. “Multitask learning." Machine learning 28 (1997): 41-75.

M. Ravanelli, T. Parcollet, Y. Bengio, "The PyTorch-Kaldi Speech Recognition Toolkit", https://arxiv.org/abs/1811.07453. Accessed 19 Nov 2018.

Published
2024-05-27