SOLVING THE PROBLEM OF SPEECH COMMANDS RECOGNITION

Chi Thien Nguyen,
Ho Chi Minh City University of Technology (HCMUT), Ho Chi Minh City, Vietnam

DOI: 10.36724/2664-066X-2024-10-4-43-50

SYNCHROINFO JOURNAL. Volume 10, Number 4 (2024). P. 43-50.

Abstract

In terms of technology development, speech recognition has a long history marked by several waves of major innovations. More recently, the field has been boosted by advances in deep learning and big data. These advances are evidenced not only by the growing number of scientific papers published in this area, but by the worldwide adoption of various deep learning methods in the design and implementation of speech recognition systems. The wide variety of speech signal processing tasks, as well as its high variability and instability of processing results in general, require a new formulation of the processing task in this area. Given staging tasks identification models speech production with purpose adequate perception. Researched solution tasks speech recognition teams. Tasks identification models of speech production with purpose adequate perception are presented and two adjustment schemes for speakers to improve speech recognition out signals. The transformation of speech signals and their recognition are implemented using likelihood functions. Given results experiments with proposed schemes under-construction sites. One hundred experiments were carried out on speech signals from the public access database TIDigits 1.0.

Keywords chalk-frequency cepstral odds, recognition speeches, adjustment under announcer

References

[1] G. K. Berdibayeva, A. N. Spirkin, O. N. Bodin and O. E. Bezborodova, “Features of Speech Commands Recognition Using an Artificial Neural Network,” 2021 Ural Symposium on Biomedical Engineering, Radioelectronics and Information Technology (USBEREIT), Yekaterinburg, Russia, 2021, pp. 0157-0160, doi: 10.1109/USBEREIT51232.2021.9455111.

[2] Daniel-S. Arias-Otalora, Andrés Florez, Gerson Mellizo, C. H. Rodríguez-Garavito, E. Romero, J. A. Tumialan, “A Machine Learning Based Command Voice Recognition Interface”, Applied Computer Sciences in Engineering, vol.1685, pp.450, 2022.

[3] A. R B, V. R C, V. K, S. Chikamath, N. S R and S. Budihal, “Limited Vocabulary Speech Recognition,” 2024 3rd International Conference for Innovation in Technology (INOCON), Bangalore, India, 2024, pp. 1-5, doi: 10.1109/INOCON60754.2024.10511500.

[4] A. Kuzdeuov, S. Nurgaliyev, D. Turmakhan, N. Laiyk and H. A. Varol, “Speech Command Recognition: Text-to-Speech and Speech Corpus Scraping Are All You Need,” 2023 3rd International Conference on Robotics, Automation and Artificial Intelligence (RAAI), Singapore, Singapore, 2023, pp. 286-291, doi: 10.1109/RAAI59955.2023.10601292.

[5] Aditya Kulkarni, Vaishali Jabade, Aniket Patil, “Audio Recognition Using Deep Learning for Edge Devices”, Advances in Computing and Data Sciences, vol.1614, pp.186, 2022.

[6] A. Yasmeen, F. I. Rahman, S. Ahmed and M. H. Kabir, “CSVC-Net: Code-Switched Voice Command Classification using Deep CNN-LSTM Network,” 2021 Joint 10th International Conference on Informatics, Electronics & Vision (ICIEV) and 2021 5th International Conference on Imaging, Vision & Pattern Recognition (icIVPR), Kitakyushu, Japan, 2021, pp. 1-8, doi: 10.1109/ICIEVicIVPR52578.2021.9564183.

[7] N. G. Zagoruiko, V. S. Lozovsky, “Adjustment to the speaker in recognizing a limited set of oral commands,” Collection of works of the Institute of Mathematics SB USSR Academy of Sciences. No. 28. 1967.

[8] J. Benesty et al., “Handbook of speech processing,” Springer. 2008.

[9] Md. Rakibul Hasan, Md. Mahbub Hasan, Md Zakir Hossain, “How many Mel‐frequency cepstral coefficients to be utilized in speech recognition? A study with the Bengali language”, The Journal of Engineering, vol.2021, no.12, pp.817, 2021.

[10] A. V. Agranovskiy, “Theoretical aspects of algorithms for processing and classifying speech signals,” Moscow: Radio and communication, 2004. 162 p.

[11] S. V. Kodzasov, O. F. Krivnova, “General phonetics,” Moscow: Publishing house RSUH, 2001. 592 p.

[12] J. Hillenbrand et al., “Acoustic characteristics of American English vowels,” The Journal of the Acoustical Society of America, no. 97(5). 1995, pp. 3099-3111.

[13] D. G. Matthews, Numerical methods,” Using MATLAB, 3rd edition. Moscow: Williams Publishing House, 2001. 720 p.

[14] G. Leonard, G. Doddington, “TIDigits,” Linguistic Data Consortium, Philadelphia, 1993. URL: https://catalog.ldc.upenn.edu/LDC93S10 (date of access: 23.03.2024).

[15] H. Aghakhani et al., “Venomave: Targeted Poisoning Against Speech Recognition,” 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), Raleigh, NC, USA, 2023, pp. 404-417, doi: 10.1109/SaTML54575.2023.00035.

[16] L. Guo et al., “Transformer-Based Spiking Neural Networks for Multimodal Audiovisual Classification,” IEEE Transactions on Cognitive and Developmental Systems, vol. 16, no. 3, pp. 1077-1086, June 2024, doi: 10.1109/TCDS.2023.3327081.

[17] S. Xiang et al., “Neuromorphic Speech Recognition with Photonic Convolutional Spiking Neural Networks,” IEEE Journal of Selected Topics in Quantum Electronics, vol. 29, no. 6: Photonic Signal Processing, pp. 1-7, Nov.-Dec. 2023, Art no. 7600507, doi: 10.1109/JSTQE.2023.3240248.