M. V. Galickiy
Network Information Technologies and Services, MTUCI, Moscow, Russia
DOI: 10.36724/2664-066X-2025-11-4-2-8
SYNCHROINFO JOURNAL. Volume 11, Number 4 (2025). P. 2-8.
Abstract
Information systems are becoming more complex due to the integration of artificial intelligence and machine learning. The introduction of additional data entry methods has the potential to increase user productivity. To improve the accuracy and efficiency of speech-to-text conversion, it is essential to consider technologies such as voice activity detection and automatic speech recognition. They provide advanced mechanisms for user-system interaction through natural user interfaces, in particular, voice. The article will also discuss some ASR platforms with different levels of adaptation to linguistic and acoustic environments. The subject of research in this article is the methods of voice activity detection (VAD) and automatic speech recognition (ASR). The purpose of the study is to analyze the VAD and ASR modules and test them to make recommendations on their use. The results of the study will be useful for web application developers who are thinking about implementing this modules in their projects.
Keywords: voice recognition; algorithms; web application; speech synthesis; ASR; VAD; artificial intelligence
References
[1] DeepSpeech’s documentation [Electronic resource]. Mode of access: https://deepspeech.readthedocs.io/en/latest/ (Date of access: 07.07.2025)
[2] Сmusphinx [Electronic resource]. Mode of access: https://cmusphinx.github.io/wiki/about/ (Date of access: 07.11.2025)
[3] Vosk Offline speech recognition API [Electronic resource]. Mode of access: https://alphacephei.com/vosk/ (Date of access: 07.07.2025)
[4] ASR [Electronic resource]. Mode of access: https://sonix.ai/resources/what-asr/ (Date of access: 07.07.2025)
[5] S. Furui, “Speech Recognition – Past, Present, and Future,” NTT review, vol. 7, no. 2, 1995, pp. 13-18.
[6] R.S. Rocha, P. Ferreira, I. Dutra, R. Correia, R. Salvini, E. Burnside, “A Speech-to-Text Interface for MammoClass,” 2016 IEEE 29th International Symposium on Computer-Based Medical Systems (CBMS), 2016, pp. 1-6.
[7] M. Bohac, M. Kucharova, Z. Callejas, J. Nouza, P. Červa, “A cross-lingual adaptation approach for rapid development of speech recognizers for learning disabled users,” EURASIP Journal on Audio, Speech, and Music Processing, 2014.
[8] P. Barry, P. Crowley, “Modern Embedded Computing: Designing Connected, Pervasive, Media-Rich Systems,” 2012, pp. 16-19.
[9] Hidden Markov chains [Electronic resource]. Mode of access: https://habr.com/ru/articles/188244/ (Date of access: 07.07.2025).
[10] DNN Neural Network [Electronic resource]. Mode of access: https://www.educba.com/dnn-neural-network/ (Date of access: 07.07.2025).
[11] B. Lindberg, “Low-Complexity Variable Frame Rate Analysis for Speech Recognition and Voice Activity Detection,” IEEE Journal of Selected Topics in Signal Processing, 2010, pp. 798-807.
[12] A Real-Time Voice Activity Detection Algorithm [Electronic resource]. Mode of access: https://www.pvsm.ru/programmirovanie/42828 (Date of access: 07.07.2025).
[13] Skillbox media [Electronic resource]. Mode of access: https://skillbox.ru/media/code/kak-ustroeno-mashinnoe-obuchenie-zadachi-algoritmy-i-vidy-machine-learning/ (Date of access: 07.07.2025).
[14] V. A. Dokuchaev, “The impact of new information and communication technologies on the privacy of personal data,” Current problems and prospects of economic development: XXIII International Scientific and Practical Conference, 2024, pp. 12-15.
[15] V. A. Dokuchaev, V.V. Maklachkova, А. А. Boiko, “The problem of data updating in CRM systems,” Economics and quality of communication systems, 2025, no. 1(35), pp. 45-57.
[16] V.Y. Statev, V. A. Dokuchaev, V.V. Maklachkova, “Information security in the big data space,” T-Comm. 2022. Vol. 16, no. 4, pp. 21-28. DOI 10.36724/2072-8735-2022-16-4-21-28.
[17] V. A. Dokuchaev, V. V. Maklachkova, V. Yu. Statev, “Classification of personal data security threats in information systems,” T-Comm. 2020. Vol. 14, no. 1, pp. 56-60. DOI 10.36724/2072-8735-2020-14-1-56-60.
[18] V. A. Dokuchaev, “Digital transformation: New drivers and new risks,” 2020 International Conference on Engineering Management of Communication and Technology, EMCTECH 2020 : Proceedings, Vienna, 2020. New York: Institute of Electrical and Electronics Engineers Inc., 2020. P. 9261544. DOI 10.1109/EMCTECH49634.2020.9261544.
[19] How ASR works [Electronic resource]. Mode of access: https://cloud.vk.com/blog/slushaet-i-ponimaet-kak-rabotaet-tehnologija-avtomaticheskogo-raspoznavanija-rechi/ (Date of access: 07.07.2025) (in Russian).
[20] Efficient voice activity detection algorithm [Electronic resource]. Mode of access: https://asmp-eurasipjournals.springeropen.com/articles/10.1186/1687-4722-2013-21 (Date of access: 07.07.2025).