Performance Analysis: AI-based VIST Audio Player by Microsoft Speech API
https://doi.org/10.24017/science.2021.1.3
Abstract views: 1098 / PDF downloads: 690Abstract
Speech recognition has gained much attention from researchers for almost last two decades. Isolated words, connected words, and continuous speech are the main focused areas of speech recognition. Researchers have adopted many techniques to solve speech recognition challenges under the umbrella of Artificial Intelligence (AI), Pattern Recognition and Acoustic Phonetic approaches. Variation in pronunciation of words, individual accents, unwanted ambient noise, speech context, and quality of input devices are some of these challenges in speech recognition. Many Application Programming Interface (API)s are developed to overcome the issue of accuracy in a speech-to-text conversion such as Microsoft Speech API and Google Speech API. In this paper, the performance of Microsoft Speech API is analyzed against other Speech APIs mentioned in the literature on the special dataset (without background noise) prepared. A Voice Interactive Speech to Text (VIST) audio player was developed for the analysis of Microsoft Speech API. VIST audio player creates runtime subtitles of the audio files running on it; the player is responsible for speech to text conversion in real-time. Microsoft Speech API was incorporated in the application to validate and make the performance of API measurable. The experiments proved the Microsoft Speech API more accurate with respect to other APIs in the context of the prepared dataset for the VIST audio player. The accuracy rate according to the precision-recall is 96% for Microsoft Speech API, which is better than previous ones as mentioned in the literature.
Keywords:
References
[2] F. Sadaoki. "50 years of progress in speech and speaker recognition research." ECTI Transactions on Computer and Information Technology (ECTI-CIT) 1.2 (2005): 64-74.
https://doi.org/10.37936/ecti-cit.200512.51834
[3] V.M. Velichko and N. G Zagoruyko: Automatic recognition of 200 Word, Int. J. Man- Machine Studies.2 pp, 223,1970
https://doi.org/10.1016/S0020-7373(70)80008-6
[4] M. Tomoko, and S. Furui. "Comparison of text-independent speaker recognition methods using VQ-distortion and discrete/continuous HMM's." IEEE Transactions on speech and audio processing 2.3 (1994): 456-459.
https://doi.org/10.1109/89.294363
[5] H. Miss, S. Kaur, and V. Chaudhary. "Literature Survey on Automatic Speech Recognition System?." International Journal of Emerging Technology and Advanced Engineering (2014).
[6] D. Jasha, and A. Acero. "Joint discriminative front end and back end training for improved speech recognition accuracy" 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings. Vol. 1. IEEE, 2006.
[7] R. Matthew, et al. "Improvements on speech recognition for fast talkers." Sixth European Conference on Speech Communication and Technology. 1999.
[8] W. Yongzhen, X. Liu, and Z. Gao. "Neural related work summarization with a joint context-driven attention mechanism." arXiv preprint arXiv:1901.09492 (2019).
[9] H. Xuedong, et al. "Microsoft Windows highly intelligent speech recognizer: Whisper." 1995 International Conference on Acoustics, Speech, and Signal Processing. Vol. 1. IEEE, 1995.
[10] L. Xiao, et al. "Language modeling for voice search: a machine translation approach" 2008 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2008.
https://doi.org/10.1109/ICASSP.2008.4518759
[11] N. Abbas, K. Yasen, K. Faraj, L. A Razak, F. Malallah, "Offline handwritten signature recognition using histogram orientation gradient and support vector machine" Journal of Theoretical and Applied Information Technology 2018.
[12] H. Yanzhang, et al. "Streaming end-to-end speech recognition for mobile devices." ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.
https://doi.org/10.1109/ICASSP.2019.8682866