Speech synthesis and recognition systems (TTS, STT): Application in the sound search engine of the university library “svetozar markovic”
Abstract
This paper addresses the topics of speech synthesis technology (text-to-speech — TTS) and speech recognition technology (speech-to-text — STT), with a focus on their application in the sound search engine of the University Library “Svetozar Marković”. The introductory section elaborates on the subject both theoretically and from a historical perspective, starting with mechanical devices from the 18th century, such as Kratzenstein’s vocal tract model from 1779, up to modern systems based on artificial intelligence and deep learning. The development of these technologies for the Serbian language is also mentioned, from the first systems in the mid-1980s to today’s solutions. The paper describes the implementation of a search engine for audio recordings from the University Library’s YouTube channel, which uses Whisper JAX technology for speech recognition, achieving recognition accuracy of over 90%. The processes of metadata collection, speech transcription, data processing, and storage in a Lucene/Solr-based database are described in detail. The system enables searching transcribed content with support for finding similar words using Levenshtein distance, which increases search efficiency despite possible errors in speech recognition. Challenges such as the lack of temporal subject descriptors in metadata and the dependence of recognition quality on the quality of the source recording are also discussed. The conclusion is that the developed system is a useful tool that facilitates access to the library’s audiovisual content, with plans for further development and expansion of its functionality.