Frequency of lowercase and uppercase letters, bigrams, and trigrams in the Serbian language

  • Vukašin Babić Faculty of Organizational Sciences, University of Belgrade
Keywords: Serbian language, letter frequency, bigram frequency, trigram frequency, Cyrillic script, text analysis

Abstract

This study presents a comprehensive analysis of letter, bigram, and trigram frequencies in the Serbian language using the Cyrillic script. Utilising a corpus of approximately 4 million characters from various literary works, newspapers, and an online encyclopedia, we calculated the frequencies of uppercase and lowercase letters, as well as bigrams and trigrams. Our findings reveal distinct patterns in the Serbian language, including the prevalence of certain letters and letter combinations. These results largely align with previous studies on Serbian and Croatian languages, with some variations due to dialectal differences. This research contributes valuable data for applications in cryptography, natural language processing, and linguistic studies specific to the Serbian language.

Published
2025-02-25
Section
Information systems