DETERMINING THE IDENTITY OF THE AUTHOR BASED ON THE ANALYSIS OF THE TEXT
Marko Gogić
Fakultet organizacionih nauka, Univerzitet u Beogradu
Keywords:
forensic linguistics; stylometry; machine learning; linguistic features; authorship attribution
Abstract
Unlike the „ordinary“ criminals, cyber criminals do not have to worry that they will accidentally leave fingerprints or traces of DNA that would reveal their identity. Instead, they can move freely through cyberspace and use social networks, online blogs for exchange of illegal materials, sending threatening messages or „phishing“, hiding behind their digital, pseudoidentities that can easily be changed. However, what cannot be easily changed is the individual language characteristics that, just like the fingerprints, remain embedded in electronic textual material. Those linguistic features of an individual represent biometric material based on which the forensic linguistics can determine the identity of an author of a threat message or a ransom request. The forensic linguistic method used for this purpose is referred to as the stylometry. The abundance of available electronic textual material, coupled with the absence of classical biometric traces in cyberspace, imposes the need for the increased use of stylometry for solving a number of practical problems. The aim of this paper is to examine the validity of existing stylometric techniques for authorship attribution based on the use of machine learning algorithms. For this purpose, the paper will present a
historical overview of the development of the stylometry, along with an analysis and a critical review of the existing work that show the best solutions in this field.