Semantic analysis of text
Abstract
Text categorization using machine learning methods has become one of the key techniques for extracting and summarization of valuable information from text documents. In this paper text pre-processing steps are described, and supervised and unsupervised machine learning approaches for text categorization are analyzed. Five algorithms are evaluated on five standard datasets for text categorization. For majority of applied algorithms, on all datasets, achieved precision and recall are in range 70-90%. In terms of predefined metrics, supervised algorithms perform better on four datasets, while unsupervised approach shows better results on one dataset. Also, main advantage of unsupervised approach comparing to those supervised is emphasized and some possible suggestions for further research in this area are given.