A New Similarity Measure for Document Classification and Text Mining


Accurate, efficient and fast processing of textual data and classification of electronic documents have become an important key factor in knowledge management and related businesses in today’s world. Text mining, information retrieval, and document classification systems have a strong positive impact on digital libraries and electronic content management, e-marketing, electronic archives, customer relationship management, decision support systems, copyright infringement, and plagiarism detection, which strictly affect economics, businesses, and organizations. In this study, we propose a new similarity measure that can be used with k-nearest neighbors (k-NN) and Rocchio algorithms, which are some of the well-known algorithms for document classification, information retrieval, and some other text mining purposes. We have tested our novel similarity measure with some structured textual data sets and we have compared the results with some other standard distance metrics and similarity measures such as Cosine similarity, Euclidean distance, and Pearson correlation coefficient. We have obtained some promising results, which show that this proposed similarity measure could be alternatively used within all suitable algorithms, methods, and models for text mining, document classification, and relevant knowledge management systems.

Keywords: text mining, document classification, similarity measures, k-NN, Rocchio algorithm

[1] Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, vol. 34, pp. 1-47.

[2] Jurafksy, D. and Martin, J. H. (2017). Speech and Language Processing. USA: Prentice Hall.

[3] Manning, C. D. and Schütze, H. (1999). Foundations of Statistical Natural Language Processing. USA: The MIT Press.

[4] Pant, G. and Srinivasan, P. (2005). Learning to crawl: Comparing classification schemes. ACM Transactions on Information Systems, vol. 23, pp. 430-462.

[5] Zhang, L. et al. (2004). An evaluation of statistical spam filtering techniques. ACM Transactions on Asian Language Information Processing, vol. 3, pp. 243-269.

[6] McCallum, A. and Nigam, K. (1998). A comparison of event models for Naive Bayes text classification, in Proceedings of AAAI/ICML-98 Workshop on Learning for Text Categorization, Madison, Wisconsin, USA: The AAAI Press.

[7] Rennie, J. D. M. et al. (2003). Tackling the poor assumptions of Naive Bayes text classification, in Proceedings of the Twentieth International Conference on Machine Learning, Washington D.C., USA: The AAAI Press.

[8] Komiya, K. et al. (2011). Negation Naive Bayes for Categorization of Product Pages on the Web, in Proceedings of Recent Advances in Natural Language Processing, Hissar, Bulgaria.

[9] Lodhi, H. et al. (2002). Text Classification using String Kernels. Journal of Machine Learning Research, vol. 2, pp. 419-444.

[10] Yu, H. et al. (2002). PEBL: positive example based learning for web page classification using SVM, in Proceedings of the Eighth International Conference on Knowledge discovery and Data Mining, Edmonton, Canada.

[11] Martin-Valdivia, M. T. et al. (2007). The learning vector quantization algorithm applied to automatic text classification tasks. Neural Networks, vol. 20, no. 6, pp. 748-756.

[12] Chen, C. et al. (2005). A Hierarchical Neural Network Document Classifier with Linguistic Feature Selection. Applied Intelligence, vol. 23, pp. 277-294.

[13] Liu, L. and Peng, T. (2014). Clustering-based Method for Positive and Unlabeled Text Categorization Enhanced by Improved TFIDF. Journal of Information Science and Engineering, vol. 30, pp. 1463-1481.

[14] Kwon, O. and Lee, J. (2003). Text categorization based on k-nearest neighbor approach for web site classification. Information Processing and Management, vol. 39, pp. 25-44.

[15] Manning, C. D. et al. (2009). Introduction to Information Retrieval. UK: Cambridge University Press.

[16] Holzinger, A. et al. (2014). Biomedical Text Mining: State-of-the-Art, Open Problems and Future Challenges. Knowledge Discovery and Data Mining, pp. 271–300.

[17] Aha, D. W. et al. (1991). Instance-Based Learning Algorithms. Machine Learning, vol. 6, no. 1, pp. 37-66.

[18] Rocchio, J. (1971). The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice Hall, USA.

[19] K