Content area
Full Text
Abstract: With fast growth in size of digital text documents over internet and digital repositories, the pools of digital document is piling up day by day. Due to this digital revolution and growth, an efficient and effective technique is required to handle such an enormous amount of data. It is extremely important to understand the documents properly to mine them. To find coherence among documents text similarity measurement pays a humongous role. The goal of similarity computation is to identify cohesion among text documents and to make the text ready for the required applications such as document organization, plagiarism detection, query matching etc. This task is one of the most fundamental task in the area of information retrieval, information extraction, document organization, plagiarism detection and text mining problems. But effectiveness of document clustering is highly dependent on this task. In this paper four similarity measures are implemented and their descriptive statistics is compared. The results are found to be satisfactory. Graphs are drawn for visualization of results.
Keywords: similarity, cosine similarity, jaccard similarity, commonality, pearson, spearman's correlation.
(ProQuest: ... denotes formulae omitted.)
I.INTRODUCTION
Technology has made us more productive and transforming our world. It has changed how we communicate, how learn. Computing systems are equipped with Artificial Intelligence. Nowadays computing systems are able to learn reason, hear and see. Enormous amount of new opportunities are created by Artificial Intelligence. Artificial Intelligence has given two promising technologies such as Natural Language Processing and Text Mining. These technologies enable and empower users to transform/map the key content in texts lying in documents into quantitative insight or to draw conclusion. Text Analytics is also known as text mining which is the process of generating new knowledge or information. It examines the collection of existing written resources to map or transform the unstructured data written as text into structured data for use in further analysis. A text mining based search will identify related facts, relationships and similarity, assertions etc that would otherwise be difficult to identify and remains buried in a mass of free text or unstructured data. Most of this information available in the form text is uncertain / ambiguous / vague. Identifying plagiarism , Organizing documents, Categorizing a product customers into different categories, Identifying customers who...