Introduction to information retrieval by christopher manning pdf

This article has multiple issues. Document clustering involves the use of introduction to information retrieval by christopher manning pdf and descriptor extraction.

Descriptors are sets of words that describe the contents within the cluster. Document clustering is generally considered to be a centralized process. Examples of document clustering include web document clustering for search users. The application of document clustering can be categorized to two types, online and offline. Online applications are usually constrained by efficiency problems when compared to offline applications.

In general, there are two common algorithms. The first one is the hierarchical based algorithm, which includes single link, complete linkage, group average and Ward’s method. By aggregating or dividing, documents can be clustered into hierarchical structure, which is suitable for browsing. However, such an algorithm usually suffers from efficiency problems. These algorithms can further be classified as hard or soft clustering algorithms. In a soft assignment, a document has fractional membership in several clusters. Other algorithms involve graph based clustering, ontology supported clustering and order sensitive clustering.

Given a clustering, it can be beneficial to automatically derive human-readable labels for the clusters. Clustering methods can be used to automatically group the retrieved documents into a list of meaningful categories, as is achieved by e. And we can avoid calculating similar information repeatedly by reducing all tokens to its base form using various stemming and lemmatization dictionaries. Some tokens are less important than others. For instance, common words such as “the” might not be very helpful for revealing the essential characteristics of a text. So usually it is a good idea to eliminate stop words and punctuation marks before doing further analysis.

After pre-processing the text data, we can then proceed to generate features. For document clustering, one of the most common ways to generate features for a document is to calculate the term frequencies of all its tokens. Although not perfect, these frequencies can usually provide some clues about the topic of the document. And sometimes it is also useful to weight the term frequencies by the inverse document frequencies. We can then cluster different documents based on the features we have generated. Finally, the clustering models can be assessed by various metrics.

Manning, Prabhakar Raghavan, and Hinrich Schütze. Claudio Carpineto, Stanislaw Osiński, Giovanni Romano, Dawid Weiss. A survey of Web clustering engines. Wui Lee Chang, Kai Meng Tay, and Chee Peng Lim, A New Evolving Tree-Based Model with Local Re-learning for Document Clustering and Visualization, Neural Processing Letters, DOI: 10. This page was last edited on 27 February 2017, at 15:46. This article has multiple issues.

Document clustering involves the use of descriptors and descriptor extraction. Descriptors are sets of words that describe the contents within the cluster. Document clustering is generally considered to be a centralized process. Examples of document clustering include web document clustering for search users. The application of document clustering can be categorized to two types, online and offline. Online applications are usually constrained by efficiency problems when compared to offline applications.