TEXT CLUSTERING.doc

A Project Report
On
TEXT CLUSTERING
By:
Ritik Lingwal (20521002)
Under the guidance of
Mr. Brijesh Dhyani
Department of Management Studies
Graphic Era Deemed to be University
Dehradun, Uttarakhand
(Batch: 2020-2022)

PROJECT DEFINITION
 User presents the system with a collection of documents (INPUT).
 The system processes these documents and groups them into fixed number of clusters
(say k).
 A cluster is a group of documents whose members are more similar to each other than to
the members of any other group.
 That is, the system groups the documents with similar themes into separate clusters.
 User then selects one or more of these clusters which seem relevant to him.
 The selected clusters are gathered to form a new collection and above process continues.
 With each successive iteration, the clusters become smaller and more detailed and thus
more more specific to user's choice of interest.
 Finally, clusters contain single document, which user can browse through.(OUTPUT)
 Thus, user is able to retrieve the relevant document without browsing through the
whole collection of documents.


PRACTICAL APPLICATIONS
To ease the process of browsing to find similar or related information.
e.g.to get an overview of documens that were returned in an opinion poll or to ease the retrieval
of information from the list of docs returned for a web based query.
To find outstanding documents within a collection
e.g.in documents you get from a news service,unique documents could give you hints on
possible new trends or technologies that have so far not been mentioned in other articles.
To find keywords
e.git can be applied to a collection of research papers to find the representative keywords.
To provide an overview of the contents of a large document collection.
e.g. huge clusters in a consumer feedback collection would indicate where your products or
services need improvement.

METHODOLOGY
Before describing our main methodology, we would like to introduce some definitions which
are critical in understanding the following algorithms.
 We represent each document by a vector whose each dimension corresponds to a unique
word in the collection. Thus, the number of dimensions of each vector is equal to the
number of unique words in the whole collection.
 Magnitude of each component is the frequency of that word in the correponding
document.
 It has been shown by using probability theory that using monotone damping function i.e
using square roots of frequency instead of directly frequency gives better results
 Such a vector, for document d, in normalized form is represented by p(d)
 Such a vector can be associated with a cluster C also,
p(C)=Summation over all constituent documents di of (p(di))
 Similarity measure between two documents, a and b,is defined as :
s(a,b) = <p(a),p(b)>

Our methodology comprises of the following steps :
FLOWCHART
I STEP Initially,all the documents are scanned and the vectors associated with each
document(as described above) are obtained.While doing this we perform the following steps:
 Remove all the 'stopwords' or junkwords which do not contribute towards the theme of
the document.For this we have developed a list of about 600 words comprising of
various pronouns,conjunctions,prepositions,auxiliary verbs etc.
 Keep only a single form of a word.If various forms of a word exist in the collection,they
are replaced throughout by a root word with their frequencies being added to the root
word.Thus,if past tense form or plural or gerund form etc. of a word are present,their
occurence is replaced by the root word.
 Apply the database derived from the thesaurus.We have developed some database of
synonyms in a pattern suited for fast access using a thesaurus.Using this we replace the
words of a class of synonyms by a single representative word.This further helps in
improving the quality of clusters obtained later.
II STEP We denote the number of clusters to be formed as ' k '. The first step consists of
finding ' k ' centers of the entire Collection.Centersare documents that are representative of the
various themes present in the Collection.
Before we describe the algorithms for finding the centers,we describe an approximate,time
consuming algorithm for clustering that is used as a basic unit by both the algorithms we have
used for finding the k centers.

CLUSTER SUBROUTINE
 Define the average similarity between constituent documents of the cluster as: s(C) This
is a measure of the cohesiveness or the quality of the cluster.
 At each stage find those two clusters whose merged cluster has the maximum average
similarity of the possible pairs at that stage.
 Merge these two clusters, reducing the number of clusters by one.
 Hence starting initially with n documents, this algorithm continues until k clusters
remain.
 Time complexity of this algorithm = O(n^2)
We now describe the two algorithms we are using for finding the k centers :
ALGORITHM 1
 Apply Cluster Subroutine to randomly selected sqrt(k*n) documents from the whole
collection.
 From the k clusters returned by subroutine,find k-centers.
 For each document belonging to a particluar cluster find its cosine similarity with the
concerned cluster.The document for which this comes out to be maximum is the center
of the cluster.
 Hence this way k-centers are obtained :
o This algorithm is not deterministic,since it employs random sampling.
o It is faster,suitable for online reclustering.
o Asymptotically,O(k*n);

ALGORITHM 2
Due to the shortcomings of the previous algorithm,which has been suggested in most of the
papers,we thought of another approach which we describe now.
 Divide the collection of documents into n/m groups of fixed size,say m(where m is
chosen by expt.)
 Apply Cluster Subroutine to each of the groups to reduce each group to r*m
clusters(where r is some fraction)
 The process is repeated on these clusters as the new collection until we get k-clusters
 Then k-centers are found as described in Algorithm 1
 Characteristics :
o More accurate
o Asymptotically O(n*m)=O(k*n) (if m is a multiple of k),though actually takes
more time than algorithm 1
After the k centers have been found the remaining collection is divided randomly into two
subcollections.ASSIGN-TO-NEAREST algorithm is applied to the first subcollection
and BAYESIAN MACHINE LEARNING is applied to the second subcollection,using
clusters obtained from the first subcollection as the training data.
We now describe the ASSIGN-TO-NEAREST algorithm :

ASSIGN-TO-NEAREST
 For each document find its similarity with each center and assign it to center with which
its similarity is maximum.
 Hence,each document is assigned to one of the k centers and we have finally k clusters.
FOR MORE REFINEMENT
 Again find centers of these clusters and repeat the first step with these new centers.
 This can be carried out indefinitely,but we observed that most significant improvement
is obtained within 2 to 3 iterations.
BAEYSIAN LEARNING
III STEPFor the remaining second subcollection,Bayesian ML is applied,using the just
obtained clusters as the test data.
 Each document is represented by a set of attributes {a1,a2,....an}.Here, a1=first word in
the document,a2=second word in the document and so on. We have a set of target
values(the k centers in our case) V = {v1, . . ,vk}.Each document is assigned to one of
the k target centers.
 This approach uses the MAP hypothesis in which each document is assigned to the
center vj for whice the expression P(vj|a1,a2,....an) is maximised.
 From Bayes Theorem,this can be rewritten as the maximisation of the
expression ( P(a1,a2,...an|vj)*P(vj))
ASSUMPTIONS
The evaluation of this expression calls for a very,very large training data,hence,we make two
reasonable assumptions :
 Attribute values are conditionally independent i.e. the prob. of occurence of a given set
of words is independent of the order of the words.
 Probability that a word wk appears at position i is the same the prob. of occurence of wk
at position j for all i,j,k.
 The final expression's P(vj),P(wk|vj) etc are calculated from the training data(the data
obtained from documents already assigned by cosine similarity measure approach).

PURPOSE
 ACCURATE AND FASTER : The reason for using Bayesian ML at all.Infact,for text
classification problem,its performance has been shown to be comparable to ANNs,but at
much reduced computational cost.
 DOMAIN INDEPENDENT : The reason for using Cosine similarity approach at
all.Most ML processes have to be trained off-line on some training data.This renders the
system domain dependent.We did not want to reduce the scope of our project by making
it domain dependent.Hence we have maintained its DOMAIN INDEPENDENCY by
obtaining the test data from the given collection itself.

TOOLS USED FOR SPEED ENHANCEMENT
AND RELIA BILITY
 HASH TABLE
Data that is accessed frequently is stored in Hashtables as INSERT and FIND operatios
are O(1) in average time complexity.
MULTITHREADED PROGRAMMING
Provides performance comparable to parallel computing for some operations that can be
done simultaneously .
THESAURUS DATABASE
Improves the quality of clustering significantly.
STOPLIST DATABASE
It is used for eliminating stopwords(e.g. "the","is",etc.)
dynamically sized arrays('Vector')
o parsing and extracting of words('String Tokenizer')

ADDED UTILITIES
QUERY BASED BROWSING
The documents containing the query and its synonyms are filtered out from the current
collection of documents to form a new collection and now Scatter/Gather is applied on this
collection.
REFINE YOUR SEARCH
After any scatter,user can opt for refining the clustering i.e. improving the quality of
clusters.(this calls for added computational time,but then it is on the discretion of the user to opt
for this)
COPYING THE DOCUMENTS OF A CLUSTER TO A NEW DIRECTORY
For some application,user might be interested in all documents of a cluster rather than some
specific document,so,user can avail of this utility.

RESULTS
We applied our INTELLIGENT TEXT CLUSTERER(ITC) on a large collection of
documents(approx. 100) downloaded from Usenet news articles .These documents spanned
various themes or fields like astronomy,medical,software,religion &politics.The table shows
the result of first stage clustering of documents.
CLUSTER#1 CLUSTER#2 CLUSTER#3 CLUSTER4 CLUSTER#5 CLUSTER#6 CLUSTER#7 CLUSTER#8
Size: 5 Size: 2 Size: 11 Size: 7 Size: 26 Size: 23 Size : 15 Size :14
list list people object christian space scsi gun
find sort human model people year drive police
quote work rights vision god nasa ide weapon
josh people kill shape moral engineer pc handgun
fundies creation give proceedings human launch add crime
sort bible rights image wrong project hard time
time creation children ai matter time controlled arm
mcdowell quality problem dimensional point include problem fire
tyre guy life spatial word orbit comparable carry
inerrancy mice govt search reason program people semi
Suppose the user is interested in field of astronomy. So, he chooses cluster 6 for navigation. As
is evident from the table contents below, it is clear that the search has narrowed down
drastically from 100 documents to 25 relevamt documents. The clusters become smaller amd
more detailed with each stage.
Size: 1 Size: 1 Size: 1 Size: 1 Size: 1 Size: 8 Size: 3 Size: 7
task rocket burst universe speculative mission convention space
robot feet star larson form space karl engineer
team motor plan book ablogenesis data hess dc
numerical allowed source star calling time libertarian year
performance launch disk physical probability nasa instt shuttle
motion organisation angle motion mind jpl business system
improve flight direction theory theist launch sept orbit
method depend sample physicists physicists project committe prog
juggle faa distribution matter year part price reusable
level shuttle black astronomers telling spacecraft dr launch
Now ,as one can see there are some clusters containing single documents only,while Cluster #6
and Cluster #8 contain many documents .Thus the clusters containing single documents are
'outstanding' documents in the sense that these documents can not be grouped with clusters
containing many documents.

For e.g. consider the Cluster #7.It does not contain any word related to astronomy.Scanning
through its topical words it seems that it is related to some business conventions or committe
etc.
Now suppose user is interested in looking for some 'projects' or 'mission ' regarding 'launch' in
'orbit' ,so he chooses clusters 2,6,8 and for further navigation.The result --
Size: 1 Size: 1 Size: 1 Size: 1 Size: 1 Size: 1 Size: 1 Size: 1
rocket mission nasa space space space jpl space
feet cost cost engg book age space dc
motor hst fee long history america data launch
launch extra overhead tech spaceship matter project shuttle
organisation mass dennis interest part station time vehicle
flight pat center life draw adm. launch orbit
depend minus reston literature higgins budget spacecraft year
rocketry service problem term science program nasa time
faatime hardware work detail illustration plan mission schedule
trajectory oms include general include resource research path
Thus we see that search has reduced down drastically and only single documents are present in
the cluster .The topical words in the cluster are now the words occuring in that document
only.Thus user is able to retrieve the relevant document without browsing through the entire
corpus of documents.
At any stage ,if user is not satisfied with the quality of clusters,he can give the 'Refine' option
which improves the quality of clusters.User has also the choice of swithching over to 'query'
based browsing and that too on the present or original collection.If the user feels that he has
taken the wrong choice of clusters,then he can revert back also to previous stage clusters.
A proper GUI has been provided with the program using Java-Frames to make the browsing
easy,full of clarity and user-friendly.
Care has also been taken that user does not play around fooling with the project;hence,suitable
error messages have been printed and handled suitably.

MAJOR WORKS
D.R.Cutting,Xerox PARC
 The cosine similarity measure was first suggested by him.The random sampling
algorithm for finding centers is also due to him.
Mehran Sahami,Stanford University
 First to suggest that Bayesian ML could be applied to the text clustering problem given a
set of training data.(His approach is domain dependent)
REFERENCES
BOOKS:
Machine learning by Tom M. Mitchell,The McGraw-Hill Companies,Inc.,March 1997
PAPERS
Koller,D. and Sahami,M. 1997.Hierarchically Classifying Documents Using Very few
Words.Proceedings of the Fourteenth International Conference on Machine Learning
Sahami,M.,Yusufali,S. and Boldonado,M.Q.W. 1998SONIA : A Sevice for Organising Networked
Information AutonomouslyProceedings of the third ACM Conference on Digital Libraries
Sahami,M. 1996.Learning Limited Dependence Bayesian Classifiers .In KDD-96:Proceedings of the
Second International On Knowledge Discovery and Data Mining,pp. 335-338,Menlo Park,CA:AAAI
Press.
Goldszmidt,M.and Sahami,M. 1998.A Probablistic Approach to Full-Text Document
Clustering.Technical Report ITAD-433-MS-98-044,SRI International.
Douglass Cutting,David Karger, and Jan Peterson,Constant Interaction Time Scatter/Gather
Browsing of Large Document CollectionsProceedings of the sixteenth Annual Inetrnational ACM
SIGIR Coneferencs,Pittsburgh.
Peter Pirolli,Patricia Schank,Marti A. Hearst,and Chritine Diehl,Scatter/Gather Browsing
Communicates the Topic Structure of A Very Large Text Collection.Proceedings of the ACM SIGCHI
Conference on Human Factors in Computing Systems(CHI),May 1996.
Marti A. Hearst,David R.Karger,and Jan O. PedersenScatter/Gather As a Tool For The
Navigation Of Retrieval Results.The Proceedings of the 1995 AAAI Fall Symposium on Knowledge
Navigation
Marti Hearst,Jan Pedersen,Reexamining The Cluster Hypothesis:Scatter/Gather on Retrieval
Results,Prooceedings of the 19th Annual International ACM/SIGIR Conference,Zurich,August 1996.
Bibtext

FUTURE SCOPE
The future scope of improvement in this project lies in following fields :
LEXICAL AFFINITIES
In some domains certain words tend to appear together.For example,in the domain of AI,the
words 'artificial' and 'intelligence' tend to appear together as do the words 'machine' and
'learning'.This fact can be done by introducing some crossover terms in the expression for
cosine similarity and modifying the assumptions of Bayesian ML.
USING SEMANTICS
Parsing can be incorporated into this program to provide some semantic information which can
be used to obtain a better estimate of the similarities between various documents.

TEXT CLUSTERING.doc

Recommended

More Related Content

Similar to TEXT CLUSTERING.doc (20)

Recently uploaded (20)

TEXT CLUSTERING.doc