SlideShare a Scribd company logo
A Project Report
On
TEXT CLUSTERING
By:
Ritik Lingwal (20521002)
Under the guidance of
Mr. Brijesh Dhyani
Department of Management Studies
Graphic Era Deemed to be University
Dehradun, Uttarakhand
(Batch: 2020-2022)
PROJECT DEFINITION
 User presents the system with a collection of documents (INPUT).
 The system processes these documents and groups them into fixed number of clusters
(say k).
 A cluster is a group of documents whose members are more similar to each other than to
the members of any other group.
 That is, the system groups the documents with similar themes into separate clusters.
 User then selects one or more of these clusters which seem relevant to him.
 The selected clusters are gathered to form a new collection and above process continues.
 With each successive iteration, the clusters become smaller and more detailed and thus
more more specific to user's choice of interest.
 Finally, clusters contain single document, which user can browse through.(OUTPUT)
 Thus, user is able to retrieve the relevant document without browsing through the
whole collection of documents.

TEXT CLUSTERING.doc
PRACTICAL APPLICATIONS
To ease the process of browsing to find similar or related information.
e.g.to get an overview of documens that were returned in an opinion poll or to ease the retrieval
of information from the list of docs returned for a web based query.
To find outstanding documents within a collection
e.g.in documents you get from a news service,unique documents could give you hints on
possible new trends or technologies that have so far not been mentioned in other articles.
To find keywords
e.git can be applied to a collection of research papers to find the representative keywords.
To provide an overview of the contents of a large document collection.
e.g. huge clusters in a consumer feedback collection would indicate where your products or
services need improvement.
METHODOLOGY
Before describing our main methodology, we would like to introduce some definitions which
are critical in understanding the following algorithms.
 We represent each document by a vector whose each dimension corresponds to a unique
word in the collection. Thus, the number of dimensions of each vector is equal to the
number of unique words in the whole collection.
 Magnitude of each component is the frequency of that word in the correponding
document.
 It has been shown by using probability theory that using monotone damping function i.e
using square roots of frequency instead of directly frequency gives better results
 Such a vector, for document d, in normalized form is represented by p(d)
 Such a vector can be associated with a cluster C also,
p(C)=Summation over all constituent documents di of (p(di))
 Similarity measure between two documents, a and b,is defined as :
s(a,b) = <p(a),p(b)>
Our methodology comprises of the following steps :
FLOWCHART
I STEP Initially,all the documents are scanned and the vectors associated with each
document(as described above) are obtained.While doing this we perform the following steps:
 Remove all the 'stopwords' or junkwords which do not contribute towards the theme of
the document.For this we have developed a list of about 600 words comprising of
various pronouns,conjunctions,prepositions,auxiliary verbs etc.
 Keep only a single form of a word.If various forms of a word exist in the collection,they
are replaced throughout by a root word with their frequencies being added to the root
word.Thus,if past tense form or plural or gerund form etc. of a word are present,their
occurence is replaced by the root word.
 Apply the database derived from the thesaurus.We have developed some database of
synonyms in a pattern suited for fast access using a thesaurus.Using this we replace the
words of a class of synonyms by a single representative word.This further helps in
improving the quality of clusters obtained later.
II STEP We denote the number of clusters to be formed as ' k '. The first step consists of
finding ' k ' centers of the entire Collection.Centersare documents that are representative of the
various themes present in the Collection.
Before we describe the algorithms for finding the centers,we describe an approximate,time
consuming algorithm for clustering that is used as a basic unit by both the algorithms we have
used for finding the k centers.
CLUSTER SUBROUTINE
 Define the average similarity between constituent documents of the cluster as: s(C) This
is a measure of the cohesiveness or the quality of the cluster.
 At each stage find those two clusters whose merged cluster has the maximum average
similarity of the possible pairs at that stage.
 Merge these two clusters, reducing the number of clusters by one.
 Hence starting initially with n documents, this algorithm continues until k clusters
remain.
 Time complexity of this algorithm = O(n^2)
We now describe the two algorithms we are using for finding the k centers :
ALGORITHM 1
 Apply Cluster Subroutine to randomly selected sqrt(k*n) documents from the whole
collection.
 From the k clusters returned by subroutine,find k-centers.
 For each document belonging to a particluar cluster find its cosine similarity with the
concerned cluster.The document for which this comes out to be maximum is the center
of the cluster.
 Hence this way k-centers are obtained :
o This algorithm is not deterministic,since it employs random sampling.
o It is faster,suitable for online reclustering.
o Asymptotically,O(k*n);
ALGORITHM 2
Due to the shortcomings of the previous algorithm,which has been suggested in most of the
papers,we thought of another approach which we describe now.
 Divide the collection of documents into n/m groups of fixed size,say m(where m is
chosen by expt.)
 Apply Cluster Subroutine to each of the groups to reduce each group to r*m
clusters(where r is some fraction)
 The process is repeated on these clusters as the new collection until we get k-clusters
 Then k-centers are found as described in Algorithm 1
 Characteristics :
o More accurate
o Asymptotically O(n*m)=O(k*n) (if m is a multiple of k),though actually takes
more time than algorithm 1
After the k centers have been found the remaining collection is divided randomly into two
subcollections.ASSIGN-TO-NEAREST algorithm is applied to the first subcollection
and BAYESIAN MACHINE LEARNING is applied to the second subcollection,using
clusters obtained from the first subcollection as the training data.
We now describe the ASSIGN-TO-NEAREST algorithm :
ASSIGN-TO-NEAREST
 For each document find its similarity with each center and assign it to center with which
its similarity is maximum.
 Hence,each document is assigned to one of the k centers and we have finally k clusters.
FOR MORE REFINEMENT
 Again find centers of these clusters and repeat the first step with these new centers.
 This can be carried out indefinitely,but we observed that most significant improvement
is obtained within 2 to 3 iterations.
BAEYSIAN LEARNING
III STEPFor the remaining second subcollection,Bayesian ML is applied,using the just
obtained clusters as the test data.
 Each document is represented by a set of attributes {a1,a2,....an}.Here, a1=first word in
the document,a2=second word in the document and so on. We have a set of target
values(the k centers in our case) V = {v1, . . ,vk}.Each document is assigned to one of
the k target centers.
 This approach uses the MAP hypothesis in which each document is assigned to the
center vj for whice the expression P(vj|a1,a2,....an) is maximised.
 From Bayes Theorem,this can be rewritten as the maximisation of the
expression ( P(a1,a2,...an|vj)*P(vj))
ASSUMPTIONS
The evaluation of this expression calls for a very,very large training data,hence,we make two
reasonable assumptions :
 Attribute values are conditionally independent i.e. the prob. of occurence of a given set
of words is independent of the order of the words.
 Probability that a word wk appears at position i is the same the prob. of occurence of wk
at position j for all i,j,k.
 The final expression's P(vj),P(wk|vj) etc are calculated from the training data(the data
obtained from documents already assigned by cosine similarity measure approach).
PURPOSE
 ACCURATE AND FASTER : The reason for using Bayesian ML at all.Infact,for text
classification problem,its performance has been shown to be comparable to ANNs,but at
much reduced computational cost.
 DOMAIN INDEPENDENT : The reason for using Cosine similarity approach at
all.Most ML processes have to be trained off-line on some training data.This renders the
system domain dependent.We did not want to reduce the scope of our project by making
it domain dependent.Hence we have maintained its DOMAIN INDEPENDENCY by
obtaining the test data from the given collection itself.
TOOLS USED FOR SPEED ENHANCEMENT
AND RELIA BILITY
 HASH TABLE
Data that is accessed frequently is stored in Hashtables as INSERT and FIND operatios
are O(1) in average time complexity.
MULTITHREADED PROGRAMMING
Provides performance comparable to parallel computing for some operations that can be
done simultaneously .
THESAURUS DATABASE
Improves the quality of clustering significantly.
STOPLIST DATABASE
It is used for eliminating stopwords(e.g. "the","is",etc.)
dynamically sized arrays('Vector')
o parsing and extracting of words('String Tokenizer')
ADDED UTILITIES
QUERY BASED BROWSING
The documents containing the query and its synonyms are filtered out from the current
collection of documents to form a new collection and now Scatter/Gather is applied on this
collection.
REFINE YOUR SEARCH
After any scatter,user can opt for refining the clustering i.e. improving the quality of
clusters.(this calls for added computational time,but then it is on the discretion of the user to opt
for this)
COPYING THE DOCUMENTS OF A CLUSTER TO A NEW DIRECTORY
For some application,user might be interested in all documents of a cluster rather than some
specific document,so,user can avail of this utility.
RESULTS
We applied our INTELLIGENT TEXT CLUSTERER(ITC) on a large collection of
documents(approx. 100) downloaded from Usenet news articles .These documents spanned
various themes or fields like astronomy,medical,software,religion &politics.The table shows
the result of first stage clustering of documents.
CLUSTER#1 CLUSTER#2 CLUSTER#3 CLUSTER4 CLUSTER#5 CLUSTER#6 CLUSTER#7 CLUSTER#8
Size: 5 Size: 2 Size: 11 Size: 7 Size: 26 Size: 23 Size : 15 Size :14
list list people object christian space scsi gun
find sort human model people year drive police
quote work rights vision god nasa ide weapon
josh people kill shape moral engineer pc handgun
fundies creation give proceedings human launch add crime
sort bible rights image wrong project hard time
time creation children ai matter time controlled arm
mcdowell quality problem dimensional point include problem fire
tyre guy life spatial word orbit comparable carry
inerrancy mice govt search reason program people semi
Suppose the user is interested in field of astronomy. So, he chooses cluster 6 for navigation. As
is evident from the table contents below, it is clear that the search has narrowed down
drastically from 100 documents to 25 relevamt documents. The clusters become smaller amd
more detailed with each stage.
Size: 1 Size: 1 Size: 1 Size: 1 Size: 1 Size: 8 Size: 3 Size: 7
task rocket burst universe speculative mission convention space
robot feet star larson form space karl engineer
team motor plan book ablogenesis data hess dc
numerical allowed source star calling time libertarian year
performance launch disk physical probability nasa instt shuttle
motion organisation angle motion mind jpl business system
improve flight direction theory theist launch sept orbit
method depend sample physicists physicists project committe prog
juggle faa distribution matter year part price reusable
level shuttle black astronomers telling spacecraft dr launch
Now ,as one can see there are some clusters containing single documents only,while Cluster #6
and Cluster #8 contain many documents .Thus the clusters containing single documents are
'outstanding' documents in the sense that these documents can not be grouped with clusters
containing many documents.
For e.g. consider the Cluster #7.It does not contain any word related to astronomy.Scanning
through its topical words it seems that it is related to some business conventions or committe
etc.
Now suppose user is interested in looking for some 'projects' or 'mission ' regarding 'launch' in
'orbit' ,so he chooses clusters 2,6,8 and for further navigation.The result --
Size: 1 Size: 1 Size: 1 Size: 1 Size: 1 Size: 1 Size: 1 Size: 1
rocket mission nasa space space space jpl space
feet cost cost engg book age space dc
motor hst fee long history america data launch
launch extra overhead tech spaceship matter project shuttle
organisation mass dennis interest part station time vehicle
flight pat center life draw adm. launch orbit
depend minus reston literature higgins budget spacecraft year
rocketry service problem term science program nasa time
faatime hardware work detail illustration plan mission schedule
trajectory oms include general include resource research path
Thus we see that search has reduced down drastically and only single documents are present in
the cluster .The topical words in the cluster are now the words occuring in that document
only.Thus user is able to retrieve the relevant document without browsing through the entire
corpus of documents.
At any stage ,if user is not satisfied with the quality of clusters,he can give the 'Refine' option
which improves the quality of clusters.User has also the choice of swithching over to 'query'
based browsing and that too on the present or original collection.If the user feels that he has
taken the wrong choice of clusters,then he can revert back also to previous stage clusters.
A proper GUI has been provided with the program using Java-Frames to make the browsing
easy,full of clarity and user-friendly.
Care has also been taken that user does not play around fooling with the project;hence,suitable
error messages have been printed and handled suitably.
MAJOR WORKS
D.R.Cutting,Xerox PARC
 The cosine similarity measure was first suggested by him.The random sampling
algorithm for finding centers is also due to him.
Mehran Sahami,Stanford University
 First to suggest that Bayesian ML could be applied to the text clustering problem given a
set of training data.(His approach is domain dependent)
REFERENCES
BOOKS:
Machine learning by Tom M. Mitchell,The McGraw-Hill Companies,Inc.,March 1997
PAPERS
Koller,D. and Sahami,M. 1997.Hierarchically Classifying Documents Using Very few
Words.Proceedings of the Fourteenth International Conference on Machine Learning
Sahami,M.,Yusufali,S. and Boldonado,M.Q.W. 1998SONIA : A Sevice for Organising Networked
Information AutonomouslyProceedings of the third ACM Conference on Digital Libraries
Sahami,M. 1996.Learning Limited Dependence Bayesian Classifiers .In KDD-96:Proceedings of the
Second International On Knowledge Discovery and Data Mining,pp. 335-338,Menlo Park,CA:AAAI
Press.
Goldszmidt,M.and Sahami,M. 1998.A Probablistic Approach to Full-Text Document
Clustering.Technical Report ITAD-433-MS-98-044,SRI International.
Douglass Cutting,David Karger, and Jan Peterson,Constant Interaction Time Scatter/Gather
Browsing of Large Document CollectionsProceedings of the sixteenth Annual Inetrnational ACM
SIGIR Coneferencs,Pittsburgh.
Peter Pirolli,Patricia Schank,Marti A. Hearst,and Chritine Diehl,Scatter/Gather Browsing
Communicates the Topic Structure of A Very Large Text Collection.Proceedings of the ACM SIGCHI
Conference on Human Factors in Computing Systems(CHI),May 1996.
Marti A. Hearst,David R.Karger,and Jan O. PedersenScatter/Gather As a Tool For The
Navigation Of Retrieval Results.The Proceedings of the 1995 AAAI Fall Symposium on Knowledge
Navigation
Marti Hearst,Jan Pedersen,Reexamining The Cluster Hypothesis:Scatter/Gather on Retrieval
Results,Prooceedings of the 19th Annual International ACM/SIGIR Conference,Zurich,August 1996.
Bibtext
FUTURE SCOPE
The future scope of improvement in this project lies in following fields :
LEXICAL AFFINITIES
In some domains certain words tend to appear together.For example,in the domain of AI,the
words 'artificial' and 'intelligence' tend to appear together as do the words 'machine' and
'learning'.This fact can be done by introducing some crossover terms in the expression for
cosine similarity and modifying the assumptions of Bayesian ML.
USING SEMANTICS
Parsing can be incorporated into this program to provide some semantic information which can
be used to obtain a better estimate of the similarities between various documents.
Ad

More Related Content

Similar to TEXT CLUSTERING.doc (20)

ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
IJDKP
 
Discovering Novel Information with sentence Level clustering From Multi-docu...
Discovering Novel Information with sentence Level clustering  From Multi-docu...Discovering Novel Information with sentence Level clustering  From Multi-docu...
Discovering Novel Information with sentence Level clustering From Multi-docu...
irjes
 
Bl24409420
Bl24409420Bl24409420
Bl24409420
IJERA Editor
 
E1062530
E1062530E1062530
E1062530
IJERD Editor
 
Clustering Using Shared Reference Points Algorithm Based On a Sound Data Model
Clustering Using Shared Reference Points Algorithm Based On a Sound Data ModelClustering Using Shared Reference Points Algorithm Based On a Sound Data Model
Clustering Using Shared Reference Points Algorithm Based On a Sound Data Model
Waqas Tariq
 
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
IJDMS
 
call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...
International Journal of Engineering Inventions www.ijeijournal.com
 
Ir 08
Ir   08Ir   08
Ir 08
Mohammed Romi
 
Clustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureClustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity Measure
IOSR Journals
 
Mapping Subsets of Scholarly Information
Mapping Subsets of Scholarly InformationMapping Subsets of Scholarly Information
Mapping Subsets of Scholarly Information
Paul Houle
 
G04124041046
G04124041046G04124041046
G04124041046
IOSR-JEN
 
Different Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using KnnDifferent Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using Knn
IOSR Journals
 
Clustering techniques final
Clustering techniques finalClustering techniques final
Clustering techniques final
Benard Maina
 
L0261075078
L0261075078L0261075078
L0261075078
inventionjournals
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
inventionjournals
 
L0261075078
L0261075078L0261075078
L0261075078
inventionjournals
 
SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional Verification
Sai Kiran Kadam
 
Av33274282
Av33274282Av33274282
Av33274282
IJERA Editor
 
Av33274282
Av33274282Av33274282
Av33274282
IJERA Editor
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
Eng. Dr. Dennis N. Mwighusa
 
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
IJDKP
 
Discovering Novel Information with sentence Level clustering From Multi-docu...
Discovering Novel Information with sentence Level clustering  From Multi-docu...Discovering Novel Information with sentence Level clustering  From Multi-docu...
Discovering Novel Information with sentence Level clustering From Multi-docu...
irjes
 
Clustering Using Shared Reference Points Algorithm Based On a Sound Data Model
Clustering Using Shared Reference Points Algorithm Based On a Sound Data ModelClustering Using Shared Reference Points Algorithm Based On a Sound Data Model
Clustering Using Shared Reference Points Algorithm Based On a Sound Data Model
Waqas Tariq
 
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
IJDMS
 
Clustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureClustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity Measure
IOSR Journals
 
Mapping Subsets of Scholarly Information
Mapping Subsets of Scholarly InformationMapping Subsets of Scholarly Information
Mapping Subsets of Scholarly Information
Paul Houle
 
G04124041046
G04124041046G04124041046
G04124041046
IOSR-JEN
 
Different Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using KnnDifferent Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using Knn
IOSR Journals
 
Clustering techniques final
Clustering techniques finalClustering techniques final
Clustering techniques final
Benard Maina
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
inventionjournals
 
SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional Verification
Sai Kiran Kadam
 

Recently uploaded (20)

Agentic Automation - Delhi UiPath Community Meetup
Agentic Automation - Delhi UiPath Community MeetupAgentic Automation - Delhi UiPath Community Meetup
Agentic Automation - Delhi UiPath Community Meetup
Manoj Batra (1600 + Connections)
 
AI Agents at Work: UiPath, Maestro & the Future of Documents
AI Agents at Work: UiPath, Maestro & the Future of DocumentsAI Agents at Work: UiPath, Maestro & the Future of Documents
AI Agents at Work: UiPath, Maestro & the Future of Documents
UiPathCommunity
 
Dark Dynamism: drones, dark factories and deurbanization
Dark Dynamism: drones, dark factories and deurbanizationDark Dynamism: drones, dark factories and deurbanization
Dark Dynamism: drones, dark factories and deurbanization
Jakub Šimek
 
fennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solutionfennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solution
shallal2
 
Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?
Eric Torreborre
 
Viam product demo_ Deploying and scaling AI with hardware.pdf
Viam product demo_ Deploying and scaling AI with hardware.pdfViam product demo_ Deploying and scaling AI with hardware.pdf
Viam product demo_ Deploying and scaling AI with hardware.pdf
camilalamoratta
 
AsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API DesignAsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API Design
leonid54
 
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdfKit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Wonjun Hwang
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
How to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabberHow to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabber
eGrabber
 
Building the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdfBuilding the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdf
Cheryl Hung
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)
Kaya Weers
 
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier VroomAI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
UXPA Boston
 
Config 2025 presentation recap covering both days
Config 2025 presentation recap covering both daysConfig 2025 presentation recap covering both days
Config 2025 presentation recap covering both days
TrishAntoni1
 
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Wonjun Hwang
 
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptxDevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
Justin Reock
 
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
James Anderson
 
Unlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web AppsUnlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web Apps
Maximiliano Firtman
 
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz
 
AI Agents at Work: UiPath, Maestro & the Future of Documents
AI Agents at Work: UiPath, Maestro & the Future of DocumentsAI Agents at Work: UiPath, Maestro & the Future of Documents
AI Agents at Work: UiPath, Maestro & the Future of Documents
UiPathCommunity
 
Dark Dynamism: drones, dark factories and deurbanization
Dark Dynamism: drones, dark factories and deurbanizationDark Dynamism: drones, dark factories and deurbanization
Dark Dynamism: drones, dark factories and deurbanization
Jakub Šimek
 
fennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solutionfennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solution
shallal2
 
Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?
Eric Torreborre
 
Viam product demo_ Deploying and scaling AI with hardware.pdf
Viam product demo_ Deploying and scaling AI with hardware.pdfViam product demo_ Deploying and scaling AI with hardware.pdf
Viam product demo_ Deploying and scaling AI with hardware.pdf
camilalamoratta
 
AsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API DesignAsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API Design
leonid54
 
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdfKit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Wonjun Hwang
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
How to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabberHow to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabber
eGrabber
 
Building the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdfBuilding the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdf
Cheryl Hung
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)
Kaya Weers
 
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier VroomAI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
UXPA Boston
 
Config 2025 presentation recap covering both days
Config 2025 presentation recap covering both daysConfig 2025 presentation recap covering both days
Config 2025 presentation recap covering both days
TrishAntoni1
 
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Wonjun Hwang
 
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptxDevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
Justin Reock
 
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
James Anderson
 
Unlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web AppsUnlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web Apps
Maximiliano Firtman
 
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz
 
Ad

TEXT CLUSTERING.doc

  • 1. A Project Report On TEXT CLUSTERING By: Ritik Lingwal (20521002) Under the guidance of Mr. Brijesh Dhyani Department of Management Studies Graphic Era Deemed to be University Dehradun, Uttarakhand (Batch: 2020-2022)
  • 2. PROJECT DEFINITION  User presents the system with a collection of documents (INPUT).  The system processes these documents and groups them into fixed number of clusters (say k).  A cluster is a group of documents whose members are more similar to each other than to the members of any other group.  That is, the system groups the documents with similar themes into separate clusters.  User then selects one or more of these clusters which seem relevant to him.  The selected clusters are gathered to form a new collection and above process continues.  With each successive iteration, the clusters become smaller and more detailed and thus more more specific to user's choice of interest.  Finally, clusters contain single document, which user can browse through.(OUTPUT)  Thus, user is able to retrieve the relevant document without browsing through the whole collection of documents. 
  • 4. PRACTICAL APPLICATIONS To ease the process of browsing to find similar or related information. e.g.to get an overview of documens that were returned in an opinion poll or to ease the retrieval of information from the list of docs returned for a web based query. To find outstanding documents within a collection e.g.in documents you get from a news service,unique documents could give you hints on possible new trends or technologies that have so far not been mentioned in other articles. To find keywords e.git can be applied to a collection of research papers to find the representative keywords. To provide an overview of the contents of a large document collection. e.g. huge clusters in a consumer feedback collection would indicate where your products or services need improvement.
  • 5. METHODOLOGY Before describing our main methodology, we would like to introduce some definitions which are critical in understanding the following algorithms.  We represent each document by a vector whose each dimension corresponds to a unique word in the collection. Thus, the number of dimensions of each vector is equal to the number of unique words in the whole collection.  Magnitude of each component is the frequency of that word in the correponding document.  It has been shown by using probability theory that using monotone damping function i.e using square roots of frequency instead of directly frequency gives better results  Such a vector, for document d, in normalized form is represented by p(d)  Such a vector can be associated with a cluster C also, p(C)=Summation over all constituent documents di of (p(di))  Similarity measure between two documents, a and b,is defined as : s(a,b) = <p(a),p(b)>
  • 6. Our methodology comprises of the following steps : FLOWCHART I STEP Initially,all the documents are scanned and the vectors associated with each document(as described above) are obtained.While doing this we perform the following steps:  Remove all the 'stopwords' or junkwords which do not contribute towards the theme of the document.For this we have developed a list of about 600 words comprising of various pronouns,conjunctions,prepositions,auxiliary verbs etc.  Keep only a single form of a word.If various forms of a word exist in the collection,they are replaced throughout by a root word with their frequencies being added to the root word.Thus,if past tense form or plural or gerund form etc. of a word are present,their occurence is replaced by the root word.  Apply the database derived from the thesaurus.We have developed some database of synonyms in a pattern suited for fast access using a thesaurus.Using this we replace the words of a class of synonyms by a single representative word.This further helps in improving the quality of clusters obtained later. II STEP We denote the number of clusters to be formed as ' k '. The first step consists of finding ' k ' centers of the entire Collection.Centersare documents that are representative of the various themes present in the Collection. Before we describe the algorithms for finding the centers,we describe an approximate,time consuming algorithm for clustering that is used as a basic unit by both the algorithms we have used for finding the k centers.
  • 7. CLUSTER SUBROUTINE  Define the average similarity between constituent documents of the cluster as: s(C) This is a measure of the cohesiveness or the quality of the cluster.  At each stage find those two clusters whose merged cluster has the maximum average similarity of the possible pairs at that stage.  Merge these two clusters, reducing the number of clusters by one.  Hence starting initially with n documents, this algorithm continues until k clusters remain.  Time complexity of this algorithm = O(n^2) We now describe the two algorithms we are using for finding the k centers : ALGORITHM 1  Apply Cluster Subroutine to randomly selected sqrt(k*n) documents from the whole collection.  From the k clusters returned by subroutine,find k-centers.  For each document belonging to a particluar cluster find its cosine similarity with the concerned cluster.The document for which this comes out to be maximum is the center of the cluster.  Hence this way k-centers are obtained : o This algorithm is not deterministic,since it employs random sampling. o It is faster,suitable for online reclustering. o Asymptotically,O(k*n);
  • 8. ALGORITHM 2 Due to the shortcomings of the previous algorithm,which has been suggested in most of the papers,we thought of another approach which we describe now.  Divide the collection of documents into n/m groups of fixed size,say m(where m is chosen by expt.)  Apply Cluster Subroutine to each of the groups to reduce each group to r*m clusters(where r is some fraction)  The process is repeated on these clusters as the new collection until we get k-clusters  Then k-centers are found as described in Algorithm 1  Characteristics : o More accurate o Asymptotically O(n*m)=O(k*n) (if m is a multiple of k),though actually takes more time than algorithm 1 After the k centers have been found the remaining collection is divided randomly into two subcollections.ASSIGN-TO-NEAREST algorithm is applied to the first subcollection and BAYESIAN MACHINE LEARNING is applied to the second subcollection,using clusters obtained from the first subcollection as the training data. We now describe the ASSIGN-TO-NEAREST algorithm :
  • 9. ASSIGN-TO-NEAREST  For each document find its similarity with each center and assign it to center with which its similarity is maximum.  Hence,each document is assigned to one of the k centers and we have finally k clusters. FOR MORE REFINEMENT  Again find centers of these clusters and repeat the first step with these new centers.  This can be carried out indefinitely,but we observed that most significant improvement is obtained within 2 to 3 iterations. BAEYSIAN LEARNING III STEPFor the remaining second subcollection,Bayesian ML is applied,using the just obtained clusters as the test data.  Each document is represented by a set of attributes {a1,a2,....an}.Here, a1=first word in the document,a2=second word in the document and so on. We have a set of target values(the k centers in our case) V = {v1, . . ,vk}.Each document is assigned to one of the k target centers.  This approach uses the MAP hypothesis in which each document is assigned to the center vj for whice the expression P(vj|a1,a2,....an) is maximised.  From Bayes Theorem,this can be rewritten as the maximisation of the expression ( P(a1,a2,...an|vj)*P(vj)) ASSUMPTIONS The evaluation of this expression calls for a very,very large training data,hence,we make two reasonable assumptions :  Attribute values are conditionally independent i.e. the prob. of occurence of a given set of words is independent of the order of the words.  Probability that a word wk appears at position i is the same the prob. of occurence of wk at position j for all i,j,k.  The final expression's P(vj),P(wk|vj) etc are calculated from the training data(the data obtained from documents already assigned by cosine similarity measure approach).
  • 10. PURPOSE  ACCURATE AND FASTER : The reason for using Bayesian ML at all.Infact,for text classification problem,its performance has been shown to be comparable to ANNs,but at much reduced computational cost.  DOMAIN INDEPENDENT : The reason for using Cosine similarity approach at all.Most ML processes have to be trained off-line on some training data.This renders the system domain dependent.We did not want to reduce the scope of our project by making it domain dependent.Hence we have maintained its DOMAIN INDEPENDENCY by obtaining the test data from the given collection itself.
  • 11. TOOLS USED FOR SPEED ENHANCEMENT AND RELIA BILITY  HASH TABLE Data that is accessed frequently is stored in Hashtables as INSERT and FIND operatios are O(1) in average time complexity. MULTITHREADED PROGRAMMING Provides performance comparable to parallel computing for some operations that can be done simultaneously . THESAURUS DATABASE Improves the quality of clustering significantly. STOPLIST DATABASE It is used for eliminating stopwords(e.g. "the","is",etc.) dynamically sized arrays('Vector') o parsing and extracting of words('String Tokenizer')
  • 12. ADDED UTILITIES QUERY BASED BROWSING The documents containing the query and its synonyms are filtered out from the current collection of documents to form a new collection and now Scatter/Gather is applied on this collection. REFINE YOUR SEARCH After any scatter,user can opt for refining the clustering i.e. improving the quality of clusters.(this calls for added computational time,but then it is on the discretion of the user to opt for this) COPYING THE DOCUMENTS OF A CLUSTER TO A NEW DIRECTORY For some application,user might be interested in all documents of a cluster rather than some specific document,so,user can avail of this utility.
  • 13. RESULTS We applied our INTELLIGENT TEXT CLUSTERER(ITC) on a large collection of documents(approx. 100) downloaded from Usenet news articles .These documents spanned various themes or fields like astronomy,medical,software,religion &politics.The table shows the result of first stage clustering of documents. CLUSTER#1 CLUSTER#2 CLUSTER#3 CLUSTER4 CLUSTER#5 CLUSTER#6 CLUSTER#7 CLUSTER#8 Size: 5 Size: 2 Size: 11 Size: 7 Size: 26 Size: 23 Size : 15 Size :14 list list people object christian space scsi gun find sort human model people year drive police quote work rights vision god nasa ide weapon josh people kill shape moral engineer pc handgun fundies creation give proceedings human launch add crime sort bible rights image wrong project hard time time creation children ai matter time controlled arm mcdowell quality problem dimensional point include problem fire tyre guy life spatial word orbit comparable carry inerrancy mice govt search reason program people semi Suppose the user is interested in field of astronomy. So, he chooses cluster 6 for navigation. As is evident from the table contents below, it is clear that the search has narrowed down drastically from 100 documents to 25 relevamt documents. The clusters become smaller amd more detailed with each stage. Size: 1 Size: 1 Size: 1 Size: 1 Size: 1 Size: 8 Size: 3 Size: 7 task rocket burst universe speculative mission convention space robot feet star larson form space karl engineer team motor plan book ablogenesis data hess dc numerical allowed source star calling time libertarian year performance launch disk physical probability nasa instt shuttle motion organisation angle motion mind jpl business system improve flight direction theory theist launch sept orbit method depend sample physicists physicists project committe prog juggle faa distribution matter year part price reusable level shuttle black astronomers telling spacecraft dr launch Now ,as one can see there are some clusters containing single documents only,while Cluster #6 and Cluster #8 contain many documents .Thus the clusters containing single documents are 'outstanding' documents in the sense that these documents can not be grouped with clusters containing many documents.
  • 14. For e.g. consider the Cluster #7.It does not contain any word related to astronomy.Scanning through its topical words it seems that it is related to some business conventions or committe etc. Now suppose user is interested in looking for some 'projects' or 'mission ' regarding 'launch' in 'orbit' ,so he chooses clusters 2,6,8 and for further navigation.The result -- Size: 1 Size: 1 Size: 1 Size: 1 Size: 1 Size: 1 Size: 1 Size: 1 rocket mission nasa space space space jpl space feet cost cost engg book age space dc motor hst fee long history america data launch launch extra overhead tech spaceship matter project shuttle organisation mass dennis interest part station time vehicle flight pat center life draw adm. launch orbit depend minus reston literature higgins budget spacecraft year rocketry service problem term science program nasa time faatime hardware work detail illustration plan mission schedule trajectory oms include general include resource research path Thus we see that search has reduced down drastically and only single documents are present in the cluster .The topical words in the cluster are now the words occuring in that document only.Thus user is able to retrieve the relevant document without browsing through the entire corpus of documents. At any stage ,if user is not satisfied with the quality of clusters,he can give the 'Refine' option which improves the quality of clusters.User has also the choice of swithching over to 'query' based browsing and that too on the present or original collection.If the user feels that he has taken the wrong choice of clusters,then he can revert back also to previous stage clusters. A proper GUI has been provided with the program using Java-Frames to make the browsing easy,full of clarity and user-friendly. Care has also been taken that user does not play around fooling with the project;hence,suitable error messages have been printed and handled suitably.
  • 15. MAJOR WORKS D.R.Cutting,Xerox PARC  The cosine similarity measure was first suggested by him.The random sampling algorithm for finding centers is also due to him. Mehran Sahami,Stanford University  First to suggest that Bayesian ML could be applied to the text clustering problem given a set of training data.(His approach is domain dependent) REFERENCES BOOKS: Machine learning by Tom M. Mitchell,The McGraw-Hill Companies,Inc.,March 1997 PAPERS Koller,D. and Sahami,M. 1997.Hierarchically Classifying Documents Using Very few Words.Proceedings of the Fourteenth International Conference on Machine Learning Sahami,M.,Yusufali,S. and Boldonado,M.Q.W. 1998SONIA : A Sevice for Organising Networked Information AutonomouslyProceedings of the third ACM Conference on Digital Libraries Sahami,M. 1996.Learning Limited Dependence Bayesian Classifiers .In KDD-96:Proceedings of the Second International On Knowledge Discovery and Data Mining,pp. 335-338,Menlo Park,CA:AAAI Press. Goldszmidt,M.and Sahami,M. 1998.A Probablistic Approach to Full-Text Document Clustering.Technical Report ITAD-433-MS-98-044,SRI International. Douglass Cutting,David Karger, and Jan Peterson,Constant Interaction Time Scatter/Gather Browsing of Large Document CollectionsProceedings of the sixteenth Annual Inetrnational ACM SIGIR Coneferencs,Pittsburgh. Peter Pirolli,Patricia Schank,Marti A. Hearst,and Chritine Diehl,Scatter/Gather Browsing Communicates the Topic Structure of A Very Large Text Collection.Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems(CHI),May 1996. Marti A. Hearst,David R.Karger,and Jan O. PedersenScatter/Gather As a Tool For The Navigation Of Retrieval Results.The Proceedings of the 1995 AAAI Fall Symposium on Knowledge Navigation Marti Hearst,Jan Pedersen,Reexamining The Cluster Hypothesis:Scatter/Gather on Retrieval Results,Prooceedings of the 19th Annual International ACM/SIGIR Conference,Zurich,August 1996. Bibtext
  • 16. FUTURE SCOPE The future scope of improvement in this project lies in following fields : LEXICAL AFFINITIES In some domains certain words tend to appear together.For example,in the domain of AI,the words 'artificial' and 'intelligence' tend to appear together as do the words 'machine' and 'learning'.This fact can be done by introducing some crossover terms in the expression for cosine similarity and modifying the assumptions of Bayesian ML. USING SEMANTICS Parsing can be incorporated into this program to provide some semantic information which can be used to obtain a better estimate of the similarities between various documents.
  翻译: