SlideShare a Scribd company logo
Web Clustering Engines
YASH DARAK
206117026
CONTENTS
● Introduction
● Why web clustering engines?
● Advantages of cluster hierarchy
● Issues in implementation of clusters
● Architecture
● Data centric clustering algorithm
● Conclusion
Search engines ?
● Search engines are an invaluable tool for retrieving information from the web. In
response to a user query, they return a list of results ranked in order of relevance to
the query.
● Eg : Google, Yahoo, Credo etc.
Google (Flat ranked search engine)
Yippy (Web clustering engine)
Web clustering engines
● Search engine.
● Web Clustering Engines are the systems that perform clustering of web search
results. This systems group the results returned by a search engine into a hierarchy of
labeled clusters (also called categories).
● Clustering is the act of grouping similar objects into sets.
● The distance between the objects in the same cluster should be minimum.
● And the distance between objects in the different clusters should be maximum.
Web clustering engines -
1. Northern Light (predefined set of clusters )
2. Vivisimo - Cluster labels were dynamically generated.
3. Clusty
4. Grokker
5. Yippy
6. Lingo3G
7. Credo etc..
Why web clustering engines ?
● Conventional engines are not much efficient in ‘Ambiguous’ queries.
● The search results returned by conventional search engines on query will be
mixed together in the list, irrelevant item occurs.
In this context clustering of search results come into picture!!
Main advantages of cluster hierarchy :
● It makes for shortcuts to the items that relate to the same meaning.
● It allows better topic understanding.
● It favors systematic exploration of search results.
Issues in implementation of clusters :
● Short input description.
● Meaningful labels.
● Selection of similarity measure.
● Grouping of objects into clusters.
● Computational efficiency.
● Overlapping clusters.
● Unknown number of clusters.
Architecture :
1. Search Result Acquisition :
● The task of the search result acquisition is to provide input for the rest of the system.
● Based on the query, the acquisition component must deliver 50 to 500 results, each of
which should contain -
■ Title
■ Contextual snippet
■ URL pointing to the full text being referred to.
● The source of search results can be any public search engines, such as google, yahoo etc.
● The most elegant way of fetching results from such search engines is by using application
programming interfaces(APIs) these engines provide.
2. Preprocessing of search results :
● It converts the contents of search results (output by the acquisition component) into a
sequence of features used by the actual clustering algorithm.
● Steps for feature extraction -
a. Language identification
b. Tokenization
c. Stemming
d. Selection of features.
b. Tokenization :
● During the tokenization step, the text of each search result gets split into a sequence of
basic independent units called tokens, which will usually represent single words, numbers,
symbols and so on.
● Tokenization becomes much more complex for languages where white spaces are not
present (such as Chinese) or where the text may switch direction (such as an Arabic text).
c. Stemming :
● The aim of stemming is to remove the inflectional prefixes and suffixes of each word and
thus reduce different grammatical forms of the word to a common base form called a stem.
● Eg.
Connected, Connecting and interconnected
‘Connect’
d. Selection features :
● It extract features for each search result present in the input.
● Features are atomic entities by which we can describe an object and represent its most
important characteristic to an algorithm.
● The features can vary from single words and fixed-length tuples of words (n-grams) to
frequent phrases (variable-length sequences of words)
How to represent a feature/text ?
● One method for representing a text is Vector Space model(VSM).
● A document d is represented in the VSM as a vector [wt0 , wt1, . . .wtn], where t0, t1, . . . tn is
a global set of words (features) and wti expresses the weight (importance) of feature ti to
document d.
● Eg. :
d-> “Polly had a dog and the dog had Polly”
3. Cluster construction and labelling :
● The set of search results along with their features, extracted in the preprocessing step, are
given as input to the clustering algorithm.
● There are a number of algorithms available for clustering. We can classify them into two
different categories -
a. Data centric Clustering algorithm
b. Description aware.
● The clusters labels should be unique, unambiguous, comprehensive and sensible to the
content.
Data centric clustering algorithm :
● This system uses VSM for text representation and the clustering technique used is
agglomerative hierarchical clustering (AHC).
● It has an initial clustering of a collection of documents in a set of k clusters(scattering).
● .At Query time the user selected clusters of interest(gather) and the system re-clustered
those documents.
● This process repeats until a small cluster with relevant documents is found.
Agglomerative Hierarchical Clustering(AHC) :
● Initially each document is in its own cluster.
● It build a distance matrix (dissimilarity matrix) for every pair of clusters.
● Merge 2 closest clusters and build the new distance matrix by replacing the merged cluster by one
cluster.
● Continue this process until the desired no of k clusters reached.
● The Complexity of this algorithm is clearly O(n2) since we are using a matrix, where n is the
number of clusters.
Improve efficiency of clustering
1. Client side processing : During high query rate periods the response times can significantly
increase and thus degrade the user experience. For avoiding this we can do some processes
using the client side resources.
2. Pretokenized Documents : Clustering engines can use tokens that are already used by the
conventional search engines.
Conclusion
● Web clustering engines organize search results by topic, thus offering a
complementary view to the flat-ranked list returned by conventional search engines.
● Due to lack of efficient methods for the performance evaluation of clustering engines
they are not seeking the attention of the people.
Thank you all for your kind
attention!!
Ad

More Related Content

What's hot (20)

Web crawler
Web crawlerWeb crawler
Web crawler
poonamkenkre
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
Nanthini Dominique
 
Text summarization
Text summarizationText summarization
Text summarization
Akash Karwande
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine Learning
Joel Graff
 
Introduction to pattern recognization
Introduction to pattern recognizationIntroduction to pattern recognization
Introduction to pattern recognization
Ajharul Abedeen
 
Text summarization
Text summarizationText summarization
Text summarization
kareemhashem
 
Handwritten character recognition using artificial neural network
Handwritten character recognition using artificial neural networkHandwritten character recognition using artificial neural network
Handwritten character recognition using artificial neural network
Harshana Madusanka Jayamaha
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Darshan Ambhaikar
 
Web mining
Web miningWeb mining
Web mining
Daminda Herath
 
Machine Learning Overview
Machine Learning OverviewMachine Learning Overview
Machine Learning Overview
Mykhailo Koval
 
Optical Character Recognition( OCR )
Optical Character Recognition( OCR )Optical Character Recognition( OCR )
Optical Character Recognition( OCR )
Karan Panjwani
 
Seminar ppt fog comp
Seminar ppt fog compSeminar ppt fog comp
Seminar ppt fog comp
Mahantesh Hiremath
 
Information retrieval-systems notes
Information retrieval-systems notesInformation retrieval-systems notes
Information retrieval-systems notes
BAIRAVI T
 
Basics of Machine Learning
Basics of Machine LearningBasics of Machine Learning
Basics of Machine Learning
butest
 
Text Detection and Recognition
Text Detection and RecognitionText Detection and Recognition
Text Detection and Recognition
Badruz Nasrin Basri
 
Natural language processing PPT presentation
Natural language processing PPT presentationNatural language processing PPT presentation
Natural language processing PPT presentation
Sai Mohith
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Yasir Khan
 
optical character recognition system
optical character recognition systemoptical character recognition system
optical character recognition system
Vijay Apurva
 
Wot
WotWot
Wot
PRAVEENSRC2113003011
 
Optical character recognition (ocr) ppt
Optical character recognition (ocr) pptOptical character recognition (ocr) ppt
Optical character recognition (ocr) ppt
Deijee Kalita
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
Nanthini Dominique
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine Learning
Joel Graff
 
Introduction to pattern recognization
Introduction to pattern recognizationIntroduction to pattern recognization
Introduction to pattern recognization
Ajharul Abedeen
 
Text summarization
Text summarizationText summarization
Text summarization
kareemhashem
 
Handwritten character recognition using artificial neural network
Handwritten character recognition using artificial neural networkHandwritten character recognition using artificial neural network
Handwritten character recognition using artificial neural network
Harshana Madusanka Jayamaha
 
Machine Learning Overview
Machine Learning OverviewMachine Learning Overview
Machine Learning Overview
Mykhailo Koval
 
Optical Character Recognition( OCR )
Optical Character Recognition( OCR )Optical Character Recognition( OCR )
Optical Character Recognition( OCR )
Karan Panjwani
 
Information retrieval-systems notes
Information retrieval-systems notesInformation retrieval-systems notes
Information retrieval-systems notes
BAIRAVI T
 
Basics of Machine Learning
Basics of Machine LearningBasics of Machine Learning
Basics of Machine Learning
butest
 
Natural language processing PPT presentation
Natural language processing PPT presentationNatural language processing PPT presentation
Natural language processing PPT presentation
Sai Mohith
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Yasir Khan
 
optical character recognition system
optical character recognition systemoptical character recognition system
optical character recognition system
Vijay Apurva
 
Optical character recognition (ocr) ppt
Optical character recognition (ocr) pptOptical character recognition (ocr) ppt
Optical character recognition (ocr) ppt
Deijee Kalita
 

Similar to Web clustering engines (20)

webclustering engine
webclustering enginewebclustering engine
webclustering engine
Deepak Sharma
 
Adaptive focused crawling strategy for maximising the relevance
Adaptive focused crawling strategy for maximising the relevanceAdaptive focused crawling strategy for maximising the relevance
Adaptive focused crawling strategy for maximising the relevance
eSAT Journals
 
Topical clustering of search results
Topical clustering of search resultsTopical clustering of search results
Topical clustering of search results
Sunny Kr
 
clustering_classification.ppt
clustering_classification.pptclustering_classification.ppt
clustering_classification.ppt
HODECE21
 
Document retrieval using clustering
Document retrieval using clusteringDocument retrieval using clustering
Document retrieval using clustering
eSAT Journals
 
A survey of web clustering engines
A survey of web clustering enginesA survey of web clustering engines
A survey of web clustering engines
unyil96
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
IJERD Editor
 
IRT Unit_4.pptx
IRT Unit_4.pptxIRT Unit_4.pptx
IRT Unit_4.pptx
thenmozhip8
 
unit 5 WEB RETRIEVAL AND WEB CRAWLING
unit 5    WEB RETRIEVAL AND WEB CRAWLINGunit 5    WEB RETRIEVAL AND WEB CRAWLING
unit 5 WEB RETRIEVAL AND WEB CRAWLING
karthiksmart21
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
Rahul Jain
 
Text clustering
Text clusteringText clustering
Text clustering
KU Leuven
 
2_Capability.ppt
2_Capability.ppt2_Capability.ppt
2_Capability.ppt
Krishna20539
 
Incremental clustering in search engines
Incremental clustering in search enginesIncremental clustering in search engines
Incremental clustering in search engines
Praxitelis Nikolaos Kouroupetroglou
 
DC presentation 1
DC presentation 1DC presentation 1
DC presentation 1
Harini Sirisena
 
A Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed ClusteringA Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed Clustering
IRJET Journal
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Simon Hughes
 
CompSci: 221 Winter 2017 Search Engine for UCI
CompSci: 221 Winter 2017 Search Engine for UCICompSci: 221 Winter 2017 Search Engine for UCI
CompSci: 221 Winter 2017 Search Engine for UCI
Soham Kulkarni
 
Sree saranya
Sree saranyaSree saranya
Sree saranya
sreesaranya
 
Sree saranya
Sree saranyaSree saranya
Sree saranya
sreesaranya
 
Web search engines
Web search enginesWeb search engines
Web search engines
AbdusamadAbdukarimov2
 
webclustering engine
webclustering enginewebclustering engine
webclustering engine
Deepak Sharma
 
Adaptive focused crawling strategy for maximising the relevance
Adaptive focused crawling strategy for maximising the relevanceAdaptive focused crawling strategy for maximising the relevance
Adaptive focused crawling strategy for maximising the relevance
eSAT Journals
 
Topical clustering of search results
Topical clustering of search resultsTopical clustering of search results
Topical clustering of search results
Sunny Kr
 
clustering_classification.ppt
clustering_classification.pptclustering_classification.ppt
clustering_classification.ppt
HODECE21
 
Document retrieval using clustering
Document retrieval using clusteringDocument retrieval using clustering
Document retrieval using clustering
eSAT Journals
 
A survey of web clustering engines
A survey of web clustering enginesA survey of web clustering engines
A survey of web clustering engines
unyil96
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
IJERD Editor
 
unit 5 WEB RETRIEVAL AND WEB CRAWLING
unit 5    WEB RETRIEVAL AND WEB CRAWLINGunit 5    WEB RETRIEVAL AND WEB CRAWLING
unit 5 WEB RETRIEVAL AND WEB CRAWLING
karthiksmart21
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
Rahul Jain
 
Text clustering
Text clusteringText clustering
Text clustering
KU Leuven
 
A Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed ClusteringA Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed Clustering
IRJET Journal
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Simon Hughes
 
CompSci: 221 Winter 2017 Search Engine for UCI
CompSci: 221 Winter 2017 Search Engine for UCICompSci: 221 Winter 2017 Search Engine for UCI
CompSci: 221 Winter 2017 Search Engine for UCI
Soham Kulkarni
 
Ad

Recently uploaded (20)

Longitudinal Benchmark: A Real-World UX Case Study in Onboarding by Linda Bor...
Longitudinal Benchmark: A Real-World UX Case Study in Onboarding by Linda Bor...Longitudinal Benchmark: A Real-World UX Case Study in Onboarding by Linda Bor...
Longitudinal Benchmark: A Real-World UX Case Study in Onboarding by Linda Bor...
UXPA Boston
 
Multi-Agent AI Systems: Architectures & Communication (MCP and A2A)
Multi-Agent AI Systems: Architectures & Communication (MCP and A2A)Multi-Agent AI Systems: Architectures & Communication (MCP and A2A)
Multi-Agent AI Systems: Architectures & Communication (MCP and A2A)
HusseinMalikMammadli
 
UX Change Fatigue: Building Resilient Teams in Times of Transformation by Mal...
UX Change Fatigue: Building Resilient Teams in Times of Transformation by Mal...UX Change Fatigue: Building Resilient Teams in Times of Transformation by Mal...
UX Change Fatigue: Building Resilient Teams in Times of Transformation by Mal...
UXPA Boston
 
CloudStack + KVM: Your Local Cloud Lab
CloudStack + KVM:   Your Local Cloud LabCloudStack + KVM:   Your Local Cloud Lab
CloudStack + KVM: Your Local Cloud Lab
ShapeBlue
 
Interactive SQL: SQL, Features of SQL, DDL & DML
Interactive SQL: SQL, Features of SQL,  DDL & DMLInteractive SQL: SQL, Features of SQL,  DDL & DML
Interactive SQL: SQL, Features of SQL, DDL & DML
IsakkiDeviP
 
Managing Geospatial Open Data Serverlessly [AWS Community Day CH 2025]
Managing Geospatial Open Data Serverlessly [AWS Community Day CH 2025]Managing Geospatial Open Data Serverlessly [AWS Community Day CH 2025]
Managing Geospatial Open Data Serverlessly [AWS Community Day CH 2025]
Chris Bingham
 
Refactoring meta-rauc-community: Cleaner Code, Better Maintenance, More Machines
Refactoring meta-rauc-community: Cleaner Code, Better Maintenance, More MachinesRefactoring meta-rauc-community: Cleaner Code, Better Maintenance, More Machines
Refactoring meta-rauc-community: Cleaner Code, Better Maintenance, More Machines
Leon Anavi
 
OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...
OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...
OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...
SOFTTECHHUB
 
I’d like to resell your CloudStack services, but...
I’d like to resell your CloudStack services, but...I’d like to resell your CloudStack services, but...
I’d like to resell your CloudStack services, but...
ShapeBlue
 
Computer Systems Quiz Presentation in Purple Bold Style (4).pdf
Computer Systems Quiz Presentation in Purple Bold Style (4).pdfComputer Systems Quiz Presentation in Purple Bold Style (4).pdf
Computer Systems Quiz Presentation in Purple Bold Style (4).pdf
fizarcse
 
Apache CloudStack 101 - Introduction, What’s New and What’s Coming
Apache CloudStack 101 - Introduction, What’s New and What’s ComingApache CloudStack 101 - Introduction, What’s New and What’s Coming
Apache CloudStack 101 - Introduction, What’s New and What’s Coming
ShapeBlue
 
AI and Gender: Decoding the Sociological Impact
AI and Gender: Decoding the Sociological ImpactAI and Gender: Decoding the Sociological Impact
AI and Gender: Decoding the Sociological Impact
SaikatBasu37
 
AI needs Hybrid Cloud - TEC conference 2025.pptx
AI needs Hybrid Cloud - TEC conference 2025.pptxAI needs Hybrid Cloud - TEC conference 2025.pptx
AI needs Hybrid Cloud - TEC conference 2025.pptx
Shikha Srivastava
 
Eating Our Own Dog Food: How to be taken seriously when it comes to adding va...
Eating Our Own Dog Food: How to be taken seriously when it comes to adding va...Eating Our Own Dog Food: How to be taken seriously when it comes to adding va...
Eating Our Own Dog Food: How to be taken seriously when it comes to adding va...
UXPA Boston
 
RDM Training: Publish research data with the Research Data Repository
RDM Training: Publish research data with the Research Data RepositoryRDM Training: Publish research data with the Research Data Repository
RDM Training: Publish research data with the Research Data Repository
CSUC - Consorci de Serveis Universitaris de Catalunya
 
Outdated Tech, Invisible Expenses – How Data Silos Undermine Operational Effi...
Outdated Tech, Invisible Expenses – How Data Silos Undermine Operational Effi...Outdated Tech, Invisible Expenses – How Data Silos Undermine Operational Effi...
Outdated Tech, Invisible Expenses – How Data Silos Undermine Operational Effi...
Precisely
 
John Carmack’s Slides From His Upper Bound 2025 Talk
John Carmack’s Slides From His Upper Bound 2025 TalkJohn Carmack’s Slides From His Upper Bound 2025 Talk
John Carmack’s Slides From His Upper Bound 2025 Talk
Razin Mustafiz
 
Outcome Over Output: How UXers Can Leverage an Outcome-Based Mindset by Malin...
Outcome Over Output: How UXers Can Leverage an Outcome-Based Mindset by Malin...Outcome Over Output: How UXers Can Leverage an Outcome-Based Mindset by Malin...
Outcome Over Output: How UXers Can Leverage an Outcome-Based Mindset by Malin...
UXPA Boston
 
Proposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStack
Proposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStackProposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStack
Proposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStack
ShapeBlue
 
Build your own NES Emulator... with Kotlin
Build your own NES Emulator... with KotlinBuild your own NES Emulator... with Kotlin
Build your own NES Emulator... with Kotlin
Artur Skowroński
 
Longitudinal Benchmark: A Real-World UX Case Study in Onboarding by Linda Bor...
Longitudinal Benchmark: A Real-World UX Case Study in Onboarding by Linda Bor...Longitudinal Benchmark: A Real-World UX Case Study in Onboarding by Linda Bor...
Longitudinal Benchmark: A Real-World UX Case Study in Onboarding by Linda Bor...
UXPA Boston
 
Multi-Agent AI Systems: Architectures & Communication (MCP and A2A)
Multi-Agent AI Systems: Architectures & Communication (MCP and A2A)Multi-Agent AI Systems: Architectures & Communication (MCP and A2A)
Multi-Agent AI Systems: Architectures & Communication (MCP and A2A)
HusseinMalikMammadli
 
UX Change Fatigue: Building Resilient Teams in Times of Transformation by Mal...
UX Change Fatigue: Building Resilient Teams in Times of Transformation by Mal...UX Change Fatigue: Building Resilient Teams in Times of Transformation by Mal...
UX Change Fatigue: Building Resilient Teams in Times of Transformation by Mal...
UXPA Boston
 
CloudStack + KVM: Your Local Cloud Lab
CloudStack + KVM:   Your Local Cloud LabCloudStack + KVM:   Your Local Cloud Lab
CloudStack + KVM: Your Local Cloud Lab
ShapeBlue
 
Interactive SQL: SQL, Features of SQL, DDL & DML
Interactive SQL: SQL, Features of SQL,  DDL & DMLInteractive SQL: SQL, Features of SQL,  DDL & DML
Interactive SQL: SQL, Features of SQL, DDL & DML
IsakkiDeviP
 
Managing Geospatial Open Data Serverlessly [AWS Community Day CH 2025]
Managing Geospatial Open Data Serverlessly [AWS Community Day CH 2025]Managing Geospatial Open Data Serverlessly [AWS Community Day CH 2025]
Managing Geospatial Open Data Serverlessly [AWS Community Day CH 2025]
Chris Bingham
 
Refactoring meta-rauc-community: Cleaner Code, Better Maintenance, More Machines
Refactoring meta-rauc-community: Cleaner Code, Better Maintenance, More MachinesRefactoring meta-rauc-community: Cleaner Code, Better Maintenance, More Machines
Refactoring meta-rauc-community: Cleaner Code, Better Maintenance, More Machines
Leon Anavi
 
OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...
OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...
OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...
SOFTTECHHUB
 
I’d like to resell your CloudStack services, but...
I’d like to resell your CloudStack services, but...I’d like to resell your CloudStack services, but...
I’d like to resell your CloudStack services, but...
ShapeBlue
 
Computer Systems Quiz Presentation in Purple Bold Style (4).pdf
Computer Systems Quiz Presentation in Purple Bold Style (4).pdfComputer Systems Quiz Presentation in Purple Bold Style (4).pdf
Computer Systems Quiz Presentation in Purple Bold Style (4).pdf
fizarcse
 
Apache CloudStack 101 - Introduction, What’s New and What’s Coming
Apache CloudStack 101 - Introduction, What’s New and What’s ComingApache CloudStack 101 - Introduction, What’s New and What’s Coming
Apache CloudStack 101 - Introduction, What’s New and What’s Coming
ShapeBlue
 
AI and Gender: Decoding the Sociological Impact
AI and Gender: Decoding the Sociological ImpactAI and Gender: Decoding the Sociological Impact
AI and Gender: Decoding the Sociological Impact
SaikatBasu37
 
AI needs Hybrid Cloud - TEC conference 2025.pptx
AI needs Hybrid Cloud - TEC conference 2025.pptxAI needs Hybrid Cloud - TEC conference 2025.pptx
AI needs Hybrid Cloud - TEC conference 2025.pptx
Shikha Srivastava
 
Eating Our Own Dog Food: How to be taken seriously when it comes to adding va...
Eating Our Own Dog Food: How to be taken seriously when it comes to adding va...Eating Our Own Dog Food: How to be taken seriously when it comes to adding va...
Eating Our Own Dog Food: How to be taken seriously when it comes to adding va...
UXPA Boston
 
Outdated Tech, Invisible Expenses – How Data Silos Undermine Operational Effi...
Outdated Tech, Invisible Expenses – How Data Silos Undermine Operational Effi...Outdated Tech, Invisible Expenses – How Data Silos Undermine Operational Effi...
Outdated Tech, Invisible Expenses – How Data Silos Undermine Operational Effi...
Precisely
 
John Carmack’s Slides From His Upper Bound 2025 Talk
John Carmack’s Slides From His Upper Bound 2025 TalkJohn Carmack’s Slides From His Upper Bound 2025 Talk
John Carmack’s Slides From His Upper Bound 2025 Talk
Razin Mustafiz
 
Outcome Over Output: How UXers Can Leverage an Outcome-Based Mindset by Malin...
Outcome Over Output: How UXers Can Leverage an Outcome-Based Mindset by Malin...Outcome Over Output: How UXers Can Leverage an Outcome-Based Mindset by Malin...
Outcome Over Output: How UXers Can Leverage an Outcome-Based Mindset by Malin...
UXPA Boston
 
Proposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStack
Proposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStackProposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStack
Proposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStack
ShapeBlue
 
Build your own NES Emulator... with Kotlin
Build your own NES Emulator... with KotlinBuild your own NES Emulator... with Kotlin
Build your own NES Emulator... with Kotlin
Artur Skowroński
 
Ad

Web clustering engines

  • 1. Web Clustering Engines YASH DARAK 206117026
  • 2. CONTENTS ● Introduction ● Why web clustering engines? ● Advantages of cluster hierarchy ● Issues in implementation of clusters ● Architecture ● Data centric clustering algorithm ● Conclusion
  • 3. Search engines ? ● Search engines are an invaluable tool for retrieving information from the web. In response to a user query, they return a list of results ranked in order of relevance to the query. ● Eg : Google, Yahoo, Credo etc.
  • 4. Google (Flat ranked search engine)
  • 6. Web clustering engines ● Search engine. ● Web Clustering Engines are the systems that perform clustering of web search results. This systems group the results returned by a search engine into a hierarchy of labeled clusters (also called categories). ● Clustering is the act of grouping similar objects into sets. ● The distance between the objects in the same cluster should be minimum. ● And the distance between objects in the different clusters should be maximum.
  • 7. Web clustering engines - 1. Northern Light (predefined set of clusters ) 2. Vivisimo - Cluster labels were dynamically generated. 3. Clusty 4. Grokker 5. Yippy 6. Lingo3G 7. Credo etc..
  • 8. Why web clustering engines ? ● Conventional engines are not much efficient in ‘Ambiguous’ queries. ● The search results returned by conventional search engines on query will be mixed together in the list, irrelevant item occurs. In this context clustering of search results come into picture!!
  • 9. Main advantages of cluster hierarchy : ● It makes for shortcuts to the items that relate to the same meaning. ● It allows better topic understanding. ● It favors systematic exploration of search results.
  • 10. Issues in implementation of clusters : ● Short input description. ● Meaningful labels. ● Selection of similarity measure. ● Grouping of objects into clusters. ● Computational efficiency. ● Overlapping clusters. ● Unknown number of clusters.
  • 12. 1. Search Result Acquisition : ● The task of the search result acquisition is to provide input for the rest of the system. ● Based on the query, the acquisition component must deliver 50 to 500 results, each of which should contain - ■ Title ■ Contextual snippet ■ URL pointing to the full text being referred to. ● The source of search results can be any public search engines, such as google, yahoo etc. ● The most elegant way of fetching results from such search engines is by using application programming interfaces(APIs) these engines provide.
  • 13. 2. Preprocessing of search results : ● It converts the contents of search results (output by the acquisition component) into a sequence of features used by the actual clustering algorithm. ● Steps for feature extraction - a. Language identification b. Tokenization c. Stemming d. Selection of features.
  • 14. b. Tokenization : ● During the tokenization step, the text of each search result gets split into a sequence of basic independent units called tokens, which will usually represent single words, numbers, symbols and so on. ● Tokenization becomes much more complex for languages where white spaces are not present (such as Chinese) or where the text may switch direction (such as an Arabic text).
  • 15. c. Stemming : ● The aim of stemming is to remove the inflectional prefixes and suffixes of each word and thus reduce different grammatical forms of the word to a common base form called a stem. ● Eg. Connected, Connecting and interconnected ‘Connect’
  • 16. d. Selection features : ● It extract features for each search result present in the input. ● Features are atomic entities by which we can describe an object and represent its most important characteristic to an algorithm. ● The features can vary from single words and fixed-length tuples of words (n-grams) to frequent phrases (variable-length sequences of words)
  • 17. How to represent a feature/text ? ● One method for representing a text is Vector Space model(VSM). ● A document d is represented in the VSM as a vector [wt0 , wt1, . . .wtn], where t0, t1, . . . tn is a global set of words (features) and wti expresses the weight (importance) of feature ti to document d. ● Eg. : d-> “Polly had a dog and the dog had Polly”
  • 18. 3. Cluster construction and labelling : ● The set of search results along with their features, extracted in the preprocessing step, are given as input to the clustering algorithm. ● There are a number of algorithms available for clustering. We can classify them into two different categories - a. Data centric Clustering algorithm b. Description aware. ● The clusters labels should be unique, unambiguous, comprehensive and sensible to the content.
  • 19. Data centric clustering algorithm : ● This system uses VSM for text representation and the clustering technique used is agglomerative hierarchical clustering (AHC). ● It has an initial clustering of a collection of documents in a set of k clusters(scattering). ● .At Query time the user selected clusters of interest(gather) and the system re-clustered those documents. ● This process repeats until a small cluster with relevant documents is found.
  • 20. Agglomerative Hierarchical Clustering(AHC) : ● Initially each document is in its own cluster. ● It build a distance matrix (dissimilarity matrix) for every pair of clusters. ● Merge 2 closest clusters and build the new distance matrix by replacing the merged cluster by one cluster. ● Continue this process until the desired no of k clusters reached. ● The Complexity of this algorithm is clearly O(n2) since we are using a matrix, where n is the number of clusters.
  • 21. Improve efficiency of clustering 1. Client side processing : During high query rate periods the response times can significantly increase and thus degrade the user experience. For avoiding this we can do some processes using the client side resources. 2. Pretokenized Documents : Clustering engines can use tokens that are already used by the conventional search engines.
  • 22. Conclusion ● Web clustering engines organize search results by topic, thus offering a complementary view to the flat-ranked list returned by conventional search engines. ● Due to lack of efficient methods for the performance evaluation of clustering engines they are not seeking the attention of the people.
  • 23. Thank you all for your kind attention!!
  翻译: