SlideShare a Scribd company logo
Web Mining
Web Mining Outline
• Goal –
   – Examine the use of data mining on the World Wide
     Web.

• Outline -
   – Introduction.
   – Web Content Mining.
   – Web Structure Mining.
   – Web Usage Mining.




                Mahavir Advaya - Ruia College       2
Web Mining Issues
• Size –
   – >350 million pages (1999).
   – Grows at about 1 million pages a day.
   – Google indexes 3 billion documents.

• Diverse types of data.




                  Mahavir Advaya - Ruia College   3
Web Data
•   Web pages.
•   Intra-page structures.
•   Inter-page structures.
•   Usage data.
•   Supplemental data –
     – Profiles.
     – Registration information.
     – Cookies.




                    Mahavir Advaya - Ruia College   4
Web Mining Taxonomy




        Mahavir Advaya - Ruia College   5
Web Content Mining
• Extends work of basic search engines.

• Search Engines –
   – IR application.
   – Keyword based.
   – Similarity between query and document.
   – Crawlers.
   – Indexing.
   – Profiles.
   – Link analysis.




                  Mahavir Advaya - Ruia College   6
Crawlers (Spider)
• Robot (spider), a program, traverses the hypertext structure in
  the Web.
   – Collect information from visited pages.
   – Used to construct indexes for search engines.

• Traditional Crawler – visits entire Web (?) and replaces index.

• Periodic Crawler – visits portions of the Web and updates subset
  of index.

• Incremental Crawler – selectively searches the Web and
  incrementally modifies index.

• Focused Crawler – visits pages related to a particular subject.



                     Mahavir Advaya - Ruia College                  7
Focused Crawler
• Only visit links from a page if that page is determined to
  be relevant.

• Classifier is static after learning phase.

• Components –
   – Hypertext Classifier which assigns relevance score to
     each page based on crawl topic.

   – Distiller to identify hub pages.

   – Crawler visits pages to based on crawler and distiller
     scores.

                    Mahavir Advaya - Ruia College          8
Focused Crawler
• Classifier to related documents to topics.

• Classifier also determines how useful outgoing links are.

• Hub Pages contain links to many relevant pages. Must
  be visited even if not high relevance score.




                   Mahavir Advaya - Ruia College              9
Focused Crawler




          Mahavir Advaya - Ruia College   10
Context Focused Crawler
•    Context Graph –
    – Context graph created for each seed document .
    – Root is the seed document.
    – Nodes at each level show documents with links to
       documents at next higher level.
    – Updated during crawl itself .

•    Approach –
    1. Construct context graph and classifiers using seed
       documents as training data.
    2. Perform crawling using classifiers and context graph
       created.


                   Mahavir Advaya - Ruia College         11
Context Graph




          Mahavir Advaya - Ruia College   12
Virtual Web View
• Approach to handle unstructured data.
• Multiple Layered DataBase (MLDB) built on top of the
  Web.
• Each layer of the database is more generalized (and
  smaller) and centralized than the one beneath it.
• Upper layers of MLDB are structured and can be accessed
  with SQL type queries.
• Does not require the use of spiders (Crawlers).
• Translation tools convert Web documents to XML.
• Extraction tools extract desired information to place in first
  layer of MLDB. Convert web document to XML.
• Higher levels contain more summarized data obtained
  through generalizations of the lower levels.

                    Mahavir Advaya - Ruia College            13
WebML
• Web data Mining Query Language.

• Provides data mining operations on MLDB.

• Major feature – four operations –
  – COVERS: one concept covers another if it is higher in
    the hierarchy.
  – COVERED BY: reverse of COVERS, reverses the
    descendents.
  – LIKE: concept is a synonym.
  – CLOSE TO: One concept is close to another if it is a
    sibling in the hierarchy.

                  Mahavir Advaya - Ruia College        14
WebML
• Example –
   – Find all the documents                 at   the   level   of
     www.engr.smu.edu.

• Query –
     SELECT *
     FROM document in ‘ ‘ www.engr.smu.edu ‘ ‘
     WHERE ONE OF keywords COVERS ‘ ‘ cat ‘ ‘




                 Mahavir Advaya - Ruia College                 15
Personalization
• Example of Web Content Mining.
• Web access or contents tuned to better fit the desires of each
  user.
• With personalization, advertisements to be sent to the customers
  based on specific knowledge.
• Goal – Make the customer purchase something.

• Three basic types –
   – Manual techniques – identify user’s preferences based on
     profiles or demographics.
   – Collaborative filtering identifies preferences based on ratings
     from similar users.
   – Content based filtering retrieves pages based on similarity
     between pages and user profiles.


                     Mahavir Advaya - Ruia College                16
Web Structure Mining
• Create a model of the Web organization or a portion of
  it.

• Mine structure (links, graph) of the Web.

• Techniques –
   – PageRank.
   – CLEVER.

• May be combined with content mining to more
  effectively retrieve important pages.


                  Mahavir Advaya - Ruia College       17
PageRank
• Used by Google.

• Prioritize pages returned from search by looking at Web
  structure.

• Importance of page is calculated based on number of
  pages which point to it – Backlinks.

• Weighting is used to provide more importance to
  backlinks coming form important pages.




                  Mahavir Advaya - Ruia College        18
PageRank (cont’d)

• PR(p) = c (PR(1)/N1 + … + PR(n)/Nn)
  – PR(i): PageRank for a page i which points to
    target page p.

  – Ni: number of links coming out of page i.

  – c: constant value between 0 and 1 used for
    normalization.



               Mahavir Advaya - Ruia College    19
CLEVER
• System developed by IBM.
• Finding both authoritative and hub pages.

• Authoritative Pages –
   – Authors define an authority as the “best source” for
     the request.
      o Highly important pages.
      o Best source for requested information.

• Hub Pages –
   – Contain links to highly important pages.
   – Clever, identifies authoritative and hub pages by
     creating weights.

                  Mahavir Advaya - Ruia College        20
HITS
• Hyperlink-Induces Topic Search.
• Finds Hubs and Authoritative Pages.

• Two components –
   – Based on a set of keywords, find set of relevant
     pages – R.
   – Identify hub and authority pages for these.
      o Expand R to a base set, B, of pages linked to or
        from R.
      o Calculate weights for authorities and hubs.

• Pages with highest ranks in R are returned.
                  Mahavir Advaya - Ruia College       21
HITS Algorithm




         Mahavir Advaya - Ruia College   22
Web Usage Mining
• Extends work of basic search engines.

• Performs mining on Web usage data or Web logs.

• Search Engines –
   – IR application.
   – Keyword based.
   – Similarity between query and document.
   – Crawlers.
   – Indexing.
   – Profiles.
   – Link analysis.

                  Mahavir Advaya - Ruia College    23
Web Usage Mining Applications
• Personalization – tracking of previously accessed pages.
• Determining frequent access behavior for users.
• Improve structure of a site’s Web pages.
• Aid in caching and prediction of future page references.
• Improve design of individual pages.
• Improve effectiveness of e-commerce (sales and
  advertising).
• Gathering Statistics – considering accessed pages may
  or may not be viewed as part web mining .




                  Mahavir Advaya - Ruia College         24
Web Usage Mining Activities
• Preprocessing Web log –
   – Cleanse.
   – Remove extraneous information.
   – Sessionize –
       o Session: Sequence of pages referenced by one user at a
         sitting.

• Pattern Discovery –
   – Count patterns that occur in sessions.
   – Pattern is sequence of pages references in session.
   – Similar to association rules –
       o Transaction: session.
       o Itemset: pattern (or subset).
       o Order is important.

• Pattern Analysis.
                      Mahavir Advaya - Ruia College          25
ARs in Web Mining
• Web Mining –
  – Content.
  – Structure.
  – Usage.

• Frequent patterns of sequential page references in Web
  searching.

• Uses –
   – Caching
   – Clustering users
   – Develop user profiles
   – Identify important pages
                 Mahavir Advaya - Ruia College        26
Web Usage Mining Issues
• Identification of exact user not possible.

• Exact sequence of pages referenced by a user not
  possible due to caching.

• Session not well defined.

• Security, privacy, and legal issues.




                   Mahavir Advaya - Ruia College   27
Web Log Cleansing
• Replace source IP address with unique but non-
  identifying ID.

• Replace exact URL of pages referenced with unique but
  non-identifying ID.

• Delete error records and records containing not page
  data (such as figures and code).




                 Mahavir Advaya - Ruia College       28
Sessionizing
• Divide Web log into sessions.

• Two common techniques –
   – Number of consecutive page references from a source
     IP address occurring within a predefined time interval
     (e.g. 25 minutes).

   – All consecutive page references from a source IP
     address where the interclick time is less than a
     predefined threshold.




                  Mahavir Advaya - Ruia College          29
Data Structures
• Keep track of patterns identified during Web usage
  mining process.

• Common techniques –
   – Trie.
   – Suffix Tree.
   – Generalized Suffix Tree.
   – WAP Tree.




                  Mahavir Advaya - Ruia College   30
Trie vs. Suffix Tree
• Trie –
   – Rooted tree.
   – Edges labeled which character (page) from pattern.
   – Path from root to leaf represents pattern.

• Suffix Tree –
   – Single child collapsed with parent. Edge contains
     labels of both prior edges.




                  Mahavir Advaya - Ruia College           31
Trie and Suffix Tree




              A

          L

      O

  G
                                                  ALOG




                  Mahavir Advaya - Ruia College          32
Generalized Suffix Tree
• Suffix tree for multiple sessions.

• Contains patterns from all sessions.

• Maintains count of frequency of occurrence of a pattern
  in the node.

• WAP Tree –
  – Web Access Pattern.
  – Compressed version of generalized suffix tree.
  – Tree stores sequences and their counts.


                    Mahavir Advaya - Ruia College      33
Types of Patterns
• Algorithms have been developed to discover different
  types of patterns.

• Properties –
   – Ordered – Characters (pages) must occur in the exact
     order in the original session.
   – Duplicates – Duplicate characters are allowed in the
     pattern.
   – Consecutive – All characters in pattern must occur
     consecutive in given session.
   – Maximal – Not subsequence of another pattern.


                  Mahavir Advaya - Ruia College        34
Pattern Types
•   Association Rules.
•   Episodes.
•   Sequential Patterns.
•   Forward Sequences.
•   Maximal Frequent Sequences.




                   Mahavir Advaya - Ruia College   35
Questions???
• Write a short note on Web Content Mining.
• What is Web Mining? Give web mining taxonomy.
• What do you mean by Web Usage Mining? Explain rule with
  examples.
• Write a short note on Harvest System.
• Define crawler. State and explain different types of crawlers.
• Write a short note on crawlers.
• Give taxonomy of web mining activities. For what purpose web
  usage mining is used? What activities are involved in web usage
  mining?
• What do you understand by the term “Web Usage Mining”.
• Explain the term crawlers in web mining.
• Discuss the importance of establishing a standardized WebML.
• Write a short note on web structure mining.

                    Mahavir Advaya - Ruia College              36
Thank you !!
Ad

More Related Content

What's hot (20)

Web Mining Presentation Final
Web Mining Presentation FinalWeb Mining Presentation Final
Web Mining Presentation Final
Er. Jagrat Gupta
 
Web mining (1)
Web mining (1)Web mining (1)
Web mining (1)
ajaybabu1314
 
Web data mining
Web data miningWeb data mining
Web data mining
Institute of Technology Telkom
 
Pagerank Algorithm Explained
Pagerank Algorithm ExplainedPagerank Algorithm Explained
Pagerank Algorithm Explained
jdhaar
 
Web usage mining
Web usage miningWeb usage mining
Web usage mining
Monu Chaudhary
 
Web essentials clients, servers and communication – the internet – basic inte...
Web essentials clients, servers and communication – the internet – basic inte...Web essentials clients, servers and communication – the internet – basic inte...
Web essentials clients, servers and communication – the internet – basic inte...
smitha273566
 
Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web mining
DataminingTools Inc
 
Web crawler
Web crawlerWeb crawler
Web crawler
poonamkenkre
 
Apriori Algorithm
Apriori AlgorithmApriori Algorithm
Apriori Algorithm
International School of Engineering
 
Data mining tasks
Data mining tasksData mining tasks
Data mining tasks
Khwaja Aamer
 
Multimedia Information Retrieval
Multimedia Information RetrievalMultimedia Information Retrieval
Multimedia Information Retrieval
Stephane Marchand-Maillet
 
WEB MINING.
WEB MINING.WEB MINING.
WEB MINING.
Sushil kasar
 
AI: Learning in AI
AI: Learning in AI AI: Learning in AI
AI: Learning in AI
DataminingTools Inc
 
The Social Semantic Web
The Social Semantic WebThe Social Semantic Web
The Social Semantic Web
John Breslin
 
Web mining (structure mining)
Web mining (structure mining)Web mining (structure mining)
Web mining (structure mining)
Amir Fahmideh
 
Recommendation system
Recommendation system Recommendation system
Recommendation system
Vikrant Arya
 
Web Content Mining
Web Content MiningWeb Content Mining
Web Content Mining
Daminda Herath
 
An Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAn Introduction to Semantic Web Technology
An Introduction to Semantic Web Technology
Ankur Biswas
 
Recommendation system
Recommendation systemRecommendation system
Recommendation system
Akshat Thakar
 
Web Mining
Web MiningWeb Mining
Web Mining
dataminers.ir
 

Similar to Web mining slides (20)

Web mining
Web miningWeb mining
Web mining
Jay Lohokare
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptx
ScrbifPt
 
Data mining and warehouse by dr D. R. Patil sir
Data mining and warehouse by dr D. R. Patil sirData mining and warehouse by dr D. R. Patil sir
Data mining and warehouse by dr D. R. Patil sir
chaudharipruthvirajr
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
NekoGato
 
Web mining .pdf module 6 dwm third year ce
Web mining .pdf module 6 dwm third year ceWeb mining .pdf module 6 dwm third year ce
Web mining .pdf module 6 dwm third year ce
NiramayKolalle
 
4 Web Crawler.pptx
4 Web Crawler.pptx4 Web Crawler.pptx
4 Web Crawler.pptx
DEEPAK948083
 
Gaurav web mining
Gaurav web miningGaurav web mining
Gaurav web mining
Gaurav Uniyal
 
Web Usage Pattern
Web Usage PatternWeb Usage Pattern
Web Usage Pattern
Shreyansh Kejriwal
 
Search Across Multiple VIVO Instances
Search Across Multiple VIVO InstancesSearch Across Multiple VIVO Instances
Search Across Multiple VIVO Instances
cappadona
 
Web mining
Web miningWeb mining
Web mining
SarthakSahoo8
 
Search Engines
Search EnginesSearch Engines
Search Engines
Kamal Acharya
 
Enterprise Search Case Study: SpareBank1 Gruppen
Enterprise Search Case Study: SpareBank1 GruppenEnterprise Search Case Study: SpareBank1 Gruppen
Enterprise Search Case Study: SpareBank1 Gruppen
Findwise
 
Information Architecture
Information ArchitectureInformation Architecture
Information Architecture
Henry Osborne
 
Search engine
Search engineSearch engine
Search engine
Rishabh Agarwal
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
butest
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
butest
 
Web mining
Web miningWeb mining
Web mining
SwarnaLatha177
 
Link analysis for web search
Link analysis for web searchLink analysis for web search
Link analysis for web search
Emrullah Delibas
 
Web Mining
Web MiningWeb Mining
Web Mining
Mudit Dholakia
 
Web mining
Web miningWeb mining
Web mining
Innovative Pencils
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptx
ScrbifPt
 
Data mining and warehouse by dr D. R. Patil sir
Data mining and warehouse by dr D. R. Patil sirData mining and warehouse by dr D. R. Patil sir
Data mining and warehouse by dr D. R. Patil sir
chaudharipruthvirajr
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
NekoGato
 
Web mining .pdf module 6 dwm third year ce
Web mining .pdf module 6 dwm third year ceWeb mining .pdf module 6 dwm third year ce
Web mining .pdf module 6 dwm third year ce
NiramayKolalle
 
4 Web Crawler.pptx
4 Web Crawler.pptx4 Web Crawler.pptx
4 Web Crawler.pptx
DEEPAK948083
 
Search Across Multiple VIVO Instances
Search Across Multiple VIVO InstancesSearch Across Multiple VIVO Instances
Search Across Multiple VIVO Instances
cappadona
 
Enterprise Search Case Study: SpareBank1 Gruppen
Enterprise Search Case Study: SpareBank1 GruppenEnterprise Search Case Study: SpareBank1 Gruppen
Enterprise Search Case Study: SpareBank1 Gruppen
Findwise
 
Information Architecture
Information ArchitectureInformation Architecture
Information Architecture
Henry Osborne
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
butest
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
butest
 
Link analysis for web search
Link analysis for web searchLink analysis for web search
Link analysis for web search
Emrullah Delibas
 
Ad

Recently uploaded (20)

What is the Philosophy of Statistics? (and how I was drawn to it)
What is the Philosophy of Statistics? (and how I was drawn to it)What is the Philosophy of Statistics? (and how I was drawn to it)
What is the Philosophy of Statistics? (and how I was drawn to it)
jemille6
 
Cultivation Practice of Garlic in Nepal.pptx
Cultivation Practice of Garlic in Nepal.pptxCultivation Practice of Garlic in Nepal.pptx
Cultivation Practice of Garlic in Nepal.pptx
UmeshTimilsina1
 
antiquity of writing in ancient India- literary & archaeological evidence
antiquity of writing in ancient India- literary & archaeological evidenceantiquity of writing in ancient India- literary & archaeological evidence
antiquity of writing in ancient India- literary & archaeological evidence
PrachiSontakke5
 
MEDICAL BIOLOGY MCQS BY. DR NASIR MUSTAFA
MEDICAL BIOLOGY MCQS  BY. DR NASIR MUSTAFAMEDICAL BIOLOGY MCQS  BY. DR NASIR MUSTAFA
MEDICAL BIOLOGY MCQS BY. DR NASIR MUSTAFA
Dr. Nasir Mustafa
 
2025 The Senior Landscape and SET plan preparations.pptx
2025 The Senior Landscape and SET plan preparations.pptx2025 The Senior Landscape and SET plan preparations.pptx
2025 The Senior Landscape and SET plan preparations.pptx
mansk2
 
Botany Assignment Help Guide - Academic Excellence
Botany Assignment Help Guide - Academic ExcellenceBotany Assignment Help Guide - Academic Excellence
Botany Assignment Help Guide - Academic Excellence
online college homework help
 
How to Create Kanban View in Odoo 18 - Odoo Slides
How to Create Kanban View in Odoo 18 - Odoo SlidesHow to Create Kanban View in Odoo 18 - Odoo Slides
How to Create Kanban View in Odoo 18 - Odoo Slides
Celine George
 
Rock Art As a Source of Ancient Indian History
Rock Art As a Source of Ancient Indian HistoryRock Art As a Source of Ancient Indian History
Rock Art As a Source of Ancient Indian History
Virag Sontakke
 
*"Sensing the World: Insect Sensory Systems"*
*"Sensing the World: Insect Sensory Systems"**"Sensing the World: Insect Sensory Systems"*
*"Sensing the World: Insect Sensory Systems"*
Arshad Shaikh
 
Cultivation Practice of Onion in Nepal.pptx
Cultivation Practice of Onion in Nepal.pptxCultivation Practice of Onion in Nepal.pptx
Cultivation Practice of Onion in Nepal.pptx
UmeshTimilsina1
 
How to Configure Public Holidays & Mandatory Days in Odoo 18
How to Configure Public Holidays & Mandatory Days in Odoo 18How to Configure Public Holidays & Mandatory Days in Odoo 18
How to Configure Public Holidays & Mandatory Days in Odoo 18
Celine George
 
Origin of Brahmi script: A breaking down of various theories
Origin of Brahmi script: A breaking down of various theoriesOrigin of Brahmi script: A breaking down of various theories
Origin of Brahmi script: A breaking down of various theories
PrachiSontakke5
 
U3 ANTITUBERCULAR DRUGS Pharmacology 3.pptx
U3 ANTITUBERCULAR DRUGS Pharmacology 3.pptxU3 ANTITUBERCULAR DRUGS Pharmacology 3.pptx
U3 ANTITUBERCULAR DRUGS Pharmacology 3.pptx
Mayuri Chavan
 
History Of The Monastery Of Mor Gabriel Philoxenos Yuhanon Dolabani
History Of The Monastery Of Mor Gabriel Philoxenos Yuhanon DolabaniHistory Of The Monastery Of Mor Gabriel Philoxenos Yuhanon Dolabani
History Of The Monastery Of Mor Gabriel Philoxenos Yuhanon Dolabani
fruinkamel7m
 
Redesigning Education as a Cognitive Ecosystem: Practical Insights into Emerg...
Redesigning Education as a Cognitive Ecosystem: Practical Insights into Emerg...Redesigning Education as a Cognitive Ecosystem: Practical Insights into Emerg...
Redesigning Education as a Cognitive Ecosystem: Practical Insights into Emerg...
Leonel Morgado
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...
BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...
BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...
Nguyen Thanh Tu Collection
 
Myopathies (muscle disorders) for undergraduate
Myopathies (muscle disorders) for undergraduateMyopathies (muscle disorders) for undergraduate
Myopathies (muscle disorders) for undergraduate
Mohamed Rizk Khodair
 
All About the 990 Unlocking Its Mysteries and Its Power.pdf
All About the 990 Unlocking Its Mysteries and Its Power.pdfAll About the 990 Unlocking Its Mysteries and Its Power.pdf
All About the 990 Unlocking Its Mysteries and Its Power.pdf
TechSoup
 
Final Evaluation.docx...........................
Final Evaluation.docx...........................Final Evaluation.docx...........................
Final Evaluation.docx...........................
l1bbyburrell
 
How to Manage Amounts in Local Currency in Odoo 18 Purchase
How to Manage Amounts in Local Currency in Odoo 18 PurchaseHow to Manage Amounts in Local Currency in Odoo 18 Purchase
How to Manage Amounts in Local Currency in Odoo 18 Purchase
Celine George
 
What is the Philosophy of Statistics? (and how I was drawn to it)
What is the Philosophy of Statistics? (and how I was drawn to it)What is the Philosophy of Statistics? (and how I was drawn to it)
What is the Philosophy of Statistics? (and how I was drawn to it)
jemille6
 
Cultivation Practice of Garlic in Nepal.pptx
Cultivation Practice of Garlic in Nepal.pptxCultivation Practice of Garlic in Nepal.pptx
Cultivation Practice of Garlic in Nepal.pptx
UmeshTimilsina1
 
antiquity of writing in ancient India- literary & archaeological evidence
antiquity of writing in ancient India- literary & archaeological evidenceantiquity of writing in ancient India- literary & archaeological evidence
antiquity of writing in ancient India- literary & archaeological evidence
PrachiSontakke5
 
MEDICAL BIOLOGY MCQS BY. DR NASIR MUSTAFA
MEDICAL BIOLOGY MCQS  BY. DR NASIR MUSTAFAMEDICAL BIOLOGY MCQS  BY. DR NASIR MUSTAFA
MEDICAL BIOLOGY MCQS BY. DR NASIR MUSTAFA
Dr. Nasir Mustafa
 
2025 The Senior Landscape and SET plan preparations.pptx
2025 The Senior Landscape and SET plan preparations.pptx2025 The Senior Landscape and SET plan preparations.pptx
2025 The Senior Landscape and SET plan preparations.pptx
mansk2
 
Botany Assignment Help Guide - Academic Excellence
Botany Assignment Help Guide - Academic ExcellenceBotany Assignment Help Guide - Academic Excellence
Botany Assignment Help Guide - Academic Excellence
online college homework help
 
How to Create Kanban View in Odoo 18 - Odoo Slides
How to Create Kanban View in Odoo 18 - Odoo SlidesHow to Create Kanban View in Odoo 18 - Odoo Slides
How to Create Kanban View in Odoo 18 - Odoo Slides
Celine George
 
Rock Art As a Source of Ancient Indian History
Rock Art As a Source of Ancient Indian HistoryRock Art As a Source of Ancient Indian History
Rock Art As a Source of Ancient Indian History
Virag Sontakke
 
*"Sensing the World: Insect Sensory Systems"*
*"Sensing the World: Insect Sensory Systems"**"Sensing the World: Insect Sensory Systems"*
*"Sensing the World: Insect Sensory Systems"*
Arshad Shaikh
 
Cultivation Practice of Onion in Nepal.pptx
Cultivation Practice of Onion in Nepal.pptxCultivation Practice of Onion in Nepal.pptx
Cultivation Practice of Onion in Nepal.pptx
UmeshTimilsina1
 
How to Configure Public Holidays & Mandatory Days in Odoo 18
How to Configure Public Holidays & Mandatory Days in Odoo 18How to Configure Public Holidays & Mandatory Days in Odoo 18
How to Configure Public Holidays & Mandatory Days in Odoo 18
Celine George
 
Origin of Brahmi script: A breaking down of various theories
Origin of Brahmi script: A breaking down of various theoriesOrigin of Brahmi script: A breaking down of various theories
Origin of Brahmi script: A breaking down of various theories
PrachiSontakke5
 
U3 ANTITUBERCULAR DRUGS Pharmacology 3.pptx
U3 ANTITUBERCULAR DRUGS Pharmacology 3.pptxU3 ANTITUBERCULAR DRUGS Pharmacology 3.pptx
U3 ANTITUBERCULAR DRUGS Pharmacology 3.pptx
Mayuri Chavan
 
History Of The Monastery Of Mor Gabriel Philoxenos Yuhanon Dolabani
History Of The Monastery Of Mor Gabriel Philoxenos Yuhanon DolabaniHistory Of The Monastery Of Mor Gabriel Philoxenos Yuhanon Dolabani
History Of The Monastery Of Mor Gabriel Philoxenos Yuhanon Dolabani
fruinkamel7m
 
Redesigning Education as a Cognitive Ecosystem: Practical Insights into Emerg...
Redesigning Education as a Cognitive Ecosystem: Practical Insights into Emerg...Redesigning Education as a Cognitive Ecosystem: Practical Insights into Emerg...
Redesigning Education as a Cognitive Ecosystem: Practical Insights into Emerg...
Leonel Morgado
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...
BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...
BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...
Nguyen Thanh Tu Collection
 
Myopathies (muscle disorders) for undergraduate
Myopathies (muscle disorders) for undergraduateMyopathies (muscle disorders) for undergraduate
Myopathies (muscle disorders) for undergraduate
Mohamed Rizk Khodair
 
All About the 990 Unlocking Its Mysteries and Its Power.pdf
All About the 990 Unlocking Its Mysteries and Its Power.pdfAll About the 990 Unlocking Its Mysteries and Its Power.pdf
All About the 990 Unlocking Its Mysteries and Its Power.pdf
TechSoup
 
Final Evaluation.docx...........................
Final Evaluation.docx...........................Final Evaluation.docx...........................
Final Evaluation.docx...........................
l1bbyburrell
 
How to Manage Amounts in Local Currency in Odoo 18 Purchase
How to Manage Amounts in Local Currency in Odoo 18 PurchaseHow to Manage Amounts in Local Currency in Odoo 18 Purchase
How to Manage Amounts in Local Currency in Odoo 18 Purchase
Celine George
 
Ad

Web mining slides

  • 2. Web Mining Outline • Goal – – Examine the use of data mining on the World Wide Web. • Outline - – Introduction. – Web Content Mining. – Web Structure Mining. – Web Usage Mining. Mahavir Advaya - Ruia College 2
  • 3. Web Mining Issues • Size – – >350 million pages (1999). – Grows at about 1 million pages a day. – Google indexes 3 billion documents. • Diverse types of data. Mahavir Advaya - Ruia College 3
  • 4. Web Data • Web pages. • Intra-page structures. • Inter-page structures. • Usage data. • Supplemental data – – Profiles. – Registration information. – Cookies. Mahavir Advaya - Ruia College 4
  • 5. Web Mining Taxonomy Mahavir Advaya - Ruia College 5
  • 6. Web Content Mining • Extends work of basic search engines. • Search Engines – – IR application. – Keyword based. – Similarity between query and document. – Crawlers. – Indexing. – Profiles. – Link analysis. Mahavir Advaya - Ruia College 6
  • 7. Crawlers (Spider) • Robot (spider), a program, traverses the hypertext structure in the Web. – Collect information from visited pages. – Used to construct indexes for search engines. • Traditional Crawler – visits entire Web (?) and replaces index. • Periodic Crawler – visits portions of the Web and updates subset of index. • Incremental Crawler – selectively searches the Web and incrementally modifies index. • Focused Crawler – visits pages related to a particular subject. Mahavir Advaya - Ruia College 7
  • 8. Focused Crawler • Only visit links from a page if that page is determined to be relevant. • Classifier is static after learning phase. • Components – – Hypertext Classifier which assigns relevance score to each page based on crawl topic. – Distiller to identify hub pages. – Crawler visits pages to based on crawler and distiller scores. Mahavir Advaya - Ruia College 8
  • 9. Focused Crawler • Classifier to related documents to topics. • Classifier also determines how useful outgoing links are. • Hub Pages contain links to many relevant pages. Must be visited even if not high relevance score. Mahavir Advaya - Ruia College 9
  • 10. Focused Crawler Mahavir Advaya - Ruia College 10
  • 11. Context Focused Crawler • Context Graph – – Context graph created for each seed document . – Root is the seed document. – Nodes at each level show documents with links to documents at next higher level. – Updated during crawl itself . • Approach – 1. Construct context graph and classifiers using seed documents as training data. 2. Perform crawling using classifiers and context graph created. Mahavir Advaya - Ruia College 11
  • 12. Context Graph Mahavir Advaya - Ruia College 12
  • 13. Virtual Web View • Approach to handle unstructured data. • Multiple Layered DataBase (MLDB) built on top of the Web. • Each layer of the database is more generalized (and smaller) and centralized than the one beneath it. • Upper layers of MLDB are structured and can be accessed with SQL type queries. • Does not require the use of spiders (Crawlers). • Translation tools convert Web documents to XML. • Extraction tools extract desired information to place in first layer of MLDB. Convert web document to XML. • Higher levels contain more summarized data obtained through generalizations of the lower levels. Mahavir Advaya - Ruia College 13
  • 14. WebML • Web data Mining Query Language. • Provides data mining operations on MLDB. • Major feature – four operations – – COVERS: one concept covers another if it is higher in the hierarchy. – COVERED BY: reverse of COVERS, reverses the descendents. – LIKE: concept is a synonym. – CLOSE TO: One concept is close to another if it is a sibling in the hierarchy. Mahavir Advaya - Ruia College 14
  • 15. WebML • Example – – Find all the documents at the level of www.engr.smu.edu. • Query – SELECT * FROM document in ‘ ‘ www.engr.smu.edu ‘ ‘ WHERE ONE OF keywords COVERS ‘ ‘ cat ‘ ‘ Mahavir Advaya - Ruia College 15
  • 16. Personalization • Example of Web Content Mining. • Web access or contents tuned to better fit the desires of each user. • With personalization, advertisements to be sent to the customers based on specific knowledge. • Goal – Make the customer purchase something. • Three basic types – – Manual techniques – identify user’s preferences based on profiles or demographics. – Collaborative filtering identifies preferences based on ratings from similar users. – Content based filtering retrieves pages based on similarity between pages and user profiles. Mahavir Advaya - Ruia College 16
  • 17. Web Structure Mining • Create a model of the Web organization or a portion of it. • Mine structure (links, graph) of the Web. • Techniques – – PageRank. – CLEVER. • May be combined with content mining to more effectively retrieve important pages. Mahavir Advaya - Ruia College 17
  • 18. PageRank • Used by Google. • Prioritize pages returned from search by looking at Web structure. • Importance of page is calculated based on number of pages which point to it – Backlinks. • Weighting is used to provide more importance to backlinks coming form important pages. Mahavir Advaya - Ruia College 18
  • 19. PageRank (cont’d) • PR(p) = c (PR(1)/N1 + … + PR(n)/Nn) – PR(i): PageRank for a page i which points to target page p. – Ni: number of links coming out of page i. – c: constant value between 0 and 1 used for normalization. Mahavir Advaya - Ruia College 19
  • 20. CLEVER • System developed by IBM. • Finding both authoritative and hub pages. • Authoritative Pages – – Authors define an authority as the “best source” for the request. o Highly important pages. o Best source for requested information. • Hub Pages – – Contain links to highly important pages. – Clever, identifies authoritative and hub pages by creating weights. Mahavir Advaya - Ruia College 20
  • 21. HITS • Hyperlink-Induces Topic Search. • Finds Hubs and Authoritative Pages. • Two components – – Based on a set of keywords, find set of relevant pages – R. – Identify hub and authority pages for these. o Expand R to a base set, B, of pages linked to or from R. o Calculate weights for authorities and hubs. • Pages with highest ranks in R are returned. Mahavir Advaya - Ruia College 21
  • 22. HITS Algorithm Mahavir Advaya - Ruia College 22
  • 23. Web Usage Mining • Extends work of basic search engines. • Performs mining on Web usage data or Web logs. • Search Engines – – IR application. – Keyword based. – Similarity between query and document. – Crawlers. – Indexing. – Profiles. – Link analysis. Mahavir Advaya - Ruia College 23
  • 24. Web Usage Mining Applications • Personalization – tracking of previously accessed pages. • Determining frequent access behavior for users. • Improve structure of a site’s Web pages. • Aid in caching and prediction of future page references. • Improve design of individual pages. • Improve effectiveness of e-commerce (sales and advertising). • Gathering Statistics – considering accessed pages may or may not be viewed as part web mining . Mahavir Advaya - Ruia College 24
  • 25. Web Usage Mining Activities • Preprocessing Web log – – Cleanse. – Remove extraneous information. – Sessionize – o Session: Sequence of pages referenced by one user at a sitting. • Pattern Discovery – – Count patterns that occur in sessions. – Pattern is sequence of pages references in session. – Similar to association rules – o Transaction: session. o Itemset: pattern (or subset). o Order is important. • Pattern Analysis. Mahavir Advaya - Ruia College 25
  • 26. ARs in Web Mining • Web Mining – – Content. – Structure. – Usage. • Frequent patterns of sequential page references in Web searching. • Uses – – Caching – Clustering users – Develop user profiles – Identify important pages Mahavir Advaya - Ruia College 26
  • 27. Web Usage Mining Issues • Identification of exact user not possible. • Exact sequence of pages referenced by a user not possible due to caching. • Session not well defined. • Security, privacy, and legal issues. Mahavir Advaya - Ruia College 27
  • 28. Web Log Cleansing • Replace source IP address with unique but non- identifying ID. • Replace exact URL of pages referenced with unique but non-identifying ID. • Delete error records and records containing not page data (such as figures and code). Mahavir Advaya - Ruia College 28
  • 29. Sessionizing • Divide Web log into sessions. • Two common techniques – – Number of consecutive page references from a source IP address occurring within a predefined time interval (e.g. 25 minutes). – All consecutive page references from a source IP address where the interclick time is less than a predefined threshold. Mahavir Advaya - Ruia College 29
  • 30. Data Structures • Keep track of patterns identified during Web usage mining process. • Common techniques – – Trie. – Suffix Tree. – Generalized Suffix Tree. – WAP Tree. Mahavir Advaya - Ruia College 30
  • 31. Trie vs. Suffix Tree • Trie – – Rooted tree. – Edges labeled which character (page) from pattern. – Path from root to leaf represents pattern. • Suffix Tree – – Single child collapsed with parent. Edge contains labels of both prior edges. Mahavir Advaya - Ruia College 31
  • 32. Trie and Suffix Tree A L O G ALOG Mahavir Advaya - Ruia College 32
  • 33. Generalized Suffix Tree • Suffix tree for multiple sessions. • Contains patterns from all sessions. • Maintains count of frequency of occurrence of a pattern in the node. • WAP Tree – – Web Access Pattern. – Compressed version of generalized suffix tree. – Tree stores sequences and their counts. Mahavir Advaya - Ruia College 33
  • 34. Types of Patterns • Algorithms have been developed to discover different types of patterns. • Properties – – Ordered – Characters (pages) must occur in the exact order in the original session. – Duplicates – Duplicate characters are allowed in the pattern. – Consecutive – All characters in pattern must occur consecutive in given session. – Maximal – Not subsequence of another pattern. Mahavir Advaya - Ruia College 34
  • 35. Pattern Types • Association Rules. • Episodes. • Sequential Patterns. • Forward Sequences. • Maximal Frequent Sequences. Mahavir Advaya - Ruia College 35
  • 36. Questions??? • Write a short note on Web Content Mining. • What is Web Mining? Give web mining taxonomy. • What do you mean by Web Usage Mining? Explain rule with examples. • Write a short note on Harvest System. • Define crawler. State and explain different types of crawlers. • Write a short note on crawlers. • Give taxonomy of web mining activities. For what purpose web usage mining is used? What activities are involved in web usage mining? • What do you understand by the term “Web Usage Mining”. • Explain the term crawlers in web mining. • Discuss the importance of establishing a standardized WebML. • Write a short note on web structure mining. Mahavir Advaya - Ruia College 36
  翻译: