SlideShare a Scribd company logo
Mastering Web
Scraping with JSoup:
Unlocking the Secrets
of HTML Parsing
Shrasti Gupta
Automation Consultant
Test Automation Competency
Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
 Punctuality
Join the session 5 minutes prior to the session start time. We start on
time and conclude on time!
 Feedback
Make sure to submit a constructive feedback for all sessions as it is very
helpful for the presenter.
 Silent Mode
Keep your mobile devices in silent mode, feel free to move out of session
in case you need to attend an urgent call.
 Avoid Disturbance
Avoid unwanted chit chat during the session.
1. What is Web Scraping
2. What is JSoup and Why?
3. Setting up with JSoup
4. Understanding the Document Object
Model(DOM)
5. Navigating the DOM with Jsoup
6. Parsing HTML with Jsoup
7. Extracting data with JSoup
8. Demo
Introduction to Web
Scraping with JSoup
What is Web Scraping
• Web scraping, also known as web data extraction, is the process of
automatically extracting information from websites using specialized tools and
software. Web scraping provides access to valuable data that may not be available
through APIs or databases.
• It enables the collection of large volumes of data from multiple sources
efficiently, suitable for various applications like market research and competitive
analysis.
Use cases of web scraping -
Competitive Analysis - Extracting data such as product pricing, features, and customer
reviews from competitor websites for analysis.
Market Research - Collecting data on consumer preferences, product demand, and
pricing strategies from various sources across the web.
Data Collection - Scraping websites to collect data for research, analysis, and
modelling purposes, such as gathering weather data, financial information, or
demographic statistics.
Content Aggregation - Scraping websites for updates and changes in content,
ensuring timely updates and staying informed about industry developments.
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
What is JSoup
JSoup is a Java library designed for parsing, manipulating, and
extracting data from HTML documents. It provides a convenient API
for working with HTML, allowing developers to perform tasks such as
parsing, traversing the DOM tree, and extracting specific elements or
data.
Jsoup parses HTML to the same DOM as modern browsers.
• Scrape and parse HTML from a URL, file, or string
• Find and extract data, using DOM traversal or CSS selectors
• Manipulate the HTML elements, attributes, and text.
Key Features Of Jsoup
HTML Parsing:
JSoup simplifies the process of parsing HTML documents, converting them into a structured Document Object
Model (DOM) representation.
DOM Traversal:
It enables developers to navigate the DOM tree, accessing and manipulating HTML elements based on their
relationships and properties.
CSS Selection:
JSoup supports CSS-like selectors for targeting specific elements within HTML documents, facilitating easy
extraction of data.
Element Manipulation:
Developers can modify HTML elements, attributes, and content using JSoup's API, enabling dynamic
manipulation of web pages.
Why JSoup?
Jsoup is a popular Java library for web scraping, and there are several reasons why it's a preferred choice:
• Ease of Use: Jsoup provides a simple and intuitive API for parsing HTML documents, making it easy for
developers to extract the data they need from web pages.
• HTML Parsing: Jsoup handles HTML parsing efficiently, allowing you to navigate the HTML structure, select
elements based on CSS selectors, and manipulate the DOM easily.
• Security: Jsoup is designed with security in mind. It helps prevent common vulnerabilities such as cross-site
scripting (XSS) attacks by sanitizing HTML input and output.
• Open Source: Jsoup is an open-source library, which means it's free to use and has a large community of
developers contributing to its improvement. This ensures ongoing support and updates.
• Java Integration: If you're working in a Java environment, Jsoup integrates seamlessly with your existing Java
codebase. This makes it a natural choice for Java developers who need to incorporate web scraping into their
projects.
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Setting Up JSoup
Adding JSoup to your Java project is straightforward using Maven build tools. By including the JSoup
dependency, you gain access to its powerful HTML parsing and data extraction capabilities.
Dependency-
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Understanding Document Object Model (DOM)
• The Document Object Model (DOM) is a representation of
the structure of an HTML document as a tree of nodes.
Each node corresponds to an element, attribute, or piece of
text in the HTML document. Understanding the DOM is
crucial for effective web scraping, as it allows us to navigate
and manipulate the structure of web pages.
• The Document Object Model (DOM) connects web pages to
scripts or programming languages by representing the
structure of a document such as the HTML representing a
web page in memory.
• The DOM represents a document with a logical tree. Each
branch of the tree ends in a node, and each node contains
objects. DOM methods allow programmatic access to the
tree. With them, you can change the document's structure,
style, or content.
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Navigating the DOM
The Document Object Model (DOM) serves as the interface between web content and scripts, providing a
structured representation of the HTML document. Effective DOM traversal is essential for accessing and
manipulating elements within the document. In this, we'll explore various methods for navigating the DOM to
locate and interact with desired elements.
Traversal methods:
• Selecting parent Elements
• Selecting Child Elements
• Exploring Siblings Elements
• Descendant elements
Selecting Parent Elements
Accessing the parent of a particular DOM element is fundamental for various operations, such as styling or modifying its
content. These methods provide direct access to the immediate parent of the selected element, allowing for seamless
manipulation or traversal to higher levels of the DOM hierarchy.
Selecting Child Elements:
Web Pages are like big family trees, with elements having child elements. These methods help us get a list of the child
elements of a specific elements we are interested in. It's like getting a list of all the kids in a family.
Exploring Sibling Elements :
Sibling elements share the same parent node and offer opportunities for targeted manipulation or traversal within a specific
context. Traversal methods such as nextSibling and previousSibling enable navigation to adjacent elements at the same level
in the DOM tree.
Descending into Descendant Elements :
Traversing through descendant elements allows for deep exploration within the DOM tree, enabling access to nested
structures and nested content. Methods like querySelector and querySelectorAll provide the powerful mechanisms for
selecting elements based on css selectors.
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Parsing HTML with Jsoup
HTML Structure
HTML (Hypertext Markup Language) is the standard language
for creating web pages.
It uses a markup structure composed of elements, tags,
attributes, and content to define the structure and appearance of
web documents.
Elements and Tags:
• Elements: Fundamental building blocks of HTML documents,
representing different types of content.
• Tags: Enclosed in angle brackets (<>), define the beginning
and end of HTML elements.
Parsing HTML with Jsoup
JSoup offers an array of robust features for parsing HTML documents with ease and precision. Whether you're
dealing with simple or complex HTML structures, JSoup's flexible API empowers you to efficiently extract the
desired data.
• Loading HTML Documents:
Jsoup simplifies the process of loading HTML documents from various sources, including URLs, files, and
strings. You can seamlessly retrieve the HTML content and begin parsing it without hassle.
• CSS Selectors:
JSoup supports CSS selectors, allowing you to target specific elements within the HTML document based on
their classes, IDs, attributes, or hierarchy. This granular selection capability enables precise data extraction
from the DOM.
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Extracting Data with JSoup
In the dynamic landscape of web development, accessing and extracting data from websites is a fundamental
task. JSoup, a Java library, offers powerful tools for parsing HTML and manipulating the Document Object Model
(DOM) of a web page. In this slide, we'll delve into the world of data extraction with JSoup, exploring its
capabilities and demonstrating how to harness its potential to gather valuable information from the web.
Data extraction techniques:
• Text Extraction
• Attribute Extraction
• HTML Content Extraction
Text Extraction
Text extraction involves the retrieval of textual information embedded within various data sources, be it
web pages, documents, or databases. Here are some prevalent methods employed
Regular Expressions (Regex):
- A powerful tool for pattern matching, regex enables the identification and extraction of specific text strings
based on predefined patterns or rules.
HTML Parsing Libraries:
- Utilizing libraries like Beautiful Soup in Python or Jsoup in Java, developers can navigate through HTML
documents, pinpointing and extracting text content enclosed within designated tags.
Optical Character Recognition (OCR):
- When dealing with scanned documents or images containing text, OCR algorithms come into play. These
algorithms analyse the image, recognize characters, and convert them into editable text.
Attributes Extraction
Attributes enrich the contextual understanding of data by providing metadata associated
with elements. Extracting attributes facilitates categorization, filtering, and analysis. Common techniques
include:
XPath Queries:
- XPath enables the selection of elements based on their attributes within XML or HTML documents. By
crafting XPath queries, developers can precisely target elements and retrieve attribute values.
CSS Selectors:
- Similar to XPath, CSS selectors allow for the identification and extraction of elements based on their
attributes CSS selectors provide a concise syntax for specifying attribute-based criteria.
API Integration:
- Some data sources offer APIs (Application Programming Interfaces) that expose structured data along with
associated attributes. Integrating with these APIs simplifies attribute extraction and ensures data consistency.
HTML Content Extraction
HTML content extraction involves capturing the structure and layout of web pages, including text, images, links,
and other multimedia elements. Techniques for extracting HTML content include
DOM Traversal:
- Traversing the Document Object Model (DOM) tree of a web page enables the extraction of specific HTML
elements and their contents. DOM traversal libraries like Cheerio (for Node.js) provide an intuitive interface
for this purpose.
Web Scraping Frameworks:
- Frameworks such as Scrapy (Python) and Puppeteer (JavaScript) offer robust tools for web scraping,
allowing developers to extract HTML content programmatically while handling various complexities like
pagination and dynamic content.
Browser Extensions:
- For more user-centric applications, browser extensions like Chrome's Content Script API enable the
extraction of HTML content directly from the user's browsing session. This approach is particularly useful for
tasks like content curation and data aggregation.
DEMO
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Ad

More Related Content

Similar to Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing (20)

Training presentation
Training presentationTraining presentation
Training presentation
Tayseer_Emam
 
NLP and the Web
NLP and the WebNLP and the Web
NLP and the Web
mattthemathman
 
GDG-USAR Tech winter break 2024 USAR.pdf
GDG-USAR Tech winter break 2024 USAR.pdfGDG-USAR Tech winter break 2024 USAR.pdf
GDG-USAR Tech winter break 2024 USAR.pdf
raiaryan174
 
ScholarsDay_Poster2015_Sumner-Atay
ScholarsDay_Poster2015_Sumner-AtayScholarsDay_Poster2015_Sumner-Atay
ScholarsDay_Poster2015_Sumner-Atay
Alex Sumner
 
Xml and xml processor
Xml and xml processorXml and xml processor
Xml and xml processor
Himanshu Soni
 
Xml and xml processor
Xml and xml processorXml and xml processor
Xml and xml processor
Himanshu Soni
 
WEB TECHNOLOGY Unit-4.pptx
WEB TECHNOLOGY Unit-4.pptxWEB TECHNOLOGY Unit-4.pptx
WEB TECHNOLOGY Unit-4.pptx
karthiksmart21
 
Vision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result RecordsVision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result Records
IJMER
 
Caste a vote online
Caste a vote onlineCaste a vote online
Caste a vote online
Manoj Kumar
 
ELK Stack Online Training - Elasticsearch Online Training Course.pptx
ELK Stack Online Training - Elasticsearch Online Training Course.pptxELK Stack Online Training - Elasticsearch Online Training Course.pptx
ELK Stack Online Training - Elasticsearch Online Training Course.pptx
eshwarvisualpath
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
IOSR Journals
 
SURE Research Report
SURE Research ReportSURE Research Report
SURE Research Report
Alex Sumner
 
Dhtml
DhtmlDhtml
Dhtml
Sadhana28
 
02 From HTML tags to XHTML
02 From HTML tags to XHTML02 From HTML tags to XHTML
02 From HTML tags to XHTML
Rich Dron
 
"Running Open-Source LLM models on Kubernetes", Volodymyr Tsap
"Running Open-Source LLM models on Kubernetes",  Volodymyr Tsap"Running Open-Source LLM models on Kubernetes",  Volodymyr Tsap
"Running Open-Source LLM models on Kubernetes", Volodymyr Tsap
Fwdays
 
Python & Django
Python & DjangoPython & Django
Python & Django
Allan114858
 
Data interchange integration, HTML XML Biological XML DTD
Data interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTD
Data interchange integration, HTML XML Biological XML DTD
AnushaMahmood
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
System analysisThe Systems analysis is a problem-solving techniq.docx
System analysisThe Systems analysis is a problem-solving techniq.docxSystem analysisThe Systems analysis is a problem-solving techniq.docx
System analysisThe Systems analysis is a problem-solving techniq.docx
ssuserf9c51d
 
JSON-LD and SHACL for Knowledge Graphs
JSON-LD and SHACL for Knowledge GraphsJSON-LD and SHACL for Knowledge Graphs
JSON-LD and SHACL for Knowledge Graphs
Franz Inc. - AllegroGraph
 
Training presentation
Training presentationTraining presentation
Training presentation
Tayseer_Emam
 
GDG-USAR Tech winter break 2024 USAR.pdf
GDG-USAR Tech winter break 2024 USAR.pdfGDG-USAR Tech winter break 2024 USAR.pdf
GDG-USAR Tech winter break 2024 USAR.pdf
raiaryan174
 
ScholarsDay_Poster2015_Sumner-Atay
ScholarsDay_Poster2015_Sumner-AtayScholarsDay_Poster2015_Sumner-Atay
ScholarsDay_Poster2015_Sumner-Atay
Alex Sumner
 
Xml and xml processor
Xml and xml processorXml and xml processor
Xml and xml processor
Himanshu Soni
 
Xml and xml processor
Xml and xml processorXml and xml processor
Xml and xml processor
Himanshu Soni
 
WEB TECHNOLOGY Unit-4.pptx
WEB TECHNOLOGY Unit-4.pptxWEB TECHNOLOGY Unit-4.pptx
WEB TECHNOLOGY Unit-4.pptx
karthiksmart21
 
Vision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result RecordsVision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result Records
IJMER
 
Caste a vote online
Caste a vote onlineCaste a vote online
Caste a vote online
Manoj Kumar
 
ELK Stack Online Training - Elasticsearch Online Training Course.pptx
ELK Stack Online Training - Elasticsearch Online Training Course.pptxELK Stack Online Training - Elasticsearch Online Training Course.pptx
ELK Stack Online Training - Elasticsearch Online Training Course.pptx
eshwarvisualpath
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
IOSR Journals
 
SURE Research Report
SURE Research ReportSURE Research Report
SURE Research Report
Alex Sumner
 
02 From HTML tags to XHTML
02 From HTML tags to XHTML02 From HTML tags to XHTML
02 From HTML tags to XHTML
Rich Dron
 
"Running Open-Source LLM models on Kubernetes", Volodymyr Tsap
"Running Open-Source LLM models on Kubernetes",  Volodymyr Tsap"Running Open-Source LLM models on Kubernetes",  Volodymyr Tsap
"Running Open-Source LLM models on Kubernetes", Volodymyr Tsap
Fwdays
 
Data interchange integration, HTML XML Biological XML DTD
Data interchange integration, HTML XML Biological XML DTDData interchange integration, HTML XML Biological XML DTD
Data interchange integration, HTML XML Biological XML DTD
AnushaMahmood
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
System analysisThe Systems analysis is a problem-solving techniq.docx
System analysisThe Systems analysis is a problem-solving techniq.docxSystem analysisThe Systems analysis is a problem-solving techniq.docx
System analysisThe Systems analysis is a problem-solving techniq.docx
ssuserf9c51d
 

More from Knoldus Inc. (20)

Angular Hydration Presentation (FrontEnd)
Angular Hydration Presentation (FrontEnd)Angular Hydration Presentation (FrontEnd)
Angular Hydration Presentation (FrontEnd)
Knoldus Inc.
 
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Optimizing Test Execution: Heuristic Algorithm for Self-HealingOptimizing Test Execution: Heuristic Algorithm for Self-Healing
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Knoldus Inc.
 
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
 
Kanban Metrics Presentation (Project Management)
Kanban Metrics Presentation (Project Management)Kanban Metrics Presentation (Project Management)
Kanban Metrics Presentation (Project Management)
Knoldus Inc.
 
Java 17 features and implementation.pptx
Java 17 features and implementation.pptxJava 17 features and implementation.pptx
Java 17 features and implementation.pptx
Knoldus Inc.
 
Chaos Mesh Introducing Chaos in Kubernetes
Chaos Mesh Introducing Chaos in KubernetesChaos Mesh Introducing Chaos in Kubernetes
Chaos Mesh Introducing Chaos in Kubernetes
Knoldus Inc.
 
GraalVM - A Step Ahead of JVM Presentation
GraalVM - A Step Ahead of JVM PresentationGraalVM - A Step Ahead of JVM Presentation
GraalVM - A Step Ahead of JVM Presentation
Knoldus Inc.
 
Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
DAPR - Distributed Application Runtime Presentation
DAPR - Distributed Application Runtime PresentationDAPR - Distributed Application Runtime Presentation
DAPR - Distributed Application Runtime Presentation
Knoldus Inc.
 
Introduction to Azure Virtual WAN Presentation
Introduction to Azure Virtual WAN PresentationIntroduction to Azure Virtual WAN Presentation
Introduction to Azure Virtual WAN Presentation
Knoldus Inc.
 
Introduction to Argo Rollouts Presentation
Introduction to Argo Rollouts PresentationIntroduction to Argo Rollouts Presentation
Introduction to Argo Rollouts Presentation
Knoldus Inc.
 
Intro to Azure Container App Presentation
Intro to Azure Container App PresentationIntro to Azure Container App Presentation
Intro to Azure Container App Presentation
Knoldus Inc.
 
Insights Unveiled Test Reporting and Observability Excellence
Insights Unveiled Test Reporting and Observability ExcellenceInsights Unveiled Test Reporting and Observability Excellence
Insights Unveiled Test Reporting and Observability Excellence
Knoldus Inc.
 
Introduction to Splunk Presentation (DevOps)
Introduction to Splunk Presentation (DevOps)Introduction to Splunk Presentation (DevOps)
Introduction to Splunk Presentation (DevOps)
Knoldus Inc.
 
Code Camp - Data Profiling and Quality Analysis Framework
Code Camp - Data Profiling and Quality Analysis FrameworkCode Camp - Data Profiling and Quality Analysis Framework
Code Camp - Data Profiling and Quality Analysis Framework
Knoldus Inc.
 
AWS: Messaging Services in AWS Presentation
AWS: Messaging Services in AWS PresentationAWS: Messaging Services in AWS Presentation
AWS: Messaging Services in AWS Presentation
Knoldus Inc.
 
Amazon Cognito: A Primer on Authentication and Authorization
Amazon Cognito: A Primer on Authentication and AuthorizationAmazon Cognito: A Primer on Authentication and Authorization
Amazon Cognito: A Primer on Authentication and Authorization
Knoldus Inc.
 
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
ZIO Http A Functional Approach to Scalable and Type-Safe Web DevelopmentZIO Http A Functional Approach to Scalable and Type-Safe Web Development
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
Knoldus Inc.
 
Managing State & HTTP Requests In Ionic.
Managing State & HTTP Requests In Ionic.Managing State & HTTP Requests In Ionic.
Managing State & HTTP Requests In Ionic.
Knoldus Inc.
 
Angular Hydration Presentation (FrontEnd)
Angular Hydration Presentation (FrontEnd)Angular Hydration Presentation (FrontEnd)
Angular Hydration Presentation (FrontEnd)
Knoldus Inc.
 
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Optimizing Test Execution: Heuristic Algorithm for Self-HealingOptimizing Test Execution: Heuristic Algorithm for Self-Healing
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Knoldus Inc.
 
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
 
Kanban Metrics Presentation (Project Management)
Kanban Metrics Presentation (Project Management)Kanban Metrics Presentation (Project Management)
Kanban Metrics Presentation (Project Management)
Knoldus Inc.
 
Java 17 features and implementation.pptx
Java 17 features and implementation.pptxJava 17 features and implementation.pptx
Java 17 features and implementation.pptx
Knoldus Inc.
 
Chaos Mesh Introducing Chaos in Kubernetes
Chaos Mesh Introducing Chaos in KubernetesChaos Mesh Introducing Chaos in Kubernetes
Chaos Mesh Introducing Chaos in Kubernetes
Knoldus Inc.
 
GraalVM - A Step Ahead of JVM Presentation
GraalVM - A Step Ahead of JVM PresentationGraalVM - A Step Ahead of JVM Presentation
GraalVM - A Step Ahead of JVM Presentation
Knoldus Inc.
 
Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
DAPR - Distributed Application Runtime Presentation
DAPR - Distributed Application Runtime PresentationDAPR - Distributed Application Runtime Presentation
DAPR - Distributed Application Runtime Presentation
Knoldus Inc.
 
Introduction to Azure Virtual WAN Presentation
Introduction to Azure Virtual WAN PresentationIntroduction to Azure Virtual WAN Presentation
Introduction to Azure Virtual WAN Presentation
Knoldus Inc.
 
Introduction to Argo Rollouts Presentation
Introduction to Argo Rollouts PresentationIntroduction to Argo Rollouts Presentation
Introduction to Argo Rollouts Presentation
Knoldus Inc.
 
Intro to Azure Container App Presentation
Intro to Azure Container App PresentationIntro to Azure Container App Presentation
Intro to Azure Container App Presentation
Knoldus Inc.
 
Insights Unveiled Test Reporting and Observability Excellence
Insights Unveiled Test Reporting and Observability ExcellenceInsights Unveiled Test Reporting and Observability Excellence
Insights Unveiled Test Reporting and Observability Excellence
Knoldus Inc.
 
Introduction to Splunk Presentation (DevOps)
Introduction to Splunk Presentation (DevOps)Introduction to Splunk Presentation (DevOps)
Introduction to Splunk Presentation (DevOps)
Knoldus Inc.
 
Code Camp - Data Profiling and Quality Analysis Framework
Code Camp - Data Profiling and Quality Analysis FrameworkCode Camp - Data Profiling and Quality Analysis Framework
Code Camp - Data Profiling and Quality Analysis Framework
Knoldus Inc.
 
AWS: Messaging Services in AWS Presentation
AWS: Messaging Services in AWS PresentationAWS: Messaging Services in AWS Presentation
AWS: Messaging Services in AWS Presentation
Knoldus Inc.
 
Amazon Cognito: A Primer on Authentication and Authorization
Amazon Cognito: A Primer on Authentication and AuthorizationAmazon Cognito: A Primer on Authentication and Authorization
Amazon Cognito: A Primer on Authentication and Authorization
Knoldus Inc.
 
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
ZIO Http A Functional Approach to Scalable and Type-Safe Web DevelopmentZIO Http A Functional Approach to Scalable and Type-Safe Web Development
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
Knoldus Inc.
 
Managing State & HTTP Requests In Ionic.
Managing State & HTTP Requests In Ionic.Managing State & HTTP Requests In Ionic.
Managing State & HTTP Requests In Ionic.
Knoldus Inc.
 
Ad

Recently uploaded (20)

Cybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and MitigationCybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and Mitigation
VICTOR MAESTRE RAMIREZ
 
Build With AI - In Person Session Slides.pdf
Build With AI - In Person Session Slides.pdfBuild With AI - In Person Session Slides.pdf
Build With AI - In Person Session Slides.pdf
Google Developer Group - Harare
 
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Mike Mingos
 
Slack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teamsSlack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teams
Nacho Cougil
 
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz
 
An Overview of Salesforce Health Cloud & How is it Transforming Patient Care
An Overview of Salesforce Health Cloud & How is it Transforming Patient CareAn Overview of Salesforce Health Cloud & How is it Transforming Patient Care
An Overview of Salesforce Health Cloud & How is it Transforming Patient Care
Cyntexa
 
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à GenèveUiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPathCommunity
 
Unlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web AppsUnlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web Apps
Maximiliano Firtman
 
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
James Anderson
 
Developing System Infrastructure Design Plan.pptx
Developing System Infrastructure Design Plan.pptxDeveloping System Infrastructure Design Plan.pptx
Developing System Infrastructure Design Plan.pptx
wondimagegndesta
 
Top 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptx
Top 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptxTop 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptx
Top 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptx
mkubeusa
 
Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)
Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)
Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)
CSUC - Consorci de Serveis Universitaris de Catalunya
 
IT488 Wireless Sensor Networks_Information Technology
IT488 Wireless Sensor Networks_Information TechnologyIT488 Wireless Sensor Networks_Information Technology
IT488 Wireless Sensor Networks_Information Technology
SHEHABALYAMANI
 
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
SOFTTECHHUB
 
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
Ivano Malavolta
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Wonjun Hwang
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdfKit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Wonjun Hwang
 
fennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solutionfennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solution
shallal2
 
Cybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and MitigationCybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and Mitigation
VICTOR MAESTRE RAMIREZ
 
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Mike Mingos
 
Slack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teamsSlack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teams
Nacho Cougil
 
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz
 
An Overview of Salesforce Health Cloud & How is it Transforming Patient Care
An Overview of Salesforce Health Cloud & How is it Transforming Patient CareAn Overview of Salesforce Health Cloud & How is it Transforming Patient Care
An Overview of Salesforce Health Cloud & How is it Transforming Patient Care
Cyntexa
 
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à GenèveUiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPathCommunity
 
Unlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web AppsUnlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web Apps
Maximiliano Firtman
 
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
James Anderson
 
Developing System Infrastructure Design Plan.pptx
Developing System Infrastructure Design Plan.pptxDeveloping System Infrastructure Design Plan.pptx
Developing System Infrastructure Design Plan.pptx
wondimagegndesta
 
Top 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptx
Top 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptxTop 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptx
Top 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptx
mkubeusa
 
IT488 Wireless Sensor Networks_Information Technology
IT488 Wireless Sensor Networks_Information TechnologyIT488 Wireless Sensor Networks_Information Technology
IT488 Wireless Sensor Networks_Information Technology
SHEHABALYAMANI
 
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
SOFTTECHHUB
 
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
Ivano Malavolta
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Wonjun Hwang
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdfKit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Wonjun Hwang
 
fennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solutionfennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solution
shallal2
 
Ad

Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing

  • 1. Mastering Web Scraping with JSoup: Unlocking the Secrets of HTML Parsing Shrasti Gupta Automation Consultant Test Automation Competency
  • 2. Lack of etiquette and manners is a huge turn off. KnolX Etiquettes  Punctuality Join the session 5 minutes prior to the session start time. We start on time and conclude on time!  Feedback Make sure to submit a constructive feedback for all sessions as it is very helpful for the presenter.  Silent Mode Keep your mobile devices in silent mode, feel free to move out of session in case you need to attend an urgent call.  Avoid Disturbance Avoid unwanted chit chat during the session.
  • 3. 1. What is Web Scraping 2. What is JSoup and Why? 3. Setting up with JSoup 4. Understanding the Document Object Model(DOM) 5. Navigating the DOM with Jsoup 6. Parsing HTML with Jsoup 7. Extracting data with JSoup 8. Demo
  • 5. What is Web Scraping • Web scraping, also known as web data extraction, is the process of automatically extracting information from websites using specialized tools and software. Web scraping provides access to valuable data that may not be available through APIs or databases. • It enables the collection of large volumes of data from multiple sources efficiently, suitable for various applications like market research and competitive analysis. Use cases of web scraping - Competitive Analysis - Extracting data such as product pricing, features, and customer reviews from competitor websites for analysis. Market Research - Collecting data on consumer preferences, product demand, and pricing strategies from various sources across the web. Data Collection - Scraping websites to collect data for research, analysis, and modelling purposes, such as gathering weather data, financial information, or demographic statistics. Content Aggregation - Scraping websites for updates and changes in content, ensuring timely updates and staying informed about industry developments.
  • 7. What is JSoup JSoup is a Java library designed for parsing, manipulating, and extracting data from HTML documents. It provides a convenient API for working with HTML, allowing developers to perform tasks such as parsing, traversing the DOM tree, and extracting specific elements or data. Jsoup parses HTML to the same DOM as modern browsers. • Scrape and parse HTML from a URL, file, or string • Find and extract data, using DOM traversal or CSS selectors • Manipulate the HTML elements, attributes, and text.
  • 8. Key Features Of Jsoup HTML Parsing: JSoup simplifies the process of parsing HTML documents, converting them into a structured Document Object Model (DOM) representation. DOM Traversal: It enables developers to navigate the DOM tree, accessing and manipulating HTML elements based on their relationships and properties. CSS Selection: JSoup supports CSS-like selectors for targeting specific elements within HTML documents, facilitating easy extraction of data. Element Manipulation: Developers can modify HTML elements, attributes, and content using JSoup's API, enabling dynamic manipulation of web pages.
  • 9. Why JSoup? Jsoup is a popular Java library for web scraping, and there are several reasons why it's a preferred choice: • Ease of Use: Jsoup provides a simple and intuitive API for parsing HTML documents, making it easy for developers to extract the data they need from web pages. • HTML Parsing: Jsoup handles HTML parsing efficiently, allowing you to navigate the HTML structure, select elements based on CSS selectors, and manipulate the DOM easily. • Security: Jsoup is designed with security in mind. It helps prevent common vulnerabilities such as cross-site scripting (XSS) attacks by sanitizing HTML input and output. • Open Source: Jsoup is an open-source library, which means it's free to use and has a large community of developers contributing to its improvement. This ensures ongoing support and updates. • Java Integration: If you're working in a Java environment, Jsoup integrates seamlessly with your existing Java codebase. This makes it a natural choice for Java developers who need to incorporate web scraping into their projects.
  • 11. Setting Up JSoup Adding JSoup to your Java project is straightforward using Maven build tools. By including the JSoup dependency, you gain access to its powerful HTML parsing and data extraction capabilities. Dependency-
  • 13. Understanding Document Object Model (DOM) • The Document Object Model (DOM) is a representation of the structure of an HTML document as a tree of nodes. Each node corresponds to an element, attribute, or piece of text in the HTML document. Understanding the DOM is crucial for effective web scraping, as it allows us to navigate and manipulate the structure of web pages. • The Document Object Model (DOM) connects web pages to scripts or programming languages by representing the structure of a document such as the HTML representing a web page in memory. • The DOM represents a document with a logical tree. Each branch of the tree ends in a node, and each node contains objects. DOM methods allow programmatic access to the tree. With them, you can change the document's structure, style, or content.
  • 15. Navigating the DOM The Document Object Model (DOM) serves as the interface between web content and scripts, providing a structured representation of the HTML document. Effective DOM traversal is essential for accessing and manipulating elements within the document. In this, we'll explore various methods for navigating the DOM to locate and interact with desired elements. Traversal methods: • Selecting parent Elements • Selecting Child Elements • Exploring Siblings Elements • Descendant elements
  • 16. Selecting Parent Elements Accessing the parent of a particular DOM element is fundamental for various operations, such as styling or modifying its content. These methods provide direct access to the immediate parent of the selected element, allowing for seamless manipulation or traversal to higher levels of the DOM hierarchy. Selecting Child Elements: Web Pages are like big family trees, with elements having child elements. These methods help us get a list of the child elements of a specific elements we are interested in. It's like getting a list of all the kids in a family. Exploring Sibling Elements : Sibling elements share the same parent node and offer opportunities for targeted manipulation or traversal within a specific context. Traversal methods such as nextSibling and previousSibling enable navigation to adjacent elements at the same level in the DOM tree. Descending into Descendant Elements : Traversing through descendant elements allows for deep exploration within the DOM tree, enabling access to nested structures and nested content. Methods like querySelector and querySelectorAll provide the powerful mechanisms for selecting elements based on css selectors.
  • 18. Parsing HTML with Jsoup HTML Structure HTML (Hypertext Markup Language) is the standard language for creating web pages. It uses a markup structure composed of elements, tags, attributes, and content to define the structure and appearance of web documents. Elements and Tags: • Elements: Fundamental building blocks of HTML documents, representing different types of content. • Tags: Enclosed in angle brackets (<>), define the beginning and end of HTML elements.
  • 19. Parsing HTML with Jsoup JSoup offers an array of robust features for parsing HTML documents with ease and precision. Whether you're dealing with simple or complex HTML structures, JSoup's flexible API empowers you to efficiently extract the desired data. • Loading HTML Documents: Jsoup simplifies the process of loading HTML documents from various sources, including URLs, files, and strings. You can seamlessly retrieve the HTML content and begin parsing it without hassle. • CSS Selectors: JSoup supports CSS selectors, allowing you to target specific elements within the HTML document based on their classes, IDs, attributes, or hierarchy. This granular selection capability enables precise data extraction from the DOM.
  • 21. Extracting Data with JSoup In the dynamic landscape of web development, accessing and extracting data from websites is a fundamental task. JSoup, a Java library, offers powerful tools for parsing HTML and manipulating the Document Object Model (DOM) of a web page. In this slide, we'll delve into the world of data extraction with JSoup, exploring its capabilities and demonstrating how to harness its potential to gather valuable information from the web. Data extraction techniques: • Text Extraction • Attribute Extraction • HTML Content Extraction
  • 22. Text Extraction Text extraction involves the retrieval of textual information embedded within various data sources, be it web pages, documents, or databases. Here are some prevalent methods employed Regular Expressions (Regex): - A powerful tool for pattern matching, regex enables the identification and extraction of specific text strings based on predefined patterns or rules. HTML Parsing Libraries: - Utilizing libraries like Beautiful Soup in Python or Jsoup in Java, developers can navigate through HTML documents, pinpointing and extracting text content enclosed within designated tags. Optical Character Recognition (OCR): - When dealing with scanned documents or images containing text, OCR algorithms come into play. These algorithms analyse the image, recognize characters, and convert them into editable text.
  • 23. Attributes Extraction Attributes enrich the contextual understanding of data by providing metadata associated with elements. Extracting attributes facilitates categorization, filtering, and analysis. Common techniques include: XPath Queries: - XPath enables the selection of elements based on their attributes within XML or HTML documents. By crafting XPath queries, developers can precisely target elements and retrieve attribute values. CSS Selectors: - Similar to XPath, CSS selectors allow for the identification and extraction of elements based on their attributes CSS selectors provide a concise syntax for specifying attribute-based criteria. API Integration: - Some data sources offer APIs (Application Programming Interfaces) that expose structured data along with associated attributes. Integrating with these APIs simplifies attribute extraction and ensures data consistency.
  • 24. HTML Content Extraction HTML content extraction involves capturing the structure and layout of web pages, including text, images, links, and other multimedia elements. Techniques for extracting HTML content include DOM Traversal: - Traversing the Document Object Model (DOM) tree of a web page enables the extraction of specific HTML elements and their contents. DOM traversal libraries like Cheerio (for Node.js) provide an intuitive interface for this purpose. Web Scraping Frameworks: - Frameworks such as Scrapy (Python) and Puppeteer (JavaScript) offer robust tools for web scraping, allowing developers to extract HTML content programmatically while handling various complexities like pagination and dynamic content. Browser Extensions: - For more user-centric applications, browser extensions like Chrome's Content Script API enable the extraction of HTML content directly from the user's browsing session. This approach is particularly useful for tasks like content curation and data aggregation.
  • 25. DEMO
  翻译: