SlideShare a Scribd company logo
Data Engineering:
A Deep Dive into
Databricks
Presenter:
Mohika Rastogi
Sant Singh
Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
 Punctuality
Join the session 5 minutes prior to the session start time. We start on
time and conclude on time!
 Feedback
Make sure to submit a constructive feedback for all sessions as it is very
helpful for the presenter.
 Silent Mode
Keep your mobile devices in silent mode, feel free to move out of session
in case you need to attend an urgent call.
 Avoid Disturbance
Avoid unwanted chit chat during the session.
1. Introduction
o What is Data Engineering
o Data Engineer vs Analyst vs Scientist
2. Central Repository
o Data Warehouse
o Data Lake
o Data Lakehouse
3. Databricks
o What is Databricks ?
o Use cases
o Managed Integration
o Delta Lake
o Delta Sharing
4. Apache Spark
5. Databricks Workspace
o Workspace Terminologies
6. Demo
Data Engineering A Deep Dive into Databricks
Data Engineering
 Data engineering is the practice of designing and building systems for collecting, storing, and
analyzing data at scale.
 Data engineering is the complex task of making raw data usable to data scientists and groups
within an organization.
Data Engineer vs Data Analyst vs Data Scientist
Data Scientist
A data scientist is someone who
uses their knowledge of statistics,
machine learning, and
programming to extract meaning
from data. They use their skills to
solve complex problems, identify
trends, and make predictions.
Data Analyst
A data analyst is someone who
collects, cleans, and analyzes
data to help businesses make
better decisions. They use
their skills to identify patterns
in data, and to create reports and
visualizations that help others
understand the data.
Data Engineer
A data engineer is someone who
builds and maintains the systems
that data scientists and data
analysts use to collect, store, and
analyze data. They use their
skills to design and build
data pipelines, and to ensure
that data is stored in a secure
and efficient way.
Central Repositories
Data Warehouse
A data lake is an ample storage that can store structured,
semi-structured, and raw data. The schema of the data is
not known as it is a schema-on-read.
Data Lake
A data warehouse is a central repository of business
data stored in structured format to help organizations
gain insights. Schema needs to be known before writing
data into a warehouse.
Data Lakehouse
 Data lakehouse is a realtively new architecture and it is combining the best of the both worlds —
data warehouses and data lakes.
 It serves as a single platform for data warehousing and data lakes. It has data management
features such as ACID transcation coming from a warehouse perspective and low cost storage
like a data lake.
Databricks
A unified, open analytics platform for
building, deploying, sharing, and
maintaining enterprise-grade data,
analytics, and AI solutions at scale.
Databricks
 An Interactive Analytics platform that enables Data Engineers, Data Scientists, and Businesses to
collaborate and work closely on notebooks, experiments, models, data, libraries, and jobs.
 Databricks was founded by creators of Apache Spark in 2013
 A one-stop product for all Data requirements, like Storage and Analysis.
 Databricks is integrated with Microsoft Azure, Amazon Web Services, and Google Cloud Platform.
What is Databricks used for?
 Data processing workflows scheduling and
management
 Working in SQL
 Generating dashboards and visualizations
 Data ingestion
 Managing security, governance, and HA/DR
 Data discovery, annotation, and exploration
 Compute management
 Machine learning (ML) modeling and tracking
 ML model serving
 Source control with Git
The Databricks workspace provides a unified interface and tools for most data tasks, including:
Databricks for Data Engineering
 Simplified data ingestion
 Automated ETL processing
 Reliable workflow orchestration
 End-to-end observability and monitoring
 Next-generation data processing engine
 Foundation of governance, reliability and performance
Databricks excels in data engineering with
its unified platform, leveraging Apache
Spark for efficient processing and
scalability.
Managed integration with open source
The following technologies are open source projects founded by Databricks employees:
 Delta Lake
− Delta Lake is the optimized storage layer that provides the foundation for storing data and tables in the Databricks
Lakehouse Platform.
 Delta Sharing
− An open standard for secure data sharing.
 Apache Spark
− Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on
single-node machines or clusters.
 MLflow
− MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment,
and a central model registry
Delta Lake
 Delta Lake is the default storage format for all operations on Databricks.
 Delta Lake is open source software that extends Parquet data files with a file-based transaction
log for ACID transactions and scalable metadata handling.
 Delta Lake is fully compatible with Apache Spark APIs, and was developed for tight integration
with Structured Streaming, allowing you to easily use a single copy of data for both batch and
streaming operations and providing incremental processing at scale.
Delta Sharing
 Delta Sharing is the industry’s first open protocol for secure data sharing, making it simple to
share data with other organizations regardless of which computing platforms they use.
 Databricks and the Linux Foundation developed Delta Sharing to provide the first open source
approach to data sharing across data, analytics and AI. Customers can share live data across
platforms, clouds and regions with strong security and governance.
Apache Spark
 Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop
MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes
interactive queries and stream processing.
 The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.
 PySpark:- PySpark is an interface for Apache Spark in Python. With PySpark, you can write
Python and SQL-like commands to manipulate and analyze data in a distributed processing
environment.
Databricks Workspace
The Databricks “Workspace” is an Environment for accessing all of the Databricks “Assets”.
The “Workspace” organizes Objects, such as- “Notebooks”, “Libraries” and “Experiments”
into “Folders”, and, provides access to “Data” and Computational Resources, such as -
“Clusters” and “Jobs”.
The Databricks “Workspace” can be managed using :-
1. Workspace UI
2. Databricks CLI
3. Databricks REST API
Databricks Workspace Terminology
01 02
03
05 06
04
Cluster is a “Set of Computational
Resources and Configurations”, on
which an organization’s Data
Engineering Workloads are run.
Cluster
Job is a way of running a “Notebook”
or a “JAR” either immediately or on a
“Scheduled Basis”.
Jobs
Every “Databricks Deployment” has a
“Central Hive Meta-store”, accessible
by all “Clusters” to persist “Table
Metadata”.
Hive Meta-store
“Notebook” is a “Web-Based Interface”
composed of a “Group of Cells” that
allow to execute coding commands.
Notebooks
DBFS is a Distributed File System
mounted into each Databricks
Workspace. DBFS contains Directorie
s which in turn contains Data Files,
Libraries and other Directories.
DBFS
By default, all tables created in
Databricks are Delta tables. Delta
tables are based on the Delta Lake
open source project.
Delta Table
Data Engineering A Deep Dive into Databricks
Data Engineering A Deep Dive into Databricks
Data Engineering A Deep Dive into Databricks
Ad

More Related Content

What's hot (20)

Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
Durga Gadiraju
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
Adam Doyle
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
LibbySchulze
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
adb.pdf
adb.pdfadb.pdf
adb.pdf
AdityaMehta724216
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Introduction to Azure Synapse Webinar
Introduction to Azure Synapse WebinarIntroduction to Azure Synapse Webinar
Introduction to Azure Synapse Webinar
Peter Ward
 
Azure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene PolonichkoAzure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene Polonichko
Dimko Zhluktenko
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
Databricks
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
Databricks
 
Introduction to Azure Data Lake
Introduction to Azure Data LakeIntroduction to Azure Data Lake
Introduction to Azure Data Lake
Antonios Chatzipavlis
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
James Serra
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the Cloud
Mark Kromer
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Scaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on DatabricksScaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on Databricks
Databricks
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
Durga Gadiraju
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
Adam Doyle
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
LibbySchulze
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Introduction to Azure Synapse Webinar
Introduction to Azure Synapse WebinarIntroduction to Azure Synapse Webinar
Introduction to Azure Synapse Webinar
Peter Ward
 
Azure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene PolonichkoAzure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene Polonichko
Dimko Zhluktenko
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
Databricks
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
Databricks
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
James Serra
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the Cloud
Mark Kromer
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Scaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on DatabricksScaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on Databricks
Databricks
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 

Similar to Data Engineering A Deep Dive into Databricks (20)

DataBricks fundamentals for fresh graduates
DataBricks fundamentals for fresh graduatesDataBricks fundamentals for fresh graduates
DataBricks fundamentals for fresh graduates
SanjeevaniClinicalRe
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsfPyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
use_case.pptx
use_case.pptxuse_case.pptx
use_case.pptx
vuppalanaveen
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
Trivadis
 
Azure Databricks - An Introduction 2019 Roadshow.pptx
Azure Databricks - An Introduction 2019 Roadshow.pptxAzure Databricks - An Introduction 2019 Roadshow.pptx
Azure Databricks - An Introduction 2019 Roadshow.pptx
pascalsegoul
 
201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning
Mark Tabladillo
 
Data Mining Tools_presnetion_data_scince.pptx
Data Mining Tools_presnetion_data_scince.pptxData Mining Tools_presnetion_data_scince.pptx
Data Mining Tools_presnetion_data_scince.pptx
fnas9889
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Dataconomy Media
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
James Serra
 
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsfPyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
sasuke20y4sh
 
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
rajeshseo5
 
Microsoft Fabric Introduction
Microsoft Fabric IntroductionMicrosoft Fabric Introduction
Microsoft Fabric Introduction
James Serra
 
Azure Databricks & Spark @ Techorama 2018
Azure Databricks & Spark @ Techorama 2018Azure Databricks & Spark @ Techorama 2018
Azure Databricks & Spark @ Techorama 2018
Nathan Bijnens
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
James Serra
 
CC -Unit4.pptx
CC -Unit4.pptxCC -Unit4.pptx
CC -Unit4.pptx
Revathiparamanathan
 
5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics
Jen Stirrup
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
Leandro Totino Pereira
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
Simon Ambridge
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Michael Rys
 
DataBricks fundamentals for fresh graduates
DataBricks fundamentals for fresh graduatesDataBricks fundamentals for fresh graduates
DataBricks fundamentals for fresh graduates
SanjeevaniClinicalRe
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsfPyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
Trivadis
 
Azure Databricks - An Introduction 2019 Roadshow.pptx
Azure Databricks - An Introduction 2019 Roadshow.pptxAzure Databricks - An Introduction 2019 Roadshow.pptx
Azure Databricks - An Introduction 2019 Roadshow.pptx
pascalsegoul
 
201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning
Mark Tabladillo
 
Data Mining Tools_presnetion_data_scince.pptx
Data Mining Tools_presnetion_data_scince.pptxData Mining Tools_presnetion_data_scince.pptx
Data Mining Tools_presnetion_data_scince.pptx
fnas9889
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Dataconomy Media
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
James Serra
 
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsfPyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
sasuke20y4sh
 
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
rajeshseo5
 
Microsoft Fabric Introduction
Microsoft Fabric IntroductionMicrosoft Fabric Introduction
Microsoft Fabric Introduction
James Serra
 
Azure Databricks & Spark @ Techorama 2018
Azure Databricks & Spark @ Techorama 2018Azure Databricks & Spark @ Techorama 2018
Azure Databricks & Spark @ Techorama 2018
Nathan Bijnens
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
James Serra
 
5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics
Jen Stirrup
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
Simon Ambridge
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Michael Rys
 
Ad

More from Knoldus Inc. (20)

Angular Hydration Presentation (FrontEnd)
Angular Hydration Presentation (FrontEnd)Angular Hydration Presentation (FrontEnd)
Angular Hydration Presentation (FrontEnd)
Knoldus Inc.
 
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Optimizing Test Execution: Heuristic Algorithm for Self-HealingOptimizing Test Execution: Heuristic Algorithm for Self-Healing
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Knoldus Inc.
 
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
 
Kanban Metrics Presentation (Project Management)
Kanban Metrics Presentation (Project Management)Kanban Metrics Presentation (Project Management)
Kanban Metrics Presentation (Project Management)
Knoldus Inc.
 
Java 17 features and implementation.pptx
Java 17 features and implementation.pptxJava 17 features and implementation.pptx
Java 17 features and implementation.pptx
Knoldus Inc.
 
Chaos Mesh Introducing Chaos in Kubernetes
Chaos Mesh Introducing Chaos in KubernetesChaos Mesh Introducing Chaos in Kubernetes
Chaos Mesh Introducing Chaos in Kubernetes
Knoldus Inc.
 
GraalVM - A Step Ahead of JVM Presentation
GraalVM - A Step Ahead of JVM PresentationGraalVM - A Step Ahead of JVM Presentation
GraalVM - A Step Ahead of JVM Presentation
Knoldus Inc.
 
Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
DAPR - Distributed Application Runtime Presentation
DAPR - Distributed Application Runtime PresentationDAPR - Distributed Application Runtime Presentation
DAPR - Distributed Application Runtime Presentation
Knoldus Inc.
 
Introduction to Azure Virtual WAN Presentation
Introduction to Azure Virtual WAN PresentationIntroduction to Azure Virtual WAN Presentation
Introduction to Azure Virtual WAN Presentation
Knoldus Inc.
 
Introduction to Argo Rollouts Presentation
Introduction to Argo Rollouts PresentationIntroduction to Argo Rollouts Presentation
Introduction to Argo Rollouts Presentation
Knoldus Inc.
 
Intro to Azure Container App Presentation
Intro to Azure Container App PresentationIntro to Azure Container App Presentation
Intro to Azure Container App Presentation
Knoldus Inc.
 
Insights Unveiled Test Reporting and Observability Excellence
Insights Unveiled Test Reporting and Observability ExcellenceInsights Unveiled Test Reporting and Observability Excellence
Insights Unveiled Test Reporting and Observability Excellence
Knoldus Inc.
 
Introduction to Splunk Presentation (DevOps)
Introduction to Splunk Presentation (DevOps)Introduction to Splunk Presentation (DevOps)
Introduction to Splunk Presentation (DevOps)
Knoldus Inc.
 
Code Camp - Data Profiling and Quality Analysis Framework
Code Camp - Data Profiling and Quality Analysis FrameworkCode Camp - Data Profiling and Quality Analysis Framework
Code Camp - Data Profiling and Quality Analysis Framework
Knoldus Inc.
 
AWS: Messaging Services in AWS Presentation
AWS: Messaging Services in AWS PresentationAWS: Messaging Services in AWS Presentation
AWS: Messaging Services in AWS Presentation
Knoldus Inc.
 
Amazon Cognito: A Primer on Authentication and Authorization
Amazon Cognito: A Primer on Authentication and AuthorizationAmazon Cognito: A Primer on Authentication and Authorization
Amazon Cognito: A Primer on Authentication and Authorization
Knoldus Inc.
 
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
ZIO Http A Functional Approach to Scalable and Type-Safe Web DevelopmentZIO Http A Functional Approach to Scalable and Type-Safe Web Development
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
Knoldus Inc.
 
Managing State & HTTP Requests In Ionic.
Managing State & HTTP Requests In Ionic.Managing State & HTTP Requests In Ionic.
Managing State & HTTP Requests In Ionic.
Knoldus Inc.
 
Angular Hydration Presentation (FrontEnd)
Angular Hydration Presentation (FrontEnd)Angular Hydration Presentation (FrontEnd)
Angular Hydration Presentation (FrontEnd)
Knoldus Inc.
 
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Optimizing Test Execution: Heuristic Algorithm for Self-HealingOptimizing Test Execution: Heuristic Algorithm for Self-Healing
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Knoldus Inc.
 
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
 
Kanban Metrics Presentation (Project Management)
Kanban Metrics Presentation (Project Management)Kanban Metrics Presentation (Project Management)
Kanban Metrics Presentation (Project Management)
Knoldus Inc.
 
Java 17 features and implementation.pptx
Java 17 features and implementation.pptxJava 17 features and implementation.pptx
Java 17 features and implementation.pptx
Knoldus Inc.
 
Chaos Mesh Introducing Chaos in Kubernetes
Chaos Mesh Introducing Chaos in KubernetesChaos Mesh Introducing Chaos in Kubernetes
Chaos Mesh Introducing Chaos in Kubernetes
Knoldus Inc.
 
GraalVM - A Step Ahead of JVM Presentation
GraalVM - A Step Ahead of JVM PresentationGraalVM - A Step Ahead of JVM Presentation
GraalVM - A Step Ahead of JVM Presentation
Knoldus Inc.
 
Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
DAPR - Distributed Application Runtime Presentation
DAPR - Distributed Application Runtime PresentationDAPR - Distributed Application Runtime Presentation
DAPR - Distributed Application Runtime Presentation
Knoldus Inc.
 
Introduction to Azure Virtual WAN Presentation
Introduction to Azure Virtual WAN PresentationIntroduction to Azure Virtual WAN Presentation
Introduction to Azure Virtual WAN Presentation
Knoldus Inc.
 
Introduction to Argo Rollouts Presentation
Introduction to Argo Rollouts PresentationIntroduction to Argo Rollouts Presentation
Introduction to Argo Rollouts Presentation
Knoldus Inc.
 
Intro to Azure Container App Presentation
Intro to Azure Container App PresentationIntro to Azure Container App Presentation
Intro to Azure Container App Presentation
Knoldus Inc.
 
Insights Unveiled Test Reporting and Observability Excellence
Insights Unveiled Test Reporting and Observability ExcellenceInsights Unveiled Test Reporting and Observability Excellence
Insights Unveiled Test Reporting and Observability Excellence
Knoldus Inc.
 
Introduction to Splunk Presentation (DevOps)
Introduction to Splunk Presentation (DevOps)Introduction to Splunk Presentation (DevOps)
Introduction to Splunk Presentation (DevOps)
Knoldus Inc.
 
Code Camp - Data Profiling and Quality Analysis Framework
Code Camp - Data Profiling and Quality Analysis FrameworkCode Camp - Data Profiling and Quality Analysis Framework
Code Camp - Data Profiling and Quality Analysis Framework
Knoldus Inc.
 
AWS: Messaging Services in AWS Presentation
AWS: Messaging Services in AWS PresentationAWS: Messaging Services in AWS Presentation
AWS: Messaging Services in AWS Presentation
Knoldus Inc.
 
Amazon Cognito: A Primer on Authentication and Authorization
Amazon Cognito: A Primer on Authentication and AuthorizationAmazon Cognito: A Primer on Authentication and Authorization
Amazon Cognito: A Primer on Authentication and Authorization
Knoldus Inc.
 
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
ZIO Http A Functional Approach to Scalable and Type-Safe Web DevelopmentZIO Http A Functional Approach to Scalable and Type-Safe Web Development
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
Knoldus Inc.
 
Managing State & HTTP Requests In Ionic.
Managing State & HTTP Requests In Ionic.Managing State & HTTP Requests In Ionic.
Managing State & HTTP Requests In Ionic.
Knoldus Inc.
 
Ad

Recently uploaded (20)

Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptxWebinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
MSP360
 
Mastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B LandscapeMastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B Landscape
marketing943205
 
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
Ivano Malavolta
 
The Future of Cisco Cloud Security: Innovations and AI Integration
The Future of Cisco Cloud Security: Innovations and AI IntegrationThe Future of Cisco Cloud Security: Innovations and AI Integration
The Future of Cisco Cloud Security: Innovations and AI Integration
Re-solution Data Ltd
 
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Raffi Khatchadourian
 
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Raffi Khatchadourian
 
Financial Services Technology Summit 2025
Financial Services Technology Summit 2025Financial Services Technology Summit 2025
Financial Services Technology Summit 2025
Ray Bugg
 
Agentic Automation - Delhi UiPath Community Meetup
Agentic Automation - Delhi UiPath Community MeetupAgentic Automation - Delhi UiPath Community Meetup
Agentic Automation - Delhi UiPath Community Meetup
Manoj Batra (1600 + Connections)
 
Transcript: Canadian book publishing: Insights from the latest salary survey ...
Transcript: Canadian book publishing: Insights from the latest salary survey ...Transcript: Canadian book publishing: Insights from the latest salary survey ...
Transcript: Canadian book publishing: Insights from the latest salary survey ...
BookNet Canada
 
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptxDevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
Justin Reock
 
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptxReimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
John Moore
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
IT484 Cyber Forensics_Information Technology
IT484 Cyber Forensics_Information TechnologyIT484 Cyber Forensics_Information Technology
IT484 Cyber Forensics_Information Technology
SHEHABALYAMANI
 
Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Everything You Need to Know About Agentforce? (Put AI Agents to Work)Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Cyntexa
 
Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)
Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)
Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)
CSUC - Consorci de Serveis Universitaris de Catalunya
 
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à GenèveUiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPathCommunity
 
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
James Anderson
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
fennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solutionfennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solution
shallal2
 
Cybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and MitigationCybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and Mitigation
VICTOR MAESTRE RAMIREZ
 
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptxWebinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
MSP360
 
Mastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B LandscapeMastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B Landscape
marketing943205
 
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
Ivano Malavolta
 
The Future of Cisco Cloud Security: Innovations and AI Integration
The Future of Cisco Cloud Security: Innovations and AI IntegrationThe Future of Cisco Cloud Security: Innovations and AI Integration
The Future of Cisco Cloud Security: Innovations and AI Integration
Re-solution Data Ltd
 
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Raffi Khatchadourian
 
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Raffi Khatchadourian
 
Financial Services Technology Summit 2025
Financial Services Technology Summit 2025Financial Services Technology Summit 2025
Financial Services Technology Summit 2025
Ray Bugg
 
Transcript: Canadian book publishing: Insights from the latest salary survey ...
Transcript: Canadian book publishing: Insights from the latest salary survey ...Transcript: Canadian book publishing: Insights from the latest salary survey ...
Transcript: Canadian book publishing: Insights from the latest salary survey ...
BookNet Canada
 
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptxDevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
Justin Reock
 
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptxReimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
John Moore
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
IT484 Cyber Forensics_Information Technology
IT484 Cyber Forensics_Information TechnologyIT484 Cyber Forensics_Information Technology
IT484 Cyber Forensics_Information Technology
SHEHABALYAMANI
 
Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Everything You Need to Know About Agentforce? (Put AI Agents to Work)Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Cyntexa
 
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à GenèveUiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPathCommunity
 
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
James Anderson
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
fennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solutionfennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solution
shallal2
 
Cybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and MitigationCybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and Mitigation
VICTOR MAESTRE RAMIREZ
 

Data Engineering A Deep Dive into Databricks

  • 1. Data Engineering: A Deep Dive into Databricks Presenter: Mohika Rastogi Sant Singh
  • 2. Lack of etiquette and manners is a huge turn off. KnolX Etiquettes  Punctuality Join the session 5 minutes prior to the session start time. We start on time and conclude on time!  Feedback Make sure to submit a constructive feedback for all sessions as it is very helpful for the presenter.  Silent Mode Keep your mobile devices in silent mode, feel free to move out of session in case you need to attend an urgent call.  Avoid Disturbance Avoid unwanted chit chat during the session.
  • 3. 1. Introduction o What is Data Engineering o Data Engineer vs Analyst vs Scientist 2. Central Repository o Data Warehouse o Data Lake o Data Lakehouse 3. Databricks o What is Databricks ? o Use cases o Managed Integration o Delta Lake o Delta Sharing 4. Apache Spark 5. Databricks Workspace o Workspace Terminologies 6. Demo
  • 5. Data Engineering  Data engineering is the practice of designing and building systems for collecting, storing, and analyzing data at scale.  Data engineering is the complex task of making raw data usable to data scientists and groups within an organization.
  • 6. Data Engineer vs Data Analyst vs Data Scientist Data Scientist A data scientist is someone who uses their knowledge of statistics, machine learning, and programming to extract meaning from data. They use their skills to solve complex problems, identify trends, and make predictions. Data Analyst A data analyst is someone who collects, cleans, and analyzes data to help businesses make better decisions. They use their skills to identify patterns in data, and to create reports and visualizations that help others understand the data. Data Engineer A data engineer is someone who builds and maintains the systems that data scientists and data analysts use to collect, store, and analyze data. They use their skills to design and build data pipelines, and to ensure that data is stored in a secure and efficient way.
  • 7. Central Repositories Data Warehouse A data lake is an ample storage that can store structured, semi-structured, and raw data. The schema of the data is not known as it is a schema-on-read. Data Lake A data warehouse is a central repository of business data stored in structured format to help organizations gain insights. Schema needs to be known before writing data into a warehouse.
  • 8. Data Lakehouse  Data lakehouse is a realtively new architecture and it is combining the best of the both worlds — data warehouses and data lakes.  It serves as a single platform for data warehousing and data lakes. It has data management features such as ACID transcation coming from a warehouse perspective and low cost storage like a data lake.
  • 9. Databricks A unified, open analytics platform for building, deploying, sharing, and maintaining enterprise-grade data, analytics, and AI solutions at scale.
  • 10. Databricks  An Interactive Analytics platform that enables Data Engineers, Data Scientists, and Businesses to collaborate and work closely on notebooks, experiments, models, data, libraries, and jobs.  Databricks was founded by creators of Apache Spark in 2013  A one-stop product for all Data requirements, like Storage and Analysis.  Databricks is integrated with Microsoft Azure, Amazon Web Services, and Google Cloud Platform.
  • 11. What is Databricks used for?  Data processing workflows scheduling and management  Working in SQL  Generating dashboards and visualizations  Data ingestion  Managing security, governance, and HA/DR  Data discovery, annotation, and exploration  Compute management  Machine learning (ML) modeling and tracking  ML model serving  Source control with Git The Databricks workspace provides a unified interface and tools for most data tasks, including:
  • 12. Databricks for Data Engineering  Simplified data ingestion  Automated ETL processing  Reliable workflow orchestration  End-to-end observability and monitoring  Next-generation data processing engine  Foundation of governance, reliability and performance Databricks excels in data engineering with its unified platform, leveraging Apache Spark for efficient processing and scalability.
  • 13. Managed integration with open source The following technologies are open source projects founded by Databricks employees:  Delta Lake − Delta Lake is the optimized storage layer that provides the foundation for storing data and tables in the Databricks Lakehouse Platform.  Delta Sharing − An open standard for secure data sharing.  Apache Spark − Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.  MLflow − MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry
  • 14. Delta Lake  Delta Lake is the default storage format for all operations on Databricks.  Delta Lake is open source software that extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling.  Delta Lake is fully compatible with Apache Spark APIs, and was developed for tight integration with Structured Streaming, allowing you to easily use a single copy of data for both batch and streaming operations and providing incremental processing at scale.
  • 15. Delta Sharing  Delta Sharing is the industry’s first open protocol for secure data sharing, making it simple to share data with other organizations regardless of which computing platforms they use.  Databricks and the Linux Foundation developed Delta Sharing to provide the first open source approach to data sharing across data, analytics and AI. Customers can share live data across platforms, clouds and regions with strong security and governance.
  • 16. Apache Spark  Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing.  The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.  PySpark:- PySpark is an interface for Apache Spark in Python. With PySpark, you can write Python and SQL-like commands to manipulate and analyze data in a distributed processing environment.
  • 17. Databricks Workspace The Databricks “Workspace” is an Environment for accessing all of the Databricks “Assets”. The “Workspace” organizes Objects, such as- “Notebooks”, “Libraries” and “Experiments” into “Folders”, and, provides access to “Data” and Computational Resources, such as - “Clusters” and “Jobs”. The Databricks “Workspace” can be managed using :- 1. Workspace UI 2. Databricks CLI 3. Databricks REST API
  • 18. Databricks Workspace Terminology 01 02 03 05 06 04 Cluster is a “Set of Computational Resources and Configurations”, on which an organization’s Data Engineering Workloads are run. Cluster Job is a way of running a “Notebook” or a “JAR” either immediately or on a “Scheduled Basis”. Jobs Every “Databricks Deployment” has a “Central Hive Meta-store”, accessible by all “Clusters” to persist “Table Metadata”. Hive Meta-store “Notebook” is a “Web-Based Interface” composed of a “Group of Cells” that allow to execute coding commands. Notebooks DBFS is a Distributed File System mounted into each Databricks Workspace. DBFS contains Directorie s which in turn contains Data Files, Libraries and other Directories. DBFS By default, all tables created in Databricks are Delta tables. Delta tables are based on the Delta Lake open source project. Delta Table
  翻译: