SlideShare a Scribd company logo
1
Capturing and querying fine-grained provenance of
preprocessing pipelines in data science
(DP4DS)
Adriane Chapman1, Paolo Missier2, Luca Lauro3, Riccardo Torlone3
(1) University of Southampton, UK
(2) Newcastle University, UK
(3) Universita’ Roma Tre, Italy
[1] Chapman, A.; Missier, P.; Simonelli, G.; and Torlone, R., Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data
Science. PVLDB, 14(4): 507–520. January 2021.
[2] Chapman, A.; Missier, P.; Lauro, L.; and Torlone, R., DPDS: Assisting Data Science with Data Provenance. PVLDB, 15(12): 3614 – 3617. 2022.
2
M
Data
sources
Acquisition,
wrangling
Test
set
Training
set
Preparing for learning
Model
Selection
Training /
test split
Model
Testing
Model
Learning
Model
Validation
Predictions
Model
Usage
Decision points:
- Source selection
- Sample / population shape
- Cleaning
- Integration
Decision points:
- Sampling / stratification
- Feature selection
- Feature engineering
- Dimensionality reduction
- Regularisation
- Imputation
- Class rebalancing
- …
Provenance
trace
M
Model
Learning
Training
set
Training /
test split
Imputation
Feature
selection
D’ D’’
…
Hyper
parameters
C1 C2
C3
Pipeline structure with provenance annotations
3
<event
name>
Provenance of what?
Base case:
- opaque program Po
- coarse-grained dataset
Default provenance:
- Every output depends on every input
P0
- Transparent program PT
- Fine-grained datasets
PT
…
…
…
…
…
…
…
…
- Transparent pipeline
- Fine-grained datasets
P’T
…
…
…
…
…
…
…
…
Pn
T
Pn
T
Pn
T
- Transparent program PT
- coarse-grained datasets
PT
f
if c:
y1  x1
else:
y1  x2
Y2  f(x1, x2)
Runtime: c == True
4
Typical operators used in data prep
5
Data reduction
- Conditional projection
- Selection
6
Data augmentation
Vertical augmentation
Horizontal augmentation
avg(age)
group by age
7
Data transformation
Example: data imputation. Here f replaces nulls with the most frequent value, for
column Zip
8
Data fusion: join and append
9
Provenance model
10
Capturing provenance: Assumptions
- Common data abstraction: (Pandas) dataframes
- Observability: runtime execution of a (python) program can be observed
- Each input and output dataframe to each operator can be inspected
11
Capturing provenance: templates
A different provenance template pt𝜏 is associated with each type 𝜏 of operator
12
Capturing provenance: bindings
At runtime, when operator o of type 𝜏 is executed, the appropriate template pt𝜏 for 𝜏 is selected
Data items from the inputs and outputs of the operator are used to bind the variables in the template
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
{old values: F, I, V}  {new values: F’, J, V’}
+
Binding rules
13
This applies to all operators
14
Join provenance pattern -- keys
Join
activity
wasGeneratedBy
Used
Left Right Output
Used
wasDerivedFrom
15
Join provenance pattern -- non-key elements
Join
activity
wasGeneratedBy
Used
Left Right Output
wasDerivedFrom
17
Capturing provenance: a more practical approach
The approach just described requires recognizing the type of operation from the source code
Restricts to a closed set of operators  needs to be maintained over time
(*) extends to joins, append
We take a more generic route to implementing the same idea:
1. look at operators’ input / output dataframes Din, Dout regardless of the specific operator
2. Dataframe diff: Compare both the shapes and values of Din, Dout (*)
3. Use the diff to:
• Select the appropriate template
• Bind the template variables using the relevant values in the two dataframes
18
Example
Consider the following sequence: Imputation  join  append  one hot encoding
Da D1
Db
Dc
D2
D3
Impute K
Join K1=K2
append
Add
‘B0,’ ‘B1’ Remove ‘B’
D4 D5
7
<event
name>
19
Example
Dataframes Diff template
D1, Da value change, reduced number of
null values
Data transformation
D2, {Da, Db} join provenance
D3, {D1, D2} append provenance
D4, D3 Shape change, column(s) added <wait!>
D5, D4 Shape change, column(s) removed Data transformation, composite
Da D1
Db
Dc
D2
D3
Impute K
Join K1=K2
append Remove ‘B’
D4 D5
Add
‘B0,’ ‘B1’
20
Summary: Shape and value changes
Shape changes:
Rows
Added?
Rows
Removed?
Columns
Added?
Columns
Removed?
Columns
Removed?
Horizontal
Augmentation
Reduction
by selection
Reduction
by projection
data
transformation
(composite)
Y
Y
Y
Y
data
transformation
Y
N
N
N
Templates:
N
Value changes for each column:
Nulls reduced?
Values changed?
Y
Y
N
Templates:
data
transformation
(imputation)
data
transformation
1-1 derivations
21
Code instrumentation
A python tracker object intercepts dataframe operations, using an observer pattern
The tracker collects the values required to generate the bindings
Create a provenance object and a tracker object
Simple column transform
One-hot encoding
join
22
Evaluation – benchmark datasets
Census pipeline:
Clerical cleaning on
every cell
(removing blanks)
Replace all ‘?’
with NaN
One-hot encoding
7 categorical
variables
Map binary
labels to 0,1
Drop one
column
23
Evaluation – benchmark pipelines
24
Evaluation: Provenance capture times
25
Evaluation: Provenance query times on Neo4J
26
Scalability: provenance query times
Synthetic Benchmarking datasets created using TPC-DI
27
Scalability: operations on TCI-DI datasets
Basic operators Join + append operators
28
Tool demo
DPDS: Assisting Data Science with Data
Provenance. Chapman, A.; Missier, P.; Lauro, L.; and
Torlone, R. PVLDB, 15(12): 3614 – 3617. 2022.
(demo paper)
29
Summary
1. What is the killer app for such granular provenance?
2. How general is the technique with respect to arbitrary pandas programs?
A method, infrastructure and tooling for collecting, querying, and visualizing
very fine-grained provenance from data processing pipelines
Ad

More Related Content

Similar to Capturing and querying fine-grained provenance of preprocessing pipelines in data science (DP4DS) (20)

Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Paolo Missier
 
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving SystemsPRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
NECST Lab @ Politecnico di Milano
 
ModelDR - the tool that untangles complex information
ModelDR - the tool that untangles complex informationModelDR - the tool that untangles complex information
ModelDR - the tool that untangles complex information
Simon Roberts
 
Dynamic Symbolic Database Application Testing
Dynamic Symbolic Database Application TestingDynamic Symbolic Database Application Testing
Dynamic Symbolic Database Application Testing
The Innovative Data Intelligence Research (IDIR) Laboratory, University of Texas at Arlington
 
METODOLOGIA DEA EN STATA
METODOLOGIA DEA EN STATAMETODOLOGIA DEA EN STATA
METODOLOGIA DEA EN STATA
LuhSm
 
ntcir14centre-overview
ntcir14centre-overviewntcir14centre-overview
ntcir14centre-overview
Tetsuya Sakai
 
Systems Analysis and Design 9th Edition Kendall Test Bank
Systems Analysis and Design 9th Edition Kendall Test BankSystems Analysis and Design 9th Edition Kendall Test Bank
Systems Analysis and Design 9th Edition Kendall Test Bank
rukamtamup
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
NishantKumar1179
 
Systems Analysis and Design 9th Edition Kendall Test Bank
Systems Analysis and Design 9th Edition Kendall Test BankSystems Analysis and Design 9th Edition Kendall Test Bank
Systems Analysis and Design 9th Edition Kendall Test Bank
dykpawa
 
Sample Project Report okokokokokokokokok
Sample Project Report okokokokokokokokokSample Project Report okokokokokokokokok
Sample Project Report okokokokokokokokok
SamraKanwal9
 
deep_Visualization in Data mining.ppt
deep_Visualization in Data mining.pptdeep_Visualization in Data mining.ppt
deep_Visualization in Data mining.ppt
PerumalPitchandi
 
Duflow manual1995
Duflow manual1995Duflow manual1995
Duflow manual1995
isMetal
 
Efficient Pseudo-Relevance Feedback Methods for Collaborative Filtering Recom...
Efficient Pseudo-Relevance Feedback Methods for Collaborative Filtering Recom...Efficient Pseudo-Relevance Feedback Methods for Collaborative Filtering Recom...
Efficient Pseudo-Relevance Feedback Methods for Collaborative Filtering Recom...
Daniel Valcarce
 
UNIT IC programming notes university file
UNIT IC programming notes university fileUNIT IC programming notes university file
UNIT IC programming notes university file
ANISHYAPIT
 
Systems Analysis and Design 9th Edition Kendall Test Bank
Systems Analysis and Design 9th Edition Kendall Test BankSystems Analysis and Design 9th Edition Kendall Test Bank
Systems Analysis and Design 9th Edition Kendall Test Bank
nsasadjike
 
DSD-INT 2020 Computational Framework - Part of the BlueEarth-Engine
DSD-INT 2020 Computational Framework - Part of the BlueEarth-EngineDSD-INT 2020 Computational Framework - Part of the BlueEarth-Engine
DSD-INT 2020 Computational Framework - Part of the BlueEarth-Engine
Deltares
 
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Data Con LA
 
GNATcoverage/GNATemulator launch
GNATcoverage/GNATemulator launchGNATcoverage/GNATemulator launch
GNATcoverage/GNATemulator launch
AdaCore
 
Implementing a data_science_project (Python Version)_part1
Implementing a data_science_project (Python Version)_part1Implementing a data_science_project (Python Version)_part1
Implementing a data_science_project (Python Version)_part1
Dr Sulaimon Afolabi
 
Fy secondsemester2016
Fy secondsemester2016Fy secondsemester2016
Fy secondsemester2016
Ankit Dubey
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Paolo Missier
 
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving SystemsPRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
NECST Lab @ Politecnico di Milano
 
ModelDR - the tool that untangles complex information
ModelDR - the tool that untangles complex informationModelDR - the tool that untangles complex information
ModelDR - the tool that untangles complex information
Simon Roberts
 
METODOLOGIA DEA EN STATA
METODOLOGIA DEA EN STATAMETODOLOGIA DEA EN STATA
METODOLOGIA DEA EN STATA
LuhSm
 
ntcir14centre-overview
ntcir14centre-overviewntcir14centre-overview
ntcir14centre-overview
Tetsuya Sakai
 
Systems Analysis and Design 9th Edition Kendall Test Bank
Systems Analysis and Design 9th Edition Kendall Test BankSystems Analysis and Design 9th Edition Kendall Test Bank
Systems Analysis and Design 9th Edition Kendall Test Bank
rukamtamup
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
NishantKumar1179
 
Systems Analysis and Design 9th Edition Kendall Test Bank
Systems Analysis and Design 9th Edition Kendall Test BankSystems Analysis and Design 9th Edition Kendall Test Bank
Systems Analysis and Design 9th Edition Kendall Test Bank
dykpawa
 
Sample Project Report okokokokokokokokok
Sample Project Report okokokokokokokokokSample Project Report okokokokokokokokok
Sample Project Report okokokokokokokokok
SamraKanwal9
 
deep_Visualization in Data mining.ppt
deep_Visualization in Data mining.pptdeep_Visualization in Data mining.ppt
deep_Visualization in Data mining.ppt
PerumalPitchandi
 
Duflow manual1995
Duflow manual1995Duflow manual1995
Duflow manual1995
isMetal
 
Efficient Pseudo-Relevance Feedback Methods for Collaborative Filtering Recom...
Efficient Pseudo-Relevance Feedback Methods for Collaborative Filtering Recom...Efficient Pseudo-Relevance Feedback Methods for Collaborative Filtering Recom...
Efficient Pseudo-Relevance Feedback Methods for Collaborative Filtering Recom...
Daniel Valcarce
 
UNIT IC programming notes university file
UNIT IC programming notes university fileUNIT IC programming notes university file
UNIT IC programming notes university file
ANISHYAPIT
 
Systems Analysis and Design 9th Edition Kendall Test Bank
Systems Analysis and Design 9th Edition Kendall Test BankSystems Analysis and Design 9th Edition Kendall Test Bank
Systems Analysis and Design 9th Edition Kendall Test Bank
nsasadjike
 
DSD-INT 2020 Computational Framework - Part of the BlueEarth-Engine
DSD-INT 2020 Computational Framework - Part of the BlueEarth-EngineDSD-INT 2020 Computational Framework - Part of the BlueEarth-Engine
DSD-INT 2020 Computational Framework - Part of the BlueEarth-Engine
Deltares
 
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Data Con LA
 
GNATcoverage/GNATemulator launch
GNATcoverage/GNATemulator launchGNATcoverage/GNATemulator launch
GNATcoverage/GNATemulator launch
AdaCore
 
Implementing a data_science_project (Python Version)_part1
Implementing a data_science_project (Python Version)_part1Implementing a data_science_project (Python Version)_part1
Implementing a data_science_project (Python Version)_part1
Dr Sulaimon Afolabi
 
Fy secondsemester2016
Fy secondsemester2016Fy secondsemester2016
Fy secondsemester2016
Ankit Dubey
 

More from Paolo Missier (20)

Data and end-to-end Explainability (XAI,XEE)
Data and end-to-end Explainability (XAI,XEE)Data and end-to-end Explainability (XAI,XEE)
Data and end-to-end Explainability (XAI,XEE)
Paolo Missier
 
A simple Introduction to Explainability in Machine Learning and AI (XAI)
A simple Introduction to Explainability in Machine Learning and AI (XAI)A simple Introduction to Explainability in Machine Learning and AI (XAI)
A simple Introduction to Explainability in Machine Learning and AI (XAI)
Paolo Missier
 
A simple Introduction to Algorithmic Fairness
A simple Introduction to Algorithmic FairnessA simple Introduction to Algorithmic Fairness
A simple Introduction to Algorithmic Fairness
Paolo Missier
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
Paolo Missier
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
Paolo Missier
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
Paolo Missier
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
Paolo Missier
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
Paolo Missier
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
Paolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Paolo Missier
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
Paolo Missier
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
Paolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Paolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Paolo Missier
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
Paolo Missier
 
ReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for Health
Paolo Missier
 
Data and end-to-end Explainability (XAI,XEE)
Data and end-to-end Explainability (XAI,XEE)Data and end-to-end Explainability (XAI,XEE)
Data and end-to-end Explainability (XAI,XEE)
Paolo Missier
 
A simple Introduction to Explainability in Machine Learning and AI (XAI)
A simple Introduction to Explainability in Machine Learning and AI (XAI)A simple Introduction to Explainability in Machine Learning and AI (XAI)
A simple Introduction to Explainability in Machine Learning and AI (XAI)
Paolo Missier
 
A simple Introduction to Algorithmic Fairness
A simple Introduction to Algorithmic FairnessA simple Introduction to Algorithmic Fairness
A simple Introduction to Algorithmic Fairness
Paolo Missier
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
Paolo Missier
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
Paolo Missier
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
Paolo Missier
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
Paolo Missier
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
Paolo Missier
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
Paolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Paolo Missier
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
Paolo Missier
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
Paolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Paolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Paolo Missier
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
Paolo Missier
 
ReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for Health
Paolo Missier
 
Ad

Recently uploaded (20)

Who's choice? Making decisions with and about Artificial Intelligence, Keele ...
Who's choice? Making decisions with and about Artificial Intelligence, Keele ...Who's choice? Making decisions with and about Artificial Intelligence, Keele ...
Who's choice? Making decisions with and about Artificial Intelligence, Keele ...
Alan Dix
 
machines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdfmachines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdf
AmirStern2
 
Middle East and Africa Cybersecurity Market Trends and Growth Analysis
Middle East and Africa Cybersecurity Market Trends and Growth Analysis Middle East and Africa Cybersecurity Market Trends and Growth Analysis
Middle East and Africa Cybersecurity Market Trends and Growth Analysis
Preeti Jha
 
AI and Gender: Decoding the Sociological Impact
AI and Gender: Decoding the Sociological ImpactAI and Gender: Decoding the Sociological Impact
AI and Gender: Decoding the Sociological Impact
SaikatBasu37
 
Building a research repository that works by Clare Cady
Building a research repository that works by Clare CadyBuilding a research repository that works by Clare Cady
Building a research repository that works by Clare Cady
UXPA Boston
 
Secondary Storage for a microcontroller system
Secondary Storage for a microcontroller systemSecondary Storage for a microcontroller system
Secondary Storage for a microcontroller system
fizarcse
 
Digital Technologies for Culture, Arts and Heritage: Insights from Interdisci...
Digital Technologies for Culture, Arts and Heritage: Insights from Interdisci...Digital Technologies for Culture, Arts and Heritage: Insights from Interdisci...
Digital Technologies for Culture, Arts and Heritage: Insights from Interdisci...
Vasileios Komianos
 
Build With AI - In Person Session Slides.pdf
Build With AI - In Person Session Slides.pdfBuild With AI - In Person Session Slides.pdf
Build With AI - In Person Session Slides.pdf
Google Developer Group - Harare
 
Agentic Automation - Delhi UiPath Community Meetup
Agentic Automation - Delhi UiPath Community MeetupAgentic Automation - Delhi UiPath Community Meetup
Agentic Automation - Delhi UiPath Community Meetup
Manoj Batra (1600 + Connections)
 
Google DeepMind’s New AI Coding Agent AlphaEvolve.pdf
Google DeepMind’s New AI Coding Agent AlphaEvolve.pdfGoogle DeepMind’s New AI Coding Agent AlphaEvolve.pdf
Google DeepMind’s New AI Coding Agent AlphaEvolve.pdf
derrickjswork
 
Top Hyper-Casual Game Studio Services
Top  Hyper-Casual  Game  Studio ServicesTop  Hyper-Casual  Game  Studio Services
Top Hyper-Casual Game Studio Services
Nova Carter
 
Harmonizing Multi-Agent Intelligence | Open Data Science Conference | Gary Ar...
Harmonizing Multi-Agent Intelligence | Open Data Science Conference | Gary Ar...Harmonizing Multi-Agent Intelligence | Open Data Science Conference | Gary Ar...
Harmonizing Multi-Agent Intelligence | Open Data Science Conference | Gary Ar...
Gary Arora
 
React Native for Business Solutions: Building Scalable Apps for Success
React Native for Business Solutions: Building Scalable Apps for SuccessReact Native for Business Solutions: Building Scalable Apps for Success
React Native for Business Solutions: Building Scalable Apps for Success
Amelia Swank
 
Mastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B LandscapeMastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B Landscape
marketing943205
 
Computer Systems Quiz Presentation in Purple Bold Style (4).pdf
Computer Systems Quiz Presentation in Purple Bold Style (4).pdfComputer Systems Quiz Presentation in Purple Bold Style (4).pdf
Computer Systems Quiz Presentation in Purple Bold Style (4).pdf
fizarcse
 
Best 10 Free AI Character Chat Platforms
Best 10 Free AI Character Chat PlatformsBest 10 Free AI Character Chat Platforms
Best 10 Free AI Character Chat Platforms
Soulmaite
 
UiPath AgentHack - Build the AI agents of tomorrow_Enablement 1.pptx
UiPath AgentHack - Build the AI agents of tomorrow_Enablement 1.pptxUiPath AgentHack - Build the AI agents of tomorrow_Enablement 1.pptx
UiPath AgentHack - Build the AI agents of tomorrow_Enablement 1.pptx
anabulhac
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
Slack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teamsSlack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teams
Nacho Cougil
 
DNF 2.0 Implementations Challenges in Nepal
DNF 2.0 Implementations Challenges in NepalDNF 2.0 Implementations Challenges in Nepal
DNF 2.0 Implementations Challenges in Nepal
ICT Frame Magazine Pvt. Ltd.
 
Who's choice? Making decisions with and about Artificial Intelligence, Keele ...
Who's choice? Making decisions with and about Artificial Intelligence, Keele ...Who's choice? Making decisions with and about Artificial Intelligence, Keele ...
Who's choice? Making decisions with and about Artificial Intelligence, Keele ...
Alan Dix
 
machines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdfmachines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdf
AmirStern2
 
Middle East and Africa Cybersecurity Market Trends and Growth Analysis
Middle East and Africa Cybersecurity Market Trends and Growth Analysis Middle East and Africa Cybersecurity Market Trends and Growth Analysis
Middle East and Africa Cybersecurity Market Trends and Growth Analysis
Preeti Jha
 
AI and Gender: Decoding the Sociological Impact
AI and Gender: Decoding the Sociological ImpactAI and Gender: Decoding the Sociological Impact
AI and Gender: Decoding the Sociological Impact
SaikatBasu37
 
Building a research repository that works by Clare Cady
Building a research repository that works by Clare CadyBuilding a research repository that works by Clare Cady
Building a research repository that works by Clare Cady
UXPA Boston
 
Secondary Storage for a microcontroller system
Secondary Storage for a microcontroller systemSecondary Storage for a microcontroller system
Secondary Storage for a microcontroller system
fizarcse
 
Digital Technologies for Culture, Arts and Heritage: Insights from Interdisci...
Digital Technologies for Culture, Arts and Heritage: Insights from Interdisci...Digital Technologies for Culture, Arts and Heritage: Insights from Interdisci...
Digital Technologies for Culture, Arts and Heritage: Insights from Interdisci...
Vasileios Komianos
 
Google DeepMind’s New AI Coding Agent AlphaEvolve.pdf
Google DeepMind’s New AI Coding Agent AlphaEvolve.pdfGoogle DeepMind’s New AI Coding Agent AlphaEvolve.pdf
Google DeepMind’s New AI Coding Agent AlphaEvolve.pdf
derrickjswork
 
Top Hyper-Casual Game Studio Services
Top  Hyper-Casual  Game  Studio ServicesTop  Hyper-Casual  Game  Studio Services
Top Hyper-Casual Game Studio Services
Nova Carter
 
Harmonizing Multi-Agent Intelligence | Open Data Science Conference | Gary Ar...
Harmonizing Multi-Agent Intelligence | Open Data Science Conference | Gary Ar...Harmonizing Multi-Agent Intelligence | Open Data Science Conference | Gary Ar...
Harmonizing Multi-Agent Intelligence | Open Data Science Conference | Gary Ar...
Gary Arora
 
React Native for Business Solutions: Building Scalable Apps for Success
React Native for Business Solutions: Building Scalable Apps for SuccessReact Native for Business Solutions: Building Scalable Apps for Success
React Native for Business Solutions: Building Scalable Apps for Success
Amelia Swank
 
Mastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B LandscapeMastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B Landscape
marketing943205
 
Computer Systems Quiz Presentation in Purple Bold Style (4).pdf
Computer Systems Quiz Presentation in Purple Bold Style (4).pdfComputer Systems Quiz Presentation in Purple Bold Style (4).pdf
Computer Systems Quiz Presentation in Purple Bold Style (4).pdf
fizarcse
 
Best 10 Free AI Character Chat Platforms
Best 10 Free AI Character Chat PlatformsBest 10 Free AI Character Chat Platforms
Best 10 Free AI Character Chat Platforms
Soulmaite
 
UiPath AgentHack - Build the AI agents of tomorrow_Enablement 1.pptx
UiPath AgentHack - Build the AI agents of tomorrow_Enablement 1.pptxUiPath AgentHack - Build the AI agents of tomorrow_Enablement 1.pptx
UiPath AgentHack - Build the AI agents of tomorrow_Enablement 1.pptx
anabulhac
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
Slack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teamsSlack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teams
Nacho Cougil
 
Ad

Capturing and querying fine-grained provenance of preprocessing pipelines in data science (DP4DS)

  • 1. 1 Capturing and querying fine-grained provenance of preprocessing pipelines in data science (DP4DS) Adriane Chapman1, Paolo Missier2, Luca Lauro3, Riccardo Torlone3 (1) University of Southampton, UK (2) Newcastle University, UK (3) Universita’ Roma Tre, Italy [1] Chapman, A.; Missier, P.; Simonelli, G.; and Torlone, R., Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science. PVLDB, 14(4): 507–520. January 2021. [2] Chapman, A.; Missier, P.; Lauro, L.; and Torlone, R., DPDS: Assisting Data Science with Data Provenance. PVLDB, 15(12): 3614 – 3617. 2022.
  • 2. 2 M Data sources Acquisition, wrangling Test set Training set Preparing for learning Model Selection Training / test split Model Testing Model Learning Model Validation Predictions Model Usage Decision points: - Source selection - Sample / population shape - Cleaning - Integration Decision points: - Sampling / stratification - Feature selection - Feature engineering - Dimensionality reduction - Regularisation - Imputation - Class rebalancing - … Provenance trace M Model Learning Training set Training / test split Imputation Feature selection D’ D’’ … Hyper parameters C1 C2 C3 Pipeline structure with provenance annotations
  • 3. 3 <event name> Provenance of what? Base case: - opaque program Po - coarse-grained dataset Default provenance: - Every output depends on every input P0 - Transparent program PT - Fine-grained datasets PT … … … … … … … … - Transparent pipeline - Fine-grained datasets P’T … … … … … … … … Pn T Pn T Pn T - Transparent program PT - coarse-grained datasets PT f if c: y1  x1 else: y1  x2 Y2  f(x1, x2) Runtime: c == True
  • 5. 5 Data reduction - Conditional projection - Selection
  • 6. 6 Data augmentation Vertical augmentation Horizontal augmentation avg(age) group by age
  • 7. 7 Data transformation Example: data imputation. Here f replaces nulls with the most frequent value, for column Zip
  • 8. 8 Data fusion: join and append
  • 10. 10 Capturing provenance: Assumptions - Common data abstraction: (Pandas) dataframes - Observability: runtime execution of a (python) program can be observed - Each input and output dataframe to each operator can be inspected
  • 11. 11 Capturing provenance: templates A different provenance template pt𝜏 is associated with each type 𝜏 of operator
  • 12. 12 Capturing provenance: bindings At runtime, when operator o of type 𝜏 is executed, the appropriate template pt𝜏 for 𝜏 is selected Data items from the inputs and outputs of the operator are used to bind the variables in the template 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 op {old values: F, I, V}  {new values: F’, J, V’} + Binding rules
  • 13. 13 This applies to all operators
  • 14. 14 Join provenance pattern -- keys Join activity wasGeneratedBy Used Left Right Output Used wasDerivedFrom
  • 15. 15 Join provenance pattern -- non-key elements Join activity wasGeneratedBy Used Left Right Output wasDerivedFrom
  • 16. 17 Capturing provenance: a more practical approach The approach just described requires recognizing the type of operation from the source code Restricts to a closed set of operators  needs to be maintained over time (*) extends to joins, append We take a more generic route to implementing the same idea: 1. look at operators’ input / output dataframes Din, Dout regardless of the specific operator 2. Dataframe diff: Compare both the shapes and values of Din, Dout (*) 3. Use the diff to: • Select the appropriate template • Bind the template variables using the relevant values in the two dataframes
  • 17. 18 Example Consider the following sequence: Imputation  join  append  one hot encoding Da D1 Db Dc D2 D3 Impute K Join K1=K2 append Add ‘B0,’ ‘B1’ Remove ‘B’ D4 D5 7 <event name>
  • 18. 19 Example Dataframes Diff template D1, Da value change, reduced number of null values Data transformation D2, {Da, Db} join provenance D3, {D1, D2} append provenance D4, D3 Shape change, column(s) added <wait!> D5, D4 Shape change, column(s) removed Data transformation, composite Da D1 Db Dc D2 D3 Impute K Join K1=K2 append Remove ‘B’ D4 D5 Add ‘B0,’ ‘B1’
  • 19. 20 Summary: Shape and value changes Shape changes: Rows Added? Rows Removed? Columns Added? Columns Removed? Columns Removed? Horizontal Augmentation Reduction by selection Reduction by projection data transformation (composite) Y Y Y Y data transformation Y N N N Templates: N Value changes for each column: Nulls reduced? Values changed? Y Y N Templates: data transformation (imputation) data transformation 1-1 derivations
  • 20. 21 Code instrumentation A python tracker object intercepts dataframe operations, using an observer pattern The tracker collects the values required to generate the bindings Create a provenance object and a tracker object Simple column transform One-hot encoding join
  • 21. 22 Evaluation – benchmark datasets Census pipeline: Clerical cleaning on every cell (removing blanks) Replace all ‘?’ with NaN One-hot encoding 7 categorical variables Map binary labels to 0,1 Drop one column
  • 25. 26 Scalability: provenance query times Synthetic Benchmarking datasets created using TPC-DI
  • 26. 27 Scalability: operations on TCI-DI datasets Basic operators Join + append operators
  • 27. 28 Tool demo DPDS: Assisting Data Science with Data Provenance. Chapman, A.; Missier, P.; Lauro, L.; and Torlone, R. PVLDB, 15(12): 3614 – 3617. 2022. (demo paper)
  • 28. 29 Summary 1. What is the killer app for such granular provenance? 2. How general is the technique with respect to arbitrary pandas programs? A method, infrastructure and tooling for collecting, querying, and visualizing very fine-grained provenance from data processing pipelines

Editor's Notes

  • #7: $f_1$, which associates the string \emph{young} to an age less than 25 and the string \emph{adult} otherwise $f_2$, which computes the average of a set of numbers.
  • #19:     & D_1=\tau_{f(K)}(D_a)\\     & D_2=D_b \join^{\tt outer}_{K_1=K_2} D_c\\     & D_3=D_1 \union D_2 \\     & D_4=\horaug_{h(B)}(D_3)\\     & D_5=\pi_{\{A,B_0, B_1\}}(D_4)\\
  翻译: