Capturing and querying fine-grained provenance of preprocessing pipelines in data science(DP4DS)

1
Capturing and querying fine-grained provenance of
preprocessing pipelines in data science
(DP4DS)
Adriane Chapman1, Paolo Missier2, Luca Lauro3, Riccardo Torlone3
(1) University of Southampton, UK
(2) Newcastle University, UK
(3) Universita’ Roma Tre, Italy
[1] Chapman, A.; Missier, P.; Simonelli, G.; and Torlone, R., Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data
Science. PVLDB, 14(4): 507–520. January 2021.
[2] Chapman, A.; Missier, P.; Lauro, L.; and Torlone, R., DPDS: Assisting Data Science with Data Provenance. PVLDB, 15(12): 3614 – 3617. 2022.

2
M
Data
sources
Acquisition,
wrangling
Test
set
Training
set
Preparing for learning
Model
Selection
Training /
test split
Model
Testing
Model
Learning
Model
Validation
Predictions
Model
Usage
Decision points:
- Source selection
- Sample / population shape
- Cleaning
- Integration
Decision points:
- Sampling / stratification
- Feature selection
- Feature engineering
- Dimensionality reduction
- Regularisation
- Imputation
- Class rebalancing
- …
Provenance
trace
M
Model
Learning
Training
set
Training /
test split
Imputation
Feature
selection
D’ D’’
…
Hyper
parameters
C1 C2
C3
Pipeline structure with provenance annotations

3
<event
name>
Provenance of what?
Base case:
- opaque program Po
- coarse-grained dataset
Default provenance:
- Every output depends on every input
P0
- Transparent program PT
- Fine-grained datasets
PT
…
…
…
…
…
…
…
…
- Transparent pipeline
- Fine-grained datasets
P’T
…
…
…
…
…
…
…
…
Pn
T
Pn
T
Pn
T
- Transparent program PT
- coarse-grained datasets
PT
f
if c:
y1  x1
else:
y1  x2
Y2  f(x1, x2)
Runtime: c == True

4
Typical operators used in data prep

5
Data reduction
- Conditional projection
- Selection

6
Data augmentation
Vertical augmentation
Horizontal augmentation
avg(age)
group by age

7
Data transformation
Example: data imputation. Here f replaces nulls with the most frequent value, for
column Zip

8
Data fusion: join and append

10
Capturing provenance: Assumptions
- Common data abstraction: (Pandas) dataframes
- Observability: runtime execution of a (python) program can be observed
- Each input and output dataframe to each operator can be inspected

11
Capturing provenance: templates
A different provenance template pt𝜏 is associated with each type 𝜏 of operator

12
Capturing provenance: bindings
At runtime, when operator o of type 𝜏 is executed, the appropriate template pt𝜏 for 𝜏 is selected
Data items from the inputs and outputs of the operator are used to bind the variables in the template
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
{old values: F, I, V}  {new values: F’, J, V’}
+
Binding rules

13
This applies to all operators

14
Join provenance pattern -- keys
Join
activity
wasGeneratedBy
Used
Left Right Output
Used
wasDerivedFrom

15
Join provenance pattern -- non-key elements
Join
activity
wasGeneratedBy
Used
Left Right Output
wasDerivedFrom

17
Capturing provenance: a more practical approach
The approach just described requires recognizing the type of operation from the source code
Restricts to a closed set of operators  needs to be maintained over time
(*) extends to joins, append
We take a more generic route to implementing the same idea:
1. look at operators’ input / output dataframes Din, Dout regardless of the specific operator
2. Dataframe diff: Compare both the shapes and values of Din, Dout (*)
3. Use the diff to:
• Select the appropriate template
• Bind the template variables using the relevant values in the two dataframes

18
Example
Consider the following sequence: Imputation  join  append  one hot encoding
Da D1
Db
Dc
D2
D3
Impute K
Join K1=K2
append
Add
‘B0,’ ‘B1’ Remove ‘B’
D4 D5
7
<event
name>

19
Example
Dataframes Diff template
D1, Da value change, reduced number of
null values
Data transformation
D2, {Da, Db} join provenance
D3, {D1, D2} append provenance
D4, D3 Shape change, column(s) added <wait!>
D5, D4 Shape change, column(s) removed Data transformation, composite
Da D1
Db
Dc
D2
D3
Impute K
Join K1=K2
append Remove ‘B’
D4 D5
Add
‘B0,’ ‘B1’

20
Summary: Shape and value changes
Shape changes:
Rows
Added?
Rows
Removed?
Columns
Added?
Columns
Removed?
Columns
Removed?
Horizontal
Augmentation
Reduction
by selection
Reduction
by projection
data
transformation
(composite)
Y
Y
Y
Y
data
transformation
Y
N
N
N
Templates:
N
Value changes for each column:
Nulls reduced?
Values changed?
Y
Y
N
Templates:
data
transformation
(imputation)
data
transformation
1-1 derivations

21
Code instrumentation
A python tracker object intercepts dataframe operations, using an observer pattern
The tracker collects the values required to generate the bindings
Create a provenance object and a tracker object
Simple column transform
One-hot encoding
join

22
Evaluation – benchmark datasets
Census pipeline:
Clerical cleaning on
every cell
(removing blanks)
Replace all ‘?’
with NaN
One-hot encoding
7 categorical
variables
Map binary
labels to 0,1
Drop one
column

23
Evaluation – benchmark pipelines

24
Evaluation: Provenance capture times

25
Evaluation: Provenance query times on Neo4J

26
Scalability: provenance query times
Synthetic Benchmarking datasets created using TPC-DI

27
Scalability: operations on TCI-DI datasets
Basic operators Join + append operators

28
Tool demo
DPDS: Assisting Data Science with Data
Provenance. Chapman, A.; Missier, P.; Lauro, L.; and
Torlone, R. PVLDB, 15(12): 3614 – 3617. 2022.
(demo paper)

29
Summary
1. What is the killer app for such granular provenance?
2. How general is the technique with respect to arbitrary pandas programs?
A method, infrastructure and tooling for collecting, querying, and visualizing
very fine-grained provenance from data processing pipelines

Capturing and querying fine-grained provenance of preprocessing pipelines in data science(DP4DS)

Recommended

More Related Content

Similar to Capturing and querying fine-grained provenance of preprocessing pipelines in data science(DP4DS) (20)

More from Paolo Missier (20)

Recently uploaded (20)