Iteratively Learning Data Transformation Programs from Examples

Iteratively Learning Data Transformation Programs from
Examples
Ph.D. defense
2015-10-21
Bo Wu
1
Department of Computer Science

Agenda
• Introduction
• Previous work
• Our approach
– Learning conditional statements
– Synthesizing branch transformation programs
– Maximize user correctness with minimal effort
• Related work
• Conclusion and future work
2

Programming by example
Accession Credit Dimensions Medium Name
01.2 Gift of the artist 5.25 in HIGH x 9.375 in WIDE Oil on canvas John Mix Stanley
05.411 Gift of James L.
Edison
20 in HIGH x 24 in WIDE Oil on canvas Mortimer L. Smith
06.1 Gift of the artist Image: 20.5 in. HIGH x 17.5 in. WIDE Oil on canvas Theodore Scott Dabo
06.2 Gift of the artist 9.75 in|16 in HIGH x 13.75 in|19.5 in
WIDE
Oil on canvas Leon Dabo
…
09.8 Gift of the artist 12 in|14 in HIGH x 16 in|18 in WIDE Oil on canvas Gari Melchers
3

Programming by Example
Raw Value Target Value
R1 5.25 in HIGH x 9.375 in WIDE
R2 20 in HIGH x 24 in WIDE
R3 9.75 in|16 in HIGH x 13.75 in|19.5 in WIDE
R4 Image: 20.5 in. HIGH x 17.5 in. WIDE
R5 12 in|14 in HIGH x 16 in|18 in WIDE
. . .
4
9.375
24
null
17.5
null
24
19.5
17.5
null
24
19.5
17.5
18

Challenges
• Various formats and few examples
• Stringent time limits
• Verifying the correctness on large datasets
5

Research problem
Enabling PBE approaches to efficiently generate correct
transformation programs for large datasets with
multiple formats using minimal user effort
6

Examining records
and providing
examples
Synthesizing
programs
and transforming
records
Examples
Transformed records
Users PBE systems
Iterative Transformation

Agenda
• Introduction
• Previous work
• Our approach
• Related work
8

R1, R2, R4
GUI
Conditional
Statement
Synthesizing
Example Cluster 1 Example Cluster K… Learning
Classifier
Examples
Branch
Program 1
Branch
Program k…
Combining
Transformation program
Clustering
Synthesizing…
R1 5.25 in HIGH x 9.375 in WIDE 9.375
R2 20 in HIGH x 24 in WIDE 24
R3 9.75 in|16 in HIGH x 13.75 in|19.5 in WIDE 19.5
R4 Image: 20.5 in. HIGH x 17.5 in. WIDE 17.5
R3
classify
substring(pos1, pos2) substring(pos3, pos4)

Transformation Program
BNK: blankspace
NUM([0-9]+): 98
UWRD([A-Z]): I
LWRD([a-z]+): mage
WORD([a-zA-Z]+): Image
START:
END:
VBAR: |
…
10
Conditional
statement
Branch
transformation
program
Branch
transformation
program
9.75 in|16 in HIGH x 13.75 in|19.5 in WIDE 19.5
Segment program:
return a substring
Position program:
return a position in the input
Transform value( )
switch (classify(value)) :
case format :
pos
1
1 = valueindexOf BNK. ( , NNUM, -1
pos BNK, 2
)
. ( , )2 = valueindexOf NUM
output=value.substr(pos pos )
ca
1 2,
sse format
pos NUM
2
3
:
. ("|",= value indexOf ,, 2
pos -1
)
. ( , , )4 = value indexOf NUM BNK
output=value.substr(pos pos )
retur
3 4,
nn output

Creating Hypothesis Spaces
• Create traces
• Derive hypothesis spaces
11
Traces: A trace here defines how the output string is constructed from
a specific set of substrings from the input string.

• Generate programs from hypothesis space
– Generate-and-test
– Generate simpler programs first
Programs with one segment programs
earlier than
Programs with three segment programs
Generating Branch Programs
12

R1 5.25 in HIGH x 9.375 in WIDE 9.375
R4 Image: 20.5 in. HIGH x 17.5 in. WIDE 17.5
Learning Conditional Statements
13
• Cluster examples
• Learn a multiclass classifier
• Recognize the format of the inputs
Cluster1-format1
Cluster2-format2
R6 12 in|14 in HIGH x 16 in|18 in WIDE
format1
format2

Agenda
• Introduction
• Previous work
• Our approach
• Related work
14

Our contributions
• Efficiently learning accurate conditional
statements [DINA, 2014]
• Efficiently synthesizing branch transformation
programs [IJCAI, 2015]
• Maximizing the user correctness with minimal
user effort [IUI, 2014; IUI, 2016(submitted)]
15

Agenda
• Introduction
• Previous work
• Our approach
• Related work
16

Motivation
• Example clustering is time consuming
– Many ways (2n) to cluster the examples
– Many examples are not compatible
– Verifying compatibility is expensive
• Learned conditional statement is not accurate
– Users are willing to provide a few examples
17
R1 5.25 in HIGH x 9.375 in WIDE 9.375
✕

Utilizing known compatibilities
18
R1 5.25 in HIGH x 9.375 in WIDE 9.375
✕After providing
3 examples
After providing
4 examples
R1 R2 R3 R4
R1 R2 R3
✕

Constraints
• Two types of constraints:
• Cannot-merge constraints:
Ex:
• Must-merge constraints:
Ex:
5.25 in HIGH x 9.375 in WIDE 9.375
9.75 in|16 in HIGH x 13.75 in|19.5 in WIDE 13.75
20 in HIGH x 24 in WIDE 24
5.25 in HIGH x 9.375 in WIDE 9.375
20 in HIGH x 24 in WIDE 24
19

Constrained Agglomerative Clustering
R1
20
R2 R3 R4
R1 R4 R2 R3
R1 R2 R4
R3
Distance between clusters (pi and pj) :
R2 20 in HIGH x 24 in WIDE
R3 9.75 in|16 in HIGH x 13.75 in|19.5 in WIDE
✕

Distance Metric Learning
• Distance metric learning
• Objective function
far away
Close to
each other
21

Evaluation
• Dataset:
– 30 editing scenarios collected from student course projects
• Methods:
– SP
• The state-of-the-art approach that uses compatibility score to select partitions
to merge
– SPIC
• Utilize previous constraints besides using compatibility score
– DP
• Learn distance metric
– DPIC
• Utilize previous constraints besides learning distance metric
– DPICED
• Our approach in this paper
23
Avg records Min formats Max formats Avg formats
350 2 12 4.4

Agenda
• Introduction
• Previous work
• Our approach
• Related work
25

Learning Transformation Programs
by Example
Input Data Target Data
2000 Ford Expedition 11k runs great los angeles $4900 (los
angeles)
1998 Honda Civic 12k miles s. Auto. - $3800 (Arcadia)
2008 Mitsubishi Galant ES $7500 (Sylmar CA) pic
1996 Isuzu Trooper 14k clean title west covina $999 (west
covina) pic
… …
2000 Ford Expedition los angeles $4900
1998 Honda Civic Arcadia $3800
2008 Mitsubishi Galant Sylmar CA $7500
1996 Isuzu Trooper west covina $999
Time complexity is exponential in the number
and a high polynomial in the length of examples
10/11/2016 26

Reuse subprograms
After 1st
example
After 2nd
example
After 3rd
example
27

Identify incorrect subprograms
Input Output
2000 Ford Expedition 11k runs great los angeles $4900 (los angeles) 2000 Ford Expedition los angeles $4900
1998 Honda Civic 12k miles s. Auto. - $3800 (Arcadia) 1998 Honda Civic Arcadia $3800
(START, NUM, 1) (BNK, NUM, 1) (’(’, WORD, 1) (LWRD, ’)', 1) (ANY,BNK’$', 1) (NUM, BNK, 2)Program
2008 Mitsubishi Galant ES $7500 (Sylmar CA) picInput:
Execution Result:
␣$7500
NullNull
0 -1 33 -1
2008 Mitsubishi Galant Sylmar CA $7500
Target
output:
33 430 23
≠
=
≠
=
10/11/2016 28

Update hypothesis spaces
(START, NUM, 1) (BNK, NUM, 1) (’(’, WORD, 1) (LWRD, ’)', 1) (ANY,BNK’$', 1)Program
Hypothesis H3
2000 Ford Expedition 11k runs great los angeles $4900 (los angeles)
h33h32h31
H3
h31
s
h31
e
h32
s
h32
e
h32
e
Left context:
• LWRD
• WORD
• …
Right context:
• “)”
• …
h32
e
Left context:
• WORD
• …
Right context:
• “)”
• …
10/11/2016 29

• Dataset
– D1: 17 scenarios used in (Lin et al., 2014)
• 5 records per scenario
– D2: 30 scenarios collected from student data integration
projects
• about 350 records per scenario
– D3: synthetic dataset
• designed to evaluate scale-up
• Alternative approaches
– Our implementation of Gulwani’s approach: (Gulwani, 2011)
– Metagol: (Lin et al., 2014)
• Metric
– Time (in seconds) to generate a transformation program
Evaluation
30

Program generation time comparisons
Table: time (in seconds) to generate programs on D1 and D2 datasets
Figure: scalability test on D3
D3
10/11/2016 31

Agenda
• Introduction
• Previous work
• Our approach
• Related work
32

Motivation
33
• Thousands of records in datasets
• Various transformation scenarios
• Overconfident users

Learning from various past results
35
…
Raw Transformed
26" H x 24" W x 12.5 26
Framed at 21.75" H x 24.25” W 21
12" H x 9" 12
…
Raw Transformed
Ravage 2099#24 (November, 1994) November, 1994
Gambit III#1 (September, 1997) September, 1997
(comic) Spidey Super Stories#12/2
(September, 1975)
comic
…
Examples
Incorrect
records
Correct
records

Approach Overview
36
Raw Transformed
10“ H x 8” W 10
H: 58 x
W:25”
58
12”H x 9”W 12
11”H x 6” 11
… …
30 x 46” 30 x 46
Entire dataset
Random
Sampling
Raw Transformed
10“ H x 8” W 10
11”H x 6” 11
… …
30 x 46” 30 x 46
Sampled records
Verifying records
Raw Transformed
11”H x 6” 11
30 x 46” 30 x 46
… …
Sorting and
color-codingRaw Transformed
30 x 46” 30 x 46
11”H x 6” 11
… …

Verifying Records
• Recommend records causing runtime errors
– Records cause the program exit abnormally
• Recommend potentially incorrect records
– Learn a binary meta-classifier
37
Program: (LWRD, ‘)’, 1)
Input: 2008 Mitsubishi Galant ES $7500 (Sylmar CA) pic
Raw Transformed
11”H x 6” 11
30 x 46” 30 x 46
… …
Ex:

Learning the Meta-classifier
38
Learn an ensemble of classifiers using ADABOOST:
(1) Select a fi from a pool of binary classifiers
(2) Assign weight wi to fi
(3) Loop until error below a threshold

Dataset:
30 scenarios
350 records per scenario
Experiment setup:
• Approach-β
• Baseline
Metrics:
• Iteration correctness
• MRR
Evaluation
39

Agenda
• Introduction
• Previous work
• Our approach
• Related work
40

Related Work
• Approaches not focusing on data
transformation
– Wrapper induction
• Kushmerick,1997; Hsu and Dung, 1998; Muslea et al., 1999
– Inductive programming (we learn )
• Summers, 1977; Kitzelmam and Schmid, 2006;Shaprio,
1981; Muggleton and Lin, 2013
• Approaches not learning program iteratively
• FlashFill (Gulwani, 2011); SmartPython (Lau, 2001),
SmartEdit (Lau, 2001); Singh and Gulwani 2012; Raza et al.,
2014; Harris, et al., 2011
– Approaches learning part of the programs iteratively
• MetagolDF (Lin et al., 2014); Preleman, et al 2014
41

Conclusion: contributions
• Enable users to generate complicated
programs in real time
• Enable users to work on large datasets
• Improve the performance of other PBE
approaches
42

Conclusion: future work
• Managing user expectation
• Incorporating third-party functions
• Handling user errors
43

Iteratively Learning Data Transformation Programs from Examples

Recommended

More Related Content

Viewers also liked (15)

Similar to Iteratively Learning Data Transformation Programs from Examples (20)

Recently uploaded (20)

Iteratively Learning Data Transformation Programs from Examples

Editor's Notes