SlideShare a Scribd company logo
Iteratively Learning Data Transformation Programs from
Examples
Ph.D. defense
2015-10-21
Bo Wu
1
Department of Computer Science
Agenda
• Introduction
• Previous work
• Our approach
– Learning conditional statements
– Synthesizing branch transformation programs
– Maximize user correctness with minimal effort
• Related work
• Conclusion and future work
2
Programming by example
Accession Credit Dimensions Medium Name
01.2 Gift of the artist 5.25 in HIGH x 9.375 in WIDE Oil on canvas John Mix Stanley
05.411 Gift of James L.
Edison
20 in HIGH x 24 in WIDE Oil on canvas Mortimer L. Smith
06.1 Gift of the artist Image: 20.5 in. HIGH x 17.5 in. WIDE Oil on canvas Theodore Scott Dabo
06.2 Gift of the artist 9.75 in|16 in HIGH x 13.75 in|19.5 in
WIDE
Oil on canvas Leon Dabo
…
09.8 Gift of the artist 12 in|14 in HIGH x 16 in|18 in WIDE Oil on canvas Gari Melchers
3
Programming by Example
Raw Value Target Value
R1 5.25 in HIGH x 9.375 in WIDE
R2 20 in HIGH x 24 in WIDE
R3 9.75 in|16 in HIGH x 13.75 in|19.5 in WIDE
R4 Image: 20.5 in. HIGH x 17.5 in. WIDE
R5 12 in|14 in HIGH x 16 in|18 in WIDE
. . .
4
9.375
24
null
17.5
null
24
19.5
17.5
null
24
19.5
17.5
18
Challenges
• Various formats and few examples
• Stringent time limits
• Verifying the correctness on large datasets
5
Research problem
Enabling PBE approaches to efficiently generate correct
transformation programs for large datasets with
multiple formats using minimal user effort
6
Examining records
and providing
examples
Synthesizing
programs
and transforming
records
Examples
Transformed records
Users PBE systems
Iterative Transformation
Agenda
• Introduction
• Previous work
• Our approach
– Learning conditional statements
– Synthesizing branch transformation programs
– Maximize user correctness with minimal effort
• Related work
• Conclusion and future work
8
R1, R2, R4
GUI
Conditional
Statement
Synthesizing
Example Cluster 1 Example Cluster K… Learning
Classifier
Examples
Branch
Program 1
Branch
Program k…
Combining
Transformation program
Clustering
Synthesizing…
R1 5.25 in HIGH x 9.375 in WIDE 9.375
R2 20 in HIGH x 24 in WIDE 24
R3 9.75 in|16 in HIGH x 13.75 in|19.5 in WIDE 19.5
R4 Image: 20.5 in. HIGH x 17.5 in. WIDE 17.5
R3
classify
substring(pos1, pos2) substring(pos3, pos4)
Transformation Program
BNK: blankspace
NUM([0-9]+): 98
UWRD([A-Z]): I
LWRD([a-z]+): mage
WORD([a-zA-Z]+): Image
START:
END:
VBAR: |
…
10
Conditional
statement
Branch
transformation
program
Branch
transformation
program
9.75 in|16 in HIGH x 13.75 in|19.5 in WIDE 19.5
Segment program:
return a substring
Position program:
return a position in the input
Transform value( )
switch (classify(value)) :
case format :
pos
1
1 = valueindexOf BNK. ( , NNUM, -1
pos BNK, 2
)
. ( , )2 = valueindexOf NUM
output=value.substr(pos pos )
ca
1 2,
sse format
pos NUM
2
3
:
. ("|",= value indexOf ,, 2
pos -1
)
. ( , , )4 = value indexOf NUM BNK
output=value.substr(pos pos )
retur
3 4,
nn output
Creating Hypothesis Spaces
• Create traces
• Derive hypothesis spaces
11
Traces: A trace here defines how the output string is constructed from
a specific set of substrings from the input string.
• Generate programs from hypothesis space
– Generate-and-test
– Generate simpler programs first
Programs with one segment programs
earlier than
Programs with three segment programs
Generating Branch Programs
12
R1 5.25 in HIGH x 9.375 in WIDE 9.375
R2 20 in HIGH x 24 in WIDE 24
R4 Image: 20.5 in. HIGH x 17.5 in. WIDE 17.5
Learning Conditional Statements
13
R3 9.75 in|16 in HIGH x 13.75 in|19.5 in WIDE 19.5
• Cluster examples
• Learn a multiclass classifier
• Recognize the format of the inputs
Cluster1-format1
Cluster2-format2
R5 Image: 20.5 in. HIGH x 17.5 in. WIDE
R6 12 in|14 in HIGH x 16 in|18 in WIDE
format1
format2
Agenda
• Introduction
• Previous work
• Our approach
– Learning conditional statements
– Synthesizing branch transformation programs
– Maximize user correctness with minimal effort
• Related work
• Conclusion and future work
14
Our contributions
• Efficiently learning accurate conditional
statements [DINA, 2014]
• Efficiently synthesizing branch transformation
programs [IJCAI, 2015]
• Maximizing the user correctness with minimal
user effort [IUI, 2014; IUI, 2016(submitted)]
15
Agenda
• Introduction
• Previous work
• Our approach
– Learning conditional statements
– Synthesizing branch transformation programs
– Maximize user correctness with minimal effort
• Related work
• Conclusion and future work
16
Motivation
• Example clustering is time consuming
– Many ways (2n) to cluster the examples
– Many examples are not compatible
– Verifying compatibility is expensive
• Learned conditional statement is not accurate
– Users are willing to provide a few examples
17
R1 5.25 in HIGH x 9.375 in WIDE 9.375
R2 20 in HIGH x 24 in WIDE 24
R3 9.75 in|16 in HIGH x 13.75 in|19.5 in WIDE 19.5
✕
Utilizing known compatibilities
18
R1 5.25 in HIGH x 9.375 in WIDE 9.375
R2 20 in HIGH x 24 in WIDE 24
R3 9.75 in|16 in HIGH x 13.75 in|19.5 in WIDE 19.5
✕After providing
3 examples
After providing
4 examples
R1 R2 R3 R4
R1 R2 R3
✕
Constraints
• Two types of constraints:
• Cannot-merge constraints:
Ex:
• Must-merge constraints:
Ex:
5.25 in HIGH x 9.375 in WIDE 9.375
9.75 in|16 in HIGH x 13.75 in|19.5 in WIDE 13.75
20 in HIGH x 24 in WIDE 24
5.25 in HIGH x 9.375 in WIDE 9.375
20 in HIGH x 24 in WIDE 24
19
Constrained Agglomerative Clustering
R1
20
R2 R3 R4
R1 R4 R2 R3
R1 R2 R4
R3
Distance between clusters (pi and pj) :
R1 5.25 in HIGH x 9.375 in WIDE
R4 Image: 20.5 in. HIGH x 17.5 in. WIDE
R1 5.25 in HIGH x 9.375 in WIDE
R2 20 in HIGH x 24 in WIDE
R3 9.75 in|16 in HIGH x 13.75 in|19.5 in WIDE
✕
Distance Metric Learning
• Distance metric learning
• Objective function
far away
Close to
each other
21
Utilizing Unlabeled data
22
Evaluation
• Dataset:
– 30 editing scenarios collected from student course projects
• Methods:
– SP
• The state-of-the-art approach that uses compatibility score to select partitions
to merge
– SPIC
• Utilize previous constraints besides using compatibility score
– DP
• Learn distance metric
– DPIC
• Utilize previous constraints besides learning distance metric
– DPICED
• Our approach in this paper
23
Avg records Min formats Max formats Avg formats
350 2 12 4.4
Results
Time and Examples:
24
Agenda
• Introduction
• Previous work
• Our approach
– Learning conditional statements
– Synthesizing branch transformation programs
– Maximize user correctness with minimal effort
• Related work
• Conclusion and future work
25
Learning Transformation Programs
by Example
Input Data Target Data
2000 Ford Expedition 11k runs great los angeles $4900 (los
angeles)
1998 Honda Civic 12k miles s. Auto. - $3800 (Arcadia)
2008 Mitsubishi Galant ES $7500 (Sylmar CA) pic
1996 Isuzu Trooper 14k clean title west covina $999 (west
covina) pic
… …
2000 Ford Expedition los angeles $4900
1998 Honda Civic 12k miles s. Auto. - $3800 (Arcadia)
2008 Mitsubishi Galant ES $7500 (Sylmar CA) pic
1998 Honda Civic Arcadia $3800
2008 Mitsubishi Galant Sylmar CA $7500
1996 Isuzu Trooper west covina $999
Time complexity is exponential in the number
and a high polynomial in the length of examples
10/11/2016 26
Reuse subprograms
After 1st
example
After 2nd
example
After 3rd
example
27
Identify incorrect subprograms
Input Output
2000 Ford Expedition 11k runs great los angeles $4900 (los angeles) 2000 Ford Expedition los angeles $4900
1998 Honda Civic 12k miles s. Auto. - $3800 (Arcadia) 1998 Honda Civic Arcadia $3800
(START, NUM, 1) (BNK, NUM, 1) (’(’, WORD, 1) (LWRD, ’)', 1) (ANY,BNK’$', 1) (NUM, BNK, 2)Program
2008 Mitsubishi Galant ES $7500 (Sylmar CA) picInput:
Execution Result:
␣$7500
NullNull
0 -1 33 -1
2008 Mitsubishi Galant Sylmar CA $7500
Target
output:
33 430 23
≠
=
≠
=
10/11/2016 28
Update hypothesis spaces
(START, NUM, 1) (BNK, NUM, 1) (’(’, WORD, 1) (LWRD, ’)', 1) (ANY,BNK’$', 1)Program
Hypothesis H3
2000 Ford Expedition 11k runs great los angeles $4900 (los angeles)
1998 Honda Civic 12k miles s. Auto. - $3800 (Arcadia)
h33h32h31
H3
h31
s
h31
e
h32
s
h32
e
h32
e
Left context:
• LWRD
• WORD
• …
Right context:
• “)”
• …
2008 Mitsubishi Galant ES $7500 (Sylmar CA) pic
h32
e
Left context:
• WORD
• …
Right context:
• “)”
• …
10/11/2016 29
• Dataset
– D1: 17 scenarios used in (Lin et al., 2014)
• 5 records per scenario
– D2: 30 scenarios collected from student data integration
projects
• about 350 records per scenario
– D3: synthetic dataset
• designed to evaluate scale-up
• Alternative approaches
– Our implementation of Gulwani’s approach: (Gulwani, 2011)
– Metagol: (Lin et al., 2014)
• Metric
– Time (in seconds) to generate a transformation program
Evaluation
30
Program generation time comparisons
Table: time (in seconds) to generate programs on D1 and D2 datasets
Figure: scalability test on D3
D3
10/11/2016 31
Agenda
• Introduction
• Previous work
• Our approach
– Learning conditional statements
– Synthesizing branch transformation programs
– Maximize user correctness with minimal effort
• Related work
• Conclusion and future work
32
Motivation
33
• Thousands of records in datasets
• Various transformation scenarios
• Overconfident users
User Interface
34
Learning from various past results
35
…
Raw Transformed
26" H x 24" W x 12.5 26
Framed at 21.75" H x 24.25” W 21
12" H x 9" 12
…
Raw Transformed
Ravage 2099#24 (November, 1994) November, 1994
Gambit III#1 (September, 1997) September, 1997
(comic) Spidey Super Stories#12/2
(September, 1975)
comic
…
Examples
Incorrect
records
Correct
records
Approach Overview
36
Raw Transformed
10“ H x 8” W 10
H: 58 x
W:25”
58
12”H x 9”W 12
11”H x 6” 11
… …
30 x 46” 30 x 46
Entire dataset
Random
Sampling
Raw Transformed
10“ H x 8” W 10
11”H x 6” 11
… …
30 x 46” 30 x 46
Sampled records
Verifying records
Raw Transformed
11”H x 6” 11
30 x 46” 30 x 46
… …
Sorting and
color-codingRaw Transformed
30 x 46” 30 x 46
11”H x 6” 11
… …
Verifying Records
• Recommend records causing runtime errors
– Records cause the program exit abnormally
• Recommend potentially incorrect records
– Learn a binary meta-classifier
37
Program: (LWRD, ‘)’, 1)
Input: 2008 Mitsubishi Galant ES $7500 (Sylmar CA) pic
Raw Transformed
11”H x 6” 11
30 x 46” 30 x 46
… …
Ex:
Learning the Meta-classifier
38
Learn an ensemble of classifiers using ADABOOST:
(1) Select a fi from a pool of binary classifiers
(2) Assign weight wi to fi
(3) Loop until error below a threshold
Dataset:
30 scenarios
350 records per scenario
Experiment setup:
• Approach-β
• Baseline
Metrics:
• Iteration correctness
• MRR
Evaluation
39
Agenda
• Introduction
• Previous work
• Our approach
– Learning conditional statements
– Synthesizing branch transformation programs
– Maximize user correctness with minimal effort
• Related work
• Conclusion and future work
40
Related Work
• Approaches not focusing on data
transformation
– Wrapper induction
• Kushmerick,1997; Hsu and Dung, 1998; Muslea et al., 1999
– Inductive programming (we learn )
• Summers, 1977; Kitzelmam and Schmid, 2006;Shaprio,
1981; Muggleton and Lin, 2013
• Approaches not learning program iteratively
• FlashFill (Gulwani, 2011); SmartPython (Lau, 2001),
SmartEdit (Lau, 2001); Singh and Gulwani 2012; Raza et al.,
2014; Harris, et al., 2011
– Approaches learning part of the programs iteratively
• MetagolDF (Lin et al., 2014); Preleman, et al 2014
41
Conclusion: contributions
• Enable users to generate complicated
programs in real time
• Enable users to work on large datasets
• Improve the performance of other PBE
approaches
42
Conclusion: future work
• Managing user expectation
• Incorporating third-party functions
• Handling user errors
43
44
Questions ?

More Related Content

Viewers also liked (15)

Exam Powerpoint Show 2010
Exam Powerpoint Show 2010Exam Powerpoint Show 2010
Exam Powerpoint Show 2010
acld2006
 
FAHID Resume
FAHID ResumeFAHID Resume
FAHID Resume
Fahid Ali
 
Problemas resueltos examen giovana quispe huillca
Problemas resueltos  examen giovana quispe huillca Problemas resueltos  examen giovana quispe huillca
Problemas resueltos examen giovana quispe huillca
giovana quispe
 
Meat Market Study by ASI Partners
Meat Market Study by ASI PartnersMeat Market Study by ASI Partners
Meat Market Study by ASI Partners
asipartners
 
Gita bersaglieri 2015
Gita bersaglieri 2015Gita bersaglieri 2015
Gita bersaglieri 2015
denise68
 
Il 4 novembre
Il 4 novembreIl 4 novembre
Il 4 novembre
denise68
 
Informatica
InformaticaInformatica
Informatica
alessandrogemo
 
Motociclette
MotocicletteMotociclette
Motociclette
alessandrogemo
 
Sci
SciSci
Sci
alessandrogemo
 
The introduction of ICT to schools
The introduction of ICT to schoolsThe introduction of ICT to schools
The introduction of ICT to schools
Tara Reilly
 
Gio2016
Gio2016Gio2016
Gio2016
denise68
 
Albizia x cd
Albizia x cdAlbizia x cd
Albizia x cd
denise68
 
Solving the Professional Liability Puzzle with the Right Pieces
Solving the Professional Liability Puzzle with the Right PiecesSolving the Professional Liability Puzzle with the Right Pieces
Solving the Professional Liability Puzzle with the Right Pieces
Sedgwick
 
ADN de la marca - identidad conceptual
ADN de la marca - identidad conceptualADN de la marca - identidad conceptual
ADN de la marca - identidad conceptual
marca_apoyo
 
Exam Powerpoint Show 2010
Exam Powerpoint Show 2010Exam Powerpoint Show 2010
Exam Powerpoint Show 2010
acld2006
 
FAHID Resume
FAHID ResumeFAHID Resume
FAHID Resume
Fahid Ali
 
Problemas resueltos examen giovana quispe huillca
Problemas resueltos  examen giovana quispe huillca Problemas resueltos  examen giovana quispe huillca
Problemas resueltos examen giovana quispe huillca
giovana quispe
 
Meat Market Study by ASI Partners
Meat Market Study by ASI PartnersMeat Market Study by ASI Partners
Meat Market Study by ASI Partners
asipartners
 
Gita bersaglieri 2015
Gita bersaglieri 2015Gita bersaglieri 2015
Gita bersaglieri 2015
denise68
 
Il 4 novembre
Il 4 novembreIl 4 novembre
Il 4 novembre
denise68
 
The introduction of ICT to schools
The introduction of ICT to schoolsThe introduction of ICT to schools
The introduction of ICT to schools
Tara Reilly
 
Albizia x cd
Albizia x cdAlbizia x cd
Albizia x cd
denise68
 
Solving the Professional Liability Puzzle with the Right Pieces
Solving the Professional Liability Puzzle with the Right PiecesSolving the Professional Liability Puzzle with the Right Pieces
Solving the Professional Liability Puzzle with the Right Pieces
Sedgwick
 
ADN de la marca - identidad conceptual
ADN de la marca - identidad conceptualADN de la marca - identidad conceptual
ADN de la marca - identidad conceptual
marca_apoyo
 

Similar to Iteratively Learning Data Transformation Programs from Examples (20)

Learning to Compose Domain-Specific Transformations for Data Augmentation
Learning to Compose Domain-Specific Transformations for Data AugmentationLearning to Compose Domain-Specific Transformations for Data Augmentation
Learning to Compose Domain-Specific Transformations for Data Augmentation
Tatsuya Shirakawa
 
SQL Query Optimization: Why Is It So Hard to Get Right?
SQL Query Optimization: Why Is It So Hard to Get Right?SQL Query Optimization: Why Is It So Hard to Get Right?
SQL Query Optimization: Why Is It So Hard to Get Right?
Brent Ozar
 
Network Design in Gurobi - final.pptx
Network Design in Gurobi - final.pptxNetwork Design in Gurobi - final.pptx
Network Design in Gurobi - final.pptx
DebarpanHaldar1
 
ENEL_680_Linear_and_Integer_Programming-1.pdf
ENEL_680_Linear_and_Integer_Programming-1.pdfENEL_680_Linear_and_Integer_Programming-1.pdf
ENEL_680_Linear_and_Integer_Programming-1.pdf
arpita43323
 
6 data envelopment_analysis
6 data envelopment_analysis6 data envelopment_analysis
6 data envelopment_analysis
FEG
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
MapR Technologies
 
Model-Based User Interface Optimization: Part I INTRODUCTION - At SICSA Summe...
Model-Based User Interface Optimization: Part I INTRODUCTION - At SICSA Summe...Model-Based User Interface Optimization: Part I INTRODUCTION - At SICSA Summe...
Model-Based User Interface Optimization: Part I INTRODUCTION - At SICSA Summe...
Aalto University
 
An Exploration of Ranking-based Strategy for Contextual Suggestions
An Exploration of Ranking-based Strategy for Contextual SuggestionsAn Exploration of Ranking-based Strategy for Contextual Suggestions
An Exploration of Ranking-based Strategy for Contextual Suggestions
Twitter Inc.
 
Online advertising and large scale model fitting
Online advertising and large scale model fittingOnline advertising and large scale model fitting
Online advertising and large scale model fitting
Wush Wu
 
Graphs for Enterprise Architects
Graphs for Enterprise ArchitectsGraphs for Enterprise Architects
Graphs for Enterprise Architects
Neo4j
 
.pptx
.pptx.pptx
.pptx
DrIshaSharma4
 
Can Machine Learning Models be Trusted? Explaining Decisions of ML Models
Can Machine Learning Models be Trusted? Explaining Decisions of ML ModelsCan Machine Learning Models be Trusted? Explaining Decisions of ML Models
Can Machine Learning Models be Trusted? Explaining Decisions of ML Models
Darek Smyk
 
AnDSummit2020 Session Pattern Analysis Data Model
AnDSummit2020 Session Pattern Analysis Data ModelAnDSummit2020 Session Pattern Analysis Data Model
AnDSummit2020 Session Pattern Analysis Data Model
Shankar Somayajula
 
DALL-E.pdf
DALL-E.pdfDALL-E.pdf
DALL-E.pdf
dsfajkh
 
Static Neural Compiler Optimization via Deep Reinforcement Learning
Static Neural Compiler Optimization via Deep Reinforcement LearningStatic Neural Compiler Optimization via Deep Reinforcement Learning
Static Neural Compiler Optimization via Deep Reinforcement Learning
Rahim Mammadli
 
2D to 3D conversion of formula 1 footage
2D to 3D conversion of formula 1 footage2D to 3D conversion of formula 1 footage
2D to 3D conversion of formula 1 footage
Serge13
 
AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...
Institute of Contemporary Sciences
 
Air Quality Modelling Tools (Aberdeen Pilot Project) Dr. Alan Hills, SEPA
Air Quality Modelling Tools (Aberdeen Pilot Project) Dr. Alan Hills, SEPAAir Quality Modelling Tools (Aberdeen Pilot Project) Dr. Alan Hills, SEPA
Air Quality Modelling Tools (Aberdeen Pilot Project) Dr. Alan Hills, SEPA
STEP_scotland
 
Metrics
MetricsMetrics
Metrics
geethawilliam
 
How to Automate CAD & GIS Integration
How to Automate CAD & GIS IntegrationHow to Automate CAD & GIS Integration
How to Automate CAD & GIS Integration
Safe Software
 
Learning to Compose Domain-Specific Transformations for Data Augmentation
Learning to Compose Domain-Specific Transformations for Data AugmentationLearning to Compose Domain-Specific Transformations for Data Augmentation
Learning to Compose Domain-Specific Transformations for Data Augmentation
Tatsuya Shirakawa
 
SQL Query Optimization: Why Is It So Hard to Get Right?
SQL Query Optimization: Why Is It So Hard to Get Right?SQL Query Optimization: Why Is It So Hard to Get Right?
SQL Query Optimization: Why Is It So Hard to Get Right?
Brent Ozar
 
Network Design in Gurobi - final.pptx
Network Design in Gurobi - final.pptxNetwork Design in Gurobi - final.pptx
Network Design in Gurobi - final.pptx
DebarpanHaldar1
 
ENEL_680_Linear_and_Integer_Programming-1.pdf
ENEL_680_Linear_and_Integer_Programming-1.pdfENEL_680_Linear_and_Integer_Programming-1.pdf
ENEL_680_Linear_and_Integer_Programming-1.pdf
arpita43323
 
6 data envelopment_analysis
6 data envelopment_analysis6 data envelopment_analysis
6 data envelopment_analysis
FEG
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
MapR Technologies
 
Model-Based User Interface Optimization: Part I INTRODUCTION - At SICSA Summe...
Model-Based User Interface Optimization: Part I INTRODUCTION - At SICSA Summe...Model-Based User Interface Optimization: Part I INTRODUCTION - At SICSA Summe...
Model-Based User Interface Optimization: Part I INTRODUCTION - At SICSA Summe...
Aalto University
 
An Exploration of Ranking-based Strategy for Contextual Suggestions
An Exploration of Ranking-based Strategy for Contextual SuggestionsAn Exploration of Ranking-based Strategy for Contextual Suggestions
An Exploration of Ranking-based Strategy for Contextual Suggestions
Twitter Inc.
 
Online advertising and large scale model fitting
Online advertising and large scale model fittingOnline advertising and large scale model fitting
Online advertising and large scale model fitting
Wush Wu
 
Graphs for Enterprise Architects
Graphs for Enterprise ArchitectsGraphs for Enterprise Architects
Graphs for Enterprise Architects
Neo4j
 
Can Machine Learning Models be Trusted? Explaining Decisions of ML Models
Can Machine Learning Models be Trusted? Explaining Decisions of ML ModelsCan Machine Learning Models be Trusted? Explaining Decisions of ML Models
Can Machine Learning Models be Trusted? Explaining Decisions of ML Models
Darek Smyk
 
AnDSummit2020 Session Pattern Analysis Data Model
AnDSummit2020 Session Pattern Analysis Data ModelAnDSummit2020 Session Pattern Analysis Data Model
AnDSummit2020 Session Pattern Analysis Data Model
Shankar Somayajula
 
DALL-E.pdf
DALL-E.pdfDALL-E.pdf
DALL-E.pdf
dsfajkh
 
Static Neural Compiler Optimization via Deep Reinforcement Learning
Static Neural Compiler Optimization via Deep Reinforcement LearningStatic Neural Compiler Optimization via Deep Reinforcement Learning
Static Neural Compiler Optimization via Deep Reinforcement Learning
Rahim Mammadli
 
2D to 3D conversion of formula 1 footage
2D to 3D conversion of formula 1 footage2D to 3D conversion of formula 1 footage
2D to 3D conversion of formula 1 footage
Serge13
 
AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...
Institute of Contemporary Sciences
 
Air Quality Modelling Tools (Aberdeen Pilot Project) Dr. Alan Hills, SEPA
Air Quality Modelling Tools (Aberdeen Pilot Project) Dr. Alan Hills, SEPAAir Quality Modelling Tools (Aberdeen Pilot Project) Dr. Alan Hills, SEPA
Air Quality Modelling Tools (Aberdeen Pilot Project) Dr. Alan Hills, SEPA
STEP_scotland
 
How to Automate CAD & GIS Integration
How to Automate CAD & GIS IntegrationHow to Automate CAD & GIS Integration
How to Automate CAD & GIS Integration
Safe Software
 

Recently uploaded (20)

OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...
ijdmsjournal
 
Machine Learning basics POWERPOINT PRESENETATION
Machine Learning basics POWERPOINT PRESENETATIONMachine Learning basics POWERPOINT PRESENETATION
Machine Learning basics POWERPOINT PRESENETATION
DarrinBright1
 
Mode-Wise Corridor Level Travel-Time Estimation Using Machine Learning Models
Mode-Wise Corridor Level Travel-Time Estimation Using Machine Learning ModelsMode-Wise Corridor Level Travel-Time Estimation Using Machine Learning Models
Mode-Wise Corridor Level Travel-Time Estimation Using Machine Learning Models
Journal of Soft Computing in Civil Engineering
 
Environment .................................
Environment .................................Environment .................................
Environment .................................
shadyozq9
 
twin tower attack 2001 new york city
twin  tower  attack  2001 new  york citytwin  tower  attack  2001 new  york city
twin tower attack 2001 new york city
harishreemavs
 
01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf
01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf
01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf
PawachMetharattanara
 
Water Industry Process Automation & Control Monthly May 2025
Water Industry Process Automation & Control Monthly May 2025Water Industry Process Automation & Control Monthly May 2025
Water Industry Process Automation & Control Monthly May 2025
Water Industry Process Automation & Control
 
acid base ppt and their specific application in food
acid base ppt and their specific application in foodacid base ppt and their specific application in food
acid base ppt and their specific application in food
Fatehatun Noor
 
Little Known Ways To 3 Best sites to Buy Linkedin Accounts.pdf
Little Known Ways To 3 Best sites to Buy Linkedin Accounts.pdfLittle Known Ways To 3 Best sites to Buy Linkedin Accounts.pdf
Little Known Ways To 3 Best sites to Buy Linkedin Accounts.pdf
gori42199
 
Frontend Architecture Diagram/Guide For Frontend Engineers
Frontend Architecture Diagram/Guide For Frontend EngineersFrontend Architecture Diagram/Guide For Frontend Engineers
Frontend Architecture Diagram/Guide For Frontend Engineers
Michael Hertzberg
 
Citizen Observatories to encourage more democratic data evidence-based decisi...
Citizen Observatories to encourage more democratic data evidence-based decisi...Citizen Observatories to encourage more democratic data evidence-based decisi...
Citizen Observatories to encourage more democratic data evidence-based decisi...
Diego López-de-Ipiña González-de-Artaza
 
hypermedia_system_revisit_roy_fielding .
hypermedia_system_revisit_roy_fielding .hypermedia_system_revisit_roy_fielding .
hypermedia_system_revisit_roy_fielding .
NABLAS株式会社
 
David Boutry - Specializes In AWS, Microservices And Python.pdf
David Boutry - Specializes In AWS, Microservices And Python.pdfDavid Boutry - Specializes In AWS, Microservices And Python.pdf
David Boutry - Specializes In AWS, Microservices And Python.pdf
David Boutry
 
Construction Materials (Paints) in Civil Engineering
Construction Materials (Paints) in Civil EngineeringConstruction Materials (Paints) in Civil Engineering
Construction Materials (Paints) in Civil Engineering
Lavish Kashyap
 
Modeling the Influence of Environmental Factors on Concrete Evaporation Rate
Modeling the Influence of Environmental Factors on Concrete Evaporation RateModeling the Influence of Environmental Factors on Concrete Evaporation Rate
Modeling the Influence of Environmental Factors on Concrete Evaporation Rate
Journal of Soft Computing in Civil Engineering
 
SICPA: Fabien Keller - background introduction
SICPA: Fabien Keller - background introductionSICPA: Fabien Keller - background introduction
SICPA: Fabien Keller - background introduction
fabienklr
 
Generative AI & Large Language Models Agents
Generative AI & Large Language Models AgentsGenerative AI & Large Language Models Agents
Generative AI & Large Language Models Agents
aasgharbee22seecs
 
ML_Unit_V_RDC_ASSOCIATION AND DIMENSIONALITY REDUCTION.pdf
ML_Unit_V_RDC_ASSOCIATION AND DIMENSIONALITY REDUCTION.pdfML_Unit_V_RDC_ASSOCIATION AND DIMENSIONALITY REDUCTION.pdf
ML_Unit_V_RDC_ASSOCIATION AND DIMENSIONALITY REDUCTION.pdf
rameshwarchintamani
 
ATAL 6 Days Online FDP Scheme Document 2025-26.pdf
ATAL 6 Days Online FDP Scheme Document 2025-26.pdfATAL 6 Days Online FDP Scheme Document 2025-26.pdf
ATAL 6 Days Online FDP Scheme Document 2025-26.pdf
ssuserda39791
 
Artificial intelligence and machine learning.pptx
Artificial intelligence and machine learning.pptxArtificial intelligence and machine learning.pptx
Artificial intelligence and machine learning.pptx
rakshanatarajan005
 
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...
ijdmsjournal
 
Machine Learning basics POWERPOINT PRESENETATION
Machine Learning basics POWERPOINT PRESENETATIONMachine Learning basics POWERPOINT PRESENETATION
Machine Learning basics POWERPOINT PRESENETATION
DarrinBright1
 
Environment .................................
Environment .................................Environment .................................
Environment .................................
shadyozq9
 
twin tower attack 2001 new york city
twin  tower  attack  2001 new  york citytwin  tower  attack  2001 new  york city
twin tower attack 2001 new york city
harishreemavs
 
01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf
01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf
01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf
PawachMetharattanara
 
acid base ppt and their specific application in food
acid base ppt and their specific application in foodacid base ppt and their specific application in food
acid base ppt and their specific application in food
Fatehatun Noor
 
Little Known Ways To 3 Best sites to Buy Linkedin Accounts.pdf
Little Known Ways To 3 Best sites to Buy Linkedin Accounts.pdfLittle Known Ways To 3 Best sites to Buy Linkedin Accounts.pdf
Little Known Ways To 3 Best sites to Buy Linkedin Accounts.pdf
gori42199
 
Frontend Architecture Diagram/Guide For Frontend Engineers
Frontend Architecture Diagram/Guide For Frontend EngineersFrontend Architecture Diagram/Guide For Frontend Engineers
Frontend Architecture Diagram/Guide For Frontend Engineers
Michael Hertzberg
 
Citizen Observatories to encourage more democratic data evidence-based decisi...
Citizen Observatories to encourage more democratic data evidence-based decisi...Citizen Observatories to encourage more democratic data evidence-based decisi...
Citizen Observatories to encourage more democratic data evidence-based decisi...
Diego López-de-Ipiña González-de-Artaza
 
hypermedia_system_revisit_roy_fielding .
hypermedia_system_revisit_roy_fielding .hypermedia_system_revisit_roy_fielding .
hypermedia_system_revisit_roy_fielding .
NABLAS株式会社
 
David Boutry - Specializes In AWS, Microservices And Python.pdf
David Boutry - Specializes In AWS, Microservices And Python.pdfDavid Boutry - Specializes In AWS, Microservices And Python.pdf
David Boutry - Specializes In AWS, Microservices And Python.pdf
David Boutry
 
Construction Materials (Paints) in Civil Engineering
Construction Materials (Paints) in Civil EngineeringConstruction Materials (Paints) in Civil Engineering
Construction Materials (Paints) in Civil Engineering
Lavish Kashyap
 
SICPA: Fabien Keller - background introduction
SICPA: Fabien Keller - background introductionSICPA: Fabien Keller - background introduction
SICPA: Fabien Keller - background introduction
fabienklr
 
Generative AI & Large Language Models Agents
Generative AI & Large Language Models AgentsGenerative AI & Large Language Models Agents
Generative AI & Large Language Models Agents
aasgharbee22seecs
 
ML_Unit_V_RDC_ASSOCIATION AND DIMENSIONALITY REDUCTION.pdf
ML_Unit_V_RDC_ASSOCIATION AND DIMENSIONALITY REDUCTION.pdfML_Unit_V_RDC_ASSOCIATION AND DIMENSIONALITY REDUCTION.pdf
ML_Unit_V_RDC_ASSOCIATION AND DIMENSIONALITY REDUCTION.pdf
rameshwarchintamani
 
ATAL 6 Days Online FDP Scheme Document 2025-26.pdf
ATAL 6 Days Online FDP Scheme Document 2025-26.pdfATAL 6 Days Online FDP Scheme Document 2025-26.pdf
ATAL 6 Days Online FDP Scheme Document 2025-26.pdf
ssuserda39791
 
Artificial intelligence and machine learning.pptx
Artificial intelligence and machine learning.pptxArtificial intelligence and machine learning.pptx
Artificial intelligence and machine learning.pptx
rakshanatarajan005
 

Iteratively Learning Data Transformation Programs from Examples

  • 1. Iteratively Learning Data Transformation Programs from Examples Ph.D. defense 2015-10-21 Bo Wu 1 Department of Computer Science
  • 2. Agenda • Introduction • Previous work • Our approach – Learning conditional statements – Synthesizing branch transformation programs – Maximize user correctness with minimal effort • Related work • Conclusion and future work 2
  • 3. Programming by example Accession Credit Dimensions Medium Name 01.2 Gift of the artist 5.25 in HIGH x 9.375 in WIDE Oil on canvas John Mix Stanley 05.411 Gift of James L. Edison 20 in HIGH x 24 in WIDE Oil on canvas Mortimer L. Smith 06.1 Gift of the artist Image: 20.5 in. HIGH x 17.5 in. WIDE Oil on canvas Theodore Scott Dabo 06.2 Gift of the artist 9.75 in|16 in HIGH x 13.75 in|19.5 in WIDE Oil on canvas Leon Dabo … 09.8 Gift of the artist 12 in|14 in HIGH x 16 in|18 in WIDE Oil on canvas Gari Melchers 3
  • 4. Programming by Example Raw Value Target Value R1 5.25 in HIGH x 9.375 in WIDE R2 20 in HIGH x 24 in WIDE R3 9.75 in|16 in HIGH x 13.75 in|19.5 in WIDE R4 Image: 20.5 in. HIGH x 17.5 in. WIDE R5 12 in|14 in HIGH x 16 in|18 in WIDE . . . 4 9.375 24 null 17.5 null 24 19.5 17.5 null 24 19.5 17.5 18
  • 5. Challenges • Various formats and few examples • Stringent time limits • Verifying the correctness on large datasets 5
  • 6. Research problem Enabling PBE approaches to efficiently generate correct transformation programs for large datasets with multiple formats using minimal user effort 6
  • 7. Examining records and providing examples Synthesizing programs and transforming records Examples Transformed records Users PBE systems Iterative Transformation
  • 8. Agenda • Introduction • Previous work • Our approach – Learning conditional statements – Synthesizing branch transformation programs – Maximize user correctness with minimal effort • Related work • Conclusion and future work 8
  • 9. R1, R2, R4 GUI Conditional Statement Synthesizing Example Cluster 1 Example Cluster K… Learning Classifier Examples Branch Program 1 Branch Program k… Combining Transformation program Clustering Synthesizing… R1 5.25 in HIGH x 9.375 in WIDE 9.375 R2 20 in HIGH x 24 in WIDE 24 R3 9.75 in|16 in HIGH x 13.75 in|19.5 in WIDE 19.5 R4 Image: 20.5 in. HIGH x 17.5 in. WIDE 17.5 R3 classify substring(pos1, pos2) substring(pos3, pos4)
  • 10. Transformation Program BNK: blankspace NUM([0-9]+): 98 UWRD([A-Z]): I LWRD([a-z]+): mage WORD([a-zA-Z]+): Image START: END: VBAR: | … 10 Conditional statement Branch transformation program Branch transformation program 9.75 in|16 in HIGH x 13.75 in|19.5 in WIDE 19.5 Segment program: return a substring Position program: return a position in the input Transform value( ) switch (classify(value)) : case format : pos 1 1 = valueindexOf BNK. ( , NNUM, -1 pos BNK, 2 ) . ( , )2 = valueindexOf NUM output=value.substr(pos pos ) ca 1 2, sse format pos NUM 2 3 : . ("|",= value indexOf ,, 2 pos -1 ) . ( , , )4 = value indexOf NUM BNK output=value.substr(pos pos ) retur 3 4, nn output
  • 11. Creating Hypothesis Spaces • Create traces • Derive hypothesis spaces 11 Traces: A trace here defines how the output string is constructed from a specific set of substrings from the input string.
  • 12. • Generate programs from hypothesis space – Generate-and-test – Generate simpler programs first Programs with one segment programs earlier than Programs with three segment programs Generating Branch Programs 12
  • 13. R1 5.25 in HIGH x 9.375 in WIDE 9.375 R2 20 in HIGH x 24 in WIDE 24 R4 Image: 20.5 in. HIGH x 17.5 in. WIDE 17.5 Learning Conditional Statements 13 R3 9.75 in|16 in HIGH x 13.75 in|19.5 in WIDE 19.5 • Cluster examples • Learn a multiclass classifier • Recognize the format of the inputs Cluster1-format1 Cluster2-format2 R5 Image: 20.5 in. HIGH x 17.5 in. WIDE R6 12 in|14 in HIGH x 16 in|18 in WIDE format1 format2
  • 14. Agenda • Introduction • Previous work • Our approach – Learning conditional statements – Synthesizing branch transformation programs – Maximize user correctness with minimal effort • Related work • Conclusion and future work 14
  • 15. Our contributions • Efficiently learning accurate conditional statements [DINA, 2014] • Efficiently synthesizing branch transformation programs [IJCAI, 2015] • Maximizing the user correctness with minimal user effort [IUI, 2014; IUI, 2016(submitted)] 15
  • 16. Agenda • Introduction • Previous work • Our approach – Learning conditional statements – Synthesizing branch transformation programs – Maximize user correctness with minimal effort • Related work • Conclusion and future work 16
  • 17. Motivation • Example clustering is time consuming – Many ways (2n) to cluster the examples – Many examples are not compatible – Verifying compatibility is expensive • Learned conditional statement is not accurate – Users are willing to provide a few examples 17 R1 5.25 in HIGH x 9.375 in WIDE 9.375 R2 20 in HIGH x 24 in WIDE 24 R3 9.75 in|16 in HIGH x 13.75 in|19.5 in WIDE 19.5 ✕
  • 18. Utilizing known compatibilities 18 R1 5.25 in HIGH x 9.375 in WIDE 9.375 R2 20 in HIGH x 24 in WIDE 24 R3 9.75 in|16 in HIGH x 13.75 in|19.5 in WIDE 19.5 ✕After providing 3 examples After providing 4 examples R1 R2 R3 R4 R1 R2 R3 ✕
  • 19. Constraints • Two types of constraints: • Cannot-merge constraints: Ex: • Must-merge constraints: Ex: 5.25 in HIGH x 9.375 in WIDE 9.375 9.75 in|16 in HIGH x 13.75 in|19.5 in WIDE 13.75 20 in HIGH x 24 in WIDE 24 5.25 in HIGH x 9.375 in WIDE 9.375 20 in HIGH x 24 in WIDE 24 19
  • 20. Constrained Agglomerative Clustering R1 20 R2 R3 R4 R1 R4 R2 R3 R1 R2 R4 R3 Distance between clusters (pi and pj) : R1 5.25 in HIGH x 9.375 in WIDE R4 Image: 20.5 in. HIGH x 17.5 in. WIDE R1 5.25 in HIGH x 9.375 in WIDE R2 20 in HIGH x 24 in WIDE R3 9.75 in|16 in HIGH x 13.75 in|19.5 in WIDE ✕
  • 21. Distance Metric Learning • Distance metric learning • Objective function far away Close to each other 21
  • 23. Evaluation • Dataset: – 30 editing scenarios collected from student course projects • Methods: – SP • The state-of-the-art approach that uses compatibility score to select partitions to merge – SPIC • Utilize previous constraints besides using compatibility score – DP • Learn distance metric – DPIC • Utilize previous constraints besides learning distance metric – DPICED • Our approach in this paper 23 Avg records Min formats Max formats Avg formats 350 2 12 4.4
  • 25. Agenda • Introduction • Previous work • Our approach – Learning conditional statements – Synthesizing branch transformation programs – Maximize user correctness with minimal effort • Related work • Conclusion and future work 25
  • 26. Learning Transformation Programs by Example Input Data Target Data 2000 Ford Expedition 11k runs great los angeles $4900 (los angeles) 1998 Honda Civic 12k miles s. Auto. - $3800 (Arcadia) 2008 Mitsubishi Galant ES $7500 (Sylmar CA) pic 1996 Isuzu Trooper 14k clean title west covina $999 (west covina) pic … … 2000 Ford Expedition los angeles $4900 1998 Honda Civic 12k miles s. Auto. - $3800 (Arcadia) 2008 Mitsubishi Galant ES $7500 (Sylmar CA) pic 1998 Honda Civic Arcadia $3800 2008 Mitsubishi Galant Sylmar CA $7500 1996 Isuzu Trooper west covina $999 Time complexity is exponential in the number and a high polynomial in the length of examples 10/11/2016 26
  • 27. Reuse subprograms After 1st example After 2nd example After 3rd example 27
  • 28. Identify incorrect subprograms Input Output 2000 Ford Expedition 11k runs great los angeles $4900 (los angeles) 2000 Ford Expedition los angeles $4900 1998 Honda Civic 12k miles s. Auto. - $3800 (Arcadia) 1998 Honda Civic Arcadia $3800 (START, NUM, 1) (BNK, NUM, 1) (’(’, WORD, 1) (LWRD, ’)', 1) (ANY,BNK’$', 1) (NUM, BNK, 2)Program 2008 Mitsubishi Galant ES $7500 (Sylmar CA) picInput: Execution Result: ␣$7500 NullNull 0 -1 33 -1 2008 Mitsubishi Galant Sylmar CA $7500 Target output: 33 430 23 ≠ = ≠ = 10/11/2016 28
  • 29. Update hypothesis spaces (START, NUM, 1) (BNK, NUM, 1) (’(’, WORD, 1) (LWRD, ’)', 1) (ANY,BNK’$', 1)Program Hypothesis H3 2000 Ford Expedition 11k runs great los angeles $4900 (los angeles) 1998 Honda Civic 12k miles s. Auto. - $3800 (Arcadia) h33h32h31 H3 h31 s h31 e h32 s h32 e h32 e Left context: • LWRD • WORD • … Right context: • “)” • … 2008 Mitsubishi Galant ES $7500 (Sylmar CA) pic h32 e Left context: • WORD • … Right context: • “)” • … 10/11/2016 29
  • 30. • Dataset – D1: 17 scenarios used in (Lin et al., 2014) • 5 records per scenario – D2: 30 scenarios collected from student data integration projects • about 350 records per scenario – D3: synthetic dataset • designed to evaluate scale-up • Alternative approaches – Our implementation of Gulwani’s approach: (Gulwani, 2011) – Metagol: (Lin et al., 2014) • Metric – Time (in seconds) to generate a transformation program Evaluation 30
  • 31. Program generation time comparisons Table: time (in seconds) to generate programs on D1 and D2 datasets Figure: scalability test on D3 D3 10/11/2016 31
  • 32. Agenda • Introduction • Previous work • Our approach – Learning conditional statements – Synthesizing branch transformation programs – Maximize user correctness with minimal effort • Related work • Conclusion and future work 32
  • 33. Motivation 33 • Thousands of records in datasets • Various transformation scenarios • Overconfident users
  • 35. Learning from various past results 35 … Raw Transformed 26" H x 24" W x 12.5 26 Framed at 21.75" H x 24.25” W 21 12" H x 9" 12 … Raw Transformed Ravage 2099#24 (November, 1994) November, 1994 Gambit III#1 (September, 1997) September, 1997 (comic) Spidey Super Stories#12/2 (September, 1975) comic … Examples Incorrect records Correct records
  • 36. Approach Overview 36 Raw Transformed 10“ H x 8” W 10 H: 58 x W:25” 58 12”H x 9”W 12 11”H x 6” 11 … … 30 x 46” 30 x 46 Entire dataset Random Sampling Raw Transformed 10“ H x 8” W 10 11”H x 6” 11 … … 30 x 46” 30 x 46 Sampled records Verifying records Raw Transformed 11”H x 6” 11 30 x 46” 30 x 46 … … Sorting and color-codingRaw Transformed 30 x 46” 30 x 46 11”H x 6” 11 … …
  • 37. Verifying Records • Recommend records causing runtime errors – Records cause the program exit abnormally • Recommend potentially incorrect records – Learn a binary meta-classifier 37 Program: (LWRD, ‘)’, 1) Input: 2008 Mitsubishi Galant ES $7500 (Sylmar CA) pic Raw Transformed 11”H x 6” 11 30 x 46” 30 x 46 … … Ex:
  • 38. Learning the Meta-classifier 38 Learn an ensemble of classifiers using ADABOOST: (1) Select a fi from a pool of binary classifiers (2) Assign weight wi to fi (3) Loop until error below a threshold
  • 39. Dataset: 30 scenarios 350 records per scenario Experiment setup: • Approach-β • Baseline Metrics: • Iteration correctness • MRR Evaluation 39
  • 40. Agenda • Introduction • Previous work • Our approach – Learning conditional statements – Synthesizing branch transformation programs – Maximize user correctness with minimal effort • Related work • Conclusion and future work 40
  • 41. Related Work • Approaches not focusing on data transformation – Wrapper induction • Kushmerick,1997; Hsu and Dung, 1998; Muslea et al., 1999 – Inductive programming (we learn ) • Summers, 1977; Kitzelmam and Schmid, 2006;Shaprio, 1981; Muggleton and Lin, 2013 • Approaches not learning program iteratively • FlashFill (Gulwani, 2011); SmartPython (Lau, 2001), SmartEdit (Lau, 2001); Singh and Gulwani 2012; Raza et al., 2014; Harris, et al., 2011 – Approaches learning part of the programs iteratively • MetagolDF (Lin et al., 2014); Preleman, et al 2014 41
  • 42. Conclusion: contributions • Enable users to generate complicated programs in real time • Enable users to work on large datasets • Improve the performance of other PBE approaches 42
  • 43. Conclusion: future work • Managing user expectation • Incorporating third-party functions • Handling user errors 43

Editor's Notes

  • #4: To do: Change the table format here
  • #6: One figure for one bullet and one built in for one bullet.
  • #10: Remove
  • #11: Transformation programs Conditional / branch Branch is a concatenation of several segment programs Segment program basically returns a substring One way of specifying the segment program is to specify where to extract this substring from the input.
  • #12: Creating traces Derive hypothesis space 1.Node contains all 2. Edges contains all the segment programs 3. Each path contains all the programs that transform the inputs into the outputs
  • #15: 18:08
  • #17: 18:08
  • #19: Provide the intuition here
  • #20: Every time user provides an example, the system will tries to generate a program. During this process, the system has figure out certain programs shouldn’t be put into same partition and certain examples can be put into same partition. We use two types of constraints to model this information
  • #22: The reason why learning a distance metric Why weighted eucldiean distance How to model the two constraints
  • #24: Verifying 3 design decisions Whether using constraints from previous iterations are helpful Whether the learned distance metric can outperform directly applying euclidean distance and CBS Whether incorprating unlabeled data is helpful
  • #26: 35 mininutes
  • #27: Syntax transformation entity extraction.
  • #28: Trace specifies one way of generating the Program from the trace will look like ….
  • #29: Change color of the dashed lines
  • #33: 12 minutes
  • #35: buildin
  • #39: How detail should I describe the classifiers
  • #41: 15 minutes
  翻译: