Building Custom ML PipelineStages for Feature Selection with Marc Kaminski

Kaminski, Schlegel | Oct. 25, 2017
BUILDING CUSTOM ML PIPELINESTAGES
FOR FEATURE SELECTION.
SPARK SUMMIT EUROPE 2017.

WHATYOU WILL LEARN DURING THIS SESSION.
 How data-driven car diagnostics look like at BMW.
 Get a good understanding of the most important elements in Spark ML PipelineStages (on a feature selection example).
 Attention: There will be Scala code examples!
 Howto use spark-FeatureSelection in your Spark ML Pipeline.
 The impact of feature selection on learning performance andthe understanding of the big data black box.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 2

1001 0110
01 10 10 1
10 01 01 10
MOTIVATION.
 #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]
[1] BearingPoint,Global AutomotiveWarranty Survey Report 2009

1001 0110
01 10 10 1
10 01 01 10
MOTIVATION.
 Potential root causes:
 Manually formalized expert knowledge cannot cope withthe vast number of possibilities.
 Cars are getting more and more complex (hybridization, connectivity).
 Less experienced workshop staff in evolving markets.

1001 0110
01 10 10 1
10 01 01 10
MOTIVATION.
 Improve three workflows at once by shifting from a manual to a data driven approach:

1001 0110
01 10 10 1
10 01 01 10
MOTIVATION.
 Automatic knowledge generation.

1001 0110
01 10 10 1
10 01 01 10
MOTIVATION.
 Automatic knowledge generation.
 Automatic workshop diagnostics.
 Predictive maintenance.

THE DATASET AND ITS CHALLENGES.
MV_S MV_0 … MV_4000 MV_BGEE_TP SC_IP SC_1 SC_2DTC_PU DTC_1 DTC_2 CP Label
44 3 … 20 -0.06false 2 77 27false true v.10 false
72 36 … 73 -0.01false 16 29 false v.10 false
100 4 … 16 -0.02true 45 1 false false v.10 false
44 14 … 54 -0.02true 76 false v.10 true
95 34 … 73 -0.07false 80 22 false false v.10 false
16 50 … 33 -0.02true 61 93false false false v.11 false
4 … 27 -0.09false 59 91 false v.10 false
88 60 … 72 -0.01true 1.9 96 53true false true v.10 false
27 14 … 88 false 73 14 false v.10 false

44 14 … 54 -0.02true 76 false v.10 true
High dimensional featurespace (7000 features +)

44 14 … 54 -0.02true 76 false v.10 true
High sparsity

44 14 … 54 -0.02true 76 false v.10 true
High sparsity
High class imbalance

SPARK PIPELINE.
Relational
DWH
Model

SPARK PIPELINE.
ETL
Imputation
Loading
MV_S SC_IP DTC_PU CP Label
44 2 false v.10 false
72 1.5 true v.11 false
23 1.4 false v.11 false
44 1.5 true v.10 true
Relational
DWH
Model

SPARK PIPELINE.
ETL
Imputation
Loading
Relational
DWH
Handling imbalance
SMOTE[2]
Undersampling
…
true false
Model
[2] Chawla et al.: SMOTE: Synthetic Minority Over-samplingTechnique

SPARK PIPELINE.
ETL
Imputation
Loading
Relational
DWH
Handling imbalance
SMOTE[2]
Undersampling
…
true false
Preprocessing
StringIndexer
OneHotEncoder
VectorAssembler
Discretization
Std.Scaler
Features Label
[0.34,0.8,0,1] 0.0
[0.7,0.4,1,0] 0.0
[0.31,0.35,1,0] 1.0
[0.3,0.4,1.1] 1.0
Model

SPARK PIPELINE.
ETL
Imputation
Loading
Relational
DWH
Handling imbalance
SMOTE[2]
Undersampling
…
true false
Preprocessing
StringIndexer
OneHotEncoder
VectorAssembler
Discretization
Std.Scaler
Features Label
[0.34,0.8,0,1] 0.0
[0.7,0.4,1,0] 0.0
[0.31,0.35,1,0] 1.0
[0.3,0.4,1.1] 1.0
Crossvalidation loop
Feature selection [3]
InformationGain
Correlation
ChiSquared
Ran. Forest
Gini
L1 LogReg
Classifier
Logistic Regression/
Random Forest
Model
[3]: Schlegel et al.: Design and optimization of an autonomous feature selection pipeline for high dimensional,
heterogeneous feature spaces.

PipelineStage
SPARK PIPELINE API.
Interface for usage in
Pipeline
data ?

PipelineStage
SPARK PIPELINE API.
Transformer
‘Transforms data’
Pipeline
data
data data
?

PipelineStage
SPARK PIPELINE API.
Estimator
‘Learns from data’
Transformer
‘Transforms data’
Pipeline
data
data dataTransformer data
?

ORG.APACHE.SPARK.ML.*
PipelineStage
Estimator
Interface for usage
in Pipeline
Transformer
Transforms data
Learns from data

Pipeline
Concat PipelineStages
Predictor
Interface for Predictors
PipelineModel
Model from Pipeline
PredictionModel
Model from predictor
PipelineStage
Estimator
Interface for usage
in Pipeline
Transformer
Model
Transforms data
Fitted model
Learns from data

Pipeline
Concat PipelineStages
Predictor
Interface for Predictors
FeatureSelector
Interface for FS
PipelineModel
Model from Pipeline
PredictionModel
Model from predictor
FeatureSelectionModel
Model from
FeatureSelector
PipelineStage
Estimator
Interface for usage
in Pipeline
Transformer
Model
Transforms data
Fitted model
Learns from data

MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
}

}
Needsto know, what it
shall return.

}
Defined later.
shall return.

}
Defined later.
Makes all Param writable.Needsto know, what it
shall return.

// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
}
Defined later.
shall return.

}
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
shall return.

}
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.
transformSchema
(= input validation)
Transformed
schema
Exception

⚡
shall return.
features label
[0,1,0,1] 1.0
[1,0,0,0] 1.0 features: VectorColumn
selected: VectorColumn
label: Double
DataFrame with Schema

}
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
transformSchema
(= input validation)
Transformed
schema
Exception

⚡
shall return.
features label
[0,1,0,1] 1.0
[1,0,0,0] 1.0
Attention:
VectorColumns have Metadata:
Name, Type, Range, etc.
features: VectorColumn
selected: VectorColumn
label: Double
DataFrame with Schema

// PipelineStage and Estimator methods
override def transformSchema(schema: StructType): StructType = {}
}
Performs input
checking and fails fast.
Canthrow exceptions.
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
shall return.

}
Performs input
fit
(= learn from data)
Dataset Transformer
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
shall return.

override def fit(dataset: Dataset[_]): M = {}
override def copy(extra: ParamMap): Learner
}
Learns from data and returns
a Model. Here: calculate
feature importances.
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Performs input
shall return.

// Abstract methods that are called from fit()
protected def train(dataset: Dataset[_]): Array[(Int, Double)]
protected def make(uid: String, selectedFeatures: Array[Int],
featureImportances: Map[String, Double]): M
}
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Performs input
shall return.

// Abstract methods that are called from fit()
protected def train(dataset: Dataset[_]): Array[(Int, Double)]
protected def make(uid: String, selectedFeatures: Array[Int],
featureImportances: Map[String, Double]): M
}
Not necessary, but avoids
code duplication.Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 10
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Performs input
shall return.

MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String,
val selectedFeatures: Array[Int],
val featureImportances: Map[String, Double])
extends Model[M] with FeatureSelectorParams with MLWritable{
Page 11

Page 11
For persistence.

def setFeaturesCol(value: String): this.type = set(featuresCol, value)
// PipelineStage and Transformer methods
override def transform(dataset: Dataset[_]): DataFrame = {}
def write: MLWriter
}
Page 11
For persistence.

def write: MLWriter
}
Page 11
Same idea as in Estimator, but
different tasks.
For persistence.

def write: MLWriter
}
Page 11
Transforms data.
different tasks.
For persistence.

def write: MLWriter
}
Page 11
Transforms data.
different tasks.
For persistence.
Adds persistence.

GIVING YOUR NEW PIPELINESTAGE PARAMETERS.
import org.apache.spark.ml.param._
import org.apache.spark.ml.param.shared._
private[selection] trait FeatureSelectorParams extends Params
with HasFeaturesCol with HasOutputCol with HasLabelCol {
// Define params and getters here...
final val param = new Param[Type](this, "name", "description")
def getParam: Type = $(param)
}

}
Possible, because package
is in org.apache.spark.ml.

}
Out of the box for severaltypes, e.g.:
DoubleParam, IntParam,
BooleanParam, StringArrayParam,...
Other types: needto implement
jsonEncode and jsonDecode to
maintain persistence.

}
Out of the box for severaltypes, e.g.:
DoubleParam, IntParam,
BooleanParam, StringArrayParam,...
Other types: needto implement
jsonEncode and jsonDecode to
maintain persistence.
getters are shared between
Estimator and Transformer.
setters not, for the pursuit of
concatenation.

ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.
 What hasto be saved?
 Metadata: uid, timestamp, version, …
 Parameters
 Learnt data: selectedFeatures & featureImportances

 Parameters
DefaultParamsWriter.saveMetadata()
DefaultParamsReader.loadMetadata()
Since we are in org.apache.spark.ml, use:

 Parameters
 Create DataFrame and use write.parquet(…)

 Parameters
 Create DataFrame and use write.parquet(…)
 How do we dothat?
 Create companion object FeatureSelectorModel, which offersthe following classes:
 abstract class FeatureSelectorModelReader[M <: FeatureSelectorModel[M]] extends MLReader[M] {…}
 class FeatureSelectorModelWriter[M <: FeatureSelectorModel[M]](instance: M) extends MLWriter {…}

HOW TO USE SPARK-FEATURESELECTION.

import org.apache.spark.ml.feature.selection.filter._
import org.apache.spark.ml.feature.selection.util.VectorMerger
import org.apache.spark.ml.Pipeline

// load Data
val df = spark.read.parquet("path/to/data/train.parquet")
features Label
[0,1,0,1] 1.0
[0,0,0,0] 0.0
[1,1,0,0] 0.0
[1,0,0,0] 1.0
df

// load Data
val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor")
val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini")
features Label
[0,1,0,1] 1.0
[0,0,0,0] 0.0
[1,1,0,0] 0.0
[1,0,0,0] 1.0
Feature selectors. Offer
different selection methods.
df

// load Data
// VectorMerger merges VectorColumns and removes duplicates. Requires vector columns with names!
val merger = new VectorMerger().setInputCols(Array(“cor", “gini")).setOutputCol(“selected")
features Label
[0,1,0,1] 1.0
[0,0,0,0] 0.0
[1,1,0,0] 0.0
[1,0,0,0] 1.0
df

// load Data
// Put everything in a pipeline and fit together
val plModel = new Pipeline().setStages(Array(corSel, giniSel, merger)).fit(df)
features Label
[0,1,0,1] 1.0
[0,0,0,0] 0.0
[1,1,0,0] 0.0
[1,0,0,0] 1.0
df
Feature F1 F2 F3 F4
Score 1 0.9 0.7 0.0 0.5
Score 2 0.6 0.8 0.0 0.4
fit

// load Data
// Put everything in a pipeline and fit together
val plModel = new Pipeline().setStages(Array(corSel, giniSel, merger)).fit(df)
val dfT = plModel.transform(df).drop(“Features")
features Label
[0,1,0,1] 1.0
[0,0,0,0] 0.0
[1,1,0,0] 0.0
[1,0,0,0] 1.0
selected Label
[0,1] 1.0
[0,0] 0.0
[1,1] 0.0
[1,0] 1.0
df dft
Feature F1 F2 F3 F4
Score 1 0.9 0.7 0.0 0.5
Score 2 0.6 0.8 0.0 0.4
Transform
fit

SPARK-FEATURESELECTION PACKAGE.
 Offers selection based on:
 Gini coefficient
 Correlation coefficient
 Information gain
 L1-Logistic regression weights
 Randomforest importances
 Utility stage:
 VectorMerger
 Three modes:
 Percentile (default)
 Fixed number of columns
 Compare to random column [4]
Find on GitHub: spark-FeatureSelection or on Spark-packages
[4]: Stoppiglia et al.: Ranking a Random Feature for Variable and Feature Selection

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Area under normalized PRC and ROC
Normalized Area under PRC Area under ROC
0 50 100 150 200 250 300 350 400 450
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Time [s]
Time for FS methods and random forest
Multibucketizer Gini Correlation Informationgain Chi² Randomforest
PERFORMANCE.

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
0 50 100 150 200 250 300 350 400 450
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Time [s]
PERFORMANCE.
0
0,2
0,4
0,6
0,8
1
1,2
Chi² Correlation Gini InfoGain
Correlation between feature importances from feature selection and random forest

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
0 50 100 150 200 250 300 350 400 450
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Time [s]
0 50 100 150 200 250 300 350 400 450
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Time [s]
Multibucketizer Gini Informationgain Randomforest
PERFORMANCE.

LESSONS LEARNT.
 Know what your data looks like and where it is located! Example:
 Operations can succeed in local mode, but fail on a cluster.
 Use .persist(StorageLevel.MEMORY_ONLY), when data fits into Memory. Default for .cache is MEMORY_AND_DISK.
 Do not reinvent the wheel for common methods  Consider putting your stages intothe spark.ml namespace.
 Use the SparkWeb GUIto understand your Spark jobs.

QUESTIONS?
Marc.Kaminski@bmw.de
Bernhard.bb.Schegel@bmw.de
Page 18

BACKUP.

DETERMINING WHEREYOUR PIPELINESTAGE SHOULD LIVE.
Own namespace
Pro Con
Safer solution Code duplication
org.apache.spark.ml.*
Pro Con
Less code duplication
(sharedParams,
SchemaUtils, …)
More dangerous,
when not
cautious
Easier to implement
persistence
vs.

FEATURE SELECTION.
 Motivation:
 Many sparse features  feature space hasto be reduced  select featuresthat carry a lot of information for prediction.
 Feature selection (unlike featuretransformation ) enables understanding of which features have a high impact onthe model.
F1 F2 Noise Label =
F1 XOR F2
0 0 0 0
1 0 0 1
0 1 0 1
1 1 1 0
Feature Selection
Feature Importance
Feature 1 0.7
Feature 2 0.7
Noise 0.2
F1 F2 Label =
F1 XOR F2
0 0 0
1 0 1
0 1 1
1 1 0

FEATURE SELECTION.
 Motivation:
 Many sparse features  feature space hasto be reduced  select featuresthat carry a lot of information for prediction.
 Feature selection (unlike featuretransformation ) enables understanding of which features have a high impact onthe model.
F1 F2 Noise Label =
F1 XOR F2
0 0 0 0
1 0 0 1
0 1 0 1
1 1 1 0
Feature Selection
Feature Importance
Feature 1 0.7
Feature 2 0.7
Noise 0.2
F1 F2 Label =
F1 XOR F2
0 0 0
1 0 1
0 1 1
1 1 0
E.g.:
- Correlation
- InformationGain
- RandomForest
etc.

FEATURE SELECTION.
Description Advantages Disadvantages Examples
Filter Evaluate intrinsic data properties
Fast
Scalable
Ignore inter-feature dependencies
Ignore interaction with classifier
Chi-squared
Information gain
Correlation
Wrapper
Evaluate model performance of
feature subset
Feature dependencies
Simple
Classifier dependent selection
Computational expensive
Risk of overfitting
Genetic algorithms
Search algorithms
Embedded
Feature selection is embedded in
classifier training
Feature dependencies Classifier dependent selection L1-Logistic regression
Random forest

CHALLENGES.
 Big plans for DataFrames when performing many operations on many columns  Cantake a longtime to build and optimize DAG.
 Column limit for DataFrames introduced by several Jiras, especially: SPARK-18016  Hopefully fixed in Spark 2.3.0.
 Spark PipelineStages are not consistent in howthey handle DataFrame schemas  Sometimes no schema is appended.
Page 23

Building Custom ML PipelineStages for Feature Selection with Marc Kaminski

Recommended

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to Building Custom ML PipelineStages for Feature Selection with Marc Kaminski (20)

More from Spark Summit (20)

Recently uploaded (20)

Building Custom ML PipelineStages for Feature Selection with Marc Kaminski