SlideShare a Scribd company logo
Anomaly 
Detection with 
Apache Spark 
A Gentle Introduction 
Sean Owen // Director of Data Science
2 
www.flickr.com/photos/sammyjammy/1285612321/in/set- 
72157620597747933
3
4 
Unknown Unknowns 
• Looking for things we don’t know 
we don’t know 
– Failing server metrics 
– Suspicious access patterns 
– Fraudulent transactions 
• Search among anomalies 
• Labeled, or not 
– Sometimes have examples of 
“important” or “unusual” 
– Usually not 
streathambrixtonchess.blogspot.co.uk/2012/07/rumsfeld-redux.html
5 
Clustering 
• Find areas dense with data 
(conversely, areas without data) 
• Anomaly = far from any cluster 
• Unsupervised learning 
• Supervise with labels to 
improve, interpret 
en.wikipedia.org/wiki/Cluster_analysis
6 
K-Means++ 
• Assign points to nearest center, 
update centers, iterate 
• Goal: points close to nearest 
cluster center 
• Must choose k = number of 
clusters 
• ++ means smarter starting point 
mahout.apache.org/users/clustering/fuzzy-k-means.html
7 
KDD Cup 1999 
Data Set
8 
KDD Cup 1999 
• Annual ML competition 
www.sigkdd.org/kddcup/index.php 
• 1999: Network intrusion detection 
• 4.9M network sessions 
• Some normal; many known attacks 
• Not a realistic sample!
9 
Service Bytes Received 
0,tcp,http,SF,215,45076, 
0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1, 
0.00,0.00,0.00,0.00,1.00,0.00,0.00, 
0,0,0.00,0.00,0.00,0.00,0.00,0.00, 
0.00,0.00,normal. 
Label 
% SYN errors
10 
Apache Spark: Something for Everyone 
• Scala-based 
– “Distributed Scala” 
– Expressive, efficient 
– JVM-based 
• Consistent Scala-like API 
– RDDs for everything 
– RDD works like immutable Scala 
collection 
– Like Apache Crunch is Collection-like 
• … but Java/Python APIs too 
• Inherently Distributed 
• Hadoop-friendly 
– Works on existing data (HDFS, HBase, 
Kafka) 
– With existing resources (YARN) 
– ETL no longer separate 
• Interactive REPL 
– Familiar model for R, Python devs 
– Exploratory, not just operational 
• MLlib
11 
Clustering, Take #0 
Just Do It
12 
> spark-shell 
val rawData = sc.textFile("/user/srowen/kddcup.data") 
rawData.first() 
... 
res3: String = 
0,tcp,http,SF,215,45076,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1, 
1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,0,0,0.00,0.00,0.00,0. 
00,0.00,0.00,0.00,0.00,normal.
13 
0,tcp,http,SF,215,45076, 
0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1, 
0.00,0.00,0.00,0.00,1.00,0.00,0.00, 
0,0,0.00,0.00,0.00,0.00,0.00,0.00, 
0.00,0.00,normal.
14 
val labelsAndData = rawData.map { line => 
val buffer = line.split(',').toBuffer 
buffer.remove(1, 3) 
val label = buffer.remove(buffer.length-1) 
val vec = Vectors.dense(buffer.map(_.toDouble).toArray) 
(label, vec) 
} 
val data = labelsAndData.values.cache() 
import org.apache.spark.mllib.clustering._ 
val kmeans = new KMeans() 
val model = kmeans.run(data)
15
16 
0 back. 2203 
0 buffer_overflow. 30 
0 ftp_write. 8 
0 guess_passwd. 53 
0 imap. 12 
0 ipsweep. 12481 
0 land. 21 
0 loadmodule. 9 
0 multihop. 7 
0 neptune. 1072017 
0 nmap. 2316 
0 normal. 972781 
0 perl. 3 
0 phf. 4 
0 pod. 264 
0 portsweep. 10412 
0 rootkit. 10 
0 satan. 15892 
0 smurf. 2807886 
0 spy. 2 
0 teardrop. 979 
0 warezclient. 1020 
0 warezmaster. 20 
1 portsweep. 1
17 
Clustering, Take #1 
Choose k
18 
def distToCentroid(v: Vector, model: KMeansModel) = { 
val centroid = model.clusterCenters(model.predict(v)) 
distance(centroid, datum) 
} 
def clusteringScore(data: RDD[Vector], k: Int) = { 
val kmeans = new KMeans() 
kmeans.setK(k) 
val model = kmeans.run(data) 
data.map(d => distToCentroid(d, model)).mean() 
} 
(5 to 40 by 5).map(k => (k, clusteringScore(data, k)))
19 
( 5,1938.858341805931) 
(10,1689.4950178959496) 
(15,1381.315620528147) 
(20,1318.256644582388) 
(25, 932.0599419255919) 
(30, 594.2334547238697) 
(35, 829.5361226176625) 
(40, 424.83023056838846)
( 30,862.9165758614838) 
( 40,801.679800071455) 
( 50,379.7481910409938) 
( 60,358.6387344388997) 
( 70,265.1383809649689) 
( 80,232.78912076732163) 
( 90,230.0085251067184) 
(100,142.84374573413373) 
20 
kmeans.setRuns(10) 
kmeans.setEpsilon(1.0e-6) 
(30 to 100 by 10).par.map(k => 
(k, clusteringScore(data, k)))
21
22 
Clustering, Take #2 
Normalize
23 
Standard Scores 
• Standard or “z” score 
• σ (standard deviation): 
normalize away scale 
• μ (mean): 
doesn’t really matter here 
• Assumes normalish distribution 
xi - μi 
σi
24 
val dataArray = data.map(_.toArray) 
val numCols = dataArray.first().length 
val n = dataArray.count() 
val sums = dataArray.reduce((a,b) => a.zip(b).map(t => t._1 + t._2)) 
val sumSquares = dataAsArray.fold(new Array[Double](numCols))( 
(a,b) => a.zip(b).map(t => t._1 + t._2 * t._2) 
) 
val stdevs = sumSquares.zip(sums).map { 
case(sumSq,sum) => math.sqrt(n*sumSq - sum*sum)/n 
} 
val means = sums.map(_ / n) 
def normalize(v: Vector) = { 
val normed = (v.toArray, means, stdevs).zipped.map( 
(value, mean, stdev) => (value - mean) / stdev) 
Vectors.dense(normed) 
}
25 
( 60,0.0038662664156513646) 
( 70,0.003284024281015404) 
( 80,0.00308768458568131) 
( 90,0.0028326001931487516) 
(100,0.002550914511356702) 
(110,0.002516106387216959) 
(120,0.0021317966227260106)
26
27 
Clustering, Take #3 
Categoricals
28 
One-Hot / 1-of-n Encoding 
…,tcp,… 
…,udp,… 
…,icmp,… 
…,1,0,0,… 
…,0,1,0,… 
…,0,0,1,…
29 
( 80,0.038867919526032156) 
( 90,0.03633130732772693) 
(100,0.025534431488492226) 
(110,0.02349979741110366) 
(120,0.01579211360618129) 
(130,0.011155491535441237) 
(140,0.010273258258627196) 
(150,0.008779632525837223) 
(160,0.009000858639068911)
30 
Clustering, Take #4 
Labels & Entropy
31 
Using Labels With Entropy 
• Information theory concept 
• Measures mixed-ness 
• Function of label proportions, pi 
• Good clusters have 
homogeneous labels 
• Homogeneous = 
low entropy = 
good clustering 
-Σpi log pi 
pi log (1/piΣ )
32 
def entropy(counts: Iterable[Int]) = { 
val values = counts.filter(_ > 0) 
val n: Double = values.sum 
values.map { v => 
val p = v / n 
-p * math.log(p) 
}.sum 
} 
def clusteringScore(...) = { 
... 
val labelsAndClusters = 
normalizedLabelsAndData.mapValues(model.predict) 
val clustersAndLabels = labelsAndClusters.map(_.swap) 
val labelsInCluster = 
clustersAndLabels.groupByKey().values 
val labelCounts = labelsInCluster.map( 
_.groupBy(l => l).map(_._2.size)) 
val n = normalizedLabelsAndData.count() 
labelCounts.map(m => m.sum * entropy(m)).sum / n 
}
33 
(80,1.0079370754411006) 
(90,0.9637681417493124) 
(100,0.9403615199645968) 
(110,0.4731764778562114) 
(120,0.37056636906883805) 
(130,0.36584249542565717) 
(140,0.10532529463749402) 
(150,0.10380319762303959) 
(160,0.14469129892579444)
34 
0 back. 6 
0 neptune. 821239 
0 normal. 255 
0 portsweep. 114 
0 satan. 31 
... 
90 ftp_write. 1 
90 loadmodule. 1 
90 neptune. 1 
90 normal. 41253 
90 warezclient. 12 
... 
93 normal. 8 
93 portsweep. 7365 
93 warezclient. 1
35 
Detecting An 
Anomaly
36 
Evaluate with Spark Streaming 
Streaming 
Streaming 
Alert
37 
val distances = normalizedData.map( 
d => distToCentroid(d, model) 
) 
val threshold = distances.top(100).last 
val anomalies = normalizedData.filter( 
d => distToCentroid(d, model) > threshold 
)
38 
flag count / srv count 
0,tcp,http,S1,299,26280,0,0,0,1,0,1,0,1,0,0, 
0,0,0,0,0,0,15,16,0.07,0.06,0.00,0.00,1.00, 
0.00,0.12,231,255,1.00,0.00,0.00,0.01,0.01, 
0.01,0.00,0.00,normal. 
dst_host_count / 
dst_host_srv_count 
Anomaly?
Thank You! 
sowen@cloudera.com 
@sean_r_owen
Ad

More Related Content

Similar to Anomaly Detection with Apache Spark (20)

NumPy/SciPy Statistics
NumPy/SciPy StatisticsNumPy/SciPy Statistics
NumPy/SciPy Statistics
Enthought, Inc.
 
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016
Mark Smith
 
Machine Learning in R
Machine Learning in RMachine Learning in R
Machine Learning in R
SujaAldrin
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
MLconf
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
Cloudera, Inc.
 
Interpolation Missing values.pptx
Interpolation Missing values.pptxInterpolation Missing values.pptx
Interpolation Missing values.pptx
RushikeshGore18
 
Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013
Vijay Srinivas Agneeswaran, Ph.D
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
DB Tsai
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
Alpine Data
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
Spark Summit
 
DA_02_algorithms.pptx
DA_02_algorithms.pptxDA_02_algorithms.pptx
DA_02_algorithms.pptx
Alok Mohapatra
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
Subhas Kumar Ghosh
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material
Bryan Yang
 
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
Databricks
 
Multiclassification with Decision Tree in Spark MLlib 1.3
Multiclassification with Decision Tree in Spark MLlib 1.3Multiclassification with Decision Tree in Spark MLlib 1.3
Multiclassification with Decision Tree in Spark MLlib 1.3
leorick lin
 
Machine Learning with Python
Machine Learning with PythonMachine Learning with Python
Machine Learning with Python
Ankit Rathi
 
Machine learning for_finance
Machine learning for_financeMachine learning for_finance
Machine learning for_finance
Stefan Duprey
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFrames
Jen Aman
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFrames
Jen Aman
 
ScalaDays 2013 Keynote Speech by Martin Odersky
ScalaDays 2013 Keynote Speech by Martin OderskyScalaDays 2013 Keynote Speech by Martin Odersky
ScalaDays 2013 Keynote Speech by Martin Odersky
Typesafe
 
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016
Mark Smith
 
Machine Learning in R
Machine Learning in RMachine Learning in R
Machine Learning in R
SujaAldrin
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
MLconf
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
Cloudera, Inc.
 
Interpolation Missing values.pptx
Interpolation Missing values.pptxInterpolation Missing values.pptx
Interpolation Missing values.pptx
RushikeshGore18
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
DB Tsai
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
Alpine Data
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
Spark Summit
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material
Bryan Yang
 
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
Databricks
 
Multiclassification with Decision Tree in Spark MLlib 1.3
Multiclassification with Decision Tree in Spark MLlib 1.3Multiclassification with Decision Tree in Spark MLlib 1.3
Multiclassification with Decision Tree in Spark MLlib 1.3
leorick lin
 
Machine Learning with Python
Machine Learning with PythonMachine Learning with Python
Machine Learning with Python
Ankit Rathi
 
Machine learning for_finance
Machine learning for_financeMachine learning for_finance
Machine learning for_finance
Stefan Duprey
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFrames
Jen Aman
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFrames
Jen Aman
 
ScalaDays 2013 Keynote Speech by Martin Odersky
ScalaDays 2013 Keynote Speech by Martin OderskyScalaDays 2013 Keynote Speech by Martin Odersky
ScalaDays 2013 Keynote Speech by Martin Odersky
Typesafe
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 
Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 
Ad

Anomaly Detection with Apache Spark

  • 1. Anomaly Detection with Apache Spark A Gentle Introduction Sean Owen // Director of Data Science
  • 3. 3
  • 4. 4 Unknown Unknowns • Looking for things we don’t know we don’t know – Failing server metrics – Suspicious access patterns – Fraudulent transactions • Search among anomalies • Labeled, or not – Sometimes have examples of “important” or “unusual” – Usually not streathambrixtonchess.blogspot.co.uk/2012/07/rumsfeld-redux.html
  • 5. 5 Clustering • Find areas dense with data (conversely, areas without data) • Anomaly = far from any cluster • Unsupervised learning • Supervise with labels to improve, interpret en.wikipedia.org/wiki/Cluster_analysis
  • 6. 6 K-Means++ • Assign points to nearest center, update centers, iterate • Goal: points close to nearest cluster center • Must choose k = number of clusters • ++ means smarter starting point mahout.apache.org/users/clustering/fuzzy-k-means.html
  • 7. 7 KDD Cup 1999 Data Set
  • 8. 8 KDD Cup 1999 • Annual ML competition www.sigkdd.org/kddcup/index.php • 1999: Network intrusion detection • 4.9M network sessions • Some normal; many known attacks • Not a realistic sample!
  • 9. 9 Service Bytes Received 0,tcp,http,SF,215,45076, 0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1, 0.00,0.00,0.00,0.00,1.00,0.00,0.00, 0,0,0.00,0.00,0.00,0.00,0.00,0.00, 0.00,0.00,normal. Label % SYN errors
  • 10. 10 Apache Spark: Something for Everyone • Scala-based – “Distributed Scala” – Expressive, efficient – JVM-based • Consistent Scala-like API – RDDs for everything – RDD works like immutable Scala collection – Like Apache Crunch is Collection-like • … but Java/Python APIs too • Inherently Distributed • Hadoop-friendly – Works on existing data (HDFS, HBase, Kafka) – With existing resources (YARN) – ETL no longer separate • Interactive REPL – Familiar model for R, Python devs – Exploratory, not just operational • MLlib
  • 11. 11 Clustering, Take #0 Just Do It
  • 12. 12 > spark-shell val rawData = sc.textFile("/user/srowen/kddcup.data") rawData.first() ... res3: String = 0,tcp,http,SF,215,45076,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1, 1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,0,0,0.00,0.00,0.00,0. 00,0.00,0.00,0.00,0.00,normal.
  • 13. 13 0,tcp,http,SF,215,45076, 0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1, 0.00,0.00,0.00,0.00,1.00,0.00,0.00, 0,0,0.00,0.00,0.00,0.00,0.00,0.00, 0.00,0.00,normal.
  • 14. 14 val labelsAndData = rawData.map { line => val buffer = line.split(',').toBuffer buffer.remove(1, 3) val label = buffer.remove(buffer.length-1) val vec = Vectors.dense(buffer.map(_.toDouble).toArray) (label, vec) } val data = labelsAndData.values.cache() import org.apache.spark.mllib.clustering._ val kmeans = new KMeans() val model = kmeans.run(data)
  • 15. 15
  • 16. 16 0 back. 2203 0 buffer_overflow. 30 0 ftp_write. 8 0 guess_passwd. 53 0 imap. 12 0 ipsweep. 12481 0 land. 21 0 loadmodule. 9 0 multihop. 7 0 neptune. 1072017 0 nmap. 2316 0 normal. 972781 0 perl. 3 0 phf. 4 0 pod. 264 0 portsweep. 10412 0 rootkit. 10 0 satan. 15892 0 smurf. 2807886 0 spy. 2 0 teardrop. 979 0 warezclient. 1020 0 warezmaster. 20 1 portsweep. 1
  • 17. 17 Clustering, Take #1 Choose k
  • 18. 18 def distToCentroid(v: Vector, model: KMeansModel) = { val centroid = model.clusterCenters(model.predict(v)) distance(centroid, datum) } def clusteringScore(data: RDD[Vector], k: Int) = { val kmeans = new KMeans() kmeans.setK(k) val model = kmeans.run(data) data.map(d => distToCentroid(d, model)).mean() } (5 to 40 by 5).map(k => (k, clusteringScore(data, k)))
  • 19. 19 ( 5,1938.858341805931) (10,1689.4950178959496) (15,1381.315620528147) (20,1318.256644582388) (25, 932.0599419255919) (30, 594.2334547238697) (35, 829.5361226176625) (40, 424.83023056838846)
  • 20. ( 30,862.9165758614838) ( 40,801.679800071455) ( 50,379.7481910409938) ( 60,358.6387344388997) ( 70,265.1383809649689) ( 80,232.78912076732163) ( 90,230.0085251067184) (100,142.84374573413373) 20 kmeans.setRuns(10) kmeans.setEpsilon(1.0e-6) (30 to 100 by 10).par.map(k => (k, clusteringScore(data, k)))
  • 21. 21
  • 22. 22 Clustering, Take #2 Normalize
  • 23. 23 Standard Scores • Standard or “z” score • σ (standard deviation): normalize away scale • μ (mean): doesn’t really matter here • Assumes normalish distribution xi - μi σi
  • 24. 24 val dataArray = data.map(_.toArray) val numCols = dataArray.first().length val n = dataArray.count() val sums = dataArray.reduce((a,b) => a.zip(b).map(t => t._1 + t._2)) val sumSquares = dataAsArray.fold(new Array[Double](numCols))( (a,b) => a.zip(b).map(t => t._1 + t._2 * t._2) ) val stdevs = sumSquares.zip(sums).map { case(sumSq,sum) => math.sqrt(n*sumSq - sum*sum)/n } val means = sums.map(_ / n) def normalize(v: Vector) = { val normed = (v.toArray, means, stdevs).zipped.map( (value, mean, stdev) => (value - mean) / stdev) Vectors.dense(normed) }
  • 25. 25 ( 60,0.0038662664156513646) ( 70,0.003284024281015404) ( 80,0.00308768458568131) ( 90,0.0028326001931487516) (100,0.002550914511356702) (110,0.002516106387216959) (120,0.0021317966227260106)
  • 26. 26
  • 27. 27 Clustering, Take #3 Categoricals
  • 28. 28 One-Hot / 1-of-n Encoding …,tcp,… …,udp,… …,icmp,… …,1,0,0,… …,0,1,0,… …,0,0,1,…
  • 29. 29 ( 80,0.038867919526032156) ( 90,0.03633130732772693) (100,0.025534431488492226) (110,0.02349979741110366) (120,0.01579211360618129) (130,0.011155491535441237) (140,0.010273258258627196) (150,0.008779632525837223) (160,0.009000858639068911)
  • 30. 30 Clustering, Take #4 Labels & Entropy
  • 31. 31 Using Labels With Entropy • Information theory concept • Measures mixed-ness • Function of label proportions, pi • Good clusters have homogeneous labels • Homogeneous = low entropy = good clustering -Σpi log pi pi log (1/piΣ )
  • 32. 32 def entropy(counts: Iterable[Int]) = { val values = counts.filter(_ > 0) val n: Double = values.sum values.map { v => val p = v / n -p * math.log(p) }.sum } def clusteringScore(...) = { ... val labelsAndClusters = normalizedLabelsAndData.mapValues(model.predict) val clustersAndLabels = labelsAndClusters.map(_.swap) val labelsInCluster = clustersAndLabels.groupByKey().values val labelCounts = labelsInCluster.map( _.groupBy(l => l).map(_._2.size)) val n = normalizedLabelsAndData.count() labelCounts.map(m => m.sum * entropy(m)).sum / n }
  • 33. 33 (80,1.0079370754411006) (90,0.9637681417493124) (100,0.9403615199645968) (110,0.4731764778562114) (120,0.37056636906883805) (130,0.36584249542565717) (140,0.10532529463749402) (150,0.10380319762303959) (160,0.14469129892579444)
  • 34. 34 0 back. 6 0 neptune. 821239 0 normal. 255 0 portsweep. 114 0 satan. 31 ... 90 ftp_write. 1 90 loadmodule. 1 90 neptune. 1 90 normal. 41253 90 warezclient. 12 ... 93 normal. 8 93 portsweep. 7365 93 warezclient. 1
  • 35. 35 Detecting An Anomaly
  • 36. 36 Evaluate with Spark Streaming Streaming Streaming Alert
  • 37. 37 val distances = normalizedData.map( d => distToCentroid(d, model) ) val threshold = distances.top(100).last val anomalies = normalizedData.filter( d => distToCentroid(d, model) > threshold )
  • 38. 38 flag count / srv count 0,tcp,http,S1,299,26280,0,0,0,1,0,1,0,1,0,0, 0,0,0,0,0,0,15,16,0.07,0.06,0.00,0.00,1.00, 0.00,0.12,231,255,1.00,0.00,0.00,0.01,0.01, 0.01,0.00,0.00,normal. dst_host_count / dst_host_srv_count Anomaly?
  翻译: