From the course: Spark for Machine Learning & AI
Unlock the full course today
Join today to access over 24,800 courses taught by industry experts.
Bucketize numeric data - Apache Spark Tutorial
From the course: Spark for Machine Learning & AI
Bucketize numeric data
- [Instructor] Now let's take a look at how we can organize continuous ranges of data into buckets or partitions. First, I'll verify my working directory and I'll start pyspark. I'll use ctrl+l to clear the screen, and I'm going to import some code that we need and I'm going to find this in pyspark.ml.feature and from there I want to import the transformation called Bucketizer. Now Bucketizer allows us to group data based on boundaries, and so I need to provide a list of boundaries for Bucketizer to work with. So I call those boundaries splits. And I'm going to provide a list of what these splits are. Now at the lower end, I would like anything starting at negative infinity to go in the first bucket. So to specify negative infinity, I use this syntax, minus float, quote inf, and from negative infinity up to -10 will be one bucket and then from -10 to zero will be another bucket from zero to 10 will be my next bucket and everything that's greater than 10 and up to positive infinity…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.