From the course: Spark for Machine Learning & AI

Unlock the full course today

Join today to access over 24,800 courses taught by industry experts.

Tokenize text data

Tokenize text data

- [Instructor] Now let's shift our focus to working with text data. I'll start an new instance in PySpark, and I'll clear the screen with Control+L, and the first thing I'll do is import some code we need. I'm going to import from the pyspark.ml feature package again, and I'm going to import something called Tokenizer. So from pyspark.ml.feature, and I'm going to import Tokenizer, and I'll clear the screen so I can have a fresh screen to work with while I define a new data frame. So now I'll create a data frame for sentences, and I'll call it sentances_df, and I'll reference the spark context and call createDateFrame. And in this data frame we'll have three rows, and in the first row we'll have the sentence, "this is an introduction to Spark Mllib". And the second row will contain the sentence, "Mllib includes libraries for classification and regression". And then our last sentence is, "it also contains supporting tools for pipelines". And that's our data, and we now want to specify…

Contents