Spark SQL – Learning about basic string functions
Problem
A big data engineer can transform data stored in files using Spark dataframe methods or Spark SQL functions. I chose to use the Spark SQL syntax since it is more widely used. Every language has at least three core data types: strings, numbers, and date/time. How do we manipulate strings using Spark SQL?
Solution
In the last tip, I reviewed the syntax for numeric Spark SQL functions. This tip focuses on the available string functions. The sheer number of string functions in Spark SQL requires them to be broken into two categories: basic and encoding. Today, we will discuss what I consider basic functions seen in most databases and/or languages.
Business Problem
Our manager has asked us to explore the syntax of the string functions available in Azure Databricks. I will execute Spark SQL using the magic command in a Python notebook. That way, we can see the output for a given input. After testing, I usually turn the Spark SQL into a string variable that can be executed by the spark.sql method.
During our exploration, we will discuss some written and digital content: Three Blind Mice, The Three Musketeers, and Star Wars. The first two appeared in books a long time ago. I have seen a couple of variations of the “Musketeers” at the movies during my lifetime. I fondly remember seeing Star Wars at my local theater in 1977. At the end of this article, the big data engineer will have a good overview of string functions.
Please see MS SQL TIPS article for details.