This document discusses tools for distributed data analysis including Apache Spark. It is divided into three parts: 1) An introduction to cluster computing architectures like batch processing and stream processing. 2) The Python data analysis library stack including NumPy, Matplotlib, Scikit-image, Scikit-learn, Rasterio, Fiona, Pandas, and Jupyter. 3) The Apache Spark cluster computing framework and examples of its use including contexts, HDFS, telemetry, MLlib, streaming, and deployment on AWS.