7 Python Essentials For Data Science
New to Python?
Here's a quick list of the most important meta information for the Python eco-system that you will need to get started:
1. Distribution
Python is an interpreted language - you will need an interpreter program to run Python scripts.
Most OS' come packaged with Python interpreter but I find the Anaconda distribution to be most easy to use and convenient.
One thing to keep in mind is that there are mainly 2 versions of Python available for download - version 2x and version 3x. 3x is relatively new so maybe 1-in-100 script you download from the internet might have to be tweaked to work with 3x interpreter; but generally is very stable.
2. Shell
Python distributions come packaged with the interactive command line shell, where you can type Python statements to see the output, one Python command at a time.
However, there is an awesome software called Jupyter Notebook, which takes the relatively boring & less-interactive command-line interface, spices it up with web grade responsiveness and brings it to your favourite browser. Jupyter Notebook program runs as a small server program from your OS command line, allowing you to access a cool interactive web-page over the browser. Definitely the way to go!
PS: Its already packaged along with the Anaconda distribution, so one less thing to worry about when getting started.
3. Libraries
Python comes with a whole host of libraries and packages pre-installed along with the interpreter (and the Anaconda distribution contains even more of the useful libraries mentioned below by-default)
There are several ways to download and install new libraries onto your machine, but probably the easiest least-painful way is by using the pip program that comes along with most Python distributions. To install a new library all you have to do is to type pip install <name-of-library> on the OS command line.
Some of the most important libraries you will need for Data Analysis are -
a) pandas
pandas is an advanced library that makes working with 2 dimensional data (rows and columns of data, like in a CSV) as easy as working with Excel. It has a bit of a learning curve but definitely worth it! ($ pip install pandas)
b) scikit-learn, numpy ...
NumPy is the predecessor or the basis for most of Pandas. SciPy and sklearn are libraries containing the statistical & machine learning algorithms like Decision Trees.
4. Visualisation
Although quite a bit of charting and plotting can be achieved within Pandas and matplotlib (another legendary Python library which forms the basis of most visuals in Python libraries), if you want to take your charting to the next level of sophistication and enterprise-grade then you may want to look into Bokeh.
Bokeh is only a few years old but is very stable and well established - plus its from the same guys as Anaconda, so comes pre-packaged.
Things can easily get complex in the Bokeh world, but to start off with stick to the bokeh.charts which provides most of the awesome charts out-of-box with minimal (1 or 2 lines) coding. As you get more advanced, you can look into Bokeh.plotting and Bokeh.models to pretty much draw whatever you want in your custom visualisation.
BTW Bokeh works super well with Pandas and Jupyter Notebook!
5. IDE
Don't worry about this when you are starting off with Python, as the interactive shell provided by Jupyter iPython notebook is really all you need when doing data analysis.
When you get to a point that you are thinking of writing a whole analytics application (multiple classes, methods, files etc etc) then it might be worth looking into a decent IDE.
PyCharm comes built in with a Notebook style interactive shell, which is neatly tucked in one corner of the IDE window (instead of having to open a separate browser window).
There's a bunch of them out there but my personal favourite is PyCharm. Its free to use unless you go for a professional version which gives a few extra capabilities for ~ £6/month/desktop application license.
6. Source Control
If you plan to invest in an IDE, you are probably writing complex workflows and algorithms, and that might be the right time to start thinking about saving your code in decent source control software and backing it up online.
I use git software on my desktop for local version control, and Github.com for remote-backup and version control.
Also, as an added benefit, Github.com and Jupyter notebook has pretty seamless integration, and so does PyCharm and Github.
7. Web framework
Python has a bunch of web frameworks that can be easily integrated into you web-app. My fav one is Django, which is an MVC based framework - easy to understand and use.
In day-to-day, BAU data analytics and data science projects you will not need a web-framework, but its good to know its there.
Big Data & Python:
There's a lots to cover, perhaps in another post, to dive into Python as one of the most established channel for Big Data management, interrogation, and manipulation across various platforms (MapR, Cloudera, etc) and technologies (Spark, Hive, etc).
Hope this helps!
Communications Media and Entertainment Genpact
8yThis is by far the best write up on Python I have ever read. I have spent the last 12 months looking into Python, and now my team is learning to use in automation tool development for PaaS. I always enjoy your articles.