7 Python Essentials For Data Science
Image from python.org

7 Python Essentials For Data Science

New to Python?

Here's a quick list of the most important meta information for the Python eco-system that you will need to get started:

1. Distribution

Python is an interpreted language - you will need an interpreter program to run Python scripts.

Most OS' come packaged with Python interpreter but I find the Anaconda distribution to be most easy to use and convenient.

One thing to keep in mind is that there are mainly 2 versions of Python available for download - version 2x and version 3x. 3x is relatively new so maybe 1-in-100 script you download from the internet might have to be tweaked to work with 3x interpreter; but generally is very stable.

2. Shell

Python distributions come packaged with the interactive command line shell, where you can type Python statements to see the output, one Python command at a time.

However, there is an awesome software called Jupyter Notebook, which takes the relatively boring & less-interactive command-line interface, spices it up with web grade responsiveness and brings it to your favourite browser. Jupyter Notebook program runs as a small server program from your OS command line, allowing you to access a cool interactive web-page over the browser. Definitely the way to go!

PS: Its already packaged along with the Anaconda distribution, so one less thing to worry about when getting started.

3. Libraries

Python comes with a whole host of libraries and packages pre-installed along with the interpreter (and the Anaconda distribution contains even more of the useful libraries mentioned below by-default)

There are several ways to download and install new libraries onto your machine, but probably the easiest least-painful way is by using the pip program that comes along with most Python distributions. To install a new library all you have to do is to type pip install <name-of-library> on the OS command line.

Some of the most important libraries you will need for Data Analysis are -

a) pandas

pandas is an advanced library that makes working with 2 dimensional data (rows and columns of data, like in a CSV) as easy as working with Excel. It has a bit of a learning curve but definitely worth it! ($ pip install pandas)

b) scikit-learn, numpy ...

NumPy is the predecessor or the basis for most of Pandas. SciPy and sklearn are libraries containing the statistical & machine learning algorithms like Decision Trees.

4. Visualisation

Although quite a bit of charting and plotting can be achieved within Pandas and matplotlib (another legendary Python library which forms the basis of most visuals in Python libraries), if you want to take your charting to the next level of sophistication and enterprise-grade then you may want to look into Bokeh.

Bokeh is only a few years old but is very stable and well established - plus its from the same guys as Anaconda, so comes pre-packaged.

Things can easily get complex in the Bokeh world, but to start off with stick to the bokeh.charts which provides most of the awesome charts out-of-box with minimal (1 or 2 lines) coding. As you get more advanced, you can look into Bokeh.plotting and Bokeh.models to pretty much draw whatever you want in your custom visualisation.

BTW Bokeh works super well with Pandas and Jupyter Notebook!

5. IDE

Don't worry about this when you are starting off with Python, as the interactive shell provided by Jupyter iPython notebook is really all you need when doing data analysis.

When you get to a point that you are thinking of writing a whole analytics application (multiple classes, methods, files etc etc) then it might be worth looking into a decent IDE.

PyCharm comes built in with a Notebook style interactive shell, which is neatly tucked in one corner of the IDE window (instead of having to open a separate browser window).

There's a bunch of them out there but my personal favourite is PyCharm. Its free to use unless you go for a professional version which gives a few extra capabilities for ~ £6/month/desktop application license.

6. Source Control

If you plan to invest in an IDE, you are probably writing complex workflows and algorithms, and that might be the right time to start thinking about saving your code in decent source control software and backing it up online.

I use git software on my desktop for local version control, and Github.com for remote-backup and version control.

Also, as an added benefit, Github.com and Jupyter notebook has pretty seamless integration, and so does PyCharm and Github.

7. Web framework

Python has a bunch of web frameworks that can be easily integrated into you web-app. My fav one is Django, which is an MVC based framework - easy to understand and use.

In day-to-day, BAU data analytics and data science projects you will not need a web-framework, but its good to know its there.

Big Data & Python:

There's a lots to cover, perhaps in another post, to dive into Python as one of the most established channel for Big Data management, interrogation, and manipulation across various platforms (MapR, Cloudera, etc) and technologies (Spark, Hive, etc).

Hope this helps!

Michael McDonnell

Communications Media and Entertainment Genpact

8y

This is by far the best write up on Python I have ever read. I have spent the last 12 months looking into Python, and now my team is learning to use in automation tool development for PaaS. I always enjoy your articles.

Like
Reply

To view or add a comment, sign in

More articles by Apoorv Kashyap

  • Snowflake LLM overview

    Big enterprise software vendors are doing a lot to help their clients onboard LLM use cases in a reduced-risk…

    3 Comments
  • GenAI - five tips and tricks to journey beyond the basics

    Warning: This is an intro article, a first in the sequence of many more to come, with tech heavy content. Here's my…

    1 Comment
  • Recap & Outlook: Customer and First-party data-driven Transformation

    This is a summary of my team's 2021 experience, based on 500+ meetings with our Customers, industry professionals…

    2 Comments
  • Customer Data platform: Buy or Build?

    Summary The key deciding factors for businesses considering an investment in their own Customer Data Platform are:…

  • 7 proven use-case areas for AI in modern commerce

    Repost from Syntasa.com, please read the full detailed article at https://www.

  • #1 most common mis-engagement model for Data Science

    Data Science has come a very very long way since becoming "proper" mainstream in the last 5-10 years. As a practitioner…

  • What is Happiness?

    2020 has been a year of profound sadness, both at the colossal loss of life and at the loss of the past ways-of-life on…

    2 Comments
  • What is Data Strategy?

    This blog is intended to be a quick-ish guide on what a solid Data Strategy should outline for an organisation. As you…

  • A future-proof Customer 360 approach

    The Basics Customer 360 or Single View (and other Marketing variations like Customer 720, 1080) of customer has been…

  • Customer 360 Evolution for The Age of Open Banking

    Dear Banking & Financial Services providers, I know you have been very busy in 2018 with recovering from regulatory…

Insights from the community

Others also viewed

Explore topics