How to become Data Scientists

Data Scientists: "A person who is better at statistics than any software engineer and better at software engineering than any statistician".

I work as an Associate Big Data consultant where I learn data engineering, study MSc Data Science and have BSc Mathematics and Statistics from Russell Group University. I am also a professional fellow of the Royal Statistical Society. I am a unicorn and this article is the summary of my journey towards Data Science.  

Mathematics needed for Data Science:

I started by taking a solid background in mathematics and statistics. On my bachelor, I did more than 70% of my modules in advanced statistics including Time Series Forecasting, LASSO, Ridge, elastic net, ANOVA, computational statistics, cross-validation and much much more. I was the only one student on the year writing Advanced Statistics project. At the end of my degree, I claimed my status of the professional statistician and registered as a fellow of Royal Statistical Society. This is not an easy route to take as Mathematics degree it's not easy anyway but what I have found out was that many students decided to choose easy modules over important modules. Someone who studies or studied mathematics degree knows that there are modules that are more and less demanding. For example, the Time Series module did only 20 students in my year and preparation for the exam was extremely hard. But who said that Data Science is not hard? Who said that to do Data Science you don't need mathematics or statistics? Definitely not one of the best on the market. So to students of a mathematics degree, I would suggest to take as many statistics and probability modules as possible but also to take modules in Linear Algebra, Markov's Chain, Calculus (I - III), differential equations, set theory, networks, geometry, cryptography, combinatorics and the first course in algebra.

Linear algebra is also incredibly important if you want to understand ML algorithms. Starting from linear regression but you can find more in Neural Networks, Deep Learning and other advance Machine Learning algorithms. I think that as a more difficult model there is the more linear algebra is needed. 

I think that some of statistics modules are good only for certain areas of statistics such as Design of Experiment is important in medical statistics but I never actually used randomised design, factorial design or something like that in my workplace. I suppose that many people use statement "design of experiment" but what they mean is more hypothesis testing and basics models of statistical modelling such as t-test, z-test, ANOVA and so on. These are basics of each good statistics courses.

Resources:

  1. Forecasting Principles and Practice,
  2. Time Series Analysis and it's Applications,
  3. Stanford University ML course,
  4. Practical Analysis for analysing large, complex data,
  5. Data Mining: Practical Machine Learning Tools and Techniques,
  6. DeepMind,
  7. Andrej Karpathy blog,
  8. A Comprehensive Guide to Data Exploration

Programming and technical skills for Data Science:

After that, I have decided to continue my postgraduate study in the evening time and work full time at the same time. I study MSc Data Science in the computer science department and in my work I learn technical things - data engineering: creating data pipelines, writing configuration files, analysing large scaled data and more. 

So to be Data Scientists you need to know programming and data engineering. Let me quickly do an overview of programming languages such as R, Python, SAS, Splunk. On my degree, I learn Python, and R in my workplace I use Splunk. In addition to that from my previous experience, I know a little bit of SAS. 

  • R is excellent statistical software designed by statisticians for statisticians. From the history of R, I like the statement: "You know first was S and then data got bigger and there was a need to make S stronger so there was R. And the name you know? If the R is before S we assume that R is a little bit better than S." I think that it summaries everything. In R you can do amazing statistical analysis and has hundreds of different libraries but I prefer Python for large scaled data.
  • Python was designed by computer scientists so the assumption was not to concentrate only on statistical analysis but on everything: web development, software development whatever comes to your imagination there is a package for it. I like Python for large scaled data and I prefer to run ML algorithms on it rather than R. 
  • Splunk mainly used for FTSE100 for Security and AI Ops. In my opinion, it is one of the best tools currently on the market for these two sectors. It is designed to deal with extremely large and complex systems and you can run most of ML algorithms on it. There are hundreds of different apps including MLTK and sklearn for Python extension and there is R analytics for R. 
  • SAS again mainly used by FTSE100 and it's a very good tool for financial modelling. It's easy to learn and easy to debug plus huge job prospects once you know it but it's expensive tool and graphics are not the best. My advance to students do SAS course and try to get the document at the end of the course. FTSE likes to have paper for everything. 

R Resources:

  1. Data Science with R ,
  2. An introduction to data cleaning with R,
  3. Introduction to Exploratory Data Analysis with R,
  4. awesome R,
  5. r-statistics,
  6. Regression Models for Data Science in R,
  7. Great R packages for data import, wrangling and visualization,
  8. R-markdown for mathematics

Python resources:

  1. learn Python,
  2. Beginning Python programming,
  3. Matplotlib tutorial,
  4. Step by step approach to perform data analysis using Python,
  5. visualise your Python code

Splunk and SAS tutorials:

  1. Free Splunk Fundamentals 1 tutorial,
  2. Free SAS e-learning

Data visualisation skills:

You can do data visualisation in any of the above-mentioned programming languages but programs such as Tableau or Power BI are specially designed for easy methods of data visualisation, dashboards, pivot tables and reporting. You need to know that without external platform – R or Python - and connection to Flask (if you want both at the same time) you won’t be able to implement ML algorithms in Tableau or Power BI. That's not the main role of the programs. Splunk, mentioned before, is a tool that you can use as reporting tool without knowing SPL (Splunk programming language) however main function of Splunk is as advanced programming tool. 

Data visualisation resources:

  1. UW Interactive Data Lab,
  2. flowingdata

It is good to have exposure to Hadoop, Cassandra, Apache or other tools for large scaled data. I want to write a dissertation on ML for large scaled systems and I am fully aware that I won’t be able to do that without learning these tools. I am lucky that on my master degree I will take the module to extend my knowledge of it but hopefully, I will get the exposure in my workplace. Another thing needed is Cloud but try to concentrate on the bits that you need. For example, in AWS there is a track for Big Data. Should I also mention DevOps? YES!!! I will especially with rising Data Ops:

  1. Creating a Data-Driven Enterprise with DataOps

So what else you can do to get more experience:

  • Whenever you can practice. You can take free available datasets and try to solve your own problem:
  1. AggData,
  2. LondonData,
  3. Reddit,
  4. Google,
  5. Kaggle,
  6. Datahub,
  7. re3data,
  8. ComplexNetwork,
  9. UCI,
  10. 100+ Interesting Data Sets for Statistics,
  • Publish your own articles, create your own blog or contribute. I had published just one article on data science blog in Germany and lady from Copenhagen contacted with me:
  1. Blog
  • Take part in hackathons:
  1. Hackevents,
  2. search on eventbrite
  • If you need a good training plan someone already did it for you:
  1. The most comprehensive Data Science learning plan for 2017

To view or add a comment, sign in

More articles by Marta F.

  • Data Science Festival #DataScienceFest

    The festival started with an event hosted by King.com where we could listen to a talk about Marketing Mix Modelling…

  • Is Data Science the new Statistics?

    My article was first published on International Data Science blog back in September 2017: As a student of Statistics…

  • Small pot: Blockchain, ML and AI in banking.

    Problem with banks is that they are not open for suggestions/connection to blockchain on the wider scale. The main…

    1 Comment

Insights from the community

Explore topics