Data Science Projects - Through the lens of a Data Scientist
In 2012, an article in the Harvard Business Review named the role of data scientist the sexiest job of the 21st century. Data scientists are getting a lot of attention, and as a result, books about data science are proliferating. But, while surveying new data science literature, it becomes clear that most authors would rather explain how to use all the latest tools and technologies than discuss the nuanced problem-solving nature of the data science process. Armed with several books and the latest knowledge of algorithms and data stores, many aspiring data scientists are still asking the question: Where do I start? We break down this daunting task into three problems and explore them one by one.
Let's roll!
1. Getting Started
The process of data science begins with preparation. You need to establish what you know, what you have, what you can get, where you are, and where you would like to be. This last one is of utmost importance; a project in data science needs to have a purpose and corresponding goals. Only when you have well-defined goals that are possible, valuable and efficient can you begin to survey the available resources and all the possibilities for moving toward those goals.
Listening to a customer
Every project in data science has a customer. Sometimes the customer is someone who pays you or your business to do the project - for example, a client or contracting agency. You and the customer share a mutual interest in completing the project successfully, but the two of you likely have different specific motivations, different skills, and, most important, different perspectives. In this way, a project in data science begins by finding agreement between two personalities, two perspectives, that if they aren't conflicting are at the very least disparate. Be aware of the customer's potential lack of data science knowledge!
Asking good questions and eliminating impediments
You need to make sure the project focuses on answering questions that are good. Asking specific, well-defined and testable questions leads to informative answers and subsequently improved results. Moreover, you should try anticipating obstacles to getting everything you want. E.g.
It's better to be skeptical at the very beginning so that the uncertainty of the task doesn't cost you nearly as much later. Take time to think through all possible paths to answering the good questions. Your goal is now to answer these questions with rigorous analysis of data.
Data deluge
In the twenty-first century, data is being collected at unprecedented rates, and in many cases it’s not being collected for a specific purpose. Whether private, public, for free, for sale, structured, unstructured, big, normal size, social, scientific, passive, active, or any other type, data sets are accumulating everywhere. There’s so much data that no one can possibly understand it all, so we treat it as a world unto itself, worthy of exploration.
You should remain keenly aware that most data sets not owned by you or your organization come with restrictions on use. Without confirming that your use case is legal, you remain at risk of losing access to the data or, even worse, a lawsuit.
Let’s say you've found data and confirmed that you're allowed to use it for your project. Should you keep looking for more data, or should you attack the data you have immediately? In this case it is recommended to run through a few specific examples of your intended analyses and see if it makes a significant difference. If the data you need doesn't exist then it would be helpful to ask if the data can exist at all. If not, you can use data extracting and transformation strategies such as web scraping.
The concept of web scraping usually entails writing code that can fetch and read web pages, interpreting the HTML, and scraping out the specific pieces of the HTML page that are of interest to the scraper.
Data wrangling
Data wrangling is the process of taking data and information in difficult, unstructured, or otherwise arbitrary formats and converting it into something that conventional software can use. Like many aspects of data science, it's not so much a process as it is a collection of strategies and techniques that can be applied within the context of an overall project strategy. Wrangling isn't a task with steps that can be prescribed exactly beforehand. Every case is different and takes some problem-solving to get good results. Data wrangling can be done by means of a script that reads a data file based on some pattern or structure in it or simply copying the relevant text or using the command line. Think like a computer program in order to create a good data-wrangling script.
Five obstacles to data wrangling could be corrupted data (some aspect of the file is either missing or has been obfuscated), poorly designed databases (database values or keys that don't match each other and incongruences in scope, depth, APIs, or schemas), Windows/Mac/Linux problems (parsing of text, line endings and other special characters), escape characters and outliers.
If you can find any proprietary products that can convert the data you have into the data you want, then they're well worth the consideration. Spending a little money on these proprietary tools may be worth it if your project gets done much earlier, but the industry is young and changing too fast to give any sort of meaningful survey.
Data assessment
Without a preliminary assessment, you may run into problems with outliers, biases, precision, specificity, or any number of other inherent aspects of the data. In order to uncover these and get to know the data better, the first step of post-wrangling data analysis is to calculate some descriptive statistics.
Descriptive statistics concerns itself with only the data you have. Examples of descriptive statistics include descriptions of a data set, summaries of a data set, maximum values, minimum values, average values, etc.
2. Building a product
The main object of any data science project is to produce something that helps solve problems and achieve goals. This might take the form of a software product, report, or set of insights or answers to important questions. The key tool sets for producing any of these are software and statistics.
Recommended by LinkedIn
Statistics and its role
Statistical methods are often considered as nearly one half, or at least one third, of the skills and knowledge needed for doing good data science. The other large piece is software development and/or application, and the remaining, smaller piece is subject matter or domain expertise. Statistics is the slice of data science that provides the insights. All of the software development and database administration that data scientists do contribute to their ability to do statistics. Web development and user interface design - two other tasks that might be asked of a data scientist - help deliver statistical analysis to the customer.
A statistical model is a description of a set of quantities or variables that are involved in a system and also a description of the mathematical relationships between those quantities. Beyond linear and exponential equations, all sorts of function types are used in statistical modeling: polynomial, piecewise polynomial (spline), differential equations, nonlinear equations of various types, and many others. The purpose of statistical modeling is to draw meaningful conclusions about the system you're studying based on a model of that system and some data. In order to draw meaningful conclusions about a system via statistical modeling, the model has to be good, the data has to be good, and the relationship between them also has to be good.
Fitting a model
Fitting a model to a data set is the process of taking the model that you've designed and finding the parameter values that describe the data the best. Model fitting is optimization, among all possible combinations of parameter values, of a goodness-of-fit function. Goodness of fit can be defined in many ways. If your model is intended to be predictive, then its predictions should be close to the eventual outcome, so you could define a closeness-of-prediction function. If the model is supposed to represent a population, as in the model of human height discussed earlier in this chapter, then you might want random samples from the model to look similar to the population you're modeling. One of the most common functions for a large number of applications is called the likelihood.
Hypothesis testing
Often you need to know more about a variable than merely a point estimate or a range of values that probably contain the true value. Sometimes it’s important to know whether a variable possesses a certain property or not, such as these:
Machine Learning
Machine Learning refers to the classes of somwhat abstract algorithms that can make conclusions from data. It's a class of complex statistical methods that is seen as black box methods. Each specific machine learning algorithm is different. Data goes in, answers come out; you have to do much work to confirm that you didn't make any mistakes, that you didn't overfit, that the data was properly train-test separated, and that your predictions, classifications, or other conclusions are still valid when brand-new data comes in.
Software
If statistics is the framework for analyzing and drawing conclusions from data, then software is the tool that puts this framework into action. To anyone who has spent significant time using Microsoft Excel or another spreadsheet application, this is often the first choice for performing any sort of data analysis. If you want to use a statistical technique that you can’t find in Excel's menus, then you may want to level up to SPSS, Stat, SAS or Minitab. Each of these proprietary tools its own language for performing statistical analysis. With a programming language, you can do more than with clicking, and you can also save the code for future use, so you know exactly what you've done and can repeat it or modify it as necessary. A file containing a sequence of commands that can be run is generally called a script and is a common concept in programming.
Programming languages such as R, Python or GNU Octave (open-source clone of MATLAB) are far more versatile than statistical applications. Code in any popular language has the potential to do most anything. These languages can execute any number of instructions on any machine, can interact with other software services via APIs, and can be included in scripts and other pieces of software. A language that's tied to its parent application is severely limited in these capacities.
Some technologies don’t fall under the category of statistical software, but they're useful in making statistical software faster, scalable, and more efficient. Well-configured databases, high-performance computing, cloud services, and big data technologies, all have their place in the industry of analytical software, and each has its own advantages and disadvantages. When deciding whether to begin using any of these auxiliary technologies, it's usually best to ask the question: are there any gross inefficiencies or limitations in my current software technologies? It takes time and effort to migrate to a new technology, but it can be worth it if you have a compelling reason.
3. Finishing off the product and wrapping up
Once a product is built, as in part 2, you still have a few things left to do to make the project more successful and to make your future life easier. Previous chapters focused more on what I might call raw results, or results that are good in a statistical sense but may not be polished enough for presenting to the customer. Part 3 first looks at the advantages of refining and curating the form and content of the product with the express purpose of concisely conveying to the customer the results that most effectively solve problems and achieve goals of the project.
Product Delivery
In order to create an effective product that you can deliver to the customer, first you must understand the customer perspective and what they expect and intend with respect to the results that you'll deliver. Talking to multiple people is also a good idea, particularly if your audience is composed of individuals with varying experience, knowledge, and interests in results. Second, you need to choose the best media for the project and for the customer. This can be a report, a white paper, an analytical tool such as a spreadsheet, an interactive graphical application, or a a web-based API, amongst many others. And finally, the content of products should focus on important, conclusive results and not distract the customer with inconclusive results or other trivia. It’s best to spend some time thinking formally about user experience (UX) design in order to make the product as effective as possible. Making good choices throughout product creation and delivery can greatly improve the project's chances for success.
Customers not using the product in the intended ways is problematic because it can lead to false or misleading results or no results at all. Misleading results are probably worse, because the customer might gain false confidence and act on those results, potentially leading to poor business decisions. You talked to the customer at the time of product delivery, and presumably you provided instructions for use, but, like the product itself, instructions may be used incorrectly, partially, or not at all. At least one follow-up is usually in order.
Product revisions should be designed and engineered with the same level of care (or more) as when you designed and built the product itself. Getting feedback is helpful, but it shouldn't be taken at face value. Moreover, not every problem needs fixing.
Moving on
Good user and developer documentation is crucial to being able to navigate project materials and to find definitive answers about details of the project. To avoid loss of significant knowledge, it's best to capture whatever knowledge you can into documentation at the end of a project. Not only might it save you time and effort, but it may also ensure that customers and others don't lose faith in your work if they come back to you with questions later or intend to continue doing business with you in the future. It is also best practice to collect all project results, reports, and other non-code materials, and place them in a shared, reliable storage location including raw data if it's not too big.
Whether there's a specific lesson you can apply to future projects or a general lesson that contributes to your awareness of possible, unexpected outcomes, thinking through the project during a postmortem review can help uncover useful knowledge that will enable you to do things differently - and hopefully better - next time. The more data science you do, the more experience you have.
This post is based on the book Think Like a Data Scientist. I also recommend the complementary book by Manning Publications.