Data science team structures
Embarking on data science and predictive analytics requires a clear understanding of how the initiative is going to be introduced, maintained, and further scaled in terms of team structure. We recommend considering three basic team structures that match different stages of machine learning adoption.
IT-centric structure
Sometimes, hiring data scientists is not an option, and you must leverage talent that’s already in-house. The main analytics and leadership role would be a “business translator,” usually referred to as a chief analytics officer (CAO) or chief data officer (CDO). The latter term gradually becomes redundant as most data processes are reshaped towards predictive analytics. This person should be capable of leading the initiative. We’ll take a more detailed look at the position below.
All the rest – data preparation, training models, creating user interfaces, and model deployment within a corporate IT infrastructure – can be largely managed by the IT department (if your organization actually has a fully functioning, in-house IT department). This approach is fairly limited, but it can be realized by using MLaaS solutions. Environments like Azure Machine Learning or Amazon Machine Learning are already equipped with approachable user interfaces to clean datasets, train models, evaluate them, and deploy.
Azure Machine Learning, for instance, supports its users with detailed documentation for a low entry threshold. This allows for fast training and early deployment of models even without an expert data scientist on board.
On the other hand, MLaaS solutions present their limitations in terms of machine learning methods and cost. All operations, from data cleaning to model evaluation, have their separate prices. And considering that the number of iterations to train an effective model can’t be estimated in advance, working with MLaaS platforms entails some budget uncertainty.
Pros of IT-centric structure:
- Leverage new investments with existing IT resources
- Computing infrastructure is provided and maintained by an external service
- In-house specialists can be trained to further realize predictive analytics potential
- Cross-silo management is reduced as all operations are held within the IT department
- Less time-to-market for relatively simple machine learning tasks requiring one or a few models
Cons of IT-centric structure:
- Limited machine learning methods and data cleaning procedures that these services provide
- Model training, testing, and prediction should be paid for. This entails uncertainty of eventual cost per prediction as the number of needed iterations can’t be estimated in advance
Integrated structure
With the integrated structure, a data science team focuses on dataset preparation and model training, while IT specialists take charge of the interfaces and infrastructure supporting deployed models. Combining machine learning expertise with IT resource is the most viable option for constant and scalable machine learning operations.
Unlike the IT-centric approach, the integrated method requires having an experienced data scientist on a team and an elaborate recruitment effort beforehand. This ensures better operational flexibility in terms of available techniques. Besides end-to-end and yet limited services, you can leverage deeper machine learning tools and libraries – like Tensor Flow or Theano – that are designed for researchers and experts with data science backgrounds. With this effort allocation, you can address highly specific business problems and choose between as-a-service and custom-built ML solutions.
Pros of integrated structure:
- Leveraging existing IT resources and investments
- Data scientists focus on innovation
- Utilizing full potential of both as-a-service and custom ML applications
- Start with one or two data scientists, then train and onboard more homegrown experts
- Using custom model combinations (ensemble models) that yield better or broader predictions
Cons of integrated structure:
- Computing infrastructure is required in case of custom ML use
- Cross-silo management takes considerable effort
- Significant investments into data science talent acquisition
- Data science talent engagement and retention challenges
Specialized data science department
To reduce management effort and build an all-encompassing machine learning framework, you can run the entire machine learning workflow within an independent data science department. This approach entails the highest cost. All operations, from data cleaning and model training to building front-end interfaces, are realized by a dedicated data science team. It doesn’t necessarily mean that all team members should have a data science background, but they should acquire technology infrastructure and service management skills.
A specialized structure model aids in addressing complex data science tasks that include research, use of multiple ML models tailored to various aspects of decision-making, or multiple ML-backed services. In the case of large organizations, specialized data science teams can supplement different business units and operate within their specific fields of analytical interest.
Most successful AI-driven companies operate with specialized data science teams. Obviously, being custom-built and wired for specific tasks, they’re all very different. The team structure at Airbnb Data Science is one of the most interesting ones. You can watch this fascinating talk by Airbnb’s data scientist Martin Daniel for a deeper understanding of how the company builds its culture or read a blog post from its ex-DS lead, but in short, here are the main principles they apply:
Experiment. Find ways to put data into new projects using an established Learn-Plan-Test-Measure process.
Democratize data. Scale your data science team to the whole company and even clients.
Measure the impact. Evaluate what part DS teams have in your decision-making process and give them credit for it.
Pros of specialized data science department:
- Centralized data science management and increased problem-solving capacities
- Realizing the full potential of both as-a-service and custom ML applications
- Solving complex prediction problems that require deep research or building segmented model factories (that operate automatically across different segments and business units)
- Setting a fully-featured data science playground to foster innovation
- Greater scalability potential
Cons of specialized data science department:
- Building and maintaining a complex computational infrastructure
- Heavy investments into data science talent acquisition
- Data science talent engagement and retention challenges
Enterprise IT involvement changes depending on the team structure you choose
Data science team roles
Let’s talk about data scientist skill sets. Unfortunately, the term data scientist expanded and became too vague in recent years. After data science appeared in the business spotlight, there is no consensus developed regarding what the skillset of a data scientist is. Matthew Mayo, Data Scientist and the Deputy Editor of KDNuggets, argues: “When I hear the term data scientist, I tend to think of the unicorn, and all that it entails, and then remember that they don’t exist, and that actual data scientists play many diverse roles in organizations, with varying levels of business, technical, interpersonal, communication, and domain skills.”
Skillset of a data scientist
As you will see below, there are many roles within the data science ecosystem, and a lot of classifications offered on the web. We will share with you the one offered by Stitch Fix’s Michael Hochster. Michael defines two types of data scientists: Type A and Type B.
Type A stands for Analysis. This person is a statistician that makes sense of data without necessarily having strong programming knowledge. Type A data scientists perform data cleaning, forecasting, modeling, visualization, etc.
Type B stands for Building. These folks use data in production. They’re excellent good software engineers with some stats background who build recommendation systems, personalization use cases, etc.
Rarely does one expert fit into a single category. But understanding these two data science functions can help you make sense of the roles we’ve described further.
Keep in mind that even professionals with this hypothetical skillset usually have their core strengths, which should be considered when distributing roles within a team. In most cases, acquiring talents will entail further training depending on their background.
But people and their roles are two different things. For instance, if your team model is the integrated one, an individual may combine multiple roles. So, let’s disregard how many actual experts you may have and outline the roles themselves. Obviously, many skillsets across roles may intersect.
Chief Analytics Officer/Chief Data Officer. In our whitepaper on machine learning, we broadly discussed this key leadership role. CAO, a “business translator,” bridges the gap between data science and domain expertise acting both as a visionary and a technical lead. You may get a better idea by looking the visualization below.
Preferred skills: data science and analytics, programming skills, domain expertise, leadership and visionary abilities
Data analyst. The data analyst role implies proper data collection and interpretation activities. An analyst ensures that collected data is relevant and exhaustive while also interpreting the analytics results. Some companies, like IBM or HP, also require data analysts to have visualization skills to convert alienating numbers into tangible insights through graphics.
Preferred skills: R, Python, JavaScript, C/C++, SQL
Business analyst. A business analyst basically realizes a CAO’s functions but on the operational level. This implies converting business expectations into data analysis. If your core data scientist lacks domain expertise, a business analyst bridges this gulf.
Preferred skills: data visualization, business intelligence, SQL
Data scientist (not a data science unicorn). What does a data scientist do? Assuming you aren’t hunting unicorns, a data scientist is a person who solves business tasks using machine learning and data mining techniques. If this is too fuzzy, the role can be narrowed down to data preparation and cleaning with further model training and evaluation.
Preferred skills: R, SAS, Python, Matlab, SQL, noSQL, Hive, Pig, Hadoop, Spark
To avoid confusion and make the search for a data scientist less overwhelming, their job is often divided into two roles: machine learning engineer and data journalist.
A machine learning engineer combines software engineering and modeling skills by determining which model to use and what data should be used for each model. Probability and statistics are also their forte. Everything that goes into training, monitoring, and maintaining a model is ML engineer’s job.
Preferred skills: R, Python, Scala, Julia, Java
Data journalists help make sense of data output by putting it in the right context. They’re also tasked with articulating business problems and shaping analytics results into compelling stories. Though required to have coding and statistics experience, they should be able to present the idea to stakeholders and represent the data team with those unfamiliar with statistics.
Preferred skills: SQL, Python, R, Scala, Carto, D3, QGIS, Tableau
Data architect. This role is critical for working with large amounts of data (you guessed it, Big Data). However, if you don’t solely rely on MLaaS cloud platforms, this role is critical to warehouse the data, define database architecture, centralize data, and ensure integrity across different sources. For large distributed systems and big datasets, the architect is also in charge of performance.
Preferred skills: SQL, noSQL, XML, Hive, Pig, Hadoop, Spark
Data engineer. Engineers implement, test, and maintain infrastructural components that data architects design. Realistically, the role of an engineer and the role of an architect can be combined in one person. The set of skills is very close.
Preferred skills: SQL, noSQL, Hive, Pig, Matlab, SAS, Python, Java, Ruby, C++, Perl
Application/data visualization engineer. Basically, this role is only necessary for a specialized data science model. In other cases, software engineers come from IT units to deliver data science results in applications that end-users face. And it’s very likely that an application engineer or other developers from front-end units will oversee end-user data visualization.
Preferred skills: programming, JavaScript (for visualization), SQL, noSQL