Fitting Data Science Methodologies to the Complex Contours of the Internet of Things

Lenses filter various portions of the visible spectrum in various ways.

Statistical models are a type of lens. They filter data sets in diverse ways and enable data scientists to visualize correlations, trends, outliers, and other patterns that might reveal some useful insight.

Data science is the art of fitting the right statistical methods to the data sets of interest. When we consider which data science methodologies might be best suited to Internet of Things (IoT) initiatives, it’s helpful to begin with an understanding of what types of data we encounter in the IoT, how this data differs from data that data scientists explore in other types of projects, and how these differences call for particular IoT-geared data-scientific approaches.

As with fitting prescription lenses to the specific shape of a particular human eye, identifying the optimal algorithmic approach to an IoT project demands that data scientists understand the contours of the following data types: 

  • Machine data: IoT data is predominantly sourced from sensors, such as the cameras and radars embedded in connected cars. Much of it is algorithmically processed at those endpoints or transmitted to the IoT cloud for further analysis, returning to the endpoints as analytic guidance either for the actuators—such as steering mechanisms and transmissions—in those same endpoints or for human operators (e.g., drivers) of those endpoints.
  • Sparse behavioral data: Much IoT data describe behaviors of the machine endpoints from which it flows, including features such as position, direction, velocity, temperature, status, and so on. This data may have many fine-grained features, but the majority of the features may have zero or null values, creating sparse data sets. As machine-sourced behavioral data is correlated with other sparse data sets, such as those from mobile apps and advertising responses, IoT data sets may grow increasingly sparse. That's because, in a sparse IoT data set, no individual machine endpoint whose behavior is being recorded is likely to exhibit more than a limited range of behaviors. But when you look across an entire population, you're likely to observe every specific type of behavior being expressed at least once and perhaps numerous times within specific IoT niches.
  • Log data: The event log is the common denominator data storage and integration abstraction for IoT. Continually logged data of all sorts (web logs, application logs, database logs, system logs and more) is the bulk of the historical data analyzed by IoT analytics. Distributed IoT log databases manage objects of both relational and non-relational types, process advanced analytics against all of this data, support mixed latencies of batch and streaming data, and ensure the linear scalability to support massive volumes of in-flight log data. Most IoT endpoints will support local logs within the increasingly tight storage constraints associated with their disparate form factors.
  • Streaming event data: Much IoT data consists of event streams that flow in real-time from sources to processing nodes throughout the distributed fabric of these massively distributed environments.

What sorts of analytic methods are best suited to finding meaningful patterns in IoT data sets that embody these characteristics? We can break out them out into at least the following: 

  • Deep learning: As I discussed here, IoT data consists of diverse feeds of sensor, media, and other data streams. Making sense of it all requires deep learning algorithms, leveraging artificial neural networks. This technology enables us to automate the continuous monitoring of feeds whose semantic patterns, context, and importance must be inferred. The algorithms can dissect IoT feeds of daunting complexity, recognizing implicit semantic hierarchies among constituent objects and correlation of multiple distinct streams. IoT streams may include various objects, such as data, video, images, speech, faces, gestures, geospatials, and browser clicks. In this way, deep learning is a fundamental tool in instrumenting the planet with ubiquitous IoT sensors and actuators to sense and react to dynamic, distributed patterns, such as terrorist activity, disaster response, weather conditions, traffic congestion, and energy grid spikes. Without deep learning algorithms that can dynamically respond to myriad concurrent streams, the human race risks drowning in our own big data.
  • Graph analysis: As I discussed here, many IoT insights that can best be revealed through graph analysis, due to the fact that IoT data sets are often entirely or predominantly composed of behavioral data. When evaluated in the aggregate, IoT behavioral data may exhibit unprecedented behavioral interactions that can’t be understood by focusing on nodes in isolation. Graph analysis is the key to flagging the sorts of IoT interaction patterns that might otherwise go undetected. Already we can see graph analysis at work in many established IoT applications in business and industry. Use cases such as real-time sensor grids, remote telemetry, self-healing network computing, medical monitoring, traffic management, emergency response, and security incident and event monitoring all depend on our being able to graph unusual patterns rapidly and take effective action to neutralize the risks.
  • Streaming analytics: As I discussed here, IoT is often real-time, streaming, and continuous in flow from end to end across a distributed application environment. To make sense of the patterns revealed in these streams, you need the analytic libraries and runtime engines built into stream computing tools such as Apache Spark, IBM Streams, and the like. These tools enable data scientists to build and refine machine learning models that handle low-latency acquisition, transmission, and analysis of device-level machine data, enabling these functions to be executed in distributed fashion across massively parallel IoT computing fabrics.
  • Anomaly detection: As this recent article notes, anomaly detection is a core analytic function of many IoT applications, especially those which perform surveillance functions. Data scientists build anomaly-detection algorithms that filter the firehose of machine data looking for exception conditions to flag for subsequent alerting, notification, process triggering, and other actions. IoT analytics identify the most noteworthy events, issues, obstacles, dangers, and opportunities in all that sensor data. It helps us to see dark patterns fast enough to drive nonstop anomaly detection, situational awareness, prescriptive analysis, and next-best action guidance.

Does IoT require a different data science workflow from the standard one defined under the CRISP-DM framework? That’s a very contentious topic. One argument against CRISP-DM is that it assumes a sequential “waterfall” development methodology in a world where IoT and other applications are increasingly being developed according to adaptive "agile" approaches.

Agile is aligned to the exploratory investigative nature of data science initiatives such as IoT analytics. In agile workflows, the CRISP-DM development phases--business understanding, data understanding, data preparation, modeling, evaluation, and deployment—are carried out in iterative, overlapping fashion. As you iterate IoT models, bring in fresh IoT data sources, and identify heretofore hidden IoT patterns, new ideas spring forth that help you to align your IoT-centric solution to specific value points. Also, agile emphasizes simplicity and incrementalism, and is in that way aligned with the embedded endpoint apps that are increasingly the focus of IoT development.

However, it’s difficult and perhaps futile to specify a standard development methodology for IoT analytics. Some IoT applications—such as energy-grid monitoring and control—are extraordinarily complex, distributed, and high-stakes. Other IoT applications—such as those that enable you use your smartwatch to remotely monitor whether you left the desk lamp on in your home office—ingest far less data, involve simpler analytics, and have far less catastrophic impacts if they fail.

Rubens Zimbres, Ph.D.

ML Engineer, Google Developer Expert in AI/ML ^ Google Cloud, Sec+

9y

James, there is an epistemologic problem. All technology was developed in an existing paradigm. Research also. It evolved. Maybe some foundations of science should be analyzed. The roots. Edgard Morin gives us a good perspective. Bertalanffy 1950 also. Open systems, feedback, complexity. For instance, there is a major problem with unstructured data. I come from business administation area and there is neither a well stablished theory to handle and analyze qualitative data nor consensus. This is a problem when we talk about business strategies, government surveillance, marketing initiatives, etc. Besides, we are all subject to bounded rationality. Coders, decision makers also. A sub optimal solution seems to be the best option. This may have already generated an error in the "Matrix", affecting the whole dynamic of IoT. Entropy of the system is a great risk. Some algos took over certain activities and we are moving on. Same paradigm. This involves cost, sunk cost, barriers to a disruptive approach and a lot of people.

Like
Reply

Schoeneburg: Agreed. With adhoc mesh IOT, dynamic graph search is very hard. Your data is always out of sync with noSql in-memory packet inspection. Where it works is with fixed sensor networks like electric and gas smart meters.

Like
Reply
Eberhard Schoneburg

Artificial Intelligence and Artificial Life Pioneer, Author, Speaker, Investor, Advisor, Lecturer

9y

The intelligent analysis of real time IoT data is much more difficult than suggested here. just throwing 'deep learning' into the pond may not work. The critical part in any real time problem sultion is the timing of events occuring and their prediction and analysis. Neural networks may work for some problems but not for all. Classical deep learning (convolutional or other) neural networks do not represent and handle well time delays and time dependencies among the input patterns. In some applications hierarchical temporal networks might work better to represent and process complex timing patterns in IoT data, especially in streaming applications and problems

To view or add a comment, sign in

More articles by James Kobielus

Insights from the community

Others also viewed

Explore topics