BLOG

How Data Scientists Think About Data

Jenn Gamble

5 min read

This year at the virtual ElixirConf, I was honored to deliver one of the keynote presentations, where I talked about building robust data pipelines for machine learning-driven industrial IoT. Watch the video below to see the full talk, and check out the post below for a deeper dive.

Artificial intelligence (AI) and machine learning (ML) may soak up most of today’s data-related buzz, but there’s a lot more that goes into turning raw numbers into valuable knowledge than simply feeding it into a computer. Just like any other raw material, data needs a little attention before it’s ready for consumption. Even for ML, data scientists play the chef by preparing the data: washing it to remove defects, augmenting and seasoning it, and ultimately transforming it into a meaningful product.

But before we go any further, let’s start with the fundamentals. What is data science? And what’s data? The answer to these questions will help us tread down the path towards thinking about data like a data scientist.

Data is information. It can span the gamut from video to text to sound, and can come in structured, unstructured, or semi-structured forms. When it comes to IoT data, our embedded sensors are data sources that continuously produce a stream of numbers, and can represent complex states and interdependencies of a machine or system.

Data scientists work with data like chemists work with molecules and biologists work with proteins. As Cassie Kozyrkov, Chief Decision Intelligence Engineer at Google, writes for Hacker Noon, “Data science is the discipline of making data useful.”

We use data to influence real-world decisions. And, while most companies employ data scientists to aid their strategic calculus, the Internet of Things presents a unique opportunity for us to extract value from the data that our devices generate. IoT devices create a digital overlay for the world around us, and so the data that they generate directly correlates to physical phenomena in real time.

Consider, for example, the industrial internet of things (IIoT) and Industry 4.0. These devices enable us to apply data science to manufacturing, which in turn unlocks insights into factory floor performance and ways to optimize everything from machine configurations to operational tasks like maintenance.

Going from raw numbers to real knowledge, however, takes patience, effort, and a little creativity. In this article, we’re going to show you how Very approaches IoT data science.

Analytical Framing

Our process begins before we even look at a single datapoint. We need an analytical framework, an instrument that guides our work. It includes how we frame the question we’re trying to answer, putting the data into a larger business or operational context, and understanding the data’s journey as it travels from inside the device and into our AWS cloud for use in our ML pipeline.

For instance, business problems often have ambiguous framing. A manufacturer may want to reduce downtime, but this still leaves us with multiple paths to follow. Should we look for correlations between sensor output and machines breaking—also known as anomaly detection—for predictive maintenance, or should we search out positive relations that will give our engineers a way to create more durable products? These are just two of the many possible routes.

Another way that we need to frame our data is by mapping out the data flow. To this end, we emphasize using a DevOps approach with a tight-knit, interdisciplinary team so that we all know that we’re on the same page. That way we understand where we’re calculating derived variables, what our input/output (IO) requirements are, if we’re using synchronous or asynchronous data transmission, and other similar details.

This all amounts to contextualizing our data. We’re answering fundamental questions like “What is this observation?”, “What am I trying to model?”, and “What am I trying to predict?”

Depending on how we choose to approach the data, we’ll prioritize certain types of data, conduct different analytics, and select the right ML algorithm for the job.

A Data Scientist’s Perspective on Models

Data scientists think about models a little differently than most people. For us, models take data as input and then give us an output. This input data can have tens or hundreds of variables, and the output is the answer to the question we posed during analytical framing.

Going back to our predictive maintenance for industrial IoT example, we’ll input variables like uptime and frequency of past failures, or environmental factors like humidity and temperature, alongside derived variables like number of operations per second to give us a percentage number to answer “how likely is our machine to fail within the next month?”

For machine learning models, one of the key takeaways is that we don’t write any rules for translating inputs into outputs. Instead, we give it tons and tons of examples of both inputs and outputs that it uses to then learn all by itself. We call this training the model. From there, we can give it new inputs and it’ll guess the output. As we give it more data or better data, it becomes more accurate.

That’s why an important part of any data scientist’s job is creating the training data. We almost never feed our model data straight from the sensors. The real world is noisy; there’s errors, blank spots, and a slew of issues to overcome. Preparing the training data, therefore, requires us to aggregate it, augment it, and clean it up. The specifics of this process are in turn defined by our prior framework.

Preparing training data, selecting the right model, choosing the proper hyperparameters that determine the computational method for training the model, and writing Python code in AWS SageMaker are iterative processes. Again, this demonstrates the importance of an agile approach. We’re constantly tweaking and refining our model, both to answer our initial questions and to adapt to changing business demands.

Conclusion

As expectations grow for our connected devices to become smarter, assume more autonomy, and make our organizations more efficient, IoT machine learning becomes increasingly important. That’s why the ML data pipelines we build take device data, train models on the cloud, and then ship these insights back to the user either through an IoT frontend, such as a mobile app, or by embedding it into the device itself via edge computing.

In the end, data science is all about creating knowledge. This powerful tool then fuels actionable predictions, recommendations, and insights. As a pioneer in the ML for IoT industry, Very’s cross-disciplinary team works closely together to foster a data ecosystem that’s rich in gold.

Excited to learn more about how Very uses data science and ML for IoT? Check out our machine learning capabilities for a look at our tooling and use cases—and, as always, feel free to contact us.