For the last few years, I have been working in the Industry 4.0 branch. At my company, we integrate AI systems into industrial processes to monitor production quality. Working through our first projects, we had to solve many issues in our infrastructure, and most of them were related to data storage and retrieval. In this post, I’d like to describe these issues and propose Reduct Storage as a possible solution.
The Data Zoo
If you have any experience in AI/ML field, you will already know that data is the key factor. In industrial environments, to get AI-ready data can be especially challenging because the input is extremely diverse. It can be some data from industrial automation (temperature, pressure etc), pictures from a CV (Computer Vision) camera or sound. Let's skip the problem of gathering data from data sources and imagine that we have a magic box which provides each second the following data:
- scalar values (temperature, machine state)
- time series with sample rate 48K (sound, vibration)
- matrices 1980×1080 (full-HD pictures)
Now we have the following issues:
- How can we handle the data in a generic way, when we have variable dimensional data (points, vectors, matrices, tensors)?
- How can we get rid of noise and compress the data?
- How can we keep a history of the data, which can be from a few bytes to a few megabytes in size?
In this article, I would like to discuss the problem of keeping a history of such diverse data. However, if you are interested in the first two issues, you can have a look at this solution here.
Why Do We Need Historical Data?
First of all, we need the history to train our models. If your AI application recognizes donuts in a box and counts them, it should be trained with a dataset of a few thousand images. This data set can be the photos from a CV camera for a day or any other time interval when the production was working.
The second case, where we need the history, is model validation. We must launch our system for several days and check the results. For example, we can take 100 random photos of boxes of donuts, count the donuts manually and compare the results with the metrics from the model. Here only history is not enough, we have to connect images and results with some identifier. In our application, this is a timestamp when we capture a photo with the CV camera.
And finally, our customer may want to have a record of the “bad” boxes, which have the wrong number of donuts, to understand why it happened. In other words, we use the history for accident investigation.
The Place To Keep Historical Data
We have to keep our data close to the process. There are many reasons for this. Above all, this is industrial production, and it’s almost always an isolated network due to security reasons. Moreover, some data sources like CV cameras produce intense traffic, so it is better to filter and compress the data before streaming it somewhere else.
To solve these issues, we have to install an edge device on the production side which collects and stores the data continuously, and we now have to keep in mind two new problems:
- Disk space limited
- Low data availability
The problem of the disk space is quite obvious and can be solved by removing the old data when we reach some quota. However, low data availability needs explaining.
As I said before, the network on the production side is usually isolated from the Internet. You may ask the customer for some temporal access to the device or plug in an LTE modem, but you don’t have a high bandwidth and stable connection to the device and its data. On the other hand, it is impossible to train the model on the device because of its limited computation power.
In the picture, you can see a typical AI application. We keep a history of the input data and metrics on the device as a ring buffer for a few days. When we need it for training or validation, we establish a connection between the device and cloud and replicate data for some time interval. Afterwards, we can work with the highly available piece of the data.
What Is Data Storage For Industry 4.0
Now that I have explained all the issues, we meet building AI applications, we are able to specify the following requirements for our data storage:
- It should be a time series database because we have to take data for some time interval when the production was working or an accident happened and so on.
- It should store blobs — we may have images, metrics as JSON documents, sound data, etc. So, data is in different formats and sizes.
- It should have a quota and may work as a FIFO buffer, removing old data when we reach the quota.
- It should have replication and a tool to copy data for some interval from one database to another one.
At my company, we didn’t find a ready-to-use database which could cover all these requirements. So, we use an object storage to store blobs and a time series database to store metrics and links to the objects for faster searching. It is a completely working solution, but we had to develop many tools and workarounds to make it work.
What is why I start the Reduct Storage project as a time series storage which can be natively used in Industry 4.0 applications.
Top comments (0)