DS18B20 data analysis using Pandas

I wrote about how to collect data from DS18B20 temperature sensor with Raspberry Pi a few months ago, and this is an article on how to do some basic data analysis using Python with the collected DS18B20 temperature sensor data.

NumPy, Pandas and Matplotlib

Pandas is a python library providing high-performance, easy-to-use high level data structures and data analysis tools for data manipulation. Pandas is built on top of NumPy, which supports large, multi-dimensional arrays and matrices, along with a large collection of mathematical functions to operate on these arrays. For plotting data, I’m using Matplotlib to generate various 2D plots. The installation of those packages can be found in its respective websites so I won’t discuss here.

Convert raw data

As per my previous article, the data that collected from DS18B20 temperature sensor are stored in /var/log/ds18b20.log log file on my Raspberry Pi, run tail ds18b20.log will provide a snapshot of the data:

2017-07-24_10:10:01 31187
2017-07-24_10:20:02 31062
2017-07-24_10:30:02 85000
2017-07-24_10:40:02 31250
2017-07-24_10:50:02 31312
2017-07-24_11:00:02 31312
2017-07-24_11:10:02 31312
2017-07-24_11:20:02 31125
2017-07-24_11:30:02 31312
2017-07-24_11:40:02 31437

The ds18b20.log is in text format with each line consists of two piece of data separated by a space, a timestamp, and a temperature reading from the DS18B20 (e.g. 29000 means 29 degree Celsius). In order to use the log for data analysis, we need to:

  • Read the each line of the data log;
  • The timestamp string need to be converted into a Python date time object;
  • The temperature need to convert to floating number, and further divided by 1000 to get the actual celsius degree.
  • Create a Pandas Data Frame object with those data for further analysis

Clean up data

In many IoT or data sensor applications, it is often that the collected data would consists of errors, for example, I noticed that due to transmission error (I have solved this problem but those early data log consists the historical data log with errors), the data occasionally consists of an error reading with a temperature of 85000, there is a need to clean up the data before further analysis.

import numpy as np
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt


def get_data(file_name):
    # Get data from ds18b20.log file
    df = pd.read_csv(file_name, names = ['Date', 'Temp'], header = None, sep = ' ')
    df['Date'] = [dt.datetime.strptime(datestr, '%Y-%m-%d_%H:%M:%S') for datestr in df['Date']]
    df['Temp'] = df['Temp']/1000.
    df.index = df['Date']
    return df

def clean_up_data(df):
    # Clean up data with error reading of 85.0
    temps = np.array(df['Temp'])
    if temps[0] == 85.:
        temps[0] = temps[1]
    if temps[-1] == 85.:
        temps[-1] = temps[-2]
    for i in range(len(temps)):
        if temps[i] == 85.:
            temps[i] = np.mean([temps[i-1], temps[i+1]])
    df['Temp'] = temps

Save this code as ds18b20_functions.py so that we could reuse it in our main program later.

The get_data function read the ds18b20.log and covert data into the appropriated format as mentioned above. The function return a pandas data frame. We also explicitly set the df[‘Date’] as the index of the data frame. This will make it easier to retrieve a part of the data, such as daily data or hourly data, later on.

The clean_up_data function clean up the error data. I used the data from adjacent time slots(the previous temperature entry and the next temperature entry) to calculate the average temperature for replacing the error data 85000, the function also take care the corner cases where the error data happened as the first or last data entry. We now the data frame ready for further analysis.

Basic Data Analysis

We can easily plot the entire data log in a 2D graph using Pandas plot function which is a higher level wrapper of marplotlib.pyplot function.

from ds18b20_functions import *

df = get_data('ds18b20.log')
clean_up_data(df)

# plot the entire data log
plot_obj = df.plot(x = 'Date', y = 'Temp', figsize = (10,7), title = 'DS18B20 Temperature Reading', legend = None, grid = True, rot = 30)
plot_obj.set_ylabel('Temperature (Degree C)')
plt.show()

This will plot the entire data log in a graph:

ds18b20 temperatures plot
DS18B20 temperatures plot

Well, the chart basically showing the temperature fluctuate between somewhat 25 degree Celcius to 32 degree Celcius, it is clearly summer time for sure (I actually living in tropical area), there are not much to tell other than that.

Data Extraction

It make more sense to get the daily temperature data for a particular date. We will modify our code a little bit to get the daily data, the rest of the code are basically the same as previous example:

from ds18b20_functions import *

df = get_data('ds18b20.log')
clean_up_data(df)

# Get the data for a particular date, e.g. '2017-07-22'
df['Time'] = [timestr.time() for timestr in df['Date']]
july22 = df.loc['2017-07-22 00:00:00':'2017-07-22 23:59:59']

plot_obj = july22.plot(x = 'Time', y = 'Temp', figsize = (10,7), title = 'DS18B20 Temperature Reading on July 22', legend = None, grid = True, rot = 30)
plot_obj.set_ylabel('Temperature (Degree C)')
plt.show()

This gives much better insight on a daily fluctuation of temperatures. The creation of a new data from column df['Time'] which contains only the time without date information is not really necessary, but it provide a better visual information for x axis on the plot. df.loc[] select partial of the data frame based on the date index, this allows us to get a particular date’s data frame.

DS18B20 temperatures on a particular date
DS18B20 temperatures on a particular date

With this we should be able to view the data on any given date, or with minor modification of the df.loc[] range to get weekly and monthly data.

Simple Moving Average

Simple Moving Average or Moving Average in short means takes a moving window of time, and calculates the average or the mean of the data during that time period as the current value. In our case, we have temperature data for every 10 minute. So we could get an hourly moving average of average out the temperature data during 6 time periods. Doing this is Pandas is incredibly fast and easy.

from ds18b20_functions import *

df = get_data('ds18b20.log')
clean_up_data(df)

# Get the data for a particular date, e.g. '2017-07-22'
df['Time'] = [timestr.time() for timestr in df['Date']]
df['Temp SMA'] = df['Temp'].rolling(window = 6).mean()
july22 = df.loc['2017-07-22 00:00:00':'2017-07-22 23:59:59']

plot_obj = july22.plot(x = 'Time', y = ['Temp','Temp SMA'], figsize = (10,7), title = 'DS18B20 Temperature Reading on July 22', grid = True, rot = 30)
plot_obj.set_ylabel('Temperature (Degree C)')
plt.show()

We calculate the result of moving average and add it into a newly create column into the data frame, and we are going to plot both the raw temperature data and the moving average on the sample chart.

ds18b20 temperatures with hourly moving average
DS18B20 temperatures with hourly moving average

Data Resample

Sometime too much details does not necessarily provide the clarity. For example, as my ds18b20.log records the temperature data every 10 mins, and we may want to just present an hourly average data. To do that we need to resample the temperature data and calculate every hour’s average temperature. Luckily this can be done easily using pandas Data Frame package.

from ds18b20_functions import *

df = get_data('ds18b20.log')
clean_up_data(df)

theDate = df.loc['2017-07-22':'2017-07-22']
# Resample the temperature data to hourly average
temp_resample = theDate.Temp.resample('H').mean()
dt_range = pd.date_range('2017-07-22', periods = 24, freq = 'H')
# Create a new data frame withe the resampled data
hourly_average = pd.DataFrame({'Time': dt_range, 'Temp': np.array(temp_resample)}, index = dt_range)

plot_obj = hourly_average.plot(x = 'Time', y = 'Temp', figsize = (10,7), title = 'DS18B20 Temperatures for July 22 with Hourly Average', legend = None, grid = True, rot = 30)
plot_obj.set_ylabel('Hourly Average Temperature (Degree C)')
plt.show()
ds18b20 temperatures with resampled hourly average
DS18B20 temperatures with resampled hourly average

Summary

Data manipulation and presentation are the basic of any data analysis. In this article, I shows the concept of extracting partial data from a text data log and other data manipulation techniques such as calculating moving average, resample a data set. With these examples, it is easily to modify the code to meet other requirements.

That’s all for now, have fun with your data!

The code examples available at my github repository.

Leave a Reply

Your email address will not be published. Required fields are marked *