profile picture

Linear regression in sqlite

2021 January 06 - 1041 words - 6 mins - sqlite prediction emonitor

Linear regression is a very simple, powerful tool to predict the future. Especially when directly implemented in sqlite. This article is based on data from my PV system. From which I want to predict three things:

These predictions can be useful to plan when to turn on home appliances, and thus use more of the self produced energy and rely less on the grid. I cannot base the prediction on historical data, and due to clouds and shade of nearby trees I cannot derive the prediction from the time the sun rises and sets. So the prediction will be based on the last 30 days of data.

Approach

The linear regression is implemented in sqlite, because, that's where the data is. The python code handling the result can remain very simple, which is a nice side effect. To do this we start simple, and wrap each result in an outer query, which runs over the previous result. So let's implement a linear regression on time series data in sqlite!

Source Data

The source data looks like this:

select TimeStamp,ETotal from spotdata limit 10 offset 2000;
TimeStampETotal
159842790616329548
159842792616329548
159842794616329549
159842798616329549
159842800616329550
159842802616329550

A timestamp and an ever increasing counter.

Relying heavy on sql windows functions we transform this into a input for our regression. This table will be referred to as input in the rest to this article.

daystartmaxendcount
12897051420514411
22617933600524991
... data omitted for readability ...
293166034860530431
303112168580686291

Here we see for the last 30 days at which time on that day the energy production:

The last 'count' column only exists as an artefact and will be omitted. The time is counted in seconds from the start of the day at midnight.

Linear Regression

source data

This image shows what I want to achieve. The linear regression for each series is overlaid in gray. This way I can extend the regression line and predict the values for the current day and tomorrow.

As you can see, a linear regression is the best fitting line for the series. Best fitting means the total distance of all point to the line is minimized.

Definition

The equation can be found in any statistical analytics text book:

y=m∙x+b

Where m is called the slope which determines the direction of the line and b is called the intercept which defines the distance of the line from 0.

m is calculated:

m=∑(xi - ẍ)(yi - ÿ) / ∑(xi - ẍ)2

and b is calculated using m :

b=max y - max x ∙ m

Where and ÿ are the averages of x and y.

Slope

First we calculate (xi - ẍ)(yi - ÿ). We do this by using the window function for avg. This adds the average to the each row in the result set:

select
	day, avg(day) over () as day_bar,
	start, avg(start) over () as start_bar,
	max, avg(max) over () as max_bar,
	end, avg(end) over () as end_bar
from input
dayday_barstartstart_barmaxmax_barendend_bar
115.52897030647.05142045272.05144154322.8
215.52617930647.03360045272.05249954322.8
315.52752930647.05004045272.05110254322.8
........................

Next we add the summaries as follows to calculate the slope:

select
	sum((day - day_bar) * (start - start_bar)) / 
		sum((day - day_bar) * (day - day_bar)) as start_slope,
	sum((day - day_bar) * (max - max_bar)) / 
		sum((day - day_bar) * (day - day_bar)) as max_slope,
	sum((day - day_bar) * (end - end_bar)) / 
		sum((day - day_bar) * (day - day_bar)) as end_slope
from (
	<previous query omitted for readability>
)
start_slopemax_slopeend_slope
128.292324805339206.255839822024109.764627363737

Intercept

With the slope we can calculate the intercept.

b=max y - max x ∙ m

As you can see in the equation, we also need the max in the previous query.

select
	start_slope,
	start_bar_max - day_bar_max * start_slope as start_intercept,
	max_slope,
	max_bar_max - day_bar_max * max_slope as max_intercept,
	end_slope,
	end_bar_max - day_bar_max * end_slope as end_intercept
from (
select	
	sum((day - day_bar) * (start - start_bar)) / 
		sum((day - day_bar) * (day - day_bar)) as start_slope,
	sum((day - day_bar) * (max - max_bar)) / 
		sum((day - day_bar) * (day - day_bar)) as max_slope,
	sum((day - day_bar) * (end - end_bar)) / 
		sum((day - day_bar) * (day - day_bar)) as end_slope,
	max(day_bar) as day_bar_max,
	max(start_bar) as start_bar_max,
	max(max_bar) as max_bar_max,
	max(end_bar) as end_bar_max
from (
	<previous query omitted for readability>

start_slopestart_interceptmax_slopemax_interceptend_slopeend_intercept
128.29232480533928658.4689655172206.25583982202442075.0344827586109.76462736373752621.4482758621

Predict the future!

With the slope and intercept for each series it's trivial to calculate the prediction for today or tomorrow. The example uses 30 days history, so tomorrow is the 31st day.

We add the following outer query to the query from above:

select 	
	time(start_slope * 31 + start_intercept, 'unixepoch', 'localtime') 
	as start_time,
	time(max_slope * 31 + max_intercept, 'unixepoch', 'localtime') 
	as max_time,
	time(end_slope * 31 + end_intercept, 'unixepoch', 'localtime') 
	as end_time
from (
	<previous query omitted for readability>
)

And here is our prediction:

start_timemax_timeend_time
09:03:5513:27:4815:33:44

The result of an linear integration, over time series data, right in sqlite! The result is ready to be used in any control or visualization.

To conclude, the complete query:

select 
	time(start_slope * 31 + start_intercept, 'unixepoch', 'localtime') as start_time,
	time(max_slope * 31 + max_intercept, 'unixepoch', 'localtime') as max_time,
	time(end_slope * 31 + end_intercept, 'unixepoch', 'localtime') as end_time
from (

select	
	start_slope,
	start_bar_max - day_bar_max * start_slope as start_intercept,
	max_slope,
	max_bar_max - day_bar_max * max_slope as max_intercept,
	end_slope,
	end_bar_max - day_bar_max * end_slope as end_intercept
from(

select	
	sum((day - day_bar) * (start - start_bar)) / sum((day - day_bar) * (day - day_bar)) as start_slope,
	sum((day - day_bar) * (max - max_bar)) / sum((day - day_bar) * (day - day_bar)) as max_slope,
	sum((day - day_bar) * (end - end_bar)) / sum((day - day_bar) * (day - day_bar)) as end_slope,
	max(day_bar) as day_bar_max,
	max(start_bar) as start_bar_max,
	max(max_bar) as max_bar_max,
	max(end_bar) as end_bar_max
from (

select	
	day, avg(day) over () as day_bar,
	start, avg(start) over () as start_bar,
	max, avg(max) over () as max_bar,
	end, avg(end) over () as end_bar
from input
)
)
)