Lights on: Electricity price prediction using Decision tree and SVM in Python – Wonder Mahembe

The price of electricity depends on many factors. Predicting the price of electricity helps many businesses understand how much electricity they must pay each year. The Electricity Price Prediction task is based on a case study where a data analyst would need to predict the daily price of electricity based on the daily consumption of heavy machinery used by businesses.

Suppose that a business relies on computing services where the power consumed by its machines varies throughout the day. The company would not know the actual cost of the electricity consumed by the machines throughout the day, but a rich set of historical data on the price of the electricity consumed by the machines daily has been provided. This project uses decision trees and support vector machines to predict the price of electricity per day. The purpose of the project was not to predict the exact price, but rather the whether the price will end up going below or above a certain threshold in a day.

Description of the solution

For this task, I began with cleaning the data. I observed that there were some invalid entries in the dataset which were in the form of question marks. To omit these from the analysis, I first replaced them with NA values, then used a Pandas function to drop the rows with NA values.

The next step was to understand the dataset in the form of data types I was going to be dealing with. I was able to notice that numerous columns had data types incorrectly captured as string (objects) while in actual fact they were decimal numbers (floats). I then instituted measures to correctly assign the data types. Moreover, I ensured that the variable “Holiday” was classified as a category, since it was a categorical variable taking values of 0 and 1.

After the variable types were correctly assigned, I conducted a further data exploration using correlation analysis on all numeric variables. This is the metric I utilised to drop some of the features of the dataset. I selected the six features that had the strongest correlation with the price of electricity, which were: ‘PeriodOfDay’, ‘SystemLoadEA’, ‘SMPEA’, ‘SystemLoadEP2’, ‘DayOfWeek’, and ‘ForecastWindProduction’. The heatmap shows the correlation coefficients.

Since I wanted to use classification algorithms in this task, my next step was to modify the target variable SMPEP2 to a binary value, with the price split into two; one below and another above a certain threshold. After checking the target variable’s distribution using the mean, mode, median, range, and visually using the histogram, I used the median to split the variable into two classes. The purpose of my prediction was hence to determine whether the price of electricity went above or below the median price based on selected features.

I then split my data into the training and test sets and fit the decision tree classifier. With the model fitted, I was able to use the test dataset to evaluate the decision tree classifier based on the confusion matrix, accuracy, F1 score, precision, and recall using functions from scikit-learn. I also conducted a cross validation of the model using five folds.

Comparing model performances

Comparing the confusion matrices of the two models, we can see that both models have relatively high true positive rates and true negative rates. However, the Decision Tree Classifier had slightly more true positives and true negatives than the SVM, and less false negatives and false positives, indicating that it may be performing slightly better than the SVM.

The next plot contains a comparison of the two models using other metrics: ‘Accuracy’, ‘F1 Score’, ‘Precision’, and ‘Recall’. Based on the performance metrics, the decision tree classifier outperformed the SVM in all metrics. The decision tree had higher accuracy, F1 score, and precision, and recall compared to the SVM.

The next plot compares the cross-validation scores of the two models based on five-fold cross validation. As shown, the mean cross-validation score for the decision tree classifier is 0.70, while for SVM it is 0.74. This means that the SVM model performed better on average across all folds during cross-validation. However, the decision tree classifier had a more consistent performance across all folds as indicated by the smaller variance in the cross-validation scores. Overall, while the decision tree classifier had a slightly lower average score, it may be a more robust model due to its consistency.

Interpretation and actionable insights

After all the models had been fitted and evaluated, I then set out to interpret the relationship between the features and the price of electricity. The purpose of this was to develop actionable insights for businesses. I plotted visualisations for each of the features used in the model, but below are three selected visualisations briefly explained. The first visualisation shows the relationship between period of day and electricity price in absolute terms.

The chart shows that there were two main spikes in the prices during the day periods, each lasting for about five hours. Assuming the time beginning at zero is midnight, there were relatively low electricity prices around midnight, with the rising as the workdays began. There were even bigger price spikes in the early to late evenings of the day, with prices calming down again close to midnight. The next plot shows the relationship between forecasted system load and electricity price.

The chart shows that prices remained largely unchanged with higher values of forecasted system load. However, as the forecasted load rose even more, the price became increasingly varied with high peaks. The last selected chart shows the relationship between forecasted electricity price and actual price.

The relationship shows that the actual price of electricity rose and fell based on the forecasted price to a greater extent. However, for higher and higher forecasted prices, the actual price’s reaction became unstable.

Suggesting actionable insights for businesses

Based on the findings discussed here and on the attached notebook, the following insights can help businesses going forward:

Machine learning models are a proven way for businesses to use data for predicting electricity prices. For those businesses which are energy intensive, they should consider investing in working algorithms for predicting electricity prices.

There are many factors positively correlated with the price of electricity. The forecasted price, for instance, is very important to consider. Businesses that cannot afford to use major ML algorithms could still rely on the predicted price to anticipate future prices. However, they should note that the predictions for very high prices do not always come true.

The forecasted and actual system loads are not the strongest predictors, but still insightful. A takeaway for businesses, in this case, is that they should watch out for very high predictions and actual values of system load. This is because the price of electricity becomes highly likely to shoot up when the power grid gets extremely strained.

Lastly, energy-intensive businesses should consider operating at full capacity during times of the day that are associated with lower average electricity prices, such as late evenings to early mornings. If operating during late afternoons and early evenings, businesses should expect to pay quite a bit more for electricity.

Full data here Download