AI - Driven Home Energy Management
System for Profit and Grid Stability
Deep Reinforcement Learning and Predictive Models for Minimizing Peak
Demand While Balancing Battery Degradation in a Dynamic Environment

Degree project report in Bachelor’s Programme in Electrical Engineering

Adam Michelin

DEPARTMENT OF ELECTRICAL ENGINEERING

CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2025
www.chalmers.se

www.chalmers.se


Degree project report 2025

AI - Driven Home Energy Management
System for Profit and Grid Stability

Deep Reinforcement Learning and Predictive Models for
Minimizing Peak Demand While Balancing Battery

Degradation in a Dynamic Environment

Adam Michelin

Department of Electrical Engineering
Chalmers University of Technology

Gothenburg, Sweden 2025


AI - Driven Home Energy Management System for Profit and Grid Stability
Deep Reinforcement Learning and Predictive Models for Minimizing Peak Demand
While Balancing Battery Degradation in a Dynamic Environment
Adam Michelin

© ADAM MICHELIN, 2025.

Supervisor & Examiner: Thomas Hammarström, Department of Electrical
Engineering, Chalmers University of Technology

Degree project report 2025
Department of Electrical Engineering Chalmers University of Technology
SE-412 96 Gothenburg
Sweden
Telephone +46 31 772 1000

Cover: Futuristic abstract home (Generated with Leonardo AI, Feb 2025)

Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria
Gothenburg, Sweden 2025

iv


AI - Driven Home Energy Management System for Profit and Grid Stability
Deep Reinforcement Learning and Predictive Models for Minimizing Peak Demand
While Balancing Battery Degradation in a Dynamic Environment
Adam Michelin
Department of Electrical Engineering
Chalmers University of Technology

Abstract
This thesis presents the development and implementation of an AI-driven home en-
ergy management system designed to optimize residential battery storage in response
to Sweden’s new power-based electricity tariffs, which introduce capacity fees based
on avarage monthly power peaks starting January 2027. The system integrates
three components: (1) multi-modal forecasting models for electricity prices, solar
production, and household demand. (2) a Recurrent Proximal Policy Optimization
(RPPO) reinforcement learning agent for real-time battery control and (3) auto-
mated orchestration via Prefect with Home Assistant integration. The forecasting
stack (XGBoost and temporal convolutional networks (TCN)) achieves competitive
accuracy, and the RL agent, trained on a custom reward balancing cost, solar uti-
lization, and safety, learns price arbitrage and solar aware charging strategies. Field
deployment on a 22 kWh battery with a 20 kW dual-orientation PV array demon-
strates integration with real hardware and shows preliminary economic benefits un-
der simulated seasonal conditions. The agent maintains 100% safety compliance
(zero charge/discharge violations during final deployment) while achieving high grid
independence. Although additional computational time for full training convergence
and hyperparameter tuning remains as future work, these preliminary results un-
derscore the strong potential of AI-driven residential energy management for cost
savings and grid support.

Keywords: AI, RL, Home Energy Management System, Software.

v


Acknowledgements
I would like to express my sincere gratitude to my supervisor and examiner, Thomas
Hammarström, for his guidance and support throughout this project. His expertise
and constructive feedback have been a big part in shaping this work.

I am deeply grateful to Chalmers University of Technology for providing the aca-
demic environment and resources necessary for this research. Special thanks to the
Department of Electrical Engineering for their sponsorship.

This project would not have been possible without the incredible open-source com-
munity. I extend my appreciation to the developers and maintainers of the tools
that formed the backbone of this work: Prefect for workflow orchestration, Gymna-
sium for reinforcement learning environments, XGBoost, PyTorch and TensorFlow
for other machine learning implementations, and Home Assistant for smart home
integration. Particular thanks to the forecast.solar team for providing reliable solar
prediction services that enabled accurate PV forecasting.

Adam Michelin, Gothenburg, Jun 2025

vii


List of Acronyms

AI Artificial Intelligence
ANN Artificial Neural Networks
API Application Programming Interface
BMS Battery Management System
CET Central European Time
COP Coefficient of Performance
DRL Deep Reinforcement Learning
ETL Extract, Transform, Load
EUPHEMIA Pan-European Hybrid Electricity Market Integration Algorithm
GBT Gradient-Boosted Trees
GRU Gated Recurrent Unit
HEMS Home Energy Management System
HMM Hidden Markov Model
HTML HyperText Markup Language
kW Kilowatt
kWh Kilowatt-hour
MAE Mean Absolute Error
MDP Markov Decision Process
ML Machine Learning
PV Photovoltaic
PVGIS Photovoltaic Geographical Information System
R2 Coefficient of Determination
RESTful Representational State Transfer
ROI Return on Investment
RPPO Recurrent Proximal Policy Optimization
SE3 Swedish Electricity Price Zone 3
SEK Swedish Krona
SMOTE Synthetic Minority Oversampling Technique
SoC State of Charge
TCN Temporal Convolutional Network
VAT Value Added Tax
XGBoost eXtreme Gradient Boosting

ix


Contents

List of Acronyms ix

Nomenclature xi

List of Figures xiii

List of Tables xvii

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Limitations / Demarcations . . . . . . . . . . . . . . . . . . . . . . . 5

2 Theory 7
2.1 Electricity–pricing context . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Physical system modeling . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 Battery State-of-Charge (SoC) Dynamics . . . . . . . . . . . . 9
2.2.2 Photovoltaic Dynamics . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Time-series forecasting . . . . . . . . . . . . . . . . . . . . . . 13

2.3.1.1 Neural Networks (NN) . . . . . . . . . . . . . . . . . 14
2.3.1.2 Temporal Convolutional Networks (TCN) . . . . . . 16
2.3.1.3 Long Short-Term Memory (LSTM) . . . . . . . . . . 17
2.3.1.4 Gradient-Boosted Trees . . . . . . . . . . . . . . . . 18

2.3.2 Deep Reinforcement learning . . . . . . . . . . . . . . . . . . . 19
2.3.2.1 Markov Decision Process (MDP) . . . . . . . . . . . 20
2.3.2.2 Proximal policy optimization & Recurrent PPO al-

gorithms . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Methods 23
3.1 Work-flow overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Data acquisition & pre-processing . . . . . . . . . . . . . . . . . . . . 25
3.3 Forecasting models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3.1 Price Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.1.1 Trend Model . . . . . . . . . . . . . . . . . . . . . . 27

xi


Contents

3.3.1.2 Peak & Valley Models . . . . . . . . . . . . . . . . . 28
3.3.1.3 Merged Model . . . . . . . . . . . . . . . . . . . . . 29

3.3.2 Home Demand Model . . . . . . . . . . . . . . . . . . . . . . 29
3.3.3 Solar Prediction Model . . . . . . . . . . . . . . . . . . . . . . 31

3.4 Reinforcement learning agent . . . . . . . . . . . . . . . . . . . . . . 32
3.4.1 Environment Design . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.2 Reward Function . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.3 Training Regime . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.4 Benchmark Strategies . . . . . . . . . . . . . . . . . . . . . . . 37

3.5 Orchestration & Automation . . . . . . . . . . . . . . . . . . . . . . . 38
3.6 Validation & Evaluation Procedures . . . . . . . . . . . . . . . . . . . 39

4 Results 41
4.1 Price Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.1.1 Price extreme detection . . . . . . . . . . . . . . . . . . . . . 41
4.1.2 Price trend model . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.3 Merged price model . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2 Solar forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Demand Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4 Reinforcement Learning Agent . . . . . . . . . . . . . . . . . . . . . . 51

4.4.1 Performance Results . . . . . . . . . . . . . . . . . . . . . . . 51
4.4.2 Battery Operation and SoC Dynamics . . . . . . . . . . . . . 53

5 Conclusion 55
5.1 Ethical and Environmental Considerations . . . . . . . . . . . . . . . 56

Bibliography 57

xii


List of Figures

1.1 Rule-based vs adaptive control: fixed rules (left) versus data-driven
adjustments (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 System purpose diagram . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 System goal diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1 Theory chapter overview . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Real-time energy-flow dashboard in Home Assistant. . . . . . . . . . 9
2.3 Symbolic proportion of delivered energy versus heat losses. . . . . . . 10
2.4 (a) Relative irradiance reduction cos(θz −β) as a function of incidence

angle for two fixed tilts: horizontal (β = 0◦) and a south-facing roof
pitch (β = 27◦). (b) Geometric interpretation of the cosine law [12]. 11

2.5 Horizontal (altitude–azimuth) coordinate system used in solar-position
calculations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.6 Machine learning chapter overview showing the difference branches
between Forecasting and Reinforcement learning classes . . . . . . . . 12

2.7 (a) Inputs for forecasting indoor temperature: outdoor temp, occu-
pancy, HVAC set-point, indoor humidity. (b) Example h-step-ahead
forecast. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.8 Multilayer perceptron a kind of neural network . . . . . . . . . . . . . 15
2.9 Gradient Descent to Global Minimum on Loss Surface . . . . . . . . . 15
2.10 TCN model with equal to inputlength, k equal to kernelsize, b equal

to dilationbase, k ≥ b and with a minimum number of residual blocks
for full history coverage n, where n can be computed from the other
values as explained above Source . . . . . . . . . . . . . . . . . . . . 16

2.11 Internal structure of an LSTM cell, showing how the forget gate ft,
input gate it, and output gate ot regulate the flow of information into
and out of the cell state Ct and hidden state ht. . . . . . . . . . . . . 17

2.12 Illustration of gradient boosting: successive shallow trees are added
to the ensemble, each one correcting the residual errors of the accu-
mulated model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.13 Gridworld MDP for a cleaning robot: the agent starts in one of
the nine states S1 . . . S9, must visit the dirty room at S5, and then
navigate to the charging station at S9. Transitions occur on up/-
down/left/right actions, and the reward structure encourages clean-
ing and timely return to charge. . . . . . . . . . . . . . . . . . . . . . 20

3.1 System architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

xiii

https://unit8.com/wp-content/uploads/2021/07/image-50.png


List of Figures

3.2 Data acquisition pipeline showcasing the raw API signal process through
validation and feature engineering to complete usable and clean data
for the ML stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3 Layout of the dual-orientation PV array with per-panel energy output
(in kWh), where North is up . . . . . . . . . . . . . . . . . . . . . . 31

3.4 Prefect Dashboard showing the deployments for the scheduling of the
Home system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.5 Prefect markdown artifact generated after running the agent before
and after turning on the sauna, (left image being before and right
after sauna has been on for some time.) . . . . . . . . . . . . . . . . . 39

4.1 Hourly spot price (blue) with red triangles marking detected peaks
(ground-truth) over a two-week window. . . . . . . . . . . . . . . . . 42

4.2 Hourly spot price (blue) with red triangles marking valleys . . . . . . 42
4.3 Model performance for predicted peaks (top) and valleys (bottom)

over one week of labeled data. (underestimates due to some true
peaks missing in labels) . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.4 Trend Model performance for April 2024. Top: actual vs. predicted
hourly SE3 prices (blue/red). Middle: hourly error (green = over-
estimate, red = underestimate). Bottom: daily MAE (orange) and
average daily price (dashed blue). Metrics: MAE 20.18 öre, RMSE
26.77 öre, direction accuracy 0.57 . . . . . . . . . . . . . . . . . . . . 45

4.5 Merged model over one-week validation. Top: actual (blue) vs. trend
forecast (green). Middle: merged output (magenta) with peaks (red)
and valleys (blue). Bottom: classifier probabilities (thresholds shown). 46

4.6 Hourly predicted (orange) versus actual (blue) solar energy produc-
tion for the PV installation over five consecutive days (March 15–19,
2025), illustrating the close alignment of the forecasted and measured
outputs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.7 Heatmap of predicted hourly solar energy production from March 1
to March 10, 2025. Each cell shows the forecasted kWh for that hour
and date, with deeper reds indicating higher output. . . . . . . . . . . 48

4.8 Home Demand Model performance. Top: actual vs. predicted con-
sumption (±1σ). Middle: HMM occupancy states and ambient tem-
perature by hour/day. Bottom: HMM state timeline and residuals. . 49

4.9 Top 20 features by XGBoost normalized gain in the Home Demand
Model, highlighting heat-pump contribution, raw consumption met-
rics, occupancy and lag-based predictors among others. . . . . . . . . 50

4.10 Monthly cost for different control strategies under a cloudy month
(Jan 2025). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.11 Net monthly benefit (SEK) for different control strategies under a
sunny month (May 2025). . . . . . . . . . . . . . . . . . . . . . . . . 52

4.12 Maximum hourly import (kW) for different control strategies in Jan-
uary 2025. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

xiv


List of Figures

4.13 Ten-day battery operation: SoC vs. electricity price (top), household
consumption and PV production (middle), and hourly grid import
with discount factor (bottom). Gray shading denotes nighttime dis-
count periods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

xv


List of Figures

xvi


List of Tables

2.1 Notation for the Swedish retail-electricity price model . . . . . . . . . 8
2.2 Notation for symbols and parameters used in photovoltaics . . . . . . 10
2.3 Common activation functions . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Notation for Reinforcement Learning, MDP and PPO-related symbols 19
2.5 Forecasting Metrics Used in This Project . . . . . . . . . . . . . . . . 22

3.1 All input features by category. Data range starts at 2017-01-01 . . . . 27
3.2 Trained Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . 36

xvii


List of Tables

xviii


1
Introduction

The transition towards renewable energy
sources is reshaping residential electric-
ity management. As Sweden implements
power-based tariffs that charge con-
sumers based on their highest monthly
usage peaks, the need for intelligent home
energy systems becomes critical. This
shift in pricing, beginning in 2025, cre-
ates both a challenge and an opportunity
for homeowners to actively manage their
consumption patterns.

This thesis explores the development of
an AI-driven home energy management

system that leverages deep reinforce-
ment learning to optimize battery storage
operations. By intelligently coordinat-
ing solar production, battery charge/dis-
charge cycles, and household consump-
tion, the system aims to minimize elec-
tricity costs while contributing to grid
stability through peak demand reduc-
tion. The work demonstrates how mod-
ern AI techniques can transform resi-
dential energy management from reactive
rule-based approaches to adaptive, data-
driven strategies that benefit both con-
sumers and the broader electrical grid.

1.1 Background

The increasing integration of renewable
energy sources, such as solar power,
has introduced new challenges in energy
management, particularly for residen-
tial and commercial buildings. Energy
consumption patterns vary throughout
the day, leading to peak demand periods
that contribute to high electricity costs.
Sweden is transitioning to an additional
capacity-based fee system for households
and small businesses. The new price
model will be fully implemented by Jan-
uary 1, 2027, and some network opera-
tors like Ellevio, are introducing it earlier
(January 2025). Their implementation
bases a part of the monthly electricity

bill on the users three highest monthly
power peaks. We can assume that other
operators will implement a similar pric-
ing model [1], [2].

By incentivizing households to reduce
their peak power consumption through
capacity based fees, Sweden aims to flat-
ten the aggregate demand curve across
the grid. When many homes shift their
energy usage away from peak periods,
this collective behavior reduces stress on
transmission infrastructure, minimizes
voltage fluctuations, and decreases the
need for expensive grid reinforcements.
This creates a win-win scenario where

1


1. Introduction

consumers save money while contribut-
ing to a more resilient and stable energy
system addressing the core reason why
Sweden is transitioning to this new tariff
structure [9].

Home battery systems offer a viable
solution for mitigating peak electricity
demand, enhancing overall energy ef-
ficiency, and potentially generating in-
come through participation as a micro-
producer. By storing excess energy from
solar panels or the grid during off-peak
hours, these batteries can supply power
when demand is high, reducing reliance
on expensive grid electricity. However,
optimizing the use of home batteries
presents several challenges, including bal-
ancing energy storage, predicting peak
demand, and minimizing battery degra-
dation over time.

A well-optimized home energy system

can lead to lower electricity costs for con-
sumers while contributing to a more bal-
anced and efficient power grid [3]. This
represents a win-win solution for both
users and electricity providers, as the
primary goal of power-based tariffs is to
encourage more even energy distribution
and reduce grid strain [2].

Artificial intelligence (AI) and machine
learning techniques, particularly rein-
forcement learning (RL), have shown sig-
nificant promise in optimizing energy us-
age [6]. These methods can dynamically
adjust battery charging and discharg-
ing strategies based on real-time data,
predictive models, and user preferences.
While traditional battery management
systems rely on predefined rules, there is
growing interest in adaptive, data-driven
approaches that can respond dynami-
cally to changing electricity prices and
user behavior [3].

Rule-Based Adaptive

Traditional 
Systems

AI/ML 
Techniques

Relies on predefined 
rules for battery 
management

Dynamically adjusts 
to real-time data and 

user behavior

Figure 1.1: Rule-based vs adaptive control: fixed rules (left) versus data-driven
adjustments (right).

2


1. Introduction

1.2 Purpose

The purpose of this thesis is to design,
implement, and experimentally eval-
uate and AI-driven battery optimiza-
tion system that minimizes electricity
costs while supporting grid stability.
With Sweden transitioning to a power-
based tariff system, residential homes
and small businesses are forced to strate-
gically manage their energy consumption
to avoid high costs.

More concretely, the work seeks to:

I Integrate a full stack optimiza-
tion platform, comprising multi-
model forecasting (demand, spot
prices, and solar production) to-
gether with a Recurrent PPO con-
troller into a real residential instal-
lation built around a 22 kWh bat-
tery and a dual-orientated 20 kW
PV array.

II Automate end-to-end orches-
tration of data ingestion, model
retraining and control dispatch
through a Prefect server workflow
pipeline, thereby demonstrating a
production-grade architecture that
can run unattended. The solution
will be integrated with Home Assis-
tant for real-time monitoring and
automation, providing a practical
and user-friendly implementation
of AI-powered energy management.

III Quantify economic and tech-
nical impact by comparing the
AI system against rule-based and
static-schedule baselines, though
the project timeline will not be able
to cover monthly trials a general
system performance with respect
to metrics will be made by simula-
tion different systems.

Full Stack 
Integration

Performance 
Evaluation

Optimized 
Energy 
Management

System Design

Automation 
Orchestration

Figure 1.2: System purpose diagram

3

https://www.home-assistant.io/
https://www.home-assistant.io/


1. Introduction

1.3 Goals

To translate the overarching purpose into
tangible outcomes, the thesis aims for six
concrete goals:

I Establish a solid theoretical
foundation by researching the
state of the art in AI-based home
energy management and Sweden’s
power-based tariff schemes.

II Develop accurate forecasting
modules for household demand,
day-ahead spot prices, and photo-
voltaic production to supply the
controller with reliable short-term
predictions.

III Design, train, and validate
a reinforcement-learning con-
troller that converts forecasts, bat-
tery constraints, and tariff rules
into battery charge/discharge ac-
tions.

IV Deploy the complete optimiza-
tion platform in a real residen-
tial setting, integrating hardware
and software for continuous, unat-
tended operation while providing
an intuitive user interface.

V Evaluate practical perfor-
mance and limitations through a
combination of field trials and sim-
ulation studies, benchmarking the
AI system against rule-based and
static-schedule baselines.

VI Publish an open reference im-
plementation source code, con-
figuration, and documentation to
facilitate replication and future re-
search.

Theoretical 
Foundation

Forecasting 
Modules

Optimization 
PlatformRL Controller

Performance 
Evaluation

Survey and 
synthesise the state

Develop accurate 
forecasting modules

Design, train, and 
validate

Deploy complete 
optimization 

platform

Evaluate practical 
performance

Figure 1.3: System goal diagram

4


1. Introduction

1.4 Limitations / Demarcations

While the work aspires to provide an end-
to-end proof-of-concept, several deliber-
ate boundaries have been set to keep the
project feasible within a bachelor-level
time frame:

I Single-site pilot. All field exper-
iments are conducted in one house
located in Stockholm, equipped
with a 22k̇Wh battery and a 20k̇W
dual-orientation PV array. Results
may not generalise to other cli-
mates, tariff zones, hardware con-
figurations, or user patterns.

II Evaluation horizon. Because of
time constraints, the RL controller
is validated over a couple of days.
Long-term seasonal effects and bat-
tery ageing beyond this window are
assessed only through simulation.

III Battery-centric control. The
controller schedules only battery
charge and discharge. Flexible
loads such as EV charging, heat
pumps, and smart appliances are
left unmanaged. While the code-
base supports future integration,
safe deployment would require ad-
ditional safety logic and hyper-
parameter tuning that lie outside
this study.

IV Simplified battery degradation
model. Cycle life is approximated
by an energy-throughput penalty
calibrated and calculated on the
specific battery installed; electro-
chemical ageing mechanisms (tem-
perature, C-rate, SoC swing) are
not explicitly modelled.

V Fixed 15-minute control gran-
ularity. The decision interval op-
erates on seemingly low resolution
but is choosen for simplicity. Un-
forseen demand spikes are safely
handled with a buffer script to en-
sure proper home energy manag-
ment safety. Faster RL dynamics
remain outside the study scope.

VI Hardware and network re-
liability. The prototype runs
on a server located at the home
owner. Cyber-security hardening
via HTTPS and ZeroTier 1 tunnel-
ing are in place to ensure safe inte-
gration together with basic crash-
handling since.

The above demarcations ensure that
the thesis remains achievable while still
demonstrating the viability of an AI-
driven battery optimizer under Sweden’s
forthcoming power-based tariff regime.
Future research can ease these con-
straints to address a more sophisticated
and broader system

1ZeroTier is a networking solution to securely connect virtual networks across various devices
and locations.

5

https://www.zerotier.com/


1. Introduction

6


2
Theory

This chapter lays the theoretical ground-
work on which the remainder of the thesis
is built. It opens with Sweden’s emerging
power-based tariff design and details how
the new capacity fee formula based on the
three highest hourly peaks, reshapes the
household cost landscape section 2.1.

Section 2.2 then translates the physi-
cal installation into mathematics: grid-
exchange balances, battery state-of-
charge dynamics, and a tilt-aware pho-
tovoltaic model that links solar geometry
to electrical output.

With the energy flows formalised, sec-
tion 2.3 surveys the machine-learning
tools that will later drive optimisa-
tion: gradient-boosted trees, temporal
convolutional networks, LSTMs, and
their integration into a Recurrent PPO
reinforcement-learning framework.

Finally, section 2.4 establishes the accu-
racy, cost-saving, and robustness metrics
that will benchmark both forecasts and
control policies throughout the study.
Together these four sections provide the
analytical scaffold required to under-
stand the methods and results that fol-
low.

Electricity 
pricing context

Physical system 
modeling

Time-series 
Forecasting

Reinforcement 
Learning

Performance 
Metrics

Emerging new pricing 
model

Home energy flow 
model with focus on 

Battery and PV 
dynamics

Basic NN structure 
with and overview into 

TCN, LSTM, amd 
Gradient-Boosted 

trees.

Basic RL structure that 
covers the MDP 

framework and PPO 
algorithms

Short overview into 
the standard metrics 

used in machine 
learning

Figure 2.1: Theory chapter overview

7


2. Theory

2.1 Electricity–pricing context

The monthly cost for a household cus-
tomer is formalised in Equation (2.1),
with all symbols defined in Table 2.1 At
its core, this expression aggregates four
main components:

(i). The per-kWh energy price set by
Nord Pool’s day-ahead market.

(ii). A per-kWh variable grid-energy
charge imposed by the DSO.

(iii). A fixed monthly grid fee cover-
ing metering and service costs.

(iv). From 2025 onwards, a capacity fee
based on peak power consumption [1].

The energy-based terms are multiplied
by a uniform VAT rate [8], while the
capacity fee mandated under Ei’s EIFS
2022:1 regulation charges customers for
their three highest hourly averages each
month, thereby incentivising lower peaks
[9].

To further encourage off-peak usage,
many DSOs (e.g. Ellevio) apply a 50%
discount on any registered peak between
22:00-06:00, effectively halving those
hours when calculating the average of the
top three peaks. This structure strongly
signals to consumers that shifting energy
usage such as EV or battery charging
to nighttime hours reduces their capac-
ity charges, aligning household behaviour
with grid-stability goals [5].

Table 2.1: Notation for the Swedish retail-electricity price model

Symbol Meaning Unit
λt Nord Pool spot price in hour t öre kWh−1

τE Energy-tax rate (ex-VAT) öre kWh−1

ϕ Variable grid-energy charge (ex-VAT) öre kWh−1

v VAT rate (= 1.25) -
Et Energy consumed in hour t kWh
β Capacity-fee unit price SEK kW−1

Pi i-th largest hourly mean power in M kW
wt 0.5 during 22:00 to 05:59, 1 otherwise -
Cfix Fixed monthly grid fee SEK

Cmonth = v
∑
t∈M

(λt + ϕ + τE) Et + Cfix + β
wt(P(1) + P(2) + P(3))

3 . (2.1)

8


2. Theory

2.2 Physical system modeling

A precise mathematical power flow model
is essential for linking control actions to
cost outcomes. Grid exchange at time t
is determined by Equation(2.2) where
Lt is the household load, P PV

t the PV
output, and P dis

t /P ch
t the battery dis-

charge/charge power. Positive P grid
t de-

notes import whereas a negative value re-
lates to exported power. Figure 2.2 illus-
trates these flows in the live Home Assis-
tant dashboard.

P grid
t = Lt − P PV

t − P dis
t + P ch

t , (2.2)

Figure 2.2: Real-time energy-flow dashboard in Home Assistant.

2.2.1 Battery State-of-Charge (SoC) Dynamics

The battery dynamic is calculated by the
round-trip efficiency shown in Equation
(2.3). It quantifies the energy returned
by a battery versus the energy put in,
accounting for losses from heat, internal

resistance, and inverter conversion. This
means that the battery must discharge
slightly more energy than the load actu-
ally receives (Fig. 2.3)

Et+1 = Et + ηch P ch
t ∆t︸ ︷︷ ︸

net energy stored

− P dis
t ∆t

ηdis︸ ︷︷ ︸
energy withdrawn

(2.3)

9


2. Theory

Energy Flows

Delivered Energy
Heat Loss

Figure 2.3: Symbolic proportion of delivered energy versus heat losses.

2.2.2 Photovoltaic Dynamics

Table 2.2: Notation for symbols and parameters used in photovoltaics

Symbol Meaning Unit
δ Solar declination angle (tilt north/south of equator) rad
ϕ Site latitude (positive north of equator) rad
β Panel tilt angle (from horizontal) rad
γ Panel azimuth angle (deviation from poles) rad
H Hour angle (solar time offset: H = 0 at solar noon) rad
θi Solar incidence angle (between sun-rays and panel normal) rad
G(τ) Plane-of-array irradiance at time τ kW m−2

Rt Raw radiation sum over [t − ∆t, t] kWh m−2

Reff
t Cosine-corrected radiation sum over [t − ∆t, t] kWh m−2

APV Effective PV array area m2

ηPV PV conversion efficiency –
P PV

t Average PV electrical power over interval kW

Photovoltaic output is modeled from the
total solar radiation over each control in-
terval ∆t [11]. First define the raw radi-
ation sum:

Rt =
∫ t

t−∆t
G(τ) dτ

[
kWh m−2

]
,

(2.4)
where G(τ) is the plane-of-array irradi-
ance at time τ for hourly steps ∆t.

To account for the panel’s tilt and orien-
tation, we apply a cosine-law correction.
Let

Reff
t =

∫ t

t−∆t
G(τ) cos

(
θi(τ)

)
dτ, (2.5)

where θi(τ) is the solar incidence angle
defined below and shown in 2.7(b).

If the PV array has effective area APV
[m2] and module efficiency ηPV treated
as a constant for simplicity1, the interval
averaged electrical power is

P PV
t = ηPV APV

Reff
t

∆t

[
kW

]
. (2.6)

With the usual ∆t = 1 h, this re-
duces to P PV

t = ηPVAPVReff
t , i.e. power

is directly proportional to the cosine-
corrected hourly radiation.

1ηP V depends partly on temperature, spectral effects, and long-term degradation among other.

10


2. Theory

For a fixed panel tilt β and azimuth γ,
the solar incidence angle on the module
surface is shown in Equation (2.7) [13].
The effective plane-of-array irradiance is
then

Geff(t) = G(t) cos
(
θi(t)

)
.

Figure 2.7(a) shows the relative factor
cos(θz − β) for β = 27◦ versus β = 0◦,
highlighting that tilt increases the daily
integral of Geff by approximately 35 %.
Figure 2.7(b) illustrates the basic cosine
geometry.

θi(t) = cos−1
[
sin δ sin ϕ cos β − sin δ cos ϕ sin β cos γ

+ cos δ cos ϕ cos β cos H + cos δ sin ϕ sin β cos γ cos H

+ cos δ sin β sin γ sin H
]
, (2.7)

(a) (b)

Figure 2.4: (a) Relative irradiance reduction cos(θz − β) as a function of incidence
angle for two fixed tilts: horizontal (β = 0◦) and a south-facing roof pitch (β = 27◦).
(b) Geometric interpretation of the cosine law [12].

Figure 2.5: Horizontal (altitude–azimuth) coordinate system used in solar-position
calculations.

11


2. Theory

2.3 Machine Learning

Machine learning (ML) is a subfield of
artificial intelligence focused on devel-
oping algorithms that automatically in-
fer patterns, relationships, and decision
rules from data, rather than relying on
strict hard coded rules. Over the past
decade, advances in statistical learning
theory, optimization methods, and scal-
able computing have enabled ML mod-
els to tackle increasingly complex tasks,
from image classification to sequential
decision-making.

In energy systems, the high dimensional-
ity and nonlinearity of phenomena such
as fluctuating demand, variable renew-
able generation, and time-varying market
prices pose challenges for conventional
rule-based controllers. ML approaches
address these challenges by learning func-
tional mappings directly from historical
and real-time measurements, adapting to
changing conditions, and continually im-
proving as more data become available.
Broadly speaking, ML methods can be
divided into:

• Supervised learning: Models
trained on labeled input–output pairs
to predict quantities of interest (e.g.
load forecasting, price forecasting).

• Unsupervised learning: Tech-
niques that discover structure in unla-
beled data (e.g. clustering of consump-
tion patterns, anomaly detection).

Each class offers distinct capabilities, su-
pervised models excel at probabilistic
forecasts, unsupervised methods reveal
hidden structure, In this chapter, we fo-
cus on the two ML components driving
our AI-based energy optimizer: (1) time-
series forecasters for generating multi-
step-ahead predictions of electrical load,
PV output, and spot prices, and (2) a
deep reinforcement-learning agent that
selects battery charge/discharge actions.
The following sections outline the essen-
tial concepts and mathematical formula-
tions underlying each model class before
delving into their specific architectures
and training algorithms.

Gradient 
Boosted 

Trees

Time Series 
Forecasting

Reinforcement 
Learning

Markov 
Decision 
Process 

MDP

Proximal 
Policy 

Optimization 
& Recurrent 

PPO

Machine 
Learning

Neural 
Networks

Temporal 
Convolutional 

Networks 
TCN

Long Short 
Term 

Memory 
LSTM

Figure 2.6: Machine learning chapter overview showing the difference branches
between Forecasting and Reinforcement learning classes

12


2. Theory

2.3.1 Time-series forecasting

Time series forecasting is a predictive
modeling technique that analyzes past se-
quential data points over time to identify
patterns and trends that can be extrapo-
lated to make predictions see Fig 2.7b.
This differs from other modeling tech-
niques like tabular forecasting that treats
each time series as an independent ex-
ample. We define time-series forecasting
mathematically by letting

x1:t ={x1, . . . , xt}

denote a history and x̂t+h its h-step pre-
diction.

Naturally, more accurate and robust fore-
casts can be obtained by incorporating

multiple coherent data sources rather
than relying solely on past target val-
ues. For a simple real-world example:
forecasting the indoor temperature of
an office room, one might include the
historical indoor temperature series x1:t
together with exogenous inputs, lets call
this

z1:t =
{
outdoor temp, occupancy,

HVAC set-point, indoor humidity
}

1:t
.

see Fig 2.7a. A general multivariate fore-
casting function then takes the form

x̂t+h = f
(
x1:t, z1:t

)
,

(a) (b)

Figure 2.7: (a) Inputs for forecasting indoor temperature: outdoor temp, occu-
pancy, HVAC set-point, indoor humidity. (b) Example h-step-ahead forecast.

Building on the general forecasting for-
mulation introduced above, we now turn
to a powerful class of nonlinear mod-

els capable of capturing complex, non-
stationary relationships in time-series
data: neural networks.

13


2. Theory

2.3.1.1 Neural Networks (NN)

Neural networks are a family of nonlinear
function approximators inspired by the
structure of biological brains. The sim-
plest unit is the perceptron, which com-
putes a weighted sum of its inputs plus a
bias,

net =
n∑

i=0
wixi + b, (2.8)

o = σ(net), (2.9)

where σ(·) is an activation function (e.g.
sigmoid, ReLU, see Table 2.3) that in-
troduces nonlinearity. By stacking many
such units into layers, we obtain a multi-
layer perceptron (MLP) see Figure 2.8,

Table 2.3: Common activation functions

Plot Function name Definition

Unit step σstep(net) =

1 net > 0,

−1 otherwise.

Linear ϕlin(net) = net

Logistic (Sigmoid) σlog(net) = 1
1 + e−net

Hyperbolic tangent σtanh(net) = enet − e−net

enet + e−net

Rectified Linear Unit (ReLU) σReLU(net) = max(0, net)

Activation functions are a core compo-
nent of neural networks, introducing the
nonlinearity that enables these models
to learn complex patterns beyond sim-
ple linear mappings. Early neural ar-
chitectures predominantly used the logis-
tic sigmoid function, which “squeezes”
the weighted sum into the interval (0,
1), mirroring the biological concept of a

neuron being inactive or active. How-
ever, as networks have grown deeper and
more complex, the Rectified Linear Unit
(ReLU) function as it closely mimics
the biological neuron’s threshold behav-
ior, where an output “fires” only after
the input surpasses a certain level and
dramatically accelerates training conver-
gence [14].

14


2. Theory

Training a neural network entails finding
the set of weights W (ℓ) and biases b(ℓ) for
each layer ℓ = 1, . . . , L that minimize a
chosen loss function J (θ). For regres-
sion tasks one typically uses the mean-
squared error,

J (θ) = 1
N

N∑
i=1

(
yi − fθ(ui)

)2
,

where θ = {W (ℓ), b(ℓ)} collectively de-
notes all parameters, ui the ith input,
and yi its target. Computing the gradient
∇θJ efficiently for deep, layered mod-
els is made possible by backpropagation,
which applies the chain rule to propagate
error signals from the output layer back
to each parameter. Parameters are then
updated iteratively.

Figure 2.8: Multilayer perceptron a kind of neural network

Figure 2.9: Gradient Descent to Global Minimum on Loss Surface

15


2. Theory

2.3.1.2 Temporal Convolutional Networks (TCN)

Temporal Convolutional Networks
(TCNs) are convolutional architectures
specifically designed for sequence data.
Unlike recurrent models, TCNs use 1D
causal convolutions filters that only span
current and past time steps so that at
no point does information “leak” from
the future. By stacking layers with ex-
ponentially increasing dilation (spacing
between filter taps), a TCN achieves a
very large receptive field with relatively
few layers, allowing it to capture long-
range dependencies efficiently.

Each convolutional layer is wrapped in
a residual block: after applying a small
stack of operations (convolution, weight
normalization, ReLU and dropout), the
block’s input is added back to its output.
These skip-connections stabilize training
and help gradients flow through deep net-
works. In practice, TCNs combine the
parallelism of CNNs with the sequential
modeling power of RNNs, often outper-
forming LSTM/GRU on large forecasting
benchmarks.

Figure 2.10: TCN model with equal to inputlength, k equal to kernelsize, b equal
to dilationbase, k ≥ b and with a minimum number of residual blocks for full history
coverage n, where n can be computed from the other values as explained above
Source

16

https://unit8.com/wp-content/uploads/2021/07/image-50.png


2. Theory

2.3.1.3 Long Short-Term Memory (LSTM)

Long Short-Term Memory networks
(LSTMs) are a type of recurrent neural
network [15] designed to capture long-
range dependencies in sequence data
while avoiding the vanishing and explod-
ing gradient problems of standard RNNs.
At their core, each LSTM cell maintains
a cell state Ct, which carries informa-
tion forward unchanged unless explicitly
modified by learned gating mechanisms.
Three gates regulate this flow:

• Forget gate ft decides which infor-
mation from the previous cell state
Ct−1 to discard.

• Input gate it determines which new

information to add to the cell state,
often computed as a candidate update
C̃t.

• Output gate ot controls which parts
of the updated cell state are exposed
as the hidden state ht.

By learning when to remember, up-
date, or output information, LSTMs ex-
cel at modeling time-series with com-
plex temporal patterns, such as season-
ality, trends, and irregular events, mak-
ing them a cornerstone architecture for
sequential forecasting tasks[15].

Figure 2.11: Internal structure of an LSTM cell, showing how the forget gate ft,
input gate it, and output gate ot regulate the flow of information into and out of
the cell state Ct and hidden state ht.

17


2. Theory

2.3.1.4 Gradient-Boosted Trees

Gradient-Boosted Trees (GBT) are a
powerful ensemble method that builds a
strong predictive model by sequentially
adding decision trees, each one trained
to correct the mistakes of the ensemble so
far. Rather than fitting one large, com-
plex tree, GBT constructs many small
“weak learners” (shallow trees), combin-
ing them into a robust predictor. Key
characteristics include:

• Stage-wise learning: Each new tree
focuses on the residual errors differ-
ences between the true target and the
current model’s predictions and at-
tempts to reduce them.

• Shrinkage and learning rate: A
small scaling factor (learning rate)
limits each tree’s impact, encourag-
ing gradual improvements and reduc-
ing overfitting.

• Tree complexity control: Maxi-
mum depth, minimum child weight,
and subsampling ratios govern each
tree’s size and the fraction of data
used, providing regularization.

• Handling missing and categori-
cal data: Modern GBT implemen-
tations (e.g. XGBoost) automatically
learn optimal “default directions” for
missing values, and efficiently bin or
one-hot encode categorical features.

• Feature importance and inter-
pretability: By measuring how often
and how effectively features split the
data, GBTs offer built-in diagnostics
to rank predictors and guide feature
engineering.

• Scalability and speed: Highly opti-
mized libraries exploit parallel tree
construction, out-of-core computa-
tion, and low-level optimizations to
handle large-scale datasets.

In time-series forecasting, GBT models
excel when combined with thoughtfully
engineered features lagged values, rolling
statistics (mean, variance), and calendar
indicators (hour of day, day of week).
Their ability to capture nonlinear inter-
actions without extensive hyperparame-
ter tuning makes them a popular baseline
and often a component in hybrid deep-
learning ensembles.

Figure 2.12: Illustration of gradient boosting: successive shallow trees are added
to the ensemble, each one correcting the residual errors of the accumulated model.

18


2. Theory

2.3.2 Deep Reinforcement learning

Deep reinforcement learning augments
the standard RL framework by represent-
ing policies πθ(a | s) and value func-
tions V π(s), Qπ(s, a) with deep neural
networks parameterized by θ. Instead
of tabular lookup, the network takes a
raw state st (and, in recurrent variants,
a hidden state ht−1) and outputs either
a distribution over actions or an esti-
mate of expected return. The “deep”
component refers to the stacked lay-
ers that learn hierarchical features from

high-dimensional inputs, enabling scal-
able solutions in complex environments.
Training proceeds by sampling trajecto-
ries {st, at, rt, st+1} and updating θ to
maximize the expected return

J(πθ) = Eπθ

[ T∑
t=0

γt rt

]
.

Gradient estimators such as the policy
gradient theorem allow backpropagation
of ∇θJ through the network.

Table 2.4: Notation for Reinforcement Learning, MDP and PPO-related symbols

Symbol Meaning Unit / Domain
S Set of all states (e.g. battery SoC) –
A Set of all actions (e.g. charge power level) –
P (s′ |s, a) Transition probability –
R(s, a) Immediate reward currency / kWh
γ Discount factor [0,1]
πθ(a |s) Parameterized policy –
J(πθ) Expected return cumulative reward
V π(s) Value of state s cumulative reward
Qπ(s, a) Value of (s, a) cumulative reward
Ât Advantage estimate at time t same as reward
rt(θ) Probability ratio: πθ(at|st)

πθold (at|st) –
ϵ Clipping parameter in PPO –
LCLIP(θ) PPO clipped objective –
θ Network parameters –
α Learning rate –
ht Recurrent hidden state vector

To ground these concepts, Figure 2.13
shows a simple gridworld in which an
agent (e.g. a cleaning robot) must nav-
igate to clean a “dirty” room and then
return to a charging station. Each cell Si

represents a distinct state at each time
step the agent chooses one of four ac-
tions (up, down, left, right) and transi-

tions according to the MDP dynamics.
The reward function penalizes remaining
dirty and incentivizes reaching the charg-
ing station before battery depletion. This
toy environment illustrates how states,
actions, transition probabilities, and re-
wards come together in a Markov deci-
sion process.

19


2. Theory

Figure 2.13: Gridworld MDP for a cleaning robot: the agent starts in one of
the nine states S1 . . . S9, must visit the dirty room at S5, and then navigate to the
charging station at S9. Transitions occur on up/down/left/right actions, and the
reward structure encourages cleaning and timely return to charge.

2.3.2.1 Markov Decision Process (MDP)

A Markod descision process is a frame-
work for modeling sequential decision-
making under uncertainty. It captures
the dynamics of an environment in terms
of states, actions, transitions, and re-
wards, providing the foundation upon
which RL algorithms optimize behavior.
An RL problem is formalized as an MDP

(S, A, P, R, γ). At each time t, the agent
in state st ∈ S selects action at ∈ A
according to πθ(at | st), transitions with
probability P (st+1 | st, at), and receives
reward rt = R(st, at). The goal is to find
πθ that maximizes the discounted return
Gt = ∑∞

k=0 γkrt+k.

2.3.2.2 Proximal policy optimization & Recurrent PPO algorithms

Proximal Policy Optimization (PPO) up-
dates the policy by maximizing the
clipped surrogate objective seen in 2.10,
where rt(θ) = πθ(at|st)

πθold (at|st) and Ât is an ad-
vantage estimate (e.g. GAE). This con-
straint on rt prevents excessively large
policy updates.

Recurrent PPO (RPPO) extends PPO

by embedding an RNN (LSTM or GRU)
to maintain a hidden state ht, so that
πθ(at | st, ht−1) can condition on past
observations. During training, trajecto-
ries carry both st and ht−1, and gradi-
ents propagate through time. This en-
ables the agent to handle partial observ-
ability and learn temporal abstractions in
environments with latent dynamics.

LCLIP(θ) = Et

[
min

(
rt(θ) Ât, clip(rt(θ), 1 − ϵ, 1 + ϵ) Ât

)]
(2.10)

20


2. Theory

2.4 Performance metrics

As shown in Table 2.5, a suite of standard
forecasting performance metrics is em-
ployed to evaluate model accuracy and
robustness. These metrics quantify er-
rors between predicted values ŷt and ac-
tual observations yt over a validation set
of size N .

By combining absolute, squared, per-
centage, and robust loss functions, the
evaluation framework captures different
aspects of error behavior ranging from
average deviations to sensitivity toward
large outliers. Together, these metrics
provide a comprehensive understanding
of model performance, enabling compar-
ison across varying error scales and dis-
tributions.

The first group of metrics Mean Abso-
lute Error (MAE), Mean Squared Er-
ror (MSE), and Root Mean Squared Er-
ror (RMSE) focuses on error magnitude.
MAE computes the average of absolute
differences |ŷt−yt|, treating all deviations
equally and offering robustness to ex-
treme values. In contrast, MSE squares
each error term (ŷt − yt)2, thereby pe-
nalizing larger errors more heavily and
emphasizing outliers. RMSE, defined as
the square root of MSE, restores the er-
ror units to match those of the original
data, making interpretation more intu-

itive while still retaining the squared-
error emphasis on larger deviations.

The remaining metrics capture relative
and distributional aspects of error. Mean
Absolute Percentage Error (MAPE) ex-
presses the average absolute error rel-
ative to the true value yt, facilitating
comparisons across series with differ-
ent scales but suffering undefined values
when yt = 0.

Symmetric Mean Absolute Percentage
Error (sMAPE) addresses this limita-
tion by symmetrizing the denominator as(
|yt| + |ŷt|

)
/2, thus bounding the mea-

sure between 0% and 200% and reducing
extreme percentage errors for values near
zero.

The coefficient of determination R2 quan-
tifies the proportion of variance in yt ex-
plained by the model, ranging from −∞
to 1, with values closer to one indicating
a better fit.

Finally, Huber Loss combines the prop-
erties of MAE and MSE by using a
quadratic penalty for errors within a
threshold δ and a linear penalty for larger
residuals, offering a balanced approach
that is less sensitive to outliers than MSE
but more discriminative than MAE.

21


2. Theory

Table 2.5: Forecasting Metrics Used in This Project

Metric Formula Description

MAE 1
N

∑N
t=1|ŷt − yt|

Mean Absolute Error: Average of the absolute differences
between predicted values ŷt and true values yt. Treats all
errors equally and is robust to outliers.

MSE 1
N

∑N
t=1(ŷt − yt)2

Mean Squared Error: Average of the squared differences
between ŷt and yt. Penalizes larger errors more heavily, mak-
ing it sensitive to outliers.

RMSE
√

1
N

∑N
t=1(ŷt − yt)2

Root Mean Squared Error: Square root of MSE, pro-
viding an error measure in the same units as yt. Emphasizes
larger errors similarly to MSE.

MAPE 100%
N

∑N
t=1

∣∣∣ ŷt−yt

yt

∣∣∣ Mean Absolute Percentage Error: Error as a percent-
age of the true value yt. Useful for relative accuracy but
undefined when yt = 0.

sMAPE 100%
N

∑N
t=1

|ŷt−yt|(
|yt|+|ŷt|

)
/2

Symmetric Mean Absolute Percentage Error:
Bounded between 0% and 200%, mitigates extreme percent-
age errors when values are near zero by symmetrizing nu-
merator and denominator.

R2 1 −
∑N

t=1(yt−ŷt)2∑N

t=1(yt−ȳ)2 , ȳ = 1
N

∑N
t=1 yt

Coefficient of Determination Proportion of variance in
yt explained by the model. Ranges from −∞ to 1; values
closer to 1 indicate better fit.

Huber Loss


1
2(yt − ŷt)2, |yt − ŷt| ≤ δ,

δ
(
|yt − ŷt| − 1

2δ
)
, otherwise,

Huber Loss: Combination of MSE and MAE that is
quadratic for small errors (|yt − ŷt| ≤ δ) and linear for large
errors (|yt − ŷt| > δ). Balances sensitivity to outliers; δ is
the threshold parameter.

22


3
Methods

This chapter explains how the Home En-
ergy AI system was built, from raw data
to live control. It documents the full
data acquisition, modelling, control de-
sign, automation, and validation. The
study begin with a bird’s-eye workflow,
before detailing the two main compo-
nents of the pipeline:

(i) multi-modal forecasting of demand,
price and solar production and,

(ii) a safety-constrained reinforcement-
learning agent that decides on battery
usage.

Each subsequent section focuses on one
link in that chain, describing the ratio-
nale behind key design choices, the data
and tools used, and the procedures ap-
plied to verify performance.

Throughout, emphasis is placed on re-
producibility, robustness, and relevance
to Swedish residential tariffs and climate
data. Prefect as the orchestration engine
was chosen because of its reliability, ease
of scheduling, failure handling and great
documentation.

Figure 3.1: System architecture

23


3. Methods

3.1 Work-flow overview

Figure 3.1 sketches the end-to-end
data–model–control workflow that turns
raw measurements into real-time com-
mands for the house. The diagram is
organized in lanes corresponding to each
logical layer of the software stack.

Data layer
Flows, implemented in Prefect1, run
at weekly, hourly and 15-minute inter-
vals to ingest external data (spot prices,
weather fields, commodity indices) and
internal measurements (household con-
sumption, heat-pump metrics, solar out-
put). Each flow’s schedule is defined
in code so that timing is predictable
and failures automatically retry with
logs. Raw API responses are passed
through an ETL (Extract, Transform,
Load) process, timestamps are aligned,
basic validity checks (gaps, outliers) are
applied, and lagged/rolling features are
computed. Any detected anomalies are
sent to the monitoring layer for valida-
tion. (See Section 3.2 for details.)

Machine learning pipeline
To prevent concept drift, all forecast-
ing models and the RL agent are re-
trained on a weekly basis. The forecast-
ing stack and the reinforcement-learning
agent each have dedicated modules (Sec-
tions 3.3 and 3.4, respectively). In brief,
weekly flows assemble the latest cleaned
data, run hyperparameter optimization
for each forecasting model, and update
model artifacts so that the control layer
always uses up-to-date predictions.

Control layer
Every 15 minutes, a recurrent PPO agent
(Section 2.3.2.2) evaluates the current
state (prices, forecasts, state-of-charge,
capacity-fee context) and proposes a bat-
tery control action. That action is fil-
tered by a deterministic safety module,
enforcing charge/discharge limits, SoC
bounds and breaker constraints, before
any command is sent to the house APIs.
(Further orchestration details appear in
Section 3.5.)

1Prefect is an open-source workflow orchestration tool [16]

24


3. Methods

3.2 Data acquisition & pre-processing

All raw signals enter through two
pipelines.

15-Minute Home Data Update
Executes every 15 minutes. It gathers
home-level measurements needed by the
control agent: energy consumption, bat-
tery state-of-charge, heat-pump metrics
and recent weather forecasts. All read-
ings are aligned to Europe/Stockholm
time to ensure consistency.

Hourly Exogenous Data Update
Executes every 15 minutes. This re-

trieves external inputs, day-ahead spot
prices, CO2 intensity, fuel costs, grid-
mix breakdown and coarse weather fore-
casts, then aligns them to CET and
forward-fills missing values for a contin-
uous hourly series.

Both pipelines use modular scripts that
apply basic integrity checks (timestamp,
gap detection and numerical validation)
before feature engineering. The cleaned,
feature-rich tables feed directly into the
machine learning and control pipelines.

Data Collection

Validation 
Checks

Complete DataRaw signals (API)

Feature 
Engineering

Figure 3.2: Data acquisition pipeline showcasing the raw API signal process
through validation and feature engineering to complete usable and clean data for
the ML stack

25


3. Methods

3.3 Forecasting models

Effective forecasting is at the heart of our
Home Energy AI system where accurate
price, demand, and solar predictions feed
directly into the control agent’s decisions.
In this chapter, we describe three fore-
casting streams, electricity price (Sec-
tion 3.3.1), home demand (Section 3.3.2),
and solar production (Section 3.3.3) each
built and validated separately before be-
ing stitched together for real-time opera-
tion.

We start by explaining why we opted for
a trio of specialized sub-models rather
than a single predictor (Trend, Peak,
Valley for prices), and how we merge

their outputs into a coherent hourly price
curve. Next, we explain the demand
model that captures household consump-
tion patterns at a 15-minute resolution,
followed by the solar model that uses
weather and array geometry data to fore-
cast PV output. Each section covers
data inputs, feature engineering, model
architecture, training regimen, and per-
formance metrics. Our goal isn’t to chase
perfect accuracy but to provide “good
enough” forecasts, refreshed weekly, so
that the RL agent can optimize battery
charge/discharge decisions with minimal
latency and maximal robustness.

3.3.1 Price Model

Forecasting electricity prices is tough,
prices depend on many factors (supply,
demand, weather, fuel costs, transmis-
sion) [4].

Europe is calculating day ahead prices by
using the EUPHEMIA market-coupling
algorithm across zones [17]. Even SE3
prices, our target, reflect cross-border
flows via that same process, making the
prediction system difficult [18].

Building one “all-in-one” model with ev-
ery feature can capture subtle patterns,
but it demands hard-to-get data, huge
compute, and adds complexity [4].

Instead, we use three focused sub-
models:

• “Trend” for overall price level.
• “Peak” detector for high-price

hours.
• “Valley” detector for low-price

hours.

Because merging multiple models can
amplify forecast errors [4], we adopt the
“specialized + merge” strategy where
each smaller model is faster to train and
can be updated weekly with proper val-
idation, yet still delivers predictions for
battery control.

26


3. Methods

3.3.1.1 Trend Model

Category Features

Grid fossilFreePercentage, renewablePercentage, powerConsumptionTotal, powerProduc-
tionTotal, powerImportTotal, powerExportTotal, nuclear, wind, hydro, solar, unknown,
import_SE-SE2, export_SE-SE4, import_NO-NO1, export_NO-NO1, import_DK-
DK1, export_DK-DK1, import_FI, export_FI

Prices PriceArea, SE3_price_ore, price_24h_avg, price_168h_avg, price_24h_std,
hour_avg_price, price_vs_hour_avg, Gas_Price, Coal_Price, CO2_Price

Weather temperature_2m, cloud_cover, relative_humidity_2m, wind_speed_100m,
wind_direction_100m, shortwave_radiation_sum

Time & Holiday hour_sin, hour_cos, day_of_week_sin, day_of_week_cos, month_sin, month_cos,
is_morning_peak, is_evening_peak, is_weekend, season, is_holiday, is_holiday_eve,
days_to_next_holiday, days_from_last_holiday

Table 3.1: All input features by category. Data range starts at 2017-01-01

The Trend Model uses XGBoost to fore-
cast the next 24 hours of SE3 day-ahead
spot prices (öre/kWh) at an hourly res-
olution. Its main purpose is to capture
the smooth backbone of the price curve
so that subsequent models can focus on
extreme spikes and dips.

Features.
All features listed in Table 3.1 are
generated by the hourly data acqui-
sition pipeline (Section 3.2). These
include grid-related metrics, historical
price statistics, weather variables, and
time/holiday encodings. Each hourly
record represents a snapshot of the sys-
tem state, with appropriate lagged and
rolling statistics already applied.

Hyperparameter tuning.
Every week, an Optuna study samples
trials to identify optimal hyperparame-
ters. The objective is squared-error re-
gression, using RMSE as the evaluation

metric, with the “hist” tree method and
“depthwise” growth policy. The best-
performing model on a rolling hold-out
validation set is tagged as the current
production version.

Training & validation.
Training occurs weekly on the most re-
cent 3–6 months of hourly data. We
employ time-series cross-validation with
forward-rolling splits, holding out one
day at a time. Models that improve val-
idation metrics replace the existing pro-
duction model.

Inference.
At runtime, the Trend Model outputs a
24-element vector of hourly price fore-
casts. These baseline predictions are
later combined with outputs from Peak
and Valley detectors (Section ??) to form
a composite forecast that balances over-
all trend accuracy with sensitivity to ex-
treme price events.

27


3. Methods

3.3.1.2 Peak & Valley Models

Both the Peak and Valley Models use a
temporal-convolutional network (TCN)
to classify each of the next 24 hours as
an extreme event (peak or valley) or not.
Instead of predicting a continuous price,
each model outputs a 24-element binary
vector, where “1” denotes a predicted
peak (or valley) hour.

Features.
Both models reuse the feature set in Ta-
ble 3.1 (Section 3.3.1.1). Inputs are con-
structed as a sliding window of the past
168 hours (one week) of these features,
yielding a 168 × (feature-count) tensor
for each model.

Peak labeling.
Before training the models, we first
needed to define what constitutes a
“peak” or "valley" in the price series, a
seemingly easy task on its own. To do
this we scanned historical prices and an
hour is marked as a peak if it passes a
derivative check that capture sharp rise-
then-drop points. These labels (see Fig-
ure 4.1) serve as ground truth for Peak
Model training.

Valley labeling.
By inverting the peak criteria, an hour
is labeled a valley if it passes a negative
derivative check, thus capturing sharp
drop-then-rise patterns. These binary
valley labels form the training targets for
the Valley Model (see Figure 4.2).

Architecture.
Each TCN has two stacked residual
blocks:

• Three 1D convolutions (kernel 3,
dilations 1, 2, 4), 64 filters each,
followed by BatchNorm, ReLU,
dropout, and a skip connection.

• Three 1D convolutions (kernel 3,
dilations 8, 16, 32), same pattern.

A global average layer collapses the tem-
poral dimension, and a final dense layer
with 24 sigmoid outputs produces per-
hour probabilities. Both models share
this structure; they differ in class-weight
and loss settings.

Training & Validation.
Because peaks and valleys account for
< 5% of hours, we apply SMOTE2 to
oversample minority class examples, in-
creasing extreme event labels to ≈20% of
training data. After training, we sweep
the probability threshold to balance pre-
cision and recall.

Inference & Performance.
At inference, a threshold converts prob-
abilities into binary flags. We visual-
ize the model’s output by mapping pre-
dicted probabilities to marker heights, so
higher probabilities indicate a stronger
predicted extreme (see Figure 4.3(a&b)).
However, this approach is flawed, since
the probabilities do not directly corre-
spond to actual peak or valley magni-
tudes, it was chosen purely for ease of
implementation.

2SMOTE (Synthetic Minority Oversampling Technique) is a machine learning technique used
to address class imbalance in datasets. It generates synthetic samples of the minority class to help
balance the class distribution and improve model performance.

28


3. Methods

3.3.1.3 Merged Model

After training the Trend, Peak, and Val-
ley Models independently, we merge their
outputs into a single prediction. It does
this with a simple override logic, where
any hour flagged as a peak or valley re-
places the Trend Model’s forecast for that
hour multiplied by the labeled predic-
tion probability to make the price mag-

nitude. Hours not flagged by either
extreme model keep the Trend Model’s
value, see Figure 4.9 for reference. This
is a suboptimal merging sinc the price ex-
treme models is not directly trained to
predict the magnitude of an hour, this
merging solution was used for its simplic-
ity.

3.3.2 Home Demand Model

The Home Demand Model forecasts
hourly residential electricity consump-
tion using an XGBoost regressor with ex-
tensive feature engineering. Rather than
classifying extreme events, this model
predicts a continuous consumption value
(kWh) for each of the next 24 hours,
enabling the downstream controller to
schedule battery usage optimally.

Features.
We engineer over 220 predictors drawn
from historical consumption, calendar
signals, weather data, and home-specific
dynamics. Many of these are lagged fea-
tures (e.g., consumption at 1 h, 2 h, 3
h, 24 h, 48 h, 72 h, and 168 h ago, plus
rolling statistics over 6 h, 12 h, 24 h,
48 h, 72 h, and 168 h windows) to cap-
ture temporal autocorrelation. For the
full list of non-lagged features (time en-
codings, temperature, wind speed, solar
irradiance, etc.), see Table 3.1.

During this project, there were no avail-
able data for usage occupancy patterns,
which is a highly correlated feature for
correctly predicting home energy de-
mand. To address this, we fit a three-
state Gaussian Hidden Markov Model
(HMM) on historical hourly load data.
Each hour’s observed consumption is

treated as an emission, and the HMM
learns three latent “occupancy” states
roughly corresponding to low, medium,
and high home activity. At each time
step, we compute the posterior proba-
bility of each HMM state given the past
sequence of loads. These three posterior
probabilities are included as features, al-
lowing the model to infer whether occu-
pants are likely away (state 1), partially
present (state 2), or fully present (state
3). By providing this latent occupancy
signal, the XGBoost regressor can adjust
its forecast when people are more or less
active in the home.

The only available data regarding the
power-hungry heat pump is its binary
state (running or not). We calculate es-
timated consumption explicitly via two
intermediate calculations:

Thermal Output Calculation:

Q̇heat = ṁ × cp × ∆Tclamped,

where
∆Tclamped = min

(
max(∆T, 2 K), 10 K

)
is the supply–return water temperature
difference clamped between 2 K and 10
K.

29


3. Methods

Electrical Input Estimation:

Pin = Q̇heat

COP ,

using a fixed Coefficient of Performance
(COP). Including both Q̇heat and Pin as
features captures the nonlinear relation-
ship between ambient/indoor tempera-
tures and heat pump electrical load.

Architecture.
We employ an XGBoost regressor in
reg:squarederror mode to learn non-
linear mappings from the 220+ features
to hourly consumption. Optuna is used
to tune the following hyperparameters:

• n_estimators: 200 – 1500
• max_depth: 4 – 12
• learning_rate: 0.005 – 0.2 (log-

scaled)

• subsample: 0.7 – 0.95
• colsample_bytree: 0.7 – 0.95
• reg_alpha: 10−8 – 10 (log-scaled)
• reg_lambda: 10−8 – 10 (log-scaled)
• min_child_weight: 1 – 7
• gamma: 0 – 0.5

These hyperparameters are searched
over 500 Optuna trials, using time-series
cross-validation on the training window.

Inference & Performance.
During inference, the model generates
point predictions for each of the next
24 hours. Prediction uncertainty is es-
timated via the variance of leaf outputs
in the trained trees. Model validation is
performed on a hold-out dataset to assess
generalization before deployment.

30


3. Methods

3.3.3 Solar Prediction Model

To estimate the photovoltaic (PV) en-
ergy output for our installation, we relied
on the forecast.solar3 API rather than
developing a custom model. This service
handles the necessary physical and mete-
orological computations (e.g., irradiance
decomposition, angle of incidence adjust-
ments) described in the Theory chapter,
allowing us to obtain reliable hourly fore-
casts spanning multiple days.

Our PV system comprises two arrays
oriented at different azimuths (see Fig-
ure 3.3):

• Southeast-facing array: 24 panels
with a tilt of 30.

• Northwest-facing array: 26 panels
also tilted at 30.

We issue separate API calls to fore-
cast.solar for each orientation to account

for differences in irradiance and shad-
ing throughout the day. After receiving
the two hourly time series, one for the
southeast array and one for the north-
west array, we sum their outputs at each
hour to obtain the total predicted energy
production.

To align the forecasts with our panels’
real-world characteristics (manufacturer
tolerances, inverter efficiency, wiring
losses, etc.), we apply a small scaling fac-
tor to each orientation’s forecast. This
factor was determined empirically based
on prior operational data by comparing
historical production data (hourly aggre-
gated) against the API’s predictions over
a representative validation period. The
resulting forecast can be visualized in
Figure 4.6.

Figure 3.3: Layout of the dual-orientation PV array with per-panel energy output
(in kWh), where North is up

3forecast.solar is a cloud-based forecasting service that leverages PVGIS (Photovoltaic Geo-
graphical Information System) for irradiance and weather-based solar production estimates via a
simple RESTful API [19]

31


3. Methods

3.4 Reinforcement learning agent

The core of our control strategy is a rein-
forcement learning (RL) agent designed
to optimally manage battery charge and
discharge in response to predicted elec-
tricity prices, forecasted solar produc-
tion, and household load. Instead of us-
ing a rule-based or optimization-only ap-
proach, we implement a recurrent Prox-
imal Policy Optimization (RPPO) agent

with LSTM memory to capture tem-
poral dependencies in price, solar, and
load patterns. By interacting with a
custom Gymnasium environment4, the
agent learns a policy that balances cost
minimization (including spot-price en-
ergy costs and monthly capacity fees)
with battery state-of-charge (SoC) con-
straints.

3.4.1 Environment Design

The environment’s observation (state)
is a dictionary comprising the following
components:

• Battery SoC: a continuous scalar
in [0, 1] representing the fraction of
the battery’s usable capacity cur-
rently stored.

• Time Index: a 3-element vector:
hour of day, minute of hour, day of
week to encode patterns.

• Price Forecast: a 24-hour ahead
vector of predicted spot prices
(SE3 market), in öre/kWh, up-
dated daily.

• Solar Forecast: a 72-hour ahead
vector of predicted PV produc-
tion (kW), obtained from the fore-
cast.solar API.

• Capacity Metrics: a 5-
dimensional vector top1, top2,

top3, rolling average, month
progress that tracks the three high-
est grid import peaks encoun-
tered in the current billing month
(for capacity-fee calculation), the
rolling average import, and how far
through the month we are.

• Price Averages: two scalars 24
hourly average, 168 hourly average
representing recent average spot
prices, used to identify high- and
low-price regimes.

• Night Discount Flag: a boolean
(0 or 1) indicating whether the
current hour falls into a low-tariff
“night” period.

• Load Forecast: a 72-hour ahead
vector of predicted household con-
sumption (kW), produced by the
demand forecasting model.

4Gymnasium is an open-source toolkit and framework for developing RL algorithms [?].

32


3. Methods

3.4.2 Reward Function

We design a multi-component reward at
each hour t to guide the agent toward
cost-minimization while respecting SoC
constraints and preferring to keep SoC
in a comfortable mid-range. The total
reward rt is the sum of:

1. Grid Cost Penalty: When net
grid power Pg,t (kW) is drawn from
the grid, the instantaneous cost in
öre is computed from the sum term
in Equation (2.1) covered in Sec-
tion 2.1. This penalizes importing
power during expensive hours.

2. Capacity Fee Penalty: In Swe-
den’s tariff structure, only the sin-
gle highest grid import peak per
day contributes to the monthly ca-
pacity fee. We maintain a list of
top-3 peaks {(ti, Ppeak,i)} in the
current month. Whenever a new
Pg,t exceeds that day’s recorded
peak, we update accordingly. At
the end of each month, the highest
recorded peak (kW) is multiplied
by the constant tariff (öre/kW).

3. SoC Penalty: We encourage
the agent to keep the battery’s
SoC within a preferred range
[SoClow, SoChigh]. The complete
SoC reward is shown in Equa-
tion (3.1). Where, α is a large
penalty factor for violating hard
limits [SoCmin, SoCmax], while β
softly rewards or penalizes depar-
tures from the preferred mid-range.

4. Potential-Based Shaping: To
accelerate learning without alter-
ing the optimal policy, we add a
potential-based shaping term [7].
Let Φ(soct) be a smooth poten-

tial function maximizing at (0.3 +
0.8)/2 = 0.55. Then at each step:

rshape,t = γ Φ(soct+1) − Φ(soct),

where γ = 0.99 is the PPO discount
factor. This term (weight wshape)
gently guides the agent’s SoC to-
ward the “optimal” mid-range over
time, speeding up convergence.

5. Battery Degradation Penalty:
Cycling the battery (charging
or discharging) results in wear
costs. We penalize absolute energy
throughput |Ethroughput, t| (in kWh)
by

wdeg ×
∣∣∣Ethroughput, t

∣∣∣ × 45 öre/kWh,

45 öre/kWh here is calculated by
the initial cost Cinit

(Ncycles×Ecapacity) ,
which is the actual cost per ev-
ery kWh of usage [10]. This term is
symmetric for charge vs. discharge
and discourages excessive cycling.

6. Action Modification Penalty:
When the agent proposes an ac-
tion at that violates safety con-
straints (e.g., would drive soct+1
outside [SoCmin, SoCmax]), the en-
vironment’s safety mask overrides
it. Each override incurs a penalty.
Consecutive violations escalate the
multiplier. This discourages un-
safe exploration and encourages the
agent to learn feasible actions.

7. Arbitrage Bonus: This is the
core profit mechanism. We define
dynamic thresholds based on recent
price percentiles:

Plow = 30th percentile of last 24 h,

Phigh = 75th percentile of last 168 h.

33


3. Methods

The agent then gets the reward by
charging when Price < Plow and
discharging when Price > Phigh

8. Export Bonus: When the agent
discharges into the grid, we also re-
ward exported energy Eexp, t (kWh)
by:

wexport ×
(
spott + taxbonus

)
×Eexp, t,

Adding a taxbonus of 60 öre/kWh
simulates the current swedish tax
reduction on exported energy, so
discharging during moderately high
prices yields extra incentive. This
encourages the agent to export
when it is both profitable and grid-
friendly.

9. Night Charging Reward:
To further incentivize off-peak
charging, any energy Enight, t

charged during the “night” win-
dow (22:00–06:00) receives a bonus

10. Solar-Aware SoC Manage-
ment: Make room for solar charg-
ing when significant production is
expected. Logic: At each time step,
look ahead 6–12 hours using the
solar forecast. If the forecasted
incoming solar energy exceeds 2.0

kWh and the current state of charge
soct > 0.60, assess available bat-
tery headroom Ht = 1 − soct. Cal-
culation: If Ht < 0.50×Esolar_6–12h,
then any discharge Edis, t that cre-
ates additional headroom is re-
warded. In other words, for every
kWh discharged in anticipation of
soon available solar production, the
agent gains a bonus.

11. Night-to-Peak Chain Bonus:
Encourage a chain strategy of
charging at night when prices are
low and discharging during sub-
sequent peak hours. It tracks
Enight_charged, the energy (kWh)
drawn from the grid between 22:00
and 06:00 each day. This summed
pool decays after 24 hours (i.e., en-
ergy charged more than 24 h ago
is discarded). During any “peak”
hour (defined as either Price >
Phigh or observed household load
>Lhigh), any discharge Echain, t that
can be matched against the current
pool is rewarded.

In practice, if the agent discharged
1 kWh during a peak hour that was
originally charged at night (within
the last 24 h), it receives a bonus.

Reward(soct) =



−α ×
(
1 + severity

)
, soct ≤ SoCmin or soct ≥ SoCmax,

+β ×
(

1 −

∣∣∣soct−(SoClow+SoChigh)/2
∣∣∣

(SoChigh−SoClow)/2

)
, SoClow ≤ soct ≤ SoChigh,

− β × SoClow−soct

SoClow−SoCmin
, soct < SoClow,

− β × soct−SoChigh
SoCmax−SoChigh

, soct > SoChigh.

(3.1)

34


3. Methods

The net reward at time t is then shown
in Equation (3.2) or in the more compact
form here:

R(t) = wT R(t) =
N∑

i=1
wi Ri(t),

where we collect all reward components
Ri into a vector R(t) ∈ RN and all cor-
responding weights wi into w ∈ RN .

R(t) =



R1(t)
R2(t)
R3(t)
R4(t)
R5(t)
R6(t)
R7(t)
R8(t)
R9(t)
R10(t)
R11(t)



=



− grid_cost(t)
− capacity_penalty(t)
− degradation_cost(t)

soc_reward(t)
shaping_reward(t)
night_charging(t)
arbitrage_bonus(t)
export_bonus(t)

− action_penalty(t)
solar_soc_reward(t)
night_peak_chain(t)



, w =



wgrid

wcap

wdeg

wsoc

wshape

wnight

warbitrage

wexport

waction_mod

wsolar

wchain



. (3.2)

This reward function balances key ob-
jectives in residential energy storage: re-
ducing electricity costs, preserving bat-
tery life, ensuring safety, and capturing
revenue opportunities.

By splitting it into weighted components,
we can tune the agent to prioritize cost
savings during expensive periods and
strategically store energy for future arbi-
trage.

The potential-based shaping and grad-

uated penalties promote stable learning
without altering the optimal policy, while
solar-aware SoC management and night-
to-peak chaining guide the agent to an-
ticipate renewable generation and pre-
dictable load cycles.

Overall, this design lets the RL agent
learn advanced, market and weather
aware strategies that should outperform
simple peak-shaving or time-of-use rules,
achieving positive economic results while
respecting operational limits.

35


3. Methods

3.4.3 Training Regime

We implement a sophisticated training
regime incorporating several advanced
techniques to ensure robust policy learn-
ing. The agent is trained using a three-
phase curriculum learning schedule
spanning millions of timestep:

Phase 1 (20% of training) focuses on ba-
sic battery management with simplified
3-day episodes, emphasizing SoC disci-
pline with reduced reward complexity.

Phase 2 (30% of training) introduces
price arbitrage strategies over 7-day
episodes while maintaining basic con-
straints.

Phase 3 (50% of training) employs full
system complexity with 30-day episodes,
complete solar integration, and all re-
ward components active.

Adaptive exploration is achieved
through dynamic entropy coefficient
scheduling, starting with high explo-
ration (4× base entropy) in early train-
ing and gradually reducing to 0.5× base

entropy in the final 20% of training to
enable policy refinement.

Continuous reward monitoring an-
alyzes component balance every 100,000
timesteps, detecting reward dominance,
extreme value ranges, and providing au-
tomatic recommendations for hyperpa-
rameter adjustment.

Data augmentation introduces con-
trolled variability during training
through random scaling of solar produc-
tion (±5%) and consumption patterns
(±15%) to improve policy robustness
across diverse operating conditions.

This training framework enables the
agent to learn stable, interpretable poli-
cies that generalize effectively across
varying seasonal conditions and market
dynamics while maintaining strict adher-
ence to operational safety requirements.

Table 3.2 summarizes the core hyperpa-
rameters of our RecurrentPPO agent.

Table 3.2: Trained Model Parameters

Parameter Value

Algorithm RecurrentPPO
Learning Rate 2 × 10−4

Discount Factor 0.995
Rollout Buffer 4096
Batch Size 256
Training Epochs 8
LSTM Layers 1
LSTM Hidden Size 64
Training Steps 30,000,000+
GAE Lambda 0.95
Entropy Coefficient 0.01

36


3. Methods

3.4.4 Benchmark Strategies

To evaluate the RL agent’s performance,
we implement five rule-based battery
management strategies representing con-
ventional approaches used in residential
energy storage systems.

1. No Battery Baseline: This strat-
egy serves as the control case where no
battery intervention occurs. The house-
hold directly imports or exports all net
energy (consumption minus solar produc-
tion) to the grid, establishing the worst-
case scenario for capacity fees and energy
costs.

2. Time-of-Use Strategy: A simple
time-based approach using fixed schedul-
ing rules:

• Night charging (22:00-06:00):
Moderate charging to 60% SoC
when solar production is minimal

• Evening discharge (16:00-20:00):
Conservative discharge from 40%
SoC during typical peak hours

• Power limits: 80% of maximum
charging rate, 70% of maximum
discharge rate

3. Price-Based Strategy: Uses fixed
price thresholds based on typical Swedish
electricity market ranges:

• Low-price threshold: 40 öre/kWh
(conservative estimate)

• High-price threshold: 120 öre/kWh
(conservative estimate)

• Conservative charging: 80% power

rate when price threshold
• Conservative discharge: 60% power

rate when price threshold

4. Solar-Following Strategy: Reacts
only to current timestep solar conditions:

• Excess solar charging: Store 90%
of excess production when solar >
demand

• Solar deficit discharge: Provide
80% of shortfall when solar < 50%
of demand

• Simple thresholds: 30-80% SoC op-
erating range

5. Peak-Shaving Strategy: Uses
simple historical peak tracking (24-hour
memory) without demand forecasting:

• Dynamic threshold: Maximum of
5.0 kW or 80% of recent peak de-
mand

• Peak reduction: Discharge when
current demand exceeds threshold

• Excess storage: Charge with 80%
of excess generation > 1.5 kW

All strategies include battery degra-
dation costs in their economic evalua-
tion, calculated at 45 öre/kWh of energy
throughput.

This comparison framework demon-
strates the economic value of intelligent
prediction and multi-objective optimiza-
tion in complex energy management sce-
narios.

37


3. Methods

3.5 Orchestration & Automation

The orchestration is managed by a
Python script using Prefect 3.4.1. Each
basic operation, such as fetching data,
training a model, or running an in-
ference is defined as a Prefect task.
Related tasks are grouped into flows
that run on a schedule. All tasks use
run_python_script() to ensure consis-
tent retry behavior, timeouts, and log-
ging. We assign descriptive names to
tasks so the Prefect UI shows clear labels
instead of file paths.

There are four main flows, each with
its own schedule. An hourly flow up-
dates external data (electricity prices,
weather, CO2 metrics, and solar fore-
casts). A 15-minute flow gathers home
data (energy consumption, heat-pump
status, and actual load) and then runs
the reinforcement-learning (RL) agent to
decide battery actions; this flow also gen-
erates a brief HTML/Markdown report
with the agent’s status. A daily flow runs
each morning to refresh data and, if en-
abled, run price forecasts. Weekly flows
retrain models, price models on Sundays,

the demand model on Mondays, and the
RL agent on Tuesdays (see Figure 3.4).
These training jobs use adjustable hy-
perparameters and allow extra time to
finish.

When the RL agent runs, we parse its
output to extract battery SoC, energy
commands, price signals, and action val-
ues. This information is included in a
short report that flags whether the agent
is charging, discharging, or idle, and sum-
marizes solar and load forecasts. All logs,
reports, and model files are saved as Pre-
fect artifacts so any issues can be traced
later (see Figure ?? in Section ??).

Error handling is consistent across all
flows: each task retries once after a brief
delay if it fails, and any task that exceeds
its timeout is stopped and logged. A sin-
gle task’s failure does not stop the en-
tire flow, downstream steps either skip
or use fallback logic. This design keeps
data fetching, model training, and RL in-
ference running smoothly with minimal
manual intervention.

Figure 3.4: Prefect Dashboard showing the deployments for the scheduling of the
Home system

38


3. Methods

3.6 Validation & Evaluation Procedures

To verify that each forecasting compo-
nent and the RL controller perform as
intended under realistic conditions, we
employ a layered validation strategy. For
the price and valley detectors where ex-
treme events are sparse we emphasize
balancing precision and recall via over-
sight on false positives/negatives.

Regarding the trend model, we rely on
automated hyperparameter tuning with
constrained Optuna trials, exploiting
XGBoost’s computational efficiency. Fi-
nally, the RL agent undergoes exten-
sive testing on historical data. For each
model, dedicated evaluation scripts gen-

erate performance plots using unseen
test-split data, enabling manual inspec-
tion of any anomalous behavior because
visual review often reveals issues that nu-
merical metrics alone cannot fully cap-
ture.

The RL agent generates dedicated HTM-
L/Markdown documentation of its cur-
rent battery action decision shown in
Figure 3.5 for logging and, more impor-
tantly, to provide visual feedback. This
approach supports the project’s goal of
creating a highly user-friendly system
that can be easily adapted in the future.

Figure 3.5: Prefect markdown artifact generated after running the agent before
and after turning on the sauna, (left image being before and right after sauna has
been on for some time.)

39


3. Methods

40


4
Results

In this chapter, we present the perfor-
mance outcomes for each component of
the optimization platform on test data
and in real-world deployment. We first
report results for the price-extreme de-

tectors (peaks/valleys) and the trend
forecasting model, followed by solar-
generation forecasts. Next, we summa-
rize demand-forecast accuracy. Finally,
we detail the RL agent’s testing results.

4.1 Price Models

The following sections comprises three
components: detecting extreme price
events (peaks and valleys), forecasting
the underlying price trend, and merging
these outputs into a single, unified pre-
diction. In the first subsection, we eval-
uate how accurately the peak and valley

classifiers flag hours. Next, we assess the
Trend Model’s ability to capture broader
hourly price movements over a month.
Finally, the Merged Model section shows
how combining the trend forecast with
extreme-event flags improves overall pre-
diction quality.

4.1.1 Price extreme detection

Figure 4.1 displays a two-week period,
with ground-truth peaks marked by red
triangles on the hourly spot price curve
(blue). Similarly, Figure 4.2 highlights
true valleys. These plots illustrate that
extreme price events are quite rare, fewer
than 5% of hours qualify as high-price
peaks, where the low-price labels make
up ≈ 20% of the data.

Because these peak and valley labels are
generated by a rule-based algorithm and
treated as “hard truths,” the resulting
classifiers can still exhibit numerous false
negatives and false positives. For exam-

ple, when a peak persists for more than
one hour, the algorithm may label only
a single hour as a peak, even though we
would ideally mark every hour of sus-
tained high prices.

The same challenge applies to valley pe-
riods, despite extensive effort to ensure
the labeling algorithm captures multiple
consecutive low-price hours, it often fails
to do so. Consequently, relying solely on
summary metrics can be misleading. Vi-
sual inspection of these labeled events is
therefore essential.

41


4. Results

Figure 4.1: Hourly spot price (blue) with red triangles marking detected peaks
(ground-truth) over a two-week window.

Figure 4.2: Hourly spot price (blue) with red triangles marking valleys

42


4. Results

Figure 4.3a and 4.3b show the model per-
formance over one week of labeled data
for peaks and valleys, respectively.

Peaks
In Figure 4.3a, the blue line represents
the hourly spot price, red triangles mark
actual (ground-truth) peaks, and in-
verted green triangles mark predicted
peaks. The shaded vertical bands in-
dicate actual peaks. Over this week,
there were 8 actual peaks, but the model
predicted 16. This yields a precision of
0.44 but a recall of 0.88 (the model still
catches roughly half of true peaks).

The resulting F1-score is 0.58, reflect-
ing the trade-off: the classifier is tuned
to prioritize recall (so as not to miss rare
high-price hours), at the expense of many
false positives.

Valleys
Similarly, in Figure 4.3b we see the
valley-detection performance over one
week of labeled data. The blue line again
denotes the hourly spot price, red trian-
gles mark the actual (ground-truth) val-
ley hours, and inverted green triangles
mark the predicted valleys. The shaded
vertical bands in this plot indicate actual
valley periods.

During this week, there were 31 true val-
ley hours, but the model predicted 47.
As a result, the precision is 0.47 (fewer
than half of predicted valleys align with
the ground truth) while the recall is 0.71
(about 70% of true valleys are success-
fully identified). The combined F1-score
of 0.56 reflects this balance, again em-
phasizing that the classifier is tuned to
favor recall over precision.

Notice how several actual valley peri-
ods span multiple consecutive hours (for
example, the trough around February
23–24), yet the model sometimes cap-
tures only a subset of those hours, or
extends the predicted band into adjacent
non-valley hours. This behavior under-
lines the difficulty of perfectly labeling
extended low-price runs, while SMOTE-
augmented training and threshold tuning
help catch most valley events, occasional
false positives and false negatives remain
inevitable.

Visual inspection of these shaded bands
thus remain crucial for validating
whether the classifier’s “imperfect but
recall-focused” performance is acceptable
before passing its signals to the RL con-
troller.

43


4. Results

(a) Peak Model performance over one week of labeled data.

(b) Valley Model performance over one week of labeled data.

Figure 4.3: Model performance for predicted peaks (top) and valleys (bottom)
over one week of labeled data. (underestimates due to some true peaks missing in
labels)

44


4. Results

4.1.2 Price trend model

Figure 4.4 illustrates the Trend Model’s
performance on SE3 price data through-
out April 2024. In the top panel, the
blue curve shows the actual hourly spot
prices, while the red curve depicts the
model’s 24-hour-ahead forecasts. We
observe that the forecast generally cap-
tures broad daily patterns, such as grad-
ual rises during mid-April volatility and
lower prices in early April but smoothes
out short-lived spikes and dips.

The middle panel plots the point-wise
error, defined as the actual − predicted.
Green bars above zero indicate hours
when the model underestimates the ac-
tual price, and red bars below zero in-
dicate overestimates. During periods of
rapid price escalation we expect to see

large errors since the trend model should
capture the overal shape rather than the
high volatility.

In the bottom panel, the orange bars
represent daily MAE while the dashed
blue line shows the average actual daily
price, which closely mirrors the red fore-
cast curve in the top panel, indirectly
showing the model’s overall accuracy.

Directional accuracy is calculated as
0.57, but this is misleading since the met-
ric is calculated on too granular data.
The average daily price metric provides
a more meaningful performance measure,
although here it is presented only in the
visually.

Figure 4.4: Trend Model performance for April 2024. Top: actual vs. predicted
hourly SE3 prices (blue/red). Middle: hourly error (green = overestimate, red =
underestimate). Bottom: daily MAE (orange) and average daily price (dashed blue).
Metrics: MAE 20.18 öre, RMSE 26.77 öre, direction accuracy 0.57

45


4. Results

4.1.3 Merged price model

The merged price model injects detected
peaks and valleys into the Trend model’s
smooth forecast. Figure 4.9 shows this
merging. Note that we use the classifier’s
probability score to set the spike height
even though the peak/valley model isn’t
trained to predict actual price magni-

tudes. As a result, these injected peaks
may not reflect true peak heights. By
adding flags nonetheless, large errors
around sharp spikes and drops are re-
duced compared to the Trend model
alone.

Figure 4.5: Merged model over one-week validation. Top: actual (blue) vs. trend
forecast (green). Middle: merged output (magenta) with peaks (red) and valleys
(blue). Bottom: classifier probabilities (thresholds shown).

46


4. Results

4.2 Solar forecasts

Figure 4.6 compares hourly predicted
(orange) and actual (blue) solar produc-
tion for five consecutive days (March
15–19, 2025). Across this interval, the
forecast closely tracks the classic ramp-
up and ramp-down PV output. The

hourly MAE over these five days is 0.45
kW (out of the 20.3 kW system), cor-
responding to an sMAPE of 5.3%. Mi-
nor overpredictions occur, but overall the
timing and magnitude of peaks match
closely the actual production.

Figure 4.6: Hourly predicted (orange) versus actual (blue) solar energy production
for the PV installation over five consecutive days (March 15–19, 2025), illustrating
the close alignment of the forecasted and measured outputs.

Figure 4.7 presents a heatmap of pre-
dicted hourly production from March 1
to March 10, 2025. Each cell’s color in-
tensity indicates the forecasted kWh for
that hour and date. Notice how the pre-
dictions captures the variability of cloud

cover on March 4–5 (midday dips) and
correctly shifts the noon-hour peak later
on March 7 when sunrise was delayed.
Over these ten days, the average daily
predicted energy is 25.6 kWh.

47


4. Results

Figure 4.7: Heatmap of predicted hourly solar energy production from March 1 to
March 10, 2025. Each cell shows the forecasted kWh for that hour and date, with
deeper reds indicating higher output.

48


4. Results

4.3 Demand Model

Figure 4.8 shows the Home Demand
Model’s performance from March 1–15,
2025.

In the top panel, actual hourly consump-
tion (blue) and predicted demand (or-
ange) are overlaid, with a shaded ±1
uncertainty band. Over these 336 hours,
the model achieves RMSE = 0.80 kWh,
MAE = 0.46 kWh, and R2 = 0.874.
Peaks from morning and evening heat-
ing loads are well captured.

The middle-left heatmap displays HMM-
derived occupancy states (0 = low, 1 =
medium, 2 = high) by hour and day, in-

dicating higher load when occupancy is
high.

The bottom-left timeline traces the con-
tinuous HMM state sequence, revealing
weekday transitions. The bottom-right
plot shows prediction residuals (actual –
predicted), which mostly lie within ±1
kWh; larger errors occur mainly due to a
lead or lag in the prediction.

Overall, the model tracks baseline pat-
terns and heating-related peaks very ac-
curately, with HMM states, temperature,
and lag features enabling robust adapta-
tion to both daily cycles and anomalies.

Figure 4.8: Home Demand Model performance. Top: actual vs. predicted con-
sumption (±1σ). Middle: HMM occupancy states and ambient temperature by
hour/day. Bottom: HMM state timeline and residuals.

49


4. Results

Figure 4.9 ranks the Demand Model’s
top 20 features by their normalized XG-
Boost gain. The single most influential
feature is “Hp Contribution” (heat-pump
load), accounting for ≈ 19% of total gain.

Immediate consumption metrics “Con-
sumption” (≈ 11%) and “Consumption
Power Ratio” (≈ 10%) are next, high-
lighting the importance of current usage
levels.

Weekday vs. weekend demand (“Con-
sumption Dow (day-of-week) Avg Ra-
tio,” ≈ 9%) and HMM-inferred occu-

pancy states (“HMM State Posterior
0,” ≈ 1.7%; “HMM State Posterior 2,”
≈ 1.5%) also rank highly, showing that
both occupancy and day-of-week pat-
terns matter.

Lagged consumption (e.g., “Consump-
tion Pct 1H,” “Consumption Same Hour
1D Ago,” “Consumption Lag 24H”) and
solar–temperature interaction features
(≈ 1% each) further refine the model,
while lower gain features like multi-day
rolling means (≈ 0.6%–0.8%) demon-
strate diminishing returns.

Figure 4.9: Top 20 features by XGBoost normalized gain in the Home Demand
Model, highlighting heat-pump contribution, raw consumption metrics, occupancy
and lag-based predictors among others.

50


4. Results

4.4 Reinforcement Learning Agent

The RL-based controller demonstrates
robust, data-driven battery management
with clear improvements over conven-
tional, rule-based baselines in our sim-
ulated environment. Training is per-
formed with the RecurrentPPO algo-

rithm; the final hyperparameters are
summarized in Table 3.2. A full training
run (30 million timesteps) on an AMD
Ryzen 5700X CPU (no GPU) requires
approximately 78 hours.

4.4.1 Performance Results

Cost under low solar generation: In
the “cloudy month” scenario (January
2025), shown in Figure 4.10, the RL con-
troller achieves the lowest total cost at
2 550 SEK, a 41 % reduction compared
to no battery (4 320 SEK) and a 15 %
improvement over the best rule-based
method (Peak Shaving, 2 990 SEK) for
this period. Although net profit is im-
possible when solar yield is minimal, the
agent effectively limit grid imports to
minimize monthly expense.

Net benefit with generous PV: In
contrast, during a “sunny month” (May
2025, Figure 4.11), the same agent real-
izes a net benefit of 900 SEK, turning
otherwise idle battery capacity into rev-
enue. This marks a 125 % increase versus
no battery (400 SEK) and a 50 % gain
over Peak Shaving (600 SEK), this show
the controller’s ability to collect surplus
solar and arbitrage price fluctuations.

Figure 4.10: Monthly cost for different control strategies under a cloudy month
(Jan 2025).

51


4. Results

Figure 4.11: Net monthly benefit (SEK) for different control strategies under a
sunny month (May 2025).

Peak Reduction Performance: Fig-
ure 4.12 clearly illustrates that the RL
controller outperforms all baseline strate-
gies in shaving peak grid import. With
the RL policy, the highest hourly draw
is limited to about 6.5 kW, compared to
11 kW when no battery is used (a 41
% reduction). The rule-based schemes
all perform between roughly 9 and 10
kW, Time-of-Use at 9̃.98 kW, Price-
Based and Solar-Following both at 9̃.01
kW, and Peak-Shaving interestingly at 7
kW, showing that even dedicated peak-

shaving logic cannot match the learned,
price and forecast-aware behavior of the
RL agent (although the peak strategy
only being a fixed "dump battery when
import over 6 kW", not ensuring available
battery power for those occasions.). This
reduction in peak demand directly trans-
lates into lower capacity charges and
greater operational flexibility, demon-
strating the RL approach’s ability to an-
ticipate and pre-empt high-price/high-
demand hours.

Figure 4.12: Maximum hourly import (kW) for different control strategies in Jan-
uary 2025.

52


4. Results

4.4.2 Battery Operation and SoC Dynamics

Figure 4.13 illustrates a 9-day interval of
the agent’s operation (June 2024). The
top panel overlays battery state-of-charge
(SoC) against the hourly electricity price.
The agent charges aggressively during
low-cost windows and discharges when

prices exceed a learned threshold. The
middle panel shows household load and
PV output. From these traces we observe
that the agent maintains SoC within pre-
ferred bands (15–85 %) while exploiting
both price arbitrage and solar surplus.

Figure 4.13: Ten-day battery operation: SoC vs. electricity price (top), household
consumption and PV production (middle), and hourly grid import with discount
factor (bottom). Gray shading denotes nighttime discount periods.

53


4. Results

54


5
Conclusion

This thesis demonstrates the feasibility
and potential of AI-driven home energy
management systems in the context of
Sweden’s transition to power-based elec-
tricity tariffs. Despite the complexity
of integrating forecasting models, rein-
forcement learning control, and real-time
hardware operation, the results show
promising progress toward intelligent,
automated battery management.

The developed system successfully coor-
dinates multiple dynamic factors, spot
price fluctuations, solar production vari-
ability, household consumption patterns,
and power-based fees into a unified con-
trol strategy. Field results confirm that
AI approaches can yield measurable eco-
nomic benefits, even with limited train-
ing time and minimal hyperparameter
tuning. These findings suggest that fur-
ther model refinement, extended train-
ing periods, and improved reward design
could significantly enhance performance.

Today’s residential battery management
still shows a large gap between avail-
able technology and practical use. Most
homeowners either rely on built-in manu-
facturer modes that ignore price signals,
or manually schedule charging and dis-
charging based on forecasts. The former

neglects profit and grid optimization en-
tirely, while the latter demands expertise
and time for suboptimal results. Neither
approach accounts for battery wear, dy-
namic grid tariffs, or coordinated solar-
battery optimiza