AI - Driven Home Energy Management System for Profit and Grid Stability Deep Reinforcement Learning and Predictive Models for Minimizing Peak Demand While Balancing Battery Degradation in a Dynamic Environment Degree project report in Bachelor’s Programme in Electrical Engineering Adam Michelin DEPARTMENT OF ELECTRICAL ENGINEERING CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2025 www.chalmers.se www.chalmers.se Degree project report 2025 AI - Driven Home Energy Management System for Profit and Grid Stability Deep Reinforcement Learning and Predictive Models for Minimizing Peak Demand While Balancing Battery Degradation in a Dynamic Environment Adam Michelin Department of Electrical Engineering Chalmers University of Technology Gothenburg, Sweden 2025 AI - Driven Home Energy Management System for Profit and Grid Stability Deep Reinforcement Learning and Predictive Models for Minimizing Peak Demand While Balancing Battery Degradation in a Dynamic Environment Adam Michelin © ADAM MICHELIN, 2025. Supervisor & Examiner: Thomas Hammarström, Department of Electrical Engineering, Chalmers University of Technology Degree project report 2025 Department of Electrical Engineering Chalmers University of Technology SE-412 96 Gothenburg Sweden Telephone +46 31 772 1000 Cover: Futuristic abstract home (Generated with Leonardo AI, Feb 2025) Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria Gothenburg, Sweden 2025 iv AI - Driven Home Energy Management System for Profit and Grid Stability Deep Reinforcement Learning and Predictive Models for Minimizing Peak Demand While Balancing Battery Degradation in a Dynamic Environment Adam Michelin Department of Electrical Engineering Chalmers University of Technology Abstract This thesis presents the development and implementation of an AI-driven home en- ergy management system designed to optimize residential battery storage in response to Sweden’s new power-based electricity tariffs, which introduce capacity fees based on avarage monthly power peaks starting January 2027. The system integrates three components: (1) multi-modal forecasting models for electricity prices, solar production, and household demand. (2) a Recurrent Proximal Policy Optimization (RPPO) reinforcement learning agent for real-time battery control and (3) auto- mated orchestration via Prefect with Home Assistant integration. The forecasting stack (XGBoost and temporal convolutional networks (TCN)) achieves competitive accuracy, and the RL agent, trained on a custom reward balancing cost, solar uti- lization, and safety, learns price arbitrage and solar aware charging strategies. Field deployment on a 22 kWh battery with a 20 kW dual-orientation PV array demon- strates integration with real hardware and shows preliminary economic benefits un- der simulated seasonal conditions. The agent maintains 100% safety compliance (zero charge/discharge violations during final deployment) while achieving high grid independence. Although additional computational time for full training convergence and hyperparameter tuning remains as future work, these preliminary results un- derscore the strong potential of AI-driven residential energy management for cost savings and grid support. Keywords: AI, RL, Home Energy Management System, Software. v Acknowledgements I would like to express my sincere gratitude to my supervisor and examiner, Thomas Hammarström, for his guidance and support throughout this project. His expertise and constructive feedback have been a big part in shaping this work. I am deeply grateful to Chalmers University of Technology for providing the aca- demic environment and resources necessary for this research. Special thanks to the Department of Electrical Engineering for their sponsorship. This project would not have been possible without the incredible open-source com- munity. I extend my appreciation to the developers and maintainers of the tools that formed the backbone of this work: Prefect for workflow orchestration, Gymna- sium for reinforcement learning environments, XGBoost, PyTorch and TensorFlow for other machine learning implementations, and Home Assistant for smart home integration. Particular thanks to the forecast.solar team for providing reliable solar prediction services that enabled accurate PV forecasting. Adam Michelin, Gothenburg, Jun 2025 vii List of Acronyms AI Artificial Intelligence ANN Artificial Neural Networks API Application Programming Interface BMS Battery Management System CET Central European Time COP Coefficient of Performance DRL Deep Reinforcement Learning ETL Extract, Transform, Load EUPHEMIA Pan-European Hybrid Electricity Market Integration Algorithm GBT Gradient-Boosted Trees GRU Gated Recurrent Unit HEMS Home Energy Management System HMM Hidden Markov Model HTML HyperText Markup Language kW Kilowatt kWh Kilowatt-hour MAE Mean Absolute Error MDP Markov Decision Process ML Machine Learning PV Photovoltaic PVGIS Photovoltaic Geographical Information System R2 Coefficient of Determination RESTful Representational State Transfer ROI Return on Investment RPPO Recurrent Proximal Policy Optimization SE3 Swedish Electricity Price Zone 3 SEK Swedish Krona SMOTE Synthetic Minority Oversampling Technique SoC State of Charge TCN Temporal Convolutional Network VAT Value Added Tax XGBoost eXtreme Gradient Boosting ix Contents List of Acronyms ix Nomenclature xi List of Figures xiii List of Tables xvii 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Limitations / Demarcations . . . . . . . . . . . . . . . . . . . . . . . 5 2 Theory 7 2.1 Electricity–pricing context . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Physical system modeling . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 Battery State-of-Charge (SoC) Dynamics . . . . . . . . . . . . 9 2.2.2 Photovoltaic Dynamics . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.1 Time-series forecasting . . . . . . . . . . . . . . . . . . . . . . 13 2.3.1.1 Neural Networks (NN) . . . . . . . . . . . . . . . . . 14 2.3.1.2 Temporal Convolutional Networks (TCN) . . . . . . 16 2.3.1.3 Long Short-Term Memory (LSTM) . . . . . . . . . . 17 2.3.1.4 Gradient-Boosted Trees . . . . . . . . . . . . . . . . 18 2.3.2 Deep Reinforcement learning . . . . . . . . . . . . . . . . . . . 19 2.3.2.1 Markov Decision Process (MDP) . . . . . . . . . . . 20 2.3.2.2 Proximal policy optimization & Recurrent PPO al- gorithms . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3 Methods 23 3.1 Work-flow overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Data acquisition & pre-processing . . . . . . . . . . . . . . . . . . . . 25 3.3 Forecasting models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.1 Price Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.1.1 Trend Model . . . . . . . . . . . . . . . . . . . . . . 27 xi Contents 3.3.1.2 Peak & Valley Models . . . . . . . . . . . . . . . . . 28 3.3.1.3 Merged Model . . . . . . . . . . . . . . . . . . . . . 29 3.3.2 Home Demand Model . . . . . . . . . . . . . . . . . . . . . . 29 3.3.3 Solar Prediction Model . . . . . . . . . . . . . . . . . . . . . . 31 3.4 Reinforcement learning agent . . . . . . . . . . . . . . . . . . . . . . 32 3.4.1 Environment Design . . . . . . . . . . . . . . . . . . . . . . . 32 3.4.2 Reward Function . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4.3 Training Regime . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4.4 Benchmark Strategies . . . . . . . . . . . . . . . . . . . . . . . 37 3.5 Orchestration & Automation . . . . . . . . . . . . . . . . . . . . . . . 38 3.6 Validation & Evaluation Procedures . . . . . . . . . . . . . . . . . . . 39 4 Results 41 4.1 Price Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1.1 Price extreme detection . . . . . . . . . . . . . . . . . . . . . 41 4.1.2 Price trend model . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1.3 Merged price model . . . . . . . . . . . . . . . . . . . . . . . . 46 4.2 Solar forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.3 Demand Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.4 Reinforcement Learning Agent . . . . . . . . . . . . . . . . . . . . . . 51 4.4.1 Performance Results . . . . . . . . . . . . . . . . . . . . . . . 51 4.4.2 Battery Operation and SoC Dynamics . . . . . . . . . . . . . 53 5 Conclusion 55 5.1 Ethical and Environmental Considerations . . . . . . . . . . . . . . . 56 Bibliography 57 xii List of Figures 1.1 Rule-based vs adaptive control: fixed rules (left) versus data-driven adjustments (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 System purpose diagram . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 System goal diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Theory chapter overview . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Real-time energy-flow dashboard in Home Assistant. . . . . . . . . . 9 2.3 Symbolic proportion of delivered energy versus heat losses. . . . . . . 10 2.4 (a) Relative irradiance reduction cos(θz −β) as a function of incidence angle for two fixed tilts: horizontal (β = 0◦) and a south-facing roof pitch (β = 27◦). (b) Geometric interpretation of the cosine law [12]. 11 2.5 Horizontal (altitude–azimuth) coordinate system used in solar-position calculations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.6 Machine learning chapter overview showing the difference branches between Forecasting and Reinforcement learning classes . . . . . . . . 12 2.7 (a) Inputs for forecasting indoor temperature: outdoor temp, occu- pancy, HVAC set-point, indoor humidity. (b) Example h-step-ahead forecast. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.8 Multilayer perceptron a kind of neural network . . . . . . . . . . . . . 15 2.9 Gradient Descent to Global Minimum on Loss Surface . . . . . . . . . 15 2.10 TCN model with equal to inputlength, k equal to kernelsize, b equal to dilationbase, k ≥ b and with a minimum number of residual blocks for full history coverage n, where n can be computed from the other values as explained above Source . . . . . . . . . . . . . . . . . . . . 16 2.11 Internal structure of an LSTM cell, showing how the forget gate ft, input gate it, and output gate ot regulate the flow of information into and out of the cell state Ct and hidden state ht. . . . . . . . . . . . . 17 2.12 Illustration of gradient boosting: successive shallow trees are added to the ensemble, each one correcting the residual errors of the accu- mulated model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.13 Gridworld MDP for a cleaning robot: the agent starts in one of the nine states S1 . . . S9, must visit the dirty room at S5, and then navigate to the charging station at S9. Transitions occur on up/- down/left/right actions, and the reward structure encourages clean- ing and timely return to charge. . . . . . . . . . . . . . . . . . . . . . 20 3.1 System architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 xiii https://unit8.com/wp-content/uploads/2021/07/image-50.png List of Figures 3.2 Data acquisition pipeline showcasing the raw API signal process through validation and feature engineering to complete usable and clean data for the ML stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Layout of the dual-orientation PV array with per-panel energy output (in kWh), where North is up . . . . . . . . . . . . . . . . . . . . . . 31 3.4 Prefect Dashboard showing the deployments for the scheduling of the Home system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.5 Prefect markdown artifact generated after running the agent before and after turning on the sauna, (left image being before and right after sauna has been on for some time.) . . . . . . . . . . . . . . . . . 39 4.1 Hourly spot price (blue) with red triangles marking detected peaks (ground-truth) over a two-week window. . . . . . . . . . . . . . . . . 42 4.2 Hourly spot price (blue) with red triangles marking valleys . . . . . . 42 4.3 Model performance for predicted peaks (top) and valleys (bottom) over one week of labeled data. (underestimates due to some true peaks missing in labels) . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.4 Trend Model performance for April 2024. Top: actual vs. predicted hourly SE3 prices (blue/red). Middle: hourly error (green = over- estimate, red = underestimate). Bottom: daily MAE (orange) and average daily price (dashed blue). Metrics: MAE 20.18 öre, RMSE 26.77 öre, direction accuracy 0.57 . . . . . . . . . . . . . . . . . . . . 45 4.5 Merged model over one-week validation. Top: actual (blue) vs. trend forecast (green). Middle: merged output (magenta) with peaks (red) and valleys (blue). Bottom: classifier probabilities (thresholds shown). 46 4.6 Hourly predicted (orange) versus actual (blue) solar energy produc- tion for the PV installation over five consecutive days (March 15–19, 2025), illustrating the close alignment of the forecasted and measured outputs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.7 Heatmap of predicted hourly solar energy production from March 1 to March 10, 2025. Each cell shows the forecasted kWh for that hour and date, with deeper reds indicating higher output. . . . . . . . . . . 48 4.8 Home Demand Model performance. Top: actual vs. predicted con- sumption (±1σ). Middle: HMM occupancy states and ambient tem- perature by hour/day. Bottom: HMM state timeline and residuals. . 49 4.9 Top 20 features by XGBoost normalized gain in the Home Demand Model, highlighting heat-pump contribution, raw consumption met- rics, occupancy and lag-based predictors among others. . . . . . . . . 50 4.10 Monthly cost for different control strategies under a cloudy month (Jan 2025). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.11 Net monthly benefit (SEK) for different control strategies under a sunny month (May 2025). . . . . . . . . . . . . . . . . . . . . . . . . 52 4.12 Maximum hourly import (kW) for different control strategies in Jan- uary 2025. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 xiv List of Figures 4.13 Ten-day battery operation: SoC vs. electricity price (top), household consumption and PV production (middle), and hourly grid import with discount factor (bottom). Gray shading denotes nighttime dis- count periods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 xv List of Figures xvi List of Tables 2.1 Notation for the Swedish retail-electricity price model . . . . . . . . . 8 2.2 Notation for symbols and parameters used in photovoltaics . . . . . . 10 2.3 Common activation functions . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Notation for Reinforcement Learning, MDP and PPO-related symbols 19 2.5 Forecasting Metrics Used in This Project . . . . . . . . . . . . . . . . 22 3.1 All input features by category. Data range starts at 2017-01-01 . . . . 27 3.2 Trained Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . 36 xvii List of Tables xviii 1 Introduction The transition towards renewable energy sources is reshaping residential electric- ity management. As Sweden implements power-based tariffs that charge con- sumers based on their highest monthly usage peaks, the need for intelligent home energy systems becomes critical. This shift in pricing, beginning in 2025, cre- ates both a challenge and an opportunity for homeowners to actively manage their consumption patterns. This thesis explores the development of an AI-driven home energy management system that leverages deep reinforce- ment learning to optimize battery storage operations. By intelligently coordinat- ing solar production, battery charge/dis- charge cycles, and household consump- tion, the system aims to minimize elec- tricity costs while contributing to grid stability through peak demand reduc- tion. The work demonstrates how mod- ern AI techniques can transform resi- dential energy management from reactive rule-based approaches to adaptive, data- driven strategies that benefit both con- sumers and the broader electrical grid. 1.1 Background The increasing integration of renewable energy sources, such as solar power, has introduced new challenges in energy management, particularly for residen- tial and commercial buildings. Energy consumption patterns vary throughout the day, leading to peak demand periods that contribute to high electricity costs. Sweden is transitioning to an additional capacity-based fee system for households and small businesses. The new price model will be fully implemented by Jan- uary 1, 2027, and some network opera- tors like Ellevio, are introducing it earlier (January 2025). Their implementation bases a part of the monthly electricity bill on the users three highest monthly power peaks. We can assume that other operators will implement a similar pric- ing model [1], [2]. By incentivizing households to reduce their peak power consumption through capacity based fees, Sweden aims to flat- ten the aggregate demand curve across the grid. When many homes shift their energy usage away from peak periods, this collective behavior reduces stress on transmission infrastructure, minimizes voltage fluctuations, and decreases the need for expensive grid reinforcements. This creates a win-win scenario where 1 1. Introduction consumers save money while contribut- ing to a more resilient and stable energy system addressing the core reason why Sweden is transitioning to this new tariff structure [9]. Home battery systems offer a viable solution for mitigating peak electricity demand, enhancing overall energy ef- ficiency, and potentially generating in- come through participation as a micro- producer. By storing excess energy from solar panels or the grid during off-peak hours, these batteries can supply power when demand is high, reducing reliance on expensive grid electricity. However, optimizing the use of home batteries presents several challenges, including bal- ancing energy storage, predicting peak demand, and minimizing battery degra- dation over time. A well-optimized home energy system can lead to lower electricity costs for con- sumers while contributing to a more bal- anced and efficient power grid [3]. This represents a win-win solution for both users and electricity providers, as the primary goal of power-based tariffs is to encourage more even energy distribution and reduce grid strain [2]. Artificial intelligence (AI) and machine learning techniques, particularly rein- forcement learning (RL), have shown sig- nificant promise in optimizing energy us- age [6]. These methods can dynamically adjust battery charging and discharg- ing strategies based on real-time data, predictive models, and user preferences. While traditional battery management systems rely on predefined rules, there is growing interest in adaptive, data-driven approaches that can respond dynami- cally to changing electricity prices and user behavior [3]. Rule-Based Adaptive Traditional Systems AI/ML Techniques Relies on predefined rules for battery management Dynamically adjusts to real-time data and user behavior Figure 1.1: Rule-based vs adaptive control: fixed rules (left) versus data-driven adjustments (right). 2 1. Introduction 1.2 Purpose The purpose of this thesis is to design, implement, and experimentally eval- uate and AI-driven battery optimiza- tion system that minimizes electricity costs while supporting grid stability. With Sweden transitioning to a power- based tariff system, residential homes and small businesses are forced to strate- gically manage their energy consumption to avoid high costs. More concretely, the work seeks to: I Integrate a full stack optimiza- tion platform, comprising multi- model forecasting (demand, spot prices, and solar production) to- gether with a Recurrent PPO con- troller into a real residential instal- lation built around a 22 kWh bat- tery and a dual-orientated 20 kW PV array. II Automate end-to-end orches- tration of data ingestion, model retraining and control dispatch through a Prefect server workflow pipeline, thereby demonstrating a production-grade architecture that can run unattended. The solution will be integrated with Home Assis- tant for real-time monitoring and automation, providing a practical and user-friendly implementation of AI-powered energy management. III Quantify economic and tech- nical impact by comparing the AI system against rule-based and static-schedule baselines, though the project timeline will not be able to cover monthly trials a general system performance with respect to metrics will be made by simula- tion different systems. Full Stack Integration Performance Evaluation Optimized Energy Management System Design Automation Orchestration Figure 1.2: System purpose diagram 3 https://www.home-assistant.io/ https://www.home-assistant.io/ 1. Introduction 1.3 Goals To translate the overarching purpose into tangible outcomes, the thesis aims for six concrete goals: I Establish a solid theoretical foundation by researching the state of the art in AI-based home energy management and Sweden’s power-based tariff schemes. II Develop accurate forecasting modules for household demand, day-ahead spot prices, and photo- voltaic production to supply the controller with reliable short-term predictions. III Design, train, and validate a reinforcement-learning con- troller that converts forecasts, bat- tery constraints, and tariff rules into battery charge/discharge ac- tions. IV Deploy the complete optimiza- tion platform in a real residen- tial setting, integrating hardware and software for continuous, unat- tended operation while providing an intuitive user interface. V Evaluate practical perfor- mance and limitations through a combination of field trials and sim- ulation studies, benchmarking the AI system against rule-based and static-schedule baselines. VI Publish an open reference im- plementation source code, con- figuration, and documentation to facilitate replication and future re- search. Theoretical Foundation Forecasting Modules Optimization PlatformRL Controller Performance Evaluation Survey and synthesise the state Develop accurate forecasting modules Design, train, and validate Deploy complete optimization platform Evaluate practical performance Figure 1.3: System goal diagram 4 1. Introduction 1.4 Limitations / Demarcations While the work aspires to provide an end- to-end proof-of-concept, several deliber- ate boundaries have been set to keep the project feasible within a bachelor-level time frame: I Single-site pilot. All field exper- iments are conducted in one house located in Stockholm, equipped with a 22k̇Wh battery and a 20k̇W dual-orientation PV array. Results may not generalise to other cli- mates, tariff zones, hardware con- figurations, or user patterns. II Evaluation horizon. Because of time constraints, the RL controller is validated over a couple of days. Long-term seasonal effects and bat- tery ageing beyond this window are assessed only through simulation. III Battery-centric control. The controller schedules only battery charge and discharge. Flexible loads such as EV charging, heat pumps, and smart appliances are left unmanaged. While the code- base supports future integration, safe deployment would require ad- ditional safety logic and hyper- parameter tuning that lie outside this study. IV Simplified battery degradation model. Cycle life is approximated by an energy-throughput penalty calibrated and calculated on the specific battery installed; electro- chemical ageing mechanisms (tem- perature, C-rate, SoC swing) are not explicitly modelled. V Fixed 15-minute control gran- ularity. The decision interval op- erates on seemingly low resolution but is choosen for simplicity. Un- forseen demand spikes are safely handled with a buffer script to en- sure proper home energy manag- ment safety. Faster RL dynamics remain outside the study scope. VI Hardware and network re- liability. The prototype runs on a server located at the home owner. Cyber-security hardening via HTTPS and ZeroTier 1 tunnel- ing are in place to ensure safe inte- gration together with basic crash- handling since. The above demarcations ensure that the thesis remains achievable while still demonstrating the viability of an AI- driven battery optimizer under Sweden’s forthcoming power-based tariff regime. Future research can ease these con- straints to address a more sophisticated and broader system 1ZeroTier is a networking solution to securely connect virtual networks across various devices and locations. 5 https://www.zerotier.com/ 1. Introduction 6 2 Theory This chapter lays the theoretical ground- work on which the remainder of the thesis is built. It opens with Sweden’s emerging power-based tariff design and details how the new capacity fee formula based on the three highest hourly peaks, reshapes the household cost landscape section 2.1. Section 2.2 then translates the physi- cal installation into mathematics: grid- exchange balances, battery state-of- charge dynamics, and a tilt-aware pho- tovoltaic model that links solar geometry to electrical output. With the energy flows formalised, sec- tion 2.3 surveys the machine-learning tools that will later drive optimisa- tion: gradient-boosted trees, temporal convolutional networks, LSTMs, and their integration into a Recurrent PPO reinforcement-learning framework. Finally, section 2.4 establishes the accu- racy, cost-saving, and robustness metrics that will benchmark both forecasts and control policies throughout the study. Together these four sections provide the analytical scaffold required to under- stand the methods and results that fol- low. Electricity pricing context Physical system modeling Time-series Forecasting Reinforcement Learning Performance Metrics Emerging new pricing model Home energy flow model with focus on Battery and PV dynamics Basic NN structure with and overview into TCN, LSTM, amd Gradient-Boosted trees. Basic RL structure that covers the MDP framework and PPO algorithms Short overview into the standard metrics used in machine learning Figure 2.1: Theory chapter overview 7 2. Theory 2.1 Electricity–pricing context The monthly cost for a household cus- tomer is formalised in Equation (2.1), with all symbols defined in Table 2.1 At its core, this expression aggregates four main components: (i). The per-kWh energy price set by Nord Pool’s day-ahead market. (ii). A per-kWh variable grid-energy charge imposed by the DSO. (iii). A fixed monthly grid fee cover- ing metering and service costs. (iv). From 2025 onwards, a capacity fee based on peak power consumption [1]. The energy-based terms are multiplied by a uniform VAT rate [8], while the capacity fee mandated under Ei’s EIFS 2022:1 regulation charges customers for their three highest hourly averages each month, thereby incentivising lower peaks [9]. To further encourage off-peak usage, many DSOs (e.g. Ellevio) apply a 50% discount on any registered peak between 22:00-06:00, effectively halving those hours when calculating the average of the top three peaks. This structure strongly signals to consumers that shifting energy usage such as EV or battery charging to nighttime hours reduces their capac- ity charges, aligning household behaviour with grid-stability goals [5]. Table 2.1: Notation for the Swedish retail-electricity price model Symbol Meaning Unit λt Nord Pool spot price in hour t öre kWh−1 τE Energy-tax rate (ex-VAT) öre kWh−1 ϕ Variable grid-energy charge (ex-VAT) öre kWh−1 v VAT rate (= 1.25) - Et Energy consumed in hour t kWh β Capacity-fee unit price SEK kW−1 Pi i-th largest hourly mean power in M kW wt 0.5 during 22:00 to 05:59, 1 otherwise - Cfix Fixed monthly grid fee SEK Cmonth = v ∑ t∈M (λt + ϕ + τE) Et + Cfix + β wt(P(1) + P(2) + P(3)) 3 . (2.1) 8 2. Theory 2.2 Physical system modeling A precise mathematical power flow model is essential for linking control actions to cost outcomes. Grid exchange at time t is determined by Equation(2.2) where Lt is the household load, P PV t the PV output, and P dis t /P ch t the battery dis- charge/charge power. Positive P grid t de- notes import whereas a negative value re- lates to exported power. Figure 2.2 illus- trates these flows in the live Home Assis- tant dashboard. P grid t = Lt − P PV t − P dis t + P ch t , (2.2) Figure 2.2: Real-time energy-flow dashboard in Home Assistant. 2.2.1 Battery State-of-Charge (SoC) Dynamics The battery dynamic is calculated by the round-trip efficiency shown in Equation (2.3). It quantifies the energy returned by a battery versus the energy put in, accounting for losses from heat, internal resistance, and inverter conversion. This means that the battery must discharge slightly more energy than the load actu- ally receives (Fig. 2.3) Et+1 = Et + ηch P ch t ∆t︸ ︷︷ ︸ net energy stored − P dis t ∆t ηdis︸ ︷︷ ︸ energy withdrawn (2.3) 9 2. Theory Energy Flows Delivered Energy Heat Loss Figure 2.3: Symbolic proportion of delivered energy versus heat losses. 2.2.2 Photovoltaic Dynamics Table 2.2: Notation for symbols and parameters used in photovoltaics Symbol Meaning Unit δ Solar declination angle (tilt north/south of equator) rad ϕ Site latitude (positive north of equator) rad β Panel tilt angle (from horizontal) rad γ Panel azimuth angle (deviation from poles) rad H Hour angle (solar time offset: H = 0 at solar noon) rad θi Solar incidence angle (between sun-rays and panel normal) rad G(τ) Plane-of-array irradiance at time τ kW m−2 Rt Raw radiation sum over [t − ∆t, t] kWh m−2 Reff t Cosine-corrected radiation sum over [t − ∆t, t] kWh m−2 APV Effective PV array area m2 ηPV PV conversion efficiency – P PV t Average PV electrical power over interval kW Photovoltaic output is modeled from the total solar radiation over each control in- terval ∆t [11]. First define the raw radi- ation sum: Rt = ∫ t t−∆t G(τ) dτ [ kWh m−2 ] , (2.4) where G(τ) is the plane-of-array irradi- ance at time τ for hourly steps ∆t. To account for the panel’s tilt and orien- tation, we apply a cosine-law correction. Let Reff t = ∫ t t−∆t G(τ) cos ( θi(τ) ) dτ, (2.5) where θi(τ) is the solar incidence angle defined below and shown in 2.7(b). If the PV array has effective area APV [m2] and module efficiency ηPV treated as a constant for simplicity1, the interval averaged electrical power is P PV t = ηPV APV Reff t ∆t [ kW ] . (2.6) With the usual ∆t = 1 h, this re- duces to P PV t = ηPVAPVReff t , i.e. power is directly proportional to the cosine- corrected hourly radiation. 1ηP V depends partly on temperature, spectral effects, and long-term degradation among other. 10 2. Theory For a fixed panel tilt β and azimuth γ, the solar incidence angle on the module surface is shown in Equation (2.7) [13]. The effective plane-of-array irradiance is then Geff(t) = G(t) cos ( θi(t) ) . Figure 2.7(a) shows the relative factor cos(θz − β) for β = 27◦ versus β = 0◦, highlighting that tilt increases the daily integral of Geff by approximately 35 %. Figure 2.7(b) illustrates the basic cosine geometry. θi(t) = cos−1 [ sin δ sin ϕ cos β − sin δ cos ϕ sin β cos γ + cos δ cos ϕ cos β cos H + cos δ sin ϕ sin β cos γ cos H + cos δ sin β sin γ sin H ] , (2.7) (a) (b) Figure 2.4: (a) Relative irradiance reduction cos(θz − β) as a function of incidence angle for two fixed tilts: horizontal (β = 0◦) and a south-facing roof pitch (β = 27◦). (b) Geometric interpretation of the cosine law [12]. Figure 2.5: Horizontal (altitude–azimuth) coordinate system used in solar-position calculations. 11 2. Theory 2.3 Machine Learning Machine learning (ML) is a subfield of artificial intelligence focused on devel- oping algorithms that automatically in- fer patterns, relationships, and decision rules from data, rather than relying on strict hard coded rules. Over the past decade, advances in statistical learning theory, optimization methods, and scal- able computing have enabled ML mod- els to tackle increasingly complex tasks, from image classification to sequential decision-making. In energy systems, the high dimensional- ity and nonlinearity of phenomena such as fluctuating demand, variable renew- able generation, and time-varying market prices pose challenges for conventional rule-based controllers. ML approaches address these challenges by learning func- tional mappings directly from historical and real-time measurements, adapting to changing conditions, and continually im- proving as more data become available. Broadly speaking, ML methods can be divided into: • Supervised learning: Models trained on labeled input–output pairs to predict quantities of interest (e.g. load forecasting, price forecasting). • Unsupervised learning: Tech- niques that discover structure in unla- beled data (e.g. clustering of consump- tion patterns, anomaly detection). Each class offers distinct capabilities, su- pervised models excel at probabilistic forecasts, unsupervised methods reveal hidden structure, In this chapter, we fo- cus on the two ML components driving our AI-based energy optimizer: (1) time- series forecasters for generating multi- step-ahead predictions of electrical load, PV output, and spot prices, and (2) a deep reinforcement-learning agent that selects battery charge/discharge actions. The following sections outline the essen- tial concepts and mathematical formula- tions underlying each model class before delving into their specific architectures and training algorithms. Gradient Boosted Trees Time Series Forecasting Reinforcement Learning Markov Decision Process MDP Proximal Policy Optimization & Recurrent PPO Machine Learning Neural Networks Temporal Convolutional Networks TCN Long Short Term Memory LSTM Figure 2.6: Machine learning chapter overview showing the difference branches between Forecasting and Reinforcement learning classes 12 2. Theory 2.3.1 Time-series forecasting Time series forecasting is a predictive modeling technique that analyzes past se- quential data points over time to identify patterns and trends that can be extrapo- lated to make predictions see Fig 2.7b. This differs from other modeling tech- niques like tabular forecasting that treats each time series as an independent ex- ample. We define time-series forecasting mathematically by letting x1:t ={x1, . . . , xt} denote a history and x̂t+h its h-step pre- diction. Naturally, more accurate and robust fore- casts can be obtained by incorporating multiple coherent data sources rather than relying solely on past target val- ues. For a simple real-world example: forecasting the indoor temperature of an office room, one might include the historical indoor temperature series x1:t together with exogenous inputs, lets call this z1:t = { outdoor temp, occupancy, HVAC set-point, indoor humidity } 1:t . see Fig 2.7a. A general multivariate fore- casting function then takes the form x̂t+h = f ( x1:t, z1:t ) , (a) (b) Figure 2.7: (a) Inputs for forecasting indoor temperature: outdoor temp, occu- pancy, HVAC set-point, indoor humidity. (b) Example h-step-ahead forecast. Building on the general forecasting for- mulation introduced above, we now turn to a powerful class of nonlinear mod- els capable of capturing complex, non- stationary relationships in time-series data: neural networks. 13 2. Theory 2.3.1.1 Neural Networks (NN) Neural networks are a family of nonlinear function approximators inspired by the structure of biological brains. The sim- plest unit is the perceptron, which com- putes a weighted sum of its inputs plus a bias, net = n∑ i=0 wixi + b, (2.8) o = σ(net), (2.9) where σ(·) is an activation function (e.g. sigmoid, ReLU, see Table 2.3) that in- troduces nonlinearity. By stacking many such units into layers, we obtain a multi- layer perceptron (MLP) see Figure 2.8, Table 2.3: Common activation functions Plot Function name Definition Unit step σstep(net) = 1 net > 0, −1 otherwise. Linear ϕlin(net) = net Logistic (Sigmoid) σlog(net) = 1 1 + e−net Hyperbolic tangent σtanh(net) = enet − e−net enet + e−net Rectified Linear Unit (ReLU) σReLU(net) = max(0, net) Activation functions are a core compo- nent of neural networks, introducing the nonlinearity that enables these models to learn complex patterns beyond sim- ple linear mappings. Early neural ar- chitectures predominantly used the logis- tic sigmoid function, which “squeezes” the weighted sum into the interval (0, 1), mirroring the biological concept of a neuron being inactive or active. How- ever, as networks have grown deeper and more complex, the Rectified Linear Unit (ReLU) function as it closely mimics the biological neuron’s threshold behav- ior, where an output “fires” only after the input surpasses a certain level and dramatically accelerates training conver- gence [14]. 14 2. Theory Training a neural network entails finding the set of weights W (ℓ) and biases b(ℓ) for each layer ℓ = 1, . . . , L that minimize a chosen loss function J (θ). For regres- sion tasks one typically uses the mean- squared error, J (θ) = 1 N N∑ i=1 ( yi − fθ(ui) )2 , where θ = {W (ℓ), b(ℓ)} collectively de- notes all parameters, ui the ith input, and yi its target. Computing the gradient ∇θJ efficiently for deep, layered mod- els is made possible by backpropagation, which applies the chain rule to propagate error signals from the output layer back to each parameter. Parameters are then updated iteratively. Figure 2.8: Multilayer perceptron a kind of neural network Figure 2.9: Gradient Descent to Global Minimum on Loss Surface 15 2. Theory 2.3.1.2 Temporal Convolutional Networks (TCN) Temporal Convolutional Networks (TCNs) are convolutional architectures specifically designed for sequence data. Unlike recurrent models, TCNs use 1D causal convolutions filters that only span current and past time steps so that at no point does information “leak” from the future. By stacking layers with ex- ponentially increasing dilation (spacing between filter taps), a TCN achieves a very large receptive field with relatively few layers, allowing it to capture long- range dependencies efficiently. Each convolutional layer is wrapped in a residual block: after applying a small stack of operations (convolution, weight normalization, ReLU and dropout), the block’s input is added back to its output. These skip-connections stabilize training and help gradients flow through deep net- works. In practice, TCNs combine the parallelism of CNNs with the sequential modeling power of RNNs, often outper- forming LSTM/GRU on large forecasting benchmarks. Figure 2.10: TCN model with equal to inputlength, k equal to kernelsize, b equal to dilationbase, k ≥ b and with a minimum number of residual blocks for full history coverage n, where n can be computed from the other values as explained above Source 16 https://unit8.com/wp-content/uploads/2021/07/image-50.png 2. Theory 2.3.1.3 Long Short-Term Memory (LSTM) Long Short-Term Memory networks (LSTMs) are a type of recurrent neural network [15] designed to capture long- range dependencies in sequence data while avoiding the vanishing and explod- ing gradient problems of standard RNNs. At their core, each LSTM cell maintains a cell state Ct, which carries informa- tion forward unchanged unless explicitly modified by learned gating mechanisms. Three gates regulate this flow: • Forget gate ft decides which infor- mation from the previous cell state Ct−1 to discard. • Input gate it determines which new information to add to the cell state, often computed as a candidate update C̃t. • Output gate ot controls which parts of the updated cell state are exposed as the hidden state ht. By learning when to remember, up- date, or output information, LSTMs ex- cel at modeling time-series with com- plex temporal patterns, such as season- ality, trends, and irregular events, mak- ing them a cornerstone architecture for sequential forecasting tasks[15]. Figure 2.11: Internal structure of an LSTM cell, showing how the forget gate ft, input gate it, and output gate ot regulate the flow of information into and out of the cell state Ct and hidden state ht. 17 2. Theory 2.3.1.4 Gradient-Boosted Trees Gradient-Boosted Trees (GBT) are a powerful ensemble method that builds a strong predictive model by sequentially adding decision trees, each one trained to correct the mistakes of the ensemble so far. Rather than fitting one large, com- plex tree, GBT constructs many small “weak learners” (shallow trees), combin- ing them into a robust predictor. Key characteristics include: • Stage-wise learning: Each new tree focuses on the residual errors differ- ences between the true target and the current model’s predictions and at- tempts to reduce them. • Shrinkage and learning rate: A small scaling factor (learning rate) limits each tree’s impact, encourag- ing gradual improvements and reduc- ing overfitting. • Tree complexity control: Maxi- mum depth, minimum child weight, and subsampling ratios govern each tree’s size and the fraction of data used, providing regularization. • Handling missing and categori- cal data: Modern GBT implemen- tations (e.g. XGBoost) automatically learn optimal “default directions” for missing values, and efficiently bin or one-hot encode categorical features. • Feature importance and inter- pretability: By measuring how often and how effectively features split the data, GBTs offer built-in diagnostics to rank predictors and guide feature engineering. • Scalability and speed: Highly opti- mized libraries exploit parallel tree construction, out-of-core computa- tion, and low-level optimizations to handle large-scale datasets. In time-series forecasting, GBT models excel when combined with thoughtfully engineered features lagged values, rolling statistics (mean, variance), and calendar indicators (hour of day, day of week). Their ability to capture nonlinear inter- actions without extensive hyperparame- ter tuning makes them a popular baseline and often a component in hybrid deep- learning ensembles. Figure 2.12: Illustration of gradient boosting: successive shallow trees are added to the ensemble, each one correcting the residual errors of the accumulated model. 18 2. Theory 2.3.2 Deep Reinforcement learning Deep reinforcement learning augments the standard RL framework by represent- ing policies πθ(a | s) and value func- tions V π(s), Qπ(s, a) with deep neural networks parameterized by θ. Instead of tabular lookup, the network takes a raw state st (and, in recurrent variants, a hidden state ht−1) and outputs either a distribution over actions or an esti- mate of expected return. The “deep” component refers to the stacked lay- ers that learn hierarchical features from high-dimensional inputs, enabling scal- able solutions in complex environments. Training proceeds by sampling trajecto- ries {st, at, rt, st+1} and updating θ to maximize the expected return J(πθ) = Eπθ [ T∑ t=0 γt rt ] . Gradient estimators such as the policy gradient theorem allow backpropagation of ∇θJ through the network. Table 2.4: Notation for Reinforcement Learning, MDP and PPO-related symbols Symbol Meaning Unit / Domain S Set of all states (e.g. battery SoC) – A Set of all actions (e.g. charge power level) – P (s′ |s, a) Transition probability – R(s, a) Immediate reward currency / kWh γ Discount factor [0,1] πθ(a |s) Parameterized policy – J(πθ) Expected return cumulative reward V π(s) Value of state s cumulative reward Qπ(s, a) Value of (s, a) cumulative reward Ât Advantage estimate at time t same as reward rt(θ) Probability ratio: πθ(at|st) πθold (at|st) – ϵ Clipping parameter in PPO – LCLIP(θ) PPO clipped objective – θ Network parameters – α Learning rate – ht Recurrent hidden state vector To ground these concepts, Figure 2.13 shows a simple gridworld in which an agent (e.g. a cleaning robot) must nav- igate to clean a “dirty” room and then return to a charging station. Each cell Si represents a distinct state at each time step the agent chooses one of four ac- tions (up, down, left, right) and transi- tions according to the MDP dynamics. The reward function penalizes remaining dirty and incentivizes reaching the charg- ing station before battery depletion. This toy environment illustrates how states, actions, transition probabilities, and re- wards come together in a Markov deci- sion process. 19 2. Theory Figure 2.13: Gridworld MDP for a cleaning robot: the agent starts in one of the nine states S1 . . . S9, must visit the dirty room at S5, and then navigate to the charging station at S9. Transitions occur on up/down/left/right actions, and the reward structure encourages cleaning and timely return to charge. 2.3.2.1 Markov Decision Process (MDP) A Markod descision process is a frame- work for modeling sequential decision- making under uncertainty. It captures the dynamics of an environment in terms of states, actions, transitions, and re- wards, providing the foundation upon which RL algorithms optimize behavior. An RL problem is formalized as an MDP (S, A, P, R, γ). At each time t, the agent in state st ∈ S selects action at ∈ A according to πθ(at | st), transitions with probability P (st+1 | st, at), and receives reward rt = R(st, at). The goal is to find πθ that maximizes the discounted return Gt = ∑∞ k=0 γkrt+k. 2.3.2.2 Proximal policy optimization & Recurrent PPO algorithms Proximal Policy Optimization (PPO) up- dates the policy by maximizing the clipped surrogate objective seen in 2.10, where rt(θ) = πθ(at|st) πθold (at|st) and Ât is an ad- vantage estimate (e.g. GAE). This con- straint on rt prevents excessively large policy updates. Recurrent PPO (RPPO) extends PPO by embedding an RNN (LSTM or GRU) to maintain a hidden state ht, so that πθ(at | st, ht−1) can condition on past observations. During training, trajecto- ries carry both st and ht−1, and gradi- ents propagate through time. This en- ables the agent to handle partial observ- ability and learn temporal abstractions in environments with latent dynamics. LCLIP(θ) = Et [ min ( rt(θ) Ât, clip(rt(θ), 1 − ϵ, 1 + ϵ) Ât )] (2.10) 20 2. Theory 2.4 Performance metrics As shown in Table 2.5, a suite of standard forecasting performance metrics is em- ployed to evaluate model accuracy and robustness. These metrics quantify er- rors between predicted values ŷt and ac- tual observations yt over a validation set of size N . By combining absolute, squared, per- centage, and robust loss functions, the evaluation framework captures different aspects of error behavior ranging from average deviations to sensitivity toward large outliers. Together, these metrics provide a comprehensive understanding of model performance, enabling compar- ison across varying error scales and dis- tributions. The first group of metrics Mean Abso- lute Error (MAE), Mean Squared Er- ror (MSE), and Root Mean Squared Er- ror (RMSE) focuses on error magnitude. MAE computes the average of absolute differences |ŷt−yt|, treating all deviations equally and offering robustness to ex- treme values. In contrast, MSE squares each error term (ŷt − yt)2, thereby pe- nalizing larger errors more heavily and emphasizing outliers. RMSE, defined as the square root of MSE, restores the er- ror units to match those of the original data, making interpretation more intu- itive while still retaining the squared- error emphasis on larger deviations. The remaining metrics capture relative and distributional aspects of error. Mean Absolute Percentage Error (MAPE) ex- presses the average absolute error rel- ative to the true value yt, facilitating comparisons across series with differ- ent scales but suffering undefined values when yt = 0. Symmetric Mean Absolute Percentage Error (sMAPE) addresses this limita- tion by symmetrizing the denominator as( |yt| + |ŷt| ) /2, thus bounding the mea- sure between 0% and 200% and reducing extreme percentage errors for values near zero. The coefficient of determination R2 quan- tifies the proportion of variance in yt ex- plained by the model, ranging from −∞ to 1, with values closer to one indicating a better fit. Finally, Huber Loss combines the prop- erties of MAE and MSE by using a quadratic penalty for errors within a threshold δ and a linear penalty for larger residuals, offering a balanced approach that is less sensitive to outliers than MSE but more discriminative than MAE. 21 2. Theory Table 2.5: Forecasting Metrics Used in This Project Metric Formula Description MAE 1 N ∑N t=1|ŷt − yt| Mean Absolute Error: Average of the absolute differences between predicted values ŷt and true values yt. Treats all errors equally and is robust to outliers. MSE 1 N ∑N t=1(ŷt − yt)2 Mean Squared Error: Average of the squared differences between ŷt and yt. Penalizes larger errors more heavily, mak- ing it sensitive to outliers. RMSE √ 1 N ∑N t=1(ŷt − yt)2 Root Mean Squared Error: Square root of MSE, pro- viding an error measure in the same units as yt. Emphasizes larger errors similarly to MSE. MAPE 100% N ∑N t=1 ∣∣∣ ŷt−yt yt ∣∣∣ Mean Absolute Percentage Error: Error as a percent- age of the true value yt. Useful for relative accuracy but undefined when yt = 0. sMAPE 100% N ∑N t=1 |ŷt−yt|( |yt|+|ŷt| ) /2 Symmetric Mean Absolute Percentage Error: Bounded between 0% and 200%, mitigates extreme percent- age errors when values are near zero by symmetrizing nu- merator and denominator. R2 1 − ∑N t=1(yt−ŷt)2∑N t=1(yt−ȳ)2 , ȳ = 1 N ∑N t=1 yt Coefficient of Determination Proportion of variance in yt explained by the model. Ranges from −∞ to 1; values closer to 1 indicate better fit. Huber Loss  1 2(yt − ŷt)2, |yt − ŷt| ≤ δ, δ ( |yt − ŷt| − 1 2δ ) , otherwise, Huber Loss: Combination of MSE and MAE that is quadratic for small errors (|yt − ŷt| ≤ δ) and linear for large errors (|yt − ŷt| > δ). Balances sensitivity to outliers; δ is the threshold parameter. 22 3 Methods This chapter explains how the Home En- ergy AI system was built, from raw data to live control. It documents the full data acquisition, modelling, control de- sign, automation, and validation. The study begin with a bird’s-eye workflow, before detailing the two main compo- nents of the pipeline: (i) multi-modal forecasting of demand, price and solar production and, (ii) a safety-constrained reinforcement- learning agent that decides on battery usage. Each subsequent section focuses on one link in that chain, describing the ratio- nale behind key design choices, the data and tools used, and the procedures ap- plied to verify performance. Throughout, emphasis is placed on re- producibility, robustness, and relevance to Swedish residential tariffs and climate data. Prefect as the orchestration engine was chosen because of its reliability, ease of scheduling, failure handling and great documentation. Figure 3.1: System architecture 23 3. Methods 3.1 Work-flow overview Figure 3.1 sketches the end-to-end data–model–control workflow that turns raw measurements into real-time com- mands for the house. The diagram is organized in lanes corresponding to each logical layer of the software stack. Data layer Flows, implemented in Prefect1, run at weekly, hourly and 15-minute inter- vals to ingest external data (spot prices, weather fields, commodity indices) and internal measurements (household con- sumption, heat-pump metrics, solar out- put). Each flow’s schedule is defined in code so that timing is predictable and failures automatically retry with logs. Raw API responses are passed through an ETL (Extract, Transform, Load) process, timestamps are aligned, basic validity checks (gaps, outliers) are applied, and lagged/rolling features are computed. Any detected anomalies are sent to the monitoring layer for valida- tion. (See Section 3.2 for details.) Machine learning pipeline To prevent concept drift, all forecast- ing models and the RL agent are re- trained on a weekly basis. The forecast- ing stack and the reinforcement-learning agent each have dedicated modules (Sec- tions 3.3 and 3.4, respectively). In brief, weekly flows assemble the latest cleaned data, run hyperparameter optimization for each forecasting model, and update model artifacts so that the control layer always uses up-to-date predictions. Control layer Every 15 minutes, a recurrent PPO agent (Section 2.3.2.2) evaluates the current state (prices, forecasts, state-of-charge, capacity-fee context) and proposes a bat- tery control action. That action is fil- tered by a deterministic safety module, enforcing charge/discharge limits, SoC bounds and breaker constraints, before any command is sent to the house APIs. (Further orchestration details appear in Section 3.5.) 1Prefect is an open-source workflow orchestration tool [16] 24 3. Methods 3.2 Data acquisition & pre-processing All raw signals enter through two pipelines. 15-Minute Home Data Update Executes every 15 minutes. It gathers home-level measurements needed by the control agent: energy consumption, bat- tery state-of-charge, heat-pump metrics and recent weather forecasts. All read- ings are aligned to Europe/Stockholm time to ensure consistency. Hourly Exogenous Data Update Executes every 15 minutes. This re- trieves external inputs, day-ahead spot prices, CO2 intensity, fuel costs, grid- mix breakdown and coarse weather fore- casts, then aligns them to CET and forward-fills missing values for a contin- uous hourly series. Both pipelines use modular scripts that apply basic integrity checks (timestamp, gap detection and numerical validation) before feature engineering. The cleaned, feature-rich tables feed directly into the machine learning and control pipelines. Data Collection Validation Checks Complete DataRaw signals (API) Feature Engineering Figure 3.2: Data acquisition pipeline showcasing the raw API signal process through validation and feature engineering to complete usable and clean data for the ML stack 25 3. Methods 3.3 Forecasting models Effective forecasting is at the heart of our Home Energy AI system where accurate price, demand, and solar predictions feed directly into the control agent’s decisions. In this chapter, we describe three fore- casting streams, electricity price (Sec- tion 3.3.1), home demand (Section 3.3.2), and solar production (Section 3.3.3) each built and validated separately before be- ing stitched together for real-time opera- tion. We start by explaining why we opted for a trio of specialized sub-models rather than a single predictor (Trend, Peak, Valley for prices), and how we merge their outputs into a coherent hourly price curve. Next, we explain the demand model that captures household consump- tion patterns at a 15-minute resolution, followed by the solar model that uses weather and array geometry data to fore- cast PV output. Each section covers data inputs, feature engineering, model architecture, training regimen, and per- formance metrics. Our goal isn’t to chase perfect accuracy but to provide “good enough” forecasts, refreshed weekly, so that the RL agent can optimize battery charge/discharge decisions with minimal latency and maximal robustness. 3.3.1 Price Model Forecasting electricity prices is tough, prices depend on many factors (supply, demand, weather, fuel costs, transmis- sion) [4]. Europe is calculating day ahead prices by using the EUPHEMIA market-coupling algorithm across zones [17]. Even SE3 prices, our target, reflect cross-border flows via that same process, making the prediction system difficult [18]. Building one “all-in-one” model with ev- ery feature can capture subtle patterns, but it demands hard-to-get data, huge compute, and adds complexity [4]. Instead, we use three focused sub- models: • “Trend” for overall price level. • “Peak” detector for high-price hours. • “Valley” detector for low-price hours. Because merging multiple models can amplify forecast errors [4], we adopt the “specialized + merge” strategy where each smaller model is faster to train and can be updated weekly with proper val- idation, yet still delivers predictions for battery control. 26 3. Methods 3.3.1.1 Trend Model Category Features Grid fossilFreePercentage, renewablePercentage, powerConsumptionTotal, powerProduc- tionTotal, powerImportTotal, powerExportTotal, nuclear, wind, hydro, solar, unknown, import_SE-SE2, export_SE-SE4, import_NO-NO1, export_NO-NO1, import_DK- DK1, export_DK-DK1, import_FI, export_FI Prices PriceArea, SE3_price_ore, price_24h_avg, price_168h_avg, price_24h_std, hour_avg_price, price_vs_hour_avg, Gas_Price, Coal_Price, CO2_Price Weather temperature_2m, cloud_cover, relative_humidity_2m, wind_speed_100m, wind_direction_100m, shortwave_radiation_sum Time & Holiday hour_sin, hour_cos, day_of_week_sin, day_of_week_cos, month_sin, month_cos, is_morning_peak, is_evening_peak, is_weekend, season, is_holiday, is_holiday_eve, days_to_next_holiday, days_from_last_holiday Table 3.1: All input features by category. Data range starts at 2017-01-01 The Trend Model uses XGBoost to fore- cast the next 24 hours of SE3 day-ahead spot prices (öre/kWh) at an hourly res- olution. Its main purpose is to capture the smooth backbone of the price curve so that subsequent models can focus on extreme spikes and dips. Features. All features listed in Table 3.1 are generated by the hourly data acqui- sition pipeline (Section 3.2). These include grid-related metrics, historical price statistics, weather variables, and time/holiday encodings. Each hourly record represents a snapshot of the sys- tem state, with appropriate lagged and rolling statistics already applied. Hyperparameter tuning. Every week, an Optuna study samples trials to identify optimal hyperparame- ters. The objective is squared-error re- gression, using RMSE as the evaluation metric, with the “hist” tree method and “depthwise” growth policy. The best- performing model on a rolling hold-out validation set is tagged as the current production version. Training & validation. Training occurs weekly on the most re- cent 3–6 months of hourly data. We employ time-series cross-validation with forward-rolling splits, holding out one day at a time. Models that improve val- idation metrics replace the existing pro- duction model. Inference. At runtime, the Trend Model outputs a 24-element vector of hourly price fore- casts. These baseline predictions are later combined with outputs from Peak and Valley detectors (Section ??) to form a composite forecast that balances over- all trend accuracy with sensitivity to ex- treme price events. 27 3. Methods 3.3.1.2 Peak & Valley Models Both the Peak and Valley Models use a temporal-convolutional network (TCN) to classify each of the next 24 hours as an extreme event (peak or valley) or not. Instead of predicting a continuous price, each model outputs a 24-element binary vector, where “1” denotes a predicted peak (or valley) hour. Features. Both models reuse the feature set in Ta- ble 3.1 (Section 3.3.1.1). Inputs are con- structed as a sliding window of the past 168 hours (one week) of these features, yielding a 168 × (feature-count) tensor for each model. Peak labeling. Before training the models, we first needed to define what constitutes a “peak” or "valley" in the price series, a seemingly easy task on its own. To do this we scanned historical prices and an hour is marked as a peak if it passes a derivative check that capture sharp rise- then-drop points. These labels (see Fig- ure 4.1) serve as ground truth for Peak Model training. Valley labeling. By inverting the peak criteria, an hour is labeled a valley if it passes a negative derivative check, thus capturing sharp drop-then-rise patterns. These binary valley labels form the training targets for the Valley Model (see Figure 4.2). Architecture. Each TCN has two stacked residual blocks: • Three 1D convolutions (kernel 3, dilations 1, 2, 4), 64 filters each, followed by BatchNorm, ReLU, dropout, and a skip connection. • Three 1D convolutions (kernel 3, dilations 8, 16, 32), same pattern. A global average layer collapses the tem- poral dimension, and a final dense layer with 24 sigmoid outputs produces per- hour probabilities. Both models share this structure; they differ in class-weight and loss settings. Training & Validation. Because peaks and valleys account for < 5% of hours, we apply SMOTE2 to oversample minority class examples, in- creasing extreme event labels to ≈20% of training data. After training, we sweep the probability threshold to balance pre- cision and recall. Inference & Performance. At inference, a threshold converts prob- abilities into binary flags. We visual- ize the model’s output by mapping pre- dicted probabilities to marker heights, so higher probabilities indicate a stronger predicted extreme (see Figure 4.3(a&b)). However, this approach is flawed, since the probabilities do not directly corre- spond to actual peak or valley magni- tudes, it was chosen purely for ease of implementation. 2SMOTE (Synthetic Minority Oversampling Technique) is a machine learning technique used to address class imbalance in datasets. It generates synthetic samples of the minority class to help balance the class distribution and improve model performance. 28 3. Methods 3.3.1.3 Merged Model After training the Trend, Peak, and Val- ley Models independently, we merge their outputs into a single prediction. It does this with a simple override logic, where any hour flagged as a peak or valley re- places the Trend Model’s forecast for that hour multiplied by the labeled predic- tion probability to make the price mag- nitude. Hours not flagged by either extreme model keep the Trend Model’s value, see Figure 4.9 for reference. This is a suboptimal merging sinc the price ex- treme models is not directly trained to predict the magnitude of an hour, this merging solution was used for its simplic- ity. 3.3.2 Home Demand Model The Home Demand Model forecasts hourly residential electricity consump- tion using an XGBoost regressor with ex- tensive feature engineering. Rather than classifying extreme events, this model predicts a continuous consumption value (kWh) for each of the next 24 hours, enabling the downstream controller to schedule battery usage optimally. Features. We engineer over 220 predictors drawn from historical consumption, calendar signals, weather data, and home-specific dynamics. Many of these are lagged fea- tures (e.g., consumption at 1 h, 2 h, 3 h, 24 h, 48 h, 72 h, and 168 h ago, plus rolling statistics over 6 h, 12 h, 24 h, 48 h, 72 h, and 168 h windows) to cap- ture temporal autocorrelation. For the full list of non-lagged features (time en- codings, temperature, wind speed, solar irradiance, etc.), see Table 3.1. During this project, there were no avail- able data for usage occupancy patterns, which is a highly correlated feature for correctly predicting home energy de- mand. To address this, we fit a three- state Gaussian Hidden Markov Model (HMM) on historical hourly load data. Each hour’s observed consumption is treated as an emission, and the HMM learns three latent “occupancy” states roughly corresponding to low, medium, and high home activity. At each time step, we compute the posterior proba- bility of each HMM state given the past sequence of loads. These three posterior probabilities are included as features, al- lowing the model to infer whether occu- pants are likely away (state 1), partially present (state 2), or fully present (state 3). By providing this latent occupancy signal, the XGBoost regressor can adjust its forecast when people are more or less active in the home. The only available data regarding the power-hungry heat pump is its binary state (running or not). We calculate es- timated consumption explicitly via two intermediate calculations: Thermal Output Calculation: Q̇heat = ṁ × cp × ∆Tclamped, where ∆Tclamped = min ( max(∆T, 2 K), 10 K ) is the supply–return water temperature difference clamped between 2 K and 10 K. 29 3. Methods Electrical Input Estimation: Pin = Q̇heat COP , using a fixed Coefficient of Performance (COP). Including both Q̇heat and Pin as features captures the nonlinear relation- ship between ambient/indoor tempera- tures and heat pump electrical load. Architecture. We employ an XGBoost regressor in reg:squarederror mode to learn non- linear mappings from the 220+ features to hourly consumption. Optuna is used to tune the following hyperparameters: • n_estimators: 200 – 1500 • max_depth: 4 – 12 • learning_rate: 0.005 – 0.2 (log- scaled) • subsample: 0.7 – 0.95 • colsample_bytree: 0.7 – 0.95 • reg_alpha: 10−8 – 10 (log-scaled) • reg_lambda: 10−8 – 10 (log-scaled) • min_child_weight: 1 – 7 • gamma: 0 – 0.5 These hyperparameters are searched over 500 Optuna trials, using time-series cross-validation on the training window. Inference & Performance. During inference, the model generates point predictions for each of the next 24 hours. Prediction uncertainty is es- timated via the variance of leaf outputs in the trained trees. Model validation is performed on a hold-out dataset to assess generalization before deployment. 30 3. Methods 3.3.3 Solar Prediction Model To estimate the photovoltaic (PV) en- ergy output for our installation, we relied on the forecast.solar3 API rather than developing a custom model. This service handles the necessary physical and mete- orological computations (e.g., irradiance decomposition, angle of incidence adjust- ments) described in the Theory chapter, allowing us to obtain reliable hourly fore- casts spanning multiple days. Our PV system comprises two arrays oriented at different azimuths (see Fig- ure 3.3): • Southeast-facing array: 24 panels with a tilt of 30. • Northwest-facing array: 26 panels also tilted at 30. We issue separate API calls to fore- cast.solar for each orientation to account for differences in irradiance and shad- ing throughout the day. After receiving the two hourly time series, one for the southeast array and one for the north- west array, we sum their outputs at each hour to obtain the total predicted energy production. To align the forecasts with our panels’ real-world characteristics (manufacturer tolerances, inverter efficiency, wiring losses, etc.), we apply a small scaling fac- tor to each orientation’s forecast. This factor was determined empirically based on prior operational data by comparing historical production data (hourly aggre- gated) against the API’s predictions over a representative validation period. The resulting forecast can be visualized in Figure 4.6. Figure 3.3: Layout of the dual-orientation PV array with per-panel energy output (in kWh), where North is up 3forecast.solar is a cloud-based forecasting service that leverages PVGIS (Photovoltaic Geo- graphical Information System) for irradiance and weather-based solar production estimates via a simple RESTful API [19] 31 3. Methods 3.4 Reinforcement learning agent The core of our control strategy is a rein- forcement learning (RL) agent designed to optimally manage battery charge and discharge in response to predicted elec- tricity prices, forecasted solar produc- tion, and household load. Instead of us- ing a rule-based or optimization-only ap- proach, we implement a recurrent Prox- imal Policy Optimization (RPPO) agent with LSTM memory to capture tem- poral dependencies in price, solar, and load patterns. By interacting with a custom Gymnasium environment4, the agent learns a policy that balances cost minimization (including spot-price en- ergy costs and monthly capacity fees) with battery state-of-charge (SoC) con- straints. 3.4.1 Environment Design The environment’s observation (state) is a dictionary comprising the following components: • Battery SoC: a continuous scalar in [0, 1] representing the fraction of the battery’s usable capacity cur- rently stored. • Time Index: a 3-element vector: hour of day, minute of hour, day of week to encode patterns. • Price Forecast: a 24-hour ahead vector of predicted spot prices (SE3 market), in öre/kWh, up- dated daily. • Solar Forecast: a 72-hour ahead vector of predicted PV produc- tion (kW), obtained from the fore- cast.solar API. • Capacity Metrics: a 5- dimensional vector top1, top2, top3, rolling average, month progress that tracks the three high- est grid import peaks encoun- tered in the current billing month (for capacity-fee calculation), the rolling average import, and how far through the month we are. • Price Averages: two scalars 24 hourly average, 168 hourly average representing recent average spot prices, used to identify high- and low-price regimes. • Night Discount Flag: a boolean (0 or 1) indicating whether the current hour falls into a low-tariff “night” period. • Load Forecast: a 72-hour ahead vector of predicted household con- sumption (kW), produced by the demand forecasting model. 4Gymnasium is an open-source toolkit and framework for developing RL algorithms [?]. 32 3. Methods 3.4.2 Reward Function We design a multi-component reward at each hour t to guide the agent toward cost-minimization while respecting SoC constraints and preferring to keep SoC in a comfortable mid-range. The total reward rt is the sum of: 1. Grid Cost Penalty: When net grid power Pg,t (kW) is drawn from the grid, the instantaneous cost in öre is computed from the sum term in Equation (2.1) covered in Sec- tion 2.1. This penalizes importing power during expensive hours. 2. Capacity Fee Penalty: In Swe- den’s tariff structure, only the sin- gle highest grid import peak per day contributes to the monthly ca- pacity fee. We maintain a list of top-3 peaks {(ti, Ppeak,i)} in the current month. Whenever a new Pg,t exceeds that day’s recorded peak, we update accordingly. At the end of each month, the highest recorded peak (kW) is multiplied by the constant tariff (öre/kW). 3. SoC Penalty: We encourage the agent to keep the battery’s SoC within a preferred range [SoClow, SoChigh]. The complete SoC reward is shown in Equa- tion (3.1). Where, α is a large penalty factor for violating hard limits [SoCmin, SoCmax], while β softly rewards or penalizes depar- tures from the preferred mid-range. 4. Potential-Based Shaping: To accelerate learning without alter- ing the optimal policy, we add a potential-based shaping term [7]. Let Φ(soct) be a smooth poten- tial function maximizing at (0.3 + 0.8)/2 = 0.55. Then at each step: rshape,t = γ Φ(soct+1) − Φ(soct), where γ = 0.99 is the PPO discount factor. This term (weight wshape) gently guides the agent’s SoC to- ward the “optimal” mid-range over time, speeding up convergence. 5. Battery Degradation Penalty: Cycling the battery (charging or discharging) results in wear costs. We penalize absolute energy throughput |Ethroughput, t| (in kWh) by wdeg × ∣∣∣Ethroughput, t ∣∣∣ × 45 öre/kWh, 45 öre/kWh here is calculated by the initial cost Cinit (Ncycles×Ecapacity) , which is the actual cost per ev- ery kWh of usage [10]. This term is symmetric for charge vs. discharge and discourages excessive cycling. 6. Action Modification Penalty: When the agent proposes an ac- tion at that violates safety con- straints (e.g., would drive soct+1 outside [SoCmin, SoCmax]), the en- vironment’s safety mask overrides it. Each override incurs a penalty. Consecutive violations escalate the multiplier. This discourages un- safe exploration and encourages the agent to learn feasible actions. 7. Arbitrage Bonus: This is the core profit mechanism. We define dynamic thresholds based on recent price percentiles: Plow = 30th percentile of last 24 h, Phigh = 75th percentile of last 168 h. 33 3. Methods The agent then gets the reward by charging when Price < Plow and discharging when Price > Phigh 8. Export Bonus: When the agent discharges into the grid, we also re- ward exported energy Eexp, t (kWh) by: wexport × ( spott + taxbonus ) ×Eexp, t, Adding a taxbonus of 60 öre/kWh simulates the current swedish tax reduction on exported energy, so discharging during moderately high prices yields extra incentive. This encourages the agent to export when it is both profitable and grid- friendly. 9. Night Charging Reward: To further incentivize off-peak charging, any energy Enight, t charged during the “night” win- dow (22:00–06:00) receives a bonus 10. Solar-Aware SoC Manage- ment: Make room for solar charg- ing when significant production is expected. Logic: At each time step, look ahead 6–12 hours using the solar forecast. If the forecasted incoming solar energy exceeds 2.0 kWh and the current state of charge soct > 0.60, assess available bat- tery headroom Ht = 1 − soct. Cal- culation: If Ht < 0.50×Esolar_6–12h, then any discharge Edis, t that cre- ates additional headroom is re- warded. In other words, for every kWh discharged in anticipation of soon available solar production, the agent gains a bonus. 11. Night-to-Peak Chain Bonus: Encourage a chain strategy of charging at night when prices are low and discharging during sub- sequent peak hours. It tracks Enight_charged, the energy (kWh) drawn from the grid between 22:00 and 06:00 each day. This summed pool decays after 24 hours (i.e., en- ergy charged more than 24 h ago is discarded). During any “peak” hour (defined as either Price > Phigh or observed household load >Lhigh), any discharge Echain, t that can be matched against the current pool is rewarded. In practice, if the agent discharged 1 kWh during a peak hour that was originally charged at night (within the last 24 h), it receives a bonus. Reward(soct) =  −α × ( 1 + severity ) , soct ≤ SoCmin or soct ≥ SoCmax, +β × ( 1 − ∣∣∣soct−(SoClow+SoChigh)/2 ∣∣∣ (SoChigh−SoClow)/2 ) , SoClow ≤ soct ≤ SoChigh, − β × SoClow−soct SoClow−SoCmin , soct < SoClow, − β × soct−SoChigh SoCmax−SoChigh , soct > SoChigh. (3.1) 34 3. Methods The net reward at time t is then shown in Equation (3.2) or in the more compact form here: R(t) = wT R(t) = N∑ i=1 wi Ri(t), where we collect all reward components Ri into a vector R(t) ∈ RN and all cor- responding weights wi into w ∈ RN . R(t) =  R1(t) R2(t) R3(t) R4(t) R5(t) R6(t) R7(t) R8(t) R9(t) R10(t) R11(t)  =  − grid_cost(t) − capacity_penalty(t) − degradation_cost(t) soc_reward(t) shaping_reward(t) night_charging(t) arbitrage_bonus(t) export_bonus(t) − action_penalty(t) solar_soc_reward(t) night_peak_chain(t)  , w =  wgrid wcap wdeg wsoc wshape wnight warbitrage wexport waction_mod wsolar wchain  . (3.2) This reward function balances key ob- jectives in residential energy storage: re- ducing electricity costs, preserving bat- tery life, ensuring safety, and capturing revenue opportunities. By splitting it into weighted components, we can tune the agent to prioritize cost savings during expensive periods and strategically store energy for future arbi- trage. The potential-based shaping and grad- uated penalties promote stable learning without altering the optimal policy, while solar-aware SoC management and night- to-peak chaining guide the agent to an- ticipate renewable generation and pre- dictable load cycles. Overall, this design lets the RL agent learn advanced, market and weather aware strategies that should outperform simple peak-shaving or time-of-use rules, achieving positive economic results while respecting operational limits. 35 3. Methods 3.4.3 Training Regime We implement a sophisticated training regime incorporating several advanced techniques to ensure robust policy learn- ing. The agent is trained using a three- phase curriculum learning schedule spanning millions of timestep: Phase 1 (20% of training) focuses on ba- sic battery management with simplified 3-day episodes, emphasizing SoC disci- pline with reduced reward complexity. Phase 2 (30% of training) introduces price arbitrage strategies over 7-day episodes while maintaining basic con- straints. Phase 3 (50% of training) employs full system complexity with 30-day episodes, complete solar integration, and all re- ward components active. Adaptive exploration is achieved through dynamic entropy coefficient scheduling, starting with high explo- ration (4× base entropy) in early train- ing and gradually reducing to 0.5× base entropy in the final 20% of training to enable policy refinement. Continuous reward monitoring an- alyzes component balance every 100,000 timesteps, detecting reward dominance, extreme value ranges, and providing au- tomatic recommendations for hyperpa- rameter adjustment. Data augmentation introduces con- trolled variability during training through random scaling of solar produc- tion (±5%) and consumption patterns (±15%) to improve policy robustness across diverse operating conditions. This training framework enables the agent to learn stable, interpretable poli- cies that generalize effectively across varying seasonal conditions and market dynamics while maintaining strict adher- ence to operational safety requirements. Table 3.2 summarizes the core hyperpa- rameters of our RecurrentPPO agent. Table 3.2: Trained Model Parameters Parameter Value Algorithm RecurrentPPO Learning Rate 2 × 10−4 Discount Factor 0.995 Rollout Buffer 4096 Batch Size 256 Training Epochs 8 LSTM Layers 1 LSTM Hidden Size 64 Training Steps 30,000,000+ GAE Lambda 0.95 Entropy Coefficient 0.01 36 3. Methods 3.4.4 Benchmark Strategies To evaluate the RL agent’s performance, we implement five rule-based battery management strategies representing con- ventional approaches used in residential energy storage systems. 1. No Battery Baseline: This strat- egy serves as the control case where no battery intervention occurs. The house- hold directly imports or exports all net energy (consumption minus solar produc- tion) to the grid, establishing the worst- case scenario for capacity fees and energy costs. 2. Time-of-Use Strategy: A simple time-based approach using fixed schedul- ing rules: • Night charging (22:00-06:00): Moderate charging to 60% SoC when solar production is minimal • Evening discharge (16:00-20:00): Conservative discharge from 40% SoC during typical peak hours • Power limits: 80% of maximum charging rate, 70% of maximum discharge rate 3. Price-Based Strategy: Uses fixed price thresholds based on typical Swedish electricity market ranges: • Low-price threshold: 40 öre/kWh (conservative estimate) • High-price threshold: 120 öre/kWh (conservative estimate) • Conservative charging: 80% power rate when price threshold • Conservative discharge: 60% power rate when price threshold 4. Solar-Following Strategy: Reacts only to current timestep solar conditions: • Excess solar charging: Store 90% of excess production when solar > demand • Solar deficit discharge: Provide 80% of shortfall when solar < 50% of demand • Simple thresholds: 30-80% SoC op- erating range 5. Peak-Shaving Strategy: Uses simple historical peak tracking (24-hour memory) without demand forecasting: • Dynamic threshold: Maximum of 5.0 kW or 80% of recent peak de- mand • Peak reduction: Discharge when current demand exceeds threshold • Excess storage: Charge with 80% of excess generation > 1.5 kW All strategies include battery degra- dation costs in their economic evalua- tion, calculated at 45 öre/kWh of energy throughput. This comparison framework demon- strates the economic value of intelligent prediction and multi-objective optimiza- tion in complex energy management sce- narios. 37 3. Methods 3.5 Orchestration & Automation The orchestration is managed by a Python script using Prefect 3.4.1. Each basic operation, such as fetching data, training a model, or running an in- ference is defined as a Prefect task. Related tasks are grouped into flows that run on a schedule. All tasks use run_python_script() to ensure consis- tent retry behavior, timeouts, and log- ging. We assign descriptive names to tasks so the Prefect UI shows clear labels instead of file paths. There are four main flows, each with its own schedule. An hourly flow up- dates external data (electricity prices, weather, CO2 metrics, and solar fore- casts). A 15-minute flow gathers home data (energy consumption, heat-pump status, and actual load) and then runs the reinforcement-learning (RL) agent to decide battery actions; this flow also gen- erates a brief HTML/Markdown report with the agent’s status. A daily flow runs each morning to refresh data and, if en- abled, run price forecasts. Weekly flows retrain models, price models on Sundays, the demand model on Mondays, and the RL agent on Tuesdays (see Figure 3.4). These training jobs use adjustable hy- perparameters and allow extra time to finish. When the RL agent runs, we parse its output to extract battery SoC, energy commands, price signals, and action val- ues. This information is included in a short report that flags whether the agent is charging, discharging, or idle, and sum- marizes solar and load forecasts. All logs, reports, and model files are saved as Pre- fect artifacts so any issues can be traced later (see Figure ?? in Section ??). Error handling is consistent across all flows: each task retries once after a brief delay if it fails, and any task that exceeds its timeout is stopped and logged. A sin- gle task’s failure does not stop the en- tire flow, downstream steps either skip or use fallback logic. This design keeps data fetching, model training, and RL in- ference running smoothly with minimal manual intervention. Figure 3.4: Prefect Dashboard showing the deployments for the scheduling of the Home system 38 3. Methods 3.6 Validation & Evaluation Procedures To verify that each forecasting compo- nent and the RL controller perform as intended under realistic conditions, we employ a layered validation strategy. For the price and valley detectors where ex- treme events are sparse we emphasize balancing precision and recall via over- sight on false positives/negatives. Regarding the trend model, we rely on automated hyperparameter tuning with constrained Optuna trials, exploiting XGBoost’s computational efficiency. Fi- nally, the RL agent undergoes exten- sive testing on historical data. For each model, dedicated evaluation scripts gen- erate performance plots using unseen test-split data, enabling manual inspec- tion of any anomalous behavior because visual review often reveals issues that nu- merical metrics alone cannot fully cap- ture. The RL agent generates dedicated HTM- L/Markdown documentation of its cur- rent battery action decision shown in Figure 3.5 for logging and, more impor- tantly, to provide visual feedback. This approach supports the project’s goal of creating a highly user-friendly system that can be easily adapted in the future. Figure 3.5: Prefect markdown artifact generated after running the agent before and after turning on the sauna, (left image being before and right after sauna has been on for some time.) 39 3. Methods 40 4 Results In this chapter, we present the perfor- mance outcomes for each component of the optimization platform on test data and in real-world deployment. We first report results for the price-extreme de- tectors (peaks/valleys) and the trend forecasting model, followed by solar- generation forecasts. Next, we summa- rize demand-forecast accuracy. Finally, we detail the RL agent’s testing results. 4.1 Price Models The following sections comprises three components: detecting extreme price events (peaks and valleys), forecasting the underlying price trend, and merging these outputs into a single, unified pre- diction. In the first subsection, we eval- uate how accurately the peak and valley classifiers flag hours. Next, we assess the Trend Model’s ability to capture broader hourly price movements over a month. Finally, the Merged Model section shows how combining the trend forecast with extreme-event flags improves overall pre- diction quality. 4.1.1 Price extreme detection Figure 4.1 displays a two-week period, with ground-truth peaks marked by red triangles on the hourly spot price curve (blue). Similarly, Figure 4.2 highlights true valleys. These plots illustrate that extreme price events are quite rare, fewer than 5% of hours qualify as high-price peaks, where the low-price labels make up ≈ 20% of the data. Because these peak and valley labels are generated by a rule-based algorithm and treated as “hard truths,” the resulting classifiers can still exhibit numerous false negatives and false positives. For exam- ple, when a peak persists for more than one hour, the algorithm may label only a single hour as a peak, even though we would ideally mark every hour of sus- tained high prices. The same challenge applies to valley pe- riods, despite extensive effort to ensure the labeling algorithm captures multiple consecutive low-price hours, it often fails to do so. Consequently, relying solely on summary metrics can be misleading. Vi- sual inspection of these labeled events is therefore essential. 41 4. Results Figure 4.1: Hourly spot price (blue) with red triangles marking detected peaks (ground-truth) over a two-week window. Figure 4.2: Hourly spot price (blue) with red triangles marking valleys 42 4. Results Figure 4.3a and 4.3b show the model per- formance over one week of labeled data for peaks and valleys, respectively. Peaks In Figure 4.3a, the blue line represents the hourly spot price, red triangles mark actual (ground-truth) peaks, and in- verted green triangles mark predicted peaks. The shaded vertical bands in- dicate actual peaks. Over this week, there were 8 actual peaks, but the model predicted 16. This yields a precision of 0.44 but a recall of 0.88 (the model still catches roughly half of true peaks). The resulting F1-score is 0.58, reflect- ing the trade-off: the classifier is tuned to prioritize recall (so as not to miss rare high-price hours), at the expense of many false positives. Valleys Similarly, in Figure 4.3b we see the valley-detection performance over one week of labeled data. The blue line again denotes the hourly spot price, red trian- gles mark the actual (ground-truth) val- ley hours, and inverted green triangles mark the predicted valleys. The shaded vertical bands in this plot indicate actual valley periods. During this week, there were 31 true val- ley hours, but the model predicted 47. As a result, the precision is 0.47 (fewer than half of predicted valleys align with the ground truth) while the recall is 0.71 (about 70% of true valleys are success- fully identified). The combined F1-score of 0.56 reflects this balance, again em- phasizing that the classifier is tuned to favor recall over precision. Notice how several actual valley peri- ods span multiple consecutive hours (for example, the trough around February 23–24), yet the model sometimes cap- tures only a subset of those hours, or extends the predicted band into adjacent non-valley hours. This behavior under- lines the difficulty of perfectly labeling extended low-price runs, while SMOTE- augmented training and threshold tuning help catch most valley events, occasional false positives and false negatives remain inevitable. Visual inspection of these shaded bands thus remain crucial for validating whether the classifier’s “imperfect but recall-focused” performance is acceptable before passing its signals to the RL con- troller. 43 4. Results (a) Peak Model performance over one week of labeled data. (b) Valley Model performance over one week of labeled data. Figure 4.3: Model performance for predicted peaks (top) and valleys (bottom) over one week of labeled data. (underestimates due to some true peaks missing in labels) 44 4. Results 4.1.2 Price trend model Figure 4.4 illustrates the Trend Model’s performance on SE3 price data through- out April 2024. In the top panel, the blue curve shows the actual hourly spot prices, while the red curve depicts the model’s 24-hour-ahead forecasts. We observe that the forecast generally cap- tures broad daily patterns, such as grad- ual rises during mid-April volatility and lower prices in early April but smoothes out short-lived spikes and dips. The middle panel plots the point-wise error, defined as the actual − predicted. Green bars above zero indicate hours when the model underestimates the ac- tual price, and red bars below zero in- dicate overestimates. During periods of rapid price escalation we expect to see large errors since the trend model should capture the overal shape rather than the high volatility. In the bottom panel, the orange bars represent daily MAE while the dashed blue line shows the average actual daily price, which closely mirrors the red fore- cast curve in the top panel, indirectly showing the model’s overall accuracy. Directional accuracy is calculated as 0.57, but this is misleading since the met- ric is calculated on too granular data. The average daily price metric provides a more meaningful performance measure, although here it is presented only in the visually. Figure 4.4: Trend Model performance for April 2024. Top: actual vs. predicted hourly SE3 prices (blue/red). Middle: hourly error (green = overestimate, red = underestimate). Bottom: daily MAE (orange) and average daily price (dashed blue). Metrics: MAE 20.18 öre, RMSE 26.77 öre, direction accuracy 0.57 45 4. Results 4.1.3 Merged price model The merged price model injects detected peaks and valleys into the Trend model’s smooth forecast. Figure 4.9 shows this merging. Note that we use the classifier’s probability score to set the spike height even though the peak/valley model isn’t trained to predict actual price magni- tudes. As a result, these injected peaks may not reflect true peak heights. By adding flags nonetheless, large errors around sharp spikes and drops are re- duced compared to the Trend model alone. Figure 4.5: Merged model over one-week validation. Top: actual (blue) vs. trend forecast (green). Middle: merged output (magenta) with peaks (red) and valleys (blue). Bottom: classifier probabilities (thresholds shown). 46 4. Results 4.2 Solar forecasts Figure 4.6 compares hourly predicted (orange) and actual (blue) solar produc- tion for five consecutive days (March 15–19, 2025). Across this interval, the forecast closely tracks the classic ramp- up and ramp-down PV output. The hourly MAE over these five days is 0.45 kW (out of the 20.3 kW system), cor- responding to an sMAPE of 5.3%. Mi- nor overpredictions occur, but overall the timing and magnitude of peaks match closely the actual production. Figure 4.6: Hourly predicted (orange) versus actual (blue) solar energy production for the PV installation over five consecutive days (March 15–19, 2025), illustrating the close alignment of the forecasted and measured outputs. Figure 4.7 presents a heatmap of pre- dicted hourly production from March 1 to March 10, 2025. Each cell’s color in- tensity indicates the forecasted kWh for that hour and date. Notice how the pre- dictions captures the variability of cloud cover on March 4–5 (midday dips) and correctly shifts the noon-hour peak later on March 7 when sunrise was delayed. Over these ten days, the average daily predicted energy is 25.6 kWh. 47 4. Results Figure 4.7: Heatmap of predicted hourly solar energy production from March 1 to March 10, 2025. Each cell shows the forecasted kWh for that hour and date, with deeper reds indicating higher output. 48 4. Results 4.3 Demand Model Figure 4.8 shows the Home Demand Model’s performance from March 1–15, 2025. In the top panel, actual hourly consump- tion (blue) and predicted demand (or- ange) are overlaid, with a shaded ±1 uncertainty band. Over these 336 hours, the model achieves RMSE = 0.80 kWh, MAE = 0.46 kWh, and R2 = 0.874. Peaks from morning and evening heat- ing loads are well captured. The middle-left heatmap displays HMM- derived occupancy states (0 = low, 1 = medium, 2 = high) by hour and day, in- dicating higher load when occupancy is high. The bottom-left timeline traces the con- tinuous HMM state sequence, revealing weekday transitions. The bottom-right plot shows prediction residuals (actual – predicted), which mostly lie within ±1 kWh; larger errors occur mainly due to a lead or lag in the prediction. Overall, the model tracks baseline pat- terns and heating-related peaks very ac- curately, with HMM states, temperature, and lag features enabling robust adapta- tion to both daily cycles and anomalies. Figure 4.8: Home Demand Model performance. Top: actual vs. predicted con- sumption (±1σ). Middle: HMM occupancy states and ambient temperature by hour/day. Bottom: HMM state timeline and residuals. 49 4. Results Figure 4.9 ranks the Demand Model’s top 20 features by their normalized XG- Boost gain. The single most influential feature is “Hp Contribution” (heat-pump load), accounting for ≈ 19% of total gain. Immediate consumption metrics “Con- sumption” (≈ 11%) and “Consumption Power Ratio” (≈ 10%) are next, high- lighting the importance of current usage levels. Weekday vs. weekend demand (“Con- sumption Dow (day-of-week) Avg Ra- tio,” ≈ 9%) and HMM-inferred occu- pancy states (“HMM State Posterior 0,” ≈ 1.7%; “HMM State Posterior 2,” ≈ 1.5%) also rank highly, showing that both occupancy and day-of-week pat- terns matter. Lagged consumption (e.g., “Consump- tion Pct 1H,” “Consumption Same Hour 1D Ago,” “Consumption Lag 24H”) and solar–temperature interaction features (≈ 1% each) further refine the model, while lower gain features like multi-day rolling means (≈ 0.6%–0.8%) demon- strate diminishing returns. Figure 4.9: Top 20 features by XGBoost normalized gain in the Home Demand Model, highlighting heat-pump contribution, raw consumption metrics, occupancy and lag-based predictors among others. 50 4. Results 4.4 Reinforcement Learning Agent The RL-based controller demonstrates robust, data-driven battery management with clear improvements over conven- tional, rule-based baselines in our sim- ulated environment. Training is per- formed with the RecurrentPPO algo- rithm; the final hyperparameters are summarized in Table 3.2. A full training run (30 million timesteps) on an AMD Ryzen 5700X CPU (no GPU) requires approximately 78 hours. 4.4.1 Performance Results Cost under low solar generation: In the “cloudy month” scenario (January 2025), shown in Figure 4.10, the RL con- troller achieves the lowest total cost at 2 550 SEK, a 41 % reduction compared to no battery (4 320 SEK) and a 15 % improvement over the best rule-based method (Peak Shaving, 2 990 SEK) for this period. Although net profit is im- possible when solar yield is minimal, the agent effectively limit grid imports to minimize monthly expense. Net benefit with generous PV: In contrast, during a “sunny month” (May 2025, Figure 4.11), the same agent real- izes a net benefit of 900 SEK, turning otherwise idle battery capacity into rev- enue. This marks a 125 % increase versus no battery (400 SEK) and a 50 % gain over Peak Shaving (600 SEK), this show the controller’s ability to collect surplus solar and arbitrage price fluctuations. Figure 4.10: Monthly cost for different control strategies under a cloudy month (Jan 2025). 51 4. Results Figure 4.11: Net monthly benefit (SEK) for different control strategies under a sunny month (May 2025). Peak Reduction Performance: Fig- ure 4.12 clearly illustrates that the RL controller outperforms all baseline strate- gies in shaving peak grid import. With the RL policy, the highest hourly draw is limited to about 6.5 kW, compared to 11 kW when no battery is used (a 41 % reduction). The rule-based schemes all perform between roughly 9 and 10 kW, Time-of-Use at 9̃.98 kW, Price- Based and Solar-Following both at 9̃.01 kW, and Peak-Shaving interestingly at 7 kW, showing that even dedicated peak- shaving logic cannot match the learned, price and forecast-aware behavior of the RL agent (although the peak strategy only being a fixed "dump battery when import over 6 kW", not ensuring available battery power for those occasions.). This reduction in peak demand directly trans- lates into lower capacity charges and greater operational flexibility, demon- strating the RL approach’s ability to an- ticipate and pre-empt high-price/high- demand hours. Figure 4.12: Maximum hourly import (kW) for different control strategies in Jan- uary 2025. 52 4. Results 4.4.2 Battery Operation and SoC Dynamics Figure 4.13 illustrates a 9-day interval of the agent’s operation (June 2024). The top panel overlays battery state-of-charge (SoC) against the hourly electricity price. The agent charges aggressively during low-cost windows and discharges when prices exceed a learned threshold. The middle panel shows household load and PV output. From these traces we observe that the agent maintains SoC within pre- ferred bands (15–85 %) while exploiting both price arbitrage and solar surplus. Figure 4.13: Ten-day battery operation: SoC vs. electricity price (top), household consumption and PV production (middle), and hourly grid import with discount factor (bottom). Gray shading denotes nighttime discount periods. 53 4. Results 54 5 Conclusion This thesis demonstrates the feasibility and potential of AI-driven home energy management systems in the context of Sweden’s transition to power-based elec- tricity tariffs. Despite the complexity of integrating forecasting models, rein- forcement learning control, and real-time hardware operation, the results show promising progress toward intelligent, automated battery management. The developed system successfully coor- dinates multiple dynamic factors, spot price fluctuations, solar production vari- ability, household consumption patterns, and power-based fees into a unified con- trol strategy. Field results confirm that AI approaches can yield measurable eco- nomic benefits, even with limited train- ing time and minimal hyperparameter tuning. These findings suggest that fur- ther model refinement, extended train- ing periods, and improved reward design could significantly enhance performance. Today’s residential battery management still shows a large gap between avail- able technology and practical use. Most homeowners either rely on built-in manu- facturer modes that ignore price signals, or manually schedule charging and dis- charging based on forecasts. The former neglects profit and grid optimization en- tirely, while the latter demands expertise and time for suboptimal results. Neither approach accounts for battery wear, dy- namic grid tariffs, or coordinated solar- battery optimiza