Project 5: Predicting Energy Price. ML

March 10, 2021

In this ML project we will create a model that predicts the day-ahead price of power in Spain with Python.

As in previous projects, we will cover some of the most interesting steps and findings but not all of them. Personally, I would recommend that you download the code (Jupyter notebook) and the datasets (CSV files) for a better understanding.

Our dataset (power_market.csv) is composed of:

date: date of the observation ”%Y-%m-%d”
hour: hour of the observation, [0 - 23]
fc demand: forecast of demand in MWh
fc nuclear: forecast of nuclear power production in MWh
import FR: forecast of the importing capacity from France to Spain in MWh
export FR: forecast of the exporting capacity from Spain to France in MWh
fc wind: forecast of wind power production in MWh
fc solar pv: forecast of PV solar (solar panels) power production in MWh
fc solar th: forecast of thermal solar power production in MWh
price: power price for each hour in €/MWh. This is the target we want you to predict.

And the scoring.csv:

We start off by importing the libraries we need. As well as the common machine learning libraries (sklearn, pandas, numpy, etc) we use astral to calculate sunrise and sunset times, and xgboost to use their gradient boosting regressor.

Exploratory Data Analysis

We create the function “ESHolidayCalendar” to check whether or not certain days are a holiday in Spain.

 class ESHolidayCalendar(AbstractHolidayCalendar):
 rules = [
     Holiday("New Years Day", month=1, day=1),
     GoodFriday,
     Holiday("Labor Day", month=5, day=1),
     Holiday("Spain Day", month=10, day=12),
     Holiday("All Saints Day", month=11, day=1),
     Holiday("Consitution Day", month=12, day=6),
     Holiday("Inmaculada Concepcion", month=12, day=8),
     Holiday("Christmas", month=12, day=25),
 ]
spain_calendar = CustomBusinessDay(calendar=ESHolidayCalendar(), weekmask="Mon Tue Wed Thu Fri Sat Sun")

We detected that there are 26 null values in our dataframe (df.isnull().sum()). We plot a correlation matrix to make sure that our missing values are not correlated.

We can also detect some patterns with visualizations.

Feature Engineering

This DateEncoder is a result of our iterative feature engineering process. As we discovered new patterns in the data we added more features. It works in the same way as FunctionTransformers, so it can be included as a step in a pipeline, and its parameters can be changed so it behaves differently (include some features or not). This allows for clean pipelines, and prevents us from having to make copies of dataframes for every type of preprocessing we want to do.

These are the features it can extract:

season, which can be returned as a column of strings or as a series of encoded columns
is_holiday, whether or not it is a holiday in Spain
is_weekend
is_business_day, the combination of the two previous features
sunlight, hours of sunlight
sun, whether or not the sun is up at any given hour (rounds to the nearest hour)
t_from_zenith, the hours away from the sun’s zenith

day_period, the period of the day

class DateEncoder(TransformerMixin, BaseEstimator):
    def __init__(self, date_feature_name="date", hour_feature_name="hour", drop_date_col=True, drop_hour_col=True, ohe_cats=True):
        self.date_feature_name = date_feature_name
        self.hour_feature_name = hour_feature_name
        self.drop_date_col = drop_date_col
        self.drop_hour_col = drop_hour_col
        self.ohe_cats = ohe_cats

    def fit(self, X, y=None):
        return self

    def create_sun_items(self, date_series, city="Madrid", region="Spain", timezone="Europe/Madrid", latitude=40.4168, longitude=3.7038):
        city = LocationInfo(city, region, timezone, latitude, longitude)
        s = date_series.apply(lambda x: sun.sun(city.observer, date=x, tzinfo=city.timezone))
        self.sun_items = s

    def transform(self, X):
        X_ = X.copy()

        month = X_[self.date_feature_name].dt.month
        X_["season"] = np.where(
            np.logical_or(month < 3, month == 12),
            "winter",
            np.where(np.logical_and(month >= 3, month < 6), "spring", np.where(np.logical_and(month >= 6, month < 9), "summer", "autumn")),
        )



        start_date = X_[self.date_feature_name].min()
        end_date = X_[self.date_feature_name].max()
        date_range = pd.bdate_range(start_date, end_date, freq=spain_calendar)
        holiday = X_[self.date_feature_name].isin(date_range)
        holiday = ~holiday
        X_["is_holiday"] = holiday.astype(int)

        # TODO: morning vs midday vs evening

        hour_cats_enc = np.where(
            np.logical_and(
                X_[self.hour_feature_name] >= 0,
                X_[self.hour_feature_name] < 6,
            ),
            "night",
            np.where(
                np.logical_and(X_[self.hour_feature_name] >= 6, X_[self.hour_feature_name] < 11),
                "morning",
                np.where(
                    np.logical_and(X_[self.hour_feature_name] >= 11, X_[self.hour_feature_name] < 18),
                    "day",
                    np.where(
                        np.logical_and(X_[self.hour_feature_name] >= 18, X_[self.hour_feature_name] < 22),
                        "evening",
                        np.where(np.logical_and(X_[self.hour_feature_name] >= 22, X_[self.hour_feature_name] < 24), "night", np.nan),
                    ),
                ),
            ),
        )

        X_["day_period"] = hour_cats_enc

        self.create_sun_items(X_[self.date_feature_name])
        X_["sun_item"] = self.sun_items
        X_["sunlight_s"] = X_.apply(lambda row: (row["sun_item"]["sunset"] - row["sun_item"]["dawn"]).seconds, axis=1)
        X_["sun"] = X_.apply(lambda row: row["sun_item"]["sunrise"].hour < row[self.hour_feature_name] < row["sun_item"]["sunset"].hour, axis=1).astype(int)
        X_["t_from_zenith"] = X_.apply(lambda row: np.abs(row["sun_item"]["noon"].hour - row[self.hour_feature_name]), axis=1)
        X_ = X_.drop("sun_item", axis=1)

        X_["is_weekend"] = (X_[self.date_feature_name].dt.dayofweek >= 5).astype(int)

        business_day = ~np.logical_or(X_["is_holiday"], X_["is_weekend"])
        X_["is_business_day"] = business_day.astype(int)

        # X_["time_of_day"]

        if self.ohe_cats:
            ohe = OneHotEncoder(sparse=False, handle_unknown="ignore")
            seasons_ohe = ohe.fit_transform(X_[["season"]])
            season_cols = ohe.get_feature_names(["season"])
            seasons_ohe = pd.DataFrame(seasons_ohe, columns=season_cols)

            dayperiod_ohe = ohe.fit_transform(X_[["day_period"]])
            dayperiod_cols = ohe.get_feature_names(["day_period"])
            dayperiod_ohe = pd.DataFrame(dayperiod_ohe, columns=dayperiod_cols)


            index = X_.index
            seasons_ohe.index = X_.index
            # X_ = pd.concat([X_, seasons_ohe], sort=False, axis=1, ignore_index=True, join_axes=[X_.index])
            X_ = pd.concat([X_, seasons_ohe, dayperiod_ohe], axis=1)
            # X_.index = index
            X_ = X_.drop("season", axis=1)


        if self.drop_date_col:
            X_ = X_.drop(self.date_feature_name, axis=1)
        if self.drop_hour_col:
            X_ = X_.drop(self.hour_feature_name, axis=1)

        return X_

This EnergyEncoder extracts ratios and differences between fc_demand and all the sources of production in our dataset.

class EnergyEncoder(TransformerMixin, BaseEstimator):
    """A template for a custom transformer."""

    def __init__(self, calculate_ratios=True, calculate_diffs=True, impute_imp_exp=True):
        self.calculate_ratios = calculate_ratios
        self.calculate_diffs = calculate_diffs
        self.impute_imp_exp = impute_imp_exp

    #     def generate_thermal_gap(self, df):
    #         self.thermal_gap = df["fc_demand"] - df["fc_nuclear"] - df["fc_wind"] - df["fc_solar_pv"] - df["fc_solar_th"]

    def generate_ratios(self, df, numerators, denominators=["fc_demand"]):
        ratios = {}
        for denominator in denominators:
            for numerator in numerators:
                ratios[f"{numerator}/{denominator}"] = df[numerator] / df[denominator]
            ratios[f"total/{denominator}"] = df[numerator] / df[numerators].sum(axis=1)
        self.ratios = ratios

    def generate_diffs(self, df, left_amounts, right_amounts):
        diffs = {}
        for left_amount in left_amounts:
            for right_amount in right_amounts:
                diffs[f"dif_{left_amount}_{right_amount}"] = df[left_amount] - df[right_amount]
            diffs[f"dif_{left_amount}_total"] = df[left_amount] - df[right_amounts].sum(axis=1)
        self.diffs = diffs

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_ = X.copy()

        if self.impute_imp_exp:
            imp_X_ = X_.select_dtypes(include=np.number)
            cols = imp_X_.columns
            knnimp = KNNImputer()
            imp_X_ = knnimp.fit_transform(imp_X_)
            imp_df = pd.DataFrame(imp_X_, columns=cols, index=X_.index)[["import_FR", "export_FR"]]
            X_["import_FR"] = imp_df["import_FR"]
            X_["export_FR"] = imp_df["export_FR"]

        if self.calculate_ratios:
            self.generate_ratios(X, ["fc_nuclear", "fc_wind", "fc_solar_pv", "fc_solar_th"], ["fc_demand"])
            X_ = pd.concat([X_, pd.DataFrame(self.ratios)], sort=False, axis=1)

        #         if self.calculate_thermal_gap:
        #             self.generate_thermal_gap(X)
        #             X_["thermal_gap"] = self.thermal_gap
        if self.calculate_diffs:
            self.generate_diffs(X_, ["fc_demand"], ["fc_nuclear", "fc_wind", "fc_solar_pv", "fc_solar_th"])
            X_ = pd.concat([X_, pd.DataFrame(self.diffs)], sort=False, axis=1)
        return X_

Applying the Transformations to our dataframes.

custom_preprocessing = make_pipeline(EnergyEncoder(), DateEncoder(ohe_cats=False))
transformed_df = custom_preprocessing.fit_transform(X_train, y_train)
all_features_df = pd.concat([transformed_df, y_train], axis=1)

Since some algorithms perform better with Encoded variables, we will OneHotEncode the variables “season” and “day_period”.

one_hot_season = ColumnTransformer([("ohe", OneHotEncoder(sparse=False, handle_unknown="ignore"), ["season", "day_period"])], remainder="passthrough")
preprocessing = make_pipeline(custom_preprocessing, one_hot_season)

Model Testing

To evaluate the performance of each model we have created a function that plots the real and the predicted output and shows the RMSE.

def evaluate_model(fitted_model, X_test, y_test):
    y_pred = fitted_model.predict(X_test)
    fig, ax = plt.subplots()
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    p1 = max(max(y_pred), max(y_test))
    p2 = min(min(y_pred), min(y_test))
    fig.set_size_inches((12, 12))
    ax.scatter(y_test, y_pred, s=2, c=np.absolute(y_test - y_pred), cmap="viridis_r")
    ax.plot([p1, p2], [p1, p2], "--", c="black")
    ax.set_xlabel("True Values", fontsize=15)
    ax.set_ylabel("Predictions", fontsize=15)
    ax.text(0.15, 0.85, f"RMSE: {rmse:.2f}", horizontalalignment="left", verticalalignment="top", transform=ax.transAxes, fontsize="large")

Linear Regression (Simplest model)

simple_reg = Pipeline(
    [
        ("preprocessing", simple_preprocessing),
        ("regression", LinearRegression()),
    ]
)
simple_reg.fit(X_train, y_train)

Feature Engineered Linear Regression

Featured Engineered Linear Regression - RMSE=12.54

Feature Engineered Random Forest

Feature Engineered XGBoost

Feature Engineered XGBoost with early Stopping

X_xg_train, X_xg_val, y_xg_train, y_xg_val = train_test_split(
    X_train, y_train, test_size=test_rows, shuffle=False
)  # shuffle:False means it conserves first X rows in train and rest in test

X_xg_train_t = preprocessing.fit_transform(X_xg_train)
X_xg_val_t = preprocessing.transform(X_xg_val)

xgb_eval_set = [(X_xg_train_t, y_xg_train), (X_xg_val_t, y_xg_val)]
xgb_earlystopping_reg = sklearn.base.clone(feateng_xgb_reg)
xgb_earlystopping_reg.set_params(**{"xgbregressor__n_estimators": 10000, "xgbregressor__learning_rate": 0.01})
xgb_earlystopping_reg.fit(X_xg_train, y_xg_train, xgbregressor__eval_set=xgb_eval_set, xgbregressor__early_stopping_rounds=200)

xgb_earlystopping_reg.named_steps["xgbregressor"].best_ntree_limit

Selecting the best number of features

Lastly, we checked whether we could improve the accuracy of our model by predicting with less paramaters as a measure to prevent overfitting. However, the result was that the more features, the more accurate our model.

Y-axis: Score. X-axis: Number of parameters

Find the code and the datasets here: https://github.com/OriolVila/Predicting-Energy-Price

Project 5: Predicting Energy Price. ML

Using Advanced Data Featuring (Python) to build a model that predicts the day-ahead price of power in Spain.

Project 5: Predicting Energy Price. ML

Demand by hour and month

Price by hour and month

Simple Linear Regression - RMSE=15.90

Featured Engineered Linear Regression - RMSE=12.54

Feature Engineered Random Forest - RMSE=18.64

Feature Engineered XGBoost - RMSE=17.75

Feature Engineered XGBoost with Early Stopping- RMSE=6.60

Y-axis: Score. X-axis: Number of parameters