Water availability prediction using historical data

Prachi Jain
12 min readJun 6, 2021
Here we try to preserve the “blue gold” using data to forecast water level throughout the year and better regulate the daily supply over different regions

Table of Contents

  1. The problem
  2. The challenge that lies behind
  3. Data Collection and description
  4. Expected Outcome
  5. Basic Workflow
  6. Some similar approaches already explored
  7. Our roadmap to the solution

8. Scope for improvements

9. References

1. The problem

The Acea Group is one of the largest Italian multiutility operators in the water services sector supplying 9 million inhabitants. It organized this competition on Kaggle to get help in predicting the water level in various types of water bodies over different seasons of the year. It is crucial for a water supply company to forecast the water level in a waterbody (water spring, lake, river, or aquifer) to handle daily consumption.

It can become tedious in scenarios like during fall and winter, water bodies are refilled, but during spring and summer, they start to drain. To help preserve the health of these water bodies it is important to predict the most efficient water availability, in terms of level and water flow for each day of the year.

2. The challenge that lies behind

The Acea Group provides data for 4 types of water bodies, namely, water spring, lake, river, and aquifer. While the primary intention is the same i.e. to predict water availability, the reality is that each waterbody has such unique characteristics that their attributes are not linked to each other. This problem uses datasets that are completely independent of each other. As each waterbody is different from the other, the related features are also different.

To determine how the features influence the water availability of each waterbody poses a challenge. So we want to design a solution that helps better understand volumes so that we can ensure water availability for each time interval(day/month) of the year.

It is of the utmost importance to notice that some features like rainfall and temperature, which are present in each dataset, don’t go alongside the date. Indeed, both rainfall and temperature affect water level features. This means, for instance, that rain fell on 1st January doesn’t affect the mentioned features right the same day but some time later. As we don’t know how many days/weeks/months later rainfall affects these features, this is another aspect to keep into consideration when analyzing the dataset.

3. Data Collection and description

The data has been provided at https://www.kaggle.com/c/acea-water-prediction/data

There are nine different datasets, completely independent and not linked to each other. The Acea Group deals with four different types of waterbodies: water spring (three datasets), lake (one dataset), river (one dataset), and aquifers (four datasets). A brief description of each one: link.

4. Expected Outcome

The table below shows the expected feature to forecast for each waterbody.

In conclusion, we want to generate four mathematical models, one for each category of water body (aquifers, water springs, river, lake) that might be applicable to every single waterbody.

5. Basic Workflow

The overview of the expected pipeline looks like below:

Input, Data Processing, and output for the four water bodies

6. Some similar approaches already explored

There have been various research works and experiments performed in the area of predicting water levels in an area in one form or the other. Some of them are summarised below to give a brief overview of the scope:

  • Short-term water level prediction using neural networks and neuro-fuzzy approach by Bunchingiv Bazartseren, Gerald Hildebrandt, K.-P. Holz (Link): The paper demonstrates a short-term prediction of water level using ANNs and neuro-fuzzy systems as well as the linear statistical models, auto-regressive moving average (ARMA), and auto-regressive exogenous (ARX). The unstable outcome of the linear models at the beginning of the verification period, together with the late response for a longer prediction span caused lower performance attributes. The ANN models were chosen due to their superior prediction ability above the other models considered.
  • Development of Water Level Prediction Models Using Machine Learning in Wetlands: A Case Study of Upo Wetland in South Korea by Changhyun Choi, Jungwook Kim, Heechan Han, Daegun Han, and Hung Soo Kim (Link): This paper uses various machine learning models such as ANNs, DTs, RFs, SVMs. Since it has many variables that are not linear in relationships, the correlation between the dependent and independent variables was examined using MI (Mutual information) rather than the commonly used Pearson correlation. The mutual information between two random variables X and Y have been defined in terms of their joint probability distribution p(x, y) as shown here:

Based on the predictive performance evaluation results, the Random Forest was the most suitable for simulating the water level.

  • Analysis and Prediction of Dammed Water Level in a Hydropower Reservoir Using Machine Learning and Persistence-Based Techniques by C. Castillo-Botón, D.Casillas-Pérez, C. Casanova-Mateo, L. M. Moreno-Saavedra, B. Morales-Díaz, J.Sanz-Justo, P. A. Gutiérrez, and S. Salcedo-Sanz (Link): This paper covers long and short term predictions for the dammed water level in a hydropower reservoir. A set of models, including different types of neural networks, Support Vector Regression(SVR), or Gaussian processes was tested. The original data were on a daily basis, thus a weekly aggregation was performed both for long and short-term analyses. This improved the representation of the differences in the changes of the level of the water stored in the reservoir; otherwise, the differences in a smaller scale of time (daily or hourly) would be too small and unstable to properly train the model. Depending on the type of variable, either an average or a sum was performed. The best overall result according to RMSE(Root Mean Square Error) was obtained with the configuration seasonal data, temporal partition, which uses four different ML models for each season. According to MAE(Mean Absolute Error), the best overall result was obtained with the configuration standard data, temporal partitioning using an SVR algorithm.

7. Our roadmap to the solution

In order to design an effective solution, first, we deeply analyze the interrelationships in the datasets using Exploratory Data Analysis (EDA).

EDA

The following insights are derived when features from each dataset were checked as per their respective target feature (Refer to the EDA notebook in GitHub link):

  • There is a large proportion of missing values in most datasets of aquifers(~60% to 85%), lakes(~85%), rivers(~85%), springs(~55% to 75%) including major features like depth to groundwater, rainfall, temp, flow rate, etc.

Check the following graphs of the percentage of missing values in each feature in each waterbody dataset-

Aquifer Auser missing values
Aquifer Doganella missing values
Aquifer Luco missing values
Aquifer Petrignano missing values
Lake Bilancino missing values
River Arno missing values
Water Spring Amiata missing values
Water Spring Lupa missing values
Water Spring Madonna Di Cannneto missing values

See an example heatmap to study correlation for each waterbody type below:

Aquifer Auser correlation (refer EDA pynb GitHub for a clearer plot)
  • Most aquifers have a strong positive correlation with rainfall and temperature features and a negative correlation with depth, hydrometry, and volume features.
Lake Bilancino correlation
  • Lake is positively correlated to rainfall.
River Arno correlation (refer EDA pynb GitHub for a clearer plot)
  • The river is positively correlated to rainfall and weakly negatively to temperature.
Water Spring Amiata correlation (refer EDA pynb GitHub for a clearer plot)
  • For water springs, rainfall, temperature are strongly positively correlated and weakly negatively correlated to depth to groundwater, and flow rate features.

Look at the plot of univariate analysis for one feature in Aquifer Auser below. Refer to the EDA pynb for all the other such plots.

Aquifer Auser Rainfall feature univariate analysis

Observations:

  • There are a considerable number of outliers in rainfall features from 100mm to 300mm and few in the hydrometry, volume features in Aquifers.
  • In Lake, there are lots of outliers in rainfall and flow rate features.
  • In River Arno, in the rainfall features, there are a considerable number of outliers till the range of up to 120 mm and a few in the Hydrometry feature.
  • Water springs have a considerable number of outliers in rainfall features to the range of up to 130 mm.

Now let’s assess the yearly and monthly observations for a single water body for eg. Aquifer Auser (Refer GitHub for all the other plots and their respective observations):

  • The mean depth to groundwater per year keeps on dipping & drops to an average of -6 mm into 2020. This happens when mean rainfall and temp are 4 mm and 9 degrees respectively, which are moderate as compared to other years' means. Also, the mean volume and mean hydrometry are -6500 cm and -0.2 m in 2020 which is very low.
  • The maximum average rainfall is recorded in the month of October or November on an average. The maximum temperature is found in July or August. The mean volume is also highest at -8700 cm and mean hydrometry at -0.10 m in July and August. This is around the same time the average depth to groundwater is found at a maximum of -6.7 m.

Metrics Used

We have chosen our own metrics for this problem: Median Absolute Error (MAE), Root Mean Square Log Error(RMSLE), and R Squared (R²) because of the following reasons -

  • MAE is robust to outliers whereas RMSE is not. Using median is an extreme way of trimming extreme values. Hence median absolute error reduces the bias in favor of low forecasts. Also, MAE is really suited from an interpretation standpoint.
  • RMSLE is used because the underestimation of the target variable is not acceptable but overestimation can be tolerated. The RMLSE incurs a larger penalty for the underestimation of the actual value. Also, we don’t want to penalize huge differences in the predicted and the actual values when both predicted and actual values are huge numbers. RMSLE metric (unlike RMSE) only considers the relative error between and the Predicted and the actual value and the scale of the error are not significant.
  • High means that the correlation between observed and predicted values is high. It tells how good our regression model is as compared to a very simple model that just predicts the mean value of target from the train set as predictions.

Data Preprocessing

Now process the data in order to remove the anomalies found in EDA.

First, the major issue of missing values in all datasets is addressed. It is preprocessed in two steps:

  • initially take only the rows where target features are not null. Impute the missing values in the independent features using knn imputation with the selected least error k value (this is done by plotting k values with our chosen metrics i.e. MAE, RMSLE, and R2 score)
  • using the complete dataset (with filled features) to predict missing values in the respective target feature with Random Forest or other baseline models like Linear Regression, Decision Tree, or KNN.

Check the below code snippet and plots for one waterbody:

For Aquifer Auser

then we see different plots for comparing these models error metrics:

then the best model and k value we get from all 3 metrics are used for predicting missing values in the target feature. in this example, it comes out to be k=15 with a Random forest regressor.

This way we can get all the 9 datasets with 100% filled values.

Feature Engineering

The primary goal is to have 4 models at the end, one for each waterbody. Also, we have to take into account the possible latency in the variables due to the effect of rainfall and temperature. Thus we integrate the datasets into their respective waterbody type by taking the average based on different time durations i.e. daily, weekly, monthly, and yearly.

Then we find out the Median Absolute error trends in each of these durations models when the variables are shifted by days, weeks, or months. Hence, the final data for each water body will have the actual values of the target variables as affected by rainfall and temperature.

For eg. in Aquifers, we first take the mean of all rainfall, temperature, volume, hydrometry, and depth features in each aquifer dataset, then combine them and again take daily mean, weekly mean or monthly mean based on the Date column. This gives us datasets aquifer_daily, aquifer_weekly, and aquifer_monthly.

In each of these new datasets, we shift each feature value by 1 to 31 days, 1 to 52 weeks, and 1 to 12 months. Then using a Random Forest Regressor, note the time duration after which the MAE is lowest to predict that shifted feature (i.e. the rainfall and temp effect is most appropriate).

A sample code snippet for shifting:

Observing best results at 26 days (aquifer_daily), 8 weeks(aquifer_weekly), and 2 months(aquifer_monthly), we made the final aquifers dataset by shifting the values by 56 days (8 weeks or 2 months) backward.

Likewise, do the analysis for all water bodies and ultimately get 4 feature engineered datasets having the true effect of rainfall and temperature.

The Model

The datasets are first normalized to remove the negative values. Then several types of models are tried to get the least error metrics/best predictions. knn, Linear Regression, Random Forest, SGRegression, Decision Trees, XGBoost, Adaboost were tried along with rigorous hyperparameter tuning in GridSearchCV. But the best results were observed when Multi-Layer Perceptron was used. It was achieved with the best-found combination of the number of layers to use, dropouts, optimizer, custom loss function & metrics(here RMSLE), batch_size, epochs, etc for each individual water body.

The train and validation graphs for each of the four water bodies can also be found in the Model pynb in GitHub. Here are a few snapshots of them:

Aquifer model (Find its complete tensorboard here)

color coding for the graphs
Aquifer model loss per epoch
Aquifer model rmse per epoch

Lake model (Find its complete tensorboard here)

Lakes model loss per epoch
Lakes model rmse per epoch

River model (Find its complete tensorboard here)

Rivers model loss per epoch
Rivers model rmse per epoch

Water Springs model (Find its complete tensorboard here)

Water Springs model loss per epoch
Water Springs model rmse per epoch

While most algorithms performed well to predict for Aquifers, the MLP was able to capture the least predictions for each waterbody type. Below is a comparative summary of the R2 score for different models.

So while training, the best MLP models’ logs and weights were saved for faster model access in the future.

The saved weights are used to load the respective model in the flask app for the problem. It has been deployed on Heroku. The app can be accessed at https://acea-flask-api.herokuapp.com/index. See a snapshot for it below.

The web application to access the predictions

8. Scope for improvements

Various types of Machine Learning algorithms and MLP have been explored for solving the problem. So now it would be appropriate to say that exploring more neural networks like CNNs or LSTMs can yield good results. Also, linear statistical models like auto-regressive moving average (ARMA), and auto-regressive exogenous (ARX) can also be tested to capture the seasonal trend in the datasets.

All the pynb’s for the problem can be found at the GitHub link: https://github.com/prachij94/Acea-smart-water-analytics

Thanks for reading through till the end! Please share doubts, feedback, or anything that can help me better learn a thing or two, at my LinkedIn profile: https://in.linkedin.com/in/prachi-jain-bb3750102

--

--