Multivariate LSTM deep learning model to predict streamflow

Multivariate LSTM deep learning model to predict streamflow

My Background:

I am a data scientist at Lynker, working with NOAA's Office of Water Prediction (OWP) to develop a hydrologic modeling tool named t-route. T-route is a dynamic channel routing model that is part of the National Water Model (NWM). This model provides a comprehensive solution for river network routing challenges. More details can be found on github.

Introduction:

This article will be in two parts. In the first part, we will access observed USGS data for a specific point and obtain meteorological data for hydrologic modeling using Google Earth Engine (GEE). Then, we will train and tune the hyperparameters of a Long-Short-Term Memory (LSTM) model to simulate streamflow. In Part 2, we will train transformers and ultimately compare the results with t-route.

Roadmap:

1- Initialization and Data Retrieval:

  • Initializes the Earth Engine,
  • fetches USGS water data,
  • Processes the response to create a usable DataFrame.

2. Mapping and Visualization:

  • Create a Folium map to visualize the location and surrounding basin.

3. Fetching ERA5 Data:

  • Retrieves ERA5 meteorological data and processes it into a DataFrame.

4. Data Preparation:

  • Prepares the combined dataset for LSTM model input using a custom function.

5. Building and Training the LSTM Model:

  • Defines, builds, trains and tune the LSTM model using TensorFlow and Keras.

6. Evaluation and Visualization:

  • Evaluates the model’s performance and visualizes the results using matplotlib.

This study is inspired by Biplov Bhandari notebook, However the hyper parameter here is tuned and another model will be added in future and compare with t-route


Tutorial:

I am running the code on Google Colab with GPU activated. The code also works with CPU but is faster with GPU. Ensure you have a Google Earth Engine (GEE) account to run the code.

Import Necessary Libraries:


!pip install scikeras
import pandas as pd
import requests
from io import StringIO
import ee
import tensorflow as tf
from keras import models, optimizers, layers, callbacks, backend as K
from sklearn import preprocessing
from datetime import datetime
import folium
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout, Input
# from keras.wrappers.scikit_learn import KerasRegressor
from scikeras.wrappers import KerasClassifier, KerasRegressor
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import MinMaxScaler        

Initialize the Earth Engine:

Google Earth Engine (GEE) is a cloud-based platform for planetary-scale environmental data analysis. Here's what it offers:

1- Data Access: GEE provides access to a vast repository of geospatial datasets, including satellite imagery, climate data, and environmental data. This includes data from sources like Landsat, Sentinel, MODIS, and many more.

2- Cloud Computing: It leverages Google's cloud infrastructure to process and analyze large datasets efficiently. This means you can run complex analyses on terabytes of data without needing powerful local hardware.

3- APIs and Code Editor: GEE has APIs for both JavaScript and Python, allowing you to write and run code for data analysis. The web-based Code Editor provides a convenient interface to write, debug, and visualize your code and results.

4- Applications: It's used in various fields such as climate change research, deforestation monitoring, agricultural forecasting, water resource management, and urban planning.

5- Visualization: GEE allows you to create visualizations and maps that can be easily shared or embedded in web applications.

# gee-lstm is my Project ID for Earth Engine, replace with yours
project_id = 'gee-lstm'

try:
    ee.Initialize(project=project_id)
except Exception as e:
    ee.Authenticate()
    ee.Initialize(project=project_id)        

Define Study Period:

Let's pick 30 years, using 80% for training and 20% for testing. The date range is from 1990 to 2020.

Fetch the USGS Data:

Select a USGS site number for the Guadalupe River, TX. Feel free to train on other stations. Detailed information about gages can be found on the USGS website.

siteNumber = '08188800' # guadelope river- Replace with your site number
usgs_water_data_url = "https://waterdata.usgs.gov/nwis/dv"
params = {
    "site_no": siteNumber,
    "begin_date": START_DATE,
    "end_date": END_DATE,
    "format": "rdb",
    "parameter_cd": "00060",  # Discharge, cubic feet per second (Mean)
}

response = requests.get(usgs_water_data_url, params=params)

if response.status_code == 200:
    print("Successful!")
else:
    print(response.status_code)
    print(response.reason)
        

Data Cleaning and Processing:

Clean and process the DataFrame. Details are in the notebook.

Fetch Basin Data and Create the Map:

Create a map that shows the catchment for the point of interest.

Fetch ERA5 Data:

Fetch ERA5 data for the specified bands and region, then combine USGS data with ERA5 data.

gauge_lat, gauge_lon = 28.50583370,	-96.88470840   # guadelope
gauge_geom = ee.Geometry.Point([gauge_lon, gauge_lat])
hydrosheds = ee.FeatureCollection("WWF/HydroSHEDS/v1/Basins/hybas_8")
basin = hydrosheds.filterBounds(gauge_geom)
basinOutline = ee.Image().byte().paint(featureCollection=basin, color=1, width=3).getMapId()
map = folium.Map(location=[gauge_lat, gauge_lon], zoom_start=11)

# Add the custom tile layer
folium.TileLayer(
    tiles=basinOutline["tile_fetcher"].url_format,
    attr="Google Earth Engine",
    overlay=True,
    name="River Basin",
).add_to(map)

# Add a marker
folium.Marker([gauge_lat, gauge_lon]).add_to(map)

# Add a layer control
map.add_child(folium.LayerControl())

# Display the map
map        

which the map will be like below.

Article content

now we Fetch ERA5 data for the specified bands and region. and combineUSGS with ERA5 data. It will be like:

Article content

Prepare Data for LSTM Model:

The target is discharge. This is multivariate forecasting because there are multiple time-dependent variables to generate predictions. This approach incorporates historical data while accounting for the interdependencies among the variables.

Loss and Evaluation Metrics:

Our loss function is 'huber_loss,' and our evaluation metrics are 'mae,' 'mse,' and 'mape.'

  • Mean Absolute Percentage Error (MAPE): The percentage equivalent of mean absolute error (MAE).
  • Huber Loss: A loss function used in robust regression that is less sensitive to outliers than the squared error loss. It combines the best properties of MSE and MAE, being quadratic for small errors and linear for large errors, making it more robust to outliers compared to MSE.

Article content


Split Data and Build the Model:

Split the data into training and testing sets, and build the model.

test_split = round(len(model_df) * 0.20)

df_for_training = model_df[:-test_split]

df_for_testing = model_df[-test_split:]

# Scale the data
scaler = RobustScaler()
df_for_training_scaled = scaler.fit_transform(df_for_training)
df_for_testing_scaled = scaler.transform(df_for_testing)
# Create sequences
def createXY(dataset, n_past):
    dataX = []
    dataY = []
    for i in range(n_past, len(dataset)):
        dataX.append(dataset[i - n_past:i, :])
        dataY.append(dataset[i, 0])
    return np.array(dataX), np.array(dataY)

trainX, trainY = createXY(df_for_training_scaled, 30)
testX, testY = createXY(df_for_testing_scaled, 30)
# Build the model
def build_model(optimizer='adam'):
    model = Sequential()
    model.add(Input(shape=(30, trainX.shape[2])))
    model.add(LSTM(50, return_sequences=True))
    model.add(LSTM(50))
    model.add(Dropout(0.2))
    model.add(Dense(1))
    model.compile(loss=tf.keras.losses.Huber(), optimizer=optimizer, metrics=['mae', 'mse', 'mape'])
    return model

# Wrapping the build_model function
grid_model = KerasRegressor(build_model, verbose=1)

# Define the grid search parameters
parameters = {
    'batch_size': [16, 32],
    'epochs': [10, 20],
    'optimizer': ['adam']
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(estimator=grid_model, param_grid=parameters, cv=2)

# Fit the model and capture the training history
history = grid_search.fit(trainX, trainY, validation_data=(testX, testY))

# Print the best parameters and score
print(f"Best: {grid_search.best_score_} using {grid_search.best_params_}")

# Evaluate the best model on the test set
best_model = grid_search.best_estimator_.model_
loss, mae, mse, mape = best_model.evaluate(testX, testY)
print(f"Test loss: {loss}")
print(f"Test MAE: {mae}")
print(f"Test MSE: {mse}")
print(f"Test MAPE: {mape}")        

Grid Search and Cross-Validation:

Perform grid search with cross-validation to find the best model. The result shows the model can predict very well with MAE: 0.045. Fewer epochs may also be used.

Article content

The visualizations illustrate the model's performance: the discharge prediction closely aligns with real discharge values, as seen in the left graph, where the red and blue lines overlap significantly. The scatter plot on the right further confirms this by showing a strong correlation between predicted and observed discharges, with points clustering around the dashed line of perfect prediction.


Article content

The training and validation loss and MAE graphs indicate that the model effectively learns and generalizes well, with both training and validation errors decreasing consistently over epochs.

Article content
model summary

Summary

The multivariate LSTM model developed for streamflow prediction demonstrates high accuracy and robustness. The model architecture includes two LSTM layers with 50 units each, followed by a dropout layer and a dense output layer, resulting in a total of 95,555 parameters. The model was trained using the Huber loss function and evaluated with metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), and Mean Absolute Percentage Error (MAPE).

This robust performance suggests the model's potential for reliable streamflow forecasting in hydrologic applications.

Mehran Homayounfar

Data Scientist and System Modeler in Agro-environmental Systems | Dynamical Optimization | Resilience Management

11mo

Interesting!

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics