Using Bayesian Regression for Stacking Time Series Predictive Models

Bohdan Pavlyshenko

Senior Data Scientist | NLP (PhD in Physics, D.Sc. in AI, Kaggle Master)

Published Jan 3, 2020

Time series analytics is an important part of modern data science.In this study, we consider the use of Bayesian regression for stacking time series predictive models on the second level of the predictive model which is the ensemble of the models of the first level. Probabilistic regression models can be based on Bayesian theorem. This approach allows us to receive a posterior distribution of model parameters using conditional likelihood and prior distribution. Probabilistic approach is more natural for stochastic variables such as sales time series. The difference between Bayesian approach and conventional Ordinary Least Squares (OLS) method is that in the Bayesian approach, uncertainty comes from parameters of model, as opposed to OLS method where the parameters are constant and uncertainty comes from data. In the Bayesian inference, we can use informative prior distributions which can be set up by an expert. So, the result can be considered as a compromise between historical data and expert opinion. It is important in the cases when we have a small number of historical data. In the Bayesian model, we can consider the target variable with non Gaussian distribution, e.g. Student’s t-distribution. Probabilistic approach gives us an ability to receive probability density function for the target variable. Having such function, we can make risk assessment and calculate value at risk (VaR) which is 5% quantile. For solving Bayesian models, the numerical Monte-Carlo methods are used. Gibbs and Hamiltonian sampling are the popular methods of finding posterior distributions for the parameters of probabilistic mode. Predictive models can be combined into ensemble model using stacking approach. In this approach, prediction results of predictive models on the validation set are treated as covariates for stacking regression. These predictive models are considered as a first level of predictive model ensemble. Stacking model forms the second level of model ensemble. Using Bayesian inference for stacking regression gives distributions for stacking regression coefficients. It enables to estimate the uncertainty of the first level predictive models. As predictive models for first level of ensemble we used the following models: ARIMA’, ’ExtraTree’, ’RandomForest’, ’Lasso’, ’NeuralNetowrk’. For stacking, we have chosen the robust regression with Student’s t-distribution for the target variable as

The data for our analysis are based on store sales historical data from the “Rossmann Store Sales” Kaggle competition. For Bayesian regression, we used Stan platform statistical modeling. The analysis was conducted in Jupyter Notebook environment using Python programming language and the following main Python packages pandas, sklearn, pystan, numpy, scipy, statsmodels, keras, matplotlib, seaborn. We trained different predictive models and made the predictions on the validation set. The model ARIMA was evaluated using statsmodels package, Neural Network was evaluated using keras package, Random Forest and Extra Tree was evaluated using sklearn package. The next figure shows the time series forecasts on the validation sets obtained using different models:

The results of prediction of these models on the validation sets are considered as the covariates for the regression on the second stacking level of the models ensemble. For stacking predictive models we split the validation set on the training and testing sets. For stacking regression we normalized the covariates and target variable using z-scores. The prior distributions for parameters α, β, σ in the Bayesian regression model are considered as Gaussian with mean values equal to 0, and standard deviation equal to 1. We split the validation set on the training and testing sets by time factor. The parameters for prior distributions can be adjusted using prediction scores on testing sets or using expert opinions in the case of small data amount. To estimate uncertainty of regression coefficients, we used the coefficient of variation which is defined as a ratio between the standard variation and mean value for model coefficient distributions. Taking into account that μican be negative, we will analyze the absolute value of the coefficient of variation. For the results evaluations, we used a relative mean absolute error (RMAE) and root mean square error (RMSE). Relative mean absolute error (RMAE) was considered as a ratio between the mean absolute error (MAE) and mean values of target variable. The data with predictions of different models on the validation set were split on the training set (48 samples) and testing set (50 samples) by date. We used the robust regression with Student’s t-distribution for the target variable. As a result of calculations, we received the following scores: RMAE(train)=12.4%, RMAE(test)=9.8%, RMSE(train)=113.7, RMSE(test)= 74.7. The next figure shows mean values time series for real and forecasted sales on the validation and testing sets:

Vertical dotted line separates the training and testing sets. The next figure shows the probability density function (PDF) for the intersect parameter:

One can observe positive bias of this (PDF). It is caused by the fact that we applied machine learning algorithms to non stationary time series. If a non stationary trend is small, it can be compensated on the validation set using stacking regression.

The next figure shows the box plots for the PDF of models coefficients of models:

The next figure shows the coefficient of variation for the PDF of regression coefficients of models:

We considered the case with the restraints to regression coefficient of models that they should be positive. We received the similar results: RMAE(train)=12.9%, RMAE(test)=9.7%, RMSE(train)=117.3, RMSE(test)=76.1. The next figure shows the box plots for the PDF of model regression coefficients for this case:

All models have a similar mean value and variation coefficients. We can observe that errors characteristics RMAE and RMSE on the validation set can be similar with respect to these errors on the training set. It tells us about the fact that Bayesian regression does not overfit on the training set comparing to the machine learning algorithms which can demonstrate essential overfitting on training sets, especially in the cases of small amount of training data. We have chosen the best stacking model ExtraTree and conducted Bayesian regression with this one model only. We received the following scores: RMAE(train)=12.9%, RMAE(test)=11.1%, RMSE(train)=117.1, RMSE(test)=84.7. We also tried to exclude the best model ExtraTree from the stacking regression and conducted Bayesian regression with the rest of models without ExtraTree. In this case we received the following scores: RMAE(train)=14.1%, RMAE(test)=10.2%, RMSE(train)=139.1, RMSE(test)=75.3. The next figure shows the box plots for the PDF of model regression coefficients:

The next figure shows the coefficient of variation for the PDF of regression coefficients of models for this case study:

We received worse results on the testing set. At the same time these models have the similar influence and thus they can potentially provide more stable results in the future due to possible changing of the quality of features. Noisy models can decrease accuracy on large training data sets, at the same time they contribute to sufficient results in the case of small data sets. We considered the case with a small number of training data, 12 samples. To get stable results, we fixed the ν parameter of Student’s t-distribution in Bayesian regression model equal to 10. We received the following scores: RMAE(train)=5.0%, RMAE(test)=14.2%, RMSE(train)=37.5, RMSE(test)=121.3. The next figure shows mean values time series for real and forecasted sales on the validation and testing sets:

The next figure shows the box plots for the PDF of models coefficients:

The next figure shows the coefficient of variation for the PDF of models regression coefficients:

In this case , we can see that an other model starts playing an important role comparing with the previous cases and ExtraTree model does not dominate. We also tried to change the parameters for informative prior distribution, we changed σ for intersect, model regression coefficient, sigma parameter for Student t-distribution of the target variable to the value 0.15. As a result we received improved scores on the testing set: RMAE(train)=7.0%, RMAE(test)=12.3%, RMSE(train)=54.3, RMSE(test)=109.9.

Conclusion

In the case study, we considered a two-level ensemble of the predictive models for time series. The models ARIMA, Neural Network, Random Forest, Extra Tree were used for the prediction on the first level of ensemble of models. On the second stacking level, time series predictions of these models on the validation set were conducted by Bayesian regression. This approach gives distributions for regression coefficients of these models. It makes it possible to estimate the uncertainty contributed by each model to the stacking result. The information about these distributions allows us to select optimal set of stacking models, taking into account domain knowledge. Probabilistic approach for stacking predictive models allows us to make risk assessment for the predictions that is important in a decision-making process. Noisy models can decrease accuracy on large training data sets, at the same time they contribute to sufficient results in the case of small data sets. Using Bayesian inference for stacking regression can be useful in cases of small datasets and help experts to select a set of models for stacking and make assessments of different kinds of risks and prediction uncertainty. Choosing the final models for stacking is up to an expert who takes into account different factors such as uncertainty of each model on the stacking regression level, amount of training and testing data, the stability of models. In Bayesian regression, we can receive a quantitative measure for the uncertainty that can be a very useful information for experts in model selection and stacking. An expert can also set up informative prior distributions for stacking regression coefficients of models, taking into account the domain knowledge information. So, Bayesian approach for stacking regression can give us the information about uncertainty of predictive models. Using this information and domain knowledge, an expert can select models to get stable stacking ensemble of predictive models.

Bohdan Pavlyshenko

Senior Data Scientist | NLP (PhD in Physics, D.Sc. in AI, Kaggle Master)

This research was conducted using different parts of codes for different models and at present time we do not have a single commented and optimized notebook for demonstrating the case study. For stacking, 'pystan' package was used with standard linear Bayesian regression and Student t-distribution for target variable.

Abdullah Al Imran

Thanks for the write-up. It would be really helpful if we could have a view of the notebooks.

Mike Pearmain

CDO | CTO | CIO

Is there a link to the notebook?

3 Reactions

See more comments

To view or add a comment, sign in

Using Bayesian Regression for Stacking Time Series Predictive Models

Bohdan Pavlyshenko

Senior Data Scientist | NLP (PhD in Physics, D.Sc. in AI, Kaggle Master)

Conclusion

More articles by Bohdan Pavlyshenko

Insights from the community

Others also viewed

Accelerated Data Analytics: Machine Learning with GPU-Accelerated Pandas and Scikit-learn

The Ultimate Roadmap to Becoming a Data Scientist

Tools & Techniques in Data Science

🔍 What Is Symbolic Regression — and Why It’s a Game-Changer for Data Science

From Novice to Data Scientist: Your Path to Success

Automation and Instrumentation of Data Science Applications Using OpenAI

What Should I Study On My Own to Become a Data Scientist?

🐼PandasAI: The future of data science analysis🐼

S3: Episode 5: Classification Basics with Logistic Regression

Learning Exploratory Analysis: Insights from a Beginner's Analysis of City Populations

Explore topics

Conclusion

More articles by Bohdan Pavlyshenko

Using GPT Models for Qualitative and Quantitative News Analytics in the 2024 US Presidential Election Process

Forecasting of Non-Stationary Sales Time Series Using Deep Learning

Methods of Informational Trends Analytics and Fake News Detection on Twitter

Named Entity Recognition for Documents with Structured Layout Using Multimodal Transformers

Fine-Tuning GPT-2 Model Using COVID-19 News

Can We Beat the Financial Market Using Semantic Signals in the Transfer Learning Approach?

Bayesian Approach for Predicting COVID-19 Impact on Stock Market Movement Using Alternative Data

Stochastic Patterns in Time Series Predictive Analytics

COVID-19 and 5G in Tweets Analytics

Time Series Forecasting with Uncertainty Assessment Using LSTM Neural Network

Insights from the community

Others also viewed

Accelerated Data Analytics: Machine Learning with GPU-Accelerated Pandas and Scikit-learn

The Ultimate Roadmap to Becoming a Data Scientist

Tools & Techniques in Data Science

🔍 What Is Symbolic Regression — and Why It’s a Game-Changer for Data Science

From Novice to Data Scientist: Your Path to Success

Automation and Instrumentation of Data Science Applications Using OpenAI

What Should I Study On My Own to Become a Data Scientist?

🐼PandasAI: The future of data science analysis🐼

S3: Episode 5: Classification Basics with Logistic Regression

Learning Exploratory Analysis: Insights from a Beginner's Analysis of City Populations

Explore topics