In many real-world Machine Learning projects, there is a need to ensemble complex models as well as maintain pipelines. As we will demonstrate, DVC is a good tool that helps tackling common technical challenges of building pipelines for the ensemble learning.
In a model ensembling setup, the final prediction is a composite of predictions
from individual machine learning algorithms. To make the best model composite,
you have to try dozens of combinations of weights for the model set. It takes a
lot of time to come up with the best one. That is why the iteration speed is
crucial in the ML model ensembling. We are going to make our research
reproducible by using Data Version Control tool -
(DVC). It provides the ability to quickly re-run and replicate
the ML prediction result by executing just a single command dvc repro
.
As we will demonstrate, DVC is a good tool that helps tackling common technical challenges of building pipelines for the ensemble learning.
In this case, we will build an R-based solution to attack the supervised-learning regression problem to predict win sales per Predict Wine Sales Kaggle competition.
An ensemble prediction methodology will be used in the project. The weighted
ensemble of three models will be implemented, trained, and predicted from
(namely, these are Linear Regression, GBM
, and XGBoost
).
If properly designed and used, ensemble prediction can perform much better then predictions of individual machine learning models composing the ensemble.
Prediction results will be delivered in a format of output CSV file that is specified in the requirements to the Predict Wine Sales Kaggle competition (so called Kaggle submission file).
In order to try the materials of this repository in your environment, the following software should be installed on your machine
pip install dvc
)The technical challenges of building the ML pipeline for this project were to meet business requirements below
The next sections below will explain how these challenges are addressed in the design of ML pipeline for this project.
The ML pipeline for this project is presented in the diagram below
As you can see, the essential implementation of the solution is as follows
preprocessing.R
handles all aspects of data manipulations and pre-processing (reading training
and testing data sets, removing outliers, imputing NAs etc.) as well as stores
refined training and testing set data as new files to reuse by model scriptsLR.R
,
GBM.R
,
xgboost.R
)ensemble.R
is responsible for the weighted ensemble prediction and the final output of
the Kaggle submission fileconfig.R
is responsible for all of the conditional logic switches needed in
the pipeline (it is included as a source to all of modeling and ensemble
prediction scripts, to get this done)There is a special note about lack of feature engineering for this project. It was an intended specification related to the specifics of the dataset. The existing features were quite instrumental to predict the target values ‘as is’. Therefore it had been decided to follow the well-known Pareto principle (interpreted as “20% of efforts address 80% of issues”, in this case) and not to spend more time on it.
Note: all R
and batch files mentioned throughout this blog post are
available online in a separate GitHub
repository. You will be also able
to review more details on the implementation of each of the machine learning
prediction models there.
All of the essential tweaks to conditional machine learning pipeline for this
project is managed by a configuration file. For ease of its use across solution,
it was implemented as an R code module (config.R
), to be included to all model
training and forecasting. Thus the respective parameters (assigned as R
variables) will be retrieved by the runnable scripts, and the conditional logic
there will be triggered respectively.
This file is not intended to run from a command line (unlike the rest of the R scripts in the project).
# Competition: https://inclass.kaggle.com/c/pred-411-2016-04-u3-wine/ | |
# This is a configuration file to the entire solution | |
# LR.R specific settings | |
cfg_run_LR <- 1 # if set to 0, LR model will not fit, and its prediction will not be calculated in the batch mode | |
# GMB.R specific settings | |
cfg_run_GBM <- 1 # if set to 0, GBM model will not fit, and its prediction will not be calculated in the batch mode | |
# xgboost.R specific settings | |
cfg_run_xgboost <- 1 # if set to 0, xgboost model will not fit, and its prediction will not be calculated in the batch mode | |
# ensemble.R specific settings | |
cfg_run_ensemble <- 1 # if set to 0, the ensemble will not predict, and ensemble prediction will not be created | |
# ensemble components | |
cfg_model_predictions <- c("data/submission_LR.csv", "data/submission_GBM.csv", "data/submission_XGBOOST.csv") | |
# element weights mapped to the cfg_model_predictions elements above | |
cfg_model_weights <- c(1,1,1) # weights of predictions of the models in the ensemble |
As we all know, there is no way to build the ideal ML model with sound prediction accuracy from the very beginning. You will have to continuously adjust your algorithm/model implementations based on the cross-validation appraisal until you yield the blooming results. This is especially true in the ensemble learning where you have to constantly tweak not only parameters of the individual prediction models but also the settings of the ensemble itself
Under such a condition, DVC will help you to manage your ensemble ML pipeline in a really solid manner. Let’s consider the following real-world scenario
GBM
model and resubmit its
implementation to (this is emulated by the commit
#8604103f0,
check sum 27825d0
)GBM
as well as the updated final ensemble predictionGBM
higher weight vs. xgboost
and LR
config.R
appeared in
the repository, as emulated by the commit
#eb97612ce,
check sum 5bcbe11
), you re-run the model predictions and the final ensemble
prediction on your machine once againAll that you need to do to handle the changes above is simply to keep running your DVC commands per the script developed (see the section below). You do not have to remember or know explicitly the changes being made into the project codebase or its pipeline configuration. DVC will automatically check out latest changes from the repo as well as make sure it runs only those steps in the pipeline that were affected by the recent changes in the code modules.
After we developed individual R scripts needed by different steps of our Machine Learning pipeline, we orchestrate it together using DVC.
Below is a batch file illustrating how DVC manages steps of the machine learning process for this project
# This is a DVC-based script to manage machine-learning pipeline for a project per | |
# https://inclass.kaggle.com/c/pred-411-2016-04-u3-wine/ | |
mkdir R_DVC_GITHUB_CODE | |
cd R_DVC_GITHUB_CODE | |
# clone the github repo with the code | |
git clone https://github.com/gvyshnya/DVC_R_Ensemble | |
# initialize DVC | |
$ dvc init | |
# import data | |
$ dvc import https://inclass.kaggle.com/c/pred-411-2016-04-u3-wine/download/wine.csv data/ | |
$ dvc import https://inclass.kaggle.com/c/pred-411-2016-04-u3-wine/download/wine_test.csv data/ | |
# run data pre-processing | |
$ dvc run Rscript --vanilla code/preprocessing.R data/wine.csv data/wine_test.csv data/training_imputed.csv data/testing_imputed.csv | |
# run LR model fit and forecasting | |
$ dvc run Rscript --vanilla code/LR.R data/training_imputed.csv data/testing_imputed.csv 0.7 825 data/submission_LR.csv code/config.R | |
# run GBM model fit and forecasting | |
$ dvc run Rscript --vanilla code/GBM.R data/training_imputed.csv data/testing_imputed.csv 5000 10 4 25 data/submission_GBM.csv code/config.R | |
# rum XGBOOST model fit and forecasting | |
$ dvc run Rscript --vanilla code/GBM.R data/training_imputed.csv data/testing_imputed.csv 1000 10 0.0001 1.0 data/submission_xgboost.csv code/config.R | |
# prepare ensemble submission | |
# Note: please make sure to edit your code/config.R to set up the references to the predictions from each model according | |
# to the names of output files on the steps above | |
$ dvc run Rscript --vanilla code/ensemble.R data/submission_ensemble.csv code/config.R |
If you then further edit ensemble configuration setup in code/config.R
, you
can simply leverage the power of DVC as for automatic dependencies resolving and
tracking to rebuild the new ensemble prediction as follows
# Improve ensemble configuration | |
$ vi code/config.R | |
# Commit all the changes. | |
$ git commit -am "Updated weights of the models in the ensemble" | |
# Reproduce the ensemble prediction | |
$ dvc repro data/submission_ensemble.csv |
In this blog post, we worked through the process of building an ensemble prediction pipeline using DVC. The essential key features of that pipeline were as follows
The helpful side effect of using DVC was you stop keeping in mind what was changed on every step of modifying your project scripts or in the pipeline configuration. Due to it maintaining the dependencies graph (DAG) automatically, it automatically triggered the only steps that were affected by the particular changes, within the pipeline job setup. It, in turn, provides the capability to quickly iterate through the entire ML pipeline.
As DVC brings proven engineering practices to often suboptimal and messy ML processes as well as helps a typical Data Science project team to eliminate a big chunk of common DevOps overheads, I found it extremely useful to leverage DVC on the industrial data science and predictive analytics projects.