The goal of this example is to give you some hands-on experience with a basic machine learning version control scenario: managing multiple datasets and ML models using DVC. We'll work with a tutorial that François Chollet put together to show how to build a powerful image classifier using a pretty small dataset.
Dataset to classify cats and dogs
We highly recommend reading François' tutorial itself. It's a great demonstration of how a general pre-trained model can be leveraged to build a new high-performance model, with very limited resources.
We first train a classifier model using 1000 labeled images, then we double the
number of images (2000) and retrain our model. We capture both datasets and
classifier results and show how to use dvc checkout
to switch between
workspace versions.
The specific algorithm used to train and validate the classifier is not important, and no prior knowledge of Keras is required. We'll reuse the script from the original blog post as a black box β it takes some data and produces a model file.
We have tested our tutorials and examples with Python 3. We don't recommend using earlier versions.
You'll need Git to run the commands in this tutorial. Also, if DVC is not installed, please follow these instructions to do so.
If you're using Windows, please review Running DVC on Windows for important tips to improve your experience.
Okay! Let's first download the code and set up a Git repository:
$ git clone https://github.com/iterative/example-versioning.git
$ cd example-versioning
This command pulls a DVC project with a single script train.py
that will train the model.
Let's now install the requirements. But before we do that, we strongly recommend creating a virtual environment:
$ python3 -m venv .env
$ source .env/bin/activate
$ pip install -r requirements.txt
Now that we're done with preparations, let's add some data and then train the first model. We'll capture everything with DVC, including the input dataset and model metrics.
$ dvc get https://github.com/iterative/dataset-registry \
tutorial/ver/data.zip
$ unzip -q data.zip
$ rm -f data.zip
dvc get
can download any file or directory tracked in a DVC repository (and stored remotely). It's likewget
, but for DVC or Git repos. In this case we use our dataset registry repo as the data source (refer to Data Registries for more info.)
This command downloads and extracts our raw dataset, consisting of 1000 labeled images for training and 800 labeled images for validation. In total, it's a 43 MB dataset, with a directory structure like this:
data
βββ train
β βββ dogs
β β βββ dog.1.jpg
β β βββ ...
β β βββ dog.500.jpg
β βββ cats
β βββ cat.1.jpg
β βββ ...
β βββ cat.500.jpg
βββ validation
βββ dogs
β βββ dog.1001.jpg
β βββ ...
β βββ dog.1400.jpg
βββ cats
βββ cat.1001.jpg
βββ ...
βββ cat.1400.jpg
(Who doesn't love ASCII directory art?)
Let's capture the current state of this dataset with dvc add
:
$ dvc add data
You can use this command instead of git add
on files or directories that are
too large to be tracked with Git: usually input datasets, models, some
intermediate results, etc. It tells Git to ignore the directory and puts it into
the cache (while keeping a
file link
to it in the workspace, so you can continue working the same way as
before). This is achieved by creating a simple human-readable
DVC-file that serves as a pointer
to the cache.
Next, we train our first model with train.py
. Because of the small dataset,
this training process should be small enough to run on most computers in a
reasonable amount of time (a few minutes). This command outputs a
bunch of files, among them model.h5
and metrics.csv
, weights of the trained
model, and metrics history. The simplest way
to capture the current version of the model is to use dvc add
again:
$ python train.py
$ dvc add model.h5
We manually added the model output here, which isn't ideal. The preferred way of capturing command outputs is with
dvc run
. More on this later.
Let's commit the current state:
$ git add data.dvc model.h5.dvc metrics.csv .gitignore
$ git commit -m "First model, trained with 1000 images"
$ git tag -a "v1.0" -m "model v1.0, 1000 images"
Note that executing
train.py
produced other intermediate files. This is OK, we will use them later.$ git status ... bottleneck_features_train.npy bottleneck_features_validation.npy`
Let's imagine that our image dataset doubles in size. The next command extracts
500 new cat images and 500 new dog images into data/train
:
$ dvc get https://github.com/iterative/dataset-registry \
tutorial/ver/new-labels.zip
$ unzip -q new-labels.zip
$ rm -f new-labels.zip
For simplicity's sake, we keep the validation subset the same. Now our dataset has 2000 images for training and 800 images for validation, with a total size of 67 MB:
data
βββ train
β βββ dogs
β β βββ dog.1.jpg
β β βββ ...
β β βββ dog.1000.jpg
β βββ cats
β βββ cat.1.jpg
β βββ ...
β βββ cat.1000.jpg
βββ validation
βββ dogs
β βββ dog.1001.jpg
β βββ ...
β βββ dog.1400.jpg
βββ cats
βββ cat.1001.jpg
βββ ...
βββ cat.1400.jpg
We will now want to leverage these new labels and retrain the model:
$ dvc add data
$ python train.py
$ dvc add model.h5
Let's commit the second version:
$ git add data.dvc model.h5.dvc metrics.csv
$ git commit -m "Second model, trained with 2000 images"
$ git tag -a "v2.0" -m "model v2.0, 2000 images"
That's it! We've tracked a second version of the dataset, model, and metrics in DVC and committed the DVC-files that point to them with Git. Let's now look at how DVC can help us go back to the previous version if we need to.
The DVC command that helps get a specific committed version of data is designed
to be similar to git checkout
. All we need to do in our case is to
additionally run dvc checkout
to get the right data into the
workspace.
There are two ways of doing this: a full workspace checkout or checkout of a specific data or model file. Let's consider the full checkout first. It's pretty straightforward:
$ git checkout v1.0
$ dvc checkout
These commands will restore the workspace to the first snapshot we made: code,
data files, model, all of it. DVC optimizes this operation to avoid copying data
or model files each time. So dvc checkout
is quick even if you have large
datasets, data files, or models.
On the other hand, if we want to keep the current code, but go back to the previous dataset version, we can do something like this:
$ git checkout v1.0 data.dvc
$ dvc checkout data.dvc
If you run git status
you'll see that data.dvc
is modified and currently
points to the v1.0
version of the dataset, while code and model files are from
the v2.0
tag.
dvc add
makes sense when you need to keep track of different versions of
datasets or model files that come from source projects. The data/
directory
above (with cats and dogs images) is a good example.
On the other hand, there are files that are the result of running some code. In
our example, train.py
produces binary files (e.g.
bottleneck_features_train.npy
), the model file model.h5
, and the
metrics file metrics.csv
.
When you have a script that takes some data as an input and produces other data
outputs, a better way to capture them is to use dvc run
:
If you tried the commands in the Switching between workspace versions section, go back to the master branch code and data, and remove the
model.h5.dvc
file with:$ git checkout master $ dvc checkout $ dvc remove model.h5.dvc
$ dvc run -n train -d train.py -d data \
-o model.h5 -o bottleneck_features_train.npy \
-o bottleneck_features_validation.npy -M metrics.csv \
python train.py
dvc run
writes a pipeline stage named train
(specified using the -n
option) in dvc.yaml
. It tracks all outputs (-o
) the same way as dvc add
does. Unlike dvc add
, dvc run
also tracks dependencies (-d
) and the
command (python train.py
) that was run to produce the result.
At this point you could run
git add .
andgit commit
to save thetrain
stage and its outputs to the repository.
dvc repro
will run the train
stage if any of its dependencies (-d
)
changed. For example, when we added new images to built the second version of
our model, that was a dependency change. It also updates outputs and puts them
into the cache.
To make things a little simpler: dvc add
and dvc checkout
provide a basic
mechanism for model and large dataset versioning. dvc run
and dvc repro
provide a build system for machine learning models, which is similar to
Make in software build automation.
In this example, our focus was on giving you hands-on experience with dataset
and ML model versioning. We specifically looked at the dvc add
and
dvc checkout
commands. We'd also like to outline some topics and ideas you
might be interested to try next to learn more about DVC and how it makes
managing ML projects simpler.
First, you may have noticed that the script that trains the model is written in
a monolithic way. It uses the save_bottleneck_feature
function to
pre-calculate the bottom, "frozen" part of the net every time it is run.
Features are written into files. The intention was probably that the
save_bottleneck_feature
can be commented out after the first run, but it's not
very convenient having to remember to do so every time the dataset changes.
Here's where the pipelines feature of DVC comes in
handy. We touched on it briefly when we described dvc run
and dvc repro
. The
next step would be splitting the script into two parts and utilizing pipelines.
See Data Pipelines to get hands-on experience with
pipelines, and try to apply it here. Don't hesitate to join our
community and ask any questions!
Another detail we only brushed upon here is the way we captured the
metrics.csv
metrics file with the -M
option of dvc run
. Marking this
output as a metric enables us to compare its values across Git tags
or branches (for example, representing different experiments). See dvc metrics
and
Compare Experiments
to learn more about managing metrics with DVC.