How cool would it be to make Git handle arbitrary large files and directories
with the same performance as with small code files? Imagine doing a git clone
and seeing data files and machine learning models in the workspace. Or switching
to a different version of a 100Gb file in less than a second with a
git checkout.
The foundation of DVC consists of a few commands that you can run along with
git to track large files, directories, or ML models. Think "Git for data".
Read on or watch our video to learn about versioning data with DVC!
To start tracking a file or directory, use dvc add:
⚙️ Expand to get an example dataset.
Having initialized a project in the previous section, get the data file we will
be using later like this:
$ mkdir data
$ dvc get https://github.com/iterative/dataset-registry \
get-started/data.xml -o data/data.xml
We use the fancy dvc get command to jump ahead a bit and show how Git repo
becomes a source for datasets or models - what we call "data registry" or "model
registry". dvc get can download any file or directory tracked in a DVC
repository. It's like wget, but for DVC or Git repos. In this case we
download the latest version of the data.xml file from the
dataset registry repo as the
data source.
$ dvc add data/data.xml
DVC stores information about the added file (or a directory) in a special .dvc
file named data/data.xml.dvc, a small text file with a human-readable
format. This file can be
easily versioned like source code with Git, as a placeholder for the original
data (which gets listed in .gitignore):
dvc add moved the data to the project's cache, and linked* it
back to the workspace.
$ tree .dvc/cache
../.dvc/cache
└── a3
└── 04afb96060aad90176268345e10355
The hash value of the data.xml file we just added (a304afb...) determines
the cache path shown above. And if you check data/data.xml.dvc, you will find
it there too:
You can upload DVC-tracked data or models with dvc push, so they're safely
stored remotely. This also means they can be
retrieved on other environments later with dvc pull. First, we need to setup a
storage:
DVC supports the following remote storage types: Google Drive, Amazon S3,
Azure Blob Storage, Google Cloud Storage, Aliyun OSS, SSH, HDFS, and HTTP.
Please refer to dvc remote add for more details and examples.
⚙️ Set up a remote storage
DVC remotes let you store a copy of the data tracked by DVC outside of the local
cache, usually a cloud storage service. For simplicity, let's set up a local
remote:
While the term "local remote" may seem contradictory, it doesn't have to be.
The "local" part refers to the type of location: another directory in the file
system. "Remote" is how we call storage for DVC projects. It's
essentially a local data backup.
$ dvc push
Usually, we also want to git commit and git push the corresponding .dvc
files.
💡 Expand to see what happens under the hood.
dvc push copied the data cached locally to the remote storage we
set up earlier. You can check that the data has been stored in the DVC remote
with:
$ ls -R /tmp/dvc-storage
/tmp/dvc-storage/:
a3
/tmp/dvc-storage/a3:
04afb96060aad90176268345e10355
Retrieving
Having DVC-tracked data stored remotely, it can be downloaded when needed in
other copies of this project with dvc pull. Usually, we run it
after git clone and git pull.
⚙️ Expand to explode the project 💣
If you've run dvc push, you can delete the cache (.dvc/cache) and
data/data.xml to experiment with dvc pull:
The regular workflow is to use git checkout first to switch a branch, checkout
a commit, or a revision of a .dvc file, and then run dvc checkout to sync
data:
$ git checkout<...>$ dvc checkout
⚙️ Expand to get the previous version of the dataset.
Let's cleanup the previous artificial changes we made and get the previous :
Yes, DVC is technically not even a version control system! .dvc files content
defines data file versions. Git itself provides the version control. DVC in turn
creates these .dvc files, updates them, and synchronizes DVC-tracked data in
the workspace efficiently to match them.
Large datasets versioning
In cases where you process very large datasets, you need an efficient mechanism
(in terms of space and performance) to share a lot of data, including different
versions of itself. Do you use a network attached storage? Or a large external
volume?
While these cases are not covered in the Get Started, we recommend reading the
following sections next to learn more about advanced workflows:
A shared external cache can be set
up to store, version and access a lot of data on a large shared volume
efficiently.
A quite advanced scenario is to track and version data directly on the remote
storage (e.g. S3). Check out
Managing External Data
to learn more.