Versioning large data files and directories for data science is great, but not
enough. How is data filtered, transformed, or used to train ML models? DVC
introduces a mechanism to capture data pipelines — series of data processes
that produce a final result.
DVC pipelines and their data can also be easily versioned (using Git). This
allows you to better organize projects, and reproduce your workflow and results
later — exactly as they were built originally! For example, you could capture a
simple ETL workflow, organize a data science project, or build a detailed
machine learning pipeline.
Watch and learn, or follow along with the code example below!
Pipeline stages
Use dvc run to create stages. These represent processes (source code tracked
with Git) that form the steps of a pipeline. Stages also connect code to its
data input and output. Let's transform a Python script into a
stage:
A dvc.yaml file is generated. It includes information about the command we ran
(python src/prepare.py), its dependencies, and
outputs.
💡 Expand to see what happens under the hood.
The command options used above mean the following:
-n prepare specifies a name for the stage. If you open the dvc.yaml file
you will see a section named prepare.
-p prepare.seed,prepare.split is a special type of dependencies -
parameters. We'll get to them later in the
Experiments section, but the idea is
that stage can depend on field values from a parameters file (params.yaml by
default):
prepare:split:0.20seed:20170428
-d src/prepare.py and -d data/data.xml mean that the stage depends on
these files to work. Notice that the source code itself is marked as a
dependency. If any of these files change later, DVC will know that this stage
needs to be reproduced.
-o data/prepared specifies an output directory for this script, which writes
two files in it. This is how the workspace should look like now:
There's no need to use dvc add for DVC to track stage outputs (data/prepared
in this case); dvc run already took care of this. You only need to run
dvc push if you want to save them to
remote storage,
(usually along with git commit to version the stage file itself).
Dependency graphs (DAGs)
By using dvc run multiple times, and specifying outputs of a
stage as dependencies of another one, we can describe a sequence of
commands that gets to a desired result. This is what we call a data pipeline
or dependency graph.
Let's create a second stage chained to the outputs of prepare, to perform
feature extraction:
Same as before, no need to run prepare, featurize, etc … but, it doesn't
run even train again this time either! It cached the previous run with the
same set of inputs (parameters + data) and reused it.
💡 Expand to see what happens under the hood.
dvc repro relies on the DAG definition that it reads from dvc.yaml, and uses
dvc.lock to determine what exactly needs to be run.
dvc.lock file is similar to .dvc files and captures hashes (in most cases
md5s) of the dependencies, values of the parameters that were used, it can be
considered a state of the pipeline:
Automation - run a sequence of steps in a "smart" way that makes iterating
on your project faster. DVC automatically determines which parts of a project
need to be run, and it caches "runs" and their results, to avoid unnecessary
re-runs.
Reproducibility - dvc.yaml and dvc.lock files describe what data to use
and which commands will generate the pipeline results (such as an ML model).
Storing these files in Git makes it easy to version and share.
Continuous Delivery and Continuous Integration (CI/CD) for ML - describing
projects in way that it can be reproduced (built) is the first necessary step
before introducing CI/CD systems. See our sister project,
CML for some examples.
Visualize
Having built our pipeline, we need a good way to understand its structure.
Seeing a graph of connected stage files would help. DVC lets you do just that,
without leaving the terminal!