Once initialized in a project, DVC populates its installation
directory (.dvc/) with the
internal directories and files needed for DVC
operation.
Additionally, there are a few metafiles that support DVC's features:
.dvc extension are placeholders to track data files
and directories. A DVC project usually has one .dvc file per
large data file or directory being tracked.dvc.yaml files (or pipelines files) specify stages that form the
pipeline(s) of a project, and how they connect (dependency graph or DAG).
These normally have a matching dvc.lock file to record the pipeline state
and track its outputs.
Both .dvc files and dvc.yaml use human-friendly YAML 1.2 schemas, described
below. We encourage you to get familiar with them so you may create, generate,
and edit them on your own.
Both the internal directory and these metafiles should be versioned with Git (in Git-enabled repositories).
.dvc filesWhen you add a file or directory to a DVC project with dvc add,
dvc import, or dvc import-url, a .dvc file is created based on the data
file name (e.g. data.xml.dvc). These files contain the information needed to
track the data with DVC.
They use a simple YAML format, meant to be easy to read, edit, or even created manually. Here is a sample:
outs:
- md5: a304afb96060aad90176268345e10355
path: data.xml
desc: cats and dogs dataset
# Comments and user metadata are supported.
meta:
name: 'John Doe'
email: john@doe.com.dvc files can contain the following fields:
outs (always present): List of output entries (details below)
that represent the files or directories tracked with DVC. Typically there is
only one (but several can be added or combined manually).deps: List of dependency entries (details below). Only present
when dvc import or dvc import-url are used to generate this .dvc file.
Typically there is only one (but several can be added manually).wdir: Working directory for the outs and deps paths (relative to the
.dvc file's location). If this field is not present explicitly, it defaults
to . (the .dvc file's location).md5: (only for imports) MD5 hash of the import .dvc file
itself.meta (optional): Arbitrary metadata can be added manually with this field.
Any YAML contents is supported. meta contents are ignored by DVC, but they
can be meaningful for user processes that read .dvc files.An output entry (outs) consists of these fields:
path: Path to the file or directory (relative to wdir which defaults to
the file's location)md5, etag, or checksum: Hash value for the file or directory being
tracked with DVC. MD5 is used for most locations (local file system and SSH);
ETag for
HTTP, S3, or Azure external outputs;
and a special checksum for HDFS and WebHDFS.size: Size of the file or directory (sum of all files).nfiles: If a directory, number of files inside.cache: Whether or not this file or directory is cached (true
by default, if not present). See the --no-commit option of dvc add.desc: User description for this output. This doesn't affect any DVC
operations.A dependency entry (deps) consists of these fields:
path: Path to the dependency (relative to wdir which defaults to the
file's location)md5, etag, or checksum: Hash value for the file or directory being
tracked with DVC. MD5 is used for most locations (local file system and SSH);
ETag for
HTTP, S3, or Azure external dependencies; and a special
checksum for HDFS and WebHDFS. See dvc import-url for more information.size: Size of the file or directory (sum of all files).nfiles: If a directory, number of files inside.repo: This entry is only for external dependencies created with
dvc import, and can contains the following fields:
url: URL of Git repository with source DVC projectrev: Only present when the --rev option of dvc import is used.
Specific commit hash, branch or tag name, etc. (a
Git revision) used to import the
dependency from.rev_lock: Git commit hash of the external DVC repository at
the time of importing or updating the dependency (with dvc update)Note that comments can be added to .dvc files using the # comment syntax.
meta fields and # comments are preserved among executions of the dvc repro
and dvc commit commands, but not when a .dvc file is overwritten by
dvc add, dvc move, dvc import, or dvc import-url.
dvc.yaml filedvc.yaml files describe data science or machine learning pipelines, similar to
how Makefiles
work for building software. Its YAML structure contains a list of stages which
can be written manually or generated by user code.
A helper command,
dvc run, is also available to add or update stages indvc.yaml. Additionally, advc.lockfile is also created or updated bydvc runanddvc repro, to record the pipeline state.
Here's a comprehensive dvc.yaml example:
stages:
features:
cmd: jupyter nbconvert --execute featurize.ipynb
deps:
- data/clean
params:
- levels.no
outs:
- features
metrics:
- performance.json
training:
desc: train your model
cmd: python train.py
deps:
- train.py
- features
outs:
- model.pkl:
desc: my model description
plots:
- logs.csv:
x: epoch
x_label: Epoch
meta: 'For deployment'
# User metadata and comments are supported.dvc.yaml files consists of a group of stages with names provided explicitly
by the user with the --name (-n) option of dvc run. Each stage can contain
the possible following fields:
cmd (always present): Executable command defined in this stagewdir: Working directory for the stage command to run in (relative to the
file's location). If this field is not present explicitly, it defaults to .
(the file's location).deps: List of dependency file or directory paths of this stage
(relative to wdir which defaults to the file's location)params: List of parameter dependency keys (field names) that
are read from a YAML, JSON, TOML, or Python file (params.yaml by default).outs: List of output file or directory paths of this stage
(relative to wdir which defaults to the file's location), and optionally,
whether or not this file or directory is cached (true by
default, if not present). See the --no-commit option of dvc run.metrics: List of metrics files, and
optionally, whether or not this metrics file is cached (true by
default, if not present). See the --metrics-no-cache (-M) option of
dvc run.plots: List of plot metrics, and optionally,
their default configuration (subfields matching the options of
dvc plots modify), and whether or not this plots file is cached
( true by default, if not present). See the --plots-no-cache option of
dvc run.frozen: Whether or not this stage is frozen from reproductionalways_changed: Whether or not this stage is considered as changed by
commands such as dvc status and dvc repro. false by defaultmeta (optional): Arbitrary metadata can be added manually with this field.
Any YAML contents is supported. meta contents are ignored by DVC, but they
can be meaningful for user processes that read or write .dvc files directly.desc (optional): User description for this stage. This doesn't affect any
DVC operations.dvc.yaml files also support # comments.
💡 We maintain a dvc.yaml
schema that can be used by
editors like VSCode or
PyCharm to enable automatic syntax
checks and auto-completion.
dvc.lock fileFor every dvc.yaml file, a matching dvc.lock (YAML) file usually exists.
It's created or updated by DVC commands such as dvc run and dvc repro.
dvc.lock describes the latest pipeline state. It has several purposes:
.dvc files.dvc status, dvc repro).dvc.lock is needed internally for several DVC commands to operate, such as
dvc checkout, dvc get, and dvc import.Here's an example dvc.lock based on the one in dvc.yaml
above:
stages:
features:
cmd: jupyter nbconvert --execute featurize.ipynb
deps:
- path: data/clean
md5: d8b874c5fa18c32b2d67f73606a1be60
params:
params.yaml:
levels.no: 5
outs:
- path: features
md5: 2119f7661d49546288b73b5730d76485
- path: performance.json
md5: ea46c1139d771bfeba7942d1fbb5981e
- path: logs.csv
md5: f99aac37e383b422adc76f5f1fb45004Stage commands are listed again in dvc.lock, in order to know when their
definitions change in the dvc.yaml file.
Regular dependencies and all kinds of outputs
(including metrics and
plots files) are also listed (per stage) in
dvc.lock, but with an additional field to store the hash value of each file or
directory tracked by DVC. Specifically: md5, etag, or checksum (same as in
deps and outs entries of .dvc files).
Full parameters (key and value) are listed separately under
params, grouped by parameters file.
.dvc/config: This is a configuration file. The config file can be edited by
hand or with the dvc config command..dvc/config.local: This is a local configuration file, that will overwrite
options in .dvc/config. This is useful when you need to specify private
options in your config that you don't want to track and share through Git
(credentials, private locations, etc). The local config file can be edited by
hand or with the command dvc config --local..dvc/cache: The cache directory will store your data in a
special structure. The data files and
directories in the workspace will only contain links to the data
files in the cache. (Refer to
Large Dataset Optimization. See
dvc config cache for related configuration options.
Note that DVC includes the cache directory in
.gitignoreduring initialization. No data tracked by DVC will ever be pushed to the Git repository, only DVC-files that are needed to download or reproduce them.
.dvc/plots: Directory for
plot templates.dvc/tmp: Directory for miscellaneous temporary files.dvc/tmp/index: Directory for remote index files that are used for
optimizing dvc push, dvc pull, dvc fetch and dvc status -c operations.dvc/tmp/state: This file is used for optimization. It is a SQLite database,
that contains hash values for files tracked in a DVC project, with respective
timestamps and inodes to avoid unnecessary file hash computations. It also
contains a list of links (from cache to workspace) created by DVC
and is used to cleanup your workspace when calling dvc checkout..dvc/tmp/state-journal: Temporary file for SQLite operations.dvc/tmp/state-wal: Another SQLite temporary file.dvc/tmp/updater: This file is used store the latest available version of
DVC. It's used to remind the user to upgrade when the installed version is
behind..dvc/tmp/updater.lock: Lock file for .dvc/tmp/updater.dvc/tmp/lock: Lock file for the entire DVC project.dvc/tmp/rwlock: JSON file that contains read and write locks for specific
dependencies and outputs, to allow safely running multiple DVC commands in
parallelThe DVC cache is a
content-addressable storage
(by default in .dvc/cache), which adds a layer of indirection between code and
data.
There are two ways in which the data is cached: As a single file
(eg. data.csv), or as a directory.
DVC calculates the file hash, a 32 characters long string (usually MD5). The
first two characters are used to name the directory inside the cache, and the
rest become the file name of the cached file. For example, if a data file has a
hash value of ec1d2935f811b77cc49b031b999cbf17, its path in the cache will be
.dvc/cache/ec/1d2935f811b77cc49b031b999cbf17.
Note that file hashes are calculated from file contents only. 2 or more files with different names but the same contents can exist in the workspace and be tracked by DVC, but only one copy is stored in the cache. This helps avoid data duplication.
Let's imagine adding a directory with 2 images:
$ tree data/images/
data/images/
├── cat.jpeg
└── index.jpeg
$ dvc add data/imagesThe directory is cached as a JSON file with .dir extension. The files it
contains are stored in the cache regularly, as explained earlier. It looks like
this:
.dvc/cache/
├── 19
│ └── 6a322c107c2572335158503c64bfba.dir
├── d4
│ └── 1d8cd98f00b204e9800998ecf8427e
└── 20
└── 0b40427ee0998e9802335d98f08cd98fThe .dir file contains the mapping of files in data/images (as a JSON
array), including their hash values:
$ cat .dvc/cache/19/6a322c107c2572335158503c64bfba.dir
[{"md5": "dff70c0392d7d386c39a23c64fcc0376", "relpath": "cat.jpeg"},
{"md5": "29a6c8271c0c8fbf75d3b97aecee589f", "relpath": "index.jpeg"}]That's how DVC knows that the other two cached files belong in the directory.