Record changes to DVC-tracked files in the project, by saving them
to the cache and updating the dvc.lock
or .dvc
files.
usage: dvc commit [-h] [-q | -v] [-f] [-d] [-R]
[targets [targets ...]]
positional arguments:
targets Limit command scope to these stages or .dvc files.
Using -R, directories to search for stages or .dvc
files can also be given.
The dvc commit
command is useful for several scenarios, when data already
tracked by DVC changes: when a stage or
pipeline is in development/experimentation; to
force-update the dvc.lock
or .dvc
files without reproducing stages or
pipelines; or to mark existing files/dirs as stage outputs. These
scenarios are further detailed below.
Code or data for a stage is under active development, with multiple iterations
(experiments) in code, configuration, or data. Use the --no-commit
option of
DVC commands (dvc add
, dvc run
, dvc repro
) to avoid caching unnecessary
data repeatedly. Use dvc commit
when the DVC-tracked data is final.
💡 For convenience, a pre-commit Git hook is available to remind you to
dvc commit
when needed. See dvc install
for more details.
dvc commit
to force update the dvc.lock
or .dvc
files and cache.dvc.yaml
. It is possible to
add missing data to an existing stage,
and then dvc commit
can be used to save outputs to the cache (and update
dvc.lock
)dvc unprotect
). Once the desired result is reached, use
dvc commit
to update the dvc.lock
file(s) and store changed data to the
cache.Let's take a look at what is happening in the first scenario closely. Normally
DVC commands like dvc add
, dvc repro
or dvc run
commit the data to the
cache after creating or updating a dvc.lock
or .dvc
file. What
commit means is that DVC:
dvc.lock
or .dvc
file..gitignore
). (Note
that if the project was initialized with no Git support
(dvc init --no-scm
), this does not happen.)There are many cases where the last step is not desirable (for example rapid
iterations on an experiment). The --no-commit
option prevents it (on the
commands where it's available). The file hash is still computed and added to the
dvc.lock
or .dvc
file, but the actual data is not cached. And this is where
the dvc commit
command comes into play: It performs that last step when
needed.
Note that it's best to avoid the last three scenarios. They essentially
force-update the dvc.lock
or .dvc
files and save data to cache. They are
still useful, but keep in mind that DVC can't guarantee reproducibility in those
cases.
-d
, --with-deps
- determines files to commit by tracking dependencies to
the target stages or .dvc
files. If no targets
are provided, this option
is ignored. By traversing all stage dependencies, DVC searches backward from
the target stages in the corresponding pipelines. This means DVC will not
commit files referenced in later stages than the targets
.-R
, --recursive
- determines the files to commit by searching each target
directory and its subdirectories for stages or .dvc
files to inspect. If
there are no directories among the targets
, this option is ignored.-f
, --force
- commit data even if hash values for dependencies or outputs
did not change.-h
, --help
- prints the usage/help message, and exit.-q
, --quiet
- do not write anything to standard output. Exit with 0 if no
problems arise, otherwise 1.-v
, --verbose
- displays detailed tracing information from executing the
dvc add
command.Let's employ a simple workspace with some data, code, ML models,
pipeline stages, such as the DVC project created for the
Get Started. Then we can see what happens with
git commit
and dvc commit
in different situations.
Sometimes we want to iterate through multiple changes to configuration, code, or
data, trying different ways to improve the output of a stage. To avoid filling
the cache with undesired intermediate results, we can run a single
stage with dvc run --no-commit
, or reproduce an entire pipeline using
dvc repro --no-commit
. This prevents data from being pushed to cache. When
development of the stage is finished, dvc commit
can be used to store data
files in the cache.
In the featurize
stage, src/featurization.py
is executed. A useful change to
make is adjusting the parameters for that script. The parameters are defined in
the params.yaml
file. Updating the value of the max_features
param to 6000
changes the resulting model:
featurize:
max_features: 6000
ngrams: 2
This edit introduces a change that would cause the featurize
, train
and
evaluate
stages to execute if we ran dvc repro
. But if we want to try
several values for max_features
and save only the best result to the cache, we
can run it like this:
$ dvc repro --no-commit
We can run this command as many times as we like, editing params.yaml
any way
we like, and so long as we use --no-commit
, the data does not get saved to the
cache. Let's verify that's the case:
First verification:
$ dvc status
featurize:
changed outs:
not in cache: data/features
train:
changed outs:
not in cache: model.pkl
Now we can look in the cache directory to see if the new version of model.pkl
is not in cache indeed. Let's look at the latest state of train
in
dvc.lock
first:
train:
cmd: python src/train.py data/features model.pkl
deps:
- path: data/features
md5: de03a7e34e003e54dde0d40582c6acf4.dir
- path: src/train.py
md5: ad8e71b2cca4334a7d3bb6495645068c
params:
params.yaml:
train.n_estimators: 100
train.seed: 20170428
outs:
- path: model.pkl
md5: 9aba000ba83b341a423a81eed8ff9238
To verify this instance of model.pkl
is not in the cache, we must know the
path to the cached file. In the cache directory, the first two characters of the
hash value are used as a subdirectory name, and the remaining characters are the
file name. Therefore, had the file been committed to the cache, it would appear
in the directory .dvc/cache/9a
. Let's check:
$ ls .dvc/cache/9a
ls: .dvc/cache/9a: No such file or directory
If we've determined the changes to params.yaml
were successful, we can execute
this set of commands:
$ dvc commit
$ dvc status
Data and pipelines are up to date.
$ ls .dvc/cache/70
ba000ba83b341a423a81eed8ff9238
We've verified that dvc commit
has saved the changes into the cache, and that
the new instance of model.pkl
is there.
It is also possible to execute the commands that are executed by dvc repro
by
hand. You won't have DVC helping you, but you have the freedom to run any
command you like, even ones not defined in dvc.yaml
stages. For example:
$ python src/featurization.py data/prepared data/features
$ python src/train.py data/features model.pkl
$ python src/evaluate.py model.pkl data/features auc.metric
As before, dvc status
will show which files have changed, and when your work
is finalized dvc commit
will commit everything to the cache.
Sometimes we want to clean up a code or configuration file in a way that doesn't cause a change in its results. We might write in-line documentation with comments, change indentation, remove some debugging printouts, or any other change that doesn't produce different output of pipeline stages.
$ git status -s
M src/train.py
$ dvc status
train:
changed deps:
modified: src/train.py
Let's edit one of the source code files. It doesn't matter which one. You'll see that both Git and DVC recognize a change was made.
If we ran dvc repro
at this point, this pipeline would be reproduced. But
since the change was inconsequential, that would be a waste of time and CPU.
That's especially critical if the corresponding stages take lots of resources to
execute.
$ git add src/train.py
$ git commit -m "CHANGED"
[master 72327bd] CHANGED
1 file changed, 2 insertions(+)
$ dvc commit
dependencies ['src/train.py'] of 'train.dvc' changed.
Are you sure you commit it? [y/n] y
$ dvc status
Data and pipelines are up to date.
Instead of reproducing the pipeline for changes that do not produce different
results, just use commit
on both Git and DVC.