Updating tracked files or directories may mean either modifying some of the data contents, or completely replacing them (under the same file name).
When the cache.type
config option is set to symlink
or hardlink
(not the
default, see dvc config cache
for more info.), updating tracked files has to
be carried out with caution, to avoid data corruption. This is due to the way in
which DVC handles linking data files between the cache and the
workspace (refer to
Large Dataset Optimization for
details).
For an example of the cache corruption problem see issue #599 in our GitHub repo.
If you use dvc.yaml
files and dvc repro
, there is no need to manage stage
outputs manually. DVC removes them for you before regenerating
them.
Otherwise (the data was tracked with dvc add
), use one of the procedures below
to "unlink" the data from the cache prior to updating it. We'll be working with
a train.tsv
file:
Unlink the file with dvc unprotect
. This will make train.tsv
safe to edit:
$ dvc unprotect train.tsv
Then edit the content of the file, for example with:
$ echo "new data item" >> train.tsv
Add the new version of the file back with DVC:
$ dvc add train.tsv
$ git add train.tsv.dvc
$ git commit -m "modify train data"
If you want to replace the file altogether, you can take the following steps.
First, stop tracking the file by
using dvc remove
on the .dvc
file. This will remove train.tsv
from the
workspace (and unlink it from the cache):
$ dvc remove train.tsv.dvc
Next, replace the file with new content:
$ echo new > train.tsv
And start tracking it again:
$ dvc add train.tsv
$ git add train.tsv.dvc .gitignore
$ git commit -m "new train data"