Edit on GitHub

How to Merge Conflicts in DVC Metafiles

Sometimes multiple members of a team might work on the the same DVC-tracked data. And when the time comes to combine their changes, merge conflicts can happen in Git-tracked metafiles, which need to be resolved.

dvc.yaml

Conflicts here are no different from what we would see in source code. See Git Merging.

stages:
  prepare:
    cmd: python src/prepare.py data/data.xml
    deps:
< < < < < < < HEAD
    - data/big.xml
= = = = = = =
    - data/small.xml
> > > > > > > branch
    - src/prepare.py
    params:
    - prepare.seed
    - prepare.split
    outs:
    - data/prepared

dvc.lock

There's no need to resolve lock file conflicts manually. You can safely delete this file and then use dvc repro after merging dvc.yaml to regenerate this file.

dvc commit can also be a good option, but only for the specific case where the HEAD version is chosen.

.dvc files

There are three main variations in the structure of these files, that differ by the command that has generated them:

Simple tracking (add)

In .dvc files generated by dvc add, you'll get something that looks like:

outs:
< < < < < < < HEAD
- md5: a304afb96060aad90176268345e10355
  size: 12
= = = = = = =
- md5: 35dd1fda9cfb4b645ae431f4621fa324
  size: 100
> > > > > > > branch
  path: data.xml

If you decide to just pick one of the versions, leave that md5 (with or without size and, possibly, nfiles fields) and delete the other one(s):

outs:
  - md5: 35dd1fda9cfb4b645ae431f4621fa324
    path: data.xml

But if you want to do actually merge data files (or directories) from both versions, then you can follow this process:

  1. Run dvc checkout data.xml on both HEAD and branch;
  2. Copy the data into temporary locations (e.g. data.xml.head and data.xml.branch);
  3. Merge it by-hand;
  4. Finally, run dvc add data.xml to overwrite the conflicted .dvc file.

Append-only directories

If you have an "append-only" dataset, where people only add new files/directories, DVC provides a so-called merge-driver that can automatically resolve Git conflicts for you. To use it, first set it up in your Git repo:

$ git config merge.dvc.name 'DVC merge driver'
$ git config merge.dvc.driver \
           'dvc git-hook merge-driver --ancestor %O --our %A --their %B'

And add this line to your .gitattributes (in the root of your git repo):

mydataset.dvc merge=dvc

Now, when a merge conflict occurs, DVC will simply combine data from both branches.

Imported data

To resolve conflicted .dvc files generated by dvc import or dvc import-url, remove the conflicted hashes (as well as size and, possibly, nfiles) altogether:

< < < < < < < HEAD
md5: 263395583f35403c8e0b1b94b30bea32
=======
md5: 520d2602f440d13372435d91d3bfa176
> > > > > > > branch
frozen: true
deps:
- path: get-started/data.xml
  repo:
    url: https://github.com/iterative/dataset-registry
< < < < < < < HEAD
    rev_lock: f31f5c4cdae787b4bdeb97a717687d44667d9e62
= = = = = = =
    rev_lock: 06be1104741f8a7c65449322a1fcc8c5f1070a1e
> > > > > > > branch
outs:
< < < < < < < HEAD
- md5: a304afb96060aad90176268345e10355
  size: 12
= = = = = = =
- md5: 35dd1fda9cfb4b645ae431f4621fa324
  size: 100
> > > > > > > branch
  path: data.xml

So you get something like this:

frozen: true
deps:
  - path: get-started/data.xml
    repo:
      url: https://github.com/iterative/dataset-registry
outs:
  - path: data.xml

And then dvc update the .dvc file to download the latest data from its original source.

Note that updating will bring in the latest version of the data from its source, which may not correspond with any of the hashes that was removed.

Content

🐛 Found an issue? Let us know! Or fix it:

Edit on GitHub

Have a question? Join our chat, we will help you:

Discord Chat