Some teams may prefer using one single shared machine to run their experiments.
This allows better resource utilization, such as the ability to use multiple
GPUs, centralized data storage, etc. With DVC, you can easily setup shared data
storage on a server accessed by several users or for any other reason, in a way
that enables almost instantaneous workspace restoration/switching
speed for everyone โ similar to git checkout
for your code.
Create a directory external to your DVC projects to be used as a shared cache location for everyone's projects:
$ mkdir -p /home/shared/dvc-cache
Make sure that the directory has proper permissions, so that all your colleagues can write to it, and can read cached files written by others. The most straightforward way to do this is to make all users members of the same group, and have the shared cache directory owned by that group.
You can skip this part if you are setting up a new DVC project where the local
cache directory (.dvc/cache
by default), hasn't been used.
If you did work on the DVC projects previously and wish to transfer its existing cache to the shared cache directory, you will simply need to move its contents from the old location to the new one:
$ mv .dvc/cache/* /home/shared/dvc-cache
Now, ensure that the cached directories and files have appropriate permissions, so that they can be accessed by your colleagues (assuming their users are members of the same group):
$ sudo find /home/shared/dvc-cache -type d -exec chmod 0775 {} \;
$ sudo find /home/shared/dvc-cache -type f -exec chmod 0444 {} \;
$ sudo chown -R myuser:ourgroup /home/shared/dvc-cache/
Tell DVC to use the directory we've set up above as the cache for your project:
$ dvc cache dir /home/shared/dvc-cache
And tell DVC to set group permissions on newly created or downloaded cache files:
$ dvc config cache.shared group
See
dvc cache dir
anddvc config cache
for more information.
If you're using Git, commit changes to your project's config file (.dvc/config
by default):
$ git add .dvc/config
$ git commit -m "config external/shared DVC cache"
You and your colleagues can work in your own separate workspaces as usual, and DVC will handle all your data in the most effective way possible. Let's say you are cleaning up raw data for later stages:
$ dvc add raw
$ dvc run -n clean_data -d raw -o clean ./cleanup.py raw clean
# The data is cached in the shared location.
$ git add raw.dvc dvc.yaml dvc.lock .gitignore
$ git commit -m "cleanup raw data"
$ git push
Your colleagues can checkout the
project data (from the shared cache), and have both
raw
and clean
data files appear in their workspace without moving anything
manually. After this, they could decide to continue building this
pipeline and process the clean data:
$ git pull
$ dvc checkout
A raw # Data is linked from cache to workspace.
$ dvc run -n process_clean_data -d clean -o processed ./process.py clean process
$ git add dvc.yaml dvc.lock
$ git commit -m "process clean data"
$ git push
And now you can just as easily make their work appear in your workspace with:
$ git pull
$ dvc checkout
A processed