Every month we are sharing here our news, findings, interesting reads, community takeaways, and everything along the way.
Some of those are related to our brainchild DVC and its journey. The others are a collection of exciting stories and ideas centered around ML best practices and workflow.
We have some exciting news to share this month!
DVC is going to PyCon 2019! It is the first conference that we attend as a team. When we say ‘team’ — we mean it. Our engineers are flying from all over the globe to get together offline and catch up with fellow Pythonistas.
The speaker pipeline is amazing! DVC creator Dmitry Petrov is giving a talk on Machine learning model and dataset versioning practices.
Stop by our booth at the Startup Row on Saturday, May 4, reach out and let us know that you are willing to chat, or simply find a person with a huge DVC owl on their shirt!
Speaking of the owls — DVC has done some rebranding recently and we love our new logo. Special thanks to 99designs.com for building a great platform for finding trusted designers.
DVC is moving fast (almost as fast as my two-year-old). We do our best to keep up and totally love all the buzz in our community channels lately!
Here is a number of interesting reads that caught our eye:
A great article about using DVC with a quite advanced scenario and docker. If you haven’t had a chance to try DVC.org yet — this is a great comprehensive read on why you should do so right away.
A short (only 8 minutes!) and inspiring talk by Alejandro Saucedo at FOSDEM. Alejandro covers the key trends in machine learning operations, as well as most recent open source tools and frameworks. Focused on reproducibility, monitoring and explainability, this lightning talk is a great snapshot of the current state of ML operations.
There is no way you will become Kaggle Master and not learn how to approach anew, the unknown problem in a fast hacking way with a very high number of iterations per unit of time. This skill in the world of competitive learning is the question of survival
There are lots of hidden gems in our Discord community discussions. Sometimes they are scattered all over the channels and hard to track down.
We are sifting through the issues and discussions and share with you the most interesting takeaways.
setup.py
).No server licenses for DVC. It is 100% free and open source.
I am trying to version control datasets and models with >10 GB (Potentially even bigger). Can DVC handle this?
There is no limit. None enforced by DVC itself. It depends on the size of your local or remote storages. You need to have some space available on S3, your SSH server or other storage you are using to keep these data files, models and their version, which you would like to store.
How does it connect them? Does it see that there is a dependency which is outputted from the first run?
DVC figures out the pipeline by looking at the dependencies and outputs of the stages. For example, having the following:
$ dvc run -f download.dvc \ | |
-o joke.txt \ | |
"curl https://geek-jokes.sameerkumar.website/api > joke.txt" | |
$ dvc run -f duplicate.dvc \ | |
-d joke.txt \ | |
-o dulpicate.txt \ | |
"cat joke.txt joke.txt > duplicate.txt" |
you will end up with two stages: download.dvc
and duplicate.dvc
. The
download one will have joke.txt
as an output . The duplicate one defined
joke.txt
as a dependency, as it is the same file. DVC detects that and creates
a pipeline by joining those stages.
You can inspect the content of each stage file here (they are human readable).
(e.g. in one repo run dvc pull -r my_remote
to pull some data and running the
same command in a different git repo should also pull the same)
Yes! It’s a frequent scenario for multiple repos to share remotes and even local
cache. DVC file serves as a link to the actual data. If you add the same DVC
file (e.g. data.dvc
) to the new repo and do dvc pull -r remotename data.dvc
-
it will fetch data. You have to use dvc remote add
first to specify the
coordinates of the remote storage you would like to share in every project.
Alternatively (check out the question below), you could use --global
to
specify a single default remote (and/or cache dir) per machine.
Use --global
when you specify the remote settings. Then remote will be visible
for all projects on the same machine. --global
— saves remote configuration to
the global config (e.g. ~/.config/dvc/config
) instead of a per project one —
.dvc/config
. See more details
here.
We would recommend to skim through our get started tutorial, to summarize the data versioning process of DVC:
dvc add
/
dvc import
) , or run a command to generate files:$ dvc run --out file.csv "wget https://example.com/file.csv"
git
git checkout v1.0
)dvc checkout
to retrieve all the files related by those stage filesAll your files (with each different version) are stored in a .dvc/cache
directory, that you sync with a remote file storage (for example, S3) using the
dvc push
or dvc pull
commands (analogous to a git push
/ git pull
, but
instead of syncing your .git
, you are syncing your .dvc
directory) on a
remote repository (let’s say an S3 bucket).
If you need to move your dvc file somewhere, it is pretty easy, even if done manually:
$ mv my.dvc data/my.dvc | |
# and now open my.dvc with your favorite editor and change wdir in it to 'wdir: ../'. |
dvc push
of a file to a remote. On the remote there is created a directory called 8f
with a file inside called 2ec34faf91ff15ef64abf3fbffa7ee
. The original CSV file doesn’t appear on the remote. Is that expected behaviour?This is an expected behavior. DVC saves files under the name created from their
checksum in order to prevent duplication. If you delete “pushed” file in your
project directory and perform dvc pull
, DVC will take care of pulling the file
and renaming it to “original” name.
Below are some details about how DVC cache works, just to illustrate the logic. When you add a data source:
$ echo "foo" > data.txt | |
$ dvc add data.txt |
It computes the (md5) checksum of the file and generates a DVC file with related information:
md5: 3bccbf004063977442029334c3448687 | |
outs: | |
- cache: true | |
md5: d3b07384d113edec49eaa6238ad5ff00 | |
metric: false | |
path: data.txt | |
wdir: .. |
The original file is moved to the cache and a link or copy (depending on your filesystem) is created to replace it on your working space:
.dvc/cache | |
└── d3 | |
└── b07384d113edec49eaa6238ad5ff00 |
Absolutely! There are three ways you could interact with DVC:
from dvc.main import main
and use it with regular CLI logic like
ret = main(‘add’, ‘foo’)
dvc/repo
and dvc/command
in our source to get a
grasp of it). It is not officially public yet, and we don’t have any special
docs for it, but it is fairly stable and could definitely be used for a POC.
We’ll add docs and all the official stuff for it in the not-so-distant
future.dvc run
and a graph of tasks? Basically what would like extremely minimal DVC invasion into my GIT repo for an existing machine learning application?There are two options:
dvc add
to track models and/or input datasets. It should be enough if
you use git commit
on DVC files produced by dvc add
. This is the very
minimum you can get with DVC and it does not require using DVC run. Check the
first part (up to the Pipelines/Add transformations section) of the DVC
get started.--no-exec
in dvc run
and then just dvc commit
and
git commit
the results. That way you’ll get your DVC files with all the
linkages, without having to actually run your commands through DVC.If you have any questions, concerns or ideas, let us know here and our stellar team will get back to you in no time.