Happy Halloween from Pirate DeeVee!
.dvc file, and what would happen if decided not push my .dvc files to my Git repo?DVC creates lightweight metafiles (.dvc files) that correspond to large
artifacts in your project. These .dvc files contain pointers to your artifacts
in remote storage (we use a simple content-based storage scheme). Because we use
content-based storage, the remote storage itself isn't designed for browsing
(although
there are some discussions about
how to make stored files more "discoverable", and you can always identify them
manually by their contents and meta-information like timestamps).
Your .dvc files help establish meaningful links between human-readable
filenames and file contents in remote storage, as well as to use Git versioning
on your stored datasets and models. You can think of your DVC remote storage as
a compliment to your Git repository, not a replacement.
In other words⦠if you're not Git versioning your .dvc files, you're not
versioning anything in DVC remote storage!
dvc pull?Yep- by default, DVC data transfer operations use a number of threads
proportional to the number of CPUs detected. But, there's a handy flag for
dvc pull and dvc push that lets you override the defaults:
-j <number>, --jobs <number> - number of threads to run
simultaneously to handle the downloading of files from
the remote. The default value is 4 * cpu_count(). For
SSH remotes, the default is just 4. Using more jobs may
improve the total download speed if a combination of small
and large files are being fetched.dvc plots show multiple precision recall curves- one for each class?Currently, dvc plots doesn't support multiple linear curves on a single plot
(except for dvc plots diff, of course!). But, you could make one precision
recall curve per class and display them side-by-side.
To do this, you'd want to write the precision recall curve values to separate
files for each class (prc-0.json,prc-1.json, etc.). Then you would run:
$ dvc plots show prc-0.json prc-1.jsonAnd you'll see two plots side-by-side! A benefit of this approach is that when
you run dvc plots diff to compare precision recall curves across Git commits,
you'll get a comparison plotted for each class.
.dvc/config file? It contains my logging credentials for storage, and I'm nervous about adding it to a shared Git repository.This is a common scenario- you don't necessarily want to broadcast your remote
storage credentials to everyone on your team, but you still want to check-in
your DVC setup (meaning, your .dvc/config file). In this case, you want to use
a local config file!
You can use the command
$ dvc config --localto setup remote credentials that will be stored in .dvc/config.local- by
default, this file is in your .gitignore so you don't have to worry about
accidentally committing secrets to your Git repository.
Check out the docs for more,
including the --system and --global options for setting your configuration
for multiple projects and users respectively.
cml-publish?cml-publish is a service for hosting files that are embedded in CML reports,
like images, audio files, and GIFS. By default, we have a limit of 2 MB per
upload.
If your files are larger than this (which can happen, depending on the machine
learning problem you're working on!) we recommend using GitLab's artifact
storage.
Based on discussions in the community,
we recently implemented a CML flag (--gitlab-uploads) to streamline the
process:
$ cml-publish movie.mov --md --gitlab-uploads > report.mdNote that we don't currently have an analagous solution for GitHub, because GitHub artifacts expire after 90 days (whereas they're permanent in GitLab).
Failed guessing mime type of file, when I try to use cml-publish. What's going on?This error message usually means that the target of cml-publish- for example,
$ cml-publish <target file>is not found. Check for typos in the target filename and ensure that the file was in fact generated during the run (if it isn't part of your Git repository). We've opened an issue to add a more informative error message in the future.
dvc metrics diff to compare metrics generated during the run to metrics on the main branch and print a table- but the table isn't showing any of the metrics from main. What could be happening?When a continuous integration runner won't report metrics from previous versions of your project (or other branches), that's usually a sign that the runner doesn't have access to the full Git history of your project or your metrics themselves. Here are a few things to check for:
dvc metrics diff require the Git history to be accessible- make sure that
in your workflow, before you run this function, you've done a git fetch. We
recommend:$ git fetch --prune --unshallowdvc pull in your
workflow before attempting dvc metrics diff.metric.json:$ dvc run -n mystage -m metric.json train.pyBy default, metric.json is cached and ignored by Git- which means that if you
aren't using a DVC remote in your CI workflow, metric.json will effectively be
abandoned on your local machine! You can avoid this by using the -M flag
instead of -m in dvc run, or manually adding the field cache: false to
your metric in dvc.yaml. Be sure to remove your metrics from any .gitignore
files, and commit and push them to your Git repository.
That's all for this month- Happy Halloween! Watch out for scary bugs. π