Here are some Q&A's from our Discord channel that we think are worth sharing.
.dvc
files. Is there an easy way to convert the old .dvc
format to the new dvc.yaml
standard?Yes! You can easily transfer the stages by hand: dvc.yaml
is designed for
manual edits in any text editor, so you can type your old stages in and then
delete the old .dvc
files. We also have a
migration script
available, although we can't provide long-term support for it.
Learn more about the dvc.yaml
format in our
brand new docs!
Just like this but with technical documentation.
No, but for a good reason! What you're seeing are cached files, and they're stored with a special naming convention that makes DVC versioning and addressing possible- these file names are how DVC deduplicates data (to avoid keeping multiple copies of the same file version) and ensures that each unique version of a file is immutable. If you manually overwrote those filenames you would risk breaking Git version control. You can read more about how DVC uses this file format in our docs.
It sounds like you're looking for ways to interact with DVC-tracked objects at a high level of abstraction, meaning that you want to interface with the original filenames and not the machine-generated hashes used by DVC. There are a few secure and recommended ways to do this:
dvc list
command-read up on it here.dvc get
and dvc import
are used to download and
share DVC-tracked artifacts. The syntax is built for an experience like using
a package manager.dvc add
data files individually, or to add a directory containing multiple data files?If the directory you're adding is logically one unit (for example, it is the
whole dataset in your project), we recommend using dvc add
at the directory
level. Otherwise, add files one-by-one. You can
read more about how DVC versions directories in our docs.
We don't have any tutorials for this use case exactly, but it's a very
straightforward modification from
our basic use cases. The key difference when
using MinIO or a similar S3-compatible storage (like DigitalOcean Spaces or IBM
Cloud Object Storage) is that in addition to setting remote data storage, you
must set the endpointurl
too. For example:
$ dvc remote add -d myremote s3://mybucket/path/to/dir
$ dvc remote modify myremote endpointurl https://object-storage.example.com
Read up on configuring supported storage in our docs.
.zip
?There are a few things to consider:
Generally, we would recommend first trying a plain unzipped directory. DVC is designed to work with large numbers of files (on the order of millions) and has the latest release (DVC 1.0) has optimizations built for this purpose exactly.
dvc push
with the DVC Python API inside a Python script?Currently, our Python API
doesn't support commands like dvc push
,dvc pull
, or dvc status
. It is
designed for interfacing with objects tracked by DVC. That said, CLI commands
are basically calling dvc.repo.Repo
object methods. So if you want to use
commands from within Python code, you could try creating a Repo
object with
r = Repo({root_dir})
and then r.push()
. Please note that we don't officially
support this use case yet.
Of course, you can also run DVC commands from a Python script using sys
or a
similar library for issuing system commands.
dvc pipeline
command for visualizing pipelines still work in DVC 1.0?Most of the dvc pipeline
functionality- like dvc pipeline show --ascii
to
print out an ASCII diagram of your pipeline- has been migrated to a new command,
dvc dag
. This function is written for our new pipeline format. Check out
our new docs for an example.
Yes. Say you have a Python script, train.py
, that takes in a dataset data
and outputs a model model.pkl
. To create a DVC pipeline stage corresponding to
this process, you could do so like this:
$ dvc run -n train
-d train.py -d data
-o model.pkl
python train.py
However, this would automatically rerun the command python train.py
, which is
not necessarily desirable if you have recently run it, the process is time
consuming, and the dependencies and outputs haven't changed. You can use the
--no-exec
flag to get around this:
$ dvc run --no-exec
-n train
-d train.py -d data
-o model.pkl
python train.py
This flag can also be useful when you want to define the pipeline on your local
machine but plan to run it later on a different machine (perhaps an instance in
the cloud).
Read more about the --no-exec
flag in our docs.
One other approach worth mentioning is that you can manually edit your
dvc.yaml
file to add a stage. If you add a stage this way, pipeline commands
won't be executed until you run dvc repro
.