Okay, now that we've learned how to track data and models with DVC and how to version them with Git, next question is how can we use these artifacts outside of the project? How do I download a model to deploy it? How do I download a specific version of a model? How do I reuse datasets across different projects?
These questions tend to come up when you browse the files that DVC saves to remote storage, e.g.
s3://dvc-public/remote/get-started/fb/89904ef053f04d64eafcc3d70db673๐ฑ instead of the original files, name such asmodel.pklordata.xml.
Read on or watch our video to see how to find and access models and datasets with DVC.
Remember those .dvc files dvc add generates? Those files (and dvc.lock
that we'll cover later), have their history in Git, DVC remote storage config
saved in Git contain all the information needed to access and download any
version of datasets, files, and models. It means that Git repository with DVC
files becomes an entry point and can be used instead of accessing files
directly.
You can use dvc list to explore a DVC repository hosted on any
Git server. For example, let's see what's in the get-started/ directory of our
dataset-registry repo:
$ dvc list https://github.com/iterative/dataset-registry get-started
.gitignore
data.xml
data.xml.dvcThe benefit of this command over browsing a Git hosting website is that the list
includes files and directories tracked by both Git and DVC (data.xml is not
visible if you check
GitHub).
One way is to simply download the data with dvc get. This is useful when
working outside of a DVC project environment, for example in an
automated ML model deployment task:
$ dvc get https://github.com/iterative/dataset-registry \
use-cases/cats-dogsWhen working inside another DVC project though, this is not the best strategy because the connection between the projects is lost โ others won't know where the data came from or whether new versions are available.
dvc import also downloads any file or directory, while also creating a .dvc
file that can be saved in the project:
$ dvc import https://github.com/iterative/dataset-registry \
get-started/data.xml -o data/data.xmlThis is similar to dvc get + dvc add, but the resulting .dvc files
includes metadata to track changes in the source repository. This allows you to
bring in changes from the data source later, using dvc update.
It's also possible to integrate your data or models directly in source code with DVC's Python API. This lets you access the data contents directly from within an application at runtime. For example:
import dvc.api
with dvc.api.open(
'get-started/data.xml',
repo='https://github.com/iterative/dataset-registry'
) as fd:
# ... fd is a file descriptor that can be processed normally.