Get tracked files or directories from remote storage into the cache.
usage: dvc fetch [-h] [-q | -v] [-j <number>] [-r <name>] [-a] [-T]
[--all-commits] [-d] [-R] [--run-cache]
[targets [targets ...]]
positional arguments:
targets Limit command scope to these tracked files/directories,
.dvc files, or stage names.
Downloads DVC-tracked files from remote storage into the cache of the project
(without placing them in the workspace, like dvc pull
would).
This makes them available for linking (or copying) into the workspace (refer to
dvc config cache.type
).
Without arguments, dvc fetch
ensures that the files specified in all
dvc.lock
and .dvc
files in the workspace exist in the cache. The
--all-branches
, --all-tags
, and --all-commits
options enable fetching data
for multiple Git commits.
The targets
given to this command (if any) limit what to fetch. It accepts
paths to tracked files or directories (including paths inside tracked
directories), .dvc
files, and stage names (found in dvc.yaml
).
Fetching is performed automatically by dvc pull
(when the data is not already
in the cache), along with dvc checkout
:
Tracked files Commands
---------------- ---------------------------------
remote storage
+
| +------------+
| - - - - | dvc fetch | ++
v +------------+ + +----------+
project's cache ++ | dvc pull |
+ +------------+ + +----------+
| - - - - |dvc checkout| ++
| +------------+
v
workspace
Here are some scenarios in which dvc fetch
is useful, instead of pulling:
dvc metrics show
with its --all-branches
option.The default remote is used (see
dvc config core.remote
) unless the
--remote
option is used.
-r <name>
, --remote <name>
- name of the
remote storage to fetch from (see
dvc remote list
).--run-cache
- downloads all available history of stage runs from the remote
repository.-d
, --with-deps
- determines files to download by tracking dependencies to
the targets
. If none are provided, this option is ignored. By traversing all
stage dependencies, DVC searches backward from the target stages in the
corresponding pipelines. This means DVC will not fetch files referenced in
later stages than the targets
.-R
, --recursive
- determines the files to fetch by searching each target
directory and its subdirectories for dvc.yaml
and .dvc
files to inspect.
If there are no directories among the targets
, this option is ignored.-j <number>
, --jobs <number>
- parallelism level for DVC to download data
from remote storage. The default value is 4 * cpu_count()
. For SSH remotes,
the default is 4
. Note that the default value can be set using the jobs
config option with dvc remote modify
. Using more jobs may improve the
overall transfer speed.-a
, --all-branches
- fetch cache for all Git branches instead of just the
current workspace. This means DVC may download files needed to reproduce
different versions of a .dvc
file
(experiments), not just the ones
currently in the workspace. Note that this can be combined with -T
below,
for example using the -aT
flag.-T
, --all-tags
- same as -a
above, but applies to Git tags as well as
the workspace. Note that both options can be combined, for example using the
-aT
flag.--all-commits
- same as -a
or -T
above, but applies to all Git commits
as well as the workspace. This downloads tracked data for the entire commit
history of the project.-h
, --help
- prints the usage/help message, and exit.-q
, --quiet
- do not write anything to standard output. Exit with 0 if no
problems arise, otherwise 1.-v
, --verbose
- displays detailed tracing information.Let's employ a simple workspace with some data, code, ML models,
pipeline stages, such as the DVC project created for the
Get Started. Then we can see what dvc fetch
does
in different scenarios.
The workspace looks like this:
.
├── data
│ └── data.xml.dvc
├── dvc.lock
├── dvc.yaml
├── params.yaml
├── prc.json
├── scores.json
└── src
└── <code files here>
This project comes with a predefined HTTP
remote storage. We can now just run dvc fetch
to download the most recent model.pkl
, data.xml
, and other DVC-tracked files
into our local cache.
$ dvc status --cloud
...
deleted: data/features/train.pkl
deleted: model.pkl
$ dvc fetch
$ tree .dvc/cache
.dvc/cache
├── 20
│ └── b786b6e6f80e2b3fcf17827ad18597.dir
├── c8
│ ├── d307aa005d6974a8525550956d5fb3
│ └── ...
...
dvc status --cloud
compares the cache contents against the default remote. Refer todvc status
.
Note that the .dvc/cache
directory was created and populated.
Refer to Structure of cache directory for more info.
Used without arguments (as above), dvc fetch
downloads all files and
directories needed by all dvc.yaml
and .dvc
files in the current branch. For
example, the hash values 20b786b...
and c8d307a...
correspond to the
data/features/
directory and model.pkl
file, respectively.
Let's now link files from the cache to the workspace with:
$ dvc checkout
If you tried the previous example, please delete the
.dvc/cache
directory first (e.g.rm -Rf .dvc/cache
) to follow this one.
dvc fetch
only downloads the tracked data corresponding to any given
targets
:
$ dvc fetch prepare
$ tree .dvc/cache
.dvc/cache
├── 20
│ └── b786b6e6f80e2b3fcf17827ad18597.dir
├── 32
│ └── b715ef0d71ff4c9e61f55b09c15e75
└── 6f
└── 597d341ceb7d8fbbe88859a892ef81
Cache entries for the data/prepared
directory (output of the
prepare
target), as well as the actual test.tsv
and train.tsv
files, were
downloaded. Their hash values are shown above.
Note that you can fetch data within directories tracked. For example, the
featurize
stage has the entire data/features
directory as output, but we can
just get this:
$ dvc fetch data/features/test.pkl
If you check again .dvc/cache
, you'll see a couple more files were downloaded:
the cache entries for the data/features
directory, and
data/features/test.pkl
itself.
After following the previous example (Specific stages), only the files
associated with the prepare
stage have been fetched. Several
dependencies/outputs of other pipeline stages are still missing from the cache:
$ dvc status -c
...
deleted: data/features/test.pkl
deleted: data/features/train.pkl
deleted: model.pkl
One could do a simple dvc fetch
to get all the data, but what if you only want
to retrieve the data up to our third stage, train
? We can use the
--with-deps
(or -d
) option:
$ dvc fetch --with-deps train
$ tree .dvc/cache
.dvc/cache
├── 20
│ └── b786b6e6f80e2b3fcf17827ad18597.dir
├── c8
│ ├── 43577f9da31eab5ddd3a2cf1465f9b
│ └── d307aa005d6974a8525550956d5fb3
├── 32
│ └── b715ef0d71ff4c9e61f55b09c15e75
├── 54
│ └── c0f3ef1f379563e0b9ba4accae6807
├── 6f
│ └── 597d341ceb7d8fbbe88859a892ef81
├── a1
│ └── 414b22382ffbb76a153ab1f0d69241.dir
└── a3
└── 04afb96060aad90176268345e10355
Fetching using --with-deps
starts with the target stage (train
) and searches
backwards through its pipeline for data to download into the project's cache.
All the data for the second and third stages (featurize
and train
) has now
been downloaded to the cache. We could now use dvc checkout
to get the data
files needed to reproduce this pipeline up to the third stage into the workspace
(with dvc repro train
).
Note that in this example project, the last stage
evaluate
doesn't add any more data files than those form previous stages, so at this point all of the data for this pipeline is cached anddvc status -c
would outputCache and remote 'storage' are in sync.