There are cases when data is so large, or its processing is organized in such a way, that its preferable to avoid moving it from its original location. For example data on a network attached storage (NAS), processing data on HDFS, running Dask via SSH, or for a script that streams data from S3 to process it.
External dependencies and external outputs provide ways to track data outside of the project.
External dependencies are considered part of the (extended) DVC
project: DVC will track them, detecting when they change (triggering stage
executions on dvc repro
, for example).
To define files or directories in an external location as
stage dependencies, put their remote URLs or
external paths in dvc.yaml
(deps
field). Use the same format as the url
of
certain dvc remote
types. Currently, the following protocols are supported:
Note that remote storage is a different feature.
Let's take a look at defining and running a download_file
stage that simply
downloads a file from an external location, on all the supported location types.
You may want to encapsulate external locations as configurable entities that can be managed independently. This is useful if multiple dependencies (or stages) reuse the same location, or if its likely to change in the future. And if the location requires authentication, you need a way to configure it in order to connect.
DVC remotes can do just this. You may use
dvc remote add
to define them, and then use a special URL with format
remote://{remote_name}/{path}
(remote alias) to define the external
dependency.
Let's see an example using SSH. First, register and configure the remote:
$ dvc remote add myssh ssh://myserver.com
$ dvc remote modify --local myssh user myuser
$ dvc remote modify --local myssh password mypassword
Please refer to
dvc remote add
for more details like setting up access credentials for the different remote types.
Now, use an alias to this remote when defining the stage:
$ dvc run -n download_file \
-d remote://myssh/path/to/data.txt \
-o data.txt \
wget https://example.com/data.txt -O data.txt
import-url
commandIn the previous examples, special downloading tools were used: scp
,
aws s3 cp
, etc. dvc import-url
simplifies the downloading for all the
supported external path or URL types.
$ dvc import-url https://data.dvc.org/get-started/data.xml
Importing 'https://data.dvc.org/get-started/data.xml' -> 'data.xml'
The command above creates the import .dvc
file data.xml.dvc
, that contains
an external dependency (in this case an HTTPs URL).
.dvc
filedvc import
can download a file or directory from any DVC project,
or from a Git repository. It also creates an external dependency in its import
.dvc
file.
$ dvc import git@github.com:iterative/example-get-started model.pkl
Importing 'model.pkl (git@github.com:iterative/example-get-started)'
-> 'model.pkl'
The command above creates model.pkl.dvc
, where the external dependency is
specified (with the repo
field).
.dvc
file