There are cases when data is so large, or its processing is organized in such a
way, that its preferable to avoid moving it from its original location. For
example data on a network attached storage (NAS), processing data on HDFS,
running Dask via SSH, or for a script that streams data
from S3 to process it.
External outputs and
external dependencies provide ways to
track data outside of the project.
How external outputs work
External outputs are considered part of the (extended) DVC project:
DVC will track them for
versioning, detecting when
they change (reported by dvc status, for example).
To use existing files or directories in an external location as
stage outputs, give their remote URLs or external
paths to dvc add, or put them in dvc.yaml (deps field). Use the same
format as the url of certain dvc remote types. Currently, the following
protocols are supported:
Amazon S3
Google Cloud Storage
SSH
HDFS
Local files and directories outside the workspace
External outputs require an
external cache
in the same external/remote file.
Note that remote storage is a different
feature, and that external outputs are not pushed or pulled from/to DVC
remotes.
⚠️ Avoid using the same DVC remote used for dvc push, dvc pull, etc. for
external outputs, because it may cause data collisions: the hash of an
external output could collide with that of a local file with different
content.
Examples
Let's take a look at the following operations on all the supported location
types:
Adding a dvc remote in the same location as the desired outputs, and
configure it as external cache with dvc config.
Tracking existing data on the external location using dvc add (--external
option needed). This produces a .dvc file with an external URL or path in
its outs field.
Creating a simple stage with dvc run (--external option needed) that
moves a local file to the external location. This produces an external output
in dvc.yaml.
⚠️ DVC requires both SSH and SFTP access to work with remote SSH locations.
Please check that you are able to connect both ways with tools like ssh and
sftp (GNU/Linux).
Note that your server's SFTP root might differ from its physical root (/).
The default cache is in .dvc/cache, so there is no need to set a
custom cache location for local paths outside of your project.
Except for external data on different storage devices or partitions mounted on
the same file system (e.g. /mnt/raid/data). In that case please setup an
external cache in that same drive to enable
file links
and avoid copying data.