Depending on your storage type, you may also need dvc remote modify to
provide credentials and/or configure other remote parameters.
Synopsis
usage: dvc remote add [-h] [--global | --system | --local] [-q | -v]
[-d] [-f]
name url
positional arguments:
name Name of the remote.
url (See supported URLs in the examples below.)
Description
This command creates a remote section in the DVC project's
config file and optionally assigns a default
remote in the core section, if the --default option is used (recommended
for the first remote):
name and url are required. The name is used to identify the remote and
must be unique for the project.
url specifies a location to store your data. It can represent a cloud storage
service, an SSH server, network-attached storage, or even a directory in the
local file system (see all the supported remote storage types in the examples
below).
DVC will determine the type of remote based on the
url provided. This may affect which parameters you can access later via
dvc remote modify (note that the url itself can be modified).
If you installed DVC via pip and plan to use cloud services as remote
storage, you might need to install these optional dependencies: [s3],
[azure], [gdrive], [gs], [oss], [ssh]. Alternatively, use [all] to
include them all. The command should look like this: pip install "dvc[s3]".
(This example installs boto3 library along with DVC to support S3 storage.)
Options
--global - save remote configuration to the global config (e.g.
~/.config/dvc/config) instead of .dvc/config.
--system - save remote configuration to the system config (e.g.
/etc/dvc/config) instead of .dvc/config.
--local - modify a local config file
instead of .dvc/config. It is located in .dvc/config.local and is
Git-ignored. This is useful when you need to specify private config options in
your config that you don't want to track and share through Git (credentials,
private locations, etc).
-d, --default - commands that require a remote (such as dvc pull,
dvc push, dvc fetch) will be using this remote by default to upload or
download data (unless their -r option is used).
By default, DVC expects your AWS CLI is already
configured.
DVC will be using default AWS credentials file to access S3. To override some of
these settings, use the parameters described in dvc remote modify.
We use the boto3 library to communicate with AWS. The following API methods
are performed:
list_objects_v2, list_objects
head_object
download_file
upload_file
delete_object
copy
So, make sure you have the following permissions enabled:
The connection string contains sensitive user info. Therefore, it's safer to
add it with the --local option, so it's written to a Git-ignored config
file. See dvc remote modify for a full list of Azure storage parameters.
The Azure Blob Storage remote can also be configured globally via environment
variables:
For more information on configuring Azure Storage connection strings, visit
here.
connection string - this is the connection string to access your Azure
Storage Account. If you don't already have a storage account, you can create
one following
these instructions.
The connection string can be found in the Access Keys pane of your Storage
Account resource in the Azure portal.
💡 Make sure the value is quoted so its processed correctly by the console.
container name - this is the top-level container in your Azure Storage
Account under which all the files for this remote will be uploaded. If the
container doesn't already exist, it will be created automatically.
Click for Google Drive
To start using a GDrive remote, first add it with a
valid URL format. Then
use any DVC command that needs to connect to it (e.g. dvc pull or dvc push
once there's tracked data to synchronize). For example:
$ dvc remote add -d myremote gdrive://0AIac4JZqHhKmUk9PDA/dvcstore
$ dvc push# Assuming there's data to push
Go to the following link in your browser:
https://accounts.google.com/o/oauth2/auth # ... copy this link
Enter verification code: # <- enter resulting code
Note that GDrive remotes are not "trusted" by default. This means that the
verify
parameter is enabled on this type of storage, so DVC recalculates the file
hashes upon download (e.g. dvc pull), to make sure that these haven't been
modified.
By default, DVC expects your GCP CLI is already
configured. DVC will be using
default GCP key file to access Google Cloud Storage. To override some of these
settings, use the parameters described in dvc remote modify.
Make sure to run gcloud auth application-default login unless you use
GOOGLE_APPLICATION_CREDENTIALS and/or service account, or other ways to
authenticate. See details here.
Click for Aliyun OSS
First you need to setup OSS storage on Aliyun Cloud. Then, use an S3 style URL
for OSS storage, and configure the endpoint:
To set key id, key secret and endpoint (or any other OSS parameter), use
dvc remote modify. Example usage is show below. Make sure to use the --local
option to avoid committing your secrets with Git:
⚠️ DVC requires both SSH and SFTP access to work with remote SSH locations.
Please check that you are able to connect both ways with tools like ssh and
sftp (GNU/Linux).
Note that the server's SFTP root might differ from its physical root (/).
Click for HDFS
💡 Using an HDFS cluster as remote storage is also supported via the WebHDFS
API. Read more about it by expanding the WebHDFS section below.
Both remotes, HDFS and WebHDFS, allow using a Hadoop cluster as a remote
repository. However, HDFS relies on pyarrow which in turn requires libhdfs,
an interface to the Java Hadoop client, that must be installed separately.
Meanwhile, WebHDFS has no need for this requirement as it communicates with the
Hadoop cluster via a HTTP REST API using the Python libraries HdfsCLI and
requests. The latter remote should be preferred by users who seek easier and
more portable setups, at the expense of performance due to the added overhead of
HTTP.
One last note: WebHDFS does require enabling the HTTP REST API in the cluster by
setting the configuration property dfs.webhdfs.enabled to true in
hdfs-site.xml.
A "local remote" is a directory in the machine's file system. Not to be confused
with the --local option of dvc remote commands!
While the term may seem contradictory, it doesn't have to be. The "local" part
refers to the type of location where the storage is: another directory in the
same file system. "Remote" is how we call storage for DVC
projects. It's essentially a local backup for data tracked by DVC.