Synchronizing data to and from remote storage requires addressing an often overlooked performance bottleneck: Determining which files to upload and download. Here we'll outline the general methods used to solve this problem, and investigate each method's effects on performance by comparing benchmark results from DVC and rclone. We'll then conclude with a more in-depth explanation of the optimizations made in DVC 1.0 which enabled us to outperform both older DVC releases as well as general data sync tools like rclone.
Many general-use tools are available for synchronizing data to and from cloud storage, some widely used options are rsync, rclone and aws sync, each with their own advantages and disadvantages. Likewise, in DVC we provide the ability to efficiently sync versioned datasets to and from cloud storage through a git-like push and pull interface.
Given that transferring data over a network to and from cloud storage is an inherently slow operation, it's important for data sync tools to optimize performance wherever possible. While the data transfer itself may be the most apparent performance bottleneck in the data sync process, here we'll cover a less obvious performance issue: How to determine which files to upload and download.
In this post, we'll outline the general methods used to solve this problem, and investigate each method's effects on performance by comparing benchmark results from DVC and rclone. We'll then conclude with a more in-depth explanation of new optimizations made in DVC 1.0 which enabled us to outperform both older DVC releases as well as general data sync tools (like rclone).
Note: "Cloud storage" and "remote storage" will be used interchangeably throughout this post. When discussing dataset size in this post, we mean size in terms of total number of files in a dataset, rather than the total amount of file data (bytes).
At the start of any data sync operation, we must first do the following steps, in order to determine which files to upload and download between the local machine and cloud storage:
Once this difference in file status has been determined, the necessary files can be copied to or from cloud storage as needed ("file status" meaning file existence as well as other potential status information, such as modification time). While this may seem like a trivial problem, the second step is actually a significant potential performance bottleneck.
In general, cloud storage APIs provide two possible ways to determine what files are present in cloud storage, and it's up to the data sync tool to select which method to use. Even for an operation as simple as synchronizing a single local file to cloud storage, choosing incorrectly between these two options could actually mean the difference between that "simple" operation taking several hours to complete instead of just a few seconds.
Note: The term "file status query" will be used throughout this post when referring to this type of cloud storage API query.
The first query method is to individually check whether or not particular files exist in cloud storage, one at a time.
Ex: The S3 API provides the HeadObject
method.`
When using this method, performance depends on the number of files being queried - for a single file, it would take a single API request, for 1 million files, it would take 1 million API requests. In this case, the overall amount of time it will take to complete the full operation will scale with the number of files to query.
One particular advantage to using this method is that it can be easily parallelized. Overall runtime can be improved by making simultaneous API requests to query for multiple files at once.
The second query method is to request the full listing of files present in cloud storage, all at once.
Ex: The S3 API provides the ListObjects
method.
With this method, the overall amount of time it will take to complete the full operation scales with the total number of files in cloud storage, rather than the number of files we wish to query.
It's important to note that when using this method, cloud APIs will only return a certain number of files at a time (the amount returned varies depending on the API). This means that for an API which returns 1000 files at a time (such as S3), retrieving the full listing of a remote containing 1000 files or less would would only take a single API request. Listing a remote which contains 1 million files would take 1000 API requests.
Another important note is that API calls for this method must be made sequentially and can't be easily parallelized. Using S3 as an example, the first API call would return files 0 through 999. The next call would return files 1000 through 1999, and so on. However, the API provides no guarantee of ordering, and API calls must be made sequentially, until the full list has been retrieved. So we can't make two simultaneous requests for both "files 1-999" and "files 1000-1999".
Consider an example scenario where a dataset being synchronized contains 100 local files, and we need to check which of those files exist in cloud storage. For the purposes of this example, we'll also assume that all individual API calls take the same amount of time to complete, and that we are not running any tasks in parallel. Additionally, let's say that our example cloud storage API returns 1000 files per page when using query method 2.
In this situation, we know that the first query method will always take a fixed number of API calls to complete (100). The number of API calls required for the second query method depends on the total number of files that already exist in the remote.
Since we know that the API returns 1000 results per API call, we can say that if
the remote contains less than 1000 * 100 = 100,000
files, fetching the full
remote listing (method 2) will be faster than checking each file individually,
since it will take less than 100 API calls to complete. In the case that the
remote contains 1000 or less files, method 2 would only require a single API
call (potentially outperforming method 1 by 100x).
However, if the remote contains anything over this 100,000 threshold, method 1 will be faster than method 2, with the difference in performance between the two methods scaling linearly as the potential remote size increases.
Total API calls required to query 100 local files from S3
This example illustrates an important point. Given a (relatively) small set of files to query and a sufficiently large remote, method 1 will always be faster than method 2.
Thinking about it from a different perspective, what happens if we have the ability to reduce the size of a (relatively) large query set?
Once our query set is smaller than a certain threshold, we'll be able to use method 1 rather than method 2. On top of that, we know that the runtime of method 1 scales with query set size. In simple terms, by reducing the size of our query set as much as possible, we can also improve performance.
So, as we have shown, choosing the optimal method depends on both:
Note: In terms of real world performance, there are other considerations that DVC must account for, such as different API calls taking different amounts of time to complete, parallelization, and the amount of time it takes to run list comparison operations in Python.
Now let's take a look at some real-world numbers to examine the impact selecting one query method or the other has on data sync performance in DVC and rclone. Both tools can utilize either potential query method, with some differences:
--no-traverse
option to select the first
query method, otherwise rclone will default to the second method in most
situations (with the exception being cases with very small query set sizes).no_traverse
configuration option.In the following scenarios, we are simulating the typical DVC use case in which a user tracks a local directory containing some number of files using DVC, and then synchronizes the DVC-tracked directory to cloud storage (S3 in these examples) using either DVC or rclone. The user would then continually repeat a process of:
Keep in mind that for DVC's purposes, we are most interested in optimizing performance for scenarios which are normally very slow to complete. If you consider an operation which previously took several hours to complete, improving that runtime down to a few minutes will have a much greater impact for our users versus shaving a few seconds off of an operation which previously took under a minute to run.
Note: For these benchmarks we are only interested in the amount of time
required to determine file status for this one-way push operation. So the
runtimes in each case are for status queries only (using dvc status -c
in DVC
and rclone copy --dry-run
in rclone). No file data was transferred to or from
S3 in any of these scenarios.
Benchmark command usage:
$ time dvc status -c -r remote
$ time rclone copy --dry-run --progress --exclude "**/**.unpacked/" .dvc/cache remote:...
rclone run with --no-traverse
where indicated
Benchmark platform: Python 3.7, MacOS Catalina, DVC installed from pip, dual-core 3.1GHz i7 cpu
Local directory w/100k total files, S3 bucket w/1M total files (1 file modified since last sync)
The previous chart contains benchmarks for a scenario in which the local directory contains 100,000 files, and the S3 bucket contains approximately 1 million files. One file in the local directory has been modified since the directory was last synchronized with the S3 bucket. This scenario tests the length of time it takes DVC or rclone to determine (and report to the user) that only the one modified file is missing from the S3 bucket and needs to be uploaded.
This illustrates DVC's performance advantage over rclone with regard to synchronizing iterations of a versioned dataset over time, as well as the DVC 1.0 performance improvements over prior releases.
Note: In these examples, the local file count refers to the number of files inside the original tracked directory. The number of files present in the DVC cache will differ slightly, since the DVC cache will contain an additional file representing the tracked directory itself, but the end result is that both DVC and rclone will both need to query for the same number of files (i.e. the number of files in the cache directory).
Local directory w/1 file, S3 bucket w/1M total files
In this example, we are testing a simple scenario in which the local directory contains 1 file and the S3 bucket contains approximately 1 million files.
In this case, in DVC 0.91 we essentially get lucky that our default choice for S3 happens to be the first query method. If we ran this same scenario with a Google Drive remote (where the 0.91 default choice is the second query method) instead of S3, we would see a very long runtime for DVC 0.91.
Also note that here, rclone is able to determine that with a single local file to query, it should use the first query method instead of defaulting to the second method.
Note: We are unsure of the reason for the rclone runtime difference with and
without --no-traverse
for this scenario, but rclone does do some computation
to determine whether or not to default to no-traverse
behavior for small query
sets. It's likely that specifying --no-traverse
allows rclone to skip that
overhead entirely in this case.
Local directory w/1M files, Empty S3 bucket
Note: DVC 0.91 and rclone with --no-traverse
both take multiple hours to
complete in this scenario and continue off of the chart.
In this example, we are testing a simple scenario in which the local directory contains approximately 1 million files and the S3 bucket is empty.
The difference in rclone runtime with or without --no-traverse
in this
scenario shows the performance impact of selecting the optimal query method for
a given situation.
This scenario also shows that rclone can outperform DVC with regard to
collecting the list of local files during certain types of sync operations. In
this case, rclone simply iterates over whatever files exist in the local
directory without doing any additional steps, since our benchmark uses a one-way
rclone copy
operation.
However, in DVC, we have some extra overhead for this step, since we collect the list of files expected to be present in the current DVC repository revision, and then verify that those files are present locally. We would then check to see if any missing files are available to be downloaded from remote storage.
It should also be noted that in common use cases where the number of files in cloud storage continues to grow over time (such as in backup solutions or in dataset versioning), rclone's advantage in this case would only apply for this initial sync operation. Once the local dataset has been pushed to cloud storage, DVC's advantage in synchronizing modifications to existing datasets would become more apparent (as shown in the first example).
So I hope that by now you're curious about DVC, and are planning on using (or maybe even already are using 😀) it to sync your files. For those who are wondering where the magic actually happens, let's dive a bit deeper into how DVC stores files, and how we were able to leverage that storage format to implement query performance optimzations in DVC 1.0. (This will also be a useful primer for anyone interested in learning about DVC internals in general.)
Previously, we have established that:
In this section, we'll cover the ways in which DVC 1.0 has directly addressed both of these key points:
Before continuing, it will be helpful for the reader to understand a few things about the DVC cache and remote storage structure.
.
├── 00
│ ├── 411460f7c92d2124a67ea0f4cb5f85
│ ├── 6f52e9102a8d3be2fe5614f42ba989
│ └── ...
├── 01
├── 02
├── 03
├── ...
└── ff
Example DVC cache/remote structure
In DVC, the number of files we need to query is just the number of files for a given project revision. So, as long as we can estimate the number of files in a DVC remote, we can programmatically choose the optimal query method for a remote operation.
In DVC 1.0, we accomplish this by taking advantage of the DVC remote structure. The over/under remote size threshold only depends on the number of files being queried (i.e. the number of files in our DVC versioned dataset). And as we have already established, a DVC remote will be evenly distributed. Therefore, if we know the number of files contained in a subset of the remote, we can then estimate the number of files contained in the entire remote.
For example, if we know that the remote subdirectory 00/
contains 10 files, we
can estimate that the remote contains roughly 256 * 10 = 2,560
files in total.
So, by requesting a list of one subdirectory at a time (rather than the full
remote) via the cloud storage API, we can calculate a running estimate of the
total remote size. If the running estimated total size goes over the threshold
value, DVC will stop fetching the contains of the remote subdirectory, and
switch to querying each file in our dataset individually. If DVC reaches the end
of the subdirectory without the estimated size going over the threshold, it will
continue to fetch the full listing for the rest of the remote.
By estimating remote size in DVC 1.0, we can ensure that we always use the optimal method when querying remote status.
A common DVC use case is versioning the contents of a large directory. As the contents of the directory changes over time, DVC will be used to push each updated version of the directory into cloud storage. In many cases, only a small number of files within that directory will be modified between project iterations.
So after the first version of a project is pushed into cloud storage, for subsequent versions, only the small subset of changed files actually needs to be synchronized with cloud storage.
Consider a case where a user has an existing directory with 1 million files which has been versioned and pushed to a remote with DVC. In the next iteration of the project, only a single file in the directory has been modified. We can obviously see that everything other than the one modified file will already exist in cloud storage. Ideally, we should only need to query for the single modified file.
However, in DVC releases prior to 1.0, DVC would always need to query for every file in the directory, regardless of whether or not a given file had changed since the last time it was pushed to remote storage.
But in DVC 1.0, we now keep an index of directories which have already been versioned and pushed into remote storage. By referencing this index, DVC will "remember" which files already exist in a remote, and will remove them from our query set at the start of a data sync operation (before we choose a query method, and before we make any cloud storage API requests).
Note: This optimization only applies to DVC versioned directories. Individually
versioned files (including those added with dvc add -R
) are not indexed in DVC
1.0, and will always be queried during remote operations.
By utilizing a storage structure that allows for optimized status queries, DVC makes data synchronization incredibly fast. Coupled with the ability to quickly identify which files remain unchanged between sync operations, DVC 1.0 is a powerful data management tool.
Whether you are upgrading from a prior DVC release, or trying DVC for the first time, we hope that all of our users are able to benefit from these new optimizations. DVC performance is an important issue, and our team is looking forward to working on further performance optimizations in the future - across all areas in DVC, not just remotes.
As always, if you have any questions, comments or suggestions regarding DVC performance, please feel free to connect with the DVC community on Discourse, Discord and GitHub.