Reproduce complete or partial pipelines by executing commands defined in their stages in the correct order. The commands to be executed are determined by recursively analyzing dependencies and outputs of the target stages.
usage: dvc repro [-h] [-q | -v] [-f] [-s] [-m] [--dry] [-i]
[-p] [-P] [-R] [--no-run-cache] [--force-downstream]
[--no-commit] [--downstream] [--pull]
[targets [targets ...]]
positional arguments:
targets Stage or path to dvc.yaml or .dvc file to reproduce. Using -R,
directories to search for stages can also be given.
dvc repro
provides a way to regenerate data pipeline results, by restoring the
dependency graph (a DAG)
implicitly defined by the stages listed in dvc.yaml
. The commands defined in
these stages can then be executed in the correct order, reproducing pipeline
results.
Pipeline stages are defined in a
dvc.yaml
file (either manually or by usingdvc run
) while initial data dependencies can be registered withdvc add
.
This command is similar to Make in software build automation, but DVC captures build requirements (dependencies and outputs) and caches the pipeline's outputs along the way.
💡 For convenience, a Git hook is available to remind you to dvc repro
when
needed after a git commit
. See dvc install
for more details.
dvc repro
does not run dvc fetch
, dvc pull
or dvc checkout
to get data
files, intermediate or final results (except if the --pull
option is used).
By default, this command checks all pipeline stages to determine which ones have changed. Then it executes the corresponding commands. Stage outputs are deleted from the workspace before executing the stage commands that produce them.
There are a few ways to restrict what will be regenerated by this command: by
specifying stages as targets
, or by using the --single-item
, among other
options.
Note that stages without dependencies are considered always changed, so
dvc repro
always executes them.
It saves all the data files, intermediate or final results into the DVC
cache (unless the --no-commit
option is used), and updates the hash
values of changed dependencies and outputs in the dvc.lock
and .dvc
files.
Currently, dvc repro
is not able to parallelize stage execution automatically.
If you need to do this, you can launch dvc repro
multiple times manually. For
example, let's say a pipeline graph looks something like this:
$ dvc dag
+--------+ +--------+
| A1 | | B1 |
+--------+ +--------+
* *
* *
* *
+--------+ +--------+
| A2 | | B2 |
+--------+ +--------+
* *
** **
* *
+------------+
| train |
+------------+
This pipeline consists of two parallel branches (A
and B
), and the final
train
stage, where the branches merge. If you run dvc repro
at this point,
it would reproduce each branch sequentially before train
. To reproduce both
branches simultaneously, you could run dvc repro A2
and dvc repro B2
at the
same time (e.g. in separate terminals). After both finish successfully, you can
then run dvc repro train
: DVC will know that both branches are already
up-to-date and only execute the final stage.
-f
, --force
- reproduce a pipeline, regenerating its results, even if no
changes were found. This executes all of the stages by default, but it can be
limited with the targets
argument, or the -s
, -p
options.-s
, --single-item
- reproduce only a single stage by turning off the
recursive search for changed dependencies. Multiple stages are executed
(non-recursively) if multiple stage names are given as targets
.-R
, --recursive
- determines the stages to reproduce by searching each
target directory and its subdirectories for stages (in dvc.yaml
) to inspect.
If there are no directories among the targets, this option is ignored.--no-commit
- do not save outputs to cache. A DVC-file is created, while
nothing is added to the cache. (dvc status
will report that the file is
not in cache
.) Use dvc commit
when ready to commit outputs with DVC.
Useful to avoid caching unnecessary data repeatedly when running multiple
experiments.-m
, --metrics
- show metrics after reproduction. The target pipelines must
have at least one metrics file defined either with the dvc metrics
command,
or by the -M
or -m
options of the dvc run
command.--dry
- only print the commands that would be executed without actually
executing the commands.-i
, --interactive
- ask for confirmation before reproducing each stage.
The stage is only executed if the user types "y".-p
, --pipeline
- reproduce the entire pipelines that the targets
belong
to. Use dvc dag <target>
to show the parent pipeline of a target.-P
, --all-pipelines
- reproduce all pipelines for all dvc.yaml
files
present in the DVC project.--no-run-cache
- execute stage commands even if they have already been run
with the same dependencies/outputs/etc. before.--force-downstream
- in cases like ... -> A (changed) -> B -> C
it will
reproduce A
first and then B
, even if B
was previously executed with the
same inputs from A
(cached). To be precise, it reproduces all descendants of
a changed stage or the stages following the changed stage, even if their
direct dependencies did not change.
It can be useful when we have a common dependency among all stages, and want
to specify it only once (for stage A
here). For example, if we know that all
stages (A
and below) depend on requirements.txt
, we can specify it in A
,
and omit it in B
and C
.
Like with the same option on dvc run
, this is a way to force-execute stages
without changes. This can also be useful for pipelines containing stages that
produce non-deterministic (semi-random) outputs, where outputs can vary on
each execution, meaning the cache cannot be trusted for such stages.
--downstream
- only execute the stages after the given targets
in their
corresponding pipelines, including the target stages themselves. This option
has no effect if targets
are not provided.--pull
- pulls dependencies and outputs
involved in the stages being reproduced, if they are found in the
default remote storage. Note that it
checks the local run-cache too (available history of stage runs).
Has no effect if combined with
--no-run-cache
.
-h
, --help
- prints the usage/help message, and exit.-q
, --quiet
- do not write anything to standard output. Exit with 0 if all
stages are up to date or if all stages are successfully executed, otherwise
exit with 1. The command defined in the stage is free to write output
regardless of this flag.-v
, --verbose
- displays detailed tracing information.To get hands-on experience with data science and machine learning pipelines, see Get Started: Data Pipelines.
Let's build and reproduce a simple pipeline. It takes this text.txt
file:
dvc
1231
is
3
the
best
And runs a few simple transformations to filter and count numbers:
$ dvc run -n filter -d text.txt -o numbers.txt \
"cat text.txt | egrep '[0-9]+' > numbers.txt"
$ dvc run -n count -d numbers.txt -d process.py -M count.txt \
"python process.py numbers.txt > count.txt"
Where process.py
is a script that, for simplicity, just prints the number of
lines:
import sys
num_lines = 0
with open(sys.argv[1], 'r') as f:
for line in f:
num_lines += 1
print(num_lines)
The result of executing these dvc run
commands should look like this:
$ tree
.
├── count.txt <---- result: "2"
├── dvc.lock <---- file to record pipeline state
├── dvc.yaml <---- file containing list of stages.
├── numbers.txt <---- intermediate result of the first stage
├── process.py <---- code that implements data transformation
└── text.txt <---- text file to process
You may want to check the contents of dvc.lock
and count.txt
for later
reference.
Ok, now let's run dvc repro
:
$ dvc repro
Stage 'filter' didn't change, skipping
Stage 'count' didn't change, skipping
Data and pipelines are up to date.
It makes sense, since we haven't changed any of the dependencies of this
pipeline (text.txt
and process.py
). Now, let's imagine we want to print a
description and we add this line to the process.py
:
...
print('Number of lines:')
print(num_lines)
If we now run dvc repro
, we should see this:
$ dvc repro
Stage 'filter' didn't change, skipping
Running stage 'count' with command:
python process.py numbers.txt > count.txt
Updating lock file 'dvc.lock'
You can now check that dvc.lock
and count.txt
have been updated with the new
information: updated dependency/output file hash values, and a new result,
respectively.
This example continues the previous one.
The --downstream
option, when used with a target
stage, allows us to only
reproduce results from commands after that specific stage in a pipeline. To
demonstrate how it works, let's make a change in text.txt
(the input of our
first stage, created in the previous example):
...
The answer to universe is 42
- The Hitchhiker's Guide to the Galaxy
Let's say we also want to print the filename in the description, and so we
update the process.py
as:
print(f'Number of lines in {sys.argv[1]}:')
print(num_lines)
Now, using the --downstream
option with dvc repro
results in the execution
of only the target (count
) and following stages (none in this case):
$ dvc repro --downstream count
Running stage 'count' with command:
python process.py numbers.txt > count.txt
Updating lock file 'dvc.lock'
The change in text.txt
is ignored because that file is a dependency in the
filter
stage, which wasn't executed by the dvc repro
above. This is because
filter
happens before the target (count
) in the pipeline (see dvc dag
), as
shown below:
$ dvc dag
+--------+
| filter |
+--------+
*
*
*
+-------+
| count |
+-------+
Note that using
dvc repro
without--downstream
in the above example results in the execution of the target (count
), and the preceeding stages (only 'filter' in this case).