push
Upload tracked files or directories to remote storage based on the current dvc files files.
Synopsis
usage: dvc push [-h] [-q | -v] [-j <number>] [-r <name>] [-a] [-T]
[--all-commits] [--glob] [-d] [-R]
[--run-cache | --no-run-cache]
[targets [targets ...]]
positional arguments:
targets Limit command scope to these tracked files/directories,
.dvc files, or stage names.Description
The dvc push and dvc pull commands are the means for uploading and
downloading data to and from remote storage (S3, SSH, GCS, etc.). These
commands are similar to git push and git pull, respectively. Data sharing
across environments, and preserving data versions (input datasets, intermediate
results, models, dvc metrics, etc.) remotely are the most common use cases for
these commands.
dvc push uploads data from the cache to a dvc remote.
Note that pushing data does not affect code,
dvc.yaml, or.dvcfiles. Those should be uploaded withgit push.dvc importdata is also ignored by this command.
The dvc remote used is determined in order, based on
- the
remotefields in thedvc.yamlor.dvcfiles. - the value passed to the
--remote(-r) option via CLI. - the value of the
core.remoteconfig option (seedvc remote default).
Without arguments, it uploads the files and directories referenced in the
current workspace (found in all dvc.yaml and .dvc files) that are missing
from the remote. Any targets given to this command limit what to push. It
accepts paths to tracked files or directories (including paths inside tracked
directories), .dvc files, and stage names (found in dvc.yaml).
The --all-branches, --all-tags, and --all-commits options enable pushing
files/dirs referenced in multiple Git commits.
💡 For convenience, a Git hook is available to automate running dvc push after
git push. See dvc install for more details.
For all outputs referenced in each target, DVC finds the
corresponding files and directories in the cache (identified by
hash values saved in dvc.lock and .dvc files). DVC then gathers a list of
files missing from the remote storage, and uploads them.
Note that the dvc status -c command can list files tracked by DVC that are new
in the cache (compared to the default remote.) It can be used to see what files
dvc push would upload.
Options
-
-a,--all-branches- determines the files to upload by examiningdvc.yamland.dvcmetafiles in all Git branches, as well as in the workspace. It's useful if branches are used to track experiments. Note that this can be combined with-Tbelow, for example using the-aTflags. -
-T,--all-tags- examines metafiles in all Git tags, as well as in the workspace. Useful if tags are used to mark certain versions of an experiment or project. Note that this can be combined with-aabove, for example using the-aTflags. -
-A,--all-commits- examines metafiles in all Git commits, as well as in the workspace. This uploads tracked data for the entire commit history of the project. -
-d,--with-deps- only meaningful when specifyingtargets. This determines files to push by resolving all dependencies of the targets: DVC searches backward from the targets in the corresponding pipelines. This will not push files referenced in later stages than thetargets. -
-R,--recursive- determines the files to push by searching each target directory and its subdirectories fordvc.yamland.dvcfiles to inspect. If there are no directories among thetargets, this option has no effect. -
-r <name>,--remote <name>- name of thedvc remoteto push to (seedvc remote list). -
--run-cache,--no-run-cache- whether to upload all available history of stage runs to thedvc remote. Default is--no-run-cache. -
-j <number>,--jobs <number>- parallelism level for DVC to upload data to remote storage. The default value is4 * cpu_count(). Note that the default value can be set using thejobsconfig option withdvc remote modify. Using more jobs may speed up the operation. -
--glob- allows pushing files and directories that match the pattern specified intargets. Shell style wildcards supported:*,?,[seq],[!seq], and** -
-h,--help- prints the usage/help message, and exit. -
-q,--quiet- do not write anything to standard output. Exit with 0 if no problems arise, otherwise 1. -
-v,--verbose- displays detailed tracing information.
Examples
To use dvc push (without options), a dvc remote default must be defined (see
also dvc remote add --default). Let's see an SSH remote example:
$ dvc remote add --default r1 \
ssh://user@example.com/project/data/cacheFor existing projects, remotes are usually already set up. You can
use dvc remote list to check them:
$ dvc remote list
r1 ssh://user@example.com/project/data/cache (default)
r2 ssh://user@example.com/other/storagePush entire data cache from the current workspace to the default remote:
$ dvc pushPush files related to a specific .dvc file only:
$ dvc push data.zip.dvcExample: With dependencies
Demonstrating the --with-deps option requires a larger example. First, assume
a pipeline has been set up with these
stages: clean-posts, featurize, test-posts,
matrix-train
Imagine the project has been modified such that the outputs of some of these stages need to be uploaded to remote storage.
$ dvc status --cloud
...
new: data/model.p
new: data/matrix-test.p
new: data/matrix-train.pOne could do a simple dvc push to share all the data, but what if you only
want to upload part of the data?
$ dvc push --with-deps test-posts
# Do some work based on the partial update...
# Then push the rest of the data:
$ dvc push --with-deps matrix-train
$ dvc status --cloud
Cache and remote 'r1' are in sync.We specified a stage in the middle of this pipeline (test-posts) with the
first push. --with-deps caused DVC to start with that .dvc file, and search
backwards through the pipeline for data files to upload.
Because the matrix-train stage occurs later (it's the last one), its data was
not pushed. However, we then specified it in the second push, so all remaining
data was uploaded.
Finally, we used dvc status to double check that all data had been uploaded.
Example: What happens in the cache?
Let's take a detailed look at what happens to the cache directory as you run
an experiment locally and push data to remote storage. To set the example
consider having created a project with some code, data, and a
dvc remote setup.
Some work has been performed in the workspace, and new data is ready for
uploading to the remote. dvc status --cloud will list several files in new
state. We can see exactly what that means by looking in the project's
cache:
$ tree .dvc/cache/files/md5
.dvc/cache/files/md5
├── 02
│ └── 423d88d184649a7157a64f28af5a73
├── 0b
│ └── d48000c6a4e359f4b81285abf059b5
├── 38
│ └── 64e70211d3bdb367ad1432bfc14c1f.dir
├── 4a
│ └── 8c47036c79c01522e79ac0f518d0f7
├── 6c
│ └── 3074754e3a9b563b62c8f1a38670dc
├── 77
│ └── bea77463abe2b7c6b4d13f00d2c7b4
└── 88
└── c3db1c257136090dbb4a7ddf31e678.dir
10 directories, 9 files
$ tree ~/vault/recursive/files/md5
~/vault/recursive/files/md5
├── 0b
│ └── d48000c6a4e359f4b81285abf059b5
├── 4a
│ └── 8c47036c79c01522e79ac0f518d0f7
└── 88
└── c3db1c257136090dbb4a7ddf31e678.dir
5 directories, 5 filesThe directory .dvc/cache is the local cache, while ~/vault/recursive is a
"local remote" (another directory in the local file system). This listing shows
the cache having more files in it than the remote – which is what the new
state means.
Refer to Structure of cache directory for more info.
Next we can copy the remaining data from the cache to the remote using
dvc push:
$ tree ~/vault/recursive
~/vault/recursive
├── 02
│ └── 423d88d184649a7157a64f28af5a73
├── 0b
│ └── d48000c6a4e359f4b81285abf059b5
├── 38
│ └── 64e70211d3bdb367ad1432bfc14c1f.dir
├── 4a
│ └── 8c47036c79c01522e79ac0f518d0f7
├── 6c
│ └── 3074754e3a9b563b62c8f1a38670dc
├── 77
│ └── bea77463abe2b7c6b4d13f00d2c7b4
└── 88
└── c3db1c257136090dbb4a7ddf31e678.dir
10 directories, 10 files
$ dvc status --cloud
Cache and remote 'r1' are in sync.And running dvc status --cloud, DVC verifies that indeed there are no more
files to push to remote storage.