fetch
Download files or directories from remote storage to the cache.
Synopsis
usage: dvc fetch [-h] [-q | -v] [-j <number>] [-r <name>] [-a] [-T]
[--all-commits] [-d] [-R] [--run-cache | --no-run-cache]
[--max-size <bytes>] [--type {metrics,plots}]
[targets [targets ...]]
positional arguments:
targets Limit command scope to these tracked files/directories,
.dvc files, or stage names.Description
Downloads tracked files and directories from a dvc remote into the
cache (without placing them in the workspace like
dvc pull). This makes the tracked data available for linking (or copying) into
the workspace (see dvc checkout).
Note that dvc pull includes fetching.
Tracked files Commands
---------------- ---------------------------------
remote storage
+
| +------------+
| - - - - | dvc fetch | ++
v +------------+ + +----------+
project's cache ++ | dvc pull |
+ +------------+ + +----------+
| - - - - |dvc checkout| ++
| +------------+
v
workspaceHere are some scenarios in which dvc fetch is useful, instead of pulling:
- After checking out a fresh copy of a DVC repository, to get DVC-tracked data from multiple project branches or tags into your machine.
- To use comparison commands across different Git commits, for example
dvc metrics showwith its--all-branchesoption, ordvc plots diff. - If you want to avoid linking files from the cache, or keep the workspace clean for any other reason.
Without arguments, it downloads all files and directories referenced in the
current workspace (found in dvc.yaml and .dvc files) that are missing from
the workspace. Any targets given to this command limit what to fetch. It
accepts paths to tracked files or directories (including paths inside tracked
directories), .dvc files, and stage names (found in dvc.yaml).
The --all-branches, --all-tags, and --all-commits options enable fetching
files/dirs referenced in multiple Git commits.
The dvc remote used is determined in order, based on
- the
remotefields in thedvc.yamlor.dvcfiles. - the value passed to the
--remoteoption via CLI. - the value of the
core.remoteconfig option (seedvc remote default).
Options
-
-r <name>,--remote <name>- name of thedvc remoteto fetch from (seedvc remote list). -
-d,--with-deps- only meaningful when specifyingtargets. This determines files to download by resolving all dependencies of the targets: DVC searches backward from the targets in the corresponding pipelines. This will not fetch files referenced in later stages than thetargets. -
-R,--recursive- determines the files to fetch by searching each target directory and its subdirectories fordvc.yamland.dvcfiles to inspect. If there are no directories among thetargets, this option has no effect. -
--run-cache,--no-run-cache- whether to download all available history of stage runs from the remote repository. See the same option indvc push. Default is “—no-run-cache`. -
-j <number>,--jobs <number>- parallelism level for DVC to download data from remote storage. The default value is4 * cpu_count(). Note that the default value can be set using thejobsconfig option withdvc remote modify. Using more jobs may speed up the operation. -
-a,--all-branches- fetch cache for all Git branches, as well as for the workspace. This means DVC may download files needed to reproduce different versions of a.dvcfile, not just the ones currently in the workspace. Note that this can be combined with-Tbelow, for example using the-aTflags. -
-T,--all-tags- fetch cache for all Git tags, as well as for the workspace. Note that this can be combined with-aabove, for example using the-aTflags. -
-A,--all-commits- fetch cache for all Git commits, as well as for the workspace. This downloads tracked data for the entire commit history of the project. -
--max-size <bytes>- fetch data files/directories that are each below specified size (bytes). Note that the size is determined by a correspondingsizefield in the.dvc/dvc.lockfile. Which means that even if some files or subdirectories are smaller inside a DVC-tracked directory, the whole directory is still skipped. -
--type <type>- fetch data files/directories that are of a particular type. Currently onlymetricsandplotsare supported. -
-h,--help- prints the usage/help message, and exit. -
-q,--quiet- do not write anything to standard output. Exit with 0 if no problems arise, otherwise 1. -
-v,--verbose- displays detailed tracing information.
Examples
Let's employ a simple workspace with some data, code, ML models,
pipeline stages, such as the DVC project created for the
Get Started. Then we can see what dvc fetch does in different
scenarios.
Start by cloning our example repo if you don't already have it:
$ git clone https://github.com/iterative/example-get-started
$ cd example-get-startedThe workspace looks like this:
.
├── data
│ └── data.xml.dvc
├── dvc.lock
├── dvc.yaml
├── params.yaml
├── prc.json
├── scores.json
└── src
└── <code files here>This project comes with a predefined HTTP remote storage. We can now just run
dvc fetch to download the most recent model.pkl, data.xml, and other
DVC-tracked files into our local cache.
$ dvc status --cloud
...
deleted: data/features/train.pkl
deleted: model.pkl
$ dvc fetch
$ tree .dvc/cache/files/md5
.dvc/cache/files/md5
├── 20
│ └── b786b6e6f80e2b3fcf17827ad18597.dir
├── c8
│ ├── d307aa005d6974a8525550956d5fb3
│ └── ...
...
dvc status --cloudcompares the cache contents against the default remote. Refer todvc status.
Note that the
.dvc/cache
directory was created and populated.
All the data needed in this version of the project is now in your cache: File
names 20b786b... and c8d307a... correspond to the data/features/ directory
and model.pkl file, respectively.
To link these files to the workspace:
$ dvc checkoutExample: Specific files or directories
If you tried the previous example, delete the
.dvc/cachedirectory first (e.g.rm -Rf .dvc/cache) to follow this one.
dvc fetch only downloads the tracked data corresponding to any given
targets:
$ dvc fetch prepare
$ tree .dvc/cache/files/md5
.dvc/cache/files/md5
├── 20
│ └── b786b6e6f80e2b3fcf17827ad18597.dir
├── 32
│ └── b715ef0d71ff4c9e61f55b09c15e75
└── 6f
└── 597d341ceb7d8fbbe88859a892ef81Cache entries for the data/prepared directory (output of the
prepare target), as well as the actual test.tsv and train.tsv files, were
downloaded. Their hash values are shown above.
Note that you can fetch data within directories tracked. For example, the
featurize stage has the entire data/features directory as output, but we can
just get this:
$ dvc fetch data/features/test.pklIf you check again .dvc/cache, you'll see a couple more files were downloaded:
the cache entries for the data/features directory, and
data/features/test.pkl itself.
Example: With dependencies
After following the previous example (Specific stages), only the files
associated with the prepare stage have been fetched. Several
dependencies/outputs of other pipeline stages are still missing from the cache:
$ dvc status -c
...
deleted: data/features/test.pkl
deleted: data/features/train.pkl
deleted: model.pklOne could do a simple dvc fetch to get all the data, but what if you only want
to retrieve the data up to our third stage, train? We can use the
--with-deps (or -d) option:
$ dvc fetch --with-deps train
$ tree .dvc/cache/files/md5
.dvc/cache/files/md5
├── 20
│ └── b786b6e6f80e2b3fcf17827ad18597.dir
├── c8
│ ├── 43577f9da31eab5ddd3a2cf1465f9b
│ └── d307aa005d6974a8525550956d5fb3
├── 32
│ └── b715ef0d71ff4c9e61f55b09c15e75
├── 54
│ └── c0f3ef1f379563e0b9ba4accae6807
├── 6f
│ └── 597d341ceb7d8fbbe88859a892ef81
├── a1
│ └── 414b22382ffbb76a153ab1f0d69241.dir
└── a3
└── 04afb96060aad90176268345e10355Fetching using --with-deps starts with the target stage (train) and searches
backwards through its pipeline for data to download into the project's cache.
All the data for the second and third stages (featurize and train) has now
been downloaded to the cache. We could now use dvc checkout to get the data
files needed to reproduce this pipeline up to the third stage into the workspace
(with dvc repro train).
Note that in this example project, the last stage
evaluatedoesn't add any more data files than those form previous stages, so at this point all of the data for this pipeline is cached anddvc status -cwould outputCache and remote 'storage' are in sync.