import
Download a file or directory tracked by another DVC or Git repository into the
workspace, and track it (an import .dvc file is created).
See also our
dvc.api.open()Python API function.
Synopsis
usage: dvc import [-h] [-q | -v]
[-o <path>] [-f] [--rev <commit>]
[--no-exec | --no-download]
[-j <number>]
[--config <path>] [--remote <name>]
[--remote-config [<name>=<value> ...]]
url path
positional arguments:
url Location of DVC or Git repository to download from
path Path to a file or directory within the repositoryDescription
Provides an easy way to reuse files or directories tracked in any DVC
repository (e.g. datasets, intermediate results, ML models) or Git
repository (e.g. code, small image/other files). dvc import downloads the
target file or directory (found at path in url), and tracks it in the local
project. This makes it possible to update the import later, if the data source
has changed (see dvc update).
See
dvc listfor a way to browse repository contents to find files or directories to import.
Note that
dvc getcorresponds to the first step this command performs (just download the data).
The imported data is cached, and linked (or copied) to the current
working directory with its original file name e.g. data.txt (or to a location
provided with --out). An import .dvc file is created in the same location
e.g. data.txt.dvc – similar to using dvc add after downloading the data.
DVC won't push data imported from other DVC repos to remote storage.
dvc pull will download from the original source.
The url argument specifies the address of the DVC or Git repository containing
the data source. Both HTTP and SSH protocols are supported (e.g.
[user@]server:project.git). url can also be a local file system path
(including the current project e.g. .).
The path argument specifies a file or directory to download (paths inside
tracked directories are supported). It should be relative to the root of the
repo (absolute paths are supported when url is local). Note that DVC-tracked
targets must be found in a dvc.yaml or .dvc file of the repo.
See
dvc import-urlto download and track data from other supported locations such as S3, SSH, HTTP, etc.
.dvc files support references to data in an external DVC repository (hosted on
a Git server). In such a .dvc file, the deps field specifies the url and
data path, and the outs field contains the corresponding local path in the
workspace. It records enough metadata about the imported data to
enable DVC to efficiently determine whether the local copy is out of date.
To actually version the data, git add (and git commit) the import .dvc
file.
⚠️ Relevant notes and limitation:
- Source DVC repos should have a
dvc remote defaultcontaining the target data for this command to work. - The only exception to the above requirement is for local repos, where DVC will try to copy the data from its cache first.
- Limited support for chained imports is available (importing data that was itself imported into the source repo from another one).
- Note that
dvc reprodoesn't check or update import.dvcfiles (seedvc freeze), usedvc updateto bring the import up to date from the data source.
Options
-
-o <path>,--out <path>- specify apathto the desired location in the workspace to place the downloaded file or directory (instead of using the current working directory). -
-f,--force- when using--outto specify a local target file or directory, the operation will fail if those paths already exist. this flag will force the operation causing local files/dirs to be overwritten by the command. -
--rev <commit>- commit hash, branch or tag name, etc. (any Git revision) of the repository to download the file or directory from. The latest commit (in the default branch) is used by default when this option is not specified.Note that this adds a
revfield in the import.dvcfile that fixes it to the revision. This can impact the behavior ofdvc update(see the Importing and updating fixed revisions example below). -
--no-exec- create the import.dvcfile without accessingurl(assumes that the data source is valid). This is useful if you need to define the project imports quickly, and import the data later (usedvc updateto finish the operation(s)). -
--no-download- create the import.dvcfile including the source data information (repository URL and version) but without downloading the associated data. This is useful if you need track changes in remote data without using local storage space (yet). The data can be downloaded later usingdvc pull. File version can be updated usingdvc update --no-download. -
-j <number>,--jobs <number>- parallelism level for DVC to download data from the remote. The default value is4 * cpu_count(). Using more jobs may speed up the operation. Note that the default value can be set in the source repo using thejobsconfig option ofdvc remote modify. -
--config <path>- path to a config file that will be merged with the config in the target repository. -
--remote <name>- name of thedvc remoteto set as a default in the target repository. -
--remote-config [<name>=<value> ...]-dvc remoteconfig options to merge with a remote's config (default or one specified by--remote) in the target repository. -
-h,--help- prints the usage/help message, and exit. -
-q,--quiet- do not write anything to standard output. Exit with 0 if no problems arise, otherwise 1. -
-v,--verbose- displays detailed tracing information.
Examples
A simple case for this command is to import a dataset from an external DVC repository, such as our get started example repo.
$ dvc import git@github.com:iterative/example-get-started \
data/data.xml
Importing 'data/data.xml (git@github.com:iterative/example-get-started)'
-> 'data.xml'In contrast with dvc get, this command doesn't just download the data file,
but it also creates an import .dvc file with a link to the data source (as
explained in the description above). (This .dvc file can later be used to
update the import.) Check data.xml.dvc:
md5: 7de90e7de7b432ad972095bc1f2ec0f8
frozen: true
wdir: .
deps:
- path: data/data.xml
repo:
url: git@github.com:iterative/example-get-started
rev_lock: 6c73875a5f5b522f90b5afa9ab12585f64327ca7
outs:
- md5: a304afb96060aad90176268345e10355
path: data.xml
cache: trueSeveral of the values above are pulled from the original .dvc file
in the external DVC repository.
The url and rev_lock subfields under repo are used to save the origin and
version of the dependency, respectively.
Example: Importing and updating fixed revisions
To import a specific version of a file/directory, we may use the --rev option:
$ dvc import --rev cats-dogs-v1 \
git@github.com:iterative/dataset-registry.git \
use-cases/cats-dogs
Importing
'use-cases/cats-dogs (git@github.com:iterative/dataset-registry.git)'
-> 'cats-dogs'When using this option, the import .dvc file will also have a rev subfield
under repo:
deps:
- path: use-cases/cats-dogs
repo:
url: git@github.com:iterative/dataset-registry.git
rev: cats-dogs-v1
rev_lock: 0547f5883fb18e523e35578e2f0d19648c8f2d5cIf rev is a Git branch or tag (where the underlying commit changes), the data
source may have updates at a later time. To bring it up to date if so (and
update rev_lock in the .dvc file), simply use dvc update <stage>.dvc. If
rev is a specific commit hash (does not change), dvc update without options
will not have an effect on the import .dvc file. You may force-update it to a
different commit with dvc update --rev:
$ dvc update --rev cats-dogs-v2 cats-dogs.dvcIn the above example, the value for
revin the new.dvcfile will bemaster(a branch) so it will be able update normally going forward.
Example: Data registry
If you take a look at our
dataset registry
project, you'll see that it's organized into different directories
such as tutorials/versioning and use-cases/, and these contain .dvc files
that track different datasets. Given this simple structure, its data files can
be easily shared among several other projects using dvc get and dvc import.
For example:
$ dvc get https://github.com/iterative/dataset-registry \
tutorials/versioning/data.zipUsed in our versioning tutorial
Or
$ dvc import git@github.com:iterative/dataset-registry.git \
use-cases/cats-dogsdvc import provides a better way to incorporate data files tracked in external
DVC repositories because it saves the connection between the
current project and the source repo. This means that enough information is
recorded in an import .dvc file in order to
reproduce downloading of this same data version
in the future, where and when needed. This is achieved with the repo field,
for example (matching the import command above):
frozen: true
deps:
- path: use-cases/cats-dogs
repo:
url: git@github.com:iterative/dataset-registry.git
rev_lock: 0547f5883fb18e523e35578e2f0d19648c8f2d5c
outs:
- md5: b6923e1e4ad16ea1a7e2b328842d56a2.dir
path: cats-dogs
cache: trueSee a full explanation in our Data Registry use case.
Example: Importing from any Git repository
You can even import files from plain Git repos that are not DVC repositories. For example, let's import a dataset from GSA's data repo:
$ dvc import git@github.com:GSA/data \
enterprise-architecture/it-standards.csv
Importing ...Note that Git-tracked files can be imported from DVC repos as well.
The file is imported, and along with it, an import .dvc file is created. Check
it-standards.csv.dvc:
deps:
- path: enterprise-architecture/it-standards.csv
repo:
url: git@github.com:GSA/data
rev_lock: af6a1feb542dc05b4d3e9c80deb50e6596876e5f
outs:
- md5: 7e6de779a1ab286745c808f291d2d671
path: it-standards.csvThe url and rev_lock subfields under repo are used to save the origin and
version of the dependency, respectively.
Example: Chained imports
DVC supports importing data that was itself imported into the source repo, as
long as all the repos in the import chain (and their dvc remote default) are
accessible from the final destination repo.
Consider an example with 3 DVC repos (A, B, and C). DVC repo /repo/a contains
a data.csv file tracked with dvc add:
/repo/a
├── data.csv
└── data.csv.dvcIn repo B, we import data.csv from A and into a subdirectory:
$ dvc import /repo/a data.csv --out training/data.csvProject B may of course contain other files unique to itself, for example:
/repo/b
└── training
├── data.csv
├── data.csv.dvc
├── labels
│ ├── test.txt
│ └── truth.txt
└── labels.dvcNotice that the
training/labelsdirectory (not an import) is also tracked in B separately.
If we examine training/data.csv.dvc, we can see that that the import source is
repo A (/repo/a):
deps:
- path: data.csv
repo:
url: /repo/a
rev_lock: 32ab3ddc8a0b5cbf7ed8cb252f93915a34b130eb
outs:
- md5: acbd18db4cc2f85cedef654fccc4a4d8
size: 3234523
path: data.csvNow lets imagine that we run the following command in our third repo, C:
$ dvc import /repo/b trainingThis will result in the following directory structure, which contains a chained import and a regular one:
/repo/c
├── training
│ ├── data.csv
│ └── labels
│ ├── test.txt
│ └── truth.txt
└── training.dvctraining/data.csvis imported from A into B into Ctraining/labels/is imported from B into C directly
However, training.dvc only references repo B (/repo/b):
deps:
- path: training
repo:
url: /repo/b
rev_lock: 15136ed84b59468b68fd66b8141b41c5be682ced
outs:
- md5: e784c380dd9aa9cb13fbe22e62d7b2de.dir
size: 27
nfiles: 3
path: trainingEach time that we dvc import or dvc update* training/ into C (or even
dvc pull it) DVC will first look up the contents of training in B and notice
that training/data.csv is itself an import. It will then resolve the chain as
needed (finding data.csv in A).
*Note that when running
dvc update trainingfrom repo C, DVC will only check whether or nottraining/changed in repo B. So ifdata.csvhas only changed in A,training/data.csvwon't be updated in C untildvc update training/data.csvhas been run in B.
This means both repos A and B must be reachable when dvc import runs in repo
C, otherwise the import chain resolution would fail.
The dvc remote default for all repos in the import chain must also be
accessible (repo C needs to have all the appropriate credentials).
Example: Set default remote
$ dvc import https://github.com/iterative/example-get-started-s3 data/prepared --remote myremote
...
$ cat prepared.dvc
deps:
- path: data/prepared
repo:
url: https://github.com/iterative/example-get-started-s3
rev_lock: 8141b41c5be682ced15136ed84b59468b68fd66b
remote: myremote
outs:
- md5: e784c380dd9aa9cb13fbe22e62d7b2de.dir
size: 27
nfiles: 3
path: preparedExample: Set AWS profile for default remote
$ dvc import https://github.com/iterative/example-get-started-s3 data/prepared --remote-config profile=myprofile
...
$ cat prepared.dvc
deps:
- path: data/prepared
repo:
url: https://github.com/iterative/example-get-started-s3
rev_lock: 8141b41c5be682ced15136ed84b59468b68fd66b
remote:
profile: myprofile
outs:
- md5: e784c380dd9aa9cb13fbe22e62d7b2de.dir
size: 27
nfiles: 3
path: preparedExample: Create new AWS S3 remote and set it as default
If remote with that name already exists, its config will be merged with options
provided by --remote-config.
$ dvc import https://github.com/iterative/example-get-started-s3 data/prepared \
--remote myremote \
--remote-config url=s3://mybucket/mypath profile=myprofile
...
$ cat prepared.dvc
deps:
- path: data/prepared
repo:
url: https://github.com/iterative/example-get-started-s3
rev_lock: 8141b41c5be682ced15136ed84b59468b68fd66b
config:
core:
remote: myremote
remote:
myremote:
url: s3://mybucket/mypath
profile: myprofile
outs:
- md5: e784c380dd9aa9cb13fbe22e62d7b2de.dir
size: 27
nfiles: 3
path: preparedExample: Set AWS secret keys for particular remote
In this example, instead of using --remote myremote with --remote-config and
exposing your secrets in dvcfile, you could use --config to use a gitignored
config file. The format of the config file is the same as produced by
dvc config.
$ cat myconfig
[core]
remote = myremote
[remote "myremote"]
access_key_id = myaccesskeyid
secret_access_key = mysecretaccesskey
$ cat .gitignore # make sure you are not commiting this file to git
...
/myconfig
...
$ dvc import https://github.com/iterative/example-get-started-s3 data/prepared --config myconfig
...
$ cat prepared.dvc
deps:
- path: data/prepared
repo:
url: https://github.com/iterative/example-get-started-s3
rev_lock: 8141b41c5be682ced15136ed84b59468b68fd66b
config: myconfig
outs:
- md5: e784c380dd9aa9cb13fbe22e62d7b2de.dir
size: 27
nfiles: 3
path: prepared