HDFS & WebHDFS
Start with dvc remote add to define the remote:
$ dvc remote add -d myremote hdfs://user@example.com:path⚠️ Using HDFS with a Hadoop cluster might require additional setup. Our
assumption is that the client is set up to use it. Specifically, libhdfs
should be installed.
HDFS configuration parameters
If any values given to the parameters below contain sensitive user info, add
them with the --local option, so they're written to a Git-ignored config file.
-
url- remote location:$ dvc remote modify myremote url hdfs://user@example.com/path -
user- user name to access the remote.$ dvc remote modify --local myremote user myuser -
kerb_ticket- path to the Kerberos ticket cache for Kerberos-secured HDFS clusters$ dvc remote modify --local myremote \ kerb_ticket /path/to/ticket/cache -
replication- replication factor for write operations on HDFS cluster. Default value is 3.$ dvc remote modify myremote replication 2
WebHDFS
Using an HDFS cluster as remote storage is also supported via the WebHDFS API.
If your cluster is secured, then WebHDFS is commonly used with Kerberos and
HTTPS. To enable these for the DVC remote, set use_https and kerberos to
true.
$ dvc remote add -d myremote webhdfs://example.com/path
$ dvc remote modify myremote use_https true
$ dvc remote modify myremote kerberos true
$ dvc remote modify --local myremote token SOME_BASE64_ENCODED_TOKEN⚠️ Using WebHDFS requires to enable REST API access in the cluster: set the
config property dfs.webhdfs.enabled to true in hdfs-site.xml.
💡 You may want to run kinit before using the remote to make sure you have an
active kerberos session.
WebHDFS configuration parameters
If any values given to the parameters below contain sensitive user info, add
them with the --local option, so they're written to a Git-ignored config file.
-
url- remote location:$ dvc remote modify myremote url webhdfs://user@example.com/pathDo not provide a
userin the URL withkerberosortokenauthentication. -
user- user name to access the remote. Do not set this withkerberosortokenauthentication.$ dvc remote modify --local myremote user myuser -
kerberos- enable Kerberos authentication (falseby default):$ dvc remote modify myremote kerberos true -
kerberos_principal- Kerberos principal to use, in case you have multiple ones (for example service accounts). Only used ifkerberosistrue.$ dvc remote modify myremote kerberos_principal myprincipal -
proxy_to- Hadoop superuser to proxy as. Proxy user feature must be enabled on the cluster, and the user must have the correct access rights. If the cluster is secured, Kerberos must be enabled (setkerberostotrue) for this to work. This parameter is incompatible withtoken.$ dvc remote modify myremote proxy_to myuser -
use_https- enables SWebHdfs. Note that DVC still expects the protocol inurlto bewebhdfs://, and will fail ifswebhdfs://is used.$ dvc remote modify myremote use_https true -
ssl_verify- whether to verify SSL requests. Defaults totruewhenuse_httpsis enabled,falseotherwise.$ dvc remote modify myremote ssl_verify false -
token- Hadoop delegation token (as returned by the WebHDFS API). If the cluster is secured, Kerberos must be enabled (setkerberostotrue) for this to work. This parameter is incompatible with providing auserand withproxy_to.$ dvc remote modify myremote token "mysecret" -
password- Password to use in combination withuserfor Basic Authentication. If you providepasswordyou must also provideuser. Since this is a password it is recommended to store this in your local config (i.e. not in Git)$ dvc remote modify --local password "mypassword" -
data_proxy_target- Target mapping to be used in the call to the fsspec WebHDFS constructor (see https://filesystem-spec.readthedocs.io/en/latest/api.html?highlight=data_proxy#fsspec.implementations.webhdfs.WebHDFS.__init__ ). This enables support for access to a WebHDFS cluster that is behind a High Availability proxy server and rewrites the URL used for connecting.For example, if you provide the url
webhdfs://host:port/and you provide the valuehttps://host:port/gateway/clusterfor thedata_proxy_targetparameter, then internally the fsspec WebHDFS will rewrite every occurrence ofhttps://host:port/webhdfs/v1intohttps://host:port/gateway/cluster/webhdfs/v1$ dvc remote modify data_proxy_target "https://host:port/gateway/cluster"