Using xarray with dask

This document presents an example of how the xarray and the dask Python libraries can be used together in R.

Installation

To use the xarray in R, it is required to have the reticulate package installed. To do so, you can use the following command:

install.packages("reticulate")

After installing the reticulate package, we must create a Python virtual environment to install the xarray library. For this, we can use the following command:

In this example, we use conda to manage Python virtual environments, but any other package management tool can also be used (e.g., mamba, virtualenv, poetry).

# defining the name of the environment
conda_env_name <- "r-xarray"

# creating conda environment
reticulate::conda_create(envname = conda_env_name)

## + /opt/conda/bin/conda 'create' '--yes' '--name' 'r-xarray' 'python=3.9' '--quiet' '-c' 'conda-forge'

## [1] "/home/sits/.conda/envs/r-xarray/bin/python"

With the environment created, let’s install the dependencies on it:

reticulate::py_install(
  c("xarray", "dask", "distributed"),
  envname = conda_env_name,
  pip =  TRUE
)

Importing Python libraries

Now, we have an environment ready to go. So, let’s activate it:

reticulate::use_condaenv(condaenv = conda_env_name)

To use the xarray library, we can import it using the following reticulate command:

xr <- reticulate::import("xarray")

For this example, we also need to import some dask packages:

da <- reticulate::import("dask.array")
dd <- reticulate::import("dask.distributed")

To create some dummy data, we must also import pandas:

pd <- reticulate::import("pandas")

Configuring Dask

As dask was created to help us scale applications using multiple cores or machines, to use it, we need first to configure the connection between our application and those resources. To do that, we must create an instance of the class dask.distributed.Client:

client <- dd$Client()
client

## <Client: 'tcp://127.0.0.1:46389' processes=4 threads=16, memory=31.20 GiB>

By default, when we create an instance of the dask.distributed.Client class, dask configures a local server to use all cores available in the machine.

Dummy dataset example

To show how the xarray can use parallel-enabled arrays from dask, we will create a dummy dataset with 3 dimensions: X, Y, and Y.

Using dask array is possible because xarray can work with multiple “array backends”. To learn more, please check this documentation page

For this, let’s create a 3D array using Dask array with values:

data <- da$random$random(c(10000, 1000, 100), chunks = c(1000, 1000, 10))

We already have the data created. Now, we need to define the name of its dimensions:

dims <- c("x", "y", "time")
dims

## [1] "x"    "y"    "time"

As the xarray uses indices to handle data, we also need to specify the coordinates of the dataset:

The coordinates are used to specify the extensions of the data (e.g., spatial dimension (X, Y) and temporal dimension (Y))

coords = list(
  x = 1:10000,
  y = 1:1000,
  time = pd$period_range("2000-01-01", periods = 100)
)

Using the elements we have defined in the code blocks above, let’s create a DataArray:

ds <- xr$DataArray(data, dims = dims, coords = coords)
ds

## <xarray.DataArray 'random_sample-bffe8d580f9a246a840a9f6935173724' (x: 10000,
##                                                                     y: 1000,
##                                                                     time: 100)> Size: 8GB
## dask.array<random_sample, shape=(10000, 1000, 100), dtype=float64, chunksize=(1000, 1000, 10), chunktype=numpy.ndarray>
## Coordinates:
##   * x        (x) int64 80kB 1 2 3 4 5 6 7 ... 9995 9996 9997 9998 9999 10000
##   * y        (y) int64 8kB 1 2 3 4 5 6 7 8 ... 993 994 995 996 997 998 999 1000
##   * time     (time) object 800B 2000-01-01 2000-01-02 ... 2000-04-08 2000-04-09

Temporal mean

With the DataArray created, it can be manipulated (e.g., extract values, calculate statistics). As an example, let’s calculate the temporal mean:

ds_mean <- ds$mean(dim = "time")
ds_mean

## <xarray.DataArray 'random_sample-bffe8d580f9a246a840a9f6935173724' (x: 10000,
##                                                                     y: 1000)> Size: 80MB
## dask.array<mean_agg-aggregate, shape=(10000, 1000), dtype=float64, chunksize=(1000, 1000), chunktype=numpy.ndarray>
## Coordinates:
##   * x        (x) int64 80kB 1 2 3 4 5 6 7 ... 9995 9996 9997 9998 9999 10000
##   * y        (y) int64 8kB 1 2 3 4 5 6 7 8 ... 993 994 995 996 997 998 999 1000

By using the mean(dim = "time"), we specify to the xarray the mean should be calculated in the temporal dimension. So, in our example, it is calculated using 10 values values for each X and Y.

With dask, we are always using lazy evaluation functions. So, the code above still needs to be executed. For this, we can ask xarray to execute it using the compute method.

ds_mean_result <- ds_mean$compute()
ds_mean_result

## <xarray.DataArray 'random_sample-bffe8d580f9a246a840a9f6935173724' (x: 10000,
##                                                                     y: 1000)> Size: 80MB
## array([[0.52395468, 0.52582319, 0.5760999 , ..., 0.52075235, 0.46198502,
##         0.49506406],
##        [0.49823877, 0.52700124, 0.48738285, ..., 0.47766254, 0.47700453,
##         0.55002639],
##        [0.49502534, 0.48837638, 0.5267161 , ..., 0.50563961, 0.48020724,
##         0.48280721],
##        ...,
##        [0.4804682 , 0.49752797, 0.51064833, ..., 0.52835988, 0.52285502,
##         0.53107802],
##        [0.49753673, 0.51126792, 0.48127124, ..., 0.51847087, 0.46748262,
##         0.55353419],
##        [0.51380704, 0.50863396, 0.47923159, ..., 0.51144999, 0.42756488,
##         0.42257002]])
## Coordinates:
##   * x        (x) int64 80kB 1 2 3 4 5 6 7 ... 9995 9996 9997 9998 9999 10000
##   * y        (y) int64 8kB 1 2 3 4 5 6 7 8 ... 993 994 995 996 997 998 999 1000