This document presents an example of how the xarray and the dask Python libraries can be used together in R.
To use the xarray in R, it is required to have the reticulate package installed. To do so, you can use the following command:
install.packages("reticulate")
After installing the reticulate package, we must create a Python virtual environment to install the xarray library. For this, we can use the following command:
In this example, we use conda to manage Python virtual environments, but any other package management tool can also be used (e.g., mamba, virtualenv, poetry).
# defining the name of the environment
conda_env_name <- "r-xarray"
# creating conda environment
reticulate::conda_create(envname = conda_env_name)
## + /opt/conda/bin/conda 'create' '--yes' '--name' 'r-xarray' 'python=3.9' '--quiet' '-c' 'conda-forge'
## [1] "/home/sits/.conda/envs/r-xarray/bin/python"
With the environment created, let’s install the dependencies on it:
reticulate::py_install(
c("xarray", "dask", "distributed"),
envname = conda_env_name,
pip = TRUE
)
Now, we have an environment ready to go. So, let’s activate it:
reticulate::use_condaenv(condaenv = conda_env_name)
To use the xarray library, we can import it using the following reticulate command:
xr <- reticulate::import("xarray")
For this example, we also need to import some dask packages:
da <- reticulate::import("dask.array")
dd <- reticulate::import("dask.distributed")
To create some dummy data, we must also import pandas:
pd <- reticulate::import("pandas")
As dask was created to help us
scale applications using multiple cores or machines, to use it, we need
first to configure the connection between our application and those
resources. To do that, we must create an instance of the class
dask.distributed.Client
:
client <- dd$Client()
client
## <Client: 'tcp://127.0.0.1:46389' processes=4 threads=16, memory=31.20 GiB>
By default, when we create an instance of the
dask.distributed.Client
class, dask configures a local server to use
all cores available in the machine.
To show how the xarray can use
parallel-enabled arrays from dask,
we will create a dummy dataset with 3 dimensions: X
,
Y
, and Y
.
Using dask array is possible because xarray can work with multiple “array backends”. To learn more, please check this documentation page
For this, let’s create a 3D array using Dask array with values:
data <- da$random$random(c(10000, 1000, 100), chunks = c(1000, 1000, 10))
We already have the data created. Now, we need to define the name of its dimensions:
dims <- c("x", "y", "time")
dims
## [1] "x" "y" "time"
As the xarray uses indices to handle data, we also need to specify the coordinates of the dataset:
The
coordinates
are used to specify the extensions of the data (e.g., spatial dimension (X
,Y
) and temporal dimension (Y
))
coords = list(
x = 1:10000,
y = 1:1000,
time = pd$period_range("2000-01-01", periods = 100)
)
Using the elements we have defined in the code blocks above, let’s create a DataArray:
ds <- xr$DataArray(data, dims = dims, coords = coords)
ds
## <xarray.DataArray 'random_sample-bffe8d580f9a246a840a9f6935173724' (x: 10000,
## y: 1000,
## time: 100)> Size: 8GB
## dask.array<random_sample, shape=(10000, 1000, 100), dtype=float64, chunksize=(1000, 1000, 10), chunktype=numpy.ndarray>
## Coordinates:
## * x (x) int64 80kB 1 2 3 4 5 6 7 ... 9995 9996 9997 9998 9999 10000
## * y (y) int64 8kB 1 2 3 4 5 6 7 8 ... 993 994 995 996 997 998 999 1000
## * time (time) object 800B 2000-01-01 2000-01-02 ... 2000-04-08 2000-04-09
With the DataArray created, it can be manipulated (e.g., extract values, calculate statistics). As an example, let’s calculate the temporal mean:
ds_mean <- ds$mean(dim = "time")
ds_mean
## <xarray.DataArray 'random_sample-bffe8d580f9a246a840a9f6935173724' (x: 10000,
## y: 1000)> Size: 80MB
## dask.array<mean_agg-aggregate, shape=(10000, 1000), dtype=float64, chunksize=(1000, 1000), chunktype=numpy.ndarray>
## Coordinates:
## * x (x) int64 80kB 1 2 3 4 5 6 7 ... 9995 9996 9997 9998 9999 10000
## * y (y) int64 8kB 1 2 3 4 5 6 7 8 ... 993 994 995 996 997 998 999 1000
By using the mean(dim = "time")
, we specify to the xarray the mean
should be calculated in the temporal dimension. So, in our example, it
is calculated using 10 values
values for each
X
and Y
.
With dask, we are
always using lazy evaluation functions. So, the code above still needs
to be executed. For this, we can ask xarray to execute it using the
compute
method.
ds_mean_result <- ds_mean$compute()
ds_mean_result
## <xarray.DataArray 'random_sample-bffe8d580f9a246a840a9f6935173724' (x: 10000,
## y: 1000)> Size: 80MB
## array([[0.52395468, 0.52582319, 0.5760999 , ..., 0.52075235, 0.46198502,
## 0.49506406],
## [0.49823877, 0.52700124, 0.48738285, ..., 0.47766254, 0.47700453,
## 0.55002639],
## [0.49502534, 0.48837638, 0.5267161 , ..., 0.50563961, 0.48020724,
## 0.48280721],
## ...,
## [0.4804682 , 0.49752797, 0.51064833, ..., 0.52835988, 0.52285502,
## 0.53107802],
## [0.49753673, 0.51126792, 0.48127124, ..., 0.51847087, 0.46748262,
## 0.55353419],
## [0.51380704, 0.50863396, 0.47923159, ..., 0.51144999, 0.42756488,
## 0.42257002]])
## Coordinates:
## * x (x) int64 80kB 1 2 3 4 5 6 7 ... 9995 9996 9997 9998 9999 10000
## * y (y) int64 8kB 1 2 3 4 5 6 7 8 ... 993 994 995 996 997 998 999 1000