This document presents an example of how the xarray Python library can be used in R.
To use the xarray in R, it is required to have the reticulate package installed. To do so, you can use the following command:
install.packages("reticulate")
After installing the reticulate package, we must create a Python virtual environment to install the xarray library. For this, we can use the following command:
In this example, we use conda to manage Python virtual environments, but any other package management tool can also be used (e.g., mamba, virtualenv, poetry).
# defining the name of the environment
conda_env_name <- "r-xarray"
# creating conda environment
reticulate::conda_create(envname = conda_env_name)
## + /opt/conda/bin/conda 'create' '--yes' '--name' 'r-xarray' 'python=3.9' '--quiet' '-c' 'conda-forge'
## [1] "/home/sits/.conda/envs/r-xarray/bin/python"
With the environment created, let’s install the dependencies on it:
reticulate::py_install(
c("xarray"),
envname = conda_env_name,
pip = TRUE
)
Now, we have an environment ready to go. So, let’s activate it:
reticulate::use_condaenv(condaenv = conda_env_name)
To use the xarray library, we can import it using the following reticulate command:
xr <- reticulate::import("xarray")
For this example, we will also use other Python libraries. So, we also need to import them:
As you can see in the code below, we are importing pandas and numpy. Both are dependencies from xarray, so we don’t need to install them directly.
pd <- reticulate::import("pandas")
np <- reticulate::import("numpy")
To show how the xarray can
handle multidimensional data, we will create a dummy dataset with 3
dimensions: X
, Y
, and Y
.
For this, let’s create a 3D array using numpy and random values:
data <- np$random$rand(1000L, 100L, 10L)
We already have the data created. Now, we need to define the name of its dimensions:
dims <- c("x", "y", "time")
dims
## [1] "x" "y" "time"
As the xarray uses indices to handle data, we also need to specify the coordinates of the dataset:
The
coordinates
are used to specify the extensions of the data (e.g., spatial dimension (X
,Y
) and temporal dimension (time
))
coords = list(
x = 1:1000,
y = 1:100,
time = pd$period_range("2000-01-01", periods = 10)
)
Using the elements we have defined in the code blocks above, let’s create a DataArray:
ds <- xr$DataArray(data, dims = dims, coords = coords)
ds
## <xarray.DataArray (x: 1000, y: 100, time: 10)> Size: 8MB
## array([[[0.47365453, 0.93473671, 0.2548404 , ..., 0.82982438,
## 0.95541118, 0.37491129],
## [0.7475468 , 0.14546763, 0.82604556, ..., 0.8975616 ,
## 0.81746904, 0.69233318],
## [0.2993701 , 0.90822995, 0.02811051, ..., 0.52169678,
## 0.81964306, 0.39314783],
## ...,
## [0.56722477, 0.4892301 , 0.92894962, ..., 0.42108727,
## 0.7352988 , 0.94542465],
## [0.94111763, 0.61111741, 0.08069564, ..., 0.31817046,
## 0.41960853, 0.2316986 ],
## [0.50303572, 0.24054607, 0.00241998, ..., 0.10399 ,
## 0.73901889, 0.76927791]],
##
## [[0.30891382, 0.31217185, 0.57671207, ..., 0.74373597,
## 0.03436587, 0.30680222],
## [0.84206798, 0.54613429, 0.12351806, ..., 0.16650705,
## 0.55805102, 0.77127825],
## [0.28929037, 0.0219615 , 0.61572283, ..., 0.11560709,
## 0.24896107, 0.07376672],
## ...
## [0.48350243, 0.28147534, 0.47720186, ..., 0.86906556,
## 0.76099373, 0.65650543],
## [0.35865617, 0.44446669, 0.49240284, ..., 0.45152173,
## 0.3631969 , 0.8962284 ],
## [0.8507341 , 0.41606127, 0.40959265, ..., 0.81736821,
## 0.57150855, 0.31787653]],
##
## [[0.13390902, 0.768228 , 0.50647091, ..., 0.24672934,
## 0.57601404, 0.62934807],
## [0.72837219, 0.69028253, 0.76259053, ..., 0.59126404,
## 0.12934916, 0.04822422],
## [0.37596659, 0.5347851 , 0.68630718, ..., 0.84354121,
## 0.35699331, 0.73959951],
## ...,
## [0.59597012, 0.90326966, 0.69016369, ..., 0.29433991,
## 0.36862798, 0.29567567],
## [0.85861481, 0.72773225, 0.28225493, ..., 0.92192114,
## 0.91106818, 0.11629271],
## [0.28512211, 0.30550223, 0.69243255, ..., 0.7663476 ,
## 0.95204822, 0.82730802]]])
## Coordinates:
## * x (x) int64 8kB 1 2 3 4 5 6 7 8 ... 993 994 995 996 997 998 999 1000
## * y (y) int64 800B 1 2 3 4 5 6 7 8 9 10 ... 92 93 94 95 96 97 98 99 100
## * time (time) object 80B 2000-01-01 2000-01-02 ... 2000-01-09 2000-01-10
With the DataArray created, it can be manipulated (e.g., extract values, calculate statistics). As an example, let’s calculate the temporal mean:
ds_mean <- ds$mean(dim = "time")
ds_mean
## <xarray.DataArray (x: 1000, y: 100)> Size: 800kB
## array([[0.75679603, 0.50913861, 0.60772233, ..., 0.69097612, 0.4666945 ,
## 0.48696565],
## [0.39722416, 0.51349987, 0.25312909, ..., 0.35545533, 0.39998652,
## 0.52965279],
## [0.55566312, 0.56567519, 0.57832792, ..., 0.57993744, 0.3852997 ,
## 0.50447857],
## ...,
## [0.41033577, 0.59772485, 0.4951134 , ..., 0.44662321, 0.51364 ,
## 0.53419491],
## [0.42373195, 0.63990477, 0.48853719, ..., 0.51528988, 0.47914088,
## 0.52390989],
## [0.49930364, 0.47579821, 0.55102785, ..., 0.45384248, 0.64826883,
## 0.56499602]])
## Coordinates:
## * x (x) int64 8kB 1 2 3 4 5 6 7 8 ... 993 994 995 996 997 998 999 1000
## * y (y) int64 800B 1 2 3 4 5 6 7 8 9 10 ... 92 93 94 95 96 97 98 99 100
By using the mean(dim = "time")
, we specify to the
xarray the mean should be calculated in the temporal dimension. So, in
our example, it is calculated using 10 values
values for
each X
and Y
.