Dask integration

recursive-diff supports xarray.DataArray and xarray.Dataset objects backed by Dask. When it compares two such objects, the comparison is optimized to maximise parallelism and minimize memory usage.

In this example, we’re going to compare two arrays worth a total of 3 GiB. However, because they’re lazily defined, the whole comparison will use only a few MiB RAM and will run on all available threads:

import sys

sys.path.insert(0, "..")

import dask.array as da
import xarray

from recursive_diff import display_diffs

a = xarray.DataArray(da.ones((200_000, 1_000)), name="ones")
b = xarray.DataArray(da.ones((200_000, 1_000)), name="ones")
a[123_456, 789] = 1.01
b[133_700, 333] = 1.0000000001  # Below tolerance

display_diffs(a, b)
lhs rhs abs_delta rel_delta
dim_0 dim_1
123456 789 1.01 1.0 -0.01 -0.009901

Dask clusters

If you have a Dask client active and compare chunked Xarray objects, the comparison will run on the Dask cluster.

In this example we’re using a LocalCluster, but this works with remote clusters as well as Coiled clusters!

You may use xarray.open_zarr() or xarray.open_dataset() to open Zarr or NetCDF files on S3, which means that if your client is outside of AWS the data won’t transfer over the internet and you won’t pay egress charges. S3 access not yet supported by recursive_open().

import dask.distributed

with dask.distributed.LocalCluster() as cluster, dask.distributed.Client(cluster):
    display_diffs(a, b)
lhs rhs abs_delta rel_delta
dim_0 dim_1
123456 789 1.01 1.0 -0.01 -0.009901