Dask integration
recursive-diff supports xarray.DataArray and xarray.Dataset objects backed by Dask. When it compares two such objects, the comparison is optimized to maximise parallelism and minimize memory usage.
In this example, we’re going to compare two arrays worth a total of 3 GiB. However, because they’re lazily defined, the whole comparison will use only a few MiB RAM and will run on all available threads:
import sys
sys.path.insert(0, "..")
import dask.array as da
import xarray
from recursive_diff import display_diffs
a = xarray.DataArray(da.ones((200_000, 1_000)), name="ones")
b = xarray.DataArray(da.ones((200_000, 1_000)), name="ones")
a[123_456, 789] = 1.01
b[133_700, 333] = 1.0000000001 # Below tolerance
display_diffs(a, b)
| lhs | rhs | abs_delta | rel_delta | ||
|---|---|---|---|---|---|
| dim_0 | dim_1 | ||||
| 123456 | 789 | 1.01 | 1.0 | -0.01 | -0.009901 |
Dask clusters
If you have a Dask client active and compare chunked Xarray objects, the comparison will run on the Dask cluster.
In this example we’re using a LocalCluster, but this works with remote clusters as well as Coiled clusters!
You may use xarray.open_zarr() or xarray.open_dataset() to open Zarr or NetCDF files on S3, which means that if your client is outside of AWS the data won’t transfer over the internet and you won’t pay egress charges.
S3 access not yet supported by recursive_open().
import dask.distributed
with dask.distributed.LocalCluster() as cluster, dask.distributed.Client(cluster):
display_diffs(a, b)
| lhs | rhs | abs_delta | rel_delta | ||
|---|---|---|---|---|---|
| dim_0 | dim_1 | ||||
| 123456 | 789 | 1.01 | 1.0 | -0.01 | -0.009901 |