{ "cells": [ { "cell_type": "markdown", "id": "9b9a1d2c-7664-4fd9-b5cb-3a766d907fe7", "metadata": {}, "source": [ "# Dask integration\n", "\n", "recursive-diff supports {class}`xarray.DataArray` and {class}`xarray.Dataset` objects backed by [Dask](https://dask.org). When it compares two such objects, the comparison is optimized to maximise parallelism and minimize memory usage.\n", "\n", "In this example, we're going to compare two arrays worth a total of 3 GiB.\n", "However, because they're lazily defined, the whole comparison will use only a few MiB RAM and will run on all available threads:" ] }, { "cell_type": "code", "execution_count": null, "id": "6e52dd2a-b565-4aee-8aa8-52c4a41e8914", "metadata": {}, "outputs": [], "source": [ "import sys\n", "\n", "sys.path.insert(0, \"..\")\n", "\n", "import dask.array as da\n", "import xarray\n", "\n", "from recursive_diff import display_diffs\n", "\n", "a = xarray.DataArray(da.ones((200_000, 1_000)), name=\"ones\")\n", "b = xarray.DataArray(da.ones((200_000, 1_000)), name=\"ones\")\n", "a[123_456, 789] = 1.01\n", "b[133_700, 333] = 1.0000000001 # Below tolerance\n", "\n", "display_diffs(a, b)" ] }, { "cell_type": "markdown", "id": "bf4417c1-3989-4512-8ea0-d9b5ecf31ab8", "metadata": {}, "source": [ "## Dask clusters\n", "If you have a Dask client active and compare chunked Xarray objects, the comparison will run on the Dask cluster.\n", "\n", "In this example we're using a ``LocalCluster``, but this works with remote clusters as well as [Coiled](https://coiled.io) clusters!\n", "\n", "You may use {func}`xarray.open_zarr` or {func}`xarray.open_dataset` to open Zarr or NetCDF files on S3, which means that if your client is outside of AWS the data won't transfer over the internet and you won't pay egress charges.\n", "S3 access not yet supported by {func}`~recursive_diff.recursive_open`." ] }, { "cell_type": "code", "execution_count": null, "id": "9205439e-3c2b-43d7-a512-b8a4e986ea27", "metadata": {}, "outputs": [], "source": [ "import dask.distributed\n", "\n", "with dask.distributed.LocalCluster() as cluster, dask.distributed.Client(cluster):\n", " display_diffs(a, b)" ] }, { "cell_type": "code", "execution_count": null, "id": "5822b326-f3e0-4be0-a015-a734ddbc816d", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.14.3" } }, "nbformat": 4, "nbformat_minor": 5 }