CLI tool

Compare either two files or all files in two directories. For supported file formats, see open().

Usage

usage: recursive-diff [-h] [--quiet]
                      [--recursive] [--match PATTERN [PATTERN ...]]
                      [--format {json,jsonl,msgpack,yaml,yml,netcdf,nc,zarr}]
                      [--rtol RTOL] [--atol ATOL]
                      [--brief_dims DIM [DIM ...] | --brief]
                      [--engine {netcdf4,h5netcdf,scipy,pydap}]
                      lhs rhs

Compare either two data files or all data files in two directories.

positional arguments:
  lhs                   Left-hand-side file or (if --recursive) directory
  rhs                   Right-hand-side file or (if --recursive) directory

options:
  -h, --help            show this help message and exit
  --quiet, -q           Suppress logging
  --recursive, -r       Compare all files with matching names in two directories
  --match, -m PATTERN [PATTERN ...]
                        Bash wildcard patterns for file names when using --recursive
                        (default: **/*.json **/*.jsonl **/*.msgpack **/*.yaml **/*.yml
                        **/*.nc **/*.zarr)
  --format {json,jsonl,msgpack,yaml,yml,netcdf,nc,zarr}
                        File format (default: infer from file extension)
  --rtol RTOL           Relative comparison tolerance (default: 1e-9)
  --atol ATOL           Absolute comparison tolerance (default: 0)
  --brief_dims DIM [DIM ...]
                        Just count differences along one or more dimensions instead of
                        printing them out individually
  --brief, -b           Just count differences for every variable instead of printing
                        them out individually
  --engine, -e {netcdf4,h5netcdf,scipy,pydap}
                        netCDF engine (default: first available)

Examples:

Compare two files:
  recursive-diff a.json b.json
Compare all files with identical names in two directories:
  recursive-diff -r dir1 dir2

Memory design

If Dask is installed:

For netCDF and Zarr files, the tool loads one pair of matching Dask chunks at a time into RAM, compares them, and then discards them. Dask chunks are automatically cut to 128 MiB, or to the native netCDF/Zarr chunks on disk if they are larger than that. JSON, JSONL, YAML, and MessagePack files are loaded a pair of files at a time, compared, and then discarded. Chunking JSONL files is not supported. You may end up with as many pairs of chunks/files in RAM at once as there are CPUs available (or more if chunks are misaligned).

If Dask is not installed:

The tool fully loads a pair of netCDF/Zarr variables into RAM at once, compares them, and then discards them. Native chunks are not used. JSON, JSONL, YAML, and MessagePack files are loaded all at once eagerly.

Limitations

  • Doesn’t compare netCDF/Zarr settings or metadata, e.g. store version, compression, chunking, etc.

  • Doesn’t support netCDF/Zarr indices with duplicate elements

  • Treats netCDF datasets split across multiple files (typically created by Dask) as individual files. This can be slow, as there is no option to skip loading over and over again variables that don’t sit on the concat_dim. It also means that it can’t compare two datasets that differ only by file chunking. See also xarray#2039.

  • Can’t compare file sets not grouped by root directory, but by prefix (e.g. foo.*.json vs. bar.*.json).