Data Chunking with Dask#

In this notebook we demonstrate:

Xarray + Dask
NetCDF file Chunks versus Dask Chunks
chunk shapes

The following material uses Coupled Model Intercomparison Project (CMIP6) collections. Please see the data collection catalogue and CMIP6 terms of use for more information.

slightly adapted version from: https://github.com/NCI-data-analysis-platform/climate-cmip

Original Authors: NCI Virtual Research Environment Team
Keywords: CMIP6, Xarray, Dask, Chunks
Creation Date: 2019-June; Updated: 2020-May

Adaptation for DKRZ data pool: S. Kindermann

Load the required modules#

import xarray as xr
import netCDF4 as nc
import time
%matplotlib inline

Data#

We will use precipitation data from SSP5-85 from the ACCESS-CM2 model in this example. Let’s take a look at some of the data:

# On Gadi: netcdf module must be loaded

!ncdump -hst '/pool/data/CMIP6/data/ScenarioMIP/CSIRO-ARCCSS/ACCESS-CM2/ssp585/r1i1p1f1/day/pr/gn/v20191108/pr_day_ACCESS-CM2_ssp585_r1i1p1f1_gn_20150101-20641231.nc'

netcdf pr_day_ACCESS-CM2_ssp585_r1i1p1f1_gn_20150101-20641231 {
dimensions:
	time = UNLIMITED ; // (18263 currently)
	lat = 144 ;
	lon = 192 ;
	bnds = 2 ;
variables:
	double time(time) ;
		time:bounds = "time_bnds" ;
		time:units = "days since 1850-01-01" ;
		time:calendar = "proleptic_gregorian" ;
		time:axis = "T" ;
		time:long_name = "time" ;
		time:standard_name = "time" ;
		time:_Storage = "chunked" ;
		time:_ChunkSizes = 1 ;
		time:_Endianness = "little" ;
	double time_bnds(time, bnds) ;
		time_bnds:_Storage = "chunked" ;
		time_bnds:_ChunkSizes = 1, 2 ;
		time_bnds:_DeflateLevel = 1 ;
		time_bnds:_Endianness = "little" ;
	double lat(lat) ;
		lat:bounds = "lat_bnds" ;
		lat:units = "degrees_north" ;
		lat:axis = "Y" ;
		lat:long_name = "Latitude" ;
		lat:standard_name = "latitude" ;
		lat:_Storage = "contiguous" ;
		lat:_Endianness = "little" ;
	double lat_bnds(lat, bnds) ;
		lat_bnds:_Storage = "chunked" ;
		lat_bnds:_ChunkSizes = 144, 2 ;
		lat_bnds:_DeflateLevel = 1 ;
		lat_bnds:_Endianness = "little" ;
	double lon(lon) ;
		lon:bounds = "lon_bnds" ;
		lon:units = "degrees_east" ;
		lon:axis = "X" ;
		lon:long_name = "Longitude" ;
		lon:standard_name = "longitude" ;
		lon:_Storage = "contiguous" ;
		lon:_Endianness = "little" ;
	double lon_bnds(lon, bnds) ;
		lon_bnds:_Storage = "chunked" ;
		lon_bnds:_ChunkSizes = 192, 2 ;
		lon_bnds:_DeflateLevel = 1 ;
		lon_bnds:_Endianness = "little" ;
	float pr(time, lat, lon) ;
		pr:standard_name = "precipitation_flux" ;
		pr:long_name = "Precipitation" ;
		pr:comment = "includes both liquid and solid phases" ;
		pr:units = "kg m-2 s-1" ;
		pr:cell_methods = "area: time: mean" ;
		pr:cell_measures = "area: areacella" ;
		pr:history = "2019-11-08T10:45:49Z altered by CMOR: replaced missing value flag (-1.07374e+09) with standard missing value (1e+20)." ;
		pr:missing_value = 1.e+20f ;
		pr:_FillValue = 1.e+20f ;
		pr:_Storage = "chunked" ;
		pr:_ChunkSizes = 1, 144, 192 ;
		pr:_DeflateLevel = 1 ;
		pr:_Endianness = "little" ;

// global attributes:
		:Conventions = "CF-1.7 CMIP-6.2" ;
		:activity_id = "ScenarioMIP" ;
		:branch_method = "standard" ;
		:branch_time_in_child = 60265. ;
		:branch_time_in_parent = 60265. ;
		:creation_date = "2019-11-08T10:45:50Z" ;
		:data_specs_version = "01.00.30" ;
		:experiment = "update of RCP8.5 based on SSP5" ;
		:experiment_id = "ssp585" ;
		:external_variables = "areacella" ;
		:forcing_index = 1 ;
		:frequency = "day" ;
		:further_info_url = "https://furtherinfo.es-doc.org/CMIP6.CSIRO-ARCCSS.ACCESS-CM2.ssp585.none.r1i1p1f1" ;
		:grid = "native atmosphere N96 grid (144x192 latxlon)" ;
		:grid_label = "gn" ;
		:history = "2019-11-08T10:45:50Z ; CMOR rewrote data to be consistent with CMIP6, CF-1.7 CMIP-6.2 and CF standards." ;
		:initialization_index = 1 ;
		:institution = "CSIRO (Commonwealth Scientific and Industrial Research Organisation, Aspendale, Victoria 3195, Australia), ARCCSS (Australian Research Council Centre of Excellence for Climate System Science)" ;
		:institution_id = "CSIRO-ARCCSS" ;
		:mip_era = "CMIP6" ;
		:nominal_resolution = "250 km" ;
		:notes = "Exp: CM2-ssp585; Local ID: bk786; Variable: pr ([\'fld_s05i216\'])" ;
		:parent_activity_id = "CMIP" ;
		:parent_experiment_id = "historical" ;
		:parent_mip_era = "CMIP6" ;
		:parent_source_id = "ACCESS-CM2" ;
		:parent_time_units = "days since 1850-01-01" ;
		:parent_variant_label = "r1i1p1f1" ;
		:physics_index = 1 ;
		:product = "model-output" ;
		:realization_index = 1 ;
		:realm = "atmos" ;
		:run_variant = "forcing: GHG, Oz, SA, Sl, Vl, BC, OC, (GHG = CO2, N2O, CH4, CFC11, CFC12, CFC113, HCFC22, HFC125, HFC134a)" ;
		:source = "ACCESS-CM2 (2019): \n",
			"aerosol: UKCA-GLOMAP-mode\n",
			"atmos: MetUM-HadGEM3-GA7.1 (N96; 192 x 144 longitude/latitude; 85 levels; top level 85 km)\n",
			"atmosChem: none\n",
			"land: CABLE2.5\n",
			"landIce: none\n",
			"ocean: ACCESS-OM2 (GFDL-MOM5, tripolar primarily 1deg; 360 x 300 longitude/latitude; 50 levels; top grid cell 0-10 m)\n",
			"ocnBgchem: none\n",
			"seaIce: CICE5.1.2 (same grid as ocean)" ;
		:source_id = "ACCESS-CM2" ;
		:source_type = "AOGCM" ;
		:sub_experiment = "none" ;
		:sub_experiment_id = "none" ;
		:table_id = "day" ;
		:table_info = "Creation Date:(30 April 2019) MD5:e14f55f257cceafb2523e41244962371" ;
		:title = "ACCESS-CM2 output prepared for CMIP6" ;
		:variable_id = "pr" ;
		:variant_label = "r1i1p1f1" ;
		:version = "v20191108" ;
		:cmor_version = "3.4.0" ;
		:tracking_id = "hdl:21.14100/1cade23c-cf5e-4d0e-96f9-4128cd729af7" ;
		:license = "CMIP6 model data produced by CSIRO is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (https://creativecommons.org/licenses/).  Consult https://pcmdi.llnl.gov/CMIP6/TermsOfUse for terms of use governing CMIP6 output, including citation requirements and proper acknowledgment.  Further information about this data, including some limitations, can be found via the further_info_url (recorded as a global attribute in this file).  The data producers and data providers make no warranty, either express or implied, including, but not limited to, warranties of merchantability and fitness for a particular purpose. All liabilities arising from the supply of the information (including any liability arising in negligence) are excluded to the fullest extent permitted by law." ;
		:_NCProperties = "version=2,netcdf=4.6.2,hdf5=1.10.5" ;
		:_SuperblockVersion = 0 ;
		:_IsNetcdf4 = 0 ;
		:_Format = "netCDF-4 classic model" ;
}

# Outside of Gadi, access via THREDDS

!ncdump -hst 'https://esgf.nci.org.au/thredds/dodsC/master/CMIP6/ScenarioMIP/CSIRO-ARCCSS/ACCESS-CM2/ssp585/r1i1p1f1/day/pr/gn/v20191108/pr_day_ACCESS-CM2_ssp585_r1i1p1f1_gn_20150101-20641231.nc'

netcdf pr_day_ACCESS-CM2_ssp585_r1i1p1f1_gn_20150101-20641231 {
dimensions:
	time = UNLIMITED ; // (18263 currently)
	bnds = 2 ;
	lat = 144 ;
	lon = 192 ;
variables:
	double time(time) ;
		time:bounds = "time_bnds" ;
		time:units = "days since 1850-01-01" ;
		time:calendar = "proleptic_gregorian" ;
		time:axis = "T" ;
		time:long_name = "time" ;
		time:standard_name = "time" ;
		time:_ChunkSizes = 1 ; // "1850-01-02"
	double time_bnds(time, bnds) ;
		time_bnds:_ChunkSizes = 1, 2 ;
	double lat(lat) ;
		lat:bounds = "lat_bnds" ;
		lat:units = "degrees_north" ;
		lat:axis = "Y" ;
		lat:long_name = "Latitude" ;
		lat:standard_name = "latitude" ;
	double lat_bnds(lat, bnds) ;
		lat_bnds:_ChunkSizes = 144, 2 ;
	double lon(lon) ;
		lon:bounds = "lon_bnds" ;
		lon:units = "degrees_east" ;
		lon:axis = "X" ;
		lon:long_name = "Longitude" ;
		lon:standard_name = "longitude" ;
	double lon_bnds(lon, bnds) ;
		lon_bnds:_ChunkSizes = 192, 2 ;
	float pr(time, lat, lon) ;
		pr:standard_name = "precipitation_flux" ;
		pr:long_name = "Precipitation" ;
		pr:comment = "includes both liquid and solid phases" ;
		pr:units = "kg m-2 s-1" ;
		pr:cell_methods = "area: time: mean" ;
		pr:cell_measures = "area: areacella" ;
		pr:history = "2019-11-08T10:45:49Z altered by CMOR: replaced missing value flag (-1.07374e+09) with standard missing value (1e+20)." ;
		pr:missing_value = 1.e+20f ;
		pr:_FillValue = 1.e+20f ;
		pr:_ChunkSizes = 1, 144, 192 ;

// global attributes:
		:Conventions = "CF-1.7 CMIP-6.2" ;
		:activity_id = "ScenarioMIP" ;
		:branch_method = "standard" ;
		:branch_time_in_child = 60265. ;
		:branch_time_in_parent = 60265. ;
		:creation_date = "2019-11-08T10:45:50Z" ;
		:data_specs_version = "01.00.30" ;
		:experiment = "update of RCP8.5 based on SSP5" ;
		:experiment_id = "ssp585" ;
		:external_variables = "areacella" ;
		:forcing_index = 1 ;
		:frequency = "day" ;
		:further_info_url = "https://furtherinfo.es-doc.org/CMIP6.CSIRO-ARCCSS.ACCESS-CM2.ssp585.none.r1i1p1f1" ;
		:grid = "native atmosphere N96 grid (144x192 latxlon)" ;
		:grid_label = "gn" ;
		:history = "2019-11-08T10:45:50Z ; CMOR rewrote data to be consistent with CMIP6, CF-1.7 CMIP-6.2 and CF standards." ;
		:initialization_index = 1 ;
		:institution = "CSIRO (Commonwealth Scientific and Industrial Research Organisation, Aspendale, Victoria 3195, Australia), ARCCSS (Australian Research Council Centre of Excellence for Climate System Science)" ;
		:institution_id = "CSIRO-ARCCSS" ;
		:mip_era = "CMIP6" ;
		:nominal_resolution = "250 km" ;
		:notes = "Exp: CM2-ssp585; Local ID: bk786; Variable: pr ([\'fld_s05i216\'])" ;
		:parent_activity_id = "CMIP" ;
		:parent_experiment_id = "historical" ;
		:parent_mip_era = "CMIP6" ;
		:parent_source_id = "ACCESS-CM2" ;
		:parent_time_units = "days since 1850-01-01" ;
		:parent_variant_label = "r1i1p1f1" ;
		:physics_index = 1 ;
		:product = "model-output" ;
		:realization_index = 1 ;
		:realm = "atmos" ;
		:run_variant = "forcing: GHG, Oz, SA, Sl, Vl, BC, OC, (GHG = CO2, N2O, CH4, CFC11, CFC12, CFC113, HCFC22, HFC125, HFC134a)" ;
		:source = "ACCESS-CM2 (2019): \n",
			"aerosol: UKCA-GLOMAP-mode\n",
			"atmos: MetUM-HadGEM3-GA7.1 (N96; 192 x 144 longitude/latitude; 85 levels; top level 85 km)\n",
			"atmosChem: none\n",
			"land: CABLE2.5\n",
			"landIce: none\n",
			"ocean: ACCESS-OM2 (GFDL-MOM5, tripolar primarily 1deg; 360 x 300 longitude/latitude; 50 levels; top grid cell 0-10 m)\n",
			"ocnBgchem: none\n",
			"seaIce: CICE5.1.2 (same grid as ocean)" ;
		:source_id = "ACCESS-CM2" ;
		:source_type = "AOGCM" ;
		:sub_experiment = "none" ;
		:sub_experiment_id = "none" ;
		:table_id = "day" ;
		:table_info = "Creation Date:(30 April 2019) MD5:e14f55f257cceafb2523e41244962371" ;
		:title = "ACCESS-CM2 output prepared for CMIP6" ;
		:variable_id = "pr" ;
		:variant_label = "r1i1p1f1" ;
		:version = "v20191108" ;
		:cmor_version = "3.4.0" ;
		:tracking_id = "hdl:21.14100/1cade23c-cf5e-4d0e-96f9-4128cd729af7" ;
		:license = "CMIP6 model data produced by CSIRO is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (https://creativecommons.org/licenses/).  Consult https://pcmdi.llnl.gov/CMIP6/TermsOfUse for terms of use governing CMIP6 output, including citation requirements and proper acknowledgment.  Further information about this data, including some limitations, can be found via the further_info_url (recorded as a global attribute in this file).  The data producers and data providers make no warranty, either express or implied, including, but not limited to, warranties of merchantability and fitness for a particular purpose. All liabilities arising from the supply of the information (including any liability arising in negligence) are excluded to the fullest extent permitted by law." ;
		:DODS_EXTRA.Unlimited_Dimension = "time" ;
		:_NCProperties = "version=2,netcdf=4.6.2,hdf5=1.10.5" ;
		:_Format = "classic" ;
}

Xarray + Dask#

Xarray can automatically wrap its data into Dask arrays. This capability turns Xarray into an extremely powerful tool when working with big earth science datasets.

To see this in action, we will download a fairly large dataset to analyze. We use Xarray’s open_mfdataset to allow multiple files to be opened simultaneously.

# On DKRZ system

!ls /pool/data/CMIP6/data/ScenarioMIP/CSIRO-ARCCSS/ACCESS-CM2/ssp585/r1i1p1f1/day/pr/gn/v20191108
path = '/pool/data/CMIP6/data/ScenarioMIP/CSIRO-ARCCSS/ACCESS-CM2/ssp585/r1i1p1f1/day/pr/gn/v20191108/*'

pr_day_ACCESS-CM2_ssp585_r1i1p1f1_gn_20150101-20641231.nc
pr_day_ACCESS-CM2_ssp585_r1i1p1f1_gn_20650101-21001231.nc

f_ssp585 = xr.open_mfdataset(path, combine='by_coords')
f_ssp585

<xarray.Dataset>
Dimensions:    (time: 31411, bnds: 2, lat: 144, lon: 192)
Coordinates:
  * time       (time) datetime64[ns] 2015-01-01T12:00:00 ... 2100-12-31T12:00:00
  * lat        (lat) float64 -89.38 -88.12 -86.88 -85.62 ... 86.88 88.12 89.38
  * lon        (lon) float64 0.9375 2.812 4.688 6.562 ... 355.3 357.2 359.1
Dimensions without coordinates: bnds
Data variables:
    time_bnds  (time, bnds) datetime64[ns] dask.array<chunksize=(18263, 2), meta=np.ndarray>
    lat_bnds   (time, lat, bnds) float64 dask.array<chunksize=(18263, 144, 2), meta=np.ndarray>
    lon_bnds   (time, lon, bnds) float64 dask.array<chunksize=(18263, 192, 2), meta=np.ndarray>
    pr         (time, lat, lon) float32 dask.array<chunksize=(18263, 144, 192), meta=np.ndarray>
Attributes: (12/47)
    Conventions:            CF-1.7 CMIP-6.2
    activity_id:            ScenarioMIP
    branch_method:          standard
    branch_time_in_child:   60265.0
    branch_time_in_parent:  60265.0
    creation_date:          2019-11-08T10:45:50Z
    ...                     ...
    variable_id:            pr
    variant_label:          r1i1p1f1
    version:                v20191108
    cmor_version:           3.4.0
    tracking_id:            hdl:21.14100/1cade23c-cf5e-4d0e-96f9-4128cd729af7
    license:                CMIP6 model data produced by CSIRO is licensed un...

NOTE: the values are not displayed, since that would trigger computation.

Chunks#

Notice that it says:pr(time, lat, lon) float32 dask.array<chunksize=(18263, 144, 192), meta=np.ndarray>. There is now the chunksize component. The data array also becomes a Dask array.

The chunking of the array comes from the integration of Dask with Xarray. Dask divides the data array into small pieces called “chunks”, with each chunk designed to be small enough to fit into memory.

The file itself may be already chunked. Filesystem chunking is available in netCDF-4 and HDF5 datasets. The CMIP6 data should all be netCDF-4 and include some form of chunking for each file.

Looking at the file metadata in the “Data” section above, we see in this case the file is chunked such that#

`pr:_ChunkSizes = 1, 144, 192 ;`#

Here we see that the data is chunked in space but not time, where one chunk is one time-step and all points in lat-lon.

image source: https://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_why_it_matters

Consider 2 types of data access

Accessing a 2D lat-lon slice in time (RHS figure)
Accessing a time series at a single lat-lon point (LHS figure)

With the chunking above, the first type of data access only requires access to a single chunk, while the second type needs to access ALL the chunks of the data array regardless. This dataset will be fastest for 2D lat-lon single time-step data access.

In general, even without chunking - when the data is accessed contiguously (by index order) - time is the slowest variable to access, then y, with x being the fastest. With the chunking method of this CMIP6 dataset, time still remains the slowest variable. More uniform variable access speeds would require more evenly spaced chunks.

The same volume of data can take orders of magnitude longer to load#

The spatial dataset contained 27648 data-points and took in the order of 100ms to load. The time-series dataset had 31411 data-points and took order 10,000 ms to load.

NOTE: If you look at the dashboard, the task stream actually shows that the most time consuming part is data concatenation.

Chunking and the ways in which data is read is important when considering both how you access the data and if you wish to parallelise your code.

NetCDF file Chunks versus Dask Chunks#

Keep in mind, Dask chunking is different to chunking of the stored data. As we saw in our example, the stored data was chunked with chunks of size (1,144,192) whereas the Dask array had a chunk size of (18263, 144, 192). It’s possible to change the chunking size in the Dask array. In the example below, we are specifying that there are 730 chunks in time.

f_ssp585 = xr.open_mfdataset(path,chunks={'time':730}, combine='by_coords')

f_ssp585

<xarray.Dataset>
Dimensions:    (time: 31411, bnds: 2, lat: 144, lon: 192)
Coordinates:
  * time       (time) datetime64[ns] 2015-01-01T12:00:00 ... 2100-12-31T12:00:00
  * lat        (lat) float64 -89.38 -88.12 -86.88 -85.62 ... 86.88 88.12 89.38
  * lon        (lon) float64 0.9375 2.812 4.688 6.562 ... 355.3 357.2 359.1
Dimensions without coordinates: bnds
Data variables:
    time_bnds  (time, bnds) datetime64[ns] dask.array<chunksize=(730, 2), meta=np.ndarray>
    lat_bnds   (time, lat, bnds) float64 dask.array<chunksize=(18263, 144, 2), meta=np.ndarray>
    lon_bnds   (time, lon, bnds) float64 dask.array<chunksize=(18263, 192, 2), meta=np.ndarray>
    pr         (time, lat, lon) float32 dask.array<chunksize=(730, 144, 192), meta=np.ndarray>
Attributes: (12/47)
    Conventions:            CF-1.7 CMIP-6.2
    activity_id:            ScenarioMIP
    branch_method:          standard
    branch_time_in_child:   60265.0
    branch_time_in_parent:  60265.0
    creation_date:          2019-11-08T10:45:50Z
    ...                     ...
    variable_id:            pr
    variant_label:          r1i1p1f1
    version:                v20191108
    cmor_version:           3.4.0
    tracking_id:            hdl:21.14100/1cade23c-cf5e-4d0e-96f9-4128cd729af7
    license:                CMIP6 model data produced by CSIRO is licensed un...

Poor chunking with dask can make your performance worse!#

As you can see, bad chunks and the alignment of the chunks slow down the I/O performance significantly. They are both important to keep in mind when creating Dask chunks.

Close the client#

Before moving on to the next exercise, make sure to close your client or stop this kernel.

client.close()

Summary#

This example shows how to make data chunking with Dask.

For further information regarding Dask, please see: https://docs.dask.org/en/latest/

Jupyter Guide to Climate Data

Data Chunking with Dask

Contents

Data Chunking with Dask#

Load the required modules#

Data#

Xarray + Dask#

Chunks#

Looking at the file metadata in the “Data” section above, we see in this case the file is chunked such that#

`pr:_ChunkSizes = 1, 144, 192 ;`#

Exercise#

The same volume of data can take orders of magnitude longer to load#

NetCDF file Chunks versus Dask Chunks#

How big do you make your chunks?#

IMPORTANT: Whatever Dask array chunks you use, make sure they align with the netCDF4 file chunks!!#

Exercise#

Poor chunking with dask can make your performance worse!#

Close the client#

Summary#

Jupyter Guide to Climate Data

Data Chunking with Dask

Contents

Data Chunking with Dask#

Load the required modules#

Data#

Xarray + Dask#

Chunks#

Looking at the file metadata in the “Data” section above, we see in this case the file is chunked such that#

pr:_ChunkSizes = 1, 144, 192 ;#

Exercise#

The same volume of data can take orders of magnitude longer to load#

NetCDF file Chunks versus Dask Chunks#

How big do you make your chunks?#

IMPORTANT: Whatever Dask array chunks you use, make sure they align with the netCDF4 file chunks!!#

Exercise#

Poor chunking with dask can make your performance worse!#

Close the client#

Summary#

`pr:_ChunkSizes = 1, 144, 192 ;`#