3.5 Climate Model Data on Google Cloud

3.5 Climate Model Data on Google Cloud#

Note

The code for downloading climate model data in this section was adapted from Section 6.2 of Anderson and Smith [2021].

The ESGF offers the complete archive of CMIP5, CMIP6, and CORDEX climate model output data, but unfortunately, it can be prone to errors that can make downloading the data a bit cumbersome. If you’re running into difficulties accessing the data using the methods described in Section 3.4, it’s worth a try to see if the data you want is available on the Google Cloud Services (GCS) archive of CMIP data, which you can navigate by following this link. Accessing the data on GCS is similar to using OPeNDAP, in that you can do subsetting on their end, and transfer only the data you want to your own computer for analysis.

The downside of the GCS archive is that it isn’t as complete as the ESGF archive. For many models, there is only data at monthly temporal frequency, which isn’t so useful for downscaling. Another minor downside is that you’ll need to learn a new file format. While netCDF is the standard for climate model output data, GCS uses a cloud-optimized data format called zarr. As the focus of scientific computing shifts more and more towards the cloud, zarr has become increasingly popular as the format for climate data hosted on remote servers. Fortunately, xarray supports the zarr format, and once you’ve opened a file with xr.open_zarr, you can work with it like you would any other xr.Dataset or xr.DataArray.

3.5.1 Searching For Data#

Similar to the ESGF, we can search the GCS climate data catalog in Python to find the files we want. We’ll do this using Pandas to access the CSV file that contains the data catalog, and then do filtering to isolate the access URLs for the desired files.

import xarray as xr
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import gcsfs
import zarr

# url for the CSV file that contains the data catalog
url_catalog = 'https://storage.googleapis.com/cmip6/cmip6-zarr-consolidated-stores.csv'

# open the data catalog with pandas, and take a peek at how it's formatted
df_catalog = pd.read_csv(url_catalog)
print(df_catalog.columns)
df_catalog.head()

Index(['activity_id', 'institution_id', 'source_id', 'experiment_id',
       'member_id', 'table_id', 'variable_id', 'grid_label', 'zstore',
       'dcpp_init_year', 'version'],
      dtype='object')

	activity_id	institution_id	source_id	experiment_id	member_id	table_id	variable_id	grid_label	zstore	dcpp_init_year	version
0	HighResMIP	CMCC	CMCC-CM2-HR4	highresSST-present	r1i1p1f1	Amon	ps	gn	gs://cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/...	NaN	20170706
1	HighResMIP	CMCC	CMCC-CM2-HR4	highresSST-present	r1i1p1f1	Amon	rsds	gn	gs://cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/...	NaN	20170706
2	HighResMIP	CMCC	CMCC-CM2-HR4	highresSST-present	r1i1p1f1	Amon	rlus	gn	gs://cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/...	NaN	20170706
3	HighResMIP	CMCC	CMCC-CM2-HR4	highresSST-present	r1i1p1f1	Amon	rlds	gn	gs://cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/...	NaN	20170706
4	HighResMIP	CMCC	CMCC-CM2-HR4	highresSST-present	r1i1p1f1	Amon	psl	gn	gs://cmip6/CMIP6/HighResMIP/CMCC/CMCC-CM2-HR4/...	NaN	20170706

As you can see, the GCS data catalog is formatted in a very similar way to the ESGF search interface. Each entry in the catalog specifies the:

Project name (activity_id)
Modeling centre (institution_id)
Model name (source_id)
Experiment type (experiment_id, i.e. historical, SSP5-8.5, etc.)
ID of the ensemble member (member_id)
Time frequency (table_id)
Variable name (variale_id)
Access URL for the Zarr data set(zstore)

Plus some other entries we won’t worry about for now. You can use the Pandas routine df.query to search for particular values of each column, and return a new DataFrame that contains the entries that meet your search criteria.

Recall from Section 3.4 that the model variable names can be a bit cryptic. The full table of long variable names, matching them to the short names (which you include in the search) can be found here.

For example, let’s search for daily precipitation data from the model MPI-ESM1-2-LR for both the historical experiment, and the SSP2-4.5 future scenario.

# prepare the search criteria as a string
search_string = "table_id == 'day' & source_id == 'MPI-ESM1-2-LR' & variable_id == 'pr'" 
# continue on the next line
search_string += " & experiment_id == ['historical', 'ssp245']"
df_search = df_catalog.query(search_string)
df_search

	activity_id	institution_id	source_id	experiment_id	member_id	table_id	variable_id	grid_label	zstore	dcpp_init_year	version
218364	ScenarioMIP	MPI-M	MPI-ESM1-2-LR	ssp245	r10i1p1f1	day	pr	gn	gs://cmip6/CMIP6/ScenarioMIP/MPI-M/MPI-ESM1-2-...	NaN	20190710
218552	ScenarioMIP	MPI-M	MPI-ESM1-2-LR	ssp245	r3i1p1f1	day	pr	gn	gs://cmip6/CMIP6/ScenarioMIP/MPI-M/MPI-ESM1-2-...	NaN	20190710
219088	ScenarioMIP	MPI-M	MPI-ESM1-2-LR	ssp245	r1i1p1f1	day	pr	gn	gs://cmip6/CMIP6/ScenarioMIP/MPI-M/MPI-ESM1-2-...	NaN	20190710
219697	ScenarioMIP	MPI-M	MPI-ESM1-2-LR	ssp245	r4i1p1f1	day	pr	gn	gs://cmip6/CMIP6/ScenarioMIP/MPI-M/MPI-ESM1-2-...	NaN	20190710
219930	ScenarioMIP	MPI-M	MPI-ESM1-2-LR	ssp245	r2i1p1f1	day	pr	gn	gs://cmip6/CMIP6/ScenarioMIP/MPI-M/MPI-ESM1-2-...	NaN	20190710
220123	ScenarioMIP	MPI-M	MPI-ESM1-2-LR	ssp245	r5i1p1f1	day	pr	gn	gs://cmip6/CMIP6/ScenarioMIP/MPI-M/MPI-ESM1-2-...	NaN	20190710
220331	ScenarioMIP	MPI-M	MPI-ESM1-2-LR	ssp245	r6i1p1f1	day	pr	gn	gs://cmip6/CMIP6/ScenarioMIP/MPI-M/MPI-ESM1-2-...	NaN	20190710
220638	ScenarioMIP	MPI-M	MPI-ESM1-2-LR	ssp245	r7i1p1f1	day	pr	gn	gs://cmip6/CMIP6/ScenarioMIP/MPI-M/MPI-ESM1-2-...	NaN	20190710
221400	ScenarioMIP	MPI-M	MPI-ESM1-2-LR	ssp245	r9i1p1f1	day	pr	gn	gs://cmip6/CMIP6/ScenarioMIP/MPI-M/MPI-ESM1-2-...	NaN	20190710
221478	ScenarioMIP	MPI-M	MPI-ESM1-2-LR	ssp245	r8i1p1f1	day	pr	gn	gs://cmip6/CMIP6/ScenarioMIP/MPI-M/MPI-ESM1-2-...	NaN	20190710
222076	CMIP	MPI-M	MPI-ESM1-2-LR	historical	r1i1p1f1	day	pr	gn	gs://cmip6/CMIP6/CMIP/MPI-M/MPI-ESM1-2-LR/hist...	NaN	20190710
224126	CMIP	MPI-M	MPI-ESM1-2-LR	historical	r10i1p1f1	day	pr	gn	gs://cmip6/CMIP6/CMIP/MPI-M/MPI-ESM1-2-LR/hist...	NaN	20190710
232949	CMIP	MPI-M	MPI-ESM1-2-LR	historical	r4i1p1f1	day	pr	gn	gs://cmip6/CMIP6/CMIP/MPI-M/MPI-ESM1-2-LR/hist...	NaN	20190710
234219	CMIP	MPI-M	MPI-ESM1-2-LR	historical	r3i1p1f1	day	pr	gn	gs://cmip6/CMIP6/CMIP/MPI-M/MPI-ESM1-2-LR/hist...	NaN	20190710
235140	CMIP	MPI-M	MPI-ESM1-2-LR	historical	r2i1p1f1	day	pr	gn	gs://cmip6/CMIP6/CMIP/MPI-M/MPI-ESM1-2-LR/hist...	NaN	20190710
235281	CMIP	MPI-M	MPI-ESM1-2-LR	historical	r5i1p1f1	day	pr	gn	gs://cmip6/CMIP6/CMIP/MPI-M/MPI-ESM1-2-LR/hist...	NaN	20190710
235391	CMIP	MPI-M	MPI-ESM1-2-LR	historical	r8i1p1f1	day	pr	gn	gs://cmip6/CMIP6/CMIP/MPI-M/MPI-ESM1-2-LR/hist...	NaN	20190710
235410	CMIP	MPI-M	MPI-ESM1-2-LR	historical	r6i1p1f1	day	pr	gn	gs://cmip6/CMIP6/CMIP/MPI-M/MPI-ESM1-2-LR/hist...	NaN	20190710
236223	CMIP	MPI-M	MPI-ESM1-2-LR	historical	r9i1p1f1	day	pr	gn	gs://cmip6/CMIP6/CMIP/MPI-M/MPI-ESM1-2-LR/hist...	NaN	20190710
236461	CMIP	MPI-M	MPI-ESM1-2-LR	historical	r7i1p1f1	day	pr	gn	gs://cmip6/CMIP6/CMIP/MPI-M/MPI-ESM1-2-LR/hist...	NaN	20190710

3.5.2 Accessing the Data#

Excellent, we now have a DataFrame that has the descriptions and access URLs for the datasets we want. In order to access the data, we first need to authenticate ourselves using the gcsfs package. Since we’re only accessing data available to the public, we can authenticate as an anonymous user, no account is needed here.

# authenticate access to Google Cloud
gcs = gcsfs.GCSFileSystem(token='anon')

# Get the path to a specific zarr store (the first one from the dataframe above)
zstore_url = df_search.zstore.values[0]
print(zstore_url)

# use the gcfs package to turn the URL into an interface to the data set
mapper = gcs.get_mapper(zstore_url)

# now open the zarr store using xarray. 
# The argument "consolidated" means to include the whole zarr data store for this file
ds = xr.open_zarr(mapper, consolidated = True)
ds

gs://cmip6/CMIP6/ScenarioMIP/MPI-M/MPI-ESM1-2-LR/ssp245/r10i1p1f1/day/pr/gn/v20190710/

<xarray.Dataset> Size: 2GB
Dimensions:    (lat: 96, bnds: 2, lon: 192, time: 31411)
Coordinates:
  * lat        (lat) float64 768B -88.57 -86.72 -84.86 ... 84.86 86.72 88.57
    lat_bnds   (lat, bnds) float64 2kB dask.array<chunksize=(96, 2), meta=np.ndarray>
  * lon        (lon) float64 2kB 0.0 1.875 3.75 5.625 ... 354.4 356.2 358.1
    lon_bnds   (lon, bnds) float64 3kB dask.array<chunksize=(192, 2), meta=np.ndarray>
  * time       (time) datetime64[ns] 251kB 2015-01-01T12:00:00 ... 2100-12-31...
    time_bnds  (time, bnds) datetime64[ns] 503kB dask.array<chunksize=(15706, 2), meta=np.ndarray>
Dimensions without coordinates: bnds
Data variables:
    pr         (time, lat, lon) float32 2GB dask.array<chunksize=(980, 96, 192), meta=np.ndarray>
Attributes: (12/50)
    Conventions:            CF-1.7 CMIP-6.2
    activity_id:            ScenarioMIP
    branch_method:          standard
    branch_time_in_child:   60265.0
    branch_time_in_parent:  60265.0
    cmor_version:           3.5.0
    ...                     ...
    title:                  MPI-ESM1-2-LR output prepared for CMIP6
    tracking_id:            hdl:21.14100/4b1502fc-585a-4005-8d9d-49ad10284480...
    variable_id:            pr
    variant_label:          r10i1p1f1
    netcdf_tracking_ids:    hdl:21.14100/4b1502fc-585a-4005-8d9d-49ad10284480...
    version_id:             v20190710

Here we accessed the data for the SSP2-4.5 scenario and the ensemble member named r10i1p1f1. Let’s select a spatial region and calculate a long-term average for the period 2071-2100, just to demonstrate that the functionality is the same as before.

# select a region near western Europe
lats = [40, 60]
lons = [-10, 10]

# convert longitude values in the dataset from the (0, 360) convention
# to the (-180, 180) convention
ds = ds.assign_coords(lon = (((ds.lon + 180) % 360) - 180))
ds = ds.sortby('lon')

# subset the data
ds_subset = ds.sel(lat = slice(*lats), lon = slice(*lons),
                   time = ds.time.dt.year.isin(range(2071, 2101)))
ds_subset

<xarray.Dataset> Size: 6MB
Dimensions:    (lat: 11, bnds: 2, lon: 11, time: 10957)
Coordinates:
  * lat        (lat) float64 88B 40.1 41.97 43.83 45.7 ... 55.02 56.89 58.76
    lat_bnds   (lat, bnds) float64 176B dask.array<chunksize=(11, 2), meta=np.ndarray>
    lon_bnds   (lon, bnds) float64 176B dask.array<chunksize=(11, 2), meta=np.ndarray>
  * time       (time) datetime64[ns] 88kB 2071-01-01T12:00:00 ... 2100-12-31T...
    time_bnds  (time, bnds) datetime64[ns] 175kB dask.array<chunksize=(10957, 2), meta=np.ndarray>
  * lon        (lon) float64 88B -9.375 -7.5 -5.625 -3.75 ... 5.625 7.5 9.375
Dimensions without coordinates: bnds
Data variables:
    pr         (time, lat, lon) float32 5MB dask.array<chunksize=(126, 11, 11), meta=np.ndarray>
Attributes: (12/50)
    Conventions:            CF-1.7 CMIP-6.2
    activity_id:            ScenarioMIP
    branch_method:          standard
    branch_time_in_child:   60265.0
    branch_time_in_parent:  60265.0
    cmor_version:           3.5.0
    ...                     ...
    title:                  MPI-ESM1-2-LR output prepared for CMIP6
    tracking_id:            hdl:21.14100/4b1502fc-585a-4005-8d9d-49ad10284480...
    variable_id:            pr
    variant_label:          r10i1p1f1
    netcdf_tracking_ids:    hdl:21.14100/4b1502fc-585a-4005-8d9d-49ad10284480...
    version_id:             v20190710

# calculate the long-term average precip, and trigger computation
pr_ltm = ds_subset.pr.mean('time').compute() * 86400 # also convert units to mm/day

p = pr_ltm.plot.contourf(subplot_kws = dict(projection = ccrs.PlateCarree(), 
                                            transform = ccrs.PlateCarree()),
                         levels = 15, cbar_kwargs = dict(label = "mm/day"))
p.axes.coastlines()
p.axes.set_title("MPI-ESM1-2-LR Average Daily Precipitation\n SSP2-4.5 2071-2100")
plt.show()

../_images/8697f7288fee76e99c95068a872565a473038a5a7516c24ae95628d4e6b62155.png

3.5.3 Saving the Data#

To save data from Google Cloud to your local machine, you can use either the xarray to_netcdf method discussed in the previous section or save the data in zarr format, just like it is on the cloud. Just like to_netcdf, there is a to_zarr method for xarray Dataset or DataArray objects (documentation here). The Zarr format doesn’t save the data to a single file like netCDF, instead it creates a directory that contains multiple files, each corresponding to different data variables, coordinates, metadata, etc. For most users, there isn’t really a benefit of using Zarr versus using netCDF, so it’s probably worth keeping it simple and sticking with netCDF as the format of your data files.