3.4 Climate Model Output from ESGF#

Note

The code for downloading climate model data in this section was adapted from Section 6.1 of Anderson and Smith [2021].

Up until now, this chapter has focused on acquiring and analyzing climate data from observations, or reanalysis data which assimilates observational data into a weather forecast model. These sources of data are only half of the story. Since this guide is about climate change projections, we must also learn how to download and process output from climate model simulations.

Output from climate models that participated in the CMIP projects has been made publicly available for download from the Earth System Grid Federation (ESGF). The ESGF is a peer-to-peer network with nodes hosted by several research centres around the world. The archive of data hosted on a given node can be searched and navigated in a browser. The website for the node hosted at Lawrence Livermore National Laboratory in the USA can be found at this link, and links for all of the nodes can be found here. Via the browser interface, one can populate their data cart like an online store and generate a wget script that when executed, will download the selected data. However, this process can be somewhat cumbersome if you are interested in large amounts of data, or if you wish to create an automated workflow. For advanced users, the ESGF offers a RESTful API, which is outside the scope of this guide. In this guide, we will use the Python package (esgf-pyclient) (also called pyesgf) as an interface to the API. With pyesgf, one can either get an OPeNDAP URL to access the data directly or get a URL from which the data can be downloaded using the Python requests library.

Complete documentation for the ESGF, and all options for accessing the data, can be found here.

As with the reanalysis and gridded observations, climate model output is typically very large in volume. For this reason, we will show examples in this notebook using monthly means instead of daily means or 6-hourly instantaneous output, which is typically required for climate impact analysis.

3.4.1 Searching For Data#

To search the ESGF archive we must first establish a connection to an ESGF node using the pyesgf.search.SearchConnection class. To do this, we pass the URL for the node to SearchConnection. Since each node may only host a subset of the complete CMIP archive, the argument distrib = True tells the API to search across all nodes, not just the one we connected to.

Note

There are sometimes network/service issues with the ESGF nodes, meaning you may sometimes get unexpected errors when using esgf-pyclient or when downloading data from the ESGF. You can try connecting to a different node, or waiting and trying to run the code later. The URLs for each of the nodes, and their operating status, can be found on this page. If you’re really stuck, you can try accessing CMIP data through Google Cloud Services, following the tutorial in Section 3.5. Another option is to use the intake-esm Python package to access climate model output, which is the focus of Section 3.6. These more modern options are easier to use, so it’s ok if your main takeaway from this section is data analysis, starting at Section 3.4.3.

# import the required packages
from pyesgf.search import SearchConnection
import os
import glob
import requests
from tqdm import tqdm
import ec3

import xarray as xr
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats as stats
from xclim import ensembles

# use this to suppress a warning which we can ignore
os.environ["ESGF_PYCLIENT_NO_FACETS_STAR_WARNING"] = "on"
WARNING:pint.util:Redefining 'percent' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
WARNING:pint.util:Redefining '%' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
WARNING:pint.util:Redefining 'year' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
WARNING:pint.util:Redefining 'yr' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
WARNING:pint.util:Redefining 'C' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
WARNING:pint.util:Redefining 'd' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
WARNING:pint.util:Redefining 'h' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
WARNING:pint.util:Redefining 'degrees_north' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
WARNING:pint.util:Redefining 'degrees_east' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
WARNING:pint.util:Redefining 'degrees' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
WARNING:pint.util:Redefining '[speed]' (<class 'pint.delegates.txt_defparser.plain.DerivedDimensionDefinition'>)
# establish a connection to the LLNL node
conn = SearchConnection('https://esgf-node.llnl.gov/esg-search', distrib=True)

# if that doesn't work, try the UK node
#conn = SearchConnection('https://esgf.ceda.ac.uk/esg-search', distrib=True)

# German Node:
#conn = SearchConnection('http://esgf-data.dkrz.de/esg-search', distrib=True)

After connecting to a node we need to decide which data we’d like to access. The search function takes a number of different search parameters (also called facets), listed and described in the table below. Unfortunately, the names for some of the search parameters are different for the CMIP5 and CMIP6 archives, so both are included in the table.

Search Criterion

CMIP5 Name

CMIP6 Name

Project (i.e. CMIP6, CORDEX, etc.)

project

project

Model Name

model

source_id

Time Frequency

time_frequency

table_id or frequency

Forcing Scenario

experiment

experiment_id

Variable

variable

Variable ID

Ensemble Member

ensemble

member_id

Climate domain (i.e. atmosphere, ocean, land, etc.)

realm

realm

Note that the CMIP model variable names can be somewhat cryptic. You can find these using the drop-down menus on the left hand side of the ESGF Metagrid page, but for convenience, we link to a table that maps the longer, more easily understandable variable names to the short variables names (which you include in the search), on this page.

To include multiple values for a search parameter (e.g. if you want data for both historical and SSP5-8.5 forcing scenarios), you can separate the search keywords with a comma (e.g. experiment_id = 'historical,ssp585'). Let’s do exactly this, searching for tas (surface air temperature) data at monthly frequency from the model CanESM5, the Canadian model contributing to CMIP6. To limit the amount of data involved, we’ll also select only one ensemble member (r1i1p1f1), instead of the whole ensemble. Dropping the search parameter member_id would make the search return results for all ensemble members (if there are multiple members for the model(s)/experiment(s) of interest).

query = conn.new_context(project="CMIP6",
                         experiment_id='historical,ssp585',
                         source_id = "CanESM5",
                         # monthly ('day' for daily, '6hr' for 6-hourly, etc.)
                         frequency = 'mon', 
                         member_id="r1i1p1f1",
                         variable_id = "tas")
results = query.search()

len(results)
19

Our search returned multiple results, meaning several different data files match our search criteria. The results still need a bit more processing until we can get the download URLs (or OPeNDAP URLs) for the data. Next, we’ll do this processing, and make a pd.DataFrame mapping the file names to the URLs.

# sometimes running this cell raises an error about shards, but running it a second time seems to work.
files = []
for i in range(len(results)):
    try:
        hit = results[i].file_context().search()
    except:
        hit = results[i].file_context().search()
    files += list(map(lambda f: {'filename': f.filename, 
                                 'download_url': f.download_url, 
                                 'opendap_url': f.opendap_url}, hit))
files = pd.DataFrame.from_dict(files)
len(files)
28

We now have a longer list of files that match our search criteria. This might be a result of the distributed search returning the same data files, hosted on different nodes. Let’s print out all of the file names to check for duplicates before we do any downloading.

for fname in files['filename'].sort_values(): # print in alphabetical order
    print(fname)
tas_Amon_CanESM5_historical_r1i1p1f1_gn_185001-201412.nc
tas_Amon_CanESM5_historical_r1i1p1f1_gn_185001-201412.nc
tas_Amon_CanESM5_historical_r1i1p1f1_gn_185001-201412.nc
tas_Amon_CanESM5_historical_r1i1p1f1_gn_185001-201412.nc
tas_Amon_CanESM5_historical_r1i1p1f1_gn_185001-201412.nc
tas_Amon_CanESM5_historical_r1i1p1f1_gn_185001-201412.nc
tas_Amon_CanESM5_historical_r1i1p1f1_gn_185001-201412.nc
tas_Amon_CanESM5_historical_r1i1p1f1_gn_185001-201412.nc
tas_Amon_CanESM5_ssp585_r1i1p1f1_gn_201501-210012.nc
tas_Amon_CanESM5_ssp585_r1i1p1f1_gn_201501-210012.nc
tas_Amon_CanESM5_ssp585_r1i1p1f1_gn_201501-210012.nc
tas_Amon_CanESM5_ssp585_r1i1p1f1_gn_201501-210012.nc
tas_Amon_CanESM5_ssp585_r1i1p1f1_gn_201501-210012.nc
tas_Amon_CanESM5_ssp585_r1i1p1f1_gn_201501-210012.nc
tas_Amon_CanESM5_ssp585_r1i1p1f1_gn_201501-210012.nc
tas_Amon_CanESM5_ssp585_r1i1p1f1_gn_201501-210012.nc
tas_Amon_CanESM5_ssp585_r1i1p1f1_gn_210101-218012.nc
tas_Amon_CanESM5_ssp585_r1i1p1f1_gn_210101-218012.nc
tas_Amon_CanESM5_ssp585_r1i1p1f1_gn_210101-218012.nc
tas_Amon_CanESM5_ssp585_r1i1p1f1_gn_210101-218012.nc
tas_Amon_CanESM5_ssp585_r1i1p1f1_gn_210101-218012.nc
tas_Amon_CanESM5_ssp585_r1i1p1f1_gn_210101-218012.nc
tas_Amon_CanESM5_ssp585_r1i1p1f1_gn_218101-230012.nc
tas_Amon_CanESM5_ssp585_r1i1p1f1_gn_218101-230012.nc
tas_Amon_CanESM5_ssp585_r1i1p1f1_gn_218101-230012.nc
tas_Amon_CanESM5_ssp585_r1i1p1f1_gn_218101-230012.nc
tas_Amon_CanESM5_ssp585_r1i1p1f1_gn_218101-230012.nc
tas_Amon_CanESM5_ssp585_r1i1p1f1_gn_218101-230012.nc

As suspected, there are duplicates of each file - we really only want 4 files here. One for the historical experiment, and three for different time periods of the SSP5-8.5 experiment.

# filter the DataFrame to drop duplicate filenames
files = files.drop_duplicates('filename')
files
filename download_url opendap_url
0 tas_Amon_CanESM5_ssp585_r1i1p1f1_gn_201501-210... http://crd-esgf-drc.ec.gc.ca/thredds/fileServe... http://crd-esgf-drc.ec.gc.ca/thredds/dodsC/esg...
1 tas_Amon_CanESM5_ssp585_r1i1p1f1_gn_210101-218... http://crd-esgf-drc.ec.gc.ca/thredds/fileServe... http://crd-esgf-drc.ec.gc.ca/thredds/dodsC/esg...
2 tas_Amon_CanESM5_ssp585_r1i1p1f1_gn_218101-230... http://crd-esgf-drc.ec.gc.ca/thredds/fileServe... http://crd-esgf-drc.ec.gc.ca/thredds/dodsC/esg...
3 tas_Amon_CanESM5_historical_r1i1p1f1_gn_185001... http://aims3.llnl.gov/thredds/fileServer/css03... http://aims3.llnl.gov/thredds/dodsC/css03_data...

3.4.2 Downloading Data#

We are now ready to access the data, either directly with xarray and the OPenDAP URL or by downloading and saving the file locally. Since we already know how to do the former, we will do the latter. If you don’t want to save the data in the current working directory, append filename to the path to the directory you’d like to write the data to using os.path.join().

# Adapted from: https://stackoverflow.com/a/37573701
def download(url, filename):
    """
    Download a file hosted at <url> and write to <filename>
    """
    print("Downloading ", filename)
    r = requests.get(url, stream=True)
    total_size, block_size = int(r.headers.get('content-length', 0)), 1024
    with open(filename, 'wb') as f:
        for data in tqdm(r.iter_content(block_size),
                         total=total_size//block_size,
                         unit='KiB', unit_scale=True):
            f.write(data)

    if total_size != 0 and os.path.getsize(filename) != total_size:
        print("Downloaded size does not match expected size!\n",
              "FYI, the status code was ", r.status_code)
# create a directory (inside the current working directory) to save the data to
data_directory = "CanESM5_tas_monthly"
# only make the directory if it doesn't already exist
if not os.path.exists(data_directory): 
    os.mkdir(data_directory)
# download the data, one file at a time
for i in range(len(files)):
    url = files['download_url'].loc[i]
    filename = files['filename'].loc[i]
    path_to_write = os.path.join(data_directory, filename)
    
    # only download if the files doesn't already exist. 
    if not os.path.exists(path_to_write): 
        download(url, path_to_write)
    
# if there is an error downloading a file, you'll need to either delete the file,
# or change this block of code to call download(url, path_to_write) without 
# first checking if path_to_write exists,

Sometimes there are issues downloading files from the ESGF, due to technical errors on their end. One way to check whether the file was downloaded successfully is to look at the file size. For monthly data, we expect each file to be a few tens of MB, so if the file size is much smaller, there was likely a problem. This will get flagged in the download function we used to download the data, but we can also check the file size using the os.stat function, as demonstrated below:

# loop through each file, just like when we downloaded them
for i in range(len(files)):
    filename = files['filename'].loc[i]
    filepath = os.path.join(data_directory, filename)
    filestats = os.stat(filepath)
    size_in_MB = filestats.st_size / 1e6 # convert bytes to megabytes
    print(f"Size of {filename}: {size_in_MB} MB")
Size of tas_Amon_CanESM5_ssp585_r1i1p1f1_gn_201501-210012.nc: 27.267498 MB
Size of tas_Amon_CanESM5_ssp585_r1i1p1f1_gn_210101-218012.nc: 25.23234 MB
Size of tas_Amon_CanESM5_ssp585_r1i1p1f1_gn_218101-230012.nc: 37.722355 MB
Size of tas_Amon_CanESM5_historical_r1i1p1f1_gn_185001-201412.nc: 52.443019 MB

The size of each downloaded file is as expected, so we will proceed. If the rest of the notebook runs without error, then that’s another sign that the download was successful.

3.4.3 Opening and Exploring the Data#

Now that the files are downloaded and saved, we can open them directly with xarray and do some analysis. Let’s plot a time series of temperature near Toronto for the whole record and compare the historical data to the Toronto City station observations (from Section 3.1). Since we’re opening multiple files, each of which has data for a different period of time, we’ll use the function xr.open_mfdataset (documentation) to open all of them at once, and automatically concatenate them along the time dimension.

The wildcard character * fills any possible value for the position where it occurs. In the example below, it takes the place of both historical and ssp585 in its first occurrence. In the second occurrence, it takes the place of the year range, e.g. 185001-201412 for the historical period, 201501-210012 for the first chunk of the SSP5-8.5 period, etc. The wildcard character is an example of a regular expression, which are symbols used to represent patterns in text.

# first open the CanESM5 data for ensemble member r1i1p1f1
ds_CanESM5 = xr.open_mfdataset(data_directory + "/tas_Amon_CanESM5_*_r1i1p1f1_gn_*.nc")
# now get obs data - this function is copied and modified from Section 3.1
stn_ids_toronto_city = [5051, 31688, 41863]
df_list = []
for i, stn_id in enumerate(stn_ids_toronto_city):
    df = ec3.get_data(stn_id, 
                      type = 3, # monthly mean data
                      progress = False)
    if i > 0: # ensure no duplicate times
        df = df[df['Date/Time'] < df_list[i-1]['Date/Time'].min()]
    df_list.append(df)

df_obs = pd.concat(df_list, axis = 0)


column_name_dict = {'Date/Time': 'time', 
                    'Mean Temp (°C)': 'tas',
                    'Latitude (y)': 'lat',
                    'Longitude (x)': 'lon',
                    'Station Name': 'Name'}
    
df_obs = df_obs.rename(columns = column_name_dict)
df_obs = df_obs[list(column_name_dict.values())]
df_obs['time'] = pd.to_datetime(df_obs['time'])
df_obs = df_obs.set_index("time")
    
# sort the data in proper chronological order
df_obs = df_obs.sort_index()

# select data only up to 2015, when the "historical" simulation ends
df_obs = df_obs[df_obs.index.year.isin(range(1850, 2015))]

print(df_obs.index.min(), df_obs.index.max())
1850-01-01 00:00:00 2006-12-01 00:00:00

The monthly data for this station ends at the end of 2006, but we’ll proceed regardless. Since we know that daily data is available up until the present, the best practice would be to calculate monthly means from the daily means (weighted by the number of days in the month), but we digress.

# interpolate the CanESM5 model data to the station location
station_lat = df_obs['lat'][0]
# add 360 since the model data longitudes go from (0, 360) instead of (-180, 180)
station_lon = df_obs['lon'][0] + 360 

tas_CanESM5 = ds_CanESM5.tas.interp(lat = station_lat, lon = station_lon) - 273.15
# station and model data have different time datatypes, so make a time axis with
# xarray to use for the obs
times_obs = xr.cftime_range(start = df_obs.index.min(), 
# need to add an extra month to the endpoint to make it closed at the right
                            end = df_obs.index.max() + pd.Timedelta(days=30), 
                            freq = 'M', calendar = 'noleap')
# plot the time series
fig, ax = plt.subplots(figsize = (8, 6))
tas_CanESM5.plot.line(ax = ax, label = 'CanESM5')
ax.plot(times_obs, df_obs['tas'], color = 'k', linestyle = '-', label = 'Station Obs')
ax.set_title("Monthly Mean Temperature in Toronto")
ax.set_ylabel(r"Temperature ($^{\circ}$C)")
plt.legend()
plt.show()
../_images/0eb1e86a9c1acd0d43f3f24998404e12e527f266181e4638e8cf57d4d0d70d95.png

While we should not expect the individual months of the simulated data to match up with the station observations, the degree of overlap is encouraging that the model agrees decently well with observations.

3.4.4 Basic Exploration of Model Bias#

To better quantify this, we can plot the probability distribution of each dataset during their common period and calculate metrics such as the mean bias. Because the model has lower minima than the observations, we might expect it to be biased cold (i.e. negative mean bias). It also has slightly more maxima that exceed the observations, so perhaps it also has a higher standard deviation. Let’s calculate these metrics and add them to the distribution plot.

# select model data during obs period
# add one to the max year so the range includes the max year
obs_year_range = range(times_obs.year.min(), times_obs.year.max() + 1) 
tas_CanESM5_obsperiod = tas_CanESM5.sel(time = tas_CanESM5.time.dt.year.isin(obs_year_range))

# fit Kernel Density Estimator so we can plot a smooth distribution
kde_CanESM5 = stats.gaussian_kde(tas_CanESM5_obsperiod.values)
# need to drop missing values or else scipy complains
kde_obs = stats.gaussian_kde(df_obs['tas'].dropna().values) 


# mean values for each dataset
tas_CanESM5_mean = tas_CanESM5_obsperiod.mean('time')
tas_obs_mean = df_obs['tas'].mean()

# stdevs for each dataset
tas_CanESM5_stdev = tas_CanESM5_obsperiod.std('time')
tas_obs_stdev = df_obs['tas'].std()
# range of temperatures to plot
temperatures = np.arange(-40, 40)
plt.figure(figsize = (8,6))
plt.title("Distribution of Monthly Mean Temperatures in Toronto")
plt.plot(temperatures, kde_CanESM5(temperatures), label = 'CanESM5')
plt.plot(temperatures, kde_obs(temperatures), label = 'Station Obs', color = 'k')
plt.vlines([tas_CanESM5_mean, tas_obs_mean], 0, 1,
           colors = ['tab:blue', 'k'], linestyles = '--')
plt.ylim(0, 0.05)
plt.xlabel(r"Temperature ($^{\circ}$C)")
plt.ylabel("Probability Density")
plt.legend(loc = 'upper left')
plt.grid()

# annotate with mean bias and ratio of stdevs
plt.text(-42, 0.038, r'Mean Bias: %.2f $^{\circ}$C' % (tas_CanESM5_mean - tas_obs_mean))
plt.text(-42, 0.036, r'Stdev Ratio: %.2f' % (tas_CanESM5_stdev / tas_obs_stdev))

plt.show()
../_images/e98f0d79fe1207b439638e37fbd056c4189b311079ad76c71a6073c1057a8f49.png

Indeed, the model data for this location is biased slightly low, and has too high of a standard deviation, compared to the station observations. For sub-monthly data, these biases could possibly be worse, which is one reason why downscaling and bias-correction are necessary parts of climate impact analysis.

An interesting feature of these distributions of temperature is that they are bimodal, meaning there are two maxima. This is likely because of the seasonal cycle - the peak near lower temperatures is for cold-season months, and the other is for the warm season. For this reason, we often separate the data by month or by season when quantifying biases (at least for variables that have a strong seasonal cycle). Let’s do this by comparing the monthly climatologies of the two datasets.

# calculate monthly climatologies
tas_monthlyclim_CanESM5 = tas_CanESM5_obsperiod.groupby('time.month').mean('time')
tas_monthlyclim_obs = df_obs['tas'].groupby(df_obs.index.month).mean()

# difference in climatologies
monthlyclim_bias = tas_monthlyclim_CanESM5.values - tas_monthlyclim_obs.values

months = tas_monthlyclim_CanESM5.month.values 

# plot
plt.figure(figsize = (8,6))
plt.title("Climatology of Monthly Mean Temperatures in Toronto")
plt.plot(months, tas_monthlyclim_CanESM5, label = 'CanESM5', marker = 'o')
plt.plot(months, tas_monthlyclim_obs, label = 'Station Obs', color = 'k', marker = 'o')
plt.plot(months, monthlyclim_bias, label = 'Model Bias', 
         color = 'red', linestyle = '--', marker = 'o')
plt.xlabel("Month of the Year")
plt.ylabel(r"Temperature ($^{\circ}$C)")
plt.legend()
plt.grid()
plt.show()
../_images/4d0f42c577264e8743c7b706e85852400b078940a976e8919efcf152dbdf120a.png

This plot is much more informative! The overall negative mean bias is due to the fact that the model’s winter months are too cold, but in the annual mean this bias is somewhat washed out by the summer months being slightly too hot. This is also why the model data higher standard deviation than the observations - the contrast between the summer and winter seasons is greater than in observations. This example demonstrates that for variables with a strong seasonal cycle like temperature, bias correction should account for seasonally-varying biases. This can be done by grouping the data by the day or month of the year, and applying separate corrections to each of these groups. This example also illustrates why it’s important to learn some basic climate science before working with climate data, and to do exploratory analysis with the observations and raw model data before doing bias correction or downscaling. If one lacked the required background knowledge or if they didn’t assess the raw model biases and went directly to downscaling, they’d likely produce downscaled projections that still suffer from seasonally-dependent biases.

3.4.6 Analyzing Multiple Ensemble Members#

The above examples used data only from a single ensemble member of CanESM5. In order to quantify uncertainty related to internal variability (Section 2.3), your final analysis should include multiple ensemble members, possibly from multiple models. Here we will use the Python module xclim and it’s xclim.ensembles functionality to analyze data from multiple ensemble members at once. The documentation for these utilities is here.

Aside about xr.open_mfdataset

You can also use xr.open_mfdataset to open multiple files that represent different ensemble members. To do this, you’ll need to use the keyword arguments it takes to intelligently structure the resulting Dataset. For example, you can use the argument concat_dim to create a new dimension called 'ensemble_member', so the data is shaped like ('ensemble_member', 'time', 'lat', 'lon'). Unfortunately, this only works if all of the data files use the same calendar. If you have a multi-model ensemble with models that use different calendars, it’s much better to use xclim.ensembles.

If you do it this way, it’s tricky to concatenate along the time dimension as well (it’s technically possible, but it requires advanced knowledge of xclim.ensembles and its calendar-handling features). The easiest way to handle different files for different time periods (like we did in 3.4.3) is to open the files for each time period separately, and then concatenate them using xr.concat (documentation) and the keyword argument dim = "time"

Since file I/O is usually slow compared to other steps of analysis, it’s best to do the smallest possible number of file opening calls. The best way to open your data will depend on the way your files are structured. Let’s say that you have separate data files for each year for a 100-year period, and 10 ensemble members. In this situation, it would be better to iterate over the ensemble member (10 iterations) and concatenate over time with open_mfdataset, instead iterating of over the time period (100 iterations) and concatenating ensemble members with xclim.ensembles. You can combine the 10 datasets, each representing a different ensemble member, into a single Dataset using xr.concat and the keyword argument dim = "ensemble_member", or create an xclim.ensembles by passing the list of datasets to xclim.ensembles.create_ensemble.

Now back working with xclim.ensembles

Below we’ll download data for a few more ensemble members from CanESM5, plot the data for the different ensemble members, and calculate some ensemble statistics. To keep things simple, we’ll do this only for the historical period.

Caution

It’s commonly stated that ensemble or multi-model means give “better results” than data from individual ensemble members. This is true for certain cases - averaging model output for long-term mean quantities can reduce model biases, but you should never take an ensemble average as the first step in your workflow. Because different ensemble members have different realizations of internal variability and weather, averaging will attenuate variability and extremes. It’s okay to calculate ensemble means and other statistics in your exploratory analysis, but downscaling should be done to individual ensemble members separately, and the range of final results be assessed as the range over the ensemble.

query2 = conn.new_context(project="CMIP6",
                         experiment_id='historical',
                         source_id = "CanESM5",
                         frequency = 'mon',
                         member_id="r2i1p1f1,r3i1p1f1,r4i1p1f1",
                         variable_id = "tas")
results2 = query2.search()
files2 = []
for i in range(len(results2)):
    try:
        hit = results2[i].file_context().search()
    except:
        hit = results2[i].file_context().search()
    files2 += list(map(lambda f: {'filename': f.filename, 
                                 'download_url': f.download_url, 
                                 'opendap_url': f.opendap_url}, hit))
files2 = pd.DataFrame.from_dict(files2)
files2 = files2.drop_duplicates('filename')

files2
filename download_url opendap_url
0 tas_Amon_CanESM5_historical_r4i1p1f1_gn_185001... http://aims3.llnl.gov/thredds/fileServer/css03... http://aims3.llnl.gov/thredds/dodsC/css03_data...
1 tas_Amon_CanESM5_historical_r3i1p1f1_gn_185001... http://aims3.llnl.gov/thredds/fileServer/css03... http://aims3.llnl.gov/thredds/dodsC/css03_data...
2 tas_Amon_CanESM5_historical_r2i1p1f1_gn_185001... http://aims3.llnl.gov/thredds/fileServer/css03... http://aims3.llnl.gov/thredds/dodsC/css03_data...
# download the file
for i in range(len(files2)):
    url = files2['download_url'].loc[i]
    filename = files2['filename'].loc[i]
    path_to_write = os.path.join(data_directory, filename)
    
    # only download if the file doesn't already exist. 
    if not os.path.exists(path_to_write): 
        download(url, path_to_write)
# create a list of the files to create the ensemble, using the wildcard character *. 
# note that the order of the files is random when using glob.glob and the 
# wildcard character, so the order of the 'realization' dimension created by 
# xclim.ensembles.create_ensemble will also be random.
# You can specify the order of the 'realization' coordinates
# by making your own ordered list of file names.
historical_files = glob.glob(data_directory + "/tas_Amon_CanESM5_historical_*_gn_185001-201412.nc")

# open up the two historical files using xclim.ensembles
ds_ensemble = ensembles.create_ensemble(historical_files)

# Print a summary of the dataset.
# The dimension 'realization' represents the ensemble member
ds_ensemble
<xarray.Dataset> Size: 260MB
Dimensions:      (realization: 4, time: 1980, bnds: 2, lat: 64, lon: 128)
Coordinates:
  * time         (time) object 16kB 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
  * lat          (lat) float64 512B -87.86 -85.1 -82.31 ... 82.31 85.1 87.86
  * lon          (lon) float64 1kB 0.0 2.812 5.625 8.438 ... 351.6 354.4 357.2
    height       float64 8B 2.0
  * realization  (realization) int64 32B 0 1 2 3
Dimensions without coordinates: bnds
Data variables:
    time_bnds    (realization, time, bnds) float64 127kB dask.array<chunksize=(1, 1980, 2), meta=np.ndarray>
    lat_bnds     (realization, lat, bnds) float64 4kB dask.array<chunksize=(1, 64, 2), meta=np.ndarray>
    lon_bnds     (realization, lon, bnds) float64 8kB dask.array<chunksize=(1, 128, 2), meta=np.ndarray>
    tas          (realization, time, lat, lon) float32 260MB dask.array<chunksize=(1, 1980, 64, 128), meta=np.ndarray>
Attributes: (12/53)
    CCCma_model_hash:            3dedf95315d603326fde4f5340dc0519d80d10c0
    CCCma_parent_runid:          rc3-pictrl
    CCCma_pycmor_hash:           33c30511acc319a98240633965a04ca99c26427e
    CCCma_runid:                 rc3.1-his01
    Conventions:                 CF-1.7 CMIP-6.2
    YMDH_branch_time_in_child:   1850:01:01:00
    ...                          ...
    tracking_id:                 hdl:21.14100/872062df-acae-499b-aa0f-9eaca76...
    variable_id:                 tas
    variant_label:               r1i1p1f1
    version:                     v20190429
    license:                     CMIP6 model data produced by The Government ...
    cmor_version:                3.4.0
# interpolate the data to the Toronto station location
tas_CanESM5_ensemble = ds_ensemble.tas.interp(lat = station_lat, lon = station_lon) - 273.15

# calculate annual means - plotting monthly means is too noisy to see the example clearly
tas_CanESM5_ensemble_annual = tas_CanESM5_ensemble.resample(time = 'Y').mean()
tas_obs_annual = df_obs['tas'].resample('Y').mean()

# obs data is missing for a few years, and incomplete for 2003
tas_obs_annual = tas_obs_annual[tas_obs_annual.index.year < 2003]

# obs data year values to pass to plotting function
yrmax_obs = tas_obs_annual.index.year.max()
yrs_obs = tas_CanESM5_ensemble_annual.time.sel(time = tas_CanESM5_ensemble_annual.time.dt.year <= yrmax_obs)

# calculate ensemble mean to include in the plot
tas_CanESM5_annual_ensmean = tas_CanESM5_ensemble_annual.mean("realization")

# plot a time series of the annual mean temperatures for Toronto, using a different colour for each ensemble member
fig, ax = plt.subplots(figsize = (8, 6))
# ensemble mean
tas_CanESM5_annual_ensmean.plot.line(ax = ax, color = 'blue', 
                                     label = 'Ensemble Mean', 
                                     linewidth = 2.5, zorder = 2)

ax.plot(yrs_obs, tas_obs_annual, color = 'k', 
        label = 'Station Obs', linewidth = 2.5, zorder = 2)

tas_CanESM5_ensemble_annual.plot.line(ax = ax, hue = 'realization', 
                                      linewidth = 0.75,
                                      add_legend = False, 
                                      zorder = 1)

ax.set_title("CanESM5 Annual Mean Temperature in Toronto")
ax.set_ylabel(r"Temperature ($^{\circ}$C)")
ax.legend()
plt.show()
../_images/7c416b16f469bb42aabede456cb23481f89b90fafcad75e9626e9f44f2b067f7.png

In the above plot, you can see how even the annual mean temperatures vary significantly across the ensemble, and show larger fluctuations than the ensemble mean (thick blue curve). The ensemble mean does not agree perfectly with the observations, because of model bias. Also, note that the year-to-year fluctuations in the ensemble mean are smaller than those for the observations. This is because observations reflect only one possible realization of the climate, so internal variability is still present. For observations, we can only disentangle forced trends (like the effects of increasing greenhouse gases) from internal variability using long-term trends. Over a short record, what might appear as a trend might actually be due to internal variability at decadal or interannual timescales. For model simulations, aggregating over multiple realizations smooths out the effects of internal variability and can give us a better estimate of the forced trend by itself. To get a plausible range of values, accounting for internal variability, you can use a measure of ensemble spread, like the standard deviation or range between a low and high quantile (e.g. 10th-90th percentile range).

Summary#

This notebook has worked through the process of downloading CMIP6 climate model data from the ESGF and doing some basic model validation using station observations. Any other data hosted on the ESGF, including CORDEX regional climate model data, can be accessed in the same way. Users are encouraged to navigate the ESGF web portal before using pyesgf to ensure you’re using the correct facet names for the project you’re searching, and that the search is returning the results you expect.