5.2 Data Selection

5.2 Data Selection#

Once you have defined the climate indicators relevant to the domain application your impact study is about, you’ll need to assess and select the datasets you’ll use to probe your research question. This section will focus on the factors that contribute to your choice of observational and climate model data.

5.2.1 Temporal Sampling#

Climate model output is typically only archived at certain standard time frequencies. These include monthly and daily means, maxima and minima, and 6-hourly means or instantaneous values. For most models and variables, it’s unlikely you will be able to find data at sub-6-hourly frequency, though some models may have hourly data (either means or instantaneous values) available on the ESGF. The most appropriate frequency and sampling type (means, max/min, instantaneous, etc.) of climate data for your study depends on the nature of your climate indicator. For example, indicators based on the number of days per year that meet certain criteria require daily means, maxima, or minima. Since you’re likely already intimately familiar with your climate indicator or your impact model, it should be fairly easy to specify the required temporal sampling.

Almost no sub-hourly data is ever archived from climate model simulations. Since the usual timestep for 1-degree resolution models is about 30 minutes, temporal variability at sub-hourly timescales is not really represented. If your application or impact model requires sub-hourly input data, you’ll need to somehow impose higher frequency temporal variability on top of the downscaled model output. Methods for doing this for certain climate variables, such as wind speed [Tang and Bassill, 2018] have been developed, but will not be covered further in this Guidebook.

5.2.2 Spatial Sampling#

The required spatial scales and coverage needed for your application are other factors you must consider in the initial steps of the study design. As you learned in Chapter 4, the spatial resolution/coverage of your downscaled output depends entirely on the observational data you use as the training data for downscaling. You may need data that covers an extended region with fine spatial resolution, or you may instead require data for one or more discrete point locations, perhaps conveniently located at or close to a weather station that provides observations of the variables you need. The spatial requirements of your application are among largest deciding factors regarding the observational dataset to use for your study.

One interesting case is when your climate indicator is based on spatial aggregations over a region that covers more than one climate model grid cell. Is downscaling even necessary for such an application? The answer is maybe! Because spatial downscaling methods like BCCA (and its derivatives DBCCA and BCCAQ) are nonlinear, as are many climate indicators like CDD, the aggregation of an indicator calculated from downscaled data could be different from a calculation based on aggregated and then bias-corrected raw model output. This is all to say, the order of operations matters, and when in doubt, do the spatial downscaling and then derive further quantities from the downscaled data, or perform both types of analysis and compare the results.

5.2.3 Assessing Observational Products#

Once you know all of the inputs required for your climate indicator(s), and the spatial and temporal sampling required of the input variables, you are ready to select the observational data product to use for downscaling, bias correction, and validation. In addition to providing observations of the required climate variables with the proper sampling, you’ll also need to ensure your observational dataset has spatial coverage for your study region, and for a sufficiently long record to characterize observed historical variability. This is typically at least 30 years, though for applications involving extremes, longer is better. It may be tempting to use, for example, satellite observations that have very high spatial resolution, but such data products typically have very short time periods and thus they can’t be used to construct a climatology and capture the full range of historical variations.

In Chapters 3 and 4 we used a gridded observational dataset as the main observational data product for downscaling and model validation. This dataset, developed by NRCan, is produced by interpolating station observations using the algorithm ANUSPLIN, and therefore for locations far from any weather stations, the data does not really reflect the local climate but instead reflects a combination of observations from locations far away. For remote locations that lack observations, this dataset (and other gridded, interpolated observational datasets) may not be appropriate. Instead, you may wish to use reanalysis data (Section 3.3) as a proxy for observations, since the physical relationships by which remote observations inform the output over a region that lacks observations might be more reliable than the statistical relationships used by interpolation algorithms. However, reanalysis data can also be unreliable over observationally sparse regions, so for applications where this is an issue, treat your results with suspicion.

Reanalysis data may also be preferable to gridded observations for some applications due to its globally complete spatial coverage, and global consistency. As an example, you might want your study to cover all of Canada and the contiguous United States. Neither the NRCanMet nor the Livneh gridded observations cover your whole study domain, and patching the two together would lead to inconsistencies since they use different interpolation algorithms. This is a scenario for which using reanalysis as the “observational” data product would be appropriate. Other situations include when observations are not available for a variable you require, or if the available observations lack sufficient temporal frequency (i.e if you require hourly data, but the observations provide only daily frequency).

5.2.4 Assessing Model Products#

Chapter 2.3 summarized the different sources of uncertainty in climate model projections. To best characterize the degree of uncertainty in your future projections, you’ll need to make choices about which simulations to select, including the type of future scenario, which models to include, and whether to use multiple ensemble members from the selected model(s). Each of these factors relates to one of the three sources of uncertainty from 2.3: scenario uncertainty, model structural uncertainty, and internal variability.

5.2.4.1 Future Scenario and Time Horizon#

When developing the research question for your study, you likely started with a question broadly similar to “How will my exposure unit be affected by climate change?” In order to answer this question quantitatively using climate model projections, you’ll need to be more specific about what you mean by “climate change”. As explained in 2.3.1, there are several standard future climate forcing scenarios called SSPs (Shared Socioeconomic Pathways) for which the models contributing to CMIP6 have run simulations. These are the successors of the Representative Concentration Pathways (RCPs) from CMIP5, though the GHG concentration pathways for three of the SSPs (SSP1-2.6, SSP2-4.5, and SSP5-8.5) have remained from the previous generation (RCP 2.6, RCP 4.5, and RCP 8.5). Each SSP reflects a different pathway of emissions and climate-related policies, ranging from continually increasing GHG emissions (SSP5-8.5) to near-term achievement of negative emissions as a result of aggressive mitigation policies (SSP1-2.6). Further explanation of the SSPs, as well as some frequently asked questions, is available on this page of climatedata.ca.

Most climate impact studies will involve choosing one or more SSPs to use. Perhaps you are interested in what changes may occur in a future that follows one of the SSPs, in which case you will focus your analysis on simulations for that scenario only. For example, you may wish to produce the most conservative projections possible, so you’d choose SSP5-8.5. Alternatively you may wish to produce a range of projections based on low and high emissions scenarios, in which case you’d choose simulations for both SSP1-2.6 and SSP5-8.5, and possibly also SSP2-4.5 as a middle-of-the-road scenario.

The time horizon of interest for your application is another choice you must make in the study design process, and it can affect your choice of future climate scenario. In the near-term (i.e. one to two decades), the climate forcing pathways (GHGs, aerosols, and other climate forcers) of the different SSPs do not differ substantially, so it might not make a meaningful difference which one you choose. In other words, forcing uncertainty does not contribute significantly to near-term climate projections, compared to model structural uncertainty and internal variability. For time horizons longer than about 20 years, the different SSPs will yield very different future projections, so sampling from different SSPs is necessary to quantify the effects of scenario uncertainty.

For some applications, such as the design of a highway or bridge, you might be interested in a time horizon longer than 100 years, because of the intended lifetime of the piece of infrastructure. However, most CMIP6 future simulations end at the year 2100, and those that continue after do not archive much daily or sub-daily output. For long-lifetime infrastructure projects, one could make use of the limited data that is available post-2100, or instead focus on projections of extreme values with long return periods (Section 3.1.3) for an end-of-century time horizon. For the most robust estimates of extreme value statistics, it is best to use many ensemble members for each selected model - more on ensemble selection in the next subsection.

As also mentioned in 2.3.1, there exists another approach to selecting the type of future you’d like to analyze, which sidesteps the issues of scenario uncertainty and selecting a time horizon. This method is to assess changes when the model global mean surface temperature (GMST) warms to a certain number of degrees above the GMST from that model’s pre-industrial control simulation. Since different models (and different future scenarios) will reach the warming threshold at different times, this method is agnostic to when the impacts occur. Its advantages are that it removes the issue of needing to weigh the likelihood of different emissions scenarios and that the impacts can be tied directly to policy goals, which are often stated in terms of global warming targets (such as \(1.5^{\circ}C\) and \(2^{\circ}C\) in the Paris Agreement). If it’s relevant, one can also report the range of years for which the different models first reach the warming threshold, to communicate when these impacts are likely to occur.

5.2.4.2 Model Selection#

Characterizing the effects of model structural uncertainty requires the use of multiple different climate models for producing your range of future climate projections. However, for most applications, it’s not practical to use every single bit of data available on the ESGF, especially since many different Source IDs are not really independent models. Section 2.3 mentioned how some models share large amounts of common code, but additionally, some model centres submit runs from slightly modified versions of the same model, either with different horizontal resolutions (i.e. BCC-CSM2-HR and BCC-CSM2-MR) or the representation of the upper atmosphere (CESM2 and CESM2-WACCM). Paring down your multi-model ensemble by choosing only one model from each modeling centre is a good first step in model selection, which should retain a substantial amount of the variations in model code. You could also choose to eliminate models which use a 360 day calendar. Finally, you could also perform some preliminary analysis on the raw model projected changes to the climate variable(s) needed to calculate your climate indicator(s), and choose the two models that fall in the lower and upper ends of the overall range to continue with for downscaling and further analysis. There is no universally agreed upon best method for selecting models to use in a climate impact study, so any of the practical methods described herein are reasonable.

5.2.4.3 Ensemble Member Selection#

Section 2.3.2 discussed how internal climate variability is an important source of uncertainty in climate model projections, especially when focusing on regional or local climate. While many CMIP5 and CMIP6 models publish output from multiple ensemble members on the ESGF, some only submit a single simulation per scenario. For this reason, some studies choose to select only a single ensemble member from each model, to place each on an equal footing when aggregating results across models. This approach is acceptable if fluctuations due to internal variability are not likely to contribute substantially to the overall range of projections, such as for long time horizons or variables with a strong climate change signal (like temperature). For variables subject to greater degrees of internal variability, such as wind speeds or precipitation, you may not be able to separate a forced climate change signal from internal variability without using multiple ensemble members from each model.

If computational resource constraints make it unfeasible for you to include multiple models and multiple ensemble members in your analysis, then you’ll need to prioritize either a multi-model ensemble or a single model, multi-member ensemble. The best option depends on whether model structural uncertainty or internal variability, dominates the spread in your variables of interest or climate indicator, across models and simulations. Determining this requires either reviewing the existing climate science literature relevant for your variables/application, or your own investigation using CMIP model data.

There are also some objective methods that can be used to reduce the size of your ensemble but maintain representation of the statistics of the overall ensemble. You can see some examples of these advanced methods here.