A large source of weather and climate model uncertainty is the approximate representation of unresolved subgrid processes through parameterization schemes. Traditional, deterministic parameterization schemes represent the mean or most likely subgrid scale forcing for a given resolved-scale state. While model errors can be reduced to a certain degree through improvements to such parameterizations, they cannot be eliminated. Irreducible uncertainties result from a lack of scale separation between resolved and unresolved processes. Uncertainty in weather forecasts also arises because the chaotic nature of the atmosphere gives rise to sensitivity to uncertain initial conditions. Practically, uncertainty is represented in forecasts using ensembles of integrations of comprehensive weather and climate prediction models, first suggested by Leith (). To produce reliable probabilistic forecasts, the generation of the ensemble must include a representation of both model and initial condition uncertainty.
Initial condition uncertainty is addressed by perturbing the initial conditions of ensemble members, for example, by selecting directions of optimal perturbation growth using singular vectors (Buizza & Palmer, ) or by characterizing initial condition uncertainty during the data assimilation cycle (Isaksen et al., ). One approach for representing irreducible model uncertainty is stochastic parameterization of unresolved physical processes. A stochastic parameterization represents the probability distribution of possible subgrid scale tendencies conditioned on the large scale. Each ensemble member experiences one possible, equally likely realization of the subgrid-scale tendencies. A more detailed motivation for including stochastic parameterizations in weather and climate models is presented in Palmer ().
Stochastic approaches for numerical weather prediction (NWP) were originally proposed for use in the European Center for Medium-Range Weather Forecasts (ECMWF) ensemble prediction system (Buizza et al., ; Palmer et al., ). They were shown to substantially improve the quality of initialized ensemble forecasts and so became widely adopted by meteorological services around the world, where they are used to produce operational ensemble weather, subseasonal, and seasonal forecasts (Berner et al., ; Leutbecher et al., ; Palmer et al., ; Palmer, ; Reyes et al., ; Suselj et al., ; Sušelj et al., ; Stockdale et al., ; Sanchez et al., ; Teixeira & Reynolds, ; Weisheimer et al., ).
Recent work has assessed the impact of stochastic parameterization schemes in both idealized and state-of-the-art climate models for long-term integration (Ajayamohan et al., ; Christensen et al., ; Dawson & Palmer, ; Davini et al., ; Juricke & Jung, ; Strømmen et al., ; Williams, ; Wang et al., ). These studies demonstrate that including a stochastic representation of model uncertainty can go beyond improving initialized forecast reliability and can also lead to improvements in the model mean state (Berner et al., ; Palmer, ) and climate variability (Ajayamohan et al., ; Christensen et al., ; Dawson & Palmer, ) and change a model's climate sensitivity (Seiffert & von Storch, ). These impacts occur through nonlinear rectification, noise-enhanced variability, and noise-induced regime transitions (Berner et al., ). In this way, small-scale variability represented by stochastic physics can impact large spatiotemporal scales of climate variability.
Despite the historical disconnect between the weather and climate prediction communities, the boundaries between weather and climate prediction are somewhat artificial (Hurrell et al., ; Palmer et al., ; Shapiro et al., ). This disconnect is challenged by recent advances in prediction on time scales from weather to subseasonal-to-seasonal and decadal by operational weather forecasting centers around the world (Moncrieff et al., ; Vitart & Robertson, ) and the ability of global cloud-resolving models to both forecast the weather and simulate the long-term climate (Crueger et al., ; Satoh et al., ; Zangl et al., ). Nonlinearities in the climate system lead to an upscale transfer of energy (and therefore error) from smaller to larger scales (Lorenz, ; Palmer, ; Tribbia & Baumhefner, ). At the same time, slowly evolving modes of variability can produce predictable signals on shorter time scales (Hoskins, ; Vannitsem & Lucarini, ). Under the “seamless prediction” paradigm, the weather and climate communities should work together to develop Earth system models (Brunet et al., ; Christensen & Berner, ), as developments made in one community are expected to benefit the other. The development and use of stochastic parameterizations is a good example of this paradigm at work.
Recent years have seen substantial interest in the development of stochastic parameterization schemes in weather and climate models. Pragmatic approaches, such as the stochastically perturbed parameterization tendencies (SPPT) scheme (Buizza et al., ; Palmer et al., ), are widely used due to their ease of implementation and beneficial impacts on the model (Christensen et al., ; Leutbecher et al., ; Sanchez et al., ). Other schemes predict the statistics of model uncertainty using a theoretical understanding of the atmospheric processes involved, such as the statistics of convection (Bengtsson et al., ; Craig & Cohen, ; Khouider et al., ; Sakradzija & Klocke, ). A third approach is to make use of observations or high-resolution simulations to characterize variability that is unresolved in a low-resolution forecast model (Shutts & Palmer, ). This last approach can be used to develop data-driven stochastic schemes (Bessac et al., ; Dorrestijn et al., ) or to constrain tunable parameters in stochastic parameterizations (Christensen et al., ; Christensen, ; Shutts & Pallares, ). A drawback of these data-driven approaches is that assumptions are made about the structure of the stochastic parameterization (e.g., the physical process to focus on, or the distribution of the stochastic term conditioned on the resolved state) in order to make the analysis tractable using conventional methods.
Machine learning models offer an approach to parameterize complex nonlinear subgrid processes in a potentially computationally efficient manner from data describing those processes. The family of machine learning models consist of mathematical models whose structure and parameters (often denoted weights) optimize the predictive performance of a priori unknown relationships between input (“predictor”) and output (“predictand”) variables. Commonly used machine learning model frameworks range in complexity from simple linear regression to decision trees and neural networks (Hastie et al., ). More complex methods allow modeling of broader classes of predictor-predictand relationships but risk “overfitting” to spurious patterns and producing predictions with large variance over small changes to the inputs. Machine learning practitioners minimize the risk of overfitting through the use of regularization techniques that impose constraints on the predictions through the model structure and the optimization loss function. Regularization is critical for more complex machine learning models, so that they can converge to optimal and robust configurations in large parameter spaces. Machine learning for parameterizations has been considered since (Krasnopolsky et al., ), and recently, multiple groups have begun developing new parameterizations for a variety of processes (Bolton & Zanna, ; Gentine et al., ; Rasp et al., ; Schneider et al., ). The most common regularization in existing machine learning parameterization approaches has been multitask learning (Caruana, ), in which the machine learning model predicts multiple correlated values simultaneously and learns to preserve the correlations. However, these schemes have focused exclusively on deterministic parameterization approaches, but the need for stochastic perturbations is being recognized (Brenowitz & Bretherton, ).
One active area in current machine learning research is generative modeling, which focuses on models that create synthetic representative samples from distributions of arbitrary complexity. Generative adversarial networks, or GANs (Goodfellow et al., ), are a class of generative models that consist of two neural networks in mutual competition. The generator network produces synthetic samples similar to the original training data from a latent vector, and the critic, or discriminator, network examines samples from the generator and the training data in order to determine if a sample is real or synthetic. The discriminator acts as an adaptable loss function for the generator by learning features of the training data and teaching those features to the generator through back-propagation. The original GAN formulation used a latent vector of random numbers as the only input to the generator, but subsequent work on conditional GANs (Mirza & Osindero, ) introduced the ability to incorporate a combination of fixed and random inputs into the generator, enabling sampling from conditional distributions. Because the stochastic parameterization problem can be framed as sampling from the distribution of subgrid tendencies conditioned on the resolved state, conditional GANs have the potential to perform well on this task.
The purpose of this study is to evaluate how well GANs can parameterize the subgrid tendency component of an atmospheric model at weather and climate time scales. A key question is whether a GAN can learn uncertainty quantification within the parameterization framework, removing the need to retrospectively develop separate stochastic representations of structural model uncertainty. While the ultimate goal is to test these ideas in a full general circulation model (GCM: left for future work), as a proof of concept, we use the two-time scale model proposed in Lorenz (), henceforth the L96 system, as a test bed for assessing the use of GAN in atmospheric models. Simple chaotic dynamical systems such as L96 are useful for testing methods in atmospheric modeling due to their transparency and computational cheapness. The L96 system has been widely used as a test bed in studies including development of stochastic parameterization schemes (Arnold et al., ; Crommelin & Vanden-Eijnden, ; Chorin & Lu, ; Kwasniok, ; Wilks, ), data assimilation methodology (Fertig et al., ; Hatfield et al., ; Law et al., ), and using ML approaches to learn improved deterministic parameterization schemes (Dueben & Bauer, ; Schneider et al., ; Watson, ).
The evaluation consists of four primary questions. First, given inputs from the “true” L96 model run, how closely does the GAN approximate the true distribution of subgrid tendencies? Second, when an ensemble of L96 models with stochastic GAN parameterizations are integrated forward to medium-range weather prediction time scales, how quickly does the prediction error increase and how well does the ensemble spread capture the error growth? Third, when the L96 model with a stochastic GAN parameterization is integrated out to climate time scales, how well does the GAN simulation approximate the properties of the true climate? Fourth, how closely does the GAN represent both different regimes within the system and the probability of switching between them?
Details of the Lorenz '96 model and the GAN are presented in Section , and the results of the weather and climate analyses described above are presented in Section . Section presents an overall discussion of results. Conclusions follow in Section . A draft of this paper has been posted on the arXiv website (Gagne et al., ).
Methods Lorenz '96 Model ConfigurationThe L96 system was designed as a “toy model” of the extratropical atmosphere, with simplified representations of advective nonlinearities and multiscale interactions (Lorenz, ). It consists of two scales of variables arranged around a latitude circle. The large-scale, low-frequency variables are coupled to a larger number of small-scale high-frequency variables, with a two-way interaction between the s and s. It is the interaction between variables of different scales that makes the L96 system ideal for evaluating new ideas in parameterization development. The L96 system has proven useful in assessing new techniques that were later developed for use in GCMs (Crommelin & Vanden-Eijnden, ; Dorrestijn et al., ).
The and variables evolve following: [Image Omitted. See PDF] [Image Omitted. See PDF] where in the present study the number of variables is and the number of variables per variable is . Further, we set the coupling constant to , the spatial-scale ratio to , and the temporal-scale ratio to . The forcing term is set large enough to ensure chaotic behavior. The chosen parameter settings, which were used in Arnold et al. (), are such that one model time unit (MTU) is approximately equivalent to five atmospheric days, deduced by comparing error-doubling times in L96 and the atmosphere (Lorenz, ).
In this study, the full Lorenz ‘96 equations are treated as the “truth” which must be forecast or simulated. In the case of the atmosphere, the physical equations of motion of the system are known. However, due to limited computational resources, it is not possible to explicitly simulate the smallest scales, which are instead parameterized. Motivated by this requirement for weather and climate prediction, a forecast model for the L96 system is constructed by truncating the model equations and parameterizing the impact of the small scales on the resolved scales: [Image Omitted. See PDF] where is the forecast estimate of and is the parameterized subgrid tendency. The parameterization approximates the true subgrid tendencies: [Image Omitted. See PDF] which can be estimated from realizations of the “truth” time series as [Image Omitted. See PDF] following Arnold et al. (). The time step equals the time step used in the forecast model for consistency.
A long “truth” run of the L96 model is performed to generate both training data for the machine learning models and a test period for both weather and climate evaluations. The “truth” run is integrated for 20,000 MTU using a fourth-order Runge-Kutta (RK4) time stepping scheme and a time step MTU. Output from the first 2,000 MTU are used for training, and the remaining 18,000 MTU are used for testing. A burn-in period of 2 MTU is discarded. All parameterized forecast models of the L96 use a forecast time step of MTU and a second-order Runge-Kutta (RK2) time stepping scheme. This RK2 scheme is used instead of RK4 to represent the temporal discretization of the equations representing the resolved dynamics in an atmospheric forecasting model.
GAN ParameterizationsThe GAN parameterization developed for the Lorenz ’96 model in this study utilizes a conditional dense GAN to predict the subgrid tendency at the current time step given information about the state at the previous time step. We will investigate two classes of predictors of : both and at the previous forecast time step and alone. In the following discussion, we focus on GANs based on the first of these predictor sets; the construction of those associated with the second set is analogous to that of the first. Note that in this section, we move to a discrete time notation, with the time index indicated by the subscript, , where adjacent indices are separated by the forecast time step, .
The GAN generator accepts , , and a latent Gaussian random vector as input to estimate , or the predicted at time . The discriminator accepts , , and as inputs (where may be either if from the training data or if from the generator) and outputs the probability that comes from the training data. All inputs and outputs are rescaled to have a mean of 0 and standard deviation of 1 based on the training data distributions. Note that we choose to develop local GANs, that is, GANs, which accept and for a given spatial index, , and that predict for that index, , as opposed to GANs that accept vector and and thus include spatial information. This is to match the local nature of parameterization schemes in weather and climate models.
Each GAN we consider consists of the same neural network architecture with variations in the inputs and how noise is scaled and inserted into the network. A diagram of the architecture of the GAN networks is shown in Figure . Both the generator and discriminator networks contain two hidden layers with 16 neurons in each layer. The weights of the hidden layers are regularized with a L2, or Ridge, penalty (Hoerl & Kennard, ) with scale factor of 0.001, which was chosen after evaluating different values and selecting the one that minimized the Hellinger distance. Scaled exponential linear unit (SELU) activation functions (Klambauer et al., ) follow each hidden layer. SELU is a variation of the common rectified linear unit (ReLU) activation function with a scaled exponential transform for the negative values that helps ensure the output distribution retains a mean of 0 and standard deviation of 1. Larger numbers of neurons per hidden layer were evaluated and did not result in improved performance. Gaussian additive noise layers before each hidden layer and optionally the output layer inject noise into the latent representations of the input data. A batch normalization (Ioffe & Szegedy, ) output layer ensures that the output values retain a mean of 0 and and standard deviation of 1, which helps the generator converge to the true distribution faster.
(Top) A diagram of how the GAN networks are connected for training. (Bottom) A diagram of the GAN network architectures used for the stochastic parameterization.
The GAN training procedure iteratively updates the discriminator and generator networks until the networks reach an equilibrium in which the discriminator struggles to distinguish true from generated samples. The inputs, outputs, and connections between networks are shown in Figure . A batch , or subset of samples drawn randomly without replacement from the training data, of truth run output is split in half. One subset is fed through the generator and then into the discriminator , and the other is sent directly to the discriminator. The discriminator weights are then updated based on the following loss function , [Image Omitted. See PDF]
is the expected value over a single batch of data. Another batch of samples are drawn and sent through the generator and then the discriminator with frozen weights. The generator loss is calculated as [Image Omitted. See PDF] based on labeling the generated samples as originating from the truth run. This reversal forces the discriminator to teach the generator features that would worsen the discriminator's own performance.
Gaussian noise is injected into the neural network at each iteration through both the input and hidden layers. We consider hidden layer Gaussian noise scaled to standard deviations of different orders of magnitude (Table ) in order to evaluate how the magnitude of the noise affects the forecast spread and the representation of the model climate. In forecast mode, we test providing both white, or uncorrelated noise, and red, or correlated noise to the GAN. The red noise is generated using an AR(1) process with a lag-1 temporal autocorrelation equal to the lag-1 autocorrelation of the deterministic residuals of the GAN, [Image Omitted. See PDF] [Image Omitted. See PDF] [Image Omitted. See PDF] The color of the noise is not relevant during the training process because the GAN only uses inputs from the previous time step like parameterizations in full-complexity weather and climate models. Both white- and red-noise representations are trained in the same way. The noise values are kept constant through the integration of a single time step. The difference between the white- and red-noise representations only manifests when they are incorporated as parameterizations in the full model (Equation (1)).
Summary of the GAN Configurations Tested
Short name | Input variables | Noise magnitude | Noise correlation | Output layer noise? |
XU-lrg-w | , | 1 | white | yes |
XU-med-w | , | 0.1 | white | yes |
XU-sml-w | , | 0.01 | white | yes |
XU-tny-w | , | 0.001 | white | yes |
X-med-w | 0.1 | white | yes | |
X-sml-w | 0.01 | white | yes | |
X-tny-w | 0.001 | white | yes | |
XU-lrg-r | , | 1 | red | yes |
XU-med-r | , | 0.1 | red | yes |
XU-sml-r | , | 0.01 | red | yes |
XU-tny-r | , | 0.001 | red | yes |
X-med-w | 0.1 | red | yes | |
X-sml-r | 0.01 | red | yes | |
X-tny-r | 0.001 | red | yes | |
XU-lrg-w* | , | 1 | white | no |
XU-med-w* | , | 0.1 | white | no |
XU-sml-w* | , | 0.01 | white | no |
XU-tny-w* | , | 0.001 | white | no |
X-sml-w* | 0.01 | white | no | |
X-tny-w* | 0.001 | white | no |
The GANs are all trained with a consistent set of optimization parameters. The GANs are updated through stochastic gradient descent with a batch size (number of examples randomly drawn without replacement from the training data) of 1,024 and a learning rate of 0.0001 with the Adam optimizer (Kingma & Ba, ). The GANs are trained on 639,936 samples and are validated on 28,797,096 samples from different portions of the truth run. The GANs are trained for 30 epochs, or passes through the training data. The model weights are saved for analysis every epoch for the first 20 epochs and then every 2 epochs between epochs 20 and 30. The GANs are developed with the Keras v2.2 machine learning API coupled with Tensorflow v1.13.
The GAN configurations considered in this study are summarized in Table . A short name of the format “predictors–noise magnitude–noise correlation” is introduced to simplify identification of different GANs. For example, “XU-med-r” refers to the GAN that takes X and U as predictors and uses medium (med) magnitude red (r) noise. While most GANs tested include the optional additive noise layer before the output layer, the sensitivity to this choice was also considered. GANs that do not include this final noise layer follow the naming convention above, but are indicated by an asterisk.
Polynomial Regression ParameterizationThe GAN stochastic parameterization is evaluated against a cubic polynomial regression parameterization, , similar to the model used in Arnold et al. (). [Image Omitted. See PDF]
The parameters are determined by a least squares fit to the data from the L96 “truth” training run. It is known that the simple cubic polynomial deterministic parameterization is a poor forecast model for the L96 system (Arnold et al., ; Christensen et al., ; Wilks, ), as does not uniquely determine . The variability in the relationship is accounted for using a temporally correlated additive noise term: [Image Omitted. See PDF] where , the first-order autoregressive parameters are fit from the residual time series , and the processes are independent for different variables.
The polynomial parameterization has been specifically designed to represent the impact of the variables in this version of the L96 model, just as traditional parameterization schemes are designed to represent a given process in a GCM. Previous studies have demonstrated that the polynomial parameterization with additive noise performs very well (Arnold et al., ; Christensen et al., ; Wilks, ). Although the multiplicative noise polynomial parameterization does perform slightly better in Arnold et al. (), we compare the GAN with the additive noise polynomial to ensure consistency in noise inclusion process. This “bespoke” parameterization is therefore a stringent benchmark against which to test GAN parameterizations.
Results MetricsThe accuracy of ensemble weather forecasts can be summarized by evaluating the root mean square error (RMSE) of the ensemble mean. The lower the RMSE, the more accurate the forecast. The RMSE at lead time is defined as [Image Omitted. See PDF] where is the number of forecast-observation pairs, is the observed state at time , and is the ensemble mean forecast at lead time , initialized at , such that .
If an ensemble forecast correctly accounts for all sources of uncertainty such that the forecast of the spread of the ensemble and measured probabilities of an event are statistically consistent, the forecast is said to be
The simplest definition of the “climate” of the L96 system is the probability density function (PDF) of the individual values. The climatological skill can therefore be summarized by quantifying the difference between the true and forecast PDF. The Hellinger distance is calculated for each forecast model: [Image Omitted. See PDF] where is the PDF of forecast values and is the PDF of truth values (Pollard, ). The smaller , the closer the forecast model pdf is to the truth pdf. We also considered the Kullback-Leibler ( ) divergence (Kullback & Leibler, ), motivated by information theory, but found it provided no additional information over the Hellinger distance, so results for the are not shown for brevity.
Evidence that the L96 model displays distinct dynamical regimes of behavior for the parameter set considered was presented in Christensen et al. (), in which regime affiliations were determined using a metric based on the temporally-localized spatial covariance. Christensen et al. () found that during the more common regime (regime frequency 80%), the eight variables exhibit wave-2 like behavior around the ring, while in the rarer regime, the variables exhibit wave-1 type behavior. Another approach to characterizing regime structure that makes use of both the instantaneous state and the recent past of the system is hidden Markov model (HMM) analysis (Franzke et al., ; Monahan et al., ; Rabiner, ). In an HMM analysis, it is assumed that underlying the observed state variables is an unobserved Markov chain taking discrete values. The HMM algorithm provides maximum likelihood estimates of the probability distributions of the state variables conditioned on the instantaneous hidden state values, the stochastic matrix of transition probabilities for each time step, and an optimal estimate of the hidden state sequence.
Off-line Assessment of GAN PerformanceThe GAN parameterizations are first evaluated on how closely their output subgrid forcing distributions match those of the truth run when the GANs are supplied with input and values from the truth run. This is summarized by the Hellinger distance in Figure . Most of the GANs show a trend of decreasing Hellinger distance for the first few epochs followed by mostly stable oscillations. GANs with both and as input tend to perform better in the off-line analysis than those with only . Larger input noise standard deviations seem to reduce the amount of fluctuation in the Hellinger distance between epochs, but there does not appear to be a consistent correlation with noise standard deviation and Hellinger distance. Note that the weights as fitted at the end of epoch 30 are used in the forecast networks, regardless of whether the GAN at this epoch shows the minimum off-line Hellinger distance, because we wanted to give each network a consistent amount of training time.
Off-line assessment of GAN performance. Hellinger distances between the GAN subgrid tendency distributions given input X and U values from the truth run and the truth run subgrid forcing distribution as a function of training epoch.
The joint distributions of and from the different model runs reveal how the noise standard deviation affects the model climate (Figure ). Larger noise standard deviations increase the range of values appearing in the run but do not appear to change the range of values output by the GAN. The X-only GANs did the best job in capturing the shape of the truth distribution although some, such as X-sml-r, have a translational bias in their joint distribution that the Hellinger distance penalizes. While the XU-w* and X-w* GANs capture the bulk of the distribution well, there are spurious points outside the bounds of the truth distribution for all of these models. The XU-*-r GANs produce either an overly diffuse but unbiased distribution or fail to capture the true distribution entirely. The polynomial model captures the conditional mean and shape of the distribution very well but produced an overly smooth representation of the truth and did not capture the bifurcation in the positive and positive quadrant.
Joint distributions (2-D histograms) of Xt−1 and Ut for each GAN configuration. The truth joint distribution is overlaid in red contours on each forecast model distribution. The distributions are ordered from left to right descending in terms of their relative total marginal Hellinger distances.
The parameterized models for the Lorenz '96 system are evaluated in a weather forecast framework. An extensive set of reforecast experiments were produced for 751 initial conditions selected from the attractor. An ensemble of 40 forecasts was produced from each initial condition (i.e., no initial condition perturbations are used). Different random seeds are used for each ensemble member to generate the stochastic perturbations used in the GAN or polynomial parameterizations.
Figure shows the RMSE and spread for all weather experiments at 1 MTU. A reduction in RMSE indicates an ensemble forecast that more closely tracks the observations. A good match between RMSE and ensemble spread indicates a reliable forecast. The best performing GANs in terms of RMSE are X-tny-r, X-tny-w, and X-tny-w*. All of these models performed slightly better than the polynomial regression, which was competitive with most GANs in terms of both RMSE and the ratio of RMSE to spread. The spread of the white noise GANs is generally underdispersive. Most of the red noise GANs, on the other hand, are somewhat overdispersive with X-med-r having the spread/error ratio closest to 1. Red noise perturbs the model in a similar direction for a longer period, which results in greater ensemble spread compared with white noise. Figure shows the RMSE and ensemble spread for a subset of the ensemble forecasts of the variables performed using the GAN parameterizations or the bespoke polynomial parameterization. X-sml-w* demonstrates both low RMSE and similar spread to RMSE throughout the forecast period. X-sml-r and X-sml-w feature similar RMSE through 1 MTU, but X-sml-r has smaller RMSE after that point and a better spread-to-error ratio throughout the period. The XU GANs have higher RMSEs than their X counterparts because the XU models may have overfit to the strong correlation between and in the training data. Inspection of the input weights revealed that XU GANs generally weigh more highly than . At maxima and minima in the waves, the XU models may be biased toward extending the current growth forward, which can be a source of error in forecast runs.
Summary of performance of different parameterized models (x-axis). (a) Weather forecast performance. Ensemble spread (circles) and RMSE (crosses) for experiments with white and red noise in GANs at time step 201. The horizontal dashed line indicates the RMSE for the polynomial forecast model. Ideally, a forecast model will produce forecasts with small RMSE while maintaining the match between spread and RMSE. (b) Climate performance. The Hellinger distance between each forecast pdf and the “true” pdf. The metric in question is calculated for the best estimate of the climatological pdf, averaging across all X variables (circles), as well as for each X variable in turn, for example, comparing true and forecast X1 pdfs, etc (crosses). The latter gives an indication of the sampling uncertainty. The horizontal dashed line indicates the mean value of H for the polynomial model.
RMSE (lines with dots) and ensemble spread (dashed lines) for a subset of experiments with white and red noise in GANs. Note that 400 forecast time steps corresponds to 2 MTU, or 10 “atmospheric days.”
The GAN parameterizations are also tested on their ability to characterize the climate of the Lorenz ‘96 system. First, the ability to reproduce the pdf of the variables is evaluated. Each forecast model and the full L96 system were used to produce a long simulation of length 10,000 MTU. Figure a shows kernel density estimates of the marginal pdfs of from the full L96 system and from a sample of parameterized models. The pdf of the true L96 system is markedly non-Gaussian, with enhanced density forming a “hump” at around . Compared to the true distribution, the XU-sml-w and XU-sml-r models both perform poorly, producing very similar pdfs with too large a standard deviation and that are too symmetric. However, the other displayed models skillfully reproduce the true pdf. Figure b shows the difference between each forecast pdf and the true pdf. Several GANs perform as well if not better than the benchmark bespoke polynomial parameterization.
The skill of the forecast models at reproducing the climate of the Lorenz ‘96 system defined as the pdf of the X variables. (a) The pdf of the Lorenz ‘96 system (black dashed) compared to a subset of the forecast models. (b) The difference between the forecast and true pdfs shown in (a). Sampling variability is indicated by shading the range in metrics for each of the 8 X variables of the full Lorenz ‘96 system.
Figure b shows the Hellinger distance evaluated for each parameterized model. The filled circles indicate the value of the metric when the pdf is evaluated across all variables in both parameterized and truth time series. The crosses give an indication of sampling variability and indicate the metrics comparing pairs of variables, that is, and for . Quantifying the parameterized model performance in this way allows for easy ranking of the different models. While the AR(1) stochastic polynomial parameterized forecast model is very skillful (Arnold et al., ), several GANs outperform the polynomial model.
In addition to correctly capturing the distribution of the variables, it is desirable that a parameterized model will capture the spatiotemporal behavior of the system. This is assessed by considering the spatial and temporal correlations in both the true system and parameterized models. The diagnostic is shown for a subset of the tested parameterized models in Figure . It is evident that the parameterized models that skillfully capture the pdf of also skillfully represent the spatiotemporal characteristics of the system. The X-sml-w* scheme performs particularly well, improving over the stochastic polynomial approach and other GANs by having a temporal correlation pattern in sync with the truth but with lower magnitude peaks.
The skill of the parameterized models (colors) at reproducing the “true” (a) spatial correlation and (b) temporal correlation of the X variables in the Lorenz ‘96 system (black), calculated from the climatological simulation. The sampling variability in these metrics, as indicated by the variability between the metric calculated for different X variables, is narrower than the plotted line width.
Following the regime results presented in Christensen et al. (), we use HMM analysis to classify into two clusters the instantaneous states in the four-dimensional space spanned by the norms of the projections of on wavenumbers 1 through 4 (denoted ). Because of the spatial homogeneity of the statistics, these wavenumber projections correspond to the empirical orthogonal functions.
Kernel density estimates of the joint pdfs of the projections of on wavenumbers 1 and 2 are presented in Figure , along with estimates of the joint distributions conditioned on the HMM regime sequence. The conditional distributions have been scaled by the probability of each state so that the full joint pdf is the sum of the conditional pdfs. For reference, the unconditional joint pdf for truth is shown in gray contours in each panel. The stochastic matrix shows the probability of remaining in each regime (diagonal values) or transitioning from one regime to the other (off-diagonal values). As in Figures and , only a subset of results are displayed.
Joint distributions of the projections of X on wavenumbers 1 and 2 (|Z1| and |Z2|, respectively) conditioned on HMM regime occupation (red and blue contours) for the truth simulation subset of parameterized simulations. In all but the upper left panel, thin black lines display the unconditional joint distribution. In all panels, the grey curves denote the full joint distribution (without regime occupation conditioning) from the truth simulation. The conditional distributions have been scaled by the relative probabilities of each state. Inset: HMM stochastic matrix Q.
The clear separation of the truth simulation into two distinct regimes is modeled well by the polynomial parameterization and most of the GAN parameterizations. With the exception of XU-sml-w and XU-sml-r, the regime spatial structures and stochastic matrices are captured well. The GANs are slightly more likely to transition between regimes than the truth run, while the polynomial run is slightly more likely to stay in the same regime. Consistent with the other climate performance results presented above, the joint distribution of produced by XU-sml-w and XU-sml-r are strongly biased, with no evidence of a meaningful separation into two distinct regimes of behavior.
To quantify the forecast model skill at reproducing the L96 behavior in wavenumber space, Figure shows calculated over the marginal pdfs of and as upward and downward triangles, respectively. Several of the GANs evaluated are competitive with, or improve upon, the polynomial parameterization scheme. The best performing GANs also performed the best in terms of other climate metrics (e.g., see Figure ).
The Hellinger distance between each forecast pdf and the “true” pdf, considering the projection of the X variables onto (upward triangles) wavenumber 1 and (upward triangles) wavenumber 2.
We note that in particular, models which accurately capture the regime behavior will also show good correlation statistics when averaged over a long time series. The regime analysis can help diagnose why a model shows poor correlation statistics. For example, X-sml-w* accurately captures the marginal distributions of the two regimes, but the frequency of occurrence of the red regime, dominated by wavenumber-1 behavior, is slightly too high (at 54% compared to 51% in the “truth” run). This reduces the magnitude of the negative correlation at a lag of 0.75 MTU and the positive correlation at a lag of 1.5 MTU observed in Figure .
Wavelet AnalysisTo further investigate why the white noise and red noise GANs differ in online performance, a wavelet analysis is performed on time series of outputs from the climate runs. A continuous wavelet transform with the Ricker wavelet decomposes the time series into contributions from different periods. We choose a continuous wavelet transform to have more control over the spectral sampling. The total energy for a given period is represented as [Image Omitted. See PDF] where is the wavelet magnitude at a given time step . The total power for each period from each model is shown in Figure . All sml GANs except the XU-sml-r and XU-sml-w follow the truth power curve closely. The polynomial regression follows the truth closely although it tends to underestimate the power slightly for each period. The GANs peak in power at the same period when the temporal correlation in Figure crosses 0. The GANs with poor Hellinger distances also contained more energy for longer periods.
(Top) Wavelet power spectra for the white and red noise GAN climate runs as well as the polynomial and truth runs. (Bottom) Mean absolute percent differences between the truth wavelet power spectrum and each model.
A clearer evaluation of the wavelet differences can be found by calculating the mean absolute percentage difference from the truth run at different wavelengths. The absolute difference between the truth and parameterized runs increases with increasing wavelengths, so the percentage difference is employed to control for this trend: [Image Omitted. See PDF]
The MAPD scores with wavelength in Figure shows that none of the GANs consistently perform better at all periods, but some do provide closer matches to the truth spectrum for the longer periods. In the peak energy period, the different GANs have minimum error for slightly different periods before increasing in error again. The X-sml-r GAN uniquely shows decreasing MAPD with increasing period, while the white noise GANs generally have similar differences across the range of evaluated periods.
DiscussionIn this study, we choose to focus on GANs for stochastic parameterization primarily because the framework offers a way to embed stochasticity directly into the model and training process instead of adding it a posteriori to a deterministic parameterization. The GAN's use of the discriminator as an adaptive loss function is particularly attractive for weather and climate applications because it reduces the need for developing a handcrafted loss and can be scaled to higher-dimensional and more complex outputs, including spatial fields.
Several of the GANs tested show a weather and climate skill that is competitive with a bespoke polynomial parameterization scheme. For climatological skill, several different metrics were considered including the distribution of the variable, spatiotemporal correlation statistics, and regime behavior. We found that forecast models that performed well according to one online metric also performed consistently well for all online metrics. The good performance of the GAN is encouraging, demonstrating that GANs can indeed be used as explicit stochastic parameterizations of uncertain subgrid processes directly from data. Furthermore, a small number of GANs improve upon the bespoke polynomial approach, indicating the potential for such machine-learned approaches to improve on our current hand-designed parameterization schemes, given suitable training data. While this has been proposed and demonstrated for deterministic parameterizations (Rasp et al., ; Schneider et al., ), this is a first demonstration for the case of an explicitly stochastic parameterization.
Comparison of Figures a and b indicates a relationship between forecast models that perform well on weather and climate time scales. To quantify this further, Figure shows the correlation between weather forecast skill and climate performance for each model considered. Models which produce weather forecasts with a lower RMSE also show good statistics on climate time scales. In contrast, the reliability of weather forecasts, that is, the statistical match between spread and error, is a poor predictor of climate performance. This is reflected by the competitive performance of the white noise GAN for producing a realistic climate, whereas on weather time scales, red noise increases the spread and can thereby substantially improve the forecast reliability (e.g., consider the X-med, X-sml, and X-tny GAN for white noise and red noise, respectively; Figure ). This relationship between initialized and climatological performance has been discussed in the context of global models (Williams et al., ). It provides further evidence that parameterizations can first be tested in weather forecasts before being used in climate models, as promoted by the “seamless prediction” framework (Brunet et al., ).
Correlation between weather forecast skill and climate performance. (a) Weather forecast RMSE versus climate Hellinger distance. (b) Weather forecast spread-error ratio versus climate Hellinger distance. (c) Off-line versus online Hellinger distance. Colors indicate forecast model, as in Figure , and marker sizes correspond to the noise standard deviation.
The disparity between the off-line verification statistics and those from the climate and weather runs (Figure c) highlights the challenges in training GANs for parameterization tasks. Neither the values of the generator loss or the off-line evaluation of the GAN samples correlated with their performance in the forecast model integrations. The generator and discriminator networks optimize until they reach an equilibrium, but there is no guarantee that the equilibrium is stable or optimal. Some of the differences in the results may be due to particular GANs converging on a poor equilibrium state as opposed to other factors being tested. GANs and other machine learning parameterization models are trained under the assumption that the data are independent and identically distributed, but in practice are applied to spatially and temporally correlated fields sequentially, potentially introducing nonlinear feedback effects. GANs are more complex to train than other machine learning methods because they require two neural networks and do not output an informative loss function. Larger magnitude additive noise appears to help prevent runaway feedbacks from model biases at the expense of increasing weather prediction errors. The inclusion of the batch normalization output layer appeared to assist both training and prediction by limiting the possible extremes reached during integration.
Other generative neural network frameworks should also be investigated to determine if they can provide similar or better performance with a less sensitive training process. One stochastic machine learning approach is Monte Carlo dropout (Gal & Ghahramani, ), in which dropout layers are activated during inference mode to produce an ensemble of stochastic samples for the same input. Monte Carlo dropout can be applied to any neural network architecture and only requires tuning the dropout rate but may produce artifacts in the predicted distribution, especially with large dropout rates. Conditional variational autoencoders (Kingma & Welling, ), a Bayesian generative model that regularizes the latent space as a vector of Gaussian distributions, are another potentially viable approach. Variational autoencoders have a more constrained latent space than GANs but often produce overly smooth samples and have their own training challenges.
Wavelet analysis helped uncover differences in model performance across different time scales. While the other evaluation metrics focused on distributional or error metrics in the time domain, the wavelet power spectrum separated the time series into different periods and enabled comparisons of the energy embedded in different scales. In particular, the wavelet analysis revealed that some of the GANs added energy to the system either at long periods or across all periods in some cases.
The standard deviation of the noise does impact both the training of the GANs and the resulting weather and climate model runs. Using too large a standard deviation limits the ability of the GAN to discover the structure of the true distribution of the data. Standard deviations that are too small may result in either the generator or discriminator becoming overly good at their tasks during training, which results in the GAN equilibrium being broken. During simulations, noise standard deviations that are too small can result in the system becoming trapped within one regime and never escaping. We tested optimizing the noise standard deviation during the training process but did not achieve consistent convergence to the same noise value in repeated experiments.
In addition to the GAN configurations evaluated in the paper, other GAN settings were tested and were found to have similar or worse performance. Given the relative simplicity of the Lorenz '96 system, adding neurons in each hidden layer did not improve performance. Using the SELU activation function generally resulted in faster equilibrium convergence than the ReLU. Varying the scaling factor for the L2 regularization on each hidden layer did affect model performance. Using a larger value greatly reduced the variance of the predictions, but using a smaller value resulted in peaks of the final distributions being too far apart, especially when both and were used as inputs. We also tested deriving from a 1-D convolutional GAN that reproduced the set of values associated with each . That approach did produce somewhat realistic values but contained “checkerboard” artifacts from the convolutions and upscaling, especially at the edges. The sum of the values was also not equal to derived from Equation .
The L96 system is commonly used as a test bed for new ideas in parameterization, and ideas tested using the system can be readily developed further for use in higher complexity Earth system models. However, the L96 system has many fewer dimensions than an Earth system model and a relatively simple target distribution. The relative simplicity of the L96 system may have also led to the more complex GAN overfitting to the data compared with the simpler polynomial parameterization. For more complex, higher dimensional systems, the extra representational capacity of the GAN may provide more benefit than can be realized in L96. The computational simplicity of L96 also allows for the production of extremely large training data sets with little compute resources. Producing higher complexity Earth system model datasets requires trading off among spatial resolution, temporal output frequency, and temporal coverage based on the amount of computational and storage resources available.
ConclusionsIn this study, we have developed an explicitly stochastic GAN framework for the parameterization of sub-grid processes in the Lorenz '96 dynamical system. After testing a wide range of GANs with different noise characteristics, we identified a subset of models that outperform a benchmark bespoke cubic polynomial parameterization. Returning to the questions posed in Section , we found that this subset of GANs approximates well the joint distribution of the resolved state and the subgrid-scale tendency. This model subset also produces the most accurate weather forecasts (i.e., lowest RMSE in the ensemble mean). Some GANs with red noise produce reliable weather forecasts (Figure ), in which the ensemble spread is a good indicator of the error in the ensemble mean. Based on the comparison of the same white and red noise GANs, the correlation structure of the noise is most critical for producing reliable forecasts. However, these are not necessarily the GANs that produce the most accurate forecasts. The subset of models with the most accurate weather forecasts produce the most accurate climate simulations, as characterized by probability distributions, space and time correlations, and regime dynamics. However, we note that the GANs which produce skillful weather and climate forecasts were different to those which performed well in “off-line” mode (Figure ).
Although the GANs required an iterative development process to maximize model performance and were very sensitive to the noise magnitude and other hyperparameter settings, they do show promise as an approach for stochastic parameterization of physical processes in more complex weather and climate models. Applying other recently developed GAN losses and regularizers (Kurach et al., ) could help reduce the chance for the GAN to experience a failure mode.
The experiments presented here demonstrate that GANs are a promising machine learning approach for developing stochastic parameterization in complex GCMs. Key lessons learned and unanswered questions include the following:
- While including the tendency from the previous time step provides a natural approach for building temporal dependence into the parameterization, it can lead to accumulation of error in the forecast, such that local-in-time parameterization should also be considered.
- Autocorrelated noise is important for a skillful weather forecast but appears less important for capturing the climatological distribution.
- It is possible that spatial correlations are also important in a higher complexity Earth system model, which could not be assessed here due to the simplicity of the Lorenz ‘96 system.
- It is possible that the noise characteristics could also be learned by the GAN framework to automate the tuning of the stochasticity.
Future work will use these lessons to guide further investigations in the Lorenz ’96 framework and to develop machine-learned stochastic parameterization schemes for use in higher complexity Earth system models. Noise can also be incorporated into the truth run training data to evaluate the effect of simulated observational error on training performance. Other noise inclusion processes, such as multiplicative noise, should also be evaluated in GANs to see if they result in further improvements over the current additive noise approach. GANs of a similar level of complexity to those used for L96 could emulate local effects, such as some warm rain formation processes.
AcknowledgmentsThis research started in a working group supported by the Statistical and Applied Mathematical Sciences Institute (SAMSI). D. J. G. and H. M. C. were funded by National Center for Atmospheric Research Advanced Study Program Postdoctoral Fellowships and by the National Science Foundation through NCAR's Cooperative Agreement AGS-1852977. H. M. C. was funded by Natural Environment Research Council grant number NE/P018238/1. A. H. M. acknowledges support from the Natural Sciences and Engineering Research Council of Canada (NSERC) and thanks SAMSI for hosting him in the autumn of 2017. The NCAR Cheyenne supercomputer was used to generate the model runs analyzed in this paper. Sue Ellen Haupt and Joseph Tribbia provided helpful reviews of the paper prior to submission. The source code and model configuration files necessary for replicating the results of this paper can be accessed at
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2020. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Stochastic parameterizations account for uncertainty in the representation of unresolved subgrid processes by sampling from the distribution of possible subgrid forcings. Some existing stochastic parameterizations utilize data‐driven approaches to characterize uncertainty, but these approaches require significant structural assumptions that can limit their scalability. Machine learning models, including neural networks, are able to represent a wide range of distributions and build optimized mappings between a large number of inputs and subgrid forcings. Recent research on machine learning parameterizations has focused only on deterministic parameterizations. In this study, we develop a stochastic parameterization using the generative adversarial network (GAN) machine learning framework. The GAN stochastic parameterization is trained and evaluated on output from the Lorenz '96 model, which is a common baseline model for evaluating both parameterization and data assimilation techniques. We evaluate different ways of characterizing the input noise for the model and perform model runs with the GAN parameterization at weather and climate time scales. Some of the GAN configurations perform better than a baseline bespoke parameterization at both time scales, and the networks closely reproduce the spatiotemporal correlations and regimes of the Lorenz '96 system. We also find that, in general, those models which produce skillful forecasts are also associated with the best climate simulations.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details




1 National Center for Atmospheric Research, Boulder, CO, USA
2 National Center for Atmospheric Research, Boulder, CO, USA; Atmospheric, Oceanic and Planetary Physics, University of Oxford, Oxford, UK
3 Department of Atmospheric and Oceanic Sciences, University of Colorado, Boulder, CO, USA
4 School of Earth and Ocean Sciences, University of Victoria, Victoria, British Columbia, Canada