Despite very careful design, there is no guarantee that a computer model will be adequate for its intended use: some processes treated as negligible can turn out to be more important than initially thought; a parameterisation may not be valid in the particular conditions of interest or may be incompatible with other hypotheses employed; the selection of parameters can be far from optimal; and so on. As a consequence, climate models have to be tested to assess their quality and evaluate their performance. In this framework, it is always necessary to keep in mind the scientific objectives of the study (or studies) that will be conducted using a particular model. Although the principles remain the same, the tests performed with a model developed to analysing the development of the global carbon cycle over the last million years (see section 5.3.2) are clearly different from those for a model providing projections of future climate changes at the highest possible resolution (see Chapter 6).
A first step is to ensure that the numerical model solves the equations of the physical model adequately. This procedure, often referred to as verification (Fig. 3.15), only deals with the numerical resolution of the equations in the model, not with the agreement between the model and reality. It checks that no coding errors have been introduced into the program. The numerical methods used to solve the model equations must also be sufficiently accurate. Different methods are available to achieve this goal. A standard one is to compare the numerical solution with the analytical one for highly idealised test cases for which an exact solution is available. It is also possible to formally state that some parts of the code are correct, for instance, the one that solves large systems of n linear algebraic equations with n unknowns (which are often produced as part of the numerical resolution of the partial differential equations on the model grid).

The next step is the validation process, i.e. determining whether the model accurately represents reality. To do this, model results have to be compared with observations obtained in the same conditions. In particular, this implies that the boundary conditions and forcings must be correctly specified to represent the observed situation. Validation must be first performed on the representation of individual physical processes, such as the formulation of the change in the snow albedo in response to surface melting and temperature change. This is generally achieved for particular locations, during field campaigns specifically designed to study this process. They provide a much larger amount of very specific data than global data bases, allowing a detailed evaluation of the performance of the model on this topic. On a larger scale, the different components of the model (atmosphere, ocean, sea ice, etc, see section 3.3) have to be tested independently, ensuring that the boundary conditions at the interface with the other components are well defined. Finally, the results of the whole coupled model have to be compared with observations. All those steps are necessary because bad surprises are always possible after the different elements are coupled together, due to nonlinear interactions between the components. Some problems with the model can also be masked by the formulation of the boundary conditions when components are run individually. However, having a coupled model providing reasonable results is not enough. In order to test whether the results occur for the correct reason, it is necessary to check that all the elements of the model are doing a good job, and that the satisfactory overall behaviour of the model is not due to several errors in its various elements cancelling each other out.
When discussing verification and validation, we must always recognize that both of them can only be partial for a climate model, except maybe in some trivial cases. The accuracy of the numerical solution can only be estimated for small elements of the code or in very special (simplified) conditions. Indeed, if it were possible to obtain a very accurate solution to compare with the numerical model results for all the possible cases, there would be no point in developing a numerical model! The comparison of model results with observation is also limited to some particular conditions and completely validating a climate model in all the potential situations would require an infinite number of tests. A climate model could thus never be considered as formally verified or validated. A model is sometimes said to be validated if it has passed a reasonable number of tests. In such a case, the credibility of model projections performed with such a model could be very high. However, there is no way to formally guarantee that the results of the model will be correct even if the conditions are only slightly different from those used in the validation process, in particular for a very complex system like the climate. Furthermore, there is no agreement in climatology as to what a reasonable number of tests is.
The term "a validated model" and phrases like "the model has been validated" must therefore be avoided. Rather, the verification and validation should be considered as processes that never lead to a final, definitive product. The model should be continuously retested as new data or experimental results become available. The building of a model could then be viewed in the same way as a scientific theory. Hypotheses are formulated and a first version of the model developed. The results of the model are then compared to observations. If the model results are in good agreement with the data, the model could be said as to be confirmed for those conditions, so increasing its credibility. However, this does not mean that the model is validated for all possible cases. If the model results do not compare well with observations, the model should be improved. This could lead to new hypotheses, to additional terms in the governing equations, or to the inclusion of new processes by new equations or new parameterisations.
Alternatively, a disagreement between model and observations can be related to an inadequate selection of the value of some parameters that are not precisely known (for instance the exchange coefficients in Eqs. 2.33 and 2.34). Adjusting those parameters is part of the calibration of the model, also referred to as tuning. Model developers and users also may decide that, if the model cannot reproduce the observations in some special cases, this indicates that it is not valid for such conditions, although it can still be used in other situations where the tests indicate better behaviour. For instance, we can imagine a climate model that cannot simulate the climate of Mars correctly without some modifications; however, this does not invalidate it for modelling conditions on Earth. On the other hand, if it works well for both Mars and Earth, this is a good test of its robustness.
The calibration of physical parameters is generally required and is perfectly justified as there is no a priori reason to select one particular value in the observed range of the parameters. It is also valid to calibrate the numerical parameters in order to obtain the most accurate numerical solution of the equations. However, care has to be taken to ensure that the calibration is not a way of artificially masking some deficiencies in the model. If this does occur, there is a high probability that the selected parameters will not provide satisfactory results for other conditions (e.g. the climate at the end of the 21^{st} century). Performing many tests for widely different situations and for various elements of the model should limit the risk, but the number of observations is often too small to ensure that the problem has been completely avoided. An additional problem with the constant improvement of the model and of its calibration as soon as new data becomes available is the absence of independent data to really test the performance of the model. Ideally, some of the available information should be used for the model development and calibration, and some should be kept to assess its accuracy. Another good model practise is to choose or design models components for which the selection of one particular value of the parameters has only a small impact on model results, so reducing importance of the calibration.
In all the tests performed with the model, it is necessary to estimate the agreement between model results and observations. This is a complex and sometimes undervalued task. Indeed, the comparisons between the results of various models have shown that a single model is never the best for all the regions and variables analysed. Introducing a new parameterisation or changing the value of a parameter usually improves the results in some areas and worsens them in others. The agreement should then be related to the intended use of the model. This could be done more or less intuitively by visually comparing maps or plots describing both the model results and the observations. However, a much better solution is to define an appropriate metric. For a single field, such as the annual mean surface temperature T_{s}, a simple root mean square (RMS) error may be appropriate:
$$RMS=\sqrt{\frac{1}{n}\sum _{k=1}^{n}{\left({T}_{s\mathrm{,\; mod}}^{k}{T}_{s\mathrm{,\; obs}}^{k}\right)}^{2}}$$  (3.30) 
where n is the number of grid points for which observations are available, ${T}_{s\mathrm{,\; mod}}^{k}$ is the model surface temperature at point k and ${T}_{s\mathrm{,\; obs}}^{k}$ is the observed surface temperature at point k. This estimate could be improved by taking into account the area of each grid point or by giving greater weight to the regions of most interest. If many variables have to be included in the metric, the RMS errors of different variables can be combined in various ways. The model datacomparison should also take into account the errors or uncertainties in both model results and observations. Errors in the observations can be directly related to the precision of the instruments or of the indirect method used to retrieve the climate signal (see for instance section 5.3.3). The uncertainties could also be due to the internal variability of the system (see sections 1.1 and 5.2) because observations and model results covering a relatively short period are not necessarily representative of the mean behaviour of the system.