## Abstract

In this study, a precipitation forecasting model is developed based on the sea level pressures (SLP), difference in sea level pressure and sea surface temperature data. For this purpose, the effective variables for precipitation estimation are determined using the Gamma test (GT) and correlation coefficient analysis in two wet and dry seasons. The best combination of selected variables is identified using entropy and GT. The performances of the alternative methods in input variables selection are compared. Then the support vector machine model is developed for dry and wet seasonal precipitations. The results are compared with the benchmark models including naïve, trend, multivariable regression, and support vector machine models. The results show the performance of the support vector machine in precipitation prediction is better than the benchmark models.

- climatic prediction
- entropy theory
- Gamma test
- precipitation prediction
- support vector machine

## INTRODUCTION

Adaptation with climate variability is one of the main challenges in water resources management. Development of accurate precipitation and streamflow forecast models can help for water resources planning and management. Precipitation prediction can be made in the short term such as next hourly prediction for flood management or in the long term such as prediction of the future 6 months. Usually, long-term precipitation forecasts are used for operational purposes: irrigation and water supply management, flood warning and prevention and hydropower planning. In recent decades, much effort has improved development of reliable long-lead forecasting models utilizing large-scale ocean-atmospheric patterns. There are various methods to develop the relationship between the large-scale climatic parameters such as geopotential height and mean sea level pressure (SLP) (predictors) and the local variables (predictands) such as temperature, precipitation and runoff. The most widely used models usually implement general circulation model (GCM) outputs as predictors to predict the precipitation. However, because the GCM data are in coarse resolution, downscaling techniques are employed to model climate outputs into metrological variables appropriate for hydrologic applications.

In recent years, a wide range of rainfall prediction methods has been used to investigate the effects of large-scale climate signals on rainfall variability. Dynamical and empirical methods are two approaches for rainfall prediction. In the dynamical approach, a physical model is utilized to generate and solve a system of equations for precipitation prediction. In the empirical method, the historical atmospheric and oceanic data are analyzed and a simulation model is developed for precipitation prediction of future periods. Recently, a wide range of conceptual models, such as regression, artificial neural networks (ANNs), and *K*-nearest neighbor (KNN) have been used in empirical approaches.

Many efforts have been devoted to the development of prediction models by implementing linear methods, such as simple/multiple linear regression and canonical correlation analysis (Johansson & Chen 2003; Hanssen-Bauer *et al*. 2003; Busuico *et al*. 2006; Crawford *et al*. 2007; Schmidli *et al*. 2007; Hertig & Jacobeit 2008; Hashmi *et al*. 2009), independent component analysis (Moradkhani & Meier 2010), or singular value decomposition (Conway *et al*. 1996; Widmann *et al*. 2003).

When the predictand variable is precipitation, linear regression relationship may not work very well, because the predictor–predictand relationships are often very complex. For this reason, a number of nonlinear regression downscaling techniques, especially ANNs because of their high potential for simulating the complex, nonlinear, and time-varying input–output systems, are employed (e.g., Mpelasoka *et al*. 2001; Cavazos & Hewitson 2005; Haylock *et al*. 2006; Hashmi *et al*. 2009; Najafi *et al*. 2011).

Support vector machines (SVMs) as one of the nonlinear modeling tools are widely used in precipitation prediction. SVMs for regression (SVR), as described by Vapnik (1992), exploit the idea of mapping input data into a high dimensional (often infinite) reproducing kernel Hilbert space where a linear regression is performed. Dibike *et al*. (2001) presented some results showing that radial basis function (RBF) is the best kernel function to be used in SVM models. Liong & Sivapragasam (2002) compared SVM with ANN and concluded that the SVM's inherent properties give it an edge in overcoming some of the major problems in the application of ANN (Han *et al*. 2007). Bray & Han (2004) illustrated the difficulties in SVM identification for flood forecasting problems. Tripathi *et al*. (2006) identified climate variables affecting spatio-temporal variation of precipitation in India. Then, the SVM-based downscaling model is applied to future climate predictions from the second generation coupled global climate model to obtain future projections of precipitation. Ghosh & Mujumdar (2008) developed downscaling models based on sparse Bayesian learning and relevance vector machine to model streamflow at river basin scale for the monsoon period using GCM simulated climatic variables. A decreasing trend is observed for monsoon streamflow of Mahanadi due to high surface warming in the future, with the CCSR/NIES GCM and B2 scenario. Najafi *et al*. (2011) used multilinear regression, SVM, and adaptive-network-based fuzzy inference system.

Other downscaling techniques including KNN (Araghinejad *et al*. 2006) and genetic programming (Hashmi *et al*. 2011) are also utilized. Depending on regions and criteria of comparison, any linear and nonlinear techniques can be employed.

Some other techniques are used for data preprocessing to reduce the dimensionality of the problem, including sensitivity analysis (Nourani & Sayyah Frad 2012), principal component analysis (Schoof & Pryor 2001; Araghinejad & Burn 2005), fuzzy clustering (Ghosh & Mujumdar 2008), wavelet transform (Nourani & Parhizkar 2013) and Gamma test (GT) (Ahmadi *et al*. 2009; Moghaddamnia *et al*. 2009; Ahmadi & Han 2013). The GT is a nonlinear modeling and analysis tool. GT predicts the minimum achievable modeling error before the modeling. GT was first reported by Stefansson *et al*. (1997) and Končar (1997), and later was discussed by many scientists and used to determine the best input combination (Chuzhanova *et al*. 1998; Remesan *et al*. 2008; Jaafar & Han 2011).

A main challenge of developing prediction models is input selection. A basic question is how many input variables should be considered in a model. Although the model performance is higher in model calibration, the systems accuracy in estimation cannot be improved with more inputs. Recent researches have been to investigate the best input variables and data length using the GT (Ahmadi *et al*. 2009; Piri *et al*. 2009; Jaafar & Han 2011).

The aim of this paper is to find the relationship between the large-scale climate parameters provided by the National Centers for Environmental Prediction (NCEP) and the precipitation of the Aharchay watershed in the northwestern part of Iran. This paper deals with the input selection challenge in three parts in identification of: (1) the most effective predictors; (2) the best combination of predictors; and (3) the best simulation method. In this paper, a comparison and assessment have been carried out for selection of the most effective predictors by GT and correlation coefficient analysis. The best combination of selected predictors is identified using entropy theory and GT. The best simulation model is selected among SVM, naïve, trend, and multivariable regression models.

The main novelty of this paper is utilizing the GT in two steps including determining the effective input variables and the best combination of input variables. The results are compared with other methods including correlation coefficient and entropy theory, respectively. The other contribution of this paper is proposing certain steps for achieving a more accurate long-lead precipitation prediction model including determination of effective signals, best combination of input variables, and best simulation models among some alternatives. The paper is organized as follows, with the ‘Materials and methods’ section including the used models GT, entropy theory, and SVM. This is followed by a case study for implementing the proposed methodology. The obtained results are presented in the next section which is followed by a summary and conclusion.

## MATERIALS AND METHODS

The modeling processes in this study including determination of effective signals, the best combination of them, and the best simulation model are presented in Figure 1. First, the time series of effective signals on western Iran's precipitation (suggested by Karamouz *et al*. 2005) are gathered and updated. Then the seasonal precipitation for two seasons of wet (December to May) and dry (June to November) are calculated. In order to select the effective variables out of the total climate signals, two methods including GT and correlation coefficient analysis considering multicollinearity are used. The results of the methods are compared through developing the simulation models.

After identifying the effective signals, in order to reduce the complexity and increase the model's accuracy, the best combination of selected variables using GT and entropy theory is selected. In the third part of modeling, the results of the SVM model are compared with the benchmark models including naïve, trend, and multivariable regression models. In the following sub-sections, brief explanations about the methods are given.

### Gamma test

This novel technique enables us to quickly evaluate and estimate the best mean squared error that can be achieved by a smooth model on unseen data for a given selection of inputs, prior to model construction.

GT estimates the minimum mean square error which is achievable in continuous nonlinear models with unseen data. The main idea was somewhat different from previous efforts for the nonlinear analysis. Suppose *X _{i}* and

*X*are close to each other; therefore,

_{j}*y*and

_{j}*y*should also be close to each other. In GT, it is attempted to make this view qualitative through mean distance between the nearest neighbor bounded set of

_{i}*X*and

_{i}*X*and mean length between the corresponding output points of

_{j}*y*and

_{j}*y*and achieve estimation for error value. Suppose there are a series of observations as the following form:1where

_{i}*X*= (

*x*

_{1}, …,

*x*) is the input vector at the range of

_{m}*C*∈

*R*, and

^{m}*y*is the output vector. The only assumption of this method is that the following equation is established between the systems:2where

*r*is the random variable that illustrates noise of equation and must be determined. Without losing the generality of the function, it can be assumed that mean of this random variable is zero (as any constant bias might be subsumed into an unknown function) and its variance is bounded. GT is based on

*N*[

*i*,

*k*] which includes a set of nearest neighbors from

*k*(1 ≤

*k*≤

*p*) for each vector

*X*(1 ≤

_{i}*i*≤

*M*). Delta function calculates the mean square of

*k*th distance from the neighbor:3where | | indicates Euclidean distance, corresponding gamma function is as:4where ,

*y*is the corresponding value for the

*k*th neighbor of

*X*in Equation (4). In order to calculate Γ, a linear regression line is built from P point on values of

_{i}*δ*(

_{M}*k*) and

*γ*(

_{M}*k*).5

Intercept of the vertical axis (*δ* = 0) is the value of Γ and *γ _{M}*(

*k*) is equal to variance errors. Drawing the regression line can provide useful data about complexity grade of the model. Vertical intercept of estimated line provides the best obtainable mean square error (Evans & Jones 2002). Furthermore, the gradient of the line provides complexity of the model (high greater complexity models have steeper gradient). Gamma is a conceptual model and its results have nothing to do with used techniques for a model of

*f*function. These results can be standardized by considering the term of

*V*

_{ratio}which is defined as follows:6where

*σ*

^{2}(

*y*) is the

*y*output variance that provides the power of judgment to be formed independent from the output range. When

*V*

_{ratio}is close to zero, there would be a higher degree of predictability of the required output of the model. A formal proof for the GT can be found in Evans (2002) and Evans & Jones (2002).

### Entropy

Entropy is a tool for quantifying the uncertainty of random processes. It measures the reduction of the uncertainty using the observation data based on the gained information. Shannon & Weaver (1949) developed the principles of the information theory in terms of ‘Entropy’. Singh (1997) reviewed some applications of entropy approaches in hydrology and water resources.

Harmancioglu & Alpaslan (1992), Caselton & Husain (1980), Harmancioglu & Singh (1998), and Husain (1989) used entropy to assess uncertainties of hydrologic variables in water resources systems and to design water quality monitoring and hydrological networks. Also, Krstanovic & Singh (1992), Mogheir & Singh (2002), and Alfonso *et al*. (2010) have used entropy theory in the field of designing groundwater quantity and quality monitoring networks.

Shannon & Weaver (1949) defined the marginal entropy, *H*(*X*), of a discrete random variable *X* as follows:7

Here, *N* represents the number of elementary events with probabilities *p*(*x _{i}*)(

*i*= 1, …,

*N*). Transinformation measures the redundant or mutual information between dependent

*X*and

*Y*expressed as follows:8or as9where

*H*(

*X*|

*Y*) = conditional probability density function of

*X*with respect to

*Y*. By definition, mutual information is the reduction in uncertainty with respect to

*Y*due to observation of

*X*(Cover & Thomas 1991).

### Support vector machine

The fundamental of SVM was developed by Vapnik (1998). SVM is based on the principle of structural risk minimization from statistical learning theory. The application of SVM has received attention in the field of hydrological engineering and water resources management due to its many interesting features and promising empirical performance (Choy & Chan 2003; Yu *et al*. 2004; Bray & Han 2004; Sivapragasam & Liong 2005; Karamouz *et al*. 2009).

The SVM model is produced by support vectors included in the training data and presents the means of a small subset of training points. The cost function for building the model ignores any training data that are within a threshold ε to the model prediction. In the SVM method, the generalization bounds are relied on defining the loss function that ignores errors. In SVM, the problem is to find a linear function that best interpolates a set of training points for the following equation:10The parameters (*W*, *b*) should be determined to minimize the sum of the squared deviations of the data utilizing the least squares approach11Some deviation *ɛ* between the eventual targets *y _{i}* and the function

*y*is allowed by defining the following constraint:12A band or a tube around the hypothesis function

*y*can be visualized, with points outside the tube regarded as training errors, otherwise called slack variables

*ξ*. For points inside the tube, the slack variables are zero and increase gradually for points outside the tube. This approach to regression is called

_{i}*ɛ*-SV regression (Vapnik 1998). It can be shown that this regression problem can be expressed as the following convex optimization problem:13

Subject to
where *C* is a pre-specified and positive constant that determines the degree of penalized loss when a training error occurs, *ξ _{i}* and are slack variables that represent the upper and the lower training errors subject to an error tolerance

*ɛ*.

Then the Lagrange function is constructed from both the objective function and the corresponding constraints to solve the optimization problem. SVMs are characterized by usage of kernel function used to change the representation of the data in the input space to a linear representation in a higher dimensional space called a feature space. Four standard kernels are usually used in classification problems and also used in regression cases: linear, polynomial, radial basis, and sigmoid. The architecture of an SVM algorithm for regression is presented in Figure 2. The input pattern (for which a prediction is to be made) is mapped into feature space. Then the products are computed with the training patterns (support vectors) using kernel functions. Finally, the products are added up using the weights. This, plus the constant term *b* yields the final prediction output. For more information about SVMs, readers are referred to Vapnik (1992, 2010).

### Model evaluation

The criteria of root mean square error, determination coefficient, and Nash–Sutcliffe model efficiency coefficient are used to evaluate the performance of simulation modeling of the historical precipitation. The following formulas are used to calculate them:15where *y _{t}* is the observed value of the historical precipitation, is the modeled value of the precipitation, and

*n*is the number of data.

The correlation coefficient indicates the strength and direction of a linear relationship between two variables. The correlation is +1 in the case of a perfect increasing linear relationship, and −1 in the case of a decreasing linear relationship16where , are the mean value of observed and modeled precipitation values. The square of the correlation coefficient (*r ^{2}*), known as the coefficient of determination, ranges from 0 to 1 which describes how much of the variance between the two variables is described by the linear fit.

The Nash–Sutcliffe model efficiency coefficient is defined as (Nash & Sutcliffe 1970)17

Nash–Sutcliffe efficiencies can range from −∞ to 1. An efficiency of 1 (*E* = 1) corresponds to a perfect match of modeled precipitation to the observed data. An efficiency of 0 (*E* = 0) indicates that the model predictions are as accurate as the mean of the observed data, whereas an efficiency less than zero (*E* < 0) occurs when the observed mean is a better predictor than the model. Essentially, the closer the model efficiency is to 1, the more accurate the model is.

## CASE STUDY

Aharchay river basin in the northwestern part of Iran is located between 47°20′ and 47°30′ east longitude and 38°20′ and 38°45′ north latitude as shown in Figure 3. The precipitations during wet (December to May) and dry (June to November) seasons are about 180 and 112 mm.

The mean annual precipitation, temperature, and inflow at the end of this basin are about 292 mm, 10 °C, and 51 MCM, respectively. About 62% of precipitation occurs in the wet season. The basin is 2,232 km^{2} in area and contains the primary tributaries of the Aharchay river which is one of the most important rivers in the Azarbayjan province. Sattarkhan dam, constructed as a multi-purpose dam, is located on the Aharchay River to supply downstream water demands including domestic, industrial, agricultural, and environmental.

The monthly precipitation data of these stations are extracted from the data bank of Iran's Meteorological Organization. The precipitation in the basin has a complete record from 1970 to 2010. The predictors used in this study are the monthly SLP, difference in sea level pressure (DSLP), and sea surface temperature (SST) of the different points around the world which are estimated by the National Center for Atmospheric Research (NCAR). The data come from several sources and are available for free on the NCEP/NCAR internet site (http://dss.ucar.edu/pub/reanalysis/). In research by Karamouz *et al*. (2005), the effects of different points around the world on Iran's climate are carried out, as shown in Figure 4. In this paper, 15 points addressed by Karamouz *et al*. (2005) are considered as the predictors of precipitation of the northwestern part of Iran.

The months of the year are divided into two dry and wet seasons. The dry season is from June to November and the wet season from December to May. In order to use the results of the prediction model for water resources management in a basin, two operational seasons are considered for precipitation prediction. The correct lead time for water resources planning and water allocation is considered to be 6 months. The effective predictors on seasonal Aharchay precipitation are selected between the DSLP and SST of 15 climate variables, as presented in Table 1.

## RESULTS

### Selecting the appropriate predictors

Correlation coefficient analysis is traditionally used between the input and output variables to identify the effective climate signals. Consider, *n* is the number of variables, 2* ^{n}*–1 cases should be examined for calculating transinformation values using the entropy method. Considering mathematical burden, the number of input variables is considered as six. Thus, the number of cases to examine with entropy is reduced to 63. The new GT is also utilized to select six variables among 15 predictor variables. The results of correlation coefficient analysis between the climate variables and the precipitation for the wet and dry seasons are shown in Figure 5.

As shown in this figure, during the wet season (December to May), variable 5 (DSLP between Siberia and Sudan), variable 6 (DSLP between Siberia and Eastern Persian Gulf), variable 9 (SST in the Black Sea), variable 14 (SST west of the Persian Gulf), variable 10 (SST at the east of the Mediterranean Sea), and variable 13 (SST in the Arabian Sea) have the most correlation with the Aharchay basin precipitation. Variable 3 (DSLP between south Greenland and east Mediterranean Sea), variable 2 (DSLP between south Greenland and west Mediterranean Sea), variable 1 (DSLP between southern Greenland and Azores), variable 4 (DSLP between southern Greenland and the Black Sea), variable 8 (SST in the Aden Sea), and variable 7 (SST in the Caspian Sea) are the most correlated variables with the area's precipitation in the dry season from June to November.

However, it is possible that there are correlations among some of these variables and one may be able to introduce one or more other variables. Multicollinearity occurs when two or more predictor variables in a multiple regression model are highly correlated. In this section, the multicollinearity among predictors in the models resulting from correlation coefficient analysis is explored. For this purpose, the possibility of multicollinearity is assessed by carrying out the correlation matrix for all the variables. The correlation value varies between −1 and +1. The correlation coefficient values are presented in Table 2 and show four pairs of variables having a coefficient value of more than 0.85, which can be classified as highly correlated. In this method, the predictor variables are selected based on the highly correlated with the precipitation and multicollinearity analysis.

In the wet season from December to May, variable 6 is replaced by variable 11 (SST in the Indian Ocean) in the model input list, since a significant correlation was observed between variables 5 and 6. Therefore, variables 5, 9, 14, 10, 13, and 11 are selected as model inputs using the multicollinearity analysis.

In the dry season (June to November) considering the prediction model inputs covariance matrix, a significant correlation exists between variables 3 and 4 and variables 1 and 2; therefore variables 1, 2, and 4 are replaced by variable 13 (SST at the Arabian Sea), variable 9 (SST at the Black Sea), and variable 5 (DSLP between Siberia and Sudan) in the model inputs list. Therefore, variables 3, 8, 7, 13, 9, and 5 are selected as the prediction model inputs using the multicollinearity method.

The Gamma values between 15 climate signals and the precipitation in the dry and wet seasons are presented in Figure 6. In the wet season, signals with the lowest Gamma value in relation to the precipitation including variables 15, 13, 6, 5, 10, and 11 are selected.

According to the correlation coefficients (CC) between variables 5 and 6, variable 5 is replaced by variable 9 (SST in the Black Sea). Therefore, considering the correlation between the variables, variables 15, 13, 6, 10, 11, and 9 are selected as input variables for the second GT model. In order to evaluate the input variables selection methods, three SVM models have been developed that use 80% historical data at the training stage and 20% in the testing stage. The mean values of wet season precipitation (December to May) for training and testing sets are 194.3, 191.8 (mm) and their standard deviations are 71.0 and 63.9 (mm). The corresponding values of the dry season are 118.3, 109.8, 55.33, and 55.31 (mm), respectively. The results of the developed models are presented in Table 3.

It is observed that input selection using the GT along with replacing the dependent variables is the best model at the training stage since it bears the least average deviation error, square root error, and the most CC between the predicted and historical values. However at the testing stage, the input selection method through GT application without deletion of dependent variables has the best performance indicating the efficiency of the GT in the input variables selection for the prediction model.

Considering the Gamma value presented in Figure 6 for the dry season and the CC between variables, signals with the lowest Gamma value in relation to the precipitation including variables 14, 12, 15, 7, 8, and 1 are chosen. Since a significant correlation is not present between the proposed variables, these variables are selected as the prediction model inputs. In order to evaluate the GT and CC method in input selection, SVM models are developed. The results of their applications are presented in Table 3 and show that determining predictors using the CC method has better performance in both training and testing stages with the least prediction error.

### Selecting the best combination of predictors

The aim of selecting the best combination of predictors is to identify and omit the signals whose existence increases the complexity of the model and that have no significant effects on the improvement of the results. In order to choose the best possible combination of predictors, the GT and the entropy method are utilized to find a model with minimum Gamma error and maximum transferred data, respectively. The number of different combinations created by the presence or absence of each one of the six selected signals as the prediction model inputs is 2^{6}–1 = 63 states. In Table 4, only seven states with the significant transinformation or low Gamma value are presented for the two seasons. The transinformation in this table is calculated based on the entropy theory.

As can be seen in Table 4, the combination of the first four inputs (111100) has the lowest Gamma value in the wet season and (101011) in the dry season. Therefore, the proposed scenarios by the GT are 111100 and 101011 for wet and dry seasons, respectively. According to the amounts of the transferred information present in the last column of Table 4, it can be observed that a combination of all variables (111111) is the scenario suggested by the entropy method in both dry and wet seasons. In the dry season, the Gamma number has the lowest value with the combination of four inputs (101011). The SVM model is developed for evaluating the selected combinations to study the performance of the utilized methods.

The results in Table 5 show that the model with the combination of the first four variables has the better performance in precipitation simulation in the wet season. Therefore, signals affecting precipitation include variables 15, 13, 6, and 5 in the wet season (December to May) and variables 3, 7, 9, and 5 in the dry season (June to November) and these are selected as model inputs, respectively.

### Selecting the best model

The SVM technique is used for precipitation prediction in wet and dry periods. The results obtained from SVM in the wet and dry seasons are compared with the basic models, such as the naïve, the trend, and the multivariable regression models as shown in Table 6. The data are divided into training and testing data by the ratios of 80% and 20%, respectively.

The training phase of the learning machine involves adjusting the parameters considering a training sample of 32 patterns in four-dimensional space (*N* = 32 and *n* = 4 in Figure 2). The input vector in the SVM model includes variables 15, 13, 6, and 5 in the wet season (December to May) and variables 3, 7, 9, and 5 in the dry season (June to November), respectively. Seasonal precipitation (predictand) constitutes the output from the model. In SVM modeling processes, after carrying out the sensitivity analysis, the υ–SVR model with the kernel RBF function is developed.

By checking the results of the testing stage, the overfitting problem is controlled. The naïve model is a model in which it is supposed that the next period prediction is like the current period. In the trend model, it is supposed that the next period prediction is based on the linear trend of the two previous periods. As can be gathered from the results presented in Table 6, in both wet and dry seasons, the SVM model gives better results than the naïve and trend models at the testing stage. Also, this model has less modeling error compared to a multivariable regression model which indicates better performance in nonlinear modeling. Figures 7 and 8 show the seasonal predicted precipitation and observed precipitation for wet and dry periods.

## SUMMARY AND CONCLUSION

In this study, the relationships between variations of SST, SLP, and DSLP of some certain points and the precipitation of Aharchay basin in two wet and dry seasons are explored. The GT and correlation coefficient analysis are used to select more effective variables between the climatic signals. The results of the SVM model for evaluating the methods show better performance of GT in input selection. In the second part of this paper, two techniques, GT and entropy are used for the best combination selection. The results show the entropy method selects the best model with more input variables which may be the best model in the training stage without this guarantee at the testing stage. The GT selects the model with the best inputs combination which has the better performance in comparison with the entropy method. In the third part of the paper, the SVM model is used for precipitation prediction and its performance is compared with the results of precipitation modeling utilizing the naïve, trend, and multivariable regression models as the benchmark models. The results show better performance of the SVM model at the testing stage.

- First received 5 December 2013.
- Accepted in revised form 6 July 2014.

- © IWA Publishing 2015

Sign-up for alerts