International Journal of Computational Intelligence Systems

Stock e-exchange prices forecasting is an important financial problem that is receiving increasing attention. This study proposes a novel three-stage nonlinear ensemble model. In the proposed model, three different types of neural-network based models, i.e. Elman network, generalized regression neural network (GRNN) and wavelet neural network (WNN) are constructed by three non-overlapping training sets and are further optimized by improved particle swarm optimization (IPSO). Finally, a neural-network-based nonlinear meta-model is generated by learning three neural-network based models through support vector machines (SVM) neural network. The superiority of the proposed approach lies in its flexibility to account for potentially complex nonlinear relationships. Three daily stock indices time series are used for validating the forecasting model. Empirical results suggest the ensemble ANNs-PSO-GA approach can significantly improve the prediction performance over other individual models and linear combination models listed in this study.


Introduction
Stock market is a complex financial market with high volatility, noise non-stationary, unstructured nature, high degree of uncertainty, and hidden relationships. Due to its irregularity, stock e-exchange prices forecasting is regarded as a rather challenging task. The main purpose of forecasting is to reduce the risk in decision making that is important for financial organizations, firm and private investors. The methods of forecasting stock could be classified into two broad classes: fundamental analysis and technical analysis. The fundamental analysis depends upon exact knowledge of the various factors that influence the stock market such as micro-economics, macro-economics, political and even psychological factors. But the knowledge is usually not readily available. The technical analysis attempts to make predictions based on past patterns. However, these patterns are not always evident because of the noise. For traditional statistical methods, it is extremely difficult to capture the irregularity [1]. In these traditional models, we need to assume a functional relationship between input and output and try to fit the data as per that relationship. This particularly hampers our efforts, since the predictors of stock form multidimensional input space and the relationship between input and output is essentially non-linear [2]. This has encouraged academic researchers and business practitioners to develop more predictable forecasting models [3]. As a result models using artificial intelligence such as artificial neural network (ANN) techniques have been recognized as more useful than conventional statistical forecasting models [4]. The neural networks can simultaneously handle the non-linear data of multidimensional input space. Furthermore, neural networks do not require an explicitly well defined relationship between input and output as they determine their own relationships based on input and output values [5]. With its proven generalization ability, the ANN is able to infer from historical patterns the characteristics of performing stocks. During the last few years, a number of neural network models and hybrid models have been proposed for obtaining accurate prediction results.
Some researches are presented. Brownstone [6] predicts the daily Market close 5 days ahead, and 25 days ahead of the Footsie by neural network. The results indicate that predictions can be produced to a high level of accuracy, in a readily understandable format. Quah [7] and Srinivasan uncover the intricate relationships between the performance of stocks and the related financial and technical variables by neural network. Experimental results obtained this far have been very encouraging. Kuo et al. [8] develop a genetic algorithm based fuzzy neural network (GFNN) to formulate the knowledge base of fuzzy inference rules which can measure the qualitative effect on the stock market. Plikynas et al. [9] use ANN for compound (technical and fundamental) analysis and prognosis' of LNSE, LITIN-A and LITIN-VVP. They employ initial preprocessing (analysis for entropy and correlation) for filtering out model input variables. A wide spectrum of different results has shown a high sensitivity to ANN parameters. Ao [10] designs a simplified automated system to study the correlation between the US market and the Asian markets by employing the evolutionary computation to simulate the markets interactive dynamics. Slim [11] proposes a stochastic neural network (SNN) to the modelling and forecasting the time varying conditional volatility of the TUNINDEX returns. The empirical analysis shows that out-of-simple volatility forecasts of the SNN are superior to forecasts of traditional linear methods (GARCH) and also better than merely assuming a conditional Gaussian distribution. O'Connor and Madden [12] evaluate the effectiveness of using external indicators, such as commodity prices and currency exchange rates. In the experiments presented, basing trading decisions on a neural network trained on a range of external indicators result in a return on investment of 23.5% per annum, during a period when the DJIA index grew by 13.03% per annum. Thomaidis et al. [13] experiment with a "top-down" pruning technique as well as two "bottomup" strategies that start with simple models and gradually complicate the architecture if data indicate so. Liao et al. [14] propose an improved neural networkthe stochastic time effective neural network model and test the forecasting performance of the model by using different volatility parameters. Guresen et al. [15] analyze multi-layer perceptron (MLP), dynamic artificial neural network (DAN2) and the hybrid neural networks which use generalized autoregressive conditional heteroscedasticity (GARCH) to extract new input variables. Jasemi et al. [16] propose the network that is not going to learn the candlestick lines alone or in combination, but is to present a kind of regression model whose independent variables are important clues and factors of the technical analysis patterns; and its dependent variable is the market trend in near future.
The broad spectrum of applications to which neural networks have been applied, however, has revealed some of its drawbacks making researchers in particular suspicious as to their suitability in financial forecasting. As we all know, neural networks are a kind of unstable learning methods. Even for some simple problems, different architectures of neural networks (e.g., different number of hidden layers, different hidden nodes and different initial conditions) result in different patterns of network generalization. Researchers have devoted a great deal of effort during the last decade in order to find the optimal parameters of neural network (e.g., number of units, number of hidden layers, type of neurons, learning rates for supervised and unsupervised training and initial weights, etc.) which can solve an actual problem based on performance evaluation criteria.
Evolutionary computation algorithms which are comprised of four major paradigms, genetic algorithms, evolutionary programming, evolution strategies, and genetic programming, have demonstrated to be suitable for optimization of different ANNs. As a popular evolutionary computation paradigm, namely, the particle swarm optimization (PSO), utilizes a "population" of candidate solutions to evolve toward an optimal or near optimal solution of an actual problem. Because of its simplicity, easy implementation, and quick convergence, PSO has attracted more and more attention and has been applied extensively in various fields. Despite of its success and popularity, Grimaldi et al. [17] have indicated that, although PSO may find solutions of reasonable quality much faster than other evolutionary computation algorithms, it can not improve the quality of the solutions as the number of iterations increases. Hence, a premature phenomenon may occur for the original PSO, especially when optimizing complex multi-objective functions. Therefore, many improved PSO algorithms have been proposed. For example, Alfi et al. [18] have presented a methodology for finding optimal system parameters and optimal control parameters by a novel adaptive particle swarm optimization (APSO) algorithm. Wang et al. [19] have presented a poly-hybrid PSO optimization method with intelligent parameter adjustment.
Unfortunately, more and more researchers have realized that only selecting a single neural-network model with the best performance may lead to loss of potentially valuable information contained by other neural-network models that may have slightly weaker performances. Therefore, some different learning strategies such as combined/ensemble learning and meta-learning have been presented. For example, Abraham and AuYeung [20] present two ensemble approaches: based on a direct error measure and based on an evolutionary algorithm to search the optimal linear combination. Experimental results reveal that the ensemble techniques perform better than the individual methods and the direct ensemble approach seems to work well for the problem considered. Kwon and Moon [21] propose a genetic ensemble of recurrent neural networks for stock prediction model. It shows notable improvement on the average over not only the buy-andhold strategy but also other traditional ensemble approaches. Chun and Park [22] propose a new learning technique which extracts new case vectors using Dynamic Adaptive Ensemble CBR (DAE CBR). The main idea of DAE CBR originates from finding combinations of parameter and updating and applying an optimal CBR model to application or domain area. Chen et al. [23] propose a flexible neural tree (FNT) ensemble technique. The structure and parameters of FNT are optimized using genetic programming (GP) like tree structure-based evolutionary algorithm and particle swarm optimization (PSO) algorithms, respectively. Experimental results show that the model considered could represent the stock indices behavior very accurately. Aladag et al. [24] propose a new forecast combination approach based on artificial neural networks. The forecasts obtain from different fuzzy time series models are combined by utilizing artificial neural networks. The proposed method is applied to index of Istanbul stock exchange (IMKB) time series. It is seen that the proposed combination approach produces better forecasts than those produced by other combination methods available in the literature. Hung [25] presents the results of using a fuzzy system method to analyze clustering in generalized autoregressive conditional heteroskedasticity (GARCH) models. Although there are many studies on ensemble forecasting, we find that the ensemble model from different base models is often combined in a linear way in the existing studies. However, a linear weighted approach is not necessarily appropriate for the financial time series. Moreover, how to optimize the performance of the base models is a key problem.
To solve the above two main problems, an ensemble model from three different base models, Elman network, generalized regression neural network (GRNN) and wavelet neural network (WNN) combined by support vector machines (SVM) neural network in a nonlinear way is presented. The optimal architectures of the three base models are obtained by the improved swarm particle optimization algorithm (IPSO). The main contributions of this study are summarized as follows: (a) A three-stage neural-network-based nonlinear weighted ensemble model for forecasting stock indices time series is proposed. In the first stage, a certain data sampling technique is used to generate three different training sets for three base models, a validation set and a testing set. Based on the different training sets, three neural-network base models are optimized by IPSO

Published by Atlantis Press
Copyright: the authors 98 algorithm in the second stage. Because the three neuralnetwork base models' training processes do not affect the overall efficiency of time series forecasting system, they can be optimized and formulated in parallel. In the final stage, a neural-network-based nonlinear weighted meta-model is produced by learning the three neuralnetwork base models through SVM neural network.
(b) All parameters in the three base models are adaptively adjusted by the improved particle swarm optimization (IPSO) algorithm. To ameliorate the performance of standard PSO, IPSO employs adaptive nonlinear inertia weight updating with fitness values. At the same time, acceleration parameters are controlled by a declining arccosine function and an increasing arccosine function. Further, the crossover operation and mutation operation are introduced to improve the performance of the candidate particles. We adopt twopoint crossover and design a crossover rate only depending on generation and an adaptive mutation rate depending on individual fitness. Finally, the optimal structure and parameters of base models are adjusted by a 2-level algorithm in the training process, i.e., binary particle swarm optimization (BiPSO) and decimal particle swarm optimization (DePSO). With IPSO the deadly drawbacks of the base models, e.g., difficult to select the parameters and frequent confinement to local minima, have been significantly improved.
The rest of this study is organized as follows. Section 2 describes the building process of the proposed model in detail. For further illustration, three stock time series are used for testing in Section 3. Then, Section 4 presents and discusses the results of forecasting and compares the forecasting performance with other methods in terms of all kinds of evaluation criteria. Finally, some concluding remarks are drawn in Section 5.

The architecture of the nonlinear ensemble model
The nonlinear ensemble model in this study adopts the concept of metal-earning, which use some individual learning algorithms to extract knowledge from several different data subsets and then to construct a unified body of knowledge, metamodel, that adequately represents the entire dataset.

Meta-learning
As above mentioned, the performance of the ensemble models from several different base models is superior to a single neural-network model with the best performance. The ensemble model is a kind of learning from learned knowledge recently developed as an emerging machine learning technique in order to forecast financial time series using multiple training datasets. Generally, learning involves extracting a pattern f = f a from a training set, T, while ensemble model does these from several training sets, (T 1 , T 2 , …, T n ), each of which was used to train an associated base model f = f a (i), (i = 1, 2, . . ., n). The symbol n represents the number of the base models. The mode of metamodel which learns form learned knowledge by the base models may different from some or all of the base models. In addition, the metamodel will be trained by a new training set, T M , which is distinct from other training sets, T n , used by the base models, f a (i). At present, many different meta-learning approaches have developed such as stacked generalization, boosting, dynamic bias selection, mining meta-knowledge, and inductive transfer, etc. Metamodel has usually two modeling methods: based on different training sets or different learning algorithms. The former is described as follows: where y is the output of meta-model, f is the learning algorithm of the meta-model, T M is the meta-training set, n is the number of base models (i.e. the number of training subset), T i (i=1,2, …, n) is the ith training subset used by the ith base model, σ is the learning operator of f, |T i |(i=1,2, …, n) is the dimension of the ith training subset and f i (i=1,2, …, n) is the learning algorithm of the ith base model. Note that the learning algorithm of each base model based on different training sets is identical.
The latter is described as follows:

Published by Atlantis Press
Copyright: the authors Note that the learning algorithm of each base model based on identical training sets is different.
This study proposes a novel ensemble model integrating the advantages of two different metalearning methods. The proposed model uses a hybrid meta-learning strategy in terms of different learning algorithms and different training subsets which include the following three stages: (a) An original data set D is divided into three parts: training set T, validation set V and testing set S. Then the different training subsets T 1 , T 2 , …, T n are sampled from T for the correspond base models.
(b) The optimal architecture of n base models f i (i=1,2, …, n) are obtained by training subsets T i (i=1,2, …, n) and validation set V.
(c) Applying the trained m different base models to validation set V, validation results from the m base models can formulate a meta-training set (T M ). Based on the meta-training set, the metamodel is constructed. In this study, NN is used as both base learner and metalearner. That is, the improved meta-learning process utilizes the nonlinear weighted form to create a novel metamodel different from the existing linear weighted metamodeling. The improved three-stage nonlinear meta-learning process is illustrated in Fig. 1. There are three main problems to be solved: (a) create n different training subsets from the original training set T for n base models, (b) determine the optimal architecture of each neural-network base model, and (c) create a metamodel with different metadata produced by the NN base models.

Data partition and sampling
In forecasting financial time series by neural network, data partitions can have a significant impact on the final results. Generally, the initial data set is only split into training set and testing set. The former set is used for

Published by Atlantis Press
Copyright: the authors 100 model construction and the latter one is used for model testing. However, a third set from the original data set, validation set, can effectively improve the robustness of neural networks. Therefore, this study divides the original data set into three different parts nonoverlapping: training set, validation set and testing set. The partition ratios usually are no consensus and selected subjectively. However, the appropriate ratios will generate better forecasting accuracy such as 7:2:1 [26]. The partition ratios are not absolute according to different problems. How to create n training subsets from the original training set T is a problem after data partition? There are three common sampling techniques to create training subsets: direct replication, bagging and noise injection. In direct replication method, training subsets is a sample duplicate of original training set T.
Because of the feature of its random sampling with replacement, Bagging [27] has become a widely used data sampling method in machine learning. But there are maybe many duplicates in some training subsets. Noise injection can increase the independence between different training subsets and further effectively reduce variance between models by inserting noise to the training dataset [28]. Unfortunately, the injected noise may distort the characteristic of original data. To overcome above three sampling techniques, an interval sampling method is proposed as where T 1 (j) is jth element of ith training subset, T the original training set, n the number of training subsets, and N is the size of T.

Neural-network base models
The performance of base model is the basis to construct an effective ensemble model. The diversity of the base models is crucial for improving the performance of the ensemble model. The neural-network base models usually can be formulated parallel by different ways such as different initial weights, different architecture (e.g., different numbers of layers or numbers of nodes in each layer), different training algorithms (e.g., the gradient descent or Levenberg-Marquardt algorithm, etc). The neural-network base models utilize usually identical technique with different parameters, however it is proven that neural-network base models based on different technique are much helpful to improve the generalization ability and adaptability to actual problems of an ensemble model. Therefore, three different neural networks, Elman network, generalized regression neural network (GRNN) and wavelet neural network (WNN) are adopted because of their respective advantages in financial time series forecasting in this study.

Neural network
Neural networks (NNs), first introduced in 1943 [29], are a set of systems derived through neuropsychology models. The basic idea of NN is to emulate the biological system of the human brain to learn and identify patterns. The NN is widely used for time series forecasting because its flexible nonlinear modeling capability can capture the nonlinear characteristics of time series well. When applying NN to time series forecasting, the final output of the ANN-based forecasting model can be represented as ) , , , where v is a vector of all parameters, p nonlinear time dependency of size (lag), and ϕ is a function determined by the network structure and connection weights. Thus, in some senses, the NN model is equivalent to a nonlinear autoregressive model.

Elman network
Elman networks belong to the class of recurrent neural networks (RNN) architecture. Elman network is a three layers feedforward neural network with the addition of a recurrent connection from the output of the hidden layer to its input. The network is augmented at the input level by additional units, called context units. Context units are also known as memory units as they store the previous output of the hidden neurons. During Elman networks operation, the activation values of the input units are set to a desired input pattern. The activation value of every hidden unit is computed by multiplying each input and context activation value by the value of the weight from the unit to the hidden unit. These values are then summed, the bias of the hidden unit is added, and the sum is passed through a squashing function f. The resulting value then constitutes the output value of the hidden unit. In the Elman network, the squashing function used is the logistic f. Then, the activations of output units are calculated based on the hidden units in an analogous manner. This represents one time step. Next, the activation of each hidden unit is copied into a corresponding context unit on a one-for-one basis with fixed weights of 1, and then the next time step is performed. This is equivalent to a recurrent connection from every hidden unit to itself and is more restrictive than the arbitrary recurrent connections allowed by Minsky's claim [31].
Suppose that n, l, m are number of the input units, hidden units and output units. The primary input is xi(i=1,2…,n), and the network output is y k (k=1,2,…,m). w ji (i=1,2,…,n; j=1,2,…l), w jr (r=1,2,…,l; j=1,2,…l), w kj (j=1,2,…,l; k=1,2,…m) are the weights of the connections between the input and hidden units, between the recurrent and the hidden units, and between the hidden and the output units, respectively. b j and b k are biases of hidden units and output units, and f(·) and g(·) are hidden and output functions, respectively. The architecture of Elman network can be written mathematically by The output of the hidden unit: The output of the output unit:

Generalized regression neural network
The generalized regression neural network (GRNN) is memory based feedforward network which was introduced [32] as a generalization of both the radial basis function network (RBFN) and probabilistic neural network (PNN). The GRNN is a powerful tool for linear or nonlinear regression based on kernel estimation theory, which builds the sought function surface in a nonparametric fashion through the available data set. The GRNN can exhibit high accuracy and robustness to sparse and noisy data and its estimate converges to the conditional mean surface while introducing more data samples. The flexible structure of GRNN makes it amenable to adaptation for different environments.
At its standard form, it allows for a simple implementation and a very fast training procedure due to the single window bandwidth parameter σ (sigma) that regulates the smoothness of the regression surface. Moreover, unlike other networks, such as MLFNN or Elman, it does not require an exact topology definition.
The GRNN architecture includes four layers, namely, the input, hidden, summation and output layers. Unlike the most popular backpropagation (BP) algorithm that trains multilayer feedforward networks iteratively, the GRNN training is a single pass procedure. In addition, the GRNN formulation comprises only one free parameter that can be optimized fast. Consequently, the GRNN trains itself in a significantly shorter time, as compared with the BP algorithm.
The GRNN utilizes the Parzen Probability Density Estimator [33] between the independent vector random variable X with dimension m, and dependent scalar random variable Y. Assume that x and y are the measured values for X and Y variables, respectively. The clustering version of the GRNN with multiple hyperspherical kernels is proposed as shown below: (11) where x i and y i are the ith training set data, x i the vector form of variable x, ) (x y ∧ the predicted output, m the dimension of input domain, and n is the number of kernels. x ij and ij σ denote the center and sigma of jth variable for the ith pattern node, respectively.

Wavelet neural network
Wavelet is a type of transformation that retains both time and frequency information of the signal [34]. The transformation process from time domain to time scale domain is a WT, technically known as signal decomposition because a given signal is decomposed into several other signals with different levels of resolution. From these decomposed signals, it is possible to recover the original time domain signal without losing any information. This reverse process is called the inverse WT or signal reconstruction. In Fourier transform, only the sine and cosine functions can be chosen as the basis functions. However, wavelet transformation (WT) has versatile basis functions to be selected based on the type of the signal analyzed. Wavelet transforms can be divided in two categories: continuous wavelet transform (CWT) and discrete wavelet transform (DWT).
The continuous wavelet transform where the mother wavelet, ψ(x), is a single fixed function such as Morlet function from which all basis functions ψ a,b (x) can be derived through Eq.13. The dilation (scale) parameter a (a R and a>0) controls the spread of the wavelet and translation (time-shift) parameter b (b R) determines its central position and the superscript * represents the complex conjugate (R denotes real number).
The WT transforms the function from original time domain in to wavelet (a, b) domain. A function x(t) having both smooth global variations and sharp local variations can be effectively represented in wavelet domain by corresponding wavelet function The discrete wavelet transform is defined by where the asterisk denotes the complex conjugate, a and b are scale and time-shift parameters, respectively, and ψ(x) is a selected basis function (mother wavelet).
Both continuous and discrete wavelet transforms have been used to implement wavelet neural networks. Normally, the former can provide much excellent performance in nonstationary signal analysis and nonlinear function modeling.
The neural networks provide flexible mapping between inputs and outputs. Hornik et al. [35] have theoretically proved that a three-layer feedforward neural network (MLFN) can approximate any continuous function arbitrarily well given a sufficient number of middle-layer nodes. However, MLFN with using sigmoid function has some limitations such as settle in local minima of the error surface and convergence too slowly. Except the radial basis function neural networks (RBF), WNNs are also an improvement approach to MLFN. Wavelet function is the same as radial basis function is a local function and influence the networks output only in some local range. Although RBF is also local function, but it does not have the spatial-spectral zooming property of the wavelet function, and therefore it cannot represent the local spatial spectral characteristics of the function. WNN shows surprising effectiveness in solving the conventional problems of poor convergence or even divergence encountered in other kinds of neural networks [36].
In this study, a Morlet mother function (shown Eq.15.) is used as node activation function for the hidden layer of a three-layer MLFN. The dilation and translation parameters, a t and b t , of the Morlet function for each node in the hidden layer are different and they need to be optimized. In the WNN, the gradient descend algorithm is employed and the error is minimized by adjusting weight vector of the connections between input units and hidden units and between hidden units and output units, and a t and b t .

Particle swarm optimization algorithm
Particle swarm optimization is a population-based stochastic optimization algorithm which has been proposed by Eberhart and Kennedy in 1995. The concept is mainly from the natural flocking and swarming behavior of birds and insects. It is considered to be able to optimize the performance of the ANN by improve its disadvantages such as difficult to select the parameters and easy to get stuck in a local minimum, etc., because it does not require gradient and differentiable information.
Suppose that the search space is h dimensional, the particles of the swarm can be represented by an n dimensional vector X i = (x i1 , x i2 , …, x ih ) T . The fitness of each particle can be evaluated according to the objective function of the actual optimization problem. The velocity of each particle can be represented by n , …, p bh ) T be the last best position of the i-th particle, which is noted as its individual best position. Further, G b = (g b1 , g b2 , …, g bh ) T is the global best position. The new velocity of particle will be assigned according to the following equations: where c 1 and c 2 represent the acceleration parameters, w represents the inertia weight, and r 1 and r 2 are random numbers ranging from 0 to 1. The velocities of the particles on each dimension are clamped to a maximum velocity: v max . The new position of each particle is calculated by Eq. (16).

The improved particle swarm optimization algorithm
Although the traditional PSO can usually find good solutions rapidly, it may be trapped in local minimum and fail to converge to the best position. In order to reduce the opportunity of trapping in a local optimum, expand the search scope of the algorithm and enhance the algorithm's climbing ability, it is certainly critical to always maintain the diversity of particles. The existing algorithms such as chaos mechanism optimization [37], hybrid simplex search PSO [38], comprehensive learning PSO, dynamic random search technique [39] are difficult to solve the two problems (global optimization and premature convergence) simultaneously. Therefore, we design an improved particle swarm optimization (IPSO) with adaptive nonlinear inertia weight and dynamic arccosine function acceleration parameters. At the same time, the crossover operation and mutation operation of GA are introduced in IPSO in order to improve the performance of the candidate particles.

Improved acceleration coefficients
In the particle swarm optimization algorithm, acceleration coefficients c 1 and c 2 control the "cognitive" part and the "social" part of the particle velocity, respectively. In general, a large "cognitive" c 1 and a small "social" c 2 in the initial stages and a small "cognitive" c 1 and a large "social" c 2 in the last stages can balance the performance in the entire optimization process [40]. Based on this idea, many methods have been proposed such as linear adjustment strategy, fuzzy control strategy, and random change strategy. However, these methods are unstable. Therefore, we propose a dynamic acceleration parameters adjustment strategy based on arccosine function. c 1 and c 2 are controlled by a declining arccosine function and an increasing arccosine function. This strategy attempts to promote particles to be placed in an unexplored area so that they can contribute to the process of finding better solutions in the early stages of optimization. The method is more conducive to getting rid of the interference of local minimum, obtaining the global optimal solution to avoid premature convergence, and improving the convergence speed and accuracy in the latter stages of optimization.  where c start represents the iteration initial value of acceleration parameters, c end represents the iteration final value of acceleration parameters, Iter is the current iteration number, Iter max is the maximum iteration number.

Improved inertia weight
The inertia weight w represents the contribution of past velocity values to the current velocity of the particle. A large inertia weight biases the search towards global exploration, while a smaller inertia weight directs towards fine-tuning the current solutions. Suitable selection of the inertia weight and acceleration coefficients can provide a balance between the global and the local search [41]. Based on this idea, many methods have been proposed such as linear decreasing inertia weight strategy, random inertia weight strategy [42], inertia weight strategy based on concave function and convex function [43], and fuzzy control strategy. However, these methods are not adaptive. In this study, we employ an adaptive nonlinear adjustment inertia weight strategy depending on particle's fitness value, which will help balance the exploring and exploiting capabilities at different stages during its search process. The strategy can be represented as  (21) where w min~wmax represent the range of inertia weight, fitness represents the current fitness value of some particle, fitness min and fitness avg represent the minimum fitness value and the average fitness value of all particles respectively. It can be seen from Eq. (21) that, the inertia weight will increase when the fitness values of particles are consistent (become local optimum) and will decrease when the fitness values of all particles are scattered. Therefore, the inertia weights of the superior particles whose fitness values are larger than the average fitness value are smaller to protect their properties. In contrast, the inertia weights of the poor particles whose fitness values are smaller than the average fitness value are larger so that they can search better space.

Adaptive genetic operators
In PSO algorithm, when the individual optimum solution P best has not been updated for a long time in the latter part of the training, the particles will be close to the global optimum solution G best . At this point the particle update velocity mainly depends on wv ij of the first part of Eq. (16) because the inertia weight w <1, the particle velocity will become increasingly smaller. The particle swarm will "fly" toward a direction, which will lead to its falling into local minimum position. In this study, the adaptive genetic operators (crossover operation and mutation operation) are introduced in order to improve the performance of the candidate particles. The particles can execute crossover operation and mutation operation according to a certain probability.
Crossover is the main search operator in GAs, creating offsprings by randomly mixing sections of the parental genome. In this study, we use two-point crossover and design a crossover rate only depending on iteration number and not associating with the individual fitness. The crossover rate can be represented as is crossover probability of t-th iteration.
A small fraction of the offsprings is randomly selected to undergo genetic mutation. The mutation operator randomly picks a location from a bit-string and flips its contents. To avoid premature convergence, we design an adaptive mutation rate depending on individual fitness as follows, In order to improve search efficiency, take advantages of PSO's training speed and GA's global search, the genetic operator control function in this study is defined as , , 2 , 1 , ln 1 where k represents current iteration number. In the process of each iteration, a random number will be created ranging from 0 to 1. If the number is less than GP k , the current particle will execute genetic operator. As can be seen from Eq. (27), in the early iterations GP k << 1, genetic operator will be executed at a small probability, in the later iterations GP k , it will be close to be 1, so the particle will execute genetic operator at greater probability. Genetic operator expands population search space shrinking in the process of iteration, so that particles can escape from the optimal value searched previously to a larger search space. The particles maintain the diversity of the population, thus it increases the possibility of finding better solutions.

NN optimized by the improved PSO
The deadly drawbacks of the NN (frequent confinement to local minima and parameters selection) are expected to be improved with IPSO. The basic idea is to optimize weights and bias of NN by decimal particle swarm optimization (DePSO), and optimize the NN structure by binary particle swarm optimization (BiPSO). A particle in DePSO real-coded represents a set of NN weight vector and bias weight vector.
Let the number of input layer nodes be R, the number of hidden layer nodes Q, the output layer nodes S, the weight and bias vector of NN can be represented as ] , , , represents the weight vector from the input layer to the hidden layer, represents the weight vector from the hidden layer to the output layer, represents the hidden layer bias vector, represents the output layer bias vector, and h is the dimension of vector X.
A particle binary-coded in BiPSO represents the corresponding hidden layer node, that is, 1 represents the corresponding hidden layer node existence and 0 represents inexistence. The particle velocity is updated according to the Eq. (16). The particle position is updated by the state transition probability depending on the particle velocity. When the particle velocity is greater than a certain value, the particle will be 1 at a larger probability.
The number of hidden nodes is not generally less than the number of input layer nodes and more than the twice of the sum of input layer nodes and output layer nodes. The BiPSO can be represented in the binary form as is a random number ranging from 0 to 1, sig(·) is a sigmoid function and , and h is the dimension of vector. If x = 1, the corresponding hidden layer node exists, and the weight and bias vector of that node in DePSO is valid. Otherwise, if x = 0, the corresponding hidden layer node does not exist, and the weight and bias vector of that node in DePSO is invalid.

Non-linear neural-network metamodel
After the three neural-network base models are trained, next work is how to integrate or combine them by the metamodel. The mode of integration or combination can be defined as where y is aggregate output combined the outputs of these base models, n the number of base models (i.e., the number of training subset), w i the assigned weight of f i , f i (i=1,2, …, n) the learning algorithm of ith base model, and T i (i=1,2, …, n) ith training subset used by ith base model.
There are four common strategies to determine the weights of base models w i : simple averaging, simple, mean squares error (MSE), stacked regression, and error-variance-based weighting [44]. Besides the individual feature of different strategies, the existing integrated technique is built on linear assumption. However, linear strategies are not necessarily sufficient for financial time series. A nonlinear integrated strategy is proposed to construct a metamodel by using SVM neural network, which is different from the base neural networks in this study. In this nonlinear metamodeling approach, outputs of base neural-network models construct a meta-training set (T M ). This metamodel, SVM neural network, in the final stage can be trained by T M and assessed by testing set S.
The nonlinear ensemble forecasting model can be defined as ) , , ( n y y y f y L = (34) where f(·) is a nonlinear ensemble function realized by SVM neural network and y i is the forecast of ith base model.
To validate the effect the proposed the nonlinear ensemble model, an individual Elman network, an individual GRNN, an individual WNN, the linear combination models, and the proposed model to predict stock indices so as to compare forecasting performance.

Data preparation
The stocks data used in this paper are daily observations obtained from Wind database (http://www.wind.com.cn). They consist of the Shanghai composite index, Shenzhen component index and Shanghai-Shenzhen 300 index studied in this paper. The entire data set of each stock index covers the period of five years. Each time series is split into three sets: training set, validation set and testing set. The first set is used to determine the specifications of the model and parameters of the forecasting technique, the second set is used to not only evaluate the good or bad performance of the predictions of the base models based on evaluation measurements but also construct a metatraining set with outputs of base models, and the third set is used for out-of-sample evaluation of the forecasting model. In addition, the training data set is divided into three training subsets by the interval sampling algorithm. Table 1 shows the information about the time series and size of subsets used.

Data Preprocessing
The data must be normalized before training, which can be described as the following formula: min( X and ) max( X represent the maximum and minimum of the original data. Some common features from past stocks time series of Shanghai composite index, Shenzhen component index and Shanghai-Shenzhen 300 index are extracted for training and testing purposes. Each set of data are normalized by dividing each value by the maximum value of each set such that each normalized value is less than or equal to unity. Normalization of input data is necessary for obtaining correct trigonometric expansion.

Performance measures
To assess the ensemble prediction model, the forecasts are compared with the true realizations. Following performance measures are used.  ( 1 (38) The symbol N is the total number of data patterns. T i and T i ' represent the actual value and prediction at time i.
MAE, MAPE and RMSE are the metrics used to estimate the error of prediction. MAE, MAPE and RMSE are widely used statistical metrics that estimate the error of prediction by measuring the deviation between actual and forecasted value returned in this case. Smaller values of these metrics indicate higher accuracy in forecasting.
Of course accuracy is one of the most important indicators for forecasting models-the others being the cost savings and profit earnings generated from better decisions. From the business, the latter is usually more important because for the business practitioners, the aim of forecasting is to support or improve decisions so as to make more profit. Thus profits or returns are more important than conventional fit measurements. In stock indices forecasting, improved decisions often depend on correct forecasting directions or turning points between the actual and predicted values (T i and T i ' ). The ability to forecast movement direction or turning points can be measured by a statistic of directional change (DC), which can be expressed in percentage as (39) where T i is the actual value at time t, T ' i+1 is the prediction at time t+1.
However, the real aim of forecasting is to obtain profits based on prediction results. To provide a more complete evaluation of the models, our comparison is based on not only the performance statistics but also the trading returns. Here the return rate is introduced as an important evaluation indicator, which is calculated according to the simple principle ignoring the friction costs.
( ) (40) where MR is the P periods excess return rate relative to the tested stock indices, AR the amount of the return rate obtained on the entire period of testing set, IR the return rate of the tested stock indices on the entire period of testing set, and N is the number of the testing periods. For convenient computing, we assume that stock can only be bought in a given lot size. It is worth noting that computation of MR is based on the trading strategy, as in the following: If ( T ' i+1 -T i )>0, then " buy", else " sell". The difference between the predicted value and the actual value will guide trading. Because the MAE, MAPE and RMSE measure predictions only in terms of levels, it is better to choose DC and every period return rate (MR) as the measurements for forecast evaluation. Of course, MAE, MAPE and RMSE are also taken into consideration for comparison of levels.

Set parameters for the ensemble model
The architectures and parameters of base models are optimized by IPSO. The parameters of metamodel, SVM, are determined by k-fold cross-validate (CV). The initial parameters of IPSO-NN model are defined as: the population size of particle swarm is 30, the maximum iteration number is 150, the training times of NN are 1500; in DePSO the initial particle positions are random numbers ranging from -15 to 15 and the initial particle velocity randomly varies between -8 and 8, c 1start =2.95, c 1end =1.05, w min =0.2, w max =0.8; in BiPSO, the initial particle positions are random numbers ranging from -1 to 1 and the initial particle velocity is randomly chosen from the range between -0.6 and 0.6, c 1 =c 2 =1.14, w=1; in genetic algorithm P c,min =0.5, P c,max =0.9, P m,min =0.02, and P m,max =0.6.

Time dependency
The daily data of the stock indices possess the time dependency, therefore, the lag order of the time series need be determined before prediction. A non-linear time dependency of size (lag) p is increased from 1 to 36 in step from 1 lag. For each test, the networks are trained on the training sets until a MSE from 2×10 -5 or less is reached. The best performing IPSO-NN on MSE for the lag order is 3 by testing the different performance of the lag orders between 1 to 36. The process is modeled with lag 3, and the realization at t+1 is dependent on the realizations of the last 3 trading days.

Simulation and prediction
The daily stock indices data of Shanghai composite index, Shenzhen component index and Shanghai-Shenzhen 300 index are pre-processed between 0 and 1 and passed to the ensemble model as non-stationary data. Fig. 2     However, the less MAE, MAPE and/or RMSE don't affirmatively represent a high hit rate of forecasting direction for stock indices movement direction prediction. Therefore, the comparison of the directional change statistic (DC) is significantly. From Table 5, we can see the proposed nonlinear ensemble model also performs much better than the other models by the rank. Furthermore, DC is more important than MAE, MAPE and/or RMSE because the former is more useful for the business practitioners' decision. Focusing on Table 5, the    Considering return rate, the empirical results show that the proposed nonlinear ensemble model could be well forecast future variation of stock indices. Compared with the other models such as three single base models and four linear ensemble models, the proposed nonlinear ensemble model belongs to the best forecasting effect. Interestingly, you can see that the rank of Table 6 is the similar to that of Table 5 because the right forecasting to direction often leads to high return rates. As shown in Table 6, for the Shenzhen component index case, the return rate for the best single base model, GRNN, is 2.61%, and the return rate for the best linear ensemble model, simple RMSE, is also 7.72%; however the return rate for the proposed nonlinear ensemble model reaches 10.6%.
From the experiments presented in this study we can draw the following conclusions: (i) The experimental results show that the proposed nonlinear ensemble forecasting model is superior to four linear ensemble models which are superior to three single base models for the test cases of three stock indices in terms of the measurement of annual return rate (MR), as can be seen from Tables 6. Likewise, the proposed nonlinear ensemble model also outperforms other models in terms of goodness-of-fit or MAE, MAPE and RMSE (refer to Figs. 2-4 and Table 2-4). (ii) MAE, MAPE and RMSE are the metrics used to estimate the error of prediction, however, it don't affirmatively represent a high return rate for stock indices forecasting. For example, in the Shenzhen component index test case, the simple MAPE model is the best in terms of the MAPE (refer to Tables 3), but it is worse than the proposed nonlinear ensemble model in the return rate (refer to Tables 6). Similarly, the indicator DC and MR have a strong positive relationship which isn't absolute. For example, in the Shanghai-Shenzhen 300 index test case, in all linear ensemble models the simple RMSE model is the best, concerning DC (refer to Tables 5), but is the worst in the return rate (refer to Tables 6). (iii) The proposed nonlinear ensemble model can be used as an alternative tool for stock indices forecasting to obtain greater forecasting accuracy and improve the prediction quality further in view of empirical results.

Conclusions
In this study we hope to design a model that can provide the most accurate prediction of stock indices. In order to overcome the drawbacks of the traditional NN, a threestage neural-network-based nonlinear weighted ensemble model is proposed. In this model, three neural-network base models, i.e., Elman, GRNN and WNN are generated by three different training sets, further, they are optimized by improved particle swarm optimization (IPSO) with adaptive nonlinear inertia weight, dynamic arccosine function acceleration parameters and the crossover and mutation operation of GA. Finally, a neural-networkbased nonlinear weighted meta-model be produced by learning three neural-network base models through SVM neural network. By applying daily data to these models and comparing the prediction results based on MAE, MAPE, RMSE, DC and MR, we find that in general the annual return rate of the proposed nonlinear ensemble model is better than single base models and the linear ensemble models for forecasting stock indices with high volatility and noise. The result of this paper may be helpful for day-ahead price forecasting of electricity markets. It would be interesting to investigate the optimization of the proposed model e.g., by inclusion of more efficient NNs and effective stochastic search techniques but this will be left for future research.