Recognition of Oil Shale Based on LIBSVM Optimized by Modified Genetic Algorithm

In order to improved the speed, accuracy and generalization of oil shale recognition model with log dada, considering parameters of traditional SVM were chosen by experience, a LIBSVM recognition model with optimized parameters was proposed based genetic algorithm. First of all, all the samples data were processed to double type as LIBSVM tool needing, and the best normalization way was chosen through comparing different accuracies of various normalization ways. Secondly, the fitness value was calculated by the traditional LIBSVM. Finally, parameters C and g were optimized by genetic algorithm according the fitness value. The optimized LIBSVM oil shale recognition model was applied in northern Qaidam basin to identify oil shale, the results show that optimized recognition model is a tool of better generalization ability and the recognition accuracy reaches as much as 97.2806%. According to the popularization effects in the well area of same geology background, this optimized LIBSVM model is the best for now.


INTRODUCTION
As logging curves of oil shale indicates the characteristics of high resistivity, high interval transit time, high natural gamma, high neutron porosity and low volume density, the methods of recognition of oil shale are currently LogR overlap method, combination method of multiple logging curves, and so on.However, applying these methods needs a large amount of complicated calculation, which are easily to make mistakes and low efficient.According to Zhu Jianwei's research in the application of identification of oil shale with logging curves, recognition and quantitative analysis of oil shale were both based on the linear relation between log response and actual stratum, but for the complexity of geology condition and sedimentary environment, adding with formation heterogeneity, there must be a nonlinear relation between them.According to Zhang Jiajia's research in recognition of oil shale with seismic method, although it had shown a better result, it was also difficult to use this method when there were no seismic data in some old area only with log data.Therefore, combined with other methods, a method based on a Library for Support Vector Machine (LIBSVM) modified by genetic algorithm was promoted to identify oil shale.Support Vector Machine (SVM) firstly proposed by Vapnik was a new machine learning method derived from principle of structure risk minimization and the VC dimension theory in statistical learning theory, and this method can be used in pattern classification and nonlinear regression with the global searching ability of genetic algorithm in optimization of complex system, a preferences method of LIBSVM was proposed to apply in recognition of oil shale in the northern Qaidam basin, and it offered a feasible and efficient method to apply in the recognition of oil shale in the same geology condition background.
There are many kinds of SVM toolboxes, and LIBSVM is currently the best toolbox of them, which was developed by professor Chi-Jen Lin from Taiwan University, It was designed as a package for pattern recognition and regression in SVM easily, simply and quickly.It not only offer a compiled executable file in Windows series system but also source code, which can be modified and updated conveniently in the application in other systems.
The classification principle of SVM is to finding an optimal hyper plane as a decision surface under a linearly separable condition, this decision surface will maximization dividing edges between positive examples and negative examples, thus it will achieve a best classification result under the lowest percent of misjudge examples.Assuming that training samples is (x i y i ) i=1,2,…,N and xi is input value, yi is expecting output, y i =±1 separately expresses positive examples and negative examples, the formula indicating the decision surface is shown as below: Formula (1) satisfy linearly separable samples (xi yi) with : In the formula, x is input vector, is adjustable weight vector, b is bias.For the given weight vector and bias b, the interval space between the hyper plane defined by the formula (1) and the nearest data, stands for it.xi is the support vector on the dividing edge, whose number is limited, and they can display the hyper plane.According to the principle of SVM, maximization of dividing edge between positive and negative examples is finding the max value of .According to Li Yang's research in LIBSVM, x 1 x 2 separately stands for positive examples and negative examples, the interval space between them can be displayed as below: According to the formula (2), maximization of interval space turns to maximize 2 , also minimize 2 , turning it into minimize  ( ) Apply multiplicator Lagrange to solve constraint optimal problem: In the formula (3), ai>0 is Lagrange coefficient.Obtaining the partial derivatives of and b and commanding the results are 0, eventually the question of optimal classification surface turned to the dual problem in finding the maximization of target function.
The optimal solution of weight vector and b is: The optimal classification surface is: x stands for testing samples.
With turning the original problem to dual problem, the calculating complexity is not depended dimension of space.As the default kernel function in LIBSVM toolbox is RBF, the SVM training model is C-SVC model as default model, so the decision function is: x x is two norm distance, gamma is parameter g, the default is the reciprocal of number of attributes.Then the default g is 1/k, the default C is 1.
To obtain a well popularized SVM classification machine, it's the key point to get a optimal penalty parameter C and g of kernel function, Parameter g mostly influenced the distribution complexity of feature space of samples.The penalty coefficient C adjusts the rate between confidence interval and empirical risk in the specific feature space.

OPTIMIZATION OF LIBSVM PARAMETERS BASED ON GENETIC ALGORITHM
Genetic algorithm is originated from computer simulation in the biological system.Genetic algorithm simulate the natural selection and duplication, cross and mutation in genetic phenomenon.Starting from either initial population, with selecting, crossing and mutating randomly, a adaptable individual is created, and the region of population turned to a better area, it will converge to the best individual with development from one generation to another, and the optimal solution will be obtained.Genetic algorithm possess a performance of highly parallel ,random and self-adapt searching characteristic, it apparently has a advantage of solving nonlinear problem which traditional method could not.
Lithology recognition with logging data is a typical process of non-linear pattern classification.Choosing the parameter plays an important role in the establishing a model of identification of oil shale with the LIBSVM method.In the practical situation, it normally happened that samples are linearly non-separable in the SVM classification, and it could not be separated even after mapping.Therefore choosing the right parameters of LIBSVM is the key to solve the problem.The main processes are shown as below: 1) Binary coding was adopted to encode the chromosome, and 10 bit code represented each parameter, two parameters were shown as 20 bit binary coded string.Top ten code was for penalty coefficient C, the last ten code was for kernel function parameter g.Parameter C was set in the range of(0,C * ),C * =max(a i ),the searching space for g was ], g 1 represented the maximum, g 2 represented the minimum.
2) U 1 and U 2 separately represented decimal integer code of C and g after encoding, corresponding decoding formulas were: ( ) 3) Initial value generated by genetic algorithm was adopted as initial population, the population size was set as 30, big size of population easily caused the increase of the amount of calculation, small size of population could not reflect the diversity of population.4) Designing the fitness function, assumed that k was the number of cross validation, 50 percent of cross validation was adopted to obtain fitness value of the individual in genetic algorithm.
Inaccuracy was the rate of error in training samples of LIBSVM, as lower the rate of error of the training samples the higher value of fitness function of chromosome of parameters.

5)
In the genetic algorithm, proportional selection operator was adopted in it, one-point crossover was adopted in the crossover operation, the basic mutation operator was adopted in the mutation operation.The termination of the evolution algebra G=100, crossover rate Pc 0.70, mutation rate Pm=0.03. 6) After repeated calculation of the fitness function value of each population, and operate the genetic operator according to the fitness function value, a new population was generated until it reached the generation 100 or the variation of objective function value never exceed 0.005 and then it would stop the calculation.

Selection of Samples and Normalization
Characteristics of logging curves were researched with geochemical analysis data and log data, it found that logging curves of oil shale indicating the characteristics of high resistivity, high interval transit time, high natural gamma, high neutron porosity and low volume density, Natural gamma GR, logarithm of specific resistivity logR, interval transit time AC and volume density DEN was the input samples, as long as it was oil shale, the output would be 1, the other would be 1.All the output values were shown as column vector.The format of input data was stationary to be double type.If there were data of the other format, it should be changed before it was inputted to calculation.The logging data in y district in Qaidam Basin was chosen as samples.Considering the representativeness and reliability of chosen samples, 500 typical samples were chosen as pattern recognition samples of LIBSVM, and the first 400 were taken as learning samples, the rest of 100 were taken as testing samples to calculate the precision of model.
The method of normalization was usually adopted maximum and minimum in the MATLAB, and which was based on that the maximum and minimum of each feature vector of testing samples was equal to the maximum and minimum of each feature vector of training sample.However it could not satisfy every sample in any condition, and the accuracy after normalization was not smaller than that result of calculation not with normalization.Therefore, normalization was not necessarily, it should be separately treated according to the situations.
Compared the accuracies of various methods of normalization after program operating, they were listed as Table 1 below, which was shown that the accuracy of normalization in the range of [0 1] was the best.

Certification of Parameters C and g Based on Genetic Algorithm
In the svmtrain function of LIBSVM, penalty coefficient C and g could be a distribution value in a certain range.And the C and g which made the highest in accuracy would be the one chosen.The best parameters of SVM would be searched with genetic algorithm.The initial population was generated in a size of 30, and the maximum of population was 100.A optimized SVM which obtained best parameters would be the best calculation model.Then the best parameter c=32.7793,g=4.6352.The accuracy of classification was better as the training numbers increasing, in the meantime the bigger value of the fitness function until it converged stably.

CONCLUSION
With the research of LBSVM in the application of recognition of oil shale in northern Qaidam Basin and the difficulties in choosing a right parameter in SVM, a method to optimize the parameter of LIBSVM, which was based on genetic algorithm, was proposed to do the research.The normalization method was chosen according to the difference of accuracy which obtained by a different normalization method.And optimal penalty coefficient C and parameter of kernel function g were obtained with a modified genetic algorithm, and then this model was used to identify oil shale.The results have shown that the accuracy of recognition of oil shale is reached 97.2806% with a model of modified parameters.And it shows that it a feasible way to deal with the similar problem in the log district in the same geology background.Yet the follow up work is to modify the encoding method and enhance the performance and accuracy of the algorithm.

2 2in
order to calculate conveniently.As to find a hyper plane, which is actually to find and b satisfied minimization

Fig. ( 1 ).
Fig. (1).Linearly separable optimal hyper plane.For linearly non-separable training samples, with problems such as noise of training samples, considering mistakenly classifying phenomenon, slack variable and penalty coefficient C are introduced to solve the problem of hyper plane in formula (1), it can be adjusted as problem below: min 2 max 400 training data were chosen to train LIBSVM, and then predicated the labels with trained model.The comparison results are displayed in the

Table 2
it separately represente as

Table 3 . Selection of parameters in LIBSVM and results of recognition.
Final classification accuracy is 97.2806%.According to this result, this method can be applied in the recognition of oil shale.The identification results of y log district in northern Qaidam Basin and the comparison of rock core are shown as Fig. (2).Part of classification results of LIBSVM model are shown in Table 3.