Optimization of Fuzzy Support Vector Machine (FSVM) Performance by Distance-Based Similarity Measure Classification

This research aims to determine the maximum or minimum value of a Fuzzy Support Vector Machine (FSVM) Algorithm using the optimization function. SVM is considered as an effective method of data classification, as opposed to FSVM, which is less effective on large and complex data because of its sensitivity to outliers and noise. One of the techniques used to overcome this inefficiency is fuzzy logic with its ability to select the right membership function, which significantly affects the effectiveness of the FSVM algorithm performance. This research was carried out using the Gaussian membership function and the Distance-Based Similarity Measurement consisting of the Euclidean, Manhattan, Chebyshev, and Minkowsky distance methods. Subsequently, the optimization of the FSVM classification process was determined using four proposed FSVM models and normal SVM as comparison references. The results showed that the method tends to eliminate the impact of noise and enhance classification accuracy effectively. FSVM provides the best and highest accuracy value of 94% at a penalty parameter value of 1000 using the Chebyshev distance matrix. Furthermore, the model proposed will be compared to the performance evaluation model in preliminary studies (Xiao Kang et al., 2018). The result further showed that using FSVM with Chebyshev distance matrix and a Gaussian membership function provides a better performance evaluation value.


Introduction
Classification is a grouping method based on the characteristics possessed by objects. The Support Vector Machine (SVM) is one of the classification methods that have been a subject of debate over the last decade due to its high generalization performance and wide application. Research related to SVM performance, such as SVM, ANN, KNN, fuzzy logic, and RF (random forest) methods [1] to classify driving models, indicates that SVM has the highest accuracy value of 96%. Furthermore, in [2] and [3], SVM is proven to have a high level of accuracy and generalization performance from other classification methods. In the real world, this method is applied in many areas such as text categorization [4][5][6], speech recognition [7], bioinformatics [8][9][10][11] , and network security [11].
Vapnik introduced SVM in 1995 [12,13] based on structural risk minimization theory. It is one of the superior methods trained with an algorithm and used to separate a dataset into two or more classes. Stave Gunn stated that the SVM method is used to determine the optimal global solution and works by mapping the training data into a highdimensional space while looking for a classification capable of maximizing the margin between the two classes [14].
Margin is the distance between the support vector and the hyperplane, while the support vector is the pattern of each class with the closest distance to the hyperplane.
The SVM method suffers a lot in complex problems with many parameters due to noise and outliers, which causes a decrease in the generalization performance [15]. Therefore, one of the methods used to solve this problem is by combining the SVM method with fuzzy logic [16][17][18]. Preliminary studies applied fuzzy logic with two events, namely using fuzzy rules [19,20], and its membership functions [21][22][23][24][25][26]. In each sample, new input was used to provide a different contribution to eliminate the noise and outliers effect and to improve the generalization performance of SVM classification.
Fuzzy logic is a mathematical way of describing obscurity, and it was introduced by Lotfi A. Zadeh in 1965 [27]. The SVM method with a combination of fuzzy logic is called the Fuzzy Support Vector Machine (FSVM), where the membership function is a crucial step in classification [28]. Several methods such as function approach, intuition, rankordering, inductive reasoning, neural networks, and genetic algorithm were used to build a membership function in fuzzy.
Euclidean distance is a general criterion chosen to determine the similarity of the data used to construct the membership function. Xiao-Kang Ding (2018) proposed the FSVM method based on the Euclidean distance using 3 methods, namely FSVM-1, FSVM-2, and FSVM-3, by comparing the distances of positive and negative samples. However, in the FSVM-3 method, point samples are mapped into a high-dimensional space and calculated using the FSVM-2 method. This research indicates that the best accuracy is given by FSVM-3 followed by FSVM-2 and FSVM-1 [29].
The measuring methods commonly used to determine similarity measurements are Euclidean, Manhattan, Chebyshev, Minkowski, Hamming, Mahanalobis, and Minkowski Chebyshev distances. There are various advantages and disadvantages associated with the use of these methods. According to Mohammed and Abdulazeez (2018), euclidean distance is the most commonly used method for calculating distances in numerical data. It works efficiently by calculating the similarity in the grouping and has the ability to separate the data adequately [29]. Manhattan distance is often used due to its ability to detect special circumstances such as the presence of outliers [30]. This is in addition to the sensitivity of the Chebyshev distance in detecting objects with outliers.
Based on the description of several preliminary studies, the combination of SVM and fuzzy logic (FSVM) methods tend to optimize classification by selecting the right membership function. Therefore, this research aims to apply the FSVM with a Gaussian membership function to determine distance measures. Furthermore, comparative research is carried out using several methods such as Euclidean, Manhattan, Chebyshev, Minkowski, Hamming, and Minkowski Chebyshev distances.

Materials and Methods
The fuzzy system proposed by Lotfi A. Zadeh was built based on fuzzy set theory and fuzzy logic. This method is useful for dealing with complex real-world problems such as uncertainty and imprecision. Set theory and classification techniques are very useful in dealing with uncertainty and ambiguity, which leads to an increase in the generalizability of the classifier. FSVM is an extension method proposed by Lin and Wang in 2002 to reduce the sensitivity of SVM to outliers or noise. It works by assigning a low weight to each sample by determining a fuzzy membership function based on the similarity of data (distance). The application of fuzzy membership function ( ) where 0 < ≤ 1 is on the training dataset with class ∈ [1, −1] . Therefore, the fuzzy version dataset is determined as follows: {( 1 , 1 , 1 ) , ( 2 , 2 , 2 ) , … . , ( , , )} . Meanwhile, the optimal FSVM hyperplane is obtained by entering the membership function value in the standard SVM formula.
where denotes a fuzzy membership function with a value between 0 and 1 (0 < ≤ 1). In 2016, Xiao-Kang proposed a research to determine the degree of membership of each data by adopting the calculation of the membership function. This research was carried out by comparing the distance from each positive and negative sample to the center of each class using the formula for calculating Euclidean, Manhattan, Chebyshev distance, and the Minkowsky distances. The calculation of the membership function is stated as follows: where; When the distance of the positive class is less than the negative, it is considered a "useful point," and its membership is set as 1. However, supposing the distance of the positive class is greater than the negative, it is considered a "noisy point" and calculated according to the Gaussian Membership Function formula.

Gaussian Membership Function
The Gauss curve Membership Function formula is written as follows:

Kernel Radial Basic Function (TBF)
The RBF kernel formula is as follows:

Distance Based Similarity Measure
The similarity measure is an important part that needs to be considered in pattern matching and to carry out various types of classification. Distance-Based Similarity Measure works to measure the level of similarity of two objects in terms of the geometric distance from the variables included in both objects. These include the following.

Euclidean Distance
Euclidean distance is often used in measuring data similarities, as shown in Equation 7:

Manhattan Distance (Minkowski Distance)
Manhattan distance is used to calculate the absolute difference between the coordinates of a pair of objects, as shown in Equation 8:

Chebysev Distance
The Chebyshev distance is measured using the following formula:

Minkowski Distance
The Minkowsky distance is a generalization of the Euclidean and the Manhattan, whereby the power (p) acts as the determining parameter. When p equals 1 and 2, the Minkowsky distance space becomes equivalent to Manhattan and Euclidean, respectively. The following formula is used to calculate the Minkowski distance:

Results and Discussion
The Fuzzy Support Vector Machine algorithm proposed in this research is applied to 3 datasets considered as a representative that effectively verifies the proposed FSVM model. Furthermore, the data processing was carried out using Jupyter Notebook Software with Python programming language. The details of the 3 datasets are shown in Table  1. The basic concept of the Fuzzy Support Vector Machine is first used to determine the degree of fuzzy membership of the data used for FSVM calculation, which comprises positive and negative classes. Therefore the center of the class can be defined as the average vector of the attributes determined using Equation 11.
where + and − denote the means of the positive and negative classes, while + and − are the number of data points in the positive and negative classes, respectively.
where SE (sensitivity) and SP (specivity) are the ratios of the completeness or accuracy of the correct prediction of positive and negative data, respectively. The results of the FSVM classification accuracy based on the AUC (Area Under Curve) approach are shown in Figure 1. Figure 2 shows that when the penalty parameter is 2 (C=2) while using Manhattan and Chebyshev distances, the best accuracy generator is FSVM. Furthermore, at parameters C=50 and 500, FSVM using the Manhattan penalty distance produces the best accuracy from other methods. The best accuracy result for the penalty parameter C=10 is obtained by the FSVM method using Euclidean distance. Meanwhile, the highest accuracy in the penalty parameters C=200 and C=1000 is generated by FSVM with Chebyshev distance. The results of the accuracy of each method are shown in Table 2.  With a value of C=1000, the accuracy is determined with different training and testing data scenarios as follows:  Table 3 and Figure 2 show that the best performance of FSVM using Chebyshev distance and RBF kernel with C=1000 is obtained from the proposed methods.  Table 3.    Table 4 and Figure 3 show that FSVM 2 and FSVM 3 provide better classification performance evaluations than previous studies. Meanwhile, FSVM 5 and FSVM 6 provided better classification performance evaluations than FSVM 1. Out of all the models discussed in this research, FSVM 3 provides the best classification evaluation.

Conclusion
This research introduced a fuzzy membership function based on Distance-Based Similarity Measure with the Euclidean, Manhattan, Chebyshev, and Minkowsky distance methods to determine the best method capable of optimizing the Fuzzy Support Vector Machine classification process. This application is a new method because there are no previous studies on the analysis of FSVM based on Distance-Based Similarity Measures. Therefore, based on the results and discussion of this study, it is concluded that the FSVM using the Chebyshev distance and the Gaussian membership function has the best performance in reducing the effects of noise and outliers.

Data Availability Statement
The data presented in this study are available in article.