Analysis of Slow Moving Goods Classification Technique: Random Forest and Naïve Bayes

-Classifications techniques in data mining are useful for grouping data based on the related criteria and history. Categorization of goods into slow moving group or the other is important because it affects the policy of the selling. Various classification algorithms are available to predict labels or class labels of data. Two of them are Random Forest and Naïve Bayes. Both algorithms have the ability to describe predictions in detail through indicators of accuracy, precision and recall. This study aims to compare the performance of the two algorithms, which uses testing data of snacks with labels for packaging, size, taste and category. The study attempts to analyze data patterns and decides whether or not the goods fall into the slow moving category. Our research shows that Random Forest algorithm predicts well with accuracy of 87.33%, precision of 85.82% and recall of 100%. The aforementioned algorithm performs better than Naïve Bayes algorithm which attains accuracy of 84.67%, precision of 88.33% and recall of 92.17%. Furthermore, Random Forest algorithm attains AUC value of 0.975 which is slightly higher than that attained by Naïve Bayes at 0.936. Random Forest algorithm is considered better based on the value of the metrics, which is reasonable because the algorithm does not produce bias and is very stable.


Introduction
Goods can be classified based on its circulations over a certain period of time and goods with very slow circulation are called slow moving goods [1]. Slow moving goods have be stored in warehouses in large quantity. Slow moving goods are materials that circulate with the speed of one item within a year [2]. Classification problems associated with slow moving goods occur due to lack of analysis of previous data [3]. Analysis can be conducted using classification algorithms of data mining. Classifications create patterns through analysis of the closeness of labels or attributes that construct item data. The resulting patterns are the predictions of slow moving goods.
In this study, Random Forest and Naïve Bayes are the classification algorithms that are used, which work on data of packaged snacks. Both algorithms were chosen because they can produce accurate predictions with descriptions that highly agree with actual situations. Many studies have been carried out that relate to the two algorithms for classification. In [4], Random Forest was used to analyzing multispectral images by classifying points in images. Taxonomy of Random Forest algorithm has been described in [5] through several parameters such as the base of classifications, size division, number of tracks, combination of strategies, number of attributes, criteria, cut-off ability, additional classifications, and number of datasets used in training phase. In addition, Random Forest algorithm has highlighted the advantages and benefits in prediction on large datasets [6].
The ability of Naïve Bayes algorithm has been tested in various data predictions including to predict the behavior of the purchase on transaction time [7]. The pattern shows that more buyers make transactions in the afternoon, particularly on Sundays. Naïve Bayes algorithm has been used to group blogger data [8] and banking product marketing data [9] - [10] to assist banks to find potential customers. The performance of Naïve Bayes algorithm has been compared with other classification algorithms such as K-Nearest Neighbor (KNN) algorithm and Decision Tree [11] to group data of school students who consume alcohol. Despite the differences, Naïve Bayes' performance has shown better accuracy than the other two algorithms.
The results of previous studies in using Random Forest and Naïve Bayes algorithms motivate an attempt to observe both algorithms in classification of slow moving goods. The result is valuable for decision makers to implement policies related to such goods. A comparison is required for a clear picture of the performance of both algorithms.

Theory a. Random Forest Algorithm
Random Forest algorithm is an ensemble model that was created and developed by Tin Kam Ho [12]. It belongs to supervised learning and works based on calculations of various models to obtain results [6]. As an ensemble model, Random Forest is able to build decision trees and uses its rules for the calculation of the final result, following formula (1) [13]. (1) Having processed training data, predictions are obtained from the average results of all trees, using formula (2).
In its various applications, Random Forest algorithm is widely used for its advantages, such as better accuracy, resistance to various disturbances, speed, and convenience in implementation [14].

b. Naïve Bayes Algorithm
Naïve Bayes algorithm is a simple probabilistic classification technique based on the application of the Bayes theorem with strong assumptions [15]. Naïve Bayes is applied to a limited number of data to get the appropriate parameters of classification. Naïve Bayes formula is expressed using formula (3) [11]. The advantages of the Naïve Bayes algorithm include the ability to handle quantitative and discrete data, resistance to isolated noise points, sufficiency of small number of training data, ability to handle missing values by neglecting instances during the calculation of estimated probability, speed, efficiency in space, and robust against irrelevant attributes.

Method a. Training Data
The use of the two algorithms in this research was administered by employing two different tests using RapidMiner application. RapidMiner is an application in the field of data mining such as machine learning, information mining, and content mining [16]. In this study, RapidMiner is used to display the performance of the two algorithms using the data of packaged snacks. Data items were taken randomly for as many as 150 data. Data have to pass a selection process in accordance with the stages of Knowledge Discovery in Database (KDD). The data were arranged based on several attributes considered to affect most on the speed of items transactions, such as packaging, size, taste, and category. Attributes description is shown in Table 1.

c. Research Framework
To produce a result of prediction of slow moving goods, the research goes through several steps as shown in Figure 1. The first stage is data preparation, which follows the stages of the Knowledge Discovery in Database (KDD). Data selection consumes a huge amount of time in order to adjust with the classification algorithms, i.e. Random Forest and Naïve Bayes. Data have to pass the KDD stages to obtain proper quality of training data. The KDD selection produces training data as described in Table 1

Figure 1. Research Framework
The next step is the selection of parameters as a measure to compare the performance of test results. We choose accuracy, precision, and recall, which are taken from the Gain Ratio criteria in RapidMiner. The choice of gain ratio as the comparison results parameter concerns more about its ability to calculate every data in the available sample space. The parameter selection is intended to see the comparison of the results of testing the two algorithms. In addition to the three parameters, the test results are displayed in accordance with the algorithm features. Training produces patterns that may be analyzed to obtain predictions that reflect the actual situation. This step provides the best prediction results for each of the two algorithms.

Results and Discussion
Research results on the two algorithms are described in the following two sections: the prediction results and the parameter results.

a. Prediction Results
Item data were tested on both algorithms by using 10 iterations. Each iteration produced a different tree structure. Confidence was displayed to indicate the level of confidence of each attribute in producing the decision whether the items belong to either slow moving or nonslow moving category. The gain ratio criterion was used as a measure to read the test results of Random Forest. There are some discrepancies in the test results, especially on the target attribute "No". This is because confidence value of the "No" is higher than the target attribute "Yes". Of 150 data, there arises 19 discrepancy data, which implies a value of 12.67% error rate of the calculation results. The rules resulting from the calculation of the Random Forest algorithm are described in Table 2.  The prediction for the Naïve Bayes algorithm using the gain ratio shows a lower error rate at 8.67%. There are 13 different data. The differences between the prediction and initial data are mostly on data with attributes "No" which become "Yes" according to Naïve Bayes calculation. This means that items, which are not originally included in the slow moving category, fall into the category. The confidence value is lower here so it changes data with class attributes "No" to become "Yes". Naïve Bayes produces a model of slow moving attributes into 2 classes previously mentioned with respective value of 0.767 for the "No" class and 0.233 for the "Yes" class. The rules resulting from the calculation of the Naïve Bayes algorithm are in Table 3.
Based on Table 3, Naïve Bayes produces an accuracy of 84.67%. Calculation of gain ratio for positive class = Yes is 60% while for positive class = No is 92.17%.

b. Accuracy, Precision, and Recall Parameters
The implementation of gain ratio criteria in this training stage produces detailed calculations in the form of confusion matrix. Both algorithms reveal patterns that are hidden in the training data. Running RapidMiner with operator Performance produces results in values of metrics in 3 parameters: Accuracy, Precision and Recall, as shown in Table 4. Entries of Table 4 show that accuracy and recall parameters of the Random Forest algorithm are higher than the Naïve Bayes algorithm. However, the precision of the Naïve Bayes algorithm is higher. Hence the Random Forest algorithm is superior in two of three metrics against Naïve Bayes algorithm. To further decide which classification algorithm is better, we need to observe the Receiver Operating Characteristic (ROC) curve and calculate the Area under the ROC Curve (AUC) [7]. An ROC curve expresses confusion matrix data, in which the horizontal line represents false positive (FP) values and the vertical line represents true positive (TP) values. Figure 2 is an ROC curve obtained from the calculation of the Random Forest algorithm with the acquisition of AUC values of 0.975. In [8], AUC was used to measure discriminative performance by predicting the possibility of the emergence of output from random samples for positive and negative populations. The greater the AUC, the firmer the classification be recommended. AUC is part of the square unit area, AUC value will always be between 0.0 and 1.0.  Figure 3 is the ROC curve from the calculation of the Naïve Bayes algorithm with AUC of 0.936. The AUC for Random Forest algorithm at 0.975 is slightly higher. However, both algorithms behave as a nearly perfect classification model with AUC values close to 1.00.

c. Discussion
Both algorithms show good performance but with different results. The overall results of testing the item data with the Random Forest and Naïve Bayes are in the following Table 5.
Metrics in Table 5 show that the performance of the Random Forest algorithm is generally better. Random Forest algorithms produces a tree structure in each iteration that is easy to compare with structures in other iterations. The most results from each structure become the final result. The ability of Random Forest algorithm to analyze the results of each decision tree in 10 iteration has apparently produce higher accuracy than the Naïve Bayes algorithm. The dominantly similar rule in every iteration is one of the advantages of Random Forest, which may support its performance to achieve a high accuracy [17]. The recall value reaching 100% and the AUC value of 0.975 have brought the Random Forest as the best choice for classification of slow moving goods. Therefore, attributes that are considered responsible to cause a goods become slow moving are taken from those identified by Random Forest algorithm.

Conclusion
We have observed two algorithms: the Random Forest and Naïve Bayes algorithms to classify data on packaged snacks and to identify which attributes supports the class label of slow moving. Calculation using RapidMiner on both algorithms give predictions with almost similar accuracy. The difference in the precision value of the two algorithms of 2.51% suggests that Naïve Bayes algorithm has better accuracy in slow moving goods in the training data. This is shown by the smaller prediction errors than that of Random Forest algorithm, and because the confidence values tend to be identical. However, Random Forest algorithm is more reliable to get a precise prediction because it may be obtained from several decision trees. This research shows that Random Forest algorithm provides better predictions to reflect actual conditions with a limited number of data. A total of 5 rules were produced, showing perfect compatibility with the actual situation of packaged snacks, which is 100%.