Sentiment Analysis of On-Demand Ride-Hailing Systems using Support Vector Machine and Naïve Bayes

Gojek is one of Indonesia's most popular online transportations, founded in 2010. The Gojek application has been downloaded one hundred forty-two million times with more than two million drivers and four hundred thousand partners in food delivery services. Due to the increasing use of the Gojek application and the importance of knowing user views about the services provided by the application. In this research, the sentiment analysis is using Support Vector Machine and the Naïve Bayes method to classify positive sentiment and negative sentiment. The target label focuses on positive and negative labels to avoid the bias that exists in neutrally labeled reviews on the Gojek Application. The research process includes data collection, pre-processing the data, weighting with Term Frequency-Invers Document Frequency, Support Vector Machine, and Naïve Bayes training by dividing the data into 90% training data and 10% testing data and then evaluating the results using a confusion matrix. The results of testing using the Support Vector Machine algorithm resulted in 90% accuracy, 94% recall, 91% precision, and 94% f1-score, therefore the Naïve Bayes algorithm produces 77% accuracy, 96% recall, 77% precision, and 85% f1-score.


Introduction
Modern technology is advancing rapidly, particularly in the realm of smartphone applications that operate on Android, Windows, and iOS platforms.This swift technological progress has led to the accumulation of vast amounts of data, which has subsequently transformed into valuable information.Technological advancements continue to expand across various sectors, including the economy, education, and transportation.Transportation is a crucial element, involving the movement and relocation of goods and passengers to different locations.Efficient transportation plays a vital role in boosting the economy of a region.The introduction of online transportation services has significantly increased public trust.The key drivers for using online transportation services include their accessibility through user-friendly applications, affordability, and safety.Moreover, online transportation services follow the on-demand business model, tailoring their services to meet consumer requests, thereby enhancing overall convenience for consumers.One of the prominent online transportation services in Indonesia is Gojek, managed by Gojek Indonesia or PT Aplikasi Karya Anak Bangsa.Established in 2010, it initially began as a two-wheeled online transportation company (Wahyu Handani et al., 2019).
Gojek has become a popular online transportation by the public that offers transportation services by using motorcycle, Gojek can be easily ordered through applications on smartphones.In this case, users are greatly facilitated by the existence of Gojek, because currently, Gojek has 17 superior services that can meet consumer needs (Muttaqin, 2020).The popularity of the Gojek Apps among users is shown in the number of positive and negative reviews received by the app.While it is difficult for developers to read all the user reviews they receive as it would take a lot of time and effort to do so manually, this approach is not recommended as it is not effective.
Meanwhile, Gojek App can be influenced by these reviews to make improvements to the app (Kulsum et al., 2022).
In 2022 research by Dimas Diandra Audiansyah, Dian Eka Ratnawati, and Buce Trias Hanggara regarding sentiment analysis based on reviews of the MyXL App on the Google Play Store.The Support Vector Machine (SVM) method is used for sentiment classification in this study.The SVM algorithm for sentiment analysis for two classes gives the best results when used with training and test data values of 80%:20%, a total of 160 positive data points and 160 negative data points, trials with K=5 cross-validation, and the use of a linear kernel.The results were 88% accuracy, 88% precision, 88% recall, and 88% average f1-score (Dimas Diandra Audiansyah, Dian Eka Ratnawati, 2022).
Research in 2021 by R. Wahyudi and G. Kusumawardana on sentiment analysis of the Grab application on the Google Play Store with the evaluation proposed in this study based on more than 1000 user reviews of the Grab Indonesia application on the Google Play Store.Analysis using the Support Vector Machine method resulted in an accuracy of 85.54%, with "ovo" receiving the majority of good reviews and "driver" receiving the majority of negative reviews.With an accuracy of 85.54%, the sentiment analysis results on 900 reviews from the testing data using fold value = 5 in the Support Vector Machine method were able to predict 59 positive reviews and 675 negative reviews for the testing data (Wahyudi & Kusumawardana, 2021).
Meanwhile, previous research using the Naïve Bayes method regarding sentiment analysis on KRI Nanggala 402 tweets on Twitter showed that the public generally reacted neutrally, with positive sentiment and negative sentiment being equal.The Naïve Bayes algorithm is used to classify tweet documents during the analysis phase.The results using the Naive Bayes algorithm have an accuracy value of 73.00%, making it a reliable model (Djamaludin et al., 2022).
The previous research also used the Naïve Bayes method regarding sentiment analysis of the National BMKG Twitter data review by dividing it into three classes, namely positive, negative, and neutral which resulted in an accuracy of 68.97% (Darwis et al., 2021).Some results from previous studies are used as a reference for this research which will use a comparison between the Support Vector Machine algorithm and the Naïve Bayes algorithm to perform sentiment analysis on Gojek App reviews on the Google Play Store.
Sentiment analysis, also referred to as opinion mining, involves the computational examination of individuals' perspectives, emotions, sentiments, judgments, and attitudes regarding various entities such as products, services, organizations, individuals, topics, events, themes, and their associated characteristics.
The emergence and rapid expansion of this field coincided with the proliferation of social media platforms on the internet, including reviews, discussion forums, blogs, microblogs, Twitter, and social networks.Since the early 2000s, sentiment analysis has evolved into one of the most dynamic and actively researched domains within natural language processing (NLP) (Zhang et al., 2018).Twitter is a social media platform frequently utilized for information exchange, discussions, and emotional expression.
The emotions expressed by Twitter users are referred to as sentiments.Sentiment analysis is performed to assess opinions and inclinations, which can be positive or negative (Silaen et al., 2022).Sentiment analysis is used to determine whether an opinion or view on a topic or event is positive, negative, or neutral.Sentiment analysis is one of the topics of Natural Language Processing (NLP).Generally, sentiment analysis is used in a variety of contexts, including stock price prediction, political issues, customer happiness, reputation analysis, and more (Fikri et al., 2020).
Google Play Store is "an online sales platform for developers to sell and distribute products to players through one or more software ecosystem platforms".
Google Play Store allows developers to make profits from their software and bring new Piksel 11 (2): 383 -392 (September 2023) functionality to consumers.Google Play Store was launched in 2008 and is the largest app store in the Android ecosystem.Google Play Store serves the Android platform as an open-source operating system for mobile devices and tablet computers (Oktavian & Budi, 2020).
Reviews in Indonesian that come from the word "ulas" can also be called "kupasan", tafsir, or commentary.According to the Big Indonesian Dictionary, a review is a response to an event.Reviews on a product or application are important because most users tend to look at previous user reviews (Onantya et al., 2019).

Research Method
The method in this research uses Knowledge Discovery in Database (KDD) to conduct sentiment analysis on Gojek Application reviews on the Google Play Store and compare the Support Vector Machine and Naïve Bayes algorithms.The Knowledge Discovery in Database (KDD) method in it has several processes, namely data selection, pre-processing, transformation, text mining, and evaluation (Alam et al., 2022).The stages of the KDD process can be seen in Figure 1.

Data Selection
The data collected is first selected.With data collection, the processing process will be better following the research objectives to be achieved (Alfiqra, 2018).The results of data selection will be used in the next step.

Pre-processing
The preprocessing stage is a very important step to perform the cleaning process on the data that is the focus of KDD.The cleaning process includes removing duplicated data, checking for inconsistent data, and correcting errors in the data, such as typographical errors (Wahyuni, 2018).The steps in pre-processing include (Rosid et al., 2020).

Transformation
Transformation is a crucial step in converting data into vector form to facilitate the data mining process.This step utilizes the Term Frequency-Inverse Document Frequency (TF-IDF) method.TF-IDF is a technique employed to assess the relevance of words (terms) to documents by assigning weights to individual words based on their significance (Simatupang & Utomo, 2019).
Inverse Document Frequency (IDF) involves the computation associated with each search term.Within a collection of documents, IDF allocates values to terms based on their occurrence frequency.The IDF value decreases as a term is more frequently mentioned across multiple documents (Yutika & Faraby, 2021).

Text Mining
Text mining involves the process of analyzing text to uncover pertinent information by applying data mining principles and techniques to detect patterns within the text and extract meaningful insights to serve a particular objective.An alternate definition of text mining is "extracting textual data" with the aim of identifying words and effectively conveying the content contained within documents.Typically, text-based data sources consist of text documents (Simatupang & Utomo, 2019).The text mining step in this research uses a comparison of the Support Vector Machine algorithm and the Naïve Bayes algorithm.

Support Vector Machine
Support Vector Machine (SVM) stands out as a highly potent classification technique due to its capability to establish a decision boundary between two classes, facilitating the prediction of labels based on one or more feature vectors.SVM functions by identifying a hyperplane using two data sets from two classes, referred to as Support Vectors, and employs the Margin, which denotes the separation distance between the Support Vectors and the hyperplane.When applied to linear data, SVM exhibits strong generalization capabilities, even with limited training data, and can effectively handle high-dimensional input spaces.(Huang et al., 2018).Support Vector Machine employs support vectors, which are the data points that have the greatest separation distance between one class and another (Handayanto et al., 2021).

Naïve Bayes
According to probability calculations, Naïve Bayes has the capability to forecast the likelihood of a data tuple belonging to a specific class.This approach offers the benefit of requiring only a limited amount of training data to establish the necessary parameter estimates for classification (Handayanto et al., 2021).Piksel 11(2): 401 -414 (September 2023) In this study, the assessment employs a Confusion Matrix, which serves as a tool to gauge the effectiveness of the generated classification model.The Confusion Matrix involves comparing the predicted class outcomes with the actual data classes.

Results and Analysis
The results of the research conducted to determine user sentiment on Gojek App reviews on the Google Play Store by using the Knowledge Discovery in Database (KDD) process for sentiment analysis on Gojek App reviews on the Google Play Store and comparing the Support Vector Machine and Naïve Bayes algorithms.

Data Selection
Data collection on Gojek Application reviews on the Google Play Store is carried out by the scrapping method using the google play scraper library to retrieve review data on the Google Play Store website.The data taken amounted to 20,000 Gojek Application review data from 8 January 2023 to 30 March 2023.The attributes taken in the review data include username, score, at, and content.After that, the data is labeled positive and negative, while neutral data is not used in this research, the aim is to focus more on positively labeled reviews and negatively labeled reviews to avoid bias in neutral data.Positively labeled data amounts to 12,914, negatively labeled data amounts to 6,241, and neutral labeled data that is not used amounts to 845.

Pre-processing
Translation: The review data obtained from scraping on the Gojek application is in text format with a CSV format, and this data is not well-structured and contains a lot of noise.The data still includes punctuation marks, numbers, emoticons, symbols, and slang words.Therefore, preprocessing steps are necessary to remove characters other than letters, reduce the vocabulary frequency, and convert slang words into standard words to make the data more structured.The following are the preprocessing steps: 1) The cleansing stage aims to remove symbols, punctuation marks, numbers, links, and emoticons.An example of the implementation of the cleansing stage can be seen in Table 1.2) The case folding stage will involve the process of changing uppercase letters into lowercase letters in the text document.An example of the implementation of the case folding stage can be seen in Table 2. 3) The tokenization stage aims to break sentences in the text document into several parts of words that are separated by spaces or special characters, known as tokens.

Text Mining
At this step, the data has been classified using two algorithm comparisons, Support Vector Machine (SVM) and Naïve Bayes.The first data is divided into 90% training data and 10% testing data so that there are 17,240 for training data and 1,915 for testing data.The process of classification using the linear kernel in SVM can be seen in Figure 5 and the classification process using the Naïve Bayes algorithm can be seen in Figure 6.
Next, in the implementation of the Support Vector Machine formula with 'W' as the weight vector, and in this calculation, two 'W' are used.If based on W1 = positive and W2 = negative with the assumption of X (X1, X2) as an additional weight from W0, then the hyperplane equation can be written as follows: Table 9 shows the TF-IDF results for several reviews that will be used in the Support Vector Machine calculation.The words that appear most often in the word cloud in positive reviews are 'bagus ', 'mantap', 'bantu', 'mudah', 'good', and 'cepat'.Meanwhile, the words that often appear in negative reviews are 'mahal ', 'kecewa', 'saldo', 'susah', and others.The evaluation stage is the final step in the KDD process, where the aim is to measure the performance of the implemented model.This is accomplished using a confusion matrix with parameters such as accuracy, precision, recall, and F1-score.
The evaluation stage and the confusion matrix results are obtained using a linear kernel in the Support Vector Machine (SVM) algorithm with a 90:10 scenario, and in the Naïve Bayes algorithm, the data is also divided into a 90:10 scenario.These results can be seen in Figure 3.The best result of this classification is achieved using the Support Vector Machine algorithm, which yields an accuracy of 90%, a recall of 94%, a precision of 91%, and an F1-score of 94%, with 1,218 positive class data correctly predicted and 500 negative class data correctly predicted.Below are the comparison results for accuracy, recall, precision, and F1-score between 3the Support Vector Machine and Naïve Bayes models, which can be seen in Table 11.
Figure 1.Knowledge Discovery in Database (KDD) Process

Table 1 .
The Implementation of Cleansing

Table 2 .
The Implementation of Case Folding

Table 3 .
The Implementation of Tokenizing

Table 4 .
The Implementation of Normalization

Table 5 .
The Implementation of Stopword Removal

Table 6 .
The Implementation of Stemming

Table 7 .
Some documents for calculation TF-IDF After calculating the document frequencies, the next step is to calculate the Inverse Document Frequency (IDF).After performing the Inverse Document Frequency (IDF) calculation, the next step is to compute the TF-IDF values, which are the results of multiplying the normalized term frequencies by the Inverse Document Frequency (IDF) for each document.Table8displays the TF-IDF calculation results.

Table 8 .
The Result of TF-IDF

Table 9 .
TF-IDF Results from Several Reviews Figure 2. The Word Cloud of Positive Reviews (left); The Word Cloud of Negative Reviews

Table 11 .
Comparison Results Between SVM and Naïve Bayes