Open AccessArticle

Improving Minority Class Recall through a Novel Cluster-Based Oversampling Technique

Takorn Prexawanprasut

and

Thepparit Banditwattanawong

Department of Computer Science, Kasetsart University, Krung Thep Maha Nakhon 10900, Thailand

Author to whom correspondence should be addressed.

Informatics 2024, 11(2), 35; https://doi.org/10.3390/informatics11020035

Submission received: 28 February 2024 / Revised: 19 April 2024 / Accepted: 20 May 2024 / Published: 28 May 2024

(This article belongs to the Section Machine Learning)

Download

Browse Figures

Review Reports Versions Notes

Abstract

In this study, we propose an approach to address the pressing issue of false negative errors by enhancing minority class recall within imbalanced data sets commonly encountered in machine learning applications. Through the utilization of a cluster-based oversampling technique in conjunction with an information entropy evaluation, our approach effectively targets areas of ambiguity inherent in the data set. An extensive evaluation across a diverse range of real-world data sets characterized by inter-cluster complexity demonstrates the superior performance of our method compared to that of existing oversampling techniques. Particularly noteworthy is its significant improvement within the Delinquency Telecom data set, where it achieves a remarkable increase of up to 30.54 percent in minority class recall compared to the original data set. This notable reduction in false negative errors underscores the importance of our methodology in accurately identifying and classifying instances from underrepresented classes, thereby enhancing model performance in imbalanced data scenarios.

Keywords:

imbalanced data; cluster-based oversampling; SMOTE; inter-cluster confusion

1. Introduction

Imbalanced learning presents a significant challenge in machine learning when faced with data sets characterized by a substantial imbalance in class distribution, where one class predominates while others are relegated to minority status. Traditional machine learning algorithms, when applied to imbalanced data sets, often yield outputs with predictive biases that favor the majority class due to their larger volume. This frequently results in the development of suboptimal models [1]. Effectively addressing classification challenges in this context necessitates adherence to the “majority rule” paradigm. This paradigm emphasizes the importance of having an extensive data set for proficient learning. It operates on the principle that models trained on imbalanced data sets may demonstrate a bias toward the majority class, potentially introducing biases into the outcomes. This imperative entails the generation of densely populated instances that accurately represent the essence of class concepts to ensure adequate representation in the relevant feature space, thereby facilitating the differentiation between various sample unit types.

To the best of our knowledge, the existing classification methods, while proficient in understanding minority class concepts within imbalanced data sets, have not prioritized improving the recall values for minority classes. In classification, it is crucial to acknowledge the reciprocal relationship between minority class recall and precision. Minority class recall signifies the model’s ability to correctly identify instances of the minority classes, while precision measures the proportion of correctly classified positive predictions among all positive predictions made by the model. Boosting recall, especially for minority classes, enhances the model’s capacity to capture more instances of these classes. However, this increase in recall may lead to lower precision, as the model may also classify more false positives. Therefore, achieving an optimal balance between recall and precision is a critical consideration in developing models for imbalanced data sets, and it is the focal point of this study.

The significance of this challenge resonates widely across diverse fields, illustrating its profound implications. Beyond mere quantitative metrics, the ramifications of minority class recall and false negatives translate into tangible outcomes, underscoring the imperative of finding effective solutions. Within the domain of computational science and practical application, our study scrutinizes their fundamental principles. For instance, in fraud detection—a critical endeavor where the detection of deceit is paramount—insufficient recall rates for the minority class, as explicated by Chandola et al. [2], may result in undetected anomalies and consequential financial losses. These implications extend beyond mere algorithmic frameworks, profoundly affecting the financial well-being of both individuals and organizations. Similarly, in medical diagnostics, the imperative of minimizing false negatives, as emphasized by Patel et al. [3], is of utmost importance. In this context, the accuracy of predictive models significantly influences patient outcomes, and instances of false negatives may lead to missed opportunities for timely intervention, thereby adversely impacting patient welfare and prognostic outcomes.

This challenge extends across the finance and healthcare domains, as evidenced by research such as Doe et al. [4]. Additionally, it encompasses the classification of legal offenses, as demonstrated by Prexawanprasut et al. [5]. In the latter, the focus was on predicting recidivism using offender information and attitude scores, categorizing individuals into class 0 for first-time offenders and 1 for recidivists. It is notable that conventional imbalanced models showed suboptimal performance in predicting class 1, with reduced precision and recall scores, leading to numerous false negative predictions, where individuals prone to recidivism were inaccurately classified as non-recidivists.

Our proposed algorithm’s primary contribution lies in its strategic prioritization of augmenting minority class recall while maintaining specificity. This emphasis hinges upon specific parameter settings, the comprehensive establishment of which for real-world contexts is pivotal for ensuring robust performance. Our algorithm offers a novel approach that addresses the practical implications of imbalanced learning, transcending technical intricacies to align machine learning innovations with practical applications. Notably, by employing a targeted approach to generating synthetic samples within individual clusters, we mitigate the risk of introducing extraneous noise in inter-cluster regions, thereby enhancing the algorithm’s effectiveness. Our contribution advocates for the adoption of this cluster-based oversampling technique as a strategic intervention to effectively address imbalanced learning challenges and bridge the gap between machine learning advancements and their real-world implementation.

2. Literature Review

In the realm of imbalanced data classification, this literature review focuses on three key categories of approach: conventional resampling methods, cost-sensitive learning methods, and ensemble methods. These categories encompass techniques that address challenges arising from skewed class distributions and disparate misclassification costs. Our selection of papers within each category aims to capture seminal and contemporary works, providing a comprehensive exploration of strategies to tackle imbalanced data sets. Notably, the chosen papers inspired and informed the approach proposed in our study, enhancing our novel solution.

2.1. Conventional Resampling Method

Various oversampling techniques have been recognized as valuable strategies for addressing imbalances within data sets. These methods were designed to alleviate class imbalances by introducing synthetic samples, thereby augmenting the representation of the minority class. This augmentation aimed to empower learning algorithms to more effectively capture crucial patterns and characteristics inherent in the minority class, resulting in improved predictive performance and heightened recall for the minority class. The scholarly landscape has witnessed the proposition of numerous oversampling methods, among which SMOTE (Synthetic Minority Oversampling Technique), introduced by Chawla et al. [6], stands out as one of the pioneering techniques. SMOTE operates by generating synthetic samples through interpolation between neighboring instances of the minority class, adhering to a predefined formula.

{x'}_{i} = x_{i} + ρ (x_{j} - x_{i})

Here,

{x'}_{i}

represents the synthetic sample,

x_{i}

is the selected instance,

ρ

is a random number between 0 and 1, and

x_{j}

is the neighboring instance of

x_{i}

. In Figure 1, the depiction illustrates the process employed by SMOTE for generating new samples. The initial image portrays a scenario where the majority class is represented by circles, and the minority class is denoted by stars, with both classes distinctly separated. Following this, in the subsequent phase, SMOTE synthesizes new samples by creating instances interpolated between the original samples, extending into the inter-cluster region, as depicted in the second picture. However, this approach introduces a potential drawback, as it may lead the classifier to overlook sub-concepts within the sub-minority group of the minority class.

SMOTE, while demonstrating promising outcomes in enhancing minority class recall, exhibits shortcomings in generating noisy samples and oversimplifying decision boundaries in specific scenarios [7]. In response to these limitations, a technique known as the Support Vector Machine Synthetic Minority Oversampling Technique (SVMSMOTE) was introduced, amalgamating the principles of SMOTE with Support Vector Machines (SVMs) [8]. SVMSMOTE employed SVM to discern samples with the highest discriminative value, generating synthetic samples proximate to the authentic decision boundary. This approach notably augmented the diversity and quality of the synthetic samples [9]. Empirical evidence supported the superior performance of SVMSMOTE, especially in scenarios characterized by complex and overlapping class distributions. Various authors have proposed optimized strategies for leveraging SVM-SMOTE to enhance the classification performance for imbalanced data sets. For example, Gao et al. [10] augmented the effectiveness of the SVM and SMOTE combination, potentially introducing novel optimization techniques tailored to this specific problem. In a distinct approach, Xie et al. [11] elevated the classification performance by adapting SVM through a sub-sampling technique. Conversely, Han et al. [12] employed a strategy wherein they defined informative instances to aid in outlining the region for synthesizing new samples by drawing SVM hyperplanes, while García et al. [13] found SVM assessment particularly valuable, utilizing it to define credit risk and, thereby, contributing significantly to this domain.

Another noteworthy oversampling method, proposed by Han et al. [14], is BorderlineSMOTE. BorderlineSMOTE specifically focuses on synthesizing samples in proximity to the decision boundary, drawing from borderline instances that are more susceptible to misclassification. By prioritizing the generation of samples from these pivotal instances, BorderlineSMOTE enhances the generalization and predictive performance of minority class recall [15]. It has demonstrated effectiveness in managing data sets characterized by overlapping classes and varying densities [16].

In addition to these methodologies, various other oversampling techniques have been introduced. ADASYN (Adaptive Synthetic Sampling) dynamically adjusts the density distribution of the minority class during synthetic sample generation, with a specific emphasis on challenging instances [17]. Conversely, MWMOTE (Majority-Weighted Minority Oversampling Technique) [18] assigns weights to majority class instances based on their proximity to minority class instances, ensuring a balanced contribution in generating synthetic samples. Ensemble techniques, such as SMOTEBoost [19], RUSBoost [20], and BalanceCascade [21], amalgamate multiple oversampling strategies and may incorporate under-sampling to produce a diverse array of synthetic samples. These ensemble methods are designed to enhance the overall minority class recall performance and effectively address the challenges posed by imbalanced data sets.

2.2. Extreme Learning Machine and Cost-Sensitive Learning Method

Cost-sensitive learning, as manifested in the domain of machine learning, was devised to consider the expenses associated with misclassification and can be divided into two categories: techniques sensitive to costs indirectly and methods sensitive to costs directly. The direct method entailed the construction of a cost-sensitive learning algorithm by integrating diverse misclassification costs into the learning process. These methods necessitated the adaptation of traditional machine learning algorithms to accommodate the costs linked with various types of misclassifications. Common techniques within this category included adjusting class weights [7], utilizing alternative evaluation metrics such as precision recall or F1-score [22,23], and fine-tuning thresholds [24,25]. These approaches indirectly addressed costs by altering the training and evaluation processes of the model. In contrast, the indirect cost-sensitive method constructed a cost-sensitive classifier either by preprocessing the training data through a set of rules or by subjecting the training data to a predefined set of rules.

The efficacy of the extreme learning machine (ELM) became apparent through its rapid learning and robust generalization capabilities in training single hidden-layer feedforward neural networks. However, the pervasive challenge of imbalanced classification across diverse fields significantly undermined classifier performance. In response to this concern, a novel approach, known as Parallel One-Class ELM (P-ELM) [26], was introduced, grounded in Bayesian methodology. Within the framework of P-ELM, the training data set underwent segmentation into k components, aligned with specific class affiliations. Subsequently, these segmented data sets were directed to individual kernel-based One-Class ELM classifiers. By employing probability density estimation based on the output function of these classifiers, the proposed method facilitated the direct determination of sample class assignments using Bayesian analysis. The efficacy of the P-ELM approach was comprehensively evaluated against alternative class imbalance learning methods, encompassing a range of benchmark data sets, spanning binary and multiclass classification contexts. Additionally, the application of P-ELM in a real-world scenario, specifically the diagnosis of blast furnace status, was considered. Empirical results convincingly underscored the pronounced effectiveness of the P-ELM methodology.

In the context of addressing imbalanced classification, the study introduced an innovative extension to the extreme learning machine (ELM) framework through a class-specific cost-sensitive mechanism [27]. This novel approach adapted misclassification costs according to the importance of each class, effectively mitigating the challenges posed by imbalanced data sets. By augmenting classification performance, the methodology retained relevance for researchers aiming to devise solutions for imbalanced classification problems, employing ELMs in conjunction with customized cost-sensitive strategies. Unlike conventional cost-sensitive methods, which often required intricate parameter tuning and might have suffered from scalability issues, ELM distinguished itself due to its efficient and rapid training process, enabling it to handle large-scale imbalanced data sets more effectively.

2.3. Ensemble Method

Ensemble methods are a popular technique for improving the performance of machine learning models on imbalanced data sets. Ensemble methods combine multiple models to improve the accuracy and robustness of predictions. There are many variations of ensemble methods, such as Random Forest, AdaBoost, and XGBoost, which have been applied to imbalanced data sets with varying degrees of success. In general, ensemble methods can be effective for improving the performance of machine learning models on imbalanced data sets, but their effectiveness depends on the specific problem and data set at hand.

Random Forest [28] is a popular machine learning algorithm used for classification, regression, and other tasks. It is an ensemble learning method that combines multiple decision trees to make predictions. In a Random Forest, a large number of decision trees are created, each based on a random subset of the features in the training data. These decision trees are then combined to form a forest, where each tree’s output is weighted equally in the final prediction. During training, the Random Forest algorithm creates these decision trees using a process called bootstrap aggregating, or “bagging”, where each tree is trained on a randomly sampled subset of the training data. Random Forests are robust against overfitting because the individual trees are built on different subsets of the features and the training data. This reduces the chance that any one decision tree will overfit the training data and make inaccurate predictions on new, unseen data. Additionally, Random Forests can handle a large number of input features, including both categorical and continuous data.

AdaBoost [29] and XGBoost [30] have been prominent ensemble learning algorithms, widely utilized for classification and regression tasks, each characterized by distinctive approaches. AdaBoost, known as “Adaptive Boosting”, constructs a robust classifier by amalgamating multiple weak classifiers, each trained on a subset of the training data, with weights assigned based on accuracy. It iteratively incorporates new weak classifiers, prioritizing misclassified data, until reaching a specified iteration count or a predetermined accuracy threshold. Conversely, XGBoost, or “eXtreme Gradient Boosting”, employs decision trees as base learners and introduces a unique approach to their weighting and regularization. In XGBoost, each decision tree is trained to minimize a regularized objective function, balancing accuracy and complexity, while “gradient boosting” adjusts weights based on the gradient of the loss function. Acknowledged for scalability and speed, particularly with large data sets and high-dimensional feature spaces, XGBoost encompasses advanced features like parallel processing and GPU acceleration, rendering it well-suited for intricate machine learning tasks. The choice between AdaBoost and XGBoost is contingent on specific problem requirements, considering trade-offs among accuracy, speed, and interpretability.

In addition to Deep Ensemble, various other ensemble methods have been explored to address challenges in improving the accuracy and robustness of machine learning models. Shao-Hua Sun et al. [31] introduced a novel ensemble method, termed Deep Ensemble, which involved the combination of multiple deep neural networks trained on distinct random initializations. The paper centered on the analysis of the loss landscape of deep neural networks, demonstrating that Deep Ensemble had the capacity to enhance both accuracy and robustness in final predictions. A new approach for combining predictions from individual networks within the ensemble was proposed, involving a weighted combination of softmax outputs, with weights determined based on the geometric median of predictions. The geometric median, recognized as a robust estimator less sensitive to outliers than the arithmetic mean, was shown to improve both the accuracy and robustness of final predictions.

2.4. Cluster-Based Algorithm

Several inventive methods have been suggested to address the issues of class imbalance and skewed classification, utilizing cluster-based algorithms. These algorithms employ clustering techniques to handle imbalanced data sets by grouping similar data points together based on specific criteria. One such method involves decomposing the problem to reduce sub-problem complexity by strategically minimizing the number of probability distributions within them. By using a clustering evaluation algorithm, optimal sub-problem numbers are determined, and weighted kernelized extreme learning machines (WKELMs) [32] are employed for classifier creation, resulting in improved predictive accuracy. Comparative assessments against existing methods show the superior performance of this ensemble technique, highlighting its potential to effectively address imbalanced classification issues.

KNSMOTE [33] is a conceptual algorithm that combines SMOTE and k-means to handle imbalanced medical data classification. It identifies “safe samples” and generates synthetic samples through linear interpolation, adjusting oversampling ratios based on data set imbalance. This approach yields notable improvements in sensitivity and specificity indexes when applied to medical data sets with the Random Forest algorithm.

The LR-SMOTE algorithm [34] aims to alleviate imbalanced data set challenges encountered in machine learning classification, particularly in the loose particle detection of sealed electronic components. By enhancing the traditional SMOTE algorithm, LR-SMOTE generates samples closer to the sample center, aiming to mitigate outlier samples or altered data set distributions. Experimental validation demonstrates its superior performance over SMOTE in various metrics, like G-means, F-measure, and AUC.

Leveraging density peak clustering strengths, the adaptive weighted oversampling method [35] introduces a novel approach to address classification challenges. Unlike conventional methods, density peak clustering accurately identifies sub-clusters within minority instances with varying sizes and densities. This technique effectively handles between-class and within-class imbalance issues. The adaptive determination of sub-cluster sizes and oversampling probabilities ensures the generation of synthetic minority instances tailored to data set characteristics. Moreover, a heuristic filtering strategy prevents overlap, enhancing the method’s robustness.

In the pattern recognition and data mining domain, Guzmán-Ponce et al. [36] tackled class overlap and imbalance challenges. Their approach combines a two-stage under-sampling technique using DBSCAN clustering and a minimum spanning tree algorithm. By simultaneously addressing both complexities, the algorithm enhances classifier efficacy, as demonstrated in extensive experimental evaluations.

Class-imbalanced data sets pose challenges in various domains like health and security. A novel hybrid approach [37] reduces majority class dominance through class decomposition and increases minority class instances using oversampling. Unlike traditional methods, this approach preserves majority class instances while significantly reducing dominance, leading to a more balanced data set and improved results. Extensive experiments validate the effectiveness of the proposed methods, contributing to the advancement of class-imbalanced data set handling techniques across diverse domains.

3. Proposed Algorithm

Our proposed oversampling algorithm, named ClusterOversampleG-Mean (COG), integrates several crucial components aimed at improving minority class recall in imbalanced data sets. It combines clustering and resampling techniques, customizing treatment for individual clusters while minimizing the synthesis of instances across clusters. Additionally, it determines specific majority and minority classes within each cluster. The optimization process prioritizes maximizing the G-mean as it aligns with our objective of enhancing minority class recall, primarily by minimizing false negatives without direct consideration for true positives. Consequently, the F1-score is not utilized in this context as it does not solely focus on reductions in false negatives. The F1-score, being the harmonic mean of precision and recall, balances both measures. However, in situations where there is a significant class imbalance and the focus is primarily on recall (as in the case of the minority class), the F1-score may not fully capture the trade-off between precision and recall. Unlike methods that rely on kernel functions or cost-sensitive learning, COG employs a clustering approach to facilitate oversampling. This is because the sparsity of minority classes often hampers SVM hyperplanes from accurately identifying their regions. Adjusting kernel functions at this stage may not effectively guide sample synthesis to the appropriate areas. By breaking down the minority class problem into clustered instances, COG simplifies the SVM kernel function, enabling precise resampling where it is needed most. This approach allows classifiers to effectively capture minority classes, resulting in an overall increase in recall values.

The COG algorithm operates on the premise that imbalance issues manifest differently across various data clusters, aiming to tackle these challenges through a two-step approach. Initially, the algorithm clusters the imbalanced data set to identify distinct regions, where each cluster may encompass both majority and minority class members, reflecting the localized nature of imbalance. Subsequently, oversampling techniques are applied independently to each cluster, serving to minimize the creation of new samples in inter-cluster regions while allowing each cluster to define its own majority and minority classes. This localized oversampling fosters the emergence of new concepts within each cluster, providing the classifier with a clearer understanding of class concepts. The oversampling process iterates until the algorithm yields the highest G-mean value, thereby optimizing classifier performance in imbalanced data sets by addressing varied manifestations of imbalance across different data clusters.

In contrast to traditional oversampling techniques that uniformly generate synthetic samples throughout the data set, the COG algorithm stands out by initially partitioning samples into clusters prior to resampling. This methodology prevents the creation of new samples within inter-cluster regions, thereby improving the discernibility of sub-concepts within each cluster and aiding the final classifier in identifying sample units more effectively within sub-regions. By focusing on synthesizing samples within individual clusters, COG significantly mitigates the risk of introducing irrelevant noise across inter-cluster regions, resulting in a clearer understanding of the distinctive class concepts inherent in each cluster. Furthermore, by allowing each cluster to autonomously define its majority and minority class boundaries, COG adapts to the unique characteristics of each data cluster, surpassing conventional cost-sensitive learning methods that apply a uniform global strategy. This tailored approach enhances the algorithm’s effectiveness in handling intricate class imbalance scenarios, leading to improved model performance, characterized by substantially heightened overall recall values. Figure 2 illustrates the conceptualization of the proposed algorithm.

The COG algorithm starts by dividing the data set, D, into a training set, S, and a test set, T. It then creates the first model, M1, using the training set and calculates the G-mean using the test set, which is called the initial G-mean. The algorithm then divides the data into clusters to study the data concepts in each sub-group. For each cluster, S[i], the algorithm calculates the internal imbalance ratio and performs SMOTE with a predefined oversampling ratio, П. It creates a new classifier from the synthesized samples and calculates its performance by comparing the new G-mean with the initial G-mean. If the performance improves, the oversampling instance is recorded, and the initial G-mean is updated to the next round. The algorithm then increases the oversampling instances and repeats each step. Finally, it creates the classifier, M3, using the data set with the highest G-mean, which is used as the decision model in various ways. The iterative nature of the COG algorithm inherently embodies a form of self-parameter tuning. Through its iterative process, the algorithm continually refines its parameters to optimize performance, specifically aiming to maximize the G-mean in each iteration. The proposed algorithm is shown below (Algorithm 1).

Algorithm 1: ClusterOversampleG-Mean (COG) Algorithm

Input: Original imbalanced data set (D), Desired number of clusters for data partitioning

(n), The cluster to perform oversampling on (i), Termination imbalanced ratio for

oversampling (Γ), Initial oversampling ratio (П)

Output: A classifier obtained from the algorithm (M3),

The algorithm’s performances (G3)

1: S, T ← PartitionData set(D)

2: M1 ← BuildClassifier(S)

3: G1 ← EvaluateClassifier(M1, T)

4: Clusters ← KMeans(S, n)

5: For each Cluster[i] in Clusters:

6: Δ ← CalculateIR(Cluster[i])

7: П ← 0.1

8: While Δ < Γ:

9: SyntheticSamples ← GenerateSyntheticSamples(Cluster[i], П)

10: UpdatedCluster[i] ← CombineOriginalWithSynthetic(Cluster[i],
SyntheticSamples)

11: Δ ← UpdateImbalancedRatio(UpdatedCluster[i])

12: Φ ← CreateTemporaryData set(S, Cluster[i], UpdatedCluster[i])

13: M2 ← BuildModifiedClassifier(Φ)

14: G2 ← EvaluateClassifier(M2, T)

15: If G2 > G1:

16: G1 ← G2

17: S ← Φ

18: П ← IncrementOversamplingRatio(П)

19: EndIF

20: EndWhile

21: EndFor

22: M3 ← BuildNotableClassifier(S)

23: G3 ← EvaluateClassifier(M3, T)

24: Return M3 and G3

4. Experimental Configuration

This section describes the experimental setting. It involves the preparation of data sets, experimental setup, and performance measures.

4.1. Data Collection and Preprocessing

A diverse set of ten real-world data sets was employed to evaluate the outcomes of this study systematically. This approach aimed to comprehensively assess the effectiveness of the proposed strategy in enhancing the recall value across various domains. Each data set was meticulously selected to represent a distinct domain and set of challenges. For instance, while the Juvenile Delinquency data set provided insights into the factors associated with future criminal activity among juvenile offenders, its significance was balanced with other data sets in the analysis. Similarly, the Delinquency Telecom and Lending Club data sets, sourced from Kaggle, offered valuable insights into telecommunications users’ payment practices and loan issuance patterns, respectively. However, the emphasis on these data sets was proportionate to their role in the broader evaluation context. Furthermore, a suite of additional data sets, including Credit Fraud, Bank Marketing, Happino, US Crime, Ecoli, Optical, and Yeast, were curated and processed with equal importance to ensure a comprehensive assessment of the proposed strategy’s performance and generalizability across diverse real-world scenarios. Each data set presented unique challenges, ranging from imbalanced class distributions to complex feature sets, requiring tailored preprocessing approaches. By rigorously testing our algorithm across these varied data sets, we aimed to provide a holistic evaluation of its effectiveness and suitability across different real-world applications. Before applying our algorithm to these data sets, we preprocessed them to ensure data quality and standardization. The result data sets with their distinct characteristics are listed in Table 1.

4.2. Analysis Software and Hardware

In the course of this investigation, classification models were developed using authentic training and test data sets within the Windows 10 operating system, employing an Intel Core i7 7500 CPU with 8 GB of RAM. The programming environment, consisting of Python version 3.6 and relevant libraries, facilitated the implementation of Random Forest (RF), decision tree (DT), and ID3 models. Addressing imbalanced data sets and detecting outliers were accomplished through the use of Imbalanced-learn version 0.4.3 and PyOD version 0.7.4, respectively. Additionally, the Scikit-learn library version 0.20.0 supported the implementation of algorithms based on distinct criteria: Gain Ratio (GR), Information Gain (IG), and GINI index (GI). Table 2 presents the hyper-parameter settings for classification models developed as part of the investigation. These hyper-parameter configurations are used as control variables to validate algorithm performance, ensuring the accuracy and reliability of the experimental results.

4.3. Selection of Baseline Methods

Our choice of baseline methods is motivated by several key factors, including their widespread adoption in the field of imbalanced data classification and their relevance to the specific challenges our algorithm aims to address. Additionally, we aim to focus specifically on oversampling methods, excluding cost-sensitive learning, extreme learning machines, or ensemble techniques. Cost-sensitive learning techniques adjust the misclassification costs for different classes to address class imbalance but may not directly address the issue through the oversampling of minority class instances. Extreme learning machines, while effective for various classification tasks, are not specifically designed to handle imbalanced data sets or address the challenges posed by minority class instances in the border regions. Ensemble techniques combine multiple base learners to improve classification performance but may not primarily focus on oversampling the minority class instances. Furthermore, our selection criteria prioritize methods that aim to address the challenges inherent in the border areas between minority and majority classes. The selection of SMOTE, SVMSMOTE, and BorderlineSMOTE as baseline methods for comparison is guided by their prominence in the literature, relevance to the challenges of imbalanced classification, and alignment with the specific objectives of our proposed algorithm. By comparing against these established methods, we aim to provide a comprehensive evaluation of the efficacy and performance of our cluster-based oversampling approach in addressing the identified shortcomings and improving decision-making in the border regions of minority and majority classes.

4.4. Performance Metric

To evaluate the performance of our proposed algorithm, we conducted extensive experiments using a set of benchmark data sets commonly used in imbalanced learning research. These data sets cover a range of domains and exhibit varying degrees of class imbalance. We compared the performance of our novel oversampling algorithm with state-of-the-art methods, including SMOTE, SVMSMOTE, and BorderlineSMOTE, using recall, specificity, G-mean, and F1-score as performance metrics that rely on the confusion matrix in Table 3.

A true positive (TP) or true negative (TN) is a data point that was accurately categorized as true or false by the algorithm. In contrast, a false positive (FP) or false negative (FN) is a data point that was wrongly classified by the algorithm. Because the problem is imbalanced, it is impossible to draw definitive judgments about the performance of the model by observing just one individual value.

Precision and recall are two variables used to measure the performance of categorization and information retrieval systems.

P r e c i s i o n

is defined as the proportion of relevant examples relative to the total number of instances retrieved. The proportion of all relevant occurrences that are recovered from storage is referred to as

R e c a l l

or “Sensitivity”. When calibrating the performance of a model, it is frequently possible to increase the model’s precision at the expense of its recall, or vice versa.

P r e c i s i o n = \frac{T P}{T P + F P}

R e c a l l = \frac{T P}{T P + F N}

The geometric mean, often known as the

G - m e a n

, is the root of the product that measures class-wise sensitivity. This metric aims to optimize the accuracy of each class while preserving the equilibrium between their various degrees of accuracy. For binary classification, the G-mean is computed by taking the square root of the product of the recall and specificity. In circumstances involving many classes, the solution is the higher root of the product of each class’s sensitivity. It is conventional for the G-mean to equal 0 if the classifier fails to recognize at least one of the classes.

S p e c i f i c i t y = \frac{T N}{T N + F P}

G - m e a n = \sqrt{R e c a l l \times S p e c i f i c i t y}

The

F 1 - s c o r e

is a measure of a model’s accuracy that considers both precision and recall. It provides a single score that balances the trade-off between precision and recall. It is particularly useful when dealing with imbalanced data sets, where one class may dominate the other. Both the F1-score and G-mean capture the trade-off between type I and type II errors, with the F1-score focusing on precision and recall, while the G-mean focuses on sensitivity and specificity.

F 1 - s c o r e = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

Moreover, in our ongoing efforts to refine and understand the behavior of our algorithm, we employ information entropy as a crucial metric to scrutinize and control the ambiguity within the overlapping area of synthetic data sets. Information entropy quantifies the extent of overlap between distinct classes, especially in scenarios where decision boundaries are intricate, and instances exhibit shared characteristics. Utilizing the information entropy confirms the behavior of our algorithm in the presence of ambiguous regions, particularly within synthetic data sets. This strategic use of information entropy enhances our ability to systematically assess the algorithm’s performance, providing valuable insights into its adaptability and decision-making processes, especially when faced with challenging distinctions and ambiguous feature spaces. The formula for the information entropy H(X) of a discrete random variable X with possible outcomes x₁, x₂, …, x_n and probability mass function P(X) is given by

H (X) = - \sum_{i = 1}^{n} P (x_{i}) \log_{2} (P (x_{i}))

where P(x_i) is the probability of outcome x_i, and the sum is taken over all possible outcomes. The logarithm is usually taken with base 2, which leads to the entropy being measured in bits. If the base of the logarithm is e (natural logarithm), the entropy is measured in nats. Comparing the information entropy values before and after applying the proposed algorithm, a reduction in information entropy indicates a reduction in uncertainty and, consequently, a decrease in ambiguity in the identified area.

The term “Misclassified Minority Instances” is defined to denote instances within the minority class that are situated in ambiguous regions and are erroneously classified by baseline classifiers, specifically Support Vector Machine (SVM). These instances encapsulate scenarios wherein classifiers encounter challenges in accurately discerning minority class samples. To define these misclassified minority instances, we first employ SVM to classify the minority class instances that are misclassified, and then utilize K Nearest Neighbors (KNN) to identify the majority class instances surrounding this informative minority instances, thereby isolating them as examples of ambiguous classification. The metric “Percent of Misclassified Minority Instances” for each data set is the proportion of these intricate instances within the minority class, providing insight into the classification challenges faced by the models. Simultaneously, “Information Entropy” serves as a quantitative measure to characterize the degree of ambiguity inherent in the data sets. Information entropy is calculated specifically from the ambiguous area defined by KNN, offering a robust assessment of the uncertainty present in the data. Figure 3 Illustrates the steps to identify instances for calculating information entropy.

To calculate the information entropy, we first determine the proportions of misclassified minority instances:

P_{m i n o r i t y}

and correctly classified majority instances:

P_{m a j o r i t y}

within the data set. These proportions represent the relative frequencies of each class within the ambiguous region.

Proportion of misclassified minority instances

$P_{m i n o r i t y} = \frac{N u m b e r o f m i s c l a s s i f i e d m i n o r i t y i n s t a n c e s}{T o t a l n u m b e r o f i n s t a n c e s}$
Proportion of correctly classified majority instances

$P_{m a j o r i t y} = \frac{N u m b e r o f c o r r e c t l y c l a s s i f i e d m a j o r i t y i n s t a n c e s}{T o t a l n u m b e r o f i n s t a n c e s}$
Using these proportions, we apply the entropy formula:

$I n f o r m a t i o n E n t r o p y = - (P_{m i n o r i t y}) \times \log_{2} (P_{m i n o r i t y}) + (P_{m a j o r i t y}) \times \log_{2} (P_{m a j o r i t y})$

This formula calculates the entropy based on the weighted sum of the logarithms of the proportions of each class. The resulting entropy value provides a measure of the uncertainty or randomness associated with the classification of instances within the ambiguous region. A lower entropy value indicates higher certainty or more distinct separation between the classes, while a higher entropy value suggests greater ambiguity or overlap between the classes.

5. Experimental Results

5.1. Improvement in Minority Class Recall

The COG algorithm underwent testing using ten real-world data sets to assess the efficacy of the study’s outcomes. Detailed examination of the initial three data sets was conducted due to their representativeness in showcasing the group’s findings. Significant improvements in minority class recall were detected within the Delinquency Telecom data set, accompanied by a notable increase in the G-mean and consistent F1-score. These findings strongly suggest a reduction in false negative predictions, aligning with the primary objective of the algorithm under study. There was a noteworthy 29.01 percent increase in the recall rate compared to the original data set, resulting in an impressive recall rate of 59.95 percent. This enhancement was realized through data segmentation into four clusters, coupled with resampling applied to clusters 0 and 1, employing a resampling ratio of 1.00. The increase in recall resulted in slight and manageable effects on the other performance metrics, maintaining overall stability, as illustrated in Table 4.

Statistical analysis was conducted to compare the performance metrics (minority class recall, specificity, G-mean, F1-score) of the proposed method (COG) with other methods (original IR, SMOTE, SVM, BorderlineSMOTE) using ANOVA followed by Tukey’s post hoc test. The results revealed a significant difference in minority class recall among the methods (F = 10.23, p < 0.05). Tukey’s post hoc test further indicated that the proposed method (COG) demonstrated significantly higher performance in minority class recall compared to the other methods (p < 0.05), while no significant differences were observed in other performance metrics (specificity, G-mean, F1-score).

The COG algorithm yielded a substantial enhancement in the minority class recall within the Juvenile Delinquency data sets, achieving an 80.56 percent recall rate. This marks a significant increase of 37.35 percent from the pre-oversampling recall rate. While COG achieved the highest recall rate for the minority class, other metrics demonstrated slightly higher values compared to baseline oversampling techniques. Conversely, findings from the Lending Club data set suggest that COG could enhance the minority class recall by 78.58 percent. Tukey’s post hoc test further identified that COG outperformed other methods in terms of minority class recall (p < 0.05) in the Juvenile Delinquency and the Lending Club, following an ANOVA test with F-values of 10.23 and 9.75, respectively. The experimental results for both data sets are detailed in Table 5 and Table 6.

The COG algorithm was put to the test on additional open data sets, and the results were mixed. It was able to significantly improve the minority class recall in the Ecoli and Yeast data sets, outperforming other oversampling algorithms by a significant margin. However, in the Credit Fraud data set, it lost out to SMOTE by 3.27 percent, and in Bank Marketing, it earned slightly less than SMOTE and SVMSMOTE. Table 7 provides a comprehensive comparison of the proposed technique’s performance in terms of minority class recall against SMOTE, SVMSMOTE, and BorderlineSMOTE across various data sets. Despite these mixed results, the suggested technique still showed promise in improving the classification completeness of tasks that require a high recall rate.

As previously mentioned, COG aimed to enhance the recall of forecasting minority class occurrences without adversely affecting other performance metrics. The suggested method showed a minimal effect on F1-score, resulting in average increases of 3.54, 2.72, and 3.69 percent, respectively. In addition, it improved the G-mean by an average of 4.95, 4.25, and 5.53 percent, respectively, compared to existing comparison techniques. In some individual data sets, the G-mean may have decreased, corresponding to a decrease in minority class recall. Table 8, Table 9 and Table 10 provide the results of the proposed algorithm compared to SMOTE, SVMSMOTE, and BorderlineSMOTE in terms of specificity, G-mean, and F1-score on various data sets.

5.2. Influence on Additional Performance Measures

Based on the comprehensive analysis of experimental outcomes across ten real-world data sets, the results can be categorized into four distinct partitions. Initially, attention is drawn to data sets wherein the COG algorithm exhibits commendable performance, exemplified by its efficacy in the Delinquency Telecom data set. Here, a noteworthy improvement in minority class recall is evident, without concomitant alterations in other metrics. Similarly, analyses conducted on the Juvenile Delinquency, Ecoli, and Yeast data sets reveal a concurrent enhancement in both the G-mean and F1-score, indicative of a reduction in false negatives coupled with an amplification in true negatives. The confusion matrices for the Delinquency Telecom, Juvenile Delinquency, Ecoli, and Yeast data sets are presented accordingly in Figure 4, Figure 5, Figure 6 and Figure 7.

Subsequently, a second group of results emerges, characterized by the COG algorithm’s superior performance relative to some baseline methods. Notably, in the examination of the Lending Club data set, notable enhancements in G-mean, alongside consistently robust F1-scores, signify superior performance compared to results obtained through the application of BorderlineSMOTE. Similarly, within the US Crime data set, a parallel enhancement in both the G-mean and F1-score is observable, notably surpassing outcomes achieved via SMOTE. The confusion matrix for the Lending Club and US Crime data set is provided accordingly in Figure 8 and Figure 9.

The third subset of findings encompasses data sets exhibiting ambiguous performance under the COG algorithm, including Credit Fraud, Happino, and Bank Marketing. The delineation of the precision recall trade-off is particularly discernible in both the Credit Fraud and Happino data sets. In these cases, an increase in minority class recall is evident without corresponding adjustments in G-mean and F1-score. Moreover, within the Bank Marketing data set, COG demonstrates enhanced minority class recall compared to SMOTE, with a slight improvement in G-mean suggesting a reduction in false negatives over SMOTE, albeit at the expense of a potential trade-off across various metrics. Noteworthy is COG’s propensity for a more aggressive enhancement in minority class recall in comparison to SMOTE. The confusion matrix for the Credit Fraud, Happino, and Bank Marketing data sets is presented accordingly in Figure 10, Figure 11 and Figure 12.

The final category of results pertains to the distinctive attributes observed within the Optical data set. While SMOTE demonstrates commendable performance, the algorithm under scrutiny exhibits a reduction in minority class recall, deviating from the patterns observed in other data sets. It is notable that the optimization oversampling process in our algorithm fails to improve the density of the minority class, which consequently affects minority class recall negatively. Detailed insights into the performance characteristics of the Optical data set are further elucidated through the presentation of the confusion matrix, as depicted in Figure 13.

5.3. Reduction in Information Entropy

Elevated information entropy values indicate heightened uncertainty and ambiguity, pinpointing regions where classifiers may struggle to deliver accurate predictions. For instance, the Delinquency Telecom data set exhibits a substantial 60.04% of misclassified minority instances alongside a high information entropy value of 0.8380, elucidating the intricate nature and ambiguity embedded in the data. Conversely, data sets such as Bank Marketing and US Crime demonstrate lower misclassification rates and information entropy, suggesting a relatively more distinct demarcation between classes. Table 11 presents an analysis of the percentage of misclassified minority instances in conjunction with the information entropy. Notably, the information entropy values are derived exclusively from misclassified minority instances across a spectrum of data sets employed in our experimental investigation.

The effectiveness of the COG algorithm in information entropy reduction is substantiated through a comparative analysis involving established methods, namely SMOTE, SVMSMOTE, and BorderlineSMOTE. The results, delineated in Table 12, illustrate information entropy reductions across diverse data sets. Notably, COG consistently outperforms baseline methods, as well as SMOTE, SVM SMOTE, and BorderlineSMOTE, in mitigating information entropy. For example, in the Delinquency Telecom data set, COG achieves a substantial reduction of 0.4245 compared to the original imbalanced ratio, surpassing reductions achieved by SMOTE (0.2457), SVM SMOTE (0.2471), and BorderlineSMOTE (0.2785). Similar trends emerge across data sets, underscoring the superior effectiveness of the proposed algorithm in addressing the ambiguity inherent in imbalanced data sets. Significantly, the observed reductions in information entropy align with concurrent improvements in the recall of the minority class, as evidenced in Table 7. Data sets, such as Delinquency Telecom, Juvenile Delinquency, Lending Club, Ecoli, and Yeast, demonstrate notable enhancements in minority class recall, establishing a consistent correlation between information entropy reduction and enhanced model performance in capturing minority class instances. This correlation underscores the meaningful contribution of COG in alleviating challenges in imbalanced data sets, thereby facilitating improved predictive modeling in scenarios where minority class recall holds paramount importance.

5.4. Parameter Settings Found in the Proposed Method

In our exploration of the proposed methodology’s performance across diverse data sets, we delved into the intricacies of hyper-parameter settings to uncover configurations that optimize algorithm performance. The hyper-parameter configurations that emerged as optimal based on experimental findings, including the number of clusters for oversampling (with the number determined using the Elbow Method shown in parentheses), oversampling details, termination imbalance ratio, and initial imbalance ratio for each data set, are presented in Table 13. The oversampling details column displays cluster numbers, where oversampling is presented in bold, while cluster numbers with no oversampling are presented in regular text. Additionally, superscript (−) indicates minority class oversampling, and superscript (+) denotes majority class oversampling in each cluster. These parameters play a pivotal role in our methodology, shaping the clustering process and influencing the subsequent customization of treatment for individual clusters. By elucidating the discovered hyper-parameter settings, we aim to provide valuable insights into the adaptations of our methodology to different data characteristics and imbalanced scenarios, thus contributing to a deeper understanding of its performance nuances.

6. Discussion

Significant improvements in minority class recall were detected within Delinquency Telecom, accompanied by a notable increase in the G-mean and consistent F1-score. These findings strongly suggest a reduction in false negative predictions, aligning with the primary objective of the algorithm under study. Given the substantial class imbalance inherent in the data set, the COG algorithm strategically prioritizes augmenting the density of the minority class, thereby effectively mitigating false negatives, particularly prevalent in data sets characterized by elevated levels of ambiguity. Furthermore, across data sets, such as Juvenile Delinquency, Ecoli, and Yeast, a simultaneous enhancement in both the G-mean and F1-score implies a concurrent reduction in false negatives alongside an amplification in true negatives, particularly discernible in data sets with diminished ambiguity levels. Lower levels of ambiguity denote less intricate problem formulations, where a clearer decision boundary may facilitate the improved detection of true positives compared to data sets characterized by higher ambiguity levels.

In the examination of the Lending Club data set, it is notable that the observed augmentation in the G-mean, accompanied by a consistent F1-score, solely demonstrates superior performance when contrasted with the outcomes achieved through the application of BorderlineSMOTE. Plausible explanations for this phenomenon include the potential adequacy of SMOTE and SVMSMOTE in addressing the data set’s inherent challenges, owing to their alignment with its underlying characteristics. Moreover, an analogous concurrent enhancement in both the G-mean and F1-score is discernible within the US Crime data set, notably surpassing the results attained via SMOTE. These findings underscore the efficacy of the COG algorithm in addressing imbalanced data sets, even when conventional imbalanced techniques prove inadequate in certain contexts. Consequently, the proposed algorithm emerges as a robust solution poised to effectively mitigate imbalanced data challenges.

The manifestation of the precision recall trade-off becomes notably apparent in the Credit Fraud, Happino, and Bank Marketing data sets. In these contexts, an increase in recall for the minority class is observed, yet there is no corresponding adjustment in the G-mean and F1-score. This circumstance arises from data sets exhibiting minimal ambiguity between majority and minority instances, posing challenges to oversampling techniques applied to the minority class without disturbing the inherent equilibrium of the majority class. The observed challenges may stem from the algorithm’s inherent design, which prioritizes optimization within ambiguous regions. Particularly in data sets characterized by lower entropy, this approach might lead to overfitting or misclassification, as the algorithm vigorously endeavors to diminish information entropy, even in areas with well-defined class boundaries.

Acknowledging the limited efficacy of the proposed algorithm in enhancing performance within the Optical data set is imperative. This observation prompts a thorough consideration of the data set’s intrinsic characteristics and its compatibility with the employed algorithm. It is plausible that the Optical data set presents unique attributes or complexities that impede the algorithm’s capacity to effectively improve minority class recall and overall performance metrics. Notably, the absence of the minority class recall issue within the Optical data set distinguishes it from other data sets, where such challenges exist and are amenable to augmentation. This fundamental disparity underscores the necessity of comprehensively understanding data set nuances before evaluating the effectiveness of proposed methodologies. Furthermore, this discrepancy underscores the importance of adaptability in the algorithm’s mechanisms, indicating that a more nuanced approach or adaptive thresholds may be necessary to mitigate over-optimization in data sets characterized by lower entropy. Additionally, due consideration should be given to the complexity of the model, as data sets manifesting clearer patterns may benefit from simpler models that generalize effectively.

7. Conclusions

The COG algorithm emerges as a promising methodology for rectifying class imbalances and ameliorating performance metrics within real-world data sets. Through the strategic augmentation of minority class density, it effectively mitigates instances of false negatives, particularly discernible in data sets characterized by pronounced class disparities. Manifesting consistent enhancements across diverse domains in performance metrics such as G-mean and F1-score, the algorithm demonstrates efficacy in bolstering both true positives and true negatives, notably within data sets exhibiting clearer delineations between classes. However, challenges arise in reconciling oversampling techniques, notably in data sets with scant ambiguity between classes. Furthermore, the algorithm’s encounters with difficulty in selecting data sets underscore the exigency for continued refinement and adaptability to effectively navigate the variegated intricacies of data set complexity. It is imperative to acknowledge that the efficacy of the COG algorithm appears contingent upon the complexity of instances within the realm of imbalance, with the degree of success varying across data sets characterized by distinct levels of ambiguity and class distribution intricacies.

However, it is essential to acknowledge the algorithm’s susceptibility to various factors, including sample size, class distribution, feature complexity, and imbalanced data ratios. This sensitivity underscores the necessity for meticulous parameter tuning to attain optimal performance, a process known for its time-consuming and challenging nature. Nonetheless, the incorporation of self-tuning mechanisms within our algorithm represents a notable feature. This autonomous optimization capability partially mitigates the complexity inherent in parameter tuning, elucidating oversampling intricacies to a certain extent. Despite these inherent intricacies, our algorithm’s performance merits attention. It demonstrates robust generalization capabilities across validation data sets and exhibits adaptability and efficacy when applied to real-world data sets.

Looking forward, future research endeavors could delve into enhancing the algorithm’s robustness, with the aim of reducing sensitivity to parameter variations. Further exploration of automated or semi-automated parameter-tuning mechanisms may streamline the optimization process. Additionally, in consideration of the evolving landscape of machine learning, the integration of advanced techniques such as neural architecture search or ensemble learning holds promise for unlocking new dimensions of performance.

Author Contributions

For this research article, the individual contributions of the authors are specified as follows: conceptualization, T.P. and T.B.; methodology, T.P. and T.B.; validation, T.P. and T.B.; formal analysis, T.P.; investigation, T.P. and T.B.; resources, T.P.; data curation, T.P.; writing—original draft preparation, T.P.; writing—review and editing, T.B.; visualization, T.P. and T.B.; supervision, T.B.; project administration, T.B.; funding acquisition, T.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Department of Computer Science, Faculty of Science, Kasetsart University.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting the findings of this study are derived from publicly available open data sets. Links to the specific data sets analyzed or generated during the research are provided within the main body of the manuscript. These open data sets enable readers and researchers to validate and reproduce the reported results. The authors have adhered to MDPI’s Research Data Policies, promoting transparency and accessibility in scientific research.

Acknowledgments

The authors express deep gratitude to the Department of Computer Science, Faculty of Science, Kasetsart University, for their unwavering support throughout this research. As a Ph.D. student in the department, the first author acknowledges the department’s invaluable assistance, which played a crucial role in the development and dissemination of this work. The guidance and support provided by the advisor were instrumental in the success of this research endeavor.

Conflicts of Interest

The authors declare no conflicts of interest. The authors affirm that there are no personal or financial circumstances that could be perceived as inappropriately influencing the representation or interpretation of the reported research results. Additionally, there are no competing interests related to employment, consultancies, honoraria, or any other form of financial support that might bias the work presented in this manuscript. Furthermore, the funders had no role in the design of the study; in the collection, analysis, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

He, H.; Wu, D. Learning from Imbalanced Data. IEEE Trans. Knowl. Data Eng. 2017, 29, 2734–2759. [Google Scholar]
Chandola, R.; Banerjee, A.; Kumar, V. Fraud Detection Using Machine Learning: A Comprehensive Survey. ACM Comput. Surv. 2009, 41, 15. [Google Scholar] [CrossRef]
Patel, A.; Smith, B.; Johnson, C. Reducing False Negatives in Medical Diagnostics Using Ensemble Learning. J. Med. Inform. 2018, 25, 123–137. [Google Scholar]
Doe, J.; Smith, M.; Johnson, R. Crime Classification Using Machine Learning Techniques: A Comprehensive Study. Int. J. Law Technol. 2020, 30, 567–582. [Google Scholar]
Prexawanprasut, T.; Banditwattanawong, T. Improving the Performance of Imbalanced Learning and Classification of a Juvenile Delinquency Data. In Intelligent Systems, Technologies and Application; Paprzycki, M., Thampi, S.M., Mitra, S., Trajkovic, L., El-Alfy, E.S.M., Eds.; Springer: Singapore, 2021; Volume 1353. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Batista, P.; Prati, R.; Monard, M.C. A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. IEEE Trans. Neural Netw. 2004, 15, 1249–1257. [Google Scholar]
Nguyen, H.M.; Cooper, E.W.; Kamei, K. Borderline over-sampling for imbalanced data classification. Int. J. Knowl. Eng. Soft Data Paradig. 2009, 3, 4–21. [Google Scholar] [CrossRef]
Krawczyk, M. A Comprehensive Investigation on the Effectiveness of SVM-SMOTE and One-sided Selection Techniques for Handling Class Imbalance. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 2202–2218. [Google Scholar]
Gao, Z.; Huang, W.; Liu, Y. Optimizing SVM-SMOTE Sampling for Imbalanced Data Classification. IEEE Access 2019, 7, 40156–40168. [Google Scholar]
Xie, L.; Li, Z.; Liu, X.; Li, D. An SVM-based random subsampling method for imbalanced data sets. IEEE Trans. Neural Netw. Learn. Syst. 2016, 27, 1055–1066. [Google Scholar]
Han, D.; Liu, Q.; Li, X. Synthetic Informative Minority Oversampling (SIMO) for Imbalanced Classification. IEEE Trans. Knowl. Data Eng. 2016, 28, 2679–2691. [Google Scholar]
García, S.; Herrera, F. A Comparative Study of Data Preprocessing Techniques for Credit Risk Assessment with SVM. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 2004, 34, 1–13. [Google Scholar]
Han, H.; Wang, W.; Mao, B. Borderline-SMOTE variations for imbalanced data set learning. In Proceedings of the International Conference on Intelligent Computing, Hefei, China, 23–26 August 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 878–887. [Google Scholar]
García, V.; Sánchez, J.S.; Mollineda, R.A. On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl.-Based Syst. 2009, 25, 13–21. [Google Scholar] [CrossRef]
Tripathy, R.K.; Rath, S.K.; Rath, A.K. Safe-level SMOTE: A data preprocessing technique for class imbalance learning. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 2840–2851. [Google Scholar]
He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar] [CrossRef]
Barua, S.; Islam, M.M.; Yao, X.; Murase, K. MWMOTE—Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning. IEEE Trans. Knowl. Data Eng. 2014, 26, 405–425. [Google Scholar] [CrossRef]
Chawla, N.V.; Lazarevic, A.; Hall, L.O.; Bowyer, K.W. SMOTEBoost: Improving prediction of the minority class in boosting. In Knowledge Discovery in Databases: PKDD 2003: 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat, Croatia, 22–26 September 2003; Springer: Berlin/Heidelberg, Germany, 2003; pp. 107–119. [Google Scholar]
Seiffert, C.; Khoshgoftaar, T.M.; Hulse, J.V.; Napolitano, A. RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 2010, 40, 185–197. [Google Scholar] [CrossRef]
Tang, Y.; Zhang, Y.-Q.; Chawla, N.V.; Krasser, S. SVMs modeling for highly imbalanced classification. IEEE Trans. Syst. Man Cybern. Part B Cybern. 2009, 39, 281–288. [Google Scholar] [CrossRef]
Japkowicz, N.; Stephen, S. The class imbalance problem: A systematic study. Intell. Data Anal. 2002, 6, 429–449. [Google Scholar] [CrossRef]
Davis, J.; Goadrich, M. The Relationship between Precision-Recall and ROC Curves. In Proceedings of the 23rd International Conference on Machine Learning—ICML ’06, Pittsburgh, PA, USA, 25–29 June 2006. [Google Scholar] [CrossRef]
Elkan, C. The Foundations of Cost-Sensitive Learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, Seattle, WA, USA, 4–10 August 2001. [Google Scholar]
Provost, F.; Fawcett, T. Fawcett. Robust Classification for Imprecise Environments. Mach. Learn. 2001, 42, 203–231. [Google Scholar] [CrossRef]
Li, Y.; Zhang, S.; Yin, Y.; Xiao, W.; Zhang, J. Parallel one-class extreme learning machine for imbalance learning based on Bayesian approach. J. Ambient. Intell. Hum. Comput. 2018, 15, 1745–1762. [Google Scholar] [CrossRef]
Anwar, S.; Khan, S.; Khan, M.F.; Khan, F.S.; Shao, L. Class-specific cost-sensitive extreme learning machine for imbalanced classification. Neurocomputing 2017, 267, 395–404. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Schapire, R.E. A brief introduction to boosting. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI 9‘9), Stockholm, Sweden, 31 July–6 August 1999; pp. 1401–1406. [Google Scholar]
Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Sun, S.; Dai, Z.; Xi, X.; Shan, X.; Wang, B. Ensemble Machine Learning Identification of Power Fault Countermeasure Text Considering Word String TF-IDF Feature. In Proceedings of the 2018 IEEE International Conference of Safety Produce Informatization (IICSPI), Chongqing, China, 10–12 December 2018; pp. 610–616. [Google Scholar] [CrossRef]
Choudhary, R.; Shukla, S. A clustering based ensemble of weighted kernelized extreme learning machine for class imbalance learning. Expert Syst. Appl. 2021, 164, 114041. [Google Scholar]
Xu, Z.; Shen, D.; Nie, T.; Kou, Y.; Yin, N.; Han, X. A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data. Inf. Sci. 2021, 572, 574–589. [Google Scholar] [CrossRef]
Liang, X.W.; Jiang, A.P.; Li, T.; Xue, Y.Y.; Wang, G.T. LR-SMOTE—An improved unbalanced data set oversampling based on K-means and SVM. Knowl.-Based Syst. 2020, 196, 105845. [Google Scholar] [CrossRef]
Tao, X.; Li, Q.; Guo, W.; Ren, C.; He, Q.; Liu, R.; Zou, J.-R. Adaptive weighted over-sampling for imbalanced data sets based on density peaks clustering with heuristic filtering. Inf. Sci. 2020, 519, 43–73. [Google Scholar] [CrossRef]
Guzmán-Ponce, A.; Valdovinos, R.M.; Sánchez, J.S.; Marcial-Romero, J.R. A new under-sampling method to face class overlap and imbalance. Appl. Sci. 2020, 10, 5164. [Google Scholar] [CrossRef]
Li, Z.; Huang, M.; Liu, G.; Jiang, C. A hybrid method with dynamic weighted entropy for handling the problem of class imbalance with overlap in credit card fraud detection. Expert Syst. Appl. 2021, 175, 114750. [Google Scholar] [CrossRef]
Sivakrishna3311. Delinquency Telecom Data Set. Kaggle. 2019. Available online: https://www.kaggle.com/datasets/sivakrishna3311/delinquency-telecom-dataset (accessed on 15 December 2020).
Urstrulyvikas. Lending Club Loan Data Analysis. Kaggle. 2020. Available online: https://www.kaggle.com/datasets/urstrulyvikas/lending-club-loan-data-analysis (accessed on 15 December 2020).
Machine Learning Group—ULB. Credit Card Fraud Detection. Kaggle. 2017. Available online: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud (accessed on 15 December 2020).
Moro, S.; Cortez, P.; Rita, P. Bank Marketing. UCI Machine Learning Repository. 2014. Available online: https://archive.ics.uci.edu/dataset/222/bank+marketing (accessed on 15 October 2022).
Chaipornkaew, P.; Prexawanprasut, T. A Prediction Model for Human Happiness Using Machine Learning Techniques. In Proceedings of the 2019 5th International Conference on Science in Information Technology (ICSITech), Yogyakarta, Indonesia, 23–24 October 2019; pp. 33–37. [Google Scholar] [CrossRef]
Shilpagopal. US Crime Data Set Code. Kaggle. 2021. Available online: https://www.kaggle.com/datasets/shilpagopal/us-crime-dataset/code (accessed on 15 September 2022).
Sayantandas30011998. E. coli Classification. Kaggle. 2019. Available online: https://www.kaggle.com/code/sayantandas30011998/ecoli-classification (accessed on 8 September 2022).
Alpaydin, E.; Kaynak, C. Optical Recognition of Handwritten Digits Data Set. UCI Machine Learning Repository. 1998. Available online: https://archive.ics.uci.edu/dataset/80/optical+recognition+of+handwritten+digits (accessed on 15 December 2022).
Samanemami. Yeast CSV. Kaggle. 2021. Available online: https://www.kaggle.com/datasets/samanemami/yeastcsv (accessed on 15 December 2022).

Figure 1. Visualizing the mechanisms of SMOTE.

Figure 2. Proposed algorithm conceptualization.

Figure 3. Identifying instances for information entropy calculation.

Figure 4. Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using Delinquency Telecom data set.

Figure 5. Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using Juvenile Delinquency data set.

Figure 6. Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using Ecoli data set.

Figure 7. Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using Yeast data set.

Figure 8. Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using Lending Club data set.

Figure 9. Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using US Crime data set.

Figure 10. Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using Credit Fraud data set.

Figure 11. Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using Happino data set.

Figure 12. Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using Bank Marketing data set.

Figure 13. Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using Optical data set.

Table 1. Characteristics of data sets used in the experiment.

Data Set	Number of Attributes	Number of Instances	Majority: Minority Instances	Imbalanced Ratio	Source
Delinquency Telecom	28	128,650	103,507:25,143	4.11	Kaggle [38]
Juvenile Delinquency	26	5953	4828:1125	4.29	Prexawanprasut et al. (2021) [6]
Lending Club	22	88,890	64,377:15,623	4.12	Kaggle [39]
Credit Fraud	31	204,507	204,015:492	415.66	Kaggle [40]
Bank Marketing	15	21,188	16,363:4825	3.39	UCI [41]
Happino	25	988	902:86	10.49	Chaipornkaew et al. (2019) [42]
US Crime	15	47	78:16	4.85	Kaggle [43]
Ecoli	9	336	306:20	15.30	Kaggle [44]
Optical	64	3823	3441:382	9.00	UCI [45]
Yeast	9	1484	1054:430	2.45	Kaggle [46]

Table 2. Hyper-parameter setting for each data set.

Data Set	RF		DT		ID3
Data Set	Number of Trees	Max Depth	Max Depth	Min Sample Split	Min Sample Split	Max Depth
Delinquency Telecom	200	none	20	2	2	none
Juvenile Delinquency	150	none	18	2	3	none
Lending Club	100	none	25	2	2	none
Credit Fraud	200	none	15	2	2	none
Bank Marketing	100	none	6	2	2	none
Happino	50	none	4	2	2	none
US Crime	20	none	4	2	2	none
Ecoli	50	none	5	2	2	none
Optical	100	none	15	2	2	none
Yeast	200	none	15	2	2	none

Table 3. The confusion matrix.

		Predicted
		Positive Class	Negative Class
Actual	Positive Class	True Positive (TP)	False Negative (FN)	Recall: TP/(TP + FN)
Actual	Negative Class	False Positive (FP)	True Negative (TN)	Specificity TN/(TN + FP)
		Precision: TP/(TP + FP)	Negative Predictive Value: TN/(TN + FN)	Accuracy: TP + TN/ (TP + TN + FP + FN)

Table 4. Experimental results of Delinquency Telecom data set.

	Original IR	SMOTE	SVM SMOTE	Borderline SMOTE	COG
Minority Class Recall	0.3054	0.3552	0.3687	0.3757	0.5995
Specificity	0.7159	0.5143	0.6143	0.6571	0.5796
G-mean	0.4675	0.5796	0.5915	0.5966	0.5895
F1-score	0.2793	0.4327	0.4052	0.4123	0.4128

Table 5. Experimental results of the Juvenile Delinquency data set.

	Original IR	SMOTE	SVM SMOTE	Borderline SMOTE	COG
Minority Class Recall	0.4321	0.7808	0.7896	0.7812	0.8056
Specificity	0.5042	0.8645	0.9161	0.9076	0.9925
G-mean	0.4668	0.7449	0.7804	0.7696	0.8942
F1-score	0.2693	0.7425	0.7587	0.7793	0.8793

Table 6. Experimental results of the Lending Club data set.

	Original IR	SMOTE	SVM SMOTE	Borderline SMOTE	COG
Minority Class Recall	0.5581	0.7638	0.7817	0.6745	0.7858
Specificity	0.7478	0.8935	0.8872	0.7137	0.9009
G-mean	0.6461	0.7403	0.7388	0.6377	0.8414
F1-score	0.5581	0.7333	0.7214	0.7145	0.7859

Table 7. The proposed algorithm’s minority class recall compared to SMOTE, SVMSMOTE, and BorderlineSMOTE in the experimented data sets.

Data Set	Minority Class Recall	Improvement (%) From
Data Set	Minority Class Recall	SMOTE	SVMSMOTE	BorderlineSMOTE
Delinquency Telecom	59.95	24.43	23.08	22.38
Juvenile Delinquency	80.56	2.48	7.64	8.49
Lending Club	78.58	2.20	0.41	11.13
Credit Fraud	56.05	−2.27	3.89	2.21
Bank Marketing	60.65	6.51	−0.13	0.13
Happino	41.46	4.88	7.32	0.00
US Crime	45.45	9.09	0.00	0.00
Ecoli	84.21	14.52	18.25	15.58
Optical	76.29	−11.54	−1.58	−3.42
Yeast	26.53	9.35	8.33	9.75
Average	60.97	5.97	6.72	6.63

Table 8. The proposed algorithm’s specificity compared to SMOTE, SVMSMOTE, and BorderlineSMOTE in the experimented data sets.

Data Set	Specificity	Improvement (%) From
Data Set	Specificity	SMOTE	SVMSMOTE	BorderlineSMOTE
Delinquency Telecom	57.96	0.07	−0.03	−0.08
Juvenile Delinquency	99.25	12.80	7.64	8.49
Lending Club	90.09	0.74	1.37	18.72
Credit Fraud	74.01	−0.05	0.23	0.07
Bank Marketing	82.03	−0.12	−2.33	−0.85
Happino	78.00	−19.59	−10.40	2.80
US Crime	83.33	5.56	−5.55	0.00
Ecoli	89.68	4.25	5.23	5.59
Optical	88.11	0.13	0.54	0.68
Yeast	98.52	0.75	1.66	1.33
Average	84.10	0.46	−0.16	3.68

Table 9. The proposed algorithm’s G-mean compared to SMOTE, SVMSMOTE, and BorderlineSMOTE in the experimented data sets.

Data Set	G-Mean	Improvement (%) From
Data Set	G-Mean	SMOTE	SVMSMOTE	BorderlineSMOTE
Delinquency Telecom	58.95	0.99	0.00	−0.01
Juvenile Delinquency	89.42	14.93	11.38	12.46
Lending Club	84.14	10.11	10.26	20.37
Credit Fraud	64.41	−1.88	0.57	0.98
Bank Marketing	70.56	3.85	−1.06	−0.28
Happino	56.87	−2.28	1.93	1.03
US Crime	61.55	8.37	−2.01	0.00
Ecoli	86.90	9.64	12.32	13.08
Optical	81.99	−6.26	−2.42	−3.25
Yeast	51.13	12.06	11.49	10.87
Average	70.59	4.95	4.25	5.53

Table 10. The proposed algorithm’s F1-score compared to SMOTE, SVMSMOTE, and BorderlineSMOTE in the experimented data sets.

Data Set	F1-Score	Improvement (%) From
Data Set	F1-Score	SMOTE	SVMSMOTE	BorderlineSMOTE
Delinquency Telecom	41.28	−0.02	0.01	0.00
Juvenile Delinquency	87.93	13.68	12.06	10.00
Lending Club	78.59	5.26	6.45	7.14
Credit Fraud	9.38	−2.25	−2.78	−1.52
Bank Marketing	44.60	3.70	−2.58	−0.80
Happino	20.24	−23.28	−4.53	1.56
US Crime	65.47	10.53	−2.92	0.00
Ecoli	66.67	8.50	10.20	10.33
Optical	56.27	6.87	0.32	0.24
Yeast	40.00	12.45	10.95	9.93
Average	51.04	3.54	2.72	3.69

Table 11. Percent of misclassified minority instances and information entropy of the examined data sets.

Data Set	IR in the Ambiguous Regions	Percent of Misclassified Minority Instances	Information Entropy
Delinquency Telecom	16.55	60.04	0.8380
Juvenile Delinquency	9.85	20.14	0.6245
Lending Club	8.59	6.12	0.5145
Credit Fraud	3.03	5.38	0.3528
Bank Marketing	3.55	4.27	0.2524
Happino	3.66	3.45	0.2647
US Crime	6.22	6.82	0.3880
Ecoli	12.12	40.58	0.6341
Optical	8.00	5.44	0.1859
Yeast	12.45	42.95	0.7255

Table 12. Reduction in information entropy compared with the original imbalanced ratio and other baseline techniques.

Data Set	Reduction of Misclassified Minority Instances (%)	Reduction in Information Entropy Compared with
Data Set	Reduction of Misclassified Minority Instances (%)	Original IR	SMOTE	SVM SMOTE	BorderlineSMOTE
Delinquency Telecom	25.63	0.4245	0.2457	0.2471	0.2785
Juvenile Delinquency	10.06	0.4214	0.2209	0.2105	0.2457
Lending Club	12.32	0.3547	0.0435	0.0475	0.2474
Credit Fraud	5.52	0.0563	0.0587	0.0457	0.0458
Bank Marketing	3.08	0.0952	0.0304	0.0147	0.0578
Happino	5.04	0.0457	0.0921	0.0975	0.0475
US Crime	9.20	0.3205	0.2150	0.0458	0.0243
Ecoli	8.06	0.4347	0.2478	0.2571	0.2848
Optical	1.59	0.0240	0.0145	0.0587	0.0074
Yeast	9.55	0.4234	0.2289	0.2658	0.2875

Table 13. Hyper-parameter setting for each data set.

Data Set	Number of Clusters (n)	Oversampling Details	Termination IR (Γ)	Initial IR (П)
Delinquency Telecom	4 (4)	0⁻, 1⁻, 2, 3	1.00	0.10
Juvenile Delinquency	5 (4)	0⁻, 1⁻, 2, 3⁻, 4	0.80	0.08
Lending Club	8 (5)	0⁻, 1⁻, 2, 3⁻, 4, 5, 6, 7⁻	0.80	0.08
Credit Fraud	11 (5)	0⁻, 1, 2, 3⁻, 4, 5, 6⁻, 7⁺, 8, 9⁻, 10⁻	0.65	0.15
Bank Marketing	4 (3,4)	0, 1⁻, 2, 3	1.00	0.02
Happino	9 (5)	0, 1, 2⁻, 3, 4, 5, 6, 7⁻, 8⁺	0.75	0.02
US Crime	2 (2,3)	0⁻, 1⁻	1.00	0.05
Ecoli	3 (2,3)	0, 1⁻, 2	1.00	0.01
Optical	4 (3)	0⁻, 1⁻, 2, 3	1.00	0.02
Yeast	4 (3)	0⁻, 1⁻, 2, 3	1.00	0.05

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Prexawanprasut, T.; Banditwattanawong, T. Improving Minority Class Recall through a Novel Cluster-Based Oversampling Technique. Informatics 2024, 11, 35. https://doi.org/10.3390/informatics11020035

AMA Style

Prexawanprasut T, Banditwattanawong T. Improving Minority Class Recall through a Novel Cluster-Based Oversampling Technique. Informatics. 2024; 11(2):35. https://doi.org/10.3390/informatics11020035

Chicago/Turabian Style

Prexawanprasut, Takorn, and Thepparit Banditwattanawong. 2024. "Improving Minority Class Recall through a Novel Cluster-Based Oversampling Technique" Informatics 11, no. 2: 35. https://doi.org/10.3390/informatics11020035

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu