Open AccessArticle

Tuning Data Mining Models to Predict Secondary School Academic Performance

William Hoyos

^1,2,3,*,†

and

Isaac Caicedo-Castro

^4,†

Sustainable and Intelligent Engineering Research Group, Cooperative University of Colombia, Monteria 230002, Colombia

R&D&I in ICT, EAFIT University, Medellin 050022, Colombia

Microbiological and Biomedical Research Group of Cordoba, University of Córdoba, Monteria 230002, Colombia

⁴

SOCRATES Research Team, Department of Systems and Telecommunications Engineering, Faculty of Engineering, University of Córdoba, Monteria 230002, Colombia

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Data 2024, 9(7), 86; https://doi.org/10.3390/data9070086

Submission received: 12 April 2024 / Revised: 14 June 2024 / Accepted: 15 June 2024 / Published: 26 June 2024

(This article belongs to the Special Issue Data Mining and Computational Intelligence for E-Learning and Education—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

In recent years, educational data mining has emerged as a growing discipline focused on developing models for predicting academic performance. The primary objective of this research was to tune classification models to predict academic performance in secondary school. The dataset employed for this study encompassed information from 19,545 high school students. We used descriptive statistics to characterise information contained in personal, school, and socioeconomic variables. We implemented two data mining techniques, namely artificial neural networks (ANN) and support vector machines (SVM). Parameter optimisation was conducted through five–fold cross–validation, and model performance was assessed using accuracy and

F_{1}

–Score. The results indicate a functional dependence between predictor variables and academic performance. The algorithms demonstrated an average performance exceeding 80% accuracy. Notably, ANN outperformed SVM in the dataset analysed. This type of methodology could help educational institutions to predict academic underachievement and thus generate strategies to improve students’ academic performance.

Keywords:

academic performance; machine learning; data mining; support vector machine; artificial neural networks

1. Introduction

Academic performance is conceived as a construct that depends not only on student motivation but also on other factors that may affect it, such as the teacher–student relationship, availability of study tools, access to computers and internet service, among others. In addition, there are demographic, socio–economic and psychological variables that contribute to the performance of any student, whether in secondary or higher education [1]. García–Tinisaray [2] defines academic performance as the main indicator of student success or failure and believes that it has been considered one of the important aspects when analysing the results of the teaching–learning process. On the other hand, in educational institutions, academic performance is an indicator of educational efficiency and quality. The academic performance of students is one of the main problems faced by secondary education institutions due to the high failure rate in some subjects that leads to poor performance during the school year and, in some cases, to student dropout [3]. The results of PISA 2022 show that 25% of 15-year-old students in Organisation for Economic Co-operation and Development (OECD) countries had not achieved a basic level of proficiency in at least one of the three main subjects assessed by PISA: reading, mathematics and science. In absolute numbers, this means that nearly 13 million 15-year-old students in the 64 countries and economies participating in PISA 2022 showed low performance in at least one subject [4].

The PISA tests conducted in 2022 show that developed countries such as Canada, Denmark, Finland, Hong Kong (China), Ireland, Japan, Korea, Latvia, Macao (China) and the United Kingdom boast the best results with percentages below 10% of students with low academic performance in the three areas evaluated. They also show that countries such as Spain have moderate levels of low academic performance (18.3% and 23.6% for reading and mathematics, respectively). It should be noted that these results are above the average of all the countries evaluated by the OECD [5].

The panorama in Latin America is quite discouraging. Perú is the country with the highest percentage of 15-year-old students who do not reach the basic level established by the OECD (60, 68.5% and 74.6% in reading, science and mathematics, respectively) [6]. Brazil and Argentina obtained similar percentages for reading, science and mathematics (Brazil: 50.8, 55 and 68.3%; Argentina: 53.6, 50.9 and 66.5%) [6]. In Colombia, the problem is very similar to that of other Latin American countries. The percentage of students with low academic performance was 51.4% for reading, 56.2% for science and 73.8% for mathematics, which shows a complicated perspective regarding students’ performance during their high school years. Students’ poor performance in the school stage will affect their performance in the following learning phases [6]. On the other hand, in Colombia, one way to measure a student’s academic performance is through the Saber 11 tests, an evaluation conducted by the Colombian Institute for the Evaluation of Education (ICFES) [7]. The Saber 11 test consists of 268 questions that evaluate five competencies: critical reading, mathematics, social sciences, natural sciences and English. There is currently a challenge in collecting socio-economic, school and personal variables to predict academic performance in the Saber 11 tests. This challenge is constituted by the absence of numerical variables in the data set, so the exact values of these characteristics are not known since most of the variables are categorical. At present, the dependence or functional relationship between the personal, socio-economic and school variables of Colombian students and their performance in the competencies evaluated in the Saber 11 tests is not known. Studies on academic performance prediction have been conducted in different departments of our country. The models that have been developed and implemented use the results of the Saber 11 tests to predict performance in different university courses [8,9,10,11,12,13,14].

2. Related Work

Data mining can be used to predict academic performance outcomes in both secondary and higher education students. Studies conducted in the Netherlands in 2017 compared different data mining techniques (ANN, SVM, logistic regression, and Naive-Bayes) for predicting academic performance in students in virtual courses. The researchers demonstrated that the predictive performance of an ANN outperformed other classifiers in terms of accuracy. However, the ANN and the other six classifiers did not outperform the findings of other studies, probably attributable to the difference in predictor variables used and the study setup [15]. Cuevas-Redondo and Estévez Bravo [16] used decision trees and linear regression for the improvement and prediction of academic performance. In the tests performed, linear regression had an error rate of more than 64% compared to the 56% maximum obtained by decision trees. For this reason, in the case of a database with little information, they do not recommend the use of linear regression for the prediction of academic performance. Abu Saa et al. [17] used multiple data mining tasks with the goal of creating qualitative prediction models that were efficient and effective in predicting student grades from a set of collected training data. The researchers implemented data mining tasks on the data set (personal, social and academic data) in question to generate classification models and evaluate them. They implemented four decision tree algorithms as well as the Naive Bayes algorithm. The results showed that student performance does not depend entirely on their academic efforts, even though there are many other factors that have equal influence. The authors conclude that the use of data mining can motivate and help universities find interesting results and patterns that can help both the university and the students in many ways. Kabakchieva [18] implemented classification models, making use of four data mining algorithms (association rules, decision trees, ANN and k-Nearest Neighbour. These algorithms were applied to the available student data and carefully preprocessed. The results reveal a classification accuracy between 67.46% and 73.59%. The highest accuracy is achieved for ANN (73.59%), followed by the decision tree model (72.74%) and the k-NN model (70.49%). The most influential factors in the classification process are the data attributes related to the college admission score and the number of failures in the first-year college exams. The study by Romero et al. [19] applied data mining techniques to predict the academic performance of first-year undergraduate students. Among the techniques they used were decision trees, Naive-Bayes and ANN. The results showed that the first two had an equal accuracy of 60.52%, while for the ANN, the accuracy was 54.47%. Regarding the accuracy of the models, the classification percentages were not very high, which indicates the difficulty of the problem of academic performance prediction as it is affected by many factors.

In Latin America, the amount of research conducted in this field is increasing. In Brazil, research on the usefulness of artificial learning in education has been developed. De Melo [20] reported that one of the ways found to improve the educational system in Latin American countries is to track students. This could be done by collecting and analysing data to find functional dependencies between sociodemographic variables and academic performance. The work aimed to evaluate machine learning techniques on data sets of technical high school students. The results of the research show that the use of machine learning, specifically supervised classification learning, is very useful for the creation of help tools that can monitor and predict student performance. It also showed that decision trees can make predictions of 89% to 94% accuracy. Menacho-Chiok [21] conducted research where several data mining techniques (logistic regression, Naive-Bayes and neural networks) were applied using students enrolled in a subject at the Universidad Nacional Agraria La Molina in Peru. The results indicated that the Naive Bayes algorithm obtained the highest classification rate at 71%. Socioeconomic variables have a considerable influence on students’ academic performance, therefore, the author recommends using them in order to improve the predictive model.

In Colombia, academic performance has been widely studied in diverse populations. Models have been developed and implemented for the prediction of the academic performance of students in higher or secondary education. Merchan-Rubiano et al. [10] have conducted several investigations using applied decision trees with the results of the Saber 11 tests (formerly called ICFES tests) and sociodemographic variables for the prediction of academic performance in engineering courses, with accuracy results of up to 86%. According to the results, the prediction of academic performance in the first year of university reduces the possibility of academic desertion and also improves the quality of the students’ formative processes, allowing the orientation of preventive monitoring strategies for people with real possibilities of suffering academic risk [10,11,12].

Moreover, the outcomes of four out of the five competencies of the Saber 11 test, excluding natural science, have been used to predict whether a given student might either fail or drop out of any course during the first semester of the systems engineering bachelor’s degree program at the University of Córdoba in Colombia [8]. In the same context, it has been studied whether considering the outcomes of all competencies contributes to predicting whether a student might fail any course related to mathematics or physics during the first term [9].

In [22,23], the goal is to determine the factors that influence the outcomes achieved by students in the Saber test. The former study focuses on Cundinamarca, a region of Colombia, during the period from 2017 to 2021 [22], while the latter examines the period from 2012 to 2022, analysing the performance of students who took the Saber 11 test in Bogotá and comparing their outcomes with those of students in the rest of Colombia [23]. Both studies differ from ours because they adopt descriptive statistical methods in lieu of predictive data mining techniques, whereas we utilise machine learning.

Furthermore, in [23], the study aims to determine the effect of the lockdown caused by the COVID-19 pandemic on Colombian students’ performance in the Saber 11 test. The study adopts linear regression analysis to estimate the weight of the variables that influence students’ outcomes in the test, while we use classification methods in our study to predict students’ performance. The drawback of the findings in [23] is that the coefficient of determination is less than 0.5, suggesting that the prediction function has limited predictive power and might not be significantly better than using the mean of the target variable.

Another study conducted in Colombia using data mining for the prediction of academic performance is entitled “Discovering patterns of academic performance in the critical reading competency”. The study executed by Timarán-Pereira and collaborators used decision trees to discover patterns of academic performance in the generic competencies of students who took the Saber Pro tests. The study included results of these tests in different regions of the country (Bogota, Eje Cafetero, Caribbean, Central East, Pacific, Central South and Llano). The results obtained with the decision tree classification model indicate that it is capable of generating models consistent with the observed reality and the theoretical support, based only on the data stored in the ICFES databases. The authors reported difficulties in the development of the research due to the poor quality of the data in the ICFES databases, where they had to discard certain attributes due to the impossibility of obtaining their values from other sources, and which, in some way, could influence the discovery of the patterns object of this study, in addition to the great consumption of resources involved in the data cleaning and transformation process [13].

The vast majority of studies conducted in Colombia on the subject use the results of the Saber 11 tests as predictor variables for the prediction of student performance in undergraduate programs, but few have been conducted for the prediction of test performance using classification tasks. A single report by [24] used a regression task for the prediction of the subject’s critical reading and mathematics, but the author’s publication does not report the performance of the model generated. Another effort on the subject comes from researcher Ferney Rodriguez, who is carrying out a project for the prediction of academic performance, but to the best of our knowledge, so far there is no report of the results obtained in the aforementioned project. The main objective of the author is to find which variables have the greatest influence on academic performance in the Saber 11 tests. Finally, as far as we know, in our department, there are no studies that use personal, school and socio-demographic information variables from the ICFES database to predict student performance in the Saber 11 tests.

3. Theoretical Framework

3.1. Mathematical Notation

This section shows the mathematical notation used in this document. Vectors will be represented in lowercase letters (x) and matrices with capital letters (X). Both with bold letters. The superscript T will denote the transpose of a matrix. The superscripts (i) and [i] are used to denote the i-th unit in the i-th hidden layer of an ANN.

Vectors are, by default, represented as columns. The notation

a \in R^{p}

will be used to indicate that an object is scalar, the notation

a \in R^{p}

to indicate a vector of length p. To indicate that an object is a matrix, we will use the convention

A \in R^{r \times s}

On the other hand, we will use m to represent the number of observations or training examples. We will denote n as the number of variables available in the data set. We will also denote

X_{i j}

representing the value of the jth variable for the ith observation, where i = 1, …, m and j = 1, …, n. Therefore, i will be used to index the training examples (from 1 through m) and j to index the variables used for training the model (from 1 through n).

3.2. SVM

SVMs are one of the most powerful and widely used learning algorithms. This technique has its roots in statistical learning theory and has shown satisfactory results from image recognition to text categorisation. It is a method that works well with high-dimensional data and avoids the curse of dimensionality. Alpaydin [25] defines it as methods that allow the model to be written as the sum of the influences of a subset of training instances.

The optimisation objective in SVM is margin maximisation. The margin is defined as the distance between the separation hyperplane (decision boundary) and the training instances that are closest to the hyperplane, which are called support vectors (see Figure 1). The SVM method can be classified into two types: linear SVM and nonlinear SVM. In turn, SVM can be applied for separable and non-separable cases.

3.2.1. Linear SVM for Separable Cases

The linear SVM, also known as the maximum margin classifier, finds a hyperplane with the largest possible margin [27]. If we consider a binary classification problem of n training examples, each example could be denoted as a tuple (

x_{i}

y_{i}

), for

i = 1, 2, \dots, m

, where

x_{i}

= (

x_{i 1}

x_{i 2}

, …,

x_{i n})^{T}

. By denoting two classes (

C_{1}

and

C_{2}

) using −1/+1, we have that

y_{i}

= +1 if

x_{i} \in C_{1}

, while

y_{i}

= −1 if

x_{i} \in C_{2}

. The decision limit for this case can be described in the following equation:

\begin{matrix} w^{T} x_{i} + w_{0} = 0 \end{matrix}

(1)

where w y

w_{0}

are the model parameters. The goal of this classifier is to find w and

w_{0}

such that:

\begin{matrix} w^{T} x_{i} + w_{0} \geq + 1 para y_{i} = + 1, \end{matrix}

(2)

\begin{matrix} w^{T} x_{i} + w_{0} \leq - 1 para y_{i} = - 1, \end{matrix}

(3)

which can be rewritten as follows:

\begin{matrix} y_{i} (w^{T} x_{i} + w_{0}) \geq + 1, \end{matrix}

(4)

furthermore, a separating hyperplane has the property that:

\begin{matrix} y_{i} (w^{T} x_{i} + w_{0}) \geq 0 . \end{matrix}

(5)

Thus, it follows that if a separating hyperplane exists, it can be used to construct a classifier where an observation with an unknown class is assigned to a class depending on which side of the hyperplane [28] is located. If we consider the task of constructing a maximum margin hyperplane based on a set of m training observations

{x_{1}, \dots, x_{m} \in R^{n}}

associated with classes

C_{1}

and

C_{2}

, the maximum margin hyperplane is the solution to the following optimisation problem [28]:

max_{w} M s . t . y_{i} (w^{T} x_{i} + w_{0}) \geq M, \forall i = 1, \dots, m

(6)

The constraint (Equation (6)) in the above optimisation problem ensures that each observation is on the correct side of the hyperplane and at a distance M from the hyperplane. Thus, M represents the margin of the hyperplane, and the optimisation problem chooses

w_{0}, w_{1}, \dots, w_{p}

to maximise M [28]. Maximising such a margin is equivalent to minimising the following objective function:

\begin{matrix} f (w) = \frac{{∥w∥}^{2}}{2} \end{matrix}

(7)

Thus, the learning task for this case can be formalised as the following optimisation problem:

\begin{matrix} min_{w} \frac{{∥w∥}^{2}}{2} s . t . y_{i} (w^{T} x_{i} + w_{0}) \geq M, \forall i = 1, \dots, m \end{matrix}

(8)

Finally, considering that the objective function is quadratic and the constraints on the parameters w and

w_{0}

are linear, the Lagrange multiplier method is used to solve the optimisation problem [25]. The new objective function known as the primal formulation is shown in the following equation:

\begin{matrix} L_{p} = \frac{1}{2} {∥w∥}^{2} - \sum_{i = 1}^{n} λ_{i} (y_{i} (w^{T} x_{i} + w_{0}) - 1) . \end{matrix}

(9)

Because the optimisation problem remains a complicated task due to the number of parameters it has: w,

w_{0}

and

l a m b d a_{i}

, it can be simplified by transforming the function so that it only depends on the Lagrange multipliers. It is known as the dual formulation and would look as follows:

\begin{matrix} L_{d} = \sum_{i = 1}^{m} λ_{i} - \frac{1}{2} \sum_{i, j} λ_{i} λ_{j} y_{i} y_{j} x_{i}^{T} x_{j} \end{matrix}

(10)

3.2.2. Soft-Margin Hyperplane

A maximum margin classifier can be used for classification as long as a separation hyperplane exists. However, in many cases, such a hyperplane does not exist, so we would not have a maximum margin classifier. In this case, the optimisation problem proposed in Equation (6) has no solution with

M > 0

[28]. In the above case of linear SVM for separable cases, we assume that error-free decision limits are constructed. The formulation of the method can be modified so that it learns a decision boundary that is tolerable to small errors in the training observations (Figure 2). To achieve this goal, a method known as the soft margin approach is used [29]. In this way, it is possible to construct a linear decision boundary when classes are not linearly separable.

The objective function (Equation (7)) is applicable in this case; however, the decision limit no longer satisfies its constraints, these are modified using a positive relaxation variable (

x_{i}

) [30], as can be seen below:

\begin{matrix} w^{T} x_{i} + w_{0} \geq + 1 - �� para y_{i} = + 1 \end{matrix}

(11)

\begin{matrix} w^{T} x_{i} + w_{0} \leq - 1 + ξ para y_{i} = - 1 \end{matrix}

(12)

The objective function in Equation (7) is modified to penalise the classifier when the training examples are located on the wrong side of the decision boundary, this function is given by the following equation:

\begin{matrix} f (w) = \frac{{∥w∥}^{2}}{2} + C {(\sum_{i = 1}^{m} ξ_{i})}^{k}, \end{matrix}

(13)

where C and k are user-specified parameters to penalise the classification error of the training instances. The dual formulation in this case is the same as for the separable case (see Equation (10)); however, the multipliers used for the non-separable case are different.

3.2.3. SVM on Nonlinear Classification Problems

One of the reasons why SVM has gained so much popularity in the field of data mining is its ability to be kernelised for the solution of nonlinear classification problems. To accomplish this task, it is necessary to map the problem to a new space by conducting nonlinear transformations using basis functions and then using a linear model in this new space [25,31]. A basis function

ϕ (\cdot)

will allow the training data to be transformed into a higher dimensionality feature space; however, a drawback with this approach is that creating new features is computationally expensive. To solve this, the kernel trick is used with the following function:

\begin{matrix} k (x^{T}, x) = ϕ (x^{T}) ϕ (x) \end{matrix}

(14)

One of the most commonly used kernels is the radial basis function, also called the Gaussian kernel:

\begin{matrix} k (x^{T}, x) = e x p (- γ {∥x^{T} - x∥}^{2}) \end{matrix}

(15)

where

γ

is a parameter to be optimised. The term kernel can be interpreted as a similarity function between two training samples. Another function used in SVM to deal with nonlinear problems is the sigmoidal kernel function:

\begin{matrix} k (x^{T}, x) = t a n h (2 x^{T} x + 1) \end{matrix}

(16)

where

t a n h (\cdot)

has the same form as the sigmoidal function but with the difference that its range is between −1 and +1 [25].

3.2.4. SVM for More Than Two Classes

There are currently two versions of SVM that allow the classification of more than two classes in a dataset. First, we have the One-Versus-One classification, which compares each pair of classes by assigning +1 and −1 to each. The test observations are classified using the different classifiers constructed and the number of times the observation is assigned to each of the classes is counted. The other alternative is known as One-Versus-All, where a class (assigned as +1) is compared with the rest of the observations as another class (assigned as −1) [28].

3.3. ANN

The ANN models relationships between a set of input signals (input layer) and an output signal (output layer). For such a function, it uses a model derived from our understanding of how the brain responds to sensory input stimuli. Our brain uses a network of interconnected cells called neurons [32]. The structure of an ANN can be seen in Figure 3.

An important feature of ANNs is their structure, which may contain the input layer, several intermediate layers and the output layer. Such intermediate layers are called hidden layers and their nodes are called hidden units. In addition, the network uses activation functions such as the sigmoid function, hyperbolic tangent, Rectified Linear Unit (ReLU) function, among others [29]. The learning process of an ANN consists of two main steps: first, forward-propagation, which consists of computing the model predictions and the error; the second step consists of updating the parameters generated in the previous step to decrease the error. This last step is known as backpropagation.

3.3.1. Forward Propagation

It consists of calculating the model output values and the corresponding error between the predictions made and the correct values of the training examples. If we assume an ANN with a single hidden layer, W and b are the parameters to be updated, X are the input features, Z is the linear function,

s i g m a (\cdot)

is a nonlinear activation function. The following equations demonstrate the calculations performed in the forward propagation process. First, it is applied to the hidden layer of the network:

\begin{matrix} Z^{[1]} = W^{[1]} X + b^{[1]} \end{matrix}

(17)

Then, applying the activation function

σ^{[1]}

for the units of the hidden layer, we have:

\begin{matrix} A^{[1]} = σ^{[1]} (Z^{[1]}) \end{matrix}

(18)

Subsequently, the values for the output layer are calculated. First, the linear function

Z^{[2]}

and finally the nonlinear activation function:

\begin{matrix} Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]} \end{matrix}

(19)

Equation (20) calculates the predictions in the output layer:

\begin{matrix} \hat{y} = A^{[2]} = σ^{[2]} (Z^{[2]}) \end{matrix}

(20)

The last thing to calculate is the model error using the function known as cross-entropy:

\begin{matrix} J = - \frac{1}{m} \sum_{i = 0}^{m} (y_{i} log (a_{i}^{[l]}) + (1 - y_{i}) log (1 - a_{i}^{[l]})) \end{matrix}

(21)

3.3.2. Back-Propagation

In the back propagation process, gradients are calculated that allow updating the parameter values in order to decrease the error in the model. If we continue with the example of an ANN with a single hidden layer, first the gradients of the output layer are calculated, where a letter d in front of a vector or matrix represents the derivative:

\begin{matrix} \frac{\partial J}{\partial Z^{[2]}} = A^{[2]} - Y \end{matrix}

(22)

\begin{matrix} \frac{\partial J}{\partial W^{[2]}} = \frac{1}{n} \frac{\partial J}{\partial Z^{[2]}} A^{[1] T} \end{matrix}

(23)

\begin{matrix} \frac{\partial J}{\partial b^{[2]}} = \frac{1}{n} \sum_{i = 1}^{m} \frac{\partial J}{\partial Z^{[2]}} \end{matrix}

(24)

Gradients are then calculated for the only hidden layer of the network:

\begin{matrix} \frac{\partial J}{\partial Z^{[1]}} = W^{[2] T} \frac{\partial J}{\partial Z^{[2]}} \times σ^{{[1]}^{'}} (Z^{[1]}) \end{matrix}

(25)

\begin{matrix} \frac{\partial J}{\partial W^{[1]}} = \frac{1}{n} \frac{\partial J}{\partial Z^{[1]}} X^{T} \end{matrix}

(26)

\begin{matrix} \frac{\partial J}{\partial b^{[2]}} = \frac{1}{n} \sum_{i = 1}^{m} \frac{\partial J}{\partial Z^{[1]}} \end{matrix}

(27)

Finally, the network parameters are updated using the downward gradient rule of thumb, where

α

represents the learning rate:

\begin{matrix} W^{[1]} = W^{[1]} - α \frac{\partial J}{\partial W^{[1]}} \end{matrix}

(28)

\begin{matrix} W^{[2]} = W^{[2]} - α \frac{\partial J}{\partial W^{[2]}} \end{matrix}

(29)

\begin{matrix} b^{[1]} = b^{[1]} - α \frac{\partial J}{\partial b^{[1]}} \end{matrix}

(30)

\begin{matrix} b^{[2]} = b^{[2]} - α \frac{\partial J}{\partial b^{[2]}} \end{matrix}

(31)

Equations (28)–(31) can be summarised in the following two equations:

\begin{matrix} W^{[l]} = W^{[l]} - α \frac{\partial J}{\partial W^{[l]}} \end{matrix}

(32)

\begin{matrix} b^{[l]} = b^{[l]} - α \frac{\partial J}{\partial b^{[l]}} \end{matrix}

(33)

4. Materials and Methods

4.1. Dataset

The data used for this research were provided by ICFES [34]. The information provided corresponds to Córdoba during the second semester of 2017. The data contained 80 predictors related to different characteristics: personal, socioeconomic and school. The variables with personal information about the students are briefly described in Table 1, the socioeconomic variables are described in Table 2 and the school-related variables are briefly described in Table 3. In addition, the data contained information on 5 subjects evaluated in the exam (target variables), such as critical reading, mathematics, social sciences, natural sciences and English. The level of academic performance is classified taking into account the score obtained in each subject (see Table 4). The data set is relatively balanced in terms of these classes, as illustrated in Figure 4. The bar chart in this figure shows that the number of records is nearly equal across all performance classes.

4.2. Data Preprocessing

Data preparation was performed using the statistical programming language R [35]. The first step was to split the initial database by each subject. An oversampling technique was used to balance the classes when they were unbalanced. Subsequently, categorical variables were coded to numerical values. Missing values were set to 0. Classes or labels (subjects evaluated in the Saber 11 exam) were coded to integer values and constant attributes were eliminated. The rest of the variables presented different scales; therefore, they were standardised with a mean of 0 and a standard deviation of 1. The standardisation procedure can be expressed in the following equation:

\begin{matrix} x_{s i} = \frac{x_{i} - μ}{σ} \end{matrix}

(34)

where

x_{s i}

is the standardised feature vector,

μ

is the sample mean of each attribute, and

σ

corresponds to the standard deviation. Each of the preprocessed databases contained information on 19,545 students and 50 columns, including the target variable. Each was divided into two parts, one for cross-validation (50%) and another for testing (50%).

4.3. Characterisation of the Evaluated Students

The preprocessed data set was used to characterise the students who took the Saber 11 tests in the second period of 2017. A secondary variable (age) was created from the date of birth. Descriptive statistics were used to know the average and standard deviation of numerical variables and the frequency distribution for qualitative variables. The factors taken into account to develop the experiments were classified according to the ICFES in personal information variables as seen in Table 1. The socioeconomic variables are shown in Table 2. The characteristics related to the educational institutions where students attend are shown in Table 3.

4.4. Experimental Setup

The descriptions of the experiments carried out in this research are shown here. Two data mining techniques were carried out. The algorithms were coded in the R statistical programming language [35].

4.4.1. Experiment 1: Tuning SVM

SVMs were implemented using the package e1071 [36]. This package allows the optimisation of parameters to create a support vector machine model. On the other hand, the kernel trick was used to create non-linear combinations of the original features to be projected into a higher dimensional space. The kernels used were radial, linear and sigmoid. Different regularisation parameters C and

γ

were evaluated to optimise the model. The cost (C) values used were: 0.01, 0.1, 1, 10, 100 and 1000. The

γ

values used were

5 \times 10^{- 21}

5 \times 10^{- 19}

5 \times 10^{- 17}

5 \times 10^{- 15}

5 \times 10^{- 13}

5 \times 10^{- 11}

5 \times 10^{- 9}

5 \times 10^{- 7}

5 \times 10^{- 5}

5 \times 10^{- 3}

5 \times 10^{- 1}

5 \times 10^{1}

5 \times 10^{3}

4.4.2. Experiment 2: Tuning ANN

ANNs were created using the RStudio interface of the Keras [37] library, a high-level API that uses TensorFlow for its execution. TensorFlow is an artificial learning platform developed by Google [38]. Regarding the network configuration, a single hidden layer was established. For optimisation, different combinations of hyperparameters (units in the hidden layer and

λ

values for regularisation) were tested. For the hidden layer, the activation function (ReLU) was used. The weights were updated with 10 epochs and batches of 32 training examples. The optimisation method called Adam [39] was used with an adaptive learning rate. The multilayer network was optimised using categorical cross-entropy as a cost or error function. The activation function (softmax) was implemented in the output layer. The

λ

values used were: 0.001, 0.01, 0.05, 0.25, 0.5; and the hidden layer sizes were: 30, 50, 80 and 120.

4.5. Evaluation

We adopted the K-fold cross-validation method as described by Mohri, Rostamizadeh and Talwalkar [40]. This approach was used to find the optimal parameters for both SVM and ANN. This method allows dividing the training set into five subsets using four for training and one for validation. The combination of hyperparameters with the best resulting average performance is used for a single evaluation on the test set [40]. There are different metrics to evaluate the performance of data mining models. The metric to use will depend on whether the task to be performed is regression or classification. The metric commonly used to evaluate classification models is accuracy, which is established as the percentage of correctly classified examples among the total number of classified examples. Higher accuracy means higher model performance. The accuracy can be expressed in the following formula:

\begin{matrix} A c c u r a c y = \frac{T P + T N}{T P + F N + F P + T N} \end{matrix}

(35)

where

T P

is true positive,

T N

true negative,

F N

the false negative and

F P

the false positive.

In addition, we use the

F_{1}

-score metric that combines precision and recall to evaluate classification models. The

F_{1}

-score can be calculated as follows:

\begin{matrix} F_{1} = \frac{2 \times p r e c i s i o n \times r e c a l l}{p r e c i s i o n + r e c a l l} = \frac{2 T P}{2 T P + F N + F P} \end{matrix}

(36)

5. Results

5.1. Characterisation of the Dataset

The characterisation of the data was performed using the R language. Averages and standard deviations were calculated for quantitative variables such as age and individual socioeconomic index, while qualitative variables were analysed using frequencies. It is important to highlight that the subjects presented four performance levels, but the initial levels 2 and 3 were combined due to the low amount of data in one of the two intermediate classes.

The average age of the students was 18 years with a standard deviation (SD) of 1.8. The average socioeconomic index among the students was 45.6 with a SD of 9.0. The most common type of document was the identity card with 86.5%, followed by citizenship card with 10.8% and 2.7% for civil registration. Regarding gender, 54.3% were women and 45.7% were men. The distribution of ethnicity in the students was: Zen with 9.9%, Afrodescendant with 2.4%, Embera 0.12%, Raizal 0.04%, Way 0.035%, Arhuaco 0.005% and another ethnic group 0.44%. Eighty-seven percent of the students did not belong to any minority ethnic group. The most common municipality of residence was Monteria with 26.8%, followed by Lorica with 7.8% (see Figure 5). This is due to Montería being the capital city of the Córdoba department. The frequencies for all the municipalities of Córdoba are shown in Figure 5. The highest educational levels for both fathers and mothers were incomplete primary and complete secondary school. The frequency percentages for these two factors can be seen in Figure 6. The most common type of work for fathers was farmer or day labourer, with 23.1%. Among mothers, the most common job was housewife, with 58.6%. The most frequent stratum was stratum 1 with 60.0%, followed by stratum 2 with 20.5%, stratum 3 with 7.1% and no stratum with 4.5% (see Figure 7).

The most common frequency of people in the household was 3–4 people. The percentage of this level was 38.1%. Families with 5 or 6 persons in the household corresponded to 37.7%. The lowest percentage of 4.8% corresponded to the level where only 1 or 2 persons lived in the household. Taking into account the number of rooms in the home, it was found that the highest frequency was two rooms with 43.7%. Then, different frequencies were observed, where we found three rooms with 33.2%, four rooms with 11%, for only one room we found 6.7%, for five rooms 2.9% and for more than five rooms 1.35%. With respect to the technological variables, 30.1% had access to a computer, 73.8% had a washing machine, 28.1% had a microwave oven, 65.5% had a closed TV service and only 8.2% had a video game console. The daily use of Internet was 27.6% between 30 and 60 min, 24.0% between one and three hours, 22.8% with 30 min or less, 11.3% with a prolonged use of more than three hours and 10% who do not use the service.

The possession of transportation vehicles is also one of the factors collected by the ICFES and taken into account for the prediction of academic performance in the Saber 11 tests. The results show that 10.9% have a car at home and 55.2% have a motorcycle. In relation to food, it was found that the highest consumption of milk and derivatives is 45.3% once or twice per week, 35.7% for meat, fish and eggs, and lastly 47.0% for the consumption of cereals, fruits and legumes. In Table 5, all the frequencies of weekly food consumption are shown.

5.2. SVM Models

The results of the application of a support vector machine on the ICFES dataset for the prediction of academic performance in the Saber 11 tests are shown here. The development of data mining models is considered an empirical work, where different parameters are tested and those that optimise the model are chosen. The cross-validation technique is useful for this case. Table 6, Table 7, Table 8, Table 9 and Table 10 show the cross-validation results and the optimal parameter combination for the model. The parameter combination with the highest accuracy was chosen. In addition, the accuracy and

F_{1}

-score of the model on the test set can be seen in Table 11 and Table 12.

The performance of the SVM models for predicting academic achievement varied with respect to the subject tested and the type of kernel used. The best performance of the model was for the subject critical reading, which obtained an accuracy of 93.5% for the three kernel types. With respect to the

F_{1}

-score, the best performance was for the Sigmoidal kernel with 93.4%. For mathematics, the accuracy was 82.1% for both the linear and sigmoidal kernels, and the

F_{1}

-Score was 81.7% for the linear kernel. The social science model obtained an accuracy of 72.9% and an

F_{1}

-score of 72.8% for the linear kernel. The natural science model obtained an accuracy of 81.1% and an

F_{1}

-score of 81.0% for both the linear and sigmoidal models. Finally, for English, the best performance corresponded to the linear and sigmoidal models, with an accuracy of 79.2% and an

F_{1}

-score of 78.7% (see Table 11).

The optimal values of the regularisation parameter C that minimise the error in the model are found in Table 13.

Another parameter that was found to optimise the model was the parameter

γ

. It should be noted that this value was only used and configured for the radial or Gaussian kernel. The results of the first experiment show that the optimal value of

γ

that minimises the model error was 5000 for critical reading, mathematics and natural sciences, whereas

C = 0.01

. The high

γ

value might cause severe overfitting because of the highly complex decision boundary, where each individual training vector is extremely localised. Therefore, the decision boundary is highly sensitive to the position of each training vector. Nevertheless, the strong regularisation introduced by the low C value counteracts this effect by enforcing the regularisation.

On the other hand, for social sciences and English, the optimal value of

γ

was 0.005. The combined effect of a small

γ

and

C = 10

causes the SVM to create a relatively smooth decision boundary due to the small

γ

while correctly classifying due to higher C value, which tolerates some misclassifications although still reducing the training error. This is because the higher the C value, the less regularisation there is. This trade-off between hyperparameters aims to achieve a model that generalises properly without either overfitting or underfitting.

5.3. ANN Models

The results of the application of a single hidden layer ANN on preprocessed data sets are presented below. There is no ideal network architecture for all applications. Such an architecture is obtained through simulations using cross-validation. In Table 14, Table 15, Table 16, Table 17 and Table 18, the results of the cross-validation technique for choosing the optimal parameters (regularisation parameter and hidden layer size) and choosing the best model to be tested with the test set can be observed. The combination of parameters with the highest accuracy was chosen. Moreover, the optimised models were tested on the corresponding test sets and it can be seen from Table 19 that the best performance was observed in the subject critical reading, where the model achieved 93.6% accuracy and an

F_{1}

-score of 93.2%. On the other hand, the worst performance of the model was for the subject mathematics, with an accuracy of 82.2% and

F_{1}

-score of 82.5%. Regarding the hyperparameters, different values were found. The most common lambda value was 0.01. The size of the hidden layer varied between 30 and 80 units, with a higher frequency for 30 and 50 units.

6. Discussion

The functional dependence between personal, socioeconomic and school variables with academic performance in the Saber 11 tests has not been widely explored in our country. Moreover, the use of categorical data to train data mining models is a major challenge to obtain decent results. An added factor is that ANN works better on unstructured data such as text, images, videos, etc. The application of ANN on tabular or structured data is also considered a challenge for educational data mining research. Nevertheless, this discipline is useful for finding patterns in data coming from educational institutions.

The two data mining techniques applied in the present research obtained an average accuracy above 80%, which is an important step in the development of educational data mining models in our country. The present work focused on applying ANN and SVM to predict the results of the Saber 11 tests. There is a work by Orjuela [24], where he used SVM for the subjects of language and mathematics, but the author implements a regression task and never reports the accuracy and performance of the model on a test set.

The performance of the algorithms was aided by the combination of two classes (performance levels) because the classes were unbalanced. The speed of training was aided by standardising the input features. Different research globally has found data preprocessing useful to increase the performance and convergence of algorithms [15,41,42]. However, preprocessing the dataset does not always improve the model performance, this is evidenced in the work by Ahamed [43], where he found higher accuracy (78% vs 70%) on the non-preprocessed dataset using SVM. SVMs are algorithms that find optimal separating hyperplanes to classify the dataset. The results obtained from the application of the support vector machine on the ICFES data sets show that it is not necessary to use a special type of kernel to increase the dimensionality of the input feature space. The use of three different kernels shows no difference between the results. This indicates that the dimensionality of the input features in the dataset is sufficient to establish a separation between the classes.

ANNs have been successfully used for solving classification and regression problems that involve large-scale data sets. Indeed, several empirical studies have evidenced that the larger the dataset, the more accurate ANNs are, often outperforming other machine learning methods ([44], p. 3).

Concerning the results of the ANN on the subjects evaluated by the ICFES, it is noteworthy that the performance of the algorithm is very similar in all combinations of the regularisation parameter with the number of hidden units in the hidden layer of the network. ANNs may find nonlinear relationships between input features and predictions of academic performance classes regardless of the size of the hidden layer. The outcomes in this study reveal that, in all these settings, the resulting new vector representation of the original input features found in the hidden layer enhances the accuracy of the predictions computed in the output layer. If the relationship between the input variables and the target variable is not linear, the relationship between the output of the neurons in the hidden layer and the target variable is linear.

In the experiments performed, only one hidden layer was used because the data used in the present investigation are tabular, and such a deep network is not necessary for the network to learn the weights that minimise the error or cost function. The results show that a maximum of 50 hidden units (in some cases only 30 units were needed) are required to find a functional dependence between the input variables and the predictions made by the network. The use of very deep networks for tabular data can lead to overfitting of the model with poor performance on the test set.

The present study shows that ANN performed better than SVM. This can be observed in the subjects natural sciences (93% vs. 81%), social sciences (86% vs. 73%) and English (85% vs. 79%). For the remaining subjects, the performance was similar (see Table 20). Academic performance in critical reading and mathematics could be correctly predicted 93% and 82% of the time with ANN and SVM, respectively. The present research does not use statistical tests to compare the results of the algorithms used because the metric used is not an average. In cases where a regression task is used and the metric is the mean squared error, tests such as Student’s t-test or Mann–Whitney U-test can be used to compare means between two data mining techniques.

Finally, besides the accuracy of ANNs, there is another advantage: they provide a probability output, which is useful for decision-makers and stakeholders. For instance, a student with a low probability (e.g., 15%) of achieving a B+ in English proficiency requires more training and possibly an intervention plan than a student whose probability of attaining this level is higher (e.g., 65%). The probabilistic nature of ANN makes it superior to SVM, as the latter does not inherently offer probability information with its predictions. Nonetheless, this limitation of SVM may be mitigated by using Platt scaling [45].

7. Conclusions

7.1. General Considerations

The prediction of academic performance has been investigated for a long time in different parts of the world. The increased use of information technologies has allowed educational data mining to become a reference for the improvement of education in educational institutions. The main objective of this research was to develop and implement two data mining models for the prediction of academic performance in the Saber 11 tests using personal, school and socioeconomic information as predictor variables. The use of SVM and ANN turned out to be suitable tools for a prediction of academic performance in high school students in the Saber 11 exam, where, on average, the subjects evaluated in the exam were correctly classified in an average of 82% and 88%, respectively. ANNs present good performance in different classification problems. This research work demonstrated that the predictive performance of ANN is superior to that of SVM on the Saber 11 test data set. With the results obtained in this research, a platform can be developed that allows secondary education institutions to monitor students and provide them with support to improve in the subjects they are having difficulties with.

7.2. Recommendations

This research presents some recommendations for future work. For example, use other data mining techniques (decision trees, random forests, Bayesian networks, etc.) or a combination of algorithms to improve performance results. Another option may be to use a dimensionality reduction technique such as PCA to find the principal components in order to improve training speed and model performance. On the other hand, collect data pertaining to all students from all over the country for several years. The present research only used data from Cordoba in the year 2017. With a larger amount of data, the functional dependence could be stronger and would significantly improve the results. Another example to improve the results of this research could be to use different optimisation functions as well as other techniques to cope with overfitting (e.g., dropout and early stopping). The present work used Adam (it has been shown to be very good on structured data), but there are different methods, such as downward gradient, RMSprop and Adaline, which were not used due to computational limitations. Finally, it would be interesting to determine which latent factors influence academic performance in the Saber 11 tests in order to generate strategies to help improve the quality of education in Colombia. To this end, we will adopt matrix factorisation algorithms (e.g., singular value decomposition, non-negative matrix factorisation, etc.) to compute those hidden factors from the input variables.

Author Contributions

Conceptualisation, W.H. and I.C.-C.; methodology, W.H.; software, W.H. and I.C.-C.; validation, W.H. and I.C.-C.; formal analysis, W.H. and I.C.-C.; investigation, W.H. and I.C.-C.; resources, W.H. and I.C.-C.; data curation, W.H. and I.C.-C.; writing—original draft preparation, W.H.; writing—review and editing, W.H. and I.C.-C.; visualisation, W.H.; supervision, W.H.; project administration, W.H.; funding acquisition, W.H. All authors have read and agreed to the published version of the manuscript.

Funding

Universidad Cooperativa de Colombia (Grant No. INV3569).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this research are available on the ICFES website, which can be accessed at https://www.icfes.gov.co/data-icfes, accessed on 5 June 2022.

Acknowledgments

The authors would like to thank Universidad Cooperativa de Colombia and Universidad de Córdoba for their support during the research. Caicedo-Castro thanks the Lord Jesus Christ for blessing this study. Finally, the authors thank the editors and the anonymous referees for their comments, which improved the quality of this paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

Banerjee, P.A. A systematic review of factors linked to poor academic performance of disadvantaged students in science and maths in schools. Cogent Educ. 2016, 3, 1178441. [Google Scholar] [CrossRef]
García-Tinisaray, D. Construction of a Model to Determine Students’ Academic Performance Based on Learning Analytics, Using Multivariate Techniques. Ph.D. Thesis, Universidad de Sevilla, Sevilla, Spain, 2015. [Google Scholar]
Rodriguez, D.; Carrasquillo, A.; Garcia, E.; Howitt, D. Factors that challenge English learners and increase their dropout rates: Recommendations from the field. Int. J. Biling. Educ. Biling. 2022, 25, 878–894. [Google Scholar] [CrossRef]
OCDE. PISA 2022 Results Vol I: The State of Learning and Equity in Education; OCDE: Paris, France, 2022. [Google Scholar]
OCDE. PISA 2022 Results Vol II: Learning During—and From—Disruption; OCDE: Paris, France, 2022. [Google Scholar]
OCDE. Who are the Low-Performing Students? OCDE: Paris, France, 2016. [Google Scholar]
Chica-Gómez, S.; Galvis-Gutiérrez, D.; Ramírez-Hassan, A. Determinants of Academic Performance in Colombia: Saber 11 Tests, 2009. Rev. Univ. Eafit 2010, 46, 48–72. [Google Scholar]
Caicedo-Castro, I.; Velez-Langs, O.; Macea-Anaya, M.; Castaño-Rivera, S.; Catro-Púche, R. Early Risk Detection of Bachelor’s Student Withdrawal or Long-Term Retention. In Proceedings of the IARIA Congress 2022: International Conference on Technical Advances and Human Consequences, Nice, France, 24–28 July 2022; pp. 76–84. [Google Scholar]
Caicedo-Castro, I.; Macea-Anaya, M.; Castaño-Rivera, S. Forecasting Failure Risk in Early Mathematics and Physical Science Courses in the Bachelor’s Degree in Engineering. In Proceedings of the IARIA Congress 2023: International Conference on Technical Advances and Human Consequences, Valencia, Spain, 13–17 November 2022; pp. 177–187. [Google Scholar]
Merchán-Rubiano, S.; Beltrán-Gómez, A.; Duarte-García, J. Formulation of a Predictive Model for Academic Performance based on Students’ Academic and Demographic Data. In Proceedings of the IEEE Frontiers in Educacion Conference, El Paso, TX, USA, 21–24 October 2015; pp. 1–7. [Google Scholar] [CrossRef]
Merchán-Rubiano, S.; Beltrán-Gómez, A.; Duarte-García, J. Analysis of Data Mining Techniques for Constructing a Predictive Model for Academic Performance. IEEE Lat. Am. Trans. 2016, 14, 2783–2788. [Google Scholar] [CrossRef]
Merchán-Rubiano, S.; Beltrán-Gómez, A.; Duarte-García, J. Engineering Students’ Academic Performance Prediction using ICFES Test Scores and Demographic Data. Ing. Solidar. 2017, 21, 53–61. [Google Scholar] [CrossRef]
Timarán-Pereira, R.; Hidalgo-Troya, A.; Caicedo-Zambrano, J.; Hernandez-Arteaga, I.; Alvarado-Pérez, J. Discovery of Academic Performance Patterns in Critical Reading Competence. In Proceedings of the 13th LACCEI Anual International Conference, Santo Domingo, Dominican Republic, 29–31 July 2015; pp. 1–9. [Google Scholar] [CrossRef]
Timarán-Pereira, R.; Hidalgo-Troya, A.; Caicedo-Zambrano, J.; Hernandez-Arteaga, I.; Alvarado-Pérez, J. Discovery of Academic Performance Patterns; Ediciones Universidad Cooperativa de Colombia: Bogotá, Colombia, 2016. [Google Scholar] [CrossRef]
Gerritsen, L. Predicting Student Performance with Neural Networks. 2017. Available online: https://api.semanticscholar.org/CorpusID:196154149 (accessed on 10 April 2024).
Cuevas-Redondo, M.A.; Estévez-Bravo, M. Analysis Techniques for the Improvement and Prediction of Academic Performance. Ph.D. Thesis, Universidad Complutense de Madrid, Madrid, Spain, 2017. [Google Scholar]
Abu Saa, A. Educational Data Mining & Students’ Performance Prediction. Int. J. Adv. Comput. Sci. Appl. 2016, 7, 212–220. [Google Scholar] [CrossRef]
Kabakchieva, D. Student Performance Prediction by Using Data Mining Classification Algorithms. Int. J. Comput. Sci. Manag. Res. 2012, 1, 686–690. [Google Scholar]
Romero, C.; Zafra, A.; Gibaja, E.; Luque, M.; Ventura, S. Predicción del Rendimiento Académico en las Nuevas Titulaciones de Grado de la EPS de la Universidad de Córdoba. In Proceedings of the Jornadas de Enseñanza Universitaria de la Informática (JENUI), Córdoba, Spain, 10–13 July 2012; pp. 57–64. [Google Scholar]
De Melo-Junior, G.; Oliveira, S.; Ferreira, C.; Vasconcelos, E.; Calixto, W.; Furriel, G. Evaluation Techniques of Machine Learning in Task of Reprovation Prediction of Technical High School Students. In Proceedings of the Conference on Electrical, Electronics Engineering, Information and Communication Technologies, Karachi, Pakistan, 30–31 December 2017; pp. 1–7. [Google Scholar] [CrossRef]
Menacho-Chiok, C. Predicción del rendimiento académico aplicando técnicas de minería de datos. An. Científicos 2017, 78, 26–33. [Google Scholar] [CrossRef]
Larrarte-Torres, C. Educational Data Mining: Analysis of the Factors that Influenced the Students’ Performance in the Test Called Saber in Cundinamarca (Colombia) between 2017 and 2021. Master’s Thesis, Escuela Colombiana de Ingeniería Julio Garavito, Bogotá, Colombia, 2022. [Google Scholar]
Abadía Alvarado, L.K.; Gómez Soler, S.C.; Cifuentes González, J. Gone with the pandemic: How did COVID-19 affect the academic performance of Colombian students? Int. J. Educ. Dev. 2023, 100, 102783. [Google Scholar] [CrossRef] [PubMed]
Orjuela, J. Determination of Performance in State Tests for High School Education in Colombia; Technical Report; Instituto Colombiano para la Evaluación de la Educación (ICFES): Bogotá, Colombia, 2010. [Google Scholar]
Alpaydin, E. Introduction to Machine Learning; MIT Press: Cambridge, MA, USA, 2010. [Google Scholar]
Cervantes, J. Support vector machine classification for large data sets via minimum enclosing ball clustering. Neurocomputing 2008, 71, 611–619. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning with Applications in R; Springer: New York, NY, USA, 2013. [Google Scholar]
Tan, P.; Steinbach, M.; Kumar, V. Introduction to Data Mining; Pearson Education: Boston, MA, USA, 2006. [Google Scholar]
Shilton, A. Design and Training of Support Vector Machines. Ph.D. Thesis, The University of Melbourne, Melbourne, VIC, Australia, 2006. [Google Scholar]
Shawe-Taylor, J.; Cristianini, N. Kernel Methods for Pattern Analysis; Cambridge University Press: New York, NY, USA, 2004. [Google Scholar]
Lantz, B. Machine Learning with R; Packt Publishing: Birmingham, UK, 2015; pp. 1–396. [Google Scholar]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning. Data Mining, Inference, and Prediction; Springer: Palo Alto, CA, USA, 2008; pp. 1–700. [Google Scholar]
ICFES. Database: Saber 11 Tests; ICFES: Boston, MA, USA, 2018.
R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2018. [Google Scholar]
Meyer, D.; Dimitriadou, E.; Hornik, K.; Weingessel, A.; Leisch, F. e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), R package version 1.7-0; R Foundation for Statistical Computing: Vienna, Austria, 2018. [Google Scholar]
Allaire, J.; Chollet, F. Keras: R Interface to ‘Keras’, R package version 2.1.6; R Foundation for Statistical Computing: Vienna, Austria, 2018. [Google Scholar]
Allaire, J.; Tang, Y. Tensorflow: R Interface to ‘TensorFlow’, R package version 1.8; R Foundation for Statistical Computing: Vienna, Austria, 2018. [Google Scholar]
Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations (ICLR), San Diega, CA, USA, 7–9 May 2015. [Google Scholar]
Mohri, M.; Rostamizadeh, A.; Talwalkar, A. Foundations of Machine Learning (Adaptive Computation and Machine Learning); MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
Eashwar, K.B.; Venkatesan, R.; Ganesh, D. Student Performance Prediction Using SVM. Int. J. Mech. Eng. Technol. 2017, 8, 649–662. [Google Scholar]
Sembiring, S.; Zarlis, M.; Hartama, D.; Ramliana, S.; Wani, E. Prediction of Student Academic Performance By an Application of Data Mining Techniques. In Proceedings of the 2011 International Conference on Management and Artificial Intelligence, Zhengzhou, China, 8–10 August 2011; Volume 6, pp. 110–114. [Google Scholar]
Ahamed, A.T.M.S.; Mahmood, N.T.; Rahman, R.M. An intelligent system to predict academic performance based on different factors during adolescence. J. Inf. Telecommun. 2017, 1, 155–175. [Google Scholar] [CrossRef]
Aggarwal, C.C. Neural Networks and Deep Learning; Springer: Berlin/Heidelberg, Germany, 2018; p. 497. [Google Scholar]
Platt, J.C. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. In Advances in Large Margin Classifiers; MIT Press: Cambridge, MA, USA, 1999; pp. 61–74. [Google Scholar]

Figure 1. Decision boundary, margin and support vectors in SVM [26].

Figure 2. Error-tolerant soft margin hyperplane. When an instance is classified, four possible cases may appear: (a) indicates that the instance is on the correct side and away from the margin; (b) indicates that the instance is on the correct side and on the margin; (c) indicates that the instance is on the correct side but away from the margin. Finally, (d) indicates that the instance is on the wrong side: it is a misclassification [25].

Figure 3. Schematic representation of an ANN with a single hidden layer. The blue nodes represent the input variables (input layer), the red nodes represent the hidden layer, and the yellow nodes represent the output variables (output layer) [33]. In this shallow network, with a single hidden layer, the inputs to the artificial neurons in the output layer are not the original variables. Hence, this fully connected network maps the input variables in a p–dimensional space to an M–dimensional space, where there might exist a hyperplane decision boundary to classify the input vectors. Thus, ANNs are known as universal approximators, where the output layers calculate their predictions based on a new representation of the original input variables obtained in the hidden layers.

Figure 4. Number of records per class distributed along the 5 subjects evaluated in the test Saber 11.

Figure 5. Frequency distribution of the municipality of residence.

Figure 6. Frequency distribution for parents’ educational levels.

Figure 7. Distribution of frequencies of the variable socioeconomic stratum. In Colombia, there are six categories used to classify the population based on their wealth and living conditions. Furthermore, there are sectors of the population that are not classified within these categories (e.g., rural areas, informal settlements, indigenous and Afro-Colombian communities, and so forth). Instead, these sectors are classified as No stratum in the bar chart.

Table 1. Variables of personal information collected from students in the Saber 11 test.

Variable	Description
Estu_tipo_documento	Type of document
Estu_genero	Gender
Estu_fecha_nacimiento	Date of birth
Estu_tiene_etnia	Does the student have an ethnic group?
Estu_etnia	Ethnicity
Estu_mcpio_reside	Municipality of residence

Table 2. Socioeconomic variables collected from students in the Saber 11 test.

Variable	Description
Fami_educacion_padre	Father’s highest educational level
Fami_educacion_madre	Mother’s highest educational level
Fami_estrato_vivienda	Socioeconomic stratum of housing
Fami_personas_hogas	Number of persons living in the household
Fami_cuartos_hogar	Number of rooms owned by the household
Fami_tiene_servicio_tv	Closed-circuit television service
Fami_tiene_computador	Tenencia de computador en el hogar
Fami_tiene_lavadora	Computer ownership in the home
Fami_tiene_horno	Ownership of microwave oven
Fami_tiene_automovil	Car ownership
Fami_tiene_motocicleta	Motorcycle ownership
Fami_tiene_consola	Video game console ownership
Fami_num_libros	Number of books in the home
Fami_come_leche_derivados	Consumption of milk and milk products
Fami_come_carne_pescado_huevos	Consumption of meat, fish and eggs
Fami_come_cereal_frutos_legumbres	Consumption of cereals, fruits and legumes
Fami_trabajo_labor_padre	Father’s job in the last year
Fami_trabajo_labor_madre	Mother’s job in the last year
Fami_situación_económica	Economic situation
Estu_dedicación_lectura_diaria	Daily reading time
Estu_dedicación_internet	Time spent on the Internet
Estu_horas_semana_trabaja	Time worked
Estu_tipo_remuneración	Type of compensation received
Estu_inse_individual	Student socioeconomic index
Estu_nse_individual	Student’s socioeconomic level

Table 3. School-associated variables collected from students in the Saber 11 test.

Variable	Description
Cole_cod_icfes	ICFES school code
Cole_cod_DANE	DANE school code
Cole_genero	School gender
Cole_naturaleza	Nature of the school
Cole_calendario	School calendar
Cole_bilingue	Indicates if the school is bilingual
Cole_caracter	Character of the school
Cole_area_ubicacion	Urban or rural area of the school
Cole_jornada	School day
Estu_nse_establecimiento	Socioeconomic level of the school

Table 4. Classification of the level of performance in the Saber 11 test.

Subject	Score Interval	Performance Level	Class
	0–35	1 (Low)	0
Critical reading and mathematics	36–70	2 (Medium)	1
	71–100	3 (High)	2
	0–40	1 (Low)	0
Social sciences and natural sciences	41–70	2 (Medium)	1
	71–100	3 (High)	2
English	0–57	A1	0
	58–68	A2	1
	69–79	B1	2
	80–100	B+	3

Table 5. Frequency of weekly food consumption by students.

Food	Rarely	1 or 2 Times	3 or 5 Times	Every Day
Milk and dairy products	11.5	45.3	23.6	16.0
Meat, fish and eggs	4.8	35.7	35.1	21.6
Cereals, fruits and legumes	23.8	47.0	18.7	6.8

Table 6. C and

γ

optimisation for radial kernel in critical reading.

Table 6. C and

γ

optimisation for radial kernel in critical reading.

	Parameter C
Gamma ( $γ$ )	0.01	0.1	1	10	100	1000
$5 \times 10^{- 21}$	0.936	0.936	0.936	0.936	0.936	0.936
$5 \times 10^{- 19}$	0.936	0.936	0.936	0.936	0.936	0.936
$5 \times 10^{- 17}$	0.936	0.936	0.936	0.936	0.936	0.936
$5 \times 10^{- 15}$	0.936	0.936	0.936	0.936	0.936	0.936
$5 \times 10^{- 13}$	0.936	0.936	0.936	0.936	0.936	0.936
$5 \times 10^{- 11}$	0.936	0.936	0.936	0.936	0.936	0.936
$5 \times 10^{- 9}$	0.936	0.936	0.936	0.936	0.936	0.936
$5 \times 10^{- 7}$	0.936	0.936	0.936	0.936	0.936	0.936
$5 \times 10^{- 5}$	0.936	0.936	0.936	0.936	0.936	0.936
$5 \times 10^{- 3}$	0.936	0.936	0.936	0.936	0.091	0.099
$5 \times 10^{- 1}$	0.936	0.936	0.936	0.936	0.936	0.936
$5 \times 10^{1}$	0.936	0.936	0.936	0.936	0.936	0.936
$5 \times 10^{3}$	0.939	0.936	0.936	0.936	0.909	0.936

Table 7. C and

γ

optimisation for radial kernel in mathematics.

Table 7. C and

γ

optimisation for radial kernel in mathematics.

	Parameter C
Gamma ( $γ$ )	0.01	0.1	1	10	100	1000
$5 \times 10^{- 21}$	0.822	0.822	0.822	0.822	0.822	0.822
$5 \times 10^{- 19}$	0.822	0.822	0.822	0.822	0.822	0.822
$5 \times 10^{- 17}$	0.822	0.822	0.822	0.822	0.822	0.822
$5 \times 10^{- 15}$	0.822	0.822	0.822	0.822	0.822	0.822
$5 \times 10^{- 13}$	0.822	0.822	0.822	0.822	0.822	0.822
$5 \times 10^{- 11}$	0.822	0.822	0.822	0.822	0.822	0.822
$5 \times 10^{- 9}$	0.822	0.822	0.822	0.822	0.822	0.822
$5 \times 10^{- 7}$	0.822	0.822	0.822	0.822	0.822	0.822
$5 \times 10^{- 5}$	0.822	0.822	0.822	0.822	0.822	0.822
$5 \times 10^{- 3}$	0.822	0.822	0.822	0.822	0.225	0.735
$5 \times 10^{- 1}$	0.822	0.822	0.822	0.822	0.822	0.822
$5 \times 10^{1}$	0.822	0.822	0.822	0.822	0.822	0.822
$5 \times 10^{3}$	0.825	0.822	0.822	0.822	0.822	0.822

Table 8. C and

γ

optimisation for radial kernel in natural sciences.

Table 8. C and

γ

optimisation for radial kernel in natural sciences.

	Parameter C
Gamma ( $γ$ )	0.01	0.1	1	10	100	1000
$5 \times 10^{- 21}$	0.811	0.811	0.811	0.811	0.811	0.811
$5 \times 10^{- 19}$	0.811	0.811	0.811	0.811	0.811	0.811
$5 \times 10^{- 17}$	0.811	0.811	0.811	0.811	0.811	0.811
$5 \times 10^{- 15}$	0.811	0.811	0.811	0.811	0.811	0.811
$5 \times 10^{- 13}$	0.811	0.811	0.811	0.811	0.811	0.811
$5 \times 10^{- 11}$	0.811	0.811	0.811	0.811	0.811	0.811
$5 \times 10^{- 9}$	0.811	0.811	0.811	0.811	0.811	0.811
$5 \times 10^{- 7}$	0.811	0.811	0.811	0.811	0.811	0.811
$5 \times 10^{- 5}$	0.811	0.811	0.811	0.811	0.811	0.811
$5 \times 10^{- 3}$	0.812	0.811	0.811	0.807	0.763	0.732
$5 \times 10^{- 1}$	0.811	0.811	0.811	0.811	0.811	0.811
$5 \times 10^{1}$	0.811	0.811	0.811	0.811	0.811	0.811
$5 \times 10^{3}$	0.811	0.811	0.811	0.811	0.811	0.811

Table 9. C and

γ

optimisation for radial kernel in social sciences.

Table 9. C and

γ

optimisation for radial kernel in social sciences.

	Parameter C
Gamma ( $γ$ )	0.01	0.1	1	10	100	1000
$5 \times 10^{- 21}$	0.723	0.723	0.723	0.723	0.723	0.723
$5 \times 10^{- 19}$	0.723	0.723	0.723	0.723	0.723	0.723
$5 \times 10^{- 17}$	0.723	0.723	0.723	0.723	0.723	0.723
$5 \times 10^{- 15}$	0.723	0.723	0.723	0.723	0.723	0.723
$5 \times 10^{- 13}$	0.723	0.723	0.723	0.723	0.723	0.723
$5 \times 10^{- 11}$	0.723	0.723	0.723	0.723	0.723	0.723
$5 \times 10^{- 9}$	0.723	0.723	0.723	0.723	0.723	0.723
$5 \times 10^{- 7}$	0.723	0.723	0.723	0.723	0.723	0.723
$5 \times 10^{- 5}$	0.723	0.723	0.723	0.723	0.723	0.723
$5 \times 10^{- 3}$	0.723	0.723	0.723	0.835	0.331	0.361
$5 \times 10^{- 1}$	0.723	0.723	0.723	0.723	0.723	0.723
$5 \times 10^{1}$	0.822	0.822	0.822	0.822	0.822	0.822
$5 \times 10^{3}$	0.723	0.723	0.723	0.723	0.723	0.723

Table 10. C and

γ

optimisation for radial kernel in English.

Table 10. C and

γ

optimisation for radial kernel in English.

	Parameter C
Gamma ( $γ$ )	0.01	0.1	1	10	100	1000
$5 \times 10^{- 21}$	0.851	0.851	0.851	0.851	0.851	0.851
$5 \times 10^{- 19}$	0.851	0.851	0.851	0.851	0.851	0.851
$5 \times 10^{- 17}$	0.851	0.851	0.851	0.851	0.851	0.851
$5 \times 10^{- 15}$	0.851	0.851	0.851	0.851	0.851	0.851
$5 \times 10^{- 13}$	0.851	0.851	0.851	0.851	0.851	0.851
$5 \times 10^{- 11}$	0.851	0.851	0.851	0.851	0.851	0.851
$5 \times 10^{- 9}$	0.851	0.851	0.851	0.851	0.851	0.851
$5 \times 10^{- 7}$	0.851	0.851	0.851	0.851	0.851	0.851
$5 \times 10^{- 5}$	0.851	0.851	0.851	0.851	0.851	0.851
$5 \times 10^{- 3}$	0.851	0.851	0.852	0.856	0.833	0.799
$5 \times 10^{- 1}$	0.851	0.851	0.851	0.851	0.851	0.851
$5 \times 10^{1}$	0.851	0.851	0.851	0.851	0.851	0.851
$5 \times 10^{3}$	0.851	0.851	0.851	0.851	0.851	0.851

Table 11. Accuracy of the SVM models using different kernel types.

	Accuracy (%) Kernel Type
Subject	Linear	Radial	Sigmoidal
Critical reading	93.5	93.5	93.5
Mathematics	82.1	82.0	82.1
Natural sciences	81.1	81.1	81.0
Social sciences	72.9	71.9	72.1
English	79.2	79.1	79.2

Table 12.

F_{1}

-score of the SVM models using different kernel types.

Table 12.

F_{1}

-score of the SVM models using different kernel types.

	$F_{1}$ -Score (%) Kernel Type
Subject	Linear	Radial	Sigmoidal
Critical reading	91.4	92.8	93.4
Mathematics	81.7	81.4	81.4
Natural sciences	81.0	81.0	80.5
Social sciences	72.8	71.9	72.0
English	78.7	78.6	78.7

Table 13. Optimal values of the regularisation parameter C used in SVM.

	Parameter C
Subject	Linear	Radial	Sigmoid
Critical reading	0.1	0.01	0.01
Mathematics	0.1	0.01	0.01
Natural sciences	0.01	0.01	0.01
Social sciences	0.1	10	0.01
English	10	0.01	10

Table 14. Optimisation of the regularisation parameter and size of the ANN hidden layer for critical reading.

	Hidden Layer Units
Lambda ( $λ$ )	30	50	80	120
0.01	0.933	0.938	0.934	0.934
0.03	0.931	0.934	0.934	0.915
0.05	0.930	0.934	0.934	0.917
0.25	0.934	0.902	0.934	0.934
0.5	0.934	0.934	0.934	0.934

Table 15. Optimisation of the regularisation parameter and size of the ANN hidden layer for mathematics.

	Hidden Layer Units
Lambda ( $λ$ )	30	50	80	120
0.01	0.821	0.819	0.825	0.820
0.03	0.821	0.821	0.821	0.820
0.05	0.820	0.820	0.822	0.821
0.25	0.821	0.821	0.820	0.821
0.5	0.821	0.821	0.821	0.821

Table 16. Optimisation of the regularisation parameter and size of the ANN hidden layer for natural sciences.

	Hidden Layer Units
Lambda ( $λ$ )	30	50	80	120
0.01	0.925	0.926	0.926	0.925
0.03	0.926	0.927	0.926	0.927
0.05	0.927	0.927	0.927	0.927
0.25	0.926	0.926	0.926	0.926
0.5	0.931	0.926	0.926	0.926

Table 17. Optimisation of the regularisation parameter and size of the ANN hidden layer for social sciences.

	Hidden Layer Units
Lambda ( $λ$ )	30	50	80	120
0.01	0.863	0.862	0.862	0.861
0.03	0.891	0.862	0.863	0.862
0.05	0.862	0.862	0.863	0.863
0.25	0.863	0.863	0.862	0.862
0.5	0.867	0.862	0.864	0.862

Table 18. Optimisation of the regularisation parameter and size of the ANN hidden layer for English.

	Hidden Layer Units
Lambda ( $λ$ )	30	50	80	120
0.01	0.854	0.892	0.853	0.854
0.03	0.854	0.855	0.853	0.854
0.05	0.850	0.851	0.852	0.853
0.25	0.851	0.851	0.852	0.853
0.5	0.850	0.851	0.850	0.852

Table 19. Accuracy and optimal hyperparameters of the ANN models.

Subject	Accuracy (%)	$F_{1}$ -Score (%)	Lambda $λ$	Hidden Layer Units
Critical reading	93.6	93.2	0.01	50
Mathematics	82.2	82.5	0.01	80
Natural sciences	92.7	92.1	0.5	30
Social sciences	86.2	85.0	0.03	30
English	85.3	86.1	0.01	50

Table 20. Accuracy comparison between SVM and ANN models.

Subject	SVM (%)	ANN (%)	Difference (%)
Critical reading	93.5	93.6	0.1
Mathematics	82.1	82.2	0.1
Natural sciences	81.1	92.7	11.6
Social sciences	72.9	86.2	13.3
English	79.2	85.3	6.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hoyos, W.; Caicedo-Castro, I. Tuning Data Mining Models to Predict Secondary School Academic Performance. Data 2024, 9, 86. https://doi.org/10.3390/data9070086

AMA Style

Hoyos W, Caicedo-Castro I. Tuning Data Mining Models to Predict Secondary School Academic Performance. Data. 2024; 9(7):86. https://doi.org/10.3390/data9070086

Chicago/Turabian Style

Hoyos, William, and Isaac Caicedo-Castro. 2024. "Tuning Data Mining Models to Predict Secondary School Academic Performance" Data 9, no. 7: 86. https://doi.org/10.3390/data9070086

Article Menu

Tuning Data Mining Models to Predict Secondary School Academic Performance

Abstract

1. Introduction

2. Related Work

3. Theoretical Framework

3.1. Mathematical Notation

3.2. SVM

3.2.1. Linear SVM for Separable Cases

3.2.2. Soft-Margin Hyperplane

3.2.3. SVM on Nonlinear Classification Problems

3.2.4. SVM for More Than Two Classes

3.3. ANN

3.3.1. Forward Propagation

3.3.2. Back-Propagation

4. Materials and Methods

4.1. Dataset

4.2. Data Preprocessing

4.3. Characterisation of the Evaluated Students

4.4. Experimental Setup

4.4.1. Experiment 1: Tuning SVM

4.4.2. Experiment 2: Tuning ANN

4.5. Evaluation

5. Results

5.1. Characterisation of the Dataset

5.2. SVM Models

5.3. ANN Models

6. Discussion

7. Conclusions

7.1. General Considerations

7.2. Recommendations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI