¿Cómo se pueden comparar y seleccionar los mejores algoritmos de ML?
¿Cómo se pueden comparar y seleccionar los mejores algoritmos de ML? Esta es una pregunta común a la que se enfrentan muchos científicos de datos y profesionales del aprendizaje automático cuando trabajan en un proyecto. Hay muchos factores que pueden influir en el rendimiento y la idoneidad de los diferentes algoritmos, como las características de los datos, el dominio del problema, las métricas de evaluación y los recursos computacionales. En este artículo, aprenderá algunos pasos y consejos generales para ayudarlo a comparar y seleccionar los mejores algoritmos de ML para su tarea.
-
Jean DessainManaging Partner at Reacfin | Professor of Finance & Machine Learning at IÉSEG
-
Tankut TekeliDirector of Data Analytics @ LC Waikiki | Data Analytics, Machine Learning
-
Balagopal MadhusoodhananDirector Intelligent Automation | LinkedIn Top Voice (AI) | Speaker | Strategy & Architecture | Cloud computing |…
El primer paso es definir tu objetivo y lo que quieres lograr con tu modelo de ML. Esto le ayudará a acotar los posibles algoritmos que pueden abordar su problema y los criterios que utilizará para evaluarlos. Por ejemplo, si su objetivo es clasificar imágenes en diferentes categorías, necesitará un algoritmo que pueda manejar datos complejos y de alta dimensión, como una red neuronal convolucional (CNN). Si su objetivo es predecir el precio de una casa en función de sus características, necesitará un algoritmo que pueda realizar una regresión, como una regresión lineal o un árbol de decisión.
-
There is no "one size fits all" approach for selecting the best model or assessing its performance. In many cases, statistical metrics based on error rates will be good enough. In most cases, it is worth digging deeper and analyzing the consequences of the decision-making that relies on the machine learning tool. This is particularly true when there is a significant imbalance between the consequences of errors, some errors being benign while others trigger severe negative consequences. This is the case in various domains, like computing the maximum load of a network, predicting probability of default, ... A good knowledge of the field for which ML is applied is the best safeguard to avoid choosing a wrong performance metric.
-
Selecting the best machine learning (ML) model for a particular task involves a combination of understanding the problem, exploring the data, and experimenting with different algorithms. 1. Clearly understand the problem you are trying to solve. 2. Understand the distribution of classes in classification tasks or the distribution of the target variable in regression tasks. 3. Select appropriate evaluation metrics based on the nature of your problem (accuracy, precision, recall, F1 score for classification; mean squared error, R-squared for regression, etc.). And Consider business-specific metrics if applicable. 4.Consider the trade-offs between different models, such as model complexity, interpretability.
-
In general, we compare learning algorithms based on their error rates, but it is important to remember that in reality, error is just one of the factors that influence our decision. Other criteria are: - risks when errors are generalized using loss functions instead of 0/1 loss -training time and space complexity - testing time and space complexity - interpretability, namely, whether the method allows knowledge extraction that can be checked and validated by experts, and - easy programmability
-
The initial step in machine learning is defining your goal and desired outcomes for your model. This process helps narrow down suitable algorithms and criteria for evaluation. For instance, classifying images might require a convolutional neural network (CNN) for complex data, while predicting house prices might involve regression methods like linear regression or decision trees.
-
Generally, comparing algorithms is usually guided by the following steps 1. Identifying the type of problem to be solved (classification, regression, clustering, etc) 2. Determine the success metric (both technical and business) 3. Shortlist some number of ML algorithms to train and benchmark against each other 4. Measure how well these algorithms perform on an out of fold set using the metrics you identified in step 2 5. Pick the ML algorithm that satisfies the success requirement 6. Optimize the chosen algorithm through hyper parameter optimization 7. Test and understand why the model makes its choices. Here you can use frame works like LIME/ SHAP. (Model explainability).
El segundo paso es explorar los datos y comprender sus características, como su tamaño, forma, distribución, calidad y características. Esto le ayudará a elegir los algoritmos que sean compatibles con sus datos y puedan manejar sus desafíos, como valores faltantes, valores atípicos, desequilibrios o ruido. Por ejemplo, si los datos son grandes y dispersos, es posible que desee utilizar un algoritmo que pueda escalar bien y reducir la dimensionalidad, como una máquina de vectores de soporte (SVM) o un análisis de componentes principales (PCA). Si los datos son pequeños y ruidosos, es posible que desee usar un algoritmo que pueda evitar el sobreajuste y regularizar el modelo, como una regresión logística o un bosque aleatorio.
-
Data engineering is the process of preparing and transforming the data for the machine learning model. - Data quality assessment report evaluates the quality of the data and identifies and quantifies the data issues. - Data strategy defines the criteria and the logic for selecting or excluding the data sources based on relevance, availability, reliability, and diversity. - Data engineering design specifies the steps and the rules for performing the data engineering tasks, such as data cleansing, data transformation and ensure its scalable.
-
Whenever data is limited (e.g. labelled health data can be expensive), it is worth checking which samples are considered more “uncertain” by the model before obtaining its annotation. Which level of confidence was a given sample classified with? By selecting low-confidence samples, we can induce a more significant impact on the learning process, which is equivalent to achieving the same performance with less data. This is because we are discarding samples that are considered “redundant”. In those cases, we opt to not obtaining their annotation, reducing cost.
-
To effectively compare and select ML algorithms, it's essential to deeply understand your data. Start by analyzing data characteristics using tools like matplotlib or seaborn. This analysis will highlight important features such as distribution, outliers, or missing values, guiding your algorithm choice. For example, tree-based models are great for non-linear data, while simpler models suffice for linear relationships. The data’s complexity and size also influence this choice; large datasets might need robust methods like ensemble models or deep learning. The key is a comprehensive data understanding, guiding you to the right algorithm.
-
By analyzing various aspects of your data, like its size, shape, distribution, quality, and features, you can identify challenges in your data, such as missing values, outliers, imbalances, or noise. The characteristics of your data will guide you in choosing algorithms that are compatible with its specific nature. For instance, if you're working with large and sparse data, algorithms like support vector machines (SVM) or principal component analysis (PCA) may be suitable due to their ability to scale well and reduce dimensionality. On the other hand, if your data is small and noisy, algorithms like logistic regression or random forest, which can prevent overfitting and provide regularization, might be more appropriate.
-
Don't underestimate this piece. It is true what they say: a large part of the work of being a data analyst or data scientist resides within this step. Understanding whether your data has unique characteristics - such as group imbalances, missing values, outliers, distributional skew - will affect how you clean and treat the data, which in turn plays a big part on the validity and efficacy of your model. Visualize your data, using value tables and histograms to get a sense of the distribution of your variables of interest. Determine your method for dealing with group imbalances, outliers, or missing data, and remember that you have to justify these methods in your final model.
El tercer paso es seleccionar las métricas y cómo medirá el rendimiento y la calidad del modelo de ML. Esto te ayudará a comparar los algoritmos de forma objetiva y cuantitativa, y a elegir el que mejor se adapte a tus expectativas y requisitos. Por ejemplo, si su métrica es la precisión, querrá elegir el algoritmo que tenga el mayor porcentaje de predicciones correctas. Si su métrica es de precisión, querrá elegir el algoritmo que tenga la tasa más baja de falsos positivos. Otras métricas comunes incluyen el recuerdo, la puntuación F1, la curva ROC, MSE, MAE y R cuadrado.
-
When defining metrics, control metrics are often the overlooked part of the story. In the complex systems with intelligent internal and external interactions, a performance metric should be supported or constrained with proper control metrics to foresee, apply and monitor the desired effects.
-
Selecting the right metrics is crucial in comparing and selecting ML algorithms. Accuracy is a common starting point, but it's not always sufficient. In classification tasks with imbalanced classes, precision, recall, and F1-score provide deeper insights. For regression models, consider mean squared error or mean absolute error. In complex scenarios, custom metrics tailored to specific business objectives can be more informative. It's also important to consider computational efficiency, especially for large datasets or real-time applications. Ultimately, the choice of metrics should align with your project’s goals and the specific nature of the data at hand.
-
Selecting the best metric also depends on the goal and the data distribution. Accuracy for example could be misleading for certain applications. Suppose you have a dataset of 1000 chests X-ray images, where 10 of it are cancerous. If your ML model is made to give negative predictions all the time (no cancer), the accuracy is 99%! But it is not significant nor useful.
-
Having a good validation strategy and solid evaluation metrics are important. Ensure that the model you choose is robust and reliable and correlates with the business metrics you intend to optimize with machine learning solution. Choosing the correct evaluation schema, whether a simple train-test split or a complex cross-validation strategy, is the crucial first step of building any machine learning solution.
-
In assessing machine learning models, performance isn't solely about accuracy. While metrics like accuracy, precision, and recall matter, practical considerations such as latency and hardware usage are crucial. Achieving optimal accuracy must be balanced with the model's efficiency in terms of response time and resource utilization. This ensures that the selected model is not only accurate but also feasible and scalable for real-world deployment.
El cuarto paso es dividir los datos en diferentes subconjuntos para el entrenamiento, la validación y las pruebas. Esto le ayudará a evitar el sobreajuste y el ajuste insuficiente, y a estimar la capacidad de generalización del modelo de ML. Por ejemplo, puede usar una división 70/15/15, en la que usa el 70 % de los datos para el entrenamiento, el 15 % para la validación y el 15 % para las pruebas. El conjunto de entrenamiento se usa para ajustar los parámetros del modelo, el conjunto de validación se usa para ajustar los hiperparámetros del modelo y el conjunto de pruebas se usa para evaluar el rendimiento final del modelo.
-
In comparing and selecting ML algorithms, data splitting is crucial. Generally, data is divided into training, validation, and testing sets. Training builds the model, validation tunes parameters, and testing evaluates performance. This method assesses a model's generalization on unseen data. Common split ratios are 60-20-20 or 70-15-15. It's vital to ensure each set represents the whole dataset, especially in classification tasks. Techniques like stratified sampling help maintain class distribution. Proper splitting ensures fair, accurate ML algorithm comparison and robust model selection.
-
Generalization Ability: Data splitting allows you to estimate the generalization ability of your model by evaluating its performance on unseen data. This helps you assess whether the model has learned the underlying patterns in the data or if it is just memorizing the training data Preventing Overfitting and Underfitting: By having separate training, validation, and testing sets, you can prevent overfitting and underfitting. The validation set helps in tuning the model's hyperparameters, while the testing set provides an unbiased evaluation of the model's performance on unseen data Model Selection: Insights from the validation set can guide you in selecting the best model architecture and hyperparameters that result in optimal performance
-
In the vast realm of machine learning, data splitting is a crucial choreography. However, with colossal datasets like 10 million images, the conventional 70/15/15 split may need adaptation. In this scenario, even a mere 1% for testing or validation yields substantial subsets for meaningful insights. Scaling the split percentage to the dataset's magnitude ensures resource efficiency and statistical robustness. Let the data split be a harmonious composition, finely tuned to unravel the grand symphony of machine learning in an optimal and judicious manner.
-
1. Training Data: Purpose: Used to train the machine learning model. Percentage: Often around 60-80% of the total dataset. Usage: Model learns patterns, relationships, and features from this subset. 2. Validation Data: Purpose: Assess and tune the model during training. Percentage: Usually around 10-20% of the total dataset (can vary). Usage: Helps in hyperparameter tuning, preventing overfitting by providing feedback to adjust the model. 3. Testing Data: Purpose: Evaluate the model's performance. Percentage: Typically around 10-20% of the total dataset (can vary). Usage: Completely unseen by the model during training and validation; used to assess how well the model generalizes to new data.
-
If you're dealing with data with timestamps, which is common if your data has transactions or interactions or events, it's critical to use out-of-time validation (split the data into different non-overlapping periods for train, test, and validation), to avoid any potential leakage.
El quinto paso es entrenar y probar los algoritmos utilizando los subconjuntos de datos y las métricas que ha seleccionado. Esto le ayudará a ver cómo funcionan sus algoritmos en diferentes escenarios de datos y cómo se comparan entre sí. Por ejemplo, puede usar un bucle o una función para iterar sobre diferentes algoritmos y aplicarlos a los mismos conjuntos de datos y, a continuación, almacenar los resultados en una tabla o una gráfica. También puedes usar bibliotecas o herramientas que puedan automatizar este proceso, como scikit-learn, TensorFlow o AutoML.
-
Algorithm Performance Comparison: By testing multiple algorithms on the same dataset, you can compare their performances and identify the one that best suits your specific problem. Comparing metrics such as accuracy, precision, recall, F1 score, and ROC-AUC can help you determine which algorithm performs better under different data scenarios. Model Selection and Tuning: Training and testing allow you to identify the best-performing algorithm and its associated hyperparameters Understanding Model Behavior: Training and testing provide insights into how different algorithms behave with respect to your dataset. You can gain an understanding of which models are prone to overfitting or underfitting and how they handle different data scenarios
-
Perhaps one of the most underrated tools for model selection are learning curves, which are crucial in evaluating the performance of machine learning models. These curves depict the relationship between a chosen performance metric (such as accuracy or loss) and the amount of training data or training iterations. If underfitting is detected, one might consider increasing model complexity, while overfitting may necessitate the use of regularization techniques or a reduction in complexity. The curves also guide decisions on whether a model would benefit from additional data or training iterations. Ultimately, learning curves offer a systematic approach to refining model architecture, enhancing training strategies, and optimizing performance.
-
Train Multiple Algorithms: Utilize various machine learning models like decision trees, SVMs, and neural networks on your training dataset, tailoring each to the specific nature of your problem. Test on Validation Set: Evaluate these models on a separate validation set using relevant metrics to determine effectiveness, ensuring they have not been exposed to this data during training. Create Similar New Data: Synthesize new data with the same distribution as the validation set to replicate real-world scenarios, using methods like bootstrapping or synthetic data generation. Introduce Bias and Stress-Test: Inject biases into this new data and re-test the models to assess their robustness and ability to handle data variations and challenges.
-
Understanding the UseCase and Model uis of high importance. For example, A model detecting a disease must have high Recall than precision and is considered better even if it has less F1 score. So, it depends on usecase. But, in general: F1 score, Accuracy, RMSE, Binary Cross Entropy loss, ROC-AUC curve
-
The machine learning model selection process involves training and testing algorithms on selected data subsets using chosen metrics. This involves implementing a loop or function to iterate over different algorithms, applying them to the same datasets, and storing and comparing results using tables or plots. Automation tools like scikit-learn, TensorFlow, or AutoML can streamline this process, enhancing efficiency and consistency in model evaluation.
El sexto y último paso es analizar y seleccionar el algoritmo en función de los resultados y la información que se ha obtenido de los pasos anteriores. Esto te ayudará a tomar una decisión informada y racional que se adapte a tu objetivo y a tus datos. Por ejemplo, puede observar los valores de las métricas, las curvas de aprendizaje, las matrices de confusión, la importancia de las características o la complejidad del modelo, y ver qué algoritmo tiene el mejor equilibrio entre precisión, eficiencia y simplicidad. También puedes tener en cuenta otros factores, como la interpretabilidad, la robustez o la escalabilidad del algoritmo.
-
analyzing and selecting the most suitable algorithm involves the following considerations Metrics Evaluation: Assess the performance of different algorithms based on various metrics like accuracy, precision, recall, F1 score, or ROC-AUC to understand how well each algorithm performs on your specific dataset Learning Curves and Model Complexity: Examine learning curves to gauge how algorithms handle training data and whether they tend to overfit or underfit. Understanding the complexity of the model can help prevent issues like overfitting, ensuring better generalization Confusion Matrices and Feature Importances: Analyze confusion matrices to understand how well an algorithm classifies different classes and where it might be making errors
-
When evaluating model performance and selecting algorithms in a scenario with various subgroups or segments in the data, it is advisable to also consider the performance for each group/segment separately in addition to the overall model performance. It is for instance possible that the overall performance is relatively high, but that there are groups/segments for which the performance is significantly lower.
-
In the research of my master's thesis focused on classification (10 years ago by now) I concluded that a great combination can be to use advanced but difficult-to-interpret algorithms (such as Random Tree) to establish an understanding of what a baseline good performance result for a model. Then, to use a better understandable model (e.g. Logistics, regular decision trees etc.) on well pre-processed data and compare how close it gets to the baseline established reference point.
-
In the process of algorithm selection, it's akin to trying on different outfits to find the one that fits you the best. Each algorithm is like a unique outfit, offering different styles and fits. Evaluating them involves checking how well each outfit complements your figure – in the ML world, it's about metrics like accuracy, precision, recall, and F1 score. The chosen algorithm should not only look good on the training data but also suit unseen data, demonstrating its versatility and reliability. Just like finding that perfect outfit, the selected algorithm should make you feel confident across various situations.
-
Heavy cross validation using the chosen metrics will help better understand the candidate models. Multiple comparisons must be considered when doing this, as they are likely to give overly optimistic estimates if one isn't careful.
-
Selecting the best algorithm is a trade-off between 1. Training and deployment cost: Choosing supervised learning algorithms is more cost-effective than deep learning algorithms. If the data is tabular (numerical/categorical), it's best to choose supervised learning algorithms. 2. High accuracy/AUC (Static metrics): Leverage the search space algorithms using Keras Tuner or AutoML solutions to get baseline best algorithms. 3. High-performing algorithms: The success metrics could be static (e.g., accuracy/precision) or the metrics can be business metrics. In the A/B testing world, we could select a few best algorithms, run the experimentation, and select the one that gives the best metrics (revenue/user engagement/latency).
-
As machine learning practitioners, we often focus on fine-tuning algorithms, tweaking hyperparameters, and experimenting with complex models. However, achieving optimal performance goes beyond just the algorithm itself and to the entire ML pipeline. But how do we optimize this pipeline? While hyperparameter tuning is crucial, consider other dimensions of optimization. Sometimes, a simple rule-based step can significantly enhance model performance. For instance, incorporating domain-specific knowledge or business rules can lead to better predictions. Don’t hesitate to experiment with such techniques.
-
Machine learning's adaptability spans across all industries, with data quality taking precedence over a singular focus on technical intricacies. Rather than fixating solely on technical aspects, a prudent approach involves surveying the environment to identify or obtain relevant data and discerning its potential value. Embracing a problem-centric methodology, What problem is in need of resolution, and what enhancement is envisioned? It is essential to recognize that the goalof a machine learning model transcends mere speed or accuracy. Instead, the measure of success lies in its ability to effectively address substantial human challenges.
-
ML is a highly iterative process where a lot changes on the quality of data. Moreover with the pace of ML research a lot of new things are coming out each day. Therefore constant research or methods is the key to select the best suited ML algorithms
-
Once you have selected the best algorithm, you can try to improve its performance by tuning its hyperparameters using grid search, random search, or Bayesian optimization methods. You can also validate your model on a new or unseen data set to check its generalization ability and robustness.
Valorar este artículo
Lecturas más relevantes
-
Estadística¿Cómo puede asegurarse de que sus modelos de ML sean precisos?
-
Aprendizaje automáticoEl rendimiento del modelo de ML está disminuyendo. ¿Cómo se puede cambiar el rumbo y mejorar su precisión con el tiempo?
-
Aprendizaje automáticoSu modelo de aprendizaje automático tiene dificultades con nuevos conjuntos de datos. ¿Cómo superarás este reto?
-
Aprendizaje automáticoSu equipo está dividido en cuanto a los algoritmos de aprendizaje automático. ¿Cómo eliges el adecuado para avanzar?