-
EMG subspace alignment and visualization for cross-subject hand gesture classification
Authors:
Martin Colot,
Cédric Simar,
Mathieu Petieau,
Ana Maria Cebolla Alvarez,
Guy Cheron,
Gianluca Bontempi
Abstract:
Electromyograms (EMG)-based hand gesture recognition systems are a promising technology for human/machine interfaces. However, one of their main limitations is the long calibration time that is typically required to handle new users. The paper discusses and analyses the challenge of cross-subject generalization thanks to an original dataset containing the EMG signals of 14 human subjects during ha…
▽ More
Electromyograms (EMG)-based hand gesture recognition systems are a promising technology for human/machine interfaces. However, one of their main limitations is the long calibration time that is typically required to handle new users. The paper discusses and analyses the challenge of cross-subject generalization thanks to an original dataset containing the EMG signals of 14 human subjects during hand gestures. The experimental results show that, though an accurate generalization based on pooling multiple subjects is hardly achievable, it is possible to improve the cross-subject estimation by identifying a robust low-dimensional subspace for multiple subjects and aligning it to a target subject. A visualization of the subspace enables us to provide insights for the improvement of cross-subject generalization with EMG signals.
△ Less
Submitted 18 December, 2023;
originally announced January 2024.
-
A churn prediction dataset from the telecom sector: a new benchmark for uplift modeling
Authors:
Théo Verhelst,
Denis Mercier,
Jeevan Shrestha,
Gianluca Bontempi
Abstract:
Uplift modeling, also known as individual treatment effect (ITE) estimation, is an important approach for data-driven decision making that aims to identify the causal impact of an intervention on individuals. This paper introduces a new benchmark dataset for uplift modeling focused on churn prediction, coming from a telecom company in Belgium, Orange Belgium. Churn, in this context, refers to cust…
▽ More
Uplift modeling, also known as individual treatment effect (ITE) estimation, is an important approach for data-driven decision making that aims to identify the causal impact of an intervention on individuals. This paper introduces a new benchmark dataset for uplift modeling focused on churn prediction, coming from a telecom company in Belgium, Orange Belgium. Churn, in this context, refers to customers terminating their subscription to the telecom service. This is the first publicly available dataset offering the possibility to evaluate the efficiency of uplift modeling on the churn prediction problem. Moreover, its unique characteristics make it more challenging than the few other public uplift datasets.
△ Less
Submitted 12 December, 2023;
originally announced December 2023.
-
A data-science pipeline to enable the Interpretability of Many-Objective Feature Selection
Authors:
Uchechukwu F. Njoku,
Alberto Abelló,
Besim Bilalli,
Gianluca Bontempi
Abstract:
Many-Objective Feature Selection (MOFS) approaches use four or more objectives to determine the relevance of a subset of features in a supervised learning task. As a consequence, MOFS typically returns a large set of non-dominated solutions, which have to be assessed by the data scientist in order to proceed with the final choice. Given the multi-variate nature of the assessment, which may include…
▽ More
Many-Objective Feature Selection (MOFS) approaches use four or more objectives to determine the relevance of a subset of features in a supervised learning task. As a consequence, MOFS typically returns a large set of non-dominated solutions, which have to be assessed by the data scientist in order to proceed with the final choice. Given the multi-variate nature of the assessment, which may include criteria (e.g. fairness) not related to predictive accuracy, this step is often not straightforward and suffers from the lack of existing tools. For instance, it is common to make use of a tabular presentation of the solutions, which provide little information about the trade-offs and the relations between criteria over the set of solutions.
This paper proposes an original methodology to support data scientists in the interpretation and comparison of the MOFS outcome by combining post-processing and visualisation of the set of solutions. The methodology supports the data scientist in the selection of an optimal feature subset by providing her with high-level information at three different levels: objectives, solutions, and individual features.
The methodology is experimentally assessed on two feature selection tasks adopting a GA-based MOFS with six objectives (number of selected features, balanced accuracy, F1-Score, variance inflation factor, statistical parity, and equalised odds). The results show the added value of the methodology in the selection of the final subset of features.
△ Less
Submitted 30 November, 2023;
originally announced November 2023.
-
Between accurate prediction and poor decision making: the AI/ML gap
Authors:
Gianluca Bontempi
Abstract:
Intelligent agents rely on AI/ML functionalities to predict the consequence of possible actions and optimise the policy. However, the effort of the research community in addressing prediction accuracy has been so intense (and successful) that it created the illusion that the more accurate the learner prediction (or classification) the better would have been the final decision. Now, such an assumpt…
▽ More
Intelligent agents rely on AI/ML functionalities to predict the consequence of possible actions and optimise the policy. However, the effort of the research community in addressing prediction accuracy has been so intense (and successful) that it created the illusion that the more accurate the learner prediction (or classification) the better would have been the final decision. Now, such an assumption is valid only if the (human or artificial) decision maker has complete knowledge of the utility of the possible actions. This paper argues that AI/ML community has taken so far a too unbalanced approach by devoting excessive attention to the estimation of the state (or target) probability to the detriment of accurate and reliable estimations of the utility. In particular, few evidence exists about the impact of a wrong utility assessment on the resulting expected utility of the decision strategy. This situation is creating a substantial gap between the expectations and the effective impact of AI solutions, as witnessed by recent criticisms and emphasised by the regulatory legislative efforts. This paper aims to study this gap by quantifying the sensitivity of the expected utility to the utility uncertainty and comparing it to the one due to probability estimation. Theoretical and simulated results show that an inaccurate utility assessment may as (and sometimes) more harmful than a poor probability estimation. The final recommendation to the community is then to undertake a focus shift from a pure accuracy-driven (or obsessed) approach to a more utility-aware methodology.
△ Less
Submitted 3 October, 2023;
originally announced October 2023.
-
Uplift vs. predictive modeling: a theoretical analysis
Authors:
Théo Verhelst,
Robin Petit,
Wouter Verbeke,
Gianluca Bontempi
Abstract:
Despite the growing popularity of machine-learning techniques in decision-making, the added value of causal-oriented strategies with respect to pure machine-learning approaches has rarely been quantified in the literature. These strategies are crucial for practitioners in various domains, such as marketing, telecommunications, health care and finance. This paper presents a comprehensive treatment…
▽ More
Despite the growing popularity of machine-learning techniques in decision-making, the added value of causal-oriented strategies with respect to pure machine-learning approaches has rarely been quantified in the literature. These strategies are crucial for practitioners in various domains, such as marketing, telecommunications, health care and finance. This paper presents a comprehensive treatment of the subject, starting from firm theoretical foundations and highlighting the parameters that influence the performance of the uplift and predictive approaches. The focus of the paper is on a binary outcome case and a binary action, and the paper presents a theoretical analysis of uplift modeling, comparing it with the classical predictive approach. The main research contributions of the paper include a new formulation of the measure of profit, a formal proof of the convergence of the uplift curve to the measure of profit ,and an illustration, through simulations, of the conditions under which predictive approaches still outperform uplift modeling. We show that the mutual information between the features and the outcome plays a significant role, along with the variance of the estimators, the distribution of the potential outcomes and the underlying costs and benefits of the treatment and the outcome.
△ Less
Submitted 21 September, 2023;
originally announced September 2023.
-
Adversarial Learning in Real-World Fraud Detection: Challenges and Perspectives
Authors:
Danele Lunghi,
Alkis Simitsis,
Olivier Caelen,
Gianluca Bontempi
Abstract:
Data economy relies on data-driven systems and complex machine learning applications are fueled by them. Unfortunately, however, machine learning models are exposed to fraudulent activities and adversarial attacks, which threaten their security and trustworthiness. In the last decade or so, the research interest on adversarial machine learning has grown significantly, revealing how learning applic…
▽ More
Data economy relies on data-driven systems and complex machine learning applications are fueled by them. Unfortunately, however, machine learning models are exposed to fraudulent activities and adversarial attacks, which threaten their security and trustworthiness. In the last decade or so, the research interest on adversarial machine learning has grown significantly, revealing how learning applications could be severely impacted by effective attacks. Although early results of adversarial machine learning indicate the huge potential of the approach to specific domains such as image processing, still there is a gap in both the research literature and practice regarding how to generalize adversarial techniques in other domains and applications. Fraud detection is a critical defense mechanism for data economy, as it is for other applications as well, which poses several challenges for machine learning. In this work, we describe how attacks against fraud detection systems differ from other applications of adversarial machine learning, and propose a number of interesting directions to bridge this gap.
△ Less
Submitted 3 July, 2023;
originally announced July 2023.
-
Traffic Modeling with SUMO: a Tutorial
Authors:
Davide Andrea Guastella,
Gianluca Bontempi
Abstract:
This paper presents a step-by-step guide to generating and simulating a traffic scenario using the open-source simulation tool SUMO. It introduces the common pipeline used to generate a synthetic traffic model for SUMO, how to import existing traffic data into a model to achieve accuracy in traffic simulation (that is, producing a traffic model which dynamics is similar to the real one). It also d…
▽ More
This paper presents a step-by-step guide to generating and simulating a traffic scenario using the open-source simulation tool SUMO. It introduces the common pipeline used to generate a synthetic traffic model for SUMO, how to import existing traffic data into a model to achieve accuracy in traffic simulation (that is, producing a traffic model which dynamics is similar to the real one). It also describes how SUMO outputs information from simulation that can be used for data analysis purposes.
△ Less
Submitted 1 March, 2023;
originally announced April 2023.
-
Partial counterfactual identification and uplift modeling: theoretical results and real-world assessment
Authors:
Théo Verhelst,
Denis Mercier,
Jeevan Shrestha,
Gianluca Bontempi
Abstract:
Counterfactuals are central in causal human reasoning and the scientific discovery process. The uplift, also called conditional average treatment effect, measures the causal effect of some action, or treatment, on the outcome of an individual. This paper discusses how it is possible to derive bounds on the probability of counterfactual statements based on uplift terms. First, we derive some origin…
▽ More
Counterfactuals are central in causal human reasoning and the scientific discovery process. The uplift, also called conditional average treatment effect, measures the causal effect of some action, or treatment, on the outcome of an individual. This paper discusses how it is possible to derive bounds on the probability of counterfactual statements based on uplift terms. First, we derive some original bounds on the probability of counterfactuals and we show that tightness of such bounds depends on the information of the feature set on the uplift term. Then, we propose a point estimator based on the assumption of conditional independence between the counterfactual outcomes. The quality of the bounds and the point estimators are assessed on synthetic data and a large real-world customer data set provided by a telecom company, showing significant improvement over the state of the art.
△ Less
Submitted 14 November, 2022;
originally announced November 2022.
-
Transfer Learning for Credit Card Fraud Detection: A Journey from Research to Production
Authors:
Wissam Siblini,
Guillaume Coter,
Rémy Fabry,
Liyun He-Guelton,
Frédéric Oblé,
Bertrand Lebichot,
Yann-Aël Le Borgne,
Gianluca Bontempi
Abstract:
The dark face of digital commerce generalization is the increase of fraud attempts. To prevent any type of attacks, state-of-the-art fraud detection systems are now embedding Machine Learning (ML) modules. The conception of such modules is only communicated at the level of research and papers mostly focus on results for isolated benchmark datasets and metrics. But research is only a part of the jo…
▽ More
The dark face of digital commerce generalization is the increase of fraud attempts. To prevent any type of attacks, state-of-the-art fraud detection systems are now embedding Machine Learning (ML) modules. The conception of such modules is only communicated at the level of research and papers mostly focus on results for isolated benchmark datasets and metrics. But research is only a part of the journey, preceded by the right formulation of the business problem and collection of data, and followed by a practical integration. In this paper, we give a wider vision of the process, on a case study of transfer learning for fraud detection, from business to research, and back to business.
△ Less
Submitted 4 November, 2021; v1 submitted 20 July, 2021;
originally announced July 2021.
-
Streaming Active Learning Strategies for Real-Life Credit Card Fraud Detection: Assessment and Visualization
Authors:
Fabirzio Carcillo,
Yann-Aël Le Borgne,
Olivier Caelen,
Gianluca Bontempi
Abstract:
Credit card fraud detection is a very challenging problem because of the specific nature of transaction data and the labeling process. The transaction data is peculiar because they are obtained in a streaming fashion, they are strongly imbalanced and prone to non-stationarity. The labeling is the outcome of an active learning process, as every day human investigators contact only a small number of…
▽ More
Credit card fraud detection is a very challenging problem because of the specific nature of transaction data and the labeling process. The transaction data is peculiar because they are obtained in a streaming fashion, they are strongly imbalanced and prone to non-stationarity. The labeling is the outcome of an active learning process, as every day human investigators contact only a small number of cardholders (associated to the riskiest transactions) and obtain the class (fraud or genuine) of the related transactions. An adequate selection of the set of cardholders is therefore crucial for an efficient fraud detection process. In this paper, we present a number of active learning strategies and we investigate their fraud detection accuracies. We compare different criteria (supervised, semi-supervised and unsupervised) to query unlabeled transactions. Finally, we highlight the existence of an exploitation/exploration trade-off for active learning in the context of fraud detection, which has so far been overlooked in the literature.
△ Less
Submitted 20 April, 2018;
originally announced April 2018.
-
SCARFF: a Scalable Framework for Streaming Credit Card Fraud Detection with Spark
Authors:
Fabrizio Carcillo,
Andrea Dal Pozzolo,
Yann-Aël Le Borgne,
Olivier Caelen,
Yannis Mazzer,
Gianluca Bontempi
Abstract:
The expansion of the electronic commerce, together with an increasing confidence of customers in electronic payments, makes of fraud detection a critical factor. Detecting frauds in (nearly) real time setting demands the design and the implementation of scalable learning techniques able to ingest and analyse massive amounts of streaming data. Recent advances in analytics and the availability of op…
▽ More
The expansion of the electronic commerce, together with an increasing confidence of customers in electronic payments, makes of fraud detection a critical factor. Detecting frauds in (nearly) real time setting demands the design and the implementation of scalable learning techniques able to ingest and analyse massive amounts of streaming data. Recent advances in analytics and the availability of open source solutions for Big Data storage and processing open new perspectives to the fraud detection field. In this paper we present a SCAlable Real-time Fraud Finder (SCARFF) which integrates Big Data tools (Kafka, Spark and Cassandra) with a machine learning approach which deals with imbalance, nonstationarity and feedback latency. Experimental results on a massive dataset of real credit card transactions show that this framework is scalable, efficient and accurate over a big stream of transactions.
△ Less
Submitted 26 September, 2017;
originally announced September 2017.
-
Feature selection in high-dimensional dataset using MapReduce
Authors:
Claudio Reggiani,
Yann-Aël Le Borgne,
Gianluca Bontempi
Abstract:
This paper describes a distributed MapReduce implementation of the minimum Redundancy Maximum Relevance algorithm, a popular feature selection method in bioinformatics and network inference problems. The proposed approach handles both tall/narrow and wide/short datasets. We further provide an open source implementation based on Hadoop/Spark, and illustrate its scalability on datasets involving mil…
▽ More
This paper describes a distributed MapReduce implementation of the minimum Redundancy Maximum Relevance algorithm, a popular feature selection method in bioinformatics and network inference problems. The proposed approach handles both tall/narrow and wide/short datasets. We further provide an open source implementation based on Hadoop/Spark, and illustrate its scalability on datasets involving millions of observations or features.
△ Less
Submitted 7 September, 2017;
originally announced September 2017.
-
OpenTED Browser: Insights into European Public Spendings
Authors:
Yann-Aël Le Borgne,
Adriana Homolova,
Gianluca Bontempi
Abstract:
We present the OpenTED browser, a Web application allowing to interactively browse public spending data related to public procurements in the European Union. The application relies on Open Data recently published by the European Commission and the Publications Office of the European Union, from which we imported a curated dataset of 4.2 million contract award notices spanning the period 2006-2015.…
▽ More
We present the OpenTED browser, a Web application allowing to interactively browse public spending data related to public procurements in the European Union. The application relies on Open Data recently published by the European Commission and the Publications Office of the European Union, from which we imported a curated dataset of 4.2 million contract award notices spanning the period 2006-2015. The application is designed to easily filter notices and visualise relationships between public contracting authorities and private contractors. The simple design allows for example to quickly find information about who the biggest suppliers of local governments are, and the nature of the contracted goods and services. We believe the tool, which we make Open Source, is a valuable source of information for journalists, NGOs, analysts and citizens for getting information on public procurement data, from large scale trends to local municipal developments.
△ Less
Submitted 16 September, 2016;
originally announced November 2016.
-
From dependency to causality: a machine learning approach
Authors:
Gianluca Bontempi,
Maxime Flauder
Abstract:
The relationship between statistical dependency and causality lies at the heart of all statistical approaches to causal inference. Recent results in the ChaLearn cause-effect pair challenge have shown that causal directionality can be inferred with good accuracy also in Markov indistinguishable configurations thanks to data driven approaches. This paper proposes a supervised machine learning appro…
▽ More
The relationship between statistical dependency and causality lies at the heart of all statistical approaches to causal inference. Recent results in the ChaLearn cause-effect pair challenge have shown that causal directionality can be inferred with good accuracy also in Markov indistinguishable configurations thanks to data driven approaches. This paper proposes a supervised machine learning approach to infer the existence of a directed causal link between two variables in multivariate settings with $n>2$ variables. The approach relies on the asymmetry of some conditional (in)dependence relations between the members of the Markov blankets of two variables causally connected. Our results show that supervised learning methods may be successfully used to extract causal information on the basis of asymmetric statistical descriptors also for $n>2$ variate distributions.
△ Less
Submitted 19 December, 2014;
originally announced December 2014.
-
Optimizing Component Combination in a Multi-Indexing Paragraph Retrieval System
Authors:
Boris Iolis,
Gianluca Bontempi
Abstract:
We demonstrate a method to optimize the combination of distinct components in a paragraph retrieval system. Our system makes use of several indices, query generators and filters, each of them potentially contributing to the quality of the returned list of results. The components are combined with a weighed sum, and we optimize the weights using a heuristic optimization algorithm. This allows us to…
▽ More
We demonstrate a method to optimize the combination of distinct components in a paragraph retrieval system. Our system makes use of several indices, query generators and filters, each of them potentially contributing to the quality of the returned list of results. The components are combined with a weighed sum, and we optimize the weights using a heuristic optimization algorithm. This allows us to maximize the quality of our results, but also to determine which components are most valuable in our system. We evaluate our approach on the paragraph selection task of a Question Answering dataset.
△ Less
Submitted 11 August, 2014;
originally announced August 2014.
-
A review and comparison of strategies for multi-step ahead time series forecasting based on the NN5 forecasting competition
Authors:
Souhaib Ben Taieb,
Gianluca Bontempi,
Amir Atiya,
Antti Sorjamaa
Abstract:
Multi-step ahead forecasting is still an open challenge in time series forecasting. Several approaches that deal with this complex problem have been proposed in the literature but an extensive comparison on a large number of tasks is still missing. This paper aims to fill this gap by reviewing existing strategies for multi-step ahead forecasting and comparing them in theoretical and practical term…
▽ More
Multi-step ahead forecasting is still an open challenge in time series forecasting. Several approaches that deal with this complex problem have been proposed in the literature but an extensive comparison on a large number of tasks is still missing. This paper aims to fill this gap by reviewing existing strategies for multi-step ahead forecasting and comparing them in theoretical and practical terms. To attain such an objective, we performed a large scale comparison of these different strategies using a large experimental benchmark (namely the 111 series from the NN5 forecasting competition). In addition, we considered the effects of deseasonalization, input variable selection, and forecast combination on these strategies and on multi-step ahead forecasting at large. The following three findings appear to be consistently supported by the experimental results: Multiple-Output strategies are the best performing approaches, deseasonalization leads to uniformly improved forecast accuracy, and input selection is more effective when performed in conjunction with deseasonalization.
△ Less
Submitted 16 August, 2011;
originally announced August 2011.
-
Distributed Principal Component Analysis for Wireless Sensor Networks
Authors:
Yann-Aël Le Borgne,
Sylvain Raybaud,
Gianluca Bontempi
Abstract:
The Principal Component Analysis (PCA) is a data dimensionality reduction technique well-suited for processing data from sensor networks. It can be applied to tasks like compression, event detection, and event recognition. This technique is based on a linear transform where the sensor measurements are projected on a set of principal components. When sensor measurements are correlated, a small se…
▽ More
The Principal Component Analysis (PCA) is a data dimensionality reduction technique well-suited for processing data from sensor networks. It can be applied to tasks like compression, event detection, and event recognition. This technique is based on a linear transform where the sensor measurements are projected on a set of principal components. When sensor measurements are correlated, a small set of principal components can explain most of the measurements variability. This allows to significantly decrease the amount of radio communication and of energy consumption. In this paper, we show that the power iteration method can be distributed in a sensor network in order to compute an approximation of the principal components. The proposed implementation relies on an aggregation service, which has recently been shown to provide a suitable framework for distributing the computation of a linear transform within a sensor network. We also extend this previous work by providing a detailed analysis of the computational, memory, and communication costs involved. A compression experiment involving real data validates the algorithm and illustrates the tradeoffs between accuracy and communication costs.
△ Less
Submitted 9 March, 2010;
originally announced March 2010.