From the course: Machine Learning with Python: Foundations

Common data quality issues - Python Tutorial

From the course: Machine Learning with Python: Foundations

Common data quality issues

- [Instructor] An ideal dataset is one that has no missing values and has no values that deviates from the expected. Such a dataset hardly exists, if at all. In reality, most datasets have to be transformed or have data quality issues that need to be dealt with prior to being used for machine learning. This is what the third stage in the machine learning process is all about, data preparation. Data preparation is a process of making sure that our data is suitable for the machine learning approach that we choose to use. In computing, the saying, "Garbage in, garbage out," is used to express the idea that incorrect or poor quality input will invariably result in incorrect or poor quality output. This concept is fundamentally important in machine learning. If proper care is not taken on the front-end to properly deal with data quality issues before building the model, then the model output will be unreliable, misleading, or simply wrong. One of the most commonly encountered data quality issues is that of missing data. There are several reasons why data could be missing. They include changes in data collection methods, human error, bias, or simply the lack of reliable input. Before we resolve missing data, we should attempt to understand why the data is missing and if there is a pattern to the missing values. There are several approaches to dealing with missing data. One approach is to simply remove all instances with features that have a missing value. Another is to use an indicated value, so just NA, unknown, or negative one to represent missing values. An alternative approach is to use a method known as imputation. Imputation is this use of a systematic approach to fill in missing data by using the most probable substitute value. There are several approaches to imputation. One of which is known as median imputation. With median imputation, we can resolve the missing value in the amount column by replacing the missing value with a median of the non-missing values. Another common data quality problem is that of outliers. An outlier is a data point that is significantly different from other observations within the dataset. Outliers manifest either as instances with characteristics different from most other instances or as values of a feature that are unusual with respect to the typical values for the feature. Before we decide what to do with the outlier data, we must first understand why it exists and whether it is useful towards our machine learning goal. Supervised machine learning algorithms learn by identifying rules or estimating the function that explains the value of the dependent variable based on the values of the independent variables. If the values of the dependent variable are categorical, we refer to them as class labels and the proportion of examples that belong to each class label is known as a class distribution. For most real world problems, the class distribution of the historical data is not uniform. For example, the vast majority of people who take out loans pay them back. This means the historical loan datasets will typically have a lot more examples of people who repay their loans than examples of people who default on their loans. This phenomenon is known as class imbalance. Class imbalance is a well-known data quality problem in machine learning. If not properly accounted for, class imbalance can lead to rather misleading results because the machine learning model we build will not have an equal shot at learning the patterns that correspond to each class label. There are several ways to resolve class imbalance. One approach is to under sample the majority class. This means that we randomly remove some of the instances of the majority class in an attempt to even the class distribution.

Contents