Data Analysis Recap

In the last three weeks, we have talked about data exploring, data pre-processing and data analysis. In the data exploring stage, we use approaches such as visual exploration or statistics to understand what is in a dataset and the characteristics of the data. These characteristics can include size or amount of data, completeness of the data (e.g., the number of missing values), correctness of the data (e.g., outliers), possible relationship amongst data elements or variables in the data (e.g., correlations, distribution).

Based on the information obtained through data exploring, data pre-processing aims to improve the quality of data by dealing with missing values, removing unusable parts of the data, correcting poorly formatted elements and defining relevant relationships across datasets. Common approaches for data pre-processing include missing value imputation, data scaling, normalization, Feature selection and Dimension reduction.

After data exploring and data pre-processing, data are prepared for analysis. Machine learning algorithms are widely used in data analysis. There are various machine learning algorithms, which can be grouped into different types. Supervised learning and unsupervised learning are two major types of machine learning. In this module, we introduced three supervised machine learning algorithms, i.e., Linear regression, SVM (Support Vector Machine) and NN (Neural Network), and one unsupervised machine learning algorithm: K-means Clustering.

Training, Test, and Validation Datasets

Ion machine learning, multiple datasets are used to build final model. In particular, three datasets are commonly used in different stage of the construction of the model, which are training dataset, test dataset and validation dataset.

Definitions

Training dataset
Training dataset is the actual dataset used to train the model. The model learns from this data, uses the data to fit the parameters (e.g., the weights and bias in the Neural Network).
Generally, training data is a certain percentage of an overall dataset along with testing set. As a rule, the better the training data, the better the algorithm or classifier performs.

Validation dataset
Validation dataset is used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyper-parameters. The model occasionally ‘sees’ this dataset for frequent evaluation, but never learns from this dataset.

Test dataset
Test dataset is used to provide an unbiased evaluation of a final model fit on the training dataset.

During training only the training and/or validation set is available. The test dataset must not be used during training. The test dataset will only be available during testing the classifier. Generally, the term “validation dataset” is used interchangeably with the term “test dataset” and refers to a sample of the dataset held back from training the model.

Often times, these data sets are taken from the same overall dataset. By using similar data for training and testing, you can minimize the effects of data discrepancies and better understand the characteristics of the model.

Split up dataset into test and training sets

Holdout method
Most simply, the original dataset is split up more or less randomly into training and test dataset, while making sure the training dataset to capture important classes you know up front.
Typically, when you separate a data set into a training set and a testing set, most of the data (e.g. 80%) is used for training, and a smaller portion of the data (e.g. 20%) is used for testing.
Holdout method is widely used for the test-training dataset specification. However, doing so can bias the classification results and the results may not be generalizable.

N-Fold cross validation
This approach randomizes the dataset and create N (almost) equal size partitions, and chooses the Nth partition for testing and the other N-1 partitions for training (within the training set you can further employ another K-fold cross validation to create a validation set and find the best parameters). Repeat this process N times to get an average of the metric.
Cross-validation is almost unbiased. However, cross-validation does not work in situations where the data cannot be shuffled, most notably in time-series.

Confusion matrix

A confusion matrix (also known as error matrix) is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known.

An example confusion matrix for a binary classifier is shown below:

N = 260	Predicted: NO	Predicted: YES
Actual: NO	90	5
Actual: YES	15	150

In this confusion matrix, there are two possible predicted classes: "yes" and "no". If we were predicting the presence of a disease, for example, "yes" would mean they have the disease, and "no" would mean they don't have the disease. The classifier made a total of 260 predictions. For example, 260 patients were being tested for the presence of disease. Out of the 260 cases, 105 were predicted as “no”, and 155 were predicted as “yes”. In reality, 95 cases are labelled as “no”, and 165 cases are “yes”.

Terms in a confusion matrix

Some basic terms are defined in a confusion matrix:: True positives (TP)
These are cases in which we predicted “yes” and the real class is “yes”. E.g. we predicted the patients have the disease and actually they do have the disease.; True negatives (TN)
We predicted “no”, and the real class is “no”.; False positives (FP) - also known as a "Type I error"
We predicted “yes”, but the real class is “no”. E.g., we predicted the patients have the disease but actually they do not have the disease.; False negatives (FN) - also known as a "Type II error"
We predicted “no”, but the real class is “yes”.

Add those terms into the confusion matrix, and it is shown below:

N = 260	Predicted: NO	Predicted: YES
Actual: NO	TN = 90	FP = 5	95
Actual: YES	FN = 15	TP = 150	165
	105	155

There are a set of rates that commonly calculated from a confusion matrix of a binary classifier:: Accuracy: Overall, how often is the classifier correct?
Accuracy = (TP+TN)/total
In the example: accuracy = (150+90)/260 = 0.923; Misclassification Rate: Overall, how often is it wrong?
Misclassification Rate = (FP+FN)/total
Misclassification Rate = 1 - Accuracy
Also known as "Error Rate". In the example: Misclassification Rate = (10+5)/165 = 0.077; True Positive Rate: When it's actually yes, how often does it predict yes?
True Positive Rate = TP/actual yes
Also known as "Sensitivity" or "Recall". In the example: True Positive Rate = 150/165 = 0.909; False Positive Rate: When it's actually no, how often does it predict yes?
False Positive Rate = FP/actual no
In the example: False Positive Rate = 5/95 = 0.053; Specificity: When it's actually no, how often does it predict no?
Specificity = TN/actual no
Specificity = 1 - False Positive Rate
In the example: Specificity = 90/95= 0.947; Precision: When it predicts yes, how often is it correct?
Precision = TP/predicted yes;
In the example: Precision = 150/155 = 0.968; Prevalence: How often does the yes condition actually occur in our sample?
Prevalence = actual yes/total;
In the example: Prevalence = 165/260 = 0.635

Data Analysis Recap

Confusion matrix