06
  • 6.1
  • 6.2
  • 6.3
  • 6.4
  • 6.5
  • 6.6
  • 6.7
  • 6.8
  • 6.9
  • 6.10
  • Steps in Data Science Process

    Data scientists follow various process to unveil information from data and provide insights in fields. In this module, we adopt a simple linear form of data science process (as illustrated below), which include five basic steps: data acquire, data prepare, data analyse, report and act.

    data acquire, data prepare, data analyse, report and act Diagram

    In the data acquire step, raw data are imported into your analytics platform. Then you should prepare your data for analysis. Data preparing can be divided into data exploring and data pre-processing. Various models can be applied for data analysis. Finally, the results of the investigation should be presented to the public or internal consumers of information in a way that makes sense and can be used by stakeholders.

    In a practical scenario, this process is rarely linear. Each step can push data scientists back to previous steps before achieving the final goal, forcing them to revisit the methods, techniques, or even the data sources.

    In the following, we will concentrate on the data preparing and data analysis steps.

  • Exploring data - Correlations

    In the broadest sense, correlation is any statistical association. In common usage it can show whether and how strongly pairs of variables are to having a linear relationship with each other. Positive Correlation indicates the extent to which those variables increase or decrease in parallel; Negative Correlation indicates the extent to which one variable increases as the other decreases.

    Correlations can have a value - Correlation Coefficient $r$ :
    $r = 1$ means a perfect positive correlation
    $r = 0$ means no correlation (the values don't seem linked at all)
    $r = -1$ means a perfect negative correlation

    the value shows how good the correlation is:

    Correlation Diagram 1

    For example, a local ice cream shop plot how much ice cream they sale versus the temperature on that day, and got the following scatter plot:

    Scatter Plot Graph Diagram 1

    We can see that weather and sales increase together: there is a high positive correlation between these two variables.

    The correlation between variable $x$ and $y$ can be calculated using a formula:

    $$r_{xy} = \frac{\sum_{i=1}^{n} (x_i - \hat{x})(x_i - \hat{y})}{\sqrt{\sum_{i=1}^{n}(x_1 = \hat{x})^2 \sum_{i=1}^{n} (y_i - \hat{y})^2}}$$

    Where $n$ is the number of $x$ or $y$ values,

    $\sum$ is Sigma, the symbol for "sum up",

    $(x_i - \hat{x})$ is each $x$-value minus the mean of $x$

    $(y_i - \hat{y})$ is each $y$-value minus the mean of $y$.

    Or you can just make a Scatter Plot, and look at it to get a rough idea about the correlation.

    Correlation is NOT causation! i.e., a correlation does not prove one thing causes the other. One classic example is about ice cream and murder:

    The rates of violent crime and murder have been known to jump when ice cream sales do. But, presumably, buying ice cream doesn't turn you into a killer.

  • Exploring data – Outliers

    An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. An outlier may be due to variability in the measurement or it may indicate experimental error. An outlier can cause serious problems in statistical analyses and data analyses.

    Outlier Graph Diagram 1
    Outlier Graph Diagram 2

    In a sense, the definition of outlier leaves it up to the analyst to decide what will be considered abnormal. There is no precise way to define and identify outliers in general because of the specifics of each dataset. Instead, you, or a domain expert, must interpret the raw observations and decide whether a value is an outlier or not. Nevertheless, the following statistical methods can be used to identify observations that appear to be rare or unlikely given the available data.

    1. Box plots or box and whisker plots
    Those plots provide a graphical depiction of data distribution and extreme values and can be used as an initial screening tool for outliers as they

    2. Probability plots
    This type of plot is used for graphically displaying a data set’s conformance to a normal distribution. It can be used as an initial screening tool for outliers as they

    3. Dion’s test
    It evaluates a single suspected outlier by computing a test statistics. If the test statistic is greater than the critical value, the suspected outlier is confirmed as a statistical outlier.

    4. Rosner’s test
    This test helps to identify multiple outliers in a data set with at least 20 normally-distributed values.

  • Exploring data – Summary statistics

    Summary statistics gives a quick and simple description of the data. Summary statistics can include:

    1. Mean
    The average of all numbers and is sometimes called the arithmetic mean. $$\hat{x} = \frac{1}{n} \Big(\sum_{i=1}^{n}x_i \Big) = \frac{x_1 + x_2 + \cdots + x_n}{n} $$

    2. Median
    The statistical median is the middle number in a sequence of numbers.
    To find the median, organize each number in order by size; the number in the middle is the median.

    3. Mode
    The number that occurs most often within a set of numbers.
    Mode helps identify the most common or frequent occurrence of a characteristic. It is possible to have two modes (bimodal), three modes (trimodal) or more modes within larger sets of numbers.

    4. Range
    The difference between the highest and lowest values within a set of numbers.
    To calculate range, subtract the smallest number from the largest number in the set:
    Range = Maximum – Minimum

    5. Standard derivation $\sigma$
    Standard derivation is a measure used to quantify the amount of variation or dispersion of a set of data values:

    Standard derivation Diagram

    Where $u$ is the mean of the variable $x$.

    A low standard deviation indicates that the data points tend to be close to the mean (also called the expected value) of the set; a high standard deviation indicates that the data points are spread out over a wider range of values.

  • Exploring data - Visualisation

    Histogram

    A histogram is a graphical display of data using bars of different heights. It is similar to a Bar Chart, but a histogram groups numbers into ranges (bins), then count how many values fall into each interval. The bins must be adjacent, and are often (but are not required to be) of equal size.

    Histograms give a rough sense of the density of the underlying distribution of the data. It is an estimate of the probability distribution of a continuous variable (quantitative variable).

    For example, you have 20 data as listed in the table below:

    36 25 38 46 55 68 72 55 36 38
    67 45 22 48 91 46 52 61 58 55

    Split them into 8 bins, with each bin representing an interval of 10 starting from 20. Each bin contains the number of occurrences of scores in the data set that are contained within that bin. For the above data set, the frequencies in each bin have been tabulated along with the scores that contributed to the frequency in each bin (see below):

    Bin Frequency Scores Included in Bin
    20-30 2 25,22
    30-40 4 36,38,36,38
    40-50 4 46,45,48,46
    50-60 5 55,55,52,58,55
    60-70 3 68,67,61
    70-80 1 72
    80-90 0 -
    90-100 1 91

    Plotting the frequency against the bin, you got the histogram:

    Bar Graph Example

    There is no right or wrong answer as to how wide a bin should be, but there are rules of thumb. You need to make sure that the bins are not too small or too large. The following histograms use the same data, but have different bins:

    Bar Graph Example 2

    Histogram can have different pattern. Some examples are shown below. When a distribution differs from the expected shape, the underlying process should be examined to find real causes of this.

    Normal, Right-skewed and Bimodal Diagrams

    Scatter plots

    A scatter plot is a two-dimensional data visualization. Data points are plotted on a horizontal and a vertical axis in the attempt to show how much one variable is affected by another. If the points are color-coded, one additional variable can be displayed. A scatter plot can suggest various kinds of correlations between variables with a certain confidence interval. For example, the scatter plot below suggests the data could be grouped into two clusters.

    Scatter Plot Graphs Diagrams

    As used in section Exploring data - Correlations, scatter plots can be used to visualise correlations between variables.

    Scatter Plot Graphs Diagrams 1
  • Data Pre-processing overview

    Data preparation is a very important part of the data science process. In fact, this is where you will spend most of your time on any data science effort. Data pre-processing can be a tedious process, but it is a crucial step. Always remember, “garbage in, garbage out”. If you don't spend the time and effort to create good data for the analysis, you will not get good results no matter how sophisticated the analysis technique you're using is.

    Two major aims to be achieved in data pre-processing:

    1. To clean the data to address data quality issues, which includes
    Missing values
    Invalid data
    outliers
    Domain knowledge is essential to making informed decisions on how to handle incomplete or incorrect data.

    2. To transform the raw data to make it suitable for analysis, which includes
    Scaling
    Transformation
    Feature selection
    Dimensionality reduction
    Data manipulation

  • Data Pre-processing – Missing values

    In statistics, missing values occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.

    Types of missing values

    When considering the potential impact of the missing data on data analysis, it is important to consider the underlying reasons for why the data are missing. Missing data are typically grouped into three categories:
    Missing completely at random (MCAR)
    When data are MCAR, events that lead to any particular data-item being missing are independent both of observable variables and of unobservable parameters of interest, and occur entirely at random.
    In these instances, the missing data reduce the population of the study and consequently, the statistical power, but the analysis performed on the data is unbiased: when data are MCAR, the data which remain can be considered a simple random sample of the full data set. MCAR is generally regarded as a strong and often unrealistic assumption.
    Missing at random (MAR)
    When data are MAR, the data are missing is systematically related to the observed but not the unobserved data.
    Missing not at random (MNAR)
    When data are MNAR, the fact that the data are missing is systematically related to the unobserved data, that is, the missingness is related to events or factors which are not measured by the researcher.

    Handling missing values

    If the number of missing values is extremely small: less than 5% of the sample, you can drop or omit those values from the analysis.

    If the number of missing values is large, in the case of multivariate analysis, it can be better to drop those cases. On the other hand, in univariate analysis, imputation can decrease the amount of bias in the data, if the values are missing at random.

    Common methods for data imputation include:
    Mean imputation
    K-nearest neighbours (KNN) imputation
    fuzzy K-means (FKM) imputation
    singular value decomposition (SVD) imputation
  • Data Pre-processing – Invalid data & outlier

    Invalid data

    A value is invalid if it is inconsistent with your input statement. For example, it does not conform to the format specified, not match the input style used, or it is out of range (too large or too small).

    Invalid data can be grouped into three categories:
    Official missing data values
    Invalid values that provide no information regarding true value
    Invalid values that suggest true value

    The problem of invalid data will typically be transformed into one of missing data.

    Outlier

    Outliers could be just meaningless aberrations caused by measurement and recording errors, but they could also be Extreme Values containing valuable information. So it is important to question and analyse outliers in any case to see what their actual meaning is.

    Why are they occurring, where, and what might the meaning be? The answer could be different from case to case, but it’s important to have this investigation rather than to simply ignore them.

  • Data Pre-processing – Scaling and Normalization

    Scaling

    Scaling or Min-Max scaling is a method used to standardize the range of independent variables or features of data. It scaling features to lie between a given minimum and maximum value, e.g. scaling feature to lie between zero and one, so that the maximum absolute value of each feature is scaled to unit size.

    In statistics, scaling usually means a linear transformation:

    $$f(x) = ax + b.$$

    In some machine learning algorithms, objective functions will not work properly without normalization. For example, methods based on measures of how far apart data points, like support vector machines (SVM) or k-nearest neighbors (KNN). With these algorithms, a change of "1" in any numeric feature is given the same importance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.

    Normalization

    Scaling just changes the range of your data. Normalization, also called Z-score normalization, or standardization, applying a transformation so that you transformed data is roughly normally distributed, i.e., have the properties of a standard normal distribution with:

    $$ \mu = 0\; \text{and}\; \sigma = 1$$

    Standard scores (also called $z$ scores) are calculated by subtracting the mean and dividing by the standard deviation:

    $$ z = \frac{x - \mu}{\sigma} $$

    Normalization is not only important if we are comparing measurements that have different units, but it is also a general requirement for many machine learning algorithms. For example, methods based on gradient descent algorithm (an optimization algorithm often used in logistic regression, SVMs, neural networks etc.). When gradient descent algorithm is used, if features are on different scales, certain weights may update faster than others.

  • Data Pre-processing – Feature Selection and Dimensionality Red

    Feature selection

    Feature selection, also called variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construction.

    Top reasons to use feature selection are:
    It enables the machine learning algorithm to train faster.
    It reduces the complexity of a model and makes it easier to interpret.
    It improves the accuracy of a model if the right subset is chosen.
    It reduces overfitting

    There are three main categories of methods for feature selection:
    Filter methods
    Filter methods select features based on their scores in various statistical tests for their correlation with the outcome variable. The features are ranked by the score and either selected to be kept or removed from the dataset. Examples of some statistical tests are: Pearson’s Correlation, Linear discriminant analysis, ANOVA, Chi-square.
    Wrapper methods
    Wrapper methods use a predictive model to score feature subsets, where different combinations are prepared, evaluated and compared to other combinations. An example of wrapper methods is the recursive feature elimination algorithm.
    Wrapper Method Diagram
    Embedded methods
    Embedded methods learn which features best contribute to the accuracy of the model while the model is being created. The most common type of embedded feature selection methods are regularization methods.

    Dimensionality reduction

    Dimension reduction, the process of converting a set of data having vast dimensions into data with lesser dimensions ensuring that it conveys similar information concisely.

    For example, as shown below, the points on the two dimensional plane ($x_1, x_2$) could be represented along on a one dimensional line ($z_1$).

    points on the two dimensional plane diagram

    Also, in the following example, points in the 3D space ($x_1, x_2, x_3$) can be converted into a 2D plane ($z_1, z_2$) with little information lost.

    3D plane converted to 2D plane diagram
    Common methods for Dimension Reduction include:
    Decision Trees
    Random Forest
    High Correlation
    Factor Analysis
    Principal Component Analysis (PCA)

    In PCA, variables are transformed into a new set of variables, which are linear combination of original variables.

    linear combination of original variables diagram

    Difference between Feature Selection and Dimensionality Reduction

    Both Feature Selection and Dimensionality Reduction seek to reduce the number of attributes in the dataset. A dimensionality reduction method does so by creating new combinations of attributes, and Feature selection methods include and exclude attributes present in the data without changing them. While, you may also find, in some articles, feature selections have been treated as a type of dimension reduction method.

School of Computing, Engineering and Built Environment