Machine Learning: Visualisation and Analysis with Amazon SageMaker Data Wrangler

In a Machine Learning project, the most time consuming part is the data preparation. For example, data scientists and data engineers have to spend the most time on preparing the data so that it becomes suitable for the machine learning algorithms. 


Let’s say there is a clinic that is specialised in Hypertension (high blood pressure) conditions. Over a number of years the clinic has treated more than a hundred thousand Hypertension patients. The clinic keeps records of the patients on information like age, gender, ethic group, job sector, weekly exercise habits, other chronic conditions, long term medications, address (which may potentially link to the patient’s social economical position but this may not be reliable) and others. The treatment plan varies from patient to patient which is also part of the patient records (the patient dataset). There are more than 70 features in the patient dataset. 


Through treatments, many patients can have their Hypertension condition managed reasonably well. But some patients are re-admitted within a certain period of time. The clinic would like to use machine learning to develop a model to predict patients’ re-admission within the next six months, twelve months and twenty-four months, respectively. 


The clinic does not have a data science team but has hired a consultant to help provide advice on the machine-learning project. The consultant has explained the data preparation to the clinic:


The quality of data directly impacts a machine learning model's performance. If the data is not prepared thoroughly then the data patterns that the machine-learning model relies on will be inaccurate, leading to poor or false predictions. The raw data is often created with missing values, inaccuracies and other errors. Separate data sets are often in different formats that need to be reconciled when they are combined. Also, out of the many features, some are more relevant to the prediction than others. Furthermore, some features have correlations between them – understanding such correlations is important so that these features are not treated as independent. 


The consultant suggests Amazon SageMaker Data Wrangler as the data preparation tool. 

  

AWS describes Amazon SageMaker Data Wrangler as the ‘fastest and easiest way to prepare data for machine learning’. The data preparation time for tabular, image, and text data can be greatly reduced. Data Wrangler helps to simplify data preparation and feature engineering through a visual and natural language interface. 


 


                                                                        (from AWS website)


The clinic particularly likes the Data Wrangler’s visualisation and analysis capabilities in their use case. 


First, the consultant pre-prepares and uploads the patient dataset to a designated S3 bucket. Then a Data Wrangler Flow is created in Amazon SageMaker Studio.

This is followed by importing data from the S3 bucket.

 A number of transformation steps are performed to prepare the data:

  • ensuring there are no duplicate columns
  • ensuring there are no duplicate entries
  • performing imputation on missing values 
  • normalization (standard scaling)
  • one hot encoding on some categorical features
  • custom transformation 
  • analysis

Data Wrangler supports visualization. Histograms can visualise the counts of feature values for a specific feature. The relationships between features can be viewed by using the Color-by option. 



Scatter Plot can be used to inspect the relationship between features. Scatter Plot charts need to have two numeric typed data columns as the X axis and the Y axis. The scatter plots can be coloured by an additional column. 



The Quick Model chart can be used to quickly evaluate the data and produce importance scores for each feature. A feature importance score indicates how useful a feature is at predicting a target label. The feature importance score is between [0, 1] and a higher number indicates that the feature is more important to the whole dataset. 




Multicollinearity

Multicollinearity is important to notice in data analysis. This is when two or more features used for prediction are related to each other. When there is multicollinearity, the features are not only predictive of the target variable, but also predictive of each other. Multicollinearity can be measured using Variance Inflation Factor (VIF), Principal Component Analysis (PCA), or Lasso feature selection. 

  • Data Wrangler returns a VIF score as a measure of how closely the variables are related to each other. A VIF score is a positive number that is greater than or equal to 1.
    • VIF = 1: uncorrelated
    • 1 < VIF <= 5: moderately correlated
    • 5 < VIF <=50: highly correlated (VIF is clipped at 50 in Data Wrangler)
  • In PCA’s case, Data Wrangler normalizes each feature to have a mean of 0 and a standard deviation of 1. An ordered list of variances is generated. These variances are also known as singular values. The values in the list of variances are greater than or equal to 0, which can be used to determine how much multicollinearity there is in the data. In this context PCA is also called SVD - Singular Value Decomposition.
  • Lasso feature selection uses the L1 regularisation to only include the most predictive features. The regularisation technique generates a coefficient for each feature. The absolute value of the coefficient provides an importance score for the feature. A higher importance score indicates that it is more predictive of the target variable. A common feature selection method is to use all the features that have a non-zero lasso coefficient.

There are other predesigned visualisation and analysis models in Data Wrangler. 

Custom visualisations can also be generated for custom analysis. Following is a sample code block to create a custom histogram:

import altair as alt
df = df.iloc[:30]
df = df.rename(columns={"Age": "value"})
df = df.assign(count=df.groupby('value').value.transform('count'))
df = df[["value", "count"]]
base = alt.Chart(df)
bar = base.mark_bar().encode(x=alt.X('value', bin=True, axis=None), y=alt.Y('count'))
rule = base.mark_rule(color='red').encode(
    x='mean(value):Q',
    size=alt.value(5))
chart = bar + rule



The clinic is convinced that Amazon Sagemaker Data Wrangler is a powerful and user friendly data preparation tool. 

                             -- Simon Wang

Comments

Popular posts from this blog

AWS Storage Gateway File Gateway with S3 and FSx For Lustre with S3

Fairness Evaluation and Model Explainability In AI

AWS and Generative AI