Machine Learning: Visualisation and Analysis with Amazon SageMaker Data Wrangler
In a Machine Learning project, the most time consuming part is the data preparation. For example, data scientists and data engineers have to spend the most time on preparing the data so that it becomes suitable for the machine learning algorithms.
Let’s say there is a clinic that is specialised in Hypertension (high blood pressure) conditions. Over a number of years the clinic has treated more than a hundred thousand Hypertension patients. The clinic keeps records of the patients on information like age, gender, ethic group, job sector, weekly exercise habits, other chronic conditions, long term medications, address (which may potentially link to the patient’s social economical position but this may not be reliable) and others. The treatment plan varies from patient to patient which is also part of the patient records (the patient dataset). There are more than 70 features in the patient dataset.
Through treatments, many patients can have their Hypertension condition managed reasonably well. But some patients are re-admitted within a certain period of time. The clinic would like to use machine learning to develop a model to predict patients’ re-admission within the next six months, twelve months and twenty-four months, respectively.
The clinic does not have a data science team but has hired a consultant to help provide advice on the machine-learning project. The consultant has explained the data preparation to the clinic:
The quality of data directly impacts a machine learning model's performance. If the data is not prepared thoroughly then the data patterns that the machine-learning model relies on will be inaccurate, leading to poor or false predictions. The raw data is often created with missing values, inaccuracies and other errors. Separate data sets are often in different formats that need to be reconciled when they are combined. Also, out of the many features, some are more relevant to the prediction than others. Furthermore, some features have correlations between them – understanding such correlations is important so that these features are not treated as independent.
The consultant suggests Amazon SageMaker Data Wrangler as the data preparation tool.
AWS describes Amazon SageMaker Data Wrangler as the ‘fastest and easiest way to prepare data for machine learning’. The data preparation time for tabular, image, and text data can be greatly reduced. Data Wrangler helps to simplify data preparation and feature engineering through a visual and natural language interface.
(from AWS website)
The clinic particularly likes the Data Wrangler’s visualisation and analysis capabilities in their use case.
First, the consultant pre-prepares and uploads the patient dataset to a designated S3 bucket. Then a Data Wrangler Flow is created in Amazon SageMaker Studio.
- ensuring there are no duplicate columns
- ensuring there are no duplicate entries
- performing imputation on missing values
- normalization (standard scaling)
- one hot encoding on some categorical features
- custom transformation
- analysis
- Data Wrangler returns a VIF score as a measure of how closely the variables are related to each other. A VIF score is a positive number that is greater than or equal to 1.
- VIF = 1: uncorrelated
- 1 < VIF <= 5: moderately correlated
- 5 < VIF <=50: highly correlated (VIF is clipped at 50 in Data Wrangler)
- In PCA’s case, Data Wrangler normalizes each feature to have a mean of 0 and a standard deviation of 1. An ordered list of variances is generated. These variances are also known as singular values. The values in the list of variances are greater than or equal to 0, which can be used to determine how much multicollinearity there is in the data. In this context PCA is also called SVD - Singular Value Decomposition.
- Lasso feature selection uses the L1 regularisation to only include the most predictive features. The regularisation technique generates a coefficient for each feature. The absolute value of the coefficient provides an importance score for the feature. A higher importance score indicates that it is more predictive of the target variable. A common feature selection method is to use all the features that have a non-zero lasso coefficient.
import altair as altdf = df.iloc[:30]df = df.rename(columns={"Age": "value"})df = df.assign(count=df.groupby('value').value.transform('count'))df = df[["value", "count"]]base = alt.Chart(df)bar = base.mark_bar().encode(x=alt.X('value', bin=True, axis=None), y=alt.Y('count'))rule = base.mark_rule(color='red').encode(x='mean(value):Q',size=alt.value(5))chart = bar + rule
Comments
Post a Comment