Solving PII Data Security Problems in An AWS Machine Learning Use Case

Recently I discussed how a solution on extracting large volumes of data from a set of enterprise applications to AWS S3 for processing helped an organisation on getting their desired data analytics outcomes. 

A further initiative has commenced to leverage the data using AI, with the aim to apply further intelligence in analytics and to provide a Gen AI style service in which users can ask questions on the likes of career development training options and to get intelligent suggestions. Obviously this initiative is a progressive process. 

But before anything on AI, the data needs to get fed into AWS Machine Learning service for ML training as well as analysing purposes.

This was where a big obstacle existed which almost ground the development to a standing still – the data from the enterprise applications contains PII (Personally Identifiable Information) and the organisation has clear policies on protecting PII, including that no PII can be made subject to machine learning or AI. 

It was good that AWS has had solutions to solve this problem – Machine Learning PII Redacting. 


Machine Learning PII Redacting


PII is the information that enables the identity of an individual to whom the information applies to being reasonably inferred. It can be either direct information, such as name, telephone number, address and Date of Birth…etc, or indirect information that can be used to identify specific individuals together with other pieces of data, such as a unique number corresponding to a specific line record in an employee record table.

ML PII Redacting refers to censoring or obscuring PII data from source data so that the processed data that is fed into ML models has privacy protection assurance and maintains compliance, being it for ML training, fine-tuning, or using deep learning models. 

This AWS solution starts with executing following steps in AWS SageMaker Studio


  • On the SageMaker console, choose Studio in the navigation pane.
  • Choose the AWS SageMaker Studio domain and user profile
  • Open Studio
  • choose the file icon in the navigation pane
  • Click the upload icon, then choose redact-pii.flow


Then, open redact-pii.flow. Once the loading is completed the flow is defined of two schools of steps: Data Source specification and then multiple transformation steps.


Data Source Specification


In this solution’s case, the data source is a csv file in a S3 bucket. Once this is specified in SageMaker Studio, the data is imported for a new flow. 

The file contains roughly 860,000 rows, and the Sampling field is set to 9,000. The SageMaker Data Wrangler will sample 9,000 rows to display in the user interface. Data_types sets the data type for each column of imported data.

Then the next step is Sampling specification which sets the number of rows SageMaker Data Wrangler will sample for an export job, using the Approximate sample size field. This is different from the number of rows sampled to display in the user interface.

SageMaker Data Wrangler Custom Transforms

The next steps together involve using SageMaker Data Wrangler custom transforms. Custom transforms enable the running of one’s own Python or SQL code within a Data Wrangler flow. SageMaker Data Wrangler optimises Python user-defined functions to provide performance similar to an Apache Spark plugin, without needing to know PySpark or Pandas. 

As such, while AWS recommends using either the Python (PySpark) or Python (user-defined function), this solution uses the Python user-defined function written in pure Python.

(The custom code can be written in four ways: In SQL, using PySpark SQL to modify the dataset; In Python, using a PySpark data frame and libraries to modify the dataset; In Python, using a pandas data frame and libraries to modify the dataset; In Python, using a user-defined function to modify a column of the dataset. The Python (pandas) approach requires the dataset to fit into memory and can only be run on a single instance, limiting its ability to scale efficiently. )

PII Column is now made and to be used as the target column to redact. 

This target column is then taken to create a new column called PII Column Preparation, that is ready for efficient redaction using Amazon Comprehend. 

Where needed, a different column can be specified in the Input column field of this step.

Because CSV files contain small amounts of text per cell, it’s more efficient to combine text from multiple cells into a single document to send to Amazon Comprehend. This reduces the overhead associated with many repeated function calls. 

Calling Amazon Comprehend 

The redaction is done as a step of a SageMaker Data Wrangler flow, in which Amazon Comprehend is called synchronously. Since Amazon Comprehend sets a 100,000 character-limit per synchronous function call, it needs to be ensured that any text sent is under this limit.

As such, before being sent to Amazon Comprehend, the data preparation involves appending a delimiter string (‘-=-‘ in this case, as this is a string not contained anywhere in the column data itself) to the end of the text in each cell. Adding this cell delimiter allows the optimistion of using Amazon Comprehend.

Of course, if the text in any individual cell is longer than the Amazon Comprehend limit of 100,000 characters, it will be truncated to 100,000. In this use case, this will not happen as the texts are much shorter. 

Then the redacted data is saved to a new column called PII Redacted. When we use a Python custom function transform, SageMaker Data Wrangler defines an empty custom_func that takes a column of text as input and returns a modified pandas series of the same length. 

This function custom_func contains two sub functions:

  1. Fun_1: This function does the work of concatenating text from individual cells in the series into longer chunks to send to Amazon Comprehend.
  2. Fun_2: This function takes text as input, calls Amazon Comprehend to detect PII, redacts where it is found, and returns the redacted text. Redaction is done by replacing any PII text with the type of PII found in square brackets, for example Joe Doe would be replaced with [NAME]. This can be modified to replace PII with any string, including the empty string (“”) to just remove it rather than redact it. 


(We can modify the function to check the confidence score of each PII entity and only redact if it’s above a specific threshold.)

When the redaction is complete, the chunks are converted back into original cells, which then get saved in the PII Redacted column.

A sample code snippet is provided below:

# concatenate text from cells into longer chunks

chunks = make_text_chunks(series, COMPREHEND_MAX_CHARS)


redacted_chunks = []

# call Comprehend once for each chunk, and redact

for text in chunks:

  redacted_text = redact_pii(text)

  redacted_chunks.append(redacted_text)

  

# join all redacted chunks into one text string

redacted_text = ''.join(redacted_chunks)


# split back to list of the original rows

redacted_rows = redacted_text.split(CELL_DELIM)  


Export Outputs 


Next, we need to export the outputs back to S3, where a destination node needs to be created in SageMaker Data Wrangler:

  • In the SageMaker Data Wrangler flow diagram, choose the plus sign next to the Redact PII step.
  • Choose Add destination, then choose Amazon S3.
  • Provide an output name for your transformed dataset.
  • Browse or enter the S3 location to store the redacted data file.
  • Choose Add destination.


After the destination node is added, creating an export job to complete the process:


  • Choose Create job in SageMaker Data Wrangler
  • Confirm the destination node then go to Next
  • Accept the defaults for all other options, then click Run


This creates a SageMaker Processing job. Go to the Processing section and click Processing jobs for the processing status. The time depends on the instance types and the volume of data, which can take many minutes. 

The output file is stored in the specified S3 bucket, now ready for the Machine Learning and AI exercises without concerns of PII. 


                                    Simon Wang

 

Comments

Popular posts from this blog

Improve Output Quality When Streaming Log Data to Amazon QuickSight for Visualisation

A Problem Solving Experience on Amazon S3 Multi-Region Access Points

AWS Storage Gateway File Gateway with S3 and FSx For Lustre with S3