Using Data Analytics and Machine Learning in Enterprise Data Protection

(This blog piece was the streamlined transcript of a community video talk I gave.)


Enterprise data protection is an integral part of any organisation’s data security architecture and operations. As a characteristic of the cloud era we are in, more and more data nowadays resides in cloud environments. 

In a typical scenario of an enterprise’s AWS cloud setup, multiple AWS accounts are used – there are good reasons why multiple OUs (Organizational Units) and accounts are structured for a given organisation. 

Behaviour differences exist between these AWS accounts. Though they all belong to the same enterprise, AWS resources for a workload in an AWS account may behave differently to another workload and its resources in a different account. The behaviours here refer to the traits of how certain resources communicate and how certain data is accessed. As a simple example, an S3 bucket in an account is designed to be frequently accessed by a vast number of users (even including the general public), while another S3 bucket in a different account is designed to be confidential – meant to be ever accessed by just a handful of people. Same differences exit for all other cloud services: storage, compute, database…etc. 

Nowadays, best practices are published and tools like AWS Security Hub, Control Tower, Security Control Policies…are widely used to define a good cloud security posture. Least Privilege Principle is observed. These continue to be crucial aspects of data protection. At same time, there are new dynamics. 

For example, new threats are always there for exploitation through unintended access to enterprise data. This is despite of the existing access control measures that you are likely to have been operating: the built-in access controls through user policies, the use of Role Based Access Control and tools like AWS STS (Security Token Service), federation and single sign on that are based on temporary access where needed, in preference over the long term access. E.g., you are already following a set of sound measures but dynamic threats to data are still not eliminated. 

Using data analytics and machine learning to combat dynamic threats to enterprise data has become an interesting idea. The key concept here is that in modern clouds, extensive volume of logging and monitoring data is generated through AWS CloudTrail, Amazon CloudWatch and other tools, even third party tools. Such data contains behavioural information on enterprise data access and application communication. 

We can use analytic tools and Machine Learning tools to process such volumes of data. Logically real time data injection is also part of the solution. Some of the tools include:

  • Amazon Kinesis
  • Amazon OpenSearch
  • Amazon EMR (previously known as Amazon Elastic MapReduce)
  • Amazon Sagemaker
  • Amazon MSK (Managed Streaming for Apache Kafka)
  • Amazon KDA (Kinesis Data Analytics)
  • Others

These modern tools have the capability to receive and process crazily huge volumes of data. In the case of this discussion we are talking about the extensive amount of security related logging and monitoring data. 

Using these tools together, we can let the analytics to develop and to model what are the normal behaviours and what are abnormal behaviours – of a workload, off the cumulated behavioural data that is collected from the workload environment. 

From here, we can define the so called ‘suspicion levels’. Some behaviours in one workload can be normal and will continue to be regarded as normal. The same behaviour in a different workload, in a different environment, night be considered as quite suspicious. Then they can be flagged accordingly. 

Naturally, the complexity associated with these things will not be underestimated. In many cases even more specialised tools may need to be leveraged, including graph databases, like Amazon Neptune. 

What are the benefits, you may ask? It can be interesting and exciting. One potential benefit is the so-called ‘Real Time Response to Threats’.

As long as the threats can be well defined and the analytics and machine learning tools are used properly, you will be able to respond to threats that are:

  • Going to happen, or
  • Are happening

E.g., you can stop it before it is too late. 

Let's use a simplified scenario that articulated by an AWS speaker in a re:Invent session for explanation purpose here. 

Imagine a user in account A managed to assume a role in account B. (This is a normal behavior so far.)

But this user has a rogue intention – they want to copy some confidential corporate data that is stored in a restricted S3 bucket.

Even with the assumed role, they still do not have the privilege to access that bucket 

The user finds out that the assumed role allows them to perform a number of EKS related operations

A way is figured out. They powered up some EKS nodes and through SSM (AWS Systems Manager) Session Manager, managed to access the restricted S3 bucket. Another bucket is created and objects (data) in the restricted bucket are copied over. 

Data breach is happening!

In the past, what used to happen was that when this was discovered afterwards, logs could be analysed to pinpoint who committed the cyber crime. But this was only after the data was long stolen. 

Now with the help of analytics and machine learning, alarms are already raised while it is happening – because such string of behaviours (a user from another account assuming the role, powering up EKS nodes then SSM Session Manager then accessing the specific bucket) are clearly abnormal, based on the cumulated behaviours of who have been accessing this bucket. 

As all this is processed in real time, the party initialising the data breach is traced and the access is stopped in real time before the data is being transferred or while it is being transferred. 

The prevention is also more intelligent and more efficient. What may commonly happen in the past is that as soon as the data breach is identified, the access to the target (in this case the S3 bucket) would be blocked by a blanket rule – as a generic measure pending the investigation progress. Legitimate access would be disrupted. It may take hours or days before the blanket ban is lifted. 

On the other hand, using the solution that has been discussed above, only the unintended access would be stopped. All legitimate access will continue on as BAU (business as usual). 

Today we discussed the power of combining analytics, machine learning and the extensive volumes of logging and monitoring data in AWS cloud environments. I believe this is making enterprise data protection more expansive. I strongly encourage people who have interest to start looking into these emerging technologies and solutions. 

                                                        --Simon Wang   



Comments

Popular posts from this blog

Fairness Evaluation and Model Explainability In AI

AWS and Generative AI

Amazon CloudFront and Its Primary and Secondary Origins