A Problem Solving Experience on Amazon S3 Multi-Region Access Points

Amazon S3 buckets can be used for storing large amount of data. AWS’ comprehensive tools on data analytics can then be leveraged for processing the data. Often, the post-analytics (processed) data is stored in another S3 bucket. Users can then access the processed data by fetching it from that S3 bucket. 

This piece discusses a problem that was recently experienced on accessing the S3 buckets behind a Multi-Region Access Points setup and the problem solving experience. 


The Setup

A large amount of data is deposit into Amazon S3 service for analytics processing. The processed data is then stored in a different S3 bucket. A user base of around 10,000 accesses the processed data actively using a proprietary application. This application is latency sensitive. Most users are located in three geographical regions: Australia and New Zealand, North America and South America. 

Due to the latency sensitive requirement, Amazon S3 Multi-Region Access Point setup has been used. 

Three S3 buckets are configured, each in following AWS Regions respectively:


  • Sydney (Australia)
  • Northern Virginia (US East)
  • Sao Paulo (South America)


These three S3 buckets have contents sync-ed up (replicated) between them. 

Amazon S3 Multi-Region Access Points service provides a global endpoint, then uses AWS backbone network for routing Amazon S3 request traffic between AWS Regions. Each global endpoint routes Amazon S3 data request traffic from multiple sources, using AWS Global Accelerator. This service takes consideration of network congestion and the location of the requesting application to dynamically route requests over the AWS network to the closest S3 bucket. AWS claims that public internet-sourced Amazon S3 data requests routed through an S3 Multi-Region Access Point can result in accelerated performance by up to 60% compared with requests routed to S3 over the public Internet. This allows multi-region applications to experience lower latency using a simple architecture when application users can be anywhere in the world.

Below diagram shows the S3 Multi-Region Access Points concept:



                                                        The diagram is from the AWS website


In such a setup, the users based in South America running the proprietary application will access the data by reaching the configured S3 Multi-Region Access Point, which will automatically route the requests to the S3 bucket in that Region. 


The Problem

Despite of the Multi-Region Access Point setup, the South America users still experienced poor application experience where the latency was much noticeable when the users were interacting with the application. This resulted in poor user experience.

On the other hand, North America users and Oceania users had been having good application experience. 


The Troubleshooting 

First, test user experience was obtained where testers performed usual application runs from Australia, The US and Brazil.  Logs were obtained and analysed. 

It was confirmed that, the S3 Multi-Region Access Point was doing its job, that when testers were running applications in The US, the requests were routed to the S3 bucket in the Northern Virginia (US East). Same goes for the tests in Australia and in Brazil. 

Then, the Total Time in the S3 logs was analysed. As a summary:


  • The Total Time logged by the S3 bucket in the Sydney (Australia) AWS Region: <10 milliseconds

  • The Total Time logged by the S3 bucket in Northern Virginia (US East) AWS Region: <10 milliseconds

  • The Total Time logged by the S3 bucket in Sao Paulo (South America) AWS Region: 40 ~ 80 milliseconds


It is to note that, the Total Time recorded in S3 logs is for the time taken from when the request is received by the Access point to when the last byte of the response has left the Access Point. E.g., it suggests the latency that the S3 apparatus, which comprises of the AWS Edge Location, the AWS network and the S3 bucket, introduces.

The Sao Paulo (South America) AWS Region indeed took longer to turn the data round to the tester but we are talking about 30 to 70 milliseconds difference, while the testers’ user experience on the application was more like several seconds difference. 

Further tests were conducted. As there were frequent business trips between the US and South America by the organization’s employees, It was feasible to have the same tester using the same device to test both in the US and in South America. These tests confirmed that the same user carrying the same laptop computer experienced noticeable delay difference. E.g., the disparity between devices was ruled out to be the cause. 

Upon further tests in South America, it was found that testers at different locations in South America could have quite different user experience of the application. While they seemed to be all not as good as what experienced in Australia and the US, some South America locations were more acceptable then some other locations. 

During a scheduled change window, turning off the S3 Multi-Region Access Point was also tested, where South America based testers running application that went to the S3 bucket in Australia. The result was that the experience was far worse than any Multi-Region Access Point scenarios.  

At the stage, the focus turned to trying some workarounds. After attempting various adjustments, it was found that reducing the timeout timer in the proprietary application helped. The explanation can be that by reducing the timeout, it makes the application to retry more often, and with each retry, the communication may be routed differently: both by the public Internet leg which is between the device and the nearest AWS Edge Location, and also by the AWS network leg, which is between the AWS Edge Location and the S3 service in that Region. 

In a geographical region where the population is high and dense and the Internet infrastructure is vast, dynamics are constant. Each request may be routed differently. Even the AWS network in such vast region might have the same characteristics, as there is only one AWS Region there that covers multiple countries and vast geographical areas. Between the Edge Location and the S3 service, the route taken can be also quite dynamic. 


The Conclusion

The Amazon S3 Multi-Region Access Point did its job and delivered benefits by dynamically routing the requests to the closest S3 bucket. There was no misconfiguration of the Multi-Region Access Point. The relatively worse user experience was likely due to the vastness of the geographical areas and the population in South America, and the unevenness from areas to areas in that region. The proprietary application did multiple fetches in a single run which compounded the experience. Finding a good balance on the timeout timer (long enough for getting the replies but not too long to force another attempt) can be a workaround, but it's effectiveness varies. A permanent improvement solution pends the improved Internet consistency and reliability in that vast area. One day if there are multiple AWS Regions in that geographical continent it would most likely help.    

                                                                                                                 Simon Wang

Comments

Popular posts from this blog

AWS Storage Gateway File Gateway with S3 and FSx For Lustre with S3

Fairness Evaluation and Model Explainability In AI

Solving PII Data Security Problems in An AWS Machine Learning Use Case