Developing Highly Concurrent Software for Detecting PII

While working at Amazon Web Services (AWS) I worked on the relaunch of AWS Macie. Macie allows customer to scan their S3 buckets for personal identifiable information (PII). PII could include information such as credit cards, social security numbers, names addresses, …

The original Macie was launched several years before I had joined the team. At the time I joined the team had started building a revamped version of the product from scratch. This revamped Macie provided several benefits over the previous counterpart, one of which was better efficiency and cost reductions.

From a high level perspective this is how Macie works:

After enabling Macie, customers would schedule a job to scan their S3 buckets. The service would then list all the files in the bucket and extract the content of the files. Next it would take the extracted content of these files and pass them into the classification module. The classification module would use various methods including ML and regexes to extract PII from the files content. The results would be exported to an S3 bucket of their choice and also available for viewing on the console.

Areas of Work

During my time at Macie I helped with the development and launch of the extraction + classification service (ECS). This was a highly concurrent and high performance service that was responsible for:

Obtaining the content from the customers bucket
Extracting their content
Performing the actual classification to find PII
Generating results.

Below I’ve explained some of the different challenges faced while working on this project:

Support for Various File Types: We supported various different file types. Txt, csv, excel, word, avro, parquet, pdf, … Each of these files contain information in different formats. Not only did this complicated the extraction process but it also made the classification part of the task more complicated. Special considerations were made to support these various file types.

No File Size Limit: Our goal was to design a system that can process files of any size. Dealing with a 100KB file is very different than dealing with a 500TB file. Hardware limitation won’t allow simple solutions that work well with smaller file sizes to scale. Our final design was able to handle files of any size (for certain files types) with constrained hardware resources. The content of these files would be extracted, classified and the results would be sent to the customer.

High Concurrency: We were designing a system that would handle Peta Bytes of data. To be able to handle such large volumes of data we utilized high concurrency. A lot of my work involved working with multithreading and required a strong understanding of concepts such as:

Java thread pools
IO and CPU bound operations
Deadlocks
Thread profiling

Archive Handling: One of the file types that Macie supported was archives (Zips, Tars, Gzips). Archives can be quite complex.

An archive can have nested archives each containing thousands of files.
Deeply nested or highly compressed files require special handling in order not to exhaust system resources.
Various other edge cases such as ensuring a single malformed file inside the archive does not render the entire archive un-processible.
Result aggregations.

All of these while maintaining high performance requirements and ensuring system resources do not drain for arbitrarily large archives (500TB+).

Dependency Throttling: When building highly scalable software that operate at high TPS rates, not all dependencies will be able to keep up. I designed and implemented a feature to handle such cases gracefully. While still keeping the systems operational it would ensure we also do not overwhelm dependencies.

Cross Team/Microservice Development: While most of my work at Macie was based on the Classification + Extraction service (CES) however the overall system was comprised of many different micro services. Any changes I made to the CES would require careful consideration of how it will impact the various other micro services. This generally required a strong knowledge of the overall architecture of the system and also required diving deep into other services code to understand their inner workings.

Continuous Deployment, Layered Testing and Monitoring: At Amazon the software development process is Continuous Deployment (CD). In most teams there are no QAs and after your code is pushed it will end up in production in a few days. This makes it critical that your code is not only tested properly but also there is monitoring in place to ensure production issues are dealt with quickly. When working on projects involving multiple teams, asides from production it is also necessary to ensure changes you make do not impact other developers.

In terms of monitoring we setup alarms in CloudWatch connected to 24/7 pagers to ensure engineers get notified if anything is not working correctly. There were also round the clock canary tests running to ensure the main customer paths are always working correctly. As for testing we used various layers of tests such as build time, integration, performance and e2e test. These tests would ensure that not only do we not break things at the production level but we also do not cause issues for other developers (in our team and other dependent teams).

Multi Region Deployments: Most companies generally deploy one instance of their service in production. Customers from all across the globe will connect that that one instance. At AWS however we had multi region deployments. Each service is deployed to over 20 different regions. This allows customer to choose the endpoint closest to them.

Deploying software to 20 different regions adds an entire new layer of complexity:

All tests had to run in every different region to ensure there are no unexpected edge cases in that specific region
Infrastructure deployments were fully automated and written using Cloudformation
Cross region monitoring and ticketing were set up
Layered deployments with mandatory wait periods were setup to ensure a faulty change has minimal blast radius (and does not cause total outage)

Skills

Java 11
Junit 5
Multithreading and high concurrency
Low latency design
Project Reactor
Functional Programming
Working various aws services (S3, Sqs, Lambda, CodeDeploy, CloudWatch, CloudFormation)

Description

Worked on the development of AWS Macie a service used for detecting PII in customer S3 buckets:

Java 11
Junit 5
Multithreading and high concurrency
Low latency design
Project Reactor
Functional Programming
Working various aws services (S3, Sqs, Lambda, CodeDeploy, CloudWatch, CloudFormation)