Resilient Engineer: AWS & Serverless Best Practices

9 min readSep 7, 2020

At my work our teams continue to develop acclaimed serverless digital solutions. We see their complexity and scope grow nearly daily. That growth requires a disciplined, resilient mindset and structure when it comes to deploying these services in a way that minimizes bugs, maintains security, bolsters the app mesh, and allows our teams to stay “agile” (with a lowercase “a”.) This post is one more in the #ResilientEngineer series.

Two key areas that I’ve seen teams struggle with, when their serverless solutions reach upwards of hundreds of production-ready services, are bombproof, pragmatic security and simple, consistent conventions. I want to share 3 better (because I don’t believe in best) practices in each key area that my teams have found to be lifesavers in their continued improvement and sanity when working with complex, distributed serverless solutions.

Bombproof, pragmatic security

Smart & secure secrets

All secrets need to be secured. Just like Gandalf told Frodo, keep them secret, keep them safe. Whether they are API Keys, database credentials, or other sensitive bits of info, secrets need to be securely stored and accessed ONLY by your applications and limited team members.

There are several ways to accomplish this goal, but listed below are the main points:

Keep your secrets out of your source control and enable repository secret scanning if it’s available in your service (GitHub has it and that’s what my teams use)
Limit access to secrets (the Principle of Least Privilege) to only those who need them for a given deploy environment, stage or function (we lock down secrets in AWS Secrets Manager and govern access with IAM)
Use separate secrets for different applications and stages all the time, no exceptions (we never share secrets across applications or services)

When it comes to serverless solutions, our teams embrace the simplicity and abstraction provided by the Serverless Framework. There are many posts about how to best handle secrets in this rock-solid framework, and here’s one of my favorites. Even if you’re not using Serverless, the post has value as most other frameworks or services provide similar mechanisms of accomplishing the same goals.

Our teams look for all opportunities to use parameters provided by external services (like AWS SSM Parameter Store) to allow scalable configuration of our secrets across different services, AWS accounts, and application stages. We also love and use safeguard policies (policy-as-code) to block accidental or malicious service deployments whenever there are plaintext secrets set as environment variables in our serverless.yml configuration resources. Our CI/CD also scans for secret patterns and blows up loudly if suspected secrets are discovered.

Using “least privilege” IAM policies

One of the most tempting shortcuts for some of our teams to take is that of allowing IAM policies that are too permissive. Projects get behind, stakeholders are applying the pressure, and allowing an open IAM policy will really help remove that one big hurdle. It takes discipline to stop and not take the easy path.

A hugely important best practice for our teams is to limiting scopes of permissions that are granted to our applications. In the case of AWS, whenever we create IAM policies for our services, it is mandatory that those roles are limited to the minimum permissions required to operate. As part of this, we reduce the use of wildcards (the * character) in our policy definitions. In addition, in our serverless configuration we apply a Safeguard policy to block deployments that contain wildcards in IAM permissions:

We use this type of safeguard policy extensively, though only recently, to either block deployments entirely or to warn developers to take another look at the IAM polices they’re using.

Restricting deploy schedules

Imagine you’re a healthcare company, like a PBM, going in to your open-enrollment or 1/1 go-live season. Your teams have high confidence that the services and code deployed are rock-steady. However, you as the tech lead want to limit any remote possibility that some new bug will be introduced during this high-stress season. You may also want to lock down changes over weekends that aren’t staffed to scale.

One common way our teams apply a restrictive mechanism to prevent unwanted deploys (or to limit the windows) is to lock down our deployments during that period by configuring allowed deploy windows. We can do it through our CI/CD scripts OR we can do this through our serverless framework as we’re configuring applications.

This kind of situation may not apply at all organization. I’ve been at companies where it didn’t matter when you deployed, and I’ve seen it also where it was a very sensitive and unwanted risk to deploy during certain business hour windows. While these situations might not apply to your organization, if they do there is a Safeguard that will allow your teams to apply this policy to your environment.

Simple, consistent conventions

Environments & stages

Having conventions has been key in helping our developers learn a set of standards and then intuitively understand other parts of a system. Most companies have a separate place for code that customers or end-users see (production) and one or more places for code that developers are working on that isn’t quite ready (development/testing etc.). This is a kind of convention and these different environments are usually called stages. Stages allow our teams to set up a consistent path for our code to take as it moves towards their destination of delighting customers.

As we apply these conventions across our distributed microservices using our serverless architecture and frameworks, our applications are pushed out to nonprod stages as our teams and developers work on them. Then, when ready for production they are deployed to a stage like prod by updating our serverless.yml or running a deploy command with the --stage prod option. For each of these stages, we use very different sets of configurations, secrets, and IAM policies.

Baked within the Serverless framework, there’s a lot of new granularity to what you can do with the App Dashboard when it comes to interacting with stages. Per-stage configuration can include things like:

Which AWS account or region stages are deployed to?
What Safeguards are evaluated against the deployment?
What parameters and secrets are used?

With this kind of customization per-stage, we are able to use Safeguards that allow blocking nonprod stage deployments to production application AWS accounts or making sure that our production API keys and secrets are always bundled in with only production deployments. Our teams have found these options to be very flexible and extensible to help support the needs and workflows of our entire organization.

Allowed region mechanisms

For many teams now during the pandemic, their collaboration and development has become geographically distributed. When working on this kind of team, the default AWS region for each developer may not be the same. A developer in Salt Lake City might default to us-west-2 and one in Orlando might use us-east-1. What we found is that as our teams started to deploy frequently across different stages, services, and accounts, it led to inadvertent issues in our code. One service may reference one region, but actually need to be deployed in another. Or, different regions may have different supported features or limitations.

To avoid issues like this, our teams require developers and engineers to use a single region or a subset of regions that suit their needs. We don’t require this as a manual configuration on the developer side, though, as anything manual will introduce fragility. Instead, our teams use Safeguards to define the rails for this type of restriction. At deployment times it ensures that our myriad services are only deployed to a particular allowed region or list of regions that our tech leads specify as valid account or regional targets:

Cloud resource naming conventions

Along with the stage and region controls listed above, keeping named resources aligned to consistent naming and description conventions has helped keep our infrastructure organized. Our practice of naming infrastructure-as-code resources with conventions also helps new team members to quickly see what’s going on with a specific AWS account and service, how they are interconnected to different microservices, and helps them to more easily build and troubleshoot.

An example of this convention is the pattern our teams follow in naming Lambda functions. We require that our Lambda functions all consistently have the application name, service name, and function name inside of each function name. This allows us to more easily find relevant functions if they are in the same AWS account yet spread out across multiple services. Our teams can also more quickly tie multiple functions together with a particular service.

Imagine that you have a CRM application (pretend it’s Microsoft Dynamics CE) that your call center teams and account services teams use to 1) configure how services should work for customers, and 2) to quickly look up information about customers when they call in for support. You might have a Lambda function that watches for changes in your data warehouse that needs to then by synchronized with metadata in Microsoft Dynamics CE. You might have another Lambda function that monitors a queue that is populated during the synchronization process, and one last function to process the “dead letter queue” (or DLQ) when issues arise from processing the normal queue. Managing these functions and resources tied to them becomes easier if you follow a convention like this: ApplicationName-ServiceName-FunctionPurposeName.

Using the convention listed above, the function names end up looking something like this:

StoicPanda-CRMSynchService-DBMonitorFunc
StoicPanda-CRMSynchService-QueueMonitorFunc
StoicPanda-CRMSynchService-DLQMonitorFunc

This way, you know exactly what the function you need is called and can find it when you need it. When you combine this practice with mandatory tagging of cloud resources, you gain a powerful control that helps prevent deployments of opaquely-named services. Like some of the above controls, our teams can also enforce this naming convention using yet another Safeguard in the Serverless Dashboard.

TL;DR

So there you have it. I’ve listed 3 things each under security and consistency that have helped my teams to maintain order in ever-scaling, poly-stack, multi-cloud environments. What we’ve accomplished above is only a subset of the controls that any team running production cloud workloads should be considering, and most of these are greatly simplified by using Serverless Framework or other abstraction frameworks similar to it.

There are also many other Safeguards to enable more application-specific practices like enforcing the creation of Dead Letter Queues or requiring services be within a VPC. So many aspects of service and deployment governance can be automated through policy-as-code during CI/CD pipelines.

Keep in mind that these best practice aren’t only applicable to the Serverless Framework. However your teams decide to build your applications, many of these practices can help your teams do so more effectively and securely.