How to use the will replace feature of AWS Auto Scaling Groups

Instead of performing rolling deploys using AWS ASGs you can alternatively replace all instances in one go for quicker deployments, however there a few quirks that need to be catered for.

We are going to take a look at a pattern for AWS deployments using AMIs and Cloudformation, my previous post High availability deployments in AWS using Ansible and Packer detailed how to perform rolling updates where a set of new instances (with new AMIs) are brought up and given time to become ready using Autoscaling Group signals, once ready old instances are terminated and the whole process repeats until all old instances are terminated.

Now this process works fine if you have a small number of instances in play, however if you are dealing with a stack that contains a large number of instances then this technique can result in slow deployments as you wait for each new instance to bootup, execute it's user data and signal it's ASG.

There is an alternative however; the WillReplace UpdatePolicy feature, when in use during an update a new ASG is created along side the existing one, this ASG will create all of its instances in one go so if you are replacing a stack of six machines this new ASG will launch all six at once and wait for them to become ready (you can use ASG signals as with a normal rolling update). If all new instances become ready the old ASG will be terminated at which point all of the new instances should be serving traffic via their assigned ELB(s).

This method speeds up deployment time considerably as you only have to wait for the time it takes to bring up one instance for the deployment to finish.

Will replace in action

So let's take a look at a complete CF template that uses the WillReplace method for ASG updates:

---
# ha_deployment_willreplace.template.yaml

AWSTemplateFormatVersion: '2010-09-09'
Description: HA Will Replace Example

Parameters:
  InstanceType:
    Description: WebServer EC2 instance type
    Type: String
    Default: t2.nano
    AllowedValues:
    - t2.nano
    - t2.micro
    ConstraintDescription: must be a valid EC2 instance type.
  KeyName:
    Description: Name of an existing EC2 KeyPair to enable SSH access to the instance
    Type: AWS::EC2::KeyPair::KeyName
    ConstraintDescription: must be the name of an existing EC2 KeyPair.
  InstanceCount:
    Description: Number of EC2 instances to launch
    Type: Number
    Default: '1'
  InstanceCountMax:
    Description: Maximum number of EC2 instances to launch
    Type: Number
    Default: '6'
  InstanceImageId:
    Description: Image ID for EC2 instances
    Type: String

Resources:
  InternetGateway:
    Type: AWS::EC2::InternetGateway
    Properties:
      Tags:
        - Key: Application
          Value: !Ref AWS::StackId

  VPC:
    Type: AWS::EC2::VPC
    Properties:
      CidrBlock: 10.0.0.0/16
      Tags:
        - Key: Application
          Value: !Ref AWS::StackId

  AttachGateway:
    Type: AWS::EC2::VPCGatewayAttachment
    Properties:
      VpcId: !Ref VPC
      InternetGatewayId: !Ref InternetGateway

  PublicSubnet:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref VPC
      CidrBlock: 10.0.1.0/24
      Tags:
        - Key: Application
          Value: !Ref AWS::StackId

  RouteTable:
    Type: AWS::EC2::RouteTable
    Properties:
      VpcId: !Ref VPC
      Tags:
        - Key: Application
          Value: !Ref AWS::StackId

  Route:
    Type: AWS::EC2::Route
    DependsOn: AttachGateway
    Properties:
      RouteTableId: !Ref RouteTable
      DestinationCidrBlock: 0.0.0.0/0
      GatewayId: !Ref InternetGateway

  PublicSubnetRouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      SubnetId: !Ref PublicSubnet
      RouteTableId: !Ref RouteTable

  PublicSshSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Enable external SSH access
      VpcId: !Ref VPC
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: '22'
          ToPort: '22'
          CidrIp: 0.0.0.0/0

  PublicWebSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Enable external web access
      VpcId: !Ref VPC
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: '80'
          ToPort: '80'
          CidrIp: 0.0.0.0/0

  WebServerGroup:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      VPCZoneIdentifier:
        - !Ref PublicSubnet
      LaunchConfigurationName: !Ref  WebLaunchConfig
      DesiredCapacity: !Ref  InstanceCount
      MinSize: 1
      MaxSize: !Ref InstanceCountMax
      LoadBalancerNames:
        - !Ref WebElasticLoadBalancer
      HealthCheckGracePeriod: '300'
      HealthCheckType: ELB
    CreationPolicy:
      ResourceSignal:
        Count: !Ref InstanceCount
        Timeout: PT5M
    UpdatePolicy:
      AutoScalingReplacingUpdate:
        WillReplace: 'true'

  WebLaunchConfig:
    Type: AWS::AutoScaling::LaunchConfiguration
    Properties:
      AssociatePublicIpAddress: 'true'
      ImageId: !Ref InstanceImageId
      InstanceType: !Ref InstanceType
      SecurityGroups:
        - !Ref PublicSshSecurityGroup
        - !Ref PublicWebSecurityGroup
      KeyName: !Ref KeyName
      UserData:
        Fn::Base64: !Sub |
          #!/bin/bash
          yum update
          yum install -y aws-cfn-bootstrap
          /opt/aws/bin/cfn-init --resource WebLaunchConfig --stack ${AWS::StackName} --region ${AWS::Region}
          yum install -y nginx
          service nginx start
          /opt/aws/bin/cfn-signal -e $? --stack ${AWS::StackName} --resource WebServerGroup --region ${AWS::Region}

  WebElasticLoadBalancer:
    Type: AWS::ElasticLoadBalancing::LoadBalancer
    Properties:
      CrossZone: 'false'
      Scheme: internet-facing
      SecurityGroups:
        - !Ref PublicWebSecurityGroup
      Subnets:
        - !Ref PublicSubnet
      Listeners:
        - LoadBalancerPort: '80'
          InstancePort: '80'
          Protocol: HTTP

Outputs:
  AutoScalingGroup:
    Description: AutoScalingGroup ID for stack
    Value: !Ref WebServerGroup

This template is self contained - it comes with a VPC, subnets and everything needed to launch instances, note that provisioning this stack will potentially cost some money (though not much).

To launch this stack we can use the AWS CLI, take note of the parameters that we are pumping in:

aws cloudformation create-stack --stack-name ha-willreplace --template-body file://ha_willreplace.template.yaml --parameters ParameterKey=KeyName,ParameterValue=[KEY-PAIR-NAME] ParameterKey=InstanceCount,ParameterValue=1  ParameterKey=InstanceImageId,ParameterValue=[AMI-ID]

You can look up some Amazon OS based AMIs to launch like so:

aws ec2 describe-images --owners 137112412989 --filters "Name=name,Values=*amzn-ami-hvm*" "Name=virtualization-type,Values=hvm" "Name=root-device-type,Values=ebs" "Name=architecture,Values=x86_64" "Name=hypervisor,Values=xen" --output text --query "reverse(sort_by(Images, &CreationDate))|[].ImageId" | tr '\t' '\n'

This will give you a list of Amazon OS AMIs with the latest at the top; you can use these IDs as the InstanceImageId with the above stack. Try launching the stack and then update it with different AMIs and instance counts (use the update-stack method instead of create-stack with the aws cloudformation CLI call) you should see that it replaces all instances in one go with a pause for each instance to become ready and send a signal.

Now just to state the obvious: normally you would be using a pre-baked image for a service that has everything you need in it already, we are just using a base Amazon OS AMI and installing what we need on the fly here by way of example.

Watch your signals

There are a few quirks to the WillReplace method; ASGs seem to launch new machines in batches of 10 with a slight deplay between each of these batches, so if you are updating a stack of 10 or more machines you will see a slight delay between each set; the main quirk however is with the ASG signal side of things - this bit can really trip you up if you are not careful!

Let's take a look at the create and update policies used by the WIllReplace method:

CreationPolicy:
  ResourceSignal:
    Count: '1'
    Timeout: PT5M
UpdatePolicy:
  AutoScalingReplacingUpdate:
    WillReplace: 'true'

The WillReplace style of update creates a new ASG hence updates will actually use the ASGs CreationPolicy to determine how many signals to wait for, the ResourceSignal count in a CreationPolicy denotes exactly how many signals to wait for in total rather than the amount of signals per instance (as per rolling updates). If you have a fixed number of instances then this is fine as you can just set the signal count to however many instances you need.

For stacks that can scale up based on a metric, such as instance CPU usage, you don't want to reset the stack down to the minimum amount of instances everytime you release, saying to developers "please only release during quiet periods" is obviously not acceptable. For scalable stacks what we need to do is detect how many instances are in play before we update our stack so we can pump this number in as the InstanceCount parameter. Notice in our stack template that we have an output for the Auto Scaling Group ID, we can use this to find the current instance count.

Here is a sample bash script that detects the number of running instances and updates the stack:

#!/bin/bash

set -euo pipefail

# ha_willreplace_stack_update.sh
# Usage: ha_willreplace_stack_update.sh [AMI_ID] [SSH_KEY_PAIR_NAME]

function main {
    local AMI_ID
    local SSH_KEY_NAME
    local ASG_ID
    local ASG_CURRENT_INSTANCE_COUNT

    AMI_ID=${1-ami-d3c0c4b5}
    SSH_KEY_NAME=${2-}

    if [[ -z ${AMI_ID} ]] || [[ -z ${SSH_KEY_NAME} ]]; then
        echo "Missing required arguments" >&2
        exit 1
    fi

    # Grab ASG ID
    ASG_ID=$( aws cloudformation describe-stacks \
        --stack-name ha-willreplace \
        --query "Stacks[0].Outputs[?OutputKey=='AutoScalingGroup'].OutputValue" \
        --output text )

    # Grab current instance count from ASG
    ASG_CURRENT_INSTANCE_COUNT=$( aws autoscaling describe-auto-scaling-groups \
        --auto-scaling-group-names "${ASG_ID}" \
        --query "AutoScalingGroups[0].DesiredCapacity" \
        --output text )

    echo "Current instance count: ${ASG_CURRENT_INSTANCE_COUNT}"

    # Update stack
    aws cloudformation update-stack \
        --stack-name ha-willreplace \
        --template-body file://ha_willreplace.template.yaml \
        --parameters \
            ParameterKey=KeyName,ParameterValue="${SSH_KEY_NAME}" \
            ParameterKey=InstanceCount,ParameterValue="${ASG_CURRENT_INSTANCE_COUNT}" \
            ParameterKey=InstanceImageId,ParameterValue="${AMI_ID}"

    aws cloudformation wait stack-update-complete \
        --stack-name ha-willreplace

    echo "Stack updated"
}

main "$@"

Give this is a go with different AMI IDs using the lookup command from earlier, you can manually scale up the stack by editing the ASG in the AWS console and setting the desired instance count. Once an update is complete you should be left with the same number of instances as before the update, if you check the CF output in the AWS console you should see that the correct number of signals were waited for and that the instances were launched at the same time (give or take).

Those pesky databases migrations

The above process makes the assumption that launching all instances at once is acceptable for a given service, however if your service runs database migrations on boot then this is a really bad idea, luckily there is a pretty simple solution available for this, thanks AWS!

What we can do is use two ASGs, the first one will launch a single instance only that will not be added to the stacks ELB, this will be the migration ASG, the single machine that it launches will go first and will run any database migrations, once that ASG has updated successfully the main ASG that uses the WillReplace method will run and replace all of the actual running instances that are in the stacks ELB. The instances in the main ASG can still run migrations so there is no need to configure them differently on boot, however since the database will already be up to date these migrations will do nothing - as long as your devs are using a decent migration framework of course. As an aiddtional bonus you can give the migration group a longer signal timeout so you can handle long running data migrations.

Here is an updated version of our previous stack with a migration ASG, note the use of the DependsOn attribute on the WebServerGroup, this attribute makes the group wait for the MigrationGroup update to complete first:

---
# ha_willreplace_migrations.template.yaml

AWSTemplateFormatVersion: '2010-09-09'
Description: HA Will Replace Example

Parameters:
  InstanceType:
    Description: WebServer EC2 instance type
    Type: String
    Default: t2.nano
    AllowedValues:
    - t2.nano
    - t2.micro
    ConstraintDescription: must be a valid EC2 instance type.
  KeyName:
    Description: Name of an existing EC2 KeyPair to enable SSH access to the instance
    Type: AWS::EC2::KeyPair::KeyName
    ConstraintDescription: must be the name of an existing EC2 KeyPair.
  InstanceCount:
    Description: Number of EC2 instances to launch
    Type: Number
    Default: '1'
  InstanceCountMax:
    Description: Maximum number of EC2 instances to launch
    Type: Number
    Default: '6'
  InstanceImageId:
    Description: Image ID for EC2 instances
    Type: String

Resources:
  InternetGateway:
    Type: AWS::EC2::InternetGateway
    Properties:
      Tags:
        - Key: Application
          Value: !Ref AWS::StackId

  VPC:
    Type: AWS::EC2::VPC
    Properties:
      CidrBlock: 10.0.0.0/16
      Tags:
        - Key: Application
          Value: !Ref AWS::StackId

  AttachGateway:
    Type: AWS::EC2::VPCGatewayAttachment
    Properties:
      VpcId: !Ref VPC
      InternetGatewayId: !Ref InternetGateway

  PublicSubnet:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref VPC
      CidrBlock: 10.0.1.0/24
      Tags:
        - Key: Application
          Value: !Ref AWS::StackId

  RouteTable:
    Type: AWS::EC2::RouteTable
    Properties:
      VpcId: !Ref VPC
      Tags:
        - Key: Application
          Value: !Ref AWS::StackId

  Route:
    Type: AWS::EC2::Route
    DependsOn: AttachGateway
    Properties:
      RouteTableId: !Ref RouteTable
      DestinationCidrBlock: 0.0.0.0/0
      GatewayId: !Ref InternetGateway

  PublicSubnetRouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      SubnetId: !Ref PublicSubnet
      RouteTableId: !Ref RouteTable

  PublicSshSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Enable external SSH access
      VpcId: !Ref VPC
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: '22'
          ToPort: '22'
          CidrIp: 0.0.0.0/0

  PublicWebSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Enable external web access
      VpcId: !Ref VPC
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: '80'
          ToPort: '80'
          CidrIp: 0.0.0.0/0

  MigrationGroup:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      VPCZoneIdentifier:
        - !Ref PublicSubnet
      LaunchConfigurationName: !Ref  MigrationLaunchConfig
      DesiredCapacity: 1
      MinSize: 0
      MaxSize: 1
    CreationPolicy:
      ResourceSignal:
        Count: 1
        Timeout: PT5M
    UpdatePolicy:
      AutoScalingReplacingUpdate:
        WillReplace: 'true'

  MigrationLaunchConfig:
    Type: AWS::AutoScaling::LaunchConfiguration
    Properties:
      AssociatePublicIpAddress: 'true'
      ImageId: !Ref InstanceImageId
      InstanceType: !Ref InstanceType
      SecurityGroups:
        - !Ref PublicSshSecurityGroup
      KeyName: !Ref KeyName
      UserData:
        Fn::Base64: !Sub |
          #!/bin/bash
          yum update
          yum install -y aws-cfn-bootstrap
          /opt/aws/bin/cfn-init --resource MigrationLaunchConfig --stack ${AWS::StackName} --region ${AWS::Region}
          yum install -y nginx
          service nginx start
          /opt/aws/bin/cfn-signal -e $? --stack ${AWS::StackName} --resource MigrationGroup --region ${AWS::Region}

  WebServerGroup:
    Type: AWS::AutoScaling::AutoScalingGroup
    DependsOn: MigrationGroup
    Properties:
      VPCZoneIdentifier:
        - !Ref PublicSubnet
      LaunchConfigurationName: !Ref  WebLaunchConfig
      DesiredCapacity: !Ref  InstanceCount
      MinSize: 1
      MaxSize: !Ref InstanceCountMax
      LoadBalancerNames:
        - !Ref WebElasticLoadBalancer
      HealthCheckGracePeriod: '300'
      HealthCheckType: ELB
    CreationPolicy:
      ResourceSignal:
        Count: !Ref InstanceCount
        Timeout: PT5M
    UpdatePolicy:
      AutoScalingReplacingUpdate:
        WillReplace: 'true'

  WebLaunchConfig:
    Type: AWS::AutoScaling::LaunchConfiguration
    Properties:
      AssociatePublicIpAddress: 'true'
      ImageId: !Ref InstanceImageId
      InstanceType: !Ref InstanceType
      SecurityGroups:
        - !Ref PublicSshSecurityGroup
        - !Ref PublicWebSecurityGroup
      KeyName: !Ref KeyName
      UserData:
        Fn::Base64: !Sub |
          #!/bin/bash
          yum update
          yum install -y aws-cfn-bootstrap
          /opt/aws/bin/cfn-init --resource WebLaunchConfig --stack ${AWS::StackName} --region ${AWS::Region}
          yum install -y nginx
          service nginx start
          /opt/aws/bin/cfn-signal -e $? --stack ${AWS::StackName} --resource WebServerGroup --region ${AWS::Region}

  WebElasticLoadBalancer:
    Type: AWS::ElasticLoadBalancing::LoadBalancer
    Properties:
      CrossZone: 'false'
      Scheme: internet-facing
      SecurityGroups:
        - !Ref PublicWebSecurityGroup
      Subnets:
        - !Ref PublicSubnet
      Listeners:
        - LoadBalancerPort: '80'
          InstancePort: '80'
          Protocol: HTTP

Outputs:
  AutoScalingGroup:
    Description: AutoScalingGroup ID for stack
    Value: !Ref WebServerGroup
  AutoScalingGroupMigration:
    Description: Migration AutoScalingGroup ID for stack
    Value: !Ref MigrationGroup

In this example I have kept the same basic launch config for both ASGs, so no migrations are executed, we are just observing how one ASG can wait for another one.

You can launch this stack like so:

aws cloudformation create-stack --stack-name ha-willreplace-migrations --template-body file://ha_willreplace_migrations.template.yaml --parameters ParameterKey=KeyName,ParameterValue=[KEY-PAIR-NAME] ParameterKey=InstanceCount,ParameterValue=1  ParameterKey=InstanceImageId,ParameterValue=[AMI-ID]

Once launched you should see that the migration group went first with the web server group going second. One more thing of note is the output for the migration group ASG ID, we can use this to minimise the migration group post create/update.

Here is an example update script that does just this:

#!/bin/bash

set -euo pipefail

# ha_willreplace_migrations_stack_update.sh
# Usage: ha_willreplace_migrations_stack_update.sh [AMI_ID] [SSH_KEY_PAIR_NAME]

function main {
    local AMI_ID
    local SSH_KEY_NAME
    local ASG_ID
    local ASG_CURRENT_INSTANCE_COUNT

    AMI_ID=${1-ami-d3c0c4b5}
    SSH_KEY_NAME=${2-}

    if [[ -z ${AMI_ID} ]] || [[ -z ${SSH_KEY_NAME} ]]; then
        echo "Missing required arguments" >&2
        exit 1
    fi

    # Grab ASG ID
    ASG_ID=$( aws cloudformation describe-stacks \
        --stack-name ha-willreplace-migrations \
        --query "Stacks[0].Outputs[?OutputKey=='AutoScalingGroup'].OutputValue" \
        --output text )

    # Grab current instance count from ASG
    ASG_CURRENT_INSTANCE_COUNT=$( aws autoscaling describe-auto-scaling-groups \
        --auto-scaling-group-names "${ASG_ID}" \
        --query "AutoScalingGroups[0].DesiredCapacity" \
        --output text )

    echo "Current instance count: ${ASG_CURRENT_INSTANCE_COUNT}"

    # Update stack
    aws cloudformation update-stack \
        --stack-name ha-willreplace-migrations \
        --template-body file://ha_willreplace_migrations.template.yaml \
        --parameters \
            ParameterKey=KeyName,ParameterValue="${SSH_KEY_NAME}" \
            ParameterKey=InstanceCount,ParameterValue="${ASG_CURRENT_INSTANCE_COUNT}" \
            ParameterKey=InstanceImageId,ParameterValue="${AMI_ID}"

    aws cloudformation wait stack-update-complete \
        --stack-name ha-willreplace-migrations

    echo "Stack updated"

    # Minimise the migration ASG
    MIGRATION_ASG_ID=$( aws cloudformation describe-stacks \
        --stack-name ha-willreplace-migrations \
        --query "Stacks[0].Outputs[?OutputKey=='AutoScalingGroupMigration'].OutputValue" \
        --output text )

    aws autoscaling update-auto-scaling-group \
        --auto-scaling-group-name "${MIGRATION_ASG_ID}" \
        --desired-capacity 0

    echo "Migration group minimised"

}

main "$@"

After running this update script you should have newly updated instances in your ASG and a minimsed migration group.

Some migration tips

Whilst we are talking about automated migrations I think it's probably worth taking a look on the side at how to do these in a safe way, really this comes down to making sure that your devs write migrations carefully, very carefully!

So the rules of migrations are:

  • Always make migrations backwards compatible with the code version that it is in production in case you need to roll back, test rolling back pre-production is possible
  • Push a migration through to production before adding any more migrations so there is only one migration queued up at any one time
  • Pull Requests that contain migrations should be reviewed and tested by several developers to ensure that the change is safe
  • It may be necessary to split migrations up into several releases depending on the type of change that is taking place
  • When adding new columns add default values for columns that are not nullable, once the code that uses the new columns is in production you can have a second migration that will cleanup and remove the default value
  • When deleting columns always release the code that does not use the columns first, you may need to run a migration that adds default values to non nullable columns that are due for deletion, once the migration is in production you can remove these columns completely with a second migration
  • When renaming columns you will need to create a copy of the old column and use triggers to keep both columns in sync with one another, once the code that uses the new column name is in production you can put in a second migration that will remove the old column and the triggers
  • Adding new tables should be possible in one migration, however do not add any constraints based on this new table to existing tables, this will require a second migration once the new table and code are in production
  • Deleting tables should follow the same process as deleting columns, this needs to be done in code first with a migration once the code is in production
  • Renaming tables is again similar to renaming columns, you will need to clone the existing table and use triggers to keep both tables in play until your code is all the way to production, a second release can delete the table with the old name plus triggers
  • For bulk data migrations (eg. converting values for entire columns or adding in a large data set)  there is no hard and fast rule; use multiple migrations in stages if needed and possibly adjust the timeout on the migration ASG if needed

Wrapping up

In this example I haven't used any configuration management or orchestration tools such as Ansible or Terraform, I strongly recommend that you use a tool instead of writing your own templates and deployment scripts as I have here.

You can see the examples from this article in my CF examples repo, please post any questions as comments if you have them.