What is Blue/Green Environment Interoperation?

Blue/Green Environments have been one of the essential non-functional requirements for any modern web app that sees significant traction. In this, the system’s production topology comprises of two replicated copies of the production environments, preferably in 2 different datacentres/regions/availability-zones.

Interoperation between them means that any point of time, due to reasons forseen and/or unforseen, either of the environments should be in active service of the end user and it should do so with transparency to the user.The user should not care which environment is in active service.This is is a preclude to High availability and Disaster Recovery.

Cohesive and Correct Interoperation means that any switch between these two environments should not yield any inconsistencies in the persistent state of the system and with minimum impact to the user

In this post, I will describe what are the challenges, considerations, one must take into account to correctly have such a mechanism working and a protocol to correctly put this in action.


Topology Description

For the purpose of faster understanding, lets document the example system’s topology below

Pictoral Depiction

Every production environment consists of 4 components.The following few lines describe each numbered component.The paired numbers are the counterparts of each other in the two environments

Components 1 & 5 - Static Frontend:
These are hosted on a Nginx weberver.Each of these Nginx have a well known CNAME associated with them : https://blueapp.example.com for the Blue/Live Env https://greenapp.example.com for the Standby/Green Env. Though these domains were not accessible from the internet on their own.(We can clamp down traffic on the ports 443 and 80 by using AWS Security Groups to a few IPs that we would use to test and troubleshoot. Another variation is the deny or allow keyword in the location block.)

Components 2 & 6 - RESTful API server:
The are hosted on any application server.

Components 3 & 7 - Mysql Database:
Persistent database storage

Components 4 & 8 - Redis Cache:
Ephemeral data storage for fast access and performance enhancement

We can have two such sets of environments replicated in production called, Blue and Green.(Also called Live and Standby respectively)
Blue is the Live environment, meaning all the live traffic is directed to this environment. Green is the standby environment waiting in shadows to be called for service in case of an incident
This will have the same code artifacts deployed on them as the Blue env, but the persistent state is different because the live traffic that is mutating the state of the Live environment.(This is a caveat that needs to be paid due respect to for correct and cohesive interoperation)

Both the environments were dichotomized by the API Gateway in a transparent manner to the end user.


Probable Solutions

The general consensus in such a scenario is to have a AWS ELB or a HAProxy as the Gateway and this is assumed as the secret sauce of the Blue/Green Deployments with Zero downtime.
Though these steps are in the right directions, there multiple caveats to this which needs to be respected.Add the need of centralized devops, and the Loadbalancer and the HAProxy start falling short.


Caveats

There can never be a true zero downtime deployment or operations for a system underpinned by a persistent data storage(In this case,the Mysql and Redis datastores).
Consider the above topology, the live traffic is being directed to the Green Env by the Gateway.This live traffic will mutate or change the persistent state.Unless we have a transparent and realtime data/transaction flow to/fro from the Live to Standby environments, the Standby environment will always lag behind the Live.This lag is the function of the time that has elapsed since the last backup/restore cycle. The backup/restore cycle in such a case is described here

In this case, if a hard switch is made to the Standby env using the Loadbalancer or HAProxy,the system will show inconsistent data to the user and no one wants that. So the correctness of the system has to be ensured.
This comes at the cost of a sliver of downtime during which the persistent state of the Live env is copied over to the Standby env.This ensures that the persistent state of the system becomes predictable when the switch is made.

With the above considerations made, a solution can be designed.At the centre of it is the API Gateway, whose configuration made sure that we have a proper handle on state of the system.


API/Web Gateway

The Live traffic is directed via the API Gateway which in this case is a Nginx Webserver. The well know domain name https://app.example.com is pointed to the IP of this server. The responsibilities of this webserver were:
1. SSL Termination.
2. Reverse Proxying the traffic.
3. Maintaining the upstreams.(The upstreams in this case are the components number 1 & 5 in the diagram)
4. Optimisations and Security.


Configuration of the Nginx API Gateway

worker_processes 1;

events { worker_connections 1024; }

http {

    upstream backend {
      server blueapp.example.com:443 fail_timeout=0;
      server greenapp.example.com:443 fail_timeout=0 backup down;
    }

    client_body_in_file_only clean;
    client_body_buffer_size 32K;
    client_max_body_size 20M;
    sendfile on;
    send_timeout 300s;

    include /etc/nginx/mime.types;

    gzip on;
    gzip_disable "msie6";
    gzip_proxied any;

    gzip_types text/plain text/xml text/css application/x-javascript application/json application/javascript text/javascript;

    gzip_vary on;

    server{
      listen      80;
      return 301 https://$host$request_uri;
    }

    server {

      listen 443;

      ssl    on;
      ssl_certificate     /etc/ssl/example/example.crt;
      ssl_certificate_key /etc/ssl/example/example.key;

      access_log    /var/log/nginx/example.access.log;
      error_log     /var/log/nginx/example.error.log;
      proxy_next_upstream  error  http_502 http_503 http_504;

      #security hardening
      server_tokens off;
      add_header X-Frame-Options "SAMEORIGIN";
      add_header X-XSS-Protection "1; mode=block";

      #attach the far expires header and proxy pass to backend for assets
      location ~* \.(?:ico|css|js|gif|jpe?g|png)$ {
        expires 30d;
        add_header Pragma public;
        add_header Cache-Control "public";
        proxy_pass https://backend;
        error_page 502 503 504  = @maintainence;
      }


      location ~ ^/ {
        proxy_pass https://backend;
        error_page 502 503 504  = @maintainence;
      }

      location @maintainence {
        root /usr/share/nginx/html/web-gateway;
        rewrite ^ /down.html break;
      }

  }

}

The above is the nginx.conf for the Gateway Webserver. Apart from the usual configuration, the interesting parts are below:

Upstreams

upstream backend {
  server blueapp.example.com:443 fail_timeout=0;
  server greenapp.example.com:443 fail_timeout=0 backup down;
}
  1. The above upstream segment creates an array of upstream servers aliased with backendwhich is used by the proxy_pass directive.
  2. The https://blueapp.example.com is the primary server and https://greenapp.example/com is the backup server denoted by the backup keyword.
  3. The eagle-eyed will notice a particular keyword down. It is a boolean valued keyword.
    -Presence of down against a server means that no traffic will be proxied to that server.
    -Absence of down means that the server is up and can serve traffic. 
  4. They will also notice a keyword named backup.A server annotated with this keyword means that unless the primary servers (without backup keyword) are actually unresponsive or down, this server will not serve any traffic.


Note:This doesn’t mean that the server annotated with down is actually down.It just means that no traffic will be proxied to the server.

Thus, in the above setup https://greenapp.example.com serves the live traffic and https://blueapp.example/com is a standby server and does not serve traffic.

This combination of down and backup is very powerful and can be used in conjunction to do all kinds of traffic wizardry.

Reverse Proxy

location ~ ^/ {
  proxy_pass https://backend;
  error_page 502 503 504  = @maintainence;
}
  1. Every request will be proxied to the upstream group named backend.The traffic will be handled by the rules mentioned in the previous section.
  2. In case the backend fails to serve a request(on error codes 502,503,504),the request will be served by a location named maintainence.

 

maintainence Location

location @maintainence {
  root /usr/share/nginx/html/web-gateway;
  rewrite ^ /down.html break;
}
  1. In case of an absolute failure of the backend, this location blocks shows a user friendly page named down.html.

 

With this setup,we can achieve a correct and cohesive Blue/Green environment interoperation.


Scenarios & Protocols

Scheduled outage/maintainence :

Suppose we have to roll out a new version of the application on a particular day.Lets say we have A (https://blueapp.example.com) is live currently and B (https://greenapp.example.com) is standby. The aim is to roll out the new deployment to the env A. For a bit more complex scenario ,consider that the new deployment will have schema changes.

Following is the protocol that has to be followed:
1. Stop the live traffic to the system by setting the down flag of A to true and display a graceful down for maintainence page.
2. Backup of DB of A
3. Backup of Redis of A
4. Restore the DB and Redis of A to B and make sure that B is upto date to the last known state while A was live
5. Switch live traffic to B to false and hot reload nginx
6. Migrate the DB of A as per the new changes.Please note that this is a throwaway migration for the purpose of testing
7. Deploy the new application on env A
8. Smoke test using the public ip of A
9. Stop the live traffic on B as well by setting the down flag of B to true
10. Restore the latest data of B onto the DB of A 
11. Again migrate the DB of A as per the new changes
12. Switch live traffic to A by setting the down flag of the upstream A to false

This way the only downtime that the end user will experience is the time taken for steps 9 and 10. As mentioned before this downtime cant be avoided if we need correctness of the deployment.
 

Unforseen/Disaster outage :

Suppose our live application is on A (https://blueapp.example.com) and due to some unforseen circumstances there is an outage. We need to transparently switch over to B (https://greenapp.example.com) Following is the state of the system and protocol that has to be followed:

  1. Encounter Downtime
  2. Set the down flag of A to true hot reload nginx
  3. Get the latest backup of Redis and DB from A to B
  4. Switch traffic to B by setting the down flag of B to false
  5. Fix the environment A
  6. Smoke test
  7. Get the B environment down by setting the down flag of B to true and hot reload nginx
  8. Restore the latest data of B to A
  9. Switch the live traffic to A by setting the down flag of A to false and hot reload nginx 
    During this scenario we have a recovery downtime during steps 3-4 and during step 9.

Note: To hot reload nginx use the command nginx -s reload


Conclusion

1.With the mixture of down and backup flags of upstream servers, we can manipulate the destination of the traffic of the system.

2.This is very useful to have correct environment interoperation so that the user will always get a transparent and correct view of the system with minimum downtime.

3.This does require you to ssh into the server and editing the nginx.conf manually and hot reloading nginx. In subsequent posts, we will explore how to leverage the power of Lua and Nginx together to remove the need for manual ssh.

As always,please let me know what you think of this setup!
Cheers!