Our remote operation teams have been involved in US holiday season traffic for decades. Black Friday and Cyber Monday drive traffic volumes crazy bringing in 100x traffic increases to company websites and systems. It’s the time when some e-commerce & retail companies make most of their annual revenue within a few days time. And I know One thing for sure — That is Murphy’s Law loves the holiday season. Anything & everything that can go wrong will go wrong during this time, and many companies struggle with their engineering teams off during the holidays.
Our clients, of course, have the luxury of remote teams helping them on preparations as well as continuous 24x7 SRE operations with all tier 1, 2 & 3 support teams fully staffed and operational while the onsite teams enjoy their well deserved time off.
Even big brand names had their fair share of holiday season downtimes due to various failures and issues. In these few mad rush hours, a downtime of a few minutes could still cost Millions of dollars to the organization. So here we’re focusing on how you could ensure that your IT infrastructure is well prepared for the holiday season. Below list is without any order of importance.
Monitor resource usage & critical metrics with proper thresholds and alerting configured. Review past data and estimate if your existing resources are sufficient to cope with the extra surge. Do not leave anything on the edge & pray that it would suffice, always have a buffer amount of resources available.
Ensure your infrastructure is ready to handle extra loads through automatic scaling and auto-migrations. Having your systems in the cloud with proper configuration is highly useful to scale beyond physical scaling limitations. This is also an economically feasible option instead of having extra infrastructure capacity which would be left unused for many months in the year. Ensure that you enable vMotion, DRS and configure proper rules on VMWare or other virtualization platforms well in advance. Ensure that your clusters are well balanced not to create any hotspots. And in the cloud enable autoscaling and ensure that the thresholds and autoscale policies/rules are good.
Ensure that your storage clusters & disks have enough free space & IOPS for a sudden surge of data. Usually, some of the data may come into a system and then moved over to another location on a daily basis ensuring steady use of storage space. However, the amount of data received on Black Friday/Cyber Monday might be 100X times the normal requirements, suffocating the capacity that managed well during the rest of the year. Also, keep an eye for storage hotspots.
If there are any servers & VMs that require restarts to clear out high CPU, high memory etc. situations do rolling restarts well in advance before the holiday season. This is especially true if there are applications with memory leaks etc.
Apply updates, patches etc well in advance to ensure that your systems are properly updated and stable. This needs to be done well in advance to avoid last-minute surprises. Having a tier-1 team would allow you to quickly apply and test updates in a large amount of infrastructure.
Ensure that your thresholds, watermarks etc. are reviewed properly and would not raise false alarms with traffic spikes. Have people & systems monitoring these alerts to ensure they are no unusual symptoms. Have response scripts ready & handy for various incidents. The chances of a mistake are very high when you need to write scripts in a stressful situation.
Make use of CDN cache if you’re not already doing so. Have distributed uptime monitoring using global monitoring nodes. Various tools such as Pingdom, AlertSite etc could be used for this.
Diversify infrastructure locally and geographically with automatic failovers (such as Geo- DNS failovers etc). Ensure that the failover infrastructure is properly sized to handle the peak traffic volume received to avoid a ping pong of failover.
Ensure that load-balancing & failover rules are implemented properly and tested many times. The worst thing to see during an emergency is to see that your plan B is not working. Remember Murphys Law loves the holiday season.
Ensure your databases are configured to auto-grow data files in larger chunks Ensure that you reconfigure any ceiling thresholds etc. Ensure that database files etc are distributed properly so there are no bottlenecks or hot files. Architecting to use multiple replica’s of databases is highly recommended to ensure optimal performance.
Ensure that any data transfers & syncs are optimized including database replications, file transfers etc. Shut down unwanted data sync/replication jobs if required to ensure that the bandwidth is fully utilized for critical syncs.
Implement Microservices & containerized applications to improve resiliency. Review container migration & restart policies properly.
Load test & stress test your infrastructure by creating chaotic situations. You could use your previous year data and the predicted traffic growth to test these scenarios. It is important to load test & stress test your failover plans and backup infrastructure too.
Adjust traffic routing policies to ensure network resiliency. Identify various bottlenecks in advance. Maybe some of your backup equipment is not properly sized to handle holiday season volumes although they are well capable to handle traffic during normal times.
Have a proper 24x7 SRE team, on-call schedules and incident response platforms such as Pagerduty, VictorOps, Opsgenie. It’s highly usual that you’re unable to reach some of the onsite engineers during holidays, so having remote expert NOC would reduce the risk of tier-1 teams stuck without help.
Have the SRE teams send out timely frequently communications about all issues to rest of the engineering teams and operational management. Have established war room channels and email groups etc. pre-established for emergency purposes.
If your large batch jobs etc. struggle for resources, identify low priority tasks which could be temporarily disabled to speed up priority work. Some of the large batch job steps could be also disabled based on the setup.
Optimize code as much as possible. It can be slow transactional code which would be taking more time with the additional load on the servers, or it could be an unoptimized reporting job which would now run for hours instead of a few minutes. Even some of the reports and dashboards are highly important for business during the holiday season as the business would want to tweak and optimize their sales campaigns according to real-time data.
Try to avoid last-minute untested changes by implementing change freezes. Have a well-established protocol and authorization procedure for emergency/break-fix changes.
Have tested and already prepared stand by resources available. Having proper backups of code, configurations, data is important to quickly rollback to a previous setting if required.
Make your application workflows independent from third-party services if possible. Would your service or site totally fail if a Third-party service fails? or are your systems written to bypass and continue if possible? There would be plenty of cases when Third-party services might fail with peak traffic.
Have already established escalation protocols, communication templates, outage messages, support teams etc stand by. Many of your onsite staff would be spending time with their families, so a remote team becomes very valuable during the holiday seasons.
Have your security operations centre team prepared and continuously monitoring your systems for DDOS attacks, data breaches etc which might happen opportunistically during the peak traffic. Again a team of highly experienced remote SOC would allow to closely monitor the situation while your onsite engineers enjoy the holidays.
There are certain application settings, tweaks and non-default configurations that companies apply to cope up with the holiday traffic volumes. These are usually well documented in a place like Confluence. Ensure those checklists are followed and verified.
Ensure that all incidents are carefully documented with timelines. In the rush to firefight, engineers might skip proper processes, procedures and even cause bigger issues while desperately attempting to fix issues. So following properly established procedures and ITIL incident management flows is essential.
Author: Roshan Jayalath is a director at Bluecorp. He has close to 20 years of experience in building & leading large cross functional IT teams across many industries. He is certified in AWS, Azure and OCI.