By Guy Warren, CEO of ITRS Group
What makes Google different from other businesses? It’s not the sleep pods, or lunch-time quidditch games. It’s not their ability to innovate new technologies.
It’s that, for the majority of us, our use of it is barely a consumer choice – particularly for Gen Zs, who grew up with Google almost as a supplementary parental figure, donning out immediate advice and solutions in response to their every question and whim. It’s as integral a part of our day-to-day infrastructure as the roads we walk down. It’s even one of the few brand names to make it into the Oxford English Dictionary as a verb.
Put simply, Google is always there. It’s omnipresent. This is both the key to its success and the weight it bears. It can never not be there, such is the dependence we now have on it.
For that reason, there is only one thing more amazing than inventing a platform like Google: maintaining it for more than two decades of exponential growth. And it has achieved this through what is arguably their greatest but most boring sounding invention: Site Reliability Engineering (SRE) – a monitoring system that is now the gold standard of performance delivery for internet giants.
SRE involves tracking data and trends over a long lifespan to identify and quickly fix degrading performance levels, usually prompted by a particular change, well before the situation comes in breach of its Service Level Agreement (SLA). The system uses both Service Level Objectives (SLOs) and Service Level Indicators (SLIs) in a two-phase early warning system to ensure they are never even close to breaching the minimum downtime promised.
It was borne out of a commitment by Google 10 years ago to maintain operational resilience and provide customers with near-24/7 uptime by becoming the first major cloud provider to eliminate maintenance windows from their SLA. Since then, they have been making thousands of changes to their services every single day with barely a hiccup.
Yet, despite it being largely to thank for Google’s longevity, SRE is not what makes Google famous. In popular conception, it is famous for its monolithic size and market share dominance, and leading the world towards digital transformation.
These are the traits that less digitally-native sectors like banking have been historically trying to replicate over, following Google down a road of rapid expansion without the essential safety net of SRE. This has led to unwieldy expansion of financial services firms’ IT estates, with new solutions tacked on to legacy systems, often rendering observability of transaction flow impossible and causing damaging IT failures.
Encouraged by COVID-19, which has put digital availability front of mind, the last 18 months have seen banks finally switching on to the importance of SRE. Repeat global outages at major banks have shone a light on the cracks in legacy systems, and new oncoming regulations are set to mandate stronger commitments to individual SLAs and operational resilience.
For individual businesses, this will involve collecting and tracking a sufficient volume of data to gain much-needed visibility and indicate as soon as they might be heading towards a breach, particularly when updating existing services or onboarding new ones.
Google, of course, has the benefit of massive resources and an incredibly experienced team dedicated to the monitoring of this data. But smaller businesses should not discount themselves from such an approach. Third-party providers can support them in their shift to SRE, using threshold-based tools with coded alerts, as well as predictive analysis tools to anticipate downtime threats.
With third-party support, businesses can improve and demonstrate operational resilience across the entire estate, achieving Google-level uptime without Google-level resources.