Opinion | How site reliability engineering can alter outsourcing models

Site reliability engineers combine the skills of software coders with the talent of those who operationalize systems for clients

Earlier this week, Mint reported that Tata Consultancy Services (TCS) is poised to become the third largest information technology (IT) services firm. This is an important milestone, not just for TCS but also for the Indian IT services sector in general. The firm’s continued strong growth is laudable.

But it’s important to make clear that TCS was compared with firms that are in its current peer group, such as IBM Global Services and Accenture. Some weeks ago, I had written about an orthogonal threat coming at IT service providers from other companies in the Big Tech ecosystem. One example is the entry of firms such as Amazon Web Services (AWS), Microsoft, Google, Intel and others into the “cloud" services space. Informal estimates of the size of these operations—each quarter—for such firms range from $11 billion (AWS) to $6 billion (Intel). That said, these estimates are handicapped by a twofold issue: first, money is fungible, so companies get to put their revenues into different internal buckets based on their own internal preferences and, second, these operations are internal to the Big Tech firms and are opaque, making it difficult for external evaluators to parse the numbers.

Revenue estimates aside, it is accepted that the Big Tech behemoths have taken over the “cloud" data centre and infrastructure management space. The class comprising the IBMs, Accentures and TCSes has been edged out of the business. This second set of firms only have the consolation prize of being able to migrate their customers’ infrastructure onto the platforms owned and run by the first set.

One might also imagine that after these software applications are made to migrate to the cloud, the second set of IT service providers are still responsible for building and maintaining these applications, while the first set (the cloud infrastructure companies) solely manage the lowest levels of the stack—the software and hardware. Indeed, reflecting this perception, TCS’s tag line for many of its advertisements is “experience certainty".

Even as recently as 10 years ago, before the cloud phenomenon took hold, outsourcing contracts with data centre providers used “availability" as the key metric. This metric was usually expressed as a percentage and a service level agreement—with contractual penalties for failure drawn up to police this. The usual expression was in “2 nines" as in 99% or “3 Nines" as in 99.9% and so on. All a data centre provider needed to do—cloud or otherwise—was to guarantee this uptime.

This differs from the “reliability" of the actual application. An application can still be unreliable, even if it runs in a data centre that is available at “4 nines" or 99.99% of the time. This unreliability of the application is the fault of the programmer, not the data centre.

The traditional model, where the development (dev) and operations (ops) teams were separate, led to the team that writes software code not being responsible for how it works when customers start using it. The development team would “throw the code over the wall" to the operations team to instal and support. This situation could lead to dysfunction. The goals of the dev and ops teams were not the same—a developer wants customers to use the “latest and greatest" piece of code, but the operations team wants a steady system with as little change as possible. Ops engineers contended that any change can introduce instability, while a system with no changes should continue to behave predictably (holding constant other factors such as too many concurrent users crashing a system).

This problem finally spawned the “DevOps" methods for software development that try to ensure that software engineers are responsible not just for the development and buildout of an application, but also for its ongoing operations. In other words, they need to work in tandem with the systems operational environment, thereby guaranteeing not just availability, but also the reliability of their computer programmes.

But even DevOps can’t solve the problem entirely. Nowadays, the trend in software is to give every system an application programming interface. This then turns every software product, even one built on DevOps, into a platform for other applications to use. If an unreliable application is written on a reliable platform, the problem still exists. Customers’ perceptions of how reliable their service is depend on more than just the availability that platform providers guarantee—since reliability is increasingly being driven by the quality of the software that customers bring to the platform.

Google, which through its search engine provides a software platform for many others to write applications, recognized this issue early on and gave birth to a new breed of systems engineer called “Site Reliability Engineers" or SREs. Google’s Ben Treynor Sloss coined the term in the early 2000s. He defined it as: “It’s what happens when you ask a software engineer to design an operations function." Even though the SRE role has become prevalent in recent years, many people don’t know what it is or does.

Google’s stance is that SREs should take joint operational responsibility and go on-call for the systems that customers build on its platforms. Site Reliability Engineering: How Google Runs Production Systems, written by a group of Google engineers, is considered the holy grail on site reliability engineering. And yes, my speculation is that Google and the other big tech firms are building large SRE practices—which will eventually compete with the TCSes, IBMs and Accentures of the world.

The Big Tech cloud infrastructure providers are moving up the chain into the application development space.

Siddharth Pai is founder of Siana Capital, a venture fund management company focused on deep science and tech in India