Software Development

Bottleneck #05: Resilience and Observability


Availability is the most important feature

— Mike Fisher, former CTO of Etsy

“I get knocked down, but I get up again…”

— Tubthumping, Chumbawumba

Every organization pays attention to resilience. The big question is
when.

Startups tend to only address resilience when their systems are already
down, often taking a very reactive approach. For a scaleup, excessive system
downtime represents a significant bottleneck to the organization, both from
the effort expended on restoring function and also from the impact of customer
dissatisfaction.

To move past this, resilience needs to be built into the business
objectives, which will influence the architecture, design, product
management, and even governance of business systems. In this article, we’ll
explore the Resilience and Observability Bottleneck: how you can recognize
it coming, how you might realize it has already arrived, and what you can do
to survive the bottleneck.

How did you get into the bottleneck?

One of the first goals of a startup is getting an initial product out
to market. Getting it in front of as many users as possible and receiving
feedback from them is typically the highest priority. If customers use
your product and see the unique value it delivers, your startup will carve
out market share and have a dependable revenue stream. However, getting
there often comes at a cost to the resilience of your product.

A startup may decide to skip automating recovery processes, because at
a small scale, the organization believes it can provide resilience through
the developers that know the system well. Incidents are handled in a
reactive nature, and resolutions come by hand. Possible solutions might be
spinning up another instance to handle increased load, or restarting a
service when it’s failing. Your first customers might even be aware of
your lack of true resilience as they experience system outages.

At one of our scaleup engagements, to get the system out to production
quickly, the client deprioritized health check mechanisms in the
cluster. The developers managed the startup process successfully for the
few times when it was necessary. For an important demo, it was decided to
spin up a new cluster so that there would be no externalities impacting
the system performance. Unfortunately, actively managing the status of all
the services running in the cluster was overlooked. The demo started
before the system was fully operational and an important component of the
system failed in front of prospective customers.

Fundamentally, your organization has made an explicit trade-off
prioritizing user-facing functionality over automating resilience,
gambling that the organization can recover from downtime through manual
intervention. The trade-off is likely acceptable as a startup while it’s
at a manageable scale. However, as you experience high growth rates and
transform from a
startup to a scaleup, the lack of resilience proves to be a scaling
bottleneck,
manifesting as an increasing occurrence of service
interruptions translating into more work on the Ops side of the DevOps
team’s responsibilities, reducing the productivity of teams. The impact
seems to appear suddenly, because the effect tends to be non-linear
relative to the growth of the customer base. What was recently manageable
is suddenly extremely impactful. Eventually, the scale of the system
creates manual work beyond the capacity of your team, which bubbles up to
affect the customer experiences. The combination of reduced productivity
and customer dissatisfaction leads to a bottleneck that is hard to
survive.

The question then is, how do I know if my product is about to hit a
scaling bottleneck? And further, if I know about those signs, how can I
avoid or keep pace with my scale? That is what we’ll look to answer as we
describe common challenges we’ve experienced with our clients and the
solutions we have seen to be most effective.

Signs you are approaching a scaling bottleneck

It’s always difficult to operate in an environment in which the scale
of the business is changing rapidly. Investing in handling high traffic
volumes too early is a waste of resources. Investing too late means your
customers are already feeling the effects of the scaling bottleneck.

To shift your operating model from reactive to proactive, you have to
be able to predict future behavior with a confidence level sufficient to
support important business decisions. Making data driven decisions is
always the goal. The key is to find the leading indicators which will
guide you to prepare for, and hopefully avoid the bottleneck, rather than
react to a bottleneck that has already occurred. Based on our experience,
we have found a set of indicators related to the common preconditions as
you approach this bottleneck.

Resilience is not a first class consideration

This may be the least obvious sign, but is arguably the most important.
Resilience is thought of as purely a technical problem and not a feature
of the product. It’s deprioritized for new features and enhancements. In
some cases, it’s not even a concern to be prioritized.

Here’s a quick test. Listen in on the different discussions that
occur within your teams, and note the context in which resilience is
discussed. You may find that it isn’t included as part of a standup, but
it does make its way into a developer meeting. When the development team isn’t
responsible for operations, resilience is effectively siloed away.
In those cases, pay close attention to how resilience is discussed.

Evidence of inadequate focus on resilience is often indirect. At one
client, we’ve seen it come in the form of technical debt cards that not
only aren’t prioritized, but become a constant growing list. At another
client, the operations team had their backlog filled purely with
customer incidents, the majority of which dealt with the system either
not being up or being unable to process requests. When resilience concerns
are not part of a team’s backlog and roadmap, you’ll have evidence that
it is not core to the product.

Solving resilience by hand (reactive manual resilience)

How your organization resolve service outages can be a key indicator
of whether your product can scaleup effectively or not. The characteristics
we describe here are fundamentally caused by a
lack of automation, resulting in excessive manual effort. Are service
outages resolved via restarts by developers? Under high load, is there
coordination required to scale compute instances?

In general, we find
these approaches don’t follow sustainable operational practices and are
brittle solutions for the next system outage. They include bandaid solutions
which alleviate a symptom, but never truly solve it in a way that allows
for future resilience.

Ownership of systems are not well defined

When your organization is moving quickly, developing new services and
capabilities, quite often key pieces of the service ecosystem, or even
the infrastructure, can become “orphaned” – without clear responsibility
for operations. As a result, production issues may remain unnoticed until
customers react. When they do occur, it takes longer to troubleshoot which
causes delays in resolving outages. Resolution is delayed while ping ponging issues
between teams in an effort to find the responsible party, wasting
everyone’s time as the issue bounces from team to team.

This problem is not unique to microservice environments. At one
engagement, we witnessed similar situations with a monolith architecture
lacking clear ownership for parts of the system. In this case, clarity
of ownership issues stemmed from a lack of clear system boundaries in a
“ball of mud” monolith.

Ignoring the reality of distributed systems

Part of developing effective systems is being able to define and use
abstractions that enable us to simplify a complex system to the point
that it actually fits in the developer’s head. This allows developers to
make decisions about the future changes necessary to deliver new value
and functionality to the business. However, as in all things, one can go
too far, not realizing that these simplifications are actually
assumptions hiding critical constraints which impact the system.
Riffing off the fallacies of distributed computing:

  • The network is not reliable.
  • Your system is affected by the speed of light. Latency is never zero.
  • Bandwidth is finite.
  • The network is not inherently secure.
  • Topology always changes, by design.
  • The network and your systems are heterogeneous. Different systems behave
    differently under load.
  • Your virtual machine will disappear when you least expect it, at exactly the
    wrong time.
  • Because people have access to a keyboard and mouse, mistakes will
    happen.
  • Your customers can (and will) take their next action in <
    500ms.

Very often, testing environments provide perfect world
conditions, which avoids violating these assumptions. Systems which
don’t account for (and test for) these real-world properties are
designed for a world in which nothing bad ever happens. As a result,
your system will exhibit unanticipated and seemingly non-deterministic
behavior as the system starts to violate the hidden assumptions. This
translates into poor performance for customers, and incredibly difficult
troubleshooting processes.

Not planning for potential traffic

Estimating future traffic volume is difficult, and we find that we
are wrong more often than we are right. Over-estimating traffic means
the organization is wasting effort designing for a reality that doesn’t
exist. Under-estimating traffic could be even more catastrophic. Unexpected
high traffic loads could happen for a variety of reasons, and a social media marketing
campaign which unexpectedly goes viral is a good example. Suddenly your
system can’t manage the incoming traffic, components start to fall over,
and everything grinds to a halt.

As a startup, you’re always looking to attract new customers and gain
additional market share. How and when that manifests can be incredibly
difficult to predict. At the scale of the internet, anything could happen,
and you should assume that it will.

Alerted via customer notifications

When customers are invested in your product and believe the issue is
resolvable, they might try to contact your support staff for
help. That may be through email, calling in, or opening a support
ticket. Service failures cause spikes in call volume or email traffic.
Your sales people may even be relaying these messages because
(potential) customers are telling them as well. And if service outages
affect strategic customers, your CEO might tell you directly (this may be
okay early on, but it’s certainly not a state you want to be in long term).

Customer communications will not always be clear and straightforward, but
rather will be based on a customer’s unique experience. If customer success staff
do not realize that these are indications of resilience problems,
they will proceed with business as usual and your engineering staff will
not receive the feedback. When they aren’t identified and managed
correctly, notifications may then turn non-verbal. For example, you may
suddenly find the rate at which customers are canceling subscriptions
increases.

When working with a small customer base, knowing about a problem
through your customers is “mostly” manageable, as they are fairly
forgiving (they are on this journey with you after all). However, as
your customer base grows, notifications will begin to pile up towards
an unmanageable state.

Figure 1:
Communication patterns as seen in an organization where customer notifications
are not managed well.

How do you get out of the bottleneck?

Once you have an outage, you want to recover as quickly as possible and
understand in detail why it happened, so you can improve your system and
ensure it never happens again.

Tackling the resilience of your products and services while in the bottleneck
can be difficult. Tactical solutions often mean you end up stuck in fire after fire.
However if it’s managed strategically, even while in the bottleneck, not
only can you relieve the pressure on your teams, but you can learn from past recovery
efforts to help manage through the hypergrowth stage and beyond.

The following five sections are effectively strategies your organization can implement.
We believe they flow in order and should be taken as a whole. However, depending
on your organization’s maturity, you may decide to leverage a subset of
strategies. Within each, we lay out several solutions that work towards it’s
respective strategy.

Ensure you have implemented basic resilience techniques

There are some basic techniques, ranging from architecture to
organization, that can improve your resiliency. They keep your product
in the right place, enabling your organization to scale effectively.

Use multiple zones within a region

For highly critical services (and their data), configure and enable
them to run across multiple zones. This should give a bump to your
system availability, and increase your resiliency in the case of
disruption (within a zone).

Specify appropriate computing instance types and specifications

Business critical services should have computing capacity
appropriately assigned to them. If services are required to run 24/7,
your infrastructure should reflect those requirements.

Match investment to critical service tiers

Many organizations manage investment by identifying critical
service tiers, with the understanding that not all business systems
share the same importance in terms of delivering customer experience
and supporting revenue. Identifying service tiers and associated
resilience outcomes informed by service level agreements (SLAs), paired with architecture and
design patterns that support the outcomes, provides helpful guardrails
and governance for your product development teams.

Clearly define owners across your entire system

Each service that exists within your system should have
well-defined owners. This information can be used to help direct issues
to the right place, and to people who can effectively resolve them.
Implementing a developer portal which provides a software services
catalog with clearly defined team ownership helps with internal
communication patterns.

Automate manual resilience processes (within a timebox)

Certain resilience problems that have been solved by hand can be
automated: actions like restarting a service, adding new instances or
restoring database backups. Many actions are easily automated or simply
require a configuration change within your cloud service provider.
While in the bottleneck, implementing these capabilities can give the
team the relief it needs, providing much needed breathing room and
time to solve the root cause(s).

Make sure to keep these implementations at their simplest and
timeboxed (couple of days at max). Bear in mind these started out as
bandaids, and automating them is just another (albeit better) type of
bandaid. Integrate these into your monitoring solution, allowing you
to remain aware of how frequently your system is automatically recovering and how long it
takes. At the same time, these metrics allow you to prioritize
moving away from reliance on these bandaid solutions and make your
whole system more robust.

Improve mean time to restore with observability and monitoring

To work your way out of a bottleneck, you need to understand your
current state so you can make effective decisions about where to invest.
If you want to be 5 nines, but have no sense of how many nines are
actually currently provided, then it’s hard to even know what path you
should be taking.

To know where you are, you need to invest in observability.
Observability allows you to be more proactive in timing investment in
resilience before it becomes unmanageable.

Centralize your logs to be viewable through a single interface

Aggregate logs from core services and systems to be available
through a central interface. This will keep them accessible to
multiple eyes easily and reduce troubleshooting efforts (potentially
improving mean time to recovery).

Define a clear structured format for log messages

Anyone who’s had to parse through aggregated log messages can tell
you that when multiple services follow differing log structures it’s
an incredible mess to find anything. Every service just ends up
speaking its own language, and only the original authors understand
the logs. Ideally, once those logs are aggregated, anyone from
developers to support teams should be able to understand the logs, no
matter their origin.

Structure the log messages using an organization-wide standardized
format. Most logging tools support a JSON format as a standard, which
enables the log message structure to contain metadata like timestamp,
severity, service and/or correlation-id. And with log management
services (through an observability platform), one can filter and search across these
properties to help debug bottleneck issues. To help make search more
efficient, prefer fewer log messages with more fields containing
pertinent information over many messages with a small number of
fields. The actual messages themselves may still be unique to a
specific service, but the attributes associated with the log message
are helpful to everyone.

Treat your log messages as a key piece of information that is
visible to more than just the developers that wrote them. Your support team can
become more effective when debugging initial customer queries, because
they can understand the structure they are viewing. If every service
can speak the same language, the barrier to provide support and
debugging assistance is removed.

Add observability that’s close to your customer experience

What gets measured gets managed.

— Peter Drucker

Though infrastructure metrics and service message logs are
useful, they are fairly low level and don’t provide any context of
the actual customer experience. On the other hand, customer
notifications are a direct indication of an issue, but they are
usually anecdotal and don’t provide much in terms of pattern (unless
you put in the work to find one).

Monitoring core business metrics enables teams to observe a
customer’s experience. Typically defined through the product’s
requirements and features, they provide high level context around
many customer experiences. These are metrics like completed
transactions, start and stop rate of a video, API usage or response
time metrics. Implicit metrics are also useful in measuring a
customer’s experiences, like frontend load time or search response
time. It’s crucial to match what is being observed directly
to how a customer is experiencing your product. Also
important to note, metrics aligned to the customer experience become
even more important in a B2B environment, where you might not have
the volume of data points necessary to be aware of customer issues
when only measuring individual components of a system.

At one client, services started to publish domain events that
were related to the product experience: events like added to cart,
failed to add to cart, transaction completed, payment approved, etc.
These events could then be picked up by an observability platform (like
Splunk, ELK or Datadog) and displayed on a dashboard, categorized and
analyzed even further. Errors could be captured and categorized, allowing
better problem solving on errors related to unexpected customer
experience.

Figure 2:
Example of what a dashboard focusing on the user experience could look like

Data gathered through core business metrics can help you understand
not only what might be failing, but where your system thresholds are and
how it manages when it’s outside of that. This gives further insight into
how you might get through the bottleneck.

Provide product status insight to customers using status indicators

It can be difficult to manage incoming customer inquiries of
different issues they are facing, with support services quickly finding
they are fighting fire after fire. Managing issue volume can be crucial
to a startup’s success, but within the bottleneck, you need to look for
systemic ways of reducing that traffic. The ability to divert call
traffic away from support will give some breathing room and a better chance to
solve the right problem.

Service status indicators can provide customers the information they are
seeking without having to reach out to support. This could come in
the form of public dashboards, email messages, or even tweets. These can
leverage backend service health and readiness checks, or a combination
of metrics to determine service availability, degradation, and outages.
During times of incidents, status indicators can provide a way of updating
many customers at once about your product’s status.

Building trust with your customers is just as important as creating a
reliable and resilient service. Providing methods for customers to understand
the services’ status and expected resolution timeframe helps build
confidence through transparency, while also giving the support staff
the space to problem-solve.

Figure 3:
Communication patterns within an organization that proactively manages how customers are notified.

Shift to explicit resilience business requirements

As a startup, new features are often considered more valuable
than technical debt, including any work related to resilience. And as stated
before, this certainly made sense initially. New features and
enhancements help keep customers and bring in new ones. The work to
provide new capabilities should, in theory, lead to an increase in
revenue.

This doesn’t necessarily hold true as your organization
grows and discovers new challenges to increasing revenue. Failures of
resilience are one source of such challenges. To move beyond this, there
needs to be a shift in how you value the resilience of your product.

Understand the costs of service failure

For a startup, the consequences of not hitting a revenue target
this ‘quarter’ might be different than for a scaleup or a mature
product. But as often happens, the initial “new features are more
valuable than technical debt” decision becomes a permanent fixture in the
organizational culture – whether the actual revenue impact is provable
or not; or even calculated. An aspect of the maturity needed when
moving from startup to scaleup is in the data-driven element of
decision-making. Is the organization tracking the value of every new
feature shipped? And is the organization analyzing the operational
investments as contributing to new revenue rather than just a
cost-center? And are the costs of an outage or recurring outages known
both in terms of wasted internal labor hours as well as lost revenue?
As a startup, in most of these regards, you’ve got nothing to lose.
But this is not true as you grow.

Therefore, it’s important to start analyzing the costs of service
failures as part of your overall product management and revenue
recognition value stream. Understanding your revenue “velocity” will
provide an easy way to quantify the direct cost-per-minute of
downtime. Tracking the costs to the team for everyone involved in an
outage incident, from customer support calls to developers to management
to public relations/marketing and even to sales, can be an eye-opening experience.
Add on the opportunity costs of dealing with an outage rather than
expanding customer outreach or delivering new features and the true
scope and impact of failures in resilience become apparent.

Manage resilience as a feature

Start treating resilience as more than just a technical
expectation. It’s a core feature that customers will come to expect.
And because they expect it, it should become a first class
consideration among other features. Part of this evolution is about shifting where the
responsibility lies. Instead of it being purely a responsibility for
tech, it’s one for product and the business. Multiple layers within
the organization will need to consider resilience a priority. This
demonstrates that resilience gets the same amount of attention that
any other feature would get.

Close collaboration between
the product and technology
is vital to make sure you’re able to
set the correct expectations across story definition, implementation
and communication to other parts of the organization. Resilience,
though a core feature, is still invisible to the customer (unlike new
features like additions to a UI or API). These two groups need to
collaborate to ensure resilience is prioritized appropriately and
implemented effectively.

The objective here is shifting resilience from being a reactionary
concern to a proactive one. And if your teams are able to be
proactive, you can also react more appropriately when something
significant is happening to your business.

Requirements should reflect realistic expectations

Knowing realistic expectations for resilience relative to
requirements and customer expectations is key to keeping your
engineering efforts cost effective. Different levels of resilience, as
measured by uptime and availability, have vastly different costs. The
cost difference between “three nines” and “four nines” of availability
(99.9% vs 99.99%) may be a factor of 10x.

It’s important to understand your customer requirements for each
business capability. Do you and your customers expect a 24x7x365
experience? Where are your customers
based? Are they local to a specific region or are they global?
Are they primarily consuming your service via mobile devices, or are
your customers integrated via your public API? For example, it is an
ineffective use of capital to provide 99.999% uptime on a service delivered via
mobile devices which only enjoy 99.9% uptime due to cell phone
reliability limits.

These are important questions to ask
when thinking about resilience, because you don’t want to pay for the
implementation of a level of resiliency that has no perceived customer
value. They also help to set and manage
expectations for the product being built, the team building and
maintaining it, the folks in your organization selling it and the
customers using it.

Feel out your problems first and avoid overengineering

If you’re solving resiliency problems by hand, your first instinct
might be to just automate it. Why not, right? Though it can help, it’s most
effective when the implementation is time-boxed to a very short period
(a couple of days at max). Spending more time will likely lead to
overengineering in an area that was actually just a symptom.
A large amount of time, energy and money will be invested into something that is
just another bandaid and most likely is not sustainable, or even worse,
causes its own set of second-order challenges.

Instead of going straight to a tactical solution, this is an
opportunity to really feel out your problem: Where do the fault lines
exist, what is your observability trying to tell you, and what design
choices correlate to these failures. You may be able to discover those
fault lines through stress, chaos or exploratory testing. Use this
opportunity to your advantage to discover other system stress points
and determine where you can get the largest value for your investment.

As your business grows and scales, it’s critical to re-evaluate
past decisions. What made sense during the startup phase may not get
you through the hypergrowth stages.

Leverage multiple techniques when gathering requirements

Gathering requirements for technically oriented features
can be difficult. Product managers or business analysts who are not
versed in the nomenclature of resilience can find it hard to
understand. This often translates into vague requirements like “Make x service
more resilient” or “100% uptime is our goal”. The requirements you define are as
important as the resulting implementations. There are many techniques
that can help us gather those requirements.

Try running a pre-mortem before writing requirements. In this
lightweight activity, individuals in different roles give their
perspectives about what they think could fail, or what is failing. A
pre-mortem provides valuable insights into how folks perceive
potential causes of failure, and the related costs. The ensuing
discussion helps prioritize things that need to be made resilient,
before any failure occurs. At a minimum, you can create new test
scenarios to further validate system resilience.

Another option is to write requirements alongside tech leads and
architecture SMEs. The responsibility to create an effective resilient system
is now shared amongst leaders on the team, and each can speak to
different aspects of the design.

These two techniques show that requirements gathering for
resilience features isn’t a single responsibility. It should be shared
across different roles within a team. Throughout every technique you
try, keep in mind who should be involved and the perspectives they bring.

Evolve your architecture and infrastructure to meet resiliency needs

For a startup, the design of the architecture is dictated by the
speed at which you can get to market. That often means the design that
worked at first can become a bottleneck in your transition to scaleup.
Your product’s resilience will ultimately come down to the technology
choices you make. It may mean examining your overall design and
architecture of the system and evolving it to meet the product
resilience needs. Much of what we spoke to earlier can help give you
data points and slack within the bottleneck. Within that space, you can
evolve the architecture and incorporate patterns that enable a truly
resilient product.

Broadly look at your architecture and determine appropriate trade-offs

Either implicitly or explicitly, when the initial architecture was
created, trade-offs were made. During the experimentation and gaining
traction phases of a startup, there is a high degree of focus on
getting something to market quickly, keeping development costs low,
and being able to easily modify or pivot product direction. The
trade-off is sacrificing the benefits of resilience
that would come from your ideal architecture.

Take an API backed by Functions as a Service (FaaS). This approach is a great way to
create something with little to no management of the infrastructure it
runs on, potentially ticking all three boxes of our focus area. On the
other hand, it’s limited based on the infrastructure it’s allowed to
run on, timing constraints of the service and the potential
communication complexity between many different functions. Though not
unachievable, the constraints of the architecture may make it
difficult or complex to achieve the resilience your product needs.

As the product and organization grows and matures, its constraints
also evolve. It’s important to acknowledge that early design decisions
may no longer be appropriate to the current operating environment, and
consequently new architectures and technologies need to be introduced.
If not addressed, the trade-offs made early on will only amplify the
bottleneck within the hypergrowth phase.

Enhance resilience with effective error recovery strategies

Data gathered from monitors can show where high failure
rates are coming from, be it third-party integrations, backed-up queues,
backoffs or others. This data can drive decisions on what are
appropriate recovery strategies to implement.

Use caching where appropriate

When retrieving information, caching strategies can help in two
ways. Primarily, they can be used to reduce the load on the service by
providing cached results for the same queries. Caching can also be
used as the fallback response when a backend service fails to return
successfully.

The trade-off is potentially serving stale data to customers, so
ensure that your use case is not sensitive to stale data. For example,
you wouldn’t want to use cached results for real-time stock price
queries.

Use default responses where appropriate

As an alternative to caching, which provides the last known
response for a query, it is possible to provide a static default value
when the backend service fails to return successfully. For example,
providing retail pricing as the fallback response for a pricing
discount service will do no harm if it is better to risk losing a sale
rather than risk losing money on a transaction.

Use retry strategies for mutation requests

Where a client is calling a service to effect a change in the data,
the use case may require a successful request before proceeding. In
this case, retrying the call may be appropriate in order to minimize
how often error management processes need to be employed.

There are some important trade-offs to consider. Retries without
delays risk causing a storm of requests which bring the whole system
down under the load. Using an exponential backoff delay mitigates the
risk of traffic load, but instead ties up connection sockets waiting
for a long-running request, which causes a different set of
failures.

Use idempotency to simplify error recovery

Clients implementing any type of retry strategy will potentially
generate multiple identical requests. Ensure the service can handle
multiple identical mutation requests, and can also handle resuming a
multi-step workflow from the point of failure.

Design business appropriate failure modes

In a system, failure is a given and your goal is to protect the end
user experience as much as possible. Specifically in cases that are
supported by downstream services, you may be able to anticipate
failures (through observability) and provide an alternative flow. Your
underlying services that leverage these integrations can be designed
with business appropriate failure modes.

Consider an ecommerce system supported by a microservice
architecture. Should downstream services supporting the ordering
function become overwhelmed, it would be more appropriate to
temporarily disable the order button and present a limited error
message to a customer. While this provides clear feedback to the user,
Product Managers concerned with sales conversions might instead allow
for orders to be captured and alert the customer to a delay in order
confirmation.

Failure modes should be embedded into upstream systems, so as to ensure
business continuity and customer satisfaction. Depending on your
architecture, this might involve your CDN or API gateway returning
cached responses if requests are overloading your subsystems. Or as
described above, your system might provide for an alternative path to
eventual consistency for specific failure modes. This is a far more
effective and customer focused approach than the presentation of a
generic error page that conveys ‘something has gone wrong’.

Resolve single points of failure

A single service can easily go from managing a single
responsibility of the product to multiple. For a startup, appending to
an existing service is often the simplest approach, as the
infrastructure and deployment path is already solved. However,
services can easily bloat and become a monolith, creating a point of
failure that can bring down many or all parts of the product. In cases
like this, you’ll need to understand ways to split up the architecture,
while also keeping the product as a whole functional.

At a fintech client, during a hyper-growth period, load
on their monolithic system would spike wildly. Due to the monolithic
nature, all of the functions were brought down simultaneously,
resulting in lost revenue and unhappy customers. The long-term
solution was to start splitting the monolith into several separate
services that could be scaled horizontally. In addition, they
introduced event queues, so transactions were never lost.

Implementing a microservice approach is not a simple and straightforward
task, and does take time and effort. Start by defining a domain that
requires a resiliency boost, and extract it’s capabilities piece by piece.
Roll out the new service, adjust infrastructure configuration as needed (increase
provisioned capacity, implement auto scaling, etc) and monitor it.
Ensure that the user journey hasn’t been affected, and resilience as
a whole has improved. Once stability is achieved, continue to iterate over
each capability in the domain. As noted in the client example, this is
also an opportunity to introduce architectural elements that help increase
the general resilience of your system. Event queues, circuit breakers, bulkheads and
anti-corruption layers are all useful architectural components that
increase the overall reliability of the system.

Continually optimize your resilience

It’s one thing to get through the bottleneck, it’s another to stay
out of it. As you grow, your system resiliency will be continually
tested. New features result in new pathways for increased system load.
Architectural changes introduces unknown system stability. Your
organization will need to stay ahead of what will eventually come. As it
matures and grows, so should your investment into resilience.

Regularly chaos test to validate system resilience

Chaos engineering is the bedrock of truly resilient products. The
core value is the ability to generate failure in ways that you might
never think of. And while that chaos is creating failures, running
through user scenarios at the same time helps to understand the user
experience. This can provide confidence that your system can withstand
unexpected chaos. At the same time, it identifies which user
experiences are impacted by system failures, giving context on what to
improve next.

Though you may feel more comfortable testing against a dev or QA
environment, the value of chaos testing comes from production or
production-like environments. The goal is to understand how resilient
the system is in the face of chaos. Early environments are (usually)
not provisioned with the same configurations found in production, thus
will not provide the confidence needed. Running a test like
this in production can be daunting, so make sure you have confidence in
your ability to restore service. This means the entire system can be
spun back up and data can be restored if needed, all through automation.

Start with small understandable scenarios that can give useful data.
As you gain experience and confidence, consider using your load/performance
tests to simulate users while you execute your chaos testing. Ensure teams and
stakeholders are aware that an experiment is about to be run, so they
are prepared to monitor (in case things go wrong). Frameworks like
Litmus or Gremlin can provide structure to chaos engineering. As
confidence and maturity in your resilience grows, you can start to run
experiments where teams are not alerted beforehand.

Recruit specialists with knowledge of resilience at scale

Hiring generalists when building and delivering an initial product
makes sense. Time and money are incredibly valuable, so having
generalists provides the flexibility to ensure you can get out to
market quickly and not eat away at the initial investment. However,
the teams have taken on more than they can handle and as your product
scales, what was once good enough is no longer the case. A slightly
unstable system that made it to market will continue to get more
unstable as you scale, because the skills required to manage it have
overtaken the skills of the existing team. In the same vein as
technical
debt
,
this can be a slippery slope and if not addressed, the problem will
continue to compound.

To sustain the resilience of your product, you’ll need to recruit
for that expertise to focus on that capability. Experts bring in a
fresh view on the system in place, along with their ability to
identify gaps and areas for improvement. Their past experiences can
have a two-fold effect on the team, providing much needed guidance in
areas that sorely need it, and a further investment in the growth of
your employees.

Always maintain or improve your reliability

In 2021, the State of Devops report expanded the fifth key metric from availability to reliability.
Under operational performance, it asserts a product’s ability to
retain its promises. Resilience ties directly into this, as it’s a
key business capability that can ensure your reliability.
With many organizations pushing more frequently to production,
there needs to be assurances that reliability remains the same or gets better.

With your observability and monitoring in place, ensure what it
tells you matches what your service level objectives (SLOs) state. With every deployment to
production, the monitors should not deviate from what your SLAs
guarantee. Certain deployment structures, like blue/green or canary
(to some extent), can help to validate the changes before being
released to a wide audience. Running tests effectively in production
can increase confidence that your agreements haven’t swayed and
resilience has remained the same or better.

Resilience and observability as your organization grows

Phase 1

Experimenting

Prototype solutions, with hyper focus on getting a product to market quickly

Phase 2

Getting Traction

Resilience and observability are manually implemented via developer intervention

Prioritization for solving resilience mainly comes from technical debt

Dashboards reflect low level services statistics like CPU and RAM

Majority of support issues come in via calls or text messages from customers

Phase 3

(Hyper) Growth

Resilience is a core feature delivered to customers, prioritized in the same vein as features

Observability is able to reflect the overall customer experience, reflected through dashboards and monitoring

Re-architect or recreate problematic services, improving the resilience in the process

Phase 4

Optimizing

Platforms evolve from internal facing services, productizing observability and compute environments

Run periodic chaos engineering exercises, with little to no notice

Augment teams with engineers that are versed in resilience at scale

Summary

As a scaleup, what determines your ability to effectively navigate the
hyper(growth) phase is in part tied to the resilience of your
product. The high growth rate starts to put pressure on a system that was
developed during the startup phase, and failure to address the resilience of
that system often results in a bottleneck.

To minimize risk, resilience needs to be treated as a first-class citizen.
The details may vary according to your context, but at a high level the
following considerations can be effective:

  • Resilience is a key feature of your product. It is no longer just a
    technical detail, but a key component that your customers will come to expect,
    shifting the company towards a proactive approach.
  • Build customer status indicators to help divert some support requests,
    allowing breathing room for your team to solve the important problems.
  • The customer experience should be reflected within your observability stack.
    Monitor core business metrics that reflect experiences your customers have.
  • Understand what your dashboards and monitors are telling you, to get a sense
    of what are the most critical areas to solve.
  • Evolve your architecture to meet your resiliency goals as you identify
    specific challenges. Initial designs may work at small scale but become
    increasingly limiting as you transition to a scaleup.
  • When architecting failure modes, find ways to fail that are friendly to the
    consumer, helping to ensure continuity and customer satisfaction.
  • Define realistic resilience expectations for your product, and understand the
    limitations with which it’s being served. Use this knowledge to provide your
    customers with effective SLAs and reasonable SLOs.
  • Optimize your resilience when you’re through the bottleneck. Make chaos
    engineering part of a regular practice or recruiting specialists.

Successfully incorporating these practices results in a future organization
where resilience is built into business objectives, across all dimensions of
people, process, and technology.