this is a first post from a series of posts called Reliable, scalable and maintainable applications

Reliability Link to heading

Everybody has an idea what this is:

  • application performs the function that the user expected
  • it can tolerate the user making mistakes or using software in unexpected way
  • its performance is good enough for the required use case, under the expected load and data volume
  • the system prevents any unauthorized access and abuse

The things that can go wrong are called faults , and systems that anticipate faults and can cope with them are called fault-tolerant or resilient. If a fault makes a system unaccessible ie. system stops working and providing required service to the user it is then called failure. For a fault-tolerant systems it makes sense to test them all the time that they perform good, and that this faults dont cause failures. One of the ways to to this is to increase rate of faults by triggering them deliberately - for example randomly killing individual process without warning. An example of this approach would be Netflix Chaos Monkey

There are situations in which we may choose to sacrifice reliability in order to reduce develompent cost or operational cost - but we should be very conscious of when we are cutting corners

Scalability Link to heading

Even if a system is working reliably today, that doesn’t mean it will necessarily work reliably in the future. One common reason for degradation is increased load: maybe the system has grown from 10,000 concurrent users to 100,000 concurrent users, or from 1 million to 10 million. Perhaps it is processing much larger volumes of data than it did before.

Scalability is the term used to describe a system’s ability to cope with increased load. Important note is that it is meaningless to say X is scalable or Y doesn’t scale, scalability is defined through discussion of questions like what options system has for coping with the grow if it grows in a particular way?, or how can we add computing resources to handle the additional load?

Load Link to heading

To be able to ask this questions, and then think about solutions we need to describe what is the current load on the system.

Load can be described with a few numbers which we call load parameters. The best choice of parameters depends on the architecture of your system - for example:

  • Web server requests per second
  • Database ratio of reads to writes
  • Chat room number of simutaneously active users
  • Cache hit rate
  • and more..

Performance Link to heading

Once we described the load on our system, we cna investigate what happens when the load increases.

  • When you increase a load parameter and keep the system resources (CPU, memory, network bandwidth, etc.) unchanged, how is the performance of your system affected?
  • When you increase a load parameter, how much do you need to increase the resources if you want to keep performance unchanged?

Both questions require performance numbers. For example

  • in batch processing systems we usually care about throughput
    • number of records we can process per second, or the total time it takes to run a job on a dataset of a certain size
  • in online systems, what is usually more important is the service’s response time
    • the time between a client sending a request and receiving a response

Even if you only make the same request over and over again, you’ll get a slightly different response time on every try. We therefore need to think of response time not as a single number, but as a distribution of values that you can measure.

It’s common to see the average response time of a service reported. (Strictly speaking, the term “average” doesn’t refer to any particular formula, but in practice it is usually understood as the arithmetic mean: given n values, add up all the values, and divide by n.) However, the mean is not a very good metric if you want to know your “typical” response time, because it doesn’t tell you how many users actually experienced that delay.

Usually it is better to use percentiles. If you take your list of response times and sort it from fastest to slowest, then the median is the halfway point: for example, if your median response time is 200 ms, that means half your requests return in less than 200 ms, and half your requests take longer than that. The median is also known as the 50th percentile, and sometimes abbreviated as p50.

In order to figure out how bad your outliers are, you can look at higher percentiles: the 95th, 99th, and 99.9th percentiles are common (abbreviated p95, p99, and p999). They are the response time thresholds at which 95%, 99%, or 99.9% of requests are faster than that particular threshold.

Maintainability Link to heading

It is well known that the majority of the cost of software is not in its initial development, but in its ongoing maintenance - fixing bugs, keeping its system operational, investigating failures, adapting it to new platforms…

Yet, unfortunately, many people working on software systems dislike maintenance of so-called legacy systems.

However, we can and should design software in such a way that it will hopefully minimize pain during maintenance, and thus avoid creating legacy software ourselves. To this end, we will pay particular attention to three design principles for software systems:

  • Operability
    • Make it easy for operations teams to keep the system running smoothly
  • Simplicity
    • Make it easy for new engineers to understand the system, by removing as much complexity as possible from the system.
  • Evolvability
    • Make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases as requirements change