This article is intended to give a compact overview of the book Designing Data-Intensive Applications by Martin Kleppmann. The importance of designing robust and reliable applications might be taken for granted by professionals in the tech industry; however, let’s open the conversation by giving some context to all non-technical readers.
During the last decade, we have seen various technological developments that have enabled companies to build platforms, such as social networks and search engines, that generate and manage unprecedented volumes of data. These massive amounts of data have made it imperative for businesses to focus on agility and short development cycles, along with hypothesis testing, to allow a quick response to emerging market trends and insights.
Moreover, open-source software has become very successful and preferred over other alternatives. Infrastructure as a service (IaaS) also plays a key role in this development, as it makes it possible for small teams to build distributed systems across many machines and countries. The synergy created by the current technological ecosystem resulted in our reality where applications with very high levels of up-time have become the norm, and maintenance outages are practically unacceptable.
Coordinating a group of developers can be time and energy exhausting, even more, when trying to balance the priorities between internal and external stakeholders. Thus, to ensure the creation of good-quality applications that easily adapt to a quickly changing environment, a technical-savvy leader should rely on an adequate framework for the current ecosystem.
Each piece of software is unique and must be treated as such, but it is also true that there are foundations shared among most software systems. The foundations can be reduced to Reliability, Scalability, and Maintainability.
Undoubtedly there will be times when our webpage, application, or software system will fail. Even the most experienced programmer is prone to errors, as it is human nature to be imperfect. Other sources of fault can be hardware or even software. Regardless of the failure source, a system should continue to perform at the desired level even in adversity.
As indicated by Kleppmann, a reliable system:
• Performs the function that the user expected.
• Can tolerate the user making mistakes or using the software in unexpected ways.
• Its performance is good enough for the required use case, under the expected load and data volume.
• Prevents any unauthorized access and abuse.
Without intending to get too deep in analyzing types of errors, let’s shortly exemplify and compare human, hardware, and software faults.
Even though hardware errors are not relevant in small-scale networks, they can occur more often in large data centers with many machines. It has been documented that hard disks have a mean time to failure of approximately 10 to 50 years. Putting that into perspective, that will be the equivalent of one disk dying per day in a storage cluster with 10,000 disks.
Common hardware errors include:
• Faulty RAM
• Power grid blackout
• Hard disk crashes
• Unplugging of wrong network cable.
Probably the most ‘tricky’ type of fault is the one related to the software. If code is not tested properly, software bugs can be present inside our code for a long time without causing much trouble. That is, of course, until they get triggered by a set of uncommon circumstances. In such cases, the problem could be exposed because the software was making some usually valid assumptions, but it eventually stopped being true for some reason.
Here are few scenarios where a bug could suddenly become apparent and cause both a surprise and a headache to engineers in turn:
• When an application server crashes given a particular bad input due to a software bug. A well-known case of this occurred on June 30, 2012, where, due to a bug in the Linux kernel, a leap second caused several applications to hang simultaneously.
• A runaway process that uses up some shared resource — CPU time, memory, disk space, or network bandwidth.
• An essential service for the system that slows down, becomes unresponsive, or starts returning corrupted responses.
• Cascading failures, where a minor defect in one component triggers an error in another component, which in turn triggers further faults.”
It is common for widely-adopted systems, such as social networks and search engines to grow in data volume, complexity, or traffic volume. Because of this, the product should be able to deal with this growth through a scalable architecture.
But that doesn’t mean all applications should share architectures, as this would become detrimental due to the particularities of each system. Instead, scalable architecture is one built from general-purpose building blocks and arranged in familiar patterns.
“People often talk of a dichotomy between scaling up (vertical scaling, moving to a powerful machine) and scaling out (horizontal scaling, distributing the load across multiple smaller machines).”
— Keppleman, 2017
The primary purpose of considering scalability during the app design is to maintain performance regardless of the increased number of users. The load of the application can be described with load parameters. To choose the best, you need to consider:
- The number of requests per second to a web server
- The ratio of reads to writes in a database
- The number of simultaneously active users in a chat room
- The hit rate on a cache.
Some businesses and consulting firms have more employee rotation than others; however, frequently, different people will work on the system across time. During the development and upgrading of an app, the main goal becomes to maintain current behavior while adapting the system to new use cases.
“We can and should design software in such a way that it will hopefully minimize pain during maintenance, and thus avoid creating legacy software ourselves.”
— Keppleman, 2017.
In fact, there are punctual design principles that we can focus on to make system maintenance simple and avoid the creation of legacy software:
- Operability: operations teams should keep the system running smoothly.
- Simplicity: new engineers should understand the system by removing as much information as possible. It is not the same as simplifying the user interface.
- Evolvability: helps engineers make changes to the system in the future, adapting for unanticipated use cases as requirements change. Some synonyms are extensibility and plasticity.
Why are software engineers always required regardless of the advances in automation processes and artificial intelligence? Because (unfortunately) there are no easy and fixed solutions for making applications reliable, scalable, or maintainable. However, decades of software products have shown us that certain patterns and techniques keep appearing across applications, even when their objective and use are entirely different.
If your goal is to build data-intensive systems, you should definitely look more in-depth into Martin Kleppmann’s book Designing Data-Intensive Applications. You will find a comprehensive analysis of the characteristics shared among data systems that thrive and how they work toward those goals.