The Hidden Costs of Bus Factor

Software Development has long left the “Wild West”, it is now a highly respectable field, with proven development methods and engineering practices. Like any other type of development, it has come up with metrics to measure progress, quality and risk. Mature development teams will actively track number of issues, the health status of its services, latest build status, but many still don’t track the Bus Factor. Bus Factor (BF), also known as Lottery Factor, Truck Factor and Circus Factor, is the number of key, irreplaceable people that exist in a project. If all of the key people leave the project or company, win the lottery, or are hit by a bus (hence the name Bus Factor), the project would be severely impacted and, in some cases, it could even come to a halt. BF is a silent illness that affects most development teams, yet its impact is underestimated, and it starts wreaking havoc long before someone leaves the team.

Types of Bus Factor

Organic

Organic Bus Factor occurs naturally over long periods of time as a project progresses. There must be no interference from any team members for it to be considered organic: no one is deliberately withholding knowledge. Although Organic Bus Factor tends to happen naturally, it does not mean it is not harmful. Organic Bus Factor can be further broken down into two categories: Survivor’s Bus Factor and Specialist’s Bus Factor.

Survivor’s Bus Factor

As programs are written and developers spend more time in a project, they become attuned to the needs and quirks of the project. Projects, however, rarely start and end with the same contributors. As people leave or join the team or project, knowledge starts entrenching itself in a smaller group of people. These individuals become the Survivors. They understand the quirks of the projects because they are the authors or were involved early on. If the team does not find the time to document any knowledge that the Survivors have acquired, the Bus Factor will start shrinking and they will become key members of the team.

An example of Survivor’s Bus Factor: Winston and his team have begun Project A, which is a REST API in Python that makes use of C++ bindings to a mathematical library. Winston and 3 other team members built the Python bindings to the C++ library. As there are four team members that work on this module, there is less push for documentation and knowledge-transfer. After multiple years, Winston is the only team member left that worked on the original C++/Python bindings. When this module breaks or produces errors, the team pings him so he can fix them. The team is under time pressure to deliver new modules, therefore Winston has very little time left to write documentation. He wishes he could document the module so he could finally vacation without his laptop.

Specialist’s Bus Factor

Some projects deal with specialized areas that not all team members will be versed in. To successfully complete a project, they will require Specialists to join the team. Without an expert in the field that the project relates to, the project will grind to a halt. There tends to be a reduced pool of candidates when looking for Specialists, either because it’s a niche field or because they require higher salaries. Bus Factor in these situations tends to approximate or be equal to the number of Specialists in a project.

An example of Specialist’s Bus Factor: ACME company writes photography management software for schools. Schools use their software to store Yearbook and events pictures. ACME now wants automated image search using Machine Learning. The entire team is made up of front-end developers, DBAs and back-end developers, so they hire Jessica: A Machine Learning Engineer who specializes in Computer Vision and Image Classification. She gathers datasets, trains ML models, writes MLOps pipelines, and deploys robust models. She also writes the search module for pictures. Schools love it, and more customers sign up because of this feature. Jessica is the only Machine Learning engineer in the company, so the Bus Factor for her search module is 1. Jessica does not like the responsibility of being the only person in charge of her module, but the company is struggling to find more specialists that fit their budget constraints.

Inorganic

Inorganic Bus Factor can only occur by deliberate actions from team members or a company’s culture. It requires direct intervention and would not occur “naturally”. Being a key member of a project can provide perks, such as increased job security or higher prospects of promotions. The impact of losing key members from a project can be harmful even in the mildest of cases, so companies try their best to entice them to stay. This type is called Inorganic because the knowledge silos would not have occurred naturally or have been significantly accelerated deliberately. This includes wilful actions such as: Not documenting key processes, not sharing information about specific areas of a project, preventing other team members from accessing key information, obfuscating key information (either documentation or code) to prevent others from obtaining it and creating unnecessary layers of access to key information.

An example of Inorganic Bus Factor caused by a team Member: Franz has been in the company writing user management libraries for many years. He has always been terrified of sudden layoffs and wants to increase his job security at all costs. As he is the only surviving author of the user management libraries, he knows that no one else understands their inner workings. He does not actively write any documentation, and he keeps all notes regarding the libraries in his own workspace so no one else can read them. When other team members have trouble with the libraries, he fixes the problem himself and does not delegate. He feels safer being the sole expert in the whole project. The Bus Factor for the User Management libraries is 1.

Inorganic Bus Factor caused by a team Member is hard to root out, because “you do not know what you do not know” (also known as “Unknown Unknowns”). It is also hard to discern between willful and involuntary Bus Factor. Is the team Member actively attempting to maintain a low Bus Factor? Or are they genuinely too overloaded to deal with the issue? It is also hard to confront, because the key member holds the cards: They could threaten to leave and risk the integrity of the project.

Inorganic Bus Factor can also be caused by a company’s culture. All companies have one, and if a company is big enough, development culture may even differ between teams. Company culture can wreak the same (or even more) havoc than deliberate actions by a team member, such as mandating knowledge silos across teams and team members or obfuscating key information with layers of middle managers.

An example of Inorganic Bus Factor caused by a Company: ANNOY Inc. has a policy against documentation and code comments. They believe that unit tests are enough information to understand what a program does. They are also very strict about team responsibilities and cross-team collaboration. They state that teams have a single responsibility and should not encroach on any other team’s tasks. When new developers join a team, there is little to no documentation to understand the codebase. Senior developers need to find the time to explain how components interact with each other. Handovers when developers leave are a nightmare, as there is little written down. When there are issues in other team’s programs, meetings need to be set-up to find out what happened. Leave is often denied close to critical deployments for fear of not having key members available.

Although ANNOY Inc. did not introduce these policies to deliberately reduce Bus Factor, these policies interfered with organic processes. Teams would have naturally documented some of their processes and would have minimized the most critical knowledge silos. The natural process would not have been as efficient as a company-wide culture of documentation but would still be better than the current situation.

Bus Factor as a Measure of Efficiency

The name “Bus Factor” is misleading, because its effects can be felt before a key member leaves a project (or gets hit by a bus). Bus Factor is, in fact, about the bottlenecks caused by knowledge silos and the lack of redundancy for specialized team members. Project Management 101: A highly efficient project reaches its goal faster. A poorly run one takes longer. If a key member cannot contribute to the project, efficiency is lost.

Efficiency bottlenecks begin when a team member transitions to a key member, either via organic or inorganic processes. Often, these key members work on key task, tasks that only key members can work on, like fixing a complex CI/CD pipeline or training the company’s most important ML model. If this process requires immediate attention, the key member will have to stop their current workload, do a costly context switch to the urgent task, and work on it. Other team members that have spare capacity or a smaller workload cannot take over the key task. If multiple key tasks require the attention of the key member, project efficiency is severely impacted. The key member’s workload then becomes a bottleneck in the development process even before the member has been hit by a bus. This is also true even when there are multiple key members in the same are: If multiple key tasks arise, they risk becoming the bottleneck.

It is futile to talk about Project Management and efficiency if we do not discuss costs. A low Bus Factor has the potential of increasing the cost of a project. In our simplest scenario the key member was required to work on an urgent key task. If the original task they were working on is not a key task, then the only new cost incurred is the time required to context switch for both the key member and the team member that took over non-key task. If the original task is also a key task, then the new costs are the context switch for the new key task, and the delay in completion of the original task being worked on. In a more serious scenario, multiple key tasks that are urgent may arise and the key member will be required to prioritize them. This can easily cascade into missed deadlines or impact to downstream tasks.

Efficiency is more severely impacted when key member leaves a project or company. If the departure is expected, such as retirement, resignation with notice or transfer, damage control can be achieved to reduce the impact. The most important step is knowledge transfer, where the key member will pass down any key knowledge to other members of the team. Knowledge transfer is an additional cost to the project, where both the key member and any other team members receiving the knowledge must stop their current work. Context switching must also be considered when team members take on the key tasks left by the key member. But not all knowledge transfers are straightforward, as the key member may not have enough resources in place to easily transfer the key knowledge. Extra documentation may have to be written, or in the worst-case scenario it may to be written from scratch. Notes, recordings, minutes and other types of documentation will need to be gathered too, incurring extra costs to the project.

If the departure of the key member is sudden, like winning the lottery, an unfortunate bus accident, or joining a secretive cult, the impact could be catastrophic in the worst scenario. If documentation is available, team members will have to reconstruct their knowledge of the codebase from these sources. The status of in-progress, future and recurring tasks will need to be obtained. If documentation is lacking or not available, team members will have to scour multiple sources to reconstruct knowledge from scratch. Even with good documentation available, there is always a risk of missing an important task. The “unknown unknowns” could cause issues later in a project: Not knowing that a deployment script required urgent refactoring, or that database back-ups had to be run manually every Monday.

Other Side-effects

On-boarding

A low Bus Factor impacts the efficiency of on-boarding of new team members. When new developers join the team and must learn a specific module, library, framework, etc., that is only understood by one or a few key members, time will be wasted. Instead of referring the new developer to documentation, the key members must block out some of their time and stop their current tasks. As previously mentioned, the capacity of non-key members is not affected even if they are available, courtesy of a low Bus Factor. Higher costs occur if the key member is overloaded or is paid more than other developers, both of which are more common when the Bus Factor is lower.

Observability

A low Bus Factor may come to light in difficult times. A program may crash, a script may raise strange exceptions, CI/CD pipelines may stop deploying, and only the key members may be able to solve this problem. If documentation is not available or is lacking, then the “emergency” may have to wait until a key member is available. Key members may also be the only ones with the expertise to prevent a catastrophic failure. For example, a key member of the Infrastructure Team provisions more resources for their Kubernetes Cluster when Christmas nears, because she knows that an increase in online shopping could crash the company’s online shop. If this key member is not available and a knowledge silo exists, a serious catastrophe will arise. Even worse, maybe only the key member knows how to bring back the online shop via specialized debugging techniques.

Upstream and Downstream

Key members will (hopefully) have a “big picture” view of a project. They would understand the downstream implications of failures in the project. They would also understand how upstream applications can affect the project. The key members would be able to advice how changes in inputs and outputs could affect downstream projects, or how changes on inputs or output of upstream projects may affect the project itself. If these insights are lost or not available temporarily, seemingly innocuous changes could cascade into a serious problem.

For example: The Data Warehouse Team runs data pipelines, which trigger a variety of ETLs. They read from upstream services managed by other Teams, transform the data with elegant statistics and store the results in their Data Warehouse powered by S3. Other teams then read from the Data Warehouse and use their data in their services. Upstream changes are discussed with the key members to ensure that downstream services are not affected, and key members also take great care to not make changes that significantly affect the data used by downstream services. When an upstream project wants to make some changes to their outputs, they liaise with the key members to ensure that no unexpected side-effects occur. The loss of a key member could mean that the “big picture” view is also lost, preventing the Data Warehouse Team from understanding how upstream sources, the project itself and downstream sources interact. Changes may be introduced that skew the data for downstream projects, suddenly downstream services are providing garbage results. Even worse, what if one of the downstream services is of extreme value? By losing the key member, this information is also potentially lost, now endangering valuable downstream projects.

Increasing Bus Factor

The most serious effects of a low Bus Factor have been addressed, so now we can focus on what is on everyone’s mind: How do you fix and prevent this? First, it is important to understand that the different types of Bus Factor are not mutually exclusive. Although I have no fancy statistics to back this statement, in my experience Survivor’s Bus Factor tends to be the most common. It also rarely acts alone. Some mitigation strategies will increase all types of Bus Factors, but more targeted mitigations may be required depending on the situation.

Documentation

Documentation is the elephant in the room. It is effective at treating all types of Bus Factor. A commonly under-used resource, well-maintained documentation is the fastest way to increase Bus Factor. To be effective, documentation must be treated as much more than an explanation of how a library or program works and must cover most aspects of development. This includes, but is not limited to, provenance of data sources, library documentation, building instructions, deployment instructions, testing instructions, FAQs, manuals, on-boarding material, development notes, project-level documentation, requirements, maintainers and main contacts, network diagrams, upstream and downstream dependencies, to mention a few. Documentation is the first line of defense against Survivor’s Bus Factor, and a serious improvement for Specialists’ Bus Factor. Although not sufficient for the latter, as specialist redundancy is still important. Bus Factor is immediately increased once up-to-date and thorough documentation is available to members of a Project. Theoretically, most if not all team members would be able to follow the documentation.

Why is documentation often neglected? The pursuit of growth over long-term maintenance plagues fast-paced environments, such as start-ups. Documentation requires constant maintenance, taking away valuable man-hours that could be invested in more development. Akin to a botanic garden, it requires constant maintenance and supervision to ensure the quality of the knowledge it shares. When a garden is not well maintained, it shows. This is no different to documentation. Developers will recount their horrible experiences attempting to follow out-of-date documentation or struggling to convince their team to invest man-hours into maintaining it. Some documentation is always better than no documentation, but badly maintained documentation leaves a sour taste to any developer who savors it. A team’s culture may also start relying on a dangerous trend: Tribal Knowledge.

Tribal Knowledge

Before our sprawling cities, busy towns and even quaint villages, humans would gather in small tribal groups. Entire civilizations did not yet inhabit great swathes of the world, and writing had not yet made it to daily lives (writing for most of the population is relatively young, the world was estimated to have a total literacy rate of about 60%-70% in the 1970s). These groups of people, or tribes, would share their knowledge via word-of-mouth: They would have told epic stories about their voyages, or their ancestors’ achievements, around a campfire while sharing a delicious meal. This led to some of history’s great stories and epics, such as the Epic Cycle (which includes the Iliad and the Odyssey).

Word-of-mouth, however, would have been extremely difficult even for long stories such as the Iliad and the Odyssey. It is also extremely difficult and unreliable for a project’s key knowledge, yet it is commonplace in development teams. Tribal Knowledge is any knowledge in a project that is passed down via inter-personal communication. The phrase “word of mouth” would have been appropriate, but in the day of enterprise instant messaging it no longer applies. Tribal knowledge is passed around via video-calls, chats, meetings, face-to-face communication or over lunch. It is not written down or easily accessible to the wider team. You can spot it easily: If a data pipeline breaks, and you ask around on how to fix it, you may be pointed to a specific person and told “They know how to fix it, send them a message”. If they decide you can fix the problem, they will pass down the information via a chat message, video call or face-to-face meeting instead of pointing you to documentation.

Tribal Knowledge is the biggest enabler of a low Bus Factor. Once entrenched in a team, it is difficult to eradicate. People are habitual animals, who tend to dislike disruption and change. Some teams will attest that their way of dealing with knowledge sharing is superior to documentation. They have become so accustomed to tribal knowledge that introducing documentation will seem like an added burden. Tribal knowledge is the opposite of documentation, its arch-nemesis. But tribal knowledge can also be written down: If I have notes on how to deploy one of our company’s services, but it is in my workstation or my personal cloud instead of a location where other team members can locate too, it is tribal knowledge. To put it more succinctly: Any knowledge that cannot be found via company-wide or team-wide searches cannot be considered documentation. If I cannot find specific information by searching the company’s Wiki, Cloud, network drives, SharePoint, messaging platform or code repository, for example, it is not documented. It does not matter how well written, explained or up to date a document is if it cannot be found by other team members.

Redundancy

Site Reliability Engineers (SREs) and other people who work closely with infrastructure can attest: If you do not have redundancy, you are playing a stupid game. And as a rule of life, “If you play stupid games, you may win stupid prizes”. Lessons have been learnt by our Computer Science forebearers, and solutions like RAID, redundant cloud services, automated backups, and many other tools exist. An experienced and reputable team will not deploy their important production program in a single machine, they would (hopefully) have redundant machines or other methods in place. The same rules can be applied to key members.

Another efficient antidote for Bus Factor is redundancy. This is much easier to achieve in the earlier stages of a project. Even if resources or specialization may constrain a project to a single team member, it is important to find the time to include a second (or more) team member to review the work. If using version control systems like Git, PRs are the best way to begin encouraging redundancy. Other team members would be more encouraged to ask questions when reviewing someone else’s work and may even force the key members/authors to improve documentation (more on this on the Accountability section).

When projects can be staffed with more than a handful of people, ensuring that sub-sections of the project are developed by at least 2 or more people can greatly increase the Bus Factor. Always making sure that as people leave or enter the project team members are evenly distributed and urgently sent to sub-sections that are in danger of low Bus Factors. Specialist’s Bus Factor may be a greater challenge, because higher salaries or access to rare specialists may not be available. Falling back to high quality, well-maintained documentation is a good strategy in cases like these. Redundancy also greatly reduces the impact from Inorganic Bus Factor, reducing the chances of willful Bus Factor from taking place.

Accountability

A great side effect of enforcing code reviews and peer practices like pair-programming is increased Accountability. Documentation can be lacking for lack of time or complacency; code could be lacking in clarity for the same reasons. During pair programming or Pull Request (PR) reviews, other team members can hold you accountable for your documentation and code. Examples of comments that could arise: “I do not understand what this line does, can you leave a comment?”, “Your function is doing too many things, break it up into multiple ones” or “The deployment documentation is not clear enough, where do I find the docker image and in what machine is it meant to be deployed?”. These comments will hold the author accountable, ensuring that Bus Factor doesn’t decreasing. If there are time or resource constraints, reviewers can force work to be scheduled in the future or prevent half-baked features or documentation from being deployed.

Another problem that peer-based practices solve is curse of knowledge. Also known as the expert blind spot, curse of knowledge occurs when someone spends too much time on a field and becomes so accustomed to it that they forget that other people lack that expertise. For example: You have spent the past few weeks trying to upgrade a Spark Cluster. During stand-up you are asked what the progress is, and you may say something like “It turns out that the AWS Hadoop library was expecting Hadoop 4.3.2, but we have Hadoop 4.2 installed. I will investigate changing it in our pom.xml, and after upgrading our Java version to 17, I will try again.”. If your team does not have experience in Spark and Java, they will stay quiet and dumbfounded. While you learnt all these concepts in the past few weeks, they haven’t, but you forgot that for a second. When you have code reviews or pair-programming, team member’s that are not in the same expertise bubble than you bring you back to reality. Comments like “Can you explain what Hadoop is?” or “Your explanation of the pom.xml changes are not clear to me” can remind key members to step into other member’s shoes for a second.

Bus Factor’s Natural Habitat

Documentation is an added burden to development, and we must have realistic expectations. We could maintain incredibly high bus factors by writing step-by-step documentation of all processes, but more time would be spent maintaining the documentation than developing a project. It is unrealistic to think that all team members will understand all aspects of a project. Even if we tried, developers have varying levels of expertise in different areas. Bus Factor is a natural process; it will never equal the number of developers in big projects. It may do so in smaller projects (excluding single developer projects), but even then, its rare.

Bus Factor is measured to minimize risks. Risk minimization strategies do not fully remove all risks; they just seek to reduce their impacts. A Bus Factor lower than the number of developers in a project will always exist, and to expect otherwise is to be unrealistic. There will always be Team Leaders who have a big picture view of multiple projects or specialists who make unique contributions to a project. There will always be one or more highly skilled team members that are involved in multiple aspects of a project, and who can provide insights or knowledge that others cannot. This is the Natural Habitat of the Bus Factor, where it lingers on a slightly risky but not deadly number. To expect that any Team member can replace a specialist, or a highly skilled developer, is foolish. But to expect that someone is so valuable, so knowledgeable, that it is worth risking the integrity of an entire project, is just as foolish.

Final Thoughts

We have discussed what Bus Factor is, how it can be categorized, how the factor slowly reduces, and how to mitigate it. Some strategies to detect it were discussed, but they required human interference. I am yet to hear about a company or team that tracks the Bus Factor of their projects. If I ever hear of one, I doubt I will be told that they track it in real-time. Development teams should consider tracking the Bus Factor as part of the health metrics of their projects, but an objective measure is yet to arise. I am hoping to cover this in a future post, but until then, steer away from buses.