06 Oct Facebooks Business Case for Knowledge Management
As you may have noticed, all Facebook’s services were out of service for about 6 hours on Monday, October 4th. By the time I am writing this post, there is no clear resolution as to whether it was a (deliberate) human error or a technical configuration going wrong. Therefore I will not comment on preventive measures. The short business case I want to make is derived from a statement, apparently issued by a Facebook employee on Reddit during the resolve of the crisis.
“There are people now trying to gain access to the peering router to implement fixes, but the people with physical access is separate from the people with knowledge of how to actually authenticate to the systems and people who know what to actually do, so there is now a logistical challenge with getting all that knowledge unified.”
Around 1800 (CEST) all Facebook services (Facebook, Whatsapp, and Instagram) stopped completely. Somehow the DNS records were erased. I will not go into more technical details of the issue. Roughly 3.5 billion people were affected by this error. Especially Facebooks employees were affected. They couldn’t enter their work environment or unlock meeting rooms and offices. To get the problem fixed, service engineers had to physically go to the data centers to reset the servers.
The KM Challenge
The Knowledge Management challenge was to muster someone with physical access to the data center, someone with the proper authentication rights to access the server, and someone with the knowledge of what to actually do. Now that’s a proper Knowledge Management Challenge. Questions that might lead to improvements:
- Could one of these people be removed from the critical path? (e.g. by giving physical access to someone who knows what to do)
- Did these people (or teams) know each other? Formally or also personal / informa?
- Did these teams ever train or test the execution of their work with respect to their dependencies?
The combined services have annual revenue of roughly 125 billion dollars. That’s more than 340 million per day, thus more than 85 million during the 6 hours of downtime. Another effect is the stock price, which lost 7.5 percent and that’s 75 billion dollars in market capitalization!
Revenue will probably not be regained. Maybe there will even be an after effect by users and advertisers moving to different platforms. The market capitalization is likely to return to its normal appreciation. Therefore the absolute minimum cost for Facebook was 85 million dollars or 14 million dollars per hour or 250.000 dollars per minute.
Knowledge Management business case
For this event only, 1 million dollars would be justified to reach 4 minutes gain in resolving the issue!
With the limited insight into the issue and the resolve, we can already clearly see the added benefit of Knowledge Management. Specifically, social knowledge awareness would have greatly accelerated the resolution of these issues.
Do you know who you need in your organization to resolve critical errors?
Did you ever train with them to know what to do and how to do it?