One of the challenges with taking an organization in a new direction is the old guard. That was one of the challenges that I faced at D&B and it is a challenge that I face now.
As I was getting ready to leave work today I was drawn into a conversation about monitoring and alerting in this brave new world. Personally, I am against monolithic Network Operations Centers. In my experience, they prolong just as more problems than they actually help solve.
At D&B, we had two different ways to manage incidents since we had two different hosting models: the stuff that I managed and everything else. For the stuff that I managed, my team of system engineers were in an on-call rotation. Any alert that was triggered by our monitoring went (via PagerDuty) to the on-call admin who would diagnose the issue, call in any additional support such as a DBA (as needed), and resolve the issue. I would wager that 99.9 percent of the time, the on-call person could resolve it within moments without customer impact (this is born out by the nearly 1000 days of uptime I had when I left).
Everything else went through the NOC managed by our hosting partner. Once an app was alerted on, they would get call out to every possible group associated with an app such as systems, DBA, network, application team, service manager, etc, etc. I’d like to say that it took longer to get people on the call, but truthfully it usually took an hour or two because nobody knew enough about overall architecture (which is a whole different problem).
At Blackbaud, I’m hoping to move to take what I had in place at D&B to the next level. With our move to the cloud, it has given us the ability to be able to switch to cattle rather than pets. Long term, I hope to be able to eliminate alerting (in the traditional sense) and instead just trigger an action that terminates the offending server and just spins up a new one. Sure, we get notified that something happened and maybe even in some instances quarantine the offending node, but we get production back up virtually instantaneously.
Whatever the end state is, it cannot rely on people needing to interact with it to get it working again. If we can build them through automation and deploy them through automation, we need to be able to fix them through automation. At the end of the day, automation will fix them faster than even my old team at D&B could.