Happy Patch Tuesday! Securing CN-NOS

SnapRoute > Cloud Native > Happy Patch Tuesday! Securing CN-NOS

Trust is something that takes time to build and even longer to rebuild if it’s lost.  Protection of our brands, both personal and corporate, is paramount to every operator and the companies they represent.  At SnapRoute, our background as operators gives us the unique perspective to identify with these challenges and meet them head-on.   We have had to respond to Common Vulnerabilities and Exposures (“CVEs”) that can expose serious flaws. We understand the constant battle of patching or upgrading code to ensure applications remain secure and available.  However, all too often availability trumps security leading to the risk of suffering from an exploit that results in being a top headline in the Wall Street Journal. Security should never be considered a 2nd class citizen to application availability, but to date monolithic NOS architectures have made it nearly impossible for them to be equal.  From our perspective it’s clear that the threat landscape has only become more serious and the risk of losing that trust even greater as the frequency of threats to our applications and infrastructure has increased.

At SnapRoute we place availability and security on equal footing.  While we have done our best to ensure that we have built CN-NOS to be secure from day one, we know from our experience that we needed to ensure CN-NOS was flexible and pliable in order to react to security threats quickly and easily.  While we would love to never have code where a security flaw is discovered, we all know the reality that this will never be the case and any vendor that claims otherwise should be held suspect.

On February 11th security vulnerability CVE-2019-5736, discovered by Adam Iwaniuk and Borys Poplawski, was publicly disclosed –  just a single day before we announced CN-NOS to the world. This vulnerability describes a privilege escalation flaw in Docker and Kubernetes which allows an attacker to gain root privileges over a system by leveraging a flaw in runC, the portable container runtime of Docker, allowing them to take over a system and potentially compromising the integrity of production infrastructure. This is a frightening thought for anyone who has been tasked with finding and patching these issues in a production environment.

When thinking of practical examples of this, I’m reminded of my day-to-day struggles during 2014 – when it was my responsibility to resolve the “Shellshock” vulnerability on our network gear.  For those who don’t recall this vulnerability it was a privilege escalation Bash flaw that allows attackers to execute arbitrary commands which they should not be allowed to. This was a massive privilege escalation problem that affected almost every single Linux system.  We had many war rooms and discussions on how we could rollout the patch quickly and efficiently across all parts of the infrastructure – including the network and compute. While I watched our compute teams rollout the fix in a few hours, on the network side we had wait for the patch to be integrated into the vendors code and then wait some more for this new version to be available. Once this new version was available it took us almost a year to test, verify and rollout.   Now I know you’re asking yourself, how could it take a year to replace a shell that has nothing to do with data or control plane components of the network switch? As those running networks know, it’s because the architecture of these monolithic NOSs were never designed to be able to take a patch live. They require a “rip and replace” method of updating that requires new images loaded and the device rebooted. All of this for just a shell fix. Our compute colleagues were dumbfounded when we described the situation to them.

Well that’s not how we work here at SnapRoute and certainly not the architecture paradigms we have adhered to within CN-NOS.  The containerized microservices architecture of CN-NOS allows us to consume, integrate and rollout a patch for CVE-2019-5736 in less than a day after this flaw was publicly announced.   So let me repeat that so its clear, in less than 24 hours we had fixed, released and enabled our customers to rollout a patch to fix this massive security flaw. This is a real-time, real-life example of the power of the CN-NOS containerized architecture.  Something that I could have never imagined possible during my days as an operator is now part of standard day-to-day operations.

While I never cheer a security vulnerability or take pleasure in hearing about a company impacted by such, this CVE has helped solidify the architectural decisions we’ve made when designing and writing CN-NOS. Users of CN-NOS can breathe easier knowing that they are running a NOS architecture that delivers on immediate turnaround to these scary security flaws in hours instead of months with a monolithic NOS.