In the summer of 2003, a quiet crisis unfolded in the backbone networks that make up the core of the Internet. A software logic flaw revealed in Cisco's venerable IP routers threatened to halt IP traffic. The ubiquity of Cisco's IOS operating system amplified the significance of the issue, as it is not an exaggeration that most Internet traffic is eventually routed by IOS.>
Before going public with the problem, Cisco released a patch to backbone providers including Sprint, Level 3, and AT&T, who promptly took down routers across the globe for emergency maintenance. This was little condolence to the thousands of corporate, government, and university administrators who awoke to a CERT notice which signaled the race to patch before the inevitable exploits were widely available. As a data center administrator and web service developer, I kept a close watch on security newsgroups as a major financial institution prepared to take our network application live. Spurious network outages would have resulted in regrettable 7:00AM phone calls and tarnished the release.
While the quick response by Cisco and network providers averted widespread network outages, the problem revealed the downfall of relying on a single vendor for critical network operations. No single vendor-provided high availability can reliably handle critical and inevitable logical flaws in modern network equipment. For an IT or network manager responsible for critical networks, it is in his or her best interest to deploy open, widely accepted protocols, and devices from multiple vendors which are interoperable.
Logical failure versus physical failure
The crux of the problem lies in the different types of failures that are possible in network devices. Traditionally the availability of a system has been considered the aggregate of hardware and software reliability, where the system resides in a closed environment. The common example would be uptime estimates for high end network servers or mainframes by vendors such as IBM. This model is not sufficient when considering the complexity of a massive public network such as the Internet or campus WAN which presents infinitely variable input data to devices in the form of valid and invalid data packets. In these environments the characteristics of physical and logical failures must be enumerated and understood by those purchasing, deploying, and administrating network devices.
The anatomy of a physical failure
Physical failures occur when the properties of a physical device subsystem change in a way that prevents the device from operating properly. Just as automobiles are prone to faulty alternators or transmissions, network devices are prone to failing hard drives, power supplies, and I/O backplanes to name a few. Problems can result from manufacturing flaws, wear from moving parts such as drive platters, overheating, or lack of maintenance.
The classic solution to physical failure is redundancy. By using multiple identical devices or subsystems, a second subsystem can pick up when one has failed. For instance, network switches are often available with dual power supplies since the subsystem has a higher probability of failure than other components in the device. When the main power supply fails the backup supply takes over seamlessly and the administrator is notified that maintenance is required. This would be equivalent to having two alternators in your car and having the second take over upon failure of the first. The driver would be notified, and it would allow him or her to take the car in for maintenance without becoming stranded. Such tactics are often employed in aerospace and military applications where reliability is critical.
Another example of physical redundancy common in network applications is the hot standby. This is a configuration where an entire redundant system is running and ready to take over processing tasks in the case of failure of the main system. Outside the computing world, hot standbys are used in professional road bicycle racing. Unlike NASCAR where a car must be repaired when a problem occurs, vans follow the cyclists carrying entire new machines for racers whose bikes have mechanical problems. When any component fails, the racer simply switches to a new bicycle. The same is true with hot standby network devices. When any component fails, the system fails to a second or third identical device. It is common to run multiple firewalls or routers in such a configuration.
The anatomy of a logical failure
Redundancy has reduced downtime due to physical failures to a statistical problem. For example: what are the chances that n or more systems will fail simultaneously? Unfortunately, the same is not true for logical errors. Logical errors result from unforeseen flaws in the logic of the system hardware or software either from an incorrect interpretation of the specification, or simple programmer error. Statistical measures such as mean time between failures are not valid in determining the chances that logic errors will cripple your network infrastructure.
Let's revisit our race car analogy with respect to logical versus hardware failures. The failure properties of car tires are similar to a hardware failure. Tires have a limited useful lifespan, and statistical analysis can be applied to determine the likelihood of failure of a specific tire. When the tire approaches the end of its useful life it can be replaced with an identical model and the original performance can be restored.
Conversely engineering flaws that affect the performance of the car cannot be rectified by replacing components. For instance, if a car is not well balanced and pulls to the left, switching to another car of the same model will not solve the engineering or logical problem. All cars with the problem must be upgraded. This is same problem that network engineers face when dealing with logical errors. When a logical error arises and is being actively exploited, either intentionally or unintentionally, all hardware with the error must be taken down for repair, including software and/or hardware upgrades.
Logical failures can not be resolved by identical redundant systems
It is important to point out that using identical redundant systems does NOT significantly increase system availability when faced with logical errors. In fact logical failures can be propagated to backup systems as they come online following the failure of the first system. Consider a problem similar to the Cisco situation mentioned at the beginning of this paper. Assume your organization has procured two identical Cisco catalyst routers to improve the uptime for your network egress point. The first one is live, and the second is a hot standby running HSRP (hot swap router protocol). Assume an error has been found in the current IOS version, where specifically crafted traffic causes the router to fail catastrophically, and requires a hard boot to restore. The problem has been recently discovered and no patch is available.
Code which exploits the logic error becomes available on the Internet, and your first router has failed as a result, and the standby has come online. For a brief period the network functions correctly until it too is presented with the rogue data stream. In the best case situation the first system will have been rebooted, and traffic will be restored until the rogue traffic is again presented to the router. In the worst case situation, the rogue traffic rate is such that neither router can remain active any period of time, and a Denial of Service attack against your network will have been successfully mounted. The administrator will be forced to take down the network until a patch is available or the rogue traffic can be filtered. In either case the network experiences unscheduled downtime at the egress point, and depending on the importance of the data traffic, this could come at a significant cost to the organization. This while employing Cisco's HSRP redundancy options.
To improve uptime, diversify network infrastructure vendors
Because of the combinatorial nature of computer systems, logical errors are inevitable. It is better to acknowledge their existence than to assume that even a name brand vendor's systems are fault free. The best known tactic to improve overall reliability of the logic subsystem is to combine multiple systems which have different failure characteristics. This requires multiple teams, with limited interaction, which implement the same specification. This paper describes a common reference to this type of system -- the Space Shuttle flight control systems.
The Space Shuttle uses a complex voting system comprised of four diversely developed systems, where all the systems respond to input data, and perform calculations. The results are then compared and the most common answer is used as the result. If an error exists in one of the systems performing the calculation, the other systems will out vote the faulty device. The system is constructed in such a way that it can continue to operate (granted with lower reliability) if 3 of the 4 systems fail.
As network managers and administrators we are fortunate that our domain is built upon open standards. These standards allow devices from different vendors to inter-operate with each other. For instance a request from a Windows desktop computer to a Linux web server can be routed over Cisco, HP, and Juniper routers. The open standards that make up the internet can act as a specification, and the engineering teams at HP, Cisco, Juniper, Microsoft, and other network device providers, can act as the multiple teams with limited interaction which implement the specification.
In other words network administrators have the ability to build highly available networks on the concepts of diverse redundancy.
The advantages of using a diverse network software have long been understood by the administrators of the internet root namesevers. As stated in this message, the root DNS servers use multiple software implementations to improve reliability of the system. While using off-the-shelf protocols and devices cannot provide the same level of logical redundancy as custom aerospace applications, critical network points can be made far more reliable and secure by using devices from multiple vendors which inter-operate. For this reason if you are considering procuring critical infrastructure it is worth while to consider interoperability of those components. I recommended avoiding proprietary protocols such as HSRP which prevent implementation of a diverse network infrastructure.