Thursday, July 19, 2007

redundancy

I've always said (and I picked it up from somewhere) that for every 9 you want to add to your uptime, multiply your initial cost by 9.
So if you have a 10mm USD data center, and you want to go from 99 to 99.9% uptime, you need a 90mm data center infrastructure. Why? Well lets take a look

You need back up generators, back up network lines, which means changing the data center architecture to support all that. Redundant gas lines for the generators. Failover for power sources etc (would have been cheaper to build it like that in the first place)-8mm
You need double the hardware, and multiple components in all the hardware you have (dual nic etc)- 7mm
You need a DR data center in another location, with the same setup- 15mm.
You need a new high speed line for the center links, and a new line for the DR site (different clec)- 2mm
You need to upgrade the SAN with real time LUN level replication, and buy one for the DR site - 5mm
You need clustering, on everything (oracle,sybase, windows, unix, linux ...), custom coded apps need to be re-written for active failover- 25mm (includes services)
You need the buy staff to deal with active/active failover, and 24x7 operation- 4mm
You need load balancing and/or fail over network operations for inbound and outbound connections (data feeds etc)- 10mm (at least!)
Since you have 2 different physical locations now (hopefully not the same state) you need new services contracts- 1mm

So thats only 78mm more, or 8x the original cost. But I'm sure I could spend the extra 12mm on something I've forgotten.

The best DRS I've ever seen was at the Depart of the Navy. Everything was a virtual machine. Live snapshots of the VM's was taken every hour or so. The snapshots were saved to an EMC SAN- which had real time replication to 5 other locations. All locals replicated to all remotes. Every remote site had a small cluster of "failover machines". The network was designed by Cisco and everything could automagically routed wherever. So, the entire data center in VA gets blown up (or whatever). The alarm fires, the VM's are started at the primary failover site (NC), they come online, routers do their thing (the DoD has the benefit of their own network) and wallah. Magic data center moved. Worst case loss, 1 hour. Failover time for the entire data center- 5-10 minutes (AND no application restart- they are hot snapshot loads). beautiful.

So, having built a couple of data centers in my day, and coded many an application for active failover, and having deployed clustering on every version of windows since NT4 and Red hat- when a data center is down for say 7 hours, I think people should not only be fired, but any contractors should be sued. I'm not saying who's data center went down, but let's just say it was bad.
Oh, kdb+ failover is trivial. Everything I do is in a pub/sub model. So aside from an extra machine in my data center- I push everything to my desktop. So when the lights went out, I still knew my positions.
J

No comments: