BCP

Christchurch Earthquake

At 4:35am on Saturday 4th September Canterbury was stuck by New Zealand’s most destructive earthquake since the 1931 Napier Earthquake (GNS Press Release or follow #tenz on Twitter). Many of the cities older buildings and were either totally destroyed or suffered significant structural damage. The Central Business District was particularly hard hit with many buildings unsafe. Widespread power outages, loss of water and a broken sewage also affected a large part of the city.

A week earlier I had hosted a session on Disaster Recovery at Code Camp in Auckland. Time to put those words into practice and help get our clients up and running again. Our clients are largely concentrated in Christchurch and surrounding towns.

Scale of the problem:

Extended electricity outages meant that UPS’s at most of our client sites initiated graceful shutdowns and so powering on again should be a relatively straight forward process. The violent shaking could have resulted in disk failure as well as unsecured items moving. It is also possible that building damage or broken water pipes could affect computer systems. In some cases buildings are inaccessible or are still without power.

Communication failure:

We were very fortunate that both the Telecom and Vodafone mobile networks were available in a large part of the city allowing communication. Twitter was particularly lively until the message spread that batteries in Cell towers were running low and mobile phone use should be minimised. National Radio should be commended for providing an excellent service, if only most people had a radio with batteries still!

Once electricity was restored in some areas of the city the team of IT Pro’s I work with got cracking testing VPN’s to identify which client sites were up. This was a particularly effective way to identify who needed help and set some priorities.

Direct communication with clients via phone is best in this type of situation as many other communication methods are not reliable. A good thing most of us have client contact numbers stored in our mobiles (and our phones were charged).

Hardware failure:

One of the key points I raised during my Disaster Recovery presentation was that Virtualisation is your friend in a disaster. I was unlucky enough to have one client with a critical server that would not boot (no BIOS screen). Fortunately the site had existing virtual infrastructure with enough disk space to migrate the server.  Rather than waiting to resolve the issue using the warranty process I decided that a P2V (Physical to Virtual migration) was the best method. 1.5 hours later the system had been migrated and they were back in business. To do this I needed a second server with the same hardware and then simply swap the disks over to the working server, booted and ran the P2V process.  Once started a couple of tweaks to the network settings and everything was running happily. The Server warranty was next business day, so the result here was much better than the client would have received if that path had been followed.

Problems:

While the vast majority of our client sites simply required power to kick back into life, we did strike a few issues. A short list of things we found on the first day back in the office:

  • Servers and racks moved
  • Blown circuit breakers
  • Dismounted Exchange Information Store
  • HyperV host server that would not start Virtual Machines
  • Domain Controller that shutdown part way through the boot process
  • Various cabling issues (due to movement)
  • Printer issues
  • PC’s that had moved (pulling out cables)

All of these issues were solved today and I am sure a few lessons were learnt. I think we will find that this is really just the beginning as many clients are still not in operating and others are have asked staff to stay away for a couple of days.

Team work:

IT Pro’s all seem to share a common attribute. We respond. IT Pro’s will drop everything often with little notice or regard for other personal circumstances to respond to clients in need. I witnessed this first hand from the guys I work with. We are a self-organising bunch of guys and it really showed over the past couple of days. I am sure this isn’t limited to my team, but to IT Pro’s all over the Canterbury region. Keep up the good work everyone!

 

Advertisements

VMware Site Recovery Manager overview

VMware Site Recovery Manager (SRM) is a disaster recovery technology that allows VMware ESX environments to be replicated to a secondary site. The ability to move protected virtual servers between sites quickly and easily takes away a lot of the difficulty associated with implementing a Disaster Recovery (DR) solution. SRM does require a significant investment in hardware and high performance links (fibre is recommended) between sites making it a solution for larger sites.

SRM leverages SAN to SAN replication technology to keep up to date copies of the production Virtual Servers at the recovery site. Any changes made to production servers are replicated in real-time to the recovery site.  The recovery site has VMware host servers with Virtual Servers in a shutdown state, in even of a failure at the production site, these servers are started (manually or automatically). SRM uses plug-ins to manage the underlying SAN storage environment simplifying management of the total solution.

Testing and validation of the recovery site is often one of the most complexed and often difficult parts of managing a DR site. One of the best features of SRM is the testing functionality. This allows the recovery site to be tested without shutting down the production environment. VLAN’s are used to isolate the recovery site during the test. This lowers the risks and costs associated with testing the site.

Recovery time is essentially the time taken to boot up the recovery site. Multiple protection groups can be created and started in a predefined order. Within a protection group, servers can be give priorities e.g. Active Directory starts before Exchange servers which start before Citrix and Blackberry.

Basic requirements:

  • Two VMware farms (a production farm and a recovery farm)
  • Two VMware vCentre servers (one at each site)
  • Two SAN’s with replication between sites (a wide range of SAN’s are supported)

SRM can fill a big part of the Disaster Recovery jigsaw and should be considered by any organisation with a VMware environment and DR needs. It is a competitive solution in terms of functionality and low on going management costs. It does require high performance data links between sites so ensure you can get and afford those services at the start of the planning process.

VMware Site Recovery Manager website

Business Continuity

Business Continuity is a popular topic at the moment and is high on the list of priorities for businesses around the globe. Business Continuity is about ensuring your business continues to function in event of a disruption (foreseen or unforeseen). Many people think about the stereotypical disasters e.g. building catches fire, earthquake or even terrorist attack. The reality is the around 80% of computer system downtime is actually the result of the people who run the computer system i.e. your IT people and not terrorists.

So what should you do to ensure business can continue in the event of an “IT disaster”?

Before you begin make sure the basics are right

  1. ensure you have good backups. This is your first line of defence.
  2. ensure your servers are in a protected environment. UPS, computer friendly environment, secure room etc
  3. ensure your IT staff know what they are doing and act in the interest of your business.

Develop a Business Continuity Plan

  1. identify and document your key business processes e.g. order processing, banking
  2. identify the things those processes identified in step 1 depend on
  3. work out your threshold for pain i.e. how long can you live without each process
  4. calculate the business cost if that business process is not available e.g. lost sales, delayed payments etc
  5. document what you will do if the process is disrupted e.g. manual processing, alternative process

Following these basic steps you will soon get an idea of where your business is vulnerable and how long you can continue before the pain threshold is reached. Businesses with large suppliers or customers should consider getting those partner organisations involved in the plan as they may also be affected if disaster strikes.

One size doesn’t fit all when it comes to Business Continuity, some businesses are happy to ride out a short outage and have manual processes while others require a duplication of IT infrastructure. Compliance with regulations is also an important consideration.

From a technical point of view, a number of technologies can be used to build the IT components of a Business Continuity solution. Virtualisation allows portability of servers and applications. Replication technologies allow data to be copied to a secondary location. Remote access technologies allow employees to work from outside the office. Backup technology can also be an important component of the solution.

Knowing what to do when a disaster occurs is a big part of BCP. Being prepared will reduce the time it takes to recover and reduces panic. It is important to test often and update your plan where necessary.