Disaster? What Disaster?

Bill Cole

Bill Cole – Competitive Sales Specialist,Information Management, IBM

In a previous post, I wrote about our System/370 dangling from a crane.  It was a simpler time where loss of computing wasn’t the business-collapsing event it can be today.  If you believe one of the adverts regarding disaster recovery, a significant number of businesses don’t recover.  That’s a scary thought for any of us who are responsible for building that capability.  In case you’re wondering, this isn’t an academic experience for me.  In fact, you may be using a navigation service or buying gifts on line through one of the systems that  I architected and implemented.

From a database perspective, 24 x 7 x forever is an expensive and technically challenging proposition.  Whatever your reasons or choices, we’re simply talking about an insurance policy that’s written in hardware, software and processes rather than bits of paper.  In one of my stints as a Production DBA, the CEO would walk by and ask about my systems.  I told him they were still up so we were still in business.  He understood the logic and wasn’t comforted.  Neither was I since I was the entire DBA staff!

Let’s start with some basics.  You’re ensuring that your database survives intact and the application can continue to function, even if it’s in a degraded fashion.  I’ve had customers with exact replicas of their Production environment and others with a smaller version just to keep doing through the disaster.  It depends on how much you want to spend on that insurance policy.  If your business is really global and/or your customers & partners expect ubiquitous access, then your choice is made.

And you have to commit to testing.  It’s far too late to find a hole when the hurricane comes through and your machine is dangling from a crane.  One of my clients actually fails over, conducts business for a weekend and then fails back.  It’s not extreme since they’re out of business if the system ever fully fails.  They survived Hurricane Sandy because they were ready and knew how to fail over and keep going.

You really have three choices.  HADR, QRep and CDC.  CDC??  Yup.  CDC replicates changes from one database to another and that seems to be what we’re talking about, right?

DB2 with HADR (High Availability Disaster Recovery) is the simple choice.  It works with pureScale 10.5, too.  You can even tune the time delay so you have some idea of how many transactions might be in flight.  The application should see an error and recovery nicely.  That’s the theory anyway.  Failover and failback are supported.  So you’re good to go, as we say in NASCAR country.  If you’re using pureScale and HADR on a Power system, you’ve pretty much prepared for anything and everything from a database perspective.  QRep is the likely variation without some of the neat tuning knobs HADR brings.

The two biggest issues are licensing and network costs, it seems to me.  Well, it’s expensive to have a pipe large enough to handle the volume of data in a large production environment.  Licensing isn’t an issue if you’re using PureData for Transactions (PDTx) since the relevant licenses (database, pureScale and HADR) are included.  Changes the whole debate.  You’ve got the first part of the insurance policy all wrapped up and paid for.

Choice two-A: Monthly or weekly cold backups and daily incrementals.  Or incrementals more often depending on your recovery options.  I’ve accomplished this one by having a process that watches for log file completions and then FTPs the completed log file to another system for backing up.  Or you can write a script that does a remote copy of the log files directly (less chance of random corruption in my experience) to the backup system.

Choice two-B: Same as above but shipping the logs to a database that is ingesting the logs in recovery mode.  The variation would be to use Change Data Capture to accomplish this.  And there’s lots more CDC could do for you besides this chore.

Choice three: Write every transaction to a log so you can replay it for recovery.  Hmmm.  How do you sell that one to the application developers?  And won’t there be a performance hit?

There’s a fourth choice.  Read on!

Here’s a short synopsis of IBM’s products:

QREP – replicate transactions to a remote database using MQ to transmit transaction messages reliably.  Think of your backups going through the reliable messaging that MQ provides rather than taking a chance with ftp or simply losing some part of a transmission.

CDC (Change Data Capture) – replicates transactions to a remote database using a proprietary TCP/IP messaging.  More on CDC in another installment.  Another useful option within CDC is to build files for DataStage to use for reloading the database.  That seems a pretty interesting option.

HADR — replicate transactions to a remote database.  HADR can be tuned to prevent loss of any transactions (which imposes an overhead on performance, of course).  You can choose to lose a few transactions by configuring for async replication, but you can tune how long that window is.  One of the things that you can do with your backup/standby databases is reporting using HADR.  I’m a big fan of this option since I hate the thought of servers simply waiting for something to fail without providing any real business value.

One of the really esoteric HA/DR configurations I’ve seen is cascaded backup databases (backups of backups).   I’ve seen this done with HADR and log shipping.  Or HADR to HADR.  It works.  I’d consider using different methods such as HADR and then log shipping.

None of the products above require any changes to application code so any and all applications should work without worrying you or the developers.  So you can stay focused on adding value to the business rather than simply playing with the technology (no matter how much it may be).

Full-out paranoia: Combine several of the above into your strategy.  I had three different methods of backup for a very large online auction house (no, not the one you’re thinking about).  A daily full warm backup.  Remote copies of the logs. Log shipping with recovery.  One of the backups had to be right!

Trick question: Which method did we miss?  The cloud, of course.  Why not put your DR site in a Public cloud?  All of these methods could be implemented in a cloud.  I suggested Public because you don’t have to manage it and it’s not subject to the vicissitudes of your environment or budget.  You can scale it to meet your requirements and pay only for what you use.  It’s a valid variation on our theme.

Finally, I keep talking about testing your recovery methods because it seems to be the weak spot that every system has.  It’s messy and time-consuming to no real business purpose.  Not to mention fraught with the possibility of a major malfunction.  We all have stories in our pocket about the backup that was never tested or the plug being pulled.  I’ve got lots of them, too.  I designed and then managed a data center.  I had a diesel generator that would kick in whenever the building lost power.  I tested it weekly.  It seemed the rational thing to do having spent all that money on it.  I was sitting in my office a few weeks later and the building lights went out.  Someone doing construction had cut the power.  I ran to my computer room (twelve feet) and found everything humming along.  Of course, no one in the building knew that because their PCs were dead.  The CEO knocked on the door after walking down the stairs.  My data center was still up.  We were still in business.  Smiles all around.  Ah, paranoia pays off!

Learn more about the new version of  DB2 for LUW

Read about the The PureData System for Transactions, which is optimized exclusively for transactional workloads.

Follow Bill Cole on Twitter : @billcole_ibm

2 Responses to Disaster? What Disaster?

  1. Ron Delaware says:

    Can HADR be configured to run on a single LPAR between two instances of TSM (one Production, one DR)

    This is the configuration of the proposed setup:

    1. The production server is running a pSeries AIX server, with only one (1) LPAR at two (2) sites (Prod_1 and Prod_2.
    2. The DR instance of TSM is also running on the same AIX server on the same LPAR at the same two (2) sites as the production (DR_1 and DR_2)
    3. The server IS NOT sharing the same network interface cards between the TSM instances
    4. Site 1 – The Prod_1 TSM server will link to the Site 2 DR_1 server instance and HADR will be setup between the two server instances
    5. Site 2 – The Prod_2 TSM server will link to the Site 1 DR_2 server instance and HADR will be setup between the two server instances

    The question to be answered:

    1st – is this a configuration that IBM will would support?
    2nd – What are the drawbacks of this type of implementation?

    • First, thanks for reading the blog, Ron. Always good to know that there’s someone is on the other side of the screen.

      It seems to me your proposed configuration is valid and would provide the necessary DR capabilities. Running TSM on the same server as the database is okay, though not optimal. TSM should be running on a separate server at best or in its own LPAR. The real limitation in this configuration is the network traffic. I trust the network traffic will be on a dedicated LAN segment since we don’t want to either slow down everyone else or be slowed by them. You know the story. The network gets saturated because some is running a backup or file transfer and your HADR suffers because it can’t keep up with the transaction volume.

      Again, thanks for reading and asking!

      /Bill

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: