December 6, 2005

Loss of Control (ler)

We had the joy this weekend of a battery failing on our RAID controller on our primary Domino mail server!  Who'd of thought a battery could impact so much?

The discs attached to this controller were our transaction logging discs, so quite important to the overall operation of the server.  Basically the server came to a grinding halt.  As things go though, we were lucky.   It happened on a Saturday morning and the server is one of a cluster.  So our Notes and Web clients carried on working against the other server in the cluster.

Our 24x7 operations crew responded and got an IBM technician involved to replace the faulty part.  This was duly done, and the server was restarted, unfortunately Domino did not!  Even though the controller was replaced, the transaction logging discs did not restart.  The server group tried a few options to kick start the discs without loosing the logs, but were unable too.  Why these discs failed, in addition to the controller is a mystery!

After a quick call with them I decided we could sacrifice the discs and logs and try some new discs.  The new discs worked and a fresh transaction logs was built.

Upon restart, the server began the arduous process of fixing many hundreds of mail files and also brought them back into synchronisation with any changes that had occurred on the other cluster member.  As about 30 hours had passed now, there was a significant amount of changes to make.

When all said and done, only one mail file was corrupted by the controller failure, which was quickly recovered by replicating from the other cluster member server.  No email was lost and none of our users were even aware that their primary mail server was unavailable for 30 hours.

Having the cluster and individual mail stores made this almost a trivial event!
Posted by Simon Barratt at 12:31:58 PM | Add/View Comments (11)