VMware vSphere, MS Clusters and vMotion resulting in cluster service failures

Quick guide here, from experiences we’ve gained.

When running an active/passive MS Failover Cluster (MSFoC) in a VMware environment, you need to be aware of the behaviour of the clustering when a vMotion event happens on the active node. Because of the momentary interruption in networking and activity on the active node during the final cutover of the VM to the new host, the passive node can often see this as a failure and try to take over. As the active node is not actually down, this results in both nodes trying to run the services, resulting in ‘split-brain’ detection shutting service down on both nodes.

Our way around this issue is to set DRS for these cluster nodes to manual, making sure we’re aware they are there. In the event of vMotion being needed (maintenance generally, or for manual load balancing), we ensure the MSFoC service is shutdown on the passive node. That way we can move whatever we like as we need to without triggering a false takeover.

Bear this in mind when running MSFoC in VMware. Impact seems to be more often with shared-disk clusters, but also seen on non-shared-disk clusters too.

Leave a comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: