Saturday, May 15, 2010

Testing Microsoft Cluster

Testing Microsoft Cluster

I came across this tutorial when i need to plan cluster configuration testing so may be useful for you.

Test Number 1: Moving Groups

Move the current resources (including the Cluster Group and any Disk Groups) that were created with Cluster Service was installed, from the active cluster node to the inactive cluster node.

In your cluster, the two nodes can be divided into an active node .

After the Cluster Service has been installed on both nodes of the cluster, one of the nodes will be in control of all the default cluster groups (the active node) and the other node will not have any cluster groups assigned to it (the inactive node). The resources found on the active node, by default, include what is called the "Cluster Group" and the "Disk Group". (There may be one or more Disk Groups, depending on how your shared disk array has been configured. In this example, we will assume there is only one Disk Group.)

The Cluster Group generally contains these cluster resources:

· Cluster IP Address (the virtual IP address of the cluster).

· Cluster Name (the virtual name of the cluster, used by clients to access the cluster).

· Disk Q: (the quorum disk, may or may not be labeled Q:)

The Disk Group generally contains a single disk resource, a drive letter, which refers to a logical drive on the shared disk array. If you have more than one logical drive as part of the shared disk array, then there will be a separate Disk Group for each logical drive available.

Now that we got all that out of the way, let's begin our first test to see if the cluster is functioning properly. Our goal in this test is to see if we can manually move both default cluster groups from the active node in the cluster to the inactive node, and then reverse our steps so that the cluster groups return to their original location on the active cluster. Here's how:

1. Start Cluster Administrator.

2. In the Explorer pane at the left side of the Cluster Administrator, open up the "Groups" folder. Inside it you should see the Cluster Group and the Disk Group groups.

3. Click on "Cluster Group" to highlight it. In the right pane of the screen, you will see the cluster resources that make up this group. Note the "Owner" of the resources. This is the name of the active node.

4. Each of the groups must be moved to the other node, one at a time. First, right-click on "Cluster Group," then select "Move Group." As soon as you do this, you will see the "State" change from "Online" to "Offline pending" to "Offline" to "Online pending" to "Online." This will happen very quickly. Also note that the "Owner" changes from the active node to the inactive node.

5. Now do the same for the "Disk Group."

6. Assuming there are no problems, both groups will have moved to the inactive node, which, in effect, has now become the active node. Once both nodes have been moved, look in the Event Viewer to see if any error messages were generated. If everything worked correctly, there should be no error message.

7. Now, move both groups back to the original node by repeating steps four through six above.

This is a very basic test, but it helps to determine if the cluster is working as it should. The following tests are slightly more comprehensive, helping you to root out any other potential problems.

· Now if you click on the "Disk Group," you will notice that your disk resource did not fail over. This is also normal. This is because a failover will only force dependent resources to failover as a group, and the "Cluster Group" we failed over earlier is not dependent on the "Disk Group," so it did not fail over. To fail over the disk group, right-click on the disk resource in the right pane of the window, and select "Initiate failure." You will have to do this a total of four times in order to failover the disk resource to the other node.

· Now that you have done, reverse your steps, and failover the "Cluster Group" and the "Disk Group" back to the original node.

Like the previous test, check out the Event Viewer logs to see if any error messages occurred. If everything worked as expected, you are ready for the next test.

Test Number 3: Turn Off Each Node

While the first two tests were performed from the Cluster Administrator, the next three tests are more real world. In this test, you will first need to ensure that all of the default groups are located on one of the two nodes. Then you will physically turn off (flip the switch) the active node (first node).

If you are watching the cluster groups from the Cluster Administrator from the inactive node after turning off the first node, you should see a failover occur and the resources should be automatically failed over to the second node. Check the Event Log for any potential error messages after this occurs.

Once you have checked for any potential problems, turn the node on that was turned off earlier (node 1) and wait until it fully boots. You will note that turning on the node that was turned off does not cause the cluster to fail back. The cluster resources will remain on the second node until you force them to return to the first node.

Now turn off the node with the active groups (second node), repeating what you did earlier with the first node. As before, you can use the Cluster Administrator from node 1 to watch the groups fail over to the first node. Check the Event Log for any potential error messages.

Once the groups fail back to the first node, turn the second node back on, and wait until it boots up fully.

This is a very good test to see if failover will work in the real world. If no problems arose from this test, then you are ready for the next test.

Test Number 4: Break Network Connectivity

This test is similar in concept as the above test. What we want to do is force a fail over. But instead of simulating a computer failure, we will be simulating a network-related error.

From the node that has the default resource groups (the first node), remove the network cable from the public network card. This will simulate a failure of the first node, and should initiate a failover to the second node.

If you are watching the cluster groups from the Cluster Administrator from the second node, you should see a failover occur and the resources should be automatically failed over. Check the Event Log for any potential error messages.

Once you have checked for any potential problems, plug the network cable back into the first node, and then remove the network cable from the public network card on the second node. As before, you can use the Cluster Administrator to watch the groups fail over to the first node. Check the Event Log for any potential error messages. Once you are done, plug the network cable back into the public network card on the second node.

If no problems arose from this test, then you are ready for the next.

Test Number 5: Break Shared Array Connectivity

This test is designed to help uncover potential issues with the shared disk array. I have seen clusters pass all of the above four tests, but fail this one if the shared disk array is not configured 100% correct. This test is designed to simulate what would happen if the controller card or cable connected from a node to the shared disk array fails.

From the node that has the default resource groups (he first node), remove the cable from the card used to connect to the shared array. This will simulate a failure of the first node, and should initiate a failover to the second node.

If you are watching the cluster groups from the Cluster Administrator from the second node, you should see a failover occur and the resources should be automatically failed over. Check the Event Log for any potential error messages.

Once you have checked for any potential problems, plug the cable back into the first node, and then remove the cable from the card used to connect to the shared array on the second node. As before, you can use the Cluster Administrator to watch the groups fail over to the first node. Check the Event Log for any potential error messages. Once you are done, plug the cable back into the appropriate card.

No comments:

Post a Comment