I had a customer calling me tonight with a problem on their Windows 2000 Cluster (They had to bring it down due to maintenance of their SAN and when it was brought online one of the groups failed to come online).
The disk resource was in a state of 'Online pending' and it was impossible to stop it (As you might now a resource in a pending state can't be configured or brought in a offline, online or failed state - either through the GUI or through the Cluster command).
It wasn't immediately appearent what the problem was either from looking in Eventviewer or in the cluster log (Found in %SystemRoot%\Cluster\Cluster.log). But we remembered that last time the Cluster was brought down (Due to a power failure) the Cluster wanted to run a Chkdsk on the largest volume (1.2 TB RAID 5 - I don't even want to guess how long that takes - maybe some of you have experience in that ?).
Furthermore, we could see that it created a log file called ChkDsk_Disk1_SigXXXXXXX.log. We then found that each time that the disk resource was being started it brought up an empty Command Prompt (Named Chkdsk) on the Console (Not at the RDP connection ;-) - but it didn't seem to start running and the logfile stopped growing after a few minutes. Additionally, the Disk's drive letter and description dis- and reappeared from the Parameters tab on the disk resource.
We tried to stop the Cluster Service (Net Stop ClusSvc) which timed out so I ended up killing the process with Kill.exe (TaskKill in Windows Server 2003). When we restarted the Service the Disk Resource once again ended in a state of online pending.
Due to the critical nature of this Cluster, we had to find a way to bring this resource online without needing to run ChkDsk. There are basically two settings/registry keys that defines how Chkdsk is run on a Cluster one is SkipChkdsk (Value of 1 means Skip - 0 is default) and the other one is ConditionalMount (If SkipChkdsk equals 0 - then a value of 0 will fail the disk resource and the default value of 1 will run 'Chkdsk /f' against the resource before bringing it online). As the Disk resource in question was in a pending state I was unable to configure it through the Cluster.exe command 'Cluster Clustername res "Disk X:" /priv ConditionalMount=1' so I had to once again "kill" ClusSvc and then change the registry key containing this setting. All registry keys for a Microsoft Clusters resources are contained within the HKLM\Cluster\Resources\'GUIDs' keys - and I found the correct key by searching for the correct description of the disk resource and verifying that it had the correct disk signature (Found by using Diskpart - Detail Disk) and changing the ..\Parameters\ConditionalMount REG_DWORD value to 0. After this I restarted the Cluster Service and the disk resource failed immediately. After this I used the "correct" way to set the SkipChkdsk value namely through the Cluster command and brought the Disk resource and the group online (Remember you cant configure these properties when the resource is 'pending' or the cluster service is stopped).
Problem solved (Well not really - more like symptom solved - I guess we need to revamp the Cluster when we upgrade it to 2003 SP1 anyway).
More resources can be found here and KB article 223023 describes the ConditialMount and SkipChkdsk in further detail.
4 comments:
I have been in the weeds solving this problem and finally found the obscure information. The interesting question that I have is why does the dirty bit get set,and why doesn't chkdsk /f fix it? I am not a hardware guy, just a mere mortal programmer who has deep respect the hardware team....
thanks, you and google saved my day (night) ;.)
You are welcome, we're just happy that someone finds our information useful ;-)
Great article - saved my hide. We had a 4 node cluster go down due to a power failure in the computer room. The cluster has several 1Tb LUNs and one went into a chkdsk.
Post a Comment