From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dushyanth Harinath Subject: Re: Multipath failover issues Date: Tue, 17 Mar 2009 18:00:00 +0530 Message-ID: <49BF97C8.1030406@directi.com> References: <49BE7C9C.8020100@directi.com> <1237226357.309.6.camel@chandra-ubuntu> <49BEBEB9.7020707@directi.com> <1237243253.309.13.camel@chandra-ubuntu> Reply-To: device-mapper development Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1237243253.309.13.camel@chandra-ubuntu> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com To: sekharan@linux.vnet.ibm.com, device-mapper development List-Id: dm-devel.ids Hi, >> 257 Critical 2009-03-11 10:38:43 ALERT:Redundant Controller Failure >> Detected (Slot B) >> >> I also found additional logs from /var/log/messages which i did not >> check earlier. >> >> Mar 11 10:32:46 multipathd: sdc: readsector0 checker reports path is down >> Mar 11 10:32:46 multipathd: checker failed path 8:32 in map infortrend01 >> Mar 11 10:32:46 multipathd: infortrend01: remaining active paths: 1 >> Mar 11 10:32:46 multipathd: sdd: readsector0 checker reports path is down >> Mar 11 10:32:46 multipathd: checker failed path 8:48 in map infortrend01 > > Does this timing correspond to when you turned off the controller ? This is when the controller failed. The controller shutdown happened much later. >> Iam assuming it must have been busy for a few secs during the switch >> over and the multipath config doesn't wait enough for the switchover to >> work. > > Answer to your previous question would help here :) > > Set no_path_retry to "queue", which would queue the I/Os when "all" the > paths fail. Iam not sure if i can do this as well. Aren't we creating an illusion that the storage subsystem is fine and queuing requests when actually the subsystem is gone ? What actually is done for queuing and there must be some limits for the queue as well right ? > If the behavior seen above was caused by the storage and will be > rectified in an acceptable (to the user) time, then this parameter > setting would solve your problem. Iam checking this with infortrend. > BTW, have you seen the I/O successfully been sent to the lun (both paths > - you can use iostat to check it) before you failed the controller ? (I > am trying to see if your config settings are proper). Iam doing a post mortem of the redundant controller failure here :). I dug out what was done after the controller failure. * Primary Controller failed and failover to secondary did not work * Multipath failed both paths and ext3 went read only * Postgres crashed * When they logged in and ran (multipath -v2 -ll), they saw both paths active - I cannot find any multipath log entries which shows paths reinstated until 11:50 - which was after controller shutdown and power cycle. * The filesystem was mounted again (without fsck) and database started (This answers your question abt IO to the LUNs i think) * Postgres recovered and was shutdown immediately and /data unmounted. * After this the controllers on the infotrend was shutdown and the device power cycled. PS : Iam digging up the entire multipath logs instead of posting snippets here - will add to pastebin and send the link over TIA Dushyanth