All of lore.kernel.org
 help / color / mirror / Atom feed
* Multipath failover issues
@ 2009-03-16 16:21 dushyanth.h
  2009-03-16 16:49 ` Bryn M. Reeves
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: dushyanth.h @ 2009-03-16 16:21 UTC (permalink / raw)
  To: device-mapper development

Hi guys,

Iam using dm-multipath for a Infortrend dual controller F16F-R4031-6 FC
system.

Version details are :

device-mapper-multipath-0.4.7-17.el5
device-mapper-1.02.24-1.el5
device-mapper-event-1.02.24-1.el5

OS : Red Hat Enterprise Linux Server release 5.2 (Tikanga)
Kernel : 2.6.18-92.1.10.el5 #1 SMP x86_64 x86_64 x86_64

Recently, one of the RAID controllers failed and caused multipath to
fail both active paths

device-mapper: multipath: Failing path 8:32.
sd 2:0:0:0: SCSI error: return code = 0x00020000
end_request: I/O error, dev sdd, sector 1976776672
device-mapper: multipath: Failing path 8:48.
sd 2:0:0:0: SCSI error: return code = 0x00020000
end_request: I/O error, dev sdd, sector 1967432880
sd 2:0:0:0: SCSI error: return code = 0x00020000
end_request: I/O error, dev sdd, sector 161647296

This caused the ext3 filesystem to go into a read only mode. Full IO
errors is at http://pastebin.com/m103325d9

The dual controller storage unit and the host server (Only 1 Server
using 2 Qlogic FC HBAs) are hooked upto two different Qlogic SanBox FC
switch for redundancy.

multipath.conf : http://pastebin.com/m4c7da817
multipath -v4 -ll : http://pastebin.com/m7d863925

I have checked the logs on the FC switch and the HBAs
and i dont see any event which suggest both paths failed at once. Even
the errors i captured out of dmesg show that one of the physical disks
that makes up dm-0 had 'end_request: I/O errors' while the other did not
have any such error.

sd 2:0:0:0: SCSI error: return code = 0x00020000
end_request: I/O error, dev sdd, sector 1967432880
sd 2:0:0:0: SCSI error: return code = 0x00020000
end_request: I/O error, dev sdd, sector 161647296

At this point iam wondering how paths 8:32 and 8:48 failed together -
considering both paths are through two different FC switches. Any
suggestions on this ?

Additionaly, I have looked at the mailing list archives & annotated conf
files and found two options a) failback and b) no_path_retry. What would
be the best recommended values for these on a dual controller setup like
mine ?

It would also be helpful if someone could share infotrend specific
settings multipath settings.

TIA
Dushyanth

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Multipath failover issues
  2009-03-16 16:21 Multipath failover issues dushyanth.h
@ 2009-03-16 16:49 ` Bryn M. Reeves
  2009-03-16 16:50 ` Bryn M. Reeves
  2009-03-16 17:59 ` Chandra Seetharaman
  2 siblings, 0 replies; 7+ messages in thread
From: Bryn M. Reeves @ 2009-03-16 16:49 UTC (permalink / raw)
  To: device-mapper development

On Mon, 2009-03-16 at 21:51 +0530, dushyanth.h@directi.com wrote:
> Hi guys,
> 
> Iam using dm-multipath for a Infortrend dual controller F16F-R4031-6 FC
> system.

Can you post your multipath.conf somewhere? (pastebin is fine).

Regards,
Bryn.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Multipath failover issues
  2009-03-16 16:21 Multipath failover issues dushyanth.h
  2009-03-16 16:49 ` Bryn M. Reeves
@ 2009-03-16 16:50 ` Bryn M. Reeves
  2009-03-16 17:59 ` Chandra Seetharaman
  2 siblings, 0 replies; 7+ messages in thread
From: Bryn M. Reeves @ 2009-03-16 16:50 UTC (permalink / raw)
  To: device-mapper development

On Mon, 2009-03-16 at 21:51 +0530, dushyanth.h@directi.com wrote:
> multipath.conf : http://pastebin.com/m4c7da817
> multipath -v4 -ll : http://pastebin.com/m7d863925

Duh. Ignore me ;)

Bryn.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Multipath failover issues
  2009-03-16 16:21 Multipath failover issues dushyanth.h
  2009-03-16 16:49 ` Bryn M. Reeves
  2009-03-16 16:50 ` Bryn M. Reeves
@ 2009-03-16 17:59 ` Chandra Seetharaman
  2009-03-16 21:03   ` dushyanth.h
  2 siblings, 1 reply; 7+ messages in thread
From: Chandra Seetharaman @ 2009-03-16 17:59 UTC (permalink / raw)
  To: device-mapper development


On Mon, 2009-03-16 at 21:51 +0530, dushyanth.h@directi.com wrote:
> Hi guys,
> 
> Iam using dm-multipath for a Infortrend dual controller F16F-R4031-6 FC
> system.
> 
> Version details are :
> 
> device-mapper-multipath-0.4.7-17.el5
> device-mapper-1.02.24-1.el5
> device-mapper-event-1.02.24-1.el5
> 
> OS : Red Hat Enterprise Linux Server release 5.2 (Tikanga)
> Kernel : 2.6.18-92.1.10.el5 #1 SMP x86_64 x86_64 x86_64
> 
> Recently, one of the RAID controllers failed and caused multipath to
> fail both active paths
> 
> device-mapper: multipath: Failing path 8:32.

8:32 has failed here.
> sd 2:0:0:0: SCSI error: return code = 0x00020000

error code 20000 mean the BUS is busy.

> end_request: I/O error, dev sdd, sector 1976776672
> device-mapper: multipath: Failing path 8:48.

and 8:48 failed because of that.

Do you know which one was supposed to fail when the RAID controller
failed ? (my guess is it is 8:32).

looks like for whatever reason the other SCSI bus became busy.
> sd 2:0:0:0: SCSI error: return code = 0x00020000
> end_request: I/O error, dev sdd, sector 1967432880
> sd 2:0:0:0: SCSI error: return code = 0x00020000
> end_request: I/O error, dev sdd, sector 161647296
> 
> This caused the ext3 filesystem to go into a read only mode. Full IO
> errors is at http://pastebin.com/m103325d9
> 
> The dual controller storage unit and the host server (Only 1 Server
> using 2 Qlogic FC HBAs) are hooked upto two different Qlogic SanBox FC
> switch for redundancy.
> 
> multipath.conf : http://pastebin.com/m4c7da817
> multipath -v4 -ll : http://pastebin.com/m7d863925
> 
> I have checked the logs on the FC switch and the HBAs
> and i dont see any event which suggest both paths failed at once. Even
> the errors i captured out of dmesg show that one of the physical disks
> that makes up dm-0 had 'end_request: I/O errors' while the other did not
> have any such error.
> 
> sd 2:0:0:0: SCSI error: return code = 0x00020000
> end_request: I/O error, dev sdd, sector 1967432880
> sd 2:0:0:0: SCSI error: return code = 0x00020000
> end_request: I/O error, dev sdd, sector 161647296
> 
> At this point iam wondering how paths 8:32 and 8:48 failed together -
> considering both paths are through two different FC switches. Any
> suggestions on this ?
> 
> Additionaly, I have looked at the mailing list archives & annotated conf
> files and found two options a) failback and b) no_path_retry. What would
> be the best recommended values for these on a dual controller setup like
> mine ?
> 
> It would also be helpful if someone could share infotrend specific
> settings multipath settings.
> 
> TIA
> Dushyanth
> 
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Multipath failover issues
  2009-03-16 17:59 ` Chandra Seetharaman
@ 2009-03-16 21:03   ` dushyanth.h
  2009-03-16 22:40     ` Chandra Seetharaman
  0 siblings, 1 reply; 7+ messages in thread
From: dushyanth.h @ 2009-03-16 21:03 UTC (permalink / raw)
  To: device-mapper development

Hi,

>> device-mapper: multipath: Failing path 8:32.
> 
> 8:32 has failed here.

>> sd 2:0:0:0: SCSI error: return code = 0x00020000
> 
> error code 20000 mean the BUS is busy.
> 
>> end_request: I/O error, dev sdd, sector 1976776672
>> device-mapper: multipath: Failing path 8:48.
> 
> and 8:48 failed because of that.

> Do you know which one was supposed to fail when the RAID controller
> failed ? (my guess is it is 8:32).

The alert on the storage device was (sorry for not including this earlier)

257 Critical 2009-03-11 10:38:43 ALERT:Redundant Controller Failure
Detected (Slot B)

I also found additional logs from /var/log/messages which i did not 
check earlier.

Mar 11 10:32:46 multipathd: sdc: readsector0 checker reports path is down
Mar 11 10:32:46 multipathd: checker failed path 8:32 in map infortrend01
Mar 11 10:32:46 multipathd: infortrend01: remaining active paths: 1
Mar 11 10:32:46 multipathd: sdd: readsector0 checker reports path is down
Mar 11 10:32:46 multipathd: checker failed path 8:48 in map infortrend01
Mar 11 10:32:46 multipathd: infortrend01: remaining active paths: 0
Mar 11 10:32:46 multipathd: dm-0: add map (uevent)
Mar 11 10:32:46 multipathd: dm-0: devmap already registered
Mar 11 10:32:46 multipathd: dm-0: add map (uevent)
Mar 11 10:32:47 multipathd: dm-0: devmap already registered
Mar 11 10:32:47 multipathd: sdd: readsector0 checker reports path is down

So, it looks like 8:32 was the path which had the failed controller and 
during the switch over multipath must have detected 8:48 as busy? if 
this is right, then it must be due to the infortrend device itself.

> looks like for whatever reason the other SCSI bus became busy.

>> sd 2:0:0:0: SCSI error: return code = 0x00020000
>> end_request: I/O error, dev sdd, sector 1967432880
>> sd 2:0:0:0: SCSI error: return code = 0x00020000
>> end_request: I/O error, dev sdd, sector 161647296

Iam assuming it must have been busy for a few secs during the switch 
over and the multipath config doesn't wait enough for the switchover to 
work.

Any advice on the below values ?

 > Additionaly, I have looked at the mailing list archives & annotated conf
 > files and found two options a) failback and b) no_path_retry. What would
 > be the best recommended values for these on a dual controller setup like
 > mine ?
 >
 > It would also be helpful if someone could share infotrend specific
 > settings multipath settings.

TIA
Dushyanth

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Multipath failover issues
  2009-03-16 21:03   ` dushyanth.h
@ 2009-03-16 22:40     ` Chandra Seetharaman
  2009-03-17 12:30       ` Dushyanth Harinath
  0 siblings, 1 reply; 7+ messages in thread
From: Chandra Seetharaman @ 2009-03-16 22:40 UTC (permalink / raw)
  To: device-mapper development


On Tue, 2009-03-17 at 02:33 +0530, dushyanth.h@directi.com wrote:
> Hi,
> 
> >> device-mapper: multipath: Failing path 8:32.
> > 
> > 8:32 has failed here.
> 
> >> sd 2:0:0:0: SCSI error: return code = 0x00020000
> > 
> > error code 20000 mean the BUS is busy.
> > 
> >> end_request: I/O error, dev sdd, sector 1976776672
> >> device-mapper: multipath: Failing path 8:48.
> > 
> > and 8:48 failed because of that.
> 
> > Do you know which one was supposed to fail when the RAID controller
> > failed ? (my guess is it is 8:32).
> 
> The alert on the storage device was (sorry for not including this earlier)
> 
> 257 Critical 2009-03-11 10:38:43 ALERT:Redundant Controller Failure
> Detected (Slot B)
> 
> I also found additional logs from /var/log/messages which i did not 
> check earlier.
> 
> Mar 11 10:32:46 multipathd: sdc: readsector0 checker reports path is down
> Mar 11 10:32:46 multipathd: checker failed path 8:32 in map infortrend01
> Mar 11 10:32:46 multipathd: infortrend01: remaining active paths: 1
> Mar 11 10:32:46 multipathd: sdd: readsector0 checker reports path is down
> Mar 11 10:32:46 multipathd: checker failed path 8:48 in map infortrend01

Does this timing correspond to when you turned off the controller ?


> Mar 11 10:32:46 multipathd: infortrend01: remaining active paths: 0
> Mar 11 10:32:46 multipathd: dm-0: add map (uevent)
> Mar 11 10:32:46 multipathd: dm-0: devmap already registered
> Mar 11 10:32:46 multipathd: dm-0: add map (uevent)
> Mar 11 10:32:47 multipathd: dm-0: devmap already registered
> Mar 11 10:32:47 multipathd: sdd: readsector0 checker reports path is down
> 
> So, it looks like 8:32 was the path which had the failed controller and 
> during the switch over multipath must have detected 8:48 as busy? if 
> this is right, then it must be due to the infortrend device itself.
> 
> > looks like for whatever reason the other SCSI bus became busy.
> 
> >> sd 2:0:0:0: SCSI error: return code = 0x00020000
> >> end_request: I/O error, dev sdd, sector 1967432880
> >> sd 2:0:0:0: SCSI error: return code = 0x00020000
> >> end_request: I/O error, dev sdd, sector 161647296
> 
> Iam assuming it must have been busy for a few secs during the switch 
> over and the multipath config doesn't wait enough for the switchover to 
> work.
> 

Answer to your previous question would help here :)

Set no_path_retry to "queue", which would queue the I/Os when "all" the
paths fail.

If the behavior seen above was caused by the storage and will be
rectified in an acceptable (to the user) time, then this parameter
setting would solve your problem.

BTW, have you seen the I/O successfully been sent to the lun (both paths
- you can use iostat to check it) before you failed the controller ? (I
am trying to see if your config settings are proper).

> Any advice on the below values ?
> 
>  > Additionaly, I have looked at the mailing list archives & annotated conf
>  > files and found two options a) failback and b) no_path_retry. What would

Failback would be useful only when you have different path groups. In a
multibus setup like yours it is of no value.

>  > be the best recommended values for these on a dual controller setup like
>  > mine ?
>  >
>  > It would also be helpful if someone could share infotrend specific
>  > settings multipath settings.
> 
> TIA
> Dushyanth
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Multipath failover issues
  2009-03-16 22:40     ` Chandra Seetharaman
@ 2009-03-17 12:30       ` Dushyanth Harinath
  0 siblings, 0 replies; 7+ messages in thread
From: Dushyanth Harinath @ 2009-03-17 12:30 UTC (permalink / raw)
  To: sekharan, device-mapper development

Hi,

>> 257 Critical 2009-03-11 10:38:43 ALERT:Redundant Controller Failure
>> Detected (Slot B)
>>
>> I also found additional logs from /var/log/messages which i did not 
>> check earlier.
>>
>> Mar 11 10:32:46 multipathd: sdc: readsector0 checker reports path is down
>> Mar 11 10:32:46 multipathd: checker failed path 8:32 in map infortrend01
>> Mar 11 10:32:46 multipathd: infortrend01: remaining active paths: 1
>> Mar 11 10:32:46 multipathd: sdd: readsector0 checker reports path is down
>> Mar 11 10:32:46 multipathd: checker failed path 8:48 in map infortrend01
> 
> Does this timing correspond to when you turned off the controller ?

This is when the controller failed. The controller shutdown happened 
much later.

>> Iam assuming it must have been busy for a few secs during the switch 
>> over and the multipath config doesn't wait enough for the switchover to 
>> work.
> 
> Answer to your previous question would help here :)
> 
> Set no_path_retry to "queue", which would queue the I/Os when "all" the
> paths fail.

Iam not sure if i can do this as well. Aren't we creating an illusion 
that the storage subsystem is fine and queuing requests when actually 
the subsystem is gone ? What actually is done for queuing and there must 
be some limits for the queue as well right ?

> If the behavior seen above was caused by the storage and will be
> rectified in an acceptable (to the user) time, then this parameter
> setting would solve your problem.

Iam checking this with infortrend.

> BTW, have you seen the I/O successfully been sent to the lun (both paths
> - you can use iostat to check it) before you failed the controller ? (I
> am trying to see if your config settings are proper).

Iam doing a post mortem of the redundant controller failure here :). I 
dug out what was done after the controller failure.

* Primary Controller failed and failover to secondary did not work
* Multipath failed both paths and ext3 went read only
* Postgres crashed
* When they logged in and ran (multipath -v2 -ll), they saw both paths 
active - I cannot find any multipath log entries which shows paths 
reinstated until 11:50 - which was after controller shutdown and power 
cycle.
* The filesystem was mounted again (without fsck) and database started 
(This answers your question abt IO to the LUNs i think)
* Postgres recovered and was shutdown immediately and /data unmounted.
* After this the controllers on the infotrend was shutdown and the 
device power cycled.

PS : Iam digging up the entire multipath logs instead of posting 
snippets here - will add to pastebin and send the link over

TIA
Dushyanth

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-03-17 12:30 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-16 16:21 Multipath failover issues dushyanth.h
2009-03-16 16:49 ` Bryn M. Reeves
2009-03-16 16:50 ` Bryn M. Reeves
2009-03-16 17:59 ` Chandra Seetharaman
2009-03-16 21:03   ` dushyanth.h
2009-03-16 22:40     ` Chandra Seetharaman
2009-03-17 12:30       ` Dushyanth Harinath

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.