All of lore.kernel.org
 help / color / mirror / Atom feed
* multipath-tools-0.4.4 on 3par unknown path failure issue
@ 2005-08-10 21:49 Alan Kasindorf
  2005-08-11 19:55 ` Andy
  0 siblings, 1 reply; 7+ messages in thread
From: Alan Kasindorf @ 2005-08-10 21:49 UTC (permalink / raw)
  To: dm-devel

Hey,

I have ~10 machines running multipath-tools-0.4.4 on RHEL ES 4.1 (latest 
everything). Machines are mounting multipathed mounts on an EMC clariion 
and a 3PAR SAN device, over the same fabric.

At some random point in time today, one of the machines lost one of its 
four 3par mounts. All other mounts worked fine. This has happened once 
or twice before as well, but we rebooted before I had time to inspect 
the issue.

multipath -v3 -l showed this status on the bad path;

params = 1 queue_if_no_path 0 1 1 round-robin 0 2 1 8:64 1000 8:176 1000
status = 1 3 0 1 1 E 0 2 0 8:64 F 3574 8:176 F 3574
exports (350002ac0005b02a4)
[size=150 GB][features="1 queue_if_no_path"][hwhandler="0"]
\_ round-robin 0 [enabled][first]
  \_ 5:0:0:3 sde  8:64    [ready ][failed]
  \_ 6:0:1:3 sdl  8:176   [ready ][failed]

This was being spammed into /var/log/messages once every five seconds 
(the multipathd polling interval):

Aug 10 15:35:43 cc42-86 multipathd: 8:64: tur checker reports path is up
Aug 10 15:35:43 cc42-86 multipathd: devmap event (8163) on exports
Aug 10 15:35:43 cc42-86 kernel: device-mapper: dm-multipath: Failing 
path 8:176.
Aug 10 15:35:43 cc42-86 kernel: device-mapper: dm-multipath: Failing 
path 8:64.
Aug 10 15:35:43 cc42-86 multipathd: 8:176: tur checker reports path is up
Aug 10 15:35:43 cc42-86 kernel: cdrom: open failed.
Aug 10 15:35:43 cc42-86 kernel: device-mapper: dm-multipath: Failing 
path 8:176.
Aug 10 15:35:43 cc42-86 kernel: device-mapper: dm-multipath: Failing 
path 8:64.
Aug 10 15:35:43 cc42-86 kernel: cdrom: open failed.
Aug 10 15:35:43 cc42-86 multipathd: open(/dev/hdc) failed
Aug 10 15:35:43 cc42-86 multipathd: mark 8:64 as failed
Aug 10 15:35:43 cc42-86 multipathd: mark 8:176 as failed
Aug 10 15:35:43 cc42-86 multipathd: devmap event (8164) on exports
Aug 10 15:35:43 cc42-86 kernel: cdrom: open failed.
Aug 10 15:35:43 cc42-86 multipathd: open(/dev/hdc) failed
Aug 10 15:35:43 cc42-86 kernel: cdrom: open failed.
Aug 10 15:35:43 cc42-86 multipathd: open(/dev/hdc) failed

tur sees it up, kernel says it's down, ad infinitum.

Nothing I tried could elicit a more detailed error about why this was 
happening. The mount on top of it is a normal ext3 mount, and wasn't 
being accessed at the time of the failure as far as I know.

I switched off the queue_if_no_path option globally in the 
mulitpath.conf file. Immediately the ext3 journal failed out, and 
multipath brought both paths back as active:

exports (350002ac0005b02a4)
[size=150 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [active][first]
  \_ 5:0:0:3 sde  8:64    [ready ][active]
  \_ 6:0:1:3 sdl  8:176   [ready ][active]

I was able to fsck the device and remount it without issue or reboot 
after that. Since, I've left the queue option disabled to see if the 
problem creeps back.

I basically have a default multipath.conf file, with some WWN to alias 
mappings, had the queue_if_no_path option enabled, and the EMC device 
info added. The problem's on the 3par however. Only one of the four 3par 
mounts on the machine was having issues.

Is this known at all? Is there anything else I can provide so that we 
can figure out why this happened? I had been running multipath tools for 
two months on a test box and never encounterred this problem. It's only 
snuck up as we've started deploying it on more machines for 
pre-production. All of the servers are identical... redhat ES4.1, same 
qla2300 fiber cards, same CPUs/etc.

We also encounterred the EMC ghost LUN issue (discussed on here once), 
which is especially bad if queue_if_no_path is enabled. Sometimes 
causing a kernel panic and bringing the machine down :(

Any assistance on the first or second issue would be appreciated!

Thanks,
-Alan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: multipath-tools-0.4.4 on 3par unknown path failure issue
  2005-08-10 21:49 multipath-tools-0.4.4 on 3par unknown path failure issue Alan Kasindorf
@ 2005-08-11 19:55 ` Andy
  2005-08-11 20:19   ` Alan Kasindorf
  0 siblings, 1 reply; 7+ messages in thread
From: Andy @ 2005-08-11 19:55 UTC (permalink / raw)
  To: device-mapper development

On Wed, Aug 10, 2005 at 05:49:40PM -0400, Alan Kasindorf wrote:
> Hey,
> 
> 
> At some random point in time today, one of the machines lost one of its 
> four 3par mounts. All other mounts worked fine. This has happened once 
> or twice before as well, but we rebooted before I had time to inspect 
> the issue.
> 
> Is this known at all? Is there anything else I can provide so that we 
> can figure out why this happened? I had been running multipath tools for 
> two months on a test box and never encounterred this problem. It's only 
> snuck up as we've started deploying it on more machines for 
>
I've had problems like this happen to me on 3par too.  What kernel version
are you using?  It almost always happened when the SAN got a RSCN (using
when another server was rebooted) I found that, at least in kernel 2.6.11.7,
that if I changed the line

bio->bi_rw != (1 << BIO_RW_FAILFAST); to
bio->bi_rw != (0 << BIO_RW_FAILFAST); 

in drivers/md/dm_mpath.c

the problem went away.  Now, in the newest kernels, after there was a big
change to the qla drivers (2.6.12-rc? and beyond, I believe) I did not need
to do the above change, but I now get aborts sometimes (these aborts
apparently come from the qlogic card).  The aborts recover, but I have been
unable to determine why I am getting them.

Andy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: multipath-tools-0.4.4 on 3par unknown path failure issue
  2005-08-11 19:55 ` Andy
@ 2005-08-11 20:19   ` Alan Kasindorf
  2005-08-12 16:04     ` Andy
  2005-08-17 19:38     ` Alan Kasindorf
  0 siblings, 2 replies; 7+ messages in thread
From: Alan Kasindorf @ 2005-08-11 20:19 UTC (permalink / raw)
  To: device-mapper development


> I've had problems like this happen to me on 3par too.  What kernel version
> are you using?  It almost always happened when the SAN got a RSCN (using
> when another server was rebooted) I found that, at least in kernel 2.6.11.7,
> that if I changed the line
> 
> bio->bi_rw != (1 << BIO_RW_FAILFAST); to
> bio->bi_rw != (0 << BIO_RW_FAILFAST); 
> 
> in drivers/md/dm_mpath.c
> 
> the problem went away.  Now, in the newest kernels, after there was a big
> change to the qla drivers (2.6.12-rc? and beyond, I believe) I did not need
> to do the above change, but I now get aborts sometimes (these aborts
> apparently come from the qlogic card).  The aborts recover, but I have been
> unable to determine why I am getting them.
> 
> Andy

We're running 2.6.9-11.ELsmp, off of redhat ES 4.1. I don't exactly have 
the entire list of redhat patches on hand, so I can't say for sure. Nor 
can I actually modify our kernel without losing support to the box. If 
this is fixed with a kernel upgrade, we can open a support ticket from 
redhat and scream/yell until they apply the patch.

However, I'd like to know what the exact issue is. I'm not exactly great 
on eliciting issues with the linux kernel right now. How were you 
monitoring what events the SAN was sending up through the card? I could 
use this to at least verify what is happening if/when we lose another 
mount. None of our servers were being rebooted when this happened though.

-Alan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: multipath-tools-0.4.4 on 3par unknown path failure issue
  2005-08-11 20:19   ` Alan Kasindorf
@ 2005-08-12 16:04     ` Andy
  2005-08-17 19:38     ` Alan Kasindorf
  1 sibling, 0 replies; 7+ messages in thread
From: Andy @ 2005-08-12 16:04 UTC (permalink / raw)
  To: device-mapper development

On Thu, Aug 11, 2005 at 04:19:24PM -0400, Alan Kasindorf wrote:
> 
> >I've had problems like this happen to me on 3par too.  What kernel version
> 
> However, I'd like to know what the exact issue is. I'm not exactly great 
> on eliciting issues with the linux kernel right now. How were you 
> monitoring what events the SAN was sending up through the card? I could 
> use this to at least verify what is happening if/when we lose another 
> mount. None of our servers were being rebooted when this happened though.
> 
My issue was that I was getting I/O errors when other systems were rebooted
(and occasionally other times, reboots consistently would cause a problem),
and those I/O errors would cause the mount to drop.  Somehow I got the idea
that turning off failfast might help and it did.  I have not monitored what
events the SAN was sending the card, all I did was turn on debugging in the
qla2xxx drivers and ask Andrew Vasquez (qlogic developer) about them.

Andy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: multipath-tools-0.4.4 on 3par unknown path failure issue
  2005-08-11 20:19   ` Alan Kasindorf
  2005-08-12 16:04     ` Andy
@ 2005-08-17 19:38     ` Alan Kasindorf
  2005-08-17 20:08       ` Ed Wilts
  1 sibling, 1 reply; 7+ messages in thread
From: Alan Kasindorf @ 2005-08-17 19:38 UTC (permalink / raw)
  To: device-mapper development

I could use some suggestions on how to approach the issue at this 
point... We're in a bad position because this bug did not come out 
during a single instance test.

I have tried adjusting QLogic timeouts and any related parameters to see 
if a higher or lower timeout would affect the issue at all, and it's 
done nothing but make the problem worse.

I'm confident in what folks have said about 2.6.12 and later not 
exhibiting this issue. However, I have a redhat kernel running Oracle, 
and we need them to not drop our support. Currently redhat's support is 
ignoring me as well. Should I talk with the QLogic maintainer at all? 
3par has no clue, unless they're hiding their "Linux guy" from me.

Does SuSE professional have this patched up? We might be able to switch 
to their distro if they will actually support multipathing on a database 
in their mainline kernel, vs not at all.

Thanks,
-Alan

> We're running 2.6.9-11.ELsmp, off of redhat ES 4.1. I don't exactly have 
> the entire list of redhat patches on hand, so I can't say for sure. Nor 
> can I actually modify our kernel without losing support to the box. If 
> this is fixed with a kernel upgrade, we can open a support ticket from 
> redhat and scream/yell until they apply the patch.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: multipath-tools-0.4.4 on 3par unknown path failure issue
  2005-08-17 19:38     ` Alan Kasindorf
@ 2005-08-17 20:08       ` Ed Wilts
  2005-08-17 20:25         ` Alan Kasindorf
  0 siblings, 1 reply; 7+ messages in thread
From: Ed Wilts @ 2005-08-17 20:08 UTC (permalink / raw)
  To: device-mapper development

On Wed, Aug 17, 2005 at 03:38:38PM -0400, Alan Kasindorf wrote:
> I have tried adjusting QLogic timeouts and any related parameters to see 
> if a higher or lower timeout would affect the issue at all, and it's 
> done nothing but make the problem worse.
> 
> I'm confident in what folks have said about 2.6.12 and later not 
> exhibiting this issue. However, I have a redhat kernel running Oracle, 
> and we need them to not drop our support. Currently redhat's support is 
> ignoring me as well. Should I talk with the QLogic maintainer at all? 
> 3par has no clue, unless they're hiding their "Linux guy" from me.

The current Red Hat qlogic driver doesn't support most of the adjustable
parameters.  However, the one you can download from qlogic does.  Red
Hat will include this driver in its next release (beta in about a week,
production in mid-September).  In addition, Red Hat will support
multipathing to at least same level as what SuSe supports today:
http://portal.suse.com/sdb/en/2005/04/sles_multipathing.html

> Does SuSE professional have this patched up? We might be able to switch 
> to their distro if they will actually support multipathing on a database 
> in their mainline kernel, vs not at all.

Red Hat is just a bit behind, but if you really need this in production,
I'd suggest you contact your support and/or sales rep to ask them to
support you with the beta release.  How much luck you'll have, I don't
know.

-- 
Ed Wilts, RHCE
Mounds View, MN, USA
mailto:ewilts@ewilts.org
Member #1, Red Hat Community Ambassador Program

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: multipath-tools-0.4.4 on 3par unknown path failure issue
  2005-08-17 20:08       ` Ed Wilts
@ 2005-08-17 20:25         ` Alan Kasindorf
  0 siblings, 0 replies; 7+ messages in thread
From: Alan Kasindorf @ 2005-08-17 20:25 UTC (permalink / raw)
  To: device-mapper development


> The current Red Hat qlogic driver doesn't support most of the adjustable
> parameters.  However, the one you can download from qlogic does.  Red
> Hat will include this driver in its next release (beta in about a week,
> production in mid-September).  In addition, Red Hat will support
> multipathing to at least same level as what SuSe supports today:
> http://portal.suse.com/sdb/en/2005/04/sles_multipathing.html

I haven't used a QLogic driver with those tunables, and the ones shown 
to me via `modinfo qla2xxx` have been sufficient for fixing most of our 
latency related problems. I don't believe any of these tunables will be 
able to fix this problem though. Both paths go down in the same second, 
so there is no way to support the device. On that note however, I assume 
the U2 release will have both the mpath driver and the QLogic driver 
updated?

> Red Hat is just a bit behind, but if you really need this in production,
> I'd suggest you contact your support and/or sales rep to ask them to
> support you with the beta release.  How much luck you'll have, I don't
> know.

All I really care about is getting a workaround, or getting a redhat 
kernel which is at least on par with a stable vanilla release of Linux. 
Having consistently subpar stable releases of redhat software, and 
absolutely no option to bring it up to the quality of a stable Linux 
release is possibly going to result us in switching our contracts at 
this point. I have contacted support, and am still waiting for some kind 
of response. Last time we talked with them about multipath support, they 
said to wait for U1 then it'd be good. Now we get to start over? :)

Thanks,
-Alan

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2005-08-17 20:25 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-10 21:49 multipath-tools-0.4.4 on 3par unknown path failure issue Alan Kasindorf
2005-08-11 19:55 ` Andy
2005-08-11 20:19   ` Alan Kasindorf
2005-08-12 16:04     ` Andy
2005-08-17 19:38     ` Alan Kasindorf
2005-08-17 20:08       ` Ed Wilts
2005-08-17 20:25         ` Alan Kasindorf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.