* multipath-tools-0.4.4 on 3par unknown path failure issue
@ 2005-08-10 21:49 Alan Kasindorf
2005-08-11 19:55 ` Andy
0 siblings, 1 reply; 7+ messages in thread
From: Alan Kasindorf @ 2005-08-10 21:49 UTC (permalink / raw)
To: dm-devel
Hey,
I have ~10 machines running multipath-tools-0.4.4 on RHEL ES 4.1 (latest
everything). Machines are mounting multipathed mounts on an EMC clariion
and a 3PAR SAN device, over the same fabric.
At some random point in time today, one of the machines lost one of its
four 3par mounts. All other mounts worked fine. This has happened once
or twice before as well, but we rebooted before I had time to inspect
the issue.
multipath -v3 -l showed this status on the bad path;
params = 1 queue_if_no_path 0 1 1 round-robin 0 2 1 8:64 1000 8:176 1000
status = 1 3 0 1 1 E 0 2 0 8:64 F 3574 8:176 F 3574
exports (350002ac0005b02a4)
[size=150 GB][features="1 queue_if_no_path"][hwhandler="0"]
\_ round-robin 0 [enabled][first]
\_ 5:0:0:3 sde 8:64 [ready ][failed]
\_ 6:0:1:3 sdl 8:176 [ready ][failed]
This was being spammed into /var/log/messages once every five seconds
(the multipathd polling interval):
Aug 10 15:35:43 cc42-86 multipathd: 8:64: tur checker reports path is up
Aug 10 15:35:43 cc42-86 multipathd: devmap event (8163) on exports
Aug 10 15:35:43 cc42-86 kernel: device-mapper: dm-multipath: Failing
path 8:176.
Aug 10 15:35:43 cc42-86 kernel: device-mapper: dm-multipath: Failing
path 8:64.
Aug 10 15:35:43 cc42-86 multipathd: 8:176: tur checker reports path is up
Aug 10 15:35:43 cc42-86 kernel: cdrom: open failed.
Aug 10 15:35:43 cc42-86 kernel: device-mapper: dm-multipath: Failing
path 8:176.
Aug 10 15:35:43 cc42-86 kernel: device-mapper: dm-multipath: Failing
path 8:64.
Aug 10 15:35:43 cc42-86 kernel: cdrom: open failed.
Aug 10 15:35:43 cc42-86 multipathd: open(/dev/hdc) failed
Aug 10 15:35:43 cc42-86 multipathd: mark 8:64 as failed
Aug 10 15:35:43 cc42-86 multipathd: mark 8:176 as failed
Aug 10 15:35:43 cc42-86 multipathd: devmap event (8164) on exports
Aug 10 15:35:43 cc42-86 kernel: cdrom: open failed.
Aug 10 15:35:43 cc42-86 multipathd: open(/dev/hdc) failed
Aug 10 15:35:43 cc42-86 kernel: cdrom: open failed.
Aug 10 15:35:43 cc42-86 multipathd: open(/dev/hdc) failed
tur sees it up, kernel says it's down, ad infinitum.
Nothing I tried could elicit a more detailed error about why this was
happening. The mount on top of it is a normal ext3 mount, and wasn't
being accessed at the time of the failure as far as I know.
I switched off the queue_if_no_path option globally in the
mulitpath.conf file. Immediately the ext3 journal failed out, and
multipath brought both paths back as active:
exports (350002ac0005b02a4)
[size=150 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [active][first]
\_ 5:0:0:3 sde 8:64 [ready ][active]
\_ 6:0:1:3 sdl 8:176 [ready ][active]
I was able to fsck the device and remount it without issue or reboot
after that. Since, I've left the queue option disabled to see if the
problem creeps back.
I basically have a default multipath.conf file, with some WWN to alias
mappings, had the queue_if_no_path option enabled, and the EMC device
info added. The problem's on the 3par however. Only one of the four 3par
mounts on the machine was having issues.
Is this known at all? Is there anything else I can provide so that we
can figure out why this happened? I had been running multipath tools for
two months on a test box and never encounterred this problem. It's only
snuck up as we've started deploying it on more machines for
pre-production. All of the servers are identical... redhat ES4.1, same
qla2300 fiber cards, same CPUs/etc.
We also encounterred the EMC ghost LUN issue (discussed on here once),
which is especially bad if queue_if_no_path is enabled. Sometimes
causing a kernel panic and bringing the machine down :(
Any assistance on the first or second issue would be appreciated!
Thanks,
-Alan
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: multipath-tools-0.4.4 on 3par unknown path failure issue
2005-08-10 21:49 multipath-tools-0.4.4 on 3par unknown path failure issue Alan Kasindorf
@ 2005-08-11 19:55 ` Andy
2005-08-11 20:19 ` Alan Kasindorf
0 siblings, 1 reply; 7+ messages in thread
From: Andy @ 2005-08-11 19:55 UTC (permalink / raw)
To: device-mapper development
On Wed, Aug 10, 2005 at 05:49:40PM -0400, Alan Kasindorf wrote:
> Hey,
>
>
> At some random point in time today, one of the machines lost one of its
> four 3par mounts. All other mounts worked fine. This has happened once
> or twice before as well, but we rebooted before I had time to inspect
> the issue.
>
> Is this known at all? Is there anything else I can provide so that we
> can figure out why this happened? I had been running multipath tools for
> two months on a test box and never encounterred this problem. It's only
> snuck up as we've started deploying it on more machines for
>
I've had problems like this happen to me on 3par too. What kernel version
are you using? It almost always happened when the SAN got a RSCN (using
when another server was rebooted) I found that, at least in kernel 2.6.11.7,
that if I changed the line
bio->bi_rw != (1 << BIO_RW_FAILFAST); to
bio->bi_rw != (0 << BIO_RW_FAILFAST);
in drivers/md/dm_mpath.c
the problem went away. Now, in the newest kernels, after there was a big
change to the qla drivers (2.6.12-rc? and beyond, I believe) I did not need
to do the above change, but I now get aborts sometimes (these aborts
apparently come from the qlogic card). The aborts recover, but I have been
unable to determine why I am getting them.
Andy
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: multipath-tools-0.4.4 on 3par unknown path failure issue
2005-08-11 19:55 ` Andy
@ 2005-08-11 20:19 ` Alan Kasindorf
2005-08-12 16:04 ` Andy
2005-08-17 19:38 ` Alan Kasindorf
0 siblings, 2 replies; 7+ messages in thread
From: Alan Kasindorf @ 2005-08-11 20:19 UTC (permalink / raw)
To: device-mapper development
> I've had problems like this happen to me on 3par too. What kernel version
> are you using? It almost always happened when the SAN got a RSCN (using
> when another server was rebooted) I found that, at least in kernel 2.6.11.7,
> that if I changed the line
>
> bio->bi_rw != (1 << BIO_RW_FAILFAST); to
> bio->bi_rw != (0 << BIO_RW_FAILFAST);
>
> in drivers/md/dm_mpath.c
>
> the problem went away. Now, in the newest kernels, after there was a big
> change to the qla drivers (2.6.12-rc? and beyond, I believe) I did not need
> to do the above change, but I now get aborts sometimes (these aborts
> apparently come from the qlogic card). The aborts recover, but I have been
> unable to determine why I am getting them.
>
> Andy
We're running 2.6.9-11.ELsmp, off of redhat ES 4.1. I don't exactly have
the entire list of redhat patches on hand, so I can't say for sure. Nor
can I actually modify our kernel without losing support to the box. If
this is fixed with a kernel upgrade, we can open a support ticket from
redhat and scream/yell until they apply the patch.
However, I'd like to know what the exact issue is. I'm not exactly great
on eliciting issues with the linux kernel right now. How were you
monitoring what events the SAN was sending up through the card? I could
use this to at least verify what is happening if/when we lose another
mount. None of our servers were being rebooted when this happened though.
-Alan
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: multipath-tools-0.4.4 on 3par unknown path failure issue
2005-08-11 20:19 ` Alan Kasindorf
@ 2005-08-12 16:04 ` Andy
2005-08-17 19:38 ` Alan Kasindorf
1 sibling, 0 replies; 7+ messages in thread
From: Andy @ 2005-08-12 16:04 UTC (permalink / raw)
To: device-mapper development
On Thu, Aug 11, 2005 at 04:19:24PM -0400, Alan Kasindorf wrote:
>
> >I've had problems like this happen to me on 3par too. What kernel version
>
> However, I'd like to know what the exact issue is. I'm not exactly great
> on eliciting issues with the linux kernel right now. How were you
> monitoring what events the SAN was sending up through the card? I could
> use this to at least verify what is happening if/when we lose another
> mount. None of our servers were being rebooted when this happened though.
>
My issue was that I was getting I/O errors when other systems were rebooted
(and occasionally other times, reboots consistently would cause a problem),
and those I/O errors would cause the mount to drop. Somehow I got the idea
that turning off failfast might help and it did. I have not monitored what
events the SAN was sending the card, all I did was turn on debugging in the
qla2xxx drivers and ask Andrew Vasquez (qlogic developer) about them.
Andy
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: multipath-tools-0.4.4 on 3par unknown path failure issue
2005-08-11 20:19 ` Alan Kasindorf
2005-08-12 16:04 ` Andy
@ 2005-08-17 19:38 ` Alan Kasindorf
2005-08-17 20:08 ` Ed Wilts
1 sibling, 1 reply; 7+ messages in thread
From: Alan Kasindorf @ 2005-08-17 19:38 UTC (permalink / raw)
To: device-mapper development
I could use some suggestions on how to approach the issue at this
point... We're in a bad position because this bug did not come out
during a single instance test.
I have tried adjusting QLogic timeouts and any related parameters to see
if a higher or lower timeout would affect the issue at all, and it's
done nothing but make the problem worse.
I'm confident in what folks have said about 2.6.12 and later not
exhibiting this issue. However, I have a redhat kernel running Oracle,
and we need them to not drop our support. Currently redhat's support is
ignoring me as well. Should I talk with the QLogic maintainer at all?
3par has no clue, unless they're hiding their "Linux guy" from me.
Does SuSE professional have this patched up? We might be able to switch
to their distro if they will actually support multipathing on a database
in their mainline kernel, vs not at all.
Thanks,
-Alan
> We're running 2.6.9-11.ELsmp, off of redhat ES 4.1. I don't exactly have
> the entire list of redhat patches on hand, so I can't say for sure. Nor
> can I actually modify our kernel without losing support to the box. If
> this is fixed with a kernel upgrade, we can open a support ticket from
> redhat and scream/yell until they apply the patch.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: multipath-tools-0.4.4 on 3par unknown path failure issue
2005-08-17 19:38 ` Alan Kasindorf
@ 2005-08-17 20:08 ` Ed Wilts
2005-08-17 20:25 ` Alan Kasindorf
0 siblings, 1 reply; 7+ messages in thread
From: Ed Wilts @ 2005-08-17 20:08 UTC (permalink / raw)
To: device-mapper development
On Wed, Aug 17, 2005 at 03:38:38PM -0400, Alan Kasindorf wrote:
> I have tried adjusting QLogic timeouts and any related parameters to see
> if a higher or lower timeout would affect the issue at all, and it's
> done nothing but make the problem worse.
>
> I'm confident in what folks have said about 2.6.12 and later not
> exhibiting this issue. However, I have a redhat kernel running Oracle,
> and we need them to not drop our support. Currently redhat's support is
> ignoring me as well. Should I talk with the QLogic maintainer at all?
> 3par has no clue, unless they're hiding their "Linux guy" from me.
The current Red Hat qlogic driver doesn't support most of the adjustable
parameters. However, the one you can download from qlogic does. Red
Hat will include this driver in its next release (beta in about a week,
production in mid-September). In addition, Red Hat will support
multipathing to at least same level as what SuSe supports today:
http://portal.suse.com/sdb/en/2005/04/sles_multipathing.html
> Does SuSE professional have this patched up? We might be able to switch
> to their distro if they will actually support multipathing on a database
> in their mainline kernel, vs not at all.
Red Hat is just a bit behind, but if you really need this in production,
I'd suggest you contact your support and/or sales rep to ask them to
support you with the beta release. How much luck you'll have, I don't
know.
--
Ed Wilts, RHCE
Mounds View, MN, USA
mailto:ewilts@ewilts.org
Member #1, Red Hat Community Ambassador Program
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: multipath-tools-0.4.4 on 3par unknown path failure issue
2005-08-17 20:08 ` Ed Wilts
@ 2005-08-17 20:25 ` Alan Kasindorf
0 siblings, 0 replies; 7+ messages in thread
From: Alan Kasindorf @ 2005-08-17 20:25 UTC (permalink / raw)
To: device-mapper development
> The current Red Hat qlogic driver doesn't support most of the adjustable
> parameters. However, the one you can download from qlogic does. Red
> Hat will include this driver in its next release (beta in about a week,
> production in mid-September). In addition, Red Hat will support
> multipathing to at least same level as what SuSe supports today:
> http://portal.suse.com/sdb/en/2005/04/sles_multipathing.html
I haven't used a QLogic driver with those tunables, and the ones shown
to me via `modinfo qla2xxx` have been sufficient for fixing most of our
latency related problems. I don't believe any of these tunables will be
able to fix this problem though. Both paths go down in the same second,
so there is no way to support the device. On that note however, I assume
the U2 release will have both the mpath driver and the QLogic driver
updated?
> Red Hat is just a bit behind, but if you really need this in production,
> I'd suggest you contact your support and/or sales rep to ask them to
> support you with the beta release. How much luck you'll have, I don't
> know.
All I really care about is getting a workaround, or getting a redhat
kernel which is at least on par with a stable vanilla release of Linux.
Having consistently subpar stable releases of redhat software, and
absolutely no option to bring it up to the quality of a stable Linux
release is possibly going to result us in switching our contracts at
this point. I have contacted support, and am still waiting for some kind
of response. Last time we talked with them about multipath support, they
said to wait for U1 then it'd be good. Now we get to start over? :)
Thanks,
-Alan
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2005-08-17 20:25 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-10 21:49 multipath-tools-0.4.4 on 3par unknown path failure issue Alan Kasindorf
2005-08-11 19:55 ` Andy
2005-08-11 20:19 ` Alan Kasindorf
2005-08-12 16:04 ` Andy
2005-08-17 19:38 ` Alan Kasindorf
2005-08-17 20:08 ` Ed Wilts
2005-08-17 20:25 ` Alan Kasindorf
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.