From mboxrd@z Thu Jan 1 00:00:00 1970 From: Alan Kasindorf Subject: multipath-tools-0.4.4 on 3par unknown path failure issue Date: Wed, 10 Aug 2005 17:49:40 -0400 Message-ID: <42FA7674.4070201@mail.communityconnect.com> Reply-To: device-mapper development Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com To: dm-devel@redhat.com List-Id: dm-devel.ids Hey, I have ~10 machines running multipath-tools-0.4.4 on RHEL ES 4.1 (latest everything). Machines are mounting multipathed mounts on an EMC clariion and a 3PAR SAN device, over the same fabric. At some random point in time today, one of the machines lost one of its four 3par mounts. All other mounts worked fine. This has happened once or twice before as well, but we rebooted before I had time to inspect the issue. multipath -v3 -l showed this status on the bad path; params = 1 queue_if_no_path 0 1 1 round-robin 0 2 1 8:64 1000 8:176 1000 status = 1 3 0 1 1 E 0 2 0 8:64 F 3574 8:176 F 3574 exports (350002ac0005b02a4) [size=150 GB][features="1 queue_if_no_path"][hwhandler="0"] \_ round-robin 0 [enabled][first] \_ 5:0:0:3 sde 8:64 [ready ][failed] \_ 6:0:1:3 sdl 8:176 [ready ][failed] This was being spammed into /var/log/messages once every five seconds (the multipathd polling interval): Aug 10 15:35:43 cc42-86 multipathd: 8:64: tur checker reports path is up Aug 10 15:35:43 cc42-86 multipathd: devmap event (8163) on exports Aug 10 15:35:43 cc42-86 kernel: device-mapper: dm-multipath: Failing path 8:176. Aug 10 15:35:43 cc42-86 kernel: device-mapper: dm-multipath: Failing path 8:64. Aug 10 15:35:43 cc42-86 multipathd: 8:176: tur checker reports path is up Aug 10 15:35:43 cc42-86 kernel: cdrom: open failed. Aug 10 15:35:43 cc42-86 kernel: device-mapper: dm-multipath: Failing path 8:176. Aug 10 15:35:43 cc42-86 kernel: device-mapper: dm-multipath: Failing path 8:64. Aug 10 15:35:43 cc42-86 kernel: cdrom: open failed. Aug 10 15:35:43 cc42-86 multipathd: open(/dev/hdc) failed Aug 10 15:35:43 cc42-86 multipathd: mark 8:64 as failed Aug 10 15:35:43 cc42-86 multipathd: mark 8:176 as failed Aug 10 15:35:43 cc42-86 multipathd: devmap event (8164) on exports Aug 10 15:35:43 cc42-86 kernel: cdrom: open failed. Aug 10 15:35:43 cc42-86 multipathd: open(/dev/hdc) failed Aug 10 15:35:43 cc42-86 kernel: cdrom: open failed. Aug 10 15:35:43 cc42-86 multipathd: open(/dev/hdc) failed tur sees it up, kernel says it's down, ad infinitum. Nothing I tried could elicit a more detailed error about why this was happening. The mount on top of it is a normal ext3 mount, and wasn't being accessed at the time of the failure as far as I know. I switched off the queue_if_no_path option globally in the mulitpath.conf file. Immediately the ext3 journal failed out, and multipath brought both paths back as active: exports (350002ac0005b02a4) [size=150 GB][features="0"][hwhandler="0"] \_ round-robin 0 [active][first] \_ 5:0:0:3 sde 8:64 [ready ][active] \_ 6:0:1:3 sdl 8:176 [ready ][active] I was able to fsck the device and remount it without issue or reboot after that. Since, I've left the queue option disabled to see if the problem creeps back. I basically have a default multipath.conf file, with some WWN to alias mappings, had the queue_if_no_path option enabled, and the EMC device info added. The problem's on the 3par however. Only one of the four 3par mounts on the machine was having issues. Is this known at all? Is there anything else I can provide so that we can figure out why this happened? I had been running multipath tools for two months on a test box and never encounterred this problem. It's only snuck up as we've started deploying it on more machines for pre-production. All of the servers are identical... redhat ES4.1, same qla2300 fiber cards, same CPUs/etc. We also encounterred the EMC ghost LUN issue (discussed on here once), which is especially bad if queue_if_no_path is enabled. Sometimes causing a kernel panic and bringing the machine down :( Any assistance on the first or second issue would be appreciated! Thanks, -Alan