* Failed path will not be recovered when disabling/enabling remote port @ 2009-07-02 11:27 Christian May 2009-07-02 11:44 ` Hannes Reinecke 2009-07-02 17:51 ` Chandra Seetharaman 0 siblings, 2 replies; 8+ messages in thread From: Christian May @ 2009-07-02 11:27 UTC (permalink / raw) To: dm-devel Hi, I've setup an IBM z10 LPAR (mainframe server) with 2.6.30-kernel. Attached to the System z10 was an IBM DS8000 storage server. 10x SCSI LUNs were assigned to LPAR via two pathes: Example: 36005076303ffc1040000000000001269 dm-9 IBM,2107900 size=1.0G features='1 queue_if_no_path' hwhandler='0' wp=rw `-+- policy='round-robin 0' prio=-2 status=active |- 0:0:0:1080639506 sdw 65:96 active undef running `- 1:0:1:1080639506 sdt 65:48 active undef running Special parameter setting: dev_loss_tmo=90sec; fast_io_fail_tmo=5sec multipath tools: multipath-tools v0.4.9 (04/04, 2009) device-mapper: device-mapper-1.02.27-7.fc10.s390x, device-mapper-libs-1.02.27-7.fc10.s390x When removing a remote port (disabling a port on the BROCADE FC switch) one path failed. root@h42lp26/ESAME:~] > multipath -l 36005076303ffc1040000000000001268 dm-8 , size=1.0G features='1 queue_if_no_path' hwhandler='0' wp=rw `-+- policy='round-robin 0' prio=-2 status=active |- #:#:#:# - #:# failed undef running `- 1:0:1:1080573970 sdr 65:16 active undef running After a while (>90sec) SCSI LUNs were removed from system: UEVENT[1246531815.619428] add /kernel/uids/74 (uids) UDEV [1246531815.621708] add /kernel/uids/74 (uids) UEVENT[1246531816.725299] remove /kernel/uids/74 (uids) UDEV [1246531816.726151] remove /kernel/uids/74 (uids) UEVENT[1246531929.959709] change /devices/virtual/block/dm-0 (block) UEVENT[1246531929.959749] change /devices/virtual/block/dm-3 (block) UEVENT[1246531929.959759] change /devices/virtual/block/dm-4 (block) UEVENT[1246531929.959769] change /devices/virtual/block/dm-5 (block) UEVENT[1246531929.966647] change /devices/virtual/block/dm-7 (block) UDEV [1246531930.045444] change /devices/virtual/block/dm-4 (block) UDEV [1246531930.048923] change /devices/virtual/block/dm-7 (block) UDEV [1246531930.054614] change /devices/virtual/block/dm-0 (block) UDEV [1246531930.060091] change /devices/virtual/block/dm-3 (block) UDEV [1246531930.071744] change /devices/virtual/block/dm-5 (block) UEVENT[1246531949.278541] change /devices/virtual/block/dm-9 (block) UDEV [1246531949.369690] change /devices/virtual/block/dm-9 (block) UEVENT[1246531950.295756] change /devices/virtual/block/dm-8 (block) UEVENT[1246531950.297597] change /devices/virtual/block/dm-6 (block) UEVENT[1246531950.297610] change /devices/virtual/block/dm-2 (block) UEVENT[1246531950.297620] change /devices/virtual/block/dm-1 (block) UDEV [1246531950.430097] change /devices/virtual/block/dm-8 (block) UDEV [1246531950.588626] change /devices/virtual/block/dm-2 (block) UDEV [1246531950.632482] change /devices/virtual/block/dm-1 (block) UDEV [1246531950.634515] change /devices/virtual/block/dm-6 (block) UEVENT[1246532034.277177] remove /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080377362/scsi_generic/sg0 (scsi_generic) UEVENT[1246532034.277214] remove /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080377362/scsi_device/0:0:0:1080377362 (scsi_device) UEVENT[1246532034.277226] remove /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080377362/scsi_disk/0:0:0:1080377362 (scsi_disk) UEVENT[1246532034.277236] remove /devices/virtual/bdi/8:0 (bdi) UEVENT[1246532034.277247] remove /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080377362/block/sda (block) UEVENT[1246532034.277258] remove /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080377362 (scsi) UEVENT[1246532034.277384] remove /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080836114/scsi_generic/sg2 (scsi_generic) UEVENT[1246532034.277594] remove /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080836114/scsi_device/0:0:0:1080836114 (scsi_device) UEVENT[1246532034.277864] remove /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080836114/scsi_disk/0:0:0:1080836114 (scsi_disk) UEVENT[1246532034.278035] remove /devices/virtual/bdi/8:32 (bdi)... .... When re-enabling the path, SCSI LUNS were reassigned to system but path didn't recover: UEVENT[1246532107.387169] add /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080836114 (scsi) UEVENT[1246532107.387209] add /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080836114/scsi_device/0:0:0:1080836114 (scsi_device) UEVENT[1246532107.387220] add /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080836114/scsi_generic/sg0 (scsi_generic) UEVENT[1246532107.387230] add /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080836114/scsi_disk/0:0:0:1080836114 (scsi_disk) UEVENT[1246532107.388941] add /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080377362 (scsi) UEVENT[1246532107.388952] add /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080377362/scsi_device/0:0:0:1080377362 (scsi_device) UEVENT[1246532107.388963] add /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080377362/scsi_generic/sg2 (scsi_generic) UEVENT[1246532107.397111] add /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080836114/block/sdu (block) UEVENT[1246532107.399249] add /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080639506 (scsi) UEVENT[1246532107.399261] add /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080639506/scsi_device/0:0:0:1080639506 (scsi_device) UEVENT[1246532107.399272] add /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080639506/scsi_generic/sg4 (scsi_generic) UEVENT[1246532107.399711] add /devices/virtual/bdi/65:64 (bdi) UEVENT[1246532107.399722] add /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080377362/scsi_disk/0:0:0:1080377362 (scsi_disk) UEVENT[1246532107.401605] add /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080573970 (scsi) UEVENT[1246532107.401617] add /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080573970/scsi_device/0:0:0:1080573970 (scsi_device) UEVENT[1246532107.401628] add /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080573970/scsi_generic/sg6 (scsi_generic) UEVENT[1246532107.403731] add /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080967186 (scsi) UEVENT[1246532107.403742] add /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080967186/scsi_device/0:0:0:1080967186 (scsi_device) UEVENT[1246532107.403753] add /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080967186/scsi_generic/sg8 (scsi_generic) UEVENT[1246532107.405963] add /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080377362/block/sdv (block) UEVENT[1246532107.406168] add /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080901650 (scsi) UEVENT[1246532107.407608] add /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080901650/scsi_device/0:0:0:1080901650 (scsi_device) UEVENT[1246532107.407624] add /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080901650/scsi_generic/sg10 (scsi_generic) UEVENT[1246532107.407880] add /devices/virtual/bdi/65:80 (bdi) [root@h42lp26/ESAME:~] > multipath -l 36005076303ffc1040000000000001268 dm-8 , size=1.0G features='1 queue_if_no_path' hwhandler='0' wp=rw `-+- policy='round-robin 0' prio=-2 status=active |- #:#:#:# - #:# failed undef running `- 1:0:1:1080573970 sdr 65:16 active undef running Running "multipath" command will recover the failed path but that's not way it should be...can somebody help to fix this? Why is the path not recovered automatically? Regards, Christian May ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Failed path will not be recovered when disabling/enabling remote port 2009-07-02 11:27 Failed path will not be recovered when disabling/enabling remote port Christian May @ 2009-07-02 11:44 ` Hannes Reinecke 2009-07-02 13:06 ` Konrad Rzeszutek 2009-07-02 17:51 ` Chandra Seetharaman 1 sibling, 1 reply; 8+ messages in thread From: Hannes Reinecke @ 2009-07-02 11:44 UTC (permalink / raw) To: device-mapper development Christian May wrote: > Hi, > > I've setup an IBM z10 LPAR (mainframe server) with 2.6.30-kernel. > Attached to the System z10 was an IBM DS8000 storage server. 10x SCSI > LUNs were assigned to LPAR via two pathes: > > Example: > 36005076303ffc1040000000000001269 dm-9 IBM,2107900 > size=1.0G features='1 queue_if_no_path' hwhandler='0' wp=rw > `-+- policy='round-robin 0' prio=-2 status=active > |- 0:0:0:1080639506 sdw 65:96 active undef running > `- 1:0:1:1080639506 sdt 65:48 active undef running > > Special parameter setting: dev_loss_tmo=90sec; fast_io_fail_tmo=5sec > > multipath tools: multipath-tools v0.4.9 (04/04, 2009) > device-mapper: device-mapper-1.02.27-7.fc10.s390x, > device-mapper-libs-1.02.27-7.fc10.s390x > > When removing a remote port (disabling a port on the BROCADE FC switch) > one path failed. > > root@h42lp26/ESAME:~] >> multipath -l > 36005076303ffc1040000000000001268 dm-8 , > size=1.0G features='1 queue_if_no_path' hwhandler='0' wp=rw > `-+- policy='round-robin 0' prio=-2 status=active > |- #:#:#:# - #:# failed undef running > `- 1:0:1:1080573970 sdr 65:16 active undef running > > After a while (>90sec) SCSI LUNs were removed from system: > [ .. ] > > When re-enabling the path, SCSI LUNS were reassigned to system but path > didn't recover: > [ .. ] > > > [root@h42lp26/ESAME:~] >> multipath -l > 36005076303ffc1040000000000001268 dm-8 , > size=1.0G features='1 queue_if_no_path' hwhandler='0' wp=rw > `-+- policy='round-robin 0' prio=-2 status=active > |- #:#:#:# - #:# failed undef running > `- 1:0:1:1080573970 sdr 65:16 active undef running > > > Running "multipath" command will recover the failed path but that's not > way it should be...can somebody help to fix this? Why is the path not > recovered automatically? > It should, really. The problem is that the paths have _not_ been reconnected; the hashes indicates that the in-kernel multipath code references a device for which no information is available. And the new device has _not_ been reconnected, as otherwise you'd end up with _three_ paths here. Probably missing udev integration. I really have to push my patches upstream ... sigh. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Failed path will not be recovered when disabling/enabling remote port 2009-07-02 11:44 ` Hannes Reinecke @ 2009-07-02 13:06 ` Konrad Rzeszutek 2009-07-02 13:16 ` Hannes Reinecke 0 siblings, 1 reply; 8+ messages in thread From: Konrad Rzeszutek @ 2009-07-02 13:06 UTC (permalink / raw) To: device-mapper development On Thu, Jul 02, 2009 at 01:44:18PM +0200, Hannes Reinecke wrote: > Christian May wrote: > > Hi, > > > > I've setup an IBM z10 LPAR (mainframe server) with 2.6.30-kernel. > > Attached to the System z10 was an IBM DS8000 storage server. 10x SCSI > > LUNs were assigned to LPAR via two pathes: > > > > Example: > > 36005076303ffc1040000000000001269 dm-9 IBM,2107900 > > size=1.0G features='1 queue_if_no_path' hwhandler='0' wp=rw > > `-+- policy='round-robin 0' prio=-2 status=active > > |- 0:0:0:1080639506 sdw 65:96 active undef running > > `- 1:0:1:1080639506 sdt 65:48 active undef running > > > > Special parameter setting: dev_loss_tmo=90sec; fast_io_fail_tmo=5sec > > > > multipath tools: multipath-tools v0.4.9 (04/04, 2009) > > device-mapper: device-mapper-1.02.27-7.fc10.s390x, > > device-mapper-libs-1.02.27-7.fc10.s390x > > > > When removing a remote port (disabling a port on the BROCADE FC switch) > > one path failed. > > > > root@h42lp26/ESAME:~] > >> multipath -l > > 36005076303ffc1040000000000001268 dm-8 , > > size=1.0G features='1 queue_if_no_path' hwhandler='0' wp=rw > > `-+- policy='round-robin 0' prio=-2 status=active > > |- #:#:#:# - #:# failed undef running > > `- 1:0:1:1080573970 sdr 65:16 active undef running > > > > After a while (>90sec) SCSI LUNs were removed from system: > > > [ .. ] > > > > When re-enabling the path, SCSI LUNS were reassigned to system but path > > didn't recover: > > > [ .. ] > > > > > > > [root@h42lp26/ESAME:~] > >> multipath -l > > 36005076303ffc1040000000000001268 dm-8 , > > size=1.0G features='1 queue_if_no_path' hwhandler='0' wp=rw > > `-+- policy='round-robin 0' prio=-2 status=active > > |- #:#:#:# - #:# failed undef running > > `- 1:0:1:1080573970 sdr 65:16 active undef running > > > > > > Running "multipath" command will recover the failed path but that's not > > way it should be...can somebody help to fix this? Why is the path not > > recovered automatically? > > > It should, really. > > The problem is that the paths have _not_ been reconnected; > the hashes indicates that the in-kernel multipath code references > a device for which no information is available. > And the new device has _not_ been reconnected, as otherwise > you'd end up with _three_ paths here. > > Probably missing udev integration. Could also be a race condition that is present in SLES10 + RHEL5 kernels. Where the SysFS directories are created (and the udev event it sent out), but the kernel hasn't populated the SysFS directories. So when multipathd tries to read them it finds no pertient information and shoves it off to the 'orphan' state. I did post a patch for this a while back. Granted this isn't a problem with the more recent kernels. > > I really have to push my patches upstream ... sigh. > > Cheers, > > Hannes > -- > Dr. Hannes Reinecke zSeries & Storage > hare@suse.de +49 911 74053 688 > SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg > GF: Markus Rex, HRB 16746 (AG Nürnberg) > > -- > dm-devel mailing list > dm-devel@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Failed path will not be recovered when disabling/enabling remote port 2009-07-02 13:06 ` Konrad Rzeszutek @ 2009-07-02 13:16 ` Hannes Reinecke 2009-07-20 16:46 ` Konrad Rzeszutek 0 siblings, 1 reply; 8+ messages in thread From: Hannes Reinecke @ 2009-07-02 13:16 UTC (permalink / raw) To: device-mapper development Hi all, Konrad Rzeszutek wrote: > On Thu, Jul 02, 2009 at 01:44:18PM +0200, Hannes Reinecke wrote: >> Christian May wrote: >>> Hi, >>> >>> I've setup an IBM z10 LPAR (mainframe server) with 2.6.30-kernel. >>> Attached to the System z10 was an IBM DS8000 storage server. 10x SCSI >>> LUNs were assigned to LPAR via two pathes: >>> >>> Example: >>> 36005076303ffc1040000000000001269 dm-9 IBM,2107900 >>> size=1.0G features='1 queue_if_no_path' hwhandler='0' wp=rw >>> `-+- policy='round-robin 0' prio=-2 status=active >>> |- 0:0:0:1080639506 sdw 65:96 active undef running >>> `- 1:0:1:1080639506 sdt 65:48 active undef running >>> >>> Special parameter setting: dev_loss_tmo=90sec; fast_io_fail_tmo=5sec >>> >>> multipath tools: multipath-tools v0.4.9 (04/04, 2009) >>> device-mapper: device-mapper-1.02.27-7.fc10.s390x, >>> device-mapper-libs-1.02.27-7.fc10.s390x >>> >>> When removing a remote port (disabling a port on the BROCADE FC switch) >>> one path failed. >>> >>> root@h42lp26/ESAME:~] >>>> multipath -l >>> 36005076303ffc1040000000000001268 dm-8 , >>> size=1.0G features='1 queue_if_no_path' hwhandler='0' wp=rw >>> `-+- policy='round-robin 0' prio=-2 status=active >>> |- #:#:#:# - #:# failed undef running >>> `- 1:0:1:1080573970 sdr 65:16 active undef running >>> >>> After a while (>90sec) SCSI LUNs were removed from system: >>> >> [ .. ] >>> When re-enabling the path, SCSI LUNS were reassigned to system but path >>> didn't recover: >>> >> [ .. ] >> >>> >>> [root@h42lp26/ESAME:~] >>>> multipath -l >>> 36005076303ffc1040000000000001268 dm-8 , >>> size=1.0G features='1 queue_if_no_path' hwhandler='0' wp=rw >>> `-+- policy='round-robin 0' prio=-2 status=active >>> |- #:#:#:# - #:# failed undef running >>> `- 1:0:1:1080573970 sdr 65:16 active undef running >>> >>> >>> Running "multipath" command will recover the failed path but that's not >>> way it should be...can somebody help to fix this? Why is the path not >>> recovered automatically? >>> >> It should, really. >> >> The problem is that the paths have _not_ been reconnected; >> the hashes indicates that the in-kernel multipath code references >> a device for which no information is available. >> And the new device has _not_ been reconnected, as otherwise >> you'd end up with _three_ paths here. >> >> Probably missing udev integration. > > Could also be a race condition that is present in SLES10 + RHEL5 > kernels. Where the SysFS directories are created (and the udev event it > sent out), but the kernel hasn't populated the SysFS directories. So > when multipathd tries to read them it finds no pertient information and > shoves it off to the 'orphan' state. > Really? With SLES10? Have you actually observed this? We're running multipath _after_ udev has processed the event. And udev already waited for sysfs, so we should be safe there. It might be applicable to mainline multipath-tools, but the SLES10 one ... I'd be surprised. Well, reasonably surprised. multipath keeps on throwing an amazing number of issues still. Do you have more information here? Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Failed path will not be recovered when disabling/enabling remote port 2009-07-02 13:16 ` Hannes Reinecke @ 2009-07-20 16:46 ` Konrad Rzeszutek 2009-07-21 6:19 ` Hannes Reinecke 0 siblings, 1 reply; 8+ messages in thread From: Konrad Rzeszutek @ 2009-07-20 16:46 UTC (permalink / raw) To: device-mapper development > > Could also be a race condition that is present in SLES10 + RHEL5 > > kernels. Where the SysFS directories are created (and the udev event it > > sent out), but the kernel hasn't populated the SysFS directories. So > > when multipathd tries to read them it finds no pertient information and > > shoves it off to the 'orphan' state. > > > Really? With SLES10? Have you actually observed this? With SLES10 SP2 to be exact. It wasn't an issue with SLES10 since the initial patch was there. The equipment I used to test this was an AX150FC with failed batteries (so no cache writes) and with a failed controller so it would run extra slow. > We're running multipath _after_ udev has processed the event. Right, the one where the SysFS directory is created. Then multipatd reads the data. I remember posting it here and mentioning that this problem exists on SLES10SP2 and RHEL5 but not on the upstream kernels. > And udev already waited for sysfs, so we should be safe there. Not so. The udev gets the SCSI uevent creation, creates the /dev/sdX, and so. But the kernel hasn't yet fully populated the SysFS entries (so /sys/block/sdX/device/vendor does exist, but has no data in it). > > It might be applicable to mainline multipath-tools, but It really depends on how the SysFS directories are populated and how slow the SCSI target is. > the SLES10 one ... I'd be surprised. > > Well, reasonably surprised. multipath keeps on throwing > an amazing number of issues still. > > Do you have more information here? Here is the patch along with a detailed description. The "multipath-tools-add-wait" patch is a backport/write of the wait_for_file routine used in the sysfs_get_[vendor|model|rev] macros. The SLES10 SP2 back-ported a lot of the upstream features of multipath, and one of those was getting rid of this function. I haven't yet found out the reason why it was deleted - looks as if a mistake as the upstream kernel _should_ cause the same set of problems with multipath. [update: Upstream kernel has this fixed] The reason a wait is necessary is due to the way the kernel sends the event. When a SCSI device is added the SCSI subsystem pursues this path: _sysfs_add_sdev: calls device_add ... [ '/devices/platform/host16/session6/target16:0:0/16:0:0:17'] uevent bus_attach_device bus_for_each_drv driver_probe_device sd_probe ['/class/scsi_disk/16:0:0:17' ] uevent add_disk ['/block/sdai'] [ Here multipath starts its job ] calls class_device_add ... [ '/class/scsi_device/16:0:0:17' ] uevent sg_add: [ '/class/scsi_generic/sg35' ] uevent done with device_add, and now we add the attributes: --> scsi_sysfs_sdev_attrs[i].vendor, model, rev <-- THIS is the problem. [Multipathd at the 'block/sdai' event has started analyzing the data, and it reads the SysFS, but the 'vendor', 'model' have no data so multipathd discards them an orphans the devices. That data gets to be there once 'device_add' is finished.] There are four uevents sent from the kernel in the creation of a SysFS representation of the device. After the last event, the SysFs entries for vendor,model, rev are populated. Which can lead to a race condition when multipath investigates the the new block device and finds it can't read the vendor. This patch adds the wait_for_file routine which adds some wait time. diff -uNrp multipath-tools-0.4.7.orig/libmultipath/discovery.c multipath-tools-0.4.7/libmultipath/discovery.c --- multipath-tools-0.4.7.orig/libmultipath/discovery.c 2008-09-25 14:02:28.000000000 -0400 +++ multipath-tools-0.4.7/libmultipath/discovery.c 2008-09-25 19:07:50.000000000 -0400 @@ -125,11 +125,54 @@ path_discovery (vector pathvec, struct c return r; } + +/* + * the daemon can race udev upon path add, + * not multipath(8), ran by udev + */ +#if DAEMON +#define WAIT_MAX_SECONDS 5 +#define WAIT_LOOP_PER_SECOND 5 + +static int +wait_for_file (char * filename) +{ + int loop; + struct stat stats; + + loop = WAIT_MAX_SECONDS * WAIT_LOOP_PER_SECOND; + + while (--loop) { + if (stat(filename, &stats) == 0) + return 0; + + if (errno != ENOENT) + return 1; + + usleep(1000 * 1000 / WAIT_LOOP_PER_SECOND); + } + return 1; +} +#else +static int +wait_for_file (char * filename) +{ + return 0; +} +#endif + #define declare_sysfs_get_str(fname) \ extern int \ sysfs_get_##fname (struct sysfs_device * dev, char * buff, size_t len) \ { \ char *attr; \ + char attr_path[SYSFS_PATH_SIZE]; \ +\ + if (safe_sprintf(attr_path, "%s/%s/%s", sysfs_path, dev->devpath, #fname)) \ + return 1; \ +\ + if (wait_for_file(attr_path)) \ + return 1; \ \ attr = sysfs_attr_get_value(dev->devpath, #fname); \ if (!attr) \ > > Cheers, > > Hannes > -- > Dr. Hannes Reinecke zSeries & Storage > hare@suse.de +49 911 74053 688 > SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg > GF: Markus Rex, HRB 16746 (AG Nürnberg) > > -- > dm-devel mailing list > dm-devel@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Failed path will not be recovered when disabling/enabling remote port 2009-07-20 16:46 ` Konrad Rzeszutek @ 2009-07-21 6:19 ` Hannes Reinecke 2009-07-21 21:42 ` Konrad Rzeszutek 0 siblings, 1 reply; 8+ messages in thread From: Hannes Reinecke @ 2009-07-21 6:19 UTC (permalink / raw) To: device-mapper development Hi Konrad, Konrad Rzeszutek wrote: >>> Could also be a race condition that is present in SLES10 + RHEL5 >>> kernels. Where the SysFS directories are created (and the udev event it >>> sent out), but the kernel hasn't populated the SysFS directories. So >>> when multipathd tries to read them it finds no pertient information and >>> shoves it off to the 'orphan' state. >>> >> Really? With SLES10? Have you actually observed this? > > With SLES10 SP2 to be exact. It wasn't an issue with SLES10 since the > initial patch was there. The equipment I used to test this was an > AX150FC with failed batteries (so no cache writes) and with a failed > controller so it would run extra slow. > >> We're running multipath _after_ udev has processed the event. > > Right, the one where the SysFS directory is created. Then multipatd > reads the data. I remember posting it here and mentioning that this > problem exists on SLES10SP2 and RHEL5 but not on the upstream kernels. > >> And udev already waited for sysfs, so we should be safe there. > > Not so. The udev gets the SCSI uevent creation, creates the /dev/sdX, and > so. But the kernel hasn't yet fully populated the SysFS entries (so > /sys/block/sdX/device/vendor does exist, but has no data in it). >> It might be applicable to mainline multipath-tools, but > > It really depends on how the SysFS directories are populated and how > slow the SCSI target is. > >> the SLES10 one ... I'd be surprised. >> >> Well, reasonably surprised. multipath keeps on throwing >> an amazing number of issues still. >> >> Do you have more information here? > > Here is the patch along with a detailed description. > > The "multipath-tools-add-wait" patch is a backport/write of the > wait_for_file routine used in the sysfs_get_[vendor|model|rev] > macros. The SLES10 SP2 back-ported a lot of the upstream features > of multipath, and one of those was getting rid of this function. > I haven't yet found out the reason why it was deleted - looks > as if a mistake as the upstream kernel _should_ cause the same > set of problems with multipath. > [update: Upstream kernel has this fixed] > > The reason a wait is necessary is due to the way the kernel > sends the event. When a SCSI device is added the SCSI subsystem > pursues this path: > > _sysfs_add_sdev: > calls device_add ... > [ '/devices/platform/host16/session6/target16:0:0/16:0:0:17'] uevent > bus_attach_device > bus_for_each_drv > driver_probe_device > sd_probe > ['/class/scsi_disk/16:0:0:17' ] uevent > add_disk > ['/block/sdai'] [ Here multipath starts its job ] > > calls class_device_add ... > [ '/class/scsi_device/16:0:0:17' ] uevent > sg_add: > [ '/class/scsi_generic/sg35' ] uevent > > > done with device_add, and now we add the attributes: > --> scsi_sysfs_sdev_attrs[i].vendor, model, rev <-- THIS is the > problem. > > [Multipathd at the 'block/sdai' event has started analyzing the data, and > it reads the SysFS, but the 'vendor', 'model' have no data so multipathd > discards them an orphans the devices. That data gets to be there once > 'device_add' is finished.] > Ah. Hmm. Seems you are correct. I'll have to apply the patch, then. Fancy opening a bugzilla for it? Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Failed path will not be recovered when disabling/enabling remote port 2009-07-21 6:19 ` Hannes Reinecke @ 2009-07-21 21:42 ` Konrad Rzeszutek 0 siblings, 0 replies; 8+ messages in thread From: Konrad Rzeszutek @ 2009-07-21 21:42 UTC (permalink / raw) To: device-mapper development > Ah. Hmm. Seems you are correct. > > I'll have to apply the patch, then. > > Fancy opening a bugzilla for it? Done. BZ #524018. > > Cheers, > > Hannes > -- > Dr. Hannes Reinecke zSeries & Storage > hare@suse.de +49 911 74053 688 > SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg > GF: Markus Rex, HRB 16746 (AG Nürnberg) > > -- > dm-devel mailing list > dm-devel@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Failed path will not be recovered when disabling/enabling remote port 2009-07-02 11:27 Failed path will not be recovered when disabling/enabling remote port Christian May 2009-07-02 11:44 ` Hannes Reinecke @ 2009-07-02 17:51 ` Chandra Seetharaman 1 sibling, 0 replies; 8+ messages in thread From: Chandra Seetharaman @ 2009-07-02 17:51 UTC (permalink / raw) To: device-mapper development One simple question. Did you observe if multipathd is (still) running ? (when the port was enabled) On Thu, 2009-07-02 at 13:27 +0200, Christian May wrote: > Hi, > > I've setup an IBM z10 LPAR (mainframe server) with 2.6.30-kernel. > Attached to the System z10 was an IBM DS8000 storage server. 10x SCSI > LUNs were assigned to LPAR via two pathes: > > Example: > 36005076303ffc1040000000000001269 dm-9 IBM,2107900 > size=1.0G features='1 queue_if_no_path' hwhandler='0' wp=rw > `-+- policy='round-robin 0' prio=-2 status=active > |- 0:0:0:1080639506 sdw 65:96 active undef running > `- 1:0:1:1080639506 sdt 65:48 active undef running > > Special parameter setting: dev_loss_tmo=90sec; fast_io_fail_tmo=5sec > > multipath tools: multipath-tools v0.4.9 (04/04, 2009) > device-mapper: device-mapper-1.02.27-7.fc10.s390x, > device-mapper-libs-1.02.27-7.fc10.s390x > > When removing a remote port (disabling a port on the BROCADE FC switch) > one path failed. > > root@h42lp26/ESAME:~] > > multipath -l > 36005076303ffc1040000000000001268 dm-8 , > size=1.0G features='1 queue_if_no_path' hwhandler='0' wp=rw > `-+- policy='round-robin 0' prio=-2 status=active > |- #:#:#:# - #:# failed undef running > `- 1:0:1:1080573970 sdr 65:16 active undef running > > After a while (>90sec) SCSI LUNs were removed from system: > > > UEVENT[1246531815.619428] add /kernel/uids/74 (uids) > UDEV [1246531815.621708] add /kernel/uids/74 (uids) > UEVENT[1246531816.725299] remove /kernel/uids/74 (uids) > UDEV [1246531816.726151] remove /kernel/uids/74 (uids) > UEVENT[1246531929.959709] change /devices/virtual/block/dm-0 (block) > UEVENT[1246531929.959749] change /devices/virtual/block/dm-3 (block) > UEVENT[1246531929.959759] change /devices/virtual/block/dm-4 (block) > UEVENT[1246531929.959769] change /devices/virtual/block/dm-5 (block) > UEVENT[1246531929.966647] change /devices/virtual/block/dm-7 (block) > UDEV [1246531930.045444] change /devices/virtual/block/dm-4 (block) > UDEV [1246531930.048923] change /devices/virtual/block/dm-7 (block) > UDEV [1246531930.054614] change /devices/virtual/block/dm-0 (block) > UDEV [1246531930.060091] change /devices/virtual/block/dm-3 (block) > UDEV [1246531930.071744] change /devices/virtual/block/dm-5 (block) > UEVENT[1246531949.278541] change /devices/virtual/block/dm-9 (block) > UDEV [1246531949.369690] change /devices/virtual/block/dm-9 (block) > UEVENT[1246531950.295756] change /devices/virtual/block/dm-8 (block) > UEVENT[1246531950.297597] change /devices/virtual/block/dm-6 (block) > UEVENT[1246531950.297610] change /devices/virtual/block/dm-2 (block) > UEVENT[1246531950.297620] change /devices/virtual/block/dm-1 (block) > UDEV [1246531950.430097] change /devices/virtual/block/dm-8 (block) > UDEV [1246531950.588626] change /devices/virtual/block/dm-2 (block) > UDEV [1246531950.632482] change /devices/virtual/block/dm-1 (block) > UDEV [1246531950.634515] change /devices/virtual/block/dm-6 (block) > UEVENT[1246532034.277177] remove > /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080377362/scsi_generic/sg0 > (scsi_generic) > UEVENT[1246532034.277214] remove > /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080377362/scsi_device/0:0:0:1080377362 > (scsi_device) > UEVENT[1246532034.277226] remove > /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080377362/scsi_disk/0:0:0:1080377362 > (scsi_disk) > UEVENT[1246532034.277236] remove /devices/virtual/bdi/8:0 (bdi) > UEVENT[1246532034.277247] remove > /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080377362/block/sda > (block) > UEVENT[1246532034.277258] remove > /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080377362 > (scsi) > UEVENT[1246532034.277384] remove > /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080836114/scsi_generic/sg2 > (scsi_generic) > UEVENT[1246532034.277594] remove > /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080836114/scsi_device/0:0:0:1080836114 > (scsi_device) > UEVENT[1246532034.277864] remove > /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080836114/scsi_disk/0:0:0:1080836114 > (scsi_disk) > UEVENT[1246532034.278035] remove /devices/virtual/bdi/8:32 (bdi)... > > .... > > When re-enabling the path, SCSI LUNS were reassigned to system but path > didn't recover: > > UEVENT[1246532107.387169] add > /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080836114 > (scsi) > UEVENT[1246532107.387209] add > /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080836114/scsi_device/0:0:0:1080836114 > (scsi_device) > UEVENT[1246532107.387220] add > /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080836114/scsi_generic/sg0 > (scsi_generic) > UEVENT[1246532107.387230] add > /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080836114/scsi_disk/0:0:0:1080836114 > (scsi_disk) > UEVENT[1246532107.388941] add > /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080377362 > (scsi) > UEVENT[1246532107.388952] add > /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080377362/scsi_device/0:0:0:1080377362 > (scsi_device) > UEVENT[1246532107.388963] add > /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080377362/scsi_generic/sg2 > (scsi_generic) > UEVENT[1246532107.397111] add > /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080836114/block/sdu > (block) > UEVENT[1246532107.399249] add > /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080639506 > (scsi) > UEVENT[1246532107.399261] add > /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080639506/scsi_device/0:0:0:1080639506 > (scsi_device) > UEVENT[1246532107.399272] add > /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080639506/scsi_generic/sg4 > (scsi_generic) > UEVENT[1246532107.399711] add /devices/virtual/bdi/65:64 (bdi) > UEVENT[1246532107.399722] add > /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080377362/scsi_disk/0:0:0:1080377362 > (scsi_disk) > UEVENT[1246532107.401605] add > /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080573970 > (scsi) > UEVENT[1246532107.401617] add > /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080573970/scsi_device/0:0:0:1080573970 > (scsi_device) > UEVENT[1246532107.401628] add > /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080573970/scsi_generic/sg6 > (scsi_generic) > UEVENT[1246532107.403731] add > /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080967186 > (scsi) > UEVENT[1246532107.403742] add > /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080967186/scsi_device/0:0:0:1080967186 > (scsi_device) > UEVENT[1246532107.403753] add > /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080967186/scsi_generic/sg8 > (scsi_generic) > UEVENT[1246532107.405963] add > /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080377362/block/sdv > (block) > UEVENT[1246532107.406168] add > /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080901650 > (scsi) > UEVENT[1246532107.407608] add > /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080901650/scsi_device/0:0:0:1080901650 > (scsi_device) > UEVENT[1246532107.407624] add > /devices/css0/0.0.0330/0.0.1780/host0/rport-0:0-0/target0:0:0/0:0:0:1080901650/scsi_generic/sg10 > (scsi_generic) > UEVENT[1246532107.407880] add /devices/virtual/bdi/65:80 (bdi) > > > [root@h42lp26/ESAME:~] > > multipath -l > 36005076303ffc1040000000000001268 dm-8 , > size=1.0G features='1 queue_if_no_path' hwhandler='0' wp=rw > `-+- policy='round-robin 0' prio=-2 status=active > |- #:#:#:# - #:# failed undef running > `- 1:0:1:1080573970 sdr 65:16 active undef running > > > Running "multipath" command will recover the failed path but that's not > way it should be...can somebody help to fix this? Why is the path not > recovered automatically? > > > Regards, > > > Christian May > > > > > > > > > > > > > -- > dm-devel mailing list > dm-devel@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2009-07-21 21:42 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-07-02 11:27 Failed path will not be recovered when disabling/enabling remote port Christian May 2009-07-02 11:44 ` Hannes Reinecke 2009-07-02 13:06 ` Konrad Rzeszutek 2009-07-02 13:16 ` Hannes Reinecke 2009-07-20 16:46 ` Konrad Rzeszutek 2009-07-21 6:19 ` Hannes Reinecke 2009-07-21 21:42 ` Konrad Rzeszutek 2009-07-02 17:51 ` Chandra Seetharaman
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.