The mpath_prio_emc with group_by_prio did the trick. Thanks! But I am still loosing the paths to the failed devices. I Increased dev_loss_tmo, but the maximum seems to be about 600 - thus, after 10 Minutes, the paths fail: SANfile_m linux # multipath -l hcfshare (360060160c820080063502869e459dc11) dm-0 , [size=3.4T][features=1 queue_if_no_path][hwhandler=1 emc] \_ round-robin 0 [prio=0][enabled] \_ #:#:#:# - #:# [failed][undef] \_ #:#:#:# - #:# [failed][undef] \_ round-robin 0 [prio=0][active] \_ 2:0:0:0 sdd 8:48 [active][undef] \_ 1:0:0:0 sdb 8:16 [active][undef] If I put them online again, I run into the -EEXIST prob. Async SCSI scanning *is* off in my kernel, so the only thing I could do from here is to try the patch, is it? Oct 17 17:26:34 SANfile_m kernel: scsi 1:0:1:0: rejecting I/O to dead device Oct 17 17:26:34 SANfile_m multipathd: sdc: emc_clariion_checker: query command indicates error Oct 17 17:26:35 SANfile_m kernel: scsi 2:0:1:0: rejecting I/O to dead device Oct 17 17:26:35 SANfile_m multipathd: sde: emc_clariion_checker: query command indicates error Oct 17 17:26:36 SANfile_m kernel: scsi 1:0:1:0: Direct-Access DGC RAID 5 0219 PQ: 0 ANSI: 4 Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Very big device. Trying to use READ CAPACITY(16). Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] 7263453184 512-byte hardware sectors (3718888 MB) Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Test WP failed, assume Write Enabled Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Asking for cache data failed Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Assuming drive cache: write through Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Very big device. Trying to use READ CAPACITY(16). Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] 7263453184 512-byte hardware sectors (3718888 MB) Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Test WP failed, assume Write Enabled Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Asking for cache data failed Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Assuming drive cache: write through Oct 17 17:26:36 SANfile_m kernel: sdf:<6>sd 1:0:1:0: [sdf] Device not ready: <6>: Sense Key : 0x2 [current] Oct 17 17:26:36 SANfile_m kernel: : ASC=0x4 ASCQ=0x3 Oct 17 17:26:36 SANfile_m kernel: end_request: I/O error, dev sdf, sector 0 Oct 17 17:26:36 SANfile_m kernel: printk: 40 messages suppressed. Oct 17 17:26:36 SANfile_m kernel: Buffer I/O error on device sdf, logical block 0 Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Device not ready: <6>: Sense Key : 0x2 [current] Oct 17 17:26:36 SANfile_m kernel: : ASC=0x4 ASCQ=0x3 Oct 17 17:26:36 SANfile_m kernel: end_request: I/O error, dev sdf, sector 0 Oct 17 17:26:36 SANfile_m kernel: Buffer I/O error on device sdf, logical block 0 Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Device not ready: <6>: Sense Key : 0x2 [current] Oct 17 17:26:36 SANfile_m kernel: : ASC=0x4 ASCQ=0x3 Oct 17 17:26:36 SANfile_m kernel: end_request: I/O error, dev sdf, sector 0 Oct 17 17:26:36 SANfile_m kernel: Buffer I/O error on device sdf, logical block 0 Oct 17 17:26:36 SANfile_m kernel: ldm_validate_partition_table(): Disk read failed. Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Device not ready: <6>: Sense Key : 0x2 [current] Oct 17 17:26:36 SANfile_m kernel: : ASC=0x4 ASCQ=0x3 Oct 17 17:26:36 SANfile_m kernel: end_request: I/O error, dev sdf, sector 0 Oct 17 17:26:36 SANfile_m kernel: Buffer I/O error on device sdf, logical block 0 Oct 17 17:26:36 SANfile_m kernel: unable to read partition table Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Attached SCSI disk Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: Attached scsi generic sg2 type 0 Oct 17 17:26:36 SANfile_m kernel: scsi 1:0:1:0: Direct-Access DGC RAID 5 0219 PQ: 0 ANSI: 4 Oct 17 17:26:36 SANfile_m kernel: kobject_add failed for 1:0:1:0 with -EEXIST, don't try to register things with the same name in the same directory. Oct 17 17:26:36 SANfile_m kernel: [number+85/816] kobject_shadow_add+0x115/0x1b0 Oct 17 17:26:36 SANfile_m kernel: [] kobject_shadow_add+0x115/0x1b0 Oct 17 17:26:36 SANfile_m kernel: [lo_ioctl+1125/2528] device_add+0xc5/0x570 Oct 17 17:26:36 SANfile_m kernel: [] device_add+0xc5/0x570 Oct 17 17:26:36 SANfile_m kernel: [fc_remote_port_rolechg+127/320] scsi_adjust_queue_depth+0x9f/0xf0 Oct 17 17:26:36 SANfile_m kernel: [] scsi_adjust_queue_depth+0x9f/0xf0 Oct 17 17:26:36 SANfile_m kernel: [blk_register_region+18/64] __blk_queue_init_tags+0x32/0x70 Oct 17 17:26:36 SANfile_m kernel: [] __blk_queue_init_tags+0x32/0x70 Oct 17 17:26:36 SANfile_m kernel: [sr_get_mcn+2/240] scsi_sysfs_add_sdev+0x32/0x230 Oct 17 17:26:36 SANfile_m kernel: [] scsi_sysfs_add_sdev+0x32/0x230 Oct 17 17:26:36 SANfile_m kernel: [] qla2xxx_slave_configure+0x77/0x110 [qla2xxx] Oct 17 17:26:36 SANfile_m kernel: [sd_init_command+313/1088] scsi_probe_and_add_lun+0x8c9/0x940 Oct 17 17:26:36 SANfile_m kernel: [] scsi_probe_and_add_lun+0x8c9/0x940 Oct 17 17:26:36 SANfile_m kernel: [sr_probe+72/1472] __scsi_scan_target+0x518/0x5c0 Oct 17 17:26:36 SANfile_m kernel: [] __scsi_scan_target+0x518/0x5c0 Oct 17 17:26:36 SANfile_m kernel: [kallsyms_addresses+36259/130252] schedule+0x2df/0x940 Oct 17 17:26:36 SANfile_m kernel: [] schedule+0x2df/0x940 Oct 17 17:26:36 SANfile_m kernel: [sr_init_command+54/944] scsi_scan_target+0xb6/0xe0 Oct 17 17:26:36 SANfile_m kernel: [] scsi_scan_target+0xb6/0xe0 Oct 17 17:26:36 SANfile_m kernel: [SendIocInit+224/784] fc_scsi_scan_rport+0x0/0x90 Oct 17 17:26:36 SANfile_m kernel: [] fc_scsi_scan_rport+0x0/0x90 Oct 17 17:26:36 SANfile_m kernel: [SendIocInit+344/784] fc_scsi_scan_rport+0x78/0x90 Oct 17 17:26:36 SANfile_m kernel: [] fc_scsi_scan_rport+0x78/0x90 Oct 17 17:26:36 SANfile_m kernel: [run_workqueue+131/256] run_workqueue+0x73/0x100 Oct 17 17:26:36 SANfile_m kernel: [] run_workqueue+0x73/0x100 Oct 17 17:26:36 SANfile_m kernel: [autoremove_wake_function+16/80] autoremove_wake_function+0x0/0x50 Oct 17 17:26:36 SANfile_m kernel: [] autoremove_wake_function+0x0/0x50 Oct 17 17:26:36 SANfile_m kernel: [worker_thread+172/256] worker_thread+0x9c/0x100 Oct 17 17:26:36 SANfile_m kernel: [] worker_thread+0x9c/0x100 Oct 17 17:26:36 SANfile_m kernel: [autoremove_wake_function+16/80] autoremove_wake_function+0x0/0x50 Oct 17 17:26:36 SANfile_m kernel: [] autoremove_wake_function+0x0/0x50 Oct 17 17:26:36 SANfile_m kernel: [worker_thread+16/256] worker_thread+0x0/0x100 Oct 17 17:26:36 SANfile_m kernel: [] worker_thread+0x0/0x100 Oct 17 17:26:36 SANfile_m kernel: [kthread+82/112] kthread+0x42/0x70 Oct 17 17:26:36 SANfile_m kernel: [] kthread+0x42/0x70 Oct 17 17:26:36 SANfile_m kernel: [kthread+16/112] kthread+0x0/0x70 Oct 17 17:26:36 SANfile_m kernel: [] kthread+0x0/0x70 Oct 17 17:26:36 SANfile_m kernel: [print_trace_stack+3/16] kernel_thread_helper+0x7/0x14 Oct 17 17:26:36 SANfile_m kernel: [] kernel_thread_helper+0x7/0x14 Oct 17 17:26:36 SANfile_m kernel: ======================= Oct 17 17:26:36 SANfile_m kernel: error 1 Oct 17 17:26:36 SANfile_m kernel: scsi 1:0:1:0: Unexpected response from lun 0 while scanning, scan aborted Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Device not ready: <6>: Sense Key : 0x2 [current] Oct 17 17:26:36 SANfile_m kernel: : ASC=0x4 ASCQ=0x3 Oct 17 17:26:36 SANfile_m kernel: end_request: I/O error, dev sdf, sector 7263453056 Oct 17 17:26:36 SANfile_m kernel: Buffer I/O error on device sdf, logical block 907931632 Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Device not ready: <6>: Sense Key : 0x2 [current] Oct 17 17:26:36 SANfile_m kernel: : ASC=0x4 ASCQ=0x3 Oct 17 17:26:36 SANfile_m kernel: end_request: I/O error, dev sdf, sector 7263453056 Oct 17 17:26:36 SANfile_m kernel: Buffer I/O error on device sdf, logical block 907931632 Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Device not ready: <6>: Sense Key : 0x2 [current] Oct 17 17:26:36 SANfile_m kernel: : ASC=0x4 ASCQ=0x3 Is that what you refer as ----- Original Message ----- From: Tore Anderson To: device-mapper development Sent: Wednesday, October 17, 2007 2:32 PM Subject: Re: [dm-devel] multibus / failover and EMC CX600 * Hannes Reinecke > That's the dev_loss_tmo setting. Just increase it to something to > your liking. Oh, sweet. This knob won't affect how long the layer will hold I/O before failing it (like lpfc_nodev_tmo), I assume? (I'm worried about it taking longer for dm-multipath to detect failed paths). I wish it could've been set to unlimited, though. Seems like there's always some kind of trouble with re-adding the devices, either I run into that -EEXIST bug, or udev doesn't do it's job properly and the revived device isn't added back into the dm-multipath map. In addition it somtimes breaks queue_if_no_path with earlier multipath-tools that doesn't use no_flush on suspend. Those versions are of course included in most server distributions... Sigh. Regards -- Tore Anderson -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel