The mpath_prio_emc with group_by_prio did the
trick. Thanks!
But I am still loosing the paths to the failed
devices. I Increased dev_loss_tmo, but the
maximum seems to be about 600 - thus, after 10 Minutes, the paths
fail:
SANfile_m linux # multipath -l
hcfshare
(360060160c820080063502869e459dc11) dm-0 ,
[size=3.4T][features=1
queue_if_no_path][hwhandler=1 emc]
\_ round-robin 0
[prio=0][enabled]
\_ #:#:#:# - #:#
[failed][undef]
\_ #:#:#:# - #:#
[failed][undef]
\_ round-robin 0 [prio=0][active]
\_ 2:0:0:0 sdd
8:48 [active][undef]
\_ 1:0:0:0 sdb 8:16
[active][undef]
If I put them online again, I run
into the -EEXIST prob. Async SCSI scanning *is* off in my kernel,
so the only thing I could do from here is to try the patch, is
it?
Oct 17 17:26:34 SANfile_m kernel: scsi 1:0:1:0:
rejecting I/O to dead device
Oct 17 17:26:34 SANfile_m multipathd: sdc:
emc_clariion_checker: query command indicates error
Oct 17 17:26:35 SANfile_m
kernel: scsi 2:0:1:0: rejecting I/O to dead device
Oct 17 17:26:35 SANfile_m
multipathd: sde: emc_clariion_checker: query command indicates error
Oct 17
17:26:36 SANfile_m kernel: scsi 1:0:1:0: Direct-Access
DGC RAID
5 0219 PQ: 0 ANSI:
4
Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Very big device. Trying
to use READ CAPACITY(16).
Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf]
7263453184 512-byte hardware sectors (3718888 MB)
Oct 17 17:26:36 SANfile_m
kernel: sd 1:0:1:0: [sdf] Test WP failed, assume Write Enabled
Oct 17
17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Asking for cache data failed
Oct
17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Assuming drive cache: write
through
Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Very big device.
Trying to use READ CAPACITY(16).
Oct 17 17:26:36 SANfile_m kernel: sd
1:0:1:0: [sdf] 7263453184 512-byte hardware sectors (3718888 MB)
Oct 17
17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Test WP failed, assume Write
Enabled
Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Asking for cache
data failed
Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Assuming
drive cache: write through
Oct 17 17:26:36 SANfile_m kernel:
sdf:<6>sd 1:0:1:0: [sdf] Device not ready: <6>: Sense Key : 0x2
[current]
Oct 17 17:26:36 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 17
17:26:36 SANfile_m kernel: end_request: I/O error, dev sdf, sector 0
Oct 17
17:26:36 SANfile_m kernel: printk: 40 messages suppressed.
Oct 17 17:26:36
SANfile_m kernel: Buffer I/O error on device sdf, logical block 0
Oct 17
17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Device not ready: <6>: Sense
Key : 0x2 [current]
Oct 17 17:26:36 SANfile_m kernel: : ASC=0x4
ASCQ=0x3
Oct 17 17:26:36 SANfile_m kernel: end_request: I/O error, dev sdf,
sector 0
Oct 17 17:26:36 SANfile_m kernel: Buffer I/O error on device sdf,
logical block 0
Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Device
not ready: <6>: Sense Key : 0x2 [current]
Oct 17 17:26:36 SANfile_m
kernel: : ASC=0x4 ASCQ=0x3
Oct 17 17:26:36 SANfile_m kernel: end_request: I/O
error, dev sdf, sector 0
Oct 17 17:26:36 SANfile_m kernel: Buffer I/O error
on device sdf, logical block 0
Oct 17 17:26:36 SANfile_m kernel:
ldm_validate_partition_table(): Disk read failed.
Oct 17 17:26:36 SANfile_m
kernel: sd 1:0:1:0: [sdf] Device not ready: <6>: Sense Key : 0x2
[current]
Oct 17 17:26:36 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 17
17:26:36 SANfile_m kernel: end_request: I/O error, dev sdf, sector 0
Oct 17
17:26:36 SANfile_m kernel: Buffer I/O error on device sdf, logical block
0
Oct 17 17:26:36 SANfile_m kernel: unable to read partition
table
Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Attached SCSI
disk
Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: Attached scsi generic sg2
type 0
Oct 17 17:26:36 SANfile_m kernel: scsi 1:0:1:0:
Direct-Access DGC RAID
5 0219 PQ: 0 ANSI:
4
Oct 17 17:26:36 SANfile_m kernel: kobject_add failed for 1:0:1:0 with
-EEXIST, don't try to register things with the same name in the same
directory.
Oct 17 17:26:36 SANfile_m kernel: [number+85/816]
kobject_shadow_add+0x115/0x1b0
Oct 17 17:26:36 SANfile_m kernel:
[<c02f95f5>] kobject_shadow_add+0x115/0x1b0
Oct 17 17:26:36 SANfile_m
kernel: [lo_ioctl+1125/2528] device_add+0xc5/0x570
Oct 17 17:26:36
SANfile_m kernel: [<c03aefd5>] device_add+0xc5/0x570
Oct 17
17:26:36 SANfile_m kernel: [fc_remote_port_rolechg+127/320]
scsi_adjust_queue_depth+0x9f/0xf0
Oct 17 17:26:36 SANfile_m kernel:
[<c03f9d7f>] scsi_adjust_queue_depth+0x9f/0xf0
Oct 17 17:26:36
SANfile_m kernel: [blk_register_region+18/64]
__blk_queue_init_tags+0x32/0x70
Oct 17 17:26:36 SANfile_m kernel:
[<c02eeb72>] __blk_queue_init_tags+0x32/0x70
Oct 17 17:26:36 SANfile_m
kernel: [sr_get_mcn+2/240] scsi_sysfs_add_sdev+0x32/0x230
Oct 17
17:26:36 SANfile_m kernel: [<c0402882>]
scsi_sysfs_add_sdev+0x32/0x230
Oct 17 17:26:36 SANfile_m kernel:
[<f99445b7>] qla2xxx_slave_configure+0x77/0x110 [qla2xxx]
Oct 17
17:26:36 SANfile_m kernel: [sd_init_command+313/1088]
scsi_probe_and_add_lun+0x8c9/0x940
Oct 17 17:26:36 SANfile_m kernel:
[<c0400859>] scsi_probe_and_add_lun+0x8c9/0x940
Oct 17 17:26:36
SANfile_m kernel: [sr_probe+72/1472] __scsi_scan_target+0x518/0x5c0
Oct
17 17:26:36 SANfile_m kernel: [<c04012c8>]
__scsi_scan_target+0x518/0x5c0
Oct 17 17:26:36 SANfile_m kernel:
[kallsyms_addresses+36259/130252] schedule+0x2df/0x940
Oct 17 17:26:36
SANfile_m kernel: [<c053695f>] schedule+0x2df/0x940
Oct 17
17:26:36 SANfile_m kernel: [sr_init_command+54/944]
scsi_scan_target+0xb6/0xe0
Oct 17 17:26:36 SANfile_m kernel:
[<c04019f6>] scsi_scan_target+0xb6/0xe0
Oct 17 17:26:36 SANfile_m
kernel: [SendIocInit+224/784] fc_scsi_scan_rport+0x0/0x90
Oct 17
17:26:36 SANfile_m kernel: [<c04084b0>]
fc_scsi_scan_rport+0x0/0x90
Oct 17 17:26:36 SANfile_m kernel:
[SendIocInit+344/784] fc_scsi_scan_rport+0x78/0x90
Oct 17 17:26:36 SANfile_m
kernel: [<c0408528>] fc_scsi_scan_rport+0x78/0x90
Oct 17 17:26:36
SANfile_m kernel: [run_workqueue+131/256] run_workqueue+0x73/0x100
Oct
17 17:26:36 SANfile_m kernel: [<c0131dc3>]
run_workqueue+0x73/0x100
Oct 17 17:26:36 SANfile_m kernel:
[autoremove_wake_function+16/80] autoremove_wake_function+0x0/0x50
Oct 17
17:26:36 SANfile_m kernel: [<c01354e0>]
autoremove_wake_function+0x0/0x50
Oct 17 17:26:36 SANfile_m kernel:
[worker_thread+172/256] worker_thread+0x9c/0x100
Oct 17 17:26:36 SANfile_m
kernel: [<c01326dc>] worker_thread+0x9c/0x100
Oct 17 17:26:36
SANfile_m kernel: [autoremove_wake_function+16/80]
autoremove_wake_function+0x0/0x50
Oct 17 17:26:36 SANfile_m kernel:
[<c01354e0>] autoremove_wake_function+0x0/0x50
Oct 17 17:26:36
SANfile_m kernel: [worker_thread+16/256] worker_thread+0x0/0x100
Oct 17
17:26:36 SANfile_m kernel: [<c0132640>]
worker_thread+0x0/0x100
Oct 17 17:26:36 SANfile_m kernel:
[kthread+82/112] kthread+0x42/0x70
Oct 17 17:26:36 SANfile_m kernel:
[<c0135212>] kthread+0x42/0x70
Oct 17 17:26:36 SANfile_m kernel:
[kthread+16/112] kthread+0x0/0x70
Oct 17 17:26:36 SANfile_m kernel:
[<c01351d0>] kthread+0x0/0x70
Oct 17 17:26:36 SANfile_m kernel:
[print_trace_stack+3/16] kernel_thread_helper+0x7/0x14
Oct 17 17:26:36
SANfile_m kernel: [<c0104763>] kernel_thread_helper+0x7/0x14
Oct
17 17:26:36 SANfile_m kernel: =======================
Oct 17 17:26:36
SANfile_m kernel: error 1
Oct 17 17:26:36 SANfile_m kernel: scsi 1:0:1:0:
Unexpected response from lun 0 while scanning, scan aborted
Oct 17 17:26:36
SANfile_m kernel: sd 1:0:1:0: [sdf] Device not ready: <6>: Sense Key : 0x2
[current]
Oct 17 17:26:36 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 17
17:26:36 SANfile_m kernel: end_request: I/O error, dev sdf, sector
7263453056
Oct 17 17:26:36 SANfile_m kernel: Buffer I/O error on device sdf,
logical block 907931632
Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf]
Device not ready: <6>: Sense Key : 0x2 [current]
Oct 17 17:26:36
SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 17 17:26:36 SANfile_m kernel:
end_request: I/O error, dev sdf, sector 7263453056
Oct 17 17:26:36 SANfile_m
kernel: Buffer I/O error on device sdf, logical block 907931632
Oct 17
17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Device not ready: <6>: Sense
Key : 0x2 [current]
Oct 17 17:26:36 SANfile_m kernel: : ASC=0x4
ASCQ=0x3
Is that what you refer as
----- Original Message -----
Sent: Wednesday, October 17, 2007 2:32
PM
Subject: Re: [dm-devel] multibus /
failover and EMC CX600
* Hannes Reinecke
> That's the dev_loss_tmo setting.
Just increase it to something to
> your liking.
Oh, sweet.
This knob won't affect how long the layer will hold I/O
before failing it
(like lpfc_nodev_tmo), I assume? (I'm worried about
it taking longer
for dm-multipath to detect failed paths).
I wish it could've been set
to unlimited, though. Seems like there's
always some kind of trouble
with re-adding the devices, either I run
into that -EEXIST bug, or udev
doesn't do it's job properly and the
revived device isn't added back into
the dm-multipath map. In addition
it somtimes breaks queue_if_no_path
with earlier multipath-tools that
doesn't use no_flush on suspend.
Those versions are of course included
in most server distributions...
Sigh.
Regards
--
Tore Anderson
--
dm-devel mailing
list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel