All of lore.kernel.org
 help / color / mirror / Atom feed
* multibus / failover and EMC CX600
@ 2007-10-17 10:23 Gerald Nowitzky
  2007-10-17 10:40 ` Tore Anderson
  0 siblings, 1 reply; 16+ messages in thread
From: Gerald Nowitzky @ 2007-10-17 10:23 UTC (permalink / raw)
  To: dm-devel

Hello!

I am a little stuck with my multipath. kpartx is doing well now, my failover 
works, but failback doesn't. Some strange things in the syslog - but 
one-by-one:

- I have a host with two HBAs (HBA-A and B)
- these are connected to two Switches (HBA-A to SW-A and HBA-B to SW-B)
- each of Switches is connected to both Service Processors (SP-A and SP-B) 
of my EMC CX600
- The CX600 is not multihomed, thus either SP-A or SP-B is servicing my LUN.

What I'd like to have is multibus via HBA-A -> SW-A -> SP-A  and HBA-B -> 
SW-B -> SP-A to the active SP and, in case both paths to the active SP fail, 
a trespas of my LUN to SP-B, multibus to the other SP-B and vice versa.

I thought "group_by_serial" should do for that, but it doesn't

I get messages about failing and recovering paths in the syslog, but the 
failover von SP-B to SP-A works, but then I get strange things in the log 
and failing back doesn't work:

-> All paths ok, SP-A is holding the LUN:

SANfile_m ~ # multipath -l
hcfshare (360060160c820080063502869e459dc11) dm-0 DGC     ,RAID 5
[size=3.4T][features=1 queue_if_no_path][hwhandler=1 emc]
\_ round-robin 0 [prio=0][active]
 \_ 2:0:1:0 sde 8:64  [active][undef]
 \_ 2:0:0:0 sdd 8:48  [active][undef]
 \_ 1:0:1:0 sdc 8:32  [active][undef]
 \_ 1:0:0:0 sdb 8:16  [active][undef]

SANfile_m ~ # dmsetup table
hcfshare1: 0 7263453117 linear 253:0 34
hcfshare: 0 7263453184 multipath 1 queue_if_no_path 1 emc 1 1 round-robin 0 
4 1 8:64 1000 8:48 1000 8:32 1000 8:16 1000

syslog:
Oct 16 21:29:50 SANfile_m multipathd: sdd: emc_clariion_checker: Passive 
path is healthy.
Oct 16 21:29:50 SANfile_m multipathd: 8:48: reinstated
Oct 16 21:29:50 SANfile_m multipathd: hcfshare: remaining active paths: 3
Oct 16 21:29:50 SANfile_m kernel: sd 2:0:0:0: [sdd] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:29:50 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:29:50 SANfile_m kernel: end_request: I/O error, dev sdd, sector 
6609990458
Oct 16 21:29:50 SANfile_m kernel: device-mapper: multipath: Failing path 
8:48.
Oct 16 21:29:50 SANfile_m multipathd: sdb: emc_clariion_checker: Passive 
path is healthy.
Oct 16 21:29:50 SANfile_m multipathd: 8:16: reinstated
Oct 16 21:29:50 SANfile_m multipathd: hcfshare: remaining active paths: 4
Oct 16 21:29:50 SANfile_m kernel: sd 1:0:0:0: [sdb] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:29:50 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:29:50 SANfile_m kernel: end_request: I/O error, dev sdb, sector 
6609991482
Oct 16 21:29:50 SANfile_m kernel: device-mapper: multipath: Failing path 
8:16.
Oct 16 21:29:50 SANfile_m multipathd: 8:48: mark as failed
Oct 16 21:29:50 SANfile_m multipathd: hcfshare: remaining active paths: 3
Oct 16 21:29:50 SANfile_m multipathd: 8:16: mark as failed
Oct 16 21:29:50 SANfile_m multipathd: hcfshare: remaining active paths: 2
Oct 16 21:29:55 SANfile_m multipathd: sdd: emc_clariion_checker: Passive 
path is healthy.
Oct 16 21:29:55 SANfile_m multipathd: 8:48: reinstated
Oct 16 21:29:55 SANfile_m multipathd: hcfshare: remaining active paths: 3
Oct 16 21:29:55 SANfile_m kernel: sd 2:0:0:0: [sdd] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:29:55 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:29:55 SANfile_m kernel: end_request: I/O error, dev sdd, sector 
2072001426
Oct 16 21:29:55 SANfile_m kernel: device-mapper: multipath: Failing path 
8:48.
Oct 16 21:29:55 SANfile_m multipathd: sdb: emc_clariion_checker: Passive 
path is healthy.
Oct 16 21:29:55 SANfile_m multipathd: 8:16: reinstated
Oct 16 21:29:55 SANfile_m multipathd: hcfshare: remaining active paths: 4
Oct 16 21:29:55 SANfile_m kernel: sd 1:0:0:0: [sdb] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:29:55 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:29:55 SANfile_m kernel: end_request: I/O error, dev sdb, sector 
2072001938
Oct 16 21:29:55 SANfile_m kernel: device-mapper: multipath: Failing path 
8:16.
Oct 16 21:29:55 SANfile_m multipathd: 8:48: mark as failed
Oct 16 21:29:55 SANfile_m multipathd: hcfshare: remaining active paths: 3
Oct 16 21:29:55 SANfile_m multipathd: 8:16: mark as failed
Oct 16 21:29:55 SANfile_m multipathd: hcfshare: remaining active paths: 2
Oct 16 21:30:00 SANfile_m multipathd: sdd: emc_clariion_checker: Passive 
path is healthy.
Oct 16 21:30:00 SANfile_m multipathd: 8:48: reinstated
Oct 16 21:30:00 SANfile_m multipathd: hcfshare: remaining active paths: 3
Oct 16 21:30:00 SANfile_m kernel: sd 2:0:0:0: [sdd] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:30:00 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:30:00 SANfile_m kernel: end_request: I/O error, dev sdd, sector 
3208345898
Oct 16 21:30:00 SANfile_m kernel: device-mapper: multipath: Failing path 
8:48.
Oct 16 21:30:00 SANfile_m multipathd: sdb: emc_clariion_checker: Passive 
path is healthy.
Oct 16 21:30:00 SANfile_m multipathd: 8:16: reinstated
Oct 16 21:30:00 SANfile_m multipathd: hcfshare: remaining active paths: 4
Oct 16 21:30:00 SANfile_m multipathd: 8:48: mark as failed
Oct 16 21:30:00 SANfile_m multipathd: hcfshare: remaining active paths: 3
Oct 16 21:30:00 SANfile_m kernel: sd 1:0:0:0: [sdb] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:30:00 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:30:00 SANfile_m kernel: end_request: I/O error, dev sdb, sector 
3208346410
Oct 16 21:30:00 SANfile_m kernel: device-mapper: multipath: Failing path 
8:16.

Now both paths to SP-A fail, the failover to SP-B works:
syslog:
Oct 16 21:32:15 SANfile_m kernel:  rport-2:0-1: blocked FC remote port time 
out: removing target and saving binding
Oct 16 21:32:17 SANfile_m kernel:  rport-1:0-1: blocked FC remote port time 
out: removing target and saving binding
Oct 16 21:32:17 SANfile_m multipathd: 8:64: mark as failed
Oct 16 21:32:17 SANfile_m multipathd: hcfshare: remaining active paths: 3
Oct 16 21:32:17 SANfile_m multipathd: 8:48: mark as failed
Oct 16 21:32:17 SANfile_m multipathd: hcfshare: remaining active paths: 2
Oct 16 21:32:17 SANfile_m multipathd: 8:32: mark as failed
Oct 16 21:32:17 SANfile_m multipathd: hcfshare: remaining active paths: 1
Oct 16 21:32:17 SANfile_m multipathd: 8:16: mark as failed
Oct 16 21:32:17 SANfile_m multipathd: hcfshare: Entering recovery mode: 
max_retries=60
Oct 16 21:32:17 SANfile_m multipathd: hcfshare: remaining active paths: 0
Oct 16 21:32:17 SANfile_m multipathd: hcfshare: Entering recovery mode: 
max_retries=60
Oct 16 21:32:22 SANfile_m kernel: scsi 2:0:1:0: rejecting I/O to dead device
Oct 16 21:32:22 SANfile_m multipathd: sde: emc_clariion_checker: query 
command indicates error
Oct 16 21:32:22 SANfile_m multipathd: sdd: emc_clariion_checker: Passive 
path is healthy.
Oct 16 21:32:22 SANfile_m multipathd: 8:48: reinstated
Oct 16 21:32:22 SANfile_m multipathd: hcfshare: queue_if_no_path enabled
Oct 16 21:32:22 SANfile_m multipathd: hcfshare: Recovered to normal mode
Oct 16 21:32:22 SANfile_m kernel: device-mapper: multipath emc: emc_pg_init: 
sending switch-over command
Oct 16 21:32:22 SANfile_m multipathd: hcfshare: remaining active paths: 1
Oct 16 21:32:22 SANfile_m kernel: scsi 1:0:1:0: rejecting I/O to dead device
Oct 16 21:32:22 SANfile_m multipathd: sdc: emc_clariion_checker: query 
command indicates error
Oct 16 21:32:22 SANfile_m multipathd: sdb: emc_clariion_checker: Active path 
is healthy.
Oct 16 21:32:22 SANfile_m multipathd: 8:16: reinstated
Oct 16 21:32:22 SANfile_m multipathd: hcfshare: remaining active paths: 2
Oct 16 21:32:27 SANfile_m kernel: scsi 2:0:1:0: rejecting I/O to dead device
Oct 16 21:32:27 SANfile_m multipathd: sde: emc_clariion_checker: query 
command indicates error
Oct 16 21:32:27 SANfile_m kernel: scsi 1:0:1:0: rejecting I/O to dead device
Oct 16 21:32:27 SANfile_m multipathd: sdc: emc_clariion_checker: query 
command indicates error
Oct 16 21:32:32 SANfile_m kernel: scsi 2:0:1:0: rejecting I/O to dead device
Oct 16 21:32:32 SANfile_m kernel: scsi 1:0:1:0: rejecting I/O to dead device
Oct 16 21:32:32 SANfile_m multipathd: sde: emc_clariion_checker: query 
command indicates error
Oct 16 21:32:32 SANfile_m multipathd: sdc: emc_clariion_checker: query 
command indicates error
Oct 16 21:32:37 SANfile_m kernel: scsi 2:0:1:0: rejecting I/O to dead device
Oct 16 21:32:37 SANfile_m multipathd: sde: emc_clariion_checker: query 
command indicates error
Oct 16 21:32:37 SANfile_m kernel: scsi 1:0:1:0: rejecting I/O to dead device
Oct 16 21:32:37 SANfile_m multipathd: sdc: emc_clariion_checker: query 
command indicates error
Oct 16 21:32:42 SANfile_m multipathd: sde: emc_clariion_checker: query 
command indicates error
Oct 16 21:32:42 SANfile_m multipathd: sdc: emc_clariion_checker: query 
command indicates error
Oct 16 21:32:42 SANfile_m kernel: scsi 2:0:1:0: rejecting I/O to dead device
Oct 16 21:32:42 SANfile_m kernel: scsi 1:0:1:0: rejecting I/O to dead device


hcfshare (360060160c820080063502869e459dc11) dm-0 ,
[size=3.4T][features=1 queue_if_no_path][hwhandler=1 emc]
\_ round-robin 0 [prio=0][active]
 \_ #:#:#:# -   #:#   [failed][undef]
 \_ 2:0:0:0 sdd 8:48  [active][undef]
 \_ #:#:#:# -   #:#   [failed][undef]
 \_ 1:0:0:0 sdb 8:16  [active][undef]

SANfile_m ~ # dmsetup table
hcfshare1: 0 7263453117 linear 253:0 34
hcfshare: 0 7263453184 multipath 1 queue_if_no_path 1 emc 1 1 round-robin 0 
4 1 8:64 1000 8:48 1000 8:32 1000 8:16 1000


Now the paths to SP-A are coming up again but multipath still shows them as 
failed, and some disturbing messages in the
syslog:

SANfile_m ~ # multipath -l
hcfshare (360060160c820080063502869e459dc11) dm-0 ,
[size=3.4T][features=1 queue_if_no_path][hwhandler=1 emc]
\_ round-robin 0 [prio=0][active]
 \_ #:#:#:# -   #:#   [failed][undef]
 \_ 2:0:0:0 sdd 8:48  [active][undef]
 \_ #:#:#:# -   #:#   [failed][undef]
 \_ 1:0:0:0 sdb 8:16  [active][undef]

SANfile_m ~ # dmsetup table
hcfshare1: 0 7263453117 linear 253:0 34
hcfshare: 0 7263453184 multipath 1 queue_if_no_path 1 emc 1 1 round-robin 0 
4 1 8:64 1000 8:48 1000 8:32 1000 8:16 1000

syslog:
Oct 16 21:35:27 SANfile_m kernel: scsi 1:0:1:0: Direct-Access     DGC 
RAID 5           0219 PQ: 0 ANSI: 4
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Very big device. Trying 
to use READ CAPACITY(16).
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] 7263453184 512-byte 
hardware sectors (3718888 MB)
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Test WP failed, assume 
Write Enabled
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Asking for cache data 
failed
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Assuming drive cache: 
write through
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Very big device. Trying 
to use READ CAPACITY(16).
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] 7263453184 512-byte 
hardware sectors (3718888 MB)
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Test WP failed, assume 
Write Enabled
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Asking for cache data 
failed
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Assuming drive cache: 
write through
Oct 16 21:35:27 SANfile_m kernel:  sdg:<6>sd 1:0:1:0: [sdg] Device not 
ready: <6>: Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 0
Oct 16 21:35:27 SANfile_m kernel: printk: 35 messages suppressed.
Oct 16 21:35:27 SANfile_m kernel: Buffer I/O error on device sdg, logical 
block 0
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 0
Oct 16 21:35:27 SANfile_m kernel: Buffer I/O error on device sdg, logical 
block 0
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 0
Oct 16 21:35:27 SANfile_m kernel: Buffer I/O error on device sdg, logical 
block 0
Oct 16 21:35:27 SANfile_m kernel: ldm_validate_partition_table(): Disk read 
failed.
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 0
Oct 16 21:35:27 SANfile_m kernel: Buffer I/O error on device sdg, logical 
block 0
Oct 16 21:35:27 SANfile_m kernel:  unable to read partition table
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Attached SCSI disk
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: Attached scsi generic sg4 type 
0
Oct 16 21:35:27 SANfile_m kernel: scsi 1:0:1:0: Direct-Access     DGC 
RAID 5           0219 PQ: 0 ANSI: 4
Oct 16 21:35:27 SANfile_m kernel: kobject_add failed for 1:0:1:0 
with -EEXIST, don't try to register things with the same name in the same 
directory.
Oct 16 21:35:27 SANfile_m kernel:  [number+85/816] 
kobject_shadow_add+0x115/0x1b0
Oct 16 21:35:27 SANfile_m kernel:  [<c02f95f5>] 
kobject_shadow_add+0x115/0x1b0
Oct 16 21:35:27 SANfile_m kernel:  [lo_ioctl+1125/2528] 
device_add+0xc5/0x570
Oct 16 21:35:27 SANfile_m kernel:  [<c03aefd5>] device_add+0xc5/0x570
Oct 16 21:35:27 SANfile_m kernel:  [fc_remote_port_rolechg+127/320] 
scsi_adjust_queue_depth+0x9f/0xf0
Oct 16 21:35:27 SANfile_m kernel:  [<c03f9d7f>] 
scsi_adjust_queue_depth+0x9f/0xf0
Oct 16 21:35:27 SANfile_m kernel:  [blk_register_region+18/64] 
__blk_queue_init_tags+0x32/0x70
Oct 16 21:35:27 SANfile_m kernel:  [<c02eeb72>] 
__blk_queue_init_tags+0x32/0x70
Oct 16 21:35:27 SANfile_m kernel:  [sr_get_mcn+2/240] 
scsi_sysfs_add_sdev+0x32/0x230
Oct 16 21:35:27 SANfile_m kernel:  [<c0402882>] 
scsi_sysfs_add_sdev+0x32/0x230
Oct 16 21:35:27 SANfile_m kernel:  [<f99445b7>] 
qla2xxx_slave_configure+0x77/0x110 [qla2xxx]
Oct 16 21:35:27 SANfile_m kernel:  [sd_init_command+313/1088] 
scsi_probe_and_add_lun+0x8c9/0x940
Oct 16 21:35:27 SANfile_m kernel:  [<c0400859>] 
scsi_probe_and_add_lun+0x8c9/0x940
Oct 16 21:35:27 SANfile_m kernel:  [sr_probe+72/1472] 
__scsi_scan_target+0x518/0x5c0
Oct 16 21:35:27 SANfile_m kernel:  [<c04012c8>] 
__scsi_scan_target+0x518/0x5c0
Oct 16 21:35:27 SANfile_m kernel:  [kallsyms_addresses+36259/130252] 
schedule+0x2df/0x940
Oct 16 21:35:27 SANfile_m kernel:  [<c053695f>] schedule+0x2df/0x940
Oct 16 21:35:27 SANfile_m kernel:  [sr_init_command+54/944] 
scsi_scan_target+0xb6/0xe0
Oct 16 21:35:27 SANfile_m kernel:  [<c04019f6>] scsi_scan_target+0xb6/0xe0
Oct 16 21:35:27 SANfile_m kernel:  [SendIocInit+224/784] 
fc_scsi_scan_rport+0x0/0x90
Oct 16 21:35:27 SANfile_m kernel:  [<c04084b0>] fc_scsi_scan_rport+0x0/0x90
Oct 16 21:35:27 SANfile_m kernel:  [SendIocInit+344/784] 
fc_scsi_scan_rport+0x78/0x90
Oct 16 21:35:27 SANfile_m kernel:  [<c0408528>] fc_scsi_scan_rport+0x78/0x90
Oct 16 21:35:27 SANfile_m kernel:  [run_workqueue+131/256] 
run_workqueue+0x73/0x100
Oct 16 21:35:27 SANfile_m kernel:  [<c0131dc3>] run_workqueue+0x73/0x100
Oct 16 21:35:27 SANfile_m kernel:  [autoremove_wake_function+16/80] 
autoremove_wake_function+0x0/0x50
Oct 16 21:35:27 SANfile_m kernel:  [<c01354e0>] 
autoremove_wake_function+0x0/0x50
Oct 16 21:35:27 SANfile_m kernel:  [worker_thread+172/256] 
worker_thread+0x9c/0x100
Oct 16 21:35:27 SANfile_m kernel:  [<c01326dc>] worker_thread+0x9c/0x100
Oct 16 21:35:27 SANfile_m kernel:  [autoremove_wake_function+16/80] 
autoremove_wake_function+0x0/0x50
Oct 16 21:35:27 SANfile_m kernel:  [<c01354e0>] 
autoremove_wake_function+0x0/0x50
Oct 16 21:35:27 SANfile_m kernel:  [worker_thread+16/256] 
worker_thread+0x0/0x100
Oct 16 21:35:27 SANfile_m kernel:  [<c0132640>] worker_thread+0x0/0x100
Oct 16 21:35:27 SANfile_m kernel:  [kthread+82/112] kthread+0x42/0x70
Oct 16 21:35:27 SANfile_m kernel:  [<c0135212>] kthread+0x42/0x70
Oct 16 21:35:27 SANfile_m kernel:  [kthread+16/112] kthread+0x0/0x70
Oct 16 21:35:27 SANfile_m kernel:  [<c01351d0>] kthread+0x0/0x70
Oct 16 21:35:27 SANfile_m kernel:  [print_trace_stack+3/16] 
kernel_thread_helper+0x7/0x14
Oct 16 21:35:27 SANfile_m kernel:  [<c0104763>] 
kernel_thread_helper+0x7/0x14
Oct 16 21:35:27 SANfile_m kernel:  =======================
Oct 16 21:35:27 SANfile_m kernel: error 1
Oct 16 21:35:27 SANfile_m kernel: scsi 1:0:1:0: Unexpected response from lun 
0 while scanning, scan aborted
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 
7263453056
Oct 16 21:35:27 SANfile_m kernel: Buffer I/O error on device sdg, logical 
block 907931632
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 
7263453056
Oct 16 21:35:27 SANfile_m kernel: Buffer I/O error on device sdg, logical 
block 907931632
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 
7263453056
Oct 16 21:35:27 SANfile_m kernel: Buffer I/O error on device sdg, logical 
block 907931632
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 0
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 0
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 0
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 
7263453176
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 
7263453176
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 
7263453176
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 
7263453176
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 
7263453176
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 
7263453176
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 
7263453120
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 
7263453168
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 
7263453176
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 
7263453176
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 0
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 0
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 0
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 0
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 0
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 0
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 0
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 0
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 0
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 0
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 0
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 0
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 0
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 0
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 0
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 0
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 0
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 0
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 0
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 0
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 0
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 0
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 0
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 0
Oct 16 21:35:27 SANfile_m kernel: sd 1:0:1:0: [sdg] Device not ready: <6>: 
Sense Key : 0x2 [current]
Oct 16 21:35:27 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 16 21:35:27 SANfile_m kernel: end_request: I/O error, dev sdg, sector 0
Oct 16 21:35:28 SANfile_m kernel: scsi 2:0:1:0: rejecting I/O to dead device
Oct 16 21:35:28 SANfile_m kernel: scsi 1:0:1:0: rejecting I/O to dead device
Oct 16 21:35:28 SANfile_m multipathd: sde: emc_clariion_checker: query 
command indicates error
Oct 16 21:35:33 SANfile_m kernel: scsi 2:0:1:0: rejecting I/O to dead device
Oct 16 21:35:33 SANfile_m kernel: scsi 1:0:1:0: rejecting I/O to dead device
Oct 16 21:35:33 SANfile_m multipathd: sdc: emc_clariion_checker: query 
command indicates error
Oct 16 21:35:33 SANfile_m multipathd: sde: emc_clariion_checker: query 
command indicates error
Oct 16 21:35:33 SANfile_m multipathd: sdc: emc_clariion_checker: query 
command indicates error
Oct 16 21:35:38 SANfile_m kernel: scsi 2:0:1:0: rejecting I/O to dead device
Oct 16 21:35:38 SANfile_m multipathd: sde: emc_clariion_checker: query 
command indicates error
Oct 16 21:35:38 SANfile_m kernel: scsi 1:0:1:0: rejecting I/O to dead device
Oct 16 21:35:38 SANfile_m multipathd: sdc: emc_clariion_checker: query 
command indicates error
Oct 16 21:35:43 SANfile_m kernel: scsi 2:0:1:0: rejecting I/O to dead device
Oct 16 21:35:43 SANfile_m kernel: scsi 1:0:1:0: rejecting I/O to dead device
Oct 16 21:35:43 SANfile_m multipathd: sde: emc_clariion_checker: query 
command indicates error
Oct 16 21:35:43 SANfile_m multipathd: sdc: emc_clariion_checker: query 
command indicates error



multipath -l still shows:
hcfshare (360060160c820080063502869e459dc11) dm-0 ,
[size=3.4T][features=1 queue_if_no_path][hwhandler=1 emc]
\_ round-robin 0 [prio=0][active]
 \_ #:#:#:# -   #:#   [failed][undef]
 \_ 2:0:0:0 sdd 8:48  [active][undef]
 \_ #:#:#:# -   #:#   [failed][undef]
 \_ 1:0:0:0 sdb 8:16  [active][undef]

of course, failback won't work then.

My config:

defaults {
       udev_dir                 /dev
       polling_interval         5
       selector                 "round-robin 0"
       path_grouping_policy     group_by_serial
       failback                 immediate
       getuid_callout           "/sbin/scsi_id -g -u -s /block/%n"
}

multipaths {
        multipath {
                wwid                    360060160c820080063502869e459dc11
                alias                   hcfshare
                path_grouping_policy    group_by_serial
                path_checker            emc_clariion
                path_selector           "round-robin 0"
                failback                immediate
        }
}

Does that tell somebody something?

Thanks,
(Gerald) 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: multibus / failover and EMC CX600
  2007-10-17 10:23 multibus / failover and EMC CX600 Gerald Nowitzky
@ 2007-10-17 10:40 ` Tore Anderson
  2007-10-17 11:08   ` Hannes Reinecke
  0 siblings, 1 reply; 16+ messages in thread
From: Tore Anderson @ 2007-10-17 10:40 UTC (permalink / raw)
  To: device-mapper development

* Gerald Nowitzky

> What I'd like to have is multibus via HBA-A -> SW-A -> SP-A  and
> HBA-B -> SW-B -> SP-A to the active SP and, in case both paths to the
> active SP fail, a trespas of my LUN to SP-B, multibus to the other
> SP-B and vice versa.

Try the following:

  prio_callout "/sbin/mpath_prio_emc /dev/%n"
  path_grouping_policy group_by_prio

The -EEXIST message is a kernel bug, see the thread starting at
<http://lkml.org/lkml/2007/8/14/106> for more information.  It might be
possible to work around it by disabling async SCSI scanning, if not
there's a patch from Matthew Wilcox that removed most of those -EEXIST
errors (but not all the corner cases).

It kinda sucks that the kernel removes the SCSI devices pointing to
blocked rports.  Wonder if it's possible to disable that somehow...

Regards
-- 
Tore Anderson

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: multibus / failover and EMC CX600
  2007-10-17 10:40 ` Tore Anderson
@ 2007-10-17 11:08   ` Hannes Reinecke
  2007-10-17 12:32     ` Tore Anderson
  0 siblings, 1 reply; 16+ messages in thread
From: Hannes Reinecke @ 2007-10-17 11:08 UTC (permalink / raw)
  To: device-mapper development

Tore Anderson wrote:
> * Gerald Nowitzky
> 
>> What I'd like to have is multibus via HBA-A -> SW-A -> SP-A  and
>> HBA-B -> SW-B -> SP-A to the active SP and, in case both paths to the
>> active SP fail, a trespas of my LUN to SP-B, multibus to the other
>> SP-B and vice versa.
> 
> Try the following:
> 
>   prio_callout "/sbin/mpath_prio_emc /dev/%n"
>   path_grouping_policy group_by_prio
> 
> The -EEXIST message is a kernel bug, see the thread starting at
> <http://lkml.org/lkml/2007/8/14/106> for more information.  It might be
> possible to work around it by disabling async SCSI scanning, if not
> there's a patch from Matthew Wilcox that removed most of those -EEXIST
> errors (but not all the corner cases).
> 
> It kinda sucks that the kernel removes the SCSI devices pointing to
> blocked rports.  Wonder if it's possible to disable that somehow...
> 
That's the dev_loss_tmo setting. Just increase it to something to
your liking.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: multibus / failover and EMC CX600
  2007-10-17 11:08   ` Hannes Reinecke
@ 2007-10-17 12:32     ` Tore Anderson
  2007-10-17 14:48       ` Gerald Nowitzky
  2007-10-17 19:49       ` Mike Christie
  0 siblings, 2 replies; 16+ messages in thread
From: Tore Anderson @ 2007-10-17 12:32 UTC (permalink / raw)
  To: device-mapper development

* Hannes Reinecke

> That's the dev_loss_tmo setting. Just increase it to something to
> your liking.

Oh, sweet.  This knob won't affect how long the layer will hold I/O
before failing it (like lpfc_nodev_tmo), I assume?  (I'm worried about
it taking longer for dm-multipath to detect failed paths).

I wish it could've been set to unlimited, though.  Seems like there's
always some kind of trouble with re-adding the devices, either I run
into that -EEXIST bug, or udev doesn't do it's job properly and the
revived device isn't added back into the dm-multipath map.  In addition
it somtimes breaks queue_if_no_path with earlier multipath-tools that
doesn't use no_flush on suspend.  Those versions are of course included
in most server distributions...  Sigh.

Regards
-- 
Tore Anderson

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: multibus / failover and EMC CX600
  2007-10-17 12:32     ` Tore Anderson
@ 2007-10-17 14:48       ` Gerald Nowitzky
  2007-10-17 16:01         ` Tore Anderson
  2007-10-17 19:49       ` Mike Christie
  1 sibling, 1 reply; 16+ messages in thread
From: Gerald Nowitzky @ 2007-10-17 14:48 UTC (permalink / raw)
  To: device-mapper development


[-- Attachment #1.1: Type: text/plain, Size: 9523 bytes --]

The mpath_prio_emc with group_by_prio did the trick. Thanks!

But I am still loosing the paths to the failed devices. I Increased dev_loss_tmo, but the maximum seems to be about 600 - thus, after 10 Minutes, the paths fail:

SANfile_m linux # multipath -l
hcfshare (360060160c820080063502869e459dc11) dm-0 ,
[size=3.4T][features=1 queue_if_no_path][hwhandler=1 emc]
\_ round-robin 0 [prio=0][enabled]
 \_ #:#:#:# -   #:#   [failed][undef]
 \_ #:#:#:# -   #:#   [failed][undef]
\_ round-robin 0 [prio=0][active]
 \_ 2:0:0:0 sdd 8:48  [active][undef]
 \_ 1:0:0:0 sdb 8:16  [active][undef]

If I put them online again, I run into the -EEXIST prob. Async SCSI scanning *is* off in my kernel, so the only thing I could do from here is to try the patch, is it?

Oct 17 17:26:34 SANfile_m kernel: scsi 1:0:1:0: rejecting I/O to dead device
Oct 17 17:26:34 SANfile_m multipathd: sdc: emc_clariion_checker: query command indicates error
Oct 17 17:26:35 SANfile_m kernel: scsi 2:0:1:0: rejecting I/O to dead device
Oct 17 17:26:35 SANfile_m multipathd: sde: emc_clariion_checker: query command indicates error
Oct 17 17:26:36 SANfile_m kernel: scsi 1:0:1:0: Direct-Access     DGC      RAID 5           0219 PQ: 0 ANSI: 4
Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Very big device. Trying to use READ CAPACITY(16).
Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] 7263453184 512-byte hardware sectors (3718888 MB)
Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Test WP failed, assume Write Enabled
Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Asking for cache data failed
Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Assuming drive cache: write through
Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Very big device. Trying to use READ CAPACITY(16).
Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] 7263453184 512-byte hardware sectors (3718888 MB)
Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Test WP failed, assume Write Enabled
Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Asking for cache data failed
Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Assuming drive cache: write through
Oct 17 17:26:36 SANfile_m kernel:  sdf:<6>sd 1:0:1:0: [sdf] Device not ready: <6>: Sense Key : 0x2 [current]
Oct 17 17:26:36 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 17 17:26:36 SANfile_m kernel: end_request: I/O error, dev sdf, sector 0
Oct 17 17:26:36 SANfile_m kernel: printk: 40 messages suppressed.
Oct 17 17:26:36 SANfile_m kernel: Buffer I/O error on device sdf, logical block 0
Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Device not ready: <6>: Sense Key : 0x2 [current]
Oct 17 17:26:36 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 17 17:26:36 SANfile_m kernel: end_request: I/O error, dev sdf, sector 0
Oct 17 17:26:36 SANfile_m kernel: Buffer I/O error on device sdf, logical block 0
Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Device not ready: <6>: Sense Key : 0x2 [current]
Oct 17 17:26:36 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 17 17:26:36 SANfile_m kernel: end_request: I/O error, dev sdf, sector 0
Oct 17 17:26:36 SANfile_m kernel: Buffer I/O error on device sdf, logical block 0
Oct 17 17:26:36 SANfile_m kernel: ldm_validate_partition_table(): Disk read failed.
Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Device not ready: <6>: Sense Key : 0x2 [current]
Oct 17 17:26:36 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 17 17:26:36 SANfile_m kernel: end_request: I/O error, dev sdf, sector 0
Oct 17 17:26:36 SANfile_m kernel: Buffer I/O error on device sdf, logical block 0
Oct 17 17:26:36 SANfile_m kernel:  unable to read partition table
Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Attached SCSI disk
Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: Attached scsi generic sg2 type 0
Oct 17 17:26:36 SANfile_m kernel: scsi 1:0:1:0: Direct-Access     DGC      RAID 5           0219 PQ: 0 ANSI: 4
Oct 17 17:26:36 SANfile_m kernel: kobject_add failed for 1:0:1:0 with -EEXIST, don't try to register things with the same name in the same directory.
Oct 17 17:26:36 SANfile_m kernel:  [number+85/816] kobject_shadow_add+0x115/0x1b0
Oct 17 17:26:36 SANfile_m kernel:  [<c02f95f5>] kobject_shadow_add+0x115/0x1b0
Oct 17 17:26:36 SANfile_m kernel:  [lo_ioctl+1125/2528] device_add+0xc5/0x570
Oct 17 17:26:36 SANfile_m kernel:  [<c03aefd5>] device_add+0xc5/0x570
Oct 17 17:26:36 SANfile_m kernel:  [fc_remote_port_rolechg+127/320] scsi_adjust_queue_depth+0x9f/0xf0
Oct 17 17:26:36 SANfile_m kernel:  [<c03f9d7f>] scsi_adjust_queue_depth+0x9f/0xf0
Oct 17 17:26:36 SANfile_m kernel:  [blk_register_region+18/64] __blk_queue_init_tags+0x32/0x70
Oct 17 17:26:36 SANfile_m kernel:  [<c02eeb72>] __blk_queue_init_tags+0x32/0x70
Oct 17 17:26:36 SANfile_m kernel:  [sr_get_mcn+2/240] scsi_sysfs_add_sdev+0x32/0x230
Oct 17 17:26:36 SANfile_m kernel:  [<c0402882>] scsi_sysfs_add_sdev+0x32/0x230
Oct 17 17:26:36 SANfile_m kernel:  [<f99445b7>] qla2xxx_slave_configure+0x77/0x110 [qla2xxx]
Oct 17 17:26:36 SANfile_m kernel:  [sd_init_command+313/1088] scsi_probe_and_add_lun+0x8c9/0x940
Oct 17 17:26:36 SANfile_m kernel:  [<c0400859>] scsi_probe_and_add_lun+0x8c9/0x940
Oct 17 17:26:36 SANfile_m kernel:  [sr_probe+72/1472] __scsi_scan_target+0x518/0x5c0
Oct 17 17:26:36 SANfile_m kernel:  [<c04012c8>] __scsi_scan_target+0x518/0x5c0
Oct 17 17:26:36 SANfile_m kernel:  [kallsyms_addresses+36259/130252] schedule+0x2df/0x940
Oct 17 17:26:36 SANfile_m kernel:  [<c053695f>] schedule+0x2df/0x940
Oct 17 17:26:36 SANfile_m kernel:  [sr_init_command+54/944] scsi_scan_target+0xb6/0xe0
Oct 17 17:26:36 SANfile_m kernel:  [<c04019f6>] scsi_scan_target+0xb6/0xe0
Oct 17 17:26:36 SANfile_m kernel:  [SendIocInit+224/784] fc_scsi_scan_rport+0x0/0x90
Oct 17 17:26:36 SANfile_m kernel:  [<c04084b0>] fc_scsi_scan_rport+0x0/0x90
Oct 17 17:26:36 SANfile_m kernel:  [SendIocInit+344/784] fc_scsi_scan_rport+0x78/0x90
Oct 17 17:26:36 SANfile_m kernel:  [<c0408528>] fc_scsi_scan_rport+0x78/0x90
Oct 17 17:26:36 SANfile_m kernel:  [run_workqueue+131/256] run_workqueue+0x73/0x100
Oct 17 17:26:36 SANfile_m kernel:  [<c0131dc3>] run_workqueue+0x73/0x100
Oct 17 17:26:36 SANfile_m kernel:  [autoremove_wake_function+16/80] autoremove_wake_function+0x0/0x50
Oct 17 17:26:36 SANfile_m kernel:  [<c01354e0>] autoremove_wake_function+0x0/0x50
Oct 17 17:26:36 SANfile_m kernel:  [worker_thread+172/256] worker_thread+0x9c/0x100
Oct 17 17:26:36 SANfile_m kernel:  [<c01326dc>] worker_thread+0x9c/0x100
Oct 17 17:26:36 SANfile_m kernel:  [autoremove_wake_function+16/80] autoremove_wake_function+0x0/0x50
Oct 17 17:26:36 SANfile_m kernel:  [<c01354e0>] autoremove_wake_function+0x0/0x50
Oct 17 17:26:36 SANfile_m kernel:  [worker_thread+16/256] worker_thread+0x0/0x100
Oct 17 17:26:36 SANfile_m kernel:  [<c0132640>] worker_thread+0x0/0x100
Oct 17 17:26:36 SANfile_m kernel:  [kthread+82/112] kthread+0x42/0x70
Oct 17 17:26:36 SANfile_m kernel:  [<c0135212>] kthread+0x42/0x70
Oct 17 17:26:36 SANfile_m kernel:  [kthread+16/112] kthread+0x0/0x70
Oct 17 17:26:36 SANfile_m kernel:  [<c01351d0>] kthread+0x0/0x70
Oct 17 17:26:36 SANfile_m kernel:  [print_trace_stack+3/16] kernel_thread_helper+0x7/0x14
Oct 17 17:26:36 SANfile_m kernel:  [<c0104763>] kernel_thread_helper+0x7/0x14
Oct 17 17:26:36 SANfile_m kernel:  =======================
Oct 17 17:26:36 SANfile_m kernel: error 1
Oct 17 17:26:36 SANfile_m kernel: scsi 1:0:1:0: Unexpected response from lun 0 while scanning, scan aborted
Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Device not ready: <6>: Sense Key : 0x2 [current]
Oct 17 17:26:36 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 17 17:26:36 SANfile_m kernel: end_request: I/O error, dev sdf, sector 7263453056
Oct 17 17:26:36 SANfile_m kernel: Buffer I/O error on device sdf, logical block 907931632
Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Device not ready: <6>: Sense Key : 0x2 [current]
Oct 17 17:26:36 SANfile_m kernel: : ASC=0x4 ASCQ=0x3
Oct 17 17:26:36 SANfile_m kernel: end_request: I/O error, dev sdf, sector 7263453056
Oct 17 17:26:36 SANfile_m kernel: Buffer I/O error on device sdf, logical block 907931632
Oct 17 17:26:36 SANfile_m kernel: sd 1:0:1:0: [sdf] Device not ready: <6>: Sense Key : 0x2 [current]
Oct 17 17:26:36 SANfile_m kernel: : ASC=0x4 ASCQ=0x3




Is that what you refer as 



  ----- Original Message ----- 
  From: Tore Anderson 
  To: device-mapper development 
  Sent: Wednesday, October 17, 2007 2:32 PM
  Subject: Re: [dm-devel] multibus / failover and EMC CX600


  * Hannes Reinecke

  > That's the dev_loss_tmo setting. Just increase it to something to
  > your liking.

  Oh, sweet.  This knob won't affect how long the layer will hold I/O
  before failing it (like lpfc_nodev_tmo), I assume?  (I'm worried about
  it taking longer for dm-multipath to detect failed paths).

  I wish it could've been set to unlimited, though.  Seems like there's
  always some kind of trouble with re-adding the devices, either I run
  into that -EEXIST bug, or udev doesn't do it's job properly and the
  revived device isn't added back into the dm-multipath map.  In addition
  it somtimes breaks queue_if_no_path with earlier multipath-tools that
  doesn't use no_flush on suspend.  Those versions are of course included
  in most server distributions...  Sigh.

  Regards
  -- 
  Tore Anderson

  --
  dm-devel mailing list
  dm-devel@redhat.com
  https://www.redhat.com/mailman/listinfo/dm-devel

[-- Attachment #1.2: Type: text/html, Size: 12371 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: multibus / failover and EMC CX600
  2007-10-17 14:48       ` Gerald Nowitzky
@ 2007-10-17 16:01         ` Tore Anderson
  2007-10-17 18:04           ` Gerald Nowitzky
  2007-10-17 19:38           ` Gerald Nowitzky
  0 siblings, 2 replies; 16+ messages in thread
From: Tore Anderson @ 2007-10-17 16:01 UTC (permalink / raw)
  To: device-mapper development

* Gerald Nowitzky

> The mpath_prio_emc with group_by_prio did the trick. Thanks!
>  
> But I am still loosing the paths to the failed devices. I Increased
> dev_loss_tmo, but the maximum seems to be about 600 - thus, after 10
> Minutes, the paths fail:

The maximum is indeed 600 seconds in 2.6.23.

> SANfile_m linux # multipath -l
> hcfshare (360060160c820080063502869e459dc11) dm-0 ,
> [size=3.4T][features=1 queue_if_no_path][hwhandler=1 emc]
> \_ round-robin 0 [prio=0][enabled]
>  \_ #:#:#:# -   #:#   [failed][undef]
>  \_ #:#:#:# -   #:#   [failed][undef]
> \_ round-robin 0 [prio=0][active]
>  \_ 2:0:0:0 sdd 8:48  [active][undef]
>  \_ 1:0:0:0 sdb 8:16  [active][undef]
> If I put them online again, I run into the -EEXIST prob. Async SCSI
> scanning *is* off in my kernel, so the only thing I could do from
> here is to try the patch, is it?

Matthew Wilcox' patch solved this particular problem for me, yes.  I
still had some problems with -EEXIST when unloading and re-inserting the
HBA driver module, though, but that's a corner case I rarely run into
(as well as being easily worked around by trying again).

Come to think of it, you never said which kernel version you're running...?

> Oct 17 17:26:36 SANfile_m kernel: kobject_add failed for 1:0:1:0 with
> -EEXIST, don't try to register things with the same name in the same
> directory.

One suggestion...  If the sysfs object is still around, you might be
able to delete it manually by running «echo 1 >
/sys/class/scsi_device/1:0:1:0/device/delete».  If that works, you can
try to rescan again by doing «echo 0 1 0 >
/sys/class/scsi_host/host1/scan».  With some luck it'll work...

If it does, most of the time udev will notice and alert multipath to
check out the new device.  Sometimes it doesn't work, though - simply
run the «multipath» command manually in that case.

By the way - the «1» in «host1» maps to the first digit in «1:0:1:0»,
while the «0 1 0» in the echo command to the last three.

Regards
-- 
Tore Anderson

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: multibus / failover and EMC CX600
  2007-10-17 16:01         ` Tore Anderson
@ 2007-10-17 18:04           ` Gerald Nowitzky
  2007-10-18  6:19             ` Hannes Reinecke
  2007-10-17 19:38           ` Gerald Nowitzky
  1 sibling, 1 reply; 16+ messages in thread
From: Gerald Nowitzky @ 2007-10-17 18:04 UTC (permalink / raw)
  To: device-mapper development


[-- Attachment #1.1: Type: text/plain, Size: 6184 bytes --]

I'm afraid the patch did not work for me. I'ts still the same.

I am using kernel 2.6.22.2 at the moment. Should I upgrade to 2.6.23 ?

Anybody any Ideas?
The system is not in production at the moment. We could do some testing.

(Gerald)

Oct 17 20:57:09 SANfile_m kernel: kobject_add failed for 1:0:1:0 with -EEXIST, don't try to register things with the same name in the same directory.
Oct 17 20:57:09 SANfile_m kernel:  [number+85/816] kobject_shadow_add+0x115/0x1b0
Oct 17 20:57:09 SANfile_m kernel:  [<c02f95f5>] kobject_shadow_add+0x115/0x1b0
Oct 17 20:57:09 SANfile_m kernel:  [lo_ioctl+1125/2528] device_add+0xc5/0x570
Oct 17 20:57:09 SANfile_m kernel:  [<c03aefd5>] device_add+0xc5/0x570
Oct 17 20:57:09 SANfile_m kernel:  [fc_remote_port_rolechg+127/320] scsi_adjust_queue_depth+0x9f/0xf0
Oct 17 20:57:09 SANfile_m kernel:  [<c03f9d7f>] scsi_adjust_queue_depth+0x9f/0xf0
Oct 17 20:57:09 SANfile_m kernel:  [blk_register_region+18/64] __blk_queue_init_tags+0x32/0x70
Oct 17 20:57:09 SANfile_m kernel:  [<c02eeb72>] __blk_queue_init_tags+0x32/0x70
Oct 17 20:57:09 SANfile_m kernel:  [sr_get_mcn+50/240] scsi_sysfs_add_sdev+0x32/0x230
Oct 17 20:57:09 SANfile_m kernel:  [<c04028b2>] scsi_sysfs_add_sdev+0x32/0x230
Oct 17 20:57:09 SANfile_m kernel:  [<f99445b7>] qla2xxx_slave_configure+0x77/0x110 [qla2xxx]
Oct 17 20:57:09 SANfile_m kernel:  [sd_init_command+313/1088] scsi_probe_and_add_lun+0x8c9/0x940
Oct 17 20:57:09 SANfile_m kernel:  [<c0400859>] scsi_probe_and_add_lun+0x8c9/0x940
Oct 17 20:57:09 SANfile_m kernel:  [sr_probe+72/1472] __scsi_scan_target+0x518/0x5c0
Oct 17 20:57:09 SANfile_m kernel:  [<c04012c8>] __scsi_scan_target+0x518/0x5c0
Oct 17 20:57:09 SANfile_m kernel:  [kallsyms_addresses+36323/130252] schedule+0x2df/0x940
Oct 17 20:57:09 SANfile_m kernel:  [<c053699f>] schedule+0x2df/0x940
Oct 17 20:57:09 SANfile_m kernel:  [sr_init_command+128/944] scsi_scan_target+0xd0/0xe0
Oct 17 20:57:09 SANfile_m kernel:  [<c0401a40>] scsi_scan_target+0xd0/0xe0
Oct 17 20:57:09 SANfile_m kernel:  [SendIocInit+272/784] fc_scsi_scan_rport+0x0/0x90
Oct 17 20:57:09 SANfile_m kernel:  [<c04084e0>] fc_scsi_scan_rport+0x0/0x90
Oct 17 20:57:09 SANfile_m kernel:  [SendIocInit+392/784] fc_scsi_scan_rport+0x78/0x90
Oct 17 20:57:09 SANfile_m kernel:  [<c0408558>] fc_scsi_scan_rport+0x78/0x90
Oct 17 20:57:09 SANfile_m kernel:  [run_workqueue+131/256] run_workqueue+0x73/0x100
Oct 17 20:57:09 SANfile_m kernel:  [<c0131dc3>] run_workqueue+0x73/0x100
Oct 17 20:57:09 SANfile_m kernel:  [autoremove_wake_function+16/80] autoremove_wake_function+0x0/0x50
Oct 17 20:57:09 SANfile_m kernel:  [<c01354e0>] autoremove_wake_function+0x0/0x50
Oct 17 20:57:09 SANfile_m kernel:  [worker_thread+172/256] worker_thread+0x9c/0x100
Oct 17 20:57:09 SANfile_m kernel:  [<c01326dc>] worker_thread+0x9c/0x100
Oct 17 20:57:09 SANfile_m kernel:  [autoremove_wake_function+16/80] autoremove_wake_function+0x0/0x50
Oct 17 20:57:09 SANfile_m kernel:  [<c01354e0>] autoremove_wake_function+0x0/0x50
Oct 17 20:57:09 SANfile_m kernel:  [worker_thread+16/256] worker_thread+0x0/0x100
Oct 17 20:57:09 SANfile_m kernel:  [<c0132640>] worker_thread+0x0/0x100
Oct 17 20:57:09 SANfile_m kernel:  [kthread+82/112] kthread+0x42/0x70
Oct 17 20:57:09 SANfile_m kernel:  [<c0135212>] kthread+0x42/0x70
Oct 17 20:57:09 SANfile_m kernel:  [kthread+16/112] kthread+0x0/0x70
Oct 17 20:57:09 SANfile_m kernel:  [<c01351d0>] kthread+0x0/0x70
Oct 17 20:57:09 SANfile_m kernel:  [print_trace_stack+3/16] kernel_thread_helper+0x7/0x14
Oct 17 20:57:09 SANfile_m kernel:  [<c0104763>] kernel_thread_helper+0x7/0x14
Oct 17 20:57:09 SANfile_m kernel:  =======================
Oct 17 20:57:09 SANfile_m kernel: error 1

  ----- Original Message ----- 
  From: Tore Anderson 
  To: device-mapper development 
  Sent: Wednesday, October 17, 2007 6:01 PM
  Subject: Re: [dm-devel] multibus / failover and EMC CX600


  * Gerald Nowitzky

  > The mpath_prio_emc with group_by_prio did the trick. Thanks!
  >  
  > But I am still loosing the paths to the failed devices. I Increased
  > dev_loss_tmo, but the maximum seems to be about 600 - thus, after 10
  > Minutes, the paths fail:

  The maximum is indeed 600 seconds in 2.6.23.

  > SANfile_m linux # multipath -l
  > hcfshare (360060160c820080063502869e459dc11) dm-0 ,
  > [size=3.4T][features=1 queue_if_no_path][hwhandler=1 emc]
  > \_ round-robin 0 [prio=0][enabled]
  >  \_ #:#:#:# -   #:#   [failed][undef]
  >  \_ #:#:#:# -   #:#   [failed][undef]
  > \_ round-robin 0 [prio=0][active]
  >  \_ 2:0:0:0 sdd 8:48  [active][undef]
  >  \_ 1:0:0:0 sdb 8:16  [active][undef]
  > If I put them online again, I run into the -EEXIST prob. Async SCSI
  > scanning *is* off in my kernel, so the only thing I could do from
  > here is to try the patch, is it?

  Matthew Wilcox' patch solved this particular problem for me, yes.  I
  still had some problems with -EEXIST when unloading and re-inserting the
  HBA driver module, though, but that's a corner case I rarely run into
  (as well as being easily worked around by trying again).

  Come to think of it, you never said which kernel version you're running...?

  > Oct 17 17:26:36 SANfile_m kernel: kobject_add failed for 1:0:1:0 with
  > -EEXIST, don't try to register things with the same name in the same
  > directory.

  One suggestion...  If the sysfs object is still around, you might be
  able to delete it manually by running «echo 1 >
  /sys/class/scsi_device/1:0:1:0/device/delete».  If that works, you can
  try to rescan again by doing «echo 0 1 0 >
  /sys/class/scsi_host/host1/scan».  With some luck it'll work...

  If it does, most of the time udev will notice and alert multipath to
  check out the new device.  Sometimes it doesn't work, though - simply
  run the «multipath» command manually in that case.

  By the way - the «1» in «host1» maps to the first digit in «1:0:1:0»,
  while the «0 1 0» in the echo command to the last three.

  Regards
  -- 
  Tore Anderson

  --
  dm-devel mailing list
  dm-devel@redhat.com
  https://www.redhat.com/mailman/listinfo/dm-devel

[-- Attachment #1.2: Type: text/html, Size: 8410 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: multibus / failover and EMC CX600
  2007-10-17 16:01         ` Tore Anderson
  2007-10-17 18:04           ` Gerald Nowitzky
@ 2007-10-17 19:38           ` Gerald Nowitzky
  2007-10-18  6:01             ` Tore Anderson
  1 sibling, 1 reply; 16+ messages in thread
From: Gerald Nowitzky @ 2007-10-17 19:38 UTC (permalink / raw)
  To: device-mapper development


[-- Attachment #1.1: Type: text/plain, Size: 4146 bytes --]

not much difference with 2.6.23.1:

Oct 17 22:33:56 SANfile_m kernel: kobject_add failed for 1:0:1:0 with -EEXIST, don't try to register things with
the same name in the same directory.
Oct 17 22:33:56 SANfile_m kernel:  [<c03074d5>] kobject_shadow_add+0x115/0x1b0
Oct 17 22:33:56 SANfile_m kernel:  [<c03a8f58>] device_add+0xa8/0x5a0
Oct 17 22:33:56 SANfile_m kernel:  [<c02fc582>] __blk_queue_init_tags+0x32/0x70
Oct 17 22:33:56 SANfile_m kernel:  [<c03fb8af>] scsi_sysfs_add_sdev+0x4f/0x220
Oct 17 22:33:56 SANfile_m kernel:  [<f99498b7>] qla2xxx_slave_configure+0x77/0x110 [qla2xxx]
Oct 17 22:33:56 SANfile_m kernel:  [<c03f986a>] scsi_probe_and_add_lun+0x92a/0x950
Oct 17 22:33:56 SANfile_m kernel:  [<c03fa26d>] __scsi_scan_target+0x4fd/0x5b0
Oct 17 22:33:56 SANfile_m kernel:  [<c03fa954>] scsi_scan_target+0x94/0xc0
Oct 17 22:33:56 SANfile_m kernel:  [<c04014f0>] fc_scsi_scan_rport+0x0/0x90
Oct 17 22:33:56 SANfile_m kernel:  [<c0401568>] fc_scsi_scan_rport+0x78/0x90
Oct 17 22:33:56 SANfile_m kernel:  [<c013e963>] run_workqueue+0x73/0x100
Oct 17 22:33:56 SANfile_m kernel:  [<c0142350>] autoremove_wake_function+0x0/0x50
Oct 17 22:33:56 SANfile_m kernel:  [<c013f3dc>] worker_thread+0x9c/0x100
Oct 17 22:33:56 SANfile_m kernel:  [<c0142350>] autoremove_wake_function+0x0/0x50
Oct 17 22:33:56 SANfile_m kernel:  [<c013f340>] worker_thread+0x0/0x100
Oct 17 22:33:56 SANfile_m kernel:  [<c0142082>] kthread+0x42/0x70
Oct 17 22:33:56 SANfile_m kernel:  [<c0142040>] kthread+0x0/0x70
Oct 17 22:33:56 SANfile_m kernel:  [<c0105e6f>] kernel_thread_helper+0x7/0x18
Oct 17 22:33:56 SANfile_m kernel:  =======================
Oct 17 22:33:56 SANfile_m kernel: error 1

  ----- Original Message ----- 
  From: Tore Anderson 
  To: device-mapper development 
  Sent: Wednesday, October 17, 2007 6:01 PM
  Subject: Re: [dm-devel] multibus / failover and EMC CX600


  * Gerald Nowitzky

  > The mpath_prio_emc with group_by_prio did the trick. Thanks!
  >  
  > But I am still loosing the paths to the failed devices. I Increased
  > dev_loss_tmo, but the maximum seems to be about 600 - thus, after 10
  > Minutes, the paths fail:

  The maximum is indeed 600 seconds in 2.6.23.

  > SANfile_m linux # multipath -l
  > hcfshare (360060160c820080063502869e459dc11) dm-0 ,
  > [size=3.4T][features=1 queue_if_no_path][hwhandler=1 emc]
  > \_ round-robin 0 [prio=0][enabled]
  >  \_ #:#:#:# -   #:#   [failed][undef]
  >  \_ #:#:#:# -   #:#   [failed][undef]
  > \_ round-robin 0 [prio=0][active]
  >  \_ 2:0:0:0 sdd 8:48  [active][undef]
  >  \_ 1:0:0:0 sdb 8:16  [active][undef]
  > If I put them online again, I run into the -EEXIST prob. Async SCSI
  > scanning *is* off in my kernel, so the only thing I could do from
  > here is to try the patch, is it?

  Matthew Wilcox' patch solved this particular problem for me, yes.  I
  still had some problems with -EEXIST when unloading and re-inserting the
  HBA driver module, though, but that's a corner case I rarely run into
  (as well as being easily worked around by trying again).

  Come to think of it, you never said which kernel version you're running...?

  > Oct 17 17:26:36 SANfile_m kernel: kobject_add failed for 1:0:1:0 with
  > -EEXIST, don't try to register things with the same name in the same
  > directory.

  One suggestion...  If the sysfs object is still around, you might be
  able to delete it manually by running «echo 1 >
  /sys/class/scsi_device/1:0:1:0/device/delete».  If that works, you can
  try to rescan again by doing «echo 0 1 0 >
  /sys/class/scsi_host/host1/scan».  With some luck it'll work...

  If it does, most of the time udev will notice and alert multipath to
  check out the new device.  Sometimes it doesn't work, though - simply
  run the «multipath» command manually in that case.

  By the way - the «1» in «host1» maps to the first digit in «1:0:1:0»,
  while the «0 1 0» in the echo command to the last three.

  Regards
  -- 
  Tore Anderson

  --
  dm-devel mailing list
  dm-devel@redhat.com
  https://www.redhat.com/mailman/listinfo/dm-devel

[-- Attachment #1.2: Type: text/html, Size: 5830 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: multibus / failover and EMC CX600
  2007-10-17 12:32     ` Tore Anderson
  2007-10-17 14:48       ` Gerald Nowitzky
@ 2007-10-17 19:49       ` Mike Christie
  1 sibling, 0 replies; 16+ messages in thread
From: Mike Christie @ 2007-10-17 19:49 UTC (permalink / raw)
  To: device-mapper development

Tore Anderson wrote:
> * Hannes Reinecke
> 
>> That's the dev_loss_tmo setting. Just increase it to something to
>> your liking.
> 
> Oh, sweet.  This knob won't affect how long the layer will hold I/O
> before failing it (like lpfc_nodev_tmo), I assume?  (I'm worried about
> it taking longer for dm-multipath to detect failed paths).
> 

With newer versions of lpfc you can set 
/sys/class/fc_rport/rportXYZ/fast_io_fail_tmo to a low value so that IO 
is failed quickly, and then set the dev_loss_tmo to a high value so the 
device is not removed quickly.

The only problem may be that there is a race where dm-multpiath could be 
queueing IO to the scsi layer while the scsi layer is reporting a 
failure. That IO that was getting queued will then sit in the scsi layer 
until dev_loss_tmo fires. That is fixed with this patchset
http://marc.info/?l=linux-scsi&m=117399843216280&w=2
but I never finished testing it out.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: multibus / failover and EMC CX600
  2007-10-17 19:38           ` Gerald Nowitzky
@ 2007-10-18  6:01             ` Tore Anderson
  2007-10-18  6:19               ` Tore Anderson
  0 siblings, 1 reply; 16+ messages in thread
From: Tore Anderson @ 2007-10-18  6:01 UTC (permalink / raw)
  To: device-mapper development

* Gerald Nowitzky

> not much difference with 2.6.23.1:

Hmm - maybe this patch helps (needs Wilcox' patch to be applied first):
<http://www.mail-archive.com/linux-scsi@vger.kernel.org/msg09458.html>.

I tested it a bit and it helped some, but I don't remember exactly for 
which situations...  It was on a machine that was using the lpfc HBA 
driver though so no guarantees that it'll help in your case.

Anyway, since you're able to reproduce it consistently and with the 
latest kernel.org tarball, I suggest you try to take it to the 
linux-scsi list to see if they've got any suggestions.  This obviously 
is a kernel bug, and it's surely not in the DM layer.  Either in the HBA 
driver or in the SCSI layer (I'm not sure which), but linux-scsi should 
be the right list in any case.

Regards
-- 
Tore Anderson

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: multibus / failover and EMC CX600
  2007-10-17 18:04           ` Gerald Nowitzky
@ 2007-10-18  6:19             ` Hannes Reinecke
  2007-10-18  6:55               ` Gerald Nowitzky
  0 siblings, 1 reply; 16+ messages in thread
From: Hannes Reinecke @ 2007-10-18  6:19 UTC (permalink / raw)
  To: device-mapper development

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=utf-8, Size: 1155 bytes --]

On Wed, Oct 17, 2007 at 08:04:12PM +0200, Gerald Nowitzky wrote:
> I'm afraid the patch did not work for me. I'ts still the same.
> 
> I am using kernel 2.6.22.2 at the moment. Should I upgrade to 2.6.23 ?
> 
> Anybody any Ideas?
> The system is not in production at the moment. We could do some testing.
> 
Well, yes. By the looks of if the problem is with multipathing still holding
references to the stale devices.
IE after dev_loss_tmo kicks in, the devices are removed from sysfs.
But multipathing does _not_ update it's device-mapper tables (that's why
you see all the '#' in the output), so there's still a refence on the
removed device and the in-kernel resources can't be freed.

So when the device is re-registered, you're getting this Oops.

Try to update the multipath information by running 'multipath' after the
devices have been removed. Once the '#' in the output are gone, you can
savely re-add the devices.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: multibus / failover and EMC CX600
  2007-10-18  6:01             ` Tore Anderson
@ 2007-10-18  6:19               ` Tore Anderson
  0 siblings, 0 replies; 16+ messages in thread
From: Tore Anderson @ 2007-10-18  6:19 UTC (permalink / raw)
  To: device-mapper development

* Tore Anderson

> Hmm - maybe this patch helps (needs Wilcox' patch to be applied first):
> <http://www.mail-archive.com/linux-scsi@vger.kernel.org/msg09458.html>.

By the way - did you try with async SCSI scanning enabled?  Worth a shot...

-- 
Tore Anderson

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: multibus / failover and EMC CX600
  2007-10-18  6:19             ` Hannes Reinecke
@ 2007-10-18  6:55               ` Gerald Nowitzky
  2007-10-18  7:12                 ` Hannes Reinecke
  0 siblings, 1 reply; 16+ messages in thread
From: Gerald Nowitzky @ 2007-10-18  6:55 UTC (permalink / raw)
  To: device-mapper development


[-- Attachment #1.1: Type: text/plain, Size: 1955 bytes --]

Hannes,

so is this behavior by design? In this case, patching in the scsi subsystem won't help to much, will it?

Manually updating the multipath information - well, yes, that will work I guess, but in the end that should work without manual intervention. Thus, I'd need to have a job checking the multipath table if there are stale devices, and, if there are, rerun multipath to get them out. Not exactly smooth, is it?

Thanks
(Gerald)
  ----- Original Message ----- 
  From: Hannes Reinecke 
  To: device-mapper development 
  Sent: Thursday, October 18, 2007 8:19 AM
  Subject: Re: [dm-devel] multibus / failover and EMC CX600


  On Wed, Oct 17, 2007 at 08:04:12PM +0200, Gerald Nowitzky wrote:
  > I'm afraid the patch did not work for me. I'ts still the same.
  > 
  > I am using kernel 2.6.22.2 at the moment. Should I upgrade to 2.6.23 ?
  > 
  > Anybody any Ideas?
  > The system is not in production at the moment. We could do some testing.
  > 
  Well, yes. By the looks of if the problem is with multipathing still holding
  references to the stale devices.
  IE after dev_loss_tmo kicks in, the devices are removed from sysfs.
  But multipathing does _not_ update it's device-mapper tables (that's why
  you see all the '#' in the output), so there's still a refence on the
  removed device and the in-kernel resources can't be freed.

  So when the device is re-registered, you're getting this Oops.

  Try to update the multipath information by running 'multipath' after the
  devices have been removed. Once the '#' in the output are gone, you can
  savely re-add the devices.

  Cheers,

  Hannes
  -- 
  Dr. Hannes Reinecke       zSeries & Storage
  hare@suse.de       +49 911 74053 688
  SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nrnberg
  GF: Markus Rex, HRB 16746 (AG Nrnberg)

  --
  dm-devel mailing list
  dm-devel@redhat.com
  https://www.redhat.com/mailman/listinfo/dm-devel

[-- Attachment #1.2: Type: text/html, Size: 3505 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: multibus / failover and EMC CX600
  2007-10-18  6:55               ` Gerald Nowitzky
@ 2007-10-18  7:12                 ` Hannes Reinecke
  2007-10-18  8:07                   ` Gerald Nowitzky
  0 siblings, 1 reply; 16+ messages in thread
From: Hannes Reinecke @ 2007-10-18  7:12 UTC (permalink / raw)
  To: device-mapper development

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=utf-8, Size: 1492 bytes --]

On Thu, Oct 18, 2007 at 08:55:52AM +0200, Gerald Nowitzky wrote:
> Hannes,
> 
> so is this behavior by design? In this case, patching in the scsi subsystem won't
> help to much, will it?
>
No, not really.
 
> Manually updating the multipath information - well, yes, that will work I guess,
> but in the end that should work without manual intervention. Thus, I'd need to have
> a job checking the multipath table if there are stale devices, and, if there are,
> rerun multipath to get them out. Not exactly smooth, is it?
>
Why, we do have interns for this kind of work :-)

No, really: Normally this should be done by multipathd / udev. With the mainline
multipathd it reads from the kernel netlink socket and will get a uevent if
a device is removed. And it should update the tables accordingly.

There have been problems with device-mapper itself (older versions required to
flush all outstanding I/O before the table could be modified), but that should
be resolved by now.

So maybe have multipathd running in verbose mode (ie -v 6 or somesuch) and see
what's going on. Especially why it isn't updating the device-mapper tables.

BTW, which distribution are you running? multipath & udev seem to be a tricky
area for most.

Not ours, of course :-)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: multibus / failover and EMC CX600
  2007-10-18  7:12                 ` Hannes Reinecke
@ 2007-10-18  8:07                   ` Gerald Nowitzky
  2007-10-19 22:35                     ` David Strand
  0 siblings, 1 reply; 16+ messages in thread
From: Gerald Nowitzky @ 2007-10-18  8:07 UTC (permalink / raw)
  To: device-mapper development


[-- Attachment #1.1: Type: text/plain, Size: 2163 bytes --]

Hannes,

nothing that looks really bedrohlich in the syslog.

What I have done:

9:47 multipathd -v 6 started
9:50 2:0:0:0 and 1:0:1:0 blocked (both paths to SP-A of the SAN)
9:52 2:0:0:0 and 1:0:1:0 re-enabled

I have attached my syslog.

I am on a gentoo with multipath 0.4.7 

(Gerald)
  ----- Original Message ----- 
  From: Hannes Reinecke 
  To: device-mapper development 
  Sent: Thursday, October 18, 2007 9:12 AM
  Subject: Re: [dm-devel] multibus / failover and EMC CX600


  On Thu, Oct 18, 2007 at 08:55:52AM +0200, Gerald Nowitzky wrote:
  > Hannes,
  > 
  > so is this behavior by design? In this case, patching in the scsi subsystem won't
  > help to much, will it?
  >
  No, not really.
   
  > Manually updating the multipath information - well, yes, that will work I guess,
  > but in the end that should work without manual intervention. Thus, I'd need to have
  > a job checking the multipath table if there are stale devices, and, if there are,
  > rerun multipath to get them out. Not exactly smooth, is it?
  >
  Why, we do have interns for this kind of work :-)

  No, really: Normally this should be done by multipathd / udev. With the mainline
  multipathd it reads from the kernel netlink socket and will get a uevent if
  a device is removed. And it should update the tables accordingly.

  There have been problems with device-mapper itself (older versions required to
  flush all outstanding I/O before the table could be modified), but that should
  be resolved by now.

  So maybe have multipathd running in verbose mode (ie -v 6 or somesuch) and see
  what's going on. Especially why it isn't updating the device-mapper tables.

  BTW, which distribution are you running? multipath & udev seem to be a tricky
  area for most.

  Not ours, of course :-)

  Cheers,

  Hannes
  -- 
  Dr. Hannes Reinecke       zSeries & Storage
  hare@suse.de       +49 911 74053 688
  SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nrnberg
  GF: Markus Rex, HRB 16746 (AG Nrnberg)

  --
  dm-devel mailing list
  dm-devel@redhat.com
  https://www.redhat.com/mailman/listinfo/dm-devel

[-- Attachment #1.2: Type: text/html, Size: 3992 bytes --]

[-- Attachment #2: syslog.zip --]
[-- Type: application/x-zip-compressed, Size: 7679 bytes --]

[-- Attachment #3: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: multibus / failover and EMC CX600
  2007-10-18  8:07                   ` Gerald Nowitzky
@ 2007-10-19 22:35                     ` David Strand
  0 siblings, 0 replies; 16+ messages in thread
From: David Strand @ 2007-10-19 22:35 UTC (permalink / raw)
  To: device-mapper development

I have this same -EEXIST problem.

If the /dev/mapper/mpath* device node is mounted with a file sys when
the disk is disconnected, multipath won't give up one of the device
paths (logs "map in use"). If the disk is then reconnected it will get
a new scsi device name. Now multipath is sometimes confused (shows 3
paths, one of which is ####), and sometimes I get the -EEXIST core
dump to go along with it.

I can do the same disconnect / reconnect of the disk all day long, as
long as no file system is mounted on it. Of course, there will pretty
much always be a file system mounted on it so that isn't really a
solution, just a fact.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2007-10-19 22:35 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-17 10:23 multibus / failover and EMC CX600 Gerald Nowitzky
2007-10-17 10:40 ` Tore Anderson
2007-10-17 11:08   ` Hannes Reinecke
2007-10-17 12:32     ` Tore Anderson
2007-10-17 14:48       ` Gerald Nowitzky
2007-10-17 16:01         ` Tore Anderson
2007-10-17 18:04           ` Gerald Nowitzky
2007-10-18  6:19             ` Hannes Reinecke
2007-10-18  6:55               ` Gerald Nowitzky
2007-10-18  7:12                 ` Hannes Reinecke
2007-10-18  8:07                   ` Gerald Nowitzky
2007-10-19 22:35                     ` David Strand
2007-10-17 19:38           ` Gerald Nowitzky
2007-10-18  6:01             ` Tore Anderson
2007-10-18  6:19               ` Tore Anderson
2007-10-17 19:49       ` Mike Christie

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.