* Trouble with ALUA on controller failover
@ 2015-04-30 14:34 Adam Drew
2015-05-06 14:56 ` Stewart, Sean
0 siblings, 1 reply; 2+ messages in thread
From: Adam Drew @ 2015-04-30 14:34 UTC (permalink / raw)
To: dm-devel@redhat.com
Hi all,
We're facing a bit of a strange problem and would like some input on debugging and next steps.
We have RHEL 7 connected to a Nimble CS-series array via FC. We're running device-mapper-multipath-0.4.9-77.el7.x86_64 and 3.10.0-123.el7.x86_64. Our multipath config is very simple:
devices {
device {
vendor "Nimble"
product "Server"
prio alua
path_grouping_policy group_by_prio
path_checker tur
features "1 queue_if_no_path"
rr_weight priorities
rr_min_io 20
failback manual
path_selector "round-robin 0"
dev_loss_tmo infinity
fast_io_fail_tmo 5
}
}
And our devices look as expected:
mpathek (29b72fe86f66a2a366c9ce9009d9a9742) dm-0 Nimble ,Server
size=244G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| `- 9:0:0:0 sdb 8:16 active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
`- 9:0:1:0 sdc 8:32 active ghost running
When we initiate controller failover we never switch over to the correct path:
[ 822.192772] sd 9:0:0:0: rejecting I/O to offline device
[ 822.198633] device-mapper: multipath: Failing path 8:16.
[ 822.204595] device-mapper: multipath: Failing path 8:32.
[ 824.099448] sd 9:0:1:0: Parameters changed
[ 825.043943] device-mapper: multipath: Failing path 8:32.
[ 830.052981] device-mapper: multipath: Failing path 8:32.
[ 835.062030] device-mapper: multipath: Failing path 8:32.
[ 840.071071] device-mapper: multipath: Failing path 8:32.
[ 845.080060] device-mapper: multipath: Failing path 8:32.
[ 850.089089] device-mapper: multipath: Failing path 8:32.
[ 855.098110] device-mapper: multipath: Failing path 8:32.
The path status is strange. The path that should be active ready running now, 9:0:1:0, is failed ready running:
mpathek (29b72fe86f66a2a366c9ce9009d9a9742) dm-0 Nimble ,Server
size=244G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=0 status=enabled
| `- 9:0:0:0 sdb 8:16 failed faulty offline
`-+- policy='round-robin 0' prio=50 status=enabled
`- 9:0:1:0 sdc 8:32 failed ready running
Multipath tries to send IO down that path but:
[ 405.078481] Add. Sense: Logical unit not accessible, target port in standby state
[ 405.086856] sd 9:0:1:0: [sdc] CDB:
[ 405.090748] Write(10): 2a 00 1c fe 95 30 00 00 08 00
[ 405.096456] sd 9:0:1:0: [sdc] Device not ready
[ 405.101419] sd 9:0:1:0: [sdc]
[ 405.104934] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 405.111162] sd 9:0:1:0: [sdc]
[ 405.114678] Sense Key : Not Ready [current]
[ 405.119481] Info fld=0x0
[ 405.122321] sd 9:0:1:0: [sdc]
[ 405.125838] Add. Sense: Logical unit not accessible, target port in standby state
[ 405.134202] sd 9:0:1:0: [sdc] CDB:
[ 405.138096] Write(10): 2a 00 1b 64 94 b0 00 00 02 00
[ 405.143785] sd 9:0:1:0: [sdc] Device not ready
[ 405.148736] sd 9:0:1:0: [sdc]
[ 405.152254] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 405.158488] sd 9:0:1:0: [sdc]
[ 405.162003] Sense Key : Not Ready [current]
When we point RHEL 6 at this same array, and same volume, failover goes over without a hitch. We've been able to reproduce this on the RHEL 7 kernel / device-mapper-multipath combo on several systems.
We've been tearing through the device-mapper-multipath-libs and kernel code to see if we can find the cause of the problem, and we've been testing quite a bit, but have as yet been unable to resolve this. We'd like some input on next steps for debugging and testing.
The only thing we've found so far that looks promising is with the parameter data format on RTPG during failover. RHEL 7 is sending a parameter data format of 1, but we answer 0 (which is within spec). Here's the message on our array side:
dsd.log.2:19485 2015-04-20,11:45:46.746395-07 INFO: scsi.core:_scsi_report_target_group: parameter data format = 1, treating it as length only format(0)
dsd.log.2:19485 2015-04-20,11:45:46.747362-07 INFO: scsi.core:_scsi_report_target_group: parameter data format = 1, treating it as length only format(0)
But on the RHEL side we see:
Apr 14 18:05:54 UCS-PGUO-RHEL7 kernel: host1: Assigned Port ID 720080
Apr 14 18:05:54 UCS-PGUO-RHEL7 kernel: scsi 1:0:0:0: Direct-Access Nimble Server 1.0 PQ: 0 ANSI: 5
Apr 14 18:05:54 UCS-PGUO-RHEL7 kernel: scsi 1:0:0:0: alua: supports implicit TPGS
Apr 14 18:05:54 UCS-PGUO-RHEL7 kernel: scsi 1:0:0:0: alua: port group 01 rel port 01
Apr 14 18:05:54 UCS-PGUO-RHEL7 kernel: scsi 1:0:0:0: alua: rtpg failed with 8000002
Apr 14 18:05:54 UCS-PGUO-RHEL7 kernel: scsi 1:0:0:0: alua: port group 01 state S non-preferred supports tolusna
Apr 14 18:05:54 UCS-PGUO-RHEL7 kernel: scsi 1:0:0:0: alua: Attached
We're wondering if the RTPG failure is causing us to be unable to instate the new active path, and we wonder if this is due to RHEL 7 kernel or dmm not liking the 0 pdf response on 1. However, we're unsure if this would be in the kernel ALUA scsi_dh code, or in the device-mapper-multipath-libs alua code.
Any helped is appreciated. We'll supply any data request.
Adam Drew
Nimble Storage
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: Trouble with ALUA on controller failover
2015-04-30 14:34 Trouble with ALUA on controller failover Adam Drew
@ 2015-05-06 14:56 ` Stewart, Sean
0 siblings, 0 replies; 2+ messages in thread
From: Stewart, Sean @ 2015-05-06 14:56 UTC (permalink / raw)
To: device-mapper development
Hi Adam,
On Thu, 2015-04-30 at 14:34 +0000, Adam Drew wrote:
> Hi all,
>
> We're facing a bit of a strange problem and would like some input on debugging and next steps.
>
> We have RHEL 7 connected to a Nimble CS-series array via FC. We're running device-mapper-multipath-0.4.9-77.el7.x86_64 and 3.10.0-123.el7.x86_64. Our multipath config is very simple:
...
> The path status is strange. The path that should be active ready running now, 9:0:1:0, is failed ready running:
> mpathek (29b72fe86f66a2a366c9ce9009d9a9742) dm-0 Nimble ,Server
> size=244G features='1 queue_if_no_path' hwhandler='0' wp=rw
> |-+- policy='round-robin 0' prio=0 status=enabled
> | `- 9:0:0:0 sdb 8:16 failed faulty offline
> `-+- policy='round-robin 0' prio=50 status=enabled
> `- 9:0:1:0 sdc 8:32 failed ready running
The "failed ready" path status is from the fact that the last I/O dm
sent down sdc came back failed, which can be seen below.
>
> Multipath tries to send IO down that path but:
> [ 405.078481] Add. Sense: Logical unit not accessible, target port in standby state
> [ 405.086856] sd 9:0:1:0: [sdc] CDB:
> [ 405.090748] Write(10): 2a 00 1c fe 95 30 00 00 08 00
> [ 405.096456] sd 9:0:1:0: [sdc] Device not ready
> [ 405.101419] sd 9:0:1:0: [sdc]
> [ 405.104934] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> [ 405.111162] sd 9:0:1:0: [sdc]
> [ 405.114678] Sense Key : Not Ready [current]
> [ 405.119481] Info fld=0x0
> [ 405.122321] sd 9:0:1:0: [sdc]
> [ 405.125838] Add. Sense: Logical unit not accessible, target port in standby state
> [ 405.134202] sd 9:0:1:0: [sdc] CDB:
> [ 405.138096] Write(10): 2a 00 1b 64 94 b0 00 00 02 00
> [ 405.143785] sd 9:0:1:0: [sdc] Device not ready
> [ 405.148736] sd 9:0:1:0: [sdc]
> [ 405.152254] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> [ 405.158488] sd 9:0:1:0: [sdc]
> [ 405.162003] Sense Key : Not Ready [current]
>
It's getting an 02/04/0b check condition from your storage, because it's
still in the standby AAS.
> When we point RHEL 6 at this same array, and same volume, failover goes over without a hitch. We've been able to reproduce this on the RHEL 7 kernel / device-mapper-multipath combo on several systems.
I'm not sure how this is possible, because the code should be very
similar between RHEL6 and RHEL7 in these areas.
> We've been tearing through the device-mapper-multipath-libs and kernel code to see if we can find the cause of the problem, and we've been testing quite a bit, but have as yet been unable to resolve this. We'd like some input on next steps for debugging and testing.
>
> The only thing we've found so far that looks promising is with the parameter data format on RTPG during failover. RHEL 7 is sending a parameter data format of 1, but we answer 0 (which is within spec). Here's the message on our array side:
>
> dsd.log.2:19485 2015-04-20,11:45:46.746395-07 INFO: scsi.core:_scsi_report_target_group: parameter data format = 1, treating it as length only format(0)
> dsd.log.2:19485 2015-04-20,11:45:46.747362-07 INFO: scsi.core:_scsi_report_target_group: parameter data format = 1, treating it as length only format(0)
>
> But on the RHEL side we see:
> Apr 14 18:05:54 UCS-PGUO-RHEL7 kernel: host1: Assigned Port ID 720080
> Apr 14 18:05:54 UCS-PGUO-RHEL7 kernel: scsi 1:0:0:0: Direct-Access Nimble Server 1.0 PQ: 0 ANSI: 5
> Apr 14 18:05:54 UCS-PGUO-RHEL7 kernel: scsi 1:0:0:0: alua: supports implicit TPGS
> Apr 14 18:05:54 UCS-PGUO-RHEL7 kernel: scsi 1:0:0:0: alua: port group 01 rel port 01
> Apr 14 18:05:54 UCS-PGUO-RHEL7 kernel: scsi 1:0:0:0: alua: rtpg failed with 8000002
> Apr 14 18:05:54 UCS-PGUO-RHEL7 kernel: scsi 1:0:0:0: alua: port group 01 state S non-preferred supports tolusna
> Apr 14 18:05:54 UCS-PGUO-RHEL7 kernel: scsi 1:0:0:0: alua: Attached
>
> We're wondering if the RTPG failure is causing us to be unable to instate the new active path, and we wonder if this is due to RHEL 7 kernel or dmm not liking the 0 pdf response on 1. However, we're unsure if this would be in the kernel ALUA scsi_dh code, or in the device-mapper-multipath-libs alua code.
>
The RTPG failure should not be causing this problem, as you see with the
subsequent "port group 01 state S non-preferred supports tolusna"
message that the RTPG does complete successfully. The first attempt uses
extended RTPG headers, and if it gets an illegal request back, it
disables it and tries again.
Your RTPG data indicates that only implicit TPGS is supported. Also, the
multipath configuration and multipath -ll output shows that no hardware
handler is being used for failover and failback (but I see the alua
handler is attached, which must be happening on device discovery and not
because of dm). Basically, in this configuration, it would be up to the
array to change the AAS to allow I/O on the other path.
I have to wonder if there's some configuration difference, or reporting
difference between the two cases. Sorry I don't have all the answers,
but hopefully this helps to some extent.
Thanks,
Sean Stewart
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2015-05-06 14:56 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-04-30 14:34 Trouble with ALUA on controller failover Adam Drew
2015-05-06 14:56 ` Stewart, Sean
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.