From: Laurence Oberman <loberman@redhat.com>
To: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: linux-block@vger.kernel.org,
linux-scsi <linux-scsi@vger.kernel.org>,
Mike Snitzer <snitzer@redhat.com>,
James Bottomley <James.Bottomley@HansenPartnership.com>,
device-mapper development <dm-devel@redhat.com>,
lsf@lists.linux-foundation.org,
Benjamin Marzinski <bmarzins@redhat.com>
Subject: Re: [dm-devel] [Lsf] Notes from the four separate IO track sessions at LSF/MM
Date: Fri, 29 Apr 2016 17:51:31 -0400 (EDT) [thread overview]
Message-ID: <2009387014.32589776.1461966691447.JavaMail.zimbra@redhat.com> (raw)
In-Reply-To: <850484819.32589649.1461966427528.JavaMail.zimbra@redhat.com>
One small correction
In the cut and past the mpath timing was this. I had a cut and past error in my prior message.
Fri Apr 29 17:16:14 EDT 2016
mpathe (360014052a6f5f9f256d4c1097eedfd95) dm-2 LIO-ORG ,block-13
size=20G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
|- 0:0:0:12 sds 65:32 active ready running
|- 0:0:1:12 sdbh 67:176 active ready running
|- 1:0:0:12 sdr 65:16 active ready running
`- 1:0:1:12 sdbi 67:192 active ready running
Start again here so its the same 5 minutes while we are in the error_handler
Fri Apr 29 17:21:26 EDT 2016
mpathe (360014052a6f5f9f256d4c1097eedfd95) dm-2 LIO-ORG ,block-13
size=20G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
|- 0:0:0:12 sds 65:32 active ready running
|- 0:0:1:12 sdbh 67:176 failed faulty offline
|- 1:0:0:12 sdr 65:16 active ready running
`- 1:0:1:12 sdbi 67:192 failed faulty offline
Laurence Oberman
Principal Software Maintenance Engineer
Red Hat Global Support Services
----- Original Message -----
From: "Laurence Oberman" <loberman@redhat.com>
To: "Bart Van Assche" <bart.vanassche@sandisk.com>
Cc: linux-block@vger.kernel.org, "linux-scsi" <linux-scsi@vger.kernel.org>, "Mike Snitzer" <snitzer@redhat.com>, "James Bottomley" <James.Bottomley@HansenPartnership.com>, "device-mapper development" <dm-devel@redhat.com>, lsf@lists.linux-foundation.org, "Benjamin Marzinski" <bmarzins@redhat.com>
Sent: Friday, April 29, 2016 5:47:07 PM
Subject: Re: [dm-devel] [Lsf] Notes from the four separate IO track sessions at LSF/MM
Hello Bart
I will email the entire log just to you. This is a summary only below of pertinent log messages.
I dont think the whole list will have an interest in all thge log messages.
When I sent the dull log to you I will include SCSI debug for the error handler stuff.
I ran the tests. This is a worst case test with 21 LUNS and jammed commands.
Typical failures like a port switch failure or link down wont be like this.
Also where we get RSCN's and we can react quicker we will.
In this case I simulated a hung switch issue like a backplane/mesh problem and believe me I see a lot of these
black-holed SCSI command situations in real life.
Recovery with 21 LUNS is 300s that have in-flights to abort.
This configuration is a multibus configuration for multipath.
Two qla2xx ports are connected to a switch and the 2 array pots are connected to the same switch.
This gives me 4 active/active paths for 21 mpath devices
I start I/O to all 21 reading 64k blocks using dd and iflag=direct
Example mpath device
mpathf (360014056a5be89021364a4c90556bfbb) dm-7 LIO-ORG ,block-14
size=20G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
|- 0:0:0:13 sdp 8:240 active ready running
|- 0:0:1:13 sdbf 67:144 active ready running
|- 1:0:0:13 sdo 8:224 active ready running
`- 1:0:1:13 sdbg 67:160 active ready running
eh_deadline is set to 10 on the 2 qlogic ports, eh_timeout is set to 10 for all devices
In multipath fast_io_fail_tmo=5
I jam one of the target array ports and discard the commands effectively black-holing the commands and leave it that way until we recover and I watch the I/O.
The recovery takes around 300s even with all the tuning and this effectively lands up in Oracle cluster evictions.
Watching multipath -ll mpathe I will block as expected while in recovery
BLocked here
Fri Apr 29 17:16:14 EDT 2016
mpathe (360014052a6f5f9f256d4c1097eedfd95) dm-2 LIO-ORG ,block-13
size=20G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
|- 0:0:0:12 sds 65:32 active ready running
|- 0:0:1:12 sdbh 67:176 active ready running
|- 1:0:0:12 sdr 65:16 active ready running
`- 1:0:1:12 sdbi 67:192 active ready running
Starte again here
Fri Apr 29 17:16:26 EDT 2016
mpathe (360014052a6f5f9f256d4c1097eedfd95) dm-2 LIO-ORG ,block-13
size=20G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
|- 0:0:0:12 sds 65:32 active ready running
|- 0:0:1:12 sdbh 67:176 failed faulty offline
|- 1:0:0:12 sdr 65:16 active ready running
`- 1:0:1:12 sdbi 67:192 failed faulty offline
Tracking I/O
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- -----timestamp-----
r b swpd free buff cache si so bi bo in cs us sy id wa st EDT
0 21 0 15409656 25580 452056 0 0 13740 0 367 2523 0 1 41 59 0 2016-04-29 17:16:17
0 21 0 15408904 25580 452336 0 0 15872 0 378 2779 0 1 42 57 0 2016-04-29 17:16:18
2 20 0 15408096 25580 452624 0 0 17612 0 399 3310 0 0 26 73 0 2016-04-29 17:16:19
0 21 0 15407188 25580 453096 0 0 17860 0 412 3137 0 0 30 70 0 2016-04-29 17:16:20
0 21 0 15410420 25580 451552 0 0 23116 0 900 6747 0 1 31 69 0 2016-04-29 17:16:21
0 21 0 15410552 25580 451420 0 0 22664 0 430 3752 0 0 24 76 0 2016-04-29 17:16:22
0 21 0 15410552 25580 451420 0 0 15700 0 325 2619 0 0 25 75 0 2016-04-29 17:16:23
0 21 0 15410552 25580 451420 0 0 13648 0 303 2387 0 0 28 71 0 2016-04-29 17:16:24
..
.. Blocked
..
Starts recovering ~= 300s seconds later
..
0 38 0 15406428 25860 452652 0 0 3208 0 859 2437 0 1 13 86 0 2016-04-29 17:21:20
0 38 0 15405668 26244 452268 0 0 6640 0 499 3575 0 1 0 99 0 2016-04-29 17:21:21
0 38 0 15406840 26496 452300 0 0 5372 0 273 1878 0 0 1 98 0 2016-04-29 17:21:22
0 38 0 15402684 29156 452048 0 0 9700 0 318 2326 0 0 11 88 0 2016-04-29 17:21:23
0 38 0 15400800 30152 452168 0 0 11876 0 433 3265 0 1 16 83 0 2016-04-29 17:21:24
0 38 0 15399792 31140 452344 0 0 11804 0 394 2902 0 1 15 85 0 2016-04-29 17:21:25
0 38 0 15398552 31952 452196 0 0 12908 0 417 3347 0 1 4 96 0 2016-04-29 17:21:26
0 35 0 15394564 32660 452800 0 0 10904 0 575 4191 1 1 9 89 0 2016-04-29 17:21:27
0 29 0 15394292 32968 452900 0 0 13356 0 602 3993 1 1 1 96 0 2016-04-29 17:21:28
0 26 0 15394464 33692 452196 0 0 16124 0 764 5451 1 1 2 96 0 2016-04-29 17:21:29
0 24 0 15394168 33880 452392 0 0 20156 0 479 3957 0 1 3 96 0 2016-04-29 17:21:30
0 24 0 15394216 34008 452460 0 0 21760 0 456 3836 0 1 6 94 0 2016-04-29 17:21:31
0 22 0 15393920 34016 452604 0 0 20104 28 437 3418 0 1 12 87 0 2016-04-29 17:21:32
0 22 0 15393952 34016 452600 0 0 20352 0 483 3259 0 1 67 32 0 2016-04-29 17:21:33
0 22 0 15394148 34016 452600 0 0 20560 0 451 3228 0 1 74 25 0 2016-04-29 17:21:34
I see the error handler start in the qlogic.
Keep in mind we are black-holed so after RESET we start the process again.
Apr 29 17:15:26 localhost root: Starting test with eh_deadline=10
Apr 29 17:16:54 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:9 -- 1 2002.
Apr 29 17:16:54 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:7 -- 1 2002.
Apr 29 17:16:54 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:8 -- 1 2002.
Apr 29 17:16:55 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:6 -- 1 2002.
Apr 29 17:16:55 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:4 -- 1 2002.
Apr 29 17:16:55 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:3 -- 1 2002.
Apr 29 17:16:55 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:22 -- 1 2002.
Apr 29 17:16:55 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:23 -- 1 2002.
Apr 29 17:16:55 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:21 -- 1 2002.
Apr 29 17:16:56 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:2 -- 1 2002.
Apr 29 17:16:56 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:19 -- 1 2002.
Apr 29 17:16:56 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:18 -- 1 2002.
Apr 29 17:16:56 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:20 -- 1 2002.
Apr 29 17:16:56 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:1 -- 1 2002.
Apr 29 17:16:56 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:0 -- 1 2002.
Apr 29 17:16:57 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:17 -- 1 2002.
Apr 29 17:16:57 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:16 -- 1 2002.
Apr 29 17:16:57 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:10 -- 1 2002.
Apr 29 17:16:58 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:14 -- 1 2002.
Apr 29 17:16:58 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:13 -- 1 2002.
Apr 29 17:16:58 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:12 -- 1 2002.
Apr 29 17:16:58 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:0 -- 1 2002.
Apr 29 17:17:00 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:10 -- 1 2002.
Apr 29 17:17:00 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:12 -- 1 2002.
Apr 29 17:17:01 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:0 -- 1 2002.
Apr 29 17:17:02 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:1 -- 1 2002.
Apr 29 17:17:03 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:13 -- 1 2002.
Apr 29 17:17:03 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:14 -- 1 2002.
Apr 29 17:17:03 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:16 -- 1 2002.
Apr 29 17:17:04 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:17 -- 1 2002.
Apr 29 17:17:09 localhost kernel: qla2xxx [0000:07:00.1]-8018:1: ADAPTER RESET ISSUED nexus=1:1:12.
Apr 29 17:17:09 localhost kernel: qla2xxx [0000:07:00.1]-00af:1: Performing ISP error recovery - ha=ffff88042a4b0000.
Apr 29 17:17:10 localhost kernel: qla2xxx [0000:07:00.1]-500a:1: LOOP UP detected (4 Gbps).
Apr 29 17:17:10 localhost kernel: qla2xxx [0000:07:00.1]-8017:1: ADAPTER RESET SUCCEEDED nexus=1:1:12.
Apr 29 17:17:30 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:12 -- 1 2002.
Apr 29 17:17:34 localhost kernel: qla2xxx [0000:07:00.0]-8018:0: ADAPTER RESET ISSUED nexus=0:1:17.
Apr 29 17:17:34 localhost kernel: qla2xxx [0000:07:00.0]-00af:0: Performing ISP error recovery - ha=ffff88042b030000.
Apr 29 17:17:35 localhost kernel: qla2xxx [0000:07:00.0]-500a:0: LOOP UP detected (4 Gbps).
Apr 29 17:17:35 localhost kernel: qla2xxx [0000:07:00.0]-8017:0: ADAPTER RESET SUCCEEDED nexus=0:1:17.
Apr 29 17:17:40 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:10 -- 1 2002.
Apr 29 17:17:50 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:1 -- 1 2002.
Apr 29 17:17:56 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:17 -- 1 2002.
Apr 29 17:18:00 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:0 -- 1 2002.
Apr 29 17:18:06 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:18 -- 1 2002.
Apr 29 17:18:10 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:13 -- 1 2002.
Apr 29 17:18:16 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:19 -- 1 2002.
Apr 29 17:18:20 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:8 -- 1 2002.
Apr 29 17:18:26 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:8 -- 1 2002.
Apr 29 17:18:30 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:7 -- 1 2002.
Apr 29 17:18:36 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:22 -- 1 2002.
Apr 29 17:18:40 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:22 -- 1 2002.
Apr 29 17:18:46 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:21 -- 1 2002.
Apr 29 17:18:50 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:21 -- 1 2002.
..
..
We start seeing the hung tasks
Apr 29 17:19:16 localhost kernel: INFO: task dd:10193 blocked for more than 120 seconds.
Apr 29 17:19:16 localhost kernel: Not tainted 4.6.0-rc5+ #1
Apr 29 17:19:16 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 29 17:19:16 localhost kernel: dd D ffff8804303079d8 0 10193 10177 0x00000080
Apr 29 17:19:16 localhost kernel: ffff8804303079d8 ffff880434814140 ffff8800a86515c0 ffff880430308000
Apr 29 17:19:16 localhost kernel: 0000000000000000 7fffffffffffffff 0000000000000000 ffff8804307bfd00
Apr 29 17:19:16 localhost kernel: ffff8804303079f0 ffffffff816ba8e5 ffff880436696e00 ffff880430307aa0
Apr 29 17:19:16 localhost kernel: Call Trace:
Apr 29 17:19:16 localhost kernel: [<ffffffff816ba8e5>] schedule+0x35/0x80
Apr 29 17:19:16 localhost kernel: [<ffffffff816bd661>] schedule_timeout+0x231/0x2d0
Apr 29 17:19:16 localhost kernel: [<ffffffff81315843>] ? __blk_run_queue+0x33/0x40
Apr 29 17:19:16 localhost kernel: [<ffffffff813158ba>] ? queue_unplugged+0x2a/0xb0
Apr 29 17:19:16 localhost kernel: [<ffffffff816b9f66>] io_schedule_timeout+0xa6/0x110
Apr 29 17:19:16 localhost kernel: [<ffffffff81259332>] do_blockdev_direct_IO+0x1b52/0x2180
Apr 29 17:19:16 localhost kernel: [<ffffffff81254320>] ? I_BDEV+0x20/0x20
Apr 29 17:19:16 localhost kernel: [<ffffffff812599a3>] __blockdev_direct_IO+0x43/0x50
Apr 29 17:19:16 localhost kernel: [<ffffffff81254a7c>] blkdev_direct_IO+0x4c/0x50
Apr 29 17:19:16 localhost kernel: [<ffffffff81193ab1>] generic_file_read_iter+0x641/0x7b0
Apr 29 17:19:16 localhost kernel: [<ffffffff8120bcf5>] ? mem_cgroup_commit_charge+0x85/0x100
Apr 29 17:19:16 localhost kernel: [<ffffffff81254e57>] blkdev_read_iter+0x37/0x40
Apr 29 17:19:16 localhost kernel: [<ffffffff81219379>] __vfs_read+0xc9/0x100
Apr 29 17:19:16 localhost kernel: [<ffffffff8121a1ef>] vfs_read+0x7f/0x130
Apr 29 17:19:16 localhost kernel: [<ffffffff8121b6d5>] SyS_read+0x55/0xc0
Apr 29 17:19:16 localhost kernel: [<ffffffff81003c12>] do_syscall_64+0x62/0x110
Apr 29 17:19:16 localhost kernel: [<ffffffff816be4a1>] entry_SYSCALL64_slow_path+0x25/0x25
..
..
Finally after the serialized timeouts we get the offline states
Apr 29 17:19:26 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:3 -- 1 2002.
Apr 29 17:19:30 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:17 -- 1 2002.
Apr 29 17:19:36 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:7 -- 1 2002.
Apr 29 17:19:40 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:18 -- 1 2002.
Apr 29 17:19:46 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:4 -- 1 2002.
Apr 29 17:19:51 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:19 -- 1 2002.
Apr 29 17:19:56 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:6 -- 1 2002.
Apr 29 17:20:01 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:2 -- 1 2002.
Apr 29 17:20:06 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:9 -- 1 2002.
Apr 29 17:20:11 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:23 -- 1 2002.
Apr 29 17:20:16 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:12 -- 1 2002.
Apr 29 17:20:21 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:3 -- 1 2002.
Apr 29 17:20:26 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:10 -- 1 2002.
Apr 29 17:20:31 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:4 -- 1 2002.
Apr 29 17:20:37 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:0 -- 1 2002.
Apr 29 17:20:41 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:6 -- 1 2002.
Apr 29 17:20:47 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:1 -- 1 2002.
Apr 29 17:20:51 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:9 -- 1 2002.
Apr 29 17:20:51 localhost kernel: sd 1:0:1:12: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:10: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:1: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:0: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:13: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: device-mapper: multipath: Failing path 66:208.
Apr 29 17:20:51 localhost kernel: device-mapper: multipath: Failing path 68:192.
Apr 29 17:20:51 localhost kernel: device-mapper: multipath: Failing path 67:224.
Apr 29 17:20:51 localhost kernel: device-mapper: multipath: Failing path 67:192.
Apr 29 17:20:51 localhost multipathd: mpatha: sdat - path offline
Apr 29 17:20:51 localhost multipathd: checker failed path 66:208 in map mpatha
Apr 29 17:20:51 localhost multipathd: mpatha: remaining active paths: 3
Apr 29 17:20:51 localhost multipathd: mpathb: sdby - path offline
Apr 29 17:20:51 localhost multipathd: checker failed path 68:192 in map mpathb
Apr 29 17:20:51 localhost multipathd: mpathb: remaining active paths: 3
Apr 29 17:20:51 localhost multipathd: mpathc: sdbk - path offline
Apr 29 17:20:51 localhost multipathd: checker failed path 67:224 in map mpathc
Apr 29 17:20:51 localhost multipathd: mpathc: remaining active paths: 3
Apr 29 17:20:51 localhost multipathd: mpathe: sdbi - path offline
Apr 29 17:20:51 localhost multipathd: checker failed path 67:192 in map mpathe
Apr 29 17:20:51 localhost multipathd: mpathe: remaining active paths: 3
Apr 29 17:20:51 localhost kernel: sd 1:0:1:8: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:7: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:22: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:21: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:20: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:14: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:16: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:17: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:18: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:19: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:2: Device offlined - not ready after error recovery
..
..
Apr 29 17:20:51 localhost kernel: sd 1:0:1:12: [sdbi] tag#14 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Apr 29 17:20:52 localhost kernel: sd 1:0:1:12: [sdbi] tag#14 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
Apr 29 17:20:52 localhost kernel: blk_update_request: I/O error, dev sdbi, sector 0
Apr 29 17:20:52 localhost kernel: sd 1:0:1:12: rejecting I/O to offline device
Apr 29 17:20:52 localhost kernel: sd 1:0:1:12: [sdbi] killing request
Apr 29 17:20:52 localhost kernel: sd 1:0:1:12: [sdbi] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Apr 29 17:20:52 localhost kernel: sd 1:0:1:12: [sdbi] CDB: Read(10) 28 00 00 02 5b 80 00 00 80 00
Apr 29 17:20:52 localhost kernel: blk_update_request: I/O error, dev sdbi, sector 154496
Apr 29 17:20:52 localhost kernel: sd 1:0:1:10: [sdbk] tag#13 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Apr 29 17:20:52 localhost kernel: sd 1:0:1:10: [sdbk] tag#13 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
Apr 29 17:20:52 localhost kernel: blk_update_request: I/O error, dev sdbk, sector 0
Apr 29 17:20:52 localhost kernel: sd 1:0:1:1: [sdby] tag#16 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Apr 29 17:20:52 localhost kernel: sd 1:0:1:1: [sdby] tag#16 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
Apr 29 17:20:52 localhost multipathd: mpathf: sdbg - path offline
Apr 29 17:20:52 localhost multipathd: checker failed path 67:160 in map mpathf
Apr 29 17:20:52 localhost multipathd: mpathf: remaining active paths: 3
Apr 29 17:20:52 localhost multipathd: mpathg: sdbe - path offline
Apr 29 17:20:52 localhost multipathd: checker failed path 67:128 in map mpathg
Apr 29 17:20:52 localhost multipathd: mpathg: remaining active paths: 3
Apr 29 17:20:52 localhost multipathd: mpathi: sdbc - path offline
Apr 29 17:20:52 localhost multipathd: checker failed path 67:96 in map mpathi
Apr 29 17:20:52 localhost multipathd: mpathi: remaining active paths: 3
Apr 29 17:20:52 localhost multipathd: mpathj: sdba - path offline
Apr 29 17:20:52 localhost multipathd: checker failed path 67:64 in map mpathj
Apr 29 17:20:52 localhost multipathd: mpathj: remaining active paths: 3
Apr 29 17:20:52 localhost multipathd: mpathk: sday - path offline
Apr 29 17:20:52 localhost multipathd: checker failed path 67:32 in map mpathk
Apr 29 17:20:52 localhost multipathd: mpathk: remaining active paths: 3
..
..
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 67:160.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 67:128.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 67:96.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 67:64.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 67:32.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 66:240.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 68:160.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 69:0.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 69:32.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 69:64.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 69:96.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 68:128.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 68:96.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 68:64.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 68:224.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 68:32.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 68:0.
Apr 29 17:20:52 localhost kernel: blk_update_request: I/O error, dev sdby, sector 0
Apr 29 17:20:52 localhost kernel: sd 1:0:1:0: [sdat] tag#15 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Apr 29 17:20:52 localhost kernel: sd 1:0:1:0: [sdat] tag#15 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
Apr 29 17:20:52 localhost kernel: blk_update_request: I/O error, dev sdat, sector 0
Apr 29 17:20:52 localhost kernel: sd 1:0:1:13: [sdbg] tag#17 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Apr 29 17:20:52 localhost kernel: sd 1:0:1:13: [sdbg] tag#17 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
Apr 29 17:20:53 localhost kernel: blk_update_request: I/O error, dev sdbg, sector 0
Apr 29 17:20:53 localhost kernel: sd 1:0:1:13: rejecting I/O to offline device
Apr 29 17:20:53 localhost kernel: sd 1:0:1:13: [sdbg] killing request
Apr 29 17:20:53 localhost kernel: sd 1:0:1:13: [sdbg] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Apr 29 17:20:53 localhost kernel: sd 1:0:1:13: [sdbg] CDB: Read(10) 28 00 00 02 5d 80 00 00 80 00
Apr 29 17:20:53 localhost kernel: blk_update_request: I/O error, dev sdbg, sector 155008
Apr 29 17:20:53 localhost kernel: sd 1:0:1:8: [sdbo] tag#31 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Apr 29 17:20:53 localhost kernel: sd 1:0:1:8: [sdbo] tag#31 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
Apr 29 17:20:53 localhost kernel: blk_update_request: I/O error, dev sdbo, sector 0
Apr 29 17:20:53 localhost kernel: sd 1:0:1:7: [sdca] tag#30 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Apr 29 17:20:53 localhost kernel: sd 1:0:1:7: [sdca] tag#30 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
Apr 29 17:20:53 localhost kernel: blk_update_request: I/O error, dev sdca, sector 0
Apr 29 17:20:53 localhost kernel: sd 1:0:1:22: [sdcg] tag#26 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
..
Apr 29 17:21:18 localhost multipathd: checker failed path 66:224 in map mpatha
Apr 29 17:21:18 localhost multipathd: mpatha: remaining active paths: 2
Apr 29 17:21:18 localhost multipathd: mpathb: sdbx - path offline
Apr 29 17:21:18 localhost multipathd: checker failed path 68:176 in map mpathb
Apr 29 17:21:18 localhost multipathd: mpathb: remaining active paths: 2
Apr 29 17:21:18 localhost multipathd: mpathc: sdbj - path offline
Apr 29 17:21:18 localhost multipathd: checker failed path 67:208 in map mpathc
Apr 29 17:21:18 localhost multipathd: mpathc: remaining active paths: 2
Apr 29 17:21:18 localhost multipathd: mpathe: sdbh - path offline
Apr 29 17:21:18 localhost multipathd: checker failed path 67:176 in map mpathe
Apr 29 17:21:18 localhost multipathd: mpathe: remaining active paths: 2
Apr 29 17:21:18 localhost multipathd: mpathf: sdbf - path offline
Apr 29 17:21:18 localhost multipathd: checker failed path 67:144 in map mpathf
Apr 29 17:21:18 localhost multipathd: mpathf: remaining active paths: 2
Apr 29 17:21:18 localhost multipathd: mpathg: sdbd - path offline
Apr 29 17:21:18 localhost multipathd: checker failed path 67:112 in map mpathg
Apr 29 17:21:18 localhost multipathd: mpathg: remaining active paths: 2
Apr 29 17:21:18 localhost multipathd: mpathi: sdbb - path offline
Apr 29 17:21:18 localhost multipathd: checker failed path 67:80 in map mpathi
Apr 29 17:21:18 localhost multipathd: mpathi: remaining active paths: 2
Apr 29 17:21:18 localhost multipathd: mpatho: sdbr - path offline
Apr 29 17:21:18 localhost multipathd: checker failed path 68:80 in map mpatho
Apr 29 17:21:18 localhost multipathd: mpatho: remaining active paths: 2
Apr 29 17:21:18 localhost multipathd: mpathq: sdbp - path offline
Apr 29 17:21:18 localhost multipathd: checker failed path 68:48 in map mpathq
Apr 29 17:21:18 localhost multipathd: mpathq: remaining active paths: 2
Apr 29 17:21:18 localhost multipathd: mpathv: sdbz - path offline
Apr 29 17:21:18 localhost multipathd: checker failed path 68:208 in map mpathv
Apr 29 17:21:18 localhost multipathd: mpathv: remaining active paths: 2
Apr 29 17:21:18 localhost multipathd: mpatht: sdbl - path offline
Apr 29 17:21:18 localhost multipathd: checker failed path 67:240 in map mpatht
Apr 29 17:21:18 localhost multipathd: mpatht: remaining active paths: 2
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 66:224.
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 68:176.
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 67:208.
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 67:176.
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 67:144.
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 67:112.
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 67:80.
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 68:80.
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 68:48.
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 68:208.
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 67:240.
Apr 29 17:21:18 localhost kernel: blk_update_request: I/O error, dev sdaw, sector 0
Apr 29 17:21:18 localhost kernel: sd 0:0:1:8: [sdbn] tag#25 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Apr 29 17:21:18 localhost kernel: sd 0:0:1:8: [sdbn] tag#25 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
Apr 29 17:21:18 localhost kernel: blk_update_request: I/O error, dev sdbn, sector 0
Apr 29 17:21:18 localhost kernel: sd 0:0:1:8: rejecting I/O to offline device
Apr 29 17:21:18 localhost kernel: sd 0:0:1:8: [sdbn] killing request
Laurence Oberman
Principal Software Maintenance Engineer
Red Hat Global Support Services
----- Original Message -----
From: "Laurence Oberman" <loberman@redhat.com>
To: "Bart Van Assche" <bart.vanassche@sandisk.com>
Cc: linux-block@vger.kernel.org, "linux-scsi" <linux-scsi@vger.kernel.org>, "Mike Snitzer" <snitzer@redhat.com>, "James Bottomley" <James.Bottomley@HansenPartnership.com>, "device-mapper development" <dm-devel@redhat.com>, lsf@lists.linux-foundation.org
Sent: Thursday, April 28, 2016 12:47:24 PM
Subject: Re: [dm-devel] [Lsf] Notes from the four separate IO track sessions at LSF/MM
Hello Bart, This is when we have a subset of the paths fails.
As you know the remaining path wont be used until the eh_handler is either done or is short circuited.
What I will do is set this up via my jammer and capture a test using latest upstream.
Of course my customer pain points are all in the RHEL kernels so I need to capture a recovery trace
on the latest upstream kernel.
When the SCSI commands for a path are black-holed and remain that way, even with eh_deadline and the short circuited adapter resets
we simply try again and get back in the wait loop until we finally declare the device offline.
This can take a while and differs depending on Qlogic, Emulex or fnic etc.
First thing tomorrow will set this up and show you what I mean.
Thanks!!
Laurence Oberman
Principal Software Maintenance Engineer
Red Hat Global Support Services
----- Original Message -----
From: "Bart Van Assche" <bart.vanassche@sandisk.com>
To: "Laurence Oberman" <loberman@redhat.com>
Cc: linux-block@vger.kernel.org, "linux-scsi" <linux-scsi@vger.kernel.org>, "Mike Snitzer" <snitzer@redhat.com>, "James Bottomley" <James.Bottomley@HansenPartnership.com>, "device-mapper development" <dm-devel@redhat.com>, lsf@lists.linux-foundation.org
Sent: Thursday, April 28, 2016 12:41:26 PM
Subject: Re: [dm-devel] [Lsf] Notes from the four separate IO track sessions at LSF/MM
On 04/28/2016 09:23 AM, Laurence Oberman wrote:
> We still suffer from periodic complaints in our large customer base
> regarding the long recovery times for dm-multipath.
> Most of the time this is when we have something like a switch
> back-plane issue or an issue where RSCN'S are blocked coming back up
> the fabric. Corner cases still bite us often.
>
> Most of the complaints originate from customers for example seeing
> Oracle cluster evictions where during the waiting on the mid-layer
> all mpath I/O is blocked until recovery.
>
> We have to tune eh_deadline, eh_timeout and fast_io_fail_tmo but
> even tuning those we have to wait on serial recovery even if we
> set the timeouts low.
>
> Lately we have been living with
> eh_deadline=10
> eh_timeout=5
> fast_fail_io_tmo=10
> leaving default sd timeout at 30s
>
> So this continues to be an issue and I have specific examples using
> the jammer I can provide showing the serial recovery times here.
Hello Laurence,
The long recovery times you refer to, is that for a scenario where all
paths failed or for a scenario where some paths failed and other paths
are still working? In the latter case, how long does it take before
dm-multipath fails over to another path?
Thanks,
Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2016-04-29 21:51 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-04-27 23:39 Notes from the four separate IO track sessions at LSF/MM James Bottomley
2016-04-28 12:11 ` Mike Snitzer
2016-04-28 15:40 ` James Bottomley
2016-04-28 15:53 ` [Lsf] " Bart Van Assche
2016-04-28 16:19 ` Knight, Frederick
2016-04-28 16:37 ` Bart Van Assche
2016-04-28 17:33 ` James Bottomley
2016-04-28 16:23 ` Laurence Oberman
2016-04-28 16:41 ` [dm-devel] " Bart Van Assche
2016-04-28 16:47 ` Laurence Oberman
2016-04-29 21:47 ` Laurence Oberman
2016-04-29 21:51 ` Laurence Oberman [this message]
2016-04-30 0:36 ` Bart Van Assche
2016-04-30 0:47 ` Laurence Oberman
2016-05-02 18:49 ` Bart Van Assche
2016-05-02 19:28 ` Laurence Oberman
2016-05-02 22:28 ` Bart Van Assche
2016-05-03 17:44 ` Laurence Oberman
2016-05-26 2:38 ` bio-based DM multipath is back from the dead [was: Re: Notes from the four separate IO track sessions at LSF/MM] Mike Snitzer
2016-05-27 8:39 ` Hannes Reinecke
2016-05-27 8:39 ` Hannes Reinecke
2016-05-27 14:44 ` Mike Snitzer
2016-05-27 15:42 ` Hannes Reinecke
2016-05-27 15:42 ` Hannes Reinecke
2016-05-27 16:10 ` Mike Snitzer
2016-04-29 16:45 ` [dm-devel] Notes from the four separate IO track sessions at LSF/MM Benjamin Marzinski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2009387014.32589776.1461966691447.JavaMail.zimbra@redhat.com \
--to=loberman@redhat.com \
--cc=James.Bottomley@HansenPartnership.com \
--cc=bart.vanassche@sandisk.com \
--cc=bmarzins@redhat.com \
--cc=dm-devel@redhat.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
--cc=lsf@lists.linux-foundation.org \
--cc=snitzer@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.