public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed
* [BUG] 2.6.39.1 crash in scsi_dispatch_cmd()
@ 2011-06-15 11:20 Heiko Carstens
  2011-06-16 16:01 ` Heiko Carstens
  0 siblings, 1 reply; 42+ messages in thread
From: Heiko Carstens @ 2011-06-15 11:20 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-scsi, Steffen Maier, Manvanthara B. Puttashankar,
	Tarak Reddy, Seshagiri N. Ippili, Christof Schmitt

Hi James,

while running some error recovery tests we encountered the following bug:

[23710.196446] Unable to handle kernel pointer dereference at virtual kernel address 6b6b6b6b6b6b6000
[23710.196453] Oops: 0038 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[23710.196460] Modules linked in: dm_round_robin sunrpc qeth_l3 binfmt_misc dm_multipath scsi_dh dm_mod ipv6 vmur qeth ccwgroup [last unloaded: scsi_wait_scan]
[23710.196477] CPU: 0 Not tainted 2.6.39.1-49.x.20110609-s390xdefault #1
[23710.196480] Process kworker/u:5 (pid: 18534, task: 000000002f4d2368, ksp: 000000002f48f268)
[23710.196483] Krnl PSW : 0704000180000000 000000000042407c (scsi_dispatch_cmd+0xd4/0x29c)
[23710.196492]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0 EA:3
[23710.196496] Krnl GPRS: 0280000000000000 6b6b6b6b6b6b6b6b 0000000000000060 0000000000006b6b
[23710.196499]            00000000005a0ab2 000000002f4d2af0 00000000390d8090 000000002f611ac7
[23710.196502]            000000003a764600 000000002f6114f0 000000003a764600 0000000000000000
[23710.196505]            000000003a78e1b0 00000000005f0638 0000000000424070 000000002f48f530
[23710.196516] Krnl Code: 000000000042406a: c0e5fffffdf9        brasl   %r14,423c5c
[23710.196520]            0000000000424070: e310a0000004        lg      %r1,0(%r10)
[23710.196525]            0000000000424076: e330a04a0091        llgh    %r3,74(%r10)
[23710.196529]           >000000000042407c: e31010000004        lg      %r1,0(%r1)
[23710.196533]            0000000000424082: e34012480091        llgh    %r4,584(%r1)
[23710.196537]            0000000000424088: 1534                clr     %r3,%r4
[23710.196541]            000000000042408a: a7240033            brc     2,4240f0
[23710.196546]            000000000042408e: d503d008c2c0        clc     8(4,%r13),704(%r12)
[23710.196549] Call Trace:
[23710.196551] ([<0000000000424070>] scsi_dispatch_cmd+0xc8/0x29c)
[23710.196555]  [<000000000042bdb8>] scsi_request_fn+0x3b8/0x480
[23710.196560]  [<00000000003adfb6>] blk_execute_rq_nowait+0x76/0xd0
[23710.196565]  [<00000000003ae0a6>] blk_execute_rq+0x96/0xe8
[23710.196569]  [<000000000042af84>] scsi_execute+0x124/0x22c
[23710.196573]  [<000000000042b1cc>] scsi_execute_req+0x140/0x148
[23710.196577]  [<0000000000422f0c>] scsi_vpd_inquiry+0x80/0xac
[23710.196580]  [<0000000000422f80>] scsi_get_vpd_page+0x48/0x108
[23710.196584]  [<000000000043e240>] sd_revalidate_disk+0x984/0x1a6c
[23710.196590]  [<00000000002b898e>] rescan_partitions+0xc6/0x594
[23710.196596]  [<000000000027b1ca>] __blkdev_get+0x2f6/0x43c
[23710.196602]  [<000000000027b358>] blkdev_get+0x48/0x38c
[23710.196606]  [<00000000003b463e>] register_disk+0x182/0x19c
[23710.196610]  [<00000000003b4758>] add_disk+0x100/0x2ac
[23710.196614]  [<00000000004402bc>] sd_probe_async+0x118/0x1c8
[23710.196618]  [<0000000000173168>] async_run_entry_fn+0x98/0x188
[23710.196624]  [<000000000015ec5e>] process_one_work+0x1f6/0x4ec
[23710.196630]  [<0000000000162a48>] worker_thread+0x17c/0x370
[23710.196633]  [<0000000000168b7e>] kthread+0xa6/0xb0
[23710.196637]  [<00000000005a61fa>] kernel_thread_starter+0x6/0xc
[23710.196643]  [<00000000005a61f4>] kernel_thread_starter+0x0/0xc
[23710.196646] INFO: lockdep is turned off.
[23710.196648] Last Breaking-Event-Address:
[23710.196650]  [<00000000005a0abc>] printk+0x5c/0x60

Testcase is multipath setup (dm) where I/O happens to these SCSI devices and
while doing so randomly the channel paths to adapters get switched on and off.
Each time a channel path to a adapter gets switched off this results into a
call to fc_remote_port_delete() for each connected port of an adapter.

A short analysis for the call trace above: we use SLUB and have all debugging
options turned on. The system crashed within scsi_dispatch_cmd() when it tried
to derefence cmd->device in order to get the host pointer within this short
code snippet:

        /*
         * Before we queue this command, check if the command
         * length exceeds what the host adapter can handle.
         */
        if (cmd->cmd_len > cmd->device->host->max_cmd_len) {
                SCSI_LOG_MLQUEUE(3,

The struct scsi_cmnd contains 0x6b which means it got freed and we have a
use after free bug here. The interesting thing is that the code got so far
into this function without crashing. Since there are several other places
where derefencing happened before in the function (without crashing) this
means that this piece of memory got freed while this function got executed
(well, preemption is turned on, so maybe it got freed while being preempted).

SLUB caller tracking is turned on, which reveals that this piece of memory
got freed by __scsi_put_command().

As far as I can tell, in this case the device driver cannot be blamed since
the request in question didn't even reach the driver before it got freed and
the system crashed.
But of course I could be entirely wrong :)

Would you happen to have any ideas what could cause this bug?

FWIW, this is 2.6.39.1 which in addition contains
e73e079bf128d68284efedeba1fbbc18d78610f9 "[SCSI] Fix oops caused by queue
refcounting failure"

Thanks,
Heiko

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] 2.6.39.1 crash in scsi_dispatch_cmd()
  2011-06-15 11:20 [BUG] 2.6.39.1 crash in scsi_dispatch_cmd() Heiko Carstens
@ 2011-06-16 16:01 ` Heiko Carstens
  2011-06-16 16:37   ` James Bottomley
  0 siblings, 1 reply; 42+ messages in thread
From: Heiko Carstens @ 2011-06-16 16:01 UTC (permalink / raw)
  To: James Bottomley, Jens Axboe
  Cc: linux-scsi, Steffen Maier, Manvanthara B. Puttashankar,
	Tarak Reddy, Seshagiri N. Ippili, Christof Schmitt

cc'ing Jens (full quote of the original bug report at the end).

I patched the SLUB debugging code so that it includes full call traces
instead of just the address of the caller of kfree/kmalloc. So here is
yet another incarnation of what I think is the same use-after-free bug:

What we can see here: while scsi_dispatch_cmd() operates on a scsi_cmnd
it gets freed via blk_done_softirq().
In this specific case scsi_send_log() crashes while trying to dereference
the SLAB/SLUB poison value.

[  952.532025] Unable to handle kernel pointer dereference at virtual kernel address 6b6b6b6b6b6b6000
[  952.532031] Oops: 0038 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  952.532038] Modules linked in: dm_round_robin dm_multipath scsi_dh sunrpc qeth_l3 binfmt_misc dm_mod ipv6 qeth ccwgroup [last unloaded: scsi_wait_scan]
[  952.532055] CPU: 0 Not tainted 2.6.39.1-50.0slubdbg.20110616-s390xdefault #1
[  952.532058] Process flush-252:1 (pid: 3993, task: 00000000b837a558, ksp: 000000009f1934f0)
[  952.532062] Krnl PSW : 0704100180000000 0000000000426dfc (scsi_print_command+0x44/0xf8)
[  952.532071]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 EA:3
[  952.532075] Krnl GPRS: 0000000000000799 6b6b6b6b6b6b6b6b 000000008ac7c800 0000000000000001
[  952.532078]            00000000005a1c8e 00000000005b95b0 00000000af005ba8 00000000afc9f57f
[  952.532082]            000000008ac7c800 00000000afc9efa8 000000008ac7c800 000000008ac7c800
[  952.532085]            0000000000000004 00000000005f1bb0 000000009f1938a0 000000009f193870
[  952.532097] Krnl Code: 0000000000426dea: f0b80004ebbf        srp	4(12,%r0),3007(%r14),8
[  952.532102]            0000000000426df0: f0a0000407f4        srp	4(11,%r0),2036,0
[  952.532106]            0000000000426df6: e31020800004        lg	%r1,128(%r2)
[  952.532111]           >0000000000426dfc: e31010b00004        lg	%r1,176(%r1)
[  952.532115]            0000000000426e02: b9020011            ltgr	%r1,%r1
[  952.532118]            0000000000426e06: a7840032            brc	8,426e6a
[  952.532122]            0000000000426e0a: e33020000004        lg	%r3,0(%r2)
[  952.532127]            0000000000426e10: c04000165e8b        larl	%r4,6f2b26
[  952.532131] Call Trace:
[  952.532133] ([<000000008ac7c800>] 0x8ac7c800)
[  952.532136]  [<0000000000424f0c>] scsi_log_send+0xf0/0x130
[  952.532140]  [<0000000000425230>] scsi_dispatch_cmd+0xc8/0x29c
[  952.532144]  [<000000000042cf78>] scsi_request_fn+0x3b8/0x480
[  952.532149]  [<00000000003a78be>] blk_insert_cloned_request+0x72/0xcc
[  952.532154]  [<000003c0023f3466>] dm_dispatch_request+0x5a/0x94 [dm_mod]
[  952.532166]  [<000003c0023f5b0c>] dm_request_fn+0x1e4/0x208 [dm_mod]
[  952.532174]  [<00000000003aa512>] queue_unplugged.clone.27+0x3e/0x6c
[  952.532178]  [<00000000003ab284>] blk_flush_plug_list+0x1fc/0x290
[  952.532182]  [<00000000003ab342>] blk_finish_plug+0x2a/0x58
[  952.532185]  [<00000000001e9410>] generic_writepages+0x6c/0x94
[  952.532192]  [<000000000026abe0>] writeback_single_inode+0xf8/0x268
[  952.532198]  [<000000000026b13a>] writeback_sb_inodes+0xd2/0x18c
[  952.532202]  [<000000000026bcc0>] writeback_inodes_wb+0x80/0x168
[  952.532206]  [<000000000026c04e>] wb_writeback+0x2a6/0x320
[  952.532210]  [<000000000026c2f4>] wb_do_writeback+0x22c/0x270
[  952.532213]  [<000000000026c3ec>] bdi_writeback_thread+0xb4/0x1c0
[  952.532217]  [<0000000000169b3e>] kthread+0xa6/0xb0
[  952.532223]  [<00000000005a73ea>] kernel_thread_starter+0x6/0xc
[  952.532228]  [<00000000005a73e4>] kernel_thread_starter+0x0/0xc
[  952.532231] INFO: lockdep is turned off.
[  952.532233] Last Breaking-Event-Address:
[  952.532235]  [<0000000000426de4>] scsi_print_command+0x2c/0xf8

Debug call traces from the freed object:

free (on cpu 1):
000000000011bd5e save_stack_trace
00000000002290b4 process_slab
0000000000229514 free_debug_processing
00000000002298a8 __slab_free
0000000000229c30 kmem_cache_free
000000000042d400 scsi_next_command
000000000042d628 scsi_io_completion
00000000003b0238 blk_done_softirq
000000000014c110 __do_softirq
000000000010e2ce do_softirq
000000000014c570 irq_exit
00000000004689b8 do_IRQ
00000000005a7e70 io_return

alloc ([flush-252:1]):
000000000011bd82 save_stack_trace
00000000002290b4 process_slab
00000000002292e4 alloc_debug_processing
000000000022a574 new_slab
000000000022a926 kmem_cache_alloc
000000000042475e scsi_pool_alloc_command
000000000042490a scsi_host_alloc_command
000000000042499e __scsi_get_command
0000000000424a8e scsi_get_command
000000000042bda6 scsi_setup_fs_cmnd
0000000000440662 sd_prep_fn
00000000003aa97e blk_peek_request
000000000042cc28 scsi_request_fn
00000000003a78be blk_insert_cloned_request
000003c0023f3466 dm_dispatch_request
000003c0023f5b0c dm_request_fn
00000000003aa512 blk_end_request_cur
00000000003ab284 blk_flush_plug_list
00000000003ab342 blk_finish_plug
00000000001e9410 generic_writepages
000000000026abe0 writeback_single_inode
000000000026b13a writeback_sb_inodes
000000000026bcc0 writeback_inodes_wb
000000000026c04e wb_writeback
000000000026c2f4 wb_do_writeback
000000000026c3ec bdi_writeback_thread
0000000000169b3e kthread
00000000005a73ea kernel_thread_starter

On Wed, Jun 15, 2011 at 01:20:16PM +0200, Heiko Carstens wrote:
> Hi James,
> 
> while running some error recovery tests we encountered the following bug:
> 
> [23710.196446] Unable to handle kernel pointer dereference at virtual kernel address 6b6b6b6b6b6b6000
> [23710.196453] Oops: 0038 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [23710.196460] Modules linked in: dm_round_robin sunrpc qeth_l3 binfmt_misc dm_multipath scsi_dh dm_mod ipv6 vmur qeth ccwgroup [last unloaded: scsi_wait_scan]
> [23710.196477] CPU: 0 Not tainted 2.6.39.1-49.x.20110609-s390xdefault #1
> [23710.196480] Process kworker/u:5 (pid: 18534, task: 000000002f4d2368, ksp: 000000002f48f268)
> [23710.196483] Krnl PSW : 0704000180000000 000000000042407c (scsi_dispatch_cmd+0xd4/0x29c)
> [23710.196492]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0 EA:3
> [23710.196496] Krnl GPRS: 0280000000000000 6b6b6b6b6b6b6b6b 0000000000000060 0000000000006b6b
> [23710.196499]            00000000005a0ab2 000000002f4d2af0 00000000390d8090 000000002f611ac7
> [23710.196502]            000000003a764600 000000002f6114f0 000000003a764600 0000000000000000
> [23710.196505]            000000003a78e1b0 00000000005f0638 0000000000424070 000000002f48f530
> [23710.196516] Krnl Code: 000000000042406a: c0e5fffffdf9        brasl   %r14,423c5c
> [23710.196520]            0000000000424070: e310a0000004        lg      %r1,0(%r10)
> [23710.196525]            0000000000424076: e330a04a0091        llgh    %r3,74(%r10)
> [23710.196529]           >000000000042407c: e31010000004        lg      %r1,0(%r1)
> [23710.196533]            0000000000424082: e34012480091        llgh    %r4,584(%r1)
> [23710.196537]            0000000000424088: 1534                clr     %r3,%r4
> [23710.196541]            000000000042408a: a7240033            brc     2,4240f0
> [23710.196546]            000000000042408e: d503d008c2c0        clc     8(4,%r13),704(%r12)
> [23710.196549] Call Trace:
> [23710.196551] ([<0000000000424070>] scsi_dispatch_cmd+0xc8/0x29c)
> [23710.196555]  [<000000000042bdb8>] scsi_request_fn+0x3b8/0x480
> [23710.196560]  [<00000000003adfb6>] blk_execute_rq_nowait+0x76/0xd0
> [23710.196565]  [<00000000003ae0a6>] blk_execute_rq+0x96/0xe8
> [23710.196569]  [<000000000042af84>] scsi_execute+0x124/0x22c
> [23710.196573]  [<000000000042b1cc>] scsi_execute_req+0x140/0x148
> [23710.196577]  [<0000000000422f0c>] scsi_vpd_inquiry+0x80/0xac
> [23710.196580]  [<0000000000422f80>] scsi_get_vpd_page+0x48/0x108
> [23710.196584]  [<000000000043e240>] sd_revalidate_disk+0x984/0x1a6c
> [23710.196590]  [<00000000002b898e>] rescan_partitions+0xc6/0x594
> [23710.196596]  [<000000000027b1ca>] __blkdev_get+0x2f6/0x43c
> [23710.196602]  [<000000000027b358>] blkdev_get+0x48/0x38c
> [23710.196606]  [<00000000003b463e>] register_disk+0x182/0x19c
> [23710.196610]  [<00000000003b4758>] add_disk+0x100/0x2ac
> [23710.196614]  [<00000000004402bc>] sd_probe_async+0x118/0x1c8
> [23710.196618]  [<0000000000173168>] async_run_entry_fn+0x98/0x188
> [23710.196624]  [<000000000015ec5e>] process_one_work+0x1f6/0x4ec
> [23710.196630]  [<0000000000162a48>] worker_thread+0x17c/0x370
> [23710.196633]  [<0000000000168b7e>] kthread+0xa6/0xb0
> [23710.196637]  [<00000000005a61fa>] kernel_thread_starter+0x6/0xc
> [23710.196643]  [<00000000005a61f4>] kernel_thread_starter+0x0/0xc
> [23710.196646] INFO: lockdep is turned off.
> [23710.196648] Last Breaking-Event-Address:
> [23710.196650]  [<00000000005a0abc>] printk+0x5c/0x60
> 
> Testcase is multipath setup (dm) where I/O happens to these SCSI devices and
> while doing so randomly the channel paths to adapters get switched on and off.
> Each time a channel path to a adapter gets switched off this results into a
> call to fc_remote_port_delete() for each connected port of an adapter.
> 
> A short analysis for the call trace above: we use SLUB and have all debugging
> options turned on. The system crashed within scsi_dispatch_cmd() when it tried
> to derefence cmd->device in order to get the host pointer within this short
> code snippet:
> 
>         /*
>          * Before we queue this command, check if the command
>          * length exceeds what the host adapter can handle.
>          */
>         if (cmd->cmd_len > cmd->device->host->max_cmd_len) {
>                 SCSI_LOG_MLQUEUE(3,
> 
> The struct scsi_cmnd contains 0x6b which means it got freed and we have a
> use after free bug here. The interesting thing is that the code got so far
> into this function without crashing. Since there are several other places
> where derefencing happened before in the function (without crashing) this
> means that this piece of memory got freed while this function got executed
> (well, preemption is turned on, so maybe it got freed while being preempted).
> 
> SLUB caller tracking is turned on, which reveals that this piece of memory
> got freed by __scsi_put_command().
> 
> As far as I can tell, in this case the device driver cannot be blamed since
> the request in question didn't even reach the driver before it got freed and
> the system crashed.
> But of course I could be entirely wrong :)
> 
> Would you happen to have any ideas what could cause this bug?
> 
> FWIW, this is 2.6.39.1 which in addition contains
> e73e079bf128d68284efedeba1fbbc18d78610f9 "[SCSI] Fix oops caused by queue
> refcounting failure"
> 
> Thanks,
> Heiko

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] 2.6.39.1 crash in scsi_dispatch_cmd()
  2011-06-16 16:01 ` Heiko Carstens
@ 2011-06-16 16:37   ` James Bottomley
  2011-06-16 18:40     ` Heiko Carstens
  0 siblings, 1 reply; 42+ messages in thread
From: James Bottomley @ 2011-06-16 16:37 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Jens Axboe, linux-scsi, Steffen Maier,
	Manvanthara B. Puttashankar, Tarak Reddy, Seshagiri N. Ippili,
	Christof Schmitt

On Thu, 2011-06-16 at 18:01 +0200, Heiko Carstens wrote:
> cc'ing Jens (full quote of the original bug report at the end).
> 
> I patched the SLUB debugging code so that it includes full call traces
> instead of just the address of the caller of kfree/kmalloc. So here is
> yet another incarnation of what I think is the same use-after-free bug:
> 
> What we can see here: while scsi_dispatch_cmd() operates on a scsi_cmnd
> it gets freed via blk_done_softirq().
> In this specific case scsi_send_log() crashes while trying to dereference
> the SLAB/SLUB poison value.

But you're crashing in the submit path, not the completion path.
scsi_send_log() is called *before* we dispatch the command, so how on
earth did the command get freed in the completion (softirq) path?

James

> [  952.532025] Unable to handle kernel pointer dereference at virtual kernel address 6b6b6b6b6b6b6000
> [  952.532031] Oops: 0038 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [  952.532038] Modules linked in: dm_round_robin dm_multipath scsi_dh sunrpc qeth_l3 binfmt_misc dm_mod ipv6 qeth ccwgroup [last unloaded: scsi_wait_scan]
> [  952.532055] CPU: 0 Not tainted 2.6.39.1-50.0slubdbg.20110616-s390xdefault #1
> [  952.532058] Process flush-252:1 (pid: 3993, task: 00000000b837a558, ksp: 000000009f1934f0)
> [  952.532062] Krnl PSW : 0704100180000000 0000000000426dfc (scsi_print_command+0x44/0xf8)
> [  952.532071]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 EA:3
> [  952.532075] Krnl GPRS: 0000000000000799 6b6b6b6b6b6b6b6b 000000008ac7c800 0000000000000001
> [  952.532078]            00000000005a1c8e 00000000005b95b0 00000000af005ba8 00000000afc9f57f
> [  952.532082]            000000008ac7c800 00000000afc9efa8 000000008ac7c800 000000008ac7c800
> [  952.532085]            0000000000000004 00000000005f1bb0 000000009f1938a0 000000009f193870
> [  952.532097] Krnl Code: 0000000000426dea: f0b80004ebbf        srp	4(12,%r0),3007(%r14),8
> [  952.532102]            0000000000426df0: f0a0000407f4        srp	4(11,%r0),2036,0
> [  952.532106]            0000000000426df6: e31020800004        lg	%r1,128(%r2)
> [  952.532111]           >0000000000426dfc: e31010b00004        lg	%r1,176(%r1)
> [  952.532115]            0000000000426e02: b9020011            ltgr	%r1,%r1
> [  952.532118]            0000000000426e06: a7840032            brc	8,426e6a
> [  952.532122]            0000000000426e0a: e33020000004        lg	%r3,0(%r2)
> [  952.532127]            0000000000426e10: c04000165e8b        larl	%r4,6f2b26
> [  952.532131] Call Trace:
> [  952.532133] ([<000000008ac7c800>] 0x8ac7c800)
> [  952.532136]  [<0000000000424f0c>] scsi_log_send+0xf0/0x130
> [  952.532140]  [<0000000000425230>] scsi_dispatch_cmd+0xc8/0x29c
> [  952.532144]  [<000000000042cf78>] scsi_request_fn+0x3b8/0x480
> [  952.532149]  [<00000000003a78be>] blk_insert_cloned_request+0x72/0xcc
> [  952.532154]  [<000003c0023f3466>] dm_dispatch_request+0x5a/0x94 [dm_mod]
> [  952.532166]  [<000003c0023f5b0c>] dm_request_fn+0x1e4/0x208 [dm_mod]
> [  952.532174]  [<00000000003aa512>] queue_unplugged.clone.27+0x3e/0x6c
> [  952.532178]  [<00000000003ab284>] blk_flush_plug_list+0x1fc/0x290
> [  952.532182]  [<00000000003ab342>] blk_finish_plug+0x2a/0x58
> [  952.532185]  [<00000000001e9410>] generic_writepages+0x6c/0x94
> [  952.532192]  [<000000000026abe0>] writeback_single_inode+0xf8/0x268
> [  952.532198]  [<000000000026b13a>] writeback_sb_inodes+0xd2/0x18c
> [  952.532202]  [<000000000026bcc0>] writeback_inodes_wb+0x80/0x168
> [  952.532206]  [<000000000026c04e>] wb_writeback+0x2a6/0x320
> [  952.532210]  [<000000000026c2f4>] wb_do_writeback+0x22c/0x270
> [  952.532213]  [<000000000026c3ec>] bdi_writeback_thread+0xb4/0x1c0
> [  952.532217]  [<0000000000169b3e>] kthread+0xa6/0xb0
> [  952.532223]  [<00000000005a73ea>] kernel_thread_starter+0x6/0xc
> [  952.532228]  [<00000000005a73e4>] kernel_thread_starter+0x0/0xc
> [  952.532231] INFO: lockdep is turned off.
> [  952.532233] Last Breaking-Event-Address:
> [  952.532235]  [<0000000000426de4>] scsi_print_command+0x2c/0xf8
> 
> Debug call traces from the freed object:
> 
> free (on cpu 1):
> 000000000011bd5e save_stack_trace
> 00000000002290b4 process_slab
> 0000000000229514 free_debug_processing
> 00000000002298a8 __slab_free
> 0000000000229c30 kmem_cache_free
> 000000000042d400 scsi_next_command
> 000000000042d628 scsi_io_completion
> 00000000003b0238 blk_done_softirq
> 000000000014c110 __do_softirq
> 000000000010e2ce do_softirq
> 000000000014c570 irq_exit
> 00000000004689b8 do_IRQ
> 00000000005a7e70 io_return
> 
> alloc ([flush-252:1]):
> 000000000011bd82 save_stack_trace
> 00000000002290b4 process_slab
> 00000000002292e4 alloc_debug_processing
> 000000000022a574 new_slab
> 000000000022a926 kmem_cache_alloc
> 000000000042475e scsi_pool_alloc_command
> 000000000042490a scsi_host_alloc_command
> 000000000042499e __scsi_get_command
> 0000000000424a8e scsi_get_command
> 000000000042bda6 scsi_setup_fs_cmnd
> 0000000000440662 sd_prep_fn
> 00000000003aa97e blk_peek_request
> 000000000042cc28 scsi_request_fn
> 00000000003a78be blk_insert_cloned_request
> 000003c0023f3466 dm_dispatch_request
> 000003c0023f5b0c dm_request_fn
> 00000000003aa512 blk_end_request_cur
> 00000000003ab284 blk_flush_plug_list
> 00000000003ab342 blk_finish_plug
> 00000000001e9410 generic_writepages
> 000000000026abe0 writeback_single_inode
> 000000000026b13a writeback_sb_inodes
> 000000000026bcc0 writeback_inodes_wb
> 000000000026c04e wb_writeback
> 000000000026c2f4 wb_do_writeback
> 000000000026c3ec bdi_writeback_thread
> 0000000000169b3e kthread
> 00000000005a73ea kernel_thread_starter
> 
> On Wed, Jun 15, 2011 at 01:20:16PM +0200, Heiko Carstens wrote:
> > Hi James,
> > 
> > while running some error recovery tests we encountered the following bug:
> > 
> > [23710.196446] Unable to handle kernel pointer dereference at virtual kernel address 6b6b6b6b6b6b6000
> > [23710.196453] Oops: 0038 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> > [23710.196460] Modules linked in: dm_round_robin sunrpc qeth_l3 binfmt_misc dm_multipath scsi_dh dm_mod ipv6 vmur qeth ccwgroup [last unloaded: scsi_wait_scan]
> > [23710.196477] CPU: 0 Not tainted 2.6.39.1-49.x.20110609-s390xdefault #1
> > [23710.196480] Process kworker/u:5 (pid: 18534, task: 000000002f4d2368, ksp: 000000002f48f268)
> > [23710.196483] Krnl PSW : 0704000180000000 000000000042407c (scsi_dispatch_cmd+0xd4/0x29c)
> > [23710.196492]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0 EA:3
> > [23710.196496] Krnl GPRS: 0280000000000000 6b6b6b6b6b6b6b6b 0000000000000060 0000000000006b6b
> > [23710.196499]            00000000005a0ab2 000000002f4d2af0 00000000390d8090 000000002f611ac7
> > [23710.196502]            000000003a764600 000000002f6114f0 000000003a764600 0000000000000000
> > [23710.196505]            000000003a78e1b0 00000000005f0638 0000000000424070 000000002f48f530
> > [23710.196516] Krnl Code: 000000000042406a: c0e5fffffdf9        brasl   %r14,423c5c
> > [23710.196520]            0000000000424070: e310a0000004        lg      %r1,0(%r10)
> > [23710.196525]            0000000000424076: e330a04a0091        llgh    %r3,74(%r10)
> > [23710.196529]           >000000000042407c: e31010000004        lg      %r1,0(%r1)
> > [23710.196533]            0000000000424082: e34012480091        llgh    %r4,584(%r1)
> > [23710.196537]            0000000000424088: 1534                clr     %r3,%r4
> > [23710.196541]            000000000042408a: a7240033            brc     2,4240f0
> > [23710.196546]            000000000042408e: d503d008c2c0        clc     8(4,%r13),704(%r12)
> > [23710.196549] Call Trace:
> > [23710.196551] ([<0000000000424070>] scsi_dispatch_cmd+0xc8/0x29c)
> > [23710.196555]  [<000000000042bdb8>] scsi_request_fn+0x3b8/0x480
> > [23710.196560]  [<00000000003adfb6>] blk_execute_rq_nowait+0x76/0xd0
> > [23710.196565]  [<00000000003ae0a6>] blk_execute_rq+0x96/0xe8
> > [23710.196569]  [<000000000042af84>] scsi_execute+0x124/0x22c
> > [23710.196573]  [<000000000042b1cc>] scsi_execute_req+0x140/0x148
> > [23710.196577]  [<0000000000422f0c>] scsi_vpd_inquiry+0x80/0xac
> > [23710.196580]  [<0000000000422f80>] scsi_get_vpd_page+0x48/0x108
> > [23710.196584]  [<000000000043e240>] sd_revalidate_disk+0x984/0x1a6c
> > [23710.196590]  [<00000000002b898e>] rescan_partitions+0xc6/0x594
> > [23710.196596]  [<000000000027b1ca>] __blkdev_get+0x2f6/0x43c
> > [23710.196602]  [<000000000027b358>] blkdev_get+0x48/0x38c
> > [23710.196606]  [<00000000003b463e>] register_disk+0x182/0x19c
> > [23710.196610]  [<00000000003b4758>] add_disk+0x100/0x2ac
> > [23710.196614]  [<00000000004402bc>] sd_probe_async+0x118/0x1c8
> > [23710.196618]  [<0000000000173168>] async_run_entry_fn+0x98/0x188
> > [23710.196624]  [<000000000015ec5e>] process_one_work+0x1f6/0x4ec
> > [23710.196630]  [<0000000000162a48>] worker_thread+0x17c/0x370
> > [23710.196633]  [<0000000000168b7e>] kthread+0xa6/0xb0
> > [23710.196637]  [<00000000005a61fa>] kernel_thread_starter+0x6/0xc
> > [23710.196643]  [<00000000005a61f4>] kernel_thread_starter+0x0/0xc
> > [23710.196646] INFO: lockdep is turned off.
> > [23710.196648] Last Breaking-Event-Address:
> > [23710.196650]  [<00000000005a0abc>] printk+0x5c/0x60
> > 
> > Testcase is multipath setup (dm) where I/O happens to these SCSI devices and
> > while doing so randomly the channel paths to adapters get switched on and off.
> > Each time a channel path to a adapter gets switched off this results into a
> > call to fc_remote_port_delete() for each connected port of an adapter.
> > 
> > A short analysis for the call trace above: we use SLUB and have all debugging
> > options turned on. The system crashed within scsi_dispatch_cmd() when it tried
> > to derefence cmd->device in order to get the host pointer within this short
> > code snippet:
> > 
> >         /*
> >          * Before we queue this command, check if the command
> >          * length exceeds what the host adapter can handle.
> >          */
> >         if (cmd->cmd_len > cmd->device->host->max_cmd_len) {
> >                 SCSI_LOG_MLQUEUE(3,
> > 
> > The struct scsi_cmnd contains 0x6b which means it got freed and we have a
> > use after free bug here. The interesting thing is that the code got so far
> > into this function without crashing. Since there are several other places
> > where derefencing happened before in the function (without crashing) this
> > means that this piece of memory got freed while this function got executed
> > (well, preemption is turned on, so maybe it got freed while being preempted).
> > 
> > SLUB caller tracking is turned on, which reveals that this piece of memory
> > got freed by __scsi_put_command().
> > 
> > As far as I can tell, in this case the device driver cannot be blamed since
> > the request in question didn't even reach the driver before it got freed and
> > the system crashed.
> > But of course I could be entirely wrong :)
> > 
> > Would you happen to have any ideas what could cause this bug?
> > 
> > FWIW, this is 2.6.39.1 which in addition contains
> > e73e079bf128d68284efedeba1fbbc18d78610f9 "[SCSI] Fix oops caused by queue
> > refcounting failure"
> > 
> > Thanks,
> > Heiko
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] 2.6.39.1 crash in scsi_dispatch_cmd()
  2011-06-16 16:37   ` James Bottomley
@ 2011-06-16 18:40     ` Heiko Carstens
  2011-06-20 15:30       ` Heiko Carstens
  0 siblings, 1 reply; 42+ messages in thread
From: Heiko Carstens @ 2011-06-16 18:40 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jens Axboe, linux-scsi, Steffen Maier,
	Manvanthara B. Puttashankar, Tarak Reddy, Seshagiri N. Ippili,
	Christof Schmitt

On Thu, Jun 16, 2011 at 12:37:55PM -0400, James Bottomley wrote:
> On Thu, 2011-06-16 at 18:01 +0200, Heiko Carstens wrote:
> > cc'ing Jens (full quote of the original bug report at the end).
> > 
> > I patched the SLUB debugging code so that it includes full call traces
> > instead of just the address of the caller of kfree/kmalloc. So here is
> > yet another incarnation of what I think is the same use-after-free bug:
> > 
> > What we can see here: while scsi_dispatch_cmd() operates on a scsi_cmnd
> > it gets freed via blk_done_softirq().
> > In this specific case scsi_send_log() crashes while trying to dereference
> > the SLAB/SLUB poison value.
> 
> But you're crashing in the submit path, not the completion path.
> scsi_send_log() is called *before* we dispatch the command, so how on
> earth did the command get freed in the completion (softirq) path?

Well, I was hoping you would point to something obviously wrong ;)
So the only way I can think this could have happened then seems to
be something like this sequence:

User A -- allocs command
User A -- somehow frees command (but still has a reference)
User B -- submit path allocs command (gets freed slab again)
User A -- frees it again
User B -- submit path crashes

So looks like I need to add more trace and/or debug code. Sigh.
Dunno.. maybe zfcp returns the command twice or something like that; even
though there is code in place which should prevent exactly this scenario.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] 2.6.39.1 crash in scsi_dispatch_cmd()
  2011-06-16 18:40     ` Heiko Carstens
@ 2011-06-20 15:30       ` Heiko Carstens
  2011-07-01 18:07         ` Roland Dreier
  0 siblings, 1 reply; 42+ messages in thread
From: Heiko Carstens @ 2011-06-20 15:30 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jens Axboe, linux-scsi, Steffen Maier,
	Manvanthara B. Puttashankar, Tarak Reddy, Seshagiri N. Ippili,
	Christof Schmitt

On Thu, Jun 16, 2011 at 08:40:23PM +0200, Heiko Carstens wrote:
> On Thu, Jun 16, 2011 at 12:37:55PM -0400, James Bottomley wrote:
> > On Thu, 2011-06-16 at 18:01 +0200, Heiko Carstens wrote:
> > > cc'ing Jens (full quote of the original bug report at the end).
> > > 
> > > I patched the SLUB debugging code so that it includes full call traces
> > > instead of just the address of the caller of kfree/kmalloc. So here is
> > > yet another incarnation of what I think is the same use-after-free bug:
> > > 
> > > What we can see here: while scsi_dispatch_cmd() operates on a scsi_cmnd
> > > it gets freed via blk_done_softirq().
> > > In this specific case scsi_send_log() crashes while trying to dereference
> > > the SLAB/SLUB poison value.
> > 
> > But you're crashing in the submit path, not the completion path.
> > scsi_send_log() is called *before* we dispatch the command, so how on
> > earth did the command get freed in the completion (softirq) path?

Here is another one:

In this case it crashed in elv_drain_elevator() while trying to get the ops
pointer of the struct elevator_queue in this line:

while (q->elevator->ops->elevator_dispatch_fn(q, 1))

The queue_flags field of the corresponding request_queue has the
QUEUE_FLAG_DEAD bit set, if I'm not mistaken (see below). That looks like
blk_cleanup_queue() got called which in turn called elevator_exit() which
freed the struct. Does it even make sense to have QUEUE_FLAG_DEAD set with
having QUEUE_FLAG_STOPPED not set?

The refcount of the embedded kobj of the request_queue however is still 1.

In addition the piece of memory got used and freed by the zfcp driver, but
this doesn't look like the real cause of the bug.
It more looks like an unexpected call to blk_cleanup_queue(), no?

[ 3081.573315] Unable to handle kernel pointer dereference at virtual kernel address 6b6b6b6b6b6b6000
[ 3081.573320] Oops: 0038 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 3081.573324] Modules linked in: dm_round_robin sunrpc qeth_l3 binfmt_misc dm_multipath scsi_dh dm_mod ipv6 qeth ccwgroup [last unloaded: scsi_wait_scan]
[ 3081.573333] CPU: 0 Not tainted 2.6.39.1-50.0slubdbg.20110617-s390xdefault #1
[ 3081.573335] Process multipathd (pid: 633, task: 0000000074c90000, ksp: 0000000072467888)
[ 3081.573337] Krnl PSW : 0404200180000000 00000000003a6430 (elv_drain_elevator+0x34/0x94)
[ 3081.573346]            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:2 PM:0 EA:3
[ 3081.573348] Krnl GPRS: 0e40000000000000 6b6b6b6b6b6b6b6b 00000000602bca70 0000000000000001
[ 3081.573350]            0000000000000002 0000000074c90750 0000000061aa923f 000000001f89a1c0
[ 3081.573352]            000000006fc6e250 0000000073131ea8 0000000061aa8c68 000000006fc6e268
[ 3081.573354]            00000000602bca70 00000000602bca70 0000000072467b58 0000000072467b38
[ 3081.573364] Krnl Code: 00000000003a6422: b904002d            lgr     %r2,%r13
[ 3081.573368]            00000000003a6426: e31010000004        lg      %r1,0(%r1)
[ 3081.573370]            00000000003a642c: a7390001            lghi    %r3,1
[ 3081.573373]           >00000000003a6430: e31010280004        lg      %r1,40(%r1)
[ 3081.573375]            00000000003a6436: 0de1                basr    %r14,%r1
[ 3081.573377]            00000000003a6438: 1222                ltr     %r2,%r2
[ 3081.573379]            00000000003a643a: a774fff1            brc     7,3a641c
[ 3081.573382]            00000000003a643e: bf4fd6b0            icm     %r4,15,1712(%r13)
[ 3081.573384] Call Trace:
[ 3081.573385] ([<0000000061aa8c68>] 0x61aa8c68)
[ 3081.573387]  [<00000000003a6534>] __elv_add_request+0xa4/0x254
[ 3081.573390]  [<00000000003a789e>] blk_insert_cloned_request+0x72/0xcc
[ 3081.573392]  [<000003c00241f466>] dm_dispatch_request+0x5a/0x94 [dm_mod]
[ 3081.573400]  [<000003c002421b0c>] dm_request_fn+0x1e4/0x208 [dm_mod]
[ 3081.573405]  [<00000000003a72c0>] blk_start_queue+0x68/0x9c
[ 3081.573407]  [<000003c00241eafe>] start_queue+0x4e/0x68 [dm_mod]
[ 3081.573411]  [<000003c002421cc2>] dm_resume+0xd2/0xfc [dm_mod]
[ 3081.573415]  [<000003c00242769e>] dev_suspend+0x1ca/0x250 [dm_mod]
[ 3081.573420]  [<000003c00242857a>] ctl_ioctl+0x1e2/0x2f4 [dm_mod]
[ 3081.573425]  [<000003c0024286b6>] dm_ctl_ioctl+0x2a/0x38 [dm_mod]
[ 3081.573429]  [<0000000000252d28>] do_vfs_ioctl+0x94/0x588
[ 3081.573432]  [<00000000002532b0>] SyS_ioctl+0x94/0xac
[ 3081.573434]  [<00000000005a789a>] sysc_noemu+0x16/0x1c
[ 3081.573438]  [<000003fffd33c7ca>] 0x3fffd33c7ca
[ 3081.573440] INFO: lockdep is turned off.
[ 3081.573441] Last Breaking-Event-Address:
[ 3081.573442]  [<00000000003a652e>] __elv_add_request+0x9e/0x254

Parts of the corresponding struct request queue:

  queue_flags = 76320,  /* == 0x12a20 --> QUEUE_FLAG_DEAD set */
  ...
  kobj = {
    name = 0x6c94f740 "queue", 
    entry = {
      next = 0x602bd0a0, 
      prev = 0x602bd0a0
    }, 
    parent = 0x0, 
    kset = 0x0, 
    ktype = 0x93fca8, 
    sd = 0x0, 
    kref = {
      refcount = {
        counter = 1
      }
    }, 
    state_initialized = 1, 
    state_in_sysfs = 0, 
    state_add_uevent_sent = 0, 
    state_remove_uevent_sent = 0, 
    uevent_suppress = 0
  }, 


In the meanwile the piece of memory got allocated and freed by the
zfcp driver (extended slub caller tracing):

Allocated:

000000000011bd3e save_stack_trace
0000000000229094 process_slab
00000000002292c4 alloc_debug_processing
000000000022a554 new_slab
000000000022b72a __kmalloc
00000000001e0b68 mempool_alloc
00000000004c2f7a zfcp_fsf_req_create
00000000004c4a52 zfcp_fsf_fcp_cmnd
00000000004c7618 zfcp_scsi_queuecommand
000000000042525a scsi_dispatch_cmd
000000000042cf58 scsi_request_fn
00000000003a733e blk_run_queue
000000000042aea6 scsi_run_queue
000000000042d3ea scsi_next_command
000000000042d608 scsi_io_completion
00000000003b0218 blk_done_softirq
000000000014c0f0 __do_softirq
000000000010e2ae do_softirq
000000000014c550 irq_exit
0000000000468998 do_IRQ
00000000005a7f3c io_return

Freed:

000000000011bd3e save_stack_trace
0000000000229094 process_slab
00000000002294f4 free_debug_processing
0000000000229888 __slab_free
0000000000229a36 kfree
00000000004c5482 zfcp_fsf_reqid_check
00000000004c63f2 zfcp_qdio_int_resp
000000000048219c qdio_kick_handler
000000000048448e __tiqdio_inbound_processing
000000000014b550 tasklet_action
000000000014c0f0 __do_softirq
000000000010e2ae do_softirq
000000000014c550 irq_exit
0000000000468998 do_IRQ
00000000005a7f3c io_return
0000000000190b3a generic_exec_single
0000000000169b1e kthread
00000000005a74d2 kernel_thread_starter

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] 2.6.39.1 crash in scsi_dispatch_cmd()
  2011-06-20 15:30       ` Heiko Carstens
@ 2011-07-01 18:07         ` Roland Dreier
  2011-07-01 19:04           ` James Bottomley
  0 siblings, 1 reply; 42+ messages in thread
From: Roland Dreier @ 2011-07-01 18:07 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: James Bottomley, Jens Axboe, linux-scsi, Steffen Maier,
	Manvanthara B. Puttashankar, Tarak Reddy, Seshagiri N. Ippili,
	Christof Schmitt

We seem to be hitting something similar, running 2.6.39.2.
Did anyone make any progress on this?  (I'm happy to try and
gather more info but I probably won't be able to seriously
debug until next week)

Anyway, we have a system with two mpt2sas adapters that
have 4 paths to a JBOD, and we're using dm-multipath to the
drives in the JBOD.  Killing one of the drives (ie yanking the
drive, so all 4 paths go down at once) leads nearly instantly to:

[  768.999560] device-mapper: multipath: Failing path 8:48.
[  769.005151] device-mapper: multipath: Failing path 8:48.
[  769.010919] device-mapper: table: 252:4: multipath: error getting device
[  769.017708] device-mapper: ioctl: error adding target to table
[  769.023696] device-mapper: multipath: Failing path 8:48.
[  769.030979] BUG: unable to handle kernel paging request at 0000000200000000
[  769.038119] IP: [<0000000200000000>] 0x1ffffffff
[  769.042859] PGD 0
[  769.045264] Oops: 0010 [#1] SMP
[  769.048671] last sysfs file:
/sys/devices/pci0000:00/0000:00:07.0/0000:09:00.0/host11/port-11:0/expander-11:0/port-11:0:1/end_device-11:0:1/target11:0:1/11:0:1:0/block/sdd/uevent
[  769.064690] CPU 6
[  769.066835] Modules linked in: target_core_pscsi target_core_file
target_core_iblock tcm_loop target_core_mod configfs ps_bdrv
ipmi_devintf ipmi_si ipmi_msghandler serio_raw ioatdma i7core_edac dca
edac_core ses enclosure usb_storage mpt2sas qla2xxx usbhid
scsi_transport_sas ahci uas scsi_transport_fc libahci e1000e hid
mlx4_core scsi_tgt raid_class [last unloaded: evbug]
[  769.102323]
[  769.104141] Pid: 30, comm: kworker/6:0 Not tainted 2.6.39.2+ #1
Xyratex Storage Server        /HS-1235T-ATX
[  769.116224] RIP: 0010:[<0000000200000000>]  [<0000000200000000>] 0x1ffffffff
[  769.123679] RSP: 0018:ffff880c1b1a1cc8  EFLAGS: 00010082
[  769.129317] RAX: ffff8806165cc780 RBX: ffff880614cd06c0 RCX: 0000000000000000
[  769.136808] RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff880614cd06c0
[  769.144298] RBP: ffff880c1b1a1ce0 R08: 0000000000000000 R09: 0000000000000001
[  769.151789] R10: ffff880c0ecd2af0 R11: ffff880619b40400 R12: ffff880614cd06c0
[  769.159279] R13: 0000000000000002 R14: 0000000000000002 R15: ffff880614cd4010
[  769.166770] FS:  0000000000000000(0000) GS:ffff880c3fc00000(0000)
knlGS:0000000000000000
[  769.175210] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  769.181284] CR2: 0000000200000000 CR3: 0000000001a03000 CR4: 00000000000006e0
[  769.188772] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  769.196254] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  769.203746] Process kworker/6:0 (pid: 30, threadinfo
ffff880c1b1a0000, task ffff880c1b1996b0)
[  769.212617] Stack:
[  769.214694]  ffffffff8123290a ffff880c19d54338 ffff880c19d54338
ffff880c1b1a1d10
[  769.222669]  ffffffff81232a63 ffff880c19d54338 ffff880614cd06c0
0000000000000002
[  769.230636]  ffffc900189b7040 ffff880c1b1a1d40 ffffffff812357ad
ffff880c1b1a1d60
[  769.238600] Call Trace:
[  769.241118]  [<ffffffff8123290a>] ? elv_drain_elevator+0x2a/0x80
[  769.247452]  [<ffffffff81232a63>] __elv_add_request+0x103/0x280
[  769.253697]  [<ffffffff812357ad>] add_acct_request+0x3d/0x50
[  769.259680]  [<ffffffff81235825>] blk_insert_cloned_request+0x65/0x90
[  769.266449]  [<ffffffff813b332e>] dm_dispatch_request+0x3e/0x70
[  769.272692]  [<ffffffff813b4e3b>] dm_request_fn+0x16b/0x240
[  769.278588]  [<ffffffff81237300>] ? perf_trace_block_unplug+0xe0/0xe0
[  769.285353]  [<ffffffff81237340>] blk_delay_work+0x40/0x60
[  769.291218]  [<ffffffff8106954a>] process_one_work+0x11a/0x420
[  769.297375]  [<ffffffff8106a5a3>] worker_thread+0x163/0x340
[  769.303274]  [<ffffffff8106a440>] ? manage_workers.clone.21+0x240/0x240
[  769.310212]  [<ffffffff8106f2d6>] kthread+0x96/0xa0
[  769.315421]  [<ffffffff814de1a4>] kernel_thread_helper+0x4/0x10
[  769.321662]  [<ffffffff8106f240>] ? kthread_worker_fn+0x190/0x190
[  769.328079]  [<ffffffff814de1a0>] ? gs_change+0x13/0x13
[  769.333632] Code:  Bad RIP value.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] 2.6.39.1 crash in scsi_dispatch_cmd()
  2011-07-01 18:07         ` Roland Dreier
@ 2011-07-01 19:04           ` James Bottomley
  2011-07-06  0:34             ` Roland Dreier
  0 siblings, 1 reply; 42+ messages in thread
From: James Bottomley @ 2011-07-01 19:04 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Heiko Carstens, Jens Axboe, linux-scsi, Steffen Maier,
	Manvanthara B. Puttashankar, Tarak Reddy, Seshagiri N. Ippili,
	Christof Schmitt

On Fri, 2011-07-01 at 11:07 -0700, Roland Dreier wrote:
> We seem to be hitting something similar, running 2.6.39.2.
> Did anyone make any progress on this?  (I'm happy to try and
> gather more info but I probably won't be able to seriously
> debug until next week)

The first obvious question is does it reproduce with git HEAD?  If yes,
then it's an unfixed reference bug, if no, we forgot to backport the fix
(we should be able to find it easily enough).

James



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] 2.6.39.1 crash in scsi_dispatch_cmd()
  2011-07-01 19:04           ` James Bottomley
@ 2011-07-06  0:34             ` Roland Dreier
  2011-07-06  6:47               ` Heiko Carstens
  0 siblings, 1 reply; 42+ messages in thread
From: Roland Dreier @ 2011-07-06  0:34 UTC (permalink / raw)
  To: James Bottomley
  Cc: Heiko Carstens, Jens Axboe, linux-scsi, Steffen Maier,
	Manvanthara B. Puttashankar, Tarak Reddy, Seshagiri N. Ippili,
	Christof Schmitt

On Fri, Jul 1, 2011 at 12:04 PM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> The first obvious question is does it reproduce with git HEAD?  If yes,
> then it's an unfixed reference bug, if no, we forgot to backport the fix
> (we should be able to find it easily enough).

Hi James,

Yes, I'm able to reproduce with v3.0-rc5-170-gba466c7.

I booted with slub_debug=FZUP and the 6b pattern in RAX pretty much
does prove that this is a use-after-free issue.  Any thoughts about
how to pin this down before I muddle on in my lowbrow way?

    general protection fault: 0000 [#1] SMP
    CPU 7
    Modules linked in: kvm_intel kvm serio_raw i7core_edac edac_core
ioatdma dca pci_stub ses
   id usb_storage qla2xxx mpt2sas ahci hid uas libahci e1000e
scsi_transport_fc scsi_transport
    raid_class scsi_tgt

    Pid: 12458, comm: blkid Not tainted 3.0.0-rc5+ #1

    RIP: 0010:[<ffffffff81233867>]  [<ffffffff81233867>]
elv_drain_elevator+0x27/0x80
    RSP: 0018:ffff880614e9fa48  EFLAGS: 00010096
    RAX: 6b6b6b6b6b6b6b6b RBX: ffff880610bb0000 RCX: 0000000000000000
    RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff880610bb0000
    RBP: ffff880614e9fa58 R08: 0000000000000000 R09: 0000000000000001
    R10: ffff880c0a7dca70 R11: ffff880615622440 R12: ffff880610bb0000
    R13: 0000000000000002 R14: 0000000000000002 R15: ffff880c0db01160
    FS:  00007fe46457a760(0000) GS:ffff880c3fc20000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fe463c86330 CR3: 0000000c0cc0c000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process blkid (pid: 12458, threadinfo ffff880614e9e000, task
ffff8806190eae20)
    Stack:
     ffff880c0d9f03c8 ffff880c0d9f03c8 ffff880614e9fa88 ffffffff812339a8
     ffff880c0d9f03c8 ffff880610bb0000 0000000000000002 ffffc9001bd5e040
     ffff880614e9fab8 ffffffff812366fd ffff880614e9fad8 ffff880610bb0000
    Call Trace:
     [<ffffffff812339a8>] __elv_add_request+0xe8/0x280
     [<ffffffff812366fd>] add_acct_request+0x3d/0x50
     [<ffffffff81236775>] blk_insert_cloned_request+0x65/0x90
     [<ffffffff813b8d4e>] dm_dispatch_request+0x3e/0x70
     [<ffffffff813ba850>] dm_request_fn+0x160/0x250
     [<ffffffff81236e88>] queue_unplugged+0x48/0xd0
     [<ffffffff8123b03d>] blk_flush_plug_list+0x1ed/0x250
     [<ffffffff810e98f0>] ? sleep_on_page+0x20/0x20
     [<ffffffff814db895>] io_schedule+0x75/0xd0
     [<ffffffff810e98fe>] sleep_on_page_killable+0xe/0x40
     [<ffffffff814dbf9a>] __wait_on_bit_lock+0x5a/0xc0
     [<ffffffff810e9857>] __lock_page_killable+0x67/0x70
     [<ffffffff8106d650>] ? autoremove_wake_function+0x40/0x40
     [<ffffffff810eb5c5>] generic_file_aio_read+0x405/0x720
     [<ffffffff81146962>] do_sync_read+0xd2/0x110
     [<ffffffff8121e393>] ? security_file_permission+0x93/0xb0
     [<ffffffff81146c81>] ? rw_verify_area+0x61/0xf0
     [<ffffffff81147143>] vfs_read+0xc3/0x180
     [<ffffffff81147251>] sys_read+0x51/0x90
     [<ffffffff814e5942>] system_call_fastpath+0x16/0x1b
    Code: c3 0f 1f 00 55 48 89 e5 53 48 83 ec 08 66 66 66 66 90 48 89
fb 0f 1f 80 00 00 00 00
    01 00 00 00 48 89 df 48 8b 00 <ff> 50 28 85 c0 75 ea 8b 93 f0 03
00 00 85 d2 74 14 8b 05 f

    RIP  [<ffffffff81233867>] elv_drain_elevator+0x27/0x80
     RSP <ffff880614e9fa48>
    ---[ end trace 5dbd03e7295023d7 ]---
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] 2.6.39.1 crash in scsi_dispatch_cmd()
  2011-07-06  0:34             ` Roland Dreier
@ 2011-07-06  6:47               ` Heiko Carstens
  2011-07-06  8:06                 ` Roland Dreier
  0 siblings, 1 reply; 42+ messages in thread
From: Heiko Carstens @ 2011-07-06  6:47 UTC (permalink / raw)
  To: Roland Dreier
  Cc: James Bottomley, Jens Axboe, linux-scsi, Steffen Maier,
	Manvanthara B. Puttashankar, Tarak Reddy, Seshagiri N. Ippili,
	Alan Stern

On Tue, Jul 05, 2011 at 05:34:16PM -0700, Roland Dreier wrote:
> On Fri, Jul 1, 2011 at 12:04 PM, James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
> > The first obvious question is does it reproduce with git HEAD?  If yes,
> > then it's an unfixed reference bug, if no, we forgot to backport the fix
> > (we should be able to find it easily enough).
> 
> Hi James,
> 
> Yes, I'm able to reproduce with v3.0-rc5-170-gba466c7.
> 
> I booted with slub_debug=FZUP and the 6b pattern in RAX pretty much
> does prove that this is a use-after-free issue.  Any thoughts about
> how to pin this down before I muddle on in my lowbrow way?

Alan Stern came up with a patch that could fix this:

http://marc.info/?l=linux-kernel&m=130963676907731&w=2

>     general protection fault: 0000 [#1] SMP
>     CPU 7
>     Modules linked in: kvm_intel kvm serio_raw i7core_edac edac_core
> ioatdma dca pci_stub ses
>    id usb_storage qla2xxx mpt2sas ahci hid uas libahci e1000e
> scsi_transport_fc scsi_transport
>     raid_class scsi_tgt
> 
>     Pid: 12458, comm: blkid Not tainted 3.0.0-rc5+ #1
> 
>     RIP: 0010:[<ffffffff81233867>]  [<ffffffff81233867>]
> elv_drain_elevator+0x27/0x80
>     RSP: 0018:ffff880614e9fa48  EFLAGS: 00010096
>     RAX: 6b6b6b6b6b6b6b6b RBX: ffff880610bb0000 RCX: 0000000000000000
>     RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff880610bb0000
>     RBP: ffff880614e9fa58 R08: 0000000000000000 R09: 0000000000000001
>     R10: ffff880c0a7dca70 R11: ffff880615622440 R12: ffff880610bb0000
>     R13: 0000000000000002 R14: 0000000000000002 R15: ffff880c0db01160
>     FS:  00007fe46457a760(0000) GS:ffff880c3fc20000(0000) knlGS:0000000000000000
>     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>     CR2: 00007fe463c86330 CR3: 0000000c0cc0c000 CR4: 00000000000006e0
>     DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>     DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>     Process blkid (pid: 12458, threadinfo ffff880614e9e000, task
> ffff8806190eae20)
>     Stack:
>      ffff880c0d9f03c8 ffff880c0d9f03c8 ffff880614e9fa88 ffffffff812339a8
>      ffff880c0d9f03c8 ffff880610bb0000 0000000000000002 ffffc9001bd5e040
>      ffff880614e9fab8 ffffffff812366fd ffff880614e9fad8 ffff880610bb0000
>     Call Trace:
>      [<ffffffff812339a8>] __elv_add_request+0xe8/0x280
>      [<ffffffff812366fd>] add_acct_request+0x3d/0x50
>      [<ffffffff81236775>] blk_insert_cloned_request+0x65/0x90
>      [<ffffffff813b8d4e>] dm_dispatch_request+0x3e/0x70
>      [<ffffffff813ba850>] dm_request_fn+0x160/0x250
>      [<ffffffff81236e88>] queue_unplugged+0x48/0xd0
>      [<ffffffff8123b03d>] blk_flush_plug_list+0x1ed/0x250
>      [<ffffffff810e98f0>] ? sleep_on_page+0x20/0x20
>      [<ffffffff814db895>] io_schedule+0x75/0xd0
>      [<ffffffff810e98fe>] sleep_on_page_killable+0xe/0x40
>      [<ffffffff814dbf9a>] __wait_on_bit_lock+0x5a/0xc0
>      [<ffffffff810e9857>] __lock_page_killable+0x67/0x70
>      [<ffffffff8106d650>] ? autoremove_wake_function+0x40/0x40
>      [<ffffffff810eb5c5>] generic_file_aio_read+0x405/0x720
>      [<ffffffff81146962>] do_sync_read+0xd2/0x110
>      [<ffffffff8121e393>] ? security_file_permission+0x93/0xb0
>      [<ffffffff81146c81>] ? rw_verify_area+0x61/0xf0
>      [<ffffffff81147143>] vfs_read+0xc3/0x180
>      [<ffffffff81147251>] sys_read+0x51/0x90
>      [<ffffffff814e5942>] system_call_fastpath+0x16/0x1b
>     Code: c3 0f 1f 00 55 48 89 e5 53 48 83 ec 08 66 66 66 66 90 48 89
> fb 0f 1f 80 00 00 00 00
>     01 00 00 00 48 89 df 48 8b 00 <ff> 50 28 85 c0 75 ea 8b 93 f0 03
> 00 00 85 d2 74 14 8b 05 f
> 
>     RIP  [<ffffffff81233867>] elv_drain_elevator+0x27/0x80
>      RSP <ffff880614e9fa48>
>     ---[ end trace 5dbd03e7295023d7 ]---
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] 2.6.39.1 crash in scsi_dispatch_cmd()
  2011-07-06  6:47               ` Heiko Carstens
@ 2011-07-06  8:06                 ` Roland Dreier
  2011-07-06  9:25                   ` Heiko Carstens
  2011-07-06 14:20                   ` Alan Stern
  0 siblings, 2 replies; 42+ messages in thread
From: Roland Dreier @ 2011-07-06  8:06 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: James Bottomley, Jens Axboe, linux-scsi, Steffen Maier,
	Manvanthara B. Puttashankar, Tarak Reddy, Seshagiri N. Ippili,
	Alan Stern

 > > I booted with slub_debug=FZUP and the 6b pattern in RAX pretty much
 > > does prove that this is a use-after-free issue.  Any thoughts about
 > > how to pin this down before I muddle on in my lowbrow way?
 >
 > Alan Stern came up with a patch that could fix this:
 >
 > http://marc.info/?l=linux-kernel&m=130963676907731&w=2

Thanks!  It seems my crash is actually a different problem (or perhaps
the same problem in different code), since I'm not going through SCSI
directly but rather a dm-multipath device on top of SCSI disks.  But
it does seem that it is another case of the block queue elevator
getting freed while requests can still be submitted.

In my case, to reproduce this I have to hold the multipath device file
open with something like "cat > /dev/dm-X" and then kill the
underlying drive.  What I think is happening (although I haven't
traced all the layers to be sure) is that then the multipath daemon
notices that all the paths to the disk are lost and tries to kill the
multipath device, which ends up in dm.c:dm_destroy().

This ends up in blk_cleanup_queue() which frees q->elevator, which of
course leads to the crash.

What's not clear to me is how things are supposed to work.  It seems
that the dm stuff at least is missing a lot of required reference
counting, to make sure that some structure sticks around to reject IOs
to the device after it is destroy but while it is still open.  But I
don't understand why people don't hit this more since it is completely
reproducible for me with a fairly normal setup (hot-remove a multipath
device that some process has open).

Alan Stern's patch looks a bit fishy -- the scsi_free_queue() is moved
earlier than the

	/* cause the request function to reject all I/O requests */
	sdev->request_queue->queuedata = NULL;

which seems to leave a small window where the use-after-free can
happen, and it's not clear to me why the scsi_free_queue() has to move
at all.

Thanks,
 - R.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] 2.6.39.1 crash in scsi_dispatch_cmd()
  2011-07-06  8:06                 ` Roland Dreier
@ 2011-07-06  9:25                   ` Heiko Carstens
  2011-07-06 14:20                   ` Alan Stern
  1 sibling, 0 replies; 42+ messages in thread
From: Heiko Carstens @ 2011-07-06  9:25 UTC (permalink / raw)
  To: Roland Dreier
  Cc: James Bottomley, Jens Axboe, linux-scsi, Steffen Maier,
	Manvanthara B. Puttashankar, Tarak Reddy, Seshagiri N. Ippili,
	Alan Stern, dm-devel, Alasdair G Kergon

On Wed, Jul 06, 2011 at 01:06:44AM -0700, Roland Dreier wrote:
>  > > I booted with slub_debug=FZUP and the 6b pattern in RAX pretty much
>  > > does prove that this is a use-after-free issue.  Any thoughts about
>  > > how to pin this down before I muddle on in my lowbrow way?
>  >
>  > Alan Stern came up with a patch that could fix this:
>  >
>  > http://marc.info/?l=linux-kernel&m=130963676907731&w=2
> 
> Thanks!  It seems my crash is actually a different problem (or perhaps
> the same problem in different code), since I'm not going through SCSI
> directly but rather a dm-multipath device on top of SCSI disks.  But
> it does seem that it is another case of the block queue elevator
> getting freed while requests can still be submitted.
> 
> In my case, to reproduce this I have to hold the multipath device file
> open with something like "cat > /dev/dm-X" and then kill the
> underlying drive.  What I think is happening (although I haven't
> traced all the layers to be sure) is that then the multipath daemon
> notices that all the paths to the disk are lost and tries to kill the
> multipath device, which ends up in dm.c:dm_destroy().
> 
> This ends up in blk_cleanup_queue() which frees q->elevator, which of
> course leads to the crash.

So it's the identical crash that I reported also (well, at least one of
the crashes I reported).

> What's not clear to me is how things are supposed to work.  It seems
> that the dm stuff at least is missing a lot of required reference
> counting, to make sure that some structure sticks around to reject IOs
> to the device after it is destroy but while it is still open.  But I
> don't understand why people don't hit this more since it is completely
> reproducible for me with a fairly normal setup (hot-remove a multipath
> device that some process has open).

cc'ing Alasdair as well, maybe he knows...

> Alan Stern's patch looks a bit fishy -- the scsi_free_queue() is moved
> earlier than the
> 
> 	/* cause the request function to reject all I/O requests */
> 	sdev->request_queue->queuedata = NULL;
> 
> which seems to leave a small window where the use-after-free can
> happen, and it's not clear to me why the scsi_free_queue() has to move
> at all.
> 
> Thanks,
>  - R.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] 2.6.39.1 crash in scsi_dispatch_cmd()
  2011-07-06  8:06                 ` Roland Dreier
  2011-07-06  9:25                   ` Heiko Carstens
@ 2011-07-06 14:20                   ` Alan Stern
  2011-07-06 14:24                     ` James Bottomley
  2011-07-06 16:24                     ` [BUG] 2.6.39.1 crash in scsi_dispatch_cmd() Roland Dreier
  1 sibling, 2 replies; 42+ messages in thread
From: Alan Stern @ 2011-07-06 14:20 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Heiko Carstens, James Bottomley, Jens Axboe, linux-scsi,
	Steffen Maier, Manvanthara B. Puttashankar, Tarak Reddy,
	Seshagiri N. Ippili

On Wed, 6 Jul 2011, Roland Dreier wrote:

> Alan Stern's patch looks a bit fishy -- the scsi_free_queue() is moved
> earlier than the
> 
> 	/* cause the request function to reject all I/O requests */
> 	sdev->request_queue->queuedata = NULL;
> 
> which seems to leave a small window where the use-after-free can
> happen, and it's not clear to me why the scsi_free_queue() has to move
> at all.

Looks can be deceiving.  Although the scsi_free_queue() is higher up in
the source file, it actually runs later than this code.  That's because
__scsi_remove_device() -- this code -- gets called when the device is
unregistered from the driver core, whereas
scsi_device_dev_release_usercontext() -- where the scsi_free_queue() is
moved to -- gets called when the last reference to the device is
dropped.

Now, one of the things I'm not sure about (it would nice if James would
pick up this thread again and comment) is whether queuedata should be
set to NULL at unregistration time or later on, when the device and the
queue are about to be freed.

Alan Stern


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] 2.6.39.1 crash in scsi_dispatch_cmd()
  2011-07-06 14:20                   ` Alan Stern
@ 2011-07-06 14:24                     ` James Bottomley
  2011-07-06 16:30                       ` Roland Dreier
  2011-07-06 16:24                     ` [BUG] 2.6.39.1 crash in scsi_dispatch_cmd() Roland Dreier
  1 sibling, 1 reply; 42+ messages in thread
From: James Bottomley @ 2011-07-06 14:24 UTC (permalink / raw)
  To: Alan Stern
  Cc: Roland Dreier, Heiko Carstens, Jens Axboe, linux-scsi,
	Steffen Maier, Manvanthara B. Puttashankar, Tarak Reddy,
	Seshagiri N. Ippili

On Wed, 2011-07-06 at 10:20 -0400, Alan Stern wrote:
> On Wed, 6 Jul 2011, Roland Dreier wrote:
> 
> > Alan Stern's patch looks a bit fishy -- the scsi_free_queue() is moved
> > earlier than the
> > 
> > 	/* cause the request function to reject all I/O requests */
> > 	sdev->request_queue->queuedata = NULL;
> > 
> > which seems to leave a small window where the use-after-free can
> > happen, and it's not clear to me why the scsi_free_queue() has to move
> > at all.
> 
> Looks can be deceiving.  Although the scsi_free_queue() is higher up in
> the source file, it actually runs later than this code.  That's because
> __scsi_remove_device() -- this code -- gets called when the device is
> unregistered from the driver core, whereas
> scsi_device_dev_release_usercontext() -- where the scsi_free_queue() is
> moved to -- gets called when the last reference to the device is
> dropped.
> 
> Now, one of the things I'm not sure about (it would nice if James would
> pick up this thread again and comment) is whether queuedata should be
> set to NULL at unregistration time or later on, when the device and the
> queue are about to be freed.

Sorry, higher priority problems at the moment.  Sorry about the
->queuedata cockup, was thinking of sdev->request_queue. Moving the
queue free is wrong ... it recently moved to fix another oops.  Problem
most likely missing block guards on blk_execute_req() ... no check for
QUEUE_DEAD.

Will be back on Thursday.

James



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] 2.6.39.1 crash in scsi_dispatch_cmd()
  2011-07-06 14:20                   ` Alan Stern
  2011-07-06 14:24                     ` James Bottomley
@ 2011-07-06 16:24                     ` Roland Dreier
  1 sibling, 0 replies; 42+ messages in thread
From: Roland Dreier @ 2011-07-06 16:24 UTC (permalink / raw)
  To: Alan Stern
  Cc: Heiko Carstens, James Bottomley, Jens Axboe, linux-scsi,
	Steffen Maier, Manvanthara B. Puttashankar, Tarak Reddy,
	Seshagiri N. Ippili

On Wed, Jul 6, 2011 at 7:20 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> Looks can be deceiving.  Although the scsi_free_queue() is higher up in
> the source file, it actually runs later than this code.  That's because
> __scsi_remove_device() -- this code -- gets called when the device is
> unregistered from the driver core, whereas
> scsi_device_dev_release_usercontext() -- where the scsi_free_queue() is
> moved to -- gets called when the last reference to the device is
> dropped.

OK, I see.  Sorry for the confusion, I just looked at the patch quickly and
assumed that the line moved up in the same function.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] 2.6.39.1 crash in scsi_dispatch_cmd()
  2011-07-06 14:24                     ` James Bottomley
@ 2011-07-06 16:30                       ` Roland Dreier
  2011-07-06 16:53                         ` Alan Stern
  0 siblings, 1 reply; 42+ messages in thread
From: Roland Dreier @ 2011-07-06 16:30 UTC (permalink / raw)
  To: James Bottomley
  Cc: Alan Stern, Heiko Carstens, Jens Axboe, linux-scsi, Steffen Maier,
	Manvanthara B. Puttashankar, Tarak Reddy, Seshagiri N. Ippili

On Wed, Jul 6, 2011 at 7:24 AM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> Problem
> most likely missing block guards on blk_execute_req() ... no check for
> QUEUE_DEAD.

blk_execute_req() doesn't exist.  Do you mean __blk_run_queue()?

It does seem bailing out before calling q->request_fn() would fix the
immediate problem, but I'm not sure if that's the right fix.

Jens, have you been following this?  Any opinion?

Thanks,
  Roland

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] 2.6.39.1 crash in scsi_dispatch_cmd()
  2011-07-06 16:30                       ` Roland Dreier
@ 2011-07-06 16:53                         ` Alan Stern
  2011-07-06 18:07                           ` Roland Dreier
  0 siblings, 1 reply; 42+ messages in thread
From: Alan Stern @ 2011-07-06 16:53 UTC (permalink / raw)
  To: Roland Dreier
  Cc: James Bottomley, Heiko Carstens, Jens Axboe, linux-scsi,
	Steffen Maier, Manvanthara B. Puttashankar, Tarak Reddy,
	Seshagiri N. Ippili

On Wed, 6 Jul 2011, Roland Dreier wrote:

> On Wed, Jul 6, 2011 at 7:24 AM, James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
> > Problem
> > most likely missing block guards on blk_execute_req() ... no check for
> > QUEUE_DEAD.
> 
> blk_execute_req() doesn't exist.  Do you mean __blk_run_queue()?

He probably meant blk_execute_rq_nowait().  The test has to be done 
before the elevator is accessed.

Alan Stern


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] 2.6.39.1 crash in scsi_dispatch_cmd()
  2011-07-06 16:53                         ` Alan Stern
@ 2011-07-06 18:07                           ` Roland Dreier
  2011-07-06 18:49                             ` Alan Stern
  0 siblings, 1 reply; 42+ messages in thread
From: Roland Dreier @ 2011-07-06 18:07 UTC (permalink / raw)
  To: Alan Stern
  Cc: James Bottomley, Heiko Carstens, Jens Axboe, linux-scsi,
	Steffen Maier, Manvanthara B. Puttashankar, Tarak Reddy,
	Seshagiri N. Ippili

On Wed, Jul 6, 2011 at 9:53 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> He probably meant blk_execute_rq_nowait().  The test has to be done
> before the elevator is accessed.

Hmm, seems we would need the test in multiple places, since my second call
trace is io_schedule -> blk_flush_plug_list -> queue_unplugged ->
__blk_run_queue

So I don't think I hit blk_execute_rq_nowait in my crash.

But maybe the problem is that dm-multipath is trying to requeue the IO to an
underlying sdX device that is already dead?

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] 2.6.39.1 crash in scsi_dispatch_cmd()
  2011-07-06 18:07                           ` Roland Dreier
@ 2011-07-06 18:49                             ` Alan Stern
  2011-07-07 20:45                               ` James Bottomley
  0 siblings, 1 reply; 42+ messages in thread
From: Alan Stern @ 2011-07-06 18:49 UTC (permalink / raw)
  To: Roland Dreier
  Cc: James Bottomley, Heiko Carstens, Jens Axboe, linux-scsi,
	Steffen Maier, Manvanthara B. Puttashankar, Tarak Reddy,
	Seshagiri N. Ippili

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: TEXT/PLAIN; charset=UTF-8, Size: 1719 bytes --]

On Wed, 6 Jul 2011, Roland Dreier wrote:

> On Wed, Jul 6, 2011 at 9:53 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> > He probably meant blk_execute_rq_nowait().  The test has to be done
> > before the elevator is accessed.
> 
> Hmm, seems we would need the test in multiple places, since my second call
> trace is io_schedule -> blk_flush_plug_list -> queue_unplugged ->
> __blk_run_queue
> 
> So I don't think I hit blk_execute_rq_nowait in my crash.
> 
> But maybe the problem is that dm-multipath is trying to requeue the IO to an
> underlying sdX device that is already dead?

I'm not at all familiar with the block layer.  It seems that the check
for a dead queue would have to be made on every path that ends up
calling the elevator, which would be a difficult sort of thing to
enforce.

I'm not too sure about James's comment:

> Moving the
> queue free is wrong ... it recently moved to fix another oops.

Apparently this refers to commit
e73e079bf128d68284efedeba1fbbc18d78610f9 ([SCSI] Fix oops caused by
queue refcounting failure).  In fact that commit does _not_ move the
call to scsi_free_queue().  Instead it merely takes another reference
to the queue, so that scsi_free_queue() doesn't actually deallocate the
queue.  But it does still deallocate the elevator.

Perhaps this means the elevator shouldn't be freed until the queue is.  
I just don't know.  Jens and James are the experts, but Jens hasn't
said anything and James is currently busy.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] 2.6.39.1 crash in scsi_dispatch_cmd()
  2011-07-06 18:49                             ` Alan Stern
@ 2011-07-07 20:45                               ` James Bottomley
  2011-07-07 21:07                                 ` Alan Stern
  0 siblings, 1 reply; 42+ messages in thread
From: James Bottomley @ 2011-07-07 20:45 UTC (permalink / raw)
  To: Alan Stern
  Cc: Roland Dreier, Heiko Carstens, Jens Axboe, linux-scsi,
	Steffen Maier, Manvanthara B. Puttashankar, Tarak Reddy,
	Seshagiri N. Ippili

On Wed, 2011-07-06 at 14:49 -0400, Alan Stern wrote:
> On Wed, 6 Jul 2011, Roland Dreier wrote:
> 
> > On Wed, Jul 6, 2011 at 9:53 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> > > He probably meant blk_execute_rq_nowait(). The test has to be done
> > > before the elevator is accessed.
> > 
> > Hmm, seems we would need the test in multiple places, since my second call
> > trace is io_schedule -> blk_flush_plug_list -> queue_unplugged ->
> > __blk_run_queue
> > 
> > So I don't think I hit blk_execute_rq_nowait in my crash.
> > 
> > But maybe the problem is that dm-multipath is trying to requeue the IO to an
> > underlying sdX device that is already dead?
> 
> I'm not at all familiar with the block layer.  It seems that the check
> for a dead queue would have to be made on every path that ends up
> calling the elevator, which would be a difficult sort of thing to
> enforce.
> 
> I'm not too sure about James's comment:
> 
> > Moving the
> > queue free is wrong ... it recently moved to fix another oops.
> 
> Apparently this refers to commit
> e73e079bf128d68284efedeba1fbbc18d78610f9 ([SCSI] Fix oops caused by
> queue refcounting failure).  In fact that commit does _not_ move the
> call to scsi_free_queue().  Instead it merely takes another reference
> to the queue, so that scsi_free_queue() doesn't actually deallocate the
> queue.  But it does still deallocate the elevator.

Not that one, 86cbfb5607d4b81b1a993ff689bbd2addd5d3a9b

> Perhaps this means the elevator shouldn't be freed until the queue is.  
> I just don't know.  Jens and James are the experts, but Jens hasn't
> said anything and James is currently busy.

OK, back from being busy now (for a short while).

This is what I think should fix it all (at least it fixes it for me now
I've finally figured out how to reproduce it).

The nasty about this is that blk_get_request() has to return NULL even
if __GFP_WAIT is specified, if the queue is already dead.  Lots of block
users don't check for NULL if they specify __GFP_WAIT.

James

---

diff --git a/block/blk-core.c b/block/blk-core.c
index d2f8f40..1d49e1c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -839,6 +839,9 @@ struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
 {
 	struct request *rq;
 
+	if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags)))
+		return NULL;
+
 	BUG_ON(rw != READ && rw != WRITE);
 
 	spin_lock_irq(q->queue_lock);
diff --git a/block/blk-exec.c b/block/blk-exec.c
index 8a0e7ec..a1ebceb 100644
--- a/block/blk-exec.c
+++ b/block/blk-exec.c
@@ -50,6 +50,13 @@ void blk_execute_rq_nowait(struct request_queue *q, struct gendisk *bd_disk,
 {
 	int where = at_head ? ELEVATOR_INSERT_FRONT : ELEVATOR_INSERT_BACK;
 
+	if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags))) {
+		rq->errors = -ENXIO;
+		if (rq->end_io)
+			rq->end_io(rq, rq->errors);
+		return;
+	}
+
 	rq->rq_disk = bd_disk;
 	rq->end_io = done;
 	WARN_ON(irqs_disabled());
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index ec1803a..28d9c9d 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -213,6 +213,8 @@ int scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
 	int ret = DRIVER_ERROR << 24;
 
 	req = blk_get_request(sdev->request_queue, write, __GFP_WAIT);
+	if (!req)
+		return ret;
 
 	if (bufflen &&	blk_rq_map_kern(sdev->request_queue, req,
 					buffer, bufflen, __GFP_WAIT))



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [BUG] 2.6.39.1 crash in scsi_dispatch_cmd()
  2011-07-07 20:45                               ` James Bottomley
@ 2011-07-07 21:07                                 ` Alan Stern
  2011-07-08 17:04                                   ` James Bottomley
  0 siblings, 1 reply; 42+ messages in thread
From: Alan Stern @ 2011-07-07 21:07 UTC (permalink / raw)
  To: James Bottomley
  Cc: Roland Dreier, Heiko Carstens, Jens Axboe, linux-scsi,
	Steffen Maier, Manvanthara B. Puttashankar, Tarak Reddy,
	Seshagiri N. Ippili

On Thu, 7 Jul 2011, James Bottomley wrote:

> > Perhaps this means the elevator shouldn't be freed until the queue is.  
> > I just don't know.  Jens and James are the experts, but Jens hasn't
> > said anything and James is currently busy.
> 
> OK, back from being busy now (for a short while).
> 
> This is what I think should fix it all (at least it fixes it for me now
> I've finally figured out how to reproduce it).
> 
> The nasty about this is that blk_get_request() has to return NULL even
> if __GFP_WAIT is specified, if the queue is already dead.  Lots of block
> users don't check for NULL if they specify __GFP_WAIT.

There are a few things here you might be able to explain.

First, after blk_cleanup_queue() is called, whose responsibility is it
to make sure that no more requests can be added to the queue: the block
layer or the client (in this case the SCSI layer)?  Your patch implies
that the block layer is responsible; is this documented anywhere?  Or
if it isn't, does Jens agree?

Second, there isn't any synchronization between __scsi_remove_device()
and scsi_prep_fn().  Isn't it still possible that a request will get
added to the queue and processed after queuedata has been set to NULL,
leading to another invalid memory access?  Isn't a NULL check still 
needed in scsi_prep_fn()?

Third, do you recall the reason (if any) for freeing the queue in 
__scsi_remove_device() rather than 
scsi_device_dev_release_usercontext()?  At that point, you can be quite 
sure no more requests will be added to the queue.

Alan Stern


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] 2.6.39.1 crash in scsi_dispatch_cmd()
  2011-07-07 21:07                                 ` Alan Stern
@ 2011-07-08 17:04                                   ` James Bottomley
  2011-07-08 19:43                                     ` Alan Stern
  2011-07-08 20:47                                     ` Roland Dreier
  0 siblings, 2 replies; 42+ messages in thread
From: James Bottomley @ 2011-07-08 17:04 UTC (permalink / raw)
  To: Alan Stern
  Cc: Roland Dreier, Heiko Carstens, Jens Axboe, linux-scsi,
	Steffen Maier, Manvanthara B. Puttashankar, Tarak Reddy,
	Seshagiri N. Ippili

On Thu, 2011-07-07 at 17:07 -0400, Alan Stern wrote:
> On Thu, 7 Jul 2011, James Bottomley wrote:
> 
> > > Perhaps this means the elevator shouldn't be freed until the queue is.  
> > > I just don't know.  Jens and James are the experts, but Jens hasn't
> > > said anything and James is currently busy.
> > 
> > OK, back from being busy now (for a short while).
> > 
> > This is what I think should fix it all (at least it fixes it for me now
> > I've finally figured out how to reproduce it).
> > 
> > The nasty about this is that blk_get_request() has to return NULL even
> > if __GFP_WAIT is specified, if the queue is already dead.  Lots of block
> > users don't check for NULL if they specify __GFP_WAIT.
> 
> There are a few things here you might be able to explain.
> 
> First, after blk_cleanup_queue() is called, whose responsibility is it
> to make sure that no more requests can be added to the queue: the block
> layer or the client (in this case the SCSI layer)?  Your patch implies
> that the block layer is responsible; is this documented anywhere?  Or
> if it isn't, does Jens agree?

So when I asked Jens, he agreed block.  The problem we've been running
into is that the teardown state (QUEUE_FLAG_DEAD) isn't really checked
where it should be, so the early teardown SCSI now does runs into a lot
of these missing checks.

> Second, there isn't any synchronization between __scsi_remove_device()
> and scsi_prep_fn().  Isn't it still possible that a request will get
> added to the queue and processed after queuedata has been set to NULL,
> leading to another invalid memory access?  Isn't a NULL check still 
> needed in scsi_prep_fn()?

only in the REQ_TYPE_BLOCK_PC case. The theory is that before we tear
down the queue, we've waited for the ULD detach to occur, so none of the
ULD prep functions need check.

However, that does beg the question of why sr is using the device after
sr_remove() has completed.  That seems to be because of the block ops
being the dominant structure, so we try to do cleanup when we get the
block release rather than the driver release ... that looks to be the
root cause to me.

> Third, do you recall the reason (if any) for freeing the queue in 
> __scsi_remove_device() rather than 
> scsi_device_dev_release_usercontext()?  At that point, you can be quite 
> sure no more requests will be added to the queue.

Yes, it was to get early teardown of the queue.  I've forgotten the
specific bug, but it's somewhere in bugzilla.  There was an oops because
the queue was still running.

James



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] 2.6.39.1 crash in scsi_dispatch_cmd()
  2011-07-08 17:04                                   ` James Bottomley
@ 2011-07-08 19:43                                     ` Alan Stern
  2011-07-08 20:41                                       ` James Bottomley
  2011-07-08 20:47                                     ` Roland Dreier
  1 sibling, 1 reply; 42+ messages in thread
From: Alan Stern @ 2011-07-08 19:43 UTC (permalink / raw)
  To: James Bottomley
  Cc: Roland Dreier, Heiko Carstens, Jens Axboe, linux-scsi,
	Steffen Maier, Manvanthara B. Puttashankar, Tarak Reddy,
	Seshagiri N. Ippili

On Fri, 8 Jul 2011, James Bottomley wrote:

> However, that does beg the question of why sr is using the device after
> sr_remove() has completed.  That seems to be because of the block ops
> being the dominant structure, so we try to do cleanup when we get the
> block release rather than the driver release ... that looks to be the
> root cause to me.

Hmmm.  What happens if I use sysfs to unbind the scsi_device from sr
while it is still mounted, and then quickly rebind it again?  Until the
filesystem is unmounted and the block release is complete, would both
instances end up sending commands to the device concurrently?

Alan Stern


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] 2.6.39.1 crash in scsi_dispatch_cmd()
  2011-07-08 19:43                                     ` Alan Stern
@ 2011-07-08 20:41                                       ` James Bottomley
  2011-07-08 22:08                                         ` Alan Stern
  0 siblings, 1 reply; 42+ messages in thread
From: James Bottomley @ 2011-07-08 20:41 UTC (permalink / raw)
  To: Alan Stern
  Cc: Roland Dreier, Heiko Carstens, Jens Axboe, linux-scsi,
	Steffen Maier, Manvanthara B. Puttashankar, Tarak Reddy,
	Seshagiri N. Ippili

On Fri, 2011-07-08 at 15:43 -0400, Alan Stern wrote:
> On Fri, 8 Jul 2011, James Bottomley wrote:
> 
> > However, that does beg the question of why sr is using the device after
> > sr_remove() has completed.  That seems to be because of the block ops
> > being the dominant structure, so we try to do cleanup when we get the
> > block release rather than the driver release ... that looks to be the
> > root cause to me.
> 
> Hmmm.  What happens if I use sysfs to unbind the scsi_device from sr
> while it is still mounted, and then quickly rebind it again?  Until the
> filesystem is unmounted and the block release is complete, would both
> instances end up sending commands to the device concurrently?

The device will go into CANCEL or DEL and resurrection is impossible ...
it should give an error when you attempt the rebind.

James



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] 2.6.39.1 crash in scsi_dispatch_cmd()
  2011-07-08 17:04                                   ` James Bottomley
  2011-07-08 19:43                                     ` Alan Stern
@ 2011-07-08 20:47                                     ` Roland Dreier
  2011-07-08 23:04                                       ` [PATCH] block: Check that queue is alive in blk_insert_cloned_request() Roland Dreier
  1 sibling, 1 reply; 42+ messages in thread
From: Roland Dreier @ 2011-07-08 20:47 UTC (permalink / raw)
  To: James Bottomley
  Cc: Alan Stern, Heiko Carstens, Jens Axboe, linux-scsi, Steffen Maier,
	Manvanthara B. Puttashankar, Tarak Reddy, Seshagiri N. Ippili

On Fri, Jul 8, 2011 at 10:04 AM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> So when I asked Jens, he agreed block.  The problem we've been running
> into is that the teardown state (QUEUE_FLAG_DEAD) isn't really checked
> where it should be, so the early teardown SCSI now does runs into a lot
> of these missing checks.

So I guess my dm-multipath case also needs to sprinkle a bunch of checks
of block queue state into dm.c too... I'll look at doing that.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] 2.6.39.1 crash in scsi_dispatch_cmd()
  2011-07-08 20:41                                       ` James Bottomley
@ 2011-07-08 22:08                                         ` Alan Stern
  2011-07-08 22:25                                           ` James Bottomley
  0 siblings, 1 reply; 42+ messages in thread
From: Alan Stern @ 2011-07-08 22:08 UTC (permalink / raw)
  To: James Bottomley
  Cc: Roland Dreier, Heiko Carstens, Jens Axboe, linux-scsi,
	Steffen Maier, Manvanthara B. Puttashankar, Tarak Reddy,
	Seshagiri N. Ippili

On Fri, 8 Jul 2011, James Bottomley wrote:

> On Fri, 2011-07-08 at 15:43 -0400, Alan Stern wrote:
> > On Fri, 8 Jul 2011, James Bottomley wrote:
> > 
> > > However, that does beg the question of why sr is using the device after
> > > sr_remove() has completed.  That seems to be because of the block ops
> > > being the dominant structure, so we try to do cleanup when we get the
> > > block release rather than the driver release ... that looks to be the
> > > root cause to me.
> > 
> > Hmmm.  What happens if I use sysfs to unbind the scsi_device from sr
> > while it is still mounted, and then quickly rebind it again?  Until the
> > filesystem is unmounted and the block release is complete, would both
> > instances end up sending commands to the device concurrently?
> 
> The device will go into CANCEL or DEL and resurrection is impossible ...
> it should give an error when you attempt the rebind.

No, you're thinking of device removal.  I'm talking about unbinding;  
the device remains intact, only the driver is affected.  The device
doesn't go into CANCEL or DEL, and the driver can indeed be rebound.  
In fact I just tried it (with a USB flash drive, not a cdrom) and the
rebind worked perfectly.

Alan Stern


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] 2.6.39.1 crash in scsi_dispatch_cmd()
  2011-07-08 22:08                                         ` Alan Stern
@ 2011-07-08 22:25                                           ` James Bottomley
  0 siblings, 0 replies; 42+ messages in thread
From: James Bottomley @ 2011-07-08 22:25 UTC (permalink / raw)
  To: Alan Stern
  Cc: Roland Dreier, Heiko Carstens, Jens Axboe, linux-scsi,
	Steffen Maier, Manvanthara B. Puttashankar, Tarak Reddy,
	Seshagiri N. Ippili

On Fri, 2011-07-08 at 18:08 -0400, Alan Stern wrote:
> On Fri, 8 Jul 2011, James Bottomley wrote:
> 
> > On Fri, 2011-07-08 at 15:43 -0400, Alan Stern wrote:
> > > On Fri, 8 Jul 2011, James Bottomley wrote:
> > > 
> > > > However, that does beg the question of why sr is using the device after
> > > > sr_remove() has completed.  That seems to be because of the block ops
> > > > being the dominant structure, so we try to do cleanup when we get the
> > > > block release rather than the driver release ... that looks to be the
> > > > root cause to me.
> > > 
> > > Hmmm.  What happens if I use sysfs to unbind the scsi_device from sr
> > > while it is still mounted, and then quickly rebind it again?  Until the
> > > filesystem is unmounted and the block release is complete, would both
> > > instances end up sending commands to the device concurrently?
> > 
> > The device will go into CANCEL or DEL and resurrection is impossible ...
> > it should give an error when you attempt the rebind.
> 
> No, you're thinking of device removal.  I'm talking about unbinding;  
> the device remains intact, only the driver is affected.  The device
> doesn't go into CANCEL or DEL, and the driver can indeed be rebound.  
> In fact I just tried it (with a USB flash drive, not a cdrom) and the
> rebind worked perfectly.

For the sr case, it's broken as I pointed out before.  For sd, you'll
see the device shut down.

James



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH] block: Check that queue is alive in blk_insert_cloned_request()
  2011-07-08 20:47                                     ` Roland Dreier
@ 2011-07-08 23:04                                       ` Roland Dreier
  2011-07-09  9:05                                         ` Stefan Richter
                                                           ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Roland Dreier @ 2011-07-08 23:04 UTC (permalink / raw)
  To: Jens Axboe
  Cc: James Bottomley, Alan Stern, Heiko Carstens, linux-scsi,
	Steffen Maier, Manvanthara B. Puttashankar, Tarak Reddy,
	Seshagiri N. Ippili, linux-kernel

From: Roland Dreier <roland@purestorage.com>

This fixes crashes such as the below that I see when the storage
underlying a dm-multipath device is hot-removed.  The problem is that
dm requeues a request to a device whose block queue has already been
cleaned up, and blk_insert_cloned_request() doesn't check if the queue
is alive, but rather goes ahead and tries to queue the request.  This
ends up dereferencing the elevator that was already freed in
blk_cleanup_queue().

    general protection fault: 0000 [#1] SMP 
    CPU 7 
    Modules linked in: kvm_intel kvm serio_raw i7core_edac edac_core ioatdma dca pci_stub ses enclosure usbhid usb_storage qla2xxx mpt2sas ahci hid uas libahci e1000e scsi_transport_fc scsi_transport_sas mlx4_core raid_class scsi_tgt
    RIP: 0010:[<ffffffff81233867>]  [<ffffffff81233867>] elv_drain_elevator+0x27/0x80
    RSP: 0018:ffff880614e9fa48  EFLAGS: 00010096
    RAX: 6b6b6b6b6b6b6b6b RBX: ffff880610bb0000 RCX: 0000000000000000
    RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff880610bb0000
    RBP: ffff880614e9fa58 R08: 0000000000000000 R09: 0000000000000001
    R10: ffff880c0a7dca70 R11: ffff880615622440 R12: ffff880610bb0000
    R13: 0000000000000002 R14: 0000000000000002 R15: ffff880c0db01160
    FS:  00007fe46457a760(0000) GS:ffff880c3fc20000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fe463c86330 CR3: 0000000c0cc0c000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process blkid (pid: 12458, threadinfo ffff880614e9e000, task ffff8806190eae20)
    Stack:
     ffff880c0d9f03c8 ffff880c0d9f03c8 ffff880614e9fa88 ffffff
     ffff880c0d9f03c8 ffff880610bb0000 0000000000000002 ffffc9
     ffff880614e9fab8 ffffffff812366fd ffff880614e9fad8 ffff88
    Call Trace:
     [<ffffffff812339a8>] __elv_add_request+0xe8/0x280
     [<ffffffff812366fd>] add_acct_request+0x3d/0x50
     [<ffffffff81236775>] blk_insert_cloned_request+0x65/0x90
     [<ffffffff813b8d4e>] dm_dispatch_request+0x3e/0x70
     [<ffffffff813ba850>] dm_request_fn+0x160/0x250
     [<ffffffff81236e88>] queue_unplugged+0x48/0xd0
     [<ffffffff8123b03d>] blk_flush_plug_list+0x1ed/0x250
     [<ffffffff810e98f0>] ? sleep_on_page+0x20/0x20
     [<ffffffff814db895>] io_schedule+0x75/0xd0
     [<ffffffff810e98fe>] sleep_on_page_killable+0xe/0x40
     [<ffffffff814dbf9a>] __wait_on_bit_lock+0x5a/0xc0
     [<ffffffff810e9857>] __lock_page_killable+0x67/0x70
     [<ffffffff8106d650>] ? autoremove_wake_function+0x40/0x40
     [<ffffffff810eb5c5>] generic_file_aio_read+0x405/0x720
     [<ffffffff81146962>] do_sync_read+0xd2/0x110
     [<ffffffff8121e393>] ? security_file_permission+0x93/0xb0
     [<ffffffff81146c81>] ? rw_verify_area+0x61/0xf0
     [<ffffffff81147143>] vfs_read+0xc3/0x180
     [<ffffffff81147251>] sys_read+0x51/0x90
     [<ffffffff814e5942>] system_call_fastpath+0x16/0x1b
    Code: c3 0f 1f 00 55 48 89 e5 53 48 83 ec 08 66 66 66 66 90 48 89 fb 0f 1f 80 00 00 00 00 48 8b 43 18 be 01 00 00 00 48 89 df 48 8b 00 <ff> 50 28 85 c0 75 ea 8b 93 f0 03 00 00 85 d2 74 14 8b 05 fa e1 
    RIP  [<ffffffff81233867>] elv_drain_elevator+0x27/0x80
     RSP <ffff880614e9fa48>
     RSP <ffff880614e9fa48>
    ---[ end trace 5dbd03e7295023d7 ]---

Signed-off-by: Roland Dreier <roland@purestorage.com>
---
 block/blk-core.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d2f8f40..a13a9cc 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1706,6 +1706,9 @@ int blk_insert_cloned_request(struct request_queue *q, struct request *rq)
 		return -EIO;
 #endif
 
+	if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags)))
+		return -EIO;
+
 	spin_lock_irqsave(q->queue_lock, flags);
 
 	/*

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH] block: Check that queue is alive in blk_insert_cloned_request()
  2011-07-08 23:04                                       ` [PATCH] block: Check that queue is alive in blk_insert_cloned_request() Roland Dreier
@ 2011-07-09  9:05                                         ` Stefan Richter
  2011-07-11 22:40                                         ` Mike Snitzer
  2011-07-12  2:09                                         ` Vivek Goyal
  2 siblings, 0 replies; 42+ messages in thread
From: Stefan Richter @ 2011-07-09  9:05 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Jens Axboe, James Bottomley, Alan Stern, Heiko Carstens,
	linux-scsi, Steffen Maier, Manvanthara B. Puttashankar,
	Tarak Reddy, Seshagiri N. Ippili, linux-kernel

On Jul 08 Roland Dreier wrote:
> This fixes crashes such as the below that I see when the storage
> underlying a dm-multipath device is hot-removed.  The problem is that
> dm requeues a request to a device whose block queue has already been
> cleaned up, and blk_insert_cloned_request() doesn't check if the queue
> is alive, but rather goes ahead and tries to queue the request.  This
> ends up dereferencing the elevator that was already freed in
> blk_cleanup_queue().
[...]
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1706,6 +1706,9 @@ int blk_insert_cloned_request(struct request_queue *q, struct request *rq)
>  		return -EIO;
>  #endif
>  
> +	if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags)))
> +		return -EIO;
> +
>  	spin_lock_irqsave(q->queue_lock, flags);
>  
>  	/*

Not knowing the calling contexts or having tried to learn about them, I am
wondering:  What prevents the elevator to be freed right after the flag
was checked?
-- 
Stefan Richter
-=====-==-== -=== -=--=
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] block: Check that queue is alive in blk_insert_cloned_request()
  2011-07-08 23:04                                       ` [PATCH] block: Check that queue is alive in blk_insert_cloned_request() Roland Dreier
  2011-07-09  9:05                                         ` Stefan Richter
@ 2011-07-11 22:40                                         ` Mike Snitzer
  2011-07-12  0:52                                           ` Alan Stern
                                                             ` (2 more replies)
  2011-07-12  2:09                                         ` Vivek Goyal
  2 siblings, 3 replies; 42+ messages in thread
From: Mike Snitzer @ 2011-07-11 22:40 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Jens Axboe, James Bottomley, Alan Stern, Heiko Carstens,
	linux-scsi, Steffen Maier, Manvanthara B. Puttashankar,
	Tarak Reddy, Seshagiri N. Ippili, linux-kernel,
	device-mapper development, Vivek Goyal, Tejun Heo

[cc'ing dm-devel, vivek and tejun]

On Fri, Jul 8, 2011 at 7:04 PM, Roland Dreier <roland@kernel.org> wrote:
> From: Roland Dreier <roland@purestorage.com>
>
> This fixes crashes such as the below that I see when the storage
> underlying a dm-multipath device is hot-removed.  The problem is that
> dm requeues a request to a device whose block queue has already been
> cleaned up, and blk_insert_cloned_request() doesn't check if the queue
> is alive, but rather goes ahead and tries to queue the request.  This
> ends up dereferencing the elevator that was already freed in
> blk_cleanup_queue().

Your patch looks fine to me:
Acked-by: Mike Snitzer <snitzer@redhat.com>

And I looked at various code paths to arrive at the references DM takes.

A reference is taken on the underlying devices' block_device via
drivers/md/dm-table.c:open_dev() with blkdev_get_by_dev().  open_dev()
also does bd_link_disk_holder(), resulting in the mpath device
becoming a holder of the underlying devices. e.g.:
/sys/block/sda/holders/dm-4

But at no point does DM-mpath get a reference to the underlying
devices' request_queue that gets assigned to clone->q (in
drivers/md/dm-mpath.c:map_io).

Seems we should, though AFAIK it won't help with the issue you've
pointed out (because the hotplugged device's driver already called
blk_cleanup_queue and nuked the elevator).

So I'm not sure getting the request_queue reference actually fixes
anything (maybe my imagination is lacking?).  But getting the
request_queue reference is "the right thing" if we're going to be
setting pointers to it.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] block: Check that queue is alive in blk_insert_cloned_request()
  2011-07-11 22:40                                         ` Mike Snitzer
@ 2011-07-12  0:52                                           ` Alan Stern
  2011-07-12  1:22                                             ` Mike Snitzer
  2011-07-12 14:58                                           ` [PATCH] dm mpath: manage reference on request queue of underlying devices Mike Snitzer
  2011-07-12 17:06                                           ` [PATCH] block: Check that queue is alive in blk_insert_cloned_request() Vivek Goyal
  2 siblings, 1 reply; 42+ messages in thread
From: Alan Stern @ 2011-07-12  0:52 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Roland Dreier, Jens Axboe, James Bottomley, Heiko Carstens,
	linux-scsi, Steffen Maier, Manvanthara B. Puttashankar,
	Tarak Reddy, Seshagiri N. Ippili, linux-kernel,
	device-mapper development, Vivek Goyal, Tejun Heo

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: TEXT/PLAIN; charset=UTF-8, Size: 925 bytes --]

On Mon, 11 Jul 2011, Mike Snitzer wrote:

> [cc'ing dm-devel, vivek and tejun]
> 
> On Fri, Jul 8, 2011 at 7:04 PM, Roland Dreier <roland@kernel.org> wrote:
> > From: Roland Dreier <roland@purestorage.com>
> >
> > This fixes crashes such as the below that I see when the storage
> > underlying a dm-multipath device is hot-removed.  The problem is that
> > dm requeues a request to a device whose block queue has already been
> > cleaned up, and blk_insert_cloned_request() doesn't check if the queue
> > is alive, but rather goes ahead and tries to queue the request.  This
> > ends up dereferencing the elevator that was already freed in
> > blk_cleanup_queue().
> 
> Your patch looks fine to me:
> Acked-by: Mike Snitzer <snitzer@redhat.com>

There's still the issue that Stefan Richter pointed out: The test for a 
dead queue must be made _after_ acquiring the queue lock, not _before_.

Alan Stern

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: block: Check that queue is alive in blk_insert_cloned_request()
  2011-07-12  0:52                                           ` Alan Stern
@ 2011-07-12  1:22                                             ` Mike Snitzer
  2011-07-12  1:46                                               ` Vivek Goyal
  0 siblings, 1 reply; 42+ messages in thread
From: Mike Snitzer @ 2011-07-12  1:22 UTC (permalink / raw)
  To: Alan Stern
  Cc: Roland Dreier, Jens Axboe, James Bottomley, Heiko Carstens,
	linux-scsi, Steffen Maier, Manvanthara B. Puttashankar,
	Tarak Reddy, Seshagiri N. Ippili, linux-kernel,
	device-mapper development, Vivek Goyal, Tejun Heo, jaxboe

On Mon, Jul 11 2011 at  8:52pm -0400,
Alan Stern <stern@rowland.harvard.edu> wrote:

> On Mon, 11 Jul 2011, Mike Snitzer wrote:
> 
> > [cc'ing dm-devel, vivek and tejun]
> > 
> > On Fri, Jul 8, 2011 at 7:04 PM, Roland Dreier <roland@kernel.org> wrote:
> > > From: Roland Dreier <roland@purestorage.com>
> > >
> > > This fixes crashes such as the below that I see when the storage
> > > underlying a dm-multipath device is hot-removed. ?The problem is that
> > > dm requeues a request to a device whose block queue has already been
> > > cleaned up, and blk_insert_cloned_request() doesn't check if the queue
> > > is alive, but rather goes ahead and tries to queue the request. ?This
> > > ends up dereferencing the elevator that was already freed in
> > > blk_cleanup_queue().
> > 
> > Your patch looks fine to me:
> > Acked-by: Mike Snitzer <snitzer@redhat.com>
> 
> There's still the issue that Stefan Richter pointed out: The test for a 
> dead queue must be made _after_ acquiring the queue lock, not _before_.

Yes, quite important.

Jens, can you tweak the patch or should Roland send a v2?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: block: Check that queue is alive in blk_insert_cloned_request()
  2011-07-12  1:22                                             ` Mike Snitzer
@ 2011-07-12  1:46                                               ` Vivek Goyal
  2011-07-12 15:24                                                 ` Alan Stern
  0 siblings, 1 reply; 42+ messages in thread
From: Vivek Goyal @ 2011-07-12  1:46 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Alan Stern, Roland Dreier, Jens Axboe, James Bottomley,
	Heiko Carstens, linux-scsi, Steffen Maier,
	Manvanthara B. Puttashankar, Tarak Reddy, Seshagiri N. Ippili,
	linux-kernel, device-mapper development, Tejun Heo, jaxboe

On Mon, Jul 11, 2011 at 09:22:06PM -0400, Mike Snitzer wrote:
> On Mon, Jul 11 2011 at  8:52pm -0400,
> Alan Stern <stern@rowland.harvard.edu> wrote:
> 
> > On Mon, 11 Jul 2011, Mike Snitzer wrote:
> > 
> > > [cc'ing dm-devel, vivek and tejun]
> > > 
> > > On Fri, Jul 8, 2011 at 7:04 PM, Roland Dreier <roland@kernel.org> wrote:
> > > > From: Roland Dreier <roland@purestorage.com>
> > > >
> > > > This fixes crashes such as the below that I see when the storage
> > > > underlying a dm-multipath device is hot-removed. ?The problem is that
> > > > dm requeues a request to a device whose block queue has already been
> > > > cleaned up, and blk_insert_cloned_request() doesn't check if the queue
> > > > is alive, but rather goes ahead and tries to queue the request. ?This
> > > > ends up dereferencing the elevator that was already freed in
> > > > blk_cleanup_queue().
> > > 
> > > Your patch looks fine to me:
> > > Acked-by: Mike Snitzer <snitzer@redhat.com>
> > 
> > There's still the issue that Stefan Richter pointed out: The test for a 
> > dead queue must be made _after_ acquiring the queue lock, not _before_.
> 
> Yes, quite important.
> 
> Jens, can you tweak the patch or should Roland send a v2?

I do not think that we should do queue dead check after taking a spinlock.
The reason being that there are life time issues of two objects.

- Validity of request queue pointer
- Validity of q->spin_lock pointer

If the dm has taken the reference to the request queue in the beginning
then it can be sure request queue pointer is valid. But spin_lock might
be coming from driver and might be in one of driver allocated structures.
So it might happen that driver has called blk_cleanup_queue() and freed
up structures which contained the spin lock.

So if queue is not dead, we know that q->spin_lock is valid. I think
only race present here is that whole operation is not atomic. First
we check for queue not dead flag and then go on to acquire request
queue lock. So this leaves a small window for race. I think I have
seen other code written in such manner (__generic_make_request()). So
it proably reasonably safe to do here too.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] block: Check that queue is alive in blk_insert_cloned_request()
  2011-07-08 23:04                                       ` [PATCH] block: Check that queue is alive in blk_insert_cloned_request() Roland Dreier
  2011-07-09  9:05                                         ` Stefan Richter
  2011-07-11 22:40                                         ` Mike Snitzer
@ 2011-07-12  2:09                                         ` Vivek Goyal
  2 siblings, 0 replies; 42+ messages in thread
From: Vivek Goyal @ 2011-07-12  2:09 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Jens Axboe, James Bottomley, Alan Stern, Heiko Carstens,
	linux-scsi, Steffen Maier, Manvanthara B. Puttashankar,
	Tarak Reddy, Seshagiri N. Ippili, linux-kernel

On Fri, Jul 08, 2011 at 04:04:30PM -0700, Roland Dreier wrote:
> From: Roland Dreier <roland@purestorage.com>
> 
> This fixes crashes such as the below that I see when the storage
> underlying a dm-multipath device is hot-removed.  The problem is that
> dm requeues a request to a device whose block queue has already been
> cleaned up, and blk_insert_cloned_request() doesn't check if the queue
> is alive, but rather goes ahead and tries to queue the request.  This
> ends up dereferencing the elevator that was already freed in
> blk_cleanup_queue().
> 
>     general protection fault: 0000 [#1] SMP 
>     CPU 7 
>     Modules linked in: kvm_intel kvm serio_raw i7core_edac edac_core ioatdma dca pci_stub ses enclosure usbhid usb_storage qla2xxx mpt2sas ahci hid uas libahci e1000e scsi_transport_fc scsi_transport_sas mlx4_core raid_class scsi_tgt
>     RIP: 0010:[<ffffffff81233867>]  [<ffffffff81233867>] elv_drain_elevator+0x27/0x80
>     RSP: 0018:ffff880614e9fa48  EFLAGS: 00010096
>     RAX: 6b6b6b6b6b6b6b6b RBX: ffff880610bb0000 RCX: 0000000000000000
>     RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff880610bb0000
>     RBP: ffff880614e9fa58 R08: 0000000000000000 R09: 0000000000000001
>     R10: ffff880c0a7dca70 R11: ffff880615622440 R12: ffff880610bb0000
>     R13: 0000000000000002 R14: 0000000000000002 R15: ffff880c0db01160
>     FS:  00007fe46457a760(0000) GS:ffff880c3fc20000(0000) knlGS:0000000000000000
>     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>     CR2: 00007fe463c86330 CR3: 0000000c0cc0c000 CR4: 00000000000006e0
>     DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>     DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>     Process blkid (pid: 12458, threadinfo ffff880614e9e000, task ffff8806190eae20)
>     Stack:
>      ffff880c0d9f03c8 ffff880c0d9f03c8 ffff880614e9fa88 ffffff
>      ffff880c0d9f03c8 ffff880610bb0000 0000000000000002 ffffc9
>      ffff880614e9fab8 ffffffff812366fd ffff880614e9fad8 ffff88
>     Call Trace:
>      [<ffffffff812339a8>] __elv_add_request+0xe8/0x280
>      [<ffffffff812366fd>] add_acct_request+0x3d/0x50
>      [<ffffffff81236775>] blk_insert_cloned_request+0x65/0x90
>      [<ffffffff813b8d4e>] dm_dispatch_request+0x3e/0x70
>      [<ffffffff813ba850>] dm_request_fn+0x160/0x250
>      [<ffffffff81236e88>] queue_unplugged+0x48/0xd0
>      [<ffffffff8123b03d>] blk_flush_plug_list+0x1ed/0x250

Roland,

IIUC, this crash can happen even without dm-multipath being in picture.
Because it looks like it can happen that we have put requests on 
plug list and then device is hot removed which ends up cleaning elevator
and then blk_flush_plug_list() is called. 

Can you please try it without multipath target.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH] dm mpath: manage reference on request queue of underlying devices
  2011-07-11 22:40                                         ` Mike Snitzer
  2011-07-12  0:52                                           ` Alan Stern
@ 2011-07-12 14:58                                           ` Mike Snitzer
  2011-07-12 17:06                                           ` [PATCH] block: Check that queue is alive in blk_insert_cloned_request() Vivek Goyal
  2 siblings, 0 replies; 42+ messages in thread
From: Mike Snitzer @ 2011-07-12 14:58 UTC (permalink / raw)
  To: Alasdair G. Kergon
  Cc: Jens Axboe, James Bottomley, Alan Stern, Heiko Carstens,
	linux-scsi, Steffen Maier, Manvanthara B. Puttashankar,
	Tarak Reddy, Seshagiri N. Ippili, linux-kernel,
	device-mapper development, Vivek Goyal, Tejun Heo, Roland Dreier

When processing a request, DM-mpath's map_io() set the cloned request's
request_queue to the appropriate underlying device's request_queue
without getting a reference on that request_queue.

DM-mpath now maintains a reference count on the underlying devices'
request_queue.  This change wasn't motivated by a specific report but
code, like blk_insert_cloned_request(), will access the request_queue
with the understanding that the request_queue is valid.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm-mpath.c |   33 +++++++++++++++++++++------------
 1 files changed, 21 insertions(+), 12 deletions(-)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index eadea8e..629efa3 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -161,11 +161,14 @@ static void free_pgpaths(struct list_head *pgpaths, struct dm_target *ti)
 {
 	struct pgpath *pgpath, *tmp;
 	struct multipath *m = ti->private;
+	struct request_queue *q;
 
 	list_for_each_entry_safe(pgpath, tmp, pgpaths, list) {
 		list_del(&pgpath->list);
+		q = bdev_get_queue(pgpath->path.dev->bdev);
 		if (m->hw_handler_name)
-			scsi_dh_detach(bdev_get_queue(pgpath->path.dev->bdev));
+			scsi_dh_detach(q);
+		blk_put_queue(q);
 		dm_put_device(ti, pgpath->path.dev);
 		free_pgpath(pgpath);
 	}
@@ -552,6 +555,7 @@ static struct pgpath *parse_path(struct dm_arg_set *as, struct path_selector *ps
 {
 	int r;
 	struct pgpath *p;
+	struct request_queue *q;
 	struct multipath *m = ti->private;
 
 	/* we need at least a path arg */
@@ -571,9 +575,14 @@ static struct pgpath *parse_path(struct dm_arg_set *as, struct path_selector *ps
 		goto bad;
 	}
 
-	if (m->hw_handler_name) {
-		struct request_queue *q = bdev_get_queue(p->path.dev->bdev);
+	q = bdev_get_queue(p->path.dev->bdev);
+	r = blk_get_queue(q);
+	if (r) {
+		ti->error = "error getting request queue";
+		goto bad_queue;
+	}
 
+	if (m->hw_handler_name) {
 		r = scsi_dh_attach(q, m->hw_handler_name);
 		if (r == -EBUSY) {
 			/*
@@ -586,8 +595,7 @@ static struct pgpath *parse_path(struct dm_arg_set *as, struct path_selector *ps
 
 		if (r < 0) {
 			ti->error = "error attaching hardware handler";
-			dm_put_device(ti, p->path.dev);
-			goto bad;
+			goto bad_finish;
 		}
 
 		if (m->hw_handler_params) {
@@ -596,21 +604,22 @@ static struct pgpath *parse_path(struct dm_arg_set *as, struct path_selector *ps
 				ti->error = "unable to set hardware "
 							"handler parameters";
 				scsi_dh_detach(q);
-				dm_put_device(ti, p->path.dev);
-				goto bad;
+				goto bad_finish;
 			}
 		}
 	}
 
 	r = ps->type->add_path(ps, &p->path, as->argc, as->argv, &ti->error);
-	if (r) {
-		dm_put_device(ti, p->path.dev);
-		goto bad;
-	}
+	if (r)
+		goto bad_finish;
 
 	return p;
 
- bad:
+bad_finish:
+	blk_put_queue(q);
+bad_queue:
+	dm_put_device(ti, p->path.dev);
+bad:
 	free_pgpath(p);
 	return ERR_PTR(r);
 }

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: block: Check that queue is alive in blk_insert_cloned_request()
  2011-07-12  1:46                                               ` Vivek Goyal
@ 2011-07-12 15:24                                                 ` Alan Stern
  2011-07-12 17:10                                                   ` Vivek Goyal
  0 siblings, 1 reply; 42+ messages in thread
From: Alan Stern @ 2011-07-12 15:24 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Mike Snitzer, Roland Dreier, Jens Axboe, James Bottomley,
	Heiko Carstens, linux-scsi, Steffen Maier,
	Manvanthara B. Puttashankar, Tarak Reddy, Seshagiri N. Ippili,
	linux-kernel, device-mapper development, Tejun Heo, jaxboe

On Mon, 11 Jul 2011, Vivek Goyal wrote:

> > > There's still the issue that Stefan Richter pointed out: The test for a 
> > > dead queue must be made _after_ acquiring the queue lock, not _before_.
> > 
> > Yes, quite important.
> > 
> > Jens, can you tweak the patch or should Roland send a v2?
> 
> I do not think that we should do queue dead check after taking a spinlock.
> The reason being that there are life time issues of two objects.
> 
> - Validity of request queue pointer
> - Validity of q->spin_lock pointer
> 
> If the dm has taken the reference to the request queue in the beginning
> then it can be sure request queue pointer is valid. But spin_lock might
> be coming from driver and might be in one of driver allocated structures.
> So it might happen that driver has called blk_cleanup_queue() and freed
> up structures which contained the spin lock.

Surely this is a bug in the design of the block layer?

> So if queue is not dead, we know that q->spin_lock is valid. I think
> only race present here is that whole operation is not atomic. First
> we check for queue not dead flag and then go on to acquire request
> queue lock. So this leaves a small window for race. I think I have
> seen other code written in such manner (__generic_make_request()). So
> it proably reasonably safe to do here too.

"Probably reasonably safe" = "unsafe".  The fact that it will usually
work out okay means that when it does fail, it will be very difficult
to track down.

It needs to be fixed _now_, when people are aware of the issue.  Not 
five years from now, when everybody has forgotten about it.

Alan Stern

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] block: Check that queue is alive in blk_insert_cloned_request()
  2011-07-11 22:40                                         ` Mike Snitzer
  2011-07-12  0:52                                           ` Alan Stern
  2011-07-12 14:58                                           ` [PATCH] dm mpath: manage reference on request queue of underlying devices Mike Snitzer
@ 2011-07-12 17:06                                           ` Vivek Goyal
  2011-07-12 17:41                                             ` James Bottomley
  2 siblings, 1 reply; 42+ messages in thread
From: Vivek Goyal @ 2011-07-12 17:06 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Roland Dreier, Jens Axboe, James Bottomley, Alan Stern,
	Heiko Carstens, linux-scsi, Steffen Maier,
	Manvanthara B. Puttashankar, Tarak Reddy, Seshagiri N. Ippili,
	linux-kernel, device-mapper development, Tejun Heo

On Mon, Jul 11, 2011 at 06:40:11PM -0400, Mike Snitzer wrote:
> [cc'ing dm-devel, vivek and tejun]
> 
> On Fri, Jul 8, 2011 at 7:04 PM, Roland Dreier <roland@kernel.org> wrote:
> > From: Roland Dreier <roland@purestorage.com>
> >
> > This fixes crashes such as the below that I see when the storage
> > underlying a dm-multipath device is hot-removed.  The problem is that
> > dm requeues a request to a device whose block queue has already been
> > cleaned up, and blk_insert_cloned_request() doesn't check if the queue
> > is alive, but rather goes ahead and tries to queue the request.  This
> > ends up dereferencing the elevator that was already freed in
> > blk_cleanup_queue().
> 
> Your patch looks fine to me:
> Acked-by: Mike Snitzer <snitzer@redhat.com>
> 
> And I looked at various code paths to arrive at the references DM takes.
> 
> A reference is taken on the underlying devices' block_device via
> drivers/md/dm-table.c:open_dev() with blkdev_get_by_dev().  open_dev()
> also does bd_link_disk_holder(), resulting in the mpath device
> becoming a holder of the underlying devices. e.g.:
> /sys/block/sda/holders/dm-4
> 
> But at no point does DM-mpath get a reference to the underlying
> devices' request_queue that gets assigned to clone->q (in
> drivers/md/dm-mpath.c:map_io).
> 
> Seems we should, though AFAIK it won't help with the issue you've
> pointed out (because the hotplugged device's driver already called
> blk_cleanup_queue and nuked the elevator).

[Thinking loud]

Could it be a driver specific issue that it cleaned up the request
queue too early?

Is there any notion of device reference which higher layers take and
that should make sure request queue is intact till somebody is holding
device reference.

If yes, what't the connection between device reference and request
queue reference. IOW, why request queue reference is needed and why
device reference is not sufficient. (Because there is not necessarily
one to one mapping between request queue and device?)

I seem to just have lots of question about devices and referencing.
Hopefully somebody with more knowledge in this area be able to
shed some light on it.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: block: Check that queue is alive in blk_insert_cloned_request()
  2011-07-12 15:24                                                 ` Alan Stern
@ 2011-07-12 17:10                                                   ` Vivek Goyal
  0 siblings, 0 replies; 42+ messages in thread
From: Vivek Goyal @ 2011-07-12 17:10 UTC (permalink / raw)
  To: Alan Stern
  Cc: Mike Snitzer, Roland Dreier, Jens Axboe, James Bottomley,
	Heiko Carstens, linux-scsi, Steffen Maier,
	Manvanthara B. Puttashankar, Tarak Reddy, Seshagiri N. Ippili,
	linux-kernel, device-mapper development, Tejun Heo, jaxboe

On Tue, Jul 12, 2011 at 11:24:54AM -0400, Alan Stern wrote:
> On Mon, 11 Jul 2011, Vivek Goyal wrote:
> 
> > > > There's still the issue that Stefan Richter pointed out: The test for a 
> > > > dead queue must be made _after_ acquiring the queue lock, not _before_.
> > > 
> > > Yes, quite important.
> > > 
> > > Jens, can you tweak the patch or should Roland send a v2?
> > 
> > I do not think that we should do queue dead check after taking a spinlock.
> > The reason being that there are life time issues of two objects.
> > 
> > - Validity of request queue pointer
> > - Validity of q->spin_lock pointer
> > 
> > If the dm has taken the reference to the request queue in the beginning
> > then it can be sure request queue pointer is valid. But spin_lock might
> > be coming from driver and might be in one of driver allocated structures.
> > So it might happen that driver has called blk_cleanup_queue() and freed
> > up structures which contained the spin lock.
> 
> Surely this is a bug in the design of the block layer?
> 
> > So if queue is not dead, we know that q->spin_lock is valid. I think
> > only race present here is that whole operation is not atomic. First
> > we check for queue not dead flag and then go on to acquire request
> > queue lock. So this leaves a small window for race. I think I have
> > seen other code written in such manner (__generic_make_request()). So
> > it proably reasonably safe to do here too.
> 
> "Probably reasonably safe" = "unsafe".  The fact that it will usually
> work out okay means that when it does fail, it will be very difficult
> to track down.
> 
> It needs to be fixed _now_, when people are aware of the issue.  Not 
> five years from now, when everybody has forgotten about it.

I agree that fixing would be good. Frankly speaking I don't even have
full understanding of the problem. I know little bit from request queue
side but have no idea about referencing mechanism at device level and
how that is supposed to work with request queue referencing.

So once we understand the problem well, probably we will have an answer
how to go about fixing it.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] block: Check that queue is alive in blk_insert_cloned_request()
  2011-07-12 17:06                                           ` [PATCH] block: Check that queue is alive in blk_insert_cloned_request() Vivek Goyal
@ 2011-07-12 17:41                                             ` James Bottomley
  2011-07-12 18:02                                               ` Vivek Goyal
  0 siblings, 1 reply; 42+ messages in thread
From: James Bottomley @ 2011-07-12 17:41 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Mike Snitzer, Roland Dreier, Jens Axboe, Alan Stern,
	Heiko Carstens, linux-scsi, Steffen Maier,
	Manvanthara B. Puttashankar, Tarak Reddy, Seshagiri N. Ippili,
	linux-kernel, device-mapper development, Tejun Heo

On Tue, 2011-07-12 at 13:06 -0400, Vivek Goyal wrote:
> On Mon, Jul 11, 2011 at 06:40:11PM -0400, Mike Snitzer wrote:
> > [cc'ing dm-devel, vivek and tejun]
> > 
> > On Fri, Jul 8, 2011 at 7:04 PM, Roland Dreier <roland@kernel.org> wrote:
> > > From: Roland Dreier <roland@purestorage.com>
> > >
> > > This fixes crashes such as the below that I see when the storage
> > > underlying a dm-multipath device is hot-removed.  The problem is that
> > > dm requeues a request to a device whose block queue has already been
> > > cleaned up, and blk_insert_cloned_request() doesn't check if the queue
> > > is alive, but rather goes ahead and tries to queue the request.  This
> > > ends up dereferencing the elevator that was already freed in
> > > blk_cleanup_queue().
> > 
> > Your patch looks fine to me:
> > Acked-by: Mike Snitzer <snitzer@redhat.com>
> > 
> > And I looked at various code paths to arrive at the references DM takes.
> > 
> > A reference is taken on the underlying devices' block_device via
> > drivers/md/dm-table.c:open_dev() with blkdev_get_by_dev().  open_dev()
> > also does bd_link_disk_holder(), resulting in the mpath device
> > becoming a holder of the underlying devices. e.g.:
> > /sys/block/sda/holders/dm-4
> > 
> > But at no point does DM-mpath get a reference to the underlying
> > devices' request_queue that gets assigned to clone->q (in
> > drivers/md/dm-mpath.c:map_io).
> > 
> > Seems we should, though AFAIK it won't help with the issue you've
> > pointed out (because the hotplugged device's driver already called
> > blk_cleanup_queue and nuked the elevator).
> 
> [Thinking loud]
> 
> Could it be a driver specific issue that it cleaned up the request
> queue too early?

One could glibly answer yes to this.  However, the fact is that it's
currently SCSI which manages the queue, so SCSI cleans it up.  Now, the
only real thing dm is interested in is the queue itself, hence the need
to take a reference to the queue.  However, queue references don't pin
SCSI devices, so you can hold a queue reference all you like and SCSI
will still clean up the queue.

I think a better question is what should cleaning up the queue do?  SCSI
uses it to indicate that we're no longer processing requests, which
happens when the device goes into a DEL state, but queue cleanup tears
down the elevators and really makes the request queue non functional.
In this case, holding a reference isn't particularly helpful.

I'm starting to wonder if there's actually any value to
blk_cleanup_queue() and whether its functionality wouldn't be better
assumed by the queue release function on last put.

> Is there any notion of device reference which higher layers take and
> that should make sure request queue is intact till somebody is holding
> device reference.
> 
> If yes, what't the connection between device reference and request
> queue reference. IOW, why request queue reference is needed and why
> device reference is not sufficient. (Because there is not necessarily
> one to one mapping between request queue and device?)

For the model, these need to be independent but related.  Trying to
subordinate one to the other is going to lead to layering problems.

James


> I seem to just have lots of question about devices and referencing.
> Hopefully somebody with more knowledge in this area be able to
> shed some light on it.
> 
> Thanks
> Vivek



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] block: Check that queue is alive in blk_insert_cloned_request()
  2011-07-12 17:41                                             ` James Bottomley
@ 2011-07-12 18:02                                               ` Vivek Goyal
  2011-07-12 18:28                                                 ` James Bottomley
  0 siblings, 1 reply; 42+ messages in thread
From: Vivek Goyal @ 2011-07-12 18:02 UTC (permalink / raw)
  To: James Bottomley
  Cc: Mike Snitzer, Roland Dreier, Jens Axboe, Alan Stern,
	Heiko Carstens, linux-scsi, Steffen Maier,
	Manvanthara B. Puttashankar, Tarak Reddy, Seshagiri N. Ippili,
	linux-kernel, device-mapper development, Tejun Heo

On Tue, Jul 12, 2011 at 12:41:30PM -0500, James Bottomley wrote:
> On Tue, 2011-07-12 at 13:06 -0400, Vivek Goyal wrote:
> > On Mon, Jul 11, 2011 at 06:40:11PM -0400, Mike Snitzer wrote:
> > > [cc'ing dm-devel, vivek and tejun]
> > > 
> > > On Fri, Jul 8, 2011 at 7:04 PM, Roland Dreier <roland@kernel.org> wrote:
> > > > From: Roland Dreier <roland@purestorage.com>
> > > >
> > > > This fixes crashes such as the below that I see when the storage
> > > > underlying a dm-multipath device is hot-removed.  The problem is that
> > > > dm requeues a request to a device whose block queue has already been
> > > > cleaned up, and blk_insert_cloned_request() doesn't check if the queue
> > > > is alive, but rather goes ahead and tries to queue the request.  This
> > > > ends up dereferencing the elevator that was already freed in
> > > > blk_cleanup_queue().
> > > 
> > > Your patch looks fine to me:
> > > Acked-by: Mike Snitzer <snitzer@redhat.com>
> > > 
> > > And I looked at various code paths to arrive at the references DM takes.
> > > 
> > > A reference is taken on the underlying devices' block_device via
> > > drivers/md/dm-table.c:open_dev() with blkdev_get_by_dev().  open_dev()
> > > also does bd_link_disk_holder(), resulting in the mpath device
> > > becoming a holder of the underlying devices. e.g.:
> > > /sys/block/sda/holders/dm-4
> > > 
> > > But at no point does DM-mpath get a reference to the underlying
> > > devices' request_queue that gets assigned to clone->q (in
> > > drivers/md/dm-mpath.c:map_io).
> > > 
> > > Seems we should, though AFAIK it won't help with the issue you've
> > > pointed out (because the hotplugged device's driver already called
> > > blk_cleanup_queue and nuked the elevator).
> > 
> > [Thinking loud]
> > 
> > Could it be a driver specific issue that it cleaned up the request
> > queue too early?
> 
> One could glibly answer yes to this.  However, the fact is that it's
> currently SCSI which manages the queue, so SCSI cleans it up.  Now, the
> only real thing dm is interested in is the queue itself, hence the need
> to take a reference to the queue.  However, queue references don't pin
> SCSI devices, so you can hold a queue reference all you like and SCSI
> will still clean up the queue.
> 
> I think a better question is what should cleaning up the queue do?  SCSI
> uses it to indicate that we're no longer processing requests, which
> happens when the device goes into a DEL state, but queue cleanup tears
> down the elevators and really makes the request queue non functional.
> In this case, holding a reference isn't particularly helpful.
> 
> I'm starting to wonder if there's actually any value to
> blk_cleanup_queue() and whether its functionality wouldn't be better
> assumed by the queue release function on last put.

I think one problem point is q->queue_lock. If driver drops its reference
on queue and cleans up its data structures, then it will free up memory
associated with q->queue_lock too. (If driver provided its own queue
lock). In that case anything which is dependent on queue lock, needs
to be freed up on blk_cleanup_queue().

If we can make sure that request queue reference will keep the spin lock
alive, then i guess all cleanup part might be able to go in release
queue function.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] block: Check that queue is alive in blk_insert_cloned_request()
  2011-07-12 18:02                                               ` Vivek Goyal
@ 2011-07-12 18:28                                                 ` James Bottomley
  2011-07-12 18:54                                                   ` Vivek Goyal
  2011-07-12 21:02                                                   ` Alan Stern
  0 siblings, 2 replies; 42+ messages in thread
From: James Bottomley @ 2011-07-12 18:28 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Mike Snitzer, Roland Dreier, Jens Axboe, Alan Stern,
	Heiko Carstens, linux-scsi, Steffen Maier,
	Manvanthara B. Puttashankar, Tarak Reddy, Seshagiri N. Ippili,
	linux-kernel, device-mapper development, Tejun Heo

On Tue, 2011-07-12 at 14:02 -0400, Vivek Goyal wrote:
> On Tue, Jul 12, 2011 at 12:41:30PM -0500, James Bottomley wrote:
> > On Tue, 2011-07-12 at 13:06 -0400, Vivek Goyal wrote:
> > > On Mon, Jul 11, 2011 at 06:40:11PM -0400, Mike Snitzer wrote:
> > > > [cc'ing dm-devel, vivek and tejun]
> > > > 
> > > > On Fri, Jul 8, 2011 at 7:04 PM, Roland Dreier <roland@kernel.org> wrote:
> > > > > From: Roland Dreier <roland@purestorage.com>
> > > > >
> > > > > This fixes crashes such as the below that I see when the storage
> > > > > underlying a dm-multipath device is hot-removed.  The problem is that
> > > > > dm requeues a request to a device whose block queue has already been
> > > > > cleaned up, and blk_insert_cloned_request() doesn't check if the queue
> > > > > is alive, but rather goes ahead and tries to queue the request.  This
> > > > > ends up dereferencing the elevator that was already freed in
> > > > > blk_cleanup_queue().
> > > > 
> > > > Your patch looks fine to me:
> > > > Acked-by: Mike Snitzer <snitzer@redhat.com>
> > > > 
> > > > And I looked at various code paths to arrive at the references DM takes.
> > > > 
> > > > A reference is taken on the underlying devices' block_device via
> > > > drivers/md/dm-table.c:open_dev() with blkdev_get_by_dev().  open_dev()
> > > > also does bd_link_disk_holder(), resulting in the mpath device
> > > > becoming a holder of the underlying devices. e.g.:
> > > > /sys/block/sda/holders/dm-4
> > > > 
> > > > But at no point does DM-mpath get a reference to the underlying
> > > > devices' request_queue that gets assigned to clone->q (in
> > > > drivers/md/dm-mpath.c:map_io).
> > > > 
> > > > Seems we should, though AFAIK it won't help with the issue you've
> > > > pointed out (because the hotplugged device's driver already called
> > > > blk_cleanup_queue and nuked the elevator).
> > > 
> > > [Thinking loud]
> > > 
> > > Could it be a driver specific issue that it cleaned up the request
> > > queue too early?
> > 
> > One could glibly answer yes to this.  However, the fact is that it's
> > currently SCSI which manages the queue, so SCSI cleans it up.  Now, the
> > only real thing dm is interested in is the queue itself, hence the need
> > to take a reference to the queue.  However, queue references don't pin
> > SCSI devices, so you can hold a queue reference all you like and SCSI
> > will still clean up the queue.
> > 
> > I think a better question is what should cleaning up the queue do?  SCSI
> > uses it to indicate that we're no longer processing requests, which
> > happens when the device goes into a DEL state, but queue cleanup tears
> > down the elevators and really makes the request queue non functional.
> > In this case, holding a reference isn't particularly helpful.
> > 
> > I'm starting to wonder if there's actually any value to
> > blk_cleanup_queue() and whether its functionality wouldn't be better
> > assumed by the queue release function on last put.
> 
> I think one problem point is q->queue_lock. If driver drops its reference
> on queue and cleans up its data structures, then it will free up memory
> associated with q->queue_lock too. (If driver provided its own queue
> lock). In that case anything which is dependent on queue lock, needs
> to be freed up on blk_cleanup_queue().

I don't quite follow.  blk_cleanup_queue() doesn't free anything (well,
except the elevator).  Final put will free the queue structure which
contains the lock, but if it's really a final put, you have no other
possible references, so no-one is using the lock ... well, assuming
there isn't a programming error, of course ...

> If we can make sure that request queue reference will keep the spin lock
> alive, then i guess all cleanup part might be able to go in release
> queue function.

As I said: cleanup doesn't free the structure containing the lock,
release does, so that piece wouldn't be altered by putting
blk_cleanup_queue() elsewhere.

James

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] block: Check that queue is alive in blk_insert_cloned_request()
  2011-07-12 18:28                                                 ` James Bottomley
@ 2011-07-12 18:54                                                   ` Vivek Goyal
  2011-07-12 21:02                                                   ` Alan Stern
  1 sibling, 0 replies; 42+ messages in thread
From: Vivek Goyal @ 2011-07-12 18:54 UTC (permalink / raw)
  To: James Bottomley
  Cc: Roland Dreier, Jens Axboe, Seshagiri N. Ippili, Mike Snitzer,
	linux-scsi, Heiko Carstens, linux-kernel, Tejun Heo,
	device-mapper development, Alan Stern, Steffen Maier,
	Manvanthara B. Puttashankar, Tarak Reddy

On Tue, Jul 12, 2011 at 01:28:18PM -0500, James Bottomley wrote:

[..]
> > > I'm starting to wonder if there's actually any value to
> > > blk_cleanup_queue() and whether its functionality wouldn't be better
> > > assumed by the queue release function on last put.
> > 
> > I think one problem point is q->queue_lock. If driver drops its reference
> > on queue and cleans up its data structures, then it will free up memory
> > associated with q->queue_lock too. (If driver provided its own queue
> > lock). In that case anything which is dependent on queue lock, needs
> > to be freed up on blk_cleanup_queue().
> 
> I don't quite follow.  blk_cleanup_queue() doesn't free anything (well,
> except the elevator).  Final put will free the queue structure which
> contains the lock, but if it's really a final put, you have no other
> possible references, so no-one is using the lock ... well, assuming
> there isn't a programming error, of course ...
> 
> > If we can make sure that request queue reference will keep the spin lock
> > alive, then i guess all cleanup part might be able to go in release
> > queue function.
> 
> As I said: cleanup doesn't free the structure containing the lock,
> release does, so that piece wouldn't be altered by putting
> blk_cleanup_queue() elsewhere.

I thought a driver could either rely on spin lock provided by request
queue or override that by providing its own spinlock.

blk_init_allocated_queue_node()
        /* Override internal queue lock with supplied lock pointer */
        if (lock)
                q->queue_lock           = lock;

So if driver calls blk_cleanup_queue() and drops its reference on queue, then
it should be free to release any memory it has allocated for spinlock.
So though queue is around there are no gurantees that q->queue_lock is
still around. That memory might have been freed by driver and reused.

I see many drivers are providing their own locks. Some samples from
drivers/block.

/virtio_blk.c:	q = vblk->disk->queue = blk_init_queue(do_virtblk_request,
&vblk->lock);
./xd.c:	xd_queue = blk_init_queue(do_xd_request, &xd_lock);
./cpqarray.c:	q = blk_init_queue(do_ida_request, &hba[i]->lock);
./sx8.c:	q = blk_init_queue(carm_rq_fn, &host->lock);
./sx8.c:	q = blk_init_queue(carm_oob_rq_fn, &host->lock);
./floppy.c:	disks[dr]->queue = blk_init_queue(do_fd_request, &floppy_lock);
./viodasd.c:	q = blk_init_queue(do_viodasd_request, &d->q_lock);
./cciss.c:	disk->queue = blk_init_queue(do_cciss_request, &h->lock);
./hd.c:	hd_queue = blk_init_queue(do_hd_request, &hd_lock);
./DAC960.c:  	RequestQueue = blk_init_queue(DAC960_RequestFunction,&Controller->queue_lock);
./z2ram.c:    z2_queue = blk_init_queue(do_z2_request, &z2ram_lock);
./amiflop.c:	disk->queue = blk_init_queue(do_fd_request, &amiflop_lock);
./xen-blkfront.c:	rq = blk_init_queue(do_blkif_request, &blkif_io_lock);
./paride/pd.c:	pd_queue = blk_init_queue(do_pd_request, &pd_lock);
./paride/pf.c:	pf_queue = blk_init_queue(do_pf_request, &pf_spin_lock);
./paride/pcd.c:	pcd_queue = blk_init_queue(do_pcd_request, &pcd_lock);
./mg_disk.c:	host->breq = blk_init_queue(mg_request_poll, &host->lock);
./mg_disk.c:	host->breq = blk_init_queue(mg_request, &host->lock);
./rbd.c:	q = blk_init_queue(rbd_rq_fn, &rbd_dev->lock);
./sunvdc.c:	q = blk_init_queue(do_vdc_request, &port->vio.lock);
./swim.c:	swd->queue = blk_init_queue(do_fd_request, &swd->lock);
./xsysace.c:	ace->queue = blk_init_queue(ace_request, &ace->lock);
./osdblk.c:	q = blk_init_queue(osdblk_rq_fn, &osdev->lock);
./ps3disk.c:	queue = blk_init_queue(ps3disk_request, &priv->lock);
./swim3.c:	swim3_queue = blk_init_queue(do_fd_request, &swim3_lock);
./ub.c:	if ((q = blk_init_queue(ub_request_fn, sc->lock)) == NULL)
./nbd.c:	disk->queue = blk_init_queue(do_nbd_request, &nbd_lock);

Thanks
Vivek

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] block: Check that queue is alive in blk_insert_cloned_request()
  2011-07-12 18:28                                                 ` James Bottomley
  2011-07-12 18:54                                                   ` Vivek Goyal
@ 2011-07-12 21:02                                                   ` Alan Stern
  1 sibling, 0 replies; 42+ messages in thread
From: Alan Stern @ 2011-07-12 21:02 UTC (permalink / raw)
  To: James Bottomley
  Cc: Vivek Goyal, Mike Snitzer, Roland Dreier, Jens Axboe,
	Heiko Carstens, linux-scsi, Steffen Maier,
	Manvanthara B. Puttashankar, Tarak Reddy, Seshagiri N. Ippili,
	linux-kernel, device-mapper development, Tejun Heo

On Tue, 12 Jul 2011, James Bottomley wrote:

> > I think one problem point is q->queue_lock. If driver drops its reference
> > on queue and cleans up its data structures, then it will free up memory
> > associated with q->queue_lock too. (If driver provided its own queue
> > lock). In that case anything which is dependent on queue lock, needs
> > to be freed up on blk_cleanup_queue().
> 
> I don't quite follow.  blk_cleanup_queue() doesn't free anything (well,
> except the elevator).  Final put will free the queue structure which
> contains the lock, but if it's really a final put, you have no other
> possible references, so no-one is using the lock ... well, assuming
> there isn't a programming error, of course ...
> 
> > If we can make sure that request queue reference will keep the spin lock
> > alive, then i guess all cleanup part might be able to go in release
> > queue function.
> 
> As I said: cleanup doesn't free the structure containing the lock,
> release does, so that piece wouldn't be altered by putting
> blk_cleanup_queue() elsewhere.

It's worth taking a closer look at what happened here.  Originally
request_queue->queuedata was set to NULL and blk_cleanup_queue() was
called from scsi_device_dev_release_usercontext() (by way of
scsi_free_queue()).  This caused a problem because there was a window
in which the last reference to the scsi_device had been dropped but the
release routine hadn't yet run, so the queue was still marked as active
(i.e., queuedata was still non-NULL).

Commit 86cbfb5607d4b81b1a993ff689bbd2addd5d3a9b ([SCSI] put stricter 
guards on queue dead checks) was written to fix this problem.  However 
it moved _both_ lines from the release routine into 
__scsi_remove_device().  There was no need for this; all that was 
needed was to make sure queuedata was NULL before the device reference 
was dropped.

And in fact, moving the call to scsi_free_queue() has opened up another
window for bugs.  Now it's possible for requests to get on the queue
(submitted by drivers from another layer that still hold a reference to
the scsi_device) even after the queue's elevator has been freed.  Yes,
the block layer is supposed to prevent this from happening, but there
are multiple paths and they don't all have the requisite checks.

And as Vivek points out, it's questionable whether they _can_ make the 
proper checks.  Checking requires the queue's lock, but the client may 
have deallocated the lock at the time the queue was cleaned up.  Maybe 
SCSI doesn't do this, but other clients of the block layer might.  The 
block layer can't make any assumptions.  The unavoidable conclusion is 
that the clients must be responsible for insuring that no requests are 
added to a queue after its lock has been released.

To help accomplish this in SCSI, I'm suggesting that the call to
scsi_free_queue() be moved back to where it used to be, in
scsi_device_dev_release_usercontext().  Then it won't be possible for
any wayward requests to be added to the queue after it is cleaned up,
because there won't be any outstanding references to the scsi_device.

Alan Stern


^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2011-07-12 21:02 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-06-15 11:20 [BUG] 2.6.39.1 crash in scsi_dispatch_cmd() Heiko Carstens
2011-06-16 16:01 ` Heiko Carstens
2011-06-16 16:37   ` James Bottomley
2011-06-16 18:40     ` Heiko Carstens
2011-06-20 15:30       ` Heiko Carstens
2011-07-01 18:07         ` Roland Dreier
2011-07-01 19:04           ` James Bottomley
2011-07-06  0:34             ` Roland Dreier
2011-07-06  6:47               ` Heiko Carstens
2011-07-06  8:06                 ` Roland Dreier
2011-07-06  9:25                   ` Heiko Carstens
2011-07-06 14:20                   ` Alan Stern
2011-07-06 14:24                     ` James Bottomley
2011-07-06 16:30                       ` Roland Dreier
2011-07-06 16:53                         ` Alan Stern
2011-07-06 18:07                           ` Roland Dreier
2011-07-06 18:49                             ` Alan Stern
2011-07-07 20:45                               ` James Bottomley
2011-07-07 21:07                                 ` Alan Stern
2011-07-08 17:04                                   ` James Bottomley
2011-07-08 19:43                                     ` Alan Stern
2011-07-08 20:41                                       ` James Bottomley
2011-07-08 22:08                                         ` Alan Stern
2011-07-08 22:25                                           ` James Bottomley
2011-07-08 20:47                                     ` Roland Dreier
2011-07-08 23:04                                       ` [PATCH] block: Check that queue is alive in blk_insert_cloned_request() Roland Dreier
2011-07-09  9:05                                         ` Stefan Richter
2011-07-11 22:40                                         ` Mike Snitzer
2011-07-12  0:52                                           ` Alan Stern
2011-07-12  1:22                                             ` Mike Snitzer
2011-07-12  1:46                                               ` Vivek Goyal
2011-07-12 15:24                                                 ` Alan Stern
2011-07-12 17:10                                                   ` Vivek Goyal
2011-07-12 14:58                                           ` [PATCH] dm mpath: manage reference on request queue of underlying devices Mike Snitzer
2011-07-12 17:06                                           ` [PATCH] block: Check that queue is alive in blk_insert_cloned_request() Vivek Goyal
2011-07-12 17:41                                             ` James Bottomley
2011-07-12 18:02                                               ` Vivek Goyal
2011-07-12 18:28                                                 ` James Bottomley
2011-07-12 18:54                                                   ` Vivek Goyal
2011-07-12 21:02                                                   ` Alan Stern
2011-07-12  2:09                                         ` Vivek Goyal
2011-07-06 16:24                     ` [BUG] 2.6.39.1 crash in scsi_dispatch_cmd() Roland Dreier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox