2.6.39-rc5+ BUG at scsi_run

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* 2.6.39-rc5+ BUG at scsi_run_queue+0x24/0xe3
@ 2011-05-03 16:53 Jim Schutt
  2011-05-03 17:00 ` James Bottomley
  0 siblings, 1 reply; 7+ messages in thread
From: Jim Schutt @ 2011-05-03 16:53 UTC (permalink / raw)
  To: James.Bottomley; +Cc: linux-kernel

Hi,

I'm getting this BUG on ~20% of boots with 2.6.39-rc5+:

[   22.607020] BUG: unable to handle kernel NULL pointer dereference at           (null)
[   22.608004] IP: [<ffffffffa019b8c5>] scsi_run_queue+0x24/0xe3 [scsi_mod]
[   22.608004] PGD 22564b067 PUD 222e93067 PMD 0
[   22.608004] Oops: 0000 [#1] SMP
[   22.608004] last sysfs file: /sys/devices/pci0000:00/0000:00:1d.7/usb1/usb_device/usbdev1.1/dev
[   22.608004] CPU 0
[   22.608004] Modules linked in: megaraid_sas ide_cd_mod ib_mthca(+) cdrom ib_mad qla2xxx(+) ib_core button scsi_transport_fc scsi_tgt serio_raw ata_piix i5k_amb tpm_tis libata hwmon tpm i5000_edac floppy(+) tpm_bios scsi_mod dcdbas edac_core pcspkr uhci_hcd ehci_hcd iTCO_wdt iTCO_vendor_support rtc nfs nfs_acl auth_rpcgss fscache lockd sunrpc tg3 bnx2 e1000
[   22.608004]
[   22.608004] Pid: 1820, comm: path_id Not tainted 2.6.39-rc5-00139-g9fbc674 #23 Dell Inc. PowerEdge 1950/0DT097
[   22.608004] RIP: 0010:[<ffffffffa019b8c5>]  [<ffffffffa019b8c5>] scsi_run_queue+0x24/0xe3 [scsi_mod]
[   22.608004] RSP: 0000:ffff88022fc03d10  EFLAGS: 00010282
[   22.608004] RAX: ffff8802240ece00 RBX: ffff88022fc03d20 RCX: ffff88022f002900
[   22.608004] RDX: 0000000000000000 RSI: 0000000000000037 RDI: 0000000000000000
[   22.608004] RBP: ffff88022fc03d60 R08: 0000000000000286 R09: ffffea00077e33a0
[   22.608004] R10: ffff88022f002900 R11: ffff88022fc03cf0 R12: ffff880223977740
[   22.608004] R13: ffff8802254b2938 R14: 0000000000000000 R15: ffff880223977740
[   22.608004] FS:  00007f2e084d26e0(0000) GS:ffff88022fc00000(0000) knlGS:0000000000000000
[   22.608004] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   22.608004] CR2: 0000000000000000 CR3: 0000000222e03000 CR4: 00000000000006f0
[   22.608004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   22.608004] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   22.608004] Process path_id (pid: 1820, threadinfo ffff8802236a8000, task ffff8802234d1690)
[   22.608004] Stack:
[   22.608004]  0000000000000282 ffff8802254b2800 0000000000000000 ffff880223977740
[   22.608004]  ffff88022fc03d60 ffff8802240ece00 ffff880223977740 ffff8802254b2938
[   22.608004]  0000000000000000 ffff880223977740 ffff88022fc03d90 ffffffffa019c205
[   22.608004] Call Trace:
[   22.608004]  <IRQ>
[   22.608004]  [<ffffffffa019c205>] scsi_next_command+0x3b/0x4c [scsi_mod]
[   22.608004]  [<ffffffffa019c84a>] scsi_end_request+0x83/0x94 [scsi_mod]
[   22.608004]  [<ffffffffa019cbea>] scsi_io_completion+0x1b0/0x3fb [scsi_mod]
[   22.608004]  [<ffffffffa019b635>] ? spin_unlock_irqrestore+0xe/0x10 [scsi_mod]
[   22.608004]  [<ffffffffa0195159>] scsi_finish_command+0xeb/0xf4 [scsi_mod]
[   22.608004]  [<ffffffffa019d9df>] scsi_softirq_done+0x112/0x11b [scsi_mod]
[   22.608004]  [<ffffffff811c727e>] blk_done_softirq+0x4b/0x61
[   22.608004]  [<ffffffff8104f74c>] __do_softirq+0xbf/0x16e
[   22.608004]  [<ffffffff813b354c>] call_softirq+0x1c/0x30
[   22.608004]  [<ffffffff810041a3>] do_softirq+0x3d/0x86
[   22.608004]  [<ffffffff8104f44a>] invoke_softirq+0x17/0x20
[   22.608004]  [<ffffffff8104fa19>] irq_exit+0x57/0x98
[   22.608004]  [<ffffffff813b3c81>] do_IRQ+0x91/0xa8
[   22.608004]  [<ffffffff813abc53>] common_interrupt+0x13/0x13
[   22.608004]  <EOI>
[   22.608004]  [<ffffffff813abc9a>] ? retint_swapgs+0xe/0x13
[   22.608004] Code: ff ff 5b 41 5c c9 c3 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 83 ec 28 0f 1f 44 00 00 49 89 ff 48 8b bf 40 03 00 00 48 8d 5d c0 <4c> 8b 37 48 89 5d c0 48 89 5d c8 48 8b 87 38 01 00 00 f6 80 a4
[   22.608004] RIP  [<ffffffffa019b8c5>] scsi_run_queue+0x24/0xe3 [scsi_mod]
[   22.608004]  RSP <ffff88022fc03d10>
[   22.608004] CR2: 0000000000000000
[   22.929460] ---[ end trace f9ecaaa16661ec4a ]---
[   22.934070] Kernel panic - not syncing: Fatal exception in interrupt
[   22.940410] Pid: 1820, comm: path_id Tainted: G      D     2.6.39-rc5-00139-g9fbc674 #23
[   22.948483] Call Trace:
[   22.950923]  <IRQ>  [<ffffffff8104953e>] ? panic+0xbc/0x1c3
[   22.953064] qla2xxx 0000:0e:00.0: Allocated (64 KB) for EFT...
[   22.953217] qla2xxx 0000:0e:00.0: Allocated (1285 KB) for firmware dump...
[   22.969169]  [<ffffffff813aba57>] ? _raw_spin_unlock_irqrestore+0xe/0x10
[   22.970176] scsi0 : qla2xxx
[   22.970505] qla2xxx 0000:0e:00.0:
[   22.970506]  QLogic Fibre Channel HBA Driver: 8.03.07.00
[   22.970507]   QLogic QLE2462 - PCI-Express Dual Channel 4Gb Fibre Channel HBA
[   22.970508]   ISP2432: PCIe (2.5GT/s x4) @ 0000:0e:00.0 hdma+, host#=0, fw=5.03.13 (496)
[   22.970540] qla2xxx 0000:0e:00.1: PCI INT B -> GSI 17 (level, low) -> IRQ 17
[   23.009538]  [<ffffffff81049a32>] ? spin_unlock_irqrestore+0xe/0x10
[   23.009544] qla2xxx 0000:0e:00.1: Found an ISP2432, irq 17, iobase 0xffffc9001178c000
[   23.009787] qla2xxx 0000:0e:00.1: irq 107 for MSI/MSI-X
[   23.009856] qla2xxx 0000:0e:00.1: Configuring PCI space...
[   23.009862] qla2xxx 0000:0e:00.1: setting latency timer to 64
[   23.040006]  [<ffffffff8104ada0>] ? kmsg_dump+0x4f/0xe6
[   23.040531] qla2xxx 0000:0e:00.1: Configure NVRAM parameters...
[   23.051121]  [<ffffffff813aca70>] ? oops_end+0xaf/0xbf
[   23.056249]  [<ffffffff8102b04c>] ? no_context+0xea/0xf6
[   23.061550]  [<ffffffff8102b227>] ? __bad_area_nosemaphore+0x107/0x114
[   23.068063]  [<ffffffff8102b2be>] ? bad_area_nosemaphore+0x13/0x15
[   23.074231]  [<ffffffff813aeb28>] ? do_page_fault+0x192/0x331
[   23.076259] qla2xxx 0000:0e:00.1: Verifying loaded RISC code...
[   23.085867]  [<ffffffff8101c7eb>] ? apic_write+0x16/0x18
[   23.086060] qla2xxx 0000:0e:00.1: FW: Loading via request-firmware...
[   23.097590]  [<ffffffff8101ca74>] ? lapic_next_event+0x15/0x19
[   23.103413]  [<ffffffff81072fa9>] ? clockevents_program_event+0x78/0x81
[   23.110014]  [<ffffffff81074428>] ? tick_dev_program_event+0x2f/0x8e
[   23.116357]  [<ffffffff810adeae>] ? trace_hardirqs_off_caller+0x11/0x25
[   23.122959]  [<ffffffff8106bb66>] ? sched_clock_local+0x11/0x76
[   23.128867]  [<ffffffff811dd81a>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[   23.135381]  [<ffffffff813abe8f>] ? page_fault+0x1f/0x30
[   23.140692]  [<ffffffffa019b8c5>] ? scsi_run_queue+0x24/0xe3 [scsi_mod]
[   23.147300]  [<ffffffffa019c205>] ? scsi_next_command+0x3b/0x4c [scsi_mod]
[   23.154166]  [<ffffffffa019c84a>] ? scsi_end_request+0x83/0x94 [scsi_mod]
[   23.160946]  [<ffffffffa019cbea>] ? scsi_io_completion+0x1b0/0x3fb [scsi_mod]
[   23.168072]  [<ffffffffa019b635>] ? spin_unlock_irqrestore+0xe/0x10 [scsi_mod]
[   23.175283]  [<ffffffffa0195159>] ? scsi_finish_command+0xeb/0xf4 [scsi_mod]
[   23.182329]  [<ffffffffa019d9df>] ? scsi_softirq_done+0x112/0x11b [scsi_mod]
[   23.189363]  [<ffffffff811c727e>] ? blk_done_softirq+0x4b/0x61
[   23.195185]  [<ffffffff8104f74c>] ? __do_softirq+0xbf/0x16e
[   23.200745]  [<ffffffff813b354c>] ? call_softirq+0x1c/0x30
[   23.206219]  [<ffffffff810041a3>] ? do_softirq+0x3d/0x86
[   23.211519]  [<ffffffff8104f44a>] ? invoke_softirq+0x17/0x20
[   23.217167]  [<ffffffff8104fa19>] ? irq_exit+0x57/0x98
[   23.222293]  [<ffffffff813b3c81>] ? do_IRQ+0x91/0xa8
[   23.227247]  [<ffffffff813abc53>] ? common_interrupt+0x13/0x13
[   23.233067]  <EOI>  [<ffffffff813abc9a>] ? retint_

I get no BUGs in dozens of boots if I revert commit 86cbfb5607d:

     [SCSI] put stricter guards on queue dead checks

     SCSI uses request_queue->queuedata == NULL as a signal that the queue
     is dying.  We set this state in the sdev release function.  However,
     this allows a small window where we release the last reference but
     haven't quite got to this stage yet and so something will try to take
     a reference in scsi_request_fn and oops.  It's very rare, but we had a
     report here, so we're pushing this as a bug fix

     The actual fix is to set request_queue->queuedata to NULL in
     scsi_remove_device() before we drop the reference.  This causes
     correct automatic rejects from scsi_request_fn as people who hold
     additional references try to submit work and prevents anything from
     getting a new reference to the sdev that way.

     Cc: stable@kernel.org
     Signed-off-by: James Bottomley <James.Bottomley@suse.de>

Please let me know if what further information you need, or if there is
anything I can do, to help resolve this.

Thanks -- Jim


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 2.6.39-rc5+ BUG at scsi_run_queue+0x24/0xe3
  2011-05-03 16:53 2.6.39-rc5+ BUG at scsi_run_queue+0x24/0xe3 Jim Schutt
@ 2011-05-03 17:00 ` James Bottomley
  2011-05-03 17:27   ` Jim Schutt
  0 siblings, 1 reply; 7+ messages in thread
From: James Bottomley @ 2011-05-03 17:00 UTC (permalink / raw)
  To: Jim Schutt; +Cc: linux-kernel

On Tue, 2011-05-03 at 10:53 -0600, Jim Schutt wrote:
> Please let me know if what further information you need, or if there is
> anything I can do, to help resolve this.

I think this is the fix (already in rc-fixes):

James

---
>From 3e85ea868dbd60a84240be5c1eebc36841b9c568 Mon Sep 17 00:00:00 2001
From: James Bottomley <James.Bottomley@suse.de>
Date: Sun, 1 May 2011 09:42:07 -0500
Subject: [PATCH] [SCSI] fix oops in scsi_run_queue()

The recent commit closing the race window in device teardown:

commit 86cbfb5607d4b81b1a993ff689bbd2addd5d3a9b
Author: James Bottomley <James.Bottomley@suse.de>
Date:   Fri Apr 22 10:39:59 2011 -0500

    [SCSI] put stricter guards on queue dead checks

is causing a potential NULL deref in scsi_run_queue() because the
q->queuedata may already be NULL by the time this function is called.
Since we shouldn't be running a queue that is being torn down, simply
add a NULL check in scsi_run_queue() to forestall this.

Signed-off-by: James Bottomley <James.Bottomley@suse.de>

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index e9901b8..03979f4 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -404,6 +404,10 @@ static void scsi_run_queue(struct request_queue *q)
 	LIST_HEAD(starved_list);
 	unsigned long flags;
 
+	/* if the device is dead, sdev will be NULL, so no queue to run */
+	if (!sdev)
+		return;
+
 	if (scsi_target(sdev)->single_lun)
 		scsi_single_lun_run(sdev);
 



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: 2.6.39-rc5+ BUG at scsi_run_queue+0x24/0xe3
  2011-05-03 17:00 ` James Bottomley
@ 2011-05-03 17:27   ` Jim Schutt
  2011-05-03 17:37     ` James Bottomley
  0 siblings, 1 reply; 7+ messages in thread
From: Jim Schutt @ 2011-05-03 17:27 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-kernel

James Bottomley wrote:
> On Tue, 2011-05-03 at 10:53 -0600, Jim Schutt wrote:
>> Please let me know if what further information you need, or if there is
>> anything I can do, to help resolve this.
> 
> I think this is the fix (already in rc-fixes):
> 
> James
> 
> ---
> From 3e85ea868dbd60a84240be5c1eebc36841b9c568 Mon Sep 17 00:00:00 2001
> From: James Bottomley <James.Bottomley@suse.de>
> Date: Sun, 1 May 2011 09:42:07 -0500
> Subject: [PATCH] [SCSI] fix oops in scsi_run_queue()
> 
> The recent commit closing the race window in device teardown:
> 
> commit 86cbfb5607d4b81b1a993ff689bbd2addd5d3a9b
> Author: James Bottomley <James.Bottomley@suse.de>
> Date:   Fri Apr 22 10:39:59 2011 -0500
> 
>     [SCSI] put stricter guards on queue dead checks
> 
> is causing a potential NULL deref in scsi_run_queue() because the
> q->queuedata may already be NULL by the time this function is called.
> Since we shouldn't be running a queue that is being torn down, simply
> add a NULL check in scsi_run_queue() to forestall this.
> 
> Signed-off-by: James Bottomley <James.Bottomley@suse.de>
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index e9901b8..03979f4 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -404,6 +404,10 @@ static void scsi_run_queue(struct request_queue *q)
>  	LIST_HEAD(starved_list);
>  	unsigned long flags;
>  
> +	/* if the device is dead, sdev will be NULL, so no queue to run */
> +	if (!sdev)
> +		return;
> +
>  	if (scsi_target(sdev)->single_lun)
>  		scsi_single_lun_run(sdev);
>  

Hmmm, with the above added, I still get BUGs.  Here's an
example:

[   17.142931] BUG: unable to handle kernel NULL pointer dereference at           (null)
[   17.143002] IP: [<ffffffffa01cf8c5>] scsi_run_queue+0x24/0xec [scsi_mod]
[   17.143002] PGD 128257067 PUD 129da5067 PMD 0
[   17.143002] Oops: 0000 [#1] SMP
[   17.143002] last sysfs file: /sys/devices/platform/pcspkr/input/input0/event0/dev
[   17.143002] CPU 1
[   17.143002] Modules linked in: megaraid_sas ide_cd_mod cdrom button ib_mthca(+) ib_mad ib_core serio_raw floppy(+) dcdbas tpm_tis ata_piix tpm tpm_bios libata i5k_amb hwmon iTCO_wdt scsi_mod iTCO_vendor_support i5000_edac ehci_hcd pcspkr edac_core uhci_hcd rtc nfs nfs_acl auth_rpcgss fscache lockd sunrpc tg3 bnx2 e1000
[   17.143002]
[   17.143002] Pid: 1751, comm: path_id Not tainted 2.6.39-rc5-00140-g6a9a2d5 #24 Dell Inc. PowerEdge 1950/0DT097
[   17.143002] RIP: 0010:[<ffffffffa01cf8c5>]  [<ffffffffa01cf8c5>] scsi_run_queue+0x24/0xec [scsi_mod]
[   17.143002] RSP: 0000:ffff88012fc43d10  EFLAGS: 00010246
[   17.143002] RAX: ffff880127393700 RBX: ffff880127393700 RCX: ffff88012f002900
[   17.143002] RDX: 0000000000000000 RSI: 0000000000000037 RDI: 0000000000000000
[   17.143002] RBP: ffff88012fc43d60 R08: 0000000000000286 R09: ffffea00040947f0
[   17.143002] R10: ffff88012f002900 R11: ffff88012fc43cf0 R12: ffff880126cdcf80
[   17.143002] R13: ffff880126d45138 R14: 0000000000000000 R15: ffff880126cdcf80
[   17.143002] FS:  0000000000000000(0000) GS:ffff88012fc40000(0000) knlGS:0000000000000000
[   17.143002] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   17.143002] CR2: 0000000000000000 CR3: 0000000126d0f000 CR4: 00000000000006e0
[   17.143002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   17.143002] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   17.143002] Process path_id (pid: 1751, threadinfo ffff880127a10000, task ffff880129c02d20)
[   17.143002] Stack:
[   17.143002]  0000000000000282 ffff880126cdcf80 0000000000000000 ffff880126cdcf80
[   17.143002]  ffff88012fc43d60 ffff880127393700 ffff880126cdcf80 ffff880126d45138
[   17.143002]  0000000000000000 ffff880126cdcf80 ffff88012fc43d90 ffffffffa01d020e
[   17.143002] Call Trace:
[   17.143002]  <IRQ>
[   17.143002]  [<ffffffffa01d020e>] scsi_next_command+0x3b/0x4c [scsi_mod]
[   17.143002]  [<ffffffffa01d0853>] scsi_end_request+0x83/0x94 [scsi_mod]
[   17.143002]  [<ffffffffa01d0bf3>] scsi_io_completion+0x1b0/0x3fb [scsi_mod]
[   17.143002]  [<ffffffffa01cf635>] ? spin_unlock_irqrestore+0xe/0x10 [scsi_mod]
[   17.143002]  [<ffffffffa01c9159>] scsi_finish_command+0xeb/0xf4 [scsi_mod]
[   17.143002]  [<ffffffffa01d19e8>] scsi_softirq_done+0x112/0x11e [scsi_mod]
[   17.143002]  [<ffffffff811c727e>] blk_done_softirq+0x4b/0x61
[   17.143002]  [<ffffffff8104f74c>] __do_softirq+0xbf/0x16e
[   17.143002]  [<ffffffff813b354c>] call_softirq+0x1c/0x30
[   17.143002]  [<ffffffff810041a3>] do_softirq+0x3d/0x86
[   17.143002]  [<ffffffff8104f44a>] invoke_softirq+0x17/0x20
[   17.143002]  [<ffffffff8104fa19>] irq_exit+0x57/0x98
[   17.143002]  [<ffffffff813b3c81>] do_IRQ+0x91/0xa8
[   17.143002]  [<ffffffff813abc53>] common_interrupt+0x13/0x13
[   17.143002]  <EOI>
[   17.143002]  [<ffffffff811078c0>] ? kmem_cache_create+0x175/0x175
[   17.143002]  [<ffffffff810eeb06>] ? anon_vma_alloc+0x1a/0x2b
[   17.143002]  [<ffffffff810eec0a>] anon_vma_prepare+0x60/0xfe
[   17.143002]  [<ffffffff810e2f59>] __do_fault+0xc8/0x360
[   17.143002]  [<ffffffff810e3227>] do_linear_fault+0x36/0x38
[   17.143002]  [<ffffffff8102ea08>] ? pgtable_page_ctor+0x1a/0x1c
[   17.143002]  [<ffffffff810e4112>] handle_pte_fault+0x6a/0x170
[   17.143002]  [<ffffffff810e2a1e>] ? spin_lock+0xe/0x10
[   17.143002]  [<ffffffff810e56e6>] handle_mm_fault+0x15f/0x177
[   17.143002]  [<ffffffff813aebda>] do_page_fault+0x244/0x331
[   17.143002]  [<ffffffff810eb086>] ? do_mmap_pgoff+0x267/0x2cc
[   17.143002]  [<ffffffff810adeae>] ? trace_hardirqs_off_caller+0x11/0x25
[   17.143002]  [<ffffffff811dd81a>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[   17.143002]  [<ffffffff813abe8f>] page_fault+0x1f/0x30
[   17.143002] Code: ff ff 5b 41 5c c9 c3 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 83 ec 28 0f 1f 44 00 00 48 89 7d b8 48 8b bf 40 03 00 00 48 85 ff <4c> 8b 37 0f 84 b0 00 00 00 48 8d 5d c0 48 89 5d c0 48 89 5d c8
[   17.143002] RIP  [<ffffffffa01cf8c5>] scsi_run_queue+0x24/0xec [scsi_mod]
[   17.143002]  RSP <ffff88012fc43d10>
[   17.143002] CR2: 0000000000000000
[   17.535741] ---[ end trace 97dde672b920540a ]---

Please let me know what else I can do to help sort this out.

-- Jim


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 2.6.39-rc5+ BUG at scsi_run_queue+0x24/0xe3
  2011-05-03 17:27   ` Jim Schutt
@ 2011-05-03 17:37     ` James Bottomley
  2011-05-03 17:54       ` Jim Schutt
  0 siblings, 1 reply; 7+ messages in thread
From: James Bottomley @ 2011-05-03 17:37 UTC (permalink / raw)
  To: Jim Schutt; +Cc: linux-kernel, linux-scsi

On Tue, 2011-05-03 at 11:27 -0600, Jim Schutt wrote:
> James Bottomley wrote:
> > On Tue, 2011-05-03 at 10:53 -0600, Jim Schutt wrote:
> >> Please let me know if what further information you need, or if there is
> >> anything I can do, to help resolve this.
> > 
> > I think this is the fix (already in rc-fixes):
> > 
> > James
> > 
> > ---
> > From 3e85ea868dbd60a84240be5c1eebc36841b9c568 Mon Sep 17 00:00:00 2001
> > From: James Bottomley <James.Bottomley@suse.de>
> > Date: Sun, 1 May 2011 09:42:07 -0500
> > Subject: [PATCH] [SCSI] fix oops in scsi_run_queue()
> > 
> > The recent commit closing the race window in device teardown:
> > 
> > commit 86cbfb5607d4b81b1a993ff689bbd2addd5d3a9b
> > Author: James Bottomley <James.Bottomley@suse.de>
> > Date:   Fri Apr 22 10:39:59 2011 -0500
> > 
> >     [SCSI] put stricter guards on queue dead checks
> > 
> > is causing a potential NULL deref in scsi_run_queue() because the
> > q->queuedata may already be NULL by the time this function is called.
> > Since we shouldn't be running a queue that is being torn down, simply
> > add a NULL check in scsi_run_queue() to forestall this.
> > 
> > Signed-off-by: James Bottomley <James.Bottomley@suse.de>
> > 
> > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > index e9901b8..03979f4 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -404,6 +404,10 @@ static void scsi_run_queue(struct request_queue *q)
> >  	LIST_HEAD(starved_list);
> >  	unsigned long flags;
> >  
> > +	/* if the device is dead, sdev will be NULL, so no queue to run */
> > +	if (!sdev)
> > +		return;
> > +
> >  	if (scsi_target(sdev)->single_lun)
> >  		scsi_single_lun_run(sdev);
> >  
> 
> Hmmm, with the above added, I still get BUGs.  Here's an
> example:
> 
> [   17.142931] BUG: unable to handle kernel NULL pointer dereference at           (null)
> [   17.143002] IP: [<ffffffffa01cf8c5>] scsi_run_queue+0x24/0xec [scsi_mod]

Ooh, compiler optimisation, I think; try this instead

James

---

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index e9901b8..0bac91e 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -400,10 +400,15 @@ static inline int scsi_host_is_busy(struct Scsi_Host *shost)
 static void scsi_run_queue(struct request_queue *q)
 {
 	struct scsi_device *sdev = q->queuedata;
-	struct Scsi_Host *shost = sdev->host;
+	struct Scsi_Host *shost;
 	LIST_HEAD(starved_list);
 	unsigned long flags;
 
+	/* if the device is dead, sdev will be NULL, so no queue to run */
+	if (!sdev)
+		return;
+
+	shost = sdev->host;
 	if (scsi_target(sdev)->single_lun)
 		scsi_single_lun_run(sdev);
 




^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: 2.6.39-rc5+ BUG at scsi_run_queue+0x24/0xe3
  2011-05-03 17:37     ` James Bottomley
@ 2011-05-03 17:54       ` Jim Schutt
  2011-05-03 18:52         ` Jim Schutt
  0 siblings, 1 reply; 7+ messages in thread
From: Jim Schutt @ 2011-05-03 17:54 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-kernel, linux-scsi

James Bottomley wrote:
> On Tue, 2011-05-03 at 11:27 -0600, Jim Schutt wrote:
>> James Bottomley wrote:
>>> On Tue, 2011-05-03 at 10:53 -0600, Jim Schutt wrote:
>>>> Please let me know if what further information you need, or if there is
>>>> anything I can do, to help resolve this.
>>> I think this is the fix (already in rc-fixes):
>>>
>>> James
>>>
>>> ---
>>> From 3e85ea868dbd60a84240be5c1eebc36841b9c568 Mon Sep 17 00:00:00 2001
>>> From: James Bottomley <James.Bottomley@suse.de>
>>> Date: Sun, 1 May 2011 09:42:07 -0500
>>> Subject: [PATCH] [SCSI] fix oops in scsi_run_queue()
>>>
>>> The recent commit closing the race window in device teardown:
>>>
>>> commit 86cbfb5607d4b81b1a993ff689bbd2addd5d3a9b
>>> Author: James Bottomley <James.Bottomley@suse.de>
>>> Date:   Fri Apr 22 10:39:59 2011 -0500
>>>
>>>     [SCSI] put stricter guards on queue dead checks
>>>
>>> is causing a potential NULL deref in scsi_run_queue() because the
>>> q->queuedata may already be NULL by the time this function is called.
>>> Since we shouldn't be running a queue that is being torn down, simply
>>> add a NULL check in scsi_run_queue() to forestall this.
>>>
>>> Signed-off-by: James Bottomley <James.Bottomley@suse.de>
>>>
>>> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
>>> index e9901b8..03979f4 100644
>>> --- a/drivers/scsi/scsi_lib.c
>>> +++ b/drivers/scsi/scsi_lib.c
>>> @@ -404,6 +404,10 @@ static void scsi_run_queue(struct request_queue *q)
>>>  	LIST_HEAD(starved_list);
>>>  	unsigned long flags;
>>>  
>>> +	/* if the device is dead, sdev will be NULL, so no queue to run */
>>> +	if (!sdev)
>>> +		return;
>>> +
>>>  	if (scsi_target(sdev)->single_lun)
>>>  		scsi_single_lun_run(sdev);
>>>  
>> Hmmm, with the above added, I still get BUGs.  Here's an
>> example:
>>
>> [   17.142931] BUG: unable to handle kernel NULL pointer dereference at           (null)
>> [   17.143002] IP: [<ffffffffa01cf8c5>] scsi_run_queue+0x24/0xec [scsi_mod]
> 
> Ooh, compiler optimisation, I think; try this instead
> 
> James
> 
> ---
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index e9901b8..0bac91e 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -400,10 +400,15 @@ static inline int scsi_host_is_busy(struct Scsi_Host *shost)
>  static void scsi_run_queue(struct request_queue *q)
>  {
>  	struct scsi_device *sdev = q->queuedata;
> -	struct Scsi_Host *shost = sdev->host;
> +	struct Scsi_Host *shost;
>  	LIST_HEAD(starved_list);
>  	unsigned long flags;
>  
> +	/* if the device is dead, sdev will be NULL, so no queue to run */
> +	if (!sdev)
> +		return;
> +
> +	shost = sdev->host;
>  	if (scsi_target(sdev)->single_lun)
>  		scsi_single_lun_run(sdev);
>  

Yes, that definitely fixes things for me.

Thanks!!

-- Jim


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 2.6.39-rc5+ BUG at scsi_run_queue+0x24/0xe3
  2011-05-03 17:54       ` Jim Schutt
@ 2011-05-03 18:52         ` Jim Schutt
  2011-05-03 20:36           ` James Bottomley
  0 siblings, 1 reply; 7+ messages in thread
From: Jim Schutt @ 2011-05-03 18:52 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-kernel, linux-scsi

Hi James,

FWIW, I noticed that commit 99f3c722e23 in scsi-rc-fixes-2.6
might want a Cc: stable@kernel.org, since the commit it's
fixing, 86cbfb5607d4, had one.

I'm not sure how that all works, but I thought I'd mention it
just in case.

-- Jim

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 2.6.39-rc5+ BUG at scsi_run_queue+0x24/0xe3
  2011-05-03 18:52         ` Jim Schutt
@ 2011-05-03 20:36           ` James Bottomley
  0 siblings, 0 replies; 7+ messages in thread
From: James Bottomley @ 2011-05-03 20:36 UTC (permalink / raw)
  To: Jim Schutt; +Cc: linux-kernel, linux-scsi

On Tue, 2011-05-03 at 12:52 -0600, Jim Schutt wrote:
> FWIW, I noticed that commit 99f3c722e23 in scsi-rc-fixes-2.6
> might want a Cc: stable@kernel.org, since the commit it's
> fixing, 86cbfb5607d4, had one.
> 
> I'm not sure how that all works, but I thought I'd mention it
> just in case.

Yes, good point; I'll add it ... destabilising the stable kernels
wouldn't be a very good idea.

James



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-05-03 20:36 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-05-03 16:53 2.6.39-rc5+ BUG at scsi_run_queue+0x24/0xe3 Jim Schutt
2011-05-03 17:00 ` James Bottomley
2011-05-03 17:27   ` Jim Schutt
2011-05-03 17:37     ` James Bottomley
2011-05-03 17:54       ` Jim Schutt
2011-05-03 18:52         ` Jim Schutt
2011-05-03 20:36           ` James Bottomley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox