[Bug 12020] New: scsi_times_out NULL pointer dereference

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [Bug 12020] New: scsi_times_out NULL pointer dereference
@ 2008-11-13 18:30 bugme-daemon
  2008-11-13 18:40 ` [Bug 12020] " bugme-daemon
                   ` (12 more replies)
  0 siblings, 13 replies; 17+ messages in thread
From: bugme-daemon @ 2008-11-13 18:30 UTC (permalink / raw)
  To: linux-scsi

http://bugzilla.kernel.org/show_bug.cgi?id=12020

           Summary: scsi_times_out NULL pointer dereference
           Product: SCSI Drivers
           Version: 2.5
     KernelVersion: 2.6.28-git20081113
          Platform: All
        OS/Version: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: Other
        AssignedTo: scsi_drivers-other@kernel-bugs.osdl.org
        ReportedBy: bs@q-leap.de


Latest working kernel version: 2.6.27
Earliest failing kernel version: 2.6.28-rc4
Hardware Environment: Infortrend G2430 connected to LSI22320R
Problem Description:

Hello,

first in 2.6.28-rc{1,2,3} the error handler was entirely broken - it
deadlocked. In rc4 this is fixed, but now I already two times got a Null
pointer dereference while doing some error handler tests. All of that looks
like due to the scsi timeout commits.

Steps to reproduce: E.g. reset devices connected to LSI 53C1030 devices using
lsiutil. Can be reproduced on about 20% eh activations.

(gdb) l *(scsi_times_out+0x15)
0xffffffff80460f1e is in scsi_times_out (drivers/scsi/scsi_error.c:176).
171             enum blk_eh_timer_return (*eh_timed_out)(struct scsi_cmnd *);
172             enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
173
174             scsi_log_completion(scmd, TIMEOUT_ERROR);
175
176             if (scmd->device->host->transportt->eh_timed_out)
177                     eh_timed_out =
scmd->device->host->transportt->eh_timed_out;
178             else if (scmd->device->host->hostt->eh_timed_out)
179                     eh_timed_out = scmd->device->host->hostt->eh_timed_out;
180             else

[  143.804672] BUG: unable to handle kernel NULL pointer dereference at
0000000000000000
[  143.808507] IP: [<ffffffff80460f1e>] scsi_times_out+0x15/0x71
[  143.816020] PGD f9381067 PUD f9360067 PMD 0
[  143.824018] Oops: 0000 [#1] SMP
[  143.824018] last sysfs file:
/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map
[  143.832016] Dumping ftrace buffer:
[  143.832016]    (ftrace buffer empty)
[  143.832016] CPU 1
[  143.832016] Modules linked in: mptctl ib_ipoib inet_lro ib_umad rdma_ucm
rdma_cm ib_cm iw_cm ib_sa ib_addr ib_uvee
[  143.832016] Pid: 246, comm: pdflush Not tainted 2.6.28-rc4-bs1 #10
[  143.832016] RIP: 0010:[<ffffffff80460f1e>]  [<ffffffff80460f1e>]
scsi_times_out+0x15/0x71
[  143.832016] RSP: 0018:ffff88007f6a3df0  EFLAGS: 00010086
[  143.832016] RAX: ffff88007ebf5330 RBX: 0000000000000000 RCX:
ffff8800f93804b8
[  143.832016] RDX: ffff88007ebf5948 RSI: 0000000000000246 RDI:
ffff8800f9380378
[  143.832016] RBP: ffff88007f6a3e00 R08: 0000000000000000 R09:
0000000000000000
[  143.832016] R10: ffff8800f9144680 R11: ffff88007eeac240 R12:
ffff88007ebf5330
[  143.832016] R13: ffff88007ebf5808 R14: ffffffff80380461 R15:
0000000000000000
[  143.832016] FS:  0000000000733860(0000) GS:ffff8800fb29ab40(0000)
knlGS:0000000000000000
[  143.832016] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[  143.832016] CR2: 0000000000000000 CR3: 00000000e80ec000 CR4:
00000000000006e0
[  143.832016] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[  143.832016] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[  143.832016] Process pdflush (pid: 246, threadinfo ffff88007ed12000, task
ffff88007ed11890)
[  143.832016] Stack:
[  143.832016]  ffff88007f6a3e00 ffff8800f9380378 ffff88007f6a3e20
ffffffff80380426
[  143.832016]  ffff88007ebf5330 ffff8800f9380378 ffff88007f6a3e70
ffffffff803804f9
[  143.832016]  ffff88007eea0000 ffff88007ebf5668 0000000000000246
ffff88007ebf5330
[  143.832016] Call Trace:
[  143.832016]  <IRQ> <0> [<ffffffff80380426>] blk_rq_timed_out+0x1b/0x56
[  143.832016]  [<ffffffff803804f9>] blk_rq_timed_out_timer+0x98/0x118
[  143.832016]  [<ffffffff80380461>] ? blk_rq_timed_out_timer+0x0/0x118
[  143.832016]  [<ffffffff802464e2>] run_timer_softirq+0x14c/0x1cc
[  143.832016]  [<ffffffff80242392>] __do_softirq+0x83/0x128
[  143.832016]  [<ffffffff8020d03c>] call_softirq+0x1c/0x28
[  143.832016]  [<ffffffff8020ea39>] do_softirq+0x49/0x90
[  143.832016]  [<ffffffff802422aa>] irq_exit+0x44/0x46
[  143.832016]  [<ffffffff8020e88b>] do_IRQ+0xba/0xcf


-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug 12020] scsi_times_out NULL pointer dereference
  2008-11-13 18:30 [Bug 12020] New: scsi_times_out NULL pointer dereference bugme-daemon
@ 2008-11-13 18:40 ` bugme-daemon
  2008-11-13 19:03 ` [Bug 12020] New: " James Bottomley
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: bugme-daemon @ 2008-11-13 18:40 UTC (permalink / raw)
  To: linux-scsi

http://bugzilla.kernel.org/show_bug.cgi?id=12020


akpm@osdl.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Regression|0                           |1




-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Bug 12020] New: scsi_times_out NULL pointer dereference
  2008-11-13 18:30 [Bug 12020] New: scsi_times_out NULL pointer dereference bugme-daemon
  2008-11-13 18:40 ` [Bug 12020] " bugme-daemon
@ 2008-11-13 19:03 ` James Bottomley
  2008-11-13 22:46   ` James Bottomley
  2008-11-13 19:03 ` [Bug 12020] " bugme-daemon
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 17+ messages in thread
From: James Bottomley @ 2008-11-13 19:03 UTC (permalink / raw)
  To: bugme-daemon; +Cc: linux-scsi, Jens Axboe

On Thu, 2008-11-13 at 10:30 -0800, bugme-daemon@bugzilla.kernel.org
wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=12020
> 
>            Summary: scsi_times_out NULL pointer dereference
>            Product: SCSI Drivers
>            Version: 2.5
>      KernelVersion: 2.6.28-git20081113
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Other
>         AssignedTo: scsi_drivers-other@kernel-bugs.osdl.org
>         ReportedBy: bs@q-leap.de
> 
> 
> Latest working kernel version: 2.6.27
> Earliest failing kernel version: 2.6.28-rc4
> Hardware Environment: Infortrend G2430 connected to LSI22320R
> Problem Description:
> 
> Hello,
> 
> first in 2.6.28-rc{1,2,3} the error handler was entirely broken - it
> deadlocked. In rc4 this is fixed, but now I already two times got a Null
> pointer dereference while doing some error handler tests. All of that looks
> like due to the scsi timeout commits.
> 
> Steps to reproduce: E.g. reset devices connected to LSI 53C1030 devices using
> lsiutil. Can be reproduced on about 20% eh activations.
> 
> (gdb) l *(scsi_times_out+0x15)
> 0xffffffff80460f1e is in scsi_times_out (drivers/scsi/scsi_error.c:176).
> 171             enum blk_eh_timer_return (*eh_timed_out)(struct scsi_cmnd *);
> 172             enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
> 173
> 174             scsi_log_completion(scmd, TIMEOUT_ERROR);
> 175
> 176             if (scmd->device->host->transportt->eh_timed_out)
> 177                     eh_timed_out =
> scmd->device->host->transportt->eh_timed_out;
> 178             else if (scmd->device->host->hostt->eh_timed_out)
> 179                     eh_timed_out = scmd->device->host->hostt->eh_timed_out;
> 180             else

Actually, I think the trace is slightly off.  I suspect this is the
problem:

	struct scsi_cmnd *scmd = req->special;

I bet req->special is NULL because the command timed out even before it
was prepared by the subsystem.

Does this fix it?

The fix is more of a bandaid than anything ... we can't really have
commands timing out in the mid-layer because we expect we have full
control of them.  With this patch, if we run out of resets, block will
complete a command we're still processing.

James

---

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 94ed262..5612c42 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -127,6 +127,13 @@ enum blk_eh_timer_return scsi_times_out(struct request *req)
 	enum blk_eh_timer_return (*eh_timed_out)(struct scsi_cmnd *);
 	enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
 
+	if (!scmd)
+		/*
+		 * nasty: command timed out before the mid layer
+		 * even prepared it
+		 */
+		return BLK_EH_RESET_TIMER;
+
 	scsi_log_completion(scmd, TIMEOUT_ERROR);
 
 	if (scmd->device->host->transportt->eh_timed_out)



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Bug 12020] scsi_times_out NULL pointer dereference
  2008-11-13 18:30 [Bug 12020] New: scsi_times_out NULL pointer dereference bugme-daemon
  2008-11-13 18:40 ` [Bug 12020] " bugme-daemon
  2008-11-13 19:03 ` [Bug 12020] New: " James Bottomley
@ 2008-11-13 19:03 ` bugme-daemon
  2008-11-13 20:12 ` bugme-daemon
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: bugme-daemon @ 2008-11-13 19:03 UTC (permalink / raw)
  To: linux-scsi

http://bugzilla.kernel.org/show_bug.cgi?id=12020





------- Comment #1 from anonymous@kernel-bugs.osdl.org  2008-11-13 11:03 -------
Reply-To: James.Bottomley@HansenPartnership.com

On Thu, 2008-11-13 at 10:30 -0800, bugme-daemon@bugzilla.kernel.org
wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=12020
> 
>            Summary: scsi_times_out NULL pointer dereference
>            Product: SCSI Drivers
>            Version: 2.5
>      KernelVersion: 2.6.28-git20081113
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Other
>         AssignedTo: scsi_drivers-other@kernel-bugs.osdl.org
>         ReportedBy: bs@q-leap.de
> 
> 
> Latest working kernel version: 2.6.27
> Earliest failing kernel version: 2.6.28-rc4
> Hardware Environment: Infortrend G2430 connected to LSI22320R
> Problem Description:
> 
> Hello,
> 
> first in 2.6.28-rc{1,2,3} the error handler was entirely broken - it
> deadlocked. In rc4 this is fixed, but now I already two times got a Null
> pointer dereference while doing some error handler tests. All of that looks
> like due to the scsi timeout commits.
> 
> Steps to reproduce: E.g. reset devices connected to LSI 53C1030 devices using
> lsiutil. Can be reproduced on about 20% eh activations.
> 
> (gdb) l *(scsi_times_out+0x15)
> 0xffffffff80460f1e is in scsi_times_out (drivers/scsi/scsi_error.c:176).
> 171             enum blk_eh_timer_return (*eh_timed_out)(struct scsi_cmnd *);
> 172             enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
> 173
> 174             scsi_log_completion(scmd, TIMEOUT_ERROR);
> 175
> 176             if (scmd->device->host->transportt->eh_timed_out)
> 177                     eh_timed_out =
> scmd->device->host->transportt->eh_timed_out;
> 178             else if (scmd->device->host->hostt->eh_timed_out)
> 179                     eh_timed_out = scmd->device->host->hostt->eh_timed_out;
> 180             else

Actually, I think the trace is slightly off.  I suspect this is the
problem:

        struct scsi_cmnd *scmd = req->special;

I bet req->special is NULL because the command timed out even before it
was prepared by the subsystem.

Does this fix it?

The fix is more of a bandaid than anything ... we can't really have
commands timing out in the mid-layer because we expect we have full
control of them.  With this patch, if we run out of resets, block will
complete a command we're still processing.

James

---

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 94ed262..5612c42 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -127,6 +127,13 @@ enum blk_eh_timer_return scsi_times_out(struct request
*req)
        enum blk_eh_timer_return (*eh_timed_out)(struct scsi_cmnd *);
        enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;

+       if (!scmd)
+               /*
+                * nasty: command timed out before the mid layer
+                * even prepared it
+                */
+               return BLK_EH_RESET_TIMER;
+
        scsi_log_completion(scmd, TIMEOUT_ERROR);

        if (scmd->device->host->transportt->eh_timed_out)


-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Bug 12020] scsi_times_out NULL pointer dereference
  2008-11-13 18:30 [Bug 12020] New: scsi_times_out NULL pointer dereference bugme-daemon
                   ` (2 preceding siblings ...)
  2008-11-13 19:03 ` [Bug 12020] " bugme-daemon
@ 2008-11-13 20:12 ` bugme-daemon
  2008-11-13 20:22   ` James Bottomley
  2008-11-13 20:23 ` bugme-daemon
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 17+ messages in thread
From: bugme-daemon @ 2008-11-13 20:12 UTC (permalink / raw)
  To: linux-scsi

http://bugzilla.kernel.org/show_bug.cgi?id=12020





------- Comment #2 from bs@q-leap.de  2008-11-13 12:12 -------
Thanks going to test it now. 

While we are at this function, could you please check 

        if (eh_timed_out)
                rtn = eh_timed_out(scmd);
                switch (rtn) {
                case BLK_EH_NOT_HANDLED:
                        break;
                default:
                        return rtn;
                }


Is the indentation wrong or are there missing if-braces?


Thanks,
Bernd


-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Bug 12020] scsi_times_out NULL pointer dereference
  2008-11-13 20:12 ` bugme-daemon
@ 2008-11-13 20:22   ` James Bottomley
  0 siblings, 0 replies; 17+ messages in thread
From: James Bottomley @ 2008-11-13 20:22 UTC (permalink / raw)
  To: bugme-daemon; +Cc: linux-scsi

On Thu, 2008-11-13 at 12:12 -0800, bugme-daemon@bugzilla.kernel.org
wrote:
> While we are at this function, could you please check 
> 
>         if (eh_timed_out)
>                 rtn = eh_timed_out(scmd);
>                 switch (rtn) {
>                 case BLK_EH_NOT_HANDLED:
>                         break;
>                 default:
>                         return rtn;
>                 }
> 
> 
> Is the indentation wrong or are there missing if-braces?

It's not as intended, but harmless:

http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=6ec39f02cf48df89c3cbab4aeef521569fec00e4

James



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug 12020] scsi_times_out NULL pointer dereference
  2008-11-13 18:30 [Bug 12020] New: scsi_times_out NULL pointer dereference bugme-daemon
                   ` (3 preceding siblings ...)
  2008-11-13 20:12 ` bugme-daemon
@ 2008-11-13 20:23 ` bugme-daemon
  2008-11-13 21:36 ` bugme-daemon
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: bugme-daemon @ 2008-11-13 20:23 UTC (permalink / raw)
  To: linux-scsi

http://bugzilla.kernel.org/show_bug.cgi?id=12020





------- Comment #3 from anonymous@kernel-bugs.osdl.org  2008-11-13 12:23 -------
Reply-To: James.Bottomley@HansenPartnership.com

On Thu, 2008-11-13 at 12:12 -0800, bugme-daemon@bugzilla.kernel.org
wrote:
> While we are at this function, could you please check 
> 
>         if (eh_timed_out)
>                 rtn = eh_timed_out(scmd);
>                 switch (rtn) {
>                 case BLK_EH_NOT_HANDLED:
>                         break;
>                 default:
>                         return rtn;
>                 }
> 
> 
> Is the indentation wrong or are there missing if-braces?

It's not as intended, but harmless:

http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=6ec39f02cf48df89c3cbab4aeef521569fec00e4

James


-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug 12020] scsi_times_out NULL pointer dereference
  2008-11-13 18:30 [Bug 12020] New: scsi_times_out NULL pointer dereference bugme-daemon
                   ` (4 preceding siblings ...)
  2008-11-13 20:23 ` bugme-daemon
@ 2008-11-13 21:36 ` bugme-daemon
  2008-11-13 22:47 ` bugme-daemon
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: bugme-daemon @ 2008-11-13 21:36 UTC (permalink / raw)
  To: linux-scsi

http://bugzilla.kernel.org/show_bug.cgi?id=12020


rjw@sisk.pl changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rjw@sisk.pl
OtherBugsDependingO|                            |11808
              nThis|                            |




-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Bug 12020] New: scsi_times_out NULL pointer dereference
  2008-11-13 19:03 ` [Bug 12020] New: " James Bottomley
@ 2008-11-13 22:46   ` James Bottomley
  0 siblings, 0 replies; 17+ messages in thread
From: James Bottomley @ 2008-11-13 22:46 UTC (permalink / raw)
  To: bugme-daemon; +Cc: linux-scsi, Jens Axboe

On Thu, 2008-11-13 at 13:03 -0600, James Bottomley wrote:
> Actually, I think the trace is slightly off.  I suspect this is the
> problem:
> 
> 	struct scsi_cmnd *scmd = req->special;
> 
> I bet req->special is NULL because the command timed out even before it
> was prepared by the subsystem.
> 
> Does this fix it?
> 
> The fix is more of a bandaid than anything ... we can't really have
> commands timing out in the mid-layer because we expect we have full
> control of them.  With this patch, if we run out of resets, block will
> complete a command we're still processing.
> 
> James
> 
> ---
> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index 94ed262..5612c42 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -127,6 +127,13 @@ enum blk_eh_timer_return scsi_times_out(struct request *req)
>  	enum blk_eh_timer_return (*eh_timed_out)(struct scsi_cmnd *);
>  	enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
>  
> +	if (!scmd)
> +		/*
> +		 * nasty: command timed out before the mid layer
> +		 * even prepared it
> +		 */
> +		return BLK_EH_RESET_TIMER;
> +
>  	scsi_log_completion(scmd, TIMEOUT_ERROR);
>  
>  	if (scmd->device->host->transportt->eh_timed_out)

Mike Anderson pointed out that we have a potential window where the
timer can fire after we've unprepped the request in SCSI (so making
req->special NULL) but before we call blk_requeue_request() which stops
the timer.  We can rejig the locking to prevent this from happening, so
could you (separately) try this patch?

James

---

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index f5d3b96..3475b74 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -649,8 +643,8 @@ static void scsi_requeue_command(struct request_queue *q, struct scsi_cmnd *cmd)
 	struct request *req = cmd->request;
 	unsigned long flags;
 
-	scsi_unprep_request(req);
 	spin_lock_irqsave(q->queue_lock, flags);
+	scsi_unprep_request(req);
 	blk_requeue_request(q, req);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Bug 12020] scsi_times_out NULL pointer dereference
  2008-11-13 18:30 [Bug 12020] New: scsi_times_out NULL pointer dereference bugme-daemon
                   ` (5 preceding siblings ...)
  2008-11-13 21:36 ` bugme-daemon
@ 2008-11-13 22:47 ` bugme-daemon
  2008-11-16 17:50 ` bugme-daemon
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: bugme-daemon @ 2008-11-13 22:47 UTC (permalink / raw)
  To: linux-scsi

http://bugzilla.kernel.org/show_bug.cgi?id=12020





------- Comment #4 from anonymous@kernel-bugs.osdl.org  2008-11-13 14:47 -------
Reply-To: James.Bottomley@HansenPartnership.com

On Thu, 2008-11-13 at 13:03 -0600, James Bottomley wrote:
> Actually, I think the trace is slightly off.  I suspect this is the
> problem:
> 
> 	struct scsi_cmnd *scmd = req->special;
> 
> I bet req->special is NULL because the command timed out even before it
> was prepared by the subsystem.
> 
> Does this fix it?
> 
> The fix is more of a bandaid than anything ... we can't really have
> commands timing out in the mid-layer because we expect we have full
> control of them.  With this patch, if we run out of resets, block will
> complete a command we're still processing.
> 
> James
> 
> ---
> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index 94ed262..5612c42 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -127,6 +127,13 @@ enum blk_eh_timer_return scsi_times_out(struct request *req)
>  	enum blk_eh_timer_return (*eh_timed_out)(struct scsi_cmnd *);
>  	enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
>  
> +	if (!scmd)
> +		/*
> +		 * nasty: command timed out before the mid layer
> +		 * even prepared it
> +		 */
> +		return BLK_EH_RESET_TIMER;
> +
>  	scsi_log_completion(scmd, TIMEOUT_ERROR);
>  
>  	if (scmd->device->host->transportt->eh_timed_out)

Mike Anderson pointed out that we have a potential window where the
timer can fire after we've unprepped the request in SCSI (so making
req->special NULL) but before we call blk_requeue_request() which stops
the timer.  We can rejig the locking to prevent this from happening, so
could you (separately) try this patch?

James

---

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index f5d3b96..3475b74 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -649,8 +643,8 @@ static void scsi_requeue_command(struct request_queue *q,
struct scsi_cmnd *cmd)
        struct request *req = cmd->request;
        unsigned long flags;

-       scsi_unprep_request(req);
        spin_lock_irqsave(q->queue_lock, flags);
+       scsi_unprep_request(req);
        blk_requeue_request(q, req);
        spin_unlock_irqrestore(q->queue_lock, flags);



-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Bug 12020] scsi_times_out NULL pointer dereference
  2008-11-13 18:30 [Bug 12020] New: scsi_times_out NULL pointer dereference bugme-daemon
                   ` (6 preceding siblings ...)
  2008-11-13 22:47 ` bugme-daemon
@ 2008-11-16 17:50 ` bugme-daemon
  2008-11-20 15:12 ` bugme-daemon
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: bugme-daemon @ 2008-11-16 17:50 UTC (permalink / raw)
  To: linux-scsi

http://bugzilla.kernel.org/show_bug.cgi?id=12020





------- Comment #5 from rjw@sisk.pl  2008-11-16 09:50 -------
Patch : http://bugzilla.kernel.org/show_bug.cgi?id=12020#c4
Handled-By : James Bottomley <James.Bottomley@HansenPartnership.com>


-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug 12020] scsi_times_out NULL pointer dereference
  2008-11-13 18:30 [Bug 12020] New: scsi_times_out NULL pointer dereference bugme-daemon
                   ` (7 preceding siblings ...)
  2008-11-16 17:50 ` bugme-daemon
@ 2008-11-20 15:12 ` bugme-daemon
  2008-11-20 19:36   ` Mike Anderson
  2008-11-20 19:36 ` bugme-daemon
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 17+ messages in thread
From: bugme-daemon @ 2008-11-20 15:12 UTC (permalink / raw)
  To: linux-scsi

http://bugzilla.kernel.org/show_bug.cgi?id=12020





------- Comment #6 from git.user@gmail.com  2008-11-20 07:12 -------
looks very similar

[  316.336654] BUG: unable to handle kernel NULL pointer dereference at
0000000000000000
[  316.339972] IP: [<ffffffff803f84d3>] scsi_times_out+0x10/0x72
[  316.339972] PGD 3e627067 PUD 3de0b067 PMD 0
[  316.339972] Oops: 0000 [#1] PREEMPT SMP
[  316.339972] last sysfs file:
/sys/devices/virtual/block/md0/md/metadata_version
[  316.339972] Dumping ftrace buffer:
[  316.339972]    (ftrace buffer empty)
[  316.339972] CPU 1
[  316.339972] Modules linked in: floppy sg
[  316.339972] Pid: 0, comm: swapper Not tainted 2.6.28-rc5 #1
[  316.339972] RIP: 0010:[<ffffffff803f84d3>]  [<ffffffff803f84d3>]
scsi_times_out+0x10/0x72
[  316.339972] RSP: 0018:ffff88003fb53e20  EFLAGS: 00010082
[  316.339972] RAX: ffff88003ef60000 RBX: 0000000000000000 RCX:
ffff88003ef60308
[  316.339972] RDX: ffff88003ef60308 RSI: 0000000000006cb2 RDI:
ffff880033dae5c0
[  316.339972] RBP: ffff88003fb53e30 R08: ffff880001019180 R09:
0000000000000010
[  316.339972] R10: 0000000000000000 R11: 0000000000000001 R12:
ffff88003ef601c8
[  316.339972] R13: ffff88003ef60308 R14: 0000000000000102 R15:
0000000000000000
[  316.339972] FS:  0000000000000000(0000) GS:ffff88003fb23b00(0000)
knlGS:0000000000000000
[  316.339972] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[  316.339972] CR2: 0000000000000000 CR3: 000000003e7e8000 CR4:
00000000000006e0
[  316.339972] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[  316.339972] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[  316.339972] Process swapper (pid: 0, threadinfo ffff88003fb4e000, task
ffff88003f863500)
[  316.339972] Stack:
[  316.339972]  ffff88000101f900 ffff880033dae5c0 ffff88003fb53e50
ffffffff803544ca
[  316.339972]  ffff88003ef60000 ffff880033dae5c0 ffff88003fb53ea0
ffffffff803545d8
[  316.339972]  ffff88003fb53ea0 ffff88003ef60000 0000000000000286
ffff88003ef60000
[  316.339972] Call Trace:
[  316.339972]  <IRQ> <0> [<ffffffff803544ca>] blk_rq_timed_out+0x16/0x5c
[  316.339972]  [<ffffffff803545d8>] blk_rq_timed_out_timer+0xc8/0x138
[  316.339972]  [<ffffffff80354510>] ? blk_rq_timed_out_timer+0x0/0x138
[  316.339972]  [<ffffffff8023f8b7>] run_timer_softirq+0x183/0x1ec
[  316.339972]  [<ffffffff80254c0c>] ? tick_dev_program_event+0x6c/0xa4
[  316.339972]  [<ffffffff8023b326>] __do_softirq+0x72/0x128
[  316.339972]  [<ffffffff8020c8cc>] call_softirq+0x1c/0x30
[  316.339972]  [<ffffffff8020df2d>] do_softirq+0x3d/0x78
[  316.339972]  [<ffffffff8023b249>] irq_exit+0x8f/0x98
[  316.339972]  [<ffffffff8021d8e4>] smp_apic_timer_interrupt+0x8a/0xd6
[  316.339972]  [<ffffffff8020c31b>] apic_timer_interrupt+0x6b/0x70
[  316.339972]  <EOI> <0> [<ffffffff80212f22>] ? mwait_idle+0x45/0x4a
[  316.339972]  [<ffffffff80209deb>] ? enter_idle+0x22/0x24
[  316.339972]  [<ffffffff8020a386>] ? cpu_idle+0x41/0x80
[  316.339972] Code: cb ff ff 85 c0 74 a0 45 31 e4 eb d2 45 31 e4 44 89 e0 5b
41 5c 41 5d 41 5e 5d c3 55 48 89 e5 53 48 83 ec 08 48 8b 9f e0 00 00 00 <48> 8b
03 48 8b 10 48 8b 82 b8 00 00 00 48 8b 80 60 01 00 00 48
[  316.339972] RIP  [<ffffffff803f84d3>] scsi_times_out+0x10/0x72
[  316.339972]  RSP <ffff88003fb53e20>
[  316.339972] CR2: 0000000000000000
[  316.339972] Kernel panic - not syncing: Fatal exception in interrupt

in my case easily be triggered by disk activity (i.g. rsync) 
on rebuilding raid

On Thu, 2008-11-13 at 13:03 -0600, James Bottomley wrote:
> Actually, I think the trace is slightly off.  I suspect this is the
> problem:
> 
>       struct scsi_cmnd *scmd = req->special;
> 
> I bet req->special is NULL because the command timed out even before it
> was prepared by the subsystem.
> 
> Does this fix it?

In my case it doesn't 'fix', but proof of concept.
With your patch [i just printk-ing the comment] 
system remains locked printk-ing from time to time: 
"nasty: command timed out before the mid layer 
even prepared it"

> The fix is more of a bandaid than anything ... we can't really have
> commands timing out in the mid-layer because we expect we have full
> control of them.  With this patch, if we run out of resets, block will
> complete a command we're still processing.

here is a dmesg:
http://sysadminday.org.ru/2.6.28-rc5-git3/scsi_times_out-NULL_pointer_dereference_dmesg


-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Bug 12020] scsi_times_out NULL pointer dereference
  2008-11-20 15:12 ` bugme-daemon
@ 2008-11-20 19:36   ` Mike Anderson
  0 siblings, 0 replies; 17+ messages in thread
From: Mike Anderson @ 2008-11-20 19:36 UTC (permalink / raw)
  To: bugme-daemon; +Cc: linux-scsi, Jens Axboe, James Bottomley, Tejun Heo

I have two systems that are hitting similar signatures in scsi_times_out.

Note: that my testing is using a distro kernel, but in this area the code
is very similar. I will work to get a reproduction on mainline.

..but..

I added some debug to scsi_times_out and noticed that the request with no
scmd set in req->special also did not have REQ_STARTED set.

I added a WARN_ON check to blk_add_timer for any requests 
that we where starting a timer for that did not have REQ_STARTED. This is
shown below. This does not look good as the elv_dequeue_request is being
called off elv_next_request for some cases.

Call Trace:
[c00000007b747580] [c00000000027808c] .blk_add_timer+0x74/0x134
(unreliable)
[c00000007b747610] [c00000000026f9b8] .elv_dequeue_request+0x78/0x8c
[c00000007b747680] [c000000000275830] .blk_do_ordered+0x8c/0x31c
[c00000007b747720] [c00000000026fc18] .elv_next_request+0x24c/0x2d4
[c00000007b7477c0] [d000000000368004] .scsi_request_fn+0xc8/0x628
[scsi_mod]
[c00000007b7478a0] [c00000000026fdf4] .elv_insert+0x154/0x38c
[c00000007b747940] [c000000000273ad0] .__make_request+0x4e4/0x568
[c00000007b7479f0] [c000000000271a68] .generic_make_request+0x3f4/0x468
[c00000007b747af0] [c000000000271bd8] .submit_bio+0xfc/0x124
[c00000007b747bb0] [c000000000160a00] .submit_bh+0x14c/0x198
[c00000007b747c40] [c0000000001630a0] .sync_dirty_buffer+0xbc/0x15c
[c00000007b747cd0] [c0000000001fcac0]
.journal_commit_transaction+0x1014/0x158c
[c00000007b747e10] [c00000000020111c] .kjournald+0x104/0x2f4
[c00000007b747f00] [c0000000000a909c] .kthread+0x78/0xc4
[c00000007b747f90] [c00000000002ae2c] .kernel_thread+0x4c/0x68

I changed the previous mentioned WARN_ON to just do a return if the request
does not have REQ_STARTED. This corrected the issue of seeing an oops in
scsi_times_out. But this is just a hack.

Hope this analysis is not flawed because of kernel deltas. It also may not
address this specific issue being seen in this bug, but does appear to
indicate a possible path to get a request on the timeout list with out a
req->special set.

I think we may need to look at some of the paths that are calling
blkdev_dequeue_request and understand how to prevent blk_add_timer from
being called if we are not really starting a SCSI cmd.

-andmike
--
Michael Anderson
andmike@linux.vnet.ibm.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug 12020] scsi_times_out NULL pointer dereference
  2008-11-13 18:30 [Bug 12020] New: scsi_times_out NULL pointer dereference bugme-daemon
                   ` (8 preceding siblings ...)
  2008-11-20 15:12 ` bugme-daemon
@ 2008-11-20 19:36 ` bugme-daemon
  2008-12-03 10:19 ` bugme-daemon
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: bugme-daemon @ 2008-11-20 19:36 UTC (permalink / raw)
  To: linux-scsi

http://bugzilla.kernel.org/show_bug.cgi?id=12020

------- Comment #7 from anonymous@kernel-bugs.osdl.org  2008-11-20 11:36 -------
Reply-To: andmike@linux.vnet.ibm.com

I have two systems that are hitting similar signatures in scsi_times_out.

Note: that my testing is using a distro kernel, but in this area the code
is very similar. I will work to get a reproduction on mainline.

..but..

I added some debug to scsi_times_out and noticed that the request with no
scmd set in req->special also did not have REQ_STARTED set.

I added a WARN_ON check to blk_add_timer for any requests 
that we where starting a timer for that did not have REQ_STARTED. This is
shown below. This does not look good as the elv_dequeue_request is being
called off elv_next_request for some cases.

Call Trace:
[c00000007b747580] [c00000000027808c] .blk_add_timer+0x74/0x134
(unreliable)
[c00000007b747610] [c00000000026f9b8] .elv_dequeue_request+0x78/0x8c
[c00000007b747680] [c000000000275830] .blk_do_ordered+0x8c/0x31c
[c00000007b747720] [c00000000026fc18] .elv_next_request+0x24c/0x2d4
[c00000007b7477c0] [d000000000368004] .scsi_request_fn+0xc8/0x628
[scsi_mod]
[c00000007b7478a0] [c00000000026fdf4] .elv_insert+0x154/0x38c
[c00000007b747940] [c000000000273ad0] .__make_request+0x4e4/0x568
[c00000007b7479f0] [c000000000271a68] .generic_make_request+0x3f4/0x468
[c00000007b747af0] [c000000000271bd8] .submit_bio+0xfc/0x124
[c00000007b747bb0] [c000000000160a00] .submit_bh+0x14c/0x198
[c00000007b747c40] [c0000000001630a0] .sync_dirty_buffer+0xbc/0x15c
[c00000007b747cd0] [c0000000001fcac0]
.journal_commit_transaction+0x1014/0x158c
[c00000007b747e10] [c00000000020111c] .kjournald+0x104/0x2f4
[c00000007b747f00] [c0000000000a909c] .kthread+0x78/0xc4
[c00000007b747f90] [c00000000002ae2c] .kernel_thread+0x4c/0x68

I changed the previous mentioned WARN_ON to just do a return if the request
does not have REQ_STARTED. This corrected the issue of seeing an oops in
scsi_times_out. But this is just a hack.

Hope this analysis is not flawed because of kernel deltas. It also may not
address this specific issue being seen in this bug, but does appear to
indicate a possible path to get a request on the timeout list with out a
req->special set.

I think we may need to look at some of the paths that are calling
blkdev_dequeue_request and understand how to prevent blk_add_timer from
being called if we are not really starting a SCSI cmd.

-andmike
--
Michael Anderson
andmike@linux.vnet.ibm.com

-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug 12020] scsi_times_out NULL pointer dereference
  2008-11-13 18:30 [Bug 12020] New: scsi_times_out NULL pointer dereference bugme-daemon
                   ` (9 preceding siblings ...)
  2008-11-20 19:36 ` bugme-daemon
@ 2008-12-03 10:19 ` bugme-daemon
  2008-12-07 20:21 ` bugme-daemon
  2008-12-07 20:21 ` bugme-daemon
  12 siblings, 0 replies; 17+ messages in thread
From: bugme-daemon @ 2008-12-03 10:19 UTC (permalink / raw)
  To: linux-scsi

http://bugzilla.kernel.org/show_bug.cgi?id=12020

------- Comment #8 from bs@q-leap.de  2008-12-03 02:19 -------
> Mike Anderson pointed out that we have a potential window where the
> timer can fire after we've unprepped the request in SCSI (so making
> req->special NULL) but before we call blk_requeue_request() which stops
> the timer.  We can rejig the locking to prevent this from happening, so
> could you (separately) try this patch?
> 
> James
> 

Hello James,

sorry for the huge delay. Unfortunately it turned out I was just 'lucky' to run
into this bug the first few times. When I later on tried to reproduce this
specific issue, I tried 20 times and couldn't, even without any patch :( So
testing especially the 2nd patch turnes out to be a bit difficult (can't be
verified by printk). 
I now also mostly only have remotely access to the test hardware and can't
reset this system remotely, so testing has become a bit difficult :(
Tomorrow will be in the lab again and try to test again.

Thanks for your help and patience,
Bernd

-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug 12020] scsi_times_out NULL pointer dereference
  2008-11-13 18:30 [Bug 12020] New: scsi_times_out NULL pointer dereference bugme-daemon
                   ` (10 preceding siblings ...)
  2008-12-03 10:19 ` bugme-daemon
@ 2008-12-07 20:21 ` bugme-daemon
  2008-12-07 20:21 ` bugme-daemon
  12 siblings, 0 replies; 17+ messages in thread
From: bugme-daemon @ 2008-12-07 20:21 UTC (permalink / raw)
  To: linux-scsi

http://bugzilla.kernel.org/show_bug.cgi?id=12020


rjw@sisk.pl changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |REJECTED
         Resolution|                            |UNREPRODUCIBLE




-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug 12020] scsi_times_out NULL pointer dereference
  2008-11-13 18:30 [Bug 12020] New: scsi_times_out NULL pointer dereference bugme-daemon
                   ` (11 preceding siblings ...)
  2008-12-07 20:21 ` bugme-daemon
@ 2008-12-07 20:21 ` bugme-daemon
  12 siblings, 0 replies; 17+ messages in thread
From: bugme-daemon @ 2008-12-07 20:21 UTC (permalink / raw)
  To: linux-scsi

http://bugzilla.kernel.org/show_bug.cgi?id=12020


rjw@sisk.pl changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|REJECTED                    |CLOSED




-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2008-12-07 20:28 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-13 18:30 [Bug 12020] New: scsi_times_out NULL pointer dereference bugme-daemon
2008-11-13 18:40 ` [Bug 12020] " bugme-daemon
2008-11-13 19:03 ` [Bug 12020] New: " James Bottomley
2008-11-13 22:46   ` James Bottomley
2008-11-13 19:03 ` [Bug 12020] " bugme-daemon
2008-11-13 20:12 ` bugme-daemon
2008-11-13 20:22   ` James Bottomley
2008-11-13 20:23 ` bugme-daemon
2008-11-13 21:36 ` bugme-daemon
2008-11-13 22:47 ` bugme-daemon
2008-11-16 17:50 ` bugme-daemon
2008-11-20 15:12 ` bugme-daemon
2008-11-20 19:36   ` Mike Anderson
2008-11-20 19:36 ` bugme-daemon
2008-12-03 10:19 ` bugme-daemon
2008-12-07 20:21 ` bugme-daemon
2008-12-07 20:21 ` bugme-daemon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).