From: Jens Axboe <jens.axboe@oracle.com>
To: "Alan D. Brunelle" <Alan.Brunelle@hp.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
LKML-scsi <linux-scsi@vger.kernel.org>
Subject: Re: kernel BUG at block/blk-timeout.c:178!
Date: Fri, 5 Dec 2008 14:35:27 +0100 [thread overview]
Message-ID: <20081205133527.GK18255@kernel.dk> (raw)
In-Reply-To: <49392D71.10505@hp.com>
On Fri, Dec 05 2008, Alan D. Brunelle wrote:
> Jens Axboe wrote:
> >>
> >> I've pushed the BUG ON check into blk_execute_rq, and it's finding it
> >> set there. Could we be getting SCSI_DH_IMM_RETRYs and that's causing the
> >> same request to be used without being re-initialized, and on error the
> >> bit is not being cleaned up properly?
> >>
> >> I'm checking that out next...
> >
> > That does indeed look problematic, we only init the timer stuff when
> > getting the request initially. So you could either make your retry loop
> > do blk_put_request() and jump to the very beginning again, or this
> > should fix the current usage.
> >
> > diff --git a/block/elevator.c b/block/elevator.c
> > index a6951f7..0a2f378 100644
> > --- a/block/elevator.c
> > +++ b/block/elevator.c
> > @@ -590,6 +590,12 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
> >
> > rq->q = q;
> >
> > + /*
> > + * This could happen on a request requeue, init the timer here as well
> > + */
> > + blk_delete_timer(rq);
> > + blk_clear_rq_complete(rq);
> > +
> > switch (where) {
> > case ELEVATOR_INSERT_FRONT:
> > rq->cmd_flags |= REQ_SOFTBARRIER;
> >
>
> Hi Jens -
>
> I was able to determine we were getting retries on TURs, and then right
> (soon?) thereafter was when we triggered the problem. I put some
> slightly different code than what you suggest above, and that triggered
> a problem elsewhere in SCSI. I backed out what I did, inserted your code
> (after updating my tree to:)
>
> commit bbeba4c35c252b2e961f09ce6ebe76b2cd5e7e3e
> Merge: 6df944c... 2cbed89...
> Author: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Thu Dec 4 21:45:44 2008 -0800
>
> Merge branch 'for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/vi
>
> * 'for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev:
> [PATCH] fix bogus argument of blkdev_put() in pktcdvd
> [PATCH 2/2] documnt FMODE_ constants
> [PATCH 1/2] kill FMODE_NDELAY_NOW
> [PATCH] clean up blkdev_get a little bit
> [PATCH] Fix block dev compat ioctl handling
> [PATCH] kill obsolete temporary comment in swsusp_close()
>
> and I get the following SCSI-related problem (the same error I got with
> code I tried last night - basically just clearing out the atomic value).
> This has now happened repeatedly 3 times. [I've cc'd linux-scsi as well,
> just to make sure to cover all the bases...]
>
> ------------[ cut here ]------------
> kernel BUG at drivers/scsi/scsi.c:347!
> invalid opcode: 0000 [#1] SMP
> last sysfs file:
> CPU 0
> Pid: 0, comm: swapper Not tainted 2.6.28-rc7 #7
> RIP: 0010:[<ffffffff80d3df3e>] [<ffffffff80d3df3e>]
> scsi_put_command+0x27/0x65
> RSP: 0018:ffffffff827b3d80 EFLAGS: 00010046
> RAX: 0000000000000282 RBX: ffff88087b68dec0 RCX: ffff88087b68dec8
> RDX: ffff88087b68dec8 RSI: 0000000000000282 RDI: ffff88087b35d834
> RBP: ffffffff827b3d90 R08: ffffffff827b3d30 R09: 00000000827b3d60
> R10: 0000000000000246 R11: ffff88087b16042c R12: ffff88087b35d800
> R13: ffff88087b35d920 R14: 0000000000000000 R15: 0000000000000001
> FS: 0000000000000000(0000) GS:ffffffff82535500(0000) knlGS:0000000000000000
> CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process swapper (pid: 0, threadinfo ffffffff82548000, task ffffffff82049380)
> Stack:
> ffff88087b68dec0 ffff88087b36b230 ffffffff827b3dc0 ffffffff80d43793
> 0000000000000000 ffff88087b68dec0 ffff88087b35b658 0000000000000000
> ffffffff827b3e00 ffffffff80d43980 ffff88087b36b230 ffff88087b35b658
> Call Trace:
> <IRQ> <0> [<ffffffff80d43793>] scsi_next_command+0x2e/0x46
> [<ffffffff80d43980>] scsi_end_request+0x92/0xa4
> [<ffffffff80d43fe3>] scsi_io_completion+0x1a7/0x3ad
> [<ffffffff80d3d926>] scsi_finish_command+0xe9/0xf2
> [<ffffffff80d444d5>] scsi_softirq_done+0x10c/0x115
> [<ffffffff8062afeb>] blk_done_softirq+0x77/0x87
> [<ffffffff802628a0>] ? ktime_get_ts+0x49/0x4e
> [<ffffffff802514b8>] __do_softirq+0x8a/0x151
> [<ffffffff8256b140>] ? early_idt_handler+0x0/0x72
> [<ffffffff8020dbec>] call_softirq+0x1c/0x28
> [<ffffffff8020f32c>] do_softirq+0x44/0x8b
> [<ffffffff802511bf>] irq_exit+0x3f/0x82
> [<ffffffff8020f5f0>] do_IRQ+0xc3/0xe3
> [<ffffffff8020c983>] ret_from_intr+0x0/0x29
> <EOI> <0> [<ffffffff80228446>] ? native_safe_halt+0x6/0x8
> [<ffffffff8021406e>] ? default_idle+0x3c/0x59
> [<ffffffff802141b0>] ? c1e_idle+0x117/0x11e
> [<ffffffff8026332d>] ? atomic_notifier_call_chain+0x13/0x15
> [<ffffffff8020b4ea>] ? cpu_idle+0x51/0x92
> [<ffffffff815be32d>] ? rest_init+0x61/0x63
> [<ffffffff8256bd6b>] ? start_kernel+0x39c/0x3a7
> [<ffffffff8256b29f>] ? x86_64_start_reservations+0xa5/0xa9
> [<ffffffff8256b3f0>] ? x86_64_start_kernel+0x12a/0x139
> Code: 41 5e c9 c3 55 48 89 e5 41 54 53 4c 8b 27 48 89 fb 49 8d 7c 24 34
> e8 69 56 98 00 48 8b 53 08 48 8d 4b 08 48 89 c6 48 39 ca 75 04 <0f> 0b
> eb fe 48 8b 43 10 48 89 42 08 48 89 10 48 8b 3b 48 89 4b
> RIP [<ffffffff80d3df3e>] scsi_put_command+0x27/0x65
> RSP <ffffffff827b3d80>
>
I haven't audited everything for that path yet, but I don't think others
do retries like you suggest here. I'd strongly encourage you to change
the retry logic to really put and get a new request, instead of relying
on all state being clean for a reissue.
--
Jens Axboe
prev parent reply other threads:[~2008-12-05 13:35 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-12-04 14:26 kernel BUG at block/blk-timeout.c:178! Alan D. Brunelle
2008-12-04 15:25 ` Alan D. Brunelle
2008-12-04 15:50 ` Jens Axboe
2008-12-04 16:13 ` Alan D. Brunelle
2008-12-04 18:31 ` Alan D. Brunelle
2008-12-04 21:06 ` Alan D. Brunelle
2008-12-05 9:40 ` Jens Axboe
2008-12-05 13:32 ` Alan D. Brunelle
2008-12-05 13:35 ` Jens Axboe [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20081205133527.GK18255@kernel.dk \
--to=jens.axboe@oracle.com \
--cc=Alan.Brunelle@hp.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.