* kernel BUG at block/blk-timeout.c:178!
@ 2008-12-04 14:26 Alan D. Brunelle
2008-12-04 15:25 ` Alan D. Brunelle
2008-12-04 15:50 ` Jens Axboe
0 siblings, 2 replies; 9+ messages in thread
From: Alan D. Brunelle @ 2008-12-04 14:26 UTC (permalink / raw)
To: linux-kernel@vger.kernel.org; +Cc: Jens Axboe
(I'll be checking into reproducibility &c., but for now...)
Using git tree as of:
commit 6a1214113090905aca6a492fc8ef10d84c608a69
Merge: c07f62e... 2642b11...
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon Dec 1 09:34:23 2008 -0800
Merge branch 'for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/ie
* 'for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linu
ieee1394: sbp2: fix race condition in state change
ieee1394: fix list corruption (reported at module removal)
firewire: fw-sbp2: another iPod mini quirk entry
ieee1394: sbp2: another iPod mini quirk entry
I just got whilst booting (16-way AMD64, 126GB RAM, 3@CCISS, 4@QLA):
------------[ cut here ]------------
kernel BUG at block/blk-timeout.c:178!
invalid opcode: 0000 [#1] SMP
last sysfs file:
CPU 0
Pid: 268, comm: scsi_wq_0 Not tainted 2.6.28-rc6 #13
RIP: 0010:[<ffffffff804e8d4d>] [<ffffffff804e8d4d>] blk_add_timer+0xdd/0xf0
RSP: 0018:ffff88087b6c3830 EFLAGS: 00010002
RAX: ffff88087cdf4408 RBX: ffff88087cded670 RCX: ffff88087cdf4408
RDX: ffff88087cded670 RSI: ffff88087cdf42c8 RDI: ffff88087cdf42c8
RBP: ffff88087b6c3840 R08: 0000000000000020 R09: ffff88087cded670
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000020
R13: ffff88087cdf42c8 R14: 0000000000000020 R15: 0000000000000008
FS: 0000000000000000(0000) GS:ffffffff80f76980(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process scsi_wq_0 (pid: 268, threadinfo ffff88087b6c2000, task
ffff88087bf706c0)
Stack:
ffff88087cded670 ffff88087b6cf460 ffff88087b6c3850 ffffffff804e2308
ffff88087b6c3890 ffffffff804e5566 ffff88087cded670 ffff88087cded670
ffff88087cdf1800 ffff88087cded670 ffff88087cded670 ffff88087cdf42c8
Call Trace:
[<ffffffff804e2308>] elv_dequeue_request+0x48/0x60
[<ffffffff804e5566>] blk_queue_start_tag+0x96/0x130
[<ffffffff8060c900>] scsi_request_fn+0x120/0x5a0
[<ffffffff804e47db>] __generic_unplug_device+0x2b/0x40
[<ffffffff804e7d56>] blk_execute_rq_nowait+0x66/0xb0
[<ffffffff804e7dfd>] blk_execute_rq+0x5d/0xb0
[<ffffffff80308c80>] ? sysfs_ilookup_test+0x0/0x20
[<ffffffff8063011e>] hp_sw_tur+0xbe/0x230
[<ffffffff8062ecc6>] ? scsi_dh_handler_lookup+0x56/0xb0
[<ffffffff806305c8>] hp_sw_bus_attach+0x58/0x160
[<ffffffff8062eba2>] scsi_dh_handler_attach+0x32/0x40
[<ffffffff8062f0da>] scsi_dh_notifier+0x6a/0xa0
[<ffffffff8025f3df>] notifier_call_chain+0x3f/0x80
[<ffffffff8025f7f8>] __blocking_notifier_call_chain+0x58/0x80
[<ffffffff8025f831>] blocking_notifier_call_chain+0x11/0x20
[<ffffffff805b4703>] device_add+0x133/0x650
[<ffffffff805ba50c>] ? attribute_container_device_trigger+0xbc/0xd0
[<ffffffff80612944>] scsi_sysfs_add_sdev+0x64/0x280
[<ffffffff80610379>] scsi_probe_and_add_lun+0xdb9/0xde0
[<ffffffff80610a07>] __scsi_scan_target+0x4d7/0x6f0
[<ffffffff806112e0>] scsi_scan_target+0xc0/0xd0
[<ffffffff8061c2f7>] fc_scsi_scan_rport+0xb7/0xc0
[<ffffffff8061c240>] ? fc_scsi_scan_rport+0x0/0xc0
[<ffffffff80256c70>] run_workqueue+0x70/0x120
[<ffffffff80256d80>] worker_thread+0x60/0xb0
[<ffffffff8025aca0>] ? autoremove_wake_function+0x0/0x40
[<ffffffff80256d20>] ? worker_thread+0x0/0xb0
[<ffffffff8025a8e9>] kthread+0x49/0x80
[<ffffffff8020d269>] child_rip+0xa/0x11
[<ffffffff8025a8a0>] ? kthread+0x0/0x80
[<ffffffff8020d25f>] ? child_rip+0x0/0x11
Code: 48 8d bb 28 03 00 00 48 89 c6 e8 af 6f d6 ff 48 83 c4 08 5b c9 c3
48 8b 15 81 cb a9 00 89 c0 48 01 d0 48 89 87 38 01 00 00 eb 90 <0f> 0b
eb fe 0f 1f 80 00 00 00 00 0f 0b eb fe 0f 1f 40 00 55 48
RIP [<ffffffff804e8d4d>] blk_add_timer+0xdd/0xf0
RSP <ffff88087b6c3830>
---[ end trace 1d1b7db6f1a8b5dd ]---
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: kernel BUG at block/blk-timeout.c:178! 2008-12-04 14:26 kernel BUG at block/blk-timeout.c:178! Alan D. Brunelle @ 2008-12-04 15:25 ` Alan D. Brunelle 2008-12-04 15:50 ` Jens Axboe 1 sibling, 0 replies; 9+ messages in thread From: Alan D. Brunelle @ 2008-12-04 15:25 UTC (permalink / raw) To: linux-kernel@vger.kernel.org; +Cc: Jens Axboe (First: It seems to be easily reproduced - 3 for 3.) Interestingly, if I add a test forward : diff --git a/block/elevator.c b/block/elevator.c index 9ac82dd..c2c8342 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -586,6 +586,7 @@ void elv_insert(struct request_queue *q, struct request *rq, unsigned ordseq; int unplug_it = 1; + BUG_ON(test_bit(REQ_ATOM_COMPLETE, &req->atomic_flags)); blk_add_trace_rq(q, rq, BLK_TA_INSERT); rq->q = q; the problem moves with it: ------------[ cut here ]------------ kernel BUG at block/elevator.c:589! invalid opcode: 0000 [#1] SMP last sysfs file: CPU 0 Pid: 268, comm: scsi_wq_0 Not tainted 2.6.28-rc6 #14 RIP: 0010:[<ffffffff804e278e>] [<ffffffff804e278e>] elv_insert+0x25e/0x270 RSP: 0018:ffff88087cd4f8c0 EFLAGS: 00010002 RAX: 0000000000000000 RBX: ffff88087b6d52c8 RCX: 00000000fffee229 RDX: 0000000000000001 RSI: ffff88087b6d52c8 RDI: ffff88087cd57670 RBP: ffff88087cd4f8e0 R08: 0000000000000000 R09: ffff88087cd57670 R10: 0000000000000000 R11: 0000000000000000 R12: ffff88087cd57670 R13: ffff88087cd57670 R14: ffff88087cdf5920 R15: 0000000000000001 FS: 0000000000000000(0000) GS:ffffffff80f76980(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process scsi_wq_0 (pid: 268, threadinfo ffff88087cd4e000, task ffff88087ba9c040) Stack: ffff88087b6d52c8 ffff88087cd57670 0000000000000001 ffff88087cdf5920 ffff88087cd4f910 ffffffff804e281b 0000000000200200 0000000000000001 ffff88087b6d52c8 ffff88087cd57670 ffff88087cd4f940 ffffffff804e7d5e Call Trace: [<ffffffff804e281b>] __elv_add_request+0x7b/0xd0 [<ffffffff804e7d5e>] blk_execute_rq_nowait+0x5e/0xb0 [<ffffffff804e7e0d>] blk_execute_rq+0x5d/0xb0 [<ffffffff80308c80>] ? sysfs_ilookup_test+0x0/0x20 [<ffffffff8063012e>] hp_sw_tur+0xbe/0x230 [<ffffffff8062ecd6>] ? scsi_dh_handler_lookup+0x56/0xb0 [<ffffffff806305d8>] hp_sw_bus_attach+0x58/0x160 [<ffffffff8062ebb2>] scsi_dh_handler_attach+0x32/0x40 [<ffffffff8062f0ea>] scsi_dh_notifier+0x6a/0xa0 [<ffffffff8025f3df>] notifier_call_chain+0x3f/0x80 [<ffffffff8025f7f8>] __blocking_notifier_call_chain+0x58/0x80 [<ffffffff8025f831>] blocking_notifier_call_chain+0x11/0x20 [<ffffffff805b4713>] device_add+0x133/0x650 [<ffffffff805ba51c>] ? attribute_container_device_trigger+0xbc/0xd0 [<ffffffff80612954>] scsi_sysfs_add_sdev+0x64/0x280 [<ffffffff80610389>] scsi_probe_and_add_lun+0xdb9/0xde0 [<ffffffff80610a17>] __scsi_scan_target+0x4d7/0x6f0 [<ffffffff806112f0>] scsi_scan_target+0xc0/0xd0 [<ffffffff8061c307>] fc_scsi_scan_rport+0xb7/0xc0 [<ffffffff8061c250>] ? fc_scsi_scan_rport+0x0/0xc0 [<ffffffff80256c70>] run_workqueue+0x70/0x120 [<ffffffff80256d80>] worker_thread+0x60/0xb0 [<ffffffff8025aca0>] ? autoremove_wake_function+0x0/0x40 [<ffffffff80256d20>] ? worker_thread+0x0/0xb0 [<ffffffff8025a8e9>] kthread+0x49/0x80 [<ffffffff8020d269>] child_rip+0xa/0x11 [<ffffffff8025a8a0>] ? kthread+0x0/0x80 [<ffffffff8020d25f>] ? child_rip+0x0/0x11 Code: 89 ef 0f 1f 00 e8 53 20 00 00 0f 1f 00 e9 29 fe ff ff 49 8b 44 24 08 4c 89 23 49 89 5c 24 08 48 89 18 48 89 43 08 e9 10 fe ff ff <0f> 0b eb fe 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 RIP [<ffffffff804e278e>] elv_insert+0x25e/0x270 RSP <ffff88087cd4f8c0> ---[ end trace 0bd30bab78e129e5 ]--- ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: kernel BUG at block/blk-timeout.c:178! 2008-12-04 14:26 kernel BUG at block/blk-timeout.c:178! Alan D. Brunelle 2008-12-04 15:25 ` Alan D. Brunelle @ 2008-12-04 15:50 ` Jens Axboe 2008-12-04 16:13 ` Alan D. Brunelle 1 sibling, 1 reply; 9+ messages in thread From: Jens Axboe @ 2008-12-04 15:50 UTC (permalink / raw) To: Alan D. Brunelle; +Cc: linux-kernel@vger.kernel.org On Thu, Dec 04 2008, Alan D. Brunelle wrote: > (I'll be checking into reproducibility &c., but for now...) > > Using git tree as of: > > commit 6a1214113090905aca6a492fc8ef10d84c608a69 > Merge: c07f62e... 2642b11... > Author: Linus Torvalds <torvalds@linux-foundation.org> > Date: Mon Dec 1 09:34:23 2008 -0800 > > Merge branch 'for-linus' of > git://git.kernel.org/pub/scm/linux/kernel/git/ie > > * 'for-linus' of > git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linu > ieee1394: sbp2: fix race condition in state change > ieee1394: fix list corruption (reported at module removal) > firewire: fw-sbp2: another iPod mini quirk entry > ieee1394: sbp2: another iPod mini quirk entry > > I just got whilst booting (16-way AMD64, 126GB RAM, 3@CCISS, 4@QLA): > > ------------[ cut here ]------------ > kernel BUG at block/blk-timeout.c:178! > invalid opcode: 0000 [#1] SMP > last sysfs file: > CPU 0 > Pid: 268, comm: scsi_wq_0 Not tainted 2.6.28-rc6 #13 > RIP: 0010:[<ffffffff804e8d4d>] [<ffffffff804e8d4d>] blk_add_timer+0xdd/0xf0 > RSP: 0018:ffff88087b6c3830 EFLAGS: 00010002 > RAX: ffff88087cdf4408 RBX: ffff88087cded670 RCX: ffff88087cdf4408 > RDX: ffff88087cded670 RSI: ffff88087cdf42c8 RDI: ffff88087cdf42c8 > RBP: ffff88087b6c3840 R08: 0000000000000020 R09: ffff88087cded670 > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000020 > R13: ffff88087cdf42c8 R14: 0000000000000020 R15: 0000000000000008 > FS: 0000000000000000(0000) GS:ffffffff80f76980(0000) knlGS:0000000000000000 > CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b > CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process scsi_wq_0 (pid: 268, threadinfo ffff88087b6c2000, task > ffff88087bf706c0) > Stack: > ffff88087cded670 ffff88087b6cf460 ffff88087b6c3850 ffffffff804e2308 > ffff88087b6c3890 ffffffff804e5566 ffff88087cded670 ffff88087cded670 > ffff88087cdf1800 ffff88087cded670 ffff88087cded670 ffff88087cdf42c8 > Call Trace: > [<ffffffff804e2308>] elv_dequeue_request+0x48/0x60 > [<ffffffff804e5566>] blk_queue_start_tag+0x96/0x130 > [<ffffffff8060c900>] scsi_request_fn+0x120/0x5a0 > [<ffffffff804e47db>] __generic_unplug_device+0x2b/0x40 > [<ffffffff804e7d56>] blk_execute_rq_nowait+0x66/0xb0 > [<ffffffff804e7dfd>] blk_execute_rq+0x5d/0xb0 > [<ffffffff80308c80>] ? sysfs_ilookup_test+0x0/0x20 > [<ffffffff8063011e>] hp_sw_tur+0xbe/0x230 > [<ffffffff8062ecc6>] ? scsi_dh_handler_lookup+0x56/0xb0 > [<ffffffff806305c8>] hp_sw_bus_attach+0x58/0x160 > [<ffffffff8062eba2>] scsi_dh_handler_attach+0x32/0x40 > [<ffffffff8062f0da>] scsi_dh_notifier+0x6a/0xa0 > [<ffffffff8025f3df>] notifier_call_chain+0x3f/0x80 > [<ffffffff8025f7f8>] __blocking_notifier_call_chain+0x58/0x80 > [<ffffffff8025f831>] blocking_notifier_call_chain+0x11/0x20 > [<ffffffff805b4703>] device_add+0x133/0x650 > [<ffffffff805ba50c>] ? attribute_container_device_trigger+0xbc/0xd0 > [<ffffffff80612944>] scsi_sysfs_add_sdev+0x64/0x280 > [<ffffffff80610379>] scsi_probe_and_add_lun+0xdb9/0xde0 > [<ffffffff80610a07>] __scsi_scan_target+0x4d7/0x6f0 > [<ffffffff806112e0>] scsi_scan_target+0xc0/0xd0 > [<ffffffff8061c2f7>] fc_scsi_scan_rport+0xb7/0xc0 > [<ffffffff8061c240>] ? fc_scsi_scan_rport+0x0/0xc0 > [<ffffffff80256c70>] run_workqueue+0x70/0x120 > [<ffffffff80256d80>] worker_thread+0x60/0xb0 > [<ffffffff8025aca0>] ? autoremove_wake_function+0x0/0x40 > [<ffffffff80256d20>] ? worker_thread+0x0/0xb0 > [<ffffffff8025a8e9>] kthread+0x49/0x80 > [<ffffffff8020d269>] child_rip+0xa/0x11 > [<ffffffff8025a8a0>] ? kthread+0x0/0x80 > [<ffffffff8020d25f>] ? child_rip+0x0/0x11 > Code: 48 8d bb 28 03 00 00 48 89 c6 e8 af 6f d6 ff 48 83 c4 08 5b c9 c3 > 48 8b 15 81 cb a9 00 89 c0 48 01 d0 48 89 87 38 01 00 00 eb 90 <0f> 0b > eb fe 0f 1f 80 00 00 00 00 0f 0b eb fe 0f 1f 40 00 55 48 > RIP [<ffffffff804e8d4d>] blk_add_timer+0xdd/0xf0 > RSP <ffff88087b6c3830> > ---[ end trace 1d1b7db6f1a8b5dd ]--- Alan, can you try latest -git? feaf3848a813a106f163013af6fcf6c4bfec92d9 or later. -- Jens Axboe ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: kernel BUG at block/blk-timeout.c:178! 2008-12-04 15:50 ` Jens Axboe @ 2008-12-04 16:13 ` Alan D. Brunelle 2008-12-04 18:31 ` Alan D. Brunelle 0 siblings, 1 reply; 9+ messages in thread From: Alan D. Brunelle @ 2008-12-04 16:13 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-kernel@vger.kernel.org Jens Axboe wrote: > > Alan, can you try latest -git? feaf3848a813a106f163013af6fcf6c4bfec92d9 > or later. > git pull()ed to: feaf3848a813a106f163013af6fcf6c4bfec92d9 and the same problem occurs. Alan ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: kernel BUG at block/blk-timeout.c:178! 2008-12-04 16:13 ` Alan D. Brunelle @ 2008-12-04 18:31 ` Alan D. Brunelle 2008-12-04 21:06 ` Alan D. Brunelle 0 siblings, 1 reply; 9+ messages in thread From: Alan D. Brunelle @ 2008-12-04 18:31 UTC (permalink / raw) To: Alan D. Brunelle; +Cc: Jens Axboe, linux-kernel@vger.kernel.org Alan D. Brunelle wrote: > Jens Axboe wrote: > >> Alan, can you try latest -git? feaf3848a813a106f163013af6fcf6c4bfec92d9 >> or later. >> > > git pull()ed to: feaf3848a813a106f163013af6fcf6c4bfec92d9 and the same > problem occurs. Maybe not - I've not been to reproduce that problem in subsequent reboots. It could be that I booted the wrong kernel first time (rc6 instead of rc7). Will keep plugging - any idea as to what might have "fixed" the problem between rc6 & rc7? Alan ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: kernel BUG at block/blk-timeout.c:178! 2008-12-04 18:31 ` Alan D. Brunelle @ 2008-12-04 21:06 ` Alan D. Brunelle 2008-12-05 9:40 ` Jens Axboe 0 siblings, 1 reply; 9+ messages in thread From: Alan D. Brunelle @ 2008-12-04 21:06 UTC (permalink / raw) To: Alan D. Brunelle; +Cc: Jens Axboe, linux-kernel@vger.kernel.org Alan D. Brunelle wrote: > Alan D. Brunelle wrote: >> Jens Axboe wrote: >> >>> Alan, can you try latest -git? feaf3848a813a106f163013af6fcf6c4bfec92d9 >>> or later. >>> >> git pull()ed to: feaf3848a813a106f163013af6fcf6c4bfec92d9 and the same >> problem occurs. > > Maybe not - I've not been to reproduce that problem in subsequent > reboots. It could be that I booted the wrong kernel first time (rc6 > instead of rc7). Will keep plugging - any idea as to what might have > "fixed" the problem between rc6 & rc7? > > Alan It's back - just not as easily reproduced as before. I'm concerned over this piece of code: /* * hp_sw_tur - Send TEST UNIT READY * @sdev: sdev command should be sent to * * Use the TEST UNIT READY command to determine * the path state. */ static int hp_sw_tur(struct scsi_device *sdev, struct hp_sw_dh_data *h) { struct request *req; int ret; req = blk_get_request(sdev->request_queue, WRITE, GFP_NOIO); if (!req) return SCSI_DH_RES_TEMP_UNAVAIL; req->cmd_type = REQ_TYPE_BLOCK_PC; req->cmd_flags |= REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER; req->cmd_len = COMMAND_SIZE(TEST_UNIT_READY); req->cmd[0] = TEST_UNIT_READY; req->timeout = HP_SW_TIMEOUT; req->sense = h->sense; memset(req->sense, 0, SCSI_SENSE_BUFFERSIZE); req->sense_len = 0; retry: ret = blk_execute_rq(req->q, NULL, req, 1); if (ret == -EIO) { if (req->sense_len > 0) { ret = tur_done(sdev, h->sense); } else { sdev_printk(KERN_WARNING, sdev, "%s: sending tur failed with %x\n", HP_SW_NAME, req->errors); ret = SCSI_DH_IO; } } else { h->path_state = HP_SW_PATH_ACTIVE; ret = SCSI_DH_OK; } if (ret == SCSI_DH_IMM_RETRY) goto retry; if (ret == SCSI_DH_DEV_OFFLINED) { h->path_state = HP_SW_PATH_PASSIVE; ret = SCSI_DH_OK; } blk_put_request(req); return ret; } I've pushed the BUG ON check into blk_execute_rq, and it's finding it set there. Could we be getting SCSI_DH_IMM_RETRYs and that's causing the same request to be used without being re-initialized, and on error the bit is not being cleaned up properly? I'm checking that out next... Alan ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: kernel BUG at block/blk-timeout.c:178! 2008-12-04 21:06 ` Alan D. Brunelle @ 2008-12-05 9:40 ` Jens Axboe 2008-12-05 13:32 ` Alan D. Brunelle 0 siblings, 1 reply; 9+ messages in thread From: Jens Axboe @ 2008-12-05 9:40 UTC (permalink / raw) To: Alan D. Brunelle; +Cc: linux-kernel@vger.kernel.org On Thu, Dec 04 2008, Alan D. Brunelle wrote: > Alan D. Brunelle wrote: > > Alan D. Brunelle wrote: > >> Jens Axboe wrote: > >> > >>> Alan, can you try latest -git? feaf3848a813a106f163013af6fcf6c4bfec92d9 > >>> or later. > >>> > >> git pull()ed to: feaf3848a813a106f163013af6fcf6c4bfec92d9 and the same > >> problem occurs. > > > > Maybe not - I've not been to reproduce that problem in subsequent > > reboots. It could be that I booted the wrong kernel first time (rc6 > > instead of rc7). Will keep plugging - any idea as to what might have > > "fixed" the problem between rc6 & rc7? > > > > Alan > > It's back - just not as easily reproduced as before. > > I'm concerned over this piece of code: > > /* > * hp_sw_tur - Send TEST UNIT READY > * @sdev: sdev command should be sent to > * > * Use the TEST UNIT READY command to determine > * the path state. > */ > static int hp_sw_tur(struct scsi_device *sdev, struct hp_sw_dh_data *h) > { > struct request *req; > int ret; > > req = blk_get_request(sdev->request_queue, WRITE, GFP_NOIO); > if (!req) > return SCSI_DH_RES_TEMP_UNAVAIL; > > req->cmd_type = REQ_TYPE_BLOCK_PC; > req->cmd_flags |= REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | > REQ_FAILFAST_DRIVER; > req->cmd_len = COMMAND_SIZE(TEST_UNIT_READY); > req->cmd[0] = TEST_UNIT_READY; > req->timeout = HP_SW_TIMEOUT; > req->sense = h->sense; > memset(req->sense, 0, SCSI_SENSE_BUFFERSIZE); > req->sense_len = 0; > > retry: > ret = blk_execute_rq(req->q, NULL, req, 1); > if (ret == -EIO) { > if (req->sense_len > 0) { > ret = tur_done(sdev, h->sense); > } else { > sdev_printk(KERN_WARNING, sdev, > "%s: sending tur failed with %x\n", > HP_SW_NAME, req->errors); > ret = SCSI_DH_IO; > } > } else { > h->path_state = HP_SW_PATH_ACTIVE; > ret = SCSI_DH_OK; > } > if (ret == SCSI_DH_IMM_RETRY) > goto retry; > if (ret == SCSI_DH_DEV_OFFLINED) { > h->path_state = HP_SW_PATH_PASSIVE; > ret = SCSI_DH_OK; > } > > blk_put_request(req); > > return ret; > } > > I've pushed the BUG ON check into blk_execute_rq, and it's finding it > set there. Could we be getting SCSI_DH_IMM_RETRYs and that's causing the > same request to be used without being re-initialized, and on error the > bit is not being cleaned up properly? > > I'm checking that out next... That does indeed look problematic, we only init the timer stuff when getting the request initially. So you could either make your retry loop do blk_put_request() and jump to the very beginning again, or this should fix the current usage. diff --git a/block/elevator.c b/block/elevator.c index a6951f7..0a2f378 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -590,6 +590,12 @@ void elv_insert(struct request_queue *q, struct request *rq, int where) rq->q = q; + /* + * This could happen on a request requeue, init the timer here as well + */ + blk_delete_timer(rq); + blk_clear_rq_complete(rq); + switch (where) { case ELEVATOR_INSERT_FRONT: rq->cmd_flags |= REQ_SOFTBARRIER; -- Jens Axboe ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: kernel BUG at block/blk-timeout.c:178! 2008-12-05 9:40 ` Jens Axboe @ 2008-12-05 13:32 ` Alan D. Brunelle 2008-12-05 13:35 ` Jens Axboe 0 siblings, 1 reply; 9+ messages in thread From: Alan D. Brunelle @ 2008-12-05 13:32 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-kernel@vger.kernel.org, LKML-scsi Jens Axboe wrote: >> >> I've pushed the BUG ON check into blk_execute_rq, and it's finding it >> set there. Could we be getting SCSI_DH_IMM_RETRYs and that's causing the >> same request to be used without being re-initialized, and on error the >> bit is not being cleaned up properly? >> >> I'm checking that out next... > > That does indeed look problematic, we only init the timer stuff when > getting the request initially. So you could either make your retry loop > do blk_put_request() and jump to the very beginning again, or this > should fix the current usage. > > diff --git a/block/elevator.c b/block/elevator.c > index a6951f7..0a2f378 100644 > --- a/block/elevator.c > +++ b/block/elevator.c > @@ -590,6 +590,12 @@ void elv_insert(struct request_queue *q, struct request *rq, int where) > > rq->q = q; > > + /* > + * This could happen on a request requeue, init the timer here as well > + */ > + blk_delete_timer(rq); > + blk_clear_rq_complete(rq); > + > switch (where) { > case ELEVATOR_INSERT_FRONT: > rq->cmd_flags |= REQ_SOFTBARRIER; > Hi Jens - I was able to determine we were getting retries on TURs, and then right (soon?) thereafter was when we triggered the problem. I put some slightly different code than what you suggest above, and that triggered a problem elsewhere in SCSI. I backed out what I did, inserted your code (after updating my tree to:) commit bbeba4c35c252b2e961f09ce6ebe76b2cd5e7e3e Merge: 6df944c... 2cbed89... Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Thu Dec 4 21:45:44 2008 -0800 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/vi * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev: [PATCH] fix bogus argument of blkdev_put() in pktcdvd [PATCH 2/2] documnt FMODE_ constants [PATCH 1/2] kill FMODE_NDELAY_NOW [PATCH] clean up blkdev_get a little bit [PATCH] Fix block dev compat ioctl handling [PATCH] kill obsolete temporary comment in swsusp_close() and I get the following SCSI-related problem (the same error I got with code I tried last night - basically just clearing out the atomic value). This has now happened repeatedly 3 times. [I've cc'd linux-scsi as well, just to make sure to cover all the bases...] ------------[ cut here ]------------ kernel BUG at drivers/scsi/scsi.c:347! invalid opcode: 0000 [#1] SMP last sysfs file: CPU 0 Pid: 0, comm: swapper Not tainted 2.6.28-rc7 #7 RIP: 0010:[<ffffffff80d3df3e>] [<ffffffff80d3df3e>] scsi_put_command+0x27/0x65 RSP: 0018:ffffffff827b3d80 EFLAGS: 00010046 RAX: 0000000000000282 RBX: ffff88087b68dec0 RCX: ffff88087b68dec8 RDX: ffff88087b68dec8 RSI: 0000000000000282 RDI: ffff88087b35d834 RBP: ffffffff827b3d90 R08: ffffffff827b3d30 R09: 00000000827b3d60 R10: 0000000000000246 R11: ffff88087b16042c R12: ffff88087b35d800 R13: ffff88087b35d920 R14: 0000000000000000 R15: 0000000000000001 FS: 0000000000000000(0000) GS:ffffffff82535500(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process swapper (pid: 0, threadinfo ffffffff82548000, task ffffffff82049380) Stack: ffff88087b68dec0 ffff88087b36b230 ffffffff827b3dc0 ffffffff80d43793 0000000000000000 ffff88087b68dec0 ffff88087b35b658 0000000000000000 ffffffff827b3e00 ffffffff80d43980 ffff88087b36b230 ffff88087b35b658 Call Trace: <IRQ> <0> [<ffffffff80d43793>] scsi_next_command+0x2e/0x46 [<ffffffff80d43980>] scsi_end_request+0x92/0xa4 [<ffffffff80d43fe3>] scsi_io_completion+0x1a7/0x3ad [<ffffffff80d3d926>] scsi_finish_command+0xe9/0xf2 [<ffffffff80d444d5>] scsi_softirq_done+0x10c/0x115 [<ffffffff8062afeb>] blk_done_softirq+0x77/0x87 [<ffffffff802628a0>] ? ktime_get_ts+0x49/0x4e [<ffffffff802514b8>] __do_softirq+0x8a/0x151 [<ffffffff8256b140>] ? early_idt_handler+0x0/0x72 [<ffffffff8020dbec>] call_softirq+0x1c/0x28 [<ffffffff8020f32c>] do_softirq+0x44/0x8b [<ffffffff802511bf>] irq_exit+0x3f/0x82 [<ffffffff8020f5f0>] do_IRQ+0xc3/0xe3 [<ffffffff8020c983>] ret_from_intr+0x0/0x29 <EOI> <0> [<ffffffff80228446>] ? native_safe_halt+0x6/0x8 [<ffffffff8021406e>] ? default_idle+0x3c/0x59 [<ffffffff802141b0>] ? c1e_idle+0x117/0x11e [<ffffffff8026332d>] ? atomic_notifier_call_chain+0x13/0x15 [<ffffffff8020b4ea>] ? cpu_idle+0x51/0x92 [<ffffffff815be32d>] ? rest_init+0x61/0x63 [<ffffffff8256bd6b>] ? start_kernel+0x39c/0x3a7 [<ffffffff8256b29f>] ? x86_64_start_reservations+0xa5/0xa9 [<ffffffff8256b3f0>] ? x86_64_start_kernel+0x12a/0x139 Code: 41 5e c9 c3 55 48 89 e5 41 54 53 4c 8b 27 48 89 fb 49 8d 7c 24 34 e8 69 56 98 00 48 8b 53 08 48 8d 4b 08 48 89 c6 48 39 ca 75 04 <0f> 0b eb fe 48 8b 43 10 48 89 42 08 48 89 10 48 8b 3b 48 89 4b RIP [<ffffffff80d3df3e>] scsi_put_command+0x27/0x65 RSP <ffffffff827b3d80> I'll post my .config & full dump a little later. Alan ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: kernel BUG at block/blk-timeout.c:178! 2008-12-05 13:32 ` Alan D. Brunelle @ 2008-12-05 13:35 ` Jens Axboe 0 siblings, 0 replies; 9+ messages in thread From: Jens Axboe @ 2008-12-05 13:35 UTC (permalink / raw) To: Alan D. Brunelle; +Cc: linux-kernel@vger.kernel.org, LKML-scsi On Fri, Dec 05 2008, Alan D. Brunelle wrote: > Jens Axboe wrote: > >> > >> I've pushed the BUG ON check into blk_execute_rq, and it's finding it > >> set there. Could we be getting SCSI_DH_IMM_RETRYs and that's causing the > >> same request to be used without being re-initialized, and on error the > >> bit is not being cleaned up properly? > >> > >> I'm checking that out next... > > > > That does indeed look problematic, we only init the timer stuff when > > getting the request initially. So you could either make your retry loop > > do blk_put_request() and jump to the very beginning again, or this > > should fix the current usage. > > > > diff --git a/block/elevator.c b/block/elevator.c > > index a6951f7..0a2f378 100644 > > --- a/block/elevator.c > > +++ b/block/elevator.c > > @@ -590,6 +590,12 @@ void elv_insert(struct request_queue *q, struct request *rq, int where) > > > > rq->q = q; > > > > + /* > > + * This could happen on a request requeue, init the timer here as well > > + */ > > + blk_delete_timer(rq); > > + blk_clear_rq_complete(rq); > > + > > switch (where) { > > case ELEVATOR_INSERT_FRONT: > > rq->cmd_flags |= REQ_SOFTBARRIER; > > > > Hi Jens - > > I was able to determine we were getting retries on TURs, and then right > (soon?) thereafter was when we triggered the problem. I put some > slightly different code than what you suggest above, and that triggered > a problem elsewhere in SCSI. I backed out what I did, inserted your code > (after updating my tree to:) > > commit bbeba4c35c252b2e961f09ce6ebe76b2cd5e7e3e > Merge: 6df944c... 2cbed89... > Author: Linus Torvalds <torvalds@linux-foundation.org> > Date: Thu Dec 4 21:45:44 2008 -0800 > > Merge branch 'for-linus' of > git://git.kernel.org/pub/scm/linux/kernel/git/vi > > * 'for-linus' of > git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev: > [PATCH] fix bogus argument of blkdev_put() in pktcdvd > [PATCH 2/2] documnt FMODE_ constants > [PATCH 1/2] kill FMODE_NDELAY_NOW > [PATCH] clean up blkdev_get a little bit > [PATCH] Fix block dev compat ioctl handling > [PATCH] kill obsolete temporary comment in swsusp_close() > > and I get the following SCSI-related problem (the same error I got with > code I tried last night - basically just clearing out the atomic value). > This has now happened repeatedly 3 times. [I've cc'd linux-scsi as well, > just to make sure to cover all the bases...] > > ------------[ cut here ]------------ > kernel BUG at drivers/scsi/scsi.c:347! > invalid opcode: 0000 [#1] SMP > last sysfs file: > CPU 0 > Pid: 0, comm: swapper Not tainted 2.6.28-rc7 #7 > RIP: 0010:[<ffffffff80d3df3e>] [<ffffffff80d3df3e>] > scsi_put_command+0x27/0x65 > RSP: 0018:ffffffff827b3d80 EFLAGS: 00010046 > RAX: 0000000000000282 RBX: ffff88087b68dec0 RCX: ffff88087b68dec8 > RDX: ffff88087b68dec8 RSI: 0000000000000282 RDI: ffff88087b35d834 > RBP: ffffffff827b3d90 R08: ffffffff827b3d30 R09: 00000000827b3d60 > R10: 0000000000000246 R11: ffff88087b16042c R12: ffff88087b35d800 > R13: ffff88087b35d920 R14: 0000000000000000 R15: 0000000000000001 > FS: 0000000000000000(0000) GS:ffffffff82535500(0000) knlGS:0000000000000000 > CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b > CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process swapper (pid: 0, threadinfo ffffffff82548000, task ffffffff82049380) > Stack: > ffff88087b68dec0 ffff88087b36b230 ffffffff827b3dc0 ffffffff80d43793 > 0000000000000000 ffff88087b68dec0 ffff88087b35b658 0000000000000000 > ffffffff827b3e00 ffffffff80d43980 ffff88087b36b230 ffff88087b35b658 > Call Trace: > <IRQ> <0> [<ffffffff80d43793>] scsi_next_command+0x2e/0x46 > [<ffffffff80d43980>] scsi_end_request+0x92/0xa4 > [<ffffffff80d43fe3>] scsi_io_completion+0x1a7/0x3ad > [<ffffffff80d3d926>] scsi_finish_command+0xe9/0xf2 > [<ffffffff80d444d5>] scsi_softirq_done+0x10c/0x115 > [<ffffffff8062afeb>] blk_done_softirq+0x77/0x87 > [<ffffffff802628a0>] ? ktime_get_ts+0x49/0x4e > [<ffffffff802514b8>] __do_softirq+0x8a/0x151 > [<ffffffff8256b140>] ? early_idt_handler+0x0/0x72 > [<ffffffff8020dbec>] call_softirq+0x1c/0x28 > [<ffffffff8020f32c>] do_softirq+0x44/0x8b > [<ffffffff802511bf>] irq_exit+0x3f/0x82 > [<ffffffff8020f5f0>] do_IRQ+0xc3/0xe3 > [<ffffffff8020c983>] ret_from_intr+0x0/0x29 > <EOI> <0> [<ffffffff80228446>] ? native_safe_halt+0x6/0x8 > [<ffffffff8021406e>] ? default_idle+0x3c/0x59 > [<ffffffff802141b0>] ? c1e_idle+0x117/0x11e > [<ffffffff8026332d>] ? atomic_notifier_call_chain+0x13/0x15 > [<ffffffff8020b4ea>] ? cpu_idle+0x51/0x92 > [<ffffffff815be32d>] ? rest_init+0x61/0x63 > [<ffffffff8256bd6b>] ? start_kernel+0x39c/0x3a7 > [<ffffffff8256b29f>] ? x86_64_start_reservations+0xa5/0xa9 > [<ffffffff8256b3f0>] ? x86_64_start_kernel+0x12a/0x139 > Code: 41 5e c9 c3 55 48 89 e5 41 54 53 4c 8b 27 48 89 fb 49 8d 7c 24 34 > e8 69 56 98 00 48 8b 53 08 48 8d 4b 08 48 89 c6 48 39 ca 75 04 <0f> 0b > eb fe 48 8b 43 10 48 89 42 08 48 89 10 48 8b 3b 48 89 4b > RIP [<ffffffff80d3df3e>] scsi_put_command+0x27/0x65 > RSP <ffffffff827b3d80> > I haven't audited everything for that path yet, but I don't think others do retries like you suggest here. I'd strongly encourage you to change the retry logic to really put and get a new request, instead of relying on all state being clean for a reissue. -- Jens Axboe ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2008-12-05 13:35 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-12-04 14:26 kernel BUG at block/blk-timeout.c:178! Alan D. Brunelle 2008-12-04 15:25 ` Alan D. Brunelle 2008-12-04 15:50 ` Jens Axboe 2008-12-04 16:13 ` Alan D. Brunelle 2008-12-04 18:31 ` Alan D. Brunelle 2008-12-04 21:06 ` Alan D. Brunelle 2008-12-05 9:40 ` Jens Axboe 2008-12-05 13:32 ` Alan D. Brunelle 2008-12-05 13:35 ` Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox