* [PATCH V2] SCSI: fix queue cleanup race before queue initialization is done
@ 2018-11-14 8:25 Ming Lei
2018-11-14 15:02 ` Bart Van Assche
2018-11-14 15:20 ` Jens Axboe
0 siblings, 2 replies; 16+ messages in thread
From: Ming Lei @ 2018-11-14 8:25 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Ming Lei, Andrew Jones, Bart Van Assche, linux-scsi,
Martin K . Petersen, Christoph Hellwig, James E . J . Bottomley,
stable, jianchao . wang
c2856ae2f315d ("blk-mq: quiesce queue before freeing queue") has
already fixed this race, however the implied synchronize_rcu()
in blk_mq_quiesce_queue() can slow down LUN probe a lot, so caused
performance regression.
Then 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
tried to quiesce queue for avoiding unnecessary synchronize_rcu()
only when queue initialization is done, because it is usual to see
lots of inexistent LUNs which need to be probed.
However, turns out it isn't safe to quiesce queue only when queue
initialization is done. Because when one SCSI command is completed,
the user of sending command can be waken up immediately, then the
scsi device may be removed, meantime the run queue in scsi_end_request()
is still in-progress, so kernel panic can be caused.
In Red Hat QE lab, there are several reports about this kind of kernel
panic triggered during kernel booting.
This patch tries to address the issue by grabing one queue usage
counter during freeing one request and the following run queue.
Fixes: 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
Cc: Andrew Jones <drjones@redhat.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
Cc: stable <stable@vger.kernel.org>
Cc: jianchao.wang <jianchao.w.wang@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-core.c | 5 ++---
drivers/scsi/scsi_lib.c | 8 ++++++++
2 files changed, 10 insertions(+), 3 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index ce12515f9b9b..deb56932f8c4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -798,9 +798,8 @@ void blk_cleanup_queue(struct request_queue *q)
* dispatch may still be in-progress since we dispatch requests
* from more than one contexts.
*
- * No need to quiesce queue if it isn't initialized yet since
- * blk_freeze_queue() should be enough for cases of passthrough
- * request.
+ * We rely on driver to deal with the race in case that queue
+ * initialization isn't done.
*/
if (q->mq_ops && blk_queue_init_done(q))
blk_mq_quiesce_queue(q);
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index c7fccbb8f554..fa6e0c3b3aa6 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -697,6 +697,12 @@ static bool scsi_end_request(struct request *req, blk_status_t error,
*/
scsi_mq_uninit_cmd(cmd);
+ /*
+ * queue is still alive, so grab the ref for preventing it
+ * from being cleaned up during running queue.
+ */
+ percpu_ref_get(&q->q_usage_counter);
+
__blk_mq_end_request(req, error);
if (scsi_target(sdev)->single_lun ||
@@ -704,6 +710,8 @@ static bool scsi_end_request(struct request *req, blk_status_t error,
kblockd_schedule_work(&sdev->requeue_work);
else
blk_mq_run_hw_queues(q, true);
+
+ percpu_ref_put(&q->q_usage_counter);
} else {
unsigned long flags;
--
2.9.5
^ permalink raw reply related [flat|nested] 16+ messages in thread* Re: [PATCH V2] SCSI: fix queue cleanup race before queue initialization is done 2018-11-14 8:25 [PATCH V2] SCSI: fix queue cleanup race before queue initialization is done Ming Lei @ 2018-11-14 15:02 ` Bart Van Assche 2018-11-15 0:48 ` Ming Lei 2018-11-14 15:20 ` Jens Axboe 1 sibling, 1 reply; 16+ messages in thread From: Bart Van Assche @ 2018-11-14 15:02 UTC (permalink / raw) To: Ming Lei, Jens Axboe Cc: linux-block, Andrew Jones, Bart Van Assche, linux-scsi, Martin K . Petersen, Christoph Hellwig, James E . J . Bottomley, stable, jianchao . wang On Wed, 2018-11-14 at 16:25 +0800, Ming Lei wrote: > --- a/drivers/scsi/scsi_lib.c > +++ b/drivers/scsi/scsi_lib.c > @@ -697,6 +697,12 @@ static bool scsi_end_request(struct request *req, > blk_status_t error, > */ > scsi_mq_uninit_cmd(cmd); > > + /* > + * queue is still alive, so grab the ref for preventing it > + * from being cleaned up during running queue. > + */ > + percpu_ref_get(&q->q_usage_counter); > + I think the above comment is misleading. In the block layer a queue is called alive if the "dying" flag has not been set. When the above call to percpu_ref_get() occurs it is not guaranteed that that flag has not yet been set. But it is guaranteed that q->q_usage_counter is not zero. I would prefer if the comment would be modified. What's not clear to me is why this patch only protects the blk-mq path but not the legacy path. Does the legacy path need similar protection? It also triggers a queue run after having finished a request. Otherwise this patch looks fine to me. Bart. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH V2] SCSI: fix queue cleanup race before queue initialization is done 2018-11-14 15:02 ` Bart Van Assche @ 2018-11-15 0:48 ` Ming Lei 0 siblings, 0 replies; 16+ messages in thread From: Ming Lei @ 2018-11-15 0:48 UTC (permalink / raw) To: Bart Van Assche Cc: Jens Axboe, linux-block, Andrew Jones, Bart Van Assche, linux-scsi, Martin K . Petersen, Christoph Hellwig, James E . J . Bottomley, stable, jianchao . wang On Wed, Nov 14, 2018 at 07:02:28AM -0800, Bart Van Assche wrote: > On Wed, 2018-11-14 at 16:25 +0800, Ming Lei wrote: > > --- a/drivers/scsi/scsi_lib.c > > +++ b/drivers/scsi/scsi_lib.c > > @@ -697,6 +697,12 @@ static bool scsi_end_request(struct request *req, > > blk_status_t error, > > */ > > scsi_mq_uninit_cmd(cmd); > > > > + /* > > + * queue is still alive, so grab the ref for preventing it > > + * from being cleaned up during running queue. > > + */ > > + percpu_ref_get(&q->q_usage_counter); > > + > > I think the above comment is misleading. In the block layer a queue is called > alive if the "dying" flag has not been set. When the above call to > percpu_ref_get() occurs it is not guaranteed that that flag has not yet been > set. But it is guaranteed that q->q_usage_counter is not zero. I would prefer > if the comment would be modified. I am fine with either way given we know the context, not mention the first thing blk_cleanup_queue() does is to call blk_set_queue_dying(). > > What's not clear to me is why this patch only protects the blk-mq path but > not the legacy path. Does the legacy path need similar protection? It also > triggers a queue run after having finished a request. The queue_lock is held for protecting everything, so such issue in legacy path. Thanks, Ming ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH V2] SCSI: fix queue cleanup race before queue initialization is done 2018-11-14 8:25 [PATCH V2] SCSI: fix queue cleanup race before queue initialization is done Ming Lei 2018-11-14 15:02 ` Bart Van Assche @ 2018-11-14 15:20 ` Jens Axboe 2018-11-15 1:02 ` Ming Lei 2018-11-21 21:47 ` Jens Axboe 1 sibling, 2 replies; 16+ messages in thread From: Jens Axboe @ 2018-11-14 15:20 UTC (permalink / raw) To: Ming Lei Cc: linux-block, Andrew Jones, Bart Van Assche, linux-scsi, Martin K . Petersen, Christoph Hellwig, James E . J . Bottomley, stable, jianchao . wang On 11/14/18 1:25 AM, Ming Lei wrote: > c2856ae2f315d ("blk-mq: quiesce queue before freeing queue") has > already fixed this race, however the implied synchronize_rcu() > in blk_mq_quiesce_queue() can slow down LUN probe a lot, so caused > performance regression. > > Then 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()") > tried to quiesce queue for avoiding unnecessary synchronize_rcu() > only when queue initialization is done, because it is usual to see > lots of inexistent LUNs which need to be probed. > > However, turns out it isn't safe to quiesce queue only when queue > initialization is done. Because when one SCSI command is completed, > the user of sending command can be waken up immediately, then the > scsi device may be removed, meantime the run queue in scsi_end_request() > is still in-progress, so kernel panic can be caused. > > In Red Hat QE lab, there are several reports about this kind of kernel > panic triggered during kernel booting. > > This patch tries to address the issue by grabing one queue usage > counter during freeing one request and the following run queue. Thanks applied, this bug was elusive but ever present in recent testing that we did internally, it's been a huge pain in the butt. The symptoms were usually a crash in blk_mq_get_driver_tag() with hctx->tags == NULL, or a crash inside deadline request insert off requeue. -- Jens Axboe ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH V2] SCSI: fix queue cleanup race before queue initialization is done 2018-11-14 15:20 ` Jens Axboe @ 2018-11-15 1:02 ` Ming Lei 2018-11-21 21:47 ` Jens Axboe 1 sibling, 0 replies; 16+ messages in thread From: Ming Lei @ 2018-11-15 1:02 UTC (permalink / raw) To: Jens Axboe Cc: linux-block, Andrew Jones, Bart Van Assche, linux-scsi, Martin K . Petersen, Christoph Hellwig, James E . J . Bottomley, stable, jianchao . wang On Wed, Nov 14, 2018 at 08:20:09AM -0700, Jens Axboe wrote: > On 11/14/18 1:25 AM, Ming Lei wrote: > > c2856ae2f315d ("blk-mq: quiesce queue before freeing queue") has > > already fixed this race, however the implied synchronize_rcu() > > in blk_mq_quiesce_queue() can slow down LUN probe a lot, so caused > > performance regression. > > > > Then 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()") > > tried to quiesce queue for avoiding unnecessary synchronize_rcu() > > only when queue initialization is done, because it is usual to see > > lots of inexistent LUNs which need to be probed. > > > > However, turns out it isn't safe to quiesce queue only when queue > > initialization is done. Because when one SCSI command is completed, > > the user of sending command can be waken up immediately, then the > > scsi device may be removed, meantime the run queue in scsi_end_request() > > is still in-progress, so kernel panic can be caused. > > > > In Red Hat QE lab, there are several reports about this kind of kernel > > panic triggered during kernel booting. > > > > This patch tries to address the issue by grabing one queue usage > > counter during freeing one request and the following run queue. > > Thanks applied, this bug was elusive but ever present in recent > testing that we did internally, it's been a huge pain in the butt. > The symptoms were usually a crash in blk_mq_get_driver_tag() with > hctx->tags == NULL, or a crash inside deadline request insert off > requeue. Thanks for applying it. In Red Hat internal test, kernel panic is triggered in blk_mq_hctx_has_pending(), either sbitmap_any_bit_set() or elevator's .has_work. I think this patch can fix most of SCSI's corner case, but may not cover all, that is why I marked it as RFC in 1st post. The root cause is in blk_mq_run_hw_queue(), which calls blk_mq_hctx_has_pending() with RCU read lock held, but we can't afford the synchronize_rcu() when blk_queue_init_done() is false. For SCSI, blk_mq_run_hw_queue() can be run from other 3 code paths: 1) scsi_ioctl_reset() - this one should be fine, given ioctl should be run after disk is added 2) scsi_error_handler() - this one is fine too, since EH implies that there is failed request not completed yet 3) scsi_unblock_requests() - there might be risk in this code, I guess. Also not sure if there is such case for other devices. Thanks, Ming ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH V2] SCSI: fix queue cleanup race before queue initialization is done 2018-11-14 15:20 ` Jens Axboe 2018-11-15 1:02 ` Ming Lei @ 2018-11-21 21:47 ` Jens Axboe 2018-11-21 22:02 ` Theodore Y. Ts'o 2018-11-22 1:00 ` Ming Lei 1 sibling, 2 replies; 16+ messages in thread From: Jens Axboe @ 2018-11-21 21:47 UTC (permalink / raw) To: Ming Lei Cc: linux-block, Andrew Jones, Bart Van Assche, linux-scsi, Martin K . Petersen, Christoph Hellwig, James E . J . Bottomley, stable, jianchao . wang On 11/14/18 8:20 AM, Jens Axboe wrote: > On 11/14/18 1:25 AM, Ming Lei wrote: >> c2856ae2f315d ("blk-mq: quiesce queue before freeing queue") has >> already fixed this race, however the implied synchronize_rcu() >> in blk_mq_quiesce_queue() can slow down LUN probe a lot, so caused >> performance regression. >> >> Then 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()") >> tried to quiesce queue for avoiding unnecessary synchronize_rcu() >> only when queue initialization is done, because it is usual to see >> lots of inexistent LUNs which need to be probed. >> >> However, turns out it isn't safe to quiesce queue only when queue >> initialization is done. Because when one SCSI command is completed, >> the user of sending command can be waken up immediately, then the >> scsi device may be removed, meantime the run queue in scsi_end_request() >> is still in-progress, so kernel panic can be caused. >> >> In Red Hat QE lab, there are several reports about this kind of kernel >> panic triggered during kernel booting. >> >> This patch tries to address the issue by grabing one queue usage >> counter during freeing one request and the following run queue. > > Thanks applied, this bug was elusive but ever present in recent > testing that we did internally, it's been a huge pain in the butt. > The symptoms were usually a crash in blk_mq_get_driver_tag() with > hctx->tags == NULL, or a crash inside deadline request insert off > requeue. I'm still hitting some weird crashes even with this applied, like this one: BUG: unable to handle kernel NULL pointer dereference at 0000000000000148 PGD 0 P4D 0. Oops: 0000 [#1] SMP PTI CPU: 37 PID: 763 Comm: kworker/37:1H Not tainted 4.20.0-rc3-00649-ge64d9a554a91-dirty #14 Hardware name: Wiwynn Leopard-Orv2/Leopard-DDR BW, BIOS LBM08 03/03/2017 Workqueue: kblockd blk_mq_run_work_fn RIP: 0010:blk_mq_get_driver_tag+0x81/0x120 Code: 24 10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 00 00 00 0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 <48> 8b 87 48 01 00 00 8b 40 04 39 43 20 72 37 f6 87 b0 00 00 00 02 RSP: 0018:ffffc90004aabd30 EFLAGS: 00010246 RAX: 0000000000000003 RBX: ffff888465ea1300 RCX: ffffc90004aabde8 RDX: 00000000ffffffff RSI: ffffc90004aabde8 RDI: 0000000000000000 RBP: 0000000000000000 R08: ffff888465ea1348 R09: 0000000000000000 R10: 0000000000001000 R11: 00000000ffffffff R12: ffff888465ea1300 R13: 0000000000000000 R14: ffff888465ea1348 R15: ffff888465d10000 FS: 0000000000000000(0000) GS:ffff88846f9c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000148 CR3: 000000000220a003 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: blk_mq_dispatch_rq_list+0xec/0x480 ? elv_rb_del+0x11/0x30 blk_mq_do_dispatch_sched+0x6e/0xf0 blk_mq_sched_dispatch_requests+0xfa/0x170 __blk_mq_run_hw_queue+0x5f/0xe0 process_one_work+0x154/0x350 worker_thread+0x46/0x3c0 kthread+0xf5/0x130 ? process_one_work+0x350/0x350 ? kthread_destroy_worker+0x50/0x50 ret_from_fork+0x1f/0x30 Modules linked in: sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm switchtec irqbypass iTCO_wdt iTCO_vendor_support efivars cdc_ether usbnet mii cdc_acm i2c_i801 lpc_ich mfd_core ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq button sch_fq_codel nfsd nfs_acl lockd grace auth_rpcgss oid_registry sunrpc nvme nvme_core fuse sg loop efivarfs autofs4 CR2: 0000000000000148 ---[ end trace 340a1fb996df1b9b ]--- RIP: 0010:blk_mq_get_driver_tag+0x81/0x120 Code: 24 10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 00 00 00 0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 <48> 8b 87 48 01 00 00 8b 40 04 39 43 20 72 37 f6 87 b0 00 00 00 02 which doesn't look that great... Are we sure this patch closed the window completely? One thing I'm pondering is we're running the queue async, so the ref get will protect whatever blk_mq_run_hw_queues() does, but what is preventing the queue from going away as soon as we've returned from that call? Meanwhile we still have the work item queued up, and it'll run, and go boom like above. -- Jens Axboe ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH V2] SCSI: fix queue cleanup race before queue initialization is done 2018-11-21 21:47 ` Jens Axboe @ 2018-11-21 22:02 ` Theodore Y. Ts'o 2018-11-22 3:43 ` Theodore Y. Ts'o 2018-11-22 1:00 ` Ming Lei 1 sibling, 1 reply; 16+ messages in thread From: Theodore Y. Ts'o @ 2018-11-21 22:02 UTC (permalink / raw) To: Jens Axboe Cc: Ming Lei, linux-block, Andrew Jones, Bart Van Assche, linux-scsi, Martin K . Petersen, Christoph Hellwig, James E . J . Bottomley, stable, jianchao . wang On Wed, Nov 21, 2018 at 02:47:35PM -0700, Jens Axboe wrote: > > Thanks applied, this bug was elusive but ever present in recent > > testing that we did internally, it's been a huge pain in the butt. > > The symptoms were usually a crash in blk_mq_get_driver_tag() with > > hctx->tags == NULL, or a crash inside deadline request insert off > > requeue. > > I'm still hitting some weird crashes even with this applied, like > this one: FYI, there are a number of Ubuntu users running 4.19, 4.19.1, and 4.19.2 which have been reporting file system corruption problems. They have a fix of configurations, but one of the things which is seem to be a common factor is they all have CONFIG_SCSI_MQ_DEFAULT disabled. (Which also happens to be how I happen to be running my laptop, and I've noticed no problems.) https://bugzilla.kernel.org/show_bug.cgi?id=201685 One user in particular reported that 4.19 worked fine, and 4.19.1 had fs corruptions (and there are no storage-related changes between 4.19 and 4.19.1) --- but the one thing those two kernels had in common was his 4.19 build had SCSI_MQ_DEFAULT disabled, and his 4.19.1 build did *not* have SCSI_MQ_DEFAULT enabled. This same user tried 4.19.3, and after two hours of heavy I/O, he's not seen a repeat, and interestingly, 4.19.3 has the backport mentioned on this thread. The weird thing is that it looked like the problem that was fixed by this commit would only show up at queue setup and teardown time. Is that correct? Is it possible that the bug fixed here would manifest as data corruptions on disk? Or would only manifest as kernel BUG_ON's and/or crashes? One more thing. I tried building a 4.20-rc2 based kernel with CONFIG_SCSI_MQ_DEFAULT=y, and I tried running gce-xfstests (which uses virtio-scsi) and I saw no failures. So I don't have a clean repro of Kernel Bugzilla #201685, and at the moment I'm too chicken to enable CONFIG_SCSI_MQ_DEFAULT on my primary development laptop... Any thoughts/suggestions appreciated. - Ted ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH V2] SCSI: fix queue cleanup race before queue initialization is done 2018-11-21 22:02 ` Theodore Y. Ts'o @ 2018-11-22 3:43 ` Theodore Y. Ts'o 0 siblings, 0 replies; 16+ messages in thread From: Theodore Y. Ts'o @ 2018-11-22 3:43 UTC (permalink / raw) To: Jens Axboe, Ming Lei, linux-block, Andrew Jones, Bart Van Assche, linux-scsi, Martin K . Petersen, Christoph Hellwig, James E . J . Bottomley, stable, jianchao . wang On Wed, Nov 21, 2018 at 05:02:13PM -0500, Theodore Y. Ts'o wrote: > On Wed, Nov 21, 2018 at 02:47:35PM -0700, Jens Axboe wrote: > > > Thanks applied, this bug was elusive but ever present in recent > > > testing that we did internally, it's been a huge pain in the butt. > > > The symptoms were usually a crash in blk_mq_get_driver_tag() with > > > hctx->tags == NULL, or a crash inside deadline request insert off > > > requeue. > > > > I'm still hitting some weird crashes even with this applied, like > > this one: > > FYI, there are a number of Ubuntu users running 4.19, 4.19.1, and > 4.19.2 which have been reporting file system corruption problems. > They have a fix of configurations, but one of the things which is seem > to be a common factor is they all have CONFIG_SCSI_MQ_DEFAULT > disabled. (Which also happens to be how I happen to be running my > laptop, and I've noticed no problems.) One correction to the above --- the people who are having the problem have CONFIG_SCSI_MQ_DEFAULT *enabled* (at least for those who reported the kernel configs --- not all of them did). I have CONFIG_SCSI_MQ_DEFAULT *disabled*, and things are running just fine on my laptop. Although that may be a red herring, since as you pointed out on the bug NVMe isn't affected by the SCSI_MQ_DEFAULT setting (sorry, I'm still used to a world where SCSI controls the whole world :-). And my laptop is an XPS 13 with an NVMe-attached 1T SSD. Fortunately I've not seen any corruption (or at least nothing visible yet). Anyway, all of this is in the bug, and I'll see if I can find a way of repro'ing corruption in a KVM or GCE crash-and-burn environment... - Ted ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH V2] SCSI: fix queue cleanup race before queue initialization is done 2018-11-21 21:47 ` Jens Axboe 2018-11-21 22:02 ` Theodore Y. Ts'o @ 2018-11-22 1:00 ` Ming Lei 2018-11-22 1:42 ` Jens Axboe 1 sibling, 1 reply; 16+ messages in thread From: Ming Lei @ 2018-11-22 1:00 UTC (permalink / raw) To: Jens Axboe Cc: linux-block, Andrew Jones, Bart Van Assche, linux-scsi, Martin K . Petersen, Christoph Hellwig, James E . J . Bottomley, stable, jianchao . wang On Wed, Nov 21, 2018 at 02:47:35PM -0700, Jens Axboe wrote: > On 11/14/18 8:20 AM, Jens Axboe wrote: > > On 11/14/18 1:25 AM, Ming Lei wrote: > >> c2856ae2f315d ("blk-mq: quiesce queue before freeing queue") has > >> already fixed this race, however the implied synchronize_rcu() > >> in blk_mq_quiesce_queue() can slow down LUN probe a lot, so caused > >> performance regression. > >> > >> Then 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()") > >> tried to quiesce queue for avoiding unnecessary synchronize_rcu() > >> only when queue initialization is done, because it is usual to see > >> lots of inexistent LUNs which need to be probed. > >> > >> However, turns out it isn't safe to quiesce queue only when queue > >> initialization is done. Because when one SCSI command is completed, > >> the user of sending command can be waken up immediately, then the > >> scsi device may be removed, meantime the run queue in scsi_end_request() > >> is still in-progress, so kernel panic can be caused. > >> > >> In Red Hat QE lab, there are several reports about this kind of kernel > >> panic triggered during kernel booting. > >> > >> This patch tries to address the issue by grabing one queue usage > >> counter during freeing one request and the following run queue. > > > > Thanks applied, this bug was elusive but ever present in recent > > testing that we did internally, it's been a huge pain in the butt. > > The symptoms were usually a crash in blk_mq_get_driver_tag() with > > hctx->tags == NULL, or a crash inside deadline request insert off > > requeue. > > I'm still hitting some weird crashes even with this applied, like > this one: > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000148 > PGD 0 P4D 0. > Oops: 0000 [#1] SMP PTI > CPU: 37 PID: 763 Comm: kworker/37:1H Not tainted 4.20.0-rc3-00649-ge64d9a554a91-dirty #14 > Hardware name: Wiwynn Leopard-Orv2/Leopard-DDR BW, BIOS LBM08 03/03/2017 > Workqueue: kblockd blk_mq_run_work_fn > RIP: 0010:blk_mq_get_driver_tag+0x81/0x120 > Code: 24 10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 00 00 00 0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 <48> 8b 87 48 01 00 00 8b 40 04 39 43 20 72 37 f6 87 b0 00 00 00 02 > RSP: 0018:ffffc90004aabd30 EFLAGS: 00010246 > RAX: 0000000000000003 RBX: ffff888465ea1300 RCX: ffffc90004aabde8 > RDX: 00000000ffffffff RSI: ffffc90004aabde8 RDI: 0000000000000000 > RBP: 0000000000000000 R08: ffff888465ea1348 R09: 0000000000000000 > R10: 0000000000001000 R11: 00000000ffffffff R12: ffff888465ea1300 > R13: 0000000000000000 R14: ffff888465ea1348 R15: ffff888465d10000 > FS: 0000000000000000(0000) GS:ffff88846f9c0000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000148 CR3: 000000000220a003 CR4: 00000000003606e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > blk_mq_dispatch_rq_list+0xec/0x480 > ? elv_rb_del+0x11/0x30 > blk_mq_do_dispatch_sched+0x6e/0xf0 > blk_mq_sched_dispatch_requests+0xfa/0x170 > __blk_mq_run_hw_queue+0x5f/0xe0 > process_one_work+0x154/0x350 > worker_thread+0x46/0x3c0 > kthread+0xf5/0x130 > ? process_one_work+0x350/0x350 > ? kthread_destroy_worker+0x50/0x50 > ret_from_fork+0x1f/0x30 > Modules linked in: sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm switchtec irqbypass iTCO_wdt iTCO_vendor_support efivars cdc_ether usbnet mii cdc_acm i2c_i801 lpc_ich mfd_core ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq button sch_fq_codel nfsd nfs_acl lockd grace auth_rpcgss oid_registry sunrpc nvme nvme_core fuse sg loop efivarfs autofs4 > CR2: 0000000000000148 > ---[ end trace 340a1fb996df1b9b ]--- > RIP: 0010:blk_mq_get_driver_tag+0x81/0x120 > Code: 24 10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 00 00 00 0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 <48> 8b 87 48 01 00 00 8b 40 04 39 43 20 72 37 f6 87 b0 00 00 00 02 > > which doesn't look that great... Are we sure this patch closed the window > completely? I mentioned this patch is just one workaround, see my comment before: https://marc.info/?l=linux-scsi&m=154224379320094&w=2 > > One thing I'm pondering is we're running the queue async, so the > ref get will protect whatever blk_mq_run_hw_queues() does, but > what is preventing the queue from going away as soon as we've > returned from that call? Meanwhile we still have the work item > queued up, and it'll run, and go boom like above. blk_sync_queue() supposes to drain the queued work, but it can be queued after blk_sync_queue() returns. Or maybe we can try the following patch? -- diff --git a/block/blk-mq.c b/block/blk-mq.c index 0375c3bd410e..986c211ca74c 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1570,6 +1570,9 @@ static void blk_mq_run_work_fn(struct work_struct *work) hctx = container_of(work, struct blk_mq_hw_ctx, run_work.work); + if (unlikely(blk_queue_dead(hctx->queue))) + return; + /* * If we are stopped, don't run the queue. */ Thanks, Ming ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH V2] SCSI: fix queue cleanup race before queue initialization is done 2018-11-22 1:00 ` Ming Lei @ 2018-11-22 1:42 ` Jens Axboe 2018-11-22 2:00 ` Ming Lei 2019-03-29 20:21 ` James Smart 0 siblings, 2 replies; 16+ messages in thread From: Jens Axboe @ 2018-11-22 1:42 UTC (permalink / raw) To: Ming Lei Cc: linux-block, Andrew Jones, Bart Van Assche, linux-scsi, Martin K . Petersen, Christoph Hellwig, James E . J . Bottomley, stable, jianchao . wang On 11/21/18 6:00 PM, Ming Lei wrote: > On Wed, Nov 21, 2018 at 02:47:35PM -0700, Jens Axboe wrote: >> On 11/14/18 8:20 AM, Jens Axboe wrote: >>> On 11/14/18 1:25 AM, Ming Lei wrote: >>>> c2856ae2f315d ("blk-mq: quiesce queue before freeing queue") has >>>> already fixed this race, however the implied synchronize_rcu() >>>> in blk_mq_quiesce_queue() can slow down LUN probe a lot, so caused >>>> performance regression. >>>> >>>> Then 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()") >>>> tried to quiesce queue for avoiding unnecessary synchronize_rcu() >>>> only when queue initialization is done, because it is usual to see >>>> lots of inexistent LUNs which need to be probed. >>>> >>>> However, turns out it isn't safe to quiesce queue only when queue >>>> initialization is done. Because when one SCSI command is completed, >>>> the user of sending command can be waken up immediately, then the >>>> scsi device may be removed, meantime the run queue in scsi_end_request() >>>> is still in-progress, so kernel panic can be caused. >>>> >>>> In Red Hat QE lab, there are several reports about this kind of kernel >>>> panic triggered during kernel booting. >>>> >>>> This patch tries to address the issue by grabing one queue usage >>>> counter during freeing one request and the following run queue. >>> >>> Thanks applied, this bug was elusive but ever present in recent >>> testing that we did internally, it's been a huge pain in the butt. >>> The symptoms were usually a crash in blk_mq_get_driver_tag() with >>> hctx->tags == NULL, or a crash inside deadline request insert off >>> requeue. >> >> I'm still hitting some weird crashes even with this applied, like >> this one: >> >> BUG: unable to handle kernel NULL pointer dereference at 0000000000000148 >> PGD 0 P4D 0. >> Oops: 0000 [#1] SMP PTI >> CPU: 37 PID: 763 Comm: kworker/37:1H Not tainted 4.20.0-rc3-00649-ge64d9a554a91-dirty #14 >> Hardware name: Wiwynn Leopard-Orv2/Leopard-DDR BW, BIOS LBM08 03/03/2017 >> Workqueue: kblockd blk_mq_run_work_fn >> RIP: 0010:blk_mq_get_driver_tag+0x81/0x120 >> Code: 24 10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 00 00 00 0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 <48> 8b 87 48 01 00 00 8b 40 04 39 43 20 72 37 f6 87 b0 00 00 00 02 >> RSP: 0018:ffffc90004aabd30 EFLAGS: 00010246 >> RAX: 0000000000000003 RBX: ffff888465ea1300 RCX: ffffc90004aabde8 >> RDX: 00000000ffffffff RSI: ffffc90004aabde8 RDI: 0000000000000000 >> RBP: 0000000000000000 R08: ffff888465ea1348 R09: 0000000000000000 >> R10: 0000000000001000 R11: 00000000ffffffff R12: ffff888465ea1300 >> R13: 0000000000000000 R14: ffff888465ea1348 R15: ffff888465d10000 >> FS: 0000000000000000(0000) GS:ffff88846f9c0000(0000) knlGS:0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: 0000000000000148 CR3: 000000000220a003 CR4: 00000000003606e0 >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >> Call Trace: >> blk_mq_dispatch_rq_list+0xec/0x480 >> ? elv_rb_del+0x11/0x30 >> blk_mq_do_dispatch_sched+0x6e/0xf0 >> blk_mq_sched_dispatch_requests+0xfa/0x170 >> __blk_mq_run_hw_queue+0x5f/0xe0 >> process_one_work+0x154/0x350 >> worker_thread+0x46/0x3c0 >> kthread+0xf5/0x130 >> ? process_one_work+0x350/0x350 >> ? kthread_destroy_worker+0x50/0x50 >> ret_from_fork+0x1f/0x30 >> Modules linked in: sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm switchtec irqbypass iTCO_wdt iTCO_vendor_support efivars cdc_ether usbnet mii cdc_acm i2c_i801 lpc_ich mfd_core ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq button sch_fq_codel nfsd nfs_acl lockd grace auth_rpcgss oid_registry sunrpc nvme nvme_core fuse sg loop efivarfs autofs4 >> CR2: 0000000000000148 >> ---[ end trace 340a1fb996df1b9b ]--- >> RIP: 0010:blk_mq_get_driver_tag+0x81/0x120 >> Code: 24 10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 00 00 00 0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 <48> 8b 87 48 01 00 00 8b 40 04 39 43 20 72 37 f6 87 b0 00 00 00 02 >> >> which doesn't look that great... Are we sure this patch closed the window >> completely? > > I mentioned this patch is just one workaround, see my comment before: > > https://marc.info/?l=linux-scsi&m=154224379320094&w=2 > >> >> One thing I'm pondering is we're running the queue async, so the >> ref get will protect whatever blk_mq_run_hw_queues() does, but >> what is preventing the queue from going away as soon as we've >> returned from that call? Meanwhile we still have the work item >> queued up, and it'll run, and go boom like above. > > blk_sync_queue() supposes to drain the queued work, but it can be > queued after blk_sync_queue() returns. It's definitely broken. Big time. And we need to do something about it NOW. > Or maybe we can try the following patch? I'm going to start backing out the sync removal patches instead of adding items to the hot path... Ted, I saw your email, I'm looking into it. Sounds like a regression between 4.18 and 4.19. The sync issue could still be it, as it can cause memory corruption, and that could lead to other corruption issues. -- Jens Axboe ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH V2] SCSI: fix queue cleanup race before queue initialization is done 2018-11-22 1:42 ` Jens Axboe @ 2018-11-22 2:00 ` Ming Lei 2018-11-22 2:14 ` Jens Axboe 2019-03-29 20:21 ` James Smart 1 sibling, 1 reply; 16+ messages in thread From: Ming Lei @ 2018-11-22 2:00 UTC (permalink / raw) To: Jens Axboe Cc: linux-block, Andrew Jones, Bart Van Assche, linux-scsi, Martin K . Petersen, Christoph Hellwig, James E . J . Bottomley, stable, jianchao . wang On Wed, Nov 21, 2018 at 06:42:51PM -0700, Jens Axboe wrote: > On 11/21/18 6:00 PM, Ming Lei wrote: > > On Wed, Nov 21, 2018 at 02:47:35PM -0700, Jens Axboe wrote: > >> On 11/14/18 8:20 AM, Jens Axboe wrote: > >>> On 11/14/18 1:25 AM, Ming Lei wrote: > >>>> c2856ae2f315d ("blk-mq: quiesce queue before freeing queue") has > >>>> already fixed this race, however the implied synchronize_rcu() > >>>> in blk_mq_quiesce_queue() can slow down LUN probe a lot, so caused > >>>> performance regression. > >>>> > >>>> Then 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()") > >>>> tried to quiesce queue for avoiding unnecessary synchronize_rcu() > >>>> only when queue initialization is done, because it is usual to see > >>>> lots of inexistent LUNs which need to be probed. > >>>> > >>>> However, turns out it isn't safe to quiesce queue only when queue > >>>> initialization is done. Because when one SCSI command is completed, > >>>> the user of sending command can be waken up immediately, then the > >>>> scsi device may be removed, meantime the run queue in scsi_end_request() > >>>> is still in-progress, so kernel panic can be caused. > >>>> > >>>> In Red Hat QE lab, there are several reports about this kind of kernel > >>>> panic triggered during kernel booting. > >>>> > >>>> This patch tries to address the issue by grabing one queue usage > >>>> counter during freeing one request and the following run queue. > >>> > >>> Thanks applied, this bug was elusive but ever present in recent > >>> testing that we did internally, it's been a huge pain in the butt. > >>> The symptoms were usually a crash in blk_mq_get_driver_tag() with > >>> hctx->tags == NULL, or a crash inside deadline request insert off > >>> requeue. > >> > >> I'm still hitting some weird crashes even with this applied, like > >> this one: > >> > >> BUG: unable to handle kernel NULL pointer dereference at 0000000000000148 > >> PGD 0 P4D 0. > >> Oops: 0000 [#1] SMP PTI > >> CPU: 37 PID: 763 Comm: kworker/37:1H Not tainted 4.20.0-rc3-00649-ge64d9a554a91-dirty #14 > >> Hardware name: Wiwynn Leopard-Orv2/Leopard-DDR BW, BIOS LBM08 03/03/2017 > >> Workqueue: kblockd blk_mq_run_work_fn > >> RIP: 0010:blk_mq_get_driver_tag+0x81/0x120 > >> Code: 24 10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 00 00 00 0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 <48> 8b 87 48 01 00 00 8b 40 04 39 43 20 72 37 f6 87 b0 00 00 00 02 > >> RSP: 0018:ffffc90004aabd30 EFLAGS: 00010246 > >> RAX: 0000000000000003 RBX: ffff888465ea1300 RCX: ffffc90004aabde8 > >> RDX: 00000000ffffffff RSI: ffffc90004aabde8 RDI: 0000000000000000 > >> RBP: 0000000000000000 R08: ffff888465ea1348 R09: 0000000000000000 > >> R10: 0000000000001000 R11: 00000000ffffffff R12: ffff888465ea1300 > >> R13: 0000000000000000 R14: ffff888465ea1348 R15: ffff888465d10000 > >> FS: 0000000000000000(0000) GS:ffff88846f9c0000(0000) knlGS:0000000000000000 > >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >> CR2: 0000000000000148 CR3: 000000000220a003 CR4: 00000000003606e0 > >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > >> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > >> Call Trace: > >> blk_mq_dispatch_rq_list+0xec/0x480 > >> ? elv_rb_del+0x11/0x30 > >> blk_mq_do_dispatch_sched+0x6e/0xf0 > >> blk_mq_sched_dispatch_requests+0xfa/0x170 > >> __blk_mq_run_hw_queue+0x5f/0xe0 > >> process_one_work+0x154/0x350 > >> worker_thread+0x46/0x3c0 > >> kthread+0xf5/0x130 > >> ? process_one_work+0x350/0x350 > >> ? kthread_destroy_worker+0x50/0x50 > >> ret_from_fork+0x1f/0x30 > >> Modules linked in: sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm switchtec irqbypass iTCO_wdt iTCO_vendor_support efivars cdc_ether usbnet mii cdc_acm i2c_i801 lpc_ich mfd_core ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq button sch_fq_codel nfsd nfs_acl lockd grace auth_rpcgss oid_registry sunrpc nvme nvme_core fuse sg loop efivarfs autofs4 > >> CR2: 0000000000000148 > >> ---[ end trace 340a1fb996df1b9b ]--- > >> RIP: 0010:blk_mq_get_driver_tag+0x81/0x120 > >> Code: 24 10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 00 00 00 0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 <48> 8b 87 48 01 00 00 8b 40 04 39 43 20 72 37 f6 87 b0 00 00 00 02 > >> > >> which doesn't look that great... Are we sure this patch closed the window > >> completely? > > > > I mentioned this patch is just one workaround, see my comment before: > > > > https://marc.info/?l=linux-scsi&m=154224379320094&w=2 > > > >> > >> One thing I'm pondering is we're running the queue async, so the > >> ref get will protect whatever blk_mq_run_hw_queues() does, but > >> what is preventing the queue from going away as soon as we've > >> returned from that call? Meanwhile we still have the work item > >> queued up, and it'll run, and go boom like above. > > > > blk_sync_queue() supposes to drain the queued work, but it can be > > queued after blk_sync_queue() returns. > > It's definitely broken. Big time. And we need to do something about > it NOW. > > > Or maybe we can try the following patch? > > I'm going to start backing out the sync removal patches instead of > adding items to the hot path... Yeah, I agree to do the following to bring back the sync, even though it slows down boot. diff --git a/block/blk-core.c b/block/blk-core.c index 04f5be473638..f6943f4a4d16 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -355,7 +355,7 @@ void blk_cleanup_queue(struct request_queue *q) * We rely on driver to deal with the race in case that queue * initialization isn't done. */ - if (queue_is_mq(q) && blk_queue_init_done(q)) + if (queue_is_mq(q)) blk_mq_quiesce_queue(q); > > Ted, I saw your email, I'm looking into it. Sounds like a regression > between 4.18 and 4.19. The sync issue could still be it, as it can > cause memory corruption, and that could lead to other corruption > issues. But I don't understand why this sync issue is related data corruption because FS IO only starts after blk_queue_init_done() becomes true. And blk_mq_quiesce_queue() and the following blk_sync_queue() should have drained everything queued. Thanks, Ming ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH V2] SCSI: fix queue cleanup race before queue initialization is done 2018-11-22 2:00 ` Ming Lei @ 2018-11-22 2:14 ` Jens Axboe 2018-11-22 2:47 ` Ming Lei 0 siblings, 1 reply; 16+ messages in thread From: Jens Axboe @ 2018-11-22 2:14 UTC (permalink / raw) To: Ming Lei Cc: linux-block, Andrew Jones, Bart Van Assche, linux-scsi, Martin K . Petersen, Christoph Hellwig, James E . J . Bottomley, stable, jianchao . wang On 11/21/18 7:00 PM, Ming Lei wrote: > On Wed, Nov 21, 2018 at 06:42:51PM -0700, Jens Axboe wrote: >> On 11/21/18 6:00 PM, Ming Lei wrote: >>> On Wed, Nov 21, 2018 at 02:47:35PM -0700, Jens Axboe wrote: >>>> On 11/14/18 8:20 AM, Jens Axboe wrote: >>>>> On 11/14/18 1:25 AM, Ming Lei wrote: >>>>>> c2856ae2f315d ("blk-mq: quiesce queue before freeing queue") has >>>>>> already fixed this race, however the implied synchronize_rcu() >>>>>> in blk_mq_quiesce_queue() can slow down LUN probe a lot, so caused >>>>>> performance regression. >>>>>> >>>>>> Then 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()") >>>>>> tried to quiesce queue for avoiding unnecessary synchronize_rcu() >>>>>> only when queue initialization is done, because it is usual to see >>>>>> lots of inexistent LUNs which need to be probed. >>>>>> >>>>>> However, turns out it isn't safe to quiesce queue only when queue >>>>>> initialization is done. Because when one SCSI command is completed, >>>>>> the user of sending command can be waken up immediately, then the >>>>>> scsi device may be removed, meantime the run queue in scsi_end_request() >>>>>> is still in-progress, so kernel panic can be caused. >>>>>> >>>>>> In Red Hat QE lab, there are several reports about this kind of kernel >>>>>> panic triggered during kernel booting. >>>>>> >>>>>> This patch tries to address the issue by grabing one queue usage >>>>>> counter during freeing one request and the following run queue. >>>>> >>>>> Thanks applied, this bug was elusive but ever present in recent >>>>> testing that we did internally, it's been a huge pain in the butt. >>>>> The symptoms were usually a crash in blk_mq_get_driver_tag() with >>>>> hctx->tags == NULL, or a crash inside deadline request insert off >>>>> requeue. >>>> >>>> I'm still hitting some weird crashes even with this applied, like >>>> this one: >>>> >>>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000148 >>>> PGD 0 P4D 0. >>>> Oops: 0000 [#1] SMP PTI >>>> CPU: 37 PID: 763 Comm: kworker/37:1H Not tainted 4.20.0-rc3-00649-ge64d9a554a91-dirty #14 >>>> Hardware name: Wiwynn Leopard-Orv2/Leopard-DDR BW, BIOS LBM08 03/03/2017 >>>> Workqueue: kblockd blk_mq_run_work_fn >>>> RIP: 0010:blk_mq_get_driver_tag+0x81/0x120 >>>> Code: 24 10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 00 00 00 0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 <48> 8b 87 48 01 00 00 8b 40 04 39 43 20 72 37 f6 87 b0 00 00 00 02 >>>> RSP: 0018:ffffc90004aabd30 EFLAGS: 00010246 >>>> RAX: 0000000000000003 RBX: ffff888465ea1300 RCX: ffffc90004aabde8 >>>> RDX: 00000000ffffffff RSI: ffffc90004aabde8 RDI: 0000000000000000 >>>> RBP: 0000000000000000 R08: ffff888465ea1348 R09: 0000000000000000 >>>> R10: 0000000000001000 R11: 00000000ffffffff R12: ffff888465ea1300 >>>> R13: 0000000000000000 R14: ffff888465ea1348 R15: ffff888465d10000 >>>> FS: 0000000000000000(0000) GS:ffff88846f9c0000(0000) knlGS:0000000000000000 >>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>> CR2: 0000000000000148 CR3: 000000000220a003 CR4: 00000000003606e0 >>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >>>> Call Trace: >>>> blk_mq_dispatch_rq_list+0xec/0x480 >>>> ? elv_rb_del+0x11/0x30 >>>> blk_mq_do_dispatch_sched+0x6e/0xf0 >>>> blk_mq_sched_dispatch_requests+0xfa/0x170 >>>> __blk_mq_run_hw_queue+0x5f/0xe0 >>>> process_one_work+0x154/0x350 >>>> worker_thread+0x46/0x3c0 >>>> kthread+0xf5/0x130 >>>> ? process_one_work+0x350/0x350 >>>> ? kthread_destroy_worker+0x50/0x50 >>>> ret_from_fork+0x1f/0x30 >>>> Modules linked in: sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm switchtec irqbypass iTCO_wdt iTCO_vendor_support efivars cdc_ether usbnet mii cdc_acm i2c_i801 lpc_ich mfd_core ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq button sch_fq_codel nfsd nfs_acl lockd grace auth_rpcgss oid_registry sunrpc nvme nvme_core fuse sg loop efivarfs autofs4 >>>> CR2: 0000000000000148 >>>> ---[ end trace 340a1fb996df1b9b ]--- >>>> RIP: 0010:blk_mq_get_driver_tag+0x81/0x120 >>>> Code: 24 10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 00 00 00 0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 <48> 8b 87 48 01 00 00 8b 40 04 39 43 20 72 37 f6 87 b0 00 00 00 02 >>>> >>>> which doesn't look that great... Are we sure this patch closed the window >>>> completely? >>> >>> I mentioned this patch is just one workaround, see my comment before: >>> >>> https://marc.info/?l=linux-scsi&m=154224379320094&w=2 >>> >>>> >>>> One thing I'm pondering is we're running the queue async, so the >>>> ref get will protect whatever blk_mq_run_hw_queues() does, but >>>> what is preventing the queue from going away as soon as we've >>>> returned from that call? Meanwhile we still have the work item >>>> queued up, and it'll run, and go boom like above. >>> >>> blk_sync_queue() supposes to drain the queued work, but it can be >>> queued after blk_sync_queue() returns. >> >> It's definitely broken. Big time. And we need to do something about >> it NOW. >> >>> Or maybe we can try the following patch? >> >> I'm going to start backing out the sync removal patches instead of >> adding items to the hot path... > > Yeah, I agree to do the following to bring back the sync, even though it slows > down boot. It slows it down for some cases, not in general. Do we need the synchronize_rcu() in the tagset deletion as well? > diff --git a/block/blk-core.c b/block/blk-core.c > index 04f5be473638..f6943f4a4d16 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -355,7 +355,7 @@ void blk_cleanup_queue(struct request_queue *q) > * We rely on driver to deal with the race in case that queue > * initialization isn't done. > */ > - if (queue_is_mq(q) && blk_queue_init_done(q)) > + if (queue_is_mq(q)) > blk_mq_quiesce_queue(q); > >> >> Ted, I saw your email, I'm looking into it. Sounds like a regression >> between 4.18 and 4.19. The sync issue could still be it, as it can >> cause memory corruption, and that could lead to other corruption >> issues. > > But I don't understand why this sync issue is related data corruption > because FS IO only starts after blk_queue_init_done() becomes true. > And blk_mq_quiesce_queue() and the following blk_sync_queue() should > have drained everything queued. Nobody understands what is going on so far, unfortunately. But I'm assuming it's some related corruption that happens before the fs is mounted. -- Jens Axboe ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH V2] SCSI: fix queue cleanup race before queue initialization is done 2018-11-22 2:14 ` Jens Axboe @ 2018-11-22 2:47 ` Ming Lei 0 siblings, 0 replies; 16+ messages in thread From: Ming Lei @ 2018-11-22 2:47 UTC (permalink / raw) To: Jens Axboe Cc: linux-block, Andrew Jones, Bart Van Assche, linux-scsi, Martin K . Petersen, Christoph Hellwig, James E . J . Bottomley, stable, jianchao . wang On Wed, Nov 21, 2018 at 07:14:35PM -0700, Jens Axboe wrote: > On 11/21/18 7:00 PM, Ming Lei wrote: > > On Wed, Nov 21, 2018 at 06:42:51PM -0700, Jens Axboe wrote: > >> On 11/21/18 6:00 PM, Ming Lei wrote: > >>> On Wed, Nov 21, 2018 at 02:47:35PM -0700, Jens Axboe wrote: > >>>> On 11/14/18 8:20 AM, Jens Axboe wrote: > >>>>> On 11/14/18 1:25 AM, Ming Lei wrote: > >>>>>> c2856ae2f315d ("blk-mq: quiesce queue before freeing queue") has > >>>>>> already fixed this race, however the implied synchronize_rcu() > >>>>>> in blk_mq_quiesce_queue() can slow down LUN probe a lot, so caused > >>>>>> performance regression. > >>>>>> > >>>>>> Then 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()") > >>>>>> tried to quiesce queue for avoiding unnecessary synchronize_rcu() > >>>>>> only when queue initialization is done, because it is usual to see > >>>>>> lots of inexistent LUNs which need to be probed. > >>>>>> > >>>>>> However, turns out it isn't safe to quiesce queue only when queue > >>>>>> initialization is done. Because when one SCSI command is completed, > >>>>>> the user of sending command can be waken up immediately, then the > >>>>>> scsi device may be removed, meantime the run queue in scsi_end_request() > >>>>>> is still in-progress, so kernel panic can be caused. > >>>>>> > >>>>>> In Red Hat QE lab, there are several reports about this kind of kernel > >>>>>> panic triggered during kernel booting. > >>>>>> > >>>>>> This patch tries to address the issue by grabing one queue usage > >>>>>> counter during freeing one request and the following run queue. > >>>>> > >>>>> Thanks applied, this bug was elusive but ever present in recent > >>>>> testing that we did internally, it's been a huge pain in the butt. > >>>>> The symptoms were usually a crash in blk_mq_get_driver_tag() with > >>>>> hctx->tags == NULL, or a crash inside deadline request insert off > >>>>> requeue. > >>>> > >>>> I'm still hitting some weird crashes even with this applied, like > >>>> this one: > >>>> > >>>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000148 > >>>> PGD 0 P4D 0. > >>>> Oops: 0000 [#1] SMP PTI > >>>> CPU: 37 PID: 763 Comm: kworker/37:1H Not tainted 4.20.0-rc3-00649-ge64d9a554a91-dirty #14 > >>>> Hardware name: Wiwynn Leopard-Orv2/Leopard-DDR BW, BIOS LBM08 03/03/2017 > >>>> Workqueue: kblockd blk_mq_run_work_fn > >>>> RIP: 0010:blk_mq_get_driver_tag+0x81/0x120 > >>>> Code: 24 10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 00 00 00 0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 <48> 8b 87 48 01 00 00 8b 40 04 39 43 20 72 37 f6 87 b0 00 00 00 02 > >>>> RSP: 0018:ffffc90004aabd30 EFLAGS: 00010246 > >>>> RAX: 0000000000000003 RBX: ffff888465ea1300 RCX: ffffc90004aabde8 > >>>> RDX: 00000000ffffffff RSI: ffffc90004aabde8 RDI: 0000000000000000 > >>>> RBP: 0000000000000000 R08: ffff888465ea1348 R09: 0000000000000000 > >>>> R10: 0000000000001000 R11: 00000000ffffffff R12: ffff888465ea1300 > >>>> R13: 0000000000000000 R14: ffff888465ea1348 R15: ffff888465d10000 > >>>> FS: 0000000000000000(0000) GS:ffff88846f9c0000(0000) knlGS:0000000000000000 > >>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >>>> CR2: 0000000000000148 CR3: 000000000220a003 CR4: 00000000003606e0 > >>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > >>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > >>>> Call Trace: > >>>> blk_mq_dispatch_rq_list+0xec/0x480 > >>>> ? elv_rb_del+0x11/0x30 > >>>> blk_mq_do_dispatch_sched+0x6e/0xf0 > >>>> blk_mq_sched_dispatch_requests+0xfa/0x170 > >>>> __blk_mq_run_hw_queue+0x5f/0xe0 > >>>> process_one_work+0x154/0x350 > >>>> worker_thread+0x46/0x3c0 > >>>> kthread+0xf5/0x130 > >>>> ? process_one_work+0x350/0x350 > >>>> ? kthread_destroy_worker+0x50/0x50 > >>>> ret_from_fork+0x1f/0x30 > >>>> Modules linked in: sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm switchtec irqbypass iTCO_wdt iTCO_vendor_support efivars cdc_ether usbnet mii cdc_acm i2c_i801 lpc_ich mfd_core ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq button sch_fq_codel nfsd nfs_acl lockd grace auth_rpcgss oid_registry sunrpc nvme nvme_core fuse sg loop efivarfs autofs4 > >>>> CR2: 0000000000000148 > >>>> ---[ end trace 340a1fb996df1b9b ]--- > >>>> RIP: 0010:blk_mq_get_driver_tag+0x81/0x120 > >>>> Code: 24 10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 00 00 00 0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 <48> 8b 87 48 01 00 00 8b 40 04 39 43 20 72 37 f6 87 b0 00 00 00 02 > >>>> > >>>> which doesn't look that great... Are we sure this patch closed the window > >>>> completely? > >>> > >>> I mentioned this patch is just one workaround, see my comment before: > >>> > >>> https://marc.info/?l=linux-scsi&m=154224379320094&w=2 > >>> > >>>> > >>>> One thing I'm pondering is we're running the queue async, so the > >>>> ref get will protect whatever blk_mq_run_hw_queues() does, but > >>>> what is preventing the queue from going away as soon as we've > >>>> returned from that call? Meanwhile we still have the work item > >>>> queued up, and it'll run, and go boom like above. > >>> > >>> blk_sync_queue() supposes to drain the queued work, but it can be > >>> queued after blk_sync_queue() returns. > >> > >> It's definitely broken. Big time. And we need to do something about > >> it NOW. > >> > >>> Or maybe we can try the following patch? > >> > >> I'm going to start backing out the sync removal patches instead of > >> adding items to the hot path... > > > > Yeah, I agree to do the following to bring back the sync, even though it slows > > down boot. > > It slows it down for some cases, not in general. It slows down in most of scsi hosts, and it is general, because the nonexistent LUNs need to be probed for common SCSI host. And VM is often sensitive on booting time. > Do we need the synchronize_rcu() in the tagset deletion as well? I think it isn't needed. There can't be any IO after blk_queue_cleanup() returns, and blk_mq_release() won't touch tagset resource too. > > > diff --git a/block/blk-core.c b/block/blk-core.c > > index 04f5be473638..f6943f4a4d16 100644 > > --- a/block/blk-core.c > > +++ b/block/blk-core.c > > @@ -355,7 +355,7 @@ void blk_cleanup_queue(struct request_queue *q) > > * We rely on driver to deal with the race in case that queue > > * initialization isn't done. > > */ > > - if (queue_is_mq(q) && blk_queue_init_done(q)) > > + if (queue_is_mq(q)) > > blk_mq_quiesce_queue(q); > > > >> > >> Ted, I saw your email, I'm looking into it. Sounds like a regression > >> between 4.18 and 4.19. The sync issue could still be it, as it can > >> cause memory corruption, and that could lead to other corruption > >> issues. > > > > But I don't understand why this sync issue is related data corruption > > because FS IO only starts after blk_queue_init_done() becomes true. > > And blk_mq_quiesce_queue() and the following blk_sync_queue() should > > have drained everything queued. > > Nobody understands what is going on so far, unfortunately. But I'm > assuming it's some related corruption that happens before the fs > is mounted. INIT_DONE is set after the disk shows up, seems there is a small window. If there is any IO before INIT_DONE is set, kernel oops may be triggered because of FS IO. But blk_mq_freeze_queue() in blk_cleanup_queue() has drained all in-flight IO, and prevented new IO from entering queue, not see how corruption may happen wrt. removal/sync. thanks, Ming ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH V2] SCSI: fix queue cleanup race before queue initialization is done 2018-11-22 1:42 ` Jens Axboe 2018-11-22 2:00 ` Ming Lei @ 2019-03-29 20:21 ` James Smart 2019-03-29 23:22 ` Ming Lei 2019-03-31 3:11 ` Ming Lei 1 sibling, 2 replies; 16+ messages in thread From: James Smart @ 2019-03-29 20:21 UTC (permalink / raw) To: Jens Axboe, Ming Lei Cc: linux-block, Andrew Jones, Bart Van Assche, linux-scsi, Martin K . Petersen, Christoph Hellwig, James E . J . Bottomley, stable, jianchao . wang On 11/21/2018 5:42 PM, Jens Axboe wrote: > On 11/21/18 6:00 PM, Ming Lei wrote: >> On Wed, Nov 21, 2018 at 02:47:35PM -0700, Jens Axboe wrote: >>> On 11/14/18 8:20 AM, Jens Axboe wrote: >>>> On 11/14/18 1:25 AM, Ming Lei wrote: >>>>> c2856ae2f315d ("blk-mq: quiesce queue before freeing queue") has >>>>> already fixed this race, however the implied synchronize_rcu() >>>>> in blk_mq_quiesce_queue() can slow down LUN probe a lot, so caused >>>>> performance regression. >>>>> >>>>> Then 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()") >>>>> tried to quiesce queue for avoiding unnecessary synchronize_rcu() >>>>> only when queue initialization is done, because it is usual to see >>>>> lots of inexistent LUNs which need to be probed. >>>>> >>>>> However, turns out it isn't safe to quiesce queue only when queue >>>>> initialization is done. Because when one SCSI command is completed, >>>>> the user of sending command can be waken up immediately, then the >>>>> scsi device may be removed, meantime the run queue in scsi_end_request() >>>>> is still in-progress, so kernel panic can be caused. >>>>> >>>>> In Red Hat QE lab, there are several reports about this kind of kernel >>>>> panic triggered during kernel booting. >>>>> >>>>> This patch tries to address the issue by grabing one queue usage >>>>> counter during freeing one request and the following run queue. >>>> Thanks applied, this bug was elusive but ever present in recent >>>> testing that we did internally, it's been a huge pain in the butt. >>>> The symptoms were usually a crash in blk_mq_get_driver_tag() with >>>> hctx->tags == NULL, or a crash inside deadline request insert off >>>> requeue. All, We are seeing errors with the following error: [44492.814347] BUG: unable to handle kernel NULL pointer dereference at (null) [44492.814383] IP: [<ffffffff8135a10b>] sbitmap_any_bit_set+0xb/0x30 ... [44492.815634] Call Trace: [44492.815652] [<ffffffff81303a88>] blk_mq_run_hw_queues+0x48/0x90 [44492.819755] [<ffffffff813053cc>] blk_mq_requeue_work+0x10c/0x120 [44492.819777] [<ffffffff81098cb4>] process_one_work+0x154/0x410 [44492.819781] [<ffffffff81099896>] worker_thread+0x116/0x4a0 [44492.819784] [<ffffffff8109edb9>] kthread+0xc9/0xe0 [44492.819790] [<ffffffff81619b05>] ret_from_fork+0x55/0x80 [44492.822798] DWARF2 unwinder stuck at ret_from_fork+0x55/0x80 [44492.822798] [44492.822799] Leftover inexact backtrace: [44492.822802] [<ffffffff8109ecf0>] ? kthread_park+0x50/0x50 [44492.822818] Code: c6 44 0f 46 ce 83 c2 01 45 89 ca 4c 89 54 01 08 48 8b 4f 10 2b 74 01 08 39 57 08 77 d8 f3 c3 90 8b 4f 08 85 c9 74 1f 48 8b 57 10 <48> 83 3a 00 75 18 31 c0 eb 0a 48 83 c2 40 48 83 3a 00 75 0a 83 [44492.822820] RIP [<ffffffff8135a10b>] sbitmap_any_bit_set+0xb/0x30 [44492.822821] RSP <ffff8807219ffdd8> [44492.822821] CR2: 0000000000000000 It appears the queue has been freed thus the bitmap is bad. Looking at the commit relative to this email thread: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/scsi/scsi_lib.c?id=8dc765d438f1e42b3e8227b3b09fad7d73f4ec9a It's interesting that the queue reference taken was released after the kblockd_schedule_work() call was made, and it's this work element that is hitting the issue. So perhaps the patch missed keeping the reference until the requeue_work item finished ? -- james ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH V2] SCSI: fix queue cleanup race before queue initialization is done 2019-03-29 20:21 ` James Smart @ 2019-03-29 23:22 ` Ming Lei 2019-03-31 3:11 ` Ming Lei 1 sibling, 0 replies; 16+ messages in thread From: Ming Lei @ 2019-03-29 23:22 UTC (permalink / raw) To: James Smart Cc: Jens Axboe, linux-block, Andrew Jones, Bart Van Assche, linux-scsi, Martin K . Petersen, Christoph Hellwig, James E . J . Bottomley, stable, jianchao . wang Hello James, On Fri, Mar 29, 2019 at 01:21:12PM -0700, James Smart wrote: > > > On 11/21/2018 5:42 PM, Jens Axboe wrote: > > On 11/21/18 6:00 PM, Ming Lei wrote: > > > On Wed, Nov 21, 2018 at 02:47:35PM -0700, Jens Axboe wrote: > > > > On 11/14/18 8:20 AM, Jens Axboe wrote: > > > > > On 11/14/18 1:25 AM, Ming Lei wrote: > > > > > > c2856ae2f315d ("blk-mq: quiesce queue before freeing queue") has > > > > > > already fixed this race, however the implied synchronize_rcu() > > > > > > in blk_mq_quiesce_queue() can slow down LUN probe a lot, so caused > > > > > > performance regression. > > > > > > > > > > > > Then 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()") > > > > > > tried to quiesce queue for avoiding unnecessary synchronize_rcu() > > > > > > only when queue initialization is done, because it is usual to see > > > > > > lots of inexistent LUNs which need to be probed. > > > > > > > > > > > > However, turns out it isn't safe to quiesce queue only when queue > > > > > > initialization is done. Because when one SCSI command is completed, > > > > > > the user of sending command can be waken up immediately, then the > > > > > > scsi device may be removed, meantime the run queue in scsi_end_request() > > > > > > is still in-progress, so kernel panic can be caused. > > > > > > > > > > > > In Red Hat QE lab, there are several reports about this kind of kernel > > > > > > panic triggered during kernel booting. > > > > > > > > > > > > This patch tries to address the issue by grabing one queue usage > > > > > > counter during freeing one request and the following run queue. > > > > > Thanks applied, this bug was elusive but ever present in recent > > > > > testing that we did internally, it's been a huge pain in the butt. > > > > > The symptoms were usually a crash in blk_mq_get_driver_tag() with > > > > > hctx->tags == NULL, or a crash inside deadline request insert off > > > > > requeue. > > All, > > We are seeing errors with the following error: > > [44492.814347] BUG: unable to handle kernel NULL pointer dereference at > (null) > [44492.814383] IP: [<ffffffff8135a10b>] sbitmap_any_bit_set+0xb/0x30 > ... > [44492.815634] Call Trace: > [44492.815652] [<ffffffff81303a88>] blk_mq_run_hw_queues+0x48/0x90 > [44492.819755] [<ffffffff813053cc>] blk_mq_requeue_work+0x10c/0x120 > [44492.819777] [<ffffffff81098cb4>] process_one_work+0x154/0x410 > [44492.819781] [<ffffffff81099896>] worker_thread+0x116/0x4a0 > [44492.819784] [<ffffffff8109edb9>] kthread+0xc9/0xe0 > [44492.819790] [<ffffffff81619b05>] ret_from_fork+0x55/0x80 > [44492.822798] DWARF2 unwinder stuck at ret_from_fork+0x55/0x80 > [44492.822798] > [44492.822799] Leftover inexact backtrace: > > [44492.822802] [<ffffffff8109ecf0>] ? kthread_park+0x50/0x50 > [44492.822818] Code: c6 44 0f 46 ce 83 c2 01 45 89 ca 4c 89 54 01 08 48 8b > 4f > 10 2b 74 01 08 39 57 08 77 d8 f3 c3 90 8b 4f 08 85 c9 74 1f 48 8b 57 10 <48> > 83 > 3a 00 75 18 31 c0 eb 0a 48 83 c2 40 48 83 3a 00 75 0a 83 > [44492.822820] RIP [<ffffffff8135a10b>] sbitmap_any_bit_set+0xb/0x30 > [44492.822821] RSP <ffff8807219ffdd8> > [44492.822821] CR2: 0000000000000000 > > It appears the queue has been freed thus the bitmap is bad. Could you provide a little background about this report? Such as the device/driver, reproduction steps, and kernel release. > > Looking at the commit relative to this email thread: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/scsi/scsi_lib.c?id=8dc765d438f1e42b3e8227b3b09fad7d73f4ec9a > > It's interesting that the queue reference taken was released after the > kblockd_schedule_work() call was made, and it's this work element that is > hitting the issue. So perhaps the patch missed keeping the reference until > the requeue_work item finished ? blk_mq's requeue_work is supposed to be drained before freeing queue, see blk_sync_queue(), and SCSI's requeue_work should have been drained too. This following change might make a difference for this issue, but looks it isn't good enough, given SCSI's requeue may come between cancel_work_sync() and blk_cleanup_queue(). Will take a close look on it in this weekend. diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c index 6a9040faed00..94882f65ccf1 100644 --- a/drivers/scsi/scsi_sysfs.c +++ b/drivers/scsi/scsi_sysfs.c @@ -1397,8 +1397,8 @@ void __scsi_remove_device(struct scsi_device *sdev) scsi_device_set_state(sdev, SDEV_DEL); mutex_unlock(&sdev->state_mutex); - blk_cleanup_queue(sdev->request_queue); cancel_work_sync(&sdev->requeue_work); + blk_cleanup_queue(sdev->request_queue); if (sdev->host->hostt->slave_destroy) sdev->host->hostt->slave_destroy(sdev); Thanks, Ming ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH V2] SCSI: fix queue cleanup race before queue initialization is done 2019-03-29 20:21 ` James Smart 2019-03-29 23:22 ` Ming Lei @ 2019-03-31 3:11 ` Ming Lei 1 sibling, 0 replies; 16+ messages in thread From: Ming Lei @ 2019-03-31 3:11 UTC (permalink / raw) To: James Smart Cc: Jens Axboe, Ming Lei, linux-block, Andrew Jones, Bart Van Assche, Linux SCSI List, Martin K . Petersen, Christoph Hellwig, James E . J . Bottomley, stable, jianchao . wang On Sat, Mar 30, 2019 at 4:21 AM James Smart <james.smart@broadcom.com> wrote: > > > > On 11/21/2018 5:42 PM, Jens Axboe wrote: > > On 11/21/18 6:00 PM, Ming Lei wrote: > >> On Wed, Nov 21, 2018 at 02:47:35PM -0700, Jens Axboe wrote: > >>> On 11/14/18 8:20 AM, Jens Axboe wrote: > >>>> On 11/14/18 1:25 AM, Ming Lei wrote: > >>>>> c2856ae2f315d ("blk-mq: quiesce queue before freeing queue") has > >>>>> already fixed this race, however the implied synchronize_rcu() > >>>>> in blk_mq_quiesce_queue() can slow down LUN probe a lot, so caused > >>>>> performance regression. > >>>>> > >>>>> Then 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()") > >>>>> tried to quiesce queue for avoiding unnecessary synchronize_rcu() > >>>>> only when queue initialization is done, because it is usual to see > >>>>> lots of inexistent LUNs which need to be probed. > >>>>> > >>>>> However, turns out it isn't safe to quiesce queue only when queue > >>>>> initialization is done. Because when one SCSI command is completed, > >>>>> the user of sending command can be waken up immediately, then the > >>>>> scsi device may be removed, meantime the run queue in scsi_end_request() > >>>>> is still in-progress, so kernel panic can be caused. > >>>>> > >>>>> In Red Hat QE lab, there are several reports about this kind of kernel > >>>>> panic triggered during kernel booting. > >>>>> > >>>>> This patch tries to address the issue by grabing one queue usage > >>>>> counter during freeing one request and the following run queue. > >>>> Thanks applied, this bug was elusive but ever present in recent > >>>> testing that we did internally, it's been a huge pain in the butt. > >>>> The symptoms were usually a crash in blk_mq_get_driver_tag() with > >>>> hctx->tags == NULL, or a crash inside deadline request insert off > >>>> requeue. > > All, > > We are seeing errors with the following error: > > [44492.814347] BUG: unable to handle kernel NULL pointer dereference at > (null) > [44492.814383] IP: [<ffffffff8135a10b>] sbitmap_any_bit_set+0xb/0x30 > ... > [44492.815634] Call Trace: > [44492.815652] [<ffffffff81303a88>] blk_mq_run_hw_queues+0x48/0x90 > [44492.819755] [<ffffffff813053cc>] blk_mq_requeue_work+0x10c/0x120 > [44492.819777] [<ffffffff81098cb4>] process_one_work+0x154/0x410 > [44492.819781] [<ffffffff81099896>] worker_thread+0x116/0x4a0 > [44492.819784] [<ffffffff8109edb9>] kthread+0xc9/0xe0 > [44492.819790] [<ffffffff81619b05>] ret_from_fork+0x55/0x80 > [44492.822798] DWARF2 unwinder stuck at ret_from_fork+0x55/0x80 > [44492.822798] > [44492.822799] Leftover inexact backtrace: > > [44492.822802] [<ffffffff8109ecf0>] ? kthread_park+0x50/0x50 > [44492.822818] Code: c6 44 0f 46 ce 83 c2 01 45 89 ca 4c 89 54 01 08 48 > 8b 4f > 10 2b 74 01 08 39 57 08 77 d8 f3 c3 90 8b 4f 08 85 c9 74 1f 48 8b 57 10 > <48> 83 > 3a 00 75 18 31 c0 eb 0a 48 83 c2 40 48 83 3a 00 75 0a 83 > [44492.822820] RIP [<ffffffff8135a10b>] sbitmap_any_bit_set+0xb/0x30 > [44492.822821] RSP <ffff8807219ffdd8> > [44492.822821] CR2: 0000000000000000 > > It appears the queue has been freed thus the bitmap is bad. > > Looking at the commit relative to this email thread: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/scsi/scsi_lib.c?id=8dc765d438f1e42b3e8227b3b09fad7d73f4ec9a > > It's interesting that the queue reference taken was released after the > kblockd_schedule_work() call was made, and it's this work element that > is hitting the issue. So perhaps the patch missed keeping the reference > until the requeue_work item finished ? Hi James, Please test the following patchset: https://lore.kernel.org/linux-block/20190331030954.22320-1-ming.lei@redhat.com/T/#u Thanks. Ming Lei ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2019-03-31 3:11 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-11-14 8:25 [PATCH V2] SCSI: fix queue cleanup race before queue initialization is done Ming Lei 2018-11-14 15:02 ` Bart Van Assche 2018-11-15 0:48 ` Ming Lei 2018-11-14 15:20 ` Jens Axboe 2018-11-15 1:02 ` Ming Lei 2018-11-21 21:47 ` Jens Axboe 2018-11-21 22:02 ` Theodore Y. Ts'o 2018-11-22 3:43 ` Theodore Y. Ts'o 2018-11-22 1:00 ` Ming Lei 2018-11-22 1:42 ` Jens Axboe 2018-11-22 2:00 ` Ming Lei 2018-11-22 2:14 ` Jens Axboe 2018-11-22 2:47 ` Ming Lei 2019-03-29 20:21 ` James Smart 2019-03-29 23:22 ` Ming Lei 2019-03-31 3:11 ` Ming Lei
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox