* [PATCH v3 0/3] Fix some starvation problems in block layer
@ 2024-09-14 7:28 Muchun Song
2024-09-14 7:28 ` [PATCH v3 1/3] block: fix missing dispatching request when queue is started or unquiesced Muchun Song
` (3 more replies)
0 siblings, 4 replies; 5+ messages in thread
From: Muchun Song @ 2024-09-14 7:28 UTC (permalink / raw)
To: axboe, ming.lei; +Cc: linux-block, linux-kernel, muchun.song, Muchun Song
We encounter a problem on our servers where hundreds of UNINTERRUPTED
processes are all waiting in the WBT wait queue. And the IO hung detector
logged so many messages about "blocked for more than 122 seconds". The
call trace is as follows:
Call Trace:
__schedule+0x959/0xee0
schedule+0x40/0xb0
io_schedule+0x12/0x40
rq_qos_wait+0xaf/0x140
wbt_wait+0x92/0xc0
__rq_qos_throttle+0x20/0x30
blk_mq_make_request+0x12a/0x5c0
generic_make_request_nocheck+0x172/0x3f0
submit_bio+0x42/0x1c0
...
The WBT module is used to throttle buffered writeback, which will block
any buffered writeback IO request until the previous inflight IOs have
been completed. So I checked the inflight IO counter. That was one meaning
one IO request was submitted to the downstream interface like block core
layer or device driver (virtio_blk driver in our case). We need to figure
out why the inflight IO is not completed in time. I confirmed that all
the virtio ring buffers of virtio_blk are empty and the hardware dispatch
list had one IO request, so the root cause is not related to the block
device or the virtio_blk driver since the driver has never received that
IO request.
We know that block core layer could submit IO requests to the driver
through kworker (the callback function is blk_mq_run_work_fn). I thought
maybe the kworker was blocked by some other resources causing the callback
to not be evoked in time. So I checked all the kworkers and workqueues and
confirmed there was no pending work on any kworker or workqueue.
Integrate all the investigation information, the problem should be in the
block core layer missing a chance to submit that IO request. After
some investigation of code, I found some scenarios which could cause the
problem.
Changes in v3:
- Collect RB tag from Ming Lei.
- Adjust text to fit maximum 74 chars per line from Jens Axboe.
Changes in v2:
- Collect RB tag from Ming Lei.
- Use barrier-less approach to fix QUEUE_FLAG_QUIESCED ordering problem
suggested by Ming Lei.
- Apply new approach to fix BLK_MQ_S_STOPPED ordering for easier
maintenance.
- Add Fixes tag to each patch.
Muchun Song (3):
block: fix missing dispatching request when queue is started or
unquiesced
block: fix ordering between checking QUEUE_FLAG_QUIESCED and adding
requests
block: fix ordering between checking BLK_MQ_S_STOPPED and adding
requests
block/blk-mq.c | 55 ++++++++++++++++++++++++++++++++++++++------------
block/blk-mq.h | 13 ++++++++++++
2 files changed, 55 insertions(+), 13 deletions(-)
--
2.20.1
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH v3 1/3] block: fix missing dispatching request when queue is started or unquiesced
2024-09-14 7:28 [PATCH v3 0/3] Fix some starvation problems in block layer Muchun Song
@ 2024-09-14 7:28 ` Muchun Song
2024-09-14 7:28 ` [PATCH v3 2/3] block: fix ordering between checking QUEUE_FLAG_QUIESCED and adding requests Muchun Song
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: Muchun Song @ 2024-09-14 7:28 UTC (permalink / raw)
To: axboe, ming.lei
Cc: linux-block, linux-kernel, muchun.song, Muchun Song, stable
Supposing the following scenario with a virtio_blk driver.
CPU0 CPU1 CPU2
blk_mq_try_issue_directly()
__blk_mq_issue_directly()
q->mq_ops->queue_rq()
virtio_queue_rq()
blk_mq_stop_hw_queue()
virtblk_done()
blk_mq_try_issue_directly()
if (blk_mq_hctx_stopped())
blk_mq_request_bypass_insert() blk_mq_run_hw_queue()
blk_mq_run_hw_queue() blk_mq_run_hw_queue()
blk_mq_insert_request()
return
After CPU0 has marked the queue as stopped, CPU1 will see the queue is
stopped. But before CPU1 puts the request on the dispatch list, CPU2
receives the interrupt of completion of request, so it will run the
hardware queue and marks the queue as non-stopped. Meanwhile, CPU1 also
runs the same hardware queue. After both CPU1 and CPU2 complete
blk_mq_run_hw_queue(), CPU1 just puts the request to the same hardware
queue and returns. It misses dispatching a request. Fix it by running
the hardware queue explicitly. And blk_mq_request_issue_directly()
should handle a similar situation. Fix it as well.
Fixes: d964f04a8fde ("blk-mq: fix direct issue")
Cc: stable@vger.kernel.org
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index e3c3c0c21b553..b2d0f22de0c7f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2619,6 +2619,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(rq->q)) {
blk_mq_insert_request(rq, 0);
+ blk_mq_run_hw_queue(hctx, false);
return;
}
@@ -2649,6 +2650,7 @@ static blk_status_t blk_mq_request_issue_directly(struct request *rq, bool last)
if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(rq->q)) {
blk_mq_insert_request(rq, 0);
+ blk_mq_run_hw_queue(hctx, false);
return BLK_STS_OK;
}
--
2.20.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH v3 2/3] block: fix ordering between checking QUEUE_FLAG_QUIESCED and adding requests
2024-09-14 7:28 [PATCH v3 0/3] Fix some starvation problems in block layer Muchun Song
2024-09-14 7:28 ` [PATCH v3 1/3] block: fix missing dispatching request when queue is started or unquiesced Muchun Song
@ 2024-09-14 7:28 ` Muchun Song
2024-09-14 7:28 ` [PATCH v3 3/3] block: fix ordering between checking BLK_MQ_S_STOPPED " Muchun Song
2024-09-24 6:10 ` [PATCH v3 0/3] Fix some starvation problems in block layer Muchun Song
3 siblings, 0 replies; 5+ messages in thread
From: Muchun Song @ 2024-09-14 7:28 UTC (permalink / raw)
To: axboe, ming.lei
Cc: linux-block, linux-kernel, muchun.song, Muchun Song, stable
Supposing the following scenario.
CPU0 CPU1
blk_mq_insert_request() 1) store
blk_mq_unquiesce_queue()
blk_queue_flag_clear() 3) store
blk_mq_run_hw_queues()
blk_mq_run_hw_queue()
if (!blk_mq_hctx_has_pending()) 4) load
return
blk_mq_run_hw_queue()
if (blk_queue_quiesced()) 2) load
return
blk_mq_sched_dispatch_requests()
The full memory barrier should be inserted between 1) and 2), as well as
between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED is
cleared or CPU1 sees dispatch list or setting of bitmap of software queue.
Otherwise, either CPU will not rerun the hardware queue causing starvation.
So the first solution is to 1) add a pair of memory barrier to fix the
problem, another solution is to 2) use hctx->queue->queue_lock to
synchronize QUEUE_FLAG_QUIESCED. Here, we chose 2) to fix it since memory
barrier is not easy to be maintained.
Fixes: f4560ffe8cec ("blk-mq: use QUEUE_FLAG_QUIESCED to quiesce queue")
Cc: stable@vger.kernel.org
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq.c | 47 ++++++++++++++++++++++++++++++++++-------------
1 file changed, 34 insertions(+), 13 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index b2d0f22de0c7f..ff6df6c7eeb25 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2202,6 +2202,24 @@ void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
}
EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
+static inline bool blk_mq_hw_queue_need_run(struct blk_mq_hw_ctx *hctx)
+{
+ bool need_run;
+
+ /*
+ * When queue is quiesced, we may be switching io scheduler, or
+ * updating nr_hw_queues, or other things, and we can't run queue
+ * any more, even blk_mq_hctx_has_pending() can't be called safely.
+ *
+ * And queue will be rerun in blk_mq_unquiesce_queue() if it is
+ * quiesced.
+ */
+ __blk_mq_run_dispatch_ops(hctx->queue, false,
+ need_run = !blk_queue_quiesced(hctx->queue) &&
+ blk_mq_hctx_has_pending(hctx));
+ return need_run;
+}
+
/**
* blk_mq_run_hw_queue - Start to run a hardware queue.
* @hctx: Pointer to the hardware queue to run.
@@ -2222,20 +2240,23 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
might_sleep_if(!async && hctx->flags & BLK_MQ_F_BLOCKING);
- /*
- * When queue is quiesced, we may be switching io scheduler, or
- * updating nr_hw_queues, or other things, and we can't run queue
- * any more, even __blk_mq_hctx_has_pending() can't be called safely.
- *
- * And queue will be rerun in blk_mq_unquiesce_queue() if it is
- * quiesced.
- */
- __blk_mq_run_dispatch_ops(hctx->queue, false,
- need_run = !blk_queue_quiesced(hctx->queue) &&
- blk_mq_hctx_has_pending(hctx));
+ need_run = blk_mq_hw_queue_need_run(hctx);
+ if (!need_run) {
+ unsigned long flags;
- if (!need_run)
- return;
+ /*
+ * Synchronize with blk_mq_unquiesce_queue(), because we check
+ * if hw queue is quiesced locklessly above, we need the use
+ * ->queue_lock to make sure we see the up-to-date status to
+ * not miss rerunning the hw queue.
+ */
+ spin_lock_irqsave(&hctx->queue->queue_lock, flags);
+ need_run = blk_mq_hw_queue_need_run(hctx);
+ spin_unlock_irqrestore(&hctx->queue->queue_lock, flags);
+
+ if (!need_run)
+ return;
+ }
if (async || !cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask)) {
blk_mq_delay_run_hw_queue(hctx, 0);
--
2.20.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH v3 3/3] block: fix ordering between checking BLK_MQ_S_STOPPED and adding requests
2024-09-14 7:28 [PATCH v3 0/3] Fix some starvation problems in block layer Muchun Song
2024-09-14 7:28 ` [PATCH v3 1/3] block: fix missing dispatching request when queue is started or unquiesced Muchun Song
2024-09-14 7:28 ` [PATCH v3 2/3] block: fix ordering between checking QUEUE_FLAG_QUIESCED and adding requests Muchun Song
@ 2024-09-14 7:28 ` Muchun Song
2024-09-24 6:10 ` [PATCH v3 0/3] Fix some starvation problems in block layer Muchun Song
3 siblings, 0 replies; 5+ messages in thread
From: Muchun Song @ 2024-09-14 7:28 UTC (permalink / raw)
To: axboe, ming.lei
Cc: linux-block, linux-kernel, muchun.song, Muchun Song, stable
Supposing first scenario with a virtio_blk driver.
CPU0 CPU1
blk_mq_try_issue_directly()
__blk_mq_issue_directly()
q->mq_ops->queue_rq()
virtio_queue_rq()
blk_mq_stop_hw_queue()
virtblk_done()
blk_mq_request_bypass_insert() 1) store
blk_mq_start_stopped_hw_queue()
clear_bit(BLK_MQ_S_STOPPED) 3) store
blk_mq_run_hw_queue()
if (!blk_mq_hctx_has_pending()) 4) load
return
blk_mq_sched_dispatch_requests()
blk_mq_run_hw_queue()
if (!blk_mq_hctx_has_pending())
return
blk_mq_sched_dispatch_requests()
if (blk_mq_hctx_stopped()) 2) load
return
__blk_mq_sched_dispatch_requests()
Supposing another scenario.
CPU0 CPU1
blk_mq_requeue_work()
blk_mq_insert_request() 1) store
virtblk_done()
blk_mq_start_stopped_hw_queue()
blk_mq_run_hw_queues() clear_bit(BLK_MQ_S_STOPPED) 3) store
blk_mq_run_hw_queue()
if (!blk_mq_hctx_has_pending()) 4) load
return
blk_mq_sched_dispatch_requests()
if (blk_mq_hctx_stopped()) 2) load
continue
blk_mq_run_hw_queue()
Both scenarios are similar, the full memory barrier should be inserted
between 1) and 2), as well as between 3) and 4) to make sure that either
CPU0 sees BLK_MQ_S_STOPPED is cleared or CPU1 sees dispatch list.
Otherwise, either CPU will not rerun the hardware queue causing starvation
of the request.
The easy way to fix it is to add the essential full memory barrier into
helper of blk_mq_hctx_stopped(). In order to not affect the fast path
(hardware queue is not stopped most of the time), we only insert the
barrier into the slow path. Actually, only slow path needs to care about
missing of dispatching the request to the low-level device driver.
Fixes: 320ae51feed5 ("blk-mq: new multi-queue block IO queueing mechanism")
Cc: stable@vger.kernel.org
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq.c | 6 ++++++
block/blk-mq.h | 13 +++++++++++++
2 files changed, 19 insertions(+)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index ff6df6c7eeb25..b90c1680cb780 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2413,6 +2413,12 @@ void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
return;
clear_bit(BLK_MQ_S_STOPPED, &hctx->state);
+ /*
+ * Pairs with the smp_mb() in blk_mq_hctx_stopped() to order the
+ * clearing of BLK_MQ_S_STOPPED above and the checking of dispatch
+ * list in the subsequent routine.
+ */
+ smp_mb__after_atomic();
blk_mq_run_hw_queue(hctx, async);
}
EXPORT_SYMBOL_GPL(blk_mq_start_stopped_hw_queue);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 260beea8e332c..f36f3bff70d86 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -228,6 +228,19 @@ static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data
static inline bool blk_mq_hctx_stopped(struct blk_mq_hw_ctx *hctx)
{
+ /* Fast path: hardware queue is not stopped most of the time. */
+ if (likely(!test_bit(BLK_MQ_S_STOPPED, &hctx->state)))
+ return false;
+
+ /*
+ * This barrier is used to order adding of dispatch list before and
+ * the test of BLK_MQ_S_STOPPED below. Pairs with the memory barrier
+ * in blk_mq_start_stopped_hw_queue() so that dispatch code could
+ * either see BLK_MQ_S_STOPPED is cleared or dispatch list is not
+ * empty to avoid missing dispatching requests.
+ */
+ smp_mb();
+
return test_bit(BLK_MQ_S_STOPPED, &hctx->state);
}
--
2.20.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH v3 0/3] Fix some starvation problems in block layer
2024-09-14 7:28 [PATCH v3 0/3] Fix some starvation problems in block layer Muchun Song
` (2 preceding siblings ...)
2024-09-14 7:28 ` [PATCH v3 3/3] block: fix ordering between checking BLK_MQ_S_STOPPED " Muchun Song
@ 2024-09-24 6:10 ` Muchun Song
3 siblings, 0 replies; 5+ messages in thread
From: Muchun Song @ 2024-09-24 6:10 UTC (permalink / raw)
To: axboe; +Cc: ming.lei, Muchun Song, linux-block, linux-kernel
> On Sep 14, 2024, at 15:28, Muchun Song <songmuchun@bytedance.com> wrote:
>
> We encounter a problem on our servers where hundreds of UNINTERRUPTED
> processes are all waiting in the WBT wait queue. And the IO hung detector
> logged so many messages about "blocked for more than 122 seconds". The
> call trace is as follows:
>
> Call Trace:
> __schedule+0x959/0xee0
> schedule+0x40/0xb0
> io_schedule+0x12/0x40
> rq_qos_wait+0xaf/0x140
> wbt_wait+0x92/0xc0
> __rq_qos_throttle+0x20/0x30
> blk_mq_make_request+0x12a/0x5c0
> generic_make_request_nocheck+0x172/0x3f0
> submit_bio+0x42/0x1c0
> ...
>
> The WBT module is used to throttle buffered writeback, which will block
> any buffered writeback IO request until the previous inflight IOs have
> been completed. So I checked the inflight IO counter. That was one meaning
> one IO request was submitted to the downstream interface like block core
> layer or device driver (virtio_blk driver in our case). We need to figure
> out why the inflight IO is not completed in time. I confirmed that all
> the virtio ring buffers of virtio_blk are empty and the hardware dispatch
> list had one IO request, so the root cause is not related to the block
> device or the virtio_blk driver since the driver has never received that
> IO request.
>
> We know that block core layer could submit IO requests to the driver
> through kworker (the callback function is blk_mq_run_work_fn). I thought
> maybe the kworker was blocked by some other resources causing the callback
> to not be evoked in time. So I checked all the kworkers and workqueues and
> confirmed there was no pending work on any kworker or workqueue.
>
> Integrate all the investigation information, the problem should be in the
> block core layer missing a chance to submit that IO request. After
> some investigation of code, I found some scenarios which could cause the
> problem.
>
> Changes in v3:
> - Collect RB tag from Ming Lei.
> - Adjust text to fit maximum 74 chars per line from Jens Axboe.
Hi Jens,
Friendly ping... Do you have any concerns regarding this version?
Muchun,
Thanks.
>
> Changes in v2:
> - Collect RB tag from Ming Lei.
> - Use barrier-less approach to fix QUEUE_FLAG_QUIESCED ordering problem
> suggested by Ming Lei.
> - Apply new approach to fix BLK_MQ_S_STOPPED ordering for easier
> maintenance.
> - Add Fixes tag to each patch.
>
> Muchun Song (3):
> block: fix missing dispatching request when queue is started or
> unquiesced
> block: fix ordering between checking QUEUE_FLAG_QUIESCED and adding
> requests
> block: fix ordering between checking BLK_MQ_S_STOPPED and adding
> requests
>
> block/blk-mq.c | 55 ++++++++++++++++++++++++++++++++++++++------------
> block/blk-mq.h | 13 ++++++++++++
> 2 files changed, 55 insertions(+), 13 deletions(-)
>
> --
> 2.20.1
>
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2024-09-24 6:10 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-14 7:28 [PATCH v3 0/3] Fix some starvation problems in block layer Muchun Song
2024-09-14 7:28 ` [PATCH v3 1/3] block: fix missing dispatching request when queue is started or unquiesced Muchun Song
2024-09-14 7:28 ` [PATCH v3 2/3] block: fix ordering between checking QUEUE_FLAG_QUIESCED and adding requests Muchun Song
2024-09-14 7:28 ` [PATCH v3 3/3] block: fix ordering between checking BLK_MQ_S_STOPPED " Muchun Song
2024-09-24 6:10 ` [PATCH v3 0/3] Fix some starvation problems in block layer Muchun Song
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox