[PATCH v2 0/3] Fix some starvation problems in block layer

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/3] Fix some starvation problems in block layer
@ 2024-09-03  8:16 Muchun Song
  2024-09-03  8:16 ` [PATCH v2 1/3] block: fix missing dispatching request when queue is started or unquiesced Muchun Song
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Muchun Song @ 2024-09-03  8:16 UTC (permalink / raw)
  To: axboe, ming.lei, yukuai1
  Cc: linux-block, linux-kernel, muchun.song, Muchun Song

We encounter a problem on our servers where hundreds of UNINTERRUPTED
processes are all waiting in the WBT wait queue. And the IO hung detector
logged so many messages about "blocked for more than 122 seconds". The
call trace is as follows:

    Call Trace:
        __schedule+0x959/0xee0
        schedule+0x40/0xb0
        io_schedule+0x12/0x40
        rq_qos_wait+0xaf/0x140
        wbt_wait+0x92/0xc0
        __rq_qos_throttle+0x20/0x30
        blk_mq_make_request+0x12a/0x5c0
        generic_make_request_nocheck+0x172/0x3f0
        submit_bio+0x42/0x1c0
        ...

The WBT module is used to throttle buffered writeback, which will block
any buffered writeback IO request until the previous inflight IOs have
been completed. So I checked the inflight IO counter. That was one meaning
one IO request was submitted to the downstream interface like block core
layer or device driver (virtio_blk driver in our case). We need to figure
out why the inflight IO is not completed in time. I confirmed that all
the virtio ring buffers of virtio_blk are empty and the hardware dispatch
list had one IO request, so the root cause is not related to the block
device or the virtio_blk driver since the driver has never received that
IO request.

We know that block core layer could submit IO requests to the driver through
kworker (the callback function is blk_mq_run_work_fn). I thought maybe the
kworker was blocked by some other resources causing the callback to not be
evoked in time. So I checked all the kworkers and workqueues and confirmed
there was no pending work on any kworker or workqueue.

Integrate all the investigation information, the problem should be in the
block core layer missing a chance to submit that IO request. After
some investigation of code, I found some scenarios which could cause the
problem.

Changes in v2:
  - Collect RB tag from Ming Lei.
  - Use barrier-less approach to fix QUEUE_FLAG_QUIESCED ordering problem
    suggested by Ming Lei.
  - Apply new approach to fix BLK_MQ_S_STOPPED ordering for easier maintenance.
  - Add Fixes tag to each patch.

Muchun Song (3):
  block: fix missing dispatching request when queue is started or
    unquiesced
  block: fix ordering between checking QUEUE_FLAG_QUIESCED and adding
    requests
  block: fix ordering between checking BLK_MQ_S_STOPPED and adding
    requests

 block/blk-mq.c | 55 ++++++++++++++++++++++++++++++++++++++------------
 block/blk-mq.h | 13 ++++++++++++
 2 files changed, 55 insertions(+), 13 deletions(-)

-- 
2.20.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 1/3] block: fix missing dispatching request when queue is started or unquiesced
  2024-09-03  8:16 [PATCH v2 0/3] Fix some starvation problems in block layer Muchun Song
@ 2024-09-03  8:16 ` Muchun Song
  2024-09-10 13:17   ` Jens Axboe
  2024-09-03  8:16 ` [PATCH v2 2/3] block: fix ordering between checking QUEUE_FLAG_QUIESCED and adding requests Muchun Song
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Muchun Song @ 2024-09-03  8:16 UTC (permalink / raw)
  To: axboe, ming.lei, yukuai1
  Cc: linux-block, linux-kernel, muchun.song, Muchun Song, stable

Supposing the following scenario with a virtio_blk driver.

CPU0                                    CPU1                                    CPU2

blk_mq_try_issue_directly()
    __blk_mq_issue_directly()
        q->mq_ops->queue_rq()
            virtio_queue_rq()
                blk_mq_stop_hw_queue()
                                        blk_mq_try_issue_directly()             virtblk_done()
                                            if (blk_mq_hctx_stopped())
    blk_mq_request_bypass_insert()                                                  blk_mq_start_stopped_hw_queue()
    blk_mq_run_hw_queue()                                                               blk_mq_run_hw_queue()
                                                blk_mq_insert_request()
                                                return // Who is responsible for dispatching this IO request?

After CPU0 has marked the queue as stopped, CPU1 will see the queue is stopped.
But before CPU1 puts the request on the dispatch list, CPU2 receives the interrupt
of completion of request, so it will run the hardware queue and marks the queue
as non-stopped. Meanwhile, CPU1 also runs the same hardware queue. After both CPU1
and CPU2 complete blk_mq_run_hw_queue(), CPU1 just puts the request to the same
hardware queue and returns. It misses dispatching a request. Fix it by running
the hardware queue explicitly. And blk_mq_request_issue_directly() should handle
a similar situation. Fix it as well.

Fixes: d964f04a8fde8 ("blk-mq: fix direct issue")
Cc: stable@vger.kernel.org
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e3c3c0c21b553..b2d0f22de0c7f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2619,6 +2619,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 
 	if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(rq->q)) {
 		blk_mq_insert_request(rq, 0);
+		blk_mq_run_hw_queue(hctx, false);
 		return;
 	}
 
@@ -2649,6 +2650,7 @@ static blk_status_t blk_mq_request_issue_directly(struct request *rq, bool last)
 
 	if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(rq->q)) {
 		blk_mq_insert_request(rq, 0);
+		blk_mq_run_hw_queue(hctx, false);
 		return BLK_STS_OK;
 	}
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 2/3] block: fix ordering between checking QUEUE_FLAG_QUIESCED and adding requests
  2024-09-03  8:16 [PATCH v2 0/3] Fix some starvation problems in block layer Muchun Song
  2024-09-03  8:16 ` [PATCH v2 1/3] block: fix missing dispatching request when queue is started or unquiesced Muchun Song
@ 2024-09-03  8:16 ` Muchun Song
  2024-09-04 12:56   ` Ming Lei
  2024-09-10 13:22   ` Jens Axboe
  2024-09-03  8:16 ` [PATCH v2 3/3] block: fix ordering between checking BLK_MQ_S_STOPPED " Muchun Song
  2024-09-10  2:49 ` [PATCH v2 0/3] Fix some starvation problems in block layer Muchun Song
  3 siblings, 2 replies; 18+ messages in thread
From: Muchun Song @ 2024-09-03  8:16 UTC (permalink / raw)
  To: axboe, ming.lei, yukuai1
  Cc: linux-block, linux-kernel, muchun.song, Muchun Song, stable

Supposing the following scenario.

CPU0                                        CPU1

blk_mq_insert_request()         1) store    blk_mq_unquiesce_queue()
blk_mq_run_hw_queue()                       blk_queue_flag_clear(QUEUE_FLAG_QUIESCED)       3) store
    if (blk_queue_quiesced())   2) load         blk_mq_run_hw_queues()
        return                                      blk_mq_run_hw_queue()
    blk_mq_sched_dispatch_requests()                    if (!blk_mq_hctx_has_pending())     4) load
                                                           return

The full memory barrier should be inserted between 1) and 2), as well as
between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED is
cleared or CPU1 sees dispatch list or setting of bitmap of software queue.
Otherwise, either CPU will not re-run the hardware queue causing starvation.

So the first solution is to 1) add a pair of memory barrier to fix the
problem, another solution is to 2) use hctx->queue->queue_lock to synchronize
QUEUE_FLAG_QUIESCED. Here, we chose 2) to fix it since memory barrier is not
easy to be maintained.

Fixes: f4560ffe8cec1 ("blk-mq: use QUEUE_FLAG_QUIESCED to quiesce queue")
Cc: stable@vger.kernel.org
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 block/blk-mq.c | 47 ++++++++++++++++++++++++++++++++++-------------
 1 file changed, 34 insertions(+), 13 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b2d0f22de0c7f..ac39f2a346a52 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2202,6 +2202,24 @@ void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
 }
 EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
 
+static inline bool blk_mq_hw_queue_need_run(struct blk_mq_hw_ctx *hctx)
+{
+	bool need_run;
+
+	/*
+	 * When queue is quiesced, we may be switching io scheduler, or
+	 * updating nr_hw_queues, or other things, and we can't run queue
+	 * any more, even blk_mq_hctx_has_pending() can't be called safely.
+	 *
+	 * And queue will be rerun in blk_mq_unquiesce_queue() if it is
+	 * quiesced.
+	 */
+	__blk_mq_run_dispatch_ops(hctx->queue, false,
+				  need_run = !blk_queue_quiesced(hctx->queue) &&
+					      blk_mq_hctx_has_pending(hctx));
+	return need_run;
+}
+
 /**
  * blk_mq_run_hw_queue - Start to run a hardware queue.
  * @hctx: Pointer to the hardware queue to run.
@@ -2222,20 +2240,23 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 
 	might_sleep_if(!async && hctx->flags & BLK_MQ_F_BLOCKING);
 
-	/*
-	 * When queue is quiesced, we may be switching io scheduler, or
-	 * updating nr_hw_queues, or other things, and we can't run queue
-	 * any more, even __blk_mq_hctx_has_pending() can't be called safely.
-	 *
-	 * And queue will be rerun in blk_mq_unquiesce_queue() if it is
-	 * quiesced.
-	 */
-	__blk_mq_run_dispatch_ops(hctx->queue, false,
-		need_run = !blk_queue_quiesced(hctx->queue) &&
-		blk_mq_hctx_has_pending(hctx));
+	need_run = blk_mq_hw_queue_need_run(hctx);
+	if (!need_run) {
+		unsigned long flags;
 
-	if (!need_run)
-		return;
+		/*
+		 * Synchronize with blk_mq_unquiesce_queue(), becuase we check
+		 * if hw queue is quiesced locklessly above, we need the use
+		 * ->queue_lock to make sure we see the up-to-date status to
+		 * not miss rerunning the hw queue.
+		 */
+		spin_lock_irqsave(&hctx->queue->queue_lock, flags);
+		need_run = blk_mq_hw_queue_need_run(hctx);
+		spin_unlock_irqrestore(&hctx->queue->queue_lock, flags);
+
+		if (!need_run)
+			return;
+	}
 
 	if (async || !cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask)) {
 		blk_mq_delay_run_hw_queue(hctx, 0);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 3/3] block: fix ordering between checking BLK_MQ_S_STOPPED and adding requests
  2024-09-03  8:16 [PATCH v2 0/3] Fix some starvation problems in block layer Muchun Song
  2024-09-03  8:16 ` [PATCH v2 1/3] block: fix missing dispatching request when queue is started or unquiesced Muchun Song
  2024-09-03  8:16 ` [PATCH v2 2/3] block: fix ordering between checking QUEUE_FLAG_QUIESCED and adding requests Muchun Song
@ 2024-09-03  8:16 ` Muchun Song
  2024-09-04 13:04   ` Ming Lei
  2024-09-10 13:22   ` Jens Axboe
  2024-09-10  2:49 ` [PATCH v2 0/3] Fix some starvation problems in block layer Muchun Song
  3 siblings, 2 replies; 18+ messages in thread
From: Muchun Song @ 2024-09-03  8:16 UTC (permalink / raw)
  To: axboe, ming.lei, yukuai1
  Cc: linux-block, linux-kernel, muchun.song, Muchun Song, stable

Supposing first scenario with a virtio_blk driver.

CPU0                                                                CPU1

blk_mq_try_issue_directly()
    __blk_mq_issue_directly()
        q->mq_ops->queue_rq()
            virtio_queue_rq()
                blk_mq_stop_hw_queue()
                                                                    virtblk_done()
    blk_mq_request_bypass_insert()                                      blk_mq_start_stopped_hw_queues()
        /* Add IO request to dispatch list */   1) store                    blk_mq_start_stopped_hw_queue()
                                                                                clear_bit(BLK_MQ_S_STOPPED)                 3) store
    blk_mq_run_hw_queue()                                                       blk_mq_run_hw_queue()
        if (!blk_mq_hctx_has_pending())                                             if (!blk_mq_hctx_has_pending())         4) load
            return                                                                      return
        blk_mq_sched_dispatch_requests()                                            blk_mq_sched_dispatch_requests()
            if (blk_mq_hctx_stopped())          2) load                                 if (blk_mq_hctx_stopped())
                return                                                                      return
            __blk_mq_sched_dispatch_requests()                                          __blk_mq_sched_dispatch_requests()

Supposing another scenario.

CPU0                                                                CPU1

blk_mq_requeue_work()
    /* Add IO request to dispatch list */       1) store            virtblk_done()
    blk_mq_run_hw_queues()/blk_mq_delay_run_hw_queues()                 blk_mq_start_stopped_hw_queues()
        if (blk_mq_hctx_stopped())              2) load                     blk_mq_start_stopped_hw_queue()
            continue                                                            clear_bit(BLK_MQ_S_STOPPED)                 3) store
        blk_mq_run_hw_queue()/blk_mq_delay_run_hw_queue()                       blk_mq_run_hw_queue()
                                                                                    if (!blk_mq_hctx_has_pending())         4) load
                                                                                        return
                                                                                    blk_mq_sched_dispatch_requests()

Both scenarios are similar, the full memory barrier should be inserted between
1) and 2), as well as between 3) and 4) to make sure that either CPU0 sees
BLK_MQ_S_STOPPED is cleared or CPU1 sees dispatch list. Otherwise, either CPU
will not rerun the hardware queue causing starvation of the request.

The easy way to fix it is to add the essential full memory barrier into helper
of blk_mq_hctx_stopped(). In order to not affect the fast path (hardware queue
is not stopped most of the time), we only insert the barrier into the slow path.
Actually, only slow path needs to care about missing of dispatching the request
to the low-level device driver.

Fixes: 320ae51feed5c ("blk-mq: new multi-queue block IO queueing mechanism")
Cc: stable@vger.kernel.org
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 block/blk-mq.c |  6 ++++++
 block/blk-mq.h | 13 +++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ac39f2a346a52..48a6a437fba5e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2413,6 +2413,12 @@ void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 		return;
 
 	clear_bit(BLK_MQ_S_STOPPED, &hctx->state);
+	/*
+	 * Pairs with the smp_mb() in blk_mq_hctx_stopped() to order the
+	 * clearing of BLK_MQ_S_STOPPED above and the checking of dispatch
+	 * list in the subsequent routine.
+	 */
+	smp_mb__after_atomic();
 	blk_mq_run_hw_queue(hctx, async);
 }
 EXPORT_SYMBOL_GPL(blk_mq_start_stopped_hw_queue);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 260beea8e332c..f36f3bff70d86 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -228,6 +228,19 @@ static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data
 
 static inline bool blk_mq_hctx_stopped(struct blk_mq_hw_ctx *hctx)
 {
+	/* Fast path: hardware queue is not stopped most of the time. */
+	if (likely(!test_bit(BLK_MQ_S_STOPPED, &hctx->state)))
+		return false;
+
+	/*
+	 * This barrier is used to order adding of dispatch list before and
+	 * the test of BLK_MQ_S_STOPPED below. Pairs with the memory barrier
+	 * in blk_mq_start_stopped_hw_queue() so that dispatch code could
+	 * either see BLK_MQ_S_STOPPED is cleared or dispatch list is not
+	 * empty to avoid missing dispatching requests.
+	 */
+	smp_mb();
+
 	return test_bit(BLK_MQ_S_STOPPED, &hctx->state);
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/3] block: fix ordering between checking QUEUE_FLAG_QUIESCED and adding requests
  2024-09-03  8:16 ` [PATCH v2 2/3] block: fix ordering between checking QUEUE_FLAG_QUIESCED and adding requests Muchun Song
@ 2024-09-04 12:56   ` Ming Lei
  2024-09-10 13:22   ` Jens Axboe
  1 sibling, 0 replies; 18+ messages in thread
From: Ming Lei @ 2024-09-04 12:56 UTC (permalink / raw)
  To: Muchun Song
  Cc: axboe, yukuai1, linux-block, linux-kernel, muchun.song, stable

On Tue, Sep 03, 2024 at 04:16:52PM +0800, Muchun Song wrote:
> Supposing the following scenario.
> 
> CPU0                                        CPU1
> 
> blk_mq_insert_request()         1) store    blk_mq_unquiesce_queue()
> blk_mq_run_hw_queue()                       blk_queue_flag_clear(QUEUE_FLAG_QUIESCED)       3) store
>     if (blk_queue_quiesced())   2) load         blk_mq_run_hw_queues()
>         return                                      blk_mq_run_hw_queue()
>     blk_mq_sched_dispatch_requests()                    if (!blk_mq_hctx_has_pending())     4) load
>                                                            return
> 
> The full memory barrier should be inserted between 1) and 2), as well as
> between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED is
> cleared or CPU1 sees dispatch list or setting of bitmap of software queue.
> Otherwise, either CPU will not re-run the hardware queue causing starvation.
> 
> So the first solution is to 1) add a pair of memory barrier to fix the
> problem, another solution is to 2) use hctx->queue->queue_lock to synchronize
> QUEUE_FLAG_QUIESCED. Here, we chose 2) to fix it since memory barrier is not
> easy to be maintained.
> 
> Fixes: f4560ffe8cec1 ("blk-mq: use QUEUE_FLAG_QUIESCED to quiesce queue")
> Cc: stable@vger.kernel.org
> Cc: Muchun Song <muchun.song@linux.dev>
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

Reviewed-by: Ming Lei <ming.lei@redhat.com>


thanks,
Ming


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/3] block: fix ordering between checking BLK_MQ_S_STOPPED and adding requests
  2024-09-03  8:16 ` [PATCH v2 3/3] block: fix ordering between checking BLK_MQ_S_STOPPED " Muchun Song
@ 2024-09-04 13:04   ` Ming Lei
  2024-09-10 13:22   ` Jens Axboe
  1 sibling, 0 replies; 18+ messages in thread
From: Ming Lei @ 2024-09-04 13:04 UTC (permalink / raw)
  To: Muchun Song
  Cc: axboe, yukuai1, linux-block, linux-kernel, muchun.song, stable

On Tue, Sep 03, 2024 at 04:16:53PM +0800, Muchun Song wrote:
> Supposing first scenario with a virtio_blk driver.
> 
> CPU0                                                                CPU1
> 
> blk_mq_try_issue_directly()
>     __blk_mq_issue_directly()
>         q->mq_ops->queue_rq()
>             virtio_queue_rq()
>                 blk_mq_stop_hw_queue()
>                                                                     virtblk_done()
>     blk_mq_request_bypass_insert()                                      blk_mq_start_stopped_hw_queues()
>         /* Add IO request to dispatch list */   1) store                    blk_mq_start_stopped_hw_queue()
>                                                                                 clear_bit(BLK_MQ_S_STOPPED)                 3) store
>     blk_mq_run_hw_queue()                                                       blk_mq_run_hw_queue()
>         if (!blk_mq_hctx_has_pending())                                             if (!blk_mq_hctx_has_pending())         4) load
>             return                                                                      return
>         blk_mq_sched_dispatch_requests()                                            blk_mq_sched_dispatch_requests()
>             if (blk_mq_hctx_stopped())          2) load                                 if (blk_mq_hctx_stopped())
>                 return                                                                      return
>             __blk_mq_sched_dispatch_requests()                                          __blk_mq_sched_dispatch_requests()
> 
> Supposing another scenario.
> 
> CPU0                                                                CPU1
> 
> blk_mq_requeue_work()
>     /* Add IO request to dispatch list */       1) store            virtblk_done()
>     blk_mq_run_hw_queues()/blk_mq_delay_run_hw_queues()                 blk_mq_start_stopped_hw_queues()
>         if (blk_mq_hctx_stopped())              2) load                     blk_mq_start_stopped_hw_queue()
>             continue                                                            clear_bit(BLK_MQ_S_STOPPED)                 3) store
>         blk_mq_run_hw_queue()/blk_mq_delay_run_hw_queue()                       blk_mq_run_hw_queue()
>                                                                                     if (!blk_mq_hctx_has_pending())         4) load
>                                                                                         return
>                                                                                     blk_mq_sched_dispatch_requests()
> 
> Both scenarios are similar, the full memory barrier should be inserted between
> 1) and 2), as well as between 3) and 4) to make sure that either CPU0 sees
> BLK_MQ_S_STOPPED is cleared or CPU1 sees dispatch list. Otherwise, either CPU
> will not rerun the hardware queue causing starvation of the request.
> 
> The easy way to fix it is to add the essential full memory barrier into helper
> of blk_mq_hctx_stopped(). In order to not affect the fast path (hardware queue
> is not stopped most of the time), we only insert the barrier into the slow path.
> Actually, only slow path needs to care about missing of dispatching the request
> to the low-level device driver.
> 
> Fixes: 320ae51feed5c ("blk-mq: new multi-queue block IO queueing mechanism")
> Cc: stable@vger.kernel.org
> Cc: Muchun Song <muchun.song@linux.dev>
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

Looks fine,

Reviewed-by: Ming Lei <ming.lei@redhat.com>


Thanks,
Ming


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/3] Fix some starvation problems in block layer
  2024-09-03  8:16 [PATCH v2 0/3] Fix some starvation problems in block layer Muchun Song
                   ` (2 preceding siblings ...)
  2024-09-03  8:16 ` [PATCH v2 3/3] block: fix ordering between checking BLK_MQ_S_STOPPED " Muchun Song
@ 2024-09-10  2:49 ` Muchun Song
  3 siblings, 0 replies; 18+ messages in thread
From: Muchun Song @ 2024-09-10  2:49 UTC (permalink / raw)
  To: Muchun Song; +Cc: Jens Axboe, Ming Lei, Yu Kuai, linux-block, linux-kernel



> On Sep 3, 2024, at 16:16, Muchun Song <songmuchun@bytedance.com> wrote:
> 
> We encounter a problem on our servers where hundreds of UNINTERRUPTED
> processes are all waiting in the WBT wait queue. And the IO hung detector
> logged so many messages about "blocked for more than 122 seconds". The
> call trace is as follows:
> 
>    Call Trace:
>        __schedule+0x959/0xee0
>        schedule+0x40/0xb0
>        io_schedule+0x12/0x40
>        rq_qos_wait+0xaf/0x140
>        wbt_wait+0x92/0xc0
>        __rq_qos_throttle+0x20/0x30
>        blk_mq_make_request+0x12a/0x5c0
>        generic_make_request_nocheck+0x172/0x3f0
>        submit_bio+0x42/0x1c0
>        ...
> 
> The WBT module is used to throttle buffered writeback, which will block
> any buffered writeback IO request until the previous inflight IOs have
> been completed. So I checked the inflight IO counter. That was one meaning
> one IO request was submitted to the downstream interface like block core
> layer or device driver (virtio_blk driver in our case). We need to figure
> out why the inflight IO is not completed in time. I confirmed that all
> the virtio ring buffers of virtio_blk are empty and the hardware dispatch
> list had one IO request, so the root cause is not related to the block
> device or the virtio_blk driver since the driver has never received that
> IO request.
> 
> We know that block core layer could submit IO requests to the driver through
> kworker (the callback function is blk_mq_run_work_fn). I thought maybe the
> kworker was blocked by some other resources causing the callback to not be
> evoked in time. So I checked all the kworkers and workqueues and confirmed
> there was no pending work on any kworker or workqueue.
> 
> Integrate all the investigation information, the problem should be in the
> block core layer missing a chance to submit that IO request. After
> some investigation of code, I found some scenarios which could cause the
> problem.

Hi Jens Axboe,

May I ask if you have any suggestions for those fixes? Or if they could
be merged?

Muchun,
Thanks.

> 
> Changes in v2:
>  - Collect RB tag from Ming Lei.
>  - Use barrier-less approach to fix QUEUE_FLAG_QUIESCED ordering problem
>    suggested by Ming Lei.
>  - Apply new approach to fix BLK_MQ_S_STOPPED ordering for easier maintenance.
>  - Add Fixes tag to each patch.
> 
> Muchun Song (3):
>  block: fix missing dispatching request when queue is started or
>    unquiesced
>  block: fix ordering between checking QUEUE_FLAG_QUIESCED and adding
>    requests
>  block: fix ordering between checking BLK_MQ_S_STOPPED and adding
>    requests
> 
> block/blk-mq.c | 55 ++++++++++++++++++++++++++++++++++++++------------
> block/blk-mq.h | 13 ++++++++++++
> 2 files changed, 55 insertions(+), 13 deletions(-)
> 
> -- 
> 2.20.1
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/3] block: fix missing dispatching request when queue is started or unquiesced
  2024-09-03  8:16 ` [PATCH v2 1/3] block: fix missing dispatching request when queue is started or unquiesced Muchun Song
@ 2024-09-10 13:17   ` Jens Axboe
  2024-09-11  2:43     ` Muchun Song
  0 siblings, 1 reply; 18+ messages in thread
From: Jens Axboe @ 2024-09-10 13:17 UTC (permalink / raw)
  To: Muchun Song, ming.lei, yukuai1
  Cc: linux-block, linux-kernel, muchun.song, stable

On 9/3/24 2:16 AM, Muchun Song wrote:
> Supposing the following scenario with a virtio_blk driver.
> 
> CPU0                                    CPU1                                    CPU2
> 
> blk_mq_try_issue_directly()
>     __blk_mq_issue_directly()
>         q->mq_ops->queue_rq()
>             virtio_queue_rq()
>                 blk_mq_stop_hw_queue()
>                                         blk_mq_try_issue_directly()             virtblk_done()
>                                             if (blk_mq_hctx_stopped())
>     blk_mq_request_bypass_insert()                                                  blk_mq_start_stopped_hw_queue()
>     blk_mq_run_hw_queue()                                                               blk_mq_run_hw_queue()
>                                                 blk_mq_insert_request()
>                                                 return // Who is responsible for dispatching this IO request?
> 
> After CPU0 has marked the queue as stopped, CPU1 will see the queue is stopped.
> But before CPU1 puts the request on the dispatch list, CPU2 receives the interrupt
> of completion of request, so it will run the hardware queue and marks the queue
> as non-stopped. Meanwhile, CPU1 also runs the same hardware queue. After both CPU1
> and CPU2 complete blk_mq_run_hw_queue(), CPU1 just puts the request to the same
> hardware queue and returns. It misses dispatching a request. Fix it by running
> the hardware queue explicitly. And blk_mq_request_issue_directly() should handle
> a similar situation. Fix it as well.

Patch looks fine, but this commit message is waaaaay too wide. Please
limit it to 72-74 chars. The above ordering is diagram is going to
otherwise be unreadable in a git log viewing in a terminal.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/3] block: fix ordering between checking QUEUE_FLAG_QUIESCED and adding requests
  2024-09-03  8:16 ` [PATCH v2 2/3] block: fix ordering between checking QUEUE_FLAG_QUIESCED and adding requests Muchun Song
  2024-09-04 12:56   ` Ming Lei
@ 2024-09-10 13:22   ` Jens Axboe
  2024-09-11  3:54     ` Ming Lei
  2024-09-11  3:56     ` Muchun Song
  1 sibling, 2 replies; 18+ messages in thread
From: Jens Axboe @ 2024-09-10 13:22 UTC (permalink / raw)
  To: Muchun Song, ming.lei, yukuai1
  Cc: linux-block, linux-kernel, muchun.song, stable

On 9/3/24 2:16 AM, Muchun Song wrote:
> Supposing the following scenario.
> 
> CPU0                                        CPU1
> 
> blk_mq_insert_request()         1) store    blk_mq_unquiesce_queue()
> blk_mq_run_hw_queue()                       blk_queue_flag_clear(QUEUE_FLAG_QUIESCED)       3) store
>     if (blk_queue_quiesced())   2) load         blk_mq_run_hw_queues()
>         return                                      blk_mq_run_hw_queue()
>     blk_mq_sched_dispatch_requests()                    if (!blk_mq_hctx_has_pending())     4) load
>                                                            return
> 
> The full memory barrier should be inserted between 1) and 2), as well as
> between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED is
> cleared or CPU1 sees dispatch list or setting of bitmap of software queue.
> Otherwise, either CPU will not re-run the hardware queue causing starvation.
> 
> So the first solution is to 1) add a pair of memory barrier to fix the
> problem, another solution is to 2) use hctx->queue->queue_lock to synchronize
> QUEUE_FLAG_QUIESCED. Here, we chose 2) to fix it since memory barrier is not
> easy to be maintained.

Same comment here, 72-74 chars wide please.

> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index b2d0f22de0c7f..ac39f2a346a52 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2202,6 +2202,24 @@ void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
>  }
>  EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
>  
> +static inline bool blk_mq_hw_queue_need_run(struct blk_mq_hw_ctx *hctx)
> +{
> +	bool need_run;
> +
> +	/*
> +	 * When queue is quiesced, we may be switching io scheduler, or
> +	 * updating nr_hw_queues, or other things, and we can't run queue
> +	 * any more, even blk_mq_hctx_has_pending() can't be called safely.
> +	 *
> +	 * And queue will be rerun in blk_mq_unquiesce_queue() if it is
> +	 * quiesced.
> +	 */
> +	__blk_mq_run_dispatch_ops(hctx->queue, false,
> +				  need_run = !blk_queue_quiesced(hctx->queue) &&
> +					      blk_mq_hctx_has_pending(hctx));
> +	return need_run;
> +}

This __blk_mq_run_dispatch_ops() is also way too wide, why didn't you
just break it like where you copied it from?

> +
>  /**
>   * blk_mq_run_hw_queue - Start to run a hardware queue.
>   * @hctx: Pointer to the hardware queue to run.
> @@ -2222,20 +2240,23 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
>  
>  	might_sleep_if(!async && hctx->flags & BLK_MQ_F_BLOCKING);
>  
> -	/*
> -	 * When queue is quiesced, we may be switching io scheduler, or
> -	 * updating nr_hw_queues, or other things, and we can't run queue
> -	 * any more, even __blk_mq_hctx_has_pending() can't be called safely.
> -	 *
> -	 * And queue will be rerun in blk_mq_unquiesce_queue() if it is
> -	 * quiesced.
> -	 */
> -	__blk_mq_run_dispatch_ops(hctx->queue, false,
> -		need_run = !blk_queue_quiesced(hctx->queue) &&
> -		blk_mq_hctx_has_pending(hctx));
> +	need_run = blk_mq_hw_queue_need_run(hctx);
> +	if (!need_run) {
> +		unsigned long flags;
>  
> -	if (!need_run)
> -		return;
> +		/*
> +		 * synchronize with blk_mq_unquiesce_queue(), becuase we check
> +		 * if hw queue is quiesced locklessly above, we need the use
> +		 * ->queue_lock to make sure we see the up-to-date status to
> +		 * not miss rerunning the hw queue.
> +		 */
> +		spin_lock_irqsave(&hctx->queue->queue_lock, flags);
> +		need_run = blk_mq_hw_queue_need_run(hctx);
> +		spin_unlock_irqrestore(&hctx->queue->queue_lock, flags);
> +
> +		if (!need_run)
> +			return;
> +	}

Is this not solvable on the unquiesce side instead? It's rather a shame
to add overhead to the fast path to avoid a race with something that's
super unlikely, like quisce.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/3] block: fix ordering between checking BLK_MQ_S_STOPPED and adding requests
  2024-09-03  8:16 ` [PATCH v2 3/3] block: fix ordering between checking BLK_MQ_S_STOPPED " Muchun Song
  2024-09-04 13:04   ` Ming Lei
@ 2024-09-10 13:22   ` Jens Axboe
  2024-09-11  2:44     ` Muchun Song
  1 sibling, 1 reply; 18+ messages in thread
From: Jens Axboe @ 2024-09-10 13:22 UTC (permalink / raw)
  To: Muchun Song, ming.lei, yukuai1
  Cc: linux-block, linux-kernel, muchun.song, stable

On 9/3/24 2:16 AM, Muchun Song wrote:
> Supposing first scenario with a virtio_blk driver.
> 
> CPU0                                                                CPU1
> 
> blk_mq_try_issue_directly()
>     __blk_mq_issue_directly()
>         q->mq_ops->queue_rq()
>             virtio_queue_rq()
>                 blk_mq_stop_hw_queue()
>                                                                     virtblk_done()
>     blk_mq_request_bypass_insert()                                      blk_mq_start_stopped_hw_queues()
>         /* Add IO request to dispatch list */   1) store                    blk_mq_start_stopped_hw_queue()
>                                                                                 clear_bit(BLK_MQ_S_STOPPED)                 3) store
>     blk_mq_run_hw_queue()                                                       blk_mq_run_hw_queue()
>         if (!blk_mq_hctx_has_pending())                                             if (!blk_mq_hctx_has_pending())         4) load
>             return                                                                      return
>         blk_mq_sched_dispatch_requests()                                            blk_mq_sched_dispatch_requests()
>             if (blk_mq_hctx_stopped())          2) load                                 if (blk_mq_hctx_stopped())
>                 return                                                                      return
>             __blk_mq_sched_dispatch_requests()                                          __blk_mq_sched_dispatch_requests()
> 
> Supposing another scenario.
> 
> CPU0                                                                CPU1
> 
> blk_mq_requeue_work()
>     /* Add IO request to dispatch list */       1) store            virtblk_done()
>     blk_mq_run_hw_queues()/blk_mq_delay_run_hw_queues()                 blk_mq_start_stopped_hw_queues()
>         if (blk_mq_hctx_stopped())              2) load                     blk_mq_start_stopped_hw_queue()
>             continue                                                            clear_bit(BLK_MQ_S_STOPPED)                 3) store
>         blk_mq_run_hw_queue()/blk_mq_delay_run_hw_queue()                       blk_mq_run_hw_queue()
>                                                                                     if (!blk_mq_hctx_has_pending())         4) load
>                                                                                         return
>                                                                                     blk_mq_sched_dispatch_requests()
> 
> Both scenarios are similar, the full memory barrier should be inserted between
> 1) and 2), as well as between 3) and 4) to make sure that either CPU0 sees
> BLK_MQ_S_STOPPED is cleared or CPU1 sees dispatch list. Otherwise, either CPU
> will not rerun the hardware queue causing starvation of the request.
> 
> The easy way to fix it is to add the essential full memory barrier into helper
> of blk_mq_hctx_stopped(). In order to not affect the fast path (hardware queue
> is not stopped most of the time), we only insert the barrier into the slow path.
> Actually, only slow path needs to care about missing of dispatching the request
> to the low-level device driver.

Again, this is way too wide, it's unreadable.

Patch looks fine, though.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/3] block: fix missing dispatching request when queue is started or unquiesced
  2024-09-10 13:17   ` Jens Axboe
@ 2024-09-11  2:43     ` Muchun Song
  0 siblings, 0 replies; 18+ messages in thread
From: Muchun Song @ 2024-09-11  2:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Muchun Song, Ming Lei, Yu Kuai, open list:BLOCK LAYER, LKML,
	stable



> On Sep 10, 2024, at 21:17, Jens Axboe <axboe@kernel.dk> wrote:
> 
> On 9/3/24 2:16 AM, Muchun Song wrote:
>> Supposing the following scenario with a virtio_blk driver.
>> 
>> CPU0                                    CPU1                                    CPU2
>> 
>> blk_mq_try_issue_directly()
>>    __blk_mq_issue_directly()
>>        q->mq_ops->queue_rq()
>>            virtio_queue_rq()
>>                blk_mq_stop_hw_queue()
>>                                        blk_mq_try_issue_directly()             virtblk_done()
>>                                            if (blk_mq_hctx_stopped())
>>    blk_mq_request_bypass_insert()                                                  blk_mq_start_stopped_hw_queue()
>>    blk_mq_run_hw_queue()                                                               blk_mq_run_hw_queue()
>>                                                blk_mq_insert_request()
>>                                                return // Who is responsible for dispatching this IO request?
>> 
>> After CPU0 has marked the queue as stopped, CPU1 will see the queue is stopped.
>> But before CPU1 puts the request on the dispatch list, CPU2 receives the interrupt
>> of completion of request, so it will run the hardware queue and marks the queue
>> as non-stopped. Meanwhile, CPU1 also runs the same hardware queue. After both CPU1
>> and CPU2 complete blk_mq_run_hw_queue(), CPU1 just puts the request to the same
>> hardware queue and returns. It misses dispatching a request. Fix it by running
>> the hardware queue explicitly. And blk_mq_request_issue_directly() should handle
>> a similar situation. Fix it as well.
> 
> Patch looks fine, but this commit message is waaaaay too wide. Please
> limit it to 72-74 chars. The above ordering is diagram is going to
> otherwise be unreadable in a git log viewing in a terminal.

Thanks for your reply. I'll adjust those lines to make the digram more
readable.

Muchun,
Thanks.

> 
> -- 
> Jens Axboe


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/3] block: fix ordering between checking BLK_MQ_S_STOPPED and adding requests
  2024-09-10 13:22   ` Jens Axboe
@ 2024-09-11  2:44     ` Muchun Song
  0 siblings, 0 replies; 18+ messages in thread
From: Muchun Song @ 2024-09-11  2:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Muchun Song, Ming Lei, Yu Kuai, open list:BLOCK LAYER, LKML,
	stable



> On Sep 10, 2024, at 21:22, Jens Axboe <axboe@kernel.dk> wrote:
> 
> On 9/3/24 2:16 AM, Muchun Song wrote:
>> Supposing first scenario with a virtio_blk driver.
>> 
>> CPU0                                                                CPU1
>> 
>> blk_mq_try_issue_directly()
>>    __blk_mq_issue_directly()
>>        q->mq_ops->queue_rq()
>>            virtio_queue_rq()
>>                blk_mq_stop_hw_queue()
>>                                                                    virtblk_done()
>>    blk_mq_request_bypass_insert()                                      blk_mq_start_stopped_hw_queues()
>>        /* Add IO request to dispatch list */   1) store                    blk_mq_start_stopped_hw_queue()
>>                                                                                clear_bit(BLK_MQ_S_STOPPED)                 3) store
>>    blk_mq_run_hw_queue()                                                       blk_mq_run_hw_queue()
>>        if (!blk_mq_hctx_has_pending())                                             if (!blk_mq_hctx_has_pending())         4) load
>>            return                                                                      return
>>        blk_mq_sched_dispatch_requests()                                            blk_mq_sched_dispatch_requests()
>>            if (blk_mq_hctx_stopped())          2) load                                 if (blk_mq_hctx_stopped())
>>                return                                                                     return
>>            __blk_mq_sched_dispatch_requests()                                          __blk_mq_sched_dispatch_requests()
>> 
>> Supposing another scenario.
>> 
>> CPU0                                                                CPU1
>> 
>> blk_mq_requeue_work()
>>    /* Add IO request to dispatch list */       1) store            virtblk_done()
>>    blk_mq_run_hw_queues()/blk_mq_delay_run_hw_queues()                 blk_mq_start_stopped_hw_queues()
>>        if (blk_mq_hctx_stopped())              2) load                     blk_mq_start_stopped_hw_queue()
>>            continue                                                            clear_bit(BLK_MQ_S_STOPPED)                 3) store
>>        blk_mq_run_hw_queue()/blk_mq_delay_run_hw_queue()                       blk_mq_run_hw_queue()
>>                                                                                    if (!blk_mq_hctx_has_pending())         4) load
>>                                                                                        return
>>                                                                                    blk_mq_sched_dispatch_requests()
>> 
>> Both scenarios are similar, the full memory barrier should be inserted between
>> 1) and 2), as well as between 3) and 4) to make sure that either CPU0 sees
>> BLK_MQ_S_STOPPED is cleared or CPU1 sees dispatch list. Otherwise, either CPU
>> will not rerun the hardware queue causing starvation of the request.
>> 
>> The easy way to fix it is to add the essential full memory barrier into helper
>> of blk_mq_hctx_stopped(). In order to not affect the fast path (hardware queue
>> is not stopped most of the time), we only insert the barrier into the slow path.
>> Actually, only slow path needs to care about missing of dispatching the request
>> to the low-level device driver.
> 
> Again, this is way too wide, it's unreadable.

OK. Will do.

Muchun,
Thanks.

> 
> Patch looks fine, though.
> 
> -- 
> Jens Axboe



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/3] block: fix ordering between checking QUEUE_FLAG_QUIESCED and adding requests
  2024-09-10 13:22   ` Jens Axboe
@ 2024-09-11  3:54     ` Ming Lei
  2024-09-11  3:59       ` Muchun Song
  2024-09-12  3:27       ` Muchun Song
  2024-09-11  3:56     ` Muchun Song
  1 sibling, 2 replies; 18+ messages in thread
From: Ming Lei @ 2024-09-11  3:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Muchun Song, yukuai1, linux-block, linux-kernel, muchun.song,
	stable, ming.lei

On Tue, Sep 10, 2024 at 07:22:16AM -0600, Jens Axboe wrote:
> On 9/3/24 2:16 AM, Muchun Song wrote:
> > Supposing the following scenario.
> > 
> > CPU0                                        CPU1
> > 
> > blk_mq_insert_request()         1) store    blk_mq_unquiesce_queue()
> > blk_mq_run_hw_queue()                       blk_queue_flag_clear(QUEUE_FLAG_QUIESCED)       3) store
> >     if (blk_queue_quiesced())   2) load         blk_mq_run_hw_queues()
> >         return                                      blk_mq_run_hw_queue()
> >     blk_mq_sched_dispatch_requests()                    if (!blk_mq_hctx_has_pending())     4) load
> >                                                            return
> > 
> > The full memory barrier should be inserted between 1) and 2), as well as
> > between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED is
> > cleared or CPU1 sees dispatch list or setting of bitmap of software queue.
> > Otherwise, either CPU will not re-run the hardware queue causing starvation.
> > 
> > So the first solution is to 1) add a pair of memory barrier to fix the
> > problem, another solution is to 2) use hctx->queue->queue_lock to synchronize
> > QUEUE_FLAG_QUIESCED. Here, we chose 2) to fix it since memory barrier is not
> > easy to be maintained.
> 
> Same comment here, 72-74 chars wide please.
> 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index b2d0f22de0c7f..ac39f2a346a52 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2202,6 +2202,24 @@ void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
> >  }
> >  EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
> >  
> > +static inline bool blk_mq_hw_queue_need_run(struct blk_mq_hw_ctx *hctx)
> > +{
> > +	bool need_run;
> > +
> > +	/*
> > +	 * When queue is quiesced, we may be switching io scheduler, or
> > +	 * updating nr_hw_queues, or other things, and we can't run queue
> > +	 * any more, even blk_mq_hctx_has_pending() can't be called safely.
> > +	 *
> > +	 * And queue will be rerun in blk_mq_unquiesce_queue() if it is
> > +	 * quiesced.
> > +	 */
> > +	__blk_mq_run_dispatch_ops(hctx->queue, false,
> > +				  need_run = !blk_queue_quiesced(hctx->queue) &&
> > +					      blk_mq_hctx_has_pending(hctx));
> > +	return need_run;
> > +}
> 
> This __blk_mq_run_dispatch_ops() is also way too wide, why didn't you
> just break it like where you copied it from?
> 
> > +
> >  /**
> >   * blk_mq_run_hw_queue - Start to run a hardware queue.
> >   * @hctx: Pointer to the hardware queue to run.
> > @@ -2222,20 +2240,23 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
> >  
> >  	might_sleep_if(!async && hctx->flags & BLK_MQ_F_BLOCKING);
> >  
> > -	/*
> > -	 * When queue is quiesced, we may be switching io scheduler, or
> > -	 * updating nr_hw_queues, or other things, and we can't run queue
> > -	 * any more, even __blk_mq_hctx_has_pending() can't be called safely.
> > -	 *
> > -	 * And queue will be rerun in blk_mq_unquiesce_queue() if it is
> > -	 * quiesced.
> > -	 */
> > -	__blk_mq_run_dispatch_ops(hctx->queue, false,
> > -		need_run = !blk_queue_quiesced(hctx->queue) &&
> > -		blk_mq_hctx_has_pending(hctx));
> > +	need_run = blk_mq_hw_queue_need_run(hctx);
> > +	if (!need_run) {
> > +		unsigned long flags;
> >  
> > -	if (!need_run)
> > -		return;
> > +		/*
> > +		 * synchronize with blk_mq_unquiesce_queue(), becuase we check
> > +		 * if hw queue is quiesced locklessly above, we need the use
> > +		 * ->queue_lock to make sure we see the up-to-date status to
> > +		 * not miss rerunning the hw queue.
> > +		 */
> > +		spin_lock_irqsave(&hctx->queue->queue_lock, flags);
> > +		need_run = blk_mq_hw_queue_need_run(hctx);
> > +		spin_unlock_irqrestore(&hctx->queue->queue_lock, flags);
> > +
> > +		if (!need_run)
> > +			return;
> > +	}
> 
> Is this not solvable on the unquiesce side instead? It's rather a shame
> to add overhead to the fast path to avoid a race with something that's
> super unlikely, like quisce.

Yeah, it can be solved by adding synchronize_rcu()/srcu() in unquiesce
side, but SCSI may call it in non-sleepable context via scsi_internal_device_unblock_nowait().


Thanks,
Ming


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/3] block: fix ordering between checking QUEUE_FLAG_QUIESCED and adding requests
  2024-09-10 13:22   ` Jens Axboe
  2024-09-11  3:54     ` Ming Lei
@ 2024-09-11  3:56     ` Muchun Song
  1 sibling, 0 replies; 18+ messages in thread
From: Muchun Song @ 2024-09-11  3:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Muchun Song, Ming Lei, Yu Kuai, linux-block, linux-kernel, stable



> On Sep 10, 2024, at 21:22, Jens Axboe <axboe@kernel.dk> wrote:
> 
> On 9/3/24 2:16 AM, Muchun Song wrote:
>> Supposing the following scenario.
>> 
>> CPU0                                        CPU1
>> 
>> blk_mq_insert_request()         1) store    blk_mq_unquiesce_queue()
>> blk_mq_run_hw_queue()                       blk_queue_flag_clear(QUEUE_FLAG_QUIESCED)     3) store
>>    if (blk_queue_quiesced())   2) load         blk_mq_run_hw_queues()
>>        return                                      blk_mq_run_hw_queue()
>>    blk_mq_sched_dispatch_requests()                    if (!blk_mq_hctx_has_pending())    4) load
>>                                                           return
>> 
>> The full memory barrier should be inserted between 1) and 2), as well as
>> between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED is
>> cleared or CPU1 sees dispatch list or setting of bitmap of software queue.
>> Otherwise, either CPU will not re-run the hardware queue causing starvation.
>> 
>> So the first solution is to 1) add a pair of memory barrier to fix the
>> problem, another solution is to 2) use hctx->queue->queue_lock to synchronize
>> QUEUE_FLAG_QUIESCED. Here, we chose 2) to fix it since memory barrier is not
>> easy to be maintained.
> 
> Same comment here, 72-74 chars wide please.

OK.

> 
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index b2d0f22de0c7f..ac39f2a346a52 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -2202,6 +2202,24 @@ void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
>> }
>> EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
>> 
>> +static inline bool blk_mq_hw_queue_need_run(struct blk_mq_hw_ctx *hctx)
>> +{
>> + 	bool need_run;
>> +
>> + 	/*
>> + 	 * When queue is quiesced, we may be switching io scheduler, or
>> + 	 * updating nr_hw_queues, or other things, and we can't run queue
>> + 	 * any more, even blk_mq_hctx_has_pending() can't be called safely.
>> + 	 *
>> + 	 * And queue will be rerun in blk_mq_unquiesce_queue() if it is
>> + 	 * quiesced.
>> + 	 */
>> + 	__blk_mq_run_dispatch_ops(hctx->queue, false,
>> + 				  need_run = !blk_queue_quiesced(hctx->queue) &&
>> + 			      	  blk_mq_hctx_has_pending(hctx));
>> + 	return need_run;
>> +}
> 
> This __blk_mq_run_dispatch_ops() is also way too wide, why didn't you
> just break it like where you copied it from?

I thought the rule allows max 80 chars pre line, so I adjusted
the code to let them align with the above "(". Seems you prefer
the previous way, I can keep it the same as before.

Muchun,
Thanks.

> 
>> +
>> /**
>>  * blk_mq_run_hw_queue - Start to run a hardware queue.
>>  * @hctx: Pointer to the hardware queue to run.
>> @@ -2222,20 +2240,23 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
>> 
>> 	might_sleep_if(!async && hctx->flags & BLK_MQ_F_BLOCKING);
>> 
>> - 	/*
>> - 	 * When queue is quiesced, we may be switching io scheduler, or
>> - 	 * updating nr_hw_queues, or other things, and we can't run queue
>> - 	 * any more, even __blk_mq_hctx_has_pending() can't be called safely.
>> - 	 *
>> - 	 * And queue will be rerun in blk_mq_unquiesce_queue() if it is
>> - 	 * quiesced.
>> - 	 */
>> - 	__blk_mq_run_dispatch_ops(hctx->queue, false,
>> - 	need_run = !blk_queue_quiesced(hctx->queue) &&
>> - 		blk_mq_hctx_has_pending(hctx));
>> + 		need_run = blk_mq_hw_queue_need_run(hctx);
>> + 	if (!need_run) {
>> + 		unsigned long flags;
>> 
>> - 	if (!need_run)
>> - 		return;
>> + 		/*
>> + 		 * synchronize with blk_mq_unquiesce_queue(), becuase we check
>> + 		 * if hw queue is quiesced locklessly above, we need the use
>> + 		 * ->queue_lock to make sure we see the up-to-date status to
>> + 		 * not miss rerunning the hw queue.
>> + 		 */
>> + 		spin_lock_irqsave(&hctx->queue->queue_lock, flags);
>> + 		need_run = blk_mq_hw_queue_need_run(hctx);
>> + 		spin_unlock_irqrestore(&hctx->queue->queue_lock, flags);
>> +
>> + 		if (!need_run)
>> + 			return;
>> + 	}
> 
> Is this not solvable on the unquiesce side instead? It's rather a shame
> to add overhead to the fast path to avoid a race with something that's
> super unlikely, like quisce.
> 
> -- 
> Jens Axboe



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/3] block: fix ordering between checking QUEUE_FLAG_QUIESCED and adding requests
  2024-09-11  3:54     ` Ming Lei
@ 2024-09-11  3:59       ` Muchun Song
  2024-09-11  5:20         ` Muchun Song
  2024-09-12  3:27       ` Muchun Song
  1 sibling, 1 reply; 18+ messages in thread
From: Muchun Song @ 2024-09-11  3:59 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Muchun Song, Yu Kuai, open list:BLOCK LAYER, LKML,
	stable



> On Sep 11, 2024, at 11:54, Ming Lei <ming.lei@redhat.com> wrote:
> 
> On Tue, Sep 10, 2024 at 07:22:16AM -0600, Jens Axboe wrote:
>> On 9/3/24 2:16 AM, Muchun Song wrote:
>>> Supposing the following scenario.
>>> 
>>> CPU0                                        CPU1
>>> 
>>> blk_mq_insert_request()         1) store    blk_mq_unquiesce_queue()
>>> blk_mq_run_hw_queue()                       blk_queue_flag_clear(QUEUE_FLAG_QUIESCED)   3) store
>>>    if (blk_queue_quiesced())   2) load         blk_mq_run_hw_queues()
>>>        return                                      blk_mq_run_hw_queue()
>>>    blk_mq_sched_dispatch_requests()                    if (!blk_mq_hctx_has_pending()) 4) load
>>>                                                           return
>>> 
>>> The full memory barrier should be inserted between 1) and 2), as well as
>>> between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED is
>>> cleared or CPU1 sees dispatch list or setting of bitmap of software queue.
>>> Otherwise, either CPU will not re-run the hardware queue causing starvation.
>>> 
>>> So the first solution is to 1) add a pair of memory barrier to fix the
>>> problem, another solution is to 2) use hctx->queue->queue_lock to synchronize
>>> QUEUE_FLAG_QUIESCED. Here, we chose 2) to fix it since memory barrier is not
>>> easy to be maintained.
>> 
>> Same comment here, 72-74 chars wide please.
>> 
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index b2d0f22de0c7f..ac39f2a346a52 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -2202,6 +2202,24 @@ void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
>>> }
>>> EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
>>> 
>>> +static inline bool blk_mq_hw_queue_need_run(struct blk_mq_hw_ctx *hctx)
>>> +{
>>> + 	bool need_run;
>>> +
>>> + 	/*
>>> + 	 * When queue is quiesced, we may be switching io scheduler, or
>>> + 	 * updating nr_hw_queues, or other things, and we can't run queue
>>> + 	 * any more, even blk_mq_hctx_has_pending() can't be called safely.
>>> + 	 *
>>> + 	 * And queue will be rerun in blk_mq_unquiesce_queue() if it is
>>> + 	 * quiesced.
>>> + 	 */
>>> + 	__blk_mq_run_dispatch_ops(hctx->queue, false,
>>> +				  need_run = !blk_queue_quiesced(hctx->queue) &&
>>> + 				  blk_mq_hctx_has_pending(hctx));
>>> + 	return need_run;
>>> +}
>> 
>> This __blk_mq_run_dispatch_ops() is also way too wide, why didn't you
>> just break it like where you copied it from?
>> 
>>> +
>>> /**
>>>  * blk_mq_run_hw_queue - Start to run a hardware queue.
>>>  * @hctx: Pointer to the hardware queue to run.
>>> @@ -2222,20 +2240,23 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
>>> 
>>> might_sleep_if(!async && hctx->flags & BLK_MQ_F_BLOCKING);
>>> 
>>> - 	/*
>>> - 	 * When queue is quiesced, we may be switching io scheduler, or
>>> - 	 * updating nr_hw_queues, or other things, and we can't run queue
>>> - 	 * any more, even __blk_mq_hctx_has_pending() can't be called safely.
>>> - 	 *
>>> - 	 * And queue will be rerun in blk_mq_unquiesce_queue() if it is
>>> - 	 * quiesced.
>>> - 	 */
>>> - 	__blk_mq_run_dispatch_ops(hctx->queue, false,
>>> - 		need_run = !blk_queue_quiesced(hctx->queue) &&
>>> - 		blk_mq_hctx_has_pending(hctx));
>>> + 	need_run = blk_mq_hw_queue_need_run(hctx);
>>> + 	if (!need_run) {
>>> + 		unsigned long flags;
>>> 
>>> - 	if (!need_run)
>>> - 		return;
>>> + 	/*
>>> + 	 * synchronize with blk_mq_unquiesce_queue(), becuase we check
>>> + 	 * if hw queue is quiesced locklessly above, we need the use
>>> + 	 * ->queue_lock to make sure we see the up-to-date status to
>>> + 	 * not miss rerunning the hw queue.
>>> + 	 */
>>> + 	spin_lock_irqsave(&hctx->queue->queue_lock, flags);
>>> + 	need_run = blk_mq_hw_queue_need_run(hctx);
>>> + 	spin_unlock_irqrestore(&hctx->queue->queue_lock, flags);
>>> +
>>> + 	if (!need_run)
>>> + 		return;
>>> + 	}
>> 
>> Is this not solvable on the unquiesce side instead? It's rather a shame
>> to add overhead to the fast path to avoid a race with something that's
>> super unlikely, like quisce.
> 
> Yeah, it can be solved by adding synchronize_rcu()/srcu() in unquiesce
> side, but SCSI may call it in non-sleepable context via scsi_internal_device_unblock_nowait().

Another approach will be like the fix for BLK_MQ_S_STOPPED (in patch 3),
we could add a pair of mb into blk_queue_quiesced() and
blk_mq_unquiesce_queue(). In which case, the fix will not affect any fast
path, only slow path need the barrier overhead.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b2d0f22de0c7f..45588ddb08d6b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -264,6 +264,12 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
                ;
        } else if (!--q->quiesce_depth) {
                blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
+               /*
+                * Pairs with the smp_mb() in blk_queue_quiesced() to order the
+                * clearing of QUEUE_FLAG_QUIESCED above and the checking of
+                * dispatch list in the subsequent routine.
+                */
+               smp_mb__after_atomic();
                run_queue = true;
        }
        spin_unlock_irqrestore(&q->queue_lock, flags);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index b8196e219ac22..7a71462892b66 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -628,7 +628,25 @@ void blk_queue_flag_clear(unsigned int flag, struct request_queue *q);
 #define blk_noretry_request(rq) \
        ((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
                             REQ_FAILFAST_DRIVER))
-#define blk_queue_quiesced(q)  test_bit(QUEUE_FLAG_QUIESCED, &(q)->queue_flags)
+
+static inline bool blk_queue_quiesced(struct request_queue *q)
+{
+       /* Fast path: hardware queue is unquiesced most of the time. */
+       if (likely(!test_bit(QUEUE_FLAG_QUIESCED, &q->queue_flags)))
+               return false;
+
+       /*
+        * This barrier is used to order adding of dispatch list before and
+        * the test of QUEUE_FLAG_QUIESCED below. Pairs with the memory barrier
+        * in blk_mq_unquiesce_queue() so that dispatch code could either see
+        * QUEUE_FLAG_QUIESCED is cleared or dispatch list is not  empty to
+        * avoid missing dispatching requests.
+        */
+       smp_mb();
+
+       return test_bit(QUEUE_FLAG_QUIESCED, &q->queue_flags);
+}
+
 #define blk_queue_pm_only(q)   atomic_read(&(q)->pm_only)
 #define blk_queue_registered(q)        test_bit(QUEUE_FLAG_REGISTERED, &(q)->queue_flags)
 #define blk_queue_sq_sched(q)  test_bit(QUEUE_FLAG_SQ_SCHED, &(q)->queue_flags)

Muchun,
Thanks.

> 
> 
> Thanks,
> Ming



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/3] block: fix ordering between checking QUEUE_FLAG_QUIESCED and adding requests
  2024-09-11  3:59       ` Muchun Song
@ 2024-09-11  5:20         ` Muchun Song
  0 siblings, 0 replies; 18+ messages in thread
From: Muchun Song @ 2024-09-11  5:20 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Muchun Song, Yu Kuai, open list:BLOCK LAYER, LKML,
	stable



> On Sep 11, 2024, at 11:59, Muchun Song <muchun.song@linux.dev> wrote:
> 
> 
> 
>> On Sep 11, 2024, at 11:54, Ming Lei <ming.lei@redhat.com> wrote:
>> 
>>> On Tue, Sep 10, 2024 at 07:22:16AM -0600, Jens Axboe wrote:
>>> On 9/3/24 2:16 AM, Muchun Song wrote:
>>>> Supposing the following scenario.
>>>> 
>>>> CPU0                                        CPU1
>>>> 
>>>> blk_mq_insert_request()         1) store    blk_mq_unquiesce_queue()
>>>> blk_mq_run_hw_queue()                       blk_queue_flag_clear(QUEUE_FLAG_QUIESCED)   3) store
>>>>   if (blk_queue_quiesced())   2) load         blk_mq_run_hw_queues()
>>>>       return                                      blk_mq_run_hw_queue()
>>>>   blk_mq_sched_dispatch_requests()                    if (!blk_mq_hctx_has_pending()) 4) load
>>>>                                                          return
>>>> 
>>>> The full memory barrier should be inserted between 1) and 2), as well as
>>>> between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED is
>>>> cleared or CPU1 sees dispatch list or setting of bitmap of software queue.
>>>> Otherwise, either CPU will not re-run the hardware queue causing starvation.
>>>> 
>>>> So the first solution is to 1) add a pair of memory barrier to fix the
>>>> problem, another solution is to 2) use hctx->queue->queue_lock to synchronize
>>>> QUEUE_FLAG_QUIESCED. Here, we chose 2) to fix it since memory barrier is not
>>>> easy to be maintained.
>>> 
>>> Same comment here, 72-74 chars wide please.
>>> 
>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>> index b2d0f22de0c7f..ac39f2a346a52 100644
>>>> --- a/block/blk-mq.c
>>>> +++ b/block/blk-mq.c
>>>> @@ -2202,6 +2202,24 @@ void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
>>>> }
>>>> EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
>>>> 
>>>> +static inline bool blk_mq_hw_queue_need_run(struct blk_mq_hw_ctx *hctx)
>>>> +{
>>>> +    bool need_run;
>>>> +
>>>> +    /*
>>>> +     * When queue is quiesced, we may be switching io scheduler, or
>>>> +     * updating nr_hw_queues, or other things, and we can't run queue
>>>> +     * any more, even blk_mq_hctx_has_pending() can't be called safely.
>>>> +     *
>>>> +     * And queue will be rerun in blk_mq_unquiesce_queue() if it is
>>>> +     * quiesced.
>>>> +     */
>>>> +    __blk_mq_run_dispatch_ops(hctx->queue, false,
>>>> +                  need_run = !blk_queue_quiesced(hctx->queue) &&
>>>> +                  blk_mq_hctx_has_pending(hctx));
>>>> +    return need_run;
>>>> +}
>>> 
>>> This __blk_mq_run_dispatch_ops() is also way too wide, why didn't you
>>> just break it like where you copied it from?
>>> 
>>>> +
>>>> /**
>>>> * blk_mq_run_hw_queue - Start to run a hardware queue.
>>>> * @hctx: Pointer to the hardware queue to run.
>>>> @@ -2222,20 +2240,23 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
>>>> 
>>>> might_sleep_if(!async && hctx->flags & BLK_MQ_F_BLOCKING);
>>>> 
>>>> -    /*
>>>> -     * When queue is quiesced, we may be switching io scheduler, or
>>>> -     * updating nr_hw_queues, or other things, and we can't run queue
>>>> -     * any more, even __blk_mq_hctx_has_pending() can't be called safely.
>>>> -     *
>>>> -     * And queue will be rerun in blk_mq_unquiesce_queue() if it is
>>>> -     * quiesced.
>>>> -     */
>>>> -    __blk_mq_run_dispatch_ops(hctx->queue, false,
>>>> -        need_run = !blk_queue_quiesced(hctx->queue) &&
>>>> -        blk_mq_hctx_has_pending(hctx));
>>>> +    need_run = blk_mq_hw_queue_need_run(hctx);
>>>> +    if (!need_run) {
>>>> +        unsigned long flags;
>>>> 
>>>> -    if (!need_run)
>>>> -        return;
>>>> +    /*
>>>> +     * synchronize with blk_mq_unquiesce_queue(), becuase we check
>>>> +     * if hw queue is quiesced locklessly above, we need the use
>>>> +     * ->queue_lock to make sure we see the up-to-date status to
>>>> +     * not miss rerunning the hw queue.
>>>> +     */
>>>> +    spin_lock_irqsave(&hctx->queue->queue_lock, flags);
>>>> +    need_run = blk_mq_hw_queue_need_run(hctx);
>>>> +    spin_unlock_irqrestore(&hctx->queue->queue_lock, flags);
>>>> +
>>>> +    if (!need_run)
>>>> +        return;
>>>> +    }
>>> 
>>> Is this not solvable on the unquiesce side instead? It's rather a shame
>>> to add overhead to the fast path to avoid a race with something that's
>>> super unlikely, like quisce.
>> 
>> Yeah, it can be solved by adding synchronize_rcu()/srcu() in unquiesce
>> side, but SCSI may call it in non-sleepable context via scsi_internal_device_unblock_nowait().
> 
> Another approach will be like the fix for BLK_MQ_S_STOPPED (in patch 3),
> we could add a pair of mb into blk_queue_quiesced() and
> blk_mq_unquiesce_queue(). In which case, the fix will not affect any fast
> path, only slow path need the barrier overhead.

I misunderstood Jens’s question. I think Ming is right.
This approach only tries to reduce the overhead as 
much
as possible even for slow path compared to 
spinlock_based approach. 
Not solving the problem only from the unquiesce side.

Muchun,
Thanks.

> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index b2d0f22de0c7f..45588ddb08d6b 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -264,6 +264,12 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
>                ;
>        } else if (!--q->quiesce_depth) {
>                blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
> +               /*
> +                * Pairs with the smp_mb() in blk_queue_quiesced() to order the
> +                * clearing of QUEUE_FLAG_QUIESCED above and the checking of
> +                * dispatch list in the subsequent routine.
> +                */
> +               smp_mb__after_atomic();
>                run_queue = true;
>        }
>        spin_unlock_irqrestore(&q->queue_lock, flags);
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index b8196e219ac22..7a71462892b66 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -628,7 +628,25 @@ void blk_queue_flag_clear(unsigned int flag, struct request_queue *q);
> #define blk_noretry_request(rq) \
>        ((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
>                             REQ_FAILFAST_DRIVER))
> -#define blk_queue_quiesced(q)  test_bit(QUEUE_FLAG_QUIESCED, &(q)->queue_flags)
> +
> +static inline bool blk_queue_quiesced(struct request_queue *q)
> +{
> +       /* Fast path: hardware queue is unquiesced most of the time. */
> +       if (likely(!test_bit(QUEUE_FLAG_QUIESCED, &q->queue_flags)))
> +               return false;
> +
> +       /*
> +        * This barrier is used to order adding of dispatch list before and
> +        * the test of QUEUE_FLAG_QUIESCED below. Pairs with the memory barrier
> +        * in blk_mq_unquiesce_queue() so that dispatch code could either see
> +        * QUEUE_FLAG_QUIESCED is cleared or dispatch list is not  empty to
> +        * avoid missing dispatching requests.
> +        */
> +       smp_mb();
> +
> +       return test_bit(QUEUE_FLAG_QUIESCED, &q->queue_flags);
> +}
> +
> #define blk_queue_pm_only(q)   atomic_read(&(q)->pm_only)
> #define blk_queue_registered(q)        test_bit(QUEUE_FLAG_REGISTERED, &(q)->queue_flags)
> #define blk_queue_sq_sched(q)  test_bit(QUEUE_FLAG_SQ_SCHED, &(q)->queue_flags)
> 
> Muchun,
> Thanks.
> 
>> 
>> 
>> Thanks,
>> Ming
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/3] block: fix ordering between checking QUEUE_FLAG_QUIESCED and adding requests
  2024-09-11  3:54     ` Ming Lei
  2024-09-11  3:59       ` Muchun Song
@ 2024-09-12  3:27       ` Muchun Song
  2024-09-12  6:27         ` Muchun Song
  1 sibling, 1 reply; 18+ messages in thread
From: Muchun Song @ 2024-09-12  3:27 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: Muchun Song, Yu Kuai, open list:BLOCK LAYER, LKML, stable



> On Sep 11, 2024, at 11:54, Ming Lei <ming.lei@redhat.com> wrote:
> 
> On Tue, Sep 10, 2024 at 07:22:16AM -0600, Jens Axboe wrote:
>> On 9/3/24 2:16 AM, Muchun Song wrote:
>>> Supposing the following scenario.
>>> 
>>> CPU0                                        CPU1
>>> 
>>> blk_mq_insert_request()         1) store    blk_mq_unquiesce_queue()
>>> blk_mq_run_hw_queue()                       blk_queue_flag_clear(QUEUE_FLAG_QUIESCED)   3) store
>>>    if (blk_queue_quiesced())   2) load         blk_mq_run_hw_queues()
>>>        return                                      blk_mq_run_hw_queue()
>>>    blk_mq_sched_dispatch_requests()                    if (!blk_mq_hctx_has_pending()) 4) load
>>>                                                           return
>>> 
>>> The full memory barrier should be inserted between 1) and 2), as well as
>>> between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED is
>>> cleared or CPU1 sees dispatch list or setting of bitmap of software queue.
>>> Otherwise, either CPU will not re-run the hardware queue causing starvation.
>>> 
>>> So the first solution is to 1) add a pair of memory barrier to fix the
>>> problem, another solution is to 2) use hctx->queue->queue_lock to synchronize
>>> QUEUE_FLAG_QUIESCED. Here, we chose 2) to fix it since memory barrier is not
>>> easy to be maintained.
>> 
>> Same comment here, 72-74 chars wide please.
>> 
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index b2d0f22de0c7f..ac39f2a346a52 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -2202,6 +2202,24 @@ void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
>>> }
>>> EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
>>> 
>>> +static inline bool blk_mq_hw_queue_need_run(struct blk_mq_hw_ctx *hctx)
>>> +{
>>> + 	bool need_run;
>>> +
>>> + 	/*
>>> + 	 * When queue is quiesced, we may be switching io scheduler, or
>>> + 	 * updating nr_hw_queues, or other things, and we can't run queue
>>> + 	 * any more, even blk_mq_hctx_has_pending() can't be called safely.
>>> + 	 *
>>> + 	 * And queue will be rerun in blk_mq_unquiesce_queue() if it is
>>> + 	 * quiesced.
>>> + 	 */
>>> + 	__blk_mq_run_dispatch_ops(hctx->queue, false,
>>> + 		need_run = !blk_queue_quiesced(hctx->queue) &&
>>> +       	blk_mq_hctx_has_pending(hctx));
>>> + 	return need_run;
>>> +}
>> 
>> This __blk_mq_run_dispatch_ops() is also way too wide, why didn't you
>> just break it like where you copied it from?
>> 
>>> +
>>> /**
>>>  * blk_mq_run_hw_queue - Start to run a hardware queue.
>>>  * @hctx: Pointer to the hardware queue to run.
>>> @@ -2222,20 +2240,23 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
>>> 
>>> might_sleep_if(!async && hctx->flags & BLK_MQ_F_BLOCKING);
>>> 
>>> - 	/*
>>> - 	 * When queue is quiesced, we may be switching io scheduler, or
>>> - 	 * updating nr_hw_queues, or other things, and we can't run queue
>>> - 	 * any more, even __blk_mq_hctx_has_pending() can't be called safely.
>>> - 	 *
>>> - 	 * And queue will be rerun in blk_mq_unquiesce_queue() if it is
>>> - 	 * quiesced.
>>> - 	 */
>>> - 	__blk_mq_run_dispatch_ops(hctx->queue, false,
>>> - 		need_run = !blk_queue_quiesced(hctx->queue) &&
>>> - 		blk_mq_hctx_has_pending(hctx));
>>> + 	need_run = blk_mq_hw_queue_need_run(hctx);
>>> + 	if (!need_run) {
>>> + 		unsigned long flags;
>>> 
>>> - 	if (!need_run)
>>> - 		return;
>>> + 	/*
>>> + 	 * synchronize with blk_mq_unquiesce_queue(), becuase we check
>>> + 	 * if hw queue is quiesced locklessly above, we need the use
>>> + 	 * ->queue_lock to make sure we see the up-to-date status to
>>> + 	 * not miss rerunning the hw queue.
>>> + 	 */
>>> + 	spin_lock_irqsave(&hctx->queue->queue_lock, flags);
>>> + 	need_run = blk_mq_hw_queue_need_run(hctx);
>>> + 	spin_unlock_irqrestore(&hctx->queue->queue_lock, flags);
>>> +
>>> + 	if (!need_run)
>>> + 		return;
>>> + }
>> 
>> Is this not solvable on the unquiesce side instead? It's rather a shame
>> to add overhead to the fast path to avoid a race with something that's
>> super unlikely, like quisce.
> 
> Yeah, it can be solved by adding synchronize_rcu()/srcu() in unquiesce
> side, but SCSI may call it in non-sleepable context via scsi_internal_device_unblock_nowait().

Hi Ming and Jens,

I use call_srcu/call_rcu to make it non-sleepable. Does this make sense to you?

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 12bf38bec1044..86cdff28b2ce6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -247,6 +247,13 @@ void blk_mq_quiesce_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue);

+static void blk_mq_run_hw_queues_rcu(struct rcu_head *rh)
+{
+       struct request_queue *q = container_of(rh, struct request_queue,
+                                              rcu_head);
+       blk_mq_run_hw_queues(q, true);
+}
+
 /*
  * blk_mq_unquiesce_queue() - counterpart of blk_mq_quiesce_queue()
  * @q: request queue.
@@ -269,8 +276,13 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
        spin_unlock_irqrestore(&q->queue_lock, flags);

        /* dispatch requests which are inserted during quiescing */
-       if (run_queue)
-               blk_mq_run_hw_queues(q, true);
+       if (run_queue) {
+               if (q->tag_set->flags & BLK_MQ_F_BLOCKING)
+                       call_srcu(q->tag_set->srcu, &q->rcu_head,
+                                 blk_mq_run_hw_queues_rcu);
+               else
+                       call_rcu(&q->rcu_head, blk_mq_run_hw_queues_rcu);
+       }
 }
 EXPORT_SYMBOL_GPL(blk_mq_unquiesce_queue);

> 
> 
> Thanks,
> Ming



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/3] block: fix ordering between checking QUEUE_FLAG_QUIESCED and adding requests
  2024-09-12  3:27       ` Muchun Song
@ 2024-09-12  6:27         ` Muchun Song
  0 siblings, 0 replies; 18+ messages in thread
From: Muchun Song @ 2024-09-12  6:27 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: Muchun Song, Yu Kuai, open list:BLOCK LAYER, LKML, stable



> On Sep 12, 2024, at 11:27, Muchun Song <muchun.song@linux.dev> wrote:
> 
> 
> 
>> On Sep 11, 2024, at 11:54, Ming Lei <ming.lei@redhat.com> wrote:
>> 
>> On Tue, Sep 10, 2024 at 07:22:16AM -0600, Jens Axboe wrote:
>>> On 9/3/24 2:16 AM, Muchun Song wrote:
>>>> Supposing the following scenario.
>>>> 
>>>> CPU0                                        CPU1
>>>> 
>>>> blk_mq_insert_request()         1) store    blk_mq_unquiesce_queue()
>>>> blk_mq_run_hw_queue()                       blk_queue_flag_clear(QUEUE_FLAG_QUIESCED) 3) store
>>>>   if (blk_queue_quiesced())   2) load         blk_mq_run_hw_queues()
>>>>       return                                      blk_mq_run_hw_queue()
>>>>   blk_mq_sched_dispatch_requests()                    if (!blk_mq_hctx_has_pending()) 4) load
>>>>                                                          return
>>>> 
>>>> The full memory barrier should be inserted between 1) and 2), as well as
>>>> between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED is
>>>> cleared or CPU1 sees dispatch list or setting of bitmap of software queue.
>>>> Otherwise, either CPU will not re-run the hardware queue causing starvation.
>>>> 
>>>> So the first solution is to 1) add a pair of memory barrier to fix the
>>>> problem, another solution is to 2) use hctx->queue->queue_lock to synchronize
>>>> QUEUE_FLAG_QUIESCED. Here, we chose 2) to fix it since memory barrier is not
>>>> easy to be maintained.
>>> 
>>> Same comment here, 72-74 chars wide please.
>>> 
>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>> index b2d0f22de0c7f..ac39f2a346a52 100644
>>>> --- a/block/blk-mq.c
>>>> +++ b/block/blk-mq.c
>>>> @@ -2202,6 +2202,24 @@ void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
>>>> }
>>>> EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
>>>> 
>>>> +static inline bool blk_mq_hw_queue_need_run(struct blk_mq_hw_ctx *hctx)
>>>> +{
>>>> +  	bool need_run;
>>>> +
>>>> +  	/*
>>>> +  	 * When queue is quiesced, we may be switching io scheduler, or
>>>> +  	 * updating nr_hw_queues, or other things, and we can't run queue
>>>> +  	 * any more, even blk_mq_hctx_has_pending() can't be called safely.
>>>> +  	 *
>>>> +  	 * And queue will be rerun in blk_mq_unquiesce_queue() if it is
>>>> +  	 * quiesced.
>>>> +  	 */
>>>> +  	__blk_mq_run_dispatch_ops(hctx->queue, false,
>>>> +  	need_run = !blk_queue_quiesced(hctx->queue) &&
>>>> +       	blk_mq_hctx_has_pending(hctx));
>>>> +  	return need_run;
>>>> +}
>>> 
>>> This __blk_mq_run_dispatch_ops() is also way too wide, why didn't you
>>> just break it like where you copied it from?
>>> 
>>>> +
>>>> /**
>>>> * blk_mq_run_hw_queue - Start to run a hardware queue.
>>>> * @hctx: Pointer to the hardware queue to run.
>>>> @@ -2222,20 +2240,23 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
>>>> 
>>>> might_sleep_if(!async && hctx->flags & BLK_MQ_F_BLOCKING);
>>>> 
>>>> -  	/*
>>>> -  	 * When queue is quiesced, we may be switching io scheduler, or
>>>> -  	 * updating nr_hw_queues, or other things, and we can't run queue
>>>> -  	 * any more, even __blk_mq_hctx_has_pending() can't be called safely.
>>>> - 	 *
>>>> -  	 * And queue will be rerun in blk_mq_unquiesce_queue() if it is
>>>> - 	 * quiesced.
>>>> -  	 */
>>>> -  	__blk_mq_run_dispatch_ops(hctx->queue, false,
>>>> -  		need_run = !blk_queue_quiesced(hctx->queue) &&
>>>> -  		blk_mq_hctx_has_pending(hctx));
>>>> +  	need_run = blk_mq_hw_queue_need_run(hctx);
>>>> +  	if (!need_run) {
>>>> +  		unsigned long flags;
>>>> 
>>>> -  	if (!need_run)
>>>> -  		return;
>>>> +  		/*
>>>> +  		 * synchronize with blk_mq_unquiesce_queue(), becuase we check
>>>> +  		 * if hw queue is quiesced locklessly above, we need the use
>>>> +  		 * ->queue_lock to make sure we see the up-to-date status to
>>>> +  		 * not miss rerunning the hw queue.
>>>> +  		 */
>>>> +  		spin_lock_irqsave(&hctx->queue->queue_lock, flags);
>>>> +  		need_run = blk_mq_hw_queue_need_run(hctx);
>>>> +  		spin_unlock_irqrestore(&hctx->queue->queue_lock, flags);
>>>> +
>>>> +  		if (!need_run)
>>>> +  		return;
>>>> + 	}
>>> 
>>> Is this not solvable on the unquiesce side instead? It's rather a shame
>>> to add overhead to the fast path to avoid a race with something that's
>>> super unlikely, like quisce.
>> 
>> Yeah, it can be solved by adding synchronize_rcu()/srcu() in unquiesce
>> side, but SCSI may call it in non-sleepable context via scsi_internal_device_unblock_nowait().
> 
> Hi Ming and Jens,
> 
> I use call_srcu/call_rcu to make it non-sleepable. Does this make sense to you?

Sorry for the noise. call_srcu/call_rcu can't be easy to do this.
Because call_srcu/call_rcu could be issued twice if users try to
unquiesce the queue again before the callback of
blk_mq_run_hw_queues_rcu has been executed.

Thanks.

> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 12bf38bec1044..86cdff28b2ce6 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -247,6 +247,13 @@ void blk_mq_quiesce_queue(struct request_queue *q)
> }
> EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue);
> 
> +static void blk_mq_run_hw_queues_rcu(struct rcu_head *rh)
> +{
> +       struct request_queue *q = container_of(rh, struct request_queue,
> +                                              rcu_head);
> +       blk_mq_run_hw_queues(q, true);
> +}
> +
> /*
>  * blk_mq_unquiesce_queue() - counterpart of blk_mq_quiesce_queue()
>  * @q: request queue.
> @@ -269,8 +276,13 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
>        spin_unlock_irqrestore(&q->queue_lock, flags);
> 
>        /* dispatch requests which are inserted during quiescing */
> -       if (run_queue)
> -               blk_mq_run_hw_queues(q, true);
> +       if (run_queue) {
> +               if (q->tag_set->flags & BLK_MQ_F_BLOCKING)
> +                       call_srcu(q->tag_set->srcu, &q->rcu_head,
> +                                 blk_mq_run_hw_queues_rcu);
> +               else
> +                       call_rcu(&q->rcu_head, blk_mq_run_hw_queues_rcu);
> +       }
> }
> EXPORT_SYMBOL_GPL(blk_mq_unquiesce_queue);
> 
>> 
>> 
>> Thanks,
>> Ming



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-09-12  6:27 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-03  8:16 [PATCH v2 0/3] Fix some starvation problems in block layer Muchun Song
2024-09-03  8:16 ` [PATCH v2 1/3] block: fix missing dispatching request when queue is started or unquiesced Muchun Song
2024-09-10 13:17   ` Jens Axboe
2024-09-11  2:43     ` Muchun Song
2024-09-03  8:16 ` [PATCH v2 2/3] block: fix ordering between checking QUEUE_FLAG_QUIESCED and adding requests Muchun Song
2024-09-04 12:56   ` Ming Lei
2024-09-10 13:22   ` Jens Axboe
2024-09-11  3:54     ` Ming Lei
2024-09-11  3:59       ` Muchun Song
2024-09-11  5:20         ` Muchun Song
2024-09-12  3:27       ` Muchun Song
2024-09-12  6:27         ` Muchun Song
2024-09-11  3:56     ` Muchun Song
2024-09-03  8:16 ` [PATCH v2 3/3] block: fix ordering between checking BLK_MQ_S_STOPPED " Muchun Song
2024-09-04 13:04   ` Ming Lei
2024-09-10 13:22   ` Jens Axboe
2024-09-11  2:44     ` Muchun Song
2024-09-10  2:49 ` [PATCH v2 0/3] Fix some starvation problems in block layer Muchun Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).