[PATCH 0/4] Fix some starvation problems

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] Fix some starvation problems
@ 2024-08-11 10:19 Muchun Song
  2024-08-11 10:19 ` [PATCH 1/4] block: fix request starvation when queue is stopped or quiesced Muchun Song
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Muchun Song @ 2024-08-11 10:19 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-kernel, Muchun Song

We encounter a problem on our servers where there are hundreds of
UNINTERRUPTED processes which are all waiting in the WBT wait queue.
And the IO hung detector logged so many messages about "blocked for
more than 122 seconds". The call trace is as follows:

    Call Trace:
        __schedule+0x959/0xee0
        schedule+0x40/0xb0
        io_schedule+0x12/0x40
        rq_qos_wait+0xaf/0x140
        wbt_wait+0x92/0xc0
        __rq_qos_throttle+0x20/0x30
        blk_mq_make_request+0x12a/0x5c0
        generic_make_request_nocheck+0x172/0x3f0
        submit_bio+0x42/0x1c0
        ...

The WBT module is used to throttle buffered writeback, which will
block any buffered writeback IO request until the previous inflight
IOs have been completed. So I checked the inflight IO counter. That
was one meaning one IO request was submitted to the downstream
interface like block core layer or device driver (virtio_blk driver
in our case). We need to figure out why the inflight IO is not
completed in time. I confirmed that all the virtio ring buffers of
virtio_blk are empty, so the root cause is not related to the block
device or the virtio_blk driver since the driver has never received
that IO request.

We know that block core layer could submit IO requests to the driver
through kworker (the callback function is blk_mq_run_work_fn). I
thought maybe the kworker was blocked by some other resources causing
the callback to not be evoked in time. So I checked all the kworkers
and workqueues and confirmed there was no pending work on any kworker
or workqueue.

Integrate all the investigation information, I guess the problem should
be in block core layer missing a chance to submit an IO request. After
some investigation of code, I found some following scenarios which could
cause similar symptoms. I am not sure whether this is the root cause or
not, but maybe it is a reasonable suspect.

Muchun Song (4):
  block: fix request starvation when queue is stopped or quiesced
  block: fix ordering between checking BLK_MQ_S_STOPPED and adding
    requests to hctx->dispatch
  block: fix missing smp_mb in blk_mq_{delay_}run_hw_queues
  block: fix fix ordering between checking QUEUE_FLAG_QUIESCED and
    adding requests to hctx->dispatch

 block/blk-mq.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

-- 
2.20.1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/4] block: fix request starvation when queue is stopped or quiesced
  2024-08-11 10:19 [PATCH 0/4] Fix some starvation problems Muchun Song
@ 2024-08-11 10:19 ` Muchun Song
  2024-08-16  9:14   ` Ming Lei
  2024-08-11 10:19 ` [PATCH 2/4] block: fix ordering between checking BLK_MQ_S_STOPPED and adding requests to hctx->dispatch Muchun Song
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 21+ messages in thread
From: Muchun Song @ 2024-08-11 10:19 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-kernel, Muchun Song

Supposing the following scenario with a virtio_blk driver.

CPU0                                    CPU1                                    CPU2

blk_mq_try_issue_directly()
    __blk_mq_issue_directly()
        q->mq_ops->queue_rq()
            virtio_queue_rq()
                blk_mq_stop_hw_queue()
                                        blk_mq_try_issue_directly()             virtblk_done()
                                            if (blk_mq_hctx_stopped())
    blk_mq_request_bypass_insert()                                                  blk_mq_start_stopped_hw_queue()
    blk_mq_run_hw_queue()                                                               blk_mq_run_hw_queue()
                                                blk_mq_insert_request()
                                                return // Who is responsible for dispatching this IO request?

After CPU0 has marked the queue as stopped, CPU1 will see the queue is stopped.
But before CPU1 puts the request on the dispatch list, CPU2 receives the interrupt
of completion of request, so it will run the hardware queue and marks the queue
as non-stopped. Meanwhile, CPU1 also runs the same hardware queue. After both CPU1
and CPU2 complete blk_mq_run_hw_queue(), CPU1 just puts the request to the same
hardware queue and returns. Seems it misses dispatching a request. Fix it by
running the hardware queue explicitly. I think blk_mq_request_issue_directly()
should handle a similar problem.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 block/blk-mq.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e3c3c0c21b553..b2d0f22de0c7f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2619,6 +2619,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 
 	if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(rq->q)) {
 		blk_mq_insert_request(rq, 0);
+		blk_mq_run_hw_queue(hctx, false);
 		return;
 	}
 
@@ -2649,6 +2650,7 @@ static blk_status_t blk_mq_request_issue_directly(struct request *rq, bool last)
 
 	if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(rq->q)) {
 		blk_mq_insert_request(rq, 0);
+		blk_mq_run_hw_queue(hctx, false);
 		return BLK_STS_OK;
 	}
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/4] block: fix ordering between checking BLK_MQ_S_STOPPED and adding requests to hctx->dispatch
  2024-08-11 10:19 [PATCH 0/4] Fix some starvation problems Muchun Song
  2024-08-11 10:19 ` [PATCH 1/4] block: fix request starvation when queue is stopped or quiesced Muchun Song
@ 2024-08-11 10:19 ` Muchun Song
  2024-08-19  2:27   ` Ming Lei
  2024-08-11 10:19 ` [PATCH 3/4] block: fix missing smp_mb in blk_mq_{delay_}run_hw_queues Muchun Song
  2024-08-11 10:19 ` [PATCH 4/4] block: fix fix ordering between checking QUEUE_FLAG_QUIESCED and adding requests to hctx->dispatch Muchun Song
  3 siblings, 1 reply; 21+ messages in thread
From: Muchun Song @ 2024-08-11 10:19 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-kernel, Muchun Song

Supposing the following scenario with a virtio_blk driver.

CPU0                                                                CPU1

blk_mq_try_issue_directly()
    __blk_mq_issue_directly()
        q->mq_ops->queue_rq()
            virtio_queue_rq()
                blk_mq_stop_hw_queue()
                                                                    virtblk_done()
    blk_mq_request_bypass_insert()                                      blk_mq_start_stopped_hw_queues()
        /* Add IO request to dispatch list */   1) store                    blk_mq_start_stopped_hw_queue()
                                                                                clear_bit(BLK_MQ_S_STOPPED)                 3) store
    blk_mq_run_hw_queue()                                                       blk_mq_run_hw_queue()
        if (!blk_mq_hctx_has_pending())                                             if (!blk_mq_hctx_has_pending())         4) load
            return                                                                      return
        blk_mq_sched_dispatch_requests()                                            blk_mq_sched_dispatch_requests()
            if (blk_mq_hctx_stopped())          2) load                                 if (blk_mq_hctx_stopped())
                return                                                                      return
            __blk_mq_sched_dispatch_requests()                                          __blk_mq_sched_dispatch_requests()

The full memory barrier should be inserted between 1) and 2), as well as between
3) and 4) to make sure that either CPU0 sees BLK_MQ_S_STOPPED is cleared or CPU1
sees dispatch list or setting of bitmap of software queue. Otherwise, either CPU
will not re-run the hardware queue causing starvation.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 block/blk-mq.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b2d0f22de0c7f..6f18993b8f454 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2075,6 +2075,13 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list,
 		 * in blk_mq_sched_restart(). Avoid restart code path to
 		 * miss the new added requests to hctx->dispatch, meantime
 		 * SCHED_RESTART is observed here.
+		 *
+		 * This barrier is also used to order adding of dispatch list
+		 * above and the test of BLK_MQ_S_STOPPED in the following
+		 * routine (in blk_mq_delay_run_hw_queue()). Pairs with the
+		 * barrier in blk_mq_start_stopped_hw_queue(). So dispatch code
+		 * could either see BLK_MQ_S_STOPPED is cleared or dispatch list
+		 * to avoid missing dispatching requests.
 		 */
 		smp_mb();
 
@@ -2237,6 +2244,17 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 	if (!need_run)
 		return;
 
+	/*
+	 * This barrier is used to order adding of dispatch list or setting
+	 * of bitmap of any software queue outside of this function and the
+	 * test of BLK_MQ_S_STOPPED in the following routine. Pairs with the
+	 * barrier in blk_mq_start_stopped_hw_queue(). So dispatch code could
+	 * either see BLK_MQ_S_STOPPED is cleared or dispatch list or setting
+	 * of bitmap of any software queue to avoid missing dispatching
+	 * requests.
+	 */
+	smp_mb();
+
 	if (async || !cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask)) {
 		blk_mq_delay_run_hw_queue(hctx, 0);
 		return;
@@ -2392,6 +2410,13 @@ void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 		return;
 
 	clear_bit(BLK_MQ_S_STOPPED, &hctx->state);
+	/*
+	 * Pairs with the smp_mb() in blk_mq_run_hw_queue() or
+	 * blk_mq_dispatch_rq_list() to order the clearing of
+	 * BLK_MQ_S_STOPPED and the test of dispatch list or
+	 * bitmap of any software queue.
+	 */
+	smp_mb__after_atomic();
 	blk_mq_run_hw_queue(hctx, async);
 }
 EXPORT_SYMBOL_GPL(blk_mq_start_stopped_hw_queue);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/4] block: fix missing smp_mb in blk_mq_{delay_}run_hw_queues
  2024-08-11 10:19 [PATCH 0/4] Fix some starvation problems Muchun Song
  2024-08-11 10:19 ` [PATCH 1/4] block: fix request starvation when queue is stopped or quiesced Muchun Song
  2024-08-11 10:19 ` [PATCH 2/4] block: fix ordering between checking BLK_MQ_S_STOPPED and adding requests to hctx->dispatch Muchun Song
@ 2024-08-11 10:19 ` Muchun Song
  2024-08-11 10:19 ` [PATCH 4/4] block: fix fix ordering between checking QUEUE_FLAG_QUIESCED and adding requests to hctx->dispatch Muchun Song
  3 siblings, 0 replies; 21+ messages in thread
From: Muchun Song @ 2024-08-11 10:19 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-kernel, Muchun Song

Supposing the following scenario with a virtio_blk driver.

CPU0                                                                CPU1

/*
 * Add request to dispatch list or set bitmap of
 * software queue.                      1) store                    virtblk_done()
 */
blk_mq_run_hw_queues()/blk_mq_delay_run_hw_queues()                     blk_mq_start_stopped_hw_queues()
    if (blk_mq_hctx_stopped())          2) load                             blk_mq_start_stopped_hw_queue()
        continue                                                                clear_bit(BLK_MQ_S_STOPPED)                 3) store
    blk_mq_run_hw_queue()/blk_mq_delay_run_hw_queue()                           blk_mq_run_hw_queue()
                                                                                    if (!blk_mq_hctx_has_pending())         4) load
                                                                                        return
                                                                                    blk_mq_sched_dispatch_requests()

The full memory barrier should be inserted between 1) and 2), as well as between
3) and 4) to make sure that either CPU0 sees BLK_MQ_S_STOPPED is cleared or CPU1
sees dispatch list or setting of bitmap of software queue. Otherwise, either CPU
will not re-run the hardware queue causing starvation.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 block/blk-mq.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6f18993b8f454..385a74e566874 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2299,6 +2299,18 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async)
 	sq_hctx = NULL;
 	if (blk_queue_sq_sched(q))
 		sq_hctx = blk_mq_get_sq_hctx(q);
+
+	/*
+	 * This barrier is used to order adding of dispatch list or setting
+	 * of bitmap of any software queue outside of this function and the
+	 * test of BLK_MQ_S_STOPPED in the following routine. Pairs with the
+	 * barrier in blk_mq_start_stopped_hw_queue(). So dispatch code could
+	 * either see BLK_MQ_S_STOPPED is cleared or dispatch list or setting
+	 * of bitmap of any software queue to avoid missing dispatching
+	 * requests.
+	 */
+	smp_mb();
+
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (blk_mq_hctx_stopped(hctx))
 			continue;
@@ -2327,6 +2339,18 @@ void blk_mq_delay_run_hw_queues(struct request_queue *q, unsigned long msecs)
 	sq_hctx = NULL;
 	if (blk_queue_sq_sched(q))
 		sq_hctx = blk_mq_get_sq_hctx(q);
+
+	/*
+	 * This barrier is used to order adding of dispatch list or setting
+	 * of bitmap of any software queue outside of this function and the
+	 * test of BLK_MQ_S_STOPPED in the following routine. Pairs with the
+	 * barrier in blk_mq_start_stopped_hw_queue(). So dispatch code could
+	 * either see BLK_MQ_S_STOPPED is cleared or dispatch list or setting
+	 * of bitmap of any software queue to avoid missing dispatching
+	 * requests.
+	 */
+	smp_mb();
+
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (blk_mq_hctx_stopped(hctx))
 			continue;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/4] block: fix fix ordering between checking QUEUE_FLAG_QUIESCED and adding requests to hctx->dispatch
  2024-08-11 10:19 [PATCH 0/4] Fix some starvation problems Muchun Song
                   ` (2 preceding siblings ...)
  2024-08-11 10:19 ` [PATCH 3/4] block: fix missing smp_mb in blk_mq_{delay_}run_hw_queues Muchun Song
@ 2024-08-11 10:19 ` Muchun Song
  2024-08-23 11:27   ` Ming Lei
  3 siblings, 1 reply; 21+ messages in thread
From: Muchun Song @ 2024-08-11 10:19 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-kernel, Muchun Song

Supposing the following scenario.

CPU0                                                                CPU1

blk_mq_request_issue_directly()                                     blk_mq_unquiesce_queue()
    if (blk_queue_quiesced())                                           blk_queue_flag_clear(QUEUE_FLAG_QUIESCED)   3) store
        blk_mq_insert_request()                                         blk_mq_run_hw_queues()
            /*                                                              blk_mq_run_hw_queue()
             * Add request to dispatch list or set bitmap of                    if (!blk_mq_hctx_has_pending())     4) load
             * software queue.                  1) store                            return
             */
        blk_mq_run_hw_queue()
            if (blk_queue_quiesced())           2) load
                return
            blk_mq_sched_dispatch_requests()

The full memory barrier should be inserted between 1) and 2), as well as
between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED is
cleared or CPU1 sees dispatch list or setting of bitmap of software queue.
Otherwise, either CPU will not re-run the hardware queue causing starvation.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 block/blk-mq.c | 38 +++++++++++++++++++++++++++-----------
 1 file changed, 27 insertions(+), 11 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 385a74e566874..66b21407a9a6c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -264,6 +264,13 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
 		;
 	} else if (!--q->quiesce_depth) {
 		blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
+		/**
+		 * The need of memory barrier is in blk_mq_run_hw_queues() to
+		 * make sure clearing of QUEUE_FLAG_QUIESCED is before the
+		 * checking of dispatch list or bitmap of any software queue.
+		 *
+		 * smp_mb__after_atomic();
+		 */
 		run_queue = true;
 	}
 	spin_unlock_irqrestore(&q->queue_lock, flags);
@@ -2222,6 +2229,21 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 {
 	bool need_run;
 
+	/*
+	 * This barrier is used to order adding of dispatch list or setting
+	 * of bitmap of any software queue outside of this function and the
+	 * test of BLK_MQ_S_STOPPED in the following routine. Pairs with the
+	 * barrier in blk_mq_start_stopped_hw_queue(). So dispatch code could
+	 * either see BLK_MQ_S_STOPPED is cleared or dispatch list or setting
+	 * of bitmap of any software queue to avoid missing dispatching
+	 * requests.
+	 *
+	 * This barrier is also used to order adding of dispatch list or
+	 * setting of bitmap of any software queue outside of this function
+	 * and test of QUEUE_FLAG_QUIESCED below.
+	 */
+	smp_mb();
+
 	/*
 	 * We can't run the queue inline with interrupts disabled.
 	 */
@@ -2244,17 +2266,6 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 	if (!need_run)
 		return;
 
-	/*
-	 * This barrier is used to order adding of dispatch list or setting
-	 * of bitmap of any software queue outside of this function and the
-	 * test of BLK_MQ_S_STOPPED in the following routine. Pairs with the
-	 * barrier in blk_mq_start_stopped_hw_queue(). So dispatch code could
-	 * either see BLK_MQ_S_STOPPED is cleared or dispatch list or setting
-	 * of bitmap of any software queue to avoid missing dispatching
-	 * requests.
-	 */
-	smp_mb();
-
 	if (async || !cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask)) {
 		blk_mq_delay_run_hw_queue(hctx, 0);
 		return;
@@ -2308,6 +2319,11 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async)
 	 * either see BLK_MQ_S_STOPPED is cleared or dispatch list or setting
 	 * of bitmap of any software queue to avoid missing dispatching
 	 * requests.
+	 *
+	 * This barrier is also used to order clearing of QUEUE_FLAG_QUIESCED
+	 * outside of this function in blk_mq_unquiesce_queue() and checking
+	 * of dispatch list or bitmap of any software queue in
+	 * blk_mq_run_hw_queue().
 	 */
 	smp_mb();
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/4] block: fix request starvation when queue is stopped or quiesced
  2024-08-11 10:19 ` [PATCH 1/4] block: fix request starvation when queue is stopped or quiesced Muchun Song
@ 2024-08-16  9:14   ` Ming Lei
  0 siblings, 0 replies; 21+ messages in thread
From: Ming Lei @ 2024-08-16  9:14 UTC (permalink / raw)
  To: Muchun Song; +Cc: axboe, linux-block, linux-kernel

On Sun, Aug 11, 2024 at 06:19:18PM +0800, Muchun Song wrote:
> Supposing the following scenario with a virtio_blk driver.
> 
> CPU0                                    CPU1                                    CPU2
> 
> blk_mq_try_issue_directly()
>     __blk_mq_issue_directly()
>         q->mq_ops->queue_rq()
>             virtio_queue_rq()
>                 blk_mq_stop_hw_queue()
>                                         blk_mq_try_issue_directly()             virtblk_done()
>                                             if (blk_mq_hctx_stopped())
>     blk_mq_request_bypass_insert()                                                  blk_mq_start_stopped_hw_queue()
>     blk_mq_run_hw_queue()                                                               blk_mq_run_hw_queue()
>                                                 blk_mq_insert_request()
>                                                 return // Who is responsible for dispatching this IO request?
> 
> After CPU0 has marked the queue as stopped, CPU1 will see the queue is stopped.
> But before CPU1 puts the request on the dispatch list, CPU2 receives the interrupt
> of completion of request, so it will run the hardware queue and marks the queue
> as non-stopped. Meanwhile, CPU1 also runs the same hardware queue. After both CPU1
> and CPU2 complete blk_mq_run_hw_queue(), CPU1 just puts the request to the same
> hardware queue and returns. Seems it misses dispatching a request. Fix it by
> running the hardware queue explicitly. I think blk_mq_request_issue_directly()
> should handle a similar problem.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  block/blk-mq.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index e3c3c0c21b553..b2d0f22de0c7f 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2619,6 +2619,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
>  
>  	if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(rq->q)) {
>  		blk_mq_insert_request(rq, 0);
> +		blk_mq_run_hw_queue(hctx, false);
>  		return;
>  	}
>  
> @@ -2649,6 +2650,7 @@ static blk_status_t blk_mq_request_issue_directly(struct request *rq, bool last)
>  
>  	if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(rq->q)) {
>  		blk_mq_insert_request(rq, 0);
> +		blk_mq_run_hw_queue(hctx, false);
>  		return BLK_STS_OK;
>  	}

Looks one real issue, and the fix is fine:

Reviewed-by: Ming Lei <ming.lei@redhat.com>


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/4] block: fix ordering between checking BLK_MQ_S_STOPPED and adding requests to hctx->dispatch
  2024-08-11 10:19 ` [PATCH 2/4] block: fix ordering between checking BLK_MQ_S_STOPPED and adding requests to hctx->dispatch Muchun Song
@ 2024-08-19  2:27   ` Ming Lei
  2024-08-19  3:49     ` Muchun Song
  0 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2024-08-19  2:27 UTC (permalink / raw)
  To: Muchun Song; +Cc: axboe, linux-block, linux-kernel, ming.lei

Hi Muchun,

On Sun, Aug 11, 2024 at 06:19:19PM +0800, Muchun Song wrote:
> Supposing the following scenario with a virtio_blk driver.
> 
> CPU0                                                                CPU1
> 
> blk_mq_try_issue_directly()
>     __blk_mq_issue_directly()
>         q->mq_ops->queue_rq()
>             virtio_queue_rq()
>                 blk_mq_stop_hw_queue()
>                                                                     virtblk_done()
>     blk_mq_request_bypass_insert()                                      blk_mq_start_stopped_hw_queues()
>         /* Add IO request to dispatch list */   1) store                    blk_mq_start_stopped_hw_queue()
>                                                                                 clear_bit(BLK_MQ_S_STOPPED)                 3) store
>     blk_mq_run_hw_queue()                                                       blk_mq_run_hw_queue()
>         if (!blk_mq_hctx_has_pending())                                             if (!blk_mq_hctx_has_pending())         4) load
>             return                                                                      return
>         blk_mq_sched_dispatch_requests()                                            blk_mq_sched_dispatch_requests()
>             if (blk_mq_hctx_stopped())          2) load                                 if (blk_mq_hctx_stopped())
>                 return                                                                      return
>             __blk_mq_sched_dispatch_requests()                                          __blk_mq_sched_dispatch_requests()
> 
> The full memory barrier should be inserted between 1) and 2), as well as between
> 3) and 4) to make sure that either CPU0 sees BLK_MQ_S_STOPPED is cleared or CPU1
> sees dispatch list or setting of bitmap of software queue. Otherwise, either CPU
> will not re-run the hardware queue causing starvation.

Yeah, it is one kind of race which is triggered when adding request into
->dispatch list after returning STS_RESOURCE. We were troubled by lots of
such kind of race.

stopping queue is used in very less drivers, and its only purpose should
be for throttling hw queue in case that low level queue is busy. There seems
more uses of blk_mq_stop_hw_queues(), but most of them should be replaced
with blk_mq_quiesce_queue().

IMO, fixing this kind of issue via memory barrier is too tricky to
maintain cause WRITE/READ dependency is very hard to follow. I'd suggest to
make memory barrier solution as the last resort, and we can try to figure
out other easier & more reliable way first.

One idea I thought of is to call blk_mq_request_bypass_insert()(or rename
& export it) before calling blk_mq_stop_hw_queue() in driver, then
return new status code STS_STOP_DISPATCH for notifying blk-mq to stop
dispatch simply.


thanks,
Ming


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/4] block: fix ordering between checking BLK_MQ_S_STOPPED and adding requests to hctx->dispatch
  2024-08-19  2:27   ` Ming Lei
@ 2024-08-19  3:49     ` Muchun Song
  2024-08-22  3:54       ` Yu Kuai
  0 siblings, 1 reply; 21+ messages in thread
From: Muchun Song @ 2024-08-19  3:49 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, open list:BLOCK LAYER, LKML, muchun.song

On Mon, Aug 19, 2024 at 10:28 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> Hi Muchun,
>
> On Sun, Aug 11, 2024 at 06:19:19PM +0800, Muchun Song wrote:
> > Supposing the following scenario with a virtio_blk driver.
> >
> > CPU0                                                                CPU1
> >
> > blk_mq_try_issue_directly()
> >     __blk_mq_issue_directly()
> >         q->mq_ops->queue_rq()
> >             virtio_queue_rq()
> >                 blk_mq_stop_hw_queue()
> >                                                                     virtblk_done()
> >     blk_mq_request_bypass_insert()                                      blk_mq_start_stopped_hw_queues()
> >         /* Add IO request to dispatch list */   1) store                    blk_mq_start_stopped_hw_queue()
> >                                                                                 clear_bit(BLK_MQ_S_STOPPED)                 3) store
> >     blk_mq_run_hw_queue()                                                       blk_mq_run_hw_queue()
> >         if (!blk_mq_hctx_has_pending())                                             if (!blk_mq_hctx_has_pending())         4) load
> >             return                                                                      return
> >         blk_mq_sched_dispatch_requests()                                            blk_mq_sched_dispatch_requests()
> >             if (blk_mq_hctx_stopped())          2) load                                 if (blk_mq_hctx_stopped())
> >                 return                                                                      return
> >             __blk_mq_sched_dispatch_requests()                                          __blk_mq_sched_dispatch_requests()
> >
> > The full memory barrier should be inserted between 1) and 2), as well as between
> > 3) and 4) to make sure that either CPU0 sees BLK_MQ_S_STOPPED is cleared or CPU1
> > sees dispatch list or setting of bitmap of software queue. Otherwise, either CPU
> > will not re-run the hardware queue causing starvation.
>
> Yeah, it is one kind of race which is triggered when adding request into
> ->dispatch list after returning STS_RESOURCE. We were troubled by lots of
> such kind of race.

Yes. I saw the similar fix for BLK_MQ_S_SCHED_RESTART.

>
> stopping queue is used in very less drivers, and its only purpose should
> be for throttling hw queue in case that low level queue is busy. There seems
> more uses of blk_mq_stop_hw_queues(), but most of them should be replaced
> with blk_mq_quiesce_queue().
>
> IMO, fixing this kind of issue via memory barrier is too tricky to
> maintain cause WRITE/READ dependency is very hard to follow. I'd suggest to
> make memory barrier solution as the last resort, and we can try to figure
> out other easier & more reliable way first.

I do agree it is hard to maintain the dependencies in the future. We should
propose an easy-maintainable solution. But I thought it is a long-term issue
throughout different stable linux distros. Adding a mb is the easy way to fix
the problem (the code footprint is really small), so it will be very
easy for others
to backport those bug fixes to different stable linux distros. Therefore, mb
should be an interim solution. Then, we could improve it based on the solution
you've proposed below. What do you think?

Thanks,
Muchun.

>
> One idea I thought of is to call blk_mq_request_bypass_insert()(or rename
> & export it) before calling blk_mq_stop_hw_queue() in driver, then
> return new status code STS_STOP_DISPATCH for notifying blk-mq to stop
> dispatch simply.
>
>
> thanks,
> Ming
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/4] block: fix ordering between checking BLK_MQ_S_STOPPED and adding requests to hctx->dispatch
  2024-08-19  3:49     ` Muchun Song
@ 2024-08-22  3:54       ` Yu Kuai
  2024-08-26  8:35         ` Muchun Song
  0 siblings, 1 reply; 21+ messages in thread
From: Yu Kuai @ 2024-08-22  3:54 UTC (permalink / raw)
  To: Muchun Song, Ming Lei
  Cc: Jens Axboe, open list:BLOCK LAYER, LKML, muchun.song, yukuai (C)

Hi,

在 2024/08/19 11:49, Muchun Song 写道:
> On Mon, Aug 19, 2024 at 10:28 AM Ming Lei <ming.lei@redhat.com> wrote:
>>
>> Hi Muchun,
>>
>> On Sun, Aug 11, 2024 at 06:19:19PM +0800, Muchun Song wrote:
>>> Supposing the following scenario with a virtio_blk driver.
>>>
>>> CPU0                                                                CPU1
>>>
>>> blk_mq_try_issue_directly()
>>>      __blk_mq_issue_directly()
>>>          q->mq_ops->queue_rq()
>>>              virtio_queue_rq()
>>>                  blk_mq_stop_hw_queue()
>>>                                                                      virtblk_done()
>>>      blk_mq_request_bypass_insert()                                      blk_mq_start_stopped_hw_queues()
>>>          /* Add IO request to dispatch list */   1) store                    blk_mq_start_stopped_hw_queue()
>>>                                                                                  clear_bit(BLK_MQ_S_STOPPED)                 3) store
>>>      blk_mq_run_hw_queue()                                                       blk_mq_run_hw_queue()
>>>          if (!blk_mq_hctx_has_pending())                                             if (!blk_mq_hctx_has_pending())         4) load
>>>              return                                                                      return
>>>          blk_mq_sched_dispatch_requests()                                            blk_mq_sched_dispatch_requests()
>>>              if (blk_mq_hctx_stopped())          2) load                                 if (blk_mq_hctx_stopped())
>>>                  return                                                                      return
>>>              __blk_mq_sched_dispatch_requests()                                          __blk_mq_sched_dispatch_requests()
>>>
>>> The full memory barrier should be inserted between 1) and 2), as well as between
>>> 3) and 4) to make sure that either CPU0 sees BLK_MQ_S_STOPPED is cleared or CPU1
>>> sees dispatch list or setting of bitmap of software queue. Otherwise, either CPU
>>> will not re-run the hardware queue causing starvation.
>>
>> Yeah, it is one kind of race which is triggered when adding request into
>> ->dispatch list after returning STS_RESOURCE. We were troubled by lots of
>> such kind of race.
> 
> Yes. I saw the similar fix for BLK_MQ_S_SCHED_RESTART.
> 
>>
>> stopping queue is used in very less drivers, and its only purpose should
>> be for throttling hw queue in case that low level queue is busy. There seems
>> more uses of blk_mq_stop_hw_queues(), but most of them should be replaced
>> with blk_mq_quiesce_queue().
>>
>> IMO, fixing this kind of issue via memory barrier is too tricky to
>> maintain cause WRITE/READ dependency is very hard to follow. I'd suggest to
>> make memory barrier solution as the last resort, and we can try to figure
>> out other easier & more reliable way first.
> 
> I do agree it is hard to maintain the dependencies in the future. We should
> propose an easy-maintainable solution. But I thought it is a long-term issue
> throughout different stable linux distros. Adding a mb is the easy way to fix
> the problem (the code footprint is really small), so it will be very
> easy for others
> to backport those bug fixes to different stable linux distros. Therefore, mb
> should be an interim solution. Then, we could improve it based on the solution
> you've proposed below. What do you think?

I'll agree with Ming, let's figure out a better fix first. Easy to 
backport to stables is not first consideration.
> 
> Thanks,
> Muchun.
> 
>>
>> One idea I thought of is to call blk_mq_request_bypass_insert()(or rename
>> & export it) before calling blk_mq_stop_hw_queue() in driver, then
>> return new status code STS_STOP_DISPATCH for notifying blk-mq to stop
>> dispatch simply.

New status code look good to me, however, I wonder can we just remove
the problematic blk_mq_stop_hw_queue(), and replace it by handling the
new status from block layer?

- Passing the new status to blk_mq_run_dispatch_ops, and quiesce with
the new status, if no request is inflight, unquiesce immediately;
- unquiesce is any IO is done afterwards;

Thanks,
Kuai

>>
>>
>> thanks,
>> Ming
>>
> 
> .
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/4] block: fix fix ordering between checking QUEUE_FLAG_QUIESCED and adding requests to hctx->dispatch
  2024-08-11 10:19 ` [PATCH 4/4] block: fix fix ordering between checking QUEUE_FLAG_QUIESCED and adding requests to hctx->dispatch Muchun Song
@ 2024-08-23 11:27   ` Ming Lei
  2024-08-26  7:06     ` Muchun Song
  0 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2024-08-23 11:27 UTC (permalink / raw)
  To: Muchun Song; +Cc: axboe, linux-block, linux-kernel, ming.lei

On Sun, Aug 11, 2024 at 06:19:21PM +0800, Muchun Song wrote:
> Supposing the following scenario.
> 
> CPU0                                                                CPU1
> 
> blk_mq_request_issue_directly()                                     blk_mq_unquiesce_queue()
>     if (blk_queue_quiesced())                                           blk_queue_flag_clear(QUEUE_FLAG_QUIESCED)   3) store
>         blk_mq_insert_request()                                         blk_mq_run_hw_queues()
>             /*                                                              blk_mq_run_hw_queue()
>              * Add request to dispatch list or set bitmap of                    if (!blk_mq_hctx_has_pending())     4) load
>              * software queue.                  1) store                            return
>              */
>         blk_mq_run_hw_queue()
>             if (blk_queue_quiesced())           2) load
>                 return
>             blk_mq_sched_dispatch_requests()
> 
> The full memory barrier should be inserted between 1) and 2), as well as
> between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED is
> cleared or CPU1 sees dispatch list or setting of bitmap of software queue.
> Otherwise, either CPU will not re-run the hardware queue causing starvation.

Memory barrier shouldn't serve as bug fix for two slow code paths.

One simple fix is to add helper of blk_queue_quiesced_lock(), and
call the following check on CPU0:

	if (blk_queue_quiesced_lock())
         blk_mq_run_hw_queue();


thanks,
Ming


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Re: [PATCH 4/4] block: fix fix ordering between checking QUEUE_FLAG_QUIESCED and adding requests to hctx->dispatch
  2024-08-23 11:27   ` Ming Lei
@ 2024-08-26  7:06     ` Muchun Song
  2024-08-26  7:33       ` Muchun Song
  0 siblings, 1 reply; 21+ messages in thread
From: Muchun Song @ 2024-08-26  7:06 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, open list:BLOCK LAYER, LKML, muchun.song

On Fri, Aug 23, 2024 at 7:28 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Sun, Aug 11, 2024 at 06:19:21 PM +0800, Muchun Song wrote:
> > Supposing the following scenario.
> >
> > CPU0                                                                CPU1
> >
> > blk_mq_request_issue_directly()                                     blk_mq_unquiesce_queue()
> >     if (blk_queue_quiesced())                                           blk_queue_flag_clear(QUEUE_FLAG_QUIESCED)   3) store
> >         blk_mq_insert_request()                                         blk_mq_run_hw_queues()
> >             /*                                                              blk_mq_run_hw_queue()
> >              * Add request to dispatch list or set bitmap of                    if (!blk_mq_hctx_has_pending())     4) load
> >              * software queue.                  1) store                            return
> >              */
> >         blk_mq_run_hw_queue()
> >             if (blk_queue_quiesced())           2) load
> >                 return
> >             blk_mq_sched_dispatch_requests()
> >
> > The full memory barrier should be inserted between 1) and 2), as well as
> > between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED is
> > cleared or CPU1 sees dispatch list or setting of bitmap of software queue.
> > Otherwise, either CPU will not re-run the hardware queue causing starvation.
>
> Memory barrier shouldn't serve as bug fix for two slow code paths.
>
> One simple fix is to add helper of blk_queue_quiesced_lock(), and
> call the following check on CPU0:
>
>         if (blk_queue_quiesced_lock())
>          blk_mq_run_hw_queue();

This only fixes blk_mq_request_issue_directly(), I think anywhere that
matching this
pattern (inserting a request to dispatch list and then running the
hardware queue)
should be fixed. And I think there are many places which match this
pattern (E.g.
blk_mq_submit_bio()). The above graph should be adjusted to the following.

CPU0                                        CPU1

blk_mq_insert_request()         1) store    blk_mq_unquiesce_queue()
blk_mq_run_hw_queue()
blk_queue_flag_clear(QUEUE_FLAG_QUIESCED)       3) store
    if (blk_queue_quiesced())   2) load         blk_mq_run_hw_queues()
        return                                      blk_mq_run_hw_queue()
    blk_mq_sched_dispatch_requests()                    if
(!blk_mq_hctx_has_pending())     4) load
                                                            return

So I think fixing blk_mq_run_hw_queue() could cover all of the situations.
Maybe I thought wrongly. Please correct me.

Muchun,
Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/4] block: fix fix ordering between checking QUEUE_FLAG_QUIESCED and adding requests to hctx->dispatch
  2024-08-26  7:06     ` Muchun Song
@ 2024-08-26  7:33       ` Muchun Song
  2024-08-26  9:20         ` Ming Lei
  0 siblings, 1 reply; 21+ messages in thread
From: Muchun Song @ 2024-08-26  7:33 UTC (permalink / raw)
  To: Muchun Song; +Cc: Ming Lei, Jens Axboe, open list:BLOCK LAYER, LKML



> On Aug 26, 2024, at 15:06, Muchun Song <songmuchun@bytedance.com> wrote:
> 
> On Fri, Aug 23, 2024 at 7:28 PM Ming Lei <ming.lei@redhat.com> wrote:
>> 
>> On Sun, Aug 11, 2024 at 06:19:21 PM +0800, Muchun Song wrote:
>>> Supposing the following scenario.
>>> 
>>> CPU0                                                                CPU1
>>> 
>>> blk_mq_request_issue_directly()                                     blk_mq_unquiesce_queue()
>>>    if (blk_queue_quiesced())                                           blk_queue_flag_clear(QUEUE_FLAG_QUIESCED)   3) store
>>>        blk_mq_insert_request()                                         blk_mq_run_hw_queues()
>>>            /*                                                              blk_mq_run_hw_queue()
>>>             * Add request to dispatch list or set bitmap of                    if (!blk_mq_hctx_has_pending())     4) load
>>>             * software queue.                  1) store                            return
>>>             */
>>>        blk_mq_run_hw_queue()
>>>            if (blk_queue_quiesced())           2) load
>>>                return
>>>            blk_mq_sched_dispatch_requests()
>>> 
>>> The full memory barrier should be inserted between 1) and 2), as well as
>>> between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED is
>>> cleared or CPU1 sees dispatch list or setting of bitmap of software queue.
>>> Otherwise, either CPU will not re-run the hardware queue causing starvation.
>> 
>> Memory barrier shouldn't serve as bug fix for two slow code paths.
>> 
>> One simple fix is to add helper of blk_queue_quiesced_lock(), and
>> call the following check on CPU0:
>> 
>>        if (blk_queue_quiesced_lock())
>>         blk_mq_run_hw_queue();
> 
> This only fixes blk_mq_request_issue_directly(), I think anywhere that
> matching this
> pattern (inserting a request to dispatch list and then running the
> hardware queue)
> should be fixed. And I think there are many places which match this
> pattern (E.g.
> blk_mq_submit_bio()). The above graph should be adjusted to the following.
> 
> CPU0                                        CPU1
> 
> blk_mq_insert_request()         1) store    blk_mq_unquiesce_queue()
> blk_mq_run_hw_queue()
> blk_queue_flag_clear(QUEUE_FLAG_QUIESCED)       3) store
>    if (blk_queue_quiesced())   2) load         blk_mq_run_hw_queues()
>        return                                      blk_mq_run_hw_queue()
>    blk_mq_sched_dispatch_requests()                    if
> (!blk_mq_hctx_has_pending())     4) load
>                                                            return

Sorry. There is something wrong with my email client. Resend the graph.

CPU0                                        CPU1

blk_mq_insert_request()         1) store    blk_mq_unquiesce_queue()
blk_mq_run_hw_queue()                       blk_queue_flag_clear(QUEUE_FLAG_QUIESCED)       3) store
    if (blk_queue_quiesced())   2) load         blk_mq_run_hw_queues()
        return                                      blk_mq_run_hw_queue()
    blk_mq_sched_dispatch_requests()                    if (!blk_mq_hctx_has_pending())     4) load
                                                            return

> 
> So I think fixing blk_mq_run_hw_queue() could cover all of the situations.
> Maybe I thought wrongly. Please correct me.
> 
> Muchun,
> Thanks.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/4] block: fix ordering between checking BLK_MQ_S_STOPPED and adding requests to hctx->dispatch
  2024-08-22  3:54       ` Yu Kuai
@ 2024-08-26  8:35         ` Muchun Song
  2024-08-26  8:53           ` Yu Kuai
  0 siblings, 1 reply; 21+ messages in thread
From: Muchun Song @ 2024-08-26  8:35 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Muchun Song, Ming Lei, Jens Axboe, open list:BLOCK LAYER, LKML,
	yukuai (C)



> On Aug 22, 2024, at 11:54, Yu Kuai <yukuai1@huaweicloud.com> wrote:
> 
> Hi,
> 
> 在 2024/08/19 11:49, Muchun Song 写道:
>> On Mon, Aug 19, 2024 at 10:28 AM Ming Lei <ming.lei@redhat.com> wrote:
>>> 
>>> Hi Muchun,
>>> 
>>> On Sun, Aug 11, 2024 at 06:19:19PM +0800, Muchun Song wrote:
>>>> Supposing the following scenario with a virtio_blk driver.
>>>> 
>>>> CPU0                                                                CPU1
>>>> 
>>>> blk_mq_try_issue_directly()
>>>>     __blk_mq_issue_directly()
>>>>         q->mq_ops->queue_rq()
>>>>             virtio_queue_rq()
>>>>                 blk_mq_stop_hw_queue()
>>>>                                                                     virtblk_done()
>>>>     blk_mq_request_bypass_insert()                                      blk_mq_start_stopped_hw_queues()
>>>>         /* Add IO request to dispatch list */   1) store                    blk_mq_start_stopped_hw_queue()
>>>>                                                                                 clear_bit(BLK_MQ_S_STOPPED)                 3) store
>>>>     blk_mq_run_hw_queue()                                                       blk_mq_run_hw_queue()
>>>>         if (!blk_mq_hctx_has_pending())                                             if (!blk_mq_hctx_has_pending())         4) load
>>>>             return                                                                      return
>>>>         blk_mq_sched_dispatch_requests()                                            blk_mq_sched_dispatch_requests()
>>>>             if (blk_mq_hctx_stopped())          2) load                                 if (blk_mq_hctx_stopped())
>>>>                 return                                                                      return
>>>>             __blk_mq_sched_dispatch_requests()                                          __blk_mq_sched_dispatch_requests()
>>>> 
>>>> The full memory barrier should be inserted between 1) and 2), as well as between
>>>> 3) and 4) to make sure that either CPU0 sees BLK_MQ_S_STOPPED is cleared or CPU1
>>>> sees dispatch list or setting of bitmap of software queue. Otherwise, either CPU
>>>> will not re-run the hardware queue causing starvation.
>>> 
>>> Yeah, it is one kind of race which is triggered when adding request into
>>> ->dispatch list after returning STS_RESOURCE. We were troubled by lots of
>>> such kind of race.
>> Yes. I saw the similar fix for BLK_MQ_S_SCHED_RESTART.
>>> 
>>> stopping queue is used in very less drivers, and its only purpose should
>>> be for throttling hw queue in case that low level queue is busy. There seems
>>> more uses of blk_mq_stop_hw_queues(), but most of them should be replaced
>>> with blk_mq_quiesce_queue().
>>> 
>>> IMO, fixing this kind of issue via memory barrier is too tricky to
>>> maintain cause WRITE/READ dependency is very hard to follow. I'd suggest to
>>> make memory barrier solution as the last resort, and we can try to figure
>>> out other easier & more reliable way first.
>> I do agree it is hard to maintain the dependencies in the future. We should
>> propose an easy-maintainable solution. But I thought it is a long-term issue
>> throughout different stable linux distros. Adding a mb is the easy way to fix
>> the problem (the code footprint is really small), so it will be very
>> easy for others
>> to backport those bug fixes to different stable linux distros. Therefore, mb
>> should be an interim solution. Then, we could improve it based on the solution
>> you've proposed below. What do you think?
> 
> I'll agree with Ming, let's figure out a better fix first. Easy to backport to stables is not first consideration.

Hi Kuai,

All right. I usually focus on MM, it seems there is a gap between MM and BLock.
Anyway, let's figure out if there is any good solution.

>> Thanks,
>> Muchun.
>>> 
>>> One idea I thought of is to call blk_mq_request_bypass_insert()(or rename
>>> & export it) before calling blk_mq_stop_hw_queue() in driver, then
>>> return new status code STS_STOP_DISPATCH for notifying blk-mq to stop
>>> dispatch simply.
> 
> New status code look good to me, however, I wonder can we just remove
> the problematic blk_mq_stop_hw_queue(), and replace it by handling the
> new status from block layer?
> 
> - Passing the new status to blk_mq_run_dispatch_ops, and quiesce with

I didn't fully understand your suggestion. Let me ask some questions.
blk_mq_stop_hw_queue() is usually called in blk_mq_ops->queue_rq path,
it'll be easy for this case to pass the new status to blk_mq_run_dispatch_ops.
Should we remove blk_mq_stop_hw_queues() as well? How to pass the new
status to blk_mq_run_dispatch_ops in this case?

> the new status, if no request is inflight, unquiesce immediately;

Actually, I didn't understand how to avoid the above race. May you elaborate
the scenario?

Muhcun,
Thanks.

> - unquiesce is any IO is done afterwards;





^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/4] block: fix ordering between checking BLK_MQ_S_STOPPED and adding requests to hctx->dispatch
  2024-08-26  8:35         ` Muchun Song
@ 2024-08-26  8:53           ` Yu Kuai
  2024-08-27  7:31             ` Muchun Song
  0 siblings, 1 reply; 21+ messages in thread
From: Yu Kuai @ 2024-08-26  8:53 UTC (permalink / raw)
  To: Muchun Song, Yu Kuai
  Cc: Muchun Song, Ming Lei, Jens Axboe, open list:BLOCK LAYER, LKML,
	yukuai (C)

Hi,

在 2024/08/26 16:35, Muchun Song 写道:
> 
> 
>> On Aug 22, 2024, at 11:54, Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> 在 2024/08/19 11:49, Muchun Song 写道:
>>> On Mon, Aug 19, 2024 at 10:28 AM Ming Lei <ming.lei@redhat.com> wrote:
>>>>
>>>> Hi Muchun,
>>>>
>>>> On Sun, Aug 11, 2024 at 06:19:19PM +0800, Muchun Song wrote:
>>>>> Supposing the following scenario with a virtio_blk driver.
>>>>>
>>>>> CPU0                                                                CPU1
>>>>>
>>>>> blk_mq_try_issue_directly()
>>>>>      __blk_mq_issue_directly()
>>>>>          q->mq_ops->queue_rq()
>>>>>              virtio_queue_rq()
>>>>>                  blk_mq_stop_hw_queue()
>>>>>                                                                      virtblk_done()
>>>>>      blk_mq_request_bypass_insert()                                      blk_mq_start_stopped_hw_queues()
>>>>>          /* Add IO request to dispatch list */   1) store                    blk_mq_start_stopped_hw_queue()
>>>>>                                                                                  clear_bit(BLK_MQ_S_STOPPED)                 3) store
>>>>>      blk_mq_run_hw_queue()                                                       blk_mq_run_hw_queue()
>>>>>          if (!blk_mq_hctx_has_pending())                                             if (!blk_mq_hctx_has_pending())         4) load
>>>>>              return                                                                      return
>>>>>          blk_mq_sched_dispatch_requests()                                            blk_mq_sched_dispatch_requests()
>>>>>              if (blk_mq_hctx_stopped())          2) load                                 if (blk_mq_hctx_stopped())
>>>>>                  return                                                                      return
>>>>>              __blk_mq_sched_dispatch_requests()                                          __blk_mq_sched_dispatch_requests()
>>>>>
>>>>> The full memory barrier should be inserted between 1) and 2), as well as between
>>>>> 3) and 4) to make sure that either CPU0 sees BLK_MQ_S_STOPPED is cleared or CPU1
>>>>> sees dispatch list or setting of bitmap of software queue. Otherwise, either CPU
>>>>> will not re-run the hardware queue causing starvation.
>>>>
>>>> Yeah, it is one kind of race which is triggered when adding request into
>>>> ->dispatch list after returning STS_RESOURCE. We were troubled by lots of
>>>> such kind of race.
>>> Yes. I saw the similar fix for BLK_MQ_S_SCHED_RESTART.
>>>>
>>>> stopping queue is used in very less drivers, and its only purpose should
>>>> be for throttling hw queue in case that low level queue is busy. There seems
>>>> more uses of blk_mq_stop_hw_queues(), but most of them should be replaced
>>>> with blk_mq_quiesce_queue().
>>>>
>>>> IMO, fixing this kind of issue via memory barrier is too tricky to
>>>> maintain cause WRITE/READ dependency is very hard to follow. I'd suggest to
>>>> make memory barrier solution as the last resort, and we can try to figure
>>>> out other easier & more reliable way first.
>>> I do agree it is hard to maintain the dependencies in the future. We should
>>> propose an easy-maintainable solution. But I thought it is a long-term issue
>>> throughout different stable linux distros. Adding a mb is the easy way to fix
>>> the problem (the code footprint is really small), so it will be very
>>> easy for others
>>> to backport those bug fixes to different stable linux distros. Therefore, mb
>>> should be an interim solution. Then, we could improve it based on the solution
>>> you've proposed below. What do you think?
>>
>> I'll agree with Ming, let's figure out a better fix first. Easy to backport to stables is not first consideration.
> 
> Hi Kuai,
> 
> All right. I usually focus on MM, it seems there is a gap between MM and BLock.
> Anyway, let's figure out if there is any good solution.
> 
>>> Thanks,
>>> Muchun.
>>>>
>>>> One idea I thought of is to call blk_mq_request_bypass_insert()(or rename
>>>> & export it) before calling blk_mq_stop_hw_queue() in driver, then
>>>> return new status code STS_STOP_DISPATCH for notifying blk-mq to stop
>>>> dispatch simply.
>>
>> New status code look good to me, however, I wonder can we just remove
>> the problematic blk_mq_stop_hw_queue(), and replace it by handling the
>> new status from block layer?
>>
>> - Passing the new status to blk_mq_run_dispatch_ops, and quiesce with
> 
> I didn't fully understand your suggestion. Let me ask some questions.
> blk_mq_stop_hw_queue() is usually called in blk_mq_ops->queue_rq path,
> it'll be easy for this case to pass the new status to blk_mq_run_dispatch_ops.
> Should we remove blk_mq_stop_hw_queues() as well? How to pass the new
> status to blk_mq_run_dispatch_ops in this case?

For queue_rq from dispatch path, it can be removed. However, it is
called from remove path as well, I don't check yet if it can be removed
there, that's another story.

And just add a return value for dispatch_ops to pass status.

Thanks,
Kuai

> 
>> the new status, if no request is inflight, unquiesce immediately;
> 
> Actually, I didn't understand how to avoid the above race. May you elaborate
> the scenario?
> 
> Muhcun,
> Thanks.
> 
>> - unquiesce is any IO is done afterwards;
> 
> 
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/4] block: fix fix ordering between checking QUEUE_FLAG_QUIESCED and adding requests to hctx->dispatch
  2024-08-26  7:33       ` Muchun Song
@ 2024-08-26  9:20         ` Ming Lei
  2024-08-27  7:24           ` Muchun Song
  0 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2024-08-26  9:20 UTC (permalink / raw)
  To: Muchun Song
  Cc: Muchun Song, Jens Axboe, open list:BLOCK LAYER, LKML, ming.lei

On Mon, Aug 26, 2024 at 03:33:18PM +0800, Muchun Song wrote:
> 
> 
> > On Aug 26, 2024, at 15:06, Muchun Song <songmuchun@bytedance.com> wrote:
> > 
> > On Fri, Aug 23, 2024 at 7:28 PM Ming Lei <ming.lei@redhat.com> wrote:
> >> 
> >> On Sun, Aug 11, 2024 at 06:19:21 PM +0800, Muchun Song wrote:
> >>> Supposing the following scenario.
> >>> 
> >>> CPU0                                                                CPU1
> >>> 
> >>> blk_mq_request_issue_directly()                                     blk_mq_unquiesce_queue()
> >>>    if (blk_queue_quiesced())                                           blk_queue_flag_clear(QUEUE_FLAG_QUIESCED)   3) store
> >>>        blk_mq_insert_request()                                         blk_mq_run_hw_queues()
> >>>            /*                                                              blk_mq_run_hw_queue()
> >>>             * Add request to dispatch list or set bitmap of                    if (!blk_mq_hctx_has_pending())     4) load
> >>>             * software queue.                  1) store                            return
> >>>             */
> >>>        blk_mq_run_hw_queue()
> >>>            if (blk_queue_quiesced())           2) load
> >>>                return
> >>>            blk_mq_sched_dispatch_requests()
> >>> 
> >>> The full memory barrier should be inserted between 1) and 2), as well as
> >>> between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED is
> >>> cleared or CPU1 sees dispatch list or setting of bitmap of software queue.
> >>> Otherwise, either CPU will not re-run the hardware queue causing starvation.
> >> 
> >> Memory barrier shouldn't serve as bug fix for two slow code paths.
> >> 
> >> One simple fix is to add helper of blk_queue_quiesced_lock(), and
> >> call the following check on CPU0:
> >> 
> >>        if (blk_queue_quiesced_lock())
> >>         blk_mq_run_hw_queue();
> > 
> > This only fixes blk_mq_request_issue_directly(), I think anywhere that
> > matching this
> > pattern (inserting a request to dispatch list and then running the
> > hardware queue)
> > should be fixed. And I think there are many places which match this
> > pattern (E.g.
> > blk_mq_submit_bio()). The above graph should be adjusted to the following.
> > 
> > CPU0                                        CPU1
> > 
> > blk_mq_insert_request()         1) store    blk_mq_unquiesce_queue()
> > blk_mq_run_hw_queue()
> > blk_queue_flag_clear(QUEUE_FLAG_QUIESCED)       3) store
> >    if (blk_queue_quiesced())   2) load         blk_mq_run_hw_queues()
> >        return                                      blk_mq_run_hw_queue()
> >    blk_mq_sched_dispatch_requests()                    if
> > (!blk_mq_hctx_has_pending())     4) load
> >                                                            return
> 
> Sorry. There is something wrong with my email client. Resend the graph.
> 
> CPU0                                        CPU1
> 
> blk_mq_insert_request()         1) store    blk_mq_unquiesce_queue()
> blk_mq_run_hw_queue()                       blk_queue_flag_clear(QUEUE_FLAG_QUIESCED)       3) store
>     if (blk_queue_quiesced())   2) load         blk_mq_run_hw_queues()
>         return                                      blk_mq_run_hw_queue()
>     blk_mq_sched_dispatch_requests()                    if (!blk_mq_hctx_has_pending())     4) load
>                                                             return

OK.

The issue shouldn't exist if blk_queue_quiesced() return false in
blk_mq_run_hw_queue(), so it is still one race in two slow paths?

I guess the barrier-less approach should work too, such as:


diff --git a/block/blk-mq.c b/block/blk-mq.c
index e3c3c0c21b55..632261982a77 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2202,6 +2202,12 @@ void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
 }
 EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
 
+static inline bool blk_mq_hw_queue_need_run(struct blk_mq_hw_ctx *hctx)
+{
+	return !blk_queue_quiesced(hctx->queue) &&
+		blk_mq_hctx_has_pending(hctx);
+}
+
 /**
  * blk_mq_run_hw_queue - Start to run a hardware queue.
  * @hctx: Pointer to the hardware queue to run.
@@ -2231,11 +2237,19 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 	 * quiesced.
 	 */
 	__blk_mq_run_dispatch_ops(hctx->queue, false,
-		need_run = !blk_queue_quiesced(hctx->queue) &&
-		blk_mq_hctx_has_pending(hctx));
+		need_run = blk_mq_hw_queue_need_run(hctx));
 
-	if (!need_run)
-		return;
+	if (!need_run) {
+		unsigned long flags;
+
+		/* sync with unquiesce */
+		spin_lock_irqsave(&hctx->queue->queue_lock, flags);
+		need_run = blk_mq_hw_queue_need_run(hctx);
+		spin_unlock_irqrestore(&hctx->queue->queue_lock, flags);
+
+		if (!need_run)
+			return;
+	}
 
 	if (async || !cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask)) {
 		blk_mq_delay_run_hw_queue(hctx, 0);


thanks,
Ming


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/4] block: fix fix ordering between checking QUEUE_FLAG_QUIESCED and adding requests to hctx->dispatch
  2024-08-26  9:20         ` Ming Lei
@ 2024-08-27  7:24           ` Muchun Song
  2024-08-27  8:16             ` Muchun Song
  0 siblings, 1 reply; 21+ messages in thread
From: Muchun Song @ 2024-08-27  7:24 UTC (permalink / raw)
  To: Ming Lei; +Cc: Muchun Song, Jens Axboe, open list:BLOCK LAYER, LKML



> On Aug 26, 2024, at 17:20, Ming Lei <ming.lei@redhat.com> wrote:
> 
> On Mon, Aug 26, 2024 at 03:33:18PM +0800, Muchun Song wrote:
>> 
>> 
>>> On Aug 26, 2024, at 15:06, Muchun Song <songmuchun@bytedance.com> wrote:
>>> 
>>> On Fri, Aug 23, 2024 at 7:28 PM Ming Lei <ming.lei@redhat.com> wrote:
>>>> 
>>>> On Sun, Aug 11, 2024 at 06:19:21 PM +0800, Muchun Song wrote:
>>>>> Supposing the following scenario.
>>>>> 
>>>>> CPU0                                                                CPU1
>>>>> 
>>>>> blk_mq_request_issue_directly()                                     blk_mq_unquiesce_queue()
>>>>>   if (blk_queue_quiesced())                                           blk_queue_flag_clear(QUEUE_FLAG_QUIESCED)   3) store
>>>>>       blk_mq_insert_request()                                         blk_mq_run_hw_queues()
>>>>>           /*                                                              blk_mq_run_hw_queue()
>>>>>            * Add request to dispatch list or set bitmap of                    if (!blk_mq_hctx_has_pending())     4) load
>>>>>            * software queue.                  1) store                            return
>>>>>            */
>>>>>       blk_mq_run_hw_queue()
>>>>>           if (blk_queue_quiesced())           2) load
>>>>>               return
>>>>>           blk_mq_sched_dispatch_requests()
>>>>> 
>>>>> The full memory barrier should be inserted between 1) and 2), as well as
>>>>> between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED is
>>>>> cleared or CPU1 sees dispatch list or setting of bitmap of software queue.
>>>>> Otherwise, either CPU will not re-run the hardware queue causing starvation.
>>>> 
>>>> Memory barrier shouldn't serve as bug fix for two slow code paths.
>>>> 
>>>> One simple fix is to add helper of blk_queue_quiesced_lock(), and
>>>> call the following check on CPU0:
>>>> 
>>>>       if (blk_queue_quiesced_lock())
>>>>        blk_mq_run_hw_queue();
>>> 
>>> This only fixes blk_mq_request_issue_directly(), I think anywhere that
>>> matching this
>>> pattern (inserting a request to dispatch list and then running the
>>> hardware queue)
>>> should be fixed. And I think there are many places which match this
>>> pattern (E.g.
>>> blk_mq_submit_bio()). The above graph should be adjusted to the following.
>>> 
>>> CPU0                                        CPU1
>>> 
>>> blk_mq_insert_request()         1) store    blk_mq_unquiesce_queue()
>>> blk_mq_run_hw_queue()
>>> blk_queue_flag_clear(QUEUE_FLAG_QUIESCED)       3) store
>>>   if (blk_queue_quiesced())   2) load         blk_mq_run_hw_queues()
>>>       return                                      blk_mq_run_hw_queue()
>>>   blk_mq_sched_dispatch_requests()                    if
>>> (!blk_mq_hctx_has_pending())     4) load
>>>                                                           return
>> 
>> Sorry. There is something wrong with my email client. Resend the graph.
>> 
>> CPU0                                        CPU1
>> 
>> blk_mq_insert_request()         1) store    blk_mq_unquiesce_queue()
>> blk_mq_run_hw_queue()                       blk_queue_flag_clear(QUEUE_FLAG_QUIESCED)       3) store
>>    if (blk_queue_quiesced())   2) load         blk_mq_run_hw_queues()
>>        return                                      blk_mq_run_hw_queue()
>>    blk_mq_sched_dispatch_requests()                    if (!blk_mq_hctx_has_pending())     4) load
>>                                                            return
> 
> OK.
> 
> The issue shouldn't exist if blk_queue_quiesced() return false in
> blk_mq_run_hw_queue(), so it is still one race in two slow paths?
> 
> I guess the barrier-less approach should work too, such as:
> 

If we prefer barrier-less approach, I think the following solution
will work as well, I'll use it in v2. Thanks.

> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index e3c3c0c21b55..632261982a77 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2202,6 +2202,12 @@ void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
> }
> EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
> 
> +static inline bool blk_mq_hw_queue_need_run(struct blk_mq_hw_ctx *hctx)
> +{
> + 	return !blk_queue_quiesced(hctx->queue) &&
> + 		blk_mq_hctx_has_pending(hctx);
> +}
> +
> /**
>  * blk_mq_run_hw_queue - Start to run a hardware queue.
>  * @hctx: Pointer to the hardware queue to run.
> @@ -2231,11 +2237,19 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
>  * quiesced.
>  */
> 	__blk_mq_run_dispatch_ops(hctx->queue, false,
> - 		need_run = !blk_queue_quiesced(hctx->queue) &&
> - 		blk_mq_hctx_has_pending(hctx));
> + 		need_run = blk_mq_hw_queue_need_run(hctx));
> 
> - 	if (!need_run)
> - 		return;
> + 	if (!need_run) {
> + 		unsigned long flags;
> +
> + 		/* sync with unquiesce */
> + 		spin_lock_irqsave(&hctx->queue->queue_lock, flags);
> + 		need_run = blk_mq_hw_queue_need_run(hctx);

One question here: should we use __blk_mq_run_dispatch_ops()? I saw a comment above.
It seems it is safe to call blk_mq_hw_queue_need_run under [s]rcu lock.

Thanks.

> + 		spin_unlock_irqrestore(&hctx->queue->queue_lock, flags);
> +
> + 		if (!need_run)
> + 			return;
> + 	}
> 
> 	if (async || !cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask)) {
> 		blk_mq_delay_run_hw_queue(hctx, 0);
> 
> 
> thanks,
> Ming



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/4] block: fix ordering between checking BLK_MQ_S_STOPPED and adding requests to hctx->dispatch
  2024-08-26  8:53           ` Yu Kuai
@ 2024-08-27  7:31             ` Muchun Song
  2024-08-29  7:57               ` Yu Kuai
  0 siblings, 1 reply; 21+ messages in thread
From: Muchun Song @ 2024-08-27  7:31 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Muchun Song, Ming Lei, Jens Axboe, open list:BLOCK LAYER, LKML,
	yukuai (C)



> On Aug 26, 2024, at 16:53, Yu Kuai <yukuai1@huaweicloud.com> wrote:
> 
> Hi,
> 
> 在 2024/08/26 16:35, Muchun Song 写道:
>>> On Aug 22, 2024, at 11:54, Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>> 
>>> Hi,
>>> 
>>> 在 2024/08/19 11:49, Muchun Song 写道:
>>>> On Mon, Aug 19, 2024 at 10:28 AM Ming Lei <ming.lei@redhat.com> wrote:
>>>>> 
>>>>> Hi Muchun,
>>>>> 
>>>>> On Sun, Aug 11, 2024 at 06:19:19PM +0800, Muchun Song wrote:
>>>>>> Supposing the following scenario with a virtio_blk driver.
>>>>>> 
>>>>>> CPU0                                                                CPU1
>>>>>> 
>>>>>> blk_mq_try_issue_directly()
>>>>>>     __blk_mq_issue_directly()
>>>>>>         q->mq_ops->queue_rq()
>>>>>>             virtio_queue_rq()
>>>>>>                 blk_mq_stop_hw_queue()
>>>>>>                                                                     virtblk_done()
>>>>>>     blk_mq_request_bypass_insert()                                      blk_mq_start_stopped_hw_queues()
>>>>>>         /* Add IO request to dispatch list */   1) store                    blk_mq_start_stopped_hw_queue()
>>>>>>                                                                                 clear_bit(BLK_MQ_S_STOPPED)                 3) store
>>>>>>     blk_mq_run_hw_queue()                                                       blk_mq_run_hw_queue()
>>>>>>         if (!blk_mq_hctx_has_pending())                                             if (!blk_mq_hctx_has_pending())         4) load
>>>>>>             return                                                                      return
>>>>>>         blk_mq_sched_dispatch_requests()                                            blk_mq_sched_dispatch_requests()
>>>>>>             if (blk_mq_hctx_stopped())          2) load                                 if (blk_mq_hctx_stopped())
>>>>>>                 return                                                                      return
>>>>>>             __blk_mq_sched_dispatch_requests()                                          __blk_mq_sched_dispatch_requests()
>>>>>> 
>>>>>> The full memory barrier should be inserted between 1) and 2), as well as between
>>>>>> 3) and 4) to make sure that either CPU0 sees BLK_MQ_S_STOPPED is cleared or CPU1
>>>>>> sees dispatch list or setting of bitmap of software queue. Otherwise, either CPU
>>>>>> will not re-run the hardware queue causing starvation.
>>>>> 
>>>>> Yeah, it is one kind of race which is triggered when adding request into
>>>>> ->dispatch list after returning STS_RESOURCE. We were troubled by lots of
>>>>> such kind of race.
>>>> Yes. I saw the similar fix for BLK_MQ_S_SCHED_RESTART.
>>>>> 
>>>>> stopping queue is used in very less drivers, and its only purpose should
>>>>> be for throttling hw queue in case that low level queue is busy. There seems
>>>>> more uses of blk_mq_stop_hw_queues(), but most of them should be replaced
>>>>> with blk_mq_quiesce_queue().
>>>>> 
>>>>> IMO, fixing this kind of issue via memory barrier is too tricky to
>>>>> maintain cause WRITE/READ dependency is very hard to follow. I'd suggest to
>>>>> make memory barrier solution as the last resort, and we can try to figure
>>>>> out other easier & more reliable way first.
>>>> I do agree it is hard to maintain the dependencies in the future. We should
>>>> propose an easy-maintainable solution. But I thought it is a long-term issue
>>>> throughout different stable linux distros. Adding a mb is the easy way to fix
>>>> the problem (the code footprint is really small), so it will be very
>>>> easy for others
>>>> to backport those bug fixes to different stable linux distros. Therefore, mb
>>>> should be an interim solution. Then, we could improve it based on the solution
>>>> you've proposed below. What do you think?
>>> 
>>> I'll agree with Ming, let's figure out a better fix first. Easy to backport to stables is not first consideration.
>> Hi Kuai,
>> All right. I usually focus on MM, it seems there is a gap between MM and BLock.
>> Anyway, let's figure out if there is any good solution.
>>>> Thanks,
>>>> Muchun.
>>>>> 
>>>>> One idea I thought of is to call blk_mq_request_bypass_insert()(or rename
>>>>> & export it) before calling blk_mq_stop_hw_queue() in driver, then
>>>>> return new status code STS_STOP_DISPATCH for notifying blk-mq to stop
>>>>> dispatch simply.
>>> 
>>> New status code look good to me, however, I wonder can we just remove
>>> the problematic blk_mq_stop_hw_queue(), and replace it by handling the
>>> new status from block layer?
>>> 
>>> - Passing the new status to blk_mq_run_dispatch_ops, and quiesce with
>> I didn't fully understand your suggestion. Let me ask some questions.
>> blk_mq_stop_hw_queue() is usually called in blk_mq_ops->queue_rq path,
>> it'll be easy for this case to pass the new status to blk_mq_run_dispatch_ops.
>> Should we remove blk_mq_stop_hw_queues() as well? How to pass the new
>> status to blk_mq_run_dispatch_ops in this case?
> 
> For queue_rq from dispatch path, it can be removed. However, it is
> called from remove path as well, I don't check yet if it can be removed
> there, that's another story.

The reason why I asked this question is that blk_mq_stop_hw_queues() also needs
to be fixed. See my patch 3.

> 
> And just add a return value for dispatch_ops to pass status.
> 
> Thanks,
> Kuai
> 
>>> the new status, if no request is inflight, unquiesce immediately;
>> Actually, I didn't understand how to avoid the above race. May you elaborate
>> the scenario?

Sorry for repeating, I didn't get your point here. May you elaborate
your suggestion? Thanks very much.

>> Muhcun,
>> Thanks.
>>> - unquiesce is any IO is done afterwards;
>> .



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/4] block: fix fix ordering between checking QUEUE_FLAG_QUIESCED and adding requests to hctx->dispatch
  2024-08-27  7:24           ` Muchun Song
@ 2024-08-27  8:16             ` Muchun Song
  2024-08-29  2:51               ` Ming Lei
  0 siblings, 1 reply; 21+ messages in thread
From: Muchun Song @ 2024-08-27  8:16 UTC (permalink / raw)
  To: Ming Lei; +Cc: Muchun Song, Jens Axboe, open list:BLOCK LAYER, LKML



> On Aug 27, 2024, at 15:24, Muchun Song <muchun.song@linux.dev> wrote:
> 
> 
> 
>> On Aug 26, 2024, at 17:20, Ming Lei <ming.lei@redhat.com> wrote:
>> 
>> On Mon, Aug 26, 2024 at 03:33:18PM +0800, Muchun Song wrote:
>>> 
>>> 
>>>> On Aug 26, 2024, at 15:06, Muchun Song <songmuchun@bytedance.com> wrote:
>>>> 
>>>> On Fri, Aug 23, 2024 at 7:28 PM Ming Lei <ming.lei@redhat.com> wrote:
>>>>> 
>>>>> On Sun, Aug 11, 2024 at 06:19:21 PM +0800, Muchun Song wrote:
>>>>>> Supposing the following scenario.
>>>>>> 
>>>>>> CPU0                                                                CPU1
>>>>>> 
>>>>>> blk_mq_request_issue_directly()                                     blk_mq_unquiesce_queue()
>>>>>>  if (blk_queue_quiesced())                                           blk_queue_flag_clear(QUEUE_FLAG_QUIESCED)   3) store
>>>>>>      blk_mq_insert_request()                                         blk_mq_run_hw_queues()
>>>>>>          /*                                                              blk_mq_run_hw_queue()
>>>>>>           * Add request to dispatch list or set bitmap of                    if (!blk_mq_hctx_has_pending())     4) load
>>>>>>           * software queue.                  1) store                            return
>>>>>>           */
>>>>>>      blk_mq_run_hw_queue()
>>>>>>          if (blk_queue_quiesced())           2) load
>>>>>>              return
>>>>>>          blk_mq_sched_dispatch_requests()
>>>>>> 
>>>>>> The full memory barrier should be inserted between 1) and 2), as well as
>>>>>> between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED is
>>>>>> cleared or CPU1 sees dispatch list or setting of bitmap of software queue.
>>>>>> Otherwise, either CPU will not re-run the hardware queue causing starvation.
>>>>> 
>>>>> Memory barrier shouldn't serve as bug fix for two slow code paths.
>>>>> 
>>>>> One simple fix is to add helper of blk_queue_quiesced_lock(), and
>>>>> call the following check on CPU0:
>>>>> 
>>>>>      if (blk_queue_quiesced_lock())
>>>>>       blk_mq_run_hw_queue();
>>>> 
>>>> This only fixes blk_mq_request_issue_directly(), I think anywhere that
>>>> matching this
>>>> pattern (inserting a request to dispatch list and then running the
>>>> hardware queue)
>>>> should be fixed. And I think there are many places which match this
>>>> pattern (E.g.
>>>> blk_mq_submit_bio()). The above graph should be adjusted to the following.
>>>> 
>>>> CPU0                                        CPU1
>>>> 
>>>> blk_mq_insert_request()         1) store    blk_mq_unquiesce_queue()
>>>> blk_mq_run_hw_queue()
>>>> blk_queue_flag_clear(QUEUE_FLAG_QUIESCED)       3) store
>>>>  if (blk_queue_quiesced())   2) load         blk_mq_run_hw_queues()
>>>>      return                                      blk_mq_run_hw_queue()
>>>>  blk_mq_sched_dispatch_requests()                    if
>>>> (!blk_mq_hctx_has_pending())     4) load
>>>>                                                          return
>>> 
>>> Sorry. There is something wrong with my email client. Resend the graph.
>>> 
>>> CPU0                                        CPU1
>>> 
>>> blk_mq_insert_request()         1) store    blk_mq_unquiesce_queue()
>>> blk_mq_run_hw_queue()                       blk_queue_flag_clear(QUEUE_FLAG_QUIESCED)       3) store
>>>   if (blk_queue_quiesced())   2) load         blk_mq_run_hw_queues()
>>>       return                                      blk_mq_run_hw_queue()
>>>   blk_mq_sched_dispatch_requests()                    if (!blk_mq_hctx_has_pending())     4) load
>>>                                                           return
>> 
>> OK.
>> 
>> The issue shouldn't exist if blk_queue_quiesced() return false in
>> blk_mq_run_hw_queue(), so it is still one race in two slow paths?
>> 
>> I guess the barrier-less approach should work too, such as:
>> 
> 
> If we prefer barrier-less approach, I think the following solution
> will work as well, I'll use it in v2. Thanks.
> 
>> 
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index e3c3c0c21b55..632261982a77 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -2202,6 +2202,12 @@ void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
>> }
>> EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
>> 
>> +static inline bool blk_mq_hw_queue_need_run(struct blk_mq_hw_ctx *hctx)
>> +{
>> + 	return !blk_queue_quiesced(hctx->queue) &&
>> + 		blk_mq_hctx_has_pending(hctx);
>> +}
>> +
>> /**
>> * blk_mq_run_hw_queue - Start to run a hardware queue.
>> * @hctx: Pointer to the hardware queue to run.
>> @@ -2231,11 +2237,19 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
>> * quiesced.
>> */
>> 	__blk_mq_run_dispatch_ops(hctx->queue, false,
>> - 		need_run = !blk_queue_quiesced(hctx->queue) &&
>> - 		blk_mq_hctx_has_pending(hctx));
>> + 		need_run = blk_mq_hw_queue_need_run(hctx));
>> 
>> - 	if (!need_run)
>> - 		return;
>> + 	if (!need_run) {
>> + 		unsigned long flags;
>> +
>> + 		/* sync with unquiesce */
>> + 		spin_lock_irqsave(&hctx->queue->queue_lock, flags);

After some time thought, I think here we need a big comment to explain
why we need to sync. Because there are other caller of blk_queue_quiesced()
which do not need to hold ->queue_lock to sync. Then, I am thinking
is ->queue_lock really easier to be maintained than mb? For developers,
we still need to care about this, right? I don't see any obvious benefit.
And the mb approach seems more efficient than spinlock. Something like:

	if (!need_run) {
		/* Add a comment here to explain what's going on here. */
		smp_mb();
		need_run = blk_mq_hw_queue_need_run(hctx);
		if (!need_run)
			return;
	}

I am not objecting to your approach, I want to know if you insist on
barrier-less approach here. If yes, I'm fine with this approach. I can
use it in v2.

Muhcun,
Thanks.

>> + 		need_run = blk_mq_hw_queue_need_run(hctx);
> 
> One question here: should we use __blk_mq_run_dispatch_ops()? I saw a comment above.
> It seems it is safe to call blk_mq_hw_queue_need_run under [s]rcu lock.
> 
> Thanks.
> 
>> + 		spin_unlock_irqrestore(&hctx->queue->queue_lock, flags);
>> +
>> + 		if (!need_run)
>> + 			return;
>> + 	}
>> 
>> 	if (async || !cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask)) {
>> 		blk_mq_delay_run_hw_queue(hctx, 0);
>> 
>> 
>> thanks,
>> Ming
> 
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/4] block: fix fix ordering between checking QUEUE_FLAG_QUIESCED and adding requests to hctx->dispatch
  2024-08-27  8:16             ` Muchun Song
@ 2024-08-29  2:51               ` Ming Lei
  2024-08-29  3:40                 ` Muchun Song
  0 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2024-08-29  2:51 UTC (permalink / raw)
  To: Muchun Song
  Cc: Muchun Song, Jens Axboe, open list:BLOCK LAYER, LKML, Ming Lei

On Tue, Aug 27, 2024 at 4:17 PM Muchun Song <muchun.song@linux.dev> wrote:
>
>
>
> > On Aug 27, 2024, at 15:24, Muchun Song <muchun.song@linux.dev> wrote:
> >
> >
> >
> >> On Aug 26, 2024, at 17:20, Ming Lei <ming.lei@redhat.com> wrote:
> >>
> >> On Mon, Aug 26, 2024 at 03:33:18PM +0800, Muchun Song wrote:
> >>>
> >>>
> >>>> On Aug 26, 2024, at 15:06, Muchun Song <songmuchun@bytedance.com> wrote:
> >>>>
> >>>> On Fri, Aug 23, 2024 at 7:28 PM Ming Lei <ming.lei@redhat.com> wrote:
> >>>>>
> >>>>> On Sun, Aug 11, 2024 at 06:19:21 PM +0800, Muchun Song wrote:
> >>>>>> Supposing the following scenario.
> >>>>>>
> >>>>>> CPU0                                                                CPU1
> >>>>>>
> >>>>>> blk_mq_request_issue_directly()                                     blk_mq_unquiesce_queue()
> >>>>>>  if (blk_queue_quiesced())                                           blk_queue_flag_clear(QUEUE_FLAG_QUIESCED)   3) store
> >>>>>>      blk_mq_insert_request()                                         blk_mq_run_hw_queues()
> >>>>>>          /*                                                              blk_mq_run_hw_queue()
> >>>>>>           * Add request to dispatch list or set bitmap of                    if (!blk_mq_hctx_has_pending())     4) load
> >>>>>>           * software queue.                  1) store                            return
> >>>>>>           */
> >>>>>>      blk_mq_run_hw_queue()
> >>>>>>          if (blk_queue_quiesced())           2) load
> >>>>>>              return
> >>>>>>          blk_mq_sched_dispatch_requests()
> >>>>>>
> >>>>>> The full memory barrier should be inserted between 1) and 2), as well as
> >>>>>> between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED is
> >>>>>> cleared or CPU1 sees dispatch list or setting of bitmap of software queue.
> >>>>>> Otherwise, either CPU will not re-run the hardware queue causing starvation.
> >>>>>
> >>>>> Memory barrier shouldn't serve as bug fix for two slow code paths.
> >>>>>
> >>>>> One simple fix is to add helper of blk_queue_quiesced_lock(), and
> >>>>> call the following check on CPU0:
> >>>>>
> >>>>>      if (blk_queue_quiesced_lock())
> >>>>>       blk_mq_run_hw_queue();
> >>>>
> >>>> This only fixes blk_mq_request_issue_directly(), I think anywhere that
> >>>> matching this
> >>>> pattern (inserting a request to dispatch list and then running the
> >>>> hardware queue)
> >>>> should be fixed. And I think there are many places which match this
> >>>> pattern (E.g.
> >>>> blk_mq_submit_bio()). The above graph should be adjusted to the following.
> >>>>
> >>>> CPU0                                        CPU1
> >>>>
> >>>> blk_mq_insert_request()         1) store    blk_mq_unquiesce_queue()
> >>>> blk_mq_run_hw_queue()
> >>>> blk_queue_flag_clear(QUEUE_FLAG_QUIESCED)       3) store
> >>>>  if (blk_queue_quiesced())   2) load         blk_mq_run_hw_queues()
> >>>>      return                                      blk_mq_run_hw_queue()
> >>>>  blk_mq_sched_dispatch_requests()                    if
> >>>> (!blk_mq_hctx_has_pending())     4) load
> >>>>                                                          return
> >>>
> >>> Sorry. There is something wrong with my email client. Resend the graph.
> >>>
> >>> CPU0                                        CPU1
> >>>
> >>> blk_mq_insert_request()         1) store    blk_mq_unquiesce_queue()
> >>> blk_mq_run_hw_queue()                       blk_queue_flag_clear(QUEUE_FLAG_QUIESCED)       3) store
> >>>   if (blk_queue_quiesced())   2) load         blk_mq_run_hw_queues()
> >>>       return                                      blk_mq_run_hw_queue()
> >>>   blk_mq_sched_dispatch_requests()                    if (!blk_mq_hctx_has_pending())     4) load
> >>>                                                           return
> >>
> >> OK.
> >>
> >> The issue shouldn't exist if blk_queue_quiesced() return false in
> >> blk_mq_run_hw_queue(), so it is still one race in two slow paths?
> >>
> >> I guess the barrier-less approach should work too, such as:
> >>
> >
> > If we prefer barrier-less approach, I think the following solution
> > will work as well, I'll use it in v2. Thanks.
> >
> >>
> >> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >> index e3c3c0c21b55..632261982a77 100644
> >> --- a/block/blk-mq.c
> >> +++ b/block/blk-mq.c
> >> @@ -2202,6 +2202,12 @@ void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
> >> }
> >> EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
> >>
> >> +static inline bool blk_mq_hw_queue_need_run(struct blk_mq_hw_ctx *hctx)
> >> +{
> >> +    return !blk_queue_quiesced(hctx->queue) &&
> >> +            blk_mq_hctx_has_pending(hctx);
> >> +}
> >> +
> >> /**
> >> * blk_mq_run_hw_queue - Start to run a hardware queue.
> >> * @hctx: Pointer to the hardware queue to run.
> >> @@ -2231,11 +2237,19 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
> >> * quiesced.
> >> */
> >>      __blk_mq_run_dispatch_ops(hctx->queue, false,
> >> -            need_run = !blk_queue_quiesced(hctx->queue) &&
> >> -            blk_mq_hctx_has_pending(hctx));
> >> +            need_run = blk_mq_hw_queue_need_run(hctx));
> >>
> >> -    if (!need_run)
> >> -            return;
> >> +    if (!need_run) {
> >> +            unsigned long flags;
> >> +
> >> +            /* sync with unquiesce */
> >> +            spin_lock_irqsave(&hctx->queue->queue_lock, flags);
>
> After some time thought, I think here we need a big comment to explain
> why we need to sync. Because there are other caller of blk_queue_quiesced()
> which do not need to hold ->queue_lock to sync. Then, I am thinking
> is ->queue_lock really easier to be maintained than mb? For developers,
> we still need to care about this, right? I don't see any obvious benefit.
> And the mb approach seems more efficient than spinlock. Something like:
>
>         if (!need_run) {
>                 /* Add a comment here to explain what's going on here. */
>                 smp_mb();
>                 need_run = blk_mq_hw_queue_need_run(hctx);
>                 if (!need_run)
>                         return;
>         }
>
> I am not objecting to your approach, I want to know if you insist on
> barrier-less approach here. If yes, I'm fine with this approach. I can
> use it in v2.

Yes, as I mentioned, the race only exists on two slow code paths,
we seldom use barrier in slow paths, in which traditional lock
can provide a simpler & more readable solution.  Anytime,
READ/WRITE dependency implied in any barrier is hard to follow.

Thanks,


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/4] block: fix fix ordering between checking QUEUE_FLAG_QUIESCED and adding requests to hctx->dispatch
  2024-08-29  2:51               ` Ming Lei
@ 2024-08-29  3:40                 ` Muchun Song
  0 siblings, 0 replies; 21+ messages in thread
From: Muchun Song @ 2024-08-29  3:40 UTC (permalink / raw)
  To: Ming Lei; +Cc: Muchun Song, Jens Axboe, open list:BLOCK LAYER, LKML



> On Aug 29, 2024, at 10:51, Ming Lei <ming.lei@redhat.com> wrote:
> 
> On Tue, Aug 27, 2024 at 4:17 PM Muchun Song <muchun.song@linux.dev> wrote:
>> 
>> 
>> 
>>> On Aug 27, 2024, at 15:24, Muchun Song <muchun.song@linux.dev> wrote:
>>> 
>>> 
>>> 
>>>> On Aug 26, 2024, at 17:20, Ming Lei <ming.lei@redhat.com> wrote:
>>>> 
>>>> On Mon, Aug 26, 2024 at 03:33:18PM +0800, Muchun Song wrote:
>>>>> 
>>>>> 
>>>>>> On Aug 26, 2024, at 15:06, Muchun Song <songmuchun@bytedance.com> wrote:
>>>>>> 
>>>>>> On Fri, Aug 23, 2024 at 7:28 PM Ming Lei <ming.lei@redhat.com> wrote:
>>>>>>> 
>>>>>>> On Sun, Aug 11, 2024 at 06:19:21 PM +0800, Muchun Song wrote:
>>>>>>>> Supposing the following scenario.
>>>>>>>> 
>>>>>>>> CPU0                                                                CPU1
>>>>>>>> 
>>>>>>>> blk_mq_request_issue_directly()                                     blk_mq_unquiesce_queue()
>>>>>>>> if (blk_queue_quiesced())                                           blk_queue_flag_clear(QUEUE_FLAG_QUIESCED)   3) store
>>>>>>>>     blk_mq_insert_request()                                         blk_mq_run_hw_queues()
>>>>>>>>         /*                                                              blk_mq_run_hw_queue()
>>>>>>>>          * Add request to dispatch list or set bitmap of                    if (!blk_mq_hctx_has_pending())     4) load
>>>>>>>>          * software queue.                  1) store                            return
>>>>>>>>          */
>>>>>>>>     blk_mq_run_hw_queue()
>>>>>>>>         if (blk_queue_quiesced())           2) load
>>>>>>>>             return
>>>>>>>>         blk_mq_sched_dispatch_requests()
>>>>>>>> 
>>>>>>>> The full memory barrier should be inserted between 1) and 2), as well as
>>>>>>>> between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED is
>>>>>>>> cleared or CPU1 sees dispatch list or setting of bitmap of software queue.
>>>>>>>> Otherwise, either CPU will not re-run the hardware queue causing starvation.
>>>>>>> 
>>>>>>> Memory barrier shouldn't serve as bug fix for two slow code paths.
>>>>>>> 
>>>>>>> One simple fix is to add helper of blk_queue_quiesced_lock(), and
>>>>>>> call the following check on CPU0:
>>>>>>> 
>>>>>>>     if (blk_queue_quiesced_lock())
>>>>>>>      blk_mq_run_hw_queue();
>>>>>> 
>>>>>> This only fixes blk_mq_request_issue_directly(), I think anywhere that
>>>>>> matching this
>>>>>> pattern (inserting a request to dispatch list and then running the
>>>>>> hardware queue)
>>>>>> should be fixed. And I think there are many places which match this
>>>>>> pattern (E.g.
>>>>>> blk_mq_submit_bio()). The above graph should be adjusted to the following.
>>>>>> 
>>>>>> CPU0                                        CPU1
>>>>>> 
>>>>>> blk_mq_insert_request()         1) store    blk_mq_unquiesce_queue()
>>>>>> blk_mq_run_hw_queue()
>>>>>> blk_queue_flag_clear(QUEUE_FLAG_QUIESCED)       3) store
>>>>>> if (blk_queue_quiesced())   2) load         blk_mq_run_hw_queues()
>>>>>>     return                                      blk_mq_run_hw_queue()
>>>>>> blk_mq_sched_dispatch_requests()                    if
>>>>>> (!blk_mq_hctx_has_pending())     4) load
>>>>>>                                                         return
>>>>> 
>>>>> Sorry. There is something wrong with my email client. Resend the graph.
>>>>> 
>>>>> CPU0                                        CPU1
>>>>> 
>>>>> blk_mq_insert_request()         1) store    blk_mq_unquiesce_queue()
>>>>> blk_mq_run_hw_queue()                       blk_queue_flag_clear(QUEUE_FLAG_QUIESCED)       3) store
>>>>>  if (blk_queue_quiesced())   2) load         blk_mq_run_hw_queues()
>>>>>      return                                      blk_mq_run_hw_queue()
>>>>>  blk_mq_sched_dispatch_requests()                    if (!blk_mq_hctx_has_pending())     4) load
>>>>>                                                          return
>>>> 
>>>> OK.
>>>> 
>>>> The issue shouldn't exist if blk_queue_quiesced() return false in
>>>> blk_mq_run_hw_queue(), so it is still one race in two slow paths?
>>>> 
>>>> I guess the barrier-less approach should work too, such as:
>>>> 
>>> 
>>> If we prefer barrier-less approach, I think the following solution
>>> will work as well, I'll use it in v2. Thanks.
>>> 
>>>> 
>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>> index e3c3c0c21b55..632261982a77 100644
>>>> --- a/block/blk-mq.c
>>>> +++ b/block/blk-mq.c
>>>> @@ -2202,6 +2202,12 @@ void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
>>>> }
>>>> EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
>>>> 
>>>> +static inline bool blk_mq_hw_queue_need_run(struct blk_mq_hw_ctx *hctx)
>>>> +{
>>>> +    return !blk_queue_quiesced(hctx->queue) &&
>>>> +            blk_mq_hctx_has_pending(hctx);
>>>> +}
>>>> +
>>>> /**
>>>> * blk_mq_run_hw_queue - Start to run a hardware queue.
>>>> * @hctx: Pointer to the hardware queue to run.
>>>> @@ -2231,11 +2237,19 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
>>>> * quiesced.
>>>> */
>>>>     __blk_mq_run_dispatch_ops(hctx->queue, false,
>>>> -            need_run = !blk_queue_quiesced(hctx->queue) &&
>>>> -            blk_mq_hctx_has_pending(hctx));
>>>> +            need_run = blk_mq_hw_queue_need_run(hctx));
>>>> 
>>>> -    if (!need_run)
>>>> -            return;
>>>> +    if (!need_run) {
>>>> +            unsigned long flags;
>>>> +
>>>> +            /* sync with unquiesce */
>>>> +            spin_lock_irqsave(&hctx->queue->queue_lock, flags);
>> 
>> After some time thought, I think here we need a big comment to explain
>> why we need to sync. Because there are other caller of blk_queue_quiesced()
>> which do not need to hold ->queue_lock to sync. Then, I am thinking
>> is ->queue_lock really easier to be maintained than mb? For developers,
>> we still need to care about this, right? I don't see any obvious benefit.
>> And the mb approach seems more efficient than spinlock. Something like:
>> 
>>        if (!need_run) {
>>                /* Add a comment here to explain what's going on here. */
>>                smp_mb();
>>                need_run = blk_mq_hw_queue_need_run(hctx);
>>                if (!need_run)
>>                        return;
>>        }
>> 
>> I am not objecting to your approach, I want to know if you insist on
>> barrier-less approach here. If yes, I'm fine with this approach. I can
>> use it in v2.
> 
> Yes, as I mentioned, the race only exists on two slow code paths,
> we seldom use barrier in slow paths, in which traditional lock
> can provide a simpler & more readable solution.  Anytime,
> READ/WRITE dependency implied in any barrier is hard to follow.

Got it. Thanks for your reply.

> 
> Thanks,



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/4] block: fix ordering between checking BLK_MQ_S_STOPPED and adding requests to hctx->dispatch
  2024-08-27  7:31             ` Muchun Song
@ 2024-08-29  7:57               ` Yu Kuai
  0 siblings, 0 replies; 21+ messages in thread
From: Yu Kuai @ 2024-08-29  7:57 UTC (permalink / raw)
  To: Muchun Song, Yu Kuai
  Cc: Muchun Song, Ming Lei, Jens Axboe, open list:BLOCK LAYER, LKML,
	yukuai (C)

Hi,

在 2024/08/27 15:31, Muchun Song 写道:
> 
> 
>> On Aug 26, 2024, at 16:53, Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> 在 2024/08/26 16:35, Muchun Song 写道:
>>>> On Aug 22, 2024, at 11:54, Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> 在 2024/08/19 11:49, Muchun Song 写道:
>>>>> On Mon, Aug 19, 2024 at 10:28 AM Ming Lei <ming.lei@redhat.com> wrote:
>>>>>>
>>>>>> Hi Muchun,
>>>>>>
>>>>>> On Sun, Aug 11, 2024 at 06:19:19PM +0800, Muchun Song wrote:
>>>>>>> Supposing the following scenario with a virtio_blk driver.
>>>>>>>
>>>>>>> CPU0                                                                CPU1
>>>>>>>
>>>>>>> blk_mq_try_issue_directly()
>>>>>>>      __blk_mq_issue_directly()
>>>>>>>          q->mq_ops->queue_rq()
>>>>>>>              virtio_queue_rq()
>>>>>>>                  blk_mq_stop_hw_queue()
>>>>>>>                                                                      virtblk_done()
>>>>>>>      blk_mq_request_bypass_insert()                                      blk_mq_start_stopped_hw_queues()
>>>>>>>          /* Add IO request to dispatch list */   1) store                    blk_mq_start_stopped_hw_queue()
>>>>>>>                                                                                  clear_bit(BLK_MQ_S_STOPPED)                 3) store
>>>>>>>      blk_mq_run_hw_queue()                                                       blk_mq_run_hw_queue()
>>>>>>>          if (!blk_mq_hctx_has_pending())                                             if (!blk_mq_hctx_has_pending())         4) load
>>>>>>>              return                                                                      return
>>>>>>>          blk_mq_sched_dispatch_requests()                                            blk_mq_sched_dispatch_requests()
>>>>>>>              if (blk_mq_hctx_stopped())          2) load                                 if (blk_mq_hctx_stopped())
>>>>>>>                  return                                                                      return
>>>>>>>              __blk_mq_sched_dispatch_requests()                                          __blk_mq_sched_dispatch_requests()
>>>>>>>
>>>>>>> The full memory barrier should be inserted between 1) and 2), as well as between
>>>>>>> 3) and 4) to make sure that either CPU0 sees BLK_MQ_S_STOPPED is cleared or CPU1
>>>>>>> sees dispatch list or setting of bitmap of software queue. Otherwise, either CPU
>>>>>>> will not re-run the hardware queue causing starvation.
>>>>>>
>>>>>> Yeah, it is one kind of race which is triggered when adding request into
>>>>>> ->dispatch list after returning STS_RESOURCE. We were troubled by lots of
>>>>>> such kind of race.
>>>>> Yes. I saw the similar fix for BLK_MQ_S_SCHED_RESTART.
>>>>>>
>>>>>> stopping queue is used in very less drivers, and its only purpose should
>>>>>> be for throttling hw queue in case that low level queue is busy. There seems
>>>>>> more uses of blk_mq_stop_hw_queues(), but most of them should be replaced
>>>>>> with blk_mq_quiesce_queue().
>>>>>>
>>>>>> IMO, fixing this kind of issue via memory barrier is too tricky to
>>>>>> maintain cause WRITE/READ dependency is very hard to follow. I'd suggest to
>>>>>> make memory barrier solution as the last resort, and we can try to figure
>>>>>> out other easier & more reliable way first.
>>>>> I do agree it is hard to maintain the dependencies in the future. We should
>>>>> propose an easy-maintainable solution. But I thought it is a long-term issue
>>>>> throughout different stable linux distros. Adding a mb is the easy way to fix
>>>>> the problem (the code footprint is really small), so it will be very
>>>>> easy for others
>>>>> to backport those bug fixes to different stable linux distros. Therefore, mb
>>>>> should be an interim solution. Then, we could improve it based on the solution
>>>>> you've proposed below. What do you think?
>>>>
>>>> I'll agree with Ming, let's figure out a better fix first. Easy to backport to stables is not first consideration.
>>> Hi Kuai,
>>> All right. I usually focus on MM, it seems there is a gap between MM and BLock.
>>> Anyway, let's figure out if there is any good solution.
>>>>> Thanks,
>>>>> Muchun.
>>>>>>
>>>>>> One idea I thought of is to call blk_mq_request_bypass_insert()(or rename
>>>>>> & export it) before calling blk_mq_stop_hw_queue() in driver, then
>>>>>> return new status code STS_STOP_DISPATCH for notifying blk-mq to stop
>>>>>> dispatch simply.
>>>>
>>>> New status code look good to me, however, I wonder can we just remove
>>>> the problematic blk_mq_stop_hw_queue(), and replace it by handling the
>>>> new status from block layer?
>>>>
>>>> - Passing the new status to blk_mq_run_dispatch_ops, and quiesce with
>>> I didn't fully understand your suggestion. Let me ask some questions.
>>> blk_mq_stop_hw_queue() is usually called in blk_mq_ops->queue_rq path,
>>> it'll be easy for this case to pass the new status to blk_mq_run_dispatch_ops.
>>> Should we remove blk_mq_stop_hw_queues() as well? How to pass the new
>>> status to blk_mq_run_dispatch_ops in this case?
>>
>> For queue_rq from dispatch path, it can be removed. However, it is
>> called from remove path as well, I don't check yet if it can be removed
>> there, that's another story.
> 
> The reason why I asked this question is that blk_mq_stop_hw_queues() also needs
> to be fixed. See my patch 3.

I just reviewed that patch, please check following.
> 
>>
>> And just add a return value for dispatch_ops to pass status.
>>
>> Thanks,
>> Kuai
>>
>>>> the new status, if no request is inflight, unquiesce immediately;
>>> Actually, I didn't understand how to avoid the above race. May you elaborate
>>> the scenario?
> 
> Sorry for repeating, I didn't get your point here. May you elaborate
> your suggestion? Thanks very much.

// dispatch path, replace stop queue with quiesce queue;
if (__blk_mq_run_dispatch_ops() != STS_STOP_DISPATCH)
        return;

blk_mq_quiesce_queue();
/* other context already stop dispatch */
if (test_and_set_bit(QUEUE_FLAG_STOP_DISPATCH, &q->queue_flags)) {
        blk_mq_unquiesce_queue();
        return;
}

/*
  * IO is done before stopping dispatch, hence can't let IO complete to
  * unquiesce queue.
  */
if (!blk_mq_inflight() && test_and_clear_bit(QUEUE_FLAG_STOP_DISPATCH, 
&q->queue_flags))
        blk_mq_unquiesce_queue();

// complete path
if (test_and_clear_bit(QUEUE_FLAG_STOP_DISPATCH, &q->queue_flags))
        blk_mq_unquiesce_queue();

Thanks,
Kuai

> 
>>> Muhcun,
>>> Thanks.
>>>> - unquiesce is any IO is done afterwards;
>>> .
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2024-08-29  7:57 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-11 10:19 [PATCH 0/4] Fix some starvation problems Muchun Song
2024-08-11 10:19 ` [PATCH 1/4] block: fix request starvation when queue is stopped or quiesced Muchun Song
2024-08-16  9:14   ` Ming Lei
2024-08-11 10:19 ` [PATCH 2/4] block: fix ordering between checking BLK_MQ_S_STOPPED and adding requests to hctx->dispatch Muchun Song
2024-08-19  2:27   ` Ming Lei
2024-08-19  3:49     ` Muchun Song
2024-08-22  3:54       ` Yu Kuai
2024-08-26  8:35         ` Muchun Song
2024-08-26  8:53           ` Yu Kuai
2024-08-27  7:31             ` Muchun Song
2024-08-29  7:57               ` Yu Kuai
2024-08-11 10:19 ` [PATCH 3/4] block: fix missing smp_mb in blk_mq_{delay_}run_hw_queues Muchun Song
2024-08-11 10:19 ` [PATCH 4/4] block: fix fix ordering between checking QUEUE_FLAG_QUIESCED and adding requests to hctx->dispatch Muchun Song
2024-08-23 11:27   ` Ming Lei
2024-08-26  7:06     ` Muchun Song
2024-08-26  7:33       ` Muchun Song
2024-08-26  9:20         ` Ming Lei
2024-08-27  7:24           ` Muchun Song
2024-08-27  8:16             ` Muchun Song
2024-08-29  2:51               ` Ming Lei
2024-08-29  3:40                 ` Muchun Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).