[PATCH v2 0/9] md: fix is_mddev

public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/9] md: fix is_mddev_idle()
@ 2025-04-27  8:29 Yu Kuai
  2025-04-27  8:29 ` [PATCH v2 1/9] blk-mq: remove blk_mq_in_flight() Yu Kuai
                   ` (9 more replies)
  0 siblings, 10 replies; 31+ messages in thread
From: Yu Kuai @ 2025-04-27  8:29 UTC (permalink / raw)
  To: hch, axboe, xni, agk, snitzer, mpatocka, song, yukuai3, cl,
	nadav.amit, ubizjak, akpm
  Cc: linux-block, linux-kernel, dm-devel, linux-raid, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Changes in v2:
 - add patch 1-5
 - add reviewed-by in patch 6,7,9
 - rename mddev->last_events to mddev->normal_IO_events in patch 8

Yu Kuai (9):
  blk-mq: remove blk_mq_in_flight()
  block: reuse part_in_flight_rw for part_in_flight
  block: WARN if bdev inflight counter is negative
  block: cleanup blk_mq_in_flight_rw()
  block: export API to get the number of bdev inflight IO
  md: record dm-raid gendisk in mddev
  md: add a new api sync_io_depth
  md: fix is_mddev_idle()
  md: cleanup accounting for issued sync IO

 block/blk-core.c          |   2 +-
 block/blk-mq.c            |  22 ++---
 block/blk-mq.h            |   5 +-
 block/blk.h               |   1 -
 block/genhd.c             |  69 ++++++++------
 drivers/md/dm-raid.c      |   3 +
 drivers/md/md.c           | 193 +++++++++++++++++++++++++++-----------
 drivers/md/md.h           |  18 +---
 drivers/md/raid1.c        |   3 -
 drivers/md/raid10.c       |   9 --
 drivers/md/raid5.c        |   8 --
 include/linux/blkdev.h    |   1 -
 include/linux/part_stat.h |   2 +
 13 files changed, 194 insertions(+), 142 deletions(-)

-- 
2.39.2


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 1/9] blk-mq: remove blk_mq_in_flight()
  2025-04-27  8:29 [PATCH v2 0/9] md: fix is_mddev_idle() Yu Kuai
@ 2025-04-27  8:29 ` Yu Kuai
  2025-04-28 13:06   ` Christoph Hellwig
                     ` (2 more replies)
  2025-04-27  8:29 ` [PATCH v2 2/9] block: reuse part_in_flight_rw for part_in_flight Yu Kuai
                   ` (8 subsequent siblings)
  9 siblings, 3 replies; 31+ messages in thread
From: Yu Kuai @ 2025-04-27  8:29 UTC (permalink / raw)
  To: hch, axboe, xni, agk, snitzer, mpatocka, song, yukuai3, cl,
	nadav.amit, ubizjak, akpm
  Cc: linux-block, linux-kernel, dm-devel, linux-raid, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

It's not used and can be removed.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq.c | 10 ----------
 block/blk-mq.h |  2 --
 2 files changed, 12 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index c2697db59109..301dbd3e1743 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -101,16 +101,6 @@ static bool blk_mq_check_inflight(struct request *rq, void *priv)
 	return true;
 }
 
-unsigned int blk_mq_in_flight(struct request_queue *q,
-		struct block_device *part)
-{
-	struct mq_inflight mi = { .part = part };
-
-	blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
-
-	return mi.inflight[0] + mi.inflight[1];
-}
-
 void blk_mq_in_flight_rw(struct request_queue *q, struct block_device *part,
 		unsigned int inflight[2])
 {
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 3011a78cf16a..a23d5812d08f 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -246,8 +246,6 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
 	return hctx->nr_ctx && hctx->tags;
 }
 
-unsigned int blk_mq_in_flight(struct request_queue *q,
-		struct block_device *part);
 void blk_mq_in_flight_rw(struct request_queue *q, struct block_device *part,
 		unsigned int inflight[2]);
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/9] blk-mq: remove blk_mq_in_flight()
  2025-04-27  8:29 ` [PATCH v2 1/9] blk-mq: remove blk_mq_in_flight() Yu Kuai
@ 2025-04-28 13:06   ` Christoph Hellwig
  2025-04-28 13:45   ` John Garry
  2025-04-29  6:25   ` Hannes Reinecke
  2 siblings, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2025-04-28 13:06 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, axboe, xni, agk, snitzer, mpatocka, song, yukuai3, cl,
	nadav.amit, ubizjak, akpm, linux-block, linux-kernel, dm-devel,
	linux-raid, yi.zhang, yangerkun, johnny.chenyi

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/9] blk-mq: remove blk_mq_in_flight()
  2025-04-27  8:29 ` [PATCH v2 1/9] blk-mq: remove blk_mq_in_flight() Yu Kuai
  2025-04-28 13:06   ` Christoph Hellwig
@ 2025-04-28 13:45   ` John Garry
  2025-04-29  1:40     ` Yu Kuai
  2025-04-29  6:25   ` Hannes Reinecke
  2 siblings, 1 reply; 31+ messages in thread
From: John Garry @ 2025-04-28 13:45 UTC (permalink / raw)
  To: Yu Kuai, hch, axboe, xni, agk, snitzer, mpatocka, song, yukuai3,
	cl, nadav.amit, ubizjak, akpm
  Cc: linux-block, linux-kernel, dm-devel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi

On 27/04/2025 09:29, Yu Kuai wrote:
> From: Yu Kuai<yukuai3@huawei.com>
> 
> It's not used and can be removed.
> 

it's nice to mention when it stopped being used.

> Signed-off-by: Yu Kuai<yukuai3@huawei.com>

Regardless, FWIW:
Reviewed-by: John Garry <john.g.garry@oracle.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/9] blk-mq: remove blk_mq_in_flight()
  2025-04-28 13:45   ` John Garry
@ 2025-04-29  1:40     ` Yu Kuai
  0 siblings, 0 replies; 31+ messages in thread
From: Yu Kuai @ 2025-04-29  1:40 UTC (permalink / raw)
  To: John Garry, Yu Kuai, hch, axboe, xni, agk, snitzer, mpatocka,
	song, cl, nadav.amit, ubizjak, akpm
  Cc: linux-block, linux-kernel, dm-devel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/04/28 21:45, John Garry 写道:
> On 27/04/2025 09:29, Yu Kuai wrote:
>> From: Yu Kuai<yukuai3@huawei.com>
>>
>> It's not used and can be removed.
>>
> 
> it's nice to mention when it stopped being used.

Ok
> 
>> Signed-off-by: Yu Kuai<yukuai3@huawei.com>
> 
> Regardless, FWIW:
> Reviewed-by: John Garry <john.g.garry@oracle.com>

Thanks,
Kuai

> 
> .
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/9] blk-mq: remove blk_mq_in_flight()
  2025-04-27  8:29 ` [PATCH v2 1/9] blk-mq: remove blk_mq_in_flight() Yu Kuai
  2025-04-28 13:06   ` Christoph Hellwig
  2025-04-28 13:45   ` John Garry
@ 2025-04-29  6:25   ` Hannes Reinecke
  2 siblings, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2025-04-29  6:25 UTC (permalink / raw)
  To: Yu Kuai, hch, axboe, xni, agk, snitzer, mpatocka, song, yukuai3,
	cl, nadav.amit, ubizjak, akpm
  Cc: linux-block, linux-kernel, dm-devel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi

On 4/27/25 10:29, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> It's not used and can be removed.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>   block/blk-mq.c | 10 ----------
>   block/blk-mq.h |  2 --
>   2 files changed, 12 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 2/9] block: reuse part_in_flight_rw for part_in_flight
  2025-04-27  8:29 [PATCH v2 0/9] md: fix is_mddev_idle() Yu Kuai
  2025-04-27  8:29 ` [PATCH v2 1/9] blk-mq: remove blk_mq_in_flight() Yu Kuai
@ 2025-04-27  8:29 ` Yu Kuai
  2025-04-28 13:06   ` Christoph Hellwig
                     ` (2 more replies)
  2025-04-27  8:29 ` [PATCH v2 3/9] block: WARN if bdev inflight counter is negative Yu Kuai
                   ` (7 subsequent siblings)
  9 siblings, 3 replies; 31+ messages in thread
From: Yu Kuai @ 2025-04-27  8:29 UTC (permalink / raw)
  To: hch, axboe, xni, agk, snitzer, mpatocka, song, yukuai3, cl,
	nadav.amit, ubizjak, akpm
  Cc: linux-block, linux-kernel, dm-devel, linux-raid, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

They are almost identical, to make code cleaner.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/genhd.c | 24 +++++++++---------------
 1 file changed, 9 insertions(+), 15 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index c2bd86cd09de..f671d9ee00c4 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -125,21 +125,6 @@ static void part_stat_read_all(struct block_device *part,
 	}
 }
 
-unsigned int part_in_flight(struct block_device *part)
-{
-	unsigned int inflight = 0;
-	int cpu;
-
-	for_each_possible_cpu(cpu) {
-		inflight += part_stat_local_read_cpu(part, in_flight[0], cpu) +
-			    part_stat_local_read_cpu(part, in_flight[1], cpu);
-	}
-	if ((int)inflight < 0)
-		inflight = 0;
-
-	return inflight;
-}
-
 static void part_in_flight_rw(struct block_device *part,
 		unsigned int inflight[2])
 {
@@ -157,6 +142,15 @@ static void part_in_flight_rw(struct block_device *part,
 		inflight[1] = 0;
 }
 
+unsigned int part_in_flight(struct block_device *part)
+{
+	unsigned int inflight[2];
+
+	part_in_flight_rw(part, inflight);
+
+	return inflight[READ] + inflight[WRITE];
+}
+
 /*
  * Can be deleted altogether. Later.
  *
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/9] block: reuse part_in_flight_rw for part_in_flight
  2025-04-27  8:29 ` [PATCH v2 2/9] block: reuse part_in_flight_rw for part_in_flight Yu Kuai
@ 2025-04-28 13:06   ` Christoph Hellwig
  2025-04-28 13:47   ` John Garry
  2025-04-29  6:25   ` Hannes Reinecke
  2 siblings, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2025-04-28 13:06 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, axboe, xni, agk, snitzer, mpatocka, song, yukuai3, cl,
	nadav.amit, ubizjak, akpm, linux-block, linux-kernel, dm-devel,
	linux-raid, yi.zhang, yangerkun, johnny.chenyi

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/9] block: reuse part_in_flight_rw for part_in_flight
  2025-04-27  8:29 ` [PATCH v2 2/9] block: reuse part_in_flight_rw for part_in_flight Yu Kuai
  2025-04-28 13:06   ` Christoph Hellwig
@ 2025-04-28 13:47   ` John Garry
  2025-04-29  6:25   ` Hannes Reinecke
  2 siblings, 0 replies; 31+ messages in thread
From: John Garry @ 2025-04-28 13:47 UTC (permalink / raw)
  To: Yu Kuai, hch, axboe, xni, agk, snitzer, mpatocka, song, yukuai3,
	cl, nadav.amit, ubizjak, akpm
  Cc: linux-block, linux-kernel, dm-devel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi

On 27/04/2025 09:29, Yu Kuai wrote:
> From: Yu Kuai<yukuai3@huawei.com>
> 
> They are almost identical, to make code cleaner.
> 
> Signed-off-by: Yu Kuai<yukuai3@huawei.com>

FWIW:
Reviewed-by: John Garry <john.g.garry@oracle.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/9] block: reuse part_in_flight_rw for part_in_flight
  2025-04-27  8:29 ` [PATCH v2 2/9] block: reuse part_in_flight_rw for part_in_flight Yu Kuai
  2025-04-28 13:06   ` Christoph Hellwig
  2025-04-28 13:47   ` John Garry
@ 2025-04-29  6:25   ` Hannes Reinecke
  2 siblings, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2025-04-29  6:25 UTC (permalink / raw)
  To: Yu Kuai, hch, axboe, xni, agk, snitzer, mpatocka, song, yukuai3,
	cl, nadav.amit, ubizjak, akpm
  Cc: linux-block, linux-kernel, dm-devel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi

On 4/27/25 10:29, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> They are almost identical, to make code cleaner.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>   block/genhd.c | 24 +++++++++---------------
>   1 file changed, 9 insertions(+), 15 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 3/9] block: WARN if bdev inflight counter is negative
  2025-04-27  8:29 [PATCH v2 0/9] md: fix is_mddev_idle() Yu Kuai
  2025-04-27  8:29 ` [PATCH v2 1/9] blk-mq: remove blk_mq_in_flight() Yu Kuai
  2025-04-27  8:29 ` [PATCH v2 2/9] block: reuse part_in_flight_rw for part_in_flight Yu Kuai
@ 2025-04-27  8:29 ` Yu Kuai
  2025-04-28 13:07   ` Christoph Hellwig
                     ` (2 more replies)
  2025-04-27  8:29 ` [PATCH v2 4/9] block: cleanup blk_mq_in_flight_rw() Yu Kuai
                   ` (6 subsequent siblings)
  9 siblings, 3 replies; 31+ messages in thread
From: Yu Kuai @ 2025-04-27  8:29 UTC (permalink / raw)
  To: hch, axboe, xni, agk, snitzer, mpatocka, song, yukuai3, cl,
	nadav.amit, ubizjak, akpm
  Cc: linux-block, linux-kernel, dm-devel, linux-raid, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Which means there is a BUG for related bio-based disk driver, or blk-mq
for rq-based disk, it's better not to hide the BUG.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/genhd.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index f671d9ee00c4..d158c25237b6 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -136,9 +136,9 @@ static void part_in_flight_rw(struct block_device *part,
 		inflight[0] += part_stat_local_read_cpu(part, in_flight[0], cpu);
 		inflight[1] += part_stat_local_read_cpu(part, in_flight[1], cpu);
 	}
-	if ((int)inflight[0] < 0)
+	if (WARN_ON_ONCE((int)inflight[0] < 0))
 		inflight[0] = 0;
-	if ((int)inflight[1] < 0)
+	if (WARN_ON_ONCE((int)inflight[1] < 0))
 		inflight[1] = 0;
 }
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 3/9] block: WARN if bdev inflight counter is negative
  2025-04-27  8:29 ` [PATCH v2 3/9] block: WARN if bdev inflight counter is negative Yu Kuai
@ 2025-04-28 13:07   ` Christoph Hellwig
  2025-04-28 14:06   ` John Garry
  2025-04-29  6:27   ` Hannes Reinecke
  2 siblings, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2025-04-28 13:07 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, axboe, xni, agk, snitzer, mpatocka, song, yukuai3, cl,
	nadav.amit, ubizjak, akpm, linux-block, linux-kernel, dm-devel,
	linux-raid, yi.zhang, yangerkun, johnny.chenyi

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 3/9] block: WARN if bdev inflight counter is negative
  2025-04-27  8:29 ` [PATCH v2 3/9] block: WARN if bdev inflight counter is negative Yu Kuai
  2025-04-28 13:07   ` Christoph Hellwig
@ 2025-04-28 14:06   ` John Garry
  2025-04-29  1:43     ` Yu Kuai
  2025-04-29  6:27   ` Hannes Reinecke
  2 siblings, 1 reply; 31+ messages in thread
From: John Garry @ 2025-04-28 14:06 UTC (permalink / raw)
  To: Yu Kuai, hch, axboe, xni, agk, snitzer, mpatocka, song, yukuai3,
	cl, nadav.amit, ubizjak, akpm
  Cc: linux-block, linux-kernel, dm-devel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi

On 27/04/2025 09:29, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Which means there is a BUG

nit: to me, BUG means symbol BUG(), and not a software bug (which I 
think that you mean)

> for related bio-based disk driver, or blk-mq
> for rq-based disk, it's better not to hide the BUG.

AFICS, this check was not present for mq, so is it really required now? 
I suppose that the code is simpler to always have the check. I find it 
an odd check to begin with...

> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>   block/genhd.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/block/genhd.c b/block/genhd.c
> index f671d9ee00c4..d158c25237b6 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -136,9 +136,9 @@ static void part_in_flight_rw(struct block_device *part,
>   		inflight[0] += part_stat_local_read_cpu(part, in_flight[0], cpu);
>   		inflight[1] += part_stat_local_read_cpu(part, in_flight[1], cpu);
>   	}
> -	if ((int)inflight[0] < 0)
> +	if (WARN_ON_ONCE((int)inflight[0] < 0))
>   		inflight[0] = 0;
> -	if ((int)inflight[1] < 0)
> +	if (WARN_ON_ONCE((int)inflight[1] < 0))
>   		inflight[1] = 0;
>   }
>   


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 3/9] block: WARN if bdev inflight counter is negative
  2025-04-28 14:06   ` John Garry
@ 2025-04-29  1:43     ` Yu Kuai
  2025-04-29  8:53       ` John Garry
  0 siblings, 1 reply; 31+ messages in thread
From: Yu Kuai @ 2025-04-29  1:43 UTC (permalink / raw)
  To: John Garry, Yu Kuai, hch, axboe, xni, agk, snitzer, mpatocka,
	song, cl, nadav.amit, ubizjak, akpm
  Cc: linux-block, linux-kernel, dm-devel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/04/28 22:06, John Garry 写道:
> On 27/04/2025 09:29, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Which means there is a BUG
> 
> nit: to me, BUG means symbol BUG(), and not a software bug (which I 
> think that you mean)

Actually, I mean the bio-based disk driver or blk-mq messed up the IO
accounting, IO done is more than IO start, and this is a bug.
> 
>> for related bio-based disk driver, or blk-mq
>> for rq-based disk, it's better not to hide the BUG.
> 
> AFICS, this check was not present for mq, so is it really required now? 
> I suppose that the code is simpler to always have the check. I find it 
> an odd check to begin with...

This check do present for mq, for example, diskstats_show() and
update_io_ticks().

Thanks,
Kuai

> 
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>>   block/genhd.c | 4 ++--
>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/genhd.c b/block/genhd.c
>> index f671d9ee00c4..d158c25237b6 100644
>> --- a/block/genhd.c
>> +++ b/block/genhd.c
>> @@ -136,9 +136,9 @@ static void part_in_flight_rw(struct block_device 
>> *part,
>>           inflight[0] += part_stat_local_read_cpu(part, in_flight[0], 
>> cpu);
>>           inflight[1] += part_stat_local_read_cpu(part, in_flight[1], 
>> cpu);
>>       }
>> -    if ((int)inflight[0] < 0)
>> +    if (WARN_ON_ONCE((int)inflight[0] < 0))
>>           inflight[0] = 0;
>> -    if ((int)inflight[1] < 0)
>> +    if (WARN_ON_ONCE((int)inflight[1] < 0))
>>           inflight[1] = 0;
>>   }
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 3/9] block: WARN if bdev inflight counter is negative
  2025-04-29  1:43     ` Yu Kuai
@ 2025-04-29  8:53       ` John Garry
  0 siblings, 0 replies; 31+ messages in thread
From: John Garry @ 2025-04-29  8:53 UTC (permalink / raw)
  To: Yu Kuai, hch, axboe, xni, agk, snitzer, mpatocka, song, cl,
	nadav.amit, ubizjak, akpm
  Cc: linux-block, linux-kernel, dm-devel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

On 29/04/2025 02:43, Yu Kuai wrote:
> accounting, IO done is more than IO start, and this is a bug.
>>
>>> for related bio-based disk driver, or blk-mq
>>> for rq-based disk, it's better not to hide the BUG.
>>
>> AFICS, this check was not present for mq, so is it really required 
>> now? I suppose that the code is simpler to always have the check. I 
>> find it an odd check to begin with...
> 
> This check do present for mq, for example, diskstats_show() and
> update_io_ticks().

ok, I just noticed this in part_inflight_show() -> 
blk_mq_in_flight_rw(), which didn't have such a check. I think that is 
because we expect the tagset iter to provide a sane count value.

Thanks,
John

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 3/9] block: WARN if bdev inflight counter is negative
  2025-04-27  8:29 ` [PATCH v2 3/9] block: WARN if bdev inflight counter is negative Yu Kuai
  2025-04-28 13:07   ` Christoph Hellwig
  2025-04-28 14:06   ` John Garry
@ 2025-04-29  6:27   ` Hannes Reinecke
  2 siblings, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2025-04-29  6:27 UTC (permalink / raw)
  To: Yu Kuai, hch, axboe, xni, agk, snitzer, mpatocka, song, yukuai3,
	cl, nadav.amit, ubizjak, akpm
  Cc: linux-block, linux-kernel, dm-devel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi

On 4/27/25 10:29, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Which means there is a BUG for related bio-based disk driver, or blk-mq
> for rq-based disk, it's better not to hide the BUG.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>   block/genhd.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
Please make 'BUG' lowercase.

Otherwise:

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 4/9] block: cleanup blk_mq_in_flight_rw()
  2025-04-27  8:29 [PATCH v2 0/9] md: fix is_mddev_idle() Yu Kuai
                   ` (2 preceding siblings ...)
  2025-04-27  8:29 ` [PATCH v2 3/9] block: WARN if bdev inflight counter is negative Yu Kuai
@ 2025-04-27  8:29 ` Yu Kuai
  2025-04-28 13:08   ` Christoph Hellwig
  2025-04-29  6:31   ` Hannes Reinecke
  2025-04-27  8:29 ` [PATCH v2 5/9] block: export API to get the number of bdev inflight IO Yu Kuai
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 31+ messages in thread
From: Yu Kuai @ 2025-04-27  8:29 UTC (permalink / raw)
  To: hch, axboe, xni, agk, snitzer, mpatocka, song, yukuai3, cl,
	nadav.amit, ubizjak, akpm
  Cc: linux-block, linux-kernel, dm-devel, linux-raid, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Also add comment for part_inflight_show() for the difference between
bio-based and rq-based device.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq.c | 12 ++++++------
 block/blk-mq.h |  3 +--
 block/genhd.c  | 43 +++++++++++++++++++++++++------------------
 3 files changed, 32 insertions(+), 26 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 301dbd3e1743..0067e8226e05 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -89,7 +89,7 @@ struct mq_inflight {
 	unsigned int inflight[2];
 };
 
-static bool blk_mq_check_inflight(struct request *rq, void *priv)
+static bool blk_mq_check_in_driver(struct request *rq, void *priv)
 {
 	struct mq_inflight *mi = priv;
 
@@ -101,14 +101,14 @@ static bool blk_mq_check_inflight(struct request *rq, void *priv)
 	return true;
 }
 
-void blk_mq_in_flight_rw(struct request_queue *q, struct block_device *part,
-		unsigned int inflight[2])
+void blk_mq_in_driver_rw(struct block_device *part, unsigned int inflight[2])
 {
 	struct mq_inflight mi = { .part = part };
 
-	blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
-	inflight[0] = mi.inflight[0];
-	inflight[1] = mi.inflight[1];
+	blk_mq_queue_tag_busy_iter(bdev_get_queue(part), blk_mq_check_in_driver,
+				   &mi);
+	inflight[READ] = mi.inflight[READ];
+	inflight[WRITE] = mi.inflight[WRITE];
 }
 
 #ifdef CONFIG_LOCKDEP
diff --git a/block/blk-mq.h b/block/blk-mq.h
index a23d5812d08f..4205da1a4836 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -246,8 +246,7 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
 	return hctx->nr_ctx && hctx->tags;
 }
 
-void blk_mq_in_flight_rw(struct request_queue *q, struct block_device *part,
-		unsigned int inflight[2]);
+void blk_mq_in_driver_rw(struct block_device *part, unsigned int inflight[2]);
 
 static inline void blk_mq_put_dispatch_budget(struct request_queue *q,
 					      int budget_token)
diff --git a/block/genhd.c b/block/genhd.c
index d158c25237b6..2470099b492b 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -126,27 +126,32 @@ static void part_stat_read_all(struct block_device *part,
 }
 
 static void part_in_flight_rw(struct block_device *part,
-		unsigned int inflight[2])
+		unsigned int inflight[2], bool mq_driver)
 {
 	int cpu;
 
-	inflight[0] = 0;
-	inflight[1] = 0;
-	for_each_possible_cpu(cpu) {
-		inflight[0] += part_stat_local_read_cpu(part, in_flight[0], cpu);
-		inflight[1] += part_stat_local_read_cpu(part, in_flight[1], cpu);
+	if (mq_driver) {
+		blk_mq_in_driver_rw(part, inflight);
+	} else {
+		for_each_possible_cpu(cpu) {
+			inflight[READ] += part_stat_local_read_cpu(
+						part, in_flight[READ], cpu);
+			inflight[WRITE] += part_stat_local_read_cpu(
+						part, in_flight[WRITE], cpu);
+		}
 	}
-	if (WARN_ON_ONCE((int)inflight[0] < 0))
-		inflight[0] = 0;
-	if (WARN_ON_ONCE((int)inflight[1] < 0))
-		inflight[1] = 0;
+
+	if (WARN_ON_ONCE((int)inflight[READ] < 0))
+		inflight[READ] = 0;
+	if (WARN_ON_ONCE((int)inflight[WRITE] < 0))
+		inflight[WRITE] = 0;
 }
 
 unsigned int part_in_flight(struct block_device *part)
 {
-	unsigned int inflight[2];
+	unsigned int inflight[2] = {0};
 
-	part_in_flight_rw(part, inflight);
+	part_in_flight_rw(part, inflight, false);
 
 	return inflight[READ] + inflight[WRITE];
 }
@@ -1036,19 +1041,21 @@ ssize_t part_stat_show(struct device *dev,
 		(unsigned int)div_u64(stat.nsecs[STAT_FLUSH], NSEC_PER_MSEC));
 }
 
+/*
+ * Show the number of IOs issued to driver.
+ * For bio-based device, started from bdev_start_io_acct();
+ * For rq-based device, started from blk_mq_start_request();
+ */
 ssize_t part_inflight_show(struct device *dev, struct device_attribute *attr,
 			   char *buf)
 {
 	struct block_device *bdev = dev_to_bdev(dev);
 	struct request_queue *q = bdev_get_queue(bdev);
-	unsigned int inflight[2];
+	unsigned int inflight[2] = {0};
 
-	if (queue_is_mq(q))
-		blk_mq_in_flight_rw(q, bdev, inflight);
-	else
-		part_in_flight_rw(bdev, inflight);
+	part_in_flight_rw(bdev, inflight, queue_is_mq(q));
 
-	return sysfs_emit(buf, "%8u %8u\n", inflight[0], inflight[1]);
+	return sysfs_emit(buf, "%8u %8u\n", inflight[READ], inflight[WRITE]);
 }
 
 static ssize_t disk_capability_show(struct device *dev,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/9] block: cleanup blk_mq_in_flight_rw()
  2025-04-27  8:29 ` [PATCH v2 4/9] block: cleanup blk_mq_in_flight_rw() Yu Kuai
@ 2025-04-28 13:08   ` Christoph Hellwig
  2025-04-29  6:31   ` Hannes Reinecke
  1 sibling, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2025-04-28 13:08 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, axboe, xni, agk, snitzer, mpatocka, song, yukuai3, cl,
	nadav.amit, ubizjak, akpm, linux-block, linux-kernel, dm-devel,
	linux-raid, yi.zhang, yangerkun, johnny.chenyi

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/9] block: cleanup blk_mq_in_flight_rw()
  2025-04-27  8:29 ` [PATCH v2 4/9] block: cleanup blk_mq_in_flight_rw() Yu Kuai
  2025-04-28 13:08   ` Christoph Hellwig
@ 2025-04-29  6:31   ` Hannes Reinecke
  2025-04-29  9:07     ` Yu Kuai
  1 sibling, 1 reply; 31+ messages in thread
From: Hannes Reinecke @ 2025-04-29  6:31 UTC (permalink / raw)
  To: Yu Kuai, hch, axboe, xni, agk, snitzer, mpatocka, song, yukuai3,
	cl, nadav.amit, ubizjak, akpm
  Cc: linux-block, linux-kernel, dm-devel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi

On 4/27/25 10:29, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Also add comment for part_inflight_show() for the difference between
> bio-based and rq-based device.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>   block/blk-mq.c | 12 ++++++------
>   block/blk-mq.h |  3 +--
>   block/genhd.c  | 43 +++++++++++++++++++++++++------------------
>   3 files changed, 32 insertions(+), 26 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 301dbd3e1743..0067e8226e05 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -89,7 +89,7 @@ struct mq_inflight {
>   	unsigned int inflight[2];
>   };
>   
> -static bool blk_mq_check_inflight(struct request *rq, void *priv)
> +static bool blk_mq_check_in_driver(struct request *rq, void *priv)

Please don't rename these functions. 'in flight' always means 'in flight
in the driver', so renaming them just introduces churn with no real 
advantage.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/9] block: cleanup blk_mq_in_flight_rw()
  2025-04-29  6:31   ` Hannes Reinecke
@ 2025-04-29  9:07     ` Yu Kuai
  0 siblings, 0 replies; 31+ messages in thread
From: Yu Kuai @ 2025-04-29  9:07 UTC (permalink / raw)
  To: Hannes Reinecke, Yu Kuai, hch, axboe, xni, agk, snitzer, mpatocka,
	song, cl, nadav.amit, ubizjak, akpm
  Cc: linux-block, linux-kernel, dm-devel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/04/29 14:31, Hannes Reinecke 写道:
> On 4/27/25 10:29, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Also add comment for part_inflight_show() for the difference between
>> bio-based and rq-based device.
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>>   block/blk-mq.c | 12 ++++++------
>>   block/blk-mq.h |  3 +--
>>   block/genhd.c  | 43 +++++++++++++++++++++++++------------------
>>   3 files changed, 32 insertions(+), 26 deletions(-)
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 301dbd3e1743..0067e8226e05 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -89,7 +89,7 @@ struct mq_inflight {
>>       unsigned int inflight[2];
>>   };
>> -static bool blk_mq_check_inflight(struct request *rq, void *priv)
>> +static bool blk_mq_check_in_driver(struct request *rq, void *priv)
> 
> Please don't rename these functions. 'in flight' always means 'in flight
> in the driver', so renaming them just introduces churn with no real 
> advantage.

Actually, the inflight value, from /proc/diskstats from diskstats_show,
actually means IO start accounting, which may not in the rq driver. For
example, IO scheduler. The same inflight value is used in
update_io_ticks as well.

This is the main reason about this rename. Related comments are added in
part_inflight_show() in this patch, and in part_in_flight() in next
patch.

Thanks,
Kuai

> 
> Cheers,
> 
> Hannes


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 5/9] block: export API to get the number of bdev inflight IO
  2025-04-27  8:29 [PATCH v2 0/9] md: fix is_mddev_idle() Yu Kuai
                   ` (3 preceding siblings ...)
  2025-04-27  8:29 ` [PATCH v2 4/9] block: cleanup blk_mq_in_flight_rw() Yu Kuai
@ 2025-04-27  8:29 ` Yu Kuai
  2025-04-28 13:08   ` Christoph Hellwig
  2025-04-29  6:32   ` Hannes Reinecke
  2025-04-27  8:29 ` [PATCH v2 6/9] md: record dm-raid gendisk in mddev Yu Kuai
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 31+ messages in thread
From: Yu Kuai @ 2025-04-27  8:29 UTC (permalink / raw)
  To: hch, axboe, xni, agk, snitzer, mpatocka, song, yukuai3, cl,
	nadav.amit, ubizjak, akpm
  Cc: linux-block, linux-kernel, dm-devel, linux-raid, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

- rename part_in_{flight, flight_rw} to bdev_count_{inflight, inflight_rw}
- export bdev_count_inflight, to fix a problem in mdraid that foreground
  IO can be starved by background sync IO in later patches

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-core.c          |  2 +-
 block/blk.h               |  1 -
 block/genhd.c             | 22 ++++++++++++++++------
 include/linux/part_stat.h |  2 ++
 4 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index e8cc270a453f..b862c66018f2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1018,7 +1018,7 @@ void update_io_ticks(struct block_device *part, unsigned long now, bool end)
 	stamp = READ_ONCE(part->bd_stamp);
 	if (unlikely(time_after(now, stamp)) &&
 	    likely(try_cmpxchg(&part->bd_stamp, &stamp, now)) &&
-	    (end || part_in_flight(part)))
+	    (end || bdev_count_inflight(part)))
 		__part_stat_add(part, io_ticks, now - stamp);
 
 	if (bdev_is_partition(part)) {
diff --git a/block/blk.h b/block/blk.h
index 006e3be433d2..f476f233f195 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -418,7 +418,6 @@ void blk_apply_bdi_limits(struct backing_dev_info *bdi,
 int blk_dev_init(void);
 
 void update_io_ticks(struct block_device *part, unsigned long now, bool end);
-unsigned int part_in_flight(struct block_device *part);
 
 static inline void req_set_nomerge(struct request_queue *q, struct request *req)
 {
diff --git a/block/genhd.c b/block/genhd.c
index 2470099b492b..fdaeafddfc4c 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -125,7 +125,7 @@ static void part_stat_read_all(struct block_device *part,
 	}
 }
 
-static void part_in_flight_rw(struct block_device *part,
+static void bdev_count_inflight_rw(struct block_device *part,
 		unsigned int inflight[2], bool mq_driver)
 {
 	int cpu;
@@ -147,14 +147,24 @@ static void part_in_flight_rw(struct block_device *part,
 		inflight[WRITE] = 0;
 }
 
-unsigned int part_in_flight(struct block_device *part)
+/**
+ * bdev_count_inflight - get the number of inflight IOs for a block device.
+ *
+ * @part: the block device.
+ *
+ * Inflight here means started IO accounting, from bdev_start_io_acct() for
+ * bio-based block device, and from blk_account_io_start() for rq-based block
+ * device.
+ */
+unsigned int bdev_count_inflight(struct block_device *part)
 {
 	unsigned int inflight[2] = {0};
 
-	part_in_flight_rw(part, inflight, false);
+	bdev_count_inflight_rw(part, inflight, false);
 
 	return inflight[READ] + inflight[WRITE];
 }
+EXPORT_SYMBOL_GPL(bdev_count_inflight);
 
 /*
  * Can be deleted altogether. Later.
@@ -1004,7 +1014,7 @@ ssize_t part_stat_show(struct device *dev,
 	struct disk_stats stat;
 	unsigned int inflight;
 
-	inflight = part_in_flight(bdev);
+	inflight = bdev_count_inflight(bdev);
 	if (inflight) {
 		part_stat_lock();
 		update_io_ticks(bdev, jiffies, true);
@@ -1053,7 +1063,7 @@ ssize_t part_inflight_show(struct device *dev, struct device_attribute *attr,
 	struct request_queue *q = bdev_get_queue(bdev);
 	unsigned int inflight[2] = {0};
 
-	part_in_flight_rw(bdev, inflight, queue_is_mq(q));
+	bdev_count_inflight_rw(bdev, inflight, queue_is_mq(q));
 
 	return sysfs_emit(buf, "%8u %8u\n", inflight[READ], inflight[WRITE]);
 }
@@ -1308,7 +1318,7 @@ static int diskstats_show(struct seq_file *seqf, void *v)
 		if (bdev_is_partition(hd) && !bdev_nr_sectors(hd))
 			continue;
 
-		inflight = part_in_flight(hd);
+		inflight = bdev_count_inflight(hd);
 		if (inflight) {
 			part_stat_lock();
 			update_io_ticks(hd, jiffies, true);
diff --git a/include/linux/part_stat.h b/include/linux/part_stat.h
index c5e9cac0575e..eeeff2a04529 100644
--- a/include/linux/part_stat.h
+++ b/include/linux/part_stat.h
@@ -79,4 +79,6 @@ static inline void part_stat_set_all(struct block_device *part, int value)
 #define part_stat_local_read_cpu(part, field, cpu)			\
 	local_read(&(part_stat_get_cpu(part, field, cpu)))
 
+unsigned int bdev_count_inflight(struct block_device *part);
+
 #endif /* _LINUX_PART_STAT_H */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 5/9] block: export API to get the number of bdev inflight IO
  2025-04-27  8:29 ` [PATCH v2 5/9] block: export API to get the number of bdev inflight IO Yu Kuai
@ 2025-04-28 13:08   ` Christoph Hellwig
  2025-04-29  6:32   ` Hannes Reinecke
  1 sibling, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2025-04-28 13:08 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, axboe, xni, agk, snitzer, mpatocka, song, yukuai3, cl,
	nadav.amit, ubizjak, akpm, linux-block, linux-kernel, dm-devel,
	linux-raid, yi.zhang, yangerkun, johnny.chenyi

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 5/9] block: export API to get the number of bdev inflight IO
  2025-04-27  8:29 ` [PATCH v2 5/9] block: export API to get the number of bdev inflight IO Yu Kuai
  2025-04-28 13:08   ` Christoph Hellwig
@ 2025-04-29  6:32   ` Hannes Reinecke
  1 sibling, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2025-04-29  6:32 UTC (permalink / raw)
  To: Yu Kuai, hch, axboe, xni, agk, snitzer, mpatocka, song, yukuai3,
	cl, nadav.amit, ubizjak, akpm
  Cc: linux-block, linux-kernel, dm-devel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi

On 4/27/25 10:29, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> - rename part_in_{flight, flight_rw} to bdev_count_{inflight, inflight_rw}
> - export bdev_count_inflight, to fix a problem in mdraid that foreground
>    IO can be starved by background sync IO in later patches
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>   block/blk-core.c          |  2 +-
>   block/blk.h               |  1 -
>   block/genhd.c             | 22 ++++++++++++++++------
>   include/linux/part_stat.h |  2 ++
>   4 files changed, 19 insertions(+), 8 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 6/9] md: record dm-raid gendisk in mddev
  2025-04-27  8:29 [PATCH v2 0/9] md: fix is_mddev_idle() Yu Kuai
                   ` (4 preceding siblings ...)
  2025-04-27  8:29 ` [PATCH v2 5/9] block: export API to get the number of bdev inflight IO Yu Kuai
@ 2025-04-27  8:29 ` Yu Kuai
  2025-04-27  8:29 ` [PATCH v2 7/9] md: add a new api sync_io_depth Yu Kuai
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Yu Kuai @ 2025-04-27  8:29 UTC (permalink / raw)
  To: hch, axboe, xni, agk, snitzer, mpatocka, song, yukuai3, cl,
	nadav.amit, ubizjak, akpm
  Cc: linux-block, linux-kernel, dm-devel, linux-raid, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Following patch will use gendisk to check if there are normal IO
completed or inflight, to fix a problem in mdraid that foreground IO
can be starved by background sync IO in later patches.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
---
 drivers/md/dm-raid.c | 3 +++
 drivers/md/md.h      | 3 ++-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index 6adc55fd90d3..127138c61be5 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -14,6 +14,7 @@
 #include "raid5.h"
 #include "raid10.h"
 #include "md-bitmap.h"
+#include "dm-core.h"
 
 #include <linux/device-mapper.h>
 
@@ -3308,6 +3309,7 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 
 	/* Disable/enable discard support on raid set. */
 	configure_discard_support(rs);
+	rs->md.dm_gendisk = ti->table->md->disk;
 
 	mddev_unlock(&rs->md);
 	return 0;
@@ -3327,6 +3329,7 @@ static void raid_dtr(struct dm_target *ti)
 
 	mddev_lock_nointr(&rs->md);
 	md_stop(&rs->md);
+	rs->md.dm_gendisk = NULL;
 	mddev_unlock(&rs->md);
 
 	if (work_pending(&rs->md.event_work))
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 1cf00a04bcdd..9d55b4630077 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -404,7 +404,8 @@ struct mddev {
 						       * are happening, so run/
 						       * takeover/stop are not safe
 						       */
-	struct gendisk			*gendisk;
+	struct gendisk			*gendisk;    /* mdraid gendisk */
+	struct gendisk			*dm_gendisk; /* dm-raid gendisk */
 
 	struct kobject			kobj;
 	int				hold_active;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 7/9] md: add a new api sync_io_depth
  2025-04-27  8:29 [PATCH v2 0/9] md: fix is_mddev_idle() Yu Kuai
                   ` (5 preceding siblings ...)
  2025-04-27  8:29 ` [PATCH v2 6/9] md: record dm-raid gendisk in mddev Yu Kuai
@ 2025-04-27  8:29 ` Yu Kuai
  2025-04-27  8:29 ` [PATCH v2 8/9] md: fix is_mddev_idle() Yu Kuai
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Yu Kuai @ 2025-04-27  8:29 UTC (permalink / raw)
  To: hch, axboe, xni, agk, snitzer, mpatocka, song, yukuai3, cl,
	nadav.amit, ubizjak, akpm
  Cc: linux-block, linux-kernel, dm-devel, linux-raid, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Currently if sync speed is above speed_min and below speed_max,
md_do_sync() will wait for all sync IOs to be done before issuing new
sync IO, means sync IO depth is limited to just 1.

This limit is too low, in order to prevent sync speed drop conspicuously
after fixing is_mddev_idle() in the next patch, add a new api for
limiting sync IO depth, the default value is 32.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
---
 drivers/md/md.c | 109 +++++++++++++++++++++++++++++++++++++++---------
 drivers/md/md.h |   1 +
 2 files changed, 91 insertions(+), 19 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 9daa78c5fe33..541151bcfe81 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -111,32 +111,48 @@ static void md_wakeup_thread_directly(struct md_thread __rcu *thread);
 /* Default safemode delay: 200 msec */
 #define DEFAULT_SAFEMODE_DELAY ((200 * HZ)/1000 +1)
 /*
- * Current RAID-1,4,5 parallel reconstruction 'guaranteed speed limit'
- * is 1000 KB/sec, so the extra system load does not show up that much.
- * Increase it if you want to have more _guaranteed_ speed. Note that
- * the RAID driver will use the maximum available bandwidth if the IO
- * subsystem is idle. There is also an 'absolute maximum' reconstruction
- * speed limit - in case reconstruction slows down your system despite
- * idle IO detection.
+ * Current RAID-1,4,5,6,10 parallel reconstruction 'guaranteed speed limit'
+ * is sysctl_speed_limit_min, 1000 KB/sec by default, so the extra system load
+ * does not show up that much. Increase it if you want to have more guaranteed
+ * speed. Note that the RAID driver will use the maximum bandwidth
+ * sysctl_speed_limit_max, 200 MB/sec by default, if the IO subsystem is idle.
  *
- * you can change it via /proc/sys/dev/raid/speed_limit_min and _max.
- * or /sys/block/mdX/md/sync_speed_{min,max}
+ * Background sync IO speed control:
+ *
+ * - below speed min:
+ *   no limit;
+ * - above speed min and below speed max:
+ *   a) if mddev is idle, then no limit;
+ *   b) if mddev is busy handling normal IO, then limit inflight sync IO
+ *   to sync_io_depth;
+ * - above speed max:
+ *   sync IO can't be issued;
+ *
+ * Following configurations can be changed via /proc/sys/dev/raid/ for system
+ * or /sys/block/mdX/md/ for one array.
  */
-
 static int sysctl_speed_limit_min = 1000;
 static int sysctl_speed_limit_max = 200000;
-static inline int speed_min(struct mddev *mddev)
+static int sysctl_sync_io_depth = 32;
+
+static int speed_min(struct mddev *mddev)
 {
 	return mddev->sync_speed_min ?
 		mddev->sync_speed_min : sysctl_speed_limit_min;
 }
 
-static inline int speed_max(struct mddev *mddev)
+static int speed_max(struct mddev *mddev)
 {
 	return mddev->sync_speed_max ?
 		mddev->sync_speed_max : sysctl_speed_limit_max;
 }
 
+static int sync_io_depth(struct mddev *mddev)
+{
+	return mddev->sync_io_depth ?
+		mddev->sync_io_depth : sysctl_sync_io_depth;
+}
+
 static void rdev_uninit_serial(struct md_rdev *rdev)
 {
 	if (!test_and_clear_bit(CollisionCheck, &rdev->flags))
@@ -293,14 +309,21 @@ static const struct ctl_table raid_table[] = {
 		.procname	= "speed_limit_min",
 		.data		= &sysctl_speed_limit_min,
 		.maxlen		= sizeof(int),
-		.mode		= S_IRUGO|S_IWUSR,
+		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
 	{
 		.procname	= "speed_limit_max",
 		.data		= &sysctl_speed_limit_max,
 		.maxlen		= sizeof(int),
-		.mode		= S_IRUGO|S_IWUSR,
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
+		.procname	= "sync_io_depth",
+		.data		= &sysctl_sync_io_depth,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
 };
@@ -5091,7 +5114,7 @@ static ssize_t
 sync_min_show(struct mddev *mddev, char *page)
 {
 	return sprintf(page, "%d (%s)\n", speed_min(mddev),
-		       mddev->sync_speed_min ? "local": "system");
+		       mddev->sync_speed_min ? "local" : "system");
 }
 
 static ssize_t
@@ -5100,7 +5123,7 @@ sync_min_store(struct mddev *mddev, const char *buf, size_t len)
 	unsigned int min;
 	int rv;
 
-	if (strncmp(buf, "system", 6)==0) {
+	if (strncmp(buf, "system", 6) == 0) {
 		min = 0;
 	} else {
 		rv = kstrtouint(buf, 10, &min);
@@ -5120,7 +5143,7 @@ static ssize_t
 sync_max_show(struct mddev *mddev, char *page)
 {
 	return sprintf(page, "%d (%s)\n", speed_max(mddev),
-		       mddev->sync_speed_max ? "local": "system");
+		       mddev->sync_speed_max ? "local" : "system");
 }
 
 static ssize_t
@@ -5129,7 +5152,7 @@ sync_max_store(struct mddev *mddev, const char *buf, size_t len)
 	unsigned int max;
 	int rv;
 
-	if (strncmp(buf, "system", 6)==0) {
+	if (strncmp(buf, "system", 6) == 0) {
 		max = 0;
 	} else {
 		rv = kstrtouint(buf, 10, &max);
@@ -5145,6 +5168,35 @@ sync_max_store(struct mddev *mddev, const char *buf, size_t len)
 static struct md_sysfs_entry md_sync_max =
 __ATTR(sync_speed_max, S_IRUGO|S_IWUSR, sync_max_show, sync_max_store);
 
+static ssize_t
+sync_io_depth_show(struct mddev *mddev, char *page)
+{
+	return sprintf(page, "%d (%s)\n", sync_io_depth(mddev),
+		       mddev->sync_io_depth ? "local" : "system");
+}
+
+static ssize_t
+sync_io_depth_store(struct mddev *mddev, const char *buf, size_t len)
+{
+	unsigned int max;
+	int rv;
+
+	if (strncmp(buf, "system", 6) == 0) {
+		max = 0;
+	} else {
+		rv = kstrtouint(buf, 10, &max);
+		if (rv < 0)
+			return rv;
+		if (max == 0)
+			return -EINVAL;
+	}
+	mddev->sync_io_depth = max;
+	return len;
+}
+
+static struct md_sysfs_entry md_sync_io_depth =
+__ATTR_RW(sync_io_depth);
+
 static ssize_t
 degraded_show(struct mddev *mddev, char *page)
 {
@@ -5671,6 +5723,7 @@ static struct attribute *md_redundancy_attrs[] = {
 	&md_mismatches.attr,
 	&md_sync_min.attr,
 	&md_sync_max.attr,
+	&md_sync_io_depth.attr,
 	&md_sync_speed.attr,
 	&md_sync_force_parallel.attr,
 	&md_sync_completed.attr,
@@ -8927,6 +8980,23 @@ static sector_t md_sync_position(struct mddev *mddev, enum sync_action action)
 	}
 }
 
+static bool sync_io_within_limit(struct mddev *mddev)
+{
+	int io_sectors;
+
+	/*
+	 * For raid456, sync IO is stripe(4k) per IO, for other levels, it's
+	 * RESYNC_PAGES(64k) per IO.
+	 */
+	if (mddev->level == 4 || mddev->level == 5 || mddev->level == 6)
+		io_sectors = 8;
+	else
+		io_sectors = 128;
+
+	return atomic_read(&mddev->recovery_active) <
+		io_sectors * sync_io_depth(mddev);
+}
+
 #define SYNC_MARKS	10
 #define	SYNC_MARK_STEP	(3*HZ)
 #define UPDATE_FREQUENCY (5*60*HZ)
@@ -9195,7 +9265,8 @@ void md_do_sync(struct md_thread *thread)
 				msleep(500);
 				goto repeat;
 			}
-			if (!is_mddev_idle(mddev, 0)) {
+			if (!sync_io_within_limit(mddev) &&
+			    !is_mddev_idle(mddev, 0)) {
 				/*
 				 * Give other IO more of a chance.
 				 * The faster the devices, the less we wait.
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 9d55b4630077..b57842188f18 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -484,6 +484,7 @@ struct mddev {
 	/* if zero, use the system-wide default */
 	int				sync_speed_min;
 	int				sync_speed_max;
+	int				sync_io_depth;
 
 	/* resync even though the same disks are shared among md-devices */
 	int				parallel_resync;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 8/9] md: fix is_mddev_idle()
  2025-04-27  8:29 [PATCH v2 0/9] md: fix is_mddev_idle() Yu Kuai
                   ` (6 preceding siblings ...)
  2025-04-27  8:29 ` [PATCH v2 7/9] md: add a new api sync_io_depth Yu Kuai
@ 2025-04-27  8:29 ` Yu Kuai
  2025-04-27  9:51   ` Paul Menzel
  2025-04-27  8:29 ` [PATCH v2 9/9] md: cleanup accounting for issued sync IO Yu Kuai
  2025-04-27  9:59 ` [PATCH v2 0/9] md: fix is_mddev_idle() Paul Menzel
  9 siblings, 1 reply; 31+ messages in thread
From: Yu Kuai @ 2025-04-27  8:29 UTC (permalink / raw)
  To: hch, axboe, xni, agk, snitzer, mpatocka, song, yukuai3, cl,
	nadav.amit, ubizjak, akpm
  Cc: linux-block, linux-kernel, dm-devel, linux-raid, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

If sync_speed is above speed_min, then is_mddev_idle() will be called
for each sync IO to check if the array is idle, and inflihgt sync_io
will be limited if the array is not idle.

However, while mkfs.ext4 for a large raid5 array while recovery is in
progress, it's found that sync_speed is already above speed_min while
lots of stripes are used for sync IO, causing long delay for mkfs.ext4.

Root cause is the following checking from is_mddev_idle():

t1: submit sync IO: events1 = completed IO - issued sync IO
t2: submit next sync IO: events2  = completed IO - issued sync IO
if (events2 - events1 > 64)

For consequence, the more sync IO issued, the less likely checking will
pass. And when completed normal IO is more than issued sync IO, the
condition will finally pass and is_mddev_idle() will return false,
however, last_events will be updated hence is_mddev_idle() can only
return false once in a while.

Fix this problem by changing the checking as following:

1) mddev doesn't have normal IO completed;
2) mddev doesn't have normal IO inflight;
3) if any member disks is partition, and all other partitions doesn't
   have IO completed.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/md.c | 84 +++++++++++++++++++++++++++----------------------
 drivers/md/md.h |  3 +-
 2 files changed, 48 insertions(+), 39 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 541151bcfe81..955efe0b40c6 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8625,50 +8625,58 @@ void md_cluster_stop(struct mddev *mddev)
 	put_cluster_ops(mddev);
 }
 
-static int is_mddev_idle(struct mddev *mddev, int init)
+static bool is_rdev_holder_idle(struct md_rdev *rdev, bool init)
 {
+	unsigned long last_events = rdev->last_events;
+
+	if (!bdev_is_partition(rdev->bdev))
+		return true;
+
+	/*
+	 * If rdev is partition, and user doesn't issue IO to the array, the
+	 * array is still not idle if user issues IO to other partitions.
+	 */
+	rdev->last_events = part_stat_read_accum(rdev->bdev->bd_disk->part0,
+						 sectors) -
+			    part_stat_read_accum(rdev->bdev, sectors);
+
+	if (!init && rdev->last_events > last_events)
+		return false;
+
+	return true;
+}
+
+/*
+ * mddev is idle if following conditions are match since last check:
+ * 1) mddev doesn't have normal IO completed;
+ * 2) mddev doesn't have inflight normal IO;
+ * 3) if any member disk is partition, and other partitions doesn't have IO
+ *    completed;
+ *
+ * Noted this checking rely on IO accounting is enabled.
+ */
+static bool is_mddev_idle(struct mddev *mddev, int init)
+{
+	unsigned long last_events = mddev->normal_IO_events;
+	struct gendisk *disk;
 	struct md_rdev *rdev;
-	int idle;
-	int curr_events;
+	bool idle = true;
 
-	idle = 1;
-	rcu_read_lock();
-	rdev_for_each_rcu(rdev, mddev) {
-		struct gendisk *disk = rdev->bdev->bd_disk;
+	disk = mddev_is_dm(mddev) ? mddev->dm_gendisk : mddev->gendisk;
+	if (!disk)
+		return true;
 
-		if (!init && !blk_queue_io_stat(disk->queue))
-			continue;
+	mddev->normal_IO_events = part_stat_read_accum(disk->part0, sectors);
+	if (!init && (mddev->normal_IO_events > last_events ||
+		      bdev_count_inflight(disk->part0)))
+		idle = false;
 
-		curr_events = (int)part_stat_read_accum(disk->part0, sectors) -
-			      atomic_read(&disk->sync_io);
-		/* sync IO will cause sync_io to increase before the disk_stats
-		 * as sync_io is counted when a request starts, and
-		 * disk_stats is counted when it completes.
-		 * So resync activity will cause curr_events to be smaller than
-		 * when there was no such activity.
-		 * non-sync IO will cause disk_stat to increase without
-		 * increasing sync_io so curr_events will (eventually)
-		 * be larger than it was before.  Once it becomes
-		 * substantially larger, the test below will cause
-		 * the array to appear non-idle, and resync will slow
-		 * down.
-		 * If there is a lot of outstanding resync activity when
-		 * we set last_event to curr_events, then all that activity
-		 * completing might cause the array to appear non-idle
-		 * and resync will be slowed down even though there might
-		 * not have been non-resync activity.  This will only
-		 * happen once though.  'last_events' will soon reflect
-		 * the state where there is little or no outstanding
-		 * resync requests, and further resync activity will
-		 * always make curr_events less than last_events.
-		 *
-		 */
-		if (init || curr_events - rdev->last_events > 64) {
-			rdev->last_events = curr_events;
-			idle = 0;
-		}
-	}
+	rcu_read_lock();
+	rdev_for_each_rcu(rdev, mddev)
+		if (!is_rdev_holder_idle(rdev, init))
+			idle = false;
 	rcu_read_unlock();
+
 	return idle;
 }
 
diff --git a/drivers/md/md.h b/drivers/md/md.h
index b57842188f18..da3fd514d20c 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -132,7 +132,7 @@ struct md_rdev {
 
 	sector_t sectors;		/* Device size (in 512bytes sectors) */
 	struct mddev *mddev;		/* RAID array if running */
-	int last_events;		/* IO event timestamp */
+	unsigned long last_events;	/* IO event timestamp */
 
 	/*
 	 * If meta_bdev is non-NULL, it means that a separate device is
@@ -520,6 +520,7 @@ struct mddev {
 							 * adding a spare
 							 */
 
+	unsigned long			normal_IO_events; /* IO event timestamp */
 	atomic_t			recovery_active; /* blocks scheduled, but not written */
 	wait_queue_head_t		recovery_wait;
 	sector_t			recovery_cp;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 8/9] md: fix is_mddev_idle()
  2025-04-27  8:29 ` [PATCH v2 8/9] md: fix is_mddev_idle() Yu Kuai
@ 2025-04-27  9:51   ` Paul Menzel
  2025-04-29  5:45     ` Xiao Ni
  0 siblings, 1 reply; 31+ messages in thread
From: Paul Menzel @ 2025-04-27  9:51 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, axboe, xni, agk, snitzer, mpatocka, song, yukuai3, cl,
	nadav.amit, ubizjak, akpm, linux-block, linux-kernel, dm-devel,
	linux-raid, yi.zhang, yangerkun, johnny.chenyi

Dear Kuai,


Thank you for your patch.


Am 27.04.25 um 10:29 schrieb Yu Kuai:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> If sync_speed is above speed_min, then is_mddev_idle() will be called
> for each sync IO to check if the array is idle, and inflihgt sync_io

infli*gh*t

> will be limited if the array is not idle.
> 
> However, while mkfs.ext4 for a large raid5 array while recovery is in
> progress, it's found that sync_speed is already above speed_min while
> lots of stripes are used for sync IO, causing long delay for mkfs.ext4.
> 
> Root cause is the following checking from is_mddev_idle():
> 
> t1: submit sync IO: events1 = completed IO - issued sync IO
> t2: submit next sync IO: events2  = completed IO - issued sync IO
> if (events2 - events1 > 64)
> 
> For consequence, the more sync IO issued, the less likely checking will
> pass. And when completed normal IO is more than issued sync IO, the
> condition will finally pass and is_mddev_idle() will return false,
> however, last_events will be updated hence is_mddev_idle() can only
> return false once in a while.
> 
> Fix this problem by changing the checking as following:
> 
> 1) mddev doesn't have normal IO completed;
> 2) mddev doesn't have normal IO inflight;
> 3) if any member disks is partition, and all other partitions doesn't
>     have IO completed.

Do you have benchmarks of mkfs.ext4 before and after your patch? It’d be 
great if you added those.

> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>   drivers/md/md.c | 84 +++++++++++++++++++++++++++----------------------
>   drivers/md/md.h |  3 +-
>   2 files changed, 48 insertions(+), 39 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 541151bcfe81..955efe0b40c6 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -8625,50 +8625,58 @@ void md_cluster_stop(struct mddev *mddev)
>   	put_cluster_ops(mddev);
>   }
>   
> -static int is_mddev_idle(struct mddev *mddev, int init)
> +static bool is_rdev_holder_idle(struct md_rdev *rdev, bool init)
>   {
> +	unsigned long last_events = rdev->last_events;
> +
> +	if (!bdev_is_partition(rdev->bdev))
> +		return true;

Will the compiler generate code, that the assignment happens after this 
condition?

> +
> +	/*
> +	 * If rdev is partition, and user doesn't issue IO to the array, the
> +	 * array is still not idle if user issues IO to other partitions.
> +	 */
> +	rdev->last_events = part_stat_read_accum(rdev->bdev->bd_disk->part0,
> +						 sectors) -
> +			    part_stat_read_accum(rdev->bdev, sectors);
> +
> +	if (!init && rdev->last_events > last_events)
> +		return false;
> +
> +	return true;

Could be one return statement, couldn’t it?

     return init || rdev->last_events <= last_events;

> +}
> +
> +/*
> + * mddev is idle if following conditions are match since last check:

… *the* following condition are match*ed* …

(or are met)

> + * 1) mddev doesn't have normal IO completed;
> + * 2) mddev doesn't have inflight normal IO;
> + * 3) if any member disk is partition, and other partitions doesn't have IO

don’t

> + *    completed;
> + *
> + * Noted this checking rely on IO accounting is enabled.
> + */
> +static bool is_mddev_idle(struct mddev *mddev, int init)
> +{
> +	unsigned long last_events = mddev->normal_IO_events;
> +	struct gendisk *disk;
>   	struct md_rdev *rdev;
> -	int idle;
> -	int curr_events;
> +	bool idle = true;
>   
> -	idle = 1;
> -	rcu_read_lock();
> -	rdev_for_each_rcu(rdev, mddev) {
> -		struct gendisk *disk = rdev->bdev->bd_disk;
> +	disk = mddev_is_dm(mddev) ? mddev->dm_gendisk : mddev->gendisk;
> +	if (!disk)
> +		return true;
>   
> -		if (!init && !blk_queue_io_stat(disk->queue))
> -			continue;
> +	mddev->normal_IO_events = part_stat_read_accum(disk->part0, sectors);
> +	if (!init && (mddev->normal_IO_events > last_events ||
> +		      bdev_count_inflight(disk->part0)))
> +		idle = false;
>   
> -		curr_events = (int)part_stat_read_accum(disk->part0, sectors) -
> -			      atomic_read(&disk->sync_io);
> -		/* sync IO will cause sync_io to increase before the disk_stats
> -		 * as sync_io is counted when a request starts, and
> -		 * disk_stats is counted when it completes.
> -		 * So resync activity will cause curr_events to be smaller than
> -		 * when there was no such activity.
> -		 * non-sync IO will cause disk_stat to increase without
> -		 * increasing sync_io so curr_events will (eventually)
> -		 * be larger than it was before.  Once it becomes
> -		 * substantially larger, the test below will cause
> -		 * the array to appear non-idle, and resync will slow
> -		 * down.
> -		 * If there is a lot of outstanding resync activity when
> -		 * we set last_event to curr_events, then all that activity
> -		 * completing might cause the array to appear non-idle
> -		 * and resync will be slowed down even though there might
> -		 * not have been non-resync activity.  This will only
> -		 * happen once though.  'last_events' will soon reflect
> -		 * the state where there is little or no outstanding
> -		 * resync requests, and further resync activity will
> -		 * always make curr_events less than last_events.
> -		 *
> -		 */
> -		if (init || curr_events - rdev->last_events > 64) {
> -			rdev->last_events = curr_events;
> -			idle = 0;
> -		}
> -	}
> +	rcu_read_lock();
> +	rdev_for_each_rcu(rdev, mddev)
> +		if (!is_rdev_holder_idle(rdev, init))
> +			idle = false;
>   	rcu_read_unlock();
> +
>   	return idle;
>   }
>   
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index b57842188f18..da3fd514d20c 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -132,7 +132,7 @@ struct md_rdev {
>   
>   	sector_t sectors;		/* Device size (in 512bytes sectors) */
>   	struct mddev *mddev;		/* RAID array if running */
> -	int last_events;		/* IO event timestamp */
> +	unsigned long last_events;	/* IO event timestamp */

Please mention in the commit message, why the type is changed.

>   
>   	/*
>   	 * If meta_bdev is non-NULL, it means that a separate device is
> @@ -520,6 +520,7 @@ struct mddev {
>   							 * adding a spare
>   							 */
>   
> +	unsigned long			normal_IO_events; /* IO event timestamp */

Make everything lower case?

>   	atomic_t			recovery_active; /* blocks scheduled, but not written */
>   	wait_queue_head_t		recovery_wait;
>   	sector_t			recovery_cp;


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 8/9] md: fix is_mddev_idle()
  2025-04-27  9:51   ` Paul Menzel
@ 2025-04-29  5:45     ` Xiao Ni
  2025-04-29  9:12       ` Yu Kuai
  0 siblings, 1 reply; 31+ messages in thread
From: Xiao Ni @ 2025-04-29  5:45 UTC (permalink / raw)
  To: Paul Menzel, Yu Kuai
  Cc: hch, axboe, agk, snitzer, mpatocka, song, yukuai3, cl, nadav.amit,
	ubizjak, akpm, linux-block, linux-kernel, dm-devel, linux-raid,
	yi.zhang, yangerkun, johnny.chenyi


在 2025/4/27 下午5:51, Paul Menzel 写道:
> Dear Kuai,
>
>
> Thank you for your patch.
>
>
> Am 27.04.25 um 10:29 schrieb Yu Kuai:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> If sync_speed is above speed_min, then is_mddev_idle() will be called
>> for each sync IO to check if the array is idle, and inflihgt sync_io
>
> infli*gh*t
>
>> will be limited if the array is not idle.
>>
>> However, while mkfs.ext4 for a large raid5 array while recovery is in
>> progress, it's found that sync_speed is already above speed_min while
>> lots of stripes are used for sync IO, causing long delay for mkfs.ext4.
>>
>> Root cause is the following checking from is_mddev_idle():
>>
>> t1: submit sync IO: events1 = completed IO - issued sync IO
>> t2: submit next sync IO: events2  = completed IO - issued sync IO
>> if (events2 - events1 > 64)
>>
>> For consequence, the more sync IO issued, the less likely checking will
>> pass. And when completed normal IO is more than issued sync IO, the
>> condition will finally pass and is_mddev_idle() will return false,
>> however, last_events will be updated hence is_mddev_idle() can only
>> return false once in a while.
>>
>> Fix this problem by changing the checking as following:
>>
>> 1) mddev doesn't have normal IO completed;
>> 2) mddev doesn't have normal IO inflight;
>> 3) if any member disks is partition, and all other partitions doesn't
>>     have IO completed.
>
> Do you have benchmarks of mkfs.ext4 before and after your patch? It’d 
> be great if you added those.
>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>>   drivers/md/md.c | 84 +++++++++++++++++++++++++++----------------------
>>   drivers/md/md.h |  3 +-
>>   2 files changed, 48 insertions(+), 39 deletions(-)
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 541151bcfe81..955efe0b40c6 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -8625,50 +8625,58 @@ void md_cluster_stop(struct mddev *mddev)
>>       put_cluster_ops(mddev);
>>   }
>>   -static int is_mddev_idle(struct mddev *mddev, int init)
>> +static bool is_rdev_holder_idle(struct md_rdev *rdev, bool init)
>>   {
>> +    unsigned long last_events = rdev->last_events;
>> +
>> +    if (!bdev_is_partition(rdev->bdev))
>> +        return true;
>
> Will the compiler generate code, that the assignment happens after 
> this condition?
>
>> +
>> +    /*
>> +     * If rdev is partition, and user doesn't issue IO to the array, 
>> the
>> +     * array is still not idle if user issues IO to other partitions.
>> +     */
>> +    rdev->last_events = 
>> part_stat_read_accum(rdev->bdev->bd_disk->part0,
>> +                         sectors) -
>> +                part_stat_read_accum(rdev->bdev, sectors);
>> +
>> +    if (!init && rdev->last_events > last_events)
>> +        return false;
>> +
>> +    return true;
>
> Could be one return statement, couldn’t it?
>
>     return init || rdev->last_events <= last_events;


For me, I prefer the way of this patch. It's easy to understand. One 
return statement is harder to understand than the two return statements.

>
>> +}
>> +
>> +/*
>> + * mddev is idle if following conditions are match since last check:
>
> … *the* following condition are match*ed* …
>
> (or are met)
>
>> + * 1) mddev doesn't have normal IO completed;
>> + * 2) mddev doesn't have inflight normal IO;
>> + * 3) if any member disk is partition, and other partitions doesn't 
>> have IO
>
> don’t
>
>> + *    completed;
>> + *
>> + * Noted this checking rely on IO accounting is enabled.
>> + */
>> +static bool is_mddev_idle(struct mddev *mddev, int init)
>> +{
>> +    unsigned long last_events = mddev->normal_IO_events;
>> +    struct gendisk *disk;
>>       struct md_rdev *rdev;
>> -    int idle;
>> -    int curr_events;
>> +    bool idle = true;
>>   -    idle = 1;
>> -    rcu_read_lock();
>> -    rdev_for_each_rcu(rdev, mddev) {
>> -        struct gendisk *disk = rdev->bdev->bd_disk;
>> +    disk = mddev_is_dm(mddev) ? mddev->dm_gendisk : mddev->gendisk;
>> +    if (!disk)
>> +        return true;
>>   -        if (!init && !blk_queue_io_stat(disk->queue))
>> -            continue;
>> +    mddev->normal_IO_events = part_stat_read_accum(disk->part0, 
>> sectors);
>> +    if (!init && (mddev->normal_IO_events > last_events ||
>> +              bdev_count_inflight(disk->part0)))
>> +        idle = false;
>>   -        curr_events = (int)part_stat_read_accum(disk->part0, 
>> sectors) -
>> -                  atomic_read(&disk->sync_io);
>> -        /* sync IO will cause sync_io to increase before the disk_stats
>> -         * as sync_io is counted when a request starts, and
>> -         * disk_stats is counted when it completes.
>> -         * So resync activity will cause curr_events to be smaller than
>> -         * when there was no such activity.
>> -         * non-sync IO will cause disk_stat to increase without
>> -         * increasing sync_io so curr_events will (eventually)
>> -         * be larger than it was before.  Once it becomes
>> -         * substantially larger, the test below will cause
>> -         * the array to appear non-idle, and resync will slow
>> -         * down.
>> -         * If there is a lot of outstanding resync activity when
>> -         * we set last_event to curr_events, then all that activity
>> -         * completing might cause the array to appear non-idle
>> -         * and resync will be slowed down even though there might
>> -         * not have been non-resync activity.  This will only
>> -         * happen once though.  'last_events' will soon reflect
>> -         * the state where there is little or no outstanding
>> -         * resync requests, and further resync activity will
>> -         * always make curr_events less than last_events.
>> -         *
>> -         */
>> -        if (init || curr_events - rdev->last_events > 64) {
>> -            rdev->last_events = curr_events;
>> -            idle = 0;
>> -        }
>> -    }
>> +    rcu_read_lock();
>> +    rdev_for_each_rcu(rdev, mddev)
>> +        if (!is_rdev_holder_idle(rdev, init))
>> +            idle = false;
>>       rcu_read_unlock();
>> +
>>       return idle;
>>   }
>>   diff --git a/drivers/md/md.h b/drivers/md/md.h
>> index b57842188f18..da3fd514d20c 100644
>> --- a/drivers/md/md.h
>> +++ b/drivers/md/md.h
>> @@ -132,7 +132,7 @@ struct md_rdev {
>>         sector_t sectors;        /* Device size (in 512bytes sectors) */
>>       struct mddev *mddev;        /* RAID array if running */
>> -    int last_events;        /* IO event timestamp */
>> +    unsigned long last_events;    /* IO event timestamp */
>
> Please mention in the commit message, why the type is changed.
>
>>         /*
>>        * If meta_bdev is non-NULL, it means that a separate device is
>> @@ -520,6 +520,7 @@ struct mddev {
>>                                * adding a spare
>>                                */
>>   +    unsigned long            normal_IO_events; /* IO event 
>> timestamp */
>
> Make everything lower case?


agree+

Regards

Xiao

>
>>       atomic_t            recovery_active; /* blocks scheduled, but 
>> not written */
>>       wait_queue_head_t        recovery_wait;
>>       sector_t            recovery_cp;
>
>
> Kind regards,
>
> Paul
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 8/9] md: fix is_mddev_idle()
  2025-04-29  5:45     ` Xiao Ni
@ 2025-04-29  9:12       ` Yu Kuai
  0 siblings, 0 replies; 31+ messages in thread
From: Yu Kuai @ 2025-04-29  9:12 UTC (permalink / raw)
  To: Xiao Ni, Paul Menzel, Yu Kuai
  Cc: hch, axboe, agk, snitzer, mpatocka, song, cl, nadav.amit, ubizjak,
	akpm, linux-block, linux-kernel, dm-devel, linux-raid, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/04/29 13:45, Xiao Ni 写道:
> 
> 在 2025/4/27 下午5:51, Paul Menzel 写道:
>> Dear Kuai,
>>
>>
>> Thank you for your patch.
>>
>>
>> Am 27.04.25 um 10:29 schrieb Yu Kuai:
>>> From: Yu Kuai <yukuai3@huawei.com>
>>>
>>> If sync_speed is above speed_min, then is_mddev_idle() will be called
>>> for each sync IO to check if the array is idle, and inflihgt sync_io
>>
>> infli*gh*t
>>
>>> will be limited if the array is not idle.
>>>
>>> However, while mkfs.ext4 for a large raid5 array while recovery is in
>>> progress, it's found that sync_speed is already above speed_min while
>>> lots of stripes are used for sync IO, causing long delay for mkfs.ext4.
>>>
>>> Root cause is the following checking from is_mddev_idle():
>>>
>>> t1: submit sync IO: events1 = completed IO - issued sync IO
>>> t2: submit next sync IO: events2  = completed IO - issued sync IO
>>> if (events2 - events1 > 64)
>>>
>>> For consequence, the more sync IO issued, the less likely checking will
>>> pass. And when completed normal IO is more than issued sync IO, the
>>> condition will finally pass and is_mddev_idle() will return false,
>>> however, last_events will be updated hence is_mddev_idle() can only
>>> return false once in a while.
>>>
>>> Fix this problem by changing the checking as following:
>>>
>>> 1) mddev doesn't have normal IO completed;
>>> 2) mddev doesn't have normal IO inflight;
>>> 3) if any member disks is partition, and all other partitions doesn't
>>>     have IO completed.
>>
>> Do you have benchmarks of mkfs.ext4 before and after your patch? It’d 
>> be great if you added those.
>>
>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>> ---
>>>   drivers/md/md.c | 84 +++++++++++++++++++++++++++----------------------
>>>   drivers/md/md.h |  3 +-
>>>   2 files changed, 48 insertions(+), 39 deletions(-)
>>>
>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>> index 541151bcfe81..955efe0b40c6 100644
>>> --- a/drivers/md/md.c
>>> +++ b/drivers/md/md.c
>>> @@ -8625,50 +8625,58 @@ void md_cluster_stop(struct mddev *mddev)
>>>       put_cluster_ops(mddev);
>>>   }
>>>   -static int is_mddev_idle(struct mddev *mddev, int init)
>>> +static bool is_rdev_holder_idle(struct md_rdev *rdev, bool init)
>>>   {
>>> +    unsigned long last_events = rdev->last_events;
>>> +
>>> +    if (!bdev_is_partition(rdev->bdev))
>>> +        return true;
>>
>> Will the compiler generate code, that the assignment happens after 
>> this condition?

Usually compiler won't. And this is fine because resync is not hot path.

>>
>>> +
>>> +    /*
>>> +     * If rdev is partition, and user doesn't issue IO to the array, 
>>> the
>>> +     * array is still not idle if user issues IO to other partitions.
>>> +     */
>>> +    rdev->last_events = 
>>> part_stat_read_accum(rdev->bdev->bd_disk->part0,
>>> +                         sectors) -
>>> +                part_stat_read_accum(rdev->bdev, sectors);
>>> +
>>> +    if (!init && rdev->last_events > last_events)
>>> +        return false;
>>> +
>>> +    return true;
>>
>> Could be one return statement, couldn’t it?
>>
>>     return init || rdev->last_events <= last_events;

Yes, this is simpiler.
> 
> 
> For me, I prefer the way of this patch. It's easy to understand. One 
> return statement is harder to understand than the two return statements.
> 
>>
>>> +}
>>> +
>>> +/*
>>> + * mddev is idle if following conditions are match since last check:
>>
>> … *the* following condition are match*ed* …
>>
>> (or are met)
>>
>>> + * 1) mddev doesn't have normal IO completed;
>>> + * 2) mddev doesn't have inflight normal IO;
>>> + * 3) if any member disk is partition, and other partitions doesn't 
>>> have IO
>>
>> don’t
>>
>>> + *    completed;
>>> + *
>>> + * Noted this checking rely on IO accounting is enabled.
>>> + */
>>> +static bool is_mddev_idle(struct mddev *mddev, int init)
>>> +{
>>> +    unsigned long last_events = mddev->normal_IO_events;
>>> +    struct gendisk *disk;
>>>       struct md_rdev *rdev;
>>> -    int idle;
>>> -    int curr_events;
>>> +    bool idle = true;
>>>   -    idle = 1;
>>> -    rcu_read_lock();
>>> -    rdev_for_each_rcu(rdev, mddev) {
>>> -        struct gendisk *disk = rdev->bdev->bd_disk;
>>> +    disk = mddev_is_dm(mddev) ? mddev->dm_gendisk : mddev->gendisk;
>>> +    if (!disk)
>>> +        return true;
>>>   -        if (!init && !blk_queue_io_stat(disk->queue))
>>> -            continue;
>>> +    mddev->normal_IO_events = part_stat_read_accum(disk->part0, 
>>> sectors);
>>> +    if (!init && (mddev->normal_IO_events > last_events ||
>>> +              bdev_count_inflight(disk->part0)))
>>> +        idle = false;
>>>   -        curr_events = (int)part_stat_read_accum(disk->part0, 
>>> sectors) -
>>> -                  atomic_read(&disk->sync_io);
>>> -        /* sync IO will cause sync_io to increase before the disk_stats
>>> -         * as sync_io is counted when a request starts, and
>>> -         * disk_stats is counted when it completes.
>>> -         * So resync activity will cause curr_events to be smaller than
>>> -         * when there was no such activity.
>>> -         * non-sync IO will cause disk_stat to increase without
>>> -         * increasing sync_io so curr_events will (eventually)
>>> -         * be larger than it was before.  Once it becomes
>>> -         * substantially larger, the test below will cause
>>> -         * the array to appear non-idle, and resync will slow
>>> -         * down.
>>> -         * If there is a lot of outstanding resync activity when
>>> -         * we set last_event to curr_events, then all that activity
>>> -         * completing might cause the array to appear non-idle
>>> -         * and resync will be slowed down even though there might
>>> -         * not have been non-resync activity.  This will only
>>> -         * happen once though.  'last_events' will soon reflect
>>> -         * the state where there is little or no outstanding
>>> -         * resync requests, and further resync activity will
>>> -         * always make curr_events less than last_events.
>>> -         *
>>> -         */
>>> -        if (init || curr_events - rdev->last_events > 64) {
>>> -            rdev->last_events = curr_events;
>>> -            idle = 0;
>>> -        }
>>> -    }
>>> +    rcu_read_lock();
>>> +    rdev_for_each_rcu(rdev, mddev)
>>> +        if (!is_rdev_holder_idle(rdev, init))
>>> +            idle = false;
>>>       rcu_read_unlock();
>>> +
>>>       return idle;
>>>   }
>>>   diff --git a/drivers/md/md.h b/drivers/md/md.h
>>> index b57842188f18..da3fd514d20c 100644
>>> --- a/drivers/md/md.h
>>> +++ b/drivers/md/md.h
>>> @@ -132,7 +132,7 @@ struct md_rdev {
>>>         sector_t sectors;        /* Device size (in 512bytes sectors) */
>>>       struct mddev *mddev;        /* RAID array if running */
>>> -    int last_events;        /* IO event timestamp */
>>> +    unsigned long last_events;    /* IO event timestamp */
>>
>> Please mention in the commit message, why the type is changed.

I thought this is straightforward, not need for type casting.
>>
>>>         /*
>>>        * If meta_bdev is non-NULL, it means that a separate device is
>>> @@ -520,6 +520,7 @@ struct mddev {
>>>                                * adding a spare
>>>                                */
>>>   +    unsigned long            normal_IO_events; /* IO event 
>>> timestamp */
>>
>> Make everything lower case?
> 
> 
> agree+

ok

Thanks,
Kuai

> Regards
> 
> Xiao
> 
>>
>>>       atomic_t            recovery_active; /* blocks scheduled, but 
>>> not written */
>>>       wait_queue_head_t        recovery_wait;
>>>       sector_t            recovery_cp;
>>
>>
>> Kind regards,
>>
>> Paul
>>
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 9/9] md: cleanup accounting for issued sync IO
  2025-04-27  8:29 [PATCH v2 0/9] md: fix is_mddev_idle() Yu Kuai
                   ` (7 preceding siblings ...)
  2025-04-27  8:29 ` [PATCH v2 8/9] md: fix is_mddev_idle() Yu Kuai
@ 2025-04-27  8:29 ` Yu Kuai
  2025-04-27  9:59 ` [PATCH v2 0/9] md: fix is_mddev_idle() Paul Menzel
  9 siblings, 0 replies; 31+ messages in thread
From: Yu Kuai @ 2025-04-27  8:29 UTC (permalink / raw)
  To: hch, axboe, xni, agk, snitzer, mpatocka, song, yukuai3, cl,
	nadav.amit, ubizjak, akpm
  Cc: linux-block, linux-kernel, dm-devel, linux-raid, yukuai1,
	yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

It's no longer used and can be removed, also remove the field
'gendisk->sync_io'.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
---
 drivers/md/md.h        | 11 -----------
 drivers/md/raid1.c     |  3 ---
 drivers/md/raid10.c    |  9 ---------
 drivers/md/raid5.c     |  8 --------
 include/linux/blkdev.h |  1 -
 5 files changed, 32 deletions(-)

diff --git a/drivers/md/md.h b/drivers/md/md.h
index da3fd514d20c..4e6af8368e7e 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -717,17 +717,6 @@ static inline int mddev_trylock(struct mddev *mddev)
 }
 extern void mddev_unlock(struct mddev *mddev);
 
-static inline void md_sync_acct(struct block_device *bdev, unsigned long nr_sectors)
-{
-	if (blk_queue_io_stat(bdev->bd_disk->queue))
-		atomic_add(nr_sectors, &bdev->bd_disk->sync_io);
-}
-
-static inline void md_sync_acct_bio(struct bio *bio, unsigned long nr_sectors)
-{
-	md_sync_acct(bio->bi_bdev, nr_sectors);
-}
-
 struct md_personality
 {
 	struct md_submodule_head head;
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 0efc03cea24e..9cc01cf66872 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2376,7 +2376,6 @@ static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio)
 
 		wbio->bi_end_io = end_sync_write;
 		atomic_inc(&r1_bio->remaining);
-		md_sync_acct(conf->mirrors[i].rdev->bdev, bio_sectors(wbio));
 
 		submit_bio_noacct(wbio);
 	}
@@ -3049,7 +3048,6 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
 			bio = r1_bio->bios[i];
 			if (bio->bi_end_io == end_sync_read) {
 				read_targets--;
-				md_sync_acct_bio(bio, nr_sectors);
 				if (read_targets == 1)
 					bio->bi_opf &= ~MD_FAILFAST;
 				submit_bio_noacct(bio);
@@ -3058,7 +3056,6 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
 	} else {
 		atomic_set(&r1_bio->remaining, 1);
 		bio = r1_bio->bios[r1_bio->read_disk];
-		md_sync_acct_bio(bio, nr_sectors);
 		if (read_targets == 1)
 			bio->bi_opf &= ~MD_FAILFAST;
 		submit_bio_noacct(bio);
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 846c5f29486e..7abe79425b85 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2425,7 +2425,6 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
 
 		atomic_inc(&conf->mirrors[d].rdev->nr_pending);
 		atomic_inc(&r10_bio->remaining);
-		md_sync_acct(conf->mirrors[d].rdev->bdev, bio_sectors(tbio));
 
 		if (test_bit(FailFast, &conf->mirrors[d].rdev->flags))
 			tbio->bi_opf |= MD_FAILFAST;
@@ -2447,8 +2446,6 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
 			bio_copy_data(tbio, fbio);
 		d = r10_bio->devs[i].devnum;
 		atomic_inc(&r10_bio->remaining);
-		md_sync_acct(conf->mirrors[d].replacement->bdev,
-			     bio_sectors(tbio));
 		submit_bio_noacct(tbio);
 	}
 
@@ -2582,13 +2579,10 @@ static void recovery_request_write(struct mddev *mddev, struct r10bio *r10_bio)
 	d = r10_bio->devs[1].devnum;
 	if (wbio->bi_end_io) {
 		atomic_inc(&conf->mirrors[d].rdev->nr_pending);
-		md_sync_acct(conf->mirrors[d].rdev->bdev, bio_sectors(wbio));
 		submit_bio_noacct(wbio);
 	}
 	if (wbio2) {
 		atomic_inc(&conf->mirrors[d].replacement->nr_pending);
-		md_sync_acct(conf->mirrors[d].replacement->bdev,
-			     bio_sectors(wbio2));
 		submit_bio_noacct(wbio2);
 	}
 }
@@ -3756,7 +3750,6 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 		r10_bio->sectors = nr_sectors;
 
 		if (bio->bi_end_io == end_sync_read) {
-			md_sync_acct_bio(bio, nr_sectors);
 			bio->bi_status = 0;
 			submit_bio_noacct(bio);
 		}
@@ -4879,7 +4872,6 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr,
 	r10_bio->sectors = nr_sectors;
 
 	/* Now submit the read */
-	md_sync_acct_bio(read_bio, r10_bio->sectors);
 	atomic_inc(&r10_bio->remaining);
 	read_bio->bi_next = NULL;
 	submit_bio_noacct(read_bio);
@@ -4939,7 +4931,6 @@ static void reshape_request_write(struct mddev *mddev, struct r10bio *r10_bio)
 			continue;
 
 		atomic_inc(&rdev->nr_pending);
-		md_sync_acct_bio(b, r10_bio->sectors);
 		atomic_inc(&r10_bio->remaining);
 		b->bi_next = NULL;
 		submit_bio_noacct(b);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 6389383166c0..ca5b0e8ba707 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1240,10 +1240,6 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 		}
 
 		if (rdev) {
-			if (s->syncing || s->expanding || s->expanded
-			    || s->replacing)
-				md_sync_acct(rdev->bdev, RAID5_STRIPE_SECTORS(conf));
-
 			set_bit(STRIPE_IO_STARTED, &sh->state);
 
 			bio_init(bi, rdev->bdev, &dev->vec, 1, op | op_flags);
@@ -1300,10 +1296,6 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 				submit_bio_noacct(bi);
 		}
 		if (rrdev) {
-			if (s->syncing || s->expanding || s->expanded
-			    || s->replacing)
-				md_sync_acct(rrdev->bdev, RAID5_STRIPE_SECTORS(conf));
-
 			set_bit(STRIPE_IO_STARTED, &sh->state);
 
 			bio_init(rbi, rrdev->bdev, &dev->rvec, 1, op | op_flags);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e39c45bc0a97..f3a625b00734 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -182,7 +182,6 @@ struct gendisk {
 	struct list_head slave_bdevs;
 #endif
 	struct timer_rand_state *random;
-	atomic_t sync_io;		/* RAID */
 	struct disk_events *ev;
 
 #ifdef CONFIG_BLK_DEV_ZONED
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/9] md: fix is_mddev_idle()
  2025-04-27  8:29 [PATCH v2 0/9] md: fix is_mddev_idle() Yu Kuai
                   ` (8 preceding siblings ...)
  2025-04-27  8:29 ` [PATCH v2 9/9] md: cleanup accounting for issued sync IO Yu Kuai
@ 2025-04-27  9:59 ` Paul Menzel
  9 siblings, 0 replies; 31+ messages in thread
From: Paul Menzel @ 2025-04-27  9:59 UTC (permalink / raw)
  To: Yu Kuai
  Cc: hch, axboe, xni, agk, snitzer, mpatocka, song, yukuai3, cl,
	nadav.amit, ubizjak, akpm, linux-block, linux-kernel, dm-devel,
	linux-raid, yi.zhang, yangerkun, johnny.chenyi

Dear Kuai,


Thank you for your patch series. Some minor comments below.


Am 27.04.25 um 10:29 schrieb Yu Kuai:
> From: Yu Kuai <yukuai3@huawei.com>

If a full cover letter is not warranted, maybe state, what patch to look 
at? (In this case 8/9.)

> Changes in v2:
>   - add patch 1-5
>   - add reviewed-by in patch 6,7,9
>   - rename mddev->last_events to mddev->normal_IO_events in patch 8
> 
> Yu Kuai (9):
>    blk-mq: remove blk_mq_in_flight()
>    block: reuse part_in_flight_rw for part_in_flight
>    block: WARN if bdev inflight counter is negative
>    block: cleanup blk_mq_in_flight_rw()

cleanup → clean up

>    block: export API to get the number of bdev inflight IO
>    md: record dm-raid gendisk in mddev
>    md: add a new api sync_io_depth

add a new → add new

>    md: fix is_mddev_idle()
>    md: cleanup accounting for issued sync IO

cleanup → clean up

>   block/blk-core.c          |   2 +-
>   block/blk-mq.c            |  22 ++---
>   block/blk-mq.h            |   5 +-
>   block/blk.h               |   1 -
>   block/genhd.c             |  69 ++++++++------
>   drivers/md/dm-raid.c      |   3 +
>   drivers/md/md.c           | 193 +++++++++++++++++++++++++++-----------
>   drivers/md/md.h           |  18 +---
>   drivers/md/raid1.c        |   3 -
>   drivers/md/raid10.c       |   9 --
>   drivers/md/raid5.c        |   8 --
>   include/linux/blkdev.h    |   1 -
>   include/linux/part_stat.h |   2 +
>   13 files changed, 194 insertions(+), 142 deletions(-)



Kind regards,

Paul

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2025-04-29  9:13 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-27  8:29 [PATCH v2 0/9] md: fix is_mddev_idle() Yu Kuai
2025-04-27  8:29 ` [PATCH v2 1/9] blk-mq: remove blk_mq_in_flight() Yu Kuai
2025-04-28 13:06   ` Christoph Hellwig
2025-04-28 13:45   ` John Garry
2025-04-29  1:40     ` Yu Kuai
2025-04-29  6:25   ` Hannes Reinecke
2025-04-27  8:29 ` [PATCH v2 2/9] block: reuse part_in_flight_rw for part_in_flight Yu Kuai
2025-04-28 13:06   ` Christoph Hellwig
2025-04-28 13:47   ` John Garry
2025-04-29  6:25   ` Hannes Reinecke
2025-04-27  8:29 ` [PATCH v2 3/9] block: WARN if bdev inflight counter is negative Yu Kuai
2025-04-28 13:07   ` Christoph Hellwig
2025-04-28 14:06   ` John Garry
2025-04-29  1:43     ` Yu Kuai
2025-04-29  8:53       ` John Garry
2025-04-29  6:27   ` Hannes Reinecke
2025-04-27  8:29 ` [PATCH v2 4/9] block: cleanup blk_mq_in_flight_rw() Yu Kuai
2025-04-28 13:08   ` Christoph Hellwig
2025-04-29  6:31   ` Hannes Reinecke
2025-04-29  9:07     ` Yu Kuai
2025-04-27  8:29 ` [PATCH v2 5/9] block: export API to get the number of bdev inflight IO Yu Kuai
2025-04-28 13:08   ` Christoph Hellwig
2025-04-29  6:32   ` Hannes Reinecke
2025-04-27  8:29 ` [PATCH v2 6/9] md: record dm-raid gendisk in mddev Yu Kuai
2025-04-27  8:29 ` [PATCH v2 7/9] md: add a new api sync_io_depth Yu Kuai
2025-04-27  8:29 ` [PATCH v2 8/9] md: fix is_mddev_idle() Yu Kuai
2025-04-27  9:51   ` Paul Menzel
2025-04-29  5:45     ` Xiao Ni
2025-04-29  9:12       ` Yu Kuai
2025-04-27  8:29 ` [PATCH v2 9/9] md: cleanup accounting for issued sync IO Yu Kuai
2025-04-27  9:59 ` [PATCH v2 0/9] md: fix is_mddev_idle() Paul Menzel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox