[PATCH 0/2] blk-throttle: fix off-by-one jiffies wait

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] blk-throttle: fix off-by-one jiffies wait_time
@ 2025-02-22  9:28 Yu Kuai
  2025-02-22  9:28 ` [PATCH 1/2] blk-throttle: cleanup throtl_extend_slice() Yu Kuai
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Yu Kuai @ 2025-02-22  9:28 UTC (permalink / raw)
  To: ming.lei, tj, josef, axboe
  Cc: cgroups, linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang,
	yangerkun

From: Yu Kuai <yukuai3@huawei.com>

Yu Kuai (2):
  blk-throttle: cleanup throtl_extend_slice()
  blk-throttle: fix off-by-one jiffies wait_time

 block/blk-throttle.c | 38 ++++++++++++++++++++++----------------
 block/blk-throttle.h | 12 ++++++++----
 2 files changed, 30 insertions(+), 20 deletions(-)

-- 
2.39.2


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/2] blk-throttle: cleanup throtl_extend_slice()
  2025-02-22  9:28 [PATCH 0/2] blk-throttle: fix off-by-one jiffies wait_time Yu Kuai
@ 2025-02-22  9:28 ` Yu Kuai
  2025-02-25 20:30   ` Tejun Heo
  2025-02-22  9:28 ` [PATCH 2/2] blk-throttle: fix off-by-one jiffies wait_time Yu Kuai
  2025-02-25 10:51 ` [PATCH 0/2] " Michal Koutný
  2 siblings, 1 reply; 18+ messages in thread
From: Yu Kuai @ 2025-02-22  9:28 UTC (permalink / raw)
  To: ming.lei, tj, josef, axboe
  Cc: cgroups, linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang,
	yangerkun

From: Yu Kuai <yukuai3@huawei.com>

Merge time_before() from caller into throtl_extend_slice() to make code
cleaner.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-throttle.c | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 8d149aff9fd0..0096e486b1e3 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -524,6 +524,9 @@ static inline void throtl_set_slice_end(struct throtl_grp *tg, bool rw,
 static inline void throtl_extend_slice(struct throtl_grp *tg, bool rw,
 				       unsigned long jiffy_end)
 {
+	if (!time_before(tg->slice_end[rw], jiffy_end))
+		return;
+
 	throtl_set_slice_end(tg, rw, jiffy_end);
 	throtl_log(&tg->service_queue,
 		   "[%c] extend slice start=%lu end=%lu jiffies=%lu",
@@ -778,12 +781,8 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 	 */
 	if (throtl_slice_used(tg, rw) && !(tg->service_queue.nr_queued[rw]))
 		throtl_start_new_slice(tg, rw, true);
-	else {
-		if (time_before(tg->slice_end[rw],
-		    jiffies + tg->td->throtl_slice))
-			throtl_extend_slice(tg, rw,
-				jiffies + tg->td->throtl_slice);
-	}
+	else
+		throtl_extend_slice(tg, rw, jiffies + tg->td->throtl_slice);
 
 	bps_wait = tg_within_bps_limit(tg, bio, bps_limit);
 	iops_wait = tg_within_iops_limit(tg, bio, iops_limit);
@@ -794,12 +793,9 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 	}
 
 	max_wait = max(bps_wait, iops_wait);
-
 	if (wait)
 		*wait = max_wait;
-
-	if (time_before(tg->slice_end[rw], jiffies + max_wait))
-		throtl_extend_slice(tg, rw, jiffies + max_wait);
+	throtl_extend_slice(tg, rw, jiffies + max_wait);
 
 	return false;
 }
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] blk-throttle: cleanup throtl_extend_slice()
  2025-02-22  9:28 ` [PATCH 1/2] blk-throttle: cleanup throtl_extend_slice() Yu Kuai
@ 2025-02-25 20:30   ` Tejun Heo
  0 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2025-02-25 20:30 UTC (permalink / raw)
  To: Yu Kuai
  Cc: ming.lei, josef, axboe, cgroups, linux-block, linux-kernel,
	yukuai3, yi.zhang, yangerkun

On Sat, Feb 22, 2025 at 05:28:22PM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Merge time_before() from caller into throtl_extend_slice() to make code
> cleaner.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 2/2] blk-throttle: fix off-by-one jiffies wait_time
  2025-02-22  9:28 [PATCH 0/2] blk-throttle: fix off-by-one jiffies wait_time Yu Kuai
  2025-02-22  9:28 ` [PATCH 1/2] blk-throttle: cleanup throtl_extend_slice() Yu Kuai
@ 2025-02-22  9:28 ` Yu Kuai
  2025-02-22 12:16   ` Ming Lei
  2025-02-25 10:51 ` [PATCH 0/2] " Michal Koutný
  2 siblings, 1 reply; 18+ messages in thread
From: Yu Kuai @ 2025-02-22  9:28 UTC (permalink / raw)
  To: ming.lei, tj, josef, axboe
  Cc: cgroups, linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang,
	yangerkun

From: Yu Kuai <yukuai3@huawei.com>

wait_time is based on jiffies, and the calculation is:

wait_time = extra_bytes * HZ / bps_limit;

If wait time is not zero, the remainder is ignored, means wait time is
short by at most 1 jiffes; On the other hand, if wait time is zero, it
will be used as 1 jiffies, means it's excessive by at most 1 jiffes.

This way causes blktests throtl/001 failure in case of CONFIG_HZ_100=y,
fix the problem by recording remainder as debt, and repay the debt
later.

Reported-and-tested-by: Ming Lei <ming.lei@redhat.com>
Closes: https://lore.kernel.org/all/20250220111735.1187999-1-ming.lei@redhat.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-throttle.c | 24 +++++++++++++++++-------
 block/blk-throttle.h | 12 ++++++++----
 2 files changed, 25 insertions(+), 11 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 0096e486b1e3..3828c6535605 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -703,9 +703,10 @@ static unsigned long tg_within_iops_limit(struct throtl_grp *tg, struct bio *bio
 }
 
 static unsigned long tg_within_bps_limit(struct throtl_grp *tg, struct bio *bio,
-				u64 bps_limit)
+				u64 bps_limit, bool *has_debt)
 {
 	bool rw = bio_data_dir(bio);
+	long long carryover_bytes;
 	long long bytes_allowed;
 	u64 extra_bytes;
 	unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
@@ -730,10 +731,16 @@ static unsigned long tg_within_bps_limit(struct throtl_grp *tg, struct bio *bio,
 
 	/* Calc approx time to dispatch */
 	extra_bytes = tg->bytes_disp[rw] + bio_size - bytes_allowed;
-	jiffy_wait = div64_u64(extra_bytes * HZ, bps_limit);
-
-	if (!jiffy_wait)
-		jiffy_wait = 1;
+	jiffy_wait = div64_u64_rem(extra_bytes * HZ, bps_limit, &carryover_bytes);
+	if (carryover_bytes) {
+		/*
+		 * If extra_bytes is not divisible, the remainder is recorded as
+		 * debt. Caller must ensure the current slice has at least 1
+		 * more jiffies to repay the debt.
+		 */
+		*has_debt = true;
+		tg->carryover_bytes[rw] -= div64_u64(carryover_bytes, HZ);
+	}
 
 	/*
 	 * This wait time is without taking into consideration the rounding
@@ -754,6 +761,7 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 	unsigned long bps_wait = 0, iops_wait = 0, max_wait = 0;
 	u64 bps_limit = tg_bps_limit(tg, rw);
 	u32 iops_limit = tg_iops_limit(tg, rw);
+	bool has_debt = false;
 
 	/*
  	 * Currently whole state machine of group depends on first bio
@@ -784,18 +792,20 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 	else
 		throtl_extend_slice(tg, rw, jiffies + tg->td->throtl_slice);
 
-	bps_wait = tg_within_bps_limit(tg, bio, bps_limit);
+	bps_wait = tg_within_bps_limit(tg, bio, bps_limit, &has_debt);
 	iops_wait = tg_within_iops_limit(tg, bio, iops_limit);
 	if (bps_wait + iops_wait == 0) {
 		if (wait)
 			*wait = 0;
+		if (has_debt)
+			throtl_extend_slice(tg, rw, jiffies + 1);
 		return true;
 	}
 
 	max_wait = max(bps_wait, iops_wait);
 	if (wait)
 		*wait = max_wait;
-	throtl_extend_slice(tg, rw, jiffies + max_wait);
+	throtl_extend_slice(tg, rw, jiffies + max_wait + has_debt);
 
 	return false;
 }
diff --git a/block/blk-throttle.h b/block/blk-throttle.h
index 1a36d1278eea..56dcb5ce412f 100644
--- a/block/blk-throttle.h
+++ b/block/blk-throttle.h
@@ -110,10 +110,14 @@ struct throtl_grp {
 	unsigned int last_io_disp[2];
 
 	/*
-	 * The following two fields are updated when new configuration is
-	 * submitted while some bios are still throttled, they record how many
-	 * bytes/ios are waited already in previous configuration, and they will
-	 * be used to calculate wait time under new configuration.
+	 * The following two fields are updated when:
+	 * 1) new configuration is submitted while some bios are still
+	 * throttled, they record how many bytes/ios are waited already in
+	 * previous configuration;
+	 * 2) IOs which may cause priority inversions are dispatched while tg is
+	 * over limit, these IOs will be dispatched directly;
+	 * 3) While calculating wait_time for IO, extra_bytes * HZ is not
+	 * divisible by bps_limit, the remainder will be recorded;
 	 */
 	long long carryover_bytes[2];
 	int carryover_ios[2];
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] blk-throttle: fix off-by-one jiffies wait_time
  2025-02-22  9:28 ` [PATCH 2/2] blk-throttle: fix off-by-one jiffies wait_time Yu Kuai
@ 2025-02-22 12:16   ` Ming Lei
  2025-02-24  2:39     ` Yu Kuai
  0 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2025-02-22 12:16 UTC (permalink / raw)
  To: Yu Kuai
  Cc: tj, josef, axboe, cgroups, linux-block, linux-kernel, yukuai3,
	yi.zhang, yangerkun

Hi Yukuai,

On Sat, Feb 22, 2025 at 05:28:23PM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> wait_time is based on jiffies, and the calculation is:
> 
> wait_time = extra_bytes * HZ / bps_limit;
> 
> If wait time is not zero, the remainder is ignored, means wait time is
> short by at most 1 jiffes; On the other hand, if wait time is zero, it
> will be used as 1 jiffies, means it's excessive by at most 1 jiffes.
> 
> This way causes blktests throtl/001 failure in case of CONFIG_HZ_100=y,
> fix the problem by recording remainder as debt, and repay the debt
> later.

After further analysis, I figured out that this one extra jiffy isn't the
only reason for throtl/001 failure.

In blktests throtl/001, bps_limit is 1MB/sec, and BS is 4k, and
COFIG_HZ=100, and default throttle slice is 2 jiffies(20ms):

- 20ms can submit 5 bios: 1024K/50(5*4k=20KB)

- the 6th bio is throttled, and the calculated wait is 1 jiffy from
tg_within_bps_limit()

- given all the 6 bios are handled in the time of jiffies A, so A + 1(wait) + 2(slice)
is programmed to start pending timer for scheduling dispatch

- when the pending timer is expired, the 6th bio is submitted, then the
  current slice is trim/reset since the throttled 6th bio is dispatched.

Now in the whole throttle slice period, 6 bios(24KB) are submitted, and 3
jiffies are taken, so 256 bios will take (256/6) * 30ms = 1.3sec.

But blktests throtl/001 still should pass since the allowed deviation is 0.5sec,
and 1.3 < 1.5.

Actually there is another reason of timer delay, looks one extra jiffy is
delayed when the timer is triggered, which can be observed reliably by:

bpftrace -e 'kfunc:throtl_pending_timer_fn { @timer_expire_delay = lhist(jiffies - args->t->expires, 0, 16, 1);}'

Then 256 bios will take (256/6) * 40ms = 1.7sec, which does match with
observation in throtl/001.

Yeah, killing one jiffy may pass blktests throtl/001, but, ...

> 
> Reported-and-tested-by: Ming Lei <ming.lei@redhat.com>
> Closes: https://lore.kernel.org/all/20250220111735.1187999-1-ming.lei@redhat.com/
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  block/blk-throttle.c | 24 +++++++++++++++++-------
>  block/blk-throttle.h | 12 ++++++++----
>  2 files changed, 25 insertions(+), 11 deletions(-)
> 
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 0096e486b1e3..3828c6535605 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -703,9 +703,10 @@ static unsigned long tg_within_iops_limit(struct throtl_grp *tg, struct bio *bio
>  }
>  
>  static unsigned long tg_within_bps_limit(struct throtl_grp *tg, struct bio *bio,
> -				u64 bps_limit)
> +				u64 bps_limit, bool *has_debt)
>  {
>  	bool rw = bio_data_dir(bio);
> +	long long carryover_bytes;
>  	long long bytes_allowed;
>  	u64 extra_bytes;
>  	unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
> @@ -730,10 +731,16 @@ static unsigned long tg_within_bps_limit(struct throtl_grp *tg, struct bio *bio,
>  
>  	/* Calc approx time to dispatch */
>  	extra_bytes = tg->bytes_disp[rw] + bio_size - bytes_allowed;
> -	jiffy_wait = div64_u64(extra_bytes * HZ, bps_limit);
> -
> -	if (!jiffy_wait)
> -		jiffy_wait = 1;
> +	jiffy_wait = div64_u64_rem(extra_bytes * HZ, bps_limit, &carryover_bytes);
> +	if (carryover_bytes) {
> +		/*
> +		 * If extra_bytes is not divisible, the remainder is recorded as
> +		 * debt. Caller must ensure the current slice has at least 1
> +		 * more jiffies to repay the debt.
> +		 */
> +		*has_debt = true;
> +		tg->carryover_bytes[rw] -= div64_u64(carryover_bytes, HZ);
> +	}

Thinking of further, it may not be good to use ->carryover_bytes[]:

- if tg_within_bps_limit() returns 0 and the bio is dispatched
  immediately, throtl_trim_slice() is called and ->carryover_bytes[]
  is cleared.

- if tg_within_bps_limit() returns >0, this bio will be throttled, and
  tg_within_bps_limit() may be called more than one time, so
  tg->carryover_bytes[] could be over-counted.

Actually this patch changes tg_within_bps_limit() to one stateful function...

>  
>  	/*
>  	 * This wait time is without taking into consideration the rounding
> @@ -754,6 +761,7 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
>  	unsigned long bps_wait = 0, iops_wait = 0, max_wait = 0;
>  	u64 bps_limit = tg_bps_limit(tg, rw);
>  	u32 iops_limit = tg_iops_limit(tg, rw);
> +	bool has_debt = false;
>  
>  	/*
>   	 * Currently whole state machine of group depends on first bio
> @@ -784,18 +792,20 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
>  	else
>  		throtl_extend_slice(tg, rw, jiffies + tg->td->throtl_slice);
>  
> -	bps_wait = tg_within_bps_limit(tg, bio, bps_limit);
> +	bps_wait = tg_within_bps_limit(tg, bio, bps_limit, &has_debt);
>  	iops_wait = tg_within_iops_limit(tg, bio, iops_limit);
>  	if (bps_wait + iops_wait == 0) {
>  		if (wait)
>  			*wait = 0;
> +		if (has_debt)
> +			throtl_extend_slice(tg, rw, jiffies + 1);
>  		return true;
>  	}
>  
>  	max_wait = max(bps_wait, iops_wait);
>  	if (wait)
>  		*wait = max_wait;
> -	throtl_extend_slice(tg, rw, jiffies + max_wait);
> +	throtl_extend_slice(tg, rw, jiffies + max_wait + has_debt);
>  
>  	return false;
>  }
> diff --git a/block/blk-throttle.h b/block/blk-throttle.h
> index 1a36d1278eea..56dcb5ce412f 100644
> --- a/block/blk-throttle.h
> +++ b/block/blk-throttle.h
> @@ -110,10 +110,14 @@ struct throtl_grp {
>  	unsigned int last_io_disp[2];
>  
>  	/*
> -	 * The following two fields are updated when new configuration is
> -	 * submitted while some bios are still throttled, they record how many
> -	 * bytes/ios are waited already in previous configuration, and they will
> -	 * be used to calculate wait time under new configuration.
> +	 * The following two fields are updated when:
> +	 * 1) new configuration is submitted while some bios are still
> +	 * throttled, they record how many bytes/ios are waited already in
> +	 * previous configuration;
> +	 * 2) IOs which may cause priority inversions are dispatched while tg is
> +	 * over limit, these IOs will be dispatched directly;
> +	 * 3) While calculating wait_time for IO, extra_bytes * HZ is not
> +	 * divisible by bps_limit, the remainder will be recorded;
>  	 */

blk-throttle takes token bucket algorithm, and the implementation
shouldn't be sensitive with the above two factors, because the bps
rate is controlled over the whole bucket(slice). Meantime it is still
tricky to maintain ->carryover_bytes during slice cycle, which can't
cover timer delay too.

Another way is to avoid to trim slice too soon in case of owning too
much debt, something like the following does avoid this issue:


diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 8d149aff9fd0..a778ebbb6887 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -615,6 +615,14 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
 	if (bytes_trim <= 0 && io_trim <= 0)
 		return;
 
+	if (tg_bps_limit(tg, rw) != U64_MAX) {
+		long long disp = tg->bytes_disp[rw];
+
+		if (bytes_trim > disp && (bytes_trim - disp) > disp / 2 && time_elapsed <
+				8 * tg->td->throtl_slice)
+			return;
+	}
+
 	tg->carryover_bytes[rw] = 0;
 	if ((long long)tg->bytes_disp[rw] >= bytes_trim)
 		tg->bytes_disp[rw] -= bytes_trim;


Thanks,
Ming


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] blk-throttle: fix off-by-one jiffies wait_time
  2025-02-22 12:16   ` Ming Lei
@ 2025-02-24  2:39     ` Yu Kuai
  2025-02-24  3:28       ` Ming Lei
  0 siblings, 1 reply; 18+ messages in thread
From: Yu Kuai @ 2025-02-24  2:39 UTC (permalink / raw)
  To: Ming Lei, Yu Kuai
  Cc: tj, josef, axboe, cgroups, linux-block, linux-kernel, yi.zhang,
	yangerkun, yukuai (C)

Hi,

在 2025/02/22 20:16, Ming Lei 写道:
> Hi Yukuai,
> 
> On Sat, Feb 22, 2025 at 05:28:23PM +0800, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> wait_time is based on jiffies, and the calculation is:
>>
>> wait_time = extra_bytes * HZ / bps_limit;
>>
>> If wait time is not zero, the remainder is ignored, means wait time is
>> short by at most 1 jiffes; On the other hand, if wait time is zero, it
>> will be used as 1 jiffies, means it's excessive by at most 1 jiffes.
>>
>> This way causes blktests throtl/001 failure in case of CONFIG_HZ_100=y,
>> fix the problem by recording remainder as debt, and repay the debt
>> later.
> 
> After further analysis, I figured out that this one extra jiffy isn't the
> only reason for throtl/001 failure.
> 
> In blktests throtl/001, bps_limit is 1MB/sec, and BS is 4k, and
> COFIG_HZ=100, and default throttle slice is 2 jiffies(20ms):
> 
> - 20ms can submit 5 bios: 1024K/50(5*4k=20KB)
> 
> - the 6th bio is throttled, and the calculated wait is 1 jiffy from
> tg_within_bps_limit()
> 
> - given all the 6 bios are handled in the time of jiffies A, so A + 1(wait) + 2(slice)
> is programmed to start pending timer for scheduling dispatch
> 
> - when the pending timer is expired, the 6th bio is submitted, then the
>    current slice is trim/reset since the throttled 6th bio is dispatched.
> 
> Now in the whole throttle slice period, 6 bios(24KB) are submitted, and 3
> jiffies are taken, so 256 bios will take (256/6) * 30ms = 1.3sec.
> 
> But blktests throtl/001 still should pass since the allowed deviation is 0.5sec,
> and 1.3 < 1.5.
> 
> Actually there is another reason of timer delay, looks one extra jiffy is
> delayed when the timer is triggered, which can be observed reliably by:
> 
> bpftrace -e 'kfunc:throtl_pending_timer_fn { @timer_expire_delay = lhist(jiffies - args->t->expires, 0, 16, 1);}'
> 
> Then 256 bios will take (256/6) * 40ms = 1.7sec, which does match with
> observation in throtl/001.

Thanks for the detailed explanation.
> 
> Yeah, killing one jiffy may pass blktests throtl/001, but, ...
> 
>>
>> Reported-and-tested-by: Ming Lei <ming.lei@redhat.com>
>> Closes: https://lore.kernel.org/all/20250220111735.1187999-1-ming.lei@redhat.com/
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>>   block/blk-throttle.c | 24 +++++++++++++++++-------
>>   block/blk-throttle.h | 12 ++++++++----
>>   2 files changed, 25 insertions(+), 11 deletions(-)
>>
>> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
>> index 0096e486b1e3..3828c6535605 100644
>> --- a/block/blk-throttle.c
>> +++ b/block/blk-throttle.c
>> @@ -703,9 +703,10 @@ static unsigned long tg_within_iops_limit(struct throtl_grp *tg, struct bio *bio
>>   }
>>   
>>   static unsigned long tg_within_bps_limit(struct throtl_grp *tg, struct bio *bio,
>> -				u64 bps_limit)
>> +				u64 bps_limit, bool *has_debt)
>>   {
>>   	bool rw = bio_data_dir(bio);
>> +	long long carryover_bytes;
>>   	long long bytes_allowed;
>>   	u64 extra_bytes;
>>   	unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
>> @@ -730,10 +731,16 @@ static unsigned long tg_within_bps_limit(struct throtl_grp *tg, struct bio *bio,
>>   
>>   	/* Calc approx time to dispatch */
>>   	extra_bytes = tg->bytes_disp[rw] + bio_size - bytes_allowed;
>> -	jiffy_wait = div64_u64(extra_bytes * HZ, bps_limit);
>> -
>> -	if (!jiffy_wait)
>> -		jiffy_wait = 1;
>> +	jiffy_wait = div64_u64_rem(extra_bytes * HZ, bps_limit, &carryover_bytes);
>> +	if (carryover_bytes) {
>> +		/*
>> +		 * If extra_bytes is not divisible, the remainder is recorded as
>> +		 * debt. Caller must ensure the current slice has at least 1
>> +		 * more jiffies to repay the debt.
>> +		 */
>> +		*has_debt = true;
>> +		tg->carryover_bytes[rw] -= div64_u64(carryover_bytes, HZ);
>> +	}
> 
> Thinking of further, it may not be good to use ->carryover_bytes[]:
> 
> - if tg_within_bps_limit() returns 0 and the bio is dispatched
>    immediately, throtl_trim_slice() is called and ->carryover_bytes[]
>    is cleared.
> 
I think this should not happen, because:

If the slice is extended in consideration of carryover_bytes,
throtl_slice_used() is checked first in throtl_trim_slice().

> - if tg_within_bps_limit() returns >0, this bio will be throttled, and
>    tg_within_bps_limit() may be called more than one time, so
>    tg->carryover_bytes[] could be over-counted.
> 
Yes, carryover_bytes can be over-counted with this patch. For now, the
only way that I can think of to fix this is to handle carryover_bytes
from the caller of tg_may_dispatch():
  - tg_within_limit(), only handle if bio is dispatched directly;
  - tg_update_disptime(), never handle;
  - throtl_dispatch_tg(), always handle;

The idea is that only handle when bio is dispatched, like following
patch(not tested yet):

[yukuai@localhost linux-next]$ git diff
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 3828c6535605..20f6bab07960 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -703,10 +703,9 @@ static unsigned long tg_within_iops_limit(struct 
throtl_grp *tg, struct bio *bio
  }

  static unsigned long tg_within_bps_limit(struct throtl_grp *tg, struct 
bio *bio,
-                               u64 bps_limit, bool *has_debt)
+                               u64 bps_limit, long long *carryover_bytes)
  {
         bool rw = bio_data_dir(bio);
-       long long carryover_bytes;
         long long bytes_allowed;
         u64 extra_bytes;
         unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
@@ -731,16 +730,7 @@ static unsigned long tg_within_bps_limit(struct 
throtl_grp *tg, struct bio *bio,

         /* Calc approx time to dispatch */
         extra_bytes = tg->bytes_disp[rw] + bio_size - bytes_allowed;
-       jiffy_wait = div64_u64_rem(extra_bytes * HZ, bps_limit, 
&carryover_bytes);
-       if (carryover_bytes) {
-               /*
-                * If extra_bytes is not divisible, the remainder is 
recorded as
-                * debt. Caller must ensure the current slice has at least 1
-                * more jiffies to repay the debt.
-                */
-               *has_debt = true;
-               tg->carryover_bytes[rw] -= div64_u64(carryover_bytes, HZ);
-       }
+       jiffy_wait = div64_u64_rem(extra_bytes * HZ, bps_limit, 
carryover_bytes);

         /*
          * This wait time is without taking into consideration the rounding
@@ -755,13 +745,12 @@ static unsigned long tg_within_bps_limit(struct 
throtl_grp *tg, struct bio *bio,
   * of jiffies to wait before this bio is with-in IO rate and can be 
dispatched
   */
  static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
-                           unsigned long *wait)
+                           unsigned long *wait, long long *carryover_bytes)
  {
         bool rw = bio_data_dir(bio);
         unsigned long bps_wait = 0, iops_wait = 0, max_wait = 0;
         u64 bps_limit = tg_bps_limit(tg, rw);
         u32 iops_limit = tg_iops_limit(tg, rw);
-       bool has_debt = false;

         /*
          * Currently whole state machine of group depends on first bio
@@ -792,20 +781,18 @@ static bool tg_may_dispatch(struct throtl_grp *tg, 
struct bio *bio,
         else
                 throtl_extend_slice(tg, rw, jiffies + 
tg->td->throtl_slice);

-       bps_wait = tg_within_bps_limit(tg, bio, bps_limit, &has_debt);
+       bps_wait = tg_within_bps_limit(tg, bio, bps_limit, carryover_bytes);
         iops_wait = tg_within_iops_limit(tg, bio, iops_limit);
         if (bps_wait + iops_wait == 0) {
                 if (wait)
                         *wait = 0;
-               if (has_debt)
-                       throtl_extend_slice(tg, rw, jiffies + 1);
                 return true;
         }

         max_wait = max(bps_wait, iops_wait);
         if (wait)
                 *wait = max_wait;
-       throtl_extend_slice(tg, rw, jiffies + max_wait + has_debt);
+       throtl_extend_slice(tg, rw, jiffies + max_wait);

         return false;
  }
@@ -858,19 +845,33 @@ static void throtl_add_bio_tg(struct bio *bio, 
struct throtl_qnode *qn,
         throtl_enqueue_tg(tg);
  }

+static void handle_tg_carryover_bytes(struct throtl_grp *tg,
+                                     long long carryover_bytes, int rw)
+{
+       if (carryover_bytes == 0)
+               return;
+       /*
+        * IO is dispatched without waiting for @carryover_bytes, make sure
+        * current slice has 1 more jiffies to repay the debt.
+        */
+       tg->carryover_bytes[rw] -= carryover_bytes;
+       throtl_extend_slice(tg, rw, tg->slice_end[rw] + 1);
+}
+
  static void tg_update_disptime(struct throtl_grp *tg)
  {
         struct throtl_service_queue *sq = &tg->service_queue;
         unsigned long read_wait = -1, write_wait = -1, min_wait = -1, 
disptime;
+       long long carryover_bytes = 0;
         struct bio *bio;

         bio = throtl_peek_queued(&sq->queued[READ]);
         if (bio)
-               tg_may_dispatch(tg, bio, &read_wait);
+               tg_may_dispatch(tg, bio, &read_wait, &carryover_bytes);

         bio = throtl_peek_queued(&sq->queued[WRITE]);
         if (bio)
-               tg_may_dispatch(tg, bio, &write_wait);
+               tg_may_dispatch(tg, bio, &write_wait, &carryover_bytes);

         min_wait = min(read_wait, write_wait);
         disptime = jiffies + min_wait;
@@ -943,12 +944,15 @@ static int throtl_dispatch_tg(struct throtl_grp *tg)
         unsigned int nr_reads = 0, nr_writes = 0;
         unsigned int max_nr_reads = THROTL_GRP_QUANTUM * 3 / 4;
         unsigned int max_nr_writes = THROTL_GRP_QUANTUM - max_nr_reads;
+       long long carryover_bytes = 0;
         struct bio *bio;

         /* Try to dispatch 75% READS and 25% WRITES */

         while ((bio = throtl_peek_queued(&sq->queued[READ])) &&
-              tg_may_dispatch(tg, bio, NULL)) {
+              tg_may_dispatch(tg, bio, NULL, &carryover_bytes)) {
+               handle_tg_carryover_bytes(tg, carryover_bytes, READ);
+               carryover_bytes = 0;

                 tg_dispatch_one_bio(tg, READ);
                 nr_reads++;
@@ -958,7 +962,9 @@ static int throtl_dispatch_tg(struct throtl_grp *tg)
         }

         while ((bio = throtl_peek_queued(&sq->queued[WRITE])) &&
-              tg_may_dispatch(tg, bio, NULL)) {
+              tg_may_dispatch(tg, bio, NULL, &carryover_bytes)) {
+               handle_tg_carryover_bytes(tg, carryover_bytes, WRITE);
+               carryover_bytes = 0;

                 tg_dispatch_one_bio(tg, WRITE);
                 nr_writes++;
@@ -1613,11 +1619,18 @@ void blk_throtl_cancel_bios(struct gendisk *disk)

  static bool tg_within_limit(struct throtl_grp *tg, struct bio *bio, 
bool rw)
  {
+       long long carryover_bytes = 0;
+
         /* throtl is FIFO - if bios are already queued, should queue */
         if (tg->service_queue.nr_queued[rw])
                 return false;

-       return tg_may_dispatch(tg, bio, NULL);
+       if (tg_may_dispatch(tg, bio, NULL, &carryover_bytes)) {
+               handle_tg_carryover_bytes(tg, carryover_bytes, rw);
+               return true;
+       }
+
+       return false;
  }

  static void tg_dispatch_in_debt(struct throtl_grp *tg, struct bio 
*bio, bool rw)


> Actually this patch changes tg_within_bps_limit() to one stateful function...
> 
>>   
>>   	/*
>>   	 * This wait time is without taking into consideration the rounding
>> @@ -754,6 +761,7 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
>>   	unsigned long bps_wait = 0, iops_wait = 0, max_wait = 0;
>>   	u64 bps_limit = tg_bps_limit(tg, rw);
>>   	u32 iops_limit = tg_iops_limit(tg, rw);
>> +	bool has_debt = false;
>>   
>>   	/*
>>    	 * Currently whole state machine of group depends on first bio
>> @@ -784,18 +792,20 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
>>   	else
>>   		throtl_extend_slice(tg, rw, jiffies + tg->td->throtl_slice);
>>   
>> -	bps_wait = tg_within_bps_limit(tg, bio, bps_limit);
>> +	bps_wait = tg_within_bps_limit(tg, bio, bps_limit, &has_debt);
>>   	iops_wait = tg_within_iops_limit(tg, bio, iops_limit);
>>   	if (bps_wait + iops_wait == 0) {
>>   		if (wait)
>>   			*wait = 0;
>> +		if (has_debt)
>> +			throtl_extend_slice(tg, rw, jiffies + 1);
>>   		return true;
>>   	}
>>   
>>   	max_wait = max(bps_wait, iops_wait);
>>   	if (wait)
>>   		*wait = max_wait;
>> -	throtl_extend_slice(tg, rw, jiffies + max_wait);
>> +	throtl_extend_slice(tg, rw, jiffies + max_wait + has_debt);
>>   
>>   	return false;
>>   }
>> diff --git a/block/blk-throttle.h b/block/blk-throttle.h
>> index 1a36d1278eea..56dcb5ce412f 100644
>> --- a/block/blk-throttle.h
>> +++ b/block/blk-throttle.h
>> @@ -110,10 +110,14 @@ struct throtl_grp {
>>   	unsigned int last_io_disp[2];
>>   
>>   	/*
>> -	 * The following two fields are updated when new configuration is
>> -	 * submitted while some bios are still throttled, they record how many
>> -	 * bytes/ios are waited already in previous configuration, and they will
>> -	 * be used to calculate wait time under new configuration.
>> +	 * The following two fields are updated when:
>> +	 * 1) new configuration is submitted while some bios are still
>> +	 * throttled, they record how many bytes/ios are waited already in
>> +	 * previous configuration;
>> +	 * 2) IOs which may cause priority inversions are dispatched while tg is
>> +	 * over limit, these IOs will be dispatched directly;
>> +	 * 3) While calculating wait_time for IO, extra_bytes * HZ is not
>> +	 * divisible by bps_limit, the remainder will be recorded;
>>   	 */
> 
> blk-throttle takes token bucket algorithm, and the implementation
> shouldn't be sensitive with the above two factors, because the bps
> rate is controlled over the whole bucket(slice). Meantime it is still
> tricky to maintain ->carryover_bytes during slice cycle, which can't
> cover timer delay too.
> 
> Another way is to avoid to trim slice too soon in case of owning too
> much debt, something like the following does avoid this issue:


> 
> 
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 8d149aff9fd0..a778ebbb6887 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -615,6 +615,14 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
>   	if (bytes_trim <= 0 && io_trim <= 0)
>   		return;
>   
> +	if (tg_bps_limit(tg, rw) != U64_MAX) {
> +		long long disp = tg->bytes_disp[rw];
> +
> +		if (bytes_trim > disp && (bytes_trim - disp) > disp / 2 && time_elapsed <
> +				8 * tg->td->throtl_slice)
> +			return;
> +	}

I'm not sure why we need this, why will there be too much debt? debt
from remainder should be at most 1 jiffies, new debt can only happen
when old debt is paid.

If you mean we don't clear debt in time, and the historical debt
accumulated too much, then we might want to trim slice sooner and a new
branch try to clear historical debt.

BTW, throtl_trim_slice() looks like problematic:

-       if (bytes_trim <= 0 && io_trim <= 0)
+       if (bytes_trim <= 0 || io_trim <= 0 ||
+           tg->bytes_disp[rw] < bytes_trim || tg->io_disp[rw] < io_trim)
                 return;

Otherwise, debt can be ignored.

Thanks,
Kuai

> +
>   	tg->carryover_bytes[rw] = 0;
>   	if ((long long)tg->bytes_disp[rw] >= bytes_trim)
>   		tg->bytes_disp[rw] -= bytes_trim;
> 
> 
> Thanks,
> Ming
> 
> 
> .
> 


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] blk-throttle: fix off-by-one jiffies wait_time
  2025-02-24  2:39     ` Yu Kuai
@ 2025-02-24  3:28       ` Ming Lei
  2025-02-24  7:03         ` Yu Kuai
  0 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2025-02-24  3:28 UTC (permalink / raw)
  To: Yu Kuai
  Cc: tj, josef, axboe, cgroups, linux-block, linux-kernel, yi.zhang,
	yangerkun, yukuai (C)

On Mon, Feb 24, 2025 at 10:39:13AM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2025/02/22 20:16, Ming Lei 写道:
> > Hi Yukuai,
> > 
> > On Sat, Feb 22, 2025 at 05:28:23PM +0800, Yu Kuai wrote:
> > > From: Yu Kuai <yukuai3@huawei.com>
> > > 
> > > wait_time is based on jiffies, and the calculation is:
> > > 
> > > wait_time = extra_bytes * HZ / bps_limit;
> > > 
> > > If wait time is not zero, the remainder is ignored, means wait time is
> > > short by at most 1 jiffes; On the other hand, if wait time is zero, it
> > > will be used as 1 jiffies, means it's excessive by at most 1 jiffes.
> > > 
> > > This way causes blktests throtl/001 failure in case of CONFIG_HZ_100=y,
> > > fix the problem by recording remainder as debt, and repay the debt
> > > later.
> > 
> > After further analysis, I figured out that this one extra jiffy isn't the
> > only reason for throtl/001 failure.
> > 
> > In blktests throtl/001, bps_limit is 1MB/sec, and BS is 4k, and
> > COFIG_HZ=100, and default throttle slice is 2 jiffies(20ms):
> > 
> > - 20ms can submit 5 bios: 1024K/50(5*4k=20KB)
> > 
> > - the 6th bio is throttled, and the calculated wait is 1 jiffy from
> > tg_within_bps_limit()
> > 
> > - given all the 6 bios are handled in the time of jiffies A, so A + 1(wait) + 2(slice)
> > is programmed to start pending timer for scheduling dispatch
> > 
> > - when the pending timer is expired, the 6th bio is submitted, then the
> >    current slice is trim/reset since the throttled 6th bio is dispatched.
> > 
> > Now in the whole throttle slice period, 6 bios(24KB) are submitted, and 3
> > jiffies are taken, so 256 bios will take (256/6) * 30ms = 1.3sec.
> > 
> > But blktests throtl/001 still should pass since the allowed deviation is 0.5sec,
> > and 1.3 < 1.5.
> > 
> > Actually there is another reason of timer delay, looks one extra jiffy is
> > delayed when the timer is triggered, which can be observed reliably by:
> > 
> > bpftrace -e 'kfunc:throtl_pending_timer_fn { @timer_expire_delay = lhist(jiffies - args->t->expires, 0, 16, 1);}'
> > 
> > Then 256 bios will take (256/6) * 40ms = 1.7sec, which does match with
> > observation in throtl/001.
> 
> Thanks for the detailed explanation.
> > 
> > Yeah, killing one jiffy may pass blktests throtl/001, but, ...
> > 
> > > 
> > > Reported-and-tested-by: Ming Lei <ming.lei@redhat.com>
> > > Closes: https://lore.kernel.org/all/20250220111735.1187999-1-ming.lei@redhat.com/
> > > Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> > > ---
> > >   block/blk-throttle.c | 24 +++++++++++++++++-------
> > >   block/blk-throttle.h | 12 ++++++++----
> > >   2 files changed, 25 insertions(+), 11 deletions(-)
> > > 
> > > diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> > > index 0096e486b1e3..3828c6535605 100644
> > > --- a/block/blk-throttle.c
> > > +++ b/block/blk-throttle.c
> > > @@ -703,9 +703,10 @@ static unsigned long tg_within_iops_limit(struct throtl_grp *tg, struct bio *bio
> > >   }
> > >   static unsigned long tg_within_bps_limit(struct throtl_grp *tg, struct bio *bio,
> > > -				u64 bps_limit)
> > > +				u64 bps_limit, bool *has_debt)
> > >   {
> > >   	bool rw = bio_data_dir(bio);
> > > +	long long carryover_bytes;
> > >   	long long bytes_allowed;
> > >   	u64 extra_bytes;
> > >   	unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
> > > @@ -730,10 +731,16 @@ static unsigned long tg_within_bps_limit(struct throtl_grp *tg, struct bio *bio,
> > >   	/* Calc approx time to dispatch */
> > >   	extra_bytes = tg->bytes_disp[rw] + bio_size - bytes_allowed;
> > > -	jiffy_wait = div64_u64(extra_bytes * HZ, bps_limit);
> > > -
> > > -	if (!jiffy_wait)
> > > -		jiffy_wait = 1;
> > > +	jiffy_wait = div64_u64_rem(extra_bytes * HZ, bps_limit, &carryover_bytes);
> > > +	if (carryover_bytes) {
> > > +		/*
> > > +		 * If extra_bytes is not divisible, the remainder is recorded as
> > > +		 * debt. Caller must ensure the current slice has at least 1
> > > +		 * more jiffies to repay the debt.
> > > +		 */
> > > +		*has_debt = true;
> > > +		tg->carryover_bytes[rw] -= div64_u64(carryover_bytes, HZ);
> > > +	}
> > 
> > Thinking of further, it may not be good to use ->carryover_bytes[]:
> > 
> > - if tg_within_bps_limit() returns 0 and the bio is dispatched
> >    immediately, throtl_trim_slice() is called and ->carryover_bytes[]
> >    is cleared.
> > 
> I think this should not happen, because:
> 
> If the slice is extended in consideration of carryover_bytes,
> throtl_slice_used() is checked first in throtl_trim_slice().

throtl_trim_slice() returns immediately if throtl_slice_used()
is true.

And throtl_slice_used() checks jiffies in [start, end] via time_in_range(),
so if `start <= jiffies <= end', it still returns false.

So if the 6 bio comes at the end of the slice exactly:

- tg_within_bps_limit() returns 0 from tg_within_bps_limit() with your
patch

- the bio is dispatched

- throtl_trim_slice() still clears ->carryover_bytes[] since
  throtl_slice_used() returns false.


> 
> > - if tg_within_bps_limit() returns >0, this bio will be throttled, and
> >    tg_within_bps_limit() may be called more than one time, so
> >    tg->carryover_bytes[] could be over-counted.
> > 
> Yes, carryover_bytes can be over-counted with this patch. For now, the
> only way that I can think of to fix this is to handle carryover_bytes
> from the caller of tg_may_dispatch():

As I explained, there are two reason which contributes to throtl/001,
and the throttle algorithm needn't to take the single time delay into
account, and it can dispatch more if less bytes are dispatched in
previous period of this slice because of either jiffy_wait roundup or
timer expire delay.

The opposite, probably carryover_bytes[] could be killed in future.


>  - tg_within_limit(), only handle if bio is dispatched directly;
>  - tg_update_disptime(), never handle;
>  - throtl_dispatch_tg(), always handle;
> 
> The idea is that only handle when bio is dispatched, like following
> patch(not tested yet):
> 
> [yukuai@localhost linux-next]$ git diff
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 3828c6535605..20f6bab07960 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -703,10 +703,9 @@ static unsigned long tg_within_iops_limit(struct
> throtl_grp *tg, struct bio *bio
>  }
> 
>  static unsigned long tg_within_bps_limit(struct throtl_grp *tg, struct bio
> *bio,
> -                               u64 bps_limit, bool *has_debt)
> +                               u64 bps_limit, long long *carryover_bytes)
>  {
>         bool rw = bio_data_dir(bio);
> -       long long carryover_bytes;
>         long long bytes_allowed;
>         u64 extra_bytes;
>         unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
> @@ -731,16 +730,7 @@ static unsigned long tg_within_bps_limit(struct
> throtl_grp *tg, struct bio *bio,
> 
>         /* Calc approx time to dispatch */
>         extra_bytes = tg->bytes_disp[rw] + bio_size - bytes_allowed;
> -       jiffy_wait = div64_u64_rem(extra_bytes * HZ, bps_limit,
> &carryover_bytes);
> -       if (carryover_bytes) {
> -               /*
> -                * If extra_bytes is not divisible, the remainder is
> recorded as
> -                * debt. Caller must ensure the current slice has at least 1
> -                * more jiffies to repay the debt.
> -                */
> -               *has_debt = true;
> -               tg->carryover_bytes[rw] -= div64_u64(carryover_bytes, HZ);
> -       }
> +       jiffy_wait = div64_u64_rem(extra_bytes * HZ, bps_limit,
> carryover_bytes);
> 
>         /*
>          * This wait time is without taking into consideration the rounding
> @@ -755,13 +745,12 @@ static unsigned long tg_within_bps_limit(struct
> throtl_grp *tg, struct bio *bio,
>   * of jiffies to wait before this bio is with-in IO rate and can be
> dispatched
>   */
>  static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
> -                           unsigned long *wait)
> +                           unsigned long *wait, long long *carryover_bytes)
>  {
>         bool rw = bio_data_dir(bio);
>         unsigned long bps_wait = 0, iops_wait = 0, max_wait = 0;
>         u64 bps_limit = tg_bps_limit(tg, rw);
>         u32 iops_limit = tg_iops_limit(tg, rw);
> -       bool has_debt = false;
> 
>         /*
>          * Currently whole state machine of group depends on first bio
> @@ -792,20 +781,18 @@ static bool tg_may_dispatch(struct throtl_grp *tg,
> struct bio *bio,
>         else
>                 throtl_extend_slice(tg, rw, jiffies + tg->td->throtl_slice);
> 
> -       bps_wait = tg_within_bps_limit(tg, bio, bps_limit, &has_debt);
> +       bps_wait = tg_within_bps_limit(tg, bio, bps_limit, carryover_bytes);
>         iops_wait = tg_within_iops_limit(tg, bio, iops_limit);
>         if (bps_wait + iops_wait == 0) {
>                 if (wait)
>                         *wait = 0;
> -               if (has_debt)
> -                       throtl_extend_slice(tg, rw, jiffies + 1);
>                 return true;
>         }
> 
>         max_wait = max(bps_wait, iops_wait);
>         if (wait)
>                 *wait = max_wait;
> -       throtl_extend_slice(tg, rw, jiffies + max_wait + has_debt);
> +       throtl_extend_slice(tg, rw, jiffies + max_wait);
> 
>         return false;
>  }
> @@ -858,19 +845,33 @@ static void throtl_add_bio_tg(struct bio *bio, struct
> throtl_qnode *qn,
>         throtl_enqueue_tg(tg);
>  }
> 
> +static void handle_tg_carryover_bytes(struct throtl_grp *tg,
> +                                     long long carryover_bytes, int rw)
> +{
> +       if (carryover_bytes == 0)
> +               return;
> +       /*
> +        * IO is dispatched without waiting for @carryover_bytes, make sure
> +        * current slice has 1 more jiffies to repay the debt.
> +        */
> +       tg->carryover_bytes[rw] -= carryover_bytes;
> +       throtl_extend_slice(tg, rw, tg->slice_end[rw] + 1);
> +}
> +
>  static void tg_update_disptime(struct throtl_grp *tg)
>  {
>         struct throtl_service_queue *sq = &tg->service_queue;
>         unsigned long read_wait = -1, write_wait = -1, min_wait = -1,
> disptime;
> +       long long carryover_bytes = 0;
>         struct bio *bio;
> 
>         bio = throtl_peek_queued(&sq->queued[READ]);
>         if (bio)
> -               tg_may_dispatch(tg, bio, &read_wait);
> +               tg_may_dispatch(tg, bio, &read_wait, &carryover_bytes);
> 
>         bio = throtl_peek_queued(&sq->queued[WRITE]);
>         if (bio)
> -               tg_may_dispatch(tg, bio, &write_wait);
> +               tg_may_dispatch(tg, bio, &write_wait, &carryover_bytes);
> 
>         min_wait = min(read_wait, write_wait);
>         disptime = jiffies + min_wait;
> @@ -943,12 +944,15 @@ static int throtl_dispatch_tg(struct throtl_grp *tg)
>         unsigned int nr_reads = 0, nr_writes = 0;
>         unsigned int max_nr_reads = THROTL_GRP_QUANTUM * 3 / 4;
>         unsigned int max_nr_writes = THROTL_GRP_QUANTUM - max_nr_reads;
> +       long long carryover_bytes = 0;
>         struct bio *bio;
> 
>         /* Try to dispatch 75% READS and 25% WRITES */
> 
>         while ((bio = throtl_peek_queued(&sq->queued[READ])) &&
> -              tg_may_dispatch(tg, bio, NULL)) {
> +              tg_may_dispatch(tg, bio, NULL, &carryover_bytes)) {
> +               handle_tg_carryover_bytes(tg, carryover_bytes, READ);
> +               carryover_bytes = 0;
> 
>                 tg_dispatch_one_bio(tg, READ);
>                 nr_reads++;
> @@ -958,7 +962,9 @@ static int throtl_dispatch_tg(struct throtl_grp *tg)
>         }
> 
>         while ((bio = throtl_peek_queued(&sq->queued[WRITE])) &&
> -              tg_may_dispatch(tg, bio, NULL)) {
> +              tg_may_dispatch(tg, bio, NULL, &carryover_bytes)) {
> +               handle_tg_carryover_bytes(tg, carryover_bytes, WRITE);
> +               carryover_bytes = 0;
> 
>                 tg_dispatch_one_bio(tg, WRITE);
>                 nr_writes++;
> @@ -1613,11 +1619,18 @@ void blk_throtl_cancel_bios(struct gendisk *disk)
> 
>  static bool tg_within_limit(struct throtl_grp *tg, struct bio *bio, bool
> rw)
>  {
> +       long long carryover_bytes = 0;
> +
>         /* throtl is FIFO - if bios are already queued, should queue */
>         if (tg->service_queue.nr_queued[rw])
>                 return false;
> 
> -       return tg_may_dispatch(tg, bio, NULL);
> +       if (tg_may_dispatch(tg, bio, NULL, &carryover_bytes)) {
> +               handle_tg_carryover_bytes(tg, carryover_bytes, rw);
> +               return true;
> +       }
> +
> +       return false;
>  }
> 
>  static void tg_dispatch_in_debt(struct throtl_grp *tg, struct bio *bio,
> bool rw)
> 
> 
> > Actually this patch changes tg_within_bps_limit() to one stateful function...
> > 
> > >   	/*
> > >   	 * This wait time is without taking into consideration the rounding
> > > @@ -754,6 +761,7 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
> > >   	unsigned long bps_wait = 0, iops_wait = 0, max_wait = 0;
> > >   	u64 bps_limit = tg_bps_limit(tg, rw);
> > >   	u32 iops_limit = tg_iops_limit(tg, rw);
> > > +	bool has_debt = false;
> > >   	/*
> > >    	 * Currently whole state machine of group depends on first bio
> > > @@ -784,18 +792,20 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
> > >   	else
> > >   		throtl_extend_slice(tg, rw, jiffies + tg->td->throtl_slice);
> > > -	bps_wait = tg_within_bps_limit(tg, bio, bps_limit);
> > > +	bps_wait = tg_within_bps_limit(tg, bio, bps_limit, &has_debt);
> > >   	iops_wait = tg_within_iops_limit(tg, bio, iops_limit);
> > >   	if (bps_wait + iops_wait == 0) {
> > >   		if (wait)
> > >   			*wait = 0;
> > > +		if (has_debt)
> > > +			throtl_extend_slice(tg, rw, jiffies + 1);
> > >   		return true;
> > >   	}
> > >   	max_wait = max(bps_wait, iops_wait);
> > >   	if (wait)
> > >   		*wait = max_wait;
> > > -	throtl_extend_slice(tg, rw, jiffies + max_wait);
> > > +	throtl_extend_slice(tg, rw, jiffies + max_wait + has_debt);
> > >   	return false;
> > >   }
> > > diff --git a/block/blk-throttle.h b/block/blk-throttle.h
> > > index 1a36d1278eea..56dcb5ce412f 100644
> > > --- a/block/blk-throttle.h
> > > +++ b/block/blk-throttle.h
> > > @@ -110,10 +110,14 @@ struct throtl_grp {
> > >   	unsigned int last_io_disp[2];
> > >   	/*
> > > -	 * The following two fields are updated when new configuration is
> > > -	 * submitted while some bios are still throttled, they record how many
> > > -	 * bytes/ios are waited already in previous configuration, and they will
> > > -	 * be used to calculate wait time under new configuration.
> > > +	 * The following two fields are updated when:
> > > +	 * 1) new configuration is submitted while some bios are still
> > > +	 * throttled, they record how many bytes/ios are waited already in
> > > +	 * previous configuration;
> > > +	 * 2) IOs which may cause priority inversions are dispatched while tg is
> > > +	 * over limit, these IOs will be dispatched directly;
> > > +	 * 3) While calculating wait_time for IO, extra_bytes * HZ is not
> > > +	 * divisible by bps_limit, the remainder will be recorded;
> > >   	 */
> > 
> > blk-throttle takes token bucket algorithm, and the implementation
> > shouldn't be sensitive with the above two factors, because the bps
> > rate is controlled over the whole bucket(slice). Meantime it is still
> > tricky to maintain ->carryover_bytes during slice cycle, which can't
> > cover timer delay too.
> > 
> > Another way is to avoid to trim slice too soon in case of owning too
> > much debt, something like the following does avoid this issue:
> 
> 
> > 
> > 
> > diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> > index 8d149aff9fd0..a778ebbb6887 100644
> > --- a/block/blk-throttle.c
> > +++ b/block/blk-throttle.c
> > @@ -615,6 +615,14 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
> >   	if (bytes_trim <= 0 && io_trim <= 0)
> >   		return;
> > +	if (tg_bps_limit(tg, rw) != U64_MAX) {
> > +		long long disp = tg->bytes_disp[rw];
> > +
> > +		if (bytes_trim > disp && (bytes_trim - disp) > disp / 2 && time_elapsed <
> > +				8 * tg->td->throtl_slice)
> > +			return;
> > +	}
> 
> I'm not sure why we need this, why will there be too much debt? debt
> from remainder should be at most 1 jiffies, new debt can only happen
> when old debt is paid.

Maybe term of `debt` is not accurate.

In throtl_trim_slice(), if we dispatched much less bytes than expected(bytes_trim),
there should be extra delay introduced, such as, extra one jiffy because
of roundup(), or timer expire delay, we shouldn't reset the slice because actual
rate(bps here) is much less than bps_limit.

> 
> If you mean we don't clear debt in time, and the historical debt
> accumulated too much, then we might want to trim slice sooner and a new
> branch try to clear historical debt.
> 
> BTW, throtl_trim_slice() looks like problematic:
> 
> -       if (bytes_trim <= 0 && io_trim <= 0)
> +       if (bytes_trim <= 0 || io_trim <= 0 ||
> +           tg->bytes_disp[rw] < bytes_trim || tg->io_disp[rw] < io_trim)
>                 return;

That is exactly what my patch is doing, just taking deviation and
timeout into account, also U64_MAX limit has to be excluded.

If you test the patch, it works just fine. My patch controls bytes_exp 
<= 1.5 * disp, then throtl/001 can be completed in 1.5sec, and if it is
changed to 1.25 * disp, the test can be completed in 1.25sec.

With this fix, why do we have to play the complicated carryover
trick?


Thanks,
Ming


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] blk-throttle: fix off-by-one jiffies wait_time
  2025-02-24  3:28       ` Ming Lei
@ 2025-02-24  7:03         ` Yu Kuai
  2025-02-24  8:56           ` Ming Lei
  0 siblings, 1 reply; 18+ messages in thread
From: Yu Kuai @ 2025-02-24  7:03 UTC (permalink / raw)
  To: Ming Lei, Yu Kuai
  Cc: tj, josef, axboe, cgroups, linux-block, linux-kernel, yi.zhang,
	yangerkun, yukuai (C)

Hi, Ming!

在 2025/02/24 11:28, Ming Lei 写道:
> throtl_trim_slice() returns immediately if throtl_slice_used()
> is true.
> 
> And throtl_slice_used() checks jiffies in [start, end] via time_in_range(),
> so if `start <= jiffies <= end', it still returns false.

Yes, I misread the code, by thinking throtl_slice_used() will return
true if the slice is still used. :(

>> BTW, throtl_trim_slice() looks like problematic:
>>
>> -       if (bytes_trim <= 0 && io_trim <= 0)
>> +       if (bytes_trim <= 0 || io_trim <= 0 ||
>> +           tg->bytes_disp[rw] < bytes_trim || tg->io_disp[rw] < io_trim)
>>                  return;
> That is exactly what my patch is doing, just taking deviation and
> timeout into account, also U64_MAX limit has to be excluded.
Yes, perhaps you can add some comments in the last two conditions of
your patch. I think maybe you're concerned about the case IO is
throttled by IOs and bytes_disp should really erase extra bytes that
does not reach bps_limit.

> 
> If you test the patch, it works just fine. My patch controls bytes_exp
> <= 1.5 * disp, then throtl/001 can be completed in 1.5sec, and if it is
> changed to 1.25 * disp, the test can be completed in 1.25sec.
> 
> With this fix, why do we have to play the complicated carryover
> trick?
> 

I understand your code now. carryover_bytes in this case is wrong, as
long as new slice is not started, and trim slice doesn't erase extra
bytes by mistake, throttle time should be accurate over time because
bytes_disp is accurate.

And root cause really is throtl_trim_slice().

However, by code review, I think throtl_start_new_slice() should be
handled as well, the same as throtl_trim_slice(), if the old bio is
dispatched with one more jiffies wait time, issue a new bio in the same
jiffies will have the same problem. For example:

Caller do sync IO, with io size: (bps_limit / (HZ / throtl_slice) + 1),
and caller will issue new IO when old IO is done. Then in this case,
each IO will start a new slice and wait for throtl_slice + 1 jiffies. I
believe this can be fixed as well so that BIOs after the first one will
only wait for throtl_silce jiffies.

> Thanks,
> Ming
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] blk-throttle: fix off-by-one jiffies wait_time
  2025-02-24  7:03         ` Yu Kuai
@ 2025-02-24  8:56           ` Ming Lei
  2025-02-24 12:03             ` Yu Kuai
  0 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2025-02-24  8:56 UTC (permalink / raw)
  To: Yu Kuai
  Cc: tj, josef, axboe, cgroups, linux-block, linux-kernel, yi.zhang,
	yangerkun, yukuai (C)

On Mon, Feb 24, 2025 at 03:03:18PM +0800, Yu Kuai wrote:
> Hi, Ming!
> 
> 在 2025/02/24 11:28, Ming Lei 写道:
> > throtl_trim_slice() returns immediately if throtl_slice_used()
> > is true.
> > 
> > And throtl_slice_used() checks jiffies in [start, end] via time_in_range(),
> > so if `start <= jiffies <= end', it still returns false.
> 
> Yes, I misread the code, by thinking throtl_slice_used() will return
> true if the slice is still used. :(
> 
> 
> > > BTW, throtl_trim_slice() looks like problematic:
> > > 
> > > -       if (bytes_trim <= 0 && io_trim <= 0)
> > > +       if (bytes_trim <= 0 || io_trim <= 0 ||
> > > +           tg->bytes_disp[rw] < bytes_trim || tg->io_disp[rw] < io_trim)
> > >                  return;
> > That is exactly what my patch is doing, just taking deviation and
> > timeout into account, also U64_MAX limit has to be excluded.
> Yes, perhaps you can add some comments in the last two conditions of
> your patch.

Yes, we need to add comment on the check, how about the following words?

```

If actually rate doesn't match with expected rate, do not trim slice
otherwise the present rate control info is lost, we don't have chance
to compensate it in the following period of this slice any more.

Add one deviation threshold since bio size is usually not divisible by
present rate, and bio dispatch should be done or nothing

Also limit max slice size for avoiding to extend the window too big.

```

> I think maybe you're concerned about the case IO is
> throttled by IOs and bytes_disp should really erase extra bytes that
> does not reach bps_limit.
> 
> > 
> > If you test the patch, it works just fine. My patch controls bytes_exp
> > <= 1.5 * disp, then throtl/001 can be completed in 1.5sec, and if it is
> > changed to 1.25 * disp, the test can be completed in 1.25sec.
> > 
> > With this fix, why do we have to play the complicated carryover
> > trick?
> > 
> 
> I understand your code now. carryover_bytes in this case is wrong, as
> long as new slice is not started, and trim slice doesn't erase extra
> bytes by mistake, throttle time should be accurate over time because
> bytes_disp is accurate.

Yes.

More or less wait for handling single bio can just affect instantaneous rate,
but the algorithm is adaptive so it can adjust the following wait if the
slice isn't over.

> 
> And root cause really is throtl_trim_slice().
> 
> However, by code review, I think throtl_start_new_slice() should be
> handled as well, the same as throtl_trim_slice(), if the old bio is
> dispatched with one more jiffies wait time, issue a new bio in the same
> jiffies will have the same problem. For example:

throtl_start_new_slice() is called when nothing is queued and the current
slice is used, so probably it is fine.

throtl_start_new_slice_with_credit() is called when dispatching one
bio, and it is still called when the current slice is used.

> 
> Caller do sync IO, with io size: (bps_limit / (HZ / throtl_slice) + 1),

This will cause wait for every single IO.

But once the IO is throttled because of the wait, throtl_start_new_slice()
won't succeed any more, meantime throtl_trim_slice() will try to fix the
rate control for the extra 1 jiffies.

So in-tree code with trim fix works just fine if consecutive IOs are
coming.

> and caller will issue new IO when old IO is done. Then in this case,
> each IO will start a new slice and wait for throtl_slice + 1 jiffies. I
> believe this can be fixed as well so that BIOs after the first one will
> only wait for throtl_silce jiffies.

I guess the above case only exists in theory when you submit new IO
after the old IO is dispatched immediately. Not sure this case is really
meaningful because submission period/latency is added/faked by user.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] blk-throttle: fix off-by-one jiffies wait_time
  2025-02-24  8:56           ` Ming Lei
@ 2025-02-24 12:03             ` Yu Kuai
  2025-02-25  1:24               ` Ming Lei
  0 siblings, 1 reply; 18+ messages in thread
From: Yu Kuai @ 2025-02-24 12:03 UTC (permalink / raw)
  To: Ming Lei, Yu Kuai
  Cc: tj, josef, axboe, cgroups, linux-block, linux-kernel, yi.zhang,
	yangerkun, yukuai (C)

Hi,

在 2025/02/24 16:56, Ming Lei 写道:
> On Mon, Feb 24, 2025 at 03:03:18PM +0800, Yu Kuai wrote:
>> Hi, Ming!
>>
>> 在 2025/02/24 11:28, Ming Lei 写道:
>>> throtl_trim_slice() returns immediately if throtl_slice_used()
>>> is true.
>>>
>>> And throtl_slice_used() checks jiffies in [start, end] via time_in_range(),
>>> so if `start <= jiffies <= end', it still returns false.
>>
>> Yes, I misread the code, by thinking throtl_slice_used() will return
>> true if the slice is still used. :(
>>
>>
>>>> BTW, throtl_trim_slice() looks like problematic:
>>>>
>>>> -       if (bytes_trim <= 0 && io_trim <= 0)
>>>> +       if (bytes_trim <= 0 || io_trim <= 0 ||
>>>> +           tg->bytes_disp[rw] < bytes_trim || tg->io_disp[rw] < io_trim)
>>>>                   return;
>>> That is exactly what my patch is doing, just taking deviation and
>>> timeout into account, also U64_MAX limit has to be excluded.
>> Yes, perhaps you can add some comments in the last two conditions of
>> your patch.
> 
> Yes, we need to add comment on the check, how about the following words?
> 
> ```
> 
> If actually rate doesn't match with expected rate, do not trim slice
> otherwise the present rate control info is lost, we don't have chance
> to compensate it in the following period of this slice any more.

So, I just give your patch a test, and result is 1.3s while 1s is
expected. While debuging, a new idea come up in mind. :)

How about keep at least one slice out of consideration from
throtl_trim_slice()? With following patch, the result is between
1.01-1.03s in my VM.

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 8d149aff9fd0..5207c85098a5 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -604,9 +604,12 @@ static inline void throtl_trim_slice(struct 
throtl_grp *tg, bool rw)

         time_elapsed = rounddown(jiffies - tg->slice_start[rw],
                                  tg->td->throtl_slice);
-       if (!time_elapsed)
+       /* don't trim slice until at least 2 slice is used */
+       if (time_elapsed < tg->td->throtl_slice * 2)
                 return;

+       /* dispite the last slice, trim previous slice */
+       time_elapsed -= tg->td->throtl_slice;
         bytes_trim = calculate_bytes_allowed(tg_bps_limit(tg, rw),
                                              time_elapsed) +
                      tg->carryover_bytes[rw];

> 
> Add one deviation threshold since bio size is usually not divisible by
> present rate, and bio dispatch should be done or nothing
> 
> Also limit max slice size for avoiding to extend the window too big.
> 
> ```
> 
> 
>> I think maybe you're concerned about the case IO is
>> throttled by IOs and bytes_disp should really erase extra bytes that
>> does not reach bps_limit.
>>
>>>
>>> If you test the patch, it works just fine. My patch controls bytes_exp
>>> <= 1.5 * disp, then throtl/001 can be completed in 1.5sec, and if it is
>>> changed to 1.25 * disp, the test can be completed in 1.25sec.
>>>
>>> With this fix, why do we have to play the complicated carryover
>>> trick?
>>>
>>
>> I understand your code now. carryover_bytes in this case is wrong, as
>> long as new slice is not started, and trim slice doesn't erase extra
>> bytes by mistake, throttle time should be accurate over time because
>> bytes_disp is accurate.
> 
> Yes.
> 
> More or less wait for handling single bio can just affect instantaneous rate,
> but the algorithm is adaptive so it can adjust the following wait if the
> slice isn't over.
> 
>>
>> And root cause really is throtl_trim_slice().
>>
>> However, by code review, I think throtl_start_new_slice() should be
>> handled as well, the same as throtl_trim_slice(), if the old bio is
>> dispatched with one more jiffies wait time, issue a new bio in the same
>> jiffies will have the same problem. For example:
> 
> throtl_start_new_slice() is called when nothing is queued and the current
> slice is used, so probably it is fine.
> 
> throtl_start_new_slice_with_credit() is called when dispatching one
> bio, and it is still called when the current slice is used.
> 
>>
>> Caller do sync IO, with io size: (bps_limit / (HZ / throtl_slice) + 1),
> 
> This will cause wait for every single IO.
> 
> But once the IO is throttled because of the wait, throtl_start_new_slice()
> won't succeed any more, meantime throtl_trim_slice() will try to fix the
> rate control for the extra 1 jiffies.
> 
> So in-tree code with trim fix works just fine if consecutive IOs are
> coming.
> 
>> and caller will issue new IO when old IO is done. Then in this case,
>> each IO will start a new slice and wait for throtl_slice + 1 jiffies. I
>> believe this can be fixed as well so that BIOs after the first one will
>> only wait for throtl_silce jiffies.
> 
> I guess the above case only exists in theory when you submit new IO
> after the old IO is dispatched immediately. Not sure this case is really
> meaningful because submission period/latency is added/faked by user.

Yes, this case is just a theory, we don't need to care for now. :)

Thanks,
Kuai

> 
> 
> Thanks,
> Ming
> 
> 
> .
> 


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] blk-throttle: fix off-by-one jiffies wait_time
  2025-02-24 12:03             ` Yu Kuai
@ 2025-02-25  1:24               ` Ming Lei
  2025-02-25  2:07                 ` Yu Kuai
  0 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2025-02-25  1:24 UTC (permalink / raw)
  To: Yu Kuai
  Cc: tj, josef, axboe, cgroups, linux-block, linux-kernel, yi.zhang,
	yangerkun, yukuai (C)

On Mon, Feb 24, 2025 at 08:03:32PM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2025/02/24 16:56, Ming Lei 写道:
> > On Mon, Feb 24, 2025 at 03:03:18PM +0800, Yu Kuai wrote:
> > > Hi, Ming!
> > > 
> > > 在 2025/02/24 11:28, Ming Lei 写道:
> > > > throtl_trim_slice() returns immediately if throtl_slice_used()
> > > > is true.
> > > > 
> > > > And throtl_slice_used() checks jiffies in [start, end] via time_in_range(),
> > > > so if `start <= jiffies <= end', it still returns false.
> > > 
> > > Yes, I misread the code, by thinking throtl_slice_used() will return
> > > true if the slice is still used. :(
> > > 
> > > 
> > > > > BTW, throtl_trim_slice() looks like problematic:
> > > > > 
> > > > > -       if (bytes_trim <= 0 && io_trim <= 0)
> > > > > +       if (bytes_trim <= 0 || io_trim <= 0 ||
> > > > > +           tg->bytes_disp[rw] < bytes_trim || tg->io_disp[rw] < io_trim)
> > > > >                   return;
> > > > That is exactly what my patch is doing, just taking deviation and
> > > > timeout into account, also U64_MAX limit has to be excluded.
> > > Yes, perhaps you can add some comments in the last two conditions of
> > > your patch.
> > 
> > Yes, we need to add comment on the check, how about the following words?
> > 
> > ```
> > 
> > If actually rate doesn't match with expected rate, do not trim slice
> > otherwise the present rate control info is lost, we don't have chance
> > to compensate it in the following period of this slice any more.
> 
> So, I just give your patch a test, and result is 1.3s while 1s is
> expected. While debuging, a new idea come up in mind. :)
> 
> How about keep at least one slice out of consideration from
> throtl_trim_slice()? With following patch, the result is between
> 1.01-1.03s in my VM.

That is easy to get the same result with the approach I suggested,
another big benefit: it is adaptive, and blk-throttle may get
simplified.

> 
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 8d149aff9fd0..5207c85098a5 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -604,9 +604,12 @@ static inline void throtl_trim_slice(struct throtl_grp
> *tg, bool rw)
> 
>         time_elapsed = rounddown(jiffies - tg->slice_start[rw],
>                                  tg->td->throtl_slice);
> -       if (!time_elapsed)
> +       /* don't trim slice until at least 2 slice is used */
> +       if (time_elapsed < tg->td->throtl_slice * 2)
>                 return;

If you just want to fix throtl/001, the above patch might
work(sometimes, it might not, and timer may expire by 2 jiffies), but it
is easy to fail other tests, such as, reduce the bps limit a bit, and
increase BS a bit to make the IO cross exactly two slices.

Also the big question is that how you can make sure that rate is always
good when the window is >= 2 slice?


Thanks,
Ming


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] blk-throttle: fix off-by-one jiffies wait_time
  2025-02-25  1:24               ` Ming Lei
@ 2025-02-25  2:07                 ` Yu Kuai
  2025-02-25  2:28                   ` Ming Lei
  0 siblings, 1 reply; 18+ messages in thread
From: Yu Kuai @ 2025-02-25  2:07 UTC (permalink / raw)
  To: Ming Lei, Yu Kuai
  Cc: tj, josef, axboe, cgroups, linux-block, linux-kernel, yi.zhang,
	yangerkun, yukuai (C)

Hi,

在 2025/02/25 9:24, Ming Lei 写道:
>> -       if (!time_elapsed)
>> +       /* don't trim slice until at least 2 slice is used */
>> +       if (time_elapsed < tg->td->throtl_slice * 2)
>>                  return;
> 
> If you just want to fix throtl/001, the above patch might
> work(sometimes, it might not, and timer may expire by 2 jiffies), but it
> is easy to fail other tests, such as, reduce the bps limit a bit, and
> increase BS a bit to make the IO cross exactly two slices.

That's fine, the key point is the following code, above code is
just to make sure there is still at least one slice to trim after
removing the last slice.

+       /* dispite the last slice, trim previous slice */
+       time_elapsed -= tg->td->throtl_slice;

In this case, if one BIO cross 1+ slices, the rate is the same as
expected in the previous slices, we can trim them without any negative
impact.

Thanks,
Kuai

> 
> Also the big question is that how you can make sure that rate is always
> good when the window is >= 2 slice?
> 
> 
> Thanks,
> Ming
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] blk-throttle: fix off-by-one jiffies wait_time
  2025-02-25  2:07                 ` Yu Kuai
@ 2025-02-25  2:28                   ` Ming Lei
  2025-02-25  3:12                     ` Yu Kuai
  0 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2025-02-25  2:28 UTC (permalink / raw)
  To: Yu Kuai
  Cc: tj, josef, axboe, cgroups, linux-block, linux-kernel, yi.zhang,
	yangerkun, yukuai (C)

On Tue, Feb 25, 2025 at 10:07:06AM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2025/02/25 9:24, Ming Lei 写道:
> > > -       if (!time_elapsed)
> > > +       /* don't trim slice until at least 2 slice is used */
> > > +       if (time_elapsed < tg->td->throtl_slice * 2)
> > >                  return;
> > 
> > If you just want to fix throtl/001, the above patch might
> > work(sometimes, it might not, and timer may expire by 2 jiffies), but it
> > is easy to fail other tests, such as, reduce the bps limit a bit, and
> > increase BS a bit to make the IO cross exactly two slices.
> 
> That's fine, the key point is the following code, above code is
> just to make sure there is still at least one slice to trim after
> removing the last slice.
> 
> +       /* dispite the last slice, trim previous slice */
> +       time_elapsed -= tg->td->throtl_slice;
> 
> In this case, if one BIO cross 1+ slices, the rate is the same as
> expected in the previous slices, we can trim them without any negative
> impact.

Can you explain in details why it signals that the rate is expected now?

If rate isn't expected, it will cause trouble to trim, even just the
previous part.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] blk-throttle: fix off-by-one jiffies wait_time
  2025-02-25  2:28                   ` Ming Lei
@ 2025-02-25  3:12                     ` Yu Kuai
  2025-02-25  8:21                       ` Ming Lei
  0 siblings, 1 reply; 18+ messages in thread
From: Yu Kuai @ 2025-02-25  3:12 UTC (permalink / raw)
  To: Ming Lei, Yu Kuai
  Cc: tj, josef, axboe, cgroups, linux-block, linux-kernel, yi.zhang,
	yangerkun, yukuai (C)

Hi, Ming!

在 2025/02/25 10:28, Ming Lei 写道:
> Can you explain in details why it signals that the rate is expected now?
> 
> If rate isn't expected, it will cause trouble to trim, even just the
> previous part.

Ok, for example, assume bps_limit is 1000bytes, 1 jiffes is 10ms, and
slice is 20ms(2 jiffies).

expected rate is 1000 / 1000 * 20 = 20 bytes per slice.

If user issue two 21 bytes IO, then wait time will be 30ms:

bytes_allowed = 20, extra_bytes = 1;
jiffy_wait = 1 + 2 = 3 jiffies

and consider
extra 1 jiffies by timer, throtl_trim_slice() will be called at:

jiffies = 40ms
slice_start = 0ms, slice_end= 40ms
bytes_disp = 21

In this case, before the patch, real rate in the first two slice is is
10.5 bytes per slice, and slice will be updated:

jiffies = 40ms
slice_start = 40ms, slice_end = 60ms,
bytes_disp = 0;

Hence the second IO will have to wait another 30ms;

With the patch, the real rate in the first slice is 20 bytes per slice,
which is the same as expected, and slice will be updated:

jiffies=40ms,
slice_start = 20ms, slice_end = 60ms,
bytes_disp = 1;

And now, the second IO will only have to wait 10ms;

Thanks,
Kuai


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] blk-throttle: fix off-by-one jiffies wait_time
  2025-02-25  3:12                     ` Yu Kuai
@ 2025-02-25  8:21                       ` Ming Lei
  2025-02-25 11:09                         ` Yu Kuai
  0 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2025-02-25  8:21 UTC (permalink / raw)
  To: Yu Kuai
  Cc: tj, josef, axboe, cgroups, linux-block, linux-kernel, yi.zhang,
	yangerkun, yukuai (C)

On Tue, Feb 25, 2025 at 11:12:24AM +0800, Yu Kuai wrote:
> Hi, Ming!
> 
> 在 2025/02/25 10:28, Ming Lei 写道:
> > Can you explain in details why it signals that the rate is expected now?
> > 
> > If rate isn't expected, it will cause trouble to trim, even just the
> > previous part.
> 
> Ok, for example, assume bps_limit is 1000bytes, 1 jiffes is 10ms, and
> slice is 20ms(2 jiffies).
> 

We all know how it works, but I didn't understand the behind idea why it
is correct. Now I figured it out: 

1) increase default slice window to 2 * td->throttle_slice

2) slice window is set as [jiffies - td->throttle_slice, jiffies + td->throttle_slice]

3) initialize td->bytes_disp[]/td->io_dis[] as actual dispatched bytes/ios
done [jiffies - td->throttle_slice, 0]

This approach looks smart, and it should work well for any deviation which is <= 1
throttle_slice.

Probably it is enough for fixing the issue in throtl/001, even though 2 jiffies
timer drift still may be observed, see the below log collected in my VM(HZ_100)
by just running one time of blktests './check throtl':

@timer_expire_delay:
[1, 2)               387 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2, 3)                11 |@                                                   |

bpftrace -e 'kfunc:throtl_pending_timer_fn { @timer_expire_delay = lhist(jiffies - args->t->expires, 0, 16, 1);}'


Also I'd suggest to remove ->carryover_bytes/ios since blk-throttle algorithm is
supposed to be adaptive, and the approach I suggested may cover this area,
what do you think of this cleanup? I have one local patchset, which can
pass all blktest throtl tests with removing ->carryover_bytes/ios.



Thanks,
Ming


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] blk-throttle: fix off-by-one jiffies wait_time
  2025-02-25  8:21                       ` Ming Lei
@ 2025-02-25 11:09                         ` Yu Kuai
  2025-02-25 12:00                           ` Ming Lei
  0 siblings, 1 reply; 18+ messages in thread
From: Yu Kuai @ 2025-02-25 11:09 UTC (permalink / raw)
  To: Ming Lei, Yu Kuai
  Cc: tj, josef, axboe, cgroups, linux-block, linux-kernel, yi.zhang,
	yangerkun, yukuai (C)

Hi,

在 2025/02/25 16:21, Ming Lei 写道:
> On Tue, Feb 25, 2025 at 11:12:24AM +0800, Yu Kuai wrote:
>> Hi, Ming!
>>
>> 在 2025/02/25 10:28, Ming Lei 写道:
>>> Can you explain in details why it signals that the rate is expected now?
>>>
>>> If rate isn't expected, it will cause trouble to trim, even just the
>>> previous part.
>>
>> Ok, for example, assume bps_limit is 1000bytes, 1 jiffes is 10ms, and
>> slice is 20ms(2 jiffies).
>>
> 
> We all know how it works, but I didn't understand the behind idea why it
> is correct. Now I figured it out:
> 
> 1) increase default slice window to 2 * td->throttle_slice
> 
> 2) slice window is set as [jiffies - td->throttle_slice, jiffies + td->throttle_slice]
> 
> 3) initialize td->bytes_disp[]/td->io_dis[] as actual dispatched bytes/ios
> done [jiffies - td->throttle_slice, 0]
> 
> This approach looks smart, and it should work well for any deviation which is <= 1
> throttle_slice.
> 
> Probably it is enough for fixing the issue in throtl/001, even though 2 jiffies
> timer drift still may be observed, see the below log collected in my VM(HZ_100)
> by just running one time of blktests './check throtl':
> 
> @timer_expire_delay:
> [1, 2)               387 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [2, 3)                11 |@                                                   |
> 
> bpftrace -e 'kfunc:throtl_pending_timer_fn { @timer_expire_delay = lhist(jiffies - args->t->expires, 0, 16, 1);}'
> 
> 
> Also I'd suggest to remove ->carryover_bytes/ios since blk-throttle algorithm is
> supposed to be adaptive, and the approach I suggested may cover this area,
> what do you think of this cleanup? I have one local patchset, which can
> pass all blktest throtl tests with removing ->carryover_bytes/ios.
> 

It's always welcome for such cleanup. BTW, do you have plans to support
bio merge for iops limit in blk-throttle? Since bio split is handled. I
was thinking about using carryover_ios, perhaps you can handle this as
well.

Thanks,
Kuai
> 
> 
> Thanks,
> Ming
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] blk-throttle: fix off-by-one jiffies wait_time
  2025-02-25 11:09                         ` Yu Kuai
@ 2025-02-25 12:00                           ` Ming Lei
  0 siblings, 0 replies; 18+ messages in thread
From: Ming Lei @ 2025-02-25 12:00 UTC (permalink / raw)
  To: Yu Kuai
  Cc: tj, josef, axboe, cgroups, linux-block, linux-kernel, yi.zhang,
	yangerkun, yukuai (C)

On Tue, Feb 25, 2025 at 07:09:30PM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2025/02/25 16:21, Ming Lei 写道:
> > On Tue, Feb 25, 2025 at 11:12:24AM +0800, Yu Kuai wrote:
> > > Hi, Ming!
> > > 
> > > 在 2025/02/25 10:28, Ming Lei 写道:
> > > > Can you explain in details why it signals that the rate is expected now?
> > > > 
> > > > If rate isn't expected, it will cause trouble to trim, even just the
> > > > previous part.
> > > 
> > > Ok, for example, assume bps_limit is 1000bytes, 1 jiffes is 10ms, and
> > > slice is 20ms(2 jiffies).
> > > 
> > 
> > We all know how it works, but I didn't understand the behind idea why it
> > is correct. Now I figured it out:
> > 
> > 1) increase default slice window to 2 * td->throttle_slice
> > 
> > 2) slice window is set as [jiffies - td->throttle_slice, jiffies + td->throttle_slice]
> > 
> > 3) initialize td->bytes_disp[]/td->io_dis[] as actual dispatched bytes/ios
> > done [jiffies - td->throttle_slice, 0]
> > 
> > This approach looks smart, and it should work well for any deviation which is <= 1
> > throttle_slice.
> > 
> > Probably it is enough for fixing the issue in throtl/001, even though 2 jiffies
> > timer drift still may be observed, see the below log collected in my VM(HZ_100)
> > by just running one time of blktests './check throtl':
> > 
> > @timer_expire_delay:
> > [1, 2)               387 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [2, 3)                11 |@                                                   |
> > 
> > bpftrace -e 'kfunc:throtl_pending_timer_fn { @timer_expire_delay = lhist(jiffies - args->t->expires, 0, 16, 1);}'
> > 
> > 
> > Also I'd suggest to remove ->carryover_bytes/ios since blk-throttle algorithm is
> > supposed to be adaptive, and the approach I suggested may cover this area,
> > what do you think of this cleanup? I have one local patchset, which can
> > pass all blktest throtl tests with removing ->carryover_bytes/ios.
> > 
> 
> It's always welcome for such cleanup. BTW, do you have plans to support
> bio merge for iops limit in blk-throttle?
> Since bio split is handled. I
> was thinking about using carryover_ios, perhaps you can handle this as
> well.

I don't know the two problems.

Let's focus on fixing throtl/001 first.

I raised the cleanup on carryover_ios because the fix I proposed in [1]
may help to cover carryover_ios too.

But I guess your patch of doubling splice window is better for fixing
throtl/001, can you send a formal patch with comment for fixing this
issue first?


[1] https://lore.kernel.org/linux-block/Z7nAJSKGANoC0Glb@fedora/



Thanks,
Ming


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] blk-throttle: fix off-by-one jiffies wait_time
  2025-02-22  9:28 [PATCH 0/2] blk-throttle: fix off-by-one jiffies wait_time Yu Kuai
  2025-02-22  9:28 ` [PATCH 1/2] blk-throttle: cleanup throtl_extend_slice() Yu Kuai
  2025-02-22  9:28 ` [PATCH 2/2] blk-throttle: fix off-by-one jiffies wait_time Yu Kuai
@ 2025-02-25 10:51 ` Michal Koutný
  2 siblings, 0 replies; 18+ messages in thread
From: Michal Koutný @ 2025-02-25 10:51 UTC (permalink / raw)
  To: Yu Kuai
  Cc: ming.lei, tj, josef, axboe, cgroups, linux-block, linux-kernel,
	yukuai3, yi.zhang, yangerkun

[-- Attachment #1: Type: text/plain, Size: 358 bytes --]

On Sat, Feb 22, 2025 at 05:28:21PM +0800, Yu Kuai <yukuai1@huaweicloud.com> wrote:
> Yu Kuai (2):
>   blk-throttle: cleanup throtl_extend_slice()
>   blk-throttle: fix off-by-one jiffies wait_time

(I haven't read through all of the 2/2 discussion but) if the 1/2 patch
is only a cleanup, it's more backport friendly to put it after the fix.

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-02-25 20:30 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-22  9:28 [PATCH 0/2] blk-throttle: fix off-by-one jiffies wait_time Yu Kuai
2025-02-22  9:28 ` [PATCH 1/2] blk-throttle: cleanup throtl_extend_slice() Yu Kuai
2025-02-25 20:30   ` Tejun Heo
2025-02-22  9:28 ` [PATCH 2/2] blk-throttle: fix off-by-one jiffies wait_time Yu Kuai
2025-02-22 12:16   ` Ming Lei
2025-02-24  2:39     ` Yu Kuai
2025-02-24  3:28       ` Ming Lei
2025-02-24  7:03         ` Yu Kuai
2025-02-24  8:56           ` Ming Lei
2025-02-24 12:03             ` Yu Kuai
2025-02-25  1:24               ` Ming Lei
2025-02-25  2:07                 ` Yu Kuai
2025-02-25  2:28                   ` Ming Lei
2025-02-25  3:12                     ` Yu Kuai
2025-02-25  8:21                       ` Ming Lei
2025-02-25 11:09                         ` Yu Kuai
2025-02-25 12:00                           ` Ming Lei
2025-02-25 10:51 ` [PATCH 0/2] " Michal Koutný

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).