10x I/O await times in 6.12

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* 10x I/O await times in 6.12
@ 2025-04-21  8:53 Matt Fleming
  2025-04-21 12:21 ` Yu Kuai
  2025-04-21 15:22 ` Keith Busch
  0 siblings, 2 replies; 16+ messages in thread
From: Matt Fleming @ 2025-04-21  8:53 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-kernel, kernel-team

Hey there,

We're moving to 6.12 at Cloudflare and noticed that write await times
in iostat are 10x what they were in 6.6. After a bit of bpftracing
(script to find all plug times above 10ms below), it seems like this
is an accounting error caused by the plug->cur_ktime optimisation
rather than anything more material.

It appears as though a task can enter __submit_bio() with ->plug set
and a very stale cur_ktime value on the order of milliseconds. Is this
expected behaviour? It looks like it leads to inaccurate I/O times.

Thanks,
Matt

---->8----
$ sudo bpftrace -e 'k:__submit_bio { $p = curtask->plug; if ($p != 0)
{ if (((nsecs - $p->cur_ktime) / 1000) > 10000) { @[kstack] =
count();}}}'
Attaching 1 probe...
^C

@[
    __submit_bio+1
    submit_bio_noacct_nocheck+390
    submit_bio_wait+92
    swap_writepage_bdev_sync+262
    swap_writepage+315
    pageout+291
    shrink_folio_list+1835
    shrink_lruvec+1683
    shrink_node+784
    balance_pgdat+877
    kswapd+496
    kthread+207
    ret_from_fork+49
    ret_from_fork_asm+26
]: 184
@[
    __submit_bio+1
    submit_bio_noacct_nocheck+390
    _xfs_buf_ioapply+599
    __xfs_buf_submit+110
    xfs_buf_delwri_submit_buffers+399
    xfsaild+691
    kthread+207
    ret_from_fork+49
    ret_from_fork_asm+26
]: 28123

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 10x I/O await times in 6.12
  2025-04-21  8:53 10x I/O await times in 6.12 Matt Fleming
@ 2025-04-21 12:21 ` Yu Kuai
  2025-04-23 10:51   ` Matt Fleming
  2026-04-23 18:18   ` Matt Fleming
  2025-04-21 15:22 ` Keith Busch
  1 sibling, 2 replies; 16+ messages in thread
From: Yu Kuai @ 2025-04-21 12:21 UTC (permalink / raw)
  To: Matt Fleming, Jens Axboe
  Cc: linux-block, linux-kernel, kernel-team, yukuai (C)

Hi,

在 2025/04/21 16:53, Matt Fleming 写道:
> Hey there,
> 
> We're moving to 6.12 at Cloudflare and noticed that write await times
> in iostat are 10x what they were in 6.6. After a bit of bpftracing
> (script to find all plug times above 10ms below), it seems like this
> is an accounting error caused by the plug->cur_ktime optimisation
> rather than anything more material.

What is the base value? If it's microseconds, IO performance should
actually be better.
> 
> It appears as though a task can enter __submit_bio() with ->plug set
> and a very stale cur_ktime value on the order of milliseconds. Is this
> expected behaviour? It looks like it leads to inaccurate I/O times.
> 

For microseconds I think it's expected, however, I don't think
milliseconds is acceptable.
> Thanks,
> Matt
> 
> ---->8----
> $ sudo bpftrace -e 'k:__submit_bio { $p = curtask->plug; if ($p != 0)
> { if (((nsecs - $p->cur_ktime) / 1000) > 10000) { @[kstack] =
> count();}}}'

Can you drop this expensive bpftrace script which might affect IO
performance, and replace __submit_bio directly to __blk_flush_plug? If
nsecs - plug->cur_ktime is still milliseconds, can you check if the
following patch can fix your problem?

Thanks,
Kuai

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ae8494d88897..37197502147e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1095,7 +1095,9 @@ static inline void blk_account_io_start(struct 
request *req)
                 return;

         req->rq_flags |= RQF_IO_STAT;
-       req->start_time_ns = blk_time_get_ns();
+
+       if (!current->plug)
+               req->start_time_ns = blk_time_get_ns();

         /*
          * All non-passthrough requests are created from a bio with one
@@ -2874,6 +2876,7 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, 
bool from_schedule)
  {
         struct request *rq;
         unsigned int depth;
+       u64 now;

         /*
          * We may have been called recursively midway through handling
@@ -2887,6 +2890,10 @@ void blk_mq_flush_plug_list(struct blk_plug 
*plug, bool from_schedule)
         depth = plug->rq_count;
         plug->rq_count = 0;

+       now = ktime_get_ns();
+       rq_list_for_each(&plug->mq_list, rq)
+               rq->start_time_ns = now;
+
         if (!plug->multiple_queues && !plug->has_elevator && 
!from_schedule) {
                 struct request_queue *q;


> Attaching 1 probe...
> ^C
> 
> @[
>      __submit_bio+1
>      submit_bio_noacct_nocheck+390
>      submit_bio_wait+92
>      swap_writepage_bdev_sync+262
>      swap_writepage+315
>      pageout+291
>      shrink_folio_list+1835
>      shrink_lruvec+1683
>      shrink_node+784
>      balance_pgdat+877
>      kswapd+496
>      kthread+207
>      ret_from_fork+49
>      ret_from_fork_asm+26
> ]: 184
> @[
>      __submit_bio+1
>      submit_bio_noacct_nocheck+390
>      _xfs_buf_ioapply+599
>      __xfs_buf_submit+110
>      xfs_buf_delwri_submit_buffers+399
>      xfsaild+691
>      kthread+207
>      ret_from_fork+49
>      ret_from_fork_asm+26
> ]: 28123
> 
> .
> 


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: 10x I/O await times in 6.12
  2025-04-21  8:53 10x I/O await times in 6.12 Matt Fleming
  2025-04-21 12:21 ` Yu Kuai
@ 2025-04-21 15:22 ` Keith Busch
  2025-04-21 18:35   ` Matt Fleming
  2025-04-22  1:28   ` Yu Kuai
  1 sibling, 2 replies; 16+ messages in thread
From: Keith Busch @ 2025-04-21 15:22 UTC (permalink / raw)
  To: Matt Fleming; +Cc: Jens Axboe, linux-block, linux-kernel, kernel-team

On Mon, Apr 21, 2025 at 09:53:10AM +0100, Matt Fleming wrote:
> Hey there,
> 
> We're moving to 6.12 at Cloudflare and noticed that write await times
> in iostat are 10x what they were in 6.6. After a bit of bpftracing
> (script to find all plug times above 10ms below), it seems like this
> is an accounting error caused by the plug->cur_ktime optimisation
> rather than anything more material.
> 
> It appears as though a task can enter __submit_bio() with ->plug set
> and a very stale cur_ktime value on the order of milliseconds. Is this
> expected behaviour? It looks like it leads to inaccurate I/O times.

There are places with a block plug that call cond_resched(), which
doesn't invalidate the plug's cached ktime. You could end up with a
stale ktime if your process is scheduled out.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 10x I/O await times in 6.12
  2025-04-21 15:22 ` Keith Busch
@ 2025-04-21 18:35   ` Matt Fleming
  2025-04-21 18:53     ` Keith Busch
  2025-04-22  1:28   ` Yu Kuai
  1 sibling, 1 reply; 16+ messages in thread
From: Matt Fleming @ 2025-04-21 18:35 UTC (permalink / raw)
  To: Keith Busch; +Cc: Jens Axboe, linux-block, linux-kernel, kernel-team

On Mon, 21 Apr 2025 at 16:22, Keith Busch <kbusch@kernel.org> wrote:
>
> On Mon, Apr 21, 2025 at 09:53:10AM +0100, Matt Fleming wrote:
> > Hey there,
> >
> > We're moving to 6.12 at Cloudflare and noticed that write await times
> > in iostat are 10x what they were in 6.6. After a bit of bpftracing
> > (script to find all plug times above 10ms below), it seems like this
> > is an accounting error caused by the plug->cur_ktime optimisation
> > rather than anything more material.
> >
> > It appears as though a task can enter __submit_bio() with ->plug set
> > and a very stale cur_ktime value on the order of milliseconds. Is this
> > expected behaviour? It looks like it leads to inaccurate I/O times.
>
> There are places with a block plug that call cond_resched(), which
> doesn't invalidate the plug's cached ktime. You could end up with a
> stale ktime if your process is scheduled out.

Is that intentional? I know the cached time is invalidated when
calling schedule(). Does the invalidation need to get pushed down into
__schedule_loop()?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 10x I/O await times in 6.12
  2025-04-21 18:35   ` Matt Fleming
@ 2025-04-21 18:53     ` Keith Busch
  2025-04-21 19:10       ` Matt Fleming
  0 siblings, 1 reply; 16+ messages in thread
From: Keith Busch @ 2025-04-21 18:53 UTC (permalink / raw)
  To: Matt Fleming; +Cc: Jens Axboe, linux-block, linux-kernel, kernel-team

On Mon, Apr 21, 2025 at 07:35:24PM +0100, Matt Fleming wrote:
> On Mon, 21 Apr 2025 at 16:22, Keith Busch <kbusch@kernel.org> wrote:
> >
> > On Mon, Apr 21, 2025 at 09:53:10AM +0100, Matt Fleming wrote:
> > > Hey there,
> > >
> > > We're moving to 6.12 at Cloudflare and noticed that write await times
> > > in iostat are 10x what they were in 6.6. After a bit of bpftracing
> > > (script to find all plug times above 10ms below), it seems like this
> > > is an accounting error caused by the plug->cur_ktime optimisation
> > > rather than anything more material.
> > >
> > > It appears as though a task can enter __submit_bio() with ->plug set
> > > and a very stale cur_ktime value on the order of milliseconds. Is this
> > > expected behaviour? It looks like it leads to inaccurate I/O times.
> >
> > There are places with a block plug that call cond_resched(), which
> > doesn't invalidate the plug's cached ktime. You could end up with a
> > stale ktime if your process is scheduled out.
> 
> Is that intentional? I know the cached time is invalidated when
> calling schedule(). Does the invalidation need to get pushed down into
> __schedule_loop()?
> 

Not sure. I'm also guessing cond_resched is the reason for your
observation, so that might be worth confirming is happening in whatever
IO paths you're workload is taking in case there's some other
explanation.

fs-writeback happens to work around it by unplugging if it knows
cond_resched is going to schedule. The decision to unplug here wasn't
necessarily because of the plug's ktime, but it gets the job done:

  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/fs-writeback.c?h=v6.15-rc3#n1984

Doesn't really scale well to copy this for every caller of
cond_resched(), though. An io specific helper implementation of
cond_resched might help.

Or if we don't want cond_resched to unplug (though I feel like you would
normally want that), I think we could invalidate the ktime when
scheduling to get the stats to read the current ktime after the process
is scheduled back in.

---
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6978,6 +6978,9 @@ static void __sched notrace preempt_schedule_common(void)
 		 * between schedule and now.
 		 */
 	} while (need_resched());
+
+	if (current->flags & PF_BLOCK_TS)
+		blk_plug_invalidate_ts(current);
 }
 
 #ifdef CONFIG_PREEMPTION
--

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 10x I/O await times in 6.12
  2025-04-21 18:53     ` Keith Busch
@ 2025-04-21 19:10       ` Matt Fleming
  2025-04-22  1:39         ` Yu Kuai
  0 siblings, 1 reply; 16+ messages in thread
From: Matt Fleming @ 2025-04-21 19:10 UTC (permalink / raw)
  To: Keith Busch; +Cc: Jens Axboe, linux-block, linux-kernel, kernel-team

On Mon, 21 Apr 2025 at 19:53, Keith Busch <kbusch@kernel.org> wrote:
>
> Not sure. I'm also guessing cond_resched is the reason for your
> observation, so that might be worth confirming is happening in whatever
> IO paths you're workload is taking in case there's some other
> explanation.

Yep, you're spot on. We're hitting cond_resched() from various code
paths (xfs_buf_delwri_submit_buffers(), swap_writepage(),
rmap_walk_file(), etc, etc).

sudo bpftrace -e 'k:psi_task_switch {        $prev = (struct
task_struct *)arg0;        if ($prev->plug != 0) {
                if ($prev->plug->cur_ktime) {
                        @[kstack(3)] = count();
                }
        }
}'
Attaching 1 probe...
^C

@[
    psi_task_switch+5
    __schedule+2081
    __cond_resched+51
]: 3044

> fs-writeback happens to work around it by unplugging if it knows
> cond_resched is going to schedule. The decision to unplug here wasn't
> necessarily because of the plug's ktime, but it gets the job done:
>
>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/fs-writeback.c?h=v6.15-rc3#n1984
>
> Doesn't really scale well to copy this for every caller of
> cond_resched(), though. An io specific helper implementation of
> cond_resched might help.
>
> Or if we don't want cond_resched to unplug (though I feel like you would
> normally want that), I think we could invalidate the ktime when
> scheduling to get the stats to read the current ktime after the process
> is scheduled back in.

Thanks. Makes sense to me. I'll try this out and report back.

> ---
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6978,6 +6978,9 @@ static void __sched notrace preempt_schedule_common(void)
>                  * between schedule and now.
>                  */
>         } while (need_resched());
> +
> +       if (current->flags & PF_BLOCK_TS)
> +               blk_plug_invalidate_ts(current);
>  }
>
>  #ifdef CONFIG_PREEMPTION
> --

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 10x I/O await times in 6.12
  2025-04-21 15:22 ` Keith Busch
  2025-04-21 18:35   ` Matt Fleming
@ 2025-04-22  1:28   ` Yu Kuai
  2025-04-22  1:39     ` Keith Busch
  1 sibling, 1 reply; 16+ messages in thread
From: Yu Kuai @ 2025-04-22  1:28 UTC (permalink / raw)
  To: Keith Busch, Matt Fleming
  Cc: Jens Axboe, linux-block, linux-kernel, kernel-team, yukuai (C)

Hi,

在 2025/04/21 23:22, Keith Busch 写道:
> On Mon, Apr 21, 2025 at 09:53:10AM +0100, Matt Fleming wrote:
>> Hey there,
>>
>> We're moving to 6.12 at Cloudflare and noticed that write await times
>> in iostat are 10x what they were in 6.6. After a bit of bpftracing
>> (script to find all plug times above 10ms below), it seems like this
>> is an accounting error caused by the plug->cur_ktime optimisation
>> rather than anything more material.
>>
>> It appears as though a task can enter __submit_bio() with ->plug set
>> and a very stale cur_ktime value on the order of milliseconds. Is this
>> expected behaviour? It looks like it leads to inaccurate I/O times.
> 
> There are places with a block plug that call cond_resched(), which
> doesn't invalidate the plug's cached ktime. You could end up with a
> stale ktime if your process is scheduled out.

This is wrong, scheduled out will clear cached ktime. You can check
it easily since there are not much caller to clear ktime.

Thanks,
Kuai

> 
> .
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 10x I/O await times in 6.12
  2025-04-21 19:10       ` Matt Fleming
@ 2025-04-22  1:39         ` Yu Kuai
  0 siblings, 0 replies; 16+ messages in thread
From: Yu Kuai @ 2025-04-22  1:39 UTC (permalink / raw)
  To: Matt Fleming, Keith Busch
  Cc: Jens Axboe, linux-block, linux-kernel, kernel-team, yukuai (C)

Hi,

在 2025/04/22 3:10, Matt Fleming 写道:
> On Mon, 21 Apr 2025 at 19:53, Keith Busch <kbusch@kernel.org> wrote:
>>
>> Not sure. I'm also guessing cond_resched is the reason for your
>> observation, so that might be worth confirming is happening in whatever
>> IO paths you're workload is taking in case there's some other
>> explanation.
> 
> Yep, you're spot on. We're hitting cond_resched() from various code
> paths (xfs_buf_delwri_submit_buffers(), swap_writepage(),
> rmap_walk_file(), etc, etc).

All pluged IO must be submited before scheduled out, there is no point
for this direction. :(

Please check the other mail that I replied to your original report,
it'll make sense if a task keeps running on one cpu for milliseconds.

Thanks,
Kuai

> 
> sudo bpftrace -e 'k:psi_task_switch {        $prev = (struct
> task_struct *)arg0;        if ($prev->plug != 0) {
>                  if ($prev->plug->cur_ktime) {
>                          @[kstack(3)] = count();
>                  }
>          }
> }'
> Attaching 1 probe...
> ^C
> 
> @[
>      psi_task_switch+5
>      __schedule+2081
>      __cond_resched+51
> ]: 3044
> 
>> fs-writeback happens to work around it by unplugging if it knows
>> cond_resched is going to schedule. The decision to unplug here wasn't
>> necessarily because of the plug's ktime, but it gets the job done:
>>
>>    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/fs-writeback.c?h=v6.15-rc3#n1984
>>
>> Doesn't really scale well to copy this for every caller of
>> cond_resched(), though. An io specific helper implementation of
>> cond_resched might help.
>>
>> Or if we don't want cond_resched to unplug (though I feel like you would
>> normally want that), I think we could invalidate the ktime when
>> scheduling to get the stats to read the current ktime after the process
>> is scheduled back in.
> 
> Thanks. Makes sense to me. I'll try this out and report back.
> 
>> ---
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -6978,6 +6978,9 @@ static void __sched notrace preempt_schedule_common(void)
>>                   * between schedule and now.
>>                   */
>>          } while (need_resched());
>> +
>> +       if (current->flags & PF_BLOCK_TS)
>> +               blk_plug_invalidate_ts(current);
>>   }
>>
>>   #ifdef CONFIG_PREEMPTION
>> --
> 
> .
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 10x I/O await times in 6.12
  2025-04-22  1:28   ` Yu Kuai
@ 2025-04-22  1:39     ` Keith Busch
  2025-04-22  3:02       ` Yu Kuai
  0 siblings, 1 reply; 16+ messages in thread
From: Keith Busch @ 2025-04-22  1:39 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Matt Fleming, Jens Axboe, linux-block, linux-kernel, kernel-team,
	yukuai (C)

On Tue, Apr 22, 2025 at 09:28:02AM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2025/04/21 23:22, Keith Busch 写道:
> > On Mon, Apr 21, 2025 at 09:53:10AM +0100, Matt Fleming wrote:
> > > Hey there,
> > > 
> > > We're moving to 6.12 at Cloudflare and noticed that write await times
> > > in iostat are 10x what they were in 6.6. After a bit of bpftracing
> > > (script to find all plug times above 10ms below), it seems like this
> > > is an accounting error caused by the plug->cur_ktime optimisation
> > > rather than anything more material.
> > > 
> > > It appears as though a task can enter __submit_bio() with ->plug set
> > > and a very stale cur_ktime value on the order of milliseconds. Is this
> > > expected behaviour? It looks like it leads to inaccurate I/O times.
> > 
> > There are places with a block plug that call cond_resched(), which
> > doesn't invalidate the plug's cached ktime. You could end up with a
> > stale ktime if your process is scheduled out.
> 
> This is wrong, scheduled out will clear cached ktime. You can check
> it easily since there are not much caller to clear ktime.

Huh? cond_resched() calls __schedule() directly via
preempt_schedule_common(), which most certainly does not clear the
plug's time.

The timestamp is only invalidated from schedule() or
rt_mutex_post_schedule(). You can check it ... "easily".

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 10x I/O await times in 6.12
  2025-04-22  1:39     ` Keith Busch
@ 2025-04-22  3:02       ` Yu Kuai
  2025-04-22 10:45         ` Matt Fleming
  0 siblings, 1 reply; 16+ messages in thread
From: Yu Kuai @ 2025-04-22  3:02 UTC (permalink / raw)
  To: Keith Busch, Yu Kuai
  Cc: Matt Fleming, Jens Axboe, linux-block, linux-kernel, kernel-team,
	yukuai (C)

Hi,

在 2025/04/22 9:39, Keith Busch 写道:
> On Tue, Apr 22, 2025 at 09:28:02AM +0800, Yu Kuai wrote:
>> Hi,
>>
>> 在 2025/04/21 23:22, Keith Busch 写道:
>>> On Mon, Apr 21, 2025 at 09:53:10AM +0100, Matt Fleming wrote:
>>>> Hey there,
>>>>
>>>> We're moving to 6.12 at Cloudflare and noticed that write await times
>>>> in iostat are 10x what they were in 6.6. After a bit of bpftracing
>>>> (script to find all plug times above 10ms below), it seems like this
>>>> is an accounting error caused by the plug->cur_ktime optimisation
>>>> rather than anything more material.
>>>>
>>>> It appears as though a task can enter __submit_bio() with ->plug set
>>>> and a very stale cur_ktime value on the order of milliseconds. Is this
>>>> expected behaviour? It looks like it leads to inaccurate I/O times.
>>>
>>> There are places with a block plug that call cond_resched(), which
>>> doesn't invalidate the plug's cached ktime. You could end up with a
>>> stale ktime if your process is scheduled out.
>>
>> This is wrong, scheduled out will clear cached ktime. You can check
>> it easily since there are not much caller to clear ktime.
> 
> Huh? cond_resched() calls __schedule() directly via
> preempt_schedule_common(), which most certainly does not clear the
> plug's time.

Yes, this is the preempt case, where pluged IO is not issued, this
behaviour is already known. I thought you mean the normal case, like
you said below. :(
> 
> The timestamp is only invalidated from schedule() or
> rt_mutex_post_schedule(). You can check it ... "easily".

So, either preempt takes a long time, or generate lots of bio to plug
takes a long time can both results in larger iostat IO latency. I still
think delay setting request start_time to blk_mq_flush_plug_list() might
be a reasonable fix.

Thanks,
Kuai
> 
> .
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 10x I/O await times in 6.12
  2025-04-22  3:02       ` Yu Kuai
@ 2025-04-22 10:45         ` Matt Fleming
  2025-04-23  3:36           ` Yu Kuai
  0 siblings, 1 reply; 16+ messages in thread
From: Matt Fleming @ 2025-04-22 10:45 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Keith Busch, Jens Axboe, linux-block, linux-kernel, kernel-team,
	yukuai (C)

On Tue, 22 Apr 2025 at 04:03, Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> So, either preempt takes a long time, or generate lots of bio to plug
> takes a long time can both results in larger iostat IO latency. I still
> think delay setting request start_time to blk_mq_flush_plug_list() might
> be a reasonable fix.

I'll try out your proposed fix also. Is it not possible for a task to
be preempted during a blk_mq_flush_plug_list() call, e.g. in the
driver layer?

I understand that you might not want to issue I/O on preempt, but
that's a distinct problem from clearing the cached ktime no? There is
no upper bound on the amount of time a task might be scheduled out due
to preempt which means there is no limit to the staleness of that
value. I would assume the only safe thing to do (like is done for
various other timestamps) is reset it when the task gets scheduled
out.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 10x I/O await times in 6.12
  2025-04-22 10:45         ` Matt Fleming
@ 2025-04-23  3:36           ` Yu Kuai
  0 siblings, 0 replies; 16+ messages in thread
From: Yu Kuai @ 2025-04-23  3:36 UTC (permalink / raw)
  To: Matt Fleming, Yu Kuai
  Cc: Keith Busch, Jens Axboe, linux-block, linux-kernel, kernel-team,
	yukuai (C)

Hi,

在 2025/04/22 18:45, Matt Fleming 写道:
> On Tue, 22 Apr 2025 at 04:03, Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> So, either preempt takes a long time, or generate lots of bio to plug
>> takes a long time can both results in larger iostat IO latency. I still
>> think delay setting request start_time to blk_mq_flush_plug_list() might
>> be a reasonable fix.
> 
> I'll try out your proposed fix also. Is it not possible for a task to
> be preempted during a blk_mq_flush_plug_list() call, e.g. in the
> driver layer?

Let's focus on your regression first, preempt during flush plug doesn't
introduce new gap, rq->start_time_ns is before that already before
caching time in plug.
> 
> I understand that you might not want to issue I/O on preempt, but
> that's a distinct problem from clearing the cached ktime no? There is
> no upper bound on the amount of time a task might be scheduled out due
> to preempt which means there is no limit to the staleness of that
> value. I would assume the only safe thing to do (like is done for
> various other timestamps) is reset it when the task gets scheduled
> out.

Yes, it's resonable to clear cached time for the preempt case. What
I'm concerned is that even the task never scheduled out, the time can
still stale for milliseconds. I think that is possilble if a lots of bio
are endup in the same roud of plug(a lot of IO merge).

Thanks,
Kuai

> 
> .
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 10x I/O await times in 6.12
  2025-04-21 12:21 ` Yu Kuai
@ 2025-04-23 10:51   ` Matt Fleming
  2025-04-24  2:35     ` Yu Kuai
  2026-04-23 18:18   ` Matt Fleming
  1 sibling, 1 reply; 16+ messages in thread
From: Matt Fleming @ 2025-04-23 10:51 UTC (permalink / raw)
  To: Yu Kuai; +Cc: Jens Axboe, linux-block, linux-kernel, kernel-team, yukuai (C)

On Mon, 21 Apr 2025 at 13:21, Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Can you drop this expensive bpftrace script which might affect IO
> performance, and replace __submit_bio directly to __blk_flush_plug? If
> nsecs - plug->cur_ktime is still milliseconds, can you check if the
> following patch can fix your problem?

Yep, the below patch fixes the regression and restores I/O wait times
that are comparable to 6.6. Thanks!

> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index ae8494d88897..37197502147e 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1095,7 +1095,9 @@ static inline void blk_account_io_start(struct
> request *req)
>                  return;
>
>          req->rq_flags |= RQF_IO_STAT;
> -       req->start_time_ns = blk_time_get_ns();
> +
> +       if (!current->plug)
> +               req->start_time_ns = blk_time_get_ns();
>
>          /*
>           * All non-passthrough requests are created from a bio with one
> @@ -2874,6 +2876,7 @@ void blk_mq_flush_plug_list(struct blk_plug *plug,
> bool from_schedule)
>   {
>          struct request *rq;
>          unsigned int depth;
> +       u64 now;
>
>          /*
>           * We may have been called recursively midway through handling
> @@ -2887,6 +2890,10 @@ void blk_mq_flush_plug_list(struct blk_plug
> *plug, bool from_schedule)
>          depth = plug->rq_count;
>          plug->rq_count = 0;
>
> +       now = ktime_get_ns();
> +       rq_list_for_each(&plug->mq_list, rq)
> +               rq->start_time_ns = now;
> +
>          if (!plug->multiple_queues && !plug->has_elevator &&
> !from_schedule) {
>                  struct request_queue *q;
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 10x I/O await times in 6.12
  2025-04-23 10:51   ` Matt Fleming
@ 2025-04-24  2:35     ` Yu Kuai
  2025-05-01  9:52       ` Matt Fleming
  0 siblings, 1 reply; 16+ messages in thread
From: Yu Kuai @ 2025-04-24  2:35 UTC (permalink / raw)
  To: Matt Fleming, Yu Kuai
  Cc: Jens Axboe, linux-block, linux-kernel, kernel-team, yukuai (C)

Hi, Jens!

在 2025/04/23 18:51, Matt Fleming 写道:
> On Mon, 21 Apr 2025 at 13:21, Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> Can you drop this expensive bpftrace script which might affect IO
>> performance, and replace __submit_bio directly to __blk_flush_plug? If
>> nsecs - plug->cur_ktime is still milliseconds, can you check if the
>> following patch can fix your problem?
> 
> Yep, the below patch fixes the regression and restores I/O wait times
> that are comparable to 6.6. Thanks!

Thanks for the test.

Do you have any suggestions? For now, I can think of two different
fixes:

1) clear time for the preempt case. On the one hand, preempt can be
frequent, cauing performance degradation; On the other hand, this can't
solve all the problems, for example, lots of interrupt or lots of BIO
issued into one round of plug.

2) delay setting rq start time to flush plug, however, the latency value
is still not accurate, and will be slightly smaller instead of greater.

Thanks,
Kuai

> 
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index ae8494d88897..37197502147e 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -1095,7 +1095,9 @@ static inline void blk_account_io_start(struct
>> request *req)
>>                   return;
>>
>>           req->rq_flags |= RQF_IO_STAT;
>> -       req->start_time_ns = blk_time_get_ns();
>> +
>> +       if (!current->plug)
>> +               req->start_time_ns = blk_time_get_ns();
>>
>>           /*
>>            * All non-passthrough requests are created from a bio with one
>> @@ -2874,6 +2876,7 @@ void blk_mq_flush_plug_list(struct blk_plug *plug,
>> bool from_schedule)
>>    {
>>           struct request *rq;
>>           unsigned int depth;
>> +       u64 now;
>>
>>           /*
>>            * We may have been called recursively midway through handling
>> @@ -2887,6 +2890,10 @@ void blk_mq_flush_plug_list(struct blk_plug
>> *plug, bool from_schedule)
>>           depth = plug->rq_count;
>>           plug->rq_count = 0;
>>
>> +       now = ktime_get_ns();
>> +       rq_list_for_each(&plug->mq_list, rq)
>> +               rq->start_time_ns = now;
>> +
>>           if (!plug->multiple_queues && !plug->has_elevator &&
>> !from_schedule) {
>>                   struct request_queue *q;
>>
> 
> .
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 10x I/O await times in 6.12
  2025-04-24  2:35     ` Yu Kuai
@ 2025-05-01  9:52       ` Matt Fleming
  0 siblings, 0 replies; 16+ messages in thread
From: Matt Fleming @ 2025-05-01  9:52 UTC (permalink / raw)
  To: Yu Kuai; +Cc: Jens Axboe, linux-block, linux-kernel, kernel-team, yukuai (C)

On Thu, 24 Apr 2025 at 03:35, Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Thanks for the test.
>
> Do you have any suggestions? For now, I can think of two different
> fixes:
>
> 1) clear time for the preempt case. On the one hand, preempt can be
> frequent, cauing performance degradation; On the other hand, this can't
> solve all the problems, for example, lots of interrupt or lots of BIO
> issued into one round of plug.
>
> 2) delay setting rq start time to flush plug, however, the latency value
> is still not accurate, and will be slightly smaller instead of greater.

Hi Kuai and Jens, is there any news here?

Thanks,
Matt

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 10x I/O await times in 6.12
  2025-04-21 12:21 ` Yu Kuai
  2025-04-23 10:51   ` Matt Fleming
@ 2026-04-23 18:18   ` Matt Fleming
  1 sibling, 0 replies; 16+ messages in thread
From: Matt Fleming @ 2026-04-23 18:18 UTC (permalink / raw)
  To: Yu Kuai, Jens Axboe
  Cc: Matt Fleming, linux-block, linux-kernel, kernel-team, yukuai (C)

Hi Yu and Jens,

On Mon, Apr 21, 2025 at 08:21:13PM +0800, Yu Kuai wrote:
> If nsecs - plug->cur_ktime is still milliseconds, can you check if
> the following patch can fix your problem?
> 
> Thanks,
> Kuai
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index ae8494d88897..37197502147e 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1095,7 +1095,9 @@ static inline void blk_account_io_start(struct request
> *req)
>                 return;
> 
>         req->rq_flags |= RQF_IO_STAT;
> -       req->start_time_ns = blk_time_get_ns();
> +
> +       if (!current->plug)
> +               req->start_time_ns = blk_time_get_ns();
> 
>         /*
>          * All non-passthrough requests are created from a bio with one
> @@ -2874,6 +2876,7 @@ void blk_mq_flush_plug_list(struct blk_plug *plug,
> bool from_schedule)
>  {
>         struct request *rq;
>         unsigned int depth;
> +       u64 now;
> 
>         /*
>          * We may have been called recursively midway through handling
> @@ -2887,6 +2890,10 @@ void blk_mq_flush_plug_list(struct blk_plug *plug,
> bool from_schedule)
>         depth = plug->rq_count;
>         plug->rq_count = 0;
> 
> +       now = ktime_get_ns();
> +       rq_list_for_each(&plug->mq_list, rq)
> +               rq->start_time_ns = now;
> +
>         if (!plug->multiple_queues && !plug->has_elevator && !from_schedule)
> {
>                 struct request_queue *q;
> 

It's April again, and this time we're moving to 6.18.

The I/O accounting bug reported at the start of this thread is still an
issue for us, and we've been carrying the above patch for 12 months.

Digging into this a bit more, the issue appears to be big writeback
(e.g. xfsaild invoking xfs_buf_delwri_submit_nowait()), where one plug
covers hundreds of writes for around 10ms on average.

xfsaild can stay on-CPU for long stretches which leads to inflated I/O
times in /proc/diskstats. Tracing the requests shows that they're not
actually any slower than previous kernel versions -- it's purely an
accounting bug.

Of course it doesn't look to be an XFS thing. It's just tasks doing
massive writes without getting preempted/interrupted.

Any interest in merging the above patch?

Thanks,
Matt

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-04-23 18:18 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-21  8:53 10x I/O await times in 6.12 Matt Fleming
2025-04-21 12:21 ` Yu Kuai
2025-04-23 10:51   ` Matt Fleming
2025-04-24  2:35     ` Yu Kuai
2025-05-01  9:52       ` Matt Fleming
2026-04-23 18:18   ` Matt Fleming
2025-04-21 15:22 ` Keith Busch
2025-04-21 18:35   ` Matt Fleming
2025-04-21 18:53     ` Keith Busch
2025-04-21 19:10       ` Matt Fleming
2025-04-22  1:39         ` Yu Kuai
2025-04-22  1:28   ` Yu Kuai
2025-04-22  1:39     ` Keith Busch
2025-04-22  3:02       ` Yu Kuai
2025-04-22 10:45         ` Matt Fleming
2025-04-23  3:36           ` Yu Kuai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox