linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] iomap: allow iomap using the per-cpu bio cache
       [not found] ` <20250822150550.GP7942@frogsfrogsfrogs>
@ 2025-08-22 15:42   ` Matthew Wilcox
  2025-08-22 16:07     ` Ritesh Harjani
  0 siblings, 1 reply; 14+ messages in thread
From: Matthew Wilcox @ 2025-08-22 15:42 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Fengnan Chang, brauner, linux-xfs, linux-fsdevel, Jens Axboe,
	linux-block

On Fri, Aug 22, 2025 at 08:05:50AM -0700, Darrick J. Wong wrote:
> On Fri, Aug 22, 2025 at 04:26:06PM +0800, Fengnan Chang wrote:
> > When use io_uring with direct IO, we could use per-cpu bio cache
> > from bio_alloc_bioset, So pass IOCB_ALLOC_CACHE flag to alloc
> > bio helper.
> >  
> > +	if (iter->flags & IOMAP_ALLOC_CACHE)
> > +		bio_opf |= REQ_ALLOC_CACHE;
> 
> Is there a reason /not/ to use the per-cpu bio cache unconditionally?

AIUI it's not safe because completions might happen on a different CPU
from the submission.  At least, there's nowhere that sets REQ_ALLOC_CACHE
unconditionally.

This could do with some better documentation ..

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] iomap: allow iomap using the per-cpu bio cache
  2025-08-22 15:42   ` [PATCH] iomap: allow iomap using the per-cpu bio cache Matthew Wilcox
@ 2025-08-22 16:07     ` Ritesh Harjani
  2025-08-22 16:51       ` Matthew Wilcox
  0 siblings, 1 reply; 14+ messages in thread
From: Ritesh Harjani @ 2025-08-22 16:07 UTC (permalink / raw)
  To: Matthew Wilcox, Darrick J. Wong
  Cc: Fengnan Chang, brauner, linux-xfs, linux-fsdevel, Jens Axboe,
	linux-block

Matthew Wilcox <willy@infradead.org> writes:

> On Fri, Aug 22, 2025 at 08:05:50AM -0700, Darrick J. Wong wrote:
>> On Fri, Aug 22, 2025 at 04:26:06PM +0800, Fengnan Chang wrote:
>> > When use io_uring with direct IO, we could use per-cpu bio cache
>> > from bio_alloc_bioset, So pass IOCB_ALLOC_CACHE flag to alloc
>> > bio helper.
>> >  
>> > +	if (iter->flags & IOMAP_ALLOC_CACHE)
>> > +		bio_opf |= REQ_ALLOC_CACHE;
>> 
>> Is there a reason /not/ to use the per-cpu bio cache unconditionally?
>
> AIUI it's not safe because completions might happen on a different CPU
> from the submission.

At max the bio de-queued from cpu X can be returned to cpu Y cache, this
shouldn't be unsafe right? e.g. bio_put_percpu_cache(). 
Not optimal for performance though.

Also even for io-uring the IRQ completions (non-polling requests) can
get routed to a different cpu then the submitting cpu, correct?
Then the completions (bio completion processing) are handled via IPIs on
the submtting cpu or based on the cache topology, right?

> At least, there's nowhere that sets REQ_ALLOC_CACHE unconditionally.
>
> This could do with some better documentation ..

Agreed. Looking at the history this got added for polling mode first but
later got enabled for even irq driven io-uring rw requests [1]. So it
make sense to understand if this can be added unconditionally for DIO
requests or not.

[1]: https://lore.kernel.org/io-uring/aab3521d49fd6c1ff6ea194c9e63d05565efc103.1666347703.git.asml.silence@gmail.com/

-ritesh

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] iomap: allow iomap using the per-cpu bio cache
  2025-08-22 16:07     ` Ritesh Harjani
@ 2025-08-22 16:51       ` Matthew Wilcox
  2025-08-23  4:15         ` Ritesh Harjani
  0 siblings, 1 reply; 14+ messages in thread
From: Matthew Wilcox @ 2025-08-22 16:51 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Darrick J. Wong, Fengnan Chang, brauner, linux-xfs, linux-fsdevel,
	Jens Axboe, linux-block

On Fri, Aug 22, 2025 at 09:37:32PM +0530, Ritesh Harjani wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> > On Fri, Aug 22, 2025 at 08:05:50AM -0700, Darrick J. Wong wrote:
> >> Is there a reason /not/ to use the per-cpu bio cache unconditionally?
> >
> > AIUI it's not safe because completions might happen on a different CPU
> > from the submission.
> 
> At max the bio de-queued from cpu X can be returned to cpu Y cache, this
> shouldn't be unsafe right? e.g. bio_put_percpu_cache(). 
> Not optimal for performance though.
> 
> Also even for io-uring the IRQ completions (non-polling requests) can
> get routed to a different cpu then the submitting cpu, correct?
> Then the completions (bio completion processing) are handled via IPIs on
> the submtting cpu or based on the cache topology, right?
> 
> > At least, there's nowhere that sets REQ_ALLOC_CACHE unconditionally.
> >
> > This could do with some better documentation ..
> 
> Agreed. Looking at the history this got added for polling mode first but
> later got enabled for even irq driven io-uring rw requests [1]. So it
> make sense to understand if this can be added unconditionally for DIO
> requests or not.

So why does the flag now exist at all?  Why not use the cache
unconditionally?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] iomap: allow iomap using the per-cpu bio cache
  2025-08-22 16:51       ` Matthew Wilcox
@ 2025-08-23  4:15         ` Ritesh Harjani
  2025-08-25  8:51           ` Fengnan Chang
  2025-09-03  9:53           ` Pavel Begunkov
  0 siblings, 2 replies; 14+ messages in thread
From: Ritesh Harjani @ 2025-08-23  4:15 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Darrick J. Wong, Fengnan Chang, brauner, linux-xfs, linux-fsdevel,
	Jens Axboe, linux-block

Matthew Wilcox <willy@infradead.org> writes:

> On Fri, Aug 22, 2025 at 09:37:32PM +0530, Ritesh Harjani wrote:
>> Matthew Wilcox <willy@infradead.org> writes:
>> > On Fri, Aug 22, 2025 at 08:05:50AM -0700, Darrick J. Wong wrote:
>> >> Is there a reason /not/ to use the per-cpu bio cache unconditionally?
>> >
>> > AIUI it's not safe because completions might happen on a different CPU
>> > from the submission.
>> 
>> At max the bio de-queued from cpu X can be returned to cpu Y cache, this
>> shouldn't be unsafe right? e.g. bio_put_percpu_cache(). 
>> Not optimal for performance though.
>> 
>> Also even for io-uring the IRQ completions (non-polling requests) can
>> get routed to a different cpu then the submitting cpu, correct?
>> Then the completions (bio completion processing) are handled via IPIs on
>> the submtting cpu or based on the cache topology, right?
>> 
>> > At least, there's nowhere that sets REQ_ALLOC_CACHE unconditionally.
>> >
>> > This could do with some better documentation ..
>> 
>> Agreed. Looking at the history this got added for polling mode first but
>> later got enabled for even irq driven io-uring rw requests [1]. So it
>> make sense to understand if this can be added unconditionally for DIO
>> requests or not.
>
> So why does the flag now exist at all?  Why not use the cache
> unconditionally?

I am hoping the author of this patch or folks with io-uring expertise
(which added the per-cpu bio cache in the first place) could answer
this better. i.e. 

Now that per-cpu bio cache is being used by io-uring rw requests for
both polled and non-polled I/O. Does that mean, we can kill
IOCB_ALLOC_CACHE check from iomap dio path completely and use per-cpu
bio cache unconditionally by passing REQ_ALLOC_CACHE flag?  That means
all DIO requests via iomap can now use this per-cpu bio cache and not
just the one initiated via io-uring path.

Or are there still restrictions in using this per-cpu bio cache, which
limits it to be only used via io-uring path? If yes, what are they? And
can this be documented somewhere?

-ritesh


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] iomap: allow iomap using the per-cpu bio cache
  2025-08-23  4:15         ` Ritesh Harjani
@ 2025-08-25  8:51           ` Fengnan Chang
  2025-08-25  9:21             ` Christoph Hellwig
  2025-09-03  9:53           ` Pavel Begunkov
  1 sibling, 1 reply; 14+ messages in thread
From: Fengnan Chang @ 2025-08-25  8:51 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Matthew Wilcox, Darrick J. Wong, brauner, linux-xfs,
	linux-fsdevel, Jens Axboe, linux-block

Sorry for the late reply.

Ritesh Harjani <ritesh.list@gmail.com> 于2025年8月23日周六 12:35写道:
>
> Matthew Wilcox <willy@infradead.org> writes:
>
> > On Fri, Aug 22, 2025 at 09:37:32PM +0530, Ritesh Harjani wrote:
> >> Matthew Wilcox <willy@infradead.org> writes:
> >> > On Fri, Aug 22, 2025 at 08:05:50AM -0700, Darrick J. Wong wrote:
> >> >> Is there a reason /not/ to use the per-cpu bio cache unconditionally?
> >> >
> >> > AIUI it's not safe because completions might happen on a different CPU
> >> > from the submission.
> >>
> >> At max the bio de-queued from cpu X can be returned to cpu Y cache, this
> >> shouldn't be unsafe right? e.g. bio_put_percpu_cache().
> >> Not optimal for performance though.
> >>
> >> Also even for io-uring the IRQ completions (non-polling requests) can
> >> get routed to a different cpu then the submitting cpu, correct?
> >> Then the completions (bio completion processing) are handled via IPIs on
> >> the submtting cpu or based on the cache topology, right?
> >>
> >> > At least, there's nowhere that sets REQ_ALLOC_CACHE unconditionally.
> >> >
> >> > This could do with some better documentation ..
> >>
> >> Agreed. Looking at the history this got added for polling mode first but
> >> later got enabled for even irq driven io-uring rw requests [1]. So it
> >> make sense to understand if this can be added unconditionally for DIO
> >> requests or not.
> >
> > So why does the flag now exist at all?  Why not use the cache
> > unconditionally?

I think it's a history problem. When REQ_ALLOC_CACHE flag first add in
https://lore.kernel.org/all/20220324203526.62306-2-snitzer@kernel.org/,
it's rename from BIOSET_PERCPU_CACHE, only work for iopoll,
"If REQ_ALLOC_CACHE is set, the final put of the bio MUST be done from
process context, not hard/soft IRQ." and from
https://lore.kernel.org/all/cover.1667384020.git.asml.silence@gmail.com/,
remove this limit, percpu bio caching can used for IRQ I/O.
"Currently, it's only enabled for previous REQ_ALLOC_CACHE users but will
be turned on system-wide later."

>
> I am hoping the author of this patch or folks with io-uring expertise
> (which added the per-cpu bio cache in the first place) could answer
> this better. i.e.
>
> Now that per-cpu bio cache is being used by io-uring rw requests for
> both polled and non-polled I/O. Does that mean, we can kill
> IOCB_ALLOC_CACHE check from iomap dio path completely and use per-cpu
> bio cache unconditionally by passing REQ_ALLOC_CACHE flag?  That means
> all DIO requests via iomap can now use this per-cpu bio cache and not
> just the one initiated via io-uring path.
>
> Or are there still restrictions in using this per-cpu bio cache, which
> limits it to be only used via io-uring path? If yes, what are they? And
> can this be documented somewhere?

No restrictions for now, I think we can enable this by default.
Maybe better solution is modify in bio.c?  Let me do some test first.

>
> -ritesh
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] iomap: allow iomap using the per-cpu bio cache
  2025-08-25  8:51           ` Fengnan Chang
@ 2025-08-25  9:21             ` Christoph Hellwig
  2025-08-25  9:41               ` Fengnan Chang
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2025-08-25  9:21 UTC (permalink / raw)
  To: Fengnan Chang
  Cc: Ritesh Harjani, Matthew Wilcox, Darrick J. Wong, brauner,
	linux-xfs, linux-fsdevel, Jens Axboe, linux-block

On Mon, Aug 25, 2025 at 04:51:27PM +0800, Fengnan Chang wrote:
> No restrictions for now, I think we can enable this by default.
> Maybe better solution is modify in bio.c?  Let me do some test first.

Any kind of numbers you see where this makes a different, including
the workloads would also be very valuable here.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] iomap: allow iomap using the per-cpu bio cache
  2025-08-25  9:21             ` Christoph Hellwig
@ 2025-08-25  9:41               ` Fengnan Chang
  2025-08-25 10:47                 ` Christoph Hellwig
  2025-08-26 16:53                 ` Ritesh Harjani
  0 siblings, 2 replies; 14+ messages in thread
From: Fengnan Chang @ 2025-08-25  9:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ritesh Harjani, Matthew Wilcox, Darrick J. Wong, brauner,
	linux-xfs, linux-fsdevel, Jens Axboe, linux-block

Christoph Hellwig <hch@infradead.org> 于2025年8月25日周一 17:21写道:
>
> On Mon, Aug 25, 2025 at 04:51:27PM +0800, Fengnan Chang wrote:
> > No restrictions for now, I think we can enable this by default.
> > Maybe better solution is modify in bio.c?  Let me do some test first.
>
> Any kind of numbers you see where this makes a different, including
> the workloads would also be very valuable here.
I'm test random direct read performance on  io_uring+ext4, and try
compare to io_uring+ raw blkdev,  io_uring+ext4 is quite poor, I'm try to
improve this, I found ext4 is quite different with blkdev when run
bio_alloc_bioset. It's beacuse blkdev ext4  use percpu bio cache, but ext4
path not. So I make this modify.
My test command is:
/fio/t/io_uring -p0 -d128 -b4096 -s1 -c1 -F1 -B1 -R1 -X1 -n1 -P1 -t0
/data01/testfile
Without this patch:
BW is 1950MB
with this patch
BW is 2001MB.


>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] iomap: allow iomap using the per-cpu bio cache
  2025-08-25  9:41               ` Fengnan Chang
@ 2025-08-25 10:47                 ` Christoph Hellwig
  2025-08-26  9:46                   ` Fengnan Chang
  2025-08-26 16:53                 ` Ritesh Harjani
  1 sibling, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2025-08-25 10:47 UTC (permalink / raw)
  To: Fengnan Chang
  Cc: Christoph Hellwig, Ritesh Harjani, Matthew Wilcox,
	Darrick J. Wong, brauner, linux-xfs, linux-fsdevel, Jens Axboe,
	linux-block

On Mon, Aug 25, 2025 at 05:41:57PM +0800, Fengnan Chang wrote:
> I'm test random direct read performance on  io_uring+ext4, and try
> compare to io_uring+ raw blkdev,  io_uring+ext4 is quite poor, I'm try to
> improve this, I found ext4 is quite different with blkdev when run
> bio_alloc_bioset. It's beacuse blkdev ext4  use percpu bio cache, but ext4
> path not. So I make this modify.
> My test command is:
> /fio/t/io_uring -p0 -d128 -b4096 -s1 -c1 -F1 -B1 -R1 -X1 -n1 -P1 -t0
> /data01/testfile
> Without this patch:
> BW is 1950MB
> with this patch
> BW is 2001MB.

Interesting.  This is why the not yet merged ext4 iomap patches I guess?
Do you see similar numbers with XFS?

> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] iomap: allow iomap using the per-cpu bio cache
  2025-08-25 10:47                 ` Christoph Hellwig
@ 2025-08-26  9:46                   ` Fengnan Chang
  2025-08-26 13:15                     ` Christoph Hellwig
  0 siblings, 1 reply; 14+ messages in thread
From: Fengnan Chang @ 2025-08-26  9:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ritesh Harjani, Matthew Wilcox, Darrick J. Wong, brauner,
	linux-xfs, linux-fsdevel, Jens Axboe, linux-block

Christoph Hellwig <hch@infradead.org> 于2025年8月25日周一 18:48写道:
>
> On Mon, Aug 25, 2025 at 05:41:57PM +0800, Fengnan Chang wrote:
> > I'm test random direct read performance on  io_uring+ext4, and try
> > compare to io_uring+ raw blkdev,  io_uring+ext4 is quite poor, I'm try to
> > improve this, I found ext4 is quite different with blkdev when run
> > bio_alloc_bioset. It's beacuse blkdev ext4  use percpu bio cache, but ext4
> > path not. So I make this modify.
> > My test command is:
> > /fio/t/io_uring -p0 -d128 -b4096 -s1 -c1 -F1 -B1 -R1 -X1 -n1 -P1 -t0
> > /data01/testfile
> > Without this patch:
> > BW is 1950MB
> > with this patch
> > BW is 2001MB.
>
> Interesting.  This is why the not yet merged ext4 iomap patches I guess?
> Do you see similar numbers with XFS?
Yes, similar numbers with XFS.

>
> >

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] iomap: allow iomap using the per-cpu bio cache
  2025-08-26  9:46                   ` Fengnan Chang
@ 2025-08-26 13:15                     ` Christoph Hellwig
  0 siblings, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2025-08-26 13:15 UTC (permalink / raw)
  To: Fengnan Chang
  Cc: Christoph Hellwig, Ritesh Harjani, Matthew Wilcox,
	Darrick J. Wong, brauner, linux-xfs, linux-fsdevel, Jens Axboe,
	linux-block

On Tue, Aug 26, 2025 at 05:46:03PM +0800, Fengnan Chang wrote:
> > Interesting.  This is why the not yet merged ext4 iomap patches I guess?
> > Do you see similar numbers with XFS?
> Yes, similar numbers with XFS.

Can you add the numbers to the next submission of the patch?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] iomap: allow iomap using the per-cpu bio cache
  2025-08-25  9:41               ` Fengnan Chang
  2025-08-25 10:47                 ` Christoph Hellwig
@ 2025-08-26 16:53                 ` Ritesh Harjani
  2025-08-29  4:26                   ` [External] " Fengnan Chang
  2025-09-03  8:28                   ` Fengnan Chang
  1 sibling, 2 replies; 14+ messages in thread
From: Ritesh Harjani @ 2025-08-26 16:53 UTC (permalink / raw)
  To: Fengnan Chang
  Cc: Matthew Wilcox, Darrick J. Wong, brauner, linux-xfs,
	linux-fsdevel, Jens Axboe, linux-block, Christoph Hellwig

Fengnan Chang <changfengnan@bytedance.com> writes:

> Christoph Hellwig <hch@infradead.org> 于2025年8月25日周一 17:21写道:
>>
>> On Mon, Aug 25, 2025 at 04:51:27PM +0800, Fengnan Chang wrote:
>> > No restrictions for now, I think we can enable this by default.
>> > Maybe better solution is modify in bio.c?  Let me do some test first.

If there are other implications to consider, for using per-cpu bio cache
by default, then maybe we can first get the optimizations for iomap in
for at least REQ_ALLOC_CACHE users and later work on to see if this
can be enabled by default for other users too.
Unless someone else thinks otherwise.

Why I am thinking this is - due to limited per-cpu bio cache if everyone
uses it for their bio submission, we may not get the best performance
where needed. So that might require us to come up with a different
approach.

>>
>> Any kind of numbers you see where this makes a different, including
>> the workloads would also be very valuable here.
> I'm test random direct read performance on  io_uring+ext4, and try
> compare to io_uring+ raw blkdev,  io_uring+ext4 is quite poor, I'm try to
> improve this, I found ext4 is quite different with blkdev when run
> bio_alloc_bioset. It's beacuse blkdev ext4  use percpu bio cache, but ext4
> path not. So I make this modify.

I am assuming you meant to say - DIO with iouring+raw_blkdev uses
per-cpu bio cache where as iouring+(ext4/xfs) does not use it.
Hence you added this patch which will enable the use of it - which
should also improve the performance of iouring+(ext4/xfs). 

That make sense to me. 

> My test command is:
> /fio/t/io_uring -p0 -d128 -b4096 -s1 -c1 -F1 -B1 -R1 -X1 -n1 -P1 -t0
> /data01/testfile
> Without this patch:
> BW is 1950MB
> with this patch
> BW is 2001MB.

Ok. That's around 2.6% improvement.. Is that what you were expecting to
see too? Is that because you were testing with -p0 (non-polled I/O)? 

Looking at the numbers here [1] & [2], I was hoping this could give
maybe around 5-6% improvement ;) 

[1]: https://lore.kernel.org/io-uring/cover.1666347703.git.asml.silence@gmail.com/
[2]: https://lore.kernel.org/all/20220806152004.382170-3-axboe@kernel.dk/


-ritesh

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [External] Re: [PATCH] iomap: allow iomap using the per-cpu bio cache
  2025-08-26 16:53                 ` Ritesh Harjani
@ 2025-08-29  4:26                   ` Fengnan Chang
  2025-09-03  8:28                   ` Fengnan Chang
  1 sibling, 0 replies; 14+ messages in thread
From: Fengnan Chang @ 2025-08-29  4:26 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Matthew Wilcox, Darrick J. Wong, brauner, linux-xfs,
	linux-fsdevel, Jens Axboe, linux-block, Christoph Hellwig

Sorry, Need to wait a few more days for this, too busy recently.

Ritesh Harjani <ritesh.list@gmail.com> 于2025年8月27日周三 01:26写道:
>
> Fengnan Chang <changfengnan@bytedance.com> writes:
>
> > Christoph Hellwig <hch@infradead.org> 于2025年8月25日周一 17:21写道:
> >>
> >> On Mon, Aug 25, 2025 at 04:51:27PM +0800, Fengnan Chang wrote:
> >> > No restrictions for now, I think we can enable this by default.
> >> > Maybe better solution is modify in bio.c?  Let me do some test first.
>
> If there are other implications to consider, for using per-cpu bio cache
> by default, then maybe we can first get the optimizations for iomap in
> for at least REQ_ALLOC_CACHE users and later work on to see if this
> can be enabled by default for other users too.
> Unless someone else thinks otherwise.
>
> Why I am thinking this is - due to limited per-cpu bio cache if everyone
> uses it for their bio submission, we may not get the best performance
> where needed. So that might require us to come up with a different
> approach.
>
> >>
> >> Any kind of numbers you see where this makes a different, including
> >> the workloads would also be very valuable here.
> > I'm test random direct read performance on  io_uring+ext4, and try
> > compare to io_uring+ raw blkdev,  io_uring+ext4 is quite poor, I'm try to
> > improve this, I found ext4 is quite different with blkdev when run
> > bio_alloc_bioset. It's beacuse blkdev ext4  use percpu bio cache, but ext4
> > path not. So I make this modify.
>
> I am assuming you meant to say - DIO with iouring+raw_blkdev uses
> per-cpu bio cache where as iouring+(ext4/xfs) does not use it.
> Hence you added this patch which will enable the use of it - which
> should also improve the performance of iouring+(ext4/xfs).
>
> That make sense to me.
>
> > My test command is:
> > /fio/t/io_uring -p0 -d128 -b4096 -s1 -c1 -F1 -B1 -R1 -X1 -n1 -P1 -t0
> > /data01/testfile
> > Without this patch:
> > BW is 1950MB
> > with this patch
> > BW is 2001MB.
>
> Ok. That's around 2.6% improvement.. Is that what you were expecting to
> see too? Is that because you were testing with -p0 (non-polled I/O)?
>
> Looking at the numbers here [1] & [2], I was hoping this could give
> maybe around 5-6% improvement ;)
>
> [1]: https://lore.kernel.org/io-uring/cover.1666347703.git.asml.silence@gmail.com/
> [2]: https://lore.kernel.org/all/20220806152004.382170-3-axboe@kernel.dk/
>
>
> -ritesh

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [External] Re: [PATCH] iomap: allow iomap using the per-cpu bio cache
  2025-08-26 16:53                 ` Ritesh Harjani
  2025-08-29  4:26                   ` [External] " Fengnan Chang
@ 2025-09-03  8:28                   ` Fengnan Chang
  1 sibling, 0 replies; 14+ messages in thread
From: Fengnan Chang @ 2025-09-03  8:28 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Matthew Wilcox, Darrick J. Wong, brauner, linux-xfs,
	linux-fsdevel, Jens Axboe, linux-block, Christoph Hellwig

Ritesh Harjani <ritesh.list@gmail.com> 于2025年8月27日周三 01:26写道:
>
> Fengnan Chang <changfengnan@bytedance.com> writes:
>
> > Christoph Hellwig <hch@infradead.org> 于2025年8月25日周一 17:21写道:
> >>
> >> On Mon, Aug 25, 2025 at 04:51:27PM +0800, Fengnan Chang wrote:
> >> > No restrictions for now, I think we can enable this by default.
> >> > Maybe better solution is modify in bio.c?  Let me do some test first.
>
> If there are other implications to consider, for using per-cpu bio cache
> by default, then maybe we can first get the optimizations for iomap in
> for at least REQ_ALLOC_CACHE users and later work on to see if this
> can be enabled by default for other users too.
> Unless someone else thinks otherwise.
>
> Why I am thinking this is - due to limited per-cpu bio cache if everyone
> uses it for their bio submission, we may not get the best performance
> where needed. So that might require us to come up with a different
> approach.

Agree, if everyone uses it for their bio submission, we can not get the best
performance.

>
> >>
> >> Any kind of numbers you see where this makes a different, including
> >> the workloads would also be very valuable here.
> > I'm test random direct read performance on  io_uring+ext4, and try
> > compare to io_uring+ raw blkdev,  io_uring+ext4 is quite poor, I'm try to
> > improve this, I found ext4 is quite different with blkdev when run
> > bio_alloc_bioset. It's beacuse blkdev ext4  use percpu bio cache, but ext4
> > path not. So I make this modify.
>
> I am assuming you meant to say - DIO with iouring+raw_blkdev uses
> per-cpu bio cache where as iouring+(ext4/xfs) does not use it.
> Hence you added this patch which will enable the use of it - which
> should also improve the performance of iouring+(ext4/xfs).

Yes. DIO+iouring+raw_blkdev vs DIO+iouring+(ext4/xfs).

>
> That make sense to me.
>
> > My test command is:
> > /fio/t/io_uring -p0 -d128 -b4096 -s1 -c1 -F1 -B1 -R1 -X1 -n1 -P1 -t0
> > /data01/testfile
> > Without this patch:
> > BW is 1950MB
> > with this patch
> > BW is 2001MB.
>
> Ok. That's around 2.6% improvement.. Is that what you were expecting to
> see too? Is that because you were testing with -p0 (non-polled I/O)?

I don't have a quantitative target for expectations, 2.6% seems reasonable.
Not related to -p0, with -p1, about 3.1% improvement.
Why we can't get 5-6% improvement? I think the biggest bottlenecks are
in ext4/xfs, most in ext4_es_lookup_extent.

>
> Looking at the numbers here [1] & [2], I was hoping this could give
> maybe around 5-6% improvement ;)
>
> [1]: https://lore.kernel.org/io-uring/cover.1666347703.git.asml.silence@gmail.com/
> [2]: https://lore.kernel.org/all/20220806152004.382170-3-axboe@kernel.dk/
>
>
> -ritesh

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] iomap: allow iomap using the per-cpu bio cache
  2025-08-23  4:15         ` Ritesh Harjani
  2025-08-25  8:51           ` Fengnan Chang
@ 2025-09-03  9:53           ` Pavel Begunkov
  1 sibling, 0 replies; 14+ messages in thread
From: Pavel Begunkov @ 2025-09-03  9:53 UTC (permalink / raw)
  To: Ritesh Harjani (IBM), Matthew Wilcox
  Cc: Darrick J. Wong, Fengnan Chang, brauner, linux-xfs, linux-fsdevel,
	Jens Axboe, linux-block

On 8/23/25 05:15, Ritesh Harjani (IBM) wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> 
>> On Fri, Aug 22, 2025 at 09:37:32PM +0530, Ritesh Harjani wrote:
>>> Matthew Wilcox <willy@infradead.org> writes:
>>>> On Fri, Aug 22, 2025 at 08:05:50AM -0700, Darrick J. Wong wrote:
>>>>> Is there a reason /not/ to use the per-cpu bio cache unconditionally?
>>>>
>>>> AIUI it's not safe because completions might happen on a different CPU
>>>> from the submission.
>>>
>>> At max the bio de-queued from cpu X can be returned to cpu Y cache, this
>>> shouldn't be unsafe right? e.g. bio_put_percpu_cache().
>>> Not optimal for performance though.
>>>
>>> Also even for io-uring the IRQ completions (non-polling requests) can
>>> get routed to a different cpu then the submitting cpu, correct?
>>> Then the completions (bio completion processing) are handled via IPIs on
>>> the submtting cpu or based on the cache topology, right?
>>>
>>>> At least, there's nowhere that sets REQ_ALLOC_CACHE unconditionally.
>>>>
>>>> This could do with some better documentation ..
>>>
>>> Agreed. Looking at the history this got added for polling mode first but
>>> later got enabled for even irq driven io-uring rw requests [1]. So it
>>> make sense to understand if this can be added unconditionally for DIO
>>> requests or not.
>>
>> So why does the flag now exist at all?  Why not use the cache
>> unconditionally?
> 
> I am hoping the author of this patch or folks with io-uring expertise
> (which added the per-cpu bio cache in the first place) could answer
> this better. i.e.

CC'ing would help :)

> Now that per-cpu bio cache is being used by io-uring rw requests for
> both polled and non-polled I/O. Does that mean, we can kill
> IOCB_ALLOC_CACHE check from iomap dio path completely and use per-cpu
> bio cache unconditionally by passing REQ_ALLOC_CACHE flag?  That means
> all DIO requests via iomap can now use this per-cpu bio cache and not
> just the one initiated via io-uring path.
> 
> Or are there still restrictions in using this per-cpu bio cache, which
> limits it to be only used via io-uring path? If yes, what are they? And
> can this be documented somewhere?

It should be safe to use for task context allocations (struct
bio_alloc_cache::free_list is [soft]irq unsafe)

IOCB_ALLOC_CACHE shouldn't be needed, but IIRC I played it
conservatively to not impact paths I didn't specifically benchmark.
FWIW, I couldn't measure any negative impact with io_uring at the
time for requests completed on a different CPU (same NUMA), but if
it's a problem, to offset the effect we can probably add a CPU
check => bio_free and/or try batch de-allocate when the cache is
full.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-09-03  9:52 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20250822082606.66375-1-changfengnan@bytedance.com>
     [not found] ` <20250822150550.GP7942@frogsfrogsfrogs>
2025-08-22 15:42   ` [PATCH] iomap: allow iomap using the per-cpu bio cache Matthew Wilcox
2025-08-22 16:07     ` Ritesh Harjani
2025-08-22 16:51       ` Matthew Wilcox
2025-08-23  4:15         ` Ritesh Harjani
2025-08-25  8:51           ` Fengnan Chang
2025-08-25  9:21             ` Christoph Hellwig
2025-08-25  9:41               ` Fengnan Chang
2025-08-25 10:47                 ` Christoph Hellwig
2025-08-26  9:46                   ` Fengnan Chang
2025-08-26 13:15                     ` Christoph Hellwig
2025-08-26 16:53                 ` Ritesh Harjani
2025-08-29  4:26                   ` [External] " Fengnan Chang
2025-09-03  8:28                   ` Fengnan Chang
2025-09-03  9:53           ` Pavel Begunkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).