* Re: [PATCH] iomap: allow iomap using the per-cpu bio cache [not found] ` <20250822150550.GP7942@frogsfrogsfrogs> @ 2025-08-22 15:42 ` Matthew Wilcox 2025-08-22 16:07 ` Ritesh Harjani 0 siblings, 1 reply; 15+ messages in thread From: Matthew Wilcox @ 2025-08-22 15:42 UTC (permalink / raw) To: Darrick J. Wong Cc: Fengnan Chang, brauner, linux-xfs, linux-fsdevel, Jens Axboe, linux-block On Fri, Aug 22, 2025 at 08:05:50AM -0700, Darrick J. Wong wrote: > On Fri, Aug 22, 2025 at 04:26:06PM +0800, Fengnan Chang wrote: > > When use io_uring with direct IO, we could use per-cpu bio cache > > from bio_alloc_bioset, So pass IOCB_ALLOC_CACHE flag to alloc > > bio helper. > > > > + if (iter->flags & IOMAP_ALLOC_CACHE) > > + bio_opf |= REQ_ALLOC_CACHE; > > Is there a reason /not/ to use the per-cpu bio cache unconditionally? AIUI it's not safe because completions might happen on a different CPU from the submission. At least, there's nowhere that sets REQ_ALLOC_CACHE unconditionally. This could do with some better documentation .. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] iomap: allow iomap using the per-cpu bio cache 2025-08-22 15:42 ` [PATCH] iomap: allow iomap using the per-cpu bio cache Matthew Wilcox @ 2025-08-22 16:07 ` Ritesh Harjani 2025-08-22 16:51 ` Matthew Wilcox 0 siblings, 1 reply; 15+ messages in thread From: Ritesh Harjani @ 2025-08-22 16:07 UTC (permalink / raw) To: Matthew Wilcox, Darrick J. Wong Cc: Fengnan Chang, brauner, linux-xfs, linux-fsdevel, Jens Axboe, linux-block Matthew Wilcox <willy@infradead.org> writes: > On Fri, Aug 22, 2025 at 08:05:50AM -0700, Darrick J. Wong wrote: >> On Fri, Aug 22, 2025 at 04:26:06PM +0800, Fengnan Chang wrote: >> > When use io_uring with direct IO, we could use per-cpu bio cache >> > from bio_alloc_bioset, So pass IOCB_ALLOC_CACHE flag to alloc >> > bio helper. >> > >> > + if (iter->flags & IOMAP_ALLOC_CACHE) >> > + bio_opf |= REQ_ALLOC_CACHE; >> >> Is there a reason /not/ to use the per-cpu bio cache unconditionally? > > AIUI it's not safe because completions might happen on a different CPU > from the submission. At max the bio de-queued from cpu X can be returned to cpu Y cache, this shouldn't be unsafe right? e.g. bio_put_percpu_cache(). Not optimal for performance though. Also even for io-uring the IRQ completions (non-polling requests) can get routed to a different cpu then the submitting cpu, correct? Then the completions (bio completion processing) are handled via IPIs on the submtting cpu or based on the cache topology, right? > At least, there's nowhere that sets REQ_ALLOC_CACHE unconditionally. > > This could do with some better documentation .. Agreed. Looking at the history this got added for polling mode first but later got enabled for even irq driven io-uring rw requests [1]. So it make sense to understand if this can be added unconditionally for DIO requests or not. [1]: https://lore.kernel.org/io-uring/aab3521d49fd6c1ff6ea194c9e63d05565efc103.1666347703.git.asml.silence@gmail.com/ -ritesh ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] iomap: allow iomap using the per-cpu bio cache 2025-08-22 16:07 ` Ritesh Harjani @ 2025-08-22 16:51 ` Matthew Wilcox 2025-08-23 4:15 ` Ritesh Harjani 0 siblings, 1 reply; 15+ messages in thread From: Matthew Wilcox @ 2025-08-22 16:51 UTC (permalink / raw) To: Ritesh Harjani Cc: Darrick J. Wong, Fengnan Chang, brauner, linux-xfs, linux-fsdevel, Jens Axboe, linux-block On Fri, Aug 22, 2025 at 09:37:32PM +0530, Ritesh Harjani wrote: > Matthew Wilcox <willy@infradead.org> writes: > > On Fri, Aug 22, 2025 at 08:05:50AM -0700, Darrick J. Wong wrote: > >> Is there a reason /not/ to use the per-cpu bio cache unconditionally? > > > > AIUI it's not safe because completions might happen on a different CPU > > from the submission. > > At max the bio de-queued from cpu X can be returned to cpu Y cache, this > shouldn't be unsafe right? e.g. bio_put_percpu_cache(). > Not optimal for performance though. > > Also even for io-uring the IRQ completions (non-polling requests) can > get routed to a different cpu then the submitting cpu, correct? > Then the completions (bio completion processing) are handled via IPIs on > the submtting cpu or based on the cache topology, right? > > > At least, there's nowhere that sets REQ_ALLOC_CACHE unconditionally. > > > > This could do with some better documentation .. > > Agreed. Looking at the history this got added for polling mode first but > later got enabled for even irq driven io-uring rw requests [1]. So it > make sense to understand if this can be added unconditionally for DIO > requests or not. So why does the flag now exist at all? Why not use the cache unconditionally? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] iomap: allow iomap using the per-cpu bio cache 2025-08-22 16:51 ` Matthew Wilcox @ 2025-08-23 4:15 ` Ritesh Harjani 2025-08-25 8:51 ` Fengnan Chang 2025-09-03 9:53 ` Pavel Begunkov 0 siblings, 2 replies; 15+ messages in thread From: Ritesh Harjani @ 2025-08-23 4:15 UTC (permalink / raw) To: Matthew Wilcox Cc: Darrick J. Wong, Fengnan Chang, brauner, linux-xfs, linux-fsdevel, Jens Axboe, linux-block Matthew Wilcox <willy@infradead.org> writes: > On Fri, Aug 22, 2025 at 09:37:32PM +0530, Ritesh Harjani wrote: >> Matthew Wilcox <willy@infradead.org> writes: >> > On Fri, Aug 22, 2025 at 08:05:50AM -0700, Darrick J. Wong wrote: >> >> Is there a reason /not/ to use the per-cpu bio cache unconditionally? >> > >> > AIUI it's not safe because completions might happen on a different CPU >> > from the submission. >> >> At max the bio de-queued from cpu X can be returned to cpu Y cache, this >> shouldn't be unsafe right? e.g. bio_put_percpu_cache(). >> Not optimal for performance though. >> >> Also even for io-uring the IRQ completions (non-polling requests) can >> get routed to a different cpu then the submitting cpu, correct? >> Then the completions (bio completion processing) are handled via IPIs on >> the submtting cpu or based on the cache topology, right? >> >> > At least, there's nowhere that sets REQ_ALLOC_CACHE unconditionally. >> > >> > This could do with some better documentation .. >> >> Agreed. Looking at the history this got added for polling mode first but >> later got enabled for even irq driven io-uring rw requests [1]. So it >> make sense to understand if this can be added unconditionally for DIO >> requests or not. > > So why does the flag now exist at all? Why not use the cache > unconditionally? I am hoping the author of this patch or folks with io-uring expertise (which added the per-cpu bio cache in the first place) could answer this better. i.e. Now that per-cpu bio cache is being used by io-uring rw requests for both polled and non-polled I/O. Does that mean, we can kill IOCB_ALLOC_CACHE check from iomap dio path completely and use per-cpu bio cache unconditionally by passing REQ_ALLOC_CACHE flag? That means all DIO requests via iomap can now use this per-cpu bio cache and not just the one initiated via io-uring path. Or are there still restrictions in using this per-cpu bio cache, which limits it to be only used via io-uring path? If yes, what are they? And can this be documented somewhere? -ritesh ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] iomap: allow iomap using the per-cpu bio cache 2025-08-23 4:15 ` Ritesh Harjani @ 2025-08-25 8:51 ` Fengnan Chang 2025-08-25 9:21 ` Christoph Hellwig 2025-09-03 9:53 ` Pavel Begunkov 1 sibling, 1 reply; 15+ messages in thread From: Fengnan Chang @ 2025-08-25 8:51 UTC (permalink / raw) To: Ritesh Harjani Cc: Matthew Wilcox, Darrick J. Wong, brauner, linux-xfs, linux-fsdevel, Jens Axboe, linux-block Sorry for the late reply. Ritesh Harjani <ritesh.list@gmail.com> 于2025年8月23日周六 12:35写道: > > Matthew Wilcox <willy@infradead.org> writes: > > > On Fri, Aug 22, 2025 at 09:37:32PM +0530, Ritesh Harjani wrote: > >> Matthew Wilcox <willy@infradead.org> writes: > >> > On Fri, Aug 22, 2025 at 08:05:50AM -0700, Darrick J. Wong wrote: > >> >> Is there a reason /not/ to use the per-cpu bio cache unconditionally? > >> > > >> > AIUI it's not safe because completions might happen on a different CPU > >> > from the submission. > >> > >> At max the bio de-queued from cpu X can be returned to cpu Y cache, this > >> shouldn't be unsafe right? e.g. bio_put_percpu_cache(). > >> Not optimal for performance though. > >> > >> Also even for io-uring the IRQ completions (non-polling requests) can > >> get routed to a different cpu then the submitting cpu, correct? > >> Then the completions (bio completion processing) are handled via IPIs on > >> the submtting cpu or based on the cache topology, right? > >> > >> > At least, there's nowhere that sets REQ_ALLOC_CACHE unconditionally. > >> > > >> > This could do with some better documentation .. > >> > >> Agreed. Looking at the history this got added for polling mode first but > >> later got enabled for even irq driven io-uring rw requests [1]. So it > >> make sense to understand if this can be added unconditionally for DIO > >> requests or not. > > > > So why does the flag now exist at all? Why not use the cache > > unconditionally? I think it's a history problem. When REQ_ALLOC_CACHE flag first add in https://lore.kernel.org/all/20220324203526.62306-2-snitzer@kernel.org/, it's rename from BIOSET_PERCPU_CACHE, only work for iopoll, "If REQ_ALLOC_CACHE is set, the final put of the bio MUST be done from process context, not hard/soft IRQ." and from https://lore.kernel.org/all/cover.1667384020.git.asml.silence@gmail.com/, remove this limit, percpu bio caching can used for IRQ I/O. "Currently, it's only enabled for previous REQ_ALLOC_CACHE users but will be turned on system-wide later." > > I am hoping the author of this patch or folks with io-uring expertise > (which added the per-cpu bio cache in the first place) could answer > this better. i.e. > > Now that per-cpu bio cache is being used by io-uring rw requests for > both polled and non-polled I/O. Does that mean, we can kill > IOCB_ALLOC_CACHE check from iomap dio path completely and use per-cpu > bio cache unconditionally by passing REQ_ALLOC_CACHE flag? That means > all DIO requests via iomap can now use this per-cpu bio cache and not > just the one initiated via io-uring path. > > Or are there still restrictions in using this per-cpu bio cache, which > limits it to be only used via io-uring path? If yes, what are they? And > can this be documented somewhere? No restrictions for now, I think we can enable this by default. Maybe better solution is modify in bio.c? Let me do some test first. > > -ritesh > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] iomap: allow iomap using the per-cpu bio cache 2025-08-25 8:51 ` Fengnan Chang @ 2025-08-25 9:21 ` Christoph Hellwig 2025-08-25 9:41 ` Fengnan Chang 0 siblings, 1 reply; 15+ messages in thread From: Christoph Hellwig @ 2025-08-25 9:21 UTC (permalink / raw) To: Fengnan Chang Cc: Ritesh Harjani, Matthew Wilcox, Darrick J. Wong, brauner, linux-xfs, linux-fsdevel, Jens Axboe, linux-block On Mon, Aug 25, 2025 at 04:51:27PM +0800, Fengnan Chang wrote: > No restrictions for now, I think we can enable this by default. > Maybe better solution is modify in bio.c? Let me do some test first. Any kind of numbers you see where this makes a different, including the workloads would also be very valuable here. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] iomap: allow iomap using the per-cpu bio cache 2025-08-25 9:21 ` Christoph Hellwig @ 2025-08-25 9:41 ` Fengnan Chang 2025-08-25 10:47 ` Christoph Hellwig 2025-08-26 16:53 ` Ritesh Harjani 0 siblings, 2 replies; 15+ messages in thread From: Fengnan Chang @ 2025-08-25 9:41 UTC (permalink / raw) To: Christoph Hellwig Cc: Ritesh Harjani, Matthew Wilcox, Darrick J. Wong, brauner, linux-xfs, linux-fsdevel, Jens Axboe, linux-block Christoph Hellwig <hch@infradead.org> 于2025年8月25日周一 17:21写道: > > On Mon, Aug 25, 2025 at 04:51:27PM +0800, Fengnan Chang wrote: > > No restrictions for now, I think we can enable this by default. > > Maybe better solution is modify in bio.c? Let me do some test first. > > Any kind of numbers you see where this makes a different, including > the workloads would also be very valuable here. I'm test random direct read performance on io_uring+ext4, and try compare to io_uring+ raw blkdev, io_uring+ext4 is quite poor, I'm try to improve this, I found ext4 is quite different with blkdev when run bio_alloc_bioset. It's beacuse blkdev ext4 use percpu bio cache, but ext4 path not. So I make this modify. My test command is: /fio/t/io_uring -p0 -d128 -b4096 -s1 -c1 -F1 -B1 -R1 -X1 -n1 -P1 -t0 /data01/testfile Without this patch: BW is 1950MB with this patch BW is 2001MB. > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] iomap: allow iomap using the per-cpu bio cache 2025-08-25 9:41 ` Fengnan Chang @ 2025-08-25 10:47 ` Christoph Hellwig 2025-08-26 9:46 ` Fengnan Chang 2025-08-26 16:53 ` Ritesh Harjani 1 sibling, 1 reply; 15+ messages in thread From: Christoph Hellwig @ 2025-08-25 10:47 UTC (permalink / raw) To: Fengnan Chang Cc: Christoph Hellwig, Ritesh Harjani, Matthew Wilcox, Darrick J. Wong, brauner, linux-xfs, linux-fsdevel, Jens Axboe, linux-block On Mon, Aug 25, 2025 at 05:41:57PM +0800, Fengnan Chang wrote: > I'm test random direct read performance on io_uring+ext4, and try > compare to io_uring+ raw blkdev, io_uring+ext4 is quite poor, I'm try to > improve this, I found ext4 is quite different with blkdev when run > bio_alloc_bioset. It's beacuse blkdev ext4 use percpu bio cache, but ext4 > path not. So I make this modify. > My test command is: > /fio/t/io_uring -p0 -d128 -b4096 -s1 -c1 -F1 -B1 -R1 -X1 -n1 -P1 -t0 > /data01/testfile > Without this patch: > BW is 1950MB > with this patch > BW is 2001MB. Interesting. This is why the not yet merged ext4 iomap patches I guess? Do you see similar numbers with XFS? > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] iomap: allow iomap using the per-cpu bio cache 2025-08-25 10:47 ` Christoph Hellwig @ 2025-08-26 9:46 ` Fengnan Chang 2025-08-26 13:15 ` Christoph Hellwig 0 siblings, 1 reply; 15+ messages in thread From: Fengnan Chang @ 2025-08-26 9:46 UTC (permalink / raw) To: Christoph Hellwig Cc: Ritesh Harjani, Matthew Wilcox, Darrick J. Wong, brauner, linux-xfs, linux-fsdevel, Jens Axboe, linux-block Christoph Hellwig <hch@infradead.org> 于2025年8月25日周一 18:48写道: > > On Mon, Aug 25, 2025 at 05:41:57PM +0800, Fengnan Chang wrote: > > I'm test random direct read performance on io_uring+ext4, and try > > compare to io_uring+ raw blkdev, io_uring+ext4 is quite poor, I'm try to > > improve this, I found ext4 is quite different with blkdev when run > > bio_alloc_bioset. It's beacuse blkdev ext4 use percpu bio cache, but ext4 > > path not. So I make this modify. > > My test command is: > > /fio/t/io_uring -p0 -d128 -b4096 -s1 -c1 -F1 -B1 -R1 -X1 -n1 -P1 -t0 > > /data01/testfile > > Without this patch: > > BW is 1950MB > > with this patch > > BW is 2001MB. > > Interesting. This is why the not yet merged ext4 iomap patches I guess? > Do you see similar numbers with XFS? Yes, similar numbers with XFS. > > > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] iomap: allow iomap using the per-cpu bio cache 2025-08-26 9:46 ` Fengnan Chang @ 2025-08-26 13:15 ` Christoph Hellwig 0 siblings, 0 replies; 15+ messages in thread From: Christoph Hellwig @ 2025-08-26 13:15 UTC (permalink / raw) To: Fengnan Chang Cc: Christoph Hellwig, Ritesh Harjani, Matthew Wilcox, Darrick J. Wong, brauner, linux-xfs, linux-fsdevel, Jens Axboe, linux-block On Tue, Aug 26, 2025 at 05:46:03PM +0800, Fengnan Chang wrote: > > Interesting. This is why the not yet merged ext4 iomap patches I guess? > > Do you see similar numbers with XFS? > Yes, similar numbers with XFS. Can you add the numbers to the next submission of the patch? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] iomap: allow iomap using the per-cpu bio cache 2025-08-25 9:41 ` Fengnan Chang 2025-08-25 10:47 ` Christoph Hellwig @ 2025-08-26 16:53 ` Ritesh Harjani 2025-08-29 4:26 ` [External] " Fengnan Chang 2025-09-03 8:28 ` Fengnan Chang 1 sibling, 2 replies; 15+ messages in thread From: Ritesh Harjani @ 2025-08-26 16:53 UTC (permalink / raw) To: Fengnan Chang Cc: Matthew Wilcox, Darrick J. Wong, brauner, linux-xfs, linux-fsdevel, Jens Axboe, linux-block, Christoph Hellwig Fengnan Chang <changfengnan@bytedance.com> writes: > Christoph Hellwig <hch@infradead.org> 于2025年8月25日周一 17:21写道: >> >> On Mon, Aug 25, 2025 at 04:51:27PM +0800, Fengnan Chang wrote: >> > No restrictions for now, I think we can enable this by default. >> > Maybe better solution is modify in bio.c? Let me do some test first. If there are other implications to consider, for using per-cpu bio cache by default, then maybe we can first get the optimizations for iomap in for at least REQ_ALLOC_CACHE users and later work on to see if this can be enabled by default for other users too. Unless someone else thinks otherwise. Why I am thinking this is - due to limited per-cpu bio cache if everyone uses it for their bio submission, we may not get the best performance where needed. So that might require us to come up with a different approach. >> >> Any kind of numbers you see where this makes a different, including >> the workloads would also be very valuable here. > I'm test random direct read performance on io_uring+ext4, and try > compare to io_uring+ raw blkdev, io_uring+ext4 is quite poor, I'm try to > improve this, I found ext4 is quite different with blkdev when run > bio_alloc_bioset. It's beacuse blkdev ext4 use percpu bio cache, but ext4 > path not. So I make this modify. I am assuming you meant to say - DIO with iouring+raw_blkdev uses per-cpu bio cache where as iouring+(ext4/xfs) does not use it. Hence you added this patch which will enable the use of it - which should also improve the performance of iouring+(ext4/xfs). That make sense to me. > My test command is: > /fio/t/io_uring -p0 -d128 -b4096 -s1 -c1 -F1 -B1 -R1 -X1 -n1 -P1 -t0 > /data01/testfile > Without this patch: > BW is 1950MB > with this patch > BW is 2001MB. Ok. That's around 2.6% improvement.. Is that what you were expecting to see too? Is that because you were testing with -p0 (non-polled I/O)? Looking at the numbers here [1] & [2], I was hoping this could give maybe around 5-6% improvement ;) [1]: https://lore.kernel.org/io-uring/cover.1666347703.git.asml.silence@gmail.com/ [2]: https://lore.kernel.org/all/20220806152004.382170-3-axboe@kernel.dk/ -ritesh ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [External] Re: [PATCH] iomap: allow iomap using the per-cpu bio cache 2025-08-26 16:53 ` Ritesh Harjani @ 2025-08-29 4:26 ` Fengnan Chang 2025-09-03 8:28 ` Fengnan Chang 1 sibling, 0 replies; 15+ messages in thread From: Fengnan Chang @ 2025-08-29 4:26 UTC (permalink / raw) To: Ritesh Harjani Cc: Matthew Wilcox, Darrick J. Wong, brauner, linux-xfs, linux-fsdevel, Jens Axboe, linux-block, Christoph Hellwig Sorry, Need to wait a few more days for this, too busy recently. Ritesh Harjani <ritesh.list@gmail.com> 于2025年8月27日周三 01:26写道: > > Fengnan Chang <changfengnan@bytedance.com> writes: > > > Christoph Hellwig <hch@infradead.org> 于2025年8月25日周一 17:21写道: > >> > >> On Mon, Aug 25, 2025 at 04:51:27PM +0800, Fengnan Chang wrote: > >> > No restrictions for now, I think we can enable this by default. > >> > Maybe better solution is modify in bio.c? Let me do some test first. > > If there are other implications to consider, for using per-cpu bio cache > by default, then maybe we can first get the optimizations for iomap in > for at least REQ_ALLOC_CACHE users and later work on to see if this > can be enabled by default for other users too. > Unless someone else thinks otherwise. > > Why I am thinking this is - due to limited per-cpu bio cache if everyone > uses it for their bio submission, we may not get the best performance > where needed. So that might require us to come up with a different > approach. > > >> > >> Any kind of numbers you see where this makes a different, including > >> the workloads would also be very valuable here. > > I'm test random direct read performance on io_uring+ext4, and try > > compare to io_uring+ raw blkdev, io_uring+ext4 is quite poor, I'm try to > > improve this, I found ext4 is quite different with blkdev when run > > bio_alloc_bioset. It's beacuse blkdev ext4 use percpu bio cache, but ext4 > > path not. So I make this modify. > > I am assuming you meant to say - DIO with iouring+raw_blkdev uses > per-cpu bio cache where as iouring+(ext4/xfs) does not use it. > Hence you added this patch which will enable the use of it - which > should also improve the performance of iouring+(ext4/xfs). > > That make sense to me. > > > My test command is: > > /fio/t/io_uring -p0 -d128 -b4096 -s1 -c1 -F1 -B1 -R1 -X1 -n1 -P1 -t0 > > /data01/testfile > > Without this patch: > > BW is 1950MB > > with this patch > > BW is 2001MB. > > Ok. That's around 2.6% improvement.. Is that what you were expecting to > see too? Is that because you were testing with -p0 (non-polled I/O)? > > Looking at the numbers here [1] & [2], I was hoping this could give > maybe around 5-6% improvement ;) > > [1]: https://lore.kernel.org/io-uring/cover.1666347703.git.asml.silence@gmail.com/ > [2]: https://lore.kernel.org/all/20220806152004.382170-3-axboe@kernel.dk/ > > > -ritesh ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [External] Re: [PATCH] iomap: allow iomap using the per-cpu bio cache 2025-08-26 16:53 ` Ritesh Harjani 2025-08-29 4:26 ` [External] " Fengnan Chang @ 2025-09-03 8:28 ` Fengnan Chang 2025-09-06 4:25 ` Ritesh Harjani 1 sibling, 1 reply; 15+ messages in thread From: Fengnan Chang @ 2025-09-03 8:28 UTC (permalink / raw) To: Ritesh Harjani Cc: Matthew Wilcox, Darrick J. Wong, brauner, linux-xfs, linux-fsdevel, Jens Axboe, linux-block, Christoph Hellwig Ritesh Harjani <ritesh.list@gmail.com> 于2025年8月27日周三 01:26写道: > > Fengnan Chang <changfengnan@bytedance.com> writes: > > > Christoph Hellwig <hch@infradead.org> 于2025年8月25日周一 17:21写道: > >> > >> On Mon, Aug 25, 2025 at 04:51:27PM +0800, Fengnan Chang wrote: > >> > No restrictions for now, I think we can enable this by default. > >> > Maybe better solution is modify in bio.c? Let me do some test first. > > If there are other implications to consider, for using per-cpu bio cache > by default, then maybe we can first get the optimizations for iomap in > for at least REQ_ALLOC_CACHE users and later work on to see if this > can be enabled by default for other users too. > Unless someone else thinks otherwise. > > Why I am thinking this is - due to limited per-cpu bio cache if everyone > uses it for their bio submission, we may not get the best performance > where needed. So that might require us to come up with a different > approach. Agree, if everyone uses it for their bio submission, we can not get the best performance. > > >> > >> Any kind of numbers you see where this makes a different, including > >> the workloads would also be very valuable here. > > I'm test random direct read performance on io_uring+ext4, and try > > compare to io_uring+ raw blkdev, io_uring+ext4 is quite poor, I'm try to > > improve this, I found ext4 is quite different with blkdev when run > > bio_alloc_bioset. It's beacuse blkdev ext4 use percpu bio cache, but ext4 > > path not. So I make this modify. > > I am assuming you meant to say - DIO with iouring+raw_blkdev uses > per-cpu bio cache where as iouring+(ext4/xfs) does not use it. > Hence you added this patch which will enable the use of it - which > should also improve the performance of iouring+(ext4/xfs). Yes. DIO+iouring+raw_blkdev vs DIO+iouring+(ext4/xfs). > > That make sense to me. > > > My test command is: > > /fio/t/io_uring -p0 -d128 -b4096 -s1 -c1 -F1 -B1 -R1 -X1 -n1 -P1 -t0 > > /data01/testfile > > Without this patch: > > BW is 1950MB > > with this patch > > BW is 2001MB. > > Ok. That's around 2.6% improvement.. Is that what you were expecting to > see too? Is that because you were testing with -p0 (non-polled I/O)? I don't have a quantitative target for expectations, 2.6% seems reasonable. Not related to -p0, with -p1, about 3.1% improvement. Why we can't get 5-6% improvement? I think the biggest bottlenecks are in ext4/xfs, most in ext4_es_lookup_extent. > > Looking at the numbers here [1] & [2], I was hoping this could give > maybe around 5-6% improvement ;) > > [1]: https://lore.kernel.org/io-uring/cover.1666347703.git.asml.silence@gmail.com/ > [2]: https://lore.kernel.org/all/20220806152004.382170-3-axboe@kernel.dk/ > > > -ritesh ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [External] Re: [PATCH] iomap: allow iomap using the per-cpu bio cache 2025-09-03 8:28 ` Fengnan Chang @ 2025-09-06 4:25 ` Ritesh Harjani 0 siblings, 0 replies; 15+ messages in thread From: Ritesh Harjani @ 2025-09-06 4:25 UTC (permalink / raw) To: Fengnan Chang Cc: Matthew Wilcox, Darrick J. Wong, brauner, linux-xfs, linux-fsdevel, Jens Axboe, linux-block, Christoph Hellwig Fengnan Chang <changfengnan@bytedance.com> writes: > Ritesh Harjani <ritesh.list@gmail.com> 于2025年8月27日周三 01:26写道: >> >> Fengnan Chang <changfengnan@bytedance.com> writes: >> >> > Christoph Hellwig <hch@infradead.org> 于2025年8月25日周一 17:21写道: >> >> >> >> On Mon, Aug 25, 2025 at 04:51:27PM +0800, Fengnan Chang wrote: >> >> > No restrictions for now, I think we can enable this by default. >> >> > Maybe better solution is modify in bio.c? Let me do some test first. >> >> If there are other implications to consider, for using per-cpu bio cache >> by default, then maybe we can first get the optimizations for iomap in >> for at least REQ_ALLOC_CACHE users and later work on to see if this >> can be enabled by default for other users too. >> Unless someone else thinks otherwise. >> >> Why I am thinking this is - due to limited per-cpu bio cache if everyone >> uses it for their bio submission, we may not get the best performance >> where needed. So that might require us to come up with a different >> approach. > > Agree, if everyone uses it for their bio submission, we can not get the best > performance. > >> >> >> >> >> Any kind of numbers you see where this makes a different, including >> >> the workloads would also be very valuable here. >> > I'm test random direct read performance on io_uring+ext4, and try >> > compare to io_uring+ raw blkdev, io_uring+ext4 is quite poor, I'm try to >> > improve this, I found ext4 is quite different with blkdev when run >> > bio_alloc_bioset. It's beacuse blkdev ext4 use percpu bio cache, but ext4 >> > path not. So I make this modify. >> >> I am assuming you meant to say - DIO with iouring+raw_blkdev uses >> per-cpu bio cache where as iouring+(ext4/xfs) does not use it. >> Hence you added this patch which will enable the use of it - which >> should also improve the performance of iouring+(ext4/xfs). > > Yes. DIO+iouring+raw_blkdev vs DIO+iouring+(ext4/xfs). > >> >> That make sense to me. >> >> > My test command is: >> > /fio/t/io_uring -p0 -d128 -b4096 -s1 -c1 -F1 -B1 -R1 -X1 -n1 -P1 -t0 >> > /data01/testfile >> > Without this patch: >> > BW is 1950MB >> > with this patch >> > BW is 2001MB. I guess here you meant BW: XXXX MB/s >> >> Ok. That's around 2.6% improvement.. Is that what you were expecting to >> see too? Is that because you were testing with -p0 (non-polled I/O)? > > I don't have a quantitative target for expectations, 2.6% seems reasonable. > Not related to -p0, with -p1, about 3.1% improvement. > Why we can't get 5-6% improvement? I think the biggest bottlenecks are > in ext4/xfs, most in ext4_es_lookup_extent. > Sure thanks for sharing the details. Could you add the performance improvements numbers along with the io_uring cmd you shared above in the commit message in v2? With that please feel free to add: Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> >> >> Looking at the numbers here [1] & [2], I was hoping this could give >> maybe around 5-6% improvement ;) >> >> [1]: https://lore.kernel.org/io-uring/cover.1666347703.git.asml.silence@gmail.com/ >> [2]: https://lore.kernel.org/all/20220806152004.382170-3-axboe@kernel.dk/ >> >> >> -ritesh ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] iomap: allow iomap using the per-cpu bio cache 2025-08-23 4:15 ` Ritesh Harjani 2025-08-25 8:51 ` Fengnan Chang @ 2025-09-03 9:53 ` Pavel Begunkov 1 sibling, 0 replies; 15+ messages in thread From: Pavel Begunkov @ 2025-09-03 9:53 UTC (permalink / raw) To: Ritesh Harjani (IBM), Matthew Wilcox Cc: Darrick J. Wong, Fengnan Chang, brauner, linux-xfs, linux-fsdevel, Jens Axboe, linux-block On 8/23/25 05:15, Ritesh Harjani (IBM) wrote: > Matthew Wilcox <willy@infradead.org> writes: > >> On Fri, Aug 22, 2025 at 09:37:32PM +0530, Ritesh Harjani wrote: >>> Matthew Wilcox <willy@infradead.org> writes: >>>> On Fri, Aug 22, 2025 at 08:05:50AM -0700, Darrick J. Wong wrote: >>>>> Is there a reason /not/ to use the per-cpu bio cache unconditionally? >>>> >>>> AIUI it's not safe because completions might happen on a different CPU >>>> from the submission. >>> >>> At max the bio de-queued from cpu X can be returned to cpu Y cache, this >>> shouldn't be unsafe right? e.g. bio_put_percpu_cache(). >>> Not optimal for performance though. >>> >>> Also even for io-uring the IRQ completions (non-polling requests) can >>> get routed to a different cpu then the submitting cpu, correct? >>> Then the completions (bio completion processing) are handled via IPIs on >>> the submtting cpu or based on the cache topology, right? >>> >>>> At least, there's nowhere that sets REQ_ALLOC_CACHE unconditionally. >>>> >>>> This could do with some better documentation .. >>> >>> Agreed. Looking at the history this got added for polling mode first but >>> later got enabled for even irq driven io-uring rw requests [1]. So it >>> make sense to understand if this can be added unconditionally for DIO >>> requests or not. >> >> So why does the flag now exist at all? Why not use the cache >> unconditionally? > > I am hoping the author of this patch or folks with io-uring expertise > (which added the per-cpu bio cache in the first place) could answer > this better. i.e. CC'ing would help :) > Now that per-cpu bio cache is being used by io-uring rw requests for > both polled and non-polled I/O. Does that mean, we can kill > IOCB_ALLOC_CACHE check from iomap dio path completely and use per-cpu > bio cache unconditionally by passing REQ_ALLOC_CACHE flag? That means > all DIO requests via iomap can now use this per-cpu bio cache and not > just the one initiated via io-uring path. > > Or are there still restrictions in using this per-cpu bio cache, which > limits it to be only used via io-uring path? If yes, what are they? And > can this be documented somewhere? It should be safe to use for task context allocations (struct bio_alloc_cache::free_list is [soft]irq unsafe) IOCB_ALLOC_CACHE shouldn't be needed, but IIRC I played it conservatively to not impact paths I didn't specifically benchmark. FWIW, I couldn't measure any negative impact with io_uring at the time for requests completed on a different CPU (same NUMA), but if it's a problem, to offset the effect we can probably add a CPU check => bio_free and/or try batch de-allocate when the cache is full. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2025-09-06 4:31 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <20250822082606.66375-1-changfengnan@bytedance.com> [not found] ` <20250822150550.GP7942@frogsfrogsfrogs> 2025-08-22 15:42 ` [PATCH] iomap: allow iomap using the per-cpu bio cache Matthew Wilcox 2025-08-22 16:07 ` Ritesh Harjani 2025-08-22 16:51 ` Matthew Wilcox 2025-08-23 4:15 ` Ritesh Harjani 2025-08-25 8:51 ` Fengnan Chang 2025-08-25 9:21 ` Christoph Hellwig 2025-08-25 9:41 ` Fengnan Chang 2025-08-25 10:47 ` Christoph Hellwig 2025-08-26 9:46 ` Fengnan Chang 2025-08-26 13:15 ` Christoph Hellwig 2025-08-26 16:53 ` Ritesh Harjani 2025-08-29 4:26 ` [External] " Fengnan Chang 2025-09-03 8:28 ` Fengnan Chang 2025-09-06 4:25 ` Ritesh Harjani 2025-09-03 9:53 ` Pavel Begunkov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).