* Re: [PATCH v3 00/10] Add dmabuf read/write via io_uring
[not found] <cover.1777475843.git.asml.silence@gmail.com>
@ 2026-05-04 15:29 ` Ming Lei
2026-05-06 9:02 ` Pavel Begunkov
2026-05-12 7:00 ` Christoph Hellwig
` (7 subsequent siblings)
8 siblings, 1 reply; 16+ messages in thread
From: Ming Lei @ 2026-05-04 15:29 UTC (permalink / raw)
To: Pavel Begunkov
Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Alexander Viro, Christian Brauner, Andrew Morton, Sumit Semwal,
Christian König, linux-block, linux-kernel, linux-nvme,
linux-fsdevel, io-uring, linux-media, dri-devel, linaro-mm-sig,
Nitesh Shetty, Kanchan Joshi, Anuj Gupta, Tushar Gohad,
William Power, Phil Cayton, Jason Gunthorpe
On Wed, Apr 29, 2026 at 04:25:46PM +0100, Pavel Begunkov wrote:
> The patch set allows to register a dmabuf to an io_uring instance for
> a specified file and use it with io_uring read / write requests. The
> infrastructure is not tied to io_uring and there could be more users
> in the future. A similar idea was attempted some years ago by Keith [1],
> from where I borrowed a good number of changes, and later was brough up
> by Tushar and Vishal from Intel.
>
> It's an opt-in feature for files, and they need to implement a new
> file operation to use it. Only NVMe block devices are supported in this
> series. The user API is built on top of io_uring's "registered buffers",
> where a dmabuf is registered in a special way, but after it can be used
> as any other "registered buffer" with IORING_OP_{READ,WRITE}_FIXED
> requests. It's created via a new file operation and the resulted map is
> then passed through the I/O stack in a new iterator type. There is some
> additional infrastructure to bind it all, which also counts requests
> using a dmabuf map and managing lifetimes, which is used to implement
> map invalidation.
>
> It was tested for GPU <-> NVMe transfers. Also, as it maintains a
> long-term dma mapping, it helps with the IOMMU cost. The numbers
> below are for udmabuf reads previously run by Anuj for different
> IOMMU modes:
Plain registered buffer is long-live too, which raises question: does this
framework need to take it into account from beginning?
BTW, inspired by this approach, I adds similar feature to ublk via UBLK_IO_F_SHMEM_ZC
which can maintain long-term vfio dma mapping over registered user-place aligned buffer.
Thanks,
Ming
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 00/10] Add dmabuf read/write via io_uring
2026-05-04 15:29 ` [PATCH v3 00/10] Add dmabuf read/write via io_uring Ming Lei
@ 2026-05-06 9:02 ` Pavel Begunkov
2026-05-07 9:50 ` Ming Lei
0 siblings, 1 reply; 16+ messages in thread
From: Pavel Begunkov @ 2026-05-06 9:02 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Alexander Viro, Christian Brauner, Andrew Morton, Sumit Semwal,
Christian König, linux-block, linux-kernel, linux-nvme,
linux-fsdevel, io-uring, linux-media, dri-devel, linaro-mm-sig,
Nitesh Shetty, Kanchan Joshi, Anuj Gupta, Tushar Gohad,
William Power, Phil Cayton, Jason Gunthorpe
Hey Ming,
On 5/4/26 16:29, Ming Lei wrote:
> On Wed, Apr 29, 2026 at 04:25:46PM +0100, Pavel Begunkov wrote:
>> The patch set allows to register a dmabuf to an io_uring instance for
>> a specified file and use it with io_uring read / write requests. The
>> infrastructure is not tied to io_uring and there could be more users
>> in the future. A similar idea was attempted some years ago by Keith [1],
>> from where I borrowed a good number of changes, and later was brough up
>> by Tushar and Vishal from Intel.
>>
>> It's an opt-in feature for files, and they need to implement a new
>> file operation to use it. Only NVMe block devices are supported in this
>> series. The user API is built on top of io_uring's "registered buffers",
>> where a dmabuf is registered in a special way, but after it can be used
>> as any other "registered buffer" with IORING_OP_{READ,WRITE}_FIXED
>> requests. It's created via a new file operation and the resulted map is
>> then passed through the I/O stack in a new iterator type. There is some
>> additional infrastructure to bind it all, which also counts requests
>> using a dmabuf map and managing lifetimes, which is used to implement
>> map invalidation.
>>
>> It was tested for GPU <-> NVMe transfers. Also, as it maintains a
>> long-term dma mapping, it helps with the IOMMU cost. The numbers
>> below are for udmabuf reads previously run by Anuj for different
>> IOMMU modes:
>
> Plain registered buffer is long-live too, which raises question: does this
> framework need to take it into account from beginning?
Not sure I follow, mind expanding on what should be accounted?
Are you suggesting that we might want to use normal registered
buffers in a similar way? I.e. giving the driver an ability to
pre-register them?
> BTW, inspired by this approach, I adds similar feature to ublk via UBLK_IO_F_SHMEM_ZC
> which can maintain long-term vfio dma mapping over registered user-place aligned buffer.
Interesting, just too a glance, and it looks like what David Wei
was thinking to add to fuse, but IIUC he gave up exactly because the
client will need to cooperate and that could be troublesome.
Should we try to push everything under the same interface instead of
keeping a ublk specific one? Again to the point that it requires
a cooperative client, but if it's something more generic, the user
might just try to use it as a general optimisation. In the same way
it'll be helpful to fuse, and as a bonus you wouldn't need tree look
ups (but mandates clients using registered buffers as a downside).
It'd need to shaped to somehow work better with host memory as I
assume you want to be able to map it into server in common case.
Switch case'ing if it's a udmabuf is not the greatest approach,
but maybe we can figure out something else.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 00/10] Add dmabuf read/write via io_uring
2026-05-06 9:02 ` Pavel Begunkov
@ 2026-05-07 9:50 ` Ming Lei
2026-05-12 9:30 ` Pavel Begunkov
0 siblings, 1 reply; 16+ messages in thread
From: Ming Lei @ 2026-05-07 9:50 UTC (permalink / raw)
To: Pavel Begunkov
Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Alexander Viro, Christian Brauner, Andrew Morton, Sumit Semwal,
Christian König, linux-block, linux-kernel, linux-nvme,
linux-fsdevel, io-uring, linux-media, dri-devel, linaro-mm-sig,
Nitesh Shetty, Kanchan Joshi, Anuj Gupta, Tushar Gohad,
William Power, Phil Cayton, Jason Gunthorpe
On Wed, May 06, 2026 at 10:02:11AM +0100, Pavel Begunkov wrote:
> Hey Ming,
>
> On 5/4/26 16:29, Ming Lei wrote:
> > On Wed, Apr 29, 2026 at 04:25:46PM +0100, Pavel Begunkov wrote:
> > > The patch set allows to register a dmabuf to an io_uring instance for
> > > a specified file and use it with io_uring read / write requests. The
> > > infrastructure is not tied to io_uring and there could be more users
> > > in the future. A similar idea was attempted some years ago by Keith [1],
> > > from where I borrowed a good number of changes, and later was brough up
> > > by Tushar and Vishal from Intel.
> > >
> > > It's an opt-in feature for files, and they need to implement a new
> > > file operation to use it. Only NVMe block devices are supported in this
> > > series. The user API is built on top of io_uring's "registered buffers",
> > > where a dmabuf is registered in a special way, but after it can be used
> > > as any other "registered buffer" with IORING_OP_{READ,WRITE}_FIXED
> > > requests. It's created via a new file operation and the resulted map is
> > > then passed through the I/O stack in a new iterator type. There is some
> > > additional infrastructure to bind it all, which also counts requests
> > > using a dmabuf map and managing lifetimes, which is used to implement
> > > map invalidation.
> > >
> > > It was tested for GPU <-> NVMe transfers. Also, as it maintains a
> > > long-term dma mapping, it helps with the IOMMU cost. The numbers
> > > below are for udmabuf reads previously run by Anuj for different
> > > IOMMU modes:
> >
> > Plain registered buffer is long-live too, which raises question: does this
> > framework need to take it into account from beginning?
>
> Not sure I follow, mind expanding on what should be accounted?
> Are you suggesting that we might want to use normal registered
> buffers in a similar way? I.e. giving the driver an ability to
> pre-register them?
Yeah, normal registered buffer is long-live too, which is exactly
what the driver cares for the long-term dma mapping motivation.
>
> > BTW, inspired by this approach, I adds similar feature to ublk via UBLK_IO_F_SHMEM_ZC
> > which can maintain long-term vfio dma mapping over registered user-place aligned buffer.
>
> Interesting, just too a glance, and it looks like what David Wei
> was thinking to add to fuse, but IIUC he gave up exactly because the
> client will need to cooperate and that could be troublesome.
Here the cooperation is minimized, maybe one shmem/hugetlb path, or memfd,
and it is one optimization and opt-in, and fallback to normal path
if application doesn't cooperate.
>
> Should we try to push everything under the same interface instead of
> keeping a ublk specific one? Again to the point that it requires
If generic interface can be figured out, it shouldn't be a big deal for
ublk to switch to it, and the usage is simple actually.
So far, ublk supports both FS and nvme block device.
And cooperation can't be avoided for this usage no matter if generic or
driver specific implementation is taken, for both fuse & ublk.
> a cooperative client, but if it's something more generic, the user
> might just try to use it as a general optimisation. In the same way
> it'll be helpful to fuse, and as a bonus you wouldn't need tree look
> ups (but mandates clients using registered buffers as a downside).
Yeah, but tree lookup is fast enough in case of huge page for typical
application, and it is simple in concept.
Thanks,
Ming
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 00/10] Add dmabuf read/write via io_uring
[not found] <cover.1777475843.git.asml.silence@gmail.com>
2026-05-04 15:29 ` [PATCH v3 00/10] Add dmabuf read/write via io_uring Ming Lei
@ 2026-05-12 7:00 ` Christoph Hellwig
2026-05-12 9:30 ` Pavel Begunkov
[not found] ` <ae941457cf6cacb9d4c16b6ec904da9ef7fed97f.1777475843.git.asml.silence@gmail.com>
` (6 subsequent siblings)
8 siblings, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2026-05-12 7:00 UTC (permalink / raw)
To: Pavel Begunkov
Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Alexander Viro, Christian Brauner, Andrew Morton, Sumit Semwal,
Christian König, linux-block, linux-kernel, linux-nvme,
linux-fsdevel, io-uring, linux-media, dri-devel, linaro-mm-sig,
Nitesh Shetty, Kanchan Joshi, Anuj Gupta, Tushar Gohad,
William Power, Phil Cayton, Jason Gunthorpe
What tree is this against? I can't apply it against the usual
candidates, even accounting for the time lag in getting to it.
Can you provide a git tree?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 00/10] Add dmabuf read/write via io_uring
2026-05-12 7:00 ` Christoph Hellwig
@ 2026-05-12 9:30 ` Pavel Begunkov
0 siblings, 0 replies; 16+ messages in thread
From: Pavel Begunkov @ 2026-05-12 9:30 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Alexander Viro,
Christian Brauner, Andrew Morton, Sumit Semwal,
Christian König, linux-block, linux-kernel, linux-nvme,
linux-fsdevel, io-uring, linux-media, dri-devel, linaro-mm-sig,
Nitesh Shetty, Kanchan Joshi, Anuj Gupta, Tushar Gohad,
William Power, Phil Cayton, Jason Gunthorpe
On 5/12/26 08:00, Christoph Hellwig wrote:
> What tree is this against? I can't apply it against the usual
> candidates, even accounting for the time lag in getting to it.
It should've been a Jens' for-next
> Can you provide a git tree?
git: https://github.com/isilence/linux.git rw-dmabuf-v4
url: https://github.com/isilence/linux/tree/rw-dmabuf-v4
It's a wip branch, for now it's just v3 + 2 fixes.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 00/10] Add dmabuf read/write via io_uring
2026-05-07 9:50 ` Ming Lei
@ 2026-05-12 9:30 ` Pavel Begunkov
0 siblings, 0 replies; 16+ messages in thread
From: Pavel Begunkov @ 2026-05-12 9:30 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Alexander Viro, Christian Brauner, Andrew Morton, Sumit Semwal,
Christian König, linux-block, linux-kernel, linux-nvme,
linux-fsdevel, io-uring, linux-media, dri-devel, linaro-mm-sig,
Nitesh Shetty, Kanchan Joshi, Anuj Gupta, Tushar Gohad,
William Power, Phil Cayton, Jason Gunthorpe
On 5/7/26 10:50, Ming Lei wrote:
...
>>> BTW, inspired by this approach, I adds similar feature to ublk via UBLK_IO_F_SHMEM_ZC
>>> which can maintain long-term vfio dma mapping over registered user-place aligned buffer.
>>
>> Interesting, just too a glance, and it looks like what David Wei
>> was thinking to add to fuse, but IIUC he gave up exactly because the
>> client will need to cooperate and that could be troublesome.
>
> Here the cooperation is minimized, maybe one shmem/hugetlb path, or memfd,
> and it is one optimization and opt-in, and fallback to normal path
> if application doesn't cooperate.
My point is that with widely enough adopted interface the user will be
able to opportunistically use it without knowledge about the file, i.e.
not knowing whether it's ublk or something else. But as you mentioned
below, it'd be cooperative interface in either case.
>> Should we try to push everything under the same interface instead of
>> keeping a ublk specific one? Again to the point that it requires
>
> If generic interface can be figured out, it shouldn't be a big deal for
> ublk to switch to it, and the usage is simple actually.
Sure, you'd just need to maintain both as there is a mismatch between
interfaces.
> So far, ublk supports both FS and nvme block device.
>
> And cooperation can't be avoided for this usage no matter if generic or
> driver specific implementation is taken, for both fuse & ublk.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 01/10] file: add callback for creating long-term dmabuf maps
[not found] ` <6cce2f4d-7400-4618-82ce-cbd5004c92a4@gmail.com>
@ 2026-05-13 8:11 ` Christoph Hellwig
0 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2026-05-13 8:11 UTC (permalink / raw)
To: Pavel Begunkov
Cc: Christian König, Jens Axboe, Keith Busch, Christoph Hellwig,
Sagi Grimberg, Alexander Viro, Christian Brauner, Andrew Morton,
Sumit Semwal, linux-block, linux-kernel, linux-nvme,
linux-fsdevel, io-uring, linux-media, dri-devel, linaro-mm-sig,
Nitesh Shetty, Kanchan Joshi, Anuj Gupta, Tushar Gohad,
William Power, Phil Cayton, Jason Gunthorpe
On Thu, Apr 30, 2026 at 07:33:39PM +0100, Pavel Begunkov wrote:
>> Then the patch should probably define the full interface and not just add the callback here and then the structure in a follow up patch.
>
> I strongly prefer splitting patches so that they touch one tree at
> a time whenever possible. tbh, I don't see much of a problem it being
> not defined as it's just forwarded in first patches, but I can shuffle
> it around in the series so that definitions come first.
file operations without users are pointless. This really should go
with "lib: add dmabuf token infrastructure" as it is the only way for
the reviewer to make any sense of it. I'll move my discussion of the
interface there for that reason.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 02/10] iov_iter: add iterator type for dmabuf maps
[not found] ` <20a233d2f35274817aa643cc0fe113707eb47e72.1777475843.git.asml.silence@gmail.com>
@ 2026-05-13 8:11 ` Christoph Hellwig
2026-05-13 10:05 ` David Laight
1 sibling, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2026-05-13 8:11 UTC (permalink / raw)
To: Pavel Begunkov
Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Alexander Viro, Christian Brauner, Andrew Morton, Sumit Semwal,
Christian König, linux-block, linux-kernel, linux-nvme,
linux-fsdevel, io-uring, linux-media, dri-devel, linaro-mm-sig,
Nitesh Shetty, Kanchan Joshi, Anuj Gupta, Tushar Gohad,
William Power, Phil Cayton, Jason Gunthorpe
On Wed, Apr 29, 2026 at 04:25:48PM +0100, Pavel Begunkov wrote:
> Introduce a new iterator type for dmabuf maps. The map in an opaque
> object with internals and format specific to the subsystem / driver, and
> only it can use that subsystem / driver for issuing IO. The task of the
> middle layers is to pass the map / iterator further down, maybe doing
> basic splitting and length checking. The iterator can only be used by
> operations of the file the associated map was created for.
Nice, this ended up way simpler than what I would have imagined.
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 03/10] block: move bvec init into __bio_clone
[not found] ` <43a91f54d61d3329316e40c69ace781b4d35fe0b.1777475843.git.asml.silence@gmail.com>
@ 2026-05-13 8:12 ` Christoph Hellwig
0 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2026-05-13 8:12 UTC (permalink / raw)
To: Pavel Begunkov
Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Alexander Viro, Christian Brauner, Andrew Morton, Sumit Semwal,
Christian König, linux-block, linux-kernel, linux-nvme,
linux-fsdevel, io-uring, linux-media, dri-devel, linaro-mm-sig,
Nitesh Shetty, Kanchan Joshi, Anuj Gupta, Tushar Gohad,
William Power, Phil Cayton, Jason Gunthorpe
On Wed, Apr 29, 2026 at 04:25:49PM +0100, Pavel Begunkov wrote:
> To quote Cristoph: "Historically __bio_clone itself does not clone the
It's Christoph.
> payload, just the bio. But we got rid of the callers that want to clone
> a bio but not the payload long time ago". So let's move ->bi_io_vec
> assignment into __bio_clone(), so we have a single point where it's set.
but this he-said, she-said is totally irrelevant.
Please state here why this is useful/important, which matters for anyone
trying to understand the code in the future.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 04/10] block: introduce dma map backed bio type
[not found] ` <646ecd6fde8d9e146cb051efb514deb27ce3883e.1777475843.git.asml.silence@gmail.com>
@ 2026-05-13 8:19 ` Christoph Hellwig
2026-05-13 8:39 ` Christoph Hellwig
1 sibling, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2026-05-13 8:19 UTC (permalink / raw)
To: Pavel Begunkov
Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Alexander Viro, Christian Brauner, Andrew Morton, Sumit Semwal,
Christian König, linux-block, linux-kernel, linux-nvme,
linux-fsdevel, io-uring, linux-media, dri-devel, linaro-mm-sig,
Nitesh Shetty, Kanchan Joshi, Anuj Gupta, Tushar Gohad,
William Power, Phil Cayton, Jason Gunthorpe
> + if (!bio_flagged(bio_src, BIO_DMABUF_MAP)) {
> + bio->bi_io_vec = bio_src->bi_io_vec;
> + } else {
> + bio->dmabuf_map = bio_src->dmabuf_map;
> + bio_set_flag(bio, BIO_DMABUF_MAP);
> + }
This is backwards, please avoid pointless negations:
if (bio_flagged(bio_src, BIO_DMABUF_MAP)) {
bio->dmabuf_map = bio_src->dmabuf_map;
bio_set_flag(bio, BIO_DMABUF_MAP);
} else {
bio->bi_io_vec = bio_src->bi_io_vec;
}
> + if (bio_flagged(bio, BIO_DMABUF_MAP)) {
> + nsegs = 1;
> +
> + if ((bio->bi_iter.bi_bvec_done & lim->dma_alignment) ||
> + (bio->bi_iter.bi_size & len_align_mask))
> + return -EINVAL;
> + if (bio->bi_iter.bi_size > max_bytes) {
> + bytes = max_bytes;
> + goto split;
> + }
Please add a comment explaining why nsegs is always 1 here.
> @@ -424,7 +424,8 @@ static inline struct bio *__bio_split_to_limits(struct bio *bio,
> switch (bio_op(bio)) {
> case REQ_OP_READ:
> case REQ_OP_WRITE:
> - if (bio_may_need_split(bio, lim))
> + if (bio_may_need_split(bio, lim) ||
> + bio_flagged(bio, BIO_DMABUF_MAP))
> return bio_split_rw(bio, lim, nr_segs);
The BIO_DMABUF_MAP check should go into bio_may_need_split.
> +static inline void bio_advance_iter_dmabuf_map(struct bvec_iter *iter,
> + unsigned int bytes)
> +{
> + iter->bi_bvec_done += bytes;
> + iter->bi_size -= bytes;
> +}
> +
> static inline void bio_advance_iter(const struct bio *bio,
> struct bvec_iter *iter, unsigned int bytes)
> {
> iter->bi_sector += bytes >> 9;
>
> - if (bio_no_advance_iter(bio))
> + if (bio_no_advance_iter(bio)) {
> iter->bi_size -= bytes;
> - else
> + } else if (bio_flagged(bio, BIO_DMABUF_MAP)) {
> + bio_advance_iter_dmabuf_map(iter, bytes);
This is a bit of a mess. You're using bi_bvec_done for something that
is not bvec_done, which makes the naming very confusing. That is even
more confusing than the existing usage, which isn't great. Also we
add yet another conditional to heavily inlined code. I'd suggest
the following:
- add a prep patch to rename bi_bvec_done to bi_offset, as even for
the existing usage it is the offset into the current bio_vec as
much as it is the count of byes done, as those must be the same
and it is used both ways
- add a prep patch to also increase bi_offset for bio_no_advance_iter.
It is not actually use there, but incrementing it is harmless and
this will avoid a new special case
- please also documet this new usage in the commet in struct bvec_iter.
- then just add the dma buf mapping to the bio_no_advance_iter condition
- figure out what to do about dm_bio_rewind_iter, which pokes into these
things that really should be block layer internal
> }
> @@ -391,7 +403,7 @@ static inline void bio_wouldblock_error(struct bio *bio)
> */
> static inline int bio_iov_vecs_to_alloc(struct iov_iter *iter, int max_segs)
> {
> - if (iov_iter_is_bvec(iter))
> + if (iov_iter_is_bvec(iter) || iov_iter_is_dmabuf_map(iter))
> return 0;
> return iov_iter_npages(iter, max_segs);
> }
Please update the comment for this helper.
> @@ -322,6 +327,7 @@ enum {
> BIO_REMAPPED,
> BIO_ZONE_WRITE_PLUGGING, /* bio handled through zone write plugging */
> BIO_EMULATES_ZONE_APPEND, /* bio emulates a zone append operation */
> + BIO_DMABUF_MAP, /* Using premmaped dma buffers */
Shouldn't this be a REQ_ flag as we should never mix and match bios with
and without this flag in a single request?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 05/10] lib: add dmabuf token infrastructure
[not found] ` <c61e6d928f86f4cb253ae350272e6039faefd3a6.1777475843.git.asml.silence@gmail.com>
@ 2026-05-13 8:24 ` Christoph Hellwig
0 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2026-05-13 8:24 UTC (permalink / raw)
To: Pavel Begunkov
Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Alexander Viro, Christian Brauner, Andrew Morton, Sumit Semwal,
Christian König, linux-block, linux-kernel, linux-nvme,
linux-fsdevel, io-uring, linux-media, dri-devel, linaro-mm-sig,
Nitesh Shetty, Kanchan Joshi, Anuj Gupta, Tushar Gohad,
William Power, Phil Cayton, Jason Gunthorpe
Naming and placement:
This is about dma-buf based I/O. So I'd expect it to be named dma-buf-io
and no io-dmabuf, and live in drivers/dma-buf and not the unrelated lib/.
But I'd like to hear from the dma-buf maintainers about that.
Config option: as this unconditionally when DMA_SHARED_BUFFER is enabled,
why does it need a separate config option?
Interface: io_dmabuf_token_create / ->create_dmabuf_token filling
in a structure allocated by the caller feels odd. My gut feeling
would be to move most of io_dmabuf_token_create into a helper called
by ->create_dmabuf_token so that the token is allocated in the
driver data structure and returned from create_dmabuf_token.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 06/10] block: forward create_dmabuf_token to drivers
[not found] ` <559756c5e22dcfa183080a979de039910d1b896d.1777475843.git.asml.silence@gmail.com>
@ 2026-05-13 8:25 ` Christoph Hellwig
0 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2026-05-13 8:25 UTC (permalink / raw)
To: Pavel Begunkov
Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Alexander Viro, Christian Brauner, Andrew Morton, Sumit Semwal,
Christian König, linux-block, linux-kernel, linux-nvme,
linux-fsdevel, io-uring, linux-media, dri-devel, linaro-mm-sig,
Nitesh Shetty, Kanchan Joshi, Anuj Gupta, Tushar Gohad,
William Power, Phil Cayton, Jason Gunthorpe
On Wed, Apr 29, 2026 at 04:25:52PM +0100, Pavel Begunkov wrote:
> Add a trivial implementation of the create_dmabuf_token call for
> block devices that forwards the call to a new blk-mq callback if it's
> available.
This should go into block_device_operations as there is nothing blk-mq
specific about this. I.e. even if this patchset doesn't handle stacking
drivers yet, it should be easy enough to add them in the future.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 07/10] nvme-pci: implement dma_token backed requests
[not found] ` <5cecb1157ab784f9f303a91449fdf11b03aa6002.1777475843.git.asml.silence@gmail.com>
@ 2026-05-13 8:38 ` Christoph Hellwig
0 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2026-05-13 8:38 UTC (permalink / raw)
To: Pavel Begunkov
Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Alexander Viro, Christian Brauner, Andrew Morton, Sumit Semwal,
Christian König, linux-block, linux-kernel, linux-nvme,
linux-fsdevel, io-uring, linux-media, dri-devel, linaro-mm-sig,
Nitesh Shetty, Kanchan Joshi, Anuj Gupta, Tushar Gohad,
William Power, Phil Cayton, Jason Gunthorpe
FYI, I really want SGL support before this get merged, but ignoring that
for now:
> +struct nvme_dmabuf_map {
> + struct io_dmabuf_map base;
> + dma_addr_t *dma_list;
> + struct sg_table *sgt;
> + unsigned nr_entries;
I'd make dma_list a variable-sized array at the end of the struture to avoid
an extra allocation and pointer derefernece.
>
> +static void nvme_dmabuf_map_sync(struct nvme_dev *nvme_dev, struct request *req,
> + bool for_cpu)
> +{
> + int length = blk_rq_payload_bytes(req);
> + struct device *dev = nvme_dev->dev;
> + enum dma_data_direction dma_dir;
> + struct bio *bio = req->bio;
> + struct nvme_dmabuf_map *map;
> + dma_addr_t *dma_list;
> + int offset, map_idx;
> +
> + dma_dir = rq_data_dir(req) == READ ? DMA_FROM_DEVICE : DMA_TO_DEVICE;
> + map = container_of(bio->dmabuf_map, struct nvme_dmabuf_map, base);
> + dma_list = map->dma_list;
> +
> + offset = bio->bi_iter.bi_bvec_done;
> + map_idx = offset / NVME_CTRL_PAGE_SIZE;
> + length += offset & (NVME_CTRL_PAGE_SIZE - 1);
Please initialize the variable at declaration time and use or add proper
helpers to simplify this:
static inline struct nvme_dmabuf_map *
to_nvme_dmabuf_map(struct io_dmabuf_map *map)
{
return container_of(map, struct nvme_dmabuf_map, base);
}
....
enum dma_data_direction dma_dir = rq_dma_dir(req);
struct device *dev = nvme_dev->dev;
struct bio *bio = req->bio;
struct nvme_dmabuf_map *map = to_nvme_dmabuf_map(bio->bi_dmabuf_map);
dma_addr_t *dma_list = map->dma_list;
int offset = bio->bi_iter.bi_bvec_done;
int mmap_idx = offset / NVME_CTRL_PAGE_SIZE;
int length = blk_rq_payload_bytes(req) +
offset & (NVME_CTRL_PAGE_SIZE - 1);
Also a lot of these ints sound like they should be unsigned.
> +
> + while (length > 0) {
> + u64 dma_addr = dma_list[map_idx++];
> +
> + if (for_cpu)
> + __dma_sync_single_for_cpu(dev, dma_addr,
> + NVME_CTRL_PAGE_SIZE, dma_dir);
> + else
> + __dma_sync_single_for_device(dev, dma_addr,
> + NVME_CTRL_PAGE_SIZE,
> + dma_dir);
> + length -= NVME_CTRL_PAGE_SIZE;
> + }
> +}
Nothing should be using these __dma_sync helpers that are internal
details. Using them means you call into sync code that should be skipped
on most common server class systems.
Also the for_cpu argument is a bit ugly. I'd rather have separate
routines as in the core dma-mapping code, even if that means a little bit
of code duplication.
> +static blk_status_t nvme_rq_setup_dmabuf_map(struct request *req,
> + struct nvme_queue *nvmeq)
> +{
> + struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> + int length = blk_rq_payload_bytes(req);
> + u64 dma_addr, prp1_dma, prp2_dma;
> + struct bio *bio = req->bio;
> + struct nvme_dmabuf_map *map;
> + dma_addr_t *dma_list;
> + dma_addr_t prp_dma;
> + __le64 *prp_list;
> + int i, map_idx;
> + int offset;
> +
> + nvme_dmabuf_map_sync(nvmeq->dev, req, false);
> +
> + map = container_of(bio->dmabuf_map, struct nvme_dmabuf_map, base);
> + dma_list = map->dma_list;
> +
> + offset = bio->bi_iter.bi_bvec_done;
> + map_idx = offset / NVME_CTRL_PAGE_SIZE;
> + offset &= (NVME_CTRL_PAGE_SIZE - 1);
> + prp1_dma = dma_list[map_idx++] + offset;
Same comments as for the sync helper above.
> + length -= (NVME_CTRL_PAGE_SIZE - offset);
> + if (length <= 0) {
> + prp2_dma = 0;
> + goto done;
> + }
> +
> + if (length <= NVME_CTRL_PAGE_SIZE) {
> + prp2_dma = dma_list[map_idx];
> + goto done;
> + }
> +
> + if (DIV_ROUND_UP(length, NVME_CTRL_PAGE_SIZE) <=
> + NVME_SMALL_POOL_SIZE / sizeof(__le64))
> + iod->flags |= IOD_SMALL_DESCRIPTOR;
> +
> + prp_list = dma_pool_alloc(nvme_dma_pool(nvmeq, iod), GFP_ATOMIC,
> + &prp_dma);
> + if (!prp_list)
> + return BLK_STS_RESOURCE;
> +
> + iod->descriptors[iod->nr_descriptors++] = prp_list;
> + prp2_dma = prp_dma;
And I really hate how this duplicates all the nasty PRP building logic,
although right now I don't have a good answer to that.
> +static inline bool nvme_rq_is_dmabuf_attached(struct request *req)
> +{
> + if (!IS_ENABLED(CONFIG_DMABUF_TOKEN))
> + return false;
> + return req->bio && bio_flagged(req->bio, BIO_DMABUF_MAP);
> +}
This is something that should go into the block layer.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 04/10] block: introduce dma map backed bio type
[not found] ` <646ecd6fde8d9e146cb051efb514deb27ce3883e.1777475843.git.asml.silence@gmail.com>
2026-05-13 8:19 ` [PATCH v3 04/10] block: introduce dma map backed bio type Christoph Hellwig
@ 2026-05-13 8:39 ` Christoph Hellwig
1 sibling, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2026-05-13 8:39 UTC (permalink / raw)
To: Pavel Begunkov
Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Alexander Viro, Christian Brauner, Andrew Morton, Sumit Semwal,
Christian König, linux-block, linux-kernel, linux-nvme,
linux-fsdevel, io-uring, linux-media, dri-devel, linaro-mm-sig,
Nitesh Shetty, Kanchan Joshi, Anuj Gupta, Tushar Gohad,
William Power, Phil Cayton, Jason Gunthorpe
> + union {
> + struct bio_vec *bi_io_vec;
> + /* Driver specific dma map, present only with BIO_DMABUF_MAP */
> + struct io_dmabuf_map *dmabuf_map;
> + };
... and please add the bi_ prefix we're using for all (well except for
one oddity) fields in struct bio.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 02/10] iov_iter: add iterator type for dmabuf maps
[not found] ` <20a233d2f35274817aa643cc0fe113707eb47e72.1777475843.git.asml.silence@gmail.com>
2026-05-13 8:11 ` [PATCH v3 02/10] iov_iter: add iterator type for " Christoph Hellwig
@ 2026-05-13 10:05 ` David Laight
2026-05-13 13:29 ` David Laight
1 sibling, 1 reply; 16+ messages in thread
From: David Laight @ 2026-05-13 10:05 UTC (permalink / raw)
To: Pavel Begunkov
Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Alexander Viro, Christian Brauner, Andrew Morton, Sumit Semwal,
Christian König, linux-block, linux-kernel, linux-nvme,
linux-fsdevel, io-uring, linux-media, dri-devel, linaro-mm-sig,
Nitesh Shetty, Kanchan Joshi, Anuj Gupta, Tushar Gohad,
William Power, Phil Cayton, Jason Gunthorpe
On Wed, 29 Apr 2026 16:25:48 +0100
Pavel Begunkov <asml.silence@gmail.com> wrote:
> Introduce a new iterator type for dmabuf maps. The map in an opaque
> object with internals and format specific to the subsystem / driver, and
> only it can use that subsystem / driver for issuing IO. The task of the
> middle layers is to pass the map / iterator further down, maybe doing
> basic splitting and length checking. The iterator can only be used by
> operations of the file the associated map was created for.
>
> Suggested-by: Keith Busch <kbusch@kernel.org>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
> include/linux/uio.h | 11 +++++++++++
> lib/iov_iter.c | 29 +++++++++++++++++++++++------
> 2 files changed, 34 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/uio.h b/include/linux/uio.h
> index a9bc5b3067e3..75051aed70de 100644
> --- a/include/linux/uio.h
> +++ b/include/linux/uio.h
> @@ -12,6 +12,7 @@
>
> struct page;
> struct folio_queue;
> +struct io_dmabuf_map;
>
> typedef unsigned int __bitwise iov_iter_extraction_t;
>
> @@ -29,6 +30,7 @@ enum iter_type {
> ITER_FOLIOQ,
> ITER_XARRAY,
> ITER_DISCARD,
> + ITER_DMABUF_MAP,
> };
>
> #define ITER_SOURCE 1 // == WRITE
> @@ -71,6 +73,7 @@ struct iov_iter {
> const struct folio_queue *folioq;
> struct xarray *xarray;
> void __user *ubuf;
> + struct io_dmabuf_map *dmabuf_map;
> };
> size_t count;
> };
> @@ -155,6 +158,11 @@ static inline bool iov_iter_is_xarray(const struct iov_iter *i)
> return iov_iter_type(i) == ITER_XARRAY;
> }
>
> +static inline bool iov_iter_is_dmabuf_map(const struct iov_iter *i)
> +{
> + return iov_iter_type(i) == ITER_DMABUF_MAP;
> +}
> +
> static inline unsigned char iov_iter_rw(const struct iov_iter *i)
> {
> return i->data_source ? WRITE : READ;
> @@ -300,6 +308,9 @@ void iov_iter_folio_queue(struct iov_iter *i, unsigned int direction,
> unsigned int first_slot, unsigned int offset, size_t count);
> void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *xarray,
> loff_t start, size_t count);
> +void iov_iter_dmabuf_map(struct iov_iter *i, unsigned int direction,
> + struct io_dmabuf_map *map,
> + loff_t off, size_t count);
> ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
> size_t maxsize, unsigned maxpages, size_t *start);
> ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages,
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 243662af1af7..e2253684b991 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -575,7 +575,8 @@ void iov_iter_advance(struct iov_iter *i, size_t size)
> {
> if (unlikely(i->count < size))
> size = i->count;
> - if (likely(iter_is_ubuf(i)) || unlikely(iov_iter_is_xarray(i))) {
> + if (likely(iter_is_ubuf(i)) || unlikely(iov_iter_is_xarray(i)) ||
> + unlikely(iov_iter_is_dmabuf_map(i))) {
Doesn't the extra check add more code to all the non-ubuf cases?
This could be fixed by either making iter_type a bitmask (with one bit set)
or writing an iter_is_one_of(i, ITER_xxx, ITER_yyy) define that uses
'(1 << i->iter_type) & ((1 << ITER_xxx) | ...)'
(look at the the nolibc printf code for an example).
> i->iov_offset += size;
> i->count -= size;
> } else if (likely(iter_is_iovec(i) || iov_iter_is_kvec(i))) {
> @@ -631,7 +632,8 @@ void iov_iter_revert(struct iov_iter *i, size_t unroll)
> return;
> }
> unroll -= i->iov_offset;
> - if (iov_iter_is_xarray(i) || iter_is_ubuf(i)) {
> + if (iov_iter_is_xarray(i) || iter_is_ubuf(i) ||
iter_is_ubuf() should have been first here.
-- David
> + iov_iter_is_dmabuf_map(i)) {
> BUG(); /* We should never go beyond the start of the specified
> * range since we might then be straying into pages that
> * aren't pinned.
> @@ -775,6 +777,20 @@ void iov_iter_xarray(struct iov_iter *i, unsigned int direction,
> }
> EXPORT_SYMBOL(iov_iter_xarray);
>
> +void iov_iter_dmabuf_map(struct iov_iter *i, unsigned int direction,
> + struct io_dmabuf_map *map,
> + loff_t off, size_t count)
> +{
> + WARN_ON(direction & ~(READ | WRITE));
> + *i = (struct iov_iter){
> + .iter_type = ITER_DMABUF_MAP,
> + .data_source = direction,
> + .dmabuf_map = map,
> + .count = count,
> + .iov_offset = off,
> + };
> +}
> +
> /**
> * iov_iter_discard - Initialise an I/O iterator that discards data
> * @i: The iterator to initialise.
> @@ -841,7 +857,7 @@ static unsigned long iov_iter_alignment_bvec(const struct iov_iter *i)
>
> unsigned long iov_iter_alignment(const struct iov_iter *i)
> {
> - if (likely(iter_is_ubuf(i))) {
> + if (likely(iter_is_ubuf(i)) || iov_iter_is_dmabuf_map(i)) {
> size_t size = i->count;
> if (size)
> return ((unsigned long)i->ubuf + i->iov_offset) | size;
> @@ -872,7 +888,7 @@ unsigned long iov_iter_gap_alignment(const struct iov_iter *i)
> size_t size = i->count;
> unsigned k;
>
> - if (iter_is_ubuf(i))
> + if (iter_is_ubuf(i) || iov_iter_is_dmabuf_map(i))
> return 0;
>
> if (WARN_ON(!iter_is_iovec(i)))
> @@ -1469,11 +1485,12 @@ EXPORT_SYMBOL_GPL(import_ubuf);
> void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
> {
> if (WARN_ON_ONCE(!iov_iter_is_bvec(i) && !iter_is_iovec(i) &&
> - !iter_is_ubuf(i)) && !iov_iter_is_kvec(i))
> + !iter_is_ubuf(i) && !iov_iter_is_kvec(i) &&
> + !iov_iter_is_dmabuf_map(i)))
> return;
> i->iov_offset = state->iov_offset;
> i->count = state->count;
> - if (iter_is_ubuf(i))
> + if (iter_is_ubuf(i) || iov_iter_is_dmabuf_map(i))
> return;
> /*
> * For the *vec iters, nr_segs + iov is constant - if we increment
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 02/10] iov_iter: add iterator type for dmabuf maps
2026-05-13 10:05 ` David Laight
@ 2026-05-13 13:29 ` David Laight
0 siblings, 0 replies; 16+ messages in thread
From: David Laight @ 2026-05-13 13:29 UTC (permalink / raw)
To: Pavel Begunkov
Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Alexander Viro, Christian Brauner, Andrew Morton, Sumit Semwal,
Christian König, linux-block, linux-kernel, linux-nvme,
linux-fsdevel, io-uring, linux-media, dri-devel, linaro-mm-sig,
Nitesh Shetty, Kanchan Joshi, Anuj Gupta, Tushar Gohad,
William Power, Phil Cayton, Jason Gunthorpe
On Wed, 13 May 2026 11:05:57 +0100
David Laight <david.laight.linux@gmail.com> wrote:
...
> > @@ -575,7 +575,8 @@ void iov_iter_advance(struct iov_iter *i, size_t size)
> > {
> > if (unlikely(i->count < size))
> > size = i->count;
> > - if (likely(iter_is_ubuf(i)) || unlikely(iov_iter_is_xarray(i))) {
> > + if (likely(iter_is_ubuf(i)) || unlikely(iov_iter_is_xarray(i)) ||
> > + unlikely(iov_iter_is_dmabuf_map(i))) {
>
>
> Doesn't the extra check add more code to all the non-ubuf cases?
> This could be fixed by either making iter_type a bitmask (with one bit set)
> or writing an iter_is_one_of(i, ITER_xxx, ITER_yyy) define that uses
> '(1 << i->iter_type) & ((1 << ITER_xxx) | ...)'
This seems to DTRT:
#define _ITER_IS_ONE_OF(iter, t1, t2, t3, t4, t5, t6, t7, t8, ...) \
((1u << (iter)->iter_type) & ((1u << ITER_##t1) | (1u << ITER_##t2) | \
(1u << ITER_##t3) | (1u << ITER_##t4) | (1u << ITER_##t5) | \
(1u << ITER_##t6) | (1u << ITER_##t7) | (1u << ITER_##t8)))
#define ITER_IS_ONE_OF(iter, t, ...) \
_ITER_IS_ONE_OF(iter, t, ## __VA_ARGS__, t, t, t, t, t, t, t)
int foo(void *);
int f(struct iov_iter *i)
{
return ITER_IS_ONE_OF(i, UBUF, KVEC) ? foo(i) : 0;
}
See https://godbolt.org/z/sMz93zah1
Pasting ITER_ on the front ensures the values are constants of the right type.
OTOH it makes it harder to search for uses of each type.
You could paste _ITER_ on the front, elsewhere define _ITER_ITER_UVEC
to be ITER_UVEC (etc), and require the caller use the full name.
-- David
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2026-05-13 13:29 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <cover.1777475843.git.asml.silence@gmail.com>
2026-05-04 15:29 ` [PATCH v3 00/10] Add dmabuf read/write via io_uring Ming Lei
2026-05-06 9:02 ` Pavel Begunkov
2026-05-07 9:50 ` Ming Lei
2026-05-12 9:30 ` Pavel Begunkov
2026-05-12 7:00 ` Christoph Hellwig
2026-05-12 9:30 ` Pavel Begunkov
[not found] ` <ae941457cf6cacb9d4c16b6ec904da9ef7fed97f.1777475843.git.asml.silence@gmail.com>
[not found] ` <f0dd8f89-835e-4331-b593-4405ec59f4fe@amd.com>
[not found] ` <6cce2f4d-7400-4618-82ce-cbd5004c92a4@gmail.com>
2026-05-13 8:11 ` [PATCH v3 01/10] file: add callback for creating long-term dmabuf maps Christoph Hellwig
[not found] ` <20a233d2f35274817aa643cc0fe113707eb47e72.1777475843.git.asml.silence@gmail.com>
2026-05-13 8:11 ` [PATCH v3 02/10] iov_iter: add iterator type for " Christoph Hellwig
2026-05-13 10:05 ` David Laight
2026-05-13 13:29 ` David Laight
[not found] ` <43a91f54d61d3329316e40c69ace781b4d35fe0b.1777475843.git.asml.silence@gmail.com>
2026-05-13 8:12 ` [PATCH v3 03/10] block: move bvec init into __bio_clone Christoph Hellwig
[not found] ` <646ecd6fde8d9e146cb051efb514deb27ce3883e.1777475843.git.asml.silence@gmail.com>
2026-05-13 8:19 ` [PATCH v3 04/10] block: introduce dma map backed bio type Christoph Hellwig
2026-05-13 8:39 ` Christoph Hellwig
[not found] ` <c61e6d928f86f4cb253ae350272e6039faefd3a6.1777475843.git.asml.silence@gmail.com>
2026-05-13 8:24 ` [PATCH v3 05/10] lib: add dmabuf token infrastructure Christoph Hellwig
[not found] ` <559756c5e22dcfa183080a979de039910d1b896d.1777475843.git.asml.silence@gmail.com>
2026-05-13 8:25 ` [PATCH v3 06/10] block: forward create_dmabuf_token to drivers Christoph Hellwig
[not found] ` <5cecb1157ab784f9f303a91449fdf11b03aa6002.1777475843.git.asml.silence@gmail.com>
2026-05-13 8:38 ` [PATCH v3 07/10] nvme-pci: implement dma_token backed requests Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox