linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v2 00/11] Add dmabuf read/write via io_uring
@ 2025-11-23 22:51 Pavel Begunkov
  2025-11-23 22:51 ` [RFC v2 01/11] file: add callback for pre-mapping dmabuf Pavel Begunkov
                   ` (12 more replies)
  0 siblings, 13 replies; 35+ messages in thread
From: Pavel Begunkov @ 2025-11-23 22:51 UTC (permalink / raw)
  To: linux-block, io-uring
  Cc: Vishal Verma, tushar.gohad, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal,
	Christian König, Pavel Begunkov, linux-kernel, linux-nvme,
	linux-fsdevel, linux-media, dri-devel, linaro-mm-sig

Picking up the work on supporting dmabuf in the read/write path. There
are two main changes. First, it doesn't pass a dma addresss directly by
rather wraps it into an opaque structure, which is extended and
understood by the target driver.

The second big change is support for dynamic attachments, which added a
good part of complexity (see Patch 5). I kept the main machinery in nvme
at first, but move_notify can ask to kill the dma mapping asynchronously,
and any new IO would need to wait during submission, thus it was moved
to blk-mq. That also introduced an extra callback layer b/w driver and
blk-mq.

There are some rough corners, and I'm not perfectly happy about the
complexity and layering. For v3 I'll try to move the waiting up in the
stack to io_uring wrapped into library helpers.

For now, I'm interested what is the best way to test move_notify? And
how dma_resv_reserve_fences() errors should be handled in move_notify?

The uapi didn't change, after registration it looks like a normal
io_uring registered buffer and can be used as such. Only non-vectored
fixed reads/writes are allowed. Pseudo code:

// registration
reg_buf_idx = 0;
io_uring_update_buffer(ring, reg_buf_idx, { dma_buf_fd, file_fd });

// request creation
io_uring_prep_read_fixed(sqe, file_fd, buffer_offset,
                         buffer_size, file_offset, reg_buf_idx);

And as previously, a good bunch of code was taken from Keith's series [1].

liburing based example:

git: https://github.com/isilence/liburing.git dmabuf-rw
link: https://github.com/isilence/liburing/tree/dmabuf-rw

[1] https://lore.kernel.org/io-uring/20220805162444.3985535-1-kbusch@fb.com/

Pavel Begunkov (11):
  file: add callback for pre-mapping dmabuf
  iov_iter: introduce iter type for pre-registered dma
  block: move around bio flagging helpers
  block: introduce dma token backed bio type
  block: add infra to handle dmabuf tokens
  nvme-pci: add support for dmabuf reggistration
  nvme-pci: implement dma_token backed requests
  io_uring/rsrc: add imu flags
  io_uring/rsrc: extended reg buffer registration
  io_uring/rsrc: add dmabuf-backed buffer registeration
  io_uring/rsrc: implement dmabuf regbuf import

 block/Makefile                   |   1 +
 block/bdev.c                     |  14 ++
 block/bio.c                      |  21 +++
 block/blk-merge.c                |  23 +++
 block/blk-mq-dma-token.c         | 236 +++++++++++++++++++++++++++++++
 block/blk-mq.c                   |  20 +++
 block/blk.h                      |   3 +-
 block/fops.c                     |   3 +
 drivers/nvme/host/pci.c          | 217 ++++++++++++++++++++++++++++
 include/linux/bio.h              |  49 ++++---
 include/linux/blk-mq-dma-token.h |  60 ++++++++
 include/linux/blk-mq.h           |  21 +++
 include/linux/blk_types.h        |   8 +-
 include/linux/blkdev.h           |   3 +
 include/linux/dma_token.h        |  35 +++++
 include/linux/fs.h               |   4 +
 include/linux/uio.h              |  10 ++
 include/uapi/linux/io_uring.h    |  13 +-
 io_uring/rsrc.c                  | 201 +++++++++++++++++++++++---
 io_uring/rsrc.h                  |  23 ++-
 io_uring/rw.c                    |   7 +-
 lib/iov_iter.c                   |  30 +++-
 22 files changed, 948 insertions(+), 54 deletions(-)
 create mode 100644 block/blk-mq-dma-token.c
 create mode 100644 include/linux/blk-mq-dma-token.h
 create mode 100644 include/linux/dma_token.h

-- 
2.52.0


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC v2 01/11] file: add callback for pre-mapping dmabuf
  2025-11-23 22:51 [RFC v2 00/11] Add dmabuf read/write via io_uring Pavel Begunkov
@ 2025-11-23 22:51 ` Pavel Begunkov
  2025-12-04 10:42   ` Christoph Hellwig
  2025-12-04 10:46   ` Christian König
  2025-11-23 22:51 ` [RFC v2 02/11] iov_iter: introduce iter type for pre-registered dma Pavel Begunkov
                   ` (11 subsequent siblings)
  12 siblings, 2 replies; 35+ messages in thread
From: Pavel Begunkov @ 2025-11-23 22:51 UTC (permalink / raw)
  To: linux-block, io-uring
  Cc: Vishal Verma, tushar.gohad, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal,
	Christian König, Pavel Begunkov, linux-kernel, linux-nvme,
	linux-fsdevel, linux-media, dri-devel, linaro-mm-sig

Add a file callback that maps a dmabuf for the given file and returns
an opaque token of type struct dma_token representing the mapping. The
implementation details are hidden from the caller, and the implementors
are normally expected to extend the structure.

The callback callers will be able to pass the token with an IO request,
which implemented in following patches as a new iterator type. The user
should release the token once it's not needed by calling the provided
release callback via appropriate helpers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/linux/dma_token.h | 35 +++++++++++++++++++++++++++++++++++
 include/linux/fs.h        |  4 ++++
 2 files changed, 39 insertions(+)
 create mode 100644 include/linux/dma_token.h

diff --git a/include/linux/dma_token.h b/include/linux/dma_token.h
new file mode 100644
index 000000000000..9194b34282c2
--- /dev/null
+++ b/include/linux/dma_token.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_DMA_TOKEN_H
+#define _LINUX_DMA_TOKEN_H
+
+#include <linux/dma-buf.h>
+
+struct dma_token_params {
+	struct dma_buf			*dmabuf;
+	enum dma_data_direction		dir;
+};
+
+struct dma_token {
+	void (*release)(struct dma_token *);
+};
+
+static inline void dma_token_release(struct dma_token *token)
+{
+	token->release(token);
+}
+
+static inline struct dma_token *
+dma_token_create(struct file *file, struct dma_token_params *params)
+{
+	struct dma_token *res;
+
+	if (!file->f_op->dma_map)
+		return ERR_PTR(-EOPNOTSUPP);
+	res = file->f_op->dma_map(file, params);
+
+	WARN_ON_ONCE(!IS_ERR(res) && !res->release);
+
+	return res;
+}
+
+#endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c895146c1444..0ce9a53fabec 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2262,6 +2262,8 @@ struct dir_context {
 struct iov_iter;
 struct io_uring_cmd;
 struct offset_ctx;
+struct dma_token;
+struct dma_token_params;
 
 typedef unsigned int __bitwise fop_flags_t;
 
@@ -2309,6 +2311,8 @@ struct file_operations {
 	int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
 				unsigned int poll_flags);
 	int (*mmap_prepare)(struct vm_area_desc *);
+	struct dma_token *(*dma_map)(struct file *,
+				     struct dma_token_params *);
 } __randomize_layout;
 
 /* Supports async buffered reads */
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC v2 02/11] iov_iter: introduce iter type for pre-registered dma
  2025-11-23 22:51 [RFC v2 00/11] Add dmabuf read/write via io_uring Pavel Begunkov
  2025-11-23 22:51 ` [RFC v2 01/11] file: add callback for pre-mapping dmabuf Pavel Begunkov
@ 2025-11-23 22:51 ` Pavel Begunkov
  2025-12-04 10:43   ` Christoph Hellwig
  2025-11-23 22:51 ` [RFC v2 03/11] block: move around bio flagging helpers Pavel Begunkov
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 35+ messages in thread
From: Pavel Begunkov @ 2025-11-23 22:51 UTC (permalink / raw)
  To: linux-block, io-uring
  Cc: Vishal Verma, tushar.gohad, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal,
	Christian König, Pavel Begunkov, linux-kernel, linux-nvme,
	linux-fsdevel, linux-media, dri-devel, linaro-mm-sig

Introduce a new iterator type backed by a pre mapped dmabuf represented
by struct dma_token. The token is specific to the file for which it was
created, and the user must avoid the token and the iterator to any other
file. This limitation will be softened in the future.

Suggested-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/linux/uio.h | 10 ++++++++++
 lib/iov_iter.c      | 30 ++++++++++++++++++++++++------
 2 files changed, 34 insertions(+), 6 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 5b127043a151..1b22594ca35b 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -29,6 +29,7 @@ enum iter_type {
 	ITER_FOLIOQ,
 	ITER_XARRAY,
 	ITER_DISCARD,
+	ITER_DMA_TOKEN,
 };
 
 #define ITER_SOURCE	1	// == WRITE
@@ -71,6 +72,7 @@ struct iov_iter {
 				const struct folio_queue *folioq;
 				struct xarray *xarray;
 				void __user *ubuf;
+				struct dma_token *dma_token;
 			};
 			size_t count;
 		};
@@ -155,6 +157,11 @@ static inline bool iov_iter_is_xarray(const struct iov_iter *i)
 	return iov_iter_type(i) == ITER_XARRAY;
 }
 
+static inline bool iov_iter_is_dma_token(const struct iov_iter *i)
+{
+	return iov_iter_type(i) == ITER_DMA_TOKEN;
+}
+
 static inline unsigned char iov_iter_rw(const struct iov_iter *i)
 {
 	return i->data_source ? WRITE : READ;
@@ -300,6 +307,9 @@ void iov_iter_folio_queue(struct iov_iter *i, unsigned int direction,
 			  unsigned int first_slot, unsigned int offset, size_t count);
 void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *xarray,
 		     loff_t start, size_t count);
+void iov_iter_dma_token(struct iov_iter *i, unsigned int direction,
+			struct dma_token *token,
+			loff_t off, size_t count);
 ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
 			size_t maxsize, unsigned maxpages, size_t *start);
 ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages,
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 2fe66a6b8789..26fa8f8f13c0 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -563,7 +563,8 @@ void iov_iter_advance(struct iov_iter *i, size_t size)
 {
 	if (unlikely(i->count < size))
 		size = i->count;
-	if (likely(iter_is_ubuf(i)) || unlikely(iov_iter_is_xarray(i))) {
+	if (likely(iter_is_ubuf(i)) || unlikely(iov_iter_is_xarray(i)) ||
+	    unlikely(iov_iter_is_dma_token(i))) {
 		i->iov_offset += size;
 		i->count -= size;
 	} else if (likely(iter_is_iovec(i) || iov_iter_is_kvec(i))) {
@@ -619,7 +620,8 @@ void iov_iter_revert(struct iov_iter *i, size_t unroll)
 		return;
 	}
 	unroll -= i->iov_offset;
-	if (iov_iter_is_xarray(i) || iter_is_ubuf(i)) {
+	if (iov_iter_is_xarray(i) || iter_is_ubuf(i) ||
+	    iov_iter_is_dma_token(i)) {
 		BUG(); /* We should never go beyond the start of the specified
 			* range since we might then be straying into pages that
 			* aren't pinned.
@@ -763,6 +765,21 @@ void iov_iter_xarray(struct iov_iter *i, unsigned int direction,
 }
 EXPORT_SYMBOL(iov_iter_xarray);
 
+void iov_iter_dma_token(struct iov_iter *i, unsigned int direction,
+			struct dma_token *token,
+			loff_t off, size_t count)
+{
+	WARN_ON(direction & ~(READ | WRITE));
+	*i = (struct iov_iter){
+		.iter_type = ITER_DMA_TOKEN,
+		.data_source = direction,
+		.dma_token = token,
+		.iov_offset = 0,
+		.count = count,
+		.iov_offset = off,
+	};
+}
+
 /**
  * iov_iter_discard - Initialise an I/O iterator that discards data
  * @i: The iterator to initialise.
@@ -829,7 +846,7 @@ static unsigned long iov_iter_alignment_bvec(const struct iov_iter *i)
 
 unsigned long iov_iter_alignment(const struct iov_iter *i)
 {
-	if (likely(iter_is_ubuf(i))) {
+	if (likely(iter_is_ubuf(i)) || iov_iter_is_dma_token(i)) {
 		size_t size = i->count;
 		if (size)
 			return ((unsigned long)i->ubuf + i->iov_offset) | size;
@@ -860,7 +877,7 @@ unsigned long iov_iter_gap_alignment(const struct iov_iter *i)
 	size_t size = i->count;
 	unsigned k;
 
-	if (iter_is_ubuf(i))
+	if (iter_is_ubuf(i) || iov_iter_is_dma_token(i))
 		return 0;
 
 	if (WARN_ON(!iter_is_iovec(i)))
@@ -1457,11 +1474,12 @@ EXPORT_SYMBOL_GPL(import_ubuf);
 void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
 {
 	if (WARN_ON_ONCE(!iov_iter_is_bvec(i) && !iter_is_iovec(i) &&
-			 !iter_is_ubuf(i)) && !iov_iter_is_kvec(i))
+			 !iter_is_ubuf(i) && !iov_iter_is_kvec(i) &&
+			 !iov_iter_is_dma_token(i)))
 		return;
 	i->iov_offset = state->iov_offset;
 	i->count = state->count;
-	if (iter_is_ubuf(i))
+	if (iter_is_ubuf(i) || iov_iter_is_dma_token(i))
 		return;
 	/*
 	 * For the *vec iters, nr_segs + iov is constant - if we increment
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC v2 03/11] block: move around bio flagging helpers
  2025-11-23 22:51 [RFC v2 00/11] Add dmabuf read/write via io_uring Pavel Begunkov
  2025-11-23 22:51 ` [RFC v2 01/11] file: add callback for pre-mapping dmabuf Pavel Begunkov
  2025-11-23 22:51 ` [RFC v2 02/11] iov_iter: introduce iter type for pre-registered dma Pavel Begunkov
@ 2025-11-23 22:51 ` Pavel Begunkov
  2025-12-04 10:43   ` Christoph Hellwig
  2025-11-23 22:51 ` [RFC v2 04/11] block: introduce dma token backed bio type Pavel Begunkov
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 35+ messages in thread
From: Pavel Begunkov @ 2025-11-23 22:51 UTC (permalink / raw)
  To: linux-block, io-uring
  Cc: Vishal Verma, tushar.gohad, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal,
	Christian König, Pavel Begunkov, linux-kernel, linux-nvme,
	linux-fsdevel, linux-media, dri-devel, linaro-mm-sig

We'll need bio_flagged() earlier in bio.h in the next patch, move it
together with all related helpers, and mark the bio_flagged()'s bio
argument as const.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/linux/bio.h | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index ad2d57908c1c..c75a9b3672aa 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -46,6 +46,21 @@ static inline unsigned int bio_max_segs(unsigned int nr_segs)
 #define bio_data_dir(bio) \
 	(op_is_write(bio_op(bio)) ? WRITE : READ)
 
+static inline bool bio_flagged(const struct bio *bio, unsigned int bit)
+{
+	return bio->bi_flags & (1U << bit);
+}
+
+static inline void bio_set_flag(struct bio *bio, unsigned int bit)
+{
+	bio->bi_flags |= (1U << bit);
+}
+
+static inline void bio_clear_flag(struct bio *bio, unsigned int bit)
+{
+	bio->bi_flags &= ~(1U << bit);
+}
+
 /*
  * Check whether this bio carries any data or not. A NULL bio is allowed.
  */
@@ -225,21 +240,6 @@ static inline void bio_cnt_set(struct bio *bio, unsigned int count)
 	atomic_set(&bio->__bi_cnt, count);
 }
 
-static inline bool bio_flagged(struct bio *bio, unsigned int bit)
-{
-	return bio->bi_flags & (1U << bit);
-}
-
-static inline void bio_set_flag(struct bio *bio, unsigned int bit)
-{
-	bio->bi_flags |= (1U << bit);
-}
-
-static inline void bio_clear_flag(struct bio *bio, unsigned int bit)
-{
-	bio->bi_flags &= ~(1U << bit);
-}
-
 static inline struct bio_vec *bio_first_bvec_all(struct bio *bio)
 {
 	WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC v2 04/11] block: introduce dma token backed bio type
  2025-11-23 22:51 [RFC v2 00/11] Add dmabuf read/write via io_uring Pavel Begunkov
                   ` (2 preceding siblings ...)
  2025-11-23 22:51 ` [RFC v2 03/11] block: move around bio flagging helpers Pavel Begunkov
@ 2025-11-23 22:51 ` Pavel Begunkov
  2025-12-04 10:48   ` Christoph Hellwig
  2025-11-23 22:51 ` [RFC v2 05/11] block: add infra to handle dmabuf tokens Pavel Begunkov
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 35+ messages in thread
From: Pavel Begunkov @ 2025-11-23 22:51 UTC (permalink / raw)
  To: linux-block, io-uring
  Cc: Vishal Verma, tushar.gohad, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal,
	Christian König, Pavel Begunkov, linux-kernel, linux-nvme,
	linux-fsdevel, linux-media, dri-devel, linaro-mm-sig

Premapped buffers don't require a generic bio_vec since these have
already been dma mapped. Repurpose the bi_io_vec space for the dma
token as they are mutually exclusive, and provide setup to support
dma tokens.

In order to use this, a driver must implement the dma_map blk-mq op,
in which case it must be aware that any given bio may be using a
dma_tag instead of a bio_vec.

Suggested-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 block/bio.c               | 21 +++++++++++++++++++++
 block/blk-merge.c         | 23 +++++++++++++++++++++++
 block/blk.h               |  3 ++-
 block/fops.c              |  2 ++
 include/linux/bio.h       | 19 ++++++++++++++++---
 include/linux/blk_types.h |  8 +++++++-
 6 files changed, 71 insertions(+), 5 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 7b13bdf72de0..8793f1ee559d 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -843,6 +843,11 @@ static int __bio_clone(struct bio *bio, struct bio *bio_src, gfp_t gfp)
 		bio_clone_blkg_association(bio, bio_src);
 	}
 
+	if (bio_flagged(bio_src, BIO_DMA_TOKEN)) {
+		bio->dma_token = bio_src->dma_token;
+		bio_set_flag(bio, BIO_DMA_TOKEN);
+	}
+
 	if (bio_crypt_clone(bio, bio_src, gfp) < 0)
 		return -ENOMEM;
 	if (bio_integrity(bio_src) &&
@@ -1167,6 +1172,18 @@ void bio_iov_bvec_set(struct bio *bio, const struct iov_iter *iter)
 	bio_set_flag(bio, BIO_CLONED);
 }
 
+void bio_iov_dma_token_set(struct bio *bio, struct iov_iter *iter)
+{
+	WARN_ON_ONCE(bio->bi_max_vecs);
+
+	bio->dma_token = iter->dma_token;
+	bio->bi_vcnt = 0;
+	bio->bi_iter.bi_bvec_done = iter->iov_offset;
+	bio->bi_iter.bi_size = iov_iter_count(iter);
+	bio->bi_opf |= REQ_NOMERGE;
+	bio_set_flag(bio, BIO_DMA_TOKEN);
+}
+
 static unsigned int get_contig_folio_len(unsigned int *num_pages,
 					 struct page **pages, unsigned int i,
 					 struct folio *folio, size_t left,
@@ -1349,6 +1366,10 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
 		bio_iov_bvec_set(bio, iter);
 		iov_iter_advance(iter, bio->bi_iter.bi_size);
 		return 0;
+	} else if (iov_iter_is_dma_token(iter)) {
+		bio_iov_dma_token_set(bio, iter);
+		iov_iter_advance(iter, bio->bi_iter.bi_size);
+		return 0;
 	}
 
 	if (iov_iter_extract_will_pin(iter))
diff --git a/block/blk-merge.c b/block/blk-merge.c
index d3115d7469df..c02a5f9c99e6 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -328,6 +328,29 @@ int bio_split_io_at(struct bio *bio, const struct queue_limits *lim,
 	unsigned nsegs = 0, bytes = 0, gaps = 0;
 	struct bvec_iter iter;
 
+	if (bio_flagged(bio, BIO_DMA_TOKEN)) {
+		int offset = offset_in_page(bio->bi_iter.bi_bvec_done);
+
+		nsegs = ALIGN(bio->bi_iter.bi_size + offset, PAGE_SIZE);
+		nsegs >>= PAGE_SHIFT;
+
+		if (offset & lim->dma_alignment || bytes & len_align_mask)
+			return -EINVAL;
+
+		if (bio->bi_iter.bi_size > max_bytes) {
+			bytes = max_bytes;
+			nsegs = (bytes + offset) >> PAGE_SHIFT;
+			goto split;
+		} else if (nsegs > lim->max_segments) {
+			nsegs = lim->max_segments;
+			bytes = PAGE_SIZE * nsegs - offset;
+			goto split;
+		}
+
+		*segs = nsegs;
+		return 0;
+	}
+
 	bio_for_each_bvec(bv, bio, iter) {
 		if (bv.bv_offset & lim->dma_alignment ||
 		    bv.bv_len & len_align_mask)
diff --git a/block/blk.h b/block/blk.h
index e4c433f62dfc..2c72f2630faf 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -398,7 +398,8 @@ static inline struct bio *__bio_split_to_limits(struct bio *bio,
 	switch (bio_op(bio)) {
 	case REQ_OP_READ:
 	case REQ_OP_WRITE:
-		if (bio_may_need_split(bio, lim))
+		if (bio_may_need_split(bio, lim) ||
+		    bio_flagged(bio, BIO_DMA_TOKEN))
 			return bio_split_rw(bio, lim, nr_segs);
 		*nr_segs = 1;
 		return bio;
diff --git a/block/fops.c b/block/fops.c
index 5e3db9fead77..41f8795874a9 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -354,6 +354,8 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 		 * bio_iov_iter_get_pages() and set the bvec directly.
 		 */
 		bio_iov_bvec_set(bio, iter);
+	} else if (iov_iter_is_dma_token(iter)) {
+		bio_iov_dma_token_set(bio, iter);
 	} else {
 		ret = blkdev_iov_iter_get_pages(bio, iter, bdev);
 		if (unlikely(ret))
diff --git a/include/linux/bio.h b/include/linux/bio.h
index c75a9b3672aa..f83342640e71 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -108,16 +108,26 @@ static inline bool bio_next_segment(const struct bio *bio,
 #define bio_for_each_segment_all(bvl, bio, iter) \
 	for (bvl = bvec_init_iter_all(&iter); bio_next_segment((bio), &iter); )
 
+static inline void bio_advance_iter_dma_token(struct bvec_iter *iter,
+						unsigned int bytes)
+{
+	iter->bi_bvec_done += bytes;
+	iter->bi_size -= bytes;
+}
+
 static inline void bio_advance_iter(const struct bio *bio,
 				    struct bvec_iter *iter, unsigned int bytes)
 {
 	iter->bi_sector += bytes >> 9;
 
-	if (bio_no_advance_iter(bio))
+	if (bio_no_advance_iter(bio)) {
 		iter->bi_size -= bytes;
-	else
+	} else if (bio_flagged(bio, BIO_DMA_TOKEN)) {
+		bio_advance_iter_dma_token(iter, bytes);
+	} else {
 		bvec_iter_advance(bio->bi_io_vec, iter, bytes);
 		/* TODO: It is reasonable to complete bio with error here. */
+	}
 }
 
 /* @bytes should be less or equal to bvec[i->bi_idx].bv_len */
@@ -129,6 +139,8 @@ static inline void bio_advance_iter_single(const struct bio *bio,
 
 	if (bio_no_advance_iter(bio))
 		iter->bi_size -= bytes;
+	else if (bio_flagged(bio, BIO_DMA_TOKEN))
+		bio_advance_iter_dma_token(iter, bytes);
 	else
 		bvec_iter_advance_single(bio->bi_io_vec, iter, bytes);
 }
@@ -398,7 +410,7 @@ static inline void bio_wouldblock_error(struct bio *bio)
  */
 static inline int bio_iov_vecs_to_alloc(struct iov_iter *iter, int max_segs)
 {
-	if (iov_iter_is_bvec(iter))
+	if (iov_iter_is_bvec(iter) || iov_iter_is_dma_token(iter))
 		return 0;
 	return iov_iter_npages(iter, max_segs);
 }
@@ -452,6 +464,7 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
 		unsigned len_align_mask);
 
 void bio_iov_bvec_set(struct bio *bio, const struct iov_iter *iter);
+void bio_iov_dma_token_set(struct bio *bio, struct iov_iter *iter);
 void __bio_release_pages(struct bio *bio, bool mark_dirty);
 extern void bio_set_pages_dirty(struct bio *bio);
 extern void bio_check_pages_dirty(struct bio *bio);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index cbbcb9051ec3..3bc7f89d4e66 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -275,7 +275,12 @@ struct bio {
 
 	atomic_t		__bi_cnt;	/* pin count */
 
-	struct bio_vec		*bi_io_vec;	/* the actual vec list */
+	union {
+		struct bio_vec		*bi_io_vec;	/* the actual vec list */
+		/* Driver specific dma map, present only with BIO_DMA_TOKEN */
+		struct dma_token	*dma_token;
+	};
+
 
 	struct bio_set		*bi_pool;
 };
@@ -315,6 +320,7 @@ enum {
 	BIO_REMAPPED,
 	BIO_ZONE_WRITE_PLUGGING, /* bio handled through zone write plugging */
 	BIO_EMULATES_ZONE_APPEND, /* bio emulates a zone append operation */
+	BIO_DMA_TOKEN, /* Using premmaped dma buffers */
 	BIO_FLAG_LAST
 };
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC v2 05/11] block: add infra to handle dmabuf tokens
  2025-11-23 22:51 [RFC v2 00/11] Add dmabuf read/write via io_uring Pavel Begunkov
                   ` (3 preceding siblings ...)
  2025-11-23 22:51 ` [RFC v2 04/11] block: introduce dma token backed bio type Pavel Begunkov
@ 2025-11-23 22:51 ` Pavel Begunkov
  2025-11-24 13:38   ` Anuj gupta
                     ` (2 more replies)
  2025-11-23 22:51 ` [RFC v2 06/11] nvme-pci: add support for dmabuf reggistration Pavel Begunkov
                   ` (7 subsequent siblings)
  12 siblings, 3 replies; 35+ messages in thread
From: Pavel Begunkov @ 2025-11-23 22:51 UTC (permalink / raw)
  To: linux-block, io-uring
  Cc: Vishal Verma, tushar.gohad, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal,
	Christian König, Pavel Begunkov, linux-kernel, linux-nvme,
	linux-fsdevel, linux-media, dri-devel, linaro-mm-sig

Add blk-mq infrastructure to handle dmabuf tokens. There are two main
objects. The first is struct blk_mq_dma_token, which is an extension of
struct dma_token and passed in an iterator. The second is struct
blk_mq_dma_map, which keeps the actual mapping and unlike the token, can
be ejected (e.g. by move_notify) and recreated.

The token keeps an rcu protected pointer to the mapping, so when it
resolves a token into a mapping to pass it to a request, it'll do an rcu
protected lookup and get a percpu reference to the mapping.

If there is no current mapping attached to a token, it'll need to be
created by calling the driver (e.g. nvme) via a new callback. It
requires waiting, thefore can't be done for nowait requests and couldn't
happen deeper in the stack, e.g. during nvme request submission.

The structure split is needed because move_notify can request to
invalidate the dma mapping at any moment, and we need a way to
concurrently remove it and wait for the inflight requests using the
previous mapping to complete.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 block/Makefile                   |   1 +
 block/bdev.c                     |  14 ++
 block/blk-mq-dma-token.c         | 236 +++++++++++++++++++++++++++++++
 block/blk-mq.c                   |  20 +++
 block/fops.c                     |   1 +
 include/linux/blk-mq-dma-token.h |  60 ++++++++
 include/linux/blk-mq.h           |  21 +++
 include/linux/blkdev.h           |   3 +
 8 files changed, 356 insertions(+)
 create mode 100644 block/blk-mq-dma-token.c
 create mode 100644 include/linux/blk-mq-dma-token.h

diff --git a/block/Makefile b/block/Makefile
index c65f4da93702..0190e5aa9f00 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -36,3 +36,4 @@ obj-$(CONFIG_BLK_INLINE_ENCRYPTION)	+= blk-crypto.o blk-crypto-profile.o \
 					   blk-crypto-sysfs.o
 obj-$(CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK)	+= blk-crypto-fallback.o
 obj-$(CONFIG_BLOCK_HOLDER_DEPRECATED)	+= holder.o
+obj-$(CONFIG_DMA_SHARED_BUFFER) += blk-mq-dma-token.o
diff --git a/block/bdev.c b/block/bdev.c
index 810707cca970..da89d20f33f3 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -28,6 +28,7 @@
 #include <linux/part_stat.h>
 #include <linux/uaccess.h>
 #include <linux/stat.h>
+#include <linux/blk-mq-dma-token.h>
 #include "../fs/internal.h"
 #include "blk.h"
 
@@ -61,6 +62,19 @@ struct block_device *file_bdev(struct file *bdev_file)
 }
 EXPORT_SYMBOL(file_bdev);
 
+struct dma_token *blkdev_dma_map(struct file *file,
+				 struct dma_token_params *params)
+{
+	struct request_queue *q = bdev_get_queue(file_bdev(file));
+
+	if (!(file->f_flags & O_DIRECT))
+		return ERR_PTR(-EINVAL);
+	if (!q->mq_ops)
+		return ERR_PTR(-EINVAL);
+
+	return blk_mq_dma_map(q, params);
+}
+
 static void bdev_write_inode(struct block_device *bdev)
 {
 	struct inode *inode = BD_INODE(bdev);
diff --git a/block/blk-mq-dma-token.c b/block/blk-mq-dma-token.c
new file mode 100644
index 000000000000..cd62c4d09422
--- /dev/null
+++ b/block/blk-mq-dma-token.c
@@ -0,0 +1,236 @@
+#include <linux/blk-mq-dma-token.h>
+#include <linux/dma-resv.h>
+
+struct blk_mq_dma_fence {
+	struct dma_fence base;
+	spinlock_t lock;
+};
+
+static const char *blk_mq_fence_drv_name(struct dma_fence *fence)
+{
+	return "blk-mq";
+}
+
+const struct dma_fence_ops blk_mq_dma_fence_ops = {
+	.get_driver_name = blk_mq_fence_drv_name,
+	.get_timeline_name = blk_mq_fence_drv_name,
+};
+
+static void blk_mq_dma_token_free(struct blk_mq_dma_token *token)
+{
+	token->q->mq_ops->clean_dma_token(token->q, token);
+	dma_buf_put(token->dmabuf);
+	kfree(token);
+}
+
+static inline void blk_mq_dma_token_put(struct blk_mq_dma_token *token)
+{
+	if (refcount_dec_and_test(&token->refs))
+		blk_mq_dma_token_free(token);
+}
+
+static void blk_mq_dma_mapping_free(struct blk_mq_dma_map *map)
+{
+	struct blk_mq_dma_token *token = map->token;
+
+	if (map->sgt)
+		token->q->mq_ops->dma_unmap(token->q, map);
+
+	dma_fence_put(&map->fence->base);
+	percpu_ref_exit(&map->refs);
+	kfree(map);
+	blk_mq_dma_token_put(token);
+}
+
+static void blk_mq_dma_map_work_free(struct work_struct *work)
+{
+	struct blk_mq_dma_map *map = container_of(work, struct blk_mq_dma_map,
+						free_work);
+
+	dma_fence_signal(&map->fence->base);
+	blk_mq_dma_mapping_free(map);
+}
+
+static void blk_mq_dma_map_refs_free(struct percpu_ref *ref)
+{
+	struct blk_mq_dma_map *map = container_of(ref, struct blk_mq_dma_map, refs);
+
+	INIT_WORK(&map->free_work, blk_mq_dma_map_work_free);
+	queue_work(system_wq, &map->free_work);
+}
+
+static struct blk_mq_dma_map *blk_mq_alloc_dma_mapping(struct blk_mq_dma_token *token)
+{
+	struct blk_mq_dma_fence *fence = NULL;
+	struct blk_mq_dma_map *map;
+	int ret = -ENOMEM;
+
+	map = kzalloc(sizeof(*map), GFP_KERNEL);
+	if (!map)
+		return ERR_PTR(-ENOMEM);
+
+	fence = kzalloc(sizeof(*fence), GFP_KERNEL);
+	if (!fence)
+		goto err;
+
+	ret = percpu_ref_init(&map->refs, blk_mq_dma_map_refs_free, 0,
+			      GFP_KERNEL);
+	if (ret)
+		goto err;
+
+	dma_fence_init(&fence->base, &blk_mq_dma_fence_ops, &fence->lock,
+			token->fence_ctx, atomic_inc_return(&token->fence_seq));
+	spin_lock_init(&fence->lock);
+	map->fence = fence;
+	map->token = token;
+	refcount_inc(&token->refs);
+	return map;
+err:
+	kfree(map);
+	kfree(fence);
+	return ERR_PTR(ret);
+}
+
+static inline
+struct blk_mq_dma_map *blk_mq_get_token_map(struct blk_mq_dma_token *token)
+{
+	struct blk_mq_dma_map *map;
+
+	guard(rcu)();
+
+	map = rcu_dereference(token->map);
+	if (unlikely(!map || !percpu_ref_tryget_live_rcu(&map->refs)))
+		return NULL;
+	return map;
+}
+
+static struct blk_mq_dma_map *
+blk_mq_create_dma_map(struct blk_mq_dma_token *token)
+{
+	struct dma_buf *dmabuf = token->dmabuf;
+	struct blk_mq_dma_map *map;
+	long ret;
+
+	guard(mutex)(&token->mapping_lock);
+
+	map = blk_mq_get_token_map(token);
+	if (map)
+		return map;
+
+	map = blk_mq_alloc_dma_mapping(token);
+	if (IS_ERR(map))
+		return NULL;
+
+	dma_resv_lock(dmabuf->resv, NULL);
+	ret = dma_resv_wait_timeout(dmabuf->resv, DMA_RESV_USAGE_BOOKKEEP,
+				    true, MAX_SCHEDULE_TIMEOUT);
+	ret = ret ? ret : -ETIME;
+	if (ret > 0)
+		ret = token->q->mq_ops->dma_map(token->q, map);
+	dma_resv_unlock(dmabuf->resv);
+
+	if (ret)
+		return ERR_PTR(ret);
+
+	percpu_ref_get(&map->refs);
+	rcu_assign_pointer(token->map, map);
+	return map;
+}
+
+static void blk_mq_dma_map_remove(struct blk_mq_dma_token *token)
+{
+	struct dma_buf *dmabuf = token->dmabuf;
+	struct blk_mq_dma_map *map;
+	int ret;
+
+	dma_resv_assert_held(dmabuf->resv);
+
+	ret = dma_resv_reserve_fences(dmabuf->resv, 1);
+	if (WARN_ON_ONCE(ret))
+		return;
+
+	map = rcu_dereference_protected(token->map,
+					dma_resv_held(dmabuf->resv));
+	if (!map)
+		return;
+	rcu_assign_pointer(token->map, NULL);
+
+	dma_resv_add_fence(dmabuf->resv, &map->fence->base,
+			   DMA_RESV_USAGE_KERNEL);
+	percpu_ref_kill(&map->refs);
+}
+
+blk_status_t blk_rq_assign_dma_map(struct request *rq,
+				   struct blk_mq_dma_token *token)
+{
+	struct blk_mq_dma_map *map;
+
+	map = blk_mq_get_token_map(token);
+	if (map)
+		goto complete;
+
+	if (rq->cmd_flags & REQ_NOWAIT)
+		return BLK_STS_AGAIN;
+
+	map = blk_mq_create_dma_map(token);
+	if (IS_ERR(map))
+		return BLK_STS_RESOURCE;
+complete:
+	rq->dma_map = map;
+	return BLK_STS_OK;
+}
+
+void blk_mq_dma_map_move_notify(struct blk_mq_dma_token *token)
+{
+	blk_mq_dma_map_remove(token);
+}
+
+static void blk_mq_release_dma_mapping(struct dma_token *base_token)
+{
+	struct blk_mq_dma_token *token = dma_token_to_blk_mq(base_token);
+	struct dma_buf *dmabuf = token->dmabuf;
+
+	dma_resv_lock(dmabuf->resv, NULL);
+	blk_mq_dma_map_remove(token);
+	dma_resv_unlock(dmabuf->resv);
+
+	blk_mq_dma_token_put(token);
+}
+
+struct dma_token *blk_mq_dma_map(struct request_queue *q,
+				  struct dma_token_params *params)
+{
+	struct dma_buf *dmabuf = params->dmabuf;
+	struct blk_mq_dma_token *token;
+	int ret;
+
+	if (!q->mq_ops->dma_map || !q->mq_ops->dma_unmap ||
+	    !q->mq_ops->init_dma_token || !q->mq_ops->clean_dma_token)
+		return ERR_PTR(-EINVAL);
+
+	token = kzalloc(sizeof(*token), GFP_KERNEL);
+	if (!token)
+		return ERR_PTR(-ENOMEM);
+
+	get_dma_buf(dmabuf);
+	token->fence_ctx = dma_fence_context_alloc(1);
+	token->dmabuf = dmabuf;
+	token->dir = params->dir;
+	token->base.release = blk_mq_release_dma_mapping;
+	token->q = q;
+	refcount_set(&token->refs, 1);
+	mutex_init(&token->mapping_lock);
+
+	if (!blk_get_queue(q)) {
+		kfree(token);
+		return ERR_PTR(-EFAULT);
+	}
+
+	ret = token->q->mq_ops->init_dma_token(token->q, token);
+	if (ret) {
+		kfree(token);
+		blk_put_queue(q);
+		return ERR_PTR(ret);
+	}
+	return &token->base;
+}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f2650c97a75e..1ff3a7e3191b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -29,6 +29,7 @@
 #include <linux/blk-crypto.h>
 #include <linux/part_stat.h>
 #include <linux/sched/isolation.h>
+#include <linux/blk-mq-dma-token.h>
 
 #include <trace/events/block.h>
 
@@ -439,6 +440,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	rq->nr_integrity_segments = 0;
 	rq->end_io = NULL;
 	rq->end_io_data = NULL;
+	rq->dma_map = NULL;
 
 	blk_crypto_rq_set_defaults(rq);
 	INIT_LIST_HEAD(&rq->queuelist);
@@ -794,6 +796,7 @@ static void __blk_mq_free_request(struct request *rq)
 	blk_pm_mark_last_busy(rq);
 	rq->mq_hctx = NULL;
 
+	blk_rq_drop_dma_map(rq);
 	if (rq->tag != BLK_MQ_NO_TAG) {
 		blk_mq_dec_active_requests(hctx);
 		blk_mq_put_tag(hctx->tags, ctx, rq->tag);
@@ -3214,6 +3217,23 @@ void blk_mq_submit_bio(struct bio *bio)
 
 	blk_mq_bio_to_request(rq, bio, nr_segs);
 
+	if (bio_flagged(bio, BIO_DMA_TOKEN)) {
+		struct blk_mq_dma_token *token;
+		blk_status_t ret;
+
+		token = dma_token_to_blk_mq(bio->dma_token);
+		ret = blk_rq_assign_dma_map(rq, token);
+		if (ret) {
+			if (ret == BLK_STS_AGAIN) {
+				bio_wouldblock_error(bio);
+			} else {
+				bio->bi_status = BLK_STS_RESOURCE;
+				bio_endio(bio);
+			}
+			goto queue_exit;
+		}
+	}
+
 	ret = blk_crypto_rq_get_keyslot(rq);
 	if (ret != BLK_STS_OK) {
 		bio->bi_status = ret;
diff --git a/block/fops.c b/block/fops.c
index 41f8795874a9..ac52fe1a4b8d 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -973,6 +973,7 @@ const struct file_operations def_blk_fops = {
 	.fallocate	= blkdev_fallocate,
 	.uring_cmd	= blkdev_uring_cmd,
 	.fop_flags	= FOP_BUFFER_RASYNC,
+	.dma_map	= blkdev_dma_map,
 };
 
 static __init int blkdev_init(void)
diff --git a/include/linux/blk-mq-dma-token.h b/include/linux/blk-mq-dma-token.h
new file mode 100644
index 000000000000..4a8d84addc06
--- /dev/null
+++ b/include/linux/blk-mq-dma-token.h
@@ -0,0 +1,60 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef BLK_MQ_DMA_TOKEN_H
+#define BLK_MQ_DMA_TOKEN_H
+
+#include <linux/blk-mq.h>
+#include <linux/dma_token.h>
+#include <linux/percpu-refcount.h>
+
+struct blk_mq_dma_token;
+struct blk_mq_dma_fence;
+
+struct blk_mq_dma_map {
+	void				*private;
+
+	struct percpu_ref		refs;
+	struct sg_table			*sgt;
+	struct blk_mq_dma_token		*token;
+	struct blk_mq_dma_fence		*fence;
+	struct work_struct		free_work;
+};
+
+struct blk_mq_dma_token {
+	struct dma_token		base;
+	enum dma_data_direction		dir;
+
+	void				*private;
+
+	struct dma_buf			*dmabuf;
+	struct blk_mq_dma_map __rcu	*map;
+	struct request_queue		*q;
+
+	struct mutex			mapping_lock;
+	refcount_t			refs;
+
+	atomic_t			fence_seq;
+	u64				fence_ctx;
+};
+
+static inline
+struct blk_mq_dma_token *dma_token_to_blk_mq(struct dma_token *token)
+{
+	return container_of(token, struct blk_mq_dma_token, base);
+}
+
+blk_status_t blk_rq_assign_dma_map(struct request *req,
+				   struct blk_mq_dma_token *token);
+
+static inline void blk_rq_drop_dma_map(struct request *rq)
+{
+	if (rq->dma_map) {
+		percpu_ref_put(&rq->dma_map->refs);
+		rq->dma_map = NULL;
+	}
+}
+
+void blk_mq_dma_map_move_notify(struct blk_mq_dma_token *token);
+struct dma_token *blk_mq_dma_map(struct request_queue *q,
+				 struct dma_token_params *params);
+
+#endif /* BLK_MQ_DMA_TOKEN_H */
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index b54506b3b76d..4745d1e183f2 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -94,6 +94,9 @@ enum mq_rq_state {
 	MQ_RQ_COMPLETE		= 2,
 };
 
+struct blk_mq_dma_map;
+struct blk_mq_dma_token;
+
 /*
  * Try to put the fields that are referenced together in the same cacheline.
  *
@@ -170,6 +173,8 @@ struct request {
 
 	unsigned long deadline;
 
+	struct blk_mq_dma_map	*dma_map;
+
 	/*
 	 * The hash is used inside the scheduler, and killed once the
 	 * request reaches the dispatch list. The ipi_list is only used
@@ -675,6 +680,21 @@ struct blk_mq_ops {
 	 */
 	void (*map_queues)(struct blk_mq_tag_set *set);
 
+	/**
+	 * @map_dmabuf: Allows drivers to pre-map a dmabuf. The resulting driver
+	 * specific mapping will be wrapped into dma_token and passed to the
+	 * read / write path in an iterator.
+	 */
+	int (*dma_map)(struct request_queue *q, struct blk_mq_dma_map *);
+	void (*dma_unmap)(struct request_queue *q, struct blk_mq_dma_map *);
+	int (*init_dma_token)(struct request_queue *q,
+			      struct blk_mq_dma_token *token);
+	void (*clean_dma_token)(struct request_queue *q,
+				struct blk_mq_dma_token *token);
+
+	struct dma_buf_attachment *(*dma_attach)(struct request_queue *q,
+					struct dma_token_params *params);
+
 #ifdef CONFIG_BLK_DEBUG_FS
 	/**
 	 * @show_rq: Used by the debugfs implementation to show driver-specific
@@ -946,6 +966,7 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 void blk_mq_tagset_wait_completed_request(struct blk_mq_tag_set *tagset);
 void blk_mq_freeze_queue_nomemsave(struct request_queue *q);
 void blk_mq_unfreeze_queue_nomemrestore(struct request_queue *q);
+
 static inline unsigned int __must_check
 blk_mq_freeze_queue(struct request_queue *q)
 {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index cb4ba09959ee..dec75348f8dc 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1777,6 +1777,9 @@ struct block_device *file_bdev(struct file *bdev_file);
 bool disk_live(struct gendisk *disk);
 unsigned int block_size(struct block_device *bdev);
 
+struct dma_token *blkdev_dma_map(struct file *file,
+				 struct dma_token_params *params);
+
 #ifdef CONFIG_BLOCK
 void invalidate_bdev(struct block_device *bdev);
 int sync_blockdev(struct block_device *bdev);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC v2 06/11] nvme-pci: add support for dmabuf reggistration
  2025-11-23 22:51 [RFC v2 00/11] Add dmabuf read/write via io_uring Pavel Begunkov
                   ` (4 preceding siblings ...)
  2025-11-23 22:51 ` [RFC v2 05/11] block: add infra to handle dmabuf tokens Pavel Begunkov
@ 2025-11-23 22:51 ` Pavel Begunkov
  2025-11-24 13:40   ` Anuj gupta
  2025-12-04 11:00   ` Christoph Hellwig
  2025-11-23 22:51 ` [RFC v2 07/11] nvme-pci: implement dma_token backed requests Pavel Begunkov
                   ` (6 subsequent siblings)
  12 siblings, 2 replies; 35+ messages in thread
From: Pavel Begunkov @ 2025-11-23 22:51 UTC (permalink / raw)
  To: linux-block, io-uring
  Cc: Vishal Verma, tushar.gohad, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal,
	Christian König, Pavel Begunkov, linux-kernel, linux-nvme,
	linux-fsdevel, linux-media, dri-devel, linaro-mm-sig

Implement dma-token related callbacks for nvme block devices.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 drivers/nvme/host/pci.c | 95 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 95 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index e5ca8301bb8b..63e03c3dc044 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -27,6 +27,7 @@
 #include <linux/io-64-nonatomic-lo-hi.h>
 #include <linux/io-64-nonatomic-hi-lo.h>
 #include <linux/sed-opal.h>
+#include <linux/blk-mq-dma-token.h>
 
 #include "trace.h"
 #include "nvme.h"
@@ -482,6 +483,92 @@ static void nvme_release_descriptor_pools(struct nvme_dev *dev)
 	}
 }
 
+static void nvme_dmabuf_move_notify(struct dma_buf_attachment *attach)
+{
+	blk_mq_dma_map_move_notify(attach->importer_priv);
+}
+
+const struct dma_buf_attach_ops nvme_dmabuf_importer_ops = {
+	.move_notify = nvme_dmabuf_move_notify,
+	.allow_peer2peer = true,
+};
+
+static int nvme_init_dma_token(struct request_queue *q,
+				struct blk_mq_dma_token *token)
+{
+	struct dma_buf_attachment *attach;
+	struct nvme_ns *ns = q->queuedata;
+	struct nvme_dev *dev = to_nvme_dev(ns->ctrl);
+	struct dma_buf *dmabuf = token->dmabuf;
+
+	if (dmabuf->size % NVME_CTRL_PAGE_SIZE)
+		return -EINVAL;
+
+	attach = dma_buf_dynamic_attach(dmabuf, dev->dev,
+					&nvme_dmabuf_importer_ops, token);
+	if (IS_ERR(attach))
+		return PTR_ERR(attach);
+
+	token->private = attach;
+	return 0;
+}
+
+static void nvme_clean_dma_token(struct request_queue *q,
+				 struct blk_mq_dma_token *token)
+{
+	struct dma_buf_attachment *attach = token->private;
+
+	dma_buf_detach(token->dmabuf, attach);
+}
+
+static int nvme_dma_map(struct request_queue *q, struct blk_mq_dma_map *map)
+{
+	struct blk_mq_dma_token *token = map->token;
+	struct dma_buf_attachment *attach = token->private;
+	unsigned nr_entries;
+	unsigned long tmp, i = 0;
+	struct scatterlist *sg;
+	struct sg_table *sgt;
+	dma_addr_t *dma_list;
+
+	nr_entries = token->dmabuf->size / NVME_CTRL_PAGE_SIZE;
+	dma_list = kmalloc_array(nr_entries, sizeof(dma_list[0]), GFP_KERNEL);
+	if (!dma_list)
+		return -ENOMEM;
+
+	sgt = dma_buf_map_attachment(attach, token->dir);
+	if (IS_ERR(sgt)) {
+		kfree(dma_list);
+		return PTR_ERR(sgt);
+	}
+	map->sgt = sgt;
+
+	for_each_sgtable_dma_sg(sgt, sg, tmp) {
+		dma_addr_t dma = sg_dma_address(sg);
+		unsigned long sg_len = sg_dma_len(sg);
+
+		while (sg_len) {
+			dma_list[i++] = dma;
+			dma += NVME_CTRL_PAGE_SIZE;
+			sg_len -= NVME_CTRL_PAGE_SIZE;
+		}
+	}
+
+	map->private = dma_list;
+	return 0;
+}
+
+static void nvme_dma_unmap(struct request_queue *q, struct blk_mq_dma_map *map)
+{
+	struct blk_mq_dma_token *token = map->token;
+	struct dma_buf_attachment *attach = token->private;
+	dma_addr_t *dma_list = map->private;
+
+	dma_buf_unmap_attachment_unlocked(attach, map->sgt, token->dir);
+	map->sgt = NULL;
+	kfree(dma_list);
+}
+
 static int nvme_init_hctx_common(struct blk_mq_hw_ctx *hctx, void *data,
 		unsigned qid)
 {
@@ -1067,6 +1154,9 @@ static blk_status_t nvme_map_data(struct request *req)
 	struct blk_dma_iter iter;
 	blk_status_t ret;
 
+	if (req->bio && bio_flagged(req->bio, BIO_DMA_TOKEN))
+		return BLK_STS_RESOURCE;
+
 	/*
 	 * Try to skip the DMA iterator for single segment requests, as that
 	 * significantly improves performances for small I/O sizes.
@@ -2093,6 +2183,11 @@ static const struct blk_mq_ops nvme_mq_ops = {
 	.map_queues	= nvme_pci_map_queues,
 	.timeout	= nvme_timeout,
 	.poll		= nvme_poll,
+
+	.dma_map	= nvme_dma_map,
+	.dma_unmap 	= nvme_dma_unmap,
+	.init_dma_token =  nvme_init_dma_token,
+	.clean_dma_token = nvme_clean_dma_token,
 };
 
 static void nvme_dev_remove_admin(struct nvme_dev *dev)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC v2 07/11] nvme-pci: implement dma_token backed requests
  2025-11-23 22:51 [RFC v2 00/11] Add dmabuf read/write via io_uring Pavel Begunkov
                   ` (5 preceding siblings ...)
  2025-11-23 22:51 ` [RFC v2 06/11] nvme-pci: add support for dmabuf reggistration Pavel Begunkov
@ 2025-11-23 22:51 ` Pavel Begunkov
  2025-12-04 11:04   ` Christoph Hellwig
  2025-11-23 22:51 ` [RFC v2 08/11] io_uring/rsrc: add imu flags Pavel Begunkov
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 35+ messages in thread
From: Pavel Begunkov @ 2025-11-23 22:51 UTC (permalink / raw)
  To: linux-block, io-uring
  Cc: Vishal Verma, tushar.gohad, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal,
	Christian König, Pavel Begunkov, linux-kernel, linux-nvme,
	linux-fsdevel, linux-media, dri-devel, linaro-mm-sig

Enable BIO_DMA_TOKEN backed requests. It requires special handling to
set up the nvme request from the prepared in advance mapping, tear it
down and sync the buffers.

Suggested-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 drivers/nvme/host/pci.c | 126 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 124 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 63e03c3dc044..ac377416b088 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -797,6 +797,123 @@ static void nvme_free_descriptors(struct request *req)
 	}
 }
 
+static void nvme_sync_dma(struct nvme_dev *nvme_dev, struct request *req,
+			  enum dma_data_direction dir)
+{
+	struct blk_mq_dma_map *map = req->dma_map;
+	int length = blk_rq_payload_bytes(req);
+	bool for_cpu = dir == DMA_FROM_DEVICE;
+	struct device *dev = nvme_dev->dev;
+	dma_addr_t *dma_list = map->private;
+	struct bio *bio = req->bio;
+	int offset, map_idx;
+
+	offset = bio->bi_iter.bi_bvec_done;
+	map_idx = offset / NVME_CTRL_PAGE_SIZE;
+	length += offset & (NVME_CTRL_PAGE_SIZE - 1);
+
+	while (length > 0) {
+		u64 dma_addr = dma_list[map_idx++];
+
+		if (for_cpu)
+			__dma_sync_single_for_cpu(dev, dma_addr,
+						  NVME_CTRL_PAGE_SIZE, dir);
+		else
+			__dma_sync_single_for_device(dev, dma_addr,
+						     NVME_CTRL_PAGE_SIZE, dir);
+		length -= NVME_CTRL_PAGE_SIZE;
+	}
+}
+
+static void nvme_unmap_premapped_data(struct nvme_dev *dev,
+				      struct request *req)
+{
+	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+
+	if (rq_data_dir(req) == READ)
+		nvme_sync_dma(dev, req, DMA_FROM_DEVICE);
+	if (!(iod->flags & IOD_SINGLE_SEGMENT))
+		nvme_free_descriptors(req);
+}
+
+static blk_status_t nvme_dma_premapped(struct request *req,
+				       struct nvme_queue *nvmeq)
+{
+	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	int length = blk_rq_payload_bytes(req);
+	struct blk_mq_dma_map *map = req->dma_map;
+	u64 dma_addr, prp1_dma, prp2_dma;
+	struct bio *bio = req->bio;
+	dma_addr_t *dma_list;
+	dma_addr_t prp_dma;
+	__le64 *prp_list;
+	int i, map_idx;
+	int offset;
+
+	dma_list = map->private;
+
+	if (rq_data_dir(req) == WRITE)
+		nvme_sync_dma(nvmeq->dev, req, DMA_TO_DEVICE);
+
+	offset = bio->bi_iter.bi_bvec_done;
+	map_idx = offset / NVME_CTRL_PAGE_SIZE;
+	offset &= (NVME_CTRL_PAGE_SIZE - 1);
+
+	prp1_dma = dma_list[map_idx++] + offset;
+
+	length -= (NVME_CTRL_PAGE_SIZE - offset);
+	if (length <= 0) {
+		prp2_dma = 0;
+		goto done;
+	}
+
+	if (length <= NVME_CTRL_PAGE_SIZE) {
+		prp2_dma = dma_list[map_idx];
+		goto done;
+	}
+
+	if (DIV_ROUND_UP(length, NVME_CTRL_PAGE_SIZE) <=
+	    NVME_SMALL_POOL_SIZE / sizeof(__le64))
+		iod->flags |= IOD_SMALL_DESCRIPTOR;
+
+	prp_list = dma_pool_alloc(nvme_dma_pool(nvmeq, iod), GFP_ATOMIC,
+			&prp_dma);
+	if (!prp_list)
+		return BLK_STS_RESOURCE;
+
+	iod->descriptors[iod->nr_descriptors++] = prp_list;
+	prp2_dma = prp_dma;
+	i = 0;
+	for (;;) {
+		if (i == NVME_CTRL_PAGE_SIZE >> 3) {
+			__le64 *old_prp_list = prp_list;
+
+			prp_list = dma_pool_alloc(nvmeq->descriptor_pools.large,
+					GFP_ATOMIC, &prp_dma);
+			if (!prp_list)
+				goto free_prps;
+			iod->descriptors[iod->nr_descriptors++] = prp_list;
+			prp_list[0] = old_prp_list[i - 1];
+			old_prp_list[i - 1] = cpu_to_le64(prp_dma);
+			i = 1;
+		}
+
+		dma_addr = dma_list[map_idx++];
+		prp_list[i++] = cpu_to_le64(dma_addr);
+
+		length -= NVME_CTRL_PAGE_SIZE;
+		if (length <= 0)
+			break;
+	}
+done:
+	iod->cmd.common.dptr.prp1 = cpu_to_le64(prp1_dma);
+	iod->cmd.common.dptr.prp2 = cpu_to_le64(prp2_dma);
+	return BLK_STS_OK;
+free_prps:
+	nvme_free_descriptors(req);
+	return BLK_STS_RESOURCE;
+}
+
 static void nvme_free_prps(struct request *req, unsigned int attrs)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -875,6 +992,11 @@ static void nvme_unmap_data(struct request *req)
 	struct device *dma_dev = nvmeq->dev->dev;
 	unsigned int attrs = 0;
 
+	if (req->bio && bio_flagged(req->bio, BIO_DMA_TOKEN)) {
+		nvme_unmap_premapped_data(nvmeq->dev, req);
+		return;
+	}
+
 	if (iod->flags & IOD_SINGLE_SEGMENT) {
 		static_assert(offsetof(union nvme_data_ptr, prp1) ==
 				offsetof(union nvme_data_ptr, sgl.addr));
@@ -1154,8 +1276,8 @@ static blk_status_t nvme_map_data(struct request *req)
 	struct blk_dma_iter iter;
 	blk_status_t ret;
 
-	if (req->bio && bio_flagged(req->bio, BIO_DMA_TOKEN))
-		return BLK_STS_RESOURCE;
+	if (req->dma_map)
+		return nvme_dma_premapped(req, nvmeq);
 
 	/*
 	 * Try to skip the DMA iterator for single segment requests, as that
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC v2 08/11] io_uring/rsrc: add imu flags
  2025-11-23 22:51 [RFC v2 00/11] Add dmabuf read/write via io_uring Pavel Begunkov
                   ` (6 preceding siblings ...)
  2025-11-23 22:51 ` [RFC v2 07/11] nvme-pci: implement dma_token backed requests Pavel Begunkov
@ 2025-11-23 22:51 ` Pavel Begunkov
  2025-11-23 22:51 ` [RFC v2 09/11] io_uring/rsrc: extended reg buffer registration Pavel Begunkov
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 35+ messages in thread
From: Pavel Begunkov @ 2025-11-23 22:51 UTC (permalink / raw)
  To: linux-block, io-uring
  Cc: Vishal Verma, tushar.gohad, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal,
	Christian König, Pavel Begunkov, linux-kernel, linux-nvme,
	linux-fsdevel, linux-media, dri-devel, linaro-mm-sig

Replace is_kbuf with a flags field in io_mapped_ubuf. There will be new
flags shortly, and bit fields are often not as convenient to work with.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/rsrc.c | 12 ++++++------
 io_uring/rsrc.h |  6 +++++-
 io_uring/rw.c   |  3 ++-
 3 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 3765a50329a8..21548942e80d 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -828,7 +828,7 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
 	imu->folio_shift = PAGE_SHIFT;
 	imu->release = io_release_ubuf;
 	imu->priv = imu;
-	imu->is_kbuf = false;
+	imu->flags = 0;
 	imu->dir = IO_IMU_DEST | IO_IMU_SOURCE;
 	if (coalesced)
 		imu->folio_shift = data.folio_shift;
@@ -985,7 +985,7 @@ int io_buffer_register_bvec(struct io_uring_cmd *cmd, struct request *rq,
 	refcount_set(&imu->refs, 1);
 	imu->release = release;
 	imu->priv = rq;
-	imu->is_kbuf = true;
+	imu->flags = IO_IMU_F_KBUF;
 	imu->dir = 1 << rq_data_dir(rq);
 
 	rq_for_each_bvec(bv, rq, rq_iter)
@@ -1020,7 +1020,7 @@ int io_buffer_unregister_bvec(struct io_uring_cmd *cmd, unsigned int index,
 		ret = -EINVAL;
 		goto unlock;
 	}
-	if (!node->buf->is_kbuf) {
+	if (!(node->buf->flags & IO_IMU_F_KBUF)) {
 		ret = -EBUSY;
 		goto unlock;
 	}
@@ -1086,7 +1086,7 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
 
 	offset = buf_addr - imu->ubuf;
 
-	if (imu->is_kbuf)
+	if (imu->flags & IO_IMU_F_KBUF)
 		return io_import_kbuf(ddir, iter, imu, len, offset);
 
 	/*
@@ -1511,7 +1511,7 @@ int io_import_reg_vec(int ddir, struct iov_iter *iter,
 	iovec_off = vec->nr - nr_iovs;
 	iov = vec->iovec + iovec_off;
 
-	if (imu->is_kbuf) {
+	if (imu->flags & IO_IMU_F_KBUF) {
 		int ret = io_kern_bvec_size(iov, nr_iovs, imu, &nr_segs);
 
 		if (unlikely(ret))
@@ -1549,7 +1549,7 @@ int io_import_reg_vec(int ddir, struct iov_iter *iter,
 		req->flags |= REQ_F_NEED_CLEANUP;
 	}
 
-	if (imu->is_kbuf)
+	if (imu->flags & IO_IMU_F_KBUF)
 		return io_vec_fill_kern_bvec(ddir, iter, imu, iov, nr_iovs, vec);
 
 	return io_vec_fill_bvec(ddir, iter, imu, iov, nr_iovs, vec);
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index d603f6a47f5e..7c1128a856ec 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -28,6 +28,10 @@ enum {
 	IO_IMU_SOURCE	= 1 << ITER_SOURCE,
 };
 
+enum {
+	IO_IMU_F_KBUF			= 1,
+};
+
 struct io_mapped_ubuf {
 	u64		ubuf;
 	unsigned int	len;
@@ -37,7 +41,7 @@ struct io_mapped_ubuf {
 	unsigned long	acct_pages;
 	void		(*release)(void *);
 	void		*priv;
-	bool		is_kbuf;
+	u8		flags;
 	u8		dir;
 	struct bio_vec	bvec[] __counted_by(nr_bvecs);
 };
diff --git a/io_uring/rw.c b/io_uring/rw.c
index a7b568c3dfe8..a3eb4e7bf992 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -706,7 +706,8 @@ static ssize_t loop_rw_iter(int ddir, struct io_rw *rw, struct iov_iter *iter)
 	if ((kiocb->ki_flags & IOCB_NOWAIT) &&
 	    !(kiocb->ki_filp->f_flags & O_NONBLOCK))
 		return -EAGAIN;
-	if ((req->flags & REQ_F_BUF_NODE) && req->buf_node->buf->is_kbuf)
+	if ((req->flags & REQ_F_BUF_NODE) &&
+	    (req->buf_node->buf->flags & IO_IMU_F_KBUF))
 		return -EFAULT;
 
 	ppos = io_kiocb_ppos(kiocb);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC v2 09/11] io_uring/rsrc: extended reg buffer registration
  2025-11-23 22:51 [RFC v2 00/11] Add dmabuf read/write via io_uring Pavel Begunkov
                   ` (7 preceding siblings ...)
  2025-11-23 22:51 ` [RFC v2 08/11] io_uring/rsrc: add imu flags Pavel Begunkov
@ 2025-11-23 22:51 ` Pavel Begunkov
  2025-11-23 22:51 ` [RFC v2 10/11] io_uring/rsrc: add dmabuf-backed buffer registeration Pavel Begunkov
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 35+ messages in thread
From: Pavel Begunkov @ 2025-11-23 22:51 UTC (permalink / raw)
  To: linux-block, io-uring
  Cc: Vishal Verma, tushar.gohad, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal,
	Christian König, Pavel Begunkov, linux-kernel, linux-nvme,
	linux-fsdevel, linux-media, dri-devel, linaro-mm-sig

We'll need to pass extra information for buffer registration apart from
iovec, add a flag to struct io_uring_rsrc_update2 that tells that its
data fields points to an extended registration structure, i.e.
struct io_uring_reg_buffer. To do normal registration the user has to
set target_fd and dmabuf_fd fields to -1, and any other combination is
currently rejected.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/uapi/linux/io_uring.h | 13 ++++++++-
 io_uring/rsrc.c               | 53 +++++++++++++++++++++++++++--------
 2 files changed, 54 insertions(+), 12 deletions(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index deb772222b6d..f64d1f246b93 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -765,15 +765,26 @@ struct io_uring_rsrc_update {
 	__aligned_u64 data;
 };
 
+/* struct io_uring_rsrc_update2::flags */
+enum io_uring_rsrc_reg_flags {
+	IORING_RSRC_F_EXTENDED_UPDATE		= 1,
+};
+
 struct io_uring_rsrc_update2 {
 	__u32 offset;
-	__u32 resv;
+	__u32 flags;
 	__aligned_u64 data;
 	__aligned_u64 tags;
 	__u32 nr;
 	__u32 resv2;
 };
 
+struct io_uring_reg_buffer {
+	__aligned_u64		iov_uaddr;
+	__s32			target_fd;
+	__s32			dmabuf_fd;
+};
+
 /* Skip updating fd indexes set to this value in the fd table */
 #define IORING_REGISTER_FILES_SKIP	(-2)
 
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 21548942e80d..691f9645d04c 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -27,7 +27,8 @@ struct io_rsrc_update {
 	u32				offset;
 };
 
-static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
+static struct io_rsrc_node *
+io_sqe_buffer_register(struct io_ring_ctx *ctx, struct io_uring_reg_buffer *rb,
 			struct iovec *iov, struct page **last_hpage);
 
 /* only define max */
@@ -234,6 +235,8 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx,
 
 	if (!ctx->file_table.data.nr)
 		return -ENXIO;
+	if (up->flags)
+		return -EINVAL;
 	if (up->offset + nr_args > ctx->file_table.data.nr)
 		return -EINVAL;
 
@@ -288,10 +291,18 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx,
 	return done ? done : err;
 }
 
+static inline void io_default_reg_buf(struct io_uring_reg_buffer *rb)
+{
+	memset(rb, 0, sizeof(*rb));
+	rb->target_fd = -1;
+	rb->dmabuf_fd = -1;
+}
+
 static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
 				   struct io_uring_rsrc_update2 *up,
 				   unsigned int nr_args)
 {
+	bool extended_entry = up->flags & IORING_RSRC_F_EXTENDED_UPDATE;
 	u64 __user *tags = u64_to_user_ptr(up->tags);
 	struct iovec fast_iov, *iov;
 	struct page *last_hpage = NULL;
@@ -302,14 +313,32 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
 
 	if (!ctx->buf_table.nr)
 		return -ENXIO;
+	if (up->flags & ~IORING_RSRC_F_EXTENDED_UPDATE)
+		return -EINVAL;
 	if (up->offset + nr_args > ctx->buf_table.nr)
 		return -EINVAL;
 
 	for (done = 0; done < nr_args; done++) {
+		struct io_uring_reg_buffer rb;
 		struct io_rsrc_node *node;
 		u64 tag = 0;
 
-		uvec = u64_to_user_ptr(user_data);
+		if (extended_entry) {
+			if (copy_from_user(&rb, u64_to_user_ptr(user_data),
+					   sizeof(rb)))
+				return -EFAULT;
+			user_data += sizeof(rb);
+		} else {
+			io_default_reg_buf(&rb);
+			rb.iov_uaddr = user_data;
+
+			if (ctx->compat)
+				user_data += sizeof(struct compat_iovec);
+			else
+				user_data += sizeof(struct iovec);
+		}
+
+		uvec = u64_to_user_ptr(rb.iov_uaddr);
 		iov = iovec_from_user(uvec, 1, 1, &fast_iov, ctx->compat);
 		if (IS_ERR(iov)) {
 			err = PTR_ERR(iov);
@@ -322,7 +351,7 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
 		err = io_buffer_validate(iov);
 		if (err)
 			break;
-		node = io_sqe_buffer_register(ctx, iov, &last_hpage);
+		node = io_sqe_buffer_register(ctx, &rb, iov, &last_hpage);
 		if (IS_ERR(node)) {
 			err = PTR_ERR(node);
 			break;
@@ -337,10 +366,6 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
 		i = array_index_nospec(up->offset + done, ctx->buf_table.nr);
 		io_reset_rsrc_node(ctx, &ctx->buf_table, i);
 		ctx->buf_table.nodes[i] = node;
-		if (ctx->compat)
-			user_data += sizeof(struct compat_iovec);
-		else
-			user_data += sizeof(struct iovec);
 	}
 	return done ? done : err;
 }
@@ -375,7 +400,7 @@ int io_register_files_update(struct io_ring_ctx *ctx, void __user *arg,
 	memset(&up, 0, sizeof(up));
 	if (copy_from_user(&up, arg, sizeof(struct io_uring_rsrc_update)))
 		return -EFAULT;
-	if (up.resv || up.resv2)
+	if (up.resv2)
 		return -EINVAL;
 	return __io_register_rsrc_update(ctx, IORING_RSRC_FILE, &up, nr_args);
 }
@@ -389,7 +414,7 @@ int io_register_rsrc_update(struct io_ring_ctx *ctx, void __user *arg,
 		return -EINVAL;
 	if (copy_from_user(&up, arg, sizeof(up)))
 		return -EFAULT;
-	if (!up.nr || up.resv || up.resv2)
+	if (!up.nr || up.resv2)
 		return -EINVAL;
 	return __io_register_rsrc_update(ctx, type, &up, up.nr);
 }
@@ -493,7 +518,7 @@ int io_files_update(struct io_kiocb *req, unsigned int issue_flags)
 	up2.data = up->arg;
 	up2.nr = 0;
 	up2.tags = 0;
-	up2.resv = 0;
+	up2.flags = 0;
 	up2.resv2 = 0;
 
 	if (up->offset == IORING_FILE_INDEX_ALLOC) {
@@ -778,6 +803,7 @@ bool io_check_coalesce_buffer(struct page **page_array, int nr_pages,
 }
 
 static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
+						   struct io_uring_reg_buffer *rb,
 						   struct iovec *iov,
 						   struct page **last_hpage)
 {
@@ -790,6 +816,9 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
 	struct io_imu_folio_data data;
 	bool coalesced = false;
 
+	if (rb->dmabuf_fd != -1 || rb->target_fd != -1)
+		return NULL;
+
 	if (!iov->iov_base)
 		return NULL;
 
@@ -887,6 +916,7 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 		memset(iov, 0, sizeof(*iov));
 
 	for (i = 0; i < nr_args; i++) {
+		struct io_uring_reg_buffer rb;
 		struct io_rsrc_node *node;
 		u64 tag = 0;
 
@@ -913,7 +943,8 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 			}
 		}
 
-		node = io_sqe_buffer_register(ctx, iov, &last_hpage);
+		io_default_reg_buf(&rb);
+		node = io_sqe_buffer_register(ctx, &rb, iov, &last_hpage);
 		if (IS_ERR(node)) {
 			ret = PTR_ERR(node);
 			break;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC v2 10/11] io_uring/rsrc: add dmabuf-backed buffer registeration
  2025-11-23 22:51 [RFC v2 00/11] Add dmabuf read/write via io_uring Pavel Begunkov
                   ` (8 preceding siblings ...)
  2025-11-23 22:51 ` [RFC v2 09/11] io_uring/rsrc: extended reg buffer registration Pavel Begunkov
@ 2025-11-23 22:51 ` Pavel Begunkov
  2025-11-23 22:51 ` [RFC v2 11/11] io_uring/rsrc: implement dmabuf regbuf import Pavel Begunkov
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 35+ messages in thread
From: Pavel Begunkov @ 2025-11-23 22:51 UTC (permalink / raw)
  To: linux-block, io-uring
  Cc: Vishal Verma, tushar.gohad, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal,
	Christian König, Pavel Begunkov, linux-kernel, linux-nvme,
	linux-fsdevel, linux-media, dri-devel, linaro-mm-sig, David Wei

Add an ability to register a dmabuf backed io_uring buffer. It also
needs know which device to use for attachment, for that it takes
target_fd and extracts the device through the new file op. Unlike normal
buffers, it also retains the target file so that any imports from
ineligible requests can be rejected in next patches.

Suggested-by: Vishal Verma <vishal1.verma@intel.com>
Suggested-by: David Wei <dw@davidwei.uk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/rsrc.c | 106 +++++++++++++++++++++++++++++++++++++++++++++++-
 io_uring/rsrc.h |   1 +
 2 files changed, 106 insertions(+), 1 deletion(-)

diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 691f9645d04c..7dfebf459dd0 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -10,6 +10,8 @@
 #include <linux/compat.h>
 #include <linux/io_uring.h>
 #include <linux/io_uring/cmd.h>
+#include <linux/dma-buf.h>
+#include <linux/dma_token.h>
 
 #include <uapi/linux/io_uring.h>
 
@@ -802,6 +804,106 @@ bool io_check_coalesce_buffer(struct page **page_array, int nr_pages,
 	return true;
 }
 
+struct io_regbuf_dma {
+	struct dma_token		*token;
+	struct file			*target_file;
+	struct dma_buf			*dmabuf;
+};
+
+static void io_release_reg_dmabuf(void *priv)
+{
+	struct io_regbuf_dma *db = priv;
+
+	dma_token_release(db->token);
+	dma_buf_put(db->dmabuf);
+	fput(db->target_file);
+	kfree(db);
+}
+
+static struct io_rsrc_node *io_register_dmabuf(struct io_ring_ctx *ctx,
+						struct io_uring_reg_buffer *rb,
+						struct iovec *iov)
+{
+	struct dma_token_params params = {};
+	struct io_rsrc_node *node = NULL;
+	struct io_mapped_ubuf *imu = NULL;
+	struct io_regbuf_dma *regbuf = NULL;
+	struct file *target_file = NULL;
+	struct dma_buf *dmabuf = NULL;
+	struct dma_token *token;
+	int ret;
+
+	if (iov->iov_base || iov->iov_len)
+		return ERR_PTR(-EFAULT);
+
+	node = io_rsrc_node_alloc(ctx, IORING_RSRC_BUFFER);
+	if (!node) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	imu = io_alloc_imu(ctx, 0);
+	if (!imu) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	regbuf = kzalloc(sizeof(*regbuf), GFP_KERNEL);
+	if (!regbuf) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	target_file = fget(rb->target_fd);
+	if (!target_file) {
+		ret = -EBADF;
+		goto err;
+	}
+
+	dmabuf = dma_buf_get(rb->dmabuf_fd);
+	if (IS_ERR(dmabuf)) {
+		ret = PTR_ERR(dmabuf);
+		dmabuf = NULL;
+		goto err;
+	}
+
+	params.dmabuf = dmabuf;
+	params.dir = DMA_BIDIRECTIONAL;
+	token = dma_token_create(target_file, &params);
+	if (IS_ERR(token)) {
+		ret = PTR_ERR(token);
+		goto err;
+	}
+
+	regbuf->target_file = target_file;
+	regbuf->token = token;
+	regbuf->dmabuf = dmabuf;
+
+	imu->nr_bvecs = 1;
+	imu->ubuf = 0;
+	imu->len = dmabuf->size;
+	imu->folio_shift = 0;
+	imu->release = io_release_reg_dmabuf;
+	imu->priv = regbuf;
+	imu->flags = IO_IMU_F_DMA;
+	imu->dir = IO_IMU_DEST | IO_IMU_SOURCE;
+	refcount_set(&imu->refs, 1);
+	node->buf = imu;
+	return node;
+err:
+	if (regbuf)
+		kfree(regbuf);
+	if (imu)
+		io_free_imu(ctx, imu);
+	if (node)
+		io_cache_free(&ctx->node_cache, node);
+	if (target_file)
+		fput(target_file);
+	if (dmabuf)
+		dma_buf_put(dmabuf);
+	return ERR_PTR(ret);
+}
+
 static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
 						   struct io_uring_reg_buffer *rb,
 						   struct iovec *iov,
@@ -817,7 +919,7 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
 	bool coalesced = false;
 
 	if (rb->dmabuf_fd != -1 || rb->target_fd != -1)
-		return NULL;
+		return io_register_dmabuf(ctx, rb, iov);
 
 	if (!iov->iov_base)
 		return NULL;
@@ -1117,6 +1219,8 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
 
 	offset = buf_addr - imu->ubuf;
 
+	if (imu->flags & IO_IMU_F_DMA)
+		return -EOPNOTSUPP;
 	if (imu->flags & IO_IMU_F_KBUF)
 		return io_import_kbuf(ddir, iter, imu, len, offset);
 
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 7c1128a856ec..280d3988abf3 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -30,6 +30,7 @@ enum {
 
 enum {
 	IO_IMU_F_KBUF			= 1,
+	IO_IMU_F_DMA			= 2,
 };
 
 struct io_mapped_ubuf {
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC v2 11/11] io_uring/rsrc: implement dmabuf regbuf import
  2025-11-23 22:51 [RFC v2 00/11] Add dmabuf read/write via io_uring Pavel Begunkov
                   ` (9 preceding siblings ...)
  2025-11-23 22:51 ` [RFC v2 10/11] io_uring/rsrc: add dmabuf-backed buffer registeration Pavel Begunkov
@ 2025-11-23 22:51 ` Pavel Begunkov
  2025-11-24 10:33 ` [RFC v2 00/11] Add dmabuf read/write via io_uring Christian König
  2025-11-24 13:35 ` Anuj gupta
  12 siblings, 0 replies; 35+ messages in thread
From: Pavel Begunkov @ 2025-11-23 22:51 UTC (permalink / raw)
  To: linux-block, io-uring
  Cc: Vishal Verma, tushar.gohad, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal,
	Christian König, Pavel Begunkov, linux-kernel, linux-nvme,
	linux-fsdevel, linux-media, dri-devel, linaro-mm-sig, David Wei

Allow importing dmabuf backed registered buffers. It's an opt-in feature
for requests and they need to pass a flag allowing it. Furthermore,
the import will fail if the request's file doesn't match the file for
which the buffer for registered. This way, it's also limited to files
that support the feature by implementing the corresponding file op.
Enable it for read/write requests.

Suggested-by: David Wei <dw@davidwei.uk>
Suggested-by: Vishal Verma <vishal1.verma@intel.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/rsrc.c | 36 +++++++++++++++++++++++++++++-------
 io_uring/rsrc.h | 16 +++++++++++++++-
 io_uring/rw.c   |  4 ++--
 3 files changed, 46 insertions(+), 10 deletions(-)

diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 7dfebf459dd0..a5d88dae536e 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -1201,9 +1201,27 @@ static int io_import_kbuf(int ddir, struct iov_iter *iter,
 	return 0;
 }
 
-static int io_import_fixed(int ddir, struct iov_iter *iter,
+static int io_import_dmabuf(struct io_kiocb *req,
+			   int ddir, struct iov_iter *iter,
 			   struct io_mapped_ubuf *imu,
-			   u64 buf_addr, size_t len)
+			   size_t len, size_t offset)
+{
+	struct io_regbuf_dma *db = imu->priv;
+
+	if (!len)
+		return -EFAULT;
+	if (req->file != db->target_file)
+		return -EBADF;
+
+	iov_iter_dma_token(iter, ddir, db->token, offset, len);
+	return 0;
+}
+
+static int io_import_fixed(struct io_kiocb *req,
+			   int ddir, struct iov_iter *iter,
+			   struct io_mapped_ubuf *imu,
+			   u64 buf_addr, size_t len,
+			   unsigned import_flags)
 {
 	const struct bio_vec *bvec;
 	size_t folio_mask;
@@ -1219,8 +1237,11 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
 
 	offset = buf_addr - imu->ubuf;
 
-	if (imu->flags & IO_IMU_F_DMA)
-		return -EOPNOTSUPP;
+	if (imu->flags & IO_IMU_F_DMA) {
+		if (!(import_flags & IO_REGBUF_IMPORT_ALLOW_DMA))
+			return -EFAULT;
+		return io_import_dmabuf(req, ddir, iter, imu, len, offset);
+	}
 	if (imu->flags & IO_IMU_F_KBUF)
 		return io_import_kbuf(ddir, iter, imu, len, offset);
 
@@ -1274,16 +1295,17 @@ inline struct io_rsrc_node *io_find_buf_node(struct io_kiocb *req,
 	return NULL;
 }
 
-int io_import_reg_buf(struct io_kiocb *req, struct iov_iter *iter,
+int __io_import_reg_buf(struct io_kiocb *req, struct iov_iter *iter,
 			u64 buf_addr, size_t len, int ddir,
-			unsigned issue_flags)
+			unsigned issue_flags, unsigned import_flags)
 {
 	struct io_rsrc_node *node;
 
 	node = io_find_buf_node(req, issue_flags);
 	if (!node)
 		return -EFAULT;
-	return io_import_fixed(ddir, iter, node->buf, buf_addr, len);
+	return io_import_fixed(req, ddir, iter, node->buf, buf_addr, len,
+				import_flags);
 }
 
 /* Lock two rings at once. The rings must be different! */
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 280d3988abf3..e0eafce976f3 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -33,6 +33,10 @@ enum {
 	IO_IMU_F_DMA			= 2,
 };
 
+enum {
+	IO_REGBUF_IMPORT_ALLOW_DMA		= 1,
+};
+
 struct io_mapped_ubuf {
 	u64		ubuf;
 	unsigned int	len;
@@ -66,9 +70,19 @@ int io_rsrc_data_alloc(struct io_rsrc_data *data, unsigned nr);
 
 struct io_rsrc_node *io_find_buf_node(struct io_kiocb *req,
 				      unsigned issue_flags);
+int __io_import_reg_buf(struct io_kiocb *req, struct iov_iter *iter,
+			u64 buf_addr, size_t len, int ddir,
+			unsigned issue_flags, unsigned import_flags);
+
+static inline
 int io_import_reg_buf(struct io_kiocb *req, struct iov_iter *iter,
 			u64 buf_addr, size_t len, int ddir,
-			unsigned issue_flags);
+			unsigned issue_flags)
+{
+	return __io_import_reg_buf(req, iter, buf_addr, len, ddir,
+				   issue_flags, 0);
+}
+
 int io_import_reg_vec(int ddir, struct iov_iter *iter,
 			struct io_kiocb *req, struct iou_vec *vec,
 			unsigned nr_iovs, unsigned issue_flags);
diff --git a/io_uring/rw.c b/io_uring/rw.c
index a3eb4e7bf992..0d9d99695801 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -374,8 +374,8 @@ static int io_init_rw_fixed(struct io_kiocb *req, unsigned int issue_flags,
 	if (io->bytes_done)
 		return 0;
 
-	ret = io_import_reg_buf(req, &io->iter, rw->addr, rw->len, ddir,
-				issue_flags);
+	ret = __io_import_reg_buf(req, &io->iter, rw->addr, rw->len, ddir,
+				  issue_flags, IO_REGBUF_IMPORT_ALLOW_DMA);
 	iov_iter_save_state(&io->iter, &io->iter_state);
 	return ret;
 }
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC v2 00/11] Add dmabuf read/write via io_uring
  2025-11-23 22:51 [RFC v2 00/11] Add dmabuf read/write via io_uring Pavel Begunkov
                   ` (10 preceding siblings ...)
  2025-11-23 22:51 ` [RFC v2 11/11] io_uring/rsrc: implement dmabuf regbuf import Pavel Begunkov
@ 2025-11-24 10:33 ` Christian König
  2025-11-24 11:30   ` Pavel Begunkov
  2025-11-24 13:35 ` Anuj gupta
  12 siblings, 1 reply; 35+ messages in thread
From: Christian König @ 2025-11-24 10:33 UTC (permalink / raw)
  To: Pavel Begunkov, linux-block, io-uring
  Cc: Vishal Verma, tushar.gohad, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal, linux-kernel,
	linux-nvme, linux-fsdevel, linux-media, dri-devel, linaro-mm-sig

On 11/23/25 23:51, Pavel Begunkov wrote:
> Picking up the work on supporting dmabuf in the read/write path.

IIRC that work was completely stopped because it violated core dma_fence and DMA-buf rules and after some private discussion was considered not doable in general.

Or am I mixing something up here? Since I don't see any dma_fence implementation at all that might actually be the case.

On the other hand we have direct I/O from DMA-buf working for quite a while, just not upstream and without io_uring support.

Regards,
Christian.

> There
> are two main changes. First, it doesn't pass a dma addresss directly by
> rather wraps it into an opaque structure, which is extended and
> understood by the target driver.
> 
> The second big change is support for dynamic attachments, which added a
> good part of complexity (see Patch 5). I kept the main machinery in nvme
> at first, but move_notify can ask to kill the dma mapping asynchronously,
> and any new IO would need to wait during submission, thus it was moved
> to blk-mq. That also introduced an extra callback layer b/w driver and
> blk-mq.
> 
> There are some rough corners, and I'm not perfectly happy about the
> complexity and layering. For v3 I'll try to move the waiting up in the
> stack to io_uring wrapped into library helpers.
> 
> For now, I'm interested what is the best way to test move_notify? And
> how dma_resv_reserve_fences() errors should be handled in move_notify?
> 
> The uapi didn't change, after registration it looks like a normal
> io_uring registered buffer and can be used as such. Only non-vectored
> fixed reads/writes are allowed. Pseudo code:
> 
> // registration
> reg_buf_idx = 0;
> io_uring_update_buffer(ring, reg_buf_idx, { dma_buf_fd, file_fd });
> 
> // request creation
> io_uring_prep_read_fixed(sqe, file_fd, buffer_offset,
>                          buffer_size, file_offset, reg_buf_idx);
> 
> And as previously, a good bunch of code was taken from Keith's series [1].
> 
> liburing based example:
> 
> git: https://github.com/isilence/liburing.git dmabuf-rw
> link: https://github.com/isilence/liburing/tree/dmabuf-rw
> 
> [1] https://lore.kernel.org/io-uring/20220805162444.3985535-1-kbusch@fb.com/
> 
> Pavel Begunkov (11):
>   file: add callback for pre-mapping dmabuf
>   iov_iter: introduce iter type for pre-registered dma
>   block: move around bio flagging helpers
>   block: introduce dma token backed bio type
>   block: add infra to handle dmabuf tokens
>   nvme-pci: add support for dmabuf reggistration
>   nvme-pci: implement dma_token backed requests
>   io_uring/rsrc: add imu flags
>   io_uring/rsrc: extended reg buffer registration
>   io_uring/rsrc: add dmabuf-backed buffer registeration
>   io_uring/rsrc: implement dmabuf regbuf import
> 
>  block/Makefile                   |   1 +
>  block/bdev.c                     |  14 ++
>  block/bio.c                      |  21 +++
>  block/blk-merge.c                |  23 +++
>  block/blk-mq-dma-token.c         | 236 +++++++++++++++++++++++++++++++
>  block/blk-mq.c                   |  20 +++
>  block/blk.h                      |   3 +-
>  block/fops.c                     |   3 +
>  drivers/nvme/host/pci.c          | 217 ++++++++++++++++++++++++++++
>  include/linux/bio.h              |  49 ++++---
>  include/linux/blk-mq-dma-token.h |  60 ++++++++
>  include/linux/blk-mq.h           |  21 +++
>  include/linux/blk_types.h        |   8 +-
>  include/linux/blkdev.h           |   3 +
>  include/linux/dma_token.h        |  35 +++++
>  include/linux/fs.h               |   4 +
>  include/linux/uio.h              |  10 ++
>  include/uapi/linux/io_uring.h    |  13 +-
>  io_uring/rsrc.c                  | 201 +++++++++++++++++++++++---
>  io_uring/rsrc.h                  |  23 ++-
>  io_uring/rw.c                    |   7 +-
>  lib/iov_iter.c                   |  30 +++-
>  22 files changed, 948 insertions(+), 54 deletions(-)
>  create mode 100644 block/blk-mq-dma-token.c
>  create mode 100644 include/linux/blk-mq-dma-token.h
>  create mode 100644 include/linux/dma_token.h
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v2 00/11] Add dmabuf read/write via io_uring
  2025-11-24 10:33 ` [RFC v2 00/11] Add dmabuf read/write via io_uring Christian König
@ 2025-11-24 11:30   ` Pavel Begunkov
  2025-11-24 14:17     ` Christian König
  0 siblings, 1 reply; 35+ messages in thread
From: Pavel Begunkov @ 2025-11-24 11:30 UTC (permalink / raw)
  To: Christian König, linux-block, io-uring
  Cc: Vishal Verma, tushar.gohad, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal, linux-kernel,
	linux-nvme, linux-fsdevel, linux-media, dri-devel, linaro-mm-sig

On 11/24/25 10:33, Christian König wrote:
> On 11/23/25 23:51, Pavel Begunkov wrote:
>> Picking up the work on supporting dmabuf in the read/write path.
> 
> IIRC that work was completely stopped because it violated core dma_fence and DMA-buf rules and after some private discussion was considered not doable in general.
> 
> Or am I mixing something up here?

The time gap is purely due to me being busy. I wasn't CC'ed to those private
discussions you mentioned, but the v1 feedback was to use dynamic attachments
and avoid passing dma address arrays directly.

https://lore.kernel.org/all/cover.1751035820.git.asml.silence@gmail.com/

I'm lost on what part is not doable. Can you elaborate on the core
dma-fence dma-buf rules?

> Since I don't see any dma_fence implementation at all that might actually be the case.

See Patch 5, struct blk_mq_dma_fence. It's used in the move_notify
callback and is signaled when all inflight IO using the current
mapping are complete. All new IO requests will try to recreate the
mapping, and hence potentially wait with dma_resv_wait_timeout().

> On the other hand we have direct I/O from DMA-buf working for quite a while, just not upstream and without io_uring support.

Have any reference?

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v2 00/11] Add dmabuf read/write via io_uring
  2025-11-23 22:51 [RFC v2 00/11] Add dmabuf read/write via io_uring Pavel Begunkov
                   ` (11 preceding siblings ...)
  2025-11-24 10:33 ` [RFC v2 00/11] Add dmabuf read/write via io_uring Christian König
@ 2025-11-24 13:35 ` Anuj gupta
  2025-11-25 12:35   ` Pavel Begunkov
  12 siblings, 1 reply; 35+ messages in thread
From: Anuj gupta @ 2025-11-24 13:35 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: linux-block, io-uring, Vishal Verma, tushar.gohad, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal,
	Christian König, linux-kernel, linux-nvme, linux-fsdevel,
	linux-media, dri-devel, linaro-mm-sig

This series significantly reduces the IOMMU/DMA overhead for I/O,
particularly when the IOMMU is configured in STRICT or LAZY mode. I
modified t/io_uring in fio to exercise this path and tested with an
Intel Optane device. On my setup, I see the following improvement:

- STRICT: before = 570 KIOPS, after = 5.01 MIOPS
- LAZY: before = 1.93 MIOPS, after = 5.01 MIOPS
- PASSTHROUGH: before = 5.01 MIOPS, after = 5.01 MIOPS

The STRICT/LAZY numbers clearly show the benefit of avoiding per-I/O
dma_map/dma_unmap and reusing the pre-mapped DMA addresses.
--
Anuj Gupta

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v2 05/11] block: add infra to handle dmabuf tokens
  2025-11-23 22:51 ` [RFC v2 05/11] block: add infra to handle dmabuf tokens Pavel Begunkov
@ 2025-11-24 13:38   ` Anuj gupta
  2025-12-04 10:56   ` Christoph Hellwig
  2025-12-04 13:08   ` Christoph Hellwig
  2 siblings, 0 replies; 35+ messages in thread
From: Anuj gupta @ 2025-11-24 13:38 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: linux-block, io-uring, Vishal Verma, tushar.gohad, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal,
	Christian König, linux-kernel, linux-nvme, linux-fsdevel,
	linux-media, dri-devel, linaro-mm-sig

> +void blk_mq_dma_map_move_notify(struct blk_mq_dma_token *token)
> +{
> +       blk_mq_dma_map_remove(token);
> +}
this needs to be exported as it is referenced from the nvme-pci driver,
otherwise we get a build error

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v2 06/11] nvme-pci: add support for dmabuf reggistration
  2025-11-23 22:51 ` [RFC v2 06/11] nvme-pci: add support for dmabuf reggistration Pavel Begunkov
@ 2025-11-24 13:40   ` Anuj gupta
  2025-12-04 11:00   ` Christoph Hellwig
  1 sibling, 0 replies; 35+ messages in thread
From: Anuj gupta @ 2025-11-24 13:40 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: linux-block, io-uring, Vishal Verma, tushar.gohad, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal,
	Christian König, linux-kernel, linux-nvme, linux-fsdevel,
	linux-media, dri-devel, linaro-mm-sig

nit:
s/reggistration/registration/ in subject

Also a MODULE_IMPORT_NS("DMA_BUF") needs to be added, since it now uses
symbols from the DMA_BUF namespace, otherwise we got a build error

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v2 00/11] Add dmabuf read/write via io_uring
  2025-11-24 11:30   ` Pavel Begunkov
@ 2025-11-24 14:17     ` Christian König
  2025-11-25 13:52       ` Pavel Begunkov
  0 siblings, 1 reply; 35+ messages in thread
From: Christian König @ 2025-11-24 14:17 UTC (permalink / raw)
  To: Pavel Begunkov, linux-block, io-uring
  Cc: Vishal Verma, tushar.gohad, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal, linux-kernel,
	linux-nvme, linux-fsdevel, linux-media, dri-devel, linaro-mm-sig

On 11/24/25 12:30, Pavel Begunkov wrote:
> On 11/24/25 10:33, Christian König wrote:
>> On 11/23/25 23:51, Pavel Begunkov wrote:
>>> Picking up the work on supporting dmabuf in the read/write path.
>>
>> IIRC that work was completely stopped because it violated core dma_fence and DMA-buf rules and after some private discussion was considered not doable in general.
>>
>> Or am I mixing something up here?
> 
> The time gap is purely due to me being busy. I wasn't CC'ed to those private
> discussions you mentioned, but the v1 feedback was to use dynamic attachments
> and avoid passing dma address arrays directly.
> 
> https://lore.kernel.org/all/cover.1751035820.git.asml.silence@gmail.com/
> 
> I'm lost on what part is not doable. Can you elaborate on the core
> dma-fence dma-buf rules?

I most likely mixed that up, in other words that was a different discussion.

When you use dma_fences to indicate async completion of events you need to be super duper careful that you only do this for in flight events, have the fence creation in the right order etc...

For example once the fence is created you can't make any memory allocations any more, that's why we have this dance of reserving fence slots, creating the fence and then adding it.

>> Since I don't see any dma_fence implementation at all that might actually be the case.
> 
> See Patch 5, struct blk_mq_dma_fence. It's used in the move_notify
> callback and is signaled when all inflight IO using the current
> mapping are complete. All new IO requests will try to recreate the
> mapping, and hence potentially wait with dma_resv_wait_timeout().

Without looking at the code that approach sounds more or less correct to me.

>> On the other hand we have direct I/O from DMA-buf working for quite a while, just not upstream and without io_uring support.
> 
> Have any reference?

There is a WIP feature in AMDs GPU driver package for ROCm.

But that can't be used as general purpose DMA-buf approach, because it makes use of internal knowledge about how the GPU driver is using the backing store.

BTW when you use DMA addresses from DMA-buf always keep in mind that this memory can be written by others at the same time, e.g. you can't do things like compute a CRC first, then write to backing store and finally compare CRC.

Regards,
Christian.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v2 00/11] Add dmabuf read/write via io_uring
  2025-11-24 13:35 ` Anuj gupta
@ 2025-11-25 12:35   ` Pavel Begunkov
  0 siblings, 0 replies; 35+ messages in thread
From: Pavel Begunkov @ 2025-11-25 12:35 UTC (permalink / raw)
  To: Anuj gupta
  Cc: linux-block, io-uring, Vishal Verma, tushar.gohad, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal,
	Christian König, linux-kernel, linux-nvme, linux-fsdevel,
	linux-media, dri-devel, linaro-mm-sig

On 11/24/25 13:35, Anuj gupta wrote:
> This series significantly reduces the IOMMU/DMA overhead for I/O,
> particularly when the IOMMU is configured in STRICT or LAZY mode. I
> modified t/io_uring in fio to exercise this path and tested with an
> Intel Optane device. On my setup, I see the following improvement:
> 
> - STRICT: before = 570 KIOPS, after = 5.01 MIOPS
> - LAZY: before = 1.93 MIOPS, after = 5.01 MIOPS
> - PASSTHROUGH: before = 5.01 MIOPS, after = 5.01 MIOPS
> 
> The STRICT/LAZY numbers clearly show the benefit of avoiding per-I/O
> dma_map/dma_unmap and reusing the pre-mapped DMA addresses.

Thanks for giving it a run. Looks indeed promising, and I believe
that was the main use case Keith was pursuing. I'll fix up the
build problems for v3

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v2 00/11] Add dmabuf read/write via io_uring
  2025-11-24 14:17     ` Christian König
@ 2025-11-25 13:52       ` Pavel Begunkov
  2025-11-25 14:21         ` Christian König
  0 siblings, 1 reply; 35+ messages in thread
From: Pavel Begunkov @ 2025-11-25 13:52 UTC (permalink / raw)
  To: Christian König, linux-block, io-uring
  Cc: Vishal Verma, tushar.gohad, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal, linux-kernel,
	linux-nvme, linux-fsdevel, linux-media, dri-devel, linaro-mm-sig

On 11/24/25 14:17, Christian König wrote:
> On 11/24/25 12:30, Pavel Begunkov wrote:
>> On 11/24/25 10:33, Christian König wrote:
>>> On 11/23/25 23:51, Pavel Begunkov wrote:
>>>> Picking up the work on supporting dmabuf in the read/write path.
>>>
>>> IIRC that work was completely stopped because it violated core dma_fence and DMA-buf rules and after some private discussion was considered not doable in general.
>>>
>>> Or am I mixing something up here?
>>
>> The time gap is purely due to me being busy. I wasn't CC'ed to those private
>> discussions you mentioned, but the v1 feedback was to use dynamic attachments
>> and avoid passing dma address arrays directly.
>>
>> https://lore.kernel.org/all/cover.1751035820.git.asml.silence@gmail.com/
>>
>> I'm lost on what part is not doable. Can you elaborate on the core
>> dma-fence dma-buf rules?
> 
> I most likely mixed that up, in other words that was a different discussion.
> 
> When you use dma_fences to indicate async completion of events you need to be super duper careful that you only do this for in flight events, have the fence creation in the right order etc...

I'm curious, what can happen if there is new IO using a
move_notify()ed mapping, but let's say it's guaranteed to complete
strictly before dma_buf_unmap_attachment() and the fence is signaled?
Is there some loss of data or corruption that can happen?

sg_table = map_attach()         |
move_notify()                   |
   -> add_fence(fence)           |
                                 | issue_IO(sg_table)
                                 | // IO completed
unmap_attachment(sg_table)      |
signal_fence(fence)             |

> For example once the fence is created you can't make any memory allocations any more, that's why we have this dance of reserving fence slots, creating the fence and then adding it.

Looks I have some terminology gap here. By "memory allocations" you
don't mean kmalloc, right? I assume it's about new users of the
mapping.

>>> Since I don't see any dma_fence implementation at all that might actually be the case.
>>
>> See Patch 5, struct blk_mq_dma_fence. It's used in the move_notify
>> callback and is signaled when all inflight IO using the current
>> mapping are complete. All new IO requests will try to recreate the
>> mapping, and hence potentially wait with dma_resv_wait_timeout().
> 
> Without looking at the code that approach sounds more or less correct to me.
> 
>>> On the other hand we have direct I/O from DMA-buf working for quite a while, just not upstream and without io_uring support.
>>
>> Have any reference?
> 
> There is a WIP feature in AMDs GPU driver package for ROCm.
> 
> But that can't be used as general purpose DMA-buf approach, because it makes use of internal knowledge about how the GPU driver is using the backing store.

Got it

> BTW when you use DMA addresses from DMA-buf always keep in mind that this memory can be written by others at the same time, e.g. you can't do things like compute a CRC first, then write to backing store and finally compare CRC.

Right. The direct IO path also works with user pages, so the
constraints are similar in this regard.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v2 00/11] Add dmabuf read/write via io_uring
  2025-11-25 13:52       ` Pavel Begunkov
@ 2025-11-25 14:21         ` Christian König
  2025-11-25 19:40           ` Pavel Begunkov
  0 siblings, 1 reply; 35+ messages in thread
From: Christian König @ 2025-11-25 14:21 UTC (permalink / raw)
  To: Pavel Begunkov, linux-block, io-uring
  Cc: Vishal Verma, tushar.gohad, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal, linux-kernel,
	linux-nvme, linux-fsdevel, linux-media, dri-devel, linaro-mm-sig

On 11/25/25 14:52, Pavel Begunkov wrote:
> On 11/24/25 14:17, Christian König wrote:
>> On 11/24/25 12:30, Pavel Begunkov wrote:
>>> On 11/24/25 10:33, Christian König wrote:
>>>> On 11/23/25 23:51, Pavel Begunkov wrote:
>>>>> Picking up the work on supporting dmabuf in the read/write path.
>>>>
>>>> IIRC that work was completely stopped because it violated core dma_fence and DMA-buf rules and after some private discussion was considered not doable in general.
>>>>
>>>> Or am I mixing something up here?
>>>
>>> The time gap is purely due to me being busy. I wasn't CC'ed to those private
>>> discussions you mentioned, but the v1 feedback was to use dynamic attachments
>>> and avoid passing dma address arrays directly.
>>>
>>> https://lore.kernel.org/all/cover.1751035820.git.asml.silence@gmail.com/
>>>
>>> I'm lost on what part is not doable. Can you elaborate on the core
>>> dma-fence dma-buf rules?
>>
>> I most likely mixed that up, in other words that was a different discussion.
>>
>> When you use dma_fences to indicate async completion of events you need to be super duper careful that you only do this for in flight events, have the fence creation in the right order etc...
> 
> I'm curious, what can happen if there is new IO using a
> move_notify()ed mapping, but let's say it's guaranteed to complete
> strictly before dma_buf_unmap_attachment() and the fence is signaled?
> Is there some loss of data or corruption that can happen?

The problem is that you can't guarantee that because you run into deadlocks.

As soon as a dma_fence() is created and published by calling add_fence it can be memory management loops back and depends on that fence.

So you actually can't issue any new IO which might block the unmap operation.

> 
> sg_table = map_attach()         |
> move_notify()                   |
>   -> add_fence(fence)           |
>                                 | issue_IO(sg_table)
>                                 | // IO completed
> unmap_attachment(sg_table)      |
> signal_fence(fence)             |
> 
>> For example once the fence is created you can't make any memory allocations any more, that's why we have this dance of reserving fence slots, creating the fence and then adding it.
> 
> Looks I have some terminology gap here. By "memory allocations" you
> don't mean kmalloc, right? I assume it's about new users of the
> mapping.

kmalloc() as well as get_free_page() is exactly what is meant here.

You can't make any memory allocation any more after creating/publishing a dma_fence.

The usually flow is the following:

1. Lock dma_resv object
2. Prepare I/O operation, make all memory allocations etc...
3. Allocate dma_fence object
4. Push I/O operation to the HW, making sure that you don't allocate memory any more.
5. Call dma_resv_add_fence(with fence allocate in #3).
6. Unlock dma_resv object

If you stride from that you most likely end up in a deadlock sooner or later.

Regards,
Christian.

>>>> Since I don't see any dma_fence implementation at all that might actually be the case.
>>>
>>> See Patch 5, struct blk_mq_dma_fence. It's used in the move_notify
>>> callback and is signaled when all inflight IO using the current
>>> mapping are complete. All new IO requests will try to recreate the
>>> mapping, and hence potentially wait with dma_resv_wait_timeout().
>>
>> Without looking at the code that approach sounds more or less correct to me.
>>
>>>> On the other hand we have direct I/O from DMA-buf working for quite a while, just not upstream and without io_uring support.
>>>
>>> Have any reference?
>>
>> There is a WIP feature in AMDs GPU driver package for ROCm.
>>
>> But that can't be used as general purpose DMA-buf approach, because it makes use of internal knowledge about how the GPU driver is using the backing store.
> 
> Got it
> 
>> BTW when you use DMA addresses from DMA-buf always keep in mind that this memory can be written by others at the same time, e.g. you can't do things like compute a CRC first, then write to backing store and finally compare CRC.
> 
> Right. The direct IO path also works with user pages, so the
> constraints are similar in this regard.
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v2 00/11] Add dmabuf read/write via io_uring
  2025-11-25 14:21         ` Christian König
@ 2025-11-25 19:40           ` Pavel Begunkov
  0 siblings, 0 replies; 35+ messages in thread
From: Pavel Begunkov @ 2025-11-25 19:40 UTC (permalink / raw)
  To: Christian König, linux-block, io-uring
  Cc: Vishal Verma, tushar.gohad, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal, linux-kernel,
	linux-nvme, linux-fsdevel, linux-media, dri-devel, linaro-mm-sig

On 11/25/25 14:21, Christian König wrote:
> On 11/25/25 14:52, Pavel Begunkov wrote:
>> On 11/24/25 14:17, Christian König wrote:
>>> On 11/24/25 12:30, Pavel Begunkov wrote:
>>>> On 11/24/25 10:33, Christian König wrote:
>>>>> On 11/23/25 23:51, Pavel Begunkov wrote:
>>>>>> Picking up the work on supporting dmabuf in the read/write path.
>>>>>
>>>>> IIRC that work was completely stopped because it violated core dma_fence and DMA-buf rules and after some private discussion was considered not doable in general.
>>>>>
>>>>> Or am I mixing something up here?
>>>>
>>>> The time gap is purely due to me being busy. I wasn't CC'ed to those private
>>>> discussions you mentioned, but the v1 feedback was to use dynamic attachments
>>>> and avoid passing dma address arrays directly.
>>>>
>>>> https://lore.kernel.org/all/cover.1751035820.git.asml.silence@gmail.com/
>>>>
>>>> I'm lost on what part is not doable. Can you elaborate on the core
>>>> dma-fence dma-buf rules?
>>>
>>> I most likely mixed that up, in other words that was a different discussion.
>>>
>>> When you use dma_fences to indicate async completion of events you need to be super duper careful that you only do this for in flight events, have the fence creation in the right order etc...
>>
>> I'm curious, what can happen if there is new IO using a
>> move_notify()ed mapping, but let's say it's guaranteed to complete
>> strictly before dma_buf_unmap_attachment() and the fence is signaled?
>> Is there some loss of data or corruption that can happen?
> 
> The problem is that you can't guarantee that because you run into deadlocks.
> 
> As soon as a dma_fence() is created and published by calling add_fence it can be memory management loops back and depends on that fence.

I think I got the idea, thanks

> So you actually can't issue any new IO which might block the unmap operation.
> 
>>
>> sg_table = map_attach()         |
>> move_notify()                   |
>>    -> add_fence(fence)           |
>>                                  | issue_IO(sg_table)
>>                                  | // IO completed
>> unmap_attachment(sg_table)      |
>> signal_fence(fence)             |
>>
>>> For example once the fence is created you can't make any memory allocations any more, that's why we have this dance of reserving fence slots, creating the fence and then adding it.
>>
>> Looks I have some terminology gap here. By "memory allocations" you
>> don't mean kmalloc, right? I assume it's about new users of the
>> mapping.
> 
> kmalloc() as well as get_free_page() is exactly what is meant here.
> 
> You can't make any memory allocation any more after creating/publishing a dma_fence.

I see, thanks

> The usually flow is the following:
> 
> 1. Lock dma_resv object
> 2. Prepare I/O operation, make all memory allocations etc...
> 3. Allocate dma_fence object
> 4. Push I/O operation to the HW, making sure that you don't allocate memory any more.
> 5. Call dma_resv_add_fence(with fence allocate in #3).
> 6. Unlock dma_resv object
> 
> If you stride from that you most likely end up in a deadlock sooner or later.
-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v2 01/11] file: add callback for pre-mapping dmabuf
  2025-11-23 22:51 ` [RFC v2 01/11] file: add callback for pre-mapping dmabuf Pavel Begunkov
@ 2025-12-04 10:42   ` Christoph Hellwig
  2025-12-04 10:46   ` Christian König
  1 sibling, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2025-12-04 10:42 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: linux-block, io-uring, Vishal Verma, tushar.gohad, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal,
	Christian König, linux-kernel, linux-nvme, linux-fsdevel,
	linux-media, dri-devel, linaro-mm-sig

On Sun, Nov 23, 2025 at 10:51:21PM +0000, Pavel Begunkov wrote:
> +static inline struct dma_token *
> +dma_token_create(struct file *file, struct dma_token_params *params)
> +{
> +	struct dma_token *res;
> +
> +	if (!file->f_op->dma_map)
> +		return ERR_PTR(-EOPNOTSUPP);
> +	res = file->f_op->dma_map(file, params);

Calling the file operation ->dmap_map feels really misleading.

create_token as in the function name is already much better, but
it really is not just dma, but dmabuf related, and that should really
be encoded in the name.

Also why not pass the dmabuf and direction directly instead of wrapping
it in the odd params struct making the whole thing hard to follow?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v2 02/11] iov_iter: introduce iter type for pre-registered dma
  2025-11-23 22:51 ` [RFC v2 02/11] iov_iter: introduce iter type for pre-registered dma Pavel Begunkov
@ 2025-12-04 10:43   ` Christoph Hellwig
  0 siblings, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2025-12-04 10:43 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: linux-block, io-uring, Vishal Verma, tushar.gohad, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal,
	Christian König, linux-kernel, linux-nvme, linux-fsdevel,
	linux-media, dri-devel, linaro-mm-sig

On Sun, Nov 23, 2025 at 10:51:22PM +0000, Pavel Begunkov wrote:
> diff --git a/include/linux/uio.h b/include/linux/uio.h
> index 5b127043a151..1b22594ca35b 100644
> --- a/include/linux/uio.h
> +++ b/include/linux/uio.h
> @@ -29,6 +29,7 @@ enum iter_type {
>  	ITER_FOLIOQ,
>  	ITER_XARRAY,
>  	ITER_DISCARD,
> +	ITER_DMA_TOKEN,

Please use DMABUF/dmabuf naming everywhere, this is about dmabufs and
not dma in general.

Otherwise this looks good.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v2 03/11] block: move around bio flagging helpers
  2025-11-23 22:51 ` [RFC v2 03/11] block: move around bio flagging helpers Pavel Begunkov
@ 2025-12-04 10:43   ` Christoph Hellwig
  0 siblings, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2025-12-04 10:43 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: linux-block, io-uring, Vishal Verma, tushar.gohad, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal,
	Christian König, linux-kernel, linux-nvme, linux-fsdevel,
	linux-media, dri-devel, linaro-mm-sig

On Sun, Nov 23, 2025 at 10:51:23PM +0000, Pavel Begunkov wrote:
> We'll need bio_flagged() earlier in bio.h in the next patch, move it
> together with all related helpers, and mark the bio_flagged()'s bio
> argument as const.
> 
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

Maybe ask Jens to queue it up ASAP to get it out of the way?


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v2 01/11] file: add callback for pre-mapping dmabuf
  2025-11-23 22:51 ` [RFC v2 01/11] file: add callback for pre-mapping dmabuf Pavel Begunkov
  2025-12-04 10:42   ` Christoph Hellwig
@ 2025-12-04 10:46   ` Christian König
  2025-12-04 11:07     ` Christoph Hellwig
  1 sibling, 1 reply; 35+ messages in thread
From: Christian König @ 2025-12-04 10:46 UTC (permalink / raw)
  To: Pavel Begunkov, linux-block, io-uring
  Cc: Vishal Verma, tushar.gohad, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal, linux-kernel,
	linux-nvme, linux-fsdevel, linux-media, dri-devel, linaro-mm-sig

On 11/23/25 23:51, Pavel Begunkov wrote:
> Add a file callback that maps a dmabuf for the given file and returns
> an opaque token of type struct dma_token representing the mapping.

I'm really scratching my head what you mean with that?

And why the heck would we need to pass a DMA-buf to a struct file?

Regards,
Christian.


> The
> implementation details are hidden from the caller, and the implementors
> are normally expected to extend the structure.
> 
> The callback callers will be able to pass the token with an IO request,
> which implemented in following patches as a new iterator type. The user
> should release the token once it's not needed by calling the provided
> release callback via appropriate helpers.
> 
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  include/linux/dma_token.h | 35 +++++++++++++++++++++++++++++++++++
>  include/linux/fs.h        |  4 ++++
>  2 files changed, 39 insertions(+)
>  create mode 100644 include/linux/dma_token.h
> 
> diff --git a/include/linux/dma_token.h b/include/linux/dma_token.h
> new file mode 100644
> index 000000000000..9194b34282c2
> --- /dev/null
> +++ b/include/linux/dma_token.h
> @@ -0,0 +1,35 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_DMA_TOKEN_H
> +#define _LINUX_DMA_TOKEN_H
> +
> +#include <linux/dma-buf.h>
> +
> +struct dma_token_params {
> +	struct dma_buf			*dmabuf;
> +	enum dma_data_direction		dir;
> +};
> +
> +struct dma_token {
> +	void (*release)(struct dma_token *);
> +};
> +
> +static inline void dma_token_release(struct dma_token *token)
> +{
> +	token->release(token);
> +}
> +
> +static inline struct dma_token *
> +dma_token_create(struct file *file, struct dma_token_params *params)
> +{
> +	struct dma_token *res;
> +
> +	if (!file->f_op->dma_map)
> +		return ERR_PTR(-EOPNOTSUPP);
> +	res = file->f_op->dma_map(file, params);
> +
> +	WARN_ON_ONCE(!IS_ERR(res) && !res->release);
> +
> +	return res;
> +}
> +
> +#endif
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index c895146c1444..0ce9a53fabec 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2262,6 +2262,8 @@ struct dir_context {
>  struct iov_iter;
>  struct io_uring_cmd;
>  struct offset_ctx;
> +struct dma_token;
> +struct dma_token_params;
>  
>  typedef unsigned int __bitwise fop_flags_t;
>  
> @@ -2309,6 +2311,8 @@ struct file_operations {
>  	int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
>  				unsigned int poll_flags);
>  	int (*mmap_prepare)(struct vm_area_desc *);
> +	struct dma_token *(*dma_map)(struct file *,
> +				     struct dma_token_params *);
>  } __randomize_layout;
>  
>  /* Supports async buffered reads */


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v2 04/11] block: introduce dma token backed bio type
  2025-11-23 22:51 ` [RFC v2 04/11] block: introduce dma token backed bio type Pavel Begunkov
@ 2025-12-04 10:48   ` Christoph Hellwig
  0 siblings, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2025-12-04 10:48 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: linux-block, io-uring, Vishal Verma, tushar.gohad, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal,
	Christian König, linux-kernel, linux-nvme, linux-fsdevel,
	linux-media, dri-devel, linaro-mm-sig

> diff --git a/block/bio.c b/block/bio.c
> index 7b13bdf72de0..8793f1ee559d 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -843,6 +843,11 @@ static int __bio_clone(struct bio *bio, struct bio *bio_src, gfp_t gfp)
>  		bio_clone_blkg_association(bio, bio_src);
>  	}
>  
> +	if (bio_flagged(bio_src, BIO_DMA_TOKEN)) {
> +		bio->dma_token = bio_src->dma_token;
> +		bio_set_flag(bio, BIO_DMA_TOKEN);
> +	}

Historically __bio_clone itself does not clone the payload, just the
bio.  But we got rid of the callers that want to clone a bio but not
the payload long time ago.

I'd suggest a prep patch that moves assigning bi_io_vec from
bio_alloc_clone and bio_init_clone into __bio_clone, and given that they
are the same field that'll take carw of the dma token as well.
Alternatively do it in an if/else that the compiler will hopefully
optimize away.

> @@ -1349,6 +1366,10 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
>  		bio_iov_bvec_set(bio, iter);
>  		iov_iter_advance(iter, bio->bi_iter.bi_size);
>  		return 0;
> +	} else if (iov_iter_is_dma_token(iter)) {

No else after an return please.

> +++ b/block/blk-merge.c
> @@ -328,6 +328,29 @@ int bio_split_io_at(struct bio *bio, const struct queue_limits *lim,
>  	unsigned nsegs = 0, bytes = 0, gaps = 0;
>  	struct bvec_iter iter;
>  
> +	if (bio_flagged(bio, BIO_DMA_TOKEN)) {

Please split the dmabuf logic into a self-contained
helper here.

> +		int offset = offset_in_page(bio->bi_iter.bi_bvec_done);
> +
> +		nsegs = ALIGN(bio->bi_iter.bi_size + offset, PAGE_SIZE);
> +		nsegs >>= PAGE_SHIFT;

Why are we hardcoding PAGE_SIZE based "segments" here?

> +
> +		if (offset & lim->dma_alignment || bytes & len_align_mask)
> +			return -EINVAL;
> +
> +		if (bio->bi_iter.bi_size > max_bytes) {
> +			bytes = max_bytes;
> +			nsegs = (bytes + offset) >> PAGE_SHIFT;
> +			goto split;
> +		} else if (nsegs > lim->max_segments) {

No else after a goto either.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v2 05/11] block: add infra to handle dmabuf tokens
  2025-11-23 22:51 ` [RFC v2 05/11] block: add infra to handle dmabuf tokens Pavel Begunkov
  2025-11-24 13:38   ` Anuj gupta
@ 2025-12-04 10:56   ` Christoph Hellwig
  2025-12-04 13:08   ` Christoph Hellwig
  2 siblings, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2025-12-04 10:56 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: linux-block, io-uring, Vishal Verma, tushar.gohad, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal,
	Christian König, linux-kernel, linux-nvme, linux-fsdevel,
	linux-media, dri-devel, linaro-mm-sig

On Sun, Nov 23, 2025 at 10:51:25PM +0000, Pavel Begunkov wrote:
> Add blk-mq infrastructure to handle dmabuf tokens. There are two main

Please spell out infrastructure in the subject as well.

> +struct dma_token *blkdev_dma_map(struct file *file,
> +				 struct dma_token_params *params)
> +{
> +	struct request_queue *q = bdev_get_queue(file_bdev(file));
> +
> +	if (!(file->f_flags & O_DIRECT))
> +		return ERR_PTR(-EINVAL);

Shouldn't the O_DIRECT check be in the caller?

> +++ b/block/blk-mq-dma-token.c

Missing SPDX and Copyright statement.

> @@ -0,0 +1,236 @@
> +#include <linux/blk-mq-dma-token.h>
> +#include <linux/dma-resv.h>
> +
> +struct blk_mq_dma_fence {
> +	struct dma_fence base;
> +	spinlock_t lock;
> +};

And a high-level comment explaining the fencing logic would be nice
as well.

> +	struct blk_mq_dma_map *map = container_of(ref, struct blk_mq_dma_map, refs);

Overly long line.

> +static struct blk_mq_dma_map *blk_mq_alloc_dma_mapping(struct blk_mq_dma_token *token)

Another one.  Also kinda inconsistent between _map in the data structure
and _mapping in the function name.

> +static inline
> +struct blk_mq_dma_map *blk_mq_get_token_map(struct blk_mq_dma_token *token)

Really odd return value / scope formatting.

> +{
> +	struct blk_mq_dma_map *map;
> +
> +	guard(rcu)();
> +
> +	map = rcu_dereference(token->map);
> +	if (unlikely(!map || !percpu_ref_tryget_live_rcu(&map->refs)))
> +		return NULL;
> +	return map;

Please use good old rcu_read_unlock to make this readable.

> +	guard(mutex)(&token->mapping_lock);

Same.

> +
> +	map = blk_mq_get_token_map(token);
> +	if (map)
> +		return map;
> +
> +	map = blk_mq_alloc_dma_mapping(token);
> +	if (IS_ERR(map))
> +		return NULL;
> +
> +	dma_resv_lock(dmabuf->resv, NULL);
> +	ret = dma_resv_wait_timeout(dmabuf->resv, DMA_RESV_USAGE_BOOKKEEP,
> +				    true, MAX_SCHEDULE_TIMEOUT);
> +	ret = ret ? ret : -ETIME;

	if (!ret)
		ret = -ETIME;

> +blk_status_t blk_rq_assign_dma_map(struct request *rq,
> +				   struct blk_mq_dma_token *token)
> +{
> +	struct blk_mq_dma_map *map;
> +
> +	map = blk_mq_get_token_map(token);
> +	if (map)
> +		goto complete;
> +
> +	if (rq->cmd_flags & REQ_NOWAIT)
> +		return BLK_STS_AGAIN;
> +
> +	map = blk_mq_create_dma_map(token);
> +	if (IS_ERR(map))
> +		return BLK_STS_RESOURCE;

Having a few comments, that say this is creating the map lazily
would probably helper the reader.  Also why not keep the !map
case in the branch, as the map case should be the fast path and
thus usually be straight line in the function?

> +void blk_mq_dma_map_move_notify(struct blk_mq_dma_token *token)
> +{
> +	blk_mq_dma_map_remove(token);
> +}

Is there a good reason for having this blk_mq_dma_map_move_notify
wrapper?

> +	if (bio_flagged(bio, BIO_DMA_TOKEN)) {
> +		struct blk_mq_dma_token *token;
> +		blk_status_t ret;
> +
> +		token = dma_token_to_blk_mq(bio->dma_token);
> +		ret = blk_rq_assign_dma_map(rq, token);
> +		if (ret) {
> +			if (ret == BLK_STS_AGAIN) {
> +				bio_wouldblock_error(bio);
> +			} else {
> +				bio->bi_status = BLK_STS_RESOURCE;
> +				bio_endio(bio);
> +			}
> +			goto queue_exit;
> +		}
> +	}

Any reason to not just keep the dma_token_to_blk_mq?  Also why is this
overriding non-BLK_STS_AGAIN errors with BLK_STS_RESOURCE?

(I really wish we could make all BLK_STS_AGAIN errors be quiet without
the explicit setting of BIO_QUIET, which is a bit annoying, but that's
not for this patch).

> +static inline
> +struct blk_mq_dma_token *dma_token_to_blk_mq(struct dma_token *token)

More odd formatting.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v2 06/11] nvme-pci: add support for dmabuf reggistration
  2025-11-23 22:51 ` [RFC v2 06/11] nvme-pci: add support for dmabuf reggistration Pavel Begunkov
  2025-11-24 13:40   ` Anuj gupta
@ 2025-12-04 11:00   ` Christoph Hellwig
  2025-12-04 19:07     ` Keith Busch
  1 sibling, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2025-12-04 11:00 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: linux-block, io-uring, Vishal Verma, tushar.gohad, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal,
	Christian König, linux-kernel, linux-nvme, linux-fsdevel,
	linux-media, dri-devel, linaro-mm-sig

Splitting this trivial stub from the substantial parts in the next patch
feels odd.  Please merge them.

(and better commit logs and comments really would be useful for others
to understand what you've done).

> +const struct dma_buf_attach_ops nvme_dmabuf_importer_ops = {
> +	.move_notify = nvme_dmabuf_move_notify,
> +	.allow_peer2peer = true,
> +};

Tab-align the =, please.

> +static int nvme_init_dma_token(struct request_queue *q,
> +				struct blk_mq_dma_token *token)
> +{
> +	struct dma_buf_attachment *attach;
> +	struct nvme_ns *ns = q->queuedata;
> +	struct nvme_dev *dev = to_nvme_dev(ns->ctrl);
> +	struct dma_buf *dmabuf = token->dmabuf;
> +
> +	if (dmabuf->size % NVME_CTRL_PAGE_SIZE)
> +		return -EINVAL;

Why do you care about alignment to the controller page size?

> +	for_each_sgtable_dma_sg(sgt, sg, tmp) {
> +		dma_addr_t dma = sg_dma_address(sg);
> +		unsigned long sg_len = sg_dma_len(sg);
> +
> +		while (sg_len) {
> +			dma_list[i++] = dma;
> +			dma += NVME_CTRL_PAGE_SIZE;
> +			sg_len -= NVME_CTRL_PAGE_SIZE;
> +		}
> +	}

Why does this build controller pages sized chunks?


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v2 07/11] nvme-pci: implement dma_token backed requests
  2025-11-23 22:51 ` [RFC v2 07/11] nvme-pci: implement dma_token backed requests Pavel Begunkov
@ 2025-12-04 11:04   ` Christoph Hellwig
  0 siblings, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2025-12-04 11:04 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: linux-block, io-uring, Vishal Verma, tushar.gohad, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal,
	Christian König, linux-kernel, linux-nvme, linux-fsdevel,
	linux-media, dri-devel, linaro-mm-sig

> +static void nvme_sync_dma(struct nvme_dev *nvme_dev, struct request *req,
> +			  enum dma_data_direction dir)
> +{
> +	struct blk_mq_dma_map *map = req->dma_map;
> +	int length = blk_rq_payload_bytes(req);
> +	bool for_cpu = dir == DMA_FROM_DEVICE;
> +	struct device *dev = nvme_dev->dev;
> +	dma_addr_t *dma_list = map->private;
> +	struct bio *bio = req->bio;
> +	int offset, map_idx;
> +
> +	offset = bio->bi_iter.bi_bvec_done;
> +	map_idx = offset / NVME_CTRL_PAGE_SIZE;
> +	length += offset & (NVME_CTRL_PAGE_SIZE - 1);
> +
> +	while (length > 0) {
> +		u64 dma_addr = dma_list[map_idx++];
> +
> +		if (for_cpu)
> +			__dma_sync_single_for_cpu(dev, dma_addr,
> +						  NVME_CTRL_PAGE_SIZE, dir);
> +		else
> +			__dma_sync_single_for_device(dev, dma_addr,
> +						     NVME_CTRL_PAGE_SIZE, dir);
> +		length -= NVME_CTRL_PAGE_SIZE;
> +	}

This looks really inefficient.  Usually the ranges in the dmabuf should
be much larger than a controller page.


> +static void nvme_unmap_premapped_data(struct nvme_dev *dev,
> +				      struct request *req)
> +{
> +	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> +
> +	if (rq_data_dir(req) == READ)
> +		nvme_sync_dma(dev, req, DMA_FROM_DEVICE);
> +	if (!(iod->flags & IOD_SINGLE_SEGMENT))
> +		nvme_free_descriptors(req);
> +}

This doesn't really unmap anything :)

Also the dma ownership rules say that you always need to call the
sync_to_device helpers before I/O and the sync_to_cpu helpers after I/O,
no matters if it is a read or write.  The implementations then makes
them a no-op where possible.

> +
> +	offset = bio->bi_iter.bi_bvec_done;
> +	map_idx = offset / NVME_CTRL_PAGE_SIZE;
> +	offset &= (NVME_CTRL_PAGE_SIZE - 1);
> +
> +	prp1_dma = dma_list[map_idx++] + offset;
> +
> +	length -= (NVME_CTRL_PAGE_SIZE - offset);
> +	if (length <= 0) {
> +		prp2_dma = 0;

Urgg, why is this building PRPs instead of SGLs?  Yes, SGLs are an
optional feature, but for devices where you want to micro-optimize
like this I think we should simply require them.  This should cut
down on both the memory use and the amount of special mapping code.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v2 01/11] file: add callback for pre-mapping dmabuf
  2025-12-04 10:46   ` Christian König
@ 2025-12-04 11:07     ` Christoph Hellwig
  2025-12-04 11:09       ` Christian König
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2025-12-04 11:07 UTC (permalink / raw)
  To: Christian König
  Cc: Pavel Begunkov, linux-block, io-uring, Vishal Verma, tushar.gohad,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Alexander Viro, Christian Brauner, Andrew Morton, Sumit Semwal,
	linux-kernel, linux-nvme, linux-fsdevel, linux-media, dri-devel,
	linaro-mm-sig

On Thu, Dec 04, 2025 at 11:46:45AM +0100, Christian König wrote:
> On 11/23/25 23:51, Pavel Begunkov wrote:
> > Add a file callback that maps a dmabuf for the given file and returns
> > an opaque token of type struct dma_token representing the mapping.
> 
> I'm really scratching my head what you mean with that?
> 
> And why the heck would we need to pass a DMA-buf to a struct file?

I find the naming pretty confusing a well.  But what this does is to
tell the file system/driver that it should expect a future
read_iter/write_iter operation that takes data from / puts data into
the dmabuf passed to this operation.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v2 01/11] file: add callback for pre-mapping dmabuf
  2025-12-04 11:07     ` Christoph Hellwig
@ 2025-12-04 11:09       ` Christian König
  2025-12-04 13:10         ` Christoph Hellwig
  0 siblings, 1 reply; 35+ messages in thread
From: Christian König @ 2025-12-04 11:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Pavel Begunkov, linux-block, io-uring, Vishal Verma, tushar.gohad,
	Keith Busch, Jens Axboe, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal, linux-kernel,
	linux-nvme, linux-fsdevel, linux-media, dri-devel, linaro-mm-sig

On 12/4/25 12:07, Christoph Hellwig wrote:
> On Thu, Dec 04, 2025 at 11:46:45AM +0100, Christian König wrote:
>> On 11/23/25 23:51, Pavel Begunkov wrote:
>>> Add a file callback that maps a dmabuf for the given file and returns
>>> an opaque token of type struct dma_token representing the mapping.
>>
>> I'm really scratching my head what you mean with that?
>>
>> And why the heck would we need to pass a DMA-buf to a struct file?
> 
> I find the naming pretty confusing a well.  But what this does is to
> tell the file system/driver that it should expect a future
> read_iter/write_iter operation that takes data from / puts data into
> the dmabuf passed to this operation.

That explanation makes much more sense.

The remaining question is why does the underlying file system / driver needs to know that it will get addresses from a DMA-buf?

Regards,
Christian.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v2 05/11] block: add infra to handle dmabuf tokens
  2025-11-23 22:51 ` [RFC v2 05/11] block: add infra to handle dmabuf tokens Pavel Begunkov
  2025-11-24 13:38   ` Anuj gupta
  2025-12-04 10:56   ` Christoph Hellwig
@ 2025-12-04 13:08   ` Christoph Hellwig
  2 siblings, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2025-12-04 13:08 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: linux-block, io-uring, Vishal Verma, tushar.gohad, Keith Busch,
	Jens Axboe, Sagi Grimberg, Alexander Viro, Christian Brauner,
	Andrew Morton, Sumit Semwal, Christian König, linux-kernel,
	linux-nvme, linux-fsdevel, linux-media, dri-devel, linaro-mm-sig

On Sun, Nov 23, 2025 at 10:51:25PM +0000, Pavel Begunkov wrote:
> +struct dma_token *blkdev_dma_map(struct file *file,
> +				 struct dma_token_params *params)

Given that this is a direct file operation instance it should be
in block/fops.c.  If we do want a generic helper below it, it
should take a struct block_device instead.  But we can probably
defer that until a user for that shows up.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v2 01/11] file: add callback for pre-mapping dmabuf
  2025-12-04 11:09       ` Christian König
@ 2025-12-04 13:10         ` Christoph Hellwig
  0 siblings, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2025-12-04 13:10 UTC (permalink / raw)
  To: Christian König
  Cc: Christoph Hellwig, Pavel Begunkov, linux-block, io-uring,
	Vishal Verma, tushar.gohad, Keith Busch, Jens Axboe,
	Sagi Grimberg, Alexander Viro, Christian Brauner, Andrew Morton,
	Sumit Semwal, linux-kernel, linux-nvme, linux-fsdevel,
	linux-media, dri-devel, linaro-mm-sig

On Thu, Dec 04, 2025 at 12:09:46PM +0100, Christian König wrote:
> > I find the naming pretty confusing a well.  But what this does is to
> > tell the file system/driver that it should expect a future
> > read_iter/write_iter operation that takes data from / puts data into
> > the dmabuf passed to this operation.
> 
> That explanation makes much more sense.
> 
> The remaining question is why does the underlying file system / driver
> needs to know that it will get addresses from a DMA-buf?

This eventually ends up calling dma_buf_dynamic_attach and provides
a way to find the dma_buf_attachment later in the I/O path.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC v2 06/11] nvme-pci: add support for dmabuf reggistration
  2025-12-04 11:00   ` Christoph Hellwig
@ 2025-12-04 19:07     ` Keith Busch
  0 siblings, 0 replies; 35+ messages in thread
From: Keith Busch @ 2025-12-04 19:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Pavel Begunkov, linux-block, io-uring, Vishal Verma, tushar.gohad,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, Alexander Viro,
	Christian Brauner, Andrew Morton, Sumit Semwal,
	Christian König, linux-kernel, linux-nvme, linux-fsdevel,
	linux-media, dri-devel, linaro-mm-sig

On Thu, Dec 04, 2025 at 03:00:02AM -0800, Christoph Hellwig wrote:
> Why do you care about alignment to the controller page size?
> 
> > +	for_each_sgtable_dma_sg(sgt, sg, tmp) {
> > +		dma_addr_t dma = sg_dma_address(sg);
> > +		unsigned long sg_len = sg_dma_len(sg);
> > +
> > +		while (sg_len) {
> > +			dma_list[i++] = dma;
> > +			dma += NVME_CTRL_PAGE_SIZE;
> > +			sg_len -= NVME_CTRL_PAGE_SIZE;
> > +		}
> > +	}
> 
> Why does this build controller pages sized chunks?

I think the idea was that having fixed size entries aligned to the
device's PRP unit is that it's efficient to jump to the correct index
for any given offset. A vector of mixed sizes would require you walk the
list to find the correct starting point, which we want to avoid.

This is similar to the way io_uring registered memory is set up, though
io_uring has extra logic to use largest common contiguous segment size,
or even just one segment if it coalesces. We could probably do that too.

Anyway, that representation naturally translates to the PRP format, but
this could be done in the SGL format too.

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2025-12-04 19:07 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-23 22:51 [RFC v2 00/11] Add dmabuf read/write via io_uring Pavel Begunkov
2025-11-23 22:51 ` [RFC v2 01/11] file: add callback for pre-mapping dmabuf Pavel Begunkov
2025-12-04 10:42   ` Christoph Hellwig
2025-12-04 10:46   ` Christian König
2025-12-04 11:07     ` Christoph Hellwig
2025-12-04 11:09       ` Christian König
2025-12-04 13:10         ` Christoph Hellwig
2025-11-23 22:51 ` [RFC v2 02/11] iov_iter: introduce iter type for pre-registered dma Pavel Begunkov
2025-12-04 10:43   ` Christoph Hellwig
2025-11-23 22:51 ` [RFC v2 03/11] block: move around bio flagging helpers Pavel Begunkov
2025-12-04 10:43   ` Christoph Hellwig
2025-11-23 22:51 ` [RFC v2 04/11] block: introduce dma token backed bio type Pavel Begunkov
2025-12-04 10:48   ` Christoph Hellwig
2025-11-23 22:51 ` [RFC v2 05/11] block: add infra to handle dmabuf tokens Pavel Begunkov
2025-11-24 13:38   ` Anuj gupta
2025-12-04 10:56   ` Christoph Hellwig
2025-12-04 13:08   ` Christoph Hellwig
2025-11-23 22:51 ` [RFC v2 06/11] nvme-pci: add support for dmabuf reggistration Pavel Begunkov
2025-11-24 13:40   ` Anuj gupta
2025-12-04 11:00   ` Christoph Hellwig
2025-12-04 19:07     ` Keith Busch
2025-11-23 22:51 ` [RFC v2 07/11] nvme-pci: implement dma_token backed requests Pavel Begunkov
2025-12-04 11:04   ` Christoph Hellwig
2025-11-23 22:51 ` [RFC v2 08/11] io_uring/rsrc: add imu flags Pavel Begunkov
2025-11-23 22:51 ` [RFC v2 09/11] io_uring/rsrc: extended reg buffer registration Pavel Begunkov
2025-11-23 22:51 ` [RFC v2 10/11] io_uring/rsrc: add dmabuf-backed buffer registeration Pavel Begunkov
2025-11-23 22:51 ` [RFC v2 11/11] io_uring/rsrc: implement dmabuf regbuf import Pavel Begunkov
2025-11-24 10:33 ` [RFC v2 00/11] Add dmabuf read/write via io_uring Christian König
2025-11-24 11:30   ` Pavel Begunkov
2025-11-24 14:17     ` Christian König
2025-11-25 13:52       ` Pavel Begunkov
2025-11-25 14:21         ` Christian König
2025-11-25 19:40           ` Pavel Begunkov
2025-11-24 13:35 ` Anuj gupta
2025-11-25 12:35   ` Pavel Begunkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).