linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] iov_iter: Add composite, scatterlist and skbuff iterator types
@ 2025-03-21 16:14 David Howells
  2025-03-21 16:14 ` [RFC PATCH 1/4] iov_iter: Move ITER_DISCARD and ITER_XARRAY iteration out-of-line David Howells
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: David Howells @ 2025-03-21 16:14 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: David Howells, Christian Brauner, Matthew Wilcox, Chuck Lever,
	Steve French, Ilya Dryomov, netfs, linux-fsdevel, linux-block,
	linux-mm, linux-kernel

Hi Leon,

Here are some patches that illustrate some of what I'm thinking of doing to
iov iterators.  Note that they are incomplete as I won't have time to
finish or test them before LSF, but I thought I'd post them for use as a
discussion point.

So the first thing I want to do is to move certain iterators out of line
from the main inline iteration multiplexor.  This code gets relentlessly
duplicated and adding further iterator types expands a whole load of
places.  So the DISCARD iterator (which is just a simple short circuit) and
the XARRAY iterator (which is obsolete) move out of line.

Then I want to add three more types, for now:

 (1) ITER_ITERLIST.  A compound iterator that takes an array of iterators
     of disparate types.  The aim here is to make it possible to fabricate
     an network message in one go (say an RPC call) and pass it to a socket
     without the need for corking.

 (2) ITER_SCATTERLIST.  An iterator that takes a scatterlist.  This can be
     used to act as a bridge in converting interfaces that currently take a
     scatterlist (e.g. crypto).  It requires extra fields adding to the
     iov_iter struct because chained scatterlists do not have have a rewind
     capability and so iov_iter_revert() must go back to the beginning and
     fast-forward.

 (3) ITER_SKBUFF.  An iterator that takes a network buffer.  The aim here
     is to render skb_to_sgvec() unnecessary for doing crypto operations.

The patches can also be found here:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-experimental

David

David Howells (4):
  iov_iter: Move ITER_DISCARD and ITER_XARRAY iteration out-of-line
  iov_iter: Add an iterator-of-iterators
  iov_iter: Add a scatterlist iterator type
  iov_iter: Add a scatterlist iterator type [INCOMPLETE]

 include/linux/iov_iter.h |  77 +----
 include/linux/uio.h      |  37 +++
 lib/iov_iter.c           | 675 ++++++++++++++++++++++++++++++++++++++-
 3 files changed, 710 insertions(+), 79 deletions(-)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH 1/4] iov_iter: Move ITER_DISCARD and ITER_XARRAY iteration out-of-line
  2025-03-21 16:14 [RFC PATCH 0/4] iov_iter: Add composite, scatterlist and skbuff iterator types David Howells
@ 2025-03-21 16:14 ` David Howells
  2025-03-21 16:14 ` [RFC PATCH 2/4] iov_iter: Add an iterator-of-iterators David Howells
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: David Howells @ 2025-03-21 16:14 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: David Howells, Christian Brauner, Matthew Wilcox, Chuck Lever,
	Steve French, Ilya Dryomov, netfs, linux-fsdevel, linux-block,
	linux-mm, linux-kernel

Move ITER_DISCARD and ITER_XARRAY iteration out-of-line in preparation of
adding other iteration types which will also be out-of-line.

Signed-off-by: David Howells <dhowells@redhat.com>
---
 include/linux/iov_iter.h | 77 +++-----------------------------------
 lib/iov_iter.c           | 81 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 86 insertions(+), 72 deletions(-)

diff --git a/include/linux/iov_iter.h b/include/linux/iov_iter.h
index c4aa58032faf..0c47933df517 100644
--- a/include/linux/iov_iter.h
+++ b/include/linux/iov_iter.h
@@ -17,6 +17,9 @@ typedef size_t (*iov_step_f)(void *iter_base, size_t progress, size_t len,
 typedef size_t (*iov_ustep_f)(void __user *iter_base, size_t progress, size_t len,
 			      void *priv, void *priv2);
 
+size_t __iterate_and_advance2(struct iov_iter *iter, size_t len, void *priv,
+			      void *priv2, iov_ustep_f ustep, iov_step_f step);
+
 /*
  * Handle ITER_UBUF.
  */
@@ -195,72 +198,6 @@ size_t iterate_folioq(struct iov_iter *iter, size_t len, void *priv, void *priv2
 	return progress;
 }
 
-/*
- * Handle ITER_XARRAY.
- */
-static __always_inline
-size_t iterate_xarray(struct iov_iter *iter, size_t len, void *priv, void *priv2,
-		      iov_step_f step)
-{
-	struct folio *folio;
-	size_t progress = 0;
-	loff_t start = iter->xarray_start + iter->iov_offset;
-	pgoff_t index = start / PAGE_SIZE;
-	XA_STATE(xas, iter->xarray, index);
-
-	rcu_read_lock();
-	xas_for_each(&xas, folio, ULONG_MAX) {
-		size_t remain, consumed, offset, part, flen;
-
-		if (xas_retry(&xas, folio))
-			continue;
-		if (WARN_ON(xa_is_value(folio)))
-			break;
-		if (WARN_ON(folio_test_hugetlb(folio)))
-			break;
-
-		offset = offset_in_folio(folio, start + progress);
-		flen = min(folio_size(folio) - offset, len);
-
-		while (flen) {
-			void *base = kmap_local_folio(folio, offset);
-
-			part = min_t(size_t, flen,
-				     PAGE_SIZE - offset_in_page(offset));
-			remain = step(base, progress, part, priv, priv2);
-			kunmap_local(base);
-
-			consumed = part - remain;
-			progress += consumed;
-			len -= consumed;
-
-			if (remain || len == 0)
-				goto out;
-			flen -= consumed;
-			offset += consumed;
-		}
-	}
-
-out:
-	rcu_read_unlock();
-	iter->iov_offset += progress;
-	iter->count -= progress;
-	return progress;
-}
-
-/*
- * Handle ITER_DISCARD.
- */
-static __always_inline
-size_t iterate_discard(struct iov_iter *iter, size_t len, void *priv, void *priv2,
-		      iov_step_f step)
-{
-	size_t progress = len;
-
-	iter->count -= progress;
-	return progress;
-}
-
 /**
  * iterate_and_advance2 - Iterate over an iterator
  * @iter: The iterator to iterate over.
@@ -306,9 +243,7 @@ size_t iterate_and_advance2(struct iov_iter *iter, size_t len, void *priv,
 		return iterate_kvec(iter, len, priv, priv2, step);
 	if (iov_iter_is_folioq(iter))
 		return iterate_folioq(iter, len, priv, priv2, step);
-	if (iov_iter_is_xarray(iter))
-		return iterate_xarray(iter, len, priv, priv2, step);
-	return iterate_discard(iter, len, priv, priv2, step);
+	return __iterate_and_advance2(iter, len, priv, priv2, ustep, step);
 }
 
 /**
@@ -370,9 +305,7 @@ size_t iterate_and_advance_kernel(struct iov_iter *iter, size_t len, void *priv,
 		return iterate_kvec(iter, len, priv, priv2, step);
 	if (iov_iter_is_folioq(iter))
 		return iterate_folioq(iter, len, priv, priv2, step);
-	if (iov_iter_is_xarray(iter))
-		return iterate_xarray(iter, len, priv, priv2, step);
-	return iterate_discard(iter, len, priv, priv2, step);
+	return __iterate_and_advance2(iter, len, priv, priv2, NULL, step);
 }
 
 #endif /* _LINUX_IOV_ITER_H */
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 65f550cb5081..33a8746e593e 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1927,3 +1927,84 @@ ssize_t iov_iter_extract_pages(struct iov_iter *i,
 	return -EFAULT;
 }
 EXPORT_SYMBOL_GPL(iov_iter_extract_pages);
+
+/*
+ * Handle ITER_XARRAY.
+ */
+static __always_inline
+size_t iterate_xarray(struct iov_iter *iter, size_t len, void *priv, void *priv2,
+		      iov_step_f step)
+{
+	struct folio *folio;
+	size_t progress = 0;
+	loff_t start = iter->xarray_start + iter->iov_offset;
+	pgoff_t index = start / PAGE_SIZE;
+	XA_STATE(xas, iter->xarray, index);
+
+	rcu_read_lock();
+	xas_for_each(&xas, folio, ULONG_MAX) {
+		size_t remain, consumed, offset, part, flen;
+
+		if (xas_retry(&xas, folio))
+			continue;
+		if (WARN_ON(xa_is_value(folio)))
+			break;
+		if (WARN_ON(folio_test_hugetlb(folio)))
+			break;
+
+		offset = offset_in_folio(folio, start + progress);
+		flen = min(folio_size(folio) - offset, len);
+
+		while (flen) {
+			void *base = kmap_local_folio(folio, offset);
+
+			part = min_t(size_t, flen,
+				     PAGE_SIZE - offset_in_page(offset));
+			remain = step(base, progress, part, priv, priv2);
+			kunmap_local(base);
+
+			consumed = part - remain;
+			progress += consumed;
+			len -= consumed;
+
+			if (remain || len == 0)
+				goto out;
+			flen -= consumed;
+			offset += consumed;
+		}
+	}
+
+out:
+	rcu_read_unlock();
+	iter->iov_offset += progress;
+	iter->count -= progress;
+	return progress;
+}
+
+/*
+ * Handle ITER_DISCARD.
+ */
+static __always_inline
+size_t iterate_discard(struct iov_iter *iter, size_t len, void *priv, void *priv2,
+		      iov_step_f step)
+{
+	size_t progress = len;
+
+	iter->count -= progress;
+	return progress;
+}
+
+/*
+ * Out of line iteration for iterator types that don't need such fast handling.
+ */
+size_t __iterate_and_advance2(struct iov_iter *iter, size_t len, void *priv,
+			      void *priv2, iov_ustep_f ustep, iov_step_f step)
+{
+	if (iov_iter_is_discard(iter))
+		return iterate_discard(iter, len, priv, priv2, step);
+	if (iov_iter_is_xarray(iter))
+		return iterate_xarray(iter, len, priv, priv2, step);
+	WARN_ON(1);
+	return 0;
+}
+EXPORT_SYMBOL(__iterate_and_advance2);


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 2/4] iov_iter: Add an iterator-of-iterators
  2025-03-21 16:14 [RFC PATCH 0/4] iov_iter: Add composite, scatterlist and skbuff iterator types David Howells
  2025-03-21 16:14 ` [RFC PATCH 1/4] iov_iter: Move ITER_DISCARD and ITER_XARRAY iteration out-of-line David Howells
@ 2025-03-21 16:14 ` David Howells
  2025-03-21 16:14 ` [RFC PATCH 3/4] iov_iter: Add a scatterlist iterator type David Howells
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: David Howells @ 2025-03-21 16:14 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: David Howells, Christian Brauner, Matthew Wilcox, Chuck Lever,
	Steve French, Ilya Dryomov, netfs, linux-fsdevel, linux-block,
	linux-mm, linux-kernel, Trond Myklebust

Add a new I/O iterator type, ITER_ITERLIST, that allows iteration over a
series of I/O iterators, provided the iterators are all the same direction
(all ITER_SOURCE or all ITER_DEST) and none of them are themselves
ITER_ITERLIST (this function is recursive).

To make reversion possible, I've added an 'orig_count' member into the
iov_iter struct so that reversion of an ITER_ITERLIST can know when to go
move backwards through the iter list.  It might make more sense to make the
iterator list element, say:

	struct itervec {
		struct iov_iter iter;
		size_t orig_count;
	};

rather than expanding struct iov_iter itself and have iov_iter_iterlist()
set vec[i].orig_count from vec[i].iter->count.

Also, for the moment, I've only permitted its use with source iterators
(eg. sendmsg).

To use this, you allocate an array of iterators and point the list iterator
at it, e.g.:

	struct iov_iter iters[3];
	struct msghdr msg;

	iov_iter_bvec(&iters[0], ITER_SOURCE, &head_bv, 1,
		      sizeof(marker) + head->iov_len);
	iov_iter_xarray(&iters[1], ITER_SOURCE, xdr->pages,
			xdr->page_fpos, xdr->page_len);
	iov_iter_kvec(&iters[2], ITER_SOURCE, &tail_kv, 1,
		      tail->iov_len);
	iov_iter_iterlist(&msg.msg_iter, ITER_SOURCE, iters, 3, size);

This can be used by network filesystem protocols, such as sunrpc, to glue a
header and a trailer on to some data to form a message and then dump the
entire message onto the socket in a single go.

[!] Note: I'm not entirely sure that this is a good idea: the problem is
    that it's reasonably common practice to copy an iterator by direct
    assignment - and that works for the existing iterators... but not this
    one.  With the iterator-of-iterators, the list of iterators has to be
    modified if we recurse.  It's probably fine just for calling sendmsg()
    from network filesystems, but I'm not 100% sure of that.

Suggested-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: David Howells <dhowells@redhat.com>
---
 include/linux/uio.h |  15 +++++
 lib/iov_iter.c      | 158 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 172 insertions(+), 1 deletion(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 8ada84e85447..59a586333e1b 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -29,6 +29,7 @@ enum iter_type {
 	ITER_FOLIOQ,
 	ITER_XARRAY,
 	ITER_DISCARD,
+	ITER_ITERLIST,
 };
 
 #define ITER_SOURCE	1	// == WRITE
@@ -71,6 +72,7 @@ struct iov_iter {
 				const struct folio_queue *folioq;
 				struct xarray *xarray;
 				void __user *ubuf;
+				struct iov_iterlist *iterlist;
 			};
 			size_t count;
 		};
@@ -82,6 +84,11 @@ struct iov_iter {
 	};
 };
 
+struct iov_iterlist {
+	struct iov_iter	iter;
+	size_t		orig_count;
+};
+
 typedef __u16 uio_meta_flags_t;
 
 struct uio_meta {
@@ -149,6 +156,11 @@ static inline bool iov_iter_is_xarray(const struct iov_iter *i)
 	return iov_iter_type(i) == ITER_XARRAY;
 }
 
+static inline bool iov_iter_is_iterlist(const struct iov_iter *i)
+{
+	return iov_iter_type(i) == ITER_ITERLIST;
+}
+
 static inline unsigned char iov_iter_rw(const struct iov_iter *i)
 {
 	return i->data_source ? WRITE : READ;
@@ -302,6 +314,9 @@ void iov_iter_folio_queue(struct iov_iter *i, unsigned int direction,
 			  unsigned int first_slot, unsigned int offset, size_t count);
 void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *xarray,
 		     loff_t start, size_t count);
+void iov_iter_iterlist(struct iov_iter *i, unsigned int direction,
+		       struct iov_iterlist *iterlist, unsigned long nr_segs,
+		       size_t count);
 ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
 			size_t maxsize, unsigned maxpages, size_t *start);
 ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages,
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 33a8746e593e..1d9190abfeb5 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -578,6 +578,19 @@ void iov_iter_advance(struct iov_iter *i, size_t size)
 		iov_iter_folioq_advance(i, size);
 	} else if (iov_iter_is_discard(i)) {
 		i->count -= size;
+	} else if (iov_iter_is_iterlist(i)) {
+		i->count -= size;
+		for (;;) {
+			size_t part = umin(size, i->iterlist->iter.count);
+
+			if (part > 0)
+				iov_iter_advance(&i->iterlist->iter, part);
+			size -= part;
+			if (!size)
+				break;
+			i->iterlist++;
+			i->nr_segs--;
+		}
 	}
 }
 EXPORT_SYMBOL(iov_iter_advance);
@@ -608,6 +621,23 @@ static void iov_iter_folioq_revert(struct iov_iter *i, size_t unroll)
 	i->folioq = folioq;
 }
 
+static void iov_iter_revert_iterlist(struct iov_iter *i, size_t unroll)
+{
+	for (;;) {
+		struct iov_iterlist *il = i->iterlist;
+
+		size_t part = umin(unroll, il->orig_count - il->iter.count);
+
+		if (part > 0)
+			iov_iter_revert(&il->iter, part);
+		unroll -= part;
+		if (!unroll)
+			break;
+		i->iterlist--;
+		i->nr_segs++;
+	}
+}
+
 void iov_iter_revert(struct iov_iter *i, size_t unroll)
 {
 	if (!unroll)
@@ -617,6 +647,8 @@ void iov_iter_revert(struct iov_iter *i, size_t unroll)
 	i->count += unroll;
 	if (unlikely(iov_iter_is_discard(i)))
 		return;
+	if (unlikely(iov_iter_is_iterlist(i)))
+		return iov_iter_revert_iterlist(i, unroll);
 	if (unroll <= i->iov_offset) {
 		i->iov_offset -= unroll;
 		return;
@@ -663,6 +695,8 @@ EXPORT_SYMBOL(iov_iter_revert);
  */
 size_t iov_iter_single_seg_count(const struct iov_iter *i)
 {
+	if (iov_iter_is_iterlist(i))
+		i = &i->iterlist->iter;
 	if (i->nr_segs > 1) {
 		if (likely(iter_is_iovec(i) || iov_iter_is_kvec(i)))
 			return min(i->count, iter_iov(i)->iov_len - i->iov_offset);
@@ -787,6 +821,41 @@ void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count)
 }
 EXPORT_SYMBOL(iov_iter_discard);
 
+/**
+ * iov_iter_iterlist - Initialise an I/O iterator that is a list of iterators
+ * @iter: The iterator to initialise.
+ * @direction: The direction of the transfer.
+ * @iterlist: The list of iterators
+ * @nr_segs: The number of elements in the list
+ * @count: The size of the I/O buffer in bytes.
+ *
+ * Set up an I/O iterator that walks over an array of other iterators.  It's
+ * only available as a source iterator (for WRITE) and none of the iterators in
+ * the array can be of ITER_ITERLIST type to prevent infinite recursion.
+ */
+void iov_iter_iterlist(struct iov_iter *iter, unsigned int direction,
+		       struct iov_iterlist *iterlist, unsigned long nr_segs,
+		       size_t count)
+{
+	unsigned long i;
+
+	BUG_ON(direction != WRITE);
+	for (i = 0; i < nr_segs; i++) {
+		BUG_ON(iterlist[i].iter.iter_type == ITER_ITERLIST);
+		BUG_ON(iterlist[i].iter.data_source != direction);
+		iterlist[i].orig_count = iterlist[i].iter.count;
+	}
+
+	*iter = (struct iov_iter){
+		.iter_type	= ITER_ITERLIST,
+		.data_source	= true,
+		.count		= count,
+		.iterlist	= iterlist,
+		.nr_segs	= nr_segs,
+	};
+}
+EXPORT_SYMBOL(iov_iter_iterlist);
+
 static bool iov_iter_aligned_iovec(const struct iov_iter *i, unsigned addr_mask,
 				   unsigned len_mask)
 {
@@ -947,6 +1016,15 @@ unsigned long iov_iter_alignment(const struct iov_iter *i)
 	if (iov_iter_is_xarray(i))
 		return (i->xarray_start + i->iov_offset) | i->count;
 
+	if (iov_iter_is_iterlist(i)) {
+		unsigned long align = 0;
+		unsigned int j;
+
+		for (j = 0; j < i->nr_segs; j++)
+			align |= iov_iter_alignment(&i->iterlist[j].iter);
+		return align;
+	}
+
 	return 0;
 }
 EXPORT_SYMBOL(iov_iter_alignment);
@@ -1206,6 +1284,18 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		return iter_folioq_get_pages(i, pages, maxsize, maxpages, start);
 	if (iov_iter_is_xarray(i))
 		return iter_xarray_get_pages(i, pages, maxsize, maxpages, start);
+	if (iov_iter_is_iterlist(i)) {
+		ssize_t size;
+
+		while (!i->iterlist->iter.count) {
+			i->iterlist++;
+			i->nr_segs--;
+		}
+		size = __iov_iter_get_pages_alloc(&i->iterlist->iter,
+						  pages, maxsize, maxpages, start);
+		i->count -= size;
+		return size;
+	}
 	return -EFAULT;
 }
 
@@ -1274,6 +1364,21 @@ static int bvec_npages(const struct iov_iter *i, int maxpages)
 	return npages;
 }
 
+static int iterlist_npages(const struct iov_iter *i, int maxpages)
+{
+	const struct iov_iterlist *p;
+	ssize_t size = i->count;
+	int npages = 0;
+
+	for (p = i->iterlist; size; p++) {
+		size -= p->iter.count;
+		npages += iov_iter_npages(&p->iter, maxpages - npages);
+		if (unlikely(npages >= maxpages))
+			return maxpages;
+	}
+	return npages;
+}
+
 int iov_iter_npages(const struct iov_iter *i, int maxpages)
 {
 	if (unlikely(!i->count))
@@ -1298,6 +1403,8 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages)
 		int npages = DIV_ROUND_UP(offset + i->count, PAGE_SIZE);
 		return min(npages, maxpages);
 	}
+	if (iov_iter_is_iterlist(i))
+		return iterlist_npages(i, maxpages);
 	return 0;
 }
 EXPORT_SYMBOL(iov_iter_npages);
@@ -1309,11 +1416,14 @@ const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags)
 		return new->bvec = kmemdup(new->bvec,
 				    new->nr_segs * sizeof(struct bio_vec),
 				    flags);
-	else if (iov_iter_is_kvec(new) || iter_is_iovec(new))
+	if (iov_iter_is_kvec(new) || iter_is_iovec(new))
 		/* iovec and kvec have identical layout */
 		return new->__iov = kmemdup(new->__iov,
 				   new->nr_segs * sizeof(struct iovec),
 				   flags);
+	if (WARN_ON_ONCE(iov_iter_is_iterlist(old)))
+		/* Don't allow dup'ing of iterlist as the cleanup is complicated */
+		return NULL;
 	return NULL;
 }
 EXPORT_SYMBOL(dup_iter);
@@ -1924,6 +2034,23 @@ ssize_t iov_iter_extract_pages(struct iov_iter *i,
 		return iov_iter_extract_xarray_pages(i, pages, maxsize,
 						     maxpages, extraction_flags,
 						     offset0);
+	if (iov_iter_is_iterlist(i)) {
+		ssize_t size;
+
+		while (i->nr_segs && !i->iterlist->iter.count) {
+			i->iterlist++;
+			i->nr_segs--;
+		}
+		if (!i->nr_segs) {
+			WARN_ON_ONCE(i->count);
+			return 0;
+		}
+		size = iov_iter_extract_pages(&i->iterlist->iter,
+					      pages, maxsize, maxpages,
+					      extraction_flags, offset0);
+		i->count -= size;
+		return size;
+	}
 	return -EFAULT;
 }
 EXPORT_SYMBOL_GPL(iov_iter_extract_pages);
@@ -1994,6 +2121,33 @@ size_t iterate_discard(struct iov_iter *iter, size_t len, void *priv, void *priv
 	return progress;
 }
 
+/*
+ * Handle iteration over ITER_ITERLIST.
+ */
+static size_t iterate_iterlist(struct iov_iter *iter, size_t len, void *priv, void *priv2,
+			       iov_ustep_f ustep, iov_step_f step)
+{
+	struct iov_iterlist *p = iter->iterlist;
+	size_t progress = 0;
+
+	do {
+		size_t consumed;
+
+		consumed = iterate_and_advance2(&p->iter, len, priv, priv2, ustep, step);
+
+		len -= consumed;
+		progress += consumed;
+		if (p->iter.count)
+			break;
+		p++;
+	} while (len);
+
+	iter->nr_segs -= p - iter->iterlist;
+	iter->iterlist = p;
+	iter->count -= progress;
+	return progress;
+}
+
 /*
  * Out of line iteration for iterator types that don't need such fast handling.
  */
@@ -2004,6 +2158,8 @@ size_t __iterate_and_advance2(struct iov_iter *iter, size_t len, void *priv,
 		return iterate_discard(iter, len, priv, priv2, step);
 	if (iov_iter_is_xarray(iter))
 		return iterate_xarray(iter, len, priv, priv2, step);
+	if (iov_iter_is_iterlist(iter))
+		return iterate_iterlist(iter, len, priv, priv2, ustep, step);
 	WARN_ON(1);
 	return 0;
 }


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 3/4] iov_iter: Add a scatterlist iterator type
  2025-03-21 16:14 [RFC PATCH 0/4] iov_iter: Add composite, scatterlist and skbuff iterator types David Howells
  2025-03-21 16:14 ` [RFC PATCH 1/4] iov_iter: Move ITER_DISCARD and ITER_XARRAY iteration out-of-line David Howells
  2025-03-21 16:14 ` [RFC PATCH 2/4] iov_iter: Add an iterator-of-iterators David Howells
@ 2025-03-21 16:14 ` David Howells
  2025-03-21 16:14 ` [RFC PATCH 4/4] iov_iter: Add a scatterlist iterator type [INCOMPLETE] David Howells
  2025-03-23  6:21 ` [RFC PATCH 0/4] iov_iter: Add composite, scatterlist and skbuff iterator types Christoph Hellwig
  4 siblings, 0 replies; 9+ messages in thread
From: David Howells @ 2025-03-21 16:14 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: David Howells, Christian Brauner, Matthew Wilcox, Chuck Lever,
	Steve French, Ilya Dryomov, netfs, linux-fsdevel, linux-block,
	linux-mm, linux-kernel

Add an iterator type that can iterate over a scatterlist.  This can be used
as a bridge to help convert things that take scatterlists into things that
take I/O iterators.

Signed-off-by: David Howells <dhowells@redhat.com>
---
 include/linux/uio.h |  12 ++
 lib/iov_iter.c      | 315 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 321 insertions(+), 6 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 59a586333e1b..0e50f4af6877 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -12,6 +12,7 @@
 
 struct page;
 struct folio_queue;
+struct scatterlist;
 
 typedef unsigned int __bitwise iov_iter_extraction_t;
 
@@ -30,6 +31,7 @@ enum iter_type {
 	ITER_XARRAY,
 	ITER_DISCARD,
 	ITER_ITERLIST,
+	ITER_SCATTERLIST,
 };
 
 #define ITER_SOURCE	1	// == WRITE
@@ -46,6 +48,7 @@ struct iov_iter {
 	bool nofault;
 	bool data_source;
 	size_t iov_offset;
+	size_t orig_count;
 	/*
 	 * Hack alert: overlay ubuf_iovec with iovec + count, so
 	 * that the members resolve correctly regardless of the type
@@ -73,11 +76,13 @@ struct iov_iter {
 				struct xarray *xarray;
 				void __user *ubuf;
 				struct iov_iterlist *iterlist;
+				struct scatterlist *sglist;
 			};
 			size_t count;
 		};
 	};
 	union {
+		struct scatterlist *sglist_head;
 		unsigned long nr_segs;
 		u8 folioq_slot;
 		loff_t xarray_start;
@@ -161,6 +166,11 @@ static inline bool iov_iter_is_iterlist(const struct iov_iter *i)
 	return iov_iter_type(i) == ITER_ITERLIST;
 }
 
+static inline bool iov_iter_is_scatterlist(const struct iov_iter *i)
+{
+	return iov_iter_type(i) == ITER_SCATTERLIST;
+}
+
 static inline unsigned char iov_iter_rw(const struct iov_iter *i)
 {
 	return i->data_source ? WRITE : READ;
@@ -317,6 +327,8 @@ void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *
 void iov_iter_iterlist(struct iov_iter *i, unsigned int direction,
 		       struct iov_iterlist *iterlist, unsigned long nr_segs,
 		       size_t count);
+void iov_iter_scatterlist(struct iov_iter *i, unsigned int direction,
+			  struct scatterlist *sglist, size_t count);
 ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
 			size_t maxsize, unsigned maxpages, size_t *start);
 ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages,
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 1d9190abfeb5..ed9859af3c5d 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -562,6 +562,26 @@ static void iov_iter_folioq_advance(struct iov_iter *i, size_t size)
 	i->folioq = folioq;
 }
 
+static void iov_iter_scatterlist_advance(struct iov_iter *i, size_t size)
+{
+	struct scatterlist *sg;
+
+	if (!i->count)
+		return;
+	i->count -= size;
+
+	size += i->iov_offset;
+
+	for (sg = i->sglist; sg; sg_next(sg)) {
+		if (likely(size < sg->length))
+			break;
+		size -= sg->length;
+	}
+	WARN_ON(!sg && size > 0);
+	i->iov_offset = size;
+	i->sglist = sg;
+}
+
 void iov_iter_advance(struct iov_iter *i, size_t size)
 {
 	if (unlikely(i->count < size))
@@ -591,6 +611,8 @@ void iov_iter_advance(struct iov_iter *i, size_t size)
 			i->iterlist++;
 			i->nr_segs--;
 		}
+	} else if (iov_iter_is_scatterlist(i)) {
+		iov_iter_scatterlist_advance(i, size);
 	}
 }
 EXPORT_SYMBOL(iov_iter_advance);
@@ -638,6 +660,15 @@ static void iov_iter_revert_iterlist(struct iov_iter *i, size_t unroll)
 	}
 }
 
+static void iov_iter_revert_scatterlist(struct iov_iter *i)
+{
+	size_t skip = i->orig_count - i->count;
+
+	i->sglist = i->sglist_head;
+	i->count = i->orig_count;
+	iov_iter_advance(i, skip);
+}
+
 void iov_iter_revert(struct iov_iter *i, size_t unroll)
 {
 	if (!unroll)
@@ -649,6 +680,8 @@ void iov_iter_revert(struct iov_iter *i, size_t unroll)
 		return;
 	if (unlikely(iov_iter_is_iterlist(i)))
 		return iov_iter_revert_iterlist(i, unroll);
+	if (unlikely(iov_iter_is_scatterlist(i)))
+		return iov_iter_revert_scatterlist(i);
 	if (unroll <= i->iov_offset) {
 		i->iov_offset -= unroll;
 		return;
@@ -706,6 +739,8 @@ size_t iov_iter_single_seg_count(const struct iov_iter *i)
 	if (unlikely(iov_iter_is_folioq(i)))
 		return !i->count ? 0 :
 			umin(folioq_folio_size(i->folioq, i->folioq_slot), i->count);
+	if (unlikely(iov_iter_is_scatterlist(i)))
+		return !i->sglist ? 0 : umin(i->count, i->sglist->length - i->iov_offset);
 	return i->count;
 }
 EXPORT_SYMBOL(iov_iter_single_seg_count);
@@ -856,6 +891,33 @@ void iov_iter_iterlist(struct iov_iter *iter, unsigned int direction,
 }
 EXPORT_SYMBOL(iov_iter_iterlist);
 
+/**
+ * iov_iter_scatterlist - Initialise an I/O iterator for a scatterlist chain
+ * @iter: The iterator to initialise.
+ * @direction: The direction of the transfer.
+ * @sglist: The head of the scatterlist
+ * @count: The size of the I/O buffer in bytes.
+ *
+ * Set up an I/O iterator that walks over a scatterlist.  Because scatterlists
+ * can be chained and have no back pointers, reversion requires starting again
+ * at the beginning and counting forwards.
+ */
+void iov_iter_scatterlist(struct iov_iter *iter, unsigned int direction,
+			  struct scatterlist *sglist, size_t count)
+{
+	WARN_ON(direction & ~(READ | WRITE));
+	*iter = (struct iov_iter){
+		.iter_type	= ITER_SCATTERLIST,
+		.data_source	= direction,
+		.sglist		= sglist,
+		.sglist_head	= sglist,
+		.iov_offset	= 0,
+		.count		= count,
+		.orig_count	= count,
+	};
+}
+EXPORT_SYMBOL(iov_iter_scatterlist);
+
 static bool iov_iter_aligned_iovec(const struct iov_iter *i, unsigned addr_mask,
 				   unsigned len_mask)
 {
@@ -994,6 +1056,26 @@ static unsigned long iov_iter_alignment_bvec(const struct iov_iter *i)
 	return res;
 }
 
+static unsigned long iov_iter_alignment_scatterlist(const struct iov_iter *i)
+{
+	struct scatterlist *sg;
+	unsigned skip = i->iov_offset;
+	unsigned res = 0;
+	size_t size = i->count;
+
+	for (sg = i->sglist; sg; sg = sg_next(sg)) {
+		size_t len = sg->length - skip;
+		res |= (unsigned long)sg->offset + skip;
+		if (len > size)
+			len = size;
+		res |= len;
+		size -= len;
+		skip = 0;
+	} while (size);
+
+	return res;
+}
+
 unsigned long iov_iter_alignment(const struct iov_iter *i)
 {
 	if (likely(iter_is_ubuf(i))) {
@@ -1024,6 +1106,8 @@ unsigned long iov_iter_alignment(const struct iov_iter *i)
 			align |= iov_iter_alignment(&i->iterlist[j].iter);
 		return align;
 	}
+	if (iov_iter_is_scatterlist(i))
+		return iov_iter_alignment_scatterlist(i);
 
 	return 0;
 }
@@ -1058,13 +1142,8 @@ unsigned long iov_iter_gap_alignment(const struct iov_iter *i)
 }
 EXPORT_SYMBOL(iov_iter_gap_alignment);
 
-static int want_pages_array(struct page ***res, size_t size,
-			    size_t start, unsigned int maxpages)
+static int __want_pages_array(struct page ***res, unsigned int count)
 {
-	unsigned int count = DIV_ROUND_UP(size + start, PAGE_SIZE);
-
-	if (count > maxpages)
-		count = maxpages;
 	WARN_ON(!count);	// caller should've prevented that
 	if (!*res) {
 		*res = kvmalloc_array(count, sizeof(struct page *), GFP_KERNEL);
@@ -1074,6 +1153,16 @@ static int want_pages_array(struct page ***res, size_t size,
 	return count;
 }
 
+static int want_pages_array(struct page ***res, size_t size,
+			    size_t start, unsigned int maxpages)
+{
+	size_t count = DIV_ROUND_UP(size + start, PAGE_SIZE);
+
+	if (count > maxpages)
+		count = maxpages;
+	return __want_pages_array(res, count);
+}
+
 static ssize_t iter_folioq_get_pages(struct iov_iter *iter,
 				     struct page ***ppages, size_t maxsize,
 				     unsigned maxpages, size_t *_start_offset)
@@ -1186,6 +1275,52 @@ static ssize_t iter_xarray_get_pages(struct iov_iter *i,
 	return maxsize;
 }
 
+static struct page *first_scatterlist_segment(const struct iov_iter *i,
+					      size_t *size, size_t *start)
+{
+	struct scatterlist *sg = i->sglist;
+	struct page *page;
+	size_t skip = i->iov_offset, len;
+
+	if (!sg)
+		return NULL;
+
+	len = sg->length - skip;
+	if (*size > len)
+		*size = len;
+	skip += sg->offset;
+	page = sg_page(sg) + skip / PAGE_SIZE;
+	*start = skip % PAGE_SIZE;
+	return page;
+}
+
+static ssize_t iter_scatterlist_get_pages(struct iov_iter *i,
+					  struct page ***pages, size_t maxsize,
+					  unsigned maxpages, size_t *start)
+{
+	struct page **p, *page;
+	unsigned int n;
+
+	page = first_scatterlist_segment(i, &maxsize, start);
+	if (!page)
+		return -EFAULT;
+	n = want_pages_array(pages, maxsize, *start, maxpages);
+	if (!n)
+		return -ENOMEM;
+	p = *pages;
+	for (int k = 0; k < n; k++)
+		get_page(p[k] = page + k);
+	maxsize = min_t(size_t, maxsize, n * PAGE_SIZE - *start);
+	i->count -= maxsize;
+	i->iov_offset += maxsize;
+	if (i->iov_offset == i->bvec->bv_len) {
+		i->iov_offset = 0;
+		i->bvec++;
+		i->nr_segs--;
+	}
+	return maxsize;
+}
+
 /* must be done on non-empty ITER_UBUF or ITER_IOVEC one */
 static unsigned long first_iovec_segment(const struct iov_iter *i, size_t *size)
 {
@@ -1296,6 +1431,8 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		i->count -= size;
 		return size;
 	}
+	if (iov_iter_is_scatterlist(i))
+		return iter_scatterlist_get_pages(i, pages, maxsize, maxpages, start);
 	return -EFAULT;
 }
 
@@ -1379,6 +1516,25 @@ static int iterlist_npages(const struct iov_iter *i, int maxpages)
 	return npages;
 }
 
+static int scatterlist_npages(const struct iov_iter *i, int maxpages)
+{
+	struct scatterlist *sg;
+	size_t skip = i->iov_offset, size = i->count;
+	int npages = 0;
+
+	for (sg = i->sglist; sg && size; sg = sg_next(sg)) {
+		unsigned offs = (sg->offset + skip) % PAGE_SIZE;
+		size_t len = umin(sg->length - skip, size);
+
+		size -= len;
+		npages += DIV_ROUND_UP(offs + len, PAGE_SIZE);
+		if (unlikely(npages > maxpages))
+			return maxpages;
+		skip = 0;
+	}
+	return npages;
+}
+
 int iov_iter_npages(const struct iov_iter *i, int maxpages)
 {
 	if (unlikely(!i->count))
@@ -1405,6 +1561,8 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages)
 	}
 	if (iov_iter_is_iterlist(i))
 		return iterlist_npages(i, maxpages);
+	if (iov_iter_is_scatterlist(i))
+		return scatterlist_npages(i, maxpages);
 	return 0;
 }
 EXPORT_SYMBOL(iov_iter_npages);
@@ -1792,6 +1950,107 @@ static ssize_t iov_iter_extract_xarray_pages(struct iov_iter *i,
 	return maxsize;
 }
 
+/*
+ * Count the number of virtually contiguous pages in a scatterlist iterator
+ * from the current point.
+ */
+static size_t count_scatterlist_contig_pages(const struct iov_iter *i,
+					     size_t maxpages, size_t maxsize)
+{
+	struct scatterlist *sg;
+	size_t npages = 0;
+	size_t skip = i->iov_offset, size = umin(i->count, maxsize);
+
+	for (sg = i->sglist; sg && size; sg = sg_next(sg)) {
+		size_t offs = (sg->offset + skip) % PAGE_SIZE;
+		size_t part = umin(sg->length - skip, size);
+
+		if (!part)
+			break;
+		size -= part;
+		npages += DIV_ROUND_UP(offs + part, PAGE_SIZE);
+		if (unlikely(npages > maxpages))
+			return maxpages;
+		if (((offs + part) % PAGE_SIZE) != 0)
+			break;
+		skip = 0;
+	}
+	return npages;
+}
+
+/*
+ * Extract a list of contiguous pages from an ITER_FOLIOQ iterator.  This does
+ * not get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_scatterlist_pages(struct iov_iter *i,
+						  struct page ***pages, size_t maxsize,
+						  unsigned int maxpages,
+						  iov_iter_extraction_t extraction_flags,
+						  size_t *offset0)
+{
+	struct scatterlist *sg = i->sglist;
+	struct page **p;
+	size_t npages, skip, size = 0;
+	int nr = 0;
+
+	if (!sg)
+		return 0;
+
+	while (skip = i->iov_offset,
+	       skip == sg->length) {
+		sg = sg_next(sg);
+		i->sglist = sg;
+		i->iov_offset = 0;
+		if (!sg)
+			return 0;
+	}
+
+	npages = count_scatterlist_contig_pages(i, maxpages, maxsize);
+
+	maxpages = __want_pages_array(pages, npages);
+	if (!maxpages)
+		return -ENOMEM;
+	*offset0 = (sg->offset + skip) & ~PAGE_MASK;
+	p = *pages;
+
+	for (sg = i->sglist; sg; sg = sg_next(sg)) {
+		struct page *page = sg_page(sg);
+		size_t part = umin(sg->length - skip, maxsize);
+		size_t off = sg->offset + skip;
+
+		if (!part)
+			break;
+
+		page += off / PAGE_SIZE;
+		off %= PAGE_SIZE;
+
+		do {
+			size_t chunk = umin(part, PAGE_SIZE - off);
+
+			p[nr++] = page;
+			page++;
+			maxpages--;
+			maxsize -= chunk;
+			size += chunk;
+			skip += chunk;
+			part -= chunk;
+			off = 0;
+		} while (part && maxsize && maxpages);
+
+		if (((sg->offset + skip + part) % PAGE_SIZE) != 0)
+			break;
+		if (!maxsize || !maxpages) {
+			if (!part)
+				sg = sg_next(sg);
+			break;
+		}
+		skip = 0;
+	}
+
+	iov_iter_advance(i, size);
+	return size;
+}
+
 /*
  * Extract a list of virtually contiguous pages from an ITER_BVEC iterator.
  * This does not get references on the pages, nor does it get a pin on them.
@@ -2051,6 +2310,10 @@ ssize_t iov_iter_extract_pages(struct iov_iter *i,
 		i->count -= size;
 		return size;
 	}
+	if (iov_iter_is_scatterlist(i))
+		return iov_iter_extract_scatterlist_pages(i, pages, maxsize,
+							  maxpages, extraction_flags,
+							  offset0);
 	return -EFAULT;
 }
 EXPORT_SYMBOL_GPL(iov_iter_extract_pages);
@@ -2148,6 +2411,44 @@ static size_t iterate_iterlist(struct iov_iter *iter, size_t len, void *priv, vo
 	return progress;
 }
 
+/*
+ * Handle iteration over ITER_SCATTERLIST.
+ */
+static size_t iterate_scatterlist(struct iov_iter *iter, size_t len, void *priv, void *priv2,
+				  iov_step_f step)
+{
+	struct scatterlist *sg = iter->sglist;
+	size_t progress = 0, skip = iter->iov_offset;
+
+	do {
+		struct page *page = sg_page(sg);
+		size_t remain, consumed;
+		size_t offset = sg->offset + skip, part;
+		void *kaddr = kmap_local_page(page + offset / PAGE_SIZE);
+
+		part = min3(len,
+			   (size_t)(sg->length - skip),
+			   (size_t)(PAGE_SIZE - offset % PAGE_SIZE));
+		remain = step(kaddr + offset % PAGE_SIZE, progress, part, priv, priv2);
+		kunmap_local(kaddr);
+		consumed = part - remain;
+		len -= consumed;
+		progress += consumed;
+		skip += consumed;
+		if (skip >= sg->length) {
+			skip = 0;
+			sg = sg_next(sg);
+		}
+		if (remain)
+			break;
+	} while (len);
+
+	iter->sglist = sg;
+	iter->iov_offset = skip;
+	iter->count -= progress;
+	return progress;
+}
+
 /*
  * Out of line iteration for iterator types that don't need such fast handling.
  */
@@ -2160,6 +2461,8 @@ size_t __iterate_and_advance2(struct iov_iter *iter, size_t len, void *priv,
 		return iterate_xarray(iter, len, priv, priv2, step);
 	if (iov_iter_is_iterlist(iter))
 		return iterate_iterlist(iter, len, priv, priv2, ustep, step);
+	if (iov_iter_is_scatterlist(iter))
+		return iterate_scatterlist(iter, len, priv, priv2, step);
 	WARN_ON(1);
 	return 0;
 }


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 4/4] iov_iter: Add a scatterlist iterator type [INCOMPLETE]
  2025-03-21 16:14 [RFC PATCH 0/4] iov_iter: Add composite, scatterlist and skbuff iterator types David Howells
                   ` (2 preceding siblings ...)
  2025-03-21 16:14 ` [RFC PATCH 3/4] iov_iter: Add a scatterlist iterator type David Howells
@ 2025-03-21 16:14 ` David Howells
  2025-03-23  6:21 ` [RFC PATCH 0/4] iov_iter: Add composite, scatterlist and skbuff iterator types Christoph Hellwig
  4 siblings, 0 replies; 9+ messages in thread
From: David Howells @ 2025-03-21 16:14 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: David Howells, Christian Brauner, Matthew Wilcox, Chuck Lever,
	Steve French, Ilya Dryomov, netfs, linux-fsdevel, linux-block,
	linux-mm, linux-kernel

Add an iterator type that can iterate over a socket buffer.

[!] Note this is not yet completely implemented and won't compile.

Signed-off-by: David Howells <dhowells@redhat.com>
---
 include/linux/uio.h |  10 ++++
 lib/iov_iter.c      | 121 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 131 insertions(+)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 0e50f4af6877..87d6ba660489 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -13,6 +13,7 @@
 struct page;
 struct folio_queue;
 struct scatterlist;
+struct sk_buff;
 
 typedef unsigned int __bitwise iov_iter_extraction_t;
 
@@ -32,6 +33,7 @@ enum iter_type {
 	ITER_DISCARD,
 	ITER_ITERLIST,
 	ITER_SCATTERLIST,
+	ITER_SKBUFF,
 };
 
 #define ITER_SOURCE	1	// == WRITE
@@ -77,6 +79,7 @@ struct iov_iter {
 				void __user *ubuf;
 				struct iov_iterlist *iterlist;
 				struct scatterlist *sglist;
+				const struct sk_buff *skb;
 			};
 			size_t count;
 		};
@@ -171,6 +174,11 @@ static inline bool iov_iter_is_scatterlist(const struct iov_iter *i)
 	return iov_iter_type(i) == ITER_SCATTERLIST;
 }
 
+static inline bool iov_iter_is_skbuff(const struct iov_iter *i)
+{
+	return iov_iter_type(i) == ITER_SKBUFF;
+}
+
 static inline unsigned char iov_iter_rw(const struct iov_iter *i)
 {
 	return i->data_source ? WRITE : READ;
@@ -329,6 +337,8 @@ void iov_iter_iterlist(struct iov_iter *i, unsigned int direction,
 		       size_t count);
 void iov_iter_scatterlist(struct iov_iter *i, unsigned int direction,
 			  struct scatterlist *sglist, size_t count);
+void iov_iter_skbuff(struct iov_iter *i, unsigned int direction,
+		     const struct sk_buff *skb, size_t count);
 ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
 			size_t maxsize, unsigned maxpages, size_t *start);
 ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages,
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index ed9859af3c5d..01215316d272 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -12,6 +12,7 @@
 #include <linux/scatterlist.h>
 #include <linux/instrumented.h>
 #include <linux/iov_iter.h>
+#include <linux/skbuff.h>
 
 static __always_inline
 size_t copy_to_user_iter(void __user *iter_to, size_t progress,
@@ -918,6 +919,29 @@ void iov_iter_scatterlist(struct iov_iter *iter, unsigned int direction,
 }
 EXPORT_SYMBOL(iov_iter_scatterlist);
 
+/**
+ * iov_iter_skbuff - Initialise an I/O iterator for a socket buffer
+ * @iter: The iterator to initialise.
+ * @direction: The direction of the transfer.
+ * @skb: The socket buffer
+ * @count: The size of the I/O buffer in bytes.
+ *
+ * Set up an I/O iterator that walks over a socket buffer.
+ */
+void iov_iter_skbuff(struct iov_iter *i, unsigned int direction,
+		     const struct sk_buff *skb, size_t count)
+{
+	WARN_ON(direction & ~(READ | WRITE));
+	*iter = (struct iov_iter){
+		.iter_type	= ITER_SKBUFF,
+		.data_source	= direction,
+		.skb		= skb,
+		.iov_offset	= 0,
+		.count		= count,
+	};
+}
+EXPORT_SYMBOL(iov_iter_skbuff);
+
 static bool iov_iter_aligned_iovec(const struct iov_iter *i, unsigned addr_mask,
 				   unsigned len_mask)
 {
@@ -2314,6 +2338,10 @@ ssize_t iov_iter_extract_pages(struct iov_iter *i,
 		return iov_iter_extract_scatterlist_pages(i, pages, maxsize,
 							  maxpages, extraction_flags,
 							  offset0);
+	if (iov_iter_is_skbuff(i))
+		return iov_iter_extract_skbuff_pages(i, pages, maxsize,
+						     maxpages, extraction_flags,
+						     offset0);
 	return -EFAULT;
 }
 EXPORT_SYMBOL_GPL(iov_iter_extract_pages);
@@ -2449,6 +2477,97 @@ static size_t iterate_scatterlist(struct iov_iter *iter, size_t len, void *priv,
 	return progress;
 }
 
+struct skbuff_iter_ctx {
+	iov_step_f	step;
+	size_t		progress;
+	void		*priv;
+	void		*priv2;
+};
+
+static bool iterate_skbuff_frag(const struct sk_buff *skb, struct skbuff_iter_ctx *ctx,
+				int offset, int len, int recursion_level)
+{
+	struct sk_buff *frag_iter;
+	size_t skip = offset, part, remain, consumed;
+
+	if (unlikely(recursion_level >= 24))
+		return false;
+
+	part = skb_headlen(skb);
+	if (skip < part) {
+		part = umin(part - skip, len);
+		remain = ctx->step(skb->data + skip, ctx->progress, part,
+				   ctx->priv, ctx->priv2);
+		consumed = part - remain;
+		ctx->progress += consumed;
+		len -= consumed;
+		if (remain > 0 || len <= 0)
+			return false;
+		skip = 0;
+	} else {
+		skip -= part;
+	}
+
+	for (int i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+		const skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+		size_t fsize = skb_frag_size(frag);
+
+		if (skip >= fsize) {
+			skip -= fsize;
+			continue;
+		}
+
+		part = umin(fsize - skip, len);
+		remain = ctx->step(skb_frag_address(frag) + skip,
+				   ctx->progress, part, ctx->priv, ctx->priv2);
+		consumed = part - remain;
+		ctx->progress += consumed;
+		len -= consumed;
+		if (remain > 0 || len <= 0)
+			return false;
+		skip = 0;
+	}
+
+	skb_walk_frags(skb, frag_iter) {
+		size_t fsize = frag_iter->len;
+
+		if (skip >= fsize) {
+			skip -= fsize;
+			continue;
+		}
+
+		part = umin(fsize - skip, len);
+		if (!iterate_skbuff_frag(frag_iter, ctx, skb_headlen(skb) + skip,
+					 part, recursion_level + 1))
+			return false;
+		len -= part;
+		if (len <= 0)
+			return false;
+		skip = 0;
+	}
+	return true;
+}
+
+/*
+ * Handle iteration over ITER_SKBUFF.  Modelled on __skb_to_sgvec().
+ */
+static size_t iterate_skbuff(struct iov_iter *iter, size_t len, void *priv, void *priv2,
+			     iov_step_f step)
+{
+	struct skbuff_iter_ctx ctx = {
+		.step		= step,
+		.progress	= 0,
+		.priv		= priv,
+		.priv2		= priv2,
+	};
+
+	iterate_skbuff_frag(iter->skb, &ctx, iter->iov_offset, len, 0);
+
+	iter->iov_offset += ctx.progress;
+	iter->count -= ctx.progress;
+	return ctx.progress;
+}
+
 /*
  * Out of line iteration for iterator types that don't need such fast handling.
  */
@@ -2463,6 +2582,8 @@ size_t __iterate_and_advance2(struct iov_iter *iter, size_t len, void *priv,
 		return iterate_iterlist(iter, len, priv, priv2, ustep, step);
 	if (iov_iter_is_scatterlist(iter))
 		return iterate_scatterlist(iter, len, priv, priv2, step);
+	if (iov_iter_is_skbuff(iter))
+		return iterate_skbuff(iter, len, priv, priv2, step);
 	WARN_ON(1);
 	return 0;
 }


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 0/4] iov_iter: Add composite, scatterlist and skbuff iterator types
  2025-03-21 16:14 [RFC PATCH 0/4] iov_iter: Add composite, scatterlist and skbuff iterator types David Howells
                   ` (3 preceding siblings ...)
  2025-03-21 16:14 ` [RFC PATCH 4/4] iov_iter: Add a scatterlist iterator type [INCOMPLETE] David Howells
@ 2025-03-23  6:21 ` Christoph Hellwig
  2025-03-23 13:39   ` Hannes Reinecke
  2025-03-23 14:33   ` Matthew Wilcox
  4 siblings, 2 replies; 9+ messages in thread
From: Christoph Hellwig @ 2025-03-23  6:21 UTC (permalink / raw)
  To: David Howells
  Cc: Leon Romanovsky, Christian Brauner, Matthew Wilcox, Chuck Lever,
	Steve French, Ilya Dryomov, netfs, linux-fsdevel, linux-block,
	linux-mm, linux-kernel

This is going entirely in the wrong direction.  We don't need more iter
types but less.  The reason why we have to many is because the underlying
representation of the ranges is a mess which goes deeper than just the
iterator, because it also means we have to convert between the
underlying representations all the time.

E.g. the socket code should have (and either has for a while or at least
there were patches) been using bio_vecs instead of reinventing them as sk
fragment.  The crypto code should not be using scatterlists, which are a
horrible data structure because they mix up the physical memory
description and the dma mapping information which isn't even used for
most uses, etc.

So instead of more iters let's convert everyone to a common
scatter/gather memory definition, which simplifies the iters.  For now
that is the bio_vec, which really should be converted from storing a
struct page to a phys_addr_t (and maybe renamed if that helps adoption).
That allows to trivially kill the kvec for example.

As for the head/tail - that seems to be a odd NFS/sunrpc fetish.  I've
actually started a little project to just convert the sunrpc code to
use bio_vecs, which massively simplifies the code, and allows directly
passing it to the iters in the socket API.  It doesn't quite work yet
but shows how all these custom (and in this case rather ad-hoc) memory
fragment representation cause a huge mess.

I don't think the iterlist can work in practice, but it would be nice
to have for a few use cases.  If it worked it should hopefully allow
to kill off the odd xarray iterator.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 0/4] iov_iter: Add composite, scatterlist and skbuff iterator types
  2025-03-23  6:21 ` [RFC PATCH 0/4] iov_iter: Add composite, scatterlist and skbuff iterator types Christoph Hellwig
@ 2025-03-23 13:39   ` Hannes Reinecke
  2025-03-23 14:35     ` Matthew Wilcox
  2025-03-23 14:33   ` Matthew Wilcox
  1 sibling, 1 reply; 9+ messages in thread
From: Hannes Reinecke @ 2025-03-23 13:39 UTC (permalink / raw)
  To: Christoph Hellwig, David Howells
  Cc: Leon Romanovsky, Christian Brauner, Matthew Wilcox, Chuck Lever,
	Steve French, Ilya Dryomov, netfs, linux-fsdevel, linux-block,
	linux-mm, linux-kernel

On 3/23/25 07:21, Christoph Hellwig wrote:
> This is going entirely in the wrong direction.  We don't need more iter
> types but less.  The reason why we have to many is because the underlying
> representation of the ranges is a mess which goes deeper than just the
> iterator, because it also means we have to convert between the
> underlying representations all the time.
> 
> E.g. the socket code should have (and either has for a while or at least
> there were patches) been using bio_vecs instead of reinventing them as sk
> fragment.  The crypto code should not be using scatterlists, which are a
> horrible data structure because they mix up the physical memory
> description and the dma mapping information which isn't even used for
> most uses, etc.
> 
> So instead of more iters let's convert everyone to a common
> scatter/gather memory definition, which simplifies the iters.  For now
> that is the bio_vec, which really should be converted from storing a
> struct page to a phys_addr_t (and maybe renamed if that helps adoption).
> That allows to trivially kill the kvec for example.
> 
> As for the head/tail - that seems to be a odd NFS/sunrpc fetish.  I've
> actually started a little project to just convert the sunrpc code to
> use bio_vecs, which massively simplifies the code, and allows directly
> passing it to the iters in the socket API.  It doesn't quite work yet
> but shows how all these custom (and in this case rather ad-hoc) memory
> fragment representation cause a huge mess.
> 
> I don't think the iterlist can work in practice, but it would be nice
> to have for a few use cases.  If it worked it should hopefully allow
> to kill off the odd xarray iterator.
> 
Can we have a session around this?
IE define how iterators should be used, and what the iterator elements
should be. If we do it properly this will also fix the frozen page 
discussion we're having; if we define iterators whose data elements
are _not_ pages then clearly one cannot take a reference to them.

But in either case, we should define the long-term goal such that
people can start converting stuff.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 0/4] iov_iter: Add composite, scatterlist and skbuff iterator types
  2025-03-23  6:21 ` [RFC PATCH 0/4] iov_iter: Add composite, scatterlist and skbuff iterator types Christoph Hellwig
  2025-03-23 13:39   ` Hannes Reinecke
@ 2025-03-23 14:33   ` Matthew Wilcox
  1 sibling, 0 replies; 9+ messages in thread
From: Matthew Wilcox @ 2025-03-23 14:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: David Howells, Leon Romanovsky, Christian Brauner, Chuck Lever,
	Steve French, Ilya Dryomov, netfs, linux-fsdevel, linux-block,
	linux-mm, linux-kernel

On Sat, Mar 22, 2025 at 11:21:28PM -0700, Christoph Hellwig wrote:
> This is going entirely in the wrong direction.  We don't need more iter
> types but less.  The reason why we have to many is because the underlying
> representation of the ranges is a mess which goes deeper than just the
> iterator, because it also means we have to convert between the
> underlying representations all the time.
> 
> E.g. the socket code should have (and either has for a while or at least
> there were patches) been using bio_vecs instead of reinventing them as sk
> fragment.  The crypto code should not be using scatterlists, which are a

I did this work six years ago -- see 8842d285bafa

Unfortunately, networking is full of inconsiderate arseholes who backed
it out without even talking to me in 21d2e6737c97


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 0/4] iov_iter: Add composite, scatterlist and skbuff iterator types
  2025-03-23 13:39   ` Hannes Reinecke
@ 2025-03-23 14:35     ` Matthew Wilcox
  0 siblings, 0 replies; 9+ messages in thread
From: Matthew Wilcox @ 2025-03-23 14:35 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, David Howells, Leon Romanovsky,
	Christian Brauner, Chuck Lever, Steve French, Ilya Dryomov, netfs,
	linux-fsdevel, linux-block, linux-mm, linux-kernel

On Sun, Mar 23, 2025 at 02:39:25PM +0100, Hannes Reinecke wrote:
> Can we have a session around this?

Wednesday, 10:30

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-03-23 14:35 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-21 16:14 [RFC PATCH 0/4] iov_iter: Add composite, scatterlist and skbuff iterator types David Howells
2025-03-21 16:14 ` [RFC PATCH 1/4] iov_iter: Move ITER_DISCARD and ITER_XARRAY iteration out-of-line David Howells
2025-03-21 16:14 ` [RFC PATCH 2/4] iov_iter: Add an iterator-of-iterators David Howells
2025-03-21 16:14 ` [RFC PATCH 3/4] iov_iter: Add a scatterlist iterator type David Howells
2025-03-21 16:14 ` [RFC PATCH 4/4] iov_iter: Add a scatterlist iterator type [INCOMPLETE] David Howells
2025-03-23  6:21 ` [RFC PATCH 0/4] iov_iter: Add composite, scatterlist and skbuff iterator types Christoph Hellwig
2025-03-23 13:39   ` Hannes Reinecke
2025-03-23 14:35     ` Matthew Wilcox
2025-03-23 14:33   ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).