All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Howells <dhowells@redhat.com>
To: netdev@vger.kernel.org
Cc: David Howells <dhowells@redhat.com>,
	Alexander Duyck <alexander.duyck@gmail.com>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Willem de Bruijn <willemdebruijn.kernel@gmail.com>,
	David Ahern <dsahern@kernel.org>,
	Matthew Wilcox <willy@infradead.org>,
	Jens Axboe <axboe@kernel.dk>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Menglong Dong <imagedong@tencent.com>
Subject: [PATCH net-next v3 01/18] net: Copy slab data for sendmsg(MSG_SPLICE_PAGES)
Date: Tue, 20 Jun 2023 15:53:20 +0100	[thread overview]
Message-ID: <20230620145338.1300897-2-dhowells@redhat.com> (raw)
In-Reply-To: <20230620145338.1300897-1-dhowells@redhat.com>

If sendmsg() is passed MSG_SPLICE_PAGES and is given a buffer that contains
some data that's resident in the slab, copy it rather than returning EIO.
This can be made use of by a number of drivers in the kernel, including:
iwarp, ceph/rds, dlm, nvme, ocfs2, drdb.  It could also be used by iscsi,
rxrpc, sunrpc, cifs and probably others.

skb_splice_from_iter() is given it's own fragment allocator as
page_frag_alloc_align() can't be used because it does no locking to prevent
parallel callers from racing.  alloc_skb_frag() uses a separate folio for
each cpu and locks to the cpu whilst allocating, reenabling cpu migration
around folio allocation.

This could allocate a whole page instead for each fragment to be copied, as
alloc_skb_with_frags() would do instead, but that would waste a lot of
space (most of the fragments look like they're going to be small).

This allows an entire message that consists of, say, a protocol header or
two, a number of pages of data and a protocol footer to be sent using a
single call to sock_sendmsg().

The callers could be made to copy the data into fragments before calling
sendmsg(), but that then penalises them if MSG_SPLICE_PAGES gets ignored.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Alexander Duyck <alexander.duyck@gmail.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: David Ahern <dsahern@kernel.org>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Menglong Dong <imagedong@tencent.com>
cc: netdev@vger.kernel.org
---

Notes:
    ver #3)
     - Remove duplicate decl of skb_splice_from_iter().
    
    ver #2)
     - Fix parameter to put_cpu_ptr() to have an '&'.

 include/linux/skbuff.h |   2 +
 net/core/skbuff.c      | 171 ++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 170 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 91ed66952580..5f53bd5d375d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -5037,6 +5037,8 @@ static inline void skb_mark_for_recycle(struct sk_buff *skb)
 #endif
 }
 
+void *alloc_skb_frag(size_t fragsz, gfp_t gfp);
+void *copy_skb_frag(const void *s, size_t len, gfp_t gfp);
 ssize_t skb_splice_from_iter(struct sk_buff *skb, struct iov_iter *iter,
 			     ssize_t maxsize, gfp_t gfp);
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index fee2b1c105fe..d962c93a429d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -6755,6 +6755,145 @@ nodefer:	__kfree_skb(skb);
 		smp_call_function_single_async(cpu, &sd->defer_csd);
 }
 
+struct skb_splice_frag_cache {
+	struct folio	*folio;
+	void		*virt;
+	unsigned int	offset;
+	/* we maintain a pagecount bias, so that we dont dirty cache line
+	 * containing page->_refcount every time we allocate a fragment.
+	 */
+	unsigned int	pagecnt_bias;
+	bool		pfmemalloc;
+};
+
+static DEFINE_PER_CPU(struct skb_splice_frag_cache, skb_splice_frag_cache);
+
+/**
+ * alloc_skb_frag - Allocate a page fragment for using in a socket
+ * @fragsz: The size of fragment required
+ * @gfp: Allocation flags
+ */
+void *alloc_skb_frag(size_t fragsz, gfp_t gfp)
+{
+	struct skb_splice_frag_cache *cache;
+	struct folio *folio, *spare = NULL;
+	size_t offset, fsize;
+	void *p;
+
+	if (WARN_ON_ONCE(fragsz == 0))
+		fragsz = 1;
+
+	cache = get_cpu_ptr(&skb_splice_frag_cache);
+reload:
+	folio = cache->folio;
+	offset = cache->offset;
+try_again:
+	if (fragsz > offset)
+		goto insufficient_space;
+
+	/* Make the allocation. */
+	cache->pagecnt_bias--;
+	offset = ALIGN_DOWN(offset - fragsz, SMP_CACHE_BYTES);
+	cache->offset = offset;
+	p = cache->virt + offset;
+	put_cpu_ptr(&skb_splice_frag_cache);
+	if (spare)
+		folio_put(spare);
+	return p;
+
+insufficient_space:
+	/* See if we can refurbish the current folio. */
+	if (!folio || !folio_ref_sub_and_test(folio, cache->pagecnt_bias))
+		goto get_new_folio;
+	if (unlikely(cache->pfmemalloc)) {
+		__folio_put(folio);
+		goto get_new_folio;
+	}
+
+	fsize = folio_size(folio);
+	if (unlikely(fragsz > fsize))
+		goto frag_too_big;
+
+	/* OK, page count is 0, we can safely set it */
+	folio_set_count(folio, PAGE_FRAG_CACHE_MAX_SIZE + 1);
+
+	/* Reset page count bias and offset to start of new frag */
+	cache->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
+	offset = fsize;
+	goto try_again;
+
+get_new_folio:
+	if (!spare) {
+		cache->folio = NULL;
+		put_cpu_ptr(&skb_splice_frag_cache);
+
+#if PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE
+		spare = folio_alloc(gfp | __GFP_NOWARN | __GFP_NORETRY |
+				    __GFP_NOMEMALLOC,
+				    PAGE_FRAG_CACHE_MAX_ORDER);
+		if (!spare)
+#endif
+			spare = folio_alloc(gfp, 0);
+		if (!spare)
+			return NULL;
+
+		cache = get_cpu_ptr(&skb_splice_frag_cache);
+		/* We may now be on a different cpu and/or someone else may
+		 * have refilled it
+		 */
+		cache->pfmemalloc = folio_is_pfmemalloc(spare);
+		if (cache->folio)
+			goto reload;
+	}
+
+	cache->folio = spare;
+	cache->virt  = folio_address(spare);
+	folio = spare;
+	spare = NULL;
+
+	/* Even if we own the page, we do not use atomic_set().  This would
+	 * break get_page_unless_zero() users.
+	 */
+	folio_ref_add(folio, PAGE_FRAG_CACHE_MAX_SIZE);
+
+	/* Reset page count bias and offset to start of new frag */
+	cache->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
+	offset = folio_size(folio);
+	goto try_again;
+
+frag_too_big:
+	/* The caller is trying to allocate a fragment with fragsz > PAGE_SIZE
+	 * but the cache isn't big enough to satisfy the request, this may
+	 * happen in low memory conditions.  We don't release the cache page
+	 * because it could make memory pressure worse so we simply return NULL
+	 * here.
+	 */
+	cache->offset = offset;
+	put_cpu_ptr(&skb_splice_frag_cache);
+	if (spare)
+		folio_put(spare);
+	return NULL;
+}
+EXPORT_SYMBOL(alloc_skb_frag);
+
+/**
+ * copy_skb_frag - Copy data into a page fragment.
+ * @s: The data to copy
+ * @len: The size of the data
+ * @gfp: Allocation flags
+ */
+void *copy_skb_frag(const void *s, size_t len, gfp_t gfp)
+{
+	void *p;
+
+	p = alloc_skb_frag(len, gfp);
+	if (!p)
+		return NULL;
+
+	return memcpy(p, s, len);
+}
+EXPORT_SYMBOL(copy_skb_frag);
+
 static void skb_splice_csum_page(struct sk_buff *skb, struct page *page,
 				 size_t offset, size_t len)
 {
@@ -6808,17 +6947,43 @@ ssize_t skb_splice_from_iter(struct sk_buff *skb, struct iov_iter *iter,
 			break;
 		}
 
+		if (space == 0 &&
+		    !skb_can_coalesce(skb, skb_shinfo(skb)->nr_frags,
+				      pages[0], off)) {
+			iov_iter_revert(iter, len);
+			break;
+		}
+
 		i = 0;
 		do {
 			struct page *page = pages[i++];
 			size_t part = min_t(size_t, PAGE_SIZE - off, len);
-
-			ret = -EIO;
-			if (WARN_ON_ONCE(!sendpage_ok(page)))
+			bool put = false;
+
+			if (PageSlab(page)) {
+				const void *p;
+				void *q;
+
+				p = kmap_local_page(page);
+				q = copy_skb_frag(p + off, part, gfp);
+				kunmap_local(p);
+				if (!q) {
+					iov_iter_revert(iter, len);
+					ret = -ENOMEM;
+					goto out;
+				}
+				page = virt_to_page(q);
+				off = offset_in_page(q);
+				put = true;
+			} else if (WARN_ON_ONCE(!sendpage_ok(page))) {
+				ret = -EIO;
 				goto out;
+			}
 
 			ret = skb_append_pagefrags(skb, page, off, part,
 						   frag_limit);
+			if (put)
+				put_page(page);
 			if (ret < 0) {
 				iov_iter_revert(iter, len);
 				goto out;



  reply	other threads:[~2023-06-20 14:56 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-20 14:53 [PATCH net-next v3 00/18] splice, net: Switch over users of sendpage() and remove it David Howells
2023-06-20 14:53 ` David Howells [this message]
2023-06-22 18:12   ` [PATCH net-next v3 01/18] net: Copy slab data for sendmsg(MSG_SPLICE_PAGES) Jakub Kicinski
2023-06-22 18:28     ` Alexander Duyck
2023-06-22 19:40     ` David Howells
2023-06-22 20:28       ` Jakub Kicinski
2023-06-22 22:54         ` David Howells
2023-06-23  2:11           ` Jakub Kicinski
2023-06-23  9:08             ` David Howells
2023-06-23  9:52               ` Paolo Abeni
2023-06-23 10:06                 ` David Howells
2023-06-23 10:21                   ` Paolo Abeni
2023-06-23  8:08   ` Paolo Abeni
2023-06-23  9:06     ` David Howells
2023-06-23  9:37       ` Paolo Abeni
2023-06-23 10:00         ` David Howells
2023-06-20 14:53 ` [PATCH net-next v3 02/18] net: Display info about MSG_SPLICE_PAGES memory handling in proc David Howells
2023-06-23  8:18   ` Paolo Abeni
2023-06-23  9:42     ` David Howells
2023-06-20 14:53 ` [PATCH net-next v3 03/18] tcp_bpf, smc, tls, espintcp: Reduce MSG_SENDPAGE_NOTLAST usage David Howells
2023-06-20 14:53 ` [PATCH net-next v3 04/18] siw: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage to transmit David Howells
2023-06-21  8:57   ` Bernard Metzler
2023-06-20 14:53 ` [PATCH net-next v3 05/18] ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage David Howells
2023-06-20 14:53 ` [PATCH net-next v3 06/18] net: Use sendmsg(MSG_SPLICE_PAGES) not sendpage in skb_send_sock() David Howells
2023-06-20 14:53 ` [PATCH net-next v3 07/18] ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage() David Howells
2023-06-20 14:53 ` [PATCH net-next v3 08/18] rds: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage David Howells
2023-06-20 14:53 ` [Cluster-devel] [PATCH net-next v3 09/18] dlm: " David Howells
2023-06-20 14:53   ` David Howells
2023-06-20 14:53 ` [PATCH net-next v3 10/18] nvme/host: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage David Howells
2023-06-21 10:15   ` Sagi Grimberg
2023-06-21 12:35     ` David Howells
2023-06-21 14:05       ` Sagi Grimberg
2023-06-29 14:45   ` Aurelien Aptel
2023-06-29 14:49     ` Sagi Grimberg
2023-06-29 15:02       ` Aurelien Aptel
2023-06-29 21:23       ` David Howells
2023-06-29 21:33         ` Sagi Grimberg
2023-06-29 21:34     ` David Howells
2023-06-29 23:43       ` Jakub Kicinski
2023-06-30 16:10         ` Nathan Chancellor
2023-06-30 16:14           ` Jakub Kicinski
2023-06-30 19:28             ` Nathan Chancellor
2023-07-07 20:45               ` Nick Desaulniers
2023-06-20 14:53 ` [PATCH net-next v3 11/18] nvme/target: " David Howells
2023-06-20 14:53 ` [PATCH net-next v3 12/18] smc: Drop smc_sendpage() in favour of smc_sendmsg() + MSG_SPLICE_PAGES David Howells
2023-06-20 14:53 ` [Ocfs2-devel] [PATCH net-next v3 13/18] ocfs2: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage() David Howells via Ocfs2-devel
2023-06-20 14:53   ` David Howells
2023-06-20 14:53 ` [PATCH net-next v3 14/18] drbd: " David Howells
2023-06-20 14:53   ` [Drbd-dev] " David Howells
2023-06-20 14:53 ` [PATCH net-next v3 15/18] drdb: Send an entire bio in a single sendmsg David Howells
2023-06-20 14:53   ` [Drbd-dev] " David Howells
2023-06-20 14:53 ` [PATCH net-next v3 16/18] iscsi: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage David Howells
2023-06-20 14:53 ` [PATCH net-next v3 17/18] sock: Remove ->sendpage*() in favour of sendmsg(MSG_SPLICE_PAGES) David Howells
2023-06-20 14:53   ` David Howells
2023-06-20 14:53   ` David Howells
2023-06-20 14:53   ` David Howells
2023-06-20 14:53 ` [PATCH net-next v3 18/18] net: Kill MSG_SENDPAGE_NOTLAST David Howells
2023-06-20 14:53   ` David Howells
2023-06-20 14:53   ` David Howells
2023-06-20 14:53   ` David Howells

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230620145338.1300897-2-dhowells@redhat.com \
    --to=dhowells@redhat.com \
    --cc=alexander.duyck@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=imagedong@tencent.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=willemdebruijn.kernel@gmail.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.