From: David Wei <dw@davidwei.uk>
To: io-uring@vger.kernel.org, netdev@vger.kernel.org
Cc: David Wei <dw@davidwei.uk>, Jens Axboe <axboe@kernel.dk>,
Pavel Begunkov <asml.silence@gmail.com>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
"David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jesper Dangaard Brouer <hawk@kernel.org>,
David Ahern <dsahern@kernel.org>,
Mina Almasry <almasrymina@google.com>
Subject: [PATCH v1 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider
Date: Mon, 7 Oct 2024 15:15:59 -0700 [thread overview]
Message-ID: <20241007221603.1703699-12-dw@davidwei.uk> (raw)
In-Reply-To: <20241007221603.1703699-1-dw@davidwei.uk>
From: Pavel Begunkov <asml.silence@gmail.com>
Implement a page pool memory provider for io_uring to receieve in a
zero copy fashion. For that, the provider allocates user pages wrapped
around into struct net_iovs, that are stored in a previously registered
struct net_iov_area.
Unlike with traditional receives, for which pages from a page pool can
be deallocated right after the user receives data, e.g. via recv(2),
we extend the lifetime by recycling buffers only after the user space
acknowledges that it's done processing the data via the refill queue.
Before handing buffers to the user, we mark them by bumping the refcount
by a bias value IO_ZC_RX_UREF, which will be checked when the buffer is
returned back. When the corresponding io_uring instance and/or page pool
are destroyed, we'll force back all buffers that are currently in the
user space in ->io_pp_zc_scrub by clearing the bias.
Refcounting and lifetime:
Initially, all buffers are considered unallocated and stored in
->freelist, at which point they are not yet directly exposed to the core
page pool code and not accounted to page pool's pages_state_hold_cnt.
The ->alloc_netmems callback will allocate them by placing into the
page pool's cache, setting the refcount to 1 as usual and adjusting
pages_state_hold_cnt.
Then, either the buffer is dropped and returns back to the page pool
into the ->freelist via io_pp_zc_release_netmem, in which case the page
pool will match hold_cnt for us with ->pages_state_release_cnt. Or more
likely the buffer will go through the network/protocol stacks and end up
in the corresponding socket's receive queue. From there the user can get
it via an new io_uring request implemented in following patches. As
mentioned above, before giving a buffer to the user we bump the refcount
by IO_ZC_RX_UREF.
Once the user is done with the buffer processing, it must return it back
via the refill queue, from where our ->alloc_netmems implementation can
grab it, check references, put IO_ZC_RX_UREF, and recycle the buffer if
there are no more users left. As we place such buffers right back into
the page pools fast cache and they didn't go through the normal pp
release path, they are still considered "allocated" and no pp hold_cnt
is required. For the same reason we dma sync buffers for the device
in io_zc_add_pp_cache().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David Wei <dw@davidwei.uk>
---
include/linux/io_uring/net.h | 5 +
io_uring/zcrx.c | 229 +++++++++++++++++++++++++++++++++++
io_uring/zcrx.h | 6 +
3 files changed, 240 insertions(+)
diff --git a/include/linux/io_uring/net.h b/include/linux/io_uring/net.h
index b58f39fed4d5..610b35b451fd 100644
--- a/include/linux/io_uring/net.h
+++ b/include/linux/io_uring/net.h
@@ -5,6 +5,11 @@
struct io_uring_cmd;
#if defined(CONFIG_IO_URING)
+
+#if defined(CONFIG_PAGE_POOL)
+extern const struct memory_provider_ops io_uring_pp_zc_ops;
+#endif
+
int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags);
#else
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index 8382129402ac..6cd3dee8b90a 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -2,7 +2,11 @@
#include <linux/kernel.h>
#include <linux/errno.h>
#include <linux/mm.h>
+#include <linux/nospec.h>
+#include <linux/netdevice.h>
#include <linux/io_uring.h>
+#include <net/page_pool/helpers.h>
+#include <trace/events/page_pool.h>
#include <uapi/linux/io_uring.h>
@@ -16,6 +20,13 @@
#if defined(CONFIG_PAGE_POOL) && defined(CONFIG_INET)
+static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *niov)
+{
+ struct net_iov_area *owner = net_iov_owner(niov);
+
+ return container_of(owner, struct io_zcrx_area, nia);
+}
+
static int io_allocate_rbuf_ring(struct io_zcrx_ifq *ifq,
struct io_uring_zcrx_ifq_reg *reg)
{
@@ -101,6 +112,9 @@ static int io_zcrx_create_area(struct io_ring_ctx *ctx,
goto err;
for (i = 0; i < nr_pages; i++) {
+ struct net_iov *niov = &area->nia.niovs[i];
+
+ niov->owner = &area->nia;
area->freelist[i] = i;
}
@@ -233,4 +247,219 @@ void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx)
lockdep_assert_held(&ctx->uring_lock);
}
+static bool io_zcrx_niov_put(struct net_iov *niov, int nr)
+{
+ return atomic_long_sub_and_test(nr, &niov->pp_ref_count);
+}
+
+static bool io_zcrx_put_niov_uref(struct net_iov *niov)
+{
+ if (atomic_long_read(&niov->pp_ref_count) < IO_ZC_RX_UREF)
+ return false;
+
+ return io_zcrx_niov_put(niov, IO_ZC_RX_UREF);
+}
+
+static inline void io_zc_add_pp_cache(struct page_pool *pp,
+ struct net_iov *niov)
+{
+ netmem_ref netmem = net_iov_to_netmem(niov);
+
+#if defined(CONFIG_HAS_DMA) && defined(CONFIG_DMA_NEED_SYNC)
+ if (pp->dma_sync && dma_dev_need_sync(pp->p.dev)) {
+ dma_addr_t dma_addr = page_pool_get_dma_addr_netmem(netmem);
+
+ dma_sync_single_range_for_device(pp->p.dev, dma_addr,
+ pp->p.offset, pp->p.max_len,
+ pp->p.dma_dir);
+ }
+#endif
+
+ page_pool_fragment_netmem(netmem, 1);
+ pp->alloc.cache[pp->alloc.count++] = netmem;
+}
+
+static inline u32 io_zcrx_rqring_entries(struct io_zcrx_ifq *ifq)
+{
+ u32 entries;
+
+ entries = smp_load_acquire(&ifq->rq_ring->tail) - ifq->cached_rq_head;
+ return min(entries, ifq->rq_entries);
+}
+
+static struct io_uring_zcrx_rqe *io_zcrx_get_rqe(struct io_zcrx_ifq *ifq,
+ unsigned mask)
+{
+ unsigned int idx = ifq->cached_rq_head++ & mask;
+
+ return &ifq->rqes[idx];
+}
+
+static void io_zcrx_ring_refill(struct page_pool *pp,
+ struct io_zcrx_ifq *ifq)
+{
+ unsigned int entries = io_zcrx_rqring_entries(ifq);
+ unsigned int mask = ifq->rq_entries - 1;
+
+ entries = min_t(unsigned, entries, PP_ALLOC_CACHE_REFILL - pp->alloc.count);
+ if (unlikely(!entries))
+ return;
+
+ do {
+ struct io_uring_zcrx_rqe *rqe = io_zcrx_get_rqe(ifq, mask);
+ struct io_zcrx_area *area;
+ struct net_iov *niov;
+ unsigned niov_idx, area_idx;
+
+ area_idx = rqe->off >> IORING_ZCRX_AREA_SHIFT;
+ niov_idx = (rqe->off & ~IORING_ZCRX_AREA_MASK) / PAGE_SIZE;
+
+ if (unlikely(rqe->__pad || area_idx))
+ continue;
+ area = ifq->area;
+
+ if (unlikely(niov_idx >= area->nia.num_niovs))
+ continue;
+ niov_idx = array_index_nospec(niov_idx, area->nia.num_niovs);
+
+ niov = &area->nia.niovs[niov_idx];
+ if (!io_zcrx_put_niov_uref(niov))
+ continue;
+ io_zc_add_pp_cache(pp, niov);
+ } while (--entries);
+
+ smp_store_release(&ifq->rq_ring->head, ifq->cached_rq_head);
+}
+
+static void io_zcrx_refill_slow(struct page_pool *pp, struct io_zcrx_ifq *ifq)
+{
+ struct io_zcrx_area *area = ifq->area;
+
+ spin_lock_bh(&area->freelist_lock);
+ while (area->free_count && pp->alloc.count < PP_ALLOC_CACHE_REFILL) {
+ struct net_iov *niov;
+ u32 pgid;
+
+ pgid = area->freelist[--area->free_count];
+ niov = &area->nia.niovs[pgid];
+
+ io_zc_add_pp_cache(pp, niov);
+
+ pp->pages_state_hold_cnt++;
+ trace_page_pool_state_hold(pp, net_iov_to_netmem(niov),
+ pp->pages_state_hold_cnt);
+ }
+ spin_unlock_bh(&area->freelist_lock);
+}
+
+static void io_zcrx_recycle_niov(struct net_iov *niov)
+{
+ struct io_zcrx_area *area = io_zcrx_iov_to_area(niov);
+
+ spin_lock_bh(&area->freelist_lock);
+ area->freelist[area->free_count++] = net_iov_idx(niov);
+ spin_unlock_bh(&area->freelist_lock);
+}
+
+static netmem_ref io_pp_zc_alloc_netmems(struct page_pool *pp, gfp_t gfp)
+{
+ struct io_zcrx_ifq *ifq = pp->mp_priv;
+
+ /* pp should already be ensuring that */
+ if (unlikely(pp->alloc.count))
+ goto out_return;
+
+ io_zcrx_ring_refill(pp, ifq);
+ if (likely(pp->alloc.count))
+ goto out_return;
+
+ io_zcrx_refill_slow(pp, ifq);
+ if (!pp->alloc.count)
+ return 0;
+out_return:
+ return pp->alloc.cache[--pp->alloc.count];
+}
+
+static bool io_pp_zc_release_netmem(struct page_pool *pp, netmem_ref netmem)
+{
+ struct net_iov *niov;
+
+ if (WARN_ON_ONCE(!netmem_is_net_iov(netmem)))
+ return false;
+
+ niov = netmem_to_net_iov(netmem);
+
+ if (io_zcrx_niov_put(niov, 1))
+ io_zcrx_recycle_niov(niov);
+ return false;
+}
+
+static void io_pp_zc_scrub(struct page_pool *pp)
+{
+ struct io_zcrx_ifq *ifq = pp->mp_priv;
+ struct io_zcrx_area *area = ifq->area;
+ int i;
+
+ /* Reclaim back all buffers given to the user space. */
+ for (i = 0; i < area->nia.num_niovs; i++) {
+ struct net_iov *niov = &area->nia.niovs[i];
+ int count;
+
+ if (!io_zcrx_put_niov_uref(niov))
+ continue;
+ io_zcrx_recycle_niov(niov);
+
+ count = atomic_inc_return_relaxed(&pp->pages_state_release_cnt);
+ trace_page_pool_state_release(pp, net_iov_to_netmem(niov), count);
+ }
+}
+
+static int io_pp_zc_init(struct page_pool *pp)
+{
+ struct io_zcrx_ifq *ifq = pp->mp_priv;
+ struct io_zcrx_area *area = ifq->area;
+ int ret;
+
+ if (!ifq)
+ return -EINVAL;
+ if (pp->p.order != 0)
+ return -EINVAL;
+ if (!pp->p.napi)
+ return -EINVAL;
+ if (!pp->p.napi->napi_id)
+ return -EINVAL;
+
+ ret = page_pool_init_paged_area(pp, &area->nia, area->pages);
+ if (ret)
+ return ret;
+
+ ifq->napi_id = pp->p.napi->napi_id;
+ percpu_ref_get(&ifq->ctx->refs);
+ ifq->pp = pp;
+ return 0;
+}
+
+static void io_pp_zc_destroy(struct page_pool *pp)
+{
+ struct io_zcrx_ifq *ifq = pp->mp_priv;
+ struct io_zcrx_area *area = ifq->area;
+
+ page_pool_release_area(pp, &ifq->area->nia);
+
+ ifq->pp = NULL;
+ ifq->napi_id = 0;
+
+ if (WARN_ON_ONCE(area->free_count != area->nia.num_niovs))
+ return;
+ percpu_ref_put(&ifq->ctx->refs);
+}
+
+const struct memory_provider_ops io_uring_pp_zc_ops = {
+ .alloc_netmems = io_pp_zc_alloc_netmems,
+ .release_netmem = io_pp_zc_release_netmem,
+ .init = io_pp_zc_init,
+ .destroy = io_pp_zc_destroy,
+ .scrub = io_pp_zc_scrub,
+};
+
#endif
diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h
index 2fcbeb3d5501..67512fc69cc4 100644
--- a/io_uring/zcrx.h
+++ b/io_uring/zcrx.h
@@ -5,6 +5,9 @@
#include <linux/io_uring_types.h>
#include <net/page_pool/types.h>
+#define IO_ZC_RX_UREF 0x10000
+#define IO_ZC_RX_KREF_MASK (IO_ZC_RX_UREF - 1)
+
struct io_zcrx_area {
struct net_iov_area nia;
struct io_zcrx_ifq *ifq;
@@ -22,15 +25,18 @@ struct io_zcrx_ifq {
struct io_ring_ctx *ctx;
struct net_device *dev;
struct io_zcrx_area *area;
+ struct page_pool *pp;
struct io_uring *rq_ring;
struct io_uring_zcrx_rqe *rqes;
u32 rq_entries;
+ u32 cached_rq_head;
unsigned short n_rqe_pages;
struct page **rqe_pages;
u32 if_rxq;
+ unsigned napi_id;
};
#if defined(CONFIG_PAGE_POOL) && defined(CONFIG_INET)
--
2.43.5
next prev parent reply other threads:[~2024-10-07 22:16 UTC|newest]
Thread overview: 126+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-07 22:15 [PATCH v1 00/15] io_uring zero copy rx David Wei
2024-10-07 22:15 ` [PATCH v1 01/15] net: devmem: pull struct definitions out of ifdef David Wei
2024-10-09 20:17 ` Mina Almasry
2024-10-09 23:16 ` Pavel Begunkov
2024-10-10 18:01 ` Mina Almasry
2024-10-10 18:57 ` Pavel Begunkov
2024-10-13 22:38 ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 02/15] net: prefix devmem specific helpers David Wei
2024-10-09 20:19 ` Mina Almasry
2024-10-07 22:15 ` [PATCH v1 03/15] net: generalise net_iov chunk owners David Wei
2024-10-08 15:46 ` Stanislav Fomichev
2024-10-08 16:34 ` Pavel Begunkov
2024-10-09 16:28 ` Stanislav Fomichev
2024-10-11 18:44 ` David Wei
2024-10-11 22:02 ` Pavel Begunkov
2024-10-11 22:25 ` Mina Almasry
2024-10-11 23:12 ` Pavel Begunkov
2024-10-09 20:44 ` Mina Almasry
2024-10-09 22:13 ` Pavel Begunkov
2024-10-09 22:19 ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 04/15] net: page_pool: create hooks for custom page providers David Wei
2024-10-09 20:49 ` Mina Almasry
2024-10-09 22:02 ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 05/15] net: prepare for non devmem TCP memory providers David Wei
2024-10-09 20:56 ` Mina Almasry
2024-10-09 21:45 ` Pavel Begunkov
2024-10-13 22:33 ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 06/15] net: page_pool: add ->scrub mem provider callback David Wei
2024-10-09 21:00 ` Mina Almasry
2024-10-09 21:59 ` Pavel Begunkov
2024-10-10 17:54 ` Mina Almasry
2024-10-13 17:25 ` David Wei
2024-10-14 13:37 ` Pavel Begunkov
2024-10-14 22:58 ` Mina Almasry
2024-10-16 17:42 ` Pavel Begunkov
2024-11-01 17:18 ` Mina Almasry
2024-11-01 18:35 ` Pavel Begunkov
2024-11-01 19:24 ` Mina Almasry
2024-11-01 21:38 ` Pavel Begunkov
2024-11-04 20:42 ` Mina Almasry
2024-11-04 21:27 ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 07/15] net: page pool: add helper creating area from pages David Wei
2024-10-09 21:11 ` Mina Almasry
2024-10-09 21:34 ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 08/15] net: add helper executing custom callback from napi David Wei
2024-10-08 22:25 ` Joe Damato
2024-10-09 15:09 ` Pavel Begunkov
2024-10-09 16:13 ` Joe Damato
2024-10-09 19:12 ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 09/15] io_uring/zcrx: add interface queue and refill queue David Wei
2024-10-09 17:50 ` Jens Axboe
2024-10-09 18:09 ` Jens Axboe
2024-10-09 19:08 ` Pavel Begunkov
2024-10-11 22:11 ` Pavel Begunkov
2024-10-13 17:32 ` David Wei
2024-10-07 22:15 ` [PATCH v1 10/15] io_uring/zcrx: add io_zcrx_area David Wei
2024-10-09 18:02 ` Jens Axboe
2024-10-09 19:05 ` Pavel Begunkov
2024-10-09 19:06 ` Jens Axboe
2024-10-09 21:29 ` Mina Almasry
2024-10-07 22:15 ` David Wei [this message]
2024-10-09 18:10 ` [PATCH v1 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider Jens Axboe
2024-10-09 22:01 ` Mina Almasry
2024-10-09 22:58 ` Pavel Begunkov
2024-10-10 18:19 ` Mina Almasry
2024-10-10 20:26 ` Pavel Begunkov
2024-10-10 20:53 ` Mina Almasry
2024-10-10 20:58 ` Mina Almasry
2024-10-10 21:22 ` Pavel Begunkov
2024-10-11 0:32 ` Mina Almasry
2024-10-11 1:49 ` Pavel Begunkov
2024-10-07 22:16 ` [PATCH v1 12/15] io_uring/zcrx: add io_recvzc request David Wei
2024-10-09 18:28 ` Jens Axboe
2024-10-09 18:51 ` Pavel Begunkov
2024-10-09 19:01 ` Jens Axboe
2024-10-09 19:27 ` Pavel Begunkov
2024-10-09 19:42 ` Jens Axboe
2024-10-09 19:47 ` Pavel Begunkov
2024-10-09 19:50 ` Jens Axboe
2024-10-07 22:16 ` [PATCH v1 13/15] io_uring/zcrx: add copy fallback David Wei
2024-10-08 15:58 ` Stanislav Fomichev
2024-10-08 16:39 ` Pavel Begunkov
2024-10-08 16:40 ` David Wei
2024-10-09 16:30 ` Stanislav Fomichev
2024-10-09 23:05 ` Pavel Begunkov
2024-10-11 6:22 ` David Wei
2024-10-11 14:43 ` Stanislav Fomichev
2024-10-09 18:38 ` Jens Axboe
2024-10-07 22:16 ` [PATCH v1 14/15] io_uring/zcrx: set pp memory provider for an rx queue David Wei
2024-10-09 18:42 ` Jens Axboe
2024-10-10 13:09 ` Pavel Begunkov
2024-10-10 13:19 ` Jens Axboe
2024-10-07 22:16 ` [PATCH v1 15/15] io_uring/zcrx: throttle receive requests David Wei
2024-10-09 18:43 ` Jens Axboe
2024-10-07 22:20 ` [PATCH v1 00/15] io_uring zero copy rx David Wei
2024-10-08 23:10 ` Joe Damato
2024-10-09 15:07 ` Pavel Begunkov
2024-10-09 16:10 ` Joe Damato
2024-10-09 16:12 ` Jens Axboe
2024-10-11 6:15 ` David Wei
2024-10-09 15:27 ` Jens Axboe
2024-10-09 15:38 ` David Ahern
2024-10-09 15:43 ` Jens Axboe
2024-10-09 15:49 ` Pavel Begunkov
2024-10-09 15:50 ` Jens Axboe
2024-10-09 16:35 ` David Ahern
2024-10-09 16:50 ` Jens Axboe
2024-10-09 16:53 ` Jens Axboe
2024-10-09 17:12 ` Jens Axboe
2024-10-10 14:21 ` Jens Axboe
2024-10-10 15:03 ` David Ahern
2024-10-10 15:15 ` Jens Axboe
2024-10-10 18:11 ` Jens Axboe
2024-10-14 8:42 ` David Laight
2024-10-09 16:55 ` Mina Almasry
2024-10-09 16:57 ` Jens Axboe
2024-10-09 19:32 ` Mina Almasry
2024-10-09 19:43 ` Pavel Begunkov
2024-10-09 19:47 ` Jens Axboe
2024-10-09 17:19 ` David Ahern
2024-10-09 18:21 ` Pedro Tammela
2024-10-10 13:19 ` Pavel Begunkov
2024-10-11 0:35 ` David Wei
2024-10-11 14:28 ` Pedro Tammela
2024-10-11 0:29 ` David Wei
2024-10-11 19:43 ` Mina Almasry
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20241007221603.1703699-12-dw@davidwei.uk \
--to=dw@davidwei.uk \
--cc=almasrymina@google.com \
--cc=asml.silence@gmail.com \
--cc=axboe@kernel.dk \
--cc=davem@davemloft.net \
--cc=dsahern@kernel.org \
--cc=edumazet@google.com \
--cc=hawk@kernel.org \
--cc=io-uring@vger.kernel.org \
--cc=kuba@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).