From: Pavel Begunkov <asml.silence@gmail.com>
To: Jakub Kicinski <kuba@kernel.org>, Mina Almasry <almasrymina@google.com>
Cc: David Wei <dw@davidwei.uk>,
io-uring@vger.kernel.org, netdev@vger.kernel.org,
Jens Axboe <axboe@kernel.dk>, Paolo Abeni <pabeni@redhat.com>,
"David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jesper Dangaard Brouer <hawk@kernel.org>,
David Ahern <dsahern@kernel.org>,
Stanislav Fomichev <stfomichev@gmail.com>,
Joe Damato <jdamato@fastly.com>,
Pedro Tammela <pctammela@mojatatu.com>
Subject: Re: [PATCH v7 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider
Date: Wed, 4 Dec 2024 19:04:53 +0000 [thread overview]
Message-ID: <2a9d89e3-94ba-4707-8e7b-726c2eeb5bfc@gmail.com> (raw)
In-Reply-To: <6e687570-c7b7-4c22-a601-d7ea6d620afe@gmail.com>
On 11/26/24 15:26, Pavel Begunkov wrote:
> On 11/15/24 23:14, Jakub Kicinski wrote:
>> On Thu, 14 Nov 2024 12:56:14 -0800 Mina Almasry wrote:
>>> But I've been bringing this up a lot in the past (and offering
>>> alternatives that don't introduce this overloading) and I think this
>>> conversation has run its course. I'm unsure about this approach and
>>> this could use another pair of eyes. Jakub, sorry to bother you but
>>> you probably are the one that reviewed the whole net_iov stuff most
>>> closely. Any chance you would take a look and provide direction here?
>>> Maybe my concern is overblown...
>>
>> Sorry I haven't read this code very closely (still!?) really hoping
>> for next version to come during the merge window when time is more
>> abundant :S
>>
>> From scanning the quoted context I gather you're asking about using
>> the elevated ->pp_ref_count for user-owned pages? If yes - I also
>> find that part.. borderline incorrect. The page pool stats will show
>> these pages as allocated which normally means held by the driver or
>> the stack. Pages given to the user should effectively leave the pool.
>
> It can't just drop all net_iov refs here, otherwise the buffer might
> be returned back to page pool and reused while the user still reading
> the data. We can't even be smart in the release callback as it might
> never get there and be reallocated purely via alloc.cache. And either
> way, tunneling all buffers into ->release_netmem would be horrible
> for performance, and it'd probably even need a check in
> page_pool_recycle_in_cache().
>
> Fixing it for devmem TCP (which also holds a net_iov ref while it's
> given to user, so we're not unique here) sounds even harder as
> they're stashed in a socket xarray page pool knows nothing about,
> so it might need some extra counting on top?
>
> This set has a problem with page_pool_alloc_frag*() helpers, so
> we'd either need to explicitly chip away some bits from ->pp_ref_count
> or move user counting out of net_iov and double the cost of one
> of the main sources of overhead, and then being very inventive
> optimising it in the future, but that won't solve the "should
> leave the pool" problem.
>
> If it's just stats printing, it should be quite easy to fix
> for the current set, ala some kind of "mask out bits responsible
> for user refs". And I don't immediately have an idea of how to
> address it for devmem TCP.
>
> Also note, that if sth happens with io_uring or such, those
> "user" refs are going to be dropped by the kernel off a page
> pool callback, so it's not about leaking buffers.
>
>> So I'm guessing this is some optimization.
>> Same for patch 8.
>
> This one will need some more explanation, otherwise it'd be a guess
> game. What is incorrect? The overall approach or some implementation
> aspect? It's also worth to note that it's a private queue, stopping
> the napi attached to it shouldn't interfere with other queues and
> users in the system, that's it assuming steering configured right.
>
>> But let me not get sucked into this before we wrap up the net-next PR..
Jakub, there is hardly anything I can do with the series if you veto it
but remain silent on any details. It appears you've had some private
conversations with David I wasn't privy to. And it's hard to tell but it
sounds that your ask number (1) is that it should be able to serve
multiple page pools in parallel. And (2), you don't like something in
regards to refcounting.
(2) As I explained in the previous message, I can separate refcounting
to get rid of the io_uring ref bias, but it's not going to help with
"elevated ->pp_ref_count for user-owned pages" IIUC what you meant
here. Moreover, devmem is suffering from the same thing, and if you
feel like it's really a big deal, we should revert devmem TCP, which
would undesirable, or somehow solve it otherwise.
Another problem is performance, it doubles refcounting, which is just
not acceptable, so I'd need to do something about it. For example,
the syscall part of the path, ala recv(2), would do something like:
for (i = 0; i < nr_frags; i++) {
niov = skb->frags[i].netmem;
// for each user ref we also take a kernel ref so that
// the net_iov is not reallocated while the user space reads data.
atomic_inc(io_uring->user_refs[idx(niov)]);
atomic_inc(net_iov->pp_ref_count);
post_completion();
}
One way out would be to steal frags from an skb together with a
reference it holds.
bool steal_frag_refs = !skb->cloned;
for (i = 0; i < nr_frags; i++) {
niov = skb->frags[i].netmem;
atomic_inc(io_uring->user_refs[idx(niov)]);
if (!steal_frag_refs)
atomic_inc(net_iov->pp_ref_count);
post_completion();
}
// we reused the frag refs, so took it away from core net
if (steal_frag_refs)
skb->nr_frags = 0;
Does that sound reasonable to you? Or do you have any other suggestions?
(1) is also doable, but that can't be done without performance penalties,
or specifically it'll tie hands from many future optimisations for more
or less same reasons why page pools are not shared and can't be pulled
from by mulitple CPUs in parallel. For example, if you think that
page_pool_napi_local() is a good optimisation, then it'd also make
sense to optimise user refcounts in a similar fashion.
Perhaps, it can have a fast path that assumes a single page pool, assume
that the state when there are multiple pp's is transient, and do some
additional sync at queue restart and/or pp init/destroy callbacks when
switching between 1 -> many and many -> 1, which would likely need
to briefly stop the queue / napi in one way or another. Does that
sound reasonable?
--
Pavel Begunkov
next prev parent reply other threads:[~2024-12-04 19:04 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-29 23:05 [PATCH v7 00/15] io_uring zero copy rx David Wei
2024-10-29 23:05 ` [PATCH v7 01/15] net: prefix devmem specific helpers David Wei
2024-11-01 14:57 ` Mina Almasry
2024-10-29 23:05 ` [PATCH v7 02/15] net: generalise net_iov chunk owners David Wei
2024-10-29 23:05 ` [PATCH v7 03/15] net: page_pool: create hooks for custom page providers David Wei
2024-11-05 16:28 ` Alexander Lobakin
2024-11-06 1:08 ` Pavel Begunkov
2024-10-29 23:05 ` [PATCH v7 04/15] net: prepare for non devmem TCP memory providers David Wei
2024-11-01 17:09 ` Mina Almasry
2024-11-01 17:41 ` Pavel Begunkov
2024-11-04 20:20 ` Mina Almasry
2024-11-04 21:24 ` Pavel Begunkov
2024-11-11 19:01 ` David Wei
2024-11-15 0:43 ` Mina Almasry
2024-10-29 23:05 ` [PATCH v7 05/15] net: page_pool: add ->scrub mem provider callback David Wei
2024-10-29 23:05 ` [PATCH v7 06/15] net: page pool: add helper creating area from pages David Wei
2024-11-01 17:33 ` Mina Almasry
2024-11-01 18:16 ` Pavel Begunkov
2024-11-05 23:34 ` Mina Almasry
2024-11-06 0:50 ` Pavel Begunkov
2024-10-29 23:05 ` [PATCH v7 07/15] net: page_pool: introduce page_pool_mp_return_in_cache David Wei
2024-11-01 20:05 ` Mina Almasry
2024-10-29 23:05 ` [PATCH v7 08/15] net: add helper executing custom callback from napi David Wei
2024-10-29 23:05 ` [PATCH v7 09/15] io_uring/zcrx: add interface queue and refill queue David Wei
2024-10-29 23:05 ` [PATCH v7 10/15] io_uring/zcrx: add io_zcrx_area David Wei
2024-10-29 23:05 ` [PATCH v7 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider David Wei
2024-11-01 20:06 ` Mina Almasry
2024-11-01 21:09 ` Pavel Begunkov
2024-11-03 21:52 ` Pavel Begunkov
2024-11-04 19:54 ` Mina Almasry
2024-11-04 21:14 ` Pavel Begunkov
2024-11-04 21:53 ` Pavel Begunkov
2024-11-11 21:15 ` David Wei
2024-11-14 20:56 ` Mina Almasry
2024-11-15 23:14 ` Jakub Kicinski
2024-11-26 15:26 ` Pavel Begunkov
2024-12-04 19:04 ` Pavel Begunkov [this message]
2024-10-29 23:05 ` [PATCH v7 12/15] io_uring/zcrx: add io_recvzc request David Wei
2024-11-01 20:11 ` Mina Almasry
2024-11-01 21:17 ` Pavel Begunkov
2024-11-05 23:09 ` Mina Almasry
2024-11-11 22:02 ` David Wei
2024-10-29 23:05 ` [PATCH v7 13/15] io_uring/zcrx: set pp memory provider for an rx queue David Wei
2024-11-01 20:16 ` Mina Almasry
2024-11-01 20:35 ` Jens Axboe
2024-11-01 21:12 ` Pavel Begunkov
2024-10-29 23:05 ` [PATCH v7 14/15] io_uring/zcrx: add copy fallback David Wei
2024-10-29 23:05 ` [PATCH v7 15/15] io_uring/zcrx: throttle receive requests David Wei
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2a9d89e3-94ba-4707-8e7b-726c2eeb5bfc@gmail.com \
--to=asml.silence@gmail.com \
--cc=almasrymina@google.com \
--cc=axboe@kernel.dk \
--cc=davem@davemloft.net \
--cc=dsahern@kernel.org \
--cc=dw@davidwei.uk \
--cc=edumazet@google.com \
--cc=hawk@kernel.org \
--cc=io-uring@vger.kernel.org \
--cc=jdamato@fastly.com \
--cc=kuba@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=pctammela@mojatatu.com \
--cc=stfomichev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox