From: Pavel Begunkov <asml.silence@gmail.com>
To: Mina Almasry <almasrymina@google.com>
Cc: David Wei <dw@davidwei.uk>,
io-uring@vger.kernel.org, netdev@vger.kernel.org,
Jens Axboe <axboe@kernel.dk>, Jakub Kicinski <kuba@kernel.org>,
Paolo Abeni <pabeni@redhat.com>,
"David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jesper Dangaard Brouer <hawk@kernel.org>,
David Ahern <dsahern@kernel.org>
Subject: Re: [PATCH v1 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider
Date: Thu, 10 Oct 2024 22:22:47 +0100 [thread overview]
Message-ID: <096387ce-64f0-402f-a5d2-6b51653f9539@gmail.com> (raw)
In-Reply-To: <CAHS8izMrPuQNvwGwAUjh7GAY-CoC81rc5BD1ZMmy-nNds3xDgA@mail.gmail.com>
On 10/10/24 21:53, Mina Almasry wrote:
> On Thu, Oct 10, 2024 at 1:26 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
...
>>>
>>> Sorry I wasn't clear. By 'this' I'm referring to:
>>>
>>> "from where our ->alloc_netmems implementation can grab it, check
>>> references, put IO_ZC_RX_UREF, and recycle the buffer if there are no
>>> more users left"
>>>
>>> This is the part that I'm not able to stomach at the moment. Maybe if
>>> I look deeper it would make more sense, but my first feelings is that
>>> it's really not acceptable.
>>>
>>> alloc_netmems (and more generically page_pool_alloc_netmem), just
>>> allocates a netmem and gives it to the page_pool code to decide
>>
>> That how it works because that's how devmem needs it and you
>> tailored it, not the other way around. It could've pretty well
>> been a callback that fills the cache as an intermediate, from
>> where page pool can grab netmems and return back to the user,
>> and it would've been a pretty clean interface as well.
>>
>
> It could have been, but that would be a much worse design IMO. The
> whole point of memory proivders is that they provide memory to the
> page_pool and the page_pool does its things (among which is recycling)
> with that memory. In this patch you seem to have implemented a
> provider which, if the page is returned by io_uring, then it's not
> returned to the page_pool, it's returned directly to the provider. In
> other code paths the memory will be returned to the page_pool.
>
> I.e allocation is always:
> provider -> pp -> driver
>
> freeing from io_uring is:
> io_uring -> provider -> pp
>
> freeing from tcp stack or driver I'm guessing will be:
> tcp stack/driver -> pp -> provider
>
> I'm recommending that the model for memory providers must be in line
> with what we do for pages, devmem TCP, and Jakub's out of tree huge
> page provider (i.e. everything else using the page_pool). The model is
> the streamlined:
Let's not go into the normal pages, because 1) it can't work
any other way in general case, it has to cross the context from
whenever page is to the napi / page pool, and 2) because devmem
TCP and io_uring already deviate from the standard page pool,
by extending lifetime of buffers to user space and more.
And then that's exactly what I'm saying, you recommend it to be
aligned with devmem TCP. And let's not forget that you had to add
batching to that exact syscall return path because of
performance...
...
>>> I doubt this is true or at least there needs to be more info here. The
>>
>> If you don't believe me, then, please, go ahead and do your testing,
>> or look through patches addressing it across the stack like [1],
>> but you'll be able to find many more. I don't have any recent numbers
>> on indirect calls, but I did a fair share of testing before for
>> different kinds of overhead, it has always been expensive, can easily
>> be 1-2% per fast block request, which could be much worse if it's per
>> page.
>>
>> [1] https://lore.kernel.org/netdev/cover.1543836966.git.pabeni@redhat.com/
>>
>>
>>> page_pool_alloc_netmem() pretty much allocates 1 buffer per callback
>>> for all its current users (regular memory & dmabuf), and that's good
>>> enough to drive 200gbps NICs. What is special about io_uring use case
>>> that this is not good enough?
>>>
>>> The reason it is good enough in my experience is that
>>> page_pool_alloc_netmem() is a slow path. netmems are allocated from
>>> that function and heavily recycled by the page_pool afterwards.
>>
>> That's because how you return buffers back to the page pool, with
>> io_uring it is a hot path, even though ammortised exactly because
>> it doesn't just return one buffer at a time.
>>
>
> Right, I guess I understand now. You need to implement your own
> recycling in the provider because your model has bypassed the
> page_pool recycling - which to me is 90% of the utility of the
So the utility of the page pool is a fast return path for the
standard page mode, i.e. napi_pp_put_page, which it is and is
important, I agree. But then even though we have a better IMO
approach for this "extended to userspace buffer life cycle"
scenario, it has to use that very same return path because...?
> page_pool. To make matters worse, the bypass is only there if the
> netmems are returned from io_uring, and not bypassed when the netmems
> are returned from driver/tcp stack. I'm guessing if you reused the
> page_pool recycling in the io_uring return path then it would remove
> the need for your provider to implement its own recycling for the
> io_uring return case.
>
> Is letting providers bypass and override the page_pool's recycling in
> some code paths OK? IMO, no. A maintainer will make the judgement call
Mina, frankly, that's nonsense. If we extend the same logic,
devmem overrides page allocation rules with callbacks, devmem
overrides and violates page pool buffer lifetimes by extending
it to user space, devmem violates and overrides the page pool
object lifetime by binding buffers to sockets. And all of it
I'd rather name extends and enhances to fit in the devmem use
case.
> and speak authoritatively here and I will follow, but I do think it's
> a (much) worse design.
Sure, I have a completely opposite opinion, that's a much
better approach than returning through a syscall, but I will
agree with you that ultimately the maintainers will say if
that's acceptable for the networking or not.
--
Pavel Begunkov
next prev parent reply other threads:[~2024-10-10 21:22 UTC|newest]
Thread overview: 126+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-07 22:15 [PATCH v1 00/15] io_uring zero copy rx David Wei
2024-10-07 22:15 ` [PATCH v1 01/15] net: devmem: pull struct definitions out of ifdef David Wei
2024-10-09 20:17 ` Mina Almasry
2024-10-09 23:16 ` Pavel Begunkov
2024-10-10 18:01 ` Mina Almasry
2024-10-10 18:57 ` Pavel Begunkov
2024-10-13 22:38 ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 02/15] net: prefix devmem specific helpers David Wei
2024-10-09 20:19 ` Mina Almasry
2024-10-07 22:15 ` [PATCH v1 03/15] net: generalise net_iov chunk owners David Wei
2024-10-08 15:46 ` Stanislav Fomichev
2024-10-08 16:34 ` Pavel Begunkov
2024-10-09 16:28 ` Stanislav Fomichev
2024-10-11 18:44 ` David Wei
2024-10-11 22:02 ` Pavel Begunkov
2024-10-11 22:25 ` Mina Almasry
2024-10-11 23:12 ` Pavel Begunkov
2024-10-09 20:44 ` Mina Almasry
2024-10-09 22:13 ` Pavel Begunkov
2024-10-09 22:19 ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 04/15] net: page_pool: create hooks for custom page providers David Wei
2024-10-09 20:49 ` Mina Almasry
2024-10-09 22:02 ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 05/15] net: prepare for non devmem TCP memory providers David Wei
2024-10-09 20:56 ` Mina Almasry
2024-10-09 21:45 ` Pavel Begunkov
2024-10-13 22:33 ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 06/15] net: page_pool: add ->scrub mem provider callback David Wei
2024-10-09 21:00 ` Mina Almasry
2024-10-09 21:59 ` Pavel Begunkov
2024-10-10 17:54 ` Mina Almasry
2024-10-13 17:25 ` David Wei
2024-10-14 13:37 ` Pavel Begunkov
2024-10-14 22:58 ` Mina Almasry
2024-10-16 17:42 ` Pavel Begunkov
2024-11-01 17:18 ` Mina Almasry
2024-11-01 18:35 ` Pavel Begunkov
2024-11-01 19:24 ` Mina Almasry
2024-11-01 21:38 ` Pavel Begunkov
2024-11-04 20:42 ` Mina Almasry
2024-11-04 21:27 ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 07/15] net: page pool: add helper creating area from pages David Wei
2024-10-09 21:11 ` Mina Almasry
2024-10-09 21:34 ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 08/15] net: add helper executing custom callback from napi David Wei
2024-10-08 22:25 ` Joe Damato
2024-10-09 15:09 ` Pavel Begunkov
2024-10-09 16:13 ` Joe Damato
2024-10-09 19:12 ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 09/15] io_uring/zcrx: add interface queue and refill queue David Wei
2024-10-09 17:50 ` Jens Axboe
2024-10-09 18:09 ` Jens Axboe
2024-10-09 19:08 ` Pavel Begunkov
2024-10-11 22:11 ` Pavel Begunkov
2024-10-13 17:32 ` David Wei
2024-10-07 22:15 ` [PATCH v1 10/15] io_uring/zcrx: add io_zcrx_area David Wei
2024-10-09 18:02 ` Jens Axboe
2024-10-09 19:05 ` Pavel Begunkov
2024-10-09 19:06 ` Jens Axboe
2024-10-09 21:29 ` Mina Almasry
2024-10-07 22:15 ` [PATCH v1 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider David Wei
2024-10-09 18:10 ` Jens Axboe
2024-10-09 22:01 ` Mina Almasry
2024-10-09 22:58 ` Pavel Begunkov
2024-10-10 18:19 ` Mina Almasry
2024-10-10 20:26 ` Pavel Begunkov
2024-10-10 20:53 ` Mina Almasry
2024-10-10 20:58 ` Mina Almasry
2024-10-10 21:22 ` Pavel Begunkov [this message]
2024-10-11 0:32 ` Mina Almasry
2024-10-11 1:49 ` Pavel Begunkov
2024-10-07 22:16 ` [PATCH v1 12/15] io_uring/zcrx: add io_recvzc request David Wei
2024-10-09 18:28 ` Jens Axboe
2024-10-09 18:51 ` Pavel Begunkov
2024-10-09 19:01 ` Jens Axboe
2024-10-09 19:27 ` Pavel Begunkov
2024-10-09 19:42 ` Jens Axboe
2024-10-09 19:47 ` Pavel Begunkov
2024-10-09 19:50 ` Jens Axboe
2024-10-07 22:16 ` [PATCH v1 13/15] io_uring/zcrx: add copy fallback David Wei
2024-10-08 15:58 ` Stanislav Fomichev
2024-10-08 16:39 ` Pavel Begunkov
2024-10-08 16:40 ` David Wei
2024-10-09 16:30 ` Stanislav Fomichev
2024-10-09 23:05 ` Pavel Begunkov
2024-10-11 6:22 ` David Wei
2024-10-11 14:43 ` Stanislav Fomichev
2024-10-09 18:38 ` Jens Axboe
2024-10-07 22:16 ` [PATCH v1 14/15] io_uring/zcrx: set pp memory provider for an rx queue David Wei
2024-10-09 18:42 ` Jens Axboe
2024-10-10 13:09 ` Pavel Begunkov
2024-10-10 13:19 ` Jens Axboe
2024-10-07 22:16 ` [PATCH v1 15/15] io_uring/zcrx: throttle receive requests David Wei
2024-10-09 18:43 ` Jens Axboe
2024-10-07 22:20 ` [PATCH v1 00/15] io_uring zero copy rx David Wei
2024-10-08 23:10 ` Joe Damato
2024-10-09 15:07 ` Pavel Begunkov
2024-10-09 16:10 ` Joe Damato
2024-10-09 16:12 ` Jens Axboe
2024-10-11 6:15 ` David Wei
2024-10-09 15:27 ` Jens Axboe
2024-10-09 15:38 ` David Ahern
2024-10-09 15:43 ` Jens Axboe
2024-10-09 15:49 ` Pavel Begunkov
2024-10-09 15:50 ` Jens Axboe
2024-10-09 16:35 ` David Ahern
2024-10-09 16:50 ` Jens Axboe
2024-10-09 16:53 ` Jens Axboe
2024-10-09 17:12 ` Jens Axboe
2024-10-10 14:21 ` Jens Axboe
2024-10-10 15:03 ` David Ahern
2024-10-10 15:15 ` Jens Axboe
2024-10-10 18:11 ` Jens Axboe
2024-10-14 8:42 ` David Laight
2024-10-09 16:55 ` Mina Almasry
2024-10-09 16:57 ` Jens Axboe
2024-10-09 19:32 ` Mina Almasry
2024-10-09 19:43 ` Pavel Begunkov
2024-10-09 19:47 ` Jens Axboe
2024-10-09 17:19 ` David Ahern
2024-10-09 18:21 ` Pedro Tammela
2024-10-10 13:19 ` Pavel Begunkov
2024-10-11 0:35 ` David Wei
2024-10-11 14:28 ` Pedro Tammela
2024-10-11 0:29 ` David Wei
2024-10-11 19:43 ` Mina Almasry
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=096387ce-64f0-402f-a5d2-6b51653f9539@gmail.com \
--to=asml.silence@gmail.com \
--cc=almasrymina@google.com \
--cc=axboe@kernel.dk \
--cc=davem@davemloft.net \
--cc=dsahern@kernel.org \
--cc=dw@davidwei.uk \
--cc=edumazet@google.com \
--cc=hawk@kernel.org \
--cc=io-uring@vger.kernel.org \
--cc=kuba@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox