From: Pavel Begunkov <asml.silence@gmail.com>
To: Mina Almasry <almasrymina@google.com>
Cc: linux-doc@vger.kernel.org, "Kaiyuan Zhang" <kaiyuanz@google.com>,
dri-devel@lists.freedesktop.org,
"Eric Dumazet" <edumazet@google.com>,
linux-kselftest@vger.kernel.org, "Shuah Khan" <shuah@kernel.org>,
"Sumit Semwal" <sumit.semwal@linaro.org>,
linux-arch@vger.kernel.org,
"Willem de Bruijn" <willemdebruijn.kernel@gmail.com>,
"Jeroen de Borst" <jeroendb@google.com>,
"Jonathan Corbet" <corbet@lwn.net>,
"Jakub Kicinski" <kuba@kernel.org>,
"Paolo Abeni" <pabeni@redhat.com>,
linux-media@vger.kernel.org,
"Jesper Dangaard Brouer" <hawk@kernel.org>,
"Arnd Bergmann" <arnd@arndb.de>,
"Shailend Chand" <shailend@google.com>,
"Shakeel Butt" <shakeelb@google.com>,
"Harshitha Ramamurthy" <hramamurthy@google.com>,
"Willem de Bruijn" <willemb@google.com>,
netdev@vger.kernel.org, "David Ahern" <dsahern@kernel.org>,
"Ilias Apalodimas" <ilias.apalodimas@linaro.org>,
linux-kernel@vger.kernel.org,
"Christian König" <christian.koenig@amd.com>,
"Yunsheng Lin" <linyunsheng@huawei.com>,
"Praveen Kaligineedi" <pkaligineedi@google.com>,
bpf@vger.kernel.org, "David S. Miller" <davem@davemloft.net>
Subject: Re: [net-next v1 08/16] memory-provider: dmabuf devmem memory provider
Date: Mon, 11 Dec 2023 20:35:54 +0000 [thread overview]
Message-ID: <661c1bae-d7d3-457e-b545-5f67b9ef4197@gmail.com> (raw)
In-Reply-To: <CAHS8izPry13h49v+PqrmWSREZKZjYpPesxUTyPQy7AGyFwzo4g@mail.gmail.com>
On 12/11/23 02:30, Mina Almasry wrote:
> On Sat, Dec 9, 2023 at 7:05 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> On 12/8/23 23:25, Mina Almasry wrote:
>>> On Fri, Dec 8, 2023 at 2:56 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>>>
>>>> On 12/8/23 00:52, Mina Almasry wrote:
>>> ...
>>>>> + if (pool->p.queue)
>>>>> + binding = READ_ONCE(pool->p.queue->binding);
>>>>> +
>>>>> + if (binding) {
>>>>> + pool->mp_ops = &dmabuf_devmem_ops;
>>>>> + pool->mp_priv = binding;
>>>>> + }
>>>>
>>>> Hmm, I don't understand why would we replace a nice transparent
>>>> api with page pool relying on a queue having devmem specific
>>>> pointer? It seemed more flexible and cleaner in the last RFC.
>>>>
>>>
>>> Jakub requested this change and may chime in, but I suspect it's to
>>> further abstract the devmem changes from driver. In this iteration,
>>> the driver grabs the netdev_rx_queue and passes it to the page_pool,
>>> and any future configurations between the net stack and page_pool can
>>> be passed this way with the driver unbothered.
>>
>> Ok, that makes sense, but even if passed via an rx queue I'd
>> at least hope it keeping abstract provider parameters, e.g.
>> ops, but not hard coded with devmem specific code.
>>
>> It might even be better done with a helper like
>> create_page_pool_from_queue(), unless there is some deeper
>> interaction b/w pp and rx queues is predicted.
>>
>
> Off hand I don't see the need for a new create_page_pool_from_queue().
> page_pool_create() already takes in a param arg that lets us pass in
> the queue as well as any other params.
>
>>>>> +
>>>>> if (pool->mp_ops) {
>>>>> err = pool->mp_ops->init(pool);
>>>>> if (err) {
>>>>> @@ -1020,3 +1033,77 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
>>>>> }
>>>>> }
>>>>> EXPORT_SYMBOL(page_pool_update_nid);
>>>>> +
>>>>> +void __page_pool_iov_free(struct page_pool_iov *ppiov)
>>>>> +{
>>>>> + if (WARN_ON(ppiov->pp->mp_ops != &dmabuf_devmem_ops))
>>>>> + return;
>>>>> +
>>>>> + netdev_free_dmabuf(ppiov);
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(__page_pool_iov_free);
>>>>
>>>> I didn't look too deep but I don't think I immediately follow
>>>> the pp refcounting. It increments pages_state_hold_cnt on
>>>> allocation, but IIUC doesn't mark skbs for recycle? Then, they all
>>>> will be put down via page_pool_iov_put_many() bypassing
>>>> page_pool_return_page() and friends. That will call
>>>> netdev_free_dmabuf(), which doesn't bump pages_state_release_cnt.
>>>>
>>>> At least I couldn't make it work with io_uring, and for my purposes,
>>>> I forced all puts to go through page_pool_return_page(), which calls
>>>> the ->release_page callback. The callback will put the reference and
>>>> ask its page pool to account release_cnt. It also gets rid of
>>>> __page_pool_iov_free(), as we'd need to add a hook there for
>>>> customization otherwise.
>>>>
>>>> I didn't care about overhead because the hot path for me is getting
>>>> buffers from a ring, which is somewhat analogous to sock_devmem_dontneed(),
>>>> but done on pp allocations under napi, and it's done separately.
>>>>
>>>> Completely untested with TCP devmem:
>>>>
>>>> https://github.com/isilence/linux/commit/14bd56605183dc80b540999e8058c79ac92ae2d8
>>>>
>>>
>>> This was a mistake in the last RFC, which should be fixed in v1. In
>>> the RFC I was not marking the skbs as skb_mark_for_recycle(), so the
>>> unreffing path wasn't as expected.
>>>
>>> In this iteration, that should be completely fixed. I suspect since I
>>> just posted this you're actually referring to the issue tested on the
>>> last RFC? Correct me if wrong.
>>
>> Right, it was with RFCv3
>>
>>> In this iteration, the reffing story:
>>>
>>> - memory provider allocs ppiov and returns it to the page pool with
>>> ppiov->refcount == 1.
>>> - The page_pool gives the page to the driver. The driver may
>>> obtain/release references with page_pool_page_[get|put]_many(), but
>>> the driver is likely not doing that unless it's doing its own page
>>> recycling.
>>> - The net stack obtains references via skb_frag_ref() ->
>>> page_pool_page_get_many()
>>> - The net stack drops references via skb_frag_unref() ->
>>> napi_pp_put_page() -> page_pool_return_page() and friends.
>>>
>>> Thus, the issue where the unref path was skipping
>>> page_pool_return_page() and friends should be resolved in this
>>> iteration, let me know if you think otherwise, but I think this was an
>>> issue limited to the last RFC.
>>
>> Then page_pool_iov_put_many() should and supposedly would never be
>> called by non devmap code because all puts must circle back into
>> ->release_page. Why adding it to into page_pool_page_put_many()?
>>
>> @@ -731,6 +731,29 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
>> + if (page_is_page_pool_iov(page)) {
>> ...
>> + page_pool_page_put_many(page, 1);
>> + return NULL;
>> + }
>>
>> Well, I'm looking at this new branch from Patch 10, it can put
>> the buffer, but what if we race at it's actually the final put?
>> Looks like nobody is going to to bump up pages_state_release_cnt
>>
>
> Good catch, I think indeed the release_cnt would be incorrect in this
> case. I think the race is benign in the sense that the ppiov will be
> freed correctly and available for allocation when the page_pool next
> needs it; the issue is with the stats AFAICT.
hold_cnt + release_cnt serves is used for refcounting. In this case
it'll leak the pool when you try to destroy it.
>> If you remove the branch, let it fall into ->release and rely
>> on refcounting there, then the callback could also fix up
>> release_cnt or ask pp to do it, like in the patch I linked above
>>
>
> Sadly I don't think this is possible due to the reasons I mention in
> the commit message of that patch. Prematurely releasing ppiov and not
> having them be candidates for recycling shows me a 4-5x degradation in
> performance.
I don't think I follow. The concept is to only recycle a buffer (i.e.
make it available for allocation) when its refs drop to zero, which is
IMHO the only way it can work, and IIUC what this patchset is doing.
That's also I suggest to do, but through a slightly different path.
Let's say at some moment there are 2 refs (e.g. 1 for an skb and
1 for userspace/xarray).
Say it first puts the skb:
napi_pp_put_page()
-> page_pool_return_page()
-> mp_ops->release_page()
-> need_to_free = put_buf()
// not last ref, need_to_free==false,
// don't recycle, don't increase release_cnt
Then you put the last ref:
page_pool_iov_put_many()
-> page_pool_return_page()
-> mp_ops->release_page()
-> need_to_free = put_buf()
// last ref, need_to_free==true,
// recycle and release_cnt++
And that last put can even be recycled right into the
pp / ptr_ring, in which case it doesn't need to touch
release_cnt. Does it make sense? I don't see where
4-5x degradation would come from
> What I could do here is detect that the refcount was dropped to 0 and
> fix up the stats in that case.
--
Pavel Begunkov
next prev parent reply other threads:[~2023-12-11 20:37 UTC|newest]
Thread overview: 71+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-08 0:52 [net-next v1 00/16] Device Memory TCP Mina Almasry
2023-12-08 0:52 ` [net-next v1 01/16] net: page_pool: factor out releasing DMA from releasing the page Mina Almasry
2023-12-10 3:49 ` Shakeel Butt
2023-12-12 8:11 ` Ilias Apalodimas
2023-12-08 0:52 ` [net-next v1 02/16] net: page_pool: create hooks for custom page providers Mina Almasry
2023-12-12 8:07 ` Ilias Apalodimas
2023-12-12 14:47 ` Mina Almasry
2023-12-08 0:52 ` [net-next v1 03/16] queue_api: define queue api Mina Almasry
2023-12-14 1:15 ` Jakub Kicinski
2023-12-08 0:52 ` [net-next v1 04/16] gve: implement " Mina Almasry
2024-03-05 11:45 ` Arnd Bergmann
2023-12-08 0:52 ` [net-next v1 05/16] net: netdev netlink api to bind dma-buf to a net device Mina Almasry
2023-12-14 1:17 ` Jakub Kicinski
2023-12-08 0:52 ` [net-next v1 06/16] netdev: support binding dma-buf to netdevice Mina Almasry
2023-12-08 15:40 ` kernel test robot
2023-12-08 16:02 ` kernel test robot
2023-12-08 17:48 ` David Ahern
2023-12-08 19:22 ` Mina Almasry
2023-12-08 20:32 ` Mina Almasry
2023-12-09 23:29 ` David Ahern
2023-12-11 2:19 ` Mina Almasry
2023-12-08 0:52 ` [net-next v1 07/16] netdev: netdevice devmem allocator Mina Almasry
2023-12-08 17:56 ` David Ahern
2023-12-08 19:27 ` Mina Almasry
2023-12-08 0:52 ` [net-next v1 08/16] memory-provider: dmabuf devmem memory provider Mina Almasry
2023-12-08 22:48 ` Pavel Begunkov
2023-12-08 23:25 ` Mina Almasry
2023-12-10 3:03 ` Pavel Begunkov
2023-12-11 2:30 ` Mina Almasry
2023-12-11 20:35 ` Pavel Begunkov [this message]
2023-12-14 20:03 ` Mina Almasry
2023-12-19 23:55 ` Pavel Begunkov
2023-12-08 23:05 ` Pavel Begunkov
2023-12-12 12:25 ` Jason Gunthorpe
2023-12-12 14:26 ` Mina Almasry
2023-12-12 14:39 ` Jason Gunthorpe
2023-12-12 14:58 ` Mina Almasry
2023-12-12 15:08 ` Jason Gunthorpe
2023-12-13 1:09 ` Mina Almasry
2023-12-13 2:19 ` David Ahern
2023-12-13 7:49 ` Yinjun Zhang
2023-12-08 0:52 ` [net-next v1 09/16] page_pool: device memory support Mina Almasry
2023-12-08 9:30 ` Yunsheng Lin
2023-12-08 16:05 ` Mina Almasry
2023-12-11 2:04 ` Yunsheng Lin
2023-12-11 2:26 ` Mina Almasry
2023-12-11 4:04 ` Mina Almasry
2023-12-11 11:51 ` Yunsheng Lin
2023-12-11 18:14 ` Mina Almasry
2023-12-12 11:17 ` Yunsheng Lin
2023-12-12 14:28 ` Mina Almasry
2023-12-13 11:48 ` Yunsheng Lin
2023-12-13 7:52 ` Mina Almasry
2023-12-08 0:52 ` [net-next v1 10/16] page_pool: don't release iov on elevanted refcount Mina Almasry
2023-12-08 0:52 ` [net-next v1 11/16] net: support non paged skb frags Mina Almasry
2023-12-08 0:52 ` [net-next v1 12/16] net: add support for skbs with unreadable frags Mina Almasry
2023-12-08 0:52 ` [net-next v1 13/16] tcp: RX path for devmem TCP Mina Almasry
2023-12-08 15:40 ` kernel test robot
2023-12-08 17:55 ` David Ahern
2023-12-08 19:23 ` Mina Almasry
2023-12-08 0:52 ` [net-next v1 14/16] net: add SO_DEVMEM_DONTNEED setsockopt to release RX frags Mina Almasry
2023-12-12 19:08 ` Simon Horman
2023-12-08 0:52 ` [net-next v1 15/16] net: add devmem TCP documentation Mina Almasry
2023-12-12 19:14 ` Simon Horman
2023-12-08 0:52 ` [net-next v1 16/16] selftests: add ncdevmem, netcat for devmem TCP Mina Almasry
2023-12-08 1:47 ` [net-next v1 00/16] Device Memory TCP Mina Almasry
2023-12-08 17:57 ` David Ahern
2023-12-08 19:31 ` Mina Almasry
2023-12-10 3:48 ` Shakeel Butt
2023-12-14 6:20 ` patchwork-bot+netdevbpf
[not found] ` <ZXqlWT2JYg0sa7IF@infradead.org>
2023-12-14 6:51 ` Mina Almasry
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=661c1bae-d7d3-457e-b545-5f67b9ef4197@gmail.com \
--to=asml.silence@gmail.com \
--cc=almasrymina@google.com \
--cc=arnd@arndb.de \
--cc=bpf@vger.kernel.org \
--cc=christian.koenig@amd.com \
--cc=corbet@lwn.net \
--cc=davem@davemloft.net \
--cc=dri-devel@lists.freedesktop.org \
--cc=dsahern@kernel.org \
--cc=edumazet@google.com \
--cc=hawk@kernel.org \
--cc=hramamurthy@google.com \
--cc=ilias.apalodimas@linaro.org \
--cc=jeroendb@google.com \
--cc=kaiyuanz@google.com \
--cc=kuba@kernel.org \
--cc=linux-arch@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-media@vger.kernel.org \
--cc=linyunsheng@huawei.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=pkaligineedi@google.com \
--cc=shailend@google.com \
--cc=shakeelb@google.com \
--cc=shuah@kernel.org \
--cc=sumit.semwal@linaro.org \
--cc=willemb@google.com \
--cc=willemdebruijn.kernel@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).