From: Alexander Lobakin <aleksander.lobakin@intel.com>
To: Jesper Dangaard Brouer <hawk@kernel.org>,
Jakub Kicinski <kuba@kernel.org>
Cc: Alexander H Duyck <alexander.duyck@gmail.com>,
Ratheesh Kannoth <rkannoth@marvell.com>, <netdev@vger.kernel.org>,
<linux-kernel@vger.kernel.org>, <davem@davemloft.net>,
<edumazet@google.com>, <pabeni@redhat.com>,
Ilias Apalodimas <ilias.apalodimas@linaro.org>,
Yunsheng Lin <linyunsheng@huawei.com>
Subject: Re: [PATCH net-next] page_pool: Clamp ring size to 32K
Date: Tue, 8 Aug 2023 15:29:02 +0200 [thread overview]
Message-ID: <7c3aa2c5-ecf3-e2cf-8955-04155f37d609@intel.com> (raw)
In-Reply-To: <15d32b22-22b0-64e3-a49e-88d780c24616@kernel.org>
From: Jesper Dangaard Brouer <hawk@kernel.org>
Date: Mon, 7 Aug 2023 22:11:35 +0200
>
>
> On 07/08/2023 19.20, Jakub Kicinski wrote:
>> On Mon, 07 Aug 2023 07:18:21 -0700 Alexander H Duyck wrote:
>>>> Page pool (PP) is just a cache of pages. The driver octeontx2 (in
>>>> link)
>>>> is creating an excessive large cache of pages. The drivers RX
>>>> descriptor ring size should be independent of the PP ptr_ring size, as
>>>> it is just a cache that grows as a functions of the in-flight packet
>>>> workload, it functions as a "shock absorber".
>>>>
>>>> 32768 pages (4KiB) is approx 128 MiB, and this will be per RX-queue.
>>>>
>>>> The RX-desc ring (obviously) pins down these pages (immediately),
>>>> but PP
>>>> ring starts empty. As the workload varies the "shock absorber" effect
>>>> will let more pages into the system, that will travel the PP ptr_ring.
>>>> As all pages originating from the same PP instance will get recycled,
>>>> the in-flight pages in the "system" (PP ptr_ring) will grow over time.
>>>>
>>>> The PP design have the problem that it never releases or reduces pages
>>>> in this shock absorber "closed" system. (Cc. PP people/devel) we should
>>>> consider implementing a MM shrinker callback
>>>> (include/linux/shrinker.h).
>>>>
>>>> Are the systems using driver octeontx2 ready to handle 128MiB memory
>>>> per
>>>> RX-queue getting pinned down overtime? (this could lead to some strange
>>>> do debug situation if the memory is not sufficient)
>>>
>>> I'm with Jesper on this. It doesn't make sense to be tying the
>>> page_pool size strictly to the ring size. The amount of recycling you
>>> get will depend on how long the packets are on the stack, not in the
>>> driver.
>>>
>
> Thanks for agreeing with me, and I agree with you :-)
>
>>> For example, in the case of something like a software router or bridge
>>> that is just taking the Rx packets and routing them to Tx you could
>>> theoretically get away with a multiple of NAPI_POLL_WEIGHT since you
>>> would likely never need much more than that as the Tx would likely be
>>> cleaned about as fast as the Rx can consume the pages.
>>>
>
> I agree.
>
>>> Rather than overriding the size here wouldn't it make more sense to do
>>> it in the octeontx2 driver? With that at least you would know that you
>>> were the one that limited the size instead of having the value modified
>>> out from underneath you.
>>>
>
> I'm not fully agreeing here. I don't think we can expect driver
> developer to be experts on page_pool cache dynamics. I'm more on
> Jakub's side here, as perhaps we/net-core can come up with some control
> system, even if this means we change this underneath drivers.
>
>
>>> That said, one change that might help to enable this kind of change
>>> would be look at adding a #define so that this value wouldn't be so
>>> much a magic number and would be visible to the drivers should it ever
>>> be changed in the future.
>>
>> All the points y'all making are valid, sizing the cache is a hard
>> problem. But the proposed solution goes in the wrong direction, IMO.
>> The driver doesn't know. I started hacking together page pool control
>> over netlink. I think that the pool size selection logic should be in
>> the core, with inputs taken from user space / workload (via netlink).
>>
>> If it wasn't for the fact that I'm working on that API I'd probably
>> side with you. And 64k descriptors is impractically large.
>>
>> Copy / pasting from the discussion on previous version:
>>
>> Tuning this in the driver relies on the assumption that the HW /
>> driver is the thing that matters. I'd think that the workload,
>> platform (CPU) and config (e.g. is IOMMU enabled?) will matter at
>> least as much. While driver developers will end up tuning to whatever
>> servers they have, random single config and most likely.. iperf.
>>
>> IMO it's much better to re-purpose "pool_size" and treat it as the
>> ring
>> size, because that's what most drivers end up putting there.
>
> I disagree here, as driver developers should not treat "pool_size" as
> the ring size. It seems to be a copy-paste-programming scheme without
> understanding PP dynamics.
+1. That's why I wrote in the previous thread that pool_size must be the
minimum value which gives optimal performance. I don't believe Otx2 HW
needs 32k entries in PP's ptr_ring to have optimal performance.
That's why I wrote that developers must check whether there's any
benefit in using bigger pool_size values. Values bigger than 2k don't
seem reasonable to me, especially now that we use direct recycling way
more aggressively -- often times ptr_ring is left unused at all.
Jakub thought that my "pls test whether bigger sizes make sense" meant
"please tune Page Pool to your servers", not exactly what I wanted to
say =\ I said only "please keep pool_size reasonable, it's your right to
have 2^32 descriptors on the ring, but don't do that with Page Pool".
>
>> Defer tuning of the effective ring size to the core and user input
>> (via the "it will be added any minute now" netlink API for configuring
>> page pools)...
>>
>
> I agree here, that tuning ring size is a hard problem, and this is
> better handled in the core. Happy to hear, that/if Jakub is working on
> this.
>
>> So capping the recycle ring to 32k instead of returning the error
>> seems
>> like an okay solution for now.
>
> As a temporary solution, I'm actually fine with capping at 32k.
> Driver developer loose some feedback control, but perhaps that is okay,
> if we can agree that the net-core should control tuning this anyhow.
>
> --Jesper
Thanks,
Olek
prev parent reply other threads:[~2023-08-08 16:42 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-07 3:49 [PATCH net-next] page_pool: Clamp ring size to 32K Ratheesh Kannoth
2023-08-07 11:42 ` Jesper Dangaard Brouer
2023-08-07 14:18 ` Alexander H Duyck
2023-08-07 17:20 ` Jakub Kicinski
2023-08-07 20:11 ` Jesper Dangaard Brouer
2023-08-08 2:26 ` [EXT] " Ratheesh Kannoth
2023-08-08 13:29 ` Alexander Lobakin [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7c3aa2c5-ecf3-e2cf-8955-04155f37d609@intel.com \
--to=aleksander.lobakin@intel.com \
--cc=alexander.duyck@gmail.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=hawk@kernel.org \
--cc=ilias.apalodimas@linaro.org \
--cc=kuba@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linyunsheng@huawei.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=rkannoth@marvell.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).