Re: [PATCH net-next] page_pool: Clamp ring size to 32K

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Alexander Lobakin <aleksander.lobakin@intel.com>
To: Jesper Dangaard Brouer <hawk@kernel.org>,
	Jakub Kicinski <kuba@kernel.org>
Cc: Alexander H Duyck <alexander.duyck@gmail.com>,
	Ratheesh Kannoth <rkannoth@marvell.com>, <netdev@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>, <davem@davemloft.net>,
	<edumazet@google.com>, <pabeni@redhat.com>,
	Ilias Apalodimas <ilias.apalodimas@linaro.org>,
	Yunsheng Lin <linyunsheng@huawei.com>
Subject: Re: [PATCH net-next] page_pool: Clamp ring size to 32K
Date: Tue, 8 Aug 2023 15:29:02 +0200	[thread overview]
Message-ID: <7c3aa2c5-ecf3-e2cf-8955-04155f37d609@intel.com> (raw)
In-Reply-To: <15d32b22-22b0-64e3-a49e-88d780c24616@kernel.org>

From: Jesper Dangaard Brouer <hawk@kernel.org>
Date: Mon, 7 Aug 2023 22:11:35 +0200

> 
> 
> On 07/08/2023 19.20, Jakub Kicinski wrote:
>> On Mon, 07 Aug 2023 07:18:21 -0700 Alexander H Duyck wrote:
>>>> Page pool (PP) is just a cache of pages.  The driver octeontx2 (in
>>>> link)
>>>> is creating an excessive large cache of pages.  The drivers RX
>>>> descriptor ring size should be independent of the PP ptr_ring size, as
>>>> it is just a cache that grows as a functions of the in-flight packet
>>>> workload, it functions as a "shock absorber".
>>>>
>>>> 32768 pages (4KiB) is approx 128 MiB, and this will be per RX-queue.
>>>>
>>>> The RX-desc ring (obviously) pins down these pages (immediately),
>>>> but PP
>>>> ring starts empty.  As the workload varies the "shock absorber" effect
>>>> will let more pages into the system, that will travel the PP ptr_ring.
>>>> As all pages originating from the same PP instance will get recycled,
>>>> the in-flight pages in the "system" (PP ptr_ring) will grow over time.
>>>>
>>>> The PP design have the problem that it never releases or reduces pages
>>>> in this shock absorber "closed" system. (Cc. PP people/devel) we should
>>>> consider implementing a MM shrinker callback
>>>> (include/linux/shrinker.h).
>>>>
>>>> Are the systems using driver octeontx2 ready to handle 128MiB memory
>>>> per
>>>> RX-queue getting pinned down overtime? (this could lead to some strange
>>>> do debug situation if the memory is not sufficient)
>>>
>>> I'm with Jesper on this. It doesn't make sense to be tying the
>>> page_pool size strictly to the ring size. The amount of recycling you
>>> get will depend on how long the packets are on the stack, not in the
>>> driver.
>>>
> 
> Thanks for agreeing with me, and I agree with you :-)
> 
>>> For example, in the case of something like a software router or bridge
>>> that is just taking the Rx packets and routing them to Tx you could
>>> theoretically get away with a multiple of NAPI_POLL_WEIGHT since you
>>> would likely never need much more than that as the Tx would likely be
>>> cleaned about as fast as the Rx can consume the pages.
>>>
> 
> I agree.
> 
>>> Rather than overriding the size here wouldn't it make more sense to do
>>> it in the octeontx2 driver? With that at least you would know that you
>>> were the one that limited the size instead of having the value modified
>>> out from underneath you.
>>>
> 
> I'm not fully agreeing here.  I don't think we can expect driver
> developer to be experts on page_pool cache dynamics.  I'm more on
> Jakub's side here, as perhaps we/net-core can come up with some control
> system, even if this means we change this underneath drivers.
> 
> 
>>> That said, one change that might help to enable this kind of change
>>> would be look at adding a #define so that this value wouldn't be so
>>> much a magic number and would be visible to the drivers should it ever
>>> be changed in the future.
>>
>> All the points y'all making are valid, sizing the cache is a hard
>> problem. But the proposed solution goes in the wrong direction, IMO.
>> The driver doesn't know. I started hacking together page pool control
>> over netlink. I think that the pool size selection logic should be in
>> the core, with inputs taken from user space / workload (via netlink).
>>
>> If it wasn't for the fact that I'm working on that API I'd probably
>> side with you. And 64k descriptors is impractically large.
>>
>> Copy / pasting from the discussion on previous version:
>>
>>    Tuning this in the driver relies on the assumption that the HW /
>>    driver is the thing that matters. I'd think that the workload,
>>    platform (CPU) and config (e.g. is IOMMU enabled?) will matter at
>>    least as much. While driver developers will end up tuning to whatever
>>    servers they have, random single config and most likely.. iperf.
>>
>>    IMO it's much better to re-purpose "pool_size" and treat it as the
>> ring
>>    size, because that's what most drivers end up putting there.
> 
> I disagree here, as driver developers should not treat "pool_size" as
> the ring size.  It seems to be a copy-paste-programming scheme without
> understanding PP dynamics.

+1. That's why I wrote in the previous thread that pool_size must be the
minimum value which gives optimal performance. I don't believe Otx2 HW
needs 32k entries in PP's ptr_ring to have optimal performance.
That's why I wrote that developers must check whether there's any
benefit in using bigger pool_size values. Values bigger than 2k don't
seem reasonable to me, especially now that we use direct recycling way
more aggressively -- often times ptr_ring is left unused at all.
Jakub thought that my "pls test whether bigger sizes make sense" meant
"please tune Page Pool to your servers", not exactly what I wanted to
say =\ I said only "please keep pool_size reasonable, it's your right to
have 2^32 descriptors on the ring, but don't do that with Page Pool".

> 
>>    Defer tuning of the effective ring size to the core and user input
>>    (via the "it will be added any minute now" netlink API for configuring
>>    page pools)...
>>
> 
> I agree here, that tuning ring size is a hard problem, and this is
> better handled in the core.  Happy to hear, that/if Jakub is working on
> this.
> 
>>    So capping the recycle ring to 32k instead of returning the error
>> seems
>>    like an okay solution for now.
> 
> As a temporary solution, I'm actually fine with capping at 32k.
> Driver developer loose some feedback control, but perhaps that is okay,
> if we can agree that the net-core should control tuning this anyhow.
> 
> --Jesper

Thanks,
Olek

     prev parent reply	other threads:[~2023-08-08 16:42 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-07  3:49 [PATCH net-next] page_pool: Clamp ring size to 32K Ratheesh Kannoth
2023-08-07 11:42 ` Jesper Dangaard Brouer
2023-08-07 14:18   ` Alexander H Duyck
2023-08-07 17:20     ` Jakub Kicinski
2023-08-07 20:11       ` Jesper Dangaard Brouer
2023-08-08  2:26         ` [EXT] " Ratheesh Kannoth
2023-08-08 13:29         ` Alexander Lobakin [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7c3aa2c5-ecf3-e2cf-8955-04155f37d609@intel.com \
    --to=aleksander.lobakin@intel.com \
    --cc=alexander.duyck@gmail.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=hawk@kernel.org \
    --cc=ilias.apalodimas@linaro.org \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linyunsheng@huawei.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=rkannoth@marvell.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).