Re: [PATCH net-next] page_pool: Clamp ring size to 32K

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Alexander Lobakin <aleksander.lobakin@intel.com>
To: Jesper Dangaard Brouer <hawk@kernel.org>,
	Jakub Kicinski <kuba@kernel.org>
Cc: Alexander H Duyck <alexander.duyck@gmail.com>,
	Ratheesh Kannoth <rkannoth@marvell.com>, <netdev@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>, <davem@davemloft.net>,
	<edumazet@google.com>, <pabeni@redhat.com>,
	Ilias Apalodimas <ilias.apalodimas@linaro.org>,
	Yunsheng Lin <linyunsheng@huawei.com>
Subject: Re: [PATCH net-next] page_pool: Clamp ring size to 32K
Date: Tue, 8 Aug 2023 15:29:02 +0200	[thread overview]
Message-ID: <7c3aa2c5-ecf3-e2cf-8955-04155f37d609@intel.com> (raw)
In-Reply-To: <15d32b22-22b0-64e3-a49e-88d780c24616@kernel.org>

From: Jesper Dangaard Brouer <hawk@kernel.org>
Date: Mon, 7 Aug 2023 22:11:35 +0200

> 
> 
> On 07/08/2023 19.20, Jakub Kicinski wrote:
>> On Mon, 07 Aug 2023 07:18:21 -0700 Alexander H Duyck wrote:
>>>> Page pool (PP) is just a cache of pages.  The driver octeontx2 (in
>>>> link)
>>>> is creating an excessive large cache of pages.  The drivers RX
>>>> descriptor ring size should be independent of the PP ptr_ring size, as
>>>> it is just a cache that grows as a functions of the in-flight packet
>>>> workload, it functions as a "shock absorber".
>>>>
>>>> 32768 pages (4KiB) is approx 128 MiB, and this will be per RX-queue.
>>>>
>>>> The RX-desc ring (obviously) pins down these pages (immediately),
>>>> but PP
>>>> ring starts empty.  As the workload varies the "shock absorber" effect
>>>> will let more pages into the system, that will travel the PP ptr_ring.
>>>> As all pages originating from the same PP instance will get recycled,
>>>> the in-flight pages in the "system" (PP ptr_ring) will grow over time.
>>>>
>>>> The PP design have the problem that it never releases or reduces pages
>>>> in this shock absorber "closed" system. (Cc. PP people/devel) we should
>>>> consider implementing a MM shrinker callback
>>>> (include/linux/shrinker.h).
>>>>
>>>> Are the systems using driver octeontx2 ready to handle 128MiB memory
>>>> per
>>>> RX-queue getting pinned down overtime? (this could lead to some strange
>>>> do debug situation if the memory is not sufficient)
>>>
>>> I'm with Jesper on this. It doesn't make sense to be tying the
>>> page_pool size strictly to the ring size. The amount of recycling you
>>> get will depend on how long the packets are on the stack, not in the
>>> driver.
>>>
> 
> Thanks for agreeing with me, and I agree with you :-)
> 
>>> For example, in the case of something like a software router or bridge
>>> that is just taking the Rx packets and routing them to Tx you could
>>> theoretically get away with a multiple of NAPI_POLL_WEIGHT since you
>>> would likely never need much more than that as the Tx would likely be
>>> cleaned about as fast as the Rx can consume the pages.
>>>
> 
> I agree.
> 
>>> Rather than overriding the size here wouldn't it make more sense to do
>>> it in the octeontx2 driver? With that at least you would know that you
>>> were the one that limited the size instead of having the value modified
>>> out from underneath you.
>>>
> 
> I'm not fully agreeing here.  I don't think we can expect driver
> developer to be experts on page_pool cache dynamics.  I'm more on
> Jakub's side here, as perhaps we/net-core can come up with some control
> system, even if this means we change this underneath drivers.
> 
> 
>>> That said, one change that might help to enable this kind of change
>>> would be look at adding a #define so that this value wouldn't be so
>>> much a magic number and would be visible to the drivers should it ever
>>> be changed in the future.
>>
>> All the points y'all making are valid, sizing the cache is a hard
>> problem. But the proposed solution goes in the wrong direction, IMO.
>> The driver doesn't know. I started hacking together page pool control
>> over netlink. I think that the pool size selection logic should be in
>> the core, with inputs taken from user space / workload (via netlink).
>>
>> If it wasn't for the fact that I'm working on that API I'd probably
>> side with you. And 64k descriptors is impractically large.
>>
>> Copy / pasting from the discussion on previous version:
>>
>>    Tuning this in the driver relies on the assumption that the HW /
>>    driver is the thing that matters. I'd think that the workload,
>>    platform (CPU) and config (e.g. is IOMMU enabled?) will matter at
>>    least as much. While driver developers will end up tuning to whatever
>>    servers they have, random single config and most likely.. iperf.
>>
>>    IMO it's much better to re-purpose "pool_size" and treat it as the
>> ring
>>    size, because that's what most drivers end up putting there.
> 
> I disagree here, as driver developers should not treat "pool_size" as
> the ring size.  It seems to be a copy-paste-programming scheme without
> understanding PP dynamics.

+1. That's why I wrote in the previous thread that pool_size must be the
minimum value which gives optimal performance. I don't believe Otx2 HW
needs 32k entries in PP's ptr_ring to have optimal performance.
That's why I wrote that developers must check whether there's any
benefit in using bigger pool_size values. Values bigger than 2k don't
seem reasonable to me, especially now that we use direct recycling way
more aggressively -- often times ptr_ring is left unused at all.
Jakub thought that my "pls test whether bigger sizes make sense" meant
"please tune Page Pool to your servers", not exactly what I wanted to
say =\ I said only "please keep pool_size reasonable, it's your right to
have 2^32 descriptors on the ring, but don't do that with Page Pool".

> 
>>    Defer tuning of the effective ring size to the core and user input
>>    (via the "it will be added any minute now" netlink API for configuring
>>    page pools)...
>>
> 
> I agree here, that tuning ring size is a hard problem, and this is
> better handled in the core.  Happy to hear, that/if Jakub is working on
> this.
> 
>>    So capping the recycle ring to 32k instead of returning the error
>> seems
>>    like an okay solution for now.
> 
> As a temporary solution, I'm actually fine with capping at 32k.
> Driver developer loose some feedback control, but perhaps that is okay,
> if we can agree that the net-core should control tuning this anyhow.
> 
> --Jesper

Thanks,
Olek

     prev parent reply	other threads:[~2023-08-08 18:45 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-07  3:49 [PATCH net-next] page_pool: Clamp ring size to 32K Ratheesh Kannoth
2023-08-07 11:42 ` Jesper Dangaard Brouer
2023-08-07 14:18   ` Alexander H Duyck
2023-08-07 17:20     ` Jakub Kicinski
2023-08-07 20:11       ` Jesper Dangaard Brouer
2023-08-08  2:26         ` [EXT] " Ratheesh Kannoth
2023-08-08 13:29         ` Alexander Lobakin [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7c3aa2c5-ecf3-e2cf-8955-04155f37d609@intel.com \
    --to=aleksander.lobakin@intel.com \
    --cc=alexander.duyck@gmail.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=hawk@kernel.org \
    --cc=ilias.apalodimas@linaro.org \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linyunsheng@huawei.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=rkannoth@marvell.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.