From: Jakub Kicinski <kuba@kernel.org>
To: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, andrew+netdev@lunn.ch, davem@davemloft.net,
edumazet@google.com, pabeni@redhat.com, leon@kernel.org,
longli@microsoft.com, kotaranov@microsoft.com, horms@kernel.org,
shradhagupta@linux.microsoft.com, ssengar@linux.microsoft.com,
ernis@linux.microsoft.com, shirazsaleem@microsoft.com,
linux-hyperv@vger.kernel.org, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org,
dipayanroy@microsoft.com
Subject: Re: [PATCH net-next, v3] net: mana: Force full-page RX buffers for 4K page size on specific systems.
Date: Fri, 20 Mar 2026 17:29:08 -0700 [thread overview]
Message-ID: <20260320172908.1840229d@kernel.org> (raw)
In-Reply-To: <ab2T8LgRiDHDIUHV@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>
On Fri, 20 Mar 2026 11:37:36 -0700 Dipayaan Roy wrote:
> On Sat, Mar 14, 2026 at 12:50:53PM -0700, Jakub Kicinski wrote:
> > On Tue, 10 Mar 2026 21:00:49 -0700 Dipayaan Roy wrote:
> > > On certain systems configured with 4K PAGE_SIZE, utilizing page_pool
> > > fragments for RX buffers results in a significant throughput regression.
> > > Profiling reveals that this regression correlates with high overhead in the
> > > fragment allocation and reference counting paths on these specific
> > > platforms, rendering the multi-buffer-per-page strategy counterproductive.
> >
> > Can you say more ? We could technically take two references on the page
> > right away if MTU is small and avoid some of the cost.
>
> There is a 15-20% shortfall in achieving line rate for MANA (180+ Gbps)
> on a particular ARM64 SKU. The issue is only specific to this processor SKU —
> not seen on other ARM64 SKUs (e.g., GB200) or x86 SKUs. Critically, the
> regression only manifests beyond 16 TCP connections, which strongly indicates
> seen when there is high contention and traffic.
>
> no. of | rx buf backed | rx buf backed
> connections | with page fragments | with full page
> -------------+---------------------+---------------
> 4 | 139 Gbps | 138 Gbps
> 8 | 140 Gbps | 162 Gbps
> 16 | 186 Gbps | 186 Gbps
These results look at bit odd, 4 and 16 streams have the same perf,
while all other cases indeed show a delta. What I was hoping for was
a more precise attribution of the performance issue. Like perf top
showing that its indeed the atomic ops on the refcount that stall.
> 32 | 136 Gbps | 183 Gbps
> 48 | 159 Gbps | 185 Gbps
> 64 | 165 Gbps | 184 Gbps
> 128 | 170 Gbps | 180 Gbps
>
> HW team is still working to RCA this hw behaviour.
>
> Regarding "We could technically take two references on the page right
> away", are you suggesting having page reference counting logic to driver
> instead of relying on page pool?
Yes, either that or adjust the page pool APIs.
page_pool_alloc_frag_netmem() currently sets the refcount to BIAS
which it then has to subtract later. So we get:
set(BIAS)
.. driver allocates chunks ..
sub(BIAS_MAX - pool->frag_users)
Instead of using BIAS we could make the page pool guess that the caller
will keep asking for the same frame size. So initially take
(PAGE_SIZE/size) references.
> > The driver doesn't seem to set skb->truesize accordingly after this
> > change. So you're lying to the stack about how much memory each packet
> > consumes. This is a blocker for the change.
> >
> ACK. I will send out a separate patch with fixes tag to fix the skb true
> size.
>
> > > To mitigate this, bypass the page_pool fragment path and force a single RX
> > > packet per page allocation when all the following conditions are met:
> > > 1. The system is configured with a 4K PAGE_SIZE.
> > > 2. A processor-specific quirk is detected via SMBIOS Type 4 data.
> >
> > I don't think we want the kernel to be in the business of carrying
> > matching on platform names and providing optimal config by default.
> > This sort of logic needs to live in user space or the hypervisor
> > (which can then pass a single bit to the driver to enable the behavior)
> >
> As per our internal discussion the hypervisor cannot provide the CPU
> version info(in vm as well as in bare metal offerings).
Why? I suppose it's much more effort for you but it's much more effort
for the community to carry the workaround. So..
> On handling it from user side are you suggesting it to introduce a new
> ethtool Private Flags and have udev rules for the driver to set the private
> flag and switch to full page rx buffers? Given that the wide number of distro
> support this might be harder to maintain/backport.
>
> Also the dmi parsing design was influenced by other net wireleass
> drivers as /wireless/ath/ath10k/core.c. If this approach is not
> acceptable for MANA driver then will have to take a alternate route
> based on the dsicussion right above it.
Plenty of ugly hacks in the kernel, it's no excuse.
prev parent reply other threads:[~2026-03-21 0:29 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-11 4:00 [PATCH net-next, v3] net: mana: Force full-page RX buffers for 4K page size on specific systems Dipayaan Roy
2026-03-14 19:50 ` Jakub Kicinski
2026-03-20 18:37 ` Dipayaan Roy
2026-03-21 0:29 ` Jakub Kicinski [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260320172908.1840229d@kernel.org \
--to=kuba@kernel.org \
--cc=andrew+netdev@lunn.ch \
--cc=davem@davemloft.net \
--cc=decui@microsoft.com \
--cc=dipayanroy@linux.microsoft.com \
--cc=dipayanroy@microsoft.com \
--cc=edumazet@google.com \
--cc=ernis@linux.microsoft.com \
--cc=haiyangz@microsoft.com \
--cc=horms@kernel.org \
--cc=kotaranov@microsoft.com \
--cc=kys@microsoft.com \
--cc=leon@kernel.org \
--cc=linux-hyperv@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=longli@microsoft.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=shirazsaleem@microsoft.com \
--cc=shradhagupta@linux.microsoft.com \
--cc=ssengar@linux.microsoft.com \
--cc=wei.liu@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox