From: Chris Arges <carges@cloudflare.com>
To: Dragos Tatulea <dtatulea@nvidia.com>,
netdev@vger.kernel.org, bpf@vger.kernel.org
Cc: netdev@vger.kernel.org, bpf@vger.kernel.org,
kernel-team <kernel-team@cloudflare.com>,
Jesper Dangaard Brouer <hawk@kernel.org>,
tariqt@nvidia.com, saeedm@nvidia.com,
Leon Romanovsky <leon@kernel.org>,
Andrew Lunn <andrew+netdev@lunn.ch>,
"David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
John Fastabend <john.fastabend@gmail.com>,
Simon Horman <horms@kernel.org>,
Andrew Rzeznik <arzeznik@cloudflare.com>,
Yan Zhai <yan@cloudflare.com>
Subject: Re: [BUG] mlx5_core memory management issue
Date: Mon, 7 Jul 2025 17:07:02 -0500 [thread overview]
Message-ID: <aGw-2geTw7Y0UXg2@861G6M3> (raw)
In-Reply-To: <md46ky57c74xrw2l2y5biwnw4vzgn6juiovqkx7tzdwks6smab@vpfd5hmclioa>
On Fri, Jul 04, 2025 at 08:14:20PM +0000, Dragos Tatulea wrote:
> On Fri, Jul 04, 2025 at 12:37:36PM +0000, Dragos Tatulea wrote:
> > On Thu, Jul 03, 2025 at 10:49:20AM -0500, Chris Arges wrote:
> > > When running iperf through a set of XDP programs we were able to crash
> > > machines with NICs using the mlx5_core driver. We were able to confirm
> > > that other NICs/drivers did not exhibit the same problem, and suspect
> > > this could be a memory management issue in the driver code.
> > > Specifically we found a WARNING at include/net/page_pool/helpers.h:277
> > > mlx5e_page_release_fragmented.isra. We are able to demonstrate this
> > > issue in production using hardware, but cannot easily bisect because
> > > we don’t have a simple reproducer.
> > >
> > Thanks for the report! We will investigate.
> >
> > > I wanted to share stack traces in
> > > order to help us further debug and understand if anyone else has run
> > > into this issue. We are currently working on getting more crashdumps
> > > and doing further analysis.
> > >
> > >
> > > The test setup looks like the following:
> > > ┌─────┐
> > > │mlx5 │
> > > │NIC │
> > > └──┬──┘
> > > │xdp ebpf program (does encap and XDP_TX)
> > > │
> > > ▼
> > > ┌──────────────────────┐
> > > │xdp.frags │
> > > │ │
> > > └──┬───────────────────┘
> > > │tailcall
> > > │BPF_REDIRECT_MAP (using CPUMAP bpf type)
> > > ▼
> > > ┌──────────────────────┐
> > > │xdp.frags/cpumap │
> > > │ │
> > > └──┬───────────────────┘
> > > │BPF_REDIRECT to veth (*potential trigger for issue)
> > > │
> > > ▼
> > > ┌──────┐
> > > │veth │
> > > │ │
> > > └──┬───┘
> > > │
> > > │
> > > ▼
> > >
> > > Here an mlx5 NIC has an xdp.frags program attached which tailcalls via
> > > BPF_REDIRECT_MAP into an xdp.frags/cpumap. For our reproducer we can
> > > choose a random valid CPU to reproduce the issue. Once that packet
> > > reaches the xdp.frags/cpumap program we then do another BPF_REDIRECT
> > > to a veth device which has an XDP program which redirects to an
> > > XSKMAP. It wasn’t until we added the additional BPF_REDIRECT to the
> > > veth device that we noticed this issue.
> > >
> > Would it be possible to try to use a single program that redirects to
> > the XSKMAP and check that the issue reproduces?
> >
> I forgot to ask: what is the MTU size?
> Also, are you setting any other special config on the device?
>
> Thanks,
> Dragos
Dragos,
The device has the following settings:
2: ext0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1600 xdp qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 1c:34:da:48:7f:e8 brd ff:ff:ff:ff:ff:ff promiscuity 0 allmulti 0 minmtu 68 maxmtu 9978 addrgenmode eui64 numtxqueues 520 numrxqueues 65 gso_max_size 65536 gso_max_segs 65535 tso_max_size 524280 tso_max_segs 65535 gro_max_size 65536 gso_ipv4_max_size 65536 gro_ipv4_max_size 65536 portname p0 switchid e87f480003da341c parentbus pci parentdev 0000:c1:00.0
prog/xdp id 173
As far as testing other packet paths to help narrow down the problem we tested:
1) Fails: XDP (mlx5 nic) -> CPU MAP -> DEV MAP (to veth) -> XSK
2) Works: XDP (mlx5 nic) -> CPU MAP -> Linux routing (to veth) -> XSK
3) Works: XDP (mlx5 nic) -> Linux routing (to veth) -> XSK
Given those cases, I would think a single program that redirects just to XSKMAP
would also work fine.
Thanks,
--chris
next prev parent reply other threads:[~2025-07-07 22:07 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-03 15:49 [BUG] mlx5_core memory management issue Chris Arges
2025-07-04 12:37 ` Dragos Tatulea
2025-07-04 20:14 ` Dragos Tatulea
2025-07-07 22:07 ` Chris Arges [this message]
2025-07-23 18:48 ` Chris Arges
2025-07-24 17:01 ` Dragos Tatulea
2025-08-07 16:45 ` Chris Arges
2025-08-11 8:37 ` Dragos Tatulea
2025-08-12 15:44 ` Dragos Tatulea
2025-08-12 18:55 ` Jesse Brandeburg
2025-08-12 20:19 ` Dragos Tatulea
2025-08-12 21:25 ` Chris Arges
2025-08-13 18:53 ` Chris Arges
2025-08-13 19:26 ` Dragos Tatulea
2025-08-13 20:24 ` Dragos Tatulea
2025-08-14 11:26 ` Jesper Dangaard Brouer
2025-08-14 14:42 ` Dragos Tatulea
2025-08-14 15:58 ` Jesper Dangaard Brouer
2025-08-14 16:45 ` Dragos Tatulea
2025-08-15 14:59 ` Jakub Kicinski
2025-08-15 16:02 ` Jesper Dangaard Brouer
2025-08-15 16:36 ` Jakub Kicinski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aGw-2geTw7Y0UXg2@861G6M3 \
--to=carges@cloudflare.com \
--cc=andrew+netdev@lunn.ch \
--cc=arzeznik@cloudflare.com \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=dtatulea@nvidia.com \
--cc=edumazet@google.com \
--cc=hawk@kernel.org \
--cc=horms@kernel.org \
--cc=john.fastabend@gmail.com \
--cc=kernel-team@cloudflare.com \
--cc=kuba@kernel.org \
--cc=leon@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=saeedm@nvidia.com \
--cc=tariqt@nvidia.com \
--cc=yan@cloudflare.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.