Re: [BUG] mlx5_core memory management issue

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Chris Arges <carges@cloudflare.com>
To: Dragos Tatulea <dtatulea@nvidia.com>
Cc: netdev@vger.kernel.org, bpf@vger.kernel.org,
	kernel-team <kernel-team@cloudflare.com>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	tariqt@nvidia.com, saeedm@nvidia.com,
	Leon Romanovsky <leon@kernel.org>,
	Andrew Lunn <andrew+netdev@lunn.ch>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	John Fastabend <john.fastabend@gmail.com>,
	Simon Horman <horms@kernel.org>,
	Andrew Rzeznik <arzeznik@cloudflare.com>,
	Yan Zhai <yan@cloudflare.com>
Subject: Re: [BUG] mlx5_core memory management issue
Date: Wed, 23 Jul 2025 13:48:07 -0500	[thread overview]
Message-ID: <aIEuZy6fUj_4wtQ6@861G6M3> (raw)
In-Reply-To: <dhqeshvesjhyxeimyh6nttlkrrhoxwpmjpn65tesani3tmne5v@msusvzdhuuin>

On 2025-07-04 12:37:36, Dragos Tatulea wrote:
> On Thu, Jul 03, 2025 at 10:49:20AM -0500, Chris Arges wrote:
> > When running iperf through a set of XDP programs we were able to crash
> > machines with NICs using the mlx5_core driver. We were able to confirm
> > that other NICs/drivers did not exhibit the same problem, and suspect
> > this could be a memory management issue in the driver code.
> > Specifically we found a WARNING at include/net/page_pool/helpers.h:277
> > mlx5e_page_release_fragmented.isra. We are able to demonstrate this
> > issue in production using hardware, but cannot easily bisect because
> > we don’t have a simple reproducer.
> >
> Thanks for the report! We will investigate.
> 
> > I wanted to share stack traces in
> > order to help us further debug and understand if anyone else has run
> > into this issue. We are currently working on getting more crashdumps
> > and doing further analysis.
> > 
> > 
> > The test setup looks like the following:
> >   ┌─────┐
> >   │mlx5 │
> >   │NIC  │
> >   └──┬──┘
> >      │xdp ebpf program (does encap and XDP_TX)
> >      │
> >      ▼
> >   ┌──────────────────────┐
> >   │xdp.frags             │
> >   │                      │
> >   └──┬───────────────────┘
> >      │tailcall
> >      │BPF_REDIRECT_MAP (using CPUMAP bpf type)
> >      ▼
> >   ┌──────────────────────┐
> >   │xdp.frags/cpumap      │
> >   │                      │
> >   └──┬───────────────────┘
> >      │BPF_REDIRECT to veth (*potential trigger for issue)
> >      │
> >      ▼
> >   ┌──────┐
> >   │veth  │
> >   │      │
> >   └──┬───┘
> >      │
> >      │
> >      ▼
> > 
> > Here an mlx5 NIC has an xdp.frags program attached which tailcalls via
> > BPF_REDIRECT_MAP into an xdp.frags/cpumap. For our reproducer we can
> > choose a random valid CPU to reproduce the issue. Once that packet
> > reaches the xdp.frags/cpumap program we then do another BPF_REDIRECT
> > to a veth device which has an XDP program which redirects to an
> > XSKMAP. It wasn’t until we added the additional BPF_REDIRECT to the
> > veth device that we noticed this issue.
> > 
> Would it be possible to try to use a single program that redirects to
> the XSKMAP and check that the issue reproduces?
> 
> > When running with 6.12.30 to 6.12.32 kernels we are able to see the
> > following KASAN use-after-free WARNINGs followed by a page fault which
> > crashes the machine. We have not been able to test earlier or later
> > kernels. I’ve tried to map symbols to lines of code for clarity.
> >
> Thanks for the KASAN reports, they are very useful. Keep us posted
> if you have other updates. A first quick look didn't reveal anything
> obvious from our side but we will keep looking.
> 
> Thanks,
> Dragos

Ok, we can reproduce this problem!

I tried to simplify this reproducer, but it seems like what's needed is:
- xdp program attached to mlx5 NIC
- cpumap redirect
- device redirect (map or just bpf_redirect)
- frame gets turned into an skb
Then from another machine send many flows of UDP traffic to trigger the problem.

I've put together a program that reproduces the issue here:
- https://github.com/arges/xdp-redirector

In general the failure manifests with many different WARNs such as:
include/net/page_pool/helpers.h:277 mlx5e_page_release_fragmented.isra.0+0xf7/0x150 [mlx5_core]
Then the machine crashes.

I was able to get a crashdump which shows:
```
PID: 0        TASK: ffff8c0910134380  CPU: 76   COMMAND: "swapper/76"
 #0 [fffffe10906d3ea8] crash_nmi_callback at ffffffffadc5c4fd
 #1 [fffffe10906d3eb0] default_do_nmi at ffffffffae9524f0
 #2 [fffffe10906d3ed0] exc_nmi at ffffffffae952733
 #3 [fffffe10906d3ef0] end_repeat_nmi at ffffffffaea01bfd
    [exception RIP: io_serial_in+25]
    RIP: ffffffffae4cd489  RSP: ffffb3c60d6049e8  RFLAGS: 00000002
    RAX: ffffffffae4cd400  RBX: 00000000000025d8  RCX: 0000000000000000
    RDX: 00000000000002fd  RSI: 0000000000000005  RDI: ffffffffb10a9cb0
    RBP: 0000000000000000   R8: 2d2d2d2d2d2d2d2d   R9: 656820747563205b
    R10: 000000002d2d2d2d  R11: 000000002d2d2d2d  R12: ffffffffb0fa5610
    R13: 0000000000000000  R14: 0000000000000000  R15: ffffffffb10a9cb0
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #4 [ffffb3c60d6049e8] io_serial_in at ffffffffae4cd489
 #5 [ffffb3c60d6049e8] serial8250_console_write at ffffffffae4d2fcf
 #6 [ffffb3c60d604a80] console_flush_all at ffffffffadd1cf26
 #7 [ffffb3c60d604b00] console_unlock at ffffffffadd1d1df
 #8 [ffffb3c60d604b48] vprintk_emit at ffffffffadd1dda1
 #9 [ffffb3c60d604b98] _printk at ffffffffae90250c
#10 [ffffb3c60d604bf8] report_bug.cold at ffffffffae95001d
#11 [ffffb3c60d604c38] handle_bug at ffffffffae950e91
#12 [ffffb3c60d604c58] exc_invalid_op at ffffffffae9512b7
#13 [ffffb3c60d604c70] asm_exc_invalid_op at ffffffffaea0123a
    [exception RIP: mlx5e_page_release_fragmented+85]
    RIP: ffffffffc25f75c5  RSP: ffffb3c60d604d20  RFLAGS: 00010293
    RAX: 000000000000003f  RBX: ffff8bfa8f059fd0  RCX: ffffe3bf1992a180
    RDX: 000000000000003d  RSI: ffffe3bf1992a180  RDI: ffff8bf9b0784000
    RBP: 0000000000000040   R8: 00000000000001d2   R9: 0000000000000006
    R10: ffff8c06de22f380  R11: ffff8bfcfe6cd680  R12: 00000000000001d2
    R13: 000000000000002b  R14: ffff8bf9b0784000  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#14 [ffffb3c60d604d20] mlx5e_free_rx_wqes at ffffffffc25f7e2f [mlx5_core]
#15 [ffffb3c60d604d58] mlx5e_post_rx_wqes at ffffffffc25f877c [mlx5_core]
#16 [ffffb3c60d604dc0] mlx5e_napi_poll at ffffffffc25fdd27 [mlx5_core]
#17 [ffffb3c60d604e20] __napi_poll at ffffffffae6a8ddb
#18 [ffffb3c60d604e90] __napi_poll at ffffffffae6a8db5
#19 [ffffb3c60d604e98] net_rx_action at ffffffffae6a95f1
#20 [ffffb3c60d604f98] handle_softirqs at ffffffffadc9d4bf
#21 [ffffb3c60d604fe8] irq_exit_rcu at ffffffffadc9e057
#22 [ffffb3c60d604ff0] common_interrupt at ffffffffae952015
--- <IRQ stack> ---
#23 [ffffb3c60c837de8] asm_common_interrupt at ffffffffaea01466
    [exception RIP: cpuidle_enter_state+184]
    RIP: ffffffffae955c38  RSP: ffffb3c60c837e98  RFLAGS: 00000202
    RAX: ffff8c0cffc00000  RBX: ffff8c0911002400  RCX: 0000000000000000
    RDX: 00003c630b2d073a  RSI: ffffffe519600d10  RDI: 0000000000000000
    RBP: 0000000000000001   R8: 0000000000000002   R9: 0000000000000001
    R10: ffff8c0cffc330c4  R11: 071c71c71c71c71c  R12: ffffffffb05ff820
    R13: 00003c630b2d073a  R14: 0000000000000001  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#24 [ffffb3c60c837ed0] cpuidle_enter at ffffffffae64b4ad
#25 [ffffb3c60c837ef0] do_idle at ffffffffadcfa7c6
#26 [ffffb3c60c837f30] cpu_startup_entry at ffffffffadcfaa09
#27 [ffffb3c60c837f40] start_secondary at ffffffffadc5ec77
#28 [ffffb3c60c837f50] common_startup_64 at ffffffffadc24d5d
```

Assuming (this is x86_64):
RDI=ffff8bf9b0784000 (rq)
RSI=ffffe3bf1992a180 (frag_page)

```
static void mlx5e_page_release_fragmented(struct mlx5e_rq *rq,
                                          struct mlx5e_frag_page *frag_page)
{
        u16 drain_count = MLX5E_PAGECNT_BIAS_MAX - frag_page->frags;
        struct page *page = frag_page->page;

        if (page_pool_unref_page(page, drain_count) == 0)
                page_pool_put_unrefed_page(rq->page_pool, page, -1, true);
}
```

crash> struct mlx5e_frag_page ffffe3bf1992a180
struct mlx5e_frag_page {
  page = 0x26ffff800000000,
  frags = 49856
}

This means that drain_count could be an unexpected number (assuming that we
expect it to be less than MLX5E_PAGECNT_BIAS_MAX).

Let me know what additional experiments would be useful here.

--chris

next prev parent reply	other threads:[~2025-07-23 18:48 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-03 15:49 [BUG] mlx5_core memory management issue Chris Arges
2025-07-04 12:37 ` Dragos Tatulea
2025-07-04 20:14   ` Dragos Tatulea
2025-07-07 22:07     ` Chris Arges
2025-07-23 18:48   ` Chris Arges [this message]
2025-07-24 17:01     ` Dragos Tatulea
2025-08-07 16:45       ` Chris Arges
2025-08-11  8:37         ` Dragos Tatulea
2025-08-12 15:44           ` Dragos Tatulea
2025-08-12 18:55             ` Jesse Brandeburg
2025-08-12 20:19               ` Dragos Tatulea
2025-08-12 21:25                 ` Chris Arges
2025-08-13 18:53                   ` Chris Arges
2025-08-13 19:26                     ` Dragos Tatulea
2025-08-13 20:24                       ` Dragos Tatulea
2025-08-14 11:26                         ` Jesper Dangaard Brouer
2025-08-14 14:42                           ` Dragos Tatulea
2025-08-14 15:58                             ` Jesper Dangaard Brouer
2025-08-14 16:45                               ` Dragos Tatulea
2025-08-15 14:59                               ` Jakub Kicinski
2025-08-15 16:02                                 ` Jesper Dangaard Brouer
2025-08-15 16:36                                   ` Jakub Kicinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aIEuZy6fUj_4wtQ6@861G6M3 \
    --to=carges@cloudflare.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=arzeznik@cloudflare.com \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=dtatulea@nvidia.com \
    --cc=edumazet@google.com \
    --cc=hawk@kernel.org \
    --cc=horms@kernel.org \
    --cc=john.fastabend@gmail.com \
    --cc=kernel-team@cloudflare.com \
    --cc=kuba@kernel.org \
    --cc=leon@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=saeedm@nvidia.com \
    --cc=tariqt@nvidia.com \
    --cc=yan@cloudflare.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.