[PATCH net v2 0/1] i40e: xsk: advance next_to

bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net v2 0/1] i40e: xsk: advance next_to_clean on status descriptors
@ 2025-10-21 17:31 Alessandro Decina
  2025-10-21 17:32 ` [PATCH net v2 1/1] " Alessandro Decina
  0 siblings, 1 reply; 12+ messages in thread
From: Alessandro Decina @ 2025-10-21 17:31 UTC (permalink / raw)
  To: netdev
  Cc: Maciej Fijalkowski, David S. Miller, Alexei Starovoitov,
	Andrew Lunn, Daniel Borkmann, Eric Dumazet, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Paolo Abeni,
	Przemek Kitszel, Stanislav Fomichev, Tirthendu Sarkar,
	Tony Nguyen, bpf, intel-wired-lan, linux-kernel,
	Alessandro Decina

Submitting v2 since while linting v1 I moved the ntp definition in the
wrong place. Apologies for the noise.

Link to v1: https://lore.kernel.org/netdev/20251021135913.5253-1-alessandro.d@gmail.com/T/#u

Changes since v1:
 * advance next_to_process after accessing the current descriptor

Alessandro Decina (1):
  i40e: xsk: advance next_to_clean on status descriptors

 drivers/net/ethernet/intel/i40e/i40e_xsk.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)


base-commit: 49d34f3dd8519581030547eb7543a62f9ab5fa08
-- 
2.43.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH net v2 1/1] i40e: xsk: advance next_to_clean on status descriptors
  2025-10-21 17:31 [PATCH net v2 0/1] i40e: xsk: advance next_to_clean on status descriptors Alessandro Decina
@ 2025-10-21 17:32 ` Alessandro Decina
  2025-10-22  3:11   ` Jason Xing
                     ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Alessandro Decina @ 2025-10-21 17:32 UTC (permalink / raw)
  To: netdev
  Cc: Maciej Fijalkowski, David S. Miller, Alexei Starovoitov,
	Andrew Lunn, Daniel Borkmann, Eric Dumazet, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Paolo Abeni,
	Przemek Kitszel, Stanislav Fomichev, Tirthendu Sarkar,
	Tony Nguyen, bpf, intel-wired-lan, linux-kernel,
	Alessandro Decina

Whenever a status descriptor is received, i40e processes and skips over
it, correctly updating next_to_process but forgetting to update
next_to_clean. In the next iteration this accidentally causes the
creation of an invalid multi-buffer xdp_buff where the first fragment
is the status descriptor.

If then a skb is constructed from such an invalid buffer - because the
eBPF program returns XDP_PASS - a panic occurs:

[ 5866.367317] BUG: unable to handle page fault for address: ffd31c37eab1c980
[ 5866.375050] #PF: supervisor read access in kernel mode
[ 5866.380825] #PF: error_code(0x0000) - not-present page
[ 5866.386602] PGD 0
[ 5866.388867] Oops: Oops: 0000 [#1] SMP NOPTI
[ 5866.393575] CPU: 34 UID: 0 PID: 0 Comm: swapper/34 Not tainted 6.17.0-custom #1 PREEMPT(voluntary)
[ 5866.403740] Hardware name: Supermicro AS -2115GT-HNTR/H13SST-G, BIOS 3.2 03/20/2025
[ 5866.412339] RIP: 0010:memcpy+0x8/0x10
[ 5866.416454] Code: cc cc 90 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 48 89 f8 48 89 d1 <f3> a4 e9 fc 26 c0 fe 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
[ 5866.437538] RSP: 0018:ff428d9ec0bb0ca8 EFLAGS: 00010286
[ 5866.443415] RAX: ff2dd26dbd8f0000 RBX: ff2dd265ad161400 RCX: 00000000000004e1
[ 5866.451435] RDX: 00000000000004e1 RSI: ffd31c37eab1c980 RDI: ff2dd26dbd8f0000
[ 5866.459454] RBP: ff428d9ec0bb0d40 R08: 0000000000000000 R09: 0000000000000000
[ 5866.467470] R10: 0000000000000000 R11: 0000000000000000 R12: ff428d9eec726ef8
[ 5866.475490] R13: ff2dd26dbd8f0000 R14: ff2dd265ca2f9fc0 R15: ff2dd26548548b80
[ 5866.483509] FS:  0000000000000000(0000) GS:ff2dd2c363592000(0000) knlGS:0000000000000000
[ 5866.492600] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5866.499060] CR2: ffd31c37eab1c980 CR3: 0000000178d7b040 CR4: 0000000000f71ef0
[ 5866.507079] PKRU: 55555554
[ 5866.510125] Call Trace:
[ 5866.512867]  <IRQ>
[ 5866.515132]  ? i40e_clean_rx_irq_zc+0xc50/0xe60 [i40e]
[ 5866.520921]  i40e_napi_poll+0x2d8/0x1890 [i40e]
[ 5866.526022]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 5866.531408]  ? raise_softirq+0x24/0x70
[ 5866.535623]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 5866.541011]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 5866.546397]  ? rcu_sched_clock_irq+0x225/0x1800
[ 5866.551493]  __napi_poll+0x30/0x230
[ 5866.555423]  net_rx_action+0x20b/0x3f0
[ 5866.559643]  handle_softirqs+0xe4/0x340
[ 5866.563962]  __irq_exit_rcu+0x10e/0x130
[ 5866.568283]  irq_exit_rcu+0xe/0x20
[ 5866.572110]  common_interrupt+0xb6/0xe0
[ 5866.576425]  </IRQ>
[ 5866.578791]  <TASK>

Advance next_to_clean to ensure invalid xdp_buff(s) aren't created.

Fixes: 1c9ba9c14658 ("i40e: xsk: add RX multi-buffer support")
Signed-off-by: Alessandro Decina <alessandro.d@gmail.com>
---
 drivers/net/ethernet/intel/i40e/i40e_xsk.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 9f47388eaba5..dbc19083bbb7 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -441,13 +441,18 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
 		dma_rmb();
 
 		if (i40e_rx_is_programming_status(qword)) {
+			u16 ntp;
+
 			i40e_clean_programming_status(rx_ring,
 						      rx_desc->raw.qword[0],
 						      qword);
 			bi = *i40e_rx_bi(rx_ring, next_to_process);
 			xsk_buff_free(bi);
-			if (++next_to_process == count)
+			ntp = next_to_process++;
+			if (next_to_process == count)
 				next_to_process = 0;
+			if (next_to_clean == ntp)
+				next_to_clean = next_to_process;
 			continue;
 		}
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH net v2 1/1] i40e: xsk: advance next_to_clean on status descriptors
  2025-10-21 17:32 ` [PATCH net v2 1/1] " Alessandro Decina
@ 2025-10-22  3:11   ` Jason Xing
  2025-10-22  5:41     ` Sarkar, Tirthendu
  2025-10-22 16:17     ` Alessandro Decina
  2025-10-22  6:41   ` [Intel-wired-lan] " Loktionov, Aleksandr
  2025-10-22 17:17   ` Maciej Fijalkowski
  2 siblings, 2 replies; 12+ messages in thread
From: Jason Xing @ 2025-10-22  3:11 UTC (permalink / raw)
  To: Alessandro Decina
  Cc: netdev, Maciej Fijalkowski, David S. Miller, Alexei Starovoitov,
	Andrew Lunn, Daniel Borkmann, Eric Dumazet, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Paolo Abeni,
	Przemek Kitszel, Stanislav Fomichev, Tirthendu Sarkar,
	Tony Nguyen, bpf, intel-wired-lan, linux-kernel

On Wed, Oct 22, 2025 at 1:33 AM Alessandro Decina
<alessandro.d@gmail.com> wrote:
>
> Whenever a status descriptor is received, i40e processes and skips over
> it, correctly updating next_to_process but forgetting to update
> next_to_clean. In the next iteration this accidentally causes the
> creation of an invalid multi-buffer xdp_buff where the first fragment
> is the status descriptor.
>
> If then a skb is constructed from such an invalid buffer - because the
> eBPF program returns XDP_PASS - a panic occurs:
>
> [ 5866.367317] BUG: unable to handle page fault for address: ffd31c37eab1c980
> [ 5866.375050] #PF: supervisor read access in kernel mode
> [ 5866.380825] #PF: error_code(0x0000) - not-present page
> [ 5866.386602] PGD 0
> [ 5866.388867] Oops: Oops: 0000 [#1] SMP NOPTI
> [ 5866.393575] CPU: 34 UID: 0 PID: 0 Comm: swapper/34 Not tainted 6.17.0-custom #1 PREEMPT(voluntary)
> [ 5866.403740] Hardware name: Supermicro AS -2115GT-HNTR/H13SST-G, BIOS 3.2 03/20/2025
> [ 5866.412339] RIP: 0010:memcpy+0x8/0x10
> [ 5866.416454] Code: cc cc 90 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 48 89 f8 48 89 d1 <f3> a4 e9 fc 26 c0 fe 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
> [ 5866.437538] RSP: 0018:ff428d9ec0bb0ca8 EFLAGS: 00010286
> [ 5866.443415] RAX: ff2dd26dbd8f0000 RBX: ff2dd265ad161400 RCX: 00000000000004e1
> [ 5866.451435] RDX: 00000000000004e1 RSI: ffd31c37eab1c980 RDI: ff2dd26dbd8f0000
> [ 5866.459454] RBP: ff428d9ec0bb0d40 R08: 0000000000000000 R09: 0000000000000000
> [ 5866.467470] R10: 0000000000000000 R11: 0000000000000000 R12: ff428d9eec726ef8
> [ 5866.475490] R13: ff2dd26dbd8f0000 R14: ff2dd265ca2f9fc0 R15: ff2dd26548548b80
> [ 5866.483509] FS:  0000000000000000(0000) GS:ff2dd2c363592000(0000) knlGS:0000000000000000
> [ 5866.492600] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 5866.499060] CR2: ffd31c37eab1c980 CR3: 0000000178d7b040 CR4: 0000000000f71ef0
> [ 5866.507079] PKRU: 55555554
> [ 5866.510125] Call Trace:
> [ 5866.512867]  <IRQ>
> [ 5866.515132]  ? i40e_clean_rx_irq_zc+0xc50/0xe60 [i40e]
> [ 5866.520921]  i40e_napi_poll+0x2d8/0x1890 [i40e]
> [ 5866.526022]  ? srso_alias_return_thunk+0x5/0xfbef5
> [ 5866.531408]  ? raise_softirq+0x24/0x70
> [ 5866.535623]  ? srso_alias_return_thunk+0x5/0xfbef5
> [ 5866.541011]  ? srso_alias_return_thunk+0x5/0xfbef5
> [ 5866.546397]  ? rcu_sched_clock_irq+0x225/0x1800
> [ 5866.551493]  __napi_poll+0x30/0x230
> [ 5866.555423]  net_rx_action+0x20b/0x3f0
> [ 5866.559643]  handle_softirqs+0xe4/0x340
> [ 5866.563962]  __irq_exit_rcu+0x10e/0x130
> [ 5866.568283]  irq_exit_rcu+0xe/0x20
> [ 5866.572110]  common_interrupt+0xb6/0xe0
> [ 5866.576425]  </IRQ>
> [ 5866.578791]  <TASK>
>
> Advance next_to_clean to ensure invalid xdp_buff(s) aren't created.
>
> Fixes: 1c9ba9c14658 ("i40e: xsk: add RX multi-buffer support")
> Signed-off-by: Alessandro Decina <alessandro.d@gmail.com>
> ---
>  drivers/net/ethernet/intel/i40e/i40e_xsk.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> index 9f47388eaba5..dbc19083bbb7 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> @@ -441,13 +441,18 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
>                 dma_rmb();
>
>                 if (i40e_rx_is_programming_status(qword)) {
> +                       u16 ntp;
> +
>                         i40e_clean_programming_status(rx_ring,
>                                                       rx_desc->raw.qword[0],
>                                                       qword);
>                         bi = *i40e_rx_bi(rx_ring, next_to_process);
>                         xsk_buff_free(bi);
> -                       if (++next_to_process == count)
> +                       ntp = next_to_process++;
> +                       if (next_to_process == count)
>                                 next_to_process = 0;
> +                       if (next_to_clean == ntp)
> +                               next_to_clean = next_to_process;
>                         continue;
>                 }
>
> --
> 2.43.0
>
>

I'm copying your reply from v1 as shown below so that we can continue
with the discussion :)

> It really depends on whether a status descriptor can be received in the
> middle of multi-buffer packet. Based on the existing code, I assumed it
> can. Therefore, consider this case:
>
> [valid_1st_packet][status_descriptor][valid_2nd_packet]
>
> In this case you want to skip status_descriptor but keep the existing
> logic that leads to:
>
>     first = next_to_clean = valid_1st_packet
>
> so then you can go and add valid_2nd_packet as a fragment to the first.

Sorry, honestly, I still don't follow you.

Looking at the case you provided, I think @first always pointing to
valid_1st_packet is valid which does not bring any trouble. You mean
the case is what you're trying to handle?

You patch updates next_to_clean that is only used at the very
beginning, so it will not affect @first. Imaging the following case:

     [status_descriptor][valid_1st_packet][valid_2nd_packet]

Even if the next_to_clean is updated, the @first still points to
[status_descriptor] that is invalid and that will later cause the
panic when constructing the skb.

I'm afraid that we're not on the same page. Let me confirm that it is
@first that points to the status descriptor that causes the panic,
right? Could you share with us the exact case just like you did as
above. Thank you.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH net v2 1/1] i40e: xsk: advance next_to_clean on status descriptors
  2025-10-22  3:11   ` Jason Xing
@ 2025-10-22  5:41     ` Sarkar, Tirthendu
  2025-10-22  7:28       ` Jason Xing
  2025-10-22 16:28       ` Your Name
  2025-10-22 16:17     ` Alessandro Decina
  1 sibling, 2 replies; 12+ messages in thread
From: Sarkar, Tirthendu @ 2025-10-22  5:41 UTC (permalink / raw)
  To: Jason Xing, Alessandro Decina
  Cc: netdev@vger.kernel.org, Fijalkowski, Maciej, David S. Miller,
	Alexei Starovoitov, Andrew Lunn, Daniel Borkmann, Eric Dumazet,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Paolo Abeni, Kitszel, Przemyslaw, Stanislav Fomichev,
	Nguyen, Anthony L, bpf@vger.kernel.org,
	intel-wired-lan@lists.osuosl.org, linux-kernel@vger.kernel.org

> From: Jason Xing <kerneljasonxing@gmail.com>
> Sent: 22 October 2025 08:41
> On Wed, Oct 22, 2025 at 1:33 AM Alessandro Decina
> <alessandro.d@gmail.com> wrote:
> >
> > Whenever a status descriptor is received, i40e processes and skips over
> > it, correctly updating next_to_process but forgetting to update
> > next_to_clean. In the next iteration this accidentally causes the
> > creation of an invalid multi-buffer xdp_buff where the first fragment
> > is the status descriptor.
> >
> > If then a skb is constructed from such an invalid buffer - because the
> > eBPF program returns XDP_PASS - a panic occurs:
> >
> > [ 5866.367317] BUG: unable to handle page fault for address:
> ffd31c37eab1c980
> > [ 5866.375050] #PF: supervisor read access in kernel mode
> > [ 5866.380825] #PF: error_code(0x0000) - not-present page
> > [ 5866.386602] PGD 0
> > [ 5866.388867] Oops: Oops: 0000 [#1] SMP NOPTI
> > [ 5866.393575] CPU: 34 UID: 0 PID: 0 Comm: swapper/34 Not tainted
> 6.17.0-custom #1 PREEMPT(voluntary)
> > [ 5866.403740] Hardware name: Supermicro AS -2115GT-HNTR/H13SST-G,
> BIOS 3.2 03/20/2025
> > [ 5866.412339] RIP: 0010:memcpy+0x8/0x10
> > [ 5866.416454] Code: cc cc 90 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 90 90
> 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 48 89 f8 48 89 d1 <f3> a4
> e9 fc 26 c0 fe 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
> > [ 5866.437538] RSP: 0018:ff428d9ec0bb0ca8 EFLAGS: 00010286
> > [ 5866.443415] RAX: ff2dd26dbd8f0000 RBX: ff2dd265ad161400 RCX:
> 00000000000004e1
> > [ 5866.451435] RDX: 00000000000004e1 RSI: ffd31c37eab1c980 RDI:
> ff2dd26dbd8f0000
> > [ 5866.459454] RBP: ff428d9ec0bb0d40 R08: 0000000000000000 R09:
> 0000000000000000
> > [ 5866.467470] R10: 0000000000000000 R11: 0000000000000000 R12:
> ff428d9eec726ef8
> > [ 5866.475490] R13: ff2dd26dbd8f0000 R14: ff2dd265ca2f9fc0 R15:
> ff2dd26548548b80
> > [ 5866.483509] FS:  0000000000000000(0000)
> GS:ff2dd2c363592000(0000) knlGS:0000000000000000
> > [ 5866.492600] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 5866.499060] CR2: ffd31c37eab1c980 CR3: 0000000178d7b040 CR4:
> 0000000000f71ef0
> > [ 5866.507079] PKRU: 55555554
> > [ 5866.510125] Call Trace:
> > [ 5866.512867]  <IRQ>
> > [ 5866.515132]  ? i40e_clean_rx_irq_zc+0xc50/0xe60 [i40e]
> > [ 5866.520921]  i40e_napi_poll+0x2d8/0x1890 [i40e]
> > [ 5866.526022]  ? srso_alias_return_thunk+0x5/0xfbef5
> > [ 5866.531408]  ? raise_softirq+0x24/0x70
> > [ 5866.535623]  ? srso_alias_return_thunk+0x5/0xfbef5
> > [ 5866.541011]  ? srso_alias_return_thunk+0x5/0xfbef5
> > [ 5866.546397]  ? rcu_sched_clock_irq+0x225/0x1800
> > [ 5866.551493]  __napi_poll+0x30/0x230
> > [ 5866.555423]  net_rx_action+0x20b/0x3f0
> > [ 5866.559643]  handle_softirqs+0xe4/0x340
> > [ 5866.563962]  __irq_exit_rcu+0x10e/0x130
> > [ 5866.568283]  irq_exit_rcu+0xe/0x20
> > [ 5866.572110]  common_interrupt+0xb6/0xe0
> > [ 5866.576425]  </IRQ>
> > [ 5866.578791]  <TASK>
> >
> > Advance next_to_clean to ensure invalid xdp_buff(s) aren't created.
> >
> > Fixes: 1c9ba9c14658 ("i40e: xsk: add RX multi-buffer support")
> > Signed-off-by: Alessandro Decina <alessandro.d@gmail.com>
> > ---
> >  drivers/net/ethernet/intel/i40e/i40e_xsk.c | 7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > index 9f47388eaba5..dbc19083bbb7 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > @@ -441,13 +441,18 @@ int i40e_clean_rx_irq_zc(struct i40e_ring
> *rx_ring, int budget)
> >                 dma_rmb();
> >
> >                 if (i40e_rx_is_programming_status(qword)) {
> > +                       u16 ntp;
> > +
> >                         i40e_clean_programming_status(rx_ring,
> >                                                       rx_desc->raw.qword[0],
> >                                                       qword);
> >                         bi = *i40e_rx_bi(rx_ring, next_to_process);
> >                         xsk_buff_free(bi);
> > -                       if (++next_to_process == count)
> > +                       ntp = next_to_process++;
> > +                       if (next_to_process == count)
> >                                 next_to_process = 0;
> > +                       if (next_to_clean == ntp)
> > +                               next_to_clean = next_to_process;
> >                         continue;
> >                 }
> >
> > --
> > 2.43.0
> >
> >
> 
> I'm copying your reply from v1 as shown below so that we can continue
> with the discussion :)
> 
> > It really depends on whether a status descriptor can be received in the
> > middle of multi-buffer packet. Based on the existing code, I assumed it
> > can. Therefore, consider this case:
> >
> > [valid_1st_packet][status_descriptor][valid_2nd_packet]
> >
> > In this case you want to skip status_descriptor but keep the existing
> > logic that leads to:
> >
> >     first = next_to_clean = valid_1st_packet
> >
> > so then you can go and add valid_2nd_packet as a fragment to the first.
> 
> Sorry, honestly, I still don't follow you.
> 
> Looking at the case you provided, I think @first always pointing to
> valid_1st_packet is valid which does not bring any trouble. You mean
> the case is what you're trying to handle?
> 
> You patch updates next_to_clean that is only used at the very
> beginning, so it will not affect @first. Imaging the following case:
> 
>      [status_descriptor][valid_1st_packet][valid_2nd_packet]
> 
> Even if the next_to_clean is updated, the @first still points to
> [status_descriptor] that is invalid and that will later cause the
> panic when constructing the skb.
> 
> I'm afraid that we're not on the same page. Let me confirm that it is
> @first that points to the status descriptor that causes the panic,
> right? Could you share with us the exact case just like you did as
> above. Thank you.
> 
> Thanks,
> Jason

I believe the issue is not that status_descriptor is getting into multi-buffer packet but not updating next_to_clean results in I40E_DESC_UNUSED() to return incorrect values.
A similar issue was reported and fixed on the non-ZC path: https://lore.kernel.org/netdev/20231004083454.20143-1-tirthendu.sarkar@intel.com/

Thanks,
Tirthendu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v2 1/1] i40e: xsk: advance next_to_clean on status descriptors
  2025-10-22  5:41     ` Sarkar, Tirthendu
@ 2025-10-22  7:28       ` Jason Xing
  2025-10-22 16:28       ` Your Name
  1 sibling, 0 replies; 12+ messages in thread
From: Jason Xing @ 2025-10-22  7:28 UTC (permalink / raw)
  To: Sarkar, Tirthendu
  Cc: Alessandro Decina, netdev@vger.kernel.org, Fijalkowski, Maciej,
	David S. Miller, Alexei Starovoitov, Andrew Lunn, Daniel Borkmann,
	Eric Dumazet, Jakub Kicinski, Jesper Dangaard Brouer,
	John Fastabend, Paolo Abeni, Kitszel, Przemyslaw,
	Stanislav Fomichev, Nguyen, Anthony L, bpf@vger.kernel.org,
	intel-wired-lan@lists.osuosl.org, linux-kernel@vger.kernel.org

On Wed, Oct 22, 2025 at 1:41 PM Sarkar, Tirthendu
<tirthendu.sarkar@intel.com> wrote:
>
> > From: Jason Xing <kerneljasonxing@gmail.com>
> > Sent: 22 October 2025 08:41
> > On Wed, Oct 22, 2025 at 1:33 AM Alessandro Decina
> > <alessandro.d@gmail.com> wrote:
> > >
> > > Whenever a status descriptor is received, i40e processes and skips over
> > > it, correctly updating next_to_process but forgetting to update
> > > next_to_clean. In the next iteration this accidentally causes the
> > > creation of an invalid multi-buffer xdp_buff where the first fragment
> > > is the status descriptor.
> > >
> > > If then a skb is constructed from such an invalid buffer - because the
> > > eBPF program returns XDP_PASS - a panic occurs:
> > >
> > > [ 5866.367317] BUG: unable to handle page fault for address:
> > ffd31c37eab1c980
> > > [ 5866.375050] #PF: supervisor read access in kernel mode
> > > [ 5866.380825] #PF: error_code(0x0000) - not-present page
> > > [ 5866.386602] PGD 0
> > > [ 5866.388867] Oops: Oops: 0000 [#1] SMP NOPTI
> > > [ 5866.393575] CPU: 34 UID: 0 PID: 0 Comm: swapper/34 Not tainted
> > 6.17.0-custom #1 PREEMPT(voluntary)
> > > [ 5866.403740] Hardware name: Supermicro AS -2115GT-HNTR/H13SST-G,
> > BIOS 3.2 03/20/2025
> > > [ 5866.412339] RIP: 0010:memcpy+0x8/0x10
> > > [ 5866.416454] Code: cc cc 90 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 90 90
> > 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 48 89 f8 48 89 d1 <f3> a4
> > e9 fc 26 c0 fe 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
> > > [ 5866.437538] RSP: 0018:ff428d9ec0bb0ca8 EFLAGS: 00010286
> > > [ 5866.443415] RAX: ff2dd26dbd8f0000 RBX: ff2dd265ad161400 RCX:
> > 00000000000004e1
> > > [ 5866.451435] RDX: 00000000000004e1 RSI: ffd31c37eab1c980 RDI:
> > ff2dd26dbd8f0000
> > > [ 5866.459454] RBP: ff428d9ec0bb0d40 R08: 0000000000000000 R09:
> > 0000000000000000
> > > [ 5866.467470] R10: 0000000000000000 R11: 0000000000000000 R12:
> > ff428d9eec726ef8
> > > [ 5866.475490] R13: ff2dd26dbd8f0000 R14: ff2dd265ca2f9fc0 R15:
> > ff2dd26548548b80
> > > [ 5866.483509] FS:  0000000000000000(0000)
> > GS:ff2dd2c363592000(0000) knlGS:0000000000000000
> > > [ 5866.492600] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [ 5866.499060] CR2: ffd31c37eab1c980 CR3: 0000000178d7b040 CR4:
> > 0000000000f71ef0
> > > [ 5866.507079] PKRU: 55555554
> > > [ 5866.510125] Call Trace:
> > > [ 5866.512867]  <IRQ>
> > > [ 5866.515132]  ? i40e_clean_rx_irq_zc+0xc50/0xe60 [i40e]
> > > [ 5866.520921]  i40e_napi_poll+0x2d8/0x1890 [i40e]
> > > [ 5866.526022]  ? srso_alias_return_thunk+0x5/0xfbef5
> > > [ 5866.531408]  ? raise_softirq+0x24/0x70
> > > [ 5866.535623]  ? srso_alias_return_thunk+0x5/0xfbef5
> > > [ 5866.541011]  ? srso_alias_return_thunk+0x5/0xfbef5
> > > [ 5866.546397]  ? rcu_sched_clock_irq+0x225/0x1800
> > > [ 5866.551493]  __napi_poll+0x30/0x230
> > > [ 5866.555423]  net_rx_action+0x20b/0x3f0
> > > [ 5866.559643]  handle_softirqs+0xe4/0x340
> > > [ 5866.563962]  __irq_exit_rcu+0x10e/0x130
> > > [ 5866.568283]  irq_exit_rcu+0xe/0x20
> > > [ 5866.572110]  common_interrupt+0xb6/0xe0
> > > [ 5866.576425]  </IRQ>
> > > [ 5866.578791]  <TASK>
> > >
> > > Advance next_to_clean to ensure invalid xdp_buff(s) aren't created.
> > >
> > > Fixes: 1c9ba9c14658 ("i40e: xsk: add RX multi-buffer support")
> > > Signed-off-by: Alessandro Decina <alessandro.d@gmail.com>
> > > ---
> > >  drivers/net/ethernet/intel/i40e/i40e_xsk.c | 7 ++++++-
> > >  1 file changed, 6 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > > index 9f47388eaba5..dbc19083bbb7 100644
> > > --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > > +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > > @@ -441,13 +441,18 @@ int i40e_clean_rx_irq_zc(struct i40e_ring
> > *rx_ring, int budget)
> > >                 dma_rmb();
> > >
> > >                 if (i40e_rx_is_programming_status(qword)) {
> > > +                       u16 ntp;
> > > +
> > >                         i40e_clean_programming_status(rx_ring,
> > >                                                       rx_desc->raw.qword[0],
> > >                                                       qword);
> > >                         bi = *i40e_rx_bi(rx_ring, next_to_process);
> > >                         xsk_buff_free(bi);
> > > -                       if (++next_to_process == count)
> > > +                       ntp = next_to_process++;
> > > +                       if (next_to_process == count)
> > >                                 next_to_process = 0;
> > > +                       if (next_to_clean == ntp)
> > > +                               next_to_clean = next_to_process;
> > >                         continue;
> > >                 }
> > >
> > > --
> > > 2.43.0
> > >
> > >
> >
> > I'm copying your reply from v1 as shown below so that we can continue
> > with the discussion :)
> >
> > > It really depends on whether a status descriptor can be received in the
> > > middle of multi-buffer packet. Based on the existing code, I assumed it
> > > can. Therefore, consider this case:
> > >
> > > [valid_1st_packet][status_descriptor][valid_2nd_packet]
> > >
> > > In this case you want to skip status_descriptor but keep the existing
> > > logic that leads to:
> > >
> > >     first = next_to_clean = valid_1st_packet
> > >
> > > so then you can go and add valid_2nd_packet as a fragment to the first.
> >
> > Sorry, honestly, I still don't follow you.
> >
> > Looking at the case you provided, I think @first always pointing to
> > valid_1st_packet is valid which does not bring any trouble. You mean
> > the case is what you're trying to handle?
> >
> > You patch updates next_to_clean that is only used at the very
> > beginning, so it will not affect @first. Imaging the following case:
> >
> >      [status_descriptor][valid_1st_packet][valid_2nd_packet]
> >
> > Even if the next_to_clean is updated, the @first still points to
> > [status_descriptor] that is invalid and that will later cause the
> > panic when constructing the skb.
> >
> > I'm afraid that we're not on the same page. Let me confirm that it is
> > @first that points to the status descriptor that causes the panic,
> > right? Could you share with us the exact case just like you did as
> > above. Thank you.
> >
> > Thanks,
> > Jason
>
> I believe the issue is not that status_descriptor is getting into multi-buffer packet but not updating next_to_clean results in I40E_DESC_UNUSED() to return incorrect values.
> A similar issue was reported and fixed on the non-ZC path: https://lore.kernel.org/netdev/20231004083454.20143-1-tirthendu.sarkar@intel.com/

Great, thanks! Now everything is clear to me. I think it would be
great if Alessandro can add/revise something like this in the commit
message.

Anyway, as to the code itself, feel free to add my tag:
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>

Thanks,
Jason

>
> Thanks,
> Tirthendu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v2 1/1] i40e: xsk: advance next_to_clean on status descriptors
  2025-10-22  5:41     ` Sarkar, Tirthendu
  2025-10-22  7:28       ` Jason Xing
@ 2025-10-22 16:28       ` Your Name
  2025-10-23  6:11         ` Sarkar, Tirthendu
  1 sibling, 1 reply; 12+ messages in thread
From: Your Name @ 2025-10-22 16:28 UTC (permalink / raw)
  To: Sarkar, Tirthendu
  Cc: Jason Xing, Alessandro Decina, netdev@vger.kernel.org,
	Fijalkowski, Maciej, David S. Miller, Alexei Starovoitov,
	Andrew Lunn, Daniel Borkmann, Eric Dumazet, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Paolo Abeni,
	Kitszel, Przemyslaw, Stanislav Fomichev, Nguyen, Anthony L,
	bpf@vger.kernel.org, intel-wired-lan@lists.osuosl.org,
	linux-kernel@vger.kernel.org

On Wed, Oct 22, 2025 at 05:41:06AM +0000, Sarkar, Tirthendu wrote:
> > From: Jason Xing <kerneljasonxing@gmail.com>
> 
> I believe the issue is not that status_descriptor is getting into
> multi-buffer packet but not updating next_to_clean results in
> I40E_DESC_UNUSED() to return incorrect values.

I don't think this is true? next_to_clean can be < next_to_process by
design, see

	if (next_to_process != next_to_clean)
		first = *i40e_rx_bi(rx_ring, next_to_clean);

at the start of i40e_clean_rx_irq_zc. This condition is normal and means
when we exited the function - for example because we ran out of budget - 
we were in the middle of a multi-buffer packet and now we must continue.

If I understand the code, I think that in that case we just set
entries_to_alloc to a lower number and return fewer buffers to the
hardware. 


> A similar issue was
> reported and fixed on the non-ZC path:
> https://lore.kernel.org/netdev/20231004083454.20143-1-tirthendu.sarkar@intel.com/

This is indeed exactly the same issue, but I'm not yet sold on the
diagnosis :D 

Ciao,
Alessandro

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH net v2 1/1] i40e: xsk: advance next_to_clean on status descriptors
  2025-10-22 16:28       ` Your Name
@ 2025-10-23  6:11         ` Sarkar, Tirthendu
  0 siblings, 0 replies; 12+ messages in thread
From: Sarkar, Tirthendu @ 2025-10-23  6:11 UTC (permalink / raw)
  To: Your Name
  Cc: Jason Xing, netdev@vger.kernel.org, Fijalkowski, Maciej,
	David S. Miller, Alexei Starovoitov, Andrew Lunn, Daniel Borkmann,
	Eric Dumazet, Jakub Kicinski, Jesper Dangaard Brouer,
	John Fastabend, Paolo Abeni, Kitszel, Przemyslaw,
	Stanislav Fomichev, Nguyen, Anthony L, bpf@vger.kernel.org,
	intel-wired-lan@lists.osuosl.org, linux-kernel@vger.kernel.org

> From: Your Name <alessandro.d@gmail.com>
> Sent: 22 October 2025 21:58
> 
> On Wed, Oct 22, 2025 at 05:41:06AM +0000, Sarkar, Tirthendu wrote:
> > > From: Jason Xing <kerneljasonxing@gmail.com>
> >
> > I believe the issue is not that status_descriptor is getting into
> > multi-buffer packet but not updating next_to_clean results in
> > I40E_DESC_UNUSED() to return incorrect values.
> 
> I don't think this is true? next_to_clean can be < next_to_process by
> design, see
> 
> 	if (next_to_process != next_to_clean)
> 		first = *i40e_rx_bi(rx_ring, next_to_clean);
> 
> at the start of i40e_clean_rx_irq_zc. This condition is normal and means
> when we exited the function - for example because we ran out of budget -
> we were in the middle of a multi-buffer packet and now we must continue.
> 

Ah, yes. Missed that. BTW, we won't run out of budget when we see status_descriptor or in the middle of a multi-buffer packet since those do not increase total_rx_packets. 
However,  if we see a status descriptor and no packets after that (size = 0), we will break the loop thus making next_to_clean and next_to_process out-of-sync with next_to_clean still pointing to status descriptor when we resume.

Thanks,
Tirthendu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v2 1/1] i40e: xsk: advance next_to_clean on status descriptors
  2025-10-22  3:11   ` Jason Xing
  2025-10-22  5:41     ` Sarkar, Tirthendu
@ 2025-10-22 16:17     ` Alessandro Decina
  1 sibling, 0 replies; 12+ messages in thread
From: Alessandro Decina @ 2025-10-22 16:17 UTC (permalink / raw)
  To: Jason Xing
  Cc: Alessandro Decina, netdev, Maciej Fijalkowski, David S. Miller,
	Alexei Starovoitov, Andrew Lunn, Daniel Borkmann, Eric Dumazet,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Paolo Abeni, Przemek Kitszel, Stanislav Fomichev,
	Tirthendu Sarkar, Tony Nguyen, bpf, intel-wired-lan, linux-kernel

On Wed, Oct 22, 2025 at 11:11:01AM +0800, Jason Xing wrote:
> On Wed, Oct 22, 2025 at 1:33 AM Alessandro Decina
> <alessandro.d@gmail.com> wrote:
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > index 9f47388eaba5..dbc19083bbb7 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > @@ -441,13 +441,18 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
> >                 dma_rmb();
> >
> >                 if (i40e_rx_is_programming_status(qword)) {
> > +                       u16 ntp;
> > +
> >                         i40e_clean_programming_status(rx_ring,
> >                                                       rx_desc->raw.qword[0],
> >                                                       qword);
> >                         bi = *i40e_rx_bi(rx_ring, next_to_process);
> >                         xsk_buff_free(bi);
> > -                       if (++next_to_process == count)
> > +                       ntp = next_to_process++;
> > +                       if (next_to_process == count)
> >                                 next_to_process = 0;
> > +                       if (next_to_clean == ntp)
> > +                               next_to_clean = next_to_process;
> >                         continue;
> >                 }
> >
> > --
> > 2.43.0
> >
> >
> 
> I'm copying your reply from v1 as shown below so that we can continue
> with the discussion :)
> 
> > It really depends on whether a status descriptor can be received in the
> > middle of multi-buffer packet. Based on the existing code, I assumed it
> > can. Therefore, consider this case:
> >
> > [valid_1st_packet][status_descriptor][valid_2nd_packet]
> >
> > In this case you want to skip status_descriptor but keep the existing
> > logic that leads to:
> >
> >     first = next_to_clean = valid_1st_packet
> >
> > so then you can go and add valid_2nd_packet as a fragment to the first.
> 
> Sorry, honestly, I still don't follow you.
> 
> Looking at the case you provided, I think @first always pointing to
> valid_1st_packet is valid which does not bring any trouble. You mean
> the case is what you're trying to handle?

Yes, I mean this case needs to keep working, so we can't move
next_to_clean unconditionally, we can only move it when
next_to_clean == ntp, which is equivalent to checking that
first == NULL. See below.
 
> You patch updates next_to_clean that is only used at the very
> beginning, so it will not affect @first. Imaging the following case:
> 
>      [status_descriptor][valid_1st_packet][valid_2nd_packet]
> 
> Even if the next_to_clean is updated, the @first still points to
> [status_descriptor] that is invalid and that will later cause the
> panic when constructing the skb.

Exactly - the key is to make sure we never get into this state :)

At the beginning of the function - outside the loop - first is only
assigned if (next_to_process != next_to_clean). This condition means: if
we previously exited the function in the middle of a multi-buffer
packet, we must resume from the start of that packet (next_to_clean) and
process the next fragment in it (next_to_process).

Consider the case you just gave:

> [status_descriptor][valid_1st_mb_packet][valid_2nd_mb_packet]

Assume we enter the function and next_to_process == next_to_clean, we
don't assign first, so first = NULL.

We find the status descriptor: without this patch, we increment
next_to_process, don't increment next_to_clean, say we run out of budget
and break the loop, the next time the function is entered we set first =
status_descriptor because next_to_process != next_to_clean. This is
exactly what we want to avoid.

With this patch upon processing the status descriptor, we see
next_to_clean == ntp, we increment next_to_clean, the next time the
function is entered next_to_process == next_to_clean, first is correctly
set to NULL and the next packet starts from valid_1st_mb_packet.

So I've covered both the scenarios:

a) [status][mb_packet1][mb_packet2]
b) [mb_packet1][status][mb_packet2]

The last case

c) [packet1][packet2][status] is actually just a), because at packet2
we'd find the EOP marker and close off the previous multi-buffer packet.

I hope I was more clear and please check my logic :)

Ciao,
Alessandro

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [Intel-wired-lan] [PATCH net v2 1/1] i40e: xsk: advance next_to_clean on status descriptors
  2025-10-21 17:32 ` [PATCH net v2 1/1] " Alessandro Decina
  2025-10-22  3:11   ` Jason Xing
@ 2025-10-22  6:41   ` Loktionov, Aleksandr
  2025-10-22 17:17   ` Maciej Fijalkowski
  2 siblings, 0 replies; 12+ messages in thread
From: Loktionov, Aleksandr @ 2025-10-22  6:41 UTC (permalink / raw)
  To: Alessandro Decina, netdev@vger.kernel.org
  Cc: Fijalkowski, Maciej, David S. Miller, Alexei Starovoitov,
	Andrew Lunn, Daniel Borkmann, Eric Dumazet, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Paolo Abeni,
	Kitszel, Przemyslaw, Stanislav Fomichev, Sarkar, Tirthendu,
	Nguyen, Anthony L, bpf@vger.kernel.org,
	intel-wired-lan@lists.osuosl.org, linux-kernel@vger.kernel.org



> -----Original Message-----
> From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf
> Of Alessandro Decina
> Sent: Tuesday, October 21, 2025 7:32 PM
> To: netdev@vger.kernel.org
> Cc: Fijalkowski, Maciej <maciej.fijalkowski@intel.com>; David S.
> Miller <davem@davemloft.net>; Alexei Starovoitov <ast@kernel.org>;
> Andrew Lunn <andrew+netdev@lunn.ch>; Daniel Borkmann
> <daniel@iogearbox.net>; Eric Dumazet <edumazet@google.com>; Jakub
> Kicinski <kuba@kernel.org>; Jesper Dangaard Brouer <hawk@kernel.org>;
> John Fastabend <john.fastabend@gmail.com>; Paolo Abeni
> <pabeni@redhat.com>; Kitszel, Przemyslaw
> <przemyslaw.kitszel@intel.com>; Stanislav Fomichev <sdf@fomichev.me>;
> Sarkar, Tirthendu <tirthendu.sarkar@intel.com>; Nguyen, Anthony L
> <anthony.l.nguyen@intel.com>; bpf@vger.kernel.org; intel-wired-
> lan@lists.osuosl.org; linux-kernel@vger.kernel.org; Alessandro Decina
> <alessandro.d@gmail.com>
> Subject: [Intel-wired-lan] [PATCH net v2 1/1] i40e: xsk: advance
> next_to_clean on status descriptors
> 
> Whenever a status descriptor is received, i40e processes and skips
> over it, correctly updating next_to_process but forgetting to update
> next_to_clean. In the next iteration this accidentally causes the
> creation of an invalid multi-buffer xdp_buff where the first fragment
> is the status descriptor.
> 
> If then a skb is constructed from such an invalid buffer - because the
> eBPF program returns XDP_PASS - a panic occurs:
> 
> [ 5866.367317] BUG: unable to handle page fault for address:
> ffd31c37eab1c980 [ 5866.375050] #PF: supervisor read access in kernel
> mode [ 5866.380825] #PF: error_code(0x0000) - not-present page [
> 5866.386602] PGD 0 [ 5866.388867] Oops: Oops: 0000 [#1] SMP NOPTI [
> 5866.393575] CPU: 34 UID: 0 PID: 0 Comm: swapper/34 Not tainted
> 6.17.0-custom #1 PREEMPT(voluntary) [ 5866.403740] Hardware name:
> Supermicro AS -2115GT-HNTR/H13SST-G, BIOS 3.2 03/20/2025 [
> 5866.412339] RIP: 0010:memcpy+0x8/0x10 [ 5866.416454] Code: cc cc 90
> cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 90 90 90 90 90 90 90 90
> 90 90 90 90 90 90 90 90 66 90 48 89 f8 48 89 d1 <f3> a4 e9 fc 26 c0 fe
> 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 [ 5866.437538] RSP:
> 0018:ff428d9ec0bb0ca8 EFLAGS: 00010286 [ 5866.443415] RAX:
> ff2dd26dbd8f0000 RBX: ff2dd265ad161400 RCX: 00000000000004e1 [
> 5866.451435] RDX: 00000000000004e1 RSI: ffd31c37eab1c980 RDI:
> ff2dd26dbd8f0000 [ 5866.459454] RBP: ff428d9ec0bb0d40 R08:
> 0000000000000000 R09: 0000000000000000 [ 5866.467470] R10:
> 0000000000000000 R11: 0000000000000000 R12: ff428d9eec726ef8 [
> 5866.475490] R13: ff2dd26dbd8f0000 R14: ff2dd265ca2f9fc0 R15:
> ff2dd26548548b80 [ 5866.483509] FS:  0000000000000000(0000)
> GS:ff2dd2c363592000(0000) knlGS:0000000000000000 [ 5866.492600] CS:
> 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5866.499060] CR2:
> ffd31c37eab1c980 CR3: 0000000178d7b040 CR4: 0000000000f71ef0 [
> 5866.507079] PKRU: 55555554 [ 5866.510125] Call Trace:
> [ 5866.512867]  <IRQ>
> [ 5866.515132]  ? i40e_clean_rx_irq_zc+0xc50/0xe60 [i40e] [
> 5866.520921]  i40e_napi_poll+0x2d8/0x1890 [i40e] [ 5866.526022]  ?
> srso_alias_return_thunk+0x5/0xfbef5
> [ 5866.531408]  ? raise_softirq+0x24/0x70 [ 5866.535623]  ?
> srso_alias_return_thunk+0x5/0xfbef5
> [ 5866.541011]  ? srso_alias_return_thunk+0x5/0xfbef5
> [ 5866.546397]  ? rcu_sched_clock_irq+0x225/0x1800 [ 5866.551493]
> __napi_poll+0x30/0x230 [ 5866.555423]  net_rx_action+0x20b/0x3f0 [
> 5866.559643]  handle_softirqs+0xe4/0x340 [ 5866.563962]
> __irq_exit_rcu+0x10e/0x130 [ 5866.568283]  irq_exit_rcu+0xe/0x20 [
> 5866.572110]  common_interrupt+0xb6/0xe0 [ 5866.576425]  </IRQ> [
> 5866.578791]  <TASK>
> 
> Advance next_to_clean to ensure invalid xdp_buff(s) aren't created.
> 
> Fixes: 1c9ba9c14658 ("i40e: xsk: add RX multi-buffer support")
> Signed-off-by: Alessandro Decina <alessandro.d@gmail.com>
> ---
>  drivers/net/ethernet/intel/i40e/i40e_xsk.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> index 9f47388eaba5..dbc19083bbb7 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> @@ -441,13 +441,18 @@ int i40e_clean_rx_irq_zc(struct i40e_ring
> *rx_ring, int budget)
>  		dma_rmb();
> 
>  		if (i40e_rx_is_programming_status(qword)) {
> +			u16 ntp;
> +
>  			i40e_clean_programming_status(rx_ring,
>  						      rx_desc->raw.qword[0],
>  						      qword);
>  			bi = *i40e_rx_bi(rx_ring, next_to_process);
>  			xsk_buff_free(bi);
> -			if (++next_to_process == count)
> +			ntp = next_to_process++;
> +			if (next_to_process == count)
>  				next_to_process = 0;
> +			if (next_to_clean == ntp)
> +				next_to_clean = next_to_process;
>  			continue;
>  		}
> 
> --
> 2.43.0


Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v2 1/1] i40e: xsk: advance next_to_clean on status descriptors
  2025-10-21 17:32 ` [PATCH net v2 1/1] " Alessandro Decina
  2025-10-22  3:11   ` Jason Xing
  2025-10-22  6:41   ` [Intel-wired-lan] " Loktionov, Aleksandr
@ 2025-10-22 17:17   ` Maciej Fijalkowski
  2025-10-22 17:39     ` Your Name
  2 siblings, 1 reply; 12+ messages in thread
From: Maciej Fijalkowski @ 2025-10-22 17:17 UTC (permalink / raw)
  To: Alessandro Decina
  Cc: netdev, David S. Miller, Alexei Starovoitov, Andrew Lunn,
	Daniel Borkmann, Eric Dumazet, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Paolo Abeni,
	Przemek Kitszel, Stanislav Fomichev, Tirthendu Sarkar,
	Tony Nguyen, bpf, intel-wired-lan, linux-kernel

On Wed, Oct 22, 2025 at 12:32:00AM +0700, Alessandro Decina wrote:

Hi Alessandro,

> Whenever a status descriptor is received, i40e processes and skips over
> it, correctly updating next_to_process but forgetting to update
> next_to_clean. In the next iteration this accidentally causes the
> creation of an invalid multi-buffer xdp_buff where the first fragment
> is the status descriptor.
> 
> If then a skb is constructed from such an invalid buffer - because the
> eBPF program returns XDP_PASS - a panic occurs:

can you elaborate on the test case that would reproduce this? I suppose
AF_XDP ZC with jumbo frames, doing XDP_PASS, but what was FDIR setup that
caused status descriptors?

> 
> [ 5866.367317] BUG: unable to handle page fault for address: ffd31c37eab1c980
> [ 5866.375050] #PF: supervisor read access in kernel mode
> [ 5866.380825] #PF: error_code(0x0000) - not-present page
> [ 5866.386602] PGD 0
> [ 5866.388867] Oops: Oops: 0000 [#1] SMP NOPTI
> [ 5866.393575] CPU: 34 UID: 0 PID: 0 Comm: swapper/34 Not tainted 6.17.0-custom #1 PREEMPT(voluntary)
> [ 5866.403740] Hardware name: Supermicro AS -2115GT-HNTR/H13SST-G, BIOS 3.2 03/20/2025
> [ 5866.412339] RIP: 0010:memcpy+0x8/0x10
> [ 5866.416454] Code: cc cc 90 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 48 89 f8 48 89 d1 <f3> a4 e9 fc 26 c0 fe 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
> [ 5866.437538] RSP: 0018:ff428d9ec0bb0ca8 EFLAGS: 00010286
> [ 5866.443415] RAX: ff2dd26dbd8f0000 RBX: ff2dd265ad161400 RCX: 00000000000004e1
> [ 5866.451435] RDX: 00000000000004e1 RSI: ffd31c37eab1c980 RDI: ff2dd26dbd8f0000
> [ 5866.459454] RBP: ff428d9ec0bb0d40 R08: 0000000000000000 R09: 0000000000000000
> [ 5866.467470] R10: 0000000000000000 R11: 0000000000000000 R12: ff428d9eec726ef8
> [ 5866.475490] R13: ff2dd26dbd8f0000 R14: ff2dd265ca2f9fc0 R15: ff2dd26548548b80
> [ 5866.483509] FS:  0000000000000000(0000) GS:ff2dd2c363592000(0000) knlGS:0000000000000000
> [ 5866.492600] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 5866.499060] CR2: ffd31c37eab1c980 CR3: 0000000178d7b040 CR4: 0000000000f71ef0
> [ 5866.507079] PKRU: 55555554
> [ 5866.510125] Call Trace:
> [ 5866.512867]  <IRQ>
> [ 5866.515132]  ? i40e_clean_rx_irq_zc+0xc50/0xe60 [i40e]
> [ 5866.520921]  i40e_napi_poll+0x2d8/0x1890 [i40e]
> [ 5866.526022]  ? srso_alias_return_thunk+0x5/0xfbef5
> [ 5866.531408]  ? raise_softirq+0x24/0x70
> [ 5866.535623]  ? srso_alias_return_thunk+0x5/0xfbef5
> [ 5866.541011]  ? srso_alias_return_thunk+0x5/0xfbef5
> [ 5866.546397]  ? rcu_sched_clock_irq+0x225/0x1800
> [ 5866.551493]  __napi_poll+0x30/0x230
> [ 5866.555423]  net_rx_action+0x20b/0x3f0
> [ 5866.559643]  handle_softirqs+0xe4/0x340
> [ 5866.563962]  __irq_exit_rcu+0x10e/0x130
> [ 5866.568283]  irq_exit_rcu+0xe/0x20
> [ 5866.572110]  common_interrupt+0xb6/0xe0
> [ 5866.576425]  </IRQ>
> [ 5866.578791]  <TASK>
> 
> Advance next_to_clean to ensure invalid xdp_buff(s) aren't created.
> 
> Fixes: 1c9ba9c14658 ("i40e: xsk: add RX multi-buffer support")
> Signed-off-by: Alessandro Decina <alessandro.d@gmail.com>
> ---
>  drivers/net/ethernet/intel/i40e/i40e_xsk.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> index 9f47388eaba5..dbc19083bbb7 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> @@ -441,13 +441,18 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
>  		dma_rmb();
>  
>  		if (i40e_rx_is_programming_status(qword)) {
> +			u16 ntp;
> +
>  			i40e_clean_programming_status(rx_ring,
>  						      rx_desc->raw.qword[0],
>  						      qword);
>  			bi = *i40e_rx_bi(rx_ring, next_to_process);
>  			xsk_buff_free(bi);
> -			if (++next_to_process == count)
> +			ntp = next_to_process++;
> +			if (next_to_process == count)
>  				next_to_process = 0;
> +			if (next_to_clean == ntp)
> +				next_to_clean = next_to_process;

I wonder if this is more readable?

diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 9f47388eaba5..36f412a2d836 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -446,6 +446,10 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
 						      qword);
 			bi = *i40e_rx_bi(rx_ring, next_to_process);
 			xsk_buff_free(bi);
+			if (next_to_clean == next_to_process) {
+				if (++next_to_clean == count)
+					next_to_clean = 0;
+			}
 			if (++next_to_process == count)
 				next_to_process = 0;
 			continue;

>  			continue;
>  		}
>  
> -- 
> 2.43.0
> 
> 

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH net v2 1/1] i40e: xsk: advance next_to_clean on status descriptors
  2025-10-22 17:17   ` Maciej Fijalkowski
@ 2025-10-22 17:39     ` Your Name
  2025-11-03 15:14       ` Maciej Fijalkowski
  0 siblings, 1 reply; 12+ messages in thread
From: Your Name @ 2025-10-22 17:39 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Alessandro Decina, netdev, David S. Miller, Alexei Starovoitov,
	Andrew Lunn, Daniel Borkmann, Eric Dumazet, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Paolo Abeni,
	Przemek Kitszel, Stanislav Fomichev, Tirthendu Sarkar,
	Tony Nguyen, bpf, intel-wired-lan, linux-kernel

On Wed, Oct 22, 2025 at 07:17:20PM +0200, Maciej Fijalkowski wrote:
> On Wed, Oct 22, 2025 at 12:32:00AM +0700, Alessandro Decina wrote:
> 
> Hi Alessandro,

Hey,

Thanks for the review!


> 
> > Whenever a status descriptor is received, i40e processes and skips over
> > it, correctly updating next_to_process but forgetting to update
> > next_to_clean. In the next iteration this accidentally causes the
> > creation of an invalid multi-buffer xdp_buff where the first fragment
> > is the status descriptor.
> > 
> > If then a skb is constructed from such an invalid buffer - because the
> > eBPF program returns XDP_PASS - a panic occurs:
> 
> can you elaborate on the test case that would reproduce this? I suppose
> AF_XDP ZC with jumbo frames, doing XDP_PASS, but what was FDIR setup that
> caused status descriptors?

Doesn't have to be jumbo or multi-frag, anything that does XDP_PASS
reproduces, as long as status descriptors are posted. 

See the scenarios here https://lore.kernel.org/netdev/aPkDtuVgbS4J-Og_@lima-default/

As for what's causing the status descriptors, I haven't been able to
figure that out. I just know that I periodically get
I40E_RX_PROG_STATUS_DESC_FD_FILTER_STATUS. Happy to dig deeper if you
have any ideas!

> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > index 9f47388eaba5..dbc19083bbb7 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > @@ -441,13 +441,18 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
> >  		dma_rmb();
> >  
> >  		if (i40e_rx_is_programming_status(qword)) {
> > +			u16 ntp;
> > +
> >  			i40e_clean_programming_status(rx_ring,
> >  						      rx_desc->raw.qword[0],
> >  						      qword);
> >  			bi = *i40e_rx_bi(rx_ring, next_to_process);
> >  			xsk_buff_free(bi);
> > -			if (++next_to_process == count)
> > +			ntp = next_to_process++;
> > +			if (next_to_process == count)
> >  				next_to_process = 0;
> > +			if (next_to_clean == ntp)
> > +				next_to_clean = next_to_process;
> 
> I wonder if this is more readable?
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> index 9f47388eaba5..36f412a2d836 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> @@ -446,6 +446,10 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
>  						      qword);
>  			bi = *i40e_rx_bi(rx_ring, next_to_process);
>  			xsk_buff_free(bi);
> +			if (next_to_clean == next_to_process) {
> +				if (++next_to_clean == count)
> +					next_to_clean = 0;
> +			}
>  			if (++next_to_process == count)
>  				next_to_process = 0;
>  			continue;
> 
> >  			continue;
> >  		}

Probably because I've looked at it for longer, I find my version clearer
(I think I copied it from another driver actually). But I don't really
mind, happy to switch to yours if you prefer!

Ciao
Alessandro


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v2 1/1] i40e: xsk: advance next_to_clean on status descriptors
  2025-10-22 17:39     ` Your Name
@ 2025-11-03 15:14       ` Maciej Fijalkowski
  0 siblings, 0 replies; 12+ messages in thread
From: Maciej Fijalkowski @ 2025-11-03 15:14 UTC (permalink / raw)
  To: Your Name
  Cc: netdev, David S. Miller, Alexei Starovoitov, Andrew Lunn,
	Daniel Borkmann, Eric Dumazet, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Paolo Abeni,
	Przemek Kitszel, Stanislav Fomichev, Tirthendu Sarkar,
	Tony Nguyen, bpf, intel-wired-lan, linux-kernel

On Thu, Oct 23, 2025 at 04:39:29AM +1100, Your Name wrote:
> On Wed, Oct 22, 2025 at 07:17:20PM +0200, Maciej Fijalkowski wrote:
> > On Wed, Oct 22, 2025 at 12:32:00AM +0700, Alessandro Decina wrote:
> > 
> > Hi Alessandro,
> 
> Hey,
> 
> Thanks for the review!
> 
> 
> > 
> > > Whenever a status descriptor is received, i40e processes and skips over
> > > it, correctly updating next_to_process but forgetting to update
> > > next_to_clean. In the next iteration this accidentally causes the
> > > creation of an invalid multi-buffer xdp_buff where the first fragment
> > > is the status descriptor.
> > > 
> > > If then a skb is constructed from such an invalid buffer - because the
> > > eBPF program returns XDP_PASS - a panic occurs:
> > 
> > can you elaborate on the test case that would reproduce this? I suppose
> > AF_XDP ZC with jumbo frames, doing XDP_PASS, but what was FDIR setup that
> > caused status descriptors?
> 
> Doesn't have to be jumbo or multi-frag, anything that does XDP_PASS
> reproduces, as long as status descriptors are posted. 
> 
> See the scenarios here https://lore.kernel.org/netdev/aPkDtuVgbS4J-Og_@lima-default/
> 
> As for what's causing the status descriptors, I haven't been able to
> figure that out. I just know that I periodically get
> I40E_RX_PROG_STATUS_DESC_FD_FILTER_STATUS. Happy to dig deeper if you
> have any ideas!
> 
> > > diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > > index 9f47388eaba5..dbc19083bbb7 100644
> > > --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > > +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > > @@ -441,13 +441,18 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
> > >  		dma_rmb();
> > >  
> > >  		if (i40e_rx_is_programming_status(qword)) {
> > > +			u16 ntp;
> > > +
> > >  			i40e_clean_programming_status(rx_ring,
> > >  						      rx_desc->raw.qword[0],
> > >  						      qword);
> > >  			bi = *i40e_rx_bi(rx_ring, next_to_process);
> > >  			xsk_buff_free(bi);
> > > -			if (++next_to_process == count)
> > > +			ntp = next_to_process++;
> > > +			if (next_to_process == count)
> > >  				next_to_process = 0;
> > > +			if (next_to_clean == ntp)
> > > +				next_to_clean = next_to_process;
> > 
> > I wonder if this is more readable?
> > 
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > index 9f47388eaba5..36f412a2d836 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > @@ -446,6 +446,10 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
> >  						      qword);
> >  			bi = *i40e_rx_bi(rx_ring, next_to_process);
> >  			xsk_buff_free(bi);
> > +			if (next_to_clean == next_to_process) {
> > +				if (++next_to_clean == count)
> > +					next_to_clean = 0;
> > +			}
> >  			if (++next_to_process == count)
> >  				next_to_process = 0;
> >  			continue;
> > 
> > >  			continue;
> > >  		}
> 
> Probably because I've looked at it for longer, I find my version clearer
> (I think I copied it from another driver actually). But I don't really
> mind, happy to switch to yours if you prefer!

Hmm. After taking a second look, how about we make it into a common
function shared between i40e_clean_rx_irq() and i40e_clean_rx_irq_zc() ?

> 
> Ciao
> Alessandro
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-11-03 15:14 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-21 17:31 [PATCH net v2 0/1] i40e: xsk: advance next_to_clean on status descriptors Alessandro Decina
2025-10-21 17:32 ` [PATCH net v2 1/1] " Alessandro Decina
2025-10-22  3:11   ` Jason Xing
2025-10-22  5:41     ` Sarkar, Tirthendu
2025-10-22  7:28       ` Jason Xing
2025-10-22 16:28       ` Your Name
2025-10-23  6:11         ` Sarkar, Tirthendu
2025-10-22 16:17     ` Alessandro Decina
2025-10-22  6:41   ` [Intel-wired-lan] " Loktionov, Aleksandr
2025-10-22 17:17   ` Maciej Fijalkowski
2025-10-22 17:39     ` Your Name
2025-11-03 15:14       ` Maciej Fijalkowski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).