* [Intel-wired-lan] [PATCH net v4 0/1] i40e: xsk: advance next_to_clean on status descriptors
@ 2025-11-18 11:31 Alessandro Decina
2025-11-18 11:31 ` [Intel-wired-lan] [PATCH net v4 1/1] " Alessandro Decina
0 siblings, 1 reply; 7+ messages in thread
From: Alessandro Decina @ 2025-11-18 11:31 UTC (permalink / raw)
To: netdev
Cc: Maciej Fijalkowski, David S. Miller, Alexei Starovoitov,
Andrew Lunn, Daniel Borkmann, Eric Dumazet, Jakub Kicinski,
Jesper Dangaard Brouer, John Fastabend, Paolo Abeni,
Przemek Kitszel, Stanislav Fomichev, Tirthendu Sarkar,
Tony Nguyen, bpf, intel-wired-lan, linux-kernel,
Alessandro Decina
Alright, hopefully last time I spam you with this.
The status descriptor handling code is now shared in
i40e_clean_programming_status.
Changes since v3:
* move the shared code in i40e_clean_programming_status
Alessandro Decina (1):
i40e: xsk: advance next_to_clean on status descriptors
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 59 +++++++++++--------
.../ethernet/intel/i40e/i40e_txrx_common.h | 5 +-
drivers/net/ethernet/intel/i40e/i40e_xsk.c | 14 ++---
3 files changed, 42 insertions(+), 36 deletions(-)
base-commit: 896f1a2493b59beb2b5ccdf990503dbb16cb2256
--
2.43.0
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Intel-wired-lan] [PATCH net v4 1/1] i40e: xsk: advance next_to_clean on status descriptors
2025-11-18 11:31 [Intel-wired-lan] [PATCH net v4 0/1] i40e: xsk: advance next_to_clean on status descriptors Alessandro Decina
@ 2025-11-18 11:31 ` Alessandro Decina
2025-11-18 12:13 ` Loktionov, Aleksandr
2025-11-20 15:41 ` Loktionov, Aleksandr
0 siblings, 2 replies; 7+ messages in thread
From: Alessandro Decina @ 2025-11-18 11:31 UTC (permalink / raw)
To: netdev
Cc: Maciej Fijalkowski, David S. Miller, Alexei Starovoitov,
Andrew Lunn, Daniel Borkmann, Eric Dumazet, Jakub Kicinski,
Jesper Dangaard Brouer, John Fastabend, Paolo Abeni,
Przemek Kitszel, Stanislav Fomichev, Tirthendu Sarkar,
Tony Nguyen, bpf, intel-wired-lan, linux-kernel,
Alessandro Decina
Whenever a status descriptor is received, i40e processes and skips over
it, correctly updating next_to_process but forgetting to update
next_to_clean. In the next iteration this accidentally causes the
creation of an invalid multi-buffer xdp_buff where the first fragment
is the status descriptor.
If then a skb is constructed from such an invalid buffer - because the
eBPF program returns XDP_PASS - a panic occurs:
[ 5866.367317] BUG: unable to handle page fault for address: ffd31c37eab1c980
[ 5866.375050] #PF: supervisor read access in kernel mode
[ 5866.380825] #PF: error_code(0x0000) - not-present page
[ 5866.386602] PGD 0
[ 5866.388867] Oops: Oops: 0000 [#1] SMP NOPTI
[ 5866.393575] CPU: 34 UID: 0 PID: 0 Comm: swapper/34 Not tainted 6.17.0-custom #1 PREEMPT(voluntary)
[ 5866.403740] Hardware name: Supermicro AS -2115GT-HNTR/H13SST-G, BIOS 3.2 03/20/2025
[ 5866.412339] RIP: 0010:memcpy+0x8/0x10
[ 5866.416454] Code: cc cc 90 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 48 89 f8 48 89 d1 <f3> a4 e9 fc 26 c0 fe 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
[ 5866.437538] RSP: 0018:ff428d9ec0bb0ca8 EFLAGS: 00010286
[ 5866.443415] RAX: ff2dd26dbd8f0000 RBX: ff2dd265ad161400 RCX: 00000000000004e1
[ 5866.451435] RDX: 00000000000004e1 RSI: ffd31c37eab1c980 RDI: ff2dd26dbd8f0000
[ 5866.459454] RBP: ff428d9ec0bb0d40 R08: 0000000000000000 R09: 0000000000000000
[ 5866.467470] R10: 0000000000000000 R11: 0000000000000000 R12: ff428d9eec726ef8
[ 5866.475490] R13: ff2dd26dbd8f0000 R14: ff2dd265ca2f9fc0 R15: ff2dd26548548b80
[ 5866.483509] FS: 0000000000000000(0000) GS:ff2dd2c363592000(0000) knlGS:0000000000000000
[ 5866.492600] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5866.499060] CR2: ffd31c37eab1c980 CR3: 0000000178d7b040 CR4: 0000000000f71ef0
[ 5866.507079] PKRU: 55555554
[ 5866.510125] Call Trace:
[ 5866.512867] <IRQ>
[ 5866.515132] ? i40e_clean_rx_irq_zc+0xc50/0xe60 [i40e]
[ 5866.520921] i40e_napi_poll+0x2d8/0x1890 [i40e]
[ 5866.526022] ? srso_alias_return_thunk+0x5/0xfbef5
[ 5866.531408] ? raise_softirq+0x24/0x70
[ 5866.535623] ? srso_alias_return_thunk+0x5/0xfbef5
[ 5866.541011] ? srso_alias_return_thunk+0x5/0xfbef5
[ 5866.546397] ? rcu_sched_clock_irq+0x225/0x1800
[ 5866.551493] __napi_poll+0x30/0x230
[ 5866.555423] net_rx_action+0x20b/0x3f0
[ 5866.559643] handle_softirqs+0xe4/0x340
[ 5866.563962] __irq_exit_rcu+0x10e/0x130
[ 5866.568283] irq_exit_rcu+0xe/0x20
[ 5866.572110] common_interrupt+0xb6/0xe0
[ 5866.576425] </IRQ>
[ 5866.578791] <TASK>
Advance next_to_clean to ensure invalid xdp_buff(s) aren't created.
Move the common logic to i40e_clean_programming_status and update both
i40e_clean_rx_irq and i40e_clean_rx_irq_zc accordingly.
Fixes: 1c9ba9c14658 ("i40e: xsk: add RX multi-buffer support")
Signed-off-by: Alessandro Decina <alessandro.d@gmail.com>
---
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 59 +++++++++++--------
.../ethernet/intel/i40e/i40e_txrx_common.h | 5 +-
drivers/net/ethernet/intel/i40e/i40e_xsk.c | 14 ++---
3 files changed, 42 insertions(+), 36 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index cc0b9efc2637..fe2190f4b9bc 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1386,22 +1386,36 @@ static void i40e_reuse_rx_page(struct i40e_ring *rx_ring,
* @rx_ring: the rx ring that has this descriptor
* @qword0_raw: qword0
* @qword1: qword1 representing status_error_len in CPU ordering
+ * @next_to_process: pointer to next_to_process index
+ * @next_to_clean: pointer to next_to_clean index
*
* Flow director should handle FD_FILTER_STATUS to check its filter programming
* status being successful or not and take actions accordingly. FCoE should
* handle its context/filter programming/invalidation status and take actions.
*
- * Returns an i40e_rx_buffer to reuse if the cleanup occurred, otherwise NULL.
+ * Returns false if what is passed is not a status descriptor.
**/
-void i40e_clean_programming_status(struct i40e_ring *rx_ring, u64 qword0_raw,
- u64 qword1)
+bool i40e_clean_programming_status(struct i40e_ring *rx_ring, u64 qword0_raw,
+ u64 qword1, u16 *next_to_process,
+ u16 *next_to_clean)
{
+ u16 ntp = *next_to_process;
u8 id;
+ if (!i40e_rx_is_programming_status(qword1))
+ return false;
+
id = FIELD_GET(I40E_RX_PROG_STATUS_DESC_QW1_PROGID_MASK, qword1);
if (id == I40E_RX_PROG_STATUS_DESC_FD_FILTER_STATUS)
i40e_fd_handle_status(rx_ring, qword0_raw, qword1, id);
+
+ if (++*next_to_process == rx_ring->count)
+ *next_to_process = 0;
+ if (ntp == *next_to_clean)
+ *next_to_clean = *next_to_process;
+
+ return true;
}
/**
@@ -1971,19 +1985,18 @@ static void i40e_rx_buffer_flip(struct i40e_rx_buffer *rx_buffer,
}
/**
- * i40e_get_rx_buffer - Fetch Rx buffer and synchronize data for use
+ * i40e_prepare_rx_buffer - Synchronize the buffer for use by the CPU
* @rx_ring: rx descriptor ring to transact packets on
+ * @rx_buffer: the rx buffer
* @size: size of buffer to add to skb
*
- * This function will pull an Rx buffer from the ring and synchronize it
- * for use by the CPU.
+ * This function will synchronize the given buffer for use by the CPU.
*/
-static struct i40e_rx_buffer *i40e_get_rx_buffer(struct i40e_ring *rx_ring,
- const unsigned int size)
+static struct i40e_rx_buffer *
+i40e_prepare_rx_buffer(struct i40e_ring *rx_ring,
+ struct i40e_rx_buffer *rx_buffer,
+ const unsigned int size)
{
- struct i40e_rx_buffer *rx_buffer;
-
- rx_buffer = i40e_rx_bi(rx_ring, rx_ring->next_to_process);
rx_buffer->page_count =
#if (PAGE_SIZE < 8192)
page_count(rx_buffer->page);
@@ -2450,6 +2463,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget,
while (likely(total_rx_packets < (unsigned int)budget)) {
u16 ntp = rx_ring->next_to_process;
+ u16 ntc = rx_ring->next_to_clean;
struct i40e_rx_buffer *rx_buffer;
union i40e_rx_desc *rx_desc;
struct sk_buff *skb;
@@ -2480,21 +2494,15 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget,
*/
dma_rmb();
- if (i40e_rx_is_programming_status(qword)) {
- i40e_clean_programming_status(rx_ring,
- rx_desc->raw.qword[0],
- qword);
- rx_buffer = i40e_rx_bi(rx_ring, ntp);
- i40e_inc_ntp(rx_ring);
+ rx_buffer = i40e_rx_bi(rx_ring, ntp);
+
+ if (i40e_clean_programming_status(rx_ring,
+ rx_desc->raw.qword[0], qword,
+ &rx_ring->next_to_process,
+ &rx_ring->next_to_clean)) {
i40e_reuse_rx_page(rx_ring, rx_buffer);
- /* Update ntc and bump cleaned count if not in the
- * middle of mb packet.
- */
- if (rx_ring->next_to_clean == ntp) {
- rx_ring->next_to_clean =
- rx_ring->next_to_process;
+ if (ntc != rx_ring->next_to_clean)
cleaned_count++;
- }
continue;
}
@@ -2503,8 +2511,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget,
break;
i40e_trace(clean_rx_irq, rx_ring, rx_desc, xdp);
- /* retrieve a buffer from the ring */
- rx_buffer = i40e_get_rx_buffer(rx_ring, size);
+ i40e_prepare_rx_buffer(rx_ring, rx_buffer, size);
neop = i40e_is_non_eop(rx_ring, rx_desc);
i40e_inc_ntp(rx_ring);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
index e26807fd2123..21d9ed878bf0 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
@@ -7,8 +7,9 @@
#include "i40e.h"
int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp, struct i40e_ring *xdp_ring);
-void i40e_clean_programming_status(struct i40e_ring *rx_ring, u64 qword0_raw,
- u64 qword1);
+bool i40e_clean_programming_status(struct i40e_ring *rx_ring, u64 qword0_raw,
+ u64 qword1, u16 *next_to_clean,
+ u16 *next_to_process);
void i40e_process_skb_fields(struct i40e_ring *rx_ring,
union i40e_rx_desc *rx_desc, struct sk_buff *skb);
void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 9f47388eaba5..f8accc266c2c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -440,14 +440,13 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
*/
dma_rmb();
- if (i40e_rx_is_programming_status(qword)) {
- i40e_clean_programming_status(rx_ring,
- rx_desc->raw.qword[0],
- qword);
- bi = *i40e_rx_bi(rx_ring, next_to_process);
+ bi = *i40e_rx_bi(rx_ring, next_to_process);
+
+ if (i40e_clean_programming_status(rx_ring,
+ rx_desc->raw.qword[0],
+ qword, &next_to_process,
+ &next_to_clean)) {
xsk_buff_free(bi);
- if (++next_to_process == count)
- next_to_process = 0;
continue;
}
@@ -455,7 +454,6 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
if (!size)
break;
- bi = *i40e_rx_bi(rx_ring, next_to_process);
xsk_buff_set_size(bi, size);
xsk_buff_dma_sync_for_cpu(bi);
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [Intel-wired-lan] [PATCH net v4 1/1] i40e: xsk: advance next_to_clean on status descriptors
2025-11-18 11:31 ` [Intel-wired-lan] [PATCH net v4 1/1] " Alessandro Decina
@ 2025-11-18 12:13 ` Loktionov, Aleksandr
2025-11-18 18:45 ` Maciej Fijalkowski
2025-11-20 15:41 ` Loktionov, Aleksandr
1 sibling, 1 reply; 7+ messages in thread
From: Loktionov, Aleksandr @ 2025-11-18 12:13 UTC (permalink / raw)
To: Alessandro Decina, netdev@vger.kernel.org
Cc: Fijalkowski, Maciej, David S. Miller, Alexei Starovoitov,
Andrew Lunn, Daniel Borkmann, Eric Dumazet, Jakub Kicinski,
Jesper Dangaard Brouer, John Fastabend, Paolo Abeni,
Kitszel, Przemyslaw, Stanislav Fomichev, Sarkar, Tirthendu,
Nguyen, Anthony L, bpf@vger.kernel.org,
intel-wired-lan@lists.osuosl.org, linux-kernel@vger.kernel.org
> -----Original Message-----
> From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf
> Of Alessandro Decina
> Sent: Tuesday, November 18, 2025 12:31 PM
> To: netdev@vger.kernel.org
> Cc: Fijalkowski, Maciej <maciej.fijalkowski@intel.com>; David S.
> Miller <davem@davemloft.net>; Alexei Starovoitov <ast@kernel.org>;
> Andrew Lunn <andrew+netdev@lunn.ch>; Daniel Borkmann
> <daniel@iogearbox.net>; Eric Dumazet <edumazet@google.com>; Jakub
> Kicinski <kuba@kernel.org>; Jesper Dangaard Brouer <hawk@kernel.org>;
> John Fastabend <john.fastabend@gmail.com>; Paolo Abeni
> <pabeni@redhat.com>; Kitszel, Przemyslaw
> <przemyslaw.kitszel@intel.com>; Stanislav Fomichev <sdf@fomichev.me>;
> Sarkar, Tirthendu <tirthendu.sarkar@intel.com>; Nguyen, Anthony L
> <anthony.l.nguyen@intel.com>; bpf@vger.kernel.org; intel-wired-
> lan@lists.osuosl.org; linux-kernel@vger.kernel.org; Alessandro Decina
> <alessandro.d@gmail.com>
> Subject: [Intel-wired-lan] [PATCH net v4 1/1] i40e: xsk: advance
> next_to_clean on status descriptors
>
> Whenever a status descriptor is received, i40e processes and skips
> over it, correctly updating next_to_process but forgetting to update
> next_to_clean. In the next iteration this accidentally causes the
> creation of an invalid multi-buffer xdp_buff where the first fragment
> is the status descriptor.
>
> If then a skb is constructed from such an invalid buffer - because the
> eBPF program returns XDP_PASS - a panic occurs:
>
> [ 5866.367317] BUG: unable to handle page fault for address:
> ffd31c37eab1c980 [ 5866.375050] #PF: supervisor read access in kernel
> mode [ 5866.380825] #PF: error_code(0x0000) - not-present page [
> 5866.386602] PGD 0 [ 5866.388867] Oops: Oops: 0000 [#1] SMP NOPTI [
> 5866.393575] CPU: 34 UID: 0 PID: 0 Comm: swapper/34 Not tainted
> 6.17.0-custom #1 PREEMPT(voluntary) [ 5866.403740] Hardware name:
> Supermicro AS -2115GT-HNTR/H13SST-G, BIOS 3.2 03/20/2025 [
> 5866.412339] RIP: 0010:memcpy+0x8/0x10 [ 5866.416454] Code: cc cc 90
> cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 90 90 90 90 90 90 90 90
> 90 90 90 90 90 90 90 90 66 90 48 89 f8 48 89 d1 <f3> a4 e9 fc 26 c0 fe
> 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 [ 5866.437538] RSP:
> 0018:ff428d9ec0bb0ca8 EFLAGS: 00010286 [ 5866.443415] RAX:
> ff2dd26dbd8f0000 RBX: ff2dd265ad161400 RCX: 00000000000004e1 [
> 5866.451435] RDX: 00000000000004e1 RSI: ffd31c37eab1c980 RDI:
> ff2dd26dbd8f0000 [ 5866.459454] RBP: ff428d9ec0bb0d40 R08:
> 0000000000000000 R09: 0000000000000000 [ 5866.467470] R10:
> 0000000000000000 R11: 0000000000000000 R12: ff428d9eec726ef8 [
> 5866.475490] R13: ff2dd26dbd8f0000 R14: ff2dd265ca2f9fc0 R15:
> ff2dd26548548b80 [ 5866.483509] FS: 0000000000000000(0000)
> GS:ff2dd2c363592000(0000) knlGS:0000000000000000 [ 5866.492600] CS:
> 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5866.499060] CR2:
> ffd31c37eab1c980 CR3: 0000000178d7b040 CR4: 0000000000f71ef0 [
> 5866.507079] PKRU: 55555554 [ 5866.510125] Call Trace:
> [ 5866.512867] <IRQ>
> [ 5866.515132] ? i40e_clean_rx_irq_zc+0xc50/0xe60 [i40e] [
> 5866.520921] i40e_napi_poll+0x2d8/0x1890 [i40e] [ 5866.526022] ?
> srso_alias_return_thunk+0x5/0xfbef5
> [ 5866.531408] ? raise_softirq+0x24/0x70 [ 5866.535623] ?
> srso_alias_return_thunk+0x5/0xfbef5
> [ 5866.541011] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 5866.546397] ? rcu_sched_clock_irq+0x225/0x1800 [ 5866.551493]
> __napi_poll+0x30/0x230 [ 5866.555423] net_rx_action+0x20b/0x3f0 [
> 5866.559643] handle_softirqs+0xe4/0x340 [ 5866.563962]
> __irq_exit_rcu+0x10e/0x130 [ 5866.568283] irq_exit_rcu+0xe/0x20 [
> 5866.572110] common_interrupt+0xb6/0xe0 [ 5866.576425] </IRQ> [
> 5866.578791] <TASK>
>
Commit message lacks test description.
Please add how you validated the fix (kernel version, NIC model, reproducer steps).
> Advance next_to_clean to ensure invalid xdp_buff(s) aren't created.
>
> Move the common logic to i40e_clean_programming_status and update both
> i40e_clean_rx_irq and i40e_clean_rx_irq_zc accordingly.
>
> Fixes: 1c9ba9c14658 ("i40e: xsk: add RX multi-buffer support")
> Signed-off-by: Alessandro Decina <alessandro.d@gmail.com>
> ---
> drivers/net/ethernet/intel/i40e/i40e_txrx.c | 59 +++++++++++-------
> -
> .../ethernet/intel/i40e/i40e_txrx_common.h | 5 +-
> drivers/net/ethernet/intel/i40e/i40e_xsk.c | 14 ++---
> 3 files changed, 42 insertions(+), 36 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> index cc0b9efc2637..fe2190f4b9bc 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> @@ -1386,22 +1386,36 @@ static void i40e_reuse_rx_page(struct
> i40e_ring *rx_ring,
> * @rx_ring: the rx ring that has this descriptor
> * @qword0_raw: qword0
> * @qword1: qword1 representing status_error_len in CPU ordering
> + * @next_to_process: pointer to next_to_process index
> + * @next_to_clean: pointer to next_to_clean index
> *
> * Flow director should handle FD_FILTER_STATUS to check its filter
> programming
> * status being successful or not and take actions accordingly. FCoE
> should
> * handle its context/filter programming/invalidation status and take
> actions.
> *
> - * Returns an i40e_rx_buffer to reuse if the cleanup occurred,
> otherwise NULL.
> + * Returns false if what is passed is not a status descriptor.
> **/
Please use kdoc Return: tag.
Kernel-doc is incomplete, please also document what 'true' means (status descriptor processed).
> -void i40e_clean_programming_status(struct i40e_ring *rx_ring, u64
> qword0_raw,
> - u64 qword1)
> +bool i40e_clean_programming_status(struct i40e_ring *rx_ring, u64
> qword0_raw,
> + u64 qword1, u16 *next_to_process,
> + u16 *next_to_clean)
> {
> + u16 ntp = *next_to_process;
> u8 id;
>
> + if (!i40e_rx_is_programming_status(qword1))
> + return false;
> +
> id = FIELD_GET(I40E_RX_PROG_STATUS_DESC_QW1_PROGID_MASK,
> qword1);
>
> if (id == I40E_RX_PROG_STATUS_DESC_FD_FILTER_STATUS)
> i40e_fd_handle_status(rx_ring, qword0_raw, qword1, id);
> +
> + if (++*next_to_process == rx_ring->count)
> + *next_to_process = 0;
Consider using u16_inc_wrap() instead of manual wrap.
> + if (ntp == *next_to_clean)
> + *next_to_clean = *next_to_process;
> +
> + return true;
> }
>
> /**
> @@ -1971,19 +1985,18 @@ static void i40e_rx_buffer_flip(struct
> i40e_rx_buffer *rx_buffer, }
>
> /**
> - * i40e_get_rx_buffer - Fetch Rx buffer and synchronize data for use
> + * i40e_prepare_rx_buffer - Synchronize the buffer for use by the CPU
> * @rx_ring: rx descriptor ring to transact packets on
> + * @rx_buffer: the rx buffer
> * @size: size of buffer to add to skb
> *
> - * This function will pull an Rx buffer from the ring and synchronize
> it
> - * for use by the CPU.
> + * This function will synchronize the given buffer for use by the
> CPU.
> */
> -static struct i40e_rx_buffer *i40e_get_rx_buffer(struct i40e_ring
> *rx_ring,
> - const unsigned int size)
> +static struct i40e_rx_buffer *
> +i40e_prepare_rx_buffer(struct i40e_ring *rx_ring,
> + struct i40e_rx_buffer *rx_buffer,
> + const unsigned int size)
> {
> - struct i40e_rx_buffer *rx_buffer;
> -
> - rx_buffer = i40e_rx_bi(rx_ring, rx_ring->next_to_process);
> rx_buffer->page_count =
> #if (PAGE_SIZE < 8192)
> page_count(rx_buffer->page);
> @@ -2450,6 +2463,7 @@ static int i40e_clean_rx_irq(struct i40e_ring
> *rx_ring, int budget,
>
> while (likely(total_rx_packets < (unsigned int)budget)) {
> u16 ntp = rx_ring->next_to_process;
> + u16 ntc = rx_ring->next_to_clean;
> struct i40e_rx_buffer *rx_buffer;
> union i40e_rx_desc *rx_desc;
> struct sk_buff *skb;
> @@ -2480,21 +2494,15 @@ static int i40e_clean_rx_irq(struct i40e_ring
> *rx_ring, int budget,
> */
> dma_rmb();
>
> - if (i40e_rx_is_programming_status(qword)) {
> - i40e_clean_programming_status(rx_ring,
> - rx_desc->raw.qword[0],
> - qword);
> - rx_buffer = i40e_rx_bi(rx_ring, ntp);
> - i40e_inc_ntp(rx_ring);
> + rx_buffer = i40e_rx_bi(rx_ring, ntp);
> +
> + if (i40e_clean_programming_status(rx_ring,
> + rx_desc->raw.qword[0],
> qword,
> + &rx_ring->next_to_process,
> + &rx_ring->next_to_clean)) {
Wrap the call to i40e_clean_programming_status to stay within 80 columns.
> i40e_reuse_rx_page(rx_ring, rx_buffer);
> - /* Update ntc and bump cleaned count if not in
> the
> - * middle of mb packet.
> - */
> - if (rx_ring->next_to_clean == ntp) {
> - rx_ring->next_to_clean =
> - rx_ring->next_to_process;
> + if (ntc != rx_ring->next_to_clean)
> cleaned_count++;
> - }
> continue;
> }
>
> @@ -2503,8 +2511,7 @@ static int i40e_clean_rx_irq(struct i40e_ring
> *rx_ring, int budget,
> break;
>
> i40e_trace(clean_rx_irq, rx_ring, rx_desc, xdp);
> - /* retrieve a buffer from the ring */
> - rx_buffer = i40e_get_rx_buffer(rx_ring, size);
> + i40e_prepare_rx_buffer(rx_ring, rx_buffer, size);
>
> neop = i40e_is_non_eop(rx_ring, rx_desc);
> i40e_inc_ntp(rx_ring);
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
> b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
> index e26807fd2123..21d9ed878bf0 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
> @@ -7,8 +7,9 @@
> #include "i40e.h"
>
> int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp, struct i40e_ring
> *xdp_ring); -void i40e_clean_programming_status(struct i40e_ring
> *rx_ring, u64 qword0_raw,
> - u64 qword1);
> +bool i40e_clean_programming_status(struct i40e_ring *rx_ring, u64
> qword0_raw,
> + u64 qword1, u16 *next_to_clean,
> + u16 *next_to_process);
Argument order differs between header and implementation (next_to_clean vs next_to_process).
This will compile but is misleading and dangerous for future maintenance.
Please fix argument order to match.
> void i40e_process_skb_fields(struct i40e_ring *rx_ring,
> union i40e_rx_desc *rx_desc, struct sk_buff
> *skb); void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring);
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> index 9f47388eaba5..f8accc266c2c 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> @@ -440,14 +440,13 @@ int i40e_clean_rx_irq_zc(struct i40e_ring
> *rx_ring, int budget)
> */
> dma_rmb();
>
> - if (i40e_rx_is_programming_status(qword)) {
> - i40e_clean_programming_status(rx_ring,
> - rx_desc->raw.qword[0],
> - qword);
> - bi = *i40e_rx_bi(rx_ring, next_to_process);
> + bi = *i40e_rx_bi(rx_ring, next_to_process);
> +
> + if (i40e_clean_programming_status(rx_ring,
> + rx_desc->raw.qword[0],
> + qword, &next_to_process,
> + &next_to_clean)) {
> xsk_buff_free(bi);
> - if (++next_to_process == count)
> - next_to_process = 0;
> continue;
> }
>
> @@ -455,7 +454,6 @@ int i40e_clean_rx_irq_zc(struct i40e_ring
> *rx_ring, int budget)
> if (!size)
> break;
>
> - bi = *i40e_rx_bi(rx_ring, next_to_process);
> xsk_buff_set_size(bi, size);
> xsk_buff_dma_sync_for_cpu(bi);
>
> --
> 2.43.0
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Intel-wired-lan] [PATCH net v4 1/1] i40e: xsk: advance next_to_clean on status descriptors
2025-11-18 12:13 ` Loktionov, Aleksandr
@ 2025-11-18 18:45 ` Maciej Fijalkowski
2025-11-19 0:49 ` Alessandro Decina
0 siblings, 1 reply; 7+ messages in thread
From: Maciej Fijalkowski @ 2025-11-18 18:45 UTC (permalink / raw)
To: Loktionov, Aleksandr
Cc: Alessandro Decina, netdev@vger.kernel.org, David S. Miller,
Alexei Starovoitov, Andrew Lunn, Daniel Borkmann, Eric Dumazet,
Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
Paolo Abeni, Kitszel, Przemyslaw, Stanislav Fomichev,
Sarkar, Tirthendu, Nguyen, Anthony L, bpf@vger.kernel.org,
intel-wired-lan@lists.osuosl.org, linux-kernel@vger.kernel.org
On Tue, Nov 18, 2025 at 01:13:33PM +0100, Loktionov, Aleksandr wrote:
>
>
> > -----Original Message-----
> > From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf
> > Of Alessandro Decina
> > Sent: Tuesday, November 18, 2025 12:31 PM
> > To: netdev@vger.kernel.org
> > Cc: Fijalkowski, Maciej <maciej.fijalkowski@intel.com>; David S.
> > Miller <davem@davemloft.net>; Alexei Starovoitov <ast@kernel.org>;
> > Andrew Lunn <andrew+netdev@lunn.ch>; Daniel Borkmann
> > <daniel@iogearbox.net>; Eric Dumazet <edumazet@google.com>; Jakub
> > Kicinski <kuba@kernel.org>; Jesper Dangaard Brouer <hawk@kernel.org>;
> > John Fastabend <john.fastabend@gmail.com>; Paolo Abeni
> > <pabeni@redhat.com>; Kitszel, Przemyslaw
> > <przemyslaw.kitszel@intel.com>; Stanislav Fomichev <sdf@fomichev.me>;
> > Sarkar, Tirthendu <tirthendu.sarkar@intel.com>; Nguyen, Anthony L
> > <anthony.l.nguyen@intel.com>; bpf@vger.kernel.org; intel-wired-
> > lan@lists.osuosl.org; linux-kernel@vger.kernel.org; Alessandro Decina
> > <alessandro.d@gmail.com>
> > Subject: [Intel-wired-lan] [PATCH net v4 1/1] i40e: xsk: advance
> > next_to_clean on status descriptors
> >
> > Whenever a status descriptor is received, i40e processes and skips
> > over it, correctly updating next_to_process but forgetting to update
> > next_to_clean. In the next iteration this accidentally causes the
> > creation of an invalid multi-buffer xdp_buff where the first fragment
> > is the status descriptor.
> >
> > If then a skb is constructed from such an invalid buffer - because the
> > eBPF program returns XDP_PASS - a panic occurs:
> >
> > [ 5866.367317] BUG: unable to handle page fault for address:
> > ffd31c37eab1c980 [ 5866.375050] #PF: supervisor read access in kernel
> > mode [ 5866.380825] #PF: error_code(0x0000) - not-present page [
> > 5866.386602] PGD 0 [ 5866.388867] Oops: Oops: 0000 [#1] SMP NOPTI [
> > 5866.393575] CPU: 34 UID: 0 PID: 0 Comm: swapper/34 Not tainted
> > 6.17.0-custom #1 PREEMPT(voluntary) [ 5866.403740] Hardware name:
> > Supermicro AS -2115GT-HNTR/H13SST-G, BIOS 3.2 03/20/2025 [
> > 5866.412339] RIP: 0010:memcpy+0x8/0x10 [ 5866.416454] Code: cc cc 90
> > cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 90 90 90 90 90 90 90 90
> > 90 90 90 90 90 90 90 90 66 90 48 89 f8 48 89 d1 <f3> a4 e9 fc 26 c0 fe
> > 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 [ 5866.437538] RSP:
> > 0018:ff428d9ec0bb0ca8 EFLAGS: 00010286 [ 5866.443415] RAX:
> > ff2dd26dbd8f0000 RBX: ff2dd265ad161400 RCX: 00000000000004e1 [
> > 5866.451435] RDX: 00000000000004e1 RSI: ffd31c37eab1c980 RDI:
> > ff2dd26dbd8f0000 [ 5866.459454] RBP: ff428d9ec0bb0d40 R08:
> > 0000000000000000 R09: 0000000000000000 [ 5866.467470] R10:
> > 0000000000000000 R11: 0000000000000000 R12: ff428d9eec726ef8 [
> > 5866.475490] R13: ff2dd26dbd8f0000 R14: ff2dd265ca2f9fc0 R15:
> > ff2dd26548548b80 [ 5866.483509] FS: 0000000000000000(0000)
> > GS:ff2dd2c363592000(0000) knlGS:0000000000000000 [ 5866.492600] CS:
> > 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5866.499060] CR2:
> > ffd31c37eab1c980 CR3: 0000000178d7b040 CR4: 0000000000f71ef0 [
> > 5866.507079] PKRU: 55555554 [ 5866.510125] Call Trace:
> > [ 5866.512867] <IRQ>
> > [ 5866.515132] ? i40e_clean_rx_irq_zc+0xc50/0xe60 [i40e] [
> > 5866.520921] i40e_napi_poll+0x2d8/0x1890 [i40e] [ 5866.526022] ?
> > srso_alias_return_thunk+0x5/0xfbef5
> > [ 5866.531408] ? raise_softirq+0x24/0x70 [ 5866.535623] ?
> > srso_alias_return_thunk+0x5/0xfbef5
> > [ 5866.541011] ? srso_alias_return_thunk+0x5/0xfbef5
> > [ 5866.546397] ? rcu_sched_clock_irq+0x225/0x1800 [ 5866.551493]
> > __napi_poll+0x30/0x230 [ 5866.555423] net_rx_action+0x20b/0x3f0 [
> > 5866.559643] handle_softirqs+0xe4/0x340 [ 5866.563962]
> > __irq_exit_rcu+0x10e/0x130 [ 5866.568283] irq_exit_rcu+0xe/0x20 [
> > 5866.572110] common_interrupt+0xb6/0xe0 [ 5866.576425] </IRQ> [
> > 5866.578791] <TASK>
Something is odd with your editor, it would be better to preserve the
formatting above.
> >
> Commit message lacks test description.
> Please add how you validated the fix (kernel version, NIC model, reproducer steps).
Repro steps would be nice to have, rest would be rather redundant to me.
>
> > Advance next_to_clean to ensure invalid xdp_buff(s) aren't created.
> >
> > Move the common logic to i40e_clean_programming_status and update both
> > i40e_clean_rx_irq and i40e_clean_rx_irq_zc accordingly.
> >
> > Fixes: 1c9ba9c14658 ("i40e: xsk: add RX multi-buffer support")
> > Signed-off-by: Alessandro Decina <alessandro.d@gmail.com>
> > ---
> > drivers/net/ethernet/intel/i40e/i40e_txrx.c | 59 +++++++++++-------
> > -
> > .../ethernet/intel/i40e/i40e_txrx_common.h | 5 +-
> > drivers/net/ethernet/intel/i40e/i40e_xsk.c | 14 ++---
> > 3 files changed, 42 insertions(+), 36 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > index cc0b9efc2637..fe2190f4b9bc 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > @@ -1386,22 +1386,36 @@ static void i40e_reuse_rx_page(struct
> > i40e_ring *rx_ring,
> > * @rx_ring: the rx ring that has this descriptor
> > * @qword0_raw: qword0
> > * @qword1: qword1 representing status_error_len in CPU ordering
> > + * @next_to_process: pointer to next_to_process index
> > + * @next_to_clean: pointer to next_to_clean index
> > *
> > * Flow director should handle FD_FILTER_STATUS to check its filter
> > programming
> > * status being successful or not and take actions accordingly. FCoE
> > should
> > * handle its context/filter programming/invalidation status and take
> > actions.
> > *
> > - * Returns an i40e_rx_buffer to reuse if the cleanup occurred,
> > otherwise NULL.
> > + * Returns false if what is passed is not a status descriptor.
> > **/
> Please use kdoc Return: tag.
> Kernel-doc is incomplete, please also document what 'true' means (status descriptor processed).
>
> > -void i40e_clean_programming_status(struct i40e_ring *rx_ring, u64
> > qword0_raw,
> > - u64 qword1)
> > +bool i40e_clean_programming_status(struct i40e_ring *rx_ring, u64
> > qword0_raw,
> > + u64 qword1, u16 *next_to_process,
> > + u16 *next_to_clean)
> > {
> > + u16 ntp = *next_to_process;
> > u8 id;
> >
> > + if (!i40e_rx_is_programming_status(qword1))
> > + return false;
> > +
> > id = FIELD_GET(I40E_RX_PROG_STATUS_DESC_QW1_PROGID_MASK,
> > qword1);
> >
> > if (id == I40E_RX_PROG_STATUS_DESC_FD_FILTER_STATUS)
> > i40e_fd_handle_status(rx_ring, qword0_raw, qword1, id);
> > +
> > + if (++*next_to_process == rx_ring->count)
> > + *next_to_process = 0;
> Consider using u16_inc_wrap() instead of manual wrap.
I don't see anything like that in kernel. Did AI hallucinated for you or
is it something from OOT realm? Or am I missing something.
This is a method we use throughout the whole driver. Even if such wrapper
exists, I'd rather see a refactor patch that addresses all the manual
wrappings, not a single one.
Let's keep it as-is.
>
> > + if (ntp == *next_to_clean)
> > + *next_to_clean = *next_to_process;
> > +
> > + return true;
> > }
> >
> > /**
> > @@ -1971,19 +1985,18 @@ static void i40e_rx_buffer_flip(struct
> > i40e_rx_buffer *rx_buffer, }
> >
> > /**
> > - * i40e_get_rx_buffer - Fetch Rx buffer and synchronize data for use
> > + * i40e_prepare_rx_buffer - Synchronize the buffer for use by the CPU
> > * @rx_ring: rx descriptor ring to transact packets on
> > + * @rx_buffer: the rx buffer
> > * @size: size of buffer to add to skb
> > *
> > - * This function will pull an Rx buffer from the ring and synchronize
> > it
> > - * for use by the CPU.
> > + * This function will synchronize the given buffer for use by the
> > CPU.
> > */
> > -static struct i40e_rx_buffer *i40e_get_rx_buffer(struct i40e_ring
> > *rx_ring,
> > - const unsigned int size)
> > +static struct i40e_rx_buffer *
> > +i40e_prepare_rx_buffer(struct i40e_ring *rx_ring,
> > + struct i40e_rx_buffer *rx_buffer,
> > + const unsigned int size)
> > {
> > - struct i40e_rx_buffer *rx_buffer;
> > -
> > - rx_buffer = i40e_rx_bi(rx_ring, rx_ring->next_to_process);
> > rx_buffer->page_count =
> > #if (PAGE_SIZE < 8192)
> > page_count(rx_buffer->page);
> > @@ -2450,6 +2463,7 @@ static int i40e_clean_rx_irq(struct i40e_ring
> > *rx_ring, int budget,
> >
> > while (likely(total_rx_packets < (unsigned int)budget)) {
> > u16 ntp = rx_ring->next_to_process;
> > + u16 ntc = rx_ring->next_to_clean;
> > struct i40e_rx_buffer *rx_buffer;
> > union i40e_rx_desc *rx_desc;
> > struct sk_buff *skb;
> > @@ -2480,21 +2494,15 @@ static int i40e_clean_rx_irq(struct i40e_ring
> > *rx_ring, int budget,
> > */
> > dma_rmb();
> >
> > - if (i40e_rx_is_programming_status(qword)) {
> > - i40e_clean_programming_status(rx_ring,
> > - rx_desc->raw.qword[0],
> > - qword);
> > - rx_buffer = i40e_rx_bi(rx_ring, ntp);
> > - i40e_inc_ntp(rx_ring);
> > + rx_buffer = i40e_rx_bi(rx_ring, ntp);
> > +
> > + if (i40e_clean_programming_status(rx_ring,
> > + rx_desc->raw.qword[0],
> > qword,
> > + &rx_ring->next_to_process,
> > + &rx_ring->next_to_clean)) {
> Wrap the call to i40e_clean_programming_status to stay within 80 columns.
>
> > i40e_reuse_rx_page(rx_ring, rx_buffer);
> > - /* Update ntc and bump cleaned count if not in
> > the
> > - * middle of mb packet.
> > - */
> > - if (rx_ring->next_to_clean == ntp) {
> > - rx_ring->next_to_clean =
> > - rx_ring->next_to_process;
> > + if (ntc != rx_ring->next_to_clean)
> > cleaned_count++;
> > - }
> > continue;
> > }
> >
> > @@ -2503,8 +2511,7 @@ static int i40e_clean_rx_irq(struct i40e_ring
> > *rx_ring, int budget,
> > break;
> >
> > i40e_trace(clean_rx_irq, rx_ring, rx_desc, xdp);
> > - /* retrieve a buffer from the ring */
> > - rx_buffer = i40e_get_rx_buffer(rx_ring, size);
> > + i40e_prepare_rx_buffer(rx_ring, rx_buffer, size);
> >
> > neop = i40e_is_non_eop(rx_ring, rx_desc);
> > i40e_inc_ntp(rx_ring);
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
> > b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
> > index e26807fd2123..21d9ed878bf0 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
> > @@ -7,8 +7,9 @@
> > #include "i40e.h"
> >
> > int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp, struct i40e_ring
> > *xdp_ring); -void i40e_clean_programming_status(struct i40e_ring
> > *rx_ring, u64 qword0_raw,
> > - u64 qword1);
> > +bool i40e_clean_programming_status(struct i40e_ring *rx_ring, u64
> > qword0_raw,
> > + u64 qword1, u16 *next_to_clean,
> > + u16 *next_to_process);
> Argument order differs between header and implementation (next_to_clean vs next_to_process).
> This will compile but is misleading and dangerous for future maintenance.
> Please fix argument order to match.
Good catch
>
> > void i40e_process_skb_fields(struct i40e_ring *rx_ring,
> > union i40e_rx_desc *rx_desc, struct sk_buff
> > *skb); void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring);
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > index 9f47388eaba5..f8accc266c2c 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > @@ -440,14 +440,13 @@ int i40e_clean_rx_irq_zc(struct i40e_ring
> > *rx_ring, int budget)
> > */
> > dma_rmb();
> >
> > - if (i40e_rx_is_programming_status(qword)) {
> > - i40e_clean_programming_status(rx_ring,
> > - rx_desc->raw.qword[0],
> > - qword);
> > - bi = *i40e_rx_bi(rx_ring, next_to_process);
> > + bi = *i40e_rx_bi(rx_ring, next_to_process);
> > +
> > + if (i40e_clean_programming_status(rx_ring,
> > + rx_desc->raw.qword[0],
> > + qword, &next_to_process,
> > + &next_to_clean)) {
> > xsk_buff_free(bi);
> > - if (++next_to_process == count)
> > - next_to_process = 0;
> > continue;
> > }
> >
> > @@ -455,7 +454,6 @@ int i40e_clean_rx_irq_zc(struct i40e_ring
> > *rx_ring, int budget)
> > if (!size)
> > break;
> >
> > - bi = *i40e_rx_bi(rx_ring, next_to_process);
> > xsk_buff_set_size(bi, size);
> > xsk_buff_dma_sync_for_cpu(bi);
> >
> > --
> > 2.43.0
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Intel-wired-lan] [PATCH net v4 1/1] i40e: xsk: advance next_to_clean on status descriptors
2025-11-18 18:45 ` Maciej Fijalkowski
@ 2025-11-19 0:49 ` Alessandro Decina
2025-11-19 16:11 ` Maciej Fijalkowski
0 siblings, 1 reply; 7+ messages in thread
From: Alessandro Decina @ 2025-11-19 0:49 UTC (permalink / raw)
To: Maciej Fijalkowski
Cc: Loktionov, Aleksandr, netdev@vger.kernel.org, David S. Miller,
Alexei Starovoitov, Andrew Lunn, Daniel Borkmann, Eric Dumazet,
Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
Paolo Abeni, Kitszel, Przemyslaw, Stanislav Fomichev,
Sarkar, Tirthendu, Nguyen, Anthony L, bpf@vger.kernel.org,
intel-wired-lan@lists.osuosl.org, linux-kernel@vger.kernel.org
On Tue, Nov 18, 2025 at 07:45:24PM +0100, Maciej Fijalkowski wrote:
> Repro steps would be nice to have, rest would be rather redundant to me.
Yeah unfortunately I don't really know how to _manually_ reproduce.
I don't know why I'm getting these status descriptors, but I'm getting
them reproducibly every few minutes on ubuntu 24.04 across 3 machines
where I've hit this bug/tested the fix. The machines are doing ~300Mbps
of UDP traffic, some of which is done using AF_XDP. The AF_XDP code is
TX only, so it's executing the build-skb-in-zc-path all the time as all
the ingress traffic goes to sockets.
If you have any idea on how to reliably produce the descriptors I'm
happy to give it a try.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Intel-wired-lan] [PATCH net v4 1/1] i40e: xsk: advance next_to_clean on status descriptors
2025-11-19 0:49 ` Alessandro Decina
@ 2025-11-19 16:11 ` Maciej Fijalkowski
0 siblings, 0 replies; 7+ messages in thread
From: Maciej Fijalkowski @ 2025-11-19 16:11 UTC (permalink / raw)
To: Alessandro Decina
Cc: Loktionov, Aleksandr, netdev@vger.kernel.org, David S. Miller,
Alexei Starovoitov, Andrew Lunn, Daniel Borkmann, Eric Dumazet,
Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
Paolo Abeni, Kitszel, Przemyslaw, Stanislav Fomichev,
Sarkar, Tirthendu, Nguyen, Anthony L, bpf@vger.kernel.org,
intel-wired-lan@lists.osuosl.org, linux-kernel@vger.kernel.org
On Wed, Nov 19, 2025 at 11:49:44AM +1100, Alessandro Decina wrote:
> On Tue, Nov 18, 2025 at 07:45:24PM +0100, Maciej Fijalkowski wrote:
> > Repro steps would be nice to have, rest would be rather redundant to me.
>
> Yeah unfortunately I don't really know how to _manually_ reproduce.
>
> I don't know why I'm getting these status descriptors, but I'm getting
> them reproducibly every few minutes on ubuntu 24.04 across 3 machines
Might be a HW quirk, I don't remember. Other example I have handy is HW
that is served by ice driver can sometimes produce 0-length descriptors
for jumbo frames and we do handle this in driver.
Describing your setup in commit message would be enough, if you ask me.
> where I've hit this bug/tested the fix. The machines are doing ~300Mbps
> of UDP traffic, some of which is done using AF_XDP. The AF_XDP code is
> TX only, so it's executing the build-skb-in-zc-path all the time as all
> the ingress traffic goes to sockets.
>
> If you have any idea on how to reliably produce the descriptors I'm
> happy to give it a try.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Intel-wired-lan] [PATCH net v4 1/1] i40e: xsk: advance next_to_clean on status descriptors
2025-11-18 11:31 ` [Intel-wired-lan] [PATCH net v4 1/1] " Alessandro Decina
2025-11-18 12:13 ` Loktionov, Aleksandr
@ 2025-11-20 15:41 ` Loktionov, Aleksandr
1 sibling, 0 replies; 7+ messages in thread
From: Loktionov, Aleksandr @ 2025-11-20 15:41 UTC (permalink / raw)
To: Alessandro Decina, netdev@vger.kernel.org
Cc: Fijalkowski, Maciej, David S. Miller, Alexei Starovoitov,
Andrew Lunn, Daniel Borkmann, Eric Dumazet, Jakub Kicinski,
Jesper Dangaard Brouer, John Fastabend, Paolo Abeni,
Kitszel, Przemyslaw, Stanislav Fomichev, Sarkar, Tirthendu,
Nguyen, Anthony L, bpf@vger.kernel.org,
intel-wired-lan@lists.osuosl.org, linux-kernel@vger.kernel.org
> -----Original Message-----
> From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf
> Of Alessandro Decina
> Sent: Tuesday, November 18, 2025 12:31 PM
> To: netdev@vger.kernel.org
> Cc: Fijalkowski, Maciej <maciej.fijalkowski@intel.com>; David S.
> Miller <davem@davemloft.net>; Alexei Starovoitov <ast@kernel.org>;
> Andrew Lunn <andrew+netdev@lunn.ch>; Daniel Borkmann
> <daniel@iogearbox.net>; Eric Dumazet <edumazet@google.com>; Jakub
> Kicinski <kuba@kernel.org>; Jesper Dangaard Brouer <hawk@kernel.org>;
> John Fastabend <john.fastabend@gmail.com>; Paolo Abeni
> <pabeni@redhat.com>; Kitszel, Przemyslaw
> <przemyslaw.kitszel@intel.com>; Stanislav Fomichev <sdf@fomichev.me>;
> Sarkar, Tirthendu <tirthendu.sarkar@intel.com>; Nguyen, Anthony L
> <anthony.l.nguyen@intel.com>; bpf@vger.kernel.org; intel-wired-
> lan@lists.osuosl.org; linux-kernel@vger.kernel.org; Alessandro Decina
> <alessandro.d@gmail.com>
> Subject: [Intel-wired-lan] [PATCH net v4 1/1] i40e: xsk: advance
> next_to_clean on status descriptors
>
> Whenever a status descriptor is received, i40e processes and skips
> over it, correctly updating next_to_process but forgetting to update
> next_to_clean. In the next iteration this accidentally causes the
> creation of an invalid multi-buffer xdp_buff where the first fragment
> is the status descriptor.
>
> If then a skb is constructed from such an invalid buffer - because the
> eBPF program returns XDP_PASS - a panic occurs:
>
> [ 5866.367317] BUG: unable to handle page fault for address:
> ffd31c37eab1c980 [ 5866.375050] #PF: supervisor read access in kernel
> mode [ 5866.380825] #PF: error_code(0x0000) - not-present page [
> 5866.386602] PGD 0 [ 5866.388867] Oops: Oops: 0000 [#1] SMP NOPTI [
> 5866.393575] CPU: 34 UID: 0 PID: 0 Comm: swapper/34 Not tainted
> 6.17.0-custom #1 PREEMPT(voluntary) [ 5866.403740] Hardware name:
> Supermicro AS -2115GT-HNTR/H13SST-G, BIOS 3.2 03/20/2025 [
> 5866.412339] RIP: 0010:memcpy+0x8/0x10 [ 5866.416454] Code: cc cc 90
> cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 90 90 90 90 90 90 90 90
> 90 90 90 90 90 90 90 90 66 90 48 89 f8 48 89 d1 <f3> a4 e9 fc 26 c0 fe
> 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 [ 5866.437538] RSP:
> 0018:ff428d9ec0bb0ca8 EFLAGS: 00010286 [ 5866.443415] RAX:
> ff2dd26dbd8f0000 RBX: ff2dd265ad161400 RCX: 00000000000004e1 [
> 5866.451435] RDX: 00000000000004e1 RSI: ffd31c37eab1c980 RDI:
> ff2dd26dbd8f0000 [ 5866.459454] RBP: ff428d9ec0bb0d40 R08:
> 0000000000000000 R09: 0000000000000000 [ 5866.467470] R10:
> 0000000000000000 R11: 0000000000000000 R12: ff428d9eec726ef8 [
> 5866.475490] R13: ff2dd26dbd8f0000 R14: ff2dd265ca2f9fc0 R15:
> ff2dd26548548b80 [ 5866.483509] FS: 0000000000000000(0000)
> GS:ff2dd2c363592000(0000) knlGS:0000000000000000 [ 5866.492600] CS:
> 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5866.499060] CR2:
> ffd31c37eab1c980 CR3: 0000000178d7b040 CR4: 0000000000f71ef0 [
> 5866.507079] PKRU: 55555554 [ 5866.510125] Call Trace:
> [ 5866.512867] <IRQ>
> [ 5866.515132] ? i40e_clean_rx_irq_zc+0xc50/0xe60 [i40e] [
> 5866.520921] i40e_napi_poll+0x2d8/0x1890 [i40e] [ 5866.526022] ?
> srso_alias_return_thunk+0x5/0xfbef5
> [ 5866.531408] ? raise_softirq+0x24/0x70 [ 5866.535623] ?
> srso_alias_return_thunk+0x5/0xfbef5
> [ 5866.541011] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 5866.546397] ? rcu_sched_clock_irq+0x225/0x1800 [ 5866.551493]
> __napi_poll+0x30/0x230 [ 5866.555423] net_rx_action+0x20b/0x3f0 [
> 5866.559643] handle_softirqs+0xe4/0x340 [ 5866.563962]
> __irq_exit_rcu+0x10e/0x130 [ 5866.568283] irq_exit_rcu+0xe/0x20 [
> 5866.572110] common_interrupt+0xb6/0xe0 [ 5866.576425] </IRQ> [
> 5866.578791] <TASK>
>
> Advance next_to_clean to ensure invalid xdp_buff(s) aren't created.
>
> Move the common logic to i40e_clean_programming_status and update both
> i40e_clean_rx_irq and i40e_clean_rx_irq_zc accordingly.
>
> Fixes: 1c9ba9c14658 ("i40e: xsk: add RX multi-buffer support")
> Signed-off-by: Alessandro Decina <alessandro.d@gmail.com>
> ---
> drivers/net/ethernet/intel/i40e/i40e_txrx.c | 59 +++++++++++-------
...
> */
> -static struct i40e_rx_buffer *i40e_get_rx_buffer(struct i40e_ring
> *rx_ring,
> - const unsigned int size)
> +static struct i40e_rx_buffer *
> +i40e_prepare_rx_buffer(struct i40e_ring *rx_ring,
> + struct i40e_rx_buffer *rx_buffer,
> + const unsigned int size)
Now i40e_prepare_rx_buffer() is used solely for side effect, so returning struct i40e_rx_buffer * could be misleading.
Can you consider to make it void?
> {
> - struct i40e_rx_buffer *rx_buffer;
> -
> - rx_buffer = i40e_rx_bi(rx_ring, rx_ring->next_to_process);
> rx_buffer->page_count =
> #if (PAGE_SIZE < 8192)
> page_count(rx_buffer->page);
> @@ -2450,6 +2463,7 @@ static int i40e_clean_rx_irq(struct i40e_ring
> *rx_ring, int budget,
>
> while (likely(total_rx_packets < (unsigned int)budget)) {
> u16 ntp = rx_ring->next_to_process;
> + u16 ntc = rx_ring->next_to_clean;
> struct i40e_rx_buffer *rx_buffer;
> union i40e_rx_desc *rx_desc;
> struct sk_buff *skb;
> @@ -2480,21 +2494,15 @@ static int i40e_clean_rx_irq(struct i40e_ring
...
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-11-20 15:42 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-18 11:31 [Intel-wired-lan] [PATCH net v4 0/1] i40e: xsk: advance next_to_clean on status descriptors Alessandro Decina
2025-11-18 11:31 ` [Intel-wired-lan] [PATCH net v4 1/1] " Alessandro Decina
2025-11-18 12:13 ` Loktionov, Aleksandr
2025-11-18 18:45 ` Maciej Fijalkowski
2025-11-19 0:49 ` Alessandro Decina
2025-11-19 16:11 ` Maciej Fijalkowski
2025-11-20 15:41 ` Loktionov, Aleksandr
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).