* [PATCH net 0/6][pull request] Intel Wired LAN Driver Updates 2025-08-15 (ice, ixgbe, igc)
@ 2025-08-15 20:41 Tony Nguyen
2025-08-15 20:41 ` [PATCH net 1/6] ice: fix NULL pointer dereference in ice_unplug_aux_dev() on reset Tony Nguyen
` (5 more replies)
0 siblings, 6 replies; 13+ messages in thread
From: Tony Nguyen @ 2025-08-15 20:41 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet, andrew+netdev, netdev; +Cc: Tony Nguyen
For ice:
Emil adds a check to ensure auxiliary device was created before
tear down to prevent NULL a pointer dereference and adds an unroll error
path on auxiliary device creation to stop a possible memory leak.
Jake fixes handling when hardware returns a 0 sized descriptor which can
cause pages to be leaked.
For ixgbe:
Jason Xing corrects a condition in which improper decrement can cause
improper budget value.
Maciej extends down states in which XDP cannot transmit and excludes XDP
rings from Tx hang checks.
For igc:
VladikSS moves setting of hardware device information to allow for proper
check of device ID.
The following are changes since commit 065c31f2c6915b38f45b1c817b31f41f62eaa774:
rtase: Fix Rx descriptor CRC error bit definition
and are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue 100GbE
Emil Tantilov (2):
ice: fix NULL pointer dereference in ice_unplug_aux_dev() on reset
ice: fix possible leak in ice_plug_aux_dev() error path
Jacob Keller (1):
ice: fix Rx page leak on multi-buffer frames
Jason Xing (1):
ixgbe: xsk: resolve the negative overflow of budget in ixgbe_xmit_zc
Maciej Fijalkowski (1):
ixgbe: fix ndo_xdp_xmit() workloads
ValdikSS (1):
igc: fix disabling L1.2 PCI-E link substate on I226 on init
drivers/net/ethernet/intel/ice/ice.h | 1 +
drivers/net/ethernet/intel/ice/ice_idc.c | 29 ++++---
drivers/net/ethernet/intel/ice/ice_txrx.c | 81 ++++++++-----------
drivers/net/ethernet/intel/ice/ice_txrx.h | 1 -
drivers/net/ethernet/intel/igc/igc_main.c | 14 ++--
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 34 +++-----
drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 4 +-
7 files changed, 72 insertions(+), 92 deletions(-)
--
2.47.1
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH net 1/6] ice: fix NULL pointer dereference in ice_unplug_aux_dev() on reset
2025-08-15 20:41 [PATCH net 0/6][pull request] Intel Wired LAN Driver Updates 2025-08-15 (ice, ixgbe, igc) Tony Nguyen
@ 2025-08-15 20:41 ` Tony Nguyen
2025-08-15 20:41 ` [PATCH net 2/6] ice: fix possible leak in ice_plug_aux_dev() error path Tony Nguyen
` (4 subsequent siblings)
5 siblings, 0 replies; 13+ messages in thread
From: Tony Nguyen @ 2025-08-15 20:41 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet, andrew+netdev, netdev
Cc: Emil Tantilov, anthony.l.nguyen, david.m.ertman,
tatyana.e.nikolova, Przemek Kitszel
From: Emil Tantilov <emil.s.tantilov@intel.com>
Issuing a reset when the driver is loaded without RDMA support, will
results in a crash as it attempts to remove RDMA's non-existent auxbus
device:
echo 1 > /sys/class/net/<if>/device/reset
BUG: kernel NULL pointer dereference, address: 0000000000000008
...
RIP: 0010:ice_unplug_aux_dev+0x29/0x70 [ice]
...
Call Trace:
<TASK>
ice_prepare_for_reset+0x77/0x260 [ice]
pci_dev_save_and_disable+0x2c/0x70
pci_reset_function+0x88/0x130
reset_store+0x5a/0xa0
kernfs_fop_write_iter+0x15e/0x210
vfs_write+0x273/0x520
ksys_write+0x6b/0xe0
do_syscall_64+0x79/0x3b0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
ice_unplug_aux_dev() checks pf->cdev_info->adev for NULL pointer, but
pf->cdev_info will also be NULL, leading to the deref in the trace above.
Introduce a flag to be set when the creation of the auxbus device is
successful, to avoid multiple NULL pointer checks in ice_unplug_aux_dev().
Fixes: c24a65b6a27c7 ("iidc/ice/irdma: Update IDC to support multiple consumers")
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
---
drivers/net/ethernet/intel/ice/ice.h | 1 +
drivers/net/ethernet/intel/ice/ice_idc.c | 10 ++++++----
2 files changed, 7 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
index 2098f00b3cd3..8a8a01a4bb40 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -510,6 +510,7 @@ enum ice_pf_flags {
ICE_FLAG_LINK_LENIENT_MODE_ENA,
ICE_FLAG_PLUG_AUX_DEV,
ICE_FLAG_UNPLUG_AUX_DEV,
+ ICE_FLAG_AUX_DEV_CREATED,
ICE_FLAG_MTU_CHANGED,
ICE_FLAG_GNSS, /* GNSS successfully initialized */
ICE_FLAG_DPLL, /* SyncE/PTP dplls initialized */
diff --git a/drivers/net/ethernet/intel/ice/ice_idc.c b/drivers/net/ethernet/intel/ice/ice_idc.c
index 6ab53e430f91..420d45c2558b 100644
--- a/drivers/net/ethernet/intel/ice/ice_idc.c
+++ b/drivers/net/ethernet/intel/ice/ice_idc.c
@@ -336,6 +336,7 @@ int ice_plug_aux_dev(struct ice_pf *pf)
mutex_lock(&pf->adev_mutex);
cdev->adev = adev;
mutex_unlock(&pf->adev_mutex);
+ set_bit(ICE_FLAG_AUX_DEV_CREATED, pf->flags);
return 0;
}
@@ -347,15 +348,16 @@ void ice_unplug_aux_dev(struct ice_pf *pf)
{
struct auxiliary_device *adev;
+ if (!test_and_clear_bit(ICE_FLAG_AUX_DEV_CREATED, pf->flags))
+ return;
+
mutex_lock(&pf->adev_mutex);
adev = pf->cdev_info->adev;
pf->cdev_info->adev = NULL;
mutex_unlock(&pf->adev_mutex);
- if (adev) {
- auxiliary_device_delete(adev);
- auxiliary_device_uninit(adev);
- }
+ auxiliary_device_delete(adev);
+ auxiliary_device_uninit(adev);
}
/**
--
2.47.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH net 2/6] ice: fix possible leak in ice_plug_aux_dev() error path
2025-08-15 20:41 [PATCH net 0/6][pull request] Intel Wired LAN Driver Updates 2025-08-15 (ice, ixgbe, igc) Tony Nguyen
2025-08-15 20:41 ` [PATCH net 1/6] ice: fix NULL pointer dereference in ice_unplug_aux_dev() on reset Tony Nguyen
@ 2025-08-15 20:41 ` Tony Nguyen
2025-08-15 20:41 ` [PATCH net 3/6] ice: fix Rx page leak on multi-buffer frames Tony Nguyen
` (3 subsequent siblings)
5 siblings, 0 replies; 13+ messages in thread
From: Tony Nguyen @ 2025-08-15 20:41 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet, andrew+netdev, netdev
Cc: Emil Tantilov, anthony.l.nguyen, david.m.ertman,
tatyana.e.nikolova, Przemek Kitszel, Aleksandr Loktionov
From: Emil Tantilov <emil.s.tantilov@intel.com>
Fix a memory leak in the error path where kfree(iadev) was not called
following an error in auxiliary_device_add().
Fixes: f9f5301e7e2d ("ice: Register auxiliary device to provide RDMA")
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
---
drivers/net/ethernet/intel/ice/ice_idc.c | 19 +++++++++++--------
1 file changed, 11 insertions(+), 8 deletions(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_idc.c b/drivers/net/ethernet/intel/ice/ice_idc.c
index 420d45c2558b..8c4a3dc22a7c 100644
--- a/drivers/net/ethernet/intel/ice/ice_idc.c
+++ b/drivers/net/ethernet/intel/ice/ice_idc.c
@@ -322,16 +322,12 @@ int ice_plug_aux_dev(struct ice_pf *pf)
"roce" : "iwarp";
ret = auxiliary_device_init(adev);
- if (ret) {
- kfree(iadev);
- return ret;
- }
+ if (ret)
+ goto free_iadev;
ret = auxiliary_device_add(adev);
- if (ret) {
- auxiliary_device_uninit(adev);
- return ret;
- }
+ if (ret)
+ goto aux_dev_uninit;
mutex_lock(&pf->adev_mutex);
cdev->adev = adev;
@@ -339,6 +335,13 @@ int ice_plug_aux_dev(struct ice_pf *pf)
set_bit(ICE_FLAG_AUX_DEV_CREATED, pf->flags);
return 0;
+
+aux_dev_uninit:
+ auxiliary_device_uninit(adev);
+free_iadev:
+ kfree(iadev);
+
+ return ret;
}
/* ice_unplug_aux_dev - unregister and free AUX device
--
2.47.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH net 3/6] ice: fix Rx page leak on multi-buffer frames
2025-08-15 20:41 [PATCH net 0/6][pull request] Intel Wired LAN Driver Updates 2025-08-15 (ice, ixgbe, igc) Tony Nguyen
2025-08-15 20:41 ` [PATCH net 1/6] ice: fix NULL pointer dereference in ice_unplug_aux_dev() on reset Tony Nguyen
2025-08-15 20:41 ` [PATCH net 2/6] ice: fix possible leak in ice_plug_aux_dev() error path Tony Nguyen
@ 2025-08-15 20:41 ` Tony Nguyen
2025-08-18 11:05 ` Jesper Dangaard Brouer
2025-08-15 20:42 ` [PATCH net 4/6] ixgbe: xsk: resolve the negative overflow of budget in ixgbe_xmit_zc Tony Nguyen
` (2 subsequent siblings)
5 siblings, 1 reply; 13+ messages in thread
From: Tony Nguyen @ 2025-08-15 20:41 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet, andrew+netdev, netdev
Cc: Jacob Keller, anthony.l.nguyen, maciej.fijalkowski,
magnus.karlsson, ast, daniel, hawk, john.fastabend, sdf, bpf,
horms, przemyslaw.kitszel, aleksander.lobakin, jaroslav.pulchart,
jdamato, christoph.petrausch, Rinitha S, Priya Singh
From: Jacob Keller <jacob.e.keller@intel.com>
The ice_put_rx_mbuf() function handles calling ice_put_rx_buf() for each
buffer in the current frame. This function was introduced as part of
handling multi-buffer XDP support in the ice driver.
It works by iterating over the buffers from first_desc up to 1 plus the
total number of fragments in the frame, cached from before the XDP program
was executed.
If the hardware posts a descriptor with a size of 0, the logic used in
ice_put_rx_mbuf() breaks. Such descriptors get skipped and don't get added
as fragments in ice_add_xdp_frag. Since the buffer isn't counted as a
fragment, we do not iterate over it in ice_put_rx_mbuf(), and thus we don't
call ice_put_rx_buf().
Because we don't call ice_put_rx_buf(), we don't attempt to re-use the
page or free it. This leaves a stale page in the ring, as we don't
increment next_to_alloc.
The ice_reuse_rx_page() assumes that the next_to_alloc has been incremented
properly, and that it always points to a buffer with a NULL page. Since
this function doesn't check, it will happily recycle a page over the top
of the next_to_alloc buffer, losing track of the old page.
Note that this leak only occurs for multi-buffer frames. The
ice_put_rx_mbuf() function always handles at least one buffer, so a
single-buffer frame will always get handled correctly. It is not clear
precisely why the hardware hands us descriptors with a size of 0 sometimes,
but it happens somewhat regularly with "jumbo frames" used by 9K MTU.
To fix ice_put_rx_mbuf(), we need to make sure to call ice_put_rx_buf() on
all buffers between first_desc and next_to_clean. Borrow the logic of a
similar function in i40e used for this same purpose. Use the same logic
also in ice_get_pgcnts().
Instead of iterating over just the number of fragments, use a loop which
iterates until the current index reaches to the next_to_clean element just
past the current frame. Check the current number of fragments (post XDP
program). For all buffers up 1 more than the number of fragments, we'll
update the pagecnt_bias. For any buffers past this, pagecnt_bias is left
as-is. This ensures that fragments released by the XDP program, as well as
any buffers with zero-size won't have their pagecnt_bias updated
incorrectly. Unlike i40e, the ice_put_rx_mbuf() function does call
ice_put_rx_buf() on the last buffer of the frame indicating end of packet.
The xdp_xmit value only needs to be updated if an XDP program is run, and
only once per packet. Drop the xdp_xmit pointer argument from
ice_put_rx_mbuf(). Instead, set xdp_xmit in the ice_clean_rx_irq() function
directly. This avoids needing to pass the argument and avoids an extra
bit-wise OR for each buffer in the frame.
Move the increment of the ntc local variable to ensure its updated *before*
all calls to ice_get_pgcnts() or ice_put_rx_mbuf(), as the loop logic
requires the index of the element just after the current frame.
This has the advantage that we also no longer need to track or cache the
number of fragments in the rx_ring, which saves a few bytes in the ring.
Cc: Christoph Petrausch <christoph.petrausch@deepl.com>
Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Closes: https://lore.kernel.org/netdev/CAK8fFZ4hY6GUJNENz3wY9jaYLZXGfpr7dnZxzGMYoE44caRbgw@mail.gmail.com/
Fixes: 743bbd93cf29 ("ice: put Rx buffers after being done with current frame")
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Priya Singh <priyax.singh@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
---
drivers/net/ethernet/intel/ice/ice_txrx.c | 81 +++++++++--------------
drivers/net/ethernet/intel/ice/ice_txrx.h | 1 -
2 files changed, 33 insertions(+), 49 deletions(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 29e0088ab6b2..93907ab2eac7 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -894,10 +894,6 @@ ice_add_xdp_frag(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
__skb_fill_page_desc_noacc(sinfo, sinfo->nr_frags++, rx_buf->page,
rx_buf->page_offset, size);
sinfo->xdp_frags_size += size;
- /* remember frag count before XDP prog execution; bpf_xdp_adjust_tail()
- * can pop off frags but driver has to handle it on its own
- */
- rx_ring->nr_frags = sinfo->nr_frags;
if (page_is_pfmemalloc(rx_buf->page))
xdp_buff_set_frag_pfmemalloc(xdp);
@@ -968,20 +964,20 @@ ice_get_rx_buf(struct ice_rx_ring *rx_ring, const unsigned int size,
/**
* ice_get_pgcnts - grab page_count() for gathered fragments
* @rx_ring: Rx descriptor ring to store the page counts on
+ * @ntc: the next to clean element (not included in this frame!)
*
* This function is intended to be called right before running XDP
* program so that the page recycling mechanism will be able to take
* a correct decision regarding underlying pages; this is done in such
* way as XDP program can change the refcount of page
*/
-static void ice_get_pgcnts(struct ice_rx_ring *rx_ring)
+static void ice_get_pgcnts(struct ice_rx_ring *rx_ring, unsigned int ntc)
{
- u32 nr_frags = rx_ring->nr_frags + 1;
u32 idx = rx_ring->first_desc;
struct ice_rx_buf *rx_buf;
u32 cnt = rx_ring->count;
- for (int i = 0; i < nr_frags; i++) {
+ while (idx != ntc) {
rx_buf = &rx_ring->rx_buf[idx];
rx_buf->pgcnt = page_count(rx_buf->page);
@@ -1154,62 +1150,48 @@ ice_put_rx_buf(struct ice_rx_ring *rx_ring, struct ice_rx_buf *rx_buf)
}
/**
- * ice_put_rx_mbuf - ice_put_rx_buf() caller, for all frame frags
+ * ice_put_rx_mbuf - ice_put_rx_buf() caller, for all buffers in frame
* @rx_ring: Rx ring with all the auxiliary data
* @xdp: XDP buffer carrying linear + frags part
- * @xdp_xmit: XDP_TX/XDP_REDIRECT verdict storage
- * @ntc: a current next_to_clean value to be stored at rx_ring
+ * @ntc: the next to clean element (not included in this frame!)
* @verdict: return code from XDP program execution
*
- * Walk through gathered fragments and satisfy internal page
- * recycle mechanism; we take here an action related to verdict
- * returned by XDP program;
+ * Called after XDP program is completed, or on error with verdict set to
+ * ICE_XDP_CONSUMED.
+ *
+ * Walk through buffers from first_desc to the end of the frame, releasing
+ * buffers and satisfying internal page recycle mechanism. The action depends
+ * on verdict from XDP program.
*/
static void ice_put_rx_mbuf(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
- u32 *xdp_xmit, u32 ntc, u32 verdict)
+ u32 ntc, u32 verdict)
{
- u32 nr_frags = rx_ring->nr_frags + 1;
+ u32 nr_frags = xdp_get_shared_info_from_buff(xdp)->nr_frags;
u32 idx = rx_ring->first_desc;
u32 cnt = rx_ring->count;
- u32 post_xdp_frags = 1;
struct ice_rx_buf *buf;
- int i;
-
- if (unlikely(xdp_buff_has_frags(xdp)))
- post_xdp_frags += xdp_get_shared_info_from_buff(xdp)->nr_frags;
+ int i = 0;
- for (i = 0; i < post_xdp_frags; i++) {
+ while (idx != ntc) {
buf = &rx_ring->rx_buf[idx];
+ if (++idx == cnt)
+ idx = 0;
- if (verdict & (ICE_XDP_TX | ICE_XDP_REDIR)) {
+ /* An XDP program could release fragments from the end of the
+ * buffer. For these, we need to keep the pagecnt_bias as-is.
+ * To do this, only adjust pagecnt_bias for fragments up to
+ * the total remaining after the XDP program has run.
+ */
+ if (verdict != ICE_XDP_CONSUMED)
ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz);
- *xdp_xmit |= verdict;
- } else if (verdict & ICE_XDP_CONSUMED) {
+ else if (i++ <= nr_frags)
buf->pagecnt_bias++;
- } else if (verdict == ICE_XDP_PASS) {
- ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz);
- }
ice_put_rx_buf(rx_ring, buf);
-
- if (++idx == cnt)
- idx = 0;
- }
- /* handle buffers that represented frags released by XDP prog;
- * for these we keep pagecnt_bias as-is; refcount from struct page
- * has been decremented within XDP prog and we do not have to increase
- * the biased refcnt
- */
- for (; i < nr_frags; i++) {
- buf = &rx_ring->rx_buf[idx];
- ice_put_rx_buf(rx_ring, buf);
- if (++idx == cnt)
- idx = 0;
}
xdp->data = NULL;
rx_ring->first_desc = ntc;
- rx_ring->nr_frags = 0;
}
/**
@@ -1317,6 +1299,10 @@ static int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
/* retrieve a buffer from the ring */
rx_buf = ice_get_rx_buf(rx_ring, size, ntc);
+ /* Increment ntc before calls to ice_put_rx_mbuf() */
+ if (++ntc == cnt)
+ ntc = 0;
+
if (!xdp->data) {
void *hard_start;
@@ -1325,24 +1311,23 @@ static int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
xdp_prepare_buff(xdp, hard_start, offset, size, !!offset);
xdp_buff_clear_frags_flag(xdp);
} else if (ice_add_xdp_frag(rx_ring, xdp, rx_buf, size)) {
- ice_put_rx_mbuf(rx_ring, xdp, NULL, ntc, ICE_XDP_CONSUMED);
+ ice_put_rx_mbuf(rx_ring, xdp, ntc, ICE_XDP_CONSUMED);
break;
}
- if (++ntc == cnt)
- ntc = 0;
/* skip if it is NOP desc */
if (ice_is_non_eop(rx_ring, rx_desc))
continue;
- ice_get_pgcnts(rx_ring);
+ ice_get_pgcnts(rx_ring, ntc);
xdp_verdict = ice_run_xdp(rx_ring, xdp, xdp_prog, xdp_ring, rx_desc);
if (xdp_verdict == ICE_XDP_PASS)
goto construct_skb;
total_rx_bytes += xdp_get_buff_len(xdp);
total_rx_pkts++;
- ice_put_rx_mbuf(rx_ring, xdp, &xdp_xmit, ntc, xdp_verdict);
+ ice_put_rx_mbuf(rx_ring, xdp, ntc, xdp_verdict);
+ xdp_xmit |= xdp_verdict & (ICE_XDP_TX | ICE_XDP_REDIR);
continue;
construct_skb:
@@ -1355,7 +1340,7 @@ static int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
rx_ring->ring_stats->rx_stats.alloc_page_failed++;
xdp_verdict = ICE_XDP_CONSUMED;
}
- ice_put_rx_mbuf(rx_ring, xdp, &xdp_xmit, ntc, xdp_verdict);
+ ice_put_rx_mbuf(rx_ring, xdp, ntc, xdp_verdict);
if (!skb)
break;
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
index fef750c5f288..2fd8e78178a2 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
@@ -358,7 +358,6 @@ struct ice_rx_ring {
struct ice_tx_ring *xdp_ring;
struct ice_rx_ring *next; /* pointer to next ring in q_vector */
struct xsk_buff_pool *xsk_pool;
- u32 nr_frags;
u16 max_frame;
u16 rx_buf_len;
dma_addr_t dma; /* physical address of ring */
--
2.47.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH net 4/6] ixgbe: xsk: resolve the negative overflow of budget in ixgbe_xmit_zc
2025-08-15 20:41 [PATCH net 0/6][pull request] Intel Wired LAN Driver Updates 2025-08-15 (ice, ixgbe, igc) Tony Nguyen
` (2 preceding siblings ...)
2025-08-15 20:41 ` [PATCH net 3/6] ice: fix Rx page leak on multi-buffer frames Tony Nguyen
@ 2025-08-15 20:42 ` Tony Nguyen
2025-08-15 20:42 ` [PATCH net 5/6] ixgbe: fix ndo_xdp_xmit() workloads Tony Nguyen
2025-08-15 20:42 ` [PATCH net 6/6] igc: fix disabling L1.2 PCI-E link substate on I226 on init Tony Nguyen
5 siblings, 0 replies; 13+ messages in thread
From: Tony Nguyen @ 2025-08-15 20:42 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet, andrew+netdev, netdev
Cc: Jason Xing, anthony.l.nguyen, maciej.fijalkowski, magnus.karlsson,
ast, daniel, hawk, john.fastabend, sdf, bpf, bjorn,
przemyslaw.kitszel, larysa.zaremba, Paul Menzel,
Aleksandr Loktionov
From: Jason Xing <kernelxing@tencent.com>
Resolve the budget negative overflow which leads to returning true in
ixgbe_xmit_zc even when the budget of descs are thoroughly consumed.
Before this patch, when the budget is decreased to zero and finishes
sending the last allowed desc in ixgbe_xmit_zc, it will always turn back
and enter into the while() statement to see if it should keep processing
packets, but in the meantime it unexpectedly decreases the value again to
'unsigned int (0--)', namely, UINT_MAX. Finally, the ixgbe_xmit_zc returns
true, showing 'we complete cleaning the budget'. That also means
'clean_complete = true' in ixgbe_poll.
The true theory behind this is if that budget number of descs are consumed,
it implies that we might have more descs to be done. So we should return
false in ixgbe_xmit_zc to tell napi poll to find another chance to start
polling to handle the rest of descs. On the contrary, returning true here
means job done and we know we finish all the possible descs this time and
we don't intend to start a new napi poll.
It is apparently against our expectations. Please also see how
ixgbe_clean_tx_irq() handles the problem: it uses do..while() statement
to make sure the budget can be decreased to zero at most and the negative
overflow never happens.
The patch adds 'likely' because we rarely would not hit the loop condition
since the standard budget is 256.
Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
---
drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
index ac58964b2f08..7b941505a9d0 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
@@ -398,7 +398,7 @@ static bool ixgbe_xmit_zc(struct ixgbe_ring *xdp_ring, unsigned int budget)
dma_addr_t dma;
u32 cmd_type;
- while (budget-- > 0) {
+ while (likely(budget)) {
if (unlikely(!ixgbe_desc_unused(xdp_ring))) {
work_done = false;
break;
@@ -433,6 +433,8 @@ static bool ixgbe_xmit_zc(struct ixgbe_ring *xdp_ring, unsigned int budget)
xdp_ring->next_to_use++;
if (xdp_ring->next_to_use == xdp_ring->count)
xdp_ring->next_to_use = 0;
+
+ budget--;
}
if (tx_desc) {
--
2.47.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH net 5/6] ixgbe: fix ndo_xdp_xmit() workloads
2025-08-15 20:41 [PATCH net 0/6][pull request] Intel Wired LAN Driver Updates 2025-08-15 (ice, ixgbe, igc) Tony Nguyen
` (3 preceding siblings ...)
2025-08-15 20:42 ` [PATCH net 4/6] ixgbe: xsk: resolve the negative overflow of budget in ixgbe_xmit_zc Tony Nguyen
@ 2025-08-15 20:42 ` Tony Nguyen
2025-08-15 20:42 ` [PATCH net 6/6] igc: fix disabling L1.2 PCI-E link substate on I226 on init Tony Nguyen
5 siblings, 0 replies; 13+ messages in thread
From: Tony Nguyen @ 2025-08-15 20:42 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet, andrew+netdev, netdev
Cc: Maciej Fijalkowski, anthony.l.nguyen, magnus.karlsson, ast,
daniel, hawk, john.fastabend, sdf, bpf, Tobias Böhm,
Marcus Wichelmann, Aleksandr Loktionov
From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Currently ixgbe driver checks periodically in its watchdog subtask if
there is anything to be transmitted (considering both Tx and XDP rings)
under state of carrier not being 'ok'. Such event is interpreted as Tx
hang and therefore results in interface reset.
This is currently problematic for ndo_xdp_xmit() as it is allowed to
produce descriptors when interface is going through reset or its carrier
is turned off.
Furthermore, XDP rings should not really be objects of Tx hang
detection. This mechanism is rather a matter of ndo_tx_timeout() being
called from dev_watchdog against Tx rings exposed to networking stack.
Taking into account issues described above, let us have a two fold fix -
do not respect XDP rings in local ixgbe watchdog and do not produce Tx
descriptors in ndo_xdp_xmit callback when there is some problem with
carrier currently. For now, keep the Tx hang checks in clean Tx irq
routine, but adjust it to not execute for XDP rings.
Cc: Tobias Böhm <tobias.boehm@hetzner-cloud.de>
Reported-by: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de>
Closes: https://lore.kernel.org/netdev/eca1880f-253a-4955-afe6-732d7c6926ee@hetzner-cloud.de/
Fixes: 6453073987ba ("ixgbe: add initial support for xdp redirect")
Fixes: 33fdc82f0883 ("ixgbe: add support for XDP_TX action")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
---
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 34 ++++++-------------
1 file changed, 11 insertions(+), 23 deletions(-)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 6122a0abb41f..80e6a2ef1350 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -968,10 +968,6 @@ static void ixgbe_update_xoff_rx_lfc(struct ixgbe_adapter *adapter)
for (i = 0; i < adapter->num_tx_queues; i++)
clear_bit(__IXGBE_HANG_CHECK_ARMED,
&adapter->tx_ring[i]->state);
-
- for (i = 0; i < adapter->num_xdp_queues; i++)
- clear_bit(__IXGBE_HANG_CHECK_ARMED,
- &adapter->xdp_ring[i]->state);
}
static void ixgbe_update_xoff_received(struct ixgbe_adapter *adapter)
@@ -1214,7 +1210,7 @@ static void ixgbe_pf_handle_tx_hang(struct ixgbe_ring *tx_ring,
struct ixgbe_adapter *adapter = netdev_priv(tx_ring->netdev);
struct ixgbe_hw *hw = &adapter->hw;
- e_err(drv, "Detected Tx Unit Hang%s\n"
+ e_err(drv, "Detected Tx Unit Hang\n"
" Tx Queue <%d>\n"
" TDH, TDT <%x>, <%x>\n"
" next_to_use <%x>\n"
@@ -1222,16 +1218,14 @@ static void ixgbe_pf_handle_tx_hang(struct ixgbe_ring *tx_ring,
"tx_buffer_info[next_to_clean]\n"
" time_stamp <%lx>\n"
" jiffies <%lx>\n",
- ring_is_xdp(tx_ring) ? " (XDP)" : "",
tx_ring->queue_index,
IXGBE_READ_REG(hw, IXGBE_TDH(tx_ring->reg_idx)),
IXGBE_READ_REG(hw, IXGBE_TDT(tx_ring->reg_idx)),
tx_ring->next_to_use, next,
tx_ring->tx_buffer_info[next].time_stamp, jiffies);
- if (!ring_is_xdp(tx_ring))
- netif_stop_subqueue(tx_ring->netdev,
- tx_ring->queue_index);
+ netif_stop_subqueue(tx_ring->netdev,
+ tx_ring->queue_index);
}
/**
@@ -1451,6 +1445,9 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
total_bytes);
adapter->tx_ipsec += total_ipsec;
+ if (ring_is_xdp(tx_ring))
+ return !!budget;
+
if (check_for_tx_hang(tx_ring) && ixgbe_check_tx_hang(tx_ring)) {
if (adapter->hw.mac.type == ixgbe_mac_e610)
ixgbe_handle_mdd_event(adapter, tx_ring);
@@ -1468,9 +1465,6 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
return true;
}
- if (ring_is_xdp(tx_ring))
- return !!budget;
-
#define TX_WAKE_THRESHOLD (DESC_NEEDED * 2)
txq = netdev_get_tx_queue(tx_ring->netdev, tx_ring->queue_index);
if (!__netif_txq_completed_wake(txq, total_packets, total_bytes,
@@ -7974,12 +7968,9 @@ static void ixgbe_check_hang_subtask(struct ixgbe_adapter *adapter)
return;
/* Force detection of hung controller */
- if (netif_carrier_ok(adapter->netdev)) {
+ if (netif_carrier_ok(adapter->netdev))
for (i = 0; i < adapter->num_tx_queues; i++)
set_check_for_tx_hang(adapter->tx_ring[i]);
- for (i = 0; i < adapter->num_xdp_queues; i++)
- set_check_for_tx_hang(adapter->xdp_ring[i]);
- }
if (!(adapter->flags & IXGBE_FLAG_MSIX_ENABLED)) {
/*
@@ -8199,13 +8190,6 @@ static bool ixgbe_ring_tx_pending(struct ixgbe_adapter *adapter)
return true;
}
- for (i = 0; i < adapter->num_xdp_queues; i++) {
- struct ixgbe_ring *ring = adapter->xdp_ring[i];
-
- if (ring->next_to_use != ring->next_to_clean)
- return true;
- }
-
return false;
}
@@ -11005,6 +10989,10 @@ static int ixgbe_xdp_xmit(struct net_device *dev, int n,
if (unlikely(test_bit(__IXGBE_DOWN, &adapter->state)))
return -ENETDOWN;
+ if (!netif_carrier_ok(adapter->netdev) ||
+ !netif_running(adapter->netdev))
+ return -ENETDOWN;
+
if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
return -EINVAL;
--
2.47.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH net 6/6] igc: fix disabling L1.2 PCI-E link substate on I226 on init
2025-08-15 20:41 [PATCH net 0/6][pull request] Intel Wired LAN Driver Updates 2025-08-15 (ice, ixgbe, igc) Tony Nguyen
` (4 preceding siblings ...)
2025-08-15 20:42 ` [PATCH net 5/6] ixgbe: fix ndo_xdp_xmit() workloads Tony Nguyen
@ 2025-08-15 20:42 ` Tony Nguyen
5 siblings, 0 replies; 13+ messages in thread
From: Tony Nguyen @ 2025-08-15 20:42 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet, andrew+netdev, netdev
Cc: ValdikSS, anthony.l.nguyen, vitaly.lifshits, dima.ruinskiy,
Paul Menzel
From: ValdikSS <iam@valdikss.org.ru>
Device ID comparison in igc_is_device_id_i226 is performed before
the ID is set, resulting in always failing check on init.
Before the patch:
* L1.2 is not disabled on init
* L1.2 is properly disabled after suspend-resume cycle
With the patch:
* L1.2 is properly disabled both on init and after suspend-resume
How to test:
Connect to the 1G link with 300+ mbit/s Internet speed, and run
the download speed test, such as:
curl -o /dev/null http://speedtest.selectel.ru/1GB
Without L1.2 disabled, the speed would be no more than ~200 mbit/s.
With L1.2 disabled, the speed would reach 1 gbit/s.
Note: it's required that the latency between your host and the remote
be around 3-5 ms, the test inside LAN (<1 ms latency) won't trigger the
issue.
Link: https://lore.kernel.org/intel-wired-lan/15248b4f-3271-42dd-8e35-02bfc92b25e1@intel.com
Fixes: 0325143b59c6 ("igc: disable L1.2 PCI-E link substate to avoid performance issue")
Signed-off-by: ValdikSS <iam@valdikss.org.ru>
Reviewed-by: Vitaly Lifshits <vitaly.lifshits@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
---
drivers/net/ethernet/intel/igc/igc_main.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
index 458e5eaa92e5..e79b14d50b24 100644
--- a/drivers/net/ethernet/intel/igc/igc_main.c
+++ b/drivers/net/ethernet/intel/igc/igc_main.c
@@ -7149,6 +7149,13 @@ static int igc_probe(struct pci_dev *pdev,
adapter->port_num = hw->bus.func;
adapter->msg_enable = netif_msg_init(debug, DEFAULT_MSG_ENABLE);
+ /* PCI config space info */
+ hw->vendor_id = pdev->vendor;
+ hw->device_id = pdev->device;
+ hw->revision_id = pdev->revision;
+ hw->subsystem_vendor_id = pdev->subsystem_vendor;
+ hw->subsystem_device_id = pdev->subsystem_device;
+
/* Disable ASPM L1.2 on I226 devices to avoid packet loss */
if (igc_is_device_id_i226(hw))
pci_disable_link_state(pdev, PCIE_LINK_STATE_L1_2);
@@ -7175,13 +7182,6 @@ static int igc_probe(struct pci_dev *pdev,
netdev->mem_start = pci_resource_start(pdev, 0);
netdev->mem_end = pci_resource_end(pdev, 0);
- /* PCI config space info */
- hw->vendor_id = pdev->vendor;
- hw->device_id = pdev->device;
- hw->revision_id = pdev->revision;
- hw->subsystem_vendor_id = pdev->subsystem_vendor;
- hw->subsystem_device_id = pdev->subsystem_device;
-
/* Copy the default MAC and PHY function pointers */
memcpy(&hw->mac.ops, ei->mac_ops, sizeof(hw->mac.ops));
memcpy(&hw->phy.ops, ei->phy_ops, sizeof(hw->phy.ops));
--
2.47.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH net 3/6] ice: fix Rx page leak on multi-buffer frames
2025-08-15 20:41 ` [PATCH net 3/6] ice: fix Rx page leak on multi-buffer frames Tony Nguyen
@ 2025-08-18 11:05 ` Jesper Dangaard Brouer
2025-08-19 0:38 ` Jacob Keller
0 siblings, 1 reply; 13+ messages in thread
From: Jesper Dangaard Brouer @ 2025-08-18 11:05 UTC (permalink / raw)
To: Tony Nguyen, davem, kuba, pabeni, edumazet, andrew+netdev, netdev
Cc: Jacob Keller, maciej.fijalkowski, magnus.karlsson, ast, daniel,
john.fastabend, sdf, bpf, horms, przemyslaw.kitszel,
aleksander.lobakin, jaroslav.pulchart, jdamato,
christoph.petrausch, Rinitha S, Priya Singh, Eelco Chaudron
On 15/08/2025 22.41, Tony Nguyen wrote:
> From: Jacob Keller <jacob.e.keller@intel.com>
>
> The ice_put_rx_mbuf() function handles calling ice_put_rx_buf() for each
> buffer in the current frame. This function was introduced as part of
> handling multi-buffer XDP support in the ice driver.
>
> It works by iterating over the buffers from first_desc up to 1 plus the
> total number of fragments in the frame, cached from before the XDP program
> was executed.
>
> If the hardware posts a descriptor with a size of 0, the logic used in
> ice_put_rx_mbuf() breaks. Such descriptors get skipped and don't get added
> as fragments in ice_add_xdp_frag. Since the buffer isn't counted as a
> fragment, we do not iterate over it in ice_put_rx_mbuf(), and thus we don't
> call ice_put_rx_buf().
>
> Because we don't call ice_put_rx_buf(), we don't attempt to re-use the
> page or free it. This leaves a stale page in the ring, as we don't
> increment next_to_alloc.
>
> The ice_reuse_rx_page() assumes that the next_to_alloc has been incremented
> properly, and that it always points to a buffer with a NULL page. Since
> this function doesn't check, it will happily recycle a page over the top
> of the next_to_alloc buffer, losing track of the old page.
>
> Note that this leak only occurs for multi-buffer frames. The
> ice_put_rx_mbuf() function always handles at least one buffer, so a
> single-buffer frame will always get handled correctly. It is not clear
> precisely why the hardware hands us descriptors with a size of 0 sometimes,
> but it happens somewhat regularly with "jumbo frames" used by 9K MTU.
>
> To fix ice_put_rx_mbuf(), we need to make sure to call ice_put_rx_buf() on
> all buffers between first_desc and next_to_clean. Borrow the logic of a
> similar function in i40e used for this same purpose. Use the same logic
> also in ice_get_pgcnts().
>
> Instead of iterating over just the number of fragments, use a loop which
> iterates until the current index reaches to the next_to_clean element just
> past the current frame. Check the current number of fragments (post XDP
> program). For all buffers up 1 more than the number of fragments, we'll
> update the pagecnt_bias. For any buffers past this, pagecnt_bias is left
> as-is. This ensures that fragments released by the XDP program, as well as
> any buffers with zero-size won't have their pagecnt_bias updated
> incorrectly. Unlike i40e, the ice_put_rx_mbuf() function does call
> ice_put_rx_buf() on the last buffer of the frame indicating end of packet.
>
> The xdp_xmit value only needs to be updated if an XDP program is run, and
> only once per packet. Drop the xdp_xmit pointer argument from
> ice_put_rx_mbuf(). Instead, set xdp_xmit in the ice_clean_rx_irq() function
> directly. This avoids needing to pass the argument and avoids an extra
> bit-wise OR for each buffer in the frame.
>
> Move the increment of the ntc local variable to ensure its updated *before*
> all calls to ice_get_pgcnts() or ice_put_rx_mbuf(), as the loop logic
> requires the index of the element just after the current frame.
>
> This has the advantage that we also no longer need to track or cache the
> number of fragments in the rx_ring, which saves a few bytes in the ring.
>
Have anyone tested the performance impact for XDP_DROP ?
(with standard non-multi-buffer frames)
Below code change will always touch cache-lines in shared_info area.
Before it was guarded with a xdp_buff_has_frags() check.
> Cc: Christoph Petrausch <christoph.petrausch@deepl.com>
> Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
> Closes: https://lore.kernel.org/netdev/CAK8fFZ4hY6GUJNENz3wY9jaYLZXGfpr7dnZxzGMYoE44caRbgw@mail.gmail.com/
> Fixes: 743bbd93cf29 ("ice: put Rx buffers after being done with current frame")
> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> Tested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> Tested-by: Priya Singh <priyax.singh@intel.com>
> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
> ---
> drivers/net/ethernet/intel/ice/ice_txrx.c | 81 +++++++++--------------
> drivers/net/ethernet/intel/ice/ice_txrx.h | 1 -
> 2 files changed, 33 insertions(+), 49 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
> index 29e0088ab6b2..93907ab2eac7 100644
> --- a/drivers/net/ethernet/intel/ice/ice_txrx.c
> +++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
> @@ -894,10 +894,6 @@ ice_add_xdp_frag(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
> __skb_fill_page_desc_noacc(sinfo, sinfo->nr_frags++, rx_buf->page,
> rx_buf->page_offset, size);
> sinfo->xdp_frags_size += size;
> - /* remember frag count before XDP prog execution; bpf_xdp_adjust_tail()
> - * can pop off frags but driver has to handle it on its own
> - */
> - rx_ring->nr_frags = sinfo->nr_frags;
>
> if (page_is_pfmemalloc(rx_buf->page))
> xdp_buff_set_frag_pfmemalloc(xdp);
> @@ -968,20 +964,20 @@ ice_get_rx_buf(struct ice_rx_ring *rx_ring, const unsigned int size,
> /**
> * ice_get_pgcnts - grab page_count() for gathered fragments
> * @rx_ring: Rx descriptor ring to store the page counts on
> + * @ntc: the next to clean element (not included in this frame!)
> *
> * This function is intended to be called right before running XDP
> * program so that the page recycling mechanism will be able to take
> * a correct decision regarding underlying pages; this is done in such
> * way as XDP program can change the refcount of page
> */
> -static void ice_get_pgcnts(struct ice_rx_ring *rx_ring)
> +static void ice_get_pgcnts(struct ice_rx_ring *rx_ring, unsigned int ntc)
> {
> - u32 nr_frags = rx_ring->nr_frags + 1;
> u32 idx = rx_ring->first_desc;
> struct ice_rx_buf *rx_buf;
> u32 cnt = rx_ring->count;
>
> - for (int i = 0; i < nr_frags; i++) {
> + while (idx != ntc) {
> rx_buf = &rx_ring->rx_buf[idx];
> rx_buf->pgcnt = page_count(rx_buf->page);
>
> @@ -1154,62 +1150,48 @@ ice_put_rx_buf(struct ice_rx_ring *rx_ring, struct ice_rx_buf *rx_buf)
> }
>
> /**
> - * ice_put_rx_mbuf - ice_put_rx_buf() caller, for all frame frags
> + * ice_put_rx_mbuf - ice_put_rx_buf() caller, for all buffers in frame
> * @rx_ring: Rx ring with all the auxiliary data
> * @xdp: XDP buffer carrying linear + frags part
> - * @xdp_xmit: XDP_TX/XDP_REDIRECT verdict storage
> - * @ntc: a current next_to_clean value to be stored at rx_ring
> + * @ntc: the next to clean element (not included in this frame!)
> * @verdict: return code from XDP program execution
> *
> - * Walk through gathered fragments and satisfy internal page
> - * recycle mechanism; we take here an action related to verdict
> - * returned by XDP program;
> + * Called after XDP program is completed, or on error with verdict set to
> + * ICE_XDP_CONSUMED.
> + *
> + * Walk through buffers from first_desc to the end of the frame, releasing
> + * buffers and satisfying internal page recycle mechanism. The action depends
> + * on verdict from XDP program.
> */
> static void ice_put_rx_mbuf(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
> - u32 *xdp_xmit, u32 ntc, u32 verdict)
> + u32 ntc, u32 verdict)
> {
> - u32 nr_frags = rx_ring->nr_frags + 1;
> + u32 nr_frags = xdp_get_shared_info_from_buff(xdp)->nr_frags;
Here we unconditionally access the skb_shared_info area.
> u32 idx = rx_ring->first_desc;
> u32 cnt = rx_ring->count;
> - u32 post_xdp_frags = 1;
> struct ice_rx_buf *buf;
> - int i;
> -
> - if (unlikely(xdp_buff_has_frags(xdp)))
Previously we only touch shared_info area if this is a multi-buff frame.
> - post_xdp_frags += xdp_get_shared_info_from_buff(xdp)->nr_frags;
> + int i = 0;
>
> - for (i = 0; i < post_xdp_frags; i++) {
> + while (idx != ntc) {
> buf = &rx_ring->rx_buf[idx];
> + if (++idx == cnt)
> + idx = 0;
>
> - if (verdict & (ICE_XDP_TX | ICE_XDP_REDIR)) {
> + /* An XDP program could release fragments from the end of the
> + * buffer. For these, we need to keep the pagecnt_bias as-is.
> + * To do this, only adjust pagecnt_bias for fragments up to
> + * the total remaining after the XDP program has run.
> + */
> + if (verdict != ICE_XDP_CONSUMED)
> ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz);
> - *xdp_xmit |= verdict;
> - } else if (verdict & ICE_XDP_CONSUMED) {
> + else if (i++ <= nr_frags)
> buf->pagecnt_bias++;
> - } else if (verdict == ICE_XDP_PASS) {
> - ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz);
> - }
>
> ice_put_rx_buf(rx_ring, buf);
> -
> - if (++idx == cnt)
> - idx = 0;
> - }
> - /* handle buffers that represented frags released by XDP prog;
> - * for these we keep pagecnt_bias as-is; refcount from struct page
> - * has been decremented within XDP prog and we do not have to increase
> - * the biased refcnt
> - */
> - for (; i < nr_frags; i++) {
> - buf = &rx_ring->rx_buf[idx];
> - ice_put_rx_buf(rx_ring, buf);
> - if (++idx == cnt)
> - idx = 0;
> }
>
> xdp->data = NULL;
> rx_ring->first_desc = ntc;
> - rx_ring->nr_frags = 0;
> }
>
> /**
> @@ -1317,6 +1299,10 @@ static int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
> /* retrieve a buffer from the ring */
> rx_buf = ice_get_rx_buf(rx_ring, size, ntc);
>
> + /* Increment ntc before calls to ice_put_rx_mbuf() */
> + if (++ntc == cnt)
> + ntc = 0;
> +
> if (!xdp->data) {
> void *hard_start;
>
> @@ -1325,24 +1311,23 @@ static int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
> xdp_prepare_buff(xdp, hard_start, offset, size, !!offset);
> xdp_buff_clear_frags_flag(xdp);
> } else if (ice_add_xdp_frag(rx_ring, xdp, rx_buf, size)) {
> - ice_put_rx_mbuf(rx_ring, xdp, NULL, ntc, ICE_XDP_CONSUMED);
> + ice_put_rx_mbuf(rx_ring, xdp, ntc, ICE_XDP_CONSUMED);
> break;
> }
> - if (++ntc == cnt)
> - ntc = 0;
>
> /* skip if it is NOP desc */
> if (ice_is_non_eop(rx_ring, rx_desc))
> continue;
>
> - ice_get_pgcnts(rx_ring);
> + ice_get_pgcnts(rx_ring, ntc);
> xdp_verdict = ice_run_xdp(rx_ring, xdp, xdp_prog, xdp_ring, rx_desc);
> if (xdp_verdict == ICE_XDP_PASS)
> goto construct_skb;
> total_rx_bytes += xdp_get_buff_len(xdp);
> total_rx_pkts++;
>
> - ice_put_rx_mbuf(rx_ring, xdp, &xdp_xmit, ntc, xdp_verdict);
> + ice_put_rx_mbuf(rx_ring, xdp, ntc, xdp_verdict);
> + xdp_xmit |= xdp_verdict & (ICE_XDP_TX | ICE_XDP_REDIR);
>
> continue;
> construct_skb:
> @@ -1355,7 +1340,7 @@ static int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
> rx_ring->ring_stats->rx_stats.alloc_page_failed++;
> xdp_verdict = ICE_XDP_CONSUMED;
> }
> - ice_put_rx_mbuf(rx_ring, xdp, &xdp_xmit, ntc, xdp_verdict);
> + ice_put_rx_mbuf(rx_ring, xdp, ntc, xdp_verdict);
>
> if (!skb)
> break;
> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
> index fef750c5f288..2fd8e78178a2 100644
> --- a/drivers/net/ethernet/intel/ice/ice_txrx.h
> +++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
> @@ -358,7 +358,6 @@ struct ice_rx_ring {
> struct ice_tx_ring *xdp_ring;
> struct ice_rx_ring *next; /* pointer to next ring in q_vector */
> struct xsk_buff_pool *xsk_pool;
> - u32 nr_frags;
> u16 max_frame;
> u16 rx_buf_len;
> dma_addr_t dma; /* physical address of ring */
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH net 3/6] ice: fix Rx page leak on multi-buffer frames
2025-08-18 11:05 ` Jesper Dangaard Brouer
@ 2025-08-19 0:38 ` Jacob Keller
2025-08-19 16:44 ` Jesper Dangaard Brouer
0 siblings, 1 reply; 13+ messages in thread
From: Jacob Keller @ 2025-08-19 0:38 UTC (permalink / raw)
To: Jesper Dangaard Brouer, Tony Nguyen, davem, kuba, pabeni,
edumazet, andrew+netdev, netdev
Cc: maciej.fijalkowski, magnus.karlsson, ast, daniel, john.fastabend,
sdf, bpf, horms, przemyslaw.kitszel, aleksander.lobakin,
jaroslav.pulchart, jdamato, christoph.petrausch, Rinitha S,
Priya Singh, Eelco Chaudron
[-- Attachment #1.1: Type: text/plain, Size: 9118 bytes --]
On 8/18/2025 4:05 AM, Jesper Dangaard Brouer wrote:
> On 15/08/2025 22.41, Tony Nguyen wrote:
>> This has the advantage that we also no longer need to track or cache the
>> number of fragments in the rx_ring, which saves a few bytes in the ring.
>>
>
> Have anyone tested the performance impact for XDP_DROP ?
> (with standard non-multi-buffer frames)
>
> Below code change will always touch cache-lines in shared_info area.
> Before it was guarded with a xdp_buff_has_frags() check.
>
I did some basic testing with XDP_DROP previously using the xdp-bench
tool, and do not recall notice an issue. I don't recall the actual
numbers now though, so I did some quick tests again.
without patch...
Client:
$ iperf3 -u -c 192.168.93.1 -t86400 -l1200 -P20 -b5G
...
[SUM] 10.00-10.33 sec 626 MBytes 16.0 Gbits/sec 546909
$ iperf3 -s -B 192.168.93.1%ens260f0
[SUM] 0.00-10.00 sec 17.7 GBytes 15.2 Gbits/sec 0.011 ms
9712/15888183 (0.061%) receiver
$ xdp-bench drop ens260f0
Summary 1,778,935 rx/s 0 err/s
Summary 2,041,087 rx/s 0 err/s
Summary 2,005,052 rx/s 0 err/s
Summary 1,918,967 rx/s 0 err/s
with patch...
Client:
$ iperf3 -u -c 192.168.93.1 -t86400 -l1200 -P20 -b5G
...
[SUM] 78.00-78.90 sec 2.01 GBytes 19.1 Gbits/sec 1801284
Server:
$ iperf3 -s -B 192.168.93.1%ens260f0
[SUM] 77.00-78.00 sec 2.14 GBytes 18.4 Gbits/sec 0.012 ms
9373/1921186 (0.49%)
xdp-bench:
$ xdp-bench drop ens260f0
Dropping packets on ens260f0 (ifindex 8; driver ice)
Summary 1,910,918 rx/s 0 err/s
Summary 1,866,562 rx/s 0 err/s
Summary 1,901,233 rx/s 0 err/s
Summary 1,859,854 rx/s 0 err/s
Summary 1,593,493 rx/s 0 err/s
Summary 1,891,426 rx/s 0 err/s
Summary 1,880,673 rx/s 0 err/s
Summary 1,866,043 rx/s 0 err/s
Summary 1,872,845 rx/s 0 err/s
I ran a few times and it seemed to waffle a bit around 15Gbit/sec to
20Gbit/sec, with throughput varying regardless of which patch applied. I
actually tended to see slightly higher numbers with this fix applied,
but it was not consistent and hard to measure.
without the patch:
Without xdp-bench running the XDP program, top showed a CPU usage of
740% and an ~86 idle score.
With xdp-bench running, the iperf cpu drops off the top listing and the
CPU idle score goes up to 99.9
with the patch:
The iperf3 CPU use seems to go up, but so does the throughput. It is
hard to get an isolated measure. I don't have an immediate setup for
fine tuned performance testing available to do anything more rigorous.
Personally, I think its overall in the noise, as I saw the same peak
performance and CPU usages with and without the patch.
I also tried testing TCP and also didn't see a significant difference
with or without the patch. Though, testing xdp-bench with TCP is not
that useful since the client stops transmitting once the packets are
dropped instead of handled.
$ iperf3 -c 192.168.93.1 -t86400 -l8000 -P5
Without patch:
[SUM] 24.00-25.00 sec 7.80 GBytes 67.0 Gbits/sec
With patch:
[SUM] 28.00-29.00 sec 7.85 GBytes 67.4 Gbits/sec
Again, it ranges from 60 to 68 Gbit/sec in both cases, though I think
the peak is slightly higher with the fix applied, sometimes I saw it
spike up to 70Gbit/sec but it mostly hovers around 67 Gbit/sec.
I'm sure theres a lot of factors impacting the performance here, but I
think there's not much evidence that its significantly different.
>> Cc: Christoph Petrausch <christoph.petrausch@deepl.com>
>> Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
>> Closes: https://lore.kernel.org/netdev/CAK8fFZ4hY6GUJNENz3wY9jaYLZXGfpr7dnZxzGMYoE44caRbgw@mail.gmail.com/
>> Fixes: 743bbd93cf29 ("ice: put Rx buffers after being done with current frame")
>> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
>> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
>> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
>> Tested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
>> Tested-by: Priya Singh <priyax.singh@intel.com>
>> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
>> ---
>> drivers/net/ethernet/intel/ice/ice_txrx.c | 81 +++++++++--------------
>> drivers/net/ethernet/intel/ice/ice_txrx.h | 1 -
>> 2 files changed, 33 insertions(+), 49 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
>> index 29e0088ab6b2..93907ab2eac7 100644
>> --- a/drivers/net/ethernet/intel/ice/ice_txrx.c
>> +++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
>> @@ -894,10 +894,6 @@ ice_add_xdp_frag(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
>> __skb_fill_page_desc_noacc(sinfo, sinfo->nr_frags++, rx_buf->page,
>> rx_buf->page_offset, size);
>> sinfo->xdp_frags_size += size;
>> - /* remember frag count before XDP prog execution; bpf_xdp_adjust_tail()
>> - * can pop off frags but driver has to handle it on its own
>> - */
>> - rx_ring->nr_frags = sinfo->nr_frags;
>>
>> if (page_is_pfmemalloc(rx_buf->page))
>> xdp_buff_set_frag_pfmemalloc(xdp);
>> @@ -968,20 +964,20 @@ ice_get_rx_buf(struct ice_rx_ring *rx_ring, const unsigned int size,
>> /**
>> * ice_get_pgcnts - grab page_count() for gathered fragments
>> * @rx_ring: Rx descriptor ring to store the page counts on
>> + * @ntc: the next to clean element (not included in this frame!)
>> *
>> * This function is intended to be called right before running XDP
>> * program so that the page recycling mechanism will be able to take
>> * a correct decision regarding underlying pages; this is done in such
>> * way as XDP program can change the refcount of page
>> */
>> -static void ice_get_pgcnts(struct ice_rx_ring *rx_ring)
>> +static void ice_get_pgcnts(struct ice_rx_ring *rx_ring, unsigned int ntc)
>> {
>> - u32 nr_frags = rx_ring->nr_frags + 1;
>> u32 idx = rx_ring->first_desc;
>> struct ice_rx_buf *rx_buf;
>> u32 cnt = rx_ring->count;
>>
>> - for (int i = 0; i < nr_frags; i++) {
>> + while (idx != ntc) {
>> rx_buf = &rx_ring->rx_buf[idx];
>> rx_buf->pgcnt = page_count(rx_buf->page);
>>
>> @@ -1154,62 +1150,48 @@ ice_put_rx_buf(struct ice_rx_ring *rx_ring, struct ice_rx_buf *rx_buf)
>> }
>>
>> /**
>> - * ice_put_rx_mbuf - ice_put_rx_buf() caller, for all frame frags
>> + * ice_put_rx_mbuf - ice_put_rx_buf() caller, for all buffers in frame
>> * @rx_ring: Rx ring with all the auxiliary data
>> * @xdp: XDP buffer carrying linear + frags part
>> - * @xdp_xmit: XDP_TX/XDP_REDIRECT verdict storage
>> - * @ntc: a current next_to_clean value to be stored at rx_ring
>> + * @ntc: the next to clean element (not included in this frame!)
>> * @verdict: return code from XDP program execution
>> *
>> - * Walk through gathered fragments and satisfy internal page
>> - * recycle mechanism; we take here an action related to verdict
>> - * returned by XDP program;
>> + * Called after XDP program is completed, or on error with verdict set to
>> + * ICE_XDP_CONSUMED.
>> + *
>> + * Walk through buffers from first_desc to the end of the frame, releasing
>> + * buffers and satisfying internal page recycle mechanism. The action depends
>> + * on verdict from XDP program.
>> */
>> static void ice_put_rx_mbuf(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
>> - u32 *xdp_xmit, u32 ntc, u32 verdict)
>> + u32 ntc, u32 verdict)
>> {
>> - u32 nr_frags = rx_ring->nr_frags + 1;
>> + u32 nr_frags = xdp_get_shared_info_from_buff(xdp)->nr_frags;
>
> Here we unconditionally access the skb_shared_info area.
>
>> u32 idx = rx_ring->first_desc;
>> u32 cnt = rx_ring->count;
>> - u32 post_xdp_frags = 1;
>> struct ice_rx_buf *buf;
>> - int i;
>> -
>> - if (unlikely(xdp_buff_has_frags(xdp)))
>
> Previously we only touch shared_info area if this is a multi-buff frame.
>
I'm not certain, but reading the helpers it might be correct to do
something like this:
if (unlikely(xdp_buff_has_frags(xdp)))
nr_frags = xdp_get_shared_info_from_buff(xdp)->nr_frags;
else
nr_frags = 1
either in the driver code or by adding a new xdp helper function.
I'm not sure its worth it though. We have pending work from our
development team to refactor ice to use page pool and switch to libeth
XDP helpers which eliminates all of this driver-specific logic.
I don't personally think its worth holding up this series and this
important memory leak fix for a minor potential code change that I can't
measure an obvious improvement on.
Thanks,
Jake
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH net 3/6] ice: fix Rx page leak on multi-buffer frames
2025-08-19 0:38 ` Jacob Keller
@ 2025-08-19 16:44 ` Jesper Dangaard Brouer
2025-08-19 19:53 ` Jacob Keller
0 siblings, 1 reply; 13+ messages in thread
From: Jesper Dangaard Brouer @ 2025-08-19 16:44 UTC (permalink / raw)
To: Jacob Keller, Tony Nguyen, ast, davem, kuba, pabeni, netdev
Cc: maciej.fijalkowski, magnus.karlsson, andrew+netdev, daniel,
john.fastabend, sdf, bpf, horms, przemyslaw.kitszel,
aleksander.lobakin, jaroslav.pulchart, jdamato,
christoph.petrausch, Rinitha S, Priya Singh, Eelco Chaudron,
edumazet
On 19/08/2025 02.38, Jacob Keller wrote:
>
>
> On 8/18/2025 4:05 AM, Jesper Dangaard Brouer wrote:
>> On 15/08/2025 22.41, Tony Nguyen wrote:
>>> This has the advantage that we also no longer need to track or cache the
>>> number of fragments in the rx_ring, which saves a few bytes in the ring.
>>>
>>
>> Have anyone tested the performance impact for XDP_DROP ?
>> (with standard non-multi-buffer frames)
>>
>> Below code change will always touch cache-lines in shared_info area.
>> Before it was guarded with a xdp_buff_has_frags() check.
>>
>
> I did some basic testing with XDP_DROP previously using the xdp-bench
> tool, and do not recall notice an issue. I don't recall the actual
> numbers now though, so I did some quick tests again.
>
> without patch...
>
> Client:
> $ iperf3 -u -c 192.168.93.1 -t86400 -l1200 -P20 -b5G
> ...
> [SUM] 10.00-10.33 sec 626 MBytes 16.0 Gbits/sec 546909
>
> $ iperf3 -s -B 192.168.93.1%ens260f0
> [SUM] 0.00-10.00 sec 17.7 GBytes 15.2 Gbits/sec 0.011 ms
> 9712/15888183 (0.061%) receiver
>
> $ xdp-bench drop ens260f0
> Summary 1,778,935 rx/s 0 err/s
> Summary 2,041,087 rx/s 0 err/s
> Summary 2,005,052 rx/s 0 err/s
> Summary 1,918,967 rx/s 0 err/s
>
> with patch...
>
> Client:
> $ iperf3 -u -c 192.168.93.1 -t86400 -l1200 -P20 -b5G
> ...
> [SUM] 78.00-78.90 sec 2.01 GBytes 19.1 Gbits/sec 1801284
>
> Server:
> $ iperf3 -s -B 192.168.93.1%ens260f0
> [SUM] 77.00-78.00 sec 2.14 GBytes 18.4 Gbits/sec 0.012 ms
> 9373/1921186 (0.49%)
>
> xdp-bench:
> $ xdp-bench drop ens260f0
> Dropping packets on ens260f0 (ifindex 8; driver ice)
> Summary 1,910,918 rx/s 0 err/s
> Summary 1,866,562 rx/s 0 err/s
> Summary 1,901,233 rx/s 0 err/s
> Summary 1,859,854 rx/s 0 err/s
> Summary 1,593,493 rx/s 0 err/s
> Summary 1,891,426 rx/s 0 err/s
> Summary 1,880,673 rx/s 0 err/s
> Summary 1,866,043 rx/s 0 err/s
> Summary 1,872,845 rx/s 0 err/s
>
>
> I ran a few times and it seemed to waffle a bit around 15Gbit/sec to
> 20Gbit/sec, with throughput varying regardless of which patch applied. I
> actually tended to see slightly higher numbers with this fix applied,
> but it was not consistent and hard to measure.
>
Above testing is not a valid XDP_DROP test.
The packet generator need to be much much faster, as XDP_DROP is for
DDoS protection use-cases (one of Cloudflare's main products).
I recommend using the script for pktgen in kernel tree:
samples/pktgen/pktgen_sample03_burst_single_flow.sh
Example:
./pktgen_sample03_burst_single_flow.sh -vi mlx5p2 -d 198.18.100.1 -m
b4:96:91:ad:0b:09 -t $(nproc)
> without the patch:
On my testlab with CPU: AMD EPYC 9684X (SRSO=IBPB) running:
- sudo ./xdp-bench drop ice4 # (defaults to no-touch)
XDP_DROP (with no-touch)
Without patch : 54,052,300 rx/s = 18.50 nanosec/packet
With the patch: 33,420,619 rx/s = 29.92 nanosec/packet
Diff: 11.42 nanosec
Using perf stat I can see an increase in cache-misses.
The difference is less, if we read-packet data, running:
- sudo ./xdp-bench drop ice4 --packet-operation read-data
XDP_DROP (with read-data)
Without patch : 27,200,683 rx/s = 36.76 nanosec/packet
With the patch: 24,348,751 rx/s = 41.07 nanosec/packet
Diff: 4.31 nanosec
On this CPU we don't have DDIO/DCA, so we take a big hit reading the
packet data in XDP. This will be needed by our DDoS bpf_prog.
The nanosec diff isn't the same, so it seem this change can hide a
little behind the cache-miss in the XDP bpf_prog.
> Without xdp-bench running the XDP program, top showed a CPU usage of
> 740% and an ~86 idle score.
>
We don't want a scaling test for this. For this XDP_DROP/DDoS test we
want to target a single CPU. This is easiest done by generating a single
flow (hint pktgen script is called _single_flow). We want to see a
single CPU running ksoftirqd 100% of the time.
> With xdp-bench running, the iperf cpu drops off the top listing and the
> CPU idle score goes up to 99.9
>
With the single flow generator DoS "attacking" a single CPU, we want to
see a single CPU running ksoftirqd 100% of the time, for XDP_DROP case.
>
> with the patch:
>
> The iperf3 CPU use seems to go up, but so does the throughput. It is
> hard to get an isolated measure. I don't have an immediate setup for
> fine tuned performance testing available to do anything more rigorous.
>
> Personally, I think its overall in the noise, as I saw the same peak
> performance and CPU usages with and without the patch.
>
> I also tried testing TCP and also didn't see a significant difference
> with or without the patch. Though, testing xdp-bench with TCP is not
> that useful since the client stops transmitting once the packets are
> dropped instead of handled.
>
> $ iperf3 -c 192.168.93.1 -t86400 -l8000 -P5
>
> Without patch:
> [SUM] 24.00-25.00 sec 7.80 GBytes 67.0 Gbits/sec
>
> With patch:
> [SUM] 28.00-29.00 sec 7.85 GBytes 67.4 Gbits/sec
>
> Again, it ranges from 60 to 68 Gbit/sec in both cases, though I think
> the peak is slightly higher with the fix applied, sometimes I saw it
> spike up to 70Gbit/sec but it mostly hovers around 67 Gbit/sec.
>
> I'm sure theres a lot of factors impacting the performance here, but I
> think there's not much evidence that its significantly different.
>>> Cc: Christoph Petrausch <christoph.petrausch@deepl.com>
>>> Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
>>> Closes: https://lore.kernel.org/netdev/CAK8fFZ4hY6GUJNENz3wY9jaYLZXGfpr7dnZxzGMYoE44caRbgw@mail.gmail.com/
>>> Fixes: 743bbd93cf29 ("ice: put Rx buffers after being done with current frame")
>>> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
>>> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
>>> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
>>> Tested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
>>> Tested-by: Priya Singh <priyax.singh@intel.com>
>>> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
>>> ---
>>> drivers/net/ethernet/intel/ice/ice_txrx.c | 81 +++++++++--------------
>>> drivers/net/ethernet/intel/ice/ice_txrx.h | 1 -
>>> 2 files changed, 33 insertions(+), 49 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
>>> index 29e0088ab6b2..93907ab2eac7 100644
>>> --- a/drivers/net/ethernet/intel/ice/ice_txrx.c
>>> +++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
>>> @@ -894,10 +894,6 @@ ice_add_xdp_frag(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
>>> __skb_fill_page_desc_noacc(sinfo, sinfo->nr_frags++, rx_buf->page,
>>> rx_buf->page_offset, size);
>>> sinfo->xdp_frags_size += size;
>>> - /* remember frag count before XDP prog execution; bpf_xdp_adjust_tail()
>>> - * can pop off frags but driver has to handle it on its own
>>> - */
>>> - rx_ring->nr_frags = sinfo->nr_frags;
>>>
>>> if (page_is_pfmemalloc(rx_buf->page))
>>> xdp_buff_set_frag_pfmemalloc(xdp);
>>> @@ -968,20 +964,20 @@ ice_get_rx_buf(struct ice_rx_ring *rx_ring, const unsigned int size,
>>> /**
>>> * ice_get_pgcnts - grab page_count() for gathered fragments
>>> * @rx_ring: Rx descriptor ring to store the page counts on
>>> + * @ntc: the next to clean element (not included in this frame!)
>>> *
>>> * This function is intended to be called right before running XDP
>>> * program so that the page recycling mechanism will be able to take
>>> * a correct decision regarding underlying pages; this is done in such
>>> * way as XDP program can change the refcount of page
>>> */
>>> -static void ice_get_pgcnts(struct ice_rx_ring *rx_ring)
>>> +static void ice_get_pgcnts(struct ice_rx_ring *rx_ring, unsigned int ntc)
>>> {
>>> - u32 nr_frags = rx_ring->nr_frags + 1;
>>> u32 idx = rx_ring->first_desc;
>>> struct ice_rx_buf *rx_buf;
>>> u32 cnt = rx_ring->count;
>>>
>>> - for (int i = 0; i < nr_frags; i++) {
>>> + while (idx != ntc) {
>>> rx_buf = &rx_ring->rx_buf[idx];
>>> rx_buf->pgcnt = page_count(rx_buf->page);
>>>
>>> @@ -1154,62 +1150,48 @@ ice_put_rx_buf(struct ice_rx_ring *rx_ring, struct ice_rx_buf *rx_buf)
>>> }
>>>
>>> /**
>>> - * ice_put_rx_mbuf - ice_put_rx_buf() caller, for all frame frags
>>> + * ice_put_rx_mbuf - ice_put_rx_buf() caller, for all buffers in frame
>>> * @rx_ring: Rx ring with all the auxiliary data
>>> * @xdp: XDP buffer carrying linear + frags part
>>> - * @xdp_xmit: XDP_TX/XDP_REDIRECT verdict storage
>>> - * @ntc: a current next_to_clean value to be stored at rx_ring
>>> + * @ntc: the next to clean element (not included in this frame!)
>>> * @verdict: return code from XDP program execution
>>> *
>>> - * Walk through gathered fragments and satisfy internal page
>>> - * recycle mechanism; we take here an action related to verdict
>>> - * returned by XDP program;
>>> + * Called after XDP program is completed, or on error with verdict set to
>>> + * ICE_XDP_CONSUMED.
>>> + *
>>> + * Walk through buffers from first_desc to the end of the frame, releasing
>>> + * buffers and satisfying internal page recycle mechanism. The action depends
>>> + * on verdict from XDP program.
>>> */
>>> static void ice_put_rx_mbuf(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
>>> - u32 *xdp_xmit, u32 ntc, u32 verdict)
>>> + u32 ntc, u32 verdict)
>>> {
>>> - u32 nr_frags = rx_ring->nr_frags + 1;
>>> + u32 nr_frags = xdp_get_shared_info_from_buff(xdp)->nr_frags;
>>
>> Here we unconditionally access the skb_shared_info area.
>>
>>> u32 idx = rx_ring->first_desc;
>>> u32 cnt = rx_ring->count;
>>> - u32 post_xdp_frags = 1;
>>> struct ice_rx_buf *buf;
>>> - int i;
>>> -
>>> - if (unlikely(xdp_buff_has_frags(xdp)))
>>
>> Previously we only touch shared_info area if this is a multi-buff frame.
>>
>
> I'm not certain, but reading the helpers it might be correct to do
> something like this:
>
> if (unlikely(xdp_buff_has_frags(xdp)))
> nr_frags = xdp_get_shared_info_from_buff(xdp)->nr_frags;
> else
> nr_frags = 1
Yes, that looks like a correct pattern.
> either in the driver code or by adding a new xdp helper function.
>
> I'm not sure its worth it though. We have pending work from our
> development team to refactor ice to use page pool and switch to libeth
> XDP helpers which eliminates all of this driver-specific logic.
Please do proper testing of XDP_DROP case when doing this change.
> I don't personally think its worth holding up this series and this
> important memory leak fix for a minor potential code change that I can't
> measure an obvious improvement on.
IMHO you included an optimization (that wasn't a gain) in a fix patch.
I think you can fix the memory leak without the "optimization" part.
pw-bot: cr
--Jesper
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH net 3/6] ice: fix Rx page leak on multi-buffer frames
2025-08-19 16:44 ` Jesper Dangaard Brouer
@ 2025-08-19 19:53 ` Jacob Keller
2025-08-19 20:44 ` Tony Nguyen
2025-08-21 16:27 ` Jacob Keller
0 siblings, 2 replies; 13+ messages in thread
From: Jacob Keller @ 2025-08-19 19:53 UTC (permalink / raw)
To: Jesper Dangaard Brouer, Tony Nguyen, ast, davem, kuba, pabeni,
netdev
Cc: maciej.fijalkowski, magnus.karlsson, andrew+netdev, daniel,
john.fastabend, sdf, bpf, horms, przemyslaw.kitszel,
aleksander.lobakin, jaroslav.pulchart, jdamato,
christoph.petrausch, Rinitha S, Priya Singh, Eelco Chaudron,
edumazet
[-- Attachment #1.1: Type: text/plain, Size: 6025 bytes --]
On 8/19/2025 9:44 AM, Jesper Dangaard Brouer wrote:
>
>
> On 19/08/2025 02.38, Jacob Keller wrote:
>>
>>
>> On 8/18/2025 4:05 AM, Jesper Dangaard Brouer wrote:
>>> On 15/08/2025 22.41, Tony Nguyen wrote:
>>>> This has the advantage that we also no longer need to track or cache the
>>>> number of fragments in the rx_ring, which saves a few bytes in the ring.
>>>>
>>>
>>> Have anyone tested the performance impact for XDP_DROP ?
>>> (with standard non-multi-buffer frames)
>>>
>>> Below code change will always touch cache-lines in shared_info area.
>>> Before it was guarded with a xdp_buff_has_frags() check.
>>>
>>
>> I did some basic testing with XDP_DROP previously using the xdp-bench
>> tool, and do not recall notice an issue. I don't recall the actual
>> numbers now though, so I did some quick tests again.
>>
>> without patch...
>>
>> Client:
>> $ iperf3 -u -c 192.168.93.1 -t86400 -l1200 -P20 -b5G
>> ...
>> [SUM] 10.00-10.33 sec 626 MBytes 16.0 Gbits/sec 546909
>>
>> $ iperf3 -s -B 192.168.93.1%ens260f0
>> [SUM] 0.00-10.00 sec 17.7 GBytes 15.2 Gbits/sec 0.011 ms
>> 9712/15888183 (0.061%) receiver
>>
>> $ xdp-bench drop ens260f0
>> Summary 1,778,935 rx/s 0 err/s
>> Summary 2,041,087 rx/s 0 err/s
>> Summary 2,005,052 rx/s 0 err/s
>> Summary 1,918,967 rx/s 0 err/s
>>
>> with patch...
>>
>> Client:
>> $ iperf3 -u -c 192.168.93.1 -t86400 -l1200 -P20 -b5G
>> ...
>> [SUM] 78.00-78.90 sec 2.01 GBytes 19.1 Gbits/sec 1801284
>>
>> Server:
>> $ iperf3 -s -B 192.168.93.1%ens260f0
>> [SUM] 77.00-78.00 sec 2.14 GBytes 18.4 Gbits/sec 0.012 ms
>> 9373/1921186 (0.49%)
>>
>> xdp-bench:
>> $ xdp-bench drop ens260f0
>> Dropping packets on ens260f0 (ifindex 8; driver ice)
>> Summary 1,910,918 rx/s 0 err/s
>> Summary 1,866,562 rx/s 0 err/s
>> Summary 1,901,233 rx/s 0 err/s
>> Summary 1,859,854 rx/s 0 err/s
>> Summary 1,593,493 rx/s 0 err/s
>> Summary 1,891,426 rx/s 0 err/s
>> Summary 1,880,673 rx/s 0 err/s
>> Summary 1,866,043 rx/s 0 err/s
>> Summary 1,872,845 rx/s 0 err/s
>>
>>
>> I ran a few times and it seemed to waffle a bit around 15Gbit/sec to
>> 20Gbit/sec, with throughput varying regardless of which patch applied. I
>> actually tended to see slightly higher numbers with this fix applied,
>> but it was not consistent and hard to measure.
>>
>
> Above testing is not a valid XDP_DROP test.
>
Fair. I'm no XDP expert, so I have a lot to learn here :)
> The packet generator need to be much much faster, as XDP_DROP is for
> DDoS protection use-cases (one of Cloudflare's main products).
>
> I recommend using the script for pktgen in kernel tree:
> samples/pktgen/pktgen_sample03_burst_single_flow.sh
>
> Example:
> ./pktgen_sample03_burst_single_flow.sh -vi mlx5p2 -d 198.18.100.1 -m
> b4:96:91:ad:0b:09 -t $(nproc)
>
>
>> without the patch:
>
> On my testlab with CPU: AMD EPYC 9684X (SRSO=IBPB) running:
> - sudo ./xdp-bench drop ice4 # (defaults to no-touch)
>
> XDP_DROP (with no-touch)
> Without patch : 54,052,300 rx/s = 18.50 nanosec/packet
> With the patch: 33,420,619 rx/s = 29.92 nanosec/packet
> Diff: 11.42 nanosec
>
Oof. Yea, thats not good.
> Using perf stat I can see an increase in cache-misses.
>
> The difference is less, if we read-packet data, running:
> - sudo ./xdp-bench drop ice4 --packet-operation read-data
>
> XDP_DROP (with read-data)
> Without patch : 27,200,683 rx/s = 36.76 nanosec/packet
> With the patch: 24,348,751 rx/s = 41.07 nanosec/packet
> Diff: 4.31 nanosec
>
> On this CPU we don't have DDIO/DCA, so we take a big hit reading the
> packet data in XDP. This will be needed by our DDoS bpf_prog.
> The nanosec diff isn't the same, so it seem this change can hide a
> little behind the cache-miss in the XDP bpf_prog.
>
>
>> Without xdp-bench running the XDP program, top showed a CPU usage of
>> 740% and an ~86 idle score.
>>
>
> We don't want a scaling test for this. For this XDP_DROP/DDoS test we
> want to target a single CPU. This is easiest done by generating a single
> flow (hint pktgen script is called _single_flow). We want to see a
> single CPU running ksoftirqd 100% of the time.
>
Ok.
>>
>> I'm not certain, but reading the helpers it might be correct to do
>> something like this:
>>
>> if (unlikely(xdp_buff_has_frags(xdp)))
>> nr_frags = xdp_get_shared_info_from_buff(xdp)->nr_frags;
>> else
>> nr_frags = 1
>
> Yes, that looks like a correct pattern.
>
>> either in the driver code or by adding a new xdp helper function.
>>
>> I'm not sure its worth it though. We have pending work from our
>> development team to refactor ice to use page pool and switch to libeth
>> XDP helpers which eliminates all of this driver-specific logic.
>
> Please do proper testing of XDP_DROP case when doing this change.
>
>> I don't personally think its worth holding up this series and this
>> important memory leak fix for a minor potential code change that I can't
>> measure an obvious improvement on.
>
> IMHO you included an optimization (that wasn't a gain) in a fix patch.
> I think you can fix the memory leak without the "optimization" part.
>
It wasn't intended as an optimization in any case, but me trying to make
it easier to keep track of what the driver was doing, but obviously
ended up regressing here.
@Jakub, @Tony, I guess we'll have to drop this patch from the series,
and I'll work on a v2 to avoid this regression.
> pw-bot: cr
>
> --Jesper
>
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH net 3/6] ice: fix Rx page leak on multi-buffer frames
2025-08-19 19:53 ` Jacob Keller
@ 2025-08-19 20:44 ` Tony Nguyen
2025-08-21 16:27 ` Jacob Keller
1 sibling, 0 replies; 13+ messages in thread
From: Tony Nguyen @ 2025-08-19 20:44 UTC (permalink / raw)
To: Jacob Keller, Jesper Dangaard Brouer, ast, davem, kuba, pabeni,
netdev
Cc: maciej.fijalkowski, magnus.karlsson, andrew+netdev, daniel,
john.fastabend, sdf, bpf, horms, przemyslaw.kitszel,
aleksander.lobakin, jaroslav.pulchart, jdamato,
christoph.petrausch, Rinitha S, Priya Singh, Eelco Chaudron,
edumazet
On 8/19/2025 12:53 PM, Jacob Keller wrote:
> On 8/19/2025 9:44 AM, Jesper Dangaard Brouer wrote:
...
>
> @Jakub, @Tony, I guess we'll have to drop this patch from the series,
> and I'll work on a v2 to avoid this regression.
Ok. I'll re-send the rest of the series with this dropped.
Thanks,
Tony
>> pw-bot: cr
I don't think the bot picked this up so...
pw-bot: changes-requested
>>
>> --Jesper
>>
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH net 3/6] ice: fix Rx page leak on multi-buffer frames
2025-08-19 19:53 ` Jacob Keller
2025-08-19 20:44 ` Tony Nguyen
@ 2025-08-21 16:27 ` Jacob Keller
1 sibling, 0 replies; 13+ messages in thread
From: Jacob Keller @ 2025-08-21 16:27 UTC (permalink / raw)
To: Jesper Dangaard Brouer, Tony Nguyen, ast, davem, kuba, pabeni,
netdev
Cc: maciej.fijalkowski, magnus.karlsson, andrew+netdev, daniel,
john.fastabend, sdf, bpf, horms, przemyslaw.kitszel,
aleksander.lobakin, jaroslav.pulchart, jdamato,
christoph.petrausch, Rinitha S, Priya Singh, Eelco Chaudron,
edumazet
[-- Attachment #1.1: Type: text/plain, Size: 5873 bytes --]
On 8/19/2025 12:53 PM, Jacob Keller wrote:
>
>
> On 8/19/2025 9:44 AM, Jesper Dangaard Brouer wrote:
>>
>>
>> On 19/08/2025 02.38, Jacob Keller wrote:
>>>
>>>
>>> On 8/18/2025 4:05 AM, Jesper Dangaard Brouer wrote:
>>>> On 15/08/2025 22.41, Tony Nguyen wrote:
>>>>> This has the advantage that we also no longer need to track or cache the
>>>>> number of fragments in the rx_ring, which saves a few bytes in the ring.
>>>>>
>>>>
>>>> Have anyone tested the performance impact for XDP_DROP ?
>>>> (with standard non-multi-buffer frames)
>>>>
>>>> Below code change will always touch cache-lines in shared_info area.
>>>> Before it was guarded with a xdp_buff_has_frags() check.
>>>>
>>>
>>> I did some basic testing with XDP_DROP previously using the xdp-bench
>>> tool, and do not recall notice an issue. I don't recall the actual
>>> numbers now though, so I did some quick tests again.
>>>
>>> without patch...
>>>
>>> Client:
>>> $ iperf3 -u -c 192.168.93.1 -t86400 -l1200 -P20 -b5G
>>> ...
>>> [SUM] 10.00-10.33 sec 626 MBytes 16.0 Gbits/sec 546909
>>>
>>> $ iperf3 -s -B 192.168.93.1%ens260f0
>>> [SUM] 0.00-10.00 sec 17.7 GBytes 15.2 Gbits/sec 0.011 ms
>>> 9712/15888183 (0.061%) receiver
>>>
>>> $ xdp-bench drop ens260f0
>>> Summary 1,778,935 rx/s 0 err/s
>>> Summary 2,041,087 rx/s 0 err/s
>>> Summary 2,005,052 rx/s 0 err/s
>>> Summary 1,918,967 rx/s 0 err/s
>>>
>>> with patch...
>>>
>>> Client:
>>> $ iperf3 -u -c 192.168.93.1 -t86400 -l1200 -P20 -b5G
>>> ...
>>> [SUM] 78.00-78.90 sec 2.01 GBytes 19.1 Gbits/sec 1801284
>>>
>>> Server:
>>> $ iperf3 -s -B 192.168.93.1%ens260f0
>>> [SUM] 77.00-78.00 sec 2.14 GBytes 18.4 Gbits/sec 0.012 ms
>>> 9373/1921186 (0.49%)
>>>
>>> xdp-bench:
>>> $ xdp-bench drop ens260f0
>>> Dropping packets on ens260f0 (ifindex 8; driver ice)
>>> Summary 1,910,918 rx/s 0 err/s
>>> Summary 1,866,562 rx/s 0 err/s
>>> Summary 1,901,233 rx/s 0 err/s
>>> Summary 1,859,854 rx/s 0 err/s
>>> Summary 1,593,493 rx/s 0 err/s
>>> Summary 1,891,426 rx/s 0 err/s
>>> Summary 1,880,673 rx/s 0 err/s
>>> Summary 1,866,043 rx/s 0 err/s
>>> Summary 1,872,845 rx/s 0 err/s
>>>
>>>
>>> I ran a few times and it seemed to waffle a bit around 15Gbit/sec to
>>> 20Gbit/sec, with throughput varying regardless of which patch applied. I
>>> actually tended to see slightly higher numbers with this fix applied,
>>> but it was not consistent and hard to measure.
>>>
>>
>> Above testing is not a valid XDP_DROP test.
>>
>
> Fair. I'm no XDP expert, so I have a lot to learn here :)
>
>> The packet generator need to be much much faster, as XDP_DROP is for
>> DDoS protection use-cases (one of Cloudflare's main products).
>>
>> I recommend using the script for pktgen in kernel tree:
>> samples/pktgen/pktgen_sample03_burst_single_flow.sh
>>
>> Example:
>> ./pktgen_sample03_burst_single_flow.sh -vi mlx5p2 -d 198.18.100.1 -m
>> b4:96:91:ad:0b:09 -t $(nproc)
>>
>>
>>> without the patch:
>>
>> On my testlab with CPU: AMD EPYC 9684X (SRSO=IBPB) running:
>> - sudo ./xdp-bench drop ice4 # (defaults to no-touch)
>>
>> XDP_DROP (with no-touch)
>> Without patch : 54,052,300 rx/s = 18.50 nanosec/packet
>> With the patch: 33,420,619 rx/s = 29.92 nanosec/packet
>> Diff: 11.42 nanosec
>>
>
> Oof. Yea, thats not good.
>
>> Using perf stat I can see an increase in cache-misses.
>>
>> The difference is less, if we read-packet data, running:
>> - sudo ./xdp-bench drop ice4 --packet-operation read-data
>>
>> XDP_DROP (with read-data)
>> Without patch : 27,200,683 rx/s = 36.76 nanosec/packet
>> With the patch: 24,348,751 rx/s = 41.07 nanosec/packet
>> Diff: 4.31 nanosec
>>
>> On this CPU we don't have DDIO/DCA, so we take a big hit reading the
>> packet data in XDP. This will be needed by our DDoS bpf_prog.
>> The nanosec diff isn't the same, so it seem this change can hide a
>> little behind the cache-miss in the XDP bpf_prog.
>>
>>
>>> Without xdp-bench running the XDP program, top showed a CPU usage of
>>> 740% and an ~86 idle score.
>>>
>>
>> We don't want a scaling test for this. For this XDP_DROP/DDoS test we
>> want to target a single CPU. This is easiest done by generating a single
>> flow (hint pktgen script is called _single_flow). We want to see a
>> single CPU running ksoftirqd 100% of the time.
>>
>
> Ok.
>
>>>
>>> I'm not certain, but reading the helpers it might be correct to do
>>> something like this:
>>>
>>> if (unlikely(xdp_buff_has_frags(xdp)))
>>> nr_frags = xdp_get_shared_info_from_buff(xdp)->nr_frags;
>>> else
>>> nr_frags = 1
>>
>> Yes, that looks like a correct pattern.
>>
It looks like i40e has the same mistake, but perhaps its less impacted
because of lower network speeds.
This mistake crept in because the i40e_process_rx_buffs (which I
borrowed the same logic from) unconditionally checks the shared info for
the nr_frags.
In actuality, this counts the number of fragments not counting the
initial descriptor, but the check in the loop body is aware of and
accounts for that.
Thus, I think what we really want here is to set nr_frags to 0 if
xdp_buff_has_frags() is false, not 1. A helper function seems like the
best solution, and I can submit a change to i40e to fix that code,
assuming I can measure the difference there as well.
Thanks,
Jake
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-08-21 16:27 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-15 20:41 [PATCH net 0/6][pull request] Intel Wired LAN Driver Updates 2025-08-15 (ice, ixgbe, igc) Tony Nguyen
2025-08-15 20:41 ` [PATCH net 1/6] ice: fix NULL pointer dereference in ice_unplug_aux_dev() on reset Tony Nguyen
2025-08-15 20:41 ` [PATCH net 2/6] ice: fix possible leak in ice_plug_aux_dev() error path Tony Nguyen
2025-08-15 20:41 ` [PATCH net 3/6] ice: fix Rx page leak on multi-buffer frames Tony Nguyen
2025-08-18 11:05 ` Jesper Dangaard Brouer
2025-08-19 0:38 ` Jacob Keller
2025-08-19 16:44 ` Jesper Dangaard Brouer
2025-08-19 19:53 ` Jacob Keller
2025-08-19 20:44 ` Tony Nguyen
2025-08-21 16:27 ` Jacob Keller
2025-08-15 20:42 ` [PATCH net 4/6] ixgbe: xsk: resolve the negative overflow of budget in ixgbe_xmit_zc Tony Nguyen
2025-08-15 20:42 ` [PATCH net 5/6] ixgbe: fix ndo_xdp_xmit() workloads Tony Nguyen
2025-08-15 20:42 ` [PATCH net 6/6] igc: fix disabling L1.2 PCI-E link substate on I226 on init Tony Nguyen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).