RE: [Intel-wired-lan] [PATCH iwl-net v2] ice: fix Rx page leak on multi-buffer frames

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RE: [Intel-wired-lan] [PATCH iwl-net v2] ice: fix Rx page leak on multi-buffer frames
  2025-07-12  0:23 Jacob Keller
@ 2025-07-30 12:11 ` Rinitha, SX
       [not found] ` <PH0PR11MB5013AD3FEC58E277297A7D4F962AA@PH0PR11MB5013.namprd11.prod.outlook.com>
  1 sibling, 0 replies; 7+ messages in thread
From: Rinitha, SX @ 2025-07-30 12:11 UTC (permalink / raw)
  To: Keller, Jacob E, Fijalkowski, Maciej, Kitszel, Przemyslaw,
	Intel Wired LAN, Lobakin, Aleksander
  Cc: Joe Damato, Nguyen, Anthony L, netdev@vger.kernel.org,
	Christoph Petrausch, Jaroslav Pulchart, Keller, Jacob E

> -----Original Message-----
> From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf Of Jacob Keller
> Sent: 12 July 2025 05:54
> To: Fijalkowski, Maciej <maciej.fijalkowski@intel.com>; Kitszel, Przemyslaw <przemyslaw.kitszel@intel.com>; Intel Wired LAN <intel-wired-lan@lists.osuosl.org>; Lobakin, Aleksander <aleksander.lobakin@intel.com>
> Cc: Joe Damato <jdamato@fastly.com>; Nguyen, Anthony L <anthony.l.nguyen@intel.com>; netdev@vger.kernel.org; Christoph Petrausch <christoph.petrausch@deepl.com>; Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>; Keller, Jacob E <jacob.e.keller@intel.com>
> Subject: [Intel-wired-lan] [PATCH iwl-net v2] ice: fix Rx page leak on multi-buffer frames
>
> The ice_put_rx_mbuf() function handles calling ice_put_rx_buf() for each buffer in the current frame. This function was introduced as part of handling multi-buffer XDP support in the ice driver.
>
> It works by iterating over the buffers from first_desc up to 1 plus the total number of fragments in the frame, cached from before the XDP program was executed.
>
> If the hardware posts a descriptor with a size of 0, the logic used in
> ice_put_rx_mbuf() breaks. Such descriptors get skipped and don't get added as fragments in ice_add_xdp_frag. Since the buffer isn't counted as a fragment, we do not iterate over it in ice_put_rx_mbuf(), and thus we don't call ice_put_rx_buf().
>
> Because we don't call ice_put_rx_buf(), we don't attempt to re-use the page or free it. This leaves a stale page in the ring, as we don't increment next_to_alloc.
>
> The ice_reuse_rx_page() assumes that the next_to_alloc has been incremented properly, and that it always points to a buffer with a NULL page. Since this function doesn't check, it will happily recycle a page over the top of the next_to_alloc buffer, losing track of the old page.
>
> Note that this leak only occurs for multi-buffer frames. The
> ice_put_rx_mbuf() function always handles at least one buffer, so a single-buffer frame will always get handled correctly. It is not clear precisely why the hardware hands us descriptors with a size of 0 sometimes, but it happens somewhat regularly with "jumbo frames" used by 9K MTU.
>
> To fix ice_put_rx_mbuf(), we need to make sure to call ice_put_rx_buf() on all buffers between first_desc and next_to_clean. Borrow the logic of a similar function in i40e used for this same purpose. Use the same logic also in ice_get_pgcnts().
>
> Instead of iterating over just the number of fragments, use a loop which iterates until the current index reaches to the next_to_clean element just past the current frame. Check the current number of fragments (post XDP program). For all buffers up 1 more than the number of fragments, > we'll update the pagecnt_bias. For any buffers past this, pagecnt_bias is left as-is. This ensures that fragments released by the XDP program, as well as any buffers with zero-size won't have their pagecnt_bias updated incorrectly. Unlike i40e, the ice_put_rx_mbuf() function does call
> ice_put_rx_buf() on the last buffer of the frame indicating end of packet.
>
> The xdp_xmit value only needs to be updated if an XDP program is run, and only once per packet. Drop the xdp_xmit pointer argument from ice_put_rx_mbuf(). Instead, set xdp_xmit in the ice_clean_rx_irq() function directly. This avoids needing to pass the argument and avoids an extra > bit-wise OR for each buffer in the frame.
>
> Move the increment of the ntc local variable to ensure its updated *before* all calls to ice_get_pgcnts() or ice_put_rx_mbuf(), as the loop logic requires the index of the element just after the current frame.
>
> This has the advantage that we also no longer need to track or cache the number of fragments in the rx_ring, which saves a few bytes in the ring.
>
> Cc: Christoph Petrausch <christoph.petrausch@deepl.com>
> Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
> Closes: https://lore.kernel.org/netdev/CAK8fFZ4hY6GUJNENz3wY9jaYLZXGfpr7dnZxzGMYoE44caRbgw@mail.gmail.com/
> Fixes: 743bbd93cf29 ("ice: put Rx buffers after being done with current frame")
> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
> ---
> I've tested this in a setup with MTU 9000, using a combination of iperf3 and wrk generated traffic.
>
> I tested this in a couple of ways. First, I check memory allocations using
> /proc/allocinfo:
>
> awk '/ice_alloc_mapped_page/ { printf("%s %s\n", $1, $2) }' /proc/allocinfo | numfmt --to=iec
>
> Second, I ported some stats from i40e written by Joe Damato to track the page allocation and busy counts. I consistently saw that the allocate stat increased without the busy or waive stats increasing. I also added a stat to track directly when we overwrote a page pointer that was non-NULL in ice_reuse_rx_page(), and saw it increment consistently.
>
> With this fix, all of these indicators are fixed. I've tested both 1500 byte and 9000 byte MTU and no longer see the leak. With the counters I was able to immediately see a leak within a few minutes of iperf3, so I am confident that I've resolved the leak with this fix.
>
> I've now also tested with xdp-bench and confirm that XDP_TX and XDP_REDIR work properly with the fix for updating xdp_xmit.
> ---
> Changes in v2:
> - Fix XDP Tx/Redirect (Thanks Maciej!)
> - Link to v1: https://lore.kernel.org/r/20250709-jk-ice-fix-rx-mem-leak-v1-1-cfdd7eeea905@intel.com
> ---
> drivers/net/ethernet/intel/ice/ice_txrx.h |  1 -  drivers/net/ethernet/intel/ice/ice_txrx.c | 81 +++++++++++++------------------
> 2 files changed, 33 insertions(+), 49 deletions(-)
>

Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [Intel-wired-lan] [PATCH iwl-net v2] ice: fix Rx page leak on multi-buffer frames
       [not found] ` <PH0PR11MB5013AD3FEC58E277297A7D4F962AA@PH0PR11MB5013.namprd11.prod.outlook.com>
@ 2025-08-13 13:39   ` Singh, PriyaX
  2025-08-13 13:43     ` Singh, PriyaX
  0 siblings, 1 reply; 7+ messages in thread
From: Singh, PriyaX @ 2025-08-13 13:39 UTC (permalink / raw)
  To: Fijalkowski, Maciej, Kitszel, Przemyslaw, Intel Wired LAN,
	Lobakin, Aleksander
  Cc: : Joe Damato, Nguyen, Anthony L, netdev@vger.kernel.org,
	Christoph Petrausch, Jaroslav Pulchart, Keller, Jacob E

> -----Original Message-----
> From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf Of
> Jacob Keller
> Sent: Saturday, July 12, 2025 5:54 AM
> To: Fijalkowski, Maciej <maciej.fijalkowski@intel.com>; Kitszel, Przemyslaw
> <przemyslaw.kitszel@intel.com>; Intel Wired LAN <intel-wired-
> lan@lists.osuosl.org>; Lobakin, Aleksander <aleksander.lobakin@intel.com>
> Cc: Joe Damato <jdamato@fastly.com>; Nguyen, Anthony L
> <anthony.l.nguyen@intel.com>; netdev@vger.kernel.org; Christoph
> Petrausch <christoph.petrausch@deepl.com>; Jaroslav Pulchart
> <jaroslav.pulchart@gooddata.com>; Keller, Jacob E
> <jacob.e.keller@intel.com>
> Subject: [Intel-wired-lan] [PATCH iwl-net v2] ice: fix Rx page leak on multi-
> buffer frames
> 
> The ice_put_rx_mbuf() function handles calling ice_put_rx_buf() for each
> buffer in the current frame. This function was introduced as part of handling
> multi-buffer XDP support in the ice driver.
> 
> It works by iterating over the buffers from first_desc up to 1 plus the total
> number of fragments in the frame, cached from before the XDP program
> was executed.
> 
> If the hardware posts a descriptor with a size of 0, the logic used in
> ice_put_rx_mbuf() breaks. Such descriptors get skipped and don't get added
> as fragments in ice_add_xdp_frag. Since the buffer isn't counted as a
> fragment, we do not iterate over it in ice_put_rx_mbuf(), and thus we don't
> call ice_put_rx_buf().
> 
> Because we don't call ice_put_rx_buf(), we don't attempt to re-use the page
> or free it. This leaves a stale page in the ring, as we don't increment
> next_to_alloc.
> 
> The ice_reuse_rx_page() assumes that the next_to_alloc has been
> incremented properly, and that it always points to a buffer with a NULL
> page. Since this function doesn't check, it will happily recycle a page over
> the top of the next_to_alloc buffer, losing track of the old page.
> 
> Note that this leak only occurs for multi-buffer frames. The
> ice_put_rx_mbuf() function always handles at least one buffer, so a single-
> buffer frame will always get handled correctly. It is not clear precisely why
> the hardware hands us descriptors with a size of 0 sometimes, but it
> happens somewhat regularly with "jumbo frames" used by 9K MTU.
> 
> To fix ice_put_rx_mbuf(), we need to make sure to call ice_put_rx_buf() on
> all buffers between first_desc and next_to_clean. Borrow the logic of a
> similar function in i40e used for this same purpose. Use the same logic also
> in ice_get_pgcnts().
> 
> Instead of iterating over just the number of fragments, use a loop which
> iterates until the current index reaches to the next_to_clean element just
> past the current frame. Check the current number of fragments (post XDP
> program). For all buffers up 1 more than the number of fragments, we'll
> update the pagecnt_bias. For any buffers past this, pagecnt_bias is left as-is.
> This ensures that fragments released by the XDP program, as well as any
> buffers with zero-size won't have their pagecnt_bias updated incorrectly.
> Unlike i40e, the ice_put_rx_mbuf() function does call
> ice_put_rx_buf() on the last buffer of the frame indicating end of packet.
> 
> The xdp_xmit value only needs to be updated if an XDP program is run, and
> only once per packet. Drop the xdp_xmit pointer argument from
> ice_put_rx_mbuf(). Instead, set xdp_xmit in the ice_clean_rx_irq() function
> directly. This avoids needing to pass the argument and avoids an extra bit-
> wise OR for each buffer in the frame.
> 
> Move the increment of the ntc local variable to ensure its updated *before*
> all calls to ice_get_pgcnts() or ice_put_rx_mbuf(), as the loop logic requires
> the index of the element just after the current frame.
> 
> This has the advantage that we also no longer need to track or cache the
> number of fragments in the rx_ring, which saves a few bytes in the ring.
> 
> Cc: Christoph Petrausch <christoph.petrausch@deepl.com>
> Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
> Closes:
> https://lore.kernel.org/netdev/CAK8fFZ4hY6GUJNENz3wY9jaYLZXGfpr7dnZxz
> GMYoE44caRbgw@mail.gmail.com/
> Fixes: 743bbd93cf29 ("ice: put Rx buffers after being done with current
> frame")
> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
> ---
> I've tested this in a setup with MTU 9000, using a combination of iperf3 and
> wrk generated traffic.
> 
> I tested this in a couple of ways. First, I check memory allocations using
> /proc/allocinfo:
> 
>   awk '/ice_alloc_mapped_page/ { printf("%s %s\n", $1, $2) }' /proc/allocinfo
> | numfmt --to=iec
> 
> Second, I ported some stats from i40e written by Joe Damato to track the
> page allocation and busy counts. I consistently saw that the allocate stat
> increased without the busy or waive stats increasing. I also added a stat to
> track directly when we overwrote a page pointer that was non-NULL in
> ice_reuse_rx_page(), and saw it increment consistently.
> 
> With this fix, all of these indicators are fixed. I've tested both 1500 byte and
> 9000 byte MTU and no longer see the leak. With the counters I was able to
> immediately see a leak within a few minutes of iperf3, so I am confident that
> I've resolved the leak with this fix.
> 
> I've now also tested with xdp-bench and confirm that XDP_TX and
> XDP_REDIR work properly with the fix for updating xdp_xmit.
> ---
> Changes in v2:
> - Fix XDP Tx/Redirect (Thanks Maciej!)
> - Link to v1: https://lore.kernel.org/r/20250709-jk-ice-fix-rx-mem-leak-v1-1-
> cfdd7eeea905@intel.com
> ---
>  drivers/net/ethernet/intel/ice/ice_txrx.h |  1 -
> drivers/net/ethernet/intel/ice/ice_txrx.c | 81 +++++++++++++------------------
>  2 files changed, 33 insertions(+), 49 deletions(-)

Tested-by: Priya Singh <priyax.singh@intel.com>



^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [Intel-wired-lan] [PATCH iwl-net v2] ice: fix Rx page leak on multi-buffer frames
  2025-08-13 13:39   ` Singh, PriyaX
@ 2025-08-13 13:43     ` Singh, PriyaX
  0 siblings, 0 replies; 7+ messages in thread
From: Singh, PriyaX @ 2025-08-13 13:43 UTC (permalink / raw)
  To: Fijalkowski, Maciej, Kitszel, Przemyslaw, Intel Wired LAN,
	Lobakin, Aleksander
  Cc: : Joe Damato, Nguyen, Anthony L, netdev@vger.kernel.org,
	Christoph Petrausch, Jaroslav Pulchart, Keller, Jacob E,
	Buvaneswaran, Sujai

++ Sujai

> -----Original Message-----
> From: Singh, PriyaX
> Sent: Wednesday, August 13, 2025 7:10 PM
> To: Fijalkowski, Maciej <maciej.fijalkowski@intel.com>; Kitszel, Przemyslaw
> <przemyslaw.kitszel@intel.com>; Intel Wired LAN <intel-wired-
> lan@lists.osuosl.org>; Lobakin, Aleksander <aleksander.lobakin@intel.com>
> Cc: : Joe Damato <jdamato@fastly.com>; Nguyen, Anthony L
> <anthony.l.nguyen@intel.com>; netdev@vger.kernel.org; Christoph
> Petrausch <christoph.petrausch@deepl.com>; Jaroslav Pulchart
> <jaroslav.pulchart@gooddata.com>; Keller, Jacob E
> <jacob.e.keller@intel.com>
> Subject: RE: [Intel-wired-lan] [PATCH iwl-net v2] ice: fix Rx page leak on
> multi-buffer frames
> 
> > -----Original Message-----
> > From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf
> > Of Jacob Keller
> > Sent: Saturday, July 12, 2025 5:54 AM
> > To: Fijalkowski, Maciej <maciej.fijalkowski@intel.com>; Kitszel,
> > Przemyslaw <przemyslaw.kitszel@intel.com>; Intel Wired LAN
> > <intel-wired- lan@lists.osuosl.org>; Lobakin, Aleksander
> > <aleksander.lobakin@intel.com>
> > Cc: Joe Damato <jdamato@fastly.com>; Nguyen, Anthony L
> > <anthony.l.nguyen@intel.com>; netdev@vger.kernel.org; Christoph
> > Petrausch <christoph.petrausch@deepl.com>; Jaroslav Pulchart
> > <jaroslav.pulchart@gooddata.com>; Keller, Jacob E
> > <jacob.e.keller@intel.com>
> > Subject: [Intel-wired-lan] [PATCH iwl-net v2] ice: fix Rx page leak on
> > multi- buffer frames
> >
> > The ice_put_rx_mbuf() function handles calling ice_put_rx_buf() for
> > each buffer in the current frame. This function was introduced as part
> > of handling multi-buffer XDP support in the ice driver.
> >
> > It works by iterating over the buffers from first_desc up to 1 plus
> > the total number of fragments in the frame, cached from before the XDP
> > program was executed.
> >
> > If the hardware posts a descriptor with a size of 0, the logic used in
> > ice_put_rx_mbuf() breaks. Such descriptors get skipped and don't get
> > added as fragments in ice_add_xdp_frag. Since the buffer isn't counted
> > as a fragment, we do not iterate over it in ice_put_rx_mbuf(), and
> > thus we don't call ice_put_rx_buf().
> >
> > Because we don't call ice_put_rx_buf(), we don't attempt to re-use the
> > page or free it. This leaves a stale page in the ring, as we don't
> > increment next_to_alloc.
> >
> > The ice_reuse_rx_page() assumes that the next_to_alloc has been
> > incremented properly, and that it always points to a buffer with a
> > NULL page. Since this function doesn't check, it will happily recycle
> > a page over the top of the next_to_alloc buffer, losing track of the old
> page.
> >
> > Note that this leak only occurs for multi-buffer frames. The
> > ice_put_rx_mbuf() function always handles at least one buffer, so a
> > single- buffer frame will always get handled correctly. It is not
> > clear precisely why the hardware hands us descriptors with a size of 0
> > sometimes, but it happens somewhat regularly with "jumbo frames" used
> by 9K MTU.
> >
> > To fix ice_put_rx_mbuf(), we need to make sure to call
> > ice_put_rx_buf() on all buffers between first_desc and next_to_clean.
> > Borrow the logic of a similar function in i40e used for this same
> > purpose. Use the same logic also in ice_get_pgcnts().
> >
> > Instead of iterating over just the number of fragments, use a loop
> > which iterates until the current index reaches to the next_to_clean
> > element just past the current frame. Check the current number of
> > fragments (post XDP program). For all buffers up 1 more than the
> > number of fragments, we'll update the pagecnt_bias. For any buffers past
> this, pagecnt_bias is left as-is.
> > This ensures that fragments released by the XDP program, as well as
> > any buffers with zero-size won't have their pagecnt_bias updated
> incorrectly.
> > Unlike i40e, the ice_put_rx_mbuf() function does call
> > ice_put_rx_buf() on the last buffer of the frame indicating end of packet.
> >
> > The xdp_xmit value only needs to be updated if an XDP program is run,
> > and only once per packet. Drop the xdp_xmit pointer argument from
> > ice_put_rx_mbuf(). Instead, set xdp_xmit in the ice_clean_rx_irq()
> > function directly. This avoids needing to pass the argument and avoids
> > an extra bit- wise OR for each buffer in the frame.
> >
> > Move the increment of the ntc local variable to ensure its updated
> > *before* all calls to ice_get_pgcnts() or ice_put_rx_mbuf(), as the
> > loop logic requires the index of the element just after the current frame.
> >
> > This has the advantage that we also no longer need to track or cache
> > the number of fragments in the rx_ring, which saves a few bytes in the
> ring.
> >
> > Cc: Christoph Petrausch <christoph.petrausch@deepl.com>
> > Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
> > Closes:
> >
> https://lore.kernel.org/netdev/CAK8fFZ4hY6GUJNENz3wY9jaYLZXGfpr7dnZxz
> > GMYoE44caRbgw@mail.gmail.com/
> > Fixes: 743bbd93cf29 ("ice: put Rx buffers after being done with
> > current
> > frame")
> > Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
> > ---
> > I've tested this in a setup with MTU 9000, using a combination of
> > iperf3 and wrk generated traffic.
> >
> > I tested this in a couple of ways. First, I check memory allocations
> > using
> > /proc/allocinfo:
> >
> >   awk '/ice_alloc_mapped_page/ { printf("%s %s\n", $1, $2) }'
> > /proc/allocinfo
> > | numfmt --to=iec
> >
> > Second, I ported some stats from i40e written by Joe Damato to track
> > the page allocation and busy counts. I consistently saw that the
> > allocate stat increased without the busy or waive stats increasing. I
> > also added a stat to track directly when we overwrote a page pointer
> > that was non-NULL in ice_reuse_rx_page(), and saw it increment
> consistently.
> >
> > With this fix, all of these indicators are fixed. I've tested both
> > 1500 byte and
> > 9000 byte MTU and no longer see the leak. With the counters I was able
> > to immediately see a leak within a few minutes of iperf3, so I am
> > confident that I've resolved the leak with this fix.
> >
> > I've now also tested with xdp-bench and confirm that XDP_TX and
> > XDP_REDIR work properly with the fix for updating xdp_xmit.
> > ---
> > Changes in v2:
> > - Fix XDP Tx/Redirect (Thanks Maciej!)
> > - Link to v1:
> > https://lore.kernel.org/r/20250709-jk-ice-fix-rx-mem-leak-v1-1-
> > cfdd7eeea905@intel.com
> > ---
> >  drivers/net/ethernet/intel/ice/ice_txrx.h |  1 -
> > drivers/net/ethernet/intel/ice/ice_txrx.c | 81
> > +++++++++++++------------------
> >  2 files changed, 33 insertions(+), 49 deletions(-)
> 
> Tested-by: Priya Singh <priyax.singh@intel.com>
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH iwl-net v2] ice: fix Rx page leak on multi-buffer frames
@ 2025-08-25 23:00 Jacob Keller
  2025-08-26  8:29 ` [Intel-wired-lan] " Paul Menzel
  2025-08-26  8:35 ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 7+ messages in thread
From: Jacob Keller @ 2025-08-25 23:00 UTC (permalink / raw)
  To: Michal Kubiak, Anthony Nguyen, Intel Wired LAN, netdev
  Cc: Christoph Petrausch, Jesper Dangaard Brouer, Jaroslav Pulchart,
	Jacob Keller

The ice_put_rx_mbuf() function handles calling ice_put_rx_buf() for each
buffer in the current frame. This function was introduced as part of
handling multi-buffer XDP support in the ice driver.

It works by iterating over the buffers from first_desc up to 1 plus the
total number of fragments in the frame, cached from before the XDP program
was executed.

If the hardware posts a descriptor with a size of 0, the logic used in
ice_put_rx_mbuf() breaks. Such descriptors get skipped and don't get added
as fragments in ice_add_xdp_frag. Since the buffer isn't counted as a
fragment, we do not iterate over it in ice_put_rx_mbuf(), and thus we don't
call ice_put_rx_buf().

Because we don't call ice_put_rx_buf(), we don't attempt to re-use the
page or free it. This leaves a stale page in the ring, as we don't
increment next_to_alloc.

The ice_reuse_rx_page() assumes that the next_to_alloc has been incremented
properly, and that it always points to a buffer with a NULL page. Since
this function doesn't check, it will happily recycle a page over the top
of the next_to_alloc buffer, losing track of the old page.

Note that this leak only occurs for multi-buffer frames. The
ice_put_rx_mbuf() function always handles at least one buffer, so a
single-buffer frame will always get handled correctly. It is not clear
precisely why the hardware hands us descriptors with a size of 0 sometimes,
but it happens somewhat regularly with "jumbo frames" used by 9K MTU.

To fix ice_put_rx_mbuf(), we need to make sure to call ice_put_rx_buf() on
all buffers between first_desc and next_to_clean. Borrow the logic of a
similar function in i40e used for this same purpose. Use the same logic
also in ice_get_pgcnts().

Instead of iterating over just the number of fragments, use a loop which
iterates until the current index reaches to the next_to_clean element just
past the current frame. Unlike i40e, the ice_put_rx_mbuf() function does
call ice_put_rx_buf() on the last buffer of the frame indicating the end of
packet.

For non-linear (multi-buffer) frames, we need to take care when adjusting
the pagecnt_bias. An XDP program might release fragments from the tail of
the frame, in which case that fragment page is already released. Only
update the pagecnt_bias for the first descriptor and fragments still
remaining post-XDP program. Take care to only access the shared info for
fragmented buffers, as this avoids a significant cache miss.

The xdp_xmit value only needs to be updated if an XDP program is run, and
only once per packet. Drop the xdp_xmit pointer argument from
ice_put_rx_mbuf(). Instead, set xdp_xmit in the ice_clean_rx_irq() function
directly. This avoids needing to pass the argument and avoids an extra
bit-wise OR for each buffer in the frame.

Move the increment of the ntc local variable to ensure its updated *before*
all calls to ice_get_pgcnts() or ice_put_rx_mbuf(), as the loop logic
requires the index of the element just after the current frame.

Now that we use an index pointer in the ring to identify the packet, we no
longer need to track or cache the number of fragments in the rx_ring.

Cc: Christoph Petrausch <christoph.petrausch@deepl.com>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Closes: https://lore.kernel.org/netdev/CAK8fFZ4hY6GUJNENz3wY9jaYLZXGfpr7dnZxzGMYoE44caRbgw@mail.gmail.com/
Fixes: 743bbd93cf29 ("ice: put Rx buffers after being done with current frame")
Tested-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
---
I've tested this in a setup with MTU 9000, using a combination of iperf3
and wrk generated traffic.

I tested this in a couple of ways. First, I check memory allocations using
/proc/allocinfo:

  awk '/ice_alloc_mapped_page/ { printf("%s %s\n", $1, $2) }' /proc/allocinfo | numfmt --to=iec

Second, I ported some stats from i40e written by Joe Damato to track the
page allocation and busy counts. I consistently saw that the allocate stat
increased without the busy or waive stats increasing. I also added a stat
to track directly when we overwrote a page pointer that was non-NULL in
ice_reuse_rx_page(), and saw it increment consistently.

With this fix, all of these indicators are fixed. I've tested both 1500
byte and 9000 byte MTU and no longer see the leak. With the counters I was
able to immediately see a leak within a few minutes of iperf3, so I am
confident that I've resolved the leak with this fix.

I've now also tested with xdp-bench and confirm that XDP_TX and XDP_REDIR work
properly with the fix for updating xdp_xmit.

This version (v2) avoids the cache miss regression reported by Jesper. I
refactored a bit to only check the shared info if the XDP buffer is
fragmented. I considered adding a helper function to do this to the XDP
header file. However, I scanned several drivers and noticed that only ice
and i40e access the nr_frags in this way. The ice variation I believe will
be removed with the conversion to page pool, so I don't think the addition
of a helper is warranted.

XDP_DROP performance has been tested for this version, thanks to work from
Michal Kubiak. The results are quite promising, with 3 versions being
compared:

* baseline net-next tree
* v1 applied
* v2 applied

Michal said:

  I run the XDP_DROP performance comparison tests on my setup in the way I
  usually do. I didn't have the pktgen configured on my link partner, but I
  used 6 instances of the xdpsock running in Tx-only mode to generate
  high-bandwith traffic. Also, I tried to replicate the conditions according
  to Jesper's description, making sure that all the traffic is directed to a
  single Rx queue and one CPU is 100% loaded.

The performance hit from v1 is replicated, and shown to be gone in v2, with
our results showing even an increase compared to baseline instead of a
drop. I've included the relative packet per second deltas compared against
a baseline test with neither v1 or v2.

baseline to v1, no-touch:
  -8,387,677 packets per second (17%) decrease.

baseline to v2, no-touch:
  +4,057,000 packets per second (8%) increase!

baseline to v1, read data:
  -411,709 packets per second (1%) decrease.

baseline to v2, read data:
  +4,331,857 packets per second (11%) increase!
---
Changes in v2:
- Only access shared info for fragmented frames
- Link to v1: https://lore.kernel.org/netdev/20250815204205.1407768-4-anthony.l.nguyen@intel.com/
---
 drivers/net/ethernet/intel/ice/ice_txrx.h |  1 -
 drivers/net/ethernet/intel/ice/ice_txrx.c | 80 +++++++++++++------------------
 2 files changed, 34 insertions(+), 47 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
index fef750c5f288..2fd8e78178a2 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
@@ -358,7 +358,6 @@ struct ice_rx_ring {
 	struct ice_tx_ring *xdp_ring;
 	struct ice_rx_ring *next;	/* pointer to next ring in q_vector */
 	struct xsk_buff_pool *xsk_pool;
-	u32 nr_frags;
 	u16 max_frame;
 	u16 rx_buf_len;
 	dma_addr_t dma;			/* physical address of ring */
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 29e0088ab6b2..fc92d7a66ad0 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -894,10 +894,6 @@ ice_add_xdp_frag(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
 	__skb_fill_page_desc_noacc(sinfo, sinfo->nr_frags++, rx_buf->page,
 				   rx_buf->page_offset, size);
 	sinfo->xdp_frags_size += size;
-	/* remember frag count before XDP prog execution; bpf_xdp_adjust_tail()
-	 * can pop off frags but driver has to handle it on its own
-	 */
-	rx_ring->nr_frags = sinfo->nr_frags;

 	if (page_is_pfmemalloc(rx_buf->page))
 		xdp_buff_set_frag_pfmemalloc(xdp);
@@ -968,20 +964,20 @@ ice_get_rx_buf(struct ice_rx_ring *rx_ring, const unsigned int size,
 /**
  * ice_get_pgcnts - grab page_count() for gathered fragments
  * @rx_ring: Rx descriptor ring to store the page counts on
+ * @ntc: the next to clean element (not included in this frame!)
  *
  * This function is intended to be called right before running XDP
  * program so that the page recycling mechanism will be able to take
  * a correct decision regarding underlying pages; this is done in such
  * way as XDP program can change the refcount of page
  */
-static void ice_get_pgcnts(struct ice_rx_ring *rx_ring)
+static void ice_get_pgcnts(struct ice_rx_ring *rx_ring, unsigned int ntc)
 {
-	u32 nr_frags = rx_ring->nr_frags + 1;
 	u32 idx = rx_ring->first_desc;
 	struct ice_rx_buf *rx_buf;
 	u32 cnt = rx_ring->count;

-	for (int i = 0; i < nr_frags; i++) {
+	while (idx != ntc) {
 		rx_buf = &rx_ring->rx_buf[idx];
 		rx_buf->pgcnt = page_count(rx_buf->page);

@@ -1154,62 +1150,51 @@ ice_put_rx_buf(struct ice_rx_ring *rx_ring, struct ice_rx_buf *rx_buf)
 }

 /**
- * ice_put_rx_mbuf - ice_put_rx_buf() caller, for all frame frags
+ * ice_put_rx_mbuf - ice_put_rx_buf() caller, for all buffers in frame
  * @rx_ring: Rx ring with all the auxiliary data
  * @xdp: XDP buffer carrying linear + frags part
- * @xdp_xmit: XDP_TX/XDP_REDIRECT verdict storage
- * @ntc: a current next_to_clean value to be stored at rx_ring
+ * @ntc: the next to clean element (not included in this frame!)
  * @verdict: return code from XDP program execution
  *
- * Walk through gathered fragments and satisfy internal page
- * recycle mechanism; we take here an action related to verdict
- * returned by XDP program;
+ * Called after XDP program is completed, or on error with verdict set to
+ * ICE_XDP_CONSUMED.
+ *
+ * Walk through buffers from first_desc to the end of the frame, releasing
+ * buffers and satisfying internal page recycle mechanism. The action depends
+ * on verdict from XDP program.
  */
 static void ice_put_rx_mbuf(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
-			    u32 *xdp_xmit, u32 ntc, u32 verdict)
+			    u32 ntc, u32 verdict)
 {
-	u32 nr_frags = rx_ring->nr_frags + 1;
 	u32 idx = rx_ring->first_desc;
 	u32 cnt = rx_ring->count;
-	u32 post_xdp_frags = 1;
 	struct ice_rx_buf *buf;
-	int i;
+	u32 xdp_frags = 0;
+	int i = 0;

 	if (unlikely(xdp_buff_has_frags(xdp)))
-		post_xdp_frags += xdp_get_shared_info_from_buff(xdp)->nr_frags;
+		xdp_frags = xdp_get_shared_info_from_buff(xdp)->nr_frags;

-	for (i = 0; i < post_xdp_frags; i++) {
+	while (idx != ntc) {
 		buf = &rx_ring->rx_buf[idx];
+		if (++idx == cnt)
+			idx = 0;

-		if (verdict & (ICE_XDP_TX | ICE_XDP_REDIR)) {
+		/* An XDP program could release fragments from the end of the
+		 * buffer. For these, we need to keep the pagecnt_bias as-is.
+		 * To do this, only adjust pagecnt_bias for fragments up to
+		 * the total remaining after the XDP program has run.
+		 */
+		if (verdict != ICE_XDP_CONSUMED)
 			ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz);
-			*xdp_xmit |= verdict;
-		} else if (verdict & ICE_XDP_CONSUMED) {
+		else if (i++ <= xdp_frags)
 			buf->pagecnt_bias++;
-		} else if (verdict == ICE_XDP_PASS) {
-			ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz);
-		}

 		ice_put_rx_buf(rx_ring, buf);
-
-		if (++idx == cnt)
-			idx = 0;
-	}
-	/* handle buffers that represented frags released by XDP prog;
-	 * for these we keep pagecnt_bias as-is; refcount from struct page
-	 * has been decremented within XDP prog and we do not have to increase
-	 * the biased refcnt
-	 */
-	for (; i < nr_frags; i++) {
-		buf = &rx_ring->rx_buf[idx];
-		ice_put_rx_buf(rx_ring, buf);
-		if (++idx == cnt)
-			idx = 0;
 	}

 	xdp->data = NULL;
 	rx_ring->first_desc = ntc;
-	rx_ring->nr_frags = 0;
 }

 /**
@@ -1317,6 +1302,10 @@ static int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
 		/* retrieve a buffer from the ring */
 		rx_buf = ice_get_rx_buf(rx_ring, size, ntc);

+		/* Increment ntc before calls to ice_put_rx_mbuf() */
+		if (++ntc == cnt)
+			ntc = 0;
+
 		if (!xdp->data) {
 			void *hard_start;

@@ -1325,24 +1314,23 @@ static int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
 			xdp_prepare_buff(xdp, hard_start, offset, size, !!offset);
 			xdp_buff_clear_frags_flag(xdp);
 		} else if (ice_add_xdp_frag(rx_ring, xdp, rx_buf, size)) {
-			ice_put_rx_mbuf(rx_ring, xdp, NULL, ntc, ICE_XDP_CONSUMED);
+			ice_put_rx_mbuf(rx_ring, xdp, ntc, ICE_XDP_CONSUMED);
 			break;
 		}
-		if (++ntc == cnt)
-			ntc = 0;

 		/* skip if it is NOP desc */
 		if (ice_is_non_eop(rx_ring, rx_desc))
 			continue;

-		ice_get_pgcnts(rx_ring);
+		ice_get_pgcnts(rx_ring, ntc);
 		xdp_verdict = ice_run_xdp(rx_ring, xdp, xdp_prog, xdp_ring, rx_desc);
 		if (xdp_verdict == ICE_XDP_PASS)
 			goto construct_skb;
 		total_rx_bytes += xdp_get_buff_len(xdp);
 		total_rx_pkts++;

-		ice_put_rx_mbuf(rx_ring, xdp, &xdp_xmit, ntc, xdp_verdict);
+		ice_put_rx_mbuf(rx_ring, xdp, ntc, xdp_verdict);
+		xdp_xmit |= xdp_verdict & (ICE_XDP_TX | ICE_XDP_REDIR);

 		continue;
 construct_skb:
@@ -1355,7 +1343,7 @@ static int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
 			rx_ring->ring_stats->rx_stats.alloc_page_failed++;
 			xdp_verdict = ICE_XDP_CONSUMED;
 		}
-		ice_put_rx_mbuf(rx_ring, xdp, &xdp_xmit, ntc, xdp_verdict);
+		ice_put_rx_mbuf(rx_ring, xdp, ntc, xdp_verdict);

 		if (!skb)
 			break;

---
base-commit: ec79003c5f9d2c7f9576fc69b8dbda80305cbe3a
change-id: 20250708-jk-ice-fix-rx-mem-leak-3a328edc4797

Best regards,
--  
Jacob Keller <jacob.e.keller@intel.com>

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-net v2] ice: fix Rx page leak on multi-buffer frames
  2025-08-25 23:00 [PATCH iwl-net v2] ice: fix Rx page leak on multi-buffer frames Jacob Keller
@ 2025-08-26  8:29 ` Paul Menzel
  2025-08-26  8:35 ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 7+ messages in thread
From: Paul Menzel @ 2025-08-26  8:29 UTC (permalink / raw)
  To: Jacob Keller
  Cc: Michal Kubiak, Anthony Nguyen, intel-wired-lan, netdev,
	Christoph Petrausch, Jesper Dangaard Brouer, Jaroslav Pulchart

Dear Jacob,


Thank you for your patch.

Am 26.08.25 um 01:00 schrieb Jacob Keller:
> The ice_put_rx_mbuf() function handles calling ice_put_rx_buf() for each
> buffer in the current frame. This function was introduced as part of
> handling multi-buffer XDP support in the ice driver.
> 
> It works by iterating over the buffers from first_desc up to 1 plus the
> total number of fragments in the frame, cached from before the XDP program
> was executed.
> 
> If the hardware posts a descriptor with a size of 0, the logic used in
> ice_put_rx_mbuf() breaks. Such descriptors get skipped and don't get added
> as fragments in ice_add_xdp_frag. Since the buffer isn't counted as a
> fragment, we do not iterate over it in ice_put_rx_mbuf(), and thus we don't
> call ice_put_rx_buf().
> 
> Because we don't call ice_put_rx_buf(), we don't attempt to re-use the
> page or free it. This leaves a stale page in the ring, as we don't
> increment next_to_alloc.
> 
> The ice_reuse_rx_page() assumes that the next_to_alloc has been incremented
> properly, and that it always points to a buffer with a NULL page. Since
> this function doesn't check, it will happily recycle a page over the top
> of the next_to_alloc buffer, losing track of the old page.
> 
> Note that this leak only occurs for multi-buffer frames. The
> ice_put_rx_mbuf() function always handles at least one buffer, so a
> single-buffer frame will always get handled correctly. It is not clear
> precisely why the hardware hands us descriptors with a size of 0 sometimes,
> but it happens somewhat regularly with "jumbo frames" used by 9K MTU.
> 
> To fix ice_put_rx_mbuf(), we need to make sure to call ice_put_rx_buf() on
> all buffers between first_desc and next_to_clean. Borrow the logic of a
> similar function in i40e used for this same purpose. Use the same logic
> also in ice_get_pgcnts().
> 
> Instead of iterating over just the number of fragments, use a loop which
> iterates until the current index reaches to the next_to_clean element just
> past the current frame. Unlike i40e, the ice_put_rx_mbuf() function does
> call ice_put_rx_buf() on the last buffer of the frame indicating the end of
> packet.
> 
> For non-linear (multi-buffer) frames, we need to take care when adjusting
> the pagecnt_bias. An XDP program might release fragments from the tail of
> the frame, in which case that fragment page is already released. Only
> update the pagecnt_bias for the first descriptor and fragments still
> remaining post-XDP program. Take care to only access the shared info for
> fragmented buffers, as this avoids a significant cache miss.
> 
> The xdp_xmit value only needs to be updated if an XDP program is run, and
> only once per packet. Drop the xdp_xmit pointer argument from
> ice_put_rx_mbuf(). Instead, set xdp_xmit in the ice_clean_rx_irq() function
> directly. This avoids needing to pass the argument and avoids an extra
> bit-wise OR for each buffer in the frame.
> 
> Move the increment of the ntc local variable to ensure its updated *before*
> all calls to ice_get_pgcnts() or ice_put_rx_mbuf(), as the loop logic
> requires the index of the element just after the current frame.
> 
> Now that we use an index pointer in the ring to identify the packet, we no
> longer need to track or cache the number of fragments in the rx_ring.
> 
> Cc: Christoph Petrausch <christoph.petrausch@deepl.com>
> Cc: Jesper Dangaard Brouer <hawk@kernel.org>
> Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
> Closes: https://lore.kernel.org/netdev/CAK8fFZ4hY6GUJNENz3wY9jaYLZXGfpr7dnZxzGMYoE44caRbgw@mail.gmail.com/
> Fixes: 743bbd93cf29 ("ice: put Rx buffers after being done with current frame")
> Tested-by: Michal Kubiak <michal.kubiak@intel.com>
> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
> ---
> I've tested this in a setup with MTU 9000, using a combination of iperf3
> and wrk generated traffic.
> 
> I tested this in a couple of ways. First, I check memory allocations using
> /proc/allocinfo:
> 
>    awk '/ice_alloc_mapped_page/ { printf("%s %s\n", $1, $2) }' /proc/allocinfo | numfmt --to=iec
> 
> Second, I ported some stats from i40e written by Joe Damato to track the
> page allocation and busy counts. I consistently saw that the allocate stat
> increased without the busy or waive stats increasing. I also added a stat
> to track directly when we overwrote a page pointer that was non-NULL in
> ice_reuse_rx_page(), and saw it increment consistently.
> 
> With this fix, all of these indicators are fixed. I've tested both 1500
> byte and 9000 byte MTU and no longer see the leak. With the counters I was
> able to immediately see a leak within a few minutes of iperf3, so I am
> confident that I've resolved the leak with this fix.
> 
> I've now also tested with xdp-bench and confirm that XDP_TX and XDP_REDIR work
> properly with the fix for updating xdp_xmit.
> 
> This version (v2) avoids the cache miss regression reported by Jesper. I
> refactored a bit to only check the shared info if the XDP buffer is
> fragmented. I considered adding a helper function to do this to the XDP
> header file. However, I scanned several drivers and noticed that only ice
> and i40e access the nr_frags in this way. The ice variation I believe will
> be removed with the conversion to page pool, so I don't think the addition
> of a helper is warranted.
> 
> XDP_DROP performance has been tested for this version, thanks to work from
> Michal Kubiak. The results are quite promising, with 3 versions being
> compared:
> 
> * baseline net-next tree
> * v1 applied
> * v2 applied
> 
> Michal said:
> 
>    I run the XDP_DROP performance comparison tests on my setup in the way I
>    usually do. I didn't have the pktgen configured on my link partner, but I
>    used 6 instances of the xdpsock running in Tx-only mode to generate
>    high-bandwith traffic. Also, I tried to replicate the conditions according
>    to Jesper's description, making sure that all the traffic is directed to a
>    single Rx queue and one CPU is 100% loaded.
> 
> The performance hit from v1 is replicated, and shown to be gone in v2, with
> our results showing even an increase compared to baseline instead of a
> drop. I've included the relative packet per second deltas compared against
> a baseline test with neither v1 or v2.
> 
> baseline to v1, no-touch:
>    -8,387,677 packets per second (17%) decrease.
> 
> baseline to v2, no-touch:
>    +4,057,000 packets per second (8%) increase!
> 
> baseline to v1, read data:
>    -411,709 packets per second (1%) decrease.
> 
> baseline to v2, read data:
>    +4,331,857 packets per second (11%) increase!

I would love to see this in the commit message.

> ---
> Changes in v2:
> - Only access shared info for fragmented frames
> - Link to v1: https://lore.kernel.org/netdev/20250815204205.1407768-4-anthony.l.nguyen@intel.com/
> ---
>   drivers/net/ethernet/intel/ice/ice_txrx.h |  1 -
>   drivers/net/ethernet/intel/ice/ice_txrx.c | 80 +++++++++++++------------------
>   2 files changed, 34 insertions(+), 47 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
> index fef750c5f288..2fd8e78178a2 100644
> --- a/drivers/net/ethernet/intel/ice/ice_txrx.h
> +++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
> @@ -358,7 +358,6 @@ struct ice_rx_ring {
>   	struct ice_tx_ring *xdp_ring;
>   	struct ice_rx_ring *next;	/* pointer to next ring in q_vector */
>   	struct xsk_buff_pool *xsk_pool;
> -	u32 nr_frags;
>   	u16 max_frame;
>   	u16 rx_buf_len;
>   	dma_addr_t dma;			/* physical address of ring */
> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
> index 29e0088ab6b2..fc92d7a66ad0 100644
> --- a/drivers/net/ethernet/intel/ice/ice_txrx.c
> +++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
> @@ -894,10 +894,6 @@ ice_add_xdp_frag(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
>   	__skb_fill_page_desc_noacc(sinfo, sinfo->nr_frags++, rx_buf->page,
>   				   rx_buf->page_offset, size);
>   	sinfo->xdp_frags_size += size;
> -	/* remember frag count before XDP prog execution; bpf_xdp_adjust_tail()
> -	 * can pop off frags but driver has to handle it on its own
> -	 */
> -	rx_ring->nr_frags = sinfo->nr_frags;
>   
>   	if (page_is_pfmemalloc(rx_buf->page))
>   		xdp_buff_set_frag_pfmemalloc(xdp);
> @@ -968,20 +964,20 @@ ice_get_rx_buf(struct ice_rx_ring *rx_ring, const unsigned int size,
>   /**
>    * ice_get_pgcnts - grab page_count() for gathered fragments
>    * @rx_ring: Rx descriptor ring to store the page counts on
> + * @ntc: the next to clean element (not included in this frame!)
>    *
>    * This function is intended to be called right before running XDP
>    * program so that the page recycling mechanism will be able to take
>    * a correct decision regarding underlying pages; this is done in such
>    * way as XDP program can change the refcount of page
>    */
> -static void ice_get_pgcnts(struct ice_rx_ring *rx_ring)
> +static void ice_get_pgcnts(struct ice_rx_ring *rx_ring, unsigned int ntc)
>   {
> -	u32 nr_frags = rx_ring->nr_frags + 1;
>   	u32 idx = rx_ring->first_desc;
>   	struct ice_rx_buf *rx_buf;
>   	u32 cnt = rx_ring->count;
>   
> -	for (int i = 0; i < nr_frags; i++) {
> +	while (idx != ntc) {
>   		rx_buf = &rx_ring->rx_buf[idx];
>   		rx_buf->pgcnt = page_count(rx_buf->page);
>   
> @@ -1154,62 +1150,51 @@ ice_put_rx_buf(struct ice_rx_ring *rx_ring, struct ice_rx_buf *rx_buf)
>   }
>   
>   /**
> - * ice_put_rx_mbuf - ice_put_rx_buf() caller, for all frame frags
> + * ice_put_rx_mbuf - ice_put_rx_buf() caller, for all buffers in frame
>    * @rx_ring: Rx ring with all the auxiliary data
>    * @xdp: XDP buffer carrying linear + frags part
> - * @xdp_xmit: XDP_TX/XDP_REDIRECT verdict storage
> - * @ntc: a current next_to_clean value to be stored at rx_ring
> + * @ntc: the next to clean element (not included in this frame!)
>    * @verdict: return code from XDP program execution
>    *
> - * Walk through gathered fragments and satisfy internal page
> - * recycle mechanism; we take here an action related to verdict
> - * returned by XDP program;
> + * Called after XDP program is completed, or on error with verdict set to
> + * ICE_XDP_CONSUMED.
> + *
> + * Walk through buffers from first_desc to the end of the frame, releasing
> + * buffers and satisfying internal page recycle mechanism. The action depends
> + * on verdict from XDP program.
>    */
>   static void ice_put_rx_mbuf(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
> -			    u32 *xdp_xmit, u32 ntc, u32 verdict)
> +			    u32 ntc, u32 verdict)
>   {
> -	u32 nr_frags = rx_ring->nr_frags + 1;
>   	u32 idx = rx_ring->first_desc;
>   	u32 cnt = rx_ring->count;
> -	u32 post_xdp_frags = 1;
>   	struct ice_rx_buf *buf;
> -	int i;
> +	u32 xdp_frags = 0;
> +	int i = 0;
>   
>   	if (unlikely(xdp_buff_has_frags(xdp)))
> -		post_xdp_frags += xdp_get_shared_info_from_buff(xdp)->nr_frags;
> +		xdp_frags = xdp_get_shared_info_from_buff(xdp)->nr_frags;
>   
> -	for (i = 0; i < post_xdp_frags; i++) {
> +	while (idx != ntc) {
>   		buf = &rx_ring->rx_buf[idx];
> +		if (++idx == cnt)
> +			idx = 0;
>   
> -		if (verdict & (ICE_XDP_TX | ICE_XDP_REDIR)) {
> +		/* An XDP program could release fragments from the end of the
> +		 * buffer. For these, we need to keep the pagecnt_bias as-is.
> +		 * To do this, only adjust pagecnt_bias for fragments up to
> +		 * the total remaining after the XDP program has run.
> +		 */
> +		if (verdict != ICE_XDP_CONSUMED)
>   			ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz);
> -			*xdp_xmit |= verdict;
> -		} else if (verdict & ICE_XDP_CONSUMED) {
> +		else if (i++ <= xdp_frags)
>   			buf->pagecnt_bias++;
> -		} else if (verdict == ICE_XDP_PASS) {
> -			ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz);
> -		}
>   
>   		ice_put_rx_buf(rx_ring, buf);
> -
> -		if (++idx == cnt)
> -			idx = 0;
> -	}
> -	/* handle buffers that represented frags released by XDP prog;
> -	 * for these we keep pagecnt_bias as-is; refcount from struct page
> -	 * has been decremented within XDP prog and we do not have to increase
> -	 * the biased refcnt
> -	 */
> -	for (; i < nr_frags; i++) {
> -		buf = &rx_ring->rx_buf[idx];
> -		ice_put_rx_buf(rx_ring, buf);
> -		if (++idx == cnt)
> -			idx = 0;
>   	}
>   
>   	xdp->data = NULL;
>   	rx_ring->first_desc = ntc;
> -	rx_ring->nr_frags = 0;
>   }
>   
>   /**
> @@ -1317,6 +1302,10 @@ static int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
>   		/* retrieve a buffer from the ring */
>   		rx_buf = ice_get_rx_buf(rx_ring, size, ntc);
>   
> +		/* Increment ntc before calls to ice_put_rx_mbuf() */
> +		if (++ntc == cnt)
> +			ntc = 0;
> +
>   		if (!xdp->data) {
>   			void *hard_start;
>   
> @@ -1325,24 +1314,23 @@ static int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
>   			xdp_prepare_buff(xdp, hard_start, offset, size, !!offset);
>   			xdp_buff_clear_frags_flag(xdp);
>   		} else if (ice_add_xdp_frag(rx_ring, xdp, rx_buf, size)) {
> -			ice_put_rx_mbuf(rx_ring, xdp, NULL, ntc, ICE_XDP_CONSUMED);
> +			ice_put_rx_mbuf(rx_ring, xdp, ntc, ICE_XDP_CONSUMED);
>   			break;
>   		}
> -		if (++ntc == cnt)
> -			ntc = 0;
>   
>   		/* skip if it is NOP desc */
>   		if (ice_is_non_eop(rx_ring, rx_desc))
>   			continue;
>   
> -		ice_get_pgcnts(rx_ring);
> +		ice_get_pgcnts(rx_ring, ntc);
>   		xdp_verdict = ice_run_xdp(rx_ring, xdp, xdp_prog, xdp_ring, rx_desc);
>   		if (xdp_verdict == ICE_XDP_PASS)
>   			goto construct_skb;
>   		total_rx_bytes += xdp_get_buff_len(xdp);
>   		total_rx_pkts++;
>   
> -		ice_put_rx_mbuf(rx_ring, xdp, &xdp_xmit, ntc, xdp_verdict);
> +		ice_put_rx_mbuf(rx_ring, xdp, ntc, xdp_verdict);
> +		xdp_xmit |= xdp_verdict & (ICE_XDP_TX | ICE_XDP_REDIR);
>   
>   		continue;
>   construct_skb:
> @@ -1355,7 +1343,7 @@ static int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
>   			rx_ring->ring_stats->rx_stats.alloc_page_failed++;
>   			xdp_verdict = ICE_XDP_CONSUMED;
>   		}
> -		ice_put_rx_mbuf(rx_ring, xdp, &xdp_xmit, ntc, xdp_verdict);
> +		ice_put_rx_mbuf(rx_ring, xdp, ntc, xdp_verdict);
>   
>   		if (!skb)
>   			break;

Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH iwl-net v2] ice: fix Rx page leak on multi-buffer frames
  2025-08-25 23:00 [PATCH iwl-net v2] ice: fix Rx page leak on multi-buffer frames Jacob Keller
  2025-08-26  8:29 ` [Intel-wired-lan] " Paul Menzel
@ 2025-08-26  8:35 ` Jesper Dangaard Brouer
  2025-08-26 10:24   ` Michal Kubiak
  1 sibling, 1 reply; 7+ messages in thread
From: Jesper Dangaard Brouer @ 2025-08-26  8:35 UTC (permalink / raw)
  To: Jacob Keller, Michal Kubiak, Anthony Nguyen, Intel Wired LAN,
	netdev
  Cc: Christoph Petrausch, Jaroslav Pulchart, kernel-team

On 26/08/2025 01.00, Jacob Keller wrote:
> XDP_DROP performance has been tested for this version, thanks to work from
> Michal Kubiak. The results are quite promising, with 3 versions being
> compared:
> 
> * baseline net-next tree
> * v1 applied
> * v2 applied
> 
> Michal said:
> 
>    I run the XDP_DROP performance comparison tests on my setup in the way I
>    usually do. I didn't have the pktgen configured on my link partner, but I
>    used 6 instances of the xdpsock running in Tx-only mode to generate
>    high-bandwith traffic. Also, I tried to replicate the conditions according
>    to Jesper's description, making sure that all the traffic is directed to a
>    single Rx queue and one CPU is 100% loaded.
> 

Thank you for replicating the test setup.  Using xdpsock as a traffic
generator is fine, as long as we make sure that the generator TX speeds
exceeds the Device Under Test RX XDP_DROP speed.  It is also important
for the test that packets hits a single RX queue and we verify one CPU 
is 100% load, as you describe.

As a reminder the pktgen kernel module comes with ready-to-use sample 
shell-scripts[1].

  [1] https://elixir.bootlin.com/linux/v6.16.3/source/samples/pktgen

> The performance hit from v1 is replicated, and shown to be gone in v2, with
> our results showing even an increase compared to baseline instead of a
> drop. I've included the relative packet per second deltas compared against
> a baseline test with neither v1 or v2.
> 

Thanks for also replicating the performance hit from v1 as I did in [2].

To Michal: What CPU did you use?
  - I used CPU: AMD EPYC 9684X (with SRSO=IBPB)

One of the reasons that I saw a larger percentage drop is that this CPU
doesn't have DDIO/DCA, which deliver the packet to L3 cache (and a L2
cache-miss will obviously take less time than a full main memory cache-
miss). (Details: Newer AMD CPUs will get something called PCIe TLP
Processing Hints (TPH), which resembles DDIO).

Point is that I see some opportunities in driver to move some of the
prefetches earlier. But we want to make sure it benefits both CPU types,
and I can test on the AMD platform. (This CPU is a large part of our
fleet so it makes sense for us to optimize this).

> baseline to v1, no-touch:
>    -8,387,677 packets per second (17%) decrease.
> 
> baseline to v2, no-touch:
>    +4,057,000 packets per second (8%) increase!
> 
> baseline to v1, read data:
>    -411,709 packets per second (1%) decrease.
> 
> baseline to v2, read data:
>    +4,331,857 packets per second (11%) increase!

Thanks for providing these numbers.
I would also like to know the throughput PPS packet numbers before and
after, as this allows me to calculate the nanosec difference. Using
percentages are usually useful, but it can be misleading when dealing
with XDP_DROP speeds, because a small nanosec change will get
"magnified" too much.

> ---
> Changes in v2:
> - Only access shared info for fragmented frames
> - Link to v1: https://lore.kernel.org/netdev/20250815204205.1407768-4-anthony.l.nguyen@intel.com/

[2] 
https://lore.kernel.org/netdev/6e2cbea1-8c70-4bfa-9ce4-1d07b545a705@kernel.org/

> ---
>   drivers/net/ethernet/intel/ice/ice_txrx.h |  1 -
>   drivers/net/ethernet/intel/ice/ice_txrx.c | 80 +++++++++++++------------------
>   2 files changed, 34 insertions(+), 47 deletions(-)

Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH iwl-net v2] ice: fix Rx page leak on multi-buffer frames
  2025-08-26  8:35 ` Jesper Dangaard Brouer
@ 2025-08-26 10:24   ` Michal Kubiak
  0 siblings, 0 replies; 7+ messages in thread
From: Michal Kubiak @ 2025-08-26 10:24 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Jacob Keller, Anthony Nguyen, Intel Wired LAN, netdev,
	Christoph Petrausch, Jaroslav Pulchart, kernel-team

On Tue, Aug 26, 2025 at 10:35:30AM +0200, Jesper Dangaard Brouer wrote:
> 
> 
> On 26/08/2025 01.00, Jacob Keller wrote:
> > XDP_DROP performance has been tested for this version, thanks to work from
> > Michal Kubiak. The results are quite promising, with 3 versions being
> > compared:
> > 
> > * baseline net-next tree
> > * v1 applied
> > * v2 applied
> > 
> > Michal said:
> > 
> >    I run the XDP_DROP performance comparison tests on my setup in the way I
> >    usually do. I didn't have the pktgen configured on my link partner, but I
> >    used 6 instances of the xdpsock running in Tx-only mode to generate
> >    high-bandwith traffic. Also, I tried to replicate the conditions according
> >    to Jesper's description, making sure that all the traffic is directed to a
> >    single Rx queue and one CPU is 100% loaded.
> > 
> 
> Thank you for replicating the test setup.  Using xdpsock as a traffic
> generator is fine, as long as we make sure that the generator TX speeds
> exceeds the Device Under Test RX XDP_DROP speed.  It is also important
> for the test that packets hits a single RX queue and we verify one CPU is
> 100% load, as you describe.
> 
> As a reminder the pktgen kernel module comes with ready-to-use sample
> shell-scripts[1].
> 
>  [1] https://elixir.bootlin.com/linux/v6.16.3/source/samples/pktgen
> 

Thank you! I am aware of that and also use those scripts.
The xdpsock solution was just the quickest option for that specific
moment, so I decided not to change my link partner setup, (since I
successfully reproduced the performance drop in v1).

> > The performance hit from v1 is replicated, and shown to be gone in v2, with
> > our results showing even an increase compared to baseline instead of a
> > drop. I've included the relative packet per second deltas compared against
> > a baseline test with neither v1 or v2.
> > 
> 
> Thanks for also replicating the performance hit from v1 as I did in [2].
> 
> To Michal: What CPU did you use?
>  - I used CPU: AMD EPYC 9684X (with SRSO=IBPB)

In my test I used: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz

> 
> One of the reasons that I saw a larger percentage drop is that this CPU
> doesn't have DDIO/DCA, which deliver the packet to L3 cache (and a L2
> cache-miss will obviously take less time than a full main memory cache-
> miss). (Details: Newer AMD CPUs will get something called PCIe TLP
> Processing Hints (TPH), which resembles DDIO).
> 
> Point is that I see some opportunities in driver to move some of the
> prefetches earlier. But we want to make sure it benefits both CPU types,
> and I can test on the AMD platform. (This CPU is a large part of our
> fleet so it makes sense for us to optimize this).
> 
> > baseline to v1, no-touch:
> >    -8,387,677 packets per second (17%) decrease.
> > 
> > baseline to v2, no-touch:
> >    +4,057,000 packets per second (8%) increase!
> > 
> > baseline to v1, read data:
> >    -411,709 packets per second (1%) decrease.
> > 
> > baseline to v2, read data:
> >    +4,331,857 packets per second (11%) increase!
> 
> Thanks for providing these numbers.
> I would also like to know the throughput PPS packet numbers before and
> after, as this allows me to calculate the nanosec difference. Using
> percentages are usually useful, but it can be misleading when dealing
> with XDP_DROP speeds, because a small nanosec change will get
> "magnified" too much.
> 

I was usually told to share the percentage data, because absolute numbers may
depend on various circumstances.
However, I understand your point regarding XDP_DROP. In such case it may
be justified. Please see my raw results (from xdp-bench summary) below:


net-next (main) (drop, no touch)
  Duration            : 105.7s
  Packets received    : 4,960,778,583
  Average packets/s   : 46,951,873
  Rx dropped          : 4,960,778,583


net-next (main) (drop, read data)
  Duration            : 94.5s
  Packets received    : 3,524,346,352
  Average packets/s   : 37,295,056
  Rx dropped          : 3,524,346,352


net-next (main+v1) (drop, no touch)
  Duration            : 122.5s
  Packets received    : 4,722,510,839
  Average packets/s   : 38,564,196
  Rx dropped          : 4,722,510,839


net-next (main+v1) (drop, read data)
  Duration            : 115.7s
  Packets received    : 4,265,991,147
  Average packets/s   : 36,883,347
  Rx dropped          : 4,265,991,147


net-next (main+v2) (drop, no touch)
  Duration            : 130.6s
  Packets received    : 6,664,104,907
  Average packets/s   : 51,008,873
  Rx dropped          : 6,664,104,907


net-next (main+v2) (drop, read data)
  Duration            : 143.6s
  Packets received    : 5,975,991,044
  Average packets/s   : 41,626,913
  Rx dropped          : 5,975,991,044


Thanks,
Michal

> > ---
> > Changes in v2:
> > - Only access shared info for fragmented frames
> > - Link to v1: https://lore.kernel.org/netdev/20250815204205.1407768-4-anthony.l.nguyen@intel.com/
> 
> [2] https://lore.kernel.org/netdev/6e2cbea1-8c70-4bfa-9ce4-1d07b545a705@kernel.org/
> 
> > ---
> >   drivers/net/ethernet/intel/ice/ice_txrx.h |  1 -
> >   drivers/net/ethernet/intel/ice/ice_txrx.c | 80 +++++++++++++------------------
> >   2 files changed, 34 insertions(+), 47 deletions(-)
> 
> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-08-26 10:24 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-25 23:00 [PATCH iwl-net v2] ice: fix Rx page leak on multi-buffer frames Jacob Keller
2025-08-26  8:29 ` [Intel-wired-lan] " Paul Menzel
2025-08-26  8:35 ` Jesper Dangaard Brouer
2025-08-26 10:24   ` Michal Kubiak
  -- strict thread matches above, loose matches on Subject: below --
2025-07-12  0:23 Jacob Keller
2025-07-30 12:11 ` [Intel-wired-lan] " Rinitha, SX
     [not found] ` <PH0PR11MB5013AD3FEC58E277297A7D4F962AA@PH0PR11MB5013.namprd11.prod.outlook.com>
2025-08-13 13:39   ` Singh, PriyaX
2025-08-13 13:43     ` Singh, PriyaX

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).