[PATCH net v3] ice: fix packet corruption due to extraneous page flip

Netdev List
 help / color / mirror / Atom feed

* [PATCH net v3] ice: fix packet corruption due to extraneous page flip
@ 2026-05-12 18:19 John Ousterhout
  2026-05-13  9:07 ` David Laight
  2026-05-14  9:08 ` Loktionov, Aleksandr
  0 siblings, 2 replies; 12+ messages in thread
From: John Ousterhout @ 2026-05-12 18:19 UTC (permalink / raw)
  To: stable
  Cc: anthony.l.nguyen, intel-wired-lan, przemyslaw.kitszel, netdev,
	jacob.e.keller, John Ousterhout

Consider the following sequence of events:
* The bottom half of a buffer page is filled with data from
  packet A. The page has a net reference count (reference count
  - bias) of 1. The page is returned to the NIC, flipped to
  use the top half.
* Before the reference on the page is released, the NIC returns
  the page with no data in it ('size' is zero in ice_clean_rx_irq).
  In this case the bias does not get decremented. The page still
  has a net reference count of 1, so it gets returned to the NIC.
  However, ice_put_rx_mbuf flipped the page so that the bottom
  half is active.
* If the NIC stores another packet in the page before packet A
  has released its reference, the data in packet A will be
  overwritten with data from the new packet.
* Unfortunately zero-length buffers occur frequently: they seem
  to occur whenever a packet uses every available byte in a
  buffer, ending precisely at the end of the buffer. When this
  happens the NIC seems to generate an extra zero-length
  buffer.
The fix is for ice_put_rx_mbuf not to flip pages that have a
size of 0.

This patch applies directly to longterm stable versions 6.18.27
and 6.12.86; it also seems relevant for 6.6.137 but would need
modifcations for that version. I have not examined earlier
versions.

Unfortunately there is no upstream commit id for this patch because
the ICE driver has undergone a major revision (libeth refactor and
pagepool conversion) that eliminated the buggy code. Thus the
problem no longer exists in the main line.

Cc: stable@vger.kernel.org # 6.12+
Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
 drivers/net/ethernet/intel/ice/ice_txrx.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 51c459a3e722..081c7a7392b7 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -1215,6 +1215,13 @@ static void ice_put_rx_mbuf(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
 		xdp_frags = xdp_get_shared_info_from_buff(xdp)->nr_frags;
 
 	while (idx != ntc) {
+		union ice_32b_rx_flex_desc *rx_desc;
+		unsigned int size;
+
+		rx_desc = ICE_RX_DESC(rx_ring, idx);
+		size = le16_to_cpu(rx_desc->wb.pkt_len) &
+		       ICE_RX_FLX_DESC_PKT_LEN_M;
+
 		buf = &rx_ring->rx_buf[idx];
 		if (++idx == cnt)
 			idx = 0;
@@ -1224,10 +1231,20 @@ static void ice_put_rx_mbuf(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
 		 * To do this, only adjust pagecnt_bias for fragments up to
 		 * the total remaining after the XDP program has run.
 		 */
-		if (verdict != ICE_XDP_CONSUMED)
-			ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz);
-		else if (i++ <= xdp_frags)
+		if (verdict != ICE_XDP_CONSUMED) {
+			/* Don't "flip" the page if size is 0: in this case
+			 * the data in the current half will not be used so
+			 * it's OK to reuse that half. And, since the bias
+			 * didn't get decremented for this half, the page can
+			 * be returned to the NIC even if the other half is
+			 * still in use, so flipping the page could cause
+			 * live packet data to be overwritten.
+			 */
+			if (size != 0)
+				ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz);
+		} else if (i++ <= xdp_frags) {
 			buf->pagecnt_bias++;
+		}
 
 		ice_put_rx_buf(rx_ring, buf);
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH net v3] ice: fix packet corruption due to extraneous page flip
  2026-05-12 18:19 [PATCH net v3] ice: fix packet corruption due to extraneous page flip John Ousterhout
@ 2026-05-13  9:07 ` David Laight
  2026-05-13 16:28   ` John Ousterhout
  2026-05-14  9:08 ` Loktionov, Aleksandr
  1 sibling, 1 reply; 12+ messages in thread
From: David Laight @ 2026-05-13  9:07 UTC (permalink / raw)
  To: John Ousterhout
  Cc: stable, anthony.l.nguyen, intel-wired-lan, przemyslaw.kitszel,
	netdev, jacob.e.keller

On Tue, 12 May 2026 11:19:53 -0700
John Ousterhout <ouster@cs.stanford.edu> wrote:

> Consider the following sequence of events:
> * The bottom half of a buffer page is filled with data from
>   packet A. The page has a net reference count (reference count
>   - bias) of 1. The page is returned to the NIC, flipped to
>   use the top half.
> * Before the reference on the page is released, the NIC returns
>   the page with no data in it ('size' is zero in ice_clean_rx_irq).
>   In this case the bias does not get decremented. The page still
>   has a net reference count of 1, so it gets returned to the NIC.
>   However, ice_put_rx_mbuf flipped the page so that the bottom
>   half is active.
> * If the NIC stores another packet in the page before packet A
>   has released its reference, the data in packet A will be
>   overwritten with data from the new packet.
> * Unfortunately zero-length buffers occur frequently: they seem
>   to occur whenever a packet uses every available byte in a
>   buffer, ending precisely at the end of the buffer. When this
>   happens the NIC seems to generate an extra zero-length
>   buffer.
> The fix is for ice_put_rx_mbuf not to flip pages that have a
> size of 0.

How is this different from packet B (in the top half) being
freed before packet A (in the bottom half)?

> This patch applies directly to longterm stable versions 6.18.27
> and 6.12.86; it also seems relevant for 6.6.137 but would need
> modifcations for that version. I have not examined earlier
> versions.
> 
> Unfortunately there is no upstream commit id for this patch because
> the ICE driver has undergone a major revision (libeth refactor and
> pagepool conversion) that eliminated the buggy code. Thus the
> problem no longer exists in the main line.
> 
> Cc: stable@vger.kernel.org # 6.12+
> Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
> ---
>  drivers/net/ethernet/intel/ice/ice_txrx.c | 23 ++++++++++++++++++++---
>  1 file changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
> index 51c459a3e722..081c7a7392b7 100644
> --- a/drivers/net/ethernet/intel/ice/ice_txrx.c
> +++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
> @@ -1215,6 +1215,13 @@ static void ice_put_rx_mbuf(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
>  		xdp_frags = xdp_get_shared_info_from_buff(xdp)->nr_frags;
>  
>  	while (idx != ntc) {
> +		union ice_32b_rx_flex_desc *rx_desc;
> +		unsigned int size;
> +
> +		rx_desc = ICE_RX_DESC(rx_ring, idx);
> +		size = le16_to_cpu(rx_desc->wb.pkt_len) &
> +		       ICE_RX_FLX_DESC_PKT_LEN_M;
> +

Looks like you only need to calculate 'size' for the !ICE_XDP_CONSUMED path.
You could also use the (likely cheaper) test for zero:
		if (!(rx_desc->wb.pkt_len & cpu_to_le16(ICE_RX_FLX_DESC_PKT_LEN_M))

-- David

>  		buf = &rx_ring->rx_buf[idx];
>  		if (++idx == cnt)
>  			idx = 0;
> @@ -1224,10 +1231,20 @@ static void ice_put_rx_mbuf(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
>  		 * To do this, only adjust pagecnt_bias for fragments up to
>  		 * the total remaining after the XDP program has run.
>  		 */
> -		if (verdict != ICE_XDP_CONSUMED)
> -			ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz);
> -		else if (i++ <= xdp_frags)
> +		if (verdict != ICE_XDP_CONSUMED) {
> +			/* Don't "flip" the page if size is 0: in this case
> +			 * the data in the current half will not be used so
> +			 * it's OK to reuse that half. And, since the bias
> +			 * didn't get decremented for this half, the page can
> +			 * be returned to the NIC even if the other half is
> +			 * still in use, so flipping the page could cause
> +			 * live packet data to be overwritten.
> +			 */
> +			if (size != 0)
> +				ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz);
> +		} else if (i++ <= xdp_frags) {
>  			buf->pagecnt_bias++;
> +		}
>  
>  		ice_put_rx_buf(rx_ring, buf);
>  	}


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v3] ice: fix packet corruption due to extraneous page flip
  2026-05-13  9:07 ` David Laight
@ 2026-05-13 16:28   ` John Ousterhout
  2026-05-13 20:49     ` David Laight
  0 siblings, 1 reply; 12+ messages in thread
From: John Ousterhout @ 2026-05-13 16:28 UTC (permalink / raw)
  To: David Laight
  Cc: stable, anthony.l.nguyen, intel-wired-lan, przemyslaw.kitszel,
	netdev, jacob.e.keller

On Wed, May 13, 2026 at 2:07 AM David Laight
<david.laight.linux@gmail.com> wrote:
>
> On Tue, 12 May 2026 11:19:53 -0700
> John Ousterhout <ouster@cs.stanford.edu> wrote:
>
> > Consider the following sequence of events:
> > * The bottom half of a buffer page is filled with data from
> >   packet A. The page has a net reference count (reference count
> >   - bias) of 1. The page is returned to the NIC, flipped to
> >   use the top half.
> > * Before the reference on the page is released, the NIC returns
> >   the page with no data in it ('size' is zero in ice_clean_rx_irq).
> >   In this case the bias does not get decremented. The page still
> >   has a net reference count of 1, so it gets returned to the NIC.
> >   However, ice_put_rx_mbuf flipped the page so that the bottom
> >   half is active.
> > * If the NIC stores another packet in the page before packet A
> >   has released its reference, the data in packet A will be
> >   overwritten with data from the new packet.
> > * Unfortunately zero-length buffers occur frequently: they seem
> >   to occur whenever a packet uses every available byte in a
> >   buffer, ending precisely at the end of the buffer. When this
> >   happens the NIC seems to generate an extra zero-length
> >   buffer.
> > The fix is for ice_put_rx_mbuf not to flip pages that have a
> > size of 0.
>
> How is this different from packet B (in the top half) being
> freed before packet A (in the bottom half)?

I'm not sure exactly what you're referring to here. Are you asking
about a situation where both halves of the page get filled with packet
data and then the second half to be filled is the first to be freed? I
believe that the ICE driver abandons a page if both halves are ever
occupied simultaneously; the page will be returned to the system once
both halves have dropped their references. Thus it doesn't matter
which half is freed first.

> > This patch applies directly to longterm stable versions 6.18.27
> > and 6.12.86; it also seems relevant for 6.6.137 but would need
> > modifcations for that version. I have not examined earlier
> > versions.
> >
> > Unfortunately there is no upstream commit id for this patch because
> > the ICE driver has undergone a major revision (libeth refactor and
> > pagepool conversion) that eliminated the buggy code. Thus the
> > problem no longer exists in the main line.
> >
> > Cc: stable@vger.kernel.org # 6.12+
> > Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
> > ---
> >  drivers/net/ethernet/intel/ice/ice_txrx.c | 23 ++++++++++++++++++++---
> >  1 file changed, 20 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
> > index 51c459a3e722..081c7a7392b7 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_txrx.c
> > +++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
> > @@ -1215,6 +1215,13 @@ static void ice_put_rx_mbuf(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
> >               xdp_frags = xdp_get_shared_info_from_buff(xdp)->nr_frags;
> >
> >       while (idx != ntc) {
> > +             union ice_32b_rx_flex_desc *rx_desc;
> > +             unsigned int size;
> > +
> > +             rx_desc = ICE_RX_DESC(rx_ring, idx);
> > +             size = le16_to_cpu(rx_desc->wb.pkt_len) &
> > +                    ICE_RX_FLX_DESC_PKT_LEN_M;
> > +
>
> Looks like you only need to calculate 'size' for the !ICE_XDP_CONSUMED path.
> You could also use the (likely cheaper) test for zero:
>                 if (!(rx_desc->wb.pkt_len & cpu_to_le16(ICE_RX_FLX_DESC_PKT_LEN_M))
>
> -- David
>
> >               buf = &rx_ring->rx_buf[idx];
> >               if (++idx == cnt)
> >                       idx = 0;
> > @@ -1224,10 +1231,20 @@ static void ice_put_rx_mbuf(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
> >                * To do this, only adjust pagecnt_bias for fragments up to
> >                * the total remaining after the XDP program has run.
> >                */
> > -             if (verdict != ICE_XDP_CONSUMED)
> > -                     ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz);
> > -             else if (i++ <= xdp_frags)
> > +             if (verdict != ICE_XDP_CONSUMED) {
> > +                     /* Don't "flip" the page if size is 0: in this case
> > +                      * the data in the current half will not be used so
> > +                      * it's OK to reuse that half. And, since the bias
> > +                      * didn't get decremented for this half, the page can
> > +                      * be returned to the NIC even if the other half is
> > +                      * still in use, so flipping the page could cause
> > +                      * live packet data to be overwritten.
> > +                      */
> > +                     if (size != 0)
> > +                             ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz);
> > +             } else if (i++ <= xdp_frags) {
> >                       buf->pagecnt_bias++;
> > +             }
> >
> >               ice_put_rx_buf(rx_ring, buf);
> >       }
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v3] ice: fix packet corruption due to extraneous page flip
  2026-05-13 16:28   ` John Ousterhout
@ 2026-05-13 20:49     ` David Laight
  2026-05-14  4:47       ` John Ousterhout
  0 siblings, 1 reply; 12+ messages in thread
From: David Laight @ 2026-05-13 20:49 UTC (permalink / raw)
  To: John Ousterhout
  Cc: stable, anthony.l.nguyen, intel-wired-lan, przemyslaw.kitszel,
	netdev, jacob.e.keller

On Wed, 13 May 2026 09:28:40 -0700
John Ousterhout <ouster@cs.stanford.edu> wrote:

> On Wed, May 13, 2026 at 2:07 AM David Laight
> <david.laight.linux@gmail.com> wrote:
> >
> > On Tue, 12 May 2026 11:19:53 -0700
> > John Ousterhout <ouster@cs.stanford.edu> wrote:
> >  
> > > Consider the following sequence of events:
> > > * The bottom half of a buffer page is filled with data from
> > >   packet A. The page has a net reference count (reference count
> > >   - bias) of 1. The page is returned to the NIC, flipped to
> > >   use the top half.
> > > * Before the reference on the page is released, the NIC returns
> > >   the page with no data in it ('size' is zero in ice_clean_rx_irq).
> > >   In this case the bias does not get decremented. The page still
> > >   has a net reference count of 1, so it gets returned to the NIC.
> > >   However, ice_put_rx_mbuf flipped the page so that the bottom
> > >   half is active.
> > > * If the NIC stores another packet in the page before packet A
> > >   has released its reference, the data in packet A will be
> > >   overwritten with data from the new packet.
> > > * Unfortunately zero-length buffers occur frequently: they seem
> > >   to occur whenever a packet uses every available byte in a
> > >   buffer, ending precisely at the end of the buffer. When this
> > >   happens the NIC seems to generate an extra zero-length
> > >   buffer.
> > > The fix is for ice_put_rx_mbuf not to flip pages that have a
> > > size of 0.  
> >
> > How is this different from packet B (in the top half) being
> > freed before packet A (in the bottom half)?  
> 
> I'm not sure exactly what you're referring to here. Are you asking
> about a situation where both halves of the page get filled with packet
> data and then the second half to be filled is the first to be freed? I
> believe that the ICE driver abandons a page if both halves are ever
> occupied simultaneously; the page will be returned to the system once
> both halves have dropped their references. Thus it doesn't matter
> which half is freed first.

That is what I was thinking, seems like the logic is over complicated.

If you need to put 4k pages into some kind of iommu rather than 2k buffers
(to contain 1536 byte ethernet packets) then I'd have thought you'd
initially put both halves into adjacent tx ring entries.
If a rx buffer is discarded (eg a zero length fragment or a CRC error,
or even 'copy break' for short packets) then, as an optimisation,
you could reuse the buffer for another receive.
The same could be done if the page is freed by an application.

However it sounds like it doesn't use the 2nd half until the first
completes - otherwise you'd never 'flip' to make the other half
active.

Thinks...
By only putting half of each 4k 'page' into the rx ring the code
will usually save (expensive) iommu setup in the (probably) normal
case where the buffers are freed 'reasonably quickly'.
But that really requires a 'free/with_nic/busy' state for each half
rather then trying to guess from a reference count.

But if the low-level code is recycling the rx buffer (for any reason)
it wants to use the same buffer.

The ethernet driver I wrote (a long time ago, early 90s) allocated
64k as 128 512byte buffers and did an aligned word-sized copy of
every receive frame - most frames were in contiguous memory.
The simplicity of it made up for the cost of the copy, especially
since that was an iommu system. 

-- David

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v3] ice: fix packet corruption due to extraneous page flip
  2026-05-13 20:49     ` David Laight
@ 2026-05-14  4:47       ` John Ousterhout
  2026-05-14 10:01         ` David Laight
  0 siblings, 1 reply; 12+ messages in thread
From: John Ousterhout @ 2026-05-14  4:47 UTC (permalink / raw)
  To: David Laight
  Cc: stable, anthony.l.nguyen, intel-wired-lan, przemyslaw.kitszel,
	netdev, jacob.e.keller

On Wed, May 13, 2026 at 1:49 PM David Laight
<david.laight.linux@gmail.com> wrote:
>
> On Wed, 13 May 2026 09:28:40 -0700
> John Ousterhout <ouster@cs.stanford.edu> wrote:
>
> > On Wed, May 13, 2026 at 2:07 AM David Laight
> > <david.laight.linux@gmail.com> wrote:
> > >
> > > On Tue, 12 May 2026 11:19:53 -0700
> > > John Ousterhout <ouster@cs.stanford.edu> wrote:
> > >
> > > > Consider the following sequence of events:
> > > > * The bottom half of a buffer page is filled with data from
> > > >   packet A. The page has a net reference count (reference count
> > > >   - bias) of 1. The page is returned to the NIC, flipped to
> > > >   use the top half.
> > > > * Before the reference on the page is released, the NIC returns
> > > >   the page with no data in it ('size' is zero in ice_clean_rx_irq).
> > > >   In this case the bias does not get decremented. The page still
> > > >   has a net reference count of 1, so it gets returned to the NIC.
> > > >   However, ice_put_rx_mbuf flipped the page so that the bottom
> > > >   half is active.
> > > > * If the NIC stores another packet in the page before packet A
> > > >   has released its reference, the data in packet A will be
> > > >   overwritten with data from the new packet.
> > > > * Unfortunately zero-length buffers occur frequently: they seem
> > > >   to occur whenever a packet uses every available byte in a
> > > >   buffer, ending precisely at the end of the buffer. When this
> > > >   happens the NIC seems to generate an extra zero-length
> > > >   buffer.
> > > > The fix is for ice_put_rx_mbuf not to flip pages that have a
> > > > size of 0.
> > >
> > > How is this different from packet B (in the top half) being
> > > freed before packet A (in the bottom half)?
> >
> > I'm not sure exactly what you're referring to here. Are you asking
> > about a situation where both halves of the page get filled with packet
> > data and then the second half to be filled is the first to be freed? I
> > believe that the ICE driver abandons a page if both halves are ever
> > occupied simultaneously; the page will be returned to the system once
> > both halves have dropped their references. Thus it doesn't matter
> > which half is freed first.
>
> That is what I was thinking, seems like the logic is over complicated.
>
> If you need to put 4k pages into some kind of iommu rather than 2k buffers
> (to contain 1536 byte ethernet packets) then I'd have thought you'd
> initially put both halves into adjacent tx ring entries.
> If a rx buffer is discarded (eg a zero length fragment or a CRC error,
> or even 'copy break' for short packets) then, as an optimisation,
> you could reuse the buffer for another receive.
> The same could be done if the page is freed by an application.
>
> However it sounds like it doesn't use the 2nd half until the first
> completes - otherwise you'd never 'flip' to make the other half
> active.
>
> Thinks...
> By only putting half of each 4k 'page' into the rx ring the code
> will usually save (expensive) iommu setup in the (probably) normal
> case where the buffers are freed 'reasonably quickly'.
> But that really requires a 'free/with_nic/busy' state for each half
> rather then trying to guess from a reference count.
>
> But if the low-level code is recycling the rx buffer (for any reason)
> it wants to use the same buffer.
>
> The ethernet driver I wrote (a long time ago, early 90s) allocated
> 64k as 128 512byte buffers and did an aligned word-sized copy of
> every receive frame - most frames were in contiguous memory.
> The simplicity of it made up for the cost of the copy, especially
> since that was an iommu system.

I'm not here to defend the logic (and it has been replaced with
something that is probably simpler and more efficient); I'm just
suggesting a bug fix for the stable releases that still have this
logic.

-John-

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [Intel-wired-lan] [PATCH net v3] ice: fix packet corruption due to extraneous page flip
  2026-05-12 18:19 [PATCH net v3] ice: fix packet corruption due to extraneous page flip John Ousterhout
  2026-05-13  9:07 ` David Laight
@ 2026-05-14  9:08 ` Loktionov, Aleksandr
  1 sibling, 0 replies; 12+ messages in thread
From: Loktionov, Aleksandr @ 2026-05-14  9:08 UTC (permalink / raw)
  To: John Ousterhout, stable@vger.kernel.org
  Cc: Nguyen, Anthony L, intel-wired-lan@lists.osuosl.org,
	Kitszel, Przemyslaw, netdev@vger.kernel.org, Keller, Jacob E



> -----Original Message-----
> From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf
> Of John Ousterhout
> Sent: Tuesday, May 12, 2026 8:20 PM
> To: stable@vger.kernel.org
> Cc: Nguyen, Anthony L <anthony.l.nguyen@intel.com>; intel-wired-
> lan@lists.osuosl.org; Kitszel, Przemyslaw
> <przemyslaw.kitszel@intel.com>; netdev@vger.kernel.org; Keller, Jacob
> E <jacob.e.keller@intel.com>; John Ousterhout <ouster@cs.stanford.edu>
> Subject: [Intel-wired-lan] [PATCH net v3] ice: fix packet corruption
> due to extraneous page flip
> 
> Consider the following sequence of events:
> * The bottom half of a buffer page is filled with data from
>   packet A. The page has a net reference count (reference count
>   - bias) of 1. The page is returned to the NIC, flipped to
>   use the top half.
> * Before the reference on the page is released, the NIC returns
>   the page with no data in it ('size' is zero in ice_clean_rx_irq).
>   In this case the bias does not get decremented. The page still
>   has a net reference count of 1, so it gets returned to the NIC.
>   However, ice_put_rx_mbuf flipped the page so that the bottom
>   half is active.
> * If the NIC stores another packet in the page before packet A
>   has released its reference, the data in packet A will be
>   overwritten with data from the new packet.
> * Unfortunately zero-length buffers occur frequently: they seem
>   to occur whenever a packet uses every available byte in a
>   buffer, ending precisely at the end of the buffer. When this
>   happens the NIC seems to generate an extra zero-length
>   buffer.
> The fix is for ice_put_rx_mbuf not to flip pages that have a size of
> 0.
> 
> This patch applies directly to longterm stable versions 6.18.27 and
> 6.12.86; it also seems relevant for 6.6.137 but would need
> modifcations for that version. I have not examined earlier versions.
> 
> Unfortunately there is no upstream commit id for this patch because
> the ICE driver has undergone a major revision (libeth refactor and
> pagepool conversion) that eliminated the buggy code. Thus the problem
> no longer exists in the main line.
> 
> Cc: stable@vger.kernel.org # 6.12+
> Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
> ---
>  drivers/net/ethernet/intel/ice/ice_txrx.c | 23 ++++++++++++++++++++--
> -
>  1 file changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c
> b/drivers/net/ethernet/intel/ice/ice_txrx.c
> index 51c459a3e722..081c7a7392b7 100644
> --- a/drivers/net/ethernet/intel/ice/ice_txrx.c
> +++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
> @@ -1215,6 +1215,13 @@ static void ice_put_rx_mbuf(struct ice_rx_ring
> *rx_ring, struct xdp_buff *xdp,
>  		xdp_frags = xdp_get_shared_info_from_buff(xdp)-
> >nr_frags;
> 
>  	while (idx != ntc) {
> +		union ice_32b_rx_flex_desc *rx_desc;
> +		unsigned int size;
> +
> +		rx_desc = ICE_RX_DESC(rx_ring, idx);
> +		size = le16_to_cpu(rx_desc->wb.pkt_len) &
> +		       ICE_RX_FLX_DESC_PKT_LEN_M;
> +
>  		buf = &rx_ring->rx_buf[idx];
>  		if (++idx == cnt)
>  			idx = 0;
> @@ -1224,10 +1231,20 @@ static void ice_put_rx_mbuf(struct ice_rx_ring
> *rx_ring, struct xdp_buff *xdp,
>  		 * To do this, only adjust pagecnt_bias for fragments up
> to
>  		 * the total remaining after the XDP program has run.
>  		 */
> -		if (verdict != ICE_XDP_CONSUMED)
> -			ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz);
> -		else if (i++ <= xdp_frags)
> +		if (verdict != ICE_XDP_CONSUMED) {
> +			/* Don't "flip" the page if size is 0: in this
> case
> +			 * the data in the current half will not be used
> so
> +			 * it's OK to reuse that half. And, since the
> bias
> +			 * didn't get decremented for this half, the page
> can
> +			 * be returned to the NIC even if the other half
> is
> +			 * still in use, so flipping the page could cause
> +			 * live packet data to be overwritten.
> +			 */
> +			if (size != 0)
> +				ice_rx_buf_adjust_pg_offset(buf, xdp-
> >frame_sz);
> +		} else if (i++ <= xdp_frags) {
>  			buf->pagecnt_bias++;
> +		}
> 
>  		ice_put_rx_buf(rx_ring, buf);
>  	}
> --
> 2.43.0

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v3] ice: fix packet corruption due to extraneous page flip
  2026-05-14  4:47       ` John Ousterhout
@ 2026-05-14 10:01         ` David Laight
  2026-05-14 16:43           ` Jacob Keller
  0 siblings, 1 reply; 12+ messages in thread
From: David Laight @ 2026-05-14 10:01 UTC (permalink / raw)
  To: John Ousterhout
  Cc: stable, anthony.l.nguyen, intel-wired-lan, przemyslaw.kitszel,
	netdev, jacob.e.keller

On Wed, 13 May 2026 21:47:11 -0700
John Ousterhout <ouster@cs.stanford.edu> wrote:

> On Wed, May 13, 2026 at 1:49 PM David Laight
> <david.laight.linux@gmail.com> wrote:
> >
> > On Wed, 13 May 2026 09:28:40 -0700
> > John Ousterhout <ouster@cs.stanford.edu> wrote:
> >  
> > > On Wed, May 13, 2026 at 2:07 AM David Laight
> > > <david.laight.linux@gmail.com> wrote:  
> > > >
> > > > On Tue, 12 May 2026 11:19:53 -0700
> > > > John Ousterhout <ouster@cs.stanford.edu> wrote:
> > > >  
> > > > > Consider the following sequence of events:
> > > > > * The bottom half of a buffer page is filled with data from
> > > > >   packet A. The page has a net reference count (reference count
> > > > >   - bias) of 1. The page is returned to the NIC, flipped to
> > > > >   use the top half.
> > > > > * Before the reference on the page is released, the NIC returns
> > > > >   the page with no data in it ('size' is zero in ice_clean_rx_irq).
> > > > >   In this case the bias does not get decremented. The page still
> > > > >   has a net reference count of 1, so it gets returned to the NIC.
> > > > >   However, ice_put_rx_mbuf flipped the page so that the bottom
> > > > >   half is active.
> > > > > * If the NIC stores another packet in the page before packet A
> > > > >   has released its reference, the data in packet A will be
> > > > >   overwritten with data from the new packet.
> > > > > * Unfortunately zero-length buffers occur frequently: they seem
> > > > >   to occur whenever a packet uses every available byte in a
> > > > >   buffer, ending precisely at the end of the buffer. When this
> > > > >   happens the NIC seems to generate an extra zero-length
> > > > >   buffer.
> > > > > The fix is for ice_put_rx_mbuf not to flip pages that have a
> > > > > size of 0.  
> > > >
> > > > How is this different from packet B (in the top half) being
> > > > freed before packet A (in the bottom half)?  
> > >
> > > I'm not sure exactly what you're referring to here. Are you asking
> > > about a situation where both halves of the page get filled with packet
> > > data and then the second half to be filled is the first to be freed? I
> > > believe that the ICE driver abandons a page if both halves are ever
> > > occupied simultaneously; the page will be returned to the system once
> > > both halves have dropped their references. Thus it doesn't matter
> > > which half is freed first.  
> >
> > That is what I was thinking, seems like the logic is over complicated.
> >
> > If you need to put 4k pages into some kind of iommu rather than 2k buffers
> > (to contain 1536 byte ethernet packets) then I'd have thought you'd
> > initially put both halves into adjacent tx ring entries.
> > If a rx buffer is discarded (eg a zero length fragment or a CRC error,
> > or even 'copy break' for short packets) then, as an optimisation,
> > you could reuse the buffer for another receive.
> > The same could be done if the page is freed by an application.
> >
> > However it sounds like it doesn't use the 2nd half until the first
> > completes - otherwise you'd never 'flip' to make the other half
> > active.
> >
> > Thinks...
> > By only putting half of each 4k 'page' into the rx ring the code
> > will usually save (expensive) iommu setup in the (probably) normal
> > case where the buffers are freed 'reasonably quickly'.
> > But that really requires a 'free/with_nic/busy' state for each half
> > rather then trying to guess from a reference count.
> >
> > But if the low-level code is recycling the rx buffer (for any reason)
> > it wants to use the same buffer.
> >
> > The ethernet driver I wrote (a long time ago, early 90s) allocated
> > 64k as 128 512byte buffers and did an aligned word-sized copy of
> > every receive frame - most frames were in contiguous memory.
> > The simplicity of it made up for the cost of the copy, especially
> > since that was an iommu system.  
> 
> I'm not here to defend the logic (and it has been replaced with
> something that is probably simpler and more efficient); I'm just
> suggesting a bug fix for the stable releases that still have this
> logic.

You've forced me to look at all of the function :-)
I've noticed a few things:
- If ice_add_xdp_frag() fails (because there are too many fragments)
  then the rest of the fragments are left in the tx ring (instead
  of being discarded) - so are likely to be treated as a full packet
  later on.
- Frames with status errors (crc, framing etc) are discarded after
  the skb is built - surely that should happen before the xdp 'program'
  is called.
- If the remote system send a very very long frame (traditionally the PHY's
  'jabber detect' didn't always work) you can end up with all of the rx
  ring being full of a single partial packet.

I think you need to avoid calling ice_add_xdp_frag() when 'size == 0'.
Then in ice_put_rx_mbuf() unconditionally call ice_put_rx_buf() for
zero length fragments.
The comment would be 'zero length fragments can always be reused'.

The zero length fragments almost certainly exist because the mac hardware
advances the the new buffer expecting more data - but only gets the 
4 byte CRC. So the zero length buffer contains the receive status.

-- David

> 
> -John-


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net v3] ice: fix packet corruption due to extraneous page flip
  2026-05-14 10:01         ` David Laight
@ 2026-05-14 16:43           ` Jacob Keller
  2026-05-26 12:47             ` [Intel-wired-lan] " Petr Oros
  0 siblings, 1 reply; 12+ messages in thread
From: Jacob Keller @ 2026-05-14 16:43 UTC (permalink / raw)
  To: David Laight, John Ousterhout
  Cc: stable, anthony.l.nguyen, intel-wired-lan, przemyslaw.kitszel,
	netdev

On 5/14/2026 3:01 AM, David Laight wrote:
> On Wed, 13 May 2026 21:47:11 -0700
> John Ousterhout <ouster@cs.stanford.edu> wrote:
> 
>> On Wed, May 13, 2026 at 1:49 PM David Laight
>> <david.laight.linux@gmail.com> wrote:
>>>
>>> On Wed, 13 May 2026 09:28:40 -0700
>>> John Ousterhout <ouster@cs.stanford.edu> wrote:
>>>  
>>>> On Wed, May 13, 2026 at 2:07 AM David Laight
>>>> <david.laight.linux@gmail.com> wrote:  
>>>>>
>>>>> On Tue, 12 May 2026 11:19:53 -0700
>>>>> John Ousterhout <ouster@cs.stanford.edu> wrote:
>>>>>  
>>>>>> Consider the following sequence of events:
>>>>>> * The bottom half of a buffer page is filled with data from
>>>>>>   packet A. The page has a net reference count (reference count
>>>>>>   - bias) of 1. The page is returned to the NIC, flipped to
>>>>>>   use the top half.
>>>>>> * Before the reference on the page is released, the NIC returns
>>>>>>   the page with no data in it ('size' is zero in ice_clean_rx_irq).
>>>>>>   In this case the bias does not get decremented. The page still
>>>>>>   has a net reference count of 1, so it gets returned to the NIC.
>>>>>>   However, ice_put_rx_mbuf flipped the page so that the bottom
>>>>>>   half is active.
>>>>>> * If the NIC stores another packet in the page before packet A
>>>>>>   has released its reference, the data in packet A will be
>>>>>>   overwritten with data from the new packet.
>>>>>> * Unfortunately zero-length buffers occur frequently: they seem
>>>>>>   to occur whenever a packet uses every available byte in a
>>>>>>   buffer, ending precisely at the end of the buffer. When this
>>>>>>   happens the NIC seems to generate an extra zero-length
>>>>>>   buffer.
>>>>>> The fix is for ice_put_rx_mbuf not to flip pages that have a
>>>>>> size of 0.  
>>>>>
>>>>> How is this different from packet B (in the top half) being
>>>>> freed before packet A (in the bottom half)?  
>>>>
>>>> I'm not sure exactly what you're referring to here. Are you asking
>>>> about a situation where both halves of the page get filled with packet
>>>> data and then the second half to be filled is the first to be freed? I
>>>> believe that the ICE driver abandons a page if both halves are ever
>>>> occupied simultaneously; the page will be returned to the system once
>>>> both halves have dropped their references. Thus it doesn't matter
>>>> which half is freed first.  
>>>
>>> That is what I was thinking, seems like the logic is over complicated.
>>>
>>> If you need to put 4k pages into some kind of iommu rather than 2k buffers
>>> (to contain 1536 byte ethernet packets) then I'd have thought you'd
>>> initially put both halves into adjacent tx ring entries.
>>> If a rx buffer is discarded (eg a zero length fragment or a CRC error,
>>> or even 'copy break' for short packets) then, as an optimisation,
>>> you could reuse the buffer for another receive.
>>> The same could be done if the page is freed by an application.
>>>
>>> However it sounds like it doesn't use the 2nd half until the first
>>> completes - otherwise you'd never 'flip' to make the other half
>>> active.
>>>
>>> Thinks...
>>> By only putting half of each 4k 'page' into the rx ring the code
>>> will usually save (expensive) iommu setup in the (probably) normal
>>> case where the buffers are freed 'reasonably quickly'.
>>> But that really requires a 'free/with_nic/busy' state for each half
>>> rather then trying to guess from a reference count.
>>>
>>> But if the low-level code is recycling the rx buffer (for any reason)
>>> it wants to use the same buffer.
>>>
>>> The ethernet driver I wrote (a long time ago, early 90s) allocated
>>> 64k as 128 512byte buffers and did an aligned word-sized copy of
>>> every receive frame - most frames were in contiguous memory.
>>> The simplicity of it made up for the cost of the copy, especially
>>> since that was an iommu system.  
>>
>> I'm not here to defend the logic (and it has been replaced with
>> something that is probably simpler and more efficient); I'm just
>> suggesting a bug fix for the stable releases that still have this
>> logic.

Right. We definitely want a fix for the possible data corruption in
stable. Ideally one as simple as possible.

> 
> You've forced me to look at all of the function :-)
> I've noticed a few things:
> - If ice_add_xdp_frag() fails (because there are too many fragments)
>   then the rest of the fragments are left in the tx ring (instead
>   of being discarded) - so are likely to be treated as a full packet
>   later on.
> - Frames with status errors (crc, framing etc) are discarded after
>   the skb is built - surely that should happen before the xdp 'program'
>   is called.
> - If the remote system send a very very long frame (traditionally the PHY's
>   'jabber detect' didn't always work) you can end up with all of the rx
>   ring being full of a single partial packet.
> 
> I think you need to avoid calling ice_add_xdp_frag() when 'size == 0'.
> Then in ice_put_rx_mbuf() unconditionally call ice_put_rx_buf() for
> zero length fragments.
> The comment would be 'zero length fragments can always be reused'.
> 

That seems correct.

> The zero length fragments almost certainly exist because the mac hardware
> advances the the new buffer expecting more data - but only gets the 
> 4 byte CRC. So the zero length buffer contains the receive status.
> 

That matches my understanding.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Intel-wired-lan] [PATCH net v3] ice: fix packet corruption due to extraneous page flip
  2026-05-14 16:43           ` Jacob Keller
@ 2026-05-26 12:47             ` Petr Oros
  2026-05-26 15:11               ` David Laight
                                 ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Petr Oros @ 2026-05-26 12:47 UTC (permalink / raw)
  To: Jacob Keller, David Laight, John Ousterhout
  Cc: stable, anthony.l.nguyen, intel-wired-lan, przemyslaw.kitszel,
	netdev


On 5/14/26 18:43, Jacob Keller wrote:
> On 5/14/2026 3:01 AM, David Laight wrote:
>> On Wed, 13 May 2026 21:47:11 -0700
>> John Ousterhout <ouster@cs.stanford.edu> wrote:
>>
>>> On Wed, May 13, 2026 at 1:49 PM David Laight
>>> <david.laight.linux@gmail.com> wrote:
>>>> On Wed, 13 May 2026 09:28:40 -0700
>>>> John Ousterhout <ouster@cs.stanford.edu> wrote:
>>>>   
>>>>> On Wed, May 13, 2026 at 2:07 AM David Laight
>>>>> <david.laight.linux@gmail.com> wrote:
>>>>>> On Tue, 12 May 2026 11:19:53 -0700
>>>>>> John Ousterhout <ouster@cs.stanford.edu> wrote:
>>>>>>   
>>>>>>> Consider the following sequence of events:
>>>>>>> * The bottom half of a buffer page is filled with data from
>>>>>>>    packet A. The page has a net reference count (reference count
>>>>>>>    - bias) of 1. The page is returned to the NIC, flipped to
>>>>>>>    use the top half.
>>>>>>> * Before the reference on the page is released, the NIC returns
>>>>>>>    the page with no data in it ('size' is zero in ice_clean_rx_irq).
>>>>>>>    In this case the bias does not get decremented. The page still
>>>>>>>    has a net reference count of 1, so it gets returned to the NIC.
>>>>>>>    However, ice_put_rx_mbuf flipped the page so that the bottom
>>>>>>>    half is active.
>>>>>>> * If the NIC stores another packet in the page before packet A
>>>>>>>    has released its reference, the data in packet A will be
>>>>>>>    overwritten with data from the new packet.
>>>>>>> * Unfortunately zero-length buffers occur frequently: they seem
>>>>>>>    to occur whenever a packet uses every available byte in a
>>>>>>>    buffer, ending precisely at the end of the buffer. When this
>>>>>>>    happens the NIC seems to generate an extra zero-length
>>>>>>>    buffer.
>>>>>>> The fix is for ice_put_rx_mbuf not to flip pages that have a
>>>>>>> size of 0.
>>>>>> How is this different from packet B (in the top half) being
>>>>>> freed before packet A (in the bottom half)?
>>>>> I'm not sure exactly what you're referring to here. Are you asking
>>>>> about a situation where both halves of the page get filled with packet
>>>>> data and then the second half to be filled is the first to be freed? I
>>>>> believe that the ICE driver abandons a page if both halves are ever
>>>>> occupied simultaneously; the page will be returned to the system once
>>>>> both halves have dropped their references. Thus it doesn't matter
>>>>> which half is freed first.
>>>> That is what I was thinking, seems like the logic is over complicated.
>>>>
>>>> If you need to put 4k pages into some kind of iommu rather than 2k buffers
>>>> (to contain 1536 byte ethernet packets) then I'd have thought you'd
>>>> initially put both halves into adjacent tx ring entries.
>>>> If a rx buffer is discarded (eg a zero length fragment or a CRC error,
>>>> or even 'copy break' for short packets) then, as an optimisation,
>>>> you could reuse the buffer for another receive.
>>>> The same could be done if the page is freed by an application.
>>>>
>>>> However it sounds like it doesn't use the 2nd half until the first
>>>> completes - otherwise you'd never 'flip' to make the other half
>>>> active.
>>>>
>>>> Thinks...
>>>> By only putting half of each 4k 'page' into the rx ring the code
>>>> will usually save (expensive) iommu setup in the (probably) normal
>>>> case where the buffers are freed 'reasonably quickly'.
>>>> But that really requires a 'free/with_nic/busy' state for each half
>>>> rather then trying to guess from a reference count.
>>>>
>>>> But if the low-level code is recycling the rx buffer (for any reason)
>>>> it wants to use the same buffer.
>>>>
>>>> The ethernet driver I wrote (a long time ago, early 90s) allocated
>>>> 64k as 128 512byte buffers and did an aligned word-sized copy of
>>>> every receive frame - most frames were in contiguous memory.
>>>> The simplicity of it made up for the cost of the copy, especially
>>>> since that was an iommu system.
>>> I'm not here to defend the logic (and it has been replaced with
>>> something that is probably simpler and more efficient); I'm just
>>> suggesting a bug fix for the stable releases that still have this
>>> logic.
> Right. We definitely want a fix for the possible data corruption in
> stable. Ideally one as simple as possible.
>
>> You've forced me to look at all of the function :-)
>> I've noticed a few things:
>> - If ice_add_xdp_frag() fails (because there are too many fragments)
>>    then the rest of the fragments are left in the tx ring (instead
>>    of being discarded) - so are likely to be treated as a full packet
>>    later on.
>> - Frames with status errors (crc, framing etc) are discarded after
>>    the skb is built - surely that should happen before the xdp 'program'
>>    is called.
>> - If the remote system send a very very long frame (traditionally the PHY's
>>    'jabber detect' didn't always work) you can end up with all of the rx
>>    ring being full of a single partial packet.
>>
>> I think you need to avoid calling ice_add_xdp_frag() when 'size == 0'.
>> Then in ice_put_rx_mbuf() unconditionally call ice_put_rx_buf() for
>> zero length fragments.
>> The comment would be 'zero length fragments can always be reused'.
>>
> That seems correct.
>
>> The zero length fragments almost certainly exist because the mac hardware
>> advances the the new buffer expecting more data - but only gets the
>> 4 byte CRC. So the zero length buffer contains the receive status.
>>
> That matches my understanding.
Hi John,

I have been looking at the same area in the pre-page-pool ice code and
I want to ask whether you observed memory growth during your Homa runs
that exposed the corruption, because in my testing the same bias mismatch
also produces a slow page leak that your v3 does not close.

Short version of the leak path, in the PASS (!CONSUMED) branch:

   1. ice_get_rx_buf(size=0) does pagecnt_bias-- unconditionally
      (added by commit ef68094cb09e ("ice: Fix kernel panic due to page
      refcount underflow") as the fix for the matching panic).
   2. ice_add_xdp_frag() then returns 0 for size==0, so that page is
      never attached to the xdp_buff/SKB. Nobody downstream will ever
      call put_page() to balance the pagecnt_bias-- from step 1.
   3. Your v3 in ice_put_rx_mbuf() correctly skips the page flip for
      size==0, which closes the corruption window. But it does not
      restore pagecnt_bias for that zero size buffer, so the page is
      handed back to ice_reuse_rx_page() with a permanent deficit of 1.
   4. On the next reuse of that page with size > 0, pagecnt_bias drops
      again. ice_can_reuse_rx_page() now sees pgcnt - bias == 2 and
      drains via __page_frag_cache_drain(page, pagecnt_bias). Because
      pagecnt_bias is one too low, the drain undershoots by 1: page
      refcount stays at 2 instead of 1.
   5. The SKB eventually releases its reference (refcount -> 1), but
      nothing ever brings it to 0. The page is leaked.
      ice_alloc_rx_bufs() just allocates a fresh page to fill the slot.

At the zero size frequency you mentioned (thousands per second), this
adds up to roughly MB/s of leaked page cache, which Jaroslav Pulchart
originally reported against 6.13.y on NUMA nodes and which motivated
the libeth/page_pool conversion in mainline. So in stable trees the
leak side of this bug is still live.

Two questions:

   - Did you monitor RSS / page allocator stats over the duration of
     your Homa runs? If you did and did not see growth, I would like
     to understand what is different about your setup, because by my
     reading of the code the leak should fire whenever both halves of
     a page end up in SKBs simultaneously and one of them carried a
     zero size descriptor along the way.

   - If your focus was specifically the corruption, would you be open
     to extending v3 (or replacing it) with a fix that also restores
     pagecnt_bias for the size==0 case? The minimal extension is one
     extra branch in ice_put_rx_mbuf:

         if (verdict != ICE_XDP_CONSUMED && size != 0)
                 ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz);
         else
                 buf->pagecnt_bias++;

     which restores bias on every path where the page is not actually
     going out to an SKB. (I have a slightly different variant that
     tracks has_data in struct ice_rx_buf to also handle the broken
     positional 'i <= xdp_frags' counter in the CONSUMED path, where
     zero size descriptors in the middle of a frame steal bias++ slots
     from real fragments. Happy to share it if useful.)

Regards,
Petr


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Intel-wired-lan] [PATCH net v3] ice: fix packet corruption due to extraneous page flip
  2026-05-26 12:47             ` [Intel-wired-lan] " Petr Oros
@ 2026-05-26 15:11               ` David Laight
  2026-05-26 22:17               ` John Ousterhout
  2026-06-01 15:51               ` Maciej Fijalkowski
  2 siblings, 0 replies; 12+ messages in thread
From: David Laight @ 2026-05-26 15:11 UTC (permalink / raw)
  To: Petr Oros
  Cc: Jacob Keller, John Ousterhout, stable, anthony.l.nguyen,
	intel-wired-lan, przemyslaw.kitszel, netdev

On Tue, 26 May 2026 14:47:42 +0200
Petr Oros <poros@redhat.com> wrote:

> On 5/14/26 18:43, Jacob Keller wrote:
> > On 5/14/2026 3:01 AM, David Laight wrote:  
> >> On Wed, 13 May 2026 21:47:11 -0700
> >> John Ousterhout <ouster@cs.stanford.edu> wrote:
> >>  
> >>> On Wed, May 13, 2026 at 1:49 PM David Laight
> >>> <david.laight.linux@gmail.com> wrote:  
> >>>> On Wed, 13 May 2026 09:28:40 -0700
> >>>> John Ousterhout <ouster@cs.stanford.edu> wrote:
> >>>>     
> >>>>> On Wed, May 13, 2026 at 2:07 AM David Laight
> >>>>> <david.laight.linux@gmail.com> wrote:  
> >>>>>> On Tue, 12 May 2026 11:19:53 -0700
> >>>>>> John Ousterhout <ouster@cs.stanford.edu> wrote:
> >>>>>>     
> >>>>>>> Consider the following sequence of events:
> >>>>>>> * The bottom half of a buffer page is filled with data from
> >>>>>>>    packet A. The page has a net reference count (reference count
> >>>>>>>    - bias) of 1. The page is returned to the NIC, flipped to
> >>>>>>>    use the top half.
> >>>>>>> * Before the reference on the page is released, the NIC returns
> >>>>>>>    the page with no data in it ('size' is zero in ice_clean_rx_irq).
> >>>>>>>    In this case the bias does not get decremented. The page still
> >>>>>>>    has a net reference count of 1, so it gets returned to the NIC.
> >>>>>>>    However, ice_put_rx_mbuf flipped the page so that the bottom
> >>>>>>>    half is active.
> >>>>>>> * If the NIC stores another packet in the page before packet A
> >>>>>>>    has released its reference, the data in packet A will be
> >>>>>>>    overwritten with data from the new packet.
> >>>>>>> * Unfortunately zero-length buffers occur frequently: they seem
> >>>>>>>    to occur whenever a packet uses every available byte in a
> >>>>>>>    buffer, ending precisely at the end of the buffer. When this
> >>>>>>>    happens the NIC seems to generate an extra zero-length
> >>>>>>>    buffer.
> >>>>>>> The fix is for ice_put_rx_mbuf not to flip pages that have a
> >>>>>>> size of 0.  
> >>>>>> How is this different from packet B (in the top half) being
> >>>>>> freed before packet A (in the bottom half)?  
> >>>>> I'm not sure exactly what you're referring to here. Are you asking
> >>>>> about a situation where both halves of the page get filled with packet
> >>>>> data and then the second half to be filled is the first to be freed? I
> >>>>> believe that the ICE driver abandons a page if both halves are ever
> >>>>> occupied simultaneously; the page will be returned to the system once
> >>>>> both halves have dropped their references. Thus it doesn't matter
> >>>>> which half is freed first.  
> >>>> That is what I was thinking, seems like the logic is over complicated.
> >>>>
> >>>> If you need to put 4k pages into some kind of iommu rather than 2k buffers
> >>>> (to contain 1536 byte ethernet packets) then I'd have thought you'd
> >>>> initially put both halves into adjacent tx ring entries.
> >>>> If a rx buffer is discarded (eg a zero length fragment or a CRC error,
> >>>> or even 'copy break' for short packets) then, as an optimisation,
> >>>> you could reuse the buffer for another receive.
> >>>> The same could be done if the page is freed by an application.
> >>>>
> >>>> However it sounds like it doesn't use the 2nd half until the first
> >>>> completes - otherwise you'd never 'flip' to make the other half
> >>>> active.
> >>>>
> >>>> Thinks...
> >>>> By only putting half of each 4k 'page' into the rx ring the code
> >>>> will usually save (expensive) iommu setup in the (probably) normal
> >>>> case where the buffers are freed 'reasonably quickly'.
> >>>> But that really requires a 'free/with_nic/busy' state for each half
> >>>> rather then trying to guess from a reference count.
> >>>>
> >>>> But if the low-level code is recycling the rx buffer (for any reason)
> >>>> it wants to use the same buffer.
> >>>>
> >>>> The ethernet driver I wrote (a long time ago, early 90s) allocated
> >>>> 64k as 128 512byte buffers and did an aligned word-sized copy of
> >>>> every receive frame - most frames were in contiguous memory.
> >>>> The simplicity of it made up for the cost of the copy, especially
> >>>> since that was an iommu system.  
> >>> I'm not here to defend the logic (and it has been replaced with
> >>> something that is probably simpler and more efficient); I'm just
> >>> suggesting a bug fix for the stable releases that still have this
> >>> logic.  
> > Right. We definitely want a fix for the possible data corruption in
> > stable. Ideally one as simple as possible.
> >  
> >> You've forced me to look at all of the function :-)
> >> I've noticed a few things:
> >> - If ice_add_xdp_frag() fails (because there are too many fragments)
> >>    then the rest of the fragments are left in the tx ring (instead
> >>    of being discarded) - so are likely to be treated as a full packet
> >>    later on.
> >> - Frames with status errors (crc, framing etc) are discarded after
> >>    the skb is built - surely that should happen before the xdp 'program'
> >>    is called.
> >> - If the remote system send a very very long frame (traditionally the PHY's
> >>    'jabber detect' didn't always work) you can end up with all of the rx
> >>    ring being full of a single partial packet.
> >>
> >> I think you need to avoid calling ice_add_xdp_frag() when 'size == 0'.
> >> Then in ice_put_rx_mbuf() unconditionally call ice_put_rx_buf() for
> >> zero length fragments.
> >> The comment would be 'zero length fragments can always be reused'.
> >>  
> > That seems correct.
> >  
> >> The zero length fragments almost certainly exist because the mac hardware
> >> advances the the new buffer expecting more data - but only gets the
> >> 4 byte CRC. So the zero length buffer contains the receive status.
> >>  
> > That matches my understanding.  
> Hi John,
> 
> I have been looking at the same area in the pre-page-pool ice code and
> I want to ask whether you observed memory growth during your Homa runs
> that exposed the corruption, because in my testing the same bias mismatch
> also produces a slow page leak that your v3 does not close.
> 
> Short version of the leak path, in the PASS (!CONSUMED) branch:
> 
>    1. ice_get_rx_buf(size=0) does pagecnt_bias-- unconditionally
>       (added by commit ef68094cb09e ("ice: Fix kernel panic due to page
>       refcount underflow") as the fix for the matching panic).
>    2. ice_add_xdp_frag() then returns 0 for size==0, so that page is
>       never attached to the xdp_buff/SKB. Nobody downstream will ever
>       call put_page() to balance the pagecnt_bias-- from step 1.
>    3. Your v3 in ice_put_rx_mbuf() correctly skips the page flip for
>       size==0, which closes the corruption window. But it does not
>       restore pagecnt_bias for that zero size buffer, so the page is
>       handed back to ice_reuse_rx_page() with a permanent deficit of 1.
>    4. On the next reuse of that page with size > 0, pagecnt_bias drops
>       again. ice_can_reuse_rx_page() now sees pgcnt - bias == 2 and
>       drains via __page_frag_cache_drain(page, pagecnt_bias). Because
>       pagecnt_bias is one too low, the drain undershoots by 1: page
>       refcount stays at 2 instead of 1.
>    5. The SKB eventually releases its reference (refcount -> 1), but
>       nothing ever brings it to 0. The page is leaked.
>       ice_alloc_rx_bufs() just allocates a fresh page to fill the slot.
> 
> At the zero size frequency you mentioned (thousands per second), this
> adds up to roughly MB/s of leaked page cache, which Jaroslav Pulchart
> originally reported against 6.13.y on NUMA nodes and which motivated
> the libeth/page_pool conversion in mainline. So in stable trees the
> leak side of this bug is still live.
> 
> Two questions:
> 
>    - Did you monitor RSS / page allocator stats over the duration of
>      your Homa runs? If you did and did not see growth, I would like
>      to understand what is different about your setup, because by my
>      reading of the code the leak should fire whenever both halves of
>      a page end up in SKBs simultaneously and one of them carried a
>      zero size descriptor along the way.
> 
>    - If your focus was specifically the corruption, would you be open
>      to extending v3 (or replacing it) with a fix that also restores
>      pagecnt_bias for the size==0 case? The minimal extension is one
>      extra branch in ice_put_rx_mbuf:
> 
>          if (verdict != ICE_XDP_CONSUMED && size != 0)
>                  ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz);
>          else
>                  buf->pagecnt_bias++;
> 
>      which restores bias on every path where the page is not actually
>      going out to an SKB. (I have a slightly different variant that
>      tracks has_data in struct ice_rx_buf to also handle the broken
>      positional 'i <= xdp_frags' counter in the CONSUMED path, where
>      zero size descriptors in the middle of a frame steal bias++ slots
>      from real fragments. Happy to share it if useful.)

By thought was:

I think you need to avoid calling ice_add_xdp_frag() when 'size == 0'.
Then in ice_put_rx_mbuf() unconditionally call ice_put_rx_buf() for
zero length fragments (regardless of verdict).
The comment would be 'zero length fragments can always be reused'.

I think that path always reuses the same half of the page without
going near the 'bias' code paths (which I didn't manage to grok).
It is the same path that is used for frames with bad CRC (ignoring
the broken paths when xdp is enabled).

-- David

> 
> Regards,
> Petr
> 
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Intel-wired-lan] [PATCH net v3] ice: fix packet corruption due to extraneous page flip
  2026-05-26 12:47             ` [Intel-wired-lan] " Petr Oros
  2026-05-26 15:11               ` David Laight
@ 2026-05-26 22:17               ` John Ousterhout
  2026-06-01 15:51               ` Maciej Fijalkowski
  2 siblings, 0 replies; 12+ messages in thread
From: John Ousterhout @ 2026-05-26 22:17 UTC (permalink / raw)
  To: Petr Oros
  Cc: Jacob Keller, David Laight, stable, anthony.l.nguyen,
	intel-wired-lan, przemyslaw.kitszel, netdev

On Tue, May 26, 2026 at 5:47 AM Petr Oros <poros@redhat.com> wrote:
>
>
> On 5/14/26 18:43, Jacob Keller wrote:
> > On 5/14/2026 3:01 AM, David Laight wrote:
> >> On Wed, 13 May 2026 21:47:11 -0700
> >> John Ousterhout <ouster@cs.stanford.edu> wrote:
> >>
> >>> On Wed, May 13, 2026 at 1:49 PM David Laight
> >>> <david.laight.linux@gmail.com> wrote:
> >>>> On Wed, 13 May 2026 09:28:40 -0700
> >>>> John Ousterhout <ouster@cs.stanford.edu> wrote:
> >>>>
> >>>>> On Wed, May 13, 2026 at 2:07 AM David Laight
> >>>>> <david.laight.linux@gmail.com> wrote:
> >>>>>> On Tue, 12 May 2026 11:19:53 -0700
> >>>>>> John Ousterhout <ouster@cs.stanford.edu> wrote:
> >>>>>>
> >>>>>>> Consider the following sequence of events:
> >>>>>>> * The bottom half of a buffer page is filled with data from
> >>>>>>>    packet A. The page has a net reference count (reference count
> >>>>>>>    - bias) of 1. The page is returned to the NIC, flipped to
> >>>>>>>    use the top half.
> >>>>>>> * Before the reference on the page is released, the NIC returns
> >>>>>>>    the page with no data in it ('size' is zero in ice_clean_rx_irq).
> >>>>>>>    In this case the bias does not get decremented. The page still
> >>>>>>>    has a net reference count of 1, so it gets returned to the NIC.
> >>>>>>>    However, ice_put_rx_mbuf flipped the page so that the bottom
> >>>>>>>    half is active.
> >>>>>>> * If the NIC stores another packet in the page before packet A
> >>>>>>>    has released its reference, the data in packet A will be
> >>>>>>>    overwritten with data from the new packet.
> >>>>>>> * Unfortunately zero-length buffers occur frequently: they seem
> >>>>>>>    to occur whenever a packet uses every available byte in a
> >>>>>>>    buffer, ending precisely at the end of the buffer. When this
> >>>>>>>    happens the NIC seems to generate an extra zero-length
> >>>>>>>    buffer.
> >>>>>>> The fix is for ice_put_rx_mbuf not to flip pages that have a
> >>>>>>> size of 0.
> >>>>>> How is this different from packet B (in the top half) being
> >>>>>> freed before packet A (in the bottom half)?
> >>>>> I'm not sure exactly what you're referring to here. Are you asking
> >>>>> about a situation where both halves of the page get filled with packet
> >>>>> data and then the second half to be filled is the first to be freed? I
> >>>>> believe that the ICE driver abandons a page if both halves are ever
> >>>>> occupied simultaneously; the page will be returned to the system once
> >>>>> both halves have dropped their references. Thus it doesn't matter
> >>>>> which half is freed first.
> >>>> That is what I was thinking, seems like the logic is over complicated.
> >>>>
> >>>> If you need to put 4k pages into some kind of iommu rather than 2k buffers
> >>>> (to contain 1536 byte ethernet packets) then I'd have thought you'd
> >>>> initially put both halves into adjacent tx ring entries.
> >>>> If a rx buffer is discarded (eg a zero length fragment or a CRC error,
> >>>> or even 'copy break' for short packets) then, as an optimisation,
> >>>> you could reuse the buffer for another receive.
> >>>> The same could be done if the page is freed by an application.
> >>>>
> >>>> However it sounds like it doesn't use the 2nd half until the first
> >>>> completes - otherwise you'd never 'flip' to make the other half
> >>>> active.
> >>>>
> >>>> Thinks...
> >>>> By only putting half of each 4k 'page' into the rx ring the code
> >>>> will usually save (expensive) iommu setup in the (probably) normal
> >>>> case where the buffers are freed 'reasonably quickly'.
> >>>> But that really requires a 'free/with_nic/busy' state for each half
> >>>> rather then trying to guess from a reference count.
> >>>>
> >>>> But if the low-level code is recycling the rx buffer (for any reason)
> >>>> it wants to use the same buffer.
> >>>>
> >>>> The ethernet driver I wrote (a long time ago, early 90s) allocated
> >>>> 64k as 128 512byte buffers and did an aligned word-sized copy of
> >>>> every receive frame - most frames were in contiguous memory.
> >>>> The simplicity of it made up for the cost of the copy, especially
> >>>> since that was an iommu system.
> >>> I'm not here to defend the logic (and it has been replaced with
> >>> something that is probably simpler and more efficient); I'm just
> >>> suggesting a bug fix for the stable releases that still have this
> >>> logic.
> > Right. We definitely want a fix for the possible data corruption in
> > stable. Ideally one as simple as possible.
> >
> >> You've forced me to look at all of the function :-)
> >> I've noticed a few things:
> >> - If ice_add_xdp_frag() fails (because there are too many fragments)
> >>    then the rest of the fragments are left in the tx ring (instead
> >>    of being discarded) - so are likely to be treated as a full packet
> >>    later on.
> >> - Frames with status errors (crc, framing etc) are discarded after
> >>    the skb is built - surely that should happen before the xdp 'program'
> >>    is called.
> >> - If the remote system send a very very long frame (traditionally the PHY's
> >>    'jabber detect' didn't always work) you can end up with all of the rx
> >>    ring being full of a single partial packet.
> >>
> >> I think you need to avoid calling ice_add_xdp_frag() when 'size == 0'.
> >> Then in ice_put_rx_mbuf() unconditionally call ice_put_rx_buf() for
> >> zero length fragments.
> >> The comment would be 'zero length fragments can always be reused'.
> >>
> > That seems correct.
> >
> >> The zero length fragments almost certainly exist because the mac hardware
> >> advances the the new buffer expecting more data - but only gets the
> >> 4 byte CRC. So the zero length buffer contains the receive status.
> >>
> > That matches my understanding.
> Hi John,
>
> I have been looking at the same area in the pre-page-pool ice code and
> I want to ask whether you observed memory growth during your Homa runs
> that exposed the corruption, because in my testing the same bias mismatch
> also produces a slow page leak that your v3 does not close.
>
> Short version of the leak path, in the PASS (!CONSUMED) branch:
>
>    1. ice_get_rx_buf(size=0) does pagecnt_bias-- unconditionally
>       (added by commit ef68094cb09e ("ice: Fix kernel panic due to page
>       refcount underflow") as the fix for the matching panic).
>    2. ice_add_xdp_frag() then returns 0 for size==0, so that page is
>       never attached to the xdp_buff/SKB. Nobody downstream will ever
>       call put_page() to balance the pagecnt_bias-- from step 1.
>    3. Your v3 in ice_put_rx_mbuf() correctly skips the page flip for
>       size==0, which closes the corruption window. But it does not
>       restore pagecnt_bias for that zero size buffer, so the page is
>       handed back to ice_reuse_rx_page() with a permanent deficit of 1.
>    4. On the next reuse of that page with size > 0, pagecnt_bias drops
>       again. ice_can_reuse_rx_page() now sees pgcnt - bias == 2 and
>       drains via __page_frag_cache_drain(page, pagecnt_bias). Because
>       pagecnt_bias is one too low, the drain undershoots by 1: page
>       refcount stays at 2 instead of 1.
>    5. The SKB eventually releases its reference (refcount -> 1), but
>       nothing ever brings it to 0. The page is leaked.
>       ice_alloc_rx_bufs() just allocates a fresh page to fill the slot.
>
> At the zero size frequency you mentioned (thousands per second), this
> adds up to roughly MB/s of leaked page cache, which Jaroslav Pulchart
> originally reported against 6.13.y on NUMA nodes and which motivated
> the libeth/page_pool conversion in mainline. So in stable trees the
> leak side of this bug is still live.
>
> Two questions:
>
>    - Did you monitor RSS / page allocator stats over the duration of
>      your Homa runs? If you did and did not see growth, I would like
>      to understand what is different about your setup, because by my
>      reading of the code the leak should fire whenever both halves of
>      a page end up in SKBs simultaneously and one of them carried a
>      zero size descriptor along the way.

I have not monitored the page allocator stats. I'm not sure I know the
best way to do this; I tried slabtop but it didn't seem to show
significant growth in memory usage.

>    - If your focus was specifically the corruption, would you be open
>      to extending v3 (or replacing it) with a fix that also restores
>      pagecnt_bias for the size==0 case? The minimal extension is one
>      extra branch in ice_put_rx_mbuf:
>
>          if (verdict != ICE_XDP_CONSUMED && size != 0)
>                  ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz);
>          else
>                  buf->pagecnt_bias++;
>
>      which restores bias on every path where the page is not actually
>      going out to an SKB. (I have a slightly different variant that
>      tracks has_data in struct ice_rx_buf to also handle the broken
>      positional 'i <= xdp_frags' counter in the CONSUMED path, where
>      zero size descriptors in the middle of a frame steal bias++ slots
>      from real fragments. Happy to share it if useful.)

My understanding of the ice driver is extremely limited. You may be
right about the proposed fix, but I don't currently know enough to get
comfortable with it. I think it might be better to separate your
change into a different patch, which can be shepherded by people with
appropriate understanding.

-John-

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Intel-wired-lan] [PATCH net v3] ice: fix packet corruption due to extraneous page flip
  2026-05-26 12:47             ` [Intel-wired-lan] " Petr Oros
  2026-05-26 15:11               ` David Laight
  2026-05-26 22:17               ` John Ousterhout
@ 2026-06-01 15:51               ` Maciej Fijalkowski
  2 siblings, 0 replies; 12+ messages in thread
From: Maciej Fijalkowski @ 2026-06-01 15:51 UTC (permalink / raw)
  To: Petr Oros
  Cc: Jacob Keller, David Laight, John Ousterhout, stable,
	anthony.l.nguyen, intel-wired-lan, przemyslaw.kitszel, netdev

On Tue, May 26, 2026 at 02:47:42PM +0200, Petr Oros wrote:
> 
> On 5/14/26 18:43, Jacob Keller wrote:
> > On 5/14/2026 3:01 AM, David Laight wrote:
> > > On Wed, 13 May 2026 21:47:11 -0700
> > > John Ousterhout <ouster@cs.stanford.edu> wrote:
> > > 
> > > > On Wed, May 13, 2026 at 1:49 PM David Laight
> > > > <david.laight.linux@gmail.com> wrote:
> > > > > On Wed, 13 May 2026 09:28:40 -0700
> > > > > John Ousterhout <ouster@cs.stanford.edu> wrote:
> > > > > > On Wed, May 13, 2026 at 2:07 AM David Laight
> > > > > > <david.laight.linux@gmail.com> wrote:
> > > > > > > On Tue, 12 May 2026 11:19:53 -0700
> > > > > > > John Ousterhout <ouster@cs.stanford.edu> wrote:
> > > > > > > > Consider the following sequence of events:
> > > > > > > > * The bottom half of a buffer page is filled with data from
> > > > > > > >    packet A. The page has a net reference count (reference count
> > > > > > > >    - bias) of 1. The page is returned to the NIC, flipped to
> > > > > > > >    use the top half.
> > > > > > > > * Before the reference on the page is released, the NIC returns
> > > > > > > >    the page with no data in it ('size' is zero in ice_clean_rx_irq).
> > > > > > > >    In this case the bias does not get decremented. The page still
> > > > > > > >    has a net reference count of 1, so it gets returned to the NIC.
> > > > > > > >    However, ice_put_rx_mbuf flipped the page so that the bottom
> > > > > > > >    half is active.
> > > > > > > > * If the NIC stores another packet in the page before packet A
> > > > > > > >    has released its reference, the data in packet A will be
> > > > > > > >    overwritten with data from the new packet.
> > > > > > > > * Unfortunately zero-length buffers occur frequently: they seem
> > > > > > > >    to occur whenever a packet uses every available byte in a
> > > > > > > >    buffer, ending precisely at the end of the buffer. When this
> > > > > > > >    happens the NIC seems to generate an extra zero-length
> > > > > > > >    buffer.
> > > > > > > > The fix is for ice_put_rx_mbuf not to flip pages that have a
> > > > > > > > size of 0.
> > > > > > > How is this different from packet B (in the top half) being
> > > > > > > freed before packet A (in the bottom half)?
> > > > > > I'm not sure exactly what you're referring to here. Are you asking
> > > > > > about a situation where both halves of the page get filled with packet
> > > > > > data and then the second half to be filled is the first to be freed? I
> > > > > > believe that the ICE driver abandons a page if both halves are ever
> > > > > > occupied simultaneously; the page will be returned to the system once
> > > > > > both halves have dropped their references. Thus it doesn't matter
> > > > > > which half is freed first.
> > > > > That is what I was thinking, seems like the logic is over complicated.
> > > > > 
> > > > > If you need to put 4k pages into some kind of iommu rather than 2k buffers
> > > > > (to contain 1536 byte ethernet packets) then I'd have thought you'd
> > > > > initially put both halves into adjacent tx ring entries.
> > > > > If a rx buffer is discarded (eg a zero length fragment or a CRC error,
> > > > > or even 'copy break' for short packets) then, as an optimisation,
> > > > > you could reuse the buffer for another receive.
> > > > > The same could be done if the page is freed by an application.
> > > > > 
> > > > > However it sounds like it doesn't use the 2nd half until the first
> > > > > completes - otherwise you'd never 'flip' to make the other half
> > > > > active.
> > > > > 
> > > > > Thinks...
> > > > > By only putting half of each 4k 'page' into the rx ring the code
> > > > > will usually save (expensive) iommu setup in the (probably) normal
> > > > > case where the buffers are freed 'reasonably quickly'.
> > > > > But that really requires a 'free/with_nic/busy' state for each half
> > > > > rather then trying to guess from a reference count.
> > > > > 
> > > > > But if the low-level code is recycling the rx buffer (for any reason)
> > > > > it wants to use the same buffer.
> > > > > 
> > > > > The ethernet driver I wrote (a long time ago, early 90s) allocated
> > > > > 64k as 128 512byte buffers and did an aligned word-sized copy of
> > > > > every receive frame - most frames were in contiguous memory.
> > > > > The simplicity of it made up for the cost of the copy, especially
> > > > > since that was an iommu system.
> > > > I'm not here to defend the logic (and it has been replaced with
> > > > something that is probably simpler and more efficient); I'm just
> > > > suggesting a bug fix for the stable releases that still have this
> > > > logic.
> > Right. We definitely want a fix for the possible data corruption in
> > stable. Ideally one as simple as possible.
> > 
> > > You've forced me to look at all of the function :-)
> > > I've noticed a few things:
> > > - If ice_add_xdp_frag() fails (because there are too many fragments)
> > >    then the rest of the fragments are left in the tx ring (instead
> > >    of being discarded) - so are likely to be treated as a full packet
> > >    later on.
> > > - Frames with status errors (crc, framing etc) are discarded after
> > >    the skb is built - surely that should happen before the xdp 'program'
> > >    is called.
> > > - If the remote system send a very very long frame (traditionally the PHY's
> > >    'jabber detect' didn't always work) you can end up with all of the rx
> > >    ring being full of a single partial packet.
> > > 
> > > I think you need to avoid calling ice_add_xdp_frag() when 'size == 0'.
> > > Then in ice_put_rx_mbuf() unconditionally call ice_put_rx_buf() for
> > > zero length fragments.
> > > The comment would be 'zero length fragments can always be reused'.
> > > 
> > That seems correct.
> > 
> > > The zero length fragments almost certainly exist because the mac hardware
> > > advances the the new buffer expecting more data - but only gets the
> > > 4 byte CRC. So the zero length buffer contains the receive status.
> > > 
> > That matches my understanding.
> Hi John,
> 
> I have been looking at the same area in the pre-page-pool ice code and
> I want to ask whether you observed memory growth during your Homa runs
> that exposed the corruption, because in my testing the same bias mismatch
> also produces a slow page leak that your v3 does not close.
> 
> Short version of the leak path, in the PASS (!CONSUMED) branch:
> 
>   1. ice_get_rx_buf(size=0) does pagecnt_bias-- unconditionally
>      (added by commit ef68094cb09e ("ice: Fix kernel panic due to page
>      refcount underflow") as the fix for the matching panic).

you mean in ice_get_rx_buf() ? I do not see this part anywhere. Must be
your custom recipe when you ported 84bf1ac85af8 ("ice: fix Rx page leak on
multi-buffer frames") ?

Given the amount of issues 0 sized frags gave us I was thinking whether
they should be treated in the same manner as normal frags.

But please rule this out on your side to align with John so we could
proceed with this fix.

Thanks
Maciej

>   2. ice_add_xdp_frag() then returns 0 for size==0, so that page is
>      never attached to the xdp_buff/SKB. Nobody downstream will ever
>      call put_page() to balance the pagecnt_bias-- from step 1.
>   3. Your v3 in ice_put_rx_mbuf() correctly skips the page flip for
>      size==0, which closes the corruption window. But it does not
>      restore pagecnt_bias for that zero size buffer, so the page is
>      handed back to ice_reuse_rx_page() with a permanent deficit of 1.
>   4. On the next reuse of that page with size > 0, pagecnt_bias drops
>      again. ice_can_reuse_rx_page() now sees pgcnt - bias == 2 and
>      drains via __page_frag_cache_drain(page, pagecnt_bias). Because
>      pagecnt_bias is one too low, the drain undershoots by 1: page
>      refcount stays at 2 instead of 1.
>   5. The SKB eventually releases its reference (refcount -> 1), but
>      nothing ever brings it to 0. The page is leaked.
>      ice_alloc_rx_bufs() just allocates a fresh page to fill the slot.
> 
> At the zero size frequency you mentioned (thousands per second), this
> adds up to roughly MB/s of leaked page cache, which Jaroslav Pulchart
> originally reported against 6.13.y on NUMA nodes and which motivated
> the libeth/page_pool conversion in mainline. So in stable trees the
> leak side of this bug is still live.
> 
> Two questions:
> 
>   - Did you monitor RSS / page allocator stats over the duration of
>     your Homa runs? If you did and did not see growth, I would like
>     to understand what is different about your setup, because by my
>     reading of the code the leak should fire whenever both halves of
>     a page end up in SKBs simultaneously and one of them carried a
>     zero size descriptor along the way.
> 
>   - If your focus was specifically the corruption, would you be open
>     to extending v3 (or replacing it) with a fix that also restores
>     pagecnt_bias for the size==0 case? The minimal extension is one
>     extra branch in ice_put_rx_mbuf:
> 
>         if (verdict != ICE_XDP_CONSUMED && size != 0)
>                 ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz);
>         else
>                 buf->pagecnt_bias++;
> 
>     which restores bias on every path where the page is not actually
>     going out to an SKB. (I have a slightly different variant that
>     tracks has_data in struct ice_rx_buf to also handle the broken
>     positional 'i <= xdp_frags' counter in the CONSUMED path, where
>     zero size descriptors in the middle of a frame steal bias++ slots
>     from real fragments. Happy to share it if useful.)
> 
> Regards,
> Petr
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-06-01 15:51 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-12 18:19 [PATCH net v3] ice: fix packet corruption due to extraneous page flip John Ousterhout
2026-05-13  9:07 ` David Laight
2026-05-13 16:28   ` John Ousterhout
2026-05-13 20:49     ` David Laight
2026-05-14  4:47       ` John Ousterhout
2026-05-14 10:01         ` David Laight
2026-05-14 16:43           ` Jacob Keller
2026-05-26 12:47             ` [Intel-wired-lan] " Petr Oros
2026-05-26 15:11               ` David Laight
2026-05-26 22:17               ` John Ousterhout
2026-06-01 15:51               ` Maciej Fijalkowski
2026-05-14  9:08 ` Loktionov, Aleksandr

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox