Netdev List
 help / color / mirror / Atom feed
From: David Laight <david.laight.linux@gmail.com>
To: Petr Oros <poros@redhat.com>
Cc: Jacob Keller <jacob.e.keller@intel.com>,
	John Ousterhout <ouster@cs.stanford.edu>,
	stable@vger.kernel.org, anthony.l.nguyen@intel.com,
	intel-wired-lan@lists.osuosl.org, przemyslaw.kitszel@intel.com,
	netdev@vger.kernel.org
Subject: Re: [Intel-wired-lan] [PATCH net v3] ice: fix packet corruption due to extraneous page flip
Date: Tue, 26 May 2026 16:11:08 +0100	[thread overview]
Message-ID: <20260526161108.645c47a1@pumpkin> (raw)
In-Reply-To: <e1ce1387-ae6b-4b43-b5d8-a1141c4a4f1c@redhat.com>

On Tue, 26 May 2026 14:47:42 +0200
Petr Oros <poros@redhat.com> wrote:

> On 5/14/26 18:43, Jacob Keller wrote:
> > On 5/14/2026 3:01 AM, David Laight wrote:  
> >> On Wed, 13 May 2026 21:47:11 -0700
> >> John Ousterhout <ouster@cs.stanford.edu> wrote:
> >>  
> >>> On Wed, May 13, 2026 at 1:49 PM David Laight
> >>> <david.laight.linux@gmail.com> wrote:  
> >>>> On Wed, 13 May 2026 09:28:40 -0700
> >>>> John Ousterhout <ouster@cs.stanford.edu> wrote:
> >>>>     
> >>>>> On Wed, May 13, 2026 at 2:07 AM David Laight
> >>>>> <david.laight.linux@gmail.com> wrote:  
> >>>>>> On Tue, 12 May 2026 11:19:53 -0700
> >>>>>> John Ousterhout <ouster@cs.stanford.edu> wrote:
> >>>>>>     
> >>>>>>> Consider the following sequence of events:
> >>>>>>> * The bottom half of a buffer page is filled with data from
> >>>>>>>    packet A. The page has a net reference count (reference count
> >>>>>>>    - bias) of 1. The page is returned to the NIC, flipped to
> >>>>>>>    use the top half.
> >>>>>>> * Before the reference on the page is released, the NIC returns
> >>>>>>>    the page with no data in it ('size' is zero in ice_clean_rx_irq).
> >>>>>>>    In this case the bias does not get decremented. The page still
> >>>>>>>    has a net reference count of 1, so it gets returned to the NIC.
> >>>>>>>    However, ice_put_rx_mbuf flipped the page so that the bottom
> >>>>>>>    half is active.
> >>>>>>> * If the NIC stores another packet in the page before packet A
> >>>>>>>    has released its reference, the data in packet A will be
> >>>>>>>    overwritten with data from the new packet.
> >>>>>>> * Unfortunately zero-length buffers occur frequently: they seem
> >>>>>>>    to occur whenever a packet uses every available byte in a
> >>>>>>>    buffer, ending precisely at the end of the buffer. When this
> >>>>>>>    happens the NIC seems to generate an extra zero-length
> >>>>>>>    buffer.
> >>>>>>> The fix is for ice_put_rx_mbuf not to flip pages that have a
> >>>>>>> size of 0.  
> >>>>>> How is this different from packet B (in the top half) being
> >>>>>> freed before packet A (in the bottom half)?  
> >>>>> I'm not sure exactly what you're referring to here. Are you asking
> >>>>> about a situation where both halves of the page get filled with packet
> >>>>> data and then the second half to be filled is the first to be freed? I
> >>>>> believe that the ICE driver abandons a page if both halves are ever
> >>>>> occupied simultaneously; the page will be returned to the system once
> >>>>> both halves have dropped their references. Thus it doesn't matter
> >>>>> which half is freed first.  
> >>>> That is what I was thinking, seems like the logic is over complicated.
> >>>>
> >>>> If you need to put 4k pages into some kind of iommu rather than 2k buffers
> >>>> (to contain 1536 byte ethernet packets) then I'd have thought you'd
> >>>> initially put both halves into adjacent tx ring entries.
> >>>> If a rx buffer is discarded (eg a zero length fragment or a CRC error,
> >>>> or even 'copy break' for short packets) then, as an optimisation,
> >>>> you could reuse the buffer for another receive.
> >>>> The same could be done if the page is freed by an application.
> >>>>
> >>>> However it sounds like it doesn't use the 2nd half until the first
> >>>> completes - otherwise you'd never 'flip' to make the other half
> >>>> active.
> >>>>
> >>>> Thinks...
> >>>> By only putting half of each 4k 'page' into the rx ring the code
> >>>> will usually save (expensive) iommu setup in the (probably) normal
> >>>> case where the buffers are freed 'reasonably quickly'.
> >>>> But that really requires a 'free/with_nic/busy' state for each half
> >>>> rather then trying to guess from a reference count.
> >>>>
> >>>> But if the low-level code is recycling the rx buffer (for any reason)
> >>>> it wants to use the same buffer.
> >>>>
> >>>> The ethernet driver I wrote (a long time ago, early 90s) allocated
> >>>> 64k as 128 512byte buffers and did an aligned word-sized copy of
> >>>> every receive frame - most frames were in contiguous memory.
> >>>> The simplicity of it made up for the cost of the copy, especially
> >>>> since that was an iommu system.  
> >>> I'm not here to defend the logic (and it has been replaced with
> >>> something that is probably simpler and more efficient); I'm just
> >>> suggesting a bug fix for the stable releases that still have this
> >>> logic.  
> > Right. We definitely want a fix for the possible data corruption in
> > stable. Ideally one as simple as possible.
> >  
> >> You've forced me to look at all of the function :-)
> >> I've noticed a few things:
> >> - If ice_add_xdp_frag() fails (because there are too many fragments)
> >>    then the rest of the fragments are left in the tx ring (instead
> >>    of being discarded) - so are likely to be treated as a full packet
> >>    later on.
> >> - Frames with status errors (crc, framing etc) are discarded after
> >>    the skb is built - surely that should happen before the xdp 'program'
> >>    is called.
> >> - If the remote system send a very very long frame (traditionally the PHY's
> >>    'jabber detect' didn't always work) you can end up with all of the rx
> >>    ring being full of a single partial packet.
> >>
> >> I think you need to avoid calling ice_add_xdp_frag() when 'size == 0'.
> >> Then in ice_put_rx_mbuf() unconditionally call ice_put_rx_buf() for
> >> zero length fragments.
> >> The comment would be 'zero length fragments can always be reused'.
> >>  
> > That seems correct.
> >  
> >> The zero length fragments almost certainly exist because the mac hardware
> >> advances the the new buffer expecting more data - but only gets the
> >> 4 byte CRC. So the zero length buffer contains the receive status.
> >>  
> > That matches my understanding.  
> Hi John,
> 
> I have been looking at the same area in the pre-page-pool ice code and
> I want to ask whether you observed memory growth during your Homa runs
> that exposed the corruption, because in my testing the same bias mismatch
> also produces a slow page leak that your v3 does not close.
> 
> Short version of the leak path, in the PASS (!CONSUMED) branch:
> 
>    1. ice_get_rx_buf(size=0) does pagecnt_bias-- unconditionally
>       (added by commit ef68094cb09e ("ice: Fix kernel panic due to page
>       refcount underflow") as the fix for the matching panic).
>    2. ice_add_xdp_frag() then returns 0 for size==0, so that page is
>       never attached to the xdp_buff/SKB. Nobody downstream will ever
>       call put_page() to balance the pagecnt_bias-- from step 1.
>    3. Your v3 in ice_put_rx_mbuf() correctly skips the page flip for
>       size==0, which closes the corruption window. But it does not
>       restore pagecnt_bias for that zero size buffer, so the page is
>       handed back to ice_reuse_rx_page() with a permanent deficit of 1.
>    4. On the next reuse of that page with size > 0, pagecnt_bias drops
>       again. ice_can_reuse_rx_page() now sees pgcnt - bias == 2 and
>       drains via __page_frag_cache_drain(page, pagecnt_bias). Because
>       pagecnt_bias is one too low, the drain undershoots by 1: page
>       refcount stays at 2 instead of 1.
>    5. The SKB eventually releases its reference (refcount -> 1), but
>       nothing ever brings it to 0. The page is leaked.
>       ice_alloc_rx_bufs() just allocates a fresh page to fill the slot.
> 
> At the zero size frequency you mentioned (thousands per second), this
> adds up to roughly MB/s of leaked page cache, which Jaroslav Pulchart
> originally reported against 6.13.y on NUMA nodes and which motivated
> the libeth/page_pool conversion in mainline. So in stable trees the
> leak side of this bug is still live.
> 
> Two questions:
> 
>    - Did you monitor RSS / page allocator stats over the duration of
>      your Homa runs? If you did and did not see growth, I would like
>      to understand what is different about your setup, because by my
>      reading of the code the leak should fire whenever both halves of
>      a page end up in SKBs simultaneously and one of them carried a
>      zero size descriptor along the way.
> 
>    - If your focus was specifically the corruption, would you be open
>      to extending v3 (or replacing it) with a fix that also restores
>      pagecnt_bias for the size==0 case? The minimal extension is one
>      extra branch in ice_put_rx_mbuf:
> 
>          if (verdict != ICE_XDP_CONSUMED && size != 0)
>                  ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz);
>          else
>                  buf->pagecnt_bias++;
> 
>      which restores bias on every path where the page is not actually
>      going out to an SKB. (I have a slightly different variant that
>      tracks has_data in struct ice_rx_buf to also handle the broken
>      positional 'i <= xdp_frags' counter in the CONSUMED path, where
>      zero size descriptors in the middle of a frame steal bias++ slots
>      from real fragments. Happy to share it if useful.)

By thought was:

I think you need to avoid calling ice_add_xdp_frag() when 'size == 0'.
Then in ice_put_rx_mbuf() unconditionally call ice_put_rx_buf() for
zero length fragments (regardless of verdict).
The comment would be 'zero length fragments can always be reused'.

I think that path always reuses the same half of the page without
going near the 'bias' code paths (which I didn't manage to grok).
It is the same path that is used for frames with bad CRC (ignoring
the broken paths when xdp is enabled).

-- David

> 
> Regards,
> Petr
> 
> 


  reply	other threads:[~2026-05-26 15:11 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-12 18:19 [PATCH net v3] ice: fix packet corruption due to extraneous page flip John Ousterhout
2026-05-13  9:07 ` David Laight
2026-05-13 16:28   ` John Ousterhout
2026-05-13 20:49     ` David Laight
2026-05-14  4:47       ` John Ousterhout
2026-05-14 10:01         ` David Laight
2026-05-14 16:43           ` Jacob Keller
2026-05-26 12:47             ` [Intel-wired-lan] " Petr Oros
2026-05-26 15:11               ` David Laight [this message]
2026-05-26 22:17               ` John Ousterhout
2026-05-14  9:08 ` Loktionov, Aleksandr

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260526161108.645c47a1@pumpkin \
    --to=david.laight.linux@gmail.com \
    --cc=anthony.l.nguyen@intel.com \
    --cc=intel-wired-lan@lists.osuosl.org \
    --cc=jacob.e.keller@intel.com \
    --cc=netdev@vger.kernel.org \
    --cc=ouster@cs.stanford.edu \
    --cc=poros@redhat.com \
    --cc=przemyslaw.kitszel@intel.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox