Re: [Intel-wired-lan] [PATCH net v3] ice: fix packet corruption due to extraneous page flip

Netdev List
 help / color / mirror / Atom feed

From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
To: Petr Oros <poros@redhat.com>
Cc: Jacob Keller <jacob.e.keller@intel.com>,
	David Laight <david.laight.linux@gmail.com>,
	John Ousterhout <ouster@cs.stanford.edu>,
	<stable@vger.kernel.org>, <anthony.l.nguyen@intel.com>,
	<intel-wired-lan@lists.osuosl.org>,
	<przemyslaw.kitszel@intel.com>, <netdev@vger.kernel.org>
Subject: Re: [Intel-wired-lan] [PATCH net v3] ice: fix packet corruption due to extraneous page flip
Date: Mon, 1 Jun 2026 17:51:12 +0200	[thread overview]
Message-ID: <ah2qcDLxMlGYNhgf@boxer> (raw)
In-Reply-To: <e1ce1387-ae6b-4b43-b5d8-a1141c4a4f1c@redhat.com>

On Tue, May 26, 2026 at 02:47:42PM +0200, Petr Oros wrote:
> 
> On 5/14/26 18:43, Jacob Keller wrote:
> > On 5/14/2026 3:01 AM, David Laight wrote:
> > > On Wed, 13 May 2026 21:47:11 -0700
> > > John Ousterhout <ouster@cs.stanford.edu> wrote:
> > > 
> > > > On Wed, May 13, 2026 at 1:49 PM David Laight
> > > > <david.laight.linux@gmail.com> wrote:
> > > > > On Wed, 13 May 2026 09:28:40 -0700
> > > > > John Ousterhout <ouster@cs.stanford.edu> wrote:
> > > > > > On Wed, May 13, 2026 at 2:07 AM David Laight
> > > > > > <david.laight.linux@gmail.com> wrote:
> > > > > > > On Tue, 12 May 2026 11:19:53 -0700
> > > > > > > John Ousterhout <ouster@cs.stanford.edu> wrote:
> > > > > > > > Consider the following sequence of events:
> > > > > > > > * The bottom half of a buffer page is filled with data from
> > > > > > > >    packet A. The page has a net reference count (reference count
> > > > > > > >    - bias) of 1. The page is returned to the NIC, flipped to
> > > > > > > >    use the top half.
> > > > > > > > * Before the reference on the page is released, the NIC returns
> > > > > > > >    the page with no data in it ('size' is zero in ice_clean_rx_irq).
> > > > > > > >    In this case the bias does not get decremented. The page still
> > > > > > > >    has a net reference count of 1, so it gets returned to the NIC.
> > > > > > > >    However, ice_put_rx_mbuf flipped the page so that the bottom
> > > > > > > >    half is active.
> > > > > > > > * If the NIC stores another packet in the page before packet A
> > > > > > > >    has released its reference, the data in packet A will be
> > > > > > > >    overwritten with data from the new packet.
> > > > > > > > * Unfortunately zero-length buffers occur frequently: they seem
> > > > > > > >    to occur whenever a packet uses every available byte in a
> > > > > > > >    buffer, ending precisely at the end of the buffer. When this
> > > > > > > >    happens the NIC seems to generate an extra zero-length
> > > > > > > >    buffer.
> > > > > > > > The fix is for ice_put_rx_mbuf not to flip pages that have a
> > > > > > > > size of 0.
> > > > > > > How is this different from packet B (in the top half) being
> > > > > > > freed before packet A (in the bottom half)?
> > > > > > I'm not sure exactly what you're referring to here. Are you asking
> > > > > > about a situation where both halves of the page get filled with packet
> > > > > > data and then the second half to be filled is the first to be freed? I
> > > > > > believe that the ICE driver abandons a page if both halves are ever
> > > > > > occupied simultaneously; the page will be returned to the system once
> > > > > > both halves have dropped their references. Thus it doesn't matter
> > > > > > which half is freed first.
> > > > > That is what I was thinking, seems like the logic is over complicated.
> > > > > 
> > > > > If you need to put 4k pages into some kind of iommu rather than 2k buffers
> > > > > (to contain 1536 byte ethernet packets) then I'd have thought you'd
> > > > > initially put both halves into adjacent tx ring entries.
> > > > > If a rx buffer is discarded (eg a zero length fragment or a CRC error,
> > > > > or even 'copy break' for short packets) then, as an optimisation,
> > > > > you could reuse the buffer for another receive.
> > > > > The same could be done if the page is freed by an application.
> > > > > 
> > > > > However it sounds like it doesn't use the 2nd half until the first
> > > > > completes - otherwise you'd never 'flip' to make the other half
> > > > > active.
> > > > > 
> > > > > Thinks...
> > > > > By only putting half of each 4k 'page' into the rx ring the code
> > > > > will usually save (expensive) iommu setup in the (probably) normal
> > > > > case where the buffers are freed 'reasonably quickly'.
> > > > > But that really requires a 'free/with_nic/busy' state for each half
> > > > > rather then trying to guess from a reference count.
> > > > > 
> > > > > But if the low-level code is recycling the rx buffer (for any reason)
> > > > > it wants to use the same buffer.
> > > > > 
> > > > > The ethernet driver I wrote (a long time ago, early 90s) allocated
> > > > > 64k as 128 512byte buffers and did an aligned word-sized copy of
> > > > > every receive frame - most frames were in contiguous memory.
> > > > > The simplicity of it made up for the cost of the copy, especially
> > > > > since that was an iommu system.
> > > > I'm not here to defend the logic (and it has been replaced with
> > > > something that is probably simpler and more efficient); I'm just
> > > > suggesting a bug fix for the stable releases that still have this
> > > > logic.
> > Right. We definitely want a fix for the possible data corruption in
> > stable. Ideally one as simple as possible.
> > 
> > > You've forced me to look at all of the function :-)
> > > I've noticed a few things:
> > > - If ice_add_xdp_frag() fails (because there are too many fragments)
> > >    then the rest of the fragments are left in the tx ring (instead
> > >    of being discarded) - so are likely to be treated as a full packet
> > >    later on.
> > > - Frames with status errors (crc, framing etc) are discarded after
> > >    the skb is built - surely that should happen before the xdp 'program'
> > >    is called.
> > > - If the remote system send a very very long frame (traditionally the PHY's
> > >    'jabber detect' didn't always work) you can end up with all of the rx
> > >    ring being full of a single partial packet.
> > > 
> > > I think you need to avoid calling ice_add_xdp_frag() when 'size == 0'.
> > > Then in ice_put_rx_mbuf() unconditionally call ice_put_rx_buf() for
> > > zero length fragments.
> > > The comment would be 'zero length fragments can always be reused'.
> > > 
> > That seems correct.
> > 
> > > The zero length fragments almost certainly exist because the mac hardware
> > > advances the the new buffer expecting more data - but only gets the
> > > 4 byte CRC. So the zero length buffer contains the receive status.
> > > 
> > That matches my understanding.
> Hi John,
> 
> I have been looking at the same area in the pre-page-pool ice code and
> I want to ask whether you observed memory growth during your Homa runs
> that exposed the corruption, because in my testing the same bias mismatch
> also produces a slow page leak that your v3 does not close.
> 
> Short version of the leak path, in the PASS (!CONSUMED) branch:
> 
>   1. ice_get_rx_buf(size=0) does pagecnt_bias-- unconditionally
>      (added by commit ef68094cb09e ("ice: Fix kernel panic due to page
>      refcount underflow") as the fix for the matching panic).

you mean in ice_get_rx_buf() ? I do not see this part anywhere. Must be
your custom recipe when you ported 84bf1ac85af8 ("ice: fix Rx page leak on
multi-buffer frames") ?

Given the amount of issues 0 sized frags gave us I was thinking whether
they should be treated in the same manner as normal frags.

But please rule this out on your side to align with John so we could
proceed with this fix.

Thanks
Maciej

>   2. ice_add_xdp_frag() then returns 0 for size==0, so that page is
>      never attached to the xdp_buff/SKB. Nobody downstream will ever
>      call put_page() to balance the pagecnt_bias-- from step 1.
>   3. Your v3 in ice_put_rx_mbuf() correctly skips the page flip for
>      size==0, which closes the corruption window. But it does not
>      restore pagecnt_bias for that zero size buffer, so the page is
>      handed back to ice_reuse_rx_page() with a permanent deficit of 1.
>   4. On the next reuse of that page with size > 0, pagecnt_bias drops
>      again. ice_can_reuse_rx_page() now sees pgcnt - bias == 2 and
>      drains via __page_frag_cache_drain(page, pagecnt_bias). Because
>      pagecnt_bias is one too low, the drain undershoots by 1: page
>      refcount stays at 2 instead of 1.
>   5. The SKB eventually releases its reference (refcount -> 1), but
>      nothing ever brings it to 0. The page is leaked.
>      ice_alloc_rx_bufs() just allocates a fresh page to fill the slot.
> 
> At the zero size frequency you mentioned (thousands per second), this
> adds up to roughly MB/s of leaked page cache, which Jaroslav Pulchart
> originally reported against 6.13.y on NUMA nodes and which motivated
> the libeth/page_pool conversion in mainline. So in stable trees the
> leak side of this bug is still live.
> 
> Two questions:
> 
>   - Did you monitor RSS / page allocator stats over the duration of
>     your Homa runs? If you did and did not see growth, I would like
>     to understand what is different about your setup, because by my
>     reading of the code the leak should fire whenever both halves of
>     a page end up in SKBs simultaneously and one of them carried a
>     zero size descriptor along the way.
> 
>   - If your focus was specifically the corruption, would you be open
>     to extending v3 (or replacing it) with a fix that also restores
>     pagecnt_bias for the size==0 case? The minimal extension is one
>     extra branch in ice_put_rx_mbuf:
> 
>         if (verdict != ICE_XDP_CONSUMED && size != 0)
>                 ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz);
>         else
>                 buf->pagecnt_bias++;
> 
>     which restores bias on every path where the page is not actually
>     going out to an SKB. (I have a slightly different variant that
>     tracks has_data in struct ice_rx_buf to also handle the broken
>     positional 'i <= xdp_frags' counter in the CONSUMED path, where
>     zero size descriptors in the middle of a frame steal bias++ slots
>     from real fragments. Happy to share it if useful.)
> 
> Regards,
> Petr
> 
>

next prev parent reply	other threads:[~2026-06-01 15:51 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-12 18:19 [PATCH net v3] ice: fix packet corruption due to extraneous page flip John Ousterhout
2026-05-13  9:07 ` David Laight
2026-05-13 16:28   ` John Ousterhout
2026-05-13 20:49     ` David Laight
2026-05-14  4:47       ` John Ousterhout
2026-05-14 10:01         ` David Laight
2026-05-14 16:43           ` Jacob Keller
2026-05-26 12:47             ` [Intel-wired-lan] " Petr Oros
2026-05-26 15:11               ` David Laight
2026-05-26 22:17               ` John Ousterhout
2026-06-01 15:51               ` Maciej Fijalkowski [this message]
2026-05-14  9:08 ` Loktionov, Aleksandr

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ah2qcDLxMlGYNhgf@boxer \
    --to=maciej.fijalkowski@intel.com \
    --cc=anthony.l.nguyen@intel.com \
    --cc=david.laight.linux@gmail.com \
    --cc=intel-wired-lan@lists.osuosl.org \
    --cc=jacob.e.keller@intel.com \
    --cc=netdev@vger.kernel.org \
    --cc=ouster@cs.stanford.edu \
    --cc=poros@redhat.com \
    --cc=przemyslaw.kitszel@intel.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox