From: Stanislav Fomichev <sdf.kernel@gmail.com>
To: Jason Xing <kerneljasonxing@gmail.com>
Cc: davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
pabeni@redhat.com, bjorn@kernel.org, magnus.karlsson@intel.com,
maciej.fijalkowski@intel.com, jonathan.lemon@gmail.com,
sdf@fomichev.me, ast@kernel.org, daniel@iogearbox.net,
hawk@kernel.org, john.fastabend@gmail.com, horms@kernel.org,
andrew+netdev@lunn.ch, bpf@vger.kernel.org,
netdev@vger.kernel.org, Jason Xing <kernelxing@tencent.com>
Subject: Re: [PATCH net 3/4] xsk: drain continuation descs after overflow in xsk_build_skb()
Date: Fri, 15 May 2026 08:41:03 -0700 [thread overview]
Message-ID: <agc8xgt2wXI-1ryb@devvm7509.cco0.facebook.com> (raw)
In-Reply-To: <CAL+tcoDA1tezc4XBUA1-4wDrpckXpQuPGFDsNPCkpUP4qyXeYQ@mail.gmail.com>
On 05/15, Jason Xing wrote:
> On Fri, May 15, 2026 at 8:29 AM Stanislav Fomichev <sdf.kernel@gmail.com> wrote:
> >
> > On 05/14, Jason Xing wrote:
> > > On Thu, May 14, 2026 at 12:27 AM Stanislav Fomichev
> > > <sdf.kernel@gmail.com> wrote:
> > > >
> > > > On 05/10, Jason Xing wrote:
> > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > >
> > > > > When a multi-buffer packet exceeds MAX_SKB_FRAGS and triggers -EOVERFLOW,
> > > > > only the current descriptor is released from the TX ring. The remaining
> > > > > continuation descriptors of the same packet stay in the ring. Since
> > > > > xs->skb is set to NULL after the drop, the TX loop picks up these
> > > > > leftover frags and misinterprets each one as the beginning of a new
> > > > > packet, corrupting the packet stream.
> > > > >
> > > > > Fix this by adding a drain_cont flag to xdp_sock. When overflow occurs
> > > > > and the dropped descriptor has XDP_PKT_CONTD set, the flag is raised.
> > > > > The main TX loop in __xsk_generic_xmit() then handles continuation
> > > > > descriptors one at a time: each gets a normal CQ reservation (with
> > > > > backpressure), its address is submitted to the completion queue, and
> > > > > the descriptor is released from the TX ring. When the last fragment
> > > > > (without XDP_PKT_CONTD) is processed, the flag is cleared and the
> > > > > function returns -EOVERFLOW so the next call starts with a fresh
> > > > > budget for normal packets.
> > > > >
> > > > > This reuses the existing CQ backpressure and budget mechanisms, so if
> > > > > the CQ is full the function returns -EAGAIN and userspace drains the
> > > > > CQ before retrying. Zero buffer leakage, zero packet stream corruption.
> > > > >
> > > > > Closes: https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/
> > > > > Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
> > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > > ---
> > > > > include/net/xdp_sock.h | 1 +
> > > > > net/xdp/xsk.c | 22 ++++++++++++++++++++++
> > > > > 2 files changed, 23 insertions(+)
> > > > >
> > > > > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > > > > index 23e8861e8b25..1958d19d9925 100644
> > > > > --- a/include/net/xdp_sock.h
> > > > > +++ b/include/net/xdp_sock.h
> > > > > @@ -80,6 +80,7 @@ struct xdp_sock {
> > > > > * call of __xsk_generic_xmit().
> > > > > */
> > > > > struct sk_buff *skb;
> > > > > + bool drain_cont;
> > > > >
> > > > > struct list_head map_list;
> > > > > /* Protects map_list */
> > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > > index 3f1e590c855d..232dd7126905 100644
> > > > > --- a/net/xdp/xsk.c
> > > > > +++ b/net/xdp/xsk.c
> > > > > @@ -936,6 +936,8 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > > > > xs->tx->invalid_descs++;
> > > > > }
> > > > > xskq_cons_release(xs->tx);
> > > > > + if (xp_mb_desc(desc))
> > > > > + xs->drain_cont = true;
> > > > > } else {
> > > > > /* Let application retry */
> > > > > xsk_cq_cancel_locked(xs->pool, 1);
> > > > > @@ -982,6 +984,26 @@ static int __xsk_generic_xmit(struct sock *sk)
> > > > > goto out;
> > > > > }
> > > > >
> > > > > + if (unlikely(xs->drain_cont)) {
> > > > > + unsigned long flags;
> > > > > + u32 idx;
> > > > > +
> > > >
> > > > [..]
> > > >
> > > > > + spin_lock_irqsave(&xs->pool->cq_prod_lock, flags);
> > > > > + idx = xskq_get_prod(xs->pool->cq);
> > > > > + xskq_prod_write_addr(xs->pool->cq, idx, desc.addr);
> > > > > + xskq_prod_submit_n(xs->pool->cq, 1);
> > > > > + spin_unlock_irqrestore(&xs->pool->cq_prod_lock, flags);
> > > >
> > > > Not sure I understand why you want this if you're still marking the desc
> > > > invalid.
> > >
> > > The key point is that as long as we read the desc from txq, the desc
> > > should be either put back to txq again or publish it in the cq (for
> > > application to keep track of this) at this point. Or else, the
> > > application will lose track of this desc, which breaks the whole
> > > logic.
> >
> > Yes, makes sense!
> >
> > > > > +
> > > > > + xs->tx->invalid_descs++;
> > > > > + xskq_cons_release(xs->tx);
> > > > > + if (!xp_mb_desc(&desc)) {
> > > > > + xs->drain_cont = false;
> > > > > + err = -EOVERFLOW;
> > > > > + goto out;
> > > > > + }
> > > >
> > > > I also don't understand why you want to return -EOVERFLOW again? Why not
> > > > (quietly) swallow these invalid xp_mb_desc from the previous packet and move
> > > > on?
> > >
> > > This particular desc is really one of overflow cases, right? We should
> > > warn users to handle this with this error code.
> >
> > Right, but didn't we already return -EOVERFLOW to the user once?
> > The part where you set drain_count=true will -EOVERFLOW. Then this
> > remainder will also return -EOVERFLOW?
>
> Right, my intention is to alert users twice under this kind of
> circumstance. We silently drain the remaining portion of the skb the
> second time, which looks a bit strange, doesn't it?
>
> >
> > But let's maybe step back, what's the expectation on the user
> > when we return -EOVERFLOW? In theory, the user can re-submit new
> > shorter packet (at the current prod index), right? And then your
> > !xp_mb_desc logic will break/swallow/-EOVERFLOW it?
>
> People would be aware of the skb that is too big to handle when
> receiving two times of overflow warning. To put it in a simpler way,
> the second time in xmit path only handling the remaining is enough, no
> more descs that belong to another skb should be taken care of.
>
> If the user is able to quickly react to this case, I think the whole
> skb should be re-put into the txq again instead of adding the
> remaining part. I'm not so sure if I interpret the "break" correctly
> here.
Let's say the user puts a packet with too many descriptors. We find that
mid-way and return -EOVERFLOW. What is user supposed to do with that error?
In my mind, the user should drain TX ring completely and start posting
shorter packets. You seem to be trying to handle the case where the user
leaves the remainder of the previously EOVERFLOW'd packet in place, why?
next prev parent reply other threads:[~2026-05-15 15:41 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-10 1:23 [PATCH net 0/4] xsk: fix meta and publish of cq issues Jason Xing
2026-05-10 1:23 ` [PATCH net 1/4] xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata() Jason Xing
2026-05-11 15:03 ` Stanislav Fomichev
2026-05-12 14:32 ` Jason Xing
2026-05-12 22:34 ` Stanislav Fomichev
2026-05-13 14:21 ` Jason Xing
2026-05-13 15:37 ` Stanislav Fomichev
2026-05-14 0:11 ` Jason Xing
2026-05-10 1:23 ` [PATCH net 2/4] xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx Jason Xing
2026-05-10 1:23 ` [PATCH net 3/4] xsk: drain continuation descs after overflow in xsk_build_skb() Jason Xing
2026-05-13 16:27 ` Stanislav Fomichev
2026-05-14 0:21 ` Jason Xing
2026-05-15 0:29 ` Stanislav Fomichev
2026-05-15 2:36 ` Jason Xing
2026-05-15 15:41 ` Stanislav Fomichev [this message]
2026-05-10 1:23 ` [PATCH net 4/4] xsk: drain continuation descs on invalid descriptor in __xsk_generic_xmit() Jason Xing
2026-05-11 14:16 ` [PATCH net 0/4] xsk: fix meta and publish of cq issues Jakub Kicinski
2026-05-12 14:29 ` Jason Xing
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=agc8xgt2wXI-1ryb@devvm7509.cco0.facebook.com \
--to=sdf.kernel@gmail.com \
--cc=andrew+netdev@lunn.ch \
--cc=ast@kernel.org \
--cc=bjorn@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=hawk@kernel.org \
--cc=horms@kernel.org \
--cc=john.fastabend@gmail.com \
--cc=jonathan.lemon@gmail.com \
--cc=kerneljasonxing@gmail.com \
--cc=kernelxing@tencent.com \
--cc=kuba@kernel.org \
--cc=maciej.fijalkowski@intel.com \
--cc=magnus.karlsson@intel.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=sdf@fomichev.me \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.