From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
To: Jason Xing <kerneljasonxing@gmail.com>
Cc: <davem@davemloft.net>, <edumazet@google.com>, <kuba@kernel.org>,
<pabeni@redhat.com>, <bjorn@kernel.org>,
<magnus.karlsson@intel.com>, <jonathan.lemon@gmail.com>,
<sdf@fomichev.me>, <ast@kernel.org>, <daniel@iogearbox.net>,
<hawk@kernel.org>, <john.fastabend@gmail.com>, <horms@kernel.org>,
<andrew+netdev@lunn.ch>, <bpf@vger.kernel.org>,
<netdev@vger.kernel.org>, Jason Xing <kernelxing@tencent.com>
Subject: Re: [PATCH net v4 0/5] xsk: fix meta and publish of cq issues
Date: Tue, 26 May 2026 21:43:08 +0200 [thread overview]
Message-ID: <ahX3zCRWTD2v7kwn@boxer> (raw)
In-Reply-To: <CAL+tcoBud6+nZ=Zeq-Ja+nnOcfwt71Z_QBofOS3qyrNt3Tkkvw@mail.gmail.com>
On Sat, May 23, 2026 at 07:49:00AM +0800, Jason Xing wrote:
> On Sat, May 23, 2026 at 2:34 AM Maciej Fijalkowski
> <maciej.fijalkowski@intel.com> wrote:
> >
> > On Fri, May 22, 2026 at 09:48:39PM +0800, Jason Xing wrote:
> > > On Fri, May 22, 2026 at 4:55 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > >
> > > > On Thu, May 21, 2026 at 10:24 PM Maciej Fijalkowski
> > > > <maciej.fijalkowski@intel.com> wrote:
> > > > >
> > > > > On Thu, May 21, 2026 at 09:07:30PM +0800, Jason Xing wrote:
> > > > > > On Thu, May 21, 2026 at 9:00 PM Maciej Fijalkowski
> > > > > > <maciej.fijalkowski@intel.com> wrote:
> > > > > > >
> > > > > > > On Thu, May 21, 2026 at 08:41:08PM +0800, Jason Xing wrote:
> > > > > > > > On Thu, May 21, 2026 at 8:24 PM Maciej Fijalkowski
> > > > > > > > <maciej.fijalkowski@intel.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, May 20, 2026 at 08:42:39AM +0800, Jason Xing wrote:
> > > > > > > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > > > > > > >
> > > > > > > > > > The series is the product of previous review from sashiko[1].
> > > > > > > > > >
> > > > > > > > > > 1) META
> > > > > > > > > > patch 1: address TOCTOU around metadata.
> > > > > > > > > >
> > > > > > > > > > 2) PUBLISH of CQ
> > > > > > > > > > patch 2: make sure xsk_addr->addrs[] can be published to cq when
> > > > > > > > > > overflow occurs.
> > > > > > > > > > patch 3: keep cleaning up the continuation descs (more than 17) and
> > > > > > > > > > publish its address when overflow occurs.
> > > > > > > > > > patch 4: like patch 3, but only handles the invalid descs cases.
> > > > > > > > > >
> > > > > > > > > > [1]: https://lore.kernel.org/all/20260502200722.53960-1-kerneljasonxing@gmail.com/
> > > > > > > > > >
> > > > > > > > > > ---
> > > > > > > > > > V4
> > > > > > > > > > Link: https://lore.kernel.org/all/20260517063311.28921-1-kerneljasonxing@gmail.com/
> > > > > > > > > > 1. correct the description of xmit path in patch 3 (sashiko)
> > > > > > > > > > 2. move set logic into xmit path in patch 3 (Stan)
> > > > > > > > > >
> > > > > > > > > > V3
> > > > > > > > > > Link: https://lore.kernel.org/all/20260515123018.80147-1-kerneljasonxing@gmail.com/
> > > > > > > > > > 1. avoid breaking previous usage of sendto, and siliently handle
> > > > > > > > > > overflow case (Stan, sashiko)
> > > > > > > > > > 2. add one particular exception process in patch 4 (sashiko)
> > > > > > > > > > 3. adjust the selftest to make sure it passes in either virutal or
> > > > > > > > > > physical machines, which includes add usleep to support physical machine.
> > > > > > > > > >
> > > > > > > > > > V2
> > > > > > > > > > Link: https://lore.kernel.org/all/20260510012310.88570-1-kerneljasonxing@gmail.com/
> > > > > > > > > > 1. adjust selftests (Jakub)
> > > > > > > > > > 2. add READ_ONCE in patch 1 (Stan)
> > > > > > > > >
> > > > > > > > > FWIW I still get test failures (yes with patch 5 applied). PTAL.
> > > > > > > >
> > > > > > > > Thanks for the test. But I've tried with ixgbe driver...
> > > > > > > >
> > > > > > > > I noticed there are some flaky tests which have nothing to do with the
> > > > > > > > series. Can you confirm that it's not caused because of the series?
> > > > > > >
> > > > > > > That explains the different results as i am using i40e/ice which have
> > > > > > > multi-buffer support whereas ixgbe does not even support mbuf at XDP.
> > > > > > > Broken tests are from mbuf cases.
> > > > > >
> > > > > > That's weird. I never expected the failed tests to be about multi-buffer.
> > > > > >
> > > > > > Are they the same as the output you attached last time? Or something
> > > > > > new? Could you please share it so that I can investigate the root
> > > > > > cause?
[...]
> > > >
> > > > Sorry, Maciej. I managed to get one server with i40e nic but still
> > > > couldn't reproduce it. Can you try the attachment (that is the
> > > > replacement for v4-0005) instead? I removed those nasty CONT test
> > > > cases...
> > >
> > > Ah, I think I eventually figured out a solution. Maciej, could you
> > > please test the 2nd patch instead?
> > >
> > > This patch reworks the CONTD test cases. Cross finger.
> >
> > Please don't rush things here, I believe we need to think a bit more here.
> > I have second thoughts about overall approach.
> >
> > My understanding wrt CQ was that it is a container that holds descriptors
> > which have been successfully transmitted. Now we want to add also leftover
> > descriptors from broken packets, which might confuse user space sides in
> > case they were relying on behavior described above.
> >
> > The intent is right of course as we don't want to lose UMEM descs, but I
> > feel like we need a separate mechanism for that rather than putting
> > invalid descs to CQ.
>
> I don't sense anything strange here if we stick to put those
> invalid/overflowed descriptors into cq. AF_XDP is only a tunnel that
> transfers the data. That's it. A bit like how the physical link works,
> which means it possibly drops data because of congestion.
>
> Upper protocol is used to guarantee when to (re)transmit a packet -
> the mechanism is the ACK driven in terms of TCP. TCP is absolutely
> capable of finding such an abnormal thing happening by checking the
> seq of incoming ack. My takeaway from this is we don't need to
> deliberately design new stuff to fulfill direct and immediate
> communication.
AF_XDP is often used for L2/L3 forwarding, UDP, custom transports, so I
was afraid some existing solutions might be relying on CQ entries implying
successful Tx.
Generally this issue is highly unlikely yet a thing we need to address so
let's follow your approach, but for that we need to update documentation
and align ZC side so that we would not have to deviate test cases.
>
> CQ works somehow as a notification that tells user space whether the
> kernel receives the data from the app and handles them. Without
> putting them in the CQ, the only thing for userspace to do is simply
> wait.
I don't follow the last part of the sentence but let's disregard it.
Userspace gets errno/retcode in generic xmit so it is aware of underlying
issues and then it's app job to act upon it.
>
> With that said, IMHO, I cannot figure out why we need a separate queue
> or something like that. Of course, a new notification that handles all
> the possible/potential exceptions and contributes to the performance
> of the upper layer is worth a try :) The latter is crucial.
We discussed with Magnus it would be good to have a dedicated xsk_queue
stat for that case, such as 'oversized_descs' which would be bumped by the
amount of descs produced to CQ.
>
> >
> > Does it make sense?
> >
> > Besides, even though we would stay with proposed changes, behavior between
> > modes should be aligned. Right now ZC seems to be broken in touched
> > regions here - when we hit the limit of frags via pool->xdp_zc_max_segs,
> > we break the loop and discard the packet, never post it to CQ and these
> > descs are lost from user space POV. Then we would continue on next call
> > and interpret the rest of too big packet as a separate one (clamped) and
> > therefore submit corrupted packet to HW.
>
> Right, this is how the previous selftests changes pollute the
> subsequent tests after that. I think the new version of the attachment
> should pass all the tests since I put all the CONTD tests separately
> into another two functions? It's pointless to test those in the zc
> mode.
We need analogous fix on ZC, then no such quirks in tests should exist.
Test is doing the same thing regardless of underlying mode. Only the range
differ (MAX_SKB_FRAGS vs pool->xdp_zc_max_segs).
To wrap up, I see it like this, moving forward:
1. fix docs
2. add ss stat
3. wait for me with ZC fixes (I'm slow!)
4. inspect if tests will fly
Let me know your thoughts! Maybe Stan wants to chime in?
Maciej
>
> As to the series, if no objections or any suggestions jump into the
> thread, I'll post the series within a week.
>
> Thanks,
> Jason
>
> >
> > I'll be looking at ZC API but i do think we need a common approach,
> > mode-agnostic.
> >
> > Thanks,
> > Maciej
> >
> > >
> > > Thanks,
> > > Jason
> > >
> > > >
> > > > Really I don't think I have much time to spend on these tests which
> > > > makes me feel extremely annoyed... It's not easy to analyze the code
> > > > without a reproducer. The good news is that now I highly suspect that
> > > > this kind of CONT test cases pollute the whole cq which affects other
> > > > tests. Before I give up on the 0003/0004 patches, I'd like to hear
> > > > some advice from you. Thank you.
> > > >
> > > > My original intention was to push batch xmit forward but at that time
> > > > sashiko pointed out some unrelated bugs accidentally.
> > > >
> > > > Thanks,
> > > > Jason
> > > >
> > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Jason
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Jason
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Jason Xing (5):
> > > > > > > > > > xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata()
> > > > > > > > > > xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx
> > > > > > > > > > xsk: drain continuation descs after overflow in xsk_build_skb()
> > > > > > > > > > xsk: drain continuation descs on invalid descriptor in
> > > > > > > > > > __xsk_generic_xmit()
> > > > > > > > > > selftests/xsk: drain CQ to wait for TX completion
> > > > > > > > > >
> > > > > > > > > > include/net/xdp_sock.h | 1 +
> > > > > > > > > > net/xdp/xsk.c | 44 +++++++++++++----
> > > > > > > > > > .../selftests/bpf/prog_tests/test_xsk.c | 48 +++++++++++--------
> > > > > > > > > > 3 files changed, 63 insertions(+), 30 deletions(-)
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > 2.43.7
> > > > > > > > > >
> >
> >
next prev parent reply other threads:[~2026-05-26 19:43 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-20 0:42 [PATCH net v4 0/5] xsk: fix meta and publish of cq issues Jason Xing
2026-05-20 0:42 ` [PATCH net v4 1/5] xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata() Jason Xing
2026-05-21 12:04 ` Maciej Fijalkowski
2026-05-20 0:42 ` [PATCH net v4 2/5] xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx Jason Xing
2026-05-21 12:05 ` Maciej Fijalkowski
2026-05-20 0:42 ` [PATCH net v4 3/5] xsk: drain continuation descs after overflow in xsk_build_skb() Jason Xing
2026-05-20 16:10 ` Maciej Fijalkowski
2026-05-20 23:53 ` Jason Xing
2026-05-21 12:02 ` Maciej Fijalkowski
2026-05-21 13:10 ` Jason Xing
2026-05-22 9:06 ` Magnus Karlsson
2026-05-22 9:22 ` Jason Xing
2026-05-20 0:42 ` [PATCH net v4 4/5] xsk: drain continuation descs on invalid descriptor in __xsk_generic_xmit() Jason Xing
2026-05-20 0:42 ` [PATCH net v4 5/5] selftests/xsk: drain CQ to wait for TX completion Jason Xing
2026-05-21 12:23 ` [PATCH net v4 0/5] xsk: fix meta and publish of cq issues Maciej Fijalkowski
2026-05-21 12:41 ` Jason Xing
2026-05-21 12:59 ` Maciej Fijalkowski
2026-05-21 13:07 ` Jason Xing
2026-05-21 14:24 ` Maciej Fijalkowski
2026-05-22 8:55 ` Jason Xing
2026-05-22 13:48 ` Jason Xing
2026-05-22 18:33 ` Maciej Fijalkowski
2026-05-22 23:49 ` Jason Xing
2026-05-26 19:43 ` Maciej Fijalkowski [this message]
2026-05-26 23:26 ` Jason Xing
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ahX3zCRWTD2v7kwn@boxer \
--to=maciej.fijalkowski@intel.com \
--cc=andrew+netdev@lunn.ch \
--cc=ast@kernel.org \
--cc=bjorn@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=hawk@kernel.org \
--cc=horms@kernel.org \
--cc=john.fastabend@gmail.com \
--cc=jonathan.lemon@gmail.com \
--cc=kerneljasonxing@gmail.com \
--cc=kernelxing@tencent.com \
--cc=kuba@kernel.org \
--cc=magnus.karlsson@intel.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=sdf@fomichev.me \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox