From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
To: Jason Xing <kerneljasonxing@gmail.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>,
<davem@davemloft.net>, <edumazet@google.com>, <kuba@kernel.org>,
<pabeni@redhat.com>, <bjorn@kernel.org>,
<magnus.karlsson@intel.com>, <jonathan.lemon@gmail.com>,
<sdf@fomichev.me>, <ast@kernel.org>, <daniel@iogearbox.net>,
<hawk@kernel.org>, <john.fastabend@gmail.com>, <joe@dama.to>,
<willemdebruijn.kernel@gmail.com>, <fmancera@suse.de>,
<csmate@nop.hu>, <bpf@vger.kernel.org>, <netdev@vger.kernel.org>,
Jason Xing <kernelxing@tencent.com>
Subject: Re: [PATCH RFC net-next 2/2] xsk: introduce a cached cq to temporarily store descriptor addrs
Date: Tue, 18 Nov 2025 19:28:49 +0100 [thread overview]
Message-ID: <aRy64Wr2UBhr4KLF@boxer> (raw)
In-Reply-To: <CAL+tcoCPiDq807u4wmqNx+j_jMmYYzNVA5ySGmp_V5gDLYz02A@mail.gmail.com>
On Tue, Nov 18, 2025 at 07:40:52PM +0800, Jason Xing wrote:
> On Tue, Nov 18, 2025 at 6:15 PM Maciej Fijalkowski
> <maciej.fijalkowski@intel.com> wrote:
> >
> > On Tue, Nov 18, 2025 at 08:01:52AM +0800, Jason Xing wrote:
> > > On Tue, Nov 18, 2025 at 12:05 AM Maciej Fijalkowski
> > > <maciej.fijalkowski@intel.com> wrote:
> > > >
> > > > On Sat, Nov 15, 2025 at 07:46:40AM +0800, Jason Xing wrote:
> > > > > On Fri, Nov 14, 2025 at 11:53 PM Maciej Fijalkowski
> > > > > <maciej.fijalkowski@intel.com> wrote:
> > > > > >
> > > > > > On Tue, Nov 11, 2025 at 10:02:58PM +0800, Jason Xing wrote:
> > > > > > > Hi Magnus,
> > > > > > >
> > > > > > > On Tue, Nov 11, 2025 at 9:44 PM Magnus Karlsson
> > > > > > > <magnus.karlsson@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Tue, 11 Nov 2025 at 14:06, Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > Hi Maciej,
> > > > > > > > >
> > > > > > > > > On Mon, Nov 3, 2025 at 11:00 PM Maciej Fijalkowski
> > > > > > > > > <maciej.fijalkowski@intel.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Sat, Nov 01, 2025 at 07:59:36AM +0800, Jason Xing wrote:
> > > > > > > > > > > On Fri, Oct 31, 2025 at 10:02 PM Maciej Fijalkowski
> > > > > > > > > > > <maciej.fijalkowski@intel.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Oct 31, 2025 at 05:32:30PM +0800, Jason Xing wrote:
> > > > > > > > > > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > > > > > > > > > >
> > > > > > > > > > > > > Before the commit 30f241fcf52a ("xsk: Fix immature cq descriptor
> > > > > > > > > > > > > production"), there is one issue[1] which causes the wrong publish
> > > > > > > > > > > > > of descriptors in race condidtion. The above commit fixes the issue
> > > > > > > > > > > > > but adds more memory operations in the xmit hot path and interrupt
> > > > > > > > > > > > > context, which can cause side effect in performance.
> > > > > > > > > > > > >
> > > > > > > > > > > > > This patch tries to propose a new solution to fix the problem
> > > > > > > > > > > > > without manipulating the allocation and deallocation of memory. One
> > > > > > > > > > > > > of the key points is that I borrowed the idea from the above commit
> > > > > > > > > > > > > that postpones updating the ring->descs in xsk_destruct_skb()
> > > > > > > > > > > > > instead of in __xsk_generic_xmit().
> > > > > > > > > > > > >
> > > > > > > > > > > > > The core logics are as show below:
> > > > > > > > > > > > > 1. allocate a new local queue. Only its cached_prod member is used.
> > > > > > > > > > > > > 2. write the descriptors into the local queue in the xmit path. And
> > > > > > > > > > > > > record the cached_prod as @start_addr that reflects the
> > > > > > > > > > > > > start position of this queue so that later the skb can easily
> > > > > > > > > > > > > find where its addrs are written in the destruction phase.
> > > > > > > > > > > > > 3. initialize the upper 24 bits of destructor_arg to store @start_addr
> > > > > > > > > > > > > in xsk_skb_init_misc().
> > > > > > > > > > > > > 4. Initialize the lower 8 bits of destructor_arg to store how many
> > > > > > > > > > > > > descriptors the skb owns in xsk_update_num_desc().
> > > > > > > > > > > > > 5. write the desc addr(s) from the @start_addr from the cached cq
> > > > > > > > > > > > > one by one into the real cq in xsk_destruct_skb(). In turn sync
> > > > > > > > > > > > > the global state of the cq.
> > > > > > > > > > > > >
> > > > > > > > > > > > > The format of destructor_arg is designed as:
> > > > > > > > > > > > > ------------------------ --------
> > > > > > > > > > > > > | start_addr | num |
> > > > > > > > > > > > > ------------------------ --------
> > > > > > > > > > > > > Using upper 24 bits is enough to keep the temporary descriptors. And
> > > > > > > > > > > > > it's also enough to use lower 8 bits to show the number of descriptors
> > > > > > > > > > > > > that one skb owns.
> > > > > > > > > > > > >
> > > > > > > > > > > > > [1]: https://lore.kernel.org/all/20250530095957.43248-1-e.kubanski@partner.samsung.com/
> > > > > > > > > > > > >
> > > > > > > > > > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > > > > > > > > > > ---
> > > > > > > > > > > > > I posted the series as an RFC because I'd like to hear more opinions on
> > > > > > > > > > > > > the current rought approach so that the fix[2] can be avoided and
> > > > > > > > > > > > > mitigate the impact of performance. This patch might have bugs because
> > > > > > > > > > > > > I decided to spend more time on it after we come to an agreement. Please
> > > > > > > > > > > > > review the overall concepts. Thanks!
> > > > > > > > > > > > >
> > > > > > > > > > > > > Maciej, could you share with me the way you tested jumbo frame? I used
> > > > > > > > > > > > > ./xdpsock -i enp2s0f1 -t -q 1 -S -s 9728 but the xdpsock utilizes the
> > > > > > > > > > > > > nic more than 90%, which means I cannot see the performance impact.
> > > > > > > > > > >
> > > > > > > > > > > Could you provide the command you used? Thanks :)
> > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > [2]:https://lore.kernel.org/all/20251030140355.4059-1-fmancera@suse.de/
> > > > > > > > > > > > > ---
> > > > > > > > > > > > > include/net/xdp_sock.h | 1 +
> > > > > > > > > > > > > include/net/xsk_buff_pool.h | 1 +
> > > > > > > > > > > > > net/xdp/xsk.c | 104 ++++++++++++++++++++++++++++--------
> > > > > > > > > > > > > net/xdp/xsk_buff_pool.c | 1 +
> > > > > > > > > > > > > 4 files changed, 84 insertions(+), 23 deletions(-)
> > > > > > > > > > > >
> > > > > > > > > > > > (...)
> > > > > > > > > > > >
> > > > > > > > > > > > > diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> > > > > > > > > > > > > index aa9788f20d0d..6e170107dec7 100644
> > > > > > > > > > > > > --- a/net/xdp/xsk_buff_pool.c
> > > > > > > > > > > > > +++ b/net/xdp/xsk_buff_pool.c
> > > > > > > > > > > > > @@ -99,6 +99,7 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
> > > > > > > > > > > > >
> > > > > > > > > > > > > pool->fq = xs->fq_tmp;
> > > > > > > > > > > > > pool->cq = xs->cq_tmp;
> > > > > > > > > > > > > + pool->cached_cq = xs->cached_cq;
> > > > > > > > > > > >
> > > > > > > > > > > > Jason,
> > > > > > > > > > > >
> > > > > > > > > > > > pool can be shared between multiple sockets that bind to same <netdev,qid>
> > > > > > > > > > > > tuple. I believe here you're opening up for the very same issue Eryk
> > > > > > > > > > > > initially reported.
> > > > > > > > > > >
> > > > > > > > > > > Actually it shouldn't happen because the cached_cq is more of the
> > > > > > > > > > > temporary array that helps the skb store its start position. The
> > > > > > > > > > > cached_prod of cached_cq can only be increased, not decreased. In the
> > > > > > > > > > > skb destruction phase, only those skbs that go to the end of life need
> > > > > > > > > > > to sync its desc from cached_cq to cq. For some skbs that are released
> > > > > > > > > > > before the tx completion, we don't need to clear its record in
> > > > > > > > > > > cached_cq at all and cq remains untouched.
> > > > > > > > > > >
> > > > > > > > > > > To put it in a simple way, the patch you proposed uses kmem_cached*
> > > > > > > > > > > helpers to store the addr and write the addr into cq at the end of
> > > > > > > > > > > lifecycle while the current patch uses a pre-allocated memory to
> > > > > > > > > > > store. So it avoids the allocation and deallocation.
> > > > > > > > > > >
> > > > > > > > > > > Unless I'm missing something important. If so, I'm still convinced
> > > > > > > > > > > this temporary queue can solve the problem since essentially it's a
> > > > > > > > > > > better substitute for kmem cache to retain high performance.
> > > > > >
> > > > > > Back after health issues!
> > > > >
> > > > > Hi Maciej,
> > > > >
> > > > > Hope you're fully recovered:)
> > > > >
> > > > > >
> > > > > > Jason, I am still not convinced about this solution.
> > > > > >
> > > > > > In shared pool setups, the temp cq will also be shared, which means that
> > > > > > two parallel processes can produce addresses onto temp cq and therefore
> > > > > > expose address to a socket that it does not belong to. In order to make
> > > > > > this work you would have to know upfront the descriptor count of given
> > > > > > frame and reserve this during processing the first descriptor.
> > > > > >
> > > > > > socket 0 socket 1
> > > > > > prod addr 0xAA
> > > > > > prod addr 0xBB
> > > > > > prod addr 0xDD
> > > > > > prod addr 0xCC
> > > > > > prod addr 0xEE
> > > > > >
> > > > > > socket 0 calls skb destructor with num desc == 3, placing 0xDD onto cq
> > > > > > which has not been sent yet, therefore potentially corrupting it.
> > > > >
> > > > > Thanks for spotting this case!
> > > > >
> > > > > Yes, it can happen, so let's turn into a per-xsk granularity? If each
> > > > > xsk has its own temp queue, then the problem would disappear and good
> > > > > news is that we don't need extra locks like pool->cq_lock to prevent
> > > > > multiple parallel xsks accessing the temp queue.
> > > >
> > > > Sure, when you're confident this is working solution then you can post it.
> > > > But from my POV we should go with Fernando's patch and then you can send
> > > > patches to bpf-next as improvements. There are people out there with
> > > > broken xsk waiting for a fix.
> > >
> > > Fine, I will officially post it on the next branch. But I think at
> > > that time, I have to revert both patches (your and Fernando's
> > > patches)? Will his patch be applied to the stable branch only so that
> > > I can make it on the next branch?
> >
> > Give it some time and let Fernando repost patch and then after applying
> > all the backport machinery will kick in. I suppose after bpf->bpf-next
> > merge you could step in and in the meantime I assume you've got plenty of
> > stuff to work on. My point here is we already hesitated too much in this
> > matter IMHO.
>
> I have no intention to stop that patch from being merged :)
>
> I meant his patch will be _only_ merged into the net/stable branch and
> _not_ be merged into the next branch, right? If so, I can continue my
> approach only targetting the next branch without worrying about his
> patch which can cause conflicts.
net/bpf branches are merged periodically to -next variants, AFAICT. New
kernels carry the fixes as well. Purpose of net/bpf branches is to address
the stable kernels with developed fixes.
>
> Thanks,
> Jason
next prev parent reply other threads:[~2025-11-18 18:29 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-31 9:32 [PATCH RFC net-next 0/2] xsk: fix immature cq descriptor production (II) Jason Xing
2025-10-31 9:32 ` [PATCH RFC net-next 1/2] Revert "xsk: Fix immature cq descriptor production" Jason Xing
2025-10-31 9:32 ` [PATCH RFC net-next 2/2] xsk: introduce a cached cq to temporarily store descriptor addrs Jason Xing
2025-10-31 14:02 ` Maciej Fijalkowski
2025-10-31 23:59 ` Jason Xing
2025-11-03 15:00 ` Maciej Fijalkowski
2025-11-03 23:29 ` Jason Xing
2025-11-11 13:03 ` Jason Xing
2025-11-11 13:44 ` Magnus Karlsson
2025-11-11 14:02 ` Jason Xing
2025-11-14 15:52 ` Maciej Fijalkowski
2025-11-14 23:46 ` Jason Xing
2025-11-17 16:05 ` Maciej Fijalkowski
2025-11-18 0:01 ` Jason Xing
2025-11-18 10:14 ` Maciej Fijalkowski
2025-11-18 11:40 ` Jason Xing
2025-11-18 18:28 ` Maciej Fijalkowski [this message]
2025-11-18 23:37 ` Jason Xing
2025-11-19 9:20 ` Jason Xing
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aRy64Wr2UBhr4KLF@boxer \
--to=maciej.fijalkowski@intel.com \
--cc=ast@kernel.org \
--cc=bjorn@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=csmate@nop.hu \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=fmancera@suse.de \
--cc=hawk@kernel.org \
--cc=joe@dama.to \
--cc=john.fastabend@gmail.com \
--cc=jonathan.lemon@gmail.com \
--cc=kerneljasonxing@gmail.com \
--cc=kernelxing@tencent.com \
--cc=kuba@kernel.org \
--cc=magnus.karlsson@gmail.com \
--cc=magnus.karlsson@intel.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=sdf@fomichev.me \
--cc=willemdebruijn.kernel@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).