From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
To: Stanislav Fomichev <stfomichev@gmail.com>
Cc: Eryk Kubanski <e.kubanski@partner.samsung.com>,
"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"bjorn@kernel.org" <bjorn@kernel.org>,
"magnus.karlsson@intel.com" <magnus.karlsson@intel.com>,
"jonathan.lemon@gmail.com" <jonathan.lemon@gmail.com>
Subject: Re: Re: [PATCH bpf v2] xsk: Fix out of order segment free in __xsk_generic_xmit()
Date: Mon, 2 Jun 2025 18:03:01 +0200 [thread overview]
Message-ID: <aD3LNcG0qHHwPbiw@boxer> (raw)
In-Reply-To: <aD3DM4elo_Xt82LE@mini-arch>
On Mon, Jun 02, 2025 at 08:28:51AM -0700, Stanislav Fomichev wrote:
> On 06/02, Eryk Kubanski wrote:
> > > I'm not sure I understand what's the issue here. If you're using the
> > > same XSK from different CPUs, you should take care of the ordering
> > > yourself on the userspace side?
> >
> > It's not a problem with user-space Completion Queue READER side.
> > Im talking exclusively about kernel-space Completion Queue WRITE side.
> >
> > This problem can occur when multiple sockets are bound to the same
> > umem, device, queue id. In this situation Completion Queue is shared.
> > This means it can be accessed by multiple threads on kernel-side.
> > Any use is indeed protected by spinlock, however any write sequence
> > (Acquire write slot as writer, write to slot, submit write slot to reader)
> > isn't atomic in any way and it's possible to submit not-yet-sent packet
> > descriptors back to user-space as TX completed.
> >
> > Up untill now, all write-back operations had two phases, each phase
> > locks the spinlock and unlocks it:
> > 1) Acquire slot + Write descriptor (increase cached-writer by N + write values)
> > 2) Submit slot to the reader (increase writer by N)
> >
> > Slot submission was solely based on the timing. Let's consider situation,
> > where two different threads issue a syscall for two different AF_XDP sockets
> > that are bound to the same umem, dev, queue-id.
> >
> > AF_XDP setup:
> >
> > kernel-space
> >
> > Write Read
> > +--+ +--+
> > | | | |
> > | | | |
> > | | | |
> > Completion | | | | Fill
> > Queue | | | | Queue
> > | | | |
> > | | | |
> > | | | |
> > | | | |
> > +--+ +--+
> > Read Write
> > user-space
> >
> >
> > +--------+ +--------+
> > | AF_XDP | | AF_XDP |
> > +--------+ +--------+
> >
> >
> >
> >
> >
> > Possible out-of-order scenario:
> >
> >
> > writer cached_writer1 cached_writer2
> > | | |
> > | | |
> > | | |
> > | | |
> > +--------------|--------|--------|--------|--------|--------|--------|----------------------------------------------+
> > | | | | | | | | |
> > Completion Queue | | | | | | | | |
> > | | | | | | | | |
> > +--------------|--------|--------|--------|--------|--------|--------|----------------------------------------------+
> > | | |
> > | | |
> > |-----------------| |
> > A) T1 syscall | |
> > writes 2 | |
> > descriptors |-----------------------------------|
> > B) T2 syscall writes 4 descriptors
> >
> >
> >
> >
> > Notes:
> > 1) T1 and T2 AF_XDP sockets are two different sockets,
> > __xsk_generic_xmit will obtain two different mutexes.
> > 2) T1 and T2 can be executed simultaneously, there is no
> > critical section whatsoever between them.
>
> XSK represents a single queue and each queue is single producer single
> consumer. The fact that you can dup a socket and call sendmsg from
> different threads/processes does not lift that restriction. I think
> if you add synchronization on the userspace (lock(); sendmsg();
> unlock();), that should help, right?
Eryk, can you tell us a bit more about HW you're using? The problem you
described simply can not happen for HW with in-order completions. You
can't complete descriptor from slot 5 without going through completion of
slot 3. So our assumption is you're using HW with out-of-order
completions, correct?
If that is the case then we have to think about possible solutions which
probably won't be straight-forward. As Stan said current fix is a no-go.
next prev parent reply other threads:[~2025-06-02 16:03 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <CGME20250530103506eucas1p1e4091678f4157b928ddfa6f6534a0009@eucas1p1.samsung.com>
2025-05-30 10:34 ` [PATCH bpf v2] xsk: Fix out of order segment free in __xsk_generic_xmit() e.kubanski
2025-05-30 11:56 ` Eryk Kubanski
2025-05-30 16:07 ` Stanislav Fomichev
2025-06-02 9:27 ` Eryk Kubanski
2025-06-02 15:28 ` Stanislav Fomichev
2025-06-02 15:58 ` Eryk Kubanski
2025-06-02 16:03 ` Maciej Fijalkowski [this message]
2025-06-02 16:18 ` Eryk Kubanski
2025-06-04 13:50 ` Maciej Fijalkowski
2025-06-04 14:15 ` Eryk Kubanski
2025-06-09 19:41 ` Maciej Fijalkowski
2025-06-10 9:35 ` Eryk Kubanski
[not found] ` <CGME20250530103506eucas1p1e4091678f4157b928ddfa6f6534a0009@eucms1p3>
2025-06-10 9:11 ` Eryk Kubanski
2025-06-11 13:10 ` Maciej Fijalkowski
2025-07-03 23:37 ` Jason Xing
2025-07-04 12:34 ` Maciej Fijalkowski
2025-07-04 15:29 ` Jason Xing
2025-06-04 14:41 ` kernel test robot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aD3LNcG0qHHwPbiw@boxer \
--to=maciej.fijalkowski@intel.com \
--cc=bjorn@kernel.org \
--cc=e.kubanski@partner.samsung.com \
--cc=jonathan.lemon@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=magnus.karlsson@intel.com \
--cc=netdev@vger.kernel.org \
--cc=stfomichev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.