From: Stanislav Fomichev <sdf@google.com>
To: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: "Kaiyuan Zhang" <kaiyuanz@google.com>,
dri-devel@lists.freedesktop.org,
"Eric Dumazet" <edumazet@google.com>,
linux-kselftest@vger.kernel.org, "Shuah Khan" <shuah@kernel.org>,
"Sumit Semwal" <sumit.semwal@linaro.org>,
"Mina Almasry" <almasrymina@google.com>,
"Jeroen de Borst" <jeroendb@google.com>,
"Jakub Kicinski" <kuba@kernel.org>,
"Paolo Abeni" <pabeni@redhat.com>,
linux-media@vger.kernel.org, linux-arch@vger.kernel.org,
"Jesper Dangaard Brouer" <hawk@kernel.org>,
"Arnd Bergmann" <arnd@arndb.de>,
linaro-mm-sig@lists.linaro.org,
"Shakeel Butt" <shakeelb@google.com>,
"Willem de Bruijn" <willemb@google.com>,
netdev@vger.kernel.org, "David Ahern" <dsahern@kernel.org>,
"Ilias Apalodimas" <ilias.apalodimas@linaro.org>,
linux-kernel@vger.kernel.org,
"David S. Miller" <davem@davemloft.net>,
"Praveen Kaligineedi" <pkaligineedi@google.com>,
"Christian König" <christian.koenig@amd.com>
Subject: Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags
Date: Tue, 7 Nov 2023 10:14:25 -0800 [thread overview]
Message-ID: <ZUp-gYT7OMb9wun3@google.com> (raw)
In-Reply-To: <CAF=yD-Ltd0REhOS78q_t8bSEpefQsZuJV_Aq7zxXmFDh+BmJhg@mail.gmail.com>
On 11/07, Willem de Bruijn wrote:
> On Tue, Nov 7, 2023 at 12:44 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On 11/06, Willem de Bruijn wrote:
> > > > > > > I think my other issue with MSG_SOCK_DEVMEM being on recvmsg is that
> > > > > > > it somehow implies that I have an option of passing or not passing it
> > > > > > > for an individual system call.
> > > > > > > If we know that we're going to use dmabuf with the socket, maybe we
> > > > > > > should move this flag to the socket() syscall?
> > > > > > >
> > > > > > > fd = socket(AF_INET6, SOCK_STREAM, SOCK_DEVMEM);
> > > > > > >
> > > > > > > ?
> > > > > >
> > > > > > I think it should then be a setsockopt called before any data is
> > > > > > exchanged, with no change of modifying mode later. We generally use
> > > > > > setsockopts for the mode of a socket. This use of the protocol field
> > > > > > in socket() for setting a mode would be novel. Also, it might miss
> > > > > > passively opened connections, or be overly restrictive: one approach
> > > > > > for all accepted child sockets.
> > > > >
> > > > > I was thinking this is similar to SOCK_CLOEXEC or SOCK_NONBLOCK? There
> > > > > are plenty of bits we can grab. But setsockopt works as well!
> > > >
> > > > To follow up: if we have this flag on a socket, not on a per-message
> > > > basis, can we also use recvmsg for the recycling part maybe?
> > > >
> > > > while (true) {
> > > > memset(msg, 0, ...);
> > > >
> > > > /* receive the tokens */
> > > > ret = recvmsg(fd, &msg, 0);
> > > >
> > > > /* recycle the tokens from the above recvmsg() */
> > > > ret = recvmsg(fd, &msg, MSG_RECYCLE);
> > > > }
> > > >
> > > > recvmsg + MSG_RECYCLE can parse the same format that regular recvmsg
> > > > exports (SO_DEVMEM_OFFSET) and we can also add extra cmsg option
> > > > to recycle a range.
> > > >
> > > > Will this be more straightforward than a setsockopt(SO_DEVMEM_DONTNEED)?
> > > > Or is it more confusing?
> > >
> > > It would have to be sendmsg, as recvmsg is a copy_to_user operation.
> > >
> > >
> > > I am not aware of any precedent in multiplexing the data stream and a
> > > control operation stream in this manner. It would also require adding
> > > a branch in the sendmsg hot path.
> >
> > Is it too much plumbing to copy_from_user msg_control deep in recvmsg
> > stack where we need it? Mixing in sendmsg is indeed ugly :-(
>
> I tried exactly the inverse of that when originally adding
> MSG_ZEROCOPY: to allow piggy-backing zerocopy completion notifications
> on sendmsg calls by writing to sendmsg msg_control on return to user.
> It required significant code churn, which the performance gains did
> not warrant. Doing so also breaks the simple rule that recv is for
> reading and send is for writing.
We're breaking so many rules here, so not sure we should be super
constrained :-D
> > Regarding hot patch: aren't we already doing copy_to_user for the tokens in
> > this hot path, so having one extra condition shouldn't hurt too much?
>
> We're doing that in the optional cmsg handling of recvmsg, which is
> already a slow path (compared to the data read() itself).
>
> > > The memory is associated with the socket, freed when the socket is
> > > closed as well as on SO_DEVMEM_DONTNEED. Fundamentally it is a socket
> > > state operation, for which setsockopt is the socket interface.
> > >
> > > Is your request purely a dislike, or is there some technical concern
> > > with BPF and setsockopt?
> >
> > It's mostly because I've been bitten too much by custom socket options that
> > are not really on/off/update-value operations:
> >
> > 29ebbba7d461 - bpf: Don't EFAULT for {g,s}setsockopt with wrong optlen
> > 00e74ae08638 - bpf: Don't EFAULT for getsockopt with optval=NULL
> > 9cacf81f8161 - bpf: Remove extra lock_sock for TCP_ZEROCOPY_RECEIVE
> > d8fe449a9c51 - bpf: Don't return EINVAL from {get,set}sockopt when optlen > PAGE_SIZE
> >
> > I do agree that this particular case of SO_DEVMEM_DONTNEED seems ok, but
> > things tend to evolve and change.
>
> I see. I'm a bit concerned if we start limiting what we can do in
> sockets because of dependencies that BPF processing places on them.
> The use case for BPF [gs]etsockopt is limited to specific control mode
> calls. Would it make sense to just exclude calls like
> SO_DEVMEM_DONTNEED from this interpositioning?
Yup, that's why I'm asking. We already have ->bpf_bypass_getsockopt()
to special-case tcp zerocopy. We might add another bpf_bypass_setsockopt
to special case SO_DEVMEM_DONTNEED. That's why I'm trying to see if
there is a better alternative.
> At a high level what we really want is a high rate metadata path from
> user to kernel. And there are no perfect solutions. From kernel to
> user we use the socket error queue for this. That was never intended
> for high event rate itself, dealing with ICMP errors and the like
> before timestamps and zerocopy notifications were added.
>
> If I squint hard enough I can see some prior art in mixing data and
> high rate state changes within the same channel in NIC descriptor
> queues, where some devices do this, e.g., { "insert encryption key",
> "send packet" }. But fundamentally I think we should keep the socket
> queues for data only.
+1, we keep taking an easy route with using sockopt for this :-(
Anyway, let's see if any better suggestions pop up. Worst case - we stick
with a socket option and will add a bypass on the bpf side.
next prev parent reply other threads:[~2023-11-07 18:14 UTC|newest]
Thread overview: 128+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-11-06 2:43 [RFC PATCH v3 00/12] Device Memory TCP Mina Almasry
2023-11-06 2:44 ` [RFC PATCH v3 01/12] net: page_pool: factor out releasing DMA from releasing the page Mina Almasry
2023-11-06 2:44 ` [RFC PATCH v3 02/12] net: page_pool: create hooks for custom page providers Mina Almasry
2023-11-07 7:44 ` Yunsheng Lin
2023-11-09 11:09 ` Paolo Abeni
2023-11-10 23:19 ` Jakub Kicinski
2023-11-13 3:28 ` Mina Almasry
2023-11-13 22:10 ` Jakub Kicinski
2023-11-06 2:44 ` [RFC PATCH v3 03/12] net: netdev netlink api to bind dma-buf to a net device Mina Almasry
2023-11-10 23:16 ` Jakub Kicinski
2023-11-06 2:44 ` [RFC PATCH v3 04/12] netdev: support binding dma-buf to netdevice Mina Almasry
2023-11-07 7:46 ` Yunsheng Lin
2023-11-07 21:59 ` Mina Almasry
2023-11-08 3:40 ` Yunsheng Lin
2023-11-09 2:22 ` Mina Almasry
2023-11-09 9:29 ` Yunsheng Lin
2023-11-08 23:47 ` David Wei
2023-11-09 2:25 ` Mina Almasry
2023-11-09 8:29 ` Paolo Abeni
2023-11-10 2:59 ` Mina Almasry
2023-11-10 7:38 ` Yunsheng Lin
2023-11-10 9:45 ` Mina Almasry
2023-11-10 23:19 ` Jakub Kicinski
2023-11-11 2:19 ` Mina Almasry
2023-11-06 2:44 ` [RFC PATCH v3 05/12] netdev: netdevice devmem allocator Mina Almasry
2023-11-06 23:44 ` David Ahern
2023-11-07 22:10 ` Mina Almasry
2023-11-07 22:55 ` David Ahern
2023-11-07 23:03 ` Mina Almasry
2023-11-09 1:15 ` David Wei
2023-11-10 14:26 ` Pavel Begunkov
2023-11-11 17:19 ` David Ahern
2023-11-14 16:09 ` Pavel Begunkov
2023-11-09 1:00 ` David Wei
2023-11-08 3:48 ` Yunsheng Lin
2023-11-09 1:41 ` Mina Almasry
2023-11-07 7:45 ` Yunsheng Lin
2023-11-09 8:44 ` Paolo Abeni
2023-11-06 2:44 ` [RFC PATCH v3 06/12] memory-provider: dmabuf devmem memory provider Mina Almasry
2023-11-06 21:02 ` Stanislav Fomichev
2023-11-06 23:49 ` David Ahern
2023-11-08 0:02 ` Mina Almasry
2023-11-08 0:10 ` David Ahern
2023-11-10 23:16 ` Jakub Kicinski
2023-11-13 4:54 ` Mina Almasry
2023-11-06 2:44 ` [RFC PATCH v3 07/12] page-pool: device memory support Mina Almasry
2023-11-07 8:00 ` Yunsheng Lin
2023-11-07 21:56 ` Mina Almasry
2023-11-08 10:56 ` Yunsheng Lin
2023-11-09 3:20 ` Mina Almasry
2023-11-09 9:30 ` Yunsheng Lin
2023-11-09 12:20 ` Mina Almasry
2023-11-09 13:23 ` Yunsheng Lin
2023-11-09 14:23 ` Christian König
2023-11-09 9:01 ` Paolo Abeni
2023-11-06 2:44 ` [RFC PATCH v3 08/12] net: support non paged skb frags Mina Almasry
2023-11-07 9:00 ` Yunsheng Lin
2023-11-07 21:19 ` Mina Almasry
2023-11-08 11:25 ` Yunsheng Lin
2023-11-09 9:14 ` Paolo Abeni
2023-11-10 4:06 ` Mina Almasry
2023-11-10 23:19 ` Jakub Kicinski
2023-11-13 6:05 ` Mina Almasry
2023-11-13 22:17 ` Jakub Kicinski
2023-11-06 2:44 ` [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags Mina Almasry
2023-11-06 18:47 ` Stanislav Fomichev
2023-11-06 19:34 ` David Ahern
2023-11-06 20:31 ` Mina Almasry
2023-11-06 21:59 ` Stanislav Fomichev
2023-11-06 22:18 ` Mina Almasry
2023-11-06 22:59 ` Stanislav Fomichev
2023-11-06 23:14 ` Kaiyuan Zhang
2023-11-06 23:27 ` Mina Almasry
2023-11-06 23:55 ` Stanislav Fomichev
2023-11-07 0:07 ` Willem de Bruijn
2023-11-07 0:14 ` Stanislav Fomichev
2023-11-07 0:59 ` Stanislav Fomichev
2023-11-07 2:23 ` Willem de Bruijn
2023-11-07 17:44 ` Stanislav Fomichev
2023-11-07 17:57 ` Willem de Bruijn
2023-11-07 18:14 ` Stanislav Fomichev [this message]
2023-11-07 0:20 ` Mina Almasry
2023-11-07 1:06 ` Stanislav Fomichev
2023-11-07 19:53 ` Mina Almasry
2023-11-07 21:05 ` Stanislav Fomichev
2023-11-07 21:17 ` Eric Dumazet
2023-11-07 22:23 ` Stanislav Fomichev
2023-11-10 23:17 ` Jakub Kicinski
2023-11-10 23:19 ` Jakub Kicinski
2023-11-07 1:09 ` David Ahern
2023-11-06 23:37 ` David Ahern
2023-11-07 0:03 ` Mina Almasry
2023-11-06 20:56 ` Stanislav Fomichev
2023-11-07 0:16 ` David Ahern
2023-11-07 0:23 ` Mina Almasry
2023-11-08 14:43 ` David Laight
2023-11-06 2:44 ` [RFC PATCH v3 10/12] tcp: RX path for devmem TCP Mina Almasry
2023-11-06 18:44 ` Stanislav Fomichev
2023-11-06 19:29 ` Mina Almasry
2023-11-06 21:14 ` Willem de Bruijn
2023-11-06 22:34 ` Stanislav Fomichev
2023-11-06 22:55 ` Willem de Bruijn
2023-11-06 23:32 ` Stanislav Fomichev
2023-11-06 23:55 ` David Ahern
2023-11-07 0:02 ` Willem de Bruijn
2023-11-07 23:55 ` Mina Almasry
2023-11-08 0:01 ` David Ahern
2023-11-09 2:39 ` Mina Almasry
2023-11-09 16:07 ` Edward Cree
2023-12-08 20:12 ` Pavel Begunkov
2023-11-09 11:05 ` Paolo Abeni
2023-11-10 23:16 ` Jakub Kicinski
2023-12-08 20:28 ` Pavel Begunkov
2023-12-08 20:09 ` Pavel Begunkov
2023-11-06 21:17 ` Stanislav Fomichev
2023-11-08 15:36 ` Edward Cree
2023-11-09 10:52 ` Paolo Abeni
2023-11-10 23:19 ` Jakub Kicinski
2023-11-06 2:44 ` [RFC PATCH v3 11/12] net: add SO_DEVMEM_DONTNEED setsockopt to release RX pages Mina Almasry
2023-11-06 2:44 ` [RFC PATCH v3 12/12] selftests: add ncdevmem, netcat for devmem TCP Mina Almasry
2023-11-09 11:03 ` Paolo Abeni
2023-11-10 23:13 ` Jakub Kicinski
2023-11-11 2:27 ` Mina Almasry
2023-11-11 2:35 ` Jakub Kicinski
2023-11-13 4:08 ` Mina Almasry
2023-11-13 22:20 ` Jakub Kicinski
2023-11-10 23:17 ` Jakub Kicinski
2023-11-07 15:18 ` [RFC PATCH v3 00/12] Device Memory TCP David Ahern
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZUp-gYT7OMb9wun3@google.com \
--to=sdf@google.com \
--cc=almasrymina@google.com \
--cc=arnd@arndb.de \
--cc=christian.koenig@amd.com \
--cc=davem@davemloft.net \
--cc=dri-devel@lists.freedesktop.org \
--cc=dsahern@kernel.org \
--cc=edumazet@google.com \
--cc=hawk@kernel.org \
--cc=ilias.apalodimas@linaro.org \
--cc=jeroendb@google.com \
--cc=kaiyuanz@google.com \
--cc=kuba@kernel.org \
--cc=linaro-mm-sig@lists.linaro.org \
--cc=linux-arch@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-media@vger.kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=pkaligineedi@google.com \
--cc=shakeelb@google.com \
--cc=shuah@kernel.org \
--cc=sumit.semwal@linaro.org \
--cc=willemb@google.com \
--cc=willemdebruijn.kernel@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).