dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: Stanislav Fomichev <sdf@google.com>
To: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: "Kaiyuan Zhang" <kaiyuanz@google.com>,
	dri-devel@lists.freedesktop.org,
	"Eric Dumazet" <edumazet@google.com>,
	linux-kselftest@vger.kernel.org, "Shuah Khan" <shuah@kernel.org>,
	"Sumit Semwal" <sumit.semwal@linaro.org>,
	"Mina Almasry" <almasrymina@google.com>,
	"Jeroen de Borst" <jeroendb@google.com>,
	"Jakub Kicinski" <kuba@kernel.org>,
	"Paolo Abeni" <pabeni@redhat.com>,
	linux-media@vger.kernel.org, linux-arch@vger.kernel.org,
	"Jesper Dangaard Brouer" <hawk@kernel.org>,
	"Arnd Bergmann" <arnd@arndb.de>,
	linaro-mm-sig@lists.linaro.org,
	"Shakeel Butt" <shakeelb@google.com>,
	"Willem de Bruijn" <willemb@google.com>,
	netdev@vger.kernel.org, "David Ahern" <dsahern@kernel.org>,
	"Ilias Apalodimas" <ilias.apalodimas@linaro.org>,
	linux-kernel@vger.kernel.org,
	"David S. Miller" <davem@davemloft.net>,
	"Praveen Kaligineedi" <pkaligineedi@google.com>,
	"Christian König" <christian.koenig@amd.com>
Subject: Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags
Date: Mon, 6 Nov 2023 16:59:49 -0800	[thread overview]
Message-ID: <ZUmMBZpLPQkRS9bg@google.com> (raw)
In-Reply-To: <ZUmBf7E8ZoTQwThL@google.com>

On 11/06, Stanislav Fomichev wrote:
> On 11/06, Willem de Bruijn wrote:
> > On Mon, Nov 6, 2023 at 3:55 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > On Mon, Nov 6, 2023 at 3:27 PM Mina Almasry <almasrymina@google.com> wrote:
> > > >
> > > > On Mon, Nov 6, 2023 at 2:59 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > > >
> > > > > On 11/06, Mina Almasry wrote:
> > > > > > On Mon, Nov 6, 2023 at 1:59 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > > > > >
> > > > > > > On 11/06, Mina Almasry wrote:
> > > > > > > > On Mon, Nov 6, 2023 at 11:34 AM David Ahern <dsahern@kernel.org> wrote:
> > > > > > > > >
> > > > > > > > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > > > > > > > > On 11/05, Mina Almasry wrote:
> > > > > > > > > >> For device memory TCP, we expect the skb headers to be available in host
> > > > > > > > > >> memory for access, and we expect the skb frags to be in device memory
> > > > > > > > > >> and unaccessible to the host. We expect there to be no mixing and
> > > > > > > > > >> matching of device memory frags (unaccessible) with host memory frags
> > > > > > > > > >> (accessible) in the same skb.
> > > > > > > > > >>
> > > > > > > > > >> Add a skb->devmem flag which indicates whether the frags in this skb
> > > > > > > > > >> are device memory frags or not.
> > > > > > > > > >>
> > > > > > > > > >> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> > > > > > > > > >> and marks the skb as skb->devmem accordingly.
> > > > > > > > > >>
> > > > > > > > > >> Add checks through the network stack to avoid accessing the frags of
> > > > > > > > > >> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> > > > > > > > > >>
> > > > > > > > > >> Signed-off-by: Willem de Bruijn <willemb@google.com>
> > > > > > > > > >> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
> > > > > > > > > >> Signed-off-by: Mina Almasry <almasrymina@google.com>
> > > > > > > > > >>
> > > > > > > > > >> ---
> > > > > > > > > >>  include/linux/skbuff.h | 14 +++++++-
> > > > > > > > > >>  include/net/tcp.h      |  5 +--
> > > > > > > > > >>  net/core/datagram.c    |  6 ++++
> > > > > > > > > >>  net/core/gro.c         |  5 ++-
> > > > > > > > > >>  net/core/skbuff.c      | 77 ++++++++++++++++++++++++++++++++++++------
> > > > > > > > > >>  net/ipv4/tcp.c         |  6 ++++
> > > > > > > > > >>  net/ipv4/tcp_input.c   | 13 +++++--
> > > > > > > > > >>  net/ipv4/tcp_output.c  |  5 ++-
> > > > > > > > > >>  net/packet/af_packet.c |  4 +--
> > > > > > > > > >>  9 files changed, 115 insertions(+), 20 deletions(-)
> > > > > > > > > >>
> > > > > > > > > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > > > > > > > >> index 1fae276c1353..8fb468ff8115 100644
> > > > > > > > > >> --- a/include/linux/skbuff.h
> > > > > > > > > >> +++ b/include/linux/skbuff.h
> > > > > > > > > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > > > > > > > > >>   *  @csum_level: indicates the number of consecutive checksums found in
> > > > > > > > > >>   *          the packet minus one that have been verified as
> > > > > > > > > >>   *          CHECKSUM_UNNECESSARY (max 3)
> > > > > > > > > >> + *  @devmem: indicates that all the fragments in this skb are backed by
> > > > > > > > > >> + *          device memory.
> > > > > > > > > >>   *  @dst_pending_confirm: need to confirm neighbour
> > > > > > > > > >>   *  @decrypted: Decrypted SKB
> > > > > > > > > >>   *  @slow_gro: state present at GRO time, slower prepare step required
> > > > > > > > > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > > > > > > > > >>  #if IS_ENABLED(CONFIG_IP_SCTP)
> > > > > > > > > >>      __u8                    csum_not_inet:1;
> > > > > > > > > >>  #endif
> > > > > > > > > >> -
> > > > > > > > > >> +    __u8                    devmem:1;
> > > > > > > > > >>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > > > > > > > > >>      __u16                   tc_index;       /* traffic control index */
> > > > > > > > > >>  #endif
> > > > > > > > > >> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > > > > > > > > >>              __skb_zcopy_downgrade_managed(skb);
> > > > > > > > > >>  }
> > > > > > > > > >>
> > > > > > > > > >> +/* Return true if frags in this skb are not readable by the host. */
> > > > > > > > > >> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> > > > > > > > > >> +{
> > > > > > > > > >> +    return skb->devmem;
> > > > > > > > > >
> > > > > > > > > > bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> > > > > > > > > > It better communicates the fact that the stack shouldn't dereference the
> > > > > > > > > > frags (because it has 'devmem' fragments or for some other potential
> > > > > > > > > > future reason).
> > > > > > > > >
> > > > > > > > > +1.
> > > > > > > > >
> > > > > > > > > Also, the flag on the skb is an optimization - a high level signal that
> > > > > > > > > one or more frags is in unreadable memory. There is no requirement that
> > > > > > > > > all of the frags are in the same memory type.
> > > > > > >
> > > > > > > David: maybe there should be such a requirement (that they all are
> > > > > > > unreadable)? Might be easier to support initially; we can relax later
> > > > > > > on.
> > > > > > >
> > > > > >
> > > > > > Currently devmem == not_readable, and the restriction is that all the
> > > > > > frags in the same skb must be either all readable or all unreadable
> > > > > > (all devmem or all non-devmem).
> > > > > >
> > > > > > > > The flag indicates that the skb contains all devmem dma-buf memory
> > > > > > > > specifically, not generic 'not_readable' frags as the comment says:
> > > > > > > >
> > > > > > > > + *     @devmem: indicates that all the fragments in this skb are backed by
> > > > > > > > + *             device memory.
> > > > > > > >
> > > > > > > > The reason it's not a generic 'not_readable' flag is because handing
> > > > > > > > off a generic not_readable skb to the userspace is semantically not
> > > > > > > > what we're doing. recvmsg() is augmented in this patch series to
> > > > > > > > return a devmem skb to the user via a cmsg_devmem struct which refers
> > > > > > > > specifically to the memory in the dma-buf. recvmsg() in this patch
> > > > > > > > series is not augmented to give any 'not_readable' skb to the
> > > > > > > > userspace.
> > > > > > > >
> > > > > > > > IMHO skb->devmem + an skb_frags_not_readable() as implemented is
> > > > > > > > correct. If a new type of unreadable skbs are introduced to the stack,
> > > > > > > > I imagine the stack would implement:
> > > > > > > >
> > > > > > > > 1. new header flag: skb->newmem
> > > > > > > > 2.
> > > > > > > >
> > > > > > > > static inline bool skb_frags_not_readable(const struct skb_buff *skb)
> > > > > > > > {
> > > > > > > >     return skb->devmem || skb->newmem;
> > > > > > > > }
> > > > > > > >
> > > > > > > > 3. tcp_recvmsg_devmem() would handle skb->devmem skbs is in this patch
> > > > > > > > series, but tcp_recvmsg_newmem() would handle skb->newmem skbs.
> > > > > > >
> > > > > > > You copy it to the userspace in a special way because your frags
> > > > > > > are page_is_page_pool_iov(). I agree with David, the skb bit is
> > > > > > > just and optimization.
> > > > > > >
> > > > > > > For most of the core stack, it doesn't matter why your skb is not
> > > > > > > readable. For a few places where it matters (recvmsg?), you can
> > > > > > > double-check your frags (all or some) with page_is_page_pool_iov.
> > > > > > >
> > > > > >
> > > > > > I see, we can do that then. I.e. make the header flag 'not_readable'
> > > > > > and check the frags to decide to delegate to tcp_recvmsg_devmem() or
> > > > > > something else. We can even assume not_readable == devmem because
> > > > > > currently devmem is the only type of unreadable frag currently.
> > > > > >
> > > > > > > Unrelated: we probably need socket to dmabuf association as well (via
> > > > > > > netlink or something).
> > > > > >
> > > > > > Not sure this is possible. The dma-buf is bound to the rx-queue, and
> > > > > > any packets that land on that rx-queue are bound to that dma-buf,
> > > > > > regardless of which socket that packet belongs to. So the association
> > > > > > IMO must be rx-queue to dma-buf, not socket to dma-buf.
> > > > >
> > > > > But there is still always 1 dmabuf to 1 socket association (on rx), right?
> > > > > Because otherwise, there is no way currently to tell, at recvmsg, which
> > > > > dmabuf the received token belongs to.
> > > > >
> > > >
> > > > Yes, but this 1 dma-buf to 1 socket association happens because the
> > > > user binds the dma-buf to an rx-queue and configures flow steering of
> > > > the socket to that rx-queue.
> > >
> > > It's still fixed and won't change during the socket lifetime, right?
> > > And the socket has to know this association; otherwise those tokens
> > > are useless since they don't carry anything to identify the dmabuf.
> > >
> > > I think my other issue with MSG_SOCK_DEVMEM being on recvmsg is that
> > > it somehow implies that I have an option of passing or not passing it
> > > for an individual system call.
> > > If we know that we're going to use dmabuf with the socket, maybe we
> > > should move this flag to the socket() syscall?
> > >
> > > fd = socket(AF_INET6, SOCK_STREAM, SOCK_DEVMEM);
> > >
> > > ?
> > 
> > I think it should then be a setsockopt called before any data is
> > exchanged, with no change of modifying mode later. We generally use
> > setsockopts for the mode of a socket. This use of the protocol field
> > in socket() for setting a mode would be novel. Also, it might miss
> > passively opened connections, or be overly restrictive: one approach
> > for all accepted child sockets.
> 
> I was thinking this is similar to SOCK_CLOEXEC or SOCK_NONBLOCK? There
> are plenty of bits we can grab. But setsockopt works as well!

To follow up: if we have this flag on a socket, not on a per-message
basis, can we also use recvmsg for the recycling part maybe?

while (true) {
	memset(msg, 0, ...);

	/* receive the tokens */
	ret = recvmsg(fd, &msg, 0);

	/* recycle the tokens from the above recvmsg() */
	ret = recvmsg(fd, &msg, MSG_RECYCLE);
}

recvmsg + MSG_RECYCLE can parse the same format that regular recvmsg
exports (SO_DEVMEM_OFFSET) and we can also add extra cmsg option
to recycle a range.

Will this be more straightforward than a setsockopt(SO_DEVMEM_DONTNEED)?
Or is it more confusing?

  reply	other threads:[~2023-11-07  0:59 UTC|newest]

Thread overview: 128+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-06  2:43 [RFC PATCH v3 00/12] Device Memory TCP Mina Almasry
2023-11-06  2:44 ` [RFC PATCH v3 01/12] net: page_pool: factor out releasing DMA from releasing the page Mina Almasry
2023-11-06  2:44 ` [RFC PATCH v3 02/12] net: page_pool: create hooks for custom page providers Mina Almasry
2023-11-07  7:44   ` Yunsheng Lin
2023-11-09 11:09   ` Paolo Abeni
2023-11-10 23:19   ` Jakub Kicinski
2023-11-13  3:28     ` Mina Almasry
2023-11-13 22:10       ` Jakub Kicinski
2023-11-06  2:44 ` [RFC PATCH v3 03/12] net: netdev netlink api to bind dma-buf to a net device Mina Almasry
2023-11-10 23:16   ` Jakub Kicinski
2023-11-06  2:44 ` [RFC PATCH v3 04/12] netdev: support binding dma-buf to netdevice Mina Almasry
2023-11-07  7:46   ` Yunsheng Lin
2023-11-07 21:59     ` Mina Almasry
2023-11-08  3:40       ` Yunsheng Lin
2023-11-09  2:22         ` Mina Almasry
2023-11-09  9:29           ` Yunsheng Lin
2023-11-08 23:47   ` David Wei
2023-11-09  2:25     ` Mina Almasry
2023-11-09  8:29   ` Paolo Abeni
2023-11-10  2:59     ` Mina Almasry
2023-11-10  7:38       ` Yunsheng Lin
2023-11-10  9:45         ` Mina Almasry
2023-11-10 23:19   ` Jakub Kicinski
2023-11-11  2:19     ` Mina Almasry
2023-11-06  2:44 ` [RFC PATCH v3 05/12] netdev: netdevice devmem allocator Mina Almasry
2023-11-06 23:44   ` David Ahern
2023-11-07 22:10     ` Mina Almasry
2023-11-07 22:55       ` David Ahern
2023-11-07 23:03         ` Mina Almasry
2023-11-09  1:15           ` David Wei
2023-11-10 14:26           ` Pavel Begunkov
2023-11-11 17:19             ` David Ahern
2023-11-14 16:09               ` Pavel Begunkov
2023-11-09  1:00         ` David Wei
2023-11-08  3:48       ` Yunsheng Lin
2023-11-09  1:41         ` Mina Almasry
2023-11-07  7:45   ` Yunsheng Lin
2023-11-09  8:44   ` Paolo Abeni
2023-11-06  2:44 ` [RFC PATCH v3 06/12] memory-provider: dmabuf devmem memory provider Mina Almasry
2023-11-06 21:02   ` Stanislav Fomichev
2023-11-06 23:49   ` David Ahern
2023-11-08  0:02     ` Mina Almasry
2023-11-08  0:10       ` David Ahern
2023-11-10 23:16   ` Jakub Kicinski
2023-11-13  4:54     ` Mina Almasry
2023-11-06  2:44 ` [RFC PATCH v3 07/12] page-pool: device memory support Mina Almasry
2023-11-07  8:00   ` Yunsheng Lin
2023-11-07 21:56     ` Mina Almasry
2023-11-08 10:56       ` Yunsheng Lin
2023-11-09  3:20         ` Mina Almasry
2023-11-09  9:30           ` Yunsheng Lin
2023-11-09 12:20             ` Mina Almasry
2023-11-09 13:23               ` Yunsheng Lin
2023-11-09 14:23           ` Christian König
2023-11-09  9:01   ` Paolo Abeni
2023-11-06  2:44 ` [RFC PATCH v3 08/12] net: support non paged skb frags Mina Almasry
2023-11-07  9:00   ` Yunsheng Lin
2023-11-07 21:19     ` Mina Almasry
2023-11-08 11:25       ` Yunsheng Lin
2023-11-09  9:14   ` Paolo Abeni
2023-11-10  4:06     ` Mina Almasry
2023-11-10 23:19   ` Jakub Kicinski
2023-11-13  6:05     ` Mina Almasry
2023-11-13 22:17       ` Jakub Kicinski
2023-11-06  2:44 ` [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags Mina Almasry
2023-11-06 18:47   ` Stanislav Fomichev
2023-11-06 19:34     ` David Ahern
2023-11-06 20:31       ` Mina Almasry
2023-11-06 21:59         ` Stanislav Fomichev
2023-11-06 22:18           ` Mina Almasry
2023-11-06 22:59             ` Stanislav Fomichev
2023-11-06 23:14               ` Kaiyuan Zhang
2023-11-06 23:27               ` Mina Almasry
2023-11-06 23:55                 ` Stanislav Fomichev
2023-11-07  0:07                   ` Willem de Bruijn
2023-11-07  0:14                     ` Stanislav Fomichev
2023-11-07  0:59                       ` Stanislav Fomichev [this message]
2023-11-07  2:23                         ` Willem de Bruijn
2023-11-07 17:44                           ` Stanislav Fomichev
2023-11-07 17:57                             ` Willem de Bruijn
2023-11-07 18:14                               ` Stanislav Fomichev
2023-11-07  0:20                     ` Mina Almasry
2023-11-07  1:06                       ` Stanislav Fomichev
2023-11-07 19:53                         ` Mina Almasry
2023-11-07 21:05                           ` Stanislav Fomichev
2023-11-07 21:17                             ` Eric Dumazet
2023-11-07 22:23                               ` Stanislav Fomichev
2023-11-10 23:17                                 ` Jakub Kicinski
2023-11-10 23:19                           ` Jakub Kicinski
2023-11-07  1:09                       ` David Ahern
2023-11-06 23:37             ` David Ahern
2023-11-07  0:03               ` Mina Almasry
2023-11-06 20:56   ` Stanislav Fomichev
2023-11-07  0:16   ` David Ahern
2023-11-07  0:23     ` Mina Almasry
2023-11-08 14:43   ` David Laight
2023-11-06  2:44 ` [RFC PATCH v3 10/12] tcp: RX path for devmem TCP Mina Almasry
2023-11-06 18:44   ` Stanislav Fomichev
2023-11-06 19:29     ` Mina Almasry
2023-11-06 21:14       ` Willem de Bruijn
2023-11-06 22:34         ` Stanislav Fomichev
2023-11-06 22:55           ` Willem de Bruijn
2023-11-06 23:32             ` Stanislav Fomichev
2023-11-06 23:55               ` David Ahern
2023-11-07  0:02                 ` Willem de Bruijn
2023-11-07 23:55                   ` Mina Almasry
2023-11-08  0:01                     ` David Ahern
2023-11-09  2:39                       ` Mina Almasry
2023-11-09 16:07                         ` Edward Cree
2023-12-08 20:12                           ` Pavel Begunkov
2023-11-09 11:05             ` Paolo Abeni
2023-11-10 23:16               ` Jakub Kicinski
2023-12-08 20:28             ` Pavel Begunkov
2023-12-08 20:09           ` Pavel Begunkov
2023-11-06 21:17       ` Stanislav Fomichev
2023-11-08 15:36         ` Edward Cree
2023-11-09 10:52   ` Paolo Abeni
2023-11-10 23:19   ` Jakub Kicinski
2023-11-06  2:44 ` [RFC PATCH v3 11/12] net: add SO_DEVMEM_DONTNEED setsockopt to release RX pages Mina Almasry
2023-11-06  2:44 ` [RFC PATCH v3 12/12] selftests: add ncdevmem, netcat for devmem TCP Mina Almasry
2023-11-09 11:03   ` Paolo Abeni
2023-11-10 23:13   ` Jakub Kicinski
2023-11-11  2:27     ` Mina Almasry
2023-11-11  2:35       ` Jakub Kicinski
2023-11-13  4:08         ` Mina Almasry
2023-11-13 22:20           ` Jakub Kicinski
2023-11-10 23:17   ` Jakub Kicinski
2023-11-07 15:18 ` [RFC PATCH v3 00/12] Device Memory TCP David Ahern

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZUmMBZpLPQkRS9bg@google.com \
    --to=sdf@google.com \
    --cc=almasrymina@google.com \
    --cc=arnd@arndb.de \
    --cc=christian.koenig@amd.com \
    --cc=davem@davemloft.net \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=hawk@kernel.org \
    --cc=ilias.apalodimas@linaro.org \
    --cc=jeroendb@google.com \
    --cc=kaiyuanz@google.com \
    --cc=kuba@kernel.org \
    --cc=linaro-mm-sig@lists.linaro.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-media@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=pkaligineedi@google.com \
    --cc=shakeelb@google.com \
    --cc=shuah@kernel.org \
    --cc=sumit.semwal@linaro.org \
    --cc=willemb@google.com \
    --cc=willemdebruijn.kernel@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).