Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH bpf 1/6] netkit: Add tstats per-CPU traffic counters
From: Daniel Borkmann @ 2023-11-06 23:42 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: martin.lau, netdev, bpf, Nikolay Aleksandrov
In-Reply-To: <20231106132845.6356bc72@kernel.org>

On 11/6/23 10:28 PM, Jakub Kicinski wrote:
> On Fri,  3 Nov 2023 23:27:43 +0100 Daniel Borkmann wrote:
>> Add dev->tstats traffic accounting to netkit. The latter contains per-CPU
>> RX and TX counters.
>>
>> The dev's TX counters are bumped upon pass/unspec as well as redirect
>> verdicts, in other words, on everything except for drops.
>>
>> The dev's RX counters are bumped upon successful __netif_rx(), as well
>> as from skb_do_redirect() (not part of this commit here).
>>
>> Using dev->lstats with having just a single packets/bytes counter and
>> inferring one another's RX counters from the peer dev's lstats is not
>> possible given skb_do_redirect() can also bump the device's stats.
> 
> sorry for the delay in replying, I'll comment here instead of on:
> 
> https://lore.kernel.org/all/6d5cb0ef-fabc-7ca3-94b2-5b1925a6805f@iogearbox.net/
> 
> What I had in mind was to have the driver just set the type of stats.
> That way it doesn't have to bother with error handling either
> (allocation failure checking, making sure free happens in the right
> spot etc. all happen in the core). Here's a completely untested diff:

Ah perfect, thanks! I'll take a look and integrate this into a v2 this
week if that's okay with you. And add sth to bail out if the ndo is in
place and NETDEV_PCPU_STAT_TSTAT not selected for the time being.

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH bpf 4/6] bpf, netkit: Add indirect call wrapper for fetching peer dev
From: Daniel Borkmann @ 2023-11-06 23:44 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: martin.lau, netdev, bpf, Nikolay Aleksandrov
In-Reply-To: <20231106133250.0d49a487@kernel.org>

On 11/6/23 10:32 PM, Jakub Kicinski wrote:
> On Fri,  3 Nov 2023 23:27:46 +0100 Daniel Borkmann wrote:
>> ndo_get_peer_dev is used in tcx BPF fast path, therefore make use of
>> indirect call wrapper and therefore optimize the bpf_redirect_peer()
>> internal handling a bit. Add a small skb_get_peer_dev() wrapper which
>> utilizes the INDIRECT_CALL_1() macro instead of open coding.
> 
> Why don't we kill the ndo and put the pointer in struct net_device?

I can take a stab at this some time after LPC and probably makes sense
also after Coco's cacheline optimizations landed.

Thanks,
Daniel

^ permalink raw reply

* Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator
From: David Ahern @ 2023-11-06 23:44 UTC (permalink / raw)
  To: Mina Almasry, netdev, linux-kernel, linux-arch, linux-kselftest,
	linux-media, dri-devel, linaro-mm-sig
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	Willem de Bruijn, Shuah Khan, Sumit Semwal, Christian König,
	Shakeel Butt, Jeroen de Borst, Praveen Kaligineedi,
	Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <20231106024413.2801438-6-almasrymina@google.com>

On 11/5/23 7:44 PM, Mina Almasry wrote:
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index eeeda849115c..1c351c138a5b 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
>  };
>  
>  #ifdef CONFIG_DMA_SHARED_BUFFER
> +struct page_pool_iov *
> +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
> +void netdev_free_devmem(struct page_pool_iov *ppiov);

netdev_{alloc,free}_dmabuf?

I say that because a dmabuf can be host memory, at least I am not aware
of a restriction that a dmabuf is device memory.


^ permalink raw reply

* Re: [PATCH] netlink: introduce netlink poll to resolve fast return issue
From: Jakub Kicinski @ 2023-11-06 23:48 UTC (permalink / raw)
  To: Jong eon Park, Paolo Abeni
  Cc: David S. Miller, Eric Dumazet, netdev, linux-kernel, Dong ha Kang
In-Reply-To: <20231103072209.1005409-1-jongeon.park@samsung.com>

On Fri,  3 Nov 2023 16:22:09 +0900 Jong eon Park wrote:
> In very rare cases, there was an issue where a user's poll function
> waiting for a uevent would continuously return very quickly, causing
> excessive CPU usage due to the following scenario.
> 
> Once sk_rcvbuf becomes full netlink_broadcast_deliver returns an error and
> netlink_overrun is called. However, if netlink_overrun was called in a
> context just before a another context returns from the poll and recv is
> invoked, emptying the rcvbuf, sk->sk_err = ENOBUF is written to the
> netlink socket belatedly and it enters the NETLINK_S_CONGESTED state.
> If the user does not check for POLLERR, they cannot consume and clean
> sk_err and repeatedly enter the situation where they call poll again but
> return immediately.
> 
> To address this issue, I would like to introduce the following netlink
> poll.
> 
> After calling the datagram_poll, netlink poll checks the
> NETLINK_S_CONGESTED status and rcv queue, and this make the user to be
> readable once more even if the user has already emptied rcv queue. This
> allows the user to be able to consume sk->sk_err value through
> netlink_recvmsg, thus the situation described above can be avoided

The explanation makes sense, but I'm not able to make the jump in
understanding how this is a netlink problem. datagram_poll() returns
EPOLLERR because sk_err is set, what makes netlink special?
The fact that we can have an sk_err with nothing in the recv queue?

Paolo understands this better, maybe he can weigh in tomorrow...

^ permalink raw reply

* Re: [RFC PATCH v3 06/12] memory-provider: dmabuf devmem memory provider
From: David Ahern @ 2023-11-06 23:49 UTC (permalink / raw)
  To: Mina Almasry, netdev, linux-kernel, linux-arch, linux-kselftest,
	linux-media, dri-devel, linaro-mm-sig
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	Willem de Bruijn, Shuah Khan, Sumit Semwal, Christian König,
	Shakeel Butt, Jeroen de Borst, Praveen Kaligineedi,
	Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <20231106024413.2801438-7-almasrymina@google.com>

On 11/5/23 7:44 PM, Mina Almasry wrote:
> diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
> index 78cbb040af94..b93243c2a640 100644
> --- a/include/net/page_pool/helpers.h
> +++ b/include/net/page_pool/helpers.h
> @@ -111,6 +112,45 @@ page_pool_iov_binding(const struct page_pool_iov *ppiov)
>  	return page_pool_iov_owner(ppiov)->binding;
>  }
>  
> +static inline int page_pool_iov_refcount(const struct page_pool_iov *ppiov)
> +{
> +	return refcount_read(&ppiov->refcount);
> +}
> +
> +static inline void page_pool_iov_get_many(struct page_pool_iov *ppiov,
> +					  unsigned int count)
> +{
> +	refcount_add(count, &ppiov->refcount);
> +}
> +
> +void __page_pool_iov_free(struct page_pool_iov *ppiov);
> +
> +static inline void page_pool_iov_put_many(struct page_pool_iov *ppiov,
> +					  unsigned int count)
> +{
> +	if (!refcount_sub_and_test(count, &ppiov->refcount))
> +		return;
> +
> +	__page_pool_iov_free(ppiov);
> +}
> +
> +/* page pool mm helpers */
> +
> +static inline bool page_is_page_pool_iov(const struct page *page)
> +{
> +	return (unsigned long)page & PP_DEVMEM;

This is another one where the code can be more generic to not force a
lot changes later.  e.g., PP_CUSTOM or PP_NO_PAGE. Then io_uring use
case with host memory can leverage the iov pool in a similar manner.

That does mean skb->devmem needs to be a flag on the page pool and not
just assume iov == device memory.



^ permalink raw reply

* Re: [RFC PATCH v3 10/12] tcp: RX path for devmem TCP
From: David Ahern @ 2023-11-06 23:55 UTC (permalink / raw)
  To: Stanislav Fomichev, Willem de Bruijn
  Cc: Mina Almasry, netdev, linux-kernel, linux-arch, linux-kselftest,
	linux-media, dri-devel, linaro-mm-sig, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
	Ilias Apalodimas, Arnd Bergmann, Shuah Khan, Sumit Semwal,
	Christian König, Shakeel Butt, Jeroen de Borst,
	Praveen Kaligineedi, Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <CAKH8qBueYgpxQTvTwngOs6RNjy9yvLF92s1p5nFrobw_UprNMQ@mail.gmail.com>

On 11/6/23 4:32 PM, Stanislav Fomichev wrote:
>> The concise notification API returns tokens as a range for
>> compression, encoding as two 32-bit unsigned integers start + length.
>> It allows for even further batching by returning multiple such ranges
>> in a single call.
> 
> Tangential: should tokens be u64? Otherwise we can't have more than
> 4gb unacknowledged. Or that's a reasonable constraint?
> 

Was thinking the same and with bits reserved for a dmabuf id to allow
multiple dmabufs in a single rx queue (future extension, but build the
capability in now). e.g., something like a 37b offset (128GB dmabuf
size), 19b length (large GRO), 8b dmabuf id (lots of dmabufs to a queue).

^ permalink raw reply

* Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags
From: Stanislav Fomichev @ 2023-11-06 23:55 UTC (permalink / raw)
  To: Mina Almasry
  Cc: David Ahern, netdev, linux-kernel, linux-arch, linux-kselftest,
	linux-media, dri-devel, linaro-mm-sig, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
	Ilias Apalodimas, Arnd Bergmann, Willem de Bruijn, Shuah Khan,
	Sumit Semwal, Christian König, Shakeel Butt, Jeroen de Borst,
	Praveen Kaligineedi, Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <CAHS8izMQ5Um_ScY0VgAjaEaT-hRh4tFoTgc6Xr9Tj5rEj0fijA@mail.gmail.com>

On Mon, Nov 6, 2023 at 3:27 PM Mina Almasry <almasrymina@google.com> wrote:
>
> On Mon, Nov 6, 2023 at 2:59 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On 11/06, Mina Almasry wrote:
> > > On Mon, Nov 6, 2023 at 1:59 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > >
> > > > On 11/06, Mina Almasry wrote:
> > > > > On Mon, Nov 6, 2023 at 11:34 AM David Ahern <dsahern@kernel.org> wrote:
> > > > > >
> > > > > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > > > > > On 11/05, Mina Almasry wrote:
> > > > > > >> For device memory TCP, we expect the skb headers to be available in host
> > > > > > >> memory for access, and we expect the skb frags to be in device memory
> > > > > > >> and unaccessible to the host. We expect there to be no mixing and
> > > > > > >> matching of device memory frags (unaccessible) with host memory frags
> > > > > > >> (accessible) in the same skb.
> > > > > > >>
> > > > > > >> Add a skb->devmem flag which indicates whether the frags in this skb
> > > > > > >> are device memory frags or not.
> > > > > > >>
> > > > > > >> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> > > > > > >> and marks the skb as skb->devmem accordingly.
> > > > > > >>
> > > > > > >> Add checks through the network stack to avoid accessing the frags of
> > > > > > >> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> > > > > > >>
> > > > > > >> Signed-off-by: Willem de Bruijn <willemb@google.com>
> > > > > > >> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
> > > > > > >> Signed-off-by: Mina Almasry <almasrymina@google.com>
> > > > > > >>
> > > > > > >> ---
> > > > > > >>  include/linux/skbuff.h | 14 +++++++-
> > > > > > >>  include/net/tcp.h      |  5 +--
> > > > > > >>  net/core/datagram.c    |  6 ++++
> > > > > > >>  net/core/gro.c         |  5 ++-
> > > > > > >>  net/core/skbuff.c      | 77 ++++++++++++++++++++++++++++++++++++------
> > > > > > >>  net/ipv4/tcp.c         |  6 ++++
> > > > > > >>  net/ipv4/tcp_input.c   | 13 +++++--
> > > > > > >>  net/ipv4/tcp_output.c  |  5 ++-
> > > > > > >>  net/packet/af_packet.c |  4 +--
> > > > > > >>  9 files changed, 115 insertions(+), 20 deletions(-)
> > > > > > >>
> > > > > > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > > > > >> index 1fae276c1353..8fb468ff8115 100644
> > > > > > >> --- a/include/linux/skbuff.h
> > > > > > >> +++ b/include/linux/skbuff.h
> > > > > > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > > > > > >>   *  @csum_level: indicates the number of consecutive checksums found in
> > > > > > >>   *          the packet minus one that have been verified as
> > > > > > >>   *          CHECKSUM_UNNECESSARY (max 3)
> > > > > > >> + *  @devmem: indicates that all the fragments in this skb are backed by
> > > > > > >> + *          device memory.
> > > > > > >>   *  @dst_pending_confirm: need to confirm neighbour
> > > > > > >>   *  @decrypted: Decrypted SKB
> > > > > > >>   *  @slow_gro: state present at GRO time, slower prepare step required
> > > > > > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > > > > > >>  #if IS_ENABLED(CONFIG_IP_SCTP)
> > > > > > >>      __u8                    csum_not_inet:1;
> > > > > > >>  #endif
> > > > > > >> -
> > > > > > >> +    __u8                    devmem:1;
> > > > > > >>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > > > > > >>      __u16                   tc_index;       /* traffic control index */
> > > > > > >>  #endif
> > > > > > >> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > > > > > >>              __skb_zcopy_downgrade_managed(skb);
> > > > > > >>  }
> > > > > > >>
> > > > > > >> +/* Return true if frags in this skb are not readable by the host. */
> > > > > > >> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> > > > > > >> +{
> > > > > > >> +    return skb->devmem;
> > > > > > >
> > > > > > > bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> > > > > > > It better communicates the fact that the stack shouldn't dereference the
> > > > > > > frags (because it has 'devmem' fragments or for some other potential
> > > > > > > future reason).
> > > > > >
> > > > > > +1.
> > > > > >
> > > > > > Also, the flag on the skb is an optimization - a high level signal that
> > > > > > one or more frags is in unreadable memory. There is no requirement that
> > > > > > all of the frags are in the same memory type.
> > > >
> > > > David: maybe there should be such a requirement (that they all are
> > > > unreadable)? Might be easier to support initially; we can relax later
> > > > on.
> > > >
> > >
> > > Currently devmem == not_readable, and the restriction is that all the
> > > frags in the same skb must be either all readable or all unreadable
> > > (all devmem or all non-devmem).
> > >
> > > > > The flag indicates that the skb contains all devmem dma-buf memory
> > > > > specifically, not generic 'not_readable' frags as the comment says:
> > > > >
> > > > > + *     @devmem: indicates that all the fragments in this skb are backed by
> > > > > + *             device memory.
> > > > >
> > > > > The reason it's not a generic 'not_readable' flag is because handing
> > > > > off a generic not_readable skb to the userspace is semantically not
> > > > > what we're doing. recvmsg() is augmented in this patch series to
> > > > > return a devmem skb to the user via a cmsg_devmem struct which refers
> > > > > specifically to the memory in the dma-buf. recvmsg() in this patch
> > > > > series is not augmented to give any 'not_readable' skb to the
> > > > > userspace.
> > > > >
> > > > > IMHO skb->devmem + an skb_frags_not_readable() as implemented is
> > > > > correct. If a new type of unreadable skbs are introduced to the stack,
> > > > > I imagine the stack would implement:
> > > > >
> > > > > 1. new header flag: skb->newmem
> > > > > 2.
> > > > >
> > > > > static inline bool skb_frags_not_readable(const struct skb_buff *skb)
> > > > > {
> > > > >     return skb->devmem || skb->newmem;
> > > > > }
> > > > >
> > > > > 3. tcp_recvmsg_devmem() would handle skb->devmem skbs is in this patch
> > > > > series, but tcp_recvmsg_newmem() would handle skb->newmem skbs.
> > > >
> > > > You copy it to the userspace in a special way because your frags
> > > > are page_is_page_pool_iov(). I agree with David, the skb bit is
> > > > just and optimization.
> > > >
> > > > For most of the core stack, it doesn't matter why your skb is not
> > > > readable. For a few places where it matters (recvmsg?), you can
> > > > double-check your frags (all or some) with page_is_page_pool_iov.
> > > >
> > >
> > > I see, we can do that then. I.e. make the header flag 'not_readable'
> > > and check the frags to decide to delegate to tcp_recvmsg_devmem() or
> > > something else. We can even assume not_readable == devmem because
> > > currently devmem is the only type of unreadable frag currently.
> > >
> > > > Unrelated: we probably need socket to dmabuf association as well (via
> > > > netlink or something).
> > >
> > > Not sure this is possible. The dma-buf is bound to the rx-queue, and
> > > any packets that land on that rx-queue are bound to that dma-buf,
> > > regardless of which socket that packet belongs to. So the association
> > > IMO must be rx-queue to dma-buf, not socket to dma-buf.
> >
> > But there is still always 1 dmabuf to 1 socket association (on rx), right?
> > Because otherwise, there is no way currently to tell, at recvmsg, which
> > dmabuf the received token belongs to.
> >
>
> Yes, but this 1 dma-buf to 1 socket association happens because the
> user binds the dma-buf to an rx-queue and configures flow steering of
> the socket to that rx-queue.

It's still fixed and won't change during the socket lifetime, right?
And the socket has to know this association; otherwise those tokens
are useless since they don't carry anything to identify the dmabuf.

I think my other issue with MSG_SOCK_DEVMEM being on recvmsg is that
it somehow implies that I have an option of passing or not passing it
for an individual system call.
If we know that we're going to use dmabuf with the socket, maybe we
should move this flag to the socket() syscall?

fd = socket(AF_INET6, SOCK_STREAM, SOCK_DEVMEM);

?

> > So why not have a separate control channel action to say: this socket fd
> > is supposed to receive into this dmabuf fd?
> > This action would put
> > the socket into permanent 'MSG_SOCK_DEVMEM' mode. Maybe you can also
> > put some checks at the lower level to to enforce this dmabuf
> > association. (to avoid any potential issues with flow steering)
> >
>
> setsockopt(SO_DEVMEM_ASSERT_DMA_BUF, dmabuf_fd)? Sounds interesting,
> but maybe a bit of a weird API to me. Because the API can't enforce
> the socket to receive packets on a dma-buf (rx-queue binding + flow
> steering does that), but the API can assert that incoming packets are
> received on said dma-buf. I guess it would check packets before they
> are acked and would drop packets that landed on the wrong queue.
>
> I'm a bit unsure about defensively programming features (and uapi no
> less) to 'avoid any potential issues with flow steering'. Flow
> steering is supposed to work.
>
> Also if we wanted to defensively program something to avoid flow
> steering issues, then I'd suggest adding to cmsg_devmem the dma-buf fd
> that the data is on, not this setsockopt() that asserts. IMO it's a
> weird API for the userspace to ask the kernel to assert some condition
> (at least I haven't seen it before or commonly).
>
> But again, in general, I'm a bit unsure about defensively designing
> uapi around a feature like flow steering that's supposed to work.

^ permalink raw reply

* Re: [RFC PATCH v3 10/12] tcp: RX path for devmem TCP
From: Willem de Bruijn @ 2023-11-07  0:02 UTC (permalink / raw)
  To: David Ahern
  Cc: Stanislav Fomichev, Mina Almasry, netdev, linux-kernel,
	linux-arch, linux-kselftest, linux-media, dri-devel,
	linaro-mm-sig, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jesper Dangaard Brouer, Ilias Apalodimas,
	Arnd Bergmann, Shuah Khan, Sumit Semwal, Christian König,
	Shakeel Butt, Jeroen de Borst, Praveen Kaligineedi,
	Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <93eb6a2b-a991-40ca-8f26-f520c986729a@kernel.org>

On Mon, Nov 6, 2023 at 3:55 PM David Ahern <dsahern@kernel.org> wrote:
>
> On 11/6/23 4:32 PM, Stanislav Fomichev wrote:
> >> The concise notification API returns tokens as a range for
> >> compression, encoding as two 32-bit unsigned integers start + length.
> >> It allows for even further batching by returning multiple such ranges
> >> in a single call.
> >
> > Tangential: should tokens be u64? Otherwise we can't have more than
> > 4gb unacknowledged. Or that's a reasonable constraint?
> >
>
> Was thinking the same and with bits reserved for a dmabuf id to allow
> multiple dmabufs in a single rx queue (future extension, but build the
> capability in now). e.g., something like a 37b offset (128GB dmabuf
> size), 19b length (large GRO), 8b dmabuf id (lots of dmabufs to a queue).

Agreed. Converting to 64b now sounds like a good forward looking revision.

^ permalink raw reply

* Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags
From: Mina Almasry @ 2023-11-07  0:03 UTC (permalink / raw)
  To: David Ahern
  Cc: Stanislav Fomichev, netdev, linux-kernel, linux-arch,
	linux-kselftest, linux-media, dri-devel, linaro-mm-sig,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	Willem de Bruijn, Shuah Khan, Sumit Semwal, Christian König,
	Shakeel Butt, Jeroen de Borst, Praveen Kaligineedi,
	Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <00883386-0c4b-4ba7-84c6-553f468304e6@kernel.org>

On Mon, Nov 6, 2023 at 3:37 PM David Ahern <dsahern@kernel.org> wrote:
>
> On 11/6/23 3:18 PM, Mina Almasry wrote:
> >>>>>> @@ -991,7 +993,7 @@ struct sk_buff {
> >>>>>>  #if IS_ENABLED(CONFIG_IP_SCTP)
> >>>>>>      __u8                    csum_not_inet:1;
> >>>>>>  #endif
> >>>>>> -
> >>>>>> +    __u8                    devmem:1;
> >>>>>>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> >>>>>>      __u16                   tc_index;       /* traffic control index */
> >>>>>>  #endif
> >>>>>> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
> >>>>>>              __skb_zcopy_downgrade_managed(skb);
> >>>>>>  }
> >>>>>>
> >>>>>> +/* Return true if frags in this skb are not readable by the host. */
> >>>>>> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> >>>>>> +{
> >>>>>> +    return skb->devmem;
> >>>>>
> >>>>> bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> >>>>> It better communicates the fact that the stack shouldn't dereference the
> >>>>> frags (because it has 'devmem' fragments or for some other potential
> >>>>> future reason).
> >>>>
> >>>> +1.
> >>>>
> >>>> Also, the flag on the skb is an optimization - a high level signal that
> >>>> one or more frags is in unreadable memory. There is no requirement that
> >>>> all of the frags are in the same memory type.
> >>
> >> David: maybe there should be such a requirement (that they all are
> >> unreadable)? Might be easier to support initially; we can relax later
> >> on.
> >>
> >
> > Currently devmem == not_readable, and the restriction is that all the
> > frags in the same skb must be either all readable or all unreadable
> > (all devmem or all non-devmem).
>
> What requires that restriction? In all of the uses of skb->devmem and
> skb_frags_not_readable() what matters is if any frag is not readable,
> then frag list walk or collapse is avoided.
>
>

Currently only tcp_recvmsg_devmem(), I think. tcp_recvmsg_locked()
delegates to tcp_recvmsg_devmem() if skb->devmem, and
tcp_recvmsg_devmem() net_err's if it finds a non-iov frag in the skb.
This is done for some simplicity, because iov's are given to the user
via cmsg, but pages are copied into the linear buffer. I think it
would be confusing for the user if we simultaneously copied some data
to the linear buffer and gave them a devmem cmsgs in the same
recvmsg() call.

So, my simplicity is:

1. in a single skb, all frags must be devmem or non-devmem, no mixing.
2. In a single recvmsg() call, we only process devmem or non-devmem
skbs, no mixing.

-- 
Thanks,
Mina

^ permalink raw reply

* Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags
From: Willem de Bruijn @ 2023-11-07  0:07 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Mina Almasry, David Ahern, netdev, linux-kernel, linux-arch,
	linux-kselftest, linux-media, dri-devel, linaro-mm-sig,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	Shuah Khan, Sumit Semwal, Christian König, Shakeel Butt,
	Jeroen de Borst, Praveen Kaligineedi, Willem de Bruijn,
	Kaiyuan Zhang
In-Reply-To: <CAKH8qBsbh8qYxNHZ6111RQFFpNWbWZtg0LDXkn15xcsbAq4R6w@mail.gmail.com>

On Mon, Nov 6, 2023 at 3:55 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On Mon, Nov 6, 2023 at 3:27 PM Mina Almasry <almasrymina@google.com> wrote:
> >
> > On Mon, Nov 6, 2023 at 2:59 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > On 11/06, Mina Almasry wrote:
> > > > On Mon, Nov 6, 2023 at 1:59 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > > >
> > > > > On 11/06, Mina Almasry wrote:
> > > > > > On Mon, Nov 6, 2023 at 11:34 AM David Ahern <dsahern@kernel.org> wrote:
> > > > > > >
> > > > > > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > > > > > > On 11/05, Mina Almasry wrote:
> > > > > > > >> For device memory TCP, we expect the skb headers to be available in host
> > > > > > > >> memory for access, and we expect the skb frags to be in device memory
> > > > > > > >> and unaccessible to the host. We expect there to be no mixing and
> > > > > > > >> matching of device memory frags (unaccessible) with host memory frags
> > > > > > > >> (accessible) in the same skb.
> > > > > > > >>
> > > > > > > >> Add a skb->devmem flag which indicates whether the frags in this skb
> > > > > > > >> are device memory frags or not.
> > > > > > > >>
> > > > > > > >> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> > > > > > > >> and marks the skb as skb->devmem accordingly.
> > > > > > > >>
> > > > > > > >> Add checks through the network stack to avoid accessing the frags of
> > > > > > > >> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> > > > > > > >>
> > > > > > > >> Signed-off-by: Willem de Bruijn <willemb@google.com>
> > > > > > > >> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
> > > > > > > >> Signed-off-by: Mina Almasry <almasrymina@google.com>
> > > > > > > >>
> > > > > > > >> ---
> > > > > > > >>  include/linux/skbuff.h | 14 +++++++-
> > > > > > > >>  include/net/tcp.h      |  5 +--
> > > > > > > >>  net/core/datagram.c    |  6 ++++
> > > > > > > >>  net/core/gro.c         |  5 ++-
> > > > > > > >>  net/core/skbuff.c      | 77 ++++++++++++++++++++++++++++++++++++------
> > > > > > > >>  net/ipv4/tcp.c         |  6 ++++
> > > > > > > >>  net/ipv4/tcp_input.c   | 13 +++++--
> > > > > > > >>  net/ipv4/tcp_output.c  |  5 ++-
> > > > > > > >>  net/packet/af_packet.c |  4 +--
> > > > > > > >>  9 files changed, 115 insertions(+), 20 deletions(-)
> > > > > > > >>
> > > > > > > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > > > > > >> index 1fae276c1353..8fb468ff8115 100644
> > > > > > > >> --- a/include/linux/skbuff.h
> > > > > > > >> +++ b/include/linux/skbuff.h
> > > > > > > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > > > > > > >>   *  @csum_level: indicates the number of consecutive checksums found in
> > > > > > > >>   *          the packet minus one that have been verified as
> > > > > > > >>   *          CHECKSUM_UNNECESSARY (max 3)
> > > > > > > >> + *  @devmem: indicates that all the fragments in this skb are backed by
> > > > > > > >> + *          device memory.
> > > > > > > >>   *  @dst_pending_confirm: need to confirm neighbour
> > > > > > > >>   *  @decrypted: Decrypted SKB
> > > > > > > >>   *  @slow_gro: state present at GRO time, slower prepare step required
> > > > > > > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > > > > > > >>  #if IS_ENABLED(CONFIG_IP_SCTP)
> > > > > > > >>      __u8                    csum_not_inet:1;
> > > > > > > >>  #endif
> > > > > > > >> -
> > > > > > > >> +    __u8                    devmem:1;
> > > > > > > >>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > > > > > > >>      __u16                   tc_index;       /* traffic control index */
> > > > > > > >>  #endif
> > > > > > > >> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > > > > > > >>              __skb_zcopy_downgrade_managed(skb);
> > > > > > > >>  }
> > > > > > > >>
> > > > > > > >> +/* Return true if frags in this skb are not readable by the host. */
> > > > > > > >> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> > > > > > > >> +{
> > > > > > > >> +    return skb->devmem;
> > > > > > > >
> > > > > > > > bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> > > > > > > > It better communicates the fact that the stack shouldn't dereference the
> > > > > > > > frags (because it has 'devmem' fragments or for some other potential
> > > > > > > > future reason).
> > > > > > >
> > > > > > > +1.
> > > > > > >
> > > > > > > Also, the flag on the skb is an optimization - a high level signal that
> > > > > > > one or more frags is in unreadable memory. There is no requirement that
> > > > > > > all of the frags are in the same memory type.
> > > > >
> > > > > David: maybe there should be such a requirement (that they all are
> > > > > unreadable)? Might be easier to support initially; we can relax later
> > > > > on.
> > > > >
> > > >
> > > > Currently devmem == not_readable, and the restriction is that all the
> > > > frags in the same skb must be either all readable or all unreadable
> > > > (all devmem or all non-devmem).
> > > >
> > > > > > The flag indicates that the skb contains all devmem dma-buf memory
> > > > > > specifically, not generic 'not_readable' frags as the comment says:
> > > > > >
> > > > > > + *     @devmem: indicates that all the fragments in this skb are backed by
> > > > > > + *             device memory.
> > > > > >
> > > > > > The reason it's not a generic 'not_readable' flag is because handing
> > > > > > off a generic not_readable skb to the userspace is semantically not
> > > > > > what we're doing. recvmsg() is augmented in this patch series to
> > > > > > return a devmem skb to the user via a cmsg_devmem struct which refers
> > > > > > specifically to the memory in the dma-buf. recvmsg() in this patch
> > > > > > series is not augmented to give any 'not_readable' skb to the
> > > > > > userspace.
> > > > > >
> > > > > > IMHO skb->devmem + an skb_frags_not_readable() as implemented is
> > > > > > correct. If a new type of unreadable skbs are introduced to the stack,
> > > > > > I imagine the stack would implement:
> > > > > >
> > > > > > 1. new header flag: skb->newmem
> > > > > > 2.
> > > > > >
> > > > > > static inline bool skb_frags_not_readable(const struct skb_buff *skb)
> > > > > > {
> > > > > >     return skb->devmem || skb->newmem;
> > > > > > }
> > > > > >
> > > > > > 3. tcp_recvmsg_devmem() would handle skb->devmem skbs is in this patch
> > > > > > series, but tcp_recvmsg_newmem() would handle skb->newmem skbs.
> > > > >
> > > > > You copy it to the userspace in a special way because your frags
> > > > > are page_is_page_pool_iov(). I agree with David, the skb bit is
> > > > > just and optimization.
> > > > >
> > > > > For most of the core stack, it doesn't matter why your skb is not
> > > > > readable. For a few places where it matters (recvmsg?), you can
> > > > > double-check your frags (all or some) with page_is_page_pool_iov.
> > > > >
> > > >
> > > > I see, we can do that then. I.e. make the header flag 'not_readable'
> > > > and check the frags to decide to delegate to tcp_recvmsg_devmem() or
> > > > something else. We can even assume not_readable == devmem because
> > > > currently devmem is the only type of unreadable frag currently.
> > > >
> > > > > Unrelated: we probably need socket to dmabuf association as well (via
> > > > > netlink or something).
> > > >
> > > > Not sure this is possible. The dma-buf is bound to the rx-queue, and
> > > > any packets that land on that rx-queue are bound to that dma-buf,
> > > > regardless of which socket that packet belongs to. So the association
> > > > IMO must be rx-queue to dma-buf, not socket to dma-buf.
> > >
> > > But there is still always 1 dmabuf to 1 socket association (on rx), right?
> > > Because otherwise, there is no way currently to tell, at recvmsg, which
> > > dmabuf the received token belongs to.
> > >
> >
> > Yes, but this 1 dma-buf to 1 socket association happens because the
> > user binds the dma-buf to an rx-queue and configures flow steering of
> > the socket to that rx-queue.
>
> It's still fixed and won't change during the socket lifetime, right?
> And the socket has to know this association; otherwise those tokens
> are useless since they don't carry anything to identify the dmabuf.
>
> I think my other issue with MSG_SOCK_DEVMEM being on recvmsg is that
> it somehow implies that I have an option of passing or not passing it
> for an individual system call.
> If we know that we're going to use dmabuf with the socket, maybe we
> should move this flag to the socket() syscall?
>
> fd = socket(AF_INET6, SOCK_STREAM, SOCK_DEVMEM);
>
> ?

I think it should then be a setsockopt called before any data is
exchanged, with no change of modifying mode later. We generally use
setsockopts for the mode of a socket. This use of the protocol field
in socket() for setting a mode would be novel. Also, it might miss
passively opened connections, or be overly restrictive: one approach
for all accepted child sockets.

^ permalink raw reply

* Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags
From: Stanislav Fomichev @ 2023-11-07  0:14 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Mina Almasry, David Ahern, netdev, linux-kernel, linux-arch,
	linux-kselftest, linux-media, dri-devel, linaro-mm-sig,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	Shuah Khan, Sumit Semwal, Christian König, Shakeel Butt,
	Jeroen de Borst, Praveen Kaligineedi, Willem de Bruijn,
	Kaiyuan Zhang
In-Reply-To: <CAF=yD-+BuKXoVL8UF+No1s0TsHSzBTz7UrB1Djt_BrM74uLLcg@mail.gmail.com>

On 11/06, Willem de Bruijn wrote:
> On Mon, Nov 6, 2023 at 3:55 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On Mon, Nov 6, 2023 at 3:27 PM Mina Almasry <almasrymina@google.com> wrote:
> > >
> > > On Mon, Nov 6, 2023 at 2:59 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > >
> > > > On 11/06, Mina Almasry wrote:
> > > > > On Mon, Nov 6, 2023 at 1:59 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > > > >
> > > > > > On 11/06, Mina Almasry wrote:
> > > > > > > On Mon, Nov 6, 2023 at 11:34 AM David Ahern <dsahern@kernel.org> wrote:
> > > > > > > >
> > > > > > > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > > > > > > > On 11/05, Mina Almasry wrote:
> > > > > > > > >> For device memory TCP, we expect the skb headers to be available in host
> > > > > > > > >> memory for access, and we expect the skb frags to be in device memory
> > > > > > > > >> and unaccessible to the host. We expect there to be no mixing and
> > > > > > > > >> matching of device memory frags (unaccessible) with host memory frags
> > > > > > > > >> (accessible) in the same skb.
> > > > > > > > >>
> > > > > > > > >> Add a skb->devmem flag which indicates whether the frags in this skb
> > > > > > > > >> are device memory frags or not.
> > > > > > > > >>
> > > > > > > > >> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> > > > > > > > >> and marks the skb as skb->devmem accordingly.
> > > > > > > > >>
> > > > > > > > >> Add checks through the network stack to avoid accessing the frags of
> > > > > > > > >> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> > > > > > > > >>
> > > > > > > > >> Signed-off-by: Willem de Bruijn <willemb@google.com>
> > > > > > > > >> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
> > > > > > > > >> Signed-off-by: Mina Almasry <almasrymina@google.com>
> > > > > > > > >>
> > > > > > > > >> ---
> > > > > > > > >>  include/linux/skbuff.h | 14 +++++++-
> > > > > > > > >>  include/net/tcp.h      |  5 +--
> > > > > > > > >>  net/core/datagram.c    |  6 ++++
> > > > > > > > >>  net/core/gro.c         |  5 ++-
> > > > > > > > >>  net/core/skbuff.c      | 77 ++++++++++++++++++++++++++++++++++++------
> > > > > > > > >>  net/ipv4/tcp.c         |  6 ++++
> > > > > > > > >>  net/ipv4/tcp_input.c   | 13 +++++--
> > > > > > > > >>  net/ipv4/tcp_output.c  |  5 ++-
> > > > > > > > >>  net/packet/af_packet.c |  4 +--
> > > > > > > > >>  9 files changed, 115 insertions(+), 20 deletions(-)
> > > > > > > > >>
> > > > > > > > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > > > > > > >> index 1fae276c1353..8fb468ff8115 100644
> > > > > > > > >> --- a/include/linux/skbuff.h
> > > > > > > > >> +++ b/include/linux/skbuff.h
> > > > > > > > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > > > > > > > >>   *  @csum_level: indicates the number of consecutive checksums found in
> > > > > > > > >>   *          the packet minus one that have been verified as
> > > > > > > > >>   *          CHECKSUM_UNNECESSARY (max 3)
> > > > > > > > >> + *  @devmem: indicates that all the fragments in this skb are backed by
> > > > > > > > >> + *          device memory.
> > > > > > > > >>   *  @dst_pending_confirm: need to confirm neighbour
> > > > > > > > >>   *  @decrypted: Decrypted SKB
> > > > > > > > >>   *  @slow_gro: state present at GRO time, slower prepare step required
> > > > > > > > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > > > > > > > >>  #if IS_ENABLED(CONFIG_IP_SCTP)
> > > > > > > > >>      __u8                    csum_not_inet:1;
> > > > > > > > >>  #endif
> > > > > > > > >> -
> > > > > > > > >> +    __u8                    devmem:1;
> > > > > > > > >>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > > > > > > > >>      __u16                   tc_index;       /* traffic control index */
> > > > > > > > >>  #endif
> > > > > > > > >> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > > > > > > > >>              __skb_zcopy_downgrade_managed(skb);
> > > > > > > > >>  }
> > > > > > > > >>
> > > > > > > > >> +/* Return true if frags in this skb are not readable by the host. */
> > > > > > > > >> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> > > > > > > > >> +{
> > > > > > > > >> +    return skb->devmem;
> > > > > > > > >
> > > > > > > > > bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> > > > > > > > > It better communicates the fact that the stack shouldn't dereference the
> > > > > > > > > frags (because it has 'devmem' fragments or for some other potential
> > > > > > > > > future reason).
> > > > > > > >
> > > > > > > > +1.
> > > > > > > >
> > > > > > > > Also, the flag on the skb is an optimization - a high level signal that
> > > > > > > > one or more frags is in unreadable memory. There is no requirement that
> > > > > > > > all of the frags are in the same memory type.
> > > > > >
> > > > > > David: maybe there should be such a requirement (that they all are
> > > > > > unreadable)? Might be easier to support initially; we can relax later
> > > > > > on.
> > > > > >
> > > > >
> > > > > Currently devmem == not_readable, and the restriction is that all the
> > > > > frags in the same skb must be either all readable or all unreadable
> > > > > (all devmem or all non-devmem).
> > > > >
> > > > > > > The flag indicates that the skb contains all devmem dma-buf memory
> > > > > > > specifically, not generic 'not_readable' frags as the comment says:
> > > > > > >
> > > > > > > + *     @devmem: indicates that all the fragments in this skb are backed by
> > > > > > > + *             device memory.
> > > > > > >
> > > > > > > The reason it's not a generic 'not_readable' flag is because handing
> > > > > > > off a generic not_readable skb to the userspace is semantically not
> > > > > > > what we're doing. recvmsg() is augmented in this patch series to
> > > > > > > return a devmem skb to the user via a cmsg_devmem struct which refers
> > > > > > > specifically to the memory in the dma-buf. recvmsg() in this patch
> > > > > > > series is not augmented to give any 'not_readable' skb to the
> > > > > > > userspace.
> > > > > > >
> > > > > > > IMHO skb->devmem + an skb_frags_not_readable() as implemented is
> > > > > > > correct. If a new type of unreadable skbs are introduced to the stack,
> > > > > > > I imagine the stack would implement:
> > > > > > >
> > > > > > > 1. new header flag: skb->newmem
> > > > > > > 2.
> > > > > > >
> > > > > > > static inline bool skb_frags_not_readable(const struct skb_buff *skb)
> > > > > > > {
> > > > > > >     return skb->devmem || skb->newmem;
> > > > > > > }
> > > > > > >
> > > > > > > 3. tcp_recvmsg_devmem() would handle skb->devmem skbs is in this patch
> > > > > > > series, but tcp_recvmsg_newmem() would handle skb->newmem skbs.
> > > > > >
> > > > > > You copy it to the userspace in a special way because your frags
> > > > > > are page_is_page_pool_iov(). I agree with David, the skb bit is
> > > > > > just and optimization.
> > > > > >
> > > > > > For most of the core stack, it doesn't matter why your skb is not
> > > > > > readable. For a few places where it matters (recvmsg?), you can
> > > > > > double-check your frags (all or some) with page_is_page_pool_iov.
> > > > > >
> > > > >
> > > > > I see, we can do that then. I.e. make the header flag 'not_readable'
> > > > > and check the frags to decide to delegate to tcp_recvmsg_devmem() or
> > > > > something else. We can even assume not_readable == devmem because
> > > > > currently devmem is the only type of unreadable frag currently.
> > > > >
> > > > > > Unrelated: we probably need socket to dmabuf association as well (via
> > > > > > netlink or something).
> > > > >
> > > > > Not sure this is possible. The dma-buf is bound to the rx-queue, and
> > > > > any packets that land on that rx-queue are bound to that dma-buf,
> > > > > regardless of which socket that packet belongs to. So the association
> > > > > IMO must be rx-queue to dma-buf, not socket to dma-buf.
> > > >
> > > > But there is still always 1 dmabuf to 1 socket association (on rx), right?
> > > > Because otherwise, there is no way currently to tell, at recvmsg, which
> > > > dmabuf the received token belongs to.
> > > >
> > >
> > > Yes, but this 1 dma-buf to 1 socket association happens because the
> > > user binds the dma-buf to an rx-queue and configures flow steering of
> > > the socket to that rx-queue.
> >
> > It's still fixed and won't change during the socket lifetime, right?
> > And the socket has to know this association; otherwise those tokens
> > are useless since they don't carry anything to identify the dmabuf.
> >
> > I think my other issue with MSG_SOCK_DEVMEM being on recvmsg is that
> > it somehow implies that I have an option of passing or not passing it
> > for an individual system call.
> > If we know that we're going to use dmabuf with the socket, maybe we
> > should move this flag to the socket() syscall?
> >
> > fd = socket(AF_INET6, SOCK_STREAM, SOCK_DEVMEM);
> >
> > ?
> 
> I think it should then be a setsockopt called before any data is
> exchanged, with no change of modifying mode later. We generally use
> setsockopts for the mode of a socket. This use of the protocol field
> in socket() for setting a mode would be novel. Also, it might miss
> passively opened connections, or be overly restrictive: one approach
> for all accepted child sockets.

I was thinking this is similar to SOCK_CLOEXEC or SOCK_NONBLOCK? There
are plenty of bits we can grab. But setsockopt works as well!

^ permalink raw reply

* Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags
From: David Ahern @ 2023-11-07  0:16 UTC (permalink / raw)
  To: Mina Almasry, netdev, linux-kernel, linux-arch, linux-kselftest,
	linux-media, dri-devel, linaro-mm-sig
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	Willem de Bruijn, Shuah Khan, Sumit Semwal, Christian König,
	Shakeel Butt, Jeroen de Borst, Praveen Kaligineedi,
	Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <20231106024413.2801438-10-almasrymina@google.com>

On 11/5/23 7:44 PM, Mina Almasry wrote:
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index 176eb5834746..cdd4fb129968 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -425,6 +425,9 @@ static int __skb_datagram_iter(const struct sk_buff *skb, int offset,
>  			return 0;
>  	}
>  
> +	if (skb_frags_not_readable(skb))
> +		goto short_copy;
> +
>  	/* Copy paged appendix. Hmm... why does this look so complicated? */
>  	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
>  		int end;
> @@ -616,6 +619,9 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
>  {
>  	int frag;
>  
> +	if (skb_frags_not_readable(skb))
> +		return -EFAULT;

This check ....
> +
>  	if (msg && msg->msg_ubuf && msg->sg_from_iter)
>  		return msg->sg_from_iter(sk, skb, from, length);


... should go here. That allows custome sg_from_iter to have access to
the skb. What matters is not expecting struct page (e.g., refcounting);
if the custom iter does not do that then all is well. io_uring's iter
does not look at the pages, so all good.

>  
> diff --git a/net/core/gro.c b/net/core/gro.c
> index 42d7f6755f32..56046d65386a 100644
> --- a/net/core/gro.c
> +++ b/net/core/gro.c
> @@ -390,6 +390,9 @@ static void gro_pull_from_frag0(struct sk_buff *skb, int grow)
>  {
>  	struct skb_shared_info *pinfo = skb_shinfo(skb);
>  
> +	if (WARN_ON_ONCE(skb_frags_not_readable(skb)))
> +		return;
> +
>  	BUG_ON(skb->end - skb->tail < grow);
>  
>  	memcpy(skb_tail_pointer(skb), NAPI_GRO_CB(skb)->frag0, grow);
> @@ -411,7 +414,7 @@ static void gro_try_pull_from_frag0(struct sk_buff *skb)
>  {
>  	int grow = skb_gro_offset(skb) - skb_headlen(skb);
>  
> -	if (grow > 0)
> +	if (grow > 0 && !skb_frags_not_readable(skb))
>  		gro_pull_from_frag0(skb, grow);
>  }
>  
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 13eca4fd25e1..f01673ed2eff 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1230,6 +1230,14 @@ void skb_dump(const char *level, const struct sk_buff *skb, bool full_pkt)
>  		struct page *p;
>  		u8 *vaddr;
>  
> +		if (skb_frag_is_page_pool_iov(frag)) {

Why skb_frag_is_page_pool_iov here vs skb_frags_not_readable?

^ permalink raw reply

* Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags
From: Mina Almasry @ 2023-11-07  0:20 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Stanislav Fomichev, David Ahern, netdev, linux-kernel, linux-arch,
	linux-kselftest, linux-media, dri-devel, linaro-mm-sig,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	Shuah Khan, Sumit Semwal, Christian König, Shakeel Butt,
	Jeroen de Borst, Praveen Kaligineedi, Willem de Bruijn,
	Kaiyuan Zhang
In-Reply-To: <CAF=yD-+BuKXoVL8UF+No1s0TsHSzBTz7UrB1Djt_BrM74uLLcg@mail.gmail.com>

On Mon, Nov 6, 2023 at 4:08 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Mon, Nov 6, 2023 at 3:55 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On Mon, Nov 6, 2023 at 3:27 PM Mina Almasry <almasrymina@google.com> wrote:
> > >
> > > On Mon, Nov 6, 2023 at 2:59 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > >
> > > > On 11/06, Mina Almasry wrote:
> > > > > On Mon, Nov 6, 2023 at 1:59 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > > > >
> > > > > > On 11/06, Mina Almasry wrote:
> > > > > > > On Mon, Nov 6, 2023 at 11:34 AM David Ahern <dsahern@kernel.org> wrote:
> > > > > > > >
> > > > > > > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > > > > > > > On 11/05, Mina Almasry wrote:
> > > > > > > > >> For device memory TCP, we expect the skb headers to be available in host
> > > > > > > > >> memory for access, and we expect the skb frags to be in device memory
> > > > > > > > >> and unaccessible to the host. We expect there to be no mixing and
> > > > > > > > >> matching of device memory frags (unaccessible) with host memory frags
> > > > > > > > >> (accessible) in the same skb.
> > > > > > > > >>
> > > > > > > > >> Add a skb->devmem flag which indicates whether the frags in this skb
> > > > > > > > >> are device memory frags or not.
> > > > > > > > >>
> > > > > > > > >> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> > > > > > > > >> and marks the skb as skb->devmem accordingly.
> > > > > > > > >>
> > > > > > > > >> Add checks through the network stack to avoid accessing the frags of
> > > > > > > > >> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> > > > > > > > >>
> > > > > > > > >> Signed-off-by: Willem de Bruijn <willemb@google.com>
> > > > > > > > >> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
> > > > > > > > >> Signed-off-by: Mina Almasry <almasrymina@google.com>
> > > > > > > > >>
> > > > > > > > >> ---
> > > > > > > > >>  include/linux/skbuff.h | 14 +++++++-
> > > > > > > > >>  include/net/tcp.h      |  5 +--
> > > > > > > > >>  net/core/datagram.c    |  6 ++++
> > > > > > > > >>  net/core/gro.c         |  5 ++-
> > > > > > > > >>  net/core/skbuff.c      | 77 ++++++++++++++++++++++++++++++++++++------
> > > > > > > > >>  net/ipv4/tcp.c         |  6 ++++
> > > > > > > > >>  net/ipv4/tcp_input.c   | 13 +++++--
> > > > > > > > >>  net/ipv4/tcp_output.c  |  5 ++-
> > > > > > > > >>  net/packet/af_packet.c |  4 +--
> > > > > > > > >>  9 files changed, 115 insertions(+), 20 deletions(-)
> > > > > > > > >>
> > > > > > > > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > > > > > > >> index 1fae276c1353..8fb468ff8115 100644
> > > > > > > > >> --- a/include/linux/skbuff.h
> > > > > > > > >> +++ b/include/linux/skbuff.h
> > > > > > > > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > > > > > > > >>   *  @csum_level: indicates the number of consecutive checksums found in
> > > > > > > > >>   *          the packet minus one that have been verified as
> > > > > > > > >>   *          CHECKSUM_UNNECESSARY (max 3)
> > > > > > > > >> + *  @devmem: indicates that all the fragments in this skb are backed by
> > > > > > > > >> + *          device memory.
> > > > > > > > >>   *  @dst_pending_confirm: need to confirm neighbour
> > > > > > > > >>   *  @decrypted: Decrypted SKB
> > > > > > > > >>   *  @slow_gro: state present at GRO time, slower prepare step required
> > > > > > > > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > > > > > > > >>  #if IS_ENABLED(CONFIG_IP_SCTP)
> > > > > > > > >>      __u8                    csum_not_inet:1;
> > > > > > > > >>  #endif
> > > > > > > > >> -
> > > > > > > > >> +    __u8                    devmem:1;
> > > > > > > > >>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > > > > > > > >>      __u16                   tc_index;       /* traffic control index */
> > > > > > > > >>  #endif
> > > > > > > > >> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > > > > > > > >>              __skb_zcopy_downgrade_managed(skb);
> > > > > > > > >>  }
> > > > > > > > >>
> > > > > > > > >> +/* Return true if frags in this skb are not readable by the host. */
> > > > > > > > >> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> > > > > > > > >> +{
> > > > > > > > >> +    return skb->devmem;
> > > > > > > > >
> > > > > > > > > bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> > > > > > > > > It better communicates the fact that the stack shouldn't dereference the
> > > > > > > > > frags (because it has 'devmem' fragments or for some other potential
> > > > > > > > > future reason).
> > > > > > > >
> > > > > > > > +1.
> > > > > > > >
> > > > > > > > Also, the flag on the skb is an optimization - a high level signal that
> > > > > > > > one or more frags is in unreadable memory. There is no requirement that
> > > > > > > > all of the frags are in the same memory type.
> > > > > >
> > > > > > David: maybe there should be such a requirement (that they all are
> > > > > > unreadable)? Might be easier to support initially; we can relax later
> > > > > > on.
> > > > > >
> > > > >
> > > > > Currently devmem == not_readable, and the restriction is that all the
> > > > > frags in the same skb must be either all readable or all unreadable
> > > > > (all devmem or all non-devmem).
> > > > >
> > > > > > > The flag indicates that the skb contains all devmem dma-buf memory
> > > > > > > specifically, not generic 'not_readable' frags as the comment says:
> > > > > > >
> > > > > > > + *     @devmem: indicates that all the fragments in this skb are backed by
> > > > > > > + *             device memory.
> > > > > > >
> > > > > > > The reason it's not a generic 'not_readable' flag is because handing
> > > > > > > off a generic not_readable skb to the userspace is semantically not
> > > > > > > what we're doing. recvmsg() is augmented in this patch series to
> > > > > > > return a devmem skb to the user via a cmsg_devmem struct which refers
> > > > > > > specifically to the memory in the dma-buf. recvmsg() in this patch
> > > > > > > series is not augmented to give any 'not_readable' skb to the
> > > > > > > userspace.
> > > > > > >
> > > > > > > IMHO skb->devmem + an skb_frags_not_readable() as implemented is
> > > > > > > correct. If a new type of unreadable skbs are introduced to the stack,
> > > > > > > I imagine the stack would implement:
> > > > > > >
> > > > > > > 1. new header flag: skb->newmem
> > > > > > > 2.
> > > > > > >
> > > > > > > static inline bool skb_frags_not_readable(const struct skb_buff *skb)
> > > > > > > {
> > > > > > >     return skb->devmem || skb->newmem;
> > > > > > > }
> > > > > > >
> > > > > > > 3. tcp_recvmsg_devmem() would handle skb->devmem skbs is in this patch
> > > > > > > series, but tcp_recvmsg_newmem() would handle skb->newmem skbs.
> > > > > >
> > > > > > You copy it to the userspace in a special way because your frags
> > > > > > are page_is_page_pool_iov(). I agree with David, the skb bit is
> > > > > > just and optimization.
> > > > > >
> > > > > > For most of the core stack, it doesn't matter why your skb is not
> > > > > > readable. For a few places where it matters (recvmsg?), you can
> > > > > > double-check your frags (all or some) with page_is_page_pool_iov.
> > > > > >
> > > > >
> > > > > I see, we can do that then. I.e. make the header flag 'not_readable'
> > > > > and check the frags to decide to delegate to tcp_recvmsg_devmem() or
> > > > > something else. We can even assume not_readable == devmem because
> > > > > currently devmem is the only type of unreadable frag currently.
> > > > >
> > > > > > Unrelated: we probably need socket to dmabuf association as well (via
> > > > > > netlink or something).
> > > > >
> > > > > Not sure this is possible. The dma-buf is bound to the rx-queue, and
> > > > > any packets that land on that rx-queue are bound to that dma-buf,
> > > > > regardless of which socket that packet belongs to. So the association
> > > > > IMO must be rx-queue to dma-buf, not socket to dma-buf.
> > > >
> > > > But there is still always 1 dmabuf to 1 socket association (on rx), right?
> > > > Because otherwise, there is no way currently to tell, at recvmsg, which
> > > > dmabuf the received token belongs to.
> > > >
> > >
> > > Yes, but this 1 dma-buf to 1 socket association happens because the
> > > user binds the dma-buf to an rx-queue and configures flow steering of
> > > the socket to that rx-queue.
> >
> > It's still fixed and won't change during the socket lifetime, right?

Technically, no.

The user is free to modify or delete flow steering rules outside of
the lifetime of the socket. Technically it's possible for the user to
reconfigure flow steering while the socket is simultaneously
receiving, and the result will be packets switching
 from devmem to non-devmem. For a reasonably correctly configured
application the application would probably want to steer 1 flow to 1
dma-buf and never change it, but this is not something we enforce, but
rather the user orchestrates. In theory someone can find a use case
for configuring and unconfigure flow steering during a connection.

> > And the socket has to know this association; otherwise those tokens
> > are useless since they don't carry anything to identify the dmabuf.
> >
> > I think my other issue with MSG_SOCK_DEVMEM being on recvmsg is that
> > it somehow implies that I have an option of passing or not passing it
> > for an individual system call.

You do have the option of passing it or not passing it per system
call. The MSG_SOCK_DEVMEM says the application is willing to receive
devmem cmsgs - that's all. The application doesn't get to decide
whether it's actually going to receive a devmem cmsg or not, because
that's dictated by the type of skb that is present in the receive
queue, and not up to the application. I should explain this in the
commit message...

> > If we know that we're going to use dmabuf with the socket, maybe we
> > should move this flag to the socket() syscall?
> >
> > fd = socket(AF_INET6, SOCK_STREAM, SOCK_DEVMEM);
> >
> > ?
>
> I think it should then be a setsockopt called before any data is
> exchanged, with no change of modifying mode later. We generally use
> setsockopts for the mode of a socket. This use of the protocol field
> in socket() for setting a mode would be novel. Also, it might miss
> passively opened connections, or be overly restrictive: one approach
> for all accepted child sockets.

We can definitely move SOCK_DEVMEM to a setsockopt(). Seems more than
reasonable.

-- 
Thanks,
Mina

^ permalink raw reply

* [netdev call] Nov 7th
From: Jakub Kicinski @ 2023-11-07  0:23 UTC (permalink / raw)
  To: netdev, netdev-driver-reviewers

Hi,

The bi-weekly netdev call at https://bbb.lwn.net/b/jak-wkr-seg-hjn
is scheduled tomorrow at 8:30 am (PT) / 5:30 pm (~EU).

Nothing on the agenda at this point, please send topics.

We could discuss any follow up / comments relating to
just-concluded netdev.conf, or upcoming LPC?

^ permalink raw reply

* Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags
From: Mina Almasry @ 2023-11-07  0:23 UTC (permalink / raw)
  To: David Ahern
  Cc: netdev, linux-kernel, linux-arch, linux-kselftest, linux-media,
	dri-devel, linaro-mm-sig, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
	Ilias Apalodimas, Arnd Bergmann, Willem de Bruijn, Shuah Khan,
	Sumit Semwal, Christian König, Shakeel Butt, Jeroen de Borst,
	Praveen Kaligineedi, Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <fa44c3d1-92b9-4686-ab3b-4fcda257aafd@kernel.org>

On Mon, Nov 6, 2023 at 4:16 PM David Ahern <dsahern@kernel.org> wrote:
>
> On 11/5/23 7:44 PM, Mina Almasry wrote:
> > diff --git a/net/core/datagram.c b/net/core/datagram.c
> > index 176eb5834746..cdd4fb129968 100644
> > --- a/net/core/datagram.c
> > +++ b/net/core/datagram.c
> > @@ -425,6 +425,9 @@ static int __skb_datagram_iter(const struct sk_buff *skb, int offset,
> >                       return 0;
> >       }
> >
> > +     if (skb_frags_not_readable(skb))
> > +             goto short_copy;
> > +
> >       /* Copy paged appendix. Hmm... why does this look so complicated? */
> >       for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
> >               int end;
> > @@ -616,6 +619,9 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
> >  {
> >       int frag;
> >
> > +     if (skb_frags_not_readable(skb))
> > +             return -EFAULT;
>
> This check ....
> > +
> >       if (msg && msg->msg_ubuf && msg->sg_from_iter)
> >               return msg->sg_from_iter(sk, skb, from, length);
>
>
> ... should go here. That allows custome sg_from_iter to have access to
> the skb. What matters is not expecting struct page (e.g., refcounting);
> if the custom iter does not do that then all is well. io_uring's iter
> does not look at the pages, so all good.
>
> >
> > diff --git a/net/core/gro.c b/net/core/gro.c
> > index 42d7f6755f32..56046d65386a 100644
> > --- a/net/core/gro.c
> > +++ b/net/core/gro.c
> > @@ -390,6 +390,9 @@ static void gro_pull_from_frag0(struct sk_buff *skb, int grow)
> >  {
> >       struct skb_shared_info *pinfo = skb_shinfo(skb);
> >
> > +     if (WARN_ON_ONCE(skb_frags_not_readable(skb)))
> > +             return;
> > +
> >       BUG_ON(skb->end - skb->tail < grow);
> >
> >       memcpy(skb_tail_pointer(skb), NAPI_GRO_CB(skb)->frag0, grow);
> > @@ -411,7 +414,7 @@ static void gro_try_pull_from_frag0(struct sk_buff *skb)
> >  {
> >       int grow = skb_gro_offset(skb) - skb_headlen(skb);
> >
> > -     if (grow > 0)
> > +     if (grow > 0 && !skb_frags_not_readable(skb))
> >               gro_pull_from_frag0(skb, grow);
> >  }
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 13eca4fd25e1..f01673ed2eff 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -1230,6 +1230,14 @@ void skb_dump(const char *level, const struct sk_buff *skb, bool full_pkt)
> >               struct page *p;
> >               u8 *vaddr;
> >
> > +             if (skb_frag_is_page_pool_iov(frag)) {
>
> Why skb_frag_is_page_pool_iov here vs skb_frags_not_readable?

Seems like a silly choice on my end. I should probably check
skb_frags_not_readable() and not kmap any frags in that case. Will do.

-- 
Thanks,
Mina

^ permalink raw reply

* [PATCH net 0/2][pull request] Intel Wired LAN Driver Updates 2023-11-06 (i40e)
From: Tony Nguyen @ 2023-11-07  0:35 UTC (permalink / raw)
  To: davem, kuba, pabeni, edumazet, netdev; +Cc: Tony Nguyen, ivecera, jiri

This series contains updates to i40e driver only.

Ivan Vecera resolves a couple issues with devlink; removing a call to
devlink_port_type_clear() and ensuring devlink port is unregistered
after the net device.

The following are changes since commit c1ed833e0b3b7b9edc82b97b73b2a8a10ceab241:
  Merge branch 'smc-fixes'
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue 40GbE

Ivan Vecera (2):
  i40e: Do not call devlink_port_type_clear()
  i40e: Fix devlink port unregistering

 drivers/net/ethernet/intel/i40e/i40e_devlink.c |  1 -
 drivers/net/ethernet/intel/i40e/i40e_main.c    | 10 ++++++----
 2 files changed, 6 insertions(+), 5 deletions(-)

-- 
2.41.0


^ permalink raw reply

* [PATCH net 1/2] i40e: Do not call devlink_port_type_clear()
From: Tony Nguyen @ 2023-11-07  0:35 UTC (permalink / raw)
  To: davem, kuba, pabeni, edumazet, netdev
  Cc: Ivan Vecera, anthony.l.nguyen, Jiri Pirko, Pucha Himasekhar Reddy
In-Reply-To: <20231107003600.653796-1-anthony.l.nguyen@intel.com>

From: Ivan Vecera <ivecera@redhat.com>

Do not call devlink_port_type_clear() prior devlink port unregister
and let devlink core to take care about it.

Reproducer:
[root@host ~]# rmmod i40e
[ 4539.964699] i40e 0000:02:00.0: devlink port type for port 0 cleared without a software interface reference, device type not supported by the kernel?
[ 4540.319811] i40e 0000:02:00.1: devlink port type for port 1 cleared without a software interface reference, device type not supported by the kernel?

Fixes: 9e479d64dc58 ("i40e: Add initial devlink support")
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_devlink.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_devlink.c b/drivers/net/ethernet/intel/i40e/i40e_devlink.c
index 74bc111b4849..cc4e9e2addb7 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_devlink.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_devlink.c
@@ -231,6 +231,5 @@ int i40e_devlink_create_port(struct i40e_pf *pf)
  **/
 void i40e_devlink_destroy_port(struct i40e_pf *pf)
 {
-	devlink_port_type_clear(&pf->devlink_port);
 	devlink_port_unregister(&pf->devlink_port);
 }
-- 
2.41.0


^ permalink raw reply related

* [PATCH net 2/2] i40e: Fix devlink port unregistering
From: Tony Nguyen @ 2023-11-07  0:35 UTC (permalink / raw)
  To: davem, kuba, pabeni, edumazet, netdev
  Cc: Ivan Vecera, anthony.l.nguyen, Jiri Pirko, Pucha Himasekhar Reddy
In-Reply-To: <20231107003600.653796-1-anthony.l.nguyen@intel.com>

From: Ivan Vecera <ivecera@redhat.com>

Ensure that devlink port is unregistered after unregistering
of net device.

Reproducer:
[root@host ~]# rmmod i40e
[ 4742.939386] i40e 0000:02:00.1: i40e_ptp_stop: removed PHC on enp2s0f1np1
[ 4743.059269] ------------[ cut here ]------------
[ 4743.063900] WARNING: CPU: 21 PID: 10766 at net/devlink/port.c:1078 devl_port_unregister+0x69/0x80
...

Fixes: 9e479d64dc58 ("i40e: Add initial devlink support")
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 3157d14d9b12..f7a332e51524 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -14213,8 +14213,7 @@ int i40e_vsi_release(struct i40e_vsi *vsi)
 	}
 	set_bit(__I40E_VSI_RELEASING, vsi->state);
 	uplink_seid = vsi->uplink_seid;
-	if (vsi->type == I40E_VSI_MAIN)
-		i40e_devlink_destroy_port(pf);
+
 	if (vsi->type != I40E_VSI_SRIOV) {
 		if (vsi->netdev_registered) {
 			vsi->netdev_registered = false;
@@ -14228,6 +14227,9 @@ int i40e_vsi_release(struct i40e_vsi *vsi)
 		i40e_vsi_disable_irq(vsi);
 	}
 
+	if (vsi->type == I40E_VSI_MAIN)
+		i40e_devlink_destroy_port(pf);
+
 	spin_lock_bh(&vsi->mac_filter_hash_lock);
 
 	/* clear the sync flag on all filters */
@@ -14402,14 +14404,14 @@ static struct i40e_vsi *i40e_vsi_reinit_setup(struct i40e_vsi *vsi)
 
 err_rings:
 	i40e_vsi_free_q_vectors(vsi);
-	if (vsi->type == I40E_VSI_MAIN)
-		i40e_devlink_destroy_port(pf);
 	if (vsi->netdev_registered) {
 		vsi->netdev_registered = false;
 		unregister_netdev(vsi->netdev);
 		free_netdev(vsi->netdev);
 		vsi->netdev = NULL;
 	}
+	if (vsi->type == I40E_VSI_MAIN)
+		i40e_devlink_destroy_port(pf);
 	i40e_aq_delete_element(&pf->hw, vsi->seid, NULL);
 err_vsi:
 	i40e_vsi_clear(vsi);
-- 
2.41.0


^ permalink raw reply related

* [PATCH net 0/4][pull request] Intel Wired LAN Driver Updates 2023-11-06 (ice)
From: Tony Nguyen @ 2023-11-07  0:48 UTC (permalink / raw)
  To: davem, kuba, pabeni, edumazet, netdev; +Cc: Tony Nguyen

This series contains updates to ice driver only.

Dave removes SR-IOV LAG attribute for only the interface being disabled
to allow for proper unwinding of all interfaces.

Michal Schmidt changes some LAG allocations from GFP_KERNEL to GFP_ATOMIC
due to non-allowed sleeping.

Aniruddha and Marcin fix redirection and drop rules for switchdev by
properly setting and marking egress/ingress type.

The following are changes since commit c1ed833e0b3b7b9edc82b97b73b2a8a10ceab241:
  Merge branch 'smc-fixes'
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue 100GbE

Aniruddha Paul (1):
  ice: Fix VF-VF filter rules in switchdev mode

Dave Ertman (1):
  ice: Fix SRIOV LAG disable on non-compliant aggregate

Marcin Szycik (1):
  ice: Fix VF-VF direction matching in drop rule in switchdev

Michal Schmidt (1):
  ice: lag: in RCU, use atomic allocation

 drivers/net/ethernet/intel/ice/ice_lag.c    |  18 ++--
 drivers/net/ethernet/intel/ice/ice_tc_lib.c | 114 +++++++++++++++-----
 2 files changed, 91 insertions(+), 41 deletions(-)

-- 
2.41.0


^ permalink raw reply

* [PATCH net 1/4] ice: Fix SRIOV LAG disable on non-compliant aggregate
From: Tony Nguyen @ 2023-11-07  0:48 UTC (permalink / raw)
  To: davem, kuba, pabeni, edumazet, netdev
  Cc: Dave Ertman, anthony.l.nguyen, daniel.machon, Wojciech Drewek,
	Simon Horman, Sujai Buvaneswaran
In-Reply-To: <20231107004844.655549-1-anthony.l.nguyen@intel.com>

From: Dave Ertman <david.m.ertman@intel.com>

If an attribute of an aggregate interface disqualifies it from supporting
SRIOV, the driver will unwind the SRIOV support.  Currently the driver is
clearing the feature bit for all interfaces in the aggregate, but this is
not allowing the other interfaces to unwind successfully on driver unload.

Only clear the feature bit for the interface that is currently unwinding.

Fixes: bf65da2eb279 ("ice: enforce interface eligibility and add messaging for SRIOV LAG")
Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_lag.c | 12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_lag.c b/drivers/net/ethernet/intel/ice/ice_lag.c
index b980f89dc892..95e46bde54fe 100644
--- a/drivers/net/ethernet/intel/ice/ice_lag.c
+++ b/drivers/net/ethernet/intel/ice/ice_lag.c
@@ -1555,18 +1555,12 @@ static void ice_lag_chk_disabled_bond(struct ice_lag *lag, void *ptr)
  */
 static void ice_lag_disable_sriov_bond(struct ice_lag *lag)
 {
-	struct ice_lag_netdev_list *entry;
 	struct ice_netdev_priv *np;
-	struct net_device *netdev;
 	struct ice_pf *pf;
 
-	list_for_each_entry(entry, lag->netdev_head, node) {
-		netdev = entry->netdev;
-		np = netdev_priv(netdev);
-		pf = np->vsi->back;
-
-		ice_clear_feature_support(pf, ICE_F_SRIOV_LAG);
-	}
+	np = netdev_priv(lag->netdev);
+	pf = np->vsi->back;
+	ice_clear_feature_support(pf, ICE_F_SRIOV_LAG);
 }
 
 /**
-- 
2.41.0


^ permalink raw reply related

* [PATCH net 2/4] ice: lag: in RCU, use atomic allocation
From: Tony Nguyen @ 2023-11-07  0:48 UTC (permalink / raw)
  To: davem, kuba, pabeni, edumazet, netdev
  Cc: Michal Schmidt, anthony.l.nguyen, daniel.machon, Wojciech Drewek,
	Pucha Himasekhar Reddy, Simon Horman
In-Reply-To: <20231107004844.655549-1-anthony.l.nguyen@intel.com>

From: Michal Schmidt <mschmidt@redhat.com>

Sleeping is not allowed in RCU read-side critical sections.
Use atomic allocations under rcu_read_lock.

Fixes: 1e0f9881ef79 ("ice: Flesh out implementation of support for SRIOV on bonded interface")
Fixes: 41ccedf5ca8f ("ice: implement lag netdev event handler")
Fixes: 3579aa86fb40 ("ice: update reset path for SRIOV LAG support")
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_lag.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_lag.c b/drivers/net/ethernet/intel/ice/ice_lag.c
index 95e46bde54fe..cd065ec48c87 100644
--- a/drivers/net/ethernet/intel/ice/ice_lag.c
+++ b/drivers/net/ethernet/intel/ice/ice_lag.c
@@ -628,7 +628,7 @@ void ice_lag_move_new_vf_nodes(struct ice_vf *vf)
 		INIT_LIST_HEAD(&ndlist.node);
 		rcu_read_lock();
 		for_each_netdev_in_bond_rcu(lag->upper_netdev, tmp_nd) {
-			nl = kzalloc(sizeof(*nl), GFP_KERNEL);
+			nl = kzalloc(sizeof(*nl), GFP_ATOMIC);
 			if (!nl)
 				break;
 
@@ -1692,7 +1692,7 @@ ice_lag_event_handler(struct notifier_block *notif_blk, unsigned long event,
 
 		rcu_read_lock();
 		for_each_netdev_in_bond_rcu(upper_netdev, tmp_nd) {
-			nd_list = kzalloc(sizeof(*nd_list), GFP_KERNEL);
+			nd_list = kzalloc(sizeof(*nd_list), GFP_ATOMIC);
 			if (!nd_list)
 				break;
 
@@ -2069,7 +2069,7 @@ void ice_lag_rebuild(struct ice_pf *pf)
 		INIT_LIST_HEAD(&ndlist.node);
 		rcu_read_lock();
 		for_each_netdev_in_bond_rcu(lag->upper_netdev, tmp_nd) {
-			nl = kzalloc(sizeof(*nl), GFP_KERNEL);
+			nl = kzalloc(sizeof(*nl), GFP_ATOMIC);
 			if (!nl)
 				break;
 
-- 
2.41.0


^ permalink raw reply related

* [PATCH net 3/4] ice: Fix VF-VF filter rules in switchdev mode
From: Tony Nguyen @ 2023-11-07  0:48 UTC (permalink / raw)
  To: davem, kuba, pabeni, edumazet, netdev
  Cc: Aniruddha Paul, anthony.l.nguyen, Przemek Kitszel,
	Wojciech Drewek, Sujai Buvaneswaran
In-Reply-To: <20231107004844.655549-1-anthony.l.nguyen@intel.com>

From: Aniruddha Paul <aniruddha.paul@intel.com>

Any packet leaving VSI i.e VF's VSI is considered as
egress traffic by HW, thus failing to match the added
rule.

Mark the direction for redirect rules as below:
1. VF-VF - Egress
2. Uplink-VF - Ingress
3. VF-Uplink - Egress
4. Link_Partner-Uplink - Ingress
5. Link_Partner-VF - Ingress

Fixes: 0960a27bd479 ("ice: Add direction metadata")
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Signed-off-by: Aniruddha Paul <aniruddha.paul@intel.com>
Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_tc_lib.c | 90 ++++++++++++++-------
 1 file changed, 62 insertions(+), 28 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_tc_lib.c b/drivers/net/ethernet/intel/ice/ice_tc_lib.c
index 37b54db91df2..0e75fc6b3c06 100644
--- a/drivers/net/ethernet/intel/ice/ice_tc_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_tc_lib.c
@@ -630,32 +630,61 @@ bool ice_is_tunnel_supported(struct net_device *dev)
 	return ice_tc_tun_get_type(dev) != TNL_LAST;
 }
 
-static int
-ice_eswitch_tc_parse_action(struct ice_tc_flower_fltr *fltr,
-			    struct flow_action_entry *act)
+static bool ice_tc_is_dev_uplink(struct net_device *dev)
+{
+	return netif_is_ice(dev) || ice_is_tunnel_supported(dev);
+}
+
+static int ice_tc_setup_redirect_action(struct net_device *filter_dev,
+					struct ice_tc_flower_fltr *fltr,
+					struct net_device *target_dev)
 {
 	struct ice_repr *repr;
 
+	fltr->action.fltr_act = ICE_FWD_TO_VSI;
+
+	if (ice_is_port_repr_netdev(filter_dev) &&
+	    ice_is_port_repr_netdev(target_dev)) {
+		repr = ice_netdev_to_repr(target_dev);
+
+		fltr->dest_vsi = repr->src_vsi;
+		fltr->direction = ICE_ESWITCH_FLTR_EGRESS;
+	} else if (ice_is_port_repr_netdev(filter_dev) &&
+		   ice_tc_is_dev_uplink(target_dev)) {
+		repr = ice_netdev_to_repr(filter_dev);
+
+		fltr->dest_vsi = repr->src_vsi->back->switchdev.uplink_vsi;
+		fltr->direction = ICE_ESWITCH_FLTR_EGRESS;
+	} else if (ice_tc_is_dev_uplink(filter_dev) &&
+		   ice_is_port_repr_netdev(target_dev)) {
+		repr = ice_netdev_to_repr(target_dev);
+
+		fltr->dest_vsi = repr->src_vsi;
+		fltr->direction = ICE_ESWITCH_FLTR_INGRESS;
+	} else {
+		NL_SET_ERR_MSG_MOD(fltr->extack,
+				   "Unsupported netdevice in switchdev mode");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int ice_eswitch_tc_parse_action(struct net_device *filter_dev,
+				       struct ice_tc_flower_fltr *fltr,
+				       struct flow_action_entry *act)
+{
+	int err;
+
 	switch (act->id) {
 	case FLOW_ACTION_DROP:
 		fltr->action.fltr_act = ICE_DROP_PACKET;
 		break;
 
 	case FLOW_ACTION_REDIRECT:
-		fltr->action.fltr_act = ICE_FWD_TO_VSI;
-
-		if (ice_is_port_repr_netdev(act->dev)) {
-			repr = ice_netdev_to_repr(act->dev);
-
-			fltr->dest_vsi = repr->src_vsi;
-			fltr->direction = ICE_ESWITCH_FLTR_INGRESS;
-		} else if (netif_is_ice(act->dev) ||
-			   ice_is_tunnel_supported(act->dev)) {
-			fltr->direction = ICE_ESWITCH_FLTR_EGRESS;
-		} else {
-			NL_SET_ERR_MSG_MOD(fltr->extack, "Unsupported netdevice in switchdev mode");
-			return -EINVAL;
-		}
+		err = ice_tc_setup_redirect_action(filter_dev, fltr, act->dev);
+		if (err)
+			return err;
 
 		break;
 
@@ -696,10 +725,6 @@ ice_eswitch_add_tc_fltr(struct ice_vsi *vsi, struct ice_tc_flower_fltr *fltr)
 		goto exit;
 	}
 
-	/* egress traffic is always redirect to uplink */
-	if (fltr->direction == ICE_ESWITCH_FLTR_EGRESS)
-		fltr->dest_vsi = vsi->back->switchdev.uplink_vsi;
-
 	rule_info.sw_act.fltr_act = fltr->action.fltr_act;
 	if (fltr->action.fltr_act != ICE_DROP_PACKET)
 		rule_info.sw_act.vsi_handle = fltr->dest_vsi->idx;
@@ -713,13 +738,21 @@ ice_eswitch_add_tc_fltr(struct ice_vsi *vsi, struct ice_tc_flower_fltr *fltr)
 	rule_info.flags_info.act_valid = true;
 
 	if (fltr->direction == ICE_ESWITCH_FLTR_INGRESS) {
+		/* Uplink to VF */
 		rule_info.sw_act.flag |= ICE_FLTR_RX;
 		rule_info.sw_act.src = hw->pf_id;
 		rule_info.flags_info.act = ICE_SINGLE_ACT_LB_ENABLE;
-	} else {
+	} else if (fltr->direction == ICE_ESWITCH_FLTR_EGRESS &&
+		   fltr->dest_vsi == vsi->back->switchdev.uplink_vsi) {
+		/* VF to Uplink */
 		rule_info.sw_act.flag |= ICE_FLTR_TX;
 		rule_info.sw_act.src = vsi->idx;
 		rule_info.flags_info.act = ICE_SINGLE_ACT_LAN_ENABLE;
+	} else {
+		/* VF to VF */
+		rule_info.sw_act.flag |= ICE_FLTR_TX;
+		rule_info.sw_act.src = vsi->idx;
+		rule_info.flags_info.act = ICE_SINGLE_ACT_LB_ENABLE;
 	}
 
 	/* specify the cookie as filter_rule_id */
@@ -1745,16 +1778,17 @@ ice_tc_parse_action(struct ice_vsi *vsi, struct ice_tc_flower_fltr *fltr,
 
 /**
  * ice_parse_tc_flower_actions - Parse the actions for a TC filter
+ * @filter_dev: Pointer to device on which filter is being added
  * @vsi: Pointer to VSI
  * @cls_flower: Pointer to TC flower offload structure
  * @fltr: Pointer to TC flower filter structure
  *
  * Parse the actions for a TC filter
  */
-static int
-ice_parse_tc_flower_actions(struct ice_vsi *vsi,
-			    struct flow_cls_offload *cls_flower,
-			    struct ice_tc_flower_fltr *fltr)
+static int ice_parse_tc_flower_actions(struct net_device *filter_dev,
+				       struct ice_vsi *vsi,
+				       struct flow_cls_offload *cls_flower,
+				       struct ice_tc_flower_fltr *fltr)
 {
 	struct flow_rule *rule = flow_cls_offload_flow_rule(cls_flower);
 	struct flow_action *flow_action = &rule->action;
@@ -1769,7 +1803,7 @@ ice_parse_tc_flower_actions(struct ice_vsi *vsi,
 
 	flow_action_for_each(i, act, flow_action) {
 		if (ice_is_eswitch_mode_switchdev(vsi->back))
-			err = ice_eswitch_tc_parse_action(fltr, act);
+			err = ice_eswitch_tc_parse_action(filter_dev, fltr, act);
 		else
 			err = ice_tc_parse_action(vsi, fltr, act);
 		if (err)
@@ -1856,7 +1890,7 @@ ice_add_tc_fltr(struct net_device *netdev, struct ice_vsi *vsi,
 	if (err < 0)
 		goto err;
 
-	err = ice_parse_tc_flower_actions(vsi, f, fltr);
+	err = ice_parse_tc_flower_actions(netdev, vsi, f, fltr);
 	if (err < 0)
 		goto err;
 
-- 
2.41.0


^ permalink raw reply related

* [PATCH net 4/4] ice: Fix VF-VF direction matching in drop rule in switchdev
From: Tony Nguyen @ 2023-11-07  0:48 UTC (permalink / raw)
  To: davem, kuba, pabeni, edumazet, netdev
  Cc: Marcin Szycik, anthony.l.nguyen, Michal Swiatkowski,
	Sujai Buvaneswaran, Simon Horman
In-Reply-To: <20231107004844.655549-1-anthony.l.nguyen@intel.com>

From: Marcin Szycik <marcin.szycik@linux.intel.com>

When adding a drop rule on a VF, rule direction is not being set, which
results in it always being set to ingress (ICE_ESWITCH_FLTR_INGRESS
equals 0). Because of this, drop rules added on port representors don't
match any packets.

To fix it, set rule direction in drop action to egress when netdev is a
port representor, otherwise set it to ingress.

Fixes: 0960a27bd479 ("ice: Add direction metadata")
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com>
Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_tc_lib.c | 24 ++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_tc_lib.c b/drivers/net/ethernet/intel/ice/ice_tc_lib.c
index 0e75fc6b3c06..dd03cb69ad26 100644
--- a/drivers/net/ethernet/intel/ice/ice_tc_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_tc_lib.c
@@ -670,6 +670,25 @@ static int ice_tc_setup_redirect_action(struct net_device *filter_dev,
 	return 0;
 }
 
+static int
+ice_tc_setup_drop_action(struct net_device *filter_dev,
+			 struct ice_tc_flower_fltr *fltr)
+{
+	fltr->action.fltr_act = ICE_DROP_PACKET;
+
+	if (ice_is_port_repr_netdev(filter_dev)) {
+		fltr->direction = ICE_ESWITCH_FLTR_EGRESS;
+	} else if (ice_tc_is_dev_uplink(filter_dev)) {
+		fltr->direction = ICE_ESWITCH_FLTR_INGRESS;
+	} else {
+		NL_SET_ERR_MSG_MOD(fltr->extack,
+				   "Unsupported netdevice in switchdev mode");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 static int ice_eswitch_tc_parse_action(struct net_device *filter_dev,
 				       struct ice_tc_flower_fltr *fltr,
 				       struct flow_action_entry *act)
@@ -678,7 +697,10 @@ static int ice_eswitch_tc_parse_action(struct net_device *filter_dev,
 
 	switch (act->id) {
 	case FLOW_ACTION_DROP:
-		fltr->action.fltr_act = ICE_DROP_PACKET;
+		err = ice_tc_setup_drop_action(filter_dev, fltr);
+		if (err)
+			return err;
+
 		break;
 
 	case FLOW_ACTION_REDIRECT:
-- 
2.41.0


^ permalink raw reply related

* Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags
From: Stanislav Fomichev @ 2023-11-07  0:59 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Mina Almasry, David Ahern, netdev, linux-kernel, linux-arch,
	linux-kselftest, linux-media, dri-devel, linaro-mm-sig,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	Shuah Khan, Sumit Semwal, Christian König, Shakeel Butt,
	Jeroen de Borst, Praveen Kaligineedi, Willem de Bruijn,
	Kaiyuan Zhang
In-Reply-To: <ZUmBf7E8ZoTQwThL@google.com>

On 11/06, Stanislav Fomichev wrote:
> On 11/06, Willem de Bruijn wrote:
> > On Mon, Nov 6, 2023 at 3:55 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > On Mon, Nov 6, 2023 at 3:27 PM Mina Almasry <almasrymina@google.com> wrote:
> > > >
> > > > On Mon, Nov 6, 2023 at 2:59 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > > >
> > > > > On 11/06, Mina Almasry wrote:
> > > > > > On Mon, Nov 6, 2023 at 1:59 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > > > > >
> > > > > > > On 11/06, Mina Almasry wrote:
> > > > > > > > On Mon, Nov 6, 2023 at 11:34 AM David Ahern <dsahern@kernel.org> wrote:
> > > > > > > > >
> > > > > > > > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > > > > > > > > On 11/05, Mina Almasry wrote:
> > > > > > > > > >> For device memory TCP, we expect the skb headers to be available in host
> > > > > > > > > >> memory for access, and we expect the skb frags to be in device memory
> > > > > > > > > >> and unaccessible to the host. We expect there to be no mixing and
> > > > > > > > > >> matching of device memory frags (unaccessible) with host memory frags
> > > > > > > > > >> (accessible) in the same skb.
> > > > > > > > > >>
> > > > > > > > > >> Add a skb->devmem flag which indicates whether the frags in this skb
> > > > > > > > > >> are device memory frags or not.
> > > > > > > > > >>
> > > > > > > > > >> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> > > > > > > > > >> and marks the skb as skb->devmem accordingly.
> > > > > > > > > >>
> > > > > > > > > >> Add checks through the network stack to avoid accessing the frags of
> > > > > > > > > >> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> > > > > > > > > >>
> > > > > > > > > >> Signed-off-by: Willem de Bruijn <willemb@google.com>
> > > > > > > > > >> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
> > > > > > > > > >> Signed-off-by: Mina Almasry <almasrymina@google.com>
> > > > > > > > > >>
> > > > > > > > > >> ---
> > > > > > > > > >>  include/linux/skbuff.h | 14 +++++++-
> > > > > > > > > >>  include/net/tcp.h      |  5 +--
> > > > > > > > > >>  net/core/datagram.c    |  6 ++++
> > > > > > > > > >>  net/core/gro.c         |  5 ++-
> > > > > > > > > >>  net/core/skbuff.c      | 77 ++++++++++++++++++++++++++++++++++++------
> > > > > > > > > >>  net/ipv4/tcp.c         |  6 ++++
> > > > > > > > > >>  net/ipv4/tcp_input.c   | 13 +++++--
> > > > > > > > > >>  net/ipv4/tcp_output.c  |  5 ++-
> > > > > > > > > >>  net/packet/af_packet.c |  4 +--
> > > > > > > > > >>  9 files changed, 115 insertions(+), 20 deletions(-)
> > > > > > > > > >>
> > > > > > > > > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > > > > > > > >> index 1fae276c1353..8fb468ff8115 100644
> > > > > > > > > >> --- a/include/linux/skbuff.h
> > > > > > > > > >> +++ b/include/linux/skbuff.h
> > > > > > > > > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > > > > > > > > >>   *  @csum_level: indicates the number of consecutive checksums found in
> > > > > > > > > >>   *          the packet minus one that have been verified as
> > > > > > > > > >>   *          CHECKSUM_UNNECESSARY (max 3)
> > > > > > > > > >> + *  @devmem: indicates that all the fragments in this skb are backed by
> > > > > > > > > >> + *          device memory.
> > > > > > > > > >>   *  @dst_pending_confirm: need to confirm neighbour
> > > > > > > > > >>   *  @decrypted: Decrypted SKB
> > > > > > > > > >>   *  @slow_gro: state present at GRO time, slower prepare step required
> > > > > > > > > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > > > > > > > > >>  #if IS_ENABLED(CONFIG_IP_SCTP)
> > > > > > > > > >>      __u8                    csum_not_inet:1;
> > > > > > > > > >>  #endif
> > > > > > > > > >> -
> > > > > > > > > >> +    __u8                    devmem:1;
> > > > > > > > > >>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > > > > > > > > >>      __u16                   tc_index;       /* traffic control index */
> > > > > > > > > >>  #endif
> > > > > > > > > >> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > > > > > > > > >>              __skb_zcopy_downgrade_managed(skb);
> > > > > > > > > >>  }
> > > > > > > > > >>
> > > > > > > > > >> +/* Return true if frags in this skb are not readable by the host. */
> > > > > > > > > >> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> > > > > > > > > >> +{
> > > > > > > > > >> +    return skb->devmem;
> > > > > > > > > >
> > > > > > > > > > bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> > > > > > > > > > It better communicates the fact that the stack shouldn't dereference the
> > > > > > > > > > frags (because it has 'devmem' fragments or for some other potential
> > > > > > > > > > future reason).
> > > > > > > > >
> > > > > > > > > +1.
> > > > > > > > >
> > > > > > > > > Also, the flag on the skb is an optimization - a high level signal that
> > > > > > > > > one or more frags is in unreadable memory. There is no requirement that
> > > > > > > > > all of the frags are in the same memory type.
> > > > > > >
> > > > > > > David: maybe there should be such a requirement (that they all are
> > > > > > > unreadable)? Might be easier to support initially; we can relax later
> > > > > > > on.
> > > > > > >
> > > > > >
> > > > > > Currently devmem == not_readable, and the restriction is that all the
> > > > > > frags in the same skb must be either all readable or all unreadable
> > > > > > (all devmem or all non-devmem).
> > > > > >
> > > > > > > > The flag indicates that the skb contains all devmem dma-buf memory
> > > > > > > > specifically, not generic 'not_readable' frags as the comment says:
> > > > > > > >
> > > > > > > > + *     @devmem: indicates that all the fragments in this skb are backed by
> > > > > > > > + *             device memory.
> > > > > > > >
> > > > > > > > The reason it's not a generic 'not_readable' flag is because handing
> > > > > > > > off a generic not_readable skb to the userspace is semantically not
> > > > > > > > what we're doing. recvmsg() is augmented in this patch series to
> > > > > > > > return a devmem skb to the user via a cmsg_devmem struct which refers
> > > > > > > > specifically to the memory in the dma-buf. recvmsg() in this patch
> > > > > > > > series is not augmented to give any 'not_readable' skb to the
> > > > > > > > userspace.
> > > > > > > >
> > > > > > > > IMHO skb->devmem + an skb_frags_not_readable() as implemented is
> > > > > > > > correct. If a new type of unreadable skbs are introduced to the stack,
> > > > > > > > I imagine the stack would implement:
> > > > > > > >
> > > > > > > > 1. new header flag: skb->newmem
> > > > > > > > 2.
> > > > > > > >
> > > > > > > > static inline bool skb_frags_not_readable(const struct skb_buff *skb)
> > > > > > > > {
> > > > > > > >     return skb->devmem || skb->newmem;
> > > > > > > > }
> > > > > > > >
> > > > > > > > 3. tcp_recvmsg_devmem() would handle skb->devmem skbs is in this patch
> > > > > > > > series, but tcp_recvmsg_newmem() would handle skb->newmem skbs.
> > > > > > >
> > > > > > > You copy it to the userspace in a special way because your frags
> > > > > > > are page_is_page_pool_iov(). I agree with David, the skb bit is
> > > > > > > just and optimization.
> > > > > > >
> > > > > > > For most of the core stack, it doesn't matter why your skb is not
> > > > > > > readable. For a few places where it matters (recvmsg?), you can
> > > > > > > double-check your frags (all or some) with page_is_page_pool_iov.
> > > > > > >
> > > > > >
> > > > > > I see, we can do that then. I.e. make the header flag 'not_readable'
> > > > > > and check the frags to decide to delegate to tcp_recvmsg_devmem() or
> > > > > > something else. We can even assume not_readable == devmem because
> > > > > > currently devmem is the only type of unreadable frag currently.
> > > > > >
> > > > > > > Unrelated: we probably need socket to dmabuf association as well (via
> > > > > > > netlink or something).
> > > > > >
> > > > > > Not sure this is possible. The dma-buf is bound to the rx-queue, and
> > > > > > any packets that land on that rx-queue are bound to that dma-buf,
> > > > > > regardless of which socket that packet belongs to. So the association
> > > > > > IMO must be rx-queue to dma-buf, not socket to dma-buf.
> > > > >
> > > > > But there is still always 1 dmabuf to 1 socket association (on rx), right?
> > > > > Because otherwise, there is no way currently to tell, at recvmsg, which
> > > > > dmabuf the received token belongs to.
> > > > >
> > > >
> > > > Yes, but this 1 dma-buf to 1 socket association happens because the
> > > > user binds the dma-buf to an rx-queue and configures flow steering of
> > > > the socket to that rx-queue.
> > >
> > > It's still fixed and won't change during the socket lifetime, right?
> > > And the socket has to know this association; otherwise those tokens
> > > are useless since they don't carry anything to identify the dmabuf.
> > >
> > > I think my other issue with MSG_SOCK_DEVMEM being on recvmsg is that
> > > it somehow implies that I have an option of passing or not passing it
> > > for an individual system call.
> > > If we know that we're going to use dmabuf with the socket, maybe we
> > > should move this flag to the socket() syscall?
> > >
> > > fd = socket(AF_INET6, SOCK_STREAM, SOCK_DEVMEM);
> > >
> > > ?
> > 
> > I think it should then be a setsockopt called before any data is
> > exchanged, with no change of modifying mode later. We generally use
> > setsockopts for the mode of a socket. This use of the protocol field
> > in socket() for setting a mode would be novel. Also, it might miss
> > passively opened connections, or be overly restrictive: one approach
> > for all accepted child sockets.
> 
> I was thinking this is similar to SOCK_CLOEXEC or SOCK_NONBLOCK? There
> are plenty of bits we can grab. But setsockopt works as well!

To follow up: if we have this flag on a socket, not on a per-message
basis, can we also use recvmsg for the recycling part maybe?

while (true) {
	memset(msg, 0, ...);

	/* receive the tokens */
	ret = recvmsg(fd, &msg, 0);

	/* recycle the tokens from the above recvmsg() */
	ret = recvmsg(fd, &msg, MSG_RECYCLE);
}

recvmsg + MSG_RECYCLE can parse the same format that regular recvmsg
exports (SO_DEVMEM_OFFSET) and we can also add extra cmsg option
to recycle a range.

Will this be more straightforward than a setsockopt(SO_DEVMEM_DONTNEED)?
Or is it more confusing?

^ permalink raw reply

* Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags
From: Stanislav Fomichev @ 2023-11-07  1:06 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Willem de Bruijn, David Ahern, netdev, linux-kernel, linux-arch,
	linux-kselftest, linux-media, dri-devel, linaro-mm-sig,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	Shuah Khan, Sumit Semwal, Christian König, Shakeel Butt,
	Jeroen de Borst, Praveen Kaligineedi, Willem de Bruijn,
	Kaiyuan Zhang
In-Reply-To: <CAHS8izNxKHhW5uCqmfau6n3c18=hE3RXzA+ng5LEGiKj12nGcg@mail.gmail.com>

On 11/06, Mina Almasry wrote:
> On Mon, Nov 6, 2023 at 4:08 PM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > On Mon, Nov 6, 2023 at 3:55 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > On Mon, Nov 6, 2023 at 3:27 PM Mina Almasry <almasrymina@google.com> wrote:
> > > >
> > > > On Mon, Nov 6, 2023 at 2:59 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > > >
> > > > > On 11/06, Mina Almasry wrote:
> > > > > > On Mon, Nov 6, 2023 at 1:59 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > > > > >
> > > > > > > On 11/06, Mina Almasry wrote:
> > > > > > > > On Mon, Nov 6, 2023 at 11:34 AM David Ahern <dsahern@kernel.org> wrote:
> > > > > > > > >
> > > > > > > > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > > > > > > > > On 11/05, Mina Almasry wrote:
> > > > > > > > > >> For device memory TCP, we expect the skb headers to be available in host
> > > > > > > > > >> memory for access, and we expect the skb frags to be in device memory
> > > > > > > > > >> and unaccessible to the host. We expect there to be no mixing and
> > > > > > > > > >> matching of device memory frags (unaccessible) with host memory frags
> > > > > > > > > >> (accessible) in the same skb.
> > > > > > > > > >>
> > > > > > > > > >> Add a skb->devmem flag which indicates whether the frags in this skb
> > > > > > > > > >> are device memory frags or not.
> > > > > > > > > >>
> > > > > > > > > >> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> > > > > > > > > >> and marks the skb as skb->devmem accordingly.
> > > > > > > > > >>
> > > > > > > > > >> Add checks through the network stack to avoid accessing the frags of
> > > > > > > > > >> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> > > > > > > > > >>
> > > > > > > > > >> Signed-off-by: Willem de Bruijn <willemb@google.com>
> > > > > > > > > >> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
> > > > > > > > > >> Signed-off-by: Mina Almasry <almasrymina@google.com>
> > > > > > > > > >>
> > > > > > > > > >> ---
> > > > > > > > > >>  include/linux/skbuff.h | 14 +++++++-
> > > > > > > > > >>  include/net/tcp.h      |  5 +--
> > > > > > > > > >>  net/core/datagram.c    |  6 ++++
> > > > > > > > > >>  net/core/gro.c         |  5 ++-
> > > > > > > > > >>  net/core/skbuff.c      | 77 ++++++++++++++++++++++++++++++++++++------
> > > > > > > > > >>  net/ipv4/tcp.c         |  6 ++++
> > > > > > > > > >>  net/ipv4/tcp_input.c   | 13 +++++--
> > > > > > > > > >>  net/ipv4/tcp_output.c  |  5 ++-
> > > > > > > > > >>  net/packet/af_packet.c |  4 +--
> > > > > > > > > >>  9 files changed, 115 insertions(+), 20 deletions(-)
> > > > > > > > > >>
> > > > > > > > > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > > > > > > > >> index 1fae276c1353..8fb468ff8115 100644
> > > > > > > > > >> --- a/include/linux/skbuff.h
> > > > > > > > > >> +++ b/include/linux/skbuff.h
> > > > > > > > > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > > > > > > > > >>   *  @csum_level: indicates the number of consecutive checksums found in
> > > > > > > > > >>   *          the packet minus one that have been verified as
> > > > > > > > > >>   *          CHECKSUM_UNNECESSARY (max 3)
> > > > > > > > > >> + *  @devmem: indicates that all the fragments in this skb are backed by
> > > > > > > > > >> + *          device memory.
> > > > > > > > > >>   *  @dst_pending_confirm: need to confirm neighbour
> > > > > > > > > >>   *  @decrypted: Decrypted SKB
> > > > > > > > > >>   *  @slow_gro: state present at GRO time, slower prepare step required
> > > > > > > > > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > > > > > > > > >>  #if IS_ENABLED(CONFIG_IP_SCTP)
> > > > > > > > > >>      __u8                    csum_not_inet:1;
> > > > > > > > > >>  #endif
> > > > > > > > > >> -
> > > > > > > > > >> +    __u8                    devmem:1;
> > > > > > > > > >>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > > > > > > > > >>      __u16                   tc_index;       /* traffic control index */
> > > > > > > > > >>  #endif
> > > > > > > > > >> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > > > > > > > > >>              __skb_zcopy_downgrade_managed(skb);
> > > > > > > > > >>  }
> > > > > > > > > >>
> > > > > > > > > >> +/* Return true if frags in this skb are not readable by the host. */
> > > > > > > > > >> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> > > > > > > > > >> +{
> > > > > > > > > >> +    return skb->devmem;
> > > > > > > > > >
> > > > > > > > > > bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> > > > > > > > > > It better communicates the fact that the stack shouldn't dereference the
> > > > > > > > > > frags (because it has 'devmem' fragments or for some other potential
> > > > > > > > > > future reason).
> > > > > > > > >
> > > > > > > > > +1.
> > > > > > > > >
> > > > > > > > > Also, the flag on the skb is an optimization - a high level signal that
> > > > > > > > > one or more frags is in unreadable memory. There is no requirement that
> > > > > > > > > all of the frags are in the same memory type.
> > > > > > >
> > > > > > > David: maybe there should be such a requirement (that they all are
> > > > > > > unreadable)? Might be easier to support initially; we can relax later
> > > > > > > on.
> > > > > > >
> > > > > >
> > > > > > Currently devmem == not_readable, and the restriction is that all the
> > > > > > frags in the same skb must be either all readable or all unreadable
> > > > > > (all devmem or all non-devmem).
> > > > > >
> > > > > > > > The flag indicates that the skb contains all devmem dma-buf memory
> > > > > > > > specifically, not generic 'not_readable' frags as the comment says:
> > > > > > > >
> > > > > > > > + *     @devmem: indicates that all the fragments in this skb are backed by
> > > > > > > > + *             device memory.
> > > > > > > >
> > > > > > > > The reason it's not a generic 'not_readable' flag is because handing
> > > > > > > > off a generic not_readable skb to the userspace is semantically not
> > > > > > > > what we're doing. recvmsg() is augmented in this patch series to
> > > > > > > > return a devmem skb to the user via a cmsg_devmem struct which refers
> > > > > > > > specifically to the memory in the dma-buf. recvmsg() in this patch
> > > > > > > > series is not augmented to give any 'not_readable' skb to the
> > > > > > > > userspace.
> > > > > > > >
> > > > > > > > IMHO skb->devmem + an skb_frags_not_readable() as implemented is
> > > > > > > > correct. If a new type of unreadable skbs are introduced to the stack,
> > > > > > > > I imagine the stack would implement:
> > > > > > > >
> > > > > > > > 1. new header flag: skb->newmem
> > > > > > > > 2.
> > > > > > > >
> > > > > > > > static inline bool skb_frags_not_readable(const struct skb_buff *skb)
> > > > > > > > {
> > > > > > > >     return skb->devmem || skb->newmem;
> > > > > > > > }
> > > > > > > >
> > > > > > > > 3. tcp_recvmsg_devmem() would handle skb->devmem skbs is in this patch
> > > > > > > > series, but tcp_recvmsg_newmem() would handle skb->newmem skbs.
> > > > > > >
> > > > > > > You copy it to the userspace in a special way because your frags
> > > > > > > are page_is_page_pool_iov(). I agree with David, the skb bit is
> > > > > > > just and optimization.
> > > > > > >
> > > > > > > For most of the core stack, it doesn't matter why your skb is not
> > > > > > > readable. For a few places where it matters (recvmsg?), you can
> > > > > > > double-check your frags (all or some) with page_is_page_pool_iov.
> > > > > > >
> > > > > >
> > > > > > I see, we can do that then. I.e. make the header flag 'not_readable'
> > > > > > and check the frags to decide to delegate to tcp_recvmsg_devmem() or
> > > > > > something else. We can even assume not_readable == devmem because
> > > > > > currently devmem is the only type of unreadable frag currently.
> > > > > >
> > > > > > > Unrelated: we probably need socket to dmabuf association as well (via
> > > > > > > netlink or something).
> > > > > >
> > > > > > Not sure this is possible. The dma-buf is bound to the rx-queue, and
> > > > > > any packets that land on that rx-queue are bound to that dma-buf,
> > > > > > regardless of which socket that packet belongs to. So the association
> > > > > > IMO must be rx-queue to dma-buf, not socket to dma-buf.
> > > > >
> > > > > But there is still always 1 dmabuf to 1 socket association (on rx), right?
> > > > > Because otherwise, there is no way currently to tell, at recvmsg, which
> > > > > dmabuf the received token belongs to.
> > > > >
> > > >
> > > > Yes, but this 1 dma-buf to 1 socket association happens because the
> > > > user binds the dma-buf to an rx-queue and configures flow steering of
> > > > the socket to that rx-queue.
> > >
> > > It's still fixed and won't change during the socket lifetime, right?
> 
> Technically, no.
> 
> The user is free to modify or delete flow steering rules outside of
> the lifetime of the socket. Technically it's possible for the user to
> reconfigure flow steering while the socket is simultaneously
> receiving, and the result will be packets switching
>  from devmem to non-devmem. For a reasonably correctly configured
> application the application would probably want to steer 1 flow to 1
> dma-buf and never change it, but this is not something we enforce, but
> rather the user orchestrates. In theory someone can find a use case
> for configuring and unconfigure flow steering during a connection.

If we do want to support this flexible configuration then we also
should export some dmabuf id along with the token?
 
> > > And the socket has to know this association; otherwise those tokens
> > > are useless since they don't carry anything to identify the dmabuf.
> > >
> > > I think my other issue with MSG_SOCK_DEVMEM being on recvmsg is that
> > > it somehow implies that I have an option of passing or not passing it
> > > for an individual system call.
> 
> You do have the option of passing it or not passing it per system
> call. The MSG_SOCK_DEVMEM says the application is willing to receive
> devmem cmsgs - that's all. The application doesn't get to decide
> whether it's actually going to receive a devmem cmsg or not, because
> that's dictated by the type of skb that is present in the receive
> queue, and not up to the application. I should explain this in the
> commit message...

What would be the case of passing it or not passing it? Some fallback to
the host memory after flow steering update? Yeah, would be useful to
document those constrains. I'd lean on starting stricter and relaxing
those conditions if we find the use-cases.

> > > If we know that we're going to use dmabuf with the socket, maybe we
> > > should move this flag to the socket() syscall?
> > >
> > > fd = socket(AF_INET6, SOCK_STREAM, SOCK_DEVMEM);
> > >
> > > ?
> >
> > I think it should then be a setsockopt called before any data is
> > exchanged, with no change of modifying mode later. We generally use
> > setsockopts for the mode of a socket. This use of the protocol field
> > in socket() for setting a mode would be novel. Also, it might miss
> > passively opened connections, or be overly restrictive: one approach
> > for all accepted child sockets.
> 
> We can definitely move SOCK_DEVMEM to a setsockopt(). Seems more than
> reasonable.

SG, added another suggestion for SO_DEVMEM_DONTNEED on another thread
with Willem. LMK what you think.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox