Netdev List
 help / color / mirror / Atom feed
* [PATCH net] net: phylink: initialize carrier state at creation
From: Klaus Kudielka @ 2023-11-06 18:05 UTC (permalink / raw)
  To: Russell King, Andrew Lunn, Heiner Kallweit
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev, linux-kernel, Klaus Kudielka

Background: Turris Omnia (Armada 385); eth2 (mvneta) connected to SFP bus;
SFP module is present, but no fiber connected, so definitely no carrier.

After booting, eth2 is down, but netdev LED trigger surprisingly reports
link active. Then, after "ip link set eth2 up", the link indicator goes
away - as I would have expected it from the beginning.

It turns out, that the default carrier state after netdev creation is
"carrier ok". Some ethernet drivers explicitly call netif_carrier_off
during probing, others (like mvneta) don't - which explains the current
behaviour: only when the device is brought up, phylink_start calls
netif_carrier_off.

Fix this for all drivers, by calling netif_carrier_off in phylink_create.

Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Klaus Kudielka <klaus.kudielka@gmail.com>
---
 drivers/net/phy/phylink.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c
index 6712883498..a28da80bde 100644
--- a/drivers/net/phy/phylink.c
+++ b/drivers/net/phy/phylink.c
@@ -1616,6 +1616,7 @@ struct phylink *phylink_create(struct phylink_config *config,
 	pl->config = config;
 	if (config->type == PHYLINK_NETDEV) {
 		pl->netdev = to_net_dev(config->dev);
+		netif_carrier_off(pl->netdev);
 	} else if (config->type == PHYLINK_DEV) {
 		pl->dev = config->dev;
 	} else {
-- 
2.42.0


^ permalink raw reply related

* Re: [PATCH bpf 4/6] bpf, netkit: Add indirect call wrapper for fetching peer dev
From: Daniel Borkmann @ 2023-11-06 18:21 UTC (permalink / raw)
  To: Stanislav Fomichev; +Cc: martin.lau, kuba, netdev, bpf, Nikolay Aleksandrov
In-Reply-To: <ZUkgtxlK9MRGHx8v@google.com>

On 11/6/23 6:21 PM, Stanislav Fomichev wrote:
[...]
>> +static struct net_device *skb_get_peer_dev(struct net_device *dev)
>> +{
>> +	const struct net_device_ops *ops = dev->netdev_ops;
>> +
>> +	if (likely(ops->ndo_get_peer_dev))
>> +		return INDIRECT_CALL_1(ops->ndo_get_peer_dev,
>> +				       netkit_peer_dev, dev);
> 
> nit: why not put both netkit and veth here under INDIRECT_CALL_2 ?
> Presumably should help with the veth deployments as well?

Yes, I'm also planning to add it there as well, it's a slightly larger
change since also a new header needs to be added, but I'll follow-up on it.

Thanks for review,
Daniel

^ permalink raw reply

* Re: [PATCH net] net: phylink: initialize carrier state at creation
From: Russell King (Oracle) @ 2023-11-06 18:36 UTC (permalink / raw)
  To: Klaus Kudielka
  Cc: Andrew Lunn, Heiner Kallweit, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, netdev, linux-kernel
In-Reply-To: <20231106180506.2665-1-klaus.kudielka@gmail.com>

On Mon, Nov 06, 2023 at 07:05:06PM +0100, Klaus Kudielka wrote:
> Background: Turris Omnia (Armada 385); eth2 (mvneta) connected to SFP bus;
> SFP module is present, but no fiber connected, so definitely no carrier.
> 
> After booting, eth2 is down, but netdev LED trigger surprisingly reports
> link active. Then, after "ip link set eth2 up", the link indicator goes
> away - as I would have expected it from the beginning.
> 
> It turns out, that the default carrier state after netdev creation is
> "carrier ok". Some ethernet drivers explicitly call netif_carrier_off
> during probing, others (like mvneta) don't - which explains the current
> behaviour: only when the device is brought up, phylink_start calls
> netif_carrier_off.
> 
> Fix this for all drivers, by calling netif_carrier_off in phylink_create.
> 
> Suggested-by: Andrew Lunn <andrew@lunn.ch>
> Signed-off-by: Klaus Kudielka <klaus.kudielka@gmail.com>

Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>

Thanks!

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply

* Re: [RFC PATCH v3 10/12] tcp: RX path for devmem TCP
From: Stanislav Fomichev @ 2023-11-06 18:44 UTC (permalink / raw)
  To: Mina Almasry
  Cc: netdev, linux-kernel, linux-arch, linux-kselftest, linux-media,
	dri-devel, linaro-mm-sig, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
	Ilias Apalodimas, Arnd Bergmann, David Ahern, Willem de Bruijn,
	Shuah Khan, Sumit Semwal, Christian König, Shakeel Butt,
	Jeroen de Borst, Praveen Kaligineedi, Willem de Bruijn,
	Kaiyuan Zhang
In-Reply-To: <20231106024413.2801438-11-almasrymina@google.com>

On 11/05, Mina Almasry wrote:
> In tcp_recvmsg_locked(), detect if the skb being received by the user
> is a devmem skb. In this case - if the user provided the MSG_SOCK_DEVMEM
> flag - pass it to tcp_recvmsg_devmem() for custom handling.
> 
> tcp_recvmsg_devmem() copies any data in the skb header to the linear
> buffer, and returns a cmsg to the user indicating the number of bytes
> returned in the linear buffer.
> 
> tcp_recvmsg_devmem() then loops over the unaccessible devmem skb frags,
> and returns to the user a cmsg_devmem indicating the location of the
> data in the dmabuf device memory. cmsg_devmem contains this information:
> 
> 1. the offset into the dmabuf where the payload starts. 'frag_offset'.
> 2. the size of the frag. 'frag_size'.
> 3. an opaque token 'frag_token' to return to the kernel when the buffer
> is to be released.
> 
> The pages awaiting freeing are stored in the newly added
> sk->sk_user_pages, and each page passed to userspace is get_page()'d.
> This reference is dropped once the userspace indicates that it is
> done reading this page.  All pages are released when the socket is
> destroyed.
> 
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
> Signed-off-by: Mina Almasry <almasrymina@google.com>
> 
> ---
> 
> RFC v3:
> - Fixed issue with put_cmsg() failing silently.
> 
> ---
>  include/linux/socket.h            |   1 +
>  include/net/page_pool/helpers.h   |   9 ++
>  include/net/sock.h                |   2 +
>  include/uapi/asm-generic/socket.h |   5 +
>  include/uapi/linux/uio.h          |   6 +
>  net/ipv4/tcp.c                    | 189 +++++++++++++++++++++++++++++-
>  net/ipv4/tcp_ipv4.c               |   7 ++
>  7 files changed, 214 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/socket.h b/include/linux/socket.h
> index cfcb7e2c3813..fe2b9e2081bb 100644
> --- a/include/linux/socket.h
> +++ b/include/linux/socket.h
> @@ -326,6 +326,7 @@ struct ucred {
>  					  * plain text and require encryption
>  					  */
>  
> +#define MSG_SOCK_DEVMEM 0x2000000	/* Receive devmem skbs as cmsg */

Sharing the feedback that I've been providing internally on the public list:

IMHO, we need a better UAPI to receive the tokens and give them back to
the kernel. CMSG + setsockopt(SO_DEVMEM_DONTNEED) get the job done,
but look dated and hacky :-(

We should either do some kind of user/kernel shared memory queue to
receive/return the tokens (similar to what Jonathan was doing in his
proposal?) or bite the bullet and switch to io_uring.

I was also suggesting to do it via netlink initially, but it's probably
a bit slow for these purpose, idk.

^ permalink raw reply

* Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags
From: Stanislav Fomichev @ 2023-11-06 18:47 UTC (permalink / raw)
  To: Mina Almasry
  Cc: netdev, linux-kernel, linux-arch, linux-kselftest, linux-media,
	dri-devel, linaro-mm-sig, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
	Ilias Apalodimas, Arnd Bergmann, David Ahern, Willem de Bruijn,
	Shuah Khan, Sumit Semwal, Christian König, Shakeel Butt,
	Jeroen de Borst, Praveen Kaligineedi, Willem de Bruijn,
	Kaiyuan Zhang
In-Reply-To: <20231106024413.2801438-10-almasrymina@google.com>

On 11/05, Mina Almasry wrote:
> For device memory TCP, we expect the skb headers to be available in host
> memory for access, and we expect the skb frags to be in device memory
> and unaccessible to the host. We expect there to be no mixing and
> matching of device memory frags (unaccessible) with host memory frags
> (accessible) in the same skb.
> 
> Add a skb->devmem flag which indicates whether the frags in this skb
> are device memory frags or not.
> 
> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> and marks the skb as skb->devmem accordingly.
> 
> Add checks through the network stack to avoid accessing the frags of
> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> 
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
> Signed-off-by: Mina Almasry <almasrymina@google.com>
> 
> ---
>  include/linux/skbuff.h | 14 +++++++-
>  include/net/tcp.h      |  5 +--
>  net/core/datagram.c    |  6 ++++
>  net/core/gro.c         |  5 ++-
>  net/core/skbuff.c      | 77 ++++++++++++++++++++++++++++++++++++------
>  net/ipv4/tcp.c         |  6 ++++
>  net/ipv4/tcp_input.c   | 13 +++++--
>  net/ipv4/tcp_output.c  |  5 ++-
>  net/packet/af_packet.c |  4 +--
>  9 files changed, 115 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 1fae276c1353..8fb468ff8115 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
>   *	@csum_level: indicates the number of consecutive checksums found in
>   *		the packet minus one that have been verified as
>   *		CHECKSUM_UNNECESSARY (max 3)
> + *	@devmem: indicates that all the fragments in this skb are backed by
> + *		device memory.
>   *	@dst_pending_confirm: need to confirm neighbour
>   *	@decrypted: Decrypted SKB
>   *	@slow_gro: state present at GRO time, slower prepare step required
> @@ -991,7 +993,7 @@ struct sk_buff {
>  #if IS_ENABLED(CONFIG_IP_SCTP)
>  	__u8			csum_not_inet:1;
>  #endif
> -
> +	__u8			devmem:1;
>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
>  	__u16			tc_index;	/* traffic control index */
>  #endif
> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
>  		__skb_zcopy_downgrade_managed(skb);
>  }
>  
> +/* Return true if frags in this skb are not readable by the host. */
> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> +{
> +	return skb->devmem;

bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
It better communicates the fact that the stack shouldn't dereference the
frags (because it has 'devmem' fragments or for some other potential
future reason).

^ permalink raw reply

* Re: [PATCH v9 9/17] bpf,lsm: refactor bpf_prog_alloc/bpf_prog_free LSM hooks
From: Andrii Nakryiko @ 2023-11-06 19:03 UTC (permalink / raw)
  To: Paul Moore
  Cc: Andrii Nakryiko, bpf, netdev, brauner, linux-fsdevel,
	linux-security-module, keescook, kernel-team, sargun
In-Reply-To: <7ff273d368f3f7dd383444928ca478ef.paul@paul-moore.com>

On Sun, Nov 5, 2023 at 9:01 PM Paul Moore <paul@paul-moore.com> wrote:
>
> On Nov  3, 2023 Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > Based on upstream discussion ([0]), rework existing
> > bpf_prog_alloc_security LSM hook. Rename it to bpf_prog_load and instead
> > of passing bpf_prog_aux, pass proper bpf_prog pointer for a full BPF
> > program struct. Also, we pass bpf_attr union with all the user-provided
> > arguments for BPF_PROG_LOAD command.  This will give LSMs as much
> > information as we can basically provide.
> >
> > The hook is also BPF token-aware now, and optional bpf_token struct is
> > passed as a third argument. bpf_prog_load LSM hook is called after
> > a bunch of sanity checks were performed, bpf_prog and bpf_prog_aux were
> > allocated and filled out, but right before performing full-fledged BPF
> > verification step.
> >
> > bpf_prog_free LSM hook is now accepting struct bpf_prog argument, for
> > consistency. SELinux code is adjusted to all new names, types, and
> > signatures.
> >
> > Note, given that bpf_prog_load (previously bpf_prog_alloc) hook can be
> > used by some LSMs to allocate extra security blob, but also by other
> > LSMs to reject BPF program loading, we need to make sure that
> > bpf_prog_free LSM hook is called after bpf_prog_load/bpf_prog_alloc one
> > *even* if the hook itself returned error. If we don't do that, we run
> > the risk of leaking memory. This seems to be possible today when
> > combining SELinux and BPF LSM, as one example, depending on their
> > relative ordering.
> >
> > Also, for BPF LSM setup, add bpf_prog_load and bpf_prog_free to
> > sleepable LSM hooks list, as they are both executed in sleepable
> > context. Also drop bpf_prog_load hook from untrusted, as there is no
> > issue with refcount or anything else anymore, that originally forced us
> > to add it to untrusted list in c0c852dd1876 ("bpf: Do not mark certain LSM
> > hook arguments as trusted"). We now trigger this hook much later and it
> > should not be an issue anymore.
>
> See my comment below, but it isn't clear to me if this means it is okay
> to have `BTF_ID(func, bpf_lsm_bpf_prog_free)` called twice.  It probably
> would be a good idea to get KP, BPF LSM maintainer, to review this change
> as well to make sure this looks good to him.
>
> >   [0] https://lore.kernel.org/bpf/9fe88aef7deabbe87d3fc38c4aea3c69.paul@paul-moore.com/
> >
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >  include/linux/lsm_hook_defs.h |  5 +++--
> >  include/linux/security.h      | 12 +++++++-----
> >  kernel/bpf/bpf_lsm.c          |  5 +++--
> >  kernel/bpf/syscall.c          | 25 +++++++++++++------------
> >  security/security.c           | 25 +++++++++++++++----------
> >  security/selinux/hooks.c      | 15 ++++++++-------
> >  6 files changed, 49 insertions(+), 38 deletions(-)
>
> ...
>
> > diff --git a/kernel/bpf/bpf_lsm.c b/kernel/bpf/bpf_lsm.c
> > index e14c822f8911..3e956f6302f3 100644
> > --- a/kernel/bpf/bpf_lsm.c
> > +++ b/kernel/bpf/bpf_lsm.c
> > @@ -263,6 +263,8 @@ BTF_ID(func, bpf_lsm_bpf_map)
> >  BTF_ID(func, bpf_lsm_bpf_map_alloc_security)
> >  BTF_ID(func, bpf_lsm_bpf_map_free_security)
> >  BTF_ID(func, bpf_lsm_bpf_prog)
> > +BTF_ID(func, bpf_lsm_bpf_prog_load)
> > +BTF_ID(func, bpf_lsm_bpf_prog_free)
> >  BTF_ID(func, bpf_lsm_bprm_check_security)
> >  BTF_ID(func, bpf_lsm_bprm_committed_creds)
> >  BTF_ID(func, bpf_lsm_bprm_committing_creds)
> > @@ -346,8 +348,7 @@ BTF_SET_END(sleepable_lsm_hooks)
> >
> >  BTF_SET_START(untrusted_lsm_hooks)
> >  BTF_ID(func, bpf_lsm_bpf_map_free_security)
> > -BTF_ID(func, bpf_lsm_bpf_prog_alloc_security)
> > -BTF_ID(func, bpf_lsm_bpf_prog_free_security)
> > +BTF_ID(func, bpf_lsm_bpf_prog_free)
> >  BTF_ID(func, bpf_lsm_file_alloc_security)
> >  BTF_ID(func, bpf_lsm_file_free_security)
> >  #ifdef CONFIG_SECURITY_NETWORK
>
> It looks like you're calling the BTF_ID() macro on bpf_lsm_bpf_prog_free
> twice?  I would have expected a only one macro call for each bpf_prog_load
> and bpf_prog_free, is that a bad assuption?
>

Yeah, there is no problem having multiple BTF_ID() invocations for the
same function. BTF_ID() macro (conceptually) emits a relocation that
will instruct resolve_btfids tool to put an actual BTF ID for the
specified function in a designated 4-byte slot.

In this case, we have two separate lists: sleepable_lsm_hooks and
untrusted_lsm_hooks, so we need two separate BTF_ID() entries for the
same function. It's expected to be duplicated.

> > diff --git a/security/security.c b/security/security.c
> > index dcb3e7014f9b..5773d446210e 100644
> > --- a/security/security.c
> > +++ b/security/security.c
> > @@ -5180,16 +5180,21 @@ int security_bpf_map_alloc(struct bpf_map *map)
> >  }
> >
> >  /**
> > - * security_bpf_prog_alloc() - Allocate a bpf program LSM blob
> > - * @aux: bpf program aux info struct
> > + * security_bpf_prog_load() - Check if loading of BPF program is allowed
> > + * @prog: BPF program object
> > + * @attr: BPF syscall attributes used to create BPF program
> > + * @token: BPF token used to grant user access to BPF subsystem
> >   *
> > - * Initialize the security field inside bpf program.
> > + * Do a check when the kernel allocates BPF program object and is about to
> > + * pass it to BPF verifier for additional correctness checks. This is also the
> > + * point where LSM blob is allocated for LSMs that need them.
>
> This is pretty nitpicky, but I'm guessing you may need to make another
> revision to this patchset, if you do please drop the BPF verifier remark
> from the comment above.
>
> Example: "Perform an access control check when the kernel loads a BPF
> program and allocates the associated BPF program object.  This hook is
> also responsibile for allocating any required LSM state for the BPF
> program."

Done, no problem.

>
> >   * Return: Returns 0 on success, error on failure.
> >   */
> > -int security_bpf_prog_alloc(struct bpf_prog_aux *aux)
> > +int security_bpf_prog_load(struct bpf_prog *prog, union bpf_attr *attr,
> > +                        struct bpf_token *token)
> >  {
> > -     return call_int_hook(bpf_prog_alloc_security, 0, aux);
> > +     return call_int_hook(bpf_prog_load, 0, prog, attr, token);
> >  }
>
> --
> paul-moore.com

^ permalink raw reply

* Re: [PATCH v9 11/17] bpf,lsm: add BPF token LSM hooks
From: Andrii Nakryiko @ 2023-11-06 19:17 UTC (permalink / raw)
  To: Paul Moore
  Cc: Andrii Nakryiko, bpf, netdev, brauner, linux-fsdevel,
	linux-security-module, keescook, kernel-team, sargun
In-Reply-To: <9d2b920cb7e59dfd56f763bdd4e53abd.paul@paul-moore.com>

On Sun, Nov 5, 2023 at 9:01 PM Paul Moore <paul@paul-moore.com> wrote:
>
> On Nov  3, 2023 Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > Wire up bpf_token_create and bpf_token_free LSM hooks, which allow to
> > allocate LSM security blob (we add `void *security` field to struct
> > bpf_token for that), but also control who can instantiate BPF token.
> > This follows existing pattern for BPF map and BPF prog.
> >
> > Also add security_bpf_token_allow_cmd() and security_bpf_token_capable()
> > LSM hooks that allow LSM implementation to control and negate (if
> > necessary) BPF token's delegation of a specific bpf_cmd and capability,
> > respectively.
> >
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >  include/linux/bpf.h           |  3 ++
> >  include/linux/lsm_hook_defs.h |  5 +++
> >  include/linux/security.h      | 25 +++++++++++++++
> >  kernel/bpf/bpf_lsm.c          |  4 +++
> >  kernel/bpf/token.c            | 13 ++++++--
> >  security/security.c           | 60 +++++++++++++++++++++++++++++++++++
> >  6 files changed, 107 insertions(+), 3 deletions(-)
>
> ...
>
> > diff --git a/include/linux/security.h b/include/linux/security.h
> > index 08fd777cbe94..1d6edbf45d1c 100644
> > --- a/include/linux/security.h
> > +++ b/include/linux/security.h
> > @@ -60,6 +60,7 @@ struct fs_parameter;
> >  enum fs_value_type;
> >  struct watch;
> >  struct watch_notification;
> > +enum bpf_cmd;
>
> Yes, I think it's fine to include bpf.h in security.h instead of the
> forward declaration.
>
> >  /* Default (no) options for the capable function */
> >  #define CAP_OPT_NONE 0x0
> > @@ -2031,6 +2032,11 @@ extern void security_bpf_map_free(struct bpf_map *map);
> >  extern int security_bpf_prog_load(struct bpf_prog *prog, union bpf_attr *attr,
> >                                 struct bpf_token *token);
> >  extern void security_bpf_prog_free(struct bpf_prog *prog);
> > +extern int security_bpf_token_create(struct bpf_token *token, union bpf_attr *attr,
> > +                                  struct path *path);
> > +extern void security_bpf_token_free(struct bpf_token *token);
> > +extern int security_bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd);
> > +extern int security_bpf_token_capable(const struct bpf_token *token, int cap);
> >  #else
> >  static inline int security_bpf(int cmd, union bpf_attr *attr,
> >                                            unsigned int size)
> > @@ -2065,6 +2071,25 @@ static inline int security_bpf_prog_load(struct bpf_prog *prog, union bpf_attr *
> >
> >  static inline void security_bpf_prog_free(struct bpf_prog *prog)
> >  { }
> > +
> > +static inline int security_bpf_token_create(struct bpf_token *token, union bpf_attr *attr,
> > +                                  struct path *path)
> > +{
> > +     return 0;
> > +}
> > +
> > +static inline void security_bpf_token_free(struct bpf_token *token)
> > +{ }
> > +
> > +static inline int security_bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd)
> > +{
> > +     return 0;
> > +}
> > +
> > +static inline int security_bpf_token_capable(const struct bpf_token *token, int cap)
> > +{
> > +     return 0;
> > +}
>
> Another nitpick, but I would prefer to shorten
> security_bpf_token_allow_cmd() renamed to security_bpf_token_cmd() both
> to shorten the name and to better fit convention.  I realize the caller
> is named bpf_token_allow_cmd() but I'd still rather see the LSM hook
> with the shorter name.

Makes sense, renamed to security_bpf_token_cmd() and updated hook name as well

>
> > diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c
> > index 35e6f55c2a41..5d04da54faea 100644
> > --- a/kernel/bpf/token.c
> > +++ b/kernel/bpf/token.c
> > @@ -7,11 +7,12 @@
> >  #include <linux/idr.h>
> >  #include <linux/namei.h>
> >  #include <linux/user_namespace.h>
> > +#include <linux/security.h>
> >
> >  bool bpf_token_capable(const struct bpf_token *token, int cap)
> >  {
> >       /* BPF token allows ns_capable() level of capabilities */
> > -     if (token) {
> > +     if (token && security_bpf_token_capable(token, cap) == 0) {
> >               if (ns_capable(token->userns, cap))
> >                       return true;
> >               if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN))
>
> We typically perform the capability based access controls prior to the
> LSM controls, meaning if we want to the token controls to work in a
> similar way we should do something like this:
>
>   bool bpf_token_capable(...)
>   {
>     if (token) {
>       if (ns_capable(token, cap) ||
>           (cap != ADMIN && ns_capable(token, ADMIN)))
>         return security_bpf_token_capable(token, cap);
>     }
>     return capable(cap) || (cap != ADMIN && capable(...))
>   }

yep, makes sense, I changed it as you suggested above

>
> > @@ -28,6 +29,7 @@ void bpf_token_inc(struct bpf_token *token)
> >
> >  static void bpf_token_free(struct bpf_token *token)
> >  {
> > +     security_bpf_token_free(token);
> >       put_user_ns(token->userns);
> >       kvfree(token);
> >  }
> > @@ -172,6 +174,10 @@ int bpf_token_create(union bpf_attr *attr)
> >       token->allowed_progs = mnt_opts->delegate_progs;
> >       token->allowed_attachs = mnt_opts->delegate_attachs;
> >
> > +     err = security_bpf_token_create(token, attr, &path);
> > +     if (err)
> > +             goto out_token;
> > +
> >       fd = get_unused_fd_flags(O_CLOEXEC);
> >       if (fd < 0) {
> >               err = fd;
> > @@ -216,8 +222,9 @@ bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd)
> >  {
> >       if (!token)
> >               return false;
> > -
> > -     return token->allowed_cmds & (1ULL << cmd);
> > +     if (!(token->allowed_cmds & (1ULL << cmd)))
> > +             return false;
> > +     return security_bpf_token_allow_cmd(token, cmd) == 0;
>
> I'm not sure how much it really matters, but someone might prefer
> the '!!' approach/style over '== 0'.

it would have to be !security_bpf_token_cmd(), right? And that single
negation is just very confusing when dealing with int-returning
function. I find it much easier to make sure the logic is correct when
we have explicit `== 0`.

Like, when I see `return !security_bpf_token_cmd(...);`, my immediate
read of that is "return whether bpf_token_cmd is not allowed" or
something along those lines, giving me a huge pause... I have the same
relationship with strcmp(), btw, while people seem totally fine with
`!strcmp()` (which to me also reads backwards).

Anyways, unless you really feel strongly, I'd keep == 0 here and above
for security_bpf_token_capable(), just because it's int-returning
function result conversion to bool-returning result.

>
> >  }
> >
> >  bool bpf_token_allow_map_type(const struct bpf_token *token, enum bpf_map_type type)
>
> --
> paul-moore.com

^ permalink raw reply

* Re: [PATCH v9 17/17] bpf,selinux: allocate bpf_security_struct per BPF token
From: Andrii Nakryiko @ 2023-11-06 19:18 UTC (permalink / raw)
  To: Paul Moore
  Cc: Andrii Nakryiko, bpf, netdev, brauner, linux-fsdevel,
	linux-security-module, keescook, kernel-team, sargun
In-Reply-To: <257c16919acd2ec98dac2e6c4f21c906.paul@paul-moore.com>

On Sun, Nov 5, 2023 at 9:01 PM Paul Moore <paul@paul-moore.com> wrote:
>
> On Nov  3, 2023 Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > Utilize newly added bpf_token_create/bpf_token_free LSM hooks to
> > allocate struct bpf_security_struct for each BPF token object in
> > SELinux. This just follows similar pattern for BPF prog and map.
> >
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >  security/selinux/hooks.c | 25 +++++++++++++++++++++++++
> >  1 file changed, 25 insertions(+)
>
> Thanks Andrii, we'll need some additional code to fully enable the
> BPF tokens on a SELinux system but I can help provide that if you'd
> like.  Although I might not be able to get to that until after the
> merge window closes.

Yep, I'd appreciate your help with the SELinux side. Until after the
merge window is fine, yes.

Thanks for reviewing the patch set!

>
> > diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> > index 002351ab67b7..1501e95366a1 100644
> > --- a/security/selinux/hooks.c
> > +++ b/security/selinux/hooks.c
> > @@ -6828,6 +6828,29 @@ static void selinux_bpf_prog_free(struct bpf_prog *prog)
> >       prog->aux->security = NULL;
> >       kfree(bpfsec);
> >  }
> > +
> > +static int selinux_bpf_token_create(struct bpf_token *token, union bpf_attr *attr,
> > +                                 struct path *path)
> > +{
> > +     struct bpf_security_struct *bpfsec;
> > +
> > +     bpfsec = kzalloc(sizeof(*bpfsec), GFP_KERNEL);
> > +     if (!bpfsec)
> > +             return -ENOMEM;
> > +
> > +     bpfsec->sid = current_sid();
> > +     token->security = bpfsec;
> > +
> > +     return 0;
> > +}
> > +
> > +static void selinux_bpf_token_free(struct bpf_token *token)
> > +{
> > +     struct bpf_security_struct *bpfsec = token->security;
> > +
> > +     token->security = NULL;
> > +     kfree(bpfsec);
> > +}
> >  #endif
> >
> >  struct lsm_blob_sizes selinux_blob_sizes __ro_after_init = {
> > @@ -7183,6 +7206,7 @@ static struct security_hook_list selinux_hooks[] __ro_after_init = {
> >       LSM_HOOK_INIT(bpf_prog, selinux_bpf_prog),
> >       LSM_HOOK_INIT(bpf_map_free, selinux_bpf_map_free),
> >       LSM_HOOK_INIT(bpf_prog_free, selinux_bpf_prog_free),
> > +     LSM_HOOK_INIT(bpf_token_free, selinux_bpf_token_free),
> >  #endif
> >
> >  #ifdef CONFIG_PERF_EVENTS
> > @@ -7241,6 +7265,7 @@ static struct security_hook_list selinux_hooks[] __ro_after_init = {
> >  #ifdef CONFIG_BPF_SYSCALL
> >       LSM_HOOK_INIT(bpf_map_create, selinux_bpf_map_create),
> >       LSM_HOOK_INIT(bpf_prog_load, selinux_bpf_prog_load),
> > +     LSM_HOOK_INIT(bpf_token_create, selinux_bpf_token_create),
> >  #endif
> >  #ifdef CONFIG_PERF_EVENTS
> >       LSM_HOOK_INIT(perf_event_alloc, selinux_perf_event_alloc),
> > --
> > 2.34.1
>
> --
> paul-moore.com

^ permalink raw reply

* Re: [PATCH net-next v6 0/5] Coalesce mac ocp write/modify calls to reduce spinlock contention
From: Mirsad Todorovac @ 2023-11-06 19:26 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Heiner Kallweit, linux-kernel, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, netdev, nic_swsd
In-Reply-To: <7f0e29b4-a34d-4e38-bba3-b179d0484942@lunn.ch>



On 11/5/23 21:36, Andrew Lunn wrote:
>> The command used for generating the assembly was taken from .o.cmd file and
>> added -save-temps as the only change:
> 
> make drivers/net/ethernet/realtek/r8169_main.lst
> 
> is simpler. You get the C and the generated assembler listed together.
> 
>     Andrew

Here, Sir:

----------------------------------------------------------------------------------------------
0000000000001950 <r8168_mac_ocp_write>:
{
     1950:       e8 00 00 00 00          call   1955 <r8168_mac_ocp_write+0x5>
                         1951: R_X86_64_PLT32    __fentry__-0x4
     1955:       55                      push   %rbp
     1956:       48 89 e5                mov    %rsp,%rbp
     1959:       41 57                   push   %r15
         raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
     195b:       4c 8d bf 24 1a 00 00    lea    0x1a24(%rdi),%r15
{
     1962:       41 56                   push   %r14
     1964:       41 89 d6                mov    %edx,%r14d
     1967:       41 55                   push   %r13
     1969:       41 54                   push   %r12
     196b:       49 89 fc                mov    %rdi,%r12
         raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
     196e:       4c 89 ff                mov    %r15,%rdi
{
     1971:       53                      push   %rbx
     1972:       89 f3                   mov    %esi,%ebx
         raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
     1974:       e8 00 00 00 00          call   1979 <r8168_mac_ocp_write+0x29>
                         1975: R_X86_64_PLT32    _raw_spin_lock_irqsave-0x4
         if (rtl_ocp_reg_failure(reg))
     1979:       89 df                   mov    %ebx,%edi
         raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
     197b:       49 89 c5                mov    %rax,%r13
         if (rtl_ocp_reg_failure(reg))
     197e:       e8 1d f2 ff ff          call   ba0 <rtl_ocp_reg_failure>
     1983:       84 c0                   test   %al,%al
     1985:       75 16                   jne    199d <r8168_mac_ocp_write+0x4d>
         RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
     1987:       c1 e3 0f                shl    $0xf,%ebx
     198a:       49 8b 04 24             mov    (%r12),%rax
     198e:       44 09 f3                or     %r14d,%ebx
     1991:       81 cb 00 00 00 80       or     $0x80000000,%ebx
build_mmio_write(writel, "l", unsigned int, "r", :"memory")
     1997:       89 98 b0 00 00 00       mov    %ebx,0xb0(%rax)
         raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
     199d:       4c 89 ee                mov    %r13,%rsi
     19a0:       4c 89 ff                mov    %r15,%rdi
     19a3:       e8 00 00 00 00          call   19a8 <r8168_mac_ocp_write+0x58>
                         19a4: R_X86_64_PLT32    _raw_spin_unlock_irqrestore-0x4
----------------------------------------------------------------------------------------------

Actually, this time rtl_hw_start_8411_2() has r8169_mac_ocp_write mostly() inlined
without an explicit "inline".

I cannot explain why the r8169_main.s looked different with the same .config.

----------------------------------------------------------------------------------------------
0000000000008370 <rtl_hw_start_8411_2>:
{
     8370:       e8 00 00 00 00          call   8375 <rtl_hw_start_8411_2+0x5>
                         8371: R_X86_64_PLT32    __fentry__-0x4
     8375:       55                      push   %rbp
     8376:       48 89 e5                mov    %rsp,%rbp
     8379:       41 56                   push   %r14
     837b:       41 55                   push   %r13
     837d:       49 89 fd                mov    %rdi,%r13
     8380:       41 54                   push   %r12
         raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
     8382:       4d 8d a5 24 1a 00 00    lea    0x1a24(%r13),%r12
         rtl_hw_start_8168g(tp);
     8389:       e8 02 c7 ff ff          call   4a90 <rtl_hw_start_8168g>
         rtl_ephy_init(tp, e_info_8411_2);
     838e:       ba 0a 00 00 00          mov    $0xa,%edx
     8393:       48 c7 c6 00 00 00 00    mov    $0x0,%rsi
                         8396: R_X86_64_32S      .rodata+0x440
     839a:       4c 89 ef                mov    %r13,%rdi
     839d:       e8 be 9d ff ff          call   2160 <__rtl_ephy_init>
         raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
     83a2:       4c 89 e7                mov    %r12,%rdi
     83a5:       e8 00 00 00 00          call   83aa <rtl_hw_start_8411_2+0x3a>
                         83a6: R_X86_64_PLT32    _raw_spin_lock_irqsave-0x4
     83aa:       ba 00 00 14 fe          mov    $0xfe140000,%edx
     83af:       48 89 c6                mov    %rax,%rsi
         RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
     83b2:       49 8b 45 00             mov    0x0(%r13),%rax
     83b6:       89 90 b0 00 00 00       mov    %edx,0xb0(%rax)
         raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
     83bc:       4c 89 e7                mov    %r12,%rdi
     83bf:       e8 00 00 00 00          call   83c4 <rtl_hw_start_8411_2+0x54>
                         83c0: R_X86_64_PLT32    _raw_spin_unlock_irqrestore-0x4
         raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
     83c4:       4c 89 e7                mov    %r12,%rdi
     83c7:       e8 00 00 00 00          call   83cc <rtl_hw_start_8411_2+0x5c>
                         83c8: R_X86_64_PLT32    _raw_spin_lock_irqsave-0x4
     83cc:       ba 00 00 15 fe          mov    $0xfe150000,%edx
     83d1:       48 89 c6                mov    %rax,%rsi
         RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
     83d4:       49 8b 45 00             mov    0x0(%r13),%rax
     83d8:       89 90 b0 00 00 00       mov    %edx,0xb0(%rax)
         raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
     83de:       4c 89 e7                mov    %r12,%rdi
     83e1:       e8 00 00 00 00          call   83e6 <rtl_hw_start_8411_2+0x76>
                         83e2: R_X86_64_PLT32    _raw_spin_unlock_irqrestore-0x4
         r8168_mac_ocp_write(tp, 0xFC2C, 0x0000);
     83e6:       31 d2                   xor    %edx,%edx
     83e8:       be 2c fc 00 00          mov    $0xfc2c,%esi
     83ed:       4c 89 ef                mov    %r13,%rdi
     83f0:       e8 5b 95 ff ff          call   1950 <r8168_mac_ocp_write>
         raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
     83f5:       4c 89 e7                mov    %r12,%rdi
     83f8:       e8 00 00 00 00          call   83fd <rtl_hw_start_8411_2+0x8d>
                         83f9: R_X86_64_PLT32    _raw_spin_lock_irqsave-0x4
     83fd:       ba 00 00 17 fe          mov    $0xfe170000,%edx
     8402:       48 89 c6                mov    %rax,%rsi
         RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
     8405:       49 8b 45 00             mov    0x0(%r13),%rax
     8409:       89 90 b0 00 00 00       mov    %edx,0xb0(%rax)
         raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
     840f:       4c 89 e7                mov    %r12,%rdi
     8412:       e8 00 00 00 00          call   8417 <rtl_hw_start_8411_2+0xa7>
                         8413: R_X86_64_PLT32    _raw_spin_unlock_irqrestore-0x4
         raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
     8417:       4c 89 e7                mov    %r12,%rdi
     841a:       e8 00 00 00 00          call   841f <rtl_hw_start_8411_2+0xaf>
                         841b: R_X86_64_PLT32    _raw_spin_lock_irqsave-0x4
     841f:       ba 00 00 18 fe          mov    $0xfe180000,%edx
     8424:       48 89 c6                mov    %rax,%rsi
         RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
     8427:       49 8b 45 00             mov    0x0(%r13),%rax
     842b:       89 90 b0 00 00 00       mov    %edx,0xb0(%rax)
         raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
     8431:       4c 89 e7                mov    %r12,%rdi
     8434:       e8 00 00 00 00          call   8439 <rtl_hw_start_8411_2+0xc9>
                         8435: R_X86_64_PLT32    _raw_spin_unlock_irqrestore-0x4
         raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
     8439:       4c 89 e7                mov    %r12,%rdi
     843c:       e8 00 00 00 00          call   8441 <rtl_hw_start_8411_2+0xd1>
                         843d: R_X86_64_PLT32    _raw_spin_lock_irqsave-0x4
     8441:       ba 00 00 19 fe          mov    $0xfe190000,%edx
     8446:       48 89 c6                mov    %rax,%rsi
         RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
     8449:       49 8b 45 00             mov    0x0(%r13),%rax
     844d:       89 90 b0 00 00 00       mov    %edx,0xb0(%rax)
         raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
     8453:       4c 89 e7                mov    %r12,%rdi
     8456:       e8 00 00 00 00          call   845b <rtl_hw_start_8411_2+0xeb>
                         8457: R_X86_64_PLT32    _raw_spin_unlock_irqrestore-0x4
         raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
     845b:       4c 89 e7                mov    %r12,%rdi
     845e:       e8 00 00 00 00          call   8463 <rtl_hw_start_8411_2+0xf3>
                         845f: R_X86_64_PLT32    _raw_spin_lock_irqsave-0x4
     8463:       ba 00 00 1a fe          mov    $0xfe1a0000,%edx
     8468:       48 89 c6                mov    %rax,%rsi
         RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
     846b:       49 8b 45 00             mov    0x0(%r13),%rax
     846f:       89 90 b0 00 00 00       mov    %edx,0xb0(%rax)
         raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
     8475:       4c 89 e7                mov    %r12,%rdi
     8478:       e8 00 00 00 00          call   847d <rtl_hw_start_8411_2+0x10d>
                         8479: R_X86_64_PLT32    _raw_spin_unlock_irqrestore-0x4
         r8168_mac_ocp_write(tp, 0xFC36, 0x0000);
     847d:       31 d2                   xor    %edx,%edx
     847f:       be 36 fc 00 00          mov    $0xfc36,%esi
     8484:       4c 89 ef                mov    %r13,%rdi
     8487:       e8 c4 94 ff ff          call   1950 <r8168_mac_ocp_write>
         mdelay(3);
----------------------------------------------------------------------------------------------


With the patch applied, and an explicit "inline" on:

static inline bool rtl_ocp_reg_failure(u32 reg);
static inline void __r8168_mac_ocp_write_seqlen(struct rtl8169_private *tp, ...);
static inline void r8168_mac_ocp_write_seqlen(struct rtl8169_private *tp, ...);

the code looks like:

----------------------------------------------------------------------------------------------
0000000000004d60 <rtl_hw_start_8411_2>:
{
     4d60:       e8 00 00 00 00          call   4d65 <rtl_hw_start_8411_2+0x5>
                         4d61: R_X86_64_PLT32    __fentry__-0x4
     4d65:       55                      push   %rbp
     4d66:       48 89 e5                mov    %rsp,%rbp
     4d69:       41 57                   push   %r15
     4d6b:       41 56                   push   %r14
         for (p = array; len--; p++)
     4d6d:       49 c7 c6 00 00 00 00    mov    $0x0,%r14
                         4d70: R_X86_64_32S      .rodata+0x820
{
     4d74:       41 55                   push   %r13
     4d76:       41 54                   push   %r12
         raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
     4d78:       41 bc 28 fc 00 00       mov    $0xfc28,%r12d
{
     4d7e:       53                      push   %rbx
     4d7f:       48 89 fb                mov    %rdi,%rbx
         raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
     4d82:       4c 8d ab 24 1a 00 00    lea    0x1a24(%rbx),%r13
{
     4d89:       48 83 ec 08             sub    $0x8,%rsp
         rtl_hw_start_8168g(tp);
     4d8d:       e8 3e fe ff ff          call   4bd0 <rtl_hw_start_8168g>
         rtl_ephy_init(tp, e_info_8411_2);
     4d92:       ba 0a 00 00 00          mov    $0xa,%edx
     4d97:       48 c7 c6 00 00 00 00    mov    $0x0,%rsi
                         4d9a: R_X86_64_32S      .rodata+0x860
     4d9e:       48 89 df                mov    %rbx,%rdi
     4da1:       e8 fa d4 ff ff          call   22a0 <__rtl_ephy_init>
         raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
     4da6:       4c 89 ef                mov    %r13,%rdi
     4da9:       e8 00 00 00 00          call   4dae <rtl_hw_start_8411_2+0x4e>
                         4daa: R_X86_64_PLT32    _raw_spin_lock_irqsave-0x4
     4dae:       31 c9                   xor    %ecx,%ecx
     4db0:       49 89 c7                mov    %rax,%r15
         for (p = array; len--; p++)
     4db3:       eb 07                   jmp    4dbc <rtl_hw_start_8411_2+0x5c>
                 __r8168_mac_ocp_write(tp, p->reg, p->data);
     4db5:       41 8b 4e 04             mov    0x4(%r14),%ecx
     4db9:       45 8b 26                mov    (%r14),%r12d
         return WARN_ONCE(reg & 0xffff0001, "Invalid ocp reg %x!\n", reg);
     4dbc:       41 f7 c4 01 00 ff ff    test   $0xffff0001,%r12d
     4dc3:       0f 85 9b 01 00 00       jne    4f64 <rtl_hw_start_8411_2+0x204>
         RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
     4dc9:       41 c1 e4 0f             shl    $0xf,%r12d
     4dcd:       48 8b 03                mov    (%rbx),%rax
     4dd0:       41 09 cc                or     %ecx,%r12d
     4dd3:       41 81 cc 00 00 00 80    or     $0x80000000,%r12d
build_mmio_write(writel, "l", unsigned int, "r", :"memory")
     4dda:       44 89 a0 b0 00 00 00    mov    %r12d,0xb0(%rax)
         for (p = array; len--; p++)
     4de1:       49 83 c6 08             add    $0x8,%r14
     4de5:       49 81 fe 00 00 00 00    cmp    $0x0,%r14
                         4de8: R_X86_64_32S      .rodata+0x860
     4dec:       75 c7                   jne    4db5 <rtl_hw_start_8411_2+0x55>
         raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
     4dee:       4c 89 fe                mov    %r15,%rsi
     4df1:       4c 89 ef                mov    %r13,%rdi
         raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
     4df4:       41 bc 00 f8 00 00       mov    $0xf800,%r12d
         for (p = array; len--; p++)
     4dfa:       49 c7 c6 00 00 00 00    mov    $0x0,%r14
                         4dfd: R_X86_64_32S      .rodata+0x4a0
         raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
     4e01:       e8 00 00 00 00          call   4e06 <rtl_hw_start_8411_2+0xa6>
                         4e02: R_X86_64_PLT32    _raw_spin_unlock_irqrestore-0x4
         mdelay(3);
     4e06:       bf 08 9c c4 00          mov    $0xc49c08,%edi
     4e0b:       e8 00 00 00 00          call   4e10 <rtl_hw_start_8411_2+0xb0>
                         4e0c: R_X86_64_PLT32    __const_udelay-0x4
         r8168_mac_ocp_write(tp, 0xFC26, 0x0000);
     4e10:       31 d2                   xor    %edx,%edx
     4e12:       be 26 fc 00 00          mov    $0xfc26,%esi
     4e17:       48 89 df                mov    %rbx,%rdi
     4e1a:       e8 71 d0 ff ff          call   1e90 <r8168_mac_ocp_write>
         raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
     4e1f:       4c 89 ef                mov    %r13,%rdi
     4e22:       e8 00 00 00 00          call   4e27 <rtl_hw_start_8411_2+0xc7>
                         4e23: R_X86_64_PLT32    _raw_spin_lock_irqsave-0x4
     4e27:       b9 08 e0 00 00          mov    $0xe008,%ecx
     4e2c:       49 89 c7                mov    %rax,%r15
         for (p = array; len--; p++)
     4e2f:       eb 07                   jmp    4e38 <rtl_hw_start_8411_2+0xd8>
                 __r8168_mac_ocp_write(tp, p->reg, p->data);
     4e31:       41 8b 4e 04             mov    0x4(%r14),%ecx
     4e35:       45 8b 26                mov    (%r14),%r12d
         return WARN_ONCE(reg & 0xffff0001, "Invalid ocp reg %x!\n", reg);
     4e38:       41 f7 c4 01 00 ff ff    test   $0xffff0001,%r12d
     4e3f:       0f 85 eb 00 00 00       jne    4f30 <rtl_hw_start_8411_2+0x1d0>
         RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
     4e45:       41 c1 e4 0f             shl    $0xf,%r12d
     4e49:       48 8b 03                mov    (%rbx),%rax
     4e4c:       41 09 cc                or     %ecx,%r12d
     4e4f:       41 81 cc 00 00 00 80    or     $0x80000000,%r12d
     4e56:       44 89 a0 b0 00 00 00    mov    %r12d,0xb0(%rax)
         for (p = array; len--; p++)
     4e5d:       49 83 c6 08             add    $0x8,%r14
     4e61:       49 81 fe 00 00 00 00    cmp    $0x0,%r14
                         4e64: R_X86_64_32S      .rodata+0x818
     4e68:       75 c7                   jne    4e31 <rtl_hw_start_8411_2+0xd1>
         raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
     4e6a:       4c 89 fe                mov    %r15,%rsi
     4e6d:       4c 89 ef                mov    %r13,%rdi
         raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
     4e70:       41 bc 2a fc 00 00       mov    $0xfc2a,%r12d
         for (p = array; len--; p++)
     4e76:       49 c7 c6 00 00 00 00    mov    $0x0,%r14
                         4e79: R_X86_64_32S      .rodata+0x460
         raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
     4e7d:       e8 00 00 00 00          call   4e82 <rtl_hw_start_8411_2+0x122>
                         4e7e: R_X86_64_PLT32    _raw_spin_unlock_irqrestore-0x4
         r8168_mac_ocp_write(tp, 0xFC26, 0x8000);
     4e82:       ba 00 80 00 00          mov    $0x8000,%edx
     4e87:       be 26 fc 00 00          mov    $0xfc26,%esi
     4e8c:       48 89 df                mov    %rbx,%rdi
     4e8f:       e8 fc cf ff ff          call   1e90 <r8168_mac_ocp_write>
         raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
     4e94:       4c 89 ef                mov    %r13,%rdi
     4e97:       e8 00 00 00 00          call   4e9c <rtl_hw_start_8411_2+0x13c>
                         4e98: R_X86_64_PLT32    _raw_spin_lock_irqsave-0x4
     4e9c:       b9 43 07 00 00          mov    $0x743,%ecx
     4ea1:       49 89 c7                mov    %rax,%r15
         for (p = array; len--; p++)
     4ea4:       eb 07                   jmp    4ead <rtl_hw_start_8411_2+0x14d>
                 __r8168_mac_ocp_write(tp, p->reg, p->data);
     4ea6:       41 8b 4e 04             mov    0x4(%r14),%ecx
     4eaa:       45 8b 26                mov    (%r14),%r12d
         return WARN_ONCE(reg & 0xffff0001, "Invalid ocp reg %x!\n", reg);
     4ead:       41 f7 c4 01 00 ff ff    test   $0xffff0001,%r12d
     4eb4:       75 4d                   jne    4f03 <rtl_hw_start_8411_2+0x1a3>
         RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
     4eb6:       41 c1 e4 0f             shl    $0xf,%r12d
     4eba:       48 8b 03                mov    (%rbx),%rax
     4ebd:       41 09 cc                or     %ecx,%r12d
     4ec0:       41 81 cc 00 00 00 80    or     $0x80000000,%r12d
     4ec7:       44 89 a0 b0 00 00 00    mov    %r12d,0xb0(%rax)
         for (p = array; len--; p++)
     4ece:       49 83 c6 08             add    $0x8,%r14
     4ed2:       49 81 fe 00 00 00 00    cmp    $0x0,%r14
                         4ed5: R_X86_64_32S      .rodata+0x498
     4ed9:       75 cb                   jne    4ea6 <rtl_hw_start_8411_2+0x146>
         raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
     4edb:       4c 89 fe                mov    %r15,%rsi
     4ede:       4c 89 ef                mov    %r13,%rdi
     4ee1:       e8 00 00 00 00          call   4ee6 <rtl_hw_start_8411_2+0x186>
                         4ee2: R_X86_64_PLT32    _raw_spin_unlock_irqrestore-0x4
}
     4ee6:       48 83 c4 08             add    $0x8,%rsp
     4eea:       5b                      pop    %rbx
     4eeb:       41 5c                   pop    %r12
     4eed:       41 5d                   pop    %r13
     4eef:       41 5e                   pop    %r14
     4ef1:       41 5f                   pop    %r15
     4ef3:       5d                      pop    %rbp
     4ef4:       31 c0                   xor    %eax,%eax
     4ef6:       31 d2                   xor    %edx,%edx
     4ef8:       31 c9                   xor    %ecx,%ecx
     4efa:       31 f6                   xor    %esi,%esi
     4efc:       31 ff                   xor    %edi,%edi
     4efe:       e9 00 00 00 00          jmp    4f03 <rtl_hw_start_8411_2+0x1a3>
                         4eff: R_X86_64_PLT32    __x86_return_thunk-0x4
         return WARN_ONCE(reg & 0xffff0001, "Invalid ocp reg %x!\n", reg);
     4f03:       0f b6 05 00 00 00 00    movzbl 0x0(%rip),%eax        # 4f0a <rtl_hw_start_8411_2+0x1aa>
                         4f06: R_X86_64_PC32     .data.once-0x2
     4f0a:       3c 01                   cmp    $0x1,%al
     4f0c:       0f 87 00 00 00 00       ja     4f12 <rtl_hw_start_8411_2+0x1b2>
                         4f0e: R_X86_64_PC32     .text.unlikely+0xd0
     4f12:       a8 01                   test   $0x1,%al
     4f14:       75 b8                   jne    4ece <rtl_hw_start_8411_2+0x16e>
     4f16:       44 89 e6                mov    %r12d,%esi
     4f19:       48 c7 c7 00 00 00 00    mov    $0x0,%rdi
                         4f1c: R_X86_64_32S      .rodata.str1.1+0x53
     4f20:       c6 05 00 00 00 00 01    movb   $0x1,0x0(%rip)        # 4f27 <rtl_hw_start_8411_2+0x1c7>
                         4f22: R_X86_64_PC32     .data.once-0x3
     4f27:       e8 00 00 00 00          call   4f2c <rtl_hw_start_8411_2+0x1cc>
                         4f28: R_X86_64_PLT32    __warn_printk-0x4
     4f2c:       0f 0b                   ud2
     4f2e:       eb 9e                   jmp    4ece <rtl_hw_start_8411_2+0x16e>
     4f30:       0f b6 05 00 00 00 00    movzbl 0x0(%rip),%eax        # 4f37 <rtl_hw_start_8411_2+0x1d7>
                         4f33: R_X86_64_PC32     .data.once-0x2
     4f37:       3c 01                   cmp    $0x1,%al
     4f39:       0f 87 00 00 00 00       ja     4f3f <rtl_hw_start_8411_2+0x1df>
                         4f3b: R_X86_64_PC32     .text.unlikely+0x106
     4f3f:       a8 01                   test   $0x1,%al
     4f41:       0f 85 16 ff ff ff       jne    4e5d <rtl_hw_start_8411_2+0xfd>
     4f47:       44 89 e6                mov    %r12d,%esi
     4f4a:       48 c7 c7 00 00 00 00    mov    $0x0,%rdi
                         4f4d: R_X86_64_32S      .rodata.str1.1+0x53
     4f51:       c6 05 00 00 00 00 01    movb   $0x1,0x0(%rip)        # 4f58 <rtl_hw_start_8411_2+0x1f8>
                         4f53: R_X86_64_PC32     .data.once-0x3
     4f58:       e8 00 00 00 00          call   4f5d <rtl_hw_start_8411_2+0x1fd>
                         4f59: R_X86_64_PLT32    __warn_printk-0x4
     4f5d:       0f 0b                   ud2
     4f5f:       e9 f9 fe ff ff          jmp    4e5d <rtl_hw_start_8411_2+0xfd>
     4f64:       0f b6 05 00 00 00 00    movzbl 0x0(%rip),%eax        # 4f6b <rtl_hw_start_8411_2+0x20b>
                         4f67: R_X86_64_PC32     .data.once-0x2
     4f6b:       3c 01                   cmp    $0x1,%al
     4f6d:       0f 87 00 00 00 00       ja     4f73 <rtl_hw_start_8411_2+0x213>
                         4f6f: R_X86_64_PC32     .text.unlikely+0xeb
     4f73:       a8 01                   test   $0x1,%al
     4f75:       0f 85 66 fe ff ff       jne    4de1 <rtl_hw_start_8411_2+0x81>
     4f7b:       44 89 e6                mov    %r12d,%esi
     4f7e:       48 c7 c7 00 00 00 00    mov    $0x0,%rdi
                         4f81: R_X86_64_32S      .rodata.str1.1+0x53
     4f85:       c6 05 00 00 00 00 01    movb   $0x1,0x0(%rip)        # 4f8c <rtl_hw_start_8411_2+0x22c>
                         4f87: R_X86_64_PC32     .data.once-0x3
     4f8c:       e8 00 00 00 00          call   4f91 <rtl_hw_start_8411_2+0x231>
                         4f8d: R_X86_64_PLT32    __warn_printk-0x4
     4f91:       0f 0b                   ud2
     4f93:       e9 49 fe ff ff          jmp    4de1 <rtl_hw_start_8411_2+0x81>
     4f98:       0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
     4f9f:       00
     4fa0:       90                      nop
     4fa1:       90                      nop
     4fa2:       90                      nop
     4fa3:       90                      nop
     4fa4:       90                      nop
     4fa5:       90                      nop
     4fa6:       90                      nop
     4fa7:       90                      nop
     4fa8:       90                      nop
     4fa9:       90                      nop
     4faa:       90                      nop
     4fab:       90                      nop
     4fac:       90                      nop
     4fad:       90                      nop
     4fae:       90                      nop
     4faf:       90                      nop
----------------------------------------------------------------------------------------------

Best regards,
Mirsad Todorovac


^ permalink raw reply

* Re: [RFC PATCH v3 10/12] tcp: RX path for devmem TCP
From: Mina Almasry @ 2023-11-06 19:29 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: netdev, linux-kernel, linux-arch, linux-kselftest, linux-media,
	dri-devel, linaro-mm-sig, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
	Ilias Apalodimas, Arnd Bergmann, David Ahern, Willem de Bruijn,
	Shuah Khan, Sumit Semwal, Christian König, Shakeel Butt,
	Jeroen de Borst, Praveen Kaligineedi, Willem de Bruijn,
	Kaiyuan Zhang
In-Reply-To: <ZUk0FGuJ28s1d9OX@google.com>

On Mon, Nov 6, 2023 at 10:44 AM Stanislav Fomichev <sdf@google.com> wrote:
>
> On 11/05, Mina Almasry wrote:
> > In tcp_recvmsg_locked(), detect if the skb being received by the user
> > is a devmem skb. In this case - if the user provided the MSG_SOCK_DEVMEM
> > flag - pass it to tcp_recvmsg_devmem() for custom handling.
> >
> > tcp_recvmsg_devmem() copies any data in the skb header to the linear
> > buffer, and returns a cmsg to the user indicating the number of bytes
> > returned in the linear buffer.
> >
> > tcp_recvmsg_devmem() then loops over the unaccessible devmem skb frags,
> > and returns to the user a cmsg_devmem indicating the location of the
> > data in the dmabuf device memory. cmsg_devmem contains this information:
> >
> > 1. the offset into the dmabuf where the payload starts. 'frag_offset'.
> > 2. the size of the frag. 'frag_size'.
> > 3. an opaque token 'frag_token' to return to the kernel when the buffer
> > is to be released.
> >
> > The pages awaiting freeing are stored in the newly added
> > sk->sk_user_pages, and each page passed to userspace is get_page()'d.
> > This reference is dropped once the userspace indicates that it is
> > done reading this page.  All pages are released when the socket is
> > destroyed.
> >
> > Signed-off-by: Willem de Bruijn <willemb@google.com>
> > Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
> > Signed-off-by: Mina Almasry <almasrymina@google.com>
> >
> > ---
> >
> > RFC v3:
> > - Fixed issue with put_cmsg() failing silently.
> >
> > ---
> >  include/linux/socket.h            |   1 +
> >  include/net/page_pool/helpers.h   |   9 ++
> >  include/net/sock.h                |   2 +
> >  include/uapi/asm-generic/socket.h |   5 +
> >  include/uapi/linux/uio.h          |   6 +
> >  net/ipv4/tcp.c                    | 189 +++++++++++++++++++++++++++++-
> >  net/ipv4/tcp_ipv4.c               |   7 ++
> >  7 files changed, 214 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/socket.h b/include/linux/socket.h
> > index cfcb7e2c3813..fe2b9e2081bb 100644
> > --- a/include/linux/socket.h
> > +++ b/include/linux/socket.h
> > @@ -326,6 +326,7 @@ struct ucred {
> >                                         * plain text and require encryption
> >                                         */
> >
> > +#define MSG_SOCK_DEVMEM 0x2000000    /* Receive devmem skbs as cmsg */
>
> Sharing the feedback that I've been providing internally on the public list:
>

There may have been a miscommunication. I don't recall hearing this
specific feedback from you, at least in the last few months. Sorry if
it seemed like I'm ignoring feedback :)

> IMHO, we need a better UAPI to receive the tokens and give them back to
> the kernel. CMSG + setsockopt(SO_DEVMEM_DONTNEED) get the job done,
> but look dated and hacky :-(
>
> We should either do some kind of user/kernel shared memory queue to
> receive/return the tokens (similar to what Jonathan was doing in his
> proposal?)

I'll take a look at Jonathan's proposal, sorry, I'm not immediately
familiar but I wanted to respond :-) But is the suggestion here to
build a new kernel-user communication channel primitive for the
purpose of passing the information in the devmem cmsg? IMHO that seems
like an overkill. Why add 100-200 lines of code to the kernel to add
something that can already be done with existing primitives? I don't
see anything concretely wrong with cmsg & setsockopt approach, and if
we switch to something I'd prefer to switch to an existing primitive
for simplicity?

The only other existing primitive to pass data outside of the linear
buffer is the MSG_ERRQUEUE that is used for zerocopy. Is that
preferred? Any other suggestions or existing primitives I'm not aware
of?

> or bite the bullet and switch to io_uring.
>

IMO io_uring & socket support are orthogonal, and one doesn't preclude
the other. As you know we like to use sockets and I believe there are
issues with io_uring adoption at Google that I'm not familiar with
(and could be wrong). I'm interested in exploring io_uring support as
a follow up but I think David Wei will be interested in io_uring
support as well anyway.

> I was also suggesting to do it via netlink initially, but it's probably
> a bit slow for these purpose, idk.

Yeah, I hear netlink is reserved for control paths and is
inappropriate for data path, but I'll let folks correct me if wrong.

-- 
Thanks,
Mina

^ permalink raw reply

* Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags
From: David Ahern @ 2023-11-06 19:34 UTC (permalink / raw)
  To: Stanislav Fomichev, Mina Almasry
  Cc: netdev, linux-kernel, linux-arch, linux-kselftest, linux-media,
	dri-devel, linaro-mm-sig, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
	Ilias Apalodimas, Arnd Bergmann, Willem de Bruijn, Shuah Khan,
	Sumit Semwal, Christian König, Shakeel Butt, Jeroen de Borst,
	Praveen Kaligineedi, Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <ZUk03DhWxV-bOFJL@google.com>

On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> On 11/05, Mina Almasry wrote:
>> For device memory TCP, we expect the skb headers to be available in host
>> memory for access, and we expect the skb frags to be in device memory
>> and unaccessible to the host. We expect there to be no mixing and
>> matching of device memory frags (unaccessible) with host memory frags
>> (accessible) in the same skb.
>>
>> Add a skb->devmem flag which indicates whether the frags in this skb
>> are device memory frags or not.
>>
>> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
>> and marks the skb as skb->devmem accordingly.
>>
>> Add checks through the network stack to avoid accessing the frags of
>> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
>>
>> Signed-off-by: Willem de Bruijn <willemb@google.com>
>> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
>> Signed-off-by: Mina Almasry <almasrymina@google.com>
>>
>> ---
>>  include/linux/skbuff.h | 14 +++++++-
>>  include/net/tcp.h      |  5 +--
>>  net/core/datagram.c    |  6 ++++
>>  net/core/gro.c         |  5 ++-
>>  net/core/skbuff.c      | 77 ++++++++++++++++++++++++++++++++++++------
>>  net/ipv4/tcp.c         |  6 ++++
>>  net/ipv4/tcp_input.c   | 13 +++++--
>>  net/ipv4/tcp_output.c  |  5 ++-
>>  net/packet/af_packet.c |  4 +--
>>  9 files changed, 115 insertions(+), 20 deletions(-)
>>
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index 1fae276c1353..8fb468ff8115 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
>>   *	@csum_level: indicates the number of consecutive checksums found in
>>   *		the packet minus one that have been verified as
>>   *		CHECKSUM_UNNECESSARY (max 3)
>> + *	@devmem: indicates that all the fragments in this skb are backed by
>> + *		device memory.
>>   *	@dst_pending_confirm: need to confirm neighbour
>>   *	@decrypted: Decrypted SKB
>>   *	@slow_gro: state present at GRO time, slower prepare step required
>> @@ -991,7 +993,7 @@ struct sk_buff {
>>  #if IS_ENABLED(CONFIG_IP_SCTP)
>>  	__u8			csum_not_inet:1;
>>  #endif
>> -
>> +	__u8			devmem:1;
>>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
>>  	__u16			tc_index;	/* traffic control index */
>>  #endif
>> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
>>  		__skb_zcopy_downgrade_managed(skb);
>>  }
>>  
>> +/* Return true if frags in this skb are not readable by the host. */
>> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
>> +{
>> +	return skb->devmem;
> 
> bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> It better communicates the fact that the stack shouldn't dereference the
> frags (because it has 'devmem' fragments or for some other potential
> future reason).

+1.

Also, the flag on the skb is an optimization - a high level signal that
one or more frags is in unreadable memory. There is no requirement that
all of the frags are in the same memory type.

^ permalink raw reply

* [PATCH bpf-next v11 03/13] bpf, net: introduce bpf_struct_ops_desc.
From: thinker.li @ 2023-11-06 20:12 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii, drosen
  Cc: sinquersw, kuifeng, Kui-Feng Lee, netdev
In-Reply-To: <20231106201252.1568931-1-thinker.li@gmail.com>

From: Kui-Feng Lee <thinker.li@gmail.com>

Move some of members of bpf_struct_ops to bpf_struct_ops_desc.  When we
introduce the new API to register new bpf_struct_ops types from modules,
bpf_struct_ops may destroyed when the module is unloaded.  Moving these
members to bpf_struct_ops_desc make these data available even when the
module is unloaded.

type_id is unavailabe in bpf_struct_ops anymore. Modules should get it from
the btf received by kmod's init function.

Cc: netdev@vger.kernel.org
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
---
 include/linux/bpf.h            | 13 ++++--
 kernel/bpf/bpf_struct_ops.c    | 80 +++++++++++++++++-----------------
 kernel/bpf/verifier.c          |  8 ++--
 net/bpf/bpf_dummy_struct_ops.c |  9 +++-
 net/ipv4/bpf_tcp_ca.c          |  8 +++-
 5 files changed, 70 insertions(+), 48 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index b4825d3cdb29..b55e27162df0 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1626,17 +1626,22 @@ struct bpf_struct_ops {
 	void (*unreg)(void *kdata);
 	int (*update)(void *kdata, void *old_kdata);
 	int (*validate)(void *kdata);
-	const struct btf_type *type;
-	const struct btf_type *value_type;
 	const char *name;
 	struct btf_func_model func_models[BPF_STRUCT_OPS_MAX_NR_MEMBERS];
+};
+
+struct bpf_struct_ops_desc {
+	struct bpf_struct_ops *st_ops;
+
+	const struct btf_type *type;
+	const struct btf_type *value_type;
 	u32 type_id;
 	u32 value_id;
 };
 
 #if defined(CONFIG_BPF_JIT) && defined(CONFIG_BPF_SYSCALL)
 #define BPF_MODULE_OWNER ((void *)((0xeB9FUL << 2) + POISON_POINTER_DELTA))
-const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id);
+const struct bpf_struct_ops_desc *bpf_struct_ops_find(u32 type_id);
 void bpf_struct_ops_init(struct btf *btf, struct bpf_verifier_log *log);
 bool bpf_struct_ops_get(const void *kdata);
 void bpf_struct_ops_put(const void *kdata);
@@ -1679,7 +1684,7 @@ int bpf_struct_ops_test_run(struct bpf_prog *prog, const union bpf_attr *kattr,
 			    union bpf_attr __user *uattr);
 #endif
 #else
-static inline const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id)
+static inline const struct bpf_struct_ops_desc *bpf_struct_ops_find(u32 type_id)
 {
 	return NULL;
 }
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 4ca4ca4998e0..d804801c7864 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -32,7 +32,7 @@ struct bpf_struct_ops_value {
 struct bpf_struct_ops_map {
 	struct bpf_map map;
 	struct rcu_head rcu;
-	const struct bpf_struct_ops *st_ops;
+	const struct bpf_struct_ops_desc *st_ops_desc;
 	/* protect map_update */
 	struct mutex lock;
 	/* link has all the bpf_links that is populated
@@ -92,9 +92,9 @@ enum {
 	__NR_BPF_STRUCT_OPS_TYPE,
 };
 
-static struct bpf_struct_ops * const bpf_struct_ops[] = {
+static struct bpf_struct_ops_desc bpf_struct_ops[] = {
 #define BPF_STRUCT_OPS_TYPE(_name)				\
-	[BPF_STRUCT_OPS_TYPE_##_name] = &bpf_##_name,
+	[BPF_STRUCT_OPS_TYPE_##_name] = { .st_ops = &bpf_##_name },
 #include "bpf_struct_ops_types.h"
 #undef BPF_STRUCT_OPS_TYPE
 };
@@ -115,10 +115,11 @@ enum {
 	IDX_MODULE_ID,
 };
 
-static void bpf_struct_ops_init_one(struct bpf_struct_ops *st_ops,
-				    struct btf *btf,
-				    struct bpf_verifier_log *log)
+static void bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc,
+				     struct btf *btf,
+				     struct bpf_verifier_log *log)
 {
+	struct bpf_struct_ops *st_ops = st_ops_desc->st_ops;
 	const struct btf_member *member;
 	const struct btf_type *t;
 	s32 type_id, value_id;
@@ -190,18 +191,18 @@ static void bpf_struct_ops_init_one(struct bpf_struct_ops *st_ops,
 			pr_warn("Error in init bpf_struct_ops %s\n",
 				st_ops->name);
 		} else {
-			st_ops->type_id = type_id;
-			st_ops->type = t;
-			st_ops->value_id = value_id;
-			st_ops->value_type = btf_type_by_id(btf,
-							    value_id);
+			st_ops_desc->type_id = type_id;
+			st_ops_desc->type = t;
+			st_ops_desc->value_id = value_id;
+			st_ops_desc->value_type = btf_type_by_id(btf,
+								 value_id);
 		}
 	}
 }
 
 void bpf_struct_ops_init(struct btf *btf, struct bpf_verifier_log *log)
 {
-	struct bpf_struct_ops *st_ops;
+	struct bpf_struct_ops_desc *st_ops_desc;
 	u32 i;
 
 	/* Ensure BTF type is emitted for "struct bpf_struct_ops_##_name" */
@@ -210,14 +211,14 @@ void bpf_struct_ops_init(struct btf *btf, struct bpf_verifier_log *log)
 #undef BPF_STRUCT_OPS_TYPE
 
 	for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) {
-		st_ops = bpf_struct_ops[i];
-		bpf_struct_ops_init_one(st_ops, btf, log);
+		st_ops_desc = &bpf_struct_ops[i];
+		bpf_struct_ops_desc_init(st_ops_desc, btf, log);
 	}
 }
 
 extern struct btf *btf_vmlinux;
 
-static const struct bpf_struct_ops *
+static const struct bpf_struct_ops_desc *
 bpf_struct_ops_find_value(u32 value_id)
 {
 	unsigned int i;
@@ -226,14 +227,14 @@ bpf_struct_ops_find_value(u32 value_id)
 		return NULL;
 
 	for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) {
-		if (bpf_struct_ops[i]->value_id == value_id)
-			return bpf_struct_ops[i];
+		if (bpf_struct_ops[i].value_id == value_id)
+			return &bpf_struct_ops[i];
 	}
 
 	return NULL;
 }
 
-const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id)
+const struct bpf_struct_ops_desc *bpf_struct_ops_find(u32 type_id)
 {
 	unsigned int i;
 
@@ -241,8 +242,8 @@ const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id)
 		return NULL;
 
 	for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) {
-		if (bpf_struct_ops[i]->type_id == type_id)
-			return bpf_struct_ops[i];
+		if (bpf_struct_ops[i].type_id == type_id)
+			return &bpf_struct_ops[i];
 	}
 
 	return NULL;
@@ -302,7 +303,7 @@ static void *bpf_struct_ops_map_lookup_elem(struct bpf_map *map, void *key)
 
 static void bpf_struct_ops_map_put_progs(struct bpf_struct_ops_map *st_map)
 {
-	const struct btf_type *t = st_map->st_ops->type;
+	const struct btf_type *t = st_map->st_ops_desc->type;
 	u32 i;
 
 	for (i = 0; i < btf_type_vlen(t); i++) {
@@ -376,11 +377,12 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
 					   void *value, u64 flags)
 {
 	struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map;
-	const struct bpf_struct_ops *st_ops = st_map->st_ops;
+	const struct bpf_struct_ops_desc *st_ops_desc = st_map->st_ops_desc;
+	const struct bpf_struct_ops *st_ops = st_ops_desc->st_ops;
 	struct bpf_struct_ops_value *uvalue, *kvalue;
 	const struct btf_type *module_type;
 	const struct btf_member *member;
-	const struct btf_type *t = st_ops->type;
+	const struct btf_type *t = st_ops_desc->type;
 	struct bpf_tramp_links *tlinks;
 	void *udata, *kdata;
 	int prog_fd, err;
@@ -393,7 +395,7 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
 	if (*(u32 *)key != 0)
 		return -E2BIG;
 
-	err = check_zero_holes(st_ops->value_type, value);
+	err = check_zero_holes(st_ops_desc->value_type, value);
 	if (err)
 		return err;
 
@@ -486,7 +488,7 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
 		}
 
 		if (prog->type != BPF_PROG_TYPE_STRUCT_OPS ||
-		    prog->aux->attach_btf_id != st_ops->type_id ||
+		    prog->aux->attach_btf_id != st_ops_desc->type_id ||
 		    prog->expected_attach_type != i) {
 			bpf_prog_put(prog);
 			err = -EINVAL;
@@ -582,7 +584,7 @@ static long bpf_struct_ops_map_delete_elem(struct bpf_map *map, void *key)
 			     BPF_STRUCT_OPS_STATE_TOBEFREE);
 	switch (prev_state) {
 	case BPF_STRUCT_OPS_STATE_INUSE:
-		st_map->st_ops->unreg(&st_map->kvalue.data);
+		st_map->st_ops_desc->st_ops->unreg(&st_map->kvalue.data);
 		bpf_map_put(map);
 		return 0;
 	case BPF_STRUCT_OPS_STATE_TOBEFREE:
@@ -663,22 +665,22 @@ static int bpf_struct_ops_map_alloc_check(union bpf_attr *attr)
 
 static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
 {
-	const struct bpf_struct_ops *st_ops;
+	const struct bpf_struct_ops_desc *st_ops_desc;
 	size_t st_map_size;
 	struct bpf_struct_ops_map *st_map;
 	const struct btf_type *t, *vt;
 	struct bpf_map *map;
 	int ret;
 
-	st_ops = bpf_struct_ops_find_value(attr->btf_vmlinux_value_type_id);
-	if (!st_ops)
+	st_ops_desc = bpf_struct_ops_find_value(attr->btf_vmlinux_value_type_id);
+	if (!st_ops_desc)
 		return ERR_PTR(-ENOTSUPP);
 
-	vt = st_ops->value_type;
+	vt = st_ops_desc->value_type;
 	if (attr->value_size != vt->size)
 		return ERR_PTR(-EINVAL);
 
-	t = st_ops->type;
+	t = st_ops_desc->type;
 
 	st_map_size = sizeof(*st_map) +
 		/* kvalue stores the
@@ -690,7 +692,7 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
 	if (!st_map)
 		return ERR_PTR(-ENOMEM);
 
-	st_map->st_ops = st_ops;
+	st_map->st_ops_desc = st_ops_desc;
 	map = &st_map->map;
 
 	ret = bpf_jit_charge_modmem(PAGE_SIZE);
@@ -728,8 +730,8 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
 static u64 bpf_struct_ops_map_mem_usage(const struct bpf_map *map)
 {
 	struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map;
-	const struct bpf_struct_ops *st_ops = st_map->st_ops;
-	const struct btf_type *vt = st_ops->value_type;
+	const struct bpf_struct_ops_desc *st_ops_desc = st_map->st_ops_desc;
+	const struct btf_type *vt = st_ops_desc->value_type;
 	u64 usage;
 
 	usage = sizeof(*st_map) +
@@ -803,7 +805,7 @@ static void bpf_struct_ops_map_link_dealloc(struct bpf_link *link)
 		/* st_link->map can be NULL if
 		 * bpf_struct_ops_link_create() fails to register.
 		 */
-		st_map->st_ops->unreg(&st_map->kvalue.data);
+		st_map->st_ops_desc->st_ops->unreg(&st_map->kvalue.data);
 		bpf_map_put(&st_map->map);
 	}
 	kfree(st_link);
@@ -850,7 +852,7 @@ static int bpf_struct_ops_map_link_update(struct bpf_link *link, struct bpf_map
 	if (!bpf_struct_ops_valid_to_reg(new_map))
 		return -EINVAL;
 
-	if (!st_map->st_ops->update)
+	if (!st_map->st_ops_desc->st_ops->update)
 		return -EOPNOTSUPP;
 
 	mutex_lock(&update_mutex);
@@ -863,12 +865,12 @@ static int bpf_struct_ops_map_link_update(struct bpf_link *link, struct bpf_map
 
 	old_st_map = container_of(old_map, struct bpf_struct_ops_map, map);
 	/* The new and old struct_ops must be the same type. */
-	if (st_map->st_ops != old_st_map->st_ops) {
+	if (st_map->st_ops_desc != old_st_map->st_ops_desc) {
 		err = -EINVAL;
 		goto err_out;
 	}
 
-	err = st_map->st_ops->update(st_map->kvalue.data, old_st_map->kvalue.data);
+	err = st_map->st_ops_desc->st_ops->update(st_map->kvalue.data, old_st_map->kvalue.data);
 	if (err)
 		goto err_out;
 
@@ -919,7 +921,7 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
 	if (err)
 		goto err_out;
 
-	err = st_map->st_ops->reg(st_map->kvalue.data);
+	err = st_map->st_ops_desc->st_ops->reg(st_map->kvalue.data);
 	if (err) {
 		bpf_link_cleanup(&link_primer);
 		link = NULL;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 857d76694517..290e3a7ee72f 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -20081,6 +20081,7 @@ static void print_verification_stats(struct bpf_verifier_env *env)
 static int check_struct_ops_btf_id(struct bpf_verifier_env *env)
 {
 	const struct btf_type *t, *func_proto;
+	const struct bpf_struct_ops_desc *st_ops_desc;
 	const struct bpf_struct_ops *st_ops;
 	const struct btf_member *member;
 	struct bpf_prog *prog = env->prog;
@@ -20093,14 +20094,15 @@ static int check_struct_ops_btf_id(struct bpf_verifier_env *env)
 	}
 
 	btf_id = prog->aux->attach_btf_id;
-	st_ops = bpf_struct_ops_find(btf_id);
-	if (!st_ops) {
+	st_ops_desc = bpf_struct_ops_find(btf_id);
+	if (!st_ops_desc) {
 		verbose(env, "attach_btf_id %u is not a supported struct\n",
 			btf_id);
 		return -ENOTSUPP;
 	}
+	st_ops = st_ops_desc->st_ops;
 
-	t = st_ops->type;
+	t = st_ops_desc->type;
 	member_idx = prog->expected_attach_type;
 	if (member_idx >= btf_type_vlen(t)) {
 		verbose(env, "attach to invalid member idx %u of struct %s\n",
diff --git a/net/bpf/bpf_dummy_struct_ops.c b/net/bpf/bpf_dummy_struct_ops.c
index 5918d1b32e19..ffa224053a6c 100644
--- a/net/bpf/bpf_dummy_struct_ops.c
+++ b/net/bpf/bpf_dummy_struct_ops.c
@@ -17,6 +17,8 @@ struct bpf_dummy_ops_test_args {
 	struct bpf_dummy_ops_state state;
 };
 
+static struct btf *bpf_dummy_ops_btf;
+
 static struct bpf_dummy_ops_test_args *
 dummy_ops_init_args(const union bpf_attr *kattr, unsigned int nr)
 {
@@ -85,9 +87,13 @@ int bpf_struct_ops_test_run(struct bpf_prog *prog, const union bpf_attr *kattr,
 	void *image = NULL;
 	unsigned int op_idx;
 	int prog_ret;
+	u32 type_id;
 	int err;
 
-	if (prog->aux->attach_btf_id != st_ops->type_id)
+	type_id = btf_find_by_name_kind(bpf_dummy_ops_btf,
+					bpf_bpf_dummy_ops.name,
+					BTF_KIND_STRUCT);
+	if (prog->aux->attach_btf_id != type_id)
 		return -EOPNOTSUPP;
 
 	func_proto = prog->aux->attach_func_proto;
@@ -143,6 +149,7 @@ int bpf_struct_ops_test_run(struct bpf_prog *prog, const union bpf_attr *kattr,
 
 static int bpf_dummy_init(struct btf *btf)
 {
+	bpf_dummy_ops_btf = btf;
 	return 0;
 }
 
diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
index 39dcccf0f174..3c8b76578a2a 100644
--- a/net/ipv4/bpf_tcp_ca.c
+++ b/net/ipv4/bpf_tcp_ca.c
@@ -20,6 +20,7 @@ static u32 unsupported_ops[] = {
 
 static const struct btf_type *tcp_sock_type;
 static u32 tcp_sock_id, sock_id;
+static const struct btf_type *tcp_congestion_ops_type;
 
 static int bpf_tcp_ca_init(struct btf *btf)
 {
@@ -36,6 +37,11 @@ static int bpf_tcp_ca_init(struct btf *btf)
 	tcp_sock_id = type_id;
 	tcp_sock_type = btf_type_by_id(btf, tcp_sock_id);
 
+	type_id = btf_find_by_name_kind(btf, "tcp_congestion_ops", BTF_KIND_STRUCT);
+	if (type_id < 0)
+		return -EINVAL;
+	tcp_congestion_ops_type = btf_type_by_id(btf, type_id);
+
 	return 0;
 }
 
@@ -149,7 +155,7 @@ static u32 prog_ops_moff(const struct bpf_prog *prog)
 	u32 midx;
 
 	midx = prog->expected_attach_type;
-	t = bpf_tcp_congestion_ops.type;
+	t = tcp_congestion_ops_type;
 	m = &btf_type_member(t)[midx];
 
 	return __btf_member_bit_offset(t, m) / 8;
-- 
2.34.1


^ permalink raw reply related

* [PATCH bpf-next v11 10/13] bpf, net: switch to dynamic registration
From: thinker.li @ 2023-11-06 20:12 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii, drosen
  Cc: sinquersw, kuifeng, Kui-Feng Lee, netdev
In-Reply-To: <20231106201252.1568931-1-thinker.li@gmail.com>

From: Kui-Feng Lee <thinker.li@gmail.com>

Replace the static list of struct_ops types with per-btf struct_ops_tab to
enable dynamic registration.

Both bpf_dummy_ops and bpf_tcp_ca now utilize the registration function
instead of being listed in bpf_struct_ops_types.h.

Cc: netdev@vger.kernel.org
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
---
 include/linux/bpf.h               | 24 +++++++--
 include/linux/btf.h               |  2 +
 kernel/bpf/bpf_struct_ops.c       | 90 ++++++++++---------------------
 kernel/bpf/bpf_struct_ops_types.h | 12 -----
 kernel/bpf/btf.c                  | 40 ++++++++++++--
 net/bpf/bpf_dummy_struct_ops.c    | 14 ++++-
 net/ipv4/bpf_tcp_ca.c             | 16 ++++--
 7 files changed, 109 insertions(+), 89 deletions(-)
 delete mode 100644 kernel/bpf/bpf_struct_ops_types.h

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 48e97a255945..432c088d4001 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1643,7 +1643,6 @@ struct bpf_struct_ops_desc {
 #if defined(CONFIG_BPF_JIT) && defined(CONFIG_BPF_SYSCALL)
 #define BPF_MODULE_OWNER ((void *)((0xeB9FUL << 2) + POISON_POINTER_DELTA))
 const struct bpf_struct_ops_desc *bpf_struct_ops_find(struct btf *btf, u32 type_id);
-void bpf_struct_ops_init(struct btf *btf, struct bpf_verifier_log *log);
 bool bpf_struct_ops_get(const void *kdata);
 void bpf_struct_ops_put(const void *kdata);
 int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, void *key,
@@ -1689,10 +1688,6 @@ static inline const struct bpf_struct_ops_desc *bpf_struct_ops_find(struct btf *
 {
 	return NULL;
 }
-static inline void bpf_struct_ops_init(struct btf *btf,
-				       struct bpf_verifier_log *log)
-{
-}
 static inline bool bpf_try_module_get(const void *data, struct module *owner)
 {
 	return try_module_get(owner);
@@ -3232,6 +3227,8 @@ static inline bool bpf_is_subprog(const struct bpf_prog *prog)
 }
 
 #ifdef CONFIG_BPF_JIT
+int register_bpf_struct_ops(struct bpf_struct_ops *st_ops);
+
 enum bpf_struct_ops_state {
 	BPF_STRUCT_OPS_STATE_INIT,
 	BPF_STRUCT_OPS_STATE_INUSE,
@@ -3243,6 +3240,23 @@ struct bpf_struct_ops_common_value {
 	refcount_t refcnt;
 	enum bpf_struct_ops_state state;
 };
+
+/* bpf_struct_ops_##_name (e.g. bpf_struct_ops_tcp_congestion_ops) is
+ * the map's value exposed to the userspace and its btf-type-id is
+ * stored at the map->btf_vmlinux_value_type_id.
+ *
+ */
+#define DEFINE_STRUCT_OPS_VALUE_TYPE(_name)			\
+extern struct bpf_struct_ops bpf_##_name;			\
+								\
+struct bpf_struct_ops_##_name {					\
+	struct bpf_struct_ops_common_value common;		\
+	struct _name data ____cacheline_aligned_in_smp;		\
+}
+
+extern int bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc,
+				    struct btf *btf,
+				    struct bpf_verifier_log *log);
 #endif /* CONFIG_BPF_JIT */
 
 #endif /* _LINUX_BPF_H */
diff --git a/include/linux/btf.h b/include/linux/btf.h
index 07ee6740e06a..d6fd6c20b3e2 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -12,6 +12,8 @@
 #include <uapi/linux/bpf.h>
 
 #define BTF_TYPE_EMIT(type) ((void)(type *)0)
+#define BTF_STRUCT_OPS_TYPE_EMIT(type) \
+	((void)(struct bpf_struct_ops_##type *)0)
 #define BTF_TYPE_EMIT_ENUM(enum_val) ((void)enum_val)
 
 /* These need to be macros, as the expressions are used in assembler input */
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 2d853431bf09..caacc251655a 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -61,35 +61,6 @@ static DEFINE_MUTEX(update_mutex);
 #define VALUE_PREFIX "bpf_struct_ops_"
 #define VALUE_PREFIX_LEN (sizeof(VALUE_PREFIX) - 1)
 
-/* bpf_struct_ops_##_name (e.g. bpf_struct_ops_tcp_congestion_ops) is
- * the map's value exposed to the userspace and its btf-type-id is
- * stored at the map->btf_vmlinux_value_type_id.
- *
- */
-#define BPF_STRUCT_OPS_TYPE(_name)				\
-extern struct bpf_struct_ops bpf_##_name;			\
-								\
-struct bpf_struct_ops_##_name {					\
-	struct bpf_struct_ops_common_value common;		\
-	struct _name data ____cacheline_aligned_in_smp;		\
-};
-#include "bpf_struct_ops_types.h"
-#undef BPF_STRUCT_OPS_TYPE
-
-enum {
-#define BPF_STRUCT_OPS_TYPE(_name) BPF_STRUCT_OPS_TYPE_##_name,
-#include "bpf_struct_ops_types.h"
-#undef BPF_STRUCT_OPS_TYPE
-	__NR_BPF_STRUCT_OPS_TYPE,
-};
-
-static struct bpf_struct_ops_desc bpf_struct_ops[] = {
-#define BPF_STRUCT_OPS_TYPE(_name)				\
-	[BPF_STRUCT_OPS_TYPE_##_name] = { .st_ops = &bpf_##_name },
-#include "bpf_struct_ops_types.h"
-#undef BPF_STRUCT_OPS_TYPE
-};
-
 const struct bpf_verifier_ops bpf_struct_ops_verifier_ops = {
 };
 
@@ -144,9 +115,9 @@ static bool is_valid_value_type(struct btf *btf, s32 value_id,
 	return true;
 }
 
-static void bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc,
-				     struct btf *btf,
-				     struct bpf_verifier_log *log)
+int bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc,
+			     struct btf *btf,
+			     struct bpf_verifier_log *log)
 {
 	struct bpf_struct_ops *st_ops = st_ops_desc->st_ops;
 	const struct btf_member *member;
@@ -160,7 +131,7 @@ static void bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc,
 	    sizeof(value_name)) {
 		pr_warn("struct_ops name %s is too long\n",
 			st_ops->name);
-		return;
+		return -EINVAL;
 	}
 	sprintf(value_name, "%s%s", VALUE_PREFIX, st_ops->name);
 
@@ -169,13 +140,13 @@ static void bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc,
 	if (type_id < 0) {
 		pr_warn("Cannot find struct %s in btf_vmlinux\n",
 			st_ops->name);
-		return;
+		return -EINVAL;
 	}
 	t = btf_type_by_id(btf, type_id);
 	if (btf_type_vlen(t) > BPF_STRUCT_OPS_MAX_NR_MEMBERS) {
 		pr_warn("Cannot support #%u members in struct %s\n",
 			btf_type_vlen(t), st_ops->name);
-		return;
+		return -EINVAL;
 	}
 
 	value_id = btf_find_by_name_kind(btf, value_name,
@@ -183,10 +154,10 @@ static void bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc,
 	if (value_id < 0) {
 		pr_warn("Cannot find struct %s in btf_vmlinux\n",
 			value_name);
-		return;
+		return -EINVAL;
 	}
 	if (!is_valid_value_type(btf, value_id, t, value_name))
-		return;
+		return -EINVAL;
 
 	for_each_member(i, t, member) {
 		const struct btf_type *func_proto;
@@ -195,13 +166,13 @@ static void bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc,
 		if (!*mname) {
 			pr_warn("anon member in struct %s is not supported\n",
 				st_ops->name);
-			break;
+			return -EOPNOTSUPP;
 		}
 
 		if (__btf_member_bitfield_size(t, member)) {
 			pr_warn("bit field member %s in struct %s is not supported\n",
 				mname, st_ops->name);
-			break;
+			return -EOPNOTSUPP;
 		}
 
 		func_proto = btf_type_resolve_func_ptr(btf,
@@ -213,7 +184,7 @@ static void bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc,
 					   &st_ops->func_models[i])) {
 			pr_warn("Error in parsing func ptr %s in struct %s\n",
 				mname, st_ops->name);
-			break;
+			return -EINVAL;
 		}
 	}
 
@@ -221,6 +192,7 @@ static void bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc,
 		if (st_ops->init(btf)) {
 			pr_warn("Error in init bpf_struct_ops %s\n",
 				st_ops->name);
+			return -EINVAL;
 		} else {
 			st_ops_desc->type_id = type_id;
 			st_ops_desc->type = t;
@@ -229,35 +201,24 @@ static void bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc,
 								 value_id);
 		}
 	}
-}
 
-void bpf_struct_ops_init(struct btf *btf, struct bpf_verifier_log *log)
-{
-	struct bpf_struct_ops_desc *st_ops_desc;
-	u32 i;
-
-	/* Ensure BTF type is emitted for "struct bpf_struct_ops_##_name" */
-#define BPF_STRUCT_OPS_TYPE(_name) BTF_TYPE_EMIT(struct bpf_struct_ops_##_name);
-#include "bpf_struct_ops_types.h"
-#undef BPF_STRUCT_OPS_TYPE
-
-	for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) {
-		st_ops_desc = &bpf_struct_ops[i];
-		bpf_struct_ops_desc_init(st_ops_desc, btf, log);
-	}
+	return 0;
 }
 
 static const struct bpf_struct_ops_desc *
 bpf_struct_ops_find_value(struct btf *btf, u32 value_id)
 {
+	const struct bpf_struct_ops_desc *st_ops_list;
 	unsigned int i;
+	u32 cnt = 0;
 
-	if (!value_id || !btf)
+	if (!value_id)
 		return NULL;
 
-	for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) {
-		if (bpf_struct_ops[i].value_id == value_id)
-			return &bpf_struct_ops[i];
+	st_ops_list = btf_get_struct_ops(btf, &cnt);
+	for (i = 0; i < cnt; i++) {
+		if (st_ops_list[i].value_id == value_id)
+			return &st_ops_list[i];
 	}
 
 	return NULL;
@@ -266,14 +227,17 @@ bpf_struct_ops_find_value(struct btf *btf, u32 value_id)
 const struct bpf_struct_ops_desc *
 bpf_struct_ops_find(struct btf *btf, u32 type_id)
 {
+	const struct bpf_struct_ops_desc *st_ops_list;
 	unsigned int i;
+	u32 cnt;
 
-	if (!type_id || !btf)
+	if (!type_id)
 		return NULL;
 
-	for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) {
-		if (bpf_struct_ops[i].type_id == type_id)
-			return &bpf_struct_ops[i];
+	st_ops_list = btf_get_struct_ops(btf, &cnt);
+	for (i = 0; i < cnt; i++) {
+		if (st_ops_list[i].type_id == type_id)
+			return &st_ops_list[i];
 	}
 
 	return NULL;
diff --git a/kernel/bpf/bpf_struct_ops_types.h b/kernel/bpf/bpf_struct_ops_types.h
deleted file mode 100644
index 5678a9ddf817..000000000000
--- a/kernel/bpf/bpf_struct_ops_types.h
+++ /dev/null
@@ -1,12 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/* internal file - do not include directly */
-
-#ifdef CONFIG_BPF_JIT
-#ifdef CONFIG_NET
-BPF_STRUCT_OPS_TYPE(bpf_dummy_ops)
-#endif
-#ifdef CONFIG_INET
-#include <net/tcp.h>
-BPF_STRUCT_OPS_TYPE(tcp_congestion_ops)
-#endif
-#endif
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 263715af10cb..7165a1beeed5 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -19,6 +19,7 @@
 #include <linux/bpf_verifier.h>
 #include <linux/btf.h>
 #include <linux/btf_ids.h>
+#include <linux/bpf.h>
 #include <linux/bpf_lsm.h>
 #include <linux/skmsg.h>
 #include <linux/perf_event.h>
@@ -5796,8 +5797,6 @@ struct btf *btf_parse_vmlinux(void)
 	/* btf_parse_vmlinux() runs under bpf_verifier_lock */
 	bpf_ctx_convert.t = btf_type_by_id(btf, bpf_ctx_convert_btf_id[0]);
 
-	bpf_struct_ops_init(btf, log);
-
 	refcount_set(&btf->refcnt, 1);
 
 	err = btf_alloc_id(btf);
@@ -8628,10 +8627,11 @@ bool btf_type_ids_nocast_alias(struct bpf_verifier_log *log,
 #ifdef CONFIG_BPF_JIT
 
 static int
-btf_add_struct_ops(struct btf *btf, struct bpf_struct_ops *st_ops)
+btf_add_struct_ops(struct btf *btf, struct bpf_struct_ops *st_ops,
+		   struct bpf_verifier_log *log)
 {
 	struct btf_struct_ops_tab *tab, *new_tab;
-	int i;
+	int i, err;
 
 	if (!btf)
 		return -ENOENT;
@@ -8668,6 +8668,10 @@ btf_add_struct_ops(struct btf *btf, struct bpf_struct_ops *st_ops)
 
 	tab->ops[btf->struct_ops_tab->cnt].st_ops = st_ops;
 
+	err = bpf_struct_ops_desc_init(&tab->ops[btf->struct_ops_tab->cnt], btf, log);
+	if (err)
+		return err;
+
 	btf->struct_ops_tab->cnt++;
 
 	return 0;
@@ -8684,4 +8688,32 @@ const struct bpf_struct_ops_desc *btf_get_struct_ops(struct btf *btf, u32 *ret_c
 	return (const struct bpf_struct_ops_desc *)btf->struct_ops_tab->ops;
 }
 
+int register_bpf_struct_ops(struct bpf_struct_ops *st_ops)
+{
+	struct bpf_verifier_log *log;
+	struct btf *btf;
+	int err = 0;
+
+	btf = btf_get_module_btf(st_ops->owner);
+	if (!btf)
+		return -EINVAL;
+
+	log = kzalloc(sizeof(*log), GFP_KERNEL | __GFP_NOWARN);
+	if (!log) {
+		err = -ENOMEM;
+		goto errout;
+	}
+
+	log->level = BPF_LOG_KERNEL;
+
+	err = btf_add_struct_ops(btf, st_ops, log);
+
+errout:
+	kfree(log);
+	btf_put(btf);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(register_bpf_struct_ops);
+
 #endif /* CONFIG_BPF_JIT */
diff --git a/net/bpf/bpf_dummy_struct_ops.c b/net/bpf/bpf_dummy_struct_ops.c
index ffa224053a6c..148a5851c4fa 100644
--- a/net/bpf/bpf_dummy_struct_ops.c
+++ b/net/bpf/bpf_dummy_struct_ops.c
@@ -7,7 +7,7 @@
 #include <linux/bpf.h>
 #include <linux/btf.h>
 
-extern struct bpf_struct_ops bpf_bpf_dummy_ops;
+static struct bpf_struct_ops bpf_bpf_dummy_ops;
 
 /* A common type for test_N with return value in bpf_dummy_ops */
 typedef int (*dummy_ops_test_ret_fn)(struct bpf_dummy_ops_state *state, ...);
@@ -223,11 +223,13 @@ static int bpf_dummy_reg(void *kdata)
 	return -EOPNOTSUPP;
 }
 
+DEFINE_STRUCT_OPS_VALUE_TYPE(bpf_dummy_ops);
+
 static void bpf_dummy_unreg(void *kdata)
 {
 }
 
-struct bpf_struct_ops bpf_bpf_dummy_ops = {
+static struct bpf_struct_ops bpf_bpf_dummy_ops = {
 	.verifier_ops = &bpf_dummy_verifier_ops,
 	.init = bpf_dummy_init,
 	.check_member = bpf_dummy_ops_check_member,
@@ -235,4 +237,12 @@ struct bpf_struct_ops bpf_bpf_dummy_ops = {
 	.reg = bpf_dummy_reg,
 	.unreg = bpf_dummy_unreg,
 	.name = "bpf_dummy_ops",
+	.owner = THIS_MODULE,
 };
+
+static int __init bpf_dummy_struct_ops_init(void)
+{
+	BTF_STRUCT_OPS_TYPE_EMIT(bpf_dummy_ops);
+	return register_bpf_struct_ops(&bpf_bpf_dummy_ops);
+}
+late_initcall(bpf_dummy_struct_ops_init);
diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
index 3c8b76578a2a..b36a19274e5b 100644
--- a/net/ipv4/bpf_tcp_ca.c
+++ b/net/ipv4/bpf_tcp_ca.c
@@ -12,7 +12,7 @@
 #include <net/bpf_sk_storage.h>
 
 /* "extern" is to avoid sparse warning.  It is only used in bpf_struct_ops.c. */
-extern struct bpf_struct_ops bpf_tcp_congestion_ops;
+static struct bpf_struct_ops bpf_tcp_congestion_ops;
 
 static u32 unsupported_ops[] = {
 	offsetof(struct tcp_congestion_ops, get_info),
@@ -277,7 +277,9 @@ static int bpf_tcp_ca_validate(void *kdata)
 	return tcp_validate_congestion_control(kdata);
 }
 
-struct bpf_struct_ops bpf_tcp_congestion_ops = {
+DEFINE_STRUCT_OPS_VALUE_TYPE(tcp_congestion_ops);
+
+static struct bpf_struct_ops bpf_tcp_congestion_ops = {
 	.verifier_ops = &bpf_tcp_ca_verifier_ops,
 	.reg = bpf_tcp_ca_reg,
 	.unreg = bpf_tcp_ca_unreg,
@@ -287,10 +289,18 @@ struct bpf_struct_ops bpf_tcp_congestion_ops = {
 	.init = bpf_tcp_ca_init,
 	.validate = bpf_tcp_ca_validate,
 	.name = "tcp_congestion_ops",
+	.owner = THIS_MODULE,
 };
 
 static int __init bpf_tcp_ca_kfunc_init(void)
 {
-	return register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &bpf_tcp_ca_kfunc_set);
+	int ret;
+
+	BTF_STRUCT_OPS_TYPE_EMIT(tcp_congestion_ops);
+
+	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &bpf_tcp_ca_kfunc_set);
+	ret = ret ?: register_bpf_struct_ops(&bpf_tcp_congestion_ops);
+
+	return ret;
 }
 late_initcall(bpf_tcp_ca_kfunc_init);
-- 
2.34.1


^ permalink raw reply related

* Re: [RFC Draft net-next] docs: netdev: add section on using lei to manage netdev mail volume
From: Jakub Kicinski @ 2023-11-06 20:19 UTC (permalink / raw)
  To: David Wei; +Cc: netdev
In-Reply-To: <20231105185014.2523447-1-dw@davidwei.uk>

On Sun,  5 Nov 2023 10:50:14 -0800 David Wei wrote:
>  
> +Managing emails
> +~~~~~~~~~~~~~~~

How about adding this section before "Patch review" ?

> +netdev is a busy mailing list with on average over 200 emails received per day,
> +which can be overwhelming to beginners. Rather than subscribing to the entire
> +list, considering using ``lei`` to only subscribe to topics that you are
> +interested in. Konstantin Ryabitsev wrote excellent tutorials on using ``lei``:
> +
> + - https://people.kernel.org/monsieuricon/lore-lei-part-1-getting-started
> + - https://people.kernel.org/monsieuricon/lore-lei-part-2-now-with-imap
> +
> +As a netdev beginner, you may want to filter out driver changes and only focus
> +on core netdev changes. Try using the following query with ``lei q``::
> +
> +  lei q -o ~/Mail/netdev \
> +    -I https://lore.kernel.org/all \
> +    -t '(b:b/net/* AND tc:netdev@vger.kernel.org AND rt:2.week.ago..'

Let's add a sentence pointing out the b:b/net hack and why it's needed.

^ permalink raw reply

* Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags
From: Mina Almasry @ 2023-11-06 20:31 UTC (permalink / raw)
  To: David Ahern
  Cc: Stanislav Fomichev, netdev, linux-kernel, linux-arch,
	linux-kselftest, linux-media, dri-devel, linaro-mm-sig,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	Willem de Bruijn, Shuah Khan, Sumit Semwal, Christian König,
	Shakeel Butt, Jeroen de Borst, Praveen Kaligineedi,
	Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <19129763-6f74-4b04-8a5f-441255b76d34@kernel.org>

On Mon, Nov 6, 2023 at 11:34 AM David Ahern <dsahern@kernel.org> wrote:
>
> On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > On 11/05, Mina Almasry wrote:
> >> For device memory TCP, we expect the skb headers to be available in host
> >> memory for access, and we expect the skb frags to be in device memory
> >> and unaccessible to the host. We expect there to be no mixing and
> >> matching of device memory frags (unaccessible) with host memory frags
> >> (accessible) in the same skb.
> >>
> >> Add a skb->devmem flag which indicates whether the frags in this skb
> >> are device memory frags or not.
> >>
> >> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> >> and marks the skb as skb->devmem accordingly.
> >>
> >> Add checks through the network stack to avoid accessing the frags of
> >> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> >>
> >> Signed-off-by: Willem de Bruijn <willemb@google.com>
> >> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
> >> Signed-off-by: Mina Almasry <almasrymina@google.com>
> >>
> >> ---
> >>  include/linux/skbuff.h | 14 +++++++-
> >>  include/net/tcp.h      |  5 +--
> >>  net/core/datagram.c    |  6 ++++
> >>  net/core/gro.c         |  5 ++-
> >>  net/core/skbuff.c      | 77 ++++++++++++++++++++++++++++++++++++------
> >>  net/ipv4/tcp.c         |  6 ++++
> >>  net/ipv4/tcp_input.c   | 13 +++++--
> >>  net/ipv4/tcp_output.c  |  5 ++-
> >>  net/packet/af_packet.c |  4 +--
> >>  9 files changed, 115 insertions(+), 20 deletions(-)
> >>
> >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> >> index 1fae276c1353..8fb468ff8115 100644
> >> --- a/include/linux/skbuff.h
> >> +++ b/include/linux/skbuff.h
> >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> >>   *  @csum_level: indicates the number of consecutive checksums found in
> >>   *          the packet minus one that have been verified as
> >>   *          CHECKSUM_UNNECESSARY (max 3)
> >> + *  @devmem: indicates that all the fragments in this skb are backed by
> >> + *          device memory.
> >>   *  @dst_pending_confirm: need to confirm neighbour
> >>   *  @decrypted: Decrypted SKB
> >>   *  @slow_gro: state present at GRO time, slower prepare step required
> >> @@ -991,7 +993,7 @@ struct sk_buff {
> >>  #if IS_ENABLED(CONFIG_IP_SCTP)
> >>      __u8                    csum_not_inet:1;
> >>  #endif
> >> -
> >> +    __u8                    devmem:1;
> >>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> >>      __u16                   tc_index;       /* traffic control index */
> >>  #endif
> >> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
> >>              __skb_zcopy_downgrade_managed(skb);
> >>  }
> >>
> >> +/* Return true if frags in this skb are not readable by the host. */
> >> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> >> +{
> >> +    return skb->devmem;
> >
> > bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> > It better communicates the fact that the stack shouldn't dereference the
> > frags (because it has 'devmem' fragments or for some other potential
> > future reason).
>
> +1.
>
> Also, the flag on the skb is an optimization - a high level signal that
> one or more frags is in unreadable memory. There is no requirement that
> all of the frags are in the same memory type.

The flag indicates that the skb contains all devmem dma-buf memory
specifically, not generic 'not_readable' frags as the comment says:

+ *     @devmem: indicates that all the fragments in this skb are backed by
+ *             device memory.

The reason it's not a generic 'not_readable' flag is because handing
off a generic not_readable skb to the userspace is semantically not
what we're doing. recvmsg() is augmented in this patch series to
return a devmem skb to the user via a cmsg_devmem struct which refers
specifically to the memory in the dma-buf. recvmsg() in this patch
series is not augmented to give any 'not_readable' skb to the
userspace.

IMHO skb->devmem + an skb_frags_not_readable() as implemented is
correct. If a new type of unreadable skbs are introduced to the stack,
I imagine the stack would implement:

1. new header flag: skb->newmem
2.

static inline bool skb_frags_not_readable(const struct skb_buff *skb)
{
    return skb->devmem || skb->newmem;
}

3. tcp_recvmsg_devmem() would handle skb->devmem skbs is in this patch
series, but tcp_recvmsg_newmem() would handle skb->newmem skbs.

-- 
Thanks,
Mina

^ permalink raw reply

* Re: [PATCH net v2] tcp: Fix -Wc23-extensions in tcp_options_write()
From: Jakub Kicinski @ 2023-11-06 20:52 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: Eric Dumazet, davem, dsahern, pabeni, ndesaulniers, trix,
	0x7f454c46, noureddine, hch, netdev, linux-kernel, llvm, patches
In-Reply-To: <20231106155806.GA1181828@dev-arch.thelio-3990X>

On Mon, 6 Nov 2023 08:58:06 -0700 Nathan Chancellor wrote:
> Ah, this suggestion is much better, thanks. I'll make this adjustment
> and send a v3 later today in case others have any suggested changes (I
> know netdev prefers waiting 24 hours for another revision but I'd like
> to get this warning cleared up by -rc1 so it does not proliferate into
> other trees and I sent v1 almost a week ago).

Definitely, sorry about the delay, feel free to post v3 ASAP.

^ permalink raw reply

* Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags
From: Stanislav Fomichev @ 2023-11-06 20:56 UTC (permalink / raw)
  To: Mina Almasry
  Cc: netdev, linux-kernel, linux-arch, linux-kselftest, linux-media,
	dri-devel, linaro-mm-sig, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
	Ilias Apalodimas, Arnd Bergmann, David Ahern, Willem de Bruijn,
	Shuah Khan, Sumit Semwal, Christian König, Shakeel Butt,
	Jeroen de Borst, Praveen Kaligineedi, Willem de Bruijn,
	Kaiyuan Zhang
In-Reply-To: <20231106024413.2801438-10-almasrymina@google.com>

On 11/05, Mina Almasry wrote:
> For device memory TCP, we expect the skb headers to be available in host
> memory for access, and we expect the skb frags to be in device memory
> and unaccessible to the host. We expect there to be no mixing and
> matching of device memory frags (unaccessible) with host memory frags
> (accessible) in the same skb.
> 
> Add a skb->devmem flag which indicates whether the frags in this skb
> are device memory frags or not.
> 
> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> and marks the skb as skb->devmem accordingly.
> 
> Add checks through the network stack to avoid accessing the frags of
> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> 
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
> Signed-off-by: Mina Almasry <almasrymina@google.com>

[..]
 
> -	snaplen = skb->len;
> +	snaplen = skb_frags_not_readable(skb) ? skb_headlen(skb) : skb->len;
>  
>  	res = run_filter(skb, sk, snaplen);
>  	if (!res)
> @@ -2279,7 +2279,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
>  		}
>  	}
>  
> -	snaplen = skb->len;
> +	snaplen = skb_frags_not_readable(skb) ? skb_headlen(skb) : skb->len;
>  
>  	res = run_filter(skb, sk, snaplen);
>  	if (!res)

Not sure it covers 100% of bpf. We might need to double-check bpf_xdp_copy_buf
which is having its own, non-skb shinfo and frags. And in general, xdp
can reference those shinfo frags early... (xdp part happens
before we create an skb with all devmem association)

^ permalink raw reply

* Re: [RFC PATCH v3 06/12] memory-provider: dmabuf devmem memory provider
From: Stanislav Fomichev @ 2023-11-06 21:02 UTC (permalink / raw)
  To: Mina Almasry
  Cc: netdev, linux-kernel, linux-arch, linux-kselftest, linux-media,
	dri-devel, linaro-mm-sig, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
	Ilias Apalodimas, Arnd Bergmann, David Ahern, Willem de Bruijn,
	Shuah Khan, Sumit Semwal, Christian König, Shakeel Butt,
	Jeroen de Borst, Praveen Kaligineedi, Willem de Bruijn,
	Kaiyuan Zhang
In-Reply-To: <20231106024413.2801438-7-almasrymina@google.com>

On 11/05, Mina Almasry wrote:
> Implement a memory provider that allocates dmabuf devmem page_pool_iovs.
> 
> Support of PP_FLAG_DMA_MAP and PP_FLAG_DMA_SYNC_DEV is omitted for
> simplicity.
> 
> The provider receives a reference to the struct netdev_dmabuf_binding
> via the pool->mp_priv pointer. The driver needs to set this pointer for
> the provider in the page_pool_params.
> 
> The provider obtains a reference on the netdev_dmabuf_binding which
> guarantees the binding and the underlying mapping remains alive until
> the provider is destroyed.
> 
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
> Signed-off-by: Mina Almasry <almasrymina@google.com>
> 
> ---
>  include/net/page_pool/helpers.h | 40 +++++++++++++++++
>  include/net/page_pool/types.h   | 10 +++++
>  net/core/page_pool.c            | 76 +++++++++++++++++++++++++++++++++
>  3 files changed, 126 insertions(+)
> 
> diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
> index 78cbb040af94..b93243c2a640 100644
> --- a/include/net/page_pool/helpers.h
> +++ b/include/net/page_pool/helpers.h
> @@ -53,6 +53,7 @@
>  #define _NET_PAGE_POOL_HELPERS_H
>  
>  #include <net/page_pool/types.h>
> +#include <net/net_debug.h>
>  
>  #ifdef CONFIG_PAGE_POOL_STATS
>  int page_pool_ethtool_stats_get_count(void);
> @@ -111,6 +112,45 @@ page_pool_iov_binding(const struct page_pool_iov *ppiov)
>  	return page_pool_iov_owner(ppiov)->binding;
>  }
>  
> +static inline int page_pool_iov_refcount(const struct page_pool_iov *ppiov)
> +{
> +	return refcount_read(&ppiov->refcount);
> +}
> +
> +static inline void page_pool_iov_get_many(struct page_pool_iov *ppiov,
> +					  unsigned int count)
> +{
> +	refcount_add(count, &ppiov->refcount);
> +}
> +
> +void __page_pool_iov_free(struct page_pool_iov *ppiov);
> +
> +static inline void page_pool_iov_put_many(struct page_pool_iov *ppiov,
> +					  unsigned int count)
> +{
> +	if (!refcount_sub_and_test(count, &ppiov->refcount))
> +		return;
> +
> +	__page_pool_iov_free(ppiov);
> +}
> +
> +/* page pool mm helpers */
> +
> +static inline bool page_is_page_pool_iov(const struct page *page)
> +{
> +	return (unsigned long)page & PP_DEVMEM;
> +}

Speaking of bpf: one thing that might be problematic with this PP_DEVMEM
bit is that it will make debugging with bpftrace a bit (more)
complicated. If somebody were trying to get to that page_pool_iov from
the frags, they will have to do the equivalent of page_is_page_pool_iov,
but probably not a big deal? (thinking out loud)

^ permalink raw reply

* [PATCH net v3] tcp: Fix -Wc23-extensions in tcp_options_write()
From: Nathan Chancellor @ 2023-11-06 21:14 UTC (permalink / raw)
  To: edumazet, davem, dsahern, kuba, pabeni
  Cc: ndesaulniers, trix, 0x7f454c46, noureddine, hch, netdev,
	linux-kernel, llvm, patches, Nathan Chancellor

Clang warns (or errors with CONFIG_WERROR=y) when CONFIG_TCP_AO is set:

  net/ipv4/tcp_output.c:663:2: error: label at end of compound statement is a C23 extension [-Werror,-Wc23-extensions]
    663 |         }
        |         ^
  1 error generated.

On earlier releases (such as clang-11, the current minimum supported
version for building the kernel) that do not support C23, this was a
hard error unconditionally:

  net/ipv4/tcp_output.c:663:2: error: expected statement
          }
          ^
  1 error generated.

While adding a semicolon after the label would resolve this, it is more
in line with the kernel as a whole to refactor this block into a
standalone function, which means the goto a label construct can just be
replaced with a return statement. Do so to resolve the warning.

Closes: https://github.com/ClangBuiltLinux/linux/issues/1953
Fixes: 1e03d32bea8e ("net/tcp: Add TCP-AO sign to outgoing packets")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
---
Changes in v3:
- Don't use a pointer to a pointer for ptr parameter to avoid the extra
  indirection in process_tcp_ao_options(), just return the modified ptr
  value back to the caller (Eric)
- Link to v2: https://lore.kernel.org/r/20231106-tcp-ao-fix-label-in-compound-statement-warning-v2-1-91eff6e1648c@kernel.org

Changes in v2:
- Break out problematic block into its own function so that goto can be
  replaced with a simple return, instead of the simple semicolon
  approach of v1 (Christoph)
- Link to v1: https://lore.kernel.org/r/20231031-tcp-ao-fix-label-in-compound-statement-warning-v1-1-c9731d115f17@kernel.org
---
 net/ipv4/tcp_output.c | 70 ++++++++++++++++++++++++++++-----------------------
 1 file changed, 39 insertions(+), 31 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0d8dd5b7e2e5..eb13a55d660c 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -601,6 +601,44 @@ static void bpf_skops_write_hdr_opt(struct sock *sk, struct sk_buff *skb,
 }
 #endif
 
+static __be32 *process_tcp_ao_options(struct tcp_sock *tp,
+				      const struct tcp_request_sock *tcprsk,
+				      struct tcp_out_options *opts,
+				      struct tcp_key *key, __be32 *ptr)
+{
+#ifdef CONFIG_TCP_AO
+	u8 maclen = tcp_ao_maclen(key->ao_key);
+
+	if (tcprsk) {
+		u8 aolen = maclen + sizeof(struct tcp_ao_hdr);
+
+		*ptr++ = htonl((TCPOPT_AO << 24) | (aolen << 16) |
+			       (tcprsk->ao_keyid << 8) |
+			       (tcprsk->ao_rcv_next));
+	} else {
+		struct tcp_ao_key *rnext_key;
+		struct tcp_ao_info *ao_info;
+
+		ao_info = rcu_dereference_check(tp->ao_info,
+			lockdep_sock_is_held(&tp->inet_conn.icsk_inet.sk));
+		rnext_key = READ_ONCE(ao_info->rnext_key);
+		if (WARN_ON_ONCE(!rnext_key))
+			return ptr;
+		*ptr++ = htonl((TCPOPT_AO << 24) |
+			       (tcp_ao_len(key->ao_key) << 16) |
+			       (key->ao_key->sndid << 8) |
+			       (rnext_key->rcvid));
+	}
+	opts->hash_location = (__u8 *)ptr;
+	ptr += maclen / sizeof(*ptr);
+	if (unlikely(maclen % sizeof(*ptr))) {
+		memset(ptr, TCPOPT_NOP, sizeof(*ptr));
+		ptr++;
+	}
+#endif
+	return ptr;
+}
+
 /* Write previously computed TCP options to the packet.
  *
  * Beware: Something in the Internet is very sensitive to the ordering of
@@ -629,37 +667,7 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp,
 		opts->hash_location = (__u8 *)ptr;
 		ptr += 4;
 	} else if (tcp_key_is_ao(key)) {
-#ifdef CONFIG_TCP_AO
-		u8 maclen = tcp_ao_maclen(key->ao_key);
-
-		if (tcprsk) {
-			u8 aolen = maclen + sizeof(struct tcp_ao_hdr);
-
-			*ptr++ = htonl((TCPOPT_AO << 24) | (aolen << 16) |
-				       (tcprsk->ao_keyid << 8) |
-				       (tcprsk->ao_rcv_next));
-		} else {
-			struct tcp_ao_key *rnext_key;
-			struct tcp_ao_info *ao_info;
-
-			ao_info = rcu_dereference_check(tp->ao_info,
-				lockdep_sock_is_held(&tp->inet_conn.icsk_inet.sk));
-			rnext_key = READ_ONCE(ao_info->rnext_key);
-			if (WARN_ON_ONCE(!rnext_key))
-				goto out_ao;
-			*ptr++ = htonl((TCPOPT_AO << 24) |
-				       (tcp_ao_len(key->ao_key) << 16) |
-				       (key->ao_key->sndid << 8) |
-				       (rnext_key->rcvid));
-		}
-		opts->hash_location = (__u8 *)ptr;
-		ptr += maclen / sizeof(*ptr);
-		if (unlikely(maclen % sizeof(*ptr))) {
-			memset(ptr, TCPOPT_NOP, sizeof(*ptr));
-			ptr++;
-		}
-out_ao:
-#endif
+		ptr = process_tcp_ao_options(tp, tcprsk, opts, key, ptr);
 	}
 	if (unlikely(opts->mss)) {
 		*ptr++ = htonl((TCPOPT_MSS << 24) |

---
base-commit: c1ed833e0b3b7b9edc82b97b73b2a8a10ceab241
change-id: 20231031-tcp-ao-fix-label-in-compound-statement-warning-ebd6c9978498

Best regards,
-- 
Nathan Chancellor <nathan@kernel.org>


^ permalink raw reply related

* Re: [RFC PATCH v3 10/12] tcp: RX path for devmem TCP
From: Willem de Bruijn @ 2023-11-06 21:14 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Stanislav Fomichev, netdev, linux-kernel, linux-arch,
	linux-kselftest, linux-media, dri-devel, linaro-mm-sig,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	David Ahern, Shuah Khan, Sumit Semwal, Christian König,
	Shakeel Butt, Jeroen de Borst, Praveen Kaligineedi,
	Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <CAHS8izNFv7r6vqYR_TYqcCuDO61F+nnNMhsSu=DrYWSr3sVgrA@mail.gmail.com>

> > IMHO, we need a better UAPI to receive the tokens and give them back to
> > the kernel. CMSG + setsockopt(SO_DEVMEM_DONTNEED) get the job done,
> > but look dated and hacky :-(
> >
> > We should either do some kind of user/kernel shared memory queue to
> > receive/return the tokens (similar to what Jonathan was doing in his
> > proposal?)
>
> I'll take a look at Jonathan's proposal, sorry, I'm not immediately
> familiar but I wanted to respond :-) But is the suggestion here to
> build a new kernel-user communication channel primitive for the
> purpose of passing the information in the devmem cmsg? IMHO that seems
> like an overkill. Why add 100-200 lines of code to the kernel to add
> something that can already be done with existing primitives? I don't
> see anything concretely wrong with cmsg & setsockopt approach, and if
> we switch to something I'd prefer to switch to an existing primitive
> for simplicity?
>
> The only other existing primitive to pass data outside of the linear
> buffer is the MSG_ERRQUEUE that is used for zerocopy. Is that
> preferred? Any other suggestions or existing primitives I'm not aware
> of?
>
> > or bite the bullet and switch to io_uring.
> >
>
> IMO io_uring & socket support are orthogonal, and one doesn't preclude
> the other. As you know we like to use sockets and I believe there are
> issues with io_uring adoption at Google that I'm not familiar with
> (and could be wrong). I'm interested in exploring io_uring support as
> a follow up but I think David Wei will be interested in io_uring
> support as well anyway.

I also disagree that we need to replace a standard socket interface
with something "faster", in quotes.

This interface is not the bottleneck to the target workload.

Replacing the synchronous sockets interface with something more
performant for workloads where it is, is an orthogonal challenge.
However we do that, I think that traditional sockets should continue
to be supported.

The feature may already even work with io_uring, as both recvmsg with
cmsg and setsockopt have io_uring support now.

^ permalink raw reply

* Re: [PATCH net v2] tcp: Fix -Wc23-extensions in tcp_options_write()
From: Nathan Chancellor @ 2023-11-06 21:16 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Eric Dumazet, davem, dsahern, pabeni, ndesaulniers, trix,
	0x7f454c46, noureddine, hch, netdev, linux-kernel, llvm, patches
In-Reply-To: <20231106125257.43f52b1f@kernel.org>

On Mon, Nov 06, 2023 at 12:52:57PM -0800, Jakub Kicinski wrote:
> On Mon, 6 Nov 2023 08:58:06 -0700 Nathan Chancellor wrote:
> > Ah, this suggestion is much better, thanks. I'll make this adjustment
> > and send a v3 later today in case others have any suggested changes (I
> > know netdev prefers waiting 24 hours for another revision but I'd like
> > to get this warning cleared up by -rc1 so it does not proliferate into
> > other trees and I sent v1 almost a week ago).
> 
> Definitely, sorry about the delay, feel free to post v3 ASAP.

Done, thanks Jakub!

https://lore.kernel.org/20231106-tcp-ao-fix-label-in-compound-statement-warning-v3-1-b54a64602a85@kernel.org/

Cheers,
Nathan

^ permalink raw reply

* Re: [RFC PATCH v3 10/12] tcp: RX path for devmem TCP
From: Stanislav Fomichev @ 2023-11-06 21:17 UTC (permalink / raw)
  To: Mina Almasry
  Cc: netdev, linux-kernel, linux-arch, linux-kselftest, linux-media,
	dri-devel, linaro-mm-sig, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
	Ilias Apalodimas, Arnd Bergmann, David Ahern, Willem de Bruijn,
	Shuah Khan, Sumit Semwal, Christian König, Shakeel Butt,
	Jeroen de Borst, Praveen Kaligineedi, Willem de Bruijn,
	Kaiyuan Zhang
In-Reply-To: <CAHS8izNFv7r6vqYR_TYqcCuDO61F+nnNMhsSu=DrYWSr3sVgrA@mail.gmail.com>

On 11/06, Mina Almasry wrote:
> On Mon, Nov 6, 2023 at 10:44 AM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On 11/05, Mina Almasry wrote:
> > > In tcp_recvmsg_locked(), detect if the skb being received by the user
> > > is a devmem skb. In this case - if the user provided the MSG_SOCK_DEVMEM
> > > flag - pass it to tcp_recvmsg_devmem() for custom handling.
> > >
> > > tcp_recvmsg_devmem() copies any data in the skb header to the linear
> > > buffer, and returns a cmsg to the user indicating the number of bytes
> > > returned in the linear buffer.
> > >
> > > tcp_recvmsg_devmem() then loops over the unaccessible devmem skb frags,
> > > and returns to the user a cmsg_devmem indicating the location of the
> > > data in the dmabuf device memory. cmsg_devmem contains this information:
> > >
> > > 1. the offset into the dmabuf where the payload starts. 'frag_offset'.
> > > 2. the size of the frag. 'frag_size'.
> > > 3. an opaque token 'frag_token' to return to the kernel when the buffer
> > > is to be released.
> > >
> > > The pages awaiting freeing are stored in the newly added
> > > sk->sk_user_pages, and each page passed to userspace is get_page()'d.
> > > This reference is dropped once the userspace indicates that it is
> > > done reading this page.  All pages are released when the socket is
> > > destroyed.
> > >
> > > Signed-off-by: Willem de Bruijn <willemb@google.com>
> > > Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
> > > Signed-off-by: Mina Almasry <almasrymina@google.com>
> > >
> > > ---
> > >
> > > RFC v3:
> > > - Fixed issue with put_cmsg() failing silently.
> > >
> > > ---
> > >  include/linux/socket.h            |   1 +
> > >  include/net/page_pool/helpers.h   |   9 ++
> > >  include/net/sock.h                |   2 +
> > >  include/uapi/asm-generic/socket.h |   5 +
> > >  include/uapi/linux/uio.h          |   6 +
> > >  net/ipv4/tcp.c                    | 189 +++++++++++++++++++++++++++++-
> > >  net/ipv4/tcp_ipv4.c               |   7 ++
> > >  7 files changed, 214 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/include/linux/socket.h b/include/linux/socket.h
> > > index cfcb7e2c3813..fe2b9e2081bb 100644
> > > --- a/include/linux/socket.h
> > > +++ b/include/linux/socket.h
> > > @@ -326,6 +326,7 @@ struct ucred {
> > >                                         * plain text and require encryption
> > >                                         */
> > >
> > > +#define MSG_SOCK_DEVMEM 0x2000000    /* Receive devmem skbs as cmsg */
> >
> > Sharing the feedback that I've been providing internally on the public list:
> >
> 
> There may have been a miscommunication. I don't recall hearing this
> specific feedback from you, at least in the last few months. Sorry if
> it seemed like I'm ignoring feedback :)

No worries, there was a thread long time ago about this whole token
interface and whether it should support out-of-order refills, etc.

> > IMHO, we need a better UAPI to receive the tokens and give them back to
> > the kernel. CMSG + setsockopt(SO_DEVMEM_DONTNEED) get the job done,
> > but look dated and hacky :-(
> >
> > We should either do some kind of user/kernel shared memory queue to
> > receive/return the tokens (similar to what Jonathan was doing in his
> > proposal?)
> 
> I'll take a look at Jonathan's proposal, sorry, I'm not immediately
> familiar but I wanted to respond :-) But is the suggestion here to
> build a new kernel-user communication channel primitive for the
> purpose of passing the information in the devmem cmsg? IMHO that seems
> like an overkill. Why add 100-200 lines of code to the kernel to add
> something that can already be done with existing primitives? I don't
> see anything concretely wrong with cmsg & setsockopt approach, and if
> we switch to something I'd prefer to switch to an existing primitive
> for simplicity?
> 
> The only other existing primitive to pass data outside of the linear
> buffer is the MSG_ERRQUEUE that is used for zerocopy. Is that
> preferred? Any other suggestions or existing primitives I'm not aware
> of?

I guess I'm just wondering whether other people have any suggestions
here. Not sure Jonathan's way was better, but we fundamentally
have two queues between the kernel and the userspace:
- userspace receiving tokens (recvmsg + magical flag)
- userspace refilling tokens (setsockopt + magical flag)

So having some kind of shared memory producer-consumer queue feels natural.
And using 'classic' socket api here feels like a stretch, idk.

But maybe I'm overthinking and overcomplicating :-)

> > or bite the bullet and switch to io_uring.
> >
> 
> IMO io_uring & socket support are orthogonal, and one doesn't preclude
> the other. As you know we like to use sockets and I believe there are
> issues with io_uring adoption at Google that I'm not familiar with
> (and could be wrong). I'm interested in exploring io_uring support as
> a follow up but I think David Wei will be interested in io_uring
> support as well anyway.

Ack, might be one more reason on our side to adopt iouring :-p

> > I was also suggesting to do it via netlink initially, but it's probably
> > a bit slow for these purpose, idk.
> 
> Yeah, I hear netlink is reserved for control paths and is
> inappropriate for data path, but I'll let folks correct me if wrong.
> 
> -- 
> Thanks,
> Mina

^ permalink raw reply

* Re: [PATCH net] net: phylink: initialize carrier state at creation
From: Andrew Lunn @ 2023-11-06 21:26 UTC (permalink / raw)
  To: Klaus Kudielka
  Cc: Russell King, Heiner Kallweit, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, netdev, linux-kernel
In-Reply-To: <20231106180506.2665-1-klaus.kudielka@gmail.com>

On Mon, Nov 06, 2023 at 07:05:06PM +0100, Klaus Kudielka wrote:
> Background: Turris Omnia (Armada 385); eth2 (mvneta) connected to SFP bus;
> SFP module is present, but no fiber connected, so definitely no carrier.
> 
> After booting, eth2 is down, but netdev LED trigger surprisingly reports
> link active. Then, after "ip link set eth2 up", the link indicator goes
> away - as I would have expected it from the beginning.
> 
> It turns out, that the default carrier state after netdev creation is
> "carrier ok". Some ethernet drivers explicitly call netif_carrier_off
> during probing, others (like mvneta) don't - which explains the current
> behaviour: only when the device is brought up, phylink_start calls
> netif_carrier_off.
> 
> Fix this for all drivers, by calling netif_carrier_off in phylink_create.

I would actually say: Fix this for all drivers using phylink, by
calling...

You marked this patch for net, so it should be backported. Ideally you
should include a Fixes: tag, indicating when the problem was
introduced. That is bit hard in this case. Its been broken
forever. But adding LED support made this observable. So maybe a
Fixes: tag based on when the LED trigger was added?

You should also add:

Cc: stable@vger.kernel.org

There is more about this in:

https://docs.kernel.org/process/stable-kernel-rules.html

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH net v3] tcp: Fix -Wc23-extensions in tcp_options_write()
From: Dmitry Safonov @ 2023-11-06 21:26 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: ndesaulniers, trix, noureddine, hch, netdev, linux-kernel, llvm,
	patches, edumazet, davem, dsahern, kuba, pabeni
In-Reply-To: <20231106-tcp-ao-fix-label-in-compound-statement-warning-v3-1-b54a64602a85@kernel.org>

On 11/6/23 21:14, Nathan Chancellor wrote:
> Clang warns (or errors with CONFIG_WERROR=y) when CONFIG_TCP_AO is set:
> 
>   net/ipv4/tcp_output.c:663:2: error: label at end of compound statement is a C23 extension [-Werror,-Wc23-extensions]
>     663 |         }
>         |         ^
>   1 error generated.
> 
> On earlier releases (such as clang-11, the current minimum supported
> version for building the kernel) that do not support C23, this was a
> hard error unconditionally:
> 
>   net/ipv4/tcp_output.c:663:2: error: expected statement
>           }
>           ^
>   1 error generated.
> 
> While adding a semicolon after the label would resolve this, it is more
> in line with the kernel as a whole to refactor this block into a
> standalone function, which means the goto a label construct can just be
> replaced with a return statement. Do so to resolve the warning.
> 
> Closes: https://github.com/ClangBuiltLinux/linux/issues/1953
> Fixes: 1e03d32bea8e ("net/tcp: Add TCP-AO sign to outgoing packets")
> Signed-off-by: Nathan Chancellor <nathan@kernel.org>

Seems like exactly the fix that my git testing tree had, with an
exception to naming the helper tcp_ao_options_write().
But then I found* your patch-v1 and decided not to send an alternative
patch.

Thanks for fixing this,
Reviewed-by: Dmitry Safonov <dima@arista.com>

*had to fix my Gmail lkml filter to label not only emails with cc/to my
name, but also the raw email address (usually, I got them to/cc "Dmitry
Safonov", but this one didn't have the name and got lost in the lkml pile).

> ---
> Changes in v3:
> - Don't use a pointer to a pointer for ptr parameter to avoid the extra
>   indirection in process_tcp_ao_options(), just return the modified ptr
>   value back to the caller (Eric)
> - Link to v2: https://lore.kernel.org/r/20231106-tcp-ao-fix-label-in-compound-statement-warning-v2-1-91eff6e1648c@kernel.org
> 
> Changes in v2:
> - Break out problematic block into its own function so that goto can be
>   replaced with a simple return, instead of the simple semicolon
>   approach of v1 (Christoph)
> - Link to v1: https://lore.kernel.org/r/20231031-tcp-ao-fix-label-in-compound-statement-warning-v1-1-c9731d115f17@kernel.org
> ---
>  net/ipv4/tcp_output.c | 70 ++++++++++++++++++++++++++++-----------------------
>  1 file changed, 39 insertions(+), 31 deletions(-)
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 0d8dd5b7e2e5..eb13a55d660c 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -601,6 +601,44 @@ static void bpf_skops_write_hdr_opt(struct sock *sk, struct sk_buff *skb,
>  }
>  #endif
>  
> +static __be32 *process_tcp_ao_options(struct tcp_sock *tp,
> +				      const struct tcp_request_sock *tcprsk,
> +				      struct tcp_out_options *opts,
> +				      struct tcp_key *key, __be32 *ptr)
> +{
> +#ifdef CONFIG_TCP_AO
> +	u8 maclen = tcp_ao_maclen(key->ao_key);
> +
> +	if (tcprsk) {
> +		u8 aolen = maclen + sizeof(struct tcp_ao_hdr);
> +
> +		*ptr++ = htonl((TCPOPT_AO << 24) | (aolen << 16) |
> +			       (tcprsk->ao_keyid << 8) |
> +			       (tcprsk->ao_rcv_next));
> +	} else {
> +		struct tcp_ao_key *rnext_key;
> +		struct tcp_ao_info *ao_info;
> +
> +		ao_info = rcu_dereference_check(tp->ao_info,
> +			lockdep_sock_is_held(&tp->inet_conn.icsk_inet.sk));
> +		rnext_key = READ_ONCE(ao_info->rnext_key);
> +		if (WARN_ON_ONCE(!rnext_key))
> +			return ptr;
> +		*ptr++ = htonl((TCPOPT_AO << 24) |
> +			       (tcp_ao_len(key->ao_key) << 16) |
> +			       (key->ao_key->sndid << 8) |
> +			       (rnext_key->rcvid));
> +	}
> +	opts->hash_location = (__u8 *)ptr;
> +	ptr += maclen / sizeof(*ptr);
> +	if (unlikely(maclen % sizeof(*ptr))) {
> +		memset(ptr, TCPOPT_NOP, sizeof(*ptr));
> +		ptr++;
> +	}
> +#endif
> +	return ptr;
> +}
> +
>  /* Write previously computed TCP options to the packet.
>   *
>   * Beware: Something in the Internet is very sensitive to the ordering of
> @@ -629,37 +667,7 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp,
>  		opts->hash_location = (__u8 *)ptr;
>  		ptr += 4;
>  	} else if (tcp_key_is_ao(key)) {
> -#ifdef CONFIG_TCP_AO
> -		u8 maclen = tcp_ao_maclen(key->ao_key);
> -
> -		if (tcprsk) {
> -			u8 aolen = maclen + sizeof(struct tcp_ao_hdr);
> -
> -			*ptr++ = htonl((TCPOPT_AO << 24) | (aolen << 16) |
> -				       (tcprsk->ao_keyid << 8) |
> -				       (tcprsk->ao_rcv_next));
> -		} else {
> -			struct tcp_ao_key *rnext_key;
> -			struct tcp_ao_info *ao_info;
> -
> -			ao_info = rcu_dereference_check(tp->ao_info,
> -				lockdep_sock_is_held(&tp->inet_conn.icsk_inet.sk));
> -			rnext_key = READ_ONCE(ao_info->rnext_key);
> -			if (WARN_ON_ONCE(!rnext_key))
> -				goto out_ao;
> -			*ptr++ = htonl((TCPOPT_AO << 24) |
> -				       (tcp_ao_len(key->ao_key) << 16) |
> -				       (key->ao_key->sndid << 8) |
> -				       (rnext_key->rcvid));
> -		}
> -		opts->hash_location = (__u8 *)ptr;
> -		ptr += maclen / sizeof(*ptr);
> -		if (unlikely(maclen % sizeof(*ptr))) {
> -			memset(ptr, TCPOPT_NOP, sizeof(*ptr));
> -			ptr++;
> -		}
> -out_ao:
> -#endif
> +		ptr = process_tcp_ao_options(tp, tcprsk, opts, key, ptr);
>  	}
>  	if (unlikely(opts->mss)) {
>  		*ptr++ = htonl((TCPOPT_MSS << 24) |
> 
> ---
> base-commit: c1ed833e0b3b7b9edc82b97b73b2a8a10ceab241
> change-id: 20231031-tcp-ao-fix-label-in-compound-statement-warning-ebd6c9978498

Thanks,
             Dmitry


^ permalink raw reply

* Re: [PATCH bpf 1/6] netkit: Add tstats per-CPU traffic counters
From: Jakub Kicinski @ 2023-11-06 21:28 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: martin.lau, netdev, bpf, Nikolay Aleksandrov
In-Reply-To: <20231103222748.12551-2-daniel@iogearbox.net>

On Fri,  3 Nov 2023 23:27:43 +0100 Daniel Borkmann wrote:
> Add dev->tstats traffic accounting to netkit. The latter contains per-CPU
> RX and TX counters.
> 
> The dev's TX counters are bumped upon pass/unspec as well as redirect
> verdicts, in other words, on everything except for drops.
> 
> The dev's RX counters are bumped upon successful __netif_rx(), as well
> as from skb_do_redirect() (not part of this commit here).
> 
> Using dev->lstats with having just a single packets/bytes counter and
> inferring one another's RX counters from the peer dev's lstats is not
> possible given skb_do_redirect() can also bump the device's stats.

sorry for the delay in replying, I'll comment here instead of on:

https://lore.kernel.org/all/6d5cb0ef-fabc-7ca3-94b2-5b1925a6805f@iogearbox.net/

What I had in mind was to have the driver just set the type of stats.
That way it doesn't have to bother with error handling either
(allocation failure checking, making sure free happens in the right
spot etc. all happen in the core). Here's a completely untested diff:



diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 9980517ed8b0..c23cb7dc0122 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -1506,25 +1506,12 @@ static void veth_free_queues(struct net_device *dev)
 
 static int veth_dev_init(struct net_device *dev)
 {
-	int err;
-
-	dev->lstats = netdev_alloc_pcpu_stats(struct pcpu_lstats);
-	if (!dev->lstats)
-		return -ENOMEM;
-
-	err = veth_alloc_queues(dev);
-	if (err) {
-		free_percpu(dev->lstats);
-		return err;
-	}
-
-	return 0;
+	return veth_alloc_queues(dev);
 }
 
 static void veth_dev_free(struct net_device *dev)
 {
 	veth_free_queues(dev);
-	free_percpu(dev->lstats);
 }
 
 #ifdef CONFIG_NET_POLL_CONTROLLER
@@ -1802,6 +1789,8 @@ static void veth_setup(struct net_device *dev)
 	dev->hw_enc_features = VETH_FEATURES;
 	dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE;
 	netif_set_tso_max_size(dev, GSO_MAX_SIZE);
+
+	dev->pcpu_stat_type = NETDEV_PCPU_STAT_LSTAT;
 }
 
 /*
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 208c63f177f4..25e71480ca58 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1797,6 +1797,13 @@ enum netdev_ml_priv_type {
 	ML_PRIV_CAN,
 };
 
+enum netdev_stat_type {
+	NETDEV_PCPU_STAT_NONE,
+	NETDEV_PCPU_STAT_LSTAT, /* struct pcpu_lstats */
+	NETDEV_PCPU_STAT_TSTAT, /* struct pcpu_sw_netstats */
+	NETDEV_PCPU_STAT_DSTAT, /* struct pcpu_dstats */
+};
+
 /**
  *	struct net_device - The DEVICE structure.
  *
@@ -2354,6 +2361,8 @@ struct net_device {
 	void				*ml_priv;
 	enum netdev_ml_priv_type	ml_priv_type;
 
+	/** @pcpu_stat_type: type of per-CPU stats in use */
+	enum netdev_stat_type pcpu_stat_type:8;
 	union {
 		struct pcpu_lstats __percpu		*lstats;
 		struct pcpu_sw_netstats __percpu	*tstats;
diff --git a/net/core/dev.c b/net/core/dev.c
index 0d548431f3fa..15fec94c7d24 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10049,6 +10049,45 @@ void netif_tx_stop_all_queues(struct net_device *dev)
 }
 EXPORT_SYMBOL(netif_tx_stop_all_queues);
 
+static int netdev_do_alloc_pcpu_stats(struct net_device *dev)
+{
+	void __percpu *v;
+
+	switch (dev->pcpu_stat_type) {
+	case NETDEV_PCPU_STAT_NONE:
+		return 0;
+	case NETDEV_PCPU_STAT_LSTAT:
+		v = dev->lstats = netdev_alloc_pcpu_stats(struct pcpu_lstats);
+		break;
+	case NETDEV_PCPU_STAT_TSTAT:
+		v = dev->tstats =
+			netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
+		break;
+	case NETDEV_PCPU_STAT_DSTAT:
+		v = dev->dstats = netdev_alloc_pcpu_stats(struct pcpu_dstats);
+		break;
+	}
+
+	return v ? 0 : -ENOMEM;
+}
+
+static void netdev_do_free_pcpu_stats(struct net_device *dev)
+{
+	switch (dev->pcpu_stat_type) {
+	case NETDEV_PCPU_STAT_NONE:
+		return;
+	case NETDEV_PCPU_STAT_LSTAT:
+		free_percpu(dev->lstats);
+		break;
+	case NETDEV_PCPU_STAT_TSTAT:
+		free_percpu(dev->tstats);
+		break;
+	case NETDEV_PCPU_STAT_DSTAT:
+		free_percpu(dev->dstats);
+		break;
+	}
+}
+
 /**
  * register_netdevice() - register a network device
  * @dev: device to register
@@ -10109,9 +10148,13 @@ int register_netdevice(struct net_device *dev)
 		goto err_uninit;
 	}
 
+	ret = netdev_do_alloc_pcpu_stats(dev);
+	if (ret)
+		goto err_uninit;
+
 	ret = dev_index_reserve(net, dev->ifindex);
 	if (ret < 0)
-		goto err_uninit;
+		goto err_free_pcpu;
 	dev->ifindex = ret;
 
 	/* Transfer changeable features to wanted_features and enable
@@ -10217,6 +10260,8 @@ int register_netdevice(struct net_device *dev)
 	call_netdevice_notifiers(NETDEV_PRE_UNINIT, dev);
 err_ifindex_release:
 	dev_index_release(net, dev->ifindex);
+err_free_pcpu:
+	netdev_do_free_pcpu_stats(dev);
 err_uninit:
 	if (dev->netdev_ops->ndo_uninit)
 		dev->netdev_ops->ndo_uninit(dev);
@@ -10469,6 +10514,7 @@ void netdev_run_todo(void)
 		WARN_ON(rcu_access_pointer(dev->ip_ptr));
 		WARN_ON(rcu_access_pointer(dev->ip6_ptr));
 
+		netdev_do_free_pcpu_stats(dev);
 		if (dev->priv_destructor)
 			dev->priv_destructor(dev);
 		if (dev->needs_free_netdev)

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox