Netdev List
 help / color / mirror / Atom feed
* Re: [RFC bpf-next 2/8] net: introduce XDP features flag
From: Jakub Kicinski @ 2022-12-20 23:39 UTC (permalink / raw)
  To: Marek Majtyka
  Cc: Lorenzo Bianconi, Lorenzo Bianconi, bpf, netdev, ast, daniel,
	andrii, davem, hawk, pabeni, edumazet, toke, memxor, saeedm,
	anthony.l.nguyen, gospo, vladimir.oltean, nbd, john, leon,
	simon.horman, aelior, christophe.jaillet, ecree.xilinx,
	grygorii.strashko, mst, bjorn, magnus.karlsson,
	maciej.fijalkowski, intel-wired-lan
In-Reply-To: <CAAOQfrF963NoMhQUTdGXyzLMdAjHfUmvzvxpOL0A1Cv4NhY97w@mail.gmail.com>

On Tue, 20 Dec 2022 23:51:31 +0100 Marek Majtyka wrote:
> Everybody is allowed to make a good use of it. Every improvement is
> highly appreciated. Thanks Lorenzo for taking this over.

IIUC this comment refers to the rtnl -> genl/yaml conversion.
In which case, unless someone objects, I'll take a stab at it 
in an hour or two and push the result out my kernel.org tree ...

^ permalink raw reply

* Re: [RFC bpf-next 2/8] net: introduce XDP features flag
From: Stanislav Fomichev @ 2022-12-20 23:36 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: Lorenzo Bianconi, Marek Majtyka, bpf, netdev, ast, daniel, andrii,
	davem, kuba, hawk, pabeni, edumazet, toke, memxor, saeedm,
	anthony.l.nguyen, gospo, vladimir.oltean, nbd, john, leon,
	simon.horman, aelior, christophe.jaillet, ecree.xilinx,
	grygorii.strashko, mst, bjorn, magnus.karlsson,
	maciej.fijalkowski, intel-wired-lan
In-Reply-To: <Y6I2VyBCz7YRxxTR@localhost.localdomain>

On Tue, Dec 20, 2022 at 2:25 PM Lorenzo Bianconi <lorenzo@kernel.org> wrote:
>
> > On Tue, Dec 20, 2022 at 2:11 AM Lorenzo Bianconi
> > <lorenzo.bianconi@redhat.com> wrote:
> > >
> > > On Dec 19, Stanislav Fomichev wrote:
> > > > On Mon, Dec 19, 2022 at 3:51 PM Marek Majtyka <alardam@gmail.com> wrote:
> > > > >
> > > > > At the time of writing, I wanted to be able to read additional information about the XDP capabilities of each network interface using ethtool. This change was intended for Linux users/admins, and not for XDP experts who mostly don't need it and prefer tasting XDP with netlink and bpf rather than reading network interface features with ethtool.
> > > >
> > > > Anything preventing ethtool from doing probing similar to 'bpftool
> > > > feature probe'?
> > > > The problem with these feature bits is that they might diverge and/or
> > > > not work at all for the backported patches (where the fix/feature has
> > > > been backported, but the part that exports the bit hasn't) :-(
> > > > OTOH, I'm not sure we can probe everything from your list, but we
> > > > might try and see what's missing..
> > >
> > > Hi Stanislav,
> > >
> > > I have not added the ethtool support to this series yet since userspace part is
> > > still missing but I think we can consider XDP as a sort of sw offload so it
> > > would be nice for the user/sysadmin (not xdp or bpf developer) to check the NIC
> > > XDP capabilities similar to what we can already do for other hw offload
> > > features.
> >
> > [..]
> >
> > > Moreover let's consider XDP_REDIRECT of a scatter-gather XDP frame into a
> > > devmap. I do not think there is a way to test if the 'target' device supports
> > > SG and so we are forced to disable this feature until all drivers support it.
> >
> > See below for more questions, but why "target device has prog
> > installed and the aux->xdp_has_frags == true" won't work for the
> > internal kernel consumers?
>
> There are some drivers (e.g. all intel ones) that currently do not support
> non-linear xdp buff in the driver napi callback (XDP_FRAG_RX) but implement
> non-linear xdp buff support in ndo_xdp_xmit callback (XDP_FRAG_TARGET).
> Moreover, I guess for a sysadmin it would be better to check NIC capabilities in
> the same way he/she is used to with other features (e.g. ethool -k ...).
>
> >
> > > Introducing XDP features we can enable it on per-driver basis.
> > > I think the same apply for other capabilities as well and just assuming a given
> > > feature is not supported if an e2e test is not working seems a bit inaccurate.
> >
> > Ok, I see that these bits are used in the later patches in xsk and
> > devmap. But I guess I'm still confused about why we add all these
> > flags, but only use mostly XDP_F_REDIRECT_TARGET; maybe start with
> > that one? And why does it have to be exposed to the userspace?
> > (userspace can still probe per-device features by trying to load
> > different progs?)
>
> There are some drivers (e.g. ixgbevf or cavium thunder) that do not support
> XDP_REDIRECT but just XDP_PASS, XDP_DROP and XDP_TX, so I think we should
> differentiate between XDP_BASIC (XDP_PASS | XDP_DROP | XDP_TX) and XDP_FULL
> (XDP_BASIC | XDP_REDIRECT).
>
> >
> > Also, it seems like XDP_F_REDIRECT_TARGET really means "the bpf
> > program has been installed on this device". Instead of a flag, why not
> > explicitly check whether the target device has a prog installed (and,
> > if needed, whether the installed program has frags support)?
>
> XDP_F_REDIRECT_TARGET is used to inform if netdev implements ndo_xdp_xmit
> callback (most of the XDP drivers do not require to load a bpf program to
> XDP_REDIRECT into them).

All of the above makes sense, thanks for the details. In this case,
agreed that it's probably not possible to probe these easily without
explicit flags :-(

Let me bikeshed the names a bit as well, feel free to ignore...

1. Maybe group XDP_{ABORTED,DROP,PASS,TX,REDIRECT} with some common
prefix? XDP_ACT_xxx (ACT for action)? Or XDP_RET_xxx?

2. Maybe: XDP_SOCK_ZEROCOPY -> XSK_ZEROCOPY ?

3. XDP_HW_OFFLOAD we don't seem to set anywhere? nfp/netdevsim changes
are missing or out of scope?

4. Agree with Jakub, not sure XDP_TX_LOCK doesn't seem relevant?

5. XDP_REDIRECT_TARGET -> XDP_RCV_REDIRECT (can 'receive' and handle
redirects? in this case XDP_ACT_REDIRECT means can 'generate'
redirects)

6. For frags, maybe:

XDP_FRAG_RX     -> XDP_SG_RX
XDP_FRAG_TARGET -> XDP_SG_RCV_REDIRECT (so this is that same as
XDP_RCV_REDIRECT but can handle frags)

But also probably fine to keep FRAG instead of SG to match BPF_F_XDP_HAS_FRAGS?

> Regards,
> Lorenzo
>
> >
> > > Regards,
> > > Lorenzo
> > >
> > > >
> > > > > On Mon, Dec 19, 2022 at 9:03 PM <sdf@google.com> wrote:
> > > > >>
> > > > >> On 12/19, Lorenzo Bianconi wrote:
> > > > >> > From: Marek Majtyka <alardam@gmail.com>
> > > > >>
> > > > >> > Implement support for checking what kind of XDP features a netdev
> > > > >> > supports. Previously, there was no way to do this other than to try to
> > > > >> > create an AF_XDP socket on the interface or load an XDP program and see
> > > > >> > if it worked. This commit changes this by adding a new variable which
> > > > >> > describes all xdp supported functions on pretty detailed level:
> > > > >>
> > > > >> >   - aborted
> > > > >> >   - drop
> > > > >> >   - pass
> > > > >> >   - tx
> > > > >> >   - redirect
> > > > >> >   - sock_zerocopy
> > > > >> >   - hw_offload
> > > > >> >   - redirect_target
> > > > >> >   - tx_lock
> > > > >> >   - frag_rx
> > > > >> >   - frag_target
> > > > >>
> > > > >> > Zerocopy mode requires that redirect XDP operation is implemented in a
> > > > >> > driver and the driver supports also zero copy mode. Full mode requires
> > > > >> > that all XDP operation are implemented in the driver. Basic mode is just
> > > > >> > full mode without redirect operation. Frag target requires
> > > > >> > redirect_target one is supported by the driver.
> > > > >>
> > > > >> Can you share more about _why_ is it needed? If we can already obtain
> > > > >> most of these signals via probing, why export the flags?
> > > > >>
> > > > >> > Initially, these new flags are disabled for all drivers by default.
> > > > >>
> > > > >> > Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > > >> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > > >> > Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org>
> > > > >> > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> > > > >> > Signed-off-by: Marek Majtyka <alardam@gmail.com>
> > > > >> > ---
> > > > >> >   .../networking/netdev-xdp-features.rst        | 60 +++++++++++++++++
> > > > >> >   include/linux/netdevice.h                     |  2 +
> > > > >> >   include/linux/xdp_features.h                  | 64 +++++++++++++++++++
> > > > >> >   include/uapi/linux/if_link.h                  |  7 ++
> > > > >> >   include/uapi/linux/xdp_features.h             | 34 ++++++++++
> > > > >> >   net/core/rtnetlink.c                          | 34 ++++++++++
> > > > >> >   tools/include/uapi/linux/if_link.h            |  7 ++
> > > > >> >   tools/include/uapi/linux/xdp_features.h       | 34 ++++++++++
> > > > >> >   8 files changed, 242 insertions(+)
> > > > >> >   create mode 100644 Documentation/networking/netdev-xdp-features.rst
> > > > >> >   create mode 100644 include/linux/xdp_features.h
> > > > >> >   create mode 100644 include/uapi/linux/xdp_features.h
> > > > >> >   create mode 100644 tools/include/uapi/linux/xdp_features.h
> > > > >>
> > > > >> > diff --git a/Documentation/networking/netdev-xdp-features.rst
> > > > >> > b/Documentation/networking/netdev-xdp-features.rst
> > > > >> > new file mode 100644
> > > > >> > index 000000000000..1dc803fe72dd
> > > > >> > --- /dev/null
> > > > >> > +++ b/Documentation/networking/netdev-xdp-features.rst
> > > > >> > @@ -0,0 +1,60 @@
> > > > >> > +.. SPDX-License-Identifier: GPL-2.0
> > > > >> > +
> > > > >> > +=====================
> > > > >> > +Netdev XDP features
> > > > >> > +=====================
> > > > >> > +
> > > > >> > + * XDP FEATURES FLAGS
> > > > >> > +
> > > > >> > +Following netdev xdp features flags can be retrieved over route netlink
> > > > >> > +interface (compact form) - the same way as netdev feature flags.
> > > > >> > +These features flags are read only and cannot be change at runtime.
> > > > >> > +
> > > > >> > +*  XDP_ABORTED
> > > > >> > +
> > > > >> > +This feature informs if netdev supports xdp aborted action.
> > > > >> > +
> > > > >> > +*  XDP_DROP
> > > > >> > +
> > > > >> > +This feature informs if netdev supports xdp drop action.
> > > > >> > +
> > > > >> > +*  XDP_PASS
> > > > >> > +
> > > > >> > +This feature informs if netdev supports xdp pass action.
> > > > >> > +
> > > > >> > +*  XDP_TX
> > > > >> > +
> > > > >> > +This feature informs if netdev supports xdp tx action.
> > > > >> > +
> > > > >> > +*  XDP_REDIRECT
> > > > >> > +
> > > > >> > +This feature informs if netdev supports xdp redirect action.
> > > > >> > +It assumes the all beforehand mentioned flags are enabled.
> > > > >> > +
> > > > >> > +*  XDP_SOCK_ZEROCOPY
> > > > >> > +
> > > > >> > +This feature informs if netdev driver supports xdp zero copy.
> > > > >> > +It assumes the all beforehand mentioned flags are enabled.
> > > > >> > +
> > > > >> > +*  XDP_HW_OFFLOAD
> > > > >> > +
> > > > >> > +This feature informs if netdev driver supports xdp hw oflloading.
> > > > >> > +
> > > > >> > +*  XDP_TX_LOCK
> > > > >> > +
> > > > >> > +This feature informs if netdev ndo_xdp_xmit function requires locking.
> > > > >> > +
> > > > >> > +*  XDP_REDIRECT_TARGET
> > > > >> > +
> > > > >> > +This feature informs if netdev implements ndo_xdp_xmit callback.
> > > > >> > +
> > > > >> > +*  XDP_FRAG_RX
> > > > >> > +
> > > > >> > +This feature informs if netdev implements non-linear xdp buff support in
> > > > >> > +the driver napi callback.
> > > > >> > +
> > > > >> > +*  XDP_FRAG_TARGET
> > > > >> > +
> > > > >> > +This feature informs if netdev implements non-linear xdp buff support in
> > > > >> > +ndo_xdp_xmit callback. XDP_FRAG_TARGET requires XDP_REDIRECT_TARGET is
> > > > >> > properly
> > > > >> > +supported.
> > > > >> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > > > >> > index aad12a179e54..ae5a8564383b 100644
> > > > >> > --- a/include/linux/netdevice.h
> > > > >> > +++ b/include/linux/netdevice.h
> > > > >> > @@ -43,6 +43,7 @@
> > > > >> >   #include <net/xdp.h>
> > > > >>
> > > > >> >   #include <linux/netdev_features.h>
> > > > >> > +#include <linux/xdp_features.h>
> > > > >> >   #include <linux/neighbour.h>
> > > > >> >   #include <uapi/linux/netdevice.h>
> > > > >> >   #include <uapi/linux/if_bonding.h>
> > > > >> > @@ -2362,6 +2363,7 @@ struct net_device {
> > > > >> >       struct rtnl_hw_stats64  *offload_xstats_l3;
> > > > >>
> > > > >> >       struct devlink_port     *devlink_port;
> > > > >> > +     xdp_features_t          xdp_features;
> > > > >> >   };
> > > > >> >   #define to_net_dev(d) container_of(d, struct net_device, dev)
> > > > >>
> > > > >> > diff --git a/include/linux/xdp_features.h b/include/linux/xdp_features.h
> > > > >> > new file mode 100644
> > > > >> > index 000000000000..4e72a86ef329
> > > > >> > --- /dev/null
> > > > >> > +++ b/include/linux/xdp_features.h
> > > > >> > @@ -0,0 +1,64 @@
> > > > >> > +/* SPDX-License-Identifier: GPL-2.0-or-later */
> > > > >> > +/*
> > > > >> > + * Network device xdp features.
> > > > >> > + */
> > > > >> > +#ifndef _LINUX_XDP_FEATURES_H
> > > > >> > +#define _LINUX_XDP_FEATURES_H
> > > > >> > +
> > > > >> > +#include <linux/types.h>
> > > > >> > +#include <linux/bitops.h>
> > > > >> > +#include <asm/byteorder.h>
> > > > >> > +#include <uapi/linux/xdp_features.h>
> > > > >> > +
> > > > >> > +typedef u32 xdp_features_t;
> > > > >> > +
> > > > >> > +#define __XDP_F_BIT(bit)     ((xdp_features_t)1 << (bit))
> > > > >> > +#define __XDP_F(name)                __XDP_F_BIT(XDP_F_##name##_BIT)
> > > > >> > +
> > > > >> > +#define XDP_F_ABORTED                __XDP_F(ABORTED)
> > > > >> > +#define XDP_F_DROP           __XDP_F(DROP)
> > > > >> > +#define XDP_F_PASS           __XDP_F(PASS)
> > > > >> > +#define XDP_F_TX             __XDP_F(TX)
> > > > >> > +#define XDP_F_REDIRECT               __XDP_F(REDIRECT)
> > > > >> > +#define XDP_F_REDIRECT_TARGET        __XDP_F(REDIRECT_TARGET)
> > > > >> > +#define XDP_F_SOCK_ZEROCOPY  __XDP_F(SOCK_ZEROCOPY)
> > > > >> > +#define XDP_F_HW_OFFLOAD     __XDP_F(HW_OFFLOAD)
> > > > >> > +#define XDP_F_TX_LOCK                __XDP_F(TX_LOCK)
> > > > >> > +#define XDP_F_FRAG_RX                __XDP_F(FRAG_RX)
> > > > >> > +#define XDP_F_FRAG_TARGET    __XDP_F(FRAG_TARGET)
> > > > >> > +
> > > > >> > +#define XDP_F_BASIC          (XDP_F_ABORTED | XDP_F_DROP |   \
> > > > >> > +                              XDP_F_PASS | XDP_F_TX)
> > > > >> > +
> > > > >> > +#define XDP_F_FULL           (XDP_F_BASIC | XDP_F_REDIRECT)
> > > > >> > +
> > > > >> > +#define XDP_F_FULL_ZC                (XDP_F_FULL | XDP_F_SOCK_ZEROCOPY)
> > > > >> > +
> > > > >> > +#define XDP_FEATURES_ABORTED_STR             "xdp-aborted"
> > > > >> > +#define XDP_FEATURES_DROP_STR                        "xdp-drop"
> > > > >> > +#define XDP_FEATURES_PASS_STR                        "xdp-pass"
> > > > >> > +#define XDP_FEATURES_TX_STR                  "xdp-tx"
> > > > >> > +#define XDP_FEATURES_REDIRECT_STR            "xdp-redirect"
> > > > >> > +#define XDP_FEATURES_REDIRECT_TARGET_STR     "xdp-redirect-target"
> > > > >> > +#define XDP_FEATURES_SOCK_ZEROCOPY_STR               "xdp-sock-zerocopy"
> > > > >> > +#define XDP_FEATURES_HW_OFFLOAD_STR          "xdp-hw-offload"
> > > > >> > +#define XDP_FEATURES_TX_LOCK_STR             "xdp-tx-lock"
> > > > >> > +#define XDP_FEATURES_FRAG_RX_STR             "xdp-frag-rx"
> > > > >> > +#define XDP_FEATURES_FRAG_TARGET_STR         "xdp-frag-target"
> > > > >> > +
> > > > >> > +#define DECLARE_XDP_FEATURES_TABLE(name, length)                             \
> > > > >> > +     const char name[][length] = {                                           \
> > > > >> > +             [XDP_F_ABORTED_BIT] = XDP_FEATURES_ABORTED_STR,                 \
> > > > >> > +             [XDP_F_DROP_BIT] = XDP_FEATURES_DROP_STR,                       \
> > > > >> > +             [XDP_F_PASS_BIT] = XDP_FEATURES_PASS_STR,                       \
> > > > >> > +             [XDP_F_TX_BIT] = XDP_FEATURES_TX_STR,                           \
> > > > >> > +             [XDP_F_REDIRECT_BIT] = XDP_FEATURES_REDIRECT_STR,               \
> > > > >> > +             [XDP_F_REDIRECT_TARGET_BIT] = XDP_FEATURES_REDIRECT_TARGET_STR, \
> > > > >> > +             [XDP_F_SOCK_ZEROCOPY_BIT] = XDP_FEATURES_SOCK_ZEROCOPY_STR,     \
> > > > >> > +             [XDP_F_HW_OFFLOAD_BIT] = XDP_FEATURES_HW_OFFLOAD_STR,           \
> > > > >> > +             [XDP_F_TX_LOCK_BIT] = XDP_FEATURES_TX_LOCK_STR,                 \
> > > > >> > +             [XDP_F_FRAG_RX_BIT] = XDP_FEATURES_FRAG_RX_STR,                 \
> > > > >> > +             [XDP_F_FRAG_TARGET_BIT] = XDP_FEATURES_FRAG_TARGET_STR,         \
> > > > >> > +     }
> > > > >> > +
> > > > >> > +#endif /* _LINUX_XDP_FEATURES_H */
> > > > >> > diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
> > > > >> > index 1021a7e47a86..971c658ceaea 100644
> > > > >> > --- a/include/uapi/linux/if_link.h
> > > > >> > +++ b/include/uapi/linux/if_link.h
> > > > >> > @@ -374,6 +374,8 @@ enum {
> > > > >>
> > > > >> >       IFLA_DEVLINK_PORT,
> > > > >>
> > > > >> > +     IFLA_XDP_FEATURES,
> > > > >> > +
> > > > >> >       __IFLA_MAX
> > > > >> >   };
> > > > >>
> > > > >> > @@ -1318,6 +1320,11 @@ enum {
> > > > >>
> > > > >> >   #define IFLA_XDP_MAX (__IFLA_XDP_MAX - 1)
> > > > >>
> > > > >> > +enum {
> > > > >> > +     IFLA_XDP_FEATURES_WORD_UNSPEC = 0,
> > > > >> > +     IFLA_XDP_FEATURES_BITS_WORD,
> > > > >> > +};
> > > > >> > +
> > > > >> >   enum {
> > > > >> >       IFLA_EVENT_NONE,
> > > > >> >       IFLA_EVENT_REBOOT,              /* internal reset / reboot */
> > > > >> > diff --git a/include/uapi/linux/xdp_features.h
> > > > >> > b/include/uapi/linux/xdp_features.h
> > > > >> > new file mode 100644
> > > > >> > index 000000000000..48eb42069bcd
> > > > >> > --- /dev/null
> > > > >> > +++ b/include/uapi/linux/xdp_features.h
> > > > >> > @@ -0,0 +1,34 @@
> > > > >> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > > > >> > +/*
> > > > >> > + * Copyright (c) 2020 Intel
> > > > >> > + */
> > > > >> > +
> > > > >> > +#ifndef __UAPI_LINUX_XDP_FEATURES__
> > > > >> > +#define __UAPI_LINUX_XDP_FEATURES__
> > > > >> > +
> > > > >> > +enum {
> > > > >> > +     XDP_F_ABORTED_BIT,
> > > > >> > +     XDP_F_DROP_BIT,
> > > > >> > +     XDP_F_PASS_BIT,
> > > > >> > +     XDP_F_TX_BIT,
> > > > >> > +     XDP_F_REDIRECT_BIT,
> > > > >> > +     XDP_F_REDIRECT_TARGET_BIT,
> > > > >> > +     XDP_F_SOCK_ZEROCOPY_BIT,
> > > > >> > +     XDP_F_HW_OFFLOAD_BIT,
> > > > >> > +     XDP_F_TX_LOCK_BIT,
> > > > >> > +     XDP_F_FRAG_RX_BIT,
> > > > >> > +     XDP_F_FRAG_TARGET_BIT,
> > > > >> > +     /*
> > > > >> > +      * Add your fresh new property above and remember to update
> > > > >> > +      * documentation.
> > > > >> > +      */
> > > > >> > +     XDP_FEATURES_COUNT,
> > > > >> > +};
> > > > >> > +
> > > > >> > +#define XDP_FEATURES_WORDS                   ((XDP_FEATURES_COUNT + 32 - 1) / 32)
> > > > >> > +#define XDP_FEATURES_WORD(blocks, index)     ((blocks)[(index) / 32U])
> > > > >> > +#define XDP_FEATURES_FIELD_FLAG(index)               (1U << (index) % 32U)
> > > > >> > +#define XDP_FEATURES_BIT_IS_SET(blocks, index)        \
> > > > >> > +     (XDP_FEATURES_WORD(blocks, index) & XDP_FEATURES_FIELD_FLAG(index))
> > > > >> > +
> > > > >> > +#endif  /* __UAPI_LINUX_XDP_FEATURES__ */
> > > > >> > diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> > > > >> > index 64289bc98887..1c299746b614 100644
> > > > >> > --- a/net/core/rtnetlink.c
> > > > >> > +++ b/net/core/rtnetlink.c
> > > > >> > @@ -1016,6 +1016,14 @@ static size_t rtnl_xdp_size(void)
> > > > >> >       return xdp_size;
> > > > >> >   }
> > > > >>
> > > > >> > +static size_t rtnl_xdp_features_size(void)
> > > > >> > +{
> > > > >> > +     size_t xdp_size = nla_total_size(0) +   /* nest IFLA_XDP_FEATURES */
> > > > >> > +                       XDP_FEATURES_WORDS * nla_total_size(4);
> > > > >> > +
> > > > >> > +     return xdp_size;
> > > > >> > +}
> > > > >> > +
> > > > >> >   static size_t rtnl_prop_list_size(const struct net_device *dev)
> > > > >> >   {
> > > > >> >       struct netdev_name_node *name_node;
> > > > >> > @@ -1103,6 +1111,7 @@ static noinline size_t if_nlmsg_size(const struct
> > > > >> > net_device *dev,
> > > > >> >              + rtnl_prop_list_size(dev)
> > > > >> >              + nla_total_size(MAX_ADDR_LEN) /* IFLA_PERM_ADDRESS */
> > > > >> >              + rtnl_devlink_port_size(dev)
> > > > >> > +            + rtnl_xdp_features_size() /* IFLA_XDP_FEATURES */
> > > > >> >              + 0;
> > > > >> >   }
> > > > >>
> > > > >> > @@ -1546,6 +1555,27 @@ static int rtnl_xdp_fill(struct sk_buff *skb,
> > > > >> > struct net_device *dev)
> > > > >> >       return err;
> > > > >> >   }
> > > > >>
> > > > >> > +static int rtnl_xdp_features_fill(struct sk_buff *skb, struct net_device
> > > > >> > *dev)
> > > > >> > +{
> > > > >> > +     struct nlattr *attr;
> > > > >> > +
> > > > >> > +     attr = nla_nest_start_noflag(skb, IFLA_XDP_FEATURES);
> > > > >> > +     if (!attr)
> > > > >> > +             return -EMSGSIZE;
> > > > >> > +
> > > > >> > +     BUILD_BUG_ON(XDP_FEATURES_WORDS != 1);
> > > > >> > +     if (nla_put_u32(skb, IFLA_XDP_FEATURES_BITS_WORD, dev->xdp_features))
> > > > >> > +             goto err_cancel;
> > > > >> > +
> > > > >> > +     nla_nest_end(skb, attr);
> > > > >> > +
> > > > >> > +     return 0;
> > > > >> > +
> > > > >> > +err_cancel:
> > > > >> > +     nla_nest_cancel(skb, attr);
> > > > >> > +     return -EMSGSIZE;
> > > > >> > +}
> > > > >> > +
> > > > >> >   static u32 rtnl_get_event(unsigned long event)
> > > > >> >   {
> > > > >> >       u32 rtnl_event_type = IFLA_EVENT_NONE;
> > > > >> > @@ -1904,6 +1934,9 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb,
> > > > >> >       if (rtnl_fill_devlink_port(skb, dev))
> > > > >> >               goto nla_put_failure;
> > > > >>
> > > > >> > +     if (rtnl_xdp_features_fill(skb, dev))
> > > > >> > +             goto nla_put_failure;
> > > > >> > +
> > > > >> >       nlmsg_end(skb, nlh);
> > > > >> >       return 0;
> > > > >>
> > > > >> > @@ -1968,6 +2001,7 @@ static const struct nla_policy
> > > > >> > ifla_policy[IFLA_MAX+1] = {
> > > > >> >       [IFLA_TSO_MAX_SIZE]     = { .type = NLA_REJECT },
> > > > >> >       [IFLA_TSO_MAX_SEGS]     = { .type = NLA_REJECT },
> > > > >> >       [IFLA_ALLMULTI]         = { .type = NLA_REJECT },
> > > > >> > +     [IFLA_XDP_FEATURES]     = { .type = NLA_NESTED },
> > > > >> >   };
> > > > >>
> > > > >> >   static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
> > > > >> > diff --git a/tools/include/uapi/linux/if_link.h
> > > > >> > b/tools/include/uapi/linux/if_link.h
> > > > >> > index 82fe18f26db5..994228e9909a 100644
> > > > >> > --- a/tools/include/uapi/linux/if_link.h
> > > > >> > +++ b/tools/include/uapi/linux/if_link.h
> > > > >> > @@ -354,6 +354,8 @@ enum {
> > > > >>
> > > > >> >       IFLA_DEVLINK_PORT,
> > > > >>
> > > > >> > +     IFLA_XDP_FEATURES,
> > > > >> > +
> > > > >> >       __IFLA_MAX
> > > > >> >   };
> > > > >>
> > > > >> > @@ -1222,6 +1224,11 @@ enum {
> > > > >>
> > > > >> >   #define IFLA_XDP_MAX (__IFLA_XDP_MAX - 1)
> > > > >>
> > > > >> > +enum {
> > > > >> > +     IFLA_XDP_FEATURES_WORD_UNSPEC = 0,
> > > > >> > +     IFLA_XDP_FEATURES_BITS_WORD,
> > > > >> > +};
> > > > >> > +
> > > > >> >   enum {
> > > > >> >       IFLA_EVENT_NONE,
> > > > >> >       IFLA_EVENT_REBOOT,              /* internal reset / reboot */
> > > > >> > diff --git a/tools/include/uapi/linux/xdp_features.h
> > > > >> > b/tools/include/uapi/linux/xdp_features.h
> > > > >> > new file mode 100644
> > > > >> > index 000000000000..48eb42069bcd
> > > > >> > --- /dev/null
> > > > >> > +++ b/tools/include/uapi/linux/xdp_features.h
> > > > >> > @@ -0,0 +1,34 @@
> > > > >> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > > > >> > +/*
> > > > >> > + * Copyright (c) 2020 Intel
> > > > >> > + */
> > > > >> > +
> > > > >> > +#ifndef __UAPI_LINUX_XDP_FEATURES__
> > > > >> > +#define __UAPI_LINUX_XDP_FEATURES__
> > > > >> > +
> > > > >> > +enum {
> > > > >> > +     XDP_F_ABORTED_BIT,
> > > > >> > +     XDP_F_DROP_BIT,
> > > > >> > +     XDP_F_PASS_BIT,
> > > > >> > +     XDP_F_TX_BIT,
> > > > >> > +     XDP_F_REDIRECT_BIT,
> > > > >> > +     XDP_F_REDIRECT_TARGET_BIT,
> > > > >> > +     XDP_F_SOCK_ZEROCOPY_BIT,
> > > > >> > +     XDP_F_HW_OFFLOAD_BIT,
> > > > >> > +     XDP_F_TX_LOCK_BIT,
> > > > >> > +     XDP_F_FRAG_RX_BIT,
> > > > >> > +     XDP_F_FRAG_TARGET_BIT,
> > > > >> > +     /*
> > > > >> > +      * Add your fresh new property above and remember to update
> > > > >> > +      * documentation.
> > > > >> > +      */
> > > > >> > +     XDP_FEATURES_COUNT,
> > > > >> > +};
> > > > >> > +
> > > > >> > +#define XDP_FEATURES_WORDS                   ((XDP_FEATURES_COUNT + 32 - 1) / 32)
> > > > >> > +#define XDP_FEATURES_WORD(blocks, index)     ((blocks)[(index) / 32U])
> > > > >> > +#define XDP_FEATURES_FIELD_FLAG(index)               (1U << (index) % 32U)
> > > > >> > +#define XDP_FEATURES_BIT_IS_SET(blocks, index)        \
> > > > >> > +     (XDP_FEATURES_WORD(blocks, index) & XDP_FEATURES_FIELD_FLAG(index))
> > > > >> > +
> > > > >> > +#endif  /* __UAPI_LINUX_XDP_FEATURES__ */
> > > > >> > --
> > > > >> > 2.38.1
> > > > >>
> > > >

^ permalink raw reply

* Re: [PATCH bpf 2/2] selftests/bpf: tunnel: add sanity test for checksums
From: Jakub Kicinski @ 2022-12-20 23:36 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: bpf, netdev, daniel
In-Reply-To: <7372590a-f40b-17d1-f780-3bd1ce4f30bb@linux.dev>

On Tue, 20 Dec 2022 15:21:08 -0800 Martin KaFai Lau wrote:
> Thanks for the fix and the idea on how to test it.
> 
> I have posted a patch to translate this test to a test for test_progs that can 
> finish and exit such that it can be run continuously in CI.  The test attaches a 
> tc-bpf at lo and the bpf prog directly checks for the skb->ip_summed == 
> CHECKSUM_NONE and the broken csum_start condition.
> 
> If the test_progs patch looks good, patch 1 can be landed first and then land 
> the test_progs patch.  wdyt?

I'm not attached to the test, your patch looks good!

^ permalink raw reply

* Re: [PATCH v7 01/11] leds: add support for hardware driven LEDs
From: Andrew Lunn @ 2022-12-20 23:35 UTC (permalink / raw)
  To: Christian Marangi
  Cc: Russell King (Oracle), Florian Fainelli, Vladimir Oltean,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Rob Herring, Krzysztof Kozlowski, Jonathan Corbet, Pavel Machek,
	John Crispin, netdev, devicetree, linux-kernel, linux-doc,
	linux-leds, Tim Harvey, Alexander Stein, Rasmus Villemoes,
	Marek Behún
In-Reply-To: <639ca0a4.050a0220.99395.8fd3@mx.google.com>

On Fri, Dec 16, 2022 at 05:45:25PM +0100, Christian Marangi wrote:
> On Thu, Dec 15, 2022 at 04:13:03PM +0000, Russell King (Oracle) wrote:
> > Hi Christian,
> > 
> > Thanks for the patch.
> > 
> > I think Andrew's email is offline at the moment.
> >
> 
> Notice by gmail spamming me "I CAN'T SEND IT AHHHHH"
> Holidy times I guess?

That was part of the problem. Away travelling when foot gun hit foot,
and i didn't have access to the tools to fix it while away.

    Andrew

^ permalink raw reply

* Re: [PATCH v7 11/11] dt-bindings: net: dsa: qca8k: add LEDs definition example
From: Andrew Lunn @ 2022-12-20 23:23 UTC (permalink / raw)
  To: Christian Marangi
  Cc: Florian Fainelli, Vladimir Oltean, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Rob Herring, Krzysztof Kozlowski,
	Jonathan Corbet, Pavel Machek, Russell King (Oracle),
	John Crispin, netdev, devicetree, linux-kernel, linux-doc,
	linux-leds, Tim Harvey, Alexander Stein, Rasmus Villemoes
In-Reply-To: <20221214235438.30271-12-ansuelsmth@gmail.com>

On Thu, Dec 15, 2022 at 12:54:38AM +0100, Christian Marangi wrote:
> Add LEDs definition example for qca8k using the offload trigger as the
> default trigger and add all the supported offload triggers by the
> switch.
> 
> Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
> ---
>  .../devicetree/bindings/net/dsa/qca8k.yaml    | 24 +++++++++++++++++++
>  1 file changed, 24 insertions(+)
> 
> diff --git a/Documentation/devicetree/bindings/net/dsa/qca8k.yaml b/Documentation/devicetree/bindings/net/dsa/qca8k.yaml
> index 978162df51f7..4090cf65c41c 100644
> --- a/Documentation/devicetree/bindings/net/dsa/qca8k.yaml
> +++ b/Documentation/devicetree/bindings/net/dsa/qca8k.yaml
> @@ -65,6 +65,8 @@ properties:
>                   internal mdio access is used.
>                   With the legacy mapping the reg corresponding to the internal
>                   mdio is the switch reg with an offset of -1.
> +                 Each phy have at least 3 LEDs connected and can be declared
> +                 using the standard LEDs structure.
>  
>  patternProperties:
>    "^(ethernet-)?ports$":
> @@ -202,6 +204,7 @@ examples:
>      };
>    - |
>      #include <dt-bindings/gpio/gpio.h>
> +    #include <dt-bindings/leds/common.h>
>  
>      mdio {
>          #address-cells = <1>;
> @@ -284,6 +287,27 @@ examples:
>  
>                  internal_phy_port1: ethernet-phy@0 {
>                      reg = <0>;
> +
> +                    leds {
> +                        #address-cells = <1>;
> +                        #size-cells = <0>;
> +
> +                        led@0 {
> +                            reg = <0>;
> +                            color = <LED_COLOR_ID_WHITE>;
> +                            function = LED_FUNCTION_LAN;
> +                            function-enumerator = <1>;
> +                            linux,default-trigger = "netdev";
> +                        };
> +
> +                        led@1 {
> +                            reg = <1>;
> +                            color = <LED_COLOR_ID_AMBER>;
> +                            function = LED_FUNCTION_LAN;
> +                            function-enumerator = <1>;
> +                            linux,default-trigger = "netdev";
> +                        };
> +                    };
>                  };

I don't see anything here which is specific to the QCA8K. I really
hope the same binding should work for any PHY which has LEDs. So
please move this into ethernet-phy.yaml.

       Andrew

^ permalink raw reply

* Re: [PATCH bpf 2/2] selftests/bpf: tunnel: add sanity test for checksums
From: Martin KaFai Lau @ 2022-12-20 23:21 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: bpf, netdev, daniel
In-Reply-To: <20221220004701.402165-2-kuba@kernel.org>

On 12/19/22 4:47 PM, Jakub Kicinski wrote:
> Simple netdevsim based test. Netdevsim will validate xmit'ed
> packets, in particular we care about checksum sanity (along
> the lines of checks inside skb_checksum_help()). Triggering
> skb_checksum_help() directly would require the right HW device
> or a crypto device setup, netdevsim is much simpler.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>   drivers/net/netdevsim/netdev.c                |  5 ++++
>   tools/testing/selftests/bpf/test_tc_tunnel.sh | 27 +++++++++++++++++++
>   2 files changed, 32 insertions(+)
> 
> diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
> index 6db6a75ff9b9..e4808a6d37a4 100644
> --- a/drivers/net/netdevsim/netdev.c
> +++ b/drivers/net/netdevsim/netdev.c
> @@ -33,6 +33,11 @@ static netdev_tx_t nsim_start_xmit(struct sk_buff *skb, struct net_device *dev)
>   	if (!nsim_ipsec_tx(ns, skb))
>   		goto out;
>   
> +	/* Validate the packet */
> +	if (skb->ip_summed == CHECKSUM_PARTIAL)
> +		WARN_ON_ONCE((unsigned int)skb_checksum_start_offset(skb) >=
> +			     skb_headlen(skb));
> +
>   	u64_stats_update_begin(&ns->syncp);
>   	ns->tx_packets++;
>   	ns->tx_bytes += skb->len;
> diff --git a/tools/testing/selftests/bpf/test_tc_tunnel.sh b/tools/testing/selftests/bpf/test_tc_tunnel.sh
> index 334bdfeab940..4dac87f6a6fa 100755
> --- a/tools/testing/selftests/bpf/test_tc_tunnel.sh
> +++ b/tools/testing/selftests/bpf/test_tc_tunnel.sh
> @@ -15,6 +15,7 @@ readonly ns1_v4=192.168.1.1
>   readonly ns2_v4=192.168.1.2
>   readonly ns1_v6=fd::1
>   readonly ns2_v6=fd::2
> +readonly nsim_v4=192.168.2.1
>   
>   # Must match port used by bpf program
>   readonly udpport=5555
> @@ -67,6 +68,10 @@ cleanup() {
>   	if [[ -n $server_pid ]]; then
>   		kill $server_pid 2> /dev/null
>   	fi
> +
> +	if [ -e /sys/bus/netdevsim/devices/netdevsim1 ]; then
> +	    echo 1 > /sys/bus/netdevsim/del_device
> +	fi
>   }
>   
>   server_listen() {
> @@ -93,6 +98,25 @@ verify_data() {
>   	fi
>   }
>   
> +decap_sanity() {
> +    echo "test decap sanity"
> +    modprobe netdevsim
> +    echo 1 1 > /sys/bus/netdevsim/new_device
> +    udevadm settle
> +    nsim=$(ls /sys/bus/netdevsim/devices/netdevsim1/net/)
> +    ip link set dev $nsim up
> +    ip addr add dev $nsim $nsim_v4/24
> +
> +    tc qdisc add dev $nsim clsact
> +    tc filter add dev $nsim egress \
> +       bpf direct-action object-file ${BPF_FILE} section decap
> +
> +    echo abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz | \
> +	nc -u 192.168.2.2 7777

Thanks for the fix and the idea on how to test it.

I have posted a patch to translate this test to a test for test_progs that can 
finish and exit such that it can be run continuously in CI.  The test attaches a 
tc-bpf at lo and the bpf prog directly checks for the skb->ip_summed == 
CHECKSUM_NONE and the broken csum_start condition.

If the test_progs patch looks good, patch 1 can be landed first and then land 
the test_progs patch.  wdyt?

^ permalink raw reply

* Re: [PATCH v7 11/11] dt-bindings: net: dsa: qca8k: add LEDs definition example
From: Andrew Lunn @ 2022-12-20 23:20 UTC (permalink / raw)
  To: Rob Herring
  Cc: Christian Marangi, Florian Fainelli, Vladimir Oltean,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Krzysztof Kozlowski, Jonathan Corbet, Pavel Machek,
	Russell King (Oracle), John Crispin, netdev, devicetree,
	linux-kernel, linux-doc, linux-leds, Tim Harvey, Alexander Stein,
	Rasmus Villemoes
In-Reply-To: <20221220173958.GA784285-robh@kernel.org>

On Tue, Dec 20, 2022 at 11:39:58AM -0600, Rob Herring wrote:
> On Thu, Dec 15, 2022 at 12:54:38AM +0100, Christian Marangi wrote:
> > Add LEDs definition example for qca8k using the offload trigger as the
> > default trigger and add all the supported offload triggers by the
> > switch.
> > 
> > Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
> > ---
> >  .../devicetree/bindings/net/dsa/qca8k.yaml    | 24 +++++++++++++++++++
> >  1 file changed, 24 insertions(+)
> > 
> > diff --git a/Documentation/devicetree/bindings/net/dsa/qca8k.yaml b/Documentation/devicetree/bindings/net/dsa/qca8k.yaml
> > index 978162df51f7..4090cf65c41c 100644
> > --- a/Documentation/devicetree/bindings/net/dsa/qca8k.yaml
> > +++ b/Documentation/devicetree/bindings/net/dsa/qca8k.yaml
> > @@ -65,6 +65,8 @@ properties:
> >                   internal mdio access is used.
> >                   With the legacy mapping the reg corresponding to the internal
> >                   mdio is the switch reg with an offset of -1.
> > +                 Each phy have at least 3 LEDs connected and can be declared
> > +                 using the standard LEDs structure.
> >  
> >  patternProperties:
> >    "^(ethernet-)?ports$":
> > @@ -202,6 +204,7 @@ examples:
> >      };
> >    - |
> >      #include <dt-bindings/gpio/gpio.h>
> > +    #include <dt-bindings/leds/common.h>
> >  
> >      mdio {
> >          #address-cells = <1>;
> > @@ -284,6 +287,27 @@ examples:
> >  
> >                  internal_phy_port1: ethernet-phy@0 {
> >                      reg = <0>;
> > +
> > +                    leds {
> > +                        #address-cells = <1>;
> > +                        #size-cells = <0>;
> > +
> > +                        led@0 {
> > +                            reg = <0>;
> > +                            color = <LED_COLOR_ID_WHITE>;
> > +                            function = LED_FUNCTION_LAN;
> > +                            function-enumerator = <1>;
> > +                            linux,default-trigger = "netdev";
> 
> 'function' should replace this. Don't encourage more users. 
> 
> Also, 'netdev' is not documented which leaves me wondering why there's 
> no warning? Either this patch didn't apply or there's a problem in the 
> schema that's not checking this node.

It is probably the usual limitation that the tools require a
compatible, where as the kernel does not.

> > +                        };
> > +
> > +                        led@1 {
> > +                            reg = <1>;
> > +                            color = <LED_COLOR_ID_AMBER>;
> > +                            function = LED_FUNCTION_LAN;
> > +                            function-enumerator = <1>;
> 
> Typo? These are supposed to be unique. Can't you use 'reg' in your case?

reg in this context is the address of the PHY on the MDIO bus. This is
an Ethernet switch, so has many PHYs, each with its own address.

   Andrew

^ permalink raw reply

* [PATCH bpf] selftests/bpf: Test bpf_skb_adjust_room on CHECKSUM_PARTIAL
From: Martin KaFai Lau @ 2022-12-20 23:13 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, netdev,
	kernel-team
In-Reply-To: <20221220004701.402165-2-kuba@kernel.org>

From: Martin KaFai Lau <martin.lau@kernel.org>

When the bpf_skb_adjust_room() shrinks the skb such that
its csum_start is invalid, the skb->ip_summed should
be reset from CHECKSUM_PARTIAL to CHECKSUM_NONE.

This patch adds a test to ensure the skb->ip_summed changed
from CHECKSUM_PARTIAL to CHECKSUM_NONE after bpf_skb_adjust_room().

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
---
 .../selftests/bpf/prog_tests/decap_sanity.c   | 83 +++++++++++++++++++
 .../selftests/bpf/progs/bpf_tracing_net.h     |  6 ++
 .../selftests/bpf/progs/decap_sanity.c        | 68 +++++++++++++++
 3 files changed, 157 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/decap_sanity.c
 create mode 100644 tools/testing/selftests/bpf/progs/decap_sanity.c

diff --git a/tools/testing/selftests/bpf/prog_tests/decap_sanity.c b/tools/testing/selftests/bpf/prog_tests/decap_sanity.c
new file mode 100644
index 000000000000..2fbb3017b740
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/decap_sanity.c
@@ -0,0 +1,83 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <net/if.h>
+#include <linux/in6.h>
+
+#include "test_progs.h"
+#include "network_helpers.h"
+#include "decap_sanity.skel.h"
+
+#define SYS(fmt, ...)						\
+	({							\
+		char cmd[1024];					\
+		snprintf(cmd, sizeof(cmd), fmt, ##__VA_ARGS__);	\
+		if (!ASSERT_OK(system(cmd), cmd))		\
+			goto fail;				\
+	})
+
+#define NS_TEST "decap_sanity_ns"
+#define IPV6_IFACE_ADDR "face::1"
+#define UDP_TEST_PORT 7777
+
+void test_decap_sanity(void)
+{
+	LIBBPF_OPTS(bpf_tc_hook, qdisc_hook, .attach_point = BPF_TC_EGRESS);
+	LIBBPF_OPTS(bpf_tc_opts, tc_attach);
+	struct nstoken *nstoken = NULL;
+	struct decap_sanity *skel;
+	struct sockaddr_in6 addr;
+	socklen_t addrlen;
+	char buf[128] = {};
+	int sockfd, err;
+
+	skel = decap_sanity__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel open_and_load"))
+		return;
+
+	SYS("ip netns add %s", NS_TEST);
+	SYS("ip -net %s -6 addr add %s/128 dev lo nodad", NS_TEST, IPV6_IFACE_ADDR);
+	SYS("ip -net %s link set dev lo up", NS_TEST);
+
+	nstoken = open_netns(NS_TEST);
+	if (!ASSERT_OK_PTR(nstoken, "open_netns"))
+		goto fail;
+
+	qdisc_hook.ifindex = if_nametoindex("lo");
+	if (!ASSERT_GT(qdisc_hook.ifindex, 0, "if_nametoindex lo"))
+		goto fail;
+
+	err = bpf_tc_hook_create(&qdisc_hook);
+	if (!ASSERT_OK(err, "create qdisc hook"))
+		goto fail;
+
+	tc_attach.prog_fd = bpf_program__fd(skel->progs.decap_sanity);
+	err = bpf_tc_attach(&qdisc_hook, &tc_attach);
+	if (!ASSERT_OK(err, "attach filter"))
+		goto fail;
+
+	addrlen = sizeof(addr);
+	err = make_sockaddr(AF_INET6, IPV6_IFACE_ADDR, UDP_TEST_PORT,
+			    (void *)&addr, &addrlen);
+	if (!ASSERT_OK(err, "make_sockaddr"))
+		goto fail;
+	sockfd = socket(AF_INET6, SOCK_DGRAM, 0);
+	if (!ASSERT_NEQ(sockfd, -1, "socket"))
+		goto fail;
+	err = sendto(sockfd, buf, sizeof(buf), 0, (void *)&addr, addrlen);
+	close(sockfd);
+	if (!ASSERT_EQ(err, sizeof(buf), "send"))
+		goto fail;
+
+	ASSERT_EQ(skel->bss->init_csum_partial, true, "init_csum_partial");
+	ASSERT_EQ(skel->bss->final_csum_none, true, "final_csum_none");
+	ASSERT_EQ(skel->bss->broken_csum_start, false, "broken_csum_start");
+
+fail:
+	if (nstoken)
+		close_netns(nstoken);
+	system("ip netns del " NS_TEST " >& /dev/null");
+	decap_sanity__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
index b394817126cf..cfed4df490f3 100644
--- a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
+++ b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
@@ -50,6 +50,12 @@
 #define ICSK_TIME_LOSS_PROBE	5
 #define ICSK_TIME_REO_TIMEOUT	6
 
+#define ETH_HLEN		14
+#define ETH_P_IPV6		0x86DD
+
+#define CHECKSUM_NONE		0
+#define CHECKSUM_PARTIAL	3
+
 #define IFNAMSIZ		16
 
 #define RTF_GATEWAY		0x0002
diff --git a/tools/testing/selftests/bpf/progs/decap_sanity.c b/tools/testing/selftests/bpf/progs/decap_sanity.c
new file mode 100644
index 000000000000..b85113554cbf
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/decap_sanity.c
@@ -0,0 +1,68 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+
+#include "vmlinux.h"
+#include "bpf_tracing_net.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+#define UDP_TEST_PORT 7777
+
+void *bpf_cast_to_kern_ctx(void *) __ksym;
+bool init_csum_partial = false;
+bool final_csum_none = false;
+bool broken_csum_start = false;
+
+static inline unsigned int skb_headlen(const struct sk_buff *skb)
+{
+	return skb->len - skb->data_len;
+}
+
+static unsigned int skb_headroom(const struct sk_buff *skb)
+{
+	return skb->data - skb->head;
+}
+
+static int skb_checksum_start_offset(const struct sk_buff *skb)
+{
+	return skb->csum_start - skb_headroom(skb);
+}
+
+SEC("tc")
+int decap_sanity(struct __sk_buff *skb)
+{
+	struct sk_buff *kskb;
+	struct ipv6hdr ip6h;
+	struct udphdr udph;
+	int err;
+
+	if (skb->protocol != __bpf_constant_htons(ETH_P_IPV6))
+		return TC_ACT_SHOT;
+
+	if (bpf_skb_load_bytes(skb, ETH_HLEN, &ip6h, sizeof(ip6h)))
+		return TC_ACT_SHOT;
+
+	if (ip6h.nexthdr != IPPROTO_UDP)
+		return TC_ACT_SHOT;
+
+	if (bpf_skb_load_bytes(skb, ETH_HLEN + sizeof(ip6h), &udph, sizeof(udph)))
+		return TC_ACT_SHOT;
+
+	if (udph.dest != __bpf_constant_htons(UDP_TEST_PORT))
+		return TC_ACT_SHOT;
+
+	kskb = bpf_cast_to_kern_ctx(skb);
+	init_csum_partial = (kskb->ip_summed == CHECKSUM_PARTIAL);
+	err = bpf_skb_adjust_room(skb, -(s32)(ETH_HLEN + sizeof(ip6h) + sizeof(udph)),
+				  1, BPF_F_ADJ_ROOM_FIXED_GSO);
+	if (err)
+		return TC_ACT_SHOT;
+	final_csum_none = (kskb->ip_summed == CHECKSUM_NONE);
+	if (kskb->ip_summed == CHECKSUM_PARTIAL &&
+	    (unsigned int)skb_checksum_start_offset(kskb) >= skb_headlen(kskb))
+		broken_csum_start = true;
+
+	return TC_ACT_SHOT;
+}
+
+char __license[] SEC("license") = "GPL";
-- 
2.30.2


^ permalink raw reply related

* Re: [PATCH v7 10/11] net: dsa: qca8k: add LEDs support
From: Andrew Lunn @ 2022-12-20 23:11 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Christian Marangi, Florian Fainelli, Vladimir Oltean,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Rob Herring, Krzysztof Kozlowski, Jonathan Corbet, Pavel Machek,
	John Crispin, netdev, devicetree, linux-kernel, linux-doc,
	linux-leds, Tim Harvey, Alexander Stein, Rasmus Villemoes
In-Reply-To: <Y5teRQ5mv1aTix4w@shell.armlinux.org.uk>

> > +qca8k_cled_trigger_offload(struct led_classdev *ldev, bool enable)
> > +{
> > +	struct qca8k_led *led = container_of(ldev, struct qca8k_led, cdev);
> > +
> > +	struct qca8k_led_pattern_en reg_info;
> > +	struct qca8k_priv *priv = led->priv;
> > +	u32 val = QCA8K_LED_ALWAYS_OFF;
> > +
> > +	qca8k_get_enable_led_reg(led->port_num, led->led_num, &reg_info);
> > +
> > +	if (enable)
> > +		val = QCA8K_LED_RULE_CONTROLLED;
> > +
> > +	return regmap_update_bits(priv->regmap, reg_info.reg,
> > +				  GENMASK(1, 0) << reg_info.shift,
> > +				  val << reg_info.shift);
> 
> 88e151x doesn't have the ability to change in this way - we have
> a register with a 4-bit field which selects the LED mode from one
> of many, or forces the LED on/off/hi-z/blink.
> 
> Not specifically for this patch, but talking generally about this
> approach, the other issue I forsee with this is that yes, 88e151x has
> three LEDs, but the LED modes are also used to implement control
> signals (e.g., on a SFP, LOS can be implemented by programming mode
> 0 on LED2 (which makes it indicate link or not.) If we expose all the
> LEDs we run the risk of the LED subsystem trampling over that
> configuration and essentially messing up such modules. So the Marvell
> PHY driver would need to know when it is appropriate to expose these
> things to the LED subsystem.
> 
> I guess doing it dependent on firmware description as you do in
> this driver would work - if there's no firmware description, they're
> not exposed.
> 

I expect there will always be some sort of firmware involved,
describing the hardware. Without that, we have no idea how many LEDs
are actually connected to pins of the PHY, for example.

I've not yet looked at this patchset in detail, but i hope the DT
binding code is reusable, so all PHYs will have the same basic
binding.

    Andrew

^ permalink raw reply

* RE: [PATCH v4] epoll: use refcount to reduce ep_mutex contention
From: Keller, Jacob E @ 2022-12-20 23:03 UTC (permalink / raw)
  To: Paolo Abeni, linux-fsdevel@vger.kernel.org
  Cc: Soheil Hassas Yeganeh, Al Viro, Davidlohr Bueso, Jason Baron,
	netdev@vger.kernel.org, Carlos Maiolino, Eric Biggers
In-Reply-To: <9d8ad7995e51ad3aecdfe6f7f9e72231b8c9d3b5.1671569682.git.pabeni@redhat.com>



> -----Original Message-----
> From: Paolo Abeni <pabeni@redhat.com>
> Sent: Tuesday, December 20, 2022 12:56 PM
> To: linux-fsdevel@vger.kernel.org
> Cc: Soheil Hassas Yeganeh <soheil@google.com>; Al Viro
> <viro@zeniv.linux.org.uk>; Davidlohr Bueso <dave@stgolabs.net>; Jason Baron
> <jbaron@akamai.com>; netdev@vger.kernel.org; Carlos Maiolino
> <cmaiolino@redhat.com>; Eric Biggers <ebiggers@kernel.org>; Keller, Jacob E
> <jacob.e.keller@intel.com>
> Subject: [PATCH v4] epoll: use refcount to reduce ep_mutex contention
> 
> We are observing huge contention on the epmutex during an http
> connection/rate test:
> 
>  83.17% 0.25%  nginx            [kernel.kallsyms]         [k]
> entry_SYSCALL_64_after_hwframe
> [...]
>            |--66.96%--__fput
>                       |--60.04%--eventpoll_release_file
>                                  |--58.41%--__mutex_lock.isra.6
>                                            |--56.56%--osq_lock
> 
> The application is multi-threaded, creates a new epoll entry for
> each incoming connection, and does not delete it before the
> connection shutdown - that is, before the connection's fd close().
> 
> Many different threads compete frequently for the epmutex lock,
> affecting the overall performance.
> 
> To reduce the contention this patch introduces explicit reference counting
> for the eventpoll struct. Each registered event acquires a reference,
> and references are released at ep_remove() time.
> 
> Additionally, this introduces a new 'dying' flag to prevent races between
> the EP file close() and the monitored file close().
> ep_eventpoll_release() marks, under f_lock spinlock, each epitem as before
> removing it, while EP file close() does not touch dying epitems.
> 
> The eventpoll struct is released by whoever - among EP file close() and
> and the monitored file close() drops its last reference.
> 
> With all the above in place, we can drop the epmutex usage at disposal time.
> 
> Overall this produces a significant performance improvement in the
> mentioned connection/rate scenario: the mutex operations disappear from
> the topmost offenders in the perf report, and the measured connections/rate
> grows by ~60%.
> 
> To make the change more readable this additionally renames ep_free() to
> ep_clear_and_put(), and moves the actual memory cleanup in a separate
> ep_free() helper.
> 

Thanks for switching to refcount! This version looks good to me.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

> Tested-by: Xiumei Mu <xmu@redhat.com>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> v4: (addresses comments from Eric Biggers and Jacob)
>  - replace a refcount_t
>  - rename the ep book-keeping helpers
> 
> v3 at:
> https://lore.kernel.org/linux-
> fsdevel/1aedd7e87097bc4352ba658ac948c585a655785a.1669657846.git.pabeni
> @redhat.com/
> 
> v2 at:
> https://lore.kernel.org/linux-
> fsdevel/f35e58ed5af8131f0f402c3dc6c3033fa96d1843.1669312208.git.pabeni@re
> dhat.com/
> 
> v1 at:
> https://lore.kernel.org/linux-
> fsdevel/f35e58ed5af8131f0f402c3dc6c3033fa96d1843.1669312208.git.pabeni@re
> dhat.com/
> 
> Previous related effort at:
> https://lore.kernel.org/linux-fsdevel/20190727113542.162213-1-
> cj.chengjian@huawei.com/
> https://lkml.org/lkml/2017/10/28/81
> ---
>  fs/eventpoll.c | 185 +++++++++++++++++++++++++++++++------------------
>  1 file changed, 116 insertions(+), 69 deletions(-)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index 64659b110973..a43ccb02133c 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -57,13 +57,7 @@
>   * we need a lock that will allow us to sleep. This lock is a
>   * mutex (ep->mtx). It is acquired during the event transfer loop,
>   * during epoll_ctl(EPOLL_CTL_DEL) and during eventpoll_release_file().
> - * Then we also need a global mutex to serialize eventpoll_release_file()
> - * and ep_free().
> - * This mutex is acquired by ep_free() during the epoll file
> - * cleanup path and it is also acquired by eventpoll_release_file()
> - * if a file has been pushed inside an epoll set and it is then
> - * close()d without a previous call to epoll_ctl(EPOLL_CTL_DEL).
> - * It is also acquired when inserting an epoll fd onto another epoll
> + * The epmutex is acquired when inserting an epoll fd onto another epoll
>   * fd. We do this so that we walk the epoll tree and ensure that this
>   * insertion does not create a cycle of epoll file descriptors, which
>   * could lead to deadlock. We need a global mutex to prevent two
> @@ -153,6 +147,13 @@ struct epitem {
>  	/* The file descriptor information this item refers to */
>  	struct epoll_filefd ffd;
> 
> +	/*
> +	 * Protected by file->f_lock, true for to-be-released epitem already
> +	 * removed from the "struct file" items list; together with
> +	 * eventpoll->refcount orchestrates "struct eventpoll" disposal
> +	 */
> +	bool dying;
> +
>  	/* List containing poll wait queues */
>  	struct eppoll_entry *pwqlist;
> 
> @@ -217,6 +218,12 @@ struct eventpoll {
>  	u64 gen;
>  	struct hlist_head refs;
> 
> +	/*
> +	 * usage count, used together with epitem->dying to
> +	 * orchestrate the disposal of this struct
> +	 */
> +	refcount_t refcount;
> +
>  #ifdef CONFIG_NET_RX_BUSY_POLL
>  	/* used to track busy poll napi_id */
>  	unsigned int napi_id;
> @@ -240,9 +247,7 @@ struct ep_pqueue {
>  /* Maximum number of epoll watched descriptors, per user */
>  static long max_user_watches __read_mostly;
> 
> -/*
> - * This mutex is used to serialize ep_free() and eventpoll_release_file().
> - */
> +/* Used for cycles detection */
>  static DEFINE_MUTEX(epmutex);
> 
>  static u64 loop_check_gen = 0;
> @@ -557,8 +562,7 @@ static void ep_remove_wait_queue(struct eppoll_entry
> *pwq)
> 
>  /*
>   * This function unregisters poll callbacks from the associated file
> - * descriptor.  Must be called with "mtx" held (or "epmutex" if called from
> - * ep_free).
> + * descriptor.  Must be called with "mtx" held.
>   */
>  static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi)
>  {
> @@ -681,11 +685,38 @@ static void epi_rcu_free(struct rcu_head *head)
>  	kmem_cache_free(epi_cache, epi);
>  }
> 
> +static void ep_get(struct eventpoll *ep)
> +{
> +	refcount_inc(&ep->refcount);
> +}
> +
> +/*
> + * Returns true if the event poll can be disposed
> + */
> +static bool ep_refcount_dec_and_test(struct eventpoll *ep)
> +{
> +	if (!refcount_dec_and_test(&ep->refcount))
> +		return false;
> +
> +	WARN_ON_ONCE(!RB_EMPTY_ROOT(&ep->rbr.rb_root));
> +	return true;
> +}
> +
> +static void ep_free(struct eventpoll *ep)
> +{
> +	mutex_destroy(&ep->mtx);
> +	free_uid(ep->user);
> +	wakeup_source_unregister(ep->ws);
> +	kfree(ep);
> +}
> +
>  /*
>   * Removes a "struct epitem" from the eventpoll RB tree and deallocates
>   * all the associated resources. Must be called with "mtx" held.
> + * If the dying flag is set, do the removal only if force is true.
> + * Returns true if the eventpoll can be disposed.
>   */
> -static int ep_remove(struct eventpoll *ep, struct epitem *epi)
> +static bool __ep_remove(struct eventpoll *ep, struct epitem *epi, bool force)
>  {
>  	struct file *file = epi->ffd.file;
>  	struct epitems_head *to_free;
> @@ -700,6 +731,11 @@ static int ep_remove(struct eventpoll *ep, struct epitem
> *epi)
> 
>  	/* Remove the current item from the list of epoll hooks */
>  	spin_lock(&file->f_lock);
> +	if (epi->dying && !force) {
> +		spin_unlock(&file->f_lock);
> +		return false;
> +	}
> +
>  	to_free = NULL;
>  	head = file->f_ep;
>  	if (head->first == &epi->fllink && !epi->fllink.next) {
> @@ -733,28 +769,28 @@ static int ep_remove(struct eventpoll *ep, struct
> epitem *epi)
>  	call_rcu(&epi->rcu, epi_rcu_free);
> 
>  	percpu_counter_dec(&ep->user->epoll_watches);
> +	return ep_refcount_dec_and_test(ep);
> +}
> 
> -	return 0;
> +/*
> + * ep_remove variant for callers owing an additional reference to the ep
> + */
> +static void ep_remove_safe(struct eventpoll *ep, struct epitem *epi)
> +{
> +	WARN_ON_ONCE(__ep_remove(ep, epi, false));
>  }
> 
> -static void ep_free(struct eventpoll *ep)
> +static void ep_clear_and_put(struct eventpoll *ep)
>  {
>  	struct rb_node *rbp;
>  	struct epitem *epi;
> +	bool dispose;
> 
>  	/* We need to release all tasks waiting for these file */
>  	if (waitqueue_active(&ep->poll_wait))
>  		ep_poll_safewake(ep, NULL, 0);
> 
> -	/*
> -	 * We need to lock this because we could be hit by
> -	 * eventpoll_release_file() while we're freeing the "struct eventpoll".
> -	 * We do not need to hold "ep->mtx" here because the epoll file
> -	 * is on the way to be removed and no one has references to it
> -	 * anymore. The only hit might come from eventpoll_release_file() but
> -	 * holding "epmutex" is sufficient here.
> -	 */
> -	mutex_lock(&epmutex);
> +	mutex_lock(&ep->mtx);
> 
>  	/*
>  	 * Walks through the whole tree by unregistering poll callbacks.
> @@ -768,25 +804,21 @@ static void ep_free(struct eventpoll *ep)
> 
>  	/*
>  	 * Walks through the whole tree by freeing each "struct epitem". At this
> -	 * point we are sure no poll callbacks will be lingering around, and also by
> -	 * holding "epmutex" we can be sure that no file cleanup code will hit
> -	 * us during this operation. So we can avoid the lock on "ep->lock".
> -	 * We do not need to lock ep->mtx, either, we only do it to prevent
> -	 * a lockdep warning.
> +	 * point we are sure no poll callbacks will be lingering around.
> +	 * Since we still own a reference to the eventpoll struct, the loop can't
> +	 * dispose it.
>  	 */
> -	mutex_lock(&ep->mtx);
>  	while ((rbp = rb_first_cached(&ep->rbr)) != NULL) {
>  		epi = rb_entry(rbp, struct epitem, rbn);
> -		ep_remove(ep, epi);
> +		ep_remove_safe(ep, epi);
>  		cond_resched();
>  	}
> +
> +	dispose = ep_refcount_dec_and_test(ep);
>  	mutex_unlock(&ep->mtx);
> 
> -	mutex_unlock(&epmutex);
> -	mutex_destroy(&ep->mtx);
> -	free_uid(ep->user);
> -	wakeup_source_unregister(ep->ws);
> -	kfree(ep);
> +	if (dispose)
> +		ep_free(ep);
>  }
> 
>  static int ep_eventpoll_release(struct inode *inode, struct file *file)
> @@ -794,7 +826,7 @@ static int ep_eventpoll_release(struct inode *inode, struct
> file *file)
>  	struct eventpoll *ep = file->private_data;
> 
>  	if (ep)
> -		ep_free(ep);
> +		ep_clear_and_put(ep);
> 
>  	return 0;
>  }
> @@ -906,33 +938,35 @@ void eventpoll_release_file(struct file *file)
>  {
>  	struct eventpoll *ep;
>  	struct epitem *epi;
> -	struct hlist_node *next;
> +	bool dispose;
> 
>  	/*
> -	 * We don't want to get "file->f_lock" because it is not
> -	 * necessary. It is not necessary because we're in the "struct file"
> -	 * cleanup path, and this means that no one is using this file anymore.
> -	 * So, for example, epoll_ctl() cannot hit here since if we reach this
> -	 * point, the file counter already went to zero and fget() would fail.
> -	 * The only hit might come from ep_free() but by holding the mutex
> -	 * will correctly serialize the operation. We do need to acquire
> -	 * "ep->mtx" after "epmutex" because ep_remove() requires it when
> called
> -	 * from anywhere but ep_free().
> -	 *
> -	 * Besides, ep_remove() acquires the lock, so we can't hold it here.
> +	 * Use the 'dying' flag to prevent a concurrent ep_cleat_and_put() from
> +	 * touching the epitems list before eventpoll_release_file() can access
> +	 * the ep->mtx.
>  	 */
> -	mutex_lock(&epmutex);
> -	if (unlikely(!file->f_ep)) {
> -		mutex_unlock(&epmutex);
> -		return;
> -	}
> -	hlist_for_each_entry_safe(epi, next, file->f_ep, fllink) {
> +again:
> +	spin_lock(&file->f_lock);
> +	if (file->f_ep && file->f_ep->first) {
> +		/* detach from ep tree */
> +		epi = hlist_entry(file->f_ep->first, struct epitem, fllink);
> +		epi->dying = true;
> +		spin_unlock(&file->f_lock);
> +
> +		/*
> +		 * ep access is safe as we still own a reference to the ep
> +		 * struct
> +		 */
>  		ep = epi->ep;
> -		mutex_lock_nested(&ep->mtx, 0);
> -		ep_remove(ep, epi);
> +		mutex_lock(&ep->mtx);
> +		dispose = __ep_remove(ep, epi, true);
>  		mutex_unlock(&ep->mtx);
> +
> +		if (dispose)
> +			ep_free(ep);
> +		goto again;
>  	}
> -	mutex_unlock(&epmutex);
> +	spin_unlock(&file->f_lock);
>  }
> 
>  static int ep_alloc(struct eventpoll **pep)
> @@ -955,6 +989,7 @@ static int ep_alloc(struct eventpoll **pep)
>  	ep->rbr = RB_ROOT_CACHED;
>  	ep->ovflist = EP_UNACTIVE_PTR;
>  	ep->user = user;
> +	refcount_set(&ep->refcount, 1);
> 
>  	*pep = ep;
> 
> @@ -1223,10 +1258,10 @@ static int ep_poll_callback(wait_queue_entry_t
> *wait, unsigned mode, int sync, v
>  		 */
>  		list_del_init(&wait->entry);
>  		/*
> -		 * ->whead != NULL protects us from the race with ep_free()
> -		 * or ep_remove(), ep_remove_wait_queue() takes whead->lock
> -		 * held by the caller. Once we nullify it, nothing protects
> -		 * ep/epi or even wait.
> +		 * ->whead != NULL protects us from the race with
> +		 * ep_clear_and_put() or ep_remove(), ep_remove_wait_queue()
> +		 * takes whead->lock held by the caller. Once we nullify it,
> +		 * nothing protects ep/epi or even wait.
>  		 */
>  		smp_store_release(&ep_pwq_from_wait(wait)->whead, NULL);
>  	}
> @@ -1496,16 +1531,22 @@ static int ep_insert(struct eventpoll *ep, const struct
> epoll_event *event,
>  	if (tep)
>  		mutex_unlock(&tep->mtx);
> 
> +	/*
> +	 * ep_remove_safe() calls in the later error paths can't lead to
> +	 * ep_free() as the ep file itself still holds an ep reference.
> +	 */
> +	ep_get(ep);
> +
>  	/* now check if we've created too many backpaths */
>  	if (unlikely(full_check && reverse_path_check())) {
> -		ep_remove(ep, epi);
> +		ep_remove_safe(ep, epi);
>  		return -EINVAL;
>  	}
> 
>  	if (epi->event.events & EPOLLWAKEUP) {
>  		error = ep_create_wakeup_source(epi);
>  		if (error) {
> -			ep_remove(ep, epi);
> +			ep_remove_safe(ep, epi);
>  			return error;
>  		}
>  	}
> @@ -1529,7 +1570,7 @@ static int ep_insert(struct eventpoll *ep, const struct
> epoll_event *event,
>  	 * high memory pressure.
>  	 */
>  	if (unlikely(!epq.epi)) {
> -		ep_remove(ep, epi);
> +		ep_remove_safe(ep, epi);
>  		return -ENOMEM;
>  	}
> 
> @@ -2025,7 +2066,7 @@ static int do_epoll_create(int flags)
>  out_free_fd:
>  	put_unused_fd(fd);
>  out_free_ep:
> -	ep_free(ep);
> +	ep_clear_and_put(ep);
>  	return error;
>  }
> 
> @@ -2167,10 +2208,16 @@ int do_epoll_ctl(int epfd, int op, int fd, struct
> epoll_event *epds,
>  			error = -EEXIST;
>  		break;
>  	case EPOLL_CTL_DEL:
> -		if (epi)
> -			error = ep_remove(ep, epi);
> -		else
> +		if (epi) {
> +			/*
> +			 * The eventpoll itself is still alive: the refcount
> +			 * can't go to zero here.
> +			 */
> +			ep_remove_safe(ep, epi);
> +			error = 0;
> +		} else {
>  			error = -ENOENT;
> +		}
>  		break;
>  	case EPOLL_CTL_MOD:
>  		if (epi) {
> --
> 2.38.1


^ permalink raw reply

* Re: [PATCH net-next] net: wangxun: Adjust code structure
From: Andrew Lunn @ 2022-12-20 22:56 UTC (permalink / raw)
  To: Jiawen Wu; +Cc: netdev, mengyuanlou
In-Reply-To: <20221213063543.2408987-1-jiawenwu@trustnetic.com>

On Tue, Dec 13, 2022 at 02:35:43PM +0800, Jiawen Wu wrote:
> From: Mengyuan Lou <mengyuanlou@net-swift.com>
> 
> Remove useless structs 'txgbe_hw' and 'ngbe_hw' make the codes clear.
> And move the same codes which sets MAC address between txgbe and ngbe to
> libwx.

So this patch appears to do three things. So ideally it should be
three patches. I will then but much easier to review.

      Andrew

^ permalink raw reply

* Re: [RFC bpf-next 2/8] net: introduce XDP features flag
From: Marek Majtyka @ 2022-12-20 22:51 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: Jakub Kicinski, Lorenzo Bianconi, bpf, netdev, ast, daniel,
	andrii, davem, hawk, pabeni, edumazet, toke, memxor, saeedm,
	anthony.l.nguyen, gospo, vladimir.oltean, nbd, john, leon,
	simon.horman, aelior, christophe.jaillet, ecree.xilinx,
	grygorii.strashko, mst, bjorn, magnus.karlsson,
	maciej.fijalkowski, intel-wired-lan
In-Reply-To: <Y6F+YJSkI19m/kMv@lore-desk>

On Tue, Dec 20, 2022 at 10:20 AM Lorenzo Bianconi
<lorenzo.bianconi@redhat.com> wrote:
>
> > On Mon, 19 Dec 2022 16:41:31 +0100 Lorenzo Bianconi wrote:
> > > +=====================
> > > +Netdev XDP features
> > > +=====================
> > > +
> > > + * XDP FEATURES FLAGS
> > > +
> > > +Following netdev xdp features flags can be retrieved over route netlink
> > > +interface (compact form) - the same way as netdev feature flags.
> >
> > How likely is it that I'll be able to convince you that cramming more
> > stuff in rtnl is a bad idea? I can convert this for you to a YAML-
> > -compatible genetlink family for you in a jiffy, just say yes :S
> >
> > rtnl is hard to parse, and already overloaded with random stuff.
> > And the messages are enormous.
>
> Hi Jakub,
>
> I am fine to use YAML for this, but I will let Marek comment since he is the
> original author of this patch.
>
> >
> > > +These features flags are read only and cannot be change at runtime.
> > > +
> > > +*  XDP_ABORTED
> > > +
> > > +This feature informs if netdev supports xdp aborted action.
> > > +
> > > +*  XDP_DROP
> > > +
> > > +This feature informs if netdev supports xdp drop action.
> > > +
> > > +*  XDP_PASS
> > > +
> > > +This feature informs if netdev supports xdp pass action.
> > > +
> > > +*  XDP_TX
> > > +
> > > +This feature informs if netdev supports xdp tx action.
> > > +
> > > +*  XDP_REDIRECT
> > > +
> > > +This feature informs if netdev supports xdp redirect action.
> > > +It assumes the all beforehand mentioned flags are enabled.
> > > +
> > > +*  XDP_SOCK_ZEROCOPY
> > > +
> > > +This feature informs if netdev driver supports xdp zero copy.
> > > +It assumes the all beforehand mentioned flags are enabled.
> >
> > Why is this "assumption" worth documenting?
>
> I guess we can remove it.
> @Marek: any comment?
>
> >
> > > +*  XDP_HW_OFFLOAD
> > > +
> > > +This feature informs if netdev driver supports xdp hw oflloading.
> > > +
> > > +*  XDP_TX_LOCK
> > > +
> > > +This feature informs if netdev ndo_xdp_xmit function requires locking.
> >
> > Why is it relevant to the user?
>
> Probably not, I kept it since it was in Marek's original patch.
> @Marek: any comment?
>
> >
> > > +*  XDP_REDIRECT_TARGET
> > > +
> > > +This feature informs if netdev implements ndo_xdp_xmit callback.
> >
> > Does it make sense to rename XDP_REDIRECT -> XDP_REDIRECT_SOURCE then?
>
> yes, naming is always hard :)
>
> >
> > > +*  XDP_FRAG_RX
> > > +
> > > +This feature informs if netdev implements non-linear xdp buff support in
> > > +the driver napi callback.
> >
> > Who's the target audience? Maybe FRAG is not the best name?
> > Scatter-gather or multi-buf may be more widely understood.
>
> ack, fine. I will rename it in the formal series.
>
> Regards,
> Lorenzo
>
> >
> > > +*  XDP_FRAG_TARGET
> > > +
> > > +This feature informs if netdev implements non-linear xdp buff support in
> > > +ndo_xdp_xmit callback. XDP_FRAG_TARGET requires XDP_REDIRECT_TARGET is properly
> > > +supported.
> >
Everybody is allowed to make a good use of it. Every improvement is
highly appreciated. Thanks Lorenzo for taking this over.
Regards
Marek

^ permalink raw reply

* Re: [bpf-next v2 0/5] samples/bpf: enhance syscall tracing program
From: Andrii Nakryiko @ 2022-12-20 22:40 UTC (permalink / raw)
  To: Daniel T. Lee
  Cc: Daniel Borkmann, Alexei Starovoitov, Yonghong Song, bpf, netdev
In-Reply-To: <20221220115928.11979-1-danieltimlee@gmail.com>

On Tue, Dec 20, 2022 at 3:59 AM Daniel T. Lee <danieltimlee@gmail.com> wrote:
>
> Syscall tracing using kprobe is quite unstable. Since it uses the exact
> name of the kernel function, the program might broke due to the rename
> of a function. The problem can also be caused by a changes in the
> arguments of the function to which the kprobe connects. This commit
> enhances syscall tracing program with the following instruments.
>
> In this patchset, ksyscall is used instead of kprobe. By using
> ksyscall, libbpf will detect the appropriate kernel function name.
> (e.g. sys_write -> __s390_sys_write). This eliminates the need to worry
> about which wrapper function to attach in order to parse arguments.
> Also ksyscall provides more fine method with attaching system call, the
> coarse SYSCALL helper at trace_common.h can be removed.
>
> Next, BPF_SYSCALL is used to reduce the inconvenience of parsing
> arguments. Since the nature of SYSCALL_WRAPPER function wraps the
> argument once, additional process of argument extraction is required
> to properly parse the argument. The BPF_SYSCALL macro will reduces the
> hassle of parsing arguments from pt_regs.
>
> Lastly, vmlinux.h is applied to syscall tracing program. This change
> allows the bpf program to refer to the internal structure as a single
> "vmlinux.h" instead of including each header referenced by the bpf
> program.
>
> Additionally, this patchset changes the suffix of _kern to .bpf to make
> use of the new compile rule (CLANG-BPF) which is more simple and neat.
> By just changing the _kern suffix to .bpf will inherit the benefit of
> the new CLANG-BPF compile target.
>
> Also, this commit adds dummy gnu/stub.h to the samples/bpf directory.
> This will fix the compiling problem with 'clang -target bpf'.
>
> ---
> Changes in V2:
> - add gnu/stub.h hack to fix compile error with 'clang -target bpf'
>
> Daniel T. Lee (5):
>   samples/bpf: use kyscall instead of kprobe in syscall tracing program
>   samples/bpf: use vmlinux.h instead of implicit headers in syscall
>     tracing program
>   samples/bpf: change _kern suffix to .bpf with syscall tracing program
>   samples/bpf: fix tracex2 by using BPF_KSYSCALL macro
>   samples/bpf: use BPF_KSYSCALL macro in syscall tracing programs
>

Nice set of changes, thanks for cleaning these up! I don't see
anything obviously wrong, but these changes seem to break s390x build
(see [0]), please check what's going on.

  [0] https://github.com/kernel-patches/bpf/actions/runs/3740339876/jobs/6348606866


>  samples/bpf/Makefile                          | 10 ++--
>  samples/bpf/gnu/stubs.h                       |  1 +
>  ...p_perf_test_kern.c => map_perf_test.bpf.c} | 48 ++++++++-----------
>  samples/bpf/map_perf_test_user.c              |  2 +-
>  ...c => test_current_task_under_cgroup.bpf.c} | 11 ++---
>  .../bpf/test_current_task_under_cgroup_user.c |  2 +-
>  samples/bpf/test_map_in_map_kern.c            |  1 -
>  ...ser_kern.c => test_probe_write_user.bpf.c} | 20 ++++----
>  samples/bpf/test_probe_write_user_user.c      |  2 +-
>  samples/bpf/trace_common.h                    | 13 -----
>  ...trace_output_kern.c => trace_output.bpf.c} |  6 +--
>  samples/bpf/trace_output_user.c               |  2 +-
>  samples/bpf/{tracex2_kern.c => tracex2.bpf.c} | 13 ++---
>  samples/bpf/tracex2_user.c                    |  2 +-
>  14 files changed, 52 insertions(+), 81 deletions(-)
>  create mode 100644 samples/bpf/gnu/stubs.h
>  rename samples/bpf/{map_perf_test_kern.c => map_perf_test.bpf.c} (85%)
>  rename samples/bpf/{test_current_task_under_cgroup_kern.c => test_current_task_under_cgroup.bpf.c} (84%)
>  rename samples/bpf/{test_probe_write_user_kern.c => test_probe_write_user.bpf.c} (71%)
>  delete mode 100644 samples/bpf/trace_common.h
>  rename samples/bpf/{trace_output_kern.c => trace_output.bpf.c} (82%)
>  rename samples/bpf/{tracex2_kern.c => tracex2.bpf.c} (89%)
>
> --
> 2.34.1
>

^ permalink raw reply

* Re: [RFC bpf-next 2/8] net: introduce XDP features flag
From: Lorenzo Bianconi @ 2022-12-20 22:25 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Lorenzo Bianconi, Marek Majtyka, bpf, netdev, ast, daniel, andrii,
	davem, kuba, hawk, pabeni, edumazet, toke, memxor, saeedm,
	anthony.l.nguyen, gospo, vladimir.oltean, nbd, john, leon,
	simon.horman, aelior, christophe.jaillet, ecree.xilinx,
	grygorii.strashko, mst, bjorn, magnus.karlsson,
	maciej.fijalkowski, intel-wired-lan
In-Reply-To: <CAKH8qBts19wxSDAKk0SBk76ftvdK+sW6d3ufcBWoV5cMa2ENpA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 22364 bytes --]

> On Tue, Dec 20, 2022 at 2:11 AM Lorenzo Bianconi
> <lorenzo.bianconi@redhat.com> wrote:
> >
> > On Dec 19, Stanislav Fomichev wrote:
> > > On Mon, Dec 19, 2022 at 3:51 PM Marek Majtyka <alardam@gmail.com> wrote:
> > > >
> > > > At the time of writing, I wanted to be able to read additional information about the XDP capabilities of each network interface using ethtool. This change was intended for Linux users/admins, and not for XDP experts who mostly don't need it and prefer tasting XDP with netlink and bpf rather than reading network interface features with ethtool.
> > >
> > > Anything preventing ethtool from doing probing similar to 'bpftool
> > > feature probe'?
> > > The problem with these feature bits is that they might diverge and/or
> > > not work at all for the backported patches (where the fix/feature has
> > > been backported, but the part that exports the bit hasn't) :-(
> > > OTOH, I'm not sure we can probe everything from your list, but we
> > > might try and see what's missing..
> >
> > Hi Stanislav,
> >
> > I have not added the ethtool support to this series yet since userspace part is
> > still missing but I think we can consider XDP as a sort of sw offload so it
> > would be nice for the user/sysadmin (not xdp or bpf developer) to check the NIC
> > XDP capabilities similar to what we can already do for other hw offload
> > features.
> 
> [..]
> 
> > Moreover let's consider XDP_REDIRECT of a scatter-gather XDP frame into a
> > devmap. I do not think there is a way to test if the 'target' device supports
> > SG and so we are forced to disable this feature until all drivers support it.
> 
> See below for more questions, but why "target device has prog
> installed and the aux->xdp_has_frags == true" won't work for the
> internal kernel consumers?

There are some drivers (e.g. all intel ones) that currently do not support
non-linear xdp buff in the driver napi callback (XDP_FRAG_RX) but implement
non-linear xdp buff support in ndo_xdp_xmit callback (XDP_FRAG_TARGET).
Moreover, I guess for a sysadmin it would be better to check NIC capabilities in
the same way he/she is used to with other features (e.g. ethool -k ...).

> 
> > Introducing XDP features we can enable it on per-driver basis.
> > I think the same apply for other capabilities as well and just assuming a given
> > feature is not supported if an e2e test is not working seems a bit inaccurate.
> 
> Ok, I see that these bits are used in the later patches in xsk and
> devmap. But I guess I'm still confused about why we add all these
> flags, but only use mostly XDP_F_REDIRECT_TARGET; maybe start with
> that one? And why does it have to be exposed to the userspace?
> (userspace can still probe per-device features by trying to load
> different progs?)

There are some drivers (e.g. ixgbevf or cavium thunder) that do not support
XDP_REDIRECT but just XDP_PASS, XDP_DROP and XDP_TX, so I think we should
differentiate between XDP_BASIC (XDP_PASS | XDP_DROP | XDP_TX) and XDP_FULL
(XDP_BASIC | XDP_REDIRECT).

> 
> Also, it seems like XDP_F_REDIRECT_TARGET really means "the bpf
> program has been installed on this device". Instead of a flag, why not
> explicitly check whether the target device has a prog installed (and,
> if needed, whether the installed program has frags support)?

XDP_F_REDIRECT_TARGET is used to inform if netdev implements ndo_xdp_xmit
callback (most of the XDP drivers do not require to load a bpf program to
XDP_REDIRECT into them).

Regards,
Lorenzo

> 
> > Regards,
> > Lorenzo
> >
> > >
> > > > On Mon, Dec 19, 2022 at 9:03 PM <sdf@google.com> wrote:
> > > >>
> > > >> On 12/19, Lorenzo Bianconi wrote:
> > > >> > From: Marek Majtyka <alardam@gmail.com>
> > > >>
> > > >> > Implement support for checking what kind of XDP features a netdev
> > > >> > supports. Previously, there was no way to do this other than to try to
> > > >> > create an AF_XDP socket on the interface or load an XDP program and see
> > > >> > if it worked. This commit changes this by adding a new variable which
> > > >> > describes all xdp supported functions on pretty detailed level:
> > > >>
> > > >> >   - aborted
> > > >> >   - drop
> > > >> >   - pass
> > > >> >   - tx
> > > >> >   - redirect
> > > >> >   - sock_zerocopy
> > > >> >   - hw_offload
> > > >> >   - redirect_target
> > > >> >   - tx_lock
> > > >> >   - frag_rx
> > > >> >   - frag_target
> > > >>
> > > >> > Zerocopy mode requires that redirect XDP operation is implemented in a
> > > >> > driver and the driver supports also zero copy mode. Full mode requires
> > > >> > that all XDP operation are implemented in the driver. Basic mode is just
> > > >> > full mode without redirect operation. Frag target requires
> > > >> > redirect_target one is supported by the driver.
> > > >>
> > > >> Can you share more about _why_ is it needed? If we can already obtain
> > > >> most of these signals via probing, why export the flags?
> > > >>
> > > >> > Initially, these new flags are disabled for all drivers by default.
> > > >>
> > > >> > Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > >> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > >> > Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org>
> > > >> > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> > > >> > Signed-off-by: Marek Majtyka <alardam@gmail.com>
> > > >> > ---
> > > >> >   .../networking/netdev-xdp-features.rst        | 60 +++++++++++++++++
> > > >> >   include/linux/netdevice.h                     |  2 +
> > > >> >   include/linux/xdp_features.h                  | 64 +++++++++++++++++++
> > > >> >   include/uapi/linux/if_link.h                  |  7 ++
> > > >> >   include/uapi/linux/xdp_features.h             | 34 ++++++++++
> > > >> >   net/core/rtnetlink.c                          | 34 ++++++++++
> > > >> >   tools/include/uapi/linux/if_link.h            |  7 ++
> > > >> >   tools/include/uapi/linux/xdp_features.h       | 34 ++++++++++
> > > >> >   8 files changed, 242 insertions(+)
> > > >> >   create mode 100644 Documentation/networking/netdev-xdp-features.rst
> > > >> >   create mode 100644 include/linux/xdp_features.h
> > > >> >   create mode 100644 include/uapi/linux/xdp_features.h
> > > >> >   create mode 100644 tools/include/uapi/linux/xdp_features.h
> > > >>
> > > >> > diff --git a/Documentation/networking/netdev-xdp-features.rst
> > > >> > b/Documentation/networking/netdev-xdp-features.rst
> > > >> > new file mode 100644
> > > >> > index 000000000000..1dc803fe72dd
> > > >> > --- /dev/null
> > > >> > +++ b/Documentation/networking/netdev-xdp-features.rst
> > > >> > @@ -0,0 +1,60 @@
> > > >> > +.. SPDX-License-Identifier: GPL-2.0
> > > >> > +
> > > >> > +=====================
> > > >> > +Netdev XDP features
> > > >> > +=====================
> > > >> > +
> > > >> > + * XDP FEATURES FLAGS
> > > >> > +
> > > >> > +Following netdev xdp features flags can be retrieved over route netlink
> > > >> > +interface (compact form) - the same way as netdev feature flags.
> > > >> > +These features flags are read only and cannot be change at runtime.
> > > >> > +
> > > >> > +*  XDP_ABORTED
> > > >> > +
> > > >> > +This feature informs if netdev supports xdp aborted action.
> > > >> > +
> > > >> > +*  XDP_DROP
> > > >> > +
> > > >> > +This feature informs if netdev supports xdp drop action.
> > > >> > +
> > > >> > +*  XDP_PASS
> > > >> > +
> > > >> > +This feature informs if netdev supports xdp pass action.
> > > >> > +
> > > >> > +*  XDP_TX
> > > >> > +
> > > >> > +This feature informs if netdev supports xdp tx action.
> > > >> > +
> > > >> > +*  XDP_REDIRECT
> > > >> > +
> > > >> > +This feature informs if netdev supports xdp redirect action.
> > > >> > +It assumes the all beforehand mentioned flags are enabled.
> > > >> > +
> > > >> > +*  XDP_SOCK_ZEROCOPY
> > > >> > +
> > > >> > +This feature informs if netdev driver supports xdp zero copy.
> > > >> > +It assumes the all beforehand mentioned flags are enabled.
> > > >> > +
> > > >> > +*  XDP_HW_OFFLOAD
> > > >> > +
> > > >> > +This feature informs if netdev driver supports xdp hw oflloading.
> > > >> > +
> > > >> > +*  XDP_TX_LOCK
> > > >> > +
> > > >> > +This feature informs if netdev ndo_xdp_xmit function requires locking.
> > > >> > +
> > > >> > +*  XDP_REDIRECT_TARGET
> > > >> > +
> > > >> > +This feature informs if netdev implements ndo_xdp_xmit callback.
> > > >> > +
> > > >> > +*  XDP_FRAG_RX
> > > >> > +
> > > >> > +This feature informs if netdev implements non-linear xdp buff support in
> > > >> > +the driver napi callback.
> > > >> > +
> > > >> > +*  XDP_FRAG_TARGET
> > > >> > +
> > > >> > +This feature informs if netdev implements non-linear xdp buff support in
> > > >> > +ndo_xdp_xmit callback. XDP_FRAG_TARGET requires XDP_REDIRECT_TARGET is
> > > >> > properly
> > > >> > +supported.
> > > >> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > > >> > index aad12a179e54..ae5a8564383b 100644
> > > >> > --- a/include/linux/netdevice.h
> > > >> > +++ b/include/linux/netdevice.h
> > > >> > @@ -43,6 +43,7 @@
> > > >> >   #include <net/xdp.h>
> > > >>
> > > >> >   #include <linux/netdev_features.h>
> > > >> > +#include <linux/xdp_features.h>
> > > >> >   #include <linux/neighbour.h>
> > > >> >   #include <uapi/linux/netdevice.h>
> > > >> >   #include <uapi/linux/if_bonding.h>
> > > >> > @@ -2362,6 +2363,7 @@ struct net_device {
> > > >> >       struct rtnl_hw_stats64  *offload_xstats_l3;
> > > >>
> > > >> >       struct devlink_port     *devlink_port;
> > > >> > +     xdp_features_t          xdp_features;
> > > >> >   };
> > > >> >   #define to_net_dev(d) container_of(d, struct net_device, dev)
> > > >>
> > > >> > diff --git a/include/linux/xdp_features.h b/include/linux/xdp_features.h
> > > >> > new file mode 100644
> > > >> > index 000000000000..4e72a86ef329
> > > >> > --- /dev/null
> > > >> > +++ b/include/linux/xdp_features.h
> > > >> > @@ -0,0 +1,64 @@
> > > >> > +/* SPDX-License-Identifier: GPL-2.0-or-later */
> > > >> > +/*
> > > >> > + * Network device xdp features.
> > > >> > + */
> > > >> > +#ifndef _LINUX_XDP_FEATURES_H
> > > >> > +#define _LINUX_XDP_FEATURES_H
> > > >> > +
> > > >> > +#include <linux/types.h>
> > > >> > +#include <linux/bitops.h>
> > > >> > +#include <asm/byteorder.h>
> > > >> > +#include <uapi/linux/xdp_features.h>
> > > >> > +
> > > >> > +typedef u32 xdp_features_t;
> > > >> > +
> > > >> > +#define __XDP_F_BIT(bit)     ((xdp_features_t)1 << (bit))
> > > >> > +#define __XDP_F(name)                __XDP_F_BIT(XDP_F_##name##_BIT)
> > > >> > +
> > > >> > +#define XDP_F_ABORTED                __XDP_F(ABORTED)
> > > >> > +#define XDP_F_DROP           __XDP_F(DROP)
> > > >> > +#define XDP_F_PASS           __XDP_F(PASS)
> > > >> > +#define XDP_F_TX             __XDP_F(TX)
> > > >> > +#define XDP_F_REDIRECT               __XDP_F(REDIRECT)
> > > >> > +#define XDP_F_REDIRECT_TARGET        __XDP_F(REDIRECT_TARGET)
> > > >> > +#define XDP_F_SOCK_ZEROCOPY  __XDP_F(SOCK_ZEROCOPY)
> > > >> > +#define XDP_F_HW_OFFLOAD     __XDP_F(HW_OFFLOAD)
> > > >> > +#define XDP_F_TX_LOCK                __XDP_F(TX_LOCK)
> > > >> > +#define XDP_F_FRAG_RX                __XDP_F(FRAG_RX)
> > > >> > +#define XDP_F_FRAG_TARGET    __XDP_F(FRAG_TARGET)
> > > >> > +
> > > >> > +#define XDP_F_BASIC          (XDP_F_ABORTED | XDP_F_DROP |   \
> > > >> > +                              XDP_F_PASS | XDP_F_TX)
> > > >> > +
> > > >> > +#define XDP_F_FULL           (XDP_F_BASIC | XDP_F_REDIRECT)
> > > >> > +
> > > >> > +#define XDP_F_FULL_ZC                (XDP_F_FULL | XDP_F_SOCK_ZEROCOPY)
> > > >> > +
> > > >> > +#define XDP_FEATURES_ABORTED_STR             "xdp-aborted"
> > > >> > +#define XDP_FEATURES_DROP_STR                        "xdp-drop"
> > > >> > +#define XDP_FEATURES_PASS_STR                        "xdp-pass"
> > > >> > +#define XDP_FEATURES_TX_STR                  "xdp-tx"
> > > >> > +#define XDP_FEATURES_REDIRECT_STR            "xdp-redirect"
> > > >> > +#define XDP_FEATURES_REDIRECT_TARGET_STR     "xdp-redirect-target"
> > > >> > +#define XDP_FEATURES_SOCK_ZEROCOPY_STR               "xdp-sock-zerocopy"
> > > >> > +#define XDP_FEATURES_HW_OFFLOAD_STR          "xdp-hw-offload"
> > > >> > +#define XDP_FEATURES_TX_LOCK_STR             "xdp-tx-lock"
> > > >> > +#define XDP_FEATURES_FRAG_RX_STR             "xdp-frag-rx"
> > > >> > +#define XDP_FEATURES_FRAG_TARGET_STR         "xdp-frag-target"
> > > >> > +
> > > >> > +#define DECLARE_XDP_FEATURES_TABLE(name, length)                             \
> > > >> > +     const char name[][length] = {                                           \
> > > >> > +             [XDP_F_ABORTED_BIT] = XDP_FEATURES_ABORTED_STR,                 \
> > > >> > +             [XDP_F_DROP_BIT] = XDP_FEATURES_DROP_STR,                       \
> > > >> > +             [XDP_F_PASS_BIT] = XDP_FEATURES_PASS_STR,                       \
> > > >> > +             [XDP_F_TX_BIT] = XDP_FEATURES_TX_STR,                           \
> > > >> > +             [XDP_F_REDIRECT_BIT] = XDP_FEATURES_REDIRECT_STR,               \
> > > >> > +             [XDP_F_REDIRECT_TARGET_BIT] = XDP_FEATURES_REDIRECT_TARGET_STR, \
> > > >> > +             [XDP_F_SOCK_ZEROCOPY_BIT] = XDP_FEATURES_SOCK_ZEROCOPY_STR,     \
> > > >> > +             [XDP_F_HW_OFFLOAD_BIT] = XDP_FEATURES_HW_OFFLOAD_STR,           \
> > > >> > +             [XDP_F_TX_LOCK_BIT] = XDP_FEATURES_TX_LOCK_STR,                 \
> > > >> > +             [XDP_F_FRAG_RX_BIT] = XDP_FEATURES_FRAG_RX_STR,                 \
> > > >> > +             [XDP_F_FRAG_TARGET_BIT] = XDP_FEATURES_FRAG_TARGET_STR,         \
> > > >> > +     }
> > > >> > +
> > > >> > +#endif /* _LINUX_XDP_FEATURES_H */
> > > >> > diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
> > > >> > index 1021a7e47a86..971c658ceaea 100644
> > > >> > --- a/include/uapi/linux/if_link.h
> > > >> > +++ b/include/uapi/linux/if_link.h
> > > >> > @@ -374,6 +374,8 @@ enum {
> > > >>
> > > >> >       IFLA_DEVLINK_PORT,
> > > >>
> > > >> > +     IFLA_XDP_FEATURES,
> > > >> > +
> > > >> >       __IFLA_MAX
> > > >> >   };
> > > >>
> > > >> > @@ -1318,6 +1320,11 @@ enum {
> > > >>
> > > >> >   #define IFLA_XDP_MAX (__IFLA_XDP_MAX - 1)
> > > >>
> > > >> > +enum {
> > > >> > +     IFLA_XDP_FEATURES_WORD_UNSPEC = 0,
> > > >> > +     IFLA_XDP_FEATURES_BITS_WORD,
> > > >> > +};
> > > >> > +
> > > >> >   enum {
> > > >> >       IFLA_EVENT_NONE,
> > > >> >       IFLA_EVENT_REBOOT,              /* internal reset / reboot */
> > > >> > diff --git a/include/uapi/linux/xdp_features.h
> > > >> > b/include/uapi/linux/xdp_features.h
> > > >> > new file mode 100644
> > > >> > index 000000000000..48eb42069bcd
> > > >> > --- /dev/null
> > > >> > +++ b/include/uapi/linux/xdp_features.h
> > > >> > @@ -0,0 +1,34 @@
> > > >> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > > >> > +/*
> > > >> > + * Copyright (c) 2020 Intel
> > > >> > + */
> > > >> > +
> > > >> > +#ifndef __UAPI_LINUX_XDP_FEATURES__
> > > >> > +#define __UAPI_LINUX_XDP_FEATURES__
> > > >> > +
> > > >> > +enum {
> > > >> > +     XDP_F_ABORTED_BIT,
> > > >> > +     XDP_F_DROP_BIT,
> > > >> > +     XDP_F_PASS_BIT,
> > > >> > +     XDP_F_TX_BIT,
> > > >> > +     XDP_F_REDIRECT_BIT,
> > > >> > +     XDP_F_REDIRECT_TARGET_BIT,
> > > >> > +     XDP_F_SOCK_ZEROCOPY_BIT,
> > > >> > +     XDP_F_HW_OFFLOAD_BIT,
> > > >> > +     XDP_F_TX_LOCK_BIT,
> > > >> > +     XDP_F_FRAG_RX_BIT,
> > > >> > +     XDP_F_FRAG_TARGET_BIT,
> > > >> > +     /*
> > > >> > +      * Add your fresh new property above and remember to update
> > > >> > +      * documentation.
> > > >> > +      */
> > > >> > +     XDP_FEATURES_COUNT,
> > > >> > +};
> > > >> > +
> > > >> > +#define XDP_FEATURES_WORDS                   ((XDP_FEATURES_COUNT + 32 - 1) / 32)
> > > >> > +#define XDP_FEATURES_WORD(blocks, index)     ((blocks)[(index) / 32U])
> > > >> > +#define XDP_FEATURES_FIELD_FLAG(index)               (1U << (index) % 32U)
> > > >> > +#define XDP_FEATURES_BIT_IS_SET(blocks, index)        \
> > > >> > +     (XDP_FEATURES_WORD(blocks, index) & XDP_FEATURES_FIELD_FLAG(index))
> > > >> > +
> > > >> > +#endif  /* __UAPI_LINUX_XDP_FEATURES__ */
> > > >> > diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> > > >> > index 64289bc98887..1c299746b614 100644
> > > >> > --- a/net/core/rtnetlink.c
> > > >> > +++ b/net/core/rtnetlink.c
> > > >> > @@ -1016,6 +1016,14 @@ static size_t rtnl_xdp_size(void)
> > > >> >       return xdp_size;
> > > >> >   }
> > > >>
> > > >> > +static size_t rtnl_xdp_features_size(void)
> > > >> > +{
> > > >> > +     size_t xdp_size = nla_total_size(0) +   /* nest IFLA_XDP_FEATURES */
> > > >> > +                       XDP_FEATURES_WORDS * nla_total_size(4);
> > > >> > +
> > > >> > +     return xdp_size;
> > > >> > +}
> > > >> > +
> > > >> >   static size_t rtnl_prop_list_size(const struct net_device *dev)
> > > >> >   {
> > > >> >       struct netdev_name_node *name_node;
> > > >> > @@ -1103,6 +1111,7 @@ static noinline size_t if_nlmsg_size(const struct
> > > >> > net_device *dev,
> > > >> >              + rtnl_prop_list_size(dev)
> > > >> >              + nla_total_size(MAX_ADDR_LEN) /* IFLA_PERM_ADDRESS */
> > > >> >              + rtnl_devlink_port_size(dev)
> > > >> > +            + rtnl_xdp_features_size() /* IFLA_XDP_FEATURES */
> > > >> >              + 0;
> > > >> >   }
> > > >>
> > > >> > @@ -1546,6 +1555,27 @@ static int rtnl_xdp_fill(struct sk_buff *skb,
> > > >> > struct net_device *dev)
> > > >> >       return err;
> > > >> >   }
> > > >>
> > > >> > +static int rtnl_xdp_features_fill(struct sk_buff *skb, struct net_device
> > > >> > *dev)
> > > >> > +{
> > > >> > +     struct nlattr *attr;
> > > >> > +
> > > >> > +     attr = nla_nest_start_noflag(skb, IFLA_XDP_FEATURES);
> > > >> > +     if (!attr)
> > > >> > +             return -EMSGSIZE;
> > > >> > +
> > > >> > +     BUILD_BUG_ON(XDP_FEATURES_WORDS != 1);
> > > >> > +     if (nla_put_u32(skb, IFLA_XDP_FEATURES_BITS_WORD, dev->xdp_features))
> > > >> > +             goto err_cancel;
> > > >> > +
> > > >> > +     nla_nest_end(skb, attr);
> > > >> > +
> > > >> > +     return 0;
> > > >> > +
> > > >> > +err_cancel:
> > > >> > +     nla_nest_cancel(skb, attr);
> > > >> > +     return -EMSGSIZE;
> > > >> > +}
> > > >> > +
> > > >> >   static u32 rtnl_get_event(unsigned long event)
> > > >> >   {
> > > >> >       u32 rtnl_event_type = IFLA_EVENT_NONE;
> > > >> > @@ -1904,6 +1934,9 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb,
> > > >> >       if (rtnl_fill_devlink_port(skb, dev))
> > > >> >               goto nla_put_failure;
> > > >>
> > > >> > +     if (rtnl_xdp_features_fill(skb, dev))
> > > >> > +             goto nla_put_failure;
> > > >> > +
> > > >> >       nlmsg_end(skb, nlh);
> > > >> >       return 0;
> > > >>
> > > >> > @@ -1968,6 +2001,7 @@ static const struct nla_policy
> > > >> > ifla_policy[IFLA_MAX+1] = {
> > > >> >       [IFLA_TSO_MAX_SIZE]     = { .type = NLA_REJECT },
> > > >> >       [IFLA_TSO_MAX_SEGS]     = { .type = NLA_REJECT },
> > > >> >       [IFLA_ALLMULTI]         = { .type = NLA_REJECT },
> > > >> > +     [IFLA_XDP_FEATURES]     = { .type = NLA_NESTED },
> > > >> >   };
> > > >>
> > > >> >   static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
> > > >> > diff --git a/tools/include/uapi/linux/if_link.h
> > > >> > b/tools/include/uapi/linux/if_link.h
> > > >> > index 82fe18f26db5..994228e9909a 100644
> > > >> > --- a/tools/include/uapi/linux/if_link.h
> > > >> > +++ b/tools/include/uapi/linux/if_link.h
> > > >> > @@ -354,6 +354,8 @@ enum {
> > > >>
> > > >> >       IFLA_DEVLINK_PORT,
> > > >>
> > > >> > +     IFLA_XDP_FEATURES,
> > > >> > +
> > > >> >       __IFLA_MAX
> > > >> >   };
> > > >>
> > > >> > @@ -1222,6 +1224,11 @@ enum {
> > > >>
> > > >> >   #define IFLA_XDP_MAX (__IFLA_XDP_MAX - 1)
> > > >>
> > > >> > +enum {
> > > >> > +     IFLA_XDP_FEATURES_WORD_UNSPEC = 0,
> > > >> > +     IFLA_XDP_FEATURES_BITS_WORD,
> > > >> > +};
> > > >> > +
> > > >> >   enum {
> > > >> >       IFLA_EVENT_NONE,
> > > >> >       IFLA_EVENT_REBOOT,              /* internal reset / reboot */
> > > >> > diff --git a/tools/include/uapi/linux/xdp_features.h
> > > >> > b/tools/include/uapi/linux/xdp_features.h
> > > >> > new file mode 100644
> > > >> > index 000000000000..48eb42069bcd
> > > >> > --- /dev/null
> > > >> > +++ b/tools/include/uapi/linux/xdp_features.h
> > > >> > @@ -0,0 +1,34 @@
> > > >> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > > >> > +/*
> > > >> > + * Copyright (c) 2020 Intel
> > > >> > + */
> > > >> > +
> > > >> > +#ifndef __UAPI_LINUX_XDP_FEATURES__
> > > >> > +#define __UAPI_LINUX_XDP_FEATURES__
> > > >> > +
> > > >> > +enum {
> > > >> > +     XDP_F_ABORTED_BIT,
> > > >> > +     XDP_F_DROP_BIT,
> > > >> > +     XDP_F_PASS_BIT,
> > > >> > +     XDP_F_TX_BIT,
> > > >> > +     XDP_F_REDIRECT_BIT,
> > > >> > +     XDP_F_REDIRECT_TARGET_BIT,
> > > >> > +     XDP_F_SOCK_ZEROCOPY_BIT,
> > > >> > +     XDP_F_HW_OFFLOAD_BIT,
> > > >> > +     XDP_F_TX_LOCK_BIT,
> > > >> > +     XDP_F_FRAG_RX_BIT,
> > > >> > +     XDP_F_FRAG_TARGET_BIT,
> > > >> > +     /*
> > > >> > +      * Add your fresh new property above and remember to update
> > > >> > +      * documentation.
> > > >> > +      */
> > > >> > +     XDP_FEATURES_COUNT,
> > > >> > +};
> > > >> > +
> > > >> > +#define XDP_FEATURES_WORDS                   ((XDP_FEATURES_COUNT + 32 - 1) / 32)
> > > >> > +#define XDP_FEATURES_WORD(blocks, index)     ((blocks)[(index) / 32U])
> > > >> > +#define XDP_FEATURES_FIELD_FLAG(index)               (1U << (index) % 32U)
> > > >> > +#define XDP_FEATURES_BIT_IS_SET(blocks, index)        \
> > > >> > +     (XDP_FEATURES_WORD(blocks, index) & XDP_FEATURES_FIELD_FLAG(index))
> > > >> > +
> > > >> > +#endif  /* __UAPI_LINUX_XDP_FEATURES__ */
> > > >> > --
> > > >> > 2.38.1
> > > >>
> > >

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* [PATCH bpf-next v5 17/17] selftests/bpf: Simple program to dump XDP RX metadata
From: Stanislav Fomichev @ 2022-12-20 22:20 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev
In-Reply-To: <20221220222043.3348718-1-sdf@google.com>

To be used for verification of driver implementations. Note that
the skb path is gone from the series, but I'm still keeping the
implementation for any possible future work.

$ xdp_hw_metadata <ifname>

On the other machine:

$ echo -n xdp | nc -u -q1 <target> 9091 # for AF_XDP
$ echo -n skb | nc -u -q1 <target> 9092 # for skb

Sample output:

  # xdp
  xsk_ring_cons__peek: 1
  0x19f9090: rx_desc[0]->addr=100000000008000 addr=8100 comp_addr=8000
  rx_timestamp_supported: 1
  rx_timestamp: 1667850075063948829
  0x19f9090: complete idx=8 addr=8000

  # skb
  found skb hwtstamp = 1668314052.854274681

Decoding:
  # xdp
  rx_timestamp=1667850075.063948829

  $ date -d @1667850075
  Mon Nov  7 11:41:15 AM PST 2022
  $ date
  Mon Nov  7 11:42:05 AM PST 2022

  # skb
  $ date -d @1668314052
  Sat Nov 12 08:34:12 PM PST 2022
  $ date
  Sat Nov 12 08:37:06 PM PST 2022

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/testing/selftests/bpf/.gitignore        |   1 +
 tools/testing/selftests/bpf/Makefile          |   6 +-
 .../selftests/bpf/progs/xdp_hw_metadata.c     |  81 ++++
 tools/testing/selftests/bpf/xdp_hw_metadata.c | 405 ++++++++++++++++++
 4 files changed, 492 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
 create mode 100644 tools/testing/selftests/bpf/xdp_hw_metadata.c

diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
index 07d2d0a8c5cb..01e3baeefd4f 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -46,3 +46,4 @@ test_cpp
 xskxceiver
 xdp_redirect_multi
 xdp_synproxy
+xdp_hw_metadata
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index e6cbc04a7920..b7d5d3aa554e 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -83,7 +83,7 @@ TEST_PROGS_EXTENDED := with_addr.sh \
 TEST_GEN_PROGS_EXTENDED = test_sock_addr test_skb_cgroup_id_user \
 	flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \
 	test_lirc_mode2_user xdping test_cpp runqslower bench bpf_testmod.ko \
-	xskxceiver xdp_redirect_multi xdp_synproxy veristat
+	xskxceiver xdp_redirect_multi xdp_synproxy veristat xdp_hw_metadata
 
 TEST_CUSTOM_PROGS = $(OUTPUT)/urandom_read $(OUTPUT)/sign-file
 TEST_GEN_FILES += liburandom_read.so
@@ -241,6 +241,9 @@ $(OUTPUT)/test_maps: $(TESTING_HELPERS)
 $(OUTPUT)/test_verifier: $(TESTING_HELPERS) $(CAP_HELPERS)
 $(OUTPUT)/xsk.o: $(BPFOBJ)
 $(OUTPUT)/xskxceiver: $(OUTPUT)/xsk.o
+$(OUTPUT)/xdp_hw_metadata: $(OUTPUT)/xsk.o $(OUTPUT)/xdp_hw_metadata.skel.h
+$(OUTPUT)/xdp_hw_metadata: $(OUTPUT)/network_helpers.o
+$(OUTPUT)/xdp_hw_metadata: LDFLAGS += -static
 
 BPFTOOL ?= $(DEFAULT_BPFTOOL)
 $(DEFAULT_BPFTOOL): $(wildcard $(BPFTOOLDIR)/*.[ch] $(BPFTOOLDIR)/Makefile)    \
@@ -383,6 +386,7 @@ linked_maps.skel.h-deps := linked_maps1.bpf.o linked_maps2.bpf.o
 test_subskeleton.skel.h-deps := test_subskeleton_lib2.bpf.o test_subskeleton_lib.bpf.o test_subskeleton.bpf.o
 test_subskeleton_lib.skel.h-deps := test_subskeleton_lib2.bpf.o test_subskeleton_lib.bpf.o
 test_usdt.skel.h-deps := test_usdt.bpf.o test_usdt_multispec.bpf.o
+xdp_hw_metadata.skel.h-deps := xdp_hw_metadata.bpf.o
 
 LINKED_BPF_SRCS := $(patsubst %.bpf.o,%.c,$(foreach skel,$(LINKED_SKELS),$($(skel)-deps)))
 
diff --git a/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c b/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
new file mode 100644
index 000000000000..25b8178735ee
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
@@ -0,0 +1,81 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <vmlinux.h>
+#include "xdp_metadata.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+struct {
+	__uint(type, BPF_MAP_TYPE_XSKMAP);
+	__uint(max_entries, 256);
+	__type(key, __u32);
+	__type(value, __u32);
+} xsk SEC(".maps");
+
+extern int bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx,
+					 __u64 *timestamp) __ksym;
+extern int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx,
+				    __u32 *hash) __ksym;
+
+SEC("xdp")
+int rx(struct xdp_md *ctx)
+{
+	void *data, *data_meta, *data_end;
+	struct ipv6hdr *ip6h = NULL;
+	struct ethhdr *eth = NULL;
+	struct udphdr *udp = NULL;
+	struct iphdr *iph = NULL;
+	struct xdp_meta *meta;
+	int ret;
+
+	data = (void *)(long)ctx->data;
+	data_end = (void *)(long)ctx->data_end;
+	eth = data;
+	if (eth + 1 < data_end) {
+		if (eth->h_proto == bpf_htons(ETH_P_IP)) {
+			iph = (void *)(eth + 1);
+			if (iph + 1 < data_end && iph->protocol == IPPROTO_UDP)
+				udp = (void *)(iph + 1);
+		}
+		if (eth->h_proto == bpf_htons(ETH_P_IPV6)) {
+			ip6h = (void *)(eth + 1);
+			if (ip6h + 1 < data_end && ip6h->nexthdr == IPPROTO_UDP)
+				udp = (void *)(ip6h + 1);
+		}
+		if (udp && udp + 1 > data_end)
+			udp = NULL;
+	}
+
+	if (!udp)
+		return XDP_PASS;
+
+	if (udp->dest != bpf_htons(9091))
+		return XDP_PASS;
+
+	bpf_printk("forwarding UDP:9091 to AF_XDP");
+
+	ret = bpf_xdp_adjust_meta(ctx, -(int)sizeof(struct xdp_meta));
+	if (ret != 0) {
+		bpf_printk("bpf_xdp_adjust_meta returned %d", ret);
+		return XDP_PASS;
+	}
+
+	data = (void *)(long)ctx->data;
+	data_meta = (void *)(long)ctx->data_meta;
+	meta = data_meta;
+
+	if (meta + 1 > data) {
+		bpf_printk("bpf_xdp_adjust_meta doesn't appear to work");
+		return XDP_PASS;
+	}
+
+	if (!bpf_xdp_metadata_rx_timestamp(ctx, &meta->rx_timestamp))
+		bpf_printk("populated rx_timestamp with %u", meta->rx_timestamp);
+
+	if (!bpf_xdp_metadata_rx_hash(ctx, &meta->rx_hash))
+		bpf_printk("populated rx_hash with %u", meta->rx_hash);
+
+	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/xdp_hw_metadata.c b/tools/testing/selftests/bpf/xdp_hw_metadata.c
new file mode 100644
index 000000000000..b8b7600dc275
--- /dev/null
+++ b/tools/testing/selftests/bpf/xdp_hw_metadata.c
@@ -0,0 +1,405 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* Reference program for verifying XDP metadata on real HW. Functional test
+ * only, doesn't test the performance.
+ *
+ * RX:
+ * - UDP 9091 packets are diverted into AF_XDP
+ * - Metadata verified:
+ *   - rx_timestamp
+ *   - rx_hash
+ *
+ * TX:
+ * - TBD
+ */
+
+#include <test_progs.h>
+#include <network_helpers.h>
+#include "xdp_hw_metadata.skel.h"
+#include "xsk.h"
+
+#include <error.h>
+#include <linux/errqueue.h>
+#include <linux/if_link.h>
+#include <linux/net_tstamp.h>
+#include <linux/udp.h>
+#include <linux/sockios.h>
+#include <sys/mman.h>
+#include <net/if.h>
+#include <poll.h>
+
+#include "xdp_metadata.h"
+
+#define UMEM_NUM 16
+#define UMEM_FRAME_SIZE XSK_UMEM__DEFAULT_FRAME_SIZE
+#define UMEM_SIZE (UMEM_FRAME_SIZE * UMEM_NUM)
+#define XDP_FLAGS (XDP_FLAGS_DRV_MODE | XDP_FLAGS_REPLACE)
+
+struct xsk {
+	void *umem_area;
+	struct xsk_umem *umem;
+	struct xsk_ring_prod fill;
+	struct xsk_ring_cons comp;
+	struct xsk_ring_prod tx;
+	struct xsk_ring_cons rx;
+	struct xsk_socket *socket;
+};
+
+struct xdp_hw_metadata *bpf_obj;
+struct xsk *rx_xsk;
+const char *ifname;
+int ifindex;
+int rxq;
+
+void test__fail(void) { /* for network_helpers.c */ }
+
+static int open_xsk(const char *ifname, struct xsk *xsk, __u32 queue_id)
+{
+	int mmap_flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE;
+	const struct xsk_socket_config socket_config = {
+		.rx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
+		.tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
+		.libbpf_flags = XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD,
+		.xdp_flags = XDP_FLAGS,
+		.bind_flags = XDP_COPY,
+	};
+	const struct xsk_umem_config umem_config = {
+		.fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
+		.comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
+		.frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE,
+		.flags = XDP_UMEM_UNALIGNED_CHUNK_FLAG,
+	};
+	__u32 idx;
+	u64 addr;
+	int ret;
+	int i;
+
+	xsk->umem_area = mmap(NULL, UMEM_SIZE, PROT_READ | PROT_WRITE, mmap_flags, -1, 0);
+	if (xsk->umem_area == MAP_FAILED)
+		return -ENOMEM;
+
+	ret = xsk_umem__create(&xsk->umem,
+			       xsk->umem_area, UMEM_SIZE,
+			       &xsk->fill,
+			       &xsk->comp,
+			       &umem_config);
+	if (ret)
+		return ret;
+
+	ret = xsk_socket__create(&xsk->socket, ifname, queue_id,
+				 xsk->umem,
+				 &xsk->rx,
+				 &xsk->tx,
+				 &socket_config);
+	if (ret)
+		return ret;
+
+	/* First half of umem is for TX. This way address matches 1-to-1
+	 * to the completion queue index.
+	 */
+
+	for (i = 0; i < UMEM_NUM / 2; i++) {
+		addr = i * UMEM_FRAME_SIZE;
+		printf("%p: tx_desc[%d] -> %lx\n", xsk, i, addr);
+	}
+
+	/* Second half of umem is for RX. */
+
+	ret = xsk_ring_prod__reserve(&xsk->fill, UMEM_NUM / 2, &idx);
+	for (i = 0; i < UMEM_NUM / 2; i++) {
+		addr = (UMEM_NUM / 2 + i) * UMEM_FRAME_SIZE;
+		printf("%p: rx_desc[%d] -> %lx\n", xsk, i, addr);
+		*xsk_ring_prod__fill_addr(&xsk->fill, i) = addr;
+	}
+	xsk_ring_prod__submit(&xsk->fill, ret);
+
+	return 0;
+}
+
+static void close_xsk(struct xsk *xsk)
+{
+	if (xsk->umem)
+		xsk_umem__delete(xsk->umem);
+	if (xsk->socket)
+		xsk_socket__delete(xsk->socket);
+	munmap(xsk->umem, UMEM_SIZE);
+}
+
+static void refill_rx(struct xsk *xsk, __u64 addr)
+{
+	__u32 idx;
+
+	if (xsk_ring_prod__reserve(&xsk->fill, 1, &idx) == 1) {
+		printf("%p: complete idx=%u addr=%llx\n", xsk, idx, addr);
+		*xsk_ring_prod__fill_addr(&xsk->fill, idx) = addr;
+		xsk_ring_prod__submit(&xsk->fill, 1);
+	}
+}
+
+static void verify_xdp_metadata(void *data)
+{
+	struct xdp_meta *meta;
+
+	meta = data - sizeof(*meta);
+
+	printf("rx_timestamp: %llu\n", meta->rx_timestamp);
+	printf("rx_hash: %u\n", meta->rx_hash);
+}
+
+static void verify_skb_metadata(int fd)
+{
+	char cmsg_buf[1024];
+	char packet_buf[128];
+
+	struct scm_timestamping *ts;
+	struct iovec packet_iov;
+	struct cmsghdr *cmsg;
+	struct msghdr hdr;
+
+	memset(&hdr, 0, sizeof(hdr));
+	hdr.msg_iov = &packet_iov;
+	hdr.msg_iovlen = 1;
+	packet_iov.iov_base = packet_buf;
+	packet_iov.iov_len = sizeof(packet_buf);
+
+	hdr.msg_control = cmsg_buf;
+	hdr.msg_controllen = sizeof(cmsg_buf);
+
+	if (recvmsg(fd, &hdr, 0) < 0)
+		error(-1, errno, "recvmsg");
+
+	for (cmsg = CMSG_FIRSTHDR(&hdr); cmsg != NULL;
+	     cmsg = CMSG_NXTHDR(&hdr, cmsg)) {
+
+		if (cmsg->cmsg_level != SOL_SOCKET)
+			continue;
+
+		switch (cmsg->cmsg_type) {
+		case SCM_TIMESTAMPING:
+			ts = (struct scm_timestamping *)CMSG_DATA(cmsg);
+			if (ts->ts[2].tv_sec || ts->ts[2].tv_nsec) {
+				printf("found skb hwtstamp = %lu.%lu\n",
+				       ts->ts[2].tv_sec, ts->ts[2].tv_nsec);
+				return;
+			}
+			break;
+		default:
+			break;
+		}
+	}
+
+	printf("skb hwtstamp is not found!\n");
+}
+
+static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd)
+{
+	const struct xdp_desc *rx_desc;
+	struct pollfd fds[rxq + 1];
+	__u64 comp_addr;
+	__u64 addr;
+	__u32 idx;
+	int ret;
+	int i;
+
+	for (i = 0; i < rxq; i++) {
+		fds[i].fd = xsk_socket__fd(rx_xsk[i].socket);
+		fds[i].events = POLLIN;
+		fds[i].revents = 0;
+	}
+
+	fds[rxq].fd = server_fd;
+	fds[rxq].events = POLLIN;
+	fds[rxq].revents = 0;
+
+	while (true) {
+		errno = 0;
+		ret = poll(fds, rxq + 1, 1000);
+		printf("poll: %d (%d)\n", ret, errno);
+		if (ret < 0)
+			break;
+		if (ret == 0)
+			continue;
+
+		if (fds[rxq].revents)
+			verify_skb_metadata(server_fd);
+
+		for (i = 0; i < rxq; i++) {
+			if (fds[i].revents == 0)
+				continue;
+
+			struct xsk *xsk = &rx_xsk[i];
+
+			ret = xsk_ring_cons__peek(&xsk->rx, 1, &idx);
+			printf("xsk_ring_cons__peek: %d\n", ret);
+			if (ret != 1)
+				continue;
+
+			rx_desc = xsk_ring_cons__rx_desc(&xsk->rx, idx);
+			comp_addr = xsk_umem__extract_addr(rx_desc->addr);
+			addr = xsk_umem__add_offset_to_addr(rx_desc->addr);
+			printf("%p: rx_desc[%u]->addr=%llx addr=%llx comp_addr=%llx\n",
+			       xsk, idx, rx_desc->addr, addr, comp_addr);
+			verify_xdp_metadata(xsk_umem__get_data(xsk->umem_area, addr));
+			xsk_ring_cons__release(&xsk->rx, 1);
+			refill_rx(xsk, comp_addr);
+		}
+	}
+
+	return 0;
+}
+
+struct ethtool_channels {
+	__u32	cmd;
+	__u32	max_rx;
+	__u32	max_tx;
+	__u32	max_other;
+	__u32	max_combined;
+	__u32	rx_count;
+	__u32	tx_count;
+	__u32	other_count;
+	__u32	combined_count;
+};
+
+#define ETHTOOL_GCHANNELS	0x0000003c /* Get no of channels */
+
+static int rxq_num(const char *ifname)
+{
+	struct ethtool_channels ch = {
+		.cmd = ETHTOOL_GCHANNELS,
+	};
+
+	struct ifreq ifr = {
+		.ifr_data = (void *)&ch,
+	};
+	strcpy(ifr.ifr_name, ifname);
+	int fd, ret;
+
+	fd = socket(AF_UNIX, SOCK_DGRAM, 0);
+	if (fd < 0)
+		error(-1, errno, "socket");
+
+	ret = ioctl(fd, SIOCETHTOOL, &ifr);
+	if (ret < 0)
+		error(-1, errno, "socket");
+
+	close(fd);
+
+	return ch.rx_count + ch.combined_count;
+}
+
+static void cleanup(void)
+{
+	LIBBPF_OPTS(bpf_xdp_attach_opts, opts);
+	int ret;
+	int i;
+
+	if (bpf_obj) {
+		opts.old_prog_fd = bpf_program__fd(bpf_obj->progs.rx);
+		if (opts.old_prog_fd >= 0) {
+			printf("detaching bpf program....\n");
+			ret = bpf_xdp_detach(ifindex, XDP_FLAGS, &opts);
+			if (ret)
+				printf("failed to detach XDP program: %d\n", ret);
+		}
+	}
+
+	for (i = 0; i < rxq; i++)
+		close_xsk(&rx_xsk[i]);
+
+	if (bpf_obj)
+		xdp_hw_metadata__destroy(bpf_obj);
+}
+
+static void handle_signal(int sig)
+{
+	/* interrupting poll() is all we need */
+}
+
+static void timestamping_enable(int fd, int val)
+{
+	int ret;
+
+	ret = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val));
+	if (ret < 0)
+		error(-1, errno, "setsockopt(SO_TIMESTAMPING)");
+}
+
+int main(int argc, char *argv[])
+{
+	int server_fd = -1;
+	int ret;
+	int i;
+
+	struct bpf_program *prog;
+
+	if (argc != 2) {
+		fprintf(stderr, "pass device name\n");
+		return -1;
+	}
+
+	ifname = argv[1];
+	ifindex = if_nametoindex(ifname);
+	rxq = rxq_num(ifname);
+
+	printf("rxq: %d\n", rxq);
+
+	rx_xsk = malloc(sizeof(struct xsk) * rxq);
+	if (!rx_xsk)
+		error(-1, ENOMEM, "malloc");
+
+	for (i = 0; i < rxq; i++) {
+		printf("open_xsk(%s, %p, %d)\n", ifname, &rx_xsk[i], i);
+		ret = open_xsk(ifname, &rx_xsk[i], i);
+		if (ret)
+			error(-1, -ret, "open_xsk");
+
+		printf("xsk_socket__fd() -> %d\n", xsk_socket__fd(rx_xsk[i].socket));
+	}
+
+	printf("open bpf program...\n");
+	bpf_obj = xdp_hw_metadata__open();
+	if (libbpf_get_error(bpf_obj))
+		error(-1, libbpf_get_error(bpf_obj), "xdp_hw_metadata__open");
+
+	prog = bpf_object__find_program_by_name(bpf_obj->obj, "rx");
+	bpf_program__set_ifindex(prog, ifindex);
+	bpf_program__set_flags(prog, BPF_F_XDP_DEV_BOUND_ONLY);
+
+	printf("load bpf program...\n");
+	ret = xdp_hw_metadata__load(bpf_obj);
+	if (ret)
+		error(-1, -ret, "xdp_hw_metadata__load");
+
+	printf("prepare skb endpoint...\n");
+	server_fd = start_server(AF_INET6, SOCK_DGRAM, NULL, 9092, 1000);
+	if (server_fd < 0)
+		error(-1, errno, "start_server");
+	timestamping_enable(server_fd,
+			    SOF_TIMESTAMPING_SOFTWARE |
+			    SOF_TIMESTAMPING_RAW_HARDWARE);
+
+	printf("prepare xsk map...\n");
+	for (i = 0; i < rxq; i++) {
+		int sock_fd = xsk_socket__fd(rx_xsk[i].socket);
+		__u32 queue_id = i;
+
+		printf("map[%d] = %d\n", queue_id, sock_fd);
+		ret = bpf_map_update_elem(bpf_map__fd(bpf_obj->maps.xsk), &queue_id, &sock_fd, 0);
+		if (ret)
+			error(-1, -ret, "bpf_map_update_elem");
+	}
+
+	printf("attach bpf program...\n");
+	ret = bpf_xdp_attach(ifindex,
+			     bpf_program__fd(bpf_obj->progs.rx),
+			     XDP_FLAGS, NULL);
+	if (ret)
+		error(-1, -ret, "bpf_xdp_attach");
+
+	signal(SIGINT, handle_signal);
+	ret = verify_metadata(rx_xsk, rxq, server_fd);
+	close(server_fd);
+	cleanup();
+	if (ret)
+		error(-1, -ret, "verify_metadata");
+}
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related

* [PATCH bpf-next v5 16/17] net/mlx5e: Support RX XDP metadata
From: Stanislav Fomichev @ 2022-12-20 22:20 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, Toke Høiland-Jørgensen,
	Saeed Mahameed, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev
In-Reply-To: <20221220222043.3348718-1-sdf@google.com>

From: Toke Høiland-Jørgensen <toke@redhat.com>

Support RX hash and timestamp metadata kfuncs. We need to pass in the cqe
pointer to the mlx5e_skb_from* functions so it can be retrieved from the
XDP ctx to do this.

Cc: Saeed Mahameed <saeedm@nvidia.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  | 10 +++-
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 23 +++++++++
 .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  5 ++
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   | 10 ++++
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.h   |  2 +
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  6 +++
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 51 ++++++++++---------
 7 files changed, 81 insertions(+), 26 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index af663978d1b4..af0be59b956e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -627,10 +627,11 @@ struct mlx5e_rq;
 typedef void (*mlx5e_fp_handle_rx_cqe)(struct mlx5e_rq*, struct mlx5_cqe64*);
 typedef struct sk_buff *
 (*mlx5e_fp_skb_from_cqe_mpwrq)(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-			       u16 cqe_bcnt, u32 head_offset, u32 page_idx);
+			       struct mlx5_cqe64 *cqe, u16 cqe_bcnt,
+			       u32 head_offset, u32 page_idx);
 typedef struct sk_buff *
 (*mlx5e_fp_skb_from_cqe)(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
-			 u32 cqe_bcnt);
+			 struct mlx5_cqe64 *cqe, u32 cqe_bcnt);
 typedef bool (*mlx5e_fp_post_rx_wqes)(struct mlx5e_rq *rq);
 typedef void (*mlx5e_fp_dealloc_wqe)(struct mlx5e_rq*, u16);
 typedef void (*mlx5e_fp_shampo_dealloc_hd)(struct mlx5e_rq*, u16, u16, bool);
@@ -1036,6 +1037,11 @@ int mlx5e_vlan_rx_kill_vid(struct net_device *dev, __always_unused __be16 proto,
 			   u16 vid);
 void mlx5e_timestamp_init(struct mlx5e_priv *priv);
 
+static inline bool mlx5e_rx_hw_stamp(struct hwtstamp_config *config)
+{
+	return config->rx_filter == HWTSTAMP_FILTER_ALL;
+}
+
 struct mlx5e_xsk_param;
 
 struct mlx5e_rq_param;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index 31bb6806bf5d..d10d31e12ba2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -156,6 +156,29 @@ mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq,
 	return true;
 }
 
+int mlx5e_xdp_rx_timestamp(const struct xdp_md *ctx, u64 *timestamp)
+{
+	const struct mlx5e_xdp_buff *_ctx = (void *)ctx;
+
+	if (unlikely(!mlx5e_rx_hw_stamp(_ctx->rq->tstamp)))
+		return -EOPNOTSUPP;
+
+	*timestamp =  mlx5e_cqe_ts_to_ns(_ctx->rq->ptp_cyc2time,
+					 _ctx->rq->clock, get_cqe_ts(_ctx->cqe));
+	return 0;
+}
+
+int mlx5e_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash)
+{
+	const struct mlx5e_xdp_buff *_ctx = (void *)ctx;
+
+	if (unlikely(!(_ctx->xdp.rxq->dev->features & NETIF_F_RXHASH)))
+		return -EOPNOTSUPP;
+
+	*hash = be32_to_cpu(_ctx->cqe->rss_hash_result);
+	return 0;
+}
+
 /* returns true if packet was consumed by xdp */
 bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct page *page,
 		      struct bpf_prog *prog, struct mlx5e_xdp_buff *mxbuf)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
index 389818bf6833..cb568c62aba0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
@@ -46,6 +46,8 @@
 
 struct mlx5e_xdp_buff {
 	struct xdp_buff xdp;
+	struct mlx5_cqe64 *cqe;
+	struct mlx5e_rq *rq;
 };
 
 struct mlx5e_xsk_param;
@@ -60,6 +62,9 @@ void mlx5e_xdp_rx_poll_complete(struct mlx5e_rq *rq);
 int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
 		   u32 flags);
 
+int mlx5e_xdp_rx_timestamp(const struct xdp_md *ctx, u64 *timestamp);
+int mlx5e_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash);
+
 INDIRECT_CALLABLE_DECLARE(bool mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq,
 							  struct mlx5e_xmit_data *xdptxd,
 							  struct skb_shared_info *sinfo,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
index 9cff82d764e3..8bf3029abd3c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
@@ -49,6 +49,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
 			umr_wqe->inline_mtts[i] = (struct mlx5_mtt) {
 				.ptag = cpu_to_be64(addr | MLX5_EN_WR),
 			};
+			wi->alloc_units[i].mxbuf->rq = rq;
 		}
 	} else if (unlikely(rq->mpwqe.umr_mode == MLX5E_MPWRQ_UMR_MODE_UNALIGNED)) {
 		for (i = 0; i < batch; i++) {
@@ -58,6 +59,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
 				.key = rq->mkey_be,
 				.va = cpu_to_be64(addr),
 			};
+			wi->alloc_units[i].mxbuf->rq = rq;
 		}
 	} else if (likely(rq->mpwqe.umr_mode == MLX5E_MPWRQ_UMR_MODE_TRIPLE)) {
 		u32 mapping_size = 1 << (rq->mpwqe.page_shift - 2);
@@ -81,6 +83,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
 				.key = rq->mkey_be,
 				.va = cpu_to_be64(rq->wqe_overflow.addr),
 			};
+			wi->alloc_units[i].mxbuf->rq = rq;
 		}
 	} else {
 		__be32 pad_size = cpu_to_be32((1 << rq->mpwqe.page_shift) -
@@ -100,6 +103,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
 				.va = cpu_to_be64(rq->wqe_overflow.addr),
 				.bcount = pad_size,
 			};
+			wi->alloc_units[i].mxbuf->rq = rq;
 		}
 	}
 
@@ -230,6 +234,7 @@ static struct sk_buff *mlx5e_xsk_construct_skb(struct mlx5e_rq *rq, struct xdp_b
 
 struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 						    struct mlx5e_mpw_info *wi,
+						    struct mlx5_cqe64 *cqe,
 						    u16 cqe_bcnt,
 						    u32 head_offset,
 						    u32 page_idx)
@@ -250,6 +255,8 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 	 */
 	WARN_ON_ONCE(head_offset);
 
+	/* mxbuf->rq is set on allocation, but cqe is per-packet so set it here */
+	mxbuf->cqe = cqe;
 	xsk_buff_set_size(&mxbuf->xdp, cqe_bcnt);
 	xsk_buff_dma_sync_for_cpu(&mxbuf->xdp, rq->xsk_pool);
 	net_prefetch(mxbuf->xdp.data);
@@ -284,6 +291,7 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 
 struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
 					      struct mlx5e_wqe_frag_info *wi,
+					      struct mlx5_cqe64 *cqe,
 					      u32 cqe_bcnt)
 {
 	struct mlx5e_xdp_buff *mxbuf = wi->au->mxbuf;
@@ -296,6 +304,8 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
 	 */
 	WARN_ON_ONCE(wi->offset);
 
+	/* mxbuf->rq is set on allocation, but cqe is per-packet so set it here */
+	mxbuf->cqe = cqe;
 	xsk_buff_set_size(&mxbuf->xdp, cqe_bcnt);
 	xsk_buff_dma_sync_for_cpu(&mxbuf->xdp, rq->xsk_pool);
 	net_prefetch(mxbuf->xdp.data);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
index 087c943bd8e9..cefc0ef6105d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
@@ -13,11 +13,13 @@ int mlx5e_xsk_alloc_rx_wqes_batched(struct mlx5e_rq *rq, u16 ix, int wqe_bulk);
 int mlx5e_xsk_alloc_rx_wqes(struct mlx5e_rq *rq, u16 ix, int wqe_bulk);
 struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 						    struct mlx5e_mpw_info *wi,
+						    struct mlx5_cqe64 *cqe,
 						    u16 cqe_bcnt,
 						    u32 head_offset,
 						    u32 page_idx);
 struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
 					      struct mlx5e_wqe_frag_info *wi,
+					      struct mlx5_cqe64 *cqe,
 					      u32 cqe_bcnt);
 
 #endif /* __MLX5_EN_XSK_RX_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 8d36e2de53a9..2dddb05d2e60 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -4913,6 +4913,11 @@ const struct net_device_ops mlx5e_netdev_ops = {
 #endif
 };
 
+static const struct xdp_metadata_ops mlx5_xdp_metadata_ops = {
+	.xmo_rx_timestamp		= mlx5e_xdp_rx_timestamp,
+	.xmo_rx_hash			= mlx5e_xdp_rx_hash,
+};
+
 static u32 mlx5e_choose_lro_timeout(struct mlx5_core_dev *mdev, u32 wanted_timeout)
 {
 	int i;
@@ -5053,6 +5058,7 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev)
 	SET_NETDEV_DEV(netdev, mdev->device);
 
 	netdev->netdev_ops = &mlx5e_netdev_ops;
+	netdev->xdp_metadata_ops = &mlx5_xdp_metadata_ops;
 
 	mlx5e_dcbnl_build_netdev(netdev);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index c8a2b26de36e..10d45064e613 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -62,10 +62,12 @@
 
 static struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-				u16 cqe_bcnt, u32 head_offset, u32 page_idx);
+				struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
+				u32 page_idx);
 static struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-				   u16 cqe_bcnt, u32 head_offset, u32 page_idx);
+				   struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
+				   u32 page_idx);
 static void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
 static void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
 static void mlx5e_handle_rx_cqe_mpwrq_shampo(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
@@ -76,11 +78,6 @@ const struct mlx5e_rx_handlers mlx5e_rx_handlers_nic = {
 	.handle_rx_cqe_mpwqe_shampo = mlx5e_handle_rx_cqe_mpwrq_shampo,
 };
 
-static inline bool mlx5e_rx_hw_stamp(struct hwtstamp_config *config)
-{
-	return config->rx_filter == HWTSTAMP_FILTER_ALL;
-}
-
 static inline void mlx5e_read_cqe_slot(struct mlx5_cqwq *wq,
 				       u32 cqcc, void *data)
 {
@@ -1575,16 +1572,19 @@ struct sk_buff *mlx5e_build_linear_skb(struct mlx5e_rq *rq, void *va,
 	return skb;
 }
 
-static void mlx5e_fill_xdp_buff(struct mlx5e_rq *rq, void *va, u16 headroom,
-				u32 len, struct mlx5e_xdp_buff *mxbuf)
+static void mlx5e_fill_xdp_buff(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
+				void *va, u16 headroom, u32 len,
+				struct mlx5e_xdp_buff *mxbuf)
 {
 	xdp_init_buff(&mxbuf->xdp, rq->buff.frame0_sz, &rq->xdp_rxq);
 	xdp_prepare_buff(&mxbuf->xdp, va, headroom, len, true);
+	mxbuf->cqe = cqe;
+	mxbuf->rq = rq;
 }
 
 static struct sk_buff *
 mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
-			  u32 cqe_bcnt)
+			  struct mlx5_cqe64 *cqe, u32 cqe_bcnt)
 {
 	union mlx5e_alloc_unit *au = wi->au;
 	u16 rx_headroom = rq->buff.headroom;
@@ -1609,7 +1609,7 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
 		struct mlx5e_xdp_buff mxbuf;
 
 		net_prefetchw(va); /* xdp_frame data area */
-		mlx5e_fill_xdp_buff(rq, va, rx_headroom, cqe_bcnt, &mxbuf);
+		mlx5e_fill_xdp_buff(rq, cqe, va, rx_headroom, cqe_bcnt, &mxbuf);
 		if (mlx5e_xdp_handle(rq, au->page, prog, &mxbuf))
 			return NULL; /* page/packet was consumed by XDP */
 
@@ -1630,7 +1630,7 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
 
 static struct sk_buff *
 mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
-			     u32 cqe_bcnt)
+			     struct mlx5_cqe64 *cqe, u32 cqe_bcnt)
 {
 	struct mlx5e_rq_frag_info *frag_info = &rq->wqe.info.arr[0];
 	struct mlx5e_wqe_frag_info *head_wi = wi;
@@ -1654,7 +1654,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 	net_prefetchw(va); /* xdp_frame data area */
 	net_prefetch(va + rx_headroom);
 
-	mlx5e_fill_xdp_buff(rq, va, rx_headroom, frag_consumed_bytes, &mxbuf);
+	mlx5e_fill_xdp_buff(rq, cqe, va, rx_headroom, frag_consumed_bytes, &mxbuf);
 	sinfo = xdp_get_shared_info_from_buff(&mxbuf.xdp);
 	truesize = 0;
 
@@ -1777,7 +1777,7 @@ static void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 			      mlx5e_skb_from_cqe_linear,
 			      mlx5e_skb_from_cqe_nonlinear,
 			      mlx5e_xsk_skb_from_cqe_linear,
-			      rq, wi, cqe_bcnt);
+			      rq, wi, cqe, cqe_bcnt);
 	if (!skb) {
 		/* probably for XDP */
 		if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
@@ -1830,7 +1830,7 @@ static void mlx5e_handle_rx_cqe_rep(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 	skb = INDIRECT_CALL_2(rq->wqe.skb_from_cqe,
 			      mlx5e_skb_from_cqe_linear,
 			      mlx5e_skb_from_cqe_nonlinear,
-			      rq, wi, cqe_bcnt);
+			      rq, wi, cqe, cqe_bcnt);
 	if (!skb) {
 		/* probably for XDP */
 		if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
@@ -1889,7 +1889,7 @@ static void mlx5e_handle_rx_cqe_mpwrq_rep(struct mlx5e_rq *rq, struct mlx5_cqe64
 	skb = INDIRECT_CALL_2(rq->mpwqe.skb_from_cqe_mpwrq,
 			      mlx5e_skb_from_cqe_mpwrq_linear,
 			      mlx5e_skb_from_cqe_mpwrq_nonlinear,
-			      rq, wi, cqe_bcnt, head_offset, page_idx);
+			      rq, wi, cqe, cqe_bcnt, head_offset, page_idx);
 	if (!skb)
 		goto mpwrq_cqe_out;
 
@@ -1940,7 +1940,8 @@ mlx5e_fill_skb_data(struct sk_buff *skb, struct mlx5e_rq *rq,
 
 static struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-				   u16 cqe_bcnt, u32 head_offset, u32 page_idx)
+				   struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
+				   u32 page_idx)
 {
 	union mlx5e_alloc_unit *au = &wi->alloc_units[page_idx];
 	u16 headlen = min_t(u16, MLX5E_RX_MAX_HEAD, cqe_bcnt);
@@ -1979,7 +1980,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 
 static struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-				u16 cqe_bcnt, u32 head_offset, u32 page_idx)
+				struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
+				u32 page_idx)
 {
 	union mlx5e_alloc_unit *au = &wi->alloc_units[page_idx];
 	u16 rx_headroom = rq->buff.headroom;
@@ -2010,7 +2012,7 @@ mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 		struct mlx5e_xdp_buff mxbuf;
 
 		net_prefetchw(va); /* xdp_frame data area */
-		mlx5e_fill_xdp_buff(rq, va, rx_headroom, cqe_bcnt, &mxbuf);
+		mlx5e_fill_xdp_buff(rq, cqe, va, rx_headroom, cqe_bcnt, &mxbuf);
 		if (mlx5e_xdp_handle(rq, au->page, prog, &mxbuf)) {
 			if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))
 				__set_bit(page_idx, wi->xdp_xmit_bitmap); /* non-atomic */
@@ -2174,8 +2176,8 @@ static void mlx5e_handle_rx_cqe_mpwrq_shampo(struct mlx5e_rq *rq, struct mlx5_cq
 		if (likely(head_size))
 			*skb = mlx5e_skb_from_cqe_shampo(rq, wi, cqe, header_index);
 		else
-			*skb = mlx5e_skb_from_cqe_mpwrq_nonlinear(rq, wi, cqe_bcnt, data_offset,
-								  page_idx);
+			*skb = mlx5e_skb_from_cqe_mpwrq_nonlinear(rq, wi, cqe, cqe_bcnt,
+								  data_offset, page_idx);
 		if (unlikely(!*skb))
 			goto free_hd_entry;
 
@@ -2249,7 +2251,8 @@ static void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cq
 			      mlx5e_skb_from_cqe_mpwrq_linear,
 			      mlx5e_skb_from_cqe_mpwrq_nonlinear,
 			      mlx5e_xsk_skb_from_cqe_mpwrq_linear,
-			      rq, wi, cqe_bcnt, head_offset, page_idx);
+			      rq, wi, cqe, cqe_bcnt, head_offset,
+			      page_idx);
 	if (!skb)
 		goto mpwrq_cqe_out;
 
@@ -2494,7 +2497,7 @@ static void mlx5i_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 	skb = INDIRECT_CALL_2(rq->wqe.skb_from_cqe,
 			      mlx5e_skb_from_cqe_linear,
 			      mlx5e_skb_from_cqe_nonlinear,
-			      rq, wi, cqe_bcnt);
+			      rq, wi, cqe, cqe_bcnt);
 	if (!skb)
 		goto wq_free_wqe;
 
@@ -2586,7 +2589,7 @@ static void mlx5e_trap_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe
 		goto free_wqe;
 	}
 
-	skb = mlx5e_skb_from_cqe_nonlinear(rq, wi, cqe_bcnt);
+	skb = mlx5e_skb_from_cqe_nonlinear(rq, wi, cqe, cqe_bcnt);
 	if (!skb)
 		goto free_wqe;
 
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related

* [PATCH bpf-next v5 15/17] net/mlx5e: Introduce wrapper for xdp_buff
From: Stanislav Fomichev @ 2022-12-20 22:20 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, Toke Høiland-Jørgensen,
	Saeed Mahameed, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev
In-Reply-To: <20221220222043.3348718-1-sdf@google.com>

From: Toke Høiland-Jørgensen <toke@redhat.com>

Preparation for implementing HW metadata kfuncs. No functional change.

Cc: Saeed Mahameed <saeedm@nvidia.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  1 +
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |  3 +-
 .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  6 +-
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   | 25 +++++----
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 56 +++++++++----------
 5 files changed, 49 insertions(+), 42 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 2d77fb8a8a01..af663978d1b4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -469,6 +469,7 @@ struct mlx5e_txqsq {
 union mlx5e_alloc_unit {
 	struct page *page;
 	struct xdp_buff *xsk;
+	struct mlx5e_xdp_buff *mxbuf;
 };
 
 /* XDP packets can be transmitted in different ways. On completion, we need to
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index 20507ef2f956..31bb6806bf5d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -158,8 +158,9 @@ mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq,
 
 /* returns true if packet was consumed by xdp */
 bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct page *page,
-		      struct bpf_prog *prog, struct xdp_buff *xdp)
+		      struct bpf_prog *prog, struct mlx5e_xdp_buff *mxbuf)
 {
+	struct xdp_buff *xdp = &mxbuf->xdp;
 	u32 act;
 	int err;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
index bc2d9034af5b..389818bf6833 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
@@ -44,10 +44,14 @@
 	(MLX5E_XDP_INLINE_WQE_MAX_DS_CNT * MLX5_SEND_WQE_DS - \
 	 sizeof(struct mlx5_wqe_inline_seg))
 
+struct mlx5e_xdp_buff {
+	struct xdp_buff xdp;
+};
+
 struct mlx5e_xsk_param;
 int mlx5e_xdp_max_mtu(struct mlx5e_params *params, struct mlx5e_xsk_param *xsk);
 bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct page *page,
-		      struct bpf_prog *prog, struct xdp_buff *xdp);
+		      struct bpf_prog *prog, struct mlx5e_xdp_buff *mlctx);
 void mlx5e_xdp_mpwqe_complete(struct mlx5e_xdpsq *sq);
 bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq);
 void mlx5e_free_xdpsq_descs(struct mlx5e_xdpsq *sq);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
index c91b54d9ff27..9cff82d764e3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
@@ -22,6 +22,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
 		goto err;
 
 	BUILD_BUG_ON(sizeof(wi->alloc_units[0]) != sizeof(wi->alloc_units[0].xsk));
+	XSK_CHECK_PRIV_TYPE(struct mlx5e_xdp_buff);
 	batch = xsk_buff_alloc_batch(rq->xsk_pool, (struct xdp_buff **)wi->alloc_units,
 				     rq->mpwqe.pages_per_wqe);
 
@@ -233,7 +234,7 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 						    u32 head_offset,
 						    u32 page_idx)
 {
-	struct xdp_buff *xdp = wi->alloc_units[page_idx].xsk;
+	struct mlx5e_xdp_buff *mxbuf = wi->alloc_units[page_idx].mxbuf;
 	struct bpf_prog *prog;
 
 	/* Check packet size. Note LRO doesn't use linear SKB */
@@ -249,9 +250,9 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 	 */
 	WARN_ON_ONCE(head_offset);
 
-	xsk_buff_set_size(xdp, cqe_bcnt);
-	xsk_buff_dma_sync_for_cpu(xdp, rq->xsk_pool);
-	net_prefetch(xdp->data);
+	xsk_buff_set_size(&mxbuf->xdp, cqe_bcnt);
+	xsk_buff_dma_sync_for_cpu(&mxbuf->xdp, rq->xsk_pool);
+	net_prefetch(mxbuf->xdp.data);
 
 	/* Possible flows:
 	 * - XDP_REDIRECT to XSKMAP:
@@ -269,7 +270,7 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 	 */
 
 	prog = rcu_dereference(rq->xdp_prog);
-	if (likely(prog && mlx5e_xdp_handle(rq, NULL, prog, xdp))) {
+	if (likely(prog && mlx5e_xdp_handle(rq, NULL, prog, mxbuf))) {
 		if (likely(__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)))
 			__set_bit(page_idx, wi->xdp_xmit_bitmap); /* non-atomic */
 		return NULL; /* page/packet was consumed by XDP */
@@ -278,14 +279,14 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 	/* XDP_PASS: copy the data from the UMEM to a new SKB and reuse the
 	 * frame. On SKB allocation failure, NULL is returned.
 	 */
-	return mlx5e_xsk_construct_skb(rq, xdp);
+	return mlx5e_xsk_construct_skb(rq, &mxbuf->xdp);
 }
 
 struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
 					      struct mlx5e_wqe_frag_info *wi,
 					      u32 cqe_bcnt)
 {
-	struct xdp_buff *xdp = wi->au->xsk;
+	struct mlx5e_xdp_buff *mxbuf = wi->au->mxbuf;
 	struct bpf_prog *prog;
 
 	/* wi->offset is not used in this function, because xdp->data and the
@@ -295,17 +296,17 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
 	 */
 	WARN_ON_ONCE(wi->offset);
 
-	xsk_buff_set_size(xdp, cqe_bcnt);
-	xsk_buff_dma_sync_for_cpu(xdp, rq->xsk_pool);
-	net_prefetch(xdp->data);
+	xsk_buff_set_size(&mxbuf->xdp, cqe_bcnt);
+	xsk_buff_dma_sync_for_cpu(&mxbuf->xdp, rq->xsk_pool);
+	net_prefetch(mxbuf->xdp.data);
 
 	prog = rcu_dereference(rq->xdp_prog);
-	if (likely(prog && mlx5e_xdp_handle(rq, NULL, prog, xdp)))
+	if (likely(prog && mlx5e_xdp_handle(rq, NULL, prog, mxbuf)))
 		return NULL; /* page/packet was consumed by XDP */
 
 	/* XDP_PASS: copy the data from the UMEM to a new SKB. The frame reuse
 	 * will be handled by mlx5e_free_rx_wqe.
 	 * On SKB allocation failure, NULL is returned.
 	 */
-	return mlx5e_xsk_construct_skb(rq, xdp);
+	return mlx5e_xsk_construct_skb(rq, &mxbuf->xdp);
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index c8820ab22169..c8a2b26de36e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -1576,10 +1576,10 @@ struct sk_buff *mlx5e_build_linear_skb(struct mlx5e_rq *rq, void *va,
 }
 
 static void mlx5e_fill_xdp_buff(struct mlx5e_rq *rq, void *va, u16 headroom,
-				u32 len, struct xdp_buff *xdp)
+				u32 len, struct mlx5e_xdp_buff *mxbuf)
 {
-	xdp_init_buff(xdp, rq->buff.frame0_sz, &rq->xdp_rxq);
-	xdp_prepare_buff(xdp, va, headroom, len, true);
+	xdp_init_buff(&mxbuf->xdp, rq->buff.frame0_sz, &rq->xdp_rxq);
+	xdp_prepare_buff(&mxbuf->xdp, va, headroom, len, true);
 }
 
 static struct sk_buff *
@@ -1606,16 +1606,16 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
 
 	prog = rcu_dereference(rq->xdp_prog);
 	if (prog) {
-		struct xdp_buff xdp;
+		struct mlx5e_xdp_buff mxbuf;
 
 		net_prefetchw(va); /* xdp_frame data area */
-		mlx5e_fill_xdp_buff(rq, va, rx_headroom, cqe_bcnt, &xdp);
-		if (mlx5e_xdp_handle(rq, au->page, prog, &xdp))
+		mlx5e_fill_xdp_buff(rq, va, rx_headroom, cqe_bcnt, &mxbuf);
+		if (mlx5e_xdp_handle(rq, au->page, prog, &mxbuf))
 			return NULL; /* page/packet was consumed by XDP */
 
-		rx_headroom = xdp.data - xdp.data_hard_start;
-		metasize = xdp.data - xdp.data_meta;
-		cqe_bcnt = xdp.data_end - xdp.data;
+		rx_headroom = mxbuf.xdp.data - mxbuf.xdp.data_hard_start;
+		metasize = mxbuf.xdp.data - mxbuf.xdp.data_meta;
+		cqe_bcnt = mxbuf.xdp.data_end - mxbuf.xdp.data;
 	}
 	frag_size = MLX5_SKB_FRAG_SZ(rx_headroom + cqe_bcnt);
 	skb = mlx5e_build_linear_skb(rq, va, frag_size, rx_headroom, cqe_bcnt, metasize);
@@ -1637,9 +1637,9 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 	union mlx5e_alloc_unit *au = wi->au;
 	u16 rx_headroom = rq->buff.headroom;
 	struct skb_shared_info *sinfo;
+	struct mlx5e_xdp_buff mxbuf;
 	u32 frag_consumed_bytes;
 	struct bpf_prog *prog;
-	struct xdp_buff xdp;
 	struct sk_buff *skb;
 	dma_addr_t addr;
 	u32 truesize;
@@ -1654,8 +1654,8 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 	net_prefetchw(va); /* xdp_frame data area */
 	net_prefetch(va + rx_headroom);
 
-	mlx5e_fill_xdp_buff(rq, va, rx_headroom, frag_consumed_bytes, &xdp);
-	sinfo = xdp_get_shared_info_from_buff(&xdp);
+	mlx5e_fill_xdp_buff(rq, va, rx_headroom, frag_consumed_bytes, &mxbuf);
+	sinfo = xdp_get_shared_info_from_buff(&mxbuf.xdp);
 	truesize = 0;
 
 	cqe_bcnt -= frag_consumed_bytes;
@@ -1673,13 +1673,13 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 		dma_sync_single_for_cpu(rq->pdev, addr + wi->offset,
 					frag_consumed_bytes, rq->buff.map_dir);
 
-		if (!xdp_buff_has_frags(&xdp)) {
+		if (!xdp_buff_has_frags(&mxbuf.xdp)) {
 			/* Init on the first fragment to avoid cold cache access
 			 * when possible.
 			 */
 			sinfo->nr_frags = 0;
 			sinfo->xdp_frags_size = 0;
-			xdp_buff_set_frags_flag(&xdp);
+			xdp_buff_set_frags_flag(&mxbuf.xdp);
 		}
 
 		frag = &sinfo->frags[sinfo->nr_frags++];
@@ -1688,7 +1688,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 		skb_frag_size_set(frag, frag_consumed_bytes);
 
 		if (page_is_pfmemalloc(au->page))
-			xdp_buff_set_frag_pfmemalloc(&xdp);
+			xdp_buff_set_frag_pfmemalloc(&mxbuf.xdp);
 
 		sinfo->xdp_frags_size += frag_consumed_bytes;
 		truesize += frag_info->frag_stride;
@@ -1701,7 +1701,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 	au = head_wi->au;
 
 	prog = rcu_dereference(rq->xdp_prog);
-	if (prog && mlx5e_xdp_handle(rq, au->page, prog, &xdp)) {
+	if (prog && mlx5e_xdp_handle(rq, au->page, prog, &mxbuf)) {
 		if (test_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
 			int i;
 
@@ -1711,22 +1711,22 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 		return NULL; /* page/packet was consumed by XDP */
 	}
 
-	skb = mlx5e_build_linear_skb(rq, xdp.data_hard_start, rq->buff.frame0_sz,
-				     xdp.data - xdp.data_hard_start,
-				     xdp.data_end - xdp.data,
-				     xdp.data - xdp.data_meta);
+	skb = mlx5e_build_linear_skb(rq, mxbuf.xdp.data_hard_start, rq->buff.frame0_sz,
+				     mxbuf.xdp.data - mxbuf.xdp.data_hard_start,
+				     mxbuf.xdp.data_end - mxbuf.xdp.data,
+				     mxbuf.xdp.data - mxbuf.xdp.data_meta);
 	if (unlikely(!skb))
 		return NULL;
 
 	page_ref_inc(au->page);
 
-	if (unlikely(xdp_buff_has_frags(&xdp))) {
+	if (unlikely(xdp_buff_has_frags(&mxbuf.xdp))) {
 		int i;
 
 		/* sinfo->nr_frags is reset by build_skb, calculate again. */
 		xdp_update_skb_shared_info(skb, wi - head_wi - 1,
 					   sinfo->xdp_frags_size, truesize,
-					   xdp_buff_is_frag_pfmemalloc(&xdp));
+					   xdp_buff_is_frag_pfmemalloc(&mxbuf.xdp));
 
 		for (i = 0; i < sinfo->nr_frags; i++) {
 			skb_frag_t *frag = &sinfo->frags[i];
@@ -2007,19 +2007,19 @@ mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 
 	prog = rcu_dereference(rq->xdp_prog);
 	if (prog) {
-		struct xdp_buff xdp;
+		struct mlx5e_xdp_buff mxbuf;
 
 		net_prefetchw(va); /* xdp_frame data area */
-		mlx5e_fill_xdp_buff(rq, va, rx_headroom, cqe_bcnt, &xdp);
-		if (mlx5e_xdp_handle(rq, au->page, prog, &xdp)) {
+		mlx5e_fill_xdp_buff(rq, va, rx_headroom, cqe_bcnt, &mxbuf);
+		if (mlx5e_xdp_handle(rq, au->page, prog, &mxbuf)) {
 			if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))
 				__set_bit(page_idx, wi->xdp_xmit_bitmap); /* non-atomic */
 			return NULL; /* page/packet was consumed by XDP */
 		}
 
-		rx_headroom = xdp.data - xdp.data_hard_start;
-		metasize = xdp.data - xdp.data_meta;
-		cqe_bcnt = xdp.data_end - xdp.data;
+		rx_headroom = mxbuf.xdp.data - mxbuf.xdp.data_hard_start;
+		metasize = mxbuf.xdp.data - mxbuf.xdp.data_meta;
+		cqe_bcnt = mxbuf.xdp.data_end - mxbuf.xdp.data;
 	}
 	frag_size = MLX5_SKB_FRAG_SZ(rx_headroom + cqe_bcnt);
 	skb = mlx5e_build_linear_skb(rq, va, frag_size, rx_headroom, cqe_bcnt, metasize);
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related

* [PATCH bpf-next v5 14/17] xsk: Add cb area to struct xdp_buff_xsk
From: Stanislav Fomichev @ 2022-12-20 22:20 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, Toke Høiland-Jørgensen,
	David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev
In-Reply-To: <20221220222043.3348718-1-sdf@google.com>

From: Toke Høiland-Jørgensen <toke@redhat.com>

Add an area after the xdp_buff in struct xdp_buff_xsk that drivers can use
to stash extra information to use in metadata kfuncs. The maximum size of
24 bytes means the full xdp_buff_xsk structure will take up exactly two
cache lines (with the cb field spanning both). Also add a macro drivers can
use to check their own wrapping structs against the available size.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/net/xsk_buff_pool.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index f787c3f524b0..3e952e569418 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -19,8 +19,11 @@ struct xdp_sock;
 struct device;
 struct page;
 
+#define XSK_PRIV_MAX 24
+
 struct xdp_buff_xsk {
 	struct xdp_buff xdp;
+	u8 cb[XSK_PRIV_MAX];
 	dma_addr_t dma;
 	dma_addr_t frame_dma;
 	struct xsk_buff_pool *pool;
@@ -28,6 +31,8 @@ struct xdp_buff_xsk {
 	struct list_head free_list_node;
 };
 
+#define XSK_CHECK_PRIV_TYPE(t) BUILD_BUG_ON(sizeof(t) > offsetofend(struct xdp_buff_xsk, cb))
+
 struct xsk_dma_map {
 	dma_addr_t *dma_pages;
 	struct device *dev;
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related

* [PATCH bpf-next v5 13/17] net/mlx4_en: Support RX XDP metadata
From: Stanislav Fomichev @ 2022-12-20 22:20 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, Tariq Toukan
In-Reply-To: <20221220222043.3348718-1-sdf@google.com>

RX timestamp and hash for now. Tested using the prog from the next
patch.

Also enabling xdp metadata support; don't see why it's disabled,
there is enough headroom..

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_clock.c | 13 +++++---
 .../net/ethernet/mellanox/mlx4/en_netdev.c    |  6 ++++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c    | 33 ++++++++++++++++++-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h  |  5 +++
 4 files changed, 52 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_clock.c b/drivers/net/ethernet/mellanox/mlx4/en_clock.c
index 98b5ffb4d729..9e3b76182088 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_clock.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_clock.c
@@ -58,9 +58,7 @@ u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
 	return hi | lo;
 }
 
-void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
-			    struct skb_shared_hwtstamps *hwts,
-			    u64 timestamp)
+u64 mlx4_en_get_hwtstamp(struct mlx4_en_dev *mdev, u64 timestamp)
 {
 	unsigned int seq;
 	u64 nsec;
@@ -70,8 +68,15 @@ void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
 		nsec = timecounter_cyc2time(&mdev->clock, timestamp);
 	} while (read_seqretry(&mdev->clock_lock, seq));
 
+	return ns_to_ktime(nsec);
+}
+
+void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
+			    struct skb_shared_hwtstamps *hwts,
+			    u64 timestamp)
+{
 	memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
-	hwts->hwtstamp = ns_to_ktime(nsec);
+	hwts->hwtstamp = mlx4_en_get_hwtstamp(mdev, timestamp);
 }
 
 /**
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 8800d3f1f55c..af4c4858f397 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -2889,6 +2889,11 @@ static const struct net_device_ops mlx4_netdev_ops_master = {
 	.ndo_bpf		= mlx4_xdp,
 };
 
+static const struct xdp_metadata_ops mlx4_xdp_metadata_ops = {
+	.xmo_rx_timestamp		= mlx4_en_xdp_rx_timestamp,
+	.xmo_rx_hash			= mlx4_en_xdp_rx_hash,
+};
+
 struct mlx4_en_bond {
 	struct work_struct work;
 	struct mlx4_en_priv *priv;
@@ -3310,6 +3315,7 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 		dev->netdev_ops = &mlx4_netdev_ops_master;
 	else
 		dev->netdev_ops = &mlx4_netdev_ops;
+	dev->xdp_metadata_ops = &mlx4_xdp_metadata_ops;
 	dev->watchdog_timeo = MLX4_EN_WATCHDOG_TIMEOUT;
 	netif_set_real_num_tx_queues(dev, priv->tx_ring_num[TX]);
 	netif_set_real_num_rx_queues(dev, priv->rx_ring_num);
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 014a80af2813..0869d4fff17b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -663,8 +663,35 @@ static int check_csum(struct mlx4_cqe *cqe, struct sk_buff *skb, void *va,
 
 struct mlx4_en_xdp_buff {
 	struct xdp_buff xdp;
+	struct mlx4_cqe *cqe;
+	struct mlx4_en_dev *mdev;
+	struct mlx4_en_rx_ring *ring;
+	struct net_device *dev;
 };
 
+int mlx4_en_xdp_rx_timestamp(const struct xdp_md *ctx, u64 *timestamp)
+{
+	struct mlx4_en_xdp_buff *_ctx = (void *)ctx;
+
+	if (unlikely(_ctx->ring->hwtstamp_rx_filter != HWTSTAMP_FILTER_ALL))
+		return -EOPNOTSUPP;
+
+	*timestamp = mlx4_en_get_hwtstamp(_ctx->mdev,
+					  mlx4_en_get_cqe_ts(_ctx->cqe));
+	return 0;
+}
+
+int mlx4_en_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash)
+{
+	struct mlx4_en_xdp_buff *_ctx = (void *)ctx;
+
+	if (unlikely(!(_ctx->dev->features & NETIF_F_RXHASH)))
+		return -EOPNOTSUPP;
+
+	*hash = be32_to_cpu(_ctx->cqe->immed_rss_invalid);
+	return 0;
+}
+
 int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int budget)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
@@ -781,8 +808,12 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 						DMA_FROM_DEVICE);
 
 			xdp_prepare_buff(&mxbuf.xdp, va - frags[0].page_offset,
-					 frags[0].page_offset, length, false);
+					 frags[0].page_offset, length, true);
 			orig_data = mxbuf.xdp.data;
+			mxbuf.cqe = cqe;
+			mxbuf.mdev = priv->mdev;
+			mxbuf.ring = ring;
+			mxbuf.dev = dev;
 
 			act = bpf_prog_run_xdp(xdp_prog, &mxbuf.xdp);
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 3d4226ddba5e..544e09b97483 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -796,10 +796,15 @@ void mlx4_en_update_pfc_stats_bitmap(struct mlx4_dev *dev,
 int mlx4_en_netdev_event(struct notifier_block *this,
 			 unsigned long event, void *ptr);
 
+struct xdp_md;
+int mlx4_en_xdp_rx_timestamp(const struct xdp_md *ctx, u64 *timestamp);
+int mlx4_en_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash);
+
 /*
  * Functions for time stamping
  */
 u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe);
+u64 mlx4_en_get_hwtstamp(struct mlx4_en_dev *mdev, u64 timestamp);
 void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
 			    struct skb_shared_hwtstamps *hwts,
 			    u64 timestamp);
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related

* [PATCH bpf-next v5 12/17] net/mlx4_en: Introduce wrapper for xdp_buff
From: Stanislav Fomichev @ 2022-12-20 22:20 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, Tariq Toukan
In-Reply-To: <20221220222043.3348718-1-sdf@google.com>

No functional changes. Boilerplate to allow stuffing more data after xdp_buff.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 26 +++++++++++++---------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 8f762fc170b3..014a80af2813 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -661,9 +661,14 @@ static int check_csum(struct mlx4_cqe *cqe, struct sk_buff *skb, void *va,
 #define MLX4_CQE_STATUS_IP_ANY (MLX4_CQE_STATUS_IPV4)
 #endif
 
+struct mlx4_en_xdp_buff {
+	struct xdp_buff xdp;
+};
+
 int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int budget)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
+	struct mlx4_en_xdp_buff mxbuf = {};
 	int factor = priv->cqe_factor;
 	struct mlx4_en_rx_ring *ring;
 	struct bpf_prog *xdp_prog;
@@ -671,7 +676,6 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	bool doorbell_pending;
 	bool xdp_redir_flush;
 	struct mlx4_cqe *cqe;
-	struct xdp_buff xdp;
 	int polled = 0;
 	int index;
 
@@ -681,7 +685,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	ring = priv->rx_ring[cq_ring];
 
 	xdp_prog = rcu_dereference_bh(ring->xdp_prog);
-	xdp_init_buff(&xdp, priv->frag_info[0].frag_stride, &ring->xdp_rxq);
+	xdp_init_buff(&mxbuf.xdp, priv->frag_info[0].frag_stride, &ring->xdp_rxq);
 	doorbell_pending = false;
 	xdp_redir_flush = false;
 
@@ -776,24 +780,24 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 						priv->frag_info[0].frag_size,
 						DMA_FROM_DEVICE);
 
-			xdp_prepare_buff(&xdp, va - frags[0].page_offset,
+			xdp_prepare_buff(&mxbuf.xdp, va - frags[0].page_offset,
 					 frags[0].page_offset, length, false);
-			orig_data = xdp.data;
+			orig_data = mxbuf.xdp.data;
 
-			act = bpf_prog_run_xdp(xdp_prog, &xdp);
+			act = bpf_prog_run_xdp(xdp_prog, &mxbuf.xdp);
 
-			length = xdp.data_end - xdp.data;
-			if (xdp.data != orig_data) {
-				frags[0].page_offset = xdp.data -
-					xdp.data_hard_start;
-				va = xdp.data;
+			length = mxbuf.xdp.data_end - mxbuf.xdp.data;
+			if (mxbuf.xdp.data != orig_data) {
+				frags[0].page_offset = mxbuf.xdp.data -
+					mxbuf.xdp.data_hard_start;
+				va = mxbuf.xdp.data;
 			}
 
 			switch (act) {
 			case XDP_PASS:
 				break;
 			case XDP_REDIRECT:
-				if (likely(!xdp_do_redirect(dev, &xdp, xdp_prog))) {
+				if (likely(!xdp_do_redirect(dev, &mxbuf.xdp, xdp_prog))) {
 					ring->xdp_redirect++;
 					xdp_redir_flush = true;
 					frags[0].page = NULL;
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related

* [PATCH bpf-next v5 11/17] selftests/bpf: Verify xdp_metadata xdp->af_xdp path
From: Stanislav Fomichev @ 2022-12-20 22:20 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev
In-Reply-To: <20221220222043.3348718-1-sdf@google.com>

- create new netns
- create veth pair (veTX+veRX)
- setup AF_XDP socket for both interfaces
- attach bpf to veRX
- send packet via veTX
- verify the packet has expected metadata at veRX

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/testing/selftests/bpf/Makefile          |   2 +-
 .../selftests/bpf/prog_tests/xdp_metadata.c   | 412 ++++++++++++++++++
 .../selftests/bpf/progs/xdp_metadata.c        |  64 +++
 .../selftests/bpf/progs/xdp_metadata2.c       |  23 +
 tools/testing/selftests/bpf/xdp_metadata.h    |  15 +
 5 files changed, 515 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_metadata.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_metadata2.c
 create mode 100644 tools/testing/selftests/bpf/xdp_metadata.h

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index c22c43bbee19..e6cbc04a7920 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -527,7 +527,7 @@ TRUNNER_BPF_PROGS_DIR := progs
 TRUNNER_EXTRA_SOURCES := test_progs.c cgroup_helpers.c trace_helpers.c	\
 			 network_helpers.c testing_helpers.c		\
 			 btf_helpers.c flow_dissector_load.h		\
-			 cap_helpers.c test_loader.c
+			 cap_helpers.c test_loader.c xsk.c
 TRUNNER_EXTRA_FILES := $(OUTPUT)/urandom_read $(OUTPUT)/bpf_testmod.ko	\
 		       $(OUTPUT)/liburandom_read.so			\
 		       $(OUTPUT)/xdp_synproxy				\
diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
new file mode 100644
index 000000000000..55f4c8cfba66
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
@@ -0,0 +1,412 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <network_helpers.h>
+#include "xdp_metadata.skel.h"
+#include "xdp_metadata2.skel.h"
+#include "xdp_metadata.h"
+#include "xsk.h"
+
+#include <bpf/btf.h>
+#include <linux/errqueue.h>
+#include <linux/if_link.h>
+#include <linux/net_tstamp.h>
+#include <linux/udp.h>
+#include <sys/mman.h>
+#include <net/if.h>
+#include <poll.h>
+
+#define TX_NAME "veTX"
+#define RX_NAME "veRX"
+
+#define UDP_PAYLOAD_BYTES 4
+
+#define AF_XDP_SOURCE_PORT 1234
+#define AF_XDP_CONSUMER_PORT 8080
+
+#define UMEM_NUM 16
+#define UMEM_FRAME_SIZE XSK_UMEM__DEFAULT_FRAME_SIZE
+#define UMEM_SIZE (UMEM_FRAME_SIZE * UMEM_NUM)
+#define XDP_FLAGS XDP_FLAGS_DRV_MODE
+#define QUEUE_ID 0
+
+#define TX_ADDR "10.0.0.1"
+#define RX_ADDR "10.0.0.2"
+#define PREFIX_LEN "8"
+#define FAMILY AF_INET
+
+#define SYS(cmd) ({ \
+	if (!ASSERT_OK(system(cmd), (cmd))) \
+		goto out; \
+})
+
+struct xsk {
+	void *umem_area;
+	struct xsk_umem *umem;
+	struct xsk_ring_prod fill;
+	struct xsk_ring_cons comp;
+	struct xsk_ring_prod tx;
+	struct xsk_ring_cons rx;
+	struct xsk_socket *socket;
+};
+
+static int open_xsk(const char *ifname, struct xsk *xsk)
+{
+	int mmap_flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE;
+	const struct xsk_socket_config socket_config = {
+		.rx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
+		.tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
+		.libbpf_flags = XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD,
+		.xdp_flags = XDP_FLAGS,
+		.bind_flags = XDP_COPY,
+	};
+	const struct xsk_umem_config umem_config = {
+		.fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
+		.comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
+		.frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE,
+		.flags = XDP_UMEM_UNALIGNED_CHUNK_FLAG,
+	};
+	__u32 idx;
+	u64 addr;
+	int ret;
+	int i;
+
+	xsk->umem_area = mmap(NULL, UMEM_SIZE, PROT_READ | PROT_WRITE, mmap_flags, -1, 0);
+	if (!ASSERT_NEQ(xsk->umem_area, MAP_FAILED, "mmap"))
+		return -1;
+
+	ret = xsk_umem__create(&xsk->umem,
+			       xsk->umem_area, UMEM_SIZE,
+			       &xsk->fill,
+			       &xsk->comp,
+			       &umem_config);
+	if (!ASSERT_OK(ret, "xsk_umem__create"))
+		return ret;
+
+	ret = xsk_socket__create(&xsk->socket, ifname, QUEUE_ID,
+				 xsk->umem,
+				 &xsk->rx,
+				 &xsk->tx,
+				 &socket_config);
+	if (!ASSERT_OK(ret, "xsk_socket__create"))
+		return ret;
+
+	/* First half of umem is for TX. This way address matches 1-to-1
+	 * to the completion queue index.
+	 */
+
+	for (i = 0; i < UMEM_NUM / 2; i++) {
+		addr = i * UMEM_FRAME_SIZE;
+		printf("%p: tx_desc[%d] -> %lx\n", xsk, i, addr);
+	}
+
+	/* Second half of umem is for RX. */
+
+	ret = xsk_ring_prod__reserve(&xsk->fill, UMEM_NUM / 2, &idx);
+	if (!ASSERT_EQ(UMEM_NUM / 2, ret, "xsk_ring_prod__reserve"))
+		return ret;
+	if (!ASSERT_EQ(idx, 0, "fill idx != 0"))
+		return -1;
+
+	for (i = 0; i < UMEM_NUM / 2; i++) {
+		addr = (UMEM_NUM / 2 + i) * UMEM_FRAME_SIZE;
+		printf("%p: rx_desc[%d] -> %lx\n", xsk, i, addr);
+		*xsk_ring_prod__fill_addr(&xsk->fill, i) = addr;
+	}
+	xsk_ring_prod__submit(&xsk->fill, ret);
+
+	return 0;
+}
+
+static void close_xsk(struct xsk *xsk)
+{
+	if (xsk->umem)
+		xsk_umem__delete(xsk->umem);
+	if (xsk->socket)
+		xsk_socket__delete(xsk->socket);
+	munmap(xsk->umem, UMEM_SIZE);
+}
+
+static void ip_csum(struct iphdr *iph)
+{
+	__u32 sum = 0;
+	__u16 *p;
+	int i;
+
+	iph->check = 0;
+	p = (void *)iph;
+	for (i = 0; i < sizeof(*iph) / sizeof(*p); i++)
+		sum += p[i];
+
+	while (sum >> 16)
+		sum = (sum & 0xffff) + (sum >> 16);
+
+	iph->check = ~sum;
+}
+
+static int generate_packet(struct xsk *xsk, __u16 dst_port)
+{
+	struct xdp_desc *tx_desc;
+	struct udphdr *udph;
+	struct ethhdr *eth;
+	struct iphdr *iph;
+	void *data;
+	__u32 idx;
+	int ret;
+
+	ret = xsk_ring_prod__reserve(&xsk->tx, 1, &idx);
+	if (!ASSERT_EQ(ret, 1, "xsk_ring_prod__reserve"))
+		return -1;
+
+	tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx);
+	tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE;
+	printf("%p: tx_desc[%u]->addr=%llx\n", xsk, idx, tx_desc->addr);
+	data = xsk_umem__get_data(xsk->umem_area, tx_desc->addr);
+
+	eth = data;
+	iph = (void *)(eth + 1);
+	udph = (void *)(iph + 1);
+
+	memcpy(eth->h_dest, "\x00\x00\x00\x00\x00\x02", ETH_ALEN);
+	memcpy(eth->h_source, "\x00\x00\x00\x00\x00\x01", ETH_ALEN);
+	eth->h_proto = htons(ETH_P_IP);
+
+	iph->version = 0x4;
+	iph->ihl = 0x5;
+	iph->tos = 0x9;
+	iph->tot_len = htons(sizeof(*iph) + sizeof(*udph) + UDP_PAYLOAD_BYTES);
+	iph->id = 0;
+	iph->frag_off = 0;
+	iph->ttl = 0;
+	iph->protocol = IPPROTO_UDP;
+	ASSERT_EQ(inet_pton(FAMILY, TX_ADDR, &iph->saddr), 1, "inet_pton(TX_ADDR)");
+	ASSERT_EQ(inet_pton(FAMILY, RX_ADDR, &iph->daddr), 1, "inet_pton(RX_ADDR)");
+	ip_csum(iph);
+
+	udph->source = htons(AF_XDP_SOURCE_PORT);
+	udph->dest = htons(dst_port);
+	udph->len = htons(sizeof(*udph) + UDP_PAYLOAD_BYTES);
+	udph->check = 0;
+
+	memset(udph + 1, 0xAA, UDP_PAYLOAD_BYTES);
+
+	tx_desc->len = sizeof(*eth) + sizeof(*iph) + sizeof(*udph) + UDP_PAYLOAD_BYTES;
+	xsk_ring_prod__submit(&xsk->tx, 1);
+
+	ret = sendto(xsk_socket__fd(xsk->socket), NULL, 0, MSG_DONTWAIT, NULL, 0);
+	if (!ASSERT_GE(ret, 0, "sendto"))
+		return ret;
+
+	return 0;
+}
+
+static void complete_tx(struct xsk *xsk)
+{
+	__u32 idx;
+	__u64 addr;
+
+	if (ASSERT_EQ(xsk_ring_cons__peek(&xsk->comp, 1, &idx), 1, "xsk_ring_cons__peek")) {
+		addr = *xsk_ring_cons__comp_addr(&xsk->comp, idx);
+
+		printf("%p: refill idx=%u addr=%llx\n", xsk, idx, addr);
+		*xsk_ring_prod__fill_addr(&xsk->fill, idx) = addr;
+		xsk_ring_prod__submit(&xsk->fill, 1);
+	}
+}
+
+static void refill_rx(struct xsk *xsk, __u64 addr)
+{
+	__u32 idx;
+
+	if (ASSERT_EQ(xsk_ring_prod__reserve(&xsk->fill, 1, &idx), 1, "xsk_ring_prod__reserve")) {
+		printf("%p: complete idx=%u addr=%llx\n", xsk, idx, addr);
+		*xsk_ring_prod__fill_addr(&xsk->fill, idx) = addr;
+		xsk_ring_prod__submit(&xsk->fill, 1);
+	}
+}
+
+static int verify_xsk_metadata(struct xsk *xsk)
+{
+	const struct xdp_desc *rx_desc;
+	struct pollfd fds = {};
+	struct xdp_meta *meta;
+	struct ethhdr *eth;
+	struct iphdr *iph;
+	__u64 comp_addr;
+	void *data;
+	__u64 addr;
+	__u32 idx;
+	int ret;
+
+	ret = recvfrom(xsk_socket__fd(xsk->socket), NULL, 0, MSG_DONTWAIT, NULL, NULL);
+	if (!ASSERT_EQ(ret, 0, "recvfrom"))
+		return -1;
+
+	fds.fd = xsk_socket__fd(xsk->socket);
+	fds.events = POLLIN;
+
+	ret = poll(&fds, 1, 1000);
+	if (!ASSERT_GT(ret, 0, "poll"))
+		return -1;
+
+	ret = xsk_ring_cons__peek(&xsk->rx, 1, &idx);
+	if (!ASSERT_EQ(ret, 1, "xsk_ring_cons__peek"))
+		return -2;
+
+	rx_desc = xsk_ring_cons__rx_desc(&xsk->rx, idx);
+	comp_addr = xsk_umem__extract_addr(rx_desc->addr);
+	addr = xsk_umem__add_offset_to_addr(rx_desc->addr);
+	printf("%p: rx_desc[%u]->addr=%llx addr=%llx comp_addr=%llx\n",
+	       xsk, idx, rx_desc->addr, addr, comp_addr);
+	data = xsk_umem__get_data(xsk->umem_area, addr);
+
+	/* Make sure we got the packet offset correctly. */
+
+	eth = data;
+	ASSERT_EQ(eth->h_proto, htons(ETH_P_IP), "eth->h_proto");
+	iph = (void *)(eth + 1);
+	ASSERT_EQ((int)iph->version, 4, "iph->version");
+
+	/* custom metadata */
+
+	meta = data - sizeof(struct xdp_meta);
+
+	if (!ASSERT_NEQ(meta->rx_timestamp, 0, "rx_timestamp"))
+		return -1;
+
+	if (!ASSERT_NEQ(meta->rx_hash, 0, "rx_hash"))
+		return -1;
+
+	xsk_ring_cons__release(&xsk->rx, 1);
+	refill_rx(xsk, comp_addr);
+
+	return 0;
+}
+
+void test_xdp_metadata(void)
+{
+	struct xdp_metadata2 *bpf_obj2 = NULL;
+	struct xdp_metadata *bpf_obj = NULL;
+	struct bpf_program *new_prog, *prog;
+	struct nstoken *tok = NULL;
+	__u32 queue_id = QUEUE_ID;
+	struct bpf_map *prog_arr;
+	struct xsk tx_xsk = {};
+	struct xsk rx_xsk = {};
+	__u32 val, key = 0;
+	int retries = 10;
+	int rx_ifindex;
+	int sock_fd;
+	int ret;
+
+	/* Setup new networking namespace, with a veth pair. */
+
+	SYS("ip netns add xdp_metadata");
+	tok = open_netns("xdp_metadata");
+	SYS("ip link add numtxqueues 1 numrxqueues 1 " TX_NAME
+	    " type veth peer " RX_NAME " numtxqueues 1 numrxqueues 1");
+	SYS("ip link set dev " TX_NAME " address 00:00:00:00:00:01");
+	SYS("ip link set dev " RX_NAME " address 00:00:00:00:00:02");
+	SYS("ip link set dev " TX_NAME " up");
+	SYS("ip link set dev " RX_NAME " up");
+	SYS("ip addr add " TX_ADDR "/" PREFIX_LEN " dev " TX_NAME);
+	SYS("ip addr add " RX_ADDR "/" PREFIX_LEN " dev " RX_NAME);
+
+	rx_ifindex = if_nametoindex(RX_NAME);
+
+	/* Setup separate AF_XDP for TX and RX interfaces. */
+
+	ret = open_xsk(TX_NAME, &tx_xsk);
+	if (!ASSERT_OK(ret, "open_xsk(TX_NAME)"))
+		goto out;
+
+	ret = open_xsk(RX_NAME, &rx_xsk);
+	if (!ASSERT_OK(ret, "open_xsk(RX_NAME)"))
+		goto out;
+
+	bpf_obj = xdp_metadata__open();
+	if (!ASSERT_OK_PTR(bpf_obj, "open skeleton"))
+		goto out;
+
+	prog = bpf_object__find_program_by_name(bpf_obj->obj, "rx");
+	bpf_program__set_ifindex(prog, rx_ifindex);
+	bpf_program__set_flags(prog, BPF_F_XDP_DEV_BOUND_ONLY);
+
+	if (!ASSERT_OK(xdp_metadata__load(bpf_obj), "load skeleton"))
+		goto out;
+
+	/* Make sure we can't add dev-bound programs to prog maps. */
+	prog_arr = bpf_object__find_map_by_name(bpf_obj->obj, "prog_arr");
+	if (!ASSERT_OK_PTR(prog_arr, "no prog_arr map"))
+		goto out;
+
+	val = bpf_program__fd(prog);
+	if (!ASSERT_ERR(bpf_map__update_elem(prog_arr, &key, sizeof(key),
+					     &val, sizeof(val), BPF_ANY),
+			"update prog_arr"))
+		goto out;
+
+	/* Attach BPF program to RX interface. */
+
+	ret = bpf_xdp_attach(rx_ifindex,
+			     bpf_program__fd(bpf_obj->progs.rx),
+			     XDP_FLAGS, NULL);
+	if (!ASSERT_GE(ret, 0, "bpf_xdp_attach"))
+		goto out;
+
+	sock_fd = xsk_socket__fd(rx_xsk.socket);
+	ret = bpf_map_update_elem(bpf_map__fd(bpf_obj->maps.xsk), &queue_id, &sock_fd, 0);
+	if (!ASSERT_GE(ret, 0, "bpf_map_update_elem"))
+		goto out;
+
+	/* Send packet destined to RX AF_XDP socket. */
+	if (!ASSERT_GE(generate_packet(&tx_xsk, AF_XDP_CONSUMER_PORT), 0,
+		       "generate AF_XDP_CONSUMER_PORT"))
+		goto out;
+
+	/* Verify AF_XDP RX packet has proper metadata. */
+	if (!ASSERT_GE(verify_xsk_metadata(&rx_xsk), 0,
+		       "verify_xsk_metadata"))
+		goto out;
+
+	complete_tx(&tx_xsk);
+
+	/* Make sure freplace correctly picks up original bound device
+	 * and doesn't crash.
+	 */
+
+	bpf_obj2 = xdp_metadata2__open();
+	if (!ASSERT_OK_PTR(bpf_obj2, "open skeleton"))
+		goto out;
+
+	new_prog = bpf_object__find_program_by_name(bpf_obj2->obj, "freplace_rx");
+	bpf_program__set_attach_target(new_prog, bpf_program__fd(prog), "rx");
+
+	if (!ASSERT_OK(xdp_metadata2__load(bpf_obj2), "load freplace skeleton"))
+		goto out;
+
+	if (!ASSERT_OK(xdp_metadata2__attach(bpf_obj2), "attach freplace"))
+		goto out;
+
+	/* Send packet to trigger . */
+	if (!ASSERT_GE(generate_packet(&tx_xsk, AF_XDP_CONSUMER_PORT), 0,
+		       "generate freplace packet"))
+		goto out;
+
+	while (!retries--) {
+		if (bpf_obj2->bss->called)
+			break;
+		usleep(10);
+	}
+	ASSERT_GT(bpf_obj2->bss->called, 0, "not called");
+
+out:
+	close_xsk(&rx_xsk);
+	close_xsk(&tx_xsk);
+	if (bpf_obj2)
+		xdp_metadata2__destroy(bpf_obj2);
+	if (bpf_obj)
+		xdp_metadata__destroy(bpf_obj);
+	system("ip netns del xdp_metadata");
+	if (tok)
+		close_netns(tok);
+}
diff --git a/tools/testing/selftests/bpf/progs/xdp_metadata.c b/tools/testing/selftests/bpf/progs/xdp_metadata.c
new file mode 100644
index 000000000000..77678b034389
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/xdp_metadata.c
@@ -0,0 +1,64 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <vmlinux.h>
+#include "xdp_metadata.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+struct {
+	__uint(type, BPF_MAP_TYPE_XSKMAP);
+	__uint(max_entries, 4);
+	__type(key, __u32);
+	__type(value, __u32);
+} xsk SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, __u32);
+	__type(value, __u32);
+} prog_arr SEC(".maps");
+
+extern int bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx,
+					 __u64 *timestamp) __ksym;
+extern int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx,
+				    __u32 *hash) __ksym;
+
+SEC("xdp")
+int rx(struct xdp_md *ctx)
+{
+	void *data, *data_meta;
+	struct xdp_meta *meta;
+	u64 timestamp = -1;
+	int ret;
+
+	/* Reserve enough for all custom metadata. */
+
+	ret = bpf_xdp_adjust_meta(ctx, -(int)sizeof(struct xdp_meta));
+	if (ret != 0)
+		return XDP_DROP;
+
+	data = (void *)(long)ctx->data;
+	data_meta = (void *)(long)ctx->data_meta;
+
+	if (data_meta + sizeof(struct xdp_meta) > data)
+		return XDP_DROP;
+
+	meta = data_meta;
+
+	/* Export metadata. */
+
+	/* We expect veth bpf_xdp_metadata_rx_timestamp to return 0 HW
+	 * timestamp, so put some non-zero value into AF_XDP frame for
+	 * the userspace.
+	 */
+	bpf_xdp_metadata_rx_timestamp(ctx, &timestamp);
+	if (timestamp == 0)
+		meta->rx_timestamp = 1;
+
+	bpf_xdp_metadata_rx_hash(ctx, &meta->rx_hash);
+
+	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/xdp_metadata2.c b/tools/testing/selftests/bpf/progs/xdp_metadata2.c
new file mode 100644
index 000000000000..cf69d05451c3
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/xdp_metadata2.c
@@ -0,0 +1,23 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <vmlinux.h>
+#include "xdp_metadata.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+extern int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx,
+				    __u32 *hash) __ksym;
+
+int called;
+
+SEC("freplace/rx")
+int freplace_rx(struct xdp_md *ctx)
+{
+	u32 hash = 0;
+	/* Call _any_ metadata function to make sure we don't crash. */
+	bpf_xdp_metadata_rx_hash(ctx, &hash);
+	called++;
+	return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/xdp_metadata.h b/tools/testing/selftests/bpf/xdp_metadata.h
new file mode 100644
index 000000000000..f6780fbb0a21
--- /dev/null
+++ b/tools/testing/selftests/bpf/xdp_metadata.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#pragma once
+
+#ifndef ETH_P_IP
+#define ETH_P_IP 0x0800
+#endif
+
+#ifndef ETH_P_IPV6
+#define ETH_P_IPV6 0x86DD
+#endif
+
+struct xdp_meta {
+	__u64 rx_timestamp;
+	__u32 rx_hash;
+};
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related

* [PATCH bpf-next v5 10/17] veth: Support RX XDP metadata
From: Stanislav Fomichev @ 2022-12-20 22:20 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev
In-Reply-To: <20221220222043.3348718-1-sdf@google.com>

The goal is to enable end-to-end testing of the metadata for AF_XDP.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/veth.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 04ffd8cb2945..71966de4e942 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -118,6 +118,7 @@ static struct {
 
 struct veth_xdp_buff {
 	struct xdp_buff xdp;
+	struct sk_buff *skb;
 };
 
 static int veth_get_link_ksettings(struct net_device *dev,
@@ -602,6 +603,7 @@ static struct xdp_frame *veth_xdp_rcv_one(struct veth_rq *rq,
 
 		xdp_convert_frame_to_buff(frame, xdp);
 		xdp->rxq = &rq->xdp_rxq;
+		vxbuf.skb = NULL;
 
 		act = bpf_prog_run_xdp(xdp_prog, xdp);
 
@@ -823,6 +825,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	__skb_push(skb, skb->data - skb_mac_header(skb));
 	if (veth_convert_skb_to_xdp_buff(rq, xdp, &skb))
 		goto drop;
+	vxbuf.skb = skb;
 
 	orig_data = xdp->data;
 	orig_data_end = xdp->data_end;
@@ -1601,6 +1604,28 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	}
 }
 
+static int veth_xdp_rx_timestamp(const struct xdp_md *ctx, u64 *timestamp)
+{
+	struct veth_xdp_buff *_ctx = (void *)ctx;
+
+	if (!_ctx->skb)
+		return -EOPNOTSUPP;
+
+	*timestamp = skb_hwtstamps(_ctx->skb)->hwtstamp;
+	return 0;
+}
+
+static int veth_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash)
+{
+	struct veth_xdp_buff *_ctx = (void *)ctx;
+
+	if (!_ctx->skb)
+		return -EOPNOTSUPP;
+
+	*hash = skb_get_hash(_ctx->skb);
+	return 0;
+}
+
 static const struct net_device_ops veth_netdev_ops = {
 	.ndo_init            = veth_dev_init,
 	.ndo_open            = veth_open,
@@ -1622,6 +1647,11 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_get_peer_dev	= veth_peer_dev,
 };
 
+static const struct xdp_metadata_ops veth_xdp_metadata_ops = {
+	.xmo_rx_timestamp		= veth_xdp_rx_timestamp,
+	.xmo_rx_hash			= veth_xdp_rx_hash,
+};
+
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
 		       NETIF_F_RXCSUM | NETIF_F_SCTP_CRC | NETIF_F_HIGHDMA | \
 		       NETIF_F_GSO_SOFTWARE | NETIF_F_GSO_ENCAP_ALL | \
@@ -1638,6 +1668,7 @@ static void veth_setup(struct net_device *dev)
 	dev->priv_flags |= IFF_PHONY_HEADROOM;
 
 	dev->netdev_ops = &veth_netdev_ops;
+	dev->xdp_metadata_ops = &veth_xdp_metadata_ops;
 	dev->ethtool_ops = &veth_ethtool_ops;
 	dev->features |= NETIF_F_LLTX;
 	dev->features |= VETH_FEATURES;
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related

* [PATCH bpf-next v5 09/17] veth: Introduce veth_xdp_buff wrapper for xdp_buff
From: Stanislav Fomichev @ 2022-12-20 22:20 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev
In-Reply-To: <20221220222043.3348718-1-sdf@google.com>

No functional changes. Boilerplate to allow stuffing more data after xdp_buff.

Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/veth.c | 56 +++++++++++++++++++++++++---------------------
 1 file changed, 31 insertions(+), 25 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index ac7c0653695f..04ffd8cb2945 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -116,6 +116,10 @@ static struct {
 	{ "peer_ifindex" },
 };
 
+struct veth_xdp_buff {
+	struct xdp_buff xdp;
+};
+
 static int veth_get_link_ksettings(struct net_device *dev,
 				   struct ethtool_link_ksettings *cmd)
 {
@@ -592,23 +596,24 @@ static struct xdp_frame *veth_xdp_rcv_one(struct veth_rq *rq,
 	rcu_read_lock();
 	xdp_prog = rcu_dereference(rq->xdp_prog);
 	if (likely(xdp_prog)) {
-		struct xdp_buff xdp;
+		struct veth_xdp_buff vxbuf;
+		struct xdp_buff *xdp = &vxbuf.xdp;
 		u32 act;
 
-		xdp_convert_frame_to_buff(frame, &xdp);
-		xdp.rxq = &rq->xdp_rxq;
+		xdp_convert_frame_to_buff(frame, xdp);
+		xdp->rxq = &rq->xdp_rxq;
 
-		act = bpf_prog_run_xdp(xdp_prog, &xdp);
+		act = bpf_prog_run_xdp(xdp_prog, xdp);
 
 		switch (act) {
 		case XDP_PASS:
-			if (xdp_update_frame_from_buff(&xdp, frame))
+			if (xdp_update_frame_from_buff(xdp, frame))
 				goto err_xdp;
 			break;
 		case XDP_TX:
 			orig_frame = *frame;
-			xdp.rxq->mem = frame->mem;
-			if (unlikely(veth_xdp_tx(rq, &xdp, bq) < 0)) {
+			xdp->rxq->mem = frame->mem;
+			if (unlikely(veth_xdp_tx(rq, xdp, bq) < 0)) {
 				trace_xdp_exception(rq->dev, xdp_prog, act);
 				frame = &orig_frame;
 				stats->rx_drops++;
@@ -619,8 +624,8 @@ static struct xdp_frame *veth_xdp_rcv_one(struct veth_rq *rq,
 			goto xdp_xmit;
 		case XDP_REDIRECT:
 			orig_frame = *frame;
-			xdp.rxq->mem = frame->mem;
-			if (xdp_do_redirect(rq->dev, &xdp, xdp_prog)) {
+			xdp->rxq->mem = frame->mem;
+			if (xdp_do_redirect(rq->dev, xdp, xdp_prog)) {
 				frame = &orig_frame;
 				stats->rx_drops++;
 				goto err_xdp;
@@ -801,7 +806,8 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 {
 	void *orig_data, *orig_data_end;
 	struct bpf_prog *xdp_prog;
-	struct xdp_buff xdp;
+	struct veth_xdp_buff vxbuf;
+	struct xdp_buff *xdp = &vxbuf.xdp;
 	u32 act, metalen;
 	int off;
 
@@ -815,22 +821,22 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	}
 
 	__skb_push(skb, skb->data - skb_mac_header(skb));
-	if (veth_convert_skb_to_xdp_buff(rq, &xdp, &skb))
+	if (veth_convert_skb_to_xdp_buff(rq, xdp, &skb))
 		goto drop;
 
-	orig_data = xdp.data;
-	orig_data_end = xdp.data_end;
+	orig_data = xdp->data;
+	orig_data_end = xdp->data_end;
 
-	act = bpf_prog_run_xdp(xdp_prog, &xdp);
+	act = bpf_prog_run_xdp(xdp_prog, xdp);
 
 	switch (act) {
 	case XDP_PASS:
 		break;
 	case XDP_TX:
-		veth_xdp_get(&xdp);
+		veth_xdp_get(xdp);
 		consume_skb(skb);
-		xdp.rxq->mem = rq->xdp_mem;
-		if (unlikely(veth_xdp_tx(rq, &xdp, bq) < 0)) {
+		xdp->rxq->mem = rq->xdp_mem;
+		if (unlikely(veth_xdp_tx(rq, xdp, bq) < 0)) {
 			trace_xdp_exception(rq->dev, xdp_prog, act);
 			stats->rx_drops++;
 			goto err_xdp;
@@ -839,10 +845,10 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 		rcu_read_unlock();
 		goto xdp_xmit;
 	case XDP_REDIRECT:
-		veth_xdp_get(&xdp);
+		veth_xdp_get(xdp);
 		consume_skb(skb);
-		xdp.rxq->mem = rq->xdp_mem;
-		if (xdp_do_redirect(rq->dev, &xdp, xdp_prog)) {
+		xdp->rxq->mem = rq->xdp_mem;
+		if (xdp_do_redirect(rq->dev, xdp, xdp_prog)) {
 			stats->rx_drops++;
 			goto err_xdp;
 		}
@@ -862,7 +868,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	rcu_read_unlock();
 
 	/* check if bpf_xdp_adjust_head was used */
-	off = orig_data - xdp.data;
+	off = orig_data - xdp->data;
 	if (off > 0)
 		__skb_push(skb, off);
 	else if (off < 0)
@@ -871,21 +877,21 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	skb_reset_mac_header(skb);
 
 	/* check if bpf_xdp_adjust_tail was used */
-	off = xdp.data_end - orig_data_end;
+	off = xdp->data_end - orig_data_end;
 	if (off != 0)
 		__skb_put(skb, off); /* positive on grow, negative on shrink */
 
 	/* XDP frag metadata (e.g. nr_frags) are updated in eBPF helpers
 	 * (e.g. bpf_xdp_adjust_tail), we need to update data_len here.
 	 */
-	if (xdp_buff_has_frags(&xdp))
+	if (xdp_buff_has_frags(xdp))
 		skb->data_len = skb_shinfo(skb)->xdp_frags_size;
 	else
 		skb->data_len = 0;
 
 	skb->protocol = eth_type_trans(skb, rq->dev);
 
-	metalen = xdp.data - xdp.data_meta;
+	metalen = xdp->data - xdp->data_meta;
 	if (metalen)
 		skb_metadata_set(skb, metalen);
 out:
@@ -898,7 +904,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	return NULL;
 err_xdp:
 	rcu_read_unlock();
-	xdp_return_buff(&xdp);
+	xdp_return_buff(xdp);
 xdp_xmit:
 	return NULL;
 }
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related

* [PATCH bpf-next v5 07/17] bpf: XDP metadata RX kfuncs
From: Stanislav Fomichev @ 2022-12-20 22:20 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev
In-Reply-To: <20221220222043.3348718-1-sdf@google.com>

Define a new kfunc set (xdp_metadata_kfunc_ids) which implements all possible
XDP metatada kfuncs. Not all devices have to implement them. If kfunc is not
supported by the target device, the default implementation is called instead.
The verifier, at load time, replaces a call to the generic kfunc with a call
to the per-device one. Per-device kfunc pointers are stored in separate
struct xdp_metadata_ops.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/linux/bpf.h       |  9 ++++++-
 include/linux/netdevice.h |  7 ++++++
 include/net/xdp.h         | 21 ++++++++++++++++
 kernel/bpf/core.c         |  8 +++++++
 kernel/bpf/offload.c      | 28 ++++++++++++++++++++++
 kernel/bpf/verifier.c     | 41 +++++++++++++++++++++++++++++++-
 net/bpf/test_run.c        |  3 +++
 net/core/xdp.c            | 50 +++++++++++++++++++++++++++++++++++++++
 8 files changed, 165 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 4c9649e69993..969f53691dd4 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2480,6 +2480,7 @@ bool bpf_offload_dev_match(struct bpf_prog *prog, struct net_device *netdev);
 void unpriv_ebpf_notify(int new_state);
 
 #if defined(CONFIG_NET) && defined(CONFIG_BPF_SYSCALL)
+void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog, u32 func_id);
 int bpf_prog_dev_bound_init(struct bpf_prog *prog, union bpf_attr *attr);
 void bpf_dev_bound_netdev_unregister(struct net_device *dev);
 
@@ -2514,8 +2515,14 @@ void sock_map_unhash(struct sock *sk);
 void sock_map_destroy(struct sock *sk);
 void sock_map_close(struct sock *sk, long timeout);
 #else
+static inline void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog,
+						u32 func_id)
+{
+	return NULL;
+}
+
 static inline int bpf_prog_dev_bound_init(struct bpf_prog *prog,
-					union bpf_attr *attr)
+					  union bpf_attr *attr)
 {
 	return -EOPNOTSUPP;
 }
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index aad12a179e54..b41d18490595 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -74,6 +74,7 @@ struct udp_tunnel_nic_info;
 struct udp_tunnel_nic;
 struct bpf_prog;
 struct xdp_buff;
+struct xdp_md;
 
 void synchronize_net(void);
 void netdev_set_default_ethtool_ops(struct net_device *dev,
@@ -1618,6 +1619,11 @@ struct net_device_ops {
 						  bool cycles);
 };
 
+struct xdp_metadata_ops {
+	int	(*xmo_rx_timestamp)(const struct xdp_md *ctx, u64 *timestamp);
+	int	(*xmo_rx_hash)(const struct xdp_md *ctx, u32 *hash);
+};
+
 /**
  * enum netdev_priv_flags - &struct net_device priv_flags
  *
@@ -2050,6 +2056,7 @@ struct net_device {
 	unsigned int		flags;
 	unsigned long long	priv_flags;
 	const struct net_device_ops *netdev_ops;
+	const struct xdp_metadata_ops *xdp_metadata_ops;
 	int			ifindex;
 	unsigned short		gflags;
 	unsigned short		hard_header_len;
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 55dbc68bfffc..91292aa13bc0 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -409,4 +409,25 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
 
 #define DEV_MAP_BULK_SIZE XDP_BULK_QUEUE_SIZE
 
+#define XDP_METADATA_KFUNC_xxx	\
+	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP, \
+			   bpf_xdp_metadata_rx_timestamp) \
+	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_HASH, \
+			   bpf_xdp_metadata_rx_hash) \
+
+enum {
+#define XDP_METADATA_KFUNC(name, _) name,
+XDP_METADATA_KFUNC_xxx
+#undef XDP_METADATA_KFUNC
+MAX_XDP_METADATA_KFUNC,
+};
+
+#ifdef CONFIG_NET
+u32 bpf_xdp_metadata_kfunc_id(int id);
+bool bpf_dev_bound_kfunc_id(u32 btf_id);
+#else
+static inline u32 bpf_xdp_metadata_kfunc_id(int id) { return 0; }
+static inline bool bpf_dev_bound_kfunc_id(u32 btf_id) { return false; }
+#endif
+
 #endif /* __LINUX_NET_XDP_H__ */
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index bafcb7a3ae6f..6d81b14361e3 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2097,6 +2097,14 @@ bool bpf_prog_map_compatible(struct bpf_map *map,
 	if (fp->kprobe_override)
 		return false;
 
+	/* XDP programs inserted into maps are not guaranteed to run on
+	 * a particular netdev (and can run outside driver context entirely
+	 * in the case of devmap and cpumap). Until device checks
+	 * are implemented, prohibit adding dev-bound programs to program maps.
+	 */
+	if (bpf_prog_is_dev_bound(fp->aux))
+		return false;
+
 	spin_lock(&map->owner.lock);
 	if (!map->owner.type) {
 		/* There's no owner yet where we could check for
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index e803824c1ddd..0e3fc743e0a8 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -767,6 +767,34 @@ void bpf_dev_bound_netdev_unregister(struct net_device *dev)
 	up_write(&bpf_devs_lock);
 }
 
+void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog, u32 func_id)
+{
+	const struct xdp_metadata_ops *ops;
+	void *p = NULL;
+
+	/* We don't hold bpf_devs_lock while resolving several
+	 * kfuncs and can race with the unregister_netdevice().
+	 * We rely on bpf_dev_bound_match() check at attach
+	 * to render this program unusable.
+	 */
+	down_read(&bpf_devs_lock);
+	if (!prog->aux->offload || !prog->aux->offload->netdev)
+		goto out;
+
+	ops = prog->aux->offload->netdev->xdp_metadata_ops;
+	if (!ops)
+		goto out;
+
+	if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP))
+		p = ops->xmo_rx_timestamp;
+	else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_HASH))
+		p = ops->xmo_rx_hash;
+out:
+	up_read(&bpf_devs_lock);
+
+	return p;
+}
+
 static int __init bpf_offload_init(void)
 {
 	return rhashtable_init(&offdevs, &offdevs_params);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index fdfdcab4a59d..320451a0be3e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2081,6 +2081,22 @@ static struct btf *find_kfunc_desc_btf(struct bpf_verifier_env *env, s16 offset)
 	return btf_vmlinux ?: ERR_PTR(-ENOENT);
 }
 
+int bpf_dev_bound_kfunc_check(struct bpf_verifier_env *env,
+			      struct bpf_prog_aux *prog_aux)
+{
+	if (!bpf_prog_is_dev_bound(prog_aux)) {
+		verbose(env, "metadata kfuncs require device-bound program\n");
+		return -EINVAL;
+	}
+
+	if (bpf_prog_is_offloaded(prog_aux)) {
+		verbose(env, "metadata kfuncs can't be offloaded\n");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 static int add_kfunc_call(struct bpf_verifier_env *env, u32 func_id, s16 offset)
 {
 	const struct btf_type *func, *func_proto;
@@ -2183,6 +2199,12 @@ static int add_kfunc_call(struct bpf_verifier_env *env, u32 func_id, s16 offset)
 		return -EINVAL;
 	}
 
+	if (bpf_dev_bound_kfunc_id(func_id)) {
+		err = bpf_dev_bound_kfunc_check(env, prog_aux);
+		if (err)
+			return err;
+	}
+
 	desc = &tab->descs[tab->nr_descs++];
 	desc->func_id = func_id;
 	desc->imm = call_imm;
@@ -15480,12 +15502,25 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 			    struct bpf_insn *insn_buf, int insn_idx, int *cnt)
 {
 	const struct bpf_kfunc_desc *desc;
+	void *xdp_kfunc;
 
 	if (!insn->imm) {
 		verbose(env, "invalid kernel function call not eliminated in verifier pass\n");
 		return -EINVAL;
 	}
 
+	*cnt = 0;
+
+	if (bpf_dev_bound_kfunc_id(insn->imm)) {
+		xdp_kfunc = bpf_dev_bound_resolve_kfunc(env->prog, insn->imm);
+		if (xdp_kfunc) {
+			insn->imm = BPF_CALL_IMM(xdp_kfunc);
+			return 0;
+		}
+
+		/* fallback to default kfunc when not supported by netdev */
+	}
+
 	/* insn->imm has the btf func_id. Replace it with
 	 * an address (relative to __bpf_call_base).
 	 */
@@ -15496,7 +15531,6 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 		return -EFAULT;
 	}
 
-	*cnt = 0;
 	insn->imm = desc->imm;
 	if (insn->off)
 		return 0;
@@ -16503,6 +16537,11 @@ int bpf_check_attach_target(struct bpf_verifier_log *log,
 	if (tgt_prog) {
 		struct bpf_prog_aux *aux = tgt_prog->aux;
 
+		if (bpf_prog_is_dev_bound(tgt_prog->aux)) {
+			bpf_log(log, "Replacing device-bound programs not supported\n");
+			return -EINVAL;
+		}
+
 		for (i = 0; i < aux->func_info_cnt; i++)
 			if (aux->func_info[i].type_id == btf_id) {
 				subprog = i;
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 2723623429ac..8da0d73b368e 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -1300,6 +1300,9 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
 	if (kattr->test.flags & ~BPF_F_TEST_XDP_LIVE_FRAMES)
 		return -EINVAL;
 
+	if (bpf_prog_is_dev_bound(prog->aux))
+		return -EINVAL;
+
 	if (do_live) {
 		if (!batch_size)
 			batch_size = NAPI_POLL_WEIGHT;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 844c9d99dc0e..ec016ee5c046 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -4,6 +4,7 @@
  * Copyright (c) 2017 Jesper Dangaard Brouer, Red Hat Inc.
  */
 #include <linux/bpf.h>
+#include <linux/btf_ids.h>
 #include <linux/filter.h>
 #include <linux/types.h>
 #include <linux/mm.h>
@@ -709,3 +710,52 @@ struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
 
 	return nxdpf;
 }
+
+__diag_push();
+__diag_ignore_all("-Wmissing-prototypes",
+		  "Global functions as their definitions will be in vmlinux BTF");
+
+int bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx, u64 *timestamp)
+{
+	return -EOPNOTSUPP;
+}
+
+int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, u32 *hash)
+{
+	return -EOPNOTSUPP;
+}
+
+__diag_pop();
+
+BTF_SET8_START(xdp_metadata_kfunc_ids)
+#define XDP_METADATA_KFUNC(_, name) BTF_ID_FLAGS(func, name, 0)
+XDP_METADATA_KFUNC_xxx
+#undef XDP_METADATA_KFUNC
+BTF_SET8_END(xdp_metadata_kfunc_ids)
+
+static const struct btf_kfunc_id_set xdp_metadata_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &xdp_metadata_kfunc_ids,
+};
+
+BTF_ID_LIST(xdp_metadata_kfunc_ids_unsorted)
+#define XDP_METADATA_KFUNC(name, str) BTF_ID(func, str)
+XDP_METADATA_KFUNC_xxx
+#undef XDP_METADATA_KFUNC
+
+u32 bpf_xdp_metadata_kfunc_id(int id)
+{
+	/* xdp_metadata_kfunc_ids is sorted and can't be used */
+	return xdp_metadata_kfunc_ids_unsorted[id];
+}
+
+bool bpf_dev_bound_kfunc_id(u32 btf_id)
+{
+	return btf_id_set8_contains(&xdp_metadata_kfunc_ids, btf_id);
+}
+
+static int __init xdp_metadata_init(void)
+{
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &xdp_metadata_kfunc_set);
+}
+late_initcall(xdp_metadata_init);
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related

* [PATCH bpf-next v5 06/17] selftests/bpf: Update expected test_offload.py messages
From: Stanislav Fomichev @ 2022-12-20 22:20 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev
In-Reply-To: <20221220222043.3348718-1-sdf@google.com>

Generic check has a different error message, update the selftest.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/testing/selftests/bpf/test_offload.py | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_offload.py b/tools/testing/selftests/bpf/test_offload.py
index 7cb1bc05e5cf..40cba8d368d9 100755
--- a/tools/testing/selftests/bpf/test_offload.py
+++ b/tools/testing/selftests/bpf/test_offload.py
@@ -1039,7 +1039,7 @@ netns = []
     offload = bpf_pinned("/sys/fs/bpf/offload")
     ret, _, err = sim.set_xdp(offload, "drv", fail=False, include_stderr=True)
     fail(ret == 0, "attached offloaded XDP program to drv")
-    check_extack(err, "Using device-bound program without HW_MODE flag is not supported.", args)
+    check_extack(err, "Using offloaded program without HW_MODE flag is not supported.", args)
     rm("/sys/fs/bpf/offload")
     sim.wait_for_flush()
 
@@ -1088,12 +1088,12 @@ netns = []
     ret, _, err = sim.set_xdp(pinned, "offload",
                               fail=False, include_stderr=True)
     fail(ret == 0, "Pinned program loaded for a different device accepted")
-    check_extack_nsim(err, "program bound to different dev.", args)
+    check_extack(err, "Program bound to different device.", args)
     simdev2.remove()
     ret, _, err = sim.set_xdp(pinned, "offload",
                               fail=False, include_stderr=True)
     fail(ret == 0, "Pinned program loaded for a removed device accepted")
-    check_extack_nsim(err, "xdpoffload of non-bound program.", args)
+    check_extack(err, "Program bound to different device.", args)
     rm(pin_file)
     bpftool_prog_list_wait(expected=0)
 
@@ -1334,12 +1334,12 @@ netns = []
     ret, _, err = simA.set_xdp(progB, "offload", force=True, JSON=False,
                                fail=False, include_stderr=True)
     fail(ret == 0, "cross-ASIC program allowed")
-    check_extack_nsim(err, "program bound to different dev.", args)
+    check_extack(err, "Program bound to different device.", args)
     for d in simdevB.nsims:
         ret, _, err = d.set_xdp(progA, "offload", force=True, JSON=False,
                                 fail=False, include_stderr=True)
         fail(ret == 0, "cross-ASIC program allowed")
-        check_extack_nsim(err, "program bound to different dev.", args)
+        check_extack(err, "Program bound to different device.", args)
 
     start_test("Test multi-dev ASIC cross-dev map reuse...")
 
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox