* Re: [PATCH net-next 2/2] net: phy: broadcom: add 1000Base-X support for BCM54616S
From: Tao Ren @ 2019-07-31 0:15 UTC (permalink / raw)
To: Andrew Lunn, Vladimir Oltean
Cc: Florian Fainelli, Heiner Kallweit, David S . Miller,
Arun Parameswaran, Justin Chen, netdev, lkml, Andrew Jeffery,
openbmc@lists.ozlabs.org
In-Reply-To: <20190730131719.GA28552@lunn.ch>
On 7/30/19 6:17 AM, Andrew Lunn wrote:
>> Again, I don't think Linux has generic support for overwriting (or
>> even describing) the operating mode of a PHY, although maybe that's a
>> direction we would want to push the discussion towards. RGMII to
>> copper, RGMII to fiber, SGMII to copper, copper to fiber (media
>> converter), even RGMII to SGMII (RTL8211FS supports this) - lots of
>> modes, and this is only for gigabit PHYs...
>
> This is something Russell King has PHYLINK patches for, which have not
> yet been merged. There are some boards which use a PHY as a media
> converter, placed between the MAC and an SFP.
Thank you Andrew. Let me see if the operating mode can be auto-detected on BCM54616S chip, and I will update back soon.
Thanks,
Tao
^ permalink raw reply
* Re: [PATCH net-next 1/2] net: phy: broadcom: set features explicitly for BCM54616S
From: Tao Ren @ 2019-07-31 0:12 UTC (permalink / raw)
To: Heiner Kallweit, Andrew Lunn
Cc: Florian Fainelli, David S . Miller, Arun Parameswaran,
Justin Chen, Vladimir Oltean, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org, Andrew Jeffery,
openbmc@lists.ozlabs.org
In-Reply-To: <bdfe07d3-66b4-061a-a149-aa2aef94b9b7@gmail.com>
On 7/29/19 11:00 PM, Heiner Kallweit wrote:
> On 30.07.2019 07:05, Tao Ren wrote:
>> On 7/29/19 8:35 PM, Andrew Lunn wrote:
>>> On Mon, Jul 29, 2019 at 05:25:32PM -0700, Tao Ren wrote:
>>>> BCM54616S feature "PHY_GBIT_FEATURES" was removed by commit dcdecdcfe1fc
>>>> ("net: phy: switch drivers to use dynamic feature detection"). As dynamic
>>>> feature detection doesn't work when BCM54616S is working in RGMII-Fiber
>>>> mode (different sets of MII Control/Status registers being used), let's
>>>> set "PHY_GBIT_FEATURES" for BCM54616S explicitly.
>>>
>>> Hi Tao
>>>
>>> What exactly does it get wrong?
>>>
>>> Thanks
>>> Andrew
>>
>> Hi Andrew,
>>
>> BCM54616S is set to RGMII-Fiber (1000Base-X) mode on my platform, and none of the features (1000BaseT/100BaseT/10BaseT) can be detected by genphy_read_abilities(), because the PHY only reports 1000BaseX_Full|Half ability in this mode.
>>
> Are you going to use the PHY in copper or fibre mode?
> In case you use fibre mode, why do you need the copper modes set as supported?
> Or does the PHY just start in fibre mode and you want to switch it to copper mode?
Hi Heiner,
The phy starts in fiber mode and that's the mode I want.
My observation is: phydev->link is always 0 (Link status bit is never set in MII_BMSR) by using dynamic ability detection on my machine. I checked phydev->supported and it's set to "AutoNeg | TP | MII | Pause | Asym_Pause" by dynamic ability detection. Is it normal/expected? Or maybe the fix should go to different places? Thank you for your help.
Thanks,
Tao
^ permalink raw reply
* Re: [PATCH 0/2] tools: bpftool: add net (un)load command to load XDP
From: Jakub Kicinski @ 2019-07-31 0:07 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Daniel T. Lee, Stephen Hemminger, David Ahern,
Jesper Dangaard Brouer, John Fastabend, Daniel Borkmann,
Alexei Starovoitov, netdev
In-Reply-To: <20190730231754.efh3fj4mnsbv445l@ast-mbp>
On Tue, 30 Jul 2019 16:17:56 -0700, Alexei Starovoitov wrote:
> On Tue, Jul 30, 2019 at 03:59:15PM -0700, Jakub Kicinski wrote:
> > On Wed, 31 Jul 2019 03:48:19 +0900, Daniel T. Lee wrote:
> > > Currently, bpftool net only supports dumping progs loaded on the
> > > interface. To load XDP prog on interface, user must use other tool
> > > (eg. iproute2). By this patch, with `bpftool net (un)load`, user can
> > > (un)load XDP prog on interface.
> >
> > I don't understand why using another tool is a bad thing :(
> > What happened to the Unix philosophy?
> >
> > I remain opposed to duplicating iproute2's functionality under
> > bpftool net :( The way to attach bpf programs in the networking
> > subsystem is through the iproute2 commends - ip and tc..
> >
> > It seems easy enough to add a feature to bpftool but from
> > a perspective of someone adding a new feature to the kernel,
> > and wanting to update user space components it's quite painful :(
> >
> > So could you describe to me in more detail why this is a good idea?
> > Perhaps others can chime in?
>
> I don't think it has anything to do with 'unix philosophy'.
> Here the proposal to teach bpftool to attach xdp progs.
> I see nothing wrong with that.
Nothing meaning you disagree it's duplicated effort and unnecessary
LoC the community has to maintain, review, test..?
> Another reason is iproute2 is still far away from adopting libbpf.
> So all the latest goodness like BTF, introspection, etc will not
> be available to iproute2 users for some time.
Duplicating the same features in bpftool will only diminish the
incentive for moving iproute2 to libbpf. And for folks who deal
with a wide variety of customers, often novices maintaining two
ways of doing the same thing is a hassle :(
> Even when iproute2 is ready it would be convenient for folks like me
> (who need to debug stuff in production) to remember cmd line of
> bpftool only to introspect the server. Debugging often includes
> detaching/attaching progs. Not only doing 'bpftool p s'.
Let's just put the two commands next to each other:
ip link set xdp $PROG dev $DEV
bpftool net attach xdp $PROG dev $DEV
Are they that different?
> If bpftool was taught to do equivalent of 'ip link' that would be
> very different story and I would be opposed to that.
Yes, that'd be pretty clear cut, only the XDP stuff is a bit more
of a judgement call.
^ permalink raw reply
* Re: [PATCH v2 bpf-next 02/12] libbpf: implement BPF CO-RE offset relocation algorithm
From: Andrii Nakryiko @ 2019-07-30 23:55 UTC (permalink / raw)
To: Song Liu
Cc: Andrii Nakryiko, bpf, netdev@vger.kernel.org, Alexei Starovoitov,
daniel@iogearbox.net, Yonghong Song, Kernel Team
In-Reply-To: <9C1CFF6F-F661-46F4-B6EB-B42D7F4204F0@fb.com>
On Tue, Jul 30, 2019 at 4:44 PM Song Liu <songliubraving@fb.com> wrote:
>
>
>
> > On Jul 30, 2019, at 12:53 PM, Andrii Nakryiko <andriin@fb.com> wrote:
> >
> > This patch implements the core logic for BPF CO-RE offsets relocations.
> > Every instruction that needs to be relocated has corresponding
> > bpf_offset_reloc as part of BTF.ext. Relocations are performed by trying
> > to match recorded "local" relocation spec against potentially many
> > compatible "target" types, creating corresponding spec. Details of the
> > algorithm are noted in corresponding comments in the code.
> >
> > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> > ---
> > tools/lib/bpf/libbpf.c | 915 ++++++++++++++++++++++++++++++++++++++++-
> > tools/lib/bpf/libbpf.h | 1 +
> > 2 files changed, 909 insertions(+), 7 deletions(-)
> >
> > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> > index ead915aec349..75da90928257 100644
> > --- a/tools/lib/bpf/libbpf.c
> > +++ b/tools/lib/bpf/libbpf.c
> > @@ -38,6 +38,7 @@
[...]
> >
> > -static const struct btf_type *skip_mods_and_typedefs(const struct btf *btf,
> > - __u32 id)
> > +static const struct btf_type *
> > +skip_mods_and_typedefs(const struct btf *btf, __u32 id, __u32 *res_id)
> > {
> > const struct btf_type *t = btf__type_by_id(btf, id);
> >
> > + if (res_id)
> > + *res_id = id;
> > +
> > while (true) {
> > switch (BTF_INFO_KIND(t->info)) {
> > case BTF_KIND_VOLATILE:
> > case BTF_KIND_CONST:
> > case BTF_KIND_RESTRICT:
> > case BTF_KIND_TYPEDEF:
> > + if (res_id)
> > + *res_id = t->type;
> > t = btf__type_by_id(btf, t->type);
>
> So btf->types[*res_id] == retval, right? Then with retval and btf, we can
> calculate *res_id without this change?
Unless I'm missing something very clever here, no. btf->types is array
of pointers (it's an index into a variable-sized types). This function
returns `struct btf_type *`, which is one of the **values** stored in
that array. You are claiming that by having value of one of array
elements you can easily find element's index? If it was possible to do
in O(1), we wouldn't have so many algorithms and data structures for
search and indexing. You can do that only with linear search, not some
clever pointer arithmetic or at least binary search. So I'm not sure
what you are proposing here...
The way BTF is defined, struct btf_type doesn't know its own type ID,
which is often inconvenient and requires to keep track of that ID, if
it's necessary, but that's how it is.
But then again, what are we trying to achieve here? Eliminate
returning id and pointer? I could always return id and easily look up
pointer, but having both is super convenient and makes code simpler
and shorter, so I'd like to keep it.
>
> > break;
> > default:
> > @@ -1044,7 +1051,7 @@ static const struct btf_type *skip_mods_and_typedefs(const struct btf *btf,
> > static bool get_map_field_int(const char *map_name, const struct btf *btf,
> > const struct btf_type *def,
> > const struct btf_member *m, __u32 *res) {
[...]
^ permalink raw reply
* Re: [PATCH net-next 2/2] net: phy: broadcom: add 1000Base-X support for BCM54616S
From: Tao Ren @ 2019-07-30 23:44 UTC (permalink / raw)
To: Vladimir Oltean
Cc: Andrew Lunn, Florian Fainelli, Heiner Kallweit, David S . Miller,
Arun Parameswaran, Justin Chen, netdev, lkml, Andrew Jeffery,
openbmc@lists.ozlabs.org
In-Reply-To: <CA+h21ho1KOGS3WsNBHzfHkpSyE4k5HTE1tV9wUtnkZhjUZGeUw@mail.gmail.com>
On 7/30/19 3:15 AM, Vladimir Oltean wrote:
> On Tue, 30 Jul 2019 at 07:52, Tao Ren <taoren@fb.com> wrote:
>>
>> On 7/29/19 6:32 PM, Vladimir Oltean wrote:
>>> Hi Tao,
>>>
>>> On Tue, 30 Jul 2019 at 03:31, Tao Ren <taoren@fb.com> wrote:
>>>>
>>>> Configure the BCM54616S for 1000Base-X mode when "brcm-phy-mode-1000bx"
>>>> is set in device tree. This is needed when the PHY is used for fiber and
>>>> backplane connections.
>>>>
>>>> The patch is inspired by commit cd9af3dac6d1 ("PHYLIB: Add 1000Base-X
>>>> support for Broadcom bcm5482").
>>>
>>> As far as I can see, for the commit you referenced,
>>> PHY_BCM_FLAGS_MODE_1000BX is referenced from nowhere in the entire
>>> mainline kernel:
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__elixir.bootlin.com_linux_latest_ident_PHY-5FBCM-5FFLAGS-5FMODE-5F1000BX&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=iYElT7HC77pRZ3byVvW8ng&m=gy6Y-3Ylme-_GQcGF4fvOX10irgAT4xh253Weo0np38&s=KL__E2bvsmvUL-hBL9hUmOS5vyPQ92EMj6fEfByn8t8&e=
>>> (it is supposed to be put by the MAC driver in phydev->dev_flags prior
>>> to calling phy_connect). But I don't see the point to this - can't you
>>> check for phydev->interface == PHY_INTERFACE_MODE_1000BASEX?
>>> This has the advantage that no MAC driver will need to know that it's
>>> talking to a Broadcom PHY. Additionally, no custom DT bindings are
>>> needed.
>>> Also, for backplane connections you probably want 1000Base-KX which
>>> has its own AN/LT, not plain 1000Base-X.
>>
>> Thank you Vladimir for the quick review!
>> Perhaps I misunderstood the purpose of phydev->interface, and I thought it was usually used to defined the interface between MAC and PHY. For example, if I need to pass both "rgmii-id" and "1000base-x" from MAC to PHY driver, what would be the preferred way?
>>
>
> Ohhhhhh, now I understand what you're trying to do, sorry, somehow I
> was too tired and I thought of something totally unrelated.
Thank you for spending time on the patch, and I really appreciate it.
> Let me see if I can explain: you've got the INTF_SEL pin strapping
> configured for something else (like RGMII to copper mode) and then
> you're changing the operating mode at runtime through MDIO? Is this
> intended to be for production code, or is it just some quick hack to
> fix a bad board design?
The INTF_SEL pins report correct mode (RGMII-Fiber) on my machine, but there are 2 "sub-modes" (1000Base-X and 100Base-FX) and I couldn't find a proper/safe way to auto-detect which "sub-mode" is active. The datasheet just describes instructions to enable a specific mode, but it doesn't say 1000Base-X/100Base-FX mode will be auto-selected. And that's why I came up with the patch to specify 1000Base-X mode.
> I think what's supposed to happen (Heiner can comment) is that
> genphy_config_init will automatically read the out-of-reset PHY
> registers and figure out which link modes are supported. This includes
> the 1000Base-X media type, *if* the PHY is strapped correctly.
> But you are changing the strapping configuration too late (again, in
> .config_init), so phylib doesn't pick up the new Base-X modes. What
> happens if you do the switchover from the .probe callback of the
> driver, instead of .config_init?
I checked the 1000base-x mode control bit and it's already set on my machine when genphy_config_init is executed, means I'm writing the same value to the register.
I write the register explicitly because I'm suspecting the mode control bit is set by my boot loader and the value is not changed by software reset.
Let me see if I can disable net/phy in boot loader and see what happens. If the bit is still on, maybe we can use the bit to auto-detect sub-mode (although the datasheet doesn't mention it)?
Anyways, let me move the logic to .probe callback. Thank you.
> I think what got me confused was your "add support for 1000Base-X"
> commit message. If I understand correctly, you're not adding support,
> you're just forcing it.
> Again, I don't think Linux has generic support for overwriting (or
> even describing) the operating mode of a PHY, although maybe that's a
> direction we would want to push the discussion towards. RGMII to
> copper, RGMII to fiber, SGMII to copper, copper to fiber (media
> converter), even RGMII to SGMII (RTL8211FS supports this) - lots of
> modes, and this is only for gigabit PHYs...
>
>>>> Signed-off-by: Tao Ren <taoren@fb.com>
>>>> ---
>>>> drivers/net/phy/broadcom.c | 58 +++++++++++++++++++++++++++++++++++---
>>>> include/linux/brcmphy.h | 4 +--
>>>> 2 files changed, 56 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/drivers/net/phy/broadcom.c b/drivers/net/phy/broadcom.c
>>>> index 2b4e41a9d35a..6c22ac3a844b 100644
>>>> --- a/drivers/net/phy/broadcom.c
>>>> +++ b/drivers/net/phy/broadcom.c
>>>> @@ -383,9 +383,9 @@ static int bcm5482_config_init(struct phy_device *phydev)
>>>> /*
>>>> * Select 1000BASE-X register set (primary SerDes)
>>>> */
>>>> - reg = bcm_phy_read_shadow(phydev, BCM5482_SHD_MODE);
>>>> - bcm_phy_write_shadow(phydev, BCM5482_SHD_MODE,
>>>> - reg | BCM5482_SHD_MODE_1000BX);
>>>> + reg = bcm_phy_read_shadow(phydev, BCM54XX_SHD_MODE);
>>>> + bcm_phy_write_shadow(phydev, BCM54XX_SHD_MODE,
>>>> + reg | BCM54XX_SHD_MODE_1000BX);
>>>>
>>>> /*
>>>> * LED1=ACTIVITYLED, LED3=LINKSPD[2]
>>>> @@ -451,6 +451,34 @@ static int bcm5481_config_aneg(struct phy_device *phydev)
>>>> return ret;
>>>> }
>>>>
>>>> +static int bcm54616s_config_init(struct phy_device *phydev)
>>>> +{
>>>> + int err, reg;
>>>> + struct device_node *np = phydev->mdio.dev.of_node;
>>>> +
>>>> + err = bcm54xx_config_init(phydev);
>>>> +
>>>> + if (of_property_read_bool(np, "brcm-phy-mode-1000bx")) {
>>>> + /* Select 1000BASE-X register set. */
>>>> + reg = bcm_phy_read_shadow(phydev, BCM54XX_SHD_MODE);
>>>> + bcm_phy_write_shadow(phydev, BCM54XX_SHD_MODE,
>>>> + reg | BCM54XX_SHD_MODE_1000BX);
>>>> +
>>>> + /* Auto-negotiation doesn't seem to work quite right
>>>> + * in this mode, so we disable it and force it to the
>>>> + * right speed/duplex setting. Only 'link status'
>>>> + * is important.
>>>> + */
>>>> + phydev->autoneg = AUTONEG_DISABLE;
>>>> + phydev->speed = SPEED_1000;
>>>> + phydev->duplex = DUPLEX_FULL;
>>>> +
>>>
>>> 1000Base-X AN does not include speed negotiation, so hardcoding
>>> SPEED_1000 is probably correct.
>>> What is wrong with the AN of duplex settings?
>>
>> FULL_DUPLEX bit is set on my platform by default. Let me enable AN and test it out; will share you results tomorrow.
Duplex setting is correct when AN is enabled. So I will just hardcode speed settings.
>>>> + phydev->dev_flags |= PHY_BCM_FLAGS_MODE_1000BX;
>>>> + }
>>>> +
>>>> + return err;
>>>> +}
>>>> +
>>>> static int bcm54616s_config_aneg(struct phy_device *phydev)
>>>> {
>>>> int ret;
>>>> @@ -464,6 +492,27 @@ static int bcm54616s_config_aneg(struct phy_device *phydev)
>>>> return ret;
>>>> }
>>>>
>>>> +static int bcm54616s_read_status(struct phy_device *phydev)
>>>> +{
>>>> + int ret;
>>>> +
>>>> + ret = genphy_read_status(phydev);
>>>> + if (ret < 0)
>>>> + return ret;
>>>> +
>>>> + if (phydev->dev_flags & PHY_BCM_FLAGS_MODE_1000BX) {
>>>> + /* Only link status matters for 1000Base-X mode, so force
>>>> + * 1000 Mbit/s full-duplex status.
>>>> + */
>>>> + if (phydev->link) {
>>>> + phydev->speed = SPEED_1000;
>>>> + phydev->duplex = DUPLEX_FULL;
>>>> + }
>>>> + }
>>>> +
>>>> + return 0;
>>>> +}
>>>> +
>>>> static int brcm_phy_setbits(struct phy_device *phydev, int reg, int set)
>>>> {
>>>> int val;
>>>> @@ -651,8 +700,9 @@ static struct phy_driver broadcom_drivers[] = {
>>>> .phy_id_mask = 0xfffffff0,
>>>> .name = "Broadcom BCM54616S",
>>>> .features = PHY_GBIT_FEATURES,
>>>> - .config_init = bcm54xx_config_init,
>>>> + .config_init = bcm54616s_config_init,
>>>> .config_aneg = bcm54616s_config_aneg,
>>>> + .read_status = bcm54616s_read_status,
>>>> .ack_interrupt = bcm_phy_ack_intr,
>>>> .config_intr = bcm_phy_config_intr,
>>>> }, {
>>>> diff --git a/include/linux/brcmphy.h b/include/linux/brcmphy.h
>>>> index 6db2d9a6e503..82030155558c 100644
>>>> --- a/include/linux/brcmphy.h
>>>> +++ b/include/linux/brcmphy.h
>>>> @@ -200,8 +200,8 @@
>>>> #define BCM5482_SHD_SSD 0x14 /* 10100: Secondary SerDes control */
>>>> #define BCM5482_SHD_SSD_LEDM 0x0008 /* SSD LED Mode enable */
>>>> #define BCM5482_SHD_SSD_EN 0x0001 /* SSD enable */
>>>> -#define BCM5482_SHD_MODE 0x1f /* 11111: Mode Control Register */
>>>> -#define BCM5482_SHD_MODE_1000BX 0x0001 /* Enable 1000BASE-X registers */
>>>> +#define BCM54XX_SHD_MODE 0x1f /* 11111: Mode Control Register */
>>>> +#define BCM54XX_SHD_MODE_1000BX 0x0001 /* Enable 1000BASE-X registers */
>>>
>>> These registers are also present on my BCM5464, probably safe to
>>> assume they're generic for the entire family.
>>> So if you make the registers definitions common, you can probably make
>>> the 1000Base-X configuration common as well.
>>
>> If I understand correctly, your recommendation is to add a common function (such as "bcm54xx_config_1000bx") so it can be used by other BCM chips? Sure, I will take care of it.
>>
>>
>> Thanks,
>>
>> Tao
>
> Regards,
> -Vladimir
>
^ permalink raw reply
* Re: [PATCH v2 bpf-next 02/12] libbpf: implement BPF CO-RE offset relocation algorithm
From: Song Liu @ 2019-07-30 23:43 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: bpf, netdev@vger.kernel.org, Alexei Starovoitov,
daniel@iogearbox.net, Yonghong Song, andrii.nakryiko@gmail.com,
Kernel Team
In-Reply-To: <20190730195408.670063-3-andriin@fb.com>
> On Jul 30, 2019, at 12:53 PM, Andrii Nakryiko <andriin@fb.com> wrote:
>
> This patch implements the core logic for BPF CO-RE offsets relocations.
> Every instruction that needs to be relocated has corresponding
> bpf_offset_reloc as part of BTF.ext. Relocations are performed by trying
> to match recorded "local" relocation spec against potentially many
> compatible "target" types, creating corresponding spec. Details of the
> algorithm are noted in corresponding comments in the code.
>
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> ---
> tools/lib/bpf/libbpf.c | 915 ++++++++++++++++++++++++++++++++++++++++-
> tools/lib/bpf/libbpf.h | 1 +
> 2 files changed, 909 insertions(+), 7 deletions(-)
>
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index ead915aec349..75da90928257 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -38,6 +38,7 @@
> #include <sys/stat.h>
> #include <sys/types.h>
> #include <sys/vfs.h>
> +#include <sys/utsname.h>
> #include <tools/libc_compat.h>
> #include <libelf.h>
> #include <gelf.h>
> @@ -47,6 +48,7 @@
> #include "btf.h"
> #include "str_error.h"
> #include "libbpf_internal.h"
> +#include "hashmap.h"
>
> #ifndef EM_BPF
> #define EM_BPF 247
> @@ -1015,17 +1017,22 @@ static int bpf_object__init_user_maps(struct bpf_object *obj, bool strict)
> return 0;
> }
>
> -static const struct btf_type *skip_mods_and_typedefs(const struct btf *btf,
> - __u32 id)
> +static const struct btf_type *
> +skip_mods_and_typedefs(const struct btf *btf, __u32 id, __u32 *res_id)
> {
> const struct btf_type *t = btf__type_by_id(btf, id);
>
> + if (res_id)
> + *res_id = id;
> +
> while (true) {
> switch (BTF_INFO_KIND(t->info)) {
> case BTF_KIND_VOLATILE:
> case BTF_KIND_CONST:
> case BTF_KIND_RESTRICT:
> case BTF_KIND_TYPEDEF:
> + if (res_id)
> + *res_id = t->type;
> t = btf__type_by_id(btf, t->type);
So btf->types[*res_id] == retval, right? Then with retval and btf, we can
calculate *res_id without this change?
> break;
> default:
> @@ -1044,7 +1051,7 @@ static const struct btf_type *skip_mods_and_typedefs(const struct btf *btf,
> static bool get_map_field_int(const char *map_name, const struct btf *btf,
> const struct btf_type *def,
> const struct btf_member *m, __u32 *res) {
> - const struct btf_type *t = skip_mods_and_typedefs(btf, m->type);
> + const struct btf_type *t = skip_mods_and_typedefs(btf, m->type, NULL);
> const char *name = btf__name_by_offset(btf, m->name_off);
> const struct btf_array *arr_info;
> const struct btf_type *arr_t;
> @@ -1110,7 +1117,7 @@ static int bpf_object__init_user_btf_map(struct bpf_object *obj,
> return -EOPNOTSUPP;
> }
>
> - def = skip_mods_and_typedefs(obj->btf, var->type);
> + def = skip_mods_and_typedefs(obj->btf, var->type, NULL);
> if (BTF_INFO_KIND(def->info) != BTF_KIND_STRUCT) {
> pr_warning("map '%s': unexpected def kind %u.\n",
> map_name, BTF_INFO_KIND(var->info));
> @@ -2292,6 +2299,893 @@ bpf_program_reloc_btf_ext(struct bpf_program *prog, struct bpf_object *obj,
> return 0;
> }
>
> +#define BPF_CORE_SPEC_MAX_LEN 64
> +
> +/* represents BPF CO-RE field or array element accessor */
> +struct bpf_core_accessor {
> + __u32 type_id; /* struct/union type or array element type */
> + __u32 idx; /* field index or array index */
> + const char *name; /* field name or NULL for array accessor */
> +};
> +
> +struct bpf_core_spec {
> + const struct btf *btf;
> + /* high-level spec: named fields and array indices only */
> + struct bpf_core_accessor spec[BPF_CORE_SPEC_MAX_LEN];
> + /* high-level spec length */
> + int len;
> + /* raw, low-level spec: 1-to-1 with accessor spec string */
> + int raw_spec[BPF_CORE_SPEC_MAX_LEN];
> + /* raw spec length */
> + int raw_len;
> + /* field byte offset represented by spec */
> + __u32 offset;
> +};
> +
> +static bool str_is_empty(const char *s)
> +{
> + return !s || !s[0];
> +}
> +
> +static int btf_kind(const struct btf_type *t)
> +{
> + return BTF_INFO_KIND(t->info);
> +}
> +
> +static bool btf_is_composite(const struct btf_type *t)
> +{
> + int kind = btf_kind(t);
> +
> + return kind == BTF_KIND_STRUCT || kind == BTF_KIND_UNION;
> +}
> +
> +static bool btf_is_array(const struct btf_type *t)
> +{
> + return btf_kind(t) == BTF_KIND_ARRAY;
> +}
> +
> +/*
> + * Turn bpf_offset_reloc into a low- and high-level spec representation,
> + * validating correctness along the way, as well as calculating resulting
> + * field offset (in bytes), specified by accessor string. Low-level spec
> + * captures every single level of nestedness, including traversing anonymous
> + * struct/union members. High-level one only captures semantically meaningful
> + * "turning points": named fields and array indicies.
> + * E.g., for this case:
> + *
> + * struct sample {
> + * int __unimportant;
> + * struct {
> + * int __1;
> + * int __2;
> + * int a[7];
> + * };
> + * };
> + *
> + * struct sample *s = ...;
> + *
> + * int x = &s->a[3]; // access string = '0:1:2:3'
> + *
> + * Low-level spec has 1:1 mapping with each element of access string (it's
> + * just a parsed access string representation): [0, 1, 2, 3].
> + *
> + * High-level spec will capture only 3 points:
> + * - intial zero-index access by pointer (&s->... is the same as &s[0]...);
> + * - field 'a' access (corresponds to '2' in low-level spec);
> + * - array element #3 access (corresponds to '3' in low-level spec).
> + *
> + */
> +static int bpf_core_spec_parse(const struct btf *btf,
> + __u32 type_id,
> + const char *spec_str,
> + struct bpf_core_spec *spec)
> +{
> + int access_idx, parsed_len, i;
> + const struct btf_type *t;
> + const char *name;
> + __u32 id;
> + __s64 sz;
> +
> + if (str_is_empty(spec_str) || *spec_str == ':')
> + return -EINVAL;
> +
> + memset(spec, 0, sizeof(*spec));
> + spec->btf = btf;
> +
> + /* parse spec_str="0:1:2:3:4" into array raw_spec=[0, 1, 2, 3, 4] */
> + while (*spec_str) {
> + if (*spec_str == ':')
> + ++spec_str;
> + if (sscanf(spec_str, "%d%n", &access_idx, &parsed_len) != 1)
> + return -EINVAL;
> + if (spec->raw_len == BPF_CORE_SPEC_MAX_LEN)
> + return -E2BIG;
> + spec_str += parsed_len;
> + spec->raw_spec[spec->raw_len++] = access_idx;
> + }
> +
> + if (spec->raw_len == 0)
> + return -EINVAL;
> +
> + /* first spec value is always reloc type array index */
> + t = skip_mods_and_typedefs(btf, type_id, &id);
> + if (!t)
> + return -EINVAL;
> +
> + access_idx = spec->raw_spec[0];
> + spec->spec[0].type_id = id;
> + spec->spec[0].idx = access_idx;
> + spec->len++;
> +
> + sz = btf__resolve_size(btf, id);
> + if (sz < 0)
> + return sz;
> + spec->offset = access_idx * sz;
> +
> + for (i = 1; i < spec->raw_len; i++) {
> + t = skip_mods_and_typedefs(btf, id, &id);
> + if (!t)
> + return -EINVAL;
> +
> + access_idx = spec->raw_spec[i];
> +
> + if (btf_is_composite(t)) {
> + const struct btf_member *m = (void *)(t + 1);
> + __u32 offset;
> +
> + if (access_idx >= BTF_INFO_VLEN(t->info))
> + return -EINVAL;
> +
> + m = &m[access_idx];
> +
> + if (BTF_INFO_KFLAG(t->info)) {
> + if (BTF_MEMBER_BITFIELD_SIZE(m->offset))
> + return -EINVAL;
> + offset = BTF_MEMBER_BIT_OFFSET(m->offset);
> + } else {
> + offset = m->offset;
> + }
> + if (m->offset % 8)
> + return -EINVAL;
> + spec->offset += offset / 8;
> +
> + if (m->name_off) {
> + name = btf__name_by_offset(btf, m->name_off);
> + if (str_is_empty(name))
> + return -EINVAL;
> +
> + spec->spec[spec->len].type_id = id;
> + spec->spec[spec->len].idx = access_idx;
> + spec->spec[spec->len].name = name;
> + spec->len++;
> + }
> +
> + id = m->type;
> + } else if (btf_is_array(t)) {
> + const struct btf_array *a = (void *)(t + 1);
> +
> + t = skip_mods_and_typedefs(btf, a->type, &id);
> + if (!t || access_idx >= a->nelems)
> + return -EINVAL;
> +
> + spec->spec[spec->len].type_id = id;
> + spec->spec[spec->len].idx = access_idx;
> + spec->len++;
> +
> + sz = btf__resolve_size(btf, id);
> + if (sz < 0)
> + return sz;
> + spec->offset += access_idx * sz;
> + } else {
> + pr_warning("relo for [%u] %s (at idx %d) captures type [%d] of unexpected kind %d\n",
> + type_id, spec_str, i, id, btf_kind(t));
> + return -EINVAL;
> + }
> + }
> +
> + return 0;
> +}
> +
> +/* Given 'some_struct_name___with_flavor' return the length of a name prefix
> + * before last triple underscore. Struct name part after last triple
> + * underscore is ignored by BPF CO-RE relocation during relocation matching.
> + */
> +static size_t bpf_core_essential_name_len(const char *name)
> +{
> + size_t n = strlen(name);
> + int i = n - 3;
> +
> + while (i > 0) {
> + if (name[i] == '_' && name[i + 1] == '_' && name[i + 2] == '_')
> + return i;
> + i--;
> + }
> + return n;
> +}
> +
> +/* dynamically sized list of type IDs */
> +struct ids_vec {
> + __u32 *data;
> + int len;
> +};
> +
> +static void bpf_core_free_cands(struct ids_vec *cand_ids)
> +{
> + free(cand_ids->data);
> + free(cand_ids);
> +}
> +
> +static struct ids_vec *bpf_core_find_cands(const struct btf *local_btf,
> + __u32 local_type_id,
> + const struct btf *targ_btf)
> +{
> + size_t local_essent_len, targ_essent_len;
> + const char *local_name, *targ_name;
> + const struct btf_type *t;
> + struct ids_vec *cand_ids;
> + __u32 *new_ids;
> + int i, err, n;
> +
> + t = btf__type_by_id(local_btf, local_type_id);
> + if (!t)
> + return ERR_PTR(-EINVAL);
> +
> + local_name = btf__name_by_offset(local_btf, t->name_off);
> + if (str_is_empty(local_name))
> + return ERR_PTR(-EINVAL);
> + local_essent_len = bpf_core_essential_name_len(local_name);
> +
> + cand_ids = calloc(1, sizeof(*cand_ids));
> + if (!cand_ids)
> + return ERR_PTR(-ENOMEM);
> +
> + n = btf__get_nr_types(targ_btf);
> + for (i = 1; i <= n; i++) {
> + t = btf__type_by_id(targ_btf, i);
> + targ_name = btf__name_by_offset(targ_btf, t->name_off);
> + if (str_is_empty(targ_name))
> + continue;
> +
> + targ_essent_len = bpf_core_essential_name_len(targ_name);
> + if (targ_essent_len != local_essent_len)
> + continue;
> +
> + if (strncmp(local_name, targ_name, local_essent_len) == 0) {
> + pr_debug("[%d] (%s): found candidate [%d] (%s)\n",
> + local_type_id, local_name, i, targ_name);
> + new_ids = realloc(cand_ids->data, cand_ids->len + 1);
> + if (!new_ids) {
> + err = -ENOMEM;
> + goto err_out;
> + }
> + cand_ids->data = new_ids;
> + cand_ids->data[cand_ids->len++] = i;
> + }
> + }
> + return cand_ids;
> +err_out:
> + bpf_core_free_cands(cand_ids);
> + return ERR_PTR(err);
> +}
> +
> +/* Check two types for compatibility, skipping const/volatile/restrict and
> + * typedefs, to ensure we are relocating offset to the compatible entities:
> + * - any two STRUCTs/UNIONs are compatible and can be mixed;
> + * - any two FWDs are compatible;
> + * - any two PTRs are always compatible;
> + * - for ENUMs, check sizes, names are ignored;
> + * - for INT, size and bitness should match, signedness is ignored;
> + * - for ARRAY, dimensionality is ignored, element types are checked for
> + * compatibility recursively;
> + * - everything else shouldn't be ever a target of relocation.
> + * These rules are not set in stone and probably will be adjusted as we get
> + * more experience with using BPF CO-RE relocations.
> + */
> +static int bpf_core_fields_are_compat(const struct btf *local_btf,
> + __u32 local_id,
> + const struct btf *targ_btf,
> + __u32 targ_id)
> +{
> + const struct btf_type *local_type, *targ_type;
> + __u16 kind;
> +
> +recur:
> + local_type = skip_mods_and_typedefs(local_btf, local_id, &local_id);
> + targ_type = skip_mods_and_typedefs(targ_btf, targ_id, &targ_id);
> + if (!local_type || !targ_type)
> + return -EINVAL;
> +
> + if (btf_is_composite(local_type) && btf_is_composite(targ_type))
> + return 1;
> + if (BTF_INFO_KIND(local_type->info) != BTF_INFO_KIND(targ_type->info))
> + return 0;
> +
> + kind = BTF_INFO_KIND(local_type->info);
> + switch (kind) {
> + case BTF_KIND_FWD:
> + case BTF_KIND_PTR:
> + return 1;
> + case BTF_KIND_ENUM:
> + return local_type->size == targ_type->size;
> + case BTF_KIND_INT: {
> + __u32 loc_int = *(__u32 *)(local_type + 1);
> + __u32 targ_int = *(__u32 *)(targ_type + 1);
> +
> + return BTF_INT_OFFSET(loc_int) == 0 &&
> + BTF_INT_OFFSET(targ_int) == 0 &&
> + local_type->size == targ_type->size &&
> + BTF_INT_BITS(loc_int) == BTF_INT_BITS(targ_int);
> + }
> + case BTF_KIND_ARRAY: {
> + const struct btf_array *loc_a, *targ_a;
> +
> + loc_a = (void *)(local_type + 1);
> + targ_a = (void *)(targ_type + 1);
> + local_id = loc_a->type;
> + targ_id = targ_a->type;
> + goto recur;
> + }
> + default:
> + pr_warning("unexpected kind %d relocated, local [%d], target [%d]\n",
> + kind, local_id, targ_id);
> + return 0;
> + }
> +}
> +
> +/*
> + * Given single high-level named field accessor in local type, find
> + * corresponding high-level accessor for a target type. Along the way,
> + * maintain low-level spec for target as well. Also keep updating target
> + * offset.
> + *
> + * Searching is performed through recursive exhaustive enumeration of all
> + * fields of a struct/union. If there are any anonymous (embedded)
> + * structs/unions, they are recursively searched as well. If field with
> + * desired name is found, check compatibility between local and target types,
> + * before returning result.
> + *
> + * 1 is returned, if field is found.
> + * 0 is returned if no compatible field is found.
> + * <0 is returned on error.
> + */
> +static int bpf_core_match_member(const struct btf *local_btf,
> + const struct bpf_core_accessor *local_acc,
> + const struct btf *targ_btf,
> + __u32 targ_id,
> + struct bpf_core_spec *spec,
> + __u32 *next_targ_id)
> +{
> + const struct btf_type *local_type, *targ_type;
> + const struct btf_member *local_member, *m;
> + const char *local_name, *targ_name;
> + __u32 local_id;
> + int i, n, found;
> +
> + targ_type = skip_mods_and_typedefs(targ_btf, targ_id, &targ_id);
> + if (!targ_type)
> + return -EINVAL;
> + if (!btf_is_composite(targ_type))
> + return 0;
> +
> + local_id = local_acc->type_id;
> + local_type = btf__type_by_id(local_btf, local_id);
> + local_member = (void *)(local_type + 1);
> + local_member += local_acc->idx;
> + local_name = btf__name_by_offset(local_btf, local_member->name_off);
> +
> + n = BTF_INFO_VLEN(targ_type->info);
> + m = (void *)(targ_type + 1);
> + for (i = 0; i < n; i++, m++) {
> + __u32 offset;
> +
> + /* bitfield relocations not supported */
> + if (BTF_INFO_KFLAG(targ_type->info)) {
> + if (BTF_MEMBER_BITFIELD_SIZE(m->offset))
> + continue;
> + offset = BTF_MEMBER_BIT_OFFSET(m->offset);
> + } else {
> + offset = m->offset;
> + }
> + if (offset % 8)
> + continue;
> +
> + /* too deep struct/union/array nesting */
> + if (spec->raw_len == BPF_CORE_SPEC_MAX_LEN)
> + return -E2BIG;
> +
> + /* speculate this member will be the good one */
> + spec->offset += offset / 8;
> + spec->raw_spec[spec->raw_len++] = i;
> +
> + targ_name = btf__name_by_offset(targ_btf, m->name_off);
> + if (str_is_empty(targ_name)) {
> + /* embedded struct/union, we need to go deeper */
> + found = bpf_core_match_member(local_btf, local_acc,
> + targ_btf, m->type,
> + spec, next_targ_id);
> + if (found) /* either found or error */
> + return found;
> + } else if (strcmp(local_name, targ_name) == 0) {
> + /* matching named field */
> + struct bpf_core_accessor *targ_acc;
> +
> + targ_acc = &spec->spec[spec->len++];
> + targ_acc->type_id = targ_id;
> + targ_acc->idx = i;
> + targ_acc->name = targ_name;
> +
> + *next_targ_id = m->type;
> + found = bpf_core_fields_are_compat(local_btf,
> + local_member->type,
> + targ_btf, m->type);
> + if (!found)
> + spec->len--; /* pop accessor */
> + return found;
> + }
> + /* member turned out not to be what we looked for */
> + spec->offset -= offset / 8;
> + spec->raw_len--;
> + }
> +
> + return 0;
> +}
> +
> +/*
> + * Try to match local spec to a target type and, if successful, produce full
> + * target spec (high-level, low-level + offset).
> + */
> +static int bpf_core_spec_match(struct bpf_core_spec *local_spec,
> + const struct btf *targ_btf, __u32 targ_id,
> + struct bpf_core_spec *targ_spec)
> +{
> + const struct btf_type *targ_type;
> + const struct bpf_core_accessor *local_acc;
> + struct bpf_core_accessor *targ_acc;
> + int i, sz, matched;
> +
> + memset(targ_spec, 0, sizeof(*targ_spec));
> + targ_spec->btf = targ_btf;
> +
> + local_acc = &local_spec->spec[0];
> + targ_acc = &targ_spec->spec[0];
> +
> + for (i = 0; i < local_spec->len; i++, local_acc++, targ_acc++) {
> + targ_type = skip_mods_and_typedefs(targ_spec->btf, targ_id,
> + &targ_id);
> + if (!targ_type)
> + return -EINVAL;
> +
> + if (local_acc->name) {
> + matched = bpf_core_match_member(local_spec->btf,
> + local_acc,
> + targ_btf, targ_id,
> + targ_spec, &targ_id);
> + if (matched <= 0)
> + return matched;
> + } else {
> + /* for i=0, targ_id is already treated as array element
> + * type (because it's the original struct), for others
> + * we should find array element type first
> + */
> + if (i > 0) {
> + const struct btf_array *a;
> +
> + if (!btf_is_array(targ_type))
> + return 0;
> +
> + a = (void *)(targ_type + 1);
> + if (local_acc->idx >= a->nelems)
> + return 0;
> + if (!skip_mods_and_typedefs(targ_btf, a->type,
> + &targ_id))
> + return -EINVAL;
> + }
> +
> + /* too deep struct/union/array nesting */
> + if (targ_spec->raw_len == BPF_CORE_SPEC_MAX_LEN)
> + return -E2BIG;
> +
> + targ_acc->type_id = targ_id;
> + targ_acc->idx = local_acc->idx;
> + targ_acc->name = NULL;
> + targ_spec->len++;
> + targ_spec->raw_spec[targ_spec->raw_len] = targ_acc->idx;
> + targ_spec->raw_len++;
> +
> + sz = btf__resolve_size(targ_btf, targ_id);
> + if (sz < 0)
> + return sz;
> + targ_spec->offset += local_acc->idx * sz;
> + }
> + }
> +
> + return 1;
> +}
> +
> +/*
> + * Patch relocatable BPF instruction.
> + * Expected insn->imm value is provided for validation, as well as the new
> + * relocated value.
> + *
> + * Currently three kinds of BPF instructions are supported:
> + * 1. rX = <imm> (assignment with immediate operand);
> + * 2. rX += <imm> (arithmetic operations with immediate operand);
> + * 3. *(rX) = <imm> (indirect memory assignment with immediate operand).
> + *
> + * If actual insn->imm value is wrong, bail out.
> + */
> +static int bpf_core_reloc_insn(struct bpf_program *prog, int insn_off,
> + __u32 orig_off, __u32 new_off)
> +{
> + struct bpf_insn *insn;
> + int insn_idx;
> + __u8 class;
> +
> + if (insn_off % sizeof(struct bpf_insn))
> + return -EINVAL;
> + insn_idx = insn_off / sizeof(struct bpf_insn);
> +
> + insn = &prog->insns[insn_idx];
> + class = BPF_CLASS(insn->code);
> +
> + if (class == BPF_ALU || class == BPF_ALU64) {
> + if (BPF_SRC(insn->code) != BPF_K)
> + return -EINVAL;
> + if (insn->imm != orig_off)
> + return -EINVAL;
> + insn->imm = new_off;
> + pr_debug("prog '%s': patched insn #%d (ALU/ALU64) imm %d -> %d\n",
> + bpf_program__title(prog, false),
> + insn_idx, orig_off, new_off);
> + } else {
> + pr_warning("prog '%s': trying to relocate unrecognized insn #%d, code:%x, src:%x, dst:%x, off:%x, imm:%x\n",
> + bpf_program__title(prog, false),
> + insn_idx, insn->code, insn->src_reg, insn->dst_reg,
> + insn->off, insn->imm);
> + return -EINVAL;
> + }
> + return 0;
> +}
> +
> +/*
> + * Probe few well-known locations for vmlinux kernel image and try to load BTF
> + * data out of it to use for target BTF.
> + */
> +static struct btf *bpf_core_find_kernel_btf(void)
> +{
> + const char *locations[] = {
> + "/lib/modules/%1$s/vmlinux-%1$s",
> + "/usr/lib/modules/%1$s/kernel/vmlinux",
> + };
> + char path[PATH_MAX + 1];
> + struct utsname buf;
> + struct btf *btf;
> + int i, err;
> +
> + err = uname(&buf);
> + if (err) {
> + pr_warning("failed to uname(): %d\n", err);
> + return ERR_PTR(err);
> + }
> +
> + for (i = 0; i < ARRAY_SIZE(locations); i++) {
> + snprintf(path, PATH_MAX, locations[i], buf.release);
> + pr_debug("attempting to load kernel BTF from '%s'\n", path);
> +
> + if (access(path, R_OK))
> + continue;
> +
> + btf = btf__parse_elf(path, NULL);
> + if (IS_ERR(btf))
> + continue;
> +
> + pr_debug("successfully loaded kernel BTF from '%s'\n", path);
> + return btf;
> + }
> +
> + pr_warning("failed to find valid kernel BTF\n");
> + return ERR_PTR(-ESRCH);
> +}
> +
> +/* Output spec definition in the format:
> + * [<type-id>] (<type-name>) + <raw-spec> => <offset>@<spec>,
> + * where <spec> is a C-syntax view of recorded field access, e.g.: x.a[3].b
> + */
> +static void bpf_core_dump_spec(int level, const struct bpf_core_spec *spec)
> +{
> + const struct btf_type *t;
> + const char *s;
> + __u32 type_id;
> + int i;
> +
> + type_id = spec->spec[0].type_id;
> + t = btf__type_by_id(spec->btf, type_id);
> + s = btf__name_by_offset(spec->btf, t->name_off);
> + libbpf_print(level, "[%u] (%s) + ", type_id, s);
> +
> + for (i = 0; i < spec->raw_len; i++)
> + libbpf_print(level, "%d%s", spec->raw_spec[i],
> + i == spec->raw_len - 1 ? " => " : ":");
> +
> + libbpf_print(level, "%u @ &x", spec->offset);
> +
> + for (i = 0; i < spec->len; i++) {
> + if (spec->spec[i].name)
> + libbpf_print(level, ".%s", spec->spec[i].name);
> + else
> + libbpf_print(level, "[%u]", spec->spec[i].idx);
> + }
> +
> +}
> +
> +static size_t bpf_core_hash_fn(const void *key, void *ctx)
> +{
> + return (size_t)key;
> +}
> +
> +static bool bpf_core_equal_fn(const void *k1, const void *k2, void *ctx)
> +{
> + return k1 == k2;
> +}
> +
> +static void *u32_to_ptr(__u32 x)
> +{
> + return (void *)(uintptr_t)x;
> +}
> +
> +/*
> + * CO-RE relocate single instruction.
> + *
> + * The outline and important points of the algorithm:
> + * 1. For given local type, find corresponding candidate target types.
> + * Candidate type is a type with the same "essential" name, ignoring
> + * everything after last triple underscore (___). E.g., `sample`,
> + * `sample___flavor_one`, `sample___flavor_another_one`, are all candidates
> + * for each other. Names with triple underscore are referred to as
> + * "flavors" and are useful, among other things, to allow to
> + * specify/support incompatible variations of the same kernel struct, which
> + * might differ between different kernel versions and/or build
> + * configurations.
> + *
> + * N.B. Struct "flavors" could be generated by bpftool's BTF-to-C
> + * converter, when deduplicated BTF of a kernel still contains more than
> + * one different types with the same name. In that case, ___2, ___3, etc
> + * are appended starting from second name conflict. But start flavors are
> + * also useful to be defined "locally", in BPF program, to extract same
> + * data from incompatible changes between different kernel
> + * versions/configurations. For instance, to handle field renames between
> + * kernel versions, one can use two flavors of the struct name with the
> + * same common name and use conditional relocations to extract that field,
> + * depending on target kernel version.
> + * 2. For each candidate type, try to match local specification to this
> + * candidate target type. Matching involves finding corresponding
> + * high-level spec accessors, meaning that all named fields should match,
> + * as well as all array accesses should be within the actual bounds. Also,
> + * types should be compatible (see bpf_core_fields_are_compat for details).
> + * 3. It is supported and expected that there might be multiple flavors
> + * matching the spec. As long as all the specs resolve to the same set of
> + * offsets across all candidates, there is not error. If there is any
> + * ambiguity, CO-RE relocation will fail. This is necessary to accomodate
> + * imprefection of BTF deduplication, which can cause slight duplication of
> + * the same BTF type, if some directly or indirectly referenced (by
> + * pointer) type gets resolved to different actual types in different
> + * object files. If such situation occurs, deduplicated BTF will end up
> + * with two (or more) structurally identical types, which differ only in
> + * types they refer to through pointer. This should be OK in most cases and
> + * is not an error.
> + * 4. Candidate types search is performed by linearly scanning through all
> + * types in target BTF. It is anticipated that this is overall more
> + * efficient memory-wise and not significantly worse (if not better)
> + * CPU-wise compared to prebuilding a map from all local type names to
> + * a list of candidate type names. It's also sped up by caching resolved
> + * list of matching candidates per each local "root" type ID, that has at
> + * least one bpf_offset_reloc associated with it. This list is shared
> + * between multiple relocations for the same type ID and is updated as some
> + * of the candidates are pruned due to structural incompatibility.
> + */
> +static int bpf_core_reloc_offset(struct bpf_program *prog,
> + const struct bpf_offset_reloc *relo,
> + int relo_idx,
> + const struct btf *local_btf,
> + const struct btf *targ_btf,
> + struct hashmap *cand_cache)
> +{
> + const char *prog_name = bpf_program__title(prog, false);
> + struct bpf_core_spec local_spec, cand_spec, targ_spec;
> + const void *type_key = u32_to_ptr(relo->type_id);
> + const struct btf_type *local_type, *cand_type;
> + const char *local_name, *cand_name;
> + struct ids_vec *cand_ids;
> + __u32 local_id, cand_id;
> + const char *spec_str;
> + int i, j, err;
> +
> + local_id = relo->type_id;
> + local_type = btf__type_by_id(local_btf, local_id);
> + if (!local_type)
> + return -EINVAL;
> +
> + local_name = btf__name_by_offset(local_btf, local_type->name_off);
> + if (str_is_empty(local_name))
> + return -EINVAL;
> +
> + spec_str = btf__name_by_offset(local_btf, relo->access_str_off);
> + if (str_is_empty(spec_str))
> + return -EINVAL;
> +
> + err = bpf_core_spec_parse(local_btf, local_id, spec_str, &local_spec);
> + if (err) {
> + pr_warning("prog '%s': relo #%d: parsing [%d] (%s) + %s failed: %d\n",
> + prog_name, relo_idx, local_id, local_name, spec_str,
> + err);
> + return -EINVAL;
> + }
> +
> + pr_debug("prog '%s': relo #%d: spec is ", prog_name, relo_idx);
> + bpf_core_dump_spec(LIBBPF_DEBUG, &local_spec);
> + libbpf_print(LIBBPF_DEBUG, "\n");
> +
> + if (!hashmap__find(cand_cache, type_key, (void **)&cand_ids)) {
> + cand_ids = bpf_core_find_cands(local_btf, local_id, targ_btf);
> + if (IS_ERR(cand_ids)) {
> + pr_warning("prog '%s': relo #%d: target candidate search failed for [%d] (%s): %ld",
> + prog_name, relo_idx, local_id, local_name,
> + PTR_ERR(cand_ids));
> + return PTR_ERR(cand_ids);
> + }
> + err = hashmap__set(cand_cache, type_key, cand_ids, NULL, NULL);
> + if (err) {
> + bpf_core_free_cands(cand_ids);
> + return err;
> + }
> + }
> +
> + for (i = 0, j = 0; i < cand_ids->len; i++) {
> + cand_id = cand_ids->data[i];
> + cand_type = btf__type_by_id(targ_btf, cand_id);
> + cand_name = btf__name_by_offset(targ_btf, cand_type->name_off);
> +
> + err = bpf_core_spec_match(&local_spec, targ_btf,
> + cand_id, &cand_spec);
> + if (err < 0) {
> + pr_warning("prog '%s': relo #%d: failed to match spec ",
> + prog_name, relo_idx);
> + bpf_core_dump_spec(LIBBPF_WARN, &local_spec);
> + libbpf_print(LIBBPF_WARN,
> + " to candidate #%d [%d] (%s): %d\n",
> + i, cand_id, cand_name, err);
> + return err;
> + }
> + if (err == 0) {
> + pr_debug("prog '%s': relo #%d: candidate #%d [%d] (%s) doesn't match spec ",
> + prog_name, relo_idx, i, cand_id, cand_name);
> + bpf_core_dump_spec(LIBBPF_DEBUG, &local_spec);
> + libbpf_print(LIBBPF_DEBUG, "\n");
> + continue;
> + }
> +
> + pr_debug("prog '%s': relo #%d: candidate #%d matched as spec ",
> + prog_name, relo_idx, i);
> + bpf_core_dump_spec(LIBBPF_DEBUG, &cand_spec);
> + libbpf_print(LIBBPF_DEBUG, "\n");
> +
> + if (j == 0) {
> + targ_spec = cand_spec;
> + } else if (cand_spec.offset != targ_spec.offset) {
> + /* if there are many candidates, they should all
> + * resolve to the same offset
> + */
> + pr_warning("prog '%s': relo #%d: candidate #%d spec ",
> + prog_name, relo_idx, i);
> + bpf_core_dump_spec(LIBBPF_WARN, &cand_spec);
> + libbpf_print(LIBBPF_WARN,
> + " conflicts with target spec ");
> + bpf_core_dump_spec(LIBBPF_WARN, &targ_spec);
> + libbpf_print(LIBBPF_WARN, "\n");
> + return -EINVAL;
> + }
> +
> + cand_ids->data[j++] = cand_spec.spec[0].type_id;
> + }
> +
> + cand_ids->len = j;
> + if (cand_ids->len == 0) {
> + pr_warning("prog '%s': relo #%d: no matching targets found for [%d] (%s) + %s\n",
> + prog_name, relo_idx, local_id, local_name, spec_str);
> + return -ESRCH;
> + }
> +
> + err = bpf_core_reloc_insn(prog, relo->insn_off,
> + local_spec.offset, targ_spec.offset);
> + if (err) {
> + pr_warning("prog '%s': relo #%d: failed to patch insn at offset %d: %d\n",
> + prog_name, relo_idx, relo->insn_off, err);
> + return -EINVAL;
> + }
> +
> + return 0;
> +}
> +
> +static int
> +bpf_core_reloc_offsets(struct bpf_object *obj, const char *targ_btf_path)
> +{
> + const struct btf_ext_info_sec *sec;
> + const struct bpf_offset_reloc *rec;
> + const struct btf_ext_info *seg;
> + struct hashmap_entry *entry;
> + struct hashmap *cand_cache = NULL;
> + struct bpf_program *prog;
> + struct btf *targ_btf;
> + const char *sec_name;
> + int i, err = 0;
> +
> + if (targ_btf_path)
> + targ_btf = btf__parse_elf(targ_btf_path, NULL);
> + else
> + targ_btf = bpf_core_find_kernel_btf();
> + if (IS_ERR(targ_btf)) {
> + pr_warning("failed to get target BTF: %ld\n",
> + PTR_ERR(targ_btf));
> + return PTR_ERR(targ_btf);
> + }
> +
> + cand_cache = hashmap__new(bpf_core_hash_fn, bpf_core_equal_fn, NULL);
> + if (IS_ERR(cand_cache)) {
> + err = PTR_ERR(cand_cache);
> + goto out;
> + }
> +
> + seg = &obj->btf_ext->offset_reloc_info;
> + for_each_btf_ext_sec(seg, sec) {
> + sec_name = btf__name_by_offset(obj->btf, sec->sec_name_off);
> + if (str_is_empty(sec_name)) {
> + err = -EINVAL;
> + goto out;
> + }
> + prog = bpf_object__find_program_by_title(obj, sec_name);
> + if (!prog) {
> + pr_warning("failed to find program '%s' for CO-RE offset relocation\n",
> + sec_name);
> + err = -EINVAL;
> + goto out;
> + }
> +
> + pr_debug("prog '%s': performing %d CO-RE offset relocs\n",
> + sec_name, sec->num_info);
> +
> + for_each_btf_ext_rec(seg, sec, i, rec) {
> + err = bpf_core_reloc_offset(prog, rec, i, obj->btf,
> + targ_btf, cand_cache);
> + if (err) {
> + pr_warning("prog '%s': relo #%d: failed to relocate: %d\n",
> + sec_name, i, err);
> + goto out;
> + }
> + }
> + }
> +
> +out:
> + btf__free(targ_btf);
> + if (!IS_ERR_OR_NULL(cand_cache)) {
> + hashmap__for_each_entry(cand_cache, entry, i) {
> + bpf_core_free_cands(entry->value);
> + }
> + hashmap__free(cand_cache);
> + }
> + return err;
> +}
> +
> +static int
> +bpf_object__relocate_core(struct bpf_object *obj, const char *targ_btf_path)
> +{
> + int err = 0;
> +
> + if (obj->btf_ext->offset_reloc_info.len)
> + err = bpf_core_reloc_offsets(obj, targ_btf_path);
> +
> + return err;
> +}
> +
> static int
> bpf_program__reloc_text(struct bpf_program *prog, struct bpf_object *obj,
> struct reloc_desc *relo)
> @@ -2399,14 +3293,21 @@ bpf_program__relocate(struct bpf_program *prog, struct bpf_object *obj)
> return 0;
> }
>
> -
> static int
> -bpf_object__relocate(struct bpf_object *obj)
> +bpf_object__relocate(struct bpf_object *obj, const char *targ_btf_path)
> {
> struct bpf_program *prog;
> size_t i;
> int err;
>
> + if (obj->btf_ext) {
> + err = bpf_object__relocate_core(obj, targ_btf_path);
> + if (err) {
> + pr_warning("failed to perform CO-RE relocations: %d\n",
> + err);
> + return err;
> + }
> + }
> for (i = 0; i < obj->nr_programs; i++) {
> prog = &obj->programs[i];
>
> @@ -2807,7 +3708,7 @@ int bpf_object__load_xattr(struct bpf_object_load_attr *attr)
> obj->loaded = true;
>
> CHECK_ERR(bpf_object__create_maps(obj), err, out);
> - CHECK_ERR(bpf_object__relocate(obj), err, out);
> + CHECK_ERR(bpf_object__relocate(obj, attr->target_btf_path), err, out);
> CHECK_ERR(bpf_object__load_progs(obj, attr->log_level), err, out);
>
> return 0;
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index 8a9d462a6f6d..e8f70977d137 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -92,6 +92,7 @@ LIBBPF_API void bpf_object__close(struct bpf_object *object);
> struct bpf_object_load_attr {
> struct bpf_object *obj;
> int log_level;
> + const char *target_btf_path;
> };
>
> /* Load/unload object into/from kernel */
> --
> 2.17.1
>
^ permalink raw reply
* Re: [PATCH v4 net-next 15/19] ionic: Add netdev-event handling
From: Stephen Hemminger @ 2019-07-30 23:39 UTC (permalink / raw)
To: Shannon Nelson; +Cc: netdev, davem
In-Reply-To: <20190722214023.9513-16-snelson@pensando.io>
On Mon, 22 Jul 2019 14:40:19 -0700
Shannon Nelson <snelson@pensando.io> wrote:
> +
> +static void ionic_lif_set_netdev_info(struct lif *lif)
> +{
> + struct ionic_admin_ctx ctx = {
> + .work = COMPLETION_INITIALIZER_ONSTACK(ctx.work),
> + .cmd.lif_setattr = {
> + .opcode = CMD_OPCODE_LIF_SETATTR,
> + .index = cpu_to_le16(lif->index),
> + .attr = IONIC_LIF_ATTR_NAME,
> + },
> + };
> +
> + strlcpy(ctx.cmd.lif_setattr.name, lif->netdev->name,
> + sizeof(ctx.cmd.lif_setattr.name));
> +
> + dev_info(lif->ionic->dev, "NETDEV_CHANGENAME %s %s\n",
> + lif->name, ctx.cmd.lif_setattr.name);
> +
There is already a kernel message for this. Repeating the same thing in the
driver is redundant.
^ permalink raw reply
* Re: [PATCH 0/2] tools: bpftool: add net (un)load command to load XDP
From: Alexei Starovoitov @ 2019-07-30 23:17 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Daniel T. Lee, Stephen Hemminger, David Ahern,
Jesper Dangaard Brouer, John Fastabend, Daniel Borkmann,
Alexei Starovoitov, netdev
In-Reply-To: <20190730155915.5bbe3a03@cakuba.netronome.com>
On Tue, Jul 30, 2019 at 03:59:15PM -0700, Jakub Kicinski wrote:
> On Wed, 31 Jul 2019 03:48:19 +0900, Daniel T. Lee wrote:
> > Currently, bpftool net only supports dumping progs loaded on the
> > interface. To load XDP prog on interface, user must use other tool
> > (eg. iproute2). By this patch, with `bpftool net (un)load`, user can
> > (un)load XDP prog on interface.
>
> I don't understand why using another tool is a bad thing :(
> What happened to the Unix philosophy?
>
> I remain opposed to duplicating iproute2's functionality under
> bpftool net :( The way to attach bpf programs in the networking
> subsystem is through the iproute2 commends - ip and tc..
>
> It seems easy enough to add a feature to bpftool but from
> a perspective of someone adding a new feature to the kernel,
> and wanting to update user space components it's quite painful :(
>
> So could you describe to me in more detail why this is a good idea?
> Perhaps others can chime in?
I don't think it has anything to do with 'unix philosophy'.
Here the proposal to teach bpftool to attach xdp progs.
I see nothing wrong with that.
Another reason is iproute2 is still far away from adopting libbpf.
So all the latest goodness like BTF, introspection, etc will not
be available to iproute2 users for some time.
Even when iproute2 is ready it would be convenient for folks like me
(who need to debug stuff in production) to remember cmd line of
bpftool only to introspect the server. Debugging often includes
detaching/attaching progs. Not only doing 'bpftool p s'.
If bpftool was taught to do equivalent of 'ip link' that would be
very different story and I would be opposed to that.
^ permalink raw reply
* Re: [PATCH bpf-next] libbpf : make libbpf_num_possible_cpus function thread safe
From: Jakub Kicinski @ 2019-07-30 23:05 UTC (permalink / raw)
To: Takshak Chahande; +Cc: netdev, ast, daniel, rdna, kernel-team, hechaol
In-Reply-To: <20190730222447.3918919-1-ctakshak@fb.com>
On Tue, 30 Jul 2019 15:24:47 -0700, Takshak Chahande wrote:
> Having static variable `cpus` in libbpf_num_possible_cpus function without
> guarding it with mutex makes this function thread-unsafe.
>
> If multiple threads accessing this function, in the current form; it
> leads to incrementing the static variable value `cpus` in the multiple
> of total available CPUs.
>
> Let caching the number of possile CPUs handled by libbpf's users than
> this library itself;
Can we just use stack variable for the calculations and
READ_ONCE()/WRITE_ONCE() for assignment to the static?
libbpf itself uses this helper so caller caching wouldn't
work there.
> and let this function be rock bottom one which reads
> and parse the file (/sys/devices/system/cpu/possible) everytime it gets
> called to simplify the things.
I don't understand can you rephrase?
> Fixes: 6446b3155521 (bpf: add a new API libbpf_num_possible_cpus())
>
No new line after the fixes tag, also I think you're missing quotation
marks around the commit title?
> Signed-off-by: Takshak Chahande <ctakshak@fb.com>
> Acked-by: Andrey Ignatov <rdna@fb.com>
^ permalink raw reply
* Re: [PATCH 0/2] tools: bpftool: add net (un)load command to load XDP
From: Jakub Kicinski @ 2019-07-30 22:59 UTC (permalink / raw)
To: Daniel T. Lee, Stephen Hemminger, David Ahern,
Jesper Dangaard Brouer, John Fastabend
Cc: Daniel Borkmann, Alexei Starovoitov, netdev
In-Reply-To: <20190730184821.10833-1-danieltimlee@gmail.com>
On Wed, 31 Jul 2019 03:48:19 +0900, Daniel T. Lee wrote:
> Currently, bpftool net only supports dumping progs loaded on the
> interface. To load XDP prog on interface, user must use other tool
> (eg. iproute2). By this patch, with `bpftool net (un)load`, user can
> (un)load XDP prog on interface.
I don't understand why using another tool is a bad thing :(
What happened to the Unix philosophy?
I remain opposed to duplicating iproute2's functionality under
bpftool net :( The way to attach bpf programs in the networking
subsystem is through the iproute2 commends - ip and tc..
It seems easy enough to add a feature to bpftool but from
a perspective of someone adding a new feature to the kernel,
and wanting to update user space components it's quite painful :(
So could you describe to me in more detail why this is a good idea?
Perhaps others can chime in?
> $ ./bpftool prog
> ...
> 208: xdp name xdp_prog1 tag ad822e38b629553f gpl
> loaded_at 2019-07-28T18:03:11+0900 uid 0
> ...
> $ ./bpftool net load id 208 xdpdrv enp6s0np1
> $ ./bpftool net
> xdp:
> enp6s0np1(5) driver id 208
> ...
> $ ./bpftool net unload xdpdrv enp6s0np1
> $ ./bpftool net
> xdp:
> ...
>
> The word 'load' is used instead of 'attach', since XDP program is not
> considered as 'bpf_attach_type' and can't be attached with
> 'BPF_PROG_ATTACH'. In this context, the meaning of 'load' is, prog will
> be loaded on interface.
>
> While this patch only contains support for XDP, through `net (un)load`,
> bpftool can further support other prog attach types.
>
> XDP (un)load tested on Netronome Agilio.
^ permalink raw reply
* Re: [PATCH v4 net-next 15/19] ionic: Add netdev-event handling
From: Shannon Nelson @ 2019-07-30 22:53 UTC (permalink / raw)
To: Saeed Mahameed, netdev@vger.kernel.org, davem@davemloft.net
In-Reply-To: <a27ba11984c8872a35206bd9fbeee0800ba7b050.camel@mellanox.com>
On 7/25/19 4:55 PM, Saeed Mahameed wrote:
> On Mon, 2019-07-22 at 14:40 -0700, Shannon Nelson wrote:
>> When the netdev gets a new name from userland, pass that name
>> down to the NIC for internal tracking.
>>
> Just out of curiosity, why your NIC internal device/firmware need to
> keep tracking of the netdev name ?
>
>
It is helpful in a debugging method inside the NIC firmware.
sln
^ permalink raw reply
* Re: [patch net-next v2 3/3] netdevsim: create devlink and netdev instances in namespace
From: Jakub Kicinski @ 2019-07-30 22:40 UTC (permalink / raw)
To: Jiri Pirko; +Cc: netdev, davem, sthemmin, dsahern, mlxsw
In-Reply-To: <20190730085734.31504-4-jiri@resnulli.us>
On Tue, 30 Jul 2019 10:57:34 +0200, Jiri Pirko wrote:
> From: Jiri Pirko <jiri@mellanox.com>
>
> When user does create new netdevsim instance using sysfs bus file,
> create the devlink instance and related netdev instance in the namespace
> of the caller.
>
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> v1->v2:
> - remove net_namespace.h include and forward decralared net struct
> - add comment to initial_net pointer
Thanks!
^ permalink raw reply
* Re: [patch net-next v2 2/3] net: devlink: export devlink net set/get helpers
From: Jakub Kicinski @ 2019-07-30 22:40 UTC (permalink / raw)
To: Jiri Pirko; +Cc: netdev, davem, sthemmin, dsahern, mlxsw
In-Reply-To: <20190730085734.31504-3-jiri@resnulli.us>
On Tue, 30 Jul 2019 10:57:33 +0200, Jiri Pirko wrote:
> From: Jiri Pirko <jiri@mellanox.com>
>
> Allow drivers to set/get net struct for devlink instance. Set is only
> allowed for newly allocated devlink instance.
>
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
^ permalink raw reply
* Re: [patch net-next v2 1/3] net: devlink: allow to change namespaces
From: Jakub Kicinski @ 2019-07-30 22:39 UTC (permalink / raw)
To: Jiri Pirko; +Cc: netdev, davem, sthemmin, dsahern, mlxsw
In-Reply-To: <20190730085734.31504-2-jiri@resnulli.us>
On Tue, 30 Jul 2019 10:57:32 +0200, Jiri Pirko wrote:
> From: Jiri Pirko <jiri@mellanox.com>
>
> All devlink instances are created in init_net and stay there for a
> lifetime. Allow user to be able to move devlink instances into
> namespaces.
>
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
I'm no namespace expert, but seems reasonable, so FWIW:
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
If I read things right we will only send the devlink instance
notification to other namespaces when it moves, but not
notifications for sub-objects like ports. Is the expectation
that the user space dumps the objects it cares about or will
the other notifications be added as needed (or option 3 - I
misread the code).
I was also wondering it moving the devlink instance during a
multi-part transaction could be an issue. But AFAIU it should
be fine - the flashing doesn't release the instance lock, and
health stuff should complete correctly even if devlink is moved
half way through?
^ permalink raw reply
* [PATCH bpf-next] libbpf : make libbpf_num_possible_cpus function thread safe
From: Takshak Chahande @ 2019-07-30 22:24 UTC (permalink / raw)
To: netdev; +Cc: ast, daniel, rdna, ctakshak, kernel-team, hechaol
Having static variable `cpus` in libbpf_num_possible_cpus function without
guarding it with mutex makes this function thread-unsafe.
If multiple threads accessing this function, in the current form; it
leads to incrementing the static variable value `cpus` in the multiple
of total available CPUs.
Let caching the number of possile CPUs handled by libbpf's users than
this library itself; and let this function be rock bottom one which reads
and parse the file (/sys/devices/system/cpu/possible) everytime it gets
called to simplify the things.
Fixes: 6446b3155521 (bpf: add a new API libbpf_num_possible_cpus())
Signed-off-by: Takshak Chahande <ctakshak@fb.com>
Acked-by: Andrey Ignatov <rdna@fb.com>
---
tools/lib/bpf/libbpf.c | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index ead915aec349..e7ac0e02287e 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -4998,14 +4998,11 @@ int libbpf_num_possible_cpus(void)
static const char *fcpu = "/sys/devices/system/cpu/possible";
int len = 0, n = 0, il = 0, ir = 0;
unsigned int start = 0, end = 0;
- static int cpus;
char buf[128];
int error = 0;
+ int cpus = 0;
int fd = -1;
- if (cpus > 0)
- return cpus;
-
fd = open(fcpu, O_RDONLY);
if (fd < 0) {
error = errno;
--
2.17.1
^ permalink raw reply related
* [RFC PATCH v2 net-next 3/3] net: use listified RX for handling GRO_NORMAL skbs
From: Edward Cree @ 2019-07-30 22:21 UTC (permalink / raw)
To: David Miller; +Cc: linux-net-drivers, netdev, Eric Dumazet
In-Reply-To: <9bcebf59-a0e7-f461-36ef-8564ecb33282@solarflare.com>
When GRO decides not to coalesce a packet, in napi_frags_finish(), instead
of passing it to the stack immediately, place it on a list in the napi
struct. Then, at flush time (napi_complete_done(), napi_poll(), or
napi_busy_loop()), call netif_receive_skb_list_internal() on the list.
We'd like to do that in napi_gro_flush(), but it's not called if
!napi->gro_bitmask, so we have to do it in the callers instead. (There are
a handful of drivers that call napi_gro_flush() themselves, but it's not
clear why, or whether this will affect them.)
Because a full 64 packets is an inefficiently large batch, also consume the
list whenever it exceeds gro_normal_batch, a new net/core sysctl that
defaults to 8.
Signed-off-by: Edward Cree <ecree@solarflare.com>
---
include/linux/netdevice.h | 3 +++
net/core/dev.c | 44 +++++++++++++++++++++++++++++++++++---
net/core/sysctl_net_core.c | 8 +++++++
3 files changed, 52 insertions(+), 3 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 88292953aa6f..55ac223553f8 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -332,6 +332,8 @@ struct napi_struct {
struct net_device *dev;
struct gro_list gro_hash[GRO_HASH_BUCKETS];
struct sk_buff *skb;
+ struct list_head rx_list; /* Pending GRO_NORMAL skbs */
+ int rx_count; /* length of rx_list */
struct hrtimer timer;
struct list_head dev_list;
struct hlist_node napi_hash_node;
@@ -4239,6 +4241,7 @@ extern int dev_weight_rx_bias;
extern int dev_weight_tx_bias;
extern int dev_rx_weight;
extern int dev_tx_weight;
+extern int gro_normal_batch;
bool netdev_has_upper_dev(struct net_device *dev, struct net_device *upper_dev);
struct net_device *netdev_upper_get_next_dev_rcu(struct net_device *dev,
diff --git a/net/core/dev.c b/net/core/dev.c
index fc676b2610e3..b749eb2bfb0c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3963,6 +3963,8 @@ int dev_weight_rx_bias __read_mostly = 1; /* bias for backlog weight */
int dev_weight_tx_bias __read_mostly = 1; /* bias for output_queue quota */
int dev_rx_weight __read_mostly = 64;
int dev_tx_weight __read_mostly = 64;
+/* Maximum number of GRO_NORMAL skbs to batch up for list-RX */
+int gro_normal_batch __read_mostly = 8;
/* Called with irq disabled */
static inline void ____napi_schedule(struct softnet_data *sd,
@@ -5742,6 +5744,26 @@ struct sk_buff *napi_get_frags(struct napi_struct *napi)
}
EXPORT_SYMBOL(napi_get_frags);
+/* Pass the currently batched GRO_NORMAL SKBs up to the stack. */
+static void gro_normal_list(struct napi_struct *napi)
+{
+ if (!napi->rx_count)
+ return;
+ netif_receive_skb_list_internal(&napi->rx_list);
+ INIT_LIST_HEAD(&napi->rx_list);
+ napi->rx_count = 0;
+}
+
+/* Queue one GRO_NORMAL SKB up for list processing. If batch size exceeded,
+ * pass the whole batch up to the stack.
+ */
+static void gro_normal_one(struct napi_struct *napi, struct sk_buff *skb)
+{
+ list_add_tail(&skb->list, &napi->rx_list);
+ if (++napi->rx_count >= gro_normal_batch)
+ gro_normal_list(napi);
+}
+
static gro_result_t napi_frags_finish(struct napi_struct *napi,
struct sk_buff *skb,
gro_result_t ret)
@@ -5751,8 +5773,8 @@ static gro_result_t napi_frags_finish(struct napi_struct *napi,
case GRO_HELD:
__skb_push(skb, ETH_HLEN);
skb->protocol = eth_type_trans(skb, skb->dev);
- if (ret == GRO_NORMAL && netif_receive_skb_internal(skb))
- ret = GRO_DROP;
+ if (ret == GRO_NORMAL)
+ gro_normal_one(napi, skb);
break;
case GRO_DROP:
@@ -6029,6 +6051,8 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
NAPIF_STATE_IN_BUSY_POLL)))
return false;
+ gro_normal_list(n);
+
if (n->gro_bitmask) {
unsigned long timeout = 0;
@@ -6114,10 +6138,19 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock)
* Ideally, a new ndo_busy_poll_stop() could avoid another round.
*/
rc = napi->poll(napi, BUSY_POLL_BUDGET);
+ /* We can't gro_normal_list() here, because napi->poll() might have
+ * rearmed the napi (napi_complete_done()) in which case it could
+ * already be running on another CPU.
+ */
trace_napi_poll(napi, rc, BUSY_POLL_BUDGET);
netpoll_poll_unlock(have_poll_lock);
- if (rc == BUSY_POLL_BUDGET)
+ if (rc == BUSY_POLL_BUDGET) {
+ /* As the whole budget was spent, we still own the napi so can
+ * safely handle the rx_list.
+ */
+ gro_normal_list(napi);
__napi_schedule(napi);
+ }
local_bh_enable();
}
@@ -6162,6 +6195,7 @@ void napi_busy_loop(unsigned int napi_id,
}
work = napi_poll(napi, BUSY_POLL_BUDGET);
trace_napi_poll(napi, work, BUSY_POLL_BUDGET);
+ gro_normal_list(napi);
count:
if (work > 0)
__NET_ADD_STATS(dev_net(napi->dev),
@@ -6267,6 +6301,8 @@ void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
napi->timer.function = napi_watchdog;
init_gro_hash(napi);
napi->skb = NULL;
+ INIT_LIST_HEAD(&napi->rx_list);
+ napi->rx_count = 0;
napi->poll = poll;
if (weight > NAPI_POLL_WEIGHT)
netdev_err_once(dev, "%s() called with weight %d\n", __func__,
@@ -6363,6 +6399,8 @@ static int napi_poll(struct napi_struct *n, struct list_head *repoll)
goto out_unlock;
}
+ gro_normal_list(n);
+
if (n->gro_bitmask) {
/* flush too old packets
* If HZ < 1000, flush all packets.
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index f9204719aeee..dba52f53eace 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -569,6 +569,14 @@ static struct ctl_table net_core_table[] = {
.mode = 0644,
.proc_handler = proc_do_static_key,
},
+ {
+ .procname = "gro_normal_batch",
+ .data = &gro_normal_batch,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &one,
+ },
{ }
};
^ permalink raw reply related
* [RFC PATCH v2 net-next 2/3] sfc: falcon: don't score irq moderation points for GRO
From: Edward Cree @ 2019-07-30 22:21 UTC (permalink / raw)
To: linux-net-drivers, David Miller; +Cc: netdev, Eric Dumazet
In-Reply-To: <9bcebf59-a0e7-f461-36ef-8564ecb33282@solarflare.com>
Same rationale as for sfc, except that this wasn't performance-tested.
Signed-off-by: Edward Cree <ecree@solarflare.com>
---
drivers/net/ethernet/sfc/falcon/rx.c | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/sfc/falcon/rx.c b/drivers/net/ethernet/sfc/falcon/rx.c
index fd850d3d8ec0..05ea3523890a 100644
--- a/drivers/net/ethernet/sfc/falcon/rx.c
+++ b/drivers/net/ethernet/sfc/falcon/rx.c
@@ -424,7 +424,6 @@ ef4_rx_packet_gro(struct ef4_channel *channel, struct ef4_rx_buffer *rx_buf,
unsigned int n_frags, u8 *eh)
{
struct napi_struct *napi = &channel->napi_str;
- gro_result_t gro_result;
struct ef4_nic *efx = channel->efx;
struct sk_buff *skb;
@@ -460,9 +459,7 @@ ef4_rx_packet_gro(struct ef4_channel *channel, struct ef4_rx_buffer *rx_buf,
skb_record_rx_queue(skb, channel->rx_queue.core_index);
- gro_result = napi_gro_frags(napi);
- if (gro_result != GRO_DROP)
- channel->irq_mod_score += 2;
+ napi_gro_frags(napi);
}
/* Allocate and construct an SKB around page fragments */
^ permalink raw reply related
* [RFC PATCH v2 net-next 1/3] sfc: don't score irq moderation points for GRO
From: Edward Cree @ 2019-07-30 22:20 UTC (permalink / raw)
To: linux-net-drivers, David Miller; +Cc: netdev, Eric Dumazet
In-Reply-To: <9bcebf59-a0e7-f461-36ef-8564ecb33282@solarflare.com>
We already scored points when handling the RX event, no-one else does this,
and looking at the history it appears this was originally meant to only
score on merges, not on GRO_NORMAL. Moreover, it gets in the way of
changing GRO to not immediately pass GRO_NORMAL skbs to the stack.
Performance testing with four TCP streams received on a single CPU (where
throughput was line rate of 9.4Gbps in all tests) showed a 13.7% reduction
in RX CPU usage (n=6, p=0.03).
Signed-off-by: Edward Cree <ecree@solarflare.com>
---
drivers/net/ethernet/sfc/rx.c | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/sfc/rx.c b/drivers/net/ethernet/sfc/rx.c
index d5db045535d3..85ec07f5a674 100644
--- a/drivers/net/ethernet/sfc/rx.c
+++ b/drivers/net/ethernet/sfc/rx.c
@@ -412,7 +412,6 @@ efx_rx_packet_gro(struct efx_channel *channel, struct efx_rx_buffer *rx_buf,
unsigned int n_frags, u8 *eh)
{
struct napi_struct *napi = &channel->napi_str;
- gro_result_t gro_result;
struct efx_nic *efx = channel->efx;
struct sk_buff *skb;
@@ -449,9 +448,7 @@ efx_rx_packet_gro(struct efx_channel *channel, struct efx_rx_buffer *rx_buf,
skb_record_rx_queue(skb, channel->rx_queue.core_index);
- gro_result = napi_gro_frags(napi);
- if (gro_result != GRO_DROP)
- channel->irq_mod_score += 2;
+ napi_gro_frags(napi);
}
/* Allocate and construct an SKB around page fragments */
^ permalink raw reply related
* [RFC PATCH v2 net-next 0/3] net: batched receive in GRO path
From: Edward Cree @ 2019-07-30 22:18 UTC (permalink / raw)
To: David Miller; +Cc: linux-net-drivers, netdev, Eric Dumazet
This series listifies part of GRO processing, in a manner which allows those
packets which are not GROed (i.e. for which dev_gro_receive returns
GRO_NORMAL) to be passed on to the listified regular receive path.
dev_gro_receive() itself is not listified, nor the per-protocol GRO
callback, since GRO's need to hold packets on lists under napi->gro_hash
makes keeping the packets on other lists awkward, and since the GRO control
block state of held skbs can refer only to one 'new' skb at a time.
Instead, when napi_frags_finish() handles a GRO_NORMAL result, stash the skb
onto a list in the napi struct, which is received at the end of the napi
poll or when its length exceeds the (new) sysctl net.core.gro_normal_batch.
Performance figures with this series, collected on a back-to-back pair of
Solarflare sfn8522-r2 NICs with 120-second NetPerf tests. In the stats,
sample size n for old and new code is 6 runs each; p is from a Welch t-test.
Tests were run both with GRO enabled and disabled, the latter simulating
uncoalesceable packets (e.g. due to IP or TCP options). The receive side
(which was the device under test) had the NetPerf process pinned to one CPU,
and the device interrupts pinned to a second CPU. CPU utilisation figures
(used in cases of line-rate performance) are summed across all CPUs.
net.core.gro_normal_batch was left at its default value of 8.
TCP 4 streams, GRO on: all results line rate (9.415Gbps)
net-next: 210.3% cpu
after #1: 181.5% cpu (-13.7%, p=0.031 vs net-next)
after #3: 196.7% cpu (- 8.4%, p=0.136 vs net-next)
TCP 4 streams, GRO off:
net-next: 8.017 Gbps
after #1: 7.785 Gbps (- 2.9%, p=0.385 vs net-next)
after #3: 7.604 Gbps (- 5.1%, p=0.282 vs net-next. But note *)
TCP 1 stream, GRO off:
net-next: 6.553 Gbps
after #1: 6.444 Gbps (- 1.7%, p=0.302 vs net-next)
after #3: 6.790 Gbps (+ 3.6%, p=0.169 vs net-next)
TCP 1 stream, GRO on, busy_read = 50: all results line rate
net-next: 156.0% cpu
after #1: 174.5% cpu (+11.9%, p=0.015 vs net-next)
after #3: 165.0% cpu (+ 5.8%, p=0.147 vs net-next)
TCP 1 stream, GRO off, busy_read = 50:
net-next: 6.488 Gbps
after #1: 6.625 Gbps (+ 2.1%, p=0.059 vs net-next)
after #3: 7.351 Gbps (+13.3%, p=0.026 vs net-next)
TCP_RR 100 streams, GRO off, 8000 byte payload
net-next: 995.083 us
after #1: 969.167 us (- 2.6%, p=0.204 vs net-next)
after #3: 976.433 us (- 1.9%, p=0.254 vs net-next)
TCP_RR 100 streams, GRO off, 8000 byte payload, busy_read = 50:
net-next: 2.851 ms
after #1: 2.871 ms (+ 0.7%, p=0.134 vs net-next)
after #3: 2.937 ms (+ 3.0%, p<0.001 vs net-next)
TCP_RR 100 streams, GRO off, 1 byte payload, busy_read = 50:
net-next: 867.317 us
after #1: 865.717 us (- 0.2%, p=0.334 vs net-next)
after #3: 868.517 us (+ 0.1%, p=0.414 vs net-next)
(*) These tests produced a mixture of line-rate and below-line-rate results,
meaning that statistically speaking the results were 'censored' by the
upper bound, and were thus not normally distributed, making a Welch t-test
mathematically invalid. I therefore also calculated estimators according
to [1], which gave the following:
net-next: 8.133 Gbps
after #1: 8.130 Gbps (- 0.0%, p=0.499 vs net-next)
after #3: 7.680 Gbps (- 5.6%, p=0.285 vs net-next)
(though my procedure for determining ν wasn't mathematically well-founded
either, so take that p-value with a grain of salt).
A further check came from dividing the bandwidth figure by the CPU usage for
each test run, giving:
net-next: 3.461
after #1: 3.198 (- 7.6%, p=0.145 vs net-next)
after #3: 3.641 (+ 5.2%, p=0.280 vs net-next)
The above results are fairly mixed, and in most cases not statistically
significant. But I think we can roughly conclude that the series
marginally improves non-GROable throughput, without hurting latency
(except in the large-payload busy-polling case, which in any case yields
horrid performance even on net-next (almost triple the latency without
busy-poll). Also, drivers which, unlike sfc, pass UDP traffic to GRO
would expect to see a benefit from gaining access to batching.
Changed in v2:
* During busy poll, call gro_normal_list() to receive batched packets
after each cycle of the napi busy loop. See comments in Patch #3 for
complications of doing the same in busy_poll_stop().
[1]: Cohen 1959, doi: 10.1080/00401706.1959.10489859
Edward Cree (3):
sfc: don't score irq moderation points for GRO
sfc: falcon: don't score irq moderation points for GRO
net: use listified RX for handling GRO_NORMAL skbs
drivers/net/ethernet/sfc/falcon/rx.c | 5 +---
drivers/net/ethernet/sfc/rx.c | 5 +---
include/linux/netdevice.h | 3 ++
net/core/dev.c | 44 ++++++++++++++++++++++++++--
net/core/sysctl_net_core.c | 8 +++++
5 files changed, 54 insertions(+), 11 deletions(-)
^ permalink raw reply
* Re: [PATCH V4 0/3] net: dsa: ksz: Add Microchip KSZ87xx support
From: David Miller @ 2019-07-30 22:14 UTC (permalink / raw)
To: marex; +Cc: netdev, andrew, f.fainelli, Tristram.Ha, vivien.didelot,
woojung.huh
In-Reply-To: <20190729174947.10103-1-marex@denx.de>
From: Marek Vasut <marex@denx.de>
Date: Mon, 29 Jul 2019 19:49:44 +0200
> This series adds support for Microchip KSZ87xx switches, which are
> slightly simpler compared to KSZ9xxx .
>
> Signed-off-by: Marek Vasut <marex@denx.de>
Series applied to net-next, thank you.
^ permalink raw reply
* Re: [PATCH v3] net: dsa: qca8k: enable port flow control
From: David Miller @ 2019-07-30 22:08 UTC (permalink / raw)
To: xiaofeis
Cc: vkoul, netdev, andrew, linux-arm-msm, bjorn.andersson,
vivien.didelot, f.fainelli, niklas.cassel, xiazha
In-Reply-To: <1564275470-52666-1-git-send-email-xiaofeis@codeaurora.org>
From: xiaofeis <xiaofeis@codeaurora.org>
Date: Sun, 28 Jul 2019 08:57:50 +0800
> Set phy device advertising to enable MAC flow control.
>
> Signed-off-by: Xiaofei Shen <xiaofeis@codeaurora.org>
I've read the discussion over a few times and if internal PHY is the
only thing supported, and the specific setup this driver supports uses
internal signalling to sync the PHY and MAC settings I guess this is
OK although suboptimal.
So applied, thanks.
Thanks.
^ permalink raw reply
* Re: [PATCH bpf-next v2] tools: bpftool: add support for reporting the effective cgroup progs
From: Takshak Chahande @ 2019-07-30 22:01 UTC (permalink / raw)
To: Jakub Kicinski
Cc: alexei.starovoitov@gmail.com, daniel@iogearbox.net,
netdev@vger.kernel.org, bpf@vger.kernel.org,
oss-drivers@netronome.com, Kernel Team, Quentin Monnet
In-Reply-To: <20190730210300.13113-1-jakub.kicinski@netronome.com>
Jakub Kicinski <jakub.kicinski@netronome.com> wrote on Tue [2019-Jul-30 14:03:00 -0700]:
> Takshak said in the original submission:
>
> With different bpf attach_flags available to attach bpf programs specially
> with BPF_F_ALLOW_OVERRIDE and BPF_F_ALLOW_MULTI, the list of effective
> bpf-programs available to any sub-cgroups really needs to be available for
> easy debugging.
>
> Using BPF_F_QUERY_EFFECTIVE flag, one can get the list of not only attached
> bpf-programs to a cgroup but also the inherited ones from parent cgroup.
>
> So a new option is introduced to use BPF_F_QUERY_EFFECTIVE query flag here
> to list all the effective bpf-programs available for execution at a specified
> cgroup.
>
> Reused modified test program test_cgroup_attach from tools/testing/selftests/bpf:
> # ./test_cgroup_attach
>
> With old bpftool:
>
> # bpftool cgroup show /sys/fs/cgroup/cgroup-test-work-dir/cg1/
> ID AttachType AttachFlags Name
> 271 egress multi pkt_cntr_1
> 272 egress multi pkt_cntr_2
>
> Attached new program pkt_cntr_4 in cg2 gives following:
>
> # bpftool cgroup show /sys/fs/cgroup/cgroup-test-work-dir/cg1/cg2
> ID AttachType AttachFlags Name
> 273 egress override pkt_cntr_4
>
> And with new "effective" option it shows all effective programs for cg2:
>
> # bpftool cgroup show /sys/fs/cgroup/cgroup-test-work-dir/cg1/cg2 effective
> ID AttachType AttachFlags Name
> 273 egress override pkt_cntr_4
> 271 egress override pkt_cntr_1
> 272 egress override pkt_cntr_2
>
> Compared to original submission use a local flag instead of global
> option.
>
> We need to clear query_flags on every command, in case batch mode
> wants to use varying settings.
>
> v2: (Takshak)
> - forbid duplicated flags;
> - fix cgroup path freeing.
>
> Signed-off-by: Takshak Chahande <ctakshak@fb.com>
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
> ---
> .../bpftool/Documentation/bpftool-cgroup.rst | 16 +++-
> tools/bpf/bpftool/bash-completion/bpftool | 15 ++--
> tools/bpf/bpftool/cgroup.c | 83 ++++++++++++-------
> 3 files changed, 76 insertions(+), 38 deletions(-)
>
> diff --git a/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst b/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst
> index 585f270c2d25..06a28b07787d 100644
> --- a/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst
> +++ b/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst
> @@ -20,8 +20,8 @@ SYNOPSIS
> CGROUP COMMANDS
> ===============
>
> -| **bpftool** **cgroup { show | list }** *CGROUP*
> -| **bpftool** **cgroup tree** [*CGROUP_ROOT*]
> +| **bpftool** **cgroup { show | list }** *CGROUP* [**effective**]
> +| **bpftool** **cgroup tree** [*CGROUP_ROOT*] [**effective**]
> | **bpftool** **cgroup attach** *CGROUP* *ATTACH_TYPE* *PROG* [*ATTACH_FLAGS*]
> | **bpftool** **cgroup detach** *CGROUP* *ATTACH_TYPE* *PROG*
> | **bpftool** **cgroup help**
> @@ -35,13 +35,17 @@ CGROUP COMMANDS
>
> DESCRIPTION
> ===========
> - **bpftool cgroup { show | list }** *CGROUP*
> + **bpftool cgroup { show | list }** *CGROUP* [**effective**]
> List all programs attached to the cgroup *CGROUP*.
>
> Output will start with program ID followed by attach type,
> attach flags and program name.
>
> - **bpftool cgroup tree** [*CGROUP_ROOT*]
> + If **effective** is specified retrieve effective programs that
> + will execute for events within a cgroup. This includes
> + inherited along with attached ones.
> +
> + **bpftool cgroup tree** [*CGROUP_ROOT*] [**effective**]
> Iterate over all cgroups in *CGROUP_ROOT* and list all
> attached programs. If *CGROUP_ROOT* is not specified,
> bpftool uses cgroup v2 mountpoint.
> @@ -50,6 +54,10 @@ DESCRIPTION
> commands: it starts with absolute cgroup path, followed by
> program ID, attach type, attach flags and program name.
>
> + If **effective** is specified retrieve effective programs that
> + will execute for events within a cgroup. This includes
> + inherited along with attached ones.
> +
> **bpftool cgroup attach** *CGROUP* *ATTACH_TYPE* *PROG* [*ATTACH_FLAGS*]
> Attach program *PROG* to the cgroup *CGROUP* with attach type
> *ATTACH_TYPE* and optional *ATTACH_FLAGS*.
> diff --git a/tools/bpf/bpftool/bash-completion/bpftool b/tools/bpf/bpftool/bash-completion/bpftool
> index 6b961a5ed100..df16c5415444 100644
> --- a/tools/bpf/bpftool/bash-completion/bpftool
> +++ b/tools/bpf/bpftool/bash-completion/bpftool
> @@ -710,12 +710,15 @@ _bpftool()
> ;;
> cgroup)
> case $command in
> - show|list)
> - _filedir
> - return 0
> - ;;
> - tree)
> - _filedir
> + show|list|tree)
> + case $cword in
> + 3)
> + _filedir
> + ;;
> + 4)
> + COMPREPLY=( $( compgen -W 'effective' -- "$cur" ) )
> + ;;
> + esac
> return 0
> ;;
> attach|detach)
> diff --git a/tools/bpf/bpftool/cgroup.c b/tools/bpf/bpftool/cgroup.c
> index f3c05b08c68c..44352b5aca85 100644
> --- a/tools/bpf/bpftool/cgroup.c
> +++ b/tools/bpf/bpftool/cgroup.c
> @@ -29,6 +29,8 @@
> " recvmsg4 | recvmsg6 | sysctl |\n" \
> " getsockopt | setsockopt }"
>
> +static unsigned int query_flags;
> +
> static const char * const attach_type_strings[] = {
> [BPF_CGROUP_INET_INGRESS] = "ingress",
> [BPF_CGROUP_INET_EGRESS] = "egress",
> @@ -107,7 +109,8 @@ static int count_attached_bpf_progs(int cgroup_fd, enum bpf_attach_type type)
> __u32 prog_cnt = 0;
> int ret;
>
> - ret = bpf_prog_query(cgroup_fd, type, 0, NULL, NULL, &prog_cnt);
> + ret = bpf_prog_query(cgroup_fd, type, query_flags, NULL,
> + NULL, &prog_cnt);
> if (ret)
> return -1;
>
> @@ -125,8 +128,8 @@ static int show_attached_bpf_progs(int cgroup_fd, enum bpf_attach_type type,
> int ret;
>
> prog_cnt = ARRAY_SIZE(prog_ids);
> - ret = bpf_prog_query(cgroup_fd, type, 0, &attach_flags, prog_ids,
> - &prog_cnt);
> + ret = bpf_prog_query(cgroup_fd, type, query_flags, &attach_flags,
> + prog_ids, &prog_cnt);
> if (ret)
> return ret;
>
> @@ -158,20 +161,34 @@ static int show_attached_bpf_progs(int cgroup_fd, enum bpf_attach_type type,
> static int do_show(int argc, char **argv)
> {
> enum bpf_attach_type type;
> + const char *path;
> int cgroup_fd;
> int ret = -1;
>
> - if (argc < 1) {
> - p_err("too few parameters for cgroup show");
> - goto exit;
> - } else if (argc > 1) {
> - p_err("too many parameters for cgroup show");
> - goto exit;
> + query_flags = 0;
> +
> + if (!REQ_ARGS(1))
> + return -1;
> + path = GET_ARG();
> +
> + while (argc) {
> + if (is_prefix(*argv, "effective")) {
> + if (query_flags & BPF_F_QUERY_EFFECTIVE) {
> + p_err("duplicated argument: %s", *argv);
> + return -1;
> + }
> + query_flags |= BPF_F_QUERY_EFFECTIVE;
> + NEXT_ARG();
> + } else {
> + p_err("expected no more arguments, 'effective', got: '%s'?",
> + *argv);
> + return -1;
> + }
> }
>
> - cgroup_fd = open(argv[0], O_RDONLY);
> + cgroup_fd = open(path, O_RDONLY);
> if (cgroup_fd < 0) {
> - p_err("can't open cgroup %s", argv[0]);
> + p_err("can't open cgroup %s", path);
> goto exit;
> }
>
> @@ -294,26 +311,37 @@ static char *find_cgroup_root(void)
>
> static int do_show_tree(int argc, char **argv)
> {
> - char *cgroup_root;
> + char *cgroup_root, *cgroup_alloced = NULL;
> int ret;
>
> - switch (argc) {
> - case 0:
> - cgroup_root = find_cgroup_root();
> - if (!cgroup_root) {
> + query_flags = 0;
> +
> + if (!argc) {
> + cgroup_alloced = find_cgroup_root();
> + if (!cgroup_alloced) {
> p_err("cgroup v2 isn't mounted");
> return -1;
> }
> - break;
> - case 1:
> - cgroup_root = argv[0];
> - break;
> - default:
> - p_err("too many parameters for cgroup tree");
> - return -1;
> + cgroup_root = cgroup_alloced;
> + } else {
> + cgroup_root = GET_ARG();
> +
> + while (argc) {
> + if (is_prefix(*argv, "effective")) {
> + if (query_flags & BPF_F_QUERY_EFFECTIVE) {
> + p_err("duplicated argument: %s", *argv);
> + return -1;
> + }
> + query_flags |= BPF_F_QUERY_EFFECTIVE;
> + NEXT_ARG();
> + } else {
> + p_err("expected no more arguments, 'effective', got: '%s'?",
> + *argv);
> + return -1;
> + }
> + }
> }
>
> -
> if (json_output)
> jsonw_start_array(json_wtr);
> else
> @@ -338,8 +366,7 @@ static int do_show_tree(int argc, char **argv)
> if (json_output)
> jsonw_end_array(json_wtr);
>
> - if (argc == 0)
> - free(cgroup_root);
> + free(cgroup_alloced);
>
> return ret;
> }
> @@ -459,8 +486,8 @@ static int do_help(int argc, char **argv)
> }
>
> fprintf(stderr,
> - "Usage: %s %s { show | list } CGROUP\n"
> - " %s %s tree [CGROUP_ROOT]\n"
> + "Usage: %s %s { show | list } CGROUP [**effective**]\n"
> + " %s %s tree [CGROUP_ROOT] [**effective**]\n"
> " %s %s attach CGROUP ATTACH_TYPE PROG [ATTACH_FLAGS]\n"
> " %s %s detach CGROUP ATTACH_TYPE PROG\n"
> " %s %s help\n"
> --
> 2.21.0
>
Thanks for v2; looks good.
Reviewed-by: Takshak Chahande <ctakshak@fb.com>
^ permalink raw reply
* Re: [PATCH v4 net-next 13/19] ionic: Add initial ethtool support
From: Shannon Nelson @ 2019-07-30 21:51 UTC (permalink / raw)
To: Andrew Lunn; +Cc: netdev, davem
In-Reply-To: <20190725133506.GD21952@lunn.ch>
On 7/25/19 6:35 AM, Andrew Lunn wrote:
>> +static int ionic_get_module_eeprom(struct net_device *netdev,
>> + struct ethtool_eeprom *ee,
>> + u8 *data)
>> +{
>> + struct lif *lif = netdev_priv(netdev);
>> + struct ionic_dev *idev = &lif->ionic->idev;
>> + struct xcvr_status *xcvr;
>> + u32 len;
>> +
>> + /* The NIC keeps the module prom up-to-date in the DMA space
>> + * so we can simply copy the module bytes into the data buffer.
>> + */
>> + xcvr = &idev->port_info->status.xcvr;
>> + len = min_t(u32, sizeof(xcvr->sprom), ee->len);
>> + memcpy(data, xcvr->sprom, len);
>> +
>> + return 0;
>> +}
> Is the firmware doing this DMA update atomically? The diagnostic
> values are u16s. Is there any chance we do this memcpy at the same
> time the DMA is active and we get a mix of old and new data?
>
> Often in cases like this you do the copy twice and ensure you get the
> same values each time. If not, keep repeating the copy until you do
> get the same values twice.
Regardless of how the structs are all aligned and our PCI block does
large writes, I can see how an unoptimized memcpy() that is doing
byte-by-byte copy rather than by words might result in a mangled value.
I think this is the only buffer that may be susceptible to this. Sure,
doing a double copy should work here.
sln
^ permalink raw reply
* Re: [PATCH v5 12/29] compat_ioctl: move drivers to compat_ptr_ioctl
From: David Miller @ 2019-07-30 21:43 UTC (permalink / raw)
To: arnd
Cc: viro, linux-fsdevel, linux-kernel, gregkh, mst, jarkko.sakkinen,
jgg, jkosina, stefanha, linux-integrity, linux1394-devel,
linux-usb, linux-input, linux-stm32, linux-arm-kernel, linux-mtd,
netdev, devel, kvm, virtualization, ceph-devel
In-Reply-To: <20190730195227.742215-1-arnd@arndb.de>
From: Arnd Bergmann <arnd@arndb.de>
Date: Tue, 30 Jul 2019 21:50:28 +0200
> Each of these drivers has a copy of the same trivial helper function to
> convert the pointer argument and then call the native ioctl handler.
>
> We now have a generic implementation of that, so use it.
>
> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
> Reviewed-by: Jiri Kosina <jkosina@suse.cz>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
I assume this has to go via your series, thus:
Acked-by: David S. Miller <davem@davemloft.net>
^ permalink raw reply
* Re: [PATCH v5 09/29] compat_ioctl: pppoe: fix PPPOEIOCSFWD handling
From: David Miller @ 2019-07-30 21:42 UTC (permalink / raw)
To: arnd
Cc: viro, linux-fsdevel, linux-kernel, g.nault, mostrows, xeb,
jchapman, netdev
In-Reply-To: <20190730192552.4014288-10-arnd@arndb.de>
From: Arnd Bergmann <arnd@arndb.de>
Date: Tue, 30 Jul 2019 21:25:20 +0200
> Support for handling the PPPOEIOCSFWD ioctl in compat mode was added in
> linux-2.5.69 along with hundreds of other commands, but was always broken
> sincen only the structure is compatible, but the command number is not,
> due to the size being sizeof(size_t), or at first sizeof(sizeof((struct
> sockaddr_pppox)), which is different on 64-bit architectures.
>
> Guillaume Nault adds:
>
> And the implementation was broken until 2016 (see 29e73269aa4d ("pppoe:
> fix reference counting in PPPoE proxy")), and nobody ever noticed. I
> should probably have removed this ioctl entirely instead of fixing it.
> Clearly, it has never been used.
>
> Fix it by adding a compat_ioctl handler for all pppoe variants that
> translates the command number and then calls the regular ioctl function.
>
> All other ioctl commands handled by pppoe are compatible between 32-bit
> and 64-bit, and require compat_ptr() conversion.
>
> This should apply to all stable kernels.
>
> Acked-by: Guillaume Nault <g.nault@alphalink.fr>
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Applied and queued up for -stable, thanks everyone.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox