Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH bpf-next] bpf: use bpf_prog_run_array_cg_flags everywhere
From: Martin KaFai Lau @ 2022-04-25 20:37 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Andrii Nakryiko, Networking, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko
In-Reply-To: <CAKH8qBsTiQA5knxoBSqxCYav89QdSN0j6t1EWX1MEVbAqLj6kg@mail.gmail.com>

On Wed, Apr 20, 2022 at 03:30:43PM -0700, Stanislav Fomichev wrote:
> On Wed, Apr 20, 2022 at 3:04 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Tue, Apr 19, 2022 at 3:23 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > Rename bpf_prog_run_array_cg_flags to bpf_prog_run_array_cg and
> > > use it everywhere. check_return_code already enforces sane
> > > return ranges for all cgroup types. (only egress and bind hooks have
> > > uncanonical return ranges, the rest is using [0, 1])
> > >
> > > No functional changes.
> > >
> > > Suggested-by: Alexei Starovoitov <ast@kernel.org>
> > > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > > ---
> > >  include/linux/bpf-cgroup.h |  8 ++---
> > >  kernel/bpf/cgroup.c        | 70 ++++++++++++--------------------------
> > >  2 files changed, 24 insertions(+), 54 deletions(-)
> > >
> > > diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> > > index 88a51b242adc..669d96d074ad 100644
> > > --- a/include/linux/bpf-cgroup.h
> > > +++ b/include/linux/bpf-cgroup.h
> > > @@ -225,24 +225,20 @@ static inline bool cgroup_bpf_sock_enabled(struct sock *sk,
> > >
> > >  #define BPF_CGROUP_RUN_SA_PROG(sk, uaddr, atype)                                      \
> > >  ({                                                                            \
> > > -       u32 __unused_flags;                                                    \
> > >         int __ret = 0;                                                         \
> > >         if (cgroup_bpf_enabled(atype))                                         \
> > >                 __ret = __cgroup_bpf_run_filter_sock_addr(sk, uaddr, atype,     \
> > > -                                                         NULL,                \
> > > -                                                         &__unused_flags);    \
> > > +                                                         NULL, NULL);         \
> > >         __ret;                                                                 \
> > >  })
> > >
> > >  #define BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, atype, t_ctx)                  \
> > >  ({                                                                            \
> > > -       u32 __unused_flags;                                                    \
> > >         int __ret = 0;                                                         \
> > >         if (cgroup_bpf_enabled(atype))  {                                      \
> > >                 lock_sock(sk);                                                 \
> > >                 __ret = __cgroup_bpf_run_filter_sock_addr(sk, uaddr, atype,     \
> > > -                                                         t_ctx,               \
> > > -                                                         &__unused_flags);    \
> > > +                                                         t_ctx, NULL);        \
> > >                 release_sock(sk);                                              \
> > >         }                                                                      \
> > >         __ret;                                                                 \
> > > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> > > index 0cb6211fcb58..f61eca32c747 100644
> > > --- a/kernel/bpf/cgroup.c
> > > +++ b/kernel/bpf/cgroup.c
> > > @@ -25,50 +25,18 @@ EXPORT_SYMBOL(cgroup_bpf_enabled_key);
> > >  /* __always_inline is necessary to prevent indirect call through run_prog
> > >   * function pointer.
> > >   */
> > > -static __always_inline int
> > > -bpf_prog_run_array_cg_flags(const struct cgroup_bpf *cgrp,
> > > -                           enum cgroup_bpf_attach_type atype,
> > > -                           const void *ctx, bpf_prog_run_fn run_prog,
> > > -                           int retval, u32 *ret_flags)
> > > -{
> > > -       const struct bpf_prog_array_item *item;
> > > -       const struct bpf_prog *prog;
> > > -       const struct bpf_prog_array *array;
> > > -       struct bpf_run_ctx *old_run_ctx;
> > > -       struct bpf_cg_run_ctx run_ctx;
> > > -       u32 func_ret;
> > > -
> > > -       run_ctx.retval = retval;
> > > -       migrate_disable();
> > > -       rcu_read_lock();
> > > -       array = rcu_dereference(cgrp->effective[atype]);
> > > -       item = &array->items[0];
> > > -       old_run_ctx = bpf_set_run_ctx(&run_ctx.run_ctx);
> > > -       while ((prog = READ_ONCE(item->prog))) {
> > > -               run_ctx.prog_item = item;
> > > -               func_ret = run_prog(prog, ctx);
> > > -               if (!(func_ret & 1) && !IS_ERR_VALUE((long)run_ctx.retval))
> > > -                       run_ctx.retval = -EPERM;
> > > -               *(ret_flags) |= (func_ret >> 1);
> > > -               item++;
> > > -       }
> > > -       bpf_reset_run_ctx(old_run_ctx);
> > > -       rcu_read_unlock();
> > > -       migrate_enable();
> > > -       return run_ctx.retval;
> > > -}
> > > -
> > >  static __always_inline int
> > >  bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp,
> > >                       enum cgroup_bpf_attach_type atype,
> > >                       const void *ctx, bpf_prog_run_fn run_prog,
> > > -                     int retval)
> > > +                     int retval, u32 *ret_flags)
> > >  {
> > >         const struct bpf_prog_array_item *item;
> > >         const struct bpf_prog *prog;
> > >         const struct bpf_prog_array *array;
> > >         struct bpf_run_ctx *old_run_ctx;
> > >         struct bpf_cg_run_ctx run_ctx;
> > > +       u32 func_ret;
> > >
> > >         run_ctx.retval = retval;
> > >         migrate_disable();
> > > @@ -78,8 +46,11 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp,
> > >         old_run_ctx = bpf_set_run_ctx(&run_ctx.run_ctx);
> > >         while ((prog = READ_ONCE(item->prog))) {
> > >                 run_ctx.prog_item = item;
> > > -               if (!run_prog(prog, ctx) && !IS_ERR_VALUE((long)run_ctx.retval))
> > > +               func_ret = run_prog(prog, ctx);
> > > +               if (!(func_ret & 1) && !IS_ERR_VALUE((long)run_ctx.retval))
> >
> > to be completely true to previous behavior, shouldn't there be
> >
> > if (ret_flags)
> >     func_ret &= 1;
> > if (!func_ret && !IS_ERR_VALUE(...))
> >
> > here?
> >
> > This might have been discussed previously and I missed it. If that's
> > so, please ignore.
> 
> We are converting the cases where run_prog(prog, ctx) returns 0 or 1,
> so it seems like we don't have to reproduce the existing behavior
> 1-to-1?
> So I'm not sure it matters, or am I missing something?
A nit, how about testing 'if (ret_flags)' first such that
it is obvious which case will use higher bits in the return value.
The compiler may be able to optimize the ret_flags == NULL case also ?

Something like:

	func_ret = run_prog(prog, ctx);
	/* The cg bpf prog uses the higher bits of the return value */
	if (ret_flags) {
		*(ret_flags) |= (func_ret >> 1);
		func_ret &= 1;
	}
	if (!func_ret && !IS_ERR_VALUE((long)run_ctx.retval))
		run_ctx.retval = -EPERM;

^ permalink raw reply

* Re: [PATCH net-next 5/5] net: lan966x: Add support for PTP_PF_EXTTS
From: Horatiu Vultur @ 2022-04-25 20:31 UTC (permalink / raw)
  To: Richard Cochran
  Cc: netdev, devicetree, linux-kernel, davem, kuba, pabeni, robh+dt,
	krzysztof.kozlowski+dt, UNGLinuxDriver
In-Reply-To: <20220424150909.GA29569@hoboy.vegasvil.org>

The 04/24/2022 08:09, Richard Cochran wrote:

Hi Richard,

> 
> On Sun, Apr 24, 2022 at 04:58:24PM +0200, Horatiu Vultur wrote:
> 
> > @@ -321,6 +321,63 @@ irqreturn_t lan966x_ptp_irq_handler(int irq, void *args)
> >       return IRQ_HANDLED;
> >  }
> >
> > +irqreturn_t lan966x_ptp_ext_irq_handler(int irq, void *args)
> > +{
> > +     struct lan966x *lan966x = args;
> > +     struct lan966x_phc *phc;
> > +     unsigned long flags;
> > +     u64 time = 0;
> > +     time64_t s;
> > +     int pin, i;
> > +     s64 ns;
> > +
> > +     if (!(lan_rd(lan966x, PTP_PIN_INTR)))
> > +             return IRQ_NONE;
> > +
> > +     /* Go through all domains and see which pin generated the interrupt */
> > +     for (i = 0; i < LAN966X_PHC_COUNT; ++i) {
> > +             struct ptp_clock_event ptp_event = {0};
> > +
> > +             phc = &lan966x->phc[i];
> > +             pin = ptp_find_pin(phc->clock, PTP_PF_EXTTS, 0);
> 
> Not safe to call ptp_find_pin() from ISR.  See comment in include/linux/ptp_clock_kernel.h

Good catch.
From what I can see, I should be able to use ptp_find_pin_unlocked.

> 
> Thanks,
> Richard

-- 
/Horatiu

^ permalink raw reply

* Re: [PATCH net] net: dsa: lantiq_gswip: Don't set GSWIP_MII_CFG_RMII_CLK
From: Hauke Mehrtens @ 2022-04-25 20:05 UTC (permalink / raw)
  To: Martin Blumenstingl, netdev
  Cc: linux-kernel, andrew, vivien.didelot, olteanv, davem, kuba,
	pabeni, stable, Jan Hoffmann
In-Reply-To: <20220425152027.2220750-1-martin.blumenstingl@googlemail.com>

On 4/25/22 17:20, Martin Blumenstingl wrote:
> Commit 4b5923249b8fa4 ("net: dsa: lantiq_gswip: Configure all remaining
> GSWIP_MII_CFG bits") added all known bits in the GSWIP_MII_CFGp
> register. It helped bring this register into a well-defined state so the
> driver has to rely less on the bootloader to do things right.
> Unfortunately it also sets the GSWIP_MII_CFG_RMII_CLK bit without any
> possibility to configure it. Upon further testing it turns out that all
> boards which are supported by the GSWIP driver in OpenWrt which use an
> RMII PHY have a dedicated oscillator on the board which provides the
> 50MHz RMII reference clock.
> 
> Don't set the GSWIP_MII_CFG_RMII_CLK bit (but keep the code which always
> clears it) to fix support for the Fritz!Box 7362 SL in OpenWrt. This is
> a board with two Atheros AR8030 RMII PHYs. With the "RMII clock" bit set
> the MAC also generates the RMII reference clock whose signal then
> conflicts with the signal from the oscillator on the board. This results
> in a constant cycle of the PHY detecting link up/down (and as a result
> of that: the two ports using the AR8030 PHYs are not working).
> 
> At the time of writing this patch there's no known board where the MAC
> (GSWIP) has to generate the RMII reference clock. If needed this can be
> implemented in future by providing a device-tree flag so the
> GSWIP_MII_CFG_RMII_CLK bit can be toggled per port.
> 
> Fixes: 4b5923249b8fa4 ("net: dsa: lantiq_gswip: Configure all remaining GSWIP_MII_CFG bits")
> Cc: stable@vger.kernel.org
> Tested-by: Jan Hoffmann <jan@3e8.eu>
> Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>

Acked-by: Hauke Mehrtens <hauke@hauke-m.de>

Looks like Linux does not have a standard device tree flag to indicate 
that MAC should provide the RMII clock. Deactivating it is probably a 
good solution.

> ---
>   drivers/net/dsa/lantiq_gswip.c | 3 ---
>   1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/net/dsa/lantiq_gswip.c b/drivers/net/dsa/lantiq_gswip.c
> index a416240d001b..12c15da55664 100644
> --- a/drivers/net/dsa/lantiq_gswip.c
> +++ b/drivers/net/dsa/lantiq_gswip.c
> @@ -1681,9 +1681,6 @@ static void gswip_phylink_mac_config(struct dsa_switch *ds, int port,
>   		break;
>   	case PHY_INTERFACE_MODE_RMII:
>   		miicfg |= GSWIP_MII_CFG_MODE_RMIIM;
> -
> -		/* Configure the RMII clock as output: */
> -		miicfg |= GSWIP_MII_CFG_RMII_CLK;
>   		break;
>   	case PHY_INTERFACE_MODE_RGMII:
>   	case PHY_INTERFACE_MODE_RGMII_ID:


^ permalink raw reply

* Re: [PATCH net-next RESEND] net: bcmgenet: hide status block before TX timestamping
From: Florian Fainelli @ 2022-04-25 20:03 UTC (permalink / raw)
  To: Jonathan Lemon, opendmb, f.fainelli, bcm-kernel-feedback-list
  Cc: netdev, kernel-team, kuba
In-Reply-To: <20220424165307.591145-1-jonathan.lemon@gmail.com>

On 4/24/22 09:53, Jonathan Lemon wrote:
> The hardware checksum offloading requires use of a transmit
> status block inserted before the outgoing frame data, this was
> updated in '9a9ba2a4aaaa ("net: bcmgenet: always enable status blocks")'
> 
> However, skb_tx_timestamp() assumes that it is passed a raw frame
> and PTP parsing chokes on this status block.
> 
> Fix this by calling __skb_pull(), which hides the TSB before calling
> skb_tx_timestamp(), so an outgoing PTP packet is parsed correctly.
> 
> As the data in the skb has already been set up for DMA, and the
> dma_unmap_* calls use a separately stored address, there is no
> no effective change in the data transmission.
> 
> Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>

Without restructuring the way we do time-stamping within the driver to 
be before bcmgenet_insert_tsb() this appears to get you what you want 
without being too intrusive. I would slap a:

Fixes: d03825fba459 ("net: bcmgenet: add skb_tx_timestamp call")

since this is a bug fix that should be fixed back when this was added. 
It was only made recently problematic by default with the enabling of 
the transmit checksum offload which implicit forces the use of a 64 byte 
transmit status block.

With that said:

Acked-by: Florian Fainelli <f.fainelli@gmail.com>

Now, if we wanted to avoid playing tricks with the sk_buff, we could 
place skb_tx_timestamp() earlier in the transmit path with the risk that 
we are no longer as close as possible from hitting the DMA, and subject 
to DMA mapping or allocation failures. Food for thought.

Thanks!
-- 
Florian

^ permalink raw reply

* Re: [net-next v4 3/3] selftests/sysctl: add sysctl macro test
From: Jakub Kicinski @ 2022-04-25 19:58 UTC (permalink / raw)
  To: xiangxia.m.yue
  Cc: netdev, linux-fsdevel, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	David S. Miller, Paolo Abeni, Hideaki YOSHIFUJI, David Ahern,
	Simon Horman, Julian Anastasov, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, Shuah Khan, Andrew Morton,
	Alexei Starovoitov, Eric Dumazet, Lorenz Bauer, Akhmat Karakotov
In-Reply-To: <20220422070141.39397-4-xiangxia.m.yue@gmail.com>

On Fri, 22 Apr 2022 15:01:41 +0800 xiangxia.m.yue@gmail.com wrote:
>  static int __init test_sysctl_init(void)
>  {
> +	test_data.match_int[0] = *(int *)SYSCTL_ZERO,
> +	test_data.match_int[1] = *(int *)SYSCTL_ONE,
> +	test_data.match_int[2] = *(int *)SYSCTL_TWO,
> +	test_data.match_int[3] = *(int *)SYSCTL_THREE,
> +	test_data.match_int[4] = *(int *)SYSCTL_FOUR,
> +	test_data.match_int[5] = *(int *)SYSCTL_ONE_HUNDRED,
> +	test_data.match_int[6] = *(int *)SYSCTL_TWO_HUNDRED,
> +	test_data.match_int[7] = *(int *)SYSCTL_ONE_THOUSAND,
> +	test_data.match_int[8] = *(int *)SYSCTL_THREE_THOUSAND,
> +	test_data.match_int[9] = *(int *)SYSCTL_INT_MAX,
> +	test_data.match_int[10] = *(int *)SYSCTL_MAXOLDUID,
> +	test_data.match_int[11] = *(int *)SYSCTL_NEG_ONE,

> +	local VALUES=(0 1 2 3 4 100 200 1000 3000 $INT_MAX 65535 -1)

How does this test work? Am I reading it right that it checks if this
bash array is in sync with the kernel code?

I'd be better if we were checking the values of the constants against
literals / defines.

^ permalink raw reply

* Re: [net-next v4 0/3] use standard sysctl macro
From: Jakub Kicinski @ 2022-04-25 19:56 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: xiangxia.m.yue, netdev, linux-fsdevel, Kees Cook, Iurii Zaikin,
	David S. Miller, Paolo Abeni, Hideaki YOSHIFUJI, David Ahern,
	Simon Horman, Julian Anastasov, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, Shuah Khan, Andrew Morton,
	Alexei Starovoitov, Eric Dumazet, Lorenz Bauer, Akhmat Karakotov
In-Reply-To: <Ymb6ukQNDh6VBT59@bombadil.infradead.org>

On Mon, 25 Apr 2022 12:47:06 -0700 Luis Chamberlain wrote:
> I have a better option. I checked to see the diff stat between
> the proposed patch to see what the chances of a conflict are
> and so far I don't see any conflict so I think this patchset
> should just go through your tree.
> 
> So feel free to take it in! Let me know if that's OK!

Ok, assuming the netfilter and bpf patches I saw were the only other
conversions we can resolve the conflicts before code reaches Linus...

^ permalink raw reply

* Re: [PATCH net-next 00/11] mlxsw: extend line card model by devices and info
From: Jakub Kicinski @ 2022-04-25 19:52 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Ido Schimmel, netdev, davem, pabeni, jiri, petrm, dsahern, andrew,
	mlxsw
In-Reply-To: <Ymb5DQonnrnIBG3c@shredder>

On Mon, 25 Apr 2022 22:39:57 +0300 Ido Schimmel wrote:
> > :/ what is a line card device? You must provide document what you're
> > doing, this:
> > 
> >  .../networking/devlink/devlink-linecard.rst   |   4 +
> > 
> > is not enough.
> > 
> > How many operations and attributes are you going to copy&paste?
> > 
> > Is linking devlink instances into a hierarchy a better approach?  
> 
> In this particular case, these devices are gearboxes. They are running
> their own firmware and we want user space to be able to query and update
> the running firmware version.

Nothing too special, then, we don't create "devices" for every
component of the system which can have a separate FW. That's where
"components" are intended to be used..

> The idea (implemented in the next patchset) is to let these devices
> expose their own "component name", which can then be plugged into the
> existing flash command:
> 
>     $ devlink lc show pci/0000:01:00.0 lc 8
>     pci/0000:01:00.0:
>       lc 8 state active type 16x100G
>         supported_types:
>            16x100G
>         devices:
>           device 0 flashable true component lc8_dev0
>           device 1 flashable false
>           device 2 flashable false
>           device 3 flashable false
>     $ devlink dev flash pci/0000:01:00.0 file some_file.mfa2 component lc8_dev0

IDK if it's just me or this assumes deep knowledge of the system.
I don't understand why we need to list devices 1-3 at all. And they
don't even have names. No information is exposed. 

There are many components on any networking device, including plenty
40G-R4 -> 25G-R1 gearboxes out there.

> Registering a separate devlink instance for these devices sounds like an
> overkill to me. If you are not OK with a separate command (e.g.,
> DEVLINK_CMD_LINECARD_INFO_GET), then extending DEVLINK_CMD_INFO_GET is
> also an option. We discussed this during internal review, but felt that
> the current approach is cleaner.

I don't know what you have queued, so if you don't need a full devlink
instance (IOW line cards won't need more individual config) that's fine.
For just FW flashing you can list many devices and update the
components... no need to introduce new objects or uAPI.

> > Would you mind if I revert this?  
> 
> I can't stop you, but keep in mind that it's already late here and that
> tomorrow I'm AFK (reserve duty) and won't be able to tag it. Jiri should
> be available to continue this discussion tomorrow morning, so probably
> best to wait for his feedback.

Sure, no rush.

^ permalink raw reply

* Re: [net-next v4 0/3] use standard sysctl macro
From: Luis Chamberlain @ 2022-04-25 19:47 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: xiangxia.m.yue, netdev, linux-fsdevel, Kees Cook, Iurii Zaikin,
	David S. Miller, Paolo Abeni, Hideaki YOSHIFUJI, David Ahern,
	Simon Horman, Julian Anastasov, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, Shuah Khan, Andrew Morton,
	Alexei Starovoitov, Eric Dumazet, Lorenz Bauer, Akhmat Karakotov
In-Reply-To: <20220422124340.2382da79@kernel.org>

On Fri, Apr 22, 2022 at 12:43:40PM -0700, Jakub Kicinski wrote:
> On Fri, 22 Apr 2022 07:44:12 -0700 Luis Chamberlain wrote:
> > On Fri, Apr 22, 2022 at 03:01:38PM +0800, xiangxia.m.yue@gmail.com wrote:
> > > From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
> > > 
> > > This patchset introduce sysctl macro or replace var
> > > with macro.
> > > 
> > > Tonghao Zhang (3):
> > >   net: sysctl: use shared sysctl macro
> > >   net: sysctl: introduce sysctl SYSCTL_THREE
> > >   selftests/sysctl: add sysctl macro test  
> > 
> > I see these are based on net-next, to avoid conflicts with
> > sysctl development this may be best based on sysctl-next
> > though. Jakub?
> 
> I guess the base should be whatever we are going to use as
> a base for a branch, the branch we can both pull in?
> 
> How many patches like that do you see flying around, tho?
> I feel like I've seen at least 3 - netfilter, net core and bpf.
> It's starting to feel like we should have one patch that adds all 
> the constants and self test, put that in a branch anyone can pull in,
> and then do the conversions in separate patches..
> 
> Option number two - rename the statics in the subsystems to SYSCTL_x,
> and we can do a much smaller cleanup in the next cycle which would
> replace those with a centralized instances? That should have minimal
> chance of conflicts so no need to do special branches.
> 
> Option number three defer all this until the merge window.

I have a better option. I checked to see the diff stat between
the proposed patch to see what the chances of a conflict are
and so far I don't see any conflict so I think this patchset
should just go through your tree.

So feel free to take it in! Let me know if that's OK!

The proposed pathset diffstat:

  fs/proc/proc_sysctl.c                    |  2 +-
  include/linux/sysctl.h                   |  9 ++---
  lib/test_sysctl.c                        | 21 ++++++++++++
  net/core/sysctl_net_core.c               | 13 +++----
  net/ipv4/sysctl_net_ipv4.c               | 16 ++++-----
  net/ipv6/sysctl_net_ipv6.c               |  6 ++--
  net/netfilter/ipvs/ip_vs_ctl.c           |  4 +--
  tools/testing/selftests/sysctl/sysctl.sh | 43 ++++++++++++++++++++++++
  8 files changed, 84 insertions(+), 30 deletions(-)

The sysctl-next diff stat:

 fs/proc/proc_sysctl.c        |  88 ++++++-----
 include/linux/acct.h         |   1 -
 include/linux/delayacct.h    |   3 -
 include/linux/ftrace.h       |   3 -
 include/linux/initrd.h       |   2 -
 include/linux/latencytop.h   |   3 -
 include/linux/lockdep.h      |   4 -
 include/linux/oom.h          |   4 -
 include/linux/panic.h        |   6 -
 include/linux/reboot.h       |   4 -
 include/linux/sched/sysctl.h |  41 -----
 include/linux/writeback.h    |  15 --
 init/do_mounts_initrd.c      |  22 ++-
 kernel/acct.c                |  22 ++-
 kernel/bpf/syscall.c         |  87 ++++++++++
 kernel/delayacct.c           |  22 ++-
 kernel/latencytop.c          |  41 +++--
 kernel/locking/lockdep.c     |  35 ++++-
 kernel/panic.c               |  26 ++-
 kernel/rcu/rcu.h             |   2 +
 kernel/reboot.c              |  34 +++-
 kernel/sched/core.c          |  69 +++++---
 kernel/sched/deadline.c      |  42 ++++-
 kernel/sched/fair.c          |  32 +++-
 kernel/sched/rt.c            |  63 +++++++-
 kernel/sched/sched.h         |   7 +
 kernel/sched/topology.c      |  25 ++-
 kernel/sysctl.c              | 366 -------------------------------------------
 kernel/trace/ftrace.c        | 101 +++++++-----
 mm/oom_kill.c                |  38 ++++-
 mm/page-writeback.c          | 104 ++++++++++--
 31 files changed, 718 insertions(+), 594 deletions(-)

   Luis

^ permalink raw reply

* Re: [PATCH net-next 00/11] mlxsw: extend line card model by devices and info
From: Ido Schimmel @ 2022-04-25 19:39 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Ido Schimmel, netdev, davem, pabeni, jiri, petrm, dsahern, andrew,
	mlxsw
In-Reply-To: <20220425090021.32e9a98f@kernel.org>

On Mon, Apr 25, 2022 at 09:00:21AM -0700, Jakub Kicinski wrote:
> On Mon, 25 Apr 2022 06:44:20 +0300 Ido Schimmel wrote:
> > This patchset is extending the line card model by three items:
> > 1) line card devices
> > 2) line card info
> > 3) line card device info
> > 
> > First three patches are introducing the necessary changes in devlink
> > core.
> > 
> > Then, all three extensions are implemented in mlxsw alongside with
> > selftest.
> 
> :/ what is a line card device? You must provide document what you're
> doing, this:
> 
>  .../networking/devlink/devlink-linecard.rst   |   4 +
> 
> is not enough.
> 
> How many operations and attributes are you going to copy&paste?
> 
> Is linking devlink instances into a hierarchy a better approach?

In this particular case, these devices are gearboxes. They are running
their own firmware and we want user space to be able to query and update
the running firmware version.

The idea (implemented in the next patchset) is to let these devices
expose their own "component name", which can then be plugged into the
existing flash command:

    $ devlink lc show pci/0000:01:00.0 lc 8
    pci/0000:01:00.0:
      lc 8 state active type 16x100G
        supported_types:
           16x100G
        devices:
          device 0 flashable true component lc8_dev0
          device 1 flashable false
          device 2 flashable false
          device 3 flashable false
    $ devlink dev flash pci/0000:01:00.0 file some_file.mfa2 component lc8_dev0

Registering a separate devlink instance for these devices sounds like an
overkill to me. If you are not OK with a separate command (e.g.,
DEVLINK_CMD_LINECARD_INFO_GET), then extending DEVLINK_CMD_INFO_GET is
also an option. We discussed this during internal review, but felt that
the current approach is cleaner.

> Would you mind if I revert this?

I can't stop you, but keep in mind that it's already late here and that
tomorrow I'm AFK (reserve duty) and won't be able to tag it. Jiri should
be available to continue this discussion tomorrow morning, so probably
best to wait for his feedback.

^ permalink raw reply

* Re: [PATCH v6 00/10] ieee802154: Better Tx error handling
From: Stefan Schmidt @ 2022-04-25 19:25 UTC (permalink / raw)
  To: Alexander Aring, Miquel Raynal
  Cc: linux-wpan - ML, David S. Miller, Jakub Kicinski,
	open list:NETWORKING [GENERAL], David Girault, Romuald Despres,
	Frederic Blain, Nicolas Schodet, Thomas Petazzoni
In-Reply-To: <CAB_54W4_oDrfNFLrRMnOBqE=yxTGh97OK94Fiip1FovbHNaKBQ@mail.gmail.com>

Hello.

On 25.04.22 14:37, Alexander Aring wrote:
> Hi,
> 
> On Thu, Apr 7, 2022 at 6:09 AM Miquel Raynal <miquel.raynal@bootlin.com> wrote:
>>
>> The idea here is to provide a fully synchronous Tx API and also be able
>> to be sure that a transfer has finished. This will be used later by
>> another series. However, while working on this task, it appeared
>> necessary to first rework the way MLME errors were (not) propagated to
>> the upper layers. This small series tries to tackle exactly that, before
>> introducing the synchronous API.
>>
> 
> Acked-by: Alexander Aring <aahringo@redhat.com>
> 
> Thanks!


These patches have been applied to the wpan-next tree and will be
part of the next pull request to net-next. Thanks!

regards
Stefan Schmidt

^ permalink raw reply

* Re: [PATCH net-next 03/28] sfc: Copy shared files needed for Siena
From: Jakub Kicinski @ 2022-04-25 19:14 UTC (permalink / raw)
  To: Martin Habets; +Cc: pabeni, davem, netdev, ecree.xilinx
In-Reply-To: <20220425072257.sfsmelc42favw2th@gmail.com>

On Mon, 25 Apr 2022 08:22:57 +0100 Martin Habets wrote:
> > This ginormous patch does not make it thru the mail systems.
> > I'm guessing there is a (perfectly reasonable) 1MB limit somewhere.  
> 
> I think the issue is with mcdi_pcol.h, which is 1.1MB of defines
> generated from the hardware databases. It has grown slowely over the
> years.
> I'll split up this patch and see if I can manually cut down mcdi_pcol.h.

FWIW the patch did finally make an appearance on the ML but the point
stands. Chiseling down the auto-generated protocol definition for a EoL
part seems like a very nice solution.

^ permalink raw reply

* Re: [PATCH v2] wwan_hwsim: Avoid flush_scheduled_work() usage
From: Jakub Kicinski @ 2022-04-25 19:11 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Loic Poulain, Sergey Ryazanov, Johannes Berg, David S. Miller,
	Paolo Abeni, Network Development
In-Reply-To: <7390d51f-60e2-3cee-5277-b819a55ceabe@I-love.SAKURA.ne.jp>

On Fri, 22 Apr 2022 12:01:24 +0900 Tetsuo Handa wrote:
> Flushing system-wide workqueues is dangerous and will be forbidden.
> Replace system_wq with local wwan_wq.
> 
> While we are at it, make err_clean_devs: label of wwan_hwsim_init()
> behave like wwan_hwsim_exit(), for it is theoretically possible to call
> wwan_hwsim_debugfs_devcreate_write()/wwan_hwsim_debugfs_devdestroy_write()
> by the moment wwan_hwsim_init_devs() returns.
> 
> Link: https://lkml.kernel.org/r/49925af7-78a8-a3dd-bce6-cfc02e1a9236@I-love.SAKURA.ne.jp
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Reviewed-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>

Commit cc271ab86606 ("wwan_hwsim: Avoid flush_scheduled_work() usage")
in net-next now, thanks!

^ permalink raw reply

* Re: [PATCH net 0/2] net/smc: Two fixes for smc fallback
From: patchwork-bot+netdevbpf @ 2022-04-25 19:10 UTC (permalink / raw)
  To: Wen Gu; +Cc: kgraul, davem, kuba, linux-s390, netdev, linux-kernel
In-Reply-To: <1650614179-11529-1-git-send-email-guwen@linux.alibaba.com>

Hello:

This series was applied to netdev/net.git (master)
by Jakub Kicinski <kuba@kernel.org>:

On Fri, 22 Apr 2022 15:56:17 +0800 you wrote:
> This patch set includes two fixes for smc fallback:
> 
> Patch 1/2 introduces some simple helpers to wrap the replacement
> and restore of clcsock's callback functions. Make sure that only
> the original callbacks will be saved and not overwritten.
> 
> Patch 2/2 fixes a syzbot reporting slab-out-of-bound issue where
> smc_fback_error_report() accesses the already freed smc sock (see
> https://lore.kernel.org/r/00000000000013ca8105d7ae3ada@google.com/).
> The patch fixes it by resetting sk_user_data and restoring clcsock
> callback functions timely in fallback situation.
> 
> [...]

Here is the summary with links:
  - [net,1/2] net/smc: Only save the original clcsock callback functions
    https://git.kernel.org/netdev/net/c/97b9af7a7093
  - [net,2/2] net/smc: Fix slab-out-of-bounds issue in fallback
    https://git.kernel.org/netdev/net/c/0558226cebee

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [Patch bpf-next v1 1/4] tcp: introduce tcp_read_skb()
From: Jakub Kicinski @ 2022-04-25 19:07 UTC (permalink / raw)
  To: Cong Wang
  Cc: netdev, Cong Wang, Eric Dumazet, John Fastabend, Daniel Borkmann,
	Jakub Sitnicki
In-Reply-To: <20220410161042.183540-2-xiyou.wangcong@gmail.com>

On Sun, 10 Apr 2022 09:10:39 -0700 Cong Wang wrote:
> +int tcp_read_skb(struct sock *sk, read_descriptor_t *desc,
> +		 sk_read_actor_t recv_actor)
> +{
> +	struct sk_buff *skb;
> +	struct tcp_sock *tp = tcp_sk(sk);
> +	u32 seq = tp->copied_seq;
> +	u32 offset;
> +	int copied = 0;
> +
> +	if (sk->sk_state == TCP_LISTEN)
> +		return -ENOTCONN;
> +	while ((skb = tcp_recv_skb(sk, seq, &offset, true)) != NULL) {
> +		if (offset < skb->len) {
> +			int used;
> +			size_t len;
> +
> +			len = skb->len - offset;
> +			used = recv_actor(desc, skb, offset, len);
> +			if (used <= 0) {
> +				if (!copied)
> +					copied = used;
> +				break;
> +			}
> +			if (WARN_ON_ONCE(used > len))
> +				used = len;
> +			seq += used;
> +			copied += used;
> +			offset += used;
> +
> +			if (offset != skb->len)
> +				continue;
> +		}
> +		if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) {
> +			kfree_skb(skb);
> +			++seq;
> +			break;
> +		}
> +		kfree_skb(skb);
> +		if (!desc->count)
> +			break;
> +		WRITE_ONCE(tp->copied_seq, seq);
> +	}
> +	WRITE_ONCE(tp->copied_seq, seq);
> +
> +	tcp_rcv_space_adjust(sk);
> +
> +	/* Clean up data we have read: This will do ACK frames. */
> +	if (copied > 0)
> +		tcp_cleanup_rbuf(sk, copied);
> +
> +	return copied;
> +}
> +EXPORT_SYMBOL(tcp_read_skb);

I started prototyping a similar patch for TLS a while back but I have
two functions - one to get the skb and another to consume it. I thought
that's better for TLS, otherwise skbs stuck in the middle layer are not
counted towards the rbuf. Any thoughts on structuring the API that way?
I guess we can refactor that later, since TLS TCP-only we don't need
proto_ops plumbing there.

Overall 👍 for adding such API.

^ permalink raw reply

* Re: [PATCH net-next 0/2] Make [IP6]GRE[TAP] devices always NETIF_F_LLTX
From: Peilin Ye @ 2022-04-25 19:05 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Paolo Abeni,
	Peilin Ye, Cong Wang, Eric Dumazet, netdev, linux-kernel
In-Reply-To: <20220425110259.389ed44b@kernel.org>

Hi Jakub,

On Mon, Apr 25, 2022 at 11:02:59AM -0700, Jakub Kicinski wrote:
> On Thu, 21 Apr 2022 15:43:42 -0700 Peilin Ye wrote:
> > This patchset depends on these fixes [1].  Since o_seqno is now atomic_t,
> > we can always turn on NETIF_F_LLTX for [IP6]GRE[TAP] devices, since we no
> > longer need the TX lock (&txq->_xmit_lock).
> 
> LGTM, thanks, but please repost on Thu / Fri. The fixes make their way
> into net-next on Thu so until then we can't apply.

Thanks for the review!

> > We could probably do the same thing to [IP6]ERSPAN devices as well, but
> > I'm not familiar with them yet.  For example, ERSPAN devices are
> > initialized as |= GRE_FEATURES in erspan_tunnel_init(), but I don't see
> > IP6ERSPAN devices being initialized as |= GRE6_FEATURES.  Please suggest
> > if I'm missing something, thanks!
> 
> Probably good to CC William when you repost.

Sure, I will resend then.

Thanks,
Peilin Ye


^ permalink raw reply

* Re: [PATCH net v2 1/2] dt-bindings: net: dsa: realtek: cleanup compatible strings
From: Rob Herring @ 2022-04-25 18:48 UTC (permalink / raw)
  To: Luiz Angelo Daros de Luca
  Cc: netdev, linus.walleij, alsi, andrew, vivien.didelot, f.fainelli,
	olteanv, davem, kuba, pabeni, krzk+dt, arinc.unal, devicetree
In-Reply-To: <20220418233558.13541-1-luizluca@gmail.com>

On Mon, Apr 18, 2022 at 08:35:57PM -0300, Luiz Angelo Daros de Luca wrote:
> Compatible strings are used to help the driver find the chip ID/version
> register for each chip family. After that, the driver can setup the
> switch accordingly. Keep only the first supported model for each family
> as a compatible string and reference other chip models in the
> description.

The power supplies needing power before you can actually read the ID
registers are the same for all the variations?

The RTL8366s has a serdes power supply while the RTL8370 does not. Maybe 
that doesn't matter as the PHYs probably don't need power to access 
registers, but I didn't look at more than 2 datasheets. If there's *any* 
differences in power sequencing then you need specific compatibles.

Rob

^ permalink raw reply

* [PATCH v3 bpf-next 2/2] selftests/bpf: handle batch operations for map-in-map bpf-maps
From: Takshak Chahande @ 2022-04-25 18:41 UTC (permalink / raw)
  To: netdev, bpf; +Cc: andrii, ast, ctakshak, ndixit, kafai, andriin, daniel
In-Reply-To: <20220425184149.1173545-1-ctakshak@fb.com>

This patch adds up test cases that handles 4 combinations:
a) outer map: BPF_MAP_TYPE_ARRAY_OF_MAPS
   inner maps: BPF_MAP_TYPE_ARRAY and BPF_MAP_TYPE_HASH
b) outer map: BPF_MAP_TYPE_HASH_OF_MAPS
   inner maps: BPF_MAP_TYPE_ARRAY and BPF_MAP_TYPE_HASH

v2->v3:
- Handled transient ENOSPC correctly, bug was found in BPF CI (Daniel)

v1->v2:
- Fixed no format arguments error (Andrii)

Signed-off-by: Takshak Chahande <ctakshak@fb.com>
---
 .../bpf/map_tests/map_in_map_batch_ops.c      | 239 ++++++++++++++++++
 1 file changed, 239 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/map_tests/map_in_map_batch_ops.c

diff --git a/tools/testing/selftests/bpf/map_tests/map_in_map_batch_ops.c b/tools/testing/selftests/bpf/map_tests/map_in_map_batch_ops.c
new file mode 100644
index 000000000000..f1eee580ba2e
--- /dev/null
+++ b/tools/testing/selftests/bpf/map_tests/map_in_map_batch_ops.c
@@ -0,0 +1,239 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <stdio.h>
+#include <errno.h>
+#include <string.h>
+#include <unistd.h>
+
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#include <test_maps.h>
+
+#define OUTER_MAP_ENTRIES 10
+
+static __u32 get_map_id_from_fd(int map_fd)
+{
+	struct bpf_map_info map_info = {};
+	uint32_t info_len = sizeof(map_info);
+	int ret;
+
+	ret = bpf_obj_get_info_by_fd(map_fd, &map_info, &info_len);
+	CHECK(ret < 0, "Finding map info failed", "error:%s\n",
+	      strerror(errno));
+
+	return map_info.id;
+}
+
+/* This creates number of OUTER_MAP_ENTRIES maps that will be stored
+ * in outer map and return the created map_fds
+ */
+static void create_inner_maps(enum bpf_map_type map_type,
+			      __u32 *inner_map_fds)
+{
+	int map_fd, map_index, ret;
+	__u32 map_key = 0, map_id;
+	char map_name[15];
+
+	for (map_index = 0; map_index < OUTER_MAP_ENTRIES; map_index++) {
+		memset(map_name, 0, sizeof(map_name));
+		sprintf(map_name, "inner_map_fd_%d", map_index);
+		map_fd = bpf_map_create(map_type, map_name, sizeof(__u32),
+					sizeof(__u32), 1, NULL);
+		CHECK(map_fd < 0,
+		      "inner bpf_map_create() failed",
+		      "map_type=(%d) map_name(%s), error:%s\n",
+		      map_type, map_name, strerror(errno));
+
+		/* keep track of the inner map fd as it is required
+		 * to add records in outer map
+		 */
+		inner_map_fds[map_index] = map_fd;
+
+		/* Add entry into this created map
+		 * eg: map1 key = 0, value = map1's map id
+		 *     map2 key = 0, value = map2's map id
+		 */
+		map_id = get_map_id_from_fd(map_fd);
+		ret = bpf_map_update_elem(map_fd, &map_key, &map_id, 0);
+		CHECK(ret != 0,
+		      "bpf_map_update_elem failed",
+		      "map_type=(%d) map_name(%s), error:%s\n",
+		      map_type, map_name, strerror(errno));
+	}
+}
+
+static int create_outer_map(enum bpf_map_type map_type, __u32 inner_map_fd)
+{
+	int outer_map_fd;
+
+	LIBBPF_OPTS(bpf_map_create_opts, attr);
+	attr.inner_map_fd = inner_map_fd;
+	outer_map_fd = bpf_map_create(map_type, "outer_map", sizeof(__u32),
+				      sizeof(__u32), OUTER_MAP_ENTRIES,
+				      &attr);
+	CHECK(outer_map_fd < 0,
+	      "outer bpf_map_create()",
+	      "map_type=(%d), error:%s\n",
+	      map_type, strerror(errno));
+
+	return outer_map_fd;
+}
+
+static void validate_fetch_results(int outer_map_fd, __u32 *inner_map_fds,
+				   __u32 *fetched_keys, __u32 *fetched_values,
+				   __u32 max_entries_fetched)
+{
+	__u32 inner_map_key, inner_map_value;
+	int inner_map_fd, entry, err;
+	__u32 outer_map_value;
+
+	for (entry = 0; entry < max_entries_fetched; ++entry) {
+		outer_map_value = fetched_values[entry];
+		inner_map_fd = bpf_map_get_fd_by_id(outer_map_value);
+		CHECK(inner_map_fd < 0,
+		      "Failed to get inner map fd",
+		      "from id(%d), error=%s\n",
+		      outer_map_value, strerror(errno));
+		err = bpf_map_get_next_key(inner_map_fd, NULL, &inner_map_key);
+		CHECK(err != 0,
+		      "Failed to get inner map key",
+		      "error=%s\n", strerror(errno));
+
+		err = bpf_map_lookup_elem(inner_map_fd, &inner_map_key,
+					  &inner_map_value);
+		CHECK(err != 0,
+		      "Failed to get inner map value",
+		      "for key(%d), error=%s\n",
+		      inner_map_key, strerror(errno));
+
+		/* Actual value validation */
+		CHECK(outer_map_value != inner_map_value,
+		      "Failed to validate inner map value",
+		      "fetched(%d) and lookedup(%d)!\n",
+		      outer_map_value, inner_map_value);
+	}
+}
+
+static void fetch_and_validate(int outer_map_fd,
+			       __u32 *inner_map_fds,
+			       struct bpf_map_batch_opts *opts,
+			       __u32 batch_size, bool delete_entries)
+{
+	__u32 *fetched_keys, *fetched_values, total_fetched = 0;
+	__u32 batch_key = 0, fetch_count, step_size;
+	int err, max_entries = OUTER_MAP_ENTRIES;
+	__u32 value_size = sizeof(__u32);
+
+	/* Total entries needs to be fetched */
+	fetched_keys = calloc(max_entries, value_size);
+	fetched_values = calloc(max_entries, value_size);
+
+	for (step_size = batch_size; step_size <= max_entries; step_size += batch_size) {
+		fetch_count = step_size;
+		err = delete_entries
+		      ? bpf_map_lookup_and_delete_batch(outer_map_fd,
+				      total_fetched ? &batch_key : NULL,
+				      &batch_key,
+				      fetched_keys + total_fetched,
+				      fetched_values + total_fetched,
+				      &fetch_count, opts)
+		      : bpf_map_lookup_batch(outer_map_fd,
+				      total_fetched ? &batch_key : NULL,
+				      &batch_key,
+				      fetched_keys + total_fetched,
+				      fetched_values + total_fetched,
+				      &fetch_count, opts);
+
+		if (err && errno == ENOSPC) {
+			/* Fetch again with higher batch size */
+			total_fetched = 0;
+			continue;
+		}
+
+		CHECK((err < 0 && (errno != ENOENT)),
+		      "lookup with steps failed",
+		      "error: %s\n", strerror(errno));
+
+		/* Update the total fetched number */
+		total_fetched += fetch_count;
+		if (err)
+			break;
+	}
+
+	CHECK((total_fetched != max_entries),
+	      "Unable to fetch expected entries !",
+	      "total_fetched(%d) and max_entries(%d) error: (%d):%s\n",
+	      total_fetched, max_entries, errno, strerror(errno));
+
+	/* validate the fetched entries */
+	validate_fetch_results(outer_map_fd, inner_map_fds, fetched_keys,
+			       fetched_values, total_fetched);
+	printf("batch_op(%s) is successful with batch_size(%d)\n",
+		delete_entries ? "LOOKUP_AND_DELETE" : "LOOKUP", batch_size);
+
+	free(fetched_keys);
+	free(fetched_values);
+}
+
+static void _map_in_map_batch_ops(enum bpf_map_type outer_map_type,
+				  enum bpf_map_type inner_map_type)
+{
+	__u32 *outer_map_keys, *inner_map_fds;
+	__u32 max_entries = OUTER_MAP_ENTRIES;
+	__u32 value_size = sizeof(__u32);
+	int batch_size[2] = {5, 10};
+	__u32 map_index, op_index;
+	int outer_map_fd, ret;
+	DECLARE_LIBBPF_OPTS(bpf_map_batch_opts, opts,
+			    .elem_flags = 0,
+			    .flags = 0,
+	);
+
+	outer_map_keys = calloc(max_entries, value_size);
+	inner_map_fds = calloc(max_entries, value_size);
+	create_inner_maps(inner_map_type, inner_map_fds);
+
+	outer_map_fd = create_outer_map(outer_map_type, *inner_map_fds);
+	/* create outer map keys */
+	for (map_index = 0; map_index < max_entries; map_index++)
+		outer_map_keys[map_index] =
+			((outer_map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS)
+			 ? 9 : 1000) - map_index;
+
+	/* batch operation - map_update */
+	ret = bpf_map_update_batch(outer_map_fd, outer_map_keys,
+				   inner_map_fds, &max_entries, &opts);
+	CHECK(ret != 0,
+	      "Failed to update the outer map batch ops",
+	      "error=%s\n", strerror(errno));
+
+	/* batch operation - map_lookup */
+	for (op_index = 0; op_index < 2; ++op_index)
+		fetch_and_validate(outer_map_fd, inner_map_fds, &opts,
+				   batch_size[op_index], false);
+
+	/* batch operation - map_lookup_delete */
+	if (outer_map_type == BPF_MAP_TYPE_HASH_OF_MAPS)
+		fetch_and_validate(outer_map_fd, inner_map_fds, &opts,
+				   max_entries, true /*delete*/);
+
+	free(inner_map_fds);
+	free(outer_map_keys);
+}
+
+void test_map_in_map_batch_ops_array(void)
+{
+	_map_in_map_batch_ops(BPF_MAP_TYPE_ARRAY_OF_MAPS, BPF_MAP_TYPE_ARRAY);
+	printf("%s:PASS with inner ARRAY map\n", __func__);
+	_map_in_map_batch_ops(BPF_MAP_TYPE_ARRAY_OF_MAPS, BPF_MAP_TYPE_HASH);
+	printf("%s:PASS with inner HASH map\n", __func__);
+}
+
+void test_map_in_map_batch_ops_hash(void)
+{
+	_map_in_map_batch_ops(BPF_MAP_TYPE_HASH_OF_MAPS, BPF_MAP_TYPE_ARRAY);
+	printf("%s:PASS with inner ARRAY map\n", __func__);
+	_map_in_map_batch_ops(BPF_MAP_TYPE_HASH_OF_MAPS, BPF_MAP_TYPE_HASH);
+	printf("%s:PASS with inner HASH map\n", __func__);
+}
-- 
2.30.2


^ permalink raw reply related

* [PATCH v3 bpf-next 1/2] bpf: Extend batch operations for map-in-map bpf-maps
From: Takshak Chahande @ 2022-04-25 18:41 UTC (permalink / raw)
  To: netdev, bpf; +Cc: andrii, ast, ctakshak, ndixit, kafai, andriin, daniel

This patch extends batch operations support for map-in-map map-types:
BPF_MAP_TYPE_HASH_OF_MAPS and BPF_MAP_TYPE_ARRAY_OF_MAPS

A usecase where outer HASH map holds hundred of VIP entries and its
associated reuse-ports per VIP stored in REUSEPORT_SOCKARRAY type
inner map, needs to do batch operation for performance gain.

This patch leverages the exiting generic functions for most of the batch
operations. As map-in-map's value contains the actual reference of the inner map,
for BPF_MAP_TYPE_HASH_OF_MAPS type, it needed an extra step to fetch the
map_id from the reference value.

selftests are added in next patch that has v1->v3 changes

Signed-off-by: Takshak Chahande <ctakshak@fb.com>
---
 kernel/bpf/arraymap.c |  2 ++
 kernel/bpf/hashtab.c  | 12 ++++++++++--
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 7f145aefbff8..f0852b6617cc 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -1344,6 +1344,8 @@ const struct bpf_map_ops array_of_maps_map_ops = {
 	.map_fd_put_ptr = bpf_map_fd_put_ptr,
 	.map_fd_sys_lookup_elem = bpf_map_fd_sys_lookup_elem,
 	.map_gen_lookup = array_of_map_gen_lookup,
+	.map_lookup_batch = generic_map_lookup_batch,
+	.map_update_batch = generic_map_update_batch,
 	.map_check_btf = map_check_no_btf,
 	.map_btf_name = "bpf_array",
 	.map_btf_id = &array_of_maps_map_btf_id,
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index c68fbebc8c00..fd537bfba84c 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -139,7 +139,7 @@ static inline bool htab_use_raw_lock(const struct bpf_htab *htab)
 
 static void htab_init_buckets(struct bpf_htab *htab)
 {
-	unsigned i;
+	unsigned int i;
 
 	for (i = 0; i < htab->n_buckets; i++) {
 		INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i);
@@ -1594,7 +1594,7 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
 	void __user *uvalues = u64_to_user_ptr(attr->batch.values);
 	void __user *ukeys = u64_to_user_ptr(attr->batch.keys);
 	void __user *ubatch = u64_to_user_ptr(attr->batch.in_batch);
-	u32 batch, max_count, size, bucket_size;
+	u32 batch, max_count, size, bucket_size, map_id;
 	struct htab_elem *node_to_free = NULL;
 	u64 elem_map_flags, map_flags;
 	struct hlist_nulls_head *head;
@@ -1719,6 +1719,13 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
 			}
 		} else {
 			value = l->key + roundup_key_size;
+			if (map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS) {
+				struct bpf_map **inner_map = value;
+				 /* Actual value is the id of the inner map */
+				map_id = map->ops->map_fd_sys_lookup_elem(*inner_map);
+				value = &map_id;
+			}
+
 			if (elem_map_flags & BPF_F_LOCK)
 				copy_map_value_locked(map, dst_val, value,
 						      true);
@@ -2425,6 +2432,7 @@ const struct bpf_map_ops htab_of_maps_map_ops = {
 	.map_fd_sys_lookup_elem = bpf_map_fd_sys_lookup_elem,
 	.map_gen_lookup = htab_of_map_gen_lookup,
 	.map_check_btf = map_check_no_btf,
+	BATCH_OPS(htab),
 	.map_btf_name = "bpf_htab",
 	.map_btf_id = &htab_of_maps_map_btf_id,
 };
-- 
2.30.2


^ permalink raw reply related

* Re: [PATCH bpf-next 2/7] bpf: add bpf_skc_to_mptcp_sock_proto
From: Mat Martineau @ 2022-04-25 18:35 UTC (permalink / raw)
  To: Geliang Tang
  Cc: Daniel Borkmann, netdev, bpf, ast, andrii, mptcp,
	Nicolas Rybowski, Matthieu Baerts
In-Reply-To: <903108df-161e-515b-da3d-bff4fb49de39@iogearbox.net>

On Mon, 25 Apr 2022, Daniel Borkmann wrote:

> On 4/21/22 12:24 AM, Mat Martineau wrote:
> [...]
>>   static const struct bpf_func_proto *
>>   bpf_sk_base_func_proto(enum bpf_func_id func_id);
>> @@ -11279,6 +11280,19 @@ const struct bpf_func_proto 
>> bpf_skc_to_unix_sock_proto = {
>>   	.ret_btf_id		= &btf_sock_ids[BTF_SOCK_TYPE_UNIX],
>>   };
>>   +BPF_CALL_1(bpf_skc_to_mptcp_sock, struct sock *, sk)
>> +{
>> +	return (unsigned long)bpf_mptcp_sock_from_subflow(sk);
>> +}
>> +
>> +static const struct bpf_func_proto bpf_skc_to_mptcp_sock_proto = {
>> +	.func		= bpf_skc_to_mptcp_sock,
>> +	.gpl_only	= false,
>> +	.ret_type	= RET_PTR_TO_BTF_ID_OR_NULL,
>> +	.arg1_type	= ARG_PTR_TO_SOCK_COMMON,
>> +	.ret_btf_id	= &btf_sock_ids[BTF_SOCK_TYPE_MPTCP],
>> +};
>
> BPF CI 
> (https://github.com/kernel-patches/bpf/runs/6136052684?check_suite_focus=true) 
> fails with:
>
>  #7   base:FAIL
>  libbpf: prog '_sockops': BPF program load failed: Invalid argument
>  libbpf: prog '_sockops': -- BEGIN PROG LOAD LOG --
>  0: R1=ctx(off=0,imm=0) R10=fp0
>  ; int op = (int)ctx->op;
>  0: (61) r2 = *(u32 *)(r1 +0)          ; R1=ctx(off=0,imm=0) 
> R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
>  ; if (op != BPF_SOCK_OPS_TCP_CONNECT_CB)
>  1: (56) if w2 != 0x3 goto pc+50       ; R2_w=3
>  ; sk = ctx->sk;
>  2: (79) r6 = *(u64 *)(r1 +184)        ; R1=ctx(off=0,imm=0) 
> R6_w=sock_or_null(id=1,off=0,imm=0)
>  ; if (!sk)
>  3: (15) if r6 == 0x0 goto pc+48       ; R6_w=sock(off=0,imm=0)
>  ; tcp_sk = bpf_tcp_sock(sk);
>  4: (bf) r1 = r6                       ; R1_w=sock(off=0,imm=0) 
> R6_w=sock(off=0,imm=0)
>  5: (85) call bpf_tcp_sock#96          ; 
> R0_w=tcp_sock_or_null(id=2,off=0,imm=0)
>  6: (bf) r7 = r0                       ; 
> R0=tcp_sock_or_null(id=2,off=0,imm=0) R7=tcp_sock_or_null(id=2,off=0,imm=0)
>  ; if (!tcp_sk)
>  7: (15) if r7 == 0x0 goto pc+44       ; R7=tcp_sock(off=0,imm=0)
>  ; if (!tcp_sk->is_mptcp) {
>  8: (61) r1 = *(u32 *)(r7 +112)        ; 
> R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) 
> R7=tcp_sock(off=0,imm=0)
>  ; if (!tcp_sk->is_mptcp) {
>  9: (56) if w1 != 0x0 goto pc+14 24: R0=tcp_sock(off=0,imm=0) 
> R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R6=sock(off=0,imm=0) 
> R7=tcp_sock(off=0,imm=0) R10=fp0
>  ; msk = bpf_skc_to_mptcp_sock(sk);
>  24: (bf) r1 = r6                      ; R1_w=sock(off=0,imm=0) 
> R6=sock(off=0,imm=0)
>  25: (85) call bpf_skc_to_mptcp_sock#194
>  invalid return type 8 of func bpf_skc_to_mptcp_sock#194
>  processed 34 insns (limit 1000000) max_states_per_insn 0 total_states 3 
> peak_states 3 mark_read 1
>  -- END PROG LOAD LOG --
>  libbpf: failed to load program '_sockops'
>  libbpf: failed to load object './mptcp_sock.o'
>  run_test:FAIL:165
>  test_base:FAIL:227
>  (network_helpers.c:88: errno: Protocol not supported) Failed to create 
> server socket
>  test_base:FAIL:241
>  RTNETLINK answers: No such file or directory
>  Error talking to the kernel
> [...]
>
> Looking at bpf_skc_to_tcp6_sock(), do we similarly need a BTF_TYPE_EMIT() 
> here?
>

Geliang, in addition to the BTF_TYPE_EMIT() can you also take a look at 
the places in kernel/trace/bpf_trace.c and kernel/bpf/verifier.c where 
bpf_skc_to_tcp6_sock and bpf_skc_to_tcp6_sock_proto are referenced?

--
Mat Martineau
Intel

^ permalink raw reply

* Re: [PATCH bpf-next 2/7] bpf: add bpf_skc_to_mptcp_sock_proto
From: Mat Martineau @ 2022-04-25 18:11 UTC (permalink / raw)
  To: Daniel Borkmann, Geliang Tang
  Cc: netdev, bpf, ast, andrii, mptcp, Nicolas Rybowski,
	Matthieu Baerts
In-Reply-To: <513485c6-82ea-9cf0-1df0-3ea75935809c@iogearbox.net>

On Mon, 25 Apr 2022, Daniel Borkmann wrote:

> On 4/21/22 12:24 AM, Mat Martineau wrote:
> [...]
>> diff --git a/include/net/mptcp.h b/include/net/mptcp.h
>> index 0a3b0fb04a3b..5b3a6f783182 100644
>> --- a/include/net/mptcp.h
>> +++ b/include/net/mptcp.h
>> @@ -283,4 +283,10 @@ static inline int mptcpv6_init(void) { return 0; }
>>   static inline void mptcpv6_handle_mapped(struct sock *sk, bool mapped) { 
>> }
>>   #endif
>>   +#if defined(CONFIG_MPTCP) && defined(CONFIG_BPF_JIT) && 
>> defined(CONFIG_BPF_SYSCALL)
>> +struct mptcp_sock *bpf_mptcp_sock_from_subflow(struct sock *sk);
>> +#else
>> +static inline struct mptcp_sock *bpf_mptcp_sock_from_subflow(struct sock 
>> *sk) { return NULL; }
>> +#endif
>> +
>
> Where is this relevant to JIT specifically?
>

That's carried over from the build conditions for bpf_tcp_ca.c in 
net/ipv4/Makefile:

ifeq ($(CONFIG_BPF_JIT),y)
obj-$(CONFIG_BPF_SYSCALL) += bpf_tcp_ca.o
endif

Looks like the reasoning for that (in the CA code) is the use of 
bpf_struct_ops in bpf_tcp_ca.c

While this patch series for MPTCP does not use bpf_struct_ops, and JIT is 
not necessary for bpf_mptcp_sock_from_subflow(), the upcoming MPTCP 
scheduler-in-BPF patches do use bpf_struct_ops. So that dependency found 
its way in to this series - but now that you point it out, 
bpf_mptcp_sock_from_subflow() shouldn't be limited by CONFIG_BPF_JIT and 
we can separately check for the JIT dependency for the scheduler code. 
Will fix that in v2.

--
Mat Martineau
Intel

^ permalink raw reply

* Re: [PATCH net-next 0/2] Make [IP6]GRE[TAP] devices always NETIF_F_LLTX
From: Jakub Kicinski @ 2022-04-25 18:02 UTC (permalink / raw)
  To: Peilin Ye
  Cc: David S. Miller, Hideaki YOSHIFUJI, David Ahern, Paolo Abeni,
	Peilin Ye, Cong Wang, Eric Dumazet, netdev, linux-kernel
In-Reply-To: <cover.1650580763.git.peilin.ye@bytedance.com>

On Thu, 21 Apr 2022 15:43:42 -0700 Peilin Ye wrote:
> This patchset depends on these fixes [1].  Since o_seqno is now atomic_t,
> we can always turn on NETIF_F_LLTX for [IP6]GRE[TAP] devices, since we no
> longer need the TX lock (&txq->_xmit_lock).

LGTM, thanks, but please repost on Thu / Fri. The fixes make their way
into net-next on Thu so until then we can't apply.

> We could probably do the same thing to [IP6]ERSPAN devices as well, but
> I'm not familiar with them yet.  For example, ERSPAN devices are
> initialized as |= GRE_FEATURES in erspan_tunnel_init(), but I don't see
> IP6ERSPAN devices being initialized as |= GRE6_FEATURES.  Please suggest
> if I'm missing something, thanks!

Probably good to CC William when you repost.

^ permalink raw reply

* Re: [PATCH v2 net] tcp: md5: incorrect tcp_header_len for incoming connections
From: Francesco Ruggeri @ 2022-04-25 17:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Paolo Abeni, Jakub Kicinski, David Ahern, Hideaki YOSHIFUJI,
	David Miller, LKML, netdev
In-Reply-To: <CANn89iKM-f=VLfwb9wq8+_bmaLjP_Xg5CanqJWhans2DXE=v5w@mail.gmail.com>

On Sun, Apr 24, 2022 at 12:37 PM Eric Dumazet <edumazet@google.com> wrote:
>
> Wait a minute.
>
> Are you sure treq->af_specific is initialized at this point ?
>
> I should have tested this one liner patch really :/
>
> I think that for syncookies, treq->af_specific is not initialized,
> because we do not go through
> tcp_conn_request() helper, but instead use cookie_tcp_reqsk_alloc()
>
> Before your patch treq->af_specific was only used during SYNACK
> generation, which does not happen in syncookie more while receiving
> the third packet.
>
> I will test something like this patch. We could move the init after
> cookie_tcp_reqsk_alloc() has been called, but I prefer using the same
> construct than tcp_conn_request()
>

Thanks for fixing this Eric. I had not considered syncookies.

Francesco

^ permalink raw reply

* Re: [PATCH RFC 2/5] tls: build proto after context has been initialized
From: Chuck Lever III @ 2022-04-25 17:51 UTC (permalink / raw)
  To: Jakub Kicinski, Hannes Reinecke
  Cc: netdev, Linux NFS Mailing List, linux-nvme@lists.infradead.org,
	CIFS, linux-fsdevel@vger.kernel.org, ak@tempesta-tech.com,
	borisp@nvidia.com, simo@redhat.com
In-Reply-To: <20220425101126.0efa9f51@kernel.org>



> On Apr 25, 2022, at 1:11 PM, Jakub Kicinski <kuba@kernel.org> wrote:
> 
> On Mon, 18 Apr 2022 12:49:36 -0400 Chuck Lever wrote:
>> From: Hannes Reinecke <hare@suse.de>
>> 
>> We have to build the proto ops only after the context has been
>> initialized, as otherwise we might crash when I/O is ongoing
>> during initialisation.
> 
> Can you say more about the crash you see? protos are not used until 
> at least one socket calls update_sk_prot().

Hannes reported the crash and supplied the fix. I didn't have a
problem with this patch series without this patch. Hannes?



--
Chuck Lever




^ permalink raw reply

* Re: [PATCH] net: linkwatch: ignore events for unregistered netdevs
From: Eric Dumazet @ 2022-04-25 17:24 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Jann Horn, Jakub Kicinski, Paolo Abeni, Oliver Neukum,
	David S. Miller, Oleksij Rempel, netdev, USB list, Andrew Lunn,
	Jacky Chou, Willy Tarreau, Lino Sanfilippo, Philipp Rosenberger,
	Heiner Kallweit, Greg Kroah-Hartman
In-Reply-To: <20220425172014.GA24181@wunner.de>

On Mon, Apr 25, 2022 at 10:20 AM Lukas Wunner <lukas@wunner.de> wrote:
>
> On Mon, Apr 25, 2022 at 05:18:51PM +0200, Jann Horn wrote:
> > Well, it's not quite a refcount. It's a count that can be incremented
> > and decremented but can't be read while the device is alive, and then
> > at some point it turns into a count that can be read and decremented
> > but can't be incremented
>
> Pardon me for being dense, but most other subsystems use the refcounting
> built into struct device (or rather, its kobject) and tear it down
> when the refcount reaches zero (e.g. pci_release_dev(), spidev_release()).
>
> What's the rationale for struct net_device rolling its own refcounting?
> Historic artifact?


Yes, probably. This was there way before new fancy mechanisms were invented.

>
>
> I think a lot of these issues would solve themselves if that was done away
> with and replaced with the generic kobject refcounting.  It's a pity that
> the tracking infrastructure is now netdev-specific and other subsystems
> cannot benefit from it.

Make sure that whatever replaces it, heavy dev_hold()/dev_put() users
do not come to a crawl.

af_packet is using this stuff.

Some users want to send millions of packets per second, without having
to bypass the kernel because it is suddenly too slow.


>
> Thanks,
>
> Lukas

^ permalink raw reply

* Re: [PATCH] net: linkwatch: ignore events for unregistered netdevs
From: Lukas Wunner @ 2022-04-25 17:20 UTC (permalink / raw)
  To: Jann Horn
  Cc: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Oliver Neukum,
	David S. Miller, Oleksij Rempel, netdev, USB list, Andrew Lunn,
	Jacky Chou, Willy Tarreau, Lino Sanfilippo, Philipp Rosenberger,
	Heiner Kallweit, Greg Kroah-Hartman
In-Reply-To: <CAG48ez0nw7coDXYozaUOTThWLkHWZuKVUpMosY2hgVSSfeM4Pw@mail.gmail.com>

On Mon, Apr 25, 2022 at 05:18:51PM +0200, Jann Horn wrote:
> Well, it's not quite a refcount. It's a count that can be incremented
> and decremented but can't be read while the device is alive, and then
> at some point it turns into a count that can be read and decremented
> but can't be incremented

Pardon me for being dense, but most other subsystems use the refcounting
built into struct device (or rather, its kobject) and tear it down
when the refcount reaches zero (e.g. pci_release_dev(), spidev_release()).

What's the rationale for struct net_device rolling its own refcounting?
Historic artifact?

I think a lot of these issues would solve themselves if that was done away
with and replaced with the generic kobject refcounting.  It's a pity that
the tracking infrastructure is now netdev-specific and other subsystems
cannot benefit from it.

Thanks,

Lukas

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox