* Re: [PATCH net 1/7] secure_seq: return the full 64-bit of the siphash
From: Eric Dumazet @ 2022-04-27 16:50 UTC (permalink / raw)
To: Willy Tarreau
Cc: kernel test robot, netdev, kbuild-all, Jakub Kicinski, Moshe Kol,
Yossi Gilad, Amit Klein, LKML, Jason A . Donenfeld
In-Reply-To: <20220427163554.GA3746@1wt.eu>
On Wed, Apr 27, 2022 at 9:35 AM Willy Tarreau <w@1wt.eu> wrote:
>
> On Wed, Apr 27, 2022 at 12:07:14PM +0200, Willy Tarreau wrote:
> > On Wed, Apr 27, 2022 at 05:56:41PM +0800, kernel test robot wrote:
> > > Hi Willy,
> > >
> > > I love your patch! Yet something to improve:
> > >
> > > [auto build test ERROR on net/master]
> > >
> > > url: https://github.com/intel-lab-lkp/linux/commits/Willy-Tarreau/insufficient-TCP-source-port-randomness/20220427-145651
> > > base: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 71cffebf6358a7f5031f5b208bbdc1cb4db6e539
> > > config: i386-randconfig-r026-20220425 (https://download.01.org/0day-ci/archive/20220427/202204271705.VrWNPv7n-lkp@intel.com/config)
> > > compiler: gcc-11 (Debian 11.2.0-20) 11.2.0
> > > reproduce (this is a W=1 build):
> > > # https://github.com/intel-lab-lkp/linux/commit/01b26e522b598adf346b809075880feab3dcdc08
> > > git remote add linux-review https://github.com/intel-lab-lkp/linux
> > > git fetch --no-tags linux-review Willy-Tarreau/insufficient-TCP-source-port-randomness/20220427-145651
> > > git checkout 01b26e522b598adf346b809075880feab3dcdc08
> > > # save the config file
> > > mkdir build_dir && cp config build_dir/.config
> > > make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash
> > >
> > > If you fix the issue, kindly add following tag as appropriate
> > > Reported-by: kernel test robot <lkp@intel.com>
> > >
> > > All errors (new ones prefixed by >>):
> > >
> > > ld: net/ipv4/inet_hashtables.o: in function `__inet_hash_connect':
> > > >> inet_hashtables.c:(.text+0x187d): undefined reference to `__umoddi3'
> >
> > Argh! indeed, we spoke about using div_u64_rem() at the beginning and
> > that one vanished over time. Will respin it.
>
> I fixed it, built it for i386 and x86_64, tested it on x86_64 and confirmed
> that it still does what I need. The change is only this:
>
> - offset = (READ_ONCE(table_perturb[index]) + (port_offset >> 32)) % remaining;
> + div_u64_rem(READ_ONCE(table_perturb[index]) + (port_offset >> 32), remaining, &offset);
>
> I'll send a v2 series in a few hours if there are no more comments.
We really do not need 33 bits here.
I would suggest using a 32bit divide.
offset = READ_ONCE(table_perturb[index]) + (port_offset >> 32));
offset %= remaining;
^ permalink raw reply
* Re: [PATCH] memcg: accounting for objects allocated for new netdevice
From: Shakeel Butt @ 2022-04-27 16:52 UTC (permalink / raw)
To: Michal Koutný
Cc: Vasily Averin, Roman Gushchin, Vlastimil Babka, kernel,
Florian Westphal, LKML, Michal Hocko, Cgroups, netdev,
David S. Miller, Jakub Kicinski, Paolo Abeni, Greg Kroah-Hartman,
Tejun Heo, Luis Chamberlain, Kees Cook, Iurii Zaikin,
linux-fsdevel
In-Reply-To: <20220427140153.GC9823@blackbody.suse.cz>
On Wed, Apr 27, 2022 at 7:01 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> Hello Vasily.
>
> On Wed, Apr 27, 2022 at 01:37:50PM +0300, Vasily Averin <vvs@openvz.org> wrote:
> > diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> > index cfa79715fc1a..2881aeeaa880 100644
> > --- a/fs/kernfs/mount.c
> > +++ b/fs/kernfs/mount.c
> > @@ -391,7 +391,7 @@ void __init kernfs_init(void)
> > {
> > kernfs_node_cache = kmem_cache_create("kernfs_node_cache",
> > sizeof(struct kernfs_node),
> > - 0, SLAB_PANIC, NULL);
> > + 0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
>
> kernfs accounting you say?
> kernfs backs up also cgroups, so the parent-child accounting comes to my
> mind.
> See the temporary switch to parent memcg in mem_cgroup_css_alloc().
>
> (I mean this makes some sense but I'd suggest unlumping the kernfs into
> a separate path for possible discussion and its not-only-netdevice
> effects.)
>
I agree with Michal that kernfs accounting should be its own patch.
Internally at Google, we actually have enabled the memcg accounting of
kernfs nodes. We have workloads which create 100s of subcontainers and
without memcg accounting of kernfs we see high system overhead.
^ permalink raw reply
* Re: [PATCH net-next 08/14] eth: bgnet: remove a copy of the NAPI_POLL_WEIGHT define
From: Jakub Kicinski @ 2022-04-27 16:53 UTC (permalink / raw)
To: Florian Fainelli; +Cc: davem, pabeni, netdev, rafal, bcm-kernel-feedback-list
In-Reply-To: <56654c2f-d144-5bcf-0d2c-db3f891169cb@gmail.com>
On Wed, 27 Apr 2022 09:09:36 -0700 Florian Fainelli wrote:
> Looks fine, however this is a new subject prefix, do you mind using:
>
> net: bgmac: remove a copy of the NAPI_POLL_WEIGHT define
Ayy, sorry. benet, bgmac... I'll fix when applying.
^ permalink raw reply
* Re: [PATCH v2 net-next] net: generalize skb freeing deferral to per-cpu lists
From: Eric Dumazet @ 2022-04-27 16:53 UTC (permalink / raw)
To: Ido Schimmel
Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni,
netdev
In-Reply-To: <YmlilMi5MmApVqTX@shredder>
On Wed, Apr 27, 2022 at 8:34 AM Ido Schimmel <idosch@idosch.org> wrote:
>
>
> Eric, with this patch I'm seeing memory leaks such as these [1][2] after
> boot. The system is using the igb driver for its management interface
> [3]. The leaks disappear after reverting the patch.
>
> Any ideas?
>
No idea, skbs allocated to send an ACK can not be stored in receive
queue, I guess this is a kmemleak false positive.
Stress your host for hours, and check if there are real kmemleaks, as
in memory being depleted.
> Let me know if you need more info. I can easily test a patch.
>
> Thanks
>
> [1]
> # cat /sys/kernel/debug/kmemleak
> unreferenced object 0xffff888170143740 (size 216):
> comm "softirq", pid 0, jiffies 4294825261 (age 95.244s)
> hex dump (first 32 bytes):
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00 00 17 0f 81 88 ff ff 00 00 00 00 00 00 00 00 ................
> backtrace:
> [<ffffffff82571fc0>] napi_skb_cache_get+0xf0/0x180
> [<ffffffff8257206a>] __napi_build_skb+0x1a/0x60
> [<ffffffff8257b0f3>] napi_build_skb+0x23/0x350
> [<ffffffffa0469592>] igb_poll+0x2b72/0x5880 [igb]
> [<ffffffff825f9584>] __napi_poll.constprop.0+0xb4/0x480
> [<ffffffff825f9d5a>] net_rx_action+0x40a/0xc60
> [<ffffffff82e00295>] __do_softirq+0x295/0x9fe
> [<ffffffff81185bcc>] __irq_exit_rcu+0x11c/0x180
> [<ffffffff8118622a>] irq_exit_rcu+0xa/0x20
> [<ffffffff82bbed39>] common_interrupt+0xa9/0xc0
> [<ffffffff82c00b5e>] asm_common_interrupt+0x1e/0x40
> [<ffffffff824a186e>] cpuidle_enter_state+0x27e/0xcb0
> [<ffffffff824a236f>] cpuidle_enter+0x4f/0xa0
> [<ffffffff8126a290>] do_idle+0x3b0/0x4b0
> [<ffffffff8126a869>] cpu_startup_entry+0x19/0x20
> [<ffffffff810f4725>] start_secondary+0x265/0x340
>
> [2]
> # cat /sys/kernel/debug/kmemleak
> unreferenced object 0xffff88810ce3aac0 (size 216):
> comm "softirq", pid 0, jiffies 4294861408 (age 64.607s)
> hex dump (first 32 bytes):
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00 c0 7b 07 81 88 ff ff 00 00 00 00 00 00 00 00 ..{.............
> backtrace:
> [<ffffffff82575539>] __alloc_skb+0x229/0x360
> [<ffffffff8290bd3c>] __tcp_send_ack.part.0+0x6c/0x760
> [<ffffffff8291a062>] tcp_send_ack+0x82/0xa0
> [<ffffffff828cb6db>] __tcp_ack_snd_check+0x15b/0xa00
> [<ffffffff828f17fe>] tcp_rcv_established+0x198e/0x2120
> [<ffffffff829363b5>] tcp_v4_do_rcv+0x665/0x9a0
> [<ffffffff8293d8ae>] tcp_v4_rcv+0x2c1e/0x32f0
> [<ffffffff828610b3>] ip_protocol_deliver_rcu+0x53/0x2c0
> [<ffffffff828616eb>] ip_local_deliver+0x3cb/0x620
> [<ffffffff8285e66f>] ip_sublist_rcv_finish+0x9f/0x2c0
> [<ffffffff82860895>] ip_list_rcv_finish.constprop.0+0x525/0x6f0
> [<ffffffff82861f88>] ip_list_rcv+0x318/0x460
> [<ffffffff825f5e61>] __netif_receive_skb_list_core+0x541/0x8f0
> [<ffffffff825f8043>] netif_receive_skb_list_internal+0x763/0xdc0
> [<ffffffff826c3025>] napi_gro_complete.constprop.0+0x5a5/0x700
> [<ffffffff826c44ed>] dev_gro_receive+0xf2d/0x23f0
> unreferenced object 0xffff888175e1afc0 (size 216):
> comm "sshd", pid 1024, jiffies 4294861424 (age 64.591s)
> hex dump (first 32 bytes):
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00 c0 7b 07 81 88 ff ff 00 00 00 00 00 00 00 00 ..{.............
> backtrace:
> [<ffffffff82575539>] __alloc_skb+0x229/0x360
> [<ffffffff8258201c>] alloc_skb_with_frags+0x9c/0x720
> [<ffffffff8255f333>] sock_alloc_send_pskb+0x7b3/0x940
> [<ffffffff82876af4>] __ip_append_data+0x1874/0x36d0
> [<ffffffff8287f283>] ip_make_skb+0x263/0x2e0
> [<ffffffff82978afa>] udp_sendmsg+0x1c8a/0x29d0
> [<ffffffff829af94e>] inet_sendmsg+0x9e/0xe0
> [<ffffffff8255082d>] __sys_sendto+0x23d/0x360
> [<ffffffff82550a31>] __x64_sys_sendto+0xe1/0x1b0
> [<ffffffff82bbde05>] do_syscall_64+0x35/0x80
> [<ffffffff82c00068>] entry_SYSCALL_64_after_hwframe+0x44/0xae
>
> [3]
> # ethtool -i enp8s0
> driver: igb
> version: 5.18.0-rc3-custom-91743-g481c1b
> firmware-version: 3.25, 0x80000708, 1.1824.0
> expansion-rom-version:
> bus-info: 0000:08:00.0
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> supports-priv-flags: yes
^ permalink raw reply
* Re: [RFC patch net-next 3/3] net: dsa: ksz: moved ksz9477 port mirror to ksz_common.c
From: Vladimir Oltean @ 2022-04-27 16:57 UTC (permalink / raw)
To: Arun Ramadoss
Cc: linux-kernel, netdev, Paolo Abeni, Jakub Kicinski,
David S. Miller, Florian Fainelli, Vivien Didelot, Andrew Lunn,
UNGLinuxDriver, Woojung Huh
In-Reply-To: <20220427162343.18092-4-arun.ramadoss@microchip.com>
On Wed, Apr 27, 2022 at 09:53:43PM +0530, Arun Ramadoss wrote:
> Moved the port_mirror_add and port_mirror_del function from ksz9477 to
Present tense (move)
> ksz_common, to make it generic function which can be used by KSZ9477
> based switch.
Presumably you mean "which can be used by other switches" (it can
already be used by ksz9477, so that can't be the argument for moving it)
>
> Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com>
> ---
Looks good, except for the spelling mistakes in the code that is being
moved (introduced in patch 1), which I expect you will update in the new
code as well.
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
> diff --git a/drivers/net/dsa/microchip/ksz_reg.h b/drivers/net/dsa/microchip/ksz_reg.h
> new file mode 100644
> index 000000000000..ccd4a6568e34
> --- /dev/null
> +++ b/drivers/net/dsa/microchip/ksz_reg.h
> @@ -0,0 +1,29 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Microchip KSZ Switch register definitions
> + *
> + * Copyright (C) 2017-2022 Microchip Technology Inc.
> + */
> +
> +#ifndef __KSZ_REGS_H
> +#define __KSZ_REGS_H
> +
> +#define REG_SW_MRI_CTRL_0 0x0370
> +
> +#define SW_IGMP_SNOOP BIT(6)
> +#define SW_IPV6_MLD_OPTION BIT(3)
> +#define SW_IPV6_MLD_SNOOP BIT(2)
> +#define SW_MIRROR_RX_TX BIT(0)
> +
> +/* 8 - Classification and Policing */
> +#define REG_PORT_MRI_MIRROR_CTRL 0x0800
> +
> +#define PORT_MIRROR_RX BIT(6)
> +#define PORT_MIRROR_TX BIT(5)
> +#define PORT_MIRROR_SNIFFER BIT(1)
> +
> +#define P_MIRROR_CTRL REG_PORT_MRI_MIRROR_CTRL
> +
> +#define S_MIRROR_CTRL REG_SW_MRI_CTRL_0
Small comment: if P_MIRROR_CTRL and S_MIRROR_CTRL are expected to be at
the same register offset for all switch families, why is there a macro
behind a macro for their addresses?
> +
> +#endif
> --
> 2.33.0
>
^ permalink raw reply
* Re: [PATCH net 1/7] secure_seq: return the full 64-bit of the siphash
From: Willy Tarreau @ 2022-04-27 16:56 UTC (permalink / raw)
To: Eric Dumazet
Cc: kernel test robot, netdev, kbuild-all, Jakub Kicinski, Moshe Kol,
Yossi Gilad, Amit Klein, LKML, Jason A . Donenfeld
In-Reply-To: <CANn89iJTg8KZvDQ2wY=psThvS5eFzv0N15FF3CTf3i6qui=wsQ@mail.gmail.com>
On Wed, Apr 27, 2022 at 09:50:06AM -0700, Eric Dumazet wrote:
> On Wed, Apr 27, 2022 at 9:35 AM Willy Tarreau <w@1wt.eu> wrote:
> >
> > On Wed, Apr 27, 2022 at 12:07:14PM +0200, Willy Tarreau wrote:
> > > On Wed, Apr 27, 2022 at 05:56:41PM +0800, kernel test robot wrote:
> > > > Hi Willy,
> > > >
> > > > I love your patch! Yet something to improve:
> > > >
> > > > [auto build test ERROR on net/master]
> > > >
> > > > url: https://github.com/intel-lab-lkp/linux/commits/Willy-Tarreau/insufficient-TCP-source-port-randomness/20220427-145651
> > > > base: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 71cffebf6358a7f5031f5b208bbdc1cb4db6e539
> > > > config: i386-randconfig-r026-20220425 (https://download.01.org/0day-ci/archive/20220427/202204271705.VrWNPv7n-lkp@intel.com/config)
> > > > compiler: gcc-11 (Debian 11.2.0-20) 11.2.0
> > > > reproduce (this is a W=1 build):
> > > > # https://github.com/intel-lab-lkp/linux/commit/01b26e522b598adf346b809075880feab3dcdc08
> > > > git remote add linux-review https://github.com/intel-lab-lkp/linux
> > > > git fetch --no-tags linux-review Willy-Tarreau/insufficient-TCP-source-port-randomness/20220427-145651
> > > > git checkout 01b26e522b598adf346b809075880feab3dcdc08
> > > > # save the config file
> > > > mkdir build_dir && cp config build_dir/.config
> > > > make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash
> > > >
> > > > If you fix the issue, kindly add following tag as appropriate
> > > > Reported-by: kernel test robot <lkp@intel.com>
> > > >
> > > > All errors (new ones prefixed by >>):
> > > >
> > > > ld: net/ipv4/inet_hashtables.o: in function `__inet_hash_connect':
> > > > >> inet_hashtables.c:(.text+0x187d): undefined reference to `__umoddi3'
> > >
> > > Argh! indeed, we spoke about using div_u64_rem() at the beginning and
> > > that one vanished over time. Will respin it.
> >
> > I fixed it, built it for i386 and x86_64, tested it on x86_64 and confirmed
> > that it still does what I need. The change is only this:
> >
> > - offset = (READ_ONCE(table_perturb[index]) + (port_offset >> 32)) % remaining;
> > + div_u64_rem(READ_ONCE(table_perturb[index]) + (port_offset >> 32), remaining, &offset);
> >
> > I'll send a v2 series in a few hours if there are no more comments.
>
> We really do not need 33 bits here.
>
> I would suggest using a 32bit divide.
>
> offset = READ_ONCE(table_perturb[index]) + (port_offset >> 32));
> offset %= remaining;
Yeah much better indeed, I'll do that. Thanks Eric!
Willy
^ permalink raw reply
* Re: [PATCH] memcg: enable accounting for veth queues
From: Jakub Kicinski @ 2022-04-27 16:58 UTC (permalink / raw)
To: Vasily Averin
Cc: Roman Gushchin, Vlastimil Babka, Shakeel Butt, kernel,
linux-kernel, Michal Hocko, cgroups, netdev, David S. Miller,
Paolo Abeni
In-Reply-To: <1c338b99-8133-6126-2ff2-94a4d3f26451@openvz.org>
On Wed, 27 Apr 2022 13:34:29 +0300 Vasily Averin wrote:
> Subject: [PATCH] memcg: enable accounting for veth queues
This is a pure networking patch, right? The prefix should be "net: ",
I think.
> veth netdevice defines own rx queues and allocates array containing
> up to 4095 ~750-bytes-long 'struct veth_rq' elements. Such allocation
> is quite huge and should be accounted to memcg.
>
> Signed-off-by: Vasily Averin <vvs@openvz.org>
^ permalink raw reply
* Re: [PATCH v2 net-next] net: generalize skb freeing deferral to per-cpu lists
From: Eric Dumazet @ 2022-04-27 17:11 UTC (permalink / raw)
To: Ido Schimmel
Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni,
netdev
In-Reply-To: <CANn89i+x44iM97YmGa6phMMUx6L5a3Cn86aNwq3OsbQf3iVgWA@mail.gmail.com>
On Wed, Apr 27, 2022 at 9:53 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Apr 27, 2022 at 8:34 AM Ido Schimmel <idosch@idosch.org> wrote:
> >
>
> >
> > Eric, with this patch I'm seeing memory leaks such as these [1][2] after
> > boot. The system is using the igb driver for its management interface
> > [3]. The leaks disappear after reverting the patch.
> >
> > Any ideas?
> >
>
> No idea, skbs allocated to send an ACK can not be stored in receive
> queue, I guess this is a kmemleak false positive.
>
> Stress your host for hours, and check if there are real kmemleaks, as
> in memory being depleted.
AT first when I saw your report I wondered if the following was needed,
but I do not think so. Nothing in __kfree_skb(skb) cares about skb->next.
But you might give it a try, maybe I am wrong.
diff --git a/net/core/dev.c b/net/core/dev.c
index 611bd719706412723561c27753150b27e1dc4e7a..9dc3443649b962586cc059899ac3d71e9c7a3559
100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6594,6 +6594,7 @@ static void skb_defer_free_flush(struct softnet_data *sd)
while (skb != NULL) {
next = skb->next;
+ skb_mark_not_on_list(skb);
__kfree_skb(skb);
skb = next;
}
^ permalink raw reply
* Re: [PATCH net 1/7] secure_seq: return the full 64-bit of the siphash
From: Jason A. Donenfeld @ 2022-04-27 17:18 UTC (permalink / raw)
To: Willy Tarreau
Cc: netdev, David Miller, Jakub Kicinski, Eric Dumazet, Moshe Kol,
Yossi Gilad, Amit Klein, linux-kernel
In-Reply-To: <20220427065233.2075-2-w@1wt.eu>
Hi Willy,
On Wed, Apr 27, 2022 at 08:52:27AM +0200, Willy Tarreau wrote:
> diff --git a/include/net/secure_seq.h b/include/net/secure_seq.h
> index d7d2495f83c2..5cea9ed9c773 100644
> --- a/include/net/secure_seq.h
> +++ b/include/net/secure_seq.h
> @@ -4,7 +4,7 @@
>
> #include <linux/types.h>
>
> -u32 secure_ipv4_port_ephemeral(__be32 saddr, __be32 daddr, __be16 dport);
> +u64 secure_ipv4_port_ephemeral(__be32 saddr, __be32 daddr, __be16 dport);
> u32 secure_ipv6_port_ephemeral(const __be32 *saddr, const __be32 *daddr,
> __be16 dport);
> u32 secure_tcp_seq(__be32 saddr, __be32 daddr,
> diff --git a/net/core/secure_seq.c b/net/core/secure_seq.c
> index 9b8443774449..2cdd43a63f64 100644
> --- a/net/core/secure_seq.c
> +++ b/net/core/secure_seq.c
> @@ -142,7 +142,7 @@ u32 secure_tcp_seq(__be32 saddr, __be32 daddr,
> }
> EXPORT_SYMBOL_GPL(secure_tcp_seq);
>
> -u32 secure_ipv4_port_ephemeral(__be32 saddr, __be32 daddr, __be16 dport)
> +u64 secure_ipv4_port_ephemeral(__be32 saddr, __be32 daddr, __be16 dport)
> {
> net_secret_init();
> return siphash_3u32((__force u32)saddr, (__force u32)daddr,
Should you be doing the same with secure_ipv6_port_ephemeral() too? Why
the asymmetry?
Jason
^ permalink raw reply
* Re: [PATCH bpf-next v6 5/6] bpf: Add selftests for raw syncookie helpers
From: Maxim Mikityanskiy @ 2022-04-27 17:19 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Alexei Starovoitov, bpf, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Networking, Tariq Toukan, Martin KaFai Lau,
Song Liu, Yonghong Song, John Fastabend, KP Singh,
David S. Miller, Jakub Kicinski, Petar Penkov, Lorenz Bauer,
Eric Dumazet, Hideaki YOSHIFUJI, David Ahern, Shuah Khan,
Jesper Dangaard Brouer, Nathan Chancellor, Nick Desaulniers,
Joe Stringer, Florent Revest, open list:KERNEL SELFTEST FRAMEWORK,
Toke Høiland-Jørgensen, Kumar Kartikeya Dwivedi,
Florian Westphal, pabeni
In-Reply-To: <CAEf4BzZhjY+F9JYmT7k+m87UZ1qKuO8_Mjjq4CGgkr=z9BGDCg@mail.gmail.com>
On 2022-04-27 01:11, Andrii Nakryiko wrote:
> On Tue, Apr 26, 2022 at 11:29 AM Maxim Mikityanskiy <maximmi@nvidia.com> wrote:
>>
>> On 2022-04-26 09:26, Andrii Nakryiko wrote:
>>> On Mon, Apr 25, 2022 at 5:12 PM Alexei Starovoitov
>>> <alexei.starovoitov@gmail.com> wrote:
>>>>
>>>> On Fri, Apr 22, 2022 at 08:24:21PM +0300, Maxim Mikityanskiy wrote:
>>>>> +void test_xdp_synproxy(void)
>>>>> +{
>>>>> + int server_fd = -1, client_fd = -1, accept_fd = -1;
>>>>> + struct nstoken *ns = NULL;
>>>>> + FILE *ctrl_file = NULL;
>>>>> + char buf[1024];
>>>>> + size_t size;
>>>>> +
>>>>> + SYS("ip netns add synproxy");
>>>>> +
>>>>> + SYS("ip link add tmp0 type veth peer name tmp1");
>>>>> + SYS("ip link set tmp1 netns synproxy");
>>>>> + SYS("ip link set tmp0 up");
>>>>> + SYS("ip addr replace 198.18.0.1/24 dev tmp0");
>>>>> +
>>>>> + // When checksum offload is enabled, the XDP program sees wrong
>>>>> + // checksums and drops packets.
>>>>> + SYS("ethtool -K tmp0 tx off");
>>>>
>>>> BPF CI image doesn't have ethtool installed.
>>>> It will take some time to get it updated. Until then we cannot land the patch set.
>>>> Can you think of a way to run this test without shelling to ethtool?
>>>
>>> Good news: we got updated CI image with ethtool, so that shouldn't be
>>> a problem anymore.
>>>
>>> Bad news: this selftest still fails, but in different place:
>>>
>>> test_synproxy:FAIL:iptables -t raw -I PREROUTING -i tmp1 -p tcp -m tcp
>>> --syn --dport 8080 -j CT --notrack unexpected error: 512 (errno 2)
>>
>> That's simply a matter of missing kernel config options:
>>
>> CONFIG_NETFILTER_SYNPROXY=y
>> CONFIG_NETFILTER_XT_TARGET_CT=y
>> CONFIG_NETFILTER_XT_MATCH_STATE=y
>> CONFIG_IP_NF_FILTER=y
>> CONFIG_IP_NF_TARGET_SYNPROXY=y
>> CONFIG_IP_NF_RAW=y
>>
>> Shall I create a pull request on github to add these options to
>> https://github.com/libbpf/libbpf/tree/master/travis-ci/vmtest/configs?
>>
>
> Yes, please. But also for [0], that's the one that tests all the
> not-yet-applied patches
>
> [0] https://github.com/kernel-patches/vmtest/
Created pull requests:
https://github.com/kernel-patches/vmtest/pull/79
https://github.com/libbpf/libbpf/pull/490
>>> See [0].
>>>
>>> [0] https://github.com/kernel-patches/bpf/runs/6169439612?check_suite_focus=true
>>
^ permalink raw reply
* Re: [net: PATCH] net: dsa: add missing refcount decrementation
From: Vladimir Oltean @ 2022-04-27 17:25 UTC (permalink / raw)
To: Marcin Wojtas
Cc: linux-kernel, netdev, davem, kuba, andrew, vivien.didelot,
f.fainelli, upstream
In-Reply-To: <20220425094708.2769275-1-mw@semihalf.com>
On Mon, Apr 25, 2022 at 11:47:08AM +0200, Marcin Wojtas wrote:
> After obtaining the "phy-handle" node, decrementing
> refcount is required. Fix that.
>
> Fixes: a20f997010c4 ("net: dsa: Don't instantiate phylink for CPU/DSA ports unless needed")
> Signed-off-by: Marcin Wojtas <mw@semihalf.com>
> ---
> net/dsa/port.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/net/dsa/port.c b/net/dsa/port.c
> index 32d472a82241..cdc56ba11f52 100644
> --- a/net/dsa/port.c
> +++ b/net/dsa/port.c
> @@ -1620,8 +1620,10 @@ int dsa_port_link_register_of(struct dsa_port *dp)
> if (ds->ops->phylink_mac_link_down)
> ds->ops->phylink_mac_link_down(ds, port,
> MLO_AN_FIXED, PHY_INTERFACE_MODE_NA);
> + of_node_put(phy_np);
> return dsa_port_phylink_register(dp);
> }
> + of_node_put(phy_np);
> return 0;
> }
>
> --
> 2.29.0
>
Thanks for the patch.
commit fc06b2867f4cea543505acfb194c2be4ebf0c7d3
Author: Miaoqian Lin <linmq006@gmail.com>
Date: Wed Apr 20 19:04:08 2022 +0800
net: dsa: Add missing of_node_put() in dsa_port_link_register_of
The device_node pointer is returned by of_parse_phandle() with refcount
incremented. We should use of_node_put() on it when done.
of_node_put() will check for NULL value.
Fixes: a20f997010c4 ("net: dsa: Don't instantiate phylink for CPU/DSA ports unless needed")
Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
^ permalink raw reply
* Re: [PATCH net-next v2 1/2] net: phy: microchip: update LAN88xx phy ID and phy ID mask.
From: Andrew Lunn @ 2022-04-27 17:30 UTC (permalink / raw)
To: Yuiko.Oshino; +Cc: Woojung.Huh, davem, netdev, Ravi.Hegde, UNGLinuxDriver
In-Reply-To: <CH0PR11MB5561E9C01C5500D6301E43728EFA9@CH0PR11MB5561.namprd11.prod.outlook.com>
> >> The current phy IDs on the available hardware.
> >> LAN8742 0x0007C130, 0x0007C131
> >> LAN88xx 0x0007C132
> >>
> >> Signed-off-by: Yuiko Oshino <yuiko.oshino@microchip.com>
> >> ---
> >> drivers/net/phy/microchip.c | 6 +++---
> >> 1 file changed, 3 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/net/phy/microchip.c b/drivers/net/phy/microchip.c
> >> index 9f1f2b6c97d4..131caf659ed2 100644
> >> --- a/drivers/net/phy/microchip.c
> >> +++ b/drivers/net/phy/microchip.c
> >> @@ -344,8 +344,8 @@ static int lan88xx_config_aneg(struct phy_device
> >> *phydev)
> >>
> >> static struct phy_driver microchip_phy_driver[] = { {
> >> - .phy_id = 0x0007c130,
> >> - .phy_id_mask = 0xfffffff0,
> >> + .phy_id = 0x0007c132,
> >> + .phy_id_mask = 0xfffffff2,
> >
> >Just so my comment on the previous version does not get lost, is this the correct
> >mask, because it is very odd. I think you really want 0xfffffffe?
> >
> > Andrew
>
> Hi Andrew,
>
> thank you for your review.
> Yes, 0xfffffff2 is correct for us.
> We would like to have bits for future revisions of the hardware without updating the driver source code in the future.
> If we use 0xfffffffe, then we need to submit a patch again when we have a new hardware revision.
This has some risks. Do you really have enough control over the
hardware people to say that:
LAN8742 0x0007C130, 0x0007C131, 0x0007C134, 0x0007C135, 0x0007C138, 0x0007C139, ...
LAN88xx 0x0007C132, 0x0007C133, 0x0007C136, 0x0007C137, 0x0007C13a, 0x0007C13b, ...
It seems safer to wait until there is new hardware and then update the
list given whatever they decide on is next.
At minimum, please add a comment in the code, otherwise you are going
to get asked, is this correct, it looks wrong?
Andrew
^ permalink raw reply
* Re: Optimizing kernel compilation / alignments for network performance
From: Rafał Miłecki @ 2022-04-27 17:31 UTC (permalink / raw)
To: Alexander Lobakin
Cc: Network Development, linux-arm-kernel, Russell King, Andrew Lunn,
Felix Fietkau, openwrt-devel@lists.openwrt.org, Florian Fainelli
In-Reply-To: <20220427125658.3127816-1-alexandr.lobakin@intel.com>
On 27.04.2022 14:56, Alexander Lobakin wrote:
> From: Rafał Miłecki <zajec5@gmail.com>
> Date: Wed, 27 Apr 2022 14:04:54 +0200
>
>> I noticed years ago that kernel changes touching code - that I don't use
>> at all - can affect network performance for me.
>>
>> I work with home routers based on Broadcom Northstar platform. Those
>> are SoCs with not-so-powerful 2 x ARM Cortex-A9 CPU cores. Main task of
>> those devices is NAT masquerade and that is what I test with iperf
>> running on two x86 machines.
>>
>> ***
>>
>> Example of such unused code change:
>> ce5013ff3bec ("mtd: spi-nor: Add support for XM25QH64A and XM25QH128A").
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce5013ff3bec05cf2a8a05c75fcd520d9914d92b
>> It lowered my NAT speed from 381 Mb/s to 367 Mb/s (-3,5%).
>>
>> I first reported that issue it in the e-mail thread:
>> ARM router NAT performance affected by random/unrelated commits
>> https://lkml.org/lkml/2019/5/21/349
>> https://www.spinics.net/lists/linux-block/msg40624.html
>>
>> Back then it was commit 5b0890a97204 ("flow_dissector: Parse batman-adv
>> unicast headers")
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9316a9ed6895c4ad2f0cde171d486f80c55d8283
>> that increased my NAT speed from 741 Mb/s to 773 Mb/s (+4,3%).
>>
>> ***
>>
>> It appears Northstar CPUs have little cache size and so any change in
>> location of kernel symbols can affect NAT performance. That explains why
>> changing unrelated code affects anything & it has been partially proven
>> aligning some of cache-v7.S code.
>>
>> My question is: is there a way to find out & force an optimal symbols
>> locations?
>
> Take a look at CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B[0]. I've been
> fighting with the same issue on some Realtek MIPS boards: random
> code changes in random kernel core parts were affecting NAT /
> network performance. This option resolved this I'd say, for the cost
> of slightly increased vmlinux size (almost no change in vmlinuz
> size).
> The only thing is that it was recently restricted to a set of
> architectures and MIPS and ARM32 are not included now lol. So it's
> either a matter of expanding the list (since it was restricted only
> because `-falign-functions=` is not supported on some architectures)
> or you can just do:
>
> make KCFLAGS=-falign-functions=64 # replace 64 with your I-cache size
>
> The actual alignment is something to play with, I stopped on the
> cacheline size, 32 in my case.
> Also, this does not provide any guarantees that you won't suffer
> from random data cacheline changes. There were some initiatives to
> introduce debug alignment of data as well, but since function are
> often bigger than 32, while variables are usually much smaller, it
> was increasing the vmlinux size by a ton (imagine each u32 variable
> occupying 32-64 bytes instead of 4). But the chance of catching this
> is much lower than to suffer from I-cache function misplacement.
Thank you Alexander, this appears to be helpful! I decided to ignore
CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B for now and just adjust CFLAGS
manually.
1. Without ce5013ff3bec and with -falign-functions=32
387 Mb/s
2. Without ce5013ff3bec and with -falign-functions=64
377 Mb/s
3. With ce5013ff3bec and with -falign-functions=32
384 Mb/s
4. With ce5013ff3bec and with -falign-functions=64
377 Mb/s
So it seems that:
1. -falign-functions=32 = pretty stable high speed
2. -falign-functions=64 = very stable slightly lower speed
I'm going to perform tests on more commits but if it stays so reliable
as above that will be a huge success for me.
^ permalink raw reply
* Re: [PATCH net-next 08/14] eth: bgnet: remove a copy of the NAPI_POLL_WEIGHT define
From: Florian Fainelli @ 2022-04-27 17:32 UTC (permalink / raw)
To: Jakub Kicinski; +Cc: davem, pabeni, netdev, rafal, bcm-kernel-feedback-list
In-Reply-To: <20220427095350.73ffc15d@kernel.org>
On 4/27/2022 9:53 AM, Jakub Kicinski wrote:
> On Wed, 27 Apr 2022 09:09:36 -0700 Florian Fainelli wrote:
>> Looks fine, however this is a new subject prefix, do you mind using:
>>
>> net: bgmac: remove a copy of the NAPI_POLL_WEIGHT define
>
> Ayy, sorry. benet, bgmac... I'll fix when applying.
Sounds good, sorry, subjects are my favorite pet peeves.
--
Florian
^ permalink raw reply
* [PATCH net 1/1] ixgbe: ensure IPsec VF<->PF compatibility
From: Tony Nguyen @ 2022-04-27 17:31 UTC (permalink / raw)
To: davem, kuba, pabeni
Cc: Leon Romanovsky, netdev, anthony.l.nguyen, Raed Salem,
Shannon Nelson, Konrad Jankowski
From: Leon Romanovsky <leonro@nvidia.com>
The VF driver can forward any IPsec flags and such makes the function
is not extendable and prone to backward/forward incompatibility.
If new software runs on VF, it won't know that PF configured something
completely different as it "knows" only XFRM_OFFLOAD_INBOUND flag.
Fixes: eda0333ac293 ("ixgbe: add VF IPsec management")
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Shannon Nelson <snelson@pensando.io>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
---
drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
index e596e1a9fc75..69d11ff7677d 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
@@ -903,7 +903,8 @@ int ixgbe_ipsec_vf_add_sa(struct ixgbe_adapter *adapter, u32 *msgbuf, u32 vf)
/* Tx IPsec offload doesn't seem to work on this
* device, so block these requests for now.
*/
- if (!(sam->flags & XFRM_OFFLOAD_INBOUND)) {
+ sam->flags = sam->flags & ~XFRM_OFFLOAD_IPV6;
+ if (sam->flags != XFRM_OFFLOAD_INBOUND) {
err = -EOPNOTSUPP;
goto err_out;
}
--
2.31.1
^ permalink raw reply related
* Re: [net: PATCH] net: dsa: add missing refcount decrementation
From: Marcin Wojtas @ 2022-04-27 17:38 UTC (permalink / raw)
To: Vladimir Oltean
Cc: Linux Kernel Mailing List, netdev, David S. Miller,
Jakub Kicinski, Andrew Lunn, vivien.didelot, Florian Fainelli,
upstream
In-Reply-To: <20220427172514.n4musn42dhygzbu2@skbuf>
śr., 27 kwi 2022 o 19:25 Vladimir Oltean <olteanv@gmail.com> napisał(a):
>
> On Mon, Apr 25, 2022 at 11:47:08AM +0200, Marcin Wojtas wrote:
> > After obtaining the "phy-handle" node, decrementing
> > refcount is required. Fix that.
> >
> > Fixes: a20f997010c4 ("net: dsa: Don't instantiate phylink for CPU/DSA ports unless needed")
> > Signed-off-by: Marcin Wojtas <mw@semihalf.com>
> > ---
> > net/dsa/port.c | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> > diff --git a/net/dsa/port.c b/net/dsa/port.c
> > index 32d472a82241..cdc56ba11f52 100644
> > --- a/net/dsa/port.c
> > +++ b/net/dsa/port.c
> > @@ -1620,8 +1620,10 @@ int dsa_port_link_register_of(struct dsa_port *dp)
> > if (ds->ops->phylink_mac_link_down)
> > ds->ops->phylink_mac_link_down(ds, port,
> > MLO_AN_FIXED, PHY_INTERFACE_MODE_NA);
> > + of_node_put(phy_np);
> > return dsa_port_phylink_register(dp);
> > }
> > + of_node_put(phy_np);
> > return 0;
> > }
> >
> > --
> > 2.29.0
> >
>
> Thanks for the patch.
>
> commit fc06b2867f4cea543505acfb194c2be4ebf0c7d3
> Author: Miaoqian Lin <linmq006@gmail.com>
> Date: Wed Apr 20 19:04:08 2022 +0800
>
> net: dsa: Add missing of_node_put() in dsa_port_link_register_of
>
> The device_node pointer is returned by of_parse_phandle() with refcount
> incremented. We should use of_node_put() on it when done.
> of_node_put() will check for NULL value.
>
> Fixes: a20f997010c4 ("net: dsa: Don't instantiate phylink for CPU/DSA ports unless needed")
> Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
> Signed-off-by: David S. Miller <davem@davemloft.net>
Great, I'll rebase then and I can drop my patch.
Thanks,
Marcin
^ permalink raw reply
* Re: [PATCH net-next 02/14] eth: remove NAPI_WEIGHT defines
From: Greg KH @ 2022-04-27 17:41 UTC (permalink / raw)
To: Jakub Kicinski
Cc: davem, pabeni, netdev, steve.glendinning, david.kershner,
liujunqi, sparmaintainer, linux-staging
In-Reply-To: <20220427154111.529975-3-kuba@kernel.org>
On Wed, Apr 27, 2022 at 08:40:59AM -0700, Jakub Kicinski wrote:
> Defining local versions of NAPI_POLL_WEIGHT with the same
> values in the drivers just makes refactoring harder.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> CC: steve.glendinning@shawell.net
> CC: david.kershner@unisys.com
> CC: gregkh@linuxfoundation.org
> CC: liujunqi@pku.edu.cn
> CC: sparmaintainer@unisys.com
> CC: linux-staging@lists.linux.dev
> ---
> drivers/net/ethernet/smsc/smsc9420.c | 2 +-
> drivers/net/ethernet/smsc/smsc9420.h | 1 -
> drivers/staging/unisys/visornic/visornic_main.c | 4 ++--
This file is gone in my tree and in linux-next, so no need to worry
about it anymore.
thanks,
greg k-h
^ permalink raw reply
* Re: [PATCH bpf-next v5 3/8] bpf: per-cgroup lsm flavor
From: Stanislav Fomichev @ 2022-04-27 17:47 UTC (permalink / raw)
To: Martin KaFai Lau; +Cc: netdev, bpf, ast, daniel, andrii
In-Reply-To: <20220427001006.dr5dl5mocufskmvv@kafai-mbp.dhcp.thefacebook.com>
On Tue, Apr 26, 2022 at 5:10 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Tue, Apr 19, 2022 at 12:00:48PM -0700, Stanislav Fomichev wrote:
> > +static void bpf_cgroup_storages_unlink(struct bpf_cgroup_storage *storages[])
> > +{
> > + enum bpf_cgroup_storage_type stype;
> > +
> > + for_each_cgroup_storage_type(stype)
> > + bpf_cgroup_storage_unlink(storages[stype]);
> > +}
> > +
> > /* Called when bpf_cgroup_link is auto-detached from dying cgroup.
> > * It drops cgroup and bpf_prog refcounts, and marks bpf_link as defunct. It
> > * doesn't free link memory, which will eventually be done by bpf_link's
> > @@ -166,6 +256,16 @@ static void bpf_cgroup_link_auto_detach(struct bpf_cgroup_link *link)
> > link->cgroup = NULL;
> > }
> >
> > +static void bpf_cgroup_lsm_shim_release(struct bpf_prog *prog,
> > + enum cgroup_bpf_attach_type atype)
> > +{
> > + if (prog->aux->cgroup_atype < CGROUP_LSM_START ||
> > + prog->aux->cgroup_atype > CGROUP_LSM_END)
> > + return;
> > +
> > + bpf_trampoline_unlink_cgroup_shim(prog);
> > +}
> > +
> > /**
> > * cgroup_bpf_release() - put references of all bpf programs and
> > * release all cgroup bpf data
> > @@ -190,10 +290,18 @@ static void cgroup_bpf_release(struct work_struct *work)
> >
> > hlist_for_each_entry_safe(pl, pltmp, progs, node) {
> > hlist_del(&pl->node);
> > - if (pl->prog)
> > + if (pl->prog) {
> > + if (atype == BPF_LSM_CGROUP)
> > + bpf_cgroup_lsm_shim_release(pl->prog,
> > + atype);
> > bpf_prog_put(pl->prog);
> > - if (pl->link)
> > + }
> > + if (pl->link) {
> > + if (atype == BPF_LSM_CGROUP)
> > + bpf_cgroup_lsm_shim_release(pl->link->link.prog,
> > + atype);
> > bpf_cgroup_link_auto_detach(pl->link);
> > + }
> > kfree(pl);
> > static_branch_dec(&cgroup_bpf_enabled_key[atype]);
> > }
> > @@ -506,6 +614,7 @@ static int __cgroup_bpf_attach(struct cgroup *cgrp,
> > struct bpf_prog *old_prog = NULL;
> > struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE] = {};
> > struct bpf_cgroup_storage *new_storage[MAX_BPF_CGROUP_STORAGE_TYPE] = {};
> > + struct bpf_attach_target_info tgt_info = {};
> > enum cgroup_bpf_attach_type atype;
> > struct bpf_prog_list *pl;
> > struct hlist_head *progs;
> > @@ -522,9 +631,35 @@ static int __cgroup_bpf_attach(struct cgroup *cgrp,
> > /* replace_prog implies BPF_F_REPLACE, and vice versa */
> > return -EINVAL;
> >
> > - atype = to_cgroup_bpf_attach_type(type);
> > - if (atype < 0)
> > - return -EINVAL;
> > + if (type == BPF_LSM_CGROUP) {
> > + struct bpf_prog *p = prog ? : link->link.prog;
> > +
> > + if (replace_prog) {
> > + /* Reusing shim from the original program.
> > + */
> > + if (replace_prog->aux->attach_btf_id !=
> > + p->aux->attach_btf_id)
> > + return -EINVAL;
> > +
> > + atype = replace_prog->aux->cgroup_atype;
> > + } else {
> > + err = bpf_check_attach_target(NULL, p, NULL,
> > + p->aux->attach_btf_id,
> > + &tgt_info);
> > + if (err)
> > + return -EINVAL;
> > +
> > + atype = bpf_lsm_attach_type_get(p->aux->attach_btf_id);
> > + if (atype < 0)
> > + return atype;
> > + }
> > +
> > + p->aux->cgroup_atype = atype;
> > + } else {
> > + atype = to_cgroup_bpf_attach_type(type);
> > + if (atype < 0)
> > + return -EINVAL;
> > + }
> >
> > progs = &cgrp->bpf.progs[atype];
> >
> > @@ -580,13 +715,26 @@ static int __cgroup_bpf_attach(struct cgroup *cgrp,
> > if (err)
> > goto cleanup;
> >
> > + bpf_cgroup_storages_link(new_storage, cgrp, type);
> After looking between this patch 3 and the next patch 4, I can't
> quite think this through quickly, so it may be faster to ask :)
>
> I have questions on the ordering between update_effective_progs(),
> bpf_cgroup_storages_link(), and bpf_trampoline_link_cgroup_shim().
>
> Why bpf_cgroup_storages_link() has to be moved up and done before
> bpf_trampoline_link_cgroup_shim() ?
Looking at it again: I think I'm confusing bpf_cgroup_storages_assign
with bpf_cgroup_storages_link and we don't need to move the latter.
For context: my reasoning here was to make sure the prog has cgroup
storage before bpf_trampoline_link_cgroup_shim installs the actual
trampoline which might trigger the programs.
> > +
> > + if (type == BPF_LSM_CGROUP && !old_prog) {
> > + struct bpf_prog *p = prog ? : link->link.prog;
> > +
> > + err = bpf_trampoline_link_cgroup_shim(p, &tgt_info);
> update_effective_progs() was done a few lines above, so the effective[atype]
> array has the new_prog now.
>
> If bpf_trampoline_link_cgroup_shim() did fail to add the
> shim_prog to the trampoline, the new_prog will still be left in
> effective[atype]. There is no shim_prog to execute effective[].
> However, there may be places that access effective[]. e.g.
> __cgroup_bpf_query() although I think to_cgroup_bpf_attach_type()
> is not handling BPF_LSM_CGROUP now. More on __cgroup_bpf_query()
> later.
>
> Doing bpf_trampoline_link_cgroup_shim() just before activate_effective_progs() ?
Yeah, you're right, I thought that there was a cleanup path that
undoes update_effective_progs action :-( Moving link_cgroup_shim
before update_effective_progs/activate_effective_progs makes sense,
thank you!
> Have you thought about what is needed to support __cgroup_bpf_query() ?
> bpf_attach_type and cgroup_bpf_attach_type is no longer a 1:1 relationship.
> Looping through cgroup_lsm_atype_usecnt[] and output them under BPF_LSM_CGROUP ?
> Same goes for local_storage. All lsm-cgrp attaching to different
> attach_btf_id sharing one local_storage because the key is only
> cgroup-id and attach_type. Is it enough to start with that
> first and the key could be extended later with a new map_flag?
> This is related to the API.
Ugh, I think I was still under the impression that it would just work
out. But I haven't thought about __cgroup_bpf_query in the context of
the next change which breaks that 1:1 relationship. Good catch, I have
to look into that and will add a test to verify.
Regarding cgroup_id+attach_type key of local storage: maybe prohibit
that mode for BPF_LSM_CGROUP ? We have two modes: (1) keyed by
cgroup_id+attach_type and (2) keyed by cgroup_id only (and might be
shared across attach_types). The first one never made much sense to
me; the second one behaves exactly like the rest of local storages
(file/sk/etc). WDYT? (we can enable (1) if we ever decide that it's
needed)
^ permalink raw reply
* [PATCH net-next] tls: Add opt-in zerocopy mode of sendfile()
From: Maxim Mikityanskiy @ 2022-04-27 17:50 UTC (permalink / raw)
To: Jakub Kicinski, David S. Miller
Cc: Daniel Borkmann, Paolo Abeni, Boris Pismenny, Tariq Toukan,
Saeed Mahameed, Gal Pressman, netdev, Maxim Mikityanskiy
From: Boris Pismenny <borisp@nvidia.com>
TLS device offload copies sendfile data to a bounce buffer before
transmitting. It allows to maintain the valid MAC on TLS records when
the file contents change and a part of TLS record has to be
retransmitted on TCP level.
In many common use cases (like serving static files over HTTPS) the file
contents are not changed on the fly. In many use cases breaking the
connection is totally acceptable if the file is changed during
transmission, because it would be received corrupted in any case.
This commit allows to optimize performance for such use cases to
providing a new optional mode of TLS sendfile(), in which the extra copy
is skipped. Removing this copy improves performance significantly, as
TLS and TCP sendfile perform the same operations, and the only overhead
is TLS header/trailer insertion.
The new mode can only be enabled with the new socket option named
TLS_TX_ZEROCOPY_SENDFILE on per-socket basis. It preserves backwards
compatibility with existing applications that rely on the copying
behavior.
The new mode is safe, meaning that unsolicited modifications of the file
being sent can't break integrity of the kernel. The worst thing that can
happen is sending a corrupted TLS record, which is in any case not
forbidden when using regular TCP sockets.
Sockets other than TLS device offload are not affected by the new socket
option, and attempts to configure it will fail.
Performance numbers in a single-core test with 12 HTTPS streams on
nginx, under 100% CPU load:
* non-zerocopy: up to 23.5 Gbit/s
* zerocopy: up to 50.0 Gbit/s
Signed-off-by: Boris Pismenny <borisp@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
---
Following up on the discussion [1], this is the patch that adds an
opt-in feature of zerocopy sendfile for TLS offload. Note that this
patch depends on bugfix [2], which should be applied first.
[1]: https://lore.kernel.org/netdev/DM4PR12MB5150C0ACA2781ABD70DB99E8DC0A9@DM4PR12MB5150.namprd12.prod.outlook.com/T/#u
[2]: https://lore.kernel.org/all/20220426154949.159055-1-maximmi@nvidia.com/
include/net/tls.h | 1 +
include/uapi/linux/tls.h | 1 +
net/tls/tls_device.c | 62 +++++++++++++++++++++++++++++-----------
net/tls/tls_main.c | 55 +++++++++++++++++++++++++++++++++++
4 files changed, 103 insertions(+), 16 deletions(-)
diff --git a/include/net/tls.h b/include/net/tls.h
index b59f0a63292b..fc291e2747b5 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -164,6 +164,7 @@ struct tls_record_info {
struct tls_offload_context_tx {
struct crypto_aead *aead_send;
+ bool zerocopy_sendfile;
spinlock_t lock; /* protects records list */
struct list_head records_list;
struct tls_record_info *open_record;
diff --git a/include/uapi/linux/tls.h b/include/uapi/linux/tls.h
index 5f38be0ec0f3..47989b77ebcf 100644
--- a/include/uapi/linux/tls.h
+++ b/include/uapi/linux/tls.h
@@ -39,6 +39,7 @@
/* TLS socket options */
#define TLS_TX 1 /* Set transmit parameters */
#define TLS_RX 2 /* Set receive parameters */
+#define TLS_TX_ZEROCOPY_SENDFILE 3 /* transmit zerocopy sendfile */
/* Supported versions */
#define TLS_VERSION_MINOR(ver) ((ver) & 0xFF)
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index b12f81a2b44c..715401b20c8b 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -411,10 +411,16 @@ static int tls_device_copy_data(void *addr, size_t bytes, struct iov_iter *i)
return 0;
}
+union tls_iter_offset {
+ struct iov_iter *msg_iter;
+ int offset;
+};
+
static int tls_push_data(struct sock *sk,
- struct iov_iter *msg_iter,
+ union tls_iter_offset iter_offset,
size_t size, int flags,
- unsigned char record_type)
+ unsigned char record_type,
+ struct page *zc_page)
{
struct tls_context *tls_ctx = tls_get_ctx(sk);
struct tls_prot_info *prot = &tls_ctx->prot_info;
@@ -480,15 +486,29 @@ static int tls_push_data(struct sock *sk,
}
record = ctx->open_record;
- copy = min_t(size_t, size, (pfrag->size - pfrag->offset));
- copy = min_t(size_t, copy, (max_open_record_len - record->len));
-
- if (copy) {
- rc = tls_device_copy_data(page_address(pfrag->page) +
- pfrag->offset, copy, msg_iter);
- if (rc)
- goto handle_error;
- tls_append_frag(record, pfrag, copy);
+
+ if (!zc_page) {
+ copy = min_t(size_t, size, pfrag->size - pfrag->offset);
+ copy = min_t(size_t, copy, max_open_record_len - record->len);
+
+ if (copy) {
+ rc = tls_device_copy_data(page_address(pfrag->page) +
+ pfrag->offset, copy,
+ iter_offset.msg_iter);
+ if (rc)
+ goto handle_error;
+ tls_append_frag(record, pfrag, copy);
+ }
+ } else {
+ copy = min_t(size_t, size, max_open_record_len - record->len);
+ if (copy) {
+ struct page_frag _pfrag;
+
+ _pfrag.page = zc_page;
+ _pfrag.offset = iter_offset.offset;
+ _pfrag.size = copy;
+ tls_append_frag(record, &_pfrag, copy);
+ }
}
size -= copy;
@@ -551,8 +571,8 @@ int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
goto out;
}
- rc = tls_push_data(sk, &msg->msg_iter, size,
- msg->msg_flags, record_type);
+ rc = tls_push_data(sk, (union tls_iter_offset)&msg->msg_iter, size,
+ msg->msg_flags, record_type, NULL);
out:
release_sock(sk);
@@ -564,11 +584,14 @@ int tls_device_sendpage(struct sock *sk, struct page *page,
int offset, size_t size, int flags)
{
struct tls_context *tls_ctx = tls_get_ctx(sk);
+ struct tls_offload_context_tx *ctx;
struct iov_iter msg_iter;
char *kaddr;
struct kvec iov;
int rc;
+ ctx = tls_offload_ctx_tx(tls_ctx);
+
if (flags & MSG_SENDPAGE_NOTLAST)
flags |= MSG_MORE;
@@ -580,12 +603,18 @@ int tls_device_sendpage(struct sock *sk, struct page *page,
goto out;
}
+ if (ctx->zerocopy_sendfile) {
+ rc = tls_push_data(sk, (union tls_iter_offset)offset, size,
+ flags, TLS_RECORD_TYPE_DATA, page);
+ goto out;
+ }
+
kaddr = kmap(page);
iov.iov_base = kaddr + offset;
iov.iov_len = size;
iov_iter_kvec(&msg_iter, WRITE, &iov, 1, size);
- rc = tls_push_data(sk, &msg_iter, size,
- flags, TLS_RECORD_TYPE_DATA);
+ rc = tls_push_data(sk, (union tls_iter_offset)&msg_iter, size,
+ flags, TLS_RECORD_TYPE_DATA, NULL);
kunmap(page);
out:
@@ -659,7 +688,8 @@ static int tls_device_push_pending_record(struct sock *sk, int flags)
struct iov_iter msg_iter;
iov_iter_kvec(&msg_iter, WRITE, NULL, 0, 0);
- return tls_push_data(sk, &msg_iter, 0, flags, TLS_RECORD_TYPE_DATA);
+ return tls_push_data(sk, (union tls_iter_offset)&msg_iter, 0, flags,
+ TLS_RECORD_TYPE_DATA, NULL);
}
void tls_device_write_space(struct sock *sk, struct tls_context *ctx)
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 7b2b0e7ffee4..8ef86e04f571 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -513,6 +513,31 @@ static int do_tls_getsockopt_conf(struct sock *sk, char __user *optval,
return rc;
}
+static int do_tls_getsockopt_tx_zc(struct sock *sk, char __user *optval,
+ int __user *optlen)
+{
+ struct tls_context *tls_ctx = tls_get_ctx(sk);
+ struct tls_offload_context_tx *ctx;
+ int len, value;
+
+ if (get_user(len, optlen))
+ return -EFAULT;
+
+ if (len != sizeof(value))
+ return -EINVAL;
+
+ if (!tls_ctx || tls_ctx->tx_conf != TLS_HW)
+ return -EBUSY;
+
+ ctx = tls_offload_ctx_tx(tls_ctx);
+
+ value = ctx->zerocopy_sendfile;
+ if (copy_to_user(optval, &value, sizeof(value)))
+ return -EFAULT;
+
+ return 0;
+}
+
static int do_tls_getsockopt(struct sock *sk, int optname,
char __user *optval, int __user *optlen)
{
@@ -524,6 +549,9 @@ static int do_tls_getsockopt(struct sock *sk, int optname,
rc = do_tls_getsockopt_conf(sk, optval, optlen,
optname == TLS_TX);
break;
+ case TLS_TX_ZEROCOPY_SENDFILE:
+ rc = do_tls_getsockopt_tx_zc(sk, optval, optlen);
+ break;
default:
rc = -ENOPROTOOPT;
break;
@@ -675,6 +703,28 @@ static int do_tls_setsockopt_conf(struct sock *sk, sockptr_t optval,
return rc;
}
+static int do_tls_setsockopt_tx_zc(struct sock *sk, sockptr_t optval,
+ unsigned int optlen)
+{
+ struct tls_context *tls_ctx = tls_get_ctx(sk);
+ struct tls_offload_context_tx *ctx;
+ int val;
+
+ if (!tls_ctx || tls_ctx->tx_conf != TLS_HW)
+ return -EINVAL;
+
+ if (sockptr_is_null(optval) || optlen < sizeof(val))
+ return -EINVAL;
+
+ if (copy_from_sockptr(&val, optval, sizeof(val)))
+ return -EFAULT;
+
+ ctx = tls_offload_ctx_tx(tls_ctx);
+ ctx->zerocopy_sendfile = val;
+
+ return 0;
+}
+
static int do_tls_setsockopt(struct sock *sk, int optname, sockptr_t optval,
unsigned int optlen)
{
@@ -688,6 +738,11 @@ static int do_tls_setsockopt(struct sock *sk, int optname, sockptr_t optval,
optname == TLS_TX);
release_sock(sk);
break;
+ case TLS_TX_ZEROCOPY_SENDFILE:
+ lock_sock(sk);
+ rc = do_tls_setsockopt_tx_zc(sk, optval, optlen);
+ release_sock(sk);
+ break;
default:
rc = -ENOPROTOOPT;
break;
--
2.25.1
^ permalink raw reply related
* Re: [PATCH v2 net-next] net: generalize skb freeing deferral to per-cpu lists
From: Eric Dumazet @ 2022-04-27 17:59 UTC (permalink / raw)
To: Ido Schimmel
Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni,
netdev
In-Reply-To: <CANn89iLue8fy-6TTEsTwzWAog-KnAcsG19up34621W8Bp+0=NQ@mail.gmail.com>
On Wed, Apr 27, 2022 at 10:11 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Apr 27, 2022 at 9:53 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Wed, Apr 27, 2022 at 8:34 AM Ido Schimmel <idosch@idosch.org> wrote:
> > >
> >
> > >
> > > Eric, with this patch I'm seeing memory leaks such as these [1][2] after
> > > boot. The system is using the igb driver for its management interface
> > > [3]. The leaks disappear after reverting the patch.
> > >
> > > Any ideas?
> > >
> >
> > No idea, skbs allocated to send an ACK can not be stored in receive
> > queue, I guess this is a kmemleak false positive.
> >
> > Stress your host for hours, and check if there are real kmemleaks, as
> > in memory being depleted.
>
> AT first when I saw your report I wondered if the following was needed,
> but I do not think so. Nothing in __kfree_skb(skb) cares about skb->next.
>
> But you might give it a try, maybe I am wrong.
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 611bd719706412723561c27753150b27e1dc4e7a..9dc3443649b962586cc059899ac3d71e9c7a3559
> 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6594,6 +6594,7 @@ static void skb_defer_free_flush(struct softnet_data *sd)
>
> while (skb != NULL) {
> next = skb->next;
> + skb_mark_not_on_list(skb);
> __kfree_skb(skb);
> skb = next;
> }
Oh well, there is definitely a leak, sorry for that.
^ permalink raw reply
* Re: [PATCH wpan-next 08/11] net: mac802154: Add a warning in the hot path
From: Alexander Aring @ 2022-04-27 18:01 UTC (permalink / raw)
To: Miquel Raynal
Cc: Stefan Schmidt, linux-wpan - ML, David S. Miller, Jakub Kicinski,
open list:NETWORKING [GENERAL], David Girault, Romuald Despres,
Frederic Blain, Nicolas Schodet, Thomas Petazzoni
In-Reply-To: <20220427164659.106447-9-miquel.raynal@bootlin.com>
Hi,
On Wed, Apr 27, 2022 at 12:47 PM Miquel Raynal
<miquel.raynal@bootlin.com> wrote:
>
> We should never start a transmission after the queue has been stopped.
>
> But because it might work we don't kill the function here but rather
> warn loudly the user that something is wrong.
>
> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
> ---
> net/mac802154/ieee802154_i.h | 8 ++++++++
> net/mac802154/tx.c | 2 ++
> net/mac802154/util.c | 18 ++++++++++++++++++
> 3 files changed, 28 insertions(+)
>
> diff --git a/net/mac802154/ieee802154_i.h b/net/mac802154/ieee802154_i.h
> index b55fdefb0b34..cb61a4abaf37 100644
> --- a/net/mac802154/ieee802154_i.h
> +++ b/net/mac802154/ieee802154_i.h
> @@ -178,6 +178,14 @@ bool ieee802154_queue_is_held(struct ieee802154_local *local);
> */
> void ieee802154_stop_queue(struct ieee802154_local *local);
>
> +/**
> + * ieee802154_queue_is_stopped - check whether ieee802154 queue was stopped
> + * @local: main mac object
> + *
> + * Goes through all the interfaces and indicates if they are all stopped or not.
> + */
> +bool ieee802154_queue_is_stopped(struct ieee802154_local *local);
> +
> /* MIB callbacks */
> void mac802154_dev_set_page_channel(struct net_device *dev, u8 page, u8 chan);
>
> diff --git a/net/mac802154/tx.c b/net/mac802154/tx.c
> index a8a83f0167bf..021dddfea542 100644
> --- a/net/mac802154/tx.c
> +++ b/net/mac802154/tx.c
> @@ -124,6 +124,8 @@ bool ieee802154_queue_is_held(struct ieee802154_local *local)
> static netdev_tx_t
> ieee802154_hot_tx(struct ieee802154_local *local, struct sk_buff *skb)
> {
> + WARN_ON_ONCE(ieee802154_queue_is_stopped(local));
> +
> return ieee802154_tx(local, skb);
> }
>
> diff --git a/net/mac802154/util.c b/net/mac802154/util.c
> index 847e0864b575..cfd17a7db532 100644
> --- a/net/mac802154/util.c
> +++ b/net/mac802154/util.c
> @@ -44,6 +44,24 @@ void ieee802154_stop_queue(struct ieee802154_local *local)
> rcu_read_unlock();
> }
>
> +bool ieee802154_queue_is_stopped(struct ieee802154_local *local)
> +{
> + struct ieee802154_sub_if_data *sdata;
> + bool stopped = true;
> +
> + rcu_read_lock();
> + list_for_each_entry_rcu(sdata, &local->interfaces, list) {
> + if (!sdata->dev)
> + continue;
> +
> + if (!netif_queue_stopped(sdata->dev))
> + stopped = false;
> + }
> + rcu_read_unlock();
> +
> + return stopped;
> +}
sorry this makes no sense, you using net core functionality to check
if a queue is stopped in a net core netif callback. Whereas the sense
here for checking if the queue is really stopped is when 802.15.4
thinks the queue is stopped vs net core netif callback running. It
means for MLME-ops there are points we want to make sure that net core
is not handling any xmit and we should check this point and not
introducing net core functionality checks. btw: if it's hit your if
branch the first time you can break?
I am not done with the review, this is just what I see now and we can
discuss that. Please be patient.
- Alex
^ permalink raw reply
* Re: [PATCH net] usbnet: smsc95xx: Fix deadlock on runtime resume
From: Alan Stern @ 2022-04-27 18:19 UTC (permalink / raw)
To: Lukas Wunner
Cc: Andrew Lunn, Steve Glendinning, UNGLinuxDriver, Oliver Neukum,
David S. Miller, Jakub Kicinski, Paolo Abeni, netdev, linux-usb,
Andre Edich, Oleksij Rempel, Martyn Welch, Gabriel Hojda,
Christoph Fritz, Lino Sanfilippo, Philipp Rosenberger,
Heiner Kallweit, Russell King
In-Reply-To: <20220427153851.GA15329@wunner.de>
On Wed, Apr 27, 2022 at 05:38:51PM +0200, Lukas Wunner wrote:
> On Wed, Apr 27, 2022 at 05:24:50PM +0200, Andrew Lunn wrote:
> > On Wed, Apr 27, 2022 at 08:41:49AM +0200, Lukas Wunner wrote:
> > > Commit 05b35e7eb9a1 ("smsc95xx: add phylib support") amended
> > > smsc95xx_resume() to call phy_init_hw(). That function waits for the
> > > device to runtime resume even though it is placed in the runtime resume
> > > path, causing a deadlock.
> >
> > You have looked at this code, tried a few different things, so this is
> > probably a dumb question.
> >
> > Do you actually need to call phy_init_hw()?
> >
> > mdio_bus_phy_resume() will call phy_init_hw(). So long as you first
> > resume the MAC and then the PHY, shouldn't this just work?
>
> mdio_bus_phy_resume() is only called for system sleep. But this is about
> *runtime* PM.
>
> mdio_bus_phy_pm_ops does not define runtime PM callbacks. So runtime PM
> is disabled on PHYs, no callback is invoked for them when the MAC runtime
> suspends, hence the onus is on the MAC to runtime suspend the PHY (which
> is a child of the MAC). Same on runtime resume.
>
> Let's say I enable runtime PM on the PHY and use pm_runtime_force_suspend()
> to suspend the PHY from the MAC's ->runtime_suspend callback. At that
> point the MAC already has status RPM_SUSPENDING. Boom, deadlock.
>
> The runtime PM core lacks the capability to declare that children should
> be force runtime suspended before a device can runtime suspend, that's
> the problem.
This might work out if you copy the scheme we use for USB devices and
interfaces.
A USB interface is only a logical part of its parent device, and as such
does not have a separate runtime power state of its own (in USB-2, at
least). Therefore the USB core calls pm_runtime_no_callbacks() for each
interface as it is created, and handles the runtime power management for
the interface (i.e., invoking the interface driver's runtime_suspend and
runtime_resume callbacks) from within the device's runtime PM routines
-- independent of the PM core's notion of what the interface's power
state should be.
Similarly, you could call pm_runtime_no_callbacks() for the PHY when it
is created, and manage the PHY's actual power state from within the
MAC's runtime-PM routines directly (i.e., without going through the PM
core).
Alan Stern
^ permalink raw reply
* Re: [PATCH perf/core 1/5] libbpf: Add bpf_program__set_insns function
From: Andrii Nakryiko @ 2022-04-27 18:47 UTC (permalink / raw)
To: Jiri Olsa
Cc: Daniel Borkmann, Jiri Olsa, Arnaldo Carvalho de Melo,
Alexei Starovoitov, Andrii Nakryiko, linux-perf-use., Networking,
bpf, Ingo Molnar, Namhyung Kim, Alexander Shishkin,
Peter Zijlstra, Martin KaFai Lau, Song Liu, Yonghong Song,
John Fastabend, Ian Rogers
In-Reply-To: <YmkB6XxM6avEZdSf@krava>
On Wed, Apr 27, 2022 at 1:42 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Tue, Apr 26, 2022 at 08:58:12AM -0700, Andrii Nakryiko wrote:
> > On Mon, Apr 25, 2022 at 11:57 PM Jiri Olsa <olsajiri@gmail.com> wrote:
> > >
> > > On Mon, Apr 25, 2022 at 11:19:09PM -0700, Andrii Nakryiko wrote:
> > > > On Mon, Apr 25, 2022 at 9:22 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> > > > >
> > > > > On 4/22/22 12:00 PM, Jiri Olsa wrote:
> > > > > > Adding bpf_program__set_insns that allows to set new
> > > > > > instructions for program.
> > > > > >
> > > > > > Also moving bpf_program__attach_kprobe_multi_opts on
> > > > > > the proper name sorted place in map file.
> > > >
> > > > would make sense to fix it as a separate patch, it has nothing to do
> > > > with bpf_program__set_insns() API itself
> > >
> > > np
> > >
> > > >
> > > > > >
> > > > > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > > > > > ---
> > > > > > tools/lib/bpf/libbpf.c | 8 ++++++++
> > > > > > tools/lib/bpf/libbpf.h | 12 ++++++++++++
> > > > > > tools/lib/bpf/libbpf.map | 3 ++-
> > > > > > 3 files changed, 22 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> > > > > > index 809fe209cdcc..284790d81c1b 100644
> > > > > > --- a/tools/lib/bpf/libbpf.c
> > > > > > +++ b/tools/lib/bpf/libbpf.c
> > > > > > @@ -8457,6 +8457,14 @@ size_t bpf_program__insn_cnt(const struct bpf_program *prog)
> > > > > > return prog->insns_cnt;
> > > > > > }
> > > > > >
> > > > > > +void bpf_program__set_insns(struct bpf_program *prog,
> > > > > > + struct bpf_insn *insns, size_t insns_cnt)
> > > > > > +{
> > > > > > + free(prog->insns);
> > > > > > + prog->insns = insns;
> > > > > > + prog->insns_cnt = insns_cnt;
> > > >
> > > > let's not store user-provided pointer here. Please realloc prog->insns
> > > > as necessary and copy over insns into it.
> > > >
> > > > Also let's at least add the check for prog->loaded and return -EBUSY
> > > > in such a case. And of course this API should return int, not void.
> > >
> > > ok, will change
> > >
> > > >
> > > > > > +}
> > > > > > +
> > > > > > int bpf_program__set_prep(struct bpf_program *prog, int nr_instances,
> > > > > > bpf_program_prep_t prep)
> > > > > > {
> > > > > > diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> > > > > > index 05dde85e19a6..b31ad58d335f 100644
> > > > > > --- a/tools/lib/bpf/libbpf.h
> > > > > > +++ b/tools/lib/bpf/libbpf.h
> > > > > > @@ -323,6 +323,18 @@ struct bpf_insn;
> > > > > > * different.
> > > > > > */
> > > > > > LIBBPF_API const struct bpf_insn *bpf_program__insns(const struct bpf_program *prog);
> > > > > > +
> > > > > > +/**
> > > > > > + * @brief **bpf_program__set_insns()** can set BPF program's underlying
> > > > > > + * BPF instructions.
> > > > > > + * @param prog BPF program for which to return instructions
> > > > > > + * @param insn a pointer to an array of BPF instructions
> > > > > > + * @param insns_cnt number of `struct bpf_insn`'s that form
> > > > > > + * specified BPF program
> > > > > > + */
> > > >
> > > > This API makes me want to cry... but I can't come up with anything
> > > > better for perf's use case.
> > > >
> >
> > So thinking about this some more. If we make libbpf not to close maps
> > and prog FDs on BPF program load failure automatically and instead
> > doing it in bpf_object__close(), which would seem to be a totally fine
> > semantics and won't break any reasonable application as they always
> > have to call bpf_object__close() anyways to clean up all the
> > resources; we wouldn't need this horror of bpf_program__set_insns().
> > Your BPF program would fail to load, but you'll get its fully prepared
> > instructions with bpf_program__insns(), then you can just append
> > correct preamble. Meanwhile, all the maps will be created (they are
> > always created before the first program load), so all the FDs will be
> > correct.
> >
> > This is certainly advanced knowledge of libbpf behavior, but the use
> > case you are trying to solve is also very unique and advanced (and I
> > wouldn't recommend anyone trying to do this anyways). WDYT? Would that
> > work?
>
> hm, so verifier will fail after all maps are set up during the walk
> of the program instructions.. I guess that could work, I'll give it
> a try, should be easy change in libbpf (like below) and also in perf
>
> but still the bpf_program__set_insns seems less horror to me
let's keep set_insns API then, but please do add a lot of warnings
into the description to make it very-very scary :)
>
> jirka
>
>
> ---
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index c8df74e5f658..1eb75d4231ff 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -7577,19 +7577,6 @@ static int bpf_object_load(struct bpf_object *obj, int extra_log_level, const ch
> obj->btf_vmlinux = NULL;
>
> obj->loaded = true; /* doesn't matter if successfully or not */
> -
> - if (err)
> - goto out;
> -
> - return 0;
> -out:
> - /* unpin any maps that were auto-pinned during load */
> - for (i = 0; i < obj->nr_maps; i++)
> - if (obj->maps[i].pinned && !obj->maps[i].reused)
> - bpf_map__unpin(&obj->maps[i], NULL);
> -
> - bpf_object_unload(obj);
> - pr_warn("failed to load object '%s'\n", obj->path);
> return libbpf_err(err);
> }
>
^ permalink raw reply
* Re: [PATCH bpf-next 2/4] libbpf: Add helpers for pinning bpf prog through bpf object skeleton
From: Andrii Nakryiko @ 2022-04-27 18:51 UTC (permalink / raw)
To: Yafang Shao
Cc: Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko, Martin Lau,
Song Liu, Yonghong Song, john fastabend, KP Singh, netdev, bpf
In-Reply-To: <CALOAHbB079w_KrJGP8ABJyBQd1HghP4Xza1mDwaJV6bX-=SHwA@mail.gmail.com>
On Wed, Apr 27, 2022 at 8:48 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Wed, Apr 27, 2022 at 10:45 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Wed, Apr 27, 2022 at 7:16 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Tue, Apr 26, 2022 at 8:59 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > On Mon, Apr 25, 2022 at 9:57 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> > > > >
> > > > > On 4/23/22 4:00 PM, Yafang Shao wrote:
> > > > > > Currently there're helpers for allowing to open/load/attach BPF object
> > > > > > through BPF object skeleton. Let's also add helpers for pinning through
> > > > > > BPF object skeleton. It could simplify BPF userspace code which wants to
> > > > > > pin the progs into bpffs.
> > > > >
> > > > > Please elaborate some more on your use case/rationale for the commit message,
> > > > > do you have orchestration code that will rely on these specifically?
> > > > >
> > > >
> > > > We have a bpf manager on our production environment to maintain the
> > > > bpf programs, some of which need to be pinned in bpffs, for example
> > > > tracing bpf programs, perf_event programs and other bpf hooks added by
> > > > ourselves for performance tuning. These bpf programs don't need a
> > > > user agent, while they really work like a kernel module, that is why
> > > > we pin them. For these kinds of bpf programs, the bpf manager can help
> > > > to simplify the development and deployment. Take the improvement on
> > > > development for example, the user doesn't need to write userspace
> > > > code while he focuses on the kernel side only, and then bpf manager
> > > > will do all the other things. Below is a simple example,
> > > > Step1, gen the skeleton for the user provided bpf object file,
> > > > $ bpftool gen skeleton test.bpf.o > simple.skel.h
> > > > Step2, Compile the bpf object file into a runnable binary
> > > > #include "simple.skel.h"
> > > >
> > > > #define SIMPLE_BPF_PIN(name, path) \
> > > > ({ \
> > > > struct name##_bpf *obj; \
> > > > int err = 0; \
> > > > \
> > > > obj = name##_bpf__open(); \
> > > > if (!obj) { \
> > > > err = -errno; \
> > > > goto cleanup; \
> > > > } \
> > > > \
> > > > err = name##_bpf__load(obj); \
> > > > if (err) \
> > > > goto cleanup; \
> > > > \
> > > > err = name##_bpf__attach(obj); \
> > > > if (err) \
> > > > goto cleanup; \
> > > > \
> > > > err = name##_bpf__pin_prog(obj, path); \
> > > > if (err) \
> > > > goto cleanup; \
> > > > \
> > > > goto end; \
> > > > \
> > > > cleanup: \
> > > > name##_bpf__destroy(obj); \
> > > > end: \
> > > > err; \
> > > > })
> > > >
> > > > SIMPLE_BPF_PIN(test, "/sys/fs/bpf");
> > > >
> > > > As the userspace code of FD-based bpf objects are all
> > > > the same, so we can abstract them as above. The pathset means to add
> > > > the non-exist "name##_bpf__pin_prog(obj, path)" for it.
> > > >
> > >
> > > Your BPF manager is user-space code that you control, right? I'm not
> > > sure how skeleton is helpful here given your BPF manager is generic
> > > and doesn't work with any specific skeleton, if I understand the idea.
> > > But let's assume that you use skeleton to also embed BPF ELF bytes and
> > > pass them to your manager for "activation". Once you open and load
> > > bpf_object, your BPF manager can generically iterate all BPF programs
> > > using bpf_object_for_each_program(), attempt to attach them with
> > > bpf_program__attach() (see how bpf_object__attach_skeleton is handling
> > > non-auto-attachable programs) and immediately pin the link (no need to
> > > even store it, you can destroy it after pinning immediately). All this
> > > is using generic libbpf APIs and requires no code generation.
> >
> > Many thanks for the detailed explanation. Your suggestion can also
> > work, but with the skeleton we can also generate a binary which can
> > run independently. (Technically speaking, the binary is the same as
> > './bpf_install target.bpf.o').
> >
>
> Forgot to mention that with skeleton we can also modify the global
> data defined in bpf object file, that may need to be abstracted as a
> new common helper. The bpf_object__* functions can't do it, right ?
I must be missing something because I don't see how you can have
code-generated skeleton and generic BPF manager at the same time. I'm
not saying don't use skeleton, I'm saying you can write this link
pinning code yourself and reuse it in your applications. You can get
access to struct bpf_object through skel->obj.
>
> > > But keep
> > > in mind that not all struct bpf_link in libbpf are pinnable (not all
> > > links have FD-based BPF link in kernel associated with them), so
> > > you'll have to deal with that somehow (and what you didn't do in this
> > > patch for libbpf implementation).
> > >
> >
> > Right, I have found it. If I understand it correctly, only the link
> > types defined in enum bpf_link_type (which is in
> > include/uapi/linux/bpf.h) are pinnable, right?
> >
It's more complicated. For kprobe/tracepoint, for example, depending
on host kernel version it could be a "fake" libbpf-side-only link, or
it could be a proper kernel object backing it. So as always, it
depends.
> > BTW, is it possible to support pinning all struct bpf_link in libbpf ?
No, it depends on kernel support, libbpf can't do much about this.
> >
> >
> > --
> > Regards
> > Yafang
>
>
>
> --
> Regards
> Yafang
^ permalink raw reply
* Re: [PATCH bpf-next 08/11] samples: bpf: fix shifting unsigned long by 32 positions
From: Andrii Nakryiko @ 2022-04-27 18:53 UTC (permalink / raw)
To: Yonghong Song
Cc: Alexander Lobakin, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
Jiri Olsa, Namhyung Kim, Martin KaFai Lau, Song Liu,
John Fastabend, KP Singh, David S. Miller, Jakub Kicinski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Jonathan Lemon, Nathan Chancellor, Nick Desaulniers,
Dmitrii Dolgov, Quentin Monnet, Tiezhu Yang,
Kumar Kartikeya Dwivedi, Chenbo Feng, Willem de Bruijn,
Daniel Wagner, Thomas Graf, Ong Boon Leong, linux-perf-use.,
open list, Networking, bpf, llvm
In-Reply-To: <05d21d85-7b59-a8f9-73dc-89189986db11@fb.com>
On Wed, Apr 27, 2022 at 8:55 AM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 4/20/22 10:18 AM, Andrii Nakryiko wrote:
> > On Thu, Apr 14, 2022 at 3:46 PM Alexander Lobakin <alobakin@pm.me> wrote:
> >>
> >> On 32 bit systems, shifting an unsigned long by 32 positions
> >> yields the following warning:
> >>
> >> samples/bpf/tracex2_kern.c:60:23: warning: shift count >= width of type [-Wshift-count-overflow]
> >> unsigned int hi = v >> 32;
> >> ^ ~~
> >>
> >
> > long is always 64-bit in BPF, but I suspect this is due to
> > samples/bpf/Makefile still using this clang + llc combo, where clang
> > is called with native target and llc for -target bpf. Not sure if we
> > are ready to ditch that complicated combination. Yonghong, do we still
> > need that or can we just use -target bpf in samples/bpf?
>
> Current most bpf programs in samples/bpf do not use vmlinux.h and CO-RE.
> They direct use kernel header files. That is why clang C -> IR
> compilation still needs to be native.
>
> We could just use -target bpf for the whole compilation but that needs
> to change the code to use vmlinux.h and CO-RE. There are already a
> couple of sample bpf programs did this.
Right, I guess I'm proposing to switch samples/bpf to vmlinux.h. Only
purely networking BPF apps can get away with not using vmlinux.h
because they might avoid dependency on kernel types. But even then a
lot of modern networking apps seem to be gaining elements of more
generic tracing and would rely on CO-RE for staying "portable" between
kernels. So it might be totally fine to just use CO-RE universally in
samples/bpf?
>
> >
> >
> >> The usual way to avoid this is to shift by 16 two times (see
> >> upper_32_bits() macro in the kernel). Use it across the BPF sample
> >> code as well.
> >>
> >> Fixes: d822a1926849 ("samples/bpf: Add counting example for kfree_skb() function calls and the write() syscall")
> >> Fixes: 0fb1170ee68a ("bpf: BPF based latency tracing")
> >> Fixes: f74599f7c530 ("bpf: Add tests and samples for LWT-BPF")
> >> Signed-off-by: Alexander Lobakin <alobakin@pm.me>
> >> ---
> >> samples/bpf/lathist_kern.c | 2 +-
> >> samples/bpf/lwt_len_hist_kern.c | 2 +-
> >> samples/bpf/tracex2_kern.c | 2 +-
> >> 3 files changed, 3 insertions(+), 3 deletions(-)
> >>
> >
> > [...]
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox