Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] net: linkwatch: ignore events for unregistered netdevs
From: Eric Dumazet @ 2022-04-25 15:23 UTC (permalink / raw)
  To: Jann Horn
  Cc: Jakub Kicinski, Lukas Wunner, Paolo Abeni, Oliver Neukum,
	David S. Miller, Oleksij Rempel, netdev, USB list, Andrew Lunn,
	Jacky Chou, Willy Tarreau, Lino Sanfilippo, Philipp Rosenberger,
	Heiner Kallweit, Greg Kroah-Hartman
In-Reply-To: <CAG48ez0nw7coDXYozaUOTThWLkHWZuKVUpMosY2hgVSSfeM4Pw@mail.gmail.com>

On Mon, Apr 25, 2022 at 8:19 AM Jann Horn <jannh@google.com> wrote:
>
> On Mon, Apr 25, 2022 at 5:13 PM Eric Dumazet <edumazet@google.com> wrote:
> > On Mon, Apr 25, 2022 at 8:01 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > >
> > > On Mon, 25 Apr 2022 16:49:34 +0200 Jann Horn wrote:
> > > > > Doesn't mean we should make it legal. We can add a warning to catch
> > > > > abuses.
> > > >
> > > > That was the idea with
> > > > https://lore.kernel.org/netdev/20220128014303.2334568-1-jannh@google.com/,
> > > > but I didn't get any replies when I asked what the precise semantics
> > > > of dev_hold() are supposed to be
> > > > (https://lore.kernel.org/netdev/CAG48ez1-OyZETvrYAfaHicYW1LbrQUVp=C0EukSWqZrYMej73w@mail.gmail.com/),
> > > > so I don't know how to proceed...
> > >
> > > Yeah, I think after you pointed out that the netdev per cpu refcounting
> > > is fundamentally broken everybody decided to hit themselves with the
> > > obliviate spell :S
> >
> > dev_hold() has been an increment of a refcount, and dev_put() a decrement.
> >
> > Not sure why it is fundamentally broken.
>
> Well, it's not quite a refcount. It's a count that can be incremented
> and decremented but can't be read while the device is alive, and then
> at some point it turns into a count that can be read and decremented
> but can't be incremented (see
> https://lore.kernel.org/netdev/CAG48ez1-OyZETvrYAfaHicYW1LbrQUVp=C0EukSWqZrYMej73w@mail.gmail.com/).
> Normal refcounts allow anyone who is holding a reference to add
> another reference.

On a live netdev nothing wants to read the 'current refcount'.
We basically do not care.

>
> > There are specific steps at device dismantles making sure no more
> > users can dev_hold()
>
> So you're saying it's intentional that even if you're already holding
> a dev_hold() reference, you may not be allowed to call dev_hold()
> again?

I think you can/should not.
We might add a test in dev_hold() and catch offenders.

Then add a new api, (dev_hold() is void and can not propagate an
error), and eventually
fix offenders.

^ permalink raw reply

* [PATCH net] net: dsa: lantiq_gswip: Don't set GSWIP_MII_CFG_RMII_CLK
From: Martin Blumenstingl @ 2022-04-25 15:20 UTC (permalink / raw)
  To: netdev
  Cc: hauke, linux-kernel, andrew, vivien.didelot, olteanv, davem, kuba,
	pabeni, Martin Blumenstingl, stable, Jan Hoffmann

Commit 4b5923249b8fa4 ("net: dsa: lantiq_gswip: Configure all remaining
GSWIP_MII_CFG bits") added all known bits in the GSWIP_MII_CFGp
register. It helped bring this register into a well-defined state so the
driver has to rely less on the bootloader to do things right.
Unfortunately it also sets the GSWIP_MII_CFG_RMII_CLK bit without any
possibility to configure it. Upon further testing it turns out that all
boards which are supported by the GSWIP driver in OpenWrt which use an
RMII PHY have a dedicated oscillator on the board which provides the
50MHz RMII reference clock.

Don't set the GSWIP_MII_CFG_RMII_CLK bit (but keep the code which always
clears it) to fix support for the Fritz!Box 7362 SL in OpenWrt. This is
a board with two Atheros AR8030 RMII PHYs. With the "RMII clock" bit set
the MAC also generates the RMII reference clock whose signal then
conflicts with the signal from the oscillator on the board. This results
in a constant cycle of the PHY detecting link up/down (and as a result
of that: the two ports using the AR8030 PHYs are not working).

At the time of writing this patch there's no known board where the MAC
(GSWIP) has to generate the RMII reference clock. If needed this can be
implemented in future by providing a device-tree flag so the
GSWIP_MII_CFG_RMII_CLK bit can be toggled per port.

Fixes: 4b5923249b8fa4 ("net: dsa: lantiq_gswip: Configure all remaining GSWIP_MII_CFG bits")
Cc: stable@vger.kernel.org
Tested-by: Jan Hoffmann <jan@3e8.eu>
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
---
 drivers/net/dsa/lantiq_gswip.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/net/dsa/lantiq_gswip.c b/drivers/net/dsa/lantiq_gswip.c
index a416240d001b..12c15da55664 100644
--- a/drivers/net/dsa/lantiq_gswip.c
+++ b/drivers/net/dsa/lantiq_gswip.c
@@ -1681,9 +1681,6 @@ static void gswip_phylink_mac_config(struct dsa_switch *ds, int port,
 		break;
 	case PHY_INTERFACE_MODE_RMII:
 		miicfg |= GSWIP_MII_CFG_MODE_RMIIM;
-
-		/* Configure the RMII clock as output: */
-		miicfg |= GSWIP_MII_CFG_RMII_CLK;
 		break;
 	case PHY_INTERFACE_MODE_RGMII:
 	case PHY_INTERFACE_MODE_RGMII_ID:
-- 
2.36.0

^ permalink raw reply related

* Re: [PATCH] net: linkwatch: ignore events for unregistered netdevs
From: Jann Horn @ 2022-04-25 15:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jakub Kicinski, Lukas Wunner, Paolo Abeni, Oliver Neukum,
	David S. Miller, Oleksij Rempel, netdev, USB list, Andrew Lunn,
	Jacky Chou, Willy Tarreau, Lino Sanfilippo, Philipp Rosenberger,
	Heiner Kallweit, Greg Kroah-Hartman
In-Reply-To: <CANn89iLwvqUJHBNifLESJyBQ85qjK42sK85Fs=QV4M7HqUXmxQ@mail.gmail.com>

On Mon, Apr 25, 2022 at 5:13 PM Eric Dumazet <edumazet@google.com> wrote:
> On Mon, Apr 25, 2022 at 8:01 AM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Mon, 25 Apr 2022 16:49:34 +0200 Jann Horn wrote:
> > > > Doesn't mean we should make it legal. We can add a warning to catch
> > > > abuses.
> > >
> > > That was the idea with
> > > https://lore.kernel.org/netdev/20220128014303.2334568-1-jannh@google.com/,
> > > but I didn't get any replies when I asked what the precise semantics
> > > of dev_hold() are supposed to be
> > > (https://lore.kernel.org/netdev/CAG48ez1-OyZETvrYAfaHicYW1LbrQUVp=C0EukSWqZrYMej73w@mail.gmail.com/),
> > > so I don't know how to proceed...
> >
> > Yeah, I think after you pointed out that the netdev per cpu refcounting
> > is fundamentally broken everybody decided to hit themselves with the
> > obliviate spell :S
>
> dev_hold() has been an increment of a refcount, and dev_put() a decrement.
>
> Not sure why it is fundamentally broken.

Well, it's not quite a refcount. It's a count that can be incremented
and decremented but can't be read while the device is alive, and then
at some point it turns into a count that can be read and decremented
but can't be incremented (see
https://lore.kernel.org/netdev/CAG48ez1-OyZETvrYAfaHicYW1LbrQUVp=C0EukSWqZrYMej73w@mail.gmail.com/).
Normal refcounts allow anyone who is holding a reference to add
another reference.

> There are specific steps at device dismantles making sure no more
> users can dev_hold()

So you're saying it's intentional that even if you're already holding
a dev_hold() reference, you may not be allowed to call dev_hold()
again?

^ permalink raw reply

* Re: [PATCH] net: linkwatch: ignore events for unregistered netdevs
From: Eric Dumazet @ 2022-04-25 15:13 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jann Horn, Lukas Wunner, Paolo Abeni, Oliver Neukum,
	David S. Miller, Oleksij Rempel, netdev, USB list, Andrew Lunn,
	Jacky Chou, Willy Tarreau, Lino Sanfilippo, Philipp Rosenberger,
	Heiner Kallweit, Greg Kroah-Hartman
In-Reply-To: <20220425080057.0fc4ef66@kernel.org>

On Mon, Apr 25, 2022 at 8:01 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 25 Apr 2022 16:49:34 +0200 Jann Horn wrote:
> > > Doesn't mean we should make it legal. We can add a warning to catch
> > > abuses.
> >
> > That was the idea with
> > https://lore.kernel.org/netdev/20220128014303.2334568-1-jannh@google.com/,
> > but I didn't get any replies when I asked what the precise semantics
> > of dev_hold() are supposed to be
> > (https://lore.kernel.org/netdev/CAG48ez1-OyZETvrYAfaHicYW1LbrQUVp=C0EukSWqZrYMej73w@mail.gmail.com/),
> > so I don't know how to proceed...
>
> Yeah, I think after you pointed out that the netdev per cpu refcounting
> is fundamentally broken everybody decided to hit themselves with the
> obliviate spell :S

dev_hold() has been an increment of a refcount, and dev_put() a decrement.

Not sure why it is fundamentally broken.

There are specific steps at device dismantles making sure no more
users can dev_hold()

It is a contract. Any buggy layer can overwrite any piece of memory,
including a refcount_t.

Traditionally we could not add a test in dev_hold() to prevent an
increment if the device is in dismantle phase.
Maybe the situation is better nowadays.

^ permalink raw reply

* Re: net: stmmac: dwmac-imx: half duplex crash
From: Marcel Ziswiler @ 2022-04-25 15:06 UTC (permalink / raw)
  To: andrew@lunn.ch
  Cc: linux-imx@nxp.com, peppe.cavallaro@st.com,
	linux-stm32@st-md-mailman.stormreply.com, davem@davemloft.net,
	linux-kernel@vger.kernel.org, pabeni@redhat.com,
	shawnguo@kernel.org, joabreu@synopsys.com, kernel@pengutronix.de,
	s.hauer@pengutronix.de, kuba@kernel.org,
	alexandre.torgue@foss.st.com, mcoquelin.stm32@gmail.com,
	linux-arm-kernel@lists.infradead.org, netdev@vger.kernel.org,
	festevam@gmail.com
In-Reply-To: <YmXIo6q8vVkL6zLp@lunn.ch>

Hi Andrew

Thanks for your help!

On Mon, 2022-04-25 at 00:01 +0200, Andrew Lunn wrote:
> On Sat, Apr 23, 2022 at 10:58:07PM +0000, Marcel Ziswiler wrote:
> > Hi there
> > 
> > We lately tried operating the IMX8MPEVK ENET_QOS imx-dwmac driver in half-duplex modes which crashes as
> > follows:
> 
> How many transmit queues do you have in operation:

Looks like NXP defaults to 5 RX and TX queues each being setup via device tree [1]. Unfortunately, upon boot
the driver only reports the RX side of things:

[    7.239604] imx-dwmac 30bf0000.ethernet eth1: Register MEM_TYPE_PAGE_POOL RxQ-0

>        /* Half-Duplex can only work with single queue */
>         if (priv->plat->tx_queues_to_use > 1)
>                 priv->phylink_config.mac_capabilities &=
>                         ~(MAC_10HD | MAC_100HD | MAC_1000HD);
> 
> What does ethtool show before you force it? Does it actually list the
> HALF modes?

Good point. I was blinded by NXP downstream which, while listing all incl. 10baseT/Half and 100baseT/Half as
supported link modes, also does not work. However, upstream indeed shows only full-duplex modes as supported:

root@verdin-imx8mp-07106916:~# ethtool eth1
Settings for eth1:
        Supported ports: [ TP MII ]
        Supported link modes:   10baseT/Full 
                                100baseT/Full 
                                1000baseT/Full 
...

Once I remove them queues being setup via device tree it shows all modes as supported again:

root@verdin-imx8mp-07106916:~# ethtool eth1
Settings for eth1:
        Supported ports: [ TP MII ]
        Supported link modes:   10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Full 
...

However, 10baseT/Half, while no longer just crashing, still does not seem to work right. Looking at wireshark
traces it does send packets but seems not to ever get neither ARP nor DHCP answers (as well as any other packet
for that matter). Looks like the same actually applies to 10baseT/Full as well. While 100baseT/Half and
100baseT/Full work fine now.

Any idea what else could still be going wrong with them 10baseT modes?

I do know that eth0 which is FEC based instead, works just fine with any and all those modes so my network
setup otherwise seems sound.

Also the question remains, why ethtool allows such illegal configuration and even worse why the kernel
subsequently just crashes. Not sure how exactly this could be prevented though.

On a side note, besides modifying the device tree for such single-queue setup being half-duplex capable, is
there any easier way? Much nicer would, of course, be if it justworkedTM (e.g. advertise all modes but once a
half-duplex mode is chosen revert to such single-queue operation). Then, on the other hand, who still uses
half-duplex communication in this day and age (;-p).

[1]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/boot/dts/freescale/imx8mp-evk.dts?h=v5.18-rc4#n110

>      Andrew

Cheers

Marcel

^ permalink raw reply

* Re: [PATCH net] tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT
From: Eric Dumazet @ 2022-04-25 15:05 UTC (permalink / raw)
  To: Dave Taht
  Cc: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni,
	netdev, Doug Porter, Soheil Hassas Yeganeh, Neal Cardwell
In-Reply-To: <CAA93jw7uoMCWT9oF52NhgozgCTJ4TAxpgNMNAZTaKqCpMR7uOg@mail.gmail.com>

On Mon, Apr 25, 2022 at 6:16 AM Dave Taht <dave.taht@gmail.com> wrote:
>

>
> Thx! We have been having very good results with TCP_NOTSENT_LOWAT set
> to 32k or less behind an apache traffic server... and had some really
> puzzling ones at geosync RTTs. Now I gotta go retest.
>
> Side question: Is there a guide/set of recommendations to setting this
> value more appropriately, under what circumstances? Could it
> autoconfigure?

It is a tradeoff between memory usage in the kernel (storing data in
the socket transmit queue),
and the number of times the application is woken up to let it push
another chunk of data.

32k really means : No more than one skb at a time.  (An skb can
usually store about 64KB of payload)

At Google we have been using 2MB, but I suspect this was to work
around the bug I just fixed.
We probably could use a smaller value like 1MB, leading to one
EPOLLOUT for 1/2 MB.

Precise values also depend on how much work is needed in
tcp_sendmsg(), zerocopy can play a role here.

autoconfigure ? Not sure how. Once you have provisioned a server for a
given workload,
you are all set. No need to tune the value as a function of the load.

>
> net.ipv4.tcp_notsent_lowat = 32768

This probably has caused many stalls on long rtt / lossy links...

>
> --
> FQ World Domination pending: https://blog.cerowrt.org/post/state_of_fq_codel/
> Dave Täht CEO, TekLibre, LLC

^ permalink raw reply

* Re: [PATCH bpf-next v3 2/7] ftrace: Fix deadloop caused by direct call in ftrace selftest
From: Steven Rostedt @ 2022-04-25 15:05 UTC (permalink / raw)
  To: Xu Kuohai
  Cc: bpf, linux-arm-kernel, linux-kernel, netdev, linux-kselftest,
	Catalin Marinas, Will Deacon, Ingo Molnar, Daniel Borkmann,
	Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, David S . Miller, Hideaki YOSHIFUJI, David Ahern,
	Thomas Gleixner, Borislav Petkov, Dave Hansen, x86, hpa,
	Shuah Khan, Jakub Kicinski, Jesper Dangaard Brouer, Mark Rutland,
	Pasha Tatashin, Ard Biesheuvel, Daniel Kiss, Steven Price,
	Sudeep Holla, Marc Zyngier, Peter Collingbourne, Mark Brown,
	Delyan Kratunov, Kumar Kartikeya Dwivedi
In-Reply-To: <20220424154028.1698685-3-xukuohai@huawei.com>

On Sun, 24 Apr 2022 11:40:23 -0400
Xu Kuohai <xukuohai@huawei.com> wrote:

> diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
> index abcadbe933bb..d2eff2b1d743 100644
> --- a/kernel/trace/trace_selftest.c
> +++ b/kernel/trace/trace_selftest.c
> @@ -785,8 +785,24 @@ static struct fgraph_ops fgraph_ops __initdata  = {
>  };
>  
>  #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
> +#ifdef CONFIG_ARM64

Please find a way to add this in arm specific code. Do not add architecture
defines in generic code.

You could add:

#ifndef ARCH_HAVE_FTRACE_DIRECT_TEST_FUNC
noinline __noclone static void trace_direct_tramp(void) { }
#endif

here, and in arch/arm64/include/ftrace.h

#define ARCH_HAVE_FTRACE_DIRECT_TEST_FUNC

and define your test function in the arm64 specific code.

-- Steve




> +extern void trace_direct_tramp(void);
> +
> +asm (
> +"	.pushsection	.text, \"ax\", @progbits\n"
> +"	.type		trace_direct_tramp, %function\n"
> +"	.global		trace_direct_tramp\n"
> +"trace_direct_tramp:"
> +"	mov	x10, x30\n"
> +"	mov	x30, x9\n"
> +"	ret	x10\n"
> +"	.size		trace_direct_tramp, .-trace_direct_tramp\n"
> +"	.popsection\n"
> +);
> +#else
>  noinline __noclone static void trace_direct_tramp(void) { }
>  #endif
> +#endif
>  
>  /*
>   * Pretty much the same than for the function tracer from which the selftest


^ permalink raw reply

* Re: [PATCH] net: linkwatch: ignore events for unregistered netdevs
From: Jakub Kicinski @ 2022-04-25 15:00 UTC (permalink / raw)
  To: Jann Horn
  Cc: Lukas Wunner, Paolo Abeni, Oliver Neukum, David S. Miller,
	Oleksij Rempel, Eric Dumazet, netdev, linux-usb, Andrew Lunn,
	Jacky Chou, Willy Tarreau, Lino Sanfilippo, Philipp Rosenberger,
	Heiner Kallweit, Greg Kroah-Hartman
In-Reply-To: <CAG48ez3ibQjhs9Qxb0AAKE4-UZiZ5UdXG1JWcPWHAWBoO-1fVw@mail.gmail.com>

On Mon, 25 Apr 2022 16:49:34 +0200 Jann Horn wrote:
> > Doesn't mean we should make it legal. We can add a warning to catch
> > abuses.  
> 
> That was the idea with
> https://lore.kernel.org/netdev/20220128014303.2334568-1-jannh@google.com/,
> but I didn't get any replies when I asked what the precise semantics
> of dev_hold() are supposed to be
> (https://lore.kernel.org/netdev/CAG48ez1-OyZETvrYAfaHicYW1LbrQUVp=C0EukSWqZrYMej73w@mail.gmail.com/),
> so I don't know how to proceed...

Yeah, I think after you pointed out that the netdev per cpu refcounting
is fundamentally broken everybody decided to hit themselves with the
obliviate spell :S

^ permalink raw reply

* Re: [PATCH bpf-next] bpftoo: Support user defined vmlinux path
From: Daniel Borkmann @ 2022-04-25 14:54 UTC (permalink / raw)
  To: Jianlin Lv, bpf
  Cc: ast, andrii, kafai, quentin, jean-philippe, mauricio, ytcoode,
	linux-kernel, netdev, jianlv
In-Reply-To: <20220425075724.48540-1-jianlv@ebay.com>

On 4/25/22 9:57 AM, Jianlin Lv wrote:
> From: Jianlin Lv <iecedge@gmail.com>
> 
> Add EXTERNAL_PATH variable that define unconventional vmlinux path
> 
> Signed-off-by: Jianlin Lv <iecedge@gmail.com>
> ---
> When building Ubuntu-5.15.0 kernel, '../../../vmlinux' cannot locate
> compiled vmlinux image. Incorrect vmlinux generated vmlinux.h missing some
> structure definitions that broken compiling pipe.

You should already be able to define custom VMLINUX_BTF_PATHS, no?

See commit :

commit ec23eb705620234421fd48fc2382490fcfbafc37
Author: Andrii Nakryiko <andriin@fb.com>
Date:   Mon Jun 29 17:47:58 2020 -0700

     tools/bpftool: Allow substituting custom vmlinux.h for the build

> ---
>   tools/bpf/bpftool/Makefile | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/tools/bpf/bpftool/Makefile b/tools/bpf/bpftool/Makefile
> index c6d2c77d0252..fefa3b763eb7 100644
> --- a/tools/bpf/bpftool/Makefile
> +++ b/tools/bpf/bpftool/Makefile
> @@ -160,6 +160,7 @@ $(OBJS): $(LIBBPF) $(LIBBPF_INTERNAL_HDRS)
>   VMLINUX_BTF_PATHS ?= $(if $(O),$(O)/vmlinux)				\
>   		     $(if $(KBUILD_OUTPUT),$(KBUILD_OUTPUT)/vmlinux)	\
>   		     ../../../vmlinux					\
> +		     $(if $(EXTERNAL_PATH),$(EXTERNAL_PATH)/vmlinux)	\
>   		     /sys/kernel/btf/vmlinux				\
>   		     /boot/vmlinux-$(shell uname -r)
>   VMLINUX_BTF ?= $(abspath $(firstword $(wildcard $(VMLINUX_BTF_PATHS))))
> 


^ permalink raw reply

* Re: [PATCH net-next 08/10] tls: rx: use async as an in-out argument
From: Jakub Kicinski @ 2022-04-25 14:54 UTC (permalink / raw)
  To: Gal Pressman
  Cc: davem, pabeni, netdev, borisp, john.fastabend, daniel, vfedorenko
In-Reply-To: <01081d46-249f-a081-f130-e0a09180d4d3@nvidia.com>

On Mon, 25 Apr 2022 10:19:45 +0300 Gal Pressman wrote:
> On 11/04/2022 22:19, Jakub Kicinski wrote:
> > Propagating EINPROGRESS thru multiple layers of functions is
> > error prone. Use darg->async as an in/out argument, like we
> > use darg->zc today. On input it tells the code if async is
> > allowed, on output if it took place.
> >
> > Signed-off-by: Jakub Kicinski <kuba@kernel.org>  
> 
> I know this is not much to go on, but this patch broke our tls workflows
> when device offload is enabled.
> I'm still looking into it, but maybe you have an idea what might have
> went wrong?

Oof right, sorry. When packet is already decrypted by HW we'll skip 
the decrypt completely and leave async to whatever it was at input.

Something like this?

--->8---------

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index ddbe05ec5489..80094528eadb 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1562,6 +1562,7 @@ static int decrypt_skb_update(struct sock *sk, struct sk_buff *skb,
 
 	if (tlm->decrypted) {
 		darg->zc = false;
+		darg->async = false;
 		return 0;
 	}
 
@@ -1572,6 +1573,7 @@ static int decrypt_skb_update(struct sock *sk, struct sk_buff *skb,
 		if (err > 0) {
 			tlm->decrypted = 1;
 			darg->zc = false;
+			darg->async = false;
 			goto decrypt_done;
 		}
 	}

^ permalink raw reply related

* Re: [PATCH] net: linkwatch: ignore events for unregistered netdevs
From: Jann Horn @ 2022-04-25 14:49 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Lukas Wunner, Paolo Abeni, Oliver Neukum, David S. Miller,
	Oleksij Rempel, Eric Dumazet, netdev, linux-usb, Andrew Lunn,
	Jacky Chou, Willy Tarreau, Lino Sanfilippo, Philipp Rosenberger,
	Heiner Kallweit, Greg Kroah-Hartman
In-Reply-To: <20220425074146.1fa27d5f@kernel.org>

On Mon, Apr 25, 2022 at 4:41 PM Jakub Kicinski <kuba@kernel.org> wrote:
> On Sat, 23 Apr 2022 18:07:23 +0200 Lukas Wunner wrote:
> > > Looking at the original report it looks like the issue could be
> > > resolved with a more usb-specific change: e.g. it looks like
> > > usbnet_defer_kevent() is not acquiring a dev reference as it should.
> > >
> > > Have you considered that path?
> >
> > First of all, the diffstat of the patch shows this is an opportunity
> > to reduce LoC as well as simplify and speed up device teardown.
> >
> > Second, the approach you're proposing won't work if a driver calls
> > netif_carrier_on/off() after unregister_netdev().
> >
> > It seems prudent to prevent such a misbehavior in *any* driver,
> > not just usbnet.  usbnet may not be the only one doing it wrong.
> > Jann pointed out that there are more syzbot reports related
> > to a UAF in linkwatch:
> >
> > https://lore.kernel.org/netdev/?q=__linkwatch_run_queue+syzbot
> >
> > Third, I think an API which schedules work, invisibly to the driver,
> > is dangerous and misguided.  If it is illegal to call
> > netif_carrier_on/off() for an unregistered but not yet freed netdev,
> > catch that in core networking code and don't expect drivers to respect
> > a rule which isn't even documented.
>
> Doesn't mean we should make it legal. We can add a warning to catch
> abuses.

That was the idea with
https://lore.kernel.org/netdev/20220128014303.2334568-1-jannh@google.com/,
but I didn't get any replies when I asked what the precise semantics
of dev_hold() are supposed to be
(https://lore.kernel.org/netdev/CAG48ez1-OyZETvrYAfaHicYW1LbrQUVp=C0EukSWqZrYMej73w@mail.gmail.com/),
so I don't know how to proceed...

^ permalink raw reply

* Re: [PATCH] net: linkwatch: ignore events for unregistered netdevs
From: Jakub Kicinski @ 2022-04-25 14:41 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Paolo Abeni, Oliver Neukum, David S. Miller, Jann Horn,
	Oleksij Rempel, Eric Dumazet, netdev, linux-usb, Andrew Lunn,
	Jacky Chou, Willy Tarreau, Lino Sanfilippo, Philipp Rosenberger,
	Heiner Kallweit, Greg Kroah-Hartman
In-Reply-To: <20220423160723.GA20330@wunner.de>

On Sat, 23 Apr 2022 18:07:23 +0200 Lukas Wunner wrote:
> > Looking at the original report it looks like the issue could be
> > resolved with a more usb-specific change: e.g. it looks like
> > usbnet_defer_kevent() is not acquiring a dev reference as it should.
> > 
> > Have you considered that path?  
> 
> First of all, the diffstat of the patch shows this is an opportunity
> to reduce LoC as well as simplify and speed up device teardown.
> 
> Second, the approach you're proposing won't work if a driver calls
> netif_carrier_on/off() after unregister_netdev().
> 
> It seems prudent to prevent such a misbehavior in *any* driver,
> not just usbnet.  usbnet may not be the only one doing it wrong.
> Jann pointed out that there are more syzbot reports related
> to a UAF in linkwatch:
> 
> https://lore.kernel.org/netdev/?q=__linkwatch_run_queue+syzbot
> 
> Third, I think an API which schedules work, invisibly to the driver,
> is dangerous and misguided.  If it is illegal to call
> netif_carrier_on/off() for an unregistered but not yet freed netdev,
> catch that in core networking code and don't expect drivers to respect
> a rule which isn't even documented.

Doesn't mean we should make it legal. We can add a warning to catch
abuses.

^ permalink raw reply

* Re: [PATCH bpf-next 2/7] bpf: add bpf_skc_to_mptcp_sock_proto
From: Daniel Borkmann @ 2022-04-25 14:33 UTC (permalink / raw)
  To: Mat Martineau, netdev, bpf
  Cc: Geliang Tang, ast, andrii, mptcp, Nicolas Rybowski,
	Matthieu Baerts
In-Reply-To: <20220420222459.307649-3-mathew.j.martineau@linux.intel.com>

On 4/21/22 12:24 AM, Mat Martineau wrote:
[...]
> diff --git a/include/net/mptcp.h b/include/net/mptcp.h
> index 0a3b0fb04a3b..5b3a6f783182 100644
> --- a/include/net/mptcp.h
> +++ b/include/net/mptcp.h
> @@ -283,4 +283,10 @@ static inline int mptcpv6_init(void) { return 0; }
>   static inline void mptcpv6_handle_mapped(struct sock *sk, bool mapped) { }
>   #endif
>   
> +#if defined(CONFIG_MPTCP) && defined(CONFIG_BPF_JIT) && defined(CONFIG_BPF_SYSCALL)
> +struct mptcp_sock *bpf_mptcp_sock_from_subflow(struct sock *sk);
> +#else
> +static inline struct mptcp_sock *bpf_mptcp_sock_from_subflow(struct sock *sk) { return NULL; }
> +#endif
> +

Where is this relevant to JIT specifically?

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH bpf-next v2 4/6] bpf, arm64: Impelment bpf_arch_text_poke() for arm64
From: Jakub Sitnicki @ 2022-04-25 14:26 UTC (permalink / raw)
  To: Xu Kuohai
  Cc: bpf, linux-arm-kernel, linux-kernel, netdev, linux-kselftest,
	Catalin Marinas, Will Deacon, Steven Rostedt, Ingo Molnar,
	Daniel Borkmann, Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, David S . Miller, Hideaki YOSHIFUJI, David Ahern,
	Thomas Gleixner, Borislav Petkov, Dave Hansen, x86, hpa,
	Shuah Khan, Mark Rutland, Ard Biesheuvel, Pasha Tatashin,
	Peter Collingbourne, Daniel Kiss, Sudeep Holla, Steven Price,
	Marc Zyngier, Mark Brown, Kumar Kartikeya Dwivedi,
	Delyan Kratunov, kernel-team
In-Reply-To: <13cd161b-43a2-ce66-6a27-6662fc36e063@huawei.com>

On Sun, Apr 24, 2022 at 01:05 PM +08, Xu Kuohai wrote:
> Thanks for your testing and suggestion! I added bpf2bpf poking to this
> series and rebased it to [2] a few days ago, so there are some conflicts
> with the bpf-next branch. I'll rebase it to bpf-next and send v3.
>
> [2] https://lore.kernel.org/bpf/20220416042940.656344-1-kuifeng@fb.com/

Looking forward to it.

I think it would be okay to post v3 saying that it depends on the
"Attach a cookie to a tracing program" series and won't apply cleanly to
bpf-next with out.

It would give us more time to review.

^ permalink raw reply

* Re: [PATCH bpf-next 2/7] bpf: add bpf_skc_to_mptcp_sock_proto
From: Daniel Borkmann @ 2022-04-25 14:29 UTC (permalink / raw)
  To: Mat Martineau, netdev, bpf
  Cc: Geliang Tang, ast, andrii, mptcp, Nicolas Rybowski,
	Matthieu Baerts
In-Reply-To: <903108df-161e-515b-da3d-bff4fb49de39@iogearbox.net>

On 4/25/22 4:26 PM, Daniel Borkmann wrote:
[...]
> 
> Looking at bpf_skc_to_tcp6_sock(), do we similarly need a BTF_TYPE_EMIT() here?

(Plus, CONFIG_MPTCP should be enabled in CI config (Andrii).)

Cheers,
Daniel

^ permalink raw reply

* Re: [RFC PATCH v4 07/15] landlock: user space API network support
From: Konstantin Meskhidze @ 2022-04-25 14:29 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: willemdebruijn.kernel, linux-security-module, netdev,
	netfilter-devel, yusongping, artem.kuzin, anton.sirazetdinov
In-Reply-To: <d4724117-167d-00b0-1f10-749b35bffc2f@digikod.net>



4/12/2022 2:21 PM, Mickaël Salaün пишет:
> 
> On 09/03/2022 14:44, Konstantin Meskhidze wrote:
>> User space API was refactored to support
>> network actions. New network access flags,
>> network rule and network attributes were
>> added.
>>
>> Signed-off-by: Konstantin Meskhidze <konstantin.meskhidze@huawei.com>
>> ---
>>
>> Changes since v3:
>> * Split commit.
>> * Refactoring User API for network rule type.
>>
>> ---
>>   include/uapi/linux/landlock.h | 48 +++++++++++++++++++++++++++++++++++
>>   security/landlock/syscalls.c  |  5 ++--
>>   2 files changed, 51 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/uapi/linux/landlock.h 
>> b/include/uapi/linux/landlock.h
>> index b3d952067f59..4fc6c793fdf4 100644
>> --- a/include/uapi/linux/landlock.h
>> +++ b/include/uapi/linux/landlock.h
>> @@ -25,6 +25,13 @@ struct landlock_ruleset_attr {
>>        * compatibility reasons.
>>        */
>>       __u64 handled_access_fs;
>> +
>> +    /**
>> +     * @handled_access_net: Bitmask of actions (cf. `Network flags`_)
>> +     * that is handled by this ruleset and should then be forbidden 
>> if no
>> +     * rule explicitly allow them.
>> +     */
>> +    __u64 handled_access_net;
>>   };
>>
>>   /*
>> @@ -46,6 +53,11 @@ enum landlock_rule_type {
>>        * landlock_path_beneath_attr .
>>        */
>>       LANDLOCK_RULE_PATH_BENEATH = 1,
>> +    /**
>> +     * @LANDLOCK_RULE_NET_SERVICE: Type of a &struct
>> +     * landlock_net_service_attr .
>> +     */
>> +    LANDLOCK_RULE_NET_SERVICE = 2,
>>   };
>>
>>   /**
>> @@ -70,6 +82,24 @@ struct landlock_path_beneath_attr {
>>        */
>>   } __attribute__((packed));
>>
>> +/**
>> + * struct landlock_net_service_attr - TCP subnet definition
>> + *
>> + * Argument of sys_landlock_add_rule().
>> + */
>> +struct landlock_net_service_attr {
>> +    /**
>> +     * @allowed_access: Bitmask of allowed access network for services
>> +     * (cf. `Network flags`_).
>> +     */
>> +    __u64 allowed_access;
>> +    /**
>> +     * @port: Network port
>> +     */
>> +    __u16 port;
>> +
>> +} __attribute__((packed));
>> +
>>   /**
>>    * DOC: fs_access
>>    *
>> @@ -134,4 +164,22 @@ struct landlock_path_beneath_attr {
>>   #define LANDLOCK_ACCESS_FS_MAKE_BLOCK            (1ULL << 11)
>>   #define LANDLOCK_ACCESS_FS_MAKE_SYM            (1ULL << 12)
>>
>> +/**
>> + * DOC: net_access
>> + *
>> + * Network flags
>> + * ~~~~~~~~~~~~~~~~
>> + *
>> + * These flags enable to restrict a sandboxed process to a set of 
>> network
>> + * actions.
>> + *
>> + * TCP sockets with allowed actions:
>> + *
>> + * - %LANDLOCK_ACCESS_NET_BIND_TCP: Bind a TCP socket to a local port.
>> + * - %LANDLOCK_ACCESS_NET_CONNECT_TCP: Connect an active TCP socket to
>> + *   a remote port.
>> + */
>> +#define LANDLOCK_ACCESS_NET_BIND_TCP            (1ULL << 0)
>> +#define LANDLOCK_ACCESS_NET_CONNECT_TCP            (1ULL << 1)
>> +
>>   #endif /* _UAPI_LINUX_LANDLOCK_H */
>> diff --git a/security/landlock/syscalls.c b/security/landlock/syscalls.c
>> index 8c0f6165fe3a..fcbce83d64ef 100644
>> --- a/security/landlock/syscalls.c
>> +++ b/security/landlock/syscalls.c
>> @@ -81,8 +81,9 @@ static void build_check_abi(void)
>>        * struct size.
>>        */
>>       ruleset_size = sizeof(ruleset_attr.handled_access_fs);
>> +    ruleset_size += sizeof(ruleset_attr.handled_access_net);
>>       BUILD_BUG_ON(sizeof(ruleset_attr) != ruleset_size);
>> -    BUILD_BUG_ON(sizeof(ruleset_attr) != 8);
>> +    BUILD_BUG_ON(sizeof(ruleset_attr) != 16);
>>
>>       path_beneath_size = sizeof(path_beneath_attr.allowed_access);
>>       path_beneath_size += sizeof(path_beneath_attr.parent_fd);
>> @@ -184,7 +185,7 @@ SYSCALL_DEFINE3(landlock_create_ruleset,
>>
>>       /* Checks content (and 32-bits cast). */
>>       if ((ruleset_attr.handled_access_fs | LANDLOCK_MASK_ACCESS_FS) !=
>> -            LANDLOCK_MASK_ACCESS_FS)
>> +             LANDLOCK_MASK_ACCESS_FS)
> 
> Don't add cosmetic changes. FYI, I'm relying on the way Vim does line 
> cuts, which is mostly tabs. Please try to do the same.
> 
   Ok. I'm using VsCode as an editor. It also could be set up to 
different code styles.
> 
>>           return -EINVAL;
>>       access_mask_set.fs = ruleset_attr.handled_access_fs;
>>
>> -- 
>> 2.25.1
>>
> 
> You need to also update Documentation/userspace-api/landlock.rst 
> accordingly. You can check you changes by building the HTML doc.

   OK. I got it. Thnaks for the comment.
> .

^ permalink raw reply

* Re: [PATCH bpf-next 2/7] bpf: add bpf_skc_to_mptcp_sock_proto
From: Daniel Borkmann @ 2022-04-25 14:26 UTC (permalink / raw)
  To: Mat Martineau, netdev, bpf
  Cc: Geliang Tang, ast, andrii, mptcp, Nicolas Rybowski,
	Matthieu Baerts
In-Reply-To: <20220420222459.307649-3-mathew.j.martineau@linux.intel.com>

On 4/21/22 12:24 AM, Mat Martineau wrote:
[...]
>   static const struct bpf_func_proto *
>   bpf_sk_base_func_proto(enum bpf_func_id func_id);
> @@ -11279,6 +11280,19 @@ const struct bpf_func_proto bpf_skc_to_unix_sock_proto = {
>   	.ret_btf_id		= &btf_sock_ids[BTF_SOCK_TYPE_UNIX],
>   };
>   
> +BPF_CALL_1(bpf_skc_to_mptcp_sock, struct sock *, sk)
> +{
> +	return (unsigned long)bpf_mptcp_sock_from_subflow(sk);
> +}
> +
> +static const struct bpf_func_proto bpf_skc_to_mptcp_sock_proto = {
> +	.func		= bpf_skc_to_mptcp_sock,
> +	.gpl_only	= false,
> +	.ret_type	= RET_PTR_TO_BTF_ID_OR_NULL,
> +	.arg1_type	= ARG_PTR_TO_SOCK_COMMON,
> +	.ret_btf_id	= &btf_sock_ids[BTF_SOCK_TYPE_MPTCP],
> +};

BPF CI (https://github.com/kernel-patches/bpf/runs/6136052684?check_suite_focus=true) fails with:

   #7   base:FAIL
   libbpf: prog '_sockops': BPF program load failed: Invalid argument
   libbpf: prog '_sockops': -- BEGIN PROG LOAD LOG --
   0: R1=ctx(off=0,imm=0) R10=fp0
   ; int op = (int)ctx->op;
   0: (61) r2 = *(u32 *)(r1 +0)          ; R1=ctx(off=0,imm=0) R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
   ; if (op != BPF_SOCK_OPS_TCP_CONNECT_CB)
   1: (56) if w2 != 0x3 goto pc+50       ; R2_w=3
   ; sk = ctx->sk;
   2: (79) r6 = *(u64 *)(r1 +184)        ; R1=ctx(off=0,imm=0) R6_w=sock_or_null(id=1,off=0,imm=0)
   ; if (!sk)
   3: (15) if r6 == 0x0 goto pc+48       ; R6_w=sock(off=0,imm=0)
   ; tcp_sk = bpf_tcp_sock(sk);
   4: (bf) r1 = r6                       ; R1_w=sock(off=0,imm=0) R6_w=sock(off=0,imm=0)
   5: (85) call bpf_tcp_sock#96          ; R0_w=tcp_sock_or_null(id=2,off=0,imm=0)
   6: (bf) r7 = r0                       ; R0=tcp_sock_or_null(id=2,off=0,imm=0) R7=tcp_sock_or_null(id=2,off=0,imm=0)
   ; if (!tcp_sk)
   7: (15) if r7 == 0x0 goto pc+44       ; R7=tcp_sock(off=0,imm=0)
   ; if (!tcp_sk->is_mptcp) {
   8: (61) r1 = *(u32 *)(r7 +112)        ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R7=tcp_sock(off=0,imm=0)
   ; if (!tcp_sk->is_mptcp) {
   9: (56) if w1 != 0x0 goto pc+14 24: R0=tcp_sock(off=0,imm=0) R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R6=sock(off=0,imm=0) R7=tcp_sock(off=0,imm=0) R10=fp0
   ; msk = bpf_skc_to_mptcp_sock(sk);
   24: (bf) r1 = r6                      ; R1_w=sock(off=0,imm=0) R6=sock(off=0,imm=0)
   25: (85) call bpf_skc_to_mptcp_sock#194
   invalid return type 8 of func bpf_skc_to_mptcp_sock#194
   processed 34 insns (limit 1000000) max_states_per_insn 0 total_states 3 peak_states 3 mark_read 1
   -- END PROG LOAD LOG --
   libbpf: failed to load program '_sockops'
   libbpf: failed to load object './mptcp_sock.o'
   run_test:FAIL:165
   test_base:FAIL:227
   (network_helpers.c:88: errno: Protocol not supported) Failed to create server socket
   test_base:FAIL:241
   RTNETLINK answers: No such file or directory
   Error talking to the kernel
[...]

Looking at bpf_skc_to_tcp6_sock(), do we similarly need a BTF_TYPE_EMIT() here?

Thanks,
Daniel

^ permalink raw reply

* [PATCH net-next v4 2/2] net: vxlan: vxlan_core.c: Add extack support to vxlan_fdb_delete
From: Alaa Mohamed @ 2022-04-25 14:25 UTC (permalink / raw)
  To: netdev
  Cc: outreachy, roopa, jdenham, sbrivio, jesse.brandeburg,
	anthony.l.nguyen, davem, kuba, pabeni, vladimir.oltean,
	claudiu.manoil, alexandre.belloni, shshaikh, manishc, razor,
	intel-wired-lan, linux-kernel, UNGLinuxDriver, GR-Linux-NIC-Dev,
	bridge, eng.alaamohamedsoliman.am
In-Reply-To: <cover.1650896000.git.eng.alaamohamedsoliman.am@gmail.com>

This patch adds extack msg support to vxlan_fdb_delete and vxlan_fdb_parse.
extack is used to propagate meaningful error msgs to the user of vxlan
fdb netlink api

Signed-off-by: Alaa Mohamed <eng.alaamohamedsoliman.am@gmail.com>
---
changes in V2:
	- fix spelling vxlan_fdb_delete
	- add missing braces
	- edit error message
---
changes in V3:
	fix errors reported by checkpatch.pl
---
changes in V4:
	- fix errors reported by checkpatch.pl
	- edit commit message.
---
 drivers/net/vxlan/vxlan_core.c | 37 ++++++++++++++++++++++++----------
 1 file changed, 26 insertions(+), 11 deletions(-)

diff --git a/drivers/net/vxlan/vxlan_core.c b/drivers/net/vxlan/vxlan_core.c
index cf2f60037340..ef69aeb058b8 100644
--- a/drivers/net/vxlan/vxlan_core.c
+++ b/drivers/net/vxlan/vxlan_core.c
@@ -1129,19 +1129,24 @@ static void vxlan_fdb_dst_destroy(struct vxlan_dev *vxlan, struct vxlan_fdb *f,

 static int vxlan_fdb_parse(struct nlattr *tb[], struct vxlan_dev *vxlan,
 			   union vxlan_addr *ip, __be16 *port, __be32 *src_vni,
-			   __be32 *vni, u32 *ifindex, u32 *nhid)
+			   __be32 *vni, u32 *ifindex, u32 *nhid, struct netlink_ext_ack *extack)
 {
 	struct net *net = dev_net(vxlan->dev);
 	int err;

 	if (tb[NDA_NH_ID] && (tb[NDA_DST] || tb[NDA_VNI] || tb[NDA_IFINDEX] ||
-	    tb[NDA_PORT]))
-		return -EINVAL;
+	    tb[NDA_PORT])) {
+			NL_SET_ERR_MSG(extack,
+						   "DST, VNI, ifindex and port are mutually exclusive with NH_ID");
+			return -EINVAL;
+		}

 	if (tb[NDA_DST]) {
 		err = vxlan_nla_get_addr(ip, tb[NDA_DST]);
-		if (err)
+		if (err) {
+			NL_SET_ERR_MSG(extack, "Unsupported address family");
 			return err;
+		}
 	} else {
 		union vxlan_addr *remote = &vxlan->default_dst.remote_ip;

@@ -1157,24 +1162,30 @@ static int vxlan_fdb_parse(struct nlattr *tb[], struct vxlan_dev *vxlan,
 	}

 	if (tb[NDA_PORT]) {
-		if (nla_len(tb[NDA_PORT]) != sizeof(__be16))
+		if (nla_len(tb[NDA_PORT]) != sizeof(__be16)) {
+			NL_SET_ERR_MSG(extack, "Invalid vxlan port");
 			return -EINVAL;
+		}
 		*port = nla_get_be16(tb[NDA_PORT]);
 	} else {
 		*port = vxlan->cfg.dst_port;
 	}

 	if (tb[NDA_VNI]) {
-		if (nla_len(tb[NDA_VNI]) != sizeof(u32))
+		if (nla_len(tb[NDA_VNI]) != sizeof(u32)) {
+			NL_SET_ERR_MSG(extack, "Invalid vni");
 			return -EINVAL;
+		}
 		*vni = cpu_to_be32(nla_get_u32(tb[NDA_VNI]));
 	} else {
 		*vni = vxlan->default_dst.remote_vni;
 	}

 	if (tb[NDA_SRC_VNI]) {
-		if (nla_len(tb[NDA_SRC_VNI]) != sizeof(u32))
+		if (nla_len(tb[NDA_SRC_VNI]) != sizeof(u32)) {
+			NL_SET_ERR_MSG(extack, "Invalid src vni");
 			return -EINVAL;
+		}
 		*src_vni = cpu_to_be32(nla_get_u32(tb[NDA_SRC_VNI]));
 	} else {
 		*src_vni = vxlan->default_dst.remote_vni;
@@ -1183,12 +1194,16 @@ static int vxlan_fdb_parse(struct nlattr *tb[], struct vxlan_dev *vxlan,
 	if (tb[NDA_IFINDEX]) {
 		struct net_device *tdev;

-		if (nla_len(tb[NDA_IFINDEX]) != sizeof(u32))
+		if (nla_len(tb[NDA_IFINDEX]) != sizeof(u32)) {
+			NL_SET_ERR_MSG(extack, "Invalid ifindex");
 			return -EINVAL;
+		}
 		*ifindex = nla_get_u32(tb[NDA_IFINDEX]);
 		tdev = __dev_get_by_index(net, *ifindex);
-		if (!tdev)
+		if (!tdev) {
+			NL_SET_ERR_MSG(extack, "Device not found");
 			return -EADDRNOTAVAIL;
+		}
 	} else {
 		*ifindex = 0;
 	}
@@ -1226,7 +1241,7 @@ static int vxlan_fdb_add(struct ndmsg *ndm, struct nlattr *tb[],
 		return -EINVAL;

 	err = vxlan_fdb_parse(tb, vxlan, &ip, &port, &src_vni, &vni, &ifindex,
-			      &nhid);
+			      &nhid, extack);
 	if (err)
 		return err;

@@ -1291,7 +1306,7 @@ static int vxlan_fdb_delete(struct ndmsg *ndm, struct nlattr *tb[],
 	int err;

 	err = vxlan_fdb_parse(tb, vxlan, &ip, &port, &src_vni, &vni, &ifindex,
-			      &nhid);
+			      &nhid, extack);
 	if (err)
 		return err;

--
2.36.0


^ permalink raw reply related

* [PATCH net-next v4 1/2] rtnetlink: add extack support in fdb del handlers
From: Alaa Mohamed @ 2022-04-25 14:25 UTC (permalink / raw)
  To: netdev
  Cc: outreachy, roopa, jdenham, sbrivio, jesse.brandeburg,
	anthony.l.nguyen, davem, kuba, pabeni, vladimir.oltean,
	claudiu.manoil, alexandre.belloni, shshaikh, manishc, razor,
	intel-wired-lan, linux-kernel, UNGLinuxDriver, GR-Linux-NIC-Dev,
	bridge, eng.alaamohamedsoliman.am
In-Reply-To: <cover.1650896000.git.eng.alaamohamedsoliman.am@gmail.com>

Add extack support to .ndo_fdb_del in netdevice.h and
all related methods.

Signed-off-by: Alaa Mohamed <eng.alaamohamedsoliman.am@gmail.com>
---
changes in V3:
        fix errors reported by checkpatch.pl
---
changes in V4:
        fix errors reported by checkpatch.pl
---
 drivers/net/ethernet/intel/ice/ice_main.c        | 2 +-
 drivers/net/ethernet/mscc/ocelot_net.c           | 4 ++--
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c | 2 +-
 drivers/net/macvlan.c                            | 2 +-
 drivers/net/vxlan/vxlan_core.c                   | 2 +-
 include/linux/netdevice.h                        | 2 +-
 net/bridge/br_fdb.c                              | 2 +-
 net/bridge/br_private.h                          | 3 ++-
 net/core/rtnetlink.c                             | 4 ++--
 9 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index d768925785ca..4fd32163729e 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -5678,7 +5678,7 @@ ice_fdb_add(struct ndmsg *ndm, struct nlattr __always_unused *tb[],
 static int
 ice_fdb_del(struct ndmsg *ndm, __always_unused struct nlattr *tb[],
 	    struct net_device *dev, const unsigned char *addr,
-	    __always_unused u16 vid)
+	    __always_unused u16 vid, struct netlink_ext_ack *extack)
 {
 	int err;

diff --git a/drivers/net/ethernet/mscc/ocelot_net.c b/drivers/net/ethernet/mscc/ocelot_net.c
index 247bc105bdd2..e07c64e3159c 100644
--- a/drivers/net/ethernet/mscc/ocelot_net.c
+++ b/drivers/net/ethernet/mscc/ocelot_net.c
@@ -774,14 +774,14 @@ static int ocelot_port_fdb_add(struct ndmsg *ndm, struct nlattr *tb[],

 static int ocelot_port_fdb_del(struct ndmsg *ndm, struct nlattr *tb[],
 			       struct net_device *dev,
-			       const unsigned char *addr, u16 vid)
+			       const unsigned char *addr, u16 vid, struct netlink_ext_ack *extack)
 {
 	struct ocelot_port_private *priv = netdev_priv(dev);
 	struct ocelot_port *ocelot_port = &priv->port;
 	struct ocelot *ocelot = ocelot_port->ocelot;
 	int port = priv->chip_port;

-	return ocelot_fdb_del(ocelot, port, addr, vid, ocelot_port->bridge);
+	return ocelot_fdb_del(ocelot, port, addr, vid, ocelot_port->bridge, extack);
 }

 static int ocelot_port_fdb_dump(struct sk_buff *skb,
diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
index d320567b2cca..51fa23418f6a 100644
--- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
+++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
@@ -368,7 +368,7 @@ static int qlcnic_set_mac(struct net_device *netdev, void *p)

 static int qlcnic_fdb_del(struct ndmsg *ndm, struct nlattr *tb[],
 			struct net_device *netdev,
-			const unsigned char *addr, u16 vid)
+			const unsigned char *addr, u16 vid, struct netlink_ext_ack *extack)
 {
 	struct qlcnic_adapter *adapter = netdev_priv(netdev);
 	int err = -EOPNOTSUPP;
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 069e8824c264..ffd34d9f7049 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -1017,7 +1017,7 @@ static int macvlan_fdb_add(struct ndmsg *ndm, struct nlattr *tb[],

 static int macvlan_fdb_del(struct ndmsg *ndm, struct nlattr *tb[],
 			   struct net_device *dev,
-			   const unsigned char *addr, u16 vid)
+			   const unsigned char *addr, u16 vid, struct netlink_ext_ack *extack)
 {
 	struct macvlan_dev *vlan = netdev_priv(dev);
 	int err = -EINVAL;
diff --git a/drivers/net/vxlan/vxlan_core.c b/drivers/net/vxlan/vxlan_core.c
index de97ff98d36e..cf2f60037340 100644
--- a/drivers/net/vxlan/vxlan_core.c
+++ b/drivers/net/vxlan/vxlan_core.c
@@ -1280,7 +1280,7 @@ int __vxlan_fdb_delete(struct vxlan_dev *vxlan,
 /* Delete entry (via netlink) */
 static int vxlan_fdb_delete(struct ndmsg *ndm, struct nlattr *tb[],
 			    struct net_device *dev,
-			    const unsigned char *addr, u16 vid)
+			    const unsigned char *addr, u16 vid, struct netlink_ext_ack *extack)
 {
 	struct vxlan_dev *vxlan = netdev_priv(dev);
 	union vxlan_addr ip;
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 28ea4f8269d4..d0d2a8f33c73 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1509,7 +1509,7 @@ struct net_device_ops {
 					       struct nlattr *tb[],
 					       struct net_device *dev,
 					       const unsigned char *addr,
-					       u16 vid);
+					       u16 vid, struct netlink_ext_ack *extack);
 	int			(*ndo_fdb_dump)(struct sk_buff *skb,
 						struct netlink_callback *cb,
 						struct net_device *dev,
diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index 6ccda68bd473..5bfce2e9a553 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -1110,7 +1110,7 @@ static int __br_fdb_delete(struct net_bridge *br,
 /* Remove neighbor entry with RTM_DELNEIGH */
 int br_fdb_delete(struct ndmsg *ndm, struct nlattr *tb[],
 		  struct net_device *dev,
-		  const unsigned char *addr, u16 vid)
+		  const unsigned char *addr, u16 vid, struct netlink_ext_ack *extack)
 {
 	struct net_bridge_vlan_group *vg;
 	struct net_bridge_port *p = NULL;
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 18ccc3d5d296..14ea6e73d786 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -780,7 +780,8 @@ void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source,
 		   const unsigned char *addr, u16 vid, unsigned long flags);

 int br_fdb_delete(struct ndmsg *ndm, struct nlattr *tb[],
-		  struct net_device *dev, const unsigned char *addr, u16 vid);
+		  struct net_device *dev, const unsigned char *addr, u16 vid,
+		  struct netlink_ext_ack *extack);
 int br_fdb_add(struct ndmsg *nlh, struct nlattr *tb[], struct net_device *dev,
 	       const unsigned char *addr, u16 vid, u16 nlh_flags,
 	       struct netlink_ext_ack *extack);
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 4041b3e2e8ec..99b30ae58a47 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -4223,7 +4223,7 @@ static int rtnl_fdb_del(struct sk_buff *skb, struct nlmsghdr *nlh,
 		const struct net_device_ops *ops = br_dev->netdev_ops;

 		if (ops->ndo_fdb_del)
-			err = ops->ndo_fdb_del(ndm, tb, dev, addr, vid);
+			err = ops->ndo_fdb_del(ndm, tb, dev, addr, vid, extack);

 		if (err)
 			goto out;
@@ -4235,7 +4235,7 @@ static int rtnl_fdb_del(struct sk_buff *skb, struct nlmsghdr *nlh,
 	if (ndm->ndm_flags & NTF_SELF) {
 		if (dev->netdev_ops->ndo_fdb_del)
 			err = dev->netdev_ops->ndo_fdb_del(ndm, tb, dev, addr,
-							   vid);
+							   vid, extack);
 		else
 			err = ndo_dflt_fdb_del(ndm, tb, dev, addr, vid);

--
2.36.0


^ permalink raw reply related

* [PATCH net-next v4 0/2] propagate extack to vxlan_fdb_delete
From: Alaa Mohamed @ 2022-04-25 14:25 UTC (permalink / raw)
  To: netdev
  Cc: outreachy, roopa, jdenham, sbrivio, jesse.brandeburg,
	anthony.l.nguyen, davem, kuba, pabeni, vladimir.oltean,
	claudiu.manoil, alexandre.belloni, shshaikh, manishc, razor,
	intel-wired-lan, linux-kernel, UNGLinuxDriver, GR-Linux-NIC-Dev,
	bridge, eng.alaamohamedsoliman.am

In order to propagate extack to vxlan_fdb_delete and vxlan_fdb_parse,
add extack to .ndo_fdb_del and edit all fdb del handelers

Alaa Mohamed (2):
  rtnetlink: add extack support in fdb del handlers
  net: vxlan: vxlan_core.c: Add extack support to vxlan_fdb_delete

 drivers/net/ethernet/intel/ice/ice_main.c     |  2 +-
 drivers/net/ethernet/mscc/ocelot_net.c        |  4 +-
 .../net/ethernet/qlogic/qlcnic/qlcnic_main.c  |  2 +-
 drivers/net/macvlan.c                         |  2 +-
 drivers/net/vxlan/vxlan_core.c                | 39 +++++++++++++------
 include/linux/netdevice.h                     |  2 +-
 net/bridge/br_fdb.c                           |  2 +-
 net/bridge/br_private.h                       |  3 +-
 net/core/rtnetlink.c                          |  4 +-
 9 files changed, 38 insertions(+), 22 deletions(-)

-- 
2.36.0


^ permalink raw reply

* Re: 9p EBADF with cache enabled (Was: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected))
From: David Howells @ 2022-04-25 14:10 UTC (permalink / raw)
  To: asmadeus
  Cc: dhowells, Christian Schoenebeck, David Kahurani, davem, ericvh,
	kuba, linux-kernel, lucho, netdev, v9fs-developer, Greg Kurz
In-Reply-To: <YmKp68xvZEjBFell@codewreck.org>

There may be a quick and dirty workaround.  I think the problem is that unless
the O_APPEND read starts at the beginning of a page, netfs is going to enforce
a read.  Does the attached patch fix the problem?  (note that it's untested)

Also, can you get the contents of /proc/fs/fscache/stats from after
reproducing the problem?

David
---
diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
index 501128188343..5f61fdb950b0 100644
--- a/fs/9p/vfs_addr.c
+++ b/fs/9p/vfs_addr.c
@@ -291,16 +291,25 @@ static int v9fs_write_end(struct file *filp, struct address_space *mapping,
 	struct folio *folio = page_folio(subpage);
 	struct inode *inode = mapping->host;
 	struct v9fs_inode *v9inode = V9FS_I(inode);
+	size_t fsize = folio_size(folio);
+	size_t offset = pos & (fsize - 1);
+	/* With multipage folio support, we may be given len > fsize */
+	size_t copy_size = min_t(size_t, len, fsize - offset);
 
 	p9_debug(P9_DEBUG_VFS, "filp %p, mapping %p\n", filp, mapping);
 
 	if (!folio_test_uptodate(folio)) {
-		if (unlikely(copied < len)) {
+		if (unlikely(copied < copy_size)) {
 			copied = 0;
 			goto out;
 		}
-
-		folio_mark_uptodate(folio);
+		if (offset == 0) {
+			if (copied == fsize)
+				folio_mark_uptodate(folio);
+			/* Could clear to end of page if last_pos == new EOF
+			 * and then mark uptodate
+			 */
+		}
 	}
 
 	/*
diff --git a/fs/netfs/buffered_read.c b/fs/netfs/buffered_read.c
index 281a88a5b8dc..78439f628c23 100644
--- a/fs/netfs/buffered_read.c
+++ b/fs/netfs/buffered_read.c
@@ -364,6 +364,12 @@ int netfs_write_begin(struct file *file, struct address_space *mapping,
 	if (folio_test_uptodate(folio))
 		goto have_folio;
 
+	if (!netfs_is_cache_enabled(ctx) &&
+	    (file->f_flags & (O_APPEND | O_ACCMODE)) == (O_APPEND | O_WRONLY)) {
+		netfs_stat(&netfs_n_rh_write_append);
+		goto have_folio_no_wait;
+	}
+
 	/* If the page is beyond the EOF, we want to clear it - unless it's
 	 * within the cache granule containing the EOF, in which case we need
 	 * to preload the granule.
diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h
index b7b0e3d18d9e..a1cd649197dc 100644
--- a/fs/netfs/internal.h
+++ b/fs/netfs/internal.h
@@ -67,6 +67,7 @@ extern atomic_t netfs_n_rh_read_failed;
 extern atomic_t netfs_n_rh_zero;
 extern atomic_t netfs_n_rh_short_read;
 extern atomic_t netfs_n_rh_write;
+extern atomic_t netfs_n_rh_write_append;
 extern atomic_t netfs_n_rh_write_begin;
 extern atomic_t netfs_n_rh_write_done;
 extern atomic_t netfs_n_rh_write_failed;
diff --git a/fs/netfs/stats.c b/fs/netfs/stats.c
index 5510a7a14a40..fce87f86f950 100644
--- a/fs/netfs/stats.c
+++ b/fs/netfs/stats.c
@@ -23,6 +23,7 @@ atomic_t netfs_n_rh_read_failed;
 atomic_t netfs_n_rh_zero;
 atomic_t netfs_n_rh_short_read;
 atomic_t netfs_n_rh_write;
+atomic_t netfs_n_rh_write_append;
 atomic_t netfs_n_rh_write_begin;
 atomic_t netfs_n_rh_write_done;
 atomic_t netfs_n_rh_write_failed;
@@ -37,10 +38,11 @@ void netfs_stats_show(struct seq_file *m)
 		   atomic_read(&netfs_n_rh_write_zskip),
 		   atomic_read(&netfs_n_rh_rreq),
 		   atomic_read(&netfs_n_rh_sreq));
-	seq_printf(m, "RdHelp : ZR=%u sh=%u sk=%u\n",
+	seq_printf(m, "RdHelp : ZR=%u sh=%u sk=%u wa=%u\n",
 		   atomic_read(&netfs_n_rh_zero),
 		   atomic_read(&netfs_n_rh_short_read),
-		   atomic_read(&netfs_n_rh_write_zskip));
+		   atomic_read(&netfs_n_rh_write_zskip),
+		   atomic_read(&netfs_n_rh_write_append));
 	seq_printf(m, "RdHelp : DL=%u ds=%u df=%u di=%u\n",
 		   atomic_read(&netfs_n_rh_download),
 		   atomic_read(&netfs_n_rh_download_done),


^ permalink raw reply related

* Re: [PATCH bpf-next 4/4] bpftool: Generate helpers for pinning prog through bpf object skeleton
From: Daniel Borkmann @ 2022-04-25 14:03 UTC (permalink / raw)
  To: Yafang Shao, ast, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh
  Cc: netdev, bpf
In-Reply-To: <20220423140058.54414-5-laoar.shao@gmail.com>

On 4/23/22 4:00 PM, Yafang Shao wrote:
> After this change, with command 'bpftool gen skeleton XXX.bpf.o', the
> helpers for pinning BPF prog will be generated in BPF object skeleton. It
> could simplify userspace code which wants to pin bpf progs in bpffs.
> 
> The new helpers are named with __{pin, unpin}_prog, because it only pins
> bpf progs. If the user also wants to pin bpf maps in bpffs, he can use
> LIBBPF_PIN_BY_NAME.
> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>   tools/bpf/bpftool/gen.c | 16 ++++++++++++++++
>   1 file changed, 16 insertions(+)
> 
> diff --git a/tools/bpf/bpftool/gen.c b/tools/bpf/bpftool/gen.c
> index 8f76d8d9996c..1d06ebde723b 100644
> --- a/tools/bpf/bpftool/gen.c
> +++ b/tools/bpf/bpftool/gen.c
> @@ -1087,6 +1087,8 @@ static int do_skeleton(int argc, char **argv)
>   			static inline int load(struct %1$s *skel);	    \n\
>   			static inline int attach(struct %1$s *skel);	    \n\
>   			static inline void detach(struct %1$s *skel);	    \n\
> +			static inline int pin_prog(struct %1$s *skel, const char *path);\n\
> +			static inline void unpin_prog(struct %1$s *skel);   \n\
>   			static inline void destroy(struct %1$s *skel);	    \n\
>   			static inline const void *elf_bytes(size_t *sz);    \n\
>   		#endif /* __cplusplus */				    \n\
> @@ -1172,6 +1174,18 @@ static int do_skeleton(int argc, char **argv)
>   		%1$s__detach(struct %1$s *obj)				    \n\
>   		{							    \n\
>   			bpf_object__detach_skeleton(obj->skeleton);	    \n\
> +		}							    \n\
> +									    \n\
> +		static inline int					    \n\
> +		%1$s__pin_prog(struct %1$s *obj, const char *path)	    \n\
> +		{							    \n\
> +			return bpf_object__pin_skeleton_prog(obj->skeleton, path);\n\
> +		}							    \n\
> +									    \n\
> +		static inline void					    \n\
> +		%1$s__unpin_prog(struct %1$s *obj)			    \n\
> +		{							    \n\
> +			bpf_object__unpin_skeleton_prog(obj->skeleton);	    \n\
>   		}							    \n\

(This should also have BPF selftest code as in-tree user.)

>   		",
>   		obj_name
> @@ -1237,6 +1251,8 @@ static int do_skeleton(int argc, char **argv)
>   		int %1$s::load(struct %1$s *skel) { return %1$s__load(skel); }		\n\
>   		int %1$s::attach(struct %1$s *skel) { return %1$s__attach(skel); }	\n\
>   		void %1$s::detach(struct %1$s *skel) { %1$s__detach(skel); }		\n\
> +		int %1$s::pin_prog(struct %1$s *skel, const char *path) { return %1$s__pin_prog(skel, path); }\n\
> +		void %1$s::unpin_prog(struct %1$s *skel) { %1$s__unpin_prog(skel); }	\n\
>   		void %1$s::destroy(struct %1$s *skel) { %1$s__destroy(skel); }		\n\
>   		const void *%1$s::elf_bytes(size_t *sz) { return %1$s__elf_bytes(sz); } \n\
>   		#endif /* __cplusplus */				    \n\
> 


^ permalink raw reply

* Re: [PATCH bpf-next 2/4] libbpf: Add helpers for pinning bpf prog through bpf object skeleton
From: Daniel Borkmann @ 2022-04-25 13:57 UTC (permalink / raw)
  To: Yafang Shao, ast, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh
  Cc: netdev, bpf
In-Reply-To: <20220423140058.54414-3-laoar.shao@gmail.com>

On 4/23/22 4:00 PM, Yafang Shao wrote:
> Currently there're helpers for allowing to open/load/attach BPF object
> through BPF object skeleton. Let's also add helpers for pinning through
> BPF object skeleton. It could simplify BPF userspace code which wants to
> pin the progs into bpffs.

Please elaborate some more on your use case/rationale for the commit message,
do you have orchestration code that will rely on these specifically?

> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>   tools/lib/bpf/libbpf.c   | 59 ++++++++++++++++++++++++++++++++++++++++
>   tools/lib/bpf/libbpf.h   |  4 +++
>   tools/lib/bpf/libbpf.map |  2 ++
>   3 files changed, 65 insertions(+)
> 
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 13fcf91e9e0e..e7ed6c53c525 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -12726,6 +12726,65 @@ void bpf_object__detach_skeleton(struct bpf_object_skeleton *s)
>   	}
>   }
>   
> +int bpf_object__pin_skeleton_prog(struct bpf_object_skeleton *s,
> +				  const char *path)

These pin the link, not the prog itself, so the func name is a bit misleading? Also,
what if is this needs to be more customized in future? It doesn't seem very generic.

> +{
> +	struct bpf_link *link;
> +	int err;
> +	int i;
> +
> +	if (!s->prog_cnt)
> +		return libbpf_err(-EINVAL);
> +
> +	if (!path)
> +		path = DEFAULT_BPFFS;
> +
> +	for (i = 0; i < s->prog_cnt; i++) {
> +		char buf[PATH_MAX];
> +		int len;
> +
> +		len = snprintf(buf, PATH_MAX, "%s/%s", path, s->progs[i].name);
> +		if (len < 0) {
> +			err = -EINVAL;
> +			goto err_unpin_prog;
> +		} else if (len >= PATH_MAX) {
> +			err = -ENAMETOOLONG;
> +			goto err_unpin_prog;
> +		}
> +
> +		link = *s->progs[i].link;
> +		if (!link) {
> +			err = -EINVAL;
> +			goto err_unpin_prog;
> +		}
> +
> +		err = bpf_link__pin(link, buf);
> +		if (err)
> +			goto err_unpin_prog;
> +	}
> +
> +	return 0;
> +
> +err_unpin_prog:
> +	bpf_object__unpin_skeleton_prog(s);
> +
> +	return libbpf_err(err);
> +}
> +
> +void bpf_object__unpin_skeleton_prog(struct bpf_object_skeleton *s)
> +{
> +	struct bpf_link *link;
> +	int i;
> +
> +	for (i = 0; i < s->prog_cnt; i++) {
> +		link = *s->progs[i].link;
> +		if (!link || !link->pin_path)
> +			continue;
> +
> +		bpf_link__unpin(link);
> +	}
> +}
> +
>   void bpf_object__destroy_skeleton(struct bpf_object_skeleton *s)
>   {
>   	if (!s)
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index 3784867811a4..af44b0968cca 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -1427,6 +1427,10 @@ bpf_object__open_skeleton(struct bpf_object_skeleton *s,
>   LIBBPF_API int bpf_object__load_skeleton(struct bpf_object_skeleton *s);
>   LIBBPF_API int bpf_object__attach_skeleton(struct bpf_object_skeleton *s);
>   LIBBPF_API void bpf_object__detach_skeleton(struct bpf_object_skeleton *s);
> +LIBBPF_API int
> +bpf_object__pin_skeleton_prog(struct bpf_object_skeleton *s, const char *path);
> +LIBBPF_API void
> +bpf_object__unpin_skeleton_prog(struct bpf_object_skeleton *s);
>   LIBBPF_API void bpf_object__destroy_skeleton(struct bpf_object_skeleton *s);

Please also add API documentation.

>   struct bpf_var_skeleton {
> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> index 82f6d62176dd..4e3e37b84b3a 100644
> --- a/tools/lib/bpf/libbpf.map
> +++ b/tools/lib/bpf/libbpf.map
> @@ -55,6 +55,8 @@ LIBBPF_0.0.1 {
>   		bpf_object__unload;
>   		bpf_object__unpin_maps;
>   		bpf_object__unpin_programs;
> +		bpf_object__pin_skeleton_prog;
> +		bpf_object__unpin_skeleton_prog;

This would have to go under LIBBPF_0.8.0 if so.

>   		bpf_perf_event_read_simple;
>   		bpf_prog_attach;
>   		bpf_prog_detach;
> 


^ permalink raw reply

* Re: [PATCHv2 net-next] net/af_packet: add VLAN support for AF_PACKET SOCK_RAW GSO
From: Michael S. Tsirkin @ 2022-04-25 13:52 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: netdev, Jason Wang, David S . Miller, Jakub Kicinski, Paolo Abeni,
	Maxim Mikityanskiy, Willem de Bruijn, Balazs Nemeth,
	Mike Pattrick, Eric Dumazet
In-Reply-To: <20220425014502.985464-1-liuhangbin@gmail.com>

On Mon, Apr 25, 2022 at 09:45:02AM +0800, Hangbin Liu wrote:
> Currently, the kernel drops GSO VLAN tagged packet if it's created with
> socket(AF_PACKET, SOCK_RAW, 0) plus virtio_net_hdr.
> 
> The reason is AF_PACKET doesn't adjust the skb network header if there is
> a VLAN tag. Then after virtio_net_hdr_set_proto() called, the skb->protocol
> will be set to ETH_P_IP/IPv6. And in later inet/ipv6_gso_segment() the skb
> is dropped as network header position is invalid.
> 
> Let's handle VLAN packets by adjusting network header position in
> packet_parse_headers(). The adjustment is safe and does not affect the
> later xmit as tap device also did that.
> 
> In packet_snd(), packet_parse_headers() need to be moved before calling
> virtio_net_hdr_set_proto(), so we can set correct skb->protocol and
> network header first.
> 
> There is no need to update tpacket_snd() as it calls packet_parse_headers()
> in tpacket_fill_skb(), which is already before calling virtio_net_hdr_*
> functions.
> 
> skb->no_fcs setting is also moved upper to make all skb settings together
> and keep consistency with function packet_sendmsg_spkt().
> 
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>

Acked-by: Michael S. Tsirkin <mst@redhat.com>

> ---
> 
> v2: Rewrite commit description, no code update.
> ---
>  net/packet/af_packet.c | 18 +++++++++++++-----
>  1 file changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index 243566129784..fd31334cf688 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -1924,12 +1924,20 @@ static int packet_rcv_spkt(struct sk_buff *skb, struct net_device *dev,
>  
>  static void packet_parse_headers(struct sk_buff *skb, struct socket *sock)
>  {
> +	int depth;
> +
>  	if ((!skb->protocol || skb->protocol == htons(ETH_P_ALL)) &&
>  	    sock->type == SOCK_RAW) {
>  		skb_reset_mac_header(skb);
>  		skb->protocol = dev_parse_header_protocol(skb);
>  	}
>  
> +	/* Move network header to the right position for VLAN tagged packets */
> +	if (likely(skb->dev->type == ARPHRD_ETHER) &&
> +	    eth_type_vlan(skb->protocol) &&
> +	    __vlan_get_protocol(skb, skb->protocol, &depth) != 0)
> +		skb_set_network_header(skb, depth);
> +
>  	skb_probe_transport_header(skb);
>  }
>  
> @@ -3047,6 +3055,11 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>  	skb->mark = sockc.mark;
>  	skb->tstamp = sockc.transmit_time;
>  
> +	if (unlikely(extra_len == 4))
> +		skb->no_fcs = 1;
> +
> +	packet_parse_headers(skb, sock);
> +
>  	if (has_vnet_hdr) {
>  		err = virtio_net_hdr_to_skb(skb, &vnet_hdr, vio_le());
>  		if (err)
> @@ -3055,11 +3068,6 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
>  		virtio_net_hdr_set_proto(skb, &vnet_hdr);
>  	}
>  
> -	packet_parse_headers(skb, sock);
> -
> -	if (unlikely(extra_len == 4))
> -		skb->no_fcs = 1;
> -
>  	err = po->xmit(skb);
>  	if (unlikely(err != 0)) {
>  		if (err > 0)
> -- 
> 2.35.1


^ permalink raw reply

* Re: [PATCH net v3] virtio_net: fix wrong buf address calculation when using xdp
From: Michael S. Tsirkin @ 2022-04-25 13:52 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: netdev, kuba, davem, stable, Jason Wang, Xuan Zhuo,
	Daniel Borkmann, virtualization
In-Reply-To: <20220425103703.3067292-1-razor@blackwall.org>

On Mon, Apr 25, 2022 at 01:37:03PM +0300, Nikolay Aleksandrov wrote:
> We received a report[1] of kernel crashes when Cilium is used in XDP
> mode with virtio_net after updating to newer kernels. After
> investigating the reason it turned out that when using mergeable bufs
> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
> calculates the build_skb address wrong because the offset can become less
> than the headroom so it gets the address of the previous page (-X bytes
> depending on how lower offset is):
>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
> 
> This is a pr_err() I added in the beginning of page_to_skb which clearly
> shows offset that is less than headroom by adding 4 bytes of metadata
> via an xdp prog. The calculations done are:
>  receive_mergeable():
>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
>  offset = xdp.data - page_address(xdp_page) -
>           vi->hdr_len - metasize;
> 
>  page_to_skb():
>  p = page_address(page) + offset;
>  ...
>  buf = p - headroom;
> 
> Now buf goes -4 bytes from the page's starting address as can be seen
> above which is set as skb->head and skb->data by build_skb later. Depending
> on what's done with the skb (when it's freed most often) we get all kinds
> of corruptions and BUG_ON() triggers in mm[2]. We have to recalculate
> the new headroom after the xdp program has run, similar to how offset
> and len are recalculated. Headroom is directly related to
> data_hard_start, data and data_meta, so we use them to get the new size.
> The result is correct (similar pr_err() in page_to_skb, one case of
> xdp_page and one case of virtnet buf):
>  a) Case with 4 bytes of metadata
>  [  115.949641] page_to_skb: page addr ffff8b4dcfad2000 offset 252 headroom 252
>  [  121.084105] page_to_skb: page addr ffff8b4dcf018000 offset 20732 headroom 252
>  b) Case of pushing data +32 bytes
>  [  153.181401] page_to_skb: page addr ffff8b4dd0c4d000 offset 288 headroom 288
>  [  158.480421] page_to_skb: page addr ffff8b4dd00b0000 offset 24864 headroom 288
>  c) Case of pushing data -33 bytes
>  [  835.906830] page_to_skb: page addr ffff8b4dd3270000 offset 223 headroom 223
>  [  840.839910] page_to_skb: page addr ffff8b4dcdd68000 offset 12511 headroom 223
> 
> Offset and headroom are equal because offset points to the start of
> reserved bytes for the virtio_net header which are at buf start +
> headroom, while data points at buf start + vnet hdr size + headroom so
> when data or data_meta are adjusted by the xdp prog both the headroom size
> and the offset change equally. We can use data_hard_start to compute the
> new headroom after the xdp prog (linearized / page start case, the
> virtnet buf case is similar just with bigger base offset):
>  xdp.data_hard_start = page_address + vnet_hdr
>  xdp.data = page_address + vnet_hdr + headroom
>  new headroom after xdp prog = xdp.data - xdp.data_hard_start - metasize
> 
> An example reproducer xdp prog[3] is below.
> 
> [1] https://github.com/cilium/cilium/issues/19453
> 
> [2] Two of the many traces:
>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
>  [   41.300891] kernel BUG at include/linux/mm.h:720!
>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
>  [   41.321387] Call Trace:
>  [   41.321819]  <TASK>
>  [   41.322193]  skb_release_data+0x13f/0x1c0
>  [   41.322902]  __kfree_skb+0x20/0x30
>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
>  [   41.384102]  inet_recvmsg+0x42/0x100
>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
>  [   41.428201]  sock_read_iter+0x84/0xd0
>  [   41.445592]  ? 0xffffffffa3000000
>  [   41.462442]  new_sync_read+0x148/0x160
>  [   41.479314]  ? 0xffffffffa3000000
>  [   41.496937]  vfs_read+0x138/0x190
>  [   41.517198]  ksys_read+0x87/0xc0
>  [   41.535336]  do_syscall_64+0x3b/0x90
>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>  [   41.568050] RIP: 0033:0x48765b
>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
>  [   41.744254]  </TASK>
>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
> 
>  and
> 
>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
>  [   33.532471] page dumped because: nonzero mapcount
>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>  [   33.532484] Call Trace:
>  [   33.532496]  <TASK>
>  [   33.532500]  dump_stack_lvl+0x45/0x5a
>  [   33.532506]  bad_page.cold+0x63/0x94
>  [   33.532510]  free_pcp_prepare+0x290/0x420
>  [   33.532515]  free_unref_page+0x1b/0x100
>  [   33.532518]  skb_release_data+0x13f/0x1c0
>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
> 
> [3] XDP program to reproduce(xdp_pass.c):
>  #include <linux/bpf.h>
>  #include <bpf/bpf_helpers.h>
> 
>  SEC("xdp_pass")
>  int xdp_pkt_pass(struct xdp_md *ctx)
>  {
>           bpf_xdp_adjust_head(ctx, -(int)32);
>           return XDP_PASS;
>  }
> 
>  char _license[] SEC("license") = "GPL";
> 
>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
> 
> CC: stable@vger.kernel.org
> CC: Jason Wang <jasowang@redhat.com>
> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> CC: Daniel Borkmann <daniel@iogearbox.net>
> CC: "Michael S. Tsirkin" <mst@redhat.com>
> CC: virtualization@lists.linux-foundation.org
> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>

Acked-by: Michael S. Tsirkin <mst@redhat.com>
> ---
> v3: Add a comment explaining why offset and headroom are equal,
>     no code changes
> v2: Recalculate headroom based on data, data_hard_start and data_meta
> 
>  drivers/net/virtio_net.c | 20 +++++++++++++++++++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 87838cbe38cf..cbba9d2e8f32 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1005,6 +1005,24 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  			 * xdp.data_meta were adjusted
>  			 */
>  			len = xdp.data_end - xdp.data + vi->hdr_len + metasize;
> +
> +			/* recalculate headroom if xdp.data or xdp_data_meta
> +			 * were adjusted, note that offset should always point
> +			 * to the start of the reserved bytes for virtio_net
> +			 * header which are followed by xdp.data, that means
> +			 * that offset is equal to the headroom (when buf is
> +			 * starting at the beginning of the page, otherwise
> +			 * there is a base offset inside the page) but it's used
> +			 * with a different starting point (buf start) than
> +			 * xdp.data (buf start + vnet hdr size). If xdp.data or
> +			 * data_meta were adjusted by the xdp prog then the
> +			 * headroom size has changed and so has the offset, we
> +			 * can use data_hard_start, which points at buf start +
> +			 * vnet hdr size, to calculate the new headroom and use
> +			 * it later to compute buf start in page_to_skb()
> +			 */
> +			headroom = xdp.data - xdp.data_hard_start - metasize;
> +
>  			/* We can only create skb based on xdp_page. */
>  			if (unlikely(xdp_page != page)) {
>  				rcu_read_unlock();
> @@ -1012,7 +1030,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  				head_skb = page_to_skb(vi, rq, xdp_page, offset,
>  						       len, PAGE_SIZE, false,
>  						       metasize,
> -						       VIRTIO_XDP_HEADROOM);
> +						       headroom);
>  				return head_skb;
>  			}
>  			break;
> -- 
> 2.35.1


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox