Netdev List
 help / color / mirror / Atom feed
* Re: [RFC net-next 2/2] ipv6: ioam: Support for Buffer occupancy data field
From: Justin Iurman @ 2021-12-09 14:10 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, davem, dsahern, yoshfuji, linux-mm, cl, penberg, rientjes,
	iamjoonsoo kim, akpm, vbabka
In-Reply-To: <20211208141825.3091923c@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com>

>> True. But keep also in mind the scope of IOAM which is not to be
>> deployed widely on the Internet. It is deployed on limited (aka private)
>> domains where each node is therefore managed by the operator. So, I'm
>> not really sure why you think that the implementation specific thing is
>> a problem here. Context of "unit" is provided by the IOAM Namespace-ID
>> attached to the trace, as well as each Node-ID if included. Again, it's
>> up to the operator to interpret values accordingly, depending on each
>> node (i.e., the operator has a large and detailed view of his domain; he
>> knows if the buffer occupancy value "X" is abnormal or not for a
>> specific node, he knows which unit is used for a specific node, etc).
> 
> It's quite likely I'm missing the point.

Let me try again to make it all clear on your mind. Here are some quoted
paragraphs from the spec:

  "Generic data: Format-free information where syntax and semantic of
   the information is defined by the operator in a specific
   deployment.  For a specific IOAM-Namespace, all IOAM nodes have to
   interpret the generic data the same way.  Examples for generic
   IOAM data include geo-location information (location of the node
   at the time the packet was processed), buffer queue fill level or
   cache fill level at the time the packet was processed, or even a
   battery charge level."

This one basically says that the IOAM Namespace-ID (in the IOAM Trace
Option-Type header) is responsible for providing context to data fields
(i.e., for "units" too, in case of generic fields such as queue depth or
buffer occupancy). So it's up to the operator to gather similar nodes
within a same IOAM Namespace. And, even if "different" kind of nodes are
within an IOAM Namespace, you still have a fallback solution if Node IDs
are part of the trace (the "hop-lim & node-id" data field, bit 0 in the
trace type). Indeed, the operator (or the collector/interpretor) knows if
node A uses "bytes" or any other units for a generic data field.

  "It should be noted that the semantics of some of the node data fields
   that are defined below, such as the queue depth and buffer occupancy,
   are implementation specific.  This approach is intended to allow IOAM
   nodes with various different architectures."

The last sentence is important here and is, in fact, related to what you
describe below. Having genericity on units for such data fields allows
for supporting multiple architectures. Same idea for the following
paragraph:

  "Data fields and associated data types for each of the IOAM-Data-
   Fields are specified in the following sections.  The definition of
   IOAM-Data-Fields focuses on the syntax of the data-fields and avoids
   specifying the semantics where feasible.  This is why no units are
   defined for data-fields like e.g., "buffer occupancy" or "queue
   depth".  With this approach, nodes can supply the information in
   their native format and are not required to perform unit or format
   conversions.  Systems that further process IOAM information, like
   e.g., a network management system are assumed to also handle unit
   conversions as part of their IOAM data-fields processing.  The
   combination of a particular data-field and the namespace-id provides
   for the context to interpret the provided data appropriately."

Does it make more sense now on why it's not really a problem to have
implementation specific units for such data fields?

>> [...]
>>
>> Do you believe this patch does not provide what is defined in the spec?
>> If so, I'm open to any suggestions.
> 
> The opposite, in a sense. I think the patch does implement behavior
> within a reasonable interpretation of the standard. But the feature
> itself seems more useful for forwarding ASICs than Linux routers,

Good point. Actually, it's probably why such data field was defined as it
is.

> because Linux routers can run a full telemetry stack and all sort
> of advanced SW instrumentation. The use case for reporting kernel
> memory use via IOAM's constrained interface does not seem particularly
> practical since it's not providing a very strong signal on what's
> going on.

I agree and disagree. I disagree because this value definitely tells you
that something (potentially bad) is going on, when it increases
significantly enough to reach a critical threshold. Basically, we need
more skb's, but oh, the pool is exhausted. OK, not a problem, expand the
pool. Oh wait, no memory left. Why? Is it only due to too much
(temporary?) load? Should I put the blame on the NIC? Is it a memory
issue? Is it something else? Or maybe several issues combined? Well, you
might not know exactly why (though you know there is a problem), which is
also why I agree with you. But, this is also why you have other data
fields available (i.e., detecting a problem might require 2+ symptoms
instead of just one).

> For switches running Linux the switch ASIC buffer occupancy can be read
> via devlink-sb that'd seem like a better fit for me, but unfortunately
> the devlink calls can sleep so we can't read such device info from the
> datapath.

Indeed, would be a better fit. I didn't know about this one, thanks for
that. It's a shame it can't be used in this context, though. But, at the
end of the day, we're left with nothing regarding buffer occupancy. So
I'm wondering if "something" is not better than "nothing" in this case.
And, for that, we're back to my previous answer on why I agree and
disagree with what you said about its utility.

^ permalink raw reply

* Re: [PATCH net-next v3 5/6] net: lan966x: Add vlan support
From: Vladimir Oltean @ 2021-12-09 13:59 UTC (permalink / raw)
  To: Horatiu Vultur
  Cc: netdev@vger.kernel.org, devicetree@vger.kernel.org,
	linux-kernel@vger.kernel.org, davem@davemloft.net,
	kuba@kernel.org, robh+dt@kernel.org, UNGLinuxDriver@microchip.com,
	linux@armlinux.org.uk, f.fainelli@gmail.com,
	vivien.didelot@gmail.com, andrew@lunn.ch
In-Reply-To: <20211209094615.329379-6-horatiu.vultur@microchip.com>

On Thu, Dec 09, 2021 at 10:46:14AM +0100, Horatiu Vultur wrote:
> +int lan966x_vlan_port_set_vid(struct lan966x_port *port, u16 vid,
> +			      bool pvid, bool untagged)
> +{
> +	struct lan966x *lan966x = port->lan966x;
> +
> +	/* Egress vlan classification */
> +	if (untagged && port->vid != vid) {
> +		if (port->vid) {
> +			dev_err(lan966x->dev,
> +				"Port already has a native VLAN: %d\n",
> +				port->vid);
> +			return -EBUSY;

Are you interested in supporting the use case from 0da1a1c48911 ("net:
mscc: ocelot: allow a config where all bridge VLANs are egress-untagged")?
Because it would be good if the driver was structured that way from the
get-go instead of patching it later.

> +		}
> +		port->vid = vid;
> +	}
> +
> +	/* Default ingress vlan classification */
> +	if (pvid)
> +		port->pvid = vid;
> +
> +	return 0;
> +}

^ permalink raw reply

* Re: [PATCH v2, 2/2] net: Add DM9051 driver
From: Leon Romanovsky @ 2021-12-09 13:53 UTC (permalink / raw)
  To: JosephCHANG
  Cc: David S . Miller, Jakub Kicinski, Rob Herring, joseph_chang,
	netdev, devicetree, linux-kernel
In-Reply-To: <20211209100702.5609-3-josright123@gmail.com>

On Thu, Dec 09, 2021 at 06:07:02PM +0800, JosephCHANG wrote:
> Add davicom dm9051 SPI ethernet driver. The driver work with dts
> for:
> 
>      - spi bus number
>      - spi chip select
>      - spi clock frequency
>      - interrupt gpio pin
>      - interrupt polarity fixed as low
> 
>     Test OK with Rpi 2 and Rpi 4. Max spi speed is 31200000.

Please don't do this type of formatting, where you added extra
indentations without reason.

> 
> Signed-off-by: JosephCHANG <josright123@gmail.com>
> [Submit v1 has Reported-by: kernel test robot <lkp@intel.com>]
> ---
>  drivers/net/ethernet/davicom/Kconfig  |  30 +
>  drivers/net/ethernet/davicom/Makefile |   1 +
>  drivers/net/ethernet/davicom/dm9051.c | 967 ++++++++++++++++++++++++++
>  drivers/net/ethernet/davicom/dm9051.h | 248 +++++++
>  4 files changed, 1246 insertions(+)
>  create mode 100644 drivers/net/ethernet/davicom/dm9051.c
>  create mode 100644 drivers/net/ethernet/davicom/dm9051.h
> 
> diff --git a/drivers/net/ethernet/davicom/Kconfig b/drivers/net/ethernet/davicom/Kconfig
> index 7af86b6d4150..9c00328f6e05 100644
> --- a/drivers/net/ethernet/davicom/Kconfig
> +++ b/drivers/net/ethernet/davicom/Kconfig
> @@ -3,6 +3,20 @@
>  # Davicom device configuration
>  #
>  
> +config NET_VENDOR_DAVICOM
> +	bool "Davicom devices"
> +	depends on ARM || MIPS || COLDFIRE || NIOS2 || COMPILE_TEST || SPI
> +	default y
> +	help
> +	  If you have a network (Ethernet) card belonging to this class, say Y.
> +
> +	  Note that the answer to this question doesn't directly affect the
> +	  kernel: saying N will just cause the configurator to skip all
> +	  the questions about Davicom devices. If you say Y, you will be asked
> +	  for your specific card in the following selections.
> +
> +if NET_VENDOR_DAVICOM
> +
>  config DM9000
>  	tristate "DM9000 support"
>  	depends on ARM || MIPS || COLDFIRE || NIOS2 || COMPILE_TEST
> @@ -22,3 +36,19 @@ config DM9000_FORCE_SIMPLE_PHY_POLL
>  	  bit to determine if the link is up or down instead of the more
>  	  costly MII PHY reads. Note, this will not work if the chip is
>  	  operating with an external PHY.
> +
> +config DM9051
> +	tristate "DM9051 SPI support"
> +	depends on SPI
> +	select CRC32
> +	select MII
> +	help
> +	  Support for DM9051 SPI chipset.
> +
> +	  To compile this driver as a module, choose M here.  The module
> +	  will be called dm9051.
> +
> +	  The SPI mode for the host's SPI master to access DM9051 is mode
> +	  0 on the SPI bus.
> +
> +endif # NET_VENDOR_DAVICOM
> diff --git a/drivers/net/ethernet/davicom/Makefile b/drivers/net/ethernet/davicom/Makefile
> index 173c87d21076..225f85bc1f53 100644
> --- a/drivers/net/ethernet/davicom/Makefile
> +++ b/drivers/net/ethernet/davicom/Makefile
> @@ -4,3 +4,4 @@
>  #
>  
>  obj-$(CONFIG_DM9000) += dm9000.o
> +obj-$(CONFIG_DM9051) += dm9051.o
> diff --git a/drivers/net/ethernet/davicom/dm9051.c b/drivers/net/ethernet/davicom/dm9051.c
> new file mode 100644
> index 000000000000..cdcf9d37ed7f
> --- /dev/null
> +++ b/drivers/net/ethernet/davicom/dm9051.c
> @@ -0,0 +1,967 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + *	Ethernet driver for the Davicom DM9051 chip.
> + *
> + *	This program is free software; you can redistribute it and/or
> + *	modify it under the terms of the GNU General Public License
> + *	as published by the Free Software Foundation; either version 2
> + *	of the License, or (at your option) any later version.
> + *
> + *	This program is distributed in the hope that it will be useful,
> + *	but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + *	GNU General Public License for more details.

Please don't add license text. The SPDX line is more than enough.

> + *
> + *	Copyright 2021 Davicom Semiconductor,Inc.
> + *	http://www.davicom.com.tw/
> + *	Joseph CHANG <joseph_chang@davicom.com.tw>
> + *	20211110b, Total 933 lines
> + */
> +
> +#include <linux/module.h>
> +#include <linux/kernel.h>
> +#include <linux/netdevice.h>
> +#include <linux/etherdevice.h>
> +#include <linux/interrupt.h>
> +#include <linux/skbuff.h>
> +#include <linux/spinlock.h>
> +#include <linux/cache.h>
> +#include <linux/crc32.h>
> +#include <linux/mii.h>
> +#include <linux/ethtool.h>
> +#include <linux/delay.h>
> +#include <linux/irq.h>
> +#include <linux/slab.h>
> +#include <linux/gpio.h>
> +#include <linux/iopoll.h>
> +#include <linux/of_gpio.h>
> +#include <linux/spi/spi.h>
> +
> +#include "dm9051.h"
> +
> +#define	DRV_PRODUCT_NAME	"dm9051"
> +#define	DRV_VERSION_CODE	DM_VERSION(5, 0, 5)			//(VER5.0.0= 0x050000)
> +#define	DRV_VERSION_DATE	"20211209"				//(update)"

Two points.
1. Try to avoid vertical alignment. It adds churn when backporting.
2. Don't add DRV_VERSION_CODE and DRV_VERSION_DATE. It is useless in
upstream kernel.

> +
> +/* spi-spi_sync, low level code */
> +static int burst_xfer(struct board_info *db, u8 cmdphase, u8 *txb, u8 *rxb, unsigned int len)
> +{
> +	struct device *dev = &db->spidev->dev;
> +	int ret = 0;
> +
> +	db->cmd[0] = cmdphase;
> +	db->spi_xfer2[0].tx_buf = &db->cmd[0];
> +	db->spi_xfer2[0].rx_buf = NULL;
> +	db->spi_xfer2[0].len = 1;
> +	if (!rxb) { //write

"//"is C++ coding style, which is also in C99 standard.
Is it ok to use such comment style in netdev?

<...>

> +	//rcr_reg_start(db, db->rcr_all); //rx enable later

I spotted many commented functions. They shouldn't be part of
submission.

<...>

> +static void dm_msg_open(struct net_device *ndev)
> +{
> +	struct board_info *db = netdev_priv(ndev);
> +	struct device *dev = &db->spidev->dev;
> +
> +	snprintf(db->DRV_VERSION, sizeof(db->DRV_VERSION), "%s_V%d.%d.%d_date_%s",
> +		 DRV_PRODUCT_NAME, (DRV_VERSION_CODE >> 16 & 0xff),
> +		 (DRV_VERSION_CODE >> 8 & 0xff),
> +		 (DRV_VERSION_CODE & 0xff),
> +		 DRV_VERSION_DATE);
> +	dev_info(dev, "version: %s\n", db->DRV_VERSION);
> +}

This should go.

<...>

> +++ b/drivers/net/ethernet/davicom/dm9051.h
> @@ -0,0 +1,248 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright 2021 Davicom Semiconductor,Inc.
> + *	http://www.davicom.com.tw
> + *	2014/03/11  Joseph CHANG  v1.0  Create
> + *	2021/10/26  Joseph CHANG  v5.0.1  Update
> + *	2021/12/09  Joseph CHANG  v5.0.5  Update

It is new file, there is no value in off-tree history.

> + *
> + * DM9051 register definitions
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.

No need.

<...>

> +#define DM9051_PIDL		0x2A
> +#define DM9051_PIDH		0x2B
> +#define DM9051_SMCR		0x2F
> +#define	DM9051_ATCR		0x30
> +#define	DM9051_SPIBCR		0x38

Please be consistent.

Thanks

^ permalink raw reply

* [PATCH] bpf: return EOPNOTSUPP when JIT is needed and not possible
From: Thadeu Lima de Souza Cascardo @ 2021-12-09 13:40 UTC (permalink / raw)
  To: bpf; +Cc: netdev, ast, daniel, linux-kernel

When a CBPF program is JITed and CONFIG_BPF_JIT_ALWAYS_ON is enabled, and
the JIT fails, it would return ENOTSUPP, which is not a valid userspace
error code.  Instead, EOPNOTSUPP should be returned.

Fixes: 290af86629b2 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
---
 kernel/bpf/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index de3e5bc6781f..5c89bae0d6f9 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1931,7 +1931,7 @@ struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err)
 		fp = bpf_int_jit_compile(fp);
 		bpf_prog_jit_attempt_done(fp);
 		if (!fp->jited && jit_needed) {
-			*err = -ENOTSUPP;
+			*err = -EOPNOTSUPP;
 			return fp;
 		}
 	} else {
-- 
2.32.0


^ permalink raw reply related

* Re: [PATCH net-next v3 6/6] net: lan966x: Add switchdev support
From: Vladimir Oltean @ 2021-12-09 13:36 UTC (permalink / raw)
  To: Horatiu Vultur
  Cc: netdev@vger.kernel.org, devicetree@vger.kernel.org,
	linux-kernel@vger.kernel.org, davem@davemloft.net,
	kuba@kernel.org, robh+dt@kernel.org, UNGLinuxDriver@microchip.com,
	linux@armlinux.org.uk, f.fainelli@gmail.com,
	vivien.didelot@gmail.com, andrew@lunn.ch
In-Reply-To: <20211209094615.329379-7-horatiu.vultur@microchip.com>

On Thu, Dec 09, 2021 at 10:46:15AM +0100, Horatiu Vultur wrote:
> This adds support for switchdev in lan966x.
> It offloads to the HW basic forwarding and vlan filtering. To be able to
> offload this to the HW, it is required to disable promisc mode for ports
> that are part of the bridge.
> 
> Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
> ---
>  .../net/ethernet/microchip/lan966x/Makefile   |   3 +-
>  .../ethernet/microchip/lan966x/lan966x_main.c |  41 +-
>  .../ethernet/microchip/lan966x/lan966x_main.h |  18 +
>  .../microchip/lan966x/lan966x_switchdev.c     | 548 ++++++++++++++++++
>  .../ethernet/microchip/lan966x/lan966x_vlan.c |  12 +-
>  5 files changed, 610 insertions(+), 12 deletions(-)
>  create mode 100644 drivers/net/ethernet/microchip/lan966x/lan966x_switchdev.c
> 
> diff --git a/drivers/net/ethernet/microchip/lan966x/Makefile b/drivers/net/ethernet/microchip/lan966x/Makefile
> index f7e6068a91cb..d82e896c2e53 100644
> --- a/drivers/net/ethernet/microchip/lan966x/Makefile
> +++ b/drivers/net/ethernet/microchip/lan966x/Makefile
> @@ -6,4 +6,5 @@
>  obj-$(CONFIG_LAN966X_SWITCH) += lan966x-switch.o
>  
>  lan966x-switch-objs  := lan966x_main.o lan966x_phylink.o lan966x_port.o \
> -			lan966x_mac.o lan966x_ethtool.o lan966x_vlan.o
> +			lan966x_mac.o lan966x_ethtool.o lan966x_switchdev.o \
> +			lan966x_vlan.o
> diff --git a/drivers/net/ethernet/microchip/lan966x/lan966x_main.c b/drivers/net/ethernet/microchip/lan966x/lan966x_main.c
> index 1b4c7e6b4f85..aee36c1cfa17 100644
> --- a/drivers/net/ethernet/microchip/lan966x/lan966x_main.c
> +++ b/drivers/net/ethernet/microchip/lan966x/lan966x_main.c
> @@ -306,7 +306,7 @@ static int lan966x_port_xmit(struct sk_buff *skb, struct net_device *dev)
>  	return lan966x_port_ifh_xmit(skb, ifh, dev);
>  }
>  
> -static void lan966x_set_promisc(struct lan966x_port *port, bool enable)
> +void lan966x_set_promisc(struct lan966x_port *port, bool enable)
>  {
>  	struct lan966x *lan966x = port->lan966x;
>  

My documentation of CPU_SRC_COPY_ENA says:

If set, all frames received on this port are
copied to the CPU extraction queue given by
CPUQ_CFG.CPUQ_SRC_COPY.

I think it was established a while ago that this isn't what promiscuous
mode is about? Instead it is about accepting packets on a port
regardless of whether the MAC DA is in their RX filter or not.

Hence the oddity of your change. I understand what it intends to do:
if this is a standalone port you support IFF_UNICAST_FLT, so you drop
frames with unknown MAC DA. But if IFF_PROMISC is set, then why do you
copy all frames to the CPU? Why don't you just put the CPU in the
unknown flooding mask? There's a difference between "force a packet to
get copied to the CPU" and "copy it to the CPU only if it may have
business there". Then this change would be compatible with bridge mode.
You want the bridge to receive unknown traffic too, it may need to
forward it in software.

> @@ -318,14 +318,18 @@ static void lan966x_set_promisc(struct lan966x_port *port, bool enable)
>  static void lan966x_port_change_rx_flags(struct net_device *dev, int flags)
>  {
>  	struct lan966x_port *port = netdev_priv(dev);
> +	bool enable;
>  
>  	if (!(flags & IFF_PROMISC))
>  		return;
>  
> -	if (dev->flags & IFF_PROMISC)
> -		lan966x_set_promisc(port, true);
> -	else
> -		lan966x_set_promisc(port, false);
> +	enable = dev->flags & IFF_PROMISC ? true : false;
> +	port->promisc = enable;
> +
> +	if (port->bridge)
> +		return;
> +
> +	lan966x_set_promisc(port, enable);
>  }
>  
>  static int lan966x_port_change_mtu(struct net_device *dev, int new_mtu)
> @@ -340,7 +344,7 @@ static int lan966x_port_change_mtu(struct net_device *dev, int new_mtu)
>  	return 0;
>  }
>  
> -static int lan966x_mc_unsync(struct net_device *dev, const unsigned char *addr)
> +int lan966x_mc_unsync(struct net_device *dev, const unsigned char *addr)
>  {
>  	struct lan966x_port *port = netdev_priv(dev);
>  	struct lan966x *lan966x = port->lan966x;
> @@ -348,7 +352,7 @@ static int lan966x_mc_unsync(struct net_device *dev, const unsigned char *addr)
>  	return lan966x_mac_forget(lan966x, addr, port->pvid, ENTRYTYPE_LOCKED);
>  }
>  
> -static int lan966x_mc_sync(struct net_device *dev, const unsigned char *addr)
> +int lan966x_mc_sync(struct net_device *dev, const unsigned char *addr)
>  {
>  	struct lan966x_port *port = netdev_priv(dev);
>  	struct lan966x *lan966x = port->lan966x;
> @@ -401,6 +405,11 @@ static const struct net_device_ops lan966x_port_netdev_ops = {
>  	.ndo_vlan_rx_kill_vid		= lan966x_vlan_rx_kill_vid,
>  };
>  
> +bool lan966x_netdevice_check(const struct net_device *dev)
> +{
> +	return dev && (dev->netdev_ops == &lan966x_port_netdev_ops);

Can "dev" ever be NULL?

> +}
> +
>  static int lan966x_port_xtr_status(struct lan966x *lan966x, u8 grp)
>  {
>  	return lan_rd(lan966x, QS_XTR_RD(grp));
> @@ -537,6 +546,11 @@ static irqreturn_t lan966x_xtr_irq_handler(int irq, void *args)
>  
>  		skb->protocol = eth_type_trans(skb, dev);
>  
> +#ifdef CONFIG_NET_SWITCHDEV
> +		if (lan966x->ports[src_port]->bridge)
> +			skb->offload_fwd_mark = 1;
> +#endif
> +
>  		netif_rx_ni(skb);
>  		dev->stats.rx_bytes += len;
>  		dev->stats.rx_packets++;
> @@ -619,13 +633,16 @@ static int lan966x_probe_port(struct lan966x *lan966x, u32 p,
>  
>  	dev->netdev_ops = &lan966x_port_netdev_ops;
>  	dev->ethtool_ops = &lan966x_ethtool_ops;
> +	dev->hw_features |= NETIF_F_HW_VLAN_CTAG_FILTER |
> +			    NETIF_F_RXFCS;
> +	dev->features |= NETIF_F_HW_VLAN_CTAG_FILTER |
> +			 NETIF_F_HW_VLAN_CTAG_TX |
> +			 NETIF_F_HW_VLAN_STAG_TX;
> +	dev->priv_flags |= IFF_UNICAST_FLT;

Too many changes in one patch. IFF_UNICAST_FLT and the handling of
promiscuous mode have nothing to do with switchdev. Neither VLAN
filtering via dev->features (should have been part of the previous
patch), or RXFCS. It seems that each one of these changes was previously
missed and is now snuck in without the explanation it deserves.

>  	dev->needed_headroom = IFH_LEN * sizeof(u32);
>  
>  	eth_hw_addr_gen(dev, lan966x->base_mac, p + 1);
>  
> -	lan966x_mac_learn(lan966x, PGID_CPU, dev->dev_addr, port->pvid,
> -			  ENTRYTYPE_LOCKED);
> -

Why is this deleted? If the port uses IFF_UNICAST_FLT and isn't
promiscuous, how does it get the unicast traffic it needs?
I may be not realizing something because the changes aren't properly
split.

>  	port->phylink_config.dev = &port->dev->dev;
>  	port->phylink_config.type = PHYLINK_NETDEV;
>  	port->phylink_pcs.poll = true;
> @@ -949,6 +966,8 @@ static int lan966x_probe(struct platform_device *pdev)
>  		lan966x_port_init(lan966x->ports[p]);
>  	}
>  
> +	lan966x_register_notifier_blocks(lan966x);
> +
>  	return 0;
>  
>  cleanup_ports:
> @@ -967,6 +986,8 @@ static int lan966x_remove(struct platform_device *pdev)
>  {
>  	struct lan966x *lan966x = platform_get_drvdata(pdev);
>  
> +	lan966x_unregister_notifier_blocks(lan966x);
> +
>  	lan966x_cleanup_ports(lan966x);
>  
>  	cancel_delayed_work_sync(&lan966x->stats_work);
> diff --git a/drivers/net/ethernet/microchip/lan966x/lan966x_main.h b/drivers/net/ethernet/microchip/lan966x/lan966x_main.h
> index ec3eccf634b3..4a0988087167 100644
> --- a/drivers/net/ethernet/microchip/lan966x/lan966x_main.h
> +++ b/drivers/net/ethernet/microchip/lan966x/lan966x_main.h
> @@ -80,6 +80,11 @@ struct lan966x {
>  	struct list_head mac_entries;
>  	spinlock_t mac_lock; /* lock for mac_entries list */
>  
> +	/* Notifiers */
> +	struct notifier_block netdevice_nb;
> +	struct notifier_block switchdev_nb;
> +	struct notifier_block switchdev_blocking_nb;
> +
>  	u16 vlan_mask[VLAN_N_VID];
>  	DECLARE_BITMAP(cpu_vlan_mask, VLAN_N_VID);
>  
> @@ -112,6 +117,10 @@ struct lan966x_port {
>  	struct net_device *dev;
>  	struct lan966x *lan966x;
>  
> +	struct net_device *bridge;
> +	u8 stp_state;
> +	u8 promisc;
> +
>  	u8 chip_port;
>  	u16 pvid;
>  	u16 vid;
> @@ -129,6 +138,14 @@ extern const struct phylink_mac_ops lan966x_phylink_mac_ops;
>  extern const struct phylink_pcs_ops lan966x_phylink_pcs_ops;
>  extern const struct ethtool_ops lan966x_ethtool_ops;
>  
> +int lan966x_mc_unsync(struct net_device *dev, const unsigned char *addr);
> +int lan966x_mc_sync(struct net_device *dev, const unsigned char *addr);
> +
> +bool lan966x_netdevice_check(const struct net_device *dev);
> +
> +int lan966x_register_notifier_blocks(struct lan966x *lan966x);
> +void lan966x_unregister_notifier_blocks(struct lan966x *lan966x);
> +
>  void lan966x_stats_get(struct net_device *dev,
>  		       struct rtnl_link_stats64 *stats);
>  int lan966x_stats_init(struct lan966x *lan966x);
> @@ -139,6 +156,7 @@ void lan966x_port_status_get(struct lan966x_port *port,
>  			     struct phylink_link_state *state);
>  int lan966x_port_pcs_set(struct lan966x_port *port,
>  			 struct lan966x_port_config *config);
> +void lan966x_set_promisc(struct lan966x_port *port, bool enable);
>  void lan966x_port_init(struct lan966x_port *port);
>  
>  int lan966x_mac_learn(struct lan966x *lan966x, int port,
> diff --git a/drivers/net/ethernet/microchip/lan966x/lan966x_switchdev.c b/drivers/net/ethernet/microchip/lan966x/lan966x_switchdev.c
> new file mode 100644
> index 000000000000..ed6ec78d2d9a
> --- /dev/null
> +++ b/drivers/net/ethernet/microchip/lan966x/lan966x_switchdev.c
> @@ -0,0 +1,548 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +
> +#include <linux/if_bridge.h>
> +#include <net/switchdev.h>
> +
> +#include "lan966x_main.h"
> +
> +static struct workqueue_struct *lan966x_owq;
> +
> +struct lan966x_fdb_event_work {
> +	struct work_struct work;
> +	struct switchdev_notifier_fdb_info fdb_info;
> +	struct net_device *dev;
> +	struct lan966x *lan966x;
> +	unsigned long event;
> +};
> +
> +static void lan966x_port_attr_bridge_flags(struct lan966x_port *port,
> +					   struct switchdev_brport_flags flags)
> +{
> +	u32 val = lan_rd(port->lan966x, ANA_PGID(PGID_MC));
> +
> +	val = ANA_PGID_PGID_GET(val);
> +
> +	if (flags.mask & BR_MCAST_FLOOD) {
> +		if (flags.val & BR_MCAST_FLOOD)
> +			val |= BIT(port->chip_port);
> +		else
> +			val &= ~BIT(port->chip_port);
> +	}
> +
> +	lan_rmw(ANA_PGID_PGID_SET(val),
> +		ANA_PGID_PGID,
> +		port->lan966x, ANA_PGID(PGID_MC));
> +}
> +
> +static u32 lan966x_get_fwd_mask(struct lan966x_port *port)
> +{
> +	struct net_device *bridge = port->bridge;
> +	struct lan966x *lan966x = port->lan966x;
> +	u8 ingress_src = port->chip_port;
> +	u32 mask = 0;
> +	int p;
> +
> +	if (port->stp_state != BR_STATE_FORWARDING)
> +		goto skip_forwarding;
> +
> +	for (p = 0; p < lan966x->num_phys_ports; p++) {
> +		port = lan966x->ports[p];
> +
> +		if (!port)
> +			continue;
> +
> +		if (port->stp_state == BR_STATE_FORWARDING &&
> +		    port->bridge == bridge)
> +			mask |= BIT(p);
> +	}
> +
> +skip_forwarding:
> +	mask &= ~BIT(ingress_src);
> +
> +	return mask;
> +}
> +
> +static void lan966x_update_fwd_mask(struct lan966x *lan966x)
> +{
> +	int p;
> +
> +	for (p = 0; p < lan966x->num_phys_ports; p++) {
> +		struct lan966x_port *port = lan966x->ports[p];
> +		unsigned long mask = 0;
> +
> +		if (port->bridge)
> +			mask = lan966x_get_fwd_mask(port);
> +
> +		mask |= BIT(CPU_PORT);
> +
> +		lan_wr(ANA_PGID_PGID_SET(mask),
> +		       lan966x, ANA_PGID(PGID_SRC + p));
> +	}
> +}
> +
> +static void lan966x_attr_stp_state_set(struct lan966x_port *port,
> +				       u8 state)
> +{
> +	struct lan966x *lan966x = port->lan966x;
> +	bool learn_ena = 0;
> +
> +	port->stp_state = state;
> +
> +	if (state == BR_STATE_FORWARDING || state == BR_STATE_LEARNING)
> +		learn_ena = 1;

Please use true/false for bool types.

> +
> +	lan_rmw(ANA_PORT_CFG_LEARN_ENA_SET(learn_ena),
> +		ANA_PORT_CFG_LEARN_ENA,
> +		lan966x, ANA_PORT_CFG(port->chip_port));
> +
> +	lan966x_update_fwd_mask(lan966x);
> +}
> +
> +static void lan966x_port_attr_ageing_set(struct lan966x_port *port,
> +					 unsigned long ageing_clock_t)
> +{
> +	unsigned long ageing_jiffies = clock_t_to_jiffies(ageing_clock_t);
> +	u32 ageing_time = jiffies_to_msecs(ageing_jiffies) / 1000;
> +
> +	lan966x_mac_set_ageing(port->lan966x, ageing_time);
> +}
> +
> +static int lan966x_port_attr_set(struct net_device *dev, const void *ctx,
> +				 const struct switchdev_attr *attr,
> +				 struct netlink_ext_ack *extack)
> +{
> +	struct lan966x_port *port = netdev_priv(dev);
> +
> +	switch (attr->id) {
> +	case SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS:
> +		lan966x_port_attr_bridge_flags(port, attr->u.brport_flags);
> +		break;

no SWITCHDEV_ATTR_ID_PORT_PRE_BRIDGE_FLAGS? I think it doesn't even work
if you don't handle that?

br_switchdev_set_port_flag():

	attr.id = SWITCHDEV_ATTR_ID_PORT_PRE_BRIDGE_FLAGS;
	attr.u.brport_flags.val = flags;
	attr.u.brport_flags.mask = mask;

	/* We run from atomic context here */
	err = call_switchdev_notifiers(SWITCHDEV_PORT_ATTR_SET, p->dev,
				       &info.info, extack);
	err = notifier_to_errno(err);
	if (err == -EOPNOTSUPP)
		return 0;

Anyway, a big blob of "switchdev support" is hard to follow and review.
If you add bridge port flags you could as well add more comprehensive
support for them, but in a separate change please. Forwarding domain is
one thing, FDB/MDB is another, VLAN is another.

> +	case SWITCHDEV_ATTR_ID_PORT_STP_STATE:
> +		lan966x_attr_stp_state_set(port, attr->u.stp_state);
> +		break;
> +	case SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME:
> +		lan966x_port_attr_ageing_set(port, attr->u.ageing_time);
> +		break;
> +	case SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING:
> +		lan966x_vlan_port_set_vlan_aware(port, attr->u.vlan_filtering);
> +		lan966x_vlan_port_apply(port);
> +		lan966x_vlan_cpu_set_vlan_aware(port);
> +		break;
> +	default:
> +		return -EOPNOTSUPP;
> +	}
> +
> +	return 0;
> +}
> +
> +static int lan966x_port_bridge_join(struct lan966x_port *port,
> +				    struct net_device *bridge,
> +				    struct netlink_ext_ack *extack)
> +{
> +	struct net_device *dev = port->dev;
> +	int err;
> +
> +	err = switchdev_bridge_port_offload(dev, dev, NULL, NULL, NULL,
> +					    false, extack);
> +	if (err)
> +		return err;
> +
> +	port->bridge = bridge;
> +
> +	/* Port enters in bridge mode therefor don't need to copy to CPU
> +	 * frames for multicast in case the bridge is not requesting them
> +	 */
> +	__dev_mc_unsync(dev, lan966x_mc_unsync);

Why is there a need to unsync the addresses, though? A driver that
supports proper MAC table isolation between standalone ports and
VLAN-unaware bridge ports should use separate pvids for the two modes of
operation (if it doesn't support other isolation mechanisms apart from VLAN,
which it looks like lan966x does not). I see that your driver even does
this in lan966x_vlan_port_get_pvid(). So the non-bridge multicast
addresses could sit there even when the port is in a bridge.

> +
> +	/* make sure that the promisc is disabled when entering under the bridge
> +	 * because we don't want all the frames to come to CPU
> +	 */
> +	lan966x_set_promisc(port, false);

What's the story here? Why don't other switchdev drivers handle promisc
in this way (copy all frames to the CPU)?

> +
> +	return 0;
> +}
> +
> +static void lan966x_port_bridge_leave(struct lan966x_port *port,
> +				      struct net_device *bridge)
> +{
> +	struct lan966x *lan966x = port->lan966x;
> +
> +	switchdev_bridge_port_unoffload(port->dev, NULL, NULL, NULL);

The bridge offers a facility to sync and unsync the host addresses on
joins and leaves. For this to work, the switchdev_bridge_port_unoffload
call should be during the NETDEV_PRECHANGEUPPER notifier event.

> +	port->bridge = NULL;
> +
> +	/* Set the port back to host mode */
> +	lan966x_vlan_port_set_vlan_aware(port, 0);

s/0/false/

> +	lan966x_vlan_port_set_vid(port, HOST_PVID, false, false);
> +	lan966x_vlan_port_apply(port);
> +
> +	lan966x_mac_cpu_learn(lan966x, port->dev->dev_addr, HOST_PVID);
> +
> +	/* Port enters in host more therefore restore mc list */
> +	__dev_mc_sync(port->dev, lan966x_mc_sync, lan966x_mc_unsync);
> +
> +	/* Restore back the promisc as it was before the interfaces was added to
> +	 * the bridge
> +	 */
> +	lan966x_set_promisc(port, port->promisc);
> +}
> +
> +static int lan966x_port_changeupper(struct net_device *dev,
> +				    struct netdev_notifier_changeupper_info *info)
> +{
> +	struct lan966x_port *port = netdev_priv(dev);
> +	struct netlink_ext_ack *extack;
> +	int err = 0;
> +
> +	extack = netdev_notifier_info_to_extack(&info->info);
> +
> +	if (netif_is_bridge_master(info->upper_dev)) {
> +		if (info->linking)
> +			err = lan966x_port_bridge_join(port, info->upper_dev,
> +						       extack);
> +		else
> +			lan966x_port_bridge_leave(port, info->upper_dev);
> +	}
> +
> +	return err;
> +}
> +
> +static int lan966x_port_add_addr(struct net_device *dev, bool up)
> +{
> +	struct lan966x_port *port = netdev_priv(dev);
> +	struct lan966x *lan966x = port->lan966x;
> +	u16 vid;
> +
> +	vid = lan966x_vlan_port_get_pvid(port);
> +
> +	if (up)
> +		lan966x_mac_cpu_learn(lan966x, dev->dev_addr, vid);
> +	else
> +		lan966x_mac_cpu_forget(lan966x, dev->dev_addr, vid);

For which uppers is this intended? The bridge? Because the bridge
notifies you of all the entries it needs, if you also consider the
replayed events and provide non-NULL pointers to
switchdev_bridge_port_offload.

> +
> +	return 0;
> +}
> +
> +static int lan966x_netdevice_port_event(struct net_device *dev,
> +					struct notifier_block *nb,
> +					unsigned long event, void *ptr)
> +{
> +	int err = 0;
> +
> +	if (!lan966x_netdevice_check(dev))
> +		return 0;
> +
> +	switch (event) {
> +	case NETDEV_CHANGEUPPER:
> +		err = lan966x_port_changeupper(dev, ptr);
> +		break;
> +	case NETDEV_PRE_UP:
> +		err = lan966x_port_add_addr(dev, true);
> +		break;
> +	case NETDEV_DOWN:
> +		err = lan966x_port_add_addr(dev, false);
> +		break;
> +	}
> +
> +	return err;
> +}
> +
> +static int lan966x_netdevice_event(struct notifier_block *nb,
> +				   unsigned long event, void *ptr)
> +{
> +	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
> +	int ret;
> +
> +	ret = lan966x_netdevice_port_event(dev, nb, event, ptr);
> +
> +	return notifier_from_errno(ret);
> +}
> +
> +static void lan966x_fdb_event_work(struct work_struct *work)
> +{
> +	struct lan966x_fdb_event_work *fdb_work =
> +		container_of(work, struct lan966x_fdb_event_work, work);
> +	struct switchdev_notifier_fdb_info *fdb_info;
> +	struct net_device *dev = fdb_work->dev;
> +	struct lan966x_port *port;
> +	struct lan966x *lan966x;
> +
> +	rtnl_lock();

rtnl_lock() shouldn't be needed.

> +
> +	fdb_info = &fdb_work->fdb_info;
> +	lan966x = fdb_work->lan966x;
> +
> +	if (lan966x_netdevice_check(dev)) {
> +		port = netdev_priv(dev);
> +
> +		switch (fdb_work->event) {
> +		case SWITCHDEV_FDB_ADD_TO_DEVICE:
> +			if (!fdb_info->added_by_user)
> +				break;

If you get notified of a MAC address dynamically learned by the software
bridge on a lan966x port, you will have allocated memory for the work
item, and scheduled it, for nothing. Please try to exit unnecessary work
early.

> +			lan966x_mac_add_entry(lan966x, port, fdb_info->addr,
> +					      fdb_info->vid);
> +			break;
> +		case SWITCHDEV_FDB_DEL_TO_DEVICE:
> +			if (!fdb_info->added_by_user)
> +				break;
> +			lan966x_mac_del_entry(lan966x, fdb_info->addr, fdb_info->vid);
> +			break;
> +		}
> +	} else {
> +		if (!netif_is_bridge_master(dev))
> +			goto out;
> +
> +		/* If the CPU is not part of the vlan then there is no point
> +		 * to copy the frames to the CPU because they will be dropped
> +		 */
> +		if (!lan966x_vlan_cpu_member_vlan_mask(lan966x, fdb_info->vid))
> +			goto out;

It isn't part of the VLAN now, but what about later? I don't see that
you keep these FDB entries anywhere and restore them when a port joins
that VLAN.

> +
> +		/* In case the bridge is called */
> +		switch (fdb_work->event) {
> +		case SWITCHDEV_FDB_ADD_TO_DEVICE:
> +			/* If there is no front port in this vlan, there is no
> +			 * point to copy the frame to CPU because it would be
> +			 * just dropped at later point. So add it only if
> +			 * there is a port
> +			 */
> +			if (!lan966x_vlan_port_any_vlan_mask(lan966x, fdb_info->vid))
> +				break;
> +
> +			lan966x_mac_cpu_learn(lan966x, fdb_info->addr, fdb_info->vid);

Does the lan966x_mac_cpu_learn() operation trigger interrupts, or only
the dynamic learning process? How do you handle migration of an FDB
entry pointing towards the CPU, towards one pointing towards a port?

> +			break;
> +		case SWITCHDEV_FDB_DEL_TO_DEVICE:
> +			/* It is OK to always forget the entry it */

Forget the entry it?

> +			lan966x_mac_cpu_forget(lan966x, fdb_info->addr, fdb_info->vid);
> +			break;
> +		}
> +	}
> +
> +out:
> +	rtnl_unlock();
> +	kfree(fdb_work->fdb_info.addr);
> +	kfree(fdb_work);
> +	dev_put(dev);
> +}
> +
> +static int lan966x_switchdev_event(struct notifier_block *nb,
> +				   unsigned long event, void *ptr)
> +{
> +	struct lan966x *lan966x = container_of(nb, struct lan966x, switchdev_nb);
> +	struct net_device *dev = switchdev_notifier_info_to_dev(ptr);
> +	struct switchdev_notifier_fdb_info *fdb_info;
> +	struct switchdev_notifier_info *info = ptr;
> +	struct lan966x_fdb_event_work *fdb_work;
> +	int err;
> +
> +	switch (event) {
> +	case SWITCHDEV_PORT_ATTR_SET:
> +		err = switchdev_handle_port_attr_set(dev, ptr,
> +						     lan966x_netdevice_check,
> +						     lan966x_port_attr_set);
> +		return notifier_from_errno(err);
> +	case SWITCHDEV_FDB_ADD_TO_DEVICE:
> +		fallthrough;

"fallthrough;" not needed here.

> +	case SWITCHDEV_FDB_DEL_TO_DEVICE:
> +		fdb_work = kzalloc(sizeof(*fdb_work), GFP_ATOMIC);
> +		if (!fdb_work)
> +			return NOTIFY_BAD;
> +
> +		fdb_info = container_of(info,
> +					struct switchdev_notifier_fdb_info,
> +					info);
> +
> +		fdb_work->dev = dev;
> +		fdb_work->lan966x = lan966x;
> +		fdb_work->event = event;
> +		INIT_WORK(&fdb_work->work, lan966x_fdb_event_work);
> +		memcpy(&fdb_work->fdb_info, ptr, sizeof(fdb_work->fdb_info));
> +		fdb_work->fdb_info.addr = kzalloc(ETH_ALEN, GFP_ATOMIC);
> +		if (!fdb_work->fdb_info.addr)
> +			goto err_addr_alloc;
> +
> +		ether_addr_copy((u8 *)fdb_work->fdb_info.addr, fdb_info->addr);
> +		dev_hold(dev);
> +
> +		queue_work(lan966x_owq, &fdb_work->work);
> +		break;
> +	}
> +
> +	return NOTIFY_DONE;
> +err_addr_alloc:
> +	kfree(fdb_work);
> +	return NOTIFY_BAD;
> +}
> +
> +static int lan966x_handle_port_vlan_add(struct net_device *dev,
> +					struct notifier_block *nb,
> +					const struct switchdev_obj_port_vlan *v)
> +{
> +	struct lan966x_port *port;
> +	struct lan966x *lan966x;
> +
> +	/* When adding a port to a vlan, we get a callback for the port but
> +	 * also for the bridge. When get the callback for the bridge just bail
> +	 * out. Then when the bridge is added to the vlan, then we get a
> +	 * callback here but in this case the flags has set:
> +	 * BRIDGE_VLAN_INFO_BRENTRY. In this case it means that the CPU
> +	 * port is added to the vlan, so the broadcast frames and unicast frames
> +	 * with dmac of the bridge should be foward to CPU.
> +	 */
> +	if (netif_is_bridge_master(dev) &&
> +	    !(v->flags & BRIDGE_VLAN_INFO_BRENTRY))
> +		return 0;
> +
> +	lan966x = container_of(nb, struct lan966x, switchdev_blocking_nb);
> +
> +	/* In case the port gets called */
> +	if (!(netif_is_bridge_master(dev))) {
> +		if (!lan966x_netdevice_check(dev))
> +			return -EOPNOTSUPP;
> +
> +		port = netdev_priv(dev);
> +		return lan966x_vlan_port_add_vlan(port, v->vid,
> +						  v->flags & BRIDGE_VLAN_INFO_PVID,
> +						  v->flags & BRIDGE_VLAN_INFO_UNTAGGED);
> +	}
> +
> +	/* In case the bridge gets called */
> +	if (netif_is_bridge_master(dev))
> +		return lan966x_vlan_cpu_add_vlan(lan966x, dev, v->vid);
> +
> +	return 0;
> +}
> +
> +static int lan966x_handle_port_obj_add(struct net_device *dev,
> +				       struct notifier_block *nb,
> +				       struct switchdev_notifier_port_obj_info *info)
> +{
> +	const struct switchdev_obj *obj = info->obj;
> +	int err;
> +
> +	switch (obj->id) {
> +	case SWITCHDEV_OBJ_ID_PORT_VLAN:
> +		err = lan966x_handle_port_vlan_add(dev, nb,
> +						   SWITCHDEV_OBJ_PORT_VLAN(obj));
> +		break;
> +	default:
> +		err = -EOPNOTSUPP;
> +		break;
> +	}
> +
> +	info->handled = true;
> +	return err;
> +}
> +
> +static int lan966x_handle_port_vlan_del(struct net_device *dev,
> +					struct notifier_block *nb,
> +					const struct switchdev_obj_port_vlan *v)
> +{
> +	struct lan966x_port *port;
> +	struct lan966x *lan966x;
> +
> +	lan966x = container_of(nb, struct lan966x, switchdev_blocking_nb);
> +
> +	/* In case the physical port gets called */
> +	if (!netif_is_bridge_master(dev)) {
> +		if (!lan966x_netdevice_check(dev))
> +			return -EOPNOTSUPP;
> +
> +		port = netdev_priv(dev);
> +		return lan966x_vlan_port_del_vlan(port, v->vid);
> +	}
> +
> +	/* In case the bridge gets called */
> +	if (netif_is_bridge_master(dev))
> +		return lan966x_vlan_cpu_del_vlan(lan966x, dev, v->vid);
> +
> +	return 0;
> +}
> +
> +static int lan966x_handle_port_obj_del(struct net_device *dev,
> +				       struct notifier_block *nb,
> +				       struct switchdev_notifier_port_obj_info *info)
> +{
> +	const struct switchdev_obj *obj = info->obj;
> +	int err;
> +
> +	switch (obj->id) {
> +	case SWITCHDEV_OBJ_ID_PORT_VLAN:
> +		err = lan966x_handle_port_vlan_del(dev, nb,
> +						   SWITCHDEV_OBJ_PORT_VLAN(obj));
> +		break;
> +	default:
> +		err = -EOPNOTSUPP;
> +		break;
> +	}
> +
> +	info->handled = true;
> +	return err;
> +}
> +
> +static int lan966x_switchdev_blocking_event(struct notifier_block *nb,
> +					    unsigned long event,
> +					    void *ptr)
> +{
> +	struct net_device *dev = switchdev_notifier_info_to_dev(ptr);
> +	int err;
> +
> +	switch (event) {
> +	case SWITCHDEV_PORT_OBJ_ADD:
> +		err = lan966x_handle_port_obj_add(dev, nb, ptr);
> +		return notifier_from_errno(err);
> +	case SWITCHDEV_PORT_OBJ_DEL:
> +		err = lan966x_handle_port_obj_del(dev, nb, ptr);
> +		return notifier_from_errno(err);
> +	case SWITCHDEV_PORT_ATTR_SET:
> +		err = switchdev_handle_port_attr_set(dev, ptr,
> +						     lan966x_netdevice_check,
> +						     lan966x_port_attr_set);
> +		return notifier_from_errno(err);
> +	}
> +
> +	return NOTIFY_DONE;
> +}
> +
> +int lan966x_register_notifier_blocks(struct lan966x *lan966x)
> +{
> +	int err;
> +
> +	lan966x->netdevice_nb.notifier_call = lan966x_netdevice_event;
> +	err = register_netdevice_notifier(&lan966x->netdevice_nb);
> +	if (err)
> +		return err;
> +
> +	lan966x->switchdev_nb.notifier_call = lan966x_switchdev_event;
> +	err = register_switchdev_notifier(&lan966x->switchdev_nb);
> +	if (err)
> +		goto err_switchdev_nb;
> +
> +	lan966x->switchdev_blocking_nb.notifier_call = lan966x_switchdev_blocking_event;
> +	err = register_switchdev_blocking_notifier(&lan966x->switchdev_blocking_nb);
> +	if (err)
> +		goto err_switchdev_blocking_nb;
> +
> +	lan966x_owq = alloc_ordered_workqueue("lan966x_order", 0);
> +	if (!lan966x_owq) {
> +		err = -ENOMEM;
> +		goto err_switchdev_blocking_nb;
> +	}

These should be singleton objects, otherwise things get problematic if
you have more than one switch device instantiated in the system.

> +
> +	return 0;
> +
> +err_switchdev_blocking_nb:
> +	unregister_switchdev_notifier(&lan966x->switchdev_nb);
> +err_switchdev_nb:
> +	unregister_netdevice_notifier(&lan966x->netdevice_nb);
> +
> +	return err;
> +}
> +
> +void lan966x_unregister_notifier_blocks(struct lan966x *lan966x)
> +{
> +	destroy_workqueue(lan966x_owq);
> +
> +	unregister_switchdev_blocking_notifier(&lan966x->switchdev_blocking_nb);
> +	unregister_switchdev_notifier(&lan966x->switchdev_nb);
> +	unregister_netdevice_notifier(&lan966x->netdevice_nb);
> +}
> diff --git a/drivers/net/ethernet/microchip/lan966x/lan966x_vlan.c b/drivers/net/ethernet/microchip/lan966x/lan966x_vlan.c
> index e47552775d06..26644503b4e6 100644
> --- a/drivers/net/ethernet/microchip/lan966x/lan966x_vlan.c
> +++ b/drivers/net/ethernet/microchip/lan966x/lan966x_vlan.c
> @@ -155,6 +155,9 @@ static bool lan966x_vlan_cpu_member_cpu_vlan_mask(struct lan966x *lan966x, u16 v
>  
>  u16 lan966x_vlan_port_get_pvid(struct lan966x_port *port)
>  {
> +	if (!port->bridge)
> +		return HOST_PVID;
> +
>  	return port->vlan_aware ? port->pvid : UNAWARE_PVID;
>  }
>  
> @@ -210,6 +213,8 @@ void lan966x_vlan_cpu_set_vlan_aware(struct lan966x_port *port)
>  		 * table for the front port and the CPU
>  		 */
>  		lan966x_mac_cpu_learn(lan966x, port->dev->dev_addr, UNAWARE_PVID);
> +		lan966x_mac_cpu_learn(lan966x, port->bridge->dev_addr,
> +				      UNAWARE_PVID);
>  
>  		lan966x_vlan_port_add_vlan_mask(port, UNAWARE_PVID);
>  		lan966x_vlan_port_apply(port);
> @@ -218,6 +223,8 @@ void lan966x_vlan_cpu_set_vlan_aware(struct lan966x_port *port)
>  		 * to vlan unaware
>  		 */
>  		lan966x_mac_cpu_forget(lan966x, port->dev->dev_addr, UNAWARE_PVID);
> +		lan966x_mac_cpu_forget(lan966x, port->bridge->dev_addr,
> +				       UNAWARE_PVID);
>  
>  		lan966x_vlan_port_del_vlan_mask(port, UNAWARE_PVID);
>  		lan966x_vlan_port_apply(port);
> @@ -293,6 +300,7 @@ int lan966x_vlan_port_add_vlan(struct lan966x_port *port,
>  	 */
>  	if (lan966x_vlan_cpu_member_cpu_vlan_mask(lan966x, vid)) {
>  		lan966x_mac_cpu_learn(lan966x, port->dev->dev_addr, vid);
> +		lan966x_mac_cpu_learn(lan966x, port->bridge->dev_addr, vid);
>  		lan966x_vlan_cpu_add_vlan_mask(lan966x, vid);
>  	}
>  
> @@ -322,8 +330,10 @@ int lan966x_vlan_port_del_vlan(struct lan966x_port *port,
>  	 * that vlan but still keep it in the mask because it may be needed
>  	 * again then another port gets added in tha vlan
>  	 */
> -	if (!lan966x_vlan_port_any_vlan_mask(lan966x, vid))
> +	if (!lan966x_vlan_port_any_vlan_mask(lan966x, vid)) {
> +		lan966x_mac_cpu_forget(lan966x, port->bridge->dev_addr, vid);
>  		lan966x_vlan_cpu_del_vlan_mask(lan966x, vid);
> +	}
>  
>  	return 0;
>  }
> -- 
> 2.33.0
>

^ permalink raw reply

* Re: [PATCH v1 1/1] include/linux/unaligned: Replace kernel.h with the necessary inclusions
From: Arend van Spriel @ 2021-12-09 13:23 UTC (permalink / raw)
  To: Andy Shevchenko, linux-wireless, brcm80211-dev-list.pdl,
	SHA-cyfmac-dev-list, netdev, linux-kernel
  Cc: Arend van Spriel, Franky Lin, Hante Meuleman, Chi-hsien Lin,
	Wright Feng, Chung-hsien Hsu, Kalle Valo, David S. Miller,
	Jakub Kicinski, Andrew Morton, heikki.krogerus
In-Reply-To: <20211209123823.20425-1-andriy.shevchenko@linux.intel.com>

[-- Attachment #1: Type: text/plain, Size: 745 bytes --]

On 12/9/2021 1:38 PM, Andy Shevchenko wrote:
> When kernel.h is used in the headers it adds a lot into dependency hell,
> especially when there are circular dependencies are involved.
> 
> Replace kernel.h inclusion with the list of what is really being used.
> 
> The rest of the changes are induced by the above and may not be split.

For brcmfmac change:

Acked-by: Arend van Spriel <arend.vanspriel@broadcom.com>
> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
> ---
>   drivers/net/wireless/broadcom/brcm80211/brcmfmac/xtlv.c | 2 ++
>   include/linux/unaligned/packed_struct.h                 | 2 +-
>   lib/lz4/lz4defs.h                                       | 2 ++
>   3 files changed, 5 insertions(+), 1 deletion(-)

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4219 bytes --]

^ permalink raw reply

* Re: [syzbot] BUG: sleeping function called from invalid context in hci_cmd_sync_cancel
From: Oliver Neukum @ 2021-12-09 13:21 UTC (permalink / raw)
  To: Benjamin Berg, Oliver Neukum, syzbot, Thinh.Nguyen, changbin.du,
	christian.brauner, davem, edumazet, gregkh, johan.hedberg, kuba,
	linux-bluetooth, linux-kernel, linux-usb, luiz.dentz,
	luiz.von.dentz, marcel, mathias.nyman, netdev, stern,
	syzkaller-bugs, yajun.deng
In-Reply-To: <14584c1a1e449cc20b5af7918b411ee27cf1570b.camel@redhat.com>


On 09.12.21 13:46, Benjamin Berg wrote:
Hi,
> On Thu, 2021-12-09 at 11:06 +0100, Oliver Neukum wrote:
>> As __cancel_work_timer can be called from hci_cmd_sync_cancel() this is
>> just not
>> an approach you can take. It looks like asynchronously canceling the
>> scheduled work
>> would result in a race, so I would for now just revert.
> Right, so this needs to be pushed into a workqueue instead, I suppose.
It looks like overkill, but I have no good alternative to offer either.
>
>> What issue exactly is this trying to fix or improve?
> The problem is aborting long-running synchronous operations. i.e.
> without this patchset, USB enumeration will hang for 10s if a USB
> bluetooth device disappears during firmware loading. This is because
> even though the USB device is gone and all URB submissions fail, the
> operation will only be aborted after the internal timeout happens.
>
I see. Something ought to be done. The issue is in good hands.

    Thanks
        Oliver


^ permalink raw reply

* Re: [PATCH net-next 1/5] net: phylink: add legacy_pre_march2020 indicator
From: Russell King (Oracle) @ 2021-12-09 13:14 UTC (permalink / raw)
  To: Chris Snook, Felix Fietkau, Florian Fainelli, John Crispin,
	Mark Lee, Matthias Brugger, Sean Wang, Vivien Didelot,
	Vladimir Oltean
  Cc: Andrew Lunn, Heiner Kallweit, linux-arm-kernel, linux-mediatek,
	netdev
In-Reply-To: <E1mucmf-00EyCl-KA@rmk-PC.armlinux.org.uk>

This series was incorrectly threaded to its cover letter; the patches
have now been re-sent with the correct message-ID for their cover
letter. Sadly, this mistake was not obvious until I looked at patchwork
to work out why they haven't been applied yet.

On Tue, Dec 07, 2021 at 03:53:37PM +0000, Russell King (Oracle) wrote:
> Add a boolean to phylink_config to indicate whether a driver has not
> been updated for the changes in commit 7cceb599d15d ("net: phylink:
> avoid mac_config calls"), and thus are reliant on the old behaviour.
> 
> We were currently keying the phylink behaviour on the presence of a
> PCS, but this is sub-optimal for modern drivers that may not have a
> PCS.
> 
> This commit merely introduces the new flag, but does not add any use,
> since we need all legacy drivers to set this flag before it can be
> used. Once these legacy drivers have been updated, we can remove this
> flag.
> 
> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
> ---
>  include/linux/phylink.h | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/include/linux/phylink.h b/include/linux/phylink.h
> index 01224235df0f..d005b8e36048 100644
> --- a/include/linux/phylink.h
> +++ b/include/linux/phylink.h
> @@ -84,6 +84,8 @@ enum phylink_op_type {
>   * struct phylink_config - PHYLINK configuration structure
>   * @dev: a pointer to a struct device associated with the MAC
>   * @type: operation type of PHYLINK instance
> + * @legacy_pre_march2020: driver has not been updated for March 2020 updates
> + *	(See commit 7cceb599d15d ("net: phylink: avoid mac_config calls")
>   * @pcs_poll: MAC PCS cannot provide link change interrupt
>   * @poll_fixed_state: if true, starts link_poll,
>   *		      if MAC link is at %MLO_AN_FIXED mode.
> @@ -97,6 +99,7 @@ enum phylink_op_type {
>  struct phylink_config {
>  	struct device *dev;
>  	enum phylink_op_type type;
> +	bool legacy_pre_march2020;
>  	bool pcs_poll;
>  	bool poll_fixed_state;
>  	bool ovr_an_inband;
> -- 
> 2.30.2
> 
> 

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply

* [PATCH net-next 5/5] net: ag71xx: remove unnecessary legacy methods
From: Russell King (Oracle) @ 2021-12-09 13:11 UTC (permalink / raw)
  To: Chris Snook, Felix Fietkau, Florian Fainelli, John Crispin,
	Mark Lee, Matthias Brugger, Sean Wang, Vivien Didelot,
	Vladimir Oltean
  Cc: Andrew Lunn, Heiner Kallweit, linux-arm-kernel, linux-mediatek,
	netdev, David S. Miller, Jakub Kicinski
In-Reply-To: <Ya+DGaGmGgWrlVkW@shell.armlinux.org.uk>

ag71xx may have a PCS, but it does not appear to support configuration
of the PCS in the current code. The functions to get its state merely
report that the link is down, and the AN restart function is empty.

Since neither of these functions will be called unless phylink's legacy
flag is set, we can safely remove these functions and indicate this is
a modern driver.

Should PCS support be added later, it will need to be modelled using
the phylink_pcs support rather than operating as a legacy driver.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
Resent with the correct cover-letter message-ID.

 drivers/net/ethernet/atheros/ag71xx.c | 13 -------------
 1 file changed, 13 deletions(-)

diff --git a/drivers/net/ethernet/atheros/ag71xx.c b/drivers/net/ethernet/atheros/ag71xx.c
index ff924f06581e..270c2935591b 100644
--- a/drivers/net/ethernet/atheros/ag71xx.c
+++ b/drivers/net/ethernet/atheros/ag71xx.c
@@ -1024,17 +1024,6 @@ static void ag71xx_mac_config(struct phylink_config *config, unsigned int mode,
 	ag71xx_wr(ag, AG71XX_REG_FIFO_CFG3, ag->fifodata[2]);
 }
 
-static void ag71xx_mac_pcs_get_state(struct phylink_config *config,
-				     struct phylink_link_state *state)
-{
-	state->link = 0;
-}
-
-static void ag71xx_mac_an_restart(struct phylink_config *config)
-{
-	/* Not Supported */
-}
-
 static void ag71xx_mac_link_down(struct phylink_config *config,
 				 unsigned int mode, phy_interface_t interface)
 {
@@ -1098,8 +1087,6 @@ static void ag71xx_mac_link_up(struct phylink_config *config,
 
 static const struct phylink_mac_ops ag71xx_phylink_mac_ops = {
 	.validate = phylink_generic_validate,
-	.mac_pcs_get_state = ag71xx_mac_pcs_get_state,
-	.mac_an_restart = ag71xx_mac_an_restart,
 	.mac_config = ag71xx_mac_config,
 	.mac_link_down = ag71xx_mac_link_down,
 	.mac_link_up = ag71xx_mac_link_up,
-- 
2.30.2


^ permalink raw reply related

* [PATCH net-next 4/5] net: phylink: use legacy_pre_march2020
From: Russell King (Oracle) @ 2021-12-09 13:11 UTC (permalink / raw)
  To: Chris Snook, Felix Fietkau, Florian Fainelli, John Crispin,
	Mark Lee, Matthias Brugger, Sean Wang, Vivien Didelot,
	Vladimir Oltean
  Cc: Andrew Lunn, Heiner Kallweit, linux-arm-kernel, linux-mediatek,
	netdev, David S. Miller, Jakub Kicinski
In-Reply-To: <Ya+DGaGmGgWrlVkW@shell.armlinux.org.uk>

Use the legacy flag to indicate whether we should operate in legacy
mode. This allows us to stop using the presence of a PCS as an
indicator to the age of the phylink user, and make PCS presence
optional.

Legacy mode involves:
1) calling mac_config() whenever the link comes up
2) calling mac_config() whenever the inband advertisement changes,
   possibly followed by a call to mac_an_restart()
3) making use of mac_an_restart()
4) making use of mac_pcs_get_state()

All the above functionality was moved to a seperate "PCS" block of
operations in March 2020.

Update the documents to indicate that the differences that this flag
makes.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
Resent with the correct cover-letter message-ID.

 drivers/net/phy/phylink.c | 12 ++++++------
 include/linux/phylink.h   | 17 +++++++++++++++++
 2 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c
index 8e3861f09b4f..e47f2baf4b07 100644
--- a/drivers/net/phy/phylink.c
+++ b/drivers/net/phy/phylink.c
@@ -742,7 +742,7 @@ static void phylink_mac_pcs_an_restart(struct phylink *pl)
 	    phylink_autoneg_inband(pl->cur_link_an_mode)) {
 		if (pl->pcs_ops)
 			pl->pcs_ops->pcs_an_restart(pl->pcs);
-		else
+		else if (pl->config->legacy_pre_march2020)
 			pl->mac_ops->mac_an_restart(pl->config);
 	}
 }
@@ -803,7 +803,7 @@ static int phylink_change_inband_advert(struct phylink *pl)
 	if (test_bit(PHYLINK_DISABLE_STOPPED, &pl->phylink_disable_state))
 		return 0;
 
-	if (!pl->pcs_ops) {
+	if (!pl->pcs_ops && pl->config->legacy_pre_march2020) {
 		/* Legacy method */
 		phylink_mac_config(pl, &pl->link_config);
 		phylink_mac_pcs_an_restart(pl);
@@ -854,7 +854,8 @@ static void phylink_mac_pcs_get_state(struct phylink *pl,
 
 	if (pl->pcs_ops)
 		pl->pcs_ops->pcs_get_state(pl->pcs, state);
-	else if (pl->mac_ops->mac_pcs_get_state)
+	else if (pl->mac_ops->mac_pcs_get_state &&
+		 pl->config->legacy_pre_march2020)
 		pl->mac_ops->mac_pcs_get_state(pl->config, state);
 	else
 		state->link = 0;
@@ -1048,12 +1049,11 @@ static void phylink_resolve(struct work_struct *w)
 			}
 			phylink_major_config(pl, false, &link_state);
 			pl->link_config.interface = link_state.interface;
-		} else if (!pl->pcs_ops) {
+		} else if (!pl->pcs_ops && pl->config->legacy_pre_march2020) {
 			/* The interface remains unchanged, only the speed,
 			 * duplex or pause settings have changed. Call the
 			 * old mac_config() method to configure the MAC/PCS
-			 * only if we do not have a PCS installed (an
-			 * unconverted user.)
+			 * only if we do not have a legacy MAC driver.
 			 */
 			phylink_mac_config(pl, &link_state);
 		}
diff --git a/include/linux/phylink.h b/include/linux/phylink.h
index d005b8e36048..a2f266cc3442 100644
--- a/include/linux/phylink.h
+++ b/include/linux/phylink.h
@@ -190,6 +190,10 @@ void validate(struct phylink_config *config, unsigned long *supported,
  * negotiation completion state in @state->an_complete, and link up state
  * in @state->link. If possible, @state->lp_advertising should also be
  * populated.
+ *
+ * Note: This is a legacy method. This function will not be called unless
+ * legacy_pre_march2020 is set in &struct phylink_config and there is no
+ * PCS attached.
  */
 void mac_pcs_get_state(struct phylink_config *config,
 		       struct phylink_link_state *state);
@@ -230,6 +234,15 @@ int mac_prepare(struct phylink_config *config, unsigned int mode,
  * guaranteed to be correct, and so any mac_config() implementation must
  * never reference these fields.
  *
+ * Note: For legacy March 2020 drivers (drivers with legacy_pre_march2020 set
+ * in their &phylnk_config and which don't have a PCS), this function will be
+ * called on each link up event, and to also change the in-band advert. For
+ * non-legacy drivers, it will only be called to reconfigure the MAC for a
+ * "major" change in e.g. interface mode. It will not be called for changes
+ * in speed, duplex or pause modes or to change the in-band advertisement.
+ * In any case, it is strongly preferred that speed, duplex and pause settings
+ * are handled in the mac_link_up() method and not in this method.
+ *
  * (this requires a rewrite - please refer to mac_link_up() for situations
  *  where the PCS and MAC are not tightly integrated.)
  *
@@ -314,6 +327,10 @@ int mac_finish(struct phylink_config *config, unsigned int mode,
 /**
  * mac_an_restart() - restart 802.3z BaseX autonegotiation
  * @config: a pointer to a &struct phylink_config.
+ *
+ * Note: This is a legacy method. This function will not be called unless
+ * legacy_pre_march2020 is set in &struct phylink_config and there is no
+ * PCS attached.
  */
 void mac_an_restart(struct phylink_config *config);
 
-- 
2.30.2


^ permalink raw reply related

* [PATCH net-next 3/5] net: mtk_eth_soc: mark as a legacy_pre_march2020 driver
From: Russell King (Oracle) @ 2021-12-09 13:11 UTC (permalink / raw)
  To: Chris Snook, Felix Fietkau, Florian Fainelli, John Crispin,
	Mark Lee, Matthias Brugger, Sean Wang, Vivien Didelot,
	Vladimir Oltean
  Cc: Andrew Lunn, Heiner Kallweit, linux-arm-kernel, linux-mediatek,
	netdev, David S. Miller, Jakub Kicinski
In-Reply-To: <Ya+DGaGmGgWrlVkW@shell.armlinux.org.uk>

mtk_eth_soc has not been updated for commit 7cceb599d15d ("net: phylink:
avoid mac_config calls"), and makes use of state->speed and
state->duplex in contravention of the phylink documentation. This makes
reliant on the legacy behaviours, so mark it as a legacy driver.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
Resent with the correct cover-letter message-ID.

 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index de4152e2e3e4..a068cf5c970f 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -2923,6 +2923,10 @@ static int mtk_add_mac(struct mtk_eth *eth, struct device_node *np)
 
 	mac->phylink_config.dev = &eth->netdev[id]->dev;
 	mac->phylink_config.type = PHYLINK_NETDEV;
+	/* This driver makes use of state->speed/state->duplex in
+	 * mac_config
+	 */
+	mac->phylink_config.legacy_pre_march2020 = true;
 	mac->phylink_config.mac_capabilities = MAC_ASYM_PAUSE | MAC_SYM_PAUSE |
 		MAC_10 | MAC_100 | MAC_1000 | MAC_2500FD;
 
-- 
2.30.2


^ permalink raw reply related

* [PATCH net-next 2/5] net: dsa: mark DSA phylink as legacy_pre_march2020
From: Russell King (Oracle) @ 2021-12-09 13:11 UTC (permalink / raw)
  To: Chris Snook, Felix Fietkau, Florian Fainelli, John Crispin,
	Mark Lee, Matthias Brugger, Sean Wang, Vivien Didelot,
	Vladimir Oltean
  Cc: Andrew Lunn, Heiner Kallweit, linux-arm-kernel, linux-mediatek,
	netdev, David S. Miller, Jakub Kicinski
In-Reply-To: <Ya+DGaGmGgWrlVkW@shell.armlinux.org.uk>

The majority of DSA drivers do not make use of the PCS support, and
thus operate in legacy mode. In order to preserve this behaviour in
future, we need to set the legacy_pre_march2020 flag so phylink knows
this may require the legacy calls.

There are some DSA drivers that do make use of PCS support, and these
will continue operating as before - legacy_pre_march2020 will not
prevent split-PCS support enabling the newer phylink behaviour.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
Resent with the correct cover-letter message-ID.

 net/dsa/port.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/net/dsa/port.c b/net/dsa/port.c
index 6d5ebe61280b..3b8d18e5b72c 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -1094,6 +1094,13 @@ int dsa_port_phylink_create(struct dsa_port *dp)
 	if (err)
 		mode = PHY_INTERFACE_MODE_NA;
 
+	/* Presence of phylink_mac_link_state or phylink_mac_an_restart is
+	 * an indicator of a legacy phylink driver.
+	 */
+	if (ds->ops->phylink_mac_link_state ||
+	    ds->ops->phylink_mac_an_restart)
+		dp->pl_config.legacy_pre_march2020 = true;
+
 	if (ds->ops->phylink_get_caps)
 		ds->ops->phylink_get_caps(ds, dp->index, &dp->pl_config);
 
-- 
2.30.2


^ permalink raw reply related

* [PATCH net-next 1/5] net: phylink: add legacy_pre_march2020 indicator
From: Russell King (Oracle) @ 2021-12-09 13:11 UTC (permalink / raw)
  To: Chris Snook, Felix Fietkau, Florian Fainelli, John Crispin,
	Mark Lee, Matthias Brugger, Sean Wang, Vivien Didelot,
	Vladimir Oltean
  Cc: Andrew Lunn, Heiner Kallweit, linux-arm-kernel, linux-mediatek,
	netdev
In-Reply-To: <Ya+DGaGmGgWrlVkW@shell.armlinux.org.uk>

Add a boolean to phylink_config to indicate whether a driver has not
been updated for the changes in commit 7cceb599d15d ("net: phylink:
avoid mac_config calls"), and thus are reliant on the old behaviour.

We were currently keying the phylink behaviour on the presence of a
PCS, but this is sub-optimal for modern drivers that may not have a
PCS.

This commit merely introduces the new flag, but does not add any use,
since we need all legacy drivers to set this flag before it can be
used. Once these legacy drivers have been updated, we can remove this
flag.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
Resent with the correct cover-letter message-ID.

 include/linux/phylink.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/phylink.h b/include/linux/phylink.h
index 01224235df0f..d005b8e36048 100644
--- a/include/linux/phylink.h
+++ b/include/linux/phylink.h
@@ -84,6 +84,8 @@ enum phylink_op_type {
  * struct phylink_config - PHYLINK configuration structure
  * @dev: a pointer to a struct device associated with the MAC
  * @type: operation type of PHYLINK instance
+ * @legacy_pre_march2020: driver has not been updated for March 2020 updates
+ *	(See commit 7cceb599d15d ("net: phylink: avoid mac_config calls")
  * @pcs_poll: MAC PCS cannot provide link change interrupt
  * @poll_fixed_state: if true, starts link_poll,
  *		      if MAC link is at %MLO_AN_FIXED mode.
@@ -97,6 +99,7 @@ enum phylink_op_type {
 struct phylink_config {
 	struct device *dev;
 	enum phylink_op_type type;
+	bool legacy_pre_march2020;
 	bool pcs_poll;
 	bool poll_fixed_state;
 	bool ovr_an_inband;
-- 
2.30.2


^ permalink raw reply related

* [PATCH net 1/1] net: stmmac: fix tc flower deletion for VLAN priority Rx steering
From: Ong Boon Leong @ 2021-12-09 13:03 UTC (permalink / raw)
  To: Giuseppe Cavallaro, Alexandre Torgue, Jose Abreu,
	David S . Miller, Jakub Kicinski, Maxime Coquelin,
	alexandre.torgue
  Cc: Kurt Kanzenbach, netdev, linux-stm32, linux-arm-kernel,
	Ong Boon Leong

To replicate the issue:-

1) Add 2 flower filters for VLAN Priority based frame steering:-
$ IFDEVNAME=eth0
$ tc qdisc add dev $IFDEVNAME ingress
$ tc qdisc add dev $IFDEVNAME root mqprio num_tc 8 \
   map 0 1 2 3 4 5 6 7 0 0 0 0 0 0 0 0 \
   queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 0
$ tc filter add dev $IFDEVNAME parent ffff: protocol 802.1Q \
   flower vlan_prio 0 hw_tc 0
$ tc filter add dev $IFDEVNAME parent ffff: protocol 802.1Q \
   flower vlan_prio 1 hw_tc 1

2) Get the 'pref' id
$ tc filter show dev $IFDEVNAME ingress

3) Delete a specific tc flower record
$ tc filter del dev $IFDEVNAME parent ffff: pref 49151

From dmesg, we will observe kernel NULL pointer ooops

[  197.170464] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  197.171367] #PF: supervisor read access in kernel mode
[  197.171367] #PF: error_code(0x0000) - not-present page
[  197.171367] PGD 0 P4D 0
[  197.171367] Oops: 0000 [#1] PREEMPT SMP NOPTI
[  197.171367] CPU: 0 PID: 3216 Comm: tc Tainted: G     U      E     5.16.0-rc2+ #12
[  197.171367] Hardware name: Intel Corporation Elkhart Lake Embedded Platform/ElkhartLake LPDDR4x T3 CRB, BIOS EHLSFWI1.R00.3273.A04.2107240322 07/24/2021
[  197.171367] RIP: 0010:tc_setup_cls+0x20b/0x4a0 [stmmac]
[  197.171367] Code: fe ff ff c7 43 14 00 00 00 00 48 c7 03 00 00 00 00 c7 43 1c 00 00 00 00 49 8b 44 24 28 48 8b bd b0 00 00 00 41 0f b7 54 24 58 <48> 8b 00 0f bf 8f 38 08 00 00 81 ea e0 ff 00 00 8b 00 25 00 04 00
[  197.171367] RSP: 0018:ffff940940a037c0 EFLAGS: 00010246
[  197.171367] RAX: 0000000000000000 RBX: ffff92e826cae2c8 RCX: ffff92e825f39000
[  197.171367] RDX: 0000000000000000 RSI: ffff92e826cae2a8 RDI: ffff92e82f0c0000
[  197.171367] RBP: ffff92e82f0c0940 R08: 0000000000000000 R09: ffff92e825f39434
[  197.171367] R10: ffff92e826c5af00 R11: ffff940940a038a8 R12: ffff940940a038a8
[  197.171367] R13: 0000000000000000 R14: 0000000000000000 R15: ffff92e830a5b600
[  197.171367] FS:  00007fa7b0c47740(0000) GS:ffff92e964200000(0000) knlGS:0000000000000000
[  197.171367] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  197.171367] CR2: 0000000000000000 CR3: 0000000124c50000 CR4: 0000000000350ef0
[  197.171367] Call Trace:
[  197.171367]  <TASK>
[  197.171367]  ? __stmmac_disable_all_queues+0xa8/0xe0 [stmmac]
[  197.171367]  stmmac_setup_tc_block_cb+0x70/0x110 [stmmac]
[  197.171367]  tc_setup_cb_destroy+0xb3/0x180
[  197.171367]  fl_hw_destroy_filter+0x94/0xc0 [cls_flower]
[  197.171367]  __fl_delete+0x16a/0x180 [cls_flower]
[  197.171367]  fl_destroy+0xb9/0x140 [cls_flower]
[  197.171367]  tcf_proto_destroy+0x1d/0xa0
[  197.171367]  tc_del_tfilter+0x3c9/0x7b0
[  197.171367]  ? tc_dump_tfilter+0x310/0x310
[  197.171367]  rtnetlink_rcv_msg+0x2bf/0x370
[  197.171367]  ? preempt_count_add+0x68/0xa0
[  197.171367]  ? _raw_spin_lock_irqsave+0x19/0x40
[  197.171367]  ? _raw_spin_unlock_irqrestore+0x1f/0x31
[  197.171367]  ? rtnl_calcit.isra.0+0x130/0x130
[  197.171367]  netlink_rcv_skb+0x4e/0x100
[  197.171367]  netlink_unicast+0x18e/0x230
[  197.171367]  netlink_sendmsg+0x245/0x480
[  197.171367]  sock_sendmsg+0x5b/0x60
[  197.171367]  ____sys_sendmsg+0x20b/0x280
[  197.171367]  ? copy_msghdr_from_user+0x5c/0x90
[  197.171367]  ___sys_sendmsg+0x7c/0xc0
[  197.171367]  ? folio_add_lru+0x52/0x80
[  197.171367]  ? __sys_sendto+0xee/0x160
[  197.171367]  __sys_sendmsg+0x59/0xa0
[  197.171367]  do_syscall_64+0x40/0x90
[  197.171367]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  197.171367] RIP: 0033:0x7fa7b0d64397
[  197.171367] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
[  197.171367] RSP: 002b:00007ffdd88b58e8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[  197.171367] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fa7b0d64397
[  197.171367] RDX: 0000000000000000 RSI: 00007ffdd88b5960 RDI: 0000000000000003
[  197.171367] RBP: 0000000061b05c21 R08: 0000000000000001 R09: 0000564584e47890
[  197.171367] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
[  197.171367] R13: 00007ffdd88b9a80 R14: 00000000bfff0000 R15: 0000564584e3e420
[  197.171367]  </TASK>
[  197.171367] Modules linked in: cls_flower sch_mqprio sch_ingress dwmac_intel(E) stmmac(E) pcs_xpcs phylink marvell marvell10g libphy 8021q bnep bluetooth ecryptfs nfsd sch_fq_codel uio uhid snd_soc_dmic snd_sof_pci_intel_tgl x86_pkg_temp_thermal snd_sof_intel_hda_common kvm_intel iTCO_wdt iTCO_vendor_support soundwire_intel mei_hdcp kvm soundwire_generic_allocation soundwire_cadence soundwire_bus irqbypass snd_sof_xtensa_dsp ax88179_178a snd_soc_acpi_intel_match intel_rapl_msr pcspkr usbnet snd_soc_acpi mii snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec i2c_i801 snd_hda_core intel_ish_ipc tpm_crb 8250_lpss intel_ishtp tpm_tis i915 mei_me i2c_smbus mei tpm_tis_core dw_dmac_core tpm spi_dw_pci parport_pc intel_pmc_core spi_dw thermal parport ttm fuse configfs snd_sof_pci snd_sof snd_soc_core snd_compress ac97_bus ledtrig_audio snd_pcm snd_timer snd soundcore [last unloaded: libphy]
[  197.171367] CR2: 0000000000000000
[  197.171367] ---[ end trace 8b8d1c617c39093d ]---

This patch reimplements the tc flower rx frame steering for VLAN priority
by keeping a record of flow_cls_offload added. The implementation also
makes way to support EtherType based RX frame steering later.

Fixes: 0e039f5cf86c ("net: stmmac: add RX frame steering based on VLAN priority in tc flower")
Tested-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com>
---
 drivers/net/ethernet/stmicro/stmmac/stmmac.h  | 17 ++++
 .../net/ethernet/stmicro/stmmac/stmmac_tc.c   | 86 ++++++++++++++++---
 2 files changed, 90 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
index 4f5292cadf5..18a262ef17f 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
@@ -171,6 +171,19 @@ struct stmmac_flow_entry {
 	int is_l4;
 };
 
+/* Rx Frame Steering */
+enum stmmac_rfs_type {
+	STMMAC_RFS_T_VLAN,
+	STMMAC_RFS_T_MAX,
+};
+
+struct stmmac_rfs_entry {
+	unsigned long cookie;
+	int in_use;
+	int type;
+	int tc;
+};
+
 struct stmmac_priv {
 	/* Frequently used values are kept adjacent for cache effect */
 	u32 tx_coal_frames[MTL_MAX_TX_QUEUES];
@@ -288,6 +301,10 @@ struct stmmac_priv {
 	struct stmmac_tc_entry *tc_entries;
 	unsigned int flow_entries_max;
 	struct stmmac_flow_entry *flow_entries;
+	unsigned int rfs_entries_max[STMMAC_RFS_T_MAX];
+	unsigned int rfs_entries_cnt[STMMAC_RFS_T_MAX];
+	unsigned int rfs_entries_total;
+	struct stmmac_rfs_entry *rfs_entries;
 
 	/* Pulse Per Second output */
 	struct stmmac_pps_cfg pps[STMMAC_PPS_MAX];
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c
index 1c4ea0b1b84..d0a2b289f46 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c
@@ -232,11 +232,33 @@ static int tc_setup_cls_u32(struct stmmac_priv *priv,
 	}
 }
 
+static int tc_rfs_init(struct stmmac_priv *priv)
+{
+	int i;
+
+	priv->rfs_entries_max[STMMAC_RFS_T_VLAN] = 8;
+
+	for (i = 0; i < STMMAC_RFS_T_MAX; i++)
+		priv->rfs_entries_total += priv->rfs_entries_max[i];
+
+	priv->rfs_entries = devm_kcalloc(priv->device,
+					 priv->rfs_entries_total,
+					 sizeof(*priv->rfs_entries),
+					 GFP_KERNEL);
+	if (!priv->rfs_entries)
+		return -ENOMEM;
+
+	dev_info(priv->device, "Enabled RFS Flow TC (entries=%d)\n",
+		 priv->rfs_entries_total);
+
+	return 0;
+}
+
 static int tc_init(struct stmmac_priv *priv)
 {
 	struct dma_features *dma_cap = &priv->dma_cap;
 	unsigned int count;
-	int i;
+	int ret, i;
 
 	if (dma_cap->l3l4fnum) {
 		priv->flow_entries_max = dma_cap->l3l4fnum;
@@ -250,10 +272,14 @@ static int tc_init(struct stmmac_priv *priv)
 		for (i = 0; i < priv->flow_entries_max; i++)
 			priv->flow_entries[i].idx = i;
 
-		dev_info(priv->device, "Enabled Flow TC (entries=%d)\n",
+		dev_info(priv->device, "Enabled L3L4 Flow TC (entries=%d)\n",
 			 priv->flow_entries_max);
 	}
 
+	ret = tc_rfs_init(priv);
+	if (ret)
+		return -ENOMEM;
+
 	if (!priv->plat->fpe_cfg) {
 		priv->plat->fpe_cfg = devm_kzalloc(priv->device,
 						   sizeof(*priv->plat->fpe_cfg),
@@ -607,16 +633,45 @@ static int tc_del_flow(struct stmmac_priv *priv,
 	return ret;
 }
 
+static struct stmmac_rfs_entry *tc_find_rfs(struct stmmac_priv *priv,
+					    struct flow_cls_offload *cls,
+					    bool get_free)
+{
+	int i;
+
+	for (i = 0; i < priv->rfs_entries_total; i++) {
+		struct stmmac_rfs_entry *entry = &priv->rfs_entries[i];
+
+		if (entry->cookie == cls->cookie)
+			return entry;
+		if (get_free && entry->in_use == false)
+			return entry;
+	}
+
+	return NULL;
+}
+
 #define VLAN_PRIO_FULL_MASK (0x07)
 
 static int tc_add_vlan_flow(struct stmmac_priv *priv,
 			    struct flow_cls_offload *cls)
 {
+	struct stmmac_rfs_entry *entry = tc_find_rfs(priv, cls, false);
 	struct flow_rule *rule = flow_cls_offload_flow_rule(cls);
 	struct flow_dissector *dissector = rule->match.dissector;
 	int tc = tc_classid_to_hwtc(priv->dev, cls->classid);
 	struct flow_match_vlan match;
 
+	if (!entry) {
+		entry = tc_find_rfs(priv, cls, true);
+		if (!entry)
+			return -ENOENT;
+	}
+
+	if (priv->rfs_entries_cnt[STMMAC_RFS_T_VLAN] >=
+	    priv->rfs_entries_max[STMMAC_RFS_T_VLAN])
+		return -ENOENT;
+
 	/* Nothing to do here */
 	if (!dissector_uses_key(dissector, FLOW_DISSECTOR_KEY_VLAN))
 		return -EINVAL;
@@ -638,6 +693,12 @@ static int tc_add_vlan_flow(struct stmmac_priv *priv,
 
 		prio = BIT(match.key->vlan_priority);
 		stmmac_rx_queue_prio(priv, priv->hw, prio, tc);
+
+		entry->in_use = true;
+		entry->cookie = cls->cookie;
+		entry->tc = tc;
+		entry->type = STMMAC_RFS_T_VLAN;
+		priv->rfs_entries_cnt[STMMAC_RFS_T_VLAN]++;
 	}
 
 	return 0;
@@ -646,20 +707,19 @@ static int tc_add_vlan_flow(struct stmmac_priv *priv,
 static int tc_del_vlan_flow(struct stmmac_priv *priv,
 			    struct flow_cls_offload *cls)
 {
-	struct flow_rule *rule = flow_cls_offload_flow_rule(cls);
-	struct flow_dissector *dissector = rule->match.dissector;
-	int tc = tc_classid_to_hwtc(priv->dev, cls->classid);
+	struct stmmac_rfs_entry *entry = tc_find_rfs(priv, cls, false);
 
-	/* Nothing to do here */
-	if (!dissector_uses_key(dissector, FLOW_DISSECTOR_KEY_VLAN))
-		return -EINVAL;
+	if (!entry || !entry->in_use || entry->type != STMMAC_RFS_T_VLAN)
+		return -ENOENT;
 
-	if (tc < 0) {
-		netdev_err(priv->dev, "Invalid traffic class\n");
-		return -EINVAL;
-	}
+	stmmac_rx_queue_prio(priv, priv->hw, 0, entry->tc);
+
+	entry->in_use = false;
+	entry->cookie = 0;
+	entry->tc = 0;
+	entry->type = 0;
 
-	stmmac_rx_queue_prio(priv, priv->hw, 0, tc);
+	priv->rfs_entries_cnt[STMMAC_RFS_T_VLAN]--;
 
 	return 0;
 }
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH] libceph, ceph: potential dereference of null pointer
From: Xiubo Li @ 2021-12-09 12:59 UTC (permalink / raw)
  To: Jeff Layton, Jiasheng Jiang, idryomov, davem, kuba
  Cc: ceph-devel, netdev, linux-kernel
In-Reply-To: <2a79206d3472b279079fbef5c9507f8805061c47.camel@kernel.org>


On 12/9/21 7:20 PM, Jeff Layton wrote:
> On Thu, 2021-12-09 at 10:50 +0800, Jiasheng Jiang wrote:
>> The return value of kzalloc() needs to be checked.
>> To avoid use of null pointer in case of the failure of alloc.
>>
>> Fixes: 3d14c5d2b6e1 ("ceph: factor out libceph from Ceph file system")
>> Signed-off-by: Jiasheng Jiang <jiasheng@iscas.ac.cn>
>> ---
>>   net/ceph/osd_client.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
>> index ff8624a7c964..3203e8a34370 100644
>> --- a/net/ceph/osd_client.c
>> +++ b/net/ceph/osd_client.c
>> @@ -1234,6 +1234,8 @@ static struct ceph_osd *create_osd(struct ceph_osd_client *osdc, int onum)
>>   	WARN_ON(onum == CEPH_HOMELESS_OSD);
>>   
>>   	osd = kzalloc(sizeof(*osd), GFP_NOIO | __GFP_NOFAIL);
>> +	if (!osd)
>> +		return NULL;
>>   	osd_init(osd);
>>   	osd->o_osdc = osdc;
>>   	osd->o_osd = onum;
> __GFP_NOFAIL should ensure that it never returns NULL, right?

Yeah, from the comment, it make no sense to test for failure here:


204  * %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the 
caller
205  * cannot handle allocation failures. The allocation could block
206  * indefinitely but will never return with failure. Testing for
207  * failure is pointless.
208  * New users should be evaluated carefully (and the flag should be
209  * used only when there is no reasonable failure policy) but it is
210  * definitely preferable to use the flag rather than opencode endless
211  * loop around allocator.
212  * Using this flag for costly allocations is _highly_ discouraged.
213  */



> Also, if you're going to fix this up to handle that error then you
> probably also need to fix lookup_create_osd to handle a NULL return from
> create_osd as well.


^ permalink raw reply

* Re: [RFC PATCH net-next 2/2] net: Reset forwarded skb->tstamp before delivering to user space
From: Daniel Borkmann @ 2021-12-09 12:58 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Willem de Bruijn, netdev, Alexei Starovoitov, David Miller,
	Eric Dumazet, Jakub Kicinski, kernel-team
In-Reply-To: <20211208221924.v4gqpkzzrbhgi2xe@kafai-mbp.dhcp.thefacebook.com>

On 12/8/21 11:19 PM, Martin KaFai Lau wrote:
> On Wed, Dec 08, 2021 at 08:27:45PM +0100, Daniel Borkmann wrote:
>> On 12/8/21 9:30 AM, Martin KaFai Lau wrote:
>>> On Wed, Dec 08, 2021 at 12:18:46AM -0800, Martin KaFai Lau wrote:
>>>> On Tue, Dec 07, 2021 at 10:48:53PM +0100, Daniel Borkmann wrote:
>> [...]
>>>>> One other thing I wonder, BPF progs at host-facing veth's tc ingress which
>>>>> are not aware of skb->tstamp will then see a tstamp from future given we
>>>>> intentionally bypass the net_timestamp_check() and might get confused (or
>>>>> would confuse higher-layer application logic)? Not quite sure yet if they
>>>>> would be the only affected user.
>>>> Considering the variety of clock used in skb->tstamp (real/mono, and also
>>>> tai in SO_TXTIME),  in general I am not sure if the tc-bpf can assume anything
>>>> in the skb->tstamp now.
>>
>> But today that's either only 0 or real via __net_timestamp() if skb->tstamp is
>> read at bpf@ingress@veth@host, no?
> I think I was trying to say the CLOCK_REALTIME in __sk_buff->tstamp
> is not practically useful in bpf@ingress other than an increasing number.
> No easy way to get the 'now' in CLOCK_REALTIME to compare with
> in order to learn if it is future or current time.  Setting
> it based on bpf_ktime_get_ns() in MONO is likely broken currently
> regardless of the skb is forwarded or delivered locally.
> 
> We have a use case that wants to change the forwarded EDT
> in bpf@ingress@veth@host before forwarding.  If it needs to
> keep __sk_buff->tstamp as the 'now' CLOCK_REALTIME in ingress,
> it needs to provide a separate way for the bpf to check/change
> the forwarded EDT.
> 
> Daniel, do you have suggestion on where to temporarily store
> the forwarded EDT so that the bpf@ingress can access?

Hm, was thinking maybe moving skb->skb_mstamp_ns into the shared info as
in skb_hwtstamps(skb)->skb_mstamp_ns could work. In other words, as a union
with hwtstamp to not bloat it further. And TCP stack as well as everything
else (like sch_fq) could switch to it natively (hwtstamp might only be used
on RX or TX completion from driver side if I'm not mistaken).

But then while this would solve the netns transfer, we would run into the
/same/ issue again when implementing a hairpinning LB where we loop from RX
to TX given this would have to be cleared somewhere again if driver populates
hwtstamp, so not really feasible and bloating shared info with a second
tstamp would bump it by one cacheline. :(

A cleaner BUT still non-generic solution compared to the previous diff I could
think of might be the below. So no change in behavior in general, but if the
bpf@ingress@veth@host needs to access the original tstamp, it could do so
via existing mapping we already have in BPF, and then it could transfer it
for all or certain traffic (up to the prog) via BPF code setting ...

   skb->tstamp = skb->hwtstamp

... and do the redirect from there to the phys dev with BPF_F_KEEP_TSTAMP
flag. Minimal intrusive, but unfortunately only accessible for BPF. Maybe use
of skb_hwtstamps(skb)->nststamp could be extended though (?)

Thanks,
Daniel

  include/linux/skbuff.h   | 14 +++++++++++++-
  include/uapi/linux/bpf.h |  1 +
  net/core/filter.c        | 37 ++++++++++++++++++++++---------------
  net/core/skbuff.c        |  2 +-
  4 files changed, 37 insertions(+), 17 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index c8cb7e697d47..b7c72fe7e1bb 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -418,7 +418,10 @@ static inline bool skb_frag_must_loop(struct page *p)
   * &skb_shared_info. Use skb_hwtstamps() to get a pointer.
   */
  struct skb_shared_hwtstamps {
-	ktime_t	hwtstamp;
+	union {
+		ktime_t	hwtstamp;
+		ktime_t	nststamp;
+	};
  };

  /* Definitions for tx_flags in struct skb_shared_info */
@@ -3711,6 +3714,15 @@ int skb_mpls_dec_ttl(struct sk_buff *skb);
  struct sk_buff *pskb_extract(struct sk_buff *skb, int off, int to_copy,
  			     gfp_t gfp);

+static inline void skb_xfer_tstamp(struct sk_buff *skb)
+{
+	/* initns might still need access to the original
+	 * skb->tstamp from a netns, e.g. for EDT.
+	 */
+	skb_hwtstamps(skb)->nststamp = skb->tstamp;
+	skb->tstamp = 0;
+}
+
  static inline int memcpy_from_msg(void *data, struct msghdr *msg, int len)
  {
  	return copy_from_iter_full(data, len, &msg->msg_iter) ? 0 : -EFAULT;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ba5af15e25f5..12d10ab8bff7 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5157,6 +5157,7 @@ enum {
  /* BPF_FUNC_clone_redirect and BPF_FUNC_redirect flags. */
  enum {
  	BPF_F_INGRESS			= (1ULL << 0),
+	BPF_F_KEEP_TSTAMP		= (1ULL << 1),
  };

  /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */
diff --git a/net/core/filter.c b/net/core/filter.c
index 6102f093d59a..e233b998387c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2097,7 +2097,8 @@ static inline int __bpf_rx_skb_no_mac(struct net_device *dev,
  	return ret;
  }

-static inline int __bpf_tx_skb(struct net_device *dev, struct sk_buff *skb)
+static inline int __bpf_tx_skb(struct net_device *dev, struct sk_buff *skb,
+			       bool keep_tstamp)
  {
  	int ret;

@@ -2108,7 +2109,8 @@ static inline int __bpf_tx_skb(struct net_device *dev, struct sk_buff *skb)
  	}

  	skb->dev = dev;
-	skb->tstamp = 0;
+	if (!keep_tstamp)
+		skb_xfer_tstamp(skb);

  	dev_xmit_recursion_inc();
  	ret = dev_queue_xmit(skb);
@@ -2136,7 +2138,8 @@ static int __bpf_redirect_no_mac(struct sk_buff *skb, struct net_device *dev,
  	skb_pop_mac_header(skb);
  	skb_reset_mac_len(skb);
  	return flags & BPF_F_INGRESS ?
-	       __bpf_rx_skb_no_mac(dev, skb) : __bpf_tx_skb(dev, skb);
+	       __bpf_rx_skb_no_mac(dev, skb) :
+	       __bpf_tx_skb(dev, skb, flags & BPF_F_KEEP_TSTAMP);
  }

  static int __bpf_redirect_common(struct sk_buff *skb, struct net_device *dev,
@@ -2150,7 +2153,8 @@ static int __bpf_redirect_common(struct sk_buff *skb, struct net_device *dev,

  	bpf_push_mac_rcsum(skb);
  	return flags & BPF_F_INGRESS ?
-	       __bpf_rx_skb(dev, skb) : __bpf_tx_skb(dev, skb);
+	       __bpf_rx_skb(dev, skb) :
+	       __bpf_tx_skb(dev, skb, flags & BPF_F_KEEP_TSTAMP);
  }

  static int __bpf_redirect(struct sk_buff *skb, struct net_device *dev,
@@ -2177,7 +2181,6 @@ static int bpf_out_neigh_v6(struct net *net, struct sk_buff *skb,
  	}

  	skb->dev = dev;
-	skb->tstamp = 0;

  	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
  		skb = skb_expand_head(skb, hh_len);
@@ -2275,7 +2278,6 @@ static int bpf_out_neigh_v4(struct net *net, struct sk_buff *skb,
  	}

  	skb->dev = dev;
-	skb->tstamp = 0;

  	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
  		skb = skb_expand_head(skb, hh_len);
@@ -2367,7 +2369,7 @@ static int __bpf_redirect_neigh_v4(struct sk_buff *skb, struct net_device *dev,
  #endif /* CONFIG_INET */

  static int __bpf_redirect_neigh(struct sk_buff *skb, struct net_device *dev,
-				struct bpf_nh_params *nh)
+				struct bpf_nh_params *nh, bool keep_tstamp)
  {
  	struct ethhdr *ethh = eth_hdr(skb);

@@ -2380,6 +2382,8 @@ static int __bpf_redirect_neigh(struct sk_buff *skb, struct net_device *dev,
  	skb_pull(skb, sizeof(*ethh));
  	skb_unset_mac_header(skb);
  	skb_reset_network_header(skb);
+	if (!keep_tstamp)
+		skb_xfer_tstamp(skb);

  	if (skb->protocol == htons(ETH_P_IP))
  		return __bpf_redirect_neigh_v4(skb, dev, nh);
@@ -2392,9 +2396,9 @@ static int __bpf_redirect_neigh(struct sk_buff *skb, struct net_device *dev,

  /* Internal, non-exposed redirect flags. */
  enum {
-	BPF_F_NEIGH	= (1ULL << 1),
-	BPF_F_PEER	= (1ULL << 2),
-	BPF_F_NEXTHOP	= (1ULL << 3),
+	BPF_F_NEIGH	= (1ULL << 2),
+	BPF_F_PEER	= (1ULL << 3),
+	BPF_F_NEXTHOP	= (1ULL << 4),
  #define BPF_F_REDIRECT_INTERNAL	(BPF_F_NEIGH | BPF_F_PEER | BPF_F_NEXTHOP)
  };

@@ -2468,8 +2472,9 @@ int skb_do_redirect(struct sk_buff *skb)
  		return -EAGAIN;
  	}
  	return flags & BPF_F_NEIGH ?
-	       __bpf_redirect_neigh(skb, dev, flags & BPF_F_NEXTHOP ?
-				    &ri->nh : NULL) :
+	       __bpf_redirect_neigh(skb, dev,
+				    flags & BPF_F_NEXTHOP ? &ri->nh : NULL,
+				    flags & BPF_F_KEEP_TSTAMP) :
  	       __bpf_redirect(skb, dev, flags);
  out_drop:
  	kfree_skb(skb);
@@ -2480,7 +2485,8 @@ BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags)
  {
  	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);

-	if (unlikely(flags & (~(BPF_F_INGRESS) | BPF_F_REDIRECT_INTERNAL)))
+	if (unlikely(flags & (~(BPF_F_INGRESS | BPF_F_KEEP_TSTAMP) |
+			      BPF_F_REDIRECT_INTERNAL)))
  		return TC_ACT_SHOT;

  	ri->flags = flags;
@@ -2523,10 +2529,11 @@ BPF_CALL_4(bpf_redirect_neigh, u32, ifindex, struct bpf_redir_neigh *, params,
  {
  	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);

-	if (unlikely((plen && plen < sizeof(*params)) || flags))
+	if (unlikely((plen && plen < sizeof(*params)) ||
+		     (flags & ~BPF_F_KEEP_TSTAMP)))
  		return TC_ACT_SHOT;

-	ri->flags = BPF_F_NEIGH | (plen ? BPF_F_NEXTHOP : 0);
+	ri->flags = BPF_F_NEIGH | (plen ? BPF_F_NEXTHOP : 0) | flags;
  	ri->tgt_index = ifindex;

  	BUILD_BUG_ON(sizeof(struct bpf_redir_neigh) != sizeof(struct bpf_nh_params));
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ba2f38246f07..782b152a000c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5472,7 +5472,7 @@ void skb_scrub_packet(struct sk_buff *skb, bool xnet)

  	ipvs_reset(skb);
  	skb->mark = 0;
-	skb->tstamp = 0;
+	skb_xfer_tstamp(skb);
  }
  EXPORT_SYMBOL_GPL(skb_scrub_packet);

-- 
2.21.0



^ permalink raw reply related

* AW: [net-next 6/6] can: mcp251xfd: mcp251xfd_regmap_crc_read(): work around broken CRC on TBC register
From: Sven Schuchmann @ 2021-12-09 12:53 UTC (permalink / raw)
  To: Marc Kleine-Budde
  Cc: Thomas.Kopp@microchip.com, pavel.modilaynen@volvocars.com,
	drew@beagleboard.org, linux-can@vger.kernel.org,
	menschel.p@posteo.de, netdev@vger.kernel.org, will@macchina.cc
In-Reply-To: <20211209112748.yixzz4xskw6qm7bw@pengutronix.de>

Hello Marc,

> Von: Marc Kleine-Budde <mkl@pengutronix.de>
> Gesendet: Donnerstag, 9. Dezember 2021 12:28
> An: Sven Schuchmann <schuchmann@schleissheimer.de>
> 
> On 09.12.2021 11:17:09, Sven Schuchmann wrote:
> > we are also seeing the CRC Errors in our setup (rpi4, Kernel 5.10.x)
> > from time to time. I just wanted to post here what I am seeing, maybe
> > it helps...
> >
> > [    6.761711] spi_master spi1: will run message pump with realtime priority
> > [    6.778063] mcp251xfd spi1.0 can1: MCP2518FD rev0.0 (-RX_INT -MAB_NO_WARN +CRC_REG
> +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz r:17.00MHz e:16.66MHz) successfully
> initialized.
> >
> > [ 4327.107856] mcp251xfd spi1.0 canfd1: CRC read error at address 0x0010 (length=4,
> data=00 cc 62 c4, CRC=0xa3a0) retrying.
> > [ 7770.163335] mcp251xfd spi1.0 canfd1: CRC read error at address 0x0010 (length=4,
> data=00 bf 16 d5, CRC=0x9d3c) retrying.
> > [ 8000.565955] mcp251xfd spi1.0 canfd1: CRC read error at address 0x0010 (length=4,
> data=00 40 66 fa, CRC=0x31d7) retrying.
> > [ 9753.658173] mcp251xfd spi1.0 canfd1: CRC read error at address 0x0010 (length=4,
> data=80 e9 01 4e, CRC=0xe862) retrying.
> 
> You are using the a back port of my HW timestamp in your v5.10 branch.
> So every 45 seconds the TBC register (address 0x0010) is read,
> additionally for every CAN error frame.
>
> In the mean time, I've implemented a workaround for the CRC read errors:
> 
> | c7eb923c3caf can: mcp251xfd: mcp251xfd_regmap_crc_read(): work around broken CRC on TBC
> register
> | ef7a8c3e7599 can: mcp251xfd: mcp251xfd_regmap_crc_read_one(): Factor out crc check into
> separate function
> 
> It fixes the CRC read error, if the first data byte is 0x00 or 0x80.
> 
> These messages should disappear, if you cherry-pick the above patches.

Sorry for the confusion, you are right.
I picked the two patches and so far no more CRC read errors.
Thanks a lot.

Sven

^ permalink raw reply

* Re: [syzbot] BUG: sleeping function called from invalid context in hci_cmd_sync_cancel
From: Benjamin Berg @ 2021-12-09 12:46 UTC (permalink / raw)
  To: Oliver Neukum, syzbot, Thinh.Nguyen, changbin.du,
	christian.brauner, davem, edumazet, gregkh, johan.hedberg, kuba,
	linux-bluetooth, linux-kernel, linux-usb, luiz.dentz,
	luiz.von.dentz, marcel, mathias.nyman, netdev, stern,
	syzkaller-bugs, yajun.deng
In-Reply-To: <3e8cba55-5d34-eab3-0625-687b66bb9449@suse.com>

[-- Attachment #1: Type: text/plain, Size: 1013 bytes --]

Hi,

On Thu, 2021-12-09 at 11:06 +0100, Oliver Neukum wrote:
> As __cancel_work_timer can be called from hci_cmd_sync_cancel() this is
> just not
> an approach you can take. It looks like asynchronously canceling the
> scheduled work
> would result in a race, so I would for now just revert.

Right, so this needs to be pushed into a workqueue instead, I suppose.

> What issue exactly is this trying to fix or improve?

The problem is aborting long-running synchronous operations. i.e.
without this patchset, USB enumeration will hang for 10s if a USB
bluetooth device disappears during firmware loading. This is because
even though the USB device is gone and all URB submissions fail, the
operation will only be aborted after the internal timeout happens.

The device in turn disappears because an rfkill switch is blocked and
the platform removes it from the bus. Overall, this can lead to
graphical login to hang as fprintd cannot initialise as it hangs in USB
enumeration.

Benjamin

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH 1/1] sctp: Protect cached endpoints to prevent possible UAF
From: Marcelo Ricardo Leitner @ 2021-12-09 12:42 UTC (permalink / raw)
  To: Lee Jones
  Cc: linux-kernel, Vlad Yasevich, Neil Horman, David S. Miller,
	Jakub Kicinski, lksctp developers, H.P. Yarroll, Karl Knutson,
	Jon Grimm, Xingang Guo, Hui Huang, Sridhar Samudrala, Daisy Chang,
	Ryan Layer, Kevin Gao, netdev
In-Reply-To: <20211208165434.2962062-1-lee.jones@linaro.org>

On Wed, Dec 08, 2021 at 04:54:34PM +0000, Lee Jones wrote:
> To prevent this from happening we need to take a reference on the
> to-be-used/dereferenced 'struct sctp_endpoint' until such a time when
> we know it can be safely released.
> 
> When KASAN is not enabled, a similar, but slightly different NULL
> pointer derefernce crash occurs later along the thread of execution in
> inet_sctp_diag_fill() this time.

Hey Lee, did you try running lksctp-tools [1] func tests with this patch?
I'm getting failures here.

[root@vm1 func_tests]# make v4test
./test_assoc_abort
test_assoc_abort.c  1 PASS : ABORT an association using SCTP_ABORT
test_assoc_abort passes

./test_assoc_shutdown
test_assoc_shutdown.c  1 BROK : bind: Address already in use
DUMP_CORE ../../src/testlib/sctputil.h: 145
/bin/sh: line 1:  3727 Segmentation fault      (core dumped) ./$a
test_assoc_shutdown fails
make: *** [Makefile:1648: v4test] Error 1

I didn't check it closely but it would seem that the ep is beind held
forever. If I simply retry after a few seconds, it's still there (now the 1st
test fails):

[root@vm1 func_tests]# make v4test
./test_assoc_abort
test_assoc_abort.c  1 BROK : bind: Address already in use
DUMP_CORE ../../src/testlib/sctputil.h: 145
/bin/sh: line 1:  3751 Segmentation fault      (core dumped) ./$a
test_assoc_abort fails

1.https://github.com/sctp/lksctp-tools

  Marcelo

^ permalink raw reply

* [PATCH v1 1/1] include/linux/unaligned: Replace kernel.h with the necessary inclusions
From: Andy Shevchenko @ 2021-12-09 12:38 UTC (permalink / raw)
  To: Arend van Spriel, Andy Shevchenko, linux-wireless,
	brcm80211-dev-list.pdl, SHA-cyfmac-dev-list, netdev, linux-kernel
  Cc: Arend van Spriel, Franky Lin, Hante Meuleman, Chi-hsien Lin,
	Wright Feng, Chung-hsien Hsu, Kalle Valo, David S. Miller,
	Jakub Kicinski, Andrew Morton, heikki.krogerus

When kernel.h is used in the headers it adds a lot into dependency hell,
especially when there are circular dependencies are involved.

Replace kernel.h inclusion with the list of what is really being used.

The rest of the changes are induced by the above and may not be split.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
---
 drivers/net/wireless/broadcom/brcm80211/brcmfmac/xtlv.c | 2 ++
 include/linux/unaligned/packed_struct.h                 | 2 +-
 lib/lz4/lz4defs.h                                       | 2 ++
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/xtlv.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/xtlv.c
index 2f3c451148db..2f8908074303 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/xtlv.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/xtlv.c
@@ -4,6 +4,8 @@
  */
 
 #include <asm/unaligned.h>
+
+#include <linux/math.h>
 #include <linux/string.h>
 #include <linux/bug.h>
 
diff --git a/include/linux/unaligned/packed_struct.h b/include/linux/unaligned/packed_struct.h
index c0d817de4df2..f4c8eaf4d012 100644
--- a/include/linux/unaligned/packed_struct.h
+++ b/include/linux/unaligned/packed_struct.h
@@ -1,7 +1,7 @@
 #ifndef _LINUX_UNALIGNED_PACKED_STRUCT_H
 #define _LINUX_UNALIGNED_PACKED_STRUCT_H
 
-#include <linux/kernel.h>
+#include <linux/types.h>
 
 struct __una_u16 { u16 x; } __packed;
 struct __una_u32 { u32 x; } __packed;
diff --git a/lib/lz4/lz4defs.h b/lib/lz4/lz4defs.h
index 673bd206aa98..330aa539b46e 100644
--- a/lib/lz4/lz4defs.h
+++ b/lib/lz4/lz4defs.h
@@ -36,6 +36,8 @@
  */
 
 #include <asm/unaligned.h>
+
+#include <linux/bitops.h>
 #include <linux/string.h>	 /* memset, memcpy */
 
 #define FORCE_INLINE __always_inline
-- 
2.33.0


^ permalink raw reply related

* Re: [PATCH 1/1] sctp: Protect cached endpoints to prevent possible UAF
From: Marcelo Ricardo Leitner @ 2021-12-09 12:21 UTC (permalink / raw)
  To: Lee Jones
  Cc: linux-kernel, Vlad Yasevich, Neil Horman, David S. Miller,
	Jakub Kicinski, lksctp developers, H.P. Yarroll, Karl Knutson,
	Jon Grimm, Xingang Guo, Hui Huang, Sridhar Samudrala, Daisy Chang,
	Ryan Layer, Kevin Gao, netdev
In-Reply-To: <YbDlFFVPm/MYEoOQ@google.com>

On Wed, Dec 08, 2021 at 05:02:12PM +0000, Lee Jones wrote:
> On Wed, 08 Dec 2021, Lee Jones wrote:
> 
> > The cause of the resultant dump_stack() reported below is a
> > dereference of a freed pointer to 'struct sctp_endpoint' in
> > sctp_sock_dump().
> > 
> > This race condition occurs when a transport is cached into its
> > associated hash table then freed prior to its subsequent use in
> > sctp_diag_dump() which uses sctp_for_each_transport() to walk the
> > (now out of date) hash table calling into sctp_sock_dump() where the
> > dereference occurs.
> > 
> > To prevent this from happening we need to take a reference on the
> > to-be-used/dereferenced 'struct sctp_endpoint' until such a time when
> > we know it can be safely released.
> > 
> > When KASAN is not enabled, a similar, but slightly different NULL
> > pointer derefernce crash occurs later along the thread of execution in
> > inet_sctp_diag_fill() this time.
> > 
> >   BUG: KASAN: use-after-free in sctp_sock_dump+0xa8/0x438 [sctp_diag]
> >   Call trace:
> >    dump_backtrace+0x0/0x2dc
> >    show_stack+0x20/0x2c
> >    dump_stack+0x120/0x144
> >    print_address_description+0x80/0x2f4
> >    __kasan_report+0x174/0x194
> >    kasan_report+0x10/0x18
> >    __asan_load8+0x84/0x8c
> >    sctp_sock_dump+0xa8/0x438 [sctp_diag]
> >    sctp_for_each_transport+0x1e0/0x26c [sctp]
> >    sctp_diag_dump+0x180/0x1f0 [sctp_diag]
> >    inet_diag_dump+0x12c/0x168
> >    netlink_dump+0x24c/0x5b8
> >    __netlink_dump_start+0x274/0x2a8
> >    inet_diag_handler_cmd+0x224/0x274
> >    sock_diag_rcv_msg+0x21c/0x230
> >    netlink_rcv_skb+0xe0/0x1bc
> >    sock_diag_rcv+0x34/0x48
> >    netlink_unicast+0x3b4/0x430
> >    netlink_sendmsg+0x4f0/0x574
> >    sock_write_iter+0x18c/0x1f0
> >    do_iter_readv_writev+0x230/0x2a8
> >    do_iter_write+0xc8/0x2b4
> >    vfs_writev+0xf8/0x184
> >    do_writev+0xb0/0x1a8
> >    __arm64_sys_writev+0x4c/0x5c
> >    el0_svc_common+0x118/0x250
> >    el0_svc_handler+0x3c/0x9c
> >    el0_svc+0x8/0xc
> 
> This looks related (reported 3 years ago!)
> 
>   https://lore.kernel.org/all/20181122131344.GD31918@localhost.localdomain/

Agree, seems related. Thanks for root causing it.

^ permalink raw reply

* [PATCH] bridge: extend BR_ISOLATE to full split-horizon
From: David Lamparter @ 2021-12-09 12:14 UTC (permalink / raw)
  To: netdev; +Cc: David Lamparter, Nikolay Aleksandrov, Alexandra Winter

Split-horizon essentially just means being able to create multiple
groups of isolated ports that are isolated within the group, but not
with respect to each other.

The intent is very different, while isolation is a policy feature,
split-horizon is intended to provide functional "multiple member ports
are treated as one for loop avoidance."  But it boils down to the same
thing in the end.

Signed-off-by: David Lamparter <equinox@diac24.net>
Cc: Nikolay Aleksandrov <nikolay@nvidia.com>
Cc: Alexandra Winter <wintera@linux.ibm.com>
---

Alexandra, could you double check my change to the qeth_l2 driver?  I
can't really test it...

Cheers,

David
---
 drivers/s390/net/qeth_l2_main.c | 10 ++++++----
 include/linux/if_bridge.h       |  9 ++++++++-
 include/uapi/linux/if_link.h    |  1 +
 net/bridge/br_if.c              | 12 ++++++++++++
 net/bridge/br_input.c           |  2 +-
 net/bridge/br_netlink.c         | 33 +++++++++++++++++++++++++++++++--
 net/bridge/br_private.h         | 13 ++++++++++---
 net/bridge/br_sysfs_if.c        | 33 ++++++++++++++++++++++++++++++++-
 8 files changed, 101 insertions(+), 12 deletions(-)

diff --git a/drivers/s390/net/qeth_l2_main.c b/drivers/s390/net/qeth_l2_main.c
index 303461d70af3..405d36757c22 100644
--- a/drivers/s390/net/qeth_l2_main.c
+++ b/drivers/s390/net/qeth_l2_main.c
@@ -729,8 +729,8 @@ static bool qeth_l2_must_learn(struct net_device *netdev,
 	priv = netdev_priv(netdev);
 	return (netdev != dstdev &&
 		(priv->brport_features & BR_LEARNING_SYNC) &&
-		!(br_port_flag_is_set(netdev, BR_ISOLATED) &&
-		  br_port_flag_is_set(dstdev, BR_ISOLATED)) &&
+		!(br_port_horizon_group(netdev) != 0 &&
+		  br_port_horizon_group(netdev) == br_port_horizon_group(dstdev)) &&
 		(netdev->netdev_ops == &qeth_l2_iqd_netdev_ops ||
 		 netdev->netdev_ops == &qeth_l2_osa_netdev_ops));
 }
@@ -757,6 +757,7 @@ static void qeth_l2_br2dev_worker(struct work_struct *work)
 	struct net_device *lowerdev;
 	struct list_head *iter;
 	int err = 0;
+	u32 horizon_group;
 
 	kfree(br2dev_event_work);
 	QETH_CARD_TEXT_(card, 4, "b2dw%04lx", event);
@@ -770,12 +771,13 @@ static void qeth_l2_br2dev_worker(struct work_struct *work)
 	if (!qeth_l2_must_learn(lsyncdev, dstdev))
 		goto unlock;
 
-	if (br_port_flag_is_set(lsyncdev, BR_ISOLATED)) {
+	horizon_group = br_port_horizon_group(lsyncdev);
+	if (horizon_group) {
 		/* Update lsyncdev and its isolated sibling(s): */
 		iter = &brdev->adj_list.lower;
 		lowerdev = netdev_next_lower_dev_rcu(brdev, &iter);
 		while (lowerdev) {
-			if (br_port_flag_is_set(lowerdev, BR_ISOLATED)) {
+			if (br_port_horizon_group(lowerdev) == horizon_group) {
 				switch (event) {
 				case SWITCHDEV_FDB_ADD_TO_DEVICE:
 					err = dev_uc_add(lowerdev, addr);
diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
index 509e18c7e740..a9738cb40066 100644
--- a/include/linux/if_bridge.h
+++ b/include/linux/if_bridge.h
@@ -53,7 +53,7 @@ struct br_ip_list {
 #define BR_VLAN_TUNNEL		BIT(13)
 #define BR_BCAST_FLOOD		BIT(14)
 #define BR_NEIGH_SUPPRESS	BIT(15)
-#define BR_ISOLATED		BIT(16)
+/* BR_ISOLATED - previously BIT(16) */
 #define BR_MRP_AWARE		BIT(17)
 #define BR_MRP_LOST_CONT	BIT(18)
 #define BR_MRP_LOST_IN_CONT	BIT(19)
@@ -158,6 +158,7 @@ struct net_device *br_fdb_find_port(const struct net_device *br_dev,
 				    __u16 vid);
 void br_fdb_clear_offload(const struct net_device *dev, u16 vid);
 bool br_port_flag_is_set(const struct net_device *dev, unsigned long flag);
+u32 br_port_horizon_group(const struct net_device *dev);
 u8 br_port_get_stp_state(const struct net_device *dev);
 clock_t br_get_ageing_time(const struct net_device *br_dev);
 #else
@@ -179,6 +180,12 @@ br_port_flag_is_set(const struct net_device *dev, unsigned long flag)
 	return false;
 }
 
+static inline u32
+br_port_horizon_group(const struct net_device *dev)
+{
+	return 0;
+}
+
 static inline u8 br_port_get_stp_state(const struct net_device *dev)
 {
 	return BR_STATE_DISABLED;
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 4ac53b30b6dc..d3a65ae33a62 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -536,6 +536,7 @@ enum {
 	IFLA_BRPORT_MRP_IN_OPEN,
 	IFLA_BRPORT_MCAST_EHT_HOSTS_LIMIT,
 	IFLA_BRPORT_MCAST_EHT_HOSTS_CNT,
+	IFLA_BRPORT_HORIZON_GROUP,	/* split-horizon (isolation) index */
 	__IFLA_BRPORT_MAX
 };
 #define IFLA_BRPORT_MAX (__IFLA_BRPORT_MAX - 1)
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index a52ad81596b7..837d8e2fd191 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -784,3 +784,15 @@ bool br_port_flag_is_set(const struct net_device *dev, unsigned long flag)
 	return p->flags & flag;
 }
 EXPORT_SYMBOL_GPL(br_port_flag_is_set);
+
+u32 br_port_horizon_group(const struct net_device *dev)
+{
+	struct net_bridge_port *p;
+
+	p = br_port_get_rtnl_rcu(dev);
+	if (!p)
+		return 0;
+
+	return p->horizon_group;
+}
+EXPORT_SYMBOL_GPL(br_port_horizon_group);
diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index b50382f957c1..b0d71d01bc02 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -112,7 +112,7 @@ int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb
 		goto drop;
 
 	BR_INPUT_SKB_CB(skb)->brdev = br->dev;
-	BR_INPUT_SKB_CB(skb)->src_port_isolated = !!(p->flags & BR_ISOLATED);
+	BR_INPUT_SKB_CB(skb)->src_horizon_group = p->horizon_group;
 
 	if (IS_ENABLED(CONFIG_INET) &&
 	    (skb->protocol == htons(ETH_P_ARP) ||
diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index 0c8b5f1a15bc..eb139a73cbbf 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -203,6 +203,7 @@ static inline size_t br_port_info_size(void)
 		+ nla_total_size(sizeof(u8))	/* IFLA_BRPORT_MRP_IN_OPEN */
 		+ nla_total_size(sizeof(u32))	/* IFLA_BRPORT_MCAST_EHT_HOSTS_LIMIT */
 		+ nla_total_size(sizeof(u32))	/* IFLA_BRPORT_MCAST_EHT_HOSTS_CNT */
+		+ nla_total_size(sizeof(u32))	/* IFLA_BRPORT_HORIZON_GROUP */
 		+ 0;
 }
 
@@ -269,7 +270,8 @@ static int br_port_fill_attrs(struct sk_buff *skb,
 							  BR_MRP_LOST_CONT)) ||
 	    nla_put_u8(skb, IFLA_BRPORT_MRP_IN_OPEN,
 		       !!(p->flags & BR_MRP_LOST_IN_CONT)) ||
-	    nla_put_u8(skb, IFLA_BRPORT_ISOLATED, !!(p->flags & BR_ISOLATED)))
+	    nla_put_u8(skb, IFLA_BRPORT_ISOLATED, !!p->horizon_group) ||
+	    nla_put_u32(skb, IFLA_BRPORT_HORIZON_GROUP, p->horizon_group))
 		return -EMSGSIZE;
 
 	timerval = br_timer_value(&p->message_age_timer);
@@ -829,6 +831,7 @@ static const struct nla_policy br_port_policy[IFLA_BRPORT_MAX + 1] = {
 	[IFLA_BRPORT_ISOLATED]	= { .type = NLA_U8 },
 	[IFLA_BRPORT_BACKUP_PORT] = { .type = NLA_U32 },
 	[IFLA_BRPORT_MCAST_EHT_HOSTS_LIMIT] = { .type = NLA_U32 },
+	[IFLA_BRPORT_HORIZON_GROUP] = NLA_POLICY_MIN(NLA_S32, 0),
 };
 
 /* Change the state of the port and notify spanning tree */
@@ -874,6 +877,21 @@ static int br_setport(struct net_bridge_port *p, struct nlattr *tb[],
 	bool br_vlan_tunnel_old;
 	int err;
 
+	if (tb[IFLA_BRPORT_ISOLATED] && tb[IFLA_BRPORT_HORIZON_GROUP]) {
+		if (nla_get_u8(tb[IFLA_BRPORT_ISOLATED]) &&
+		    nla_get_u32(tb[IFLA_BRPORT_HORIZON_GROUP]) == 0) {
+			NL_SET_ERR_MSG(extack,
+				       "HORIZON_GROUP must be non-zero for ISOLATED flag");
+			return -EINVAL;
+		}
+		if (!nla_get_u8(tb[IFLA_BRPORT_ISOLATED]) &&
+		    nla_get_u32(tb[IFLA_BRPORT_HORIZON_GROUP]) != 0) {
+			NL_SET_ERR_MSG(extack,
+				       "ISOLATED cannot be unset with nonzero HORIZON_GROUP");
+			return -EINVAL;
+		}
+	}
+
 	old_flags = p->flags;
 	br_vlan_tunnel_old = (old_flags & BR_VLAN_TUNNEL) ? true : false;
 
@@ -892,7 +910,6 @@ static int br_setport(struct net_bridge_port *p, struct nlattr *tb[],
 	br_set_port_flag(p, tb, IFLA_BRPORT_PROXYARP_WIFI, BR_PROXYARP_WIFI);
 	br_set_port_flag(p, tb, IFLA_BRPORT_VLAN_TUNNEL, BR_VLAN_TUNNEL);
 	br_set_port_flag(p, tb, IFLA_BRPORT_NEIGH_SUPPRESS, BR_NEIGH_SUPPRESS);
-	br_set_port_flag(p, tb, IFLA_BRPORT_ISOLATED, BR_ISOLATED);
 
 	changed_mask = old_flags ^ p->flags;
 
@@ -973,6 +990,18 @@ static int br_setport(struct net_bridge_port *p, struct nlattr *tb[],
 			return err;
 	}
 
+	if (tb[IFLA_BRPORT_HORIZON_GROUP]) {
+		p->horizon_group = nla_get_u32(tb[IFLA_BRPORT_HORIZON_GROUP]);
+	} else if (tb[IFLA_BRPORT_ISOLATED]) {
+		u8 isolated = nla_get_u8(tb[IFLA_BRPORT_ISOLATED]);
+
+		/* don't trample other values set by HORIZON_GROUP */
+		if (isolated && p->horizon_group == 0)
+			p->horizon_group = 1;
+		else if (!isolated)
+			p->horizon_group = 0;
+	}
+
 	return 0;
 }
 
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index af2b3512d86c..7916f0aa39de 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -353,6 +353,12 @@ struct net_bridge_port {
 #endif
 	struct net_bridge_port		__rcu *backup_port;
 
+	/* Ports with the *same* non-zero horizon_group are isolated from
+	 * each other.  Zero or *differing* horizon_group forwards normally.
+	 * UAPI limited to positive signed int. (recommended ifindex namespace)
+	 */
+	u32				horizon_group;
+
 	/* STP */
 	u8				priority;
 	u8				state;
@@ -539,13 +545,14 @@ struct net_bridge {
 struct br_input_skb_cb {
 	struct net_device *brdev;
 
+	u32 src_horizon_group;
+
 	u16 frag_max_size;
 #ifdef CONFIG_BRIDGE_IGMP_SNOOPING
 	u8 igmp;
 	u8 mrouters_only:1;
 #endif
 	u8 proxyarp_replied:1;
-	u8 src_port_isolated:1;
 #ifdef CONFIG_BRIDGE_VLAN_FILTERING
 	u8 vlan_filtered:1;
 #endif
@@ -811,8 +818,8 @@ void br_flood(struct net_bridge *br, struct sk_buff *skb,
 static inline bool br_skb_isolated(const struct net_bridge_port *to,
 				   const struct sk_buff *skb)
 {
-	return BR_INPUT_SKB_CB(skb)->src_port_isolated &&
-	       (to->flags & BR_ISOLATED);
+	return BR_INPUT_SKB_CB(skb)->src_horizon_group &&
+		BR_INPUT_SKB_CB(skb)->src_horizon_group == to->horizon_group;
 }
 
 /* br_if.c */
diff --git a/net/bridge/br_sysfs_if.c b/net/bridge/br_sysfs_if.c
index 07fa76080512..0344315b6f37 100644
--- a/net/bridge/br_sysfs_if.c
+++ b/net/bridge/br_sysfs_if.c
@@ -239,7 +239,37 @@ BRPORT_ATTR_FLAG(proxyarp_wifi, BR_PROXYARP_WIFI);
 BRPORT_ATTR_FLAG(multicast_flood, BR_MCAST_FLOOD);
 BRPORT_ATTR_FLAG(broadcast_flood, BR_BCAST_FLOOD);
 BRPORT_ATTR_FLAG(neigh_suppress, BR_NEIGH_SUPPRESS);
-BRPORT_ATTR_FLAG(isolated, BR_ISOLATED);
+
+static ssize_t show_isolated(struct net_bridge_port *p, char *buf)
+{
+	return sprintf(buf, "%u\n", !!p->horizon_group);
+}
+
+static int br_set_isolated(struct net_bridge_port *p, unsigned long val)
+{
+	if (val == 0)
+		p->horizon_group = 0;
+	else if (p->horizon_group == 0)
+		p->horizon_group = 1;
+	return 0;
+}
+static BRPORT_ATTR(isolated, 0644, show_isolated, br_set_isolated);
+
+static ssize_t show_horizon_group(struct net_bridge_port *p, char *buf)
+{
+	return sprintf(buf, "%u\n", p->horizon_group);
+}
+
+static int br_set_horizon_group(struct net_bridge_port *p, unsigned long val)
+{
+	if (val > INT_MAX)
+		return -EINVAL;
+
+	p->horizon_group = val;
+	return 0;
+}
+static BRPORT_ATTR(horizon_group, 0644, show_horizon_group,
+		   br_set_horizon_group);
 
 #ifdef CONFIG_BRIDGE_IGMP_SNOOPING
 static ssize_t show_multicast_router(struct net_bridge_port *p, char *buf)
@@ -292,6 +322,7 @@ static const struct brport_attribute *brport_attrs[] = {
 	&brport_attr_group_fwd_mask,
 	&brport_attr_neigh_suppress,
 	&brport_attr_isolated,
+	&brport_attr_horizon_group,
 	&brport_attr_backup_port,
 	NULL
 };
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH v2 net-next] tcp: Warn if sock_owned_by_user() is true in tcp_child_process().
From: Kuniyuki Iwashima @ 2021-12-09 12:16 UTC (permalink / raw)
  To: pabeni; +Cc: benh, davem, edumazet, kuba, kuni1840, kuniyu, netdev
In-Reply-To: <bd537766c4b70da71153a9972e6f6ee12e92ff92.camel@redhat.com>

From:   Paolo Abeni <pabeni@redhat.com>
Date:   Thu, 09 Dec 2021 12:59:21 +0100
> On Thu, 2021-12-09 at 20:07 +0900, Kuniyuki Iwashima wrote:
>> From:   Eric Dumazet <edumazet@google.com>
>> Date:   Thu, 9 Dec 2021 00:00:35 -0800
>>> On Wed, Dec 8, 2021 at 5:33 PM Kuniyuki Iwashima <kuniyu@amazon.co.jp> wrote:
>>>> 
>>>> While creating a child socket from ACK (not TCP Fast Open case), before
>>>> v2.3.41, we used to call bh_lock_sock() later than now; it was called just
>>>> before tcp_rcv_state_process().  The full socket was put into an accept
>>>> queue and exposed to other CPUs before bh_lock_sock() so that process
>>>> context might have acquired the lock by then.  Thus, we had to check if any
>>>> process context was accessing the socket before tcp_rcv_state_process().
>>>> 
>>> 
>>> I think you misunderstood me.
>>> 
>>> I think this code is not dead yet, so I would :
>>> 
>>> Not include a Fixes: tag to avoid unnecessary backports (of a patch
>>> and its revert)
>>> 
>>> If you want to get syzbot coverage for few releases, especially with
>>> MPTCP and synflood,
>>> you  can then submit a patch like the following.
>> 
>> Sorry, I got on the same page.
>> Let me take a look at MPTCP, then if I still think it is dead code, I will
>> submit the patch.
> 
> For the records, I think the 'else' branch should be reachble with
> MPTCP in some non trivial scenario, e.g. MPJ subflows 3WHS racing with
> setsockopt on the main MPTCP socket. I'm unsure if syzbot could catch
> that, as it needs mptcp endpoints configuration.

Ah, I was wrong.
Thanks for explaining!


> 
> Cheers,
> 
> Paolo

^ permalink raw reply

* Re: [PATCH 1/1] sctp: Protect cached endpoints to prevent possible UAF
From: Marcelo Ricardo Leitner @ 2021-12-09 12:16 UTC (permalink / raw)
  To: Lee Jones
  Cc: linux-kernel, Vlad Yasevich, Neil Horman, David S. Miller,
	Jakub Kicinski, lksctp developers, H.P. Yarroll, Karl Knutson,
	Jon Grimm, Xingang Guo, Hui Huang, Sridhar Samudrala, Daisy Chang,
	Ryan Layer, Kevin Gao, netdev
In-Reply-To: <20211208165434.2962062-1-lee.jones@linaro.org>

On Wed, Dec 08, 2021 at 04:54:34PM +0000, Lee Jones wrote:
> The cause of the resultant dump_stack() reported below is a
> dereference of a freed pointer to 'struct sctp_endpoint' in
> sctp_sock_dump().

Hi,

Please give me another day to review this one.

Thanks,
Marcelo

^ permalink raw reply

* Re: [PATCH net v2 0/3] net/sched: Fix ct zone matching for invalid conntrack state
From: Marcelo Ricardo Leitner @ 2021-12-09 12:11 UTC (permalink / raw)
  To: Paul Blakey
  Cc: dev, netdev, Saeed Mahameed, Cong Wang, Jamal Hadi Salim,
	Pravin B Shelar, davem, Jiri Pirko, wenxu, Oz Shlomo, Vlad Buslov,
	Roi Dayan
In-Reply-To: <20211209075734.10199-1-paulb@nvidia.com>

On Thu, Dec 09, 2021 at 09:57:31AM +0200, Paul Blakey wrote:
> Changelog:
> 	1->2:
> 	  Cover letter wording
> 	  Added blamed CCs

Thanks.

> 
> Paul Blakey (3):
>   net/sched: Extend qdisc control block with tc control block
>   net/sched: flow_dissector: Fix matching on zone id for invalid conns
>   net: openvswitch: Fix matching zone id for invalid conns arriving from tc

I keep getting surprised by how much metadata we have on CT other than
skb->_nfct. :-)

Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox