Netdev List

Netdev List
 help / color / mirror / Atom feed

* Optimizing kernel compilation / alignments for network performance
From: Rafał Miłecki @ 2022-04-27 12:04 UTC (permalink / raw)
  To: Network Development, linux-arm-kernel, Russell King, Andrew Lunn,
	Felix Fietkau
  Cc: openwrt-devel@lists.openwrt.org, Florian Fainelli

Hi,

I noticed years ago that kernel changes touching code - that I don't use
at all - can affect network performance for me.

I work with home routers based on Broadcom Northstar platform. Those
are SoCs with not-so-powerful 2 x ARM Cortex-A9 CPU cores. Main task of
those devices is NAT masquerade and that is what I test with iperf
running on two x86 machines.

***

Example of such unused code change:
ce5013ff3bec ("mtd: spi-nor: Add support for XM25QH64A and XM25QH128A").
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce5013ff3bec05cf2a8a05c75fcd520d9914d92b
It lowered my NAT speed from 381 Mb/s to 367 Mb/s (-3,5%).

I first reported that issue it in the e-mail thread:
ARM router NAT performance affected by random/unrelated commits
https://lkml.org/lkml/2019/5/21/349
https://www.spinics.net/lists/linux-block/msg40624.html

Back then it was commit 5b0890a97204 ("flow_dissector: Parse batman-adv
unicast headers")
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9316a9ed6895c4ad2f0cde171d486f80c55d8283
that increased my NAT speed from 741 Mb/s to 773 Mb/s (+4,3%).

***

It appears Northstar CPUs have little cache size and so any change in
location of kernel symbols can affect NAT performance. That explains why
changing unrelated code affects anything & it has been partially proven
aligning some of cache-v7.S code.

My question is: is there a way to find out & force an optimal symbols
locations?

Adding .align 5 to the cache-v7.S is a partial success. I'd like to find
out what other functions are worth optimizing (aligning) and force that
(I guess  __attribute__((aligned(32))) could be used).

I can't really draw any conclusions from comparing System.map before and
after above commits as they relocate thousands of symbols in one go.

Optimizing is pretty important for me for two reasons:
1. I want to reach maximum possible NAT masquerade performance
2. I need stable performance across random commits to detect regressions

^ permalink raw reply

* Re: [PATCH net-next 5/7] usbnet: smsc95xx: Forward PHY interrupts to PHY driver to avoid polling
From: Andrew Lunn @ 2022-04-27 12:03 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Steve Glendinning, UNGLinuxDriver, Oliver Neukum, David S. Miller,
	Jakub Kicinski, Paolo Abeni, netdev, linux-usb, Andre Edich,
	Oleksij Rempel, Martyn Welch, Gabriel Hojda, Christoph Fritz,
	Lino Sanfilippo, Philipp Rosenberger, Heiner Kallweit,
	Russell King
In-Reply-To: <276a1b50cf9fcca5168ca2770a863cb56069a277.1651037513.git.lukas@wunner.de>

> @@ -606,11 +616,20 @@ static void smsc95xx_status(struct usbnet *dev, struct urb *urb)
>  	intdata = get_unaligned_le32(urb->transfer_buffer);
>  	netif_dbg(dev, link, dev->net, "intdata: 0x%08X\n", intdata);
>  
> +	/* USB interrupts are received in softirq (tasklet) context.
> +	 * Switch to hardirq context to make genirq code happy.
> +	 */
> +	local_irq_save(flags);
> +	__irq_enter_raw();
> +
>  	if (intdata & INT_ENP_PHY_INT_)
> -		;
> +		generic_handle_domain_irq(pdata->irqdomain, PHY_HWIRQ);
>  	else
>  		netdev_warn(dev->net, "unexpected interrupt, intdata=0x%08X\n",
>  			    intdata);

Ah, O.K, forget my previous comment. Maybe add something to the commit
message that the ; will soon be replaced by a call to actually handle
the interrupt.

	Andrew

^ permalink raw reply

* [PATCH net-next 2/2] net: phy: smsc: add LAN8742 phy support.
From: Yuiko Oshino @ 2022-04-27 11:51 UTC (permalink / raw)
  To: woojung.huh, yuiko.oshino, davem, netdev, andrew, ravi.hegde,
	UNGLinuxDriver
In-Reply-To: <20220427115142.12642-1-yuiko.oshino@microchip.com>

The current phy IDs on the available hardware.
        LAN8742 0x0007C130, 0x0007C131

Signed-off-by: Yuiko Oshino <yuiko.oshino@microchip.com>
---
 drivers/net/phy/smsc.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/drivers/net/phy/smsc.c b/drivers/net/phy/smsc.c
index d8cac02a79b9..2b3b5d0ad6f3 100644
--- a/drivers/net/phy/smsc.c
+++ b/drivers/net/phy/smsc.c
@@ -483,6 +483,32 @@ static struct phy_driver smsc_phy_driver[] = {
 
 	.suspend	= genphy_suspend,
 	.resume		= genphy_resume,
+}, {
+	.phy_id	= 0x0007c130,	/* 0x0007c130 and 0x0007c131 */
+	.phy_id_mask	= 0xfffffff2,
+	.name		= "Microchip LAN8742",
+
+	/* PHY_BASIC_FEATURES */
+	.flags		= PHY_RST_AFTER_CLK_EN,
+
+	.probe		= smsc_phy_probe,
+
+	/* basic functions */
+	.read_status	= lan87xx_read_status,
+	.config_init	= smsc_phy_config_init,
+	.soft_reset	= smsc_phy_reset,
+
+	/* IRQ related */
+	.config_intr	= smsc_phy_config_intr,
+	.handle_interrupt = smsc_phy_handle_interrupt,
+
+	/* Statistics */
+	.get_sset_count = smsc_get_sset_count,
+	.get_strings	= smsc_get_strings,
+	.get_stats	= smsc_get_stats,
+
+	.suspend	= genphy_suspend,
+	.resume	= genphy_resume,
 } };
 
 module_phy_driver(smsc_phy_driver);
@@ -498,6 +524,7 @@ static struct mdio_device_id __maybe_unused smsc_tbl[] = {
 	{ 0x0007c0d0, 0xfffffff0 },
 	{ 0x0007c0f0, 0xfffffff0 },
 	{ 0x0007c110, 0xfffffff0 },
+	{ 0x0007c130, 0xfffffff2 },
 	{ }
 };
 
-- 
2.25.1


^ permalink raw reply related

* [REVIEW REQUEST3 PATCH net-next 1/2] net: phy: microchip: update LAN88xx phy ID and phy ID mask.
From: Yuiko Oshino @ 2022-04-27 11:51 UTC (permalink / raw)
  To: woojung.huh, yuiko.oshino, davem, netdev, andrew, ravi.hegde,
	UNGLinuxDriver
In-Reply-To: <20220427115142.12642-1-yuiko.oshino@microchip.com>

update LAN88xx phy ID and phy ID mask because the existing code conflicts with the LAN8742 phy.

The current phy IDs on the available hardware.
        LAN8742 0x0007C130, 0x0007C131
        LAN88xx 0x0007C132

Signed-off-by: Yuiko Oshino <yuiko.oshino@microchip.com>
---
 drivers/net/phy/microchip.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/phy/microchip.c b/drivers/net/phy/microchip.c
index 9f1f2b6c97d4..131caf659ed2 100644
--- a/drivers/net/phy/microchip.c
+++ b/drivers/net/phy/microchip.c
@@ -344,8 +344,8 @@ static int lan88xx_config_aneg(struct phy_device *phydev)
 
 static struct phy_driver microchip_phy_driver[] = {
 {
-	.phy_id		= 0x0007c130,
-	.phy_id_mask	= 0xfffffff0,
+	.phy_id		= 0x0007c132,
+	.phy_id_mask	= 0xfffffff2,
 	.name		= "Microchip LAN88xx",
 
 	/* PHY_GBIT_FEATURES */
@@ -369,7 +369,7 @@ static struct phy_driver microchip_phy_driver[] = {
 module_phy_driver(microchip_phy_driver);
 
 static struct mdio_device_id __maybe_unused microchip_tbl[] = {
-	{ 0x0007c130, 0xfffffff0 },
+	{ 0x0007c132, 0xfffffff2 },
 	{ }
 };
 
-- 
2.25.1


^ permalink raw reply related

* [PATCH net-next 0/2] net: phy: add LAN8742 phy support
From: Yuiko Oshino @ 2022-04-27 11:51 UTC (permalink / raw)
  To: woojung.huh, yuiko.oshino, davem, netdev, andrew, ravi.hegde,
	UNGLinuxDriver

add LAN8742 phy support
update LAN88xx phy ID and phy ID mask so that it can coexist with LAN8742

The current phy IDs on the available hardware.
    LAN8742 0x0007C130, 0x0007C131
    LAN88xx 0x0007C132

Yuiko Oshino (2):
  net: phy: microchip: update LAN88xx phy ID and phy ID mask.
  net: phy: smsc: add LAN8742 phy support.

 drivers/net/phy/microchip.c |  6 +++---
 drivers/net/phy/smsc.c      | 27 +++++++++++++++++++++++++++
 2 files changed, 30 insertions(+), 3 deletions(-)

-- 
2.25.1


^ permalink raw reply

* Re: [PATCH net-next 4/7] usbnet: smsc95xx: Avoid link settings race on interrupt reception
From: Andrew Lunn @ 2022-04-27 11:50 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Steve Glendinning, UNGLinuxDriver, Oliver Neukum, David S. Miller,
	Jakub Kicinski, Paolo Abeni, netdev, linux-usb, Andre Edich,
	Oleksij Rempel, Martyn Welch, Gabriel Hojda, Christoph Fritz,
	Lino Sanfilippo, Philipp Rosenberger, Heiner Kallweit,
	Russell King
In-Reply-To: <28fd5a3d53a401cdf8ae0a116108702a46d2c7a2.1651037513.git.lukas@wunner.de>

>  static void smsc95xx_status(struct usbnet *dev, struct urb *urb)
> @@ -607,7 +607,7 @@ static void smsc95xx_status(struct usbnet *dev, struct urb *urb)
>  	netif_dbg(dev, link, dev->net, "intdata: 0x%08X\n", intdata);
>  
>  	if (intdata & INT_ENP_PHY_INT_)
> -		usbnet_defer_kevent(dev, EVENT_LINK_RESET);
> +		;
>  	else
>  		netdev_warn(dev->net, "unexpected interrupt, intdata=0x%08X\n",
>  			    intdata);

Hi Lukas

It would probably make sense to invert the condition and remove the ;

   Andrew

^ permalink raw reply

* [PATCH bpf-next] bpf, sockmap: Call skb_linearize only when required in sk_psock_skb_ingress_enqueue
From: Liu Jian @ 2022-04-27 11:51 UTC (permalink / raw)
  To: john.fastabend, daniel, jakub, davem, kuba, pabeni, ast, andrii,
	kafai, songliubraving, yhs, kpsingh, netdev, bpf
  Cc: liujian56

The skb_to_sgvec fails only when the number of frag_list and frags exceeds
MAX_MSG_FRAGS. Therefore, we can call skb_linearize only when the
conversion fails.

Signed-off-by: Liu Jian <liujian56@huawei.com>
---
 net/core/skmsg.c | 22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index cc381165ea08..22b983ade0e7 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -524,16 +524,20 @@ static int sk_psock_skb_ingress_enqueue(struct sk_buff *skb,
 {
 	int num_sge, copied;
 
-	/* skb linearize may fail with ENOMEM, but lets simply try again
-	 * later if this happens. Under memory pressure we don't want to
-	 * drop the skb. We need to linearize the skb so that the mapping
-	 * in skb_to_sgvec can not error.
-	 */
-	if (skb_linearize(skb))
-		return -EAGAIN;
 	num_sge = skb_to_sgvec(skb, msg->sg.data, off, len);
-	if (unlikely(num_sge < 0))
-		return num_sge;
+	if (num_sge < 0) {
+		/* skb linearize may fail with ENOMEM, but lets simply try again
+		 * later if this happens. Under memory pressure we don't want to
+		 * drop the skb. We need to linearize the skb so that the mapping
+		 * in skb_to_sgvec can not error.
+		 */
+		if (skb_linearize(skb))
+			return -EAGAIN;
+
+		num_sge = skb_to_sgvec(skb, msg->sg.data, off, len);
+		if (unlikely(num_sge < 0))
+			return num_sge;
+	}
 
 	copied = len;
 	msg->sg.start = 0;
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH net-next 0/6] net: remove non-Ethernet drivers using virt_to_bus()
From: patchwork-bot+netdevbpf @ 2022-04-27 11:30 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: davem, pabeni, netdev
In-Reply-To: <20220426175436.417283-1-kuba@kernel.org>

Hello:

This series was applied to netdev/net-next.git (master)
by David S. Miller <davem@davemloft.net>:

On Tue, 26 Apr 2022 10:54:30 -0700 you wrote:
> Networking is currently the main offender in using virt_to_bus().
> Frankly all the drivers which use it are super old and unlikely
> to be used today. They are just an ongoing maintenance burden.
> 
> In other words this series is using virt_to_bus() as an excuse
> to shed some old stuff. Having done the tree-wide dev_addr_set()
> conversion recently I have limited sympathy for carrying dead
> code.
> 
> [...]

Here is the summary with links:
  - [net-next,1/6] net: atm: remove support for Fujitsu FireStream ATM devices
    https://git.kernel.org/netdev/net-next/c/41c335c82123
  - [net-next,2/6] net: atm: remove support for Madge Horizon ATM devices
    https://git.kernel.org/netdev/net-next/c/5b74a20d35ab
  - [net-next,3/6] net: atm: remove support for ZeitNet ZN122x ATM devices
    https://git.kernel.org/netdev/net-next/c/052e1f01bfae
  - [net-next,4/6] net: wan: remove support for COSA and SRP synchronous serial boards
    https://git.kernel.org/netdev/net-next/c/89fbca3307d4
  - [net-next,5/6] net: wan: remove support for Z85230-based devices
    https://git.kernel.org/netdev/net-next/c/bc6df26f1f78
  - [net-next,6/6] net: hamradio: remove support for DMA SCC devices
    https://git.kernel.org/netdev/net-next/c/865e2eb08f51

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [RFC v1 2/3] net: dsa: mt753x: make CPU-Port dynamic
From: Vladimir Oltean @ 2022-04-27 11:28 UTC (permalink / raw)
  To: Frank Wunderlich
  Cc: linux-mediatek, linux-rockchip, Frank Wunderlich, Rob Herring,
	Krzysztof Kozlowski, Heiko Stuebner, Sean Wang, Landen Chao,
	DENG Qingfang, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	David S. Miller, Jakub Kicinski, Paolo Abeni, Matthias Brugger,
	Peter Geis, devicetree, linux-arm-kernel, linux-kernel, netdev
In-Reply-To: <20220426134924.30372-3-linux@fw-web.de>

On Tue, Apr 26, 2022 at 03:49:23PM +0200, Frank Wunderlich wrote:
> From: Frank Wunderlich <frank-w@public-files.de>
> 
> Currently CPU-Port is hardcoded to Port 6.
> 
> On BPI-R2-Pro board this port is not connected and only Port 5 is
> connected to gmac of SoC.
> 
> Replace this hardcoded CPU-Port with a member in mt7530_priv struct
> which is set in mt753x_cpu_port_enable to the right port.
> 
> I defined a default in probe (in case no CPU-Port will be setup) and
> if both cpu-port were setup port 6 will be used like the const prior
> this patch.
> 
> In mt7531_setup first access is before we know which port should be used
> (mt753x_cpu_port_enable) so section "BPDU to CPU port" needs to be moved
> down.
> 
> Signed-off-by: Frank Wunderlich <frank-w@public-files.de>
> ---
>  drivers/net/dsa/mt7530.c | 46 ++++++++++++++++++++++------------------
>  drivers/net/dsa/mt7530.h |  2 +-
>  2 files changed, 26 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/net/dsa/mt7530.c b/drivers/net/dsa/mt7530.c
> index ccf4cb944167..4789105b8137 100644
> --- a/drivers/net/dsa/mt7530.c
> +++ b/drivers/net/dsa/mt7530.c
> @@ -1004,6 +1004,7 @@ mt753x_cpu_port_enable(struct dsa_switch *ds, int port)
>  			return ret;
>  	}
>  
> +	priv->cpu_port = port;
>  	/* Enable Mediatek header mode on the cpu port */
>  	mt7530_write(priv, MT7530_PVC_P(port),
>  		     PORT_SPEC_TAG);
> @@ -1041,7 +1042,7 @@ mt7530_port_enable(struct dsa_switch *ds, int port,
>  	 * restore the port matrix if the port is the member of a certain
>  	 * bridge.
>  	 */
> -	priv->ports[port].pm |= PCR_MATRIX(BIT(MT7530_CPU_PORT));
> +	priv->ports[port].pm |= PCR_MATRIX(BIT(priv->cpu_port));

This is called with an "int port" argument, so could you please do:

	struct dsa_port *dp = dsa_to_port(ds, port);

	if (dsa_port_is_user(dp)) {
		struct dsa_port *cpu_dp = dp->cpu_dp;

		priv->ports[port].pm |= PCR_MATRIX(BIT(cpu_dp->index));
	}

>  	priv->ports[port].enable = true;
>  	mt7530_rmw(priv, MT7530_PCR_P(port), PCR_MATRIX_MASK,
>  		   priv->ports[port].pm);
> @@ -1190,8 +1191,8 @@ mt7530_port_bridge_join(struct dsa_switch *ds, int port,
>  			struct netlink_ext_ack *extack)
>  {
>  	struct dsa_port *dp = dsa_to_port(ds, port), *other_dp;
> -	u32 port_bitmap = BIT(MT7530_CPU_PORT);
>  	struct mt7530_priv *priv = ds->priv;
> +	u32 port_bitmap = BIT(priv->cpu_port);

Here we know for certain that "port" is a user port, so could you please:

	struct dsa_port *cpu_dp = dp->cpu_dp;
	u32 port_bitmap = BIT(cpu_dp->index);

>  
>  	mutex_lock(&priv->reg_mutex);
>  
> @@ -1267,9 +1268,9 @@ mt7530_port_set_vlan_unaware(struct dsa_switch *ds, int port)
>  	 * the CPU port get out of VLAN filtering mode.
>  	 */
>  	if (all_user_ports_removed) {
> -		mt7530_write(priv, MT7530_PCR_P(MT7530_CPU_PORT),
> +		mt7530_write(priv, MT7530_PCR_P(priv->cpu_port),
>  			     PCR_MATRIX(dsa_user_ports(priv->ds)));
> -		mt7530_write(priv, MT7530_PVC_P(MT7530_CPU_PORT), PORT_SPEC_TAG
> +		mt7530_write(priv, MT7530_PVC_P(priv->cpu_port), PORT_SPEC_TAG
>  			     | PVC_EG_TAG(MT7530_VLAN_EG_CONSISTENT));
>  	}
>  }

Same idea here, go through dp->cpu_dp.

> @@ -1335,8 +1336,8 @@ mt7530_port_bridge_leave(struct dsa_switch *ds, int port,
>  	 */
>  	if (priv->ports[port].enable)
>  		mt7530_rmw(priv, MT7530_PCR_P(port), PCR_MATRIX_MASK,
> -			   PCR_MATRIX(BIT(MT7530_CPU_PORT)));
> -	priv->ports[port].pm = PCR_MATRIX(BIT(MT7530_CPU_PORT));
> +			   PCR_MATRIX(BIT(priv->cpu_port)));
> +	priv->ports[port].pm = PCR_MATRIX(BIT(priv->cpu_port));

Same.

>  
>  	/* When a port is removed from the bridge, the port would be set up
>  	 * back to the default as is at initial boot which is a VLAN-unaware
> @@ -1503,6 +1504,7 @@ static int
>  mt7530_port_vlan_filtering(struct dsa_switch *ds, int port, bool vlan_filtering,
>  			   struct netlink_ext_ack *extack)
>  {
> +	struct mt7530_priv *priv = ds->priv;
>  	if (vlan_filtering) {
>  		/* The port is being kept as VLAN-unaware port when bridge is
>  		 * set up with vlan_filtering not being set, Otherwise, the
> @@ -1510,7 +1512,7 @@ mt7530_port_vlan_filtering(struct dsa_switch *ds, int port, bool vlan_filtering,
>  		 * for becoming a VLAN-aware port.
>  		 */
>  		mt7530_port_set_vlan_aware(ds, port);
> -		mt7530_port_set_vlan_aware(ds, MT7530_CPU_PORT);
> +		mt7530_port_set_vlan_aware(ds, priv->cpu_port);
>  	} else {
>  		mt7530_port_set_vlan_unaware(ds, port);
>  	}

Same.

> @@ -1526,7 +1528,7 @@ mt7530_hw_vlan_add(struct mt7530_priv *priv,
>  	u32 val;
>  
>  	new_members = entry->old_members | BIT(entry->port) |
> -		      BIT(MT7530_CPU_PORT);
> +		      BIT(priv->cpu_port);
>  
>  	/* Validate the entry with independent learning, create egress tag per
>  	 * VLAN and joining the port as one of the port members.
> @@ -1550,8 +1552,8 @@ mt7530_hw_vlan_add(struct mt7530_priv *priv,
>  	 * DSA tag.
>  	 */
>  	mt7530_rmw(priv, MT7530_VAWD2,
> -		   ETAG_CTRL_P_MASK(MT7530_CPU_PORT),
> -		   ETAG_CTRL_P(MT7530_CPU_PORT,
> +		   ETAG_CTRL_P_MASK(priv->cpu_port),
> +		   ETAG_CTRL_P(priv->cpu_port,
>  			       MT7530_VLAN_EGRESS_STACK));
>  }

This one is interesting. For some reason, the driver author felt the
need to explicitly add BIT(MT7530_CPU_PORT) to new_members, even though
mt7530_port_vlan_add() will be called on the CPU port too

bridge vlan add dev swp0 vid 100
bridge vlan add dev br0 vid 100 self		# here

I would first make a patch that leaves it up to DSA to call
port_vlan_add for the CPU port, rather than doing it implicitly.
There is a certain advantage to that too. You can do autonomous
forwarding in a certain VLAN, but not add br0 to that VLAN, and then,
you could avoid flooding the CPU with those packets, if you know that
software doesn't need to process them.

The modified function would look like this:

static void
mt7530_hw_vlan_add(struct mt7530_priv *priv,
		   struct mt7530_hw_vlan_entry *entry)
{
	struct dsa_port *dp = dsa_to_port(priv->ds, entry->port);
	u8 new_members;
	u32 val;

	new_members = entry->old_members | BIT(entry->port);

	/* Validate the entry with independent learning, create egress tag per
	 * VLAN and joining the port as one of the port members.
	 */
	val = IVL_MAC | VTAG_EN | PORT_MEM(new_members) | FID(FID_BRIDGED) |
	      VLAN_VALID;
	mt7530_write(priv, MT7530_VAWD1, val);

	/* Decide whether adding tag or not for those outgoing packets from the
	 * port inside the VLAN.
	 * CPU port is always taken as a tagged port for serving more than one
	 * VLANs across and also being applied with egress type stack mode for
	 * that VLAN tags would be appended after hardware special tag used as
	 * DSA tag.
	 */
	if (dsa_port_is_cpu(dp))
		val = MT7530_VLAN_EGRESS_STACK;
	else if (entry->untagged)
		val = MT7530_VLAN_EGRESS_UNTAG;
	else
		val = MT7530_VLAN_EGRESS_TAG;
	mt7530_rmw(priv, MT7530_VAWD2,
		   ETAG_CTRL_P_MASK(entry->port),
		   ETAG_CTRL_P(entry->port, val));
}

Then please confirm that there aren't regressions in local traffic
termination for a VLAN-aware bridge. Pinging over a bridge with
vlan_filtering=1 should be sufficient, since that is done in the default
pvid of 1.

With this change, you no longer need to concern yourself about the fixed
value of MT7530_CPU_PORT.

>  
> @@ -1575,7 +1577,7 @@ mt7530_hw_vlan_del(struct mt7530_priv *priv,
>  	 * the entry would be kept valid. Otherwise, the entry is got to be
>  	 * disabled.
>  	 */
> -	if (new_members && new_members != BIT(MT7530_CPU_PORT)) {
> +	if (new_members && new_members != BIT(priv->cpu_port)) {
>  		val = IVL_MAC | VTAG_EN | PORT_MEM(new_members) |
>  		      VLAN_VALID;
>  		mt7530_write(priv, MT7530_VAWD1, val);

Here the logic is again too convoluted for its own good. Let's think
before changing.

What I think was happening was this. DSA used to be super blind when
adding a VLAN to the CPU port. Since there is no netdev for the CPU
port, then what DSA used to do was to implicitly program a bridge VLAN to
the CPU port whenever that VLAN was programmed on a user port. The
downside is that you never know when you can remove that VLAN, since the
CPU port is shared between multiple user ports and refcounting was
impossible due to technical switchdev API details.

Here, the mt7530 driver tries to outsmart DSA by saying "hey, if the CPU
port is the only remaining member of the VLAN, this VLAN is garbage,
let's remove it". Which, I mean, fair point.

But since commit 134ef2388e7f ("net: dsa: add explicit support for host
bridge VLANs"), DSA does a better job of keeping track of VLANs added on
CPU ports, and it doesn't leave dangling VLANs behind. You need to trust it.

So in this case, you would have to change the code like this:

	if (new_members) {
		val = IVL_MAC | VTAG_EN | PORT_MEM(new_members) |
		      VLAN_VALID;
		mt7530_write(priv, MT7530_VAWD1, val);
	} else {
		mt7530_write(priv, MT7530_VAWD1, 0);
		mt7530_write(priv, MT7530_VAWD2, 0);
	}

and then delete the comment, it isn't relevant any longer.

> @@ -2105,7 +2107,7 @@ mt7530_setup(struct dsa_switch *ds)
>  	 * controller also is the container for two GMACs nodes representing
>  	 * as two netdev instances.
>  	 */
> -	dn = dsa_to_port(ds, MT7530_CPU_PORT)->master->dev.of_node->parent;
> +	dn = dsa_to_port(ds, priv->cpu_port)->master->dev.of_node->parent;

Holy ####, this is convoluted. This gets us the OF node of the container
of the DSA master, alright.

Could you do something more generic? Something like this:

	struct device_node *dn = NULL;
	struct dsa_port *cpu_dp;

	dsa_switch_for_each_cpu_port(cpu_dp, ds) {
		dn = cpu_dp->master->dev.of_node->parent;
		/* It doesn't matter which CPU port is found first,
		 * their masters should share the same parent OF node
		 */
		break;
	}

	if (!dn) {
		ERROR ERROR ERROR, no CPU port
	}

>  	ds->assisted_learning_on_cpu_port = true;
>  	ds->mtu_enforcement_ingress = true;
>  
> @@ -2337,15 +2339,6 @@ mt7531_setup(struct dsa_switch *ds)
>  	mt7531_ind_c45_phy_write(priv, MT753X_CTRL_PHY_ADDR, MDIO_MMD_VEND2,
>  				 CORE_PLL_GROUP4, val);
>  
> -	/* BPDU to CPU port */
> -	mt7530_rmw(priv, MT7531_CFC, MT7531_CPU_PMAP_MASK,
> -		   BIT(MT7530_CPU_PORT));
> -	mt7530_rmw(priv, MT753X_BPC, MT753X_BPDU_PORT_FW_MASK,
> -		   MT753X_BPDU_CPU_ONLY);
> -
> -	/* Enable and reset MIB counters */
> -	mt7530_mib_reset(ds);
> -
>  	for (i = 0; i < MT7530_NUM_PORTS; i++) {
>  		/* Disable forwarding by default on all ports */
>  		mt7530_rmw(priv, MT7530_PCR_P(i), PCR_MATRIX_MASK,
> @@ -2373,6 +2366,15 @@ mt7531_setup(struct dsa_switch *ds)
>  			   PVC_EG_TAG(MT7530_VLAN_EG_CONSISTENT));
>  	}
>  
> +	/* BPDU to CPU port */
> +	mt7530_rmw(priv, MT7531_CFC, MT7531_CPU_PMAP_MASK,
> +		   BIT(priv->cpu_port));
> +	mt7530_rmw(priv, MT753X_BPC, MT753X_BPDU_PORT_FW_MASK,
> +		   MT753X_BPDU_CPU_ONLY);
> +
> +	/* Enable and reset MIB counters */
> +	mt7530_mib_reset(ds);
> +
>  	/* Setup VLAN ID 0 for VLAN-unaware bridges */
>  	ret = mt7530_setup_vlan0(priv);
>  	if (ret)
> @@ -3213,6 +3215,8 @@ mt7530_probe(struct mdio_device *mdiodev)
>  	if (!priv)
>  		return -ENOMEM;
>  
> +	priv->cpu_port = 6;
> +
>  	priv->ds = devm_kzalloc(&mdiodev->dev, sizeof(*priv->ds), GFP_KERNEL);
>  	if (!priv->ds)
>  		return -ENOMEM;
> diff --git a/drivers/net/dsa/mt7530.h b/drivers/net/dsa/mt7530.h
> index 91508e2feef9..62df8d10f6d4 100644
> --- a/drivers/net/dsa/mt7530.h
> +++ b/drivers/net/dsa/mt7530.h
> @@ -8,7 +8,6 @@
>  
>  #define MT7530_NUM_PORTS		7
>  #define MT7530_NUM_PHYS			5
> -#define MT7530_CPU_PORT			6
>  #define MT7530_NUM_FDB_RECORDS		2048
>  #define MT7530_ALL_MEMBERS		0xff
>  
> @@ -823,6 +822,7 @@ struct mt7530_priv {
>  	u8			mirror_tx;
>  
>  	struct mt7530_port	ports[MT7530_NUM_PORTS];
> +	int			cpu_port;

And this should no longer be needed.

>  	/* protect among processes for registers access*/
>  	struct mutex reg_mutex;
>  	int irq;
> -- 
> 2.25.1
> 


^ permalink raw reply

* Re: [PATCH net-next v2 0/5] net: lan966x: Add support for PTP programmable pins
From: patchwork-bot+netdevbpf @ 2022-04-27 11:20 UTC (permalink / raw)
  To: Horatiu Vultur
  Cc: netdev, devicetree, linux-kernel, davem, kuba, pabeni, robh+dt,
	krzysztof.kozlowski+dt, UNGLinuxDriver, richardcochran
In-Reply-To: <20220427065127.3765659-1-horatiu.vultur@microchip.com>

Hello:

This series was applied to netdev/net-next.git (master)
by David S. Miller <davem@davemloft.net>:

On Wed, 27 Apr 2022 08:51:22 +0200 you wrote:
> Lan966x has 8 PTP programmable pins. The last pin is hardcoded to be used
> by PHC0 and all the rest are shareable between the PHCs. The PTP pins can
> implement both extts and perout functions.
> 
> v1->v2:
> - use ptp_find_pin_unlocked instead of ptp_find_pin inside the irq handler.
> 
> [...]

Here is the summary with links:
  - [net-next,v2,1/5] dt-bindings: net: lan966x: Extend with the ptp external interrupt.
    https://git.kernel.org/netdev/net-next/c/c1a519919d04
  - [net-next,v2,2/5] net: lan966x: Change the PTP pin used to read/write the PHC.
    https://git.kernel.org/netdev/net-next/c/77f2accb501a
  - [net-next,v2,3/5] net: lan966x: Add registers used to configure the PTP pin
    https://git.kernel.org/netdev/net-next/c/3adc11e5fc5f
  - [net-next,v2,4/5] net: lan966x: Add support for PTP_PF_PEROUT
    https://git.kernel.org/netdev/net-next/c/2b7ff2588ec2
  - [net-next,v2,5/5] net: lan966x: Add support for PTP_PF_EXTTS
    https://git.kernel.org/netdev/net-next/c/f3d8e0a9c28b

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* [PATCH net] netfilter: conn: fix udp offload timeout sysctl
From: Volodymyr Mytnyk @ 2022-04-27 11:09 UTC (permalink / raw)
  To: netdev
  Cc: Taras Chornyi, Serhiy Pshyk, Volodymyr Mytnyk, Oz Shlomo,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	David S. Miller, Jakub Kicinski, Paolo Abeni, Paul Blakey,
	netfilter-devel, coreteam, linux-kernel

`nf_flowtable_udp_timeout` sysctl option is available only
if CONFIG_NFT_FLOW_OFFLOAD enabled. But infra for this flow
offload UDP timeout was added under CONFIG_NF_FLOW_TABLE
config option. So, if you have CONFIG_NFT_FLOW_OFFLOAD
disabled and CONFIG_NF_FLOW_TABLE enabled, the
`nf_flowtable_udp_timeout` is not present in sysfs.
Please note, that TCP flow offload timeout sysctl option
is present even CONFIG_NFT_FLOW_OFFLOAD is disabled.

I suppose it was a typo in commit that adds UDP flow offload
timeout and CONFIG_NF_FLOW_TABLE should be used instead.

Fixes: 975c57504da1 ("netfilter: conntrack: Introduce udp offload timeout configuration")
Signed-off-by: Volodymyr Mytnyk <volodymyr.mytnyk@plvision.eu>
---
 net/netfilter/nf_conntrack_standalone.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
index 3e1afd10a9b6..55aa55b252b2 100644
--- a/net/netfilter/nf_conntrack_standalone.c
+++ b/net/netfilter/nf_conntrack_standalone.c
@@ -823,7 +823,7 @@ static struct ctl_table nf_ct_sysctl_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
 	},
-#if IS_ENABLED(CONFIG_NFT_FLOW_OFFLOAD)
+#if IS_ENABLED(CONFIG_NF_FLOW_TABLE)
 	[NF_SYSCTL_CT_PROTO_TIMEOUT_UDP_OFFLOAD] = {
 		.procname	= "nf_flowtable_udp_timeout",
 		.maxlen		= sizeof(unsigned int),
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH v2 net-next 07/10] net: dsa: request drivers to perform FDB isolation
From: Hans Schultz @ 2022-04-27 11:06 UTC (permalink / raw)
  To: Vladimir Oltean, Hans Schultz
  Cc: Andrew Lunn, netdev@vger.kernel.org, Jakub Kicinski,
	David S. Miller, Florian Fainelli, Vivien Didelot,
	Vladimir Oltean, Kurt Kanzenbach, Hauke Mehrtens, Woojung Huh,
	UNGLinuxDriver@microchip.com, Sean Wang, Landen Chao,
	DENG Qingfang, Claudiu Manoil, Alexandre Belloni, Linus Walleij,
	Alvin Šipraga, George McCollister
In-Reply-To: <20220427103209.luyfereepqaha7dw@skbuf>

On ons, apr 27, 2022 at 10:32, Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> On Wed, Apr 27, 2022 at 10:38:18AM +0200, Hans Schultz wrote:
>> 
>> I see that there is definitions for 64bit mac addresses out there, which
>> might also be needed to be supported at some point?
>
> I don't know about 64-bit MAC addresses, do you have more information?
>

I have only seen that there is a section about it in rfc7043:
https://www.rfc-editor.org/rfc/rfc7043

and some in rfc7042 also.

^ permalink raw reply

* Re: [PATCH net 0/4][pull request] Intel Wired LAN Driver Updates 2022-04-26
From: patchwork-bot+netdevbpf @ 2022-04-27 10:00 UTC (permalink / raw)
  To: Tony Nguyen; +Cc: davem, kuba, pabeni, netdev
In-Reply-To: <20220426203018.3790378-1-anthony.l.nguyen@intel.com>

Hello:

This series was applied to netdev/net.git (master)
by Tony Nguyen <anthony.l.nguyen@intel.com>:

On Tue, 26 Apr 2022 13:30:14 -0700 you wrote:
> This series contains updates to ice driver only.
> 
> Ivan Vecera removes races related to VF message processing by changing
> mutex_trylock() call to mutex_lock() and moving additional operations
> to occur under mutex.
> 
> Petr Oros increases wait time after firmware flash as current time is
> not sufficient.
> 
> [...]

Here is the summary with links:
  - [net,1/4] ice: Fix incorrect locking in ice_vc_process_vf_msg()
    https://git.kernel.org/netdev/net/c/aaf461af729b
  - [net,2/4] ice: Protect vf_state check by cfg_lock in ice_vc_process_vf_msg()
    https://git.kernel.org/netdev/net/c/77d64d285be5
  - [net,3/4] ice: wait 5 s for EMP reset after firmware flash
    https://git.kernel.org/netdev/net/c/b537752e6cbf
  - [net,4/4] ice: fix use-after-free when deinitializing mailbox snapshot
    https://git.kernel.org/netdev/net/c/b668f4cd715a

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* [PATCH] memcg: accounting for objects allocated for new netdevice
From: Vasily Averin @ 2022-04-27 10:37 UTC (permalink / raw)
  To: Roman Gushchin, Vlastimil Babka, Shakeel Butt
  Cc: kernel, Florian Westphal, linux-kernel, Michal Hocko, cgroups,
	netdev, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Greg Kroah-Hartman, Tejun Heo, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, linux-fsdevel

Creating a new netdevice allocates at least ~50Kb of memory for various
kernel objects, but only ~5Kb of them are accounted to memcg. As a result,
creating an unlimited number of netdevice inside a memcg-limited container
does not fall within memcg restrictions, consumes a significant part
of the host's memory, can cause global OOM and lead to random kills of
host processes.

The main consumers of non-accounted memory are:
 ~10Kb   80+ kernfs nodes
 ~6Kb    ipv6_add_dev() allocations
  6Kb    __register_sysctl_table() allocations
  4Kb    neigh_sysctl_register() allocations
  4Kb    __devinet_sysctl_register() allocations
  4Kb    __addrconf_sysctl_register() allocations

Accounting of these objects allows to increase the share of memcg-related
memory up to 60-70% (~38Kb accounted vs ~54Kb total for dummy netdevice
on typical VM with default Fedora 35 kernel) and this should be enough
to somehow protect the host from misuse inside container.

Other related objects are quite small and may not be taken into account
to minimize the expected performance degradation.

It should be separately mentonied ~300 bytes of percpu allocation
of struct ipstats_mib in snmp6_alloc_dev(), on huge multi-cpu nodes
it can become the main consumer of memory.

Signed-off-by: Vasily Averin <vvs@openvz.org>
---
RFC was discussed here:
https://lore.kernel.org/all/a5e09e93-106d-0527-5b1e-48dbf3b48b4e@virtuozzo.com/
---
 fs/kernfs/mount.c     | 2 +-
 fs/proc/proc_sysctl.c | 2 +-
 net/core/neighbour.c  | 2 +-
 net/ipv4/devinet.c    | 2 +-
 net/ipv6/addrconf.c   | 8 ++++----
 5 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index cfa79715fc1a..2881aeeaa880 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -391,7 +391,7 @@ void __init kernfs_init(void)
 {
 	kernfs_node_cache = kmem_cache_create("kernfs_node_cache",
 					      sizeof(struct kernfs_node),
-					      0, SLAB_PANIC, NULL);
+					      0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	/* Creates slab cache for kernfs inode attributes */
 	kernfs_iattrs_cache  = kmem_cache_create("kernfs_iattrs_cache",
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 7d9cfc730bd4..df4604fea4f8 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -1333,7 +1333,7 @@ struct ctl_table_header *__register_sysctl_table(
 		nr_entries++;
 
 	header = kzalloc(sizeof(struct ctl_table_header) +
-			 sizeof(struct ctl_node)*nr_entries, GFP_KERNEL);
+			 sizeof(struct ctl_node)*nr_entries, GFP_KERNEL_ACCOUNT);
 	if (!header)
 		return NULL;
 
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index ec0bf737b076..3dcda2a54f86 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -3728,7 +3728,7 @@ int neigh_sysctl_register(struct net_device *dev, struct neigh_parms *p,
 	char neigh_path[ sizeof("net//neigh/") + IFNAMSIZ + IFNAMSIZ ];
 	char *p_name;
 
-	t = kmemdup(&neigh_sysctl_template, sizeof(*t), GFP_KERNEL);
+	t = kmemdup(&neigh_sysctl_template, sizeof(*t), GFP_KERNEL_ACCOUNT);
 	if (!t)
 		goto err;
 
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index fba2bffd65f7..47523fe5b891 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -2566,7 +2566,7 @@ static int __devinet_sysctl_register(struct net *net, char *dev_name,
 	struct devinet_sysctl_table *t;
 	char path[sizeof("net/ipv4/conf/") + IFNAMSIZ];
 
-	t = kmemdup(&devinet_sysctl, sizeof(*t), GFP_KERNEL);
+	t = kmemdup(&devinet_sysctl, sizeof(*t), GFP_KERNEL_ACCOUNT);
 	if (!t)
 		goto out;
 
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index f908e2fd30b2..e79621ee4a0a 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -342,7 +342,7 @@ static int snmp6_alloc_dev(struct inet6_dev *idev)
 {
 	int i;
 
-	idev->stats.ipv6 = alloc_percpu(struct ipstats_mib);
+	idev->stats.ipv6 = alloc_percpu_gfp(struct ipstats_mib, GFP_KERNEL_ACCOUNT);
 	if (!idev->stats.ipv6)
 		goto err_ip;
 
@@ -358,7 +358,7 @@ static int snmp6_alloc_dev(struct inet6_dev *idev)
 	if (!idev->stats.icmpv6dev)
 		goto err_icmp;
 	idev->stats.icmpv6msgdev = kzalloc(sizeof(struct icmpv6msg_mib_device),
-					   GFP_KERNEL);
+					   GFP_KERNEL_ACCOUNT);
 	if (!idev->stats.icmpv6msgdev)
 		goto err_icmpmsg;
 
@@ -382,7 +382,7 @@ static struct inet6_dev *ipv6_add_dev(struct net_device *dev)
 	if (dev->mtu < IPV6_MIN_MTU)
 		return ERR_PTR(-EINVAL);
 
-	ndev = kzalloc(sizeof(struct inet6_dev), GFP_KERNEL);
+	ndev = kzalloc(sizeof(struct inet6_dev), GFP_KERNEL_ACCOUNT);
 	if (!ndev)
 		return ERR_PTR(err);
 
@@ -7029,7 +7029,7 @@ static int __addrconf_sysctl_register(struct net *net, char *dev_name,
 	struct ctl_table *table;
 	char path[sizeof("net/ipv6/conf/") + IFNAMSIZ];
 
-	table = kmemdup(addrconf_sysctl, sizeof(addrconf_sysctl), GFP_KERNEL);
+	table = kmemdup(addrconf_sysctl, sizeof(addrconf_sysctl), GFP_KERNEL_ACCOUNT);
 	if (!table)
 		goto out;
 
-- 
2.31.1


^ permalink raw reply related

* [PATCH] memcg: enable accounting for veth queues
From: Vasily Averin @ 2022-04-27 10:34 UTC (permalink / raw)
  To: Roman Gushchin, Vlastimil Babka, Shakeel Butt
  Cc: kernel, linux-kernel, Michal Hocko, cgroups, netdev,
	David S. Miller, Jakub Kicinski, Paolo Abeni

veth netdevice defines own rx queues and allocates array containing
up to 4095 ~750-bytes-long 'struct veth_rq' elements. Such allocation
is quite huge and should be accounted to memcg.

Signed-off-by: Vasily Averin <vvs@openvz.org>
---
 drivers/net/veth.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index d29fb9759cc9..bd67f458641a 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -1310,7 +1310,7 @@ static int veth_alloc_queues(struct net_device *dev)
 	struct veth_priv *priv = netdev_priv(dev);
 	int i;
 
-	priv->rq = kcalloc(dev->num_rx_queues, sizeof(*priv->rq), GFP_KERNEL);
+	priv->rq = kcalloc(dev->num_rx_queues, sizeof(*priv->rq), GFP_KERNEL_ACCOUNT);
 	if (!priv->rq)
 		return -ENOMEM;
 
-- 
2.31.1


^ permalink raw reply related

* Re: [PATCH v2 net-next 07/10] net: dsa: request drivers to perform FDB isolation
From: Vladimir Oltean @ 2022-04-27 10:32 UTC (permalink / raw)
  To: Hans Schultz
  Cc: Andrew Lunn, netdev@vger.kernel.org, Jakub Kicinski,
	David S. Miller, Florian Fainelli, Vivien Didelot,
	Vladimir Oltean, Kurt Kanzenbach, Hauke Mehrtens, Woojung Huh,
	UNGLinuxDriver@microchip.com, Sean Wang, Landen Chao,
	DENG Qingfang, Claudiu Manoil, Alexandre Belloni, Linus Walleij,
	Alvin Šipraga, George McCollister
In-Reply-To: <86ilquapl1.fsf@gmail.com>

On Wed, Apr 27, 2022 at 10:38:18AM +0200, Hans Schultz wrote:
> > But we surely wouldn't pass _all_ parameters of port_fdb_add through
> > some giant struct args_of_port_fdb_add, would we?  Not ds, port, db,
> > just what is naturally grouped together as an FDB entry: addr, vid,
> > maybe your new static/dynamic thing.
> >
> > If we group the addr and vid in port_fdb_add into a structure that
> > represents an FDB entry, what do we do about those in port_fdb_del?
> > Group those as well, maybe for consistency?
> 
> I think the 'old' interface that several other functions use should have
> one struct... e.g. port, addr and vid. But somehow it would be good to
> have something more dynamic. There could be two layer of structs, but
> generally i think that for these op functions now in relation to fdb
> should only have structs as parameters in a logical way that is
> expandable and thus future oriented.

As a disadvantage to your proposal, such a change would not only modify
the port_fdb_add prototype of these functions once again, but it would
also modify the way in which they *access* their arguments (instead of
accessing "addr" they now need to access "fdb->addr" for example).

You can argue that this would be the change to end all changes, but what
if we then need to add more unrelated arguments to port_fdb_add, like an
"extack". You still need to modify the prototypes for all drivers again.
I think it's a non-goal, you may disagree.

> Something else to consider is what do switchcore drivers that don't use
> 'include/net/dsa.h' do and why?

They tend to copy-paste bad coding patterns from each other. The
SWITCHDEV_FDB_ADD_TO_DEVICE and SWITCHDEV_FDB_DEL_TO_DEVICE handlers are
quite a big mess, but that's a story for another time.

Otherwise, generally speaking, they have access from the atomic notifier
to the struct switchdev_notifier_fdb_info *fdb_info, then they allocate
a work item on a private work queue, copy the stuff from this notifier
object that they find relevant, and use that private structure from the
deferred context.

DSA does basically the same thing, except for the fact that the hardware
access is abstracted behind an indirect call to port_fdb_add().

> > Restated: do we want to treat the "static/dynamic" info as a property of
> > an FDB entry (i.e. a member of the structure), or as the way in which a
> > certain FDB entry can be added to hardware (case in which it is relevant
> > only to _add and to _dump)?  After all, an FDB entry for {addr, vid}
> > learned statically, and another FDB entry for the same {addr, vid} but
> > learned dynamically, are fundamentally the same object.
> 
> I cannot answer for the workings of all switchcores, but for my sake I
> use a debug tool to show the age of a dynamic entry in the ATU, so I
> don't think that it has much relevance outside of that.
> 
> > And if we don't go with a big struct args_of_port_fdb_add (which would
> > be silly if we did), who guarantees that the argument list of port_fdb_add
> > won't grow in the future anyway? Like in the example I just gave above,
> > where "static/dynamic" doesn't fully appear to group naturally with
> > "addr" and "vid", and would probably still be a separate boolean,
> > rendering the whole point moot.
> >
> > And even grouping only those last 2 together is maybe debatable due to
> > practical reasons - where do we declare this small structure? We have a
> > struct switchdev_notifier_fdb_info with some more stuff that we
> > deliberately do not want to expose, and {addr, vid} are all that remain.
> >
> > Although maybe there are benefits to having a small {addr, vid} structure
> > defined somewhere publicly, too, and referenced consistently by driver
> > code. Like for example to avoid bad patterns from proliferating.
> > Currently we have "const unsigned char *addr, u16 vid", so on a 64 bit
> > machine, this is a pointer plus an unsigned short, 10 bytes, padded up
> > by the compiler, maybe to 16. But the Ethernet addresses are 6 bytes,
> > those are shorter than the pointer itself, so on a 64-bit machine,
> > having "unsigned char addr[ETH_ALEN], u16 vid" makes a lot more space,
> > saves some real memory.
> 
> I see that there is definitions for 64bit mac addresses out there, which
> might also be needed to be supported at some point?

I don't know about 64-bit MAC addresses, do you have more information?

> > Anyway, I'm side tracking. If you want to make changes, propose a
> > patch, but come up with a more real argument than "reducing churn"
> > (while effectively producing churn).
> 
> Unfortunately I don't have the time to make such a patch suggestion for
> some time to come as I also have other patch sets coming up, and I
> should study a bit what your patch set with the dsa_db is about also, so
> maybe I must just add the bool to port_fdb_add() for now.

Yeah, I should update the DSA documentation with clear info about that,
it's just the commit messages for now.

> > To give you a counter example, phylink_mac_config() used to have the
> > opposite problem, the arguments were all neatly packed into a struct
> > phylink_link_state, but as the kerneldocs from include/linux/phylink.h
> > put it, not all members of that structure were guaranteed to contain
> > valid values. So there were bugs due to people not realizing this, and
> > consequently, phylink_mac_link_up() took the opposite approach, of
> > explicitly passing all known parameters of the resolved link state as
> > individual arguments. Now there are 10 arguments to that function, but
> > somehow at least this appears to have worked better, in the sense that
> > there isn't an explanatory note saying what's valid and what isn't.
> 
> Yes, I can see the danger of it, but something like phylink is also
> different as it is more hardware related, which has a slower development
> cycle than feature/protocol stuff.

My advice is just add what you need to add. The prototype changes for
port_fdb_add took place in:

2022-02-25 c26933639b54 ("net: dsa: request drivers to perform FDB isolation")
2017-08-06 1b6dd556c304 ("net: dsa: Remove prepare phase for FDB")
2017-08-06 6c2c1dcb185f ("net: dsa: Change DSA slave FDB API to be switchdev independent")
2016-04-06 8497aa618dd6 ("net: dsa: make the FDB add function return void")
2015-10-08 1f36faf26943 ("net: dsa: push prepare phase in port_fdb_add")
2015-08-10 2a778e1b5899 ("net: dsa: change FDB routines prototypes")
2015-08-06 55045ddded0f ("net: dsa: add support for switchdev FDB objects")
2015-03-26 339d82626d22 ("net: dsa: Add basic framework to support ndo_fdb functions")

As you can see, those aren't a lot of changes. I'm happy to see new
development in this area, but there was a long period of stability,
which is likely to continue.

Also, if you study the phylink code, you'll notice that it does't have
"a slower development cycle", quite the contrary, it is even aggressively
converting drivers to make use of new API, and marking unconverted drivers
as deprecated/legacy.

^ permalink raw reply

* [PATCH mlx5-next 3/5] vfio/mlx5: Manage the VF attach/detach callback from the PF
From: Yishai Hadas @ 2022-04-27  9:31 UTC (permalink / raw)
  To: alex.williamson, jgg, saeedm
  Cc: kvm, netdev, kuba, leonro, yishaih, maorg, cohuck
In-Reply-To: <20220427093120.161402-1-yishaih@nvidia.com>

Manage the VF attach/detach callback from the PF.

This lets the driver to enable parallel VFs migration as will be
introduced in the next patch.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/pci/mlx5/cmd.c  | 59 +++++++++++++++++++++++++++++++++---
 drivers/vfio/pci/mlx5/cmd.h  | 23 +++++++++++++-
 drivers/vfio/pci/mlx5/main.c | 25 ++++-----------
 3 files changed, 82 insertions(+), 25 deletions(-)

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
index d608b8167f58..1f84d7b9b9e5 100644
--- a/drivers/vfio/pci/mlx5/cmd.c
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -71,21 +71,70 @@ int mlx5vf_cmd_query_vhca_migration_state(struct pci_dev *pdev, u16 vhca_id,
 	return ret;
 }
 
-bool mlx5vf_cmd_is_migratable(struct pci_dev *pdev)
+static int mlx5fv_vf_event(struct notifier_block *nb,
+			   unsigned long event, void *data)
 {
-	struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev);
+	struct mlx5vf_pci_core_device *mvdev =
+		container_of(nb, struct mlx5vf_pci_core_device, nb);
+
+	mutex_lock(&mvdev->state_mutex);
+	switch (event) {
+	case MLX5_PF_NOTIFY_ENABLE_VF:
+		mvdev->mdev_detach = false;
+		break;
+	case MLX5_PF_NOTIFY_DISABLE_VF:
+		mvdev->mdev_detach = true;
+		break;
+	default:
+		break;
+	}
+	mlx5vf_state_mutex_unlock(mvdev);
+	return 0;
+}
+
+void mlx5vf_cmd_remove_migratable(struct mlx5vf_pci_core_device *mvdev)
+{
+	mlx5_sriov_blocking_notifier_unregister(mvdev->mdev, mvdev->vf_id,
+						&mvdev->nb);
+}
+
+bool mlx5vf_cmd_is_migratable(struct mlx5vf_pci_core_device *mvdev)
+{
+	struct pci_dev *pdev = mvdev->core_device.pdev;
 	bool migratable = false;
+	int ret;
 
-	if (!mdev)
+	mvdev->mdev = mlx5_vf_get_core_dev(pdev);
+	if (!mvdev->mdev)
 		return false;
+	if (!MLX5_CAP_GEN(mvdev->mdev, migration))
+		goto end;
+	mvdev->vf_id = pci_iov_vf_id(pdev);
+	if (mvdev->vf_id < 0)
+		goto end;
 
-	if (!MLX5_CAP_GEN(mdev, migration))
+	mutex_init(&mvdev->state_mutex);
+	spin_lock_init(&mvdev->reset_lock);
+	mvdev->nb.notifier_call = mlx5fv_vf_event;
+	ret = mlx5_sriov_blocking_notifier_register(mvdev->mdev, mvdev->vf_id,
+						    &mvdev->nb);
+	if (ret)
 		goto end;
 
+	mutex_lock(&mvdev->state_mutex);
+	if (mvdev->mdev_detach)
+		goto unreg;
+
+	mlx5vf_state_mutex_unlock(mvdev);
 	migratable = true;
+	goto end;
 
+unreg:
+	mlx5vf_state_mutex_unlock(mvdev);
+	mlx5_sriov_blocking_notifier_unregister(mvdev->mdev, mvdev->vf_id,
+						&mvdev->nb);
 end:
-	mlx5_vf_put_core_dev(mdev);
+	mlx5_vf_put_core_dev(mvdev->mdev);
 	return migratable;
 }
 
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
index 2da6a1c0ec5c..f47174eab4b8 100644
--- a/drivers/vfio/pci/mlx5/cmd.h
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -7,6 +7,7 @@
 #define MLX5_VFIO_CMD_H
 
 #include <linux/kernel.h>
+#include <linux/vfio_pci_core.h>
 #include <linux/mlx5/driver.h>
 
 struct mlx5_vf_migration_file {
@@ -24,14 +25,34 @@ struct mlx5_vf_migration_file {
 	unsigned long last_offset;
 };
 
+struct mlx5vf_pci_core_device {
+	struct vfio_pci_core_device core_device;
+	int vf_id;
+	u16 vhca_id;
+	u8 migrate_cap:1;
+	u8 deferred_reset:1;
+	/* protect migration state */
+	struct mutex state_mutex;
+	enum vfio_device_mig_state mig_state;
+	/* protect the reset_done flow */
+	spinlock_t reset_lock;
+	struct mlx5_vf_migration_file *resuming_migf;
+	struct mlx5_vf_migration_file *saving_migf;
+	struct notifier_block nb;
+	struct mlx5_core_dev *mdev;
+	u8 mdev_detach:1;
+};
+
 int mlx5vf_cmd_suspend_vhca(struct pci_dev *pdev, u16 vhca_id, u16 op_mod);
 int mlx5vf_cmd_resume_vhca(struct pci_dev *pdev, u16 vhca_id, u16 op_mod);
 int mlx5vf_cmd_query_vhca_migration_state(struct pci_dev *pdev, u16 vhca_id,
 					  size_t *state_size);
 int mlx5vf_cmd_get_vhca_id(struct pci_dev *pdev, u16 function_id, u16 *vhca_id);
-bool mlx5vf_cmd_is_migratable(struct pci_dev *pdev);
+bool mlx5vf_cmd_is_migratable(struct mlx5vf_pci_core_device *mvdev);
+void mlx5vf_cmd_remove_migratable(struct mlx5vf_pci_core_device *mvdev);
 int mlx5vf_cmd_save_vhca_state(struct pci_dev *pdev, u16 vhca_id,
 			       struct mlx5_vf_migration_file *migf);
 int mlx5vf_cmd_load_vhca_state(struct pci_dev *pdev, u16 vhca_id,
 			       struct mlx5_vf_migration_file *migf);
+void mlx5vf_state_mutex_unlock(struct mlx5vf_pci_core_device *mvdev);
 #endif /* MLX5_VFIO_CMD_H */
diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
index 2578f61eaeae..445c516d38d9 100644
--- a/drivers/vfio/pci/mlx5/main.c
+++ b/drivers/vfio/pci/mlx5/main.c
@@ -17,7 +17,6 @@
 #include <linux/uaccess.h>
 #include <linux/vfio.h>
 #include <linux/sched/mm.h>
-#include <linux/vfio_pci_core.h>
 #include <linux/anon_inodes.h>
 
 #include "cmd.h"
@@ -25,20 +24,6 @@
 /* Arbitrary to prevent userspace from consuming endless memory */
 #define MAX_MIGRATION_SIZE (512*1024*1024)
 
-struct mlx5vf_pci_core_device {
-	struct vfio_pci_core_device core_device;
-	u16 vhca_id;
-	u8 migrate_cap:1;
-	u8 deferred_reset:1;
-	/* protect migration state */
-	struct mutex state_mutex;
-	enum vfio_device_mig_state mig_state;
-	/* protect the reset_done flow */
-	spinlock_t reset_lock;
-	struct mlx5_vf_migration_file *resuming_migf;
-	struct mlx5_vf_migration_file *saving_migf;
-};
-
 static struct page *
 mlx5vf_get_migration_page(struct mlx5_vf_migration_file *migf,
 			  unsigned long offset)
@@ -444,7 +429,7 @@ mlx5vf_pci_step_device_state_locked(struct mlx5vf_pci_core_device *mvdev,
  * This function is called in all state_mutex unlock cases to
  * handle a 'deferred_reset' if exists.
  */
-static void mlx5vf_state_mutex_unlock(struct mlx5vf_pci_core_device *mvdev)
+void mlx5vf_state_mutex_unlock(struct mlx5vf_pci_core_device *mvdev)
 {
 again:
 	spin_lock(&mvdev->reset_lock);
@@ -597,13 +582,11 @@ static int mlx5vf_pci_probe(struct pci_dev *pdev,
 		return -ENOMEM;
 	vfio_pci_core_init_device(&mvdev->core_device, pdev, &mlx5vf_pci_ops);
 
-	if (pdev->is_virtfn && mlx5vf_cmd_is_migratable(pdev)) {
+	if (pdev->is_virtfn && mlx5vf_cmd_is_migratable(mvdev)) {
 		mvdev->migrate_cap = 1;
 		mvdev->core_device.vdev.migration_flags =
 			VFIO_MIGRATION_STOP_COPY |
 			VFIO_MIGRATION_P2P;
-		mutex_init(&mvdev->state_mutex);
-		spin_lock_init(&mvdev->reset_lock);
 	}
 
 	ret = vfio_pci_core_register_device(&mvdev->core_device);
@@ -614,6 +597,8 @@ static int mlx5vf_pci_probe(struct pci_dev *pdev,
 	return 0;
 
 out_free:
+	if (mvdev->migrate_cap)
+		mlx5vf_cmd_remove_migratable(mvdev);
 	vfio_pci_core_uninit_device(&mvdev->core_device);
 	kfree(mvdev);
 	return ret;
@@ -624,6 +609,8 @@ static void mlx5vf_pci_remove(struct pci_dev *pdev)
 	struct mlx5vf_pci_core_device *mvdev = dev_get_drvdata(&pdev->dev);
 
 	vfio_pci_core_unregister_device(&mvdev->core_device);
+	if (mvdev->migrate_cap)
+		mlx5vf_cmd_remove_migratable(mvdev);
 	vfio_pci_core_uninit_device(&mvdev->core_device);
 	kfree(mvdev);
 }
-- 
2.18.1


^ permalink raw reply related

* [PATCH mlx5-next 4/5] vfio/mlx5: Refactor to enable VFs migration in parallel
From: Yishai Hadas @ 2022-04-27  9:31 UTC (permalink / raw)
  To: alex.williamson, jgg, saeedm
  Cc: kvm, netdev, kuba, leonro, yishaih, maorg, cohuck
In-Reply-To: <20220427093120.161402-1-yishaih@nvidia.com>

Refactor to enable different VFs to run their commands over the PF
command interface in parallel and to not block one each other.

This is done by not using the global PF lock that was used before but
relying on the VF attach/detach mechanism to sync.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/pci/mlx5/cmd.c  | 102 +++++++++++++++--------------------
 drivers/vfio/pci/mlx5/cmd.h  |  11 ++--
 drivers/vfio/pci/mlx5/main.c |  44 ++++-----------
 3 files changed, 58 insertions(+), 99 deletions(-)

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
index 1f84d7b9b9e5..ba06b797d630 100644
--- a/drivers/vfio/pci/mlx5/cmd.c
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -5,70 +5,65 @@
 
 #include "cmd.h"
 
-int mlx5vf_cmd_suspend_vhca(struct pci_dev *pdev, u16 vhca_id, u16 op_mod)
+static int mlx5vf_cmd_get_vhca_id(struct mlx5_core_dev *mdev, u16 function_id,
+				  u16 *vhca_id);
+
+int mlx5vf_cmd_suspend_vhca(struct mlx5vf_pci_core_device *mvdev, u16 op_mod)
 {
-	struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev);
 	u32 out[MLX5_ST_SZ_DW(suspend_vhca_out)] = {};
 	u32 in[MLX5_ST_SZ_DW(suspend_vhca_in)] = {};
-	int ret;
 
-	if (!mdev)
+	lockdep_assert_held(&mvdev->state_mutex);
+	if (mvdev->mdev_detach)
 		return -ENOTCONN;
 
 	MLX5_SET(suspend_vhca_in, in, opcode, MLX5_CMD_OP_SUSPEND_VHCA);
-	MLX5_SET(suspend_vhca_in, in, vhca_id, vhca_id);
+	MLX5_SET(suspend_vhca_in, in, vhca_id, mvdev->vhca_id);
 	MLX5_SET(suspend_vhca_in, in, op_mod, op_mod);
 
-	ret = mlx5_cmd_exec_inout(mdev, suspend_vhca, in, out);
-	mlx5_vf_put_core_dev(mdev);
-	return ret;
+	return mlx5_cmd_exec_inout(mvdev->mdev, suspend_vhca, in, out);
 }
 
-int mlx5vf_cmd_resume_vhca(struct pci_dev *pdev, u16 vhca_id, u16 op_mod)
+int mlx5vf_cmd_resume_vhca(struct mlx5vf_pci_core_device *mvdev, u16 op_mod)
 {
-	struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev);
 	u32 out[MLX5_ST_SZ_DW(resume_vhca_out)] = {};
 	u32 in[MLX5_ST_SZ_DW(resume_vhca_in)] = {};
-	int ret;
 
-	if (!mdev)
+	lockdep_assert_held(&mvdev->state_mutex);
+	if (mvdev->mdev_detach)
 		return -ENOTCONN;
 
 	MLX5_SET(resume_vhca_in, in, opcode, MLX5_CMD_OP_RESUME_VHCA);
-	MLX5_SET(resume_vhca_in, in, vhca_id, vhca_id);
+	MLX5_SET(resume_vhca_in, in, vhca_id, mvdev->vhca_id);
 	MLX5_SET(resume_vhca_in, in, op_mod, op_mod);
 
-	ret = mlx5_cmd_exec_inout(mdev, resume_vhca, in, out);
-	mlx5_vf_put_core_dev(mdev);
-	return ret;
+	return mlx5_cmd_exec_inout(mvdev->mdev, resume_vhca, in, out);
 }
 
-int mlx5vf_cmd_query_vhca_migration_state(struct pci_dev *pdev, u16 vhca_id,
+int mlx5vf_cmd_query_vhca_migration_state(struct mlx5vf_pci_core_device *mvdev,
 					  size_t *state_size)
 {
-	struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev);
 	u32 out[MLX5_ST_SZ_DW(query_vhca_migration_state_out)] = {};
 	u32 in[MLX5_ST_SZ_DW(query_vhca_migration_state_in)] = {};
 	int ret;
 
-	if (!mdev)
+	lockdep_assert_held(&mvdev->state_mutex);
+	if (mvdev->mdev_detach)
 		return -ENOTCONN;
 
 	MLX5_SET(query_vhca_migration_state_in, in, opcode,
 		 MLX5_CMD_OP_QUERY_VHCA_MIGRATION_STATE);
-	MLX5_SET(query_vhca_migration_state_in, in, vhca_id, vhca_id);
+	MLX5_SET(query_vhca_migration_state_in, in, vhca_id, mvdev->vhca_id);
 	MLX5_SET(query_vhca_migration_state_in, in, op_mod, 0);
 
-	ret = mlx5_cmd_exec_inout(mdev, query_vhca_migration_state, in, out);
+	ret = mlx5_cmd_exec_inout(mvdev->mdev, query_vhca_migration_state, in,
+				  out);
 	if (ret)
-		goto end;
+		return ret;
 
 	*state_size = MLX5_GET(query_vhca_migration_state_out, out,
 			       required_umem_size);
-
-end:
-	mlx5_vf_put_core_dev(mdev);
-	return ret;
+	return 0;
 }
 
 static int mlx5fv_vf_event(struct notifier_block *nb,
@@ -125,6 +120,9 @@ bool mlx5vf_cmd_is_migratable(struct mlx5vf_pci_core_device *mvdev)
 	if (mvdev->mdev_detach)
 		goto unreg;
 
+	if (mlx5vf_cmd_get_vhca_id(mvdev->mdev, mvdev->vf_id + 1, &mvdev->vhca_id))
+		goto unreg;
+
 	mlx5vf_state_mutex_unlock(mvdev);
 	migratable = true;
 	goto end;
@@ -138,23 +136,18 @@ bool mlx5vf_cmd_is_migratable(struct mlx5vf_pci_core_device *mvdev)
 	return migratable;
 }
 
-int mlx5vf_cmd_get_vhca_id(struct pci_dev *pdev, u16 function_id, u16 *vhca_id)
+static int mlx5vf_cmd_get_vhca_id(struct mlx5_core_dev *mdev, u16 function_id,
+				  u16 *vhca_id)
 {
-	struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev);
 	u32 in[MLX5_ST_SZ_DW(query_hca_cap_in)] = {};
 	int out_size;
 	void *out;
 	int ret;
 
-	if (!mdev)
-		return -ENOTCONN;
-
 	out_size = MLX5_ST_SZ_BYTES(query_hca_cap_out);
 	out = kzalloc(out_size, GFP_KERNEL);
-	if (!out) {
-		ret = -ENOMEM;
-		goto end;
-	}
+	if (!out)
+		return -ENOMEM;
 
 	MLX5_SET(query_hca_cap_in, in, opcode, MLX5_CMD_OP_QUERY_HCA_CAP);
 	MLX5_SET(query_hca_cap_in, in, other_function, 1);
@@ -172,8 +165,6 @@ int mlx5vf_cmd_get_vhca_id(struct pci_dev *pdev, u16 function_id, u16 *vhca_id)
 
 err_exec:
 	kfree(out);
-end:
-	mlx5_vf_put_core_dev(mdev);
 	return ret;
 }
 
@@ -218,21 +209,23 @@ static int _create_state_mkey(struct mlx5_core_dev *mdev, u32 pdn,
 	return err;
 }
 
-int mlx5vf_cmd_save_vhca_state(struct pci_dev *pdev, u16 vhca_id,
+int mlx5vf_cmd_save_vhca_state(struct mlx5vf_pci_core_device *mvdev,
 			       struct mlx5_vf_migration_file *migf)
 {
-	struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev);
 	u32 out[MLX5_ST_SZ_DW(save_vhca_state_out)] = {};
 	u32 in[MLX5_ST_SZ_DW(save_vhca_state_in)] = {};
+	struct mlx5_core_dev *mdev;
 	u32 pdn, mkey;
 	int err;
 
-	if (!mdev)
+	lockdep_assert_held(&mvdev->state_mutex);
+	if (mvdev->mdev_detach)
 		return -ENOTCONN;
 
+	mdev = mvdev->mdev;
 	err = mlx5_core_alloc_pd(mdev, &pdn);
 	if (err)
-		goto end;
+		return err;
 
 	err = dma_map_sgtable(mdev->device, &migf->table.sgt, DMA_FROM_DEVICE,
 			      0);
@@ -246,7 +239,7 @@ int mlx5vf_cmd_save_vhca_state(struct pci_dev *pdev, u16 vhca_id,
 	MLX5_SET(save_vhca_state_in, in, opcode,
 		 MLX5_CMD_OP_SAVE_VHCA_STATE);
 	MLX5_SET(save_vhca_state_in, in, op_mod, 0);
-	MLX5_SET(save_vhca_state_in, in, vhca_id, vhca_id);
+	MLX5_SET(save_vhca_state_in, in, vhca_id, mvdev->vhca_id);
 	MLX5_SET(save_vhca_state_in, in, mkey, mkey);
 	MLX5_SET(save_vhca_state_in, in, size, migf->total_length);
 
@@ -254,37 +247,28 @@ int mlx5vf_cmd_save_vhca_state(struct pci_dev *pdev, u16 vhca_id,
 	if (err)
 		goto err_exec;
 
-	migf->total_length =
-		MLX5_GET(save_vhca_state_out, out, actual_image_size);
-
-	mlx5_core_destroy_mkey(mdev, mkey);
-	mlx5_core_dealloc_pd(mdev, pdn);
-	dma_unmap_sgtable(mdev->device, &migf->table.sgt, DMA_FROM_DEVICE, 0);
-	mlx5_vf_put_core_dev(mdev);
-
-	return 0;
-
+	migf->total_length = MLX5_GET(save_vhca_state_out, out,
+				      actual_image_size);
 err_exec:
 	mlx5_core_destroy_mkey(mdev, mkey);
 err_create_mkey:
 	dma_unmap_sgtable(mdev->device, &migf->table.sgt, DMA_FROM_DEVICE, 0);
 err_dma_map:
 	mlx5_core_dealloc_pd(mdev, pdn);
-end:
-	mlx5_vf_put_core_dev(mdev);
 	return err;
 }
 
-int mlx5vf_cmd_load_vhca_state(struct pci_dev *pdev, u16 vhca_id,
+int mlx5vf_cmd_load_vhca_state(struct mlx5vf_pci_core_device *mvdev,
 			       struct mlx5_vf_migration_file *migf)
 {
-	struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev);
+	struct mlx5_core_dev *mdev;
 	u32 out[MLX5_ST_SZ_DW(save_vhca_state_out)] = {};
 	u32 in[MLX5_ST_SZ_DW(save_vhca_state_in)] = {};
 	u32 pdn, mkey;
 	int err;
 
-	if (!mdev)
+	lockdep_assert_held(&mvdev->state_mutex);
+	if (mvdev->mdev_detach)
 		return -ENOTCONN;
 
 	mutex_lock(&migf->lock);
@@ -293,6 +277,7 @@ int mlx5vf_cmd_load_vhca_state(struct pci_dev *pdev, u16 vhca_id,
 		goto end;
 	}
 
+	mdev = mvdev->mdev;
 	err = mlx5_core_alloc_pd(mdev, &pdn);
 	if (err)
 		goto end;
@@ -308,7 +293,7 @@ int mlx5vf_cmd_load_vhca_state(struct pci_dev *pdev, u16 vhca_id,
 	MLX5_SET(load_vhca_state_in, in, opcode,
 		 MLX5_CMD_OP_LOAD_VHCA_STATE);
 	MLX5_SET(load_vhca_state_in, in, op_mod, 0);
-	MLX5_SET(load_vhca_state_in, in, vhca_id, vhca_id);
+	MLX5_SET(load_vhca_state_in, in, vhca_id, mvdev->vhca_id);
 	MLX5_SET(load_vhca_state_in, in, mkey, mkey);
 	MLX5_SET(load_vhca_state_in, in, size, migf->total_length);
 
@@ -320,7 +305,6 @@ int mlx5vf_cmd_load_vhca_state(struct pci_dev *pdev, u16 vhca_id,
 err_reg:
 	mlx5_core_dealloc_pd(mdev, pdn);
 end:
-	mlx5_vf_put_core_dev(mdev);
 	mutex_unlock(&migf->lock);
 	return err;
 }
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
index f47174eab4b8..3246c73395bc 100644
--- a/drivers/vfio/pci/mlx5/cmd.h
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -43,16 +43,15 @@ struct mlx5vf_pci_core_device {
 	u8 mdev_detach:1;
 };
 
-int mlx5vf_cmd_suspend_vhca(struct pci_dev *pdev, u16 vhca_id, u16 op_mod);
-int mlx5vf_cmd_resume_vhca(struct pci_dev *pdev, u16 vhca_id, u16 op_mod);
-int mlx5vf_cmd_query_vhca_migration_state(struct pci_dev *pdev, u16 vhca_id,
+int mlx5vf_cmd_suspend_vhca(struct mlx5vf_pci_core_device *mvdev, u16 op_mod);
+int mlx5vf_cmd_resume_vhca(struct mlx5vf_pci_core_device *mvdev, u16 op_mod);
+int mlx5vf_cmd_query_vhca_migration_state(struct mlx5vf_pci_core_device *mvdev,
 					  size_t *state_size);
-int mlx5vf_cmd_get_vhca_id(struct pci_dev *pdev, u16 function_id, u16 *vhca_id);
 bool mlx5vf_cmd_is_migratable(struct mlx5vf_pci_core_device *mvdev);
 void mlx5vf_cmd_remove_migratable(struct mlx5vf_pci_core_device *mvdev);
-int mlx5vf_cmd_save_vhca_state(struct pci_dev *pdev, u16 vhca_id,
+int mlx5vf_cmd_save_vhca_state(struct mlx5vf_pci_core_device *mvdev,
 			       struct mlx5_vf_migration_file *migf);
-int mlx5vf_cmd_load_vhca_state(struct pci_dev *pdev, u16 vhca_id,
+int mlx5vf_cmd_load_vhca_state(struct mlx5vf_pci_core_device *mvdev,
 			       struct mlx5_vf_migration_file *migf);
 void mlx5vf_state_mutex_unlock(struct mlx5vf_pci_core_device *mvdev);
 #endif /* MLX5_VFIO_CMD_H */
diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
index 445c516d38d9..f9793a627c24 100644
--- a/drivers/vfio/pci/mlx5/main.c
+++ b/drivers/vfio/pci/mlx5/main.c
@@ -208,8 +208,8 @@ mlx5vf_pci_save_device_data(struct mlx5vf_pci_core_device *mvdev)
 	stream_open(migf->filp->f_inode, migf->filp);
 	mutex_init(&migf->lock);
 
-	ret = mlx5vf_cmd_query_vhca_migration_state(
-		mvdev->core_device.pdev, mvdev->vhca_id, &migf->total_length);
+	ret = mlx5vf_cmd_query_vhca_migration_state(mvdev,
+						    &migf->total_length);
 	if (ret)
 		goto out_free;
 
@@ -218,8 +218,7 @@ mlx5vf_pci_save_device_data(struct mlx5vf_pci_core_device *mvdev)
 	if (ret)
 		goto out_free;
 
-	ret = mlx5vf_cmd_save_vhca_state(mvdev->core_device.pdev,
-					 mvdev->vhca_id, migf);
+	ret = mlx5vf_cmd_save_vhca_state(mvdev, migf);
 	if (ret)
 		goto out_free;
 	return migf;
@@ -346,8 +345,7 @@ mlx5vf_pci_step_device_state_locked(struct mlx5vf_pci_core_device *mvdev,
 	int ret;
 
 	if (cur == VFIO_DEVICE_STATE_RUNNING_P2P && new == VFIO_DEVICE_STATE_STOP) {
-		ret = mlx5vf_cmd_suspend_vhca(
-			mvdev->core_device.pdev, mvdev->vhca_id,
+		ret = mlx5vf_cmd_suspend_vhca(mvdev,
 			MLX5_SUSPEND_VHCA_IN_OP_MOD_SUSPEND_RESPONDER);
 		if (ret)
 			return ERR_PTR(ret);
@@ -355,8 +353,7 @@ mlx5vf_pci_step_device_state_locked(struct mlx5vf_pci_core_device *mvdev,
 	}
 
 	if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_RUNNING_P2P) {
-		ret = mlx5vf_cmd_resume_vhca(
-			mvdev->core_device.pdev, mvdev->vhca_id,
+		ret = mlx5vf_cmd_resume_vhca(mvdev,
 			MLX5_RESUME_VHCA_IN_OP_MOD_RESUME_RESPONDER);
 		if (ret)
 			return ERR_PTR(ret);
@@ -364,8 +361,7 @@ mlx5vf_pci_step_device_state_locked(struct mlx5vf_pci_core_device *mvdev,
 	}
 
 	if (cur == VFIO_DEVICE_STATE_RUNNING && new == VFIO_DEVICE_STATE_RUNNING_P2P) {
-		ret = mlx5vf_cmd_suspend_vhca(
-			mvdev->core_device.pdev, mvdev->vhca_id,
+		ret = mlx5vf_cmd_suspend_vhca(mvdev,
 			MLX5_SUSPEND_VHCA_IN_OP_MOD_SUSPEND_INITIATOR);
 		if (ret)
 			return ERR_PTR(ret);
@@ -373,8 +369,7 @@ mlx5vf_pci_step_device_state_locked(struct mlx5vf_pci_core_device *mvdev,
 	}
 
 	if (cur == VFIO_DEVICE_STATE_RUNNING_P2P && new == VFIO_DEVICE_STATE_RUNNING) {
-		ret = mlx5vf_cmd_resume_vhca(
-			mvdev->core_device.pdev, mvdev->vhca_id,
+		ret = mlx5vf_cmd_resume_vhca(mvdev,
 			MLX5_RESUME_VHCA_IN_OP_MOD_RESUME_INITIATOR);
 		if (ret)
 			return ERR_PTR(ret);
@@ -409,8 +404,7 @@ mlx5vf_pci_step_device_state_locked(struct mlx5vf_pci_core_device *mvdev,
 	}
 
 	if (cur == VFIO_DEVICE_STATE_RESUMING && new == VFIO_DEVICE_STATE_STOP) {
-		ret = mlx5vf_cmd_load_vhca_state(mvdev->core_device.pdev,
-						 mvdev->vhca_id,
+		ret = mlx5vf_cmd_load_vhca_state(mvdev,
 						 mvdev->resuming_migf);
 		if (ret)
 			return ERR_PTR(ret);
@@ -517,34 +511,16 @@ static int mlx5vf_pci_open_device(struct vfio_device *core_vdev)
 	struct mlx5vf_pci_core_device *mvdev = container_of(
 		core_vdev, struct mlx5vf_pci_core_device, core_device.vdev);
 	struct vfio_pci_core_device *vdev = &mvdev->core_device;
-	int vf_id;
 	int ret;
 
 	ret = vfio_pci_core_enable(vdev);
 	if (ret)
 		return ret;
 
-	if (!mvdev->migrate_cap) {
-		vfio_pci_core_finish_enable(vdev);
-		return 0;
-	}
-
-	vf_id = pci_iov_vf_id(vdev->pdev);
-	if (vf_id < 0) {
-		ret = vf_id;
-		goto out_disable;
-	}
-
-	ret = mlx5vf_cmd_get_vhca_id(vdev->pdev, vf_id + 1, &mvdev->vhca_id);
-	if (ret)
-		goto out_disable;
-
-	mvdev->mig_state = VFIO_DEVICE_STATE_RUNNING;
+	if (mvdev->migrate_cap)
+		mvdev->mig_state = VFIO_DEVICE_STATE_RUNNING;
 	vfio_pci_core_finish_enable(vdev);
 	return 0;
-out_disable:
-	vfio_pci_core_disable(vdev);
-	return ret;
 }
 
 static void mlx5vf_pci_close_device(struct vfio_device *core_vdev)
-- 
2.18.1


^ permalink raw reply related

* Re: [PATCH net 1/7] secure_seq: return the full 64-bit of the siphash
From: kernel test robot @ 2022-04-27  9:56 UTC (permalink / raw)
  To: Willy Tarreau, netdev
  Cc: kbuild-all, Jakub Kicinski, Eric Dumazet, Moshe Kol, Yossi Gilad,
	Amit Klein, linux-kernel, Willy Tarreau, Jason A . Donenfeld
In-Reply-To: <20220427065233.2075-2-w@1wt.eu>

Hi Willy,

I love your patch! Yet something to improve:

[auto build test ERROR on net/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Willy-Tarreau/insufficient-TCP-source-port-randomness/20220427-145651
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 71cffebf6358a7f5031f5b208bbdc1cb4db6e539
config: i386-randconfig-r026-20220425 (https://download.01.org/0day-ci/archive/20220427/202204271705.VrWNPv7n-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.2.0-20) 11.2.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/01b26e522b598adf346b809075880feab3dcdc08
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Willy-Tarreau/insufficient-TCP-source-port-randomness/20220427-145651
        git checkout 01b26e522b598adf346b809075880feab3dcdc08
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   ld: net/ipv4/inet_hashtables.o: in function `__inet_hash_connect':
>> inet_hashtables.c:(.text+0x187d): undefined reference to `__umoddi3'

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply

* [PATCH mlx5-next 1/5] vfio/mlx5: Reorganize the VF is migratable code
From: Yishai Hadas @ 2022-04-27  9:31 UTC (permalink / raw)
  To: alex.williamson, jgg, saeedm
  Cc: kvm, netdev, kuba, leonro, yishaih, maorg, cohuck
In-Reply-To: <20220427093120.161402-1-yishaih@nvidia.com>

Reorganize the VF is migratable code to be in a separate function, next
patches from the series may use this.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/pci/mlx5/cmd.c  | 18 ++++++++++++++++++
 drivers/vfio/pci/mlx5/cmd.h  |  1 +
 drivers/vfio/pci/mlx5/main.c | 22 +++++++---------------
 3 files changed, 26 insertions(+), 15 deletions(-)

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
index 5c9f9218cc1d..d608b8167f58 100644
--- a/drivers/vfio/pci/mlx5/cmd.c
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -71,6 +71,24 @@ int mlx5vf_cmd_query_vhca_migration_state(struct pci_dev *pdev, u16 vhca_id,
 	return ret;
 }
 
+bool mlx5vf_cmd_is_migratable(struct pci_dev *pdev)
+{
+	struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev);
+	bool migratable = false;
+
+	if (!mdev)
+		return false;
+
+	if (!MLX5_CAP_GEN(mdev, migration))
+		goto end;
+
+	migratable = true;
+
+end:
+	mlx5_vf_put_core_dev(mdev);
+	return migratable;
+}
+
 int mlx5vf_cmd_get_vhca_id(struct pci_dev *pdev, u16 function_id, u16 *vhca_id)
 {
 	struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev);
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
index 1392a11a9cc0..2da6a1c0ec5c 100644
--- a/drivers/vfio/pci/mlx5/cmd.h
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -29,6 +29,7 @@ int mlx5vf_cmd_resume_vhca(struct pci_dev *pdev, u16 vhca_id, u16 op_mod);
 int mlx5vf_cmd_query_vhca_migration_state(struct pci_dev *pdev, u16 vhca_id,
 					  size_t *state_size);
 int mlx5vf_cmd_get_vhca_id(struct pci_dev *pdev, u16 function_id, u16 *vhca_id);
+bool mlx5vf_cmd_is_migratable(struct pci_dev *pdev);
 int mlx5vf_cmd_save_vhca_state(struct pci_dev *pdev, u16 vhca_id,
 			       struct mlx5_vf_migration_file *migf);
 int mlx5vf_cmd_load_vhca_state(struct pci_dev *pdev, u16 vhca_id,
diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
index bbec5d288fee..2578f61eaeae 100644
--- a/drivers/vfio/pci/mlx5/main.c
+++ b/drivers/vfio/pci/mlx5/main.c
@@ -597,21 +597,13 @@ static int mlx5vf_pci_probe(struct pci_dev *pdev,
 		return -ENOMEM;
 	vfio_pci_core_init_device(&mvdev->core_device, pdev, &mlx5vf_pci_ops);
 
-	if (pdev->is_virtfn) {
-		struct mlx5_core_dev *mdev =
-			mlx5_vf_get_core_dev(pdev);
-
-		if (mdev) {
-			if (MLX5_CAP_GEN(mdev, migration)) {
-				mvdev->migrate_cap = 1;
-				mvdev->core_device.vdev.migration_flags =
-					VFIO_MIGRATION_STOP_COPY |
-					VFIO_MIGRATION_P2P;
-				mutex_init(&mvdev->state_mutex);
-				spin_lock_init(&mvdev->reset_lock);
-			}
-			mlx5_vf_put_core_dev(mdev);
-		}
+	if (pdev->is_virtfn && mlx5vf_cmd_is_migratable(pdev)) {
+		mvdev->migrate_cap = 1;
+		mvdev->core_device.vdev.migration_flags =
+			VFIO_MIGRATION_STOP_COPY |
+			VFIO_MIGRATION_P2P;
+		mutex_init(&mvdev->state_mutex);
+		spin_lock_init(&mvdev->reset_lock);
 	}
 
 	ret = vfio_pci_core_register_device(&mvdev->core_device);
-- 
2.18.1


^ permalink raw reply related

* [PATCH mlx5-next 2/5] net/mlx5: Expose mlx5_sriov_blocking_notifier_register /  unregister APIs
From: Yishai Hadas @ 2022-04-27  9:31 UTC (permalink / raw)
  To: alex.williamson, jgg, saeedm
  Cc: kvm, netdev, kuba, leonro, yishaih, maorg, cohuck
In-Reply-To: <20220427093120.161402-1-yishaih@nvidia.com>

Expose mlx5_sriov_blocking_notifier_register / unregister APIs to let a
VF register to be notified for its enablement / disablement by the PF.

Upon VF probe it will call mlx5_sriov_blocking_notifier_register() with
its notifier block and upon VF remove it will call
mlx5_sriov_blocking_notifier_unregister() to drop its registration.

This can give a VF the ability to clean some resources upon disable
before that the command interface goes down and on the other hand sets
some stuff before that it's enabled.

This may be used by a VF which is migration capable in few cases.(e.g.
PF load/unload upon an health recovery).

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/sriov.c   | 65 ++++++++++++++++++-
 include/linux/mlx5/driver.h                   | 12 ++++
 2 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
index 887ee0f729d1..2935614f6fa9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
@@ -87,6 +87,11 @@ static int mlx5_device_enable_sriov(struct mlx5_core_dev *dev, int num_vfs)
 enable_vfs_hca:
 	num_msix_count = mlx5_get_default_msix_vec_count(dev, num_vfs);
 	for (vf = 0; vf < num_vfs; vf++) {
+		/* Notify the VF before its enablement to let it set
+		 * some stuff.
+		 */
+		blocking_notifier_call_chain(&sriov->vfs_ctx[vf].notifier,
+					     MLX5_PF_NOTIFY_ENABLE_VF, dev);
 		err = mlx5_core_enable_hca(dev, vf + 1);
 		if (err) {
 			mlx5_core_warn(dev, "failed to enable VF %d (%d)\n", vf, err);
@@ -127,6 +132,11 @@ mlx5_device_disable_sriov(struct mlx5_core_dev *dev, int num_vfs, bool clear_vf)
 	for (vf = num_vfs - 1; vf >= 0; vf--) {
 		if (!sriov->vfs_ctx[vf].enabled)
 			continue;
+		/* Notify the VF before its disablement to let it clean
+		 * some resources.
+		 */
+		blocking_notifier_call_chain(&sriov->vfs_ctx[vf].notifier,
+					     MLX5_PF_NOTIFY_DISABLE_VF, dev);
 		err = mlx5_core_disable_hca(dev, vf + 1);
 		if (err) {
 			mlx5_core_warn(dev, "failed to disable VF %d\n", vf);
@@ -257,7 +267,7 @@ int mlx5_sriov_init(struct mlx5_core_dev *dev)
 {
 	struct mlx5_core_sriov *sriov = &dev->priv.sriov;
 	struct pci_dev *pdev = dev->pdev;
-	int total_vfs;
+	int total_vfs, i;
 
 	if (!mlx5_core_is_pf(dev))
 		return 0;
@@ -269,6 +279,9 @@ int mlx5_sriov_init(struct mlx5_core_dev *dev)
 	if (!sriov->vfs_ctx)
 		return -ENOMEM;
 
+	for (i = 0; i < total_vfs; i++)
+		BLOCKING_INIT_NOTIFIER_HEAD(&sriov->vfs_ctx[i].notifier);
+
 	return 0;
 }
 
@@ -281,3 +294,53 @@ void mlx5_sriov_cleanup(struct mlx5_core_dev *dev)
 
 	kfree(sriov->vfs_ctx);
 }
+
+/**
+ * mlx5_sriov_blocking_notifier_unregister - Unregister a VF from
+ * a notification block chain.
+ *
+ * @mdev: The mlx5 core device.
+ * @vf_id: The VF id.
+ * @nb: The notifier block to be unregistered.
+ */
+void mlx5_sriov_blocking_notifier_unregister(struct mlx5_core_dev *mdev,
+					     int vf_id,
+					     struct notifier_block *nb)
+{
+	struct mlx5_vf_context *vfs_ctx;
+	struct mlx5_core_sriov *sriov;
+
+	sriov = &mdev->priv.sriov;
+	if (WARN_ON(vf_id < 0 || vf_id >= sriov->num_vfs))
+		return;
+
+	vfs_ctx = &sriov->vfs_ctx[vf_id];
+	blocking_notifier_chain_unregister(&vfs_ctx->notifier, nb);
+}
+EXPORT_SYMBOL(mlx5_sriov_blocking_notifier_unregister);
+
+/**
+ * mlx5_sriov_blocking_notifier_register - Register a VF notification
+ * block chain.
+ *
+ * @mdev: The mlx5 core device.
+ * @vf_id: The VF id.
+ * @nb: The notifier block to be called upon the VF events.
+ *
+ * Returns 0 on success or an error code.
+ */
+int mlx5_sriov_blocking_notifier_register(struct mlx5_core_dev *mdev,
+					  int vf_id,
+					  struct notifier_block *nb)
+{
+	struct mlx5_vf_context *vfs_ctx;
+	struct mlx5_core_sriov *sriov;
+
+	sriov = &mdev->priv.sriov;
+	if (vf_id < 0 || vf_id >= sriov->num_vfs)
+		return -EINVAL;
+
+	vfs_ctx = &sriov->vfs_ctx[vf_id];
+	return blocking_notifier_chain_register(&vfs_ctx->notifier, nb);
+}
+EXPORT_SYMBOL(mlx5_sriov_blocking_notifier_register);
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 9424503eb8d3..3d1594bad4ec 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -445,6 +445,11 @@ struct mlx5_qp_table {
 	struct radix_tree_root	tree;
 };
 
+enum {
+	MLX5_PF_NOTIFY_DISABLE_VF,
+	MLX5_PF_NOTIFY_ENABLE_VF,
+};
+
 struct mlx5_vf_context {
 	int	enabled;
 	u64	port_guid;
@@ -455,6 +460,7 @@ struct mlx5_vf_context {
 	u8	port_guid_valid:1;
 	u8	node_guid_valid:1;
 	enum port_state_policy	policy;
+	struct blocking_notifier_head notifier;
 };
 
 struct mlx5_core_sriov {
@@ -1155,6 +1161,12 @@ int mlx5_dm_sw_icm_dealloc(struct mlx5_core_dev *dev, enum mlx5_sw_icm_type type
 struct mlx5_core_dev *mlx5_vf_get_core_dev(struct pci_dev *pdev);
 void mlx5_vf_put_core_dev(struct mlx5_core_dev *mdev);
 
+int mlx5_sriov_blocking_notifier_register(struct mlx5_core_dev *mdev,
+					  int vf_id,
+					  struct notifier_block *nb);
+void mlx5_sriov_blocking_notifier_unregister(struct mlx5_core_dev *mdev,
+					     int vf_id,
+					     struct notifier_block *nb);
 #ifdef CONFIG_MLX5_CORE_IPOIB
 struct net_device *mlx5_rdma_netdev_alloc(struct mlx5_core_dev *mdev,
 					  struct ib_device *ibdev,
-- 
2.18.1


^ permalink raw reply related

* Re: [PATCH net 1/7] secure_seq: return the full 64-bit of the siphash
From: Willy Tarreau @ 2022-04-27 10:07 UTC (permalink / raw)
  To: kernel test robot
  Cc: netdev, kbuild-all, Jakub Kicinski, Eric Dumazet, Moshe Kol,
	Yossi Gilad, Amit Klein, linux-kernel, Jason A . Donenfeld
In-Reply-To: <202204271705.VrWNPv7n-lkp@intel.com>

On Wed, Apr 27, 2022 at 05:56:41PM +0800, kernel test robot wrote:
> Hi Willy,
> 
> I love your patch! Yet something to improve:
> 
> [auto build test ERROR on net/master]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Willy-Tarreau/insufficient-TCP-source-port-randomness/20220427-145651
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 71cffebf6358a7f5031f5b208bbdc1cb4db6e539
> config: i386-randconfig-r026-20220425 (https://download.01.org/0day-ci/archive/20220427/202204271705.VrWNPv7n-lkp@intel.com/config)
> compiler: gcc-11 (Debian 11.2.0-20) 11.2.0
> reproduce (this is a W=1 build):
>         # https://github.com/intel-lab-lkp/linux/commit/01b26e522b598adf346b809075880feab3dcdc08
>         git remote add linux-review https://github.com/intel-lab-lkp/linux
>         git fetch --no-tags linux-review Willy-Tarreau/insufficient-TCP-source-port-randomness/20220427-145651
>         git checkout 01b26e522b598adf346b809075880feab3dcdc08
>         # save the config file
>         mkdir build_dir && cp config build_dir/.config
>         make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash
> 
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot <lkp@intel.com>
> 
> All errors (new ones prefixed by >>):
> 
>    ld: net/ipv4/inet_hashtables.o: in function `__inet_hash_connect':
> >> inet_hashtables.c:(.text+0x187d): undefined reference to `__umoddi3'

Argh! indeed, we spoke about using div_u64_rem() at the beginning and
that one vanished over time. Will respin it.

Thanks,
Willy

^ permalink raw reply

* Re: [PATCH net-next 0/7] mptcp: Timeout for MP_FAIL response
From: patchwork-bot+netdevbpf @ 2022-04-27  9:50 UTC (permalink / raw)
  To: Mat Martineau; +Cc: netdev, davem, kuba, pabeni, matthieu.baerts, mptcp
In-Reply-To: <20220426215717.129506-1-mathew.j.martineau@linux.intel.com>

Hello:

This series was applied to netdev/net-next.git (master)
by David S. Miller <davem@davemloft.net>:

On Tue, 26 Apr 2022 14:57:10 -0700 you wrote:
> When one peer sends an infinite mapping to coordinate fallback from
> MPTCP to regular TCP, the other peer is expected to send a packet with
> the MPTCP MP_FAIL option to acknowledge the infinite mapping. Rather
> than leave the connection in some half-fallback state, this series adds
> a timeout after which the infinite mapping sender will reset the
> connection.
> 
> [...]

Here is the summary with links:
  - [net-next,1/7] selftests: mptcp: add infinite map testcase
    https://git.kernel.org/netdev/net-next/c/b6e074e171bc
  - [net-next,2/7] mptcp: use mptcp_stop_timer
    https://git.kernel.org/netdev/net-next/c/bcf3cf93f645
  - [net-next,3/7] mptcp: add data lock for sk timers
    https://git.kernel.org/netdev/net-next/c/4293248c6704
  - [net-next,4/7] mptcp: add MP_FAIL response support
    https://git.kernel.org/netdev/net-next/c/9c81be0dbc89
  - [net-next,5/7] mptcp: reset subflow when MP_FAIL doesn't respond
    https://git.kernel.org/netdev/net-next/c/49fa1919d6bc
  - [net-next,6/7] selftests: mptcp: check MP_FAIL response mibs
    https://git.kernel.org/netdev/net-next/c/1f7d325f7d49
  - [net-next,7/7] selftests: mptcp: print extra msg in chk_csum_nr
    https://git.kernel.org/netdev/net-next/c/53f368bfff31

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* [PATCH mlx5-next 5/5] vfio/mlx5: Run the SAVE state command in an async mode
From: Yishai Hadas @ 2022-04-27  9:31 UTC (permalink / raw)
  To: alex.williamson, jgg, saeedm
  Cc: kvm, netdev, kuba, leonro, yishaih, maorg, cohuck
In-Reply-To: <20220427093120.161402-1-yishaih@nvidia.com>

Use the PF asynchronous command mode for the SAVE state command.

This enables returning earlier to user space upon issuing successfully
the command and improve latency by let things run in parallel.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/pci/mlx5/cmd.c  | 72 ++++++++++++++++++++++++++++++++++--
 drivers/vfio/pci/mlx5/cmd.h  | 17 +++++++++
 drivers/vfio/pci/mlx5/main.c | 54 ++++++++++++++++++++++++---
 3 files changed, 134 insertions(+), 9 deletions(-)

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
index ba06b797d630..fad648d0c088 100644
--- a/drivers/vfio/pci/mlx5/cmd.c
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -78,6 +78,7 @@ static int mlx5fv_vf_event(struct notifier_block *nb,
 		mvdev->mdev_detach = false;
 		break;
 	case MLX5_PF_NOTIFY_DISABLE_VF:
+		mlx5vf_disable_fds(mvdev);
 		mvdev->mdev_detach = true;
 		break;
 	default:
@@ -209,11 +210,56 @@ static int _create_state_mkey(struct mlx5_core_dev *mdev, u32 pdn,
 	return err;
 }
 
+void mlx5vf_mig_file_cleanup_cb(struct work_struct *_work)
+{
+	struct mlx5vf_async_data *async_data = container_of(_work,
+		struct mlx5vf_async_data, work);
+	struct mlx5_vf_migration_file *migf = container_of(async_data,
+		struct mlx5_vf_migration_file, async_data);
+	struct mlx5_core_dev *mdev = migf->mvdev->mdev;
+
+	mutex_lock(&migf->lock);
+	if (async_data->status) {
+		migf->is_err = true;
+		wake_up_interruptible(&migf->poll_wait);
+	}
+	mutex_unlock(&migf->lock);
+
+	mlx5_core_destroy_mkey(mdev, async_data->mkey);
+	dma_unmap_sgtable(mdev->device, &migf->table.sgt, DMA_FROM_DEVICE, 0);
+	mlx5_core_dealloc_pd(mdev, async_data->pdn);
+	kvfree(async_data->out);
+	fput(migf->filp);
+}
+
+static void mlx5vf_save_callback(int status, struct mlx5_async_work *context)
+{
+	struct mlx5vf_async_data *async_data = container_of(context,
+			struct mlx5vf_async_data, cb_work);
+	struct mlx5_vf_migration_file *migf = container_of(async_data,
+			struct mlx5_vf_migration_file, async_data);
+
+	if (!status) {
+		WRITE_ONCE(migf->total_length,
+			   MLX5_GET(save_vhca_state_out, async_data->out,
+				    actual_image_size));
+		wake_up_interruptible(&migf->poll_wait);
+	}
+
+	/*
+	 * The error and the cleanup flows can't run from an
+	 * interrupt context
+	 */
+	async_data->status = status;
+	queue_work(migf->mvdev->cb_wq, &async_data->work);
+}
+
 int mlx5vf_cmd_save_vhca_state(struct mlx5vf_pci_core_device *mvdev,
 			       struct mlx5_vf_migration_file *migf)
 {
-	u32 out[MLX5_ST_SZ_DW(save_vhca_state_out)] = {};
+	u32 out_size = MLX5_ST_SZ_BYTES(save_vhca_state_out);
 	u32 in[MLX5_ST_SZ_DW(save_vhca_state_in)] = {};
+	struct mlx5vf_async_data *async_data;
 	struct mlx5_core_dev *mdev;
 	u32 pdn, mkey;
 	int err;
@@ -243,13 +289,31 @@ int mlx5vf_cmd_save_vhca_state(struct mlx5vf_pci_core_device *mvdev,
 	MLX5_SET(save_vhca_state_in, in, mkey, mkey);
 	MLX5_SET(save_vhca_state_in, in, size, migf->total_length);
 
-	err = mlx5_cmd_exec_inout(mdev, save_vhca_state, in, out);
+	async_data = &migf->async_data;
+	async_data->out = kvzalloc(out_size, GFP_KERNEL);
+	if (!async_data->out) {
+		err = -ENOMEM;
+		goto err_out;
+	}
+
+	/* no data exists till the callback comes back */
+	migf->total_length = 0;
+	get_file(migf->filp);
+	async_data->mkey = mkey;
+	async_data->pdn = pdn;
+	err = mlx5_cmd_exec_cb(&migf->async_ctx, in, sizeof(in),
+			       async_data->out,
+			       out_size, mlx5vf_save_callback,
+			       &async_data->cb_work);
 	if (err)
 		goto err_exec;
 
-	migf->total_length = MLX5_GET(save_vhca_state_out, out,
-				      actual_image_size);
+	return 0;
+
 err_exec:
+	fput(migf->filp);
+	kvfree(async_data->out);
+err_out:
 	mlx5_core_destroy_mkey(mdev, mkey);
 err_create_mkey:
 	dma_unmap_sgtable(mdev->device, &migf->table.sgt, DMA_FROM_DEVICE, 0);
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
index 3246c73395bc..f8f273faa5a8 100644
--- a/drivers/vfio/pci/mlx5/cmd.h
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -10,10 +10,20 @@
 #include <linux/vfio_pci_core.h>
 #include <linux/mlx5/driver.h>
 
+struct mlx5vf_async_data {
+	struct mlx5_async_work cb_work;
+	struct work_struct work;
+	int status;
+	u32 pdn;
+	u32 mkey;
+	void *out;
+};
+
 struct mlx5_vf_migration_file {
 	struct file *filp;
 	struct mutex lock;
 	bool disabled;
+	u8 is_err:1;
 
 	struct sg_append_table table;
 	size_t total_length;
@@ -23,6 +33,10 @@ struct mlx5_vf_migration_file {
 	struct scatterlist *last_offset_sg;
 	unsigned int sg_last_entry;
 	unsigned long last_offset;
+	struct mlx5vf_pci_core_device *mvdev;
+	wait_queue_head_t poll_wait;
+	struct mlx5_async_ctx async_ctx;
+	struct mlx5vf_async_data async_data;
 };
 
 struct mlx5vf_pci_core_device {
@@ -38,6 +52,7 @@ struct mlx5vf_pci_core_device {
 	spinlock_t reset_lock;
 	struct mlx5_vf_migration_file *resuming_migf;
 	struct mlx5_vf_migration_file *saving_migf;
+	struct workqueue_struct *cb_wq;
 	struct notifier_block nb;
 	struct mlx5_core_dev *mdev;
 	u8 mdev_detach:1;
@@ -54,4 +69,6 @@ int mlx5vf_cmd_save_vhca_state(struct mlx5vf_pci_core_device *mvdev,
 int mlx5vf_cmd_load_vhca_state(struct mlx5vf_pci_core_device *mvdev,
 			       struct mlx5_vf_migration_file *migf);
 void mlx5vf_state_mutex_unlock(struct mlx5vf_pci_core_device *mvdev);
+void mlx5vf_disable_fds(struct mlx5vf_pci_core_device *mvdev);
+void mlx5vf_mig_file_cleanup_cb(struct work_struct *_work);
 #endif /* MLX5_VFIO_CMD_H */
diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
index f9793a627c24..6df7ad2dfa6d 100644
--- a/drivers/vfio/pci/mlx5/main.c
+++ b/drivers/vfio/pci/mlx5/main.c
@@ -134,12 +134,22 @@ static ssize_t mlx5vf_save_read(struct file *filp, char __user *buf, size_t len,
 		return -ESPIPE;
 	pos = &filp->f_pos;
 
+	if (!(filp->f_flags & O_NONBLOCK)) {
+		if (wait_event_interruptible(migf->poll_wait,
+			     READ_ONCE(migf->total_length) || migf->is_err))
+			return -ERESTARTSYS;
+	}
+
 	mutex_lock(&migf->lock);
+	if ((filp->f_flags & O_NONBLOCK) && !READ_ONCE(migf->total_length)) {
+		done = -EAGAIN;
+		goto out_unlock;
+	}
 	if (*pos > migf->total_length) {
 		done = -EINVAL;
 		goto out_unlock;
 	}
-	if (migf->disabled) {
+	if (migf->disabled || migf->is_err) {
 		done = -ENODEV;
 		goto out_unlock;
 	}
@@ -179,9 +189,28 @@ static ssize_t mlx5vf_save_read(struct file *filp, char __user *buf, size_t len,
 	return done;
 }
 
+static __poll_t mlx5vf_save_poll(struct file *filp,
+				 struct poll_table_struct *wait)
+{
+	struct mlx5_vf_migration_file *migf = filp->private_data;
+	__poll_t pollflags = 0;
+
+	poll_wait(filp, &migf->poll_wait, wait);
+
+	mutex_lock(&migf->lock);
+	if (migf->disabled || migf->is_err)
+		pollflags = EPOLLIN | EPOLLRDNORM | EPOLLRDHUP;
+	else if (READ_ONCE(migf->total_length))
+		pollflags = EPOLLIN | EPOLLRDNORM;
+	mutex_unlock(&migf->lock);
+
+	return pollflags;
+}
+
 static const struct file_operations mlx5vf_save_fops = {
 	.owner = THIS_MODULE,
 	.read = mlx5vf_save_read,
+	.poll = mlx5vf_save_poll,
 	.release = mlx5vf_release_file,
 	.llseek = no_llseek,
 };
@@ -207,7 +236,9 @@ mlx5vf_pci_save_device_data(struct mlx5vf_pci_core_device *mvdev)
 
 	stream_open(migf->filp->f_inode, migf->filp);
 	mutex_init(&migf->lock);
-
+	init_waitqueue_head(&migf->poll_wait);
+	mlx5_cmd_init_async_ctx(mvdev->mdev, &migf->async_ctx);
+	INIT_WORK(&migf->async_data.work, mlx5vf_mig_file_cleanup_cb);
 	ret = mlx5vf_cmd_query_vhca_migration_state(mvdev,
 						    &migf->total_length);
 	if (ret)
@@ -218,6 +249,7 @@ mlx5vf_pci_save_device_data(struct mlx5vf_pci_core_device *mvdev)
 	if (ret)
 		goto out_free;
 
+	migf->mvdev = mvdev;
 	ret = mlx5vf_cmd_save_vhca_state(mvdev, migf);
 	if (ret)
 		goto out_free;
@@ -323,7 +355,7 @@ mlx5vf_pci_resume_device_data(struct mlx5vf_pci_core_device *mvdev)
 	return migf;
 }
 
-static void mlx5vf_disable_fds(struct mlx5vf_pci_core_device *mvdev)
+void mlx5vf_disable_fds(struct mlx5vf_pci_core_device *mvdev)
 {
 	if (mvdev->resuming_migf) {
 		mlx5vf_disable_fd(mvdev->resuming_migf);
@@ -331,6 +363,8 @@ static void mlx5vf_disable_fds(struct mlx5vf_pci_core_device *mvdev)
 		mvdev->resuming_migf = NULL;
 	}
 	if (mvdev->saving_migf) {
+		mlx5_cmd_cleanup_async_ctx(&mvdev->saving_migf->async_ctx);
+		cancel_work_sync(&mvdev->saving_migf->async_data.work);
 		mlx5vf_disable_fd(mvdev->saving_migf);
 		fput(mvdev->saving_migf->filp);
 		mvdev->saving_migf = NULL;
@@ -560,6 +594,11 @@ static int mlx5vf_pci_probe(struct pci_dev *pdev,
 
 	if (pdev->is_virtfn && mlx5vf_cmd_is_migratable(mvdev)) {
 		mvdev->migrate_cap = 1;
+		mvdev->cb_wq = alloc_ordered_workqueue("mlx5vf_wq", 0);
+		if (!mvdev->cb_wq) {
+			ret = -ENOMEM;
+			goto out_free;
+		}
 		mvdev->core_device.vdev.migration_flags =
 			VFIO_MIGRATION_STOP_COPY |
 			VFIO_MIGRATION_P2P;
@@ -573,8 +612,11 @@ static int mlx5vf_pci_probe(struct pci_dev *pdev,
 	return 0;
 
 out_free:
-	if (mvdev->migrate_cap)
+	if (mvdev->migrate_cap) {
 		mlx5vf_cmd_remove_migratable(mvdev);
+		if (mvdev->cb_wq)
+			destroy_workqueue(mvdev->cb_wq);
+	}
 	vfio_pci_core_uninit_device(&mvdev->core_device);
 	kfree(mvdev);
 	return ret;
@@ -585,8 +627,10 @@ static void mlx5vf_pci_remove(struct pci_dev *pdev)
 	struct mlx5vf_pci_core_device *mvdev = dev_get_drvdata(&pdev->dev);
 
 	vfio_pci_core_unregister_device(&mvdev->core_device);
-	if (mvdev->migrate_cap)
+	if (mvdev->migrate_cap) {
 		mlx5vf_cmd_remove_migratable(mvdev);
+		destroy_workqueue(mvdev->cb_wq);
+	}
 	vfio_pci_core_uninit_device(&mvdev->core_device);
 	kfree(mvdev);
 }
-- 
2.18.1


^ permalink raw reply related

* [PATCH mlx5-next 0/5] Improve mlx5 live migration driver
From: Yishai Hadas @ 2022-04-27  9:31 UTC (permalink / raw)
  To: alex.williamson, jgg, saeedm
  Cc: kvm, netdev, kuba, leonro, yishaih, maorg, cohuck

This series improves mlx5 live migration driver in few aspects as of
below.

Refactor to enable running migration commands in parallel over the PF
command interface.

To achieve that we exposed from mlx5_core an API to let the VF be
notified before that the PF command interface goes down/up. (e.g. PF
reload upon health recovery).

Once having the above functionality in place mlx5 vfio doesn't need any
more to obtain the global PF lock upon using the command interface but
can rely on the above mechanism to be in sync with the PF.

This can enable parallel VFs migration over the PF command interface
from kernel driver point of view.

In addition,
Moved to use the PF async command mode for the SAVE state command.
This enables returning earlier to user space upon issuing successfully
the command and improve latency by let things run in parallel.

Alex, as this series touches mlx5_core we may need to send this in a
pull request format to VFIO to avoid conflicts before acceptance.

Yishai

Yishai Hadas (5):
  vfio/mlx5: Reorganize the VF is migratable code
  net/mlx5: Expose mlx5_sriov_blocking_notifier_register /  unregister
    APIs
  vfio/mlx5: Manage the VF attach/detach callback from the PF
  vfio/mlx5: Refactor to enable VFs migration in parallel
  vfio/mlx5: Run the SAVE state command in an async mode

 .../net/ethernet/mellanox/mlx5/core/sriov.c   |  65 ++++-
 drivers/vfio/pci/mlx5/cmd.c                   | 229 +++++++++++++-----
 drivers/vfio/pci/mlx5/cmd.h                   |  50 +++-
 drivers/vfio/pci/mlx5/main.c                  | 133 +++++-----
 include/linux/mlx5/driver.h                   |  12 +
 5 files changed, 358 insertions(+), 131 deletions(-)

-- 
2.18.1

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox