Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH v4.3] net: phy: Add driver for Motorcomm yt8521 gigabit ethernet phy
From: Frank @ 2022-08-16 11:17 UTC (permalink / raw)
  To: Peter Geis, Andrew Lunn, Heiner Kallweit, Russell King,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: yinghong.zhang, fei.zhang, hua.sun, netdev, linux-kernel, Frank

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=y, Size: 38468 bytes --]

 Add a driver for the motorcomm yt8521 gigabit ethernet phy. We have verified
 the driver on StarFive VisionFive development board, which is developed by
 Shanghai StarFive Technology Co., Ltd.. On the board, yt8521 gigabit ethernet
 phy works in utp mode, RGMII interface, supports 1000M/100M/10M speeds, and
 wol(magic package).

Signed-off-by: Frank <Frank.Sae@motor-comm.com>
---

patch v4.3:
 Hi Jakub
 Kernel 6.0-rc1 is cut on 2022-08-14, we update new patch for Kernel 6.0-rc1.

patch v4.2:
 fix "=0D" and "=20" which in patch v4.1.
 
patch v4.1:
 Hi Andrew
 Thanks very much and based on your comments we modified the patch v4.1 as below.

> You have interrupt bits defined, you enable the WoL interrupt, but you
> don't have an interrupt handler. PHY interrupts are generally level
> interrupts, so on WoL, don't you end up with an interrupt storm,
> because you don't have anything clearing the WoL interrupt?

 in the driver, only the interrupt for WOL is used according to uplayer WOL
 configuration via .set_wol. If WOL is enalbed, then the WOL interrupt is
 enabled accordingly; you know, for a system support WOL, the WOL interrupt
 will be handled by user's WOL circuit, and it will not cause interrupt storm
 to CPU core system.

 for WOL interrupt, by default, we configure to pulse trigger mode, that is a
 pulse from High to low change, and the low level will hold for 672ms.
 user's WOL circuit will detected such volt level change as a WOL event trigger.

patch v4:
 Hi Andrew,Jakub
 Thanks very much and based on your comments we modified the patch v4 as below.

 We evaluated the Marvell 10G driver and at803x you suggested. The 2 drivers
 implement SFP module attach/detach functions and we think these functions do
 not help yt8521 to do UTP/Fiber register space arbitration which you may
 concern in previous patch.

 Yt8521 can detect utp/fiber media link status automatically. For the case of
 both media are connected, driver arbitrates the priority of the media (by
 default, driver takes fiber as higher priority) and report the media status
 to up layer(MAC).

patch v3:
 Hi Andrew
 Thanks and based on your comments we modified the patch as below.

> It is generally not that simple. Fibre, you probably want 1000BaseX,
> unless the fibre module is actually copper, and then you want
> SGMII. So you need something to talk to the fibre module and ask it
> what it is. That something is phylink. Phylink does not support both
> copper and fibre at the same time for one MAC.

 yes, you said it and for MAC, it does not support copper and Fiber at same time.
 but from Physical Layer, you know, sometimes both Copper and Fiber cables are
 connected. in this case, Phy driver should do arbitration and report to MAC
 which media should be used and Link-up.
 This is the reason that the driver has a "polling mode", and by default, also
 this driver takes fiber as first priority which matches phy chip default behavior.

patch v2:
 Hi Andrew, Russell King, Peter,
 Thanks and based on your comments we modified the patch as below.
 
> So there's only two possible pages that can be used in the extended
>register space?
 
 Yes,there is only two register space (utp and fiber).
 
> > +/* Extended Register's Data Register */
> > +#define YTPHY_PAGE_DATA                                0x1F
>
> These are defined exactly the same way as below. Please reuse code
> where possible.
 
 Yes, code will be reuse, but "YT8511_PAGE" need to be rename like
 "YTPHY_PAGE_DATA",as it is common register for yt phys.

patch v1:
 Add a driver for the motorcomm yt8521 gigabit ethernet phy. We have verified
 the driver on StarFive VisionFive development board, which is developed by
 Shanghai StarFive Technology Co., Ltd.. On the board, yt8521 gigabit ethernet
 phy works in utp mode, RGMII interface, supports 1000M/100M/10M speeds, and
 wol(magic package).

Thanks and BR,
Frank

 MAINTAINERS                 |    1 +
 drivers/net/phy/Kconfig     |    2 +-
 drivers/net/phy/motorcomm.c | 1162 ++++++++++++++++++++++++++++++++++-
 3 files changed, 1162 insertions(+), 3 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 8a5012b..994919b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13780,6 +13780,7 @@ F:	include/uapi/linux/meye.h
 
 MOTORCOMM PHY DRIVER
 M:	Peter Geis <pgwipeout@gmail.com>
+M:	Frank <Frank.Sae@motor-comm.com>
 L:	netdev@vger.kernel.org
 S:	Maintained
 F:	drivers/net/phy/motorcomm.c
diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index c57a026..040c8bf 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -260,7 +260,7 @@ config MOTORCOMM_PHY
 	tristate "Motorcomm PHYs"
 	help
 	  Enables support for Motorcomm network PHYs.
-	  Currently supports the YT8511 gigabit PHY.
+	  Currently supports the YT8511, YT8521 Gigabit Ethernet PHYs.
 
 config NATIONAL_PHY
 	tristate "National Semiconductor PHYs"
diff --git a/drivers/net/phy/motorcomm.c b/drivers/net/phy/motorcomm.c
index 7e6ac2c..bae56f5 100644
--- a/drivers/net/phy/motorcomm.c
+++ b/drivers/net/phy/motorcomm.c
@@ -1,15 +1,102 @@
 // SPDX-License-Identifier: GPL-2.0+
 /*
- * Driver for Motorcomm PHYs
+ * Motorcomm 8511/8521 PHY driver.
  *
  * Author: Peter Geis <pgwipeout@gmail.com>
+ * Author: Frank <Frank.Sae@motor-comm.com>
  */
 
+#include <linux/etherdevice.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/phy.h>
 
 #define PHY_ID_YT8511		0x0000010a
+#define PHY_ID_YT8521				0x0000011A
+
+/* YT8521 Register Overview
+ *	UTP Register space	|	FIBER Register space
+ *  ------------------------------------------------------------
+ * |	UTP MII			|	FIBER MII		|
+ * |	UTP MMD			|				|
+ * |	UTP Extended		|	FIBER Extended		|
+ *  ------------------------------------------------------------
+ * |			Common Extended				|
+ *  ------------------------------------------------------------
+ */
+
+/* 0x10 ~ 0x15 , 0x1E and 0x1F are common MII registers of yt phy */
+
+/* Specific Function Control Register */
+#define YTPHY_SPECIFIC_FUNCTION_CONTROL_REG	0x10
+
+/* 2b00 Manual MDI configuration
+ * 2b01 Manual MDIX configuration
+ * 2b10 Reserved
+ * 2b11 Enable automatic crossover for all modes  *default*
+ */
+#define YTPHY_SFCR_MDI_CROSSOVER_MODE_MASK	(BIT(6) | BIT(5))
+#define YTPHY_SFCR_CROSSOVER_EN			BIT(3)
+#define YTPHY_SFCR_SQE_TEST_EN			BIT(2)
+#define YTPHY_SFCR_POLARITY_REVERSAL_EN		BIT(1)
+#define YTPHY_SFCR_JABBER_DIS			BIT(0)
+
+/* Specific Status Register */
+#define YTPHY_SPECIFIC_STATUS_REG		0x11
+#define YTPHY_SSR_SPEED_MODE_OFFSET		14
+
+#define YTPHY_SSR_SPEED_MODE_MASK		(BIT(15) | BIT(14))
+#define YTPHY_SSR_SPEED_10M			0x0
+#define YTPHY_SSR_SPEED_100M			0x1
+#define YTPHY_SSR_SPEED_1000M			0x2
+#define YTPHY_SSR_DUPLEX_OFFSET			13
+#define YTPHY_SSR_DUPLEX			BIT(13)
+#define YTPHY_SSR_PAGE_RECEIVED			BIT(12)
+#define YTPHY_SSR_SPEED_DUPLEX_RESOLVED		BIT(11)
+#define YTPHY_SSR_LINK				BIT(10)
+#define YTPHY_SSR_MDIX_CROSSOVER		BIT(6)
+#define YTPHY_SSR_DOWNGRADE			BIT(5)
+#define YTPHY_SSR_TRANSMIT_PAUSE		BIT(3)
+#define YTPHY_SSR_RECEIVE_PAUSE			BIT(2)
+#define YTPHY_SSR_POLARITY			BIT(1)
+#define YTPHY_SSR_JABBER			BIT(0)
+
+/* Interrupt enable Register */
+#define YTPHY_INTERRUPT_ENABLE_REG		0x12
+#define YTPHY_IER_WOL				BIT(6)
+
+/* Interrupt Status Register */
+#define YTPHY_INTERRUPT_STATUS_REG		0x13
+#define YTPHY_ISR_AUTONEG_ERR			BIT(15)
+#define YTPHY_ISR_SPEED_CHANGED			BIT(14)
+#define YTPHY_ISR_DUPLEX_CHANGED		BIT(13)
+#define YTPHY_ISR_PAGE_RECEIVED			BIT(12)
+#define YTPHY_ISR_LINK_FAILED			BIT(11)
+#define YTPHY_ISR_LINK_SUCCESSED		BIT(10)
+#define YTPHY_ISR_WOL				BIT(6)
+#define YTPHY_ISR_WIRESPEED_DOWNGRADE		BIT(5)
+#define YTPHY_ISR_SERDES_LINK_FAILED		BIT(3)
+#define YTPHY_ISR_SERDES_LINK_SUCCESSED		BIT(2)
+#define YTPHY_ISR_POLARITY_CHANGED		BIT(1)
+#define YTPHY_ISR_JABBER_HAPPENED		BIT(0)
+
+/* Speed Auto Downgrade Control Register */
+#define YTPHY_SPEED_AUTO_DOWNGRADE_CONTROL_REG	0x14
+#define YTPHY_SADCR_SPEED_DOWNGRADE_EN		BIT(5)
+
+/* If these bits are set to 3, the PHY attempts five times ( 3(set value) +
+ * additional 2) before downgrading, default 0x3
+ */
+#define YTPHY_SADCR_SPEED_RETRY_LIMIT		(0x3 << 2)
+
+/* Rx Error Counter Register */
+#define YTPHY_RX_ERROR_COUNTER_REG		0x15
+
+/* Extended Register's Address Offset Register */
+#define YTPHY_PAGE_SELECT			0x1E
+
+/* Extended Register's Data Register */
+#define YTPHY_PAGE_DATA				0x1F
 
 #define YT8511_PAGE_SELECT	0x1e
 #define YT8511_PAGE		0x1f
@@ -38,6 +125,347 @@
 #define YT8511_DELAY_FE_TX_EN	(0xf << 12)
 #define YT8511_DELAY_FE_TX_DIS	(0x2 << 12)
 
+/* Extended register is different from MMD Register and MII Register.
+ * We can use ytphy_read_ext/ytphy_write_ext/ytphy_modify_ext function to
+ * operate extended register.
+ * Extended Register  start
+ */
+
+/* Phy gmii clock gating Register */
+#define YT8521_CLOCK_GATING_REG			0xC
+#define YT8521_CGR_RX_CLK_EN			BIT(12)
+
+#define YT8521_EXTREG_SLEEP_CONTROL1_REG	0x27
+#define YT8521_ESC1R_SLEEP_SW			BIT(15)
+#define YT8521_ESC1R_PLLON_SLP			BIT(14)
+
+/* Phy fiber Link timer cfg2 Register */
+#define YT8521_LINK_TIMER_CFG2_REG		0xA5
+#define YT8521_LTCR_EN_AUTOSEN			BIT(15)
+
+/* 0xA000, 0xA001, 0xA003 ,and 0xA006 ~ 0xA00A  are common ext registers
+ * of yt8521 phy. There is no need to switch reg space when operating these
+ * registers.
+ */
+
+#define YT8521_REG_SPACE_SELECT_REG		0xA000
+#define YT8521_RSSR_SPACE_MASK			BIT(1)
+#define YT8521_RSSR_FIBER_SPACE			(0x1 << 1)
+#define YT8521_RSSR_UTP_SPACE			(0x0 << 1)
+#define YT8521_RSSR_TO_BE_ARBITRATED		(0xFF)
+
+#define YT8521_CHIP_CONFIG_REG			0xA001
+#define YT8521_CCR_SW_RST			BIT(15)
+
+#define YT8521_CCR_MODE_SEL_MASK		(BIT(2) | BIT(1) | BIT(0))
+#define YT8521_CCR_MODE_UTP_TO_RGMII		0
+#define YT8521_CCR_MODE_FIBER_TO_RGMII		1
+#define YT8521_CCR_MODE_UTP_FIBER_TO_RGMII	2
+#define YT8521_CCR_MODE_UTP_TO_SGMII		3
+#define YT8521_CCR_MODE_SGPHY_TO_RGMAC		4
+#define YT8521_CCR_MODE_SGMAC_TO_RGPHY		5
+#define YT8521_CCR_MODE_UTP_TO_FIBER_AUTO	6
+#define YT8521_CCR_MODE_UTP_TO_FIBER_FORCE	7
+
+/* 3 phy polling modes,poll mode combines utp and fiber mode*/
+#define YT8521_MODE_FIBER			0x1
+#define YT8521_MODE_UTP				0x2
+#define YT8521_MODE_POLL			0x3
+
+#define YT8521_RGMII_CONFIG1_REG		0xA003
+
+/* TX Gig-E Delay is bits 3:0, default 0x1
+ * TX Fast-E Delay is bits 7:4, default 0xf
+ * RX Delay is bits 13:10, default 0x0
+ * Delay = 150ps * N
+ * On = 2250ps, off = 0ps
+ */
+#define YT8521_RC1R_RX_DELAY_MASK		(0xF << 10)
+#define YT8521_RC1R_RX_DELAY_EN			(0xF << 10)
+#define YT8521_RC1R_RX_DELAY_DIS		(0x0 << 10)
+#define YT8521_RC1R_FE_TX_DELAY_MASK		(0xF << 4)
+#define YT8521_RC1R_FE_TX_DELAY_EN		(0xF << 4)
+#define YT8521_RC1R_FE_TX_DELAY_DIS		(0x0 << 4)
+#define YT8521_RC1R_GE_TX_DELAY_MASK		(0xF << 0)
+#define YT8521_RC1R_GE_TX_DELAY_EN		(0xF << 0)
+#define YT8521_RC1R_GE_TX_DELAY_DIS		(0x0 << 0)
+
+#define YTPHY_MISC_CONFIG_REG			0xA006
+#define YTPHY_MCR_FIBER_SPEED_MASK		BIT(0)
+#define YTPHY_MCR_FIBER_1000BX			(0x1 << 0)
+#define YTPHY_MCR_FIBER_100FX			(0x0 << 0)
+
+/* WOL MAC ADDR: MACADDR2(highest), MACADDR1(middle), MACADDR0(lowest) */
+#define YTPHY_WOL_MACADDR2_REG			0xA007
+#define YTPHY_WOL_MACADDR1_REG			0xA008
+#define YTPHY_WOL_MACADDR0_REG			0xA009
+
+#define YTPHY_WOL_CONFIG_REG			0xA00A
+#define YTPHY_WCR_INTR_SEL			BIT(6)
+#define YTPHY_WCR_ENABLE			BIT(3)
+
+/* 2b00 84ms
+ * 2b01 168ms  *default*
+ * 2b10 336ms
+ * 2b11 672ms
+ */
+#define YTPHY_WCR_PULSE_WIDTH_MASK		(BIT(2) | BIT(1))
+#define YTPHY_WCR_PULSE_WIDTH_672MS		(BIT(2) | BIT(1))
+
+/* 1b0 Interrupt and WOL events is level triggered and active LOW  *default*
+ * 1b1 Interrupt and WOL events is pulse triggered and active LOW
+ */
+#define YTPHY_WCR_TYPE_PULSE			BIT(0)
+
+/* Extended Register  end */
+
+struct yt8521_priv {
+	/* YT8521_RSSR_UTP_SPACE / YT8521_RSSR_FIBER_SPACE / YT8521_RSSR_TO_BE_ARBITRATED */
+	u8 reg_page;
+	/* YT8521_MODE_FIBER / YT8521_MODE_UTP / YT8521_MODE_POLL*/
+	u8 polling_mode;
+	u8 strap_mode; /* 8 working modes  */
+};
+
+/**
+ * ytphy_read_ext() - read a PHY's extended register
+ * @phydev: a pointer to a &struct phy_device
+ * @regnum: register number to read
+ *
+ * NOTE:The caller must have taken the MDIO bus lock.
+ *
+ * returns the value of regnum reg or negative error code
+ */
+static int ytphy_read_ext(struct phy_device *phydev, u16 regnum)
+{
+	int ret;
+
+	ret = __phy_write(phydev, YTPHY_PAGE_SELECT, regnum);
+	if (ret < 0)
+		return ret;
+
+	return __phy_read(phydev, YTPHY_PAGE_DATA);
+}
+
+/**
+ * ytphy_read_ext_with_lock() - read a PHY's extended register
+ * @phydev: a pointer to a &struct phy_device
+ * @regnum: register number to read
+ *
+ * returns the value of regnum reg or negative error code
+ */
+static int ytphy_read_ext_with_lock(struct phy_device *phydev, u16 regnum)
+{
+	int ret;
+
+	phy_lock_mdio_bus(phydev);
+	ret = ytphy_read_ext(phydev, regnum);
+	phy_unlock_mdio_bus(phydev);
+
+	return ret;
+}
+
+/**
+ * ytphy_write_ext() - write a PHY's extended register
+ * @phydev: a pointer to a &struct phy_device
+ * @regnum: register number to write
+ * @val: value to write to @regnum
+ *
+ * NOTE:The caller must have taken the MDIO bus lock.
+ *
+ * returns 0 or negative error code
+ */
+static int ytphy_write_ext(struct phy_device *phydev, u16 regnum, u16 val)
+{
+	int ret;
+
+	ret = __phy_write(phydev, YTPHY_PAGE_SELECT, regnum);
+	if (ret < 0)
+		return ret;
+
+	return __phy_write(phydev, YTPHY_PAGE_DATA, val);
+}
+
+/**
+ * ytphy_write_ext_with_lock() - write a PHY's extended register
+ * @phydev: a pointer to a &struct phy_device
+ * @regnum: register number to write
+ * @val: value to write to @regnum
+ *
+ * returns 0 or negative error code
+ */
+static int ytphy_write_ext_with_lock(struct phy_device *phydev, u16 regnum,
+				     u16 val)
+{
+	int ret;
+
+	phy_lock_mdio_bus(phydev);
+	ret = ytphy_write_ext(phydev, regnum, val);
+	phy_unlock_mdio_bus(phydev);
+
+	return ret;
+}
+
+/**
+ * ytphy_modify_ext() - bits modify a PHY's extended register
+ * @phydev: a pointer to a &struct phy_device
+ * @regnum: register number to write
+ * @mask: bit mask of bits to clear
+ * @set: bit mask of bits to set
+ *
+ * NOTE: Convenience function which allows a PHY‘s extended register to be
+ * modified as new register value = (old register value & ~mask) | set.
+ * The caller must have taken the MDIO bus lock.
+ *
+ * returns 0 or negative error code
+ */
+static int ytphy_modify_ext(struct phy_device *phydev, u16 regnum, u16 mask,
+			    u16 set)
+{
+	u16 val;
+	int ret;
+
+	ret = ytphy_read_ext(phydev, regnum);
+	if (ret < 0)
+		return ret;
+
+	val = ret & 0xffff;
+	val &= ~mask;
+	val |= set;
+
+	return ytphy_write_ext(phydev, regnum, val);
+}
+
+/**
+ * ytphy_modify_ext_with_lock() - bits modify a PHY's extended register
+ * @phydev: a pointer to a &struct phy_device
+ * @regnum: register number to write
+ * @mask: bit mask of bits to clear
+ * @set: bit mask of bits to set
+ *
+ * NOTE: Convenience function which allows a PHY‘s extended register to be
+ * modified as new register value = (old register value & ~mask) | set.
+ *
+ * returns 0 or negative error code
+ */
+static int ytphy_modify_ext_with_lock(struct phy_device *phydev, u16 regnum,
+				      u16 mask, u16 set)
+{
+	int ret;
+
+	phy_lock_mdio_bus(phydev);
+	ret = ytphy_modify_ext(phydev, regnum, mask, set);
+	phy_unlock_mdio_bus(phydev);
+
+	return ret;
+}
+
+/**
+ * ytphy_get_wol() - report whether wake-on-lan is enabled
+ * @phydev: a pointer to a &struct phy_device
+ * @wol: a pointer to a &struct ethtool_wolinfo
+ *
+ * NOTE: YTPHY_WOL_CONFIG_REG is common ext reg.
+ */
+static void ytphy_get_wol(struct phy_device *phydev,
+			  struct ethtool_wolinfo *wol)
+{
+	int wol_config;
+
+	wol->supported = WAKE_MAGIC;
+	wol->wolopts = 0;
+
+	wol_config = ytphy_read_ext_with_lock(phydev, YTPHY_WOL_CONFIG_REG);
+	if (wol_config < 0)
+		return;
+
+	if (wol_config & YTPHY_WCR_ENABLE)
+		wol->wolopts |= WAKE_MAGIC;
+}
+
+/**
+ * ytphy_set_wol() - turn wake-on-lan on or off
+ * @phydev: a pointer to a &struct phy_device
+ * @wol: a pointer to a &struct ethtool_wolinfo
+ *
+ * NOTE: YTPHY_WOL_CONFIG_REG, YTPHY_WOL_MACADDR2_REG, YTPHY_WOL_MACADDR1_REG
+ * and YTPHY_WOL_MACADDR0_REG are common ext reg. the YTPHY_INTERRUPT_ENABLE_REG
+ * of UTP is special, fiber also use this register.
+ *
+ * returns 0 or negative errno code
+ */
+static int ytphy_set_wol(struct phy_device *phydev, struct ethtool_wolinfo *wol)
+{
+	struct net_device *p_attached_dev;
+	const u16 mac_addr_reg[] = {
+		YTPHY_WOL_MACADDR2_REG,
+		YTPHY_WOL_MACADDR1_REG,
+		YTPHY_WOL_MACADDR0_REG,
+	};
+	const u8 *mac_addr;
+	int old_page;
+	int ret = 0;
+	u16 mask;
+	u16 val;
+	u8 i;
+
+	if (wol->wolopts & WAKE_MAGIC) {
+		p_attached_dev = phydev->attached_dev;
+		if (!p_attached_dev)
+			return -ENODEV;
+
+		mac_addr = (const u8 *)p_attached_dev->dev_addr;
+		if (!is_valid_ether_addr(mac_addr))
+			return -EINVAL;
+
+		/* lock mdio bus then switch to utp reg space */
+		old_page = phy_select_page(phydev, YT8521_RSSR_UTP_SPACE);
+		if (old_page < 0)
+			goto err_restore_page;
+
+		/* Store the device address for the magic packet */
+		for (i = 0; i < 3; i++) {
+			ret = ytphy_write_ext(phydev, mac_addr_reg[i],
+					      ((mac_addr[i * 2] << 8)) |
+						      (mac_addr[i * 2 + 1]));
+			if (ret < 0)
+				goto err_restore_page;
+		}
+
+		/* Enable WOL feature */
+		mask = YTPHY_WCR_PULSE_WIDTH_MASK | YTPHY_WCR_INTR_SEL;
+		val = YTPHY_WCR_ENABLE | YTPHY_WCR_INTR_SEL;
+		val |= YTPHY_WCR_TYPE_PULSE | YTPHY_WCR_PULSE_WIDTH_672MS;
+		ret = ytphy_modify_ext(phydev, YTPHY_WOL_CONFIG_REG, mask, val);
+		if (ret < 0)
+			goto err_restore_page;
+
+		/* Enable WOL interrupt */
+		ret = __phy_modify(phydev, YTPHY_INTERRUPT_ENABLE_REG, 0,
+				   YTPHY_IER_WOL);
+		if (ret < 0)
+			goto err_restore_page;
+
+	} else {
+		old_page = phy_select_page(phydev, YT8521_RSSR_UTP_SPACE);
+		if (old_page < 0)
+			goto err_restore_page;
+
+		/* Disable WOL feature */
+		mask = YTPHY_WCR_ENABLE | YTPHY_WCR_INTR_SEL;
+		ret = ytphy_modify_ext(phydev, YTPHY_WOL_CONFIG_REG, mask, 0);
+
+		/* Disable WOL interrupt */
+		ret = __phy_modify(phydev, YTPHY_INTERRUPT_ENABLE_REG,
+				   YTPHY_IER_WOL, 0);
+		if (ret < 0)
+			goto err_restore_page;
+	}
+
+err_restore_page:
+	return phy_restore_page(phydev, old_page, ret);
+}
+
 static int yt8511_read_page(struct phy_device *phydev)
 {
 	return __phy_read(phydev, YT8511_PAGE_SELECT);
@@ -111,6 +539,717 @@ static int yt8511_config_init(struct phy_device *phydev)
 	return phy_restore_page(phydev, oldpage, ret);
 }
 
+/**
+ * yt8521_read_page() - read reg page
+ * @phydev: a pointer to a &struct phy_device
+ *
+ * returns current reg space of yt8521 (YT8521_RSSR_FIBER_SPACE/
+ * YT8521_RSSR_UTP_SPACE) or negative errno code
+ */
+static int yt8521_read_page(struct phy_device *phydev)
+{
+	int old_page;
+
+	old_page = ytphy_read_ext(phydev, YT8521_REG_SPACE_SELECT_REG);
+	if (old_page < 0)
+		return old_page;
+
+	if ((old_page & YT8521_RSSR_SPACE_MASK) == YT8521_RSSR_FIBER_SPACE)
+		return YT8521_RSSR_FIBER_SPACE;
+
+	return YT8521_RSSR_UTP_SPACE;
+};
+
+/**
+ * yt8521_write_page() - write reg page
+ * @phydev: a pointer to a &struct phy_device
+ * @page: The reg page(YT8521_RSSR_FIBER_SPACE/YT8521_RSSR_UTP_SPACE) to write.
+ *
+ * returns 0 or negative errno code
+ */
+static int yt8521_write_page(struct phy_device *phydev, int page)
+{
+	int mask = YT8521_RSSR_SPACE_MASK;
+	int set;
+
+	if ((page & YT8521_RSSR_SPACE_MASK) == YT8521_RSSR_FIBER_SPACE)
+		set = YT8521_RSSR_FIBER_SPACE;
+	else
+		set = YT8521_RSSR_UTP_SPACE;
+
+	return ytphy_modify_ext(phydev, YT8521_REG_SPACE_SELECT_REG, mask, set);
+};
+
+/**
+ * yt8521_probe() - read chip config then set suitable polling_mode
+ * @phydev: a pointer to a &struct phy_device
+ *
+ * returns 0 or negative errno code
+ */
+static int yt8521_probe(struct phy_device *phydev)
+{
+	struct device *dev = &phydev->mdio.dev;
+	struct yt8521_priv *priv;
+	int chip_config;
+	int ret;
+
+	priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
+	if (!priv)
+		return -ENOMEM;
+
+	phydev->priv = priv;
+
+	chip_config = ytphy_read_ext_with_lock(phydev, YT8521_CHIP_CONFIG_REG);
+	if (chip_config < 0)
+		return chip_config;
+
+	priv->strap_mode = chip_config & YT8521_CCR_MODE_SEL_MASK;
+	switch (priv->strap_mode) {
+	case YT8521_CCR_MODE_FIBER_TO_RGMII:
+	case YT8521_CCR_MODE_SGPHY_TO_RGMAC:
+	case YT8521_CCR_MODE_SGMAC_TO_RGPHY:
+		priv->polling_mode = YT8521_MODE_FIBER;
+		priv->reg_page = YT8521_RSSR_FIBER_SPACE;
+		break;
+	case YT8521_CCR_MODE_UTP_FIBER_TO_RGMII:
+	case YT8521_CCR_MODE_UTP_TO_FIBER_AUTO:
+	case YT8521_CCR_MODE_UTP_TO_FIBER_FORCE:
+		priv->polling_mode = YT8521_MODE_POLL;
+		priv->reg_page = YT8521_RSSR_TO_BE_ARBITRATED;
+		break;
+	case YT8521_CCR_MODE_UTP_TO_SGMII:
+	case YT8521_CCR_MODE_UTP_TO_RGMII:
+		priv->polling_mode = YT8521_MODE_UTP;
+		priv->reg_page = YT8521_RSSR_UTP_SPACE;
+		break;
+	}
+	/* set default reg space */
+	if (priv->reg_page != YT8521_RSSR_TO_BE_ARBITRATED) {
+		ret = ytphy_write_ext_with_lock(phydev,
+						YT8521_REG_SPACE_SELECT_REG,
+						priv->reg_page);
+		if (ret < 0)
+			return ret;
+	}
+
+	return 0;
+}
+
+/**
+ * yt8521_adjust_status() - update speed and duplex to phydev. when in fiber
+ * mode, adjust speed and duplex.
+ *
+ * @phydev: a pointer to a &struct phy_device
+ * @status: yt8521 status read from YTPHY_SPECIFIC_STATUS_REG
+ * @is_utp: false(yt8521 work in fiber mode) or true(yt8521 work in utp mode)
+ *
+ * NOTE: there is no 10M speed mode in fiber mode, so need adjust.
+ *
+ * returns 0
+ */
+static int yt8521_adjust_status(struct phy_device *phydev, int status,
+				bool is_utp)
+{
+	int speed_mode, duplex;
+	int speed;
+
+	if (is_utp)
+		duplex = (status & YTPHY_SSR_DUPLEX) >> YTPHY_SSR_DUPLEX_OFFSET;
+	else
+		duplex = DUPLEX_FULL;
+
+	speed_mode = (status & YTPHY_SSR_SPEED_MODE_MASK) >>
+		     YTPHY_SSR_SPEED_MODE_OFFSET;
+
+	switch (speed_mode) {
+	case YTPHY_SSR_SPEED_10M:
+		if (is_utp)
+			speed = SPEED_10;
+		else
+			speed = SPEED_UNKNOWN;
+		break;
+	case YTPHY_SSR_SPEED_100M:
+		speed = SPEED_100;
+		break;
+	case YTPHY_SSR_SPEED_1000M:
+		speed = SPEED_1000;
+		break;
+	default:
+		speed = SPEED_UNKNOWN;
+		break;
+	}
+
+	phydev->speed = speed;
+	phydev->duplex = duplex;
+
+	return 0;
+}
+
+/**
+ * yt8521_read_status_paged() -  determines the speed and duplex of one page
+ * @phydev: a pointer to a &struct phy_device
+ * @page: The reg page(YT8521_RSSR_FIBER_SPACE/YT8521_RSSR_UTP_SPACE) to operate
+ *
+ * returns 1 (utp or fiber link),0 (no link) or negative errno code
+ */
+static int yt8521_read_status_paged(struct phy_device *phydev, int page)
+{
+	int fiber_latch_val;
+	int fiber_curr_val;
+	int old_page;
+	int ret = 0;
+	int status;
+	int link;
+
+	/* YT8521 has two reg space (utp/fiber) for linkup with utp/fiber
+	 * respectively. but for utp/fiber combo mode, reg space should be
+	 * arbitrated based on media priority. by default, fiber takes priority.
+	 * reg space should be properly set before read YTPHY_SPECIFIC_STATUS_REG.
+	 */
+
+	page &= YT8521_RSSR_SPACE_MASK;
+	old_page = phy_select_page(phydev, page);
+	if (old_page < 0)
+		goto err_restore_page;
+
+	/* Read YTPHY_SPECIFIC_STATUS_REG, which indicates the speed and duplex
+	 * of the PHY is actually using.
+	 */
+	ret = __phy_read(phydev, YTPHY_SPECIFIC_STATUS_REG);
+	if (ret < 0)
+		goto err_restore_page;
+
+	status = ret;
+	link = status & YTPHY_SSR_LINK;
+
+	/* When PHY is in fiber mode, speed transferred from 1000Mbps to 100Mbps,
+	 * there is not link down from YTPHY_SPECIFIC_STATUS_REG, so we need
+	 * check MII_BMSR to identify such case.
+	 */
+	if (page == YT8521_RSSR_FIBER_SPACE) {
+		ret = __phy_read(phydev, MII_BMSR);
+		if (ret < 0)
+			goto err_restore_page;
+
+		fiber_latch_val = ret;
+		ret = __phy_read(phydev, MII_BMSR);
+		if (ret < 0)
+			goto err_restore_page;
+
+		fiber_curr_val = ret;
+		if (link && fiber_latch_val != fiber_curr_val) {
+			link = 0;
+			phydev_info(phydev,
+				    "%s, phy addr: %d, fiber link down detect, latch = %04x, curr = %04x\n",
+				    __func__, phydev->mdio.addr, fiber_latch_val,
+				    fiber_curr_val);
+		}
+	}
+
+	if (link) {
+		ret = 1;
+		if (page == YT8521_RSSR_UTP_SPACE)
+			yt8521_adjust_status(phydev, status, true);
+		else
+			yt8521_adjust_status(phydev, status, false);
+
+	} else {
+		ret = 0;
+	}
+
+err_restore_page:
+	return phy_restore_page(phydev, old_page, ret);
+}
+
+/**
+ * yt8521_read_status() -  determines the negotiated speed and duplex
+ * @phydev: a pointer to a &struct phy_device
+ *
+ * returns 0 or negative errno code
+ */
+static int yt8521_read_status(struct phy_device *phydev)
+{
+	struct yt8521_priv *priv = phydev->priv;
+	int link_utp = 0;
+	int link_fiber;
+	int link;
+	int ret;
+
+	if (priv->reg_page != YT8521_RSSR_TO_BE_ARBITRATED) {
+		link = yt8521_read_status_paged(phydev, priv->reg_page);
+		if (link < 0)
+			return link;
+	} else {
+		/* when page is YT8521_RSSR_TO_BE_ARBITRATED, arbitration is
+		 * needed. by default, fiber is of higher priority.
+		 */
+
+		link_fiber = yt8521_read_status_paged(phydev,
+						      YT8521_RSSR_FIBER_SPACE);
+		if (link_fiber < 0)
+			return link_fiber;
+
+		if (!link_fiber) {
+			link_utp = yt8521_read_status_paged(phydev,
+							    YT8521_RSSR_UTP_SPACE);
+			if (link_utp < 0)
+				return link_utp;
+		}
+
+		link = link_utp || link_fiber;
+	}
+
+	if (link) {
+		if (phydev->link == 0) {
+			/* arbitrate reg space based on linkup media type. */
+			if (priv->polling_mode == YT8521_MODE_POLL) {
+				if (link_fiber)
+					priv->reg_page =
+						YT8521_RSSR_FIBER_SPACE;
+				else
+					priv->reg_page = YT8521_RSSR_UTP_SPACE;
+
+				ret = ytphy_write_ext_with_lock(phydev,
+								YT8521_REG_SPACE_SELECT_REG,
+								priv->reg_page);
+				if (ret < 0)
+					return ret;
+			}
+
+			phydev_info(phydev,
+				    "%s, phy addr: %d, link up, media: %s\n",
+				    __func__, phydev->mdio.addr,
+				    (priv->reg_page == YT8521_RSSR_UTP_SPACE) ?
+					    "UTP" :
+						  "Fiber");
+		}
+		phydev->link = 1;
+	} else {
+		if (phydev->link == 1) {
+			/* When in YT8521_MODE_POLL mode, need prepare for next
+			 * arbitration.
+			 */
+			if (priv->polling_mode == YT8521_MODE_POLL)
+				priv->reg_page = YT8521_RSSR_TO_BE_ARBITRATED;
+
+			phydev_info(phydev, "%s, phy addr: %d, link down\n",
+				    __func__, phydev->mdio.addr);
+		}
+
+		phydev->link = 0;
+	}
+
+	return 0;
+}
+
+/**
+ * yt8521_modify_bmcr_paged - bits modify a PHY's BMCR register of one page
+ * @phydev: the phy_device struct
+ * @page: The reg page(YT8521_RSSR_FIBER_SPACE/YT8521_RSSR_UTP_SPACE) to operate
+ * @mask: bit mask of bits to clear
+ * @set: bit mask of bits to set
+ *
+ * NOTE: Convenience function which allows a PHY‘s BMCR register to be
+ * modified as new register value = (old register value & ~mask) | set.
+ * YT8521 has two space (utp/fiber) and three mode (utp/fiber/poll), each space
+ * has MII_BMCR. poll mode combines utp and faber,so need do both.
+ * If it is reset, it will wait for completion.
+ *
+ * returns 0 or negative errno code
+ */
+static int yt8521_modify_bmcr_paged(struct phy_device *phydev, int page,
+				    u16 mask, u16 set)
+{
+	int max_cnt = 500; /* the max wait time of reset ~ 500 ms */
+	int old_page;
+	int ret = 0;
+
+	old_page = phy_select_page(phydev, page & YT8521_RSSR_SPACE_MASK);
+	if (old_page < 0)
+		goto err_restore_page;
+
+	ret = __phy_modify(phydev, MII_BMCR, mask, set);
+	if (ret < 0)
+		goto err_restore_page;
+
+	/* If it is reset, need to wait for the reset to complete */
+	if (set == BMCR_RESET) {
+		while (max_cnt--) {
+			/* unlock mdio bus during sleep */
+			phy_unlock_mdio_bus(phydev);
+			usleep_range(1000, 1100);
+			phy_lock_mdio_bus(phydev);
+
+			ret = __phy_read(phydev, MII_BMCR);
+			if (ret < 0)
+				goto err_restore_page;
+
+			if (!(ret & BMCR_RESET))
+				return phy_restore_page(phydev, old_page, 0);
+		}
+		if (max_cnt <= 0)
+			ret = -ETIMEDOUT;
+	}
+
+err_restore_page:
+	return phy_restore_page(phydev, old_page, ret);
+}
+
+/**
+ * yt8521_modify_utp_fiber_bmcr - bits modify a PHY's BMCR register
+ * @phydev: the phy_device struct
+ * @mask: bit mask of bits to clear
+ * @set: bit mask of bits to set
+ *
+ * NOTE: Convenience function which allows a PHY‘s BMCR register to be
+ * modified as new register value = (old register value & ~mask) | set.
+ * YT8521 has two space (utp/fiber) and three mode (utp/fiber/poll), each space
+ * has MII_BMCR. poll mode combines utp and faber,so need do both.
+ *
+ * returns 0 or negative errno code
+ */
+static int yt8521_modify_utp_fiber_bmcr(struct phy_device *phydev, u16 mask,
+					u16 set)
+{
+	struct yt8521_priv *priv = phydev->priv;
+	int ret;
+
+	if (priv->reg_page != YT8521_RSSR_TO_BE_ARBITRATED) {
+		ret = yt8521_modify_bmcr_paged(phydev, priv->reg_page, mask,
+					       set);
+		if (ret < 0)
+			return ret;
+	} else {
+		ret = yt8521_modify_bmcr_paged(phydev, YT8521_RSSR_FIBER_SPACE,
+					       mask, set);
+		if (ret < 0)
+			return ret;
+
+		ret = yt8521_modify_bmcr_paged(phydev, YT8521_RSSR_UTP_SPACE,
+					       mask, set);
+		if (ret < 0)
+			return ret;
+	}
+	return 0;
+}
+
+/**
+ * yt8521_soft_reset() - called to issue a PHY software reset
+ * @phydev: a pointer to a &struct phy_device
+ *
+ * returns 0 or negative errno code
+ */
+static int yt8521_soft_reset(struct phy_device *phydev)
+{
+	return yt8521_modify_utp_fiber_bmcr(phydev, 0, BMCR_RESET);
+}
+
+/**
+ * yt8521_suspend() - suspend the hardware
+ * @phydev: a pointer to a &struct phy_device
+ *
+ * returns 0 or negative errno code
+ */
+static int yt8521_suspend(struct phy_device *phydev)
+{
+	int wol_config;
+
+	/* YTPHY_WOL_CONFIG_REG is common ext reg */
+	wol_config = ytphy_read_ext_with_lock(phydev, YTPHY_WOL_CONFIG_REG);
+	if (wol_config < 0)
+		return wol_config;
+
+	/* if wol enable, do nothing */
+	if (wol_config & YTPHY_WCR_ENABLE)
+		return 0;
+
+	return yt8521_modify_utp_fiber_bmcr(phydev, 0, BMCR_PDOWN);
+}
+
+/**
+ * yt8521_resume() - resume the hardware
+ * @phydev: a pointer to a &struct phy_device
+ *
+ * returns 0 or negative errno code
+ */
+static int yt8521_resume(struct phy_device *phydev)
+{
+	int ret;
+	int wol_config;
+
+	/* disable auto sleep */
+	ret = ytphy_modify_ext_with_lock(phydev,
+					 YT8521_EXTREG_SLEEP_CONTROL1_REG,
+					 YT8521_ESC1R_SLEEP_SW, 0);
+	if (ret < 0)
+		return ret;
+
+	wol_config = ytphy_read_ext_with_lock(phydev, YTPHY_WOL_CONFIG_REG);
+	if (wol_config < 0)
+		return wol_config;
+
+	/* if wol enable, do nothing */
+	if (wol_config & YTPHY_WCR_ENABLE)
+		return 0;
+
+	return yt8521_modify_utp_fiber_bmcr(phydev, BMCR_PDOWN, 0);
+}
+
+/**
+ * yt8521_config_init() - called to initialize the PHY
+ * @phydev: a pointer to a &struct phy_device
+ *
+ * returns 0 or negative errno code
+ */
+static int yt8521_config_init(struct phy_device *phydev)
+{
+	int old_page;
+	int ret = 0;
+
+	old_page = phy_select_page(phydev, YT8521_RSSR_UTP_SPACE);
+	if (old_page < 0)
+		goto err_restore_page;
+
+	/* disable auto sleep */
+	ret = ytphy_modify_ext(phydev, YT8521_EXTREG_SLEEP_CONTROL1_REG,
+			       YT8521_ESC1R_SLEEP_SW, 0);
+	if (ret < 0)
+		goto err_restore_page;
+
+	/* enable RXC clock when no wire plug */
+	ret = ytphy_modify_ext(phydev, YT8521_CLOCK_GATING_REG,
+			       YT8521_CGR_RX_CLK_EN, 0);
+	if (ret < 0)
+		goto err_restore_page;
+
+err_restore_page:
+	return phy_restore_page(phydev, old_page, ret);
+}
+
+/**
+ * yt8521_fiber_setup_forced - configures/forces speed from @phydev
+ * @phydev: target phy_device struct
+ *
+ * returns 0 or negative errno code
+ */
+static int yt8521_fiber_setup_forced(struct phy_device *phydev)
+{
+	int max_cnt = 500; /* the max wait time of reset ~ 500 ms */
+	u16 val;
+	int ret;
+
+	if (phydev->speed == SPEED_1000)
+		val = YTPHY_MCR_FIBER_1000BX;
+	else if (phydev->speed == SPEED_100)
+		val = YTPHY_MCR_FIBER_100FX;
+	else
+		return -EINVAL;
+
+	ret =  phy_modify(phydev,  MII_BMCR, BMCR_ANENABLE, 0);
+	if (ret < 0)
+		return ret;
+
+	/* disable Fiber auto sensing */
+	ret =  ytphy_modify_ext_with_lock(phydev, YT8521_LINK_TIMER_CFG2_REG,
+					  YT8521_LTCR_EN_AUTOSEN, 0);
+	if (ret < 0)
+		return ret;
+
+	phydev->pause = 0;
+	phydev->asym_pause = 0;
+
+	ret =  ytphy_modify_ext_with_lock(phydev, YTPHY_MISC_CONFIG_REG,
+					  YTPHY_MCR_FIBER_SPEED_MASK, val);
+	if (ret < 0)
+		return ret;
+
+	ret =  ytphy_modify_ext_with_lock(phydev, YT8521_CHIP_CONFIG_REG,
+					  YT8521_CCR_SW_RST, 0);
+	if (ret < 0)
+		return ret;
+
+	while (max_cnt--) {
+		usleep_range(1000, 1100);
+		ret = ytphy_read_ext_with_lock(phydev, YT8521_CHIP_CONFIG_REG);
+		if (ret < 0)
+			return ret;
+
+		if (!(ret & YT8521_CCR_SW_RST))
+			return 0;
+	}
+
+	if (max_cnt <= 0)
+		return -ETIMEDOUT;
+
+	return 0;
+}
+
+/**
+ * yt8521_fiber_config_aneg - restart auto-negotiation or write YTPHY_MISC_CONFIG_REG
+ * @phydev: target phy_device struct
+ *
+ * returns 0 or negative errno code
+ */
+static int yt8521_fiber_config_aneg(struct phy_device *phydev)
+{
+	int err, changed;
+	u16 adv;
+
+	if (phydev->autoneg != AUTONEG_ENABLE)
+		return yt8521_fiber_setup_forced(phydev);
+
+	err =  ytphy_modify_ext_with_lock(phydev, YTPHY_MISC_CONFIG_REG,
+					  YTPHY_MCR_FIBER_SPEED_MASK, YTPHY_MCR_FIBER_1000BX);
+	if (err < 0)
+		return err;
+
+	/* enable Fiber auto sensing */
+	err =  ytphy_modify_ext_with_lock(phydev, YT8521_LINK_TIMER_CFG2_REG,
+					  0, YT8521_LTCR_EN_AUTOSEN);
+	if (err < 0)
+		return err;
+
+	/* Setup fiber advertisement */
+	adv = ADVERTISE_1000XFULL | ADVERTISE_1000XPAUSE | ADVERTISE_1000XPSE_ASYM;
+	err = phy_modify_changed(phydev, MII_ADVERTISE,
+				 ADVERTISE_1000XHALF | ADVERTISE_1000XFULL |
+				 ADVERTISE_1000XPAUSE | ADVERTISE_1000XPSE_ASYM,
+				 adv);
+	if (err < 0)
+		return err;
+
+	if (err > 0)
+		changed = 1;
+
+	return genphy_check_and_restart_aneg(phydev, changed);
+}
+
+/**
+ * yt8521_config_aneg_paged() - switch reg space then call genphy_config_aneg
+ * of one page
+ * @phydev: a pointer to a &struct phy_device
+ * @page: The reg page(YT8521_RSSR_FIBER_SPACE/YT8521_RSSR_UTP_SPACE) to operate
+ *
+ * returns 0 or negative errno code
+ */
+static int yt8521_config_aneg_paged(struct phy_device *phydev, int page)
+{
+	int old_page;
+	int ret = 0;
+
+	page &= YT8521_RSSR_SPACE_MASK;
+
+	old_page = phy_select_page(phydev, page);
+	if (old_page < 0)
+		goto err_restore_page;
+
+	/* unlock mdio bus during genphy_config_aneg/yt8521_fiber_config_aneg,
+	 * because it will operate this lock.
+	 */
+	phy_unlock_mdio_bus(phydev);
+	if (page == YT8521_RSSR_FIBER_SPACE)
+		ret = yt8521_fiber_config_aneg(phydev);
+	else
+		ret = genphy_config_aneg(phydev);
+
+	phy_lock_mdio_bus(phydev);
+
+err_restore_page:
+	return phy_restore_page(phydev, old_page, ret);
+}
+
+/**
+ * yt8521_config_aneg() - change reg space then call genphy_config_aneg
+ * @phydev: a pointer to a &struct phy_device
+ *
+ * returns 0 or negative errno code
+ */
+static int yt8521_config_aneg(struct phy_device *phydev)
+{
+	struct yt8521_priv *priv = phydev->priv;
+	int ret;
+
+	if (priv->reg_page != YT8521_RSSR_TO_BE_ARBITRATED) {
+		ret = yt8521_config_aneg_paged(phydev, priv->reg_page);
+		if (ret < 0)
+			return ret;
+	} else {
+		ret = yt8521_config_aneg_paged(phydev, YT8521_RSSR_FIBER_SPACE);
+		if (ret < 0)
+			return ret;
+
+		ret = yt8521_config_aneg_paged(phydev, YT8521_RSSR_UTP_SPACE);
+		if (ret < 0)
+			return ret;
+	}
+
+	return 0;
+}
+
+/**
+ * yt8521_aneg_done_paged() - determines the auto negotiation result of one page
+ * @phydev: a pointer to a &struct phy_device
+ * @page: The reg page(YT8521_RSSR_FIBER_SPACE/YT8521_RSSR_UTP_SPACE) to operate
+ *
+ * returns 0(no link)or 1(fiber or utp link) or negative errno code
+ */
+static int yt8521_aneg_done_paged(struct phy_device *phydev, int page)
+{
+	int old_page;
+	int ret = 0;
+	int link;
+
+	old_page = phy_select_page(phydev, page & YT8521_RSSR_SPACE_MASK);
+	if (old_page < 0)
+		goto err_restore_page;
+
+	ret = __phy_read(phydev, YTPHY_SPECIFIC_STATUS_REG);
+	if (ret < 0)
+		goto err_restore_page;
+
+	link = !!(ret & YTPHY_SSR_LINK);
+	ret = link;
+
+err_restore_page:
+	return phy_restore_page(phydev, old_page, ret);
+}
+
+/**
+ * yt8521_aneg_done() - determines the auto negotiation result
+ * @phydev: a pointer to a &struct phy_device
+ *
+ * returns 0(no link)or 1(fiber or utp link) or negative errno code
+ */
+static int yt8521_aneg_done(struct phy_device *phydev)
+{
+	struct yt8521_priv *priv = phydev->priv;
+	int link_utp = 0;
+	int link_fiber;
+	int link;
+
+	if (priv->reg_page != YT8521_RSSR_TO_BE_ARBITRATED) {
+		link = yt8521_aneg_done_paged(phydev, priv->reg_page);
+	} else {
+		link_fiber =
+			yt8521_aneg_done_paged(phydev, YT8521_RSSR_FIBER_SPACE);
+		if (link_fiber < 0)
+			return link_fiber;
+
+		if (!link_fiber) {
+			link_utp = yt8521_aneg_done_paged(phydev,
+							  YT8521_RSSR_UTP_SPACE);
+			if (link_utp < 0)
+				return link_utp;
+		}
+		link = link_fiber || link_utp;
+		phydev_info(phydev,
+			    "%s, phy addr: %d, link_fiber: %d, link_utp: %d\n",
+			    __func__, phydev->mdio.addr, link_fiber, link_utp);
+	}
+
+	return link;
+}
+
 static struct phy_driver motorcomm_phy_drvs[] = {
 	{
 		PHY_ID_MATCH_EXACT(PHY_ID_YT8511),
@@ -121,16 +1260,35 @@ static struct phy_driver motorcomm_phy_drvs[] = {
 		.read_page	= yt8511_read_page,
 		.write_page	= yt8511_write_page,
 	},
+	{
+		PHY_ID_MATCH_EXACT(PHY_ID_YT8521),
+		.name		= "YT8521 Gigabit Ethernet",
+		.features	= PHY_GBIT_FEATURES,
+		.probe		= yt8521_probe,
+		.read_page	= yt8521_read_page,
+		.write_page	= yt8521_write_page,
+		.get_wol	= ytphy_get_wol,
+		.set_wol	= ytphy_set_wol,
+		.config_aneg	= yt8521_config_aneg,
+		.aneg_done	= yt8521_aneg_done,
+		.config_init	= yt8521_config_init,
+		.read_status	= yt8521_read_status,
+		.soft_reset	= yt8521_soft_reset,
+		.suspend	= yt8521_suspend,
+		.resume		= yt8521_resume,
+	},
 };
 
 module_phy_driver(motorcomm_phy_drvs);
 
-MODULE_DESCRIPTION("Motorcomm PHY driver");
+MODULE_DESCRIPTION("Motorcomm 8511/8521 PHY driver");
 MODULE_AUTHOR("Peter Geis");
+MODULE_AUTHOR("Frank");
 MODULE_LICENSE("GPL");
 
 static const struct mdio_device_id __maybe_unused motorcomm_tbl[] = {
 	{ PHY_ID_MATCH_EXACT(PHY_ID_YT8511) },
+	{ PHY_ID_MATCH_EXACT(PHY_ID_YT8521) },
 	{ /* sentinal */ }
 };
 
-- 
2.31.0.windows.1


^ permalink raw reply related

* Re: [PATCH net] net: phy: Warn about incorrect mdio_bus_phy_resume() state
From: Marek Szyprowski @ 2022-08-16 11:18 UTC (permalink / raw)
  To: Florian Fainelli, netdev, Steve Glendinning
  Cc: opendmb, Andrew Lunn, Heiner Kallweit, Russell King,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	open list
In-Reply-To: <8c21e530-8e8f-ce2a-239e-9d3a354996cf@gmail.com>


On 12.08.2022 18:32, Florian Fainelli wrote:
> On 8/12/22 04:19, Marek Szyprowski wrote:
>> Hi All,
>>
>> On 02.08.2022 01:34, Florian Fainelli wrote:
>>> Calling mdio_bus_phy_resume() with neither the PHY state machine set to
>>> PHY_HALTED nor phydev->mac_managed_pm set to true is a good indication
>>> that we can produce a race condition looking like this:
>>>
>>> CPU0                        CPU1
>>> bcmgenet_resume
>>>    -> phy_resume
>>>      -> phy_init_hw
>>>    -> phy_start
>>>      -> phy_resume
>>> phy_start_aneg()
>>> mdio_bus_phy_resume
>>>    -> phy_resume
>>>       -> phy_write(..., BMCR_RESET)
>>>        -> usleep()                                  -> phy_read()
>>>
>>> with the phy_resume() function triggering a PHY behavior that might 
>>> have
>>> to be worked around with (see bf8bfc4336f7 ("net: phy: broadcom: Fix
>>> brcm_fet_config_init()") for instance) that ultimately leads to an 
>>> error
>>> reading from the PHY.
>>>
>>> Fixes: fba863b81604 ("net: phy: make PHY PM ops a no-op if MAC 
>>> driver manages PHY PM")
>>> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
>>
>> This patch, as probably intended, triggers a warning during system
>> suspend/resume cycle in the SMSC911x driver. I've observed it on ARM
>> Juno R1 board on the kernel compiled from next-202208010:
>>
>>    ------------[ cut here ]------------
>>    WARNING: CPU: 1 PID: 398 at drivers/net/phy/phy_device.c:323
>> mdio_bus_phy_resume+0x34/0xc8
>>    Modules linked in: smsc911x cpufreq_powersave cpufreq_conservative
>> crct10dif_ce ip_tables x_tables ipv6 [last unloaded: smsc911x]
>>    CPU: 1 PID: 398 Comm: rtcwake Not tainted 5.19.0+ #940
>>    Hardware name: ARM Juno development board (r1) (DT)
>>    pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>    pc : mdio_bus_phy_resume+0x34/0xc8
>>    lr : dpm_run_callback+0x74/0x350
>>    ...
>>    Call trace:
>>     mdio_bus_phy_resume+0x34/0xc8
>>     dpm_run_callback+0x74/0x350
>>     device_resume+0xb8/0x258
>>     dpm_resume+0x120/0x4a8
>>     dpm_resume_end+0x14/0x28
>>     suspend_devices_and_enter+0x164/0xa60
>>     pm_suspend+0x25c/0x3a8
>>     state_store+0x84/0x108
>>     kobj_attr_store+0x14/0x28
>>     sysfs_kf_write+0x60/0x70
>>     kernfs_fop_write_iter+0x124/0x1a8
>>     new_sync_write+0xd0/0x190
>>     vfs_write+0x208/0x478
>>     ksys_write+0x64/0xf0
>>     __arm64_sys_write+0x14/0x20
>>     invoke_syscall+0x40/0xf8
>>     el0_svc_common.constprop.3+0x8c/0x120
>>     do_el0_svc+0x28/0xc8
>>     el0_svc+0x48/0xd0
>>     el0t_64_sync_handler+0x94/0xb8
>>     el0t_64_sync+0x15c/0x160
>>    irq event stamp: 24406
>>    hardirqs last  enabled at (24405): [<ffff8000090c4734>]
>> _raw_spin_unlock_irqrestore+0x8c/0x90
>>    hardirqs last disabled at (24406): [<ffff8000090b3164>] 
>> el1_dbg+0x24/0x88
>>    softirqs last  enabled at (24144): [<ffff800008010488>] 
>> _stext+0x488/0x5cc
>>    softirqs last disabled at (24139): [<ffff80000809bf98>]
>> irq_exit_rcu+0x168/0x1a8
>>    ---[ end trace 0000000000000000 ]---
>>
>> I hope the above information will help fixing the driver.
>
> Yes this is catching an actual issue in the driver in that the PHY 
> state machine is still running while the system is trying to suspend. 
> We could go about fixing it in a different number of ways, though I 
> believe this one is probably correct enough to work and fix the warning:

Right, this fixes the warning (after fixing the minor compile issue, see 
below).

Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>

Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>

>
> diff --git a/drivers/net/ethernet/smsc/smsc911x.c 
> b/drivers/net/ethernet/smsc/smsc911x.c
> index 3bf20211cceb..e9c0668a4dc0 100644
> --- a/drivers/net/ethernet/smsc/smsc911x.c
> +++ b/drivers/net/ethernet/smsc/smsc911x.c
> @@ -1037,6 +1037,8 @@ static int smsc911x_mii_probe(struct net_device 
> *dev)
>                 return ret;
>         }
>
> +       /* Indicate that the MAC is responsible for managing PHY PM */
> +       phydev->mac_managed_pm = true;
>         phy_attached_info(phydev);
>
>         phy_set_max_speed(phydev, SPEED_100);
> @@ -2587,6 +2589,8 @@ static int smsc911x_suspend(struct device *dev)
>         if (netif_running(ndev)) {
>                 netif_stop_queue(ndev);
>                 netif_device_detach(ndev);
> +               if (!device_may_wakeup(dev))
> +                       phy_suspend(dev->phydev);
                         phy_suspend(ndev->phydev);
> }
>
>         /* enable wake on LAN, energy detection and the external PME
> @@ -2628,6 +2632,8 @@ static int smsc911x_resume(struct device *dev)
>         if (netif_running(ndev)) {
>                 netif_device_attach(ndev);
>                 netif_start_queue(ndev);
> +               if (!device_may_wakeup(dev))
> +                       phy_resume(dev->phydev);
                         phy_resume(ndev->phydev);
> }
>
>         return 0;
>
Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply

* Re: [PATCH 1/2] dt-bindings: vertexcom-mse102x: Update email address
From: Krzysztof Kozlowski @ 2022-08-16 11:18 UTC (permalink / raw)
  To: Stefan Wahren, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Rob Herring, Krzysztof Kozlowski
  Cc: netdev, devicetree, stefan.wahren
In-Reply-To: <6b3bfb9d-1acf-8445-7181-d5e1ecf4745d@i2se.com>

On 16/08/2022 14:14, Stefan Wahren wrote:
> Hi Krzystof,
> 
> Am 16.08.22 um 12:24 schrieb Krzysztof Kozlowski:
>> On 15/08/2022 11:06, Stefan Wahren wrote:
>>> in-tech smart charging is now chargebyte. So update the email address
>>> accordingly.
>>>
>>> Signed-off-by: Stefan Wahren <stefan.wahren@i2se.com>
>> Yet you used third email address... which is fine, just a bit confusing.
>> Since in-tech.com still works and you might be (or not) different
>> Stefan, it's difficult to judge...
> 
> sorry about the confusion. I'm that person. Unfortuntely the other 
> accounts are using a M$ Exchange server, so i switched to the i2se 
> account for this setup before sending broken patches ...
> 
> in-tech was the parent company, but the accounts will unavailable in the 
> future.

Email used for authorship and for sending can be different. It makes
things more obvious.


Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>


Best regards,
Krzysztof

^ permalink raw reply

* [PATCH] net: dsa: mv88e6xxx: support RGMII cmode
From: Marcus Carlberg @ 2022-08-16 11:45 UTC (permalink / raw)
  To: Andrew Lunn, Vivien Didelot, Florian Fainelli, Vladimir Oltean,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: kernel, Marcus Carlberg, netdev, linux-kernel

Since the probe defaults all interfaces to the highest speed possible
(10GBASE-X in mv88e6393x) before the phy mode configuration from the
devicetree is considered it is currently impossible to use port 0 in
RGMII mode.

This change will allow RGMII modes to be configurable for port 0
enabling port 0 to be configured as RGMII as well as serial depending
on configuration.

Signed-off-by: Marcus Carlberg <marcus.carlberg@axis.com>
---
 drivers/net/dsa/mv88e6xxx/port.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/dsa/mv88e6xxx/port.c b/drivers/net/dsa/mv88e6xxx/port.c
index 90c55f23b7c9..2e005449e733 100644
--- a/drivers/net/dsa/mv88e6xxx/port.c
+++ b/drivers/net/dsa/mv88e6xxx/port.c
@@ -517,6 +517,12 @@ static int mv88e6xxx_port_set_cmode(struct mv88e6xxx_chip *chip, int port,
 	case PHY_INTERFACE_MODE_RMII:
 		cmode = MV88E6XXX_PORT_STS_CMODE_RMII;
 		break;
+	case PHY_INTERFACE_MODE_RGMII:
+	case PHY_INTERFACE_MODE_RGMII_ID:
+	case PHY_INTERFACE_MODE_RGMII_RXID:
+	case PHY_INTERFACE_MODE_RGMII_TXID:
+		cmode = MV88E6XXX_PORT_STS_CMODE_RGMII;
+		break;
 	case PHY_INTERFACE_MODE_1000BASEX:
 		cmode = MV88E6XXX_PORT_STS_CMODE_1000BASEX;
 		break;
-- 
2.20.1


^ permalink raw reply related

* [RESEND PATCH 0/2] Macb PTP enhancements
From: Harini Katakam @ 2022-08-16 11:54 UTC (permalink / raw)
  To: nicolas.ferre, davem, richardcochran, claudiu.beznea,
	andrei.pistirica, kuba, edumazet, pabeni
  Cc: netdev, linux-kernel, michal.simek, harinikatakamlinux,
	harini.katakam, michal.simek, harini.katakam, radhey.shyam.pandey

This series is a follow up for patches 2 and 3 from a previous series:
https://lore.kernel.org/all/ca4c97c9-1117-a465-5202-e1bf276fe75b@microchip.com/
https://lore.kernel.org/all/20220517135525.GC3344@hoboy.vegasvil.org/
Sorry for the delay.

ACK is added only to patch 3 (now patch 2).
Patch 1 is updated with check for gem_has_ptp as per Claudiu's comments.

Resending as net-next was closed when this series was sent a few weeks ago.

Harini Katakam (2):
  net: macb: Enable PTP unicast
  net: macb: Optimize reading HW timestamp

 drivers/net/ethernet/cadence/macb.h      |  4 ++++
 drivers/net/ethernet/cadence/macb_main.c | 13 ++++++++++++-
 drivers/net/ethernet/cadence/macb_ptp.c  |  8 ++++++--
 3 files changed, 22 insertions(+), 3 deletions(-)

-- 
2.17.1

^ permalink raw reply

* [RESEND PATCH 1/2] net: macb: Enable PTP unicast
From: Harini Katakam @ 2022-08-16 11:54 UTC (permalink / raw)
  To: nicolas.ferre, davem, richardcochran, claudiu.beznea,
	andrei.pistirica, kuba, edumazet, pabeni
  Cc: netdev, linux-kernel, michal.simek, harinikatakamlinux,
	harini.katakam, michal.simek, harini.katakam, radhey.shyam.pandey
In-Reply-To: <20220816115500.353-1-harini.katakam@amd.com>

From: Harini Katakam <harini.katakam@xilinx.com>

Enable transmission and reception of PTP unicast packets by
updating PTP unicast config bit and setting current HW mac
address as allowed address in PTP unicast filter registers.

Signed-off-by: Harini Katakam <harini.katakam@xilinx.com>
Signed-off-by: Michal Simek <michal.simek@xilinx.com>
Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@xilinx.com>
---
Added check for gem_has_ptp as per Claudiu's comments.

 drivers/net/ethernet/cadence/macb.h      |  4 ++++
 drivers/net/ethernet/cadence/macb_main.c | 13 ++++++++++++-
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h
index 9c410f93a103..1aa578c1ca4a 100644
--- a/drivers/net/ethernet/cadence/macb.h
+++ b/drivers/net/ethernet/cadence/macb.h
@@ -95,6 +95,8 @@
 #define GEM_SA4B		0x00A0 /* Specific4 Bottom */
 #define GEM_SA4T		0x00A4 /* Specific4 Top */
 #define GEM_WOL			0x00b8 /* Wake on LAN */
+#define GEM_RXPTPUNI		0x00D4 /* PTP RX Unicast address */
+#define GEM_TXPTPUNI		0x00D8 /* PTP TX Unicast address */
 #define GEM_EFTSH		0x00e8 /* PTP Event Frame Transmitted Seconds Register 47:32 */
 #define GEM_EFRSH		0x00ec /* PTP Event Frame Received Seconds Register 47:32 */
 #define GEM_PEFTSH		0x00f0 /* PTP Peer Event Frame Transmitted Seconds Register 47:32 */
@@ -245,6 +247,8 @@
 #define MACB_TZQ_OFFSET		12 /* Transmit zero quantum pause frame */
 #define MACB_TZQ_SIZE		1
 #define MACB_SRTSM_OFFSET	15 /* Store Receive Timestamp to Memory */
+#define MACB_PTPUNI_OFFSET	20 /* PTP Unicast packet enable */
+#define MACB_PTPUNI_SIZE	1
 #define MACB_OSSMODE_OFFSET	24 /* Enable One Step Synchro Mode */
 #define MACB_OSSMODE_SIZE	1
 #define MACB_MIIONRGMII_OFFSET	28 /* MII Usage on RGMII Interface */
diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
index 494fe961a49d..4699699a1593 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -287,6 +287,13 @@ static void macb_set_hwaddr(struct macb *bp)
 	top = cpu_to_le16(*((u16 *)(bp->dev->dev_addr + 4)));
 	macb_or_gem_writel(bp, SA1T, top);
 
+#ifdef CONFIG_MACB_USE_HWSTAMP
+	if (gem_has_ptp(bp)) {
+		gem_writel(bp, RXPTPUNI, bottom);
+		gem_writel(bp, TXPTPUNI, bottom);
+	}
+#endif
+
 	/* Clear unused address register sets */
 	macb_or_gem_writel(bp, SA2B, 0);
 	macb_or_gem_writel(bp, SA2T, 0);
@@ -720,7 +727,11 @@ static void macb_mac_link_up(struct phylink_config *config,
 
 	spin_unlock_irqrestore(&bp->lock, flags);
 
-	/* Enable Rx and Tx */
+	/* Enable Rx and Tx; Enable PTP unicast */
+#ifdef CONFIG_MACB_USE_HWSTAMP
+	if (gem_has_ptp(bp))
+		macb_writel(bp, NCR, macb_readl(bp, NCR) | MACB_BIT(PTPUNI));
+#endif
 	macb_writel(bp, NCR, macb_readl(bp, NCR) | MACB_BIT(RE) | MACB_BIT(TE));
 
 	netif_tx_wake_all_queues(ndev);
-- 
2.17.1


^ permalink raw reply related

* [RESEND PATCH 2/2] net: macb: Optimize reading HW timestamp
From: Harini Katakam @ 2022-08-16 11:55 UTC (permalink / raw)
  To: nicolas.ferre, davem, richardcochran, claudiu.beznea,
	andrei.pistirica, kuba, edumazet, pabeni
  Cc: netdev, linux-kernel, michal.simek, harinikatakamlinux,
	harini.katakam, michal.simek, harini.katakam, radhey.shyam.pandey
In-Reply-To: <20220816115500.353-1-harini.katakam@amd.com>

From: Harini Katakam <harini.katakam@xilinx.com>

The seconds input from BD (6 bits) just needs to be ORed with the
upper bits from timer in this function. Avoid +/- operations every
single time. Check for seconds rollover at BIT 5 and subtract the
overhead only in that case.

Signed-off-by: Harini Katakam <harini.katakam@xilinx.com>
Signed-off-by: Michal Simek <michal.simek@xilinx.com>
Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@xilinx.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
---
 drivers/net/ethernet/cadence/macb_ptp.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb_ptp.c b/drivers/net/ethernet/cadence/macb_ptp.c
index e6cb20aaa76a..674002661366 100644
--- a/drivers/net/ethernet/cadence/macb_ptp.c
+++ b/drivers/net/ethernet/cadence/macb_ptp.c
@@ -247,6 +247,7 @@ static int gem_hw_timestamp(struct macb *bp, u32 dma_desc_ts_1,
 			    u32 dma_desc_ts_2, struct timespec64 *ts)
 {
 	struct timespec64 tsu;
+	bool sec_rollover = false;
 
 	ts->tv_sec = (GEM_BFEXT(DMA_SECH, dma_desc_ts_2) << GEM_DMA_SECL_SIZE) |
 			GEM_BFEXT(DMA_SECL, dma_desc_ts_1);
@@ -264,9 +265,12 @@ static int gem_hw_timestamp(struct macb *bp, u32 dma_desc_ts_1,
 	 */
 	if ((ts->tv_sec & (GEM_DMA_SEC_TOP >> 1)) &&
 	    !(tsu.tv_sec & (GEM_DMA_SEC_TOP >> 1)))
-		ts->tv_sec -= GEM_DMA_SEC_TOP;
+		sec_rollover = true;
+
+	ts->tv_sec |= ((~GEM_DMA_SEC_MASK) & tsu.tv_sec);
 
-	ts->tv_sec += ((~GEM_DMA_SEC_MASK) & tsu.tv_sec);
+	if (sec_rollover)
+		ts->tv_sec -= GEM_DMA_SEC_TOP;
 
 	return 0;
 }
-- 
2.17.1


^ permalink raw reply related

* [PATCH V2 net-next] net: phy: realtek: add support for RTL8221F(D)(I)-VD-CG
From: wei.fang @ 2022-08-16 19:48 UTC (permalink / raw)
  To: andrew, hkallweit1, linux, davem, edumazet, kuba, pabeni, netdev
  Cc: xiaoning.wang

From: Clark Wang <xiaoning.wang@nxp.com>

RTL8221F(D)(I)-VD-CG is the pin-to-pin upgrade chip from
RTL8221F(D)(I)-CG.

Add new PHY ID for this chip.
It does not support RTL8211F_PHYCR2 anymore, so remove the w/r operation
of this register.

Signed-off-by: Clark Wang <xiaoning.wang@nxp.com>
Signed-off-by: Wei Fang <wei.fang@nxp.com>
---
V2 change:
1. Commit message changed, RTL8221 instead of RTL8821.
2. Add has_phycr2 to struct rtl821x_priv.
---
 drivers/net/phy/realtek.c | 44 ++++++++++++++++++++++++++++-----------
 1 file changed, 32 insertions(+), 12 deletions(-)

diff --git a/drivers/net/phy/realtek.c b/drivers/net/phy/realtek.c
index a5671ab896b3..3d99fd6664d7 100644
--- a/drivers/net/phy/realtek.c
+++ b/drivers/net/phy/realtek.c
@@ -70,6 +70,7 @@
 #define RTLGEN_SPEED_MASK			0x0630
 
 #define RTL_GENERIC_PHYID			0x001cc800
+#define RTL_8211FVD_PHYID			0x001cc878
 
 MODULE_DESCRIPTION("Realtek PHY driver");
 MODULE_AUTHOR("Johnson Leung");
@@ -78,6 +79,7 @@ MODULE_LICENSE("GPL");
 struct rtl821x_priv {
 	u16 phycr1;
 	u16 phycr2;
+	bool has_phycr2;
 };
 
 static int rtl821x_read_page(struct phy_device *phydev)
@@ -94,6 +96,7 @@ static int rtl821x_probe(struct phy_device *phydev)
 {
 	struct device *dev = &phydev->mdio.dev;
 	struct rtl821x_priv *priv;
+	u32 phy_id = phydev->drv->phy_id;
 	int ret;
 
 	priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
@@ -108,13 +111,16 @@ static int rtl821x_probe(struct phy_device *phydev)
 	if (of_property_read_bool(dev->of_node, "realtek,aldps-enable"))
 		priv->phycr1 |= RTL8211F_ALDPS_PLL_OFF | RTL8211F_ALDPS_ENABLE | RTL8211F_ALDPS_XTAL_OFF;
 
-	ret = phy_read_paged(phydev, 0xa43, RTL8211F_PHYCR2);
-	if (ret < 0)
-		return ret;
+	priv->has_phycr2 = !(phy_id == RTL_8211FVD_PHYID);
+	if (priv->has_phycr2) {
+		ret = phy_read_paged(phydev, 0xa43, RTL8211F_PHYCR2);
+		if (ret < 0)
+			return ret;
 
-	priv->phycr2 = ret & RTL8211F_CLKOUT_EN;
-	if (of_property_read_bool(dev->of_node, "realtek,clkout-disable"))
-		priv->phycr2 &= ~RTL8211F_CLKOUT_EN;
+		priv->phycr2 = ret & RTL8211F_CLKOUT_EN;
+		if (of_property_read_bool(dev->of_node, "realtek,clkout-disable"))
+			priv->phycr2 &= ~RTL8211F_CLKOUT_EN;
+	}
 
 	phydev->priv = priv;
 
@@ -400,12 +406,14 @@ static int rtl8211f_config_init(struct phy_device *phydev)
 			val_rxdly ? "enabled" : "disabled");
 	}
 
-	ret = phy_modify_paged(phydev, 0xa43, RTL8211F_PHYCR2,
-			       RTL8211F_CLKOUT_EN, priv->phycr2);
-	if (ret < 0) {
-		dev_err(dev, "clkout configuration failed: %pe\n",
-			ERR_PTR(ret));
-		return ret;
+	if (priv->has_phycr2) {
+		ret = phy_modify_paged(phydev, 0xa43, RTL8211F_PHYCR2,
+				       RTL8211F_CLKOUT_EN, priv->phycr2);
+		if (ret < 0) {
+			dev_err(dev, "clkout configuration failed: %pe\n",
+				ERR_PTR(ret));
+			return ret;
+		}
 	}
 
 	return genphy_soft_reset(phydev);
@@ -923,6 +931,18 @@ static struct phy_driver realtek_drvs[] = {
 		.resume		= rtl821x_resume,
 		.read_page	= rtl821x_read_page,
 		.write_page	= rtl821x_write_page,
+	}, {
+		PHY_ID_MATCH_EXACT(RTL_8211FVD_PHYID),
+		.name		= "RTL8211F-VD Gigabit Ethernet",
+		.probe		= rtl821x_probe,
+		.config_init	= &rtl8211f_config_init,
+		.read_status	= rtlgen_read_status,
+		.config_intr	= &rtl8211f_config_intr,
+		.handle_interrupt = rtl8211f_handle_interrupt,
+		.suspend	= genphy_suspend,
+		.resume		= rtl821x_resume,
+		.read_page	= rtl821x_read_page,
+		.write_page	= rtl821x_write_page,
 	}, {
 		.name		= "Generic FE-GE Realtek PHY",
 		.match_phy_device = rtlgen_match_phy_device,
-- 
2.25.1


^ permalink raw reply related

* [PATCH 8/8] tty: Make ->set_termios() old ktermios const
From: Ilpo Järvinen @ 2022-08-16 11:57 UTC (permalink / raw)
  To: linux-serial, Greg Kroah-Hartman, Jiri Slaby, Andy Shevchenko,
	Arnd Bergmann, Samuel Iglesias Gonsalvez, Jens Taprogge,
	Ulf Hansson, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David Lin, Johan Hovold,
	Alex Elder, Shawn Guo, Sascha Hauer, Pengutronix Kernel Team,
	Fabio Estevam, NXP Linux Team, Oliver Neukum, Marcel Holtmann,
	Johan Hedberg, Luiz Augusto von Dentz, linux-kernel,
	industrypack-devel, linux-mmc, linux-usb, netdev, linux-s390,
	linux-staging, greybus-dev, linux-arm-kernel, linux-bluetooth
  Cc: Ilpo Järvinen
In-Reply-To: <20220816115739.10928-1-ilpo.jarvinen@linux.intel.com>

There should be no reason to adjust old ktermios which is going to get
discarded anyway.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
---
 drivers/char/pcmcia/synclink_cs.c   | 3 ++-
 drivers/ipack/devices/ipoctal.c     | 2 +-
 drivers/mmc/core/sdio_uart.c        | 4 ++--
 drivers/net/usb/hso.c               | 3 ++-
 drivers/s390/char/tty3270.c         | 2 +-
 drivers/staging/fwserial/fwserial.c | 3 ++-
 drivers/staging/greybus/uart.c      | 2 +-
 drivers/tty/amiserial.c             | 6 +++---
 drivers/tty/moxa.c                  | 9 +++++----
 drivers/tty/mxser.c                 | 6 ++++--
 drivers/tty/n_gsm.c                 | 3 ++-
 drivers/tty/pty.c                   | 2 +-
 drivers/tty/serial/serial_core.c    | 6 +++---
 drivers/tty/synclink_gt.c           | 3 ++-
 drivers/usb/class/cdc-acm.c         | 4 ++--
 drivers/usb/serial/usb-serial.c     | 3 ++-
 include/linux/tty_driver.h          | 4 ++--
 net/bluetooth/rfcomm/tty.c          | 3 ++-
 18 files changed, 39 insertions(+), 29 deletions(-)

diff --git a/drivers/char/pcmcia/synclink_cs.c b/drivers/char/pcmcia/synclink_cs.c
index 8fc49b038372..b2735be81ab2 100644
--- a/drivers/char/pcmcia/synclink_cs.c
+++ b/drivers/char/pcmcia/synclink_cs.c
@@ -2274,7 +2274,8 @@ static int mgslpc_ioctl(struct tty_struct *tty,
  *	tty		pointer to tty structure
  *	termios		pointer to buffer to hold returned old termios
  */
-static void mgslpc_set_termios(struct tty_struct *tty, struct ktermios *old_termios)
+static void mgslpc_set_termios(struct tty_struct *tty,
+			       const struct ktermios *old_termios)
 {
 	MGSLPC_INFO *info = (MGSLPC_INFO *)tty->driver_data;
 	unsigned long flags;
diff --git a/drivers/ipack/devices/ipoctal.c b/drivers/ipack/devices/ipoctal.c
index 20d2b9ec1227..fc00274070b6 100644
--- a/drivers/ipack/devices/ipoctal.c
+++ b/drivers/ipack/devices/ipoctal.c
@@ -497,7 +497,7 @@ static unsigned int ipoctal_chars_in_buffer(struct tty_struct *tty)
 }
 
 static void ipoctal_set_termios(struct tty_struct *tty,
-				struct ktermios *old_termios)
+				const struct ktermios *old_termios)
 {
 	unsigned int cflag;
 	unsigned char mr1 = 0;
diff --git a/drivers/mmc/core/sdio_uart.c b/drivers/mmc/core/sdio_uart.c
index 414aa82abc39..ae7ef2e038be 100644
--- a/drivers/mmc/core/sdio_uart.c
+++ b/drivers/mmc/core/sdio_uart.c
@@ -246,7 +246,7 @@ static inline void sdio_uart_update_mctrl(struct sdio_uart_port *port,
 
 static void sdio_uart_change_speed(struct sdio_uart_port *port,
 				   struct ktermios *termios,
-				   struct ktermios *old)
+				   const struct ktermios *old)
 {
 	unsigned char cval, fcr = 0;
 	unsigned int baud, quot;
@@ -859,7 +859,7 @@ static void sdio_uart_unthrottle(struct tty_struct *tty)
 }
 
 static void sdio_uart_set_termios(struct tty_struct *tty,
-						struct ktermios *old_termios)
+				  const struct ktermios *old_termios)
 {
 	struct sdio_uart_port *port = tty->driver_data;
 	unsigned int cflag = tty->termios.c_cflag;
diff --git a/drivers/net/usb/hso.c b/drivers/net/usb/hso.c
index f8221a7acf62..ce1f6081d582 100644
--- a/drivers/net/usb/hso.c
+++ b/drivers/net/usb/hso.c
@@ -1380,7 +1380,8 @@ static void hso_serial_cleanup(struct tty_struct *tty)
 }
 
 /* setup the term */
-static void hso_serial_set_termios(struct tty_struct *tty, struct ktermios *old)
+static void hso_serial_set_termios(struct tty_struct *tty,
+				   const struct ktermios *old)
 {
 	struct hso_serial *serial = tty->driver_data;
 	unsigned long flags;
diff --git a/drivers/s390/char/tty3270.c b/drivers/s390/char/tty3270.c
index 5c83f71c1d0e..26e3995ac062 100644
--- a/drivers/s390/char/tty3270.c
+++ b/drivers/s390/char/tty3270.c
@@ -1760,7 +1760,7 @@ tty3270_flush_chars(struct tty_struct *tty)
  * Check for visible/invisible input switches
  */
 static void
-tty3270_set_termios(struct tty_struct *tty, struct ktermios *old)
+tty3270_set_termios(struct tty_struct *tty, const struct ktermios *old)
 {
 	struct tty3270 *tp;
 	int new;
diff --git a/drivers/staging/fwserial/fwserial.c b/drivers/staging/fwserial/fwserial.c
index e8fa7f53cd5e..81b06d88ed0d 100644
--- a/drivers/staging/fwserial/fwserial.c
+++ b/drivers/staging/fwserial/fwserial.c
@@ -1267,7 +1267,8 @@ static int fwtty_ioctl(struct tty_struct *tty, unsigned int cmd,
 	return err;
 }
 
-static void fwtty_set_termios(struct tty_struct *tty, struct ktermios *old)
+static void fwtty_set_termios(struct tty_struct *tty,
+			      const struct ktermios *old)
 {
 	struct fwtty_port *port = tty->driver_data;
 	unsigned int baud;
diff --git a/drivers/staging/greybus/uart.c b/drivers/staging/greybus/uart.c
index dc4ed0ff1ae2..90ff07f2cbf7 100644
--- a/drivers/staging/greybus/uart.c
+++ b/drivers/staging/greybus/uart.c
@@ -480,7 +480,7 @@ static int gb_tty_break_ctl(struct tty_struct *tty, int state)
 }
 
 static void gb_tty_set_termios(struct tty_struct *tty,
-			       struct ktermios *termios_old)
+			       const struct ktermios *termios_old)
 {
 	struct gb_uart_set_line_coding_request newline;
 	struct gb_tty *gb_tty = tty->driver_data;
diff --git a/drivers/tty/amiserial.c b/drivers/tty/amiserial.c
index 81e7f64c1739..f52266766df9 100644
--- a/drivers/tty/amiserial.c
+++ b/drivers/tty/amiserial.c
@@ -94,7 +94,7 @@ static struct tty_driver *serial_driver;
 static unsigned char current_ctl_bits;
 
 static void change_speed(struct tty_struct *tty, struct serial_state *info,
-		struct ktermios *old);
+			 const struct ktermios *old);
 static void rs_wait_until_sent(struct tty_struct *tty, int timeout);
 
 
@@ -566,7 +566,7 @@ static void shutdown(struct tty_struct *tty, struct serial_state *info)
  * the specified baud rate for a serial port.
  */
 static void change_speed(struct tty_struct *tty, struct serial_state *info,
-			 struct ktermios *old_termios)
+			 const struct ktermios *old_termios)
 {
 	struct tty_port *port = &info->tport;
 	int	quot = 0, baud_base, baud;
@@ -1169,7 +1169,7 @@ static int rs_ioctl(struct tty_struct *tty,
 	return 0;
 }
 
-static void rs_set_termios(struct tty_struct *tty, struct ktermios *old_termios)
+static void rs_set_termios(struct tty_struct *tty, const struct ktermios *old_termios)
 {
 	struct serial_state *info = tty->driver_data;
 	unsigned long flags;
diff --git a/drivers/tty/moxa.c b/drivers/tty/moxa.c
index f3c72ab1476c..35b6fddf0341 100644
--- a/drivers/tty/moxa.c
+++ b/drivers/tty/moxa.c
@@ -491,7 +491,7 @@ static int moxa_write(struct tty_struct *, const unsigned char *, int);
 static unsigned int moxa_write_room(struct tty_struct *);
 static void moxa_flush_buffer(struct tty_struct *);
 static unsigned int moxa_chars_in_buffer(struct tty_struct *);
-static void moxa_set_termios(struct tty_struct *, struct ktermios *);
+static void moxa_set_termios(struct tty_struct *, const struct ktermios *);
 static void moxa_stop(struct tty_struct *);
 static void moxa_start(struct tty_struct *);
 static void moxa_hangup(struct tty_struct *);
@@ -499,7 +499,7 @@ static int moxa_tiocmget(struct tty_struct *tty);
 static int moxa_tiocmset(struct tty_struct *tty,
 			 unsigned int set, unsigned int clear);
 static void moxa_poll(struct timer_list *);
-static void moxa_set_tty_param(struct tty_struct *, struct ktermios *);
+static void moxa_set_tty_param(struct tty_struct *, const struct ktermios *);
 static void moxa_shutdown(struct tty_port *);
 static int moxa_carrier_raised(struct tty_port *);
 static void moxa_dtr_rts(struct tty_port *, int);
@@ -1602,7 +1602,7 @@ static int moxa_tiocmset(struct tty_struct *tty,
 }
 
 static void moxa_set_termios(struct tty_struct *tty,
-		struct ktermios *old_termios)
+		             const struct ktermios *old_termios)
 {
 	struct moxa_port *ch = tty->driver_data;
 
@@ -1761,7 +1761,8 @@ static void moxa_poll(struct timer_list *unused)
 
 /******************************************************************************/
 
-static void moxa_set_tty_param(struct tty_struct *tty, struct ktermios *old_termios)
+static void moxa_set_tty_param(struct tty_struct *tty,
+			       const struct ktermios *old_termios)
 {
 	register struct ktermios *ts = &tty->termios;
 	struct moxa_port *ch = tty->driver_data;
diff --git a/drivers/tty/mxser.c b/drivers/tty/mxser.c
index 70b982b2c6b2..3413bd77beed 100644
--- a/drivers/tty/mxser.c
+++ b/drivers/tty/mxser.c
@@ -571,7 +571,8 @@ static void mxser_handle_cts(struct tty_struct *tty, struct mxser_port *info,
  * This routine is called to set the UART divisor registers to match
  * the specified baud rate for a serial port.
  */
-static void mxser_change_speed(struct tty_struct *tty, struct ktermios *old_termios)
+static void mxser_change_speed(struct tty_struct *tty,
+			       const struct ktermios *old_termios)
 {
 	struct mxser_port *info = tty->driver_data;
 	unsigned cflag, cval;
@@ -1348,7 +1349,8 @@ static void mxser_start(struct tty_struct *tty)
 	spin_unlock_irqrestore(&info->slock, flags);
 }
 
-static void mxser_set_termios(struct tty_struct *tty, struct ktermios *old_termios)
+static void mxser_set_termios(struct tty_struct *tty,
+			      const struct ktermios *old_termios)
 {
 	struct mxser_port *info = tty->driver_data;
 	unsigned long flags;
diff --git a/drivers/tty/n_gsm.c b/drivers/tty/n_gsm.c
index caa5c14ed57f..97cd8d67c866 100644
--- a/drivers/tty/n_gsm.c
+++ b/drivers/tty/n_gsm.c
@@ -3647,7 +3647,8 @@ static int gsmtty_ioctl(struct tty_struct *tty,
 	}
 }
 
-static void gsmtty_set_termios(struct tty_struct *tty, struct ktermios *old)
+static void gsmtty_set_termios(struct tty_struct *tty,
+			       const struct ktermios *old)
 {
 	struct gsm_dlci *dlci = tty->driver_data;
 	if (dlci->state == DLCI_CLOSED)
diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c
index 752dab3356d7..07394fdaf522 100644
--- a/drivers/tty/pty.c
+++ b/drivers/tty/pty.c
@@ -240,7 +240,7 @@ static int pty_open(struct tty_struct *tty, struct file *filp)
 }
 
 static void pty_set_termios(struct tty_struct *tty,
-					struct ktermios *old_termios)
+			    const struct ktermios *old_termios)
 {
 	/* See if packet mode change of state. */
 	if (tty->link && tty->link->ctrl.packet) {
diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c
index c8295904b331..a8202b582bcc 100644
--- a/drivers/tty/serial/serial_core.c
+++ b/drivers/tty/serial/serial_core.c
@@ -49,7 +49,7 @@ static struct lock_class_key port_lock_key;
 #define RS485_MAX_RTS_DELAY	100 /* msecs */
 
 static void uart_change_speed(struct tty_struct *tty, struct uart_state *state,
-					struct ktermios *old_termios);
+			      const struct ktermios *old_termios);
 static void uart_wait_until_sent(struct tty_struct *tty, int timeout);
 static void uart_change_pm(struct uart_state *state,
 			   enum uart_pm_state pm_state);
@@ -492,7 +492,7 @@ EXPORT_SYMBOL(uart_get_divisor);
 
 /* Caller holds port mutex */
 static void uart_change_speed(struct tty_struct *tty, struct uart_state *state,
-					struct ktermios *old_termios)
+			      const struct ktermios *old_termios)
 {
 	struct uart_port *uport = uart_port_check(state);
 	struct ktermios *termios;
@@ -1619,7 +1619,7 @@ static void uart_set_ldisc(struct tty_struct *tty)
 }
 
 static void uart_set_termios(struct tty_struct *tty,
-						struct ktermios *old_termios)
+			     const struct ktermios *old_termios)
 {
 	struct uart_state *state = tty->driver_data;
 	struct uart_port *uport;
diff --git a/drivers/tty/synclink_gt.c b/drivers/tty/synclink_gt.c
index 9bc2a9265277..4a003e929776 100644
--- a/drivers/tty/synclink_gt.c
+++ b/drivers/tty/synclink_gt.c
@@ -707,7 +707,8 @@ static void hangup(struct tty_struct *tty)
 	wake_up_interruptible(&info->port.open_wait);
 }
 
-static void set_termios(struct tty_struct *tty, struct ktermios *old_termios)
+static void set_termios(struct tty_struct *tty,
+			const struct ktermios *old_termios)
 {
 	struct slgt_info *info = tty->driver_data;
 	unsigned long flags;
diff --git a/drivers/usb/class/cdc-acm.c b/drivers/usb/class/cdc-acm.c
index 483bcb1213f7..46dbf907e4b5 100644
--- a/drivers/usb/class/cdc-acm.c
+++ b/drivers/usb/class/cdc-acm.c
@@ -51,7 +51,7 @@ static DEFINE_IDR(acm_minors);
 static DEFINE_MUTEX(acm_minors_lock);
 
 static void acm_tty_set_termios(struct tty_struct *tty,
-				struct ktermios *termios_old);
+				const struct ktermios *termios_old);
 
 /*
  * acm_minors accessors
@@ -1049,7 +1049,7 @@ static int acm_tty_ioctl(struct tty_struct *tty,
 }
 
 static void acm_tty_set_termios(struct tty_struct *tty,
-						struct ktermios *termios_old)
+				const struct ktermios *termios_old)
 {
 	struct acm *acm = tty->driver_data;
 	struct ktermios *termios = &tty->termios;
diff --git a/drivers/usb/serial/usb-serial.c b/drivers/usb/serial/usb-serial.c
index e35bea2235c1..164521ee10c6 100644
--- a/drivers/usb/serial/usb-serial.c
+++ b/drivers/usb/serial/usb-serial.c
@@ -519,7 +519,8 @@ static int serial_ioctl(struct tty_struct *tty,
 	return retval;
 }
 
-static void serial_set_termios(struct tty_struct *tty, struct ktermios *old)
+static void serial_set_termios(struct tty_struct *tty,
+		               const struct ktermios *old)
 {
 	struct usb_serial_port *port = tty->driver_data;
 
diff --git a/include/linux/tty_driver.h b/include/linux/tty_driver.h
index 4841d8069c07..b2456b545ba0 100644
--- a/include/linux/tty_driver.h
+++ b/include/linux/tty_driver.h
@@ -141,7 +141,7 @@ struct serial_struct;
  *
  *	Optional.
  *
- * @set_termios: ``void ()(struct tty_struct *tty, struct ktermios *old)``
+ * @set_termios: ``void ()(struct tty_struct *tty, const struct ktermios *old)``
  *
  *	This routine allows the @tty driver to be notified when device's
  *	termios settings have changed. New settings are in @tty->termios.
@@ -365,7 +365,7 @@ struct tty_operations {
 		    unsigned int cmd, unsigned long arg);
 	long (*compat_ioctl)(struct tty_struct *tty,
 			     unsigned int cmd, unsigned long arg);
-	void (*set_termios)(struct tty_struct *tty, struct ktermios * old);
+	void (*set_termios)(struct tty_struct *tty, const struct ktermios *old);
 	void (*throttle)(struct tty_struct * tty);
 	void (*unthrottle)(struct tty_struct * tty);
 	void (*stop)(struct tty_struct *tty);
diff --git a/net/bluetooth/rfcomm/tty.c b/net/bluetooth/rfcomm/tty.c
index ebd78fdbd6e8..b9536641161c 100644
--- a/net/bluetooth/rfcomm/tty.c
+++ b/net/bluetooth/rfcomm/tty.c
@@ -855,7 +855,8 @@ static int rfcomm_tty_ioctl(struct tty_struct *tty, unsigned int cmd, unsigned l
 	return -ENOIOCTLCMD;
 }
 
-static void rfcomm_tty_set_termios(struct tty_struct *tty, struct ktermios *old)
+static void rfcomm_tty_set_termios(struct tty_struct *tty,
+				   const struct ktermios *old)
 {
 	struct ktermios *new = &tty->termios;
 	int old_baud_rate = tty_termios_baud_rate(old);
-- 
2.30.2


^ permalink raw reply related

* Re: [PATCH net-next 00/10] net/smc: optimize the parallelism of SMC-R connections
From: Tony Lu @ 2022-08-16 12:40 UTC (permalink / raw)
  To: Jan Karcher
  Cc: D. Wythe, kgraul, wenjia, kuba, davem, netdev, linux-s390,
	linux-rdma
In-Reply-To: <2182efbc-99f8-17ba-d344-95a467536b05@linux.ibm.com>

On Tue, Aug 16, 2022 at 11:35:15AM +0200, Jan Karcher wrote:
> 
> 
> On 10.08.2022 19:47, D. Wythe wrote:
> > From: "D. Wythe" <alibuda@linux.alibaba.com>
> > 
> > This patch set attempts to optimize the parallelism of SMC-R connections,
> > mainly to reduce unnecessary blocking on locks, and to fix exceptions that
> > occur after thoses optimization.
> > 
> 
> Thank you again for your submission!
> Let me give you a quick update from our side:
> We tested your patches on top of the net-next kernel on our s390 systems.
> They did crash our systems. After verifying our environment we pulled
> console logs and now we can tell that there is indeed a problem with your
> patches regarding SMC-D. So please do not integrate this change as of right
> now. I'm going to do more in depth reviews of your patches but i need some
> time for them so here is a quick a description of the problem:
> 
> It is a SMC-D problem, that occurs while building up the connection. In
> smc_conn_create you set struct smc_lnk_cluster *lnkc = NULL. For the SMC-R
> path you do grab the pointer, for SMC-D that never happens. Still you are
> using this refernce for SMC-D => Crash. This problem can be reproduced using
> the SMC-D path. Here is an example console output:

Got it.

> 
> [  779.516382] Unable to handle kernel pointer dereference in virtual kernel
> address space
> [  779.516389] Failing address: 0000000000000000 TEID: 0000000000000483
> [  779.516391] Fault in home space mode while using kernel ASCE.
> [  779.516395] AS:0000000069628007 R3:00000000ffbf0007 S:00000000ffbef800
> P:000000000000003d
> [  779.516431] Oops: 0004 ilc:2 [#1] SMP
> [  779.516436] Modules linked in: tcp_diag inet_diag ism mlx5_ib ib_uverbs
> mlx5_core smc_diag smc ib_core nft_fib_inet nft_fib_ipv4
> nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6
> nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv
> 6 nf_defrag_ipv4 ip_set nf_tables n
> [  779.516470] CPU: 0 PID: 24 Comm: kworker/0:1 Not tainted
> 5.19.0-13940-g22a46254655a #3
> [  779.516476] Hardware name: IBM 8561 T01 701 (z/VM 7.2.0)
> 
> [  779.522738] Workqueue: smc_hs_wq smc_listen_work [smc]
> [  779.522755] Krnl PSW : 0704c00180000000 000003ff803da89c
> (smc_conn_create+0x174/0x968 [smc])
> [  779.522766]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0
> RI:0 EA:3
> [  779.522770] Krnl GPRS: 0000000000000002 0000000000000000 0000000000000001
> 0000000000000000
> [  779.522773]            000000008a4128a0 000003ff803f21aa 000000008e30d640
> 0000000086d72000
> [  779.522776]            0000000086d72000 000000008a412803 000000008a412800
> 000000008e30d650
> [  779.522779]            0000000080934200 0000000000000000 000003ff803cb954
> 00000380002dfa88
> [  779.522789] Krnl Code: 000003ff803da88e: e310f0e80024        stg
> %r1,232(%r15)
> [  779.522789]            000003ff803da894: a7180000            lhi %r1,0
> [  779.522789]           #000003ff803da898: 582003ac            l %r2,940
> [  779.522789]           >000003ff803da89c: ba123020            cs
> %r1,%r2,32(%r3)
> [  779.522789]            000003ff803da8a0: ec1603be007e        cij
> %r1,0,6,000003ff803db01c
> 
> [  779.522789]            000003ff803da8a6: 4110b002            la
> %r1,2(%r11)
> [  779.522789]            000003ff803da8aa: e310f0f00024        stg
> %r1,240(%r15)
> [  779.522789]            000003ff803da8b0: e310f0c00004        lg
> %r1,192(%r15)
> [  779.522870] Call Trace:
> [  779.522873]  [<000003ff803da89c>] smc_conn_create+0x174/0x968 [smc]
> [  779.522884]  [<000003ff803cb954>] smc_find_ism_v2_device_serv+0x1b4/0x300
> [smc]
> 01: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP stop
> from CPU 01.
> 01: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP stop
> from CPU 00.
> [  779.522894]  [<000003ff803cbace>] smc_listen_find_device+0x2e/0x370 [smc]
> 
> 
> I'm going to send the review for the first patch right away (which is the
> one causing the crash), so far I'm done with it. The others are going to
> follow. Maybe you can look over the problem and come up with a solution,
> otherwise we are going to decide if we want to look into it as soon as I'm
> done with the reviews. Thank you for your patience.

Thanks for pointing this issue. We will fix this soon in v2.

Tony Lu

^ permalink raw reply

* Re: [net v2 1/1] ice: Fix crash by keep old cfg when update TCs more than queues
From: Ding Hui @ 2022-08-16 12:44 UTC (permalink / raw)
  To: Anatolii Gerasymenko, jesse.brandeburg, anthony.l.nguyen, davem,
	edumazet, kuba, pabeni, keescook, intel-wired-lan
  Cc: netdev, linux-kernel, linux-hardening
In-Reply-To: <dd4f9e5d-d8d7-3069-21ee-7897b3d10d3d@intel.com>

On 2022/8/16 17:13, Anatolii Gerasymenko wrote:
> On 15.08.2022 03:18, Ding Hui wrote:
>> There are problems if allocated queues less than Traffic Classes.
>>
>> Commit a632b2a4c920 ("ice: ethtool: Prohibit improper channel config
>> for DCB") already disallow setting less queues than TCs.
>>
>> Another case is if we first set less queues, and later update more TCs
>> config due to LLDP, ice_vsi_cfg_tc() will failed but left dirty
>> num_txq/rxq and tc_cfg in vsi, that will cause invalid porinter access.
> 
> Nice catch. Looks good to me.

Thanks, I'll send v3 later, could I add Acked-by: tag too?

>   
>> [   95.968089] ice 0000:3b:00.1: More TCs defined than queues/rings allocated.
>> [   95.968092] ice 0000:3b:00.1: Trying to use more Rx queues (8), than were allocated (1)!
>> [   95.968093] ice 0000:3b:00.1: Failed to config TC for VSI index: 0
>> [   95.969621] general protection fault: 0000 [#1] SMP NOPTI
>> [   95.969705] CPU: 1 PID: 58405 Comm: lldpad Kdump: loaded Tainted: G     U  W  O     --------- -t - 4.18.0 #1
>> [   95.969867] Hardware name: O.E.M/BC11SPSCB10, BIOS 8.23 12/30/2021
>> [   95.969992] RIP: 0010:devm_kmalloc+0xa/0x60
>> [   95.970052] Code: 5c ff ff ff 31 c0 5b 5d 41 5c c3 b8 f4 ff ff ff eb f4 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 89 d1 <8b> 97 60 02 00 00 48 8d 7e 18 48 39 f7 72 3f 55 89 ce 53 48 8b 4c
>> [   95.970344] RSP: 0018:ffffc9003f553888 EFLAGS: 00010206
>> [   95.970425] RAX: dead000000000200 RBX: ffffea003c425b00 RCX: 00000000006080c0
>> [   95.970536] RDX: 00000000006080c0 RSI: 0000000000000200 RDI: dead000000000200
>> [   95.970648] RBP: dead000000000200 R08: 00000000000463c0 R09: ffff888ffa900000
>> [   95.970760] R10: 0000000000000000 R11: 0000000000000002 R12: ffff888ff6b40100
>> [   95.970870] R13: ffff888ff6a55018 R14: 0000000000000000 R15: ffff888ff6a55460
>> [   95.970981] FS:  00007f51b7d24700(0000) GS:ffff88903ee80000(0000) knlGS:0000000000000000
>> [   95.971108] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   95.971197] CR2: 00007fac5410d710 CR3: 0000000f2c1de002 CR4: 00000000007606e0
>> [   95.971309] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [   95.971419] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [   95.971530] PKRU: 55555554
>> [   95.971573] Call Trace:
>> [   95.971622]  ice_setup_rx_ring+0x39/0x110 [ice]
>> [   95.971695]  ice_vsi_setup_rx_rings+0x54/0x90 [ice]
>> [   95.971774]  ice_vsi_open+0x25/0x120 [ice]
>> [   95.971843]  ice_open_internal+0xb8/0x1f0 [ice]
>> [   95.971919]  ice_ena_vsi+0x4f/0xd0 [ice]
>> [   95.971987]  ice_dcb_ena_dis_vsi.constprop.5+0x29/0x90 [ice]
>> [   95.972082]  ice_pf_dcb_cfg+0x29a/0x380 [ice]
>> [   95.972154]  ice_dcbnl_setets+0x174/0x1b0 [ice]
>> [   95.972220]  dcbnl_ieee_set+0x89/0x230
>> [   95.972279]  ? dcbnl_ieee_del+0x150/0x150
>> [   95.972341]  dcb_doit+0x124/0x1b0
>> [   95.972392]  rtnetlink_rcv_msg+0x243/0x2f0
>> [   95.972457]  ? dcb_doit+0x14d/0x1b0
>> [   95.972510]  ? __kmalloc_node_track_caller+0x1d3/0x280
>> [   95.972591]  ? rtnl_calcit.isra.31+0x100/0x100
>> [   95.972661]  netlink_rcv_skb+0xcf/0xf0
>> [   95.972720]  netlink_unicast+0x16d/0x220
>> [   95.972781]  netlink_sendmsg+0x2ba/0x3a0
>> [   95.975891]  sock_sendmsg+0x4c/0x50
>> [   95.979032]  ___sys_sendmsg+0x2e4/0x300
>> [   95.982147]  ? kmem_cache_alloc+0x13e/0x190
>> [   95.985242]  ? __wake_up_common_lock+0x79/0x90
>> [   95.988338]  ? __check_object_size+0xac/0x1b0
>> [   95.991440]  ? _copy_to_user+0x22/0x30
>> [   95.994539]  ? move_addr_to_user+0xbb/0xd0
>> [   95.997619]  ? __sys_sendmsg+0x53/0x80
>> [   96.000664]  __sys_sendmsg+0x53/0x80
>> [   96.003747]  do_syscall_64+0x5b/0x1d0
>> [   96.006862]  entry_SYSCALL_64_after_hwframe+0x65/0xca
>>
>> Only update num_txq/rxq when passed check, and restore tc_cfg if setup
>> queue map failed.
>>
>> Signed-off-by: Ding Hui <dinghui@sangfor.com.cn>
> 
> Please, also add Fixes tag.
> 
>> ---
>>   drivers/net/ethernet/intel/ice/ice_lib.c | 42 +++++++++++++++---------
>>   1 file changed, 26 insertions(+), 16 deletions(-)
>>
>> ---
>> v1:
>> https://patchwork.kernel.org/project/netdevbpf/patch/20220812123933.5481-1-dinghui@sangfor.com.cn/
>>
>> v2:
>>    rewrite subject
>>    rebase to net
>>
>> diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c
>> index a830f7f9aed0..6e64cca30351 100644
>> --- a/drivers/net/ethernet/intel/ice/ice_lib.c
>> +++ b/drivers/net/ethernet/intel/ice/ice_lib.c
>> @@ -914,7 +914,7 @@ static void ice_set_dflt_vsi_ctx(struct ice_hw *hw, struct ice_vsi_ctx *ctxt)
>>    */
>>   static int ice_vsi_setup_q_map(struct ice_vsi *vsi, struct ice_vsi_ctx *ctxt)
>>   {
>> -	u16 offset = 0, qmap = 0, tx_count = 0, pow = 0;
>> +	u16 offset = 0, qmap = 0, tx_count = 0, rx_count = 0, pow = 0;
>>   	u16 num_txq_per_tc, num_rxq_per_tc;
>>   	u16 qcount_tx = vsi->alloc_txq;
>>   	u16 qcount_rx = vsi->alloc_rxq;
>> @@ -981,23 +981,25 @@ static int ice_vsi_setup_q_map(struct ice_vsi *vsi, struct ice_vsi_ctx *ctxt)
>>   	 * at least 1)
>>   	 */
>>   	if (offset)
>> -		vsi->num_rxq = offset;
>> +		rx_count = offset;
>>   	else
>> -		vsi->num_rxq = num_rxq_per_tc;
>> +		rx_count = num_rxq_per_tc;
>>   
>> -	if (vsi->num_rxq > vsi->alloc_rxq) {
>> +	if (rx_count > vsi->alloc_rxq) {
>>   		dev_err(ice_pf_to_dev(vsi->back), "Trying to use more Rx queues (%u), than were allocated (%u)!\n",
>> -			vsi->num_rxq, vsi->alloc_rxq);
>> +			rx_count, vsi->alloc_rxq);
>>   		return -EINVAL;
>>   	}
>>   
>> -	vsi->num_txq = tx_count;
>> -	if (vsi->num_txq > vsi->alloc_txq) {
>> +	if (tx_count > vsi->alloc_txq) {
>>   		dev_err(ice_pf_to_dev(vsi->back), "Trying to use more Tx queues (%u), than were allocated (%u)!\n",
>> -			vsi->num_txq, vsi->alloc_txq);
>> +			tx_count, vsi->alloc_txq);
>>   		return -EINVAL;
>>   	}
>>   
>> +	vsi->num_txq = tx_count;
>> +	vsi->num_rxq = rx_count;
>> +
>>   	if (vsi->type == ICE_VSI_VF && vsi->num_txq != vsi->num_rxq) {
>>   		dev_dbg(ice_pf_to_dev(vsi->back), "VF VSI should have same number of Tx and Rx queues. Hence making them equal\n");
>>   		/* since there is a chance that num_rxq could have been changed
>> @@ -3492,6 +3494,7 @@ ice_vsi_setup_q_map_mqprio(struct ice_vsi *vsi, struct ice_vsi_ctx *ctxt,
>>   	int tc0_qcount = vsi->mqprio_qopt.qopt.count[0];
>>   	u8 netdev_tc = 0;
>>   	int i;
>> +	u16 new_txq, new_rxq;
> 
> Please follow the Reverse Christmas Tree (RCT) convention.
> 
>>   	vsi->tc_cfg.ena_tc = ena_tc ? ena_tc : 1;
>>   
>> @@ -3530,21 +3533,24 @@ ice_vsi_setup_q_map_mqprio(struct ice_vsi *vsi, struct ice_vsi_ctx *ctxt,
>>   		}
>>   	}
>>   
>> -	/* Set actual Tx/Rx queue pairs */
>> -	vsi->num_txq = offset + qcount_tx;
>> -	if (vsi->num_txq > vsi->alloc_txq) {
>> +	new_txq = offset + qcount_tx;
>> +	if (new_txq > vsi->alloc_txq) {
>>   		dev_err(ice_pf_to_dev(vsi->back), "Trying to use more Tx queues (%u), than were allocated (%u)!\n",
>> -			vsi->num_txq, vsi->alloc_txq);
>> +			new_txq, vsi->alloc_txq);
>>   		return -EINVAL;
>>   	}
>>   
>> -	vsi->num_rxq = offset + qcount_rx;
>> -	if (vsi->num_rxq > vsi->alloc_rxq) {
>> +	new_rxq = offset + qcount_rx;
>> +	if (new_rxq > vsi->alloc_rxq) {
>>   		dev_err(ice_pf_to_dev(vsi->back), "Trying to use more Rx queues (%u), than were allocated (%u)!\n",
>> -			vsi->num_rxq, vsi->alloc_rxq);
>> +			new_rxq, vsi->alloc_rxq);
>>   		return -EINVAL;
>>   	}
>>   
>> +	/* Set actual Tx/Rx queue pairs */
>> +	vsi->num_txq = new_txq;
>> +	vsi->num_rxq = new_rxq;
>> +
>>   	/* Setup queue TC[0].qmap for given VSI context */
>>   	ctxt->info.tc_mapping[0] = cpu_to_le16(qmap);
>>   	ctxt->info.q_mapping[0] = cpu_to_le16(vsi->rxq_map[0]);
>> @@ -3580,6 +3586,7 @@ int ice_vsi_cfg_tc(struct ice_vsi *vsi, u8 ena_tc)
>>   	struct device *dev;
>>   	int i, ret = 0;
>>   	u8 num_tc = 0;
>> +	struct ice_tc_cfg old_tc_cfg;
> 
> RCT here also.
>    
>>   	dev = ice_pf_to_dev(pf);
>>   	if (vsi->tc_cfg.ena_tc == ena_tc &&
>> @@ -3600,6 +3607,7 @@ int ice_vsi_cfg_tc(struct ice_vsi *vsi, u8 ena_tc)
>>   			max_txqs[i] = vsi->num_txq;
>>   	}
>>   
>> +	memcpy(&old_tc_cfg, &vsi->tc_cfg, sizeof(old_tc_cfg));
>>   	vsi->tc_cfg.ena_tc = ena_tc;
>>   	vsi->tc_cfg.numtc = num_tc;
>>   
>> @@ -3616,8 +3624,10 @@ int ice_vsi_cfg_tc(struct ice_vsi *vsi, u8 ena_tc)
>>   	else
>>   		ret = ice_vsi_setup_q_map(vsi, ctx);
>>   
>> -	if (ret)
>> +	if (ret) {
>> +		memcpy(&vsi->tc_cfg, &old_tc_cfg, sizeof(vsi->tc_cfg));
>>   		goto out;
>> +	}
>>   
>>   	/* must to indicate which section of VSI context are being modified */
>>   	ctx->info.valid_sections = cpu_to_le16(ICE_AQ_VSI_PROP_RXQ_MAP_VALID);


-- 
Thanks,
- Ding Hui

^ permalink raw reply

* Re: [PATCH net-next 01/10] net/smc: remove locks smc_client_lgr_pending and smc_server_lgr_pending
From: Tony Lu @ 2022-08-16 12:47 UTC (permalink / raw)
  To: Jan Karcher
  Cc: D. Wythe, kgraul, wenjia, kuba, davem, netdev, linux-s390,
	linux-rdma
In-Reply-To: <853f7c69-8690-1ceb-9256-af5c2a5b2ae0@linux.ibm.com>

On Tue, Aug 16, 2022 at 11:43:23AM +0200, Jan Karcher wrote:
> 
> 
> On 10.08.2022 19:47, D. Wythe wrote:
> > From: "D. Wythe" <alibuda@linux.alibaba.com>
> > 
> > This patch attempts to remove locks named smc_client_lgr_pending and
> > smc_server_lgr_pending, which aim to serialize the creation of link
> > group. However, once link group existed already, those locks are
> > meaningless, worse still, they make incoming connections have to be
> > queued one after the other.
> > 
> > Now, the creation of link group is no longer generated by competition,
> > but allocated through following strategy.
> > 
> > 1. Try to find a suitable link group, if successd, current connection
> > is considered as NON first contact connection. ends.
> > 
> > 2. Check the number of connections currently waiting for a suitable
> > link group to be created, if it is not less that the number of link
> > groups to be created multiplied by (SMC_RMBS_PER_LGR_MAX - 1), then
> > increase the number of link groups to be created, current connection
> > is considered as the first contact connection. ends.
> > 
> > 3. Increase the number of connections currently waiting, and wait
> > for woken up.
> > 
> > 4. Decrease the number of connections currently waiting, goto 1.
> > 
> > We wake up the connection that was put to sleep in stage 3 through
> > the SMC link state change event. Once the link moves out of the
> > SMC_LNK_ACTIVATING state, decrease the number of link groups to
> > be created, and then wake up at most (SMC_RMBS_PER_LGR_MAX - 1)
> > connections.
> > 
> > In the iplementation, we introduce the concept of lnk cluster, which is
> > a collection of links with the same characteristics (see
> > smcr_lnk_cluster_cmpfn() with more details), which makes it possible to
> > wake up efficiently in the scenario of N v.s 1.
> > 
> > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > ---
> >   net/smc/af_smc.c   |  11 +-
> >   net/smc/smc_core.c | 356 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
> >   net/smc/smc_core.h |  48 ++++++++
> >   net/smc/smc_llc.c  |   9 +-
> >   4 files changed, 411 insertions(+), 13 deletions(-)
> > 
> > diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
> > index 79c1318..af4b0aa 100644
> > --- a/net/smc/af_smc.c
> > +++ b/net/smc/af_smc.c
> > @@ -1194,10 +1194,8 @@ static int smc_connect_rdma(struct smc_sock *smc,
> >   	if (reason_code)
> >   		return reason_code;
> > 
> > -	mutex_lock(&smc_client_lgr_pending);
> >   	reason_code = smc_conn_create(smc, ini);
> >   	if (reason_code) {
> > -		mutex_unlock(&smc_client_lgr_pending);
> >   		return reason_code;
> >   	}
> > 
> > @@ -1289,7 +1287,6 @@ static int smc_connect_rdma(struct smc_sock *smc,
> >   		if (reason_code)
> >   			goto connect_abort;
> >   	}
> > -	mutex_unlock(&smc_client_lgr_pending);
> > 
> >   	smc_copy_sock_settings_to_clc(smc);
> >   	smc->connect_nonblock = 0;
> > @@ -1299,7 +1296,6 @@ static int smc_connect_rdma(struct smc_sock *smc,
> >   	return 0;
> >   connect_abort:
> >   	smc_conn_abort(smc, ini->first_contact_local);
> > -	mutex_unlock(&smc_client_lgr_pending);
> >   	smc->connect_nonblock = 0;
> > 
> >   	return reason_code;
> 
> 
> You are removing the locking mechanism out of this function completly, which
> is fine because it is only called for a SMC-R connection.
> 
> 
> > @@ -2377,7 +2373,8 @@ static void smc_listen_work(struct work_struct *work)
> >   	if (rc)
> >   		goto out_decl;
> > 
> > -	mutex_lock(&smc_server_lgr_pending);
> > +	if (ini->is_smcd)
> > +		mutex_lock(&smc_server_lgr_pending);
> >   	smc_close_init(new_smc);
> >   	smc_rx_init(new_smc);
> >   	smc_tx_init(new_smc);
> > @@ -2415,7 +2412,6 @@ static void smc_listen_work(struct work_struct *work)
> >   					    ini->first_contact_local, ini);
> >   		if (rc)
> >   			goto out_unlock;
> > -		mutex_unlock(&smc_server_lgr_pending);
> >   	}
> >   	smc_conn_save_peer_info(new_smc, cclc);
> >   	smc_listen_out_connected(new_smc);
> > @@ -2423,7 +2419,8 @@ static void smc_listen_work(struct work_struct *work)
> >   	goto out_free;
> > 
> >   out_unlock:
> > -	mutex_unlock(&smc_server_lgr_pending);
> > +	if (ini->is_smcd)
> > +		mutex_unlock(&smc_server_lgr_pending);
> 
> 
> You want to remove the mutex lock for SMC-R so you are only locking for a
> SMC-D connection. So far so good. I think you could also remove this unlock
> call since it is only called in the case of a SMC-R connection - which means
> it is obsolete:
> 
> l2398 ff. (with your patch on net-next)
> 
>     /* receive SMC Confirm CLC message */
>     memset(buf, 0, sizeof(*buf));
>     cclc = (struct smc_clc_msg_accept_confirm *)buf;
>     rc = smc_clc_wait_msg(new_smc, cclc, sizeof(*buf),
>                   SMC_CLC_CONFIRM, CLC_WAIT_TIME);
>     if (rc) {
> x        if (!ini->is_smcd)
> x            goto out_unlock;
>         goto out_decl;
>     }
> 
> >   out_decl:
> >   	smc_listen_decline(new_smc, rc, ini ? ini->first_contact_local : 0,
> >   			   proposal_version);
> > diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
> > index ff49a11..a3338cc 100644
> > --- a/net/smc/smc_core.c
> > +++ b/net/smc/smc_core.c
> > @@ -46,6 +46,10 @@ struct smc_lgr_list smc_lgr_list = {	/* established link groups */
> >   	.num = 0,
> >   };
> > 
> > +struct smc_lgr_manager smc_lgr_manager = {
> > +	.lock = __SPIN_LOCK_UNLOCKED(smc_lgr_manager.lock),
> > +};
> > +
> >   static atomic_t lgr_cnt = ATOMIC_INIT(0); /* number of existing link groups */
> >   static DECLARE_WAIT_QUEUE_HEAD(lgrs_deleted);
> > 
> > @@ -55,6 +59,282 @@ static void smc_buf_free(struct smc_link_group *lgr, bool is_rmb,
> > 
> >   static void smc_link_down_work(struct work_struct *work);
> > 
> > +/* SMC-R lnk cluster compare func
> > + * All lnks that meet the description conditions of this function
> > + * are logically aggregated, called lnk cluster.
> > + * For the server side, lnk cluster is used to determine whether
> > + * a new group needs to be created when processing new imcoming connections.
> > + * For the client side, lnk cluster is used to determine whether
> > + * to wait for link ready (in other words, first contact ready).
> > + */
> > +static int smcr_lnk_cluster_cmpfn(struct rhashtable_compare_arg *arg, const void *obj)
> > +{
> > +	const struct smc_lnk_cluster_compare_arg *key = arg->key;
> > +	const struct smc_lnk_cluster *lnkc = obj;
> > +
> > +	if (memcmp(key->peer_systemid, lnkc->peer_systemid, SMC_SYSTEMID_LEN))
> > +		return 1;
> > +
> > +	if (memcmp(key->peer_gid, lnkc->peer_gid, SMC_GID_SIZE))
> > +		return 1;
> > +
> > +	if ((key->role == SMC_SERV || key->clcqpn == lnkc->clcqpn) &&
> > +	    (key->smcr_version == SMC_V2 ||
> > +	    !memcmp(key->peer_mac, lnkc->peer_mac, ETH_ALEN)))
> > +		return 0;
> > +
> > +	return 1;
> > +}
> > +
> > +/* SMC-R lnk cluster hash func */
> > +static u32 smcr_lnk_cluster_hashfn(const void *data, u32 len, u32 seed)
> > +{
> > +	const struct smc_lnk_cluster *lnkc = data;
> > +
> > +	return jhash2((u32 *)lnkc->peer_systemid, SMC_SYSTEMID_LEN / sizeof(u32), seed)
> > +		+ (lnkc->role == SMC_SERV) ? 0 : lnkc->clcqpn;
> > +}
> > +
> > +/* SMC-R lnk cluster compare arg hash func */
> > +static u32 smcr_lnk_cluster_compare_arg_hashfn(const void *data, u32 len, u32 seed)
> > +{
> > +	const struct smc_lnk_cluster_compare_arg *key = data;
> > +
> > +	return jhash2((u32 *)key->peer_systemid, SMC_SYSTEMID_LEN / sizeof(u32), seed)
> > +		+ (key->role == SMC_SERV) ? 0 : key->clcqpn;
> > +}
> > +
> > +static const struct rhashtable_params smcr_lnk_cluster_rhl_params = {
> > +	.head_offset = offsetof(struct smc_lnk_cluster, rnode),
> > +	.key_len = sizeof(struct smc_lnk_cluster_compare_arg),
> > +	.obj_cmpfn = smcr_lnk_cluster_cmpfn,
> > +	.obj_hashfn = smcr_lnk_cluster_hashfn,
> > +	.hashfn = smcr_lnk_cluster_compare_arg_hashfn,
> > +	.automatic_shrinking = true,
> > +};
> > +
> > +/* hold a reference for smc_lnk_cluster */
> > +static inline void smc_lnk_cluster_hold(struct smc_lnk_cluster *lnkc)
> > +{
> > +	if (likely(lnkc))
> > +		refcount_inc(&lnkc->ref);
> > +}
> > +
> > +/* release a reference for smc_lnk_cluster */
> > +static inline void smc_lnk_cluster_put(struct smc_lnk_cluster *lnkc)
> > +{
> > +	bool do_free = false;
> > +
> > +	if (!lnkc)
> > +		return;
> > +
> > +	if (refcount_dec_not_one(&lnkc->ref))
> > +		return;
> > +
> > +	spin_lock_bh(&smc_lgr_manager.lock);
> > +	/* last ref */
> > +	if (refcount_dec_and_test(&lnkc->ref)) {
> > +		do_free = true;
> > +		rhashtable_remove_fast(&smc_lgr_manager.lnk_cluster_maps, &lnkc->rnode,
> > +				       smcr_lnk_cluster_rhl_params);
> > +	}
> > +	spin_unlock_bh(&smc_lgr_manager.lock);
> > +	if (do_free)
> > +		kfree(lnkc);
> > +}
> > +
> > +/* Get or create smc_lnk_cluster by key
> > + * This function will hold a reference of returned smc_lnk_cluster
> > + * or create a new smc_lnk_cluster with the reference initialized to 1。
> > + * caller MUST call smc_lnk_cluster_put after this.
> > + */
> > +static inline struct smc_lnk_cluster *
> > +smcr_lnk_get_or_create_cluster(struct smc_lnk_cluster_compare_arg *key)
> > +{
> > +	struct smc_lnk_cluster *lnkc, *tmp_lnkc;
> > +	bool busy_retry;
> > +	int err;
> > +
> > +	/* serving a hardware or software interrupt, or preemption is disabled */
> > +	busy_retry = !in_interrupt();
> > +
> > +	spin_lock_bh(&smc_lgr_manager.lock);
> > +	lnkc = rhashtable_lookup_fast(&smc_lgr_manager.lnk_cluster_maps, key,
> > +				      smcr_lnk_cluster_rhl_params);
> > +	if (!lnkc) {
> > +		lnkc = kzalloc(sizeof(*lnkc), GFP_ATOMIC);
> > +		if (unlikely(!lnkc))
> > +			goto fail;
> > +
> > +		/* init cluster */
> > +		spin_lock_init(&lnkc->lock);
> > +		lnkc->role = key->role;
> > +		if (key->role == SMC_CLNT)
> > +			lnkc->clcqpn = key->clcqpn;
> > +		init_waitqueue_head(&lnkc->first_contact_waitqueue);
> > +		memcpy(lnkc->peer_systemid, key->peer_systemid, SMC_SYSTEMID_LEN);
> > +		memcpy(lnkc->peer_gid, key->peer_gid, SMC_GID_SIZE);
> > +		memcpy(lnkc->peer_mac, key->peer_mac, ETH_ALEN);
> > +		refcount_set(&lnkc->ref, 1);
> > +
> > +		do {
> > +			err = rhashtable_insert_fast(&smc_lgr_manager.lnk_cluster_maps,
> > +						     &lnkc->rnode, smcr_lnk_cluster_rhl_params);
> > +
> > +			/* success or fatal error */
> > +			if (err != -EBUSY)
> > +				break;
> > +
> > +			/* impossible in fact right now */
> > +			if (unlikely(!busy_retry)) {
> > +				pr_warn_ratelimited("smc: create lnk cluster in softirq\n");
> > +				break;
> > +			}
> > +
> > +			spin_unlock_bh(&smc_lgr_manager.lock);
> > +			/* yeild */
> > +			cond_resched();
> > +			spin_lock_bh(&smc_lgr_manager.lock);
> > +
> > +			/* after spin_unlock_bh(), lnk_cluster_maps may be changed */
> > +			tmp_lnkc = rhashtable_lookup_fast(&smc_lgr_manager.lnk_cluster_maps, key,
> > +							  smcr_lnk_cluster_rhl_params);
> > +
> > +			if (unlikely(tmp_lnkc)) {
> > +				pr_warn_ratelimited("smc: create cluster failed dues to duplicat key");
> > +				kfree(lnkc);
> > +				lnkc = NULL;
> > +				goto fail;
> > +			}
> > +		} while (1);
> > +
> > +		if (unlikely(err)) {
> > +			pr_warn_ratelimited("smc: rhashtable_insert_fast failed (%d)", err);
> > +			kfree(lnkc);
> > +			lnkc = NULL;
> > +		}
> > +	} else {
> > +		smc_lnk_cluster_hold(lnkc);
> > +	}
> > +fail:
> > +	spin_unlock_bh(&smc_lgr_manager.lock);
> > +	return lnkc;
> > +}
> > +
> > +/* Get or create a smc_lnk_cluster by lnk
> > + * caller MUST call smc_lnk_cluster_put after this.
> > + */
> > +static inline struct smc_lnk_cluster *smcr_lnk_get_cluster(struct smc_link *lnk)
> > +{
> > +	struct smc_lnk_cluster_compare_arg key;
> > +	struct smc_link_group *lgr;
> > +
> > +	lgr = lnk->lgr;
> > +	if (!lgr || lgr->is_smcd)
> > +		return NULL;
> > +
> > +	key.smcr_version = lgr->smc_version;
> > +	key.peer_systemid = lgr->peer_systemid;
> > +	key.peer_gid = lnk->peer_gid;
> > +	key.peer_mac = lnk->peer_mac;
> > +	key.role	 = lgr->role;
> > +	if (key.role == SMC_CLNT)
> > +		key.clcqpn = lnk->peer_qpn;
> > +
> > +	return smcr_lnk_get_or_create_cluster(&key);
> > +}
> > +
> > +/* Get or create a smc_lnk_cluster by ini
> > + * caller MUST call smc_lnk_cluster_put after this.
> > + */
> > +static inline struct smc_lnk_cluster *
> > +smcr_lnk_get_cluster_by_ini(struct smc_init_info *ini, int role)
> > +{
> > +	struct smc_lnk_cluster_compare_arg key;
> > +
> > +	if (ini->is_smcd)
> > +		return NULL;
> > +
> > +	key.smcr_version = ini->smcr_version;
> > +	key.peer_systemid = ini->peer_systemid;
> > +	key.peer_gid = ini->peer_gid;
> > +	key.peer_mac = ini->peer_mac;
> > +	key.role	= role;
> > +	if (role == SMC_CLNT)
> > +		key.clcqpn	= ini->ib_clcqpn;
> > +
> > +	return smcr_lnk_get_or_create_cluster(&key);
> > +}
> > +
> > +/* callback when smc link state change */
> > +void smcr_lnk_cluster_on_lnk_state(struct smc_link *lnk)
> > +{
> > +	struct smc_lnk_cluster *lnkc;
> > +	int nr = 0;
> > +
> > +	/* barrier for lnk->state */
> > +	smp_mb();
> > +
> > +	/* only first link can made connections block on
> > +	 * first_contact_waitqueue
> > +	 */
> > +	if (lnk->link_idx != SMC_SINGLE_LINK)
> > +		return;
> > +
> > +	/* state already seen  */
> > +	if (lnk->state_record & SMC_LNK_STATE_BIT(lnk->state))
> > +		return;
> > +
> > +	lnkc = smcr_lnk_get_cluster(lnk);
> > +
> > +	if (unlikely(!lnkc))
> > +		return;
> > +
> > +	spin_lock_bh(&lnkc->lock);
> > +
> > +	/* all lnk state change should be
> > +	 * 1. SMC_LNK_UNUSED -> SMC_LNK_TEAR_DWON (link init failed)
> 
> Should this really be DWON and not DOWN?
> 
> > +	 * 2. SMC_LNK_UNUSED -> SMC_LNK_ACTIVATING -> SMC_LNK_TEAR_DWON
> > +	 * 3. SMC_LNK_UNUSED -> SMC_LNK_ACTIVATING -> SMC_LNK_INACTIVE -> SMC_LNK_TEAR_DWON
> > +	 * 4. SMC_LNK_UNUSED -> SMC_LNK_ACTIVATING -> SMC_LNK_INACTIVE -> SMC_LNK_TEAR_DWON
> > +	 * 5. SMC_LNK_UNUSED -> SMC_LNK_ATIVATING -> SMC_LNK_ACTIVE ->SMC_LNK_INACTIVE
> > +	 * -> SMC_LNK_TEAR_DWON
> > +	 */
> > +	switch (lnk->state) {
> > +	case SMC_LNK_ACTIVATING:
> > +		/* It's safe to hold a reference without lock
> > +		 * dues to the smcr_lnk_get_cluster already hold one
> > +		 */
> > +		smc_lnk_cluster_hold(lnkc);
> > +		break;
> > +	case SMC_LNK_TEAR_DWON:
> > +		if (lnk->state_record & SMC_LNK_STATE_BIT(SMC_LNK_ACTIVATING))
> > +			/* smc_lnk_cluster_hold in SMC_LNK_ACTIVATING */
> > +			smc_lnk_cluster_put(lnkc);
> > +		fallthrough;
> > +	case SMC_LNK_ACTIVE:
> > +	case SMC_LNK_INACTIVE:
> > +		if (!(lnk->state_record &
> > +			(SMC_LNK_STATE_BIT(SMC_LNK_ACTIVE)
> > +			| SMC_LNK_STATE_BIT(SMC_LNK_INACTIVE)))) {
> > +			lnkc->pending_capability -= (SMC_RMBS_PER_LGR_MAX - 1);
> > +			/* TODO: wakeup just one to perfrom first contact
> > +			 * if record state has no SMC_LNK_ACTIVE
> > +			 */
> 
> 
> Todo in a patch.
> 
> > +			nr = SMC_RMBS_PER_LGR_MAX - 1;
> > +		}
> > +		break;
> > +	case SMC_LNK_UNUSED:
> > +		pr_warn_ratelimited("net/smc: invalid lnk state. ");
> > +		break;
> > +	}
> > +	SMC_LNK_STATE_RECORD(lnk, lnk->state);
> > +	spin_unlock_bh(&lnkc->lock);
> > +	if (nr)
> > +		wake_up_nr(&lnkc->first_contact_waitqueue, nr);
> > +	smc_lnk_cluster_put(lnkc);	/* smc_lnk_cluster_hold in smcr_lnk_get_cluster */
> > +}
> > +
> >   /* return head of link group list and its lock for a given link group */
> >   static inline struct list_head *smc_lgr_list_head(struct smc_link_group *lgr,
> >   						  spinlock_t **lgr_lock)
> > @@ -651,8 +931,10 @@ static void smcr_lgr_link_deactivate_all(struct smc_link_group *lgr)
> >   	for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
> >   		struct smc_link *lnk = &lgr->lnk[i];
> > 
> > -		if (smc_link_sendable(lnk))
> > +		if (smc_link_sendable(lnk)) {
> >   			lnk->state = SMC_LNK_INACTIVE;
> > +			smcr_lnk_cluster_on_lnk_state(lnk);
> > +		}
> >   	}
> >   	wake_up_all(&lgr->llc_msg_waiter);
> >   	wake_up_all(&lgr->llc_flow_waiter);
> > @@ -762,6 +1044,9 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
> >   	atomic_set(&lnk->conn_cnt, 0);
> >   	smc_llc_link_set_uid(lnk);
> >   	INIT_WORK(&lnk->link_down_wrk, smc_link_down_work);
> > +	lnk->peer_qpn = ini->ib_clcqpn;
> > +	memcpy(lnk->peer_gid, ini->peer_gid, SMC_GID_SIZE);
> > +	memcpy(lnk->peer_mac, ini->peer_mac, sizeof(lnk->peer_mac));
> >   	if (!lnk->smcibdev->initialized) {
> >   		rc = (int)smc_ib_setup_per_ibdev(lnk->smcibdev);
> >   		if (rc)
> > @@ -792,6 +1077,7 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
> >   	if (rc)
> >   		goto destroy_qp;
> >   	lnk->state = SMC_LNK_ACTIVATING;
> > +	smcr_lnk_cluster_on_lnk_state(lnk);
> >   	return 0;
> > 
> >   destroy_qp:
> > @@ -806,6 +1092,8 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
> >   	smc_ibdev_cnt_dec(lnk);
> >   	put_device(&lnk->smcibdev->ibdev->dev);
> >   	smcibdev = lnk->smcibdev;
> > +	lnk->state = SMC_LNK_TEAR_DWON;
> > +	smcr_lnk_cluster_on_lnk_state(lnk);
> >   	memset(lnk, 0, sizeof(struct smc_link));
> >   	lnk->state = SMC_LNK_UNUSED;
> >   	if (!atomic_dec_return(&smcibdev->lnk_cnt))
> > @@ -1263,6 +1551,8 @@ void smcr_link_clear(struct smc_link *lnk, bool log)
> >   	if (!lnk->lgr || lnk->clearing ||
> >   	    lnk->state == SMC_LNK_UNUSED)
> >   		return;
> > +	lnk->state = SMC_LNK_TEAR_DWON;
> > +	smcr_lnk_cluster_on_lnk_state(lnk);
> >   	lnk->clearing = 1;
> >   	lnk->peer_qpn = 0;
> >   	smc_llc_link_clear(lnk, log);
> > @@ -1712,6 +2002,7 @@ void smcr_link_down_cond(struct smc_link *lnk)
> >   {
> >   	if (smc_link_downing(&lnk->state)) {
> >   		trace_smcr_link_down(lnk, __builtin_return_address(0));
> > +		smcr_lnk_cluster_on_lnk_state(lnk);
> >   		smcr_link_down(lnk);
> >   	}
> >   }
> > @@ -1721,6 +2012,7 @@ void smcr_link_down_cond_sched(struct smc_link *lnk)
> >   {
> >   	if (smc_link_downing(&lnk->state)) {
> >   		trace_smcr_link_down(lnk, __builtin_return_address(0));
> > +		smcr_lnk_cluster_on_lnk_state(lnk);
> >   		schedule_work(&lnk->link_down_wrk);
> >   	}
> >   }
> > @@ -1850,11 +2142,13 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
> >   {
> >   	struct smc_connection *conn = &smc->conn;
> >   	struct net *net = sock_net(&smc->sk);
> > +	DECLARE_WAITQUEUE(wait, current);
> > +	struct smc_lnk_cluster *lnkc = NULL;
> 
> Declared as NULL.
> 
> >   	struct list_head *lgr_list;
> >   	struct smc_link_group *lgr;
> >   	enum smc_lgr_role role;
> >   	spinlock_t *lgr_lock;
> > -	int rc = 0;
> > +	int rc = 0, timeo = CLC_WAIT_TIME;
> > 
> >   	lgr_list = ini->is_smcd ? &ini->ism_dev[ini->ism_selected]->lgr_list :
> >   				  &smc_lgr_list.list;
> > @@ -1862,12 +2156,26 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
> >   				  &smc_lgr_list.lock;
> >   	ini->first_contact_local = 1;
> >   	role = smc->listen_smc ? SMC_SERV : SMC_CLNT;
> > -	if (role == SMC_CLNT && ini->first_contact_peer)
> > +
> > +	if (!ini->is_smcd) {
> > +		lnkc = smcr_lnk_get_cluster_by_ini(ini, role);
> 
> Here linkc is set if it is SMC-R.
> 
> > +		if (unlikely(!lnkc))
> > +			return SMC_CLC_DECL_INTERR;
> > +	}
> > +
> > +	if (role == SMC_CLNT && ini->first_contact_peer) {
> > +		/* first_contact */
> > +		spin_lock_bh(&lnkc->lock);
> 
> And here SMC-D dies because of the NULL address. This kills our Systems if
> we try to talk via SMC-D.

Got it, thanks.

> 
> [  779.516389] Failing address: 0000000000000000 TEID: 0000000000000483
> [  779.516391] Fault in home space mode while using kernel ASCE.
> [  779.516395] AS:0000000069628007 R3:00000000ffbf0007 S:00000000ffbef800
> P:000000000000003d
> [  779.516431] Oops: 0004 ilc:2 [#1] SMP
> [  779.516436] Modules linked in: tcp_diag inet_diag ism mlx5_ib ib_uverbs
> mlx5_core smc_diag smc ib_core nft_fib_inet nft_fib_ipv4
> nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6
> nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv
> 6 nf_defrag_ipv4 ip_set nf_tables n
> [  779.516470] CPU: 0 PID: 24 Comm: kworker/0:1 Not tainted
> 5.19.0-13940-g22a46254655a #3
> [  779.516476] Hardware name: IBM 8561 T01 701 (z/VM 7.2.0)
> 
> [  779.522738] Workqueue: smc_hs_wq smc_listen_work [smc]
> [  779.522755] Krnl PSW : 0704c00180000000 000003ff803da89c
> (smc_conn_create+0x174/0x968 [smc])
> [  779.522766]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0
> RI:0 EA:3
> [  779.522770] Krnl GPRS: 0000000000000002 0000000000000000 0000000000000001
> 0000000000000000
> [  779.522773]            000000008a4128a0 000003ff803f21aa 000000008e30d640
> 0000000086d72000
> [  779.522776]            0000000086d72000 000000008a412803 000000008a412800
> 000000008e30d650
> [  779.522779]            0000000080934200 0000000000000000 000003ff803cb954
> 00000380002dfa88
> [  779.522789] Krnl Code: 000003ff803da88e: e310f0e80024        stg
> %r1,232(%r15)
> [  779.522789]            000003ff803da894: a7180000            lhi %r1,0
> [  779.522789]           #000003ff803da898: 582003ac            l %r2,940
> [  779.522789]           >000003ff803da89c: ba123020            cs
> %r1,%r2,32(%r3)
> [  779.522789]            000003ff803da8a0: ec1603be007e        cij
> %r1,0,6,000003ff803db01c
> 
> [  779.522789]            000003ff803da8a6: 4110b002            la
> %r1,2(%r11)
> [  779.522789]            000003ff803da8aa: e310f0f00024        stg
> %r1,240(%r15)
> [  779.522789]            000003ff803da8b0: e310f0c00004        lg
> %r1,192(%r15)
> [  779.522870] Call Trace:
> [  779.522873]  [<000003ff803da89c>] smc_conn_create+0x174/0x968 [smc]
> [  779.522884]  [<000003ff803cb954>] smc_find_ism_v2_device_serv+0x1b4/0x300
> [smc]
> 

D. Wythe is on vacation now. We will fix it soon.

Thanks again for your reviews. 

Cheers,
Tony Lu

> > +		lnkc->pending_capability += (SMC_RMBS_PER_LGR_MAX - 1);
> > +		spin_unlock_bh(&lnkc->lock);
> >   		/* create new link group as well */
> >   		goto create;
> > +	}
> > 
> >   	/* determine if an existing link group can be reused */
> >   	spin_lock_bh(lgr_lock);
> > +	spin_lock(&lnkc->lock);
> > +again:
> >   	list_for_each_entry(lgr, lgr_list, list) {
> >   		write_lock_bh(&lgr->conns_lock);
> >   		if ((ini->is_smcd ?
> > @@ -1894,9 +2202,33 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
> >   		}
> >   		write_unlock_bh(&lgr->conns_lock);
> >   	}
> > +	if (lnkc && ini->first_contact_local) {
> > +		if (lnkc->pending_capability > lnkc->conns_pending) {
> > +			lnkc->conns_pending++;
> > +			add_wait_queue(&lnkc->first_contact_waitqueue, &wait);
> > +			spin_unlock(&lnkc->lock);
> > +			spin_unlock_bh(lgr_lock);
> > +			set_current_state(TASK_INTERRUPTIBLE);
> > +			/* need to wait at least once first contact done */
> > +			timeo = schedule_timeout(timeo);
> > +			set_current_state(TASK_RUNNING);
> > +			remove_wait_queue(&lnkc->first_contact_waitqueue, &wait);
> > +			spin_lock_bh(lgr_lock);
> > +			spin_lock(&lnkc->lock);
> > +
> > +			lnkc->conns_pending--;
> > +			if (timeo)
> > +				goto again;
> > +		}
> > +		if (role == SMC_SERV) {
> > +			/* first_contact */
> > +			lnkc->pending_capability += (SMC_RMBS_PER_LGR_MAX - 1);
> > +		}
> > +	}
> > +	spin_unlock(&lnkc->lock);
> >   	spin_unlock_bh(lgr_lock);
> >   	if (rc)
> > -		return rc;
> > +		goto out;
> > 
> >   	if (role == SMC_CLNT && !ini->first_contact_peer &&
> >   	    ini->first_contact_local) {
> > @@ -1904,7 +2236,8 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
> >   		 * a new one
> >   		 * send out_of_sync decline, reason synchr. error
> >   		 */
> > -		return SMC_CLC_DECL_SYNCERR;
> > +		rc = SMC_CLC_DECL_SYNCERR;
> > +		goto out;
> >   	}
> > 
> >   create:
> > @@ -1941,6 +2274,8 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
> >   #endif
> > 
> >   out:
> > +	/* smc_lnk_cluster_hold in smcr_lnk_get_or_create_cluster */
> > +	smc_lnk_cluster_put(lnkc);
> >   	return rc;
> >   }
> > 
> > @@ -2599,12 +2934,23 @@ static int smc_core_reboot_event(struct notifier_block *this,
> > 
> >   int __init smc_core_init(void)
> >   {
> > +	/* init smc lnk cluster maps */
> > +	rhashtable_init(&smc_lgr_manager.lnk_cluster_maps, &smcr_lnk_cluster_rhl_params);
> >   	return register_reboot_notifier(&smc_reboot_notifier);
> >   }
> > 
> > +static void smc_lnk_cluster_free_cb(void *ptr, void *arg)
> > +{
> > +	pr_warn("smc: smc lnk cluster refcnt leak.\n");
> > +	kfree(ptr);
> > +}
> > +
> >   /* Called (from smc_exit) when module is removed */
> >   void smc_core_exit(void)
> >   {
> >   	unregister_reboot_notifier(&smc_reboot_notifier);
> >   	smc_lgrs_shutdown();
> > +	/* destroy smc lnk cluster maps */
> > +	rhashtable_free_and_destroy(&smc_lgr_manager.lnk_cluster_maps, smc_lnk_cluster_free_cb,
> > +				    NULL);
> >   }
> > diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
> > index fe8b524..199f533 100644
> > --- a/net/smc/smc_core.h
> > +++ b/net/smc/smc_core.h
> > @@ -15,6 +15,7 @@
> >   #include <linux/atomic.h>
> >   #include <linux/smc.h>
> >   #include <linux/pci.h>
> > +#include <linux/rhashtable.h>
> >   #include <rdma/ib_verbs.h>
> >   #include <net/genetlink.h>
> > 
> > @@ -29,18 +30,62 @@ struct smc_lgr_list {			/* list of link group definition */
> >   	u32			num;	/* unique link group number */
> >   };
> > 
> > +struct smc_lgr_manager {		/* manager for link group */
> > +	struct rhashtable	lnk_cluster_maps;	/* maps of smc_lnk_cluster */
> > +	spinlock_t		lock;	/* lock for lgr_cm_maps */
> > +};
> > +
> > +struct smc_lnk_cluster {
> > +	struct rhash_head	rnode;	/* node for rhashtable */
> > +	struct wait_queue_head	first_contact_waitqueue;
> > +					/* queue for non first contact to wait
> > +					 * first contact to be established.
> > +					 */
> > +	spinlock_t		lock;	/* protection for link group */
> > +	refcount_t		ref;	/* refcount for cluster */
> > +	unsigned long		pending_capability;
> > +					/* maximum pending number of connections that
> > +					 * need wait first contact complete.
> > +					 */
> > +	unsigned long		conns_pending;
> > +					/* connections that are waiting for first contact
> > +					 * complete
> > +					 */
> > +	u8		peer_systemid[SMC_SYSTEMID_LEN];
> > +	u8		peer_mac[ETH_ALEN];	/* = gid[8:10||13:15] */
> > +	u8		peer_gid[SMC_GID_SIZE];	/* gid of peer*/
> > +	int		clcqpn;
> > +	int		role;
> > +};
> > +
> >   enum smc_lgr_role {		/* possible roles of a link group */
> >   	SMC_CLNT,	/* client */
> >   	SMC_SERV	/* server */
> >   };
> > 
> > +struct smc_lnk_cluster_compare_arg	/* key for smc_lnk_cluster */
> > +{
> > +	int	smcr_version;
> > +	enum smc_lgr_role role;
> > +	u8	*peer_systemid;
> > +	u8	*peer_gid;
> > +	u8	*peer_mac;
> > +	int clcqpn;
> > +};
> > +
> >   enum smc_link_state {			/* possible states of a link */
> >   	SMC_LNK_UNUSED,		/* link is unused */
> >   	SMC_LNK_INACTIVE,	/* link is inactive */
> >   	SMC_LNK_ACTIVATING,	/* link is being activated */
> >   	SMC_LNK_ACTIVE,		/* link is active */
> > +	SMC_LNK_TEAR_DWON,	/* link is tear down */
> >   };
> > 
> > +#define SMC_LNK_STATE_BIT(state)	(1 << (state))
> > +
> > +#define	SMC_LNK_STATE_RECORD(lnk, state)	\
> > +	((lnk)->state_record |= SMC_LNK_STATE_BIT(state))
> > +
> >   #define SMC_WR_BUF_SIZE		48	/* size of work request buffer */
> >   #define SMC_WR_BUF_V2_SIZE	8192	/* size of v2 work request buffer */
> > 
> > @@ -145,6 +190,7 @@ struct smc_link {
> >   	int			ndev_ifidx; /* network device ifindex */
> > 
> >   	enum smc_link_state	state;		/* state of link */
> > +	int			state_record;		/* record of previous state */
> >   	struct delayed_work	llc_testlink_wrk; /* testlink worker */
> >   	struct completion	llc_testlink_resp; /* wait for rx of testlink */
> >   	int			llc_testlink_time; /* testlink interval */
> > @@ -557,6 +603,8 @@ struct smc_link *smc_switch_conns(struct smc_link_group *lgr,
> >   int smcr_nl_get_link(struct sk_buff *skb, struct netlink_callback *cb);
> >   int smcd_nl_get_lgr(struct sk_buff *skb, struct netlink_callback *cb);
> > 
> > +void smcr_lnk_cluster_on_lnk_state(struct smc_link *lnk);
> > +
> >   static inline struct smc_link_group *smc_get_lgr(struct smc_link *link)
> >   {
> >   	return link->lgr;
> > diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
> > index 175026a..8134c15 100644
> > --- a/net/smc/smc_llc.c
> > +++ b/net/smc/smc_llc.c
> > @@ -1099,6 +1099,7 @@ int smc_llc_cli_add_link(struct smc_link *link, struct smc_llc_qentry *qentry)
> >   		goto out;
> >   out_clear_lnk:
> >   	lnk_new->state = SMC_LNK_INACTIVE;
> > +	smcr_lnk_cluster_on_lnk_state(lnk_new);
> >   	smcr_link_clear(lnk_new, false);
> >   out_reject:
> >   	smc_llc_cli_add_link_reject(qentry);
> > @@ -1278,6 +1279,7 @@ static void smc_llc_delete_asym_link(struct smc_link_group *lgr)
> >   		return; /* no asymmetric link */
> >   	if (!smc_link_downing(&lnk_asym->state))
> >   		return;
> > +	smcr_lnk_cluster_on_lnk_state(lnk_asym);
> >   	lnk_new = smc_switch_conns(lgr, lnk_asym, false);
> >   	smc_wr_tx_wait_no_pending_sends(lnk_asym);
> >   	if (!lnk_new)
> > @@ -1492,6 +1494,7 @@ int smc_llc_srv_add_link(struct smc_link *link,
> >   out_err:
> >   	if (link_new) {
> >   		link_new->state = SMC_LNK_INACTIVE;
> > +		smcr_lnk_cluster_on_lnk_state(link_new);
> >   		smcr_link_clear(link_new, false);
> >   	}
> >   out:
> > @@ -1602,8 +1605,10 @@ static void smc_llc_process_cli_delete_link(struct smc_link_group *lgr)
> >   	del_llc->reason = 0;
> >   	smc_llc_send_message(lnk, &qentry->msg); /* response */
> > 
> > -	if (smc_link_downing(&lnk_del->state))
> > +	if (smc_link_downing(&lnk_del->state)) {
> > +		smcr_lnk_cluster_on_lnk_state(lnk);
> >   		smc_switch_conns(lgr, lnk_del, false);
> > +	}
> >   	smcr_link_clear(lnk_del, true);
> > 
> >   	active_links = smc_llc_active_link_count(lgr);
> > @@ -1676,6 +1681,7 @@ static void smc_llc_process_srv_delete_link(struct smc_link_group *lgr)
> >   		goto out; /* asymmetric link already deleted */
> > 
> >   	if (smc_link_downing(&lnk_del->state)) {
> > +		smcr_lnk_cluster_on_lnk_state(lnk);
> >   		if (smc_switch_conns(lgr, lnk_del, false))
> >   			smc_wr_tx_wait_no_pending_sends(lnk_del);
> >   	}
> > @@ -2167,6 +2173,7 @@ void smc_llc_link_active(struct smc_link *link)
> >   		schedule_delayed_work(&link->llc_testlink_wrk,
> >   				      link->llc_testlink_time);
> >   	}
> > +	smcr_lnk_cluster_on_lnk_state(link);
> >   }
> > 
> >   /* called in worker context */

^ permalink raw reply

* Re: [PATCH net-next 01/10] net/smc: remove locks smc_client_lgr_pending and smc_server_lgr_pending
From: Tony Lu @ 2022-08-16 12:52 UTC (permalink / raw)
  To: D. Wythe; +Cc: kgraul, wenjia, kuba, davem, netdev, linux-s390, linux-rdma
In-Reply-To: <075ff0be35660efac638448cdae7f7e7e04199d4.1660152975.git.alibuda@linux.alibaba.com>

On Thu, Aug 11, 2022 at 01:47:32AM +0800, D. Wythe wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
> 
> This patch attempts to remove locks named smc_client_lgr_pending and
> smc_server_lgr_pending, which aim to serialize the creation of link
> group. However, once link group existed already, those locks are
> meaningless, worse still, they make incoming connections have to be
> queued one after the other.
> 
> Now, the creation of link group is no longer generated by competition,
> but allocated through following strategy.
> 
> 1. Try to find a suitable link group, if successd, current connection
> is considered as NON first contact connection. ends.
> 
> 2. Check the number of connections currently waiting for a suitable
> link group to be created, if it is not less that the number of link
> groups to be created multiplied by (SMC_RMBS_PER_LGR_MAX - 1), then
> increase the number of link groups to be created, current connection
> is considered as the first contact connection. ends.
> 
> 3. Increase the number of connections currently waiting, and wait
> for woken up.
> 
> 4. Decrease the number of connections currently waiting, goto 1.
> 
> We wake up the connection that was put to sleep in stage 3 through
> the SMC link state change event. Once the link moves out of the
> SMC_LNK_ACTIVATING state, decrease the number of link groups to
> be created, and then wake up at most (SMC_RMBS_PER_LGR_MAX - 1)
> connections.
> 
> In the iplementation, we introduce the concept of lnk cluster, which is
> a collection of links with the same characteristics (see
> smcr_lnk_cluster_cmpfn() with more details), which makes it possible to
> wake up efficiently in the scenario of N v.s 1.
> 
> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> ---
>  net/smc/af_smc.c   |  11 +-
>  net/smc/smc_core.c | 356 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  net/smc/smc_core.h |  48 ++++++++
>  net/smc/smc_llc.c  |   9 +-
>  4 files changed, 411 insertions(+), 13 deletions(-)
> 
> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
> index 79c1318..af4b0aa 100644
> --- a/net/smc/af_smc.c
> +++ b/net/smc/af_smc.c
> @@ -1194,10 +1194,8 @@ static int smc_connect_rdma(struct smc_sock *smc,
>  	if (reason_code)
>  		return reason_code;
>  
> -	mutex_lock(&smc_client_lgr_pending);
>  	reason_code = smc_conn_create(smc, ini);
>  	if (reason_code) {
> -		mutex_unlock(&smc_client_lgr_pending);
>  		return reason_code;
>  	}
>  
> @@ -1289,7 +1287,6 @@ static int smc_connect_rdma(struct smc_sock *smc,
>  		if (reason_code)
>  			goto connect_abort;
>  	}
> -	mutex_unlock(&smc_client_lgr_pending);
>  
>  	smc_copy_sock_settings_to_clc(smc);
>  	smc->connect_nonblock = 0;
> @@ -1299,7 +1296,6 @@ static int smc_connect_rdma(struct smc_sock *smc,
>  	return 0;
>  connect_abort:
>  	smc_conn_abort(smc, ini->first_contact_local);
> -	mutex_unlock(&smc_client_lgr_pending);
>  	smc->connect_nonblock = 0;
>  
>  	return reason_code;
> @@ -2377,7 +2373,8 @@ static void smc_listen_work(struct work_struct *work)
>  	if (rc)
>  		goto out_decl;
>  
> -	mutex_lock(&smc_server_lgr_pending);
> +	if (ini->is_smcd)
> +		mutex_lock(&smc_server_lgr_pending);
>  	smc_close_init(new_smc);
>  	smc_rx_init(new_smc);
>  	smc_tx_init(new_smc);
> @@ -2415,7 +2412,6 @@ static void smc_listen_work(struct work_struct *work)
>  					    ini->first_contact_local, ini);
>  		if (rc)
>  			goto out_unlock;
> -		mutex_unlock(&smc_server_lgr_pending);
>  	}
>  	smc_conn_save_peer_info(new_smc, cclc);
>  	smc_listen_out_connected(new_smc);
> @@ -2423,7 +2419,8 @@ static void smc_listen_work(struct work_struct *work)
>  	goto out_free;
>  
>  out_unlock:
> -	mutex_unlock(&smc_server_lgr_pending);
> +	if (ini->is_smcd)
> +		mutex_unlock(&smc_server_lgr_pending);
>  out_decl:
>  	smc_listen_decline(new_smc, rc, ini ? ini->first_contact_local : 0,
>  			   proposal_version);
> diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
> index ff49a11..a3338cc 100644
> --- a/net/smc/smc_core.c
> +++ b/net/smc/smc_core.c
> @@ -46,6 +46,10 @@ struct smc_lgr_list smc_lgr_list = {	/* established link groups */
>  	.num = 0,
>  };
>  
> +struct smc_lgr_manager smc_lgr_manager = {
> +	.lock = __SPIN_LOCK_UNLOCKED(smc_lgr_manager.lock),
> +};
> +
>  static atomic_t lgr_cnt = ATOMIC_INIT(0); /* number of existing link groups */
>  static DECLARE_WAIT_QUEUE_HEAD(lgrs_deleted);
>  
> @@ -55,6 +59,282 @@ static void smc_buf_free(struct smc_link_group *lgr, bool is_rmb,
>  
>  static void smc_link_down_work(struct work_struct *work);
>  
> +/* SMC-R lnk cluster compare func
> + * All lnks that meet the description conditions of this function
> + * are logically aggregated, called lnk cluster.
> + * For the server side, lnk cluster is used to determine whether
> + * a new group needs to be created when processing new imcoming connections.
> + * For the client side, lnk cluster is used to determine whether
> + * to wait for link ready (in other words, first contact ready).
> + */
> +static int smcr_lnk_cluster_cmpfn(struct rhashtable_compare_arg *arg, const void *obj)
> +{
> +	const struct smc_lnk_cluster_compare_arg *key = arg->key;
> +	const struct smc_lnk_cluster *lnkc = obj;
> +
> +	if (memcmp(key->peer_systemid, lnkc->peer_systemid, SMC_SYSTEMID_LEN))
> +		return 1;
> +
> +	if (memcmp(key->peer_gid, lnkc->peer_gid, SMC_GID_SIZE))
> +		return 1;
> +
> +	if ((key->role == SMC_SERV || key->clcqpn == lnkc->clcqpn) &&
> +	    (key->smcr_version == SMC_V2 ||
> +	    !memcmp(key->peer_mac, lnkc->peer_mac, ETH_ALEN)))
> +		return 0;
> +
> +	return 1;
> +}

Also, SMC prefers to use *link* for function name and struct name. It's
okay to use *lnk* for local variables.

> +
> +/* SMC-R lnk cluster hash func */
> +static u32 smcr_lnk_cluster_hashfn(const void *data, u32 len, u32 seed)
> +{
> +	const struct smc_lnk_cluster *lnkc = data;
> +
> +	return jhash2((u32 *)lnkc->peer_systemid, SMC_SYSTEMID_LEN / sizeof(u32), seed)
> +		+ (lnkc->role == SMC_SERV) ? 0 : lnkc->clcqpn;
> +}
> +
> +/* SMC-R lnk cluster compare arg hash func */
> +static u32 smcr_lnk_cluster_compare_arg_hashfn(const void *data, u32 len, u32 seed)
> +{
> +	const struct smc_lnk_cluster_compare_arg *key = data;
> +
> +	return jhash2((u32 *)key->peer_systemid, SMC_SYSTEMID_LEN / sizeof(u32), seed)
> +		+ (key->role == SMC_SERV) ? 0 : key->clcqpn;
> +}
> +
> +static const struct rhashtable_params smcr_lnk_cluster_rhl_params = {
> +	.head_offset = offsetof(struct smc_lnk_cluster, rnode),
> +	.key_len = sizeof(struct smc_lnk_cluster_compare_arg),
> +	.obj_cmpfn = smcr_lnk_cluster_cmpfn,
> +	.obj_hashfn = smcr_lnk_cluster_hashfn,
> +	.hashfn = smcr_lnk_cluster_compare_arg_hashfn,
> +	.automatic_shrinking = true,
> +};
> +
> +/* hold a reference for smc_lnk_cluster */
> +static inline void smc_lnk_cluster_hold(struct smc_lnk_cluster *lnkc)
> +{
> +	if (likely(lnkc))
> +		refcount_inc(&lnkc->ref);
> +}
> +
> +/* release a reference for smc_lnk_cluster */
> +static inline void smc_lnk_cluster_put(struct smc_lnk_cluster *lnkc)
> +{
> +	bool do_free = false;
> +
> +	if (!lnkc)
> +		return;
> +
> +	if (refcount_dec_not_one(&lnkc->ref))
> +		return;
> +
> +	spin_lock_bh(&smc_lgr_manager.lock);
> +	/* last ref */
> +	if (refcount_dec_and_test(&lnkc->ref)) {
> +		do_free = true;
> +		rhashtable_remove_fast(&smc_lgr_manager.lnk_cluster_maps, &lnkc->rnode,
> +				       smcr_lnk_cluster_rhl_params);
> +	}
> +	spin_unlock_bh(&smc_lgr_manager.lock);
> +	if (do_free)
> +		kfree(lnkc);
> +}
> +
> +/* Get or create smc_lnk_cluster by key
> + * This function will hold a reference of returned smc_lnk_cluster
> + * or create a new smc_lnk_cluster with the reference initialized to 1。
> + * caller MUST call smc_lnk_cluster_put after this.
> + */
> +static inline struct smc_lnk_cluster *
> +smcr_lnk_get_or_create_cluster(struct smc_lnk_cluster_compare_arg *key)
> +{
> +	struct smc_lnk_cluster *lnkc, *tmp_lnkc;
> +	bool busy_retry;
> +	int err;
> +
> +	/* serving a hardware or software interrupt, or preemption is disabled */
> +	busy_retry = !in_interrupt();
> +
> +	spin_lock_bh(&smc_lgr_manager.lock);
> +	lnkc = rhashtable_lookup_fast(&smc_lgr_manager.lnk_cluster_maps, key,
> +				      smcr_lnk_cluster_rhl_params);
> +	if (!lnkc) {
> +		lnkc = kzalloc(sizeof(*lnkc), GFP_ATOMIC);
> +		if (unlikely(!lnkc))
> +			goto fail;
> +
> +		/* init cluster */
> +		spin_lock_init(&lnkc->lock);
> +		lnkc->role = key->role;
> +		if (key->role == SMC_CLNT)
> +			lnkc->clcqpn = key->clcqpn;
> +		init_waitqueue_head(&lnkc->first_contact_waitqueue);
> +		memcpy(lnkc->peer_systemid, key->peer_systemid, SMC_SYSTEMID_LEN);
> +		memcpy(lnkc->peer_gid, key->peer_gid, SMC_GID_SIZE);
> +		memcpy(lnkc->peer_mac, key->peer_mac, ETH_ALEN);
> +		refcount_set(&lnkc->ref, 1);
> +
> +		do {
> +			err = rhashtable_insert_fast(&smc_lgr_manager.lnk_cluster_maps,
> +						     &lnkc->rnode, smcr_lnk_cluster_rhl_params);
> +
> +			/* success or fatal error */
> +			if (err != -EBUSY)
> +				break;
> +
> +			/* impossible in fact right now */
> +			if (unlikely(!busy_retry)) {
> +				pr_warn_ratelimited("smc: create lnk cluster in softirq\n");
> +				break;
> +			}
> +
> +			spin_unlock_bh(&smc_lgr_manager.lock);
> +			/* yeild */
> +			cond_resched();
> +			spin_lock_bh(&smc_lgr_manager.lock);
> +
> +			/* after spin_unlock_bh(), lnk_cluster_maps may be changed */
> +			tmp_lnkc = rhashtable_lookup_fast(&smc_lgr_manager.lnk_cluster_maps, key,
> +							  smcr_lnk_cluster_rhl_params);
> +
> +			if (unlikely(tmp_lnkc)) {
> +				pr_warn_ratelimited("smc: create cluster failed dues to duplicat key");
> +				kfree(lnkc);
> +				lnkc = NULL;
> +				goto fail;
> +			}
> +		} while (1);
> +
> +		if (unlikely(err)) {
> +			pr_warn_ratelimited("smc: rhashtable_insert_fast failed (%d)", err);
> +			kfree(lnkc);
> +			lnkc = NULL;
> +		}
> +	} else {
> +		smc_lnk_cluster_hold(lnkc);
> +	}
> +fail:
> +	spin_unlock_bh(&smc_lgr_manager.lock);
> +	return lnkc;
> +}
> +
> +/* Get or create a smc_lnk_cluster by lnk
> + * caller MUST call smc_lnk_cluster_put after this.
> + */
> +static inline struct smc_lnk_cluster *smcr_lnk_get_cluster(struct smc_link *lnk)
> +{
> +	struct smc_lnk_cluster_compare_arg key;
> +	struct smc_link_group *lgr;
> +
> +	lgr = lnk->lgr;
> +	if (!lgr || lgr->is_smcd)
> +		return NULL;
> +
> +	key.smcr_version = lgr->smc_version;
> +	key.peer_systemid = lgr->peer_systemid;
> +	key.peer_gid = lnk->peer_gid;
> +	key.peer_mac = lnk->peer_mac;
> +	key.role	 = lgr->role;
> +	if (key.role == SMC_CLNT)
> +		key.clcqpn = lnk->peer_qpn;
> +
> +	return smcr_lnk_get_or_create_cluster(&key);
> +}
> +
> +/* Get or create a smc_lnk_cluster by ini
> + * caller MUST call smc_lnk_cluster_put after this.
> + */
> +static inline struct smc_lnk_cluster *
> +smcr_lnk_get_cluster_by_ini(struct smc_init_info *ini, int role)
> +{
> +	struct smc_lnk_cluster_compare_arg key;
> +
> +	if (ini->is_smcd)
> +		return NULL;
> +
> +	key.smcr_version = ini->smcr_version;
> +	key.peer_systemid = ini->peer_systemid;
> +	key.peer_gid = ini->peer_gid;
> +	key.peer_mac = ini->peer_mac;
> +	key.role	= role;
> +	if (role == SMC_CLNT)
> +		key.clcqpn	= ini->ib_clcqpn;
> +
> +	return smcr_lnk_get_or_create_cluster(&key);
> +}
> +
> +/* callback when smc link state change */
> +void smcr_lnk_cluster_on_lnk_state(struct smc_link *lnk)
> +{
> +	struct smc_lnk_cluster *lnkc;
> +	int nr = 0;
> +
> +	/* barrier for lnk->state */
> +	smp_mb();
> +
> +	/* only first link can made connections block on
> +	 * first_contact_waitqueue
> +	 */
> +	if (lnk->link_idx != SMC_SINGLE_LINK)
> +		return;
> +
> +	/* state already seen  */
> +	if (lnk->state_record & SMC_LNK_STATE_BIT(lnk->state))
> +		return;
> +
> +	lnkc = smcr_lnk_get_cluster(lnk);
> +
> +	if (unlikely(!lnkc))
> +		return;
> +
> +	spin_lock_bh(&lnkc->lock);
> +
> +	/* all lnk state change should be
> +	 * 1. SMC_LNK_UNUSED -> SMC_LNK_TEAR_DWON (link init failed)
> +	 * 2. SMC_LNK_UNUSED -> SMC_LNK_ACTIVATING -> SMC_LNK_TEAR_DWON
> +	 * 3. SMC_LNK_UNUSED -> SMC_LNK_ACTIVATING -> SMC_LNK_INACTIVE -> SMC_LNK_TEAR_DWON
> +	 * 4. SMC_LNK_UNUSED -> SMC_LNK_ACTIVATING -> SMC_LNK_INACTIVE -> SMC_LNK_TEAR_DWON
> +	 * 5. SMC_LNK_UNUSED -> SMC_LNK_ATIVATING -> SMC_LNK_ACTIVE ->SMC_LNK_INACTIVE
> +	 * -> SMC_LNK_TEAR_DWON
> +	 */
> +	switch (lnk->state) {
> +	case SMC_LNK_ACTIVATING:
> +		/* It's safe to hold a reference without lock
> +		 * dues to the smcr_lnk_get_cluster already hold one
> +		 */
> +		smc_lnk_cluster_hold(lnkc);
> +		break;
> +	case SMC_LNK_TEAR_DWON:
> +		if (lnk->state_record & SMC_LNK_STATE_BIT(SMC_LNK_ACTIVATING))
> +			/* smc_lnk_cluster_hold in SMC_LNK_ACTIVATING */
> +			smc_lnk_cluster_put(lnkc);
> +		fallthrough;
> +	case SMC_LNK_ACTIVE:
> +	case SMC_LNK_INACTIVE:
> +		if (!(lnk->state_record &
> +			(SMC_LNK_STATE_BIT(SMC_LNK_ACTIVE)
> +			| SMC_LNK_STATE_BIT(SMC_LNK_INACTIVE)))) {
> +			lnkc->pending_capability -= (SMC_RMBS_PER_LGR_MAX - 1);
> +			/* TODO: wakeup just one to perfrom first contact
> +			 * if record state has no SMC_LNK_ACTIVE
> +			 */
> +			nr = SMC_RMBS_PER_LGR_MAX - 1;
> +		}
> +		break;
> +	case SMC_LNK_UNUSED:
> +		pr_warn_ratelimited("net/smc: invalid lnk state. ");
> +		break;
> +	}
> +	SMC_LNK_STATE_RECORD(lnk, lnk->state);
> +	spin_unlock_bh(&lnkc->lock);
> +	if (nr)
> +		wake_up_nr(&lnkc->first_contact_waitqueue, nr);
> +	smc_lnk_cluster_put(lnkc);	/* smc_lnk_cluster_hold in smcr_lnk_get_cluster */
> +}
> +
>  /* return head of link group list and its lock for a given link group */
>  static inline struct list_head *smc_lgr_list_head(struct smc_link_group *lgr,
>  						  spinlock_t **lgr_lock)
> @@ -651,8 +931,10 @@ static void smcr_lgr_link_deactivate_all(struct smc_link_group *lgr)
>  	for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
>  		struct smc_link *lnk = &lgr->lnk[i];
>  
> -		if (smc_link_sendable(lnk))
> +		if (smc_link_sendable(lnk)) {
>  			lnk->state = SMC_LNK_INACTIVE;
> +			smcr_lnk_cluster_on_lnk_state(lnk);
> +		}
>  	}
>  	wake_up_all(&lgr->llc_msg_waiter);
>  	wake_up_all(&lgr->llc_flow_waiter);
> @@ -762,6 +1044,9 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
>  	atomic_set(&lnk->conn_cnt, 0);
>  	smc_llc_link_set_uid(lnk);
>  	INIT_WORK(&lnk->link_down_wrk, smc_link_down_work);
> +	lnk->peer_qpn = ini->ib_clcqpn;
> +	memcpy(lnk->peer_gid, ini->peer_gid, SMC_GID_SIZE);
> +	memcpy(lnk->peer_mac, ini->peer_mac, sizeof(lnk->peer_mac));
>  	if (!lnk->smcibdev->initialized) {
>  		rc = (int)smc_ib_setup_per_ibdev(lnk->smcibdev);
>  		if (rc)
> @@ -792,6 +1077,7 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
>  	if (rc)
>  		goto destroy_qp;
>  	lnk->state = SMC_LNK_ACTIVATING;
> +	smcr_lnk_cluster_on_lnk_state(lnk);
>  	return 0;
>  
>  destroy_qp:
> @@ -806,6 +1092,8 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
>  	smc_ibdev_cnt_dec(lnk);
>  	put_device(&lnk->smcibdev->ibdev->dev);
>  	smcibdev = lnk->smcibdev;
> +	lnk->state = SMC_LNK_TEAR_DWON;
> +	smcr_lnk_cluster_on_lnk_state(lnk);
>  	memset(lnk, 0, sizeof(struct smc_link));
>  	lnk->state = SMC_LNK_UNUSED;
>  	if (!atomic_dec_return(&smcibdev->lnk_cnt))
> @@ -1263,6 +1551,8 @@ void smcr_link_clear(struct smc_link *lnk, bool log)
>  	if (!lnk->lgr || lnk->clearing ||
>  	    lnk->state == SMC_LNK_UNUSED)
>  		return;
> +	lnk->state = SMC_LNK_TEAR_DWON;
> +	smcr_lnk_cluster_on_lnk_state(lnk);
>  	lnk->clearing = 1;
>  	lnk->peer_qpn = 0;
>  	smc_llc_link_clear(lnk, log);
> @@ -1712,6 +2002,7 @@ void smcr_link_down_cond(struct smc_link *lnk)
>  {
>  	if (smc_link_downing(&lnk->state)) {
>  		trace_smcr_link_down(lnk, __builtin_return_address(0));
> +		smcr_lnk_cluster_on_lnk_state(lnk);
>  		smcr_link_down(lnk);
>  	}
>  }
> @@ -1721,6 +2012,7 @@ void smcr_link_down_cond_sched(struct smc_link *lnk)
>  {
>  	if (smc_link_downing(&lnk->state)) {
>  		trace_smcr_link_down(lnk, __builtin_return_address(0));
> +		smcr_lnk_cluster_on_lnk_state(lnk);
>  		schedule_work(&lnk->link_down_wrk);
>  	}
>  }
> @@ -1850,11 +2142,13 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
>  {
>  	struct smc_connection *conn = &smc->conn;
>  	struct net *net = sock_net(&smc->sk);
> +	DECLARE_WAITQUEUE(wait, current);
> +	struct smc_lnk_cluster *lnkc = NULL;
>  	struct list_head *lgr_list;
>  	struct smc_link_group *lgr;
>  	enum smc_lgr_role role;
>  	spinlock_t *lgr_lock;
> -	int rc = 0;
> +	int rc = 0, timeo = CLC_WAIT_TIME;
>  
>  	lgr_list = ini->is_smcd ? &ini->ism_dev[ini->ism_selected]->lgr_list :
>  				  &smc_lgr_list.list;
> @@ -1862,12 +2156,26 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
>  				  &smc_lgr_list.lock;
>  	ini->first_contact_local = 1;
>  	role = smc->listen_smc ? SMC_SERV : SMC_CLNT;
> -	if (role == SMC_CLNT && ini->first_contact_peer)
> +
> +	if (!ini->is_smcd) {
> +		lnkc = smcr_lnk_get_cluster_by_ini(ini, role);
> +		if (unlikely(!lnkc))
> +			return SMC_CLC_DECL_INTERR;
> +	}
> +
> +	if (role == SMC_CLNT && ini->first_contact_peer) {
> +		/* first_contact */
> +		spin_lock_bh(&lnkc->lock);
> +		lnkc->pending_capability += (SMC_RMBS_PER_LGR_MAX - 1);
> +		spin_unlock_bh(&lnkc->lock);
>  		/* create new link group as well */
>  		goto create;
> +	}
>  
>  	/* determine if an existing link group can be reused */
>  	spin_lock_bh(lgr_lock);
> +	spin_lock(&lnkc->lock);
> +again:
>  	list_for_each_entry(lgr, lgr_list, list) {
>  		write_lock_bh(&lgr->conns_lock);
>  		if ((ini->is_smcd ?
> @@ -1894,9 +2202,33 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
>  		}
>  		write_unlock_bh(&lgr->conns_lock);
>  	}
> +	if (lnkc && ini->first_contact_local) {
> +		if (lnkc->pending_capability > lnkc->conns_pending) {
> +			lnkc->conns_pending++;
> +			add_wait_queue(&lnkc->first_contact_waitqueue, &wait);
> +			spin_unlock(&lnkc->lock);
> +			spin_unlock_bh(lgr_lock);
> +			set_current_state(TASK_INTERRUPTIBLE);
> +			/* need to wait at least once first contact done */
> +			timeo = schedule_timeout(timeo);
> +			set_current_state(TASK_RUNNING);
> +			remove_wait_queue(&lnkc->first_contact_waitqueue, &wait);
> +			spin_lock_bh(lgr_lock);
> +			spin_lock(&lnkc->lock);
> +
> +			lnkc->conns_pending--;
> +			if (timeo)
> +				goto again;
> +		}
> +		if (role == SMC_SERV) {
> +			/* first_contact */
> +			lnkc->pending_capability += (SMC_RMBS_PER_LGR_MAX - 1);
> +		}
> +	}
> +	spin_unlock(&lnkc->lock);
>  	spin_unlock_bh(lgr_lock);
>  	if (rc)
> -		return rc;
> +		goto out;
>  
>  	if (role == SMC_CLNT && !ini->first_contact_peer &&
>  	    ini->first_contact_local) {
> @@ -1904,7 +2236,8 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
>  		 * a new one
>  		 * send out_of_sync decline, reason synchr. error
>  		 */
> -		return SMC_CLC_DECL_SYNCERR;
> +		rc = SMC_CLC_DECL_SYNCERR;
> +		goto out;
>  	}
>  
>  create:
> @@ -1941,6 +2274,8 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
>  #endif
>  
>  out:
> +	/* smc_lnk_cluster_hold in smcr_lnk_get_or_create_cluster */
> +	smc_lnk_cluster_put(lnkc);
>  	return rc;
>  }
>  
> @@ -2599,12 +2934,23 @@ static int smc_core_reboot_event(struct notifier_block *this,
>  
>  int __init smc_core_init(void)
>  {
> +	/* init smc lnk cluster maps */
> +	rhashtable_init(&smc_lgr_manager.lnk_cluster_maps, &smcr_lnk_cluster_rhl_params);
>  	return register_reboot_notifier(&smc_reboot_notifier);
>  }
>  
> +static void smc_lnk_cluster_free_cb(void *ptr, void *arg)
> +{
> +	pr_warn("smc: smc lnk cluster refcnt leak.\n");
> +	kfree(ptr);
> +}
> +
>  /* Called (from smc_exit) when module is removed */
>  void smc_core_exit(void)
>  {
>  	unregister_reboot_notifier(&smc_reboot_notifier);
>  	smc_lgrs_shutdown();
> +	/* destroy smc lnk cluster maps */
> +	rhashtable_free_and_destroy(&smc_lgr_manager.lnk_cluster_maps, smc_lnk_cluster_free_cb,
> +				    NULL);
>  }
> diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
> index fe8b524..199f533 100644
> --- a/net/smc/smc_core.h
> +++ b/net/smc/smc_core.h
> @@ -15,6 +15,7 @@
>  #include <linux/atomic.h>
>  #include <linux/smc.h>
>  #include <linux/pci.h>
> +#include <linux/rhashtable.h>
>  #include <rdma/ib_verbs.h>
>  #include <net/genetlink.h>
>  
> @@ -29,18 +30,62 @@ struct smc_lgr_list {			/* list of link group definition */
>  	u32			num;	/* unique link group number */
>  };
>  
> +struct smc_lgr_manager {		/* manager for link group */
> +	struct rhashtable	lnk_cluster_maps;	/* maps of smc_lnk_cluster */
> +	spinlock_t		lock;	/* lock for lgr_cm_maps */
> +};
> +
> +struct smc_lnk_cluster {
> +	struct rhash_head	rnode;	/* node for rhashtable */
> +	struct wait_queue_head	first_contact_waitqueue;
> +					/* queue for non first contact to wait
> +					 * first contact to be established.
> +					 */
> +	spinlock_t		lock;	/* protection for link group */
> +	refcount_t		ref;	/* refcount for cluster */
> +	unsigned long		pending_capability;
> +					/* maximum pending number of connections that
> +					 * need wait first contact complete.
> +					 */
> +	unsigned long		conns_pending;
> +					/* connections that are waiting for first contact
> +					 * complete
> +					 */
> +	u8		peer_systemid[SMC_SYSTEMID_LEN];
> +	u8		peer_mac[ETH_ALEN];	/* = gid[8:10||13:15] */
> +	u8		peer_gid[SMC_GID_SIZE];	/* gid of peer*/
> +	int		clcqpn;
> +	int		role;
> +};
> +
>  enum smc_lgr_role {		/* possible roles of a link group */
>  	SMC_CLNT,	/* client */
>  	SMC_SERV	/* server */
>  };
>  
> +struct smc_lnk_cluster_compare_arg	/* key for smc_lnk_cluster */
> +{
> +	int	smcr_version;
> +	enum smc_lgr_role role;
> +	u8	*peer_systemid;
> +	u8	*peer_gid;
> +	u8	*peer_mac;
> +	int clcqpn;
> +};
> +
>  enum smc_link_state {			/* possible states of a link */
>  	SMC_LNK_UNUSED,		/* link is unused */
>  	SMC_LNK_INACTIVE,	/* link is inactive */
>  	SMC_LNK_ACTIVATING,	/* link is being activated */
>  	SMC_LNK_ACTIVE,		/* link is active */
> +	SMC_LNK_TEAR_DWON,	/* link is tear down */
>  };
>  
> +#define SMC_LNK_STATE_BIT(state)	(1 << (state))
> +
> +#define	SMC_LNK_STATE_RECORD(lnk, state)	\
> +	((lnk)->state_record |= SMC_LNK_STATE_BIT(state))
> +
>  #define SMC_WR_BUF_SIZE		48	/* size of work request buffer */
>  #define SMC_WR_BUF_V2_SIZE	8192	/* size of v2 work request buffer */
>  
> @@ -145,6 +190,7 @@ struct smc_link {
>  	int			ndev_ifidx; /* network device ifindex */
>  
>  	enum smc_link_state	state;		/* state of link */
> +	int			state_record;		/* record of previous state */
>  	struct delayed_work	llc_testlink_wrk; /* testlink worker */
>  	struct completion	llc_testlink_resp; /* wait for rx of testlink */
>  	int			llc_testlink_time; /* testlink interval */
> @@ -557,6 +603,8 @@ struct smc_link *smc_switch_conns(struct smc_link_group *lgr,
>  int smcr_nl_get_link(struct sk_buff *skb, struct netlink_callback *cb);
>  int smcd_nl_get_lgr(struct sk_buff *skb, struct netlink_callback *cb);
>  
> +void smcr_lnk_cluster_on_lnk_state(struct smc_link *lnk);
> +
>  static inline struct smc_link_group *smc_get_lgr(struct smc_link *link)
>  {
>  	return link->lgr;
> diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
> index 175026a..8134c15 100644
> --- a/net/smc/smc_llc.c
> +++ b/net/smc/smc_llc.c
> @@ -1099,6 +1099,7 @@ int smc_llc_cli_add_link(struct smc_link *link, struct smc_llc_qentry *qentry)
>  		goto out;
>  out_clear_lnk:
>  	lnk_new->state = SMC_LNK_INACTIVE;
> +	smcr_lnk_cluster_on_lnk_state(lnk_new);
>  	smcr_link_clear(lnk_new, false);
>  out_reject:
>  	smc_llc_cli_add_link_reject(qentry);
> @@ -1278,6 +1279,7 @@ static void smc_llc_delete_asym_link(struct smc_link_group *lgr)
>  		return; /* no asymmetric link */
>  	if (!smc_link_downing(&lnk_asym->state))
>  		return;
> +	smcr_lnk_cluster_on_lnk_state(lnk_asym);
>  	lnk_new = smc_switch_conns(lgr, lnk_asym, false);
>  	smc_wr_tx_wait_no_pending_sends(lnk_asym);
>  	if (!lnk_new)
> @@ -1492,6 +1494,7 @@ int smc_llc_srv_add_link(struct smc_link *link,
>  out_err:
>  	if (link_new) {
>  		link_new->state = SMC_LNK_INACTIVE;
> +		smcr_lnk_cluster_on_lnk_state(link_new);
>  		smcr_link_clear(link_new, false);
>  	}
>  out:
> @@ -1602,8 +1605,10 @@ static void smc_llc_process_cli_delete_link(struct smc_link_group *lgr)
>  	del_llc->reason = 0;
>  	smc_llc_send_message(lnk, &qentry->msg); /* response */
>  
> -	if (smc_link_downing(&lnk_del->state))
> +	if (smc_link_downing(&lnk_del->state)) {
> +		smcr_lnk_cluster_on_lnk_state(lnk);
>  		smc_switch_conns(lgr, lnk_del, false);
> +	}
>  	smcr_link_clear(lnk_del, true);
>  
>  	active_links = smc_llc_active_link_count(lgr);
> @@ -1676,6 +1681,7 @@ static void smc_llc_process_srv_delete_link(struct smc_link_group *lgr)
>  		goto out; /* asymmetric link already deleted */
>  
>  	if (smc_link_downing(&lnk_del->state)) {
> +		smcr_lnk_cluster_on_lnk_state(lnk);
>  		if (smc_switch_conns(lgr, lnk_del, false))
>  			smc_wr_tx_wait_no_pending_sends(lnk_del);
>  	}
> @@ -2167,6 +2173,7 @@ void smc_llc_link_active(struct smc_link *link)
>  		schedule_delayed_work(&link->llc_testlink_wrk,
>  				      link->llc_testlink_time);
>  	}
> +	smcr_lnk_cluster_on_lnk_state(link);
>  }
>  
>  /* called in worker context */
> -- 
> 1.8.3.1

^ permalink raw reply

* [net-next 1/1] stmmac: intel: remove unused 'has_crossts' flag
From: Vee Khee Wong @ 2022-08-16  7:30 UTC (permalink / raw)
  To: peppe.cavallaro, alexandre.torgue, joabreu, davem, edumazet, kuba,
	pabeni, mcoquelin.stm32
  Cc: tee.min.tan, netdev, linux-stm32, linux-arm-kernel, linux-kernel


The 'has_crossts' flag was not used anywhere in the stmmac driver,
removing it from both header file and dwmac-intel driver.

Signed-off-by: Wong Vee Khee <veekhee@apple.com>
---
drivers/net/ethernet/stmicro/stmmac/dwmac-intel.c | 1 -
include/linux/stmmac.h                            | 1 -
2 files changed, 2 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-intel.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-intel.c
index 52f9ed8db9c9..1d96ca96009b 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-intel.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-intel.c
@@ -610,7 +610,6 @@ static int intel_mgbe_common_data(struct pci_dev *pdev,
	plat->int_snapshot_num = AUX_SNAPSHOT1;
	plat->ext_snapshot_num = AUX_SNAPSHOT0;

-	plat->has_crossts = true;
	plat->crosststamp = intel_crosststamp;
	plat->int_snapshot_en = 0;

diff --git a/include/linux/stmmac.h b/include/linux/stmmac.h
index 8df475db88c0..fb2e88614f5d 100644
--- a/include/linux/stmmac.h
+++ b/include/linux/stmmac.h
@@ -257,7 +257,6 @@ struct plat_stmmacenet_data {
	u8 vlan_fail_q;
	unsigned int eee_usecs_rate;
	struct pci_dev *pdev;
-	bool has_crossts;
	int int_snapshot_num;
	int ext_snapshot_num;
	bool int_snapshot_en;
-- 
2.32.1 (Apple Git-133)


^ permalink raw reply related

* Re: [PATCH] net: Fix suspicious RCU usage in bpf_sk_reuseport_detach()
From: David Howells @ 2022-08-16 13:09 UTC (permalink / raw)
  To: Hawkins Jiawei
  Cc: dhowells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, linux-kernel, netdev, linux-kernel-mentees, bpf
In-Reply-To: <20220816103452.479281-1-yin31149@gmail.com>

Hawkins Jiawei <yin31149@gmail.com> wrote:

>  	if (socks) {
>  		WRITE_ONCE(sk->sk_user_data, NULL);

Btw, shouldn't this be rcu_assign_pointer() or RCU_INIT_POINTER(), not
WRITE_ONCE()?

David


^ permalink raw reply

* Re: [PATCH net v3 2/2] bonding: 802.3ad: fix no transmission of LACPDUs
From: Jay Vosburgh @ 2022-08-16 13:11 UTC (permalink / raw)
  To: Jonathan Toppins
  Cc: netdev, liuhangbin, Veaceslav Falico, Andy Gospodarek,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	linux-kernel
In-Reply-To: <0639f1e3d366c5098d561a947fd416fa5277e7f4.1660572700.git.jtoppins@redhat.com>

Jonathan Toppins <jtoppins@redhat.com> wrote:

>This is caused by the global variable ad_ticks_per_sec being zero as
>demonstrated by the reproducer script discussed below. This causes
>all timer values in __ad_timer_to_ticks to be zero, resulting
>in the periodic timer to never fire.
>
>To reproduce:
>Run the script in
>`tools/testing/selftests/drivers/net/bonding/bond-break-lacpdu-tx.sh` which
>puts bonding into a state where it never transmits LACPDUs.
>
>line 44: ip link add fbond type bond mode 4 miimon 200 \
>            xmit_hash_policy 1 ad_actor_sys_prio 65535 lacp_rate fast
>setting bond param: ad_actor_sys_prio
>given:
>    params.ad_actor_system = 0
>call stack:
>    bond_option_ad_actor_sys_prio()
>    -> bond_3ad_update_ad_actor_settings()
>       -> set ad.system.sys_priority = bond->params.ad_actor_sys_prio
>       -> ad.system.sys_mac_addr = bond->dev->dev_addr; because
>            params.ad_actor_system == 0
>results:
>     ad.system.sys_mac_addr = bond->dev->dev_addr
>
>line 48: ip link set fbond address 52:54:00:3B:7C:A6
>setting bond MAC addr
>call stack:
>    bond->dev->dev_addr = new_mac
>
>line 52: ip link set fbond type bond ad_actor_sys_prio 65535
>setting bond param: ad_actor_sys_prio
>given:
>    params.ad_actor_system = 0
>call stack:
>    bond_option_ad_actor_sys_prio()
>    -> bond_3ad_update_ad_actor_settings()
>       -> set ad.system.sys_priority = bond->params.ad_actor_sys_prio
>       -> ad.system.sys_mac_addr = bond->dev->dev_addr; because
>            params.ad_actor_system == 0
>results:
>     ad.system.sys_mac_addr = bond->dev->dev_addr
>
>line 60: ip link set veth1-bond down master fbond
>given:
>    params.ad_actor_system = 0
>    params.mode = BOND_MODE_8023AD
>    ad.system.sys_mac_addr == bond->dev->dev_addr
>call stack:
>    bond_enslave
>    -> bond_3ad_initialize(); because first slave
>       -> if ad.system.sys_mac_addr != bond->dev->dev_addr
>          return
>results:
>     Nothing is run in bond_3ad_initialize() because dev_add equals
>     sys_mac_addr leaving the global ad_ticks_per_sec zero as it is
>     never initialized anywhere else.
>
>Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
>Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
>---
>
>Notes:
>    v2:
>     * split this fix from the reproducer
>    v3:
>     * rebased to latest net/master
>
> drivers/net/bonding/bond_3ad.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
>index d7fb33c078e8..957d30db6f95 100644
>--- a/drivers/net/bonding/bond_3ad.c
>+++ b/drivers/net/bonding/bond_3ad.c
>@@ -84,7 +84,8 @@ enum ad_link_speed_type {
> static const u8 null_mac_addr[ETH_ALEN + 2] __long_aligned = {
> 	0, 0, 0, 0, 0, 0
> };
>-static u16 ad_ticks_per_sec;
>+
>+static u16 ad_ticks_per_sec = 1000 / AD_TIMER_INTERVAL;
> static const int ad_delta_in_ticks = (AD_TIMER_INTERVAL * HZ) / 1000;

	I still feel like this is kind of a hack, as it's not really
fixing bond_3ad_initialize to actually work (which is the real problem
as I understand it).  If it's ok to skip all that for this case, then
why do we ever need to call bond_3ad_initialize?

	-J

> static const u8 lacpdu_mcast_addr[ETH_ALEN + 2] __long_aligned =
>-- 
>2.31.1
>

---
	-Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply

* Re: [PATCH net] net: phy: Warn about incorrect mdio_bus_phy_resume() state
From: Geert Uytterhoeven @ 2022-08-16 13:20 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Marek Szyprowski, netdev, Steve Glendinning, Doug Berger,
	Andrew Lunn, Heiner Kallweit, Russell King, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, open list,
	Linux-Renesas, Sergey Shtylyov
In-Reply-To: <8c21e530-8e8f-ce2a-239e-9d3a354996cf@gmail.com>

Hi Florian,

On Fri, Aug 12, 2022 at 6:39 PM Florian Fainelli <f.fainelli@gmail.com> wrote:
> On 8/12/22 04:19, Marek Szyprowski wrote:
> > On 02.08.2022 01:34, Florian Fainelli wrote:
> >> Calling mdio_bus_phy_resume() with neither the PHY state machine set to
> >> PHY_HALTED nor phydev->mac_managed_pm set to true is a good indication
> >> that we can produce a race condition looking like this:
> >>
> >> CPU0                                         CPU1
> >> bcmgenet_resume
> >>    -> phy_resume
> >>      -> phy_init_hw
> >>    -> phy_start
> >>      -> phy_resume
> >>                                                   phy_start_aneg()
> >> mdio_bus_phy_resume
> >>    -> phy_resume
> >>       -> phy_write(..., BMCR_RESET)
> >>        -> usleep()                                  -> phy_read()
> >>
> >> with the phy_resume() function triggering a PHY behavior that might have
> >> to be worked around with (see bf8bfc4336f7 ("net: phy: broadcom: Fix
> >> brcm_fet_config_init()") for instance) that ultimately leads to an error
> >> reading from the PHY.
> >>
> >> Fixes: fba863b81604 ("net: phy: make PHY PM ops a no-op if MAC driver manages PHY PM")
> >> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
> >
> > This patch, as probably intended, triggers a warning during system
> > suspend/resume cycle in the SMSC911x driver. I've observed it on ARM
> > Juno R1 board on the kernel compiled from next-202208010:
> >
> >    ------------[ cut here ]------------
> >    WARNING: CPU: 1 PID: 398 at drivers/net/phy/phy_device.c:323
> > mdio_bus_phy_resume+0x34/0xc8

I am seeing the same on the ape6evm and kzm9g development
boards with smsc911x Ethernet, and on various boards with Renesas
Ethernet (sh_eth or ravb) if Wake-on-LAN is disabled.

> Yes this is catching an actual issue in the driver in that the PHY state
> machine is still running while the system is trying to suspend. We could
> go about fixing it in a different number of ways, though I believe this
> one is probably correct enough to work and fix the warning:

> --- a/drivers/net/ethernet/smsc/smsc911x.c
> +++ b/drivers/net/ethernet/smsc/smsc911x.c
> @@ -1037,6 +1037,8 @@ static int smsc911x_mii_probe(struct net_device *dev)
>                  return ret;
>          }
>
> +       /* Indicate that the MAC is responsible for managing PHY PM */
> +       phydev->mac_managed_pm = true;
>          phy_attached_info(phydev);
>
>          phy_set_max_speed(phydev, SPEED_100);
> @@ -2587,6 +2589,8 @@ static int smsc911x_suspend(struct device *dev)
>          if (netif_running(ndev)) {
>                  netif_stop_queue(ndev);
>                  netif_device_detach(ndev);
> +               if (!device_may_wakeup(dev))
> +                       phy_suspend(dev->phydev);
>          }
>
>          /* enable wake on LAN, energy detection and the external PME
> @@ -2628,6 +2632,8 @@ static int smsc911x_resume(struct device *dev)
>          if (netif_running(ndev)) {
>                  netif_device_attach(ndev);
>                  netif_start_queue(ndev);
> +               if (!device_may_wakeup(dev))
> +                       phy_resume(dev->phydev);
>          }
>
>          return 0;

Thanks for your patch, but unfortunately this does not work on ape6evm
and kzm9g, where the smsc911x device is connected to a power-managed
bus.  It looks like the PHY registers are accessed while the device
is already suspended, causing a crash during system suspend:

8<--- cut here ---
Unhandled fault: imprecise external abort (0x1406) at 0x00000000
[00000000] *pgd=00000000
Internal error: : 1406 [#1] SMP ARM
CPU: 2 PID: 75 Comm: kworker/2:2 Not tainted
6.0.0-rc1-ape6evm-00977-gdc70725fbca5-dirty #375
Hardware name: Generic R8A73A4 (Flattened Device Tree)
Workqueue: events_power_efficient phy_state_machine
PC is at smsc911x_reg_read+0x30/0x48
LR is at smsc911x_reg_read+0x30/0x48
pc : [<c03807cc>]    lr : [<c03807cc>]    psr: 20030093
sp : f0891e30  ip : 00000000  fp : eff98605
r10: c092af80  r9 : c202e6e8  r8 : 20030013
r7 : c202e70c  r6 : 000000a4  r5 : 20030093  r4 : c202e6c0
r3 : f0a31000  r2 : 00000002  r1 : f0a310a4  r0 : 00000000
Flags: nzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 10c5387d  Table: 4552006a  DAC: 00000051
Register r0 information: NULL pointer
Register r1 information: 0-page vmalloc region starting at 0xf0a31000
allocated at smsc911x_drv_probe+0x108/0x934
Register r2 information: non-paged memory
Register r3 information: 0-page vmalloc region starting at 0xf0a31000
allocated at smsc911x_drv_probe+0x108/0x934
Register r4 information: slab kmalloc-4k start c202e000 pointer offset
1728 size 4096
Register r5 information: non-paged memory
Register r6 information: non-paged memory
Register r7 information: slab kmalloc-4k start c202e000 pointer offset
1804 size 4096
Register r8 information: non-paged memory
Register r9 information: slab kmalloc-4k start c202e000 pointer offset
1768 size 4096
Register r10 information: non-slab/vmalloc memory
Register r11 information: non-slab/vmalloc memory
Register r12 information: NULL pointer
Process kworker/2:2 (pid: 75, stack limit = 0x5239c21f)
Stack: (0xf0891e30 to 0xf0892000)
1e20:                                     c202e6c0 00000006 c19da000 c202e6c0
1e40: 20030013 c03815bc 00000000 00000001 c19da000 c0381820 00000001 c19da000
1e60: c19da000 00000000 c39eacc0 00000000 c092af80 c037e694 c19da000 00000001
1e80: 00000000 c19da758 c19da000 00000001 00000000 c037e888 c29a0000 c29a03f4
1ea0: 00000078 c29a0448 c39eacc0 c037c878 c29a0000 c037cab4 c29a0000 c29a03f4
1ec0: c29a0000 c037fbc8 c29a0000 c29a03f4 c29a0000 c29a0448 00000004 c0377540
1ee0: c3ac8f00 c29a03f4 c29a0000 c0378588 009a03f4 c07cce88 c3ac8f00 c29a03f4
1f00: eff96780 00000000 eff98600 00000080 c092af80 c00426f0 00000001 00000000
1f20: c00425c4 c07cce88 c07bb574 00000000 c13fa5b8 c0da3f6c 00000000 c071c1da
1f40: 00000000 c07cce88 00000000 c3ac8f00 c3ac8f18 eff96780 c092a665 c07c9d00
1f60: eff967bc c0934e20 00000000 c0042b30 c2b8a500 c2ba65c0 c3ac8880 f0901ebc
1f80: c00428f0 c3ac8f00 00000000 c00494c4 c2ba65c0 c00493f4 00000000 00000000
1fa0: 00000000 00000000 00000000 c0009108 00000000 00000000 00000000 00000000
1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
1fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
 smsc911x_reg_read from smsc911x_mac_read+0x4c/0xa0
 smsc911x_mac_read from smsc911x_mii_read+0x38/0xb4
 smsc911x_mii_read from __mdiobus_read+0x70/0xc4
 __mdiobus_read from mdiobus_read+0x34/0x48
 mdiobus_read from genphy_update_link+0x10/0xc8
 genphy_update_link from genphy_read_status+0x10/0xc4
 genphy_read_status from lan87xx_read_status+0x10/0x11c
 lan87xx_read_status from phy_check_link_status+0x5c/0xbc
 phy_check_link_status from phy_state_machine+0x78/0x218
 phy_state_machine from process_one_work+0x2f0/0x4c4
 process_one_work from worker_thread+0x240/0x2d0
 worker_thread from kthread+0xd0/0xe0
 kthread from ret_from_fork+0x14/0x2c
Exception stack(0xf0891fb0 to 0xf0891ff8)
1fa0:                                     00000000 00000000 00000000 00000000
1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
1fe0: 00000000 00000000 00000000 00000000 00000013 00000000
Code: e5933000 e1a05000 e1a00004 e12fff33 (e1a01005)
---[ end trace 0000000000000000 ]---

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* Re: [PATCH for-rc] RDMA/cxgb4: fix accept failure due to increased cpl_t5_pass_accept_rpl size
From: Leon Romanovsky @ 2022-08-16 13:21 UTC (permalink / raw)
  To: Rahul Lakkireddy
  Cc: linux-rdma, netdev, jgg, davem, kuba, edumazet, pabeni, keescook,
	bharat
In-Reply-To: <20220809184118.2029-1-rahul.lakkireddy@chelsio.com>

On Wed, Aug 10, 2022 at 12:11:18AM +0530, Rahul Lakkireddy wrote:
> From: Potnuri Bharat Teja <bharat@chelsio.com>
> 
> Commit 'c2ed5611afd7' has increased the cpl_t5_pass_accept_rpl{} structure
> size by 8B to avoid roundup. cpl_t5_pass_accept_rpl{} is a HW specific
> structure and increasing its size will lead to unwanted adapter errors.
> Current commit reverts the cpl_t5_pass_accept_rpl{} back to its original
> and allocates zeroed skb buffer there by avoiding the memset for iss field.
> Reorder code to minimize chip type checks.
> 
> Fixes: c2ed5611afd7 ("iw_cxgb4: Use memset_startat() for cpl_t5_pass_accept_rpl")
> Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
> Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
> ---
>  drivers/infiniband/hw/cxgb4/cm.c            | 25 ++++++++-------------
>  drivers/net/ethernet/chelsio/cxgb4/t4_msg.h |  2 +-
>  2 files changed, 10 insertions(+), 17 deletions(-)
> 

Thanks, applied to -rc.

^ permalink raw reply

* Re: [PATCH V2 net-next] net: phy: realtek: add support for RTL8221F(D)(I)-VD-CG
From: Heiner Kallweit @ 2022-08-16 13:22 UTC (permalink / raw)
  To: wei.fang, andrew, linux, davem, edumazet, kuba, pabeni, netdev
  Cc: xiaoning.wang
In-Reply-To: <20220816194859.2369-1-wei.fang@nxp.com>

On 16.08.2022 21:48, wei.fang@nxp.com wrote:
> From: Clark Wang <xiaoning.wang@nxp.com>
> 
> RTL8221F(D)(I)-VD-CG is the pin-to-pin upgrade chip from
> RTL8221F(D)(I)-CG.

Here you talk about RTL8221, in the driver struct definition you say RTL8211.
You changed the naming already for v2. It would be time for you to clarify
which chip you actually mean.
RTL8221 is a 2.5Gbps PHY, however you don't handle this mode in your code.

> 
> Add new PHY ID for this chip.
> It does not support RTL8211F_PHYCR2 anymore, so remove the w/r operation
> of this register.
> 
> Signed-off-by: Clark Wang <xiaoning.wang@nxp.com>
> Signed-off-by: Wei Fang <wei.fang@nxp.com>
> ---
> V2 change:
> 1. Commit message changed, RTL8221 instead of RTL8821.
> 2. Add has_phycr2 to struct rtl821x_priv.
> ---
>  drivers/net/phy/realtek.c | 44 ++++++++++++++++++++++++++++-----------
>  1 file changed, 32 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/net/phy/realtek.c b/drivers/net/phy/realtek.c
> index a5671ab896b3..3d99fd6664d7 100644
> --- a/drivers/net/phy/realtek.c
> +++ b/drivers/net/phy/realtek.c
> @@ -70,6 +70,7 @@
>  #define RTLGEN_SPEED_MASK			0x0630
>  
>  #define RTL_GENERIC_PHYID			0x001cc800
> +#define RTL_8211FVD_PHYID			0x001cc878
>  
>  MODULE_DESCRIPTION("Realtek PHY driver");
>  MODULE_AUTHOR("Johnson Leung");
> @@ -78,6 +79,7 @@ MODULE_LICENSE("GPL");
>  struct rtl821x_priv {
>  	u16 phycr1;
>  	u16 phycr2;
> +	bool has_phycr2;
>  };
>  
>  static int rtl821x_read_page(struct phy_device *phydev)
> @@ -94,6 +96,7 @@ static int rtl821x_probe(struct phy_device *phydev)
>  {
>  	struct device *dev = &phydev->mdio.dev;
>  	struct rtl821x_priv *priv;
> +	u32 phy_id = phydev->drv->phy_id;
>  	int ret;
>  
>  	priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
> @@ -108,13 +111,16 @@ static int rtl821x_probe(struct phy_device *phydev)
>  	if (of_property_read_bool(dev->of_node, "realtek,aldps-enable"))
>  		priv->phycr1 |= RTL8211F_ALDPS_PLL_OFF | RTL8211F_ALDPS_ENABLE | RTL8211F_ALDPS_XTAL_OFF;
>  
> -	ret = phy_read_paged(phydev, 0xa43, RTL8211F_PHYCR2);
> -	if (ret < 0)
> -		return ret;
> +	priv->has_phycr2 = !(phy_id == RTL_8211FVD_PHYID);
> +	if (priv->has_phycr2) {
> +		ret = phy_read_paged(phydev, 0xa43, RTL8211F_PHYCR2);
> +		if (ret < 0)
> +			return ret;
>  
> -	priv->phycr2 = ret & RTL8211F_CLKOUT_EN;
> -	if (of_property_read_bool(dev->of_node, "realtek,clkout-disable"))
> -		priv->phycr2 &= ~RTL8211F_CLKOUT_EN;
> +		priv->phycr2 = ret & RTL8211F_CLKOUT_EN;
> +		if (of_property_read_bool(dev->of_node, "realtek,clkout-disable"))
> +			priv->phycr2 &= ~RTL8211F_CLKOUT_EN;
> +	}
>  
>  	phydev->priv = priv;
>  
> @@ -400,12 +406,14 @@ static int rtl8211f_config_init(struct phy_device *phydev)
>  			val_rxdly ? "enabled" : "disabled");
>  	}
>  
> -	ret = phy_modify_paged(phydev, 0xa43, RTL8211F_PHYCR2,
> -			       RTL8211F_CLKOUT_EN, priv->phycr2);
> -	if (ret < 0) {
> -		dev_err(dev, "clkout configuration failed: %pe\n",
> -			ERR_PTR(ret));
> -		return ret;
> +	if (priv->has_phycr2) {
> +		ret = phy_modify_paged(phydev, 0xa43, RTL8211F_PHYCR2,
> +				       RTL8211F_CLKOUT_EN, priv->phycr2);
> +		if (ret < 0) {
> +			dev_err(dev, "clkout configuration failed: %pe\n",
> +				ERR_PTR(ret));
> +			return ret;
> +		}
>  	}
>  
>  	return genphy_soft_reset(phydev);
> @@ -923,6 +931,18 @@ static struct phy_driver realtek_drvs[] = {
>  		.resume		= rtl821x_resume,
>  		.read_page	= rtl821x_read_page,
>  		.write_page	= rtl821x_write_page,
> +	}, {
> +		PHY_ID_MATCH_EXACT(RTL_8211FVD_PHYID),
> +		.name		= "RTL8211F-VD Gigabit Ethernet",

This conflicts with RTL8221 in the commit message.

> +		.probe		= rtl821x_probe,
> +		.config_init	= &rtl8211f_config_init,
> +		.read_status	= rtlgen_read_status,
> +		.config_intr	= &rtl8211f_config_intr,
> +		.handle_interrupt = rtl8211f_handle_interrupt,
> +		.suspend	= genphy_suspend,
> +		.resume		= rtl821x_resume,
> +		.read_page	= rtl821x_read_page,
> +		.write_page	= rtl821x_write_page,
>  	}, {
>  		.name		= "Generic FE-GE Realtek PHY",
>  		.match_phy_device = rtlgen_match_phy_device,


^ permalink raw reply

* [PATCH net v2] net: Fix suspicious RCU usage in bpf_sk_reuseport_detach()
From: David Howells @ 2022-08-16 13:26 UTC (permalink / raw)
  To: yin31149; +Cc: Jakub Kicinski, netdev, dhowells, linux-kernel

bpf_sk_reuseport_detach() calls __rcu_dereference_sk_user_data_with_flags()
to obtain the value of sk->sk_user_data, but that function is only usable
if the RCU read lock is held, and neither that function nor any of its
callers hold it.

Fix this by adding a new helper,
__rcu_dereference_sk_user_data_with_flags_check() that checks to see if
sk->sk_callback_lock() is held and use that here instead.

__rcu_dereference_sk_user_data_with_flags() then calls that, supplying false
as condition indicating only the RCU read lock should be checked.

Without this, the following warning can be occasionally observed:

=============================
WARNING: suspicious RCU usage
6.0.0-rc1-build2+ #563 Not tainted
-----------------------------
include/net/sock.h:592 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
5 locks held by locktest/29873:
 #0: ffff88812734b550 (&sb->s_type->i_mutex_key#9){+.+.}-{3:3}, at: __sock_release+0x77/0x121
 #1: ffff88812f5621b0 (sk_lock-AF_INET){+.+.}-{0:0}, at: tcp_close+0x1c/0x70
 #2: ffff88810312f5c8 (&h->lhash2[i].lock){+.+.}-{2:2}, at: inet_unhash+0x76/0x1c0
 #3: ffffffff83768bb8 (reuseport_lock){+...}-{2:2}, at: reuseport_detach_sock+0x18/0xdd
 #4: ffff88812f562438 (clock-AF_INET){++..}-{2:2}, at: bpf_sk_reuseport_detach+0x24/0xa4

stack backtrace:
CPU: 1 PID: 29873 Comm: locktest Not tainted 6.0.0-rc1-build2+ #563
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x4c/0x5f
 bpf_sk_reuseport_detach+0x6d/0xa4
 reuseport_detach_sock+0x75/0xdd
 inet_unhash+0xa5/0x1c0
 tcp_set_state+0x169/0x20f
 ? lockdep_sock_is_held+0x3a/0x3a
 ? __lock_release.isra.0+0x13e/0x220
 ? reacquire_held_locks+0x1bb/0x1bb
 ? hlock_class+0x31/0x96
 ? mark_lock+0x9e/0x1af
 __tcp_close+0x50/0x4b6
 tcp_close+0x28/0x70
 inet_release+0x8e/0xa7
 __sock_release+0x95/0x121
 sock_close+0x14/0x17
 __fput+0x20f/0x36a
 task_work_run+0xa3/0xcc
 exit_to_user_mode_prepare+0x9c/0x14d
 syscall_exit_to_user_mode+0x18/0x44
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Changes
=======
ver #2)
 - Changed to suggestion from Hawkins Jiawei to have a ..._check() function
   and make the original a special case of that.

Fixes: cf8c1e967224 ("net: refactor bpf_sk_reuseport_detach()")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Hawkins Jiawei <yin31149@gmail.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: netdev@vger.kernel.org
Link: https://lore.kernel.org/r/166064248071.3502205.10036394558814861778.stgit@warthog.procyon.org.uk # v1
---

 include/net/sock.h           |   18 ++++++++++++++----
 kernel/bpf/reuseport_array.c |    3 ++-
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 05a1bbdf5805..6464da28e842 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -578,18 +578,22 @@ static inline bool sk_user_data_is_nocopy(const struct sock *sk)
 #define __sk_user_data(sk) ((*((void __rcu **)&(sk)->sk_user_data)))
 
 /**
- * __rcu_dereference_sk_user_data_with_flags - return the pointer
+ * __rcu_dereference_sk_user_data_with_flags_check - return the pointer
  * only if argument flags all has been set in sk_user_data. Otherwise
  * return NULL
  *
  * @sk: socket
  * @flags: flag bits
+ * @condition: Condition under which non-RCU access may take place
+ *
+ * The caller must be holding the RCU read lock 
  */
 static inline void *
-__rcu_dereference_sk_user_data_with_flags(const struct sock *sk,
-					  uintptr_t flags)
+__rcu_dereference_sk_user_data_with_flags_check(const struct sock *sk,
+						uintptr_t flags, bool condition)
 {
-	uintptr_t sk_user_data = (uintptr_t)rcu_dereference(__sk_user_data(sk));
+	uintptr_t sk_user_data =
+		(uintptr_t)rcu_dereference_check(__sk_user_data(sk), condition);
 
 	WARN_ON_ONCE(flags & SK_USER_DATA_PTRMASK);
 
@@ -598,6 +602,12 @@ __rcu_dereference_sk_user_data_with_flags(const struct sock *sk,
 	return NULL;
 }
 
+static inline void *
+__rcu_dereference_sk_user_data_with_flags(const struct sock *sk, uintptr_t flags)
+{
+	return __rcu_dereference_sk_user_data_with_flags_check(sk, flags, false);
+}
+
 #define rcu_dereference_sk_user_data(sk)				\
 	__rcu_dereference_sk_user_data_with_flags(sk, 0)
 #define __rcu_assign_sk_user_data_with_flags(sk, ptr, flags)		\
diff --git a/kernel/bpf/reuseport_array.c b/kernel/bpf/reuseport_array.c
index 85fa9dbfa8bf..856c360a591d 100644
--- a/kernel/bpf/reuseport_array.c
+++ b/kernel/bpf/reuseport_array.c
@@ -24,7 +24,8 @@ void bpf_sk_reuseport_detach(struct sock *sk)
 	struct sock __rcu **socks;
 
 	write_lock_bh(&sk->sk_callback_lock);
-	socks = __rcu_dereference_sk_user_data_with_flags(sk, SK_USER_DATA_BPF);
+	socks = __rcu_dereference_sk_user_data_with_flags_check(
+		sk, SK_USER_DATA_BPF, lockdep_is_held(&sk->sk_callback_lock));
 	if (socks) {
 		WRITE_ONCE(sk->sk_user_data, NULL);
 		/*



^ permalink raw reply related

* Re: [PATCH v1 0/3] Bring back driver_deferred_probe_check_state() for now
From: Luca Weiss @ 2022-08-16 13:30 UTC (permalink / raw)
  To: Saravana Kannan
  Cc: Tony Lindgren, Greg Kroah-Hartman, Rafael J. Wysocki,
	Kevin Hilman, Ulf Hansson, Pavel Machek, Len Brown, Andrew Lunn,
	Heiner Kallweit, Russell King, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, naresh.kamboju, kernel-team,
	linux-kernel, linux-pm, netdev
In-Reply-To: <CAGETcx_6oh=GVLP7-1gN_4DW7UHJ1MZQ6T1U2hupc_ZYDnXcNA@mail.gmail.com>

Hi Saravana,

On Tue Aug 16, 2022 at 1:36 AM CEST, Saravana Kannan wrote:
> On Mon, Aug 15, 2022 at 9:57 AM Luca Weiss <luca.weiss@fairphone.com> wrote:
> >
> > On Mon Aug 15, 2022 at 1:01 PM CEST, Tony Lindgren wrote:
> > > * Saravana Kannan <saravanak@google.com> [700101 02:00]:
> > > > More fixes/changes are needed before driver_deferred_probe_check_state()
> > > > can be deleted. So, bring it back for now.
> > > >
> > > > Greg,
> > > >
> > > > Can we get this into 5.19? If not, it might not be worth picking up this
> > > > series. I could just do the other/more fixes in time for 5.20.
> > >
> > > Yes please pick this as fixes for v6.0-rc series, it fixes booting for
> > > me. I've replied with fixes tags for the two patches that were causing
> > > regressions for me.
> > >
> >
> > Hi,
> >
> > for me Patch 1+3 fix display probe on Qualcomm SM6350 (although display
> > for this SoC isn't upstream yet, there are lots of other SoCs with very
> > similar setup).
> >
> > Probe for DPU silently fails, with CONFIG_DEBUG_DRIVER=y we get this:
> >
> > msm-mdss ae00000.mdss: __genpd_dev_pm_attach() failed to find PM domain: -2
> >
> > While I'm not familiar with the specifics of fw_devlink, the dtsi has
> > power-domains = <&dispcc MDSS_GDSC> for this node but it doesn't pick
> > that up for some reason.
> >
> > We can also see that a bit later dispcc finally probes.
>
> Luca,
>
> Can you test with this series instead and see if it fixes this issue?
> https://lore.kernel.org/lkml/20220810060040.321697-1-saravanak@google.com/
>

Unfortunately it doesn't seem to work with the 9 patches, and the
attached diff also doesn't seem to make a difference. I do see this in
dmesg which I haven't seen in the past:

[    0.056554] platform 1d87000.phy: Fixed dependency cycle(s) with /soc@0/ufs@1d84000
[    0.060070] platform ae00000.mdss: Fixed dependency cycle(s) with /soc@0/clock-controller@af00000
[    0.060150] platform ae00000.mdss: Failed to create device link with ae00000.mdss
[    0.060188] platform ae00000.mdss: Failed to create device link with ae00000.mdss
[    0.061135] platform c440000.spmi: Failed to create device link with c440000.spmi
[    0.061157] platform c440000.spmi: Failed to create device link with c440000.spmi
[    0.061180] platform c440000.spmi: Failed to create device link with c440000.spmi
[    0.061198] platform c440000.spmi: Failed to create device link with c440000.spmi
[    0.061215] platform c440000.spmi: Failed to create device link with c440000.spmi
[    0.061231] platform c440000.spmi: Failed to create device link with c440000.spmi
[    0.061252] platform c440000.spmi: Failed to create device link with c440000.spmi

Also I'm going to be on holiday from today for about 2 weeks so I won't
be able to test anything in that time.

And in case it's interesting, here's the full dmesg to initramfs:
https://pastebin.com/raw/Fc8W4MVi

Regards
Luca

> You might also need to add this delta on top of the series if the
> series itself isn't sufficient.
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index 2f012e826986..866755d8ad95 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -2068,7 +2068,11 @@ static int fw_devlink_create_devlink(struct device *con,
>                 device_links_write_unlock();
>         }
>
> -       sup_dev = get_dev_from_fwnode(sup_handle);
> +       if (sup_handle->flags & FWNODE_FLAG_NOT_DEVICE)
> +               sup_dev = fwnode_get_next_parent_dev(sup_handle);
> +       else
> +               sup_dev = get_dev_from_fwnode(sup_handle);
> +
>         if (sup_dev) {
>                 /*
>                  * If it's one of those drivers that don't actually bind to
>
> -Saravana


^ permalink raw reply

* Re: [PATCH net-next] tcp: Make SYN ACK RTO tunable by BPF programs with TFO
From: Neal Cardwell @ 2022-08-16 13:36 UTC (permalink / raw)
  To: Jie Meng; +Cc: netdev, kafai, kuba, edumazet, bpf, Yuchung Cheng, Wei Wang
In-Reply-To: <20220815202900.3961097-1-jmeng@fb.com>

On Mon, Aug 15, 2022 at 8:30 PM Jie Meng <jmeng@fb.com> wrote:
>
> Instead of the hardcoded TCP_TIMEOUT_INIT, this diff calls tcp_timeout_init
> to initiate req->timeout like the non TFO SYN ACK case.
>
> Tested using the following packetdrill script, on a host with a BPF
> program that sets the initial connect timeout to 10ms.
>
> `../../common/defaults.sh`
>
> // Initialize connection
>     0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
>    +0 setsockopt(3, SOL_TCP, TCP_FASTOPEN, [1], 4) = 0
>    +0 bind(3, ..., ...) = 0
>    +0 listen(3, 1) = 0
>
>    +0 < S 0:0(0) win 32792 <mss 1000,sackOK,FO TFO_COOKIE>
>    +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
>    +.01 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
>    +.02 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
>    +.04 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
>    +.01 < . 1:1(0) ack 1 win 32792
>
>    +0 accept(3, ..., ...) = 4
>
> Signed-off-by: Jie Meng <jmeng@fb.com>
> ---
>  net/ipv4/tcp_fastopen.c | 3 ++-
>  net/ipv4/tcp_timer.c    | 2 +-
>  2 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
> index 825b216d11f5..45cc7f1ca296 100644
> --- a/net/ipv4/tcp_fastopen.c
> +++ b/net/ipv4/tcp_fastopen.c
> @@ -272,8 +272,9 @@ static struct sock *tcp_fastopen_create_child(struct sock *sk,
>          * The request socket is not added to the ehash
>          * because it's been added to the accept queue directly.
>          */
> +       req->timeout = tcp_timeout_init(child);
>         inet_csk_reset_xmit_timer(child, ICSK_TIME_RETRANS,
> -                                 TCP_TIMEOUT_INIT, TCP_RTO_MAX);
> +                                 req->timeout, TCP_RTO_MAX);
>
>         refcount_set(&req->rsk_refcnt, 2);
>
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index b4dfb82d6ecb..cb79127f45c3 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -428,7 +428,7 @@ static void tcp_fastopen_synack_timer(struct sock *sk, struct request_sock *req)
>         if (!tp->retrans_stamp)
>                 tp->retrans_stamp = tcp_time_stamp(tp);
>         inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
> -                         TCP_TIMEOUT_INIT << req->num_timeout, TCP_RTO_MAX);
> +                         req->timeout << req->num_timeout, TCP_RTO_MAX);
>  }
>
>
> --

Looks good to me. Thanks for the feature!

Acked-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply

* Re: [PATCH net v3 2/2] bonding: 802.3ad: fix no transmission of LACPDUs
From: Jonathan Toppins @ 2022-08-16 13:41 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: netdev, liuhangbin, Veaceslav Falico, Andy Gospodarek,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	linux-kernel
In-Reply-To: <17000.1660655501@famine>

On 8/16/22 09:11, Jay Vosburgh wrote:
> Jonathan Toppins <jtoppins@redhat.com> wrote:
> 
>> This is caused by the global variable ad_ticks_per_sec being zero as
>> demonstrated by the reproducer script discussed below. This causes
>> all timer values in __ad_timer_to_ticks to be zero, resulting
>> in the periodic timer to never fire.
>>
>> To reproduce:
>> Run the script in
>> `tools/testing/selftests/drivers/net/bonding/bond-break-lacpdu-tx.sh` which
>> puts bonding into a state where it never transmits LACPDUs.
>>
>> line 44: ip link add fbond type bond mode 4 miimon 200 \
>>             xmit_hash_policy 1 ad_actor_sys_prio 65535 lacp_rate fast
>> setting bond param: ad_actor_sys_prio
>> given:
>>     params.ad_actor_system = 0
>> call stack:
>>     bond_option_ad_actor_sys_prio()
>>     -> bond_3ad_update_ad_actor_settings()
>>        -> set ad.system.sys_priority = bond->params.ad_actor_sys_prio
>>        -> ad.system.sys_mac_addr = bond->dev->dev_addr; because
>>             params.ad_actor_system == 0
>> results:
>>      ad.system.sys_mac_addr = bond->dev->dev_addr
>>
>> line 48: ip link set fbond address 52:54:00:3B:7C:A6
>> setting bond MAC addr
>> call stack:
>>     bond->dev->dev_addr = new_mac
>>
>> line 52: ip link set fbond type bond ad_actor_sys_prio 65535
>> setting bond param: ad_actor_sys_prio
>> given:
>>     params.ad_actor_system = 0
>> call stack:
>>     bond_option_ad_actor_sys_prio()
>>     -> bond_3ad_update_ad_actor_settings()
>>        -> set ad.system.sys_priority = bond->params.ad_actor_sys_prio
>>        -> ad.system.sys_mac_addr = bond->dev->dev_addr; because
>>             params.ad_actor_system == 0
>> results:
>>      ad.system.sys_mac_addr = bond->dev->dev_addr
>>
>> line 60: ip link set veth1-bond down master fbond
>> given:
>>     params.ad_actor_system = 0
>>     params.mode = BOND_MODE_8023AD
>>     ad.system.sys_mac_addr == bond->dev->dev_addr
>> call stack:
>>     bond_enslave
>>     -> bond_3ad_initialize(); because first slave
>>        -> if ad.system.sys_mac_addr != bond->dev->dev_addr
>>           return
>> results:
>>      Nothing is run in bond_3ad_initialize() because dev_add equals
>>      sys_mac_addr leaving the global ad_ticks_per_sec zero as it is
>>      never initialized anywhere else.
>>
>> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
>> Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
>> ---
>>
>> Notes:
>>     v2:
>>      * split this fix from the reproducer
>>     v3:
>>      * rebased to latest net/master
>>
>> drivers/net/bonding/bond_3ad.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
>> index d7fb33c078e8..957d30db6f95 100644
>> --- a/drivers/net/bonding/bond_3ad.c
>> +++ b/drivers/net/bonding/bond_3ad.c
>> @@ -84,7 +84,8 @@ enum ad_link_speed_type {
>> static const u8 null_mac_addr[ETH_ALEN + 2] __long_aligned = {
>> 	0, 0, 0, 0, 0, 0
>> };
>> -static u16 ad_ticks_per_sec;
>> +
>> +static u16 ad_ticks_per_sec = 1000 / AD_TIMER_INTERVAL;
>> static const int ad_delta_in_ticks = (AD_TIMER_INTERVAL * HZ) / 1000;
> 
> 	I still feel like this is kind of a hack, as it's not really
> fixing bond_3ad_initialize to actually work (which is the real problem
> as I understand it).  If it's ok to skip all that for this case, then
> why do we ever need to call bond_3ad_initialize?
> 

The way it is currently written you still need to call 
bond_3ad_initialize() just not for setting the tick resolution. The 
issue here is ad_ticks_per_sec is used in several places to calculate 
timer periods, __ad_timer_to_ticks(), for various timers in the 802.3ad 
protocol. And if this variable, ad_ticks_per_sec, is left uninitialized 
all of these timer periods go to zero. Since the value passed in 
bond_3ad_initialize() is an immediate value I simply moved it off of the 
call stack and set the static global variable instead.

To fix bond_3ad_initialize(), probably something like the below is 
needed, but I do not understand why the guard if check was placed in 
bond_3ad_initialize().

diff --git i/drivers/net/bonding/bond_3ad.c w/drivers/net/bonding/bond_3ad.c
index d7fb33c078e8..5b5146f5c4ea 100644
--- i/drivers/net/bonding/bond_3ad.c
+++ w/drivers/net/bonding/bond_3ad.c
@@ -2005,32 +2005,21 @@ void bond_3ad_initiate_agg_selection(struct 
bonding *bond, int timeout)
   *
   * Can be called only after the mac address of the bond is set.
   */
-void bond_3ad_initialize(struct bonding *bond, u16 tick_resolution)
+void bond_3ad_initialize(struct bonding *bond)
  {
-	/* check that the bond is not initialized yet */
-	if (!MAC_ADDRESS_EQUAL(&(BOND_AD_INFO(bond).system.sys_mac_addr),
-				bond->dev->dev_addr)) {
-
-		BOND_AD_INFO(bond).aggregator_identifier = 0;
-
-		BOND_AD_INFO(bond).system.sys_priority =
-			bond->params.ad_actor_sys_prio;
-		if (is_zero_ether_addr(bond->params.ad_actor_system))
-			BOND_AD_INFO(bond).system.sys_mac_addr =
-			    *((struct mac_addr *)bond->dev->dev_addr);
-		else
-			BOND_AD_INFO(bond).system.sys_mac_addr =
-			    *((struct mac_addr *)bond->params.ad_actor_system);
-
-		/* initialize how many times this module is called in one
-		 * second (should be about every 100ms)
-		 */
-		ad_ticks_per_sec = tick_resolution;
+	BOND_AD_INFO(bond).aggregator_identifier = 0;
+	BOND_AD_INFO(bond).system.sys_priority =
+		bond->params.ad_actor_sys_prio;
+	if (is_zero_ether_addr(bond->params.ad_actor_system))
+		BOND_AD_INFO(bond).system.sys_mac_addr =
+		    *((struct mac_addr *)bond->dev->dev_addr);
+	else
+		BOND_AD_INFO(bond).system.sys_mac_addr =
+		    *((struct mac_addr *)bond->params.ad_actor_system);

-		bond_3ad_initiate_agg_selection(bond,
-						AD_AGGREGATOR_SELECTION_TIMER *
-						ad_ticks_per_sec);
-	}
+	bond_3ad_initiate_agg_selection(bond,
+					AD_AGGREGATOR_SELECTION_TIMER *
+					ad_ticks_per_sec);
  }

  /**
diff --git i/drivers/net/bonding/bond_main.c 
w/drivers/net/bonding/bond_main.c
index 50e60843020c..5f56af9dc3ba 100644
--- i/drivers/net/bonding/bond_main.c
+++ w/drivers/net/bonding/bond_main.c
@@ -2078,10 +2078,10 @@ int bond_enslave(struct net_device *bond_dev, 
struct net_device *slave_dev,
  		/* if this is the first slave */
  		if (!prev_slave) {
  			SLAVE_AD_INFO(new_slave)->id = 1;
-			/* Initialize AD with the number of times that the AD timer is 
called in 1 second
-			 * can be called only after the mac address of the bond is set
+			/* can be called only after the mac address of the
+			 * bond is set
  			 */
-			bond_3ad_initialize(bond, 1000/AD_TIMER_INTERVAL);
+			bond_3ad_initialize(bond);
  		} else {
  			SLAVE_AD_INFO(new_slave)->id =
  				SLAVE_AD_INFO(prev_slave)->id + 1;
diff --git i/include/net/bond_3ad.h w/include/net/bond_3ad.h
index 184105d68294..be2992e6de5d 100644
--- i/include/net/bond_3ad.h
+++ w/include/net/bond_3ad.h
@@ -290,7 +290,7 @@ static inline const char 
*bond_3ad_churn_desc(churn_state_t state)
  }

  /* ========== AD Exported functions to the main bonding code ========== */
-void bond_3ad_initialize(struct bonding *bond, u16 tick_resolution);
+void bond_3ad_initialize(struct bonding *bond);
  void bond_3ad_bind_slave(struct slave *slave);
  void bond_3ad_unbind_slave(struct slave *slave);
  void bond_3ad_state_machine_handler(struct work_struct *);

-Jon


^ permalink raw reply related

* Re: [ RFC net-next 2/3] net: flow_offload: add action stats api
From: Edward Cree @ 2022-08-16 13:42 UTC (permalink / raw)
  To: Oz Shlomo, netdev
  Cc: Jiri Pirko, Jamal Hadi Salim, Simon Horman, Baowen Zheng,
	Vlad Buslov, Ido Schimmel, Roi Dayan
In-Reply-To: <20220816092338.12613-3-ozsh@nvidia.com>

On 16/08/2022 10:23, Oz Shlomo wrote:
> The current offload api provides visibility to flow hw stats.
> This works as long as the flow stats values apply to all the flow's
> actions. However, this assumption breaks when an action, such as police,
> decides to drop or jump over other actions.
> 
> Extend the flow_offload api to return stat record per action instance.
> Use the per action stats value, if available, when updating the action
> instance counters.
> 
> Signed-off-by: Oz Shlomo <ozsh@nvidia.com>

When I worked on this before I tried with a similar "array of action
 stats" API [1], but after some discussion it seemed cleaner to have
 a "get stats for one single action" callback [2] which then could
 be called in a loop for filter dumps but also called singly for
 action dumps (RTM_GETACTION).  I recommend this approach to your
 consideration.

[1]: https://lore.kernel.org/all/9804a392-c9fd-8d03-7900-e01848044fea@solarflare.com/
[2]: https://lore.kernel.org/all/a3f0a79a-7e2c-4cdc-8c97-dfebe959ab1f@solarflare.com/

> diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
> index 7da3337c4356..7dc8a62796b5 100644
> --- a/net/sched/cls_flower.c
> +++ b/net/sched/cls_flower.c
> @@ -499,7 +499,9 @@ static void fl_hw_update_stats(struct tcf_proto *tp, struct cls_fl_filter *f,
>  	tc_setup_cb_call(block, TC_SETUP_CLSFLOWER, &cls_flower, false,
>  			 rtnl_held);
>  
> -	tcf_exts_hw_stats_update(&f->exts, &cls_flower.stats);
> +	tcf_exts_hw_stats_update(&f->exts, &cls_flower.stats, cls_flower.act_stats);
> +
> +	kfree(cls_flower.act_stats);
>  }

Perhaps I'm being dumb, but I don't see this being allocated
 anywhere.  Is the driver supposed to be responsible for doing so?
 That seems inelegant.

-ed

^ permalink raw reply

* Re: [PATCH iproute2]rdma: modify the command output message of rdma statistic help based on man manual
From: Leon Romanovsky @ 2022-08-16 13:49 UTC (permalink / raw)
  To: jiangheng; +Cc: netdev
In-Reply-To: <25ce8a4f.6db0.182910b625a.Coremail.15720603159@163.com>

On Fri, Aug 12, 2022 at 03:53:38PM +0800, jiangheng wrote:
> 
> From 4ce8d60f34c5cdcff8b25e3c3891fc053736225b Mon Sep 17 00:00:00 2001
> From: jinag <jinag12138@gmail.com>
> Date: Fri, 12 Aug 2022 15:48:07 +0800
> Subject: [PATCH] rdma: modify the command output message of rdma statistic help
>  based on man manual
> 
> 
> ---
>  rdma/stat.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)

iproute2 follows kernel coding style, both in the code
and in the patches. Please follow it.

Thanks

> 
> 
> diff --git a/rdma/stat.c b/rdma/stat.c
> index aad8815c..1e869c25 100644
> --- a/rdma/stat.c
> +++ b/rdma/stat.c
> @@ -24,10 +24,10 @@ static int stat_help(struct rd *rd)
>     pr_out("       %s statistic mode [ supported ] link [ DEV/PORT_INDEX ]\n", rd->filename);
>     pr_out("       %s statistic set link [ DEV/PORT_INDEX ] optional-counters [ OPTIONAL-COUNTERS ]\n", rd->filename);
>     pr_out("       %s statistic unset link [ DEV/PORT_INDEX ] optional-counters\n", rd->filename);
> -   pr_out("where  OBJECT: = { qp }\n");
> -   pr_out("       CRITERIA : = { type }\n");
> +   pr_out("where  OBJECT: = { qp | mr }\n");
> +   pr_out("       CRITERIA : = { type | pid }\n");
>     pr_out("       COUNTER_SCOPE: = { link | dev }\n");
> -   pr_out("       FILTER_NAME: = { cntn | lqpn | pid }\n");
> +   pr_out("       FILTER_NAME: = { cntn | lqpn | pid | qp_type }\n");
>     pr_out("Examples:\n");
>     pr_out("       %s statistic qp show\n", rd->filename);
>     pr_out("       %s statistic qp show link mlx5_2/1\n", rd->filename);
> --
> 2.23.0

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox