Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net-next v6 0/5] Coalesce mac ocp write/modify calls to reduce spinlock contention
From: Mirsad Goran Todorovac @ 2023-11-04 22:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: David S. Miller, Eric Dumazet, Heiner Kallweit, Jakub Kicinski,
	Paolo Abeni, netdev, nic_swsd, Mirsad Goran Todorovac

The motivation for these helpers was the locking overhead of 130 consecutive
r8168_mac_ocp_write() calls in the RTL8411b reset after the NIC gets confused
if the PHY is powered-down.

To quote Heiner:

    On RTL8411b the RX unit gets confused if the PHY is powered-down.
    This was reported in [0] and confirmed by Realtek. Realtek provided
    a sequence to fix the RX unit after PHY wakeup.

A series of about 130 r8168_mac_ocp_write() calls is performed to program the
RTL registers for recovery, each doing an expensive spin_lock_irqsave() and
spin_unlock_irqrestore().

Each mac ocp write is made of:

    static void __r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg,
                      u32 data)
    {
        if (rtl_ocp_reg_failure(reg))
            return;

        RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
    }

    static void r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg,
                    u32 data)
    {
        unsigned long flags;

        raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
        __r8168_mac_ocp_write(tp, reg, data);
        raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
    }

Register programming is done through RTL_W32() macro which expands into

    #define RTL_W32(tp, reg, val32) writel((val32), tp->mmio_addr + (reg))

which is further (on Alpha):

    extern inline void writel(u32 b, volatile void __iomem *addr)
    {
        mb();
        __raw_writel(b, addr);
    }

or on i386/x86_64:

    #define build_mmio_write(name, size, type, reg, barrier) \
    static inline void name(type val, volatile void __iomem *addr) \
    { asm volatile("mov" size " %0,%1": :reg (val), \
    "m" (*(volatile type __force *)addr) barrier); }

    build_mmio_write(writel, "l", unsigned int, "r", :"memory")

This obviously involves iat least a compiler barrier.

mb() expands into something like this i.e. on x86_64:

    #define mb()    asm volatile("lock; addl $0,0(%%esp)" ::: "memory")

This means a whole lot of memory bus stalls: for spin_lock_irqsave(),
memory barrier, writel(), and spin_unlock_irqrestore().

With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like
a lock storm that will stall all of the cores and CPUs on the same memory controller
for certain time I/O takes to finish.

In a sequential case of RTL register programming, the writes to RTL registers
can be coalesced under a same raw spinlock. This can dramatically decrease the
number of bus stalls in a multicore or multi-CPU system.

Macro helpers r8168_mac_ocp_write_seq() and r8168_mac_ocp_modify_seq() are
provided to reduce lock contention:

    static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
    {

        ...

        /* The following Realtek-provided magic fixes an issue with the RX unit
         * getting confused after the PHY having been powered-down.
         */

        static const struct recover_8411b_info init_zero_seq[] = {
            { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 },
            ...
        };

        ...

        r8168_mac_ocp_write_seq(tp, init_zero_seq);

        ...

    }

The hex data is preserved intact through s/r8168_mac_ocp_write[(]tp,/{ / and s/[)];/ },/
functions that only changed the function names and the ending of the line, so the actual
hex data is unchanged.

To repeat, the reason for the introduction of the original commit
was to enable recovery of the RX unit on the RTL8411b which was confused by the
powered-down PHY. This sequence of r8168_mac_ocp_write() calls amplifies the problem
into a series of about 500+ memory bus locks, most waiting for the main memory read,
modify and write under a LOCK. The memory barrier in RTL_W32 should suffice for
the programming sequence to reach RTL NIC registers.

[0] https://bugzilla.redhat.com/show_bug.cgi?id=1692075

v6:
 proceeded according to Jacob Keller's suggestions by creating a cover page and reducing
 the text within the commits. Applying to the net-next tree as Heiner Kallweit requested.

v5:
 attempted some new optimisations, which were rejected, but not all and not completely.

v4:
 fixed complaints as advised by Heiner and checkpatch.pl.
 split the patch into five sections to be more easily manipulated and reviewed
 introduced r8168_mac_ocp_write_seq()
 applied coalescing of mac ocp writes/modifies for 8168H, 8125 and 8125B

v3:
 removed register/mask pair array sentinels, so using ARRAY_SIZE().
 avoided duplication of RTL_W32() call code as advised by Heiner.

Mirsad Goran Todorovac (5):
  r8169: Coalesce r8169_mac_ocp_write/modify calls to reduce spinlock
    stalls
  r8169: Coalesce RTL8411b PHY power-down recovery calls to reduce
    spinlock stalls
  r8169: Coalesce mac ocp write and modify for 8168H start to reduce
    spinlocks
  r8169: Coalesce mac ocp commands for 8125 and 8125B start to reduce
    spinlock contention
  r8169: Coalesce mac ocp commands for rtl_hw_init_8125 to reduce
    spinlocks

 drivers/net/ethernet/realtek/r8169_main.c | 304 +++++++++++-----------
 1 file changed, 150 insertions(+), 154 deletions(-)

-- 
2.34.1

^ permalink raw reply

* [PATCH net-next v6 1/5] r8169: Coalesce r8169_mac_ocp_write/modify calls to reduce spinlock stalls
From: Mirsad Goran Todorovac @ 2023-11-04 22:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: David S. Miller, Eric Dumazet, Heiner Kallweit, Jakub Kicinski,
	Paolo Abeni, netdev, nic_swsd, Mirsad Goran Todorovac,
	Marco Elver, Jacob Keller
In-Reply-To: <20231104221514.45821-1-mirsad.todorovac@alu.unizg.hr>

A pair of new helpers r8168_mac_ocp_write_seq() and r8168_mac_ocp_modify_seq()
are introduced.

They are meant to minimise the locking and unlocking overhead when just assuring
the sequential mac ocp register programming according to the Realtek specs would do.
The latter is assured by the compiler optimisation "barrier" in the writev() call
called by the low-level RTL_W32() primitive.

Cc: Heiner Kallweit <hkallweit1@gmail.com>
Cc: Marco Elver <elver@google.com>
Cc: nic_swsd@realtek.com
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Link: https://lore.kernel.org/lkml/20231028005153.2180411-1-mirsad.todorovac@alu.unizg.hr/
Link: https://lore.kernel.org/lkml/20231028110459.2644926-1-mirsad.todorovac@alu.unizg.hr/
Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
---
v6:
 proceeded according to Jacob Keller's suggestions by creating a cover page and reducing
 the text within the commits. Applying to the net-next tree as Heiner Kallweit requested.

v5:
 attempted some new optimisations, which were rejected, but not all and not completely.

v4:
 fixed complaints as advised by Heiner and checkpatch.pl.
 split the patch into five sections to be more easily manipulated and reviewed
 introduced r8168_mac_ocp_write_seq()
 applied coalescing of mac ocp writes/modifies for 8168H, 8125 and 8125B

v3:
 removed register/mask pair array sentinels, so using ARRAY_SIZE().
 avoided duplication of RTL_W32() call code as advised by Heiner.

 drivers/net/ethernet/realtek/r8169_main.c | 57 +++++++++++++++++++++++
 1 file changed, 57 insertions(+)

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index a987defb575c..e39b5777d67b 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -939,6 +939,63 @@ static void r8168_mac_ocp_modify(struct rtl8169_private *tp, u32 reg, u16 mask,
 	raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
 }
 
+struct e_info_regdata {
+	u32	reg;
+	u32	data;
+};
+
+struct e_info_regmaskset {
+	u32	reg;
+	u16	mask;
+	u16	set;
+};
+
+static void __r8168_mac_ocp_write_seqlen(struct rtl8169_private *tp,
+					 const struct e_info_regdata *array, int len)
+{
+	struct e_info_regdata const *p;
+
+	for (p = array; len--; p++)
+		__r8168_mac_ocp_write(tp, p->reg, p->data);
+}
+
+static void r8168_mac_ocp_write_seqlen(struct rtl8169_private *tp,
+				       const struct e_info_regdata *array, int len)
+{
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
+	__r8168_mac_ocp_write_seqlen(tp, array, len);
+	raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
+}
+
+static void __r8168_mac_ocp_modify_seqlen(struct rtl8169_private *tp,
+					  const struct e_info_regmaskset *array, int len)
+{
+	struct e_info_regmaskset const *p;
+	u16 data;
+
+	for (p = array; len--; p++) {
+		data = __r8168_mac_ocp_read(tp, p->reg);
+		__r8168_mac_ocp_write(tp, p->reg, (data & ~p->mask) | p->set);
+	}
+}
+
+static void r8168_mac_ocp_modify_seqlen(struct rtl8169_private *tp,
+					const struct e_info_regmaskset *array, int len)
+{
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
+	__r8168_mac_ocp_modify_seqlen(tp, array, len);
+	raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
+}
+
+#define r8168_mac_ocp_write_seq(tp, a) r8168_mac_ocp_write_seqlen(tp, a, ARRAY_SIZE(a))
+#define r8168_mac_ocp_modify_seq(tp, a) r8168_mac_ocp_modify_seqlen(tp, a, ARRAY_SIZE(a))
+#define __r8168_mac_ocp_write_seq(tp, a) __r8168_mac_ocp_write_seqlen(tp, a, ARRAY_SIZE(a))
+#define __r8168_mac_ocp_modify_seq(tp, a) __r8168_mac_ocp_modify_seqlen(tp, a, ARRAY_SIZE(a))
+
 /* Work around a hw issue with RTL8168g PHY, the quirk disables
  * PHY MCU interrupts before PHY power-down.
  */
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH net v3] net: dsa: tag_rtl4_a: Bump min packet size
From: Linus Walleij @ 2023-11-04 22:17 UTC (permalink / raw)
  To: Simon Horman
  Cc: Andrew Lunn, Florian Fainelli, Vladimir Oltean, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Luiz Angelo Daros de Luca, netdev, linux-kernel
In-Reply-To: <20231104141031.GF891380@kernel.org>

On Sat, Nov 4, 2023 at 3:10 PM Simon Horman <horms@kernel.org> wrote:

> In this case it may not have activated the automation, but
> I do see that the patch is now marked as "Changes Requested"
> in patchwork, so all is well.

Yeah, in this case it should even be

pw-bot: reject

because I found the real problem elsewhere.

> FWIIW, pw-bot is (slightly) documented here:
>
>   https://docs.kernel.org/process/maintainer-netdev.html#updating-patch-status

Thanks, I'm getting better at it!

Yours,
Linus Walleij

^ permalink raw reply

* [PATCH net-next v6 2/5] r8169: Coalesce RTL8411b PHY power-down recovery calls to reduce spinlock stalls
From: Mirsad Goran Todorovac @ 2023-11-04 22:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: David S. Miller, Eric Dumazet, Heiner Kallweit, Jakub Kicinski,
	Paolo Abeni, netdev, nic_swsd, Mirsad Goran Todorovac,
	Marco Elver, Jacob Keller
In-Reply-To: <20231104221514.45821-1-mirsad.todorovac@alu.unizg.hr>

On RTL8411b the RX unit gets confused if the PHY is powered-down.
This was reported in [0] and confirmed by Realtek. Realtek provided
a sequence to fix the RX unit after PHY wakeup.

A series of about 130 r8168_mac_ocp_write() calls is performed to
program the RTL registers for recovery.

With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like
a lock storm that will stall all of the cores sharing the same cache for certain
time I/O takes to finish.

In a sequential case of RTL register programming, a sequence of writes to the RTL
registers can be coalesced under a same raw spinlock. This can dramatically decrease
the number of bus stalls in a multicore or multi-CPU system.

Cc: Heiner Kallweit <hkallweit1@gmail.com>
Cc: Marco Elver <elver@google.com>
Cc: nic_swsd@realtek.com
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Link: https://lore.kernel.org/lkml/20231028005153.2180411-1-mirsad.todorovac@alu.unizg.hr/
Link: https://lore.kernel.org/lkml/20231028110459.2644926-1-mirsad.todorovac@alu.unizg.hr/
Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
---
v6:
 proceeded according to Jacob Keller's suggestions by creating a cover page and reducing
 the text within the commits. Applying to the net-next tree as Heiner Kallweit requested.

v5:
 attempted some new optimisations, which were rejected, but not all and not completely.

v4:
 fixed complaints as advised by Heiner and checkpatch.pl.
 split the patch into five sections to be more easily manipulated and reviewed
 introduced r8168_mac_ocp_write_seq()
 applied coalescing of mac ocp writes/modifies for 8168H, 8125 and 8125B

v3:
 removed register/mask pair array sentinels, so using ARRAY_SIZE().
 avoided duplication of RTL_W32() call code as advised by Heiner.

 drivers/net/ethernet/realtek/r8169_main.c | 173 ++++++----------------
 1 file changed, 46 insertions(+), 127 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index e39b5777d67b..5515c51b6e3c 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -3157,145 +3157,64 @@ static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
 		{ 0x1d, 0x0000,	0x4000 },
 	};
 
-	rtl_hw_start_8168g(tp);
+	static const struct e_info_regdata init_zero_seq[] = {
+		{ 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 }, { 0xFC2E, 0x0000 },
+		{ 0xFC30, 0x0000 }, { 0xFC32, 0x0000 }, { 0xFC34, 0x0000 }, { 0xFC36, 0x0000 },
+	};
 
+	static const struct e_info_regdata recover_seq[] = {
+		{ 0xF800, 0xE008 }, { 0xF802, 0xE00A }, { 0xF804, 0xE00C }, { 0xF806, 0xE00E },
+		{ 0xF808, 0xE027 }, { 0xF80A, 0xE04F }, { 0xF80C, 0xE05E }, { 0xF80E, 0xE065 },
+		{ 0xF810, 0xC602 }, { 0xF812, 0xBE00 }, { 0xF814, 0x0000 }, { 0xF816, 0xC502 },
+		{ 0xF818, 0xBD00 }, { 0xF81A, 0x074C }, { 0xF81C, 0xC302 }, { 0xF81E, 0xBB00 },
+		{ 0xF820, 0x080A }, { 0xF822, 0x6420 }, { 0xF824, 0x48C2 }, { 0xF826, 0x8C20 },
+		{ 0xF828, 0xC516 }, { 0xF82A, 0x64A4 }, { 0xF82C, 0x49C0 }, { 0xF82E, 0xF009 },
+		{ 0xF830, 0x74A2 }, { 0xF832, 0x8CA5 }, { 0xF834, 0x74A0 }, { 0xF836, 0xC50E },
+		{ 0xF838, 0x9CA2 }, { 0xF83A, 0x1C11 }, { 0xF83C, 0x9CA0 }, { 0xF83E, 0xE006 },
+		{ 0xF840, 0x74F8 }, { 0xF842, 0x48C4 }, { 0xF844, 0x8CF8 }, { 0xF846, 0xC404 },
+		{ 0xF848, 0xBC00 }, { 0xF84A, 0xC403 }, { 0xF84C, 0xBC00 }, { 0xF84E, 0x0BF2 },
+		{ 0xF850, 0x0C0A }, { 0xF852, 0xE434 }, { 0xF854, 0xD3C0 }, { 0xF856, 0x49D9 },
+		{ 0xF858, 0xF01F }, { 0xF85A, 0xC526 }, { 0xF85C, 0x64A5 }, { 0xF85E, 0x1400 },
+		{ 0xF860, 0xF007 }, { 0xF862, 0x0C01 }, { 0xF864, 0x8CA5 }, { 0xF866, 0x1C15 },
+		{ 0xF868, 0xC51B }, { 0xF86A, 0x9CA0 }, { 0xF86C, 0xE013 }, { 0xF86E, 0xC519 },
+		{ 0xF870, 0x74A0 }, { 0xF872, 0x48C4 }, { 0xF874, 0x8CA0 }, { 0xF876, 0xC516 },
+		{ 0xF878, 0x74A4 }, { 0xF87A, 0x48C8 }, { 0xF87C, 0x48CA }, { 0xF87E, 0x9CA4 },
+		{ 0xF880, 0xC512 }, { 0xF882, 0x1B00 }, { 0xF884, 0x9BA0 }, { 0xF886, 0x1B1C },
+		{ 0xF888, 0x483F }, { 0xF88A, 0x9BA2 }, { 0xF88C, 0x1B04 }, { 0xF88E, 0xC508 },
+		{ 0xF890, 0x9BA0 }, { 0xF892, 0xC505 }, { 0xF894, 0xBD00 }, { 0xF896, 0xC502 },
+		{ 0xF898, 0xBD00 }, { 0xF89A, 0x0300 }, { 0xF89C, 0x051E }, { 0xF89E, 0xE434 },
+		{ 0xF8A0, 0xE018 }, { 0xF8A2, 0xE092 }, { 0xF8A4, 0xDE20 }, { 0xF8A6, 0xD3C0 },
+		{ 0xF8A8, 0xC50F }, { 0xF8AA, 0x76A4 }, { 0xF8AC, 0x49E3 }, { 0xF8AE, 0xF007 },
+		{ 0xF8B0, 0x49C0 }, { 0xF8B2, 0xF103 }, { 0xF8B4, 0xC607 }, { 0xF8B6, 0xBE00 },
+		{ 0xF8B8, 0xC606 }, { 0xF8BA, 0xBE00 }, { 0xF8BC, 0xC602 }, { 0xF8BE, 0xBE00 },
+		{ 0xF8C0, 0x0C4C }, { 0xF8C2, 0x0C28 }, { 0xF8C4, 0x0C2C }, { 0xF8C6, 0xDC00 },
+		{ 0xF8C8, 0xC707 }, { 0xF8CA, 0x1D00 }, { 0xF8CC, 0x8DE2 }, { 0xF8CE, 0x48C1 },
+		{ 0xF8D0, 0xC502 }, { 0xF8D2, 0xBD00 }, { 0xF8D4, 0x00AA }, { 0xF8D6, 0xE0C0 },
+		{ 0xF8D8, 0xC502 }, { 0xF8DA, 0xBD00 }, { 0xF8DC, 0x0132 },
+	};
+
+	static const struct e_info_regdata final_seq[] = {
+		{ 0xFC2A, 0x0743 }, { 0xFC2C, 0x0801 }, { 0xFC2E, 0x0BE9 }, { 0xFC30, 0x02FD },
+		{ 0xFC32, 0x0C25 }, { 0xFC34, 0x00A9 }, { 0xFC36, 0x012D },
+	};
+
+	rtl_hw_start_8168g(tp);
 	rtl_ephy_init(tp, e_info_8411_2);
 
 	/* The following Realtek-provided magic fixes an issue with the RX unit
 	 * getting confused after the PHY having been powered-down.
 	 */
-	r8168_mac_ocp_write(tp, 0xFC28, 0x0000);
-	r8168_mac_ocp_write(tp, 0xFC2A, 0x0000);
-	r8168_mac_ocp_write(tp, 0xFC2C, 0x0000);
-	r8168_mac_ocp_write(tp, 0xFC2E, 0x0000);
-	r8168_mac_ocp_write(tp, 0xFC30, 0x0000);
-	r8168_mac_ocp_write(tp, 0xFC32, 0x0000);
-	r8168_mac_ocp_write(tp, 0xFC34, 0x0000);
-	r8168_mac_ocp_write(tp, 0xFC36, 0x0000);
+
+	r8168_mac_ocp_write_seq(tp, init_zero_seq);
 	mdelay(3);
 	r8168_mac_ocp_write(tp, 0xFC26, 0x0000);
 
-	r8168_mac_ocp_write(tp, 0xF800, 0xE008);
-	r8168_mac_ocp_write(tp, 0xF802, 0xE00A);
-	r8168_mac_ocp_write(tp, 0xF804, 0xE00C);
-	r8168_mac_ocp_write(tp, 0xF806, 0xE00E);
-	r8168_mac_ocp_write(tp, 0xF808, 0xE027);
-	r8168_mac_ocp_write(tp, 0xF80A, 0xE04F);
-	r8168_mac_ocp_write(tp, 0xF80C, 0xE05E);
-	r8168_mac_ocp_write(tp, 0xF80E, 0xE065);
-	r8168_mac_ocp_write(tp, 0xF810, 0xC602);
-	r8168_mac_ocp_write(tp, 0xF812, 0xBE00);
-	r8168_mac_ocp_write(tp, 0xF814, 0x0000);
-	r8168_mac_ocp_write(tp, 0xF816, 0xC502);
-	r8168_mac_ocp_write(tp, 0xF818, 0xBD00);
-	r8168_mac_ocp_write(tp, 0xF81A, 0x074C);
-	r8168_mac_ocp_write(tp, 0xF81C, 0xC302);
-	r8168_mac_ocp_write(tp, 0xF81E, 0xBB00);
-	r8168_mac_ocp_write(tp, 0xF820, 0x080A);
-	r8168_mac_ocp_write(tp, 0xF822, 0x6420);
-	r8168_mac_ocp_write(tp, 0xF824, 0x48C2);
-	r8168_mac_ocp_write(tp, 0xF826, 0x8C20);
-	r8168_mac_ocp_write(tp, 0xF828, 0xC516);
-	r8168_mac_ocp_write(tp, 0xF82A, 0x64A4);
-	r8168_mac_ocp_write(tp, 0xF82C, 0x49C0);
-	r8168_mac_ocp_write(tp, 0xF82E, 0xF009);
-	r8168_mac_ocp_write(tp, 0xF830, 0x74A2);
-	r8168_mac_ocp_write(tp, 0xF832, 0x8CA5);
-	r8168_mac_ocp_write(tp, 0xF834, 0x74A0);
-	r8168_mac_ocp_write(tp, 0xF836, 0xC50E);
-	r8168_mac_ocp_write(tp, 0xF838, 0x9CA2);
-	r8168_mac_ocp_write(tp, 0xF83A, 0x1C11);
-	r8168_mac_ocp_write(tp, 0xF83C, 0x9CA0);
-	r8168_mac_ocp_write(tp, 0xF83E, 0xE006);
-	r8168_mac_ocp_write(tp, 0xF840, 0x74F8);
-	r8168_mac_ocp_write(tp, 0xF842, 0x48C4);
-	r8168_mac_ocp_write(tp, 0xF844, 0x8CF8);
-	r8168_mac_ocp_write(tp, 0xF846, 0xC404);
-	r8168_mac_ocp_write(tp, 0xF848, 0xBC00);
-	r8168_mac_ocp_write(tp, 0xF84A, 0xC403);
-	r8168_mac_ocp_write(tp, 0xF84C, 0xBC00);
-	r8168_mac_ocp_write(tp, 0xF84E, 0x0BF2);
-	r8168_mac_ocp_write(tp, 0xF850, 0x0C0A);
-	r8168_mac_ocp_write(tp, 0xF852, 0xE434);
-	r8168_mac_ocp_write(tp, 0xF854, 0xD3C0);
-	r8168_mac_ocp_write(tp, 0xF856, 0x49D9);
-	r8168_mac_ocp_write(tp, 0xF858, 0xF01F);
-	r8168_mac_ocp_write(tp, 0xF85A, 0xC526);
-	r8168_mac_ocp_write(tp, 0xF85C, 0x64A5);
-	r8168_mac_ocp_write(tp, 0xF85E, 0x1400);
-	r8168_mac_ocp_write(tp, 0xF860, 0xF007);
-	r8168_mac_ocp_write(tp, 0xF862, 0x0C01);
-	r8168_mac_ocp_write(tp, 0xF864, 0x8CA5);
-	r8168_mac_ocp_write(tp, 0xF866, 0x1C15);
-	r8168_mac_ocp_write(tp, 0xF868, 0xC51B);
-	r8168_mac_ocp_write(tp, 0xF86A, 0x9CA0);
-	r8168_mac_ocp_write(tp, 0xF86C, 0xE013);
-	r8168_mac_ocp_write(tp, 0xF86E, 0xC519);
-	r8168_mac_ocp_write(tp, 0xF870, 0x74A0);
-	r8168_mac_ocp_write(tp, 0xF872, 0x48C4);
-	r8168_mac_ocp_write(tp, 0xF874, 0x8CA0);
-	r8168_mac_ocp_write(tp, 0xF876, 0xC516);
-	r8168_mac_ocp_write(tp, 0xF878, 0x74A4);
-	r8168_mac_ocp_write(tp, 0xF87A, 0x48C8);
-	r8168_mac_ocp_write(tp, 0xF87C, 0x48CA);
-	r8168_mac_ocp_write(tp, 0xF87E, 0x9CA4);
-	r8168_mac_ocp_write(tp, 0xF880, 0xC512);
-	r8168_mac_ocp_write(tp, 0xF882, 0x1B00);
-	r8168_mac_ocp_write(tp, 0xF884, 0x9BA0);
-	r8168_mac_ocp_write(tp, 0xF886, 0x1B1C);
-	r8168_mac_ocp_write(tp, 0xF888, 0x483F);
-	r8168_mac_ocp_write(tp, 0xF88A, 0x9BA2);
-	r8168_mac_ocp_write(tp, 0xF88C, 0x1B04);
-	r8168_mac_ocp_write(tp, 0xF88E, 0xC508);
-	r8168_mac_ocp_write(tp, 0xF890, 0x9BA0);
-	r8168_mac_ocp_write(tp, 0xF892, 0xC505);
-	r8168_mac_ocp_write(tp, 0xF894, 0xBD00);
-	r8168_mac_ocp_write(tp, 0xF896, 0xC502);
-	r8168_mac_ocp_write(tp, 0xF898, 0xBD00);
-	r8168_mac_ocp_write(tp, 0xF89A, 0x0300);
-	r8168_mac_ocp_write(tp, 0xF89C, 0x051E);
-	r8168_mac_ocp_write(tp, 0xF89E, 0xE434);
-	r8168_mac_ocp_write(tp, 0xF8A0, 0xE018);
-	r8168_mac_ocp_write(tp, 0xF8A2, 0xE092);
-	r8168_mac_ocp_write(tp, 0xF8A4, 0xDE20);
-	r8168_mac_ocp_write(tp, 0xF8A6, 0xD3C0);
-	r8168_mac_ocp_write(tp, 0xF8A8, 0xC50F);
-	r8168_mac_ocp_write(tp, 0xF8AA, 0x76A4);
-	r8168_mac_ocp_write(tp, 0xF8AC, 0x49E3);
-	r8168_mac_ocp_write(tp, 0xF8AE, 0xF007);
-	r8168_mac_ocp_write(tp, 0xF8B0, 0x49C0);
-	r8168_mac_ocp_write(tp, 0xF8B2, 0xF103);
-	r8168_mac_ocp_write(tp, 0xF8B4, 0xC607);
-	r8168_mac_ocp_write(tp, 0xF8B6, 0xBE00);
-	r8168_mac_ocp_write(tp, 0xF8B8, 0xC606);
-	r8168_mac_ocp_write(tp, 0xF8BA, 0xBE00);
-	r8168_mac_ocp_write(tp, 0xF8BC, 0xC602);
-	r8168_mac_ocp_write(tp, 0xF8BE, 0xBE00);
-	r8168_mac_ocp_write(tp, 0xF8C0, 0x0C4C);
-	r8168_mac_ocp_write(tp, 0xF8C2, 0x0C28);
-	r8168_mac_ocp_write(tp, 0xF8C4, 0x0C2C);
-	r8168_mac_ocp_write(tp, 0xF8C6, 0xDC00);
-	r8168_mac_ocp_write(tp, 0xF8C8, 0xC707);
-	r8168_mac_ocp_write(tp, 0xF8CA, 0x1D00);
-	r8168_mac_ocp_write(tp, 0xF8CC, 0x8DE2);
-	r8168_mac_ocp_write(tp, 0xF8CE, 0x48C1);
-	r8168_mac_ocp_write(tp, 0xF8D0, 0xC502);
-	r8168_mac_ocp_write(tp, 0xF8D2, 0xBD00);
-	r8168_mac_ocp_write(tp, 0xF8D4, 0x00AA);
-	r8168_mac_ocp_write(tp, 0xF8D6, 0xE0C0);
-	r8168_mac_ocp_write(tp, 0xF8D8, 0xC502);
-	r8168_mac_ocp_write(tp, 0xF8DA, 0xBD00);
-	r8168_mac_ocp_write(tp, 0xF8DC, 0x0132);
+	r8168_mac_ocp_write_seq(tp, recover_seq);
 
 	r8168_mac_ocp_write(tp, 0xFC26, 0x8000);
 
-	r8168_mac_ocp_write(tp, 0xFC2A, 0x0743);
-	r8168_mac_ocp_write(tp, 0xFC2C, 0x0801);
-	r8168_mac_ocp_write(tp, 0xFC2E, 0x0BE9);
-	r8168_mac_ocp_write(tp, 0xFC30, 0x02FD);
-	r8168_mac_ocp_write(tp, 0xFC32, 0x0C25);
-	r8168_mac_ocp_write(tp, 0xFC34, 0x00A9);
-	r8168_mac_ocp_write(tp, 0xFC36, 0x012D);
+	r8168_mac_ocp_write_seq(tp, final_seq);
+
 }
 
 static void rtl_hw_start_8168h_1(struct rtl8169_private *tp)
-- 
2.34.1


^ permalink raw reply related

* [PATCH net-next v6 3/5] r8169: Coalesce mac ocp write and modify for 8168H start to reduce spinlocks
From: Mirsad Goran Todorovac @ 2023-11-04 22:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: David S. Miller, Eric Dumazet, Heiner Kallweit, Jakub Kicinski,
	Paolo Abeni, netdev, nic_swsd, Mirsad Goran Todorovac,
	Marco Elver, Jacob Keller
In-Reply-To: <20231104221514.45821-1-mirsad.todorovac@alu.unizg.hr>

Repeated calls to r8168_mac_ocp_write() and r8168_mac_ocp_modify() in
the startup of 8168H involve implicit spin_lock_irqsave() and spin_unlock_irqrestore()
on each invocation.

Coalesced with the corresponding helpers, r8168_mac_ocp_write_seq() and
r8168_mac_ocp_modify_seq() with a sinqle lock/unlock, these calls reduce overall
lock contention.

Fixes: ef712ede3541d ("r8169: add helper r8168_mac_ocp_modify")
Fixes: 6e1d0b8988188 ("r8169:add support for RTL8168H and RTL8107E")
Cc: Heiner Kallweit <hkallweit1@gmail.com>
Cc: Marco Elver <elver@google.com>
Cc: nic_swsd@realtek.com
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Link: https://lore.kernel.org/lkml/20231028005153.2180411-1-mirsad.todorovac@alu.unizg.hr/
Link: https://lore.kernel.org/lkml/20231028110459.2644926-1-mirsad.todorovac@alu.unizg.hr/
Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
---
 drivers/net/ethernet/realtek/r8169_main.c | 26 +++++++++++++++--------
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 5515c51b6e3c..0fb34d217205 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -3227,6 +3227,21 @@ static void rtl_hw_start_8168h_1(struct rtl8169_private *tp)
 		{ 0x04, 0xffff,	0x854a },
 		{ 0x01, 0xffff,	0x068b }
 	};
+
+	static const struct e_info_regmaskset e_info_regmaskset_8168h_1[] = {
+		{ 0xe056, 0x00f0, 0x0070 },
+		{ 0xe052, 0x6000, 0x8008 },
+		{ 0xe0d6, 0x01ff, 0x017f },
+		{ 0xd420, 0x0fff, 0x047f },
+	};
+
+	static const struct e_info_regdata e_info_regdata_8168h_1[] = {
+		{ 0xe63e, 0x0001 },
+		{ 0xe63e, 0x0000 },
+		{ 0xc094, 0x0000 },
+		{ 0xc09e, 0x0000 },
+	};
+
 	int rg_saw_cnt;
 
 	rtl_ephy_init(tp, e_info_8168h_1);
@@ -3267,15 +3282,8 @@ static void rtl_hw_start_8168h_1(struct rtl8169_private *tp)
 		r8168_mac_ocp_modify(tp, 0xd412, 0x0fff, sw_cnt_1ms_ini);
 	}
 
-	r8168_mac_ocp_modify(tp, 0xe056, 0x00f0, 0x0070);
-	r8168_mac_ocp_modify(tp, 0xe052, 0x6000, 0x8008);
-	r8168_mac_ocp_modify(tp, 0xe0d6, 0x01ff, 0x017f);
-	r8168_mac_ocp_modify(tp, 0xd420, 0x0fff, 0x047f);
-
-	r8168_mac_ocp_write(tp, 0xe63e, 0x0001);
-	r8168_mac_ocp_write(tp, 0xe63e, 0x0000);
-	r8168_mac_ocp_write(tp, 0xc094, 0x0000);
-	r8168_mac_ocp_write(tp, 0xc09e, 0x0000);
+	r8168_mac_ocp_modify_seq(tp, e_info_regmaskset_8168h_1);
+	r8168_mac_ocp_write_seq(tp, e_info_regdata_8168h_1);
 }
 
 static void rtl_hw_start_8168ep(struct rtl8169_private *tp)
-- 
2.34.1


^ permalink raw reply related

* [PATCH net-next v6 4/5] r8169: Coalesce mac ocp commands for 8125 and 8125B start to reduce spinlock contention
From: Mirsad Goran Todorovac @ 2023-11-04 22:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: David S. Miller, Eric Dumazet, Heiner Kallweit, Jakub Kicinski,
	Paolo Abeni, netdev, nic_swsd, Mirsad Goran Todorovac,
	Marco Elver, Jacob Keller
In-Reply-To: <20231104221514.45821-1-mirsad.todorovac@alu.unizg.hr>

Repeated calls to r8168_mac_ocp_write() and r8168_mac_ocp_modify() in
the startup of 8125 and 8125B involve implicit spin_lock_irqsave() and
spin_unlock_irqrestore() on each invocation.

Coalesced with the corresponding helpers r8168_mac_ocp_write_seq() and
r8168_mac_ocp_modify_seq() into sequential write or modidy with a sinqle lock/unlock,
these calls reduce overall lock contention.

Fixes: f1bce4ad2f1ce ("r8169: add support for RTL8125")
Fixes: 0439297be9511 ("r8169: add support for RTL8125B")
Cc: Heiner Kallweit <hkallweit1@gmail.com>
Cc: Marco Elver <elver@google.com>
Cc: nic_swsd@realtek.com
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Link: https://lore.kernel.org/lkml/20231028005153.2180411-1-mirsad.todorovac@alu.unizg.hr/
Link: https://lore.kernel.org/lkml/20231028110459.2644926-1-mirsad.todorovac@alu.unizg.hr/
Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
---
 drivers/net/ethernet/realtek/r8169_main.c | 38 ++++++++++++++---------
 1 file changed, 23 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 0fb34d217205..056fe5b3930b 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -3539,6 +3539,27 @@ DECLARE_RTL_COND(rtl_mac_ocp_e00e_cond)
 
 static void rtl_hw_start_8125_common(struct rtl8169_private *tp)
 {
+
+	static const struct e_info_regmaskset e_info_8125_common_1[] = {
+		{ 0xd3e2, 0x0fff, 0x03a9 },
+		{ 0xd3e4, 0x00ff, 0x0000 },
+		{ 0xe860, 0x0000, 0x0080 },
+	};
+
+	static const struct e_info_regmaskset e_info_8125_common_2[] = {
+		{ 0xc0b4, 0x0000, 0x000c },
+		{ 0xeb6a, 0x00ff, 0x0033 },
+		{ 0xeb50, 0x03e0, 0x0040 },
+		{ 0xe056, 0x00f0, 0x0030 },
+		{ 0xe040, 0x1000, 0x0000 },
+		{ 0xea1c, 0x0003, 0x0001 },
+		{ 0xe0c0, 0x4f0f, 0x4403 },
+		{ 0xe052, 0x0080, 0x0068 },
+		{ 0xd430, 0x0fff, 0x047f },
+		{ 0xea1c, 0x0004, 0x0000 },
+		{ 0xeb54, 0x0000, 0x0001 },
+	};
+
 	rtl_pcie_state_l2l3_disable(tp);
 
 	RTL_W16(tp, 0x382, 0x221b);
@@ -3553,9 +3574,7 @@ static void rtl_hw_start_8125_common(struct rtl8169_private *tp)
 	r8168_mac_ocp_write(tp, 0xc140, 0xffff);
 	r8168_mac_ocp_write(tp, 0xc142, 0xffff);
 
-	r8168_mac_ocp_modify(tp, 0xd3e2, 0x0fff, 0x03a9);
-	r8168_mac_ocp_modify(tp, 0xd3e4, 0x00ff, 0x0000);
-	r8168_mac_ocp_modify(tp, 0xe860, 0x0000, 0x0080);
+	r8168_mac_ocp_modify_seq(tp, e_info_8125_common_1);
 
 	/* disable new tx descriptor format */
 	r8168_mac_ocp_modify(tp, 0xeb58, 0x0001, 0x0000);
@@ -3570,18 +3589,7 @@ static void rtl_hw_start_8125_common(struct rtl8169_private *tp)
 	else
 		r8168_mac_ocp_modify(tp, 0xe63e, 0x0c30, 0x0020);
 
-	r8168_mac_ocp_modify(tp, 0xc0b4, 0x0000, 0x000c);
-	r8168_mac_ocp_modify(tp, 0xeb6a, 0x00ff, 0x0033);
-	r8168_mac_ocp_modify(tp, 0xeb50, 0x03e0, 0x0040);
-	r8168_mac_ocp_modify(tp, 0xe056, 0x00f0, 0x0030);
-	r8168_mac_ocp_modify(tp, 0xe040, 0x1000, 0x0000);
-	r8168_mac_ocp_modify(tp, 0xea1c, 0x0003, 0x0001);
-	r8168_mac_ocp_modify(tp, 0xe0c0, 0x4f0f, 0x4403);
-	r8168_mac_ocp_modify(tp, 0xe052, 0x0080, 0x0068);
-	r8168_mac_ocp_modify(tp, 0xd430, 0x0fff, 0x047f);
-
-	r8168_mac_ocp_modify(tp, 0xea1c, 0x0004, 0x0000);
-	r8168_mac_ocp_modify(tp, 0xeb54, 0x0000, 0x0001);
+	r8168_mac_ocp_modify_seq(tp, e_info_8125_common_2);
 	udelay(1);
 	r8168_mac_ocp_modify(tp, 0xeb54, 0x0001, 0x0000);
 	RTL_W16(tp, 0x1880, RTL_R16(tp, 0x1880) & ~0x0030);
-- 
2.34.1


^ permalink raw reply related

* [PATCH net-next v6 5/5] r8169: Coalesce mac ocp commands for rtl_hw_init_8125 to reduce spinlocks
From: Mirsad Goran Todorovac @ 2023-11-04 22:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: David S. Miller, Eric Dumazet, Heiner Kallweit, Jakub Kicinski,
	Paolo Abeni, netdev, nic_swsd, Mirsad Goran Todorovac,
	Marco Elver, Jacob Keller
In-Reply-To: <20231104221514.45821-1-mirsad.todorovac@alu.unizg.hr>

Repeated calls to r8168_mac_ocp_write() and r8168_mac_ocp_modify() in
the init sequence of the 8125 involve implicit spin_lock_irqsave() and
spin_unlock_irqrestore() on each invocation.

Coalesced with the corresponding helpers r8168_mac_ocp_write_seq() and
r8168_mac_ocp_modify_seq() into sequential write or modidy with a sinqle lock/unlock,
these calls reduce overall lock contention.

Fixes: f1bce4ad2f1ce ("r8169: add support for RTL8125")
Cc: Heiner Kallweit <hkallweit1@gmail.com>
Cc: Marco Elver <elver@google.com>
Cc: nic_swsd@realtek.com
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Link: https://lore.kernel.org/lkml/20231028005153.2180411-1-mirsad.todorovac@alu.unizg.hr/
Link: https://lore.kernel.org/lkml/20231028110459.2644926-1-mirsad.todorovac@alu.unizg.hr/
Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
---
 drivers/net/ethernet/realtek/r8169_main.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 056fe5b3930b..42f0a7486151 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -5074,6 +5074,12 @@ static void rtl_hw_init_8168g(struct rtl8169_private *tp)
 
 static void rtl_hw_init_8125(struct rtl8169_private *tp)
 {
+	static const struct e_info_regdata hw_init_8125_1[] = {
+		{ 0xc0aa, 0x07d0 },
+		{ 0xc0a6, 0x0150 },
+		{ 0xc01e, 0x5555 },
+	};
+
 	rtl_enable_rxdvgate(tp);
 
 	RTL_W8(tp, ChipCmd, RTL_R8(tp, ChipCmd) & ~(CmdTxEnb | CmdRxEnb));
@@ -5083,9 +5089,7 @@ static void rtl_hw_init_8125(struct rtl8169_private *tp)
 	r8168_mac_ocp_modify(tp, 0xe8de, BIT(14), 0);
 	r8168g_wait_ll_share_fifo_ready(tp);
 
-	r8168_mac_ocp_write(tp, 0xc0aa, 0x07d0);
-	r8168_mac_ocp_write(tp, 0xc0a6, 0x0150);
-	r8168_mac_ocp_write(tp, 0xc01e, 0x5555);
+	r8168_mac_ocp_write_seq(tp, hw_init_8125_1);
 	r8168g_wait_ll_share_fifo_ready(tp);
 }
 
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH net-next v6 0/5] Coalesce mac ocp write/modify calls to reduce spinlock contention
From: Heiner Kallweit @ 2023-11-04 22:37 UTC (permalink / raw)
  To: Mirsad Goran Todorovac, linux-kernel
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev, nic_swsd
In-Reply-To: <20231104221514.45821-1-mirsad.todorovac@alu.unizg.hr>

On 04.11.2023 23:15, Mirsad Goran Todorovac wrote:
> The motivation for these helpers was the locking overhead of 130 consecutive
> r8168_mac_ocp_write() calls in the RTL8411b reset after the NIC gets confused
> if the PHY is powered-down.
> 
> To quote Heiner:
> 
>     On RTL8411b the RX unit gets confused if the PHY is powered-down.
>     This was reported in [0] and confirmed by Realtek. Realtek provided
>     a sequence to fix the RX unit after PHY wakeup.
> 
> A series of about 130 r8168_mac_ocp_write() calls is performed to program the
> RTL registers for recovery, each doing an expensive spin_lock_irqsave() and
> spin_unlock_irqrestore().
> 
> Each mac ocp write is made of:
> 
>     static void __r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg,
>                       u32 data)
>     {
>         if (rtl_ocp_reg_failure(reg))
>             return;
> 
>         RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
>     }
> 
>     static void r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg,
>                     u32 data)
>     {
>         unsigned long flags;
> 
>         raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>         __r8168_mac_ocp_write(tp, reg, data);
>         raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>     }
> 
> Register programming is done through RTL_W32() macro which expands into
> 
>     #define RTL_W32(tp, reg, val32) writel((val32), tp->mmio_addr + (reg))
> 
> which is further (on Alpha):
> 
>     extern inline void writel(u32 b, volatile void __iomem *addr)
>     {
>         mb();
>         __raw_writel(b, addr);
>     }
> 
> or on i386/x86_64:
> 
>     #define build_mmio_write(name, size, type, reg, barrier) \
>     static inline void name(type val, volatile void __iomem *addr) \
>     { asm volatile("mov" size " %0,%1": :reg (val), \
>     "m" (*(volatile type __force *)addr) barrier); }
> 
>     build_mmio_write(writel, "l", unsigned int, "r", :"memory")
> 
> This obviously involves iat least a compiler barrier.
> 
> mb() expands into something like this i.e. on x86_64:
> 
>     #define mb()    asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
> 
> This means a whole lot of memory bus stalls: for spin_lock_irqsave(),
> memory barrier, writel(), and spin_unlock_irqrestore().
> 
> With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like
> a lock storm that will stall all of the cores and CPUs on the same memory controller
> for certain time I/O takes to finish.
> 
> In a sequential case of RTL register programming, the writes to RTL registers
> can be coalesced under a same raw spinlock. This can dramatically decrease the
> number of bus stalls in a multicore or multi-CPU system.
> 
> Macro helpers r8168_mac_ocp_write_seq() and r8168_mac_ocp_modify_seq() are
> provided to reduce lock contention:
> 
>     static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
>     {
> 
>         ...
> 
>         /* The following Realtek-provided magic fixes an issue with the RX unit
>          * getting confused after the PHY having been powered-down.
>          */
> 
>         static const struct recover_8411b_info init_zero_seq[] = {
>             { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 },
>             ...
>         };
> 
>         ...
> 
>         r8168_mac_ocp_write_seq(tp, init_zero_seq);
> 
>         ...
> 
>     }
> 
> The hex data is preserved intact through s/r8168_mac_ocp_write[(]tp,/{ / and s/[)];/ },/
> functions that only changed the function names and the ending of the line, so the actual
> hex data is unchanged.
> 
> To repeat, the reason for the introduction of the original commit
> was to enable recovery of the RX unit on the RTL8411b which was confused by the
> powered-down PHY. This sequence of r8168_mac_ocp_write() calls amplifies the problem
> into a series of about 500+ memory bus locks, most waiting for the main memory read,
> modify and write under a LOCK. The memory barrier in RTL_W32 should suffice for
> the programming sequence to reach RTL NIC registers.
> 
> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1692075
> 
> v6:
>  proceeded according to Jacob Keller's suggestions by creating a cover page and reducing
>  the text within the commits. Applying to the net-next tree as Heiner Kallweit requested.
> 
> v5:
>  attempted some new optimisations, which were rejected, but not all and not completely.
> 
> v4:
>  fixed complaints as advised by Heiner and checkpatch.pl.
>  split the patch into five sections to be more easily manipulated and reviewed
>  introduced r8168_mac_ocp_write_seq()
>  applied coalescing of mac ocp writes/modifies for 8168H, 8125 and 8125B
> 
> v3:
>  removed register/mask pair array sentinels, so using ARRAY_SIZE().
>  avoided duplication of RTL_W32() call code as advised by Heiner.
> 
> Mirsad Goran Todorovac (5):
>   r8169: Coalesce r8169_mac_ocp_write/modify calls to reduce spinlock
>     stalls
>   r8169: Coalesce RTL8411b PHY power-down recovery calls to reduce
>     spinlock stalls
>   r8169: Coalesce mac ocp write and modify for 8168H start to reduce
>     spinlocks
>   r8169: Coalesce mac ocp commands for 8125 and 8125B start to reduce
>     spinlock contention
>   r8169: Coalesce mac ocp commands for rtl_hw_init_8125 to reduce
>     spinlocks
> 
>  drivers/net/ethernet/realtek/r8169_main.c | 304 +++++++++++-----------
>  1 file changed, 150 insertions(+), 154 deletions(-)
> 

You still write:
"a lock storm that will stall all of the cores and CPUs on the same memory controller"
even though you were informed that that's not the case.
There's no actual problem, therefore your Fixes tags are incorrect.
Also net-next is closed at the moment.
In patches 3-5 I see no benefit. And I have doubts whether the small benefit in
patch 2 is worth adding all the helpers in patch 1.


^ permalink raw reply

* Re: [PATCH net-next v6 0/5] Coalesce mac ocp write/modify calls to reduce spinlock contention
From: Mirsad Todorovac @ 2023-11-05  0:15 UTC (permalink / raw)
  To: Heiner Kallweit, linux-kernel
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev, nic_swsd
In-Reply-To: <da4409f3-d509-413b-8433-f222acbbb1be@gmail.com>



On 11/4/23 23:37, Heiner Kallweit wrote:
> On 04.11.2023 23:15, Mirsad Goran Todorovac wrote:
>> The motivation for these helpers was the locking overhead of 130 consecutive
>> r8168_mac_ocp_write() calls in the RTL8411b reset after the NIC gets confused
>> if the PHY is powered-down.
>>
>> To quote Heiner:
>>
>>      On RTL8411b the RX unit gets confused if the PHY is powered-down.
>>      This was reported in [0] and confirmed by Realtek. Realtek provided
>>      a sequence to fix the RX unit after PHY wakeup.
>>
>> A series of about 130 r8168_mac_ocp_write() calls is performed to program the
>> RTL registers for recovery, each doing an expensive spin_lock_irqsave() and
>> spin_unlock_irqrestore().
>>
>> Each mac ocp write is made of:
>>
>>      static void __r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg,
>>                        u32 data)
>>      {
>>          if (rtl_ocp_reg_failure(reg))
>>              return;
>>
>>          RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
>>      }
>>
>>      static void r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg,
>>                      u32 data)
>>      {
>>          unsigned long flags;
>>
>>          raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>>          __r8168_mac_ocp_write(tp, reg, data);
>>          raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>>      }
>>
>> Register programming is done through RTL_W32() macro which expands into
>>
>>      #define RTL_W32(tp, reg, val32) writel((val32), tp->mmio_addr + (reg))
>>
>> which is further (on Alpha):
>>
>>      extern inline void writel(u32 b, volatile void __iomem *addr)
>>      {
>>          mb();
>>          __raw_writel(b, addr);
>>      }
>>
>> or on i386/x86_64:
>>
>>      #define build_mmio_write(name, size, type, reg, barrier) \
>>      static inline void name(type val, volatile void __iomem *addr) \
>>      { asm volatile("mov" size " %0,%1": :reg (val), \
>>      "m" (*(volatile type __force *)addr) barrier); }
>>
>>      build_mmio_write(writel, "l", unsigned int, "r", :"memory")
>>
>> This obviously involves iat least a compiler barrier.
>>
>> mb() expands into something like this i.e. on x86_64:
>>
>>      #define mb()    asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
>>
>> This means a whole lot of memory bus stalls: for spin_lock_irqsave(),
>> memory barrier, writel(), and spin_unlock_irqrestore().
>>
>> With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like
>> a lock storm that will stall all of the cores and CPUs on the same memory controller
>> for certain time I/O takes to finish.
>>
>> In a sequential case of RTL register programming, the writes to RTL registers
>> can be coalesced under a same raw spinlock. This can dramatically decrease the
>> number of bus stalls in a multicore or multi-CPU system.
>>
>> Macro helpers r8168_mac_ocp_write_seq() and r8168_mac_ocp_modify_seq() are
>> provided to reduce lock contention:
>>
>>      static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
>>      {
>>
>>          ...
>>
>>          /* The following Realtek-provided magic fixes an issue with the RX unit
>>           * getting confused after the PHY having been powered-down.
>>           */
>>
>>          static const struct recover_8411b_info init_zero_seq[] = {
>>              { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 },
>>              ...
>>          };
>>
>>          ...
>>
>>          r8168_mac_ocp_write_seq(tp, init_zero_seq);
>>
>>          ...
>>
>>      }
>>
>> The hex data is preserved intact through s/r8168_mac_ocp_write[(]tp,/{ / and s/[)];/ },/
>> functions that only changed the function names and the ending of the line, so the actual
>> hex data is unchanged.
>>
>> To repeat, the reason for the introduction of the original commit
>> was to enable recovery of the RX unit on the RTL8411b which was confused by the
>> powered-down PHY. This sequence of r8168_mac_ocp_write() calls amplifies the problem
>> into a series of about 500+ memory bus locks, most waiting for the main memory read,
>> modify and write under a LOCK. The memory barrier in RTL_W32 should suffice for
>> the programming sequence to reach RTL NIC registers.
>>
>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1692075
>>
>> v6:
>>   proceeded according to Jacob Keller's suggestions by creating a cover page and reducing
>>   the text within the commits. Applying to the net-next tree as Heiner Kallweit requested.
>>
>> v5:
>>   attempted some new optimisations, which were rejected, but not all and not completely.
>>
>> v4:
>>   fixed complaints as advised by Heiner and checkpatch.pl.
>>   split the patch into five sections to be more easily manipulated and reviewed
>>   introduced r8168_mac_ocp_write_seq()
>>   applied coalescing of mac ocp writes/modifies for 8168H, 8125 and 8125B
>>
>> v3:
>>   removed register/mask pair array sentinels, so using ARRAY_SIZE().
>>   avoided duplication of RTL_W32() call code as advised by Heiner.
>>
>> Mirsad Goran Todorovac (5):
>>    r8169: Coalesce r8169_mac_ocp_write/modify calls to reduce spinlock
>>      stalls
>>    r8169: Coalesce RTL8411b PHY power-down recovery calls to reduce
>>      spinlock stalls
>>    r8169: Coalesce mac ocp write and modify for 8168H start to reduce
>>      spinlocks
>>    r8169: Coalesce mac ocp commands for 8125 and 8125B start to reduce
>>      spinlock contention
>>    r8169: Coalesce mac ocp commands for rtl_hw_init_8125 to reduce
>>      spinlocks
>>
>>   drivers/net/ethernet/realtek/r8169_main.c | 304 +++++++++++-----------
>>   1 file changed, 150 insertions(+), 154 deletions(-)
>>

Hi, Mr. Kallweit,

So good to hear so soon from you. I'm encouraged that you are positive about improving
the speed and reducing the size of the Realtek drivers.

> You still write:
> "a lock storm that will stall all of the cores and CPUs on the same memory controller"
> even though you were informed that that's not the case.

I was not convinced. There is no such thing as a free lunch, and there is no locking
without affecting other cores, or locking would not make sense.

> There's no actual problem, therefore your Fixes tags are incorrect.

Mea culpa - my mistake, I will fix that in the next version.

> Also net-next is closed at the moment.

There is no problem with that, as these are only optimisation fixes, not zero day
exploits. I am a patient person.

> In patches 3-5 I see no benefit. And I have doubts whether the small benefit in
> patch 2 is worth adding all the helpers in patch 1.

I merely followed and mimed driver style from the constructions like this one:

         static const struct ephy_info e_info_8168e_1[] = {
                 { 0x00, 0x0200, 0x0100 },
                 { 0x00, 0x0000, 0x0004 },
                 { 0x06, 0x0002, 0x0001 },
                 { 0x06, 0x0000, 0x0030 },
                 { 0x07, 0x0000, 0x2000 },
                 { 0x00, 0x0000, 0x0020 },
                 { 0x03, 0x5800, 0x2000 },
                 { 0x03, 0x0000, 0x0001 },
                 { 0x01, 0x0800, 0x1000 },
                 { 0x07, 0x0000, 0x4000 },
                 { 0x1e, 0x0000, 0x2000 },
                 { 0x19, 0xffff, 0xfe6c },
                 { 0x0a, 0x0000, 0x0040 }
         };

         rtl_set_def_aspm_entry_latency(tp);

         rtl_ephy_init(tp, e_info_8168e_1);

Here you did not think that introducing an array reduced code readability.

My ideal is a lockless driver using RCU, and you seem to prefer lock/unlock
on each RTL_W32() write. I am convinced that a driver with less
raw_spin_lock_irqsave()/raw_spin_unlock_irqrestore() pairs would scale better
with more NICs and more cores.

You said nothing to convinced me otherwise.

But I am merely defending my point, this by no means implies disrespect or overlooking
your contribution to the source as a coder and a a maintainer.

Realtek NICs are known as cheap NIC for motherboards, but they are becoming more ubiquitous,
and it is logical to use less locking, as locking is expensive. "barrier" in writev()
guarantees sequential orders of write, and locking and unlocking on each read/modify/write
is unnecessary overhead, IMHO.

As the conclusion, I would like to emphasise that improving lock contention for the code
is by no means a personal attack on the maintainer or a breach of the Code of Conduct.

If you are so much against the changes which Mr. Jacob Keller from Intel reviewed,
maybe we can cool emotions and start thinking rationally.

Additionally, I would like to "inline" many functions, as I think that call/return
sequences with stack frame generation /destruction are more expensive than inlining the
small one liners.

But I will certainly respect your opinion on the matter as a maintainer.

What I realise that I might be optimising the cold paths of the code, but from your emails
it seems like nothing is worth optimising in this driver, and with all due respect Sir,
I think that is dead wrong.

Of course, I am tempted to comply to the authority as a kernel newbie, but I was reminded
in the spirit that this is exactly what the guys in Chernobyl did while maintaining the
reactor that malfunctioned: they did not dare to question the authority telling them that
everything is alright.

Have a nice evening, and please do not take these words as a breach of the Code or a
personal attack. I believe we are on the same side, and that is making this driver better.

The Linux kernel developer community was my last hope that this human race has a force
to improve the mankind and make it worth surviving.

But sometimes it is more honourable to go down with the ship and preserve the honour.

Best regards,
Mirsad Todorovac

^ permalink raw reply

* [PATCH net-next V6] ptp: fix corrupted list in ptp_open
From: Edward Adam Davis @ 2023-11-05  2:12 UTC (permalink / raw)
  To: richardcochran
  Cc: davem, habetsm.xilinx, jeremy, linux-kernel, netdev, reibax,
	syzbot+df3f3ef31f60781fa911

There is no lock protection when writing ptp->tsevqs in ptp_open() and
ptp_release(), which can cause data corruption, use spin lock to avoid this
issue.

Moreover, ptp_release() should not be used to release the queue in ptp_read(),
and it should be deleted together.

Reported-and-tested-by: syzbot+df3f3ef31f60781fa911@syzkaller.appspotmail.com
Fixes: 8f5de6fb2453 ("ptp: support multiple timestamp event readers")
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
---
 drivers/ptp/ptp_chardev.c | 11 +++++++----
 drivers/ptp/ptp_clock.c   |  1 +
 drivers/ptp/ptp_private.h |  1 +
 3 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/ptp/ptp_chardev.c b/drivers/ptp/ptp_chardev.c
index 282cd7d24077..31594f40a21e 100644
--- a/drivers/ptp/ptp_chardev.c
+++ b/drivers/ptp/ptp_chardev.c
@@ -108,6 +108,7 @@ int ptp_open(struct posix_clock_context *pccontext, fmode_t fmode)
 		container_of(pccontext->clk, struct ptp_clock, clock);
 	struct timestamp_event_queue *queue;
 	char debugfsname[32];
+	unsigned long flags;
 
 	queue = kzalloc(sizeof(*queue), GFP_KERNEL);
 	if (!queue)
@@ -119,8 +120,10 @@ int ptp_open(struct posix_clock_context *pccontext, fmode_t fmode)
 	}
 	bitmap_set(queue->mask, 0, PTP_MAX_CHANNELS);
 	spin_lock_init(&queue->lock);
+	spin_lock_irqsave(&ptp->tsevqs_lock, flags);
 	list_add_tail(&queue->qlist, &ptp->tsevqs);
 	pccontext->private_clkdata = queue;
+	spin_unlock_irqrestore(&ptp->tsevqs_lock, flags);
 
 	/* Debugfs contents */
 	sprintf(debugfsname, "0x%p", queue);
@@ -139,13 +142,15 @@ int ptp_release(struct posix_clock_context *pccontext)
 {
 	struct timestamp_event_queue *queue = pccontext->private_clkdata;
 	unsigned long flags;
+	struct ptp_clock *ptp =
+		container_of(pccontext->clk, struct ptp_clock, clock);
 
 	if (queue) {
 		debugfs_remove(queue->debugfs_instance);
+		spin_lock_irqsave(&ptp->tsevqs_lock, flags);
 		pccontext->private_clkdata = NULL;
-		spin_lock_irqsave(&queue->lock, flags);
 		list_del(&queue->qlist);
-		spin_unlock_irqrestore(&queue->lock, flags);
+		spin_unlock_irqrestore(&ptp->tsevqs_lock, flags);
 		bitmap_free(queue->mask);
 		kfree(queue);
 	}
@@ -585,7 +590,5 @@ ssize_t ptp_read(struct posix_clock_context *pccontext, uint rdflags,
 free_event:
 	kfree(event);
 exit:
-	if (result < 0)
-		ptp_release(pccontext);
 	return result;
 }
diff --git a/drivers/ptp/ptp_clock.c b/drivers/ptp/ptp_clock.c
index 3d1b0a97301c..ea82648ad557 100644
--- a/drivers/ptp/ptp_clock.c
+++ b/drivers/ptp/ptp_clock.c
@@ -247,6 +247,7 @@ struct ptp_clock *ptp_clock_register(struct ptp_clock_info *info,
 	if (!queue)
 		goto no_memory_queue;
 	list_add_tail(&queue->qlist, &ptp->tsevqs);
+	spin_lock_init(&ptp->tsevqs_lock);
 	queue->mask = bitmap_alloc(PTP_MAX_CHANNELS, GFP_KERNEL);
 	if (!queue->mask)
 		goto no_memory_bitmap;
diff --git a/drivers/ptp/ptp_private.h b/drivers/ptp/ptp_private.h
index 52f87e394aa6..63af246f17eb 100644
--- a/drivers/ptp/ptp_private.h
+++ b/drivers/ptp/ptp_private.h
@@ -44,6 +44,7 @@ struct ptp_clock {
 	struct pps_device *pps_source;
 	long dialed_frequency; /* remembers the frequency adjustment */
 	struct list_head tsevqs; /* timestamp fifo list */
+	spinlock_t tsevqs_lock; /* one process at a time writing the timestamp fifo list*/
 	struct mutex pincfg_mux; /* protect concurrent info->pin_config access */
 	wait_queue_head_t tsev_wq;
 	int defunct; /* tells readers to go away when clock is being removed */
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH net-next v6 0/5] Coalesce mac ocp write/modify calls to reduce spinlock contention
From: Mirsad Todorovac @ 2023-11-05  2:18 UTC (permalink / raw)
  To: Heiner Kallweit, linux-kernel
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev, nic_swsd
In-Reply-To: <da4409f3-d509-413b-8433-f222acbbb1be@gmail.com>



On 11/4/23 23:37, Heiner Kallweit wrote:
> On 04.11.2023 23:15, Mirsad Goran Todorovac wrote:
>> The motivation for these helpers was the locking overhead of 130 consecutive
>> r8168_mac_ocp_write() calls in the RTL8411b reset after the NIC gets confused
>> if the PHY is powered-down.
>>
>> To quote Heiner:
>>
>>      On RTL8411b the RX unit gets confused if the PHY is powered-down.
>>      This was reported in [0] and confirmed by Realtek. Realtek provided
>>      a sequence to fix the RX unit after PHY wakeup.
>>
>> A series of about 130 r8168_mac_ocp_write() calls is performed to program the
>> RTL registers for recovery, each doing an expensive spin_lock_irqsave() and
>> spin_unlock_irqrestore().
>>
>> Each mac ocp write is made of:
>>
>>      static void __r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg,
>>                        u32 data)
>>      {
>>          if (rtl_ocp_reg_failure(reg))
>>              return;
>>
>>          RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
>>      }
>>
>>      static void r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg,
>>                      u32 data)
>>      {
>>          unsigned long flags;
>>
>>          raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>>          __r8168_mac_ocp_write(tp, reg, data);
>>          raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>>      }
>>
>> Register programming is done through RTL_W32() macro which expands into
>>
>>      #define RTL_W32(tp, reg, val32) writel((val32), tp->mmio_addr + (reg))
>>
>> which is further (on Alpha):
>>
>>      extern inline void writel(u32 b, volatile void __iomem *addr)
>>      {
>>          mb();
>>          __raw_writel(b, addr);
>>      }
>>
>> or on i386/x86_64:
>>
>>      #define build_mmio_write(name, size, type, reg, barrier) \
>>      static inline void name(type val, volatile void __iomem *addr) \
>>      { asm volatile("mov" size " %0,%1": :reg (val), \
>>      "m" (*(volatile type __force *)addr) barrier); }
>>
>>      build_mmio_write(writel, "l", unsigned int, "r", :"memory")
>>
>> This obviously involves iat least a compiler barrier.
>>
>> mb() expands into something like this i.e. on x86_64:
>>
>>      #define mb()    asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
>>
>> This means a whole lot of memory bus stalls: for spin_lock_irqsave(),
>> memory barrier, writel(), and spin_unlock_irqrestore().
>>
>> With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like
>> a lock storm that will stall all of the cores and CPUs on the same memory controller
>> for certain time I/O takes to finish.
>>
>> In a sequential case of RTL register programming, the writes to RTL registers
>> can be coalesced under a same raw spinlock. This can dramatically decrease the
>> number of bus stalls in a multicore or multi-CPU system.
>>
>> Macro helpers r8168_mac_ocp_write_seq() and r8168_mac_ocp_modify_seq() are
>> provided to reduce lock contention:
>>
>>      static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
>>      {
>>
>>          ...
>>
>>          /* The following Realtek-provided magic fixes an issue with the RX unit
>>           * getting confused after the PHY having been powered-down.
>>           */
>>
>>          static const struct recover_8411b_info init_zero_seq[] = {
>>              { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 },
>>              ...
>>          };
>>
>>          ...
>>
>>          r8168_mac_ocp_write_seq(tp, init_zero_seq);
>>
>>          ...
>>
>>      }
>>
>> The hex data is preserved intact through s/r8168_mac_ocp_write[(]tp,/{ / and s/[)];/ },/
>> functions that only changed the function names and the ending of the line, so the actual
>> hex data is unchanged.
>>
>> To repeat, the reason for the introduction of the original commit
>> was to enable recovery of the RX unit on the RTL8411b which was confused by the
>> powered-down PHY. This sequence of r8168_mac_ocp_write() calls amplifies the problem
>> into a series of about 500+ memory bus locks, most waiting for the main memory read,
>> modify and write under a LOCK. The memory barrier in RTL_W32 should suffice for
>> the programming sequence to reach RTL NIC registers.
>>
>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1692075
>>
>> v6:
>>   proceeded according to Jacob Keller's suggestions by creating a cover page and reducing
>>   the text within the commits. Applying to the net-next tree as Heiner Kallweit requested.
>>
>> v5:
>>   attempted some new optimisations, which were rejected, but not all and not completely.
>>
>> v4:
>>   fixed complaints as advised by Heiner and checkpatch.pl.
>>   split the patch into five sections to be more easily manipulated and reviewed
>>   introduced r8168_mac_ocp_write_seq()
>>   applied coalescing of mac ocp writes/modifies for 8168H, 8125 and 8125B
>>
>> v3:
>>   removed register/mask pair array sentinels, so using ARRAY_SIZE().
>>   avoided duplication of RTL_W32() call code as advised by Heiner.
>>
>> Mirsad Goran Todorovac (5):
>>    r8169: Coalesce r8169_mac_ocp_write/modify calls to reduce spinlock
>>      stalls
>>    r8169: Coalesce RTL8411b PHY power-down recovery calls to reduce
>>      spinlock stalls
>>    r8169: Coalesce mac ocp write and modify for 8168H start to reduce
>>      spinlocks
>>    r8169: Coalesce mac ocp commands for 8125 and 8125B start to reduce
>>      spinlock contention
>>    r8169: Coalesce mac ocp commands for rtl_hw_init_8125 to reduce
>>      spinlocks
>>
>>   drivers/net/ethernet/realtek/r8169_main.c | 304 +++++++++++-----------
>>   1 file changed, 150 insertions(+), 154 deletions(-)
>>
> 
> You still write:
> "a lock storm that will stall all of the cores and CPUs on the same memory controller"
> even though you were informed that that's not the case.
> There's no actual problem, therefore your Fixes tags are incorrect.
> Also net-next is closed at the moment.
> In patches 3-5 I see no benefit. And I have doubts whether the small benefit in
> patch 2 is worth adding all the helpers in patch 1.

After some thought, I would like to have a consensus on these patches, rather than someone
feels defeated or outvoted.

So I will try to reach some common ground, if you think the cause is worth it.

Why is adding six lines of a helper a problem worse than removing 130 lines of callers?

I would hate to think that the Linux kernel developer community became the place where
Authority has higher weight than Reason and Logic.

I have no personal gain from improving these drivers other than the Galactic credits.

One thing I wouldn't like and do not like is the Windows drivers being better because
their programmers are more innovative.

Best regards,
Mirsad Todorovac

^ permalink raw reply

* Re: Bypass qdiscs?
From: John Ousterhout @ 2023-11-05  2:47 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: netdev
In-Reply-To: <29217dab-e00e-4e4c-8d6a-4088d8e79c8e@lunn.ch>

I haven't tried creating a "pass through" qdisc, but that seems like a
reasonable approach if (as it seems) there isn't something already
built-in that provides equivalent functionality.

-John-

P.S. If hardware starts supporting Homa, I hope that it will be
possible to move the entire transport to the NIC, so that applications
can bypass the kernel entirely, as with RDMA.

On Sat, Nov 4, 2023 at 8:08 AM Andrew Lunn <andrew@lunn.ch> wrote:
>
> On Fri, Nov 03, 2023 at 04:55:35PM -0700, John Ousterhout wrote:
> > Is there a way to mark an skb (or its socket) before invoking
> > ip_queue_xmit/ip6_xmit so that the packet will bypass the qdiscs and
> > be transmitted immediately? Is doing such a thing considered bad
> > practice?
> >
> > (Homa has its own packet scheduling mechanism so the qdiscs are just
> > getting in the way and adding delays)
>
> Hi John
>
> One thing to think about is what happens when hardware starts
> supporting Homa. Can the packet scheduling be moved into the hardware?
> Ideally you want to make use of the existing mechanisms to offload
> scheduling to the hardware, rather than add a Homa specific one.
>
> Did you try adding a Homa specific qdisc implementing the scheduling
> algorithm? Did it kill performance? We prefer to try to fix problems,
> rather than bypass them.
>
>        Andrew

^ permalink raw reply

* Re: [PATCH bpf-next v10 10/13] bpf, net: switch to dynamic registration
From: kernel test robot @ 2023-11-05  5:16 UTC (permalink / raw)
  To: thinker.li, bpf, ast, martin.lau, song, kernel-team, andrii,
	drosen
  Cc: oe-kbuild-all, sinquersw, kuifeng, Kui-Feng Lee, netdev
In-Reply-To: <20231103232202.3664407-11-thinker.li@gmail.com>

Hi,

kernel test robot noticed the following build errors:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/thinker-li-gmail-com/bpf-refactory-struct_ops-type-initialization-to-a-function/20231104-072528
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20231103232202.3664407-11-thinker.li%40gmail.com
patch subject: [PATCH bpf-next v10 10/13] bpf, net: switch to dynamic registration
config: riscv-randconfig-002-20231105 (https://download.01.org/0day-ci/archive/20231105/202311051202.DeubcWTl-lkp@intel.com/config)
compiler: riscv64-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231105/202311051202.DeubcWTl-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202311051202.DeubcWTl-lkp@intel.com/

All errors (new ones prefixed by >>):

   riscv64-linux-ld: kernel/bpf/btf.o: in function `btf_array_show':
>> kernel/bpf/btf.c:3044:(.text+0x6c38): undefined reference to `bpf_struct_ops_desc_init'


vim +3044 kernel/bpf/btf.c

31d0bc81637d8d Alan Maguire     2020-09-28  3032  
31d0bc81637d8d Alan Maguire     2020-09-28  3033  static void btf_array_show(const struct btf *btf, const struct btf_type *t,
31d0bc81637d8d Alan Maguire     2020-09-28  3034  			   u32 type_id, void *data, u8 bits_offset,
31d0bc81637d8d Alan Maguire     2020-09-28  3035  			   struct btf_show *show)
31d0bc81637d8d Alan Maguire     2020-09-28  3036  {
31d0bc81637d8d Alan Maguire     2020-09-28  3037  	const struct btf_member *m = show->state.member;
31d0bc81637d8d Alan Maguire     2020-09-28  3038  
31d0bc81637d8d Alan Maguire     2020-09-28  3039  	/*
31d0bc81637d8d Alan Maguire     2020-09-28  3040  	 * First check if any members would be shown (are non-zero).
31d0bc81637d8d Alan Maguire     2020-09-28  3041  	 * See comments above "struct btf_show" definition for more
31d0bc81637d8d Alan Maguire     2020-09-28  3042  	 * details on how this works at a high-level.
31d0bc81637d8d Alan Maguire     2020-09-28  3043  	 */
31d0bc81637d8d Alan Maguire     2020-09-28 @3044  	if (show->state.depth > 0 && !(show->flags & BTF_SHOW_ZERO)) {
31d0bc81637d8d Alan Maguire     2020-09-28  3045  		if (!show->state.depth_check) {
31d0bc81637d8d Alan Maguire     2020-09-28  3046  			show->state.depth_check = show->state.depth + 1;
31d0bc81637d8d Alan Maguire     2020-09-28  3047  			show->state.depth_to_show = 0;
31d0bc81637d8d Alan Maguire     2020-09-28  3048  		}
31d0bc81637d8d Alan Maguire     2020-09-28  3049  		__btf_array_show(btf, t, type_id, data, bits_offset, show);
31d0bc81637d8d Alan Maguire     2020-09-28  3050  		show->state.member = m;
31d0bc81637d8d Alan Maguire     2020-09-28  3051  
31d0bc81637d8d Alan Maguire     2020-09-28  3052  		if (show->state.depth_check != show->state.depth + 1)
31d0bc81637d8d Alan Maguire     2020-09-28  3053  			return;
31d0bc81637d8d Alan Maguire     2020-09-28  3054  		show->state.depth_check = 0;
31d0bc81637d8d Alan Maguire     2020-09-28  3055  
31d0bc81637d8d Alan Maguire     2020-09-28  3056  		if (show->state.depth_to_show <= show->state.depth)
31d0bc81637d8d Alan Maguire     2020-09-28  3057  			return;
31d0bc81637d8d Alan Maguire     2020-09-28  3058  		/*
31d0bc81637d8d Alan Maguire     2020-09-28  3059  		 * Reaching here indicates we have recursed and found
31d0bc81637d8d Alan Maguire     2020-09-28  3060  		 * non-zero array member(s).
31d0bc81637d8d Alan Maguire     2020-09-28  3061  		 */
31d0bc81637d8d Alan Maguire     2020-09-28  3062  	}
31d0bc81637d8d Alan Maguire     2020-09-28  3063  	__btf_array_show(btf, t, type_id, data, bits_offset, show);
b00b8daec828dd Martin KaFai Lau 2018-04-18  3064  }
b00b8daec828dd Martin KaFai Lau 2018-04-18  3065  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH v9 bpf-next 01/17] bpf: align CAP_NET_ADMIN checks with bpf_capable() approach
From: Yafang Shao @ 2023-11-05  6:33 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, netdev, paul, brauner, linux-fsdevel, linux-security-module,
	keescook, kernel-team, sargun
In-Reply-To: <20231103190523.6353-2-andrii@kernel.org>

On Sat, Nov 4, 2023 at 3:05 AM Andrii Nakryiko <andrii@kernel.org> wrote:
>
> Within BPF syscall handling code CAP_NET_ADMIN checks stand out a bit
> compared to CAP_BPF and CAP_PERFMON checks. For the latter, CAP_BPF or
> CAP_PERFMON are checked first, but if they are not set, CAP_SYS_ADMIN
> takes over and grants whatever part of BPF syscall is required.
>
> Similar kind of checks that involve CAP_NET_ADMIN are not so consistent.
> One out of four uses does follow CAP_BPF/CAP_PERFMON model: during
> BPF_PROG_LOAD, if the type of BPF program is "network-related" either
> CAP_NET_ADMIN or CAP_SYS_ADMIN is required to proceed.
>
> But in three other cases CAP_NET_ADMIN is required even if CAP_SYS_ADMIN
> is set:
>   - when creating DEVMAP/XDKMAP/CPU_MAP maps;
>   - when attaching CGROUP_SKB programs;
>   - when handling BPF_PROG_QUERY command.
>
> This patch is changing the latter three cases to follow BPF_PROG_LOAD
> model, that is allowing to proceed under either CAP_NET_ADMIN or
> CAP_SYS_ADMIN.
>
> This also makes it cleaner in subsequent BPF token patches to switch
> wholesomely to a generic bpf_token_capable(int cap) check, that always
> falls back to CAP_SYS_ADMIN if requested capability is missing.
>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

Acked-by: Yafang Shao <laoar.shao@gmail.com>

> ---
>  kernel/bpf/syscall.c | 13 +++++++++----
>  1 file changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 0ed286b8a0f0..ad4d8e433ccc 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1096,6 +1096,11 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
>         return ret;
>  }
>
> +static bool bpf_net_capable(void)
> +{
> +       return capable(CAP_NET_ADMIN) || capable(CAP_SYS_ADMIN);
> +}
> +
>  #define BPF_MAP_CREATE_LAST_FIELD map_extra
>  /* called via syscall */
>  static int map_create(union bpf_attr *attr)
> @@ -1199,7 +1204,7 @@ static int map_create(union bpf_attr *attr)
>         case BPF_MAP_TYPE_DEVMAP:
>         case BPF_MAP_TYPE_DEVMAP_HASH:
>         case BPF_MAP_TYPE_XSKMAP:
> -               if (!capable(CAP_NET_ADMIN))
> +               if (!bpf_net_capable())
>                         return -EPERM;
>                 break;
>         default:
> @@ -2599,7 +2604,7 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
>             !bpf_capable())
>                 return -EPERM;
>
> -       if (is_net_admin_prog_type(type) && !capable(CAP_NET_ADMIN) && !capable(CAP_SYS_ADMIN))
> +       if (is_net_admin_prog_type(type) && !bpf_net_capable())
>                 return -EPERM;
>         if (is_perfmon_prog_type(type) && !perfmon_capable())
>                 return -EPERM;
> @@ -3751,7 +3756,7 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
>         case BPF_PROG_TYPE_SK_LOOKUP:
>                 return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
>         case BPF_PROG_TYPE_CGROUP_SKB:
> -               if (!capable(CAP_NET_ADMIN))
> +               if (!bpf_net_capable())
>                         /* cg-skb progs can be loaded by unpriv user.
>                          * check permissions at attach time.
>                          */
> @@ -3954,7 +3959,7 @@ static int bpf_prog_detach(const union bpf_attr *attr)
>  static int bpf_prog_query(const union bpf_attr *attr,
>                           union bpf_attr __user *uattr)
>  {
> -       if (!capable(CAP_NET_ADMIN))
> +       if (!bpf_net_capable())
>                 return -EPERM;
>         if (CHECK_ATTR(BPF_PROG_QUERY))
>                 return -EINVAL;
> --
> 2.34.1
>
>


-- 
Regards
Yafang

^ permalink raw reply

* Re: [PATCH net] net: xt_recent: fix (increase) ipv6 literal buffer length
From: Jan Engelhardt @ 2023-11-05  7:08 UTC (permalink / raw)
  To: Maciej Żenczykowski
  Cc: Maciej Żenczykowski, David S . Miller, Pablo Neira Ayuso,
	Florian Westphal, Linux Network Development Mailing List,
	Netfilter Development Mailing List, Patrick McHardy
In-Reply-To: <20231104210053.343149-1-maze@google.com>


On Saturday 2023-11-04 22:00, Maciej Żenczykowski wrote:
>
>IPv4 in IPv6 is supported by in6_pton [...]
>but the provided buffer is too short:

If in6_pton were to support tunnel traffic.. wait that sounds
unusual, and would require dst to be at least 20 bytes, which the 
function documentation contradicts.

As the RFCs make no precise name proposition

	(IPv6 Text Representation, third alternative,
	IPv4 "decimal value" of the "four low-order 8-bit pieces")

so let's just call it

	"low-32-bit dot-decimal representation"

which should avoid the tunnel term.

^ permalink raw reply

* Re: [PATCH] net: phy: at803x: add QCA8084 ethernet phy support
From: Jie Luo @ 2023-11-05  7:42 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: hkallweit1, linux, davem, edumazet, kuba, pabeni, netdev,
	linux-kernel
In-Reply-To: <1aaf7a4b-fc7e-4203-af71-740bc187d046@lunn.ch>



On 11/4/2023 10:19 PM, Andrew Lunn wrote:
> On Sat, Nov 04, 2023 at 02:25:25PM +0800, Jie Luo wrote:
>>
>>
>> On 11/3/2023 9:01 PM, Andrew Lunn wrote:
>>>>    #define QCA8081_PHY_ID				0x004dd101
>>>> +#define QCA8081_PHY_MASK			0xffffff00
>>>
>>> That is an unusual mask. Please check it is correct. All you should
>>> need its PHY_ID_MATCH_EXACT, PHY_ID_MATCH_MODEL, PHY_ID_MATCH_VENDOR.
>>
>> Thanks Andrew for the review.
>> The PHY ID of qca8084 is correct, i will update to use PHY_ID_MATCH_EXACT in
>> the new added entry for qca8084.
> 
> Note, i asked about the mask, not the ID. Is PHY_ID_MATCH_EXACT maybe
> too exact? Is there the option for different revisions of the PHY? Can
> one entry in the table be used for multiple revisions?
> 
> 
>      Andrew
> 
> ---
> pw-bot: cr

Sure, Andrew, qca8084 is the different model chip from qca8081, and 
there is only one PHY ID used currently for the qca8084(4-port) and 
qca8082(2-port), but we can use PHY_ID_MATCH_MODEL for new added entry 
of qca8084 and qca8082.

^ permalink raw reply

* Re: [PATCH] net: phy: at803x: add QCA8084 ethernet phy support
From: Jie Luo @ 2023-11-05  8:09 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: andrew, hkallweit1, davem, edumazet, kuba, pabeni, netdev,
	linux-kernel
In-Reply-To: <ZUadBJQLFA4f/gQY@shell.armlinux.org.uk>



On 11/5/2023 3:35 AM, Russell King (Oracle) wrote:
> On Fri, Nov 03, 2023 at 08:35:37PM +0800, Luo Jie wrote:
>> Add qca8084 PHY support, which is four-port PHY with maximum
>> link capability 2.5G, the features of each port is almost same
>> as QCA8081 and slave seed config is not needed.
>>
>> There are some initialization configurations needed.
>> 1. Configuring qca8084 related initializations including
>> MSE detect threshold and ADC clock edge invert.
>> 2. Add the additional configurations for the CDT feature.
>>
>> Signed-off-by: Luo Jie <quic_luoj@quicinc.com>
>> ---
>>   drivers/net/phy/at803x.c | 40 +++++++++++++++++++++++++++++++++++++---
>>   1 file changed, 37 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/net/phy/at803x.c b/drivers/net/phy/at803x.c
>> index 37fb033e1c29..4124eb76d835 100644
>> --- a/drivers/net/phy/at803x.c
>> +++ b/drivers/net/phy/at803x.c
>> @@ -176,6 +176,8 @@
>>   #define AT8030_PHY_ID_MASK			0xffffffef
>>   
>>   #define QCA8081_PHY_ID				0x004dd101
>> +#define QCA8081_PHY_MASK			0xffffff00
>> +#define QCA8084_PHY_ID				0x004dd180
> ...
>> @@ -2207,8 +2240,9 @@ static struct phy_driver at803x_driver[] = {
>>   	.resume			= qca83xx_resume,
>>   }, {
>>   	/* Qualcomm QCA8081 */
>> -	PHY_ID_MATCH_EXACT(QCA8081_PHY_ID),
>> -	.name			= "Qualcomm QCA8081",
>> +	.phy_id			= QCA8081_PHY_ID,
>> +	.phy_id_mask		= QCA8081_PHY_MASK,
>> +	.name			= "Qualcomm QCA808X",
> ...
>> @@ -2241,7 +2275,7 @@ static struct mdio_device_id __maybe_unused atheros_tbl[] = {
>>   	{ PHY_ID_MATCH_EXACT(QCA8327_A_PHY_ID) },
>>   	{ PHY_ID_MATCH_EXACT(QCA8327_B_PHY_ID) },
>>   	{ PHY_ID_MATCH_EXACT(QCA9561_PHY_ID) },
>> -	{ PHY_ID_MATCH_EXACT(QCA8081_PHY_ID) },
>> +	{ QCA8081_PHY_ID, QCA8081_PHY_MASK},
> 
> So, in summary from the above, what you're doing is using the pair of
> QCA8081_PHY_ID, QCA8081_PHY_MASK to match not only QCA8081 but also
> QCA8084. This is confusing.

Yes, Russell.
qca8084 is the different PHY model compared with the existed qca8081, 
qca8084 needs the extra PCS configuration and clock configuration, which 
will be pushed to the PCS driver.

I will update to use PHY_ID_MATCH_MODEL for the new added entry of 
qca8084 to distinguish qca8081.

> 
> Are there any other parts that QCA808X would correspond with which
> would not be compatible with the above? E.g. QCA8082, QCA8083, QCA8088
> etc.

The new added PHY chip qca8082, qca8084 and qca8085 use the same PHY ID, 
so the PHY_ID_MATCH_MODEL should be able to cover the new added entry.

> 
> If there are, then the correct approach would be to list them
> separately in atheros_tbl, and also have separate driver entries in
> at803x_driver so it's unambiguous.
> 
> If we keep this approach, then I would suggest:
> 
> #define QCA808X_PHY_ID		0x004dd100
> #define QCA808X_PHY_MASK	GENMASK(31, 8)
> 
> to make it explicit that this phy ID/mask pair is matching several
> devices, rather than re-using the QCA8081_PHY_ID definition.

Thanks for the suggestion. the PHY_ID_MATCH_MODEL should match the new 
added PHY devices.

> 
> 
> The next point - what about the revision field which occupies bits 3:0
> in these:

bits[3:0] is for the revision number, currently there is only one PHY ID 
(revision number is 0)used for the new added series PHY chip.

> 
>>   static bool qca808x_has_fast_retrain_or_slave_seed(struct phy_device *phydev)
>>   {
>> +	if (phydev->phy_id == QCA8084_PHY_ID)
>> +		return false;
>> +
> ...
>> @@ -1767,6 +1781,20 @@ static int qca808x_config_init(struct phy_device *phydev)
>>   {
>>   	int ret;
>>   
>> +	if (phydev->phy_id == QCA8084_PHY_ID) {
>> +		/* Invert ADC clock edge */
> ...
>> @@ -1958,6 +1986,11 @@ static int qca808x_cable_test_start(struct phy_device *phydev)
>>   	phy_write_mmd(phydev, MDIO_MMD_PCS, 0x807a, 0xc060);
>>   	phy_write_mmd(phydev, MDIO_MMD_PCS, 0x807e, 0xb060);
>>   
>> +	if (phydev->phy_id == QCA8084_PHY_ID) {
> 
> Do these need to be exact matches, or should the revision field be
> ignored? If so, consider using phy_id_compare(), or if you end up with
> separate driver entries, consider using phydev_id_compare().
> 
> Thanks.
> 
Ok, i will double check this configuration to consider to use the 
suggested API in the next updated patch. Thanks Russell for the review.

^ permalink raw reply

* Re: [PATCH net-next v6 0/5] Coalesce mac ocp write/modify calls to reduce spinlock contention
From: Heiner Kallweit @ 2023-11-05  9:20 UTC (permalink / raw)
  To: Mirsad Todorovac, linux-kernel
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev, nic_swsd
In-Reply-To: <edee64f4-442d-4670-a91b-e5b83117dd40@alu.unizg.hr>

On 05.11.2023 01:15, Mirsad Todorovac wrote:
> 
> 
> On 11/4/23 23:37, Heiner Kallweit wrote:
>> On 04.11.2023 23:15, Mirsad Goran Todorovac wrote:
>>> The motivation for these helpers was the locking overhead of 130 consecutive
>>> r8168_mac_ocp_write() calls in the RTL8411b reset after the NIC gets confused
>>> if the PHY is powered-down.
>>>
>>> To quote Heiner:
>>>
>>>      On RTL8411b the RX unit gets confused if the PHY is powered-down.
>>>      This was reported in [0] and confirmed by Realtek. Realtek provided
>>>      a sequence to fix the RX unit after PHY wakeup.
>>>
>>> A series of about 130 r8168_mac_ocp_write() calls is performed to program the
>>> RTL registers for recovery, each doing an expensive spin_lock_irqsave() and
>>> spin_unlock_irqrestore().
>>>
>>> Each mac ocp write is made of:
>>>
>>>      static void __r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg,
>>>                        u32 data)
>>>      {
>>>          if (rtl_ocp_reg_failure(reg))
>>>              return;
>>>
>>>          RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
>>>      }
>>>
>>>      static void r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg,
>>>                      u32 data)
>>>      {
>>>          unsigned long flags;
>>>
>>>          raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>>>          __r8168_mac_ocp_write(tp, reg, data);
>>>          raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>>>      }
>>>
>>> Register programming is done through RTL_W32() macro which expands into
>>>
>>>      #define RTL_W32(tp, reg, val32) writel((val32), tp->mmio_addr + (reg))
>>>
>>> which is further (on Alpha):
>>>
>>>      extern inline void writel(u32 b, volatile void __iomem *addr)
>>>      {
>>>          mb();
>>>          __raw_writel(b, addr);
>>>      }
>>>
>>> or on i386/x86_64:
>>>
>>>      #define build_mmio_write(name, size, type, reg, barrier) \
>>>      static inline void name(type val, volatile void __iomem *addr) \
>>>      { asm volatile("mov" size " %0,%1": :reg (val), \
>>>      "m" (*(volatile type __force *)addr) barrier); }
>>>
>>>      build_mmio_write(writel, "l", unsigned int, "r", :"memory")
>>>
>>> This obviously involves iat least a compiler barrier.
>>>
>>> mb() expands into something like this i.e. on x86_64:
>>>
>>>      #define mb()    asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
>>>
>>> This means a whole lot of memory bus stalls: for spin_lock_irqsave(),
>>> memory barrier, writel(), and spin_unlock_irqrestore().
>>>
>>> With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like
>>> a lock storm that will stall all of the cores and CPUs on the same memory controller
>>> for certain time I/O takes to finish.
>>>
>>> In a sequential case of RTL register programming, the writes to RTL registers
>>> can be coalesced under a same raw spinlock. This can dramatically decrease the
>>> number of bus stalls in a multicore or multi-CPU system.
>>>
>>> Macro helpers r8168_mac_ocp_write_seq() and r8168_mac_ocp_modify_seq() are
>>> provided to reduce lock contention:
>>>
>>>      static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
>>>      {
>>>
>>>          ...
>>>
>>>          /* The following Realtek-provided magic fixes an issue with the RX unit
>>>           * getting confused after the PHY having been powered-down.
>>>           */
>>>
>>>          static const struct recover_8411b_info init_zero_seq[] = {
>>>              { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 },
>>>              ...
>>>          };
>>>
>>>          ...
>>>
>>>          r8168_mac_ocp_write_seq(tp, init_zero_seq);
>>>
>>>          ...
>>>
>>>      }
>>>
>>> The hex data is preserved intact through s/r8168_mac_ocp_write[(]tp,/{ / and s/[)];/ },/
>>> functions that only changed the function names and the ending of the line, so the actual
>>> hex data is unchanged.
>>>
>>> To repeat, the reason for the introduction of the original commit
>>> was to enable recovery of the RX unit on the RTL8411b which was confused by the
>>> powered-down PHY. This sequence of r8168_mac_ocp_write() calls amplifies the problem
>>> into a series of about 500+ memory bus locks, most waiting for the main memory read,
>>> modify and write under a LOCK. The memory barrier in RTL_W32 should suffice for
>>> the programming sequence to reach RTL NIC registers.
>>>
>>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1692075
>>>
>>> v6:
>>>   proceeded according to Jacob Keller's suggestions by creating a cover page and reducing
>>>   the text within the commits. Applying to the net-next tree as Heiner Kallweit requested.
>>>
>>> v5:
>>>   attempted some new optimisations, which were rejected, but not all and not completely.
>>>
>>> v4:
>>>   fixed complaints as advised by Heiner and checkpatch.pl.
>>>   split the patch into five sections to be more easily manipulated and reviewed
>>>   introduced r8168_mac_ocp_write_seq()
>>>   applied coalescing of mac ocp writes/modifies for 8168H, 8125 and 8125B
>>>
>>> v3:
>>>   removed register/mask pair array sentinels, so using ARRAY_SIZE().
>>>   avoided duplication of RTL_W32() call code as advised by Heiner.
>>>
>>> Mirsad Goran Todorovac (5):
>>>    r8169: Coalesce r8169_mac_ocp_write/modify calls to reduce spinlock
>>>      stalls
>>>    r8169: Coalesce RTL8411b PHY power-down recovery calls to reduce
>>>      spinlock stalls
>>>    r8169: Coalesce mac ocp write and modify for 8168H start to reduce
>>>      spinlocks
>>>    r8169: Coalesce mac ocp commands for 8125 and 8125B start to reduce
>>>      spinlock contention
>>>    r8169: Coalesce mac ocp commands for rtl_hw_init_8125 to reduce
>>>      spinlocks
>>>
>>>   drivers/net/ethernet/realtek/r8169_main.c | 304 +++++++++++-----------
>>>   1 file changed, 150 insertions(+), 154 deletions(-)
>>>
> 
> Hi, Mr. Kallweit,
> 
> So good to hear so soon from you. I'm encouraged that you are positive about improving
> the speed and reducing the size of the Realtek drivers.
> 
>> You still write:
>> "a lock storm that will stall all of the cores and CPUs on the same memory controller"
>> even though you were informed that that's not the case.
> 
> I was not convinced. There is no such thing as a free lunch, and there is no locking
> without affecting other cores, or locking would not make sense.
> 
>> There's no actual problem, therefore your Fixes tags are incorrect.
> 
> Mea culpa - my mistake, I will fix that in the next version.
> 
>> Also net-next is closed at the moment.
> 
> There is no problem with that, as these are only optimisation fixes, not zero day
> exploits. I am a patient person.
> 
>> In patches 3-5 I see no benefit. And I have doubts whether the small benefit in
>> patch 2 is worth adding all the helpers in patch 1.
> 
> I merely followed and mimed driver style from the constructions like this one:
> 
>         static const struct ephy_info e_info_8168e_1[] = {
>                 { 0x00, 0x0200, 0x0100 },
>                 { 0x00, 0x0000, 0x0004 },
>                 { 0x06, 0x0002, 0x0001 },
>                 { 0x06, 0x0000, 0x0030 },
>                 { 0x07, 0x0000, 0x2000 },
>                 { 0x00, 0x0000, 0x0020 },
>                 { 0x03, 0x5800, 0x2000 },
>                 { 0x03, 0x0000, 0x0001 },
>                 { 0x01, 0x0800, 0x1000 },
>                 { 0x07, 0x0000, 0x4000 },
>                 { 0x1e, 0x0000, 0x2000 },
>                 { 0x19, 0xffff, 0xfe6c },
>                 { 0x0a, 0x0000, 0x0040 }
>         };
> 
>         rtl_set_def_aspm_entry_latency(tp);
> 
>         rtl_ephy_init(tp, e_info_8168e_1);
> 
> Here you did not think that introducing an array reduced code readability.
> 
> My ideal is a lockless driver using RCU, and you seem to prefer lock/unlock
> on each RTL_W32() write. I am convinced that a driver with less
> raw_spin_lock_irqsave()/raw_spin_unlock_irqrestore() pairs would scale better
> with more NICs and more cores.
> 
Then please focus on hot paths where it actually could make a difference,
and provide numbers instead of a purely theoretical discussion.

> You said nothing to convinced me otherwise.
> 
> But I am merely defending my point, this by no means implies disrespect or overlooking
> your contribution to the source as a coder and a a maintainer.
> 
> Realtek NICs are known as cheap NIC for motherboards, but they are becoming more ubiquitous,
> and it is logical to use less locking, as locking is expensive. "barrier" in writev()
> guarantees sequential orders of write, and locking and unlocking on each read/modify/write
> is unnecessary overhead, IMHO.
> 
> As the conclusion, I would like to emphasise that improving lock contention for the code
> is by no means a personal attack on the maintainer or a breach of the Code of Conduct.
> 
> If you are so much against the changes which Mr. Jacob Keller from Intel reviewed,
> maybe we can cool emotions and start thinking rationally.
> 
> Additionally, I would like to "inline" many functions, as I think that call/return
> sequences with stack frame generation /destruction are more expensive than inlining the
> small one liners.
> 

Mainline standard is to let the compiler decide on inlining.

> But I will certainly respect your opinion on the matter as a maintainer.
> 
> What I realise that I might be optimising the cold paths of the code, but from your emails
> it seems like nothing is worth optimising in this driver, and with all due respect Sir,
> I think that is dead wrong.
> 

Nobody ever said that, and if you look at the history of the driver you'll see a lot of
optimizations that have been added over time. Ideally an optimization improves both:
performance and code readability
Code readability is important for maintainability and weighs higher for me than a minor
performance optimization in a code path that is very rarely used.

> Of course, I am tempted to comply to the authority as a kernel newbie, but I was reminded
> in the spirit that this is exactly what the guys in Chernobyl did while maintaining the
> reactor that malfunctioned: they did not dare to question the authority telling them that
> everything is alright.
> 
> Have a nice evening, and please do not take these words as a breach of the Code or a
> personal attack. I believe we are on the same side, and that is making this driver better.
> 
> The Linux kernel developer community was my last hope that this human race has a force
> to improve the mankind and make it worth surviving.
> 
> But sometimes it is more honourable to go down with the ship and preserve the honour.
> 
> Best regards,
> Mirsad Todorovac


^ permalink raw reply

* [PATCH net v2] net: ti: icssg-prueth: Add missing icss_iep_put to error path
From: Jan Kiszka @ 2023-11-05  9:51 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	MD Danish Anwar
  Cc: netdev, linux-kernel, Lopes Ivo, Diogo Miguel (T CED IFD-PT),
	Nishanth Menon, Su, Bao Cheng (RC-CN DF FA R&D)

From: Jan Kiszka <jan.kiszka@siemens.com>

Analogously to prueth_remove.

Fixes: 186734c15886 ("net: ti: icssg-prueth: add packet timestamping and ptp support")
Fixes: 443a2367ba3c ("net: ti: icssg-prueth: am65x SR2.0 add 10M full duplex support")
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---

Changes in v2:
 - add proper tags

This was lost from the TI SDK version while ripping out SR1.0 support - 
which we are currently restoring for upstream.

 drivers/net/ethernet/ti/icssg/icssg_prueth.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/ti/icssg/icssg_prueth.c b/drivers/net/ethernet/ti/icssg/icssg_prueth.c
index 6c4b64227ac8..d119b2bb8158 100644
--- a/drivers/net/ethernet/ti/icssg/icssg_prueth.c
+++ b/drivers/net/ethernet/ti/icssg/icssg_prueth.c
@@ -2206,6 +2206,9 @@ static int prueth_probe(struct platform_device *pdev)
 	if (prueth->pdata.quirk_10m_link_issue)
 		icss_iep_exit_fw(prueth->iep1);
 
+	icss_iep_put(prueth->iep1);
+	icss_iep_put(prueth->iep0);
+
 free_pool:
 	gen_pool_free(prueth->sram_pool,
 		      (unsigned long)prueth->msmcram.va, msmc_ram_size);
-- 
2.35.3

^ permalink raw reply related

* [PATCH net v2] net: ti: icssg-prueth: Fix error cleanup on failing pruss_request_mem_region
From: Jan Kiszka @ 2023-11-05  9:51 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	MD Danish Anwar
  Cc: netdev, linux-kernel, Lopes Ivo, Diogo Miguel (T CED IFD-PT),
	Nishanth Menon, Su, Bao Cheng (RC-CN DF FA R&D)

From: Jan Kiszka <jan.kiszka@siemens.com>

We were just continuing in this case, surely not desired.

Fixes: 128d5874c082 ("net: ti: icssg-prueth: Add ICSSG ethernet driver")
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---

Changes in v2:
 - add proper tags

 drivers/net/ethernet/ti/icssg/icssg_prueth.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ti/icssg/icssg_prueth.c b/drivers/net/ethernet/ti/icssg/icssg_prueth.c
index d119b2bb8158..845e8a782d3a 100644
--- a/drivers/net/ethernet/ti/icssg/icssg_prueth.c
+++ b/drivers/net/ethernet/ti/icssg/icssg_prueth.c
@@ -2063,7 +2063,7 @@ static int prueth_probe(struct platform_device *pdev)
 				       &prueth->shram);
 	if (ret) {
 		dev_err(dev, "unable to get PRUSS SHRD RAM2: %d\n", ret);
-		pruss_put(prueth->pruss);
+		goto put_pruss;
 	}
 
 	prueth->sram_pool = of_gen_pool_get(np, "sram", 0);
@@ -2215,6 +2215,8 @@ static int prueth_probe(struct platform_device *pdev)
 
 put_mem:
 	pruss_release_mem_region(prueth->pruss, &prueth->shram);
+
+put_pruss:
 	pruss_put(prueth->pruss);
 
 put_cores:
-- 
2.35.3

^ permalink raw reply related

* Re: [PATCH bpf-next v5 06/13] xsk: Document tx_metadata_len layout
From: Simon Horman @ 2023-11-05 12:45 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
	yoong.siang.song, netdev, xdp-hints
In-Reply-To: <20231102225837.1141915-7-sdf@google.com>

On Thu, Nov 02, 2023 at 03:58:30PM -0700, Stanislav Fomichev wrote:
> - how to use
> - how to query features
> - pointers to the examples
> 
> Signed-off-by: Stanislav Fomichev <sdf@google.com>

...

> diff --git a/Documentation/networking/xsk-tx-metadata.rst b/Documentation/networking/xsk-tx-metadata.rst
> new file mode 100644
> index 000000000000..4f376560b23f
> --- /dev/null
> +++ b/Documentation/networking/xsk-tx-metadata.rst
> @@ -0,0 +1,70 @@

Hi Stan,

a minor nit from my side: an SPDX licence identifier tag should probably go
here.

> +==================
> +AF_XDP TX Metadata
> +==================
> +
> +This document describes how to enable offloads when transmitting packets
> +via :doc:`af_xdp`. Refer to :doc:`xdp-rx-metadata` on how to access similar
> +metadata on the receive side.

...

^ permalink raw reply

* Re: [PATCH net-next V6] ptp: fix corrupted list in ptp_open
From: Richard Cochran @ 2023-11-05 13:36 UTC (permalink / raw)
  To: Edward Adam Davis
  Cc: davem, habetsm.xilinx, jeremy, linux-kernel, netdev, reibax,
	syzbot+df3f3ef31f60781fa911
In-Reply-To: <tencent_856E1C97CCE9E2ED66CC087B526CD42ED50A@qq.com>

Edward!

On Sun, Nov 05, 2023 at 10:12:08AM +0800, Edward Adam Davis wrote:
> There is no lock protection when writing ptp->tsevqs in ptp_open() and
> ptp_release(), which can cause data corruption, use spin lock to avoid this
> issue.
> 
> Moreover, ptp_release() should not be used to release the queue in ptp_read(),
> and it should be deleted together.

Change to:  "it should be deleted altogether"
 
> Reported-and-tested-by: syzbot+df3f3ef31f60781fa911@syzkaller.appspotmail.com
> Fixes: 8f5de6fb2453 ("ptp: support multiple timestamp event readers")
> Signed-off-by: Edward Adam Davis <eadavis@qq.com>
> ---
>  drivers/ptp/ptp_chardev.c | 11 +++++++----
>  drivers/ptp/ptp_clock.c   |  1 +
>  drivers/ptp/ptp_private.h |  1 +
>  3 files changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/ptp/ptp_chardev.c b/drivers/ptp/ptp_chardev.c
> index 282cd7d24077..31594f40a21e 100644
> --- a/drivers/ptp/ptp_chardev.c
> +++ b/drivers/ptp/ptp_chardev.c
> @@ -108,6 +108,7 @@ int ptp_open(struct posix_clock_context *pccontext, fmode_t fmode)
>  		container_of(pccontext->clk, struct ptp_clock, clock);
>  	struct timestamp_event_queue *queue;
>  	char debugfsname[32];
> +	unsigned long flags;
>  
>  	queue = kzalloc(sizeof(*queue), GFP_KERNEL);
>  	if (!queue)
> @@ -119,8 +120,10 @@ int ptp_open(struct posix_clock_context *pccontext, fmode_t fmode)
>  	}
>  	bitmap_set(queue->mask, 0, PTP_MAX_CHANNELS);
>  	spin_lock_init(&queue->lock);
> +	spin_lock_irqsave(&ptp->tsevqs_lock, flags);
>  	list_add_tail(&queue->qlist, &ptp->tsevqs);
>  	pccontext->private_clkdata = queue;

Move this assignment outside of locked region, i.e. after spin_unlock_irqrestore().

> +	spin_unlock_irqrestore(&ptp->tsevqs_lock, flags);
>  
>  	/* Debugfs contents */
>  	sprintf(debugfsname, "0x%p", queue);
> @@ -139,13 +142,15 @@ int ptp_release(struct posix_clock_context *pccontext)
>  {
>  	struct timestamp_event_queue *queue = pccontext->private_clkdata;
>  	unsigned long flags;
> +	struct ptp_clock *ptp =
> +		container_of(pccontext->clk, struct ptp_clock, clock);
>  
>  	if (queue) {

Please remove this test.  Since you removed ptp_release() from
ptp_read(), the queue cannot be NULL.

>  		debugfs_remove(queue->debugfs_instance);
> +		spin_lock_irqsave(&ptp->tsevqs_lock, flags);
>  		pccontext->private_clkdata = NULL;

Move this assignment outside of locked region.

> -		spin_lock_irqsave(&queue->lock, flags);
>  		list_del(&queue->qlist);
> -		spin_unlock_irqrestore(&queue->lock, flags);
> +		spin_unlock_irqrestore(&ptp->tsevqs_lock, flags);
>  		bitmap_free(queue->mask);
>  		kfree(queue);
>  	}
> @@ -585,7 +590,5 @@ ssize_t ptp_read(struct posix_clock_context *pccontext, uint rdflags,
>  free_event:
>  	kfree(event);
>  exit:
> -	if (result < 0)
> -		ptp_release(pccontext);

This is good, but please put it into a separate patch, along with the
removal of the bogus "if (queue)" test in ptp_release().

>  	return result;
>  }
> diff --git a/drivers/ptp/ptp_clock.c b/drivers/ptp/ptp_clock.c
> index 3d1b0a97301c..ea82648ad557 100644
> --- a/drivers/ptp/ptp_clock.c
> +++ b/drivers/ptp/ptp_clock.c
> @@ -247,6 +247,7 @@ struct ptp_clock *ptp_clock_register(struct ptp_clock_info *info,
>  	if (!queue)
>  		goto no_memory_queue;
>  	list_add_tail(&queue->qlist, &ptp->tsevqs);
> +	spin_lock_init(&ptp->tsevqs_lock);
>  	queue->mask = bitmap_alloc(PTP_MAX_CHANNELS, GFP_KERNEL);
>  	if (!queue->mask)
>  		goto no_memory_bitmap;
> diff --git a/drivers/ptp/ptp_private.h b/drivers/ptp/ptp_private.h
> index 52f87e394aa6..63af246f17eb 100644
> --- a/drivers/ptp/ptp_private.h
> +++ b/drivers/ptp/ptp_private.h
> @@ -44,6 +44,7 @@ struct ptp_clock {
>  	struct pps_device *pps_source;
>  	long dialed_frequency; /* remembers the frequency adjustment */
>  	struct list_head tsevqs; /* timestamp fifo list */
> +	spinlock_t tsevqs_lock; /* one process at a time writing the timestamp fifo list*/

Please change this comment to "protects tsevqs from concurrent access"

>  	struct mutex pincfg_mux; /* protect concurrent info->pin_config access */
>  	wait_queue_head_t tsev_wq;
>  	int defunct; /* tells readers to go away when clock is being removed */
> -- 
> 2.25.1
> 

Since v6.6 now contains the original commit, please change subject to [net].

Thanks,
Richard

^ permalink raw reply

* Re: [PATCH net-next V6] ptp: fix corrupted list in ptp_open
From: Richard Cochran @ 2023-11-05 14:02 UTC (permalink / raw)
  To: Edward Adam Davis
  Cc: davem, habetsm.xilinx, jeremy, linux-kernel, netdev, reibax,
	syzbot+df3f3ef31f60781fa911
In-Reply-To: <ZUeaaVHlYCaq2NwG@hoboy.vegasvil.org>

On Sun, Nov 05, 2023 at 05:36:41AM -0800, Richard Cochran wrote:

> > @@ -44,6 +44,7 @@ struct ptp_clock {
> >  	struct pps_device *pps_source;
> >  	long dialed_frequency; /* remembers the frequency adjustment */
> >  	struct list_head tsevqs; /* timestamp fifo list */
> > +	spinlock_t tsevqs_lock; /* one process at a time writing the timestamp fifo list*/
> 
> Please change this comment to "protects tsevqs from concurrent access"

And please don't forget to take the spin lock around
list_for_each_entry() in ptp_clock_event().

Thanks,
Richard

^ permalink raw reply

* Re: [PATCH net-next v6 0/5] Coalesce mac ocp write/modify calls to reduce spinlock contention
From: Mirsad Todorovac @ 2023-11-05 15:00 UTC (permalink / raw)
  To: Heiner Kallweit, linux-kernel
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev, nic_swsd
In-Reply-To: <716b3dc3-231e-4fd2-b892-707c0d636d00@gmail.com>

On 11/5/23 10:20, Heiner Kallweit wrote:
> On 05.11.2023 01:15, Mirsad Todorovac wrote:
>>
>>
>> On 11/4/23 23:37, Heiner Kallweit wrote:
>>> On 04.11.2023 23:15, Mirsad Goran Todorovac wrote:
>>>> The motivation for these helpers was the locking overhead of 130 consecutive
>>>> r8168_mac_ocp_write() calls in the RTL8411b reset after the NIC gets confused
>>>> if the PHY is powered-down.
>>>>
>>>> To quote Heiner:
>>>>
>>>>       On RTL8411b the RX unit gets confused if the PHY is powered-down.
>>>>       This was reported in [0] and confirmed by Realtek. Realtek provided
>>>>       a sequence to fix the RX unit after PHY wakeup.
>>>>
>>>> A series of about 130 r8168_mac_ocp_write() calls is performed to program the
>>>> RTL registers for recovery, each doing an expensive spin_lock_irqsave() and
>>>> spin_unlock_irqrestore().
>>>>
>>>> Each mac ocp write is made of:
>>>>
>>>>       static void __r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg,
>>>>                         u32 data)
>>>>       {
>>>>           if (rtl_ocp_reg_failure(reg))
>>>>               return;
>>>>
>>>>           RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
>>>>       }
>>>>
>>>>       static void r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg,
>>>>                       u32 data)
>>>>       {
>>>>           unsigned long flags;
>>>>
>>>>           raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
>>>>           __r8168_mac_ocp_write(tp, reg, data);
>>>>           raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
>>>>       }
>>>>
>>>> Register programming is done through RTL_W32() macro which expands into
>>>>
>>>>       #define RTL_W32(tp, reg, val32) writel((val32), tp->mmio_addr + (reg))
>>>>
>>>> which is further (on Alpha):
>>>>
>>>>       extern inline void writel(u32 b, volatile void __iomem *addr)
>>>>       {
>>>>           mb();
>>>>           __raw_writel(b, addr);
>>>>       }
>>>>
>>>> or on i386/x86_64:
>>>>
>>>>       #define build_mmio_write(name, size, type, reg, barrier) \
>>>>       static inline void name(type val, volatile void __iomem *addr) \
>>>>       { asm volatile("mov" size " %0,%1": :reg (val), \
>>>>       "m" (*(volatile type __force *)addr) barrier); }
>>>>
>>>>       build_mmio_write(writel, "l", unsigned int, "r", :"memory")
>>>>
>>>> This obviously involves iat least a compiler barrier.
>>>>
>>>> mb() expands into something like this i.e. on x86_64:
>>>>
>>>>       #define mb()    asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
>>>>
>>>> This means a whole lot of memory bus stalls: for spin_lock_irqsave(),
>>>> memory barrier, writel(), and spin_unlock_irqrestore().
>>>>
>>>> With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like
>>>> a lock storm that will stall all of the cores and CPUs on the same memory controller
>>>> for certain time I/O takes to finish.
>>>>
>>>> In a sequential case of RTL register programming, the writes to RTL registers
>>>> can be coalesced under a same raw spinlock. This can dramatically decrease the
>>>> number of bus stalls in a multicore or multi-CPU system.
>>>>
>>>> Macro helpers r8168_mac_ocp_write_seq() and r8168_mac_ocp_modify_seq() are
>>>> provided to reduce lock contention:
>>>>
>>>>       static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
>>>>       {
>>>>
>>>>           ...
>>>>
>>>>           /* The following Realtek-provided magic fixes an issue with the RX unit
>>>>            * getting confused after the PHY having been powered-down.
>>>>            */
>>>>
>>>>           static const struct recover_8411b_info init_zero_seq[] = {
>>>>               { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 },
>>>>               ...
>>>>           };
>>>>
>>>>           ...
>>>>
>>>>           r8168_mac_ocp_write_seq(tp, init_zero_seq);
>>>>
>>>>           ...
>>>>
>>>>       }
>>>>
>>>> The hex data is preserved intact through s/r8168_mac_ocp_write[(]tp,/{ / and s/[)];/ },/
>>>> functions that only changed the function names and the ending of the line, so the actual
>>>> hex data is unchanged.
>>>>
>>>> To repeat, the reason for the introduction of the original commit
>>>> was to enable recovery of the RX unit on the RTL8411b which was confused by the
>>>> powered-down PHY. This sequence of r8168_mac_ocp_write() calls amplifies the problem
>>>> into a series of about 500+ memory bus locks, most waiting for the main memory read,
>>>> modify and write under a LOCK. The memory barrier in RTL_W32 should suffice for
>>>> the programming sequence to reach RTL NIC registers.
>>>>
>>>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1692075
>>>>
>>>> v6:
>>>>    proceeded according to Jacob Keller's suggestions by creating a cover page and reducing
>>>>    the text within the commits. Applying to the net-next tree as Heiner Kallweit requested.
>>>>
>>>> v5:
>>>>    attempted some new optimisations, which were rejected, but not all and not completely.
>>>>
>>>> v4:
>>>>    fixed complaints as advised by Heiner and checkpatch.pl.
>>>>    split the patch into five sections to be more easily manipulated and reviewed
>>>>    introduced r8168_mac_ocp_write_seq()
>>>>    applied coalescing of mac ocp writes/modifies for 8168H, 8125 and 8125B
>>>>
>>>> v3:
>>>>    removed register/mask pair array sentinels, so using ARRAY_SIZE().
>>>>    avoided duplication of RTL_W32() call code as advised by Heiner.
>>>>
>>>> Mirsad Goran Todorovac (5):
>>>>     r8169: Coalesce r8169_mac_ocp_write/modify calls to reduce spinlock
>>>>       stalls
>>>>     r8169: Coalesce RTL8411b PHY power-down recovery calls to reduce
>>>>       spinlock stalls
>>>>     r8169: Coalesce mac ocp write and modify for 8168H start to reduce
>>>>       spinlocks
>>>>     r8169: Coalesce mac ocp commands for 8125 and 8125B start to reduce
>>>>       spinlock contention
>>>>     r8169: Coalesce mac ocp commands for rtl_hw_init_8125 to reduce
>>>>       spinlocks
>>>>
>>>>    drivers/net/ethernet/realtek/r8169_main.c | 304 +++++++++++-----------
>>>>    1 file changed, 150 insertions(+), 154 deletions(-)
>>>>
>>
>> Hi, Mr. Kallweit,
>>
>> So good to hear so soon from you. I'm encouraged that you are positive about improving
>> the speed and reducing the size of the Realtek drivers.
>>
>>> You still write:
>>> "a lock storm that will stall all of the cores and CPUs on the same memory controller"
>>> even though you were informed that that's not the case.
>>
>> I was not convinced. There is no such thing as a free lunch, and there is no locking
>> without affecting other cores, or locking would not make sense.
>>
>>> There's no actual problem, therefore your Fixes tags are incorrect.
>>
>> Mea culpa - my mistake, I will fix that in the next version.
>>
>>> Also net-next is closed at the moment.
>>
>> There is no problem with that, as these are only optimisation fixes, not zero day
>> exploits. I am a patient person.
>>
>>> In patches 3-5 I see no benefit. And I have doubts whether the small benefit in
>>> patch 2 is worth adding all the helpers in patch 1.
>>
>> I merely followed and mimed driver style from the constructions like this one:
>>
>>          static const struct ephy_info e_info_8168e_1[] = {
>>                  { 0x00, 0x0200, 0x0100 },
>>                  { 0x00, 0x0000, 0x0004 },
>>                  { 0x06, 0x0002, 0x0001 },
>>                  { 0x06, 0x0000, 0x0030 },
>>                  { 0x07, 0x0000, 0x2000 },
>>                  { 0x00, 0x0000, 0x0020 },
>>                  { 0x03, 0x5800, 0x2000 },
>>                  { 0x03, 0x0000, 0x0001 },
>>                  { 0x01, 0x0800, 0x1000 },
>>                  { 0x07, 0x0000, 0x4000 },
>>                  { 0x1e, 0x0000, 0x2000 },
>>                  { 0x19, 0xffff, 0xfe6c },
>>                  { 0x0a, 0x0000, 0x0040 }
>>          };
>>
>>          rtl_set_def_aspm_entry_latency(tp);
>>
>>          rtl_ephy_init(tp, e_info_8168e_1);
>>
>> Here you did not think that introducing an array reduced code readability.
>>
>> My ideal is a lockless driver using RCU, and you seem to prefer lock/unlock
>> on each RTL_W32() write. I am convinced that a driver with less
>> raw_spin_lock_irqsave()/raw_spin_unlock_irqrestore() pairs would scale better
>> with more NICs and more cores.
>>
> Then please focus on hot paths where it actually could make a difference,
> and provide numbers instead of a purely theoretical discussion.

I will comply.

RTL8411b losing PHY that requires this expensive reset probably doesn't happen
anyway on the Linux servers. :-/

I have done my homework and I see that you are also co-maintainer of the net PHYLIB,
so your insight on this matter is undoubtedly greater after five years of experience
in maintaining the driver.

Learning about the network stack and the PHY layer is however a formidable thought
very interesting task. The whole area of making multimedia more responsive on Linux
and Windows graphic interface is very challenging, and I could pass it with my day
job as research.

But as I said, I have to catch up with a lot of homework.

>> You said nothing to convinced me otherwise.
>>
>> But I am merely defending my point, this by no means implies disrespect or overlooking
>> your contribution to the source as a coder and a a maintainer.
>>
>> Realtek NICs are known as cheap NIC for motherboards, but they are becoming more ubiquitous,
>> and it is logical to use less locking, as locking is expensive. "barrier" in writev()
>> guarantees sequential orders of write, and locking and unlocking on each read/modify/write
>> is unnecessary overhead, IMHO.
>>
>> As the conclusion, I would like to emphasise that improving lock contention for the code
>> is by no means a personal attack on the maintainer or a breach of the Code of Conduct.
>>
>> If you are so much against the changes which Mr. Jacob Keller from Intel reviewed,
>> maybe we can cool emotions and start thinking rationally.
>>
>> Additionally, I would like to "inline" many functions, as I think that call/return
>> sequences with stack frame generation /destruction are more expensive than inlining the
>> small one liners.

> Mainline standard is to let the compiler decide on inlining.
  
>> But I will certainly respect your opinion on the matter as a maintainer.
>>
>> What I realise that I might be optimising the cold paths of the code, but from your emails
>> it seems like nothing is worth optimising in this driver, and with all due respect Sir,
>> I think that is dead wrong.
> 
> Nobody ever said that, and if you look at the history of the driver you'll see a lot of
> optimizations that have been added over time. Ideally an optimization improves both:
> performance and code readability
> Code readability is important for maintainability and weighs higher for me than a minor
> performance optimization in a code path that is very rarely used.

I see.

However, you do use lookup tables for programming with fn rtl_ephy_init().

If this would be more readable, I can unroll the table so it is one entry per line
like e_info_8168e_1 was made?

Then the actual function call adds nothing to readability and the ease of maintanance,
as the principle { address, value } and { address, mask, value } would be preserved.

In fact, some programming books advise separating data from code for readability and
efficiency sake.

I admit that the 8125 optimisation I proposed is minimal locking but hard to read and
maintain. (It defies the KISS principle.)

Thank you for your time and patience with me.

I always like more that things are explained through Spock logic than by a call to authority
(which is BTW the argumentum-ad-hominem logical fallacy).

(This is of course not the case with the sacred texts, where I use quotes from the relevant
authorities.)

Have a nice day and I wish you a blessed Sunday.

I should probably catch up with the documentation before using any more of your valuable time
and energy. I hope to reciprocate.

Best regards,
Mirsad Todorovac
  
>> Of course, I am tempted to comply to the authority as a kernel newbie, but I was reminded
>> in the spirit that this is exactly what the guys in Chernobyl did while maintaining the
>> reactor that malfunctioned: they did not dare to question the authority telling them that
>> everything is alright.
>>
>> Have a nice evening, and please do not take these words as a breach of the Code or a
>> personal attack. I believe we are on the same side, and that is making this driver better.
>>
>> The Linux kernel developer community was my last hope that this human race has a force
>> to improve the mankind and make it worth surviving.
>>
>> But sometimes it is more honourable to go down with the ship and preserve the honour.
>>
>> Best regards,
>> Mirsad Todorovac

^ permalink raw reply

* Re: [PATCH net-next v6 0/5] Coalesce mac ocp write/modify calls to reduce spinlock contention
From: Andrew Lunn @ 2023-11-05 15:33 UTC (permalink / raw)
  To: Mirsad Todorovac
  Cc: Heiner Kallweit, linux-kernel, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, netdev, nic_swsd
In-Reply-To: <edee64f4-442d-4670-a91b-e5b83117dd40@alu.unizg.hr>

> > > With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like
> > > a lock storm that will stall all of the cores and CPUs on the same memory controller
> > > for certain time I/O takes to finish.

Please provide benchmark data to show this is a real issue, and the
patch fixes it.

> Additionally, I would like to "inline" many functions, as I think that call/return
> sequences with stack frame generation /destruction are more expensive than inlining the
> small one liners.

Please provide benchmarks to show the compiler is getting this wrong,
and inline really is needed.

Until there are benchmarks: NACK.

    Andrew

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox