* [PATCH net-next 0/2] net: dsa: mv88e6xxx: Improve indirect addressing performance @ 2022-01-26 23:12 Tobias Waldekranz 2022-01-26 23:12 ` [PATCH net-next 1/2] net: dsa: mv88e6xxx: Improve performance of busy bit polling Tobias Waldekranz 2022-01-26 23:12 ` [PATCH net-next 2/2] net: dsa: mv88e6xxx: Improve indirect addressing performance Tobias Waldekranz 0 siblings, 2 replies; 8+ messages in thread From: Tobias Waldekranz @ 2022-01-26 23:12 UTC (permalink / raw) To: davem, kuba; +Cc: netdev The individual patches have all the details. This work was triggered by recent work on a platform that took 16s (sic) to load the mv88e6xxx module. The first patch gets rid of most of that time by replacing a very long delay with a tighter poll loop to wait for the busy bit to clear. The second patch shaves off some more time by avoiding redundant busy-bit-checks, saving 1 out of 4 MDIO operations for every register read/write in the optimal case. Tobias Waldekranz (2): net: dsa: mv88e6xxx: Improve performance of busy bit polling net: dsa: mv88e6xxx: Improve indirect addressing performance drivers/net/dsa/mv88e6xxx/chip.c | 8 ++++---- drivers/net/dsa/mv88e6xxx/chip.h | 1 + drivers/net/dsa/mv88e6xxx/smi.c | 32 ++++++++++++++++++-------------- 3 files changed, 23 insertions(+), 18 deletions(-) -- 2.25.1 ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH net-next 1/2] net: dsa: mv88e6xxx: Improve performance of busy bit polling 2022-01-26 23:12 [PATCH net-next 0/2] net: dsa: mv88e6xxx: Improve indirect addressing performance Tobias Waldekranz @ 2022-01-26 23:12 ` Tobias Waldekranz 2022-01-26 23:45 ` Andrew Lunn 2022-01-26 23:54 ` Andrew Lunn 2022-01-26 23:12 ` [PATCH net-next 2/2] net: dsa: mv88e6xxx: Improve indirect addressing performance Tobias Waldekranz 1 sibling, 2 replies; 8+ messages in thread From: Tobias Waldekranz @ 2022-01-26 23:12 UTC (permalink / raw) To: davem, kuba Cc: netdev, Andrew Lunn, Vivien Didelot, Florian Fainelli, Vladimir Oltean, linux-kernel Avoid a long delay when a busy bit is still set and has to be polled again. Measurements on a system with 2 Opals (6097F) and one Agate (6352) show that even with this much tighter loop, we have about a 50% chance of the bit being cleared on the first poll, all other accesses see the bit being cleared on the second poll. On a standard MDIO bus running MDC at 2.5MHz, a single access with 32 bits of preamble plus 32 bits of data takes 64*(1/2.5MHz) = 25.6us. This means that mv88e6xxx_smi_direct_wait took 26us + CPU overhead in the fast scenario, but 26us + 1500us + 26us + CPU overhead in the slow case - bringing the average close to 1ms. With this change in place, the slow case is closer to 2*26us + CPU overhead, with the average well below 100us - a 10x improvement. This translates to real-world winnings. On a 3-chip 20-port system, the modprobe time drops by 88%: Before: root@coronet:~# time modprobe mv88e6xxx real 0m 15.99s user 0m 0.00s sys 0m 1.52s After: root@coronet:~# time modprobe mv88e6xxx real 0m 2.21s user 0m 0.00s sys 0m 1.54s Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> --- drivers/net/dsa/mv88e6xxx/chip.c | 8 ++++---- drivers/net/dsa/mv88e6xxx/smi.c | 8 ++++---- 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c index 58ca684d73f7..3566617143cf 100644 --- a/drivers/net/dsa/mv88e6xxx/chip.c +++ b/drivers/net/dsa/mv88e6xxx/chip.c @@ -86,12 +86,12 @@ int mv88e6xxx_write(struct mv88e6xxx_chip *chip, int addr, int reg, u16 val) int mv88e6xxx_wait_mask(struct mv88e6xxx_chip *chip, int addr, int reg, u16 mask, u16 val) { + const unsigned long timeout = jiffies + msecs_to_jiffies(50); u16 data; int err; - int i; /* There's no bus specific operation to wait for a mask */ - for (i = 0; i < 16; i++) { + do { err = mv88e6xxx_read(chip, addr, reg, &data); if (err) return err; @@ -99,8 +99,8 @@ int mv88e6xxx_wait_mask(struct mv88e6xxx_chip *chip, int addr, int reg, if ((data & mask) == val) return 0; - usleep_range(1000, 2000); - } + cpu_relax(); + } while (time_before(jiffies, timeout)); dev_err(chip->dev, "Timeout while waiting for switch\n"); return -ETIMEDOUT; diff --git a/drivers/net/dsa/mv88e6xxx/smi.c b/drivers/net/dsa/mv88e6xxx/smi.c index 282fe08db050..a59f32243e08 100644 --- a/drivers/net/dsa/mv88e6xxx/smi.c +++ b/drivers/net/dsa/mv88e6xxx/smi.c @@ -55,11 +55,11 @@ static int mv88e6xxx_smi_direct_write(struct mv88e6xxx_chip *chip, static int mv88e6xxx_smi_direct_wait(struct mv88e6xxx_chip *chip, int dev, int reg, int bit, int val) { + const unsigned long timeout = jiffies + msecs_to_jiffies(50); u16 data; int err; - int i; - for (i = 0; i < 16; i++) { + do { err = mv88e6xxx_smi_direct_read(chip, dev, reg, &data); if (err) return err; @@ -67,8 +67,8 @@ static int mv88e6xxx_smi_direct_wait(struct mv88e6xxx_chip *chip, if (!!(data & BIT(bit)) == !!val) return 0; - usleep_range(1000, 2000); - } + cpu_relax(); + } while (time_before(jiffies, timeout)); return -ETIMEDOUT; } -- 2.25.1 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH net-next 1/2] net: dsa: mv88e6xxx: Improve performance of busy bit polling 2022-01-26 23:12 ` [PATCH net-next 1/2] net: dsa: mv88e6xxx: Improve performance of busy bit polling Tobias Waldekranz @ 2022-01-26 23:45 ` Andrew Lunn 2022-01-27 12:58 ` Tobias Waldekranz 2022-01-26 23:54 ` Andrew Lunn 1 sibling, 1 reply; 8+ messages in thread From: Andrew Lunn @ 2022-01-26 23:45 UTC (permalink / raw) To: Tobias Waldekranz Cc: davem, kuba, netdev, Vivien Didelot, Florian Fainelli, Vladimir Oltean, linux-kernel > @@ -86,12 +86,12 @@ int mv88e6xxx_write(struct mv88e6xxx_chip *chip, int addr, int reg, u16 val) > int mv88e6xxx_wait_mask(struct mv88e6xxx_chip *chip, int addr, int reg, > u16 mask, u16 val) > { > + const unsigned long timeout = jiffies + msecs_to_jiffies(50); > u16 data; > int err; > - int i; > > /* There's no bus specific operation to wait for a mask */ > - for (i = 0; i < 16; i++) { > + do { > err = mv88e6xxx_read(chip, addr, reg, &data); > if (err) > return err; > @@ -99,8 +99,8 @@ int mv88e6xxx_wait_mask(struct mv88e6xxx_chip *chip, int addr, int reg, > if ((data & mask) == val) > return 0; > > - usleep_range(1000, 2000); > - } > + cpu_relax(); > + } while (time_before(jiffies, timeout)); I don't know if this is an issue or not... There are a few bit-banging systems out there. For those, i wonder if 50ms is too short? With the old code, they had 16 chances, no matter how slow they were. With the new code, if they take 50ms for one transaction, they don't get a second chance. But if they have taken 50ms, around 37ms has been spent with the preamble, start, op, phy address, and register address. I assume at that point the switch actually looks at the register, and given your timings, it really should be ready, so a second loop is probably not required? O.K, so this seems safe. Andrew ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net-next 1/2] net: dsa: mv88e6xxx: Improve performance of busy bit polling 2022-01-26 23:45 ` Andrew Lunn @ 2022-01-27 12:58 ` Tobias Waldekranz 2022-01-27 13:06 ` Andrew Lunn 0 siblings, 1 reply; 8+ messages in thread From: Tobias Waldekranz @ 2022-01-27 12:58 UTC (permalink / raw) To: Andrew Lunn Cc: davem, kuba, netdev, Vivien Didelot, Florian Fainelli, Vladimir Oltean, linux-kernel On Thu, Jan 27, 2022 at 00:45, Andrew Lunn <andrew@lunn.ch> wrote: > There are a few bit-banging systems out there. For those, i wonder if > 50ms is too short? With the old code, they had 16 chances, no matter > how slow they were. With the new code, if they take 50ms for one > transaction, they don't get a second chance. > > But if they have taken 50ms, around 37ms has been spent with the > preamble, start, op, phy address, and register address. I assume at > that point the switch actually looks at the register, and given your > timings, it really should be ready, so a second loop is probably not > required? > > O.K, so this seems safe. I think you raise a good point though. Say that you then have this series of events: 1. Bang out ST 2. Bang out OP 3. Bang out PHYADR 4. Bang out REGADR 5. Clock out TA 6. schedule() 7. A SCHED_FIFO/P99 task runs 8. Clock in DATA - Steps 1 through 5 could plausibly be completed before the bit clears if you are running over some memory mapped GPIO lines - Step 7 could execute for more than 50ms - After step 8, you would see the busy bit set, but your time is up All of this is of course _very_ unlikely, but not impossible. Should we ensure that you always get at least two bites at the apple? ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net-next 1/2] net: dsa: mv88e6xxx: Improve performance of busy bit polling 2022-01-27 12:58 ` Tobias Waldekranz @ 2022-01-27 13:06 ` Andrew Lunn 0 siblings, 0 replies; 8+ messages in thread From: Andrew Lunn @ 2022-01-27 13:06 UTC (permalink / raw) To: Tobias Waldekranz Cc: davem, kuba, netdev, Vivien Didelot, Florian Fainelli, Vladimir Oltean, linux-kernel On Thu, Jan 27, 2022 at 01:58:12PM +0100, Tobias Waldekranz wrote: > On Thu, Jan 27, 2022 at 00:45, Andrew Lunn <andrew@lunn.ch> wrote: > > There are a few bit-banging systems out there. For those, i wonder if > > 50ms is too short? With the old code, they had 16 chances, no matter > > how slow they were. With the new code, if they take 50ms for one > > transaction, they don't get a second chance. > > > > But if they have taken 50ms, around 37ms has been spent with the > > preamble, start, op, phy address, and register address. I assume at > > that point the switch actually looks at the register, and given your > > timings, it really should be ready, so a second loop is probably not > > required? > > > > O.K, so this seems safe. > > I think you raise a good point though. Say that you then have this > series of events: > > 1. Bang out ST > 2. Bang out OP > 3. Bang out PHYADR > 4. Bang out REGADR > 5. Clock out TA > 6. schedule() > 7. A SCHED_FIFO/P99 task runs > 8. Clock in DATA > > - Steps 1 through 5 could plausibly be completed before the bit clears > if you are running over some memory mapped GPIO lines > - Step 7 could execute for more than 50ms > - After step 8, you would see the busy bit set, but your time is up So this is the opposite case i was thinking about. A very fast bit banger. Yes, in theory this could happen. > All of this is of course _very_ unlikely, but not impossible. Should we > ensure that you always get at least two bites at the apple? This is why i always point people at include/linux/iopoll.h. It handles conditions like this by doing one more poll after the timeout just to be sure the scheduler has not interfered. So a minimum of 2 would be good. Andrew ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net-next 1/2] net: dsa: mv88e6xxx: Improve performance of busy bit polling 2022-01-26 23:12 ` [PATCH net-next 1/2] net: dsa: mv88e6xxx: Improve performance of busy bit polling Tobias Waldekranz 2022-01-26 23:45 ` Andrew Lunn @ 2022-01-26 23:54 ` Andrew Lunn 1 sibling, 0 replies; 8+ messages in thread From: Andrew Lunn @ 2022-01-26 23:54 UTC (permalink / raw) To: Tobias Waldekranz Cc: davem, kuba, netdev, Vivien Didelot, Florian Fainelli, Vladimir Oltean, linux-kernel On Thu, Jan 27, 2022 at 12:12:38AM +0100, Tobias Waldekranz wrote: > Avoid a long delay when a busy bit is still set and has to be polled > again. > > Measurements on a system with 2 Opals (6097F) and one Agate (6352) > show that even with this much tighter loop, we have about a 50% chance > of the bit being cleared on the first poll, all other accesses see the > bit being cleared on the second poll. > > On a standard MDIO bus running MDC at 2.5MHz, a single access with 32 > bits of preamble plus 32 bits of data takes 64*(1/2.5MHz) = 25.6us. > > This means that mv88e6xxx_smi_direct_wait took 26us + CPU overhead in > the fast scenario, but 26us + 1500us + 26us + CPU overhead in the slow > case - bringing the average close to 1ms. > > With this change in place, the slow case is closer to 2*26us + CPU > overhead, with the average well below 100us - a 10x improvement. > > This translates to real-world winnings. On a 3-chip 20-port system, > the modprobe time drops by 88%: > > Before: > > root@coronet:~# time modprobe mv88e6xxx > real 0m 15.99s > user 0m 0.00s > sys 0m 1.52s > > After: > > root@coronet:~# time modprobe mv88e6xxx > real 0m 2.21s > user 0m 0.00s > sys 0m 1.54s > > Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Andrew ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH net-next 2/2] net: dsa: mv88e6xxx: Improve indirect addressing performance 2022-01-26 23:12 [PATCH net-next 0/2] net: dsa: mv88e6xxx: Improve indirect addressing performance Tobias Waldekranz 2022-01-26 23:12 ` [PATCH net-next 1/2] net: dsa: mv88e6xxx: Improve performance of busy bit polling Tobias Waldekranz @ 2022-01-26 23:12 ` Tobias Waldekranz 2022-01-26 23:53 ` Andrew Lunn 1 sibling, 1 reply; 8+ messages in thread From: Tobias Waldekranz @ 2022-01-26 23:12 UTC (permalink / raw) To: davem, kuba Cc: netdev, Andrew Lunn, Vivien Didelot, Florian Fainelli, Vladimir Oltean, linux-kernel Before this change, both the read and write callback would start out by asserting that the chip's busy flag was cleared. However, both callbacks also made sure to wait for the clearing of the busy bit before returning - making the initial check superfluous. The only time that would ever have an effect was if the busy bit was initially set for some reason. With that in mind, make sure to perform an initial check of the busy bit, after which both read and write can rely the previous operation to have waited for the bit to clear. This cuts the number of operations on the underlying MDIO bus by 25% Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> --- drivers/net/dsa/mv88e6xxx/chip.h | 1 + drivers/net/dsa/mv88e6xxx/smi.c | 24 ++++++++++++++---------- 2 files changed, 15 insertions(+), 10 deletions(-) diff --git a/drivers/net/dsa/mv88e6xxx/chip.h b/drivers/net/dsa/mv88e6xxx/chip.h index 8271b8aa7b71..438cee853d07 100644 --- a/drivers/net/dsa/mv88e6xxx/chip.h +++ b/drivers/net/dsa/mv88e6xxx/chip.h @@ -392,6 +392,7 @@ struct mv88e6xxx_chip { struct mv88e6xxx_bus_ops { int (*read)(struct mv88e6xxx_chip *chip, int addr, int reg, u16 *val); int (*write)(struct mv88e6xxx_chip *chip, int addr, int reg, u16 val); + int (*init)(struct mv88e6xxx_chip *chip); }; struct mv88e6xxx_mdio_bus { diff --git a/drivers/net/dsa/mv88e6xxx/smi.c b/drivers/net/dsa/mv88e6xxx/smi.c index a59f32243e08..1ebdaa55e710 100644 --- a/drivers/net/dsa/mv88e6xxx/smi.c +++ b/drivers/net/dsa/mv88e6xxx/smi.c @@ -104,11 +104,6 @@ static int mv88e6xxx_smi_indirect_read(struct mv88e6xxx_chip *chip, { int err; - err = mv88e6xxx_smi_direct_wait(chip, chip->sw_addr, - MV88E6XXX_SMI_CMD, 15, 0); - if (err) - return err; - err = mv88e6xxx_smi_direct_write(chip, chip->sw_addr, MV88E6XXX_SMI_CMD, MV88E6XXX_SMI_CMD_BUSY | @@ -132,11 +127,6 @@ static int mv88e6xxx_smi_indirect_write(struct mv88e6xxx_chip *chip, { int err; - err = mv88e6xxx_smi_direct_wait(chip, chip->sw_addr, - MV88E6XXX_SMI_CMD, 15, 0); - if (err) - return err; - err = mv88e6xxx_smi_direct_write(chip, chip->sw_addr, MV88E6XXX_SMI_DATA, data); if (err) @@ -155,9 +145,20 @@ static int mv88e6xxx_smi_indirect_write(struct mv88e6xxx_chip *chip, MV88E6XXX_SMI_CMD, 15, 0); } +static int mv88e6xxx_smi_indirect_init(struct mv88e6xxx_chip *chip) +{ + /* Ensure that the chip starts out in the ready state. As both + * reads and writes always ensure this on return, they can + * safely depend on the chip not being busy on entry. + */ + return mv88e6xxx_smi_direct_wait(chip, chip->sw_addr, + MV88E6XXX_SMI_CMD, 15, 0); +} + static const struct mv88e6xxx_bus_ops mv88e6xxx_smi_indirect_ops = { .read = mv88e6xxx_smi_indirect_read, .write = mv88e6xxx_smi_indirect_write, + .init = mv88e6xxx_smi_indirect_init, }; int mv88e6xxx_smi_init(struct mv88e6xxx_chip *chip, @@ -175,5 +176,8 @@ int mv88e6xxx_smi_init(struct mv88e6xxx_chip *chip, chip->bus = bus; chip->sw_addr = sw_addr; + if (chip->smi_ops->init) + return chip->smi_ops->init(chip); + return 0; } -- 2.25.1 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH net-next 2/2] net: dsa: mv88e6xxx: Improve indirect addressing performance 2022-01-26 23:12 ` [PATCH net-next 2/2] net: dsa: mv88e6xxx: Improve indirect addressing performance Tobias Waldekranz @ 2022-01-26 23:53 ` Andrew Lunn 0 siblings, 0 replies; 8+ messages in thread From: Andrew Lunn @ 2022-01-26 23:53 UTC (permalink / raw) To: Tobias Waldekranz Cc: davem, kuba, netdev, Vivien Didelot, Florian Fainelli, Vladimir Oltean, linux-kernel On Thu, Jan 27, 2022 at 12:12:39AM +0100, Tobias Waldekranz wrote: > Before this change, both the read and write callback would start out > by asserting that the chip's busy flag was cleared. However, both > callbacks also made sure to wait for the clearing of the busy bit > before returning - making the initial check superfluous. The only > time that would ever have an effect was if the busy bit was initially > set for some reason. > > With that in mind, make sure to perform an initial check of the busy > bit, after which both read and write can rely the previous operation > to have waited for the bit to clear. > > This cuts the number of operations on the underlying MDIO bus by 25% > > Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Andrew ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2022-01-27 13:06 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-01-26 23:12 [PATCH net-next 0/2] net: dsa: mv88e6xxx: Improve indirect addressing performance Tobias Waldekranz 2022-01-26 23:12 ` [PATCH net-next 1/2] net: dsa: mv88e6xxx: Improve performance of busy bit polling Tobias Waldekranz 2022-01-26 23:45 ` Andrew Lunn 2022-01-27 12:58 ` Tobias Waldekranz 2022-01-27 13:06 ` Andrew Lunn 2022-01-26 23:54 ` Andrew Lunn 2022-01-26 23:12 ` [PATCH net-next 2/2] net: dsa: mv88e6xxx: Improve indirect addressing performance Tobias Waldekranz 2022-01-26 23:53 ` Andrew Lunn
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).