From: "Rafał Miłecki" <zajec5@gmail.com>
To: Arnd Bergmann <arnd@arndb.de>
Cc: Andrew Lunn <andrew@lunn.ch>,
Alexander Lobakin <alexandr.lobakin@intel.com>,
Network Development <netdev@vger.kernel.org>,
linux-arm-kernel <linux-arm-kernel@lists.infradead.org>,
Russell King <linux@armlinux.org.uk>,
Felix Fietkau <nbd@nbd.name>,
"openwrt-devel@lists.openwrt.org"
<openwrt-devel@lists.openwrt.org>,
Florian Fainelli <f.fainelli@gmail.com>
Subject: Re: Optimizing kernel compilation / alignments for network performance
Date: Tue, 10 May 2022 13:23:38 +0200 [thread overview]
Message-ID: <bc628101-8772-9994-3aa7-c141ae151f15@gmail.com> (raw)
In-Reply-To: <CAK8P3a0Rouw8jHHqGhKtMu-ks--bqpVYj_+u4-Pt9VoFOK7nMw@mail.gmail.com>
On 6.05.2022 10:45, Arnd Bergmann wrote:
> On Fri, May 6, 2022 at 9:44 AM Rafał Miłecki <zajec5@gmail.com> wrote:
>>
>> On 5.05.2022 18:04, Andrew Lunn wrote:
>>>> you'll see that most used functions are:
>>>> v7_dma_inv_range
>>>> __irqentry_text_end
>>>> l2c210_inv_range
>>>> v7_dma_clean_range
>>>> bcma_host_soc_read32
>>>> __netif_receive_skb_core
>>>> arch_cpu_idle
>>>> l2c210_clean_range
>>>> fib_table_lookup
>>>
>>> There is a lot of cache management functions here.
>
> Indeed, so optimizing the coherency management (see Felix' reply)
> is likely to help most in making the driver faster, but that does not
> explain why the alignment of the object code has such a big impact
> on performance.
>
> To investigate the alignment further, what I was actually looking for
> is a comparison of the profile of the slow and fast case. Here I would
> expect that the slow case spends more time in one of the functions
> that don't deal with cache management (maybe fib_table_lookup or
> __netif_receive_skb_core).
>
> A few other thoughts:
>
> - bcma_host_soc_read32() is a fundamentally slow operation, maybe
> some of the calls can turned into a relaxed read, like the readback
> in bgmac_chip_intrs_off() or the 'poll again' at the end bgmac_poll(),
> though obviously not the one in bgmac_dma_rx_read().
> It may be possible to even avoid some of the reads entirely, checking
> for more data in bgmac_poll() may actually be counterproductive
> depending on the workload.
I'll experiment with that, hopefully I can optimize it a bit.
> - The higher-end networking SoCs are usually cache-coherent and
> can avoid the cache management entirely. There is a slim chance
> that this chip is designed that way and it just needs to be enabled
> properly. Most low-end chips don't implement the coherent
> interconnect though, and I suppose you have checked this already.
To my best knowledge Northstar platform doesn't support hw coherency.
I just took an extra look at Broadcom's SDK and them seem to have some
driver for selected chipsets but BCM708 isn't there.
config BCM_GLB_COHERENCY
bool "Global Hardware Cache Coherency"
default n
depends on BCM963158 || BCM96846 || BCM96858 || BCM96856 || BCM963178 || BCM947622 || BCM963146 || BCM94912 || BCM96813 || BCM96756 || BCM96855
> - bgmac_dma_rx_update_index() and bgmac_dma_tx_add() appear
> to have an extraneous dma_wmb(), which should be implied by the
> non-relaxed writel() in bgmac_write().
I tried dropping wmb() calls.
With wmb(): 421 Mb/s
Without: 418 Mb/s
I also tried dropping bgmac_read() from bgmac_chip_intrs_off() which
seems to be a flushing readback.
With bgmac_read(): 421 Mb/s
Without: 413 Mb/s
> - accesses to the DMA descriptor don't show up in the profile here,
> but look like they can get misoptimized by the compiler. I would
> generally use READ_ONCE() and WRITE_ONCE() for these to
> ensure that you don't end up with extra or out-of-order accesses.
> This also makes it clearer to the reader that something special
> happens here.
Should I use something as below?
FWIW it doesn't seem to change NAT performance.
Without WRITE_ONCE: 421 Mb/s
With: 419 Mb/s
diff --git a/drivers/net/ethernet/broadcom/bgmac.c b/drivers/net/ethernet/broadcom/bgmac.c
index 87700072..ce98f2a9 100644
--- a/drivers/net/ethernet/broadcom/bgmac.c
+++ b/drivers/net/ethernet/broadcom/bgmac.c
@@ -119,10 +119,10 @@ bgmac_dma_tx_add_buf(struct bgmac *bgmac, struct bgmac_dma_ring *ring,
slot = &ring->slots[i];
dma_desc = &ring->cpu_base[i];
- dma_desc->addr_low = cpu_to_le32(lower_32_bits(slot->dma_addr));
- dma_desc->addr_high = cpu_to_le32(upper_32_bits(slot->dma_addr));
- dma_desc->ctl0 = cpu_to_le32(ctl0);
- dma_desc->ctl1 = cpu_to_le32(ctl1);
+ WRITE_ONCE(dma_desc->addr_low, cpu_to_le32(lower_32_bits(slot->dma_addr)));
+ WRITE_ONCE(dma_desc->addr_high, cpu_to_le32(upper_32_bits(slot->dma_addr)));
+ WRITE_ONCE(dma_desc->ctl0, cpu_to_le32(ctl0));
+ WRITE_ONCE(dma_desc->ctl1, cpu_to_le32(ctl1));
}
static netdev_tx_t bgmac_dma_tx_add(struct bgmac *bgmac,
@@ -387,10 +387,10 @@ static void bgmac_dma_rx_setup_desc(struct bgmac *bgmac,
* B43_DMA64_DCTL1_ADDREXT_MASK;
*/
- dma_desc->addr_low = cpu_to_le32(lower_32_bits(ring->slots[desc_idx].dma_addr));
- dma_desc->addr_high = cpu_to_le32(upper_32_bits(ring->slots[desc_idx].dma_addr));
- dma_desc->ctl0 = cpu_to_le32(ctl0);
- dma_desc->ctl1 = cpu_to_le32(ctl1);
+ WRITE_ONCE(dma_desc->addr_low, cpu_to_le32(lower_32_bits(ring->slots[desc_idx].dma_addr)));
+ WRITE_ONCE(dma_desc->addr_high, cpu_to_le32(upper_32_bits(ring->slots[desc_idx].dma_addr)));
+ WRITE_ONCE(dma_desc->ctl0, cpu_to_le32(ctl0));
+ WRITE_ONCE(dma_desc->ctl1, cpu_to_le32(ctl1));
ring->end = desc_idx;
}
next prev parent reply other threads:[~2022-05-10 11:23 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-04-27 12:04 Optimizing kernel compilation / alignments for network performance Rafał Miłecki
2022-04-27 12:56 ` Alexander Lobakin
2022-04-27 17:31 ` Rafał Miłecki
2022-04-29 14:18 ` Rafał Miłecki
2022-04-29 14:49 ` Arnd Bergmann
2022-05-05 15:42 ` Rafał Miłecki
2022-05-05 16:04 ` Andrew Lunn
2022-05-05 16:46 ` Felix Fietkau
2022-05-06 7:47 ` Rafał Miłecki
2022-05-06 12:42 ` Andrew Lunn
2022-05-10 10:29 ` Rafał Miłecki
2022-05-10 14:09 ` Dave Taht
2022-05-10 19:15 ` Dave Taht
2022-05-06 7:44 ` Rafał Miłecki
2022-05-06 8:45 ` Arnd Bergmann
2022-05-06 8:55 ` Rafał Miłecki
2022-05-06 9:44 ` Arnd Bergmann
2022-05-10 12:51 ` Rafał Miłecki
2022-05-10 13:19 ` Arnd Bergmann
2022-05-10 11:23 ` Rafał Miłecki [this message]
2022-05-10 13:18 ` Arnd Bergmann
2022-05-08 9:53 ` Rafał Miłecki
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=bc628101-8772-9994-3aa7-c141ae151f15@gmail.com \
--to=zajec5@gmail.com \
--cc=alexandr.lobakin@intel.com \
--cc=andrew@lunn.ch \
--cc=arnd@arndb.de \
--cc=f.fainelli@gmail.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux@armlinux.org.uk \
--cc=nbd@nbd.name \
--cc=netdev@vger.kernel.org \
--cc=openwrt-devel@lists.openwrt.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).