* v7_dma_inv_range performance/high expense
@ 2016-05-27 14:40 Andrew Lunn
2016-05-27 14:58 ` Russell King - ARM Linux
2016-05-27 16:14 ` Mark Rutland
0 siblings, 2 replies; 5+ messages in thread
From: Andrew Lunn @ 2016-05-27 14:40 UTC (permalink / raw)
To: linux-arm-kernel
Hi folks
I have an imx6q, which is a quad core v7 processor. Attached to it via
pcie i have an intel i210 Ethernet controller.
When the ethernet is transmitting, i can get gigabit line rate, and
use one core to about 35% of one core. When receiving, i get around
700Mbps and ksoftirqd/0 is 98% loading a core.
Using perf to profile the ksoftirqd/0 pid is see:
46.38% [kernel] [k] v7_dma_inv_range
21.25% [kernel] [k] l2c210_inv_range
10.90% [kernel] [k] igb_poll
1.69% [kernel] [k] dma_cache_maint_page
1.27% [kernel] [k] eth_type_trans
1.20% [kernel] [k] skb_add_rx_frag
Digging deeper into v7_dma_inv_range i see:
801182c0 <v7_dma_inv_range>:
v7_dma_inv_range():
0.26 mrc 15, 0, r3, cr0, cr0, {1}
0.07 lsr r3, r3, #16
and r3, r3, #15
0.04 mov r2, #4
lsl r2, r2, r3
0.04 sub r3, r2, #1
tst r0, r3
0.02 bic r0, r0, r3
0.03 dsb sy
3.01 mcrne 15, 0, r0, cr7, cr14, {1}
0.54 tst r1, r3
bic r1, r1, r3
0.08 mcrne 15, 0, r1, cr7, cr14, {1}
3.82 34: mcr 15, 0, r0, cr7, cr6, {1}
88.32 add r0, r0, r2
cmp r0, r1
1.97 bcc 34
0.43 dsb st
1.37 bx lr
I'm assuming perf is off by one here, and the add is not taking 88.32%
of the load, rather it is the mcr instruction before it.
The original code in arch/arm/mm/cache-v7.S says:
mcr p15, 0, r0, c7, c6, 1 @ invalidate D / U line
I don't get why a cache invalidate instruction should be so expensive.
It is just throwing away the contents of the cache line, not flushing
it out to DRAM. Should i trust perf? Is a cache invalidate really so
expensive? Or am i totally missing something here?
Thanks
Andrew
^ permalink raw reply [flat|nested] 5+ messages in thread
* v7_dma_inv_range performance/high expense
2016-05-27 14:40 v7_dma_inv_range performance/high expense Andrew Lunn
@ 2016-05-27 14:58 ` Russell King - ARM Linux
2016-05-27 15:38 ` Andrew Lunn
2016-05-27 16:14 ` Mark Rutland
1 sibling, 1 reply; 5+ messages in thread
From: Russell King - ARM Linux @ 2016-05-27 14:58 UTC (permalink / raw)
To: linux-arm-kernel
On Fri, May 27, 2016 at 04:40:45PM +0200, Andrew Lunn wrote:
> 0.26 mrc 15, 0, r3, cr0, cr0, {1}
> 0.07 lsr r3, r3, #16
> and r3, r3, #15
> 0.04 mov r2, #4
> lsl r2, r2, r3
> 0.04 sub r3, r2, #1
> tst r0, r3
> 0.02 bic r0, r0, r3
> 0.03 dsb sy
> 3.01 mcrne 15, 0, r0, cr7, cr14, {1}
> 0.54 tst r1, r3
> bic r1, r1, r3
> 0.08 mcrne 15, 0, r1, cr7, cr14, {1}
> 3.82 34: mcr 15, 0, r0, cr7, cr6, {1}
> 88.32 add r0, r0, r2
> cmp r0, r1
> 1.97 bcc 34
> 0.43 dsb st
> 1.37 bx lr
>
> I'm assuming perf is off by one here, and the add is not taking 88.32%
> of the load, rather it is the mcr instruction before it.
Possibly, but I'm not sure that merely subtracting four from the PC (or
two for thumb) would be the correct solution - what if we've branched
to a function and we've taken the exception with the PC pointing at the
very first instruction - we'd wind it back by one place, and it will be
pointing at the instruction before the function (not the previously
executed instruction.)
So, I think folk just have to get used to reading ARM perf traces
differently[*] - the PC points at the _next_ instruction to be executed
after the exception which recorded the event returns.
* - maybe it is the same as x86, I've never looked at an x86 perf trace,
but I don't see that it would be any different.
> The original code in arch/arm/mm/cache-v7.S says:
>
> mcr p15, 0, r0, c7, c6, 1 @ invalidate D / U line
>
> I don't get why a cache invalidate instruction should be so expensive.
> It is just throwing away the contents of the cache line, not flushing
> it out to DRAM. Should i trust perf? Is a cache invalidate really so
> expensive? Or am i totally missing something here?
If we're being asked to do a large region, then flushing the cache one
line at a time _is_ expensive. There's no real getting away from that.
The only thing that saves you from having to do that is having DMA
coherency with the cache, something which I've pointed out in some
meetings I've had with ARM over the years.
The response was along the lines that you'd expect... It's only
relatively recently, with SMP (which needs coherency) that ARM systems
have had a coherent bus, and even systems which have it, there's
relatively few SoCs which make use of it.
--
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.
^ permalink raw reply [flat|nested] 5+ messages in thread
* v7_dma_inv_range performance/high expense
2016-05-27 14:58 ` Russell King - ARM Linux
@ 2016-05-27 15:38 ` Andrew Lunn
2016-05-27 16:37 ` Russell King - ARM Linux
0 siblings, 1 reply; 5+ messages in thread
From: Andrew Lunn @ 2016-05-27 15:38 UTC (permalink / raw)
To: linux-arm-kernel
> > The original code in arch/arm/mm/cache-v7.S says:
> >
> > mcr p15, 0, r0, c7, c6, 1 @ invalidate D / U line
> >
> > I don't get why a cache invalidate instruction should be so expensive.
> > It is just throwing away the contents of the cache line, not flushing
> > it out to DRAM. Should i trust perf? Is a cache invalidate really so
> > expensive? Or am i totally missing something here?
>
> If we're being asked to do a large region, then flushing the cache one
> line at a time _is_ expensive.
Hi Russell
It is a 2K block, i.e. space for one ethernet frame.
You say flush here. Yet we are not flushing, we are invalidating.
What we logically want to happen is that the DMA engine copies the
packet into DRAM. Once complete we invalidate the cache, and the next
read instruction would cause a cache miss and the ethernet frame is
pulled in.
Looking at these numbers, the invalidate is much more expensive than
the cache miss.
You say one line at a time is expensive. Do you have any idea where
the break even is for invalidating the whole cache? Having said that,
v7_invalidate_l1 seems to be doing it a line at a time as well.
Thanks
Andrew
^ permalink raw reply [flat|nested] 5+ messages in thread
* v7_dma_inv_range performance/high expense
2016-05-27 14:40 v7_dma_inv_range performance/high expense Andrew Lunn
2016-05-27 14:58 ` Russell King - ARM Linux
@ 2016-05-27 16:14 ` Mark Rutland
1 sibling, 0 replies; 5+ messages in thread
From: Mark Rutland @ 2016-05-27 16:14 UTC (permalink / raw)
To: linux-arm-kernel
On Fri, May 27, 2016 at 04:40:45PM +0200, Andrew Lunn wrote:
> Hi folks
>
> I have an imx6q, which is a quad core v7 processor. Attached to it via
> pcie i have an intel i210 Ethernet controller.
>
> When the ethernet is transmitting, i can get gigabit line rate, and
> use one core to about 35% of one core. When receiving, i get around
> 700Mbps and ksoftirqd/0 is 98% loading a core.
>
> Using perf to profile the ksoftirqd/0 pid is see:
>
> 46.38% [kernel] [k] v7_dma_inv_range
> 21.25% [kernel] [k] l2c210_inv_range
> 10.90% [kernel] [k] igb_poll
> 1.69% [kernel] [k] dma_cache_maint_page
> 1.27% [kernel] [k] eth_type_trans
> 1.20% [kernel] [k] skb_add_rx_frag
>
> Digging deeper into v7_dma_inv_range i see:
>
> 801182c0 <v7_dma_inv_range>:
> v7_dma_inv_range():
> 0.26 mrc 15, 0, r3, cr0, cr0, {1}
> 0.07 lsr r3, r3, #16
> and r3, r3, #15
> 0.04 mov r2, #4
> lsl r2, r2, r3
> 0.04 sub r3, r2, #1
> tst r0, r3
> 0.02 bic r0, r0, r3
> 0.03 dsb sy
> 3.01 mcrne 15, 0, r0, cr7, cr14, {1}
> 0.54 tst r1, r3
> bic r1, r1, r3
> 0.08 mcrne 15, 0, r1, cr7, cr14, {1}
> 3.82 34: mcr 15, 0, r0, cr7, cr6, {1}
> 88.32 add r0, r0, r2
> cmp r0, r1
> 1.97 bcc 34
> 0.43 dsb st
> 1.37 bx lr
>
> I'm assuming perf is off by one here, and the add is not taking 88.32%
> of the load, rather it is the mcr instruction before it.
The address perf reports is the PC at the moment the PMU overflow
interrupt was architecturally taken by the core. Reporting anything else
would require us to make up bogus PC values (e.g. if a branch was just
taken, you can't reconstruct the previous PC).
If the PMU overflow interrupt comes in (asynchronously) while an
expensive instruction is in progress, the CPU will likely have to wait
for that to complete before it can handle the interrupt.
So yes, the MCR is very likely to be the expensive instruction here.
> The original code in arch/arm/mm/cache-v7.S says:
>
> mcr p15, 0, r0, c7, c6, 1 @ invalidate D / U line
>
> I don't get why a cache invalidate instruction should be so expensive.
> It is just throwing away the contents of the cache line, not flushing
> it out to DRAM.
This really depends on the microarchitecture and integration.
The cache maintenance operations likely have to synchronise with some
logic in other cores to safely invalidate all copies of the line, there
may be some limit on the number of outstanding operations, etc.
> Should i trust perf?
I don't see a reason not to. Nothing above implies that perf is
providing erroneous information.
Thanks,
Mark.
^ permalink raw reply [flat|nested] 5+ messages in thread
* v7_dma_inv_range performance/high expense
2016-05-27 15:38 ` Andrew Lunn
@ 2016-05-27 16:37 ` Russell King - ARM Linux
0 siblings, 0 replies; 5+ messages in thread
From: Russell King - ARM Linux @ 2016-05-27 16:37 UTC (permalink / raw)
To: linux-arm-kernel
On Fri, May 27, 2016 at 05:38:37PM +0200, Andrew Lunn wrote:
> You say flush here. Yet we are not flushing, we are invalidating.
Yes, I meant invalidating, sorry.
> What we logically want to happen is that the DMA engine copies the
> packet into DRAM. Once complete we invalidate the cache, and the next
> read instruction would cause a cache miss and the ethernet frame is
> pulled in.
Yes, so you read the data which was DMAd, rather than any data that may
be in the cache from previous accesses _or_ speculative prefetches.
> Looking at these numbers, the invalidate is much more expensive than
> the cache miss.
>
> You say one line at a time is expensive. Do you have any idea where
> the break even is for invalidating the whole cache? Having said that,
> v7_invalidate_l1 seems to be doing it a line at a time as well.
Yes, v7_invalidate_l1 also does it one line at a time, but by set/way
instead, and set/way doesn't tell you whether the cache line overlaps
the memory region you're invalidating, so you would end up discarding
dirty data from other memory regions.
The alternative is to flush all cache lines in the cache,
Even so, for either to be cheaper, you need to be touching less lines
than the present method. For 2K worth of data, it's unlikely to be
cheaper.
I guess something which may be worth trying is to unroll the loop a
little, and see what effect it has on the perf numbers... if things
like the branch predictor are working correctly, I'd have expected
little difference (except to spread the cost over more of the function.)
It may be worth just proving that point.
--
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2016-05-27 16:37 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-05-27 14:40 v7_dma_inv_range performance/high expense Andrew Lunn
2016-05-27 14:58 ` Russell King - ARM Linux
2016-05-27 15:38 ` Andrew Lunn
2016-05-27 16:37 ` Russell King - ARM Linux
2016-05-27 16:14 ` Mark Rutland
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).