Performance bug in c-r4k.c cache handling code

All of lore.kernel.org
 help / color / mirror / Atom feed

* Performance bug in c-r4k.c cache handling code
@ 2005-09-19 15:40 Thiemo Seufer
  2005-09-19 16:54 ` Atsushi Nemoto
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Thiemo Seufer @ 2005-09-19 15:40 UTC (permalink / raw)
  To: linux-mips

Hello All,

I found an performance bug in c-r4k.c:r4k_dma_cache_inv, where a
Hit_Writeback_Inv instead of Hit_Invalidate is done. Ralf mentioned
this is probably due to broken Hit_Invalidate cache ops on some
CPUs, does anybody have more information about this? The appended
patch works apparently fine on R4400, R4600v2.0, R5000.


Thiemo


Index: arch/mips/mm/c-r4k.c
===================================================================
RCS file: /home/cvs/linux/arch/mips/mm/c-r4k.c,v
retrieving revision 1.119
diff -u -p -r1.119 c-r4k.c
--- arch/mips/mm/c-r4k.c	9 Sep 2005 20:26:54 -0000	1.119
+++ arch/mips/mm/c-r4k.c	19 Sep 2005 15:33:35 -0000
@@ -685,7 +685,7 @@ static void r4k_dma_cache_inv(unsigned l
 		a = addr & ~(sc_lsize - 1);
 		end = (addr + size - 1) & ~(sc_lsize - 1);
 		while (1) {
-			flush_scache_line(a);	/* Hit_Writeback_Inv_SD */
+			invalidate_scache_line(a);	/* Hit_Invalidate_SD */
 			if (a == end)
 				break;
 			a += sc_lsize;
@@ -702,7 +702,7 @@ static void r4k_dma_cache_inv(unsigned l
 		a = addr & ~(dc_lsize - 1);
 		end = (addr + size - 1) & ~(dc_lsize - 1);
 		while (1) {
-			flush_dcache_line(a);	/* Hit_Writeback_Inv_D */
+			invalidate_dcache_line(a);	/* Hit_Invalidate_D */
 			if (a == end)
 				break;
 			a += dc_lsize;

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Performance bug in c-r4k.c cache handling code
  2005-09-19 15:40 Performance bug in c-r4k.c cache handling code Thiemo Seufer
@ 2005-09-19 16:54 ` Atsushi Nemoto
  2005-09-19 18:31   ` Thiemo Seufer
  2005-09-19 17:01 ` Maciej W. Rozycki
  2005-09-19 17:31 ` peter fuerst
  2 siblings, 1 reply; 10+ messages in thread
From: Atsushi Nemoto @ 2005-09-19 16:54 UTC (permalink / raw)
  To: ths; +Cc: linux-mips

>>>>> On Mon, 19 Sep 2005 17:40:56 +0200, Thiemo Seufer <ths@networkno.de> said:

ths> I found an performance bug in c-r4k.c:r4k_dma_cache_inv, where a
ths> Hit_Writeback_Inv instead of Hit_Invalidate is done. Ralf
ths> mentioned this is probably due to broken Hit_Invalidate cache ops
ths> on some CPUs, does anybody have more information about this? The
ths> appended patch works apparently fine on R4400, R4600v2.0, R5000.

Just a question: Are there any performance advantage of using
Hit_Invalidate instead of Hit_Writeback_Inv if the target line was
CLEAN?

---
Atsushi Nemoto

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Performance bug in c-r4k.c cache handling code
  2005-09-19 15:40 Performance bug in c-r4k.c cache handling code Thiemo Seufer
  2005-09-19 16:54 ` Atsushi Nemoto
@ 2005-09-19 17:01 ` Maciej W. Rozycki
  2005-09-20  9:09   ` Dominic Sweetman
  2005-09-19 17:31 ` peter fuerst
  2 siblings, 1 reply; 10+ messages in thread
From: Maciej W. Rozycki @ 2005-09-19 17:01 UTC (permalink / raw)
  To: Thiemo Seufer; +Cc: linux-mips

On Mon, 19 Sep 2005, Thiemo Seufer wrote:

> I found an performance bug in c-r4k.c:r4k_dma_cache_inv, where a
> Hit_Writeback_Inv instead of Hit_Invalidate is done. Ralf mentioned
> this is probably due to broken Hit_Invalidate cache ops on some
> CPUs, does anybody have more information about this? The appended
> patch works apparently fine on R4400, R4600v2.0, R5000.

 It's actually been on my to-do list for research for quite a while.  
These functions are called through pointers, so even if there are errata 
in some processors, I'd be more than happy to use pure invalidations for 
these that work whenever possible.

 FYI, for R4k DECstations the need to flush the cache for newly allocated 
skbs reduces throughput of FDDI reception by about a half (!), down from 
about 90Mbps (that's for the /260); hopefully with no writebacks the 
performance hit will be at least a bit smaller.

  Maciej

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Performance bug in c-r4k.c cache handling code
  2005-09-19 15:40 Performance bug in c-r4k.c cache handling code Thiemo Seufer
  2005-09-19 16:54 ` Atsushi Nemoto
  2005-09-19 17:01 ` Maciej W. Rozycki
@ 2005-09-19 17:31 ` peter fuerst
  2 siblings, 0 replies; 10+ messages in thread
From: peter fuerst @ 2005-09-19 17:31 UTC (permalink / raw)
  To: linux-mips-bounce



Hello,

r4k_dma_cache_wback_inv and r4k_dma_cache_inv always had the same function
body.  With some invocations (on some machine at least ;) this does not only
affect performance, but also is corrupting (dma-)memory, so it had to be
adjusted in the IP28-patches from the beginning.  I had some correspondence
with Ralf about it some months ago. He hesitated to take over this changes
because of the reasons mentioned below. (so btw. IP28 got its own
r10k_dma_cache_inv :)

kind regards

pf


On Mon, 19 Sep 2005, Thiemo Seufer wrote:

> Date: Mon, 19 Sep 2005 17:40:56 +0200
> From: Thiemo Seufer <ths@networkno.de>
> Reply-To: linux-mips-bounce@linux-mips.org
> To: linux-mips@linux-mips.org
> Subject: Performance bug in c-r4k.c cache handling code
>
> Hello All,
>
> I found an performance bug in c-r4k.c:r4k_dma_cache_inv, where a
> Hit_Writeback_Inv instead of Hit_Invalidate is done. Ralf mentioned
> this is probably due to broken Hit_Invalidate cache ops on some
> CPUs, does anybody have more information about this? The appended
> patch works apparently fine on R4400, R4600v2.0, R5000.
>
>
> Thiemo
>
>
> Index: arch/mips/mm/c-r4k.c
> ===================================================================
> RCS file: /home/cvs/linux/arch/mips/mm/c-r4k.c,v
> retrieving revision 1.119
> diff -u -p -r1.119 c-r4k.c
> --- arch/mips/mm/c-r4k.c	9 Sep 2005 20:26:54 -0000	1.119
> +++ arch/mips/mm/c-r4k.c	19 Sep 2005 15:33:35 -0000
> @@ -685,7 +685,7 @@ static void r4k_dma_cache_inv(unsigned l
>  		a = addr & ~(sc_lsize - 1);
>  		end = (addr + size - 1) & ~(sc_lsize - 1);
>  		while (1) {
> -			flush_scache_line(a);	/* Hit_Writeback_Inv_SD */
> +			invalidate_scache_line(a);	/* Hit_Invalidate_SD */
>  			if (a == end)
>  				break;
>  			a += sc_lsize;
> @@ -702,7 +702,7 @@ static void r4k_dma_cache_inv(unsigned l
>  		a = addr & ~(dc_lsize - 1);
>  		end = (addr + size - 1) & ~(dc_lsize - 1);
>  		while (1) {
> -			flush_dcache_line(a);	/* Hit_Writeback_Inv_D */
> +			invalidate_dcache_line(a);	/* Hit_Invalidate_D */
>  			if (a == end)
>  				break;
>  			a += dc_lsize;
>
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Performance bug in c-r4k.c cache handling code
  2005-09-19 16:54 ` Atsushi Nemoto
@ 2005-09-19 18:31   ` Thiemo Seufer
  0 siblings, 0 replies; 10+ messages in thread
From: Thiemo Seufer @ 2005-09-19 18:31 UTC (permalink / raw)
  To: Atsushi Nemoto; +Cc: linux-mips

Atsushi Nemoto wrote:
> >>>>> On Mon, 19 Sep 2005 17:40:56 +0200, Thiemo Seufer <ths@networkno.de> said:
> 
> ths> I found an performance bug in c-r4k.c:r4k_dma_cache_inv, where a
> ths> Hit_Writeback_Inv instead of Hit_Invalidate is done. Ralf
> ths> mentioned this is probably due to broken Hit_Invalidate cache ops
> ths> on some CPUs, does anybody have more information about this? The
> ths> appended patch works apparently fine on R4400, R4600v2.0, R5000.
> 
> Just a question: Are there any performance advantage of using
> Hit_Invalidate instead of Hit_Writeback_Inv if the target line was
> CLEAN?

I wouldn't think so, but it depends on the particular implementation.


Thiemo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Performance bug in c-r4k.c cache handling code
  2005-09-19 17:01 ` Maciej W. Rozycki
@ 2005-09-20  9:09   ` Dominic Sweetman
  2005-09-20 12:22     ` Ralf Baechle
  2005-09-20 12:37     ` Maciej W. Rozycki
  0 siblings, 2 replies; 10+ messages in thread
From: Dominic Sweetman @ 2005-09-20  9:09 UTC (permalink / raw)
  To: Maciej W. Rozycki; +Cc: Thiemo Seufer, linux-mips

> > I found an performance bug in c-r4k.c:r4k_dma_cache_inv, where a
> > Hit_Writeback_Inv instead of Hit_Invalidate is done.

The MIPS64 spec (which is really all there is to set standards in this
area) regards Hit_Invalidate as optional.  So it would be nice not to
use it.  CPUs have no standard "configuration" register you can read
to establish which cacheops work, so to identify capable CPUs you must
use a table of CPU attributes indexed by the CPU ID, which encourages
the crime of building software which can't possibly run on a new CPU...

So long as the buffer is in fact clean, then in most implementations a
Hit_Writeback_Invalidate will be just as efficient.

Moreover, CPUs always "post" writes to some extent, so a small
percentage of dirty lines can be handled without any great overhead.
So a significant advantage can only occur when the buffer you want to
invalidate (prior to DMA-in) was fairly recently densely written by
the CPU; and this is only safe when all that data can be guaranteed to
now be of no importance to anyone.

Randomly and retrospectively discarding writes could generate some
very interesting bugs, or (indeed) usually hide some very interesting
bugs.  It's the kind of thing one would lik to avoid!

I suppose where DMA data subsequently gets decorated by the CPU then
handed on to some other layer, then the buffer is freed...?

> FYI, for R4k DECstations the need to flush the cache for newly allocated 
> skbs reduces throughput of FDDI reception by about a half (!), down from 
> about 90Mbps (that's for the /260)...

How did you measure the high throughput?  Have you got a
machine with DMA-coherency you can turn on and off?

--
Dominic

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Performance bug in c-r4k.c cache handling code
  2005-09-20  9:09   ` Dominic Sweetman
@ 2005-09-20 12:22     ` Ralf Baechle
  2005-09-20 12:37     ` Maciej W. Rozycki
  1 sibling, 0 replies; 10+ messages in thread
From: Ralf Baechle @ 2005-09-20 12:22 UTC (permalink / raw)
  To: Dominic Sweetman; +Cc: Maciej W. Rozycki, Thiemo Seufer, linux-mips

On Tue, Sep 20, 2005 at 10:09:20AM +0100, Dominic Sweetman wrote:

> > > I found an performance bug in c-r4k.c:r4k_dma_cache_inv, where a
> > > Hit_Writeback_Inv instead of Hit_Invalidate is done.
> 
> The MIPS64 spec (which is really all there is to set standards in this
> area) regards Hit_Invalidate as optional.  So it would be nice not to
> use it.  CPUs have no standard "configuration" register you can read
> to establish which cacheops work, so to identify capable CPUs you must
> use a table of CPU attributes indexed by the CPU ID, which encourages
> the crime of building software which can't possibly run on a new CPU...
> 
> So long as the buffer is in fact clean, then in most implementations a
> Hit_Writeback_Invalidate will be just as efficient.

This are R4700 numbers, the only I was able to find in a quick search.

  Hit_Invalidate_D		 7 cycles for a cache miss
				 9 cycles for a cache hit
  Hit_Writeback_Invalidate_D	 7 cycles for a cache miss
				12 cycles for a cache hit if the cache line
				   is clean.
				14 cycles for a cache hit if the cache line
				   is dirty (Writeback).
  Hit_Writeback_D		 7 cycles for a cache miss
				10 cycles for a cache hit if the cache line
				   is clean
				14 cycles for a cache hit if the cache line
				   is dirty (Writeback).

> Moreover, CPUs always "post" writes to some extent, so a small
> percentage of dirty lines can be handled without any great overhead.
> So a significant advantage can only occur when the buffer you want to
> invalidate (prior to DMA-in) was fairly recently densely written by
> the CPU; and this is only safe when all that data can be guaranteed to
> now be of no importance to anyone.

Linux has a well-defined ABI that DMA drivers are supposed to use; the
functions of this ABI that perform cache flushes also take a DMA
direction argument based on which the implementation can deciede on what
the best flush function for a particular case will be.

> Randomly and retrospectively discarding writes could generate some
> very interesting bugs, or (indeed) usually hide some very interesting
> bugs.  It's the kind of thing one would lik to avoid!
> 
> I suppose where DMA data subsequently gets decorated by the CPU then
> handed on to some other layer, then the buffer is freed...?
> 
> > FYI, for R4k DECstations the need to flush the cache for newly allocated 
> > skbs reduces throughput of FDDI reception by about a half (!), down from 
> > about 90Mbps (that's for the /260)...

Software coherency will result in many server / client type operations
approximate worst case as none of the data will reside in caches.  Routers
are going to be somewhat better off - as long as they don't peek to deep
into the packets, that is.

> How did you measure the high throughput?  Have you got a
> machine with DMA-coherency you can turn on and off?

Afaik AMD Alchemy processors have configurable coherency.

  Ralf

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Performance bug in c-r4k.c cache handling code
  2005-09-20  9:09   ` Dominic Sweetman
  2005-09-20 12:22     ` Ralf Baechle
@ 2005-09-20 12:37     ` Maciej W. Rozycki
  2005-09-20 13:18       ` Dominic Sweetman
  1 sibling, 1 reply; 10+ messages in thread
From: Maciej W. Rozycki @ 2005-09-20 12:37 UTC (permalink / raw)
  To: Dominic Sweetman; +Cc: Thiemo Seufer, linux-mips

On Tue, 20 Sep 2005, Dominic Sweetman wrote:

> > > I found an performance bug in c-r4k.c:r4k_dma_cache_inv, where a
> > > Hit_Writeback_Inv instead of Hit_Invalidate is done.
> 
> The MIPS64 spec (which is really all there is to set standards in this
> area) regards Hit_Invalidate as optional.  So it would be nice not to
> use it.  CPUs have no standard "configuration" register you can read
> to establish which cacheops work, so to identify capable CPUs you must
> use a table of CPU attributes indexed by the CPU ID, which encourages
> the crime of building software which can't possibly run on a new CPU...

 Or just using the safe fallback -- that shouldn't be a problem (these 
functions are called indirectly).  Besides new CPUs more often than not 
require changes to kernel-level software anyway.

> So long as the buffer is in fact clean, then in most implementations a
> Hit_Writeback_Invalidate will be just as efficient.

 I hope so, but who knows what's wired there in all those old systems?...

> I suppose where DMA data subsequently gets decorated by the CPU then
> handed on to some other layer, then the buffer is freed...?

 I don't think the buffer is modified, so cache lines should remain clean. 
For the usual case of IP data is used exactly once for copy_and_csum() 
(more or less) which moves it to another buffer.

> > FYI, for R4k DECstations the need to flush the cache for newly allocated 
> > skbs reduces throughput of FDDI reception by about a half (!), down from 
> > about 90Mbps (that's for the /260)...
> 
> How did you measure the high throughput?  Have you got a
> machine with DMA-coherency you can turn on and off?

 I just disabled invalidations. ;-)  Yes, that resulted in some corrupt 
data, but it was good enough to do benchmarking.  That was an R4400 with 
1MB of S-cache.

 Eventually I should benchmark both invalidation variations against each 
other with the system in question and see if it makes any difference.  
Ironically this is where the write-back cache of the R4k gives loss rather 
than gain as compared to the write-through cache of the R3k (the system 
supports daughtercards with either CPU, so useful comparison is possible) 
as for the former I have to invalidate cache spanning the whole 
newly-allocated buffer, i.e. ~4.5kB, while for the latter I may invalidate 
only the area actually used, once a frame has been received, its length is 
known and quite often much smaller than the maximum (especially if it's 
been routed from a network that has a smaller frame length limit, like 
Ethernet).

  Maciej

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Performance bug in c-r4k.c cache handling code
  2005-09-20 12:37     ` Maciej W. Rozycki
@ 2005-09-20 13:18       ` Dominic Sweetman
  2005-09-20 14:51         ` Maciej W. Rozycki
  0 siblings, 1 reply; 10+ messages in thread
From: Dominic Sweetman @ 2005-09-20 13:18 UTC (permalink / raw)
  To: Maciej W. Rozycki; +Cc: Dominic Sweetman, Thiemo Seufer, linux-mips

Maciej W. Rozycki (macro@linux-mips.org) writes:

> Besides new CPUs more often than not 
> require changes to kernel-level software anyway.

Making sure that isn't so is the reason why there's a MIPS32/64 spec
(with all the privileged operations defined).  Which also avoids the
undesirable development step of new hardware combined with new kernel
software... 

> > How did you measure the high throughput?  Have you got a
> > machine with DMA-coherency you can turn on and off?
> 
>  I just disabled invalidations. ;-)

Ouch.  So the effect could have come from a variety of sources.

> That was an R4400 with 1MB of S-cache.

With an R4400 S-cache, any difference between "would write it back but
it's clean" and "just invalidate" is likely to be small, since in
either case the time will be dominated by the (external) cache tag
memory RMW operation.

> Eventually I should benchmark both invalidation variations against each 
> other with the system in question and see if it makes any difference.  

Indeed.  And it might also be a good idea to test a more modern
system, too, to see how big an effect this might be.

> Ironically this is where the write-back cache of the R4k gives loss
> rather than gain as compared to the write-through cache of the R3k
> (the system supports daughtercards with either CPU, so useful
> comparison is possible)...

Maybe.  But remember, on the R3K every write was a write through, and
they all had a cost in bus congestion, which may have delayed a
following read and held up the CPU (or the write buffer may have
filled and stalled the CPU). 

I think up to about 33MHz write-through remained a tolerable policy
for 1988-era memory systems; any faster than that and you just sank
under a flood of writes.  2005-era memory systems are much faster when
bursting, but the time they take to process a single write cycle has
improved by less than 2x.  So write-through is still a really bad idea
for 100MHz CPUs using off-chip memory.

Even when your device requires you to push out all the data it can be
more efficient to write data to the cache and then force writeback to
memory: at least that way the data goes to the memory in efficient
burst cycles.

--
Dominic

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Performance bug in c-r4k.c cache handling code
  2005-09-20 13:18       ` Dominic Sweetman
@ 2005-09-20 14:51         ` Maciej W. Rozycki
  0 siblings, 0 replies; 10+ messages in thread
From: Maciej W. Rozycki @ 2005-09-20 14:51 UTC (permalink / raw)
  To: Dominic Sweetman; +Cc: Thiemo Seufer, linux-mips

On Tue, 20 Sep 2005, Dominic Sweetman wrote:

> > Besides new CPUs more often than not 
> > require changes to kernel-level software anyway.
> 
> Making sure that isn't so is the reason why there's a MIPS32/64 spec
> (with all the privileged operations defined).  Which also avoids the
> undesirable development step of new hardware combined with new kernel
> software... 

 Except that clairvoyance does not work reliably in the long run, so while 
you may be able to run old software with no changes on a new chip using a 
subset of its capabilities, you probably still want to update your kernel 
to better exploit the new design.

> >  I just disabled invalidations. ;-)
> 
> Ouch.  So the effect could have come from a variety of sources.

 Yes, I should have probably really done a better arrangement -- but I had 
limited time for testing, so I did just a rough estimate.  I should get 
back to it eventually as that piece of code requires further tuning.

> > Ironically this is where the write-back cache of the R4k gives loss
> > rather than gain as compared to the write-through cache of the R3k
> > (the system supports daughtercards with either CPU, so useful
> > comparison is possible)...
> 
> Maybe.  But remember, on the R3K every write was a write through, and
> they all had a cost in bus congestion, which may have delayed a
> following read and held up the CPU (or the write buffer may have
> filled and stalled the CPU). 

 Effective bandwidth of DRAM memory for the system is documented to be 
100MB/s and the R3k used (the PACEMIPS R3400) is clocked @40MHz (the R4400 
on the other daughterboard is clocked @60MHz).  Therefore while I believe 
congestion is indeed possible, it's not really that common with PIO, 
especially as the CPU has a priority over DMA traffic.  Except from 
partial word writes which require RMW cycles at the memory controller 
level due to ECC (but they are not used for bulk transfers anyway).

> I think up to about 33MHz write-through remained a tolerable policy
> for 1988-era memory systems; any faster than that and you just sank
> under a flood of writes.  2005-era memory systems are much faster when
> bursting, but the time they take to process a single write cycle has
> improved by less than 2x.  So write-through is still a really bad idea
> for 100MHz CPUs using off-chip memory.

 Certainly.  It's just systems with a lot of DMA traffic do really beg for 
hardware-maintained coherence.

> Even when your device requires you to push out all the data it can be
> more efficient to write data to the cache and then force writeback to
> memory: at least that way the data goes to the memory in efficient
> burst cycles.

 That's assuming cache does write-allocation, which is not always the 
case.

  Maciej

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2005-09-20 14:51 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-09-19 15:40 Performance bug in c-r4k.c cache handling code Thiemo Seufer
2005-09-19 16:54 ` Atsushi Nemoto
2005-09-19 18:31   ` Thiemo Seufer
2005-09-19 17:01 ` Maciej W. Rozycki
2005-09-20  9:09   ` Dominic Sweetman
2005-09-20 12:22     ` Ralf Baechle
2005-09-20 12:37     ` Maciej W. Rozycki
2005-09-20 13:18       ` Dominic Sweetman
2005-09-20 14:51         ` Maciej W. Rozycki
2005-09-19 17:31 ` peter fuerst

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.