* Possible bug in flush_dcache_all on 440GP
@ 2003-02-26 4:08 Eugene Surovegin
2003-02-26 15:39 ` Segher Boessenkool
0 siblings, 1 reply; 3+ messages in thread
From: Eugene Surovegin @ 2003-02-26 4:08 UTC (permalink / raw)
To: linuxppc-embedded
Hi all!
I believe there is a bug in flush_dcache_all implementation for not cache
coherent processors.
This function uses simple algorithm to force dcache flush by reading
"enough" data to completely reload the cache:
/*
* 40x cores have 8K or 16K dcache and 32 byte line size.
* 440 has a 32K dcache and 32 byte line size.
* 8xx has 1, 2, 4, 8K variants.
* For now, cover the worst case of the 440.
* When we get a cputable cache size entry we can do the right thing.
*/
#define CACHE_NWAYS 64
#define CACHE_NLINES 16
_GLOBAL(flush_dcache_all)
li r4, (CACHE_NWAYS * CACHE_NLINES)
mtctr r4
lis r5, KERNELBASE@h
1: lwz r3, 0(r5) /* Load one word from every line */
addi r5, r5, L1_CACHE_LINE_SIZE
bdnz 1b
blr
This function uses the assumption that __every__ load operation will
cause cache miss therefore it executes CACHE_NWAYS * CACHE_NLINES
loads to force all cache reload. It uses memory from the beginning
of the kernel for this purpose.
Problem may arise if some of the addresses from this range (starting
at KERNELBASE) are already in the dcache (for example from the _previous_
call to flush_dcache_all).
Here is more technical details:
Cache on 440GP is 64-was associative. There is a register for each cache set
(called data cache victim index register) which holds "way" number for
next cache-miss-triggered load operation. It's incremented in round-robin
manner after each cache load.
flush_dcache_all _may_ cause up to 64 loads for each cache set, and all
ways will be reloaded. But, if there is less than 64 loads (because some loads
are not misses) not all ways will be reloaded, causing possible dirty data
not reaching phys memory.
It's interesting that current flush_dcache_all implementation seems to be
OK for
all CPU with _smaller_ than 32K dcache size. This is due to the fact that
using
_twice_ as much memory than the cache size will _always_ completely reload
the cache.
I think of two possible way to fix this function:
1) Use twice as much memory than the cache size. This solution is not very
efficient,
but it doesn't add _any_ special requirements to the memory we use to
reload the
cache with.
2) Add "dccci 0, 0" just before "blr". This still assumes that we use
memory which
normally is _not_ loaded into dcache (e.g. code at KERNELBASE).
Eugene.
** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 3+ messages in thread* Re: Possible bug in flush_dcache_all on 440GP
2003-02-26 4:08 Possible bug in flush_dcache_all on 440GP Eugene Surovegin
@ 2003-02-26 15:39 ` Segher Boessenkool
0 siblings, 0 replies; 3+ messages in thread
From: Segher Boessenkool @ 2003-02-26 15:39 UTC (permalink / raw)
To: Eugene Surovegin; +Cc: linuxppc-embedded
Eugene Surovegin wrote:
> I believe there is a bug in flush_dcache_all implementation for not cache
> coherent processors.
>
> This function uses simple algorithm to force dcache flush by reading
> "enough" data to completely reload the cache:
[snip]
So you're saying it doesn't use an LRU replacement algorithm
but a FIFO one?
> 1) Use twice as much memory than the cache size. This solution is not very
> efficient,
> but it doesn't add _any_ special requirements to the memory we use to
> reload the
> cache with.
That doesn't work correctly, either, in that case. You have to
read the same memory region twice, not read a twice as big region
once.
Segher
** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Possible bug in flush_dcache_all on 440GP
@ 2003-02-28 15:00 Ralph Blach
0 siblings, 0 replies; 3+ messages in thread
From: Ralph Blach @ 2003-02-28 15:00 UTC (permalink / raw)
To: linuxppc-embedded
This comment is from Thomas Sartorius,
Eugene is correct about the "generic" way requiring that you load twice as
many memory locations as would fit in the cache, in order to guarantee that
any previous "dirty" contents get written to memory and removed from the
cache.
Note that his second suggestion regarding dccci requires that the processor
be in supervisor mode, and assumes that there is no dirty data left in the
cache at the time of the dccci (or else one doesn't care about causing such
dirty data to be written back to memory).
An alternative (and likely faster) method is to use a series of dcbz
instructions (as many as there are lines in the cache) to a series of
"safe" addresses for which it is known that the cache does not currently
contain dirty data, and then use dccci at the end to eliminate this dirty
data without causing any of it to be written back to memory. This
technique should be much faster as it avoids having to actually read any
memory locations into the cache.
Another alternative is to use a loop that does a series of dcread/dcbf
instructions, where the information that is read into the GPR by the dcread
is then used by the dcbf to cause that line to be cast-out and invalidated.
Depending on the possible state of the cache, it might be necessary to test
the valid bit read by the dcread before trying to use the value for the
dcbf, to avoid any MMU exceptions.
One thing to note with regards to any of the techniques: you probably need
to guarantee that interrupts do not occur during the sequence to make sure
that the cache is cleanly flushed when the routine is finished.
One more thing to note with regards to any of the techniques: you need to
concern yourself with possible MMU exceptions during the sequence.
One last thing to note: if you're using any of the techniques other than
the dcread/dcbf sequence, then you need to concern yourself with the
"victim limit" values, and whether or not the cache has been partitioned
into "normal". "transient", and "locked" regions. The techniques described
all presume that "normal" storage access operations will cause the "victim
index" value to walk through all the values from 0 to 63, but if the cache
has been partitioned, this will not be the case.
In the end, I would suggest that the "safest", most robust technique is to
use the dcread/dcbf sequence loop, with proper testing of the dcread result
(e.g., for a valid bit) before executing the dcbf, and with proper MMU
setup ahead of time to make sure you don't get MMU exceptions during the
sequence.
One last thing: Eugene suggests that "40x" processors have 32-byte cache
lines, but that is not the case for the 403 and 401 (they have 16-byte
cache lines).
Segher Boessenkool <segher@koffie.nl>@lists.linuxppc.org on 02/26/2003
10:39:05 AM
Sent by: owner-linuxppc-embedded@lists.linuxppc.org
To: Eugene Surovegin <ebs@ebshome.net>
cc: linuxppc-embedded@lists.linuxppc.org
Subject: Re: Possible bug in flush_dcache_all on 440GP
Eugene Surovegin wrote:
> I believe there is a bug in flush_dcache_all implementation for not cache
> coherent processors.
>
> This function uses simple algorithm to force dcache flush by reading
> "enough" data to completely reload the cache:
** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2003-02-28 15:00 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-02-26 4:08 Possible bug in flush_dcache_all on 440GP Eugene Surovegin
2003-02-26 15:39 ` Segher Boessenkool
-- strict thread matches above, loose matches on Subject: below --
2003-02-28 15:00 Ralph Blach
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).