flush_icache

public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed

* flush_icache_range
@ 2005-03-15 12:40 Zoltan Menyhart
  2005-03-15 18:21 ` flush_icache_range David Mosberger
                   ` (30 more replies)
  0 siblings, 31 replies; 32+ messages in thread
From: Zoltan Menyhart @ 2005-03-15 12:40 UTC (permalink / raw)
  To: linux-ia64

Apparently, the function flush_icache_range() flushes the
caches 32 by 32 bytes.
According to some measures on a Tiger box, an "fc" instruction
costs 200 nanosec. if no other CPU has the line its cache,
there is no traffic on the bus, everything is ideal.
If all the others CPUs have the line in their caches, they post
bus transactions, then the cost of an "fc" instruction is 5
microsec.
To flush a full page of 64 Kbytes, it can take 400 microsec. to
10 millisec.

Cannot we test at the boot time the characteristics of the
CPUs and select the optimal flush_icache_range() ? E.g.:
- if the CPU has 64 bytes / L1 lines =>
	flush by use of 64 byte steps
- if the CPU implements the "fc.i" instruction =>
	flush the I-caches only

Thanks,

Zoltan Menyhart

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flush_icache_range
  2005-03-15 12:40 flush_icache_range Zoltan Menyhart
@ 2005-03-15 18:21 ` David Mosberger
  2005-03-16 10:58 ` flush_icache_range Zoltan Menyhart
                   ` (29 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: David Mosberger @ 2005-03-15 18:21 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Tue, 15 Mar 2005 13:40:21 +0100, Zoltan Menyhart <Zoltan.Menyhart@bull.net> said:

  Zoltan> Apparently, the function flush_icache_range() flushes the
  Zoltan> caches 32 by 32 bytes.
  Zoltan> According to some measures on a Tiger box, an "fc" instruction
  Zoltan> costs 200 nanosec. if no other CPU has the line its cache,
  Zoltan> there is no traffic on the bus, everything is ideal.
  Zoltan> If all the others CPUs have the line in their caches, they post
  Zoltan> bus transactions, then the cost of an "fc" instruction is 5
  Zoltan> microsec.
  Zoltan> To flush a full page of 64 Kbytes, it can take 400 microsec. to
  Zoltan> 10 millisec.

  Zoltan> Cannot we test at the boot time the characteristics of the
  Zoltan> CPUs and select the optimal flush_icache_range() ? E.g.:
  Zoltan> - if the CPU has 64 bytes / L1 lines =>
  Zoltan> flush by use of 64 byte steps
  Zoltan> - if the CPU implements the "fc.i" instruction =>
  Zoltan> flush the I-caches only

Does it actually make any difference?  The expensive part of "fc" is
when it's causing write-backs and you end up being memory-bandwidth
limited.  With a 64-byte stride, the CPU would do less work, but you'd
still be bottlenecked by the write-back speed.

64-byte stride would help a bit when the cache is clean already.
IIRC, it didn't make much of a difference when I measured it last,
though.

OTOH, if it's really a performance-advantage, we could relatively
easily do a runtime patch of the stride in the flush-icache routine.

As far fc vs fc.i: I submitted a patch to Tony for that a few
days/weeks ago.  In practice, it's not going to make a difference on
current CPUs because fc.i is just an alias for fc.

	--david

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flush_icache_range
  2005-03-15 12:40 flush_icache_range Zoltan Menyhart
  2005-03-15 18:21 ` flush_icache_range David Mosberger
@ 2005-03-16 10:58 ` Zoltan Menyhart
  2005-03-16 11:19 ` flush_icache_range Duraid Madina
                   ` (28 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Zoltan Menyhart @ 2005-03-16 10:58 UTC (permalink / raw)
  To: linux-ia64

David Mosberger wrote:

>   Zoltan> Apparently, the function flush_icache_range() flushes the
>   Zoltan> caches 32 by 32 bytes.
>   Zoltan> According to some measures on a Tiger box, an "fc" instruction
>   Zoltan> costs 200 nanosec. if no other CPU has the line its cache,
>   Zoltan> there is no traffic on the bus, everything is ideal.
>   Zoltan> If all the others CPUs have the line in their caches, they post
>   Zoltan> bus transactions, then the cost of an "fc" instruction is 5
>   Zoltan> microsec.
>   Zoltan> To flush a full page of 64 Kbytes, it can take 400 microsec. to
>   Zoltan> 10 millisec.
> 
>   Zoltan> Cannot we test at the boot time the characteristics of the
>   Zoltan> CPUs and select the optimal flush_icache_range() ? E.g.:
>   Zoltan> - if the CPU has 64 bytes / L1 lines =>
>   Zoltan> flush by use of 64 byte steps
>   Zoltan> - if the CPU implements the "fc.i" instruction =>
>   Zoltan> flush the I-caches only
> 
> Does it actually make any difference?  The expensive part of "fc" is
> when it's causing write-backs and you end up being memory-bandwidth
> limited.  With a 64-byte stride, the CPU would do less work, but you'd
> still be bottlenecked by the write-back speed.

I ran flush_icache_range() for 1000 times for the same page
(i.e. the "fc" has really nothing to do).
The other CPUs were idle. No traffic on the bus.
I simply took the ITC value before and after...
Here are the values (average for the 1000 runs):

With a 64-byte stride:	110143 nsec 187218 cycles
With a 32-byte stride:	225606 nsec 383477 cycles

processor  : 7
vendor     : GenuineIntel
arch       : IA-64
family     : Itanium 2
model      : 2
revision   : 1
archrev    : 0
features   : branchlong
cpu number : 0
cpu regs   : 4
cpu MHz    : 1699.762994
itc MHz    : 1699.762994
BogoMIPS   : 2541.74

I think the CPU sends out the snoop requests anyway.
I guess it can send out a second snoop request before the first
one is acknowledged, this is why it is somewhat quicker than the 400
microsec., as I wrote before.
I think saving more than 100 microsec. / page and reducing the bus
traffic can be interesting.

Thanks,

Zoltan



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flush_icache_range
  2005-03-15 12:40 flush_icache_range Zoltan Menyhart
  2005-03-15 18:21 ` flush_icache_range David Mosberger
  2005-03-16 10:58 ` flush_icache_range Zoltan Menyhart
@ 2005-03-16 11:19 ` Duraid Madina
  2005-03-16 18:31 ` flush_icache_range David Mosberger
                   ` (27 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Duraid Madina @ 2005-03-16 11:19 UTC (permalink / raw)
  To: linux-ia64

naughty! :)

	Duraid

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flush_icache_range
  2005-03-15 12:40 flush_icache_range Zoltan Menyhart
                   ` (2 preceding siblings ...)
  2005-03-16 11:19 ` flush_icache_range Duraid Madina
@ 2005-03-16 18:31 ` David Mosberger
  2005-05-20 14:17 ` flush_icache_range Zoltan Menyhart
                   ` (26 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: David Mosberger @ 2005-03-16 18:31 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Wed, 16 Mar 2005 11:58:17 +0100, Zoltan Menyhart <Zoltan.Menyhart@bull.net> said:

  Zoltan> I ran flush_icache_range() for 1000 times for the same page
  Zoltan> (i.e. the "fc" has really nothing to do).  The other CPUs
  Zoltan> were idle. No traffic on the bus.  I simply took the ITC
  Zoltan> value before and after...  Here are the values (average for
  Zoltan> the 1000 runs):

  Zoltan> With a 64-byte stride: 110143 nsec 187218 cycles
  Zoltan> With a 32-byte stride: 225606 nsec 383477 cycles

That's definitely a worthwhile improvement.  I re-checked and it turns
out that I misremembered what I measured: the test-case I had was
testing whether a better scheduled loop-body would help.  I think I
actually wrote that in the Merced days, so I couldn't even have tested
64-byte stride at that time.

I re-ran the test case now and got these results:

 page size   cache-line           stride
	       state       32 bytes	64 bytes
-------------------------------------------------------------
	       dirty	   32,000	22,000 (86 cyc/line)
 16 KB
	       clean	   26,000	12,800 (50 cyc/line)
-------------------------------------------------------------
	       dirty	  130,000	85,000 (83 cyc/line)
 64 KB
	       clean	  105,000	54,000 (52 cyc/line)
-------------------------------------------------------------

While all the numbers are substantially lower than what you're seeing,
clearly using a 64-byte stride is a big win.  I assume the difference
between our results is due to chipsets.  My measurements were done
with a 1.5GHz/6M Madison and the zx1 chipset, which doesn't go beyond
4-way (hence latency tends to be substantially better than with more
scalable chipsets).

	--david

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flush_icache_range
  2005-03-15 12:40 flush_icache_range Zoltan Menyhart
                   ` (3 preceding siblings ...)
  2005-03-16 18:31 ` flush_icache_range David Mosberger
@ 2005-05-20 14:17 ` Zoltan Menyhart
  2005-05-20 15:03 ` flush_icache_range David Mosberger
                   ` (25 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Zoltan Menyhart @ 2005-05-20 14:17 UTC (permalink / raw)
  To: linux-ia64

[-- Attachment #1: Type: text/plain, Size: 1383 bytes --]

Here is a small patch that flushes the i-cache 64 bytes
by 64 bytes on Itanium 2 (or +).

Some measures on a Tiger box with the indicated CPU-s:

processor  : 0
vendor     : GenuineIntel
arch       : IA-64
family     : Itanium 2
model      : 1
revision   : 5
archrev    : 0
features   : branchlong
cpu number : 0
cpu regs   : 4
cpu MHz    : 1296.439995
itc MHz    : 1296.439995
BogoMIPS   : 1941.96
etc...

Flushing a page of 64 Kbytes (the others do not do anything,
they have not got anything about my data on their caches):

With a 32-byte stride:

Modified in d-cache: cycles = 215 K, time = 169 usec
Valid:               cycles = 222 K, time = 171 usec
Invalid:             cycles = 222 K, time = 171 usec

Note that for the dirty case, only the 1st flush causes a write-
back from the L2 / L3 caches, the 3 other flushes find the cache
entries invalid in the L2 / L3 caches.

With a 64-byte stride:

Modified in d-cache: cycles = 63 K, time = 49 usec
Valid:               cycles = 116 K, time = 89 usec
Invalid:             cycles = 116 K, time = 89 usec

It is funny to see that the dirty lines can be flushed more
efficiently. I guess the CPU knows in such a case that the others
cannot have anything to flush, the flush request may not even be
issued to the other CPU-s.

I also tried to issue more than one flush per loop-body
iteration, it did not help.

Thanks,

Zoltan


[-- Attachment #2: diff --]
[-- Type: text/plain, Size: 1280 bytes --]

diff -Nru linux-2.6.11-old/arch/ia64/lib/flush.S linux-2.6.11/arch/ia64/lib/flush.S
--- linux-2.6.11-old/arch/ia64/lib/flush.S	2005-05-20 15:26:18.330498876 +0200
+++ linux-2.6.11/arch/ia64/lib/flush.S	2005-05-20 15:28:25.639091067 +0200
@@ -7,6 +7,23 @@
 #include <asm/asmmacro.h>
 #include <asm/page.h>
 
+
+/*
+ * Note that "L1_CACHE_SHIFT" and "L1_CACHE_BYTES" defined in
+ * include/asm-ia64/cache.h are not what their names suggest.
+ * They actually defines the cache line size for L2.
+ *
+ * We have to flush the L1 i-cache, too.
+ */
+#if	defined(CONFIG_ITANIUM)
+#define L1_CACHE_SHIFT	5
+#else
+#define L1_CACHE_SHIFT	6
+#endif
+
+#define	L1_CACHE_BYTES	(1 << L1_CACHE_SHIFT)
+
+
 	/*
 	 * flush_icache_range(start,end)
 	 *	Must flush range from start to end-1 but nothing else (need to
@@ -17,7 +34,7 @@
 	alloc r2=ar.pfs,2,0,0,0
 	sub r8=in1,in0,1
 	;;
-	shr.u r8=r8,5			// we flush 32 bytes per iteration
+	shr.u r8=r8,L1_CACHE_SHIFT	// we flush L1_CACHE_BYTES bytes per iteration
 	.save ar.lc, r3
 	mov r3=ar.lc			// save ar.lc
 	;;
@@ -26,8 +43,12 @@
 
 	mov ar.lc=r8
 	;;
+#if	defined(CONFIG_ITANIUM)
 .Loop:	fc in0				// issuable on M0 only
-	add in0=32,in0
+#else
+.Loop:	fc.i in0
+#endif
+	add in0=L1_CACHE_BYTES,in0
 	br.cloop.sptk.few .Loop
 	;;
 	sync.i

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flush_icache_range
  2005-03-15 12:40 flush_icache_range Zoltan Menyhart
                   ` (4 preceding siblings ...)
  2005-05-20 14:17 ` flush_icache_range Zoltan Menyhart
@ 2005-05-20 15:03 ` David Mosberger
  2005-05-23 13:43 ` flush_icache_range Zoltan Menyhart
                   ` (24 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: David Mosberger @ 2005-05-20 15:03 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Fri, 20 May 2005 16:17:51 +0200, Zoltan Menyhart <Zoltan.Menyhart@bull.net> said:

  Zoltan> +#if	defined(CONFIG_ITANIUM)
  Zoltan> .Loop:	fc in0				// issuable on M0 only
  Zoltan> +#else
  Zoltan> +.Loop:	fc.i in0
  Zoltan> +#endif

Why this?  fc.i on Merced is a NOP (as it is on McKinley etc.).

	--david

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flush_icache_range
  2005-03-15 12:40 flush_icache_range Zoltan Menyhart
                   ` (5 preceding siblings ...)
  2005-05-20 15:03 ` flush_icache_range David Mosberger
@ 2005-05-23 13:43 ` Zoltan Menyhart
  2005-05-26 17:21 ` flush_icache_range David Mosberger
                   ` (23 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Zoltan Menyhart @ 2005-05-23 13:43 UTC (permalink / raw)
  To: linux-ia64

[-- Attachment #1: Type: text/plain, Size: 2667 bytes --]

The Itanium 2 processor Reference Manual for SW development & optimization
(May 2004) says in the chapter 5.8:

"In Itanium 2 processor, each fc will invalidate 128 bytes corresponding
to the L3 cache line size. Since both the L1I and L1D have line sizes of
64 bytes, a single fc instruction can invalidate two lines."

Can someone please confirm that an equivalent statement is true for the
"fc.i", too ?
Say:

"In Itanium 2 processor, each fc.i instruction will ensure that 128 bytes
(corresponding to the L3 cache line size) of the I-cache(s) be coherent with
the data caches. Since the L1I cache has line sizes of 64 bytes, a single
fc.i instruction can make coherent two lines."


This gave me the idea to try with 128-byte strides
(the measures are repeated for 10 times):

Modified in d-cache:
cycles = 19,164 time = 14.782 usec
cycles = 18,060 time = 13.930 usec
cycles = 16,929 time = 13.058 usec
cycles = 17,597 time = 13.573 usec
cycles = 17,163 time = 13.239 usec
cycles = 16,990 time = 13.105 usec
cycles = 17,427 time = 13.442 usec
cycles = 17,028 time = 13.134 usec
cycles = 16,993 time = 13.107 usec
cycles = 16,930 time = 13.059 usec

Valid:
cycles = 13,514 time = 10.424 usec
cycles = 13,518 time = 10.427 usec
cycles = 13,518 time = 10.427 usec
cycles = 13,518 time = 10.427 usec
cycles = 13,518 time = 10.427 usec
cycles = 13,746 time = 10.603 usec
cycles = 13,866 time = 10.695 usec
cycles = 13,830 time = 10.668 usec
cycles = 13,790 time = 10.637 usec
cycles = 13,830 time = 10.668 usec

Invalid:
cycles = 13,794 time = 10.640 usec
cycles = 13,790 time = 10.637 usec
cycles = 13,830 time = 10.668 usec
cycles = 13,830 time = 10.668 usec
cycles = 13,966 time = 10.773 usec
cycles = 13,994 time = 10.794 usec
cycles = 14,074 time = 10.856 usec
cycles = 13,574 time = 10.470 usec
cycles = 13,902 time = 10.723 usec
cycles = 14,114 time = 10.887 usec

I got these incredibly low number of cycles,
compared to my previous results:

With a 32-byte stride:

Modified in d-cache: cycles = 215 K, time = 169 usec
Valid:               cycles = 222 K, time = 171 usec
Invalid:             cycles = 222 K, time = 171 usec

With a 64-byte stride:

Modified in d-cache: cycles = 63 K, time = 49 usec
Valid:               cycles = 116 K, time = 89 usec
Invalid:             cycles = 116 K, time = 89 usec 


This is a Tiger box with the following CPUs:
processor  : 0
vendor     : GenuineIntel
arch       : IA-64
family     : Itanium 2
model      : 1
revision   : 5
archrev    : 0
features   : branchlong
cpu number : 0
cpu regs   : 4
cpu MHz    : 1296.435998
itc MHz    : 1296.435998
BogoMIPS   : 1941.96
etc...

Can these results be real?

Thanks,

Zoltan



[-- Attachment #2: diff --]
[-- Type: text/plain, Size: 1217 bytes --]

--- linux-2.6.11-orig/arch/ia64/lib/flush.S	2005-04-26 15:59:49.000000000 +0200
+++ linux-2.6.11/arch/ia64/lib/flush.S	2005-05-23 15:30:24.891935385 +0200
@@ -7,6 +7,22 @@
 #include <asm/asmmacro.h>
 #include <asm/page.h>
 
+
+#if	defined(CONFIG_ITANIUM)
+#define CACHE_SHIFT	5
+#else
+/*
+ * In Itanium 2 processor, each fc.i instruction will ensure that 128 bytes
+ * (corresponding to the L3 cache line size) of the I-cache(s) be coherent with
+ * the data caches. Since the L1I cache has line sizes of 64 bytes, a single
+ * fc.i instruction can make coherent two lines.
+ */
+#define CACHE_SHIFT	7
+#endif
+
+#define	CACHE_BYTES	(1 << CACHE_SHIFT)
+
+
 	/*
 	 * flush_icache_range(start,end)
 	 *	Must flush range from start to end-1 but nothing else (need to
@@ -17,7 +33,7 @@
 	alloc r2=ar.pfs,2,0,0,0
 	sub r8=in1,in0,1
 	;;
-	shr.u r8=r8,5			// we flush 32 bytes per iteration
+	shr.u r8=r8,CACHE_SHIFT		// we flush CACHE_BYTES bytes per iteration
 	.save ar.lc, r3
 	mov r3=ar.lc			// save ar.lc
 	;;
@@ -26,8 +42,8 @@
 
 	mov ar.lc=r8
 	;;
-.Loop:	fc in0				// issuable on M0 only
-	add in0=32,in0
+.Loop:	fc.i in0			// issuable on M0 only
+	add in0=CACHE_BYTES,in0
 	br.cloop.sptk.few .Loop
 	;;
 	sync.i

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flush_icache_range
  2005-03-15 12:40 flush_icache_range Zoltan Menyhart
                   ` (6 preceding siblings ...)
  2005-05-23 13:43 ` flush_icache_range Zoltan Menyhart
@ 2005-05-26 17:21 ` David Mosberger
  2005-05-26 17:39 ` flush_icache_range Seth, Rohit
                   ` (22 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: David Mosberger @ 2005-05-26 17:21 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Mon, 23 May 2005 15:43:16 +0200, Zoltan Menyhart <Zoltan.Menyhart@bull.net> said:

  Zoltan> The Itanium 2 processor Reference Manual for SW development
  Zoltan> & optimization (May 2004) says in the chapter 5.8:

  Zoltan> "In Itanium 2 processor, each fc will invalidate 128 bytes
  Zoltan> corresponding to the L3 cache line size. Since both the L1I
  Zoltan> and L1D have line sizes of 64 bytes, a single fc instruction
  Zoltan> can invalidate two lines."

  Zoltan> Can someone please confirm that an equivalent statement is
  Zoltan> true for the "fc.i", too ?  Say:

  Zoltan> "In Itanium 2 processor, each fc.i instruction will ensure
  Zoltan> that 128 bytes (corresponding to the L3 cache line size) of
  Zoltan> the I-cache(s) be coherent with the data caches. Since the
  Zoltan> L1I cache has line sizes of 64 bytes, a single fc.i
  Zoltan> instruction can make coherent two lines."

On Itanium 2, fc.i maps to fc, so I'd say by definition it is true
that fc.i has the same effect.

BTW: I would prefer if we picked up the stride from the PAL info.
That way, we don't have to add ugly #ifdef's and hope that we get the
right stride.  I just checked on a Madison and, indeed it reports:

Data Cache level 1:
        Size           : 16384 bytes
        Attributes     : WriteThrough
        Associativity  : 4
        Line size      : 64 bytes
        Stride         : 128 bytes
	   :

Note that the "Stride" is 128 bytes.  The PAL manual defines the
stride as:

	the most effective stride in bytes for flushing the cache

So, what we could do is walk through the PAL cache-info, determine the
minimum stride and store the min. stride in a per-CPU variable.

  Zoltan> Modified in d-cache:
  Zoltan> cycles = 19,164 time = 14.782 usec
  Zoltan>	[snip...]

  Zoltan> Can these results be real?

Can you remind me how much data you flushed?

	--david

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: flush_icache_range
  2005-03-15 12:40 flush_icache_range Zoltan Menyhart
                   ` (7 preceding siblings ...)
  2005-05-26 17:21 ` flush_icache_range David Mosberger
@ 2005-05-26 17:39 ` Seth, Rohit
  2005-05-27 15:45 ` flush_icache_range Zoltan Menyhart
                   ` (21 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Seth, Rohit @ 2005-05-26 17:39 UTC (permalink / raw)
  To: linux-ia64

David Mosberger <> wrote on Thursday, May 26, 2005 10:22 AM:

> BTW: I would prefer if we picked up the stride from the PAL info.
> That way, we don't have to add ugly #ifdef's and hope that we get the
> right stride.  I just checked on a Madison and, indeed it reports:
> 

Just want to confirm that this is the correct expectation.  Stride size
(as given by pal_cache_info) is the minimal number of bytes that will be
flushed using fc.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flush_icache_range
  2005-03-15 12:40 flush_icache_range Zoltan Menyhart
                   ` (8 preceding siblings ...)
  2005-05-26 17:39 ` flush_icache_range Seth, Rohit
@ 2005-05-27 15:45 ` Zoltan Menyhart
  2005-05-27 15:56 ` flush_icache_range David Mosberger
                   ` (20 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Zoltan Menyhart @ 2005-05-27 15:45 UTC (permalink / raw)
  To: linux-ia64

[-- Attachment #1: Type: text/plain, Size: 731 bytes --]

David Mosberger wrote:

> BTW: I would prefer if we picked up the stride from the PAL info.
> That way, we don't have to add ugly #ifdef's and hope that we get the
> right stride.  I just checked on a Madison and, indeed it reports:
> [...]
> So, what we could do is walk through the PAL cache-info, determine the
> minimum stride and store the min. stride in a per-CPU variable.

Well, if we want to get rid of all the #ifdef CONFIG_ITANIUM's...

I propose to support homogeneous systems only.
(Not a real restriction.)
Should any inhomogeneity or error be found, my code falls back to the
old golden stride size of 32 bytes.

> Can you remind me how much data you flushed?

It was a page of 64 Kbytes each time.

Thanks,

Zoltan

[-- Attachment #2: diff --]
[-- Type: text/plain, Size: 4321 bytes --]

--- linux-2.6.11-orig/arch/ia64/lib/flush.S	2005-04-26 15:59:49.000000000 +0200
+++ linux-2.6.11/arch/ia64/lib/flush.S	2005-05-27 14:26:15.326080510 +0200
@@ -3,37 +3,51 @@
  *
  * Copyright (C) 1999-2001 Hewlett-Packard Co
  * Copyright (C) 1999-2001 David Mosberger-Tang <davidm@hpl.hp.com>
+ *
+ * 05/28/05 Zoltan Menyhart	Dynamic stride size
  */
+
 #include <asm/asmmacro.h>
 #include <asm/page.h>
 
+
 	/*
 	 * flush_icache_range(start,end)
-	 *	Must flush range from start to end-1 but nothing else (need to
+	 *
+	 *	Make i-cache(s) coherent with d-caches.
+	 *
+	 *	Must deal with range from start to end-1 but nothing else (need to
 	 *	be careful not to touch addresses that may be unmapped).
 	 */
 GLOBAL_ENTRY(flush_icache_range)
+
 	.prologue
-	alloc r2=ar.pfs,2,0,0,0
-	sub r8=in1,in0,1
-	;;
-	shr.u r8=r8,5			// we flush 32 bytes per iteration
-	.save ar.lc, r3
-	mov r3=ar.lc			// save ar.lc
+	alloc	r2=ar.pfs,2,0,0,0
+	movl	r3=log_2_i_cache_stride_size
+ 	mov	r21=1
+	;;
+	ld8	r20=[r3]		// r20: log2( stride size of the i-cache(s) )
+	sub	r8=in1,in0,1
+	;;
+	shl	r21=r21,r20		// r21: stride size of the i-cache(s)
+	shr.u	r8=r8,r20		// we flush "stride size" bytes per iteration
+	
+	.save	ar.lc, r3
+	mov	r3=ar.lc		// save ar.lc
 	;;
 
 	.body
 
-	mov ar.lc=r8
+	mov	ar.lc=r8
 	;;
-.Loop:	fc in0				// issuable on M0 only
-	add in0=32,in0
+.Loop:	fc.i	in0			// issuable on M0 only
+	add	in0=r21,in0
 	br.cloop.sptk.few .Loop
 	;;
 	sync.i
 	;;
 	srlz.i
 	;;
-	mov ar.lc=r3			// restore ar.lc
+	mov	ar.lc=r3		// restore ar.lc
 	br.ret.sptk.many rp
 END(flush_icache_range)
--- linux-2.6.11-orig/arch/ia64/kernel/setup.c	2005-04-26 15:59:49.000000000 +0200
+++ linux-2.6.11/arch/ia64/kernel/setup.c	2005-05-27 17:41:04.680429503 +0200
@@ -15,6 +15,7 @@
  * 02/01/00 R.Seth	fixed get_cpuinfo for SMP
  * 01/07/99 S.Eranian	added the support for command line argument
  * 06/24/99 W.Drummond	added boot_cpu_data.
+ * 05/28/05 Z. Menyhart	Dynamic stride size for "flush_icache_range()"
  */
 #include <linux/config.h>
 #include <linux/module.h>
@@ -78,6 +79,14 @@
 EXPORT_SYMBOL(io_space);
 unsigned int num_io_spaces;
 
+/*
+ * "flush_icache_range()" needs to know what processor dependent stride size to use
+ * when it makes i-cache(s) coherent with d-caches.
+ */
+#define	LOG_2_I_CACHE_STRIDE_SIZE	5	/* Safest way to go: 32 bytes by 32 bytes */
+unsigned int log_2_i_cache_stride_size;
+static unsigned int have_found_i_cache_stride_size;	/* Not yet */
+
 unsigned char aux_device_present = 0xaa;        /* XXX remove this when legacy I/O is gone */
 
 /*
@@ -624,6 +633,47 @@
 		ia64_max_cacheline_size = max;
 }
 
+
+/*
+ * "flush_icache_range()" needs to know what processor dependent stride size to use
+ * when it makes i-cache(s) coherent with d-caches.
+ *
+ * Paranoia: all the CPUs are required to have the same stride size.
+ */
+static void
+get_i_cache_stride_size (void)
+{
+        pal_cache_config_info_t cci;
+        s64 status;
+
+        /*
+	 * We assume that the stride size of the L2I cache (if exixt) is the same as
+	 * that of the L1I cache.
+	 */
+	status = ia64_pal_cache_config_info(/* cache_level ( 0 means L1 ) */ 0,
+					    /* cache_type (instruction)= */ 1, &cci);
+	if (status != 0) {
+		printk(KERN_ERR
+		       "%s: ia64_pal_cache_config_info(L1I) failed (status=%ld CPU=%d)\n",
+		       __FUNCTION__, status, smp_processor_id());
+		log_2_i_cache_stride_size = LOG_2_I_CACHE_STRIDE_SIZE;
+		return;
+	}
+	if (have_found_i_cache_stride_size) {
+		if (log_2_i_cache_stride_size != cci.pcci_stride) {
+			printk(KERN_ERR
+			       "%s: L1I cache stride size %d on CPU %d is incoherent "
+			       "with previously seen value %d\n",
+			       __FUNCTION__, 1 << cci.pcci_stride, smp_processor_id(),
+			       1 << log_2_i_cache_stride_size);
+			log_2_i_cache_stride_size = LOG_2_I_CACHE_STRIDE_SIZE;
+		}
+		return;
+	}
+	log_2_i_cache_stride_size = cci.pcci_stride;
+	have_found_i_cache_stride_size = 1;
+}
+
 /*
  * cpu_init() initializes state that is per-CPU.  This function acts
  * as a 'CPU state barrier', nothing should get across.
@@ -649,6 +699,7 @@
 		    ia64_tpa(cpu_data) - (long) __per_cpu_start);
 
 	get_max_cacheline_size();
+	get_i_cache_stride_size();
 
 	/*
 	 * We can't pass "local_cpu_data" to identify_cpu() because we haven't called

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flush_icache_range
  2005-03-15 12:40 flush_icache_range Zoltan Menyhart
                   ` (9 preceding siblings ...)
  2005-05-27 15:45 ` flush_icache_range Zoltan Menyhart
@ 2005-05-27 15:56 ` David Mosberger
  2005-05-27 16:45 ` flush_icache_range Zoltan Menyhart
                   ` (19 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: David Mosberger @ 2005-05-27 15:56 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Fri, 27 May 2005 17:45:01 +0200, Zoltan Menyhart <Zoltan.Menyhart@bull.net> said:

  Zoltan> I propose to support homogeneous systems only.

Why?  Using a per-CPU variable is just as easy.

	--david

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flush_icache_range
  2005-03-15 12:40 flush_icache_range Zoltan Menyhart
                   ` (10 preceding siblings ...)
  2005-05-27 15:56 ` flush_icache_range David Mosberger
@ 2005-05-27 16:45 ` Zoltan Menyhart
  2005-05-27 16:55 ` flush_icache_range David Mosberger
                   ` (18 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Zoltan Menyhart @ 2005-05-27 16:45 UTC (permalink / raw)
  To: linux-ia64

David Mosberger wrote:

>   Zoltan> I propose to support homogeneous systems only.
> 
> Why?  Using a per-CPU variable is just as easy.
> 
> 	--david

I think we cannot use per-CPU data, and there is no need
for using per-CPU data, because fc.i a global operation,
the stride size is a common global value for a give machine.

Shall we use the system wide minimum stride ?

Why does not the SAL calculate it ? :-)

Well, if there were some real machines with mixed CPUs...

Anyway, due to the usage of the #ifdef CONFIG_ITANIUM's
and the way how they are used, I think the current kernel
does not support mixed Itanium 1 and 2 CPUs.

I think it is enough to support:
- either "N" byte strides if all the CPUs say so
- or 32 byte strides otherwise, including PAL errors

Telling the truth, I have not tested my code on Merced
(not having any at hand).
Testing mixed CPU configuration would be even more
hopeless for me.

Have you got information on the forthcoming CPUs?

Thanks,

Zoltan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flush_icache_range
  2005-03-15 12:40 flush_icache_range Zoltan Menyhart
                   ` (11 preceding siblings ...)
  2005-05-27 16:45 ` flush_icache_range Zoltan Menyhart
@ 2005-05-27 16:55 ` David Mosberger
  2005-05-27 18:27 ` flush_icache_range Grant Grundler
                   ` (17 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: David Mosberger @ 2005-05-27 16:55 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Fri, 27 May 2005 18:45:19 +0200, Zoltan Menyhart <Zoltan.Menyhart@bull.net> said:

  Zoltan> I think we cannot use per-CPU data

Why?  I didn't think it was used until per-CPU is initialized.

  Zoltan> and there is no need for using per-CPU data, because fc.i a
  Zoltan> global operation, the stride size is a common global value
  Zoltan> for a give machine.

Ugh, where that is say that?  I can easily imagine a processor where
your argument would not hold true.

  Zoltan> Shall we use the system wide minimum stride ?

Yes, that would be fine, too.

  Zoltan> Why does not the SAL calculate it ? :-)

BTW: instead of calculating the min log2 stride, I'd just export the
actual stride.  No point doing that in flush_cache over and over
again.

	--david

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flush_icache_range
  2005-03-15 12:40 flush_icache_range Zoltan Menyhart
                   ` (12 preceding siblings ...)
  2005-05-27 16:55 ` flush_icache_range David Mosberger
@ 2005-05-27 18:27 ` Grant Grundler
  2005-05-27 19:00 ` flush_icache_range Russ Anderson
                   ` (16 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Grant Grundler @ 2005-05-27 18:27 UTC (permalink / raw)
  To: linux-ia64

On Fri, May 27, 2005 at 06:45:19PM +0200, Zoltan Menyhart wrote:
> Anyway, due to the usage of the #ifdef CONFIG_ITANIUM's
> and the way how they are used, I think the current kernel
> does not support mixed Itanium 1 and 2 CPUs.

I'm not sure how the HW could support given the two
use a different system bus (Merced vs Mckinley).

Even if someone made a chipset that could supported it (NumaLink?)
they might ban such configurations to avoid getting
dragged to the lowest common denominator in performance
or have pay obscene amounts to test something few customers
will actually use.

grant

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flush_icache_range
  2005-03-15 12:40 flush_icache_range Zoltan Menyhart
                   ` (13 preceding siblings ...)
  2005-05-27 18:27 ` flush_icache_range Grant Grundler
@ 2005-05-27 19:00 ` Russ Anderson
  2005-05-29 20:23 ` flush_icache_range Menyhart, Zoltan
                   ` (15 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Russ Anderson @ 2005-05-27 19:00 UTC (permalink / raw)
  To: linux-ia64

Grant Grundler wrote: 
> On Fri, May 27, 2005 at 06:45:19PM +0200, Zoltan Menyhart wrote:
> > Anyway, due to the usage of the #ifdef CONFIG_ITANIUM's
> > and the way how they are used, I think the current kernel
> > does not support mixed Itanium 1 and 2 CPUs.
> 
> I'm not sure how the HW could support given the two
> use a different system bus (Merced vs Mckinley).
> 
> Even if someone made a chipset that could supported it (NumaLink?)
> they might ban such configurations to avoid getting
> dragged to the lowest common denominator in performance
> or have pay obscene amounts to test something few customers
> will actually use.

SGI does not mix Merced & McKinley in the same NUMAlinked system.

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flush_icache_range
  2005-03-15 12:40 flush_icache_range Zoltan Menyhart
                   ` (14 preceding siblings ...)
  2005-05-27 19:00 ` flush_icache_range Russ Anderson
@ 2005-05-29 20:23 ` Menyhart, Zoltan
  2005-06-01 23:50 ` flush_icache_range David Mosberger
                   ` (14 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Menyhart, Zoltan @ 2005-05-29 20:23 UTC (permalink / raw)
  To: linux-ia64

David Mosberger wrote:

Zoltan> I think we cannot use per-CPU data
 > Why?  I didn't think it was used until per-CPU is initialized.

The point is if we used the per-CPU stride size reported by the PAL
for each CPU, and if we stored these values in their respective per-CPU
data area, then they could be different - on a hypothetic mixed system.
We need a stride size that suits for all the CPUs.

  Zoltan> and there is no need for using per-CPU data, because fc.i a
  Zoltan> global operation, the stride size is a common global value
  Zoltan> for a give machine.
 > Ugh, where that is say that?  I can easily imagine a processor where
 > your argument would not hold true.

fc.i is a global operation as any of the CPUs may have some old
instructions in its i-cache(s).
We need to use the minimum of the stride sizes (that suits for all the
CPUs) - on a hypothetic mixed system.
We use a pre-calculated system wide stride size, then why to
replicate this common value in the per-CPU data areas, why not
to use simply a global variable?

I prefer not to support mixed configurations as there is not such a
real machine, => no need to calculate the min. stride size. This is how
my last patch works.
Even much simpler, the "#ifdef CONFIG_ITANIUM" based solution,
(see my patch on 2005-05-23), would be enough in this case:
http://marc.theaimsgroup.com/?l=linux-ia64&m\x111685596128411&w=2

 > BTW: instead of calculating the min log2 stride, I'd just export the
 > actual stride.  No point doing that in flush_cache over and over
 > again.

We use in "flush_icache_range()" the
   loop_counter = ( end_address - start_address - 1 )  / stride_size
equation. It is more efficient to use a shift by log2(stride_size) than
a division.
We also need the stride size to increment the address for fc.i.

Thanks,

Zoltan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flush_icache_range
  2005-03-15 12:40 flush_icache_range Zoltan Menyhart
                   ` (15 preceding siblings ...)
  2005-05-29 20:23 ` flush_icache_range Menyhart, Zoltan
@ 2005-06-01 23:50 ` David Mosberger
  2005-06-02  3:00 ` flush_icache_range Jim Hull
                   ` (13 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: David Mosberger @ 2005-06-01 23:50 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Sun, 29 May 2005 22:23:23 +0200, "Menyhart, Zoltan" <Zoltan.Menyhart@free.fr> said:

  Zoltan> David Mosberger wrote:
  Zoltan> I think we cannot use per-CPU data

  >> Why?  I didn't think it was used until per-CPU is initialized.

  Zoltan> The point is if we used the per-CPU stride size reported by
  Zoltan> the PAL for each CPU, and if we stored these values in their
  Zoltan> respective per-CPU data area, then they could be different -
  Zoltan> on a hypothetic mixed system.  We need a stride size that
  Zoltan> suits for all the CPUs.

If a CPU advertises an optimal stride-size of 128 bytes, that better
work for all CPUs in the system, even if the optimal stride for
another CPU is smaller, say 64 bytes.

If you want to do a global min and store that in a global variable,
sure, that's fine too.  I _think_ the per-CPU approach would come out
as simpler code, but I could be wrong.

  Zoltan> I prefer not to support mixed configurations as there is not
  Zoltan> such a real machine, => no need to calculate the min. stride
  Zoltan> size.

We should write software that fits the architecture, not just today's
machines.  Doing the latter just calls for trouble when new machines
are being introduced.  Never underestimate the cleverness of future hw
designers...

  >> BTW: instead of calculating the min log2 stride, I'd just export the
  >> actual stride.  No point doing that in flush_cache over and over
  >> again.

  Zoltan> We use in "flush_icache_range()" the loop_counter = (
  Zoltan> end_address - start_address - 1 ) / stride_size equation. It
  Zoltan> is more efficient to use a shift by log2(stride_size) than a
  Zoltan> division.  We also need the stride size to increment the
  Zoltan> address for fc.i.

Ah, I forgot about the division.  Yes, storing the log2 makes more
sense, then.

	--david

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: flush_icache_range
  2005-03-15 12:40 flush_icache_range Zoltan Menyhart
                   ` (16 preceding siblings ...)
  2005-06-01 23:50 ` flush_icache_range David Mosberger
@ 2005-06-02  3:00 ` Jim Hull
  2005-06-02 12:12 ` flush_icache_range Zoltan Menyhart
                   ` (12 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Jim Hull @ 2005-06-02  3:00 UTC (permalink / raw)
  To: linux-ia64

David: 

> If a CPU advertises an optimal stride-size of 128 bytes, that better
> work for all CPUs in the system, even if the optimal stride for
> another CPU is smaller, say 64 bytes.

Unfortunately, that's not how the ia64 architecture works.  Although the flushes
are broadcast throughout the cache coherence domain, the parameters returned by
PAL are for that specific processor only.  This is because each PAL knows
nothing about the other (potentially heterogeneous) processors - that depends on
system knowledge.

Because there is no SAL interface to query all the PALs and compute the minimum
for you, this task falls on the OS to do.

> If you want to do a global min and store that in a global variable,
> sure, that's fine too.  I _think_ the per-CPU approach would come out
> as simpler code, but I could be wrong.

A single global is sufficient, so long as you don't aspire to optimally run a
single OS image across multiple cache coherence domains.  If you did, then you'd
need one value per coherence domain.  If you did want to (someday) allow for
this, then a per-CPU value, computed initially as the system-wide min, would be
easier to extend.

>   Zoltan> I prefer not to support mixed configurations as there is not
>   Zoltan> such a real machine, => no need to calculate the min. stride
>   Zoltan> size.
> 
> We should write software that fits the architecture, not just today's
> machines.  Doing the latter just calls for trouble when new machines
> are being introduced.  Never underestimate the cleverness of future hw
> designers...

It's not as farfetched as you might think.  As I understand Intel's official
public position, they don't test, and therefore don't support such mixed
configurations.  However, they also don't expect it not to work, and any OEM who
wishes to do the testing and assume the support costs is free to do so.  (Note:
This same position applies to other types of mixing as well, such as multiple
frequencies.)

 -- Jim

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flush_icache_range
  2005-03-15 12:40 flush_icache_range Zoltan Menyhart
                   ` (17 preceding siblings ...)
  2005-06-02  3:00 ` flush_icache_range Jim Hull
@ 2005-06-02 12:12 ` Zoltan Menyhart
  2005-06-02 14:25 ` flush_icache_range Zoltan Menyhart
                   ` (11 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Zoltan Menyhart @ 2005-06-02 12:12 UTC (permalink / raw)
  To: linux-ia64

[-- Attachment #1: Type: text/plain, Size: 316 bytes --]

David, Jim,

Here is my next try.

I calculate the minimum of the i-cache stride sizes.
( Good luck for those who want to use mixed systems :-). )

About a single OS image across multiple cache coherence domains:
I think Linux is designed for a single cache coherence domain,
see e.g. the atomic operations.

Zoltan

[-- Attachment #2: diff --]
[-- Type: text/plain, Size: 3799 bytes --]

--- linux-2.6.11-orig/arch/ia64/lib/flush.S	2005-04-26 15:59:49.000000000 +0200
+++ linux-2.6.11/arch/ia64/lib/flush.S	2005-05-27 14:26:15.000000000 +0200
@@ -3,37 +3,51 @@
  *
  * Copyright (C) 1999-2001 Hewlett-Packard Co
  * Copyright (C) 1999-2001 David Mosberger-Tang <davidm@hpl.hp.com>
+ *
+ * 05/28/05 Zoltan Menyhart	Dynamic stride size
  */
+
 #include <asm/asmmacro.h>
 #include <asm/page.h>
 
+
 	/*
 	 * flush_icache_range(start,end)
-	 *	Must flush range from start to end-1 but nothing else (need to
+	 *
+	 *	Make i-cache(s) coherent with d-caches.
+	 *
+	 *	Must deal with range from start to end-1 but nothing else (need to
 	 *	be careful not to touch addresses that may be unmapped).
 	 */
 GLOBAL_ENTRY(flush_icache_range)
+
 	.prologue
-	alloc r2=ar.pfs,2,0,0,0
-	sub r8=in1,in0,1
-	;;
-	shr.u r8=r8,5			// we flush 32 bytes per iteration
-	.save ar.lc, r3
-	mov r3=ar.lc			// save ar.lc
+	alloc	r2=ar.pfs,2,0,0,0
+	movl	r3=log_2_i_cache_stride_size
+ 	mov	r21=1
+	;;
+	ld8	r20=[r3]		// r20: log2( stride size of the i-cache(s) )
+	sub	r8=in1,in0,1
+	;;
+	shl	r21=r21,r20		// r21: stride size of the i-cache(s)
+	shr.u	r8=r8,r20		// we flush "stride size" bytes per iteration
+	
+	.save	ar.lc, r3
+	mov	r3=ar.lc		// save ar.lc
 	;;
 
 	.body
 
-	mov ar.lc=r8
+	mov	ar.lc=r8
 	;;
-.Loop:	fc in0				// issuable on M0 only
-	add in0=32,in0
+.Loop:	fc.i	in0			// issuable on M0 only
+	add	in0=r21,in0
 	br.cloop.sptk.few .Loop
 	;;
 	sync.i
 	;;
 	srlz.i
 	;;
-	mov ar.lc=r3			// restore ar.lc
+	mov	ar.lc=r3		// restore ar.lc
 	br.ret.sptk.many rp
 END(flush_icache_range)
--- linux-2.6.11-orig/arch/ia64/kernel/setup.c	2005-04-26 15:59:49.000000000 +0200
+++ linux-2.6.11/arch/ia64/kernel/setup.c	2005-06-02 13:55:23.448675412 +0200
@@ -15,6 +15,7 @@
  * 02/01/00 R.Seth	fixed get_cpuinfo for SMP
  * 01/07/99 S.Eranian	added the support for command line argument
  * 06/24/99 W.Drummond	added boot_cpu_data.
+ * 05/28/05 Z. Menyhart	Dynamic stride size for "flush_icache_range()"
  */
 #include <linux/config.h>
 #include <linux/module.h>
@@ -78,6 +79,13 @@
 EXPORT_SYMBOL(io_space);
 unsigned int num_io_spaces;
 
+/*
+ * "flush_icache_range()" needs to know what processor dependent stride size to use
+ * when it makes i-cache(s) coherent with d-caches.
+ */
+#define	LOG_2_I_CACHE_STRIDE_SIZE	5	/* Safest way to go: 32 bytes by 32 bytes */
+unsigned long log_2_i_cache_stride_size = ~0;
+
 unsigned char aux_device_present = 0xaa;        /* XXX remove this when legacy I/O is gone */
 
 /*
@@ -624,6 +632,34 @@
 		ia64_max_cacheline_size = max;
 }
 
+
+/*
+ * "flush_icache_range()" needs to know what processor dependent stride size to use
+ * when it makes i-cache(s) coherent with d-caches.
+ * The minimum of the i-cache stride sizes is calculated.
+ */
+static void
+get_i_cache_stride_size (void)
+{
+	pal_cache_config_info_t cci;
+	s64 status;
+
+	/*
+	 * We assume that the stride size of the L2I cache (if exixt) is the same as
+	 * that of the L1I cache.
+	 */
+	status = ia64_pal_cache_config_info(/* cache_level ( 0 means L1 ) */ 0,
+					    /* cache_type (instruction)= */ 1, &cci);
+	if (status != 0) {
+		printk(KERN_ERR
+		       "%s: ia64_pal_cache_config_info(L1I) failed (status=%ld CPU=%d)\n",
+		       __FUNCTION__, status, smp_processor_id());
+		cci.pcci_stride = LOG_2_I_CACHE_STRIDE_SIZE;
+	}
+	if (cci.pcci_stride < log_2_i_cache_stride_size)
+		log_2_i_cache_stride_size = cci.pcci_stride;
+}
+
 /*
  * cpu_init() initializes state that is per-CPU.  This function acts
  * as a 'CPU state barrier', nothing should get across.
@@ -649,6 +685,7 @@
 		    ia64_tpa(cpu_data) - (long) __per_cpu_start);
 
 	get_max_cacheline_size();
+	get_i_cache_stride_size();
 
 	/*
 	 * We can't pass "local_cpu_data" to identify_cpu() because we haven't called

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flush_icache_range
  2005-03-15 12:40 flush_icache_range Zoltan Menyhart
                   ` (18 preceding siblings ...)
  2005-06-02 12:12 ` flush_icache_range Zoltan Menyhart
@ 2005-06-02 14:25 ` Zoltan Menyhart
  2005-06-02 17:36 ` flush_icache_range David Mosberger
                   ` (10 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Zoltan Menyhart @ 2005-06-02 14:25 UTC (permalink / raw)
  To: linux-ia64

[-- Attachment #1: Type: text/plain, Size: 1574 bytes --]

Jack Steiner wrote:
> On Thu, Jun 02, 2005 at 02:12:02PM +0200, Zoltan Menyhart wrote:
> 
>>+.Loop:	fc.i	in0			// issuable on M0 only
>>+	add	in0=r21,in0
>> 	br.cloop.sptk.few .Loop
>> 	;;
> 
> 
> I noticed that the flush loop has a single bundle loop. I know
> that this loop was not introduced by your code, but according to 
> Intel, single bundle loops should not be used in performance critical code.
> 
> We ran in to severe performance problems several years ago with single bundle
> loops. IIRC, the details were posted to the ia64 mail list & the 
> resolution was "don't use single bundle loops". I don't know if the performance
> problem exists if the loop contains an fc instruction but you may want
> to unroll the loop one additional cycle. 
> 
> (The problem is that single bundle loops that are not aligned on a
> 0 mod 32 address will run significantly slower (we observed 3X slower) after 
> an interrupt).

Thank you for your remark.

I added a "nop.b. 0" to occupy the original slot of "br".
I hope it is fine that my "br" is shifted to the very last slot:

0xa000000100302d00 <flush_icache_range+64>:     [MIB]       fc.i r32
0xa000000100302d01 <flush_icache_range+65>:                 add r32=r21,r32
0xa000000100302d02 <flush_icache_range+66>:                 nop.b 0x0
0xa000000100302d10 <flush_icache_range+80>:     [MFB]       nop.m 0x0
0xa000000100302d11 <flush_icache_range+81>:                 nop.f 0x0
0xa000000100302d12 <flush_icache_range+82>:                 br.cloop.sptk.few 0xa000000100302d00
									<flush_icache_range+64>;;

Zoltan


[-- Attachment #2: diff2 --]
[-- Type: text/plain, Size: 3887 bytes --]

--- linux-2.6.11-orig/arch/ia64/lib/flush.S	2005-04-26 15:59:49.000000000 +0200
+++ linux-2.6.11/arch/ia64/lib/flush.S	2005-06-02 16:12:08.655606148 +0200
@@ -3,37 +3,57 @@
  *
  * Copyright (C) 1999-2001 Hewlett-Packard Co
  * Copyright (C) 1999-2001 David Mosberger-Tang <davidm@hpl.hp.com>
+ *
+ * 05/28/05 Zoltan Menyhart	Dynamic stride size
  */
+
 #include <asm/asmmacro.h>
 #include <asm/page.h>
 
+
 	/*
 	 * flush_icache_range(start,end)
-	 *	Must flush range from start to end-1 but nothing else (need to
+	 *
+	 *	Make i-cache(s) coherent with d-caches.
+	 *
+	 *	Must deal with range from start to end-1 but nothing else (need to
 	 *	be careful not to touch addresses that may be unmapped).
 	 */
 GLOBAL_ENTRY(flush_icache_range)
+
 	.prologue
-	alloc r2=ar.pfs,2,0,0,0
-	sub r8=in1,in0,1
-	;;
-	shr.u r8=r8,5			// we flush 32 bytes per iteration
-	.save ar.lc, r3
-	mov r3=ar.lc			// save ar.lc
+	alloc	r2=ar.pfs,2,0,0,0
+	movl	r3=log_2_i_cache_stride_size
+ 	mov	r21=1
+	;;
+	ld8	r20=[r3]		// r20: log2( stride size of the i-cache(s) )
+	sub	r8=in1,in0,1
+	;;
+	shl	r21=r21,r20		// r21: stride size of the i-cache(s)
+	shr.u	r8=r8,r20		// we flush "stride size" bytes per iteration
+	
+	.save	ar.lc, r3
+	mov	r3=ar.lc		// save ar.lc
 	;;
 
 	.body
 
-	mov ar.lc=r8
+	mov	ar.lc=r8
 	;;
-.Loop:	fc in0				// issuable on M0 only
-	add in0=32,in0
+
+	/*
+	 * 32 byte aligned loop, even number of (actually 2) bundles
+	 */
+.Loop:	fc.i	in0			// issuable on M0 only
+	add	in0=r21,in0
+	nop.b	0
 	br.cloop.sptk.few .Loop
 	;;
+
 	sync.i
 	;;
 	srlz.i
 	;;
-	mov ar.lc=r3			// restore ar.lc
+	mov	ar.lc=r3		// restore ar.lc
 	br.ret.sptk.many rp
 END(flush_icache_range)
--- linux-2.6.11-orig/arch/ia64/kernel/setup.c	2005-04-26 15:59:49.000000000 +0200
+++ linux-2.6.11/arch/ia64/kernel/setup.c	2005-06-02 13:55:23.448675412 +0200
@@ -15,6 +15,7 @@
  * 02/01/00 R.Seth	fixed get_cpuinfo for SMP
  * 01/07/99 S.Eranian	added the support for command line argument
  * 06/24/99 W.Drummond	added boot_cpu_data.
+ * 05/28/05 Z. Menyhart	Dynamic stride size for "flush_icache_range()"
  */
 #include <linux/config.h>
 #include <linux/module.h>
@@ -78,6 +79,13 @@
 EXPORT_SYMBOL(io_space);
 unsigned int num_io_spaces;
 
+/*
+ * "flush_icache_range()" needs to know what processor dependent stride size to use
+ * when it makes i-cache(s) coherent with d-caches.
+ */
+#define	LOG_2_I_CACHE_STRIDE_SIZE	5	/* Safest way to go: 32 bytes by 32 bytes */
+unsigned long log_2_i_cache_stride_size = ~0;
+
 unsigned char aux_device_present = 0xaa;        /* XXX remove this when legacy I/O is gone */
 
 /*
@@ -624,6 +632,34 @@
 		ia64_max_cacheline_size = max;
 }
 
+
+/*
+ * "flush_icache_range()" needs to know what processor dependent stride size to use
+ * when it makes i-cache(s) coherent with d-caches.
+ * The minimum of the i-cache stride sizes is calculated.
+ */
+static void
+get_i_cache_stride_size (void)
+{
+	pal_cache_config_info_t cci;
+	s64 status;
+
+	/*
+	 * We assume that the stride size of the L2I cache (if exixt) is the same as
+	 * that of the L1I cache.
+	 */
+	status = ia64_pal_cache_config_info(/* cache_level ( 0 means L1 ) */ 0,
+					    /* cache_type (instruction)= */ 1, &cci);
+	if (status != 0) {
+		printk(KERN_ERR
+		       "%s: ia64_pal_cache_config_info(L1I) failed (status=%ld CPU=%d)\n",
+		       __FUNCTION__, status, smp_processor_id());
+		cci.pcci_stride = LOG_2_I_CACHE_STRIDE_SIZE;
+	}
+	if (cci.pcci_stride < log_2_i_cache_stride_size)
+		log_2_i_cache_stride_size = cci.pcci_stride;
+}
+
 /*
  * cpu_init() initializes state that is per-CPU.  This function acts
  * as a 'CPU state barrier', nothing should get across.
@@ -649,6 +685,7 @@
 		    ia64_tpa(cpu_data) - (long) __per_cpu_start);
 
 	get_max_cacheline_size();
+	get_i_cache_stride_size();
 
 	/*
 	 * We can't pass "local_cpu_data" to identify_cpu() because we haven't called

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: flush_icache_range
  2005-03-15 12:40 flush_icache_range Zoltan Menyhart
                   ` (19 preceding siblings ...)
  2005-06-02 14:25 ` flush_icache_range Zoltan Menyhart
@ 2005-06-02 17:36 ` David Mosberger
  2005-06-02 18:28 ` flush_icache_range David Mosberger
                   ` (9 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: David Mosberger @ 2005-06-02 17:36 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Wed, 1 Jun 2005 20:00:30 -0700, "Jim Hull" <jim.hull@hp.com> said:

  Jim> David:

  >> If a CPU advertises an optimal stride-size of 128 bytes, that better
  >> work for all CPUs in the system, even if the optimal stride for
  >> another CPU is smaller, say 64 bytes.

  Jim> Unfortunately, that's not how the ia64 architecture works.
  Jim> Although the flushes are broadcast throughout the cache
  Jim> coherence domain, the parameters returned by PAL are for that
  Jim> specific processor only.  This is because each PAL knows
  Jim> nothing about the other (potentially heterogeneous) processors
  Jim> - that depends on system knowledge.

  Jim> Because there is no SAL interface to query all the PALs and
  Jim> compute the minimum for you, this task falls on the OS to do.

Clearly PAL couldn't do that.  What I had in mind was something else:
suppose you wanted to design a machine that can mix CPUs with L3
cache-line sizes of 128 and 256 bytes.  In such a case, I'd have
_expected_ that either the 256 byte CPUs would also advertise a stride
of 128 (since they're designed to be backwards-compatible) or they
advertise a 256 byte stride but then take care of converting each "fc"
into a pair of 128-byte flushes on the bus.

I guess what you're saying is that the architecture imposes no such
constraint and the OS _must_ flush at the minium stride across all
CPUs.  Right?

	--david

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flush_icache_range
  2005-03-15 12:40 flush_icache_range Zoltan Menyhart
                   ` (20 preceding siblings ...)
  2005-06-02 17:36 ` flush_icache_range David Mosberger
@ 2005-06-02 18:28 ` David Mosberger
  2005-06-02 18:31 ` flush_icache_range David Mosberger
                   ` (8 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: David Mosberger @ 2005-06-02 18:28 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Thu, 02 Jun 2005 14:12:02 +0200, Zoltan Menyhart <Zoltan.Menyhart@bull.net> said:

  Zoltan> Here is my next try.

The changes to the assembly-file look mostly OK, except for the usual
white-space issues (trailing whitespace, introduction of new, useless
blank lines).

More importantly, it looks to me like there is an off-by-one bug:
ar.lc needs to be initialized to loop_count-1.  Which raises the
question: how well has this been tested?  Or am I missing something?

As for setup.c: I'd get rid of LOG_2_I_CACHE_STRIDE_SIZE and just
initialize log_2_i_cache_stride_size to 5 (there is no point in
initializing it with a random & useless value).

Also, I think you should do take the minimum of _all_ cache-levels,
not just level 1 (yes, I also have a hard time imagining a system
where the higher level has a smaller stride, but I don't think there
is anything that prevents such a system).

Thanks,

	--david

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flush_icache_range
  2005-03-15 12:40 flush_icache_range Zoltan Menyhart
                   ` (21 preceding siblings ...)
  2005-06-02 18:28 ` flush_icache_range David Mosberger
@ 2005-06-02 18:31 ` David Mosberger
  2005-06-02 19:00 ` flush_icache_range Jim Hull
                   ` (7 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: David Mosberger @ 2005-06-02 18:31 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Thu, 02 Jun 2005 16:25:25 +0200, Zoltan Menyhart <Zoltan.Menyhart@bull.net> said:

  Zoltan> I added a "nop.b. 0" to occupy the original slot of "br".
  Zoltan> I hope it is fine that my "br" is shifted to the very last slot:

  Zoltan> 0xa000000100302d00 <flush_icache_range+64>:     [MIB]       fc.i r32
  Zoltan> 0xa000000100302d01 <flush_icache_range+65>:                 add r32=r21,r32
  Zoltan> 0xa000000100302d02 <flush_icache_range+66>:                 nop.b 0x0
  Zoltan> 0xa000000100302d10 <flush_icache_range+80>:     [MFB]       nop.m 0x0
  Zoltan> 0xa000000100302d11 <flush_icache_range+81>:                 nop.f 0x0
  Zoltan> 0xa000000100302d12 <flush_icache_range+82>:                 br.cloop.sptk.few 0xa000000100302d00
  Zoltan> <flush_icache_range+64>;;

I'd rather use an MMI or MII bundle instead.  B slots are (somewhat)
liable to cause split issues (though they don't on Itanium 2 if the
B-slot contains a "nop.b").

	--david

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: flush_icache_range
  2005-03-15 12:40 flush_icache_range Zoltan Menyhart
                   ` (22 preceding siblings ...)
  2005-06-02 18:31 ` flush_icache_range David Mosberger
@ 2005-06-02 19:00 ` Jim Hull
  2005-06-02 21:37 ` flush_icache_range Menyhart, Zoltan
                   ` (6 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Jim Hull @ 2005-06-02 19:00 UTC (permalink / raw)
  To: linux-ia64

David: 

> I guess what you're saying is that the architecture imposes no such
> constraint and the OS _must_ flush at the minium stride across all
> CPUs.  Right?

Yes.  And note that for current implementations, the minimum stride is
determined by the L1 cache, not the L3 (but you shouldn't count on that either -
just ask PAL about all the levels, and take the minimum).

 -- Jim



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flush_icache_range
  2005-03-15 12:40 flush_icache_range Zoltan Menyhart
                   ` (23 preceding siblings ...)
  2005-06-02 19:00 ` flush_icache_range Jim Hull
@ 2005-06-02 21:37 ` Menyhart, Zoltan
  2005-06-02 22:23 ` flush_icache_range David Mosberger
                   ` (5 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Menyhart, Zoltan @ 2005-06-02 21:37 UTC (permalink / raw)
  To: linux-ia64

David Mosberger wrote:

>The changes to the assembly-file look mostly OK, except for the usual
>white-space issues (trailing whitespace, introduction of new, useless
>blank lines).
>  
>
Well, if I add just 1 or 2 lines, as I did in my first patch, I respect 
the original way of
whitespace usage.
I wanted to make it more easy to read, otherwise, some silly errors can 
slip in more easily.

>More importantly, it looks to me like there is an off-by-one bug:
>ar.lc needs to be initialized to loop_count-1.  Which raises the
>question: how well has this been tested?  Or am I missing something?
>  
>
It is calculated as:

loop_counter = ( end_address - start_address - 1 )  / stride_size

Most of the cases we flush entire pages.
In these cases, "loop_counter" seems to be correct.

Otherwise, if at least "start_address" is stride size aligned,
(ELF loader: text size is N * 16), we are still safe.

Otherwise, if only "end_address" is stride size aligned,
(not of very much interest), we are still safe.

Otherwise, if none of them is stride size aligned,
(e.g. a debugger may request to flush 2 bundles spanning over a
stride border), we will miss to flush the 2nd stride.

I propose to round down "start_address" to be stride size aligned.

>As for setup.c: I'd get rid of LOG_2_I_CACHE_STRIDE_SIZE and just
>initialize log_2_i_cache_stride_size to 5 (there is no point in
>initializing it with a random & useless value).
>  
>
"log_2_i_cache_stride_size" is not initialized to any stride size, it 
calculates the min. value.
Should "pal_cache_config_info" fail, you need something useful to be 
able to boot up.

I 'd rather keep "LOG_2_I_CACHE_STRIDE_SIZE", I like speaking names.
Perhaps "LOG_2_DEFAULT_I_CACHE_STRIDE_SIZE" would be even better :-)

>Also, I think you should do take the minimum of _all_ cache-levels,
>not just level 1 (yes, I also have a hard time imagining a system
>where the higher level has a smaller stride, but I don't think there
>is anything that prevents such a system).
>  
>
Well, things are getting complicated :-) I can add it...
I've got a concern about the "unique_caches" returned by 
"pal_cache_summary()".
Let's assume that

unique_caches - cache_levels = 2

I could not find anything making sure that we've got in this case L1I 
and L2I, and not
L1I and L3I (feeding through the unified L2). Yes, I know there is no 
such a CPU
(at the moment) but the PAL spec. does not exclude it :-)

Thanks,

Zoltan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flush_icache_range
  2005-03-15 12:40 flush_icache_range Zoltan Menyhart
                   ` (24 preceding siblings ...)
  2005-06-02 21:37 ` flush_icache_range Menyhart, Zoltan
@ 2005-06-02 22:23 ` David Mosberger
  2005-06-02 22:55 ` flush_icache_range Menyhart, Zoltan
                   ` (4 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: David Mosberger @ 2005-06-02 22:23 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Thu, 02 Jun 2005 23:37:36 +0200, "Menyhart, Zoltan" <Zoltan.Menyhart@free.fr> said:

  Zoltan> I propose to round down "start_address" to be stride size aligned.

Sounds reasonable.

  >> As for setup.c: I'd get rid of LOG_2_I_CACHE_STRIDE_SIZE and just
  >> initialize log_2_i_cache_stride_size to 5 (there is no point in
  >> initializing it with a random & useless value).

  Zoltan> "log_2_i_cache_stride_size" is not initialized to any stride
  Zoltan> size, it calculates the min. value.  Should
  Zoltan> "pal_cache_config_info" fail, you need something useful to
  Zoltan> be able to boot up.

  Zoltan> I 'd rather keep "LOG_2_I_CACHE_STRIDE_SIZE", I like speaking names.
  Zoltan> Perhaps "LOG_2_DEFAULT_I_CACHE_STRIDE_SIZE" would be even better :-)

Sure.  The canonical way in Linux to say "log2" is "shift", so perhaps:

  DEFAULT_I_CACHE_STRIDE_SHIFT

?

  >> Also, I think you should do take the minimum of _all_ cache-levels,
  >> not just level 1 (yes, I also have a hard time imagining a system
  >> where the higher level has a smaller stride, but I don't think there
  >> is anything that prevents such a system).

  Zoltan> Well, things are getting complicated :-) I can add it...
  Zoltan> I've got a concern about the "unique_caches" returned by
  Zoltan> "pal_cache_summary()".  Let's assume that

  Zoltan> unique_caches - cache_levels = 2

  Zoltan> I could not find anything making sure that we've got in this
  Zoltan> case L1I and L2I, and not L1I and L3I (feeding through the
  Zoltan> unified L2). Yes, I know there is no such a CPU (at the
  Zoltan> moment) but the PAL spec. does not exclude it :-)

I don't think it matters whether you pick up the stride from an
i-cache or a unified cache.  As long as you take the minimum stride,
it will work correctly.

	--david

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flush_icache_range
  2005-03-15 12:40 flush_icache_range Zoltan Menyhart
                   ` (25 preceding siblings ...)
  2005-06-02 22:23 ` flush_icache_range David Mosberger
@ 2005-06-02 22:55 ` Menyhart, Zoltan
  2005-06-02 23:07 ` flush_icache_range David Mosberger
                   ` (3 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: Menyhart, Zoltan @ 2005-06-02 22:55 UTC (permalink / raw)
  To: linux-ia64

David Mosberger wrote:

>  Zoltan> unique_caches - cache_levels = 2
>
>  Zoltan> I could not find anything making sure that we've got in this
>  Zoltan> case L1I and L2I, and not L1I and L3I (feeding through the
>  Zoltan> unified L2). Yes, I know there is no such a CPU (at the
>  Zoltan> moment) but the PAL spec. does not exclude it :-)
>
>I don't think it matters whether you pick up the stride from an
>i-cache or a unified cache.  As long as you take the minimum stride,
>it will work correctly.
>  
>
I mean:
- if the cache of a given level is split, then we need to take the stride
   size of the i-cache:
   pal_cache_config_info(level, /* cache_type = */ 1,...)
- if the cache of a given level is unified, then we need to take the stride
   size of the unified/data cache:
   pal_cache_config_info(level, /* cache_type = */ 2,...)
In my example I know only that some (but not all) the levels are split.
Guessing the existence of a split cache by not obtaining an "Invalid
argument" error form "pal_cache_config_info()" is a bit weak method...

Thanks,

Zoltan



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flush_icache_range
  2005-03-15 12:40 flush_icache_range Zoltan Menyhart
                   ` (26 preceding siblings ...)
  2005-06-02 22:55 ` flush_icache_range Menyhart, Zoltan
@ 2005-06-02 23:07 ` David Mosberger
  2005-06-03 12:35 ` flush_icache_range Zoltan Menyhart
                   ` (2 subsequent siblings)
  30 siblings, 0 replies; 32+ messages in thread
From: David Mosberger @ 2005-06-02 23:07 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Fri, 03 Jun 2005 00:55:42 +0200, "Menyhart, Zoltan" <Zoltan.Menyhart@free.fr> said:

  Zoltan> I mean:

  Zoltan> - if the cache of a given level is split, then we need to
  Zoltan> take the stride size of the i-cache:
  Zoltan> pal_cache_config_info(level, /* cache_type = */ 1,...)

  Zoltan> - if the cache of a given level is unified, then we need to
  Zoltan> take the stride size of the unified/data cache:
  Zoltan> pal_cache_config_info(level, /* cache_type = */ 2,...)

  Zoltan> In my example I know only that some (but not all) the levels
  Zoltan> are split.  Guessing the existence of a split cache by not
  Zoltan> obtaining an "Invalid argument" error form
  Zoltan> "pal_cache_config_info()" is a bit weak method...

Ah, I misremembered: I thought a cache_type value of 1 means
"instruction-cache or unified", but no, it means _only_
instruction-cache.  So the safe sequence seems to be:

   pal_cache_config_info(level, 2, &info)
   if (!info.pcci_unified)
	pal_cache_config_info(level, 1, &info)

While I omitted the failure checks, of course, this doesn't look too
bad.

	--david

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flush_icache_range
  2005-03-15 12:40 flush_icache_range Zoltan Menyhart
                   ` (27 preceding siblings ...)
  2005-06-02 23:07 ` flush_icache_range David Mosberger
@ 2005-06-03 12:35 ` Zoltan Menyhart
  2005-06-03 21:09 ` flush_icache_range David Mosberger
  2005-06-13 11:20 ` flush_icache_range Zoltan Menyhart
  30 siblings, 0 replies; 32+ messages in thread
From: Zoltan Menyhart @ 2005-06-03 12:35 UTC (permalink / raw)
  To: linux-ia64

[-- Attachment #1: Type: text/plain, Size: 160 bytes --]

David, Jim,

Here is my next try. I hope I have not missed out anything.

Thanks,

Zoltan

P.S.: next week I'll be away, should there be some minor problems...

[-- Attachment #2: diff3 --]
[-- Type: text/plain, Size: 4582 bytes --]

--- linux-2.6.11-orig/arch/ia64/lib/flush.S	2005-04-26 15:59:49.000000000 +0200
+++ linux-2.6.11/arch/ia64/lib/flush.S	2005-06-03 14:11:09.069041791 +0200
@@ -3,37 +3,59 @@
  *
  * Copyright (C) 1999-2001 Hewlett-Packard Co
  * Copyright (C) 1999-2001 David Mosberger-Tang <davidm@hpl.hp.com>
+ *
+ * 05/28/05 Zoltan Menyhart	Dynamic stride size
  */
+
 #include <asm/asmmacro.h>
-#include <asm/page.h>
+
 
 	/*
 	 * flush_icache_range(start,end)
-	 *	Must flush range from start to end-1 but nothing else (need to
+	 *
+	 *	Make i-cache(s) coherent with d-caches.
+	 *
+	 *	Must deal with range from start to end-1 but nothing else (need to
 	 *	be careful not to touch addresses that may be unmapped).
+	 *
+	 *	Note: "in0" and "in1" are preserved for debugging purposes.
 	 */
 GLOBAL_ENTRY(flush_icache_range)
+
 	.prologue
-	alloc r2=ar.pfs,2,0,0,0
-	sub r8=in1,in0,1
-	;;
-	shr.u r8=r8,5			// we flush 32 bytes per iteration
-	.save ar.lc, r3
-	mov r3=ar.lc			// save ar.lc
+	alloc	r2=ar.pfs,2,0,0,0
+	movl	r3=ia64_i_cache_stride_shift
+ 	mov	r21=1
+	;;
+	ld8	r20=[r3]		// r20: stride shift
+	sub	r22=in1,r0,1		// last byte address
+	;;
+	shr.u	r23=in0,r20		// start / (stride size)
+	shr.u	r22=r22,r20		// (last byte address) / (stride size)
+	shl	r21=r21,r20		// r21: stride size of the i-cache(s)
+	;;
+	sub	r8=r22,r23		// number of strides - 1
+	shl	r24=r23,r20		// r24: addresses for "fc.i" =
+					//	"start" rounded down to stride boundary
+	.save	ar.lc,r3
+	mov	r3=ar.lc		// save ar.lc
 	;;
 
 	.body
-
-	mov ar.lc=r8
+	mov	ar.lc=r8
 	;;
-.Loop:	fc in0				// issuable on M0 only
-	add in0=32,in0
+	/*
+	 * 32 byte aligned loop, even number of (actually 2) bundles
+	 */
+.Loop:	fc.i	r24			// issuable on M0 only
+	add	r24=r21,r24		// we flush "stride size" bytes per iteration
+	nop.i	0
 	br.cloop.sptk.few .Loop
 	;;
 	sync.i
 	;;
 	srlz.i
 	;;
-	mov ar.lc=r3			// restore ar.lc
+	mov	ar.lc=r3		// restore ar.lc
 	br.ret.sptk.many rp
 END(flush_icache_range)
--- linux-2.6.11-orig/arch/ia64/kernel/setup.c	2005-04-26 15:59:49.000000000 +0200
+++ linux-2.6.11/arch/ia64/kernel/setup.c	2005-06-03 14:06:23.779006224 +0200
@@ -15,6 +15,7 @@
  * 02/01/00 R.Seth	fixed get_cpuinfo for SMP
  * 01/07/99 S.Eranian	added the support for command line argument
  * 06/24/99 W.Drummond	added boot_cpu_data.
+ * 05/28/05 Z. Menyhart	Dynamic stride size for "flush_icache_range()"
  */
 #include <linux/config.h>
 #include <linux/module.h>
@@ -78,6 +79,13 @@
 EXPORT_SYMBOL(io_space);
 unsigned int num_io_spaces;
 
+/*
+ * "flush_icache_range()" needs to know what processor dependent stride size to use
+ * when it makes i-cache(s) coherent with d-caches.
+ */
+#define	I_CACHE_STRIDE_SHIFT	5	/* Safest way to go: 32 bytes by 32 bytes */
+unsigned long ia64_i_cache_stride_shift = ~0;
+
 unsigned char aux_device_present = 0xaa;        /* XXX remove this when legacy I/O is gone */
 
 /*
@@ -590,6 +598,12 @@
 	/* start_kernel() requires this... */
 }
 
+/*
+ * Calculate the max. cache line size.
+ *
+ * In addition, the minimum of the i-cache stride sizes is calculated for
+ * "flush_icache_range()".
+ */
 static void
 get_max_cacheline_size (void)
 {
@@ -603,6 +617,8 @@
                 printk(KERN_ERR "%s: ia64_pal_cache_summary() failed (status=%ld)\n",
                        __FUNCTION__, status);
                 max = SMP_CACHE_BYTES;
+		/* Safest setup for "flush_icache_range()" */
+		ia64_i_cache_stride_shift = I_CACHE_STRIDE_SHIFT;
 		goto out;
         }
 
@@ -611,14 +627,31 @@
 						    &cci);
 		if (status != 0) {
 			printk(KERN_ERR
-			       "%s: ia64_pal_cache_config_info(l=%lu) failed (status=%ld)\n",
+			       "%s: ia64_pal_cache_config_info(l=%lu, 2) failed (status=%ld)\n",
 			       __FUNCTION__, l, status);
 			max = SMP_CACHE_BYTES;
+			/* The safest setup for "flush_icache_range()" */
+			cci.pcci_stride = I_CACHE_STRIDE_SHIFT;
+			cci.pcci_unified = 1;
 		}
 		line_size = 1 << cci.pcci_line_size;
 		if (line_size > max)
 			max = line_size;
-        }
+		if (!cci.pcci_unified) {
+			status = ia64_pal_cache_config_info(l,
+						    /* cache_type (instruction)= */ 1,
+						    &cci);
+			if (status != 0) {
+				printk(KERN_ERR
+				"%s: ia64_pal_cache_config_info(l=%lu, 1) failed (status=%ld)\n",
+					__FUNCTION__, l, status);
+				/* The safest setup for "flush_icache_range()" */
+				cci.pcci_stride = I_CACHE_STRIDE_SHIFT;
+			}
+		}
+		if (cci.pcci_stride < ia64_i_cache_stride_shift)
+			ia64_i_cache_stride_shift = cci.pcci_stride;
+	}
   out:
 	if (max > ia64_max_cacheline_size)
 		ia64_max_cacheline_size = max;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flush_icache_range
  2005-03-15 12:40 flush_icache_range Zoltan Menyhart
                   ` (28 preceding siblings ...)
  2005-06-03 12:35 ` flush_icache_range Zoltan Menyhart
@ 2005-06-03 21:09 ` David Mosberger
  2005-06-13 11:20 ` flush_icache_range Zoltan Menyhart
  30 siblings, 0 replies; 32+ messages in thread
From: David Mosberger @ 2005-06-03 21:09 UTC (permalink / raw)
  To: linux-ia64

Looks fine to me.  The large number of shifts in the setup-code is a
bit unfortunate, but compared to the savings the larger stride
achieves on today's systems, this penalty is minor.  Still, if you
could find a sequence that's nicer to McKinley-type cores (with a
single shifter), that would be a plus.

The other slight concern I have is that if somebody calls
flush_icache_range() before cpu_init() has been called the first time,
it won't work.  We can probably live with it, but it would have been
nicer if there had been a clean way to default to stride=5 initially.

I think the patch is close enough that it should be put in the kernel.

	--david

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flush_icache_range
  2005-03-15 12:40 flush_icache_range Zoltan Menyhart
                   ` (29 preceding siblings ...)
  2005-06-03 21:09 ` flush_icache_range David Mosberger
@ 2005-06-13 11:20 ` Zoltan Menyhart
  30 siblings, 0 replies; 32+ messages in thread
From: Zoltan Menyhart @ 2005-06-13 11:20 UTC (permalink / raw)
  To: linux-ia64

David Mosberger wrote:
> Looks fine to me.  The large number of shifts in the setup-code is a
> bit unfortunate, but compared to the savings the larger stride
> achieves on today's systems, this penalty is minor.  Still, if you
> could find a sequence that's nicer to McKinley-type cores (with a
> single shifter), that would be a plus.

We need:
- the stride size shift (for dividing)
- the stride size (for the address steps)
- some masking (for rounding down the start address)
I cannot find any sequence with less shifts.
( I do not want to store separately in global variables the shift,
  the value and the mask. )
 
> The other slight concern I have is that if somebody calls
> flush_icache_range() before cpu_init() has been called the first time,
> it won't work. We can probably live with it, but it would have been
> nicer if there had been a clean way to default to stride=5 initially.

I checked where "flush_icache_range()" is used:

fs/binfmt_*.c

do_mmap_pgoff()
sys_init_module()

via flush_icache_user_range():
   copy_to_user_page()

via update_mmu_cache():
   swiotlb_map_single()
   install_page()
   install_file_pte()
   break_cow()
   do_wp_page()
   do_swap_page()
   do_anonymous_page()
   do_no_page()
   handle_pte_fault()

( I hope I have not missed anything. )

These are much after my very first stride size init:

start_kernel():
   setup_arch():
      cpu_init():
         get_max_cacheline_size()

Let's see for the other CPUs: any __init'ed stuff can eventually
do some black magic.

start_kernel():
   rest_init():
      init():
         smp_init():
            for (<the other CPUs>)
               cpu_up(i)

- Either the black trick is before "cpu_up(i)":
  the CPU-i has not got anything about it in its cache,
- Or the black trick is after "cpu_up(i)":
  the (eventually smaller) stride size of CPU-i is
  taken into account.
We are safe anyway.

As far as "sn_flush_all_caches()" is concerned,
can someone, please, from SGI have a look at it?

Debuggers should use their own safe way to flush caches:
the less they reckon on standard kernel routines,
the less chance that a debugging session crashes :-).

Thanks,

Zoltan


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2005-06-13 11:20 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-03-15 12:40 flush_icache_range Zoltan Menyhart
2005-03-15 18:21 ` flush_icache_range David Mosberger
2005-03-16 10:58 ` flush_icache_range Zoltan Menyhart
2005-03-16 11:19 ` flush_icache_range Duraid Madina
2005-03-16 18:31 ` flush_icache_range David Mosberger
2005-05-20 14:17 ` flush_icache_range Zoltan Menyhart
2005-05-20 15:03 ` flush_icache_range David Mosberger
2005-05-23 13:43 ` flush_icache_range Zoltan Menyhart
2005-05-26 17:21 ` flush_icache_range David Mosberger
2005-05-26 17:39 ` flush_icache_range Seth, Rohit
2005-05-27 15:45 ` flush_icache_range Zoltan Menyhart
2005-05-27 15:56 ` flush_icache_range David Mosberger
2005-05-27 16:45 ` flush_icache_range Zoltan Menyhart
2005-05-27 16:55 ` flush_icache_range David Mosberger
2005-05-27 18:27 ` flush_icache_range Grant Grundler
2005-05-27 19:00 ` flush_icache_range Russ Anderson
2005-05-29 20:23 ` flush_icache_range Menyhart, Zoltan
2005-06-01 23:50 ` flush_icache_range David Mosberger
2005-06-02  3:00 ` flush_icache_range Jim Hull
2005-06-02 12:12 ` flush_icache_range Zoltan Menyhart
2005-06-02 14:25 ` flush_icache_range Zoltan Menyhart
2005-06-02 17:36 ` flush_icache_range David Mosberger
2005-06-02 18:28 ` flush_icache_range David Mosberger
2005-06-02 18:31 ` flush_icache_range David Mosberger
2005-06-02 19:00 ` flush_icache_range Jim Hull
2005-06-02 21:37 ` flush_icache_range Menyhart, Zoltan
2005-06-02 22:23 ` flush_icache_range David Mosberger
2005-06-02 22:55 ` flush_icache_range Menyhart, Zoltan
2005-06-02 23:07 ` flush_icache_range David Mosberger
2005-06-03 12:35 ` flush_icache_range Zoltan Menyhart
2005-06-03 21:09 ` flush_icache_range David Mosberger
2005-06-13 11:20 ` flush_icache_range Zoltan Menyhart

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox