The Itanium 2 processor Reference Manual for SW development & optimization (May 2004) says in the chapter 5.8: "In Itanium 2 processor, each fc will invalidate 128 bytes corresponding to the L3 cache line size. Since both the L1I and L1D have line sizes of 64 bytes, a single fc instruction can invalidate two lines." Can someone please confirm that an equivalent statement is true for the "fc.i", too ? Say: "In Itanium 2 processor, each fc.i instruction will ensure that 128 bytes (corresponding to the L3 cache line size) of the I-cache(s) be coherent with the data caches. Since the L1I cache has line sizes of 64 bytes, a single fc.i instruction can make coherent two lines." This gave me the idea to try with 128-byte strides (the measures are repeated for 10 times): Modified in d-cache: cycles = 19,164 time = 14.782 usec cycles = 18,060 time = 13.930 usec cycles = 16,929 time = 13.058 usec cycles = 17,597 time = 13.573 usec cycles = 17,163 time = 13.239 usec cycles = 16,990 time = 13.105 usec cycles = 17,427 time = 13.442 usec cycles = 17,028 time = 13.134 usec cycles = 16,993 time = 13.107 usec cycles = 16,930 time = 13.059 usec Valid: cycles = 13,514 time = 10.424 usec cycles = 13,518 time = 10.427 usec cycles = 13,518 time = 10.427 usec cycles = 13,518 time = 10.427 usec cycles = 13,518 time = 10.427 usec cycles = 13,746 time = 10.603 usec cycles = 13,866 time = 10.695 usec cycles = 13,830 time = 10.668 usec cycles = 13,790 time = 10.637 usec cycles = 13,830 time = 10.668 usec Invalid: cycles = 13,794 time = 10.640 usec cycles = 13,790 time = 10.637 usec cycles = 13,830 time = 10.668 usec cycles = 13,830 time = 10.668 usec cycles = 13,966 time = 10.773 usec cycles = 13,994 time = 10.794 usec cycles = 14,074 time = 10.856 usec cycles = 13,574 time = 10.470 usec cycles = 13,902 time = 10.723 usec cycles = 14,114 time = 10.887 usec I got these incredibly low number of cycles, compared to my previous results: With a 32-byte stride: Modified in d-cache: cycles = 215 K, time = 169 usec Valid: cycles = 222 K, time = 171 usec Invalid: cycles = 222 K, time = 171 usec With a 64-byte stride: Modified in d-cache: cycles = 63 K, time = 49 usec Valid: cycles = 116 K, time = 89 usec Invalid: cycles = 116 K, time = 89 usec This is a Tiger box with the following CPUs: processor : 0 vendor : GenuineIntel arch : IA-64 family : Itanium 2 model : 1 revision : 5 archrev : 0 features : branchlong cpu number : 0 cpu regs : 4 cpu MHz : 1296.435998 itc MHz : 1296.435998 BogoMIPS : 1941.96 etc... Can these results be real? Thanks, Zoltan