Here is a small patch that flushes the i-cache 64 bytes by 64 bytes on Itanium 2 (or +). Some measures on a Tiger box with the indicated CPU-s: processor : 0 vendor : GenuineIntel arch : IA-64 family : Itanium 2 model : 1 revision : 5 archrev : 0 features : branchlong cpu number : 0 cpu regs : 4 cpu MHz : 1296.439995 itc MHz : 1296.439995 BogoMIPS : 1941.96 etc... Flushing a page of 64 Kbytes (the others do not do anything, they have not got anything about my data on their caches): With a 32-byte stride: Modified in d-cache: cycles = 215 K, time = 169 usec Valid: cycles = 222 K, time = 171 usec Invalid: cycles = 222 K, time = 171 usec Note that for the dirty case, only the 1st flush causes a write- back from the L2 / L3 caches, the 3 other flushes find the cache entries invalid in the L2 / L3 caches. With a 64-byte stride: Modified in d-cache: cycles = 63 K, time = 49 usec Valid: cycles = 116 K, time = 89 usec Invalid: cycles = 116 K, time = 89 usec It is funny to see that the dirty lines can be flushed more efficiently. I guess the CPU knows in such a case that the others cannot have anything to flush, the flush request may not even be issued to the other CPU-s. I also tried to issue more than one flush per loop-body iteration, it did not help. Thanks, Zoltan