* [PATCH] pte prefetching
@ 2005-03-24 7:18 Nick Piggin
2005-03-24 20:15 ` David Mosberger
2005-03-25 5:22 ` Nick Piggin
0 siblings, 2 replies; 3+ messages in thread
From: Nick Piggin @ 2005-03-24 7:18 UTC (permalink / raw)
To: linux-ia64
[-- Attachment #1: Type: text/plain, Size: 1184 bytes --]
Hi,
Sending this to the ia64 list, because that is so far the only platform
I have tested on, and because the patch may be more likely to have real
applications on ia64 systems.
I have been looking at different implementations of unmapping and page
table freeing recently. As a consequence, I came to notice that the
vast majority of L2 cache misses on ia64 (and probably all
architectures) in an unmapping workload comes from the line:
pte_t ptent = *pte;
In zap_pte_range, ie. walking the bottom level page table pages.
I should qualify that - that is the case when the page tables
aren't in cache - this does not apply to a simple lmbench fork/exit
test for example.
Anyway, I tried prefetching a line ahead of the one we're currently
working in, and put the prefetching into zap_pte_range, and
copy_pte_range (which does a similar pte walk to set up page tables
on fork()).
microbenchmark results are pretty good - but I wonder if anyone might
have a real-world use for it?
After applying the recent freepgt patchset from Hugh (on lkml), the
time to fork+exit a process mapping 64GB of address (32MB of page
tables) is 0.471s. With the prefetch patch, this drops to 0.357s.
[-- Attachment #2: pte-prefetch --]
[-- Type: text/plain, Size: 2843 bytes --]
Index: linux-2.6/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-generic/pgtable.h 2005-03-24 10:43:38.000000000 +1100
+++ linux-2.6/include/asm-generic/pgtable.h 2005-03-24 12:08:57.000000000 +1100
@@ -160,6 +160,39 @@ static inline void ptep_set_wrprotect(st
})
#endif
+#define PTES_PER_LINE (L1_CACHE_BYTES / sizeof(pte_t))
+#define PTE_LINE_MASK (~(PTES_PER_LINE - 1))
+#define ADDR_PER_LINE (PTES_PER_LINE << PAGE_SHIFT)
+#define ADDR_LINE_MASK (~((PTES_PER_LINE << PAGE_SHIFT) - 1))
+
+#define pte_prefetch(pte, addr, end) \
+({ \
+ unsigned long nextline = ((addr) + ADDR_PER_LINE) & ADDR_LINE_MASK; \
+ if (nextline < (end)) \
+ prefetch(pte + PTES_PER_LINE); \
+})
+
+#define pte_prefetch_next(pte, addr, end) \
+({ \
+ unsigned long _addr = (addr); \
+ if (!(_addr & ~ADDR_LINE_MASK)) /* We hit a new cacheline */ \
+ pte_prefetch(pte, _addr, end); \
+})
+
+#define pte_prefetchw(pte, addr, end) \
+({ \
+ unsigned long nextline = ((addr) + ADDR_PER_LINE) & ADDR_LINE_MASK; \
+ if (nextline < (end)) \
+ prefetchw(pte + PTES_PER_LINE); \
+})
+
+#define pte_prefetchw_next(pte, addr, end) \
+({ \
+ unsigned long _addr = (addr); \
+ if (!(_addr & ~ADDR_LINE_MASK)) /* We hit a new cacheline */ \
+ pte_prefetchw(pte, _addr, end); \
+})
+
#ifndef __ASSEMBLY__
/*
* When walking page tables, we usually want to skip any p?d_none entries;
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c 2005-03-24 12:08:43.000000000 +1100
+++ linux-2.6/mm/memory.c 2005-03-24 12:08:57.000000000 +1100
@@ -411,6 +411,7 @@ again:
progress = 0;
spin_lock(&src_mm->page_table_lock);
+ pte_prefetch(src_pte, addr, end);
do {
/*
* We are holding two locks at this point - either of them
@@ -426,7 +427,9 @@ again:
}
copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vm_flags, addr);
progress += 8;
- } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
+ } while (dst_pte++, src_pte++, addr += PAGE_SIZE,
+ pte_prefetch_next(src_pte, addr, end), addr != end);
+
spin_unlock(&src_mm->page_table_lock);
pte_unmap_nested(src_pte - 1);
@@ -512,6 +515,7 @@ static void zap_pte_range(struct mmu_gat
pte_t *pte;
pte = pte_offset_map(pmd, addr);
+ pte_prefetchw(pte, addr, end);
do {
pte_t ptent = *pte;
if (pte_none(ptent))
@@ -571,7 +575,8 @@ static void zap_pte_range(struct mmu_gat
if (!pte_file(ptent))
free_swap_and_cache(pte_to_swp_entry(ptent));
pte_clear(tlb->mm, addr, pte);
- } while (pte++, addr += PAGE_SIZE, addr != end);
+ } while (pte++, addr += PAGE_SIZE,
+ pte_prefetchw_next(pte, addr, end), addr != end);
pte_unmap(pte - 1);
}
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH] pte prefetching
2005-03-24 7:18 [PATCH] pte prefetching Nick Piggin
@ 2005-03-24 20:15 ` David Mosberger
2005-03-25 5:22 ` Nick Piggin
1 sibling, 0 replies; 3+ messages in thread
From: David Mosberger @ 2005-03-24 20:15 UTC (permalink / raw)
To: linux-ia64
>>>>> On Thu, 24 Mar 2005 18:18:17 +1100, Nick Piggin <nickpiggin@yahoo.com.au> said:
Nick> After applying the recent freepgt patchset from Hugh (on
Nick> lkml), the time to fork+exit a process mapping 64GB of address
Nick> (32MB of page tables) is 0.471s. With the prefetch patch, this
Nick> drops to 0.357s.
Looks like a nice improvement to me.
Does prefetching 1 line ahead give the best results? That's only
128/8\x16 PTEs. Assuming a 200 cycle latency, this would allow
for only 12.5 cycles/iteration. Especially for large (NUMA) machines,
prefetching further out might help more.
--david
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH] pte prefetching
2005-03-24 7:18 [PATCH] pte prefetching Nick Piggin
2005-03-24 20:15 ` David Mosberger
@ 2005-03-25 5:22 ` Nick Piggin
1 sibling, 0 replies; 3+ messages in thread
From: Nick Piggin @ 2005-03-25 5:22 UTC (permalink / raw)
To: linux-ia64
David Mosberger wrote:
>>>>>>On Thu, 24 Mar 2005 18:18:17 +1100, Nick Piggin <nickpiggin@yahoo.com.au> said:
>
>
> Nick> After applying the recent freepgt patchset from Hugh (on
> Nick> lkml), the time to fork+exit a process mapping 64GB of address
> Nick> (32MB of page tables) is 0.471s. With the prefetch patch, this
> Nick> drops to 0.357s.
>
Sorry, above numbers were wrong:
0.118s versus 0.089s. Improvement ratio is the same, I just used the
wrong divisor.
> Looks like a nice improvement to me.
>
> Does prefetching 1 line ahead give the best results? That's only
> 128/8\x16 PTEs. Assuming a 200 cycle latency, this would allow
> for only 12.5 cycles/iteration. Especially for large (NUMA) machines,
> prefetching further out might help more.
>
Hmm... yeah it may do. Although I don't think that changes your cycles
/ iteration ratio, does it? Just allows for for a little bit more
variation.
I just retested, and prefetching 2 lines ahead gives virtually the same
performance.
But actually, my tests are set up so each pte page has only a single
'present' pte (I did it that way to speed up initial faulting time).
So the loop will almost always get stopped by the pte_none tests. So
perhaps that is able to complete in close to or less than 12 cycles.
Nick
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2005-03-25 5:22 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-03-24 7:18 [PATCH] pte prefetching Nick Piggin
2005-03-24 20:15 ` David Mosberger
2005-03-25 5:22 ` Nick Piggin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox