linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] x86_64 Avoid some atomic operations during address space destruction
@ 2005-08-07 12:16 Zachary Amsden
  2005-08-25 16:54 ` Andi Kleen
  0 siblings, 1 reply; 4+ messages in thread
From: Zachary Amsden @ 2005-08-07 12:16 UTC (permalink / raw)
  To: Andi Kleen, Linux Kernel Mailing List, Pratap Subrahmanyam

[-- Attachment #1: Type: text/plain, Size: 878 bytes --]

This turned out to be a huge win on 32-bit i386 in PAE mode, but it is 
likely not as significant on x86_64; I don't know because I haven't 
actually measured the cost.  I don't have 64-bit hardware that I have 
the luxury of rebooting right now, so this patch is untested, but if 
someone wants to try this out, it might actually show a measurable win 
on fork/exit.  I lost my cycle count measurement diffs, but I don't 
think they would apply cleanly to x86_64 anyways.  This patch at least 
looks good, and compiles cleanly on 2.6.13-rc5-mm1, thus passing some 
level of testing.

Also, it might show reduced latency on pre-emptible kernels during heavy 
fork/exit activity, possibly allowing ZAP_BLOCK_SIZE to be raised for 
some architectures (I measured a ~30-50% reduction in cycle timings for 
zap_pte_range on i386 with CONFIG_PREEMPT with the analogous patch).

Zach

[-- Attachment #2: x86_64-pte-destruction --]
[-- Type: text/plain, Size: 1576 bytes --]

Any architecture that has hardware updated A/D bits that require
synchronization against other processors during PTE operations
can benefit from doing non-atomic PTE updates during address space
destruction.  Originally done on i386, now ported to x86_64.

Doing a read/write pair instead of an xchg() operation saves the
implicit lock, which turns out to be a big win on 32-bit (esp w PAE).

Diffs-against: 2.6.13-rc5-mm1
Signed-off-by: Zachary Amsden <zach@vmware.com>
Index: linux-2.6.13-rc5-mm1/include/asm-x86_64/pgtable.h
===================================================================
--- linux-2.6.13-rc5-mm1.orig/include/asm-x86_64/pgtable.h	2005-08-07 04:56:37.000000000 -0700
+++ linux-2.6.13-rc5-mm1/include/asm-x86_64/pgtable.h	2005-08-07 04:59:18.601856096 -0700
@@ -104,6 +104,19 @@
 ((unsigned long) __va(pud_val(pud) & PHYSICAL_PAGE_MASK))
 
 #define ptep_get_and_clear(mm,addr,xp)	__pte(xchg(&(xp)->pte, 0))
+
+static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm, unsigned long addr, pte_t *ptep, int full)
+{
+	pte_t pte;
+	if (full) {
+		pte = *ptep;
+		*ptep = __pte(0);
+	} else {
+		pte = ptep_get_and_clear(mm, addr, ptep);
+	}
+	return pte;
+}
+
 #define pte_same(a, b)		((a).pte == (b).pte)
 
 #define PMD_SIZE	(1UL << PMD_SHIFT)
@@ -433,6 +446,7 @@
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
+#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define __HAVE_ARCH_PTE_SAME
 #include <asm-generic/pgtable.h>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] x86_64 Avoid some atomic operations during address space destruction
  2005-08-07 12:16 [PATCH] x86_64 Avoid some atomic operations during address space destruction Zachary Amsden
@ 2005-08-25 16:54 ` Andi Kleen
  2005-08-25 17:12   ` Zachary Amsden
  0 siblings, 1 reply; 4+ messages in thread
From: Andi Kleen @ 2005-08-25 16:54 UTC (permalink / raw)
  To: Zachary Amsden; +Cc: Linux Kernel Mailing List, Pratap Subrahmanyam

On Sunday 07 August 2005 14:16, Zachary Amsden wrote:
> This turned out to be a huge win on 32-bit i386 in PAE mode, but it is
> likely not as significant on x86_64; I don't know because I haven't
> actually measured the cost.  I don't have 64-bit hardware that I have
> the luxury of rebooting right now, so this patch is untested, but if
> someone wants to try this out, it might actually show a measurable win
> on fork/exit.  I lost my cycle count measurement diffs, but I don't
> think they would apply cleanly to x86_64 anyways.  This patch at least
> looks good, and compiles cleanly on 2.6.13-rc5-mm1, thus passing some
> level of testing.

FYI I have queued it, but cannot apply it because the necessary generic
code support is still not in mainline.

Do you have any other optimizations pending for x86-64? 

There is still the iopl optimization that you did that is on my TODO list to 
add. Anything else.

-Andi

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] x86_64 Avoid some atomic operations during address space destruction
  2005-08-25 16:54 ` Andi Kleen
@ 2005-08-25 17:12   ` Zachary Amsden
  2005-08-25 17:26     ` Andi Kleen
  0 siblings, 1 reply; 4+ messages in thread
From: Zachary Amsden @ 2005-08-25 17:12 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Linux Kernel Mailing List, Pratap Subrahmanyam, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 745 bytes --]

Andi Kleen wrote:

>On Sunday 07 August 2005 14:16, Zachary Amsden wrote:
>  
>
>FYI I have queued it, but cannot apply it because the necessary generic
>code support is still not in mainline.
>  
>

Here's the patch for generic / i386 support; it's already in the -mm tree.

>Do you have any other optimizations pending for x86-64? 
>
>There is still the iopl optimization that you did that is on my TODO list to 
>add. Anything else.
>  
>

I started porting the IOPL work, but got confused in my tree and end up 
patching asm-i386 with x86-64 code.  The joy of  unenforced source control!

I have some other MMU optimizations pending that will hopefully be a win 
for all architectures; still measuring which alternative is best there.

Zach

[-- Attachment #2: mmu-ptep-clear-optimization --]
[-- Type: text/plain, Size: 4007 bytes --]

Add a new accessor for PTEs, which passes the full hint from the mmu_gather
struct; this allows architectures with hardware pagetables to optimize away
atomic PTE operations when destroying an address space.  Removing the locked
operation should allow better pipelining of memory access in this loop.  I
measured an average savings of 30-35 cycles per zap_pte_range on the first 500
destructions on Pentium-M, but I believe the optimization would win more on
older processors which still assert the bus lock on xchg for an exclusive
cacheline.

Update: I made some new measurements, and this saves exactly 26 cycles over
ptep_get_and_clear on Pentium M.  On P4, with a PAE kernel, this saves 180
cycles per ptep_get_and_clear, for a whopping 92160 cycles savings for a full
address space destruction.

pte_clear_full is not yet used, but is provided for future optimizations (in
particular, when running inside of a hypervisor that queues page table updates,
the full hint allows us to avoid queueing unnecessary page table update for an
address space in the process of being destroyed.

This is not a huge win, but it does help a bit, and sets the stage for further
hypervisor optimization of the mm layer on all architectures.

Signed-off-by: Zachary Amsden <zach@vmware.com>
Index: linux-2.6.13/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.13.orig/include/asm-generic/pgtable.h	2005-07-29 11:03:10.000000000 -0700
+++ linux-2.6.13/include/asm-generic/pgtable.h	2005-07-29 15:26:58.000000000 -0700
@@ -101,6 +101,22 @@
 })
 #endif
 
+#ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
+#define ptep_get_and_clear_full(__mm, __address, __ptep, __full)	\
+({									\
+	pte_t __pte;							\
+	__pte = ptep_get_and_clear((__mm), (__address), (__ptep));	\
+	__pte;								\
+})
+#endif
+
+#ifndef __HAVE_ARCH_PTE_CLEAR_FULL
+#define pte_clear_full(__mm, __address, __ptep, __full)		\
+do {									\
+	pte_clear((__mm), (__address), (__ptep));			\
+} while (0)
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
 #define ptep_clear_flush(__vma, __address, __ptep)			\
 ({									\
Index: linux-2.6.13/include/asm-i386/pgtable.h
===================================================================
--- linux-2.6.13.orig/include/asm-i386/pgtable.h	2005-07-29 11:03:10.000000000 -0700
+++ linux-2.6.13/include/asm-i386/pgtable.h	2005-07-29 15:26:58.000000000 -0700
@@ -258,6 +258,18 @@
 	return test_and_clear_bit(_PAGE_BIT_ACCESSED, &ptep->pte_low);
 }
 
+static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm, unsigned long addr, pte_t *ptep, int full) 
+{
+	pte_t pte;
+	if (full) {
+		pte = *ptep;
+		*ptep = __pte(0);
+	} else {
+		pte = ptep_get_and_clear(mm, addr, ptep);
+	}
+	return pte;
+}
+
 static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
 {
 	clear_bit(_PAGE_BIT_RW, &ptep->pte_low);
@@ -415,6 +427,7 @@
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
+#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define __HAVE_ARCH_PTE_SAME
 #include <asm-generic/pgtable.h>
Index: linux-2.6.13/mm/memory.c
===================================================================
--- linux-2.6.13.orig/mm/memory.c	2005-07-29 11:03:11.000000000 -0700
+++ linux-2.6.13/mm/memory.c	2005-07-29 15:26:58.000000000 -0700
@@ -551,7 +551,7 @@
 				     page->index > details->last_index))
 					continue;
 			}
-			ptent = ptep_get_and_clear(tlb->mm, addr, pte);
+			ptent = ptep_get_and_clear_full(tlb->mm, addr, pte, tlb->fullmm);
 			tlb_remove_tlb_entry(tlb, pte, addr);
 			if (unlikely(!page))
 				continue;
@@ -579,7 +579,7 @@
 			continue;
 		if (!pte_file(ptent))
 			free_swap_and_cache(pte_to_swp_entry(ptent));
-		pte_clear(tlb->mm, addr, pte);
+		pte_clear_full(tlb->mm, addr, pte, tlb->fullmm);
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	pte_unmap(pte - 1);
 }

[-- Attachment #3: x86_64-pte-destruction --]
[-- Type: text/plain, Size: 1576 bytes --]

Any architecture that has hardware updated A/D bits that require
synchronization against other processors during PTE operations
can benefit from doing non-atomic PTE updates during address space
destruction.  Originally done on i386, now ported to x86_64.

Doing a read/write pair instead of an xchg() operation saves the
implicit lock, which turns out to be a big win on 32-bit (esp w PAE).

Diffs-against: 2.6.13-rc5-mm1
Signed-off-by: Zachary Amsden <zach@vmware.com>
Index: linux-2.6.13-rc5-mm1/include/asm-x86_64/pgtable.h
===================================================================
--- linux-2.6.13-rc5-mm1.orig/include/asm-x86_64/pgtable.h	2005-08-07 04:56:37.000000000 -0700
+++ linux-2.6.13-rc5-mm1/include/asm-x86_64/pgtable.h	2005-08-07 04:59:18.601856096 -0700
@@ -104,6 +104,19 @@
 ((unsigned long) __va(pud_val(pud) & PHYSICAL_PAGE_MASK))
 
 #define ptep_get_and_clear(mm,addr,xp)	__pte(xchg(&(xp)->pte, 0))
+
+static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm, unsigned long addr, pte_t *ptep, int full)
+{
+	pte_t pte;
+	if (full) {
+		pte = *ptep;
+		*ptep = __pte(0);
+	} else {
+		pte = ptep_get_and_clear(mm, addr, ptep);
+	}
+	return pte;
+}
+
 #define pte_same(a, b)		((a).pte == (b).pte)
 
 #define PMD_SIZE	(1UL << PMD_SHIFT)
@@ -433,6 +446,7 @@
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
+#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define __HAVE_ARCH_PTE_SAME
 #include <asm-generic/pgtable.h>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] x86_64 Avoid some atomic operations during address space destruction
  2005-08-25 17:12   ` Zachary Amsden
@ 2005-08-25 17:26     ` Andi Kleen
  0 siblings, 0 replies; 4+ messages in thread
From: Andi Kleen @ 2005-08-25 17:26 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Linux Kernel Mailing List, Pratap Subrahmanyam, Andrew Morton

On Thursday 25 August 2005 19:12, Zachary Amsden wrote:
> Andi Kleen wrote:
> >On Sunday 07 August 2005 14:16, Zachary Amsden wrote:
> >
> >
> >FYI I have queued it, but cannot apply it because the necessary generic
> >code support is still not in mainline.
>
> Here's the patch for generic / i386 support; it's already in the -mm tree.

I'll probably not put that into my tree because I try to avoid generic
patches of other people - i assume it's queued for mainline.

If you want you can include the x86-64 patch with that submission too
(it's fine for me), alternatively I'll submit it later when I do the next 
merge from i386 (that might take some time though) or remember about it 
for some other reason. 

>
> >Do you have any other optimizations pending for x86-64?
> >
> >There is still the iopl optimization that you did that is on my TODO list
> > to add. Anything else.
>
> I started porting the IOPL work, but got confused in my tree and end up
> patching asm-i386 with x86-64 code.  The joy of  unenforced source control!

Ok. When you don't get around to it I'll eventually.

>
> I have some other MMU optimizations pending that will hopefully be a win
> for all architectures; still measuring which alternative is best there.

Ok.  Thanks. Please keep me updated on that.

-Andi

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2005-08-25 17:26 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-07 12:16 [PATCH] x86_64 Avoid some atomic operations during address space destruction Zachary Amsden
2005-08-25 16:54 ` Andi Kleen
2005-08-25 17:12   ` Zachary Amsden
2005-08-25 17:26     ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).