Re: page fault fastpath patch v2: fix race conditions, stats for 8,32 and 512 cpu SMP

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: page fault fastpath patch v2: fix race conditions, stats for 8,32     and    512 cpu SMP
       [not found] ` <2uCTq-2wa-55@gated-at.bofh.it>
@ 2004-08-18 23:50   ` Rajesh Venkatasubramanian
  2004-08-19  0:01     ` William Lee Irwin III
  0 siblings, 1 reply; 18+ messages in thread
From: Rajesh Venkatasubramanian @ 2004-08-18 23:50 UTC (permalink / raw)
  To: Hugh Dickins, William Lee Irwin III, David S. Miller, raybry, ak,
	benh, manfred, linux-ia64, linux-kernel

William Lee Irwin III wrote:
> On Wed, Aug 18, 2004 at 06:55:07PM +0100, Hugh Dickins wrote:
>> That's interesting, and disappointing.
>> The main lesson I took from your patch (I think wli was hinting at
>> the same) is that we ought now to question page_table_lock usage,
>> should be possible to cut it a lot.
>> I recall from exchanges with Dave McCracken 18 months ago that the
>> page_table_lock is _almost_ unnecessary in rmap.c, should be possible
>> to get avoid it there and in some other places.
>> We take page_table_lock when making absent present and when making
>> present absent: I like your observation that those are exclusive cases.
>> But you've found that narrowing the width of the page_table_lock
>> in a particular path does not help.  You sound surprised, me too.
>> Did you find out why that was?
>
> It also protects against vma tree modifications in mainline, but rmap.c
> shouldn't need it for vmas anymore, as the vma is rooted to the spot by
> mapping->i_shared_lock for file pages and anon_vma->lock for anonymous.

If I am reading the code correctly, then without page_table_lock
in page_referenced_one(), we can race with exit_mmap() and page 
table pages can be freed under us.

William Lee Irwin III wrote:
> On Wed, Aug 18, 2004 at 06:55:07PM +0100, Hugh Dickins wrote:
>> I do think this will be a more fruitful direction than pte locking:
>> just looking through the arches for spare bits puts me off pte locking.
>
> Fortunately, spare bits aren't strictly necessary, and neither is
> cmpxchg. A single invalid value can serve in place of a bitflag. When
> using such an invalid value, just xchg()'ing it and looping when the
> invalid value is seen should suffice. This holds more generally for all
> radix trees, not just pagetables, and happily xchg() or emulation
> thereof is required by core code for all arches.

Good point. 

Another solution may be to use the unused bytes (->lru or
->private) in page table "struct page" as bit_spin_locks. We can 
use a single bit to protect a small set of ptes (8, 16, or 32).

Rajesh
  



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: page fault fastpath patch v2: fix race conditions, stats for 8,32     and    512 cpu SMP
  2004-08-18 23:50   ` page fault fastpath patch v2: fix race conditions, stats for 8,32 and 512 cpu SMP Rajesh Venkatasubramanian
@ 2004-08-19  0:01     ` William Lee Irwin III
  2004-08-19  0:07       ` Rajesh Venkatasubramanian
  0 siblings, 1 reply; 18+ messages in thread
From: William Lee Irwin III @ 2004-08-19  0:01 UTC (permalink / raw)
  To: Rajesh Venkatasubramanian
  Cc: Hugh Dickins, David S. Miller, raybry, ak, benh, manfred,
	linux-ia64, linux-kernel

William Lee Irwin III wrote:
>> It also protects against vma tree modifications in mainline, but rmap.c
>> shouldn't need it for vmas anymore, as the vma is rooted to the spot by
>> mapping->i_shared_lock for file pages and anon_vma->lock for anonymous.

On Wed, Aug 18, 2004 at 07:50:21PM -0400, Rajesh Venkatasubramanian wrote:
> If I am reading the code correctly, then without page_table_lock
> in page_referenced_one(), we can race with exit_mmap() and page 
> table pages can be freed under us.

exit_mmap() has removed the vma from ->i_mmap and ->mmap prior to
unmapping the pages, so this should be safe unless that operation
can be caught while it's in progress.


William Lee Irwin III wrote:
>> Fortunately, spare bits aren't strictly necessary, and neither is
>> cmpxchg. A single invalid value can serve in place of a bitflag. When
>> using such an invalid value, just xchg()'ing it and looping when the
>> invalid value is seen should suffice. This holds more generally for all
>> radix trees, not just pagetables, and happily xchg() or emulation
>> thereof is required by core code for all arches.

On Wed, Aug 18, 2004 at 07:50:21PM -0400, Rajesh Venkatasubramanian wrote:
> Good point. 
> Another solution may be to use the unused bytes (->lru or
> ->private) in page table "struct page" as bit_spin_locks. We can 
> use a single bit to protect a small set of ptes (8, 16, or 32).

In general the bitwise operations are more expensive than ordinary
spinlocks, and a separately-allocated spinlock (not necessarily
kmalloc()'d, sitting in struct page also counts, that is, separate from
the pte) introduces another cacheline to be touched where with in-place
locking of the pte only the pte's cacheline is needed.


-- wli

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: page fault fastpath patch v2: fix race conditions, stats for 8,32     and    512 cpu SMP
  2004-08-19  0:01     ` William Lee Irwin III
@ 2004-08-19  0:07       ` Rajesh Venkatasubramanian
  2004-08-19  0:20         ` William Lee Irwin III
  0 siblings, 1 reply; 18+ messages in thread
From: Rajesh Venkatasubramanian @ 2004-08-19  0:07 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Hugh Dickins, David S. Miller, raybry, ak, benh, manfred,
	linux-ia64, LKML



On Wed, 18 Aug 2004, William Lee Irwin III wrote:

> William Lee Irwin III wrote:
> >> It also protects against vma tree modifications in mainline, but rmap.c
> >> shouldn't need it for vmas anymore, as the vma is rooted to the spot by
> >> mapping->i_shared_lock for file pages and anon_vma->lock for anonymous.
>
> On Wed, Aug 18, 2004 at 07:50:21PM -0400, Rajesh Venkatasubramanian wrote:
> > If I am reading the code correctly, then without page_table_lock
> > in page_referenced_one(), we can race with exit_mmap() and page
> > table pages can be freed under us.
>
> exit_mmap() has removed the vma from ->i_mmap and ->mmap prior to
> unmapping the pages, so this should be safe unless that operation
> can be caught while it's in progress.

No. Unfortunately exit_mmap() removes vmas from ->i_mmap after removing
page table pages. Maybe we can reverse this, though.

Rajesh


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: page fault fastpath patch v2: fix race conditions, stats for 8,32     and    512 cpu SMP
  2004-08-19  0:07       ` Rajesh Venkatasubramanian
@ 2004-08-19  0:20         ` William Lee Irwin III
  2004-08-19  3:19           ` Rajesh Venkatasubramanian
  2004-08-23 22:00           ` Christoph Lameter
  0 siblings, 2 replies; 18+ messages in thread
From: William Lee Irwin III @ 2004-08-19  0:20 UTC (permalink / raw)
  To: Rajesh Venkatasubramanian
  Cc: Hugh Dickins, David S. Miller, raybry, ak, benh, manfred,
	linux-ia64, LKML

On Wed, 18 Aug 2004, William Lee Irwin III wrote:
>> exit_mmap() has removed the vma from ->i_mmap and ->mmap prior to
>> unmapping the pages, so this should be safe unless that operation
>> can be caught while it's in progress.

On Wed, Aug 18, 2004 at 08:07:24PM -0400, Rajesh Venkatasubramanian wrote:
> No. Unfortunately exit_mmap() removes vmas from ->i_mmap after removing
> page table pages. Maybe we can reverse this, though.

Something like this?


Index: mm1-2.6.8.1/mm/mmap.c
===================================================================
--- mm1-2.6.8.1.orig/mm/mmap.c	2004-08-16 23:47:16.000000000 -0700
+++ mm1-2.6.8.1/mm/mmap.c	2004-08-18 17:18:26.513559632 -0700
@@ -1810,6 +1810,14 @@
 	mm->map_count -= unmap_vmas(&tlb, mm, mm->mmap, 0,
 					~0UL, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
+	/*
+	 * Walk the list again, actually closing and freeing it.
+	 */
+	while (vma) {
+		struct vm_area_struct *next = vma->vm_next;
+		remove_vm_struct(vma);
+		vma = next;
+	}
 	BUG_ON(mm->map_count);	/* This is just debugging */
 	clear_page_tables(tlb, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD);
 	tlb_finish_mmu(tlb, 0, MM_VM_SIZE(mm));
@@ -1822,16 +1830,6 @@
 	mm->locked_vm = 0;
 
 	spin_unlock(&mm->page_table_lock);
-
-	/*
-	 * Walk the list again, actually closing and freeing it
-	 * without holding any MM locks.
-	 */
-	while (vma) {
-		struct vm_area_struct *next = vma->vm_next;
-		remove_vm_struct(vma);
-		vma = next;
-	}
 }
 
 /* Insert vm structure into process list sorted by address

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: page fault fastpath patch v2: fix race conditions, stats for 8,32     and    512 cpu SMP
  2004-08-19  0:20         ` William Lee Irwin III
@ 2004-08-19  3:19           ` Rajesh Venkatasubramanian
  2004-08-19  3:31             ` William Lee Irwin III
  2004-08-23 22:00           ` Christoph Lameter
  1 sibling, 1 reply; 18+ messages in thread
From: Rajesh Venkatasubramanian @ 2004-08-19  3:19 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Hugh Dickins, David S. Miller, raybry, ak, benh, manfred,
	linux-ia64, LKML



On Wed, 18 Aug 2004, William Lee Irwin III wrote:

> On Wed, 18 Aug 2004, William Lee Irwin III wrote:
> >> exit_mmap() has removed the vma from ->i_mmap and ->mmap prior to
> >> unmapping the pages, so this should be safe unless that operation
> >> can be caught while it's in progress.
>
> On Wed, Aug 18, 2004 at 08:07:24PM -0400, Rajesh Venkatasubramanian wrote:
> > No. Unfortunately exit_mmap() removes vmas from ->i_mmap after removing
> > page table pages. Maybe we can reverse this, though.
>
> Something like this?

Yeah, something similar... A small nitpick: page_table_lock
nests within i_mmap_lock and anon_lock.

do_unmap() also frees page tables first and then removes vmas from
i_mmap (and/or) anon_vma list. Is there a reason to preserve
this ordering ?

Rajesh

>
> Index: mm1-2.6.8.1/mm/mmap.c
> ===================================================================
> --- mm1-2.6.8.1.orig/mm/mmap.c	2004-08-16 23:47:16.000000000 -0700
> +++ mm1-2.6.8.1/mm/mmap.c	2004-08-18 17:18:26.513559632 -0700
> @@ -1810,6 +1810,14 @@
>  	mm->map_count -= unmap_vmas(&tlb, mm, mm->mmap, 0,
>  					~0UL, &nr_accounted, NULL);
>  	vm_unacct_memory(nr_accounted);
> +	/*
> +	 * Walk the list again, actually closing and freeing it.
> +	 */
> +	while (vma) {
> +		struct vm_area_struct *next = vma->vm_next;
> +		remove_vm_struct(vma);
> +		vma = next;
> +	}
>  	BUG_ON(mm->map_count);	/* This is just debugging */
>  	clear_page_tables(tlb, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD);
>  	tlb_finish_mmu(tlb, 0, MM_VM_SIZE(mm));
> @@ -1822,16 +1830,6 @@
>  	mm->locked_vm = 0;
>
>  	spin_unlock(&mm->page_table_lock);
> -
> -	/*
> -	 * Walk the list again, actually closing and freeing it
> -	 * without holding any MM locks.
> -	 */
> -	while (vma) {
> -		struct vm_area_struct *next = vma->vm_next;
> -		remove_vm_struct(vma);
> -		vma = next;
> -	}
>  }
>
>  /* Insert vm structure into process list sorted by address
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: page fault fastpath patch v2: fix race conditions, stats for 8,32     and    512 cpu SMP
  2004-08-19  3:19           ` Rajesh Venkatasubramanian
@ 2004-08-19  3:31             ` William Lee Irwin III
  2004-08-19  3:41               ` William Lee Irwin III
  0 siblings, 1 reply; 18+ messages in thread
From: William Lee Irwin III @ 2004-08-19  3:31 UTC (permalink / raw)
  To: Rajesh Venkatasubramanian
  Cc: Hugh Dickins, David S. Miller, raybry, ak, benh, manfred,
	linux-ia64, LKML

On Wed, 18 Aug 2004, William Lee Irwin III wrote:
>> Something like this?

On Wed, Aug 18, 2004 at 11:19:25PM -0400, Rajesh Venkatasubramanian wrote:
> Yeah, something similar... A small nitpick: page_table_lock
> nests within i_mmap_lock and anon_lock.
> do_unmap() also frees page tables first and then removes vmas from
> i_mmap (and/or) anon_vma list. Is there a reason to preserve
> this ordering ?

I don't see a reason to rearrange it. We can just as easily drop and
reacquire ->page_table_lock.


-- wli

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: page fault fastpath patch v2: fix race conditions, stats for 8,32     and    512 cpu SMP
  2004-08-19  3:31             ` William Lee Irwin III
@ 2004-08-19  3:41               ` William Lee Irwin III
  0 siblings, 0 replies; 18+ messages in thread
From: William Lee Irwin III @ 2004-08-19  3:41 UTC (permalink / raw)
  To: Rajesh Venkatasubramanian, Hugh Dickins, David S. Miller, raybry,
	ak, benh, manfred, linux-ia64, LKML

On Wed, 18 Aug 2004, William Lee Irwin III wrote:
>>> Something like this?

On Wed, Aug 18, 2004 at 11:19:25PM -0400, Rajesh Venkatasubramanian wrote:
>> Yeah, something similar... A small nitpick: page_table_lock
>> nests within i_mmap_lock and anon_lock.
>> do_unmap() also frees page tables first and then removes vmas from
>> i_mmap (and/or) anon_vma list. Is there a reason to preserve
>> this ordering ?

On Wed, Aug 18, 2004 at 08:31:27PM -0700, William Lee Irwin III wrote:
> I don't see a reason to rearrange it. We can just as easily drop and
> reacquire ->page_table_lock.

On closer examination it's not that simple.


-- wli

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: page fault fastpath patch v2: fix race conditions, stats for 8,32     and    512 cpu SMP
  2004-08-19  0:20         ` William Lee Irwin III
  2004-08-19  3:19           ` Rajesh Venkatasubramanian
@ 2004-08-23 22:00           ` Christoph Lameter
  2004-08-23 23:25             ` Rajesh Venkatasubramanian
  1 sibling, 1 reply; 18+ messages in thread
From: Christoph Lameter @ 2004-08-23 22:00 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Rajesh Venkatasubramanian, Hugh Dickins, David S. Miller, raybry,
	ak, benh, manfred, linux-ia64, LKML

Would it not be better if exit_mmap would hold a writelock on
mm->mmap_sem and the page_table_lock? This would insure that page faults
would not change pte's. exit_mmap changes the memory map so it seems that
the wrong lock is used here.

On Wed, 18 Aug 2004, William Lee Irwin III wrote:

> On Wed, 18 Aug 2004, William Lee Irwin III wrote:
> >> exit_mmap() has removed the vma from ->i_mmap and ->mmap prior to
> >> unmapping the pages, so this should be safe unless that operation
> >> can be caught while it's in progress.
>
> On Wed, Aug 18, 2004 at 08:07:24PM -0400, Rajesh Venkatasubramanian wrote:
> > No. Unfortunately exit_mmap() removes vmas from ->i_mmap after removing
> > page table pages. Maybe we can reverse this, though.
>
> Something like this?
>
>
> Index: mm1-2.6.8.1/mm/mmap.c
> ===================================================================
> --- mm1-2.6.8.1.orig/mm/mmap.c	2004-08-16 23:47:16.000000000 -0700
> +++ mm1-2.6.8.1/mm/mmap.c	2004-08-18 17:18:26.513559632 -0700
> @@ -1810,6 +1810,14 @@
>  	mm->map_count -= unmap_vmas(&tlb, mm, mm->mmap, 0,
>  					~0UL, &nr_accounted, NULL);
>  	vm_unacct_memory(nr_accounted);
> +	/*
> +	 * Walk the list again, actually closing and freeing it.
> +	 */
> +	while (vma) {
> +		struct vm_area_struct *next = vma->vm_next;
> +		remove_vm_struct(vma);
> +		vma = next;
> +	}
>  	BUG_ON(mm->map_count);	/* This is just debugging */
>  	clear_page_tables(tlb, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD);
>  	tlb_finish_mmu(tlb, 0, MM_VM_SIZE(mm));
> @@ -1822,16 +1830,6 @@
>  	mm->locked_vm = 0;
>
>  	spin_unlock(&mm->page_table_lock);
> -
> -	/*
> -	 * Walk the list again, actually closing and freeing it
> -	 * without holding any MM locks.
> -	 */
> -	while (vma) {
> -		struct vm_area_struct *next = vma->vm_next;
> -		remove_vm_struct(vma);
> -		vma = next;
> -	}
>  }
>
>  /* Insert vm structure into process list sorted by address
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: page fault fastpath patch v2: fix race conditions, stats for 8,32     and    512 cpu SMP
  2004-08-23 22:00           ` Christoph Lameter
@ 2004-08-23 23:25             ` Rajesh Venkatasubramanian
  2004-08-23 23:35               ` Christoph Lameter
  0 siblings, 1 reply; 18+ messages in thread
From: Rajesh Venkatasubramanian @ 2004-08-23 23:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: William Lee Irwin III, Hugh Dickins, David S. Miller, raybry, ak,
	benh, manfred, linux-ia64, LKML



On Mon, 23 Aug 2004, Christoph Lameter wrote:

> Would it not be better if exit_mmap would hold a writelock on
> mm->mmap_sem and the page_table_lock? This would insure that page faults
> would not change pte's. exit_mmap changes the memory map so it seems that
> the wrong lock is used here.

exit_mmap() is only called from kernel/fork.c/mmput(). We call exit_mmap()
only if (atomic_dec_and_lock(&mm->mm_users, &mmlist_lock)).

So there are no other active thread (mm_user) other than the current
exit_mmap() thread. This gives thread exclusion. So we don't need
mm->mmap_sem.

However, we have to lock out truncate()->zap_pmd_range(), rmap.c
functions, and other places from walking the page tables while we
are freeing the page tables in exit_mmap(). The page_table_lock in
exit_mmap() provides that exclusion.

That's my understanding. Correct me if I am wrong.

Thanks,
Rajesh

> On Wed, 18 Aug 2004, William Lee Irwin III wrote:
>
> > On Wed, 18 Aug 2004, William Lee Irwin III wrote:
> > >> exit_mmap() has removed the vma from ->i_mmap and ->mmap prior to
> > >> unmapping the pages, so this should be safe unless that operation
> > >> can be caught while it's in progress.
> >
> > On Wed, Aug 18, 2004 at 08:07:24PM -0400, Rajesh Venkatasubramanian wrote:
> > > No. Unfortunately exit_mmap() removes vmas from ->i_mmap after removing
> > > page table pages. Maybe we can reverse this, though.
> >
> > Something like this?
> >
> >
> > Index: mm1-2.6.8.1/mm/mmap.c
> > ===================================================================
> > --- mm1-2.6.8.1.orig/mm/mmap.c	2004-08-16 23:47:16.000000000 -0700
> > +++ mm1-2.6.8.1/mm/mmap.c	2004-08-18 17:18:26.513559632 -0700
> > @@ -1810,6 +1810,14 @@
> >  	mm->map_count -= unmap_vmas(&tlb, mm, mm->mmap, 0,
> >  					~0UL, &nr_accounted, NULL);
> >  	vm_unacct_memory(nr_accounted);
> > +	/*
> > +	 * Walk the list again, actually closing and freeing it.
> > +	 */
> > +	while (vma) {
> > +		struct vm_area_struct *next = vma->vm_next;
> > +		remove_vm_struct(vma);
> > +		vma = next;
> > +	}
> >  	BUG_ON(mm->map_count);	/* This is just debugging */
> >  	clear_page_tables(tlb, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD);
> >  	tlb_finish_mmu(tlb, 0, MM_VM_SIZE(mm));
> > @@ -1822,16 +1830,6 @@
> >  	mm->locked_vm = 0;
> >
> >  	spin_unlock(&mm->page_table_lock);
> > -
> > -	/*
> > -	 * Walk the list again, actually closing and freeing it
> > -	 * without holding any MM locks.
> > -	 */
> > -	while (vma) {
> > -		struct vm_area_struct *next = vma->vm_next;
> > -		remove_vm_struct(vma);
> > -		vma = next;
> > -	}
> >  }
> >
> >  /* Insert vm structure into process list sorted by address
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: page fault fastpath patch v2: fix race conditions, stats for 8,32     and    512 cpu SMP
  2004-08-23 23:25             ` Rajesh Venkatasubramanian
@ 2004-08-23 23:35               ` Christoph Lameter
  0 siblings, 0 replies; 18+ messages in thread
From: Christoph Lameter @ 2004-08-23 23:35 UTC (permalink / raw)
  To: Rajesh Venkatasubramanian
  Cc: William Lee Irwin III, Hugh Dickins, David S. Miller, raybry, ak,
	benh, manfred, linux-ia64, LKML

On Mon, 23 Aug 2004, Rajesh Venkatasubramanian wrote:

> So there are no other active thread (mm_user) other than the current
> exit_mmap() thread. This gives thread exclusion. So we don't need
> mm->mmap_sem.
>
> However, we have to lock out truncate()->zap_pmd_range(), rmap.c
> functions, and other places from walking the page tables while we
> are freeing the page tables in exit_mmap(). The page_table_lock in
> exit_mmap() provides that exclusion.
>
> That's my understanding. Correct me if I am wrong.

Correct. I just looked through the function and it unlinks the pgd before
unlocking page_table_lock. It frees the pgd tree content later.

Since no mm_user is active anymore no pte ops occur and therefore also
atomic pte operations do not need protection.



^ permalink raw reply	[flat|nested] 18+ messages in thread

[parent not found: <fa.ofiojek.hkeujs@ifi.uio.no>]

[parent not found: <fa.o1kt2ua.1bm6n0c@ifi.uio.no>]

* Re: page fault fastpath patch v2: fix race conditions, stats for 8,32     and    512 cpu SMP
       [not found] ` <fa.o1kt2ua.1bm6n0c@ifi.uio.no>
@ 2004-08-18 20:13   ` Ray Bryant
  2004-08-18 20:48     ` William Lee Irwin III
  0 siblings, 1 reply; 18+ messages in thread
From: Ray Bryant @ 2004-08-18 20:13 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Christoph Lameter, William Lee Irwin III, David S. Miller, ak,
	benh, manfred, linux-ia64, linux-kernel

Hi Hugh,

Hugh Dickins wrote:
> On Tue, 17 Aug 2004, Christoph Lameter wrote:
> 
> 

<snip>

> 
> Just handling that one anonymous case is not worth it, when we know
> that the next day someone else from SGI will post a similar test
> which shows the same on file pages ;)
> 

Hugh -- this is called full employment for kernel scalability analysts.
:-) :-)

Actually, disks are so slow that I wouldn't expect that scalability problem to 
show up in the page fault code, but rather in the block I/O or page cache 
management portions of the code instead.

<snip>

> 
>>Introducing the page_table_lock even for a short time makes performance
>>drop to the level before the patch.
> 
> 
> That's interesting, and disappointing.
> 

I think that the major impact here is actually grabbing the lock when
30 or more processors are trying to obtain it -- the amount of time that the 
lock is actually held is insignificant in comparison.

> The main lesson I took from your patch (I think wli was hinting at
> the same) is that we ought now to question page_table_lock usage,
> should be possible to cut it a lot.
> 

That would be a useful avenue to explore.  Unfortunately, we are on kind of a 
tight fuse here trying to get the next kernel release ready.   At the moment 
we are in the mode of moving fixes from 2.4.21 to 2.6, and this is one such 
fix.   I'd be willing to pursue both in parallel so that in a future release
we have gotten to page_table_lock reduction as well.  Does that make sense at 
all?

(I just don't want to get bogged down in a 6-month effort here unless we can't 
avoid it.)

> I recall from exchanges with Dave McCracken 18 months ago that the
> page_table_lock is _almost_ unnecessary in rmap.c, should be possible
> to get avoid it there and in some other places.
> 
> We take page_table_lock when making absent present and when making
> present absent: I like your observation that those are exclusive cases.
> 
> But you've found that narrowing the width of the page_table_lock
> in a particular path does not help.  You sound surprised, me too.
> Did you find out why that was?
>  

See above comment.

> 
>>- One could avoid pte locking by introducing a pte_cmpxchg. cmpxchg
>>seems to be supported by all ia64 and i386 cpus except the original 80386.
> 
> 
> I do think this will be a more fruitful direction than pte locking:
> just looking through the arches for spare bits puts me off pte locking.
>

The original patch that we had for 2.4.21 did exactly that, we shied away from 
that due to concerns as to which processors allow you to update a running pte 
using a cmpxchg (== the set of processors for which set_pte() is a simple 
store.)  AFAIK, the only such processor is i386, but if Christoph is correct, 
then more recent Intel x86 processors don't even have that restriction.  I'll 
admit that I encouraged Christoph not to follow that path due to concerns of 
arch dependent code creeping into the do_anonymous_page() path.

Best Regards,

Ray

raybry@sgi.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: page fault fastpath patch v2: fix race conditions, stats for 8,32     and    512 cpu SMP
  2004-08-18 20:13   ` Ray Bryant
@ 2004-08-18 20:48     ` William Lee Irwin III
  0 siblings, 0 replies; 18+ messages in thread
From: William Lee Irwin III @ 2004-08-18 20:48 UTC (permalink / raw)
  To: Ray Bryant
  Cc: Hugh Dickins, Christoph Lameter, David S. Miller, ak, benh,
	manfred, linux-ia64, linux-kernel

Hugh Dickins wrote:
>> Just handling that one anonymous case is not worth it, when we know
>> that the next day someone else from SGI will post a similar test
>> which shows the same on file pages ;)

On Wed, Aug 18, 2004 at 03:13:39PM -0500, Ray Bryant wrote:
> Hugh -- this is called full employment for kernel scalability analysts.
> :-) :-)
> Actually, disks are so slow that I wouldn't expect that scalability problem 
> to show up in the page fault code, but rather in the block I/O or page 
> cache management portions of the code instead.

mapping->tree_lock is a near-certain culprit-to-be.


On Wed, Aug 18, 2004 at 03:13:39PM -0500, Ray Bryant wrote:
> I think that the major impact here is actually grabbing the lock when
> 30 or more processors are trying to obtain it -- the amount of time that 
> the lock is actually held is insignificant in comparison.

The spinlocking algorithms in general use are rather weak. Queued locks
with O(contenders) cacheline transfers instead of unbounded may be
useful in such arrangements so that occasional contention is not so
catastrophic, and it appears you certainly have the space for them.
In general, everyone's favored rearranging algorithms for less
contention instead of fiddling with locking algorithms, but I suspect
in the case of such large systems the number of nontrivial algorithms
to devise is daunting enough locking algorithms may mitigate the issues.


Hugh Dickins wrote:
>> The main lesson I took from your patch (I think wli was hinting at
>> the same) is that we ought now to question page_table_lock usage,
>> should be possible to cut it a lot.

On Wed, Aug 18, 2004 at 03:13:39PM -0500, Ray Bryant wrote:
> That would be a useful avenue to explore.  Unfortunately, we are on kind of 
> a tight fuse here trying to get the next kernel release ready.   At the 
> moment we are in the mode of moving fixes from 2.4.21 to 2.6, and this is 
> one such fix.   I'd be willing to pursue both in parallel so that in a 
> future release
> we have gotten to page_table_lock reduction as well.  Does that make sense 
> at all?
> (I just don't want to get bogged down in a 6-month effort here unless we 
> can't avoid it.)

Scalability issues with fault handling have trivial enough hardware
requirements that it's likely feasible for others to work on it also.
Also fortunate for you is that others also have issues here, and those
on much smaller systems.


Hugh Dickins wrote:
>> I do think this will be a more fruitful direction than pte locking:
>> just looking through the arches for spare bits puts me off pte locking.

On Wed, Aug 18, 2004 at 03:13:39PM -0500, Ray Bryant wrote:
> The original patch that we had for 2.4.21 did exactly that, we shied away 
> from that due to concerns as to which processors allow you to update a 
> running pte using a cmpxchg (== the set of processors for which set_pte() 
> is a simple store.)  AFAIK, the only such processor is i386, but if 
> Christoph is correct, then more recent Intel x86 processors don't even have 
> that restriction.  I'll admit that I encouraged Christoph not to follow 
> that path due to concerns of arch dependent code creeping into the 
> do_anonymous_page() path.

I'm in general more concerned about how one synchronizes with TLB miss
handlers, particularly for machines where this would involve the TLB
miss handler looping until a valid value is fetched in what is now a
very highly optimized performance-critical codepath.


-- wli

^ permalink raw reply	[flat|nested] 18+ messages in thread

* page fault fastpath: Increasing SMP scalability by introducing pte locks?
@ 2004-08-15 13:50 Christoph Lameter
  2004-08-15 20:09 ` David S. Miller
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Lameter @ 2004-08-15 13:50 UTC (permalink / raw)
  To: linux-ia64; +Cc: linux-kernel

Well this is more an idea than a real patch yet. The page_table_lock
becomes a bottleneck if more than 4 CPUs are rapidly allocating and using
memory. "pft" is a program that measures the performance of page faults on
SMP system. It allocates memory simultaneously in multiple threads thereby
causing lots of page faults for anonymous pages.

Results for a standard 2.6.8.1 kernel. Allocating 2G of RAM in an 8
processor SMP system:

 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
  2   3    1    0.094s      4.500s   4.059s 85561.646  85568.398
  2   3    2    0.092s      6.390s   3.043s 60649.650 114521.474
  2   3    4    0.081s      6.500s   1.093s 59740.813 203552.963
  2   3    8    0.101s     12.001s   2.035s 32487.736 167082.560

Scalablity problems set in over 4 CPUs.

With pte locks and the fastpath:

 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
  2   3    1    0.071s      4.535s   4.061s 85362.102  85288.646
  2   3    2    0.073s      4.793s   2.013s 80789.137 184196.199
  2   3    4    0.087s      5.119s   1.057s 75516.326 249716.547
  2   3    8    0.096s      7.089s   1.019s 54715.728 328540.988

The performance in SMP situation is significantly enhanced with this
patch.

Note that the patch does not address various race conditions that may
result from using a pte lock only in handle_mm_fault. Some rules
need to be develop how to coordinate pte locks and the page_table_lock in
order to avoid these.

pte locks are realized by finding a spare bit in the ptes (TLB structures
on IA64 and i386) and settings the bit atomically via bitops for locking.
The fastpath does not allocate the page_table_lock but instead immediately
locks the pte. Thus the logic to release and later reacquire the
page_table_lock is avoided. Multiple page faults can run concurrently
using pte locks avoiding the page_table_lock. Essentially pte locks would
allow a finer granularity of locking.

I would like to get some feedback if people feel that this is the right
way to solve the issue. Most of this is based on work of Ray
Bryant and others at SGI.

Attached are:
1. pte lock patch for i386 and ia64
2. page_fault fastpath
3. page fault test program
4. test script

=========== PTE LOCK PATCH

Index: linux-2.6.8-rc4/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.8-rc4.orig/include/asm-generic/pgtable.h	2004-08-09 19:22:39.000000000 -0700
+++ linux-2.6.8-rc4/include/asm-generic/pgtable.h	2004-08-13 10:05:36.000000000 -0700
@@ -126,4 +126,11 @@
 #define pgd_offset_gate(mm, addr)	pgd_offset(mm, addr)
 #endif

+#ifndef __HAVE_ARCH_PTE_LOCK
+/* need to fall back to the mm spinlock if PTE locks are not supported */
+#define ptep_lock(ptep)		!spin_trylock(&mm->page_table_lock)
+#define ptep_unlock(ptep)	spin_unlock(&mm->page_table_lock)
+#define pte_locked(pte)		spin_is_locked(&mm->page_table_lock)
+#endif
+
 #endif /* _ASM_GENERIC_PGTABLE_H */
Index: linux-2.6.8-rc4/include/asm-ia64/pgtable.h
===================================================================
--- linux-2.6.8-rc4.orig/include/asm-ia64/pgtable.h	2004-08-09 19:22:39.000000000 -0700
+++ linux-2.6.8-rc4/include/asm-ia64/pgtable.h	2004-08-13 10:19:15.000000000 -0700
@@ -30,6 +30,8 @@
 #define _PAGE_P_BIT		0
 #define _PAGE_A_BIT		5
 #define _PAGE_D_BIT		6
+#define _PAGE_IG_BITS		53
+#define _PAGE_LOCK_BIT		(_PAGE_IG_BITS+3)	/* bit 56. Aligned to 8 bits */

 #define _PAGE_P			(1 << _PAGE_P_BIT)	/* page present bit */
 #define _PAGE_MA_WB		(0x0 <<  2)	/* write back memory attribute */
@@ -58,6 +60,7 @@
 #define _PAGE_PPN_MASK		(((__IA64_UL(1) << IA64_MAX_PHYS_BITS) - 1) & ~0xfffUL)
 #define _PAGE_ED		(__IA64_UL(1) << 52)	/* exception deferral */
 #define _PAGE_PROTNONE		(__IA64_UL(1) << 63)
+#define _PAGE_LOCK		(__IA64_UL(1) << _PAGE_LOCK_BIT)

 /* Valid only for a PTE with the present bit cleared: */
 #define _PAGE_FILE		(1 << 1)		/* see swap & file pte remarks below */
@@ -282,6 +285,13 @@
 #define pte_mkclean(pte)	(__pte(pte_val(pte) & ~_PAGE_D))
 #define pte_mkdirty(pte)	(__pte(pte_val(pte) | _PAGE_D))

+/*
+ * Lock functions for pte's
+*/
+#define ptep_lock(ptep)		test_and_set_bit(_PAGE_LOCK_BIT,ptep)
+#define ptep_unlock(ptep)	{ clear_bit(_PAGE_LOCK_BIT,ptep);smp_mb__after_clear_bit(); }
+#define pte_locked(pte)		((pte_val(pte) & _PAGE_LOCK)!=0)
+
 /*
  * Macro to a page protection value as "uncacheable".  Note that "protection" is really a
  * misnomer here as the protection value contains the memory attribute bits, dirty bits,
@@ -558,6 +568,7 @@
 #define __HAVE_ARCH_PTEP_MKDIRTY
 #define __HAVE_ARCH_PTE_SAME
 #define __HAVE_ARCH_PGD_OFFSET_GATE
+#define __HAVE_ARCH_PTE_LOCK
 #include <asm-generic/pgtable.h>

 #endif /* _ASM_IA64_PGTABLE_H */
Index: linux-2.6.8-rc4/include/asm-i386/pgtable.h
===================================================================
--- linux-2.6.8-rc4.orig/include/asm-i386/pgtable.h	2004-08-09 19:23:35.000000000 -0700
+++ linux-2.6.8-rc4/include/asm-i386/pgtable.h	2004-08-13 10:04:19.000000000 -0700
@@ -101,7 +101,7 @@
 #define _PAGE_BIT_DIRTY		6
 #define _PAGE_BIT_PSE		7	/* 4 MB (or 2MB) page, Pentium+, if present.. */
 #define _PAGE_BIT_GLOBAL	8	/* Global TLB entry PPro+ */
-#define _PAGE_BIT_UNUSED1	9	/* available for programmer */
+#define _PAGE_BIT_LOCK		9	/* available for programmer */
 #define _PAGE_BIT_UNUSED2	10
 #define _PAGE_BIT_UNUSED3	11
 #define _PAGE_BIT_NX		63
@@ -115,7 +115,7 @@
 #define _PAGE_DIRTY	0x040
 #define _PAGE_PSE	0x080	/* 4 MB (or 2MB) page, Pentium+, if present.. */
 #define _PAGE_GLOBAL	0x100	/* Global TLB entry PPro+ */
-#define _PAGE_UNUSED1	0x200	/* available for programmer */
+#define _PAGE_LOCK	0x200	/* available for programmer */
 #define _PAGE_UNUSED2	0x400
 #define _PAGE_UNUSED3	0x800

@@ -260,6 +260,10 @@
 static inline void ptep_set_wrprotect(pte_t *ptep)		{ clear_bit(_PAGE_BIT_RW, &ptep->pte_low); }
 static inline void ptep_mkdirty(pte_t *ptep)			{ set_bit(_PAGE_BIT_DIRTY, &ptep->pte_low); }

+#define ptep_lock(ptep) test_and_set_bit(_PAGE_BIT_LOCK,&ptep->pte_low)
+#define ptep_unlock(ptep) clear_bit(_PAGE_BIT_LOCK,&ptep->pte_low)
+#define pte_locked(pte) ((ptep->pte_low & _PAGE_LOCK) !=0)
+
 /*
  * Macro to mark a page protection value as "uncacheable".  On processors which do not support
  * it, this is a no-op.
@@ -419,6 +423,7 @@
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define __HAVE_ARCH_PTEP_MKDIRTY
 #define __HAVE_ARCH_PTE_SAME
+#define __HAVE_ARCH_PTE_LOCK
 #include <asm-generic/pgtable.h>

 #endif /* _I386_PGTABLE_H */

======= PAGEFAULT FASTPATH
Index: linux-2.6.8-rc4/mm/memory.c
===================================================================
--- linux-2.6.8-rc4.orig/mm/memory.c	2004-08-09 19:23:02.000000000 -0700
+++ linux-2.6.8-rc4/mm/memory.c	2004-08-13 10:19:21.000000000 -0700
@@ -1680,6 +1680,10 @@
 {
 	pgd_t *pgd;
 	pmd_t *pmd;
+#ifdef __HAVE_ARCH_PTE_LOCK
+	pte_t *pte;
+	pte_t entry;
+#endif

 	__set_current_state(TASK_RUNNING);
 	pgd = pgd_offset(mm, address);
@@ -1688,7 +1692,64 @@

 	if (is_vm_hugetlb_page(vma))
 		return VM_FAULT_SIGBUS;	/* mapping truncation does this. */
+#ifdef __HAVE_ARCH_PTE_LOCK
+	/*
+	 * Fast path for anonymous pages, not found faults bypassing
+	 * the necessity to acquire the page_table_lock
+	 */
+
+	if ((vma->vm_ops && vma->vm_ops->nopage) || pgd_none(*pgd)) goto use_page_table_lock;
+	pmd = pmd_offset(pgd,address);
+	if (pmd_none(*pmd)) goto use_page_table_lock;
+	pte = pte_offset_kernel(pmd,address);
+	if (pte_locked(*pte)) return VM_FAULT_MINOR;
+	if (!pte_none(*pte)) goto use_page_table_lock;
+
+	/*
+	 * Page not present, so kswapd and PTE updates will not touch the pte
+	 * so we are able to just use a pte lock.
+	 */
+
+	if (ptep_lock(pte)) return VM_FAULT_MINOR;
+		/*
+		 * PTE already locked so this code is already running on another processor. Wait
+		 * until that processor does our work and then return. If something went
+		 * wrong in the handling of the other processor then we will get another page fault
+		 * that may then handle the error condition
+		 */
+
+	/* Read-only mapping of ZERO_PAGE. */
+	entry = pte_wrprotect(mk_pte(ZERO_PAGE(address), vma->vm_page_prot));
+
+	if (write_access) {
+		struct page *page;
+
+		if (unlikely(anon_vma_prepare(vma))) goto no_mem;
+
+		page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+		if (!page)  goto no_mem;
+		clear_user_highpage(page, address);
+
+		mm->rss++;
+		entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,vma->vm_page_prot)),vma);
+		lru_cache_add_active(page);
+		mark_page_accessed(page);
+		page_add_anon_rmap(page, vma, address);
+	}
+	/* Setting the pte clears the pte lock so there is no need for unlocking */
+	set_pte(pte, entry);
+	pte_unmap(pte);
+
+	/* No need to invalidate - it was non-present before */
+	update_mmu_cache(vma, address, entry);
+	return VM_FAULT_MINOR;		/* Minor fault */

+no_mem:
+	ptep_unlock(pte);
+	return VM_FAULT_OOM;
+
+use_page_table_lock:
+#endif
 	/*
 	 * We need the page table lock to synchronize with kswapd
 	 * and the SMP-safe atomic PTE updates.

======= PFT.C test program
#include <pthread.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/shm.h>
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>

#include <sys/mman.h>
#include <time.h>
#include <errno.h>
#include <sys/resource.h>

extern int      optind, opterr;
extern char     *optarg;

long     bytes=16384;
long     sleepsec=0;
long     verbose=0;
long     forkcnt=1;
long     repeatcount=1;
long     do_bzero=0;
long     mypid;
int 	title=0;

volatile int    go, state[128];

struct timespec wall;
struct rusage ruse;
long faults;
long pages;
long gbyte;
double faults_per_sec;
double faults_per_sec_per_cpu;

#define perrorx(s)      (perror(s), exit(1))
#define NBPP            16384

void* test(void*);
void  launch(void);


main (int argc, char *argv[])
{
        int                     i, j, c, stat, er=0;
        static  char            optstr[] = "b:f:g:r:s:vzHt";


        opterr=1;
        while ((c = getopt(argc, argv, optstr)) != EOF)
                switch (c) {
                case 'g':
                        bytes = atol(optarg)*1024*1024*1024;
                        break;
                case 'b':
                        bytes = atol(optarg);
                        break;
                case 'f':
                        forkcnt = atol(optarg);
                        break;
                case 'r':
                        repeatcount = atol(optarg);
                        break;
                case 's':
                        sleepsec = atol(optarg);
                        break;
                case 'v':
                        verbose++;
                        break;
                case 'z':
                        do_bzero++;
                        break;
                case 'H':
                        er++;
                        break;
		case 't' :
			title++;
			break;
                case '?':
                        er = 1;
                        break;
                }

        if (er) {
                printf("usage: %s %s\n", argv[0], optstr);
                exit(1);
        }

	pages = bytes*repeatcount/getpagesize();
	gbyte = bytes/(1024*1024*1024);
	bytes = bytes/forkcnt;

	if (verbose) printf("Calculated pages=%ld pagesize=%ld.\n",pages,getpagesize());

        mypid = getpid();
        setpgid(0, mypid);

        for (i=0; i<repeatcount; i++) {
                if (fork() == 0)
                        launch();
                while (wait(&stat) > 0);
        }

	getrusage(RUSAGE_CHILDREN,&ruse);
	clock_gettime(CLOCK_PROCESS_CPUTIME_ID,&wall);
	if (verbose) printf("Calculated faults=%ld. Real minor faults=%ld, major faults=%ld\n",pages,ruse.ru_minflt+ruse.ru_majflt);
	faults_per_sec=(double) pages / ((double) wall.tv_sec + (double) wall.tv_nsec / 1000000000.0);
	faults_per_sec_per_cpu=(double) pages /  (
		(double) (ruse.ru_utime.tv_sec + ruse.ru_stime.tv_sec) + ((double) (ruse.ru_utime.tv_usec + ruse.ru_stime.tv_usec) / 1000000.0));
	if (title) printf(" Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec\n");
	printf("%3ld %3ld %4ld %4ld.%03lds%7ld.%03lds%4ld.%03lds%10.3f %10.3f\n",
		gbyte,repeatcount,forkcnt,
		ruse.ru_utime.tv_sec,ruse.ru_utime.tv_usec/1000,
		ruse.ru_stime.tv_sec,ruse.ru_stime.tv_usec/1000,
		wall.tv_sec,wall.tv_nsec/10000000,
		faults_per_sec_per_cpu,faults_per_sec);
        exit(0);
}

char *
do_shm(long shmlen) {
        char    *p;
        int     shmid;

        printf ("Try to allocate TOTAL shm segment of %ld bytes\n", shmlen);

        if ((shmid = shmget(IPC_PRIVATE, shmlen, SHM_R|SHM_W))  == -1)
                perrorx("shmget faiiled");

        p=(char*)shmat(shmid, (void*)0, SHM_R|SHM_W);
	printf("  created, adr: 0x%lx\n", (long)p);
	printf("  attached\n");
        bzero(p, shmlen);
	printf("  zeroed\n");

        // if (shmctl(shmid,IPC_RMID,0) == -1)
        //        perrorx("shmctl failed");
	// printf("  deleted\n");

	return p;


}

void
launch()
{
        pthread_t                       ptid[128];
        int     i, j;

        for (j=0; j<forkcnt; j++)
                if (pthread_create(&ptid[j], NULL, test, (void*) (long)j) < 0)
                        perrorx("pthread create");

        if(0) for (j=0; j<forkcnt; j++)
                while(state[j] == 0);
        go = 1;
        if(0) for (j=0; j<forkcnt; j++)
                while(state[j] == 1);
        for (j=0; j<forkcnt; j++)
                pthread_join(ptid[j], NULL);
        exit(0);
}

void*
test(void *arg)
{
        char    *p, *pe;
        long    id;

        id = (long) arg;
        state[id] = 1;
        while(!go);
        p = malloc(bytes);
        // p = do_shm(bytes);
	if (p == 0) {
	    printf("malloc of %Ld bytes failed.\n",bytes);
	    exit;
	} else
	    if (verbose) printf("malloc of %Ld bytes succeeded\n",bytes);
        if (do_bzero)
                bzero(p, bytes);
        else {
                for(pe=p+bytes; p<pe; p+=16384)
                        *p = 'r';
        }
        sleep(sleepsec);
        state[id] = 2;
        pthread_exit(0);
}

===== Test script

./pft -t -g2 -r3 -f1
./pft -g2 -r3 -f2
./pft -g2 -r3 -f4
./pft -g2 -r3 -f8


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
  2004-08-15 13:50 page fault fastpath: Increasing SMP scalability by introducing pte locks? Christoph Lameter
@ 2004-08-15 20:09 ` David S. Miller
  2004-08-15 22:58   ` Christoph Lameter
  0 siblings, 1 reply; 18+ messages in thread
From: David S. Miller @ 2004-08-15 20:09 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-ia64, linux-kernel


Is the read lock in the VMA semaphore enough to let you do
the pgd/pmd walking without the page_table_lock?
I think it is, but just checking.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
  2004-08-15 20:09 ` David S. Miller
@ 2004-08-15 22:58   ` Christoph Lameter
  2004-08-15 23:58     ` David S. Miller
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Lameter @ 2004-08-15 22:58 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-ia64, linux-kernel

On Sun, 15 Aug 2004, David S. Miller wrote:

>
> Is the read lock in the VMA semaphore enough to let you do
> the pgd/pmd walking without the page_table_lock?
> I think it is, but just checking.

That would be great.... May I change the page_table lock to
be a read write spinlock instead?

I would then convert all spin_locks to write_locks and
then use read locks to switch to a "pte locking mode". The read
lock would allow simultanous threads operating on the page table
that will only modify individual pte's via pte locks. Write locks
still exclude the readers and thus the whole scheme should allow
a gradual transition.

Maybe such a locking policy could do some good.

However, performance is only increased somewhat. Scalability
is still bad with more than 32 CPUs despite my hack. More
extensive work is needed <sigh>:

Regular kernel 512 CPU's 16G allocation per thread:

 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 16   3    1    0.748s     67.200s  67.098s 46295.921  46270.533
 16   3    2    0.899s    100.189s  52.021s 31118.426  60242.544
 16   3    4    1.517s    103.467s  31.021s 29963.479 100777.788
 16   3    8    1.268s    166.023s  26.035s 18803.807 119350.434
 16   3   16    6.296s    453.445s  33.082s  6842.371  92987.774
 16   3   32   22.434s   1341.205s  48.026s  2306.860  65174.913
 16   3   64   54.189s   4633.748s  81.089s   671.026  38411.466
 16   3  128  244.333s  17584.111s 152.026s   176.444  20659.132
 16   3  256  222.936s   8167.241s  73.018s   374.930  42983.366
 16   3  512  207.464s   4259.264s  39.044s   704.258  79741.366

Modified kernel:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 16   3    1    0.884s     64.241s  65.014s 48302.177  48287.787
 16   3    2    0.931s     99.156s  51.058s 31429.640  60979.126
 16   3    4    1.028s     88.451s  26.096s 35155.837 116669.999
 16   3    8    1.957s     61.395s  12.099s 49654.307 242078.305
 16   3   16    5.701s     81.382s   9.039s 36122.904 334774.381
 16   3   32   15.207s    163.893s   9.094s 17564.021 316284.690
 16   3   64   76.056s    440.771s  13.037s  6086.601 235120.800
 16   3  128  203.843s   1535.909s  19.084s  1808.145 158495.679
 16   3  256  274.815s    755.764s  12.058s  3052.387 250010.942
 16   3  512  205.505s    381.106s   7.060s  5362.531 413531.352



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
  2004-08-15 22:58   ` Christoph Lameter
@ 2004-08-15 23:58     ` David S. Miller
  2004-08-16  0:11       ` Christoph Lameter
  0 siblings, 1 reply; 18+ messages in thread
From: David S. Miller @ 2004-08-15 23:58 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-ia64, linux-kernel

On Sun, 15 Aug 2004 15:58:27 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> On Sun, 15 Aug 2004, David S. Miller wrote:
> 
> >
> > Is the read lock in the VMA semaphore enough to let you do
> > the pgd/pmd walking without the page_table_lock?
> > I think it is, but just checking.
> 
> That would be great.... May I change the page_table lock to
> be a read write spinlock instead?

No, I means "is the read long _ON_ the VMA semaphore".
The VMA semaphore is a read/write semaphore, and we grab
it for reading in the code path you're modifying.

Please don't change page_table_lock to a rwlock, it's
only needed for write accesses.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
  2004-08-15 23:58     ` David S. Miller
@ 2004-08-16  0:11       ` Christoph Lameter
  2004-08-16  1:56         ` David S. Miller
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Lameter @ 2004-08-16  0:11 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-ia64, linux-kernel

On Sun, 15 Aug 2004, David S. Miller wrote:
> > On Sun, 15 Aug 2004, David S. Miller wrote:
> > > Is the read lock in the VMA semaphore enough to let you do
> > > the pgd/pmd walking without the page_table_lock?
> > > I think it is, but just checking.
> >
> > That would be great.... May I change the page_table lock to
> > be a read write spinlock instead?
>
> No, I means "is the read long _ON_ the VMA semaphore".
> The VMA semaphore is a read/write semaphore, and we grab
> it for reading in the code path you're modifying.
>
> Please don't change page_table_lock to a rwlock, it's
> only needed for write accesses.

pgd/pmd walking should be possible always even without the vma semaphore
since the CPU can potentially walk the chain at anytime.

The modification of the pte is not without issue since there are other
code paths that may modify pte's and rely on the page_table_lock to
exclude others from modifying ptes. One known problem is the swap
code which sets the page to the pte_not_present condition to insure that
nothing else touches the page while it is figuring out where to put it. A
page fault during that time (skipping the checking of the
page_table_lock) will cause the fastpath to be taken which will then
assign new memory to it.

We need to have some kind of system how finer granularity locks could be
realized.

One possibility is to abuse the rw spinlock to not only allow exclusive
access to the page tables(as done right now with the spinlock) but also
allow shared access with pte locking after a read lock.

Is there any other way to realize this?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
  2004-08-16  0:11       ` Christoph Lameter
@ 2004-08-16  1:56         ` David S. Miller
  2004-08-16  3:29           ` Christoph Lameter
  0 siblings, 1 reply; 18+ messages in thread
From: David S. Miller @ 2004-08-16  1:56 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-ia64, linux-kernel

On Sun, 15 Aug 2004 17:11:53 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> pgd/pmd walking should be possible always even without the vma semaphore
> since the CPU can potentially walk the chain at anytime.

munmap() can destroy pmd and pte tables.  somehow we have
to protect against that, and currently that is having the
VMA semaphore held for reading, see free_pgtables().


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
  2004-08-16  1:56         ` David S. Miller
@ 2004-08-16  3:29           ` Christoph Lameter
  2004-08-16 14:39             ` William Lee Irwin III
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Lameter @ 2004-08-16  3:29 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-ia64, linux-kernel

On Sun, 15 Aug 2004, David S. Miller wrote:

> On Sun, 15 Aug 2004 17:11:53 -0700 (PDT)
> Christoph Lameter <clameter@sgi.com> wrote:
>
> > pgd/pmd walking should be possible always even without the vma semaphore
> > since the CPU can potentially walk the chain at anytime.
>
> munmap() can destroy pmd and pte tables.  somehow we have
> to protect against that, and currently that is having the
> VMA semaphore held for reading, see free_pgtables().

It looks to me like the code takes care to provide the correct
sequencing so that the integrity of pgd,pmd and pte links is
guaranteed from the viewpoint of the MMU in the CPUs. munmap is there to
protect one kernel thread messing with the addresses of these entities
that might be stored in another threads register.

Therefore it is safe to walk the chain only holding the semaphore read
lock?

If the mmap lock already guarantees the integrity of the pgd,pmd,pte
system, then pte locking would be okay as long as integrity of the
pgd,pmd and pte's is always guaranteed. Then also adding a lock bit would
work.

So then there are two ways of modifying the pgd,pmd and pte's.

A) Processor obtains vma semaphore write lock and does large scale
modifications to pgd,pmd,pte.

B) Processor obtains vma semaphore read lock but is still free to do
modifications on individual pte's while holding that vma lock. There is no
need to acquire the page_table_lock. These changes must be atomic.

The role of the page_table_lock is restricted *only* to the "struct
page" stuff? It says in the comments regarding handle_mm_fault that the
lock is taken for synchronization with kswapd in regards to the pte
entries. Seems that this use of the page_table_lock is wrong. A or B
should have been used.

We could simply remove the page_table_lock from handle_mm_fault and
provide the synchronization with kswapd with pte locks right? Both
processes are essentially doing modifications on pte's while holding the
vma read lock and I would be changing the way of synchronization between
these two processes.

F.e. something along these lines removing the page_table_lock from
handle_mm_fault and friends. Surprisingly this will also avoid many
rereads of the pte's since the pte's are really locked. This is just for
illustrative purpose and unfinished...

Index: linux-2.6.8.1/mm/memory.c
===================================================================
--- linux-2.6.8.1.orig/mm/memory.c	2004-08-15 06:03:04.000000000 -0700
+++ linux-2.6.8.1/mm/memory.c	2004-08-15 20:26:29.000000000 -0700
@@ -1035,8 +1035,7 @@
  * change only once the write actually happens. This avoids a few races,
  * and potentially makes it more efficient.
  *
- * We hold the mm semaphore and the page_table_lock on entry and exit
- * with the page_table_lock released.
+ * We hold the mm semaphore.
  */
 static int do_wp_page(struct mm_struct *mm, struct vm_area_struct * vma,
 	unsigned long address, pte_t *page_table, pmd_t *pmd, pte_t pte)
@@ -1051,10 +1050,10 @@
 		 * at least the kernel stops what it's doing before it corrupts
 		 * data, but for the moment just pretend this is OOM.
 		 */
+		ptep_unlock(page_table);
 		pte_unmap(page_table);
 		printk(KERN_ERR "do_wp_page: bogus page at address %08lx\n",
 				address);
-		spin_unlock(&mm->page_table_lock);
 		return VM_FAULT_OOM;
 	}
 	old_page = pfn_to_page(pfn);
@@ -1069,7 +1068,7 @@
 			ptep_set_access_flags(vma, address, page_table, entry, 1);
 			update_mmu_cache(vma, address, entry);
 			pte_unmap(page_table);
-			spin_unlock(&mm->page_table_lock);
+			/* pte lock unlocked by ptep_set_access */
 			return VM_FAULT_MINOR;
 		}
 	}
@@ -1080,7 +1079,7 @@
 	 */
 	if (!PageReserved(old_page))
 		page_cache_get(old_page);
-	spin_unlock(&mm->page_table_lock);
+	ptep_unlock(page_table);

 	if (unlikely(anon_vma_prepare(vma)))
 		goto no_new_page;
@@ -1090,26 +1089,21 @@
 	copy_cow_page(old_page,new_page,address);

 	/*
-	 * Re-check the pte - we dropped the lock
+	 * There is no need to recheck. The pte was locked
 	 */
-	spin_lock(&mm->page_table_lock);
-	page_table = pte_offset_map(pmd, address);
-	if (likely(pte_same(*page_table, pte))) {
-		if (PageReserved(old_page))
-			++mm->rss;
-		else
-			page_remove_rmap(old_page);
-		break_cow(vma, new_page, address, page_table);
-		lru_cache_add_active(new_page);
-		page_add_anon_rmap(new_page, vma, address);
+	if (PageReserved(old_page))
+		++mm->rss;
+	else
+		page_remove_rmap(old_page);
+	break_cow(vma, new_page, address, page_table);
+	lru_cache_add_active(new_page);
+	page_add_anon_rmap(new_page, vma, address);

-		/* Free the old page.. */
-		new_page = old_page;
-	}
+	/* Free the old page.. */
+	new_page = old_page;
 	pte_unmap(page_table);
 	page_cache_release(new_page);
 	page_cache_release(old_page);
-	spin_unlock(&mm->page_table_lock);
 	return VM_FAULT_MINOR;

 no_new_page:
@@ -1314,8 +1308,8 @@
 }

 /*
- * We hold the mm semaphore and the page_table_lock on entry and
- * should release the pagetable lock on exit..
+ * We hold the mm semaphore and a pte lock n entry and
+ * should release the pte lock on exit..
  */
 static int do_swap_page(struct mm_struct * mm,
 	struct vm_area_struct * vma, unsigned long address,
@@ -1327,27 +1321,10 @@
 	int ret = VM_FAULT_MINOR;

 	pte_unmap(page_table);
-	spin_unlock(&mm->page_table_lock);
 	page = lookup_swap_cache(entry);
 	if (!page) {
  		swapin_readahead(entry, address, vma);
  		page = read_swap_cache_async(entry, vma, address);
-		if (!page) {
-			/*
-			 * Back out if somebody else faulted in this pte while
-			 * we released the page table lock.
-			 */
-			spin_lock(&mm->page_table_lock);
-			page_table = pte_offset_map(pmd, address);
-			if (likely(pte_same(*page_table, orig_pte)))
-				ret = VM_FAULT_OOM;
-			else
-				ret = VM_FAULT_MINOR;
-			pte_unmap(page_table);
-			spin_unlock(&mm->page_table_lock);
-			goto out;
-		}
-
 		/* Had to read the page from swap area: Major fault */
 		ret = VM_FAULT_MAJOR;
 		inc_page_state(pgmajfault);
@@ -1356,21 +1333,6 @@
 	mark_page_accessed(page);
 	lock_page(page);

-	/*
-	 * Back out if somebody else faulted in this pte while we
-	 * released the page table lock.
-	 */
-	spin_lock(&mm->page_table_lock);
-	page_table = pte_offset_map(pmd, address);
-	if (unlikely(!pte_same(*page_table, orig_pte))) {
-		pte_unmap(page_table);
-		spin_unlock(&mm->page_table_lock);
-		unlock_page(page);
-		page_cache_release(page);
-		ret = VM_FAULT_MINOR;
-		goto out;
-	}
-
 	/* The page isn't present yet, go ahead with the fault. */

 	swap_free(entry);
@@ -1398,8 +1360,8 @@

 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(vma, address, pte);
+	ptep_unlock(page_table);
 	pte_unmap(page_table);
-	spin_unlock(&mm->page_table_lock);
 out:
 	return ret;
 }
@@ -1424,7 +1386,6 @@
 	if (write_access) {
 		/* Allocate our own private page. */
 		pte_unmap(page_table);
-		spin_unlock(&mm->page_table_lock);

 		if (unlikely(anon_vma_prepare(vma)))
 			goto no_mem;
@@ -1433,13 +1394,12 @@
 			goto no_mem;
 		clear_user_highpage(page, addr);

-		spin_lock(&mm->page_table_lock);
 		page_table = pte_offset_map(pmd, addr);

 		if (!pte_none(*page_table)) {
+			ptep_unlock(page_table);
 			pte_unmap(page_table);
 			page_cache_release(page);
-			spin_unlock(&mm->page_table_lock);
 			goto out;
 		}
 		mm->rss++;
@@ -1456,7 +1416,6 @@

 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(vma, addr, entry);
-	spin_unlock(&mm->page_table_lock);
 out:
 	return VM_FAULT_MINOR;
 no_mem:
@@ -1472,8 +1431,8 @@
  * As this is called only for pages that do not currently exist, we
  * do not need to flush old virtual caches or the TLB.
  *
- * This is called with the MM semaphore held and the page table
- * spinlock held. Exit with the spinlock released.
+ * This is called with the MM semaphore held and pte lock
+ * held. Exit with the pte lock released.
  */
 static int
 do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -1489,9 +1448,9 @@
 	if (!vma->vm_ops || !vma->vm_ops->nopage)
 		return do_anonymous_page(mm, vma, page_table,
 					pmd, write_access, address);
+	ptep_unlock(page_table);
 	pte_unmap(page_table);
-	spin_unlock(&mm->page_table_lock);
-
+
 	if (vma->vm_file) {
 		mapping = vma->vm_file->f_mapping;
 		sequence = atomic_read(&mapping->truncate_count);
@@ -1523,7 +1482,7 @@
 		anon = 1;
 	}

-	spin_lock(&mm->page_table_lock);
+	while (ptep_lock(page_table)) ;
 	/*
 	 * For a file-backed vma, someone could have truncated or otherwise
 	 * invalidated this page.  If unmap_mapping_range got called,
@@ -1532,7 +1491,7 @@
 	if (mapping &&
 	      (unlikely(sequence != atomic_read(&mapping->truncate_count)))) {
 		sequence = atomic_read(&mapping->truncate_count);
-		spin_unlock(&mm->page_table_lock);
+		ptep_unlock(page_table);
 		page_cache_release(new_page);
 		goto retry;
 	}
@@ -1565,15 +1524,15 @@
 		pte_unmap(page_table);
 	} else {
 		/* One of our sibling threads was faster, back out. */
+		ptep_unlock(page_table);
 		pte_unmap(page_table);
 		page_cache_release(new_page);
-		spin_unlock(&mm->page_table_lock);
 		goto out;
 	}

 	/* no need to invalidate: a not-present page shouldn't be cached */
 	update_mmu_cache(vma, address, entry);
-	spin_unlock(&mm->page_table_lock);
+	ptep_unlock(page_table);
 out:
 	return ret;
 oom:
@@ -1606,8 +1565,8 @@

 	pgoff = pte_to_pgoff(*pte);

+	ptep_unlock(pte);
 	pte_unmap(pte);
-	spin_unlock(&mm->page_table_lock);

 	err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0);
 	if (err == -ENOMEM)
@@ -1644,13 +1603,11 @@
 {
 	pte_t entry;

-	entry = *pte;
+	entry = *pte;	/* get the unlocked value so that we do not write the lock bit back */
+
+	if (ptep_lock(pte)) return VM_FAULT_MINOR;
+
 	if (!pte_present(entry)) {
-		/*
-		 * If it truly wasn't present, we know that kswapd
-		 * and the PTE updates will not touch it later. So
-		 * drop the lock.
-		 */
 		if (pte_none(entry))
 			return do_no_page(mm, vma, address, write_access, pte, pmd);
 		if (pte_file(entry))
@@ -1668,7 +1625,6 @@
 	ptep_set_access_flags(vma, address, pte, entry, write_access);
 	update_mmu_cache(vma, address, entry);
 	pte_unmap(pte);
-	spin_unlock(&mm->page_table_lock);
 	return VM_FAULT_MINOR;
 }

@@ -1688,12 +1644,6 @@

 	if (is_vm_hugetlb_page(vma))
 		return VM_FAULT_SIGBUS;	/* mapping truncation does this. */
-
-	/*
-	 * We need the page table lock to synchronize with kswapd
-	 * and the SMP-safe atomic PTE updates.
-	 */
-	spin_lock(&mm->page_table_lock);
 	pmd = pmd_alloc(mm, pgd, address);

 	if (pmd) {
@@ -1701,7 +1651,6 @@
 		if (pte)
 			return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
 	}
-	spin_unlock(&mm->page_table_lock);
 	return VM_FAULT_OOM;
 }

Index: linux-2.6.8.1/mm/rmap.c
===================================================================
--- linux-2.6.8.1.orig/mm/rmap.c	2004-08-14 03:56:22.000000000 -0700
+++ linux-2.6.8.1/mm/rmap.c	2004-08-15 19:59:32.000000000 -0700
@@ -494,8 +494,14 @@

 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address);
+#ifdef __HAVE_ARCH_PTE_LOCK
+	/* If we would simply zero the pte then handle_mm_fault might
+	 * race against this code and reinstate an anonymous mapping
+	 */
+	pteval = ptep_clear_and_lock_flush(vma, address, pte);
+#else
 	pteval = ptep_clear_flush(vma, address, pte);
-
+#endif
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
 		set_page_dirty(page);
@@ -508,9 +514,13 @@
 		 */
 		BUG_ON(!PageSwapCache(page));
 		swap_duplicate(entry);
+		/* This is going to clear the lock that may have been set on the pte */
 		set_pte(pte, swp_entry_to_pte(entry));
 		BUG_ON(pte_file(*pte));
 	}
+#ifdef __HAVE_ARCH_PTE_LOCK
+	else ptep_unlock(pte);
+#endif

 	mm->rss--;
 	BUG_ON(!page->mapcount);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
  2004-08-16  3:29           ` Christoph Lameter
@ 2004-08-16 14:39             ` William Lee Irwin III
  2004-08-17 15:28               ` page fault fastpath patch v2: fix race conditions, stats for 8,32 and 512 cpu SMP Christoph Lameter
  0 siblings, 1 reply; 18+ messages in thread
From: William Lee Irwin III @ 2004-08-16 14:39 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: David S. Miller, linux-ia64, linux-kernel

On Sun, 15 Aug 2004, David S. Miller wrote:
>> munmap() can destroy pmd and pte tables.  somehow we have
>> to protect against that, and currently that is having the
>> VMA semaphore held for reading, see free_pgtables().

On Sun, Aug 15, 2004 at 08:29:11PM -0700, Christoph Lameter wrote:
> It looks to me like the code takes care to provide the correct
> sequencing so that the integrity of pgd,pmd and pte links is
> guaranteed from the viewpoint of the MMU in the CPUs. munmap is there to
> protect one kernel thread messing with the addresses of these entities
> that might be stored in another threads register.
> Therefore it is safe to walk the chain only holding the semaphore read
> lock?

Detached pagetables are assumed to be freeable after a TLB flush IPI.
Previously holding ->page_table_lock would prevent the shootdowns of
links to the pagetable page from executing concurrently with
modifications to the pagetable page. Disabling interrupts or otherwise
inhibiting the progress of the IPI'ing cpu is needed to prevent
dereferencing freed pagetables and incorrect accounting based on
contents of about-to-be-freed pagetables. Reference counting pagetable
pages may help here, where the final put would be responsible for
unaccounting the various things in the pagetable page.

-- wli

^ permalink raw reply	[flat|nested] 18+ messages in thread

* page fault fastpath patch v2: fix race conditions, stats for 8,32 and 512 cpu SMP
  2004-08-16 14:39             ` William Lee Irwin III
@ 2004-08-17 15:28               ` Christoph Lameter
  2004-08-17 15:37                 ` Christoph Hellwig
                                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Christoph Lameter @ 2004-08-17 15:28 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: David S. Miller, raybry, ak, benh, manfred, linux-ia64,
	linux-kernel

This is the second release of the page fault fastpath path. The fast path
avoids locking during the creation of page table entries for anonymous
memory in a threaded application running on a SMP system. The performance
increases significantly for more than 4 threads running concurrently.

Changes:
- Insure that it is safe to call the various functions without holding
the page_table_lock.
- Fix cases in rmap.c where a pte could be cleared for a very short time
before being set to another value by introducing a pte_xchg function. This
created a potential race condition with the fastpath code which checks for
a cleared pte without holding the page_table_lock.
- i386 support
- Various cleanups

Issue remaining:
- The fastpath increments mm->rss without acquiring the page_table_lock.
Introducing the page_table_lock even for a short time makes performance
drop to the level before the patch.

Ideas:
- One could avoid pte locking by introducing a pte_cmpxchg. cmpxchg
seems to be supported by all ia64 and i386 cpus except the original 80386.
- Make rss atomic or eliminate rss?

==== 8 CPU SMP system

Unpatched:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
  2   3    1    0.094s      4.500s   4.059s 85561.646  85568.398
  2   3    2    0.092s      6.390s   3.043s 60649.650 114521.474
  2   3    4    0.081s      6.500s   1.093s 59740.813 203552.963
  2   3    8    0.101s     12.001s   2.035s 32487.736 167082.560

With page fault fastpath patch:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
  2   3    1    0.095s      4.544s   4.064s 84733.378  84699.952
  2   3    2    0.080s      4.749s   2.056s 81426.302 153163.463
  2   3    4    0.081s      5.173s   1.057s 74828.674 249792.084
  2   3    8    0.093s      7.097s   1.021s 54678.576 324072.260

==== 16 CPU system

Unpatched:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 16   3    1    0.627s     61.749s  62.038s 50430.908  50427.364
 16   3    2    0.579s     64.237s  33.068s 48532.874  93375.083
 16   3    4    0.608s     87.579s  28.011s 35670.888 111900.261
 16   3    8    0.612s    122.913s  19.074s 25466.233 159343.342
 16   3   16    0.617s    383.727s  26.091s  8184.648 116868.093
 16   3   32    2.492s    753.081s  25.031s  4163.364 124275.119

With page fault fastpath patch:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 16   3    1    0.572s     61.460s  62.003s 50710.367  50705.490
 16   3    2    0.571s     63.951s  33.057s 48753.975  93679.565
 16   3    4    0.593s     72.737s  24.078s 42897.603 126927.505
 16   3    8    0.625s     85.085s  15.008s 36701.575 208502.061
 16   3   16    0.560s     67.191s   6.096s 46430.048 451954.271
 16   3   32    1.599s    162.986s   5.079s 19112.972 543031.652

==== 512 CPU system

Unpatched:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 16   3    1    0.748s     67.200s  67.098s 46295.921  46270.533
 16   3    2    0.899s    100.189s  52.021s 31118.426  60242.544
 16   3    4    1.517s    103.467s  31.021s 29963.479 100777.788
 16   3    8    1.268s    166.023s  26.035s 18803.807 119350.434
 16   3   16    6.296s    453.445s  33.082s  6842.371  92987.774
 16   3   32   22.434s   1341.205s  48.026s  2306.860  65174.913
 16   3   64   54.189s   4633.748s  81.089s   671.026  38411.466
 16   3  128  244.333s  17584.111s 152.026s   176.444  20659.132
 16   3  256  222.936s   8167.241s  73.018s   374.930  42983.366
 16   3  512  207.464s   4259.264s  39.044s   704.258  79741.366

With page fault fastpath patch:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 16   3    1    0.884s     64.241s  65.014s 48302.177  48287.787
 16   3    2    0.931s     99.156s  51.058s 31429.640  60979.126
 16   3    4    1.028s     88.451s  26.096s 35155.837 116669.999
 16   3    8    1.957s     61.395s  12.099s 49654.307 242078.305
 16   3   16    5.701s     81.382s   9.039s 36122.904 334774.381
 16   3   32   15.207s    163.893s   9.094s 17564.021 316284.690
 16   3   64   76.056s    440.771s  13.037s  6086.601 235120.800
 16   3  128  203.843s   1535.909s  19.084s  1808.145 158495.679
 16   3  256  274.815s    755.764s  12.058s  3052.387 250010.942
 16   3  512  205.505s    381.106s   7.060s  5362.531 413531.352

Test program and scripts were posted with the first release of this patch.

Feedback welcome. I will be at a conference for the rest of the week and
may reply late to feedback.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

==== FASTPATH PATCH

Index: linux-2.6.8.1/mm/memory.c
===================================================================
--- linux-2.6.8.1.orig/mm/memory.c	2004-08-14 03:55:24.000000000 -0700
+++ linux-2.6.8.1/mm/memory.c	2004-08-16 21:37:39.000000000 -0700
@@ -1680,6 +1680,10 @@
 {
 	pgd_t *pgd;
 	pmd_t *pmd;
+#ifdef __HAVE_ARCH_PTE_LOCK
+	pte_t *pte;
+	pte_t entry;
+#endif

 	__set_current_state(TASK_RUNNING);
 	pgd = pgd_offset(mm, address);
@@ -1688,7 +1692,81 @@

 	if (is_vm_hugetlb_page(vma))
 		return VM_FAULT_SIGBUS;	/* mapping truncation does this. */
+#ifdef __HAVE_ARCH_PTE_LOCK
+	/*
+	 * Fast path for anonymous pages, not found faults bypassing
+	 * the necessity to acquire the page_table_lock
+	 */
+
+	if ((vma->vm_ops && vma->vm_ops->nopage) || pgd_none(*pgd)) goto use_page_table_lock;
+	pmd = pmd_offset(pgd,address);
+	if (pmd_none(*pmd)) goto use_page_table_lock;
+	pte = pte_offset_kernel(pmd,address);
+	if (pte_locked(*pte)) return VM_FAULT_MINOR;
+	if (!pte_none(*pte)) goto use_page_table_lock;
+
+	/*
+	 * Page not present, so kswapd and PTE updates will not touch the pte
+	 * so we are able to just use a pte lock.
+	 */
+
+	/* Return from fault handler perhaps cause another fault if the page is still locked */
+	if (ptep_lock(pte)) return VM_FAULT_MINOR;
+	/* Someout could have set the pte to something else before we acquired the lock. check */
+	if (!pte_none(pte_mkunlocked(*pte))) {
+		ptep_unlock(pte);
+		return VM_FAULT_MINOR;
+	}
+	/* Read-only mapping of ZERO_PAGE. */
+	entry = pte_wrprotect(mk_pte(ZERO_PAGE(address), vma->vm_page_prot));
+
+	if (write_access) {
+		struct page *page;
+
+		/*
+		 * anon_vma_prepare only requires the mmap_mem lock and
+		 * will acquire the page_table_lock if necessary
+		 */
+		if (unlikely(anon_vma_prepare(vma))) goto no_mem;
+
+		/* alloc_page_vma only requires mmap_mem lock */
+		page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+		if (!page)  goto no_mem;
+
+		clear_user_highpage(page, address);
+
+		entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,vma->vm_page_prot)),vma);
+		/* lru_cache_add_active uses a cpu_var */
+		lru_cache_add_active(page);
+		mark_page_accessed(page);
+
+		/*
+		 * Incrementing rss usually requires the page_table_lock
+		 * We need something to make this atomic!
+		 * Adding a lock here will hurt performance significantly
+		 */
+		mm->rss++;
+
+		/*
+		 * Invoking page_add_anon_rmap without the page_table_lock since
+		 * page is a newly allocated page not yet managed by VM
+		 */
+		page_add_anon_rmap(page, vma, address);
+	}
+	/* Setting the pte clears the pte lock so there is no need for unlocking */
+	set_pte(pte, entry);
+	pte_unmap(pte);
+
+	/* No need to invalidate - it was non-present before */
+	update_mmu_cache(vma, address, entry);
+	return VM_FAULT_MINOR;		/* Minor fault */

+no_mem:
+	ptep_unlock(pte);
+	return VM_FAULT_OOM;
+
+use_page_table_lock:
+#endif
 	/*
 	 * We need the page table lock to synchronize with kswapd
 	 * and the SMP-safe atomic PTE updates.
Index: linux-2.6.8.1/mm/rmap.c
===================================================================
--- linux-2.6.8.1.orig/mm/rmap.c	2004-08-14 03:56:22.000000000 -0700
+++ linux-2.6.8.1/mm/rmap.c	2004-08-16 21:41:19.000000000 -0700
@@ -333,7 +333,10 @@
  * @vma:	the vm area in which the mapping is added
  * @address:	the user virtual address mapped
  *
- * The caller needs to hold the mm->page_table_lock.
+ * The caller needs to hold the mm->page_table_lock if page
+ * is pointing to something that is known by the vm.
+ * The lock does not need to be held if page is pointing
+ * to a newly allocated page.
  */
 void page_add_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address)
@@ -494,11 +497,6 @@

 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address);
-	pteval = ptep_clear_flush(vma, address, pte);
-
-	/* Move the dirty bit to the physical page now the pte is gone. */
-	if (pte_dirty(pteval))
-		set_page_dirty(page);

 	if (PageAnon(page)) {
 		swp_entry_t entry = { .val = page->private };
@@ -508,9 +506,14 @@
 		 */
 		BUG_ON(!PageSwapCache(page));
 		swap_duplicate(entry);
-		set_pte(pte, swp_entry_to_pte(entry));
+		pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry));
 		BUG_ON(pte_file(*pte));
-	}
+	} else
+		pteval = ptep_clear_flush(vma, address, pte);
+
+	/* Move the dirty bit to the physical page now the pte is gone. */
+	if (pte_dirty(pteval))
+		set_page_dirty(page);

 	mm->rss--;
 	BUG_ON(!page->mapcount);
@@ -602,11 +605,12 @@

 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address);
-		pteval = ptep_clear_flush(vma, address, pte);

 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))
-			set_pte(pte, pgoff_to_pte(page->index));
+			pteval = ptep_xchg_flush(vma, address, pte, pgoff_to_pte(page->index));
+		else
+			pteval = ptep_clear_flush(vma, address, pte);

 		/* Move the dirty bit to the physical page now the pte is gone. */
 		if (pte_dirty(pteval))

===== PTE LOCK PATCH

Index: linux-2.6.8.1/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.8.1.orig/include/asm-generic/pgtable.h	2004-08-14 03:55:10.000000000 -0700
+++ linux-2.6.8.1/include/asm-generic/pgtable.h	2004-08-16 21:36:11.000000000 -0700
@@ -85,6 +85,15 @@
 }
 #endif

+#ifndef __HAVE_ARCH_PTEP_XCHG
+static inline pte_t ptep_xchg(pte_t *ptep,pte_t pteval)
+{
+	pte_t pte = *ptep;
+	set_pte(ptep, pteval);
+	return pte;
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
 #define ptep_clear_flush(__vma, __address, __ptep)			\
 ({									\
@@ -94,6 +103,16 @@
 })
 #endif

+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval)		\
+({									\
+	pte_t __pte = ptep_xchg(__ptep, __pteval);			\
+	flush_tlb_page(__vma, __address);				\
+	__pte;								\
+})
+#endif
+
+
 #ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
 static inline void ptep_set_wrprotect(pte_t *ptep)
 {
Index: linux-2.6.8.1/include/asm-ia64/pgtable.h
===================================================================
--- linux-2.6.8.1.orig/include/asm-ia64/pgtable.h	2004-08-14 03:55:10.000000000 -0700
+++ linux-2.6.8.1/include/asm-ia64/pgtable.h	2004-08-16 20:36:12.000000000 -0700
@@ -30,6 +30,8 @@
 #define _PAGE_P_BIT		0
 #define _PAGE_A_BIT		5
 #define _PAGE_D_BIT		6
+#define _PAGE_IG_BITS		53
+#define _PAGE_LOCK_BIT		(_PAGE_IG_BITS+3)	/* bit 56. Aligned to 8 bits */

 #define _PAGE_P			(1 << _PAGE_P_BIT)	/* page present bit */
 #define _PAGE_MA_WB		(0x0 <<  2)	/* write back memory attribute */
@@ -58,6 +60,7 @@
 #define _PAGE_PPN_MASK		(((__IA64_UL(1) << IA64_MAX_PHYS_BITS) - 1) & ~0xfffUL)
 #define _PAGE_ED		(__IA64_UL(1) << 52)	/* exception deferral */
 #define _PAGE_PROTNONE		(__IA64_UL(1) << 63)
+#define _PAGE_LOCK		(__IA64_UL(1) << _PAGE_LOCK_BIT)

 /* Valid only for a PTE with the present bit cleared: */
 #define _PAGE_FILE		(1 << 1)		/* see swap & file pte remarks below */
@@ -281,6 +284,13 @@
 #define pte_mkyoung(pte)	(__pte(pte_val(pte) | _PAGE_A))
 #define pte_mkclean(pte)	(__pte(pte_val(pte) & ~_PAGE_D))
 #define pte_mkdirty(pte)	(__pte(pte_val(pte) | _PAGE_D))
+#define pte_mkunlocked(pte)	(__pte(pte_val(pte) & ~_PAGE_LOCK))
+/*
+ * Lock functions for pte's
+*/
+#define ptep_lock(ptep)		test_and_set_bit(_PAGE_LOCK_BIT,ptep)
+#define ptep_unlock(ptep)	{ clear_bit(_PAGE_LOCK_BIT,ptep);smp_mb__after_clear_bit(); }
+#define pte_locked(pte)		((pte_val(pte) & _PAGE_LOCK)!=0)

 /*
  * Macro to a page protection value as "uncacheable".  Note that "protection" is really a
@@ -387,6 +397,18 @@
 #endif
 }

+static inline pte_t
+ptep_xchg (pte_t *ptep,pte_t pteval)
+{
+#ifdef CONFIG_SMP
+	return __pte(xchg((long *) ptep, pteval.pte));
+#else
+	pte_t pte = *ptep;
+	set_pte(ptep,pteval);
+	return pte;
+#endif
+}
+
 static inline void
 ptep_set_wrprotect (pte_t *ptep)
 {
@@ -554,10 +576,12 @@
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
+#define __HAVE_ARCH_PTEP_XCHG
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define __HAVE_ARCH_PTEP_MKDIRTY
 #define __HAVE_ARCH_PTE_SAME
 #define __HAVE_ARCH_PGD_OFFSET_GATE
+#define __HAVE_ARCH_PTE_LOCK
 #include <asm-generic/pgtable.h>

 #endif /* _ASM_IA64_PGTABLE_H */
Index: linux-2.6.8.1/include/asm-i386/pgtable.h
===================================================================
--- linux-2.6.8.1.orig/include/asm-i386/pgtable.h	2004-08-14 03:55:48.000000000 -0700
+++ linux-2.6.8.1/include/asm-i386/pgtable.h	2004-08-16 20:36:12.000000000 -0700
@@ -101,7 +101,7 @@
 #define _PAGE_BIT_DIRTY		6
 #define _PAGE_BIT_PSE		7	/* 4 MB (or 2MB) page, Pentium+, if present.. */
 #define _PAGE_BIT_GLOBAL	8	/* Global TLB entry PPro+ */
-#define _PAGE_BIT_UNUSED1	9	/* available for programmer */
+#define _PAGE_BIT_LOCK		9	/* available for programmer */
 #define _PAGE_BIT_UNUSED2	10
 #define _PAGE_BIT_UNUSED3	11
 #define _PAGE_BIT_NX		63
@@ -115,7 +115,7 @@
 #define _PAGE_DIRTY	0x040
 #define _PAGE_PSE	0x080	/* 4 MB (or 2MB) page, Pentium+, if present.. */
 #define _PAGE_GLOBAL	0x100	/* Global TLB entry PPro+ */
-#define _PAGE_UNUSED1	0x200	/* available for programmer */
+#define _PAGE_LOCK	0x200	/* available for programmer */
 #define _PAGE_UNUSED2	0x400
 #define _PAGE_UNUSED3	0x800

@@ -201,6 +201,7 @@
 extern unsigned long pg0[];

 #define pte_present(x)	((x).pte_low & (_PAGE_PRESENT | _PAGE_PROTNONE))
+#define pte_locked(x) ((x).pte_low & _PAGE_LOCK)
 #define pte_clear(xp)	do { set_pte(xp, __pte(0)); } while (0)

 #define pmd_none(x)	(!pmd_val(x))
@@ -236,6 +237,7 @@
 static inline pte_t pte_mkdirty(pte_t pte)	{ (pte).pte_low |= _PAGE_DIRTY; return pte; }
 static inline pte_t pte_mkyoung(pte_t pte)	{ (pte).pte_low |= _PAGE_ACCESSED; return pte; }
 static inline pte_t pte_mkwrite(pte_t pte)	{ (pte).pte_low |= _PAGE_RW; return pte; }
+static inline pte_t pte_mkunlocked(pte_t pte)	{ (pte).pte_low &= ~_PAGE_LOCK; return pte; }

 #ifdef CONFIG_X86_PAE
 # include <asm/pgtable-3level.h>
@@ -260,6 +262,9 @@
 static inline void ptep_set_wrprotect(pte_t *ptep)		{ clear_bit(_PAGE_BIT_RW, &ptep->pte_low); }
 static inline void ptep_mkdirty(pte_t *ptep)			{ set_bit(_PAGE_BIT_DIRTY, &ptep->pte_low); }

+#define ptep_lock(ptep) test_and_set_bit(_PAGE_BIT_LOCK,&ptep->pte_low)
+#define ptep_unlock(ptep) clear_bit(_PAGE_BIT_LOCK,&ptep->pte_low)
+
 /*
  * Macro to mark a page protection value as "uncacheable".  On processors which do not support
  * it, this is a no-op.
@@ -416,9 +421,11 @@
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
+#define __HAVE_ARCH_PTEP_XCHG
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define __HAVE_ARCH_PTEP_MKDIRTY
 #define __HAVE_ARCH_PTE_SAME
+#define __HAVE_ARCH_PTE_LOCK
 #include <asm-generic/pgtable.h>

 #endif /* _I386_PGTABLE_H */

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: page fault fastpath patch v2: fix race conditions, stats for 8,32 and 512 cpu SMP
  2004-08-17 15:28               ` page fault fastpath patch v2: fix race conditions, stats for 8,32 and 512 cpu SMP Christoph Lameter
@ 2004-08-17 15:37                 ` Christoph Hellwig
  2004-08-17 15:51                 ` William Lee Irwin III
  2004-08-18 17:55                 ` Hugh Dickins
  2 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2004-08-17 15:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: William Lee Irwin III, David S. Miller, raybry, ak, benh, manfred,
	linux-ia64, linux-kernel

On Tue, Aug 17, 2004 at 08:28:44AM -0700, Christoph Lameter wrote:
> This is the second release of the page fault fastpath path. The fast path
> avoids locking during the creation of page table entries for anonymous
> memory in a threaded application running on a SMP system. The performance
> increases significantly for more than 4 threads running concurrently.

Please reformat your patch according to Documentation/CodingStyle
(or just look at the surrounding code..).

Also you're duplicating far too much code of the regular pagefault code,
this probably wants some inlined helpers.

Your ptep_lock should be called ptep_trylock.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: page fault fastpath patch v2: fix race conditions, stats for 8,32 and 512 cpu SMP
  2004-08-17 15:28               ` page fault fastpath patch v2: fix race conditions, stats for 8,32 and 512 cpu SMP Christoph Lameter
  2004-08-17 15:37                 ` Christoph Hellwig
@ 2004-08-17 15:51                 ` William Lee Irwin III
  2004-08-18 17:55                 ` Hugh Dickins
  2 siblings, 0 replies; 18+ messages in thread
From: William Lee Irwin III @ 2004-08-17 15:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David S. Miller, raybry, ak, benh, manfred, linux-ia64,
	linux-kernel

On Tue, Aug 17, 2004 at 08:28:44AM -0700, Christoph Lameter wrote:
> This is the second release of the page fault fastpath path. The fast path
> avoids locking during the creation of page table entries for anonymous
> memory in a threaded application running on a SMP system. The performance
> increases significantly for more than 4 threads running concurrently.
> Changes:
> - Insure that it is safe to call the various functions without holding
> the page_table_lock.
> - Fix cases in rmap.c where a pte could be cleared for a very short time
> before being set to another value by introducing a pte_xchg function. This
> created a potential race condition with the fastpath code which checks for
> a cleared pte without holding the page_table_lock.
> - i386 support
> - Various cleanups
> Issue remaining:
> - The fastpath increments mm->rss without acquiring the page_table_lock.
> Introducing the page_table_lock even for a short time makes performance
> drop to the level before the patch.

Hmm. I'm suspicious but I can't immediately poke a hole in it as it
leaves most uses of ->page_table_lock in place. I can't help thinking
there's a more comprehensive attack on the locking in this area, either.

-- wli

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: page fault fastpath patch v2: fix race conditions, stats for 8,32 and    512 cpu SMP
  2004-08-17 15:28               ` page fault fastpath patch v2: fix race conditions, stats for 8,32 and 512 cpu SMP Christoph Lameter
  2004-08-17 15:37                 ` Christoph Hellwig
  2004-08-17 15:51                 ` William Lee Irwin III
@ 2004-08-18 17:55                 ` Hugh Dickins
  2004-08-18 20:20                   ` William Lee Irwin III
  2004-08-19  1:19                   ` Christoph Lameter
  2 siblings, 2 replies; 18+ messages in thread
From: Hugh Dickins @ 2004-08-18 17:55 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: William Lee Irwin III, David S. Miller, raybry, ak, benh, manfred,
	linux-ia64, linux-kernel

On Tue, 17 Aug 2004, Christoph Lameter wrote:

> This is the second release of the page fault fastpath path. The fast path
> avoids locking during the creation of page table entries for anonymous
> memory in a threaded application running on a SMP system. The performance
> increases significantly for more than 4 threads running concurrently.

It is interesting.  I don't like it at all in its current state,
#ifdef'ed special casing for one particular path through the code,
but it does seem worth taking further.

Just handling that one anonymous case is not worth it, when we know
that the next day someone else from SGI will post a similar test
which shows the same on file pages ;)

Your ptep lock bit avoids collision with pte bits, but does it not
also need to avoid collision with pte swap entry bits?  And the
pte_file bit too, at least once it's extended to nopage areas.

I'm very suspicious of the way you just return VM_FAULT_MINOR when
you find the lock bit already set.  Yes, you can do that, but the
lock bit is held right across the alloc_page_vma, so other threads
trying to fault the same pte will be spinning back out to user and
refaulting back into kernel while they wait: we'd usually use a
waitqueue and wakeup with that kind of lock; or not hold it across,
and make it a bitspin lock.

It's a realistic case, which I guess your test program won't be trying.
Feels livelocky to me, but I may be overreacting against: it's not as
if you're changing the page_table_lock to be treated that way.

> Introducing the page_table_lock even for a short time makes performance
> drop to the level before the patch.

That's interesting, and disappointing.

The main lesson I took from your patch (I think wli was hinting at
the same) is that we ought now to question page_table_lock usage,
should be possible to cut it a lot.

I recall from exchanges with Dave McCracken 18 months ago that the
page_table_lock is _almost_ unnecessary in rmap.c, should be possible
to get avoid it there and in some other places.

We take page_table_lock when making absent present and when making
present absent: I like your observation that those are exclusive cases.

But you've found that narrowing the width of the page_table_lock
in a particular path does not help.  You sound surprised, me too.
Did you find out why that was?

> - One could avoid pte locking by introducing a pte_cmpxchg. cmpxchg
> seems to be supported by all ia64 and i386 cpus except the original 80386.

I do think this will be a more fruitful direction than pte locking:
just looking through the arches for spare bits puts me off pte locking.

Hugh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: page fault fastpath patch v2: fix race conditions, stats for 8,32 and    512 cpu SMP
  2004-08-18 17:55                 ` Hugh Dickins
@ 2004-08-18 20:20                   ` William Lee Irwin III
  2004-08-19  1:19                   ` Christoph Lameter
  1 sibling, 0 replies; 18+ messages in thread
From: William Lee Irwin III @ 2004-08-18 20:20 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Christoph Lameter, David S. Miller, raybry, ak, benh, manfred,
	linux-ia64, linux-kernel

On Wed, Aug 18, 2004 at 06:55:07PM +0100, Hugh Dickins wrote:
> It is interesting.  I don't like it at all in its current state,
> #ifdef'ed special casing for one particular path through the code,
> but it does seem worth taking further.
> Just handling that one anonymous case is not worth it, when we know
> that the next day someone else from SGI will post a similar test
> which shows the same on file pages ;)
> Your ptep lock bit avoids collision with pte bits, but does it not
> also need to avoid collision with pte swap entry bits?  And the
> pte_file bit too, at least once it's extended to nopage areas.
> I'm very suspicious of the way you just return VM_FAULT_MINOR when
> you find the lock bit already set.  Yes, you can do that, but the
> lock bit is held right across the alloc_page_vma, so other threads
> trying to fault the same pte will be spinning back out to user and
> refaulting back into kernel while they wait: we'd usually use a
> waitqueue and wakeup with that kind of lock; or not hold it across,
> and make it a bitspin lock.
> It's a realistic case, which I guess your test program won't be trying.
> Feels livelocky to me, but I may be overreacting against: it's not as
> if you're changing the page_table_lock to be treated that way.

Both points are valid; it should retry in-kernel for the pte lock bit
and arrange to use a bit not used for swap (there are at least
PAGE_SHIFT of these on all 64-bit arches).


On Tue, 17 Aug 2004, Christoph Lameter wrote:
>> Introducing the page_table_lock even for a short time makes performance
>> drop to the level before the patch.

On Wed, Aug 18, 2004 at 06:55:07PM +0100, Hugh Dickins wrote:
> That's interesting, and disappointing.
> The main lesson I took from your patch (I think wli was hinting at
> the same) is that we ought now to question page_table_lock usage,
> should be possible to cut it a lot.
> I recall from exchanges with Dave McCracken 18 months ago that the
> page_table_lock is _almost_ unnecessary in rmap.c, should be possible
> to get avoid it there and in some other places.
> We take page_table_lock when making absent present and when making
> present absent: I like your observation that those are exclusive cases.
> But you've found that narrowing the width of the page_table_lock
> in a particular path does not help.  You sound surprised, me too.
> Did you find out why that was?

It also protects against vma tree modifications in mainline, but rmap.c
shouldn't need it for vmas anymore, as the vma is rooted to the spot by
mapping->i_shared_lock for file pages and anon_vma->lock for anonymous.


On Tue, 17 Aug 2004, Christoph Lameter wrote:
>> - One could avoid pte locking by introducing a pte_cmpxchg. cmpxchg
>> seems to be supported by all ia64 and i386 cpus except the original 80386.

On Wed, Aug 18, 2004 at 06:55:07PM +0100, Hugh Dickins wrote:
> I do think this will be a more fruitful direction than pte locking:
> just looking through the arches for spare bits puts me off pte locking.

Fortunately, spare bits aren't strictly necessary, and neither is
cmpxchg. A single invalid value can serve in place of a bitflag. When
using such an invalid value, just xchg()'ing it and looping when the
invalid value is seen should suffice. This holds more generally for all
radix trees, not just pagetables, and happily xchg() or emulation
thereof is required by core code for all arches.


-- wli

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: page fault fastpath patch v2: fix race conditions, stats for 8,32 and    512 cpu SMP
  2004-08-18 17:55                 ` Hugh Dickins
  2004-08-18 20:20                   ` William Lee Irwin III
@ 2004-08-19  1:19                   ` Christoph Lameter
  1 sibling, 0 replies; 18+ messages in thread
From: Christoph Lameter @ 2004-08-19  1:19 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: William Lee Irwin III, David S. Miller, raybry, ak, benh, manfred,
	linux-ia64, linux-kernel

> > - One could avoid pte locking by introducing a pte_cmpxchg. cmpxchg
> > seems to be supported by all ia64 and i386 cpus except the original 80386.
>
> I do think this will be a more fruitful direction than pte locking:
> just looking through the arches for spare bits puts me off pte locking.

Thanks for the support. Got a V3 here (not ready to post yet) that throws
out the locks and uses cmpxchg instead. It also removes the use of
page_table_lock completely from handle_mm_fault.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2004-08-23 23:39 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <2uexw-1Nn-1@gated-at.bofh.it>
     [not found] ` <2uCTq-2wa-55@gated-at.bofh.it>
2004-08-18 23:50   ` page fault fastpath patch v2: fix race conditions, stats for 8,32 and 512 cpu SMP Rajesh Venkatasubramanian
2004-08-19  0:01     ` William Lee Irwin III
2004-08-19  0:07       ` Rajesh Venkatasubramanian
2004-08-19  0:20         ` William Lee Irwin III
2004-08-19  3:19           ` Rajesh Venkatasubramanian
2004-08-19  3:31             ` William Lee Irwin III
2004-08-19  3:41               ` William Lee Irwin III
2004-08-23 22:00           ` Christoph Lameter
2004-08-23 23:25             ` Rajesh Venkatasubramanian
2004-08-23 23:35               ` Christoph Lameter
     [not found] <fa.ofiojek.hkeujs@ifi.uio.no>
     [not found] ` <fa.o1kt2ua.1bm6n0c@ifi.uio.no>
2004-08-18 20:13   ` Ray Bryant
2004-08-18 20:48     ` William Lee Irwin III
2004-08-15 13:50 page fault fastpath: Increasing SMP scalability by introducing pte locks? Christoph Lameter
2004-08-15 20:09 ` David S. Miller
2004-08-15 22:58   ` Christoph Lameter
2004-08-15 23:58     ` David S. Miller
2004-08-16  0:11       ` Christoph Lameter
2004-08-16  1:56         ` David S. Miller
2004-08-16  3:29           ` Christoph Lameter
2004-08-16 14:39             ` William Lee Irwin III
2004-08-17 15:28               ` page fault fastpath patch v2: fix race conditions, stats for 8,32 and 512 cpu SMP Christoph Lameter
2004-08-17 15:37                 ` Christoph Hellwig
2004-08-17 15:51                 ` William Lee Irwin III
2004-08-18 17:55                 ` Hugh Dickins
2004-08-18 20:20                   ` William Lee Irwin III
2004-08-19  1:19                   ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox