Re: page fault fastpath patch v2: fix race conditions, stats for

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: page fault fastpath patch v2: fix race conditions, stats for
       [not found] ` <fa.o1kt2ua.1bm6n0c@ifi.uio.no>
@ 2004-08-18 20:13     ` Ray Bryant
  0 siblings, 0 replies; 10+ messages in thread
From: Ray Bryant @ 2004-08-18 20:13 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Christoph Lameter, William Lee Irwin III, David S. Miller, ak,
	benh, manfred, linux-ia64, linux-kernel

Hi Hugh,

Hugh Dickins wrote:
> On Tue, 17 Aug 2004, Christoph Lameter wrote:
> 
> 

<snip>

> 
> Just handling that one anonymous case is not worth it, when we know
> that the next day someone else from SGI will post a similar test
> which shows the same on file pages ;)
> 

Hugh -- this is called full employment for kernel scalability analysts.
:-) :-)

Actually, disks are so slow that I wouldn't expect that scalability problem to 
show up in the page fault code, but rather in the block I/O or page cache 
management portions of the code instead.

<snip>

> 
>>Introducing the page_table_lock even for a short time makes performance
>>drop to the level before the patch.
> 
> 
> That's interesting, and disappointing.
> 

I think that the major impact here is actually grabbing the lock when
30 or more processors are trying to obtain it -- the amount of time that the 
lock is actually held is insignificant in comparison.

> The main lesson I took from your patch (I think wli was hinting at
> the same) is that we ought now to question page_table_lock usage,
> should be possible to cut it a lot.
> 

That would be a useful avenue to explore.  Unfortunately, we are on kind of a 
tight fuse here trying to get the next kernel release ready.   At the moment 
we are in the mode of moving fixes from 2.4.21 to 2.6, and this is one such 
fix.   I'd be willing to pursue both in parallel so that in a future release
we have gotten to page_table_lock reduction as well.  Does that make sense at 
all?

(I just don't want to get bogged down in a 6-month effort here unless we can't 
avoid it.)

> I recall from exchanges with Dave McCracken 18 months ago that the
> page_table_lock is _almost_ unnecessary in rmap.c, should be possible
> to get avoid it there and in some other places.
> 
> We take page_table_lock when making absent present and when making
> present absent: I like your observation that those are exclusive cases.
> 
> But you've found that narrowing the width of the page_table_lock
> in a particular path does not help.  You sound surprised, me too.
> Did you find out why that was?
>  

See above comment.

> 
>>- One could avoid pte locking by introducing a pte_cmpxchg. cmpxchg
>>seems to be supported by all ia64 and i386 cpus except the original 80386.
> 
> 
> I do think this will be a more fruitful direction than pte locking:
> just looking through the arches for spare bits puts me off pte locking.
>

The original patch that we had for 2.4.21 did exactly that, we shied away from 
that due to concerns as to which processors allow you to update a running pte 
using a cmpxchg (= the set of processors for which set_pte() is a simple 
store.)  AFAIK, the only such processor is i386, but if Christoph is correct, 
then more recent Intel x86 processors don't even have that restriction.  I'll 
admit that I encouraged Christoph not to follow that path due to concerns of 
arch dependent code creeping into the do_anonymous_page() path.

Best Regards,

Ray

raybry@sgi.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: page fault fastpath patch v2: fix race conditions, stats for 8,32     and    512 cpu SMP
@ 2004-08-18 20:13     ` Ray Bryant
  0 siblings, 0 replies; 10+ messages in thread
From: Ray Bryant @ 2004-08-18 20:13 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Christoph Lameter, William Lee Irwin III, David S. Miller, ak,
	benh, manfred, linux-ia64, linux-kernel

Hi Hugh,

Hugh Dickins wrote:
> On Tue, 17 Aug 2004, Christoph Lameter wrote:
> 
> 

<snip>

> 
> Just handling that one anonymous case is not worth it, when we know
> that the next day someone else from SGI will post a similar test
> which shows the same on file pages ;)
> 

Hugh -- this is called full employment for kernel scalability analysts.
:-) :-)

Actually, disks are so slow that I wouldn't expect that scalability problem to 
show up in the page fault code, but rather in the block I/O or page cache 
management portions of the code instead.

<snip>

> 
>>Introducing the page_table_lock even for a short time makes performance
>>drop to the level before the patch.
> 
> 
> That's interesting, and disappointing.
> 

I think that the major impact here is actually grabbing the lock when
30 or more processors are trying to obtain it -- the amount of time that the 
lock is actually held is insignificant in comparison.

> The main lesson I took from your patch (I think wli was hinting at
> the same) is that we ought now to question page_table_lock usage,
> should be possible to cut it a lot.
> 

That would be a useful avenue to explore.  Unfortunately, we are on kind of a 
tight fuse here trying to get the next kernel release ready.   At the moment 
we are in the mode of moving fixes from 2.4.21 to 2.6, and this is one such 
fix.   I'd be willing to pursue both in parallel so that in a future release
we have gotten to page_table_lock reduction as well.  Does that make sense at 
all?

(I just don't want to get bogged down in a 6-month effort here unless we can't 
avoid it.)

> I recall from exchanges with Dave McCracken 18 months ago that the
> page_table_lock is _almost_ unnecessary in rmap.c, should be possible
> to get avoid it there and in some other places.
> 
> We take page_table_lock when making absent present and when making
> present absent: I like your observation that those are exclusive cases.
> 
> But you've found that narrowing the width of the page_table_lock
> in a particular path does not help.  You sound surprised, me too.
> Did you find out why that was?
>  

See above comment.

> 
>>- One could avoid pte locking by introducing a pte_cmpxchg. cmpxchg
>>seems to be supported by all ia64 and i386 cpus except the original 80386.
> 
> 
> I do think this will be a more fruitful direction than pte locking:
> just looking through the arches for spare bits puts me off pte locking.
>

The original patch that we had for 2.4.21 did exactly that, we shied away from 
that due to concerns as to which processors allow you to update a running pte 
using a cmpxchg (== the set of processors for which set_pte() is a simple 
store.)  AFAIK, the only such processor is i386, but if Christoph is correct, 
then more recent Intel x86 processors don't even have that restriction.  I'll 
admit that I encouraged Christoph not to follow that path due to concerns of 
arch dependent code creeping into the do_anonymous_page() path.

Best Regards,

Ray

raybry@sgi.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: page fault fastpath patch v2: fix race conditions, stats for 8,32     and    512 cpu SMP
  2004-08-18 20:13     ` page fault fastpath patch v2: fix race conditions, stats for 8,32 and 512 cpu SMP Ray Bryant
@ 2004-08-18 20:48       ` William Lee Irwin III
  -1 siblings, 0 replies; 10+ messages in thread
From: William Lee Irwin III @ 2004-08-18 20:48 UTC (permalink / raw)
  To: Ray Bryant
  Cc: Hugh Dickins, Christoph Lameter, David S. Miller, ak, benh,
	manfred, linux-ia64, linux-kernel

Hugh Dickins wrote:
>> Just handling that one anonymous case is not worth it, when we know
>> that the next day someone else from SGI will post a similar test
>> which shows the same on file pages ;)

On Wed, Aug 18, 2004 at 03:13:39PM -0500, Ray Bryant wrote:
> Hugh -- this is called full employment for kernel scalability analysts.
> :-) :-)
> Actually, disks are so slow that I wouldn't expect that scalability problem 
> to show up in the page fault code, but rather in the block I/O or page 
> cache management portions of the code instead.

mapping->tree_lock is a near-certain culprit-to-be.


On Wed, Aug 18, 2004 at 03:13:39PM -0500, Ray Bryant wrote:
> I think that the major impact here is actually grabbing the lock when
> 30 or more processors are trying to obtain it -- the amount of time that 
> the lock is actually held is insignificant in comparison.

The spinlocking algorithms in general use are rather weak. Queued locks
with O(contenders) cacheline transfers instead of unbounded may be
useful in such arrangements so that occasional contention is not so
catastrophic, and it appears you certainly have the space for them.
In general, everyone's favored rearranging algorithms for less
contention instead of fiddling with locking algorithms, but I suspect
in the case of such large systems the number of nontrivial algorithms
to devise is daunting enough locking algorithms may mitigate the issues.


Hugh Dickins wrote:
>> The main lesson I took from your patch (I think wli was hinting at
>> the same) is that we ought now to question page_table_lock usage,
>> should be possible to cut it a lot.

On Wed, Aug 18, 2004 at 03:13:39PM -0500, Ray Bryant wrote:
> That would be a useful avenue to explore.  Unfortunately, we are on kind of 
> a tight fuse here trying to get the next kernel release ready.   At the 
> moment we are in the mode of moving fixes from 2.4.21 to 2.6, and this is 
> one such fix.   I'd be willing to pursue both in parallel so that in a 
> future release
> we have gotten to page_table_lock reduction as well.  Does that make sense 
> at all?
> (I just don't want to get bogged down in a 6-month effort here unless we 
> can't avoid it.)

Scalability issues with fault handling have trivial enough hardware
requirements that it's likely feasible for others to work on it also.
Also fortunate for you is that others also have issues here, and those
on much smaller systems.


Hugh Dickins wrote:
>> I do think this will be a more fruitful direction than pte locking:
>> just looking through the arches for spare bits puts me off pte locking.

On Wed, Aug 18, 2004 at 03:13:39PM -0500, Ray Bryant wrote:
> The original patch that we had for 2.4.21 did exactly that, we shied away 
> from that due to concerns as to which processors allow you to update a 
> running pte using a cmpxchg (= the set of processors for which set_pte() 
> is a simple store.)  AFAIK, the only such processor is i386, but if 
> Christoph is correct, then more recent Intel x86 processors don't even have 
> that restriction.  I'll admit that I encouraged Christoph not to follow 
> that path due to concerns of arch dependent code creeping into the 
> do_anonymous_page() path.

I'm in general more concerned about how one synchronizes with TLB miss
handlers, particularly for machines where this would involve the TLB
miss handler looping until a valid value is fetched in what is now a
very highly optimized performance-critical codepath.


-- wli

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: page fault fastpath patch v2: fix race conditions, stats for 8,32     and    512 cpu SMP
@ 2004-08-18 20:48       ` William Lee Irwin III
  0 siblings, 0 replies; 10+ messages in thread
From: William Lee Irwin III @ 2004-08-18 20:48 UTC (permalink / raw)
  To: Ray Bryant
  Cc: Hugh Dickins, Christoph Lameter, David S. Miller, ak, benh,
	manfred, linux-ia64, linux-kernel

Hugh Dickins wrote:
>> Just handling that one anonymous case is not worth it, when we know
>> that the next day someone else from SGI will post a similar test
>> which shows the same on file pages ;)

On Wed, Aug 18, 2004 at 03:13:39PM -0500, Ray Bryant wrote:
> Hugh -- this is called full employment for kernel scalability analysts.
> :-) :-)
> Actually, disks are so slow that I wouldn't expect that scalability problem 
> to show up in the page fault code, but rather in the block I/O or page 
> cache management portions of the code instead.

mapping->tree_lock is a near-certain culprit-to-be.


On Wed, Aug 18, 2004 at 03:13:39PM -0500, Ray Bryant wrote:
> I think that the major impact here is actually grabbing the lock when
> 30 or more processors are trying to obtain it -- the amount of time that 
> the lock is actually held is insignificant in comparison.

The spinlocking algorithms in general use are rather weak. Queued locks
with O(contenders) cacheline transfers instead of unbounded may be
useful in such arrangements so that occasional contention is not so
catastrophic, and it appears you certainly have the space for them.
In general, everyone's favored rearranging algorithms for less
contention instead of fiddling with locking algorithms, but I suspect
in the case of such large systems the number of nontrivial algorithms
to devise is daunting enough locking algorithms may mitigate the issues.


Hugh Dickins wrote:
>> The main lesson I took from your patch (I think wli was hinting at
>> the same) is that we ought now to question page_table_lock usage,
>> should be possible to cut it a lot.

On Wed, Aug 18, 2004 at 03:13:39PM -0500, Ray Bryant wrote:
> That would be a useful avenue to explore.  Unfortunately, we are on kind of 
> a tight fuse here trying to get the next kernel release ready.   At the 
> moment we are in the mode of moving fixes from 2.4.21 to 2.6, and this is 
> one such fix.   I'd be willing to pursue both in parallel so that in a 
> future release
> we have gotten to page_table_lock reduction as well.  Does that make sense 
> at all?
> (I just don't want to get bogged down in a 6-month effort here unless we 
> can't avoid it.)

Scalability issues with fault handling have trivial enough hardware
requirements that it's likely feasible for others to work on it also.
Also fortunate for you is that others also have issues here, and those
on much smaller systems.


Hugh Dickins wrote:
>> I do think this will be a more fruitful direction than pte locking:
>> just looking through the arches for spare bits puts me off pte locking.

On Wed, Aug 18, 2004 at 03:13:39PM -0500, Ray Bryant wrote:
> The original patch that we had for 2.4.21 did exactly that, we shied away 
> from that due to concerns as to which processors allow you to update a 
> running pte using a cmpxchg (== the set of processors for which set_pte() 
> is a simple store.)  AFAIK, the only such processor is i386, but if 
> Christoph is correct, then more recent Intel x86 processors don't even have 
> that restriction.  I'll admit that I encouraged Christoph not to follow 
> that path due to concerns of arch dependent code creeping into the 
> do_anonymous_page() path.

I'm in general more concerned about how one synchronizes with TLB miss
handlers, particularly for machines where this would involve the TLB
miss handler looping until a valid value is fetched in what is now a
very highly optimized performance-critical codepath.


-- wli

^ permalink raw reply	[flat|nested] 10+ messages in thread

[parent not found: <2uexw-1Nn-1@gated-at.bofh.it>]

[parent not found: <2uCTq-2wa-55@gated-at.bofh.it>]

* Re: page fault fastpath patch v2: fix race conditions, stats for 8,32     and    512 cpu SMP
       [not found] ` <2uCTq-2wa-55@gated-at.bofh.it>
@ 2004-08-18 23:50   ` Rajesh Venkatasubramanian
  2004-08-19  0:01     ` William Lee Irwin III
  0 siblings, 1 reply; 10+ messages in thread
From: Rajesh Venkatasubramanian @ 2004-08-18 23:50 UTC (permalink / raw)
  To: Hugh Dickins, William Lee Irwin III, David S. Miller, raybry, ak,
	benh, manfred, linux-ia64, linux-kernel

William Lee Irwin III wrote:
> On Wed, Aug 18, 2004 at 06:55:07PM +0100, Hugh Dickins wrote:
>> That's interesting, and disappointing.
>> The main lesson I took from your patch (I think wli was hinting at
>> the same) is that we ought now to question page_table_lock usage,
>> should be possible to cut it a lot.
>> I recall from exchanges with Dave McCracken 18 months ago that the
>> page_table_lock is _almost_ unnecessary in rmap.c, should be possible
>> to get avoid it there and in some other places.
>> We take page_table_lock when making absent present and when making
>> present absent: I like your observation that those are exclusive cases.
>> But you've found that narrowing the width of the page_table_lock
>> in a particular path does not help.  You sound surprised, me too.
>> Did you find out why that was?
>
> It also protects against vma tree modifications in mainline, but rmap.c
> shouldn't need it for vmas anymore, as the vma is rooted to the spot by
> mapping->i_shared_lock for file pages and anon_vma->lock for anonymous.

If I am reading the code correctly, then without page_table_lock
in page_referenced_one(), we can race with exit_mmap() and page 
table pages can be freed under us.

William Lee Irwin III wrote:
> On Wed, Aug 18, 2004 at 06:55:07PM +0100, Hugh Dickins wrote:
>> I do think this will be a more fruitful direction than pte locking:
>> just looking through the arches for spare bits puts me off pte locking.
>
> Fortunately, spare bits aren't strictly necessary, and neither is
> cmpxchg. A single invalid value can serve in place of a bitflag. When
> using such an invalid value, just xchg()'ing it and looping when the
> invalid value is seen should suffice. This holds more generally for all
> radix trees, not just pagetables, and happily xchg() or emulation
> thereof is required by core code for all arches.

Good point. 

Another solution may be to use the unused bytes (->lru or
->private) in page table "struct page" as bit_spin_locks. We can 
use a single bit to protect a small set of ptes (8, 16, or 32).

Rajesh
  



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: page fault fastpath patch v2: fix race conditions, stats for 8,32     and    512 cpu SMP
  2004-08-18 23:50   ` Rajesh Venkatasubramanian
@ 2004-08-19  0:01     ` William Lee Irwin III
  2004-08-19  0:07       ` page fault fastpath patch v2: fix race conditions, stats for Rajesh Venkatasubramanian
  0 siblings, 1 reply; 10+ messages in thread
From: William Lee Irwin III @ 2004-08-19  0:01 UTC (permalink / raw)
  To: Rajesh Venkatasubramanian
  Cc: Hugh Dickins, David S. Miller, raybry, ak, benh, manfred,
	linux-ia64, linux-kernel

William Lee Irwin III wrote:
>> It also protects against vma tree modifications in mainline, but rmap.c
>> shouldn't need it for vmas anymore, as the vma is rooted to the spot by
>> mapping->i_shared_lock for file pages and anon_vma->lock for anonymous.

On Wed, Aug 18, 2004 at 07:50:21PM -0400, Rajesh Venkatasubramanian wrote:
> If I am reading the code correctly, then without page_table_lock
> in page_referenced_one(), we can race with exit_mmap() and page 
> table pages can be freed under us.

exit_mmap() has removed the vma from ->i_mmap and ->mmap prior to
unmapping the pages, so this should be safe unless that operation
can be caught while it's in progress.


William Lee Irwin III wrote:
>> Fortunately, spare bits aren't strictly necessary, and neither is
>> cmpxchg. A single invalid value can serve in place of a bitflag. When
>> using such an invalid value, just xchg()'ing it and looping when the
>> invalid value is seen should suffice. This holds more generally for all
>> radix trees, not just pagetables, and happily xchg() or emulation
>> thereof is required by core code for all arches.

On Wed, Aug 18, 2004 at 07:50:21PM -0400, Rajesh Venkatasubramanian wrote:
> Good point. 
> Another solution may be to use the unused bytes (->lru or
> ->private) in page table "struct page" as bit_spin_locks. We can 
> use a single bit to protect a small set of ptes (8, 16, or 32).

In general the bitwise operations are more expensive than ordinary
spinlocks, and a separately-allocated spinlock (not necessarily
kmalloc()'d, sitting in struct page also counts, that is, separate from
the pte) introduces another cacheline to be touched where with in-place
locking of the pte only the pte's cacheline is needed.


-- wli

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: page fault fastpath patch v2: fix race conditions, stats for
  2004-08-19  0:01     ` William Lee Irwin III
@ 2004-08-19  0:07       ` Rajesh Venkatasubramanian
  2004-08-19  0:20         ` page fault fastpath patch v2: fix race conditions, stats for 8,32 and 512 cpu SMP William Lee Irwin III
  0 siblings, 1 reply; 10+ messages in thread
From: Rajesh Venkatasubramanian @ 2004-08-19  0:07 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Hugh Dickins, David S. Miller, raybry, ak, benh, manfred,
	linux-ia64, LKML



On Wed, 18 Aug 2004, William Lee Irwin III wrote:

> William Lee Irwin III wrote:
> >> It also protects against vma tree modifications in mainline, but rmap.c
> >> shouldn't need it for vmas anymore, as the vma is rooted to the spot by
> >> mapping->i_shared_lock for file pages and anon_vma->lock for anonymous.
>
> On Wed, Aug 18, 2004 at 07:50:21PM -0400, Rajesh Venkatasubramanian wrote:
> > If I am reading the code correctly, then without page_table_lock
> > in page_referenced_one(), we can race with exit_mmap() and page
> > table pages can be freed under us.
>
> exit_mmap() has removed the vma from ->i_mmap and ->mmap prior to
> unmapping the pages, so this should be safe unless that operation
> can be caught while it's in progress.

No. Unfortunately exit_mmap() removes vmas from ->i_mmap after removing
page table pages. Maybe we can reverse this, though.

Rajesh


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: page fault fastpath patch v2: fix race conditions, stats for 8,32     and    512 cpu SMP
  2004-08-19  0:07       ` page fault fastpath patch v2: fix race conditions, stats for Rajesh Venkatasubramanian
@ 2004-08-19  0:20         ` William Lee Irwin III
  2004-08-19  3:19           ` page fault fastpath patch v2: fix race conditions, stats for Rajesh Venkatasubramanian
  2004-08-23 22:00           ` Christoph Lameter
  0 siblings, 2 replies; 10+ messages in thread
From: William Lee Irwin III @ 2004-08-19  0:20 UTC (permalink / raw)
  To: Rajesh Venkatasubramanian
  Cc: Hugh Dickins, David S. Miller, raybry, ak, benh, manfred,
	linux-ia64, LKML

On Wed, 18 Aug 2004, William Lee Irwin III wrote:
>> exit_mmap() has removed the vma from ->i_mmap and ->mmap prior to
>> unmapping the pages, so this should be safe unless that operation
>> can be caught while it's in progress.

On Wed, Aug 18, 2004 at 08:07:24PM -0400, Rajesh Venkatasubramanian wrote:
> No. Unfortunately exit_mmap() removes vmas from ->i_mmap after removing
> page table pages. Maybe we can reverse this, though.

Something like this?


Index: mm1-2.6.8.1/mm/mmap.c
=================================--- mm1-2.6.8.1.orig/mm/mmap.c	2004-08-16 23:47:16.000000000 -0700
+++ mm1-2.6.8.1/mm/mmap.c	2004-08-18 17:18:26.513559632 -0700
@@ -1810,6 +1810,14 @@
 	mm->map_count -= unmap_vmas(&tlb, mm, mm->mmap, 0,
 					~0UL, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
+	/*
+	 * Walk the list again, actually closing and freeing it.
+	 */
+	while (vma) {
+		struct vm_area_struct *next = vma->vm_next;
+		remove_vm_struct(vma);
+		vma = next;
+	}
 	BUG_ON(mm->map_count);	/* This is just debugging */
 	clear_page_tables(tlb, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD);
 	tlb_finish_mmu(tlb, 0, MM_VM_SIZE(mm));
@@ -1822,16 +1830,6 @@
 	mm->locked_vm = 0;
 
 	spin_unlock(&mm->page_table_lock);
-
-	/*
-	 * Walk the list again, actually closing and freeing it
-	 * without holding any MM locks.
-	 */
-	while (vma) {
-		struct vm_area_struct *next = vma->vm_next;
-		remove_vm_struct(vma);
-		vma = next;
-	}
 }
 
 /* Insert vm structure into process list sorted by address

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: page fault fastpath patch v2: fix race conditions, stats for
  2004-08-19  0:20         ` page fault fastpath patch v2: fix race conditions, stats for 8,32 and 512 cpu SMP William Lee Irwin III
@ 2004-08-19  3:19           ` Rajesh Venkatasubramanian
  2004-08-23 22:00           ` Christoph Lameter
  1 sibling, 0 replies; 10+ messages in thread
From: Rajesh Venkatasubramanian @ 2004-08-19  3:19 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Hugh Dickins, David S. Miller, raybry, ak, benh, manfred,
	linux-ia64, LKML



On Wed, 18 Aug 2004, William Lee Irwin III wrote:

> On Wed, 18 Aug 2004, William Lee Irwin III wrote:
> >> exit_mmap() has removed the vma from ->i_mmap and ->mmap prior to
> >> unmapping the pages, so this should be safe unless that operation
> >> can be caught while it's in progress.
>
> On Wed, Aug 18, 2004 at 08:07:24PM -0400, Rajesh Venkatasubramanian wrote:
> > No. Unfortunately exit_mmap() removes vmas from ->i_mmap after removing
> > page table pages. Maybe we can reverse this, though.
>
> Something like this?

Yeah, something similar... A small nitpick: page_table_lock
nests within i_mmap_lock and anon_lock.

do_unmap() also frees page tables first and then removes vmas from
i_mmap (and/or) anon_vma list. Is there a reason to preserve
this ordering ?

Rajesh

>
> Index: mm1-2.6.8.1/mm/mmap.c
> =================================> --- mm1-2.6.8.1.orig/mm/mmap.c	2004-08-16 23:47:16.000000000 -0700
> +++ mm1-2.6.8.1/mm/mmap.c	2004-08-18 17:18:26.513559632 -0700
> @@ -1810,6 +1810,14 @@
>  	mm->map_count -= unmap_vmas(&tlb, mm, mm->mmap, 0,
>  					~0UL, &nr_accounted, NULL);
>  	vm_unacct_memory(nr_accounted);
> +	/*
> +	 * Walk the list again, actually closing and freeing it.
> +	 */
> +	while (vma) {
> +		struct vm_area_struct *next = vma->vm_next;
> +		remove_vm_struct(vma);
> +		vma = next;
> +	}
>  	BUG_ON(mm->map_count);	/* This is just debugging */
>  	clear_page_tables(tlb, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD);
>  	tlb_finish_mmu(tlb, 0, MM_VM_SIZE(mm));
> @@ -1822,16 +1830,6 @@
>  	mm->locked_vm = 0;
>
>  	spin_unlock(&mm->page_table_lock);
> -
> -	/*
> -	 * Walk the list again, actually closing and freeing it
> -	 * without holding any MM locks.
> -	 */
> -	while (vma) {
> -		struct vm_area_struct *next = vma->vm_next;
> -		remove_vm_struct(vma);
> -		vma = next;
> -	}
>  }
>
>  /* Insert vm structure into process list sorted by address
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: page fault fastpath patch v2: fix race conditions, stats for
  2004-08-19  0:20         ` page fault fastpath patch v2: fix race conditions, stats for 8,32 and 512 cpu SMP William Lee Irwin III
  2004-08-19  3:19           ` page fault fastpath patch v2: fix race conditions, stats for Rajesh Venkatasubramanian
@ 2004-08-23 22:00           ` Christoph Lameter
  2004-08-23 23:25             ` Rajesh Venkatasubramanian
  1 sibling, 1 reply; 10+ messages in thread
From: Christoph Lameter @ 2004-08-23 22:00 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Rajesh Venkatasubramanian, Hugh Dickins, David S. Miller, raybry,
	ak, benh, manfred, linux-ia64, LKML

Would it not be better if exit_mmap would hold a writelock on
mm->mmap_sem and the page_table_lock? This would insure that page faults
would not change pte's. exit_mmap changes the memory map so it seems that
the wrong lock is used here.

On Wed, 18 Aug 2004, William Lee Irwin III wrote:

> On Wed, 18 Aug 2004, William Lee Irwin III wrote:
> >> exit_mmap() has removed the vma from ->i_mmap and ->mmap prior to
> >> unmapping the pages, so this should be safe unless that operation
> >> can be caught while it's in progress.
>
> On Wed, Aug 18, 2004 at 08:07:24PM -0400, Rajesh Venkatasubramanian wrote:
> > No. Unfortunately exit_mmap() removes vmas from ->i_mmap after removing
> > page table pages. Maybe we can reverse this, though.
>
> Something like this?
>
>
> Index: mm1-2.6.8.1/mm/mmap.c
> =================================> --- mm1-2.6.8.1.orig/mm/mmap.c	2004-08-16 23:47:16.000000000 -0700
> +++ mm1-2.6.8.1/mm/mmap.c	2004-08-18 17:18:26.513559632 -0700
> @@ -1810,6 +1810,14 @@
>  	mm->map_count -= unmap_vmas(&tlb, mm, mm->mmap, 0,
>  					~0UL, &nr_accounted, NULL);
>  	vm_unacct_memory(nr_accounted);
> +	/*
> +	 * Walk the list again, actually closing and freeing it.
> +	 */
> +	while (vma) {
> +		struct vm_area_struct *next = vma->vm_next;
> +		remove_vm_struct(vma);
> +		vma = next;
> +	}
>  	BUG_ON(mm->map_count);	/* This is just debugging */
>  	clear_page_tables(tlb, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD);
>  	tlb_finish_mmu(tlb, 0, MM_VM_SIZE(mm));
> @@ -1822,16 +1830,6 @@
>  	mm->locked_vm = 0;
>
>  	spin_unlock(&mm->page_table_lock);
> -
> -	/*
> -	 * Walk the list again, actually closing and freeing it
> -	 * without holding any MM locks.
> -	 */
> -	while (vma) {
> -		struct vm_area_struct *next = vma->vm_next;
> -		remove_vm_struct(vma);
> -		vma = next;
> -	}
>  }
>
>  /* Insert vm structure into process list sorted by address
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: page fault fastpath patch v2: fix race conditions, stats for
  2004-08-23 22:00           ` Christoph Lameter
@ 2004-08-23 23:25             ` Rajesh Venkatasubramanian
  2004-08-23 23:35               ` Christoph Lameter
  0 siblings, 1 reply; 10+ messages in thread
From: Rajesh Venkatasubramanian @ 2004-08-23 23:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: William Lee Irwin III, Hugh Dickins, David S. Miller, raybry, ak,
	benh, manfred, linux-ia64, LKML



On Mon, 23 Aug 2004, Christoph Lameter wrote:

> Would it not be better if exit_mmap would hold a writelock on
> mm->mmap_sem and the page_table_lock? This would insure that page faults
> would not change pte's. exit_mmap changes the memory map so it seems that
> the wrong lock is used here.

exit_mmap() is only called from kernel/fork.c/mmput(). We call exit_mmap()
only if (atomic_dec_and_lock(&mm->mm_users, &mmlist_lock)).

So there are no other active thread (mm_user) other than the current
exit_mmap() thread. This gives thread exclusion. So we don't need
mm->mmap_sem.

However, we have to lock out truncate()->zap_pmd_range(), rmap.c
functions, and other places from walking the page tables while we
are freeing the page tables in exit_mmap(). The page_table_lock in
exit_mmap() provides that exclusion.

That's my understanding. Correct me if I am wrong.

Thanks,
Rajesh

> On Wed, 18 Aug 2004, William Lee Irwin III wrote:
>
> > On Wed, 18 Aug 2004, William Lee Irwin III wrote:
> > >> exit_mmap() has removed the vma from ->i_mmap and ->mmap prior to
> > >> unmapping the pages, so this should be safe unless that operation
> > >> can be caught while it's in progress.
> >
> > On Wed, Aug 18, 2004 at 08:07:24PM -0400, Rajesh Venkatasubramanian wrote:
> > > No. Unfortunately exit_mmap() removes vmas from ->i_mmap after removing
> > > page table pages. Maybe we can reverse this, though.
> >
> > Something like this?
> >
> >
> > Index: mm1-2.6.8.1/mm/mmap.c
> > =================================> > --- mm1-2.6.8.1.orig/mm/mmap.c	2004-08-16 23:47:16.000000000 -0700
> > +++ mm1-2.6.8.1/mm/mmap.c	2004-08-18 17:18:26.513559632 -0700
> > @@ -1810,6 +1810,14 @@
> >  	mm->map_count -= unmap_vmas(&tlb, mm, mm->mmap, 0,
> >  					~0UL, &nr_accounted, NULL);
> >  	vm_unacct_memory(nr_accounted);
> > +	/*
> > +	 * Walk the list again, actually closing and freeing it.
> > +	 */
> > +	while (vma) {
> > +		struct vm_area_struct *next = vma->vm_next;
> > +		remove_vm_struct(vma);
> > +		vma = next;
> > +	}
> >  	BUG_ON(mm->map_count);	/* This is just debugging */
> >  	clear_page_tables(tlb, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD);
> >  	tlb_finish_mmu(tlb, 0, MM_VM_SIZE(mm));
> > @@ -1822,16 +1830,6 @@
> >  	mm->locked_vm = 0;
> >
> >  	spin_unlock(&mm->page_table_lock);
> > -
> > -	/*
> > -	 * Walk the list again, actually closing and freeing it
> > -	 * without holding any MM locks.
> > -	 */
> > -	while (vma) {
> > -		struct vm_area_struct *next = vma->vm_next;
> > -		remove_vm_struct(vma);
> > -		vma = next;
> > -	}
> >  }
> >
> >  /* Insert vm structure into process list sorted by address
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: page fault fastpath patch v2: fix race conditions, stats for
  2004-08-23 23:25             ` Rajesh Venkatasubramanian
@ 2004-08-23 23:35               ` Christoph Lameter
  0 siblings, 0 replies; 10+ messages in thread
From: Christoph Lameter @ 2004-08-23 23:35 UTC (permalink / raw)
  To: Rajesh Venkatasubramanian
  Cc: William Lee Irwin III, Hugh Dickins, David S. Miller, raybry, ak,
	benh, manfred, linux-ia64, LKML

On Mon, 23 Aug 2004, Rajesh Venkatasubramanian wrote:

> So there are no other active thread (mm_user) other than the current
> exit_mmap() thread. This gives thread exclusion. So we don't need
> mm->mmap_sem.
>
> However, we have to lock out truncate()->zap_pmd_range(), rmap.c
> functions, and other places from walking the page tables while we
> are freeing the page tables in exit_mmap(). The page_table_lock in
> exit_mmap() provides that exclusion.
>
> That's my understanding. Correct me if I am wrong.

Correct. I just looked through the function and it unlinks the pgd before
unlocking page_table_lock. It frees the pgd tree content later.

Since no mm_user is active anymore no pte ops occur and therefore also
atomic pte operations do not need protection.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: page fault fastpath patch v2: fix race conditions, stats for 8,32
@ 2004-08-18 17:55 Hugh Dickins
  2004-08-19  1:19 ` page fault fastpath patch v2: fix race conditions, stats for Christoph Lameter
  0 siblings, 1 reply; 10+ messages in thread
From: Hugh Dickins @ 2004-08-18 17:55 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: William Lee Irwin III, David S. Miller, raybry, ak, benh, manfred,
	linux-ia64, linux-kernel

On Tue, 17 Aug 2004, Christoph Lameter wrote:

> This is the second release of the page fault fastpath path. The fast path
> avoids locking during the creation of page table entries for anonymous
> memory in a threaded application running on a SMP system. The performance
> increases significantly for more than 4 threads running concurrently.

It is interesting.  I don't like it at all in its current state,
#ifdef'ed special casing for one particular path through the code,
but it does seem worth taking further.

Just handling that one anonymous case is not worth it, when we know
that the next day someone else from SGI will post a similar test
which shows the same on file pages ;)

Your ptep lock bit avoids collision with pte bits, but does it not
also need to avoid collision with pte swap entry bits?  And the
pte_file bit too, at least once it's extended to nopage areas.

I'm very suspicious of the way you just return VM_FAULT_MINOR when
you find the lock bit already set.  Yes, you can do that, but the
lock bit is held right across the alloc_page_vma, so other threads
trying to fault the same pte will be spinning back out to user and
refaulting back into kernel while they wait: we'd usually use a
waitqueue and wakeup with that kind of lock; or not hold it across,
and make it a bitspin lock.

It's a realistic case, which I guess your test program won't be trying.
Feels livelocky to me, but I may be overreacting against: it's not as
if you're changing the page_table_lock to be treated that way.

> Introducing the page_table_lock even for a short time makes performance
> drop to the level before the patch.

That's interesting, and disappointing.

The main lesson I took from your patch (I think wli was hinting at
the same) is that we ought now to question page_table_lock usage,
should be possible to cut it a lot.

I recall from exchanges with Dave McCracken 18 months ago that the
page_table_lock is _almost_ unnecessary in rmap.c, should be possible
to get avoid it there and in some other places.

We take page_table_lock when making absent present and when making
present absent: I like your observation that those are exclusive cases.

But you've found that narrowing the width of the page_table_lock
in a particular path does not help.  You sound surprised, me too.
Did you find out why that was?

> - One could avoid pte locking by introducing a pte_cmpxchg. cmpxchg
> seems to be supported by all ia64 and i386 cpus except the original 80386.

I do think this will be a more fruitful direction than pte locking:
just looking through the arches for spare bits puts me off pte locking.

Hugh

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: page fault fastpath patch v2: fix race conditions, stats for
  2004-08-18 17:55 page fault fastpath patch v2: fix race conditions, stats for 8,32 Hugh Dickins
@ 2004-08-19  1:19 ` Christoph Lameter
  0 siblings, 0 replies; 10+ messages in thread
From: Christoph Lameter @ 2004-08-19  1:19 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: William Lee Irwin III, David S. Miller, raybry, ak, benh, manfred,
	linux-ia64, linux-kernel

> > - One could avoid pte locking by introducing a pte_cmpxchg. cmpxchg
> > seems to be supported by all ia64 and i386 cpus except the original 80386.
>
> I do think this will be a more fruitful direction than pte locking:
> just looking through the arches for spare bits puts me off pte locking.

Thanks for the support. Got a V3 here (not ready to post yet) that throws
out the locks and uses cmpxchg instead. It also removes the use of
page_table_lock completely from handle_mm_fault.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2004-08-23 23:35 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <fa.ofiojek.hkeujs@ifi.uio.no>
     [not found] ` <fa.o1kt2ua.1bm6n0c@ifi.uio.no>
2004-08-18 20:13   ` page fault fastpath patch v2: fix race conditions, stats for Ray Bryant
2004-08-18 20:13     ` page fault fastpath patch v2: fix race conditions, stats for 8,32 and 512 cpu SMP Ray Bryant
2004-08-18 20:48     ` William Lee Irwin III
2004-08-18 20:48       ` William Lee Irwin III
     [not found] <2uexw-1Nn-1@gated-at.bofh.it>
     [not found] ` <2uCTq-2wa-55@gated-at.bofh.it>
2004-08-18 23:50   ` Rajesh Venkatasubramanian
2004-08-19  0:01     ` William Lee Irwin III
2004-08-19  0:07       ` page fault fastpath patch v2: fix race conditions, stats for Rajesh Venkatasubramanian
2004-08-19  0:20         ` page fault fastpath patch v2: fix race conditions, stats for 8,32 and 512 cpu SMP William Lee Irwin III
2004-08-19  3:19           ` page fault fastpath patch v2: fix race conditions, stats for Rajesh Venkatasubramanian
2004-08-23 22:00           ` Christoph Lameter
2004-08-23 23:25             ` Rajesh Venkatasubramanian
2004-08-23 23:35               ` Christoph Lameter
2004-08-18 17:55 page fault fastpath patch v2: fix race conditions, stats for 8,32 Hugh Dickins
2004-08-19  1:19 ` page fault fastpath patch v2: fix race conditions, stats for Christoph Lameter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.