changing mm->mmap_sem (was: Re: system call for process information?)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* changing mm->mmap_sem  (was: Re: system call for process information?)
  2001-03-14 19:19 system call for process information? Rik van Riel
@ 2001-03-15 12:24 ` Rik van Riel
  2001-03-16  9:49   ` Stephen C. Tweedie
  0 siblings, 1 reply; 16+ messages in thread
From: Rik van Riel @ 2001-03-15 12:24 UTC (permalink / raw)
  To: george anzinger; +Cc: Alexander Viro, linux-mm, bcrl, linux-kernel

On Wed, 14 Mar 2001, Rik van Riel wrote:
> On Wed, 14 Mar 2001, george anzinger wrote:
> 
> > Is it REALLY necessary to prevent them from seeing an
> > inconsistent state?  Seems to me that in the total picture (i.e.
> > system wide) they will never see a consistent state, so why be
> > concerned with a small corner of the system.
> 
> You're right.

Mmmm, I've looked at the code today and it turned out that
we're NOT right ;)

The mmap_sem is used in procfs to prevent the list of VMAs
from changing. In the page fault code it seems to be used
to prevent other page faults to happen at the same time with
the current page fault (and to prevent VMAs from changing
while a page fault is underway).

Maybe we should change the mmap_sem into a R/W semaphore ?

Since page faults seem to be the "common cause" of blocking
procfs access *and* since both page faults and procfs only
need to prevent the VMA list from changing, a read lock would
help here.

Write locks would be used in the code where we actually want
to change the VMA list and page faults would use an extra lock
to protect against each other (possibly a per-pagetable lock so
multithreaded apps can pagefault in different memory regions at
the same time ???).

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: changing mm->mmap_sem  (was: Re: system call for process information?)
  2001-03-15 12:24 ` changing mm->mmap_sem (was: Re: system call for process information?) Rik van Riel
@ 2001-03-16  9:49   ` Stephen C. Tweedie
  2001-03-16 11:50     ` Rik van Riel
  0 siblings, 1 reply; 16+ messages in thread
From: Stephen C. Tweedie @ 2001-03-16  9:49 UTC (permalink / raw)
  To: Rik van Riel
  Cc: george anzinger, Alexander Viro, linux-mm, bcrl, linux-kernel

Hi,

On Thu, Mar 15, 2001 at 09:24:59AM -0300, Rik van Riel wrote:
> On Wed, 14 Mar 2001, Rik van Riel wrote:

> The mmap_sem is used in procfs to prevent the list of VMAs
> from changing. In the page fault code it seems to be used
> to prevent other page faults to happen at the same time with
> the current page fault (and to prevent VMAs from changing
> while a page fault is underway).

The page table spinlock should be quite sufficient to let us avoid
races in the page fault code.  We've had to deal with this before
there was ever a mmap_sem anyway: in ancient times, every page fault
had to do things like check to see if the pte had changed after IO was
complete and once the BKL had been retaken.  We can do the same with
the page fault spinlock without much pain.

> Maybe we should change the mmap_sem into a R/W semaphore ?

Definitely.

> Write locks would be used in the code where we actually want
> to change the VMA list and page faults would use an extra lock
> to protect against each other (possibly a per-pagetable lock

Why do we need another lock?  The critical section where we do the
final update on the pte _already_ takes the page table spinlock to
avoid races against the swapper.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: changing mm->mmap_sem  (was: Re: system call for process information?)
  2001-03-16  9:49   ` Stephen C. Tweedie
@ 2001-03-16 11:50     ` Rik van Riel
  2001-03-16 12:53       ` Stephen C. Tweedie
  0 siblings, 1 reply; 16+ messages in thread
From: Rik van Riel @ 2001-03-16 11:50 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: george anzinger, Alexander Viro, linux-mm, bcrl, linux-kernel

On Fri, 16 Mar 2001, Stephen C. Tweedie wrote:
> On Thu, Mar 15, 2001 at 09:24:59AM -0300, Rik van Riel wrote:
> > On Wed, 14 Mar 2001, Rik van Riel wrote:
> 
> > The mmap_sem is used in procfs to prevent the list of VMAs
> > from changing. In the page fault code it seems to be used
> > to prevent other page faults to happen at the same time with
> > the current page fault (and to prevent VMAs from changing
> > while a page fault is underway).
> 
> The page table spinlock should be quite sufficient to let us avoid
> races in the page fault code.

> > Write locks would be used in the code where we actually want
> > to change the VMA list and page faults would use an extra lock
> > to protect against each other (possibly a per-pagetable lock
> 
> Why do we need another lock?  The critical section where we do the
> final update on the pte _already_ takes the page table spinlock to
> avoid races against the swapper.

The problem is that mmap_sem seems to be protecting the list
of VMAs, so taking _only_ the page_table_lock could let a VMA
change under us while a page fault is underway ...

Then again, I guess just making mmap_sem a R/W lock should fix
our problems ... and maybe even make it possible (in 2.5?) to
let multithreaded programs have pagefaults at the same time,
instead of having all threads queue up behind mmap_sem.

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: changing mm->mmap_sem  (was: Re: system call for process information?)
  2001-03-16 11:50     ` Rik van Riel
@ 2001-03-16 12:53       ` Stephen C. Tweedie
  2001-03-18  7:23         ` Rik van Riel
  0 siblings, 1 reply; 16+ messages in thread
From: Stephen C. Tweedie @ 2001-03-16 12:53 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Stephen C. Tweedie, george anzinger, Alexander Viro, linux-mm,
	bcrl, linux-kernel

Hi,

On Fri, Mar 16, 2001 at 08:50:25AM -0300, Rik van Riel wrote:
> On Fri, 16 Mar 2001, Stephen C. Tweedie wrote:
> 
> > > Write locks would be used in the code where we actually want
> > > to change the VMA list and page faults would use an extra lock
> > > to protect against each other (possibly a per-pagetable lock
> > 
> > Why do we need another lock?  The critical section where we do the
> > final update on the pte _already_ takes the page table spinlock to
> > avoid races against the swapper.
> 
> The problem is that mmap_sem seems to be protecting the list
> of VMAs, so taking _only_ the page_table_lock could let a VMA
> change under us while a page fault is underway ...

Right, I'm not suggesting removing that: making the mmap_sem
read/write is fine, but yes, we still need that semaphore.  But as for
the "page faults would use an extra lock to protect against each
other" bit --- we already have another lock, the page table lock,
which can be used in this way, so ANOTHER lock should be unnecessary.

--Stephen

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: changing mm->mmap_sem  (was: Re: system call for process information?)
  2001-03-16 12:53       ` Stephen C. Tweedie
@ 2001-03-18  7:23         ` Rik van Riel
  2001-03-18  9:56           ` Mike Galbraith
  0 siblings, 1 reply; 16+ messages in thread
From: Rik van Riel @ 2001-03-18  7:23 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: george anzinger, Alexander Viro, linux-mm, bcrl, linux-kernel

On Fri, 16 Mar 2001, Stephen C. Tweedie wrote:

> Right, I'm not suggesting removing that: making the mmap_sem
> read/write is fine, but yes, we still need that semaphore.

Initial patch (against 2.4.2-ac20) is available at
http://www.surriel.com/patches/

> But as for the "page faults would use an extra lock to protect against
> each other" bit --- we already have another lock, the page table lock,
> which can be used in this way, so ANOTHER lock should be unnecessary.

Tomorrow I'll take a look at the various ->nopage
functions and do_swap_page to see if these functions
would be able to take simultaneous faults at the same
address (from multiple threads).  If not, either we'll
need to modify these functions, or we could add a (few?)
extra lock to prevent these functions from faulting at
the same address at the same time in multiple threads.

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: changing mm->mmap_sem  (was: Re: system call for process information?)
@ 2001-03-18  9:34 Manfred Spraul
  2001-03-18 10:56 ` Rik van Riel
  2001-03-19 12:54 ` Stephen C. Tweedie
  0 siblings, 2 replies; 16+ messages in thread
From: Manfred Spraul @ 2001-03-18  9:34 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, Stephen C. Tweedie

>
> The problem is that mmap_sem seems to be protecting the list
> of VMAs, so taking _only_ the page_table_lock could let a VMA
> change under us while a page fault is underway ...

No, that can't happen.
VMA changes only happen if both the mmap_sem and the page table lock is
acquired. (check insert_vm() at the end of mm/mmap.c)
The page fault path uses the map_sem, kswaps uses page_table_lock.

<< from your patch:
--- linux-2.4.2-ac20-vm/mm/vmscan.c.orig	Sat Mar 17 11:30:49 2001
+++ linux-2.4.2-ac20-vm/mm/vmscan.c	Sat Mar 17 20:53:10 2001
@@ -231,6 +231,7 @@
 	 * Find the proper vm-area after freezing the vma chain
 	 * and ptes.
 	 */
+	down_read(&mm->mmap_sem);
                spin_lock(&mm->page_table_lock);
 >>>>

Why do you acquire the mmap semaphore in swapout_mm()? The old rule was
that kswapd should never sleep on the mmap semaphore. Isn't there a
deadlock if mmap sem is already acquired? I don't remember the details.

>
> The problem is that mmap_sem seems to be protecting the list
> of VMAs, so taking _only_ the page_table_lock could let a VMA
> change under us while a page fault is underway ...

I remember that the pmd_alloc() and pte_alloc() functions need
additional locking.

--
    Manfred

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: changing mm->mmap_sem  (was: Re: system call for process information?)
  2001-03-18  7:23         ` Rik van Riel
@ 2001-03-18  9:56           ` Mike Galbraith
  2001-03-18 10:46             ` Rik van Riel
  0 siblings, 1 reply; 16+ messages in thread
From: Mike Galbraith @ 2001-03-18  9:56 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, linux-kernel

On Sun, 18 Mar 2001, Rik van Riel wrote:

> On Fri, 16 Mar 2001, Stephen C. Tweedie wrote:
>
> > Right, I'm not suggesting removing that: making the mmap_sem
> > read/write is fine, but yes, we still need that semaphore.
>
> Initial patch (against 2.4.2-ac20) is available at
> http://www.surriel.com/patches/
>
> > But as for the "page faults would use an extra lock to protect against
> > each other" bit --- we already have another lock, the page table lock,
> > which can be used in this way, so ANOTHER lock should be unnecessary.
>
> Tomorrow I'll take a look at the various ->nopage
> functions and do_swap_page to see if these functions
> would be able to take simultaneous faults at the same
> address (from multiple threads).  If not, either we'll
> need to modify these functions, or we could add a (few?)
> extra lock to prevent these functions from faulting at
> the same address at the same time in multiple threads.

Hi Rik,

I gave this patch a try, and the initial results are extremely encouraging.
Not only do I have vmstat (SCHED_RR) info in realtime with zero delays :))
I also have a _nice_ throughput improvement.  There are some worrisome
warnings below along with the compile changes I made here, but for an
initial patch, things look pretty darn wonderful.

	Cheers,

	-Mike

--- ./include/linux/sched.h.org	Sun Mar 18 10:20:42 2001
+++ ./include/linux/sched.h	Sun Mar 18 10:27:48 2001
@@ -238,7 +238,7 @@
 	mm_users:	ATOMIC_INIT(2), 		\
 	mm_count:	ATOMIC_INIT(1), 		\
 	map_count:	1, 				\
-	mmap_sem:	__MUTEX_INITIALIZER(name.mmap_sem), \
+	mmap_sem:	__RWSEM_INITIALIZER(name.mmap_sem, RW_LOCK_BIAS), \
 	page_table_lock: SPIN_LOCK_UNLOCKED, 		\
 	mmlist:		LIST_HEAD_INIT(name.mmlist),	\
 }
--- ./include/linux/mm.h.org	Sun Mar 18 09:56:55 2001
+++ ./include/linux/mm.h	Sun Mar 18 10:27:59 2001
@@ -533,13 +533,13 @@
 	if (vma->vm_end - address > current->rlim[RLIMIT_STACK].rlim_cur ||
 	    ((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) > current->rlim[RLIMIT_AS].rlim_cur)
 		return -ENOMEM;
-	spin_lock(&mm->page_table_lock);
+	spin_lock(&vma->vm_mm->page_table_lock);
 	vma->vm_start = address;
 	vma->vm_pgoff -= grow;
 	vma->vm_mm->total_vm += grow;
 	if (vma->vm_flags & VM_LOCKED)
 		vma->vm_mm->locked_vm += grow;
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock(&vma->vm_mm->page_table_lock);
 	return 0;
 }

...
VFS: Mounted root (ext2 filesystem) readonly.
Freeing unused kernel memory: 196k freed
Adding Swap: 265064k swap-space (priority 2)
VM: Bad swap entry 00011e00
VM: Bad swap entry 00058d00
Unused swap offset entry in swap_dup 00058d00
Unused swap offset entry in swap_dup 00011e00
VM: Bad swap entry 00011e00
VM: Bad swap entry 00058d00
Unused swap offset entry in swap_dup 00058d00
VM: Bad swap entry 00058d00
Unused swap offset entry in swap_dup 00011e00
Unused swap offset entry in swap_dup 00058d00
VM: Bad swap entry 00011e00
VM: Bad swap entry 00058d00
Unused swap offset entry in swap_dup 00011e00
Unused swap offset entry in swap_dup 00058d00
VM: Bad swap entry 00011e00
VM: Bad swap entry 00058d00
Unused swap offset entry in swap_dup 008f4e00
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_dup 006ef700
Unused swap offset entry in swap_dup 008f4e00
VM: Bad swap entry 006ef700
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_dup 006ef700
Unused swap offset entry in swap_dup 008f4e00
VM: Bad swap entry 006ef700
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_dup 006ef700
Unused swap offset entry in swap_dup 008f4e00
VM: Bad swap entry 006ef700
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_dup 006ef700
Unused swap offset entry in swap_dup 008f4e00
VM: Bad swap entry 006ef700
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_dup 006ef700
Unused swap offset entry in swap_dup 008f4e00
VM: Bad swap entry 006ef700
Unused swap offset entry in swap_dup 006ef700
VM: Bad swap entry 006ef700
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_dup 006ef700
Unused swap offset entry in swap_dup 008f4e00
VM: Bad swap entry 006ef700
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_count 00011e00
Unused swap offset entry in swap_dup 006ef700
Unused swap offset entry in swap_dup 008f4e00
VM: Bad swap entry 006ef700
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_dup 008f4e00
Unused swap offset entry in swap_dup 006ef700
VM: Bad swap entry 006ef700
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_dup 006ef700
Unused swap offset entry in swap_dup 008f4e00
VM: Bad swap entry 006ef700
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_dup 006ef700
Unused swap offset entry in swap_dup 008f4e00
VM: Bad swap entry 006ef700
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_dup 006ef700
Unused swap offset entry in swap_dup 008f4e00
VM: Bad swap entry 006ef700
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_dup 006ef700
Unused swap offset entry in swap_dup 008f4e00
VM: Bad swap entry 006ef700
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_dup 008f4e00
Unused swap offset entry in swap_dup 006ef700
VM: Bad swap entry 006ef700
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_dup 006ef700
Unused swap offset entry in swap_dup 008f4e00
Unused swap offset entry in swap_dup 00011e00
VM: Bad swap entry 006ef700
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_dup 006ef700
Unused swap offset entry in swap_dup 008f4e00
VM: Bad swap entry 00011e00
Unused swap offset entry in swap_count 00011e00
VM: Bad swap entry 006ef700
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_dup 006ef700
Unused swap offset entry in swap_dup 008f4e00
VM: Bad swap entry 006ef700
VM: Bad swap entry 008f4e00
Unused swap offset entry in swap_dup 008f4e00
Unused swap offset entry in swap_dup 006ef700

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: changing mm->mmap_sem  (was: Re: system call for process information?)
  2001-03-18  9:56           ` Mike Galbraith
@ 2001-03-18 10:46             ` Rik van Riel
  2001-03-18 12:33               ` Mike Galbraith
  0 siblings, 1 reply; 16+ messages in thread
From: Rik van Riel @ 2001-03-18 10:46 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-mm, linux-kernel

On Sun, 18 Mar 2001, Mike Galbraith wrote:

> I gave this patch a try, and the initial results are extremely encouraging.
> Not only do I have vmstat (SCHED_RR) info in realtime with zero delays :))
> I also have a _nice_ throughput improvement.  There are some worrisome
> warnings below along with the compile changes I made here, but for an
> initial patch, things look pretty darn wonderful.

	[snip compile fixes .. integrated]

> VFS: Mounted root (ext2 filesystem) readonly.
> Freeing unused kernel memory: 196k freed
> Adding Swap: 265064k swap-space (priority 2)
> VM: Bad swap entry 00011e00
> VM: Bad swap entry 00058d00
> Unused swap offset entry in swap_dup 00058d00
> Unused swap offset entry in swap_dup 00011e00
> VM: Bad swap entry 00011e00
> VM: Bad swap entry 00058d00

Heh, I guess do_swap_page isn't too happy when multiple threads
of the same program take a page fault at the same address at the
same time.

I take it you were testing something like mysql, jvm or apache2 ?

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: changing mm->mmap_sem  (was: Re: system call for process information?)
  2001-03-18  9:34 changing mm->mmap_sem (was: Re: system call for process information?) Manfred Spraul
@ 2001-03-18 10:56 ` Rik van Riel
  2001-03-19 12:54 ` Stephen C. Tweedie
  1 sibling, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2001-03-18 10:56 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: linux-kernel, Stephen C. Tweedie

On Sun, 18 Mar 2001, Manfred Spraul wrote:

> > The problem is that mmap_sem seems to be protecting the list
> > of VMAs, so taking _only_ the page_table_lock could let a VMA
> > change under us while a page fault is underway ...
> 
> No, that can't happen.
> VMA changes only happen if both the mmap_sem and the page table lock is
> acquired. (check insert_vm() at the end of mm/mmap.c)
> The page fault path uses the map_sem, kswaps uses page_table_lock.

You're right here, I missed this "little detail"...

> << from your patch:
> --- linux-2.4.2-ac20-vm/mm/vmscan.c.orig	Sat Mar 17 11:30:49 2001
> +++ linux-2.4.2-ac20-vm/mm/vmscan.c	Sat Mar 17 20:53:10 2001
> @@ -231,6 +231,7 @@
>  	 * Find the proper vm-area after freezing the vma chain
>  	 * and ptes.
>  	 */
> +	down_read(&mm->mmap_sem);
>                 spin_lock(&mm->page_table_lock);
>  >>>>
> 
> Why do you acquire the mmap semaphore in swapout_mm()? The old rule was
> that kswapd should never sleep on the mmap semaphore. Isn't there a
> deadlock if mmap sem is already acquired? I don't remember the details.

You're right, kswapd shouldn't do this.  I have this removed from
my code right now...

> > The problem is that mmap_sem seems to be protecting the list
> > of VMAs, so taking _only_ the page_table_lock could let a VMA
> > change under us while a page fault is underway ...
> 
> I remember that the pmd_alloc() and pte_alloc() functions need
> additional locking.

Isn't this what the page_table_lock is for ?
(too bad they're not using it...)

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: changing mm->mmap_sem  (was: Re: system call for process information?)
  2001-03-18 10:46             ` Rik van Riel
@ 2001-03-18 12:33               ` Mike Galbraith
  0 siblings, 0 replies; 16+ messages in thread
From: Mike Galbraith @ 2001-03-18 12:33 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, linux-kernel

On Sun, 18 Mar 2001, Rik van Riel wrote:

> > VFS: Mounted root (ext2 filesystem) readonly.
> > Freeing unused kernel memory: 196k freed
> > Adding Swap: 265064k swap-space (priority 2)
> > VM: Bad swap entry 00011e00
> > VM: Bad swap entry 00058d00
> > Unused swap offset entry in swap_dup 00058d00
> > Unused swap offset entry in swap_dup 00011e00
> > VM: Bad swap entry 00011e00
> > VM: Bad swap entry 00058d00
>
> Heh, I guess do_swap_page isn't too happy when multiple threads
> of the same program take a page fault at the same address at the
> same time.
>
> I take it you were testing something like mysql, jvm or apache2 ?

No, this was make -j30 bzImage.  (nscd was running though...)

	-Mike


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: changing mm->mmap_sem  (was: Re: system call for process information?)
       [not found] <Pine.LNX.4.33.0103181407520.1426-100000@mikeg.weiden.de>
@ 2001-03-18 14:43 ` Rik van Riel
  2001-03-18 18:13   ` Linus Torvalds
  0 siblings, 1 reply; 16+ messages in thread
From: Rik van Riel @ 2001-03-18 14:43 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: sct, linux-kernel

On Sun, 18 Mar 2001, Mike Galbraith wrote:

> > No, this was make -j30 bzImage.  (nscd was running though...)
> 
> I rebooted, shut down nscd prior to testing and did 5 builds in a row
> without a single gripe.  Started nscd for sixth run and instantly the
> kernel griped.  Yup.. threaded apps pushing swap.

OK, I'll write some code to prevent multiple threads from
stepping all over each other when they pagefault at the
same address.

What would be the preferred method of fixing this ?

- fixing do_swap_page and all ->nopage functions
- hacking handle_mm_fault to make sure no overlapping
  pagefaults will be served at the same time

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: changing mm->mmap_sem  (was: Re: system call for process information?)
  2001-03-18 14:43 ` Rik van Riel
@ 2001-03-18 18:13   ` Linus Torvalds
  0 siblings, 0 replies; 16+ messages in thread
From: Linus Torvalds @ 2001-03-18 18:13 UTC (permalink / raw)
  To: linux-kernel

In article <Pine.LNX.4.21.0103181122480.13050-100000@imladris.rielhome.conectiva>,
Rik van Riel  <riel@conectiva.com.br> wrote:
>
>OK, I'll write some code to prevent multiple threads from
>stepping all over each other when they pagefault at the
>same address.
>
>What would be the preferred method of fixing this ?
>
>- fixing do_swap_page and all ->nopage functions

There is no need to fix gthe "nopage" functions. They never see the page
table directly anyway. 

So the only thing that _should_ be needed is to make sure that
do_no_page(), do_swap_page() and do_anonymous_page() will re-aquire the
mm->page_table_lock and undo their work if it turns out that the page
table entry is no longer empty.. 

(do_wp_page() should already be ok in this regard - it already does this
exactly because present pagetable entries can already race with kswapd. 
What we're adding is that _nonpresent_ page table entries can race with
multiple invocations of concurrent page faults)

>- hacking handle_mm_fault to make sure no overlapping
>  pagefaults will be served at the same time

No. The whole reason the rw_semaphores were done in the first place was
to allow page faults to happen concurrently to allow threaded
applictions to scale up even when faulting.

		Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: changing mm->mmap_sem  (was: Re: system call for process information?)
       [not found] <200103181813.KAA22153@penguin.transmeta.com>
@ 2001-03-18 20:59 ` Rik van Riel
  2001-03-19  1:21   ` Linus Torvalds
  0 siblings, 1 reply; 16+ messages in thread
From: Rik van Riel @ 2001-03-18 20:59 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Sun, 18 Mar 2001, Linus Torvalds wrote:
> In article <Pine.LNX.4.21.0103181122480.13050-100000@imladris.rielhome.conectiva>,
> Rik van Riel  <riel@conectiva.com.br> wrote:
> >
> >OK, I'll write some code to prevent multiple threads from
> >stepping all over each other when they pagefault at the
> >same address.
> >
> >What would be the preferred method of fixing this ?
> >
> >- fixing do_swap_page and all ->nopage functions
> 
> There is no need to fix gthe "nopage" functions. They never see the
> page table directly anyway.
> 
> So the only thing that _should_ be needed is to make sure that
> do_no_page(), do_swap_page() and do_anonymous_page() will re-aquire
> the mm->page_table_lock and undo their work if it turns out that the
> page table entry is no longer empty..

... in which case concurrency is maximised, but there is a
possibility of doing double work...

> >- hacking handle_mm_fault to make sure no overlapping
> >  pagefaults will be served at the same time
> 
> No. The whole reason the rw_semaphores were done in the first place
> was to allow page faults to happen concurrently to allow threaded
> applictions to scale up even when faulting.

Indeed, having threaded apps do multiple page faults at the
same time is the main goal of this patch. However, I don't
see how it would be good for scalability to have multiple
threads fault in the same page at the same time, when they
could just wait for one of them to do the work.

Only faults for different addresses would proceed, not faults
for the same address...

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: changing mm->mmap_sem  (was: Re: system call for process information?)
  2001-03-18 20:59 ` Rik van Riel
@ 2001-03-19  1:21   ` Linus Torvalds
  2001-03-19  2:59     ` Rik van Riel
  0 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2001-03-19  1:21 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel



On Sun, 18 Mar 2001, Rik van Riel wrote:
>
> Indeed, having threaded apps do multiple page faults at the
> same time is the main goal of this patch. However, I don't
> see how it would be good for scalability to have multiple
> threads fault in the same page at the same time, when they
> could just wait for one of them to do the work.

But they will.

That's what lock_page() etc are there for - there's no need for the VM to
synchronize because we already have the synchronization primitives at a
lower level.

And there isn't any other lock that could work anyway. It's either the
whole MM or a page. There's nothing in between.

		Linus


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: changing mm->mmap_sem  (was: Re: system call for process information?)
  2001-03-19  1:21   ` Linus Torvalds
@ 2001-03-19  2:59     ` Rik van Riel
  0 siblings, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2001-03-19  2:59 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Sun, 18 Mar 2001, Linus Torvalds wrote:
> On Sun, 18 Mar 2001, Rik van Riel wrote:
> >
> > Indeed, having threaded apps do multiple page faults at the
> > same time is the main goal of this patch. However, I don't
> > see how it would be good for scalability to have multiple
> > threads fault in the same page at the same time, when they
> > could just wait for one of them to do the work.
> 
> But they will.
> 
> That's what lock_page() etc are there for - there's no need for the VM
> to synchronize because we already have the synchronization primitives
> at a lower level.

Indeed. I'll go multithread the do_no_page and do_swap_page
functions tomorrow (maybe even tonight ;)).

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: changing mm->mmap_sem  (was: Re: system call for process information?)
  2001-03-18  9:34 changing mm->mmap_sem (was: Re: system call for process information?) Manfred Spraul
  2001-03-18 10:56 ` Rik van Riel
@ 2001-03-19 12:54 ` Stephen C. Tweedie
  1 sibling, 0 replies; 16+ messages in thread
From: Stephen C. Tweedie @ 2001-03-19 12:54 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: Rik van Riel, linux-kernel, Stephen C. Tweedie

Hi,

On Sun, Mar 18, 2001 at 10:34:38AM +0100, Manfred Spraul wrote:

> > The problem is that mmap_sem seems to be protecting the list
> > of VMAs, so taking _only_ the page_table_lock could let a VMA
> > change under us while a page fault is underway ...
> 
> No, that can't happen.

It can.  Page faults often need to block, so they have to be able to
drop the page_table_lock.  Holding the mmap_sem is all that keeps the
vma intact until the IO is complete.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2001-03-19 12:58 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-03-18  9:34 changing mm->mmap_sem (was: Re: system call for process information?) Manfred Spraul
2001-03-18 10:56 ` Rik van Riel
2001-03-19 12:54 ` Stephen C. Tweedie
     [not found] <200103181813.KAA22153@penguin.transmeta.com>
2001-03-18 20:59 ` Rik van Riel
2001-03-19  1:21   ` Linus Torvalds
2001-03-19  2:59     ` Rik van Riel
     [not found] <Pine.LNX.4.33.0103181407520.1426-100000@mikeg.weiden.de>
2001-03-18 14:43 ` Rik van Riel
2001-03-18 18:13   ` Linus Torvalds
  -- strict thread matches above, loose matches on Subject: below --
2001-03-14 19:19 system call for process information? Rik van Riel
2001-03-15 12:24 ` changing mm->mmap_sem (was: Re: system call for process information?) Rik van Riel
2001-03-16  9:49   ` Stephen C. Tweedie
2001-03-16 11:50     ` Rik van Riel
2001-03-16 12:53       ` Stephen C. Tweedie
2001-03-18  7:23         ` Rik van Riel
2001-03-18  9:56           ` Mike Galbraith
2001-03-18 10:46             ` Rik van Riel
2001-03-18 12:33               ` Mike Galbraith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox