TASK_SIZE is variable.

public inbox for linux-arch@vger.kernel.org
 help / color / mirror / Atom feed

* TASK_SIZE is variable.
@ 2005-01-25 22:26 David Woodhouse
  2005-01-25 23:52 ` David S. Miller
  2005-01-29 20:23 ` Andrew Morton
  0 siblings, 2 replies; 32+ messages in thread
From: David Woodhouse @ 2005-01-25 22:26 UTC (permalink / raw)
  To: linux-arch

Bad things can happen if a 32-bit process is the last user of a 64-bit
mm. TASK_SIZE isn't a constant, and we can end up clearing page tables
only up to the 32-bit TASK_SIZE instead of all the way. We should
probably double-check every instance of TASK_SIZE or USER_PTRS_PER_PGD
for this kind of problem.

We should also double-check that MM_VM_SIZE() and other such things are
correctly defined on all architectures. I already fixed ppc64 which let
it stay as TASK_SIZE, and hence dependent on the _current_ context
instead of the mm in the argument.

--- mm/mmap.c.orig	2005-01-25 22:23:02.030427272 +0000
+++ mm/mmap.c	2005-01-25 22:23:55.627279312 +0000
@@ -1612,8 +1612,8 @@ static void free_pgtables(struct mmu_gat
 	unsigned long last = end + PGDIR_SIZE - 1;
 	struct mm_struct *mm = tlb->mm;

-	if (last > TASK_SIZE || last < end)
-		last = TASK_SIZE;
+	if (last > MM_VM_SIZE(mm) || last < end)
+		last = MM_VM_SIZE(mm);

 	if (!prev) {
 		prev = mm->mmap;
@@ -1996,7 +1996,7 @@ void exit_mmap(struct mm_struct *mm)
 	vm_unacct_memory(nr_accounted);
 	BUG_ON(mm->map_count);	/* This is just debugging */
 	clear_page_range(tlb, FIRST_USER_PGD_NR * PGDIR_SIZE,
-			(TASK_SIZE + PGDIR_SIZE - 1) & PGDIR_MASK);
+			(MM_VM_SIZE(mm) + PGDIR_SIZE - 1) & PGDIR_MASK);

 	tlb_finish_mmu(tlb, 0, MM_VM_SIZE(mm));

-- 
dwmw2

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TASK_SIZE is variable.
  2005-01-25 22:26 TASK_SIZE is variable David Woodhouse
@ 2005-01-25 23:52 ` David S. Miller
  2005-01-26  6:36   ` Andi Kleen
  2005-01-26  7:54   ` David Woodhouse
  2005-01-29 20:23 ` Andrew Morton
  1 sibling, 2 replies; 32+ messages in thread
From: David S. Miller @ 2005-01-25 23:52 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-arch

On Tue, 25 Jan 2005 22:26:52 +0000
David Woodhouse <dwmw2@infradead.org> wrote:

> Bad things can happen if a 32-bit process is the last user of a 64-bit
> mm.

I guess this is OK.

I think the easiest way to fix this for good is to simply
kill off TASK_SIZE and that way each and every use will be
audited.  We can replace TASK_SIZE with something the describes
what is really wanted:

1) Max address potentially mapped in "mm".  Although I heavily
   dislike the MM_VM_SIZE(mm) scheme because address space size
   is determined by the process emul type, and thus is a thread
   not a mm property

   In fact, look at how nobody seems to even use the "mm" argument
   to this macro.  Further, look at the comment above the ppc64
   definition:

/* We can't actually tell the TASK_SIZE given just the mm, but default
 * to the 64-bit case to make sure that enough gets cleaned up. */

   Yeah, we can't tell the TASK_SIZE from the "mm", no shit.

2) Limits imposed at mmap()/munmap()/mremap() time for virtual
   address arguments.

Thinking about this some more, it may in fact be better to make
these platforms go to using a constant TASK_SIZE, kill this
bogus'ly argumented MM_VM_SIZE(mm) thing, and use something new
for mmap()/mremap()/munmap() et al. argument checking which
is based upon some thread property.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TASK_SIZE is variable.
  2005-01-25 23:52 ` David S. Miller
@ 2005-01-26  6:36   ` Andi Kleen
  2005-01-26  6:41     ` David S. Miller
  2005-01-26  7:54   ` David Woodhouse
  1 sibling, 1 reply; 32+ messages in thread
From: Andi Kleen @ 2005-01-26  6:36 UTC (permalink / raw)
  To: David S. Miller; +Cc: David Woodhouse, linux-arch

On Tue, Jan 25, 2005 at 03:52:39PM -0800, David S. Miller wrote:
> On Tue, 25 Jan 2005 22:26:52 +0000
> David Woodhouse <dwmw2@infradead.org> wrote:
> 
> > Bad things can happen if a 32-bit process is the last user of a 64-bit
> > mm.
> 
> I guess this is OK.

I still don't get it. When exactly can a process have memory > 32bit 
and not have the 32bit flag set that is checked by TASK_SIZE. IMHO that's 
the bug that needs addressing, because it will likely break more code.

-Andi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TASK_SIZE is variable.
  2005-01-26  6:36   ` Andi Kleen
@ 2005-01-26  6:41     ` David S. Miller
  2005-01-26  7:13       ` Andi Kleen
  0 siblings, 1 reply; 32+ messages in thread
From: David S. Miller @ 2005-01-26  6:41 UTC (permalink / raw)
  To: Andi Kleen; +Cc: dwmw2, linux-arch

On Wed, 26 Jan 2005 07:36:27 +0100
Andi Kleen <ak@suse.de> wrote:

> I still don't get it. When exactly can a process have memory > 32bit 
> and not have the 32bit flag set that is checked by TASK_SIZE. IMHO that's 
> the bug that needs addressing, because it will likely break more code.

I think on some platforms they move over to the 32-bit setting
in the thread struct before the address space is cleared out
for exec.

And if that is the case, I agree with Andi, that is the real bug
that needs to be fixed.

I recall that at one point Al Viro put some fix into binfmt_elf.c
that caused the thread compat type to change too early like that
and it made sparc64 explode so I knew to correct it immediately
and this happened years ago.

I just checked and ia64 sets thread.task_size in SET_PERSONALITY()
which is just fine, similarly for ppc64.  So I really wonder how
this problem can arise.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TASK_SIZE is variable.
  2005-01-26  6:41     ` David S. Miller
@ 2005-01-26  7:13       ` Andi Kleen
  2005-01-26  7:24         ` Andrew Morton
  0 siblings, 1 reply; 32+ messages in thread
From: Andi Kleen @ 2005-01-26  7:13 UTC (permalink / raw)
  To: David S. Miller; +Cc: Andi Kleen, dwmw2, linux-arch

> I just checked and ia64 sets thread.task_size in SET_PERSONALITY()
> which is just fine, similarly for ppc64.  So I really wonder how
> this problem can arise.

After rereading Anton's post on l-k I think the problem happens
when a proc accesses (like read on /proc/*/cmdline) increases the 
reference count of a mm, then the mm exits, and then the other
process reading /proc does the final mmput. Then the exit_mmap
executes in the context of the other process. 

Maybe it would be best to just use a semaphore to synchronize
this? 

[BTW there seem to be some other issues in this code; I'm currently
together with someone else trying to track down a exit mm race that causes 
machine checks on K8 because an lazy mm task has page tables that are already
freed and get overwritten by random data. Still haven't root caused it yet]

-Andi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TASK_SIZE is variable.
  2005-01-26  7:13       ` Andi Kleen
@ 2005-01-26  7:24         ` Andrew Morton
  2005-01-26  7:43           ` Andi Kleen
  0 siblings, 1 reply; 32+ messages in thread
From: Andrew Morton @ 2005-01-26  7:24 UTC (permalink / raw)
  To: Andi Kleen; +Cc: davem, dwmw2, linux-arch

Andi Kleen <ak@suse.de> wrote:
>
> After rereading Anton's post on l-k I think the problem happens
>  when a proc accesses (like read on /proc/*/cmdline) increases the 
>  reference count of a mm, then the mm exits, and then the other
>  process reading /proc does the final mmput. Then the exit_mmap
>  executes in the context of the other process. 

yup.  This happens in quite a few places.  Everything under mmput() needs
to understand that the mm isn't necessarily current's mm.  I'm not sure
that introcuction of additional locking to prevent that would be very nice.

(Could we null out current->mm during mmput() to catch buggy code, or would
that break the lazy-tlb code?)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TASK_SIZE is variable.
  2005-01-26  7:24         ` Andrew Morton
@ 2005-01-26  7:43           ` Andi Kleen
  2005-01-26  8:01             ` Andrew Morton
  0 siblings, 1 reply; 32+ messages in thread
From: Andi Kleen @ 2005-01-26  7:43 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, davem, dwmw2, linux-arch

On Tue, Jan 25, 2005 at 11:24:41PM -0800, Andrew Morton wrote:
> Andi Kleen <ak@suse.de> wrote:
> >
> > After rereading Anton's post on l-k I think the problem happens
> >  when a proc accesses (like read on /proc/*/cmdline) increases the 
> >  reference count of a mm, then the mm exits, and then the other
> >  process reading /proc does the final mmput. Then the exit_mmap
> >  executes in the context of the other process. 
> 
> yup.  This happens in quite a few places.  Everything under mmput() needs
> to understand that the mm isn't necessarily current's mm.  I'm not sure
> that introcuction of additional locking to prevent that would be very nice.

After thinking about it more I agree. Just replacing TASK_SIZE with
something that depends on the mm is the best solution here.

> (Could we null out current->mm during mmput() to catch buggy code, or would
> that break the lazy-tlb code?)

It wouldn't have caught TASK_SIZE anyways, I'm not sure how useful this is.

-Andi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TASK_SIZE is variable.
  2005-01-25 23:52 ` David S. Miller
  2005-01-26  6:36   ` Andi Kleen
@ 2005-01-26  7:54   ` David Woodhouse
  1 sibling, 0 replies; 32+ messages in thread
From: David Woodhouse @ 2005-01-26  7:54 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-arch

On Tue, 2005-01-25 at 15:52 -0800, David S. Miller wrote:
> I think the easiest way to fix this for good is to simply
> kill off TASK_SIZE and that way each and every use will be
> audited.  We can replace TASK_SIZE with something the describes
> what is really wanted:

You're right. And we should replace MM_VM_SIZE() too -- as you say,
nobody uses the (mm) argument because it's useless.

TASK_SIZE should be renamed to current_task_size(), because that's what
it actually does. MM_VM_SIZE(mm) should become MAX_TASK_SIZE, for much
the same reason. 

Then perhaps we can also add thread_task_size(tsk) in case any of the
current users of MM_VM_SIZE() or the currently-broken users of TASK_SIZE
actually have access to the thread_info in question and really want the
real answer instead of MAX_TASK_SIZE.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TASK_SIZE is variable.
  2005-01-26  7:43           ` Andi Kleen
@ 2005-01-26  8:01             ` Andrew Morton
  2005-01-26  8:04               ` Andi Kleen
  2005-01-28  2:58               ` Paul Mackerras
  0 siblings, 2 replies; 32+ messages in thread
From: Andrew Morton @ 2005-01-26  8:01 UTC (permalink / raw)
  To: Andi Kleen; +Cc: davem, dwmw2, linux-arch

Andi Kleen <ak@suse.de> wrote:
>
>  After thinking about it more I agree. Just replacing TASK_SIZE with
>  something that depends on the mm is the best solution here.

OK.

Here's what I currently have:

From: Anton Blanchard <anton@samba.org>

The 4 level pagetable code changed the exit_mmap code to rely on TASK_SIZE.
 On some architectures (eg ppc64 and ia64), this is a per task property and
bad things can happen in certain circumstances when using it.

It is possible for one task to end up "owning" an mm from another - we have
seen this with the procfs code when process 1 accesses /proc/pid/cmdline of
process 2 while it is exiting.  Process 2 exits but does not tear its mm
down.  Later on process 1 finishes with the proc file and the mm gets torn
down at this point.

Now if process 1 was 32bit and process 2 was 64bit then we end up using a
bad value for TASK_SIZE in exit_mmap.  We only tear down part of the
address space and leave half initialised pagetables and entries in the MMU
etc.

MM_VM_SIZE() was created for this purpose (and is used in the next line for
tlb_finish_mmu), so use it.  I moved the PGD round up of TASK_SIZE into the
default MM_VM_SIZE.

As an aside, all architectures except one define FIRST_USER_PGD_NR as 0:

include/asm-arm26/pgtable.h:#define FIRST_USER_PGD_NR       1

It would be nice to get rid of one more magic constant and just clear from
0 ...  MM_VM_SIZE().  That would make it consistent with the tlb_flush_mmu
call below it too.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
---

 25-akpm/include/linux/mm.h |    2 +-
 25-akpm/mm/mmap.c          |    3 +--
 2 files changed, 2 insertions(+), 3 deletions(-)

diff -puN include/linux/mm.h~use-mm_vm_size-in-exit_mmap include/linux/mm.h
--- 25/include/linux/mm.h~use-mm_vm_size-in-exit_mmap	2005-01-25 21:41:44.365536624 -0800
+++ 25-akpm/include/linux/mm.h	2005-01-25 21:41:44.371535712 -0800
@@ -38,7 +38,7 @@ extern int sysctl_legacy_va_layout;
 #include <asm/atomic.h>

 #ifndef MM_VM_SIZE
-#define MM_VM_SIZE(mm)	TASK_SIZE
+#define MM_VM_SIZE(mm)	((TASK_SIZE + PGDIR_SIZE - 1) & PGDIR_MASK)
 #endif

 #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
diff -puN mm/mmap.c~use-mm_vm_size-in-exit_mmap mm/mmap.c
--- 25/mm/mmap.c~use-mm_vm_size-in-exit_mmap	2005-01-25 21:41:44.367536320 -0800
+++ 25-akpm/mm/mmap.c	2005-01-25 21:41:44.373535408 -0800
@@ -1980,8 +1980,7 @@ void exit_mmap(struct mm_struct *mm)
 					~0UL, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
 	BUG_ON(mm->map_count);	/* This is just debugging */
-	clear_page_range(tlb, FIRST_USER_PGD_NR * PGDIR_SIZE,
-			(TASK_SIZE + PGDIR_SIZE - 1) & PGDIR_MASK);
+	clear_page_range(tlb, FIRST_USER_PGD_NR * PGDIR_SIZE, MM_VM_SIZE(mm));

 	tlb_finish_mmu(tlb, 0, MM_VM_SIZE(mm));

_

and:

From: David Woodhouse <dwmw2@infradead.org>

Bad things can happen if a 32-bit process is the last user of a 64-bit mm. 
TASK_SIZE isn't a constant, and we can end up clearing page tables only up
to the 32-bit TASK_SIZE instead of all the way.  We should probably
double-check every instance of TASK_SIZE or USER_PTRS_PER_PGD for this kind
of problem.

We should also double-check that MM_VM_SIZE() and other such things are
correctly defined on all architectures.  I already fixed ppc64 which let it
stay as TASK_SIZE, and hence dependent on the _current_ context instead of
the mm in the argument.

Signed-off-by: Andrew Morton <akpm@osdl.org>
---

 25-akpm/mm/mmap.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff -puN mm/mmap.c~task_size-is-variable mm/mmap.c
--- 25/mm/mmap.c~task_size-is-variable	2005-01-25 22:08:40.903785456 -0800
+++ 25-akpm/mm/mmap.c	2005-01-25 22:08:40.908784696 -0800
@@ -1612,8 +1612,8 @@ static void free_pgtables(struct mmu_gat
 	unsigned long last = end + PGDIR_SIZE - 1;
 	struct mm_struct *mm = tlb->mm;

-	if (last > TASK_SIZE || last < end)
-		last = TASK_SIZE;
+	if (last > MM_VM_SIZE(mm) || last < end)
+		last = MM_VM_SIZE(mm);

 	if (!prev) {
 		prev = mm->mmap;
_

Shall I ship it?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TASK_SIZE is variable.
  2005-01-26  8:01             ` Andrew Morton
@ 2005-01-26  8:04               ` Andi Kleen
  2005-01-28  2:58               ` Paul Mackerras
  1 sibling, 0 replies; 32+ messages in thread
From: Andi Kleen @ 2005-01-26  8:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, davem, dwmw2, linux-arch

On Wed, Jan 26, 2005 at 12:01:17AM -0800, Andrew Morton wrote:
> Andi Kleen <ak@suse.de> wrote:
> >
> >  After thinking about it more I agree. Just replacing TASK_SIZE with
> >  something that depends on the mm is the best solution here.
> 
> OK.
> 
> Here's what I currently have:

I suspect more architectures need to be fixed, just doing  ia64/s390/ppc64
is not enough. x86-64 is safe in mainline, but some people carry
a patch to make TASK_SIZE variable for flex-mmap.

-Andi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TASK_SIZE is variable.
  2005-01-26  8:01             ` Andrew Morton
  2005-01-26  8:04               ` Andi Kleen
@ 2005-01-28  2:58               ` Paul Mackerras
  2005-01-28  3:11                 ` Paul Mackerras
  2005-01-28  6:39                 ` Andi Kleen
  1 sibling, 2 replies; 32+ messages in thread
From: Paul Mackerras @ 2005-01-28  2:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, davem, dwmw2, linux-arch

Here is a patch that I did recently to reduce the overhead of
clear_page_tables() when using 64k pages on ppc64.  It keeps a record
of the maximum address that has been used in each mm_struct.  With
this we can kill MM_VM_SIZE.

Andrew - could this go in -mm perhaps?

Signed-off-by: Paul Mackerras <paulus@samba.org>

diff -urN linux-2.6.10/include/asm-ia64/processor.h g5-64k/include/asm-ia64/processor.h
--- linux-2.6.10/include/asm-ia64/processor.h	2004-11-11 09:57:35.000000000 +1100
+++ g5-64k/include/asm-ia64/processor.h	2005-01-28 13:52:42.000000000 +1100
@@ -43,14 +43,6 @@
 #define TASK_SIZE		(current->thread.task_size)
 
 /*
- * MM_VM_SIZE(mm) gives the maximum address (plus 1) which may contain a mapping for
- * address-space MM.  Note that with 32-bit tasks, this is still DEFAULT_TASK_SIZE,
- * because the kernel may have installed helper-mappings above TASK_SIZE.  For example,
- * for x86 emulation, the LDT and GDT are mapped above TASK_SIZE.
- */
-#define MM_VM_SIZE(mm)		DEFAULT_TASK_SIZE
-
-/*
  * This decides where the kernel will search for a free chunk of vm
  * space during mmap's.
  */
diff -urN linux-2.6.10/include/linux/mm.h g5-64k/include/linux/mm.h
--- linux-2.6.10/include/linux/mm.h	2004-11-25 17:42:43.000000000 +1100
+++ g5-64k/include/linux/mm.h	2005-01-28 13:52:14.848021352 +1100
@@ -37,10 +37,6 @@
 #include <asm/processor.h>
 #include <asm/atomic.h>
 
-#ifndef MM_VM_SIZE
-#define MM_VM_SIZE(mm)	TASK_SIZE
-#endif
-
 #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
 
 /*
diff -urN linux-2.6.10/include/linux/sched.h g5-64k/include/linux/sched.h
--- linux-2.6.10/include/linux/sched.h	2004-12-03 18:31:57.000000000 +1100
+++ g5-64k/include/linux/sched.h	2004-12-26 16:21:37.000000000 +1100
@@ -228,6 +228,7 @@
 	unsigned long arg_start, arg_end, env_start, env_end;
 	unsigned long rss, anon_rss, total_vm, locked_vm, shared_vm;
 	unsigned long exec_vm, stack_vm, reserved_vm, def_flags, nr_ptes;
+	unsigned long max_addr;
 
 	unsigned long saved_auxv[42]; /* for /proc/PID/auxv */
 
diff -urN linux-2.6.10/mm/mmap.c g5-64k/mm/mmap.c
--- linux-2.6.10/mm/mmap.c	2004-12-13 21:47:27.000000000 +1100
+++ g5-64k/mm/mmap.c	2005-01-28 13:47:33.000000000 +1100
@@ -317,6 +317,8 @@
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
 
+	if (vma->vm_end && vma->vm_end - 1 > mm->max_addr)
+		mm->max_addr = vma->vm_end - 1;
 	mm->map_count++;
 	validate_mm(mm);
 }
@@ -496,6 +498,8 @@
 		}
 	}
 
+	if (vma->vm_end && vma->vm_end - 1 > mm->max_addr)
+		mm->max_addr = vma->vm_end - 1;
 	validate_mm(mm);
 }
 
@@ -1832,6 +1836,7 @@
 	struct mmu_gather *tlb;
 	struct vm_area_struct *vma;
 	unsigned long nr_accounted = 0;
+	int nr_pgds;
 
 	lru_add_drain();
 
@@ -1844,8 +1849,10 @@
 					~0UL, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
 	BUG_ON(mm->map_count);	/* This is just debugging */
-	clear_page_tables(tlb, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD);
-	tlb_finish_mmu(tlb, 0, MM_VM_SIZE(mm));
+	nr_pgds = pgd_index(mm->max_addr) - FIRST_USER_PGD_NR + 1;
+	if (nr_pgds > 0)
+		clear_page_tables(tlb, FIRST_USER_PGD_NR, nr_pgds);
+	tlb_finish_mmu(tlb, 0, mm->max_addr + 1);
 
 	vma = mm->mmap;
 	mm->mmap = mm->mmap_cache = NULL;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TASK_SIZE is variable.
  2005-01-28  2:58               ` Paul Mackerras
@ 2005-01-28  3:11                 ` Paul Mackerras
  2005-01-28  3:17                   ` Andrew Morton
  2005-01-28  6:39                 ` Andi Kleen
  1 sibling, 1 reply; 32+ messages in thread
From: Paul Mackerras @ 2005-01-28  3:11 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen, davem, dwmw2, linux-arch

I wrote:

> Here is a patch that I did recently to reduce the overhead of
> clear_page_tables() when using 64k pages on ppc64.  It keeps a record
> of the maximum address that has been used in each mm_struct.  With
> this we can kill MM_VM_SIZE.

And I sent the 2.6.10 version of the patch, unfortunately.  Here is a
patch against current BK.

diff -urN linux-2.5/include/asm-ia64/processor.h test/include/asm-ia64/processor.h
--- linux-2.5/include/asm-ia64/processor.h	2005-01-21 10:31:58.000000000 +1100
+++ test/include/asm-ia64/processor.h	2005-01-28 13:59:37.236992600 +1100
@@ -43,14 +43,6 @@
 #define TASK_SIZE		(current->thread.task_size)
 
 /*
- * MM_VM_SIZE(mm) gives the maximum address (plus 1) which may contain a mapping for
- * address-space MM.  Note that with 32-bit tasks, this is still DEFAULT_TASK_SIZE,
- * because the kernel may have installed helper-mappings above TASK_SIZE.  For example,
- * for x86 emulation, the LDT and GDT are mapped above TASK_SIZE.
- */
-#define MM_VM_SIZE(mm)		DEFAULT_TASK_SIZE
-
-/*
  * This decides where the kernel will search for a free chunk of vm
  * space during mmap's.
  */
diff -urN linux-2.5/include/linux/mm.h test/include/linux/mm.h
--- linux-2.5/include/linux/mm.h	2005-01-17 12:07:10.000000000 +1100
+++ test/include/linux/mm.h	2005-01-28 13:59:37.310981352 +1100
@@ -37,10 +37,6 @@
 #include <asm/processor.h>
 #include <asm/atomic.h>
 
-#ifndef MM_VM_SIZE
-#define MM_VM_SIZE(mm)	TASK_SIZE
-#endif
-
 #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
 
 /*
diff -urN linux-2.5/include/linux/sched.h test/include/linux/sched.h
--- linux-2.5/include/linux/sched.h	2005-01-21 10:31:58.000000000 +1100
+++ test/include/linux/sched.h	2005-01-28 13:59:37.333977856 +1100
@@ -231,6 +231,7 @@
 	unsigned long arg_start, arg_end, env_start, env_end;
 	unsigned long rss, anon_rss, total_vm, locked_vm, shared_vm;
 	unsigned long exec_vm, stack_vm, reserved_vm, def_flags, nr_ptes;
+	unsigned long max_addr;
 
 	unsigned long saved_auxv[42]; /* for /proc/PID/auxv */
 
diff -urN linux-2.5/mm/mmap.c test/mm/mmap.c
--- linux-2.5/mm/mmap.c	2005-01-13 11:01:00.000000000 +1100
+++ test/mm/mmap.c	2005-01-28 14:01:51.889892368 +1100
@@ -409,6 +409,8 @@
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
 
+	if (vma->vm_end && vma->vm_end - 1 > mm->max_addr)
+		mm->max_addr = vma->vm_end - 1;
 	mm->map_count++;
 	validate_mm(mm);
 }
@@ -598,6 +600,8 @@
 		}
 	}
 
+	if (vma->vm_end && vma->vm_end - 1 > mm->max_addr)
+		mm->max_addr = vma->vm_end - 1;
 	validate_mm(mm);
 }
 
@@ -1983,6 +1987,7 @@
 	struct mmu_gather *tlb;
 	struct vm_area_struct *vma;
 	unsigned long nr_accounted = 0;
+	int nr_pgds;
 
 	lru_add_drain();
 
@@ -1996,9 +2001,9 @@
 	vm_unacct_memory(nr_accounted);
 	BUG_ON(mm->map_count);	/* This is just debugging */
 	clear_page_range(tlb, FIRST_USER_PGD_NR * PGDIR_SIZE,
-			(TASK_SIZE + PGDIR_SIZE - 1) & PGDIR_MASK);
+			(mm->max_addr + PGDIR_SIZE) & PGDIR_MASK);
 	
-	tlb_finish_mmu(tlb, 0, MM_VM_SIZE(mm));
+	tlb_finish_mmu(tlb, 0, mm->max_addr + 1);
 
 	vma = mm->mmap;
 	mm->mmap = mm->mmap_cache = NULL;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TASK_SIZE is variable.
  2005-01-28  3:11                 ` Paul Mackerras
@ 2005-01-28  3:17                   ` Andrew Morton
  2005-01-28  6:40                     ` Andi Kleen
  2005-01-28  8:46                     ` Russell King
  0 siblings, 2 replies; 32+ messages in thread
From: Andrew Morton @ 2005-01-28  3:17 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: ak, davem, dwmw2, linux-arch

Paul Mackerras <paulus@samba.org> wrote:
>
> > Here is a patch that I did recently to reduce the overhead of
>  > clear_page_tables() when using 64k pages on ppc64.  It keeps a record
>  > of the maximum address that has been used in each mm_struct.  With
>  > this we can kill MM_VM_SIZE.
> 
>  And I sent the 2.6.10 version of the patch, unfortunately.  Here is a
>  patch against current BK.

OK..  I'll drop Anton's patch:

From: Anton Blanchard <anton@samba.org>

The 4 level pagetable code changed the exit_mmap code to rely on TASK_SIZE.
 On some architectures (eg ppc64 and ia64), this is a per task property and
bad things can happen in certain circumstances when using it.

It is possible for one task to end up "owning" an mm from another - we have
seen this with the procfs code when process 1 accesses /proc/pid/cmdline of
process 2 while it is exiting.  Process 2 exits but does not tear its mm
down.  Later on process 1 finishes with the proc file and the mm gets torn
down at this point.

Now if process 1 was 32bit and process 2 was 64bit then we end up using a
bad value for TASK_SIZE in exit_mmap.  We only tear down part of the
address space and leave half initialised pagetables and entries in the MMU
etc.

MM_VM_SIZE() was created for this purpose (and is used in the next line for
tlb_finish_mmu), so use it.  I moved the PGD round up of TASK_SIZE into the
default MM_VM_SIZE.

As an aside, all architectures except one define FIRST_USER_PGD_NR as 0:

include/asm-arm26/pgtable.h:#define FIRST_USER_PGD_NR       1

It would be nice to get rid of one more magic constant and just clear from
0 ...  MM_VM_SIZE().  That would make it consistent with the tlb_flush_mmu
call below it too.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
---

 25-akpm/include/linux/mm.h |    2 +-
 25-akpm/mm/mmap.c          |    3 +--
 2 files changed, 2 insertions(+), 3 deletions(-)

diff -puN include/linux/mm.h~use-mm_vm_size-in-exit_mmap include/linux/mm.h
--- 25/include/linux/mm.h~use-mm_vm_size-in-exit_mmap	2005-01-25 21:41:44.365536624 -0800
+++ 25-akpm/include/linux/mm.h	2005-01-25 21:41:44.371535712 -0800
@@ -38,7 +38,7 @@ extern int sysctl_legacy_va_layout;
 #include <asm/atomic.h>

 #ifndef MM_VM_SIZE
-#define MM_VM_SIZE(mm)	TASK_SIZE
+#define MM_VM_SIZE(mm)	((TASK_SIZE + PGDIR_SIZE - 1) & PGDIR_MASK)
 #endif

 #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
diff -puN mm/mmap.c~use-mm_vm_size-in-exit_mmap mm/mmap.c
--- 25/mm/mmap.c~use-mm_vm_size-in-exit_mmap	2005-01-25 21:41:44.367536320 -0800
+++ 25-akpm/mm/mmap.c	2005-01-25 21:41:44.373535408 -0800
@@ -1980,8 +1980,7 @@ void exit_mmap(struct mm_struct *mm)
 					~0UL, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
 	BUG_ON(mm->map_count);	/* This is just debugging */
-	clear_page_range(tlb, FIRST_USER_PGD_NR * PGDIR_SIZE,
-			(TASK_SIZE + PGDIR_SIZE - 1) & PGDIR_MASK);
+	clear_page_range(tlb, FIRST_USER_PGD_NR * PGDIR_SIZE, MM_VM_SIZE(mm));

 	tlb_finish_mmu(tlb, 0, MM_VM_SIZE(mm));

_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TASK_SIZE is variable.
  2005-01-28  2:58               ` Paul Mackerras
  2005-01-28  3:11                 ` Paul Mackerras
@ 2005-01-28  6:39                 ` Andi Kleen
  2005-01-28 11:32                   ` David Woodhouse
  1 sibling, 1 reply; 32+ messages in thread
From: Andi Kleen @ 2005-01-28  6:39 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Andrew Morton, Andi Kleen, davem, dwmw2, linux-arch

On Fri, Jan 28, 2005 at 01:58:04PM +1100, Paul Mackerras wrote:
> Here is a patch that I did recently to reduce the overhead of
> clear_page_tables() when using 64k pages on ppc64.  It keeps a record
> of the maximum address that has been used in each mm_struct.  With
> this we can kill MM_VM_SIZE.
> 
> Andrew - could this go in -mm perhaps?

How does this work? The stack is always at the top of address space
and it will never change anything.

Or did you do this only to optimize 32bit processes on 64bit kernels?
If yes then the test in TASK_SIZE or MM_VM_SIZE should take care of it 
anyways.

-Andi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TASK_SIZE is variable.
  2005-01-28  3:17                   ` Andrew Morton
@ 2005-01-28  6:40                     ` Andi Kleen
  2005-01-29 11:23                       ` Anton Blanchard
  2005-01-28  8:46                     ` Russell King
  1 sibling, 1 reply; 32+ messages in thread
From: Andi Kleen @ 2005-01-28  6:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Paul Mackerras, ak, davem, dwmw2, linux-arch

On Thu, Jan 27, 2005 at 07:17:38PM -0800, Andrew Morton wrote:
> Paul Mackerras <paulus@samba.org> wrote:
> >
> > > Here is a patch that I did recently to reduce the overhead of
> >  > clear_page_tables() when using 64k pages on ppc64.  It keeps a record
> >  > of the maximum address that has been used in each mm_struct.  With
> >  > this we can kill MM_VM_SIZE.
> > 
> >  And I sent the 2.6.10 version of the patch, unfortunately.  Here is a
> >  patch against current BK.
> 
> OK..  I'll drop Anton's patch:

Frankly Anton's patch seems simpler and equivalent to me.

-Andi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TASK_SIZE is variable.
  2005-01-28  3:17                   ` Andrew Morton
  2005-01-28  6:40                     ` Andi Kleen
@ 2005-01-28  8:46                     ` Russell King
  1 sibling, 0 replies; 32+ messages in thread
From: Russell King @ 2005-01-28  8:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Paul Mackerras, ak, davem, dwmw2, linux-arch

On Thu, Jan 27, 2005 at 07:17:38PM -0800, Andrew Morton wrote:
> As an aside, all architectures except one define FIRST_USER_PGD_NR as 0:
> 
> include/asm-arm26/pgtable.h:#define FIRST_USER_PGD_NR       1

After seeing Andi's comments - if there is a chance that this patch gets
merged, please can the commentry be corrected to reflect reality instead
of being misleading?

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 PCMCIA      - http://pcmcia.arm.linux.org.uk/
                 2.6 Serial core

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TASK_SIZE is variable.
  2005-01-28  6:39                 ` Andi Kleen
@ 2005-01-28 11:32                   ` David Woodhouse
  0 siblings, 0 replies; 32+ messages in thread
From: David Woodhouse @ 2005-01-28 11:32 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Paul Mackerras, Andrew Morton, davem, linux-arch

On Fri, 2005-01-28 at 07:39 +0100, Andi Kleen wrote:
> How does this work? The stack is always at the top of address space
> and it will never change anything.

Is that the case? Doesn't ia64 put some extra things required for
emulation (virtual LDT etc?) in the virtual address space above 4GiB,
for 32-bit processes?

> Or did you do this only to optimize 32bit processes on 64bit kernels?
> If yes then the test in TASK_SIZE or MM_VM_SIZE should take care of it
> anyways.

The test in TASK_SIZE gets it wrong as discussed.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TASK_SIZE is variable.
  2005-01-28  6:40                     ` Andi Kleen
@ 2005-01-29 11:23                       ` Anton Blanchard
  0 siblings, 0 replies; 32+ messages in thread
From: Anton Blanchard @ 2005-01-29 11:23 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, Paul Mackerras, davem, dwmw2, linux-arch

 
> Frankly Anton's patch seems simpler and equivalent to me.

I like Pauls patch :) It speeds up teardown of processes since we only
walk as much of the pagetables as is mapped (instead of the entire
address space).

Anton

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TASK_SIZE is variable.
  2005-01-25 22:26 TASK_SIZE is variable David Woodhouse
  2005-01-25 23:52 ` David S. Miller
@ 2005-01-29 20:23 ` Andrew Morton
  2005-01-29 23:28   ` Paul Mackerras
  1 sibling, 1 reply; 32+ messages in thread
From: Andrew Morton @ 2005-01-29 20:23 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-arch

Guys, could I get a bit of clarity on how we want to proceed with all of
this?

I have the below two patches queued.  Speak now or forever stfu!

From: David Woodhouse <dwmw2@infradead.org>

Bad things can happen if a 32-bit process is the last user of a 64-bit mm. 
TASK_SIZE isn't a constant, and we can end up clearing page tables only up
to the 32-bit TASK_SIZE instead of all the way.  We should probably
double-check every instance of TASK_SIZE or USER_PTRS_PER_PGD for this kind
of problem.

We should also double-check that MM_VM_SIZE() and other such things are
correctly defined on all architectures.  I already fixed ppc64 which let it
stay as TASK_SIZE, and hence dependent on the _current_ context instead of
the mm in the argument.

Signed-off-by: Andrew Morton <akpm@osdl.org>
---

 25-akpm/mm/mmap.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff -puN mm/mmap.c~task_size-is-variable mm/mmap.c
--- 25/mm/mmap.c~task_size-is-variable	2005-01-25 22:08:40.903785456 -0800
+++ 25-akpm/mm/mmap.c	2005-01-25 22:08:40.908784696 -0800
@@ -1612,8 +1612,8 @@ static void free_pgtables(struct mmu_gat
 	unsigned long last = end + PGDIR_SIZE - 1;
 	struct mm_struct *mm = tlb->mm;

-	if (last > TASK_SIZE || last < end)
-		last = TASK_SIZE;
+	if (last > MM_VM_SIZE(mm) || last < end)
+		last = MM_VM_SIZE(mm);

 	if (!prev) {
 		prev = mm->mmap;
_

From: Anton Blanchard <anton@samba.org>

The 4 level pagetable code changed the exit_mmap code to rely on TASK_SIZE.
 On some architectures (eg ppc64 and ia64), this is a per task property and
bad things can happen in certain circumstances when using it.

It is possible for one task to end up "owning" an mm from another - we have
seen this with the procfs code when process 1 accesses /proc/pid/cmdline of
process 2 while it is exiting.  Process 2 exits but does not tear its mm
down.  Later on process 1 finishes with the proc file and the mm gets torn
down at this point.

Now if process 1 was 32bit and process 2 was 64bit then we end up using a
bad value for TASK_SIZE in exit_mmap.  We only tear down part of the
address space and leave half initialised pagetables and entries in the MMU
etc.

MM_VM_SIZE() was created for this purpose (and is used in the next line for
tlb_finish_mmu), so use it.  I moved the PGD round up of TASK_SIZE into the
default MM_VM_SIZE.

As an aside, all architectures except one define FIRST_USER_PGD_NR as 0:

include/asm-arm26/pgtable.h:#define FIRST_USER_PGD_NR       1

It would be nice to get rid of one more magic constant and just clear from
0 ...  MM_VM_SIZE().  That would make it consistent with the tlb_flush_mmu
call below it too.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
---

 25-akpm/include/linux/mm.h |    2 +-
 25-akpm/mm/mmap.c          |    3 +--
 2 files changed, 2 insertions(+), 3 deletions(-)

diff -puN include/linux/mm.h~use-mm_vm_size-in-exit_mmap include/linux/mm.h
--- 25/include/linux/mm.h~use-mm_vm_size-in-exit_mmap	2005-01-25 21:41:44.365536624 -0800
+++ 25-akpm/include/linux/mm.h	2005-01-25 21:41:44.371535712 -0800
@@ -38,7 +38,7 @@ extern int sysctl_legacy_va_layout;
 #include <asm/atomic.h>

 #ifndef MM_VM_SIZE
-#define MM_VM_SIZE(mm)	TASK_SIZE
+#define MM_VM_SIZE(mm)	((TASK_SIZE + PGDIR_SIZE - 1) & PGDIR_MASK)
 #endif

 #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
diff -puN mm/mmap.c~use-mm_vm_size-in-exit_mmap mm/mmap.c
--- 25/mm/mmap.c~use-mm_vm_size-in-exit_mmap	2005-01-25 21:41:44.367536320 -0800
+++ 25-akpm/mm/mmap.c	2005-01-25 21:41:44.373535408 -0800
@@ -1980,8 +1980,7 @@ void exit_mmap(struct mm_struct *mm)
 					~0UL, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
 	BUG_ON(mm->map_count);	/* This is just debugging */
-	clear_page_range(tlb, FIRST_USER_PGD_NR * PGDIR_SIZE,
-			(TASK_SIZE + PGDIR_SIZE - 1) & PGDIR_MASK);
+	clear_page_range(tlb, FIRST_USER_PGD_NR * PGDIR_SIZE, MM_VM_SIZE(mm));

 	tlb_finish_mmu(tlb, 0, MM_VM_SIZE(mm));

_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TASK_SIZE is variable.
  2005-01-29 20:23 ` Andrew Morton
@ 2005-01-29 23:28   ` Paul Mackerras
  2005-01-30 11:01     ` Andi Kleen
  2005-01-31  2:33     ` Matthew Wilcox
  0 siblings, 2 replies; 32+ messages in thread
From: Paul Mackerras @ 2005-01-29 23:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: David Woodhouse, linux-arch

Andrew Morton writes:

> Guys, could I get a bit of clarity on how we want to proceed with all of
> this?
> 
> I have the below two patches queued.  Speak now or forever stfu!

In both cases (free_pgtables and exit_mmap) we want to do something to
the whole of an address space which may not be ours.  With these
patches we use MM_VM_SIZE(mm) to tell us how big the address space is
rather than TASK_SIZE, which is a good start.  However, the default
MM_VM_SIZE is still (based on) TASK_SIZE, and each 64-bit arch that
has TASK_SIZE as a per-task property then has to override it (except
that parisc doesn't seem to).  Also, having the mm as an argument to
MM_VM_SIZE is currently useless because there is nothing in the mm
that is helpful.

Thus the two patches you included will fix the problems, at the
expense of having to scan the whole pgd at exit time, even though we
really only need to look at the first 2 entries on ppc64 for the
very common case where the mm was used by a 32-bit process.

IMHO it would be cleaner and more efficient to have a way to tell how
much of an address space has actually been used, just from the mm,
without having to know what task(s) used the mm and whether they were
32-bit or 64-bit tasks.  That is what my mm->max_addr patch was aiming
at.  Having mm->max_addr should also work correctly across all
architectures, even the ones that have mappings above the 4GB point
for 32-bit processes, as long as they create vma's for those mappings.

Thus I would like to see the mm->max_addr patch go in.  If
architectures still want to control MM_VM_SIZE, we could have the
default definition of MM_VM_SIZE(mm) be mm->max_addr + 1 (or
alternatively make mm->max_addr be the address space size rather than
the max address - I made it the max address because I was concerned
that there might be 32-bit architectures that allowed mappings up to
and including 0xffffffff).

I am happy to make a combined patch if people agree on the approach.

Paul.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TASK_SIZE is variable.
  2005-01-29 23:28   ` Paul Mackerras
@ 2005-01-30 11:01     ` Andi Kleen
  2005-01-30 12:10       ` Paul Mackerras
  2005-01-31  2:33     ` Matthew Wilcox
  1 sibling, 1 reply; 32+ messages in thread
From: Andi Kleen @ 2005-01-30 11:01 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Andrew Morton, David Woodhouse, linux-arch

> Thus the two patches you included will fix the problems, at the
> expense of having to scan the whole pgd at exit time, even though we
> really only need to look at the first 2 entries on ppc64 for the
> very common case where the mm was used by a 32-bit process.

I plan to fix this anyways in a more generic way - 4 level currently
has some bad performance regression because it scans much more pagetables.
DaveM had an old patch to use bitmaps for used entries in struct page.
For your 32bit processes only the first bit would be set and it would not
look at most of the pgds.  The plan was to redo Dave's old patch
for 4 level (I wanted to redo it a bit because I didn't like how
his iterators worked)

Once that it is in you will get the fast scanning for free.
Should hopefully happen soon.

-Andi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TASK_SIZE is variable.
  2005-01-30 11:01     ` Andi Kleen
@ 2005-01-30 12:10       ` Paul Mackerras
  2005-01-31  2:23         ` David S. Miller
  0 siblings, 1 reply; 32+ messages in thread
From: Paul Mackerras @ 2005-01-30 12:10 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, David Woodhouse, linux-arch

Andi Kleen writes:

> I plan to fix this anyways in a more generic way - 4 level currently
> has some bad performance regression because it scans much more pagetables.
> DaveM had an old patch to use bitmaps for used entries in struct page.
> For your 32bit processes only the first bit would be set and it would not
> look at most of the pgds.  The plan was to redo Dave's old patch
> for 4 level (I wanted to redo it a bit because I didn't like how
> his iterators worked)

Sounds good.  I'm looking forward to the patch.

In the meantime I think that MM_VM_SIZE(mm) should be renamed to
MAX_TASK_SIZE without the mm argument.  MAX_TASK_SIZE can default to
TASK_SIZE and be overridden on architectures that have more than one
task size.

Paul.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TASK_SIZE is variable.
  2005-01-30 12:10       ` Paul Mackerras
@ 2005-01-31  2:23         ` David S. Miller
  2005-01-31  9:23           ` Andi Kleen
  0 siblings, 1 reply; 32+ messages in thread
From: David S. Miller @ 2005-01-31  2:23 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: ak, akpm, dwmw2, linux-arch

On Sun, 30 Jan 2005 23:10:02 +1100
Paul Mackerras <paulus@samba.org> wrote:

> Andi Kleen writes:
> 
> > I plan to fix this anyways in a more generic way - 4 level currently
> > has some bad performance regression because it scans much more pagetables.
> > DaveM had an old patch to use bitmaps for used entries in struct page.
> > For your 32bit processes only the first bit would be set and it would not
> > look at most of the pgds.  The plan was to redo Dave's old patch
> > for 4 level (I wanted to redo it a bit because I didn't like how
> > his iterators worked)
> 
> Sounds good.  I'm looking forward to the patch.
> 
> In the meantime I think that MM_VM_SIZE(mm) should be renamed to
> MAX_TASK_SIZE without the mm argument.  MAX_TASK_SIZE can default to
> TASK_SIZE and be overridden on architectures that have more than one
> task size.

I'm looking forward very much to Andi's work as well.

However, I like the mm->max_addr idea because that could be used
for the mmap()/munmap()/mremap() sanity checks as well instead of
bogus TASK_SIZE.

There are some compat syscalls for the mmap like stuff that exists
only to validate the address ranges for compat task limits.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TASK_SIZE is variable.
  2005-01-29 23:28   ` Paul Mackerras
  2005-01-30 11:01     ` Andi Kleen
@ 2005-01-31  2:33     ` Matthew Wilcox
  1 sibling, 0 replies; 32+ messages in thread
From: Matthew Wilcox @ 2005-01-31  2:33 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Andrew Morton, David Woodhouse, linux-arch

On Sun, Jan 30, 2005 at 10:28:50AM +1100, Paul Mackerras wrote:
> However, the default
> MM_VM_SIZE is still (based on) TASK_SIZE, and each 64-bit arch that
> has TASK_SIZE as a per-task property then has to override it (except
> that parisc doesn't seem to).

There's no 64-bit userspace for parisc yet ... that would involve dealing
with drepper some more, and nobody wants to do that.

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TASK_SIZE is variable.
  2005-01-31  2:23         ` David S. Miller
@ 2005-01-31  9:23           ` Andi Kleen
  2005-01-31 19:29             ` David S. Miller
  0 siblings, 1 reply; 32+ messages in thread
From: Andi Kleen @ 2005-01-31  9:23 UTC (permalink / raw)
  To: David S. Miller; +Cc: Paul Mackerras, ak, akpm, dwmw2, linux-arch

> However, I like the mm->max_addr idea because that could be used
> for the mmap()/munmap()/mremap() sanity checks as well instead of
> bogus TASK_SIZE.

Hmm, but in process context it is not bogus is it? 

-Andi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TASK_SIZE is variable.
  2005-01-31  9:23           ` Andi Kleen
@ 2005-01-31 19:29             ` David S. Miller
  2005-01-31 19:38               ` Andi Kleen
  0 siblings, 1 reply; 32+ messages in thread
From: David S. Miller @ 2005-01-31 19:29 UTC (permalink / raw)
  To: Andi Kleen; +Cc: paulus, akpm, dwmw2, linux-arch

On Mon, 31 Jan 2005 10:23:54 +0100
Andi Kleen <ak@suse.de> wrote:

> > However, I like the mm->max_addr idea because that could be used
> > for the mmap()/munmap()/mremap() sanity checks as well instead of
> > bogus TASK_SIZE.
> 
> Hmm, but in process context it is not bogus is it? 

I guess you're suggesting that there could be times when
mm->max_addr and the current thread's address space disposition
are out of sync?

That is certainly something to check out.  And I am rather
certain that it is flush_old_exec() which should transition
mm->max_addr.  It is the only spot where the address space
API logically changes.  More precisely, this occurs when
exec_mmap() completes successfully, so maybe it should happen
there and in that case.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TASK_SIZE is variable.
  2005-01-31 19:29             ` David S. Miller
@ 2005-01-31 19:38               ` Andi Kleen
  2005-01-31 20:35                 ` David S. Miller
  0 siblings, 1 reply; 32+ messages in thread
From: Andi Kleen @ 2005-01-31 19:38 UTC (permalink / raw)
  To: David S. Miller; +Cc: Andi Kleen, paulus, akpm, dwmw2, linux-arch

On Mon, Jan 31, 2005 at 11:29:57AM -0800, David S. Miller wrote:
> On Mon, 31 Jan 2005 10:23:54 +0100
> Andi Kleen <ak@suse.de> wrote:
> 
> > > However, I like the mm->max_addr idea because that could be used
> > > for the mmap()/munmap()/mremap() sanity checks as well instead of
> > > bogus TASK_SIZE.
> > 
> > Hmm, but in process context it is not bogus is it? 
> 
> I guess you're suggesting that there could be times when
> mm->max_addr and the current thread's address space disposition
> are out of sync?

No, i just think it's overkill for the simple task Paul wants
it for. And in mmap/munmap you can just trust what TASK_SIZE
tells you because it's in the correct process context.

-Andi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TASK_SIZE is variable.
  2005-01-31 19:38               ` Andi Kleen
@ 2005-01-31 20:35                 ` David S. Miller
  2005-02-03  4:08                   ` Andrew Morton
  0 siblings, 1 reply; 32+ messages in thread
From: David S. Miller @ 2005-01-31 20:35 UTC (permalink / raw)
  To: Andi Kleen; +Cc: paulus, akpm, dwmw2, linux-arch

On Mon, 31 Jan 2005 20:38:28 +0100
Andi Kleen <ak@suse.de> wrote:

> No, i just think it's overkill for the simple task Paul wants
> it for. And in mmap/munmap you can just trust what TASK_SIZE
> tells you because it's in the correct process context.

Lots of 64-bit platforms use a constant TASK_SIZE.
This is why such platforms use compat syscall wrappers
to validate mmap args.

I'm saying that if we move over to useing mm->map_limit
in the syscall validation checks, such schemes would no
longer be necessary.

The arity of the address space really is an MM attribute
even though it is inherited from the thread.  The delayed
MM final release bug case via procfs is instrumental in
demonstrating this fact.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TASK_SIZE is variable.
  2005-01-31 20:35                 ` David S. Miller
@ 2005-02-03  4:08                   ` Andrew Morton
  2005-02-03  6:28                     ` David S. Miller
  0 siblings, 1 reply; 32+ messages in thread
From: Andrew Morton @ 2005-02-03  4:08 UTC (permalink / raw)
  To: David S. Miller; +Cc: ak, paulus, dwmw2, linux-arch

Guys, I still don't have a clear sense of what you want done on this front.
I'm still sitting on the below two patches.

If 2.6.11-rc3 works correctly as-is then let's leave it alone.  If it does
not then can we please get a wiggle on?

task_size-is-variable.patch:

From: David Woodhouse <dwmw2@infradead.org>

Bad things can happen if a 32-bit process is the last user of a 64-bit mm. 
TASK_SIZE isn't a constant, and we can end up clearing page tables only up
to the 32-bit TASK_SIZE instead of all the way.  We should probably
double-check every instance of TASK_SIZE or USER_PTRS_PER_PGD for this kind
of problem.

We should also double-check that MM_VM_SIZE() and other such things are
correctly defined on all architectures.  I already fixed ppc64 which let it
stay as TASK_SIZE, and hence dependent on the _current_ context instead of
the mm in the argument.

Signed-off-by: Andrew Morton <akpm@osdl.org>
---

 25-akpm/mm/mmap.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff -puN mm/mmap.c~task_size-is-variable mm/mmap.c
--- 25/mm/mmap.c~task_size-is-variable	2005-01-25 22:08:40.903785456 -0800
+++ 25-akpm/mm/mmap.c	2005-01-25 22:08:40.908784696 -0800
@@ -1612,8 +1612,8 @@ static void free_pgtables(struct mmu_gat
 	unsigned long last = end + PGDIR_SIZE - 1;
 	struct mm_struct *mm = tlb->mm;

-	if (last > TASK_SIZE || last < end)
-		last = TASK_SIZE;
+	if (last > MM_VM_SIZE(mm) || last < end)
+		last = MM_VM_SIZE(mm);

 	if (!prev) {
 		prev = mm->mmap;
_

use-mm_vm_size-in-exit_mmap.patch:

From: Anton Blanchard <anton@samba.org>

The 4 level pagetable code changed the exit_mmap code to rely on TASK_SIZE.
 On some architectures (eg ppc64 and ia64), this is a per task property and
bad things can happen in certain circumstances when using it.

It is possible for one task to end up "owning" an mm from another - we have
seen this with the procfs code when process 1 accesses /proc/pid/cmdline of
process 2 while it is exiting.  Process 2 exits but does not tear its mm
down.  Later on process 1 finishes with the proc file and the mm gets torn
down at this point.

Now if process 1 was 32bit and process 2 was 64bit then we end up using a
bad value for TASK_SIZE in exit_mmap.  We only tear down part of the
address space and leave half initialised pagetables and entries in the MMU
etc.

MM_VM_SIZE() was created for this purpose (and is used in the next line for
tlb_finish_mmu), so use it.  I moved the PGD round up of TASK_SIZE into the
default MM_VM_SIZE.

As an aside, all architectures except one define FIRST_USER_PGD_NR as 0:

include/asm-arm26/pgtable.h:#define FIRST_USER_PGD_NR       1

It would be nice to get rid of one more magic constant and just clear from
0 ...  MM_VM_SIZE().  That would make it consistent with the tlb_flush_mmu
call below it too.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
---

 25-akpm/include/linux/mm.h |    2 +-
 25-akpm/mm/mmap.c          |    3 +--
 2 files changed, 2 insertions(+), 3 deletions(-)

diff -puN include/linux/mm.h~use-mm_vm_size-in-exit_mmap include/linux/mm.h
--- 25/include/linux/mm.h~use-mm_vm_size-in-exit_mmap	2005-01-25 21:41:44.365536624 -0800
+++ 25-akpm/include/linux/mm.h	2005-01-25 21:41:44.371535712 -0800
@@ -38,7 +38,7 @@ extern int sysctl_legacy_va_layout;
 #include <asm/atomic.h>

 #ifndef MM_VM_SIZE
-#define MM_VM_SIZE(mm)	TASK_SIZE
+#define MM_VM_SIZE(mm)	((TASK_SIZE + PGDIR_SIZE - 1) & PGDIR_MASK)
 #endif

 #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
diff -puN mm/mmap.c~use-mm_vm_size-in-exit_mmap mm/mmap.c
--- 25/mm/mmap.c~use-mm_vm_size-in-exit_mmap	2005-01-25 21:41:44.367536320 -0800
+++ 25-akpm/mm/mmap.c	2005-01-25 21:41:44.373535408 -0800
@@ -1980,8 +1980,7 @@ void exit_mmap(struct mm_struct *mm)
 					~0UL, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
 	BUG_ON(mm->map_count);	/* This is just debugging */
-	clear_page_range(tlb, FIRST_USER_PGD_NR * PGDIR_SIZE,
-			(TASK_SIZE + PGDIR_SIZE - 1) & PGDIR_MASK);
+	clear_page_range(tlb, FIRST_USER_PGD_NR * PGDIR_SIZE, MM_VM_SIZE(mm));

 	tlb_finish_mmu(tlb, 0, MM_VM_SIZE(mm));

_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TASK_SIZE is variable.
  2005-02-03  4:08                   ` Andrew Morton
@ 2005-02-03  6:28                     ` David S. Miller
  2005-02-03  7:19                       ` Andi Kleen
  0 siblings, 1 reply; 32+ messages in thread
From: David S. Miller @ 2005-02-03  6:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: ak, paulus, dwmw2, linux-arch

On Wed, 2 Feb 2005 20:08:12 -0800
Andrew Morton <akpm@osdl.org> wrote:

> Guys, I still don't have a clear sense of what you want done on this front.
> I'm still sitting on the below two patches.
> 
> If 2.6.11-rc3 works correctly as-is then let's leave it alone.  If it does
> not then can we please get a wiggle on?

I'm willing to go whatever way people want to get a fix quickly
into 2.6.11-final.  But longer term I think the mm->vm_limit idea
is the better direction longer term.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TASK_SIZE is variable.
  2005-02-03  6:28                     ` David S. Miller
@ 2005-02-03  7:19                       ` Andi Kleen
  2005-02-03  9:23                         ` David Woodhouse
  0 siblings, 1 reply; 32+ messages in thread
From: Andi Kleen @ 2005-02-03  7:19 UTC (permalink / raw)
  To: David S. Miller; +Cc: Andrew Morton, ak, paulus, dwmw2, linux-arch

On Wed, Feb 02, 2005 at 10:28:41PM -0800, David S. Miller wrote:
> On Wed, 2 Feb 2005 20:08:12 -0800
> Andrew Morton <akpm@osdl.org> wrote:
> 
> > Guys, I still don't have a clear sense of what you want done on this front.
> > I'm still sitting on the below two patches.
> > 
> > If 2.6.11-rc3 works correctly as-is then let's leave it alone.  If it does
> > not then can we please get a wiggle on?
> 
> I'm willing to go whatever way people want to get a fix quickly
> into 2.6.11-final.  But longer term I think the mm->vm_limit idea
> is the better direction longer term.

For 2.6.11 MM_VM_SIZE is imho the right fix. 

-Andi (who reconsidered his earlier objection)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: TASK_SIZE is variable.
  2005-02-03  7:19                       ` Andi Kleen
@ 2005-02-03  9:23                         ` David Woodhouse
  0 siblings, 0 replies; 32+ messages in thread
From: David Woodhouse @ 2005-02-03  9:23 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David S. Miller, Andrew Morton, paulus, linux-arch

On Thu, 2005-02-03 at 08:19 +0100, Andi Kleen wrote:
> 
> > I'm willing to go whatever way people want to get a fix quickly
> > into 2.6.11-final.  But longer term I think the mm->vm_limit idea
> > is the better direction longer term.
> 
> For 2.6.11 MM_VM_SIZE is imho the right fix. 

Agreed. As Dave says, we can always implement the more cunning ideas
later. Push what you have.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2005-02-03  9:23 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-01-25 22:26 TASK_SIZE is variable David Woodhouse
2005-01-25 23:52 ` David S. Miller
2005-01-26  6:36   ` Andi Kleen
2005-01-26  6:41     ` David S. Miller
2005-01-26  7:13       ` Andi Kleen
2005-01-26  7:24         ` Andrew Morton
2005-01-26  7:43           ` Andi Kleen
2005-01-26  8:01             ` Andrew Morton
2005-01-26  8:04               ` Andi Kleen
2005-01-28  2:58               ` Paul Mackerras
2005-01-28  3:11                 ` Paul Mackerras
2005-01-28  3:17                   ` Andrew Morton
2005-01-28  6:40                     ` Andi Kleen
2005-01-29 11:23                       ` Anton Blanchard
2005-01-28  8:46                     ` Russell King
2005-01-28  6:39                 ` Andi Kleen
2005-01-28 11:32                   ` David Woodhouse
2005-01-26  7:54   ` David Woodhouse
2005-01-29 20:23 ` Andrew Morton
2005-01-29 23:28   ` Paul Mackerras
2005-01-30 11:01     ` Andi Kleen
2005-01-30 12:10       ` Paul Mackerras
2005-01-31  2:23         ` David S. Miller
2005-01-31  9:23           ` Andi Kleen
2005-01-31 19:29             ` David S. Miller
2005-01-31 19:38               ` Andi Kleen
2005-01-31 20:35                 ` David S. Miller
2005-02-03  4:08                   ` Andrew Morton
2005-02-03  6:28                     ` David S. Miller
2005-02-03  7:19                       ` Andi Kleen
2005-02-03  9:23                         ` David Woodhouse
2005-01-31  2:33     ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox