* TASK_SIZE is variable.
@ 2005-01-25 22:26 David Woodhouse
2005-01-25 23:52 ` David S. Miller
2005-01-29 20:23 ` Andrew Morton
0 siblings, 2 replies; 32+ messages in thread
From: David Woodhouse @ 2005-01-25 22:26 UTC (permalink / raw)
To: linux-arch
Bad things can happen if a 32-bit process is the last user of a 64-bit
mm. TASK_SIZE isn't a constant, and we can end up clearing page tables
only up to the 32-bit TASK_SIZE instead of all the way. We should
probably double-check every instance of TASK_SIZE or USER_PTRS_PER_PGD
for this kind of problem.
We should also double-check that MM_VM_SIZE() and other such things are
correctly defined on all architectures. I already fixed ppc64 which let
it stay as TASK_SIZE, and hence dependent on the _current_ context
instead of the mm in the argument.
--- mm/mmap.c.orig 2005-01-25 22:23:02.030427272 +0000
+++ mm/mmap.c 2005-01-25 22:23:55.627279312 +0000
@@ -1612,8 +1612,8 @@ static void free_pgtables(struct mmu_gat
unsigned long last = end + PGDIR_SIZE - 1;
struct mm_struct *mm = tlb->mm;
- if (last > TASK_SIZE || last < end)
- last = TASK_SIZE;
+ if (last > MM_VM_SIZE(mm) || last < end)
+ last = MM_VM_SIZE(mm);
if (!prev) {
prev = mm->mmap;
@@ -1996,7 +1996,7 @@ void exit_mmap(struct mm_struct *mm)
vm_unacct_memory(nr_accounted);
BUG_ON(mm->map_count); /* This is just debugging */
clear_page_range(tlb, FIRST_USER_PGD_NR * PGDIR_SIZE,
- (TASK_SIZE + PGDIR_SIZE - 1) & PGDIR_MASK);
+ (MM_VM_SIZE(mm) + PGDIR_SIZE - 1) & PGDIR_MASK);
tlb_finish_mmu(tlb, 0, MM_VM_SIZE(mm));
--
dwmw2
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: TASK_SIZE is variable. 2005-01-25 22:26 TASK_SIZE is variable David Woodhouse @ 2005-01-25 23:52 ` David S. Miller 2005-01-26 6:36 ` Andi Kleen 2005-01-26 7:54 ` David Woodhouse 2005-01-29 20:23 ` Andrew Morton 1 sibling, 2 replies; 32+ messages in thread From: David S. Miller @ 2005-01-25 23:52 UTC (permalink / raw) To: David Woodhouse; +Cc: linux-arch On Tue, 25 Jan 2005 22:26:52 +0000 David Woodhouse <dwmw2@infradead.org> wrote: > Bad things can happen if a 32-bit process is the last user of a 64-bit > mm. I guess this is OK. I think the easiest way to fix this for good is to simply kill off TASK_SIZE and that way each and every use will be audited. We can replace TASK_SIZE with something the describes what is really wanted: 1) Max address potentially mapped in "mm". Although I heavily dislike the MM_VM_SIZE(mm) scheme because address space size is determined by the process emul type, and thus is a thread not a mm property In fact, look at how nobody seems to even use the "mm" argument to this macro. Further, look at the comment above the ppc64 definition: /* We can't actually tell the TASK_SIZE given just the mm, but default * to the 64-bit case to make sure that enough gets cleaned up. */ Yeah, we can't tell the TASK_SIZE from the "mm", no shit. 2) Limits imposed at mmap()/munmap()/mremap() time for virtual address arguments. Thinking about this some more, it may in fact be better to make these platforms go to using a constant TASK_SIZE, kill this bogus'ly argumented MM_VM_SIZE(mm) thing, and use something new for mmap()/mremap()/munmap() et al. argument checking which is based upon some thread property. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: TASK_SIZE is variable. 2005-01-25 23:52 ` David S. Miller @ 2005-01-26 6:36 ` Andi Kleen 2005-01-26 6:41 ` David S. Miller 2005-01-26 7:54 ` David Woodhouse 1 sibling, 1 reply; 32+ messages in thread From: Andi Kleen @ 2005-01-26 6:36 UTC (permalink / raw) To: David S. Miller; +Cc: David Woodhouse, linux-arch On Tue, Jan 25, 2005 at 03:52:39PM -0800, David S. Miller wrote: > On Tue, 25 Jan 2005 22:26:52 +0000 > David Woodhouse <dwmw2@infradead.org> wrote: > > > Bad things can happen if a 32-bit process is the last user of a 64-bit > > mm. > > I guess this is OK. I still don't get it. When exactly can a process have memory > 32bit and not have the 32bit flag set that is checked by TASK_SIZE. IMHO that's the bug that needs addressing, because it will likely break more code. -Andi ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: TASK_SIZE is variable. 2005-01-26 6:36 ` Andi Kleen @ 2005-01-26 6:41 ` David S. Miller 2005-01-26 7:13 ` Andi Kleen 0 siblings, 1 reply; 32+ messages in thread From: David S. Miller @ 2005-01-26 6:41 UTC (permalink / raw) To: Andi Kleen; +Cc: dwmw2, linux-arch On Wed, 26 Jan 2005 07:36:27 +0100 Andi Kleen <ak@suse.de> wrote: > I still don't get it. When exactly can a process have memory > 32bit > and not have the 32bit flag set that is checked by TASK_SIZE. IMHO that's > the bug that needs addressing, because it will likely break more code. I think on some platforms they move over to the 32-bit setting in the thread struct before the address space is cleared out for exec. And if that is the case, I agree with Andi, that is the real bug that needs to be fixed. I recall that at one point Al Viro put some fix into binfmt_elf.c that caused the thread compat type to change too early like that and it made sparc64 explode so I knew to correct it immediately and this happened years ago. I just checked and ia64 sets thread.task_size in SET_PERSONALITY() which is just fine, similarly for ppc64. So I really wonder how this problem can arise. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: TASK_SIZE is variable. 2005-01-26 6:41 ` David S. Miller @ 2005-01-26 7:13 ` Andi Kleen 2005-01-26 7:24 ` Andrew Morton 0 siblings, 1 reply; 32+ messages in thread From: Andi Kleen @ 2005-01-26 7:13 UTC (permalink / raw) To: David S. Miller; +Cc: Andi Kleen, dwmw2, linux-arch > I just checked and ia64 sets thread.task_size in SET_PERSONALITY() > which is just fine, similarly for ppc64. So I really wonder how > this problem can arise. After rereading Anton's post on l-k I think the problem happens when a proc accesses (like read on /proc/*/cmdline) increases the reference count of a mm, then the mm exits, and then the other process reading /proc does the final mmput. Then the exit_mmap executes in the context of the other process. Maybe it would be best to just use a semaphore to synchronize this? [BTW there seem to be some other issues in this code; I'm currently together with someone else trying to track down a exit mm race that causes machine checks on K8 because an lazy mm task has page tables that are already freed and get overwritten by random data. Still haven't root caused it yet] -Andi ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: TASK_SIZE is variable. 2005-01-26 7:13 ` Andi Kleen @ 2005-01-26 7:24 ` Andrew Morton 2005-01-26 7:43 ` Andi Kleen 0 siblings, 1 reply; 32+ messages in thread From: Andrew Morton @ 2005-01-26 7:24 UTC (permalink / raw) To: Andi Kleen; +Cc: davem, dwmw2, linux-arch Andi Kleen <ak@suse.de> wrote: > > After rereading Anton's post on l-k I think the problem happens > when a proc accesses (like read on /proc/*/cmdline) increases the > reference count of a mm, then the mm exits, and then the other > process reading /proc does the final mmput. Then the exit_mmap > executes in the context of the other process. yup. This happens in quite a few places. Everything under mmput() needs to understand that the mm isn't necessarily current's mm. I'm not sure that introcuction of additional locking to prevent that would be very nice. (Could we null out current->mm during mmput() to catch buggy code, or would that break the lazy-tlb code?) ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: TASK_SIZE is variable. 2005-01-26 7:24 ` Andrew Morton @ 2005-01-26 7:43 ` Andi Kleen 2005-01-26 8:01 ` Andrew Morton 0 siblings, 1 reply; 32+ messages in thread From: Andi Kleen @ 2005-01-26 7:43 UTC (permalink / raw) To: Andrew Morton; +Cc: Andi Kleen, davem, dwmw2, linux-arch On Tue, Jan 25, 2005 at 11:24:41PM -0800, Andrew Morton wrote: > Andi Kleen <ak@suse.de> wrote: > > > > After rereading Anton's post on l-k I think the problem happens > > when a proc accesses (like read on /proc/*/cmdline) increases the > > reference count of a mm, then the mm exits, and then the other > > process reading /proc does the final mmput. Then the exit_mmap > > executes in the context of the other process. > > yup. This happens in quite a few places. Everything under mmput() needs > to understand that the mm isn't necessarily current's mm. I'm not sure > that introcuction of additional locking to prevent that would be very nice. After thinking about it more I agree. Just replacing TASK_SIZE with something that depends on the mm is the best solution here. > (Could we null out current->mm during mmput() to catch buggy code, or would > that break the lazy-tlb code?) It wouldn't have caught TASK_SIZE anyways, I'm not sure how useful this is. -Andi ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: TASK_SIZE is variable. 2005-01-26 7:43 ` Andi Kleen @ 2005-01-26 8:01 ` Andrew Morton 2005-01-26 8:04 ` Andi Kleen 2005-01-28 2:58 ` Paul Mackerras 0 siblings, 2 replies; 32+ messages in thread From: Andrew Morton @ 2005-01-26 8:01 UTC (permalink / raw) To: Andi Kleen; +Cc: davem, dwmw2, linux-arch Andi Kleen <ak@suse.de> wrote: > > After thinking about it more I agree. Just replacing TASK_SIZE with > something that depends on the mm is the best solution here. OK. Here's what I currently have: From: Anton Blanchard <anton@samba.org> The 4 level pagetable code changed the exit_mmap code to rely on TASK_SIZE. On some architectures (eg ppc64 and ia64), this is a per task property and bad things can happen in certain circumstances when using it. It is possible for one task to end up "owning" an mm from another - we have seen this with the procfs code when process 1 accesses /proc/pid/cmdline of process 2 while it is exiting. Process 2 exits but does not tear its mm down. Later on process 1 finishes with the proc file and the mm gets torn down at this point. Now if process 1 was 32bit and process 2 was 64bit then we end up using a bad value for TASK_SIZE in exit_mmap. We only tear down part of the address space and leave half initialised pagetables and entries in the MMU etc. MM_VM_SIZE() was created for this purpose (and is used in the next line for tlb_finish_mmu), so use it. I moved the PGD round up of TASK_SIZE into the default MM_VM_SIZE. As an aside, all architectures except one define FIRST_USER_PGD_NR as 0: include/asm-arm26/pgtable.h:#define FIRST_USER_PGD_NR 1 It would be nice to get rid of one more magic constant and just clear from 0 ... MM_VM_SIZE(). That would make it consistent with the tlb_flush_mmu call below it too. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> --- 25-akpm/include/linux/mm.h | 2 +- 25-akpm/mm/mmap.c | 3 +-- 2 files changed, 2 insertions(+), 3 deletions(-) diff -puN include/linux/mm.h~use-mm_vm_size-in-exit_mmap include/linux/mm.h --- 25/include/linux/mm.h~use-mm_vm_size-in-exit_mmap 2005-01-25 21:41:44.365536624 -0800 +++ 25-akpm/include/linux/mm.h 2005-01-25 21:41:44.371535712 -0800 @@ -38,7 +38,7 @@ extern int sysctl_legacy_va_layout; #include <asm/atomic.h> #ifndef MM_VM_SIZE -#define MM_VM_SIZE(mm) TASK_SIZE +#define MM_VM_SIZE(mm) ((TASK_SIZE + PGDIR_SIZE - 1) & PGDIR_MASK) #endif #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) diff -puN mm/mmap.c~use-mm_vm_size-in-exit_mmap mm/mmap.c --- 25/mm/mmap.c~use-mm_vm_size-in-exit_mmap 2005-01-25 21:41:44.367536320 -0800 +++ 25-akpm/mm/mmap.c 2005-01-25 21:41:44.373535408 -0800 @@ -1980,8 +1980,7 @@ void exit_mmap(struct mm_struct *mm) ~0UL, &nr_accounted, NULL); vm_unacct_memory(nr_accounted); BUG_ON(mm->map_count); /* This is just debugging */ - clear_page_range(tlb, FIRST_USER_PGD_NR * PGDIR_SIZE, - (TASK_SIZE + PGDIR_SIZE - 1) & PGDIR_MASK); + clear_page_range(tlb, FIRST_USER_PGD_NR * PGDIR_SIZE, MM_VM_SIZE(mm)); tlb_finish_mmu(tlb, 0, MM_VM_SIZE(mm)); _ and: From: David Woodhouse <dwmw2@infradead.org> Bad things can happen if a 32-bit process is the last user of a 64-bit mm. TASK_SIZE isn't a constant, and we can end up clearing page tables only up to the 32-bit TASK_SIZE instead of all the way. We should probably double-check every instance of TASK_SIZE or USER_PTRS_PER_PGD for this kind of problem. We should also double-check that MM_VM_SIZE() and other such things are correctly defined on all architectures. I already fixed ppc64 which let it stay as TASK_SIZE, and hence dependent on the _current_ context instead of the mm in the argument. Signed-off-by: Andrew Morton <akpm@osdl.org> --- 25-akpm/mm/mmap.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff -puN mm/mmap.c~task_size-is-variable mm/mmap.c --- 25/mm/mmap.c~task_size-is-variable 2005-01-25 22:08:40.903785456 -0800 +++ 25-akpm/mm/mmap.c 2005-01-25 22:08:40.908784696 -0800 @@ -1612,8 +1612,8 @@ static void free_pgtables(struct mmu_gat unsigned long last = end + PGDIR_SIZE - 1; struct mm_struct *mm = tlb->mm; - if (last > TASK_SIZE || last < end) - last = TASK_SIZE; + if (last > MM_VM_SIZE(mm) || last < end) + last = MM_VM_SIZE(mm); if (!prev) { prev = mm->mmap; _ Shall I ship it? ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: TASK_SIZE is variable. 2005-01-26 8:01 ` Andrew Morton @ 2005-01-26 8:04 ` Andi Kleen 2005-01-28 2:58 ` Paul Mackerras 1 sibling, 0 replies; 32+ messages in thread From: Andi Kleen @ 2005-01-26 8:04 UTC (permalink / raw) To: Andrew Morton; +Cc: Andi Kleen, davem, dwmw2, linux-arch On Wed, Jan 26, 2005 at 12:01:17AM -0800, Andrew Morton wrote: > Andi Kleen <ak@suse.de> wrote: > > > > After thinking about it more I agree. Just replacing TASK_SIZE with > > something that depends on the mm is the best solution here. > > OK. > > Here's what I currently have: I suspect more architectures need to be fixed, just doing ia64/s390/ppc64 is not enough. x86-64 is safe in mainline, but some people carry a patch to make TASK_SIZE variable for flex-mmap. -Andi ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: TASK_SIZE is variable. 2005-01-26 8:01 ` Andrew Morton 2005-01-26 8:04 ` Andi Kleen @ 2005-01-28 2:58 ` Paul Mackerras 2005-01-28 3:11 ` Paul Mackerras 2005-01-28 6:39 ` Andi Kleen 1 sibling, 2 replies; 32+ messages in thread From: Paul Mackerras @ 2005-01-28 2:58 UTC (permalink / raw) To: Andrew Morton; +Cc: Andi Kleen, davem, dwmw2, linux-arch Here is a patch that I did recently to reduce the overhead of clear_page_tables() when using 64k pages on ppc64. It keeps a record of the maximum address that has been used in each mm_struct. With this we can kill MM_VM_SIZE. Andrew - could this go in -mm perhaps? Signed-off-by: Paul Mackerras <paulus@samba.org> diff -urN linux-2.6.10/include/asm-ia64/processor.h g5-64k/include/asm-ia64/processor.h --- linux-2.6.10/include/asm-ia64/processor.h 2004-11-11 09:57:35.000000000 +1100 +++ g5-64k/include/asm-ia64/processor.h 2005-01-28 13:52:42.000000000 +1100 @@ -43,14 +43,6 @@ #define TASK_SIZE (current->thread.task_size) /* - * MM_VM_SIZE(mm) gives the maximum address (plus 1) which may contain a mapping for - * address-space MM. Note that with 32-bit tasks, this is still DEFAULT_TASK_SIZE, - * because the kernel may have installed helper-mappings above TASK_SIZE. For example, - * for x86 emulation, the LDT and GDT are mapped above TASK_SIZE. - */ -#define MM_VM_SIZE(mm) DEFAULT_TASK_SIZE - -/* * This decides where the kernel will search for a free chunk of vm * space during mmap's. */ diff -urN linux-2.6.10/include/linux/mm.h g5-64k/include/linux/mm.h --- linux-2.6.10/include/linux/mm.h 2004-11-25 17:42:43.000000000 +1100 +++ g5-64k/include/linux/mm.h 2005-01-28 13:52:14.848021352 +1100 @@ -37,10 +37,6 @@ #include <asm/processor.h> #include <asm/atomic.h> -#ifndef MM_VM_SIZE -#define MM_VM_SIZE(mm) TASK_SIZE -#endif - #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) /* diff -urN linux-2.6.10/include/linux/sched.h g5-64k/include/linux/sched.h --- linux-2.6.10/include/linux/sched.h 2004-12-03 18:31:57.000000000 +1100 +++ g5-64k/include/linux/sched.h 2004-12-26 16:21:37.000000000 +1100 @@ -228,6 +228,7 @@ unsigned long arg_start, arg_end, env_start, env_end; unsigned long rss, anon_rss, total_vm, locked_vm, shared_vm; unsigned long exec_vm, stack_vm, reserved_vm, def_flags, nr_ptes; + unsigned long max_addr; unsigned long saved_auxv[42]; /* for /proc/PID/auxv */ diff -urN linux-2.6.10/mm/mmap.c g5-64k/mm/mmap.c --- linux-2.6.10/mm/mmap.c 2004-12-13 21:47:27.000000000 +1100 +++ g5-64k/mm/mmap.c 2005-01-28 13:47:33.000000000 +1100 @@ -317,6 +317,8 @@ if (mapping) spin_unlock(&mapping->i_mmap_lock); + if (vma->vm_end && vma->vm_end - 1 > mm->max_addr) + mm->max_addr = vma->vm_end - 1; mm->map_count++; validate_mm(mm); } @@ -496,6 +498,8 @@ } } + if (vma->vm_end && vma->vm_end - 1 > mm->max_addr) + mm->max_addr = vma->vm_end - 1; validate_mm(mm); } @@ -1832,6 +1836,7 @@ struct mmu_gather *tlb; struct vm_area_struct *vma; unsigned long nr_accounted = 0; + int nr_pgds; lru_add_drain(); @@ -1844,8 +1849,10 @@ ~0UL, &nr_accounted, NULL); vm_unacct_memory(nr_accounted); BUG_ON(mm->map_count); /* This is just debugging */ - clear_page_tables(tlb, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD); - tlb_finish_mmu(tlb, 0, MM_VM_SIZE(mm)); + nr_pgds = pgd_index(mm->max_addr) - FIRST_USER_PGD_NR + 1; + if (nr_pgds > 0) + clear_page_tables(tlb, FIRST_USER_PGD_NR, nr_pgds); + tlb_finish_mmu(tlb, 0, mm->max_addr + 1); vma = mm->mmap; mm->mmap = mm->mmap_cache = NULL; ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: TASK_SIZE is variable. 2005-01-28 2:58 ` Paul Mackerras @ 2005-01-28 3:11 ` Paul Mackerras 2005-01-28 3:17 ` Andrew Morton 2005-01-28 6:39 ` Andi Kleen 1 sibling, 1 reply; 32+ messages in thread From: Paul Mackerras @ 2005-01-28 3:11 UTC (permalink / raw) To: Andrew Morton, Andi Kleen, davem, dwmw2, linux-arch I wrote: > Here is a patch that I did recently to reduce the overhead of > clear_page_tables() when using 64k pages on ppc64. It keeps a record > of the maximum address that has been used in each mm_struct. With > this we can kill MM_VM_SIZE. And I sent the 2.6.10 version of the patch, unfortunately. Here is a patch against current BK. diff -urN linux-2.5/include/asm-ia64/processor.h test/include/asm-ia64/processor.h --- linux-2.5/include/asm-ia64/processor.h 2005-01-21 10:31:58.000000000 +1100 +++ test/include/asm-ia64/processor.h 2005-01-28 13:59:37.236992600 +1100 @@ -43,14 +43,6 @@ #define TASK_SIZE (current->thread.task_size) /* - * MM_VM_SIZE(mm) gives the maximum address (plus 1) which may contain a mapping for - * address-space MM. Note that with 32-bit tasks, this is still DEFAULT_TASK_SIZE, - * because the kernel may have installed helper-mappings above TASK_SIZE. For example, - * for x86 emulation, the LDT and GDT are mapped above TASK_SIZE. - */ -#define MM_VM_SIZE(mm) DEFAULT_TASK_SIZE - -/* * This decides where the kernel will search for a free chunk of vm * space during mmap's. */ diff -urN linux-2.5/include/linux/mm.h test/include/linux/mm.h --- linux-2.5/include/linux/mm.h 2005-01-17 12:07:10.000000000 +1100 +++ test/include/linux/mm.h 2005-01-28 13:59:37.310981352 +1100 @@ -37,10 +37,6 @@ #include <asm/processor.h> #include <asm/atomic.h> -#ifndef MM_VM_SIZE -#define MM_VM_SIZE(mm) TASK_SIZE -#endif - #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) /* diff -urN linux-2.5/include/linux/sched.h test/include/linux/sched.h --- linux-2.5/include/linux/sched.h 2005-01-21 10:31:58.000000000 +1100 +++ test/include/linux/sched.h 2005-01-28 13:59:37.333977856 +1100 @@ -231,6 +231,7 @@ unsigned long arg_start, arg_end, env_start, env_end; unsigned long rss, anon_rss, total_vm, locked_vm, shared_vm; unsigned long exec_vm, stack_vm, reserved_vm, def_flags, nr_ptes; + unsigned long max_addr; unsigned long saved_auxv[42]; /* for /proc/PID/auxv */ diff -urN linux-2.5/mm/mmap.c test/mm/mmap.c --- linux-2.5/mm/mmap.c 2005-01-13 11:01:00.000000000 +1100 +++ test/mm/mmap.c 2005-01-28 14:01:51.889892368 +1100 @@ -409,6 +409,8 @@ if (mapping) spin_unlock(&mapping->i_mmap_lock); + if (vma->vm_end && vma->vm_end - 1 > mm->max_addr) + mm->max_addr = vma->vm_end - 1; mm->map_count++; validate_mm(mm); } @@ -598,6 +600,8 @@ } } + if (vma->vm_end && vma->vm_end - 1 > mm->max_addr) + mm->max_addr = vma->vm_end - 1; validate_mm(mm); } @@ -1983,6 +1987,7 @@ struct mmu_gather *tlb; struct vm_area_struct *vma; unsigned long nr_accounted = 0; + int nr_pgds; lru_add_drain(); @@ -1996,9 +2001,9 @@ vm_unacct_memory(nr_accounted); BUG_ON(mm->map_count); /* This is just debugging */ clear_page_range(tlb, FIRST_USER_PGD_NR * PGDIR_SIZE, - (TASK_SIZE + PGDIR_SIZE - 1) & PGDIR_MASK); + (mm->max_addr + PGDIR_SIZE) & PGDIR_MASK); - tlb_finish_mmu(tlb, 0, MM_VM_SIZE(mm)); + tlb_finish_mmu(tlb, 0, mm->max_addr + 1); vma = mm->mmap; mm->mmap = mm->mmap_cache = NULL; ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: TASK_SIZE is variable. 2005-01-28 3:11 ` Paul Mackerras @ 2005-01-28 3:17 ` Andrew Morton 2005-01-28 6:40 ` Andi Kleen 2005-01-28 8:46 ` Russell King 0 siblings, 2 replies; 32+ messages in thread From: Andrew Morton @ 2005-01-28 3:17 UTC (permalink / raw) To: Paul Mackerras; +Cc: ak, davem, dwmw2, linux-arch Paul Mackerras <paulus@samba.org> wrote: > > > Here is a patch that I did recently to reduce the overhead of > > clear_page_tables() when using 64k pages on ppc64. It keeps a record > > of the maximum address that has been used in each mm_struct. With > > this we can kill MM_VM_SIZE. > > And I sent the 2.6.10 version of the patch, unfortunately. Here is a > patch against current BK. OK.. I'll drop Anton's patch: From: Anton Blanchard <anton@samba.org> The 4 level pagetable code changed the exit_mmap code to rely on TASK_SIZE. On some architectures (eg ppc64 and ia64), this is a per task property and bad things can happen in certain circumstances when using it. It is possible for one task to end up "owning" an mm from another - we have seen this with the procfs code when process 1 accesses /proc/pid/cmdline of process 2 while it is exiting. Process 2 exits but does not tear its mm down. Later on process 1 finishes with the proc file and the mm gets torn down at this point. Now if process 1 was 32bit and process 2 was 64bit then we end up using a bad value for TASK_SIZE in exit_mmap. We only tear down part of the address space and leave half initialised pagetables and entries in the MMU etc. MM_VM_SIZE() was created for this purpose (and is used in the next line for tlb_finish_mmu), so use it. I moved the PGD round up of TASK_SIZE into the default MM_VM_SIZE. As an aside, all architectures except one define FIRST_USER_PGD_NR as 0: include/asm-arm26/pgtable.h:#define FIRST_USER_PGD_NR 1 It would be nice to get rid of one more magic constant and just clear from 0 ... MM_VM_SIZE(). That would make it consistent with the tlb_flush_mmu call below it too. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> --- 25-akpm/include/linux/mm.h | 2 +- 25-akpm/mm/mmap.c | 3 +-- 2 files changed, 2 insertions(+), 3 deletions(-) diff -puN include/linux/mm.h~use-mm_vm_size-in-exit_mmap include/linux/mm.h --- 25/include/linux/mm.h~use-mm_vm_size-in-exit_mmap 2005-01-25 21:41:44.365536624 -0800 +++ 25-akpm/include/linux/mm.h 2005-01-25 21:41:44.371535712 -0800 @@ -38,7 +38,7 @@ extern int sysctl_legacy_va_layout; #include <asm/atomic.h> #ifndef MM_VM_SIZE -#define MM_VM_SIZE(mm) TASK_SIZE +#define MM_VM_SIZE(mm) ((TASK_SIZE + PGDIR_SIZE - 1) & PGDIR_MASK) #endif #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) diff -puN mm/mmap.c~use-mm_vm_size-in-exit_mmap mm/mmap.c --- 25/mm/mmap.c~use-mm_vm_size-in-exit_mmap 2005-01-25 21:41:44.367536320 -0800 +++ 25-akpm/mm/mmap.c 2005-01-25 21:41:44.373535408 -0800 @@ -1980,8 +1980,7 @@ void exit_mmap(struct mm_struct *mm) ~0UL, &nr_accounted, NULL); vm_unacct_memory(nr_accounted); BUG_ON(mm->map_count); /* This is just debugging */ - clear_page_range(tlb, FIRST_USER_PGD_NR * PGDIR_SIZE, - (TASK_SIZE + PGDIR_SIZE - 1) & PGDIR_MASK); + clear_page_range(tlb, FIRST_USER_PGD_NR * PGDIR_SIZE, MM_VM_SIZE(mm)); tlb_finish_mmu(tlb, 0, MM_VM_SIZE(mm)); _ ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: TASK_SIZE is variable. 2005-01-28 3:17 ` Andrew Morton @ 2005-01-28 6:40 ` Andi Kleen 2005-01-29 11:23 ` Anton Blanchard 2005-01-28 8:46 ` Russell King 1 sibling, 1 reply; 32+ messages in thread From: Andi Kleen @ 2005-01-28 6:40 UTC (permalink / raw) To: Andrew Morton; +Cc: Paul Mackerras, ak, davem, dwmw2, linux-arch On Thu, Jan 27, 2005 at 07:17:38PM -0800, Andrew Morton wrote: > Paul Mackerras <paulus@samba.org> wrote: > > > > > Here is a patch that I did recently to reduce the overhead of > > > clear_page_tables() when using 64k pages on ppc64. It keeps a record > > > of the maximum address that has been used in each mm_struct. With > > > this we can kill MM_VM_SIZE. > > > > And I sent the 2.6.10 version of the patch, unfortunately. Here is a > > patch against current BK. > > OK.. I'll drop Anton's patch: Frankly Anton's patch seems simpler and equivalent to me. -Andi ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: TASK_SIZE is variable. 2005-01-28 6:40 ` Andi Kleen @ 2005-01-29 11:23 ` Anton Blanchard 0 siblings, 0 replies; 32+ messages in thread From: Anton Blanchard @ 2005-01-29 11:23 UTC (permalink / raw) To: Andi Kleen; +Cc: Andrew Morton, Paul Mackerras, davem, dwmw2, linux-arch > Frankly Anton's patch seems simpler and equivalent to me. I like Pauls patch :) It speeds up teardown of processes since we only walk as much of the pagetables as is mapped (instead of the entire address space). Anton ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: TASK_SIZE is variable. 2005-01-28 3:17 ` Andrew Morton 2005-01-28 6:40 ` Andi Kleen @ 2005-01-28 8:46 ` Russell King 1 sibling, 0 replies; 32+ messages in thread From: Russell King @ 2005-01-28 8:46 UTC (permalink / raw) To: Andrew Morton; +Cc: Paul Mackerras, ak, davem, dwmw2, linux-arch On Thu, Jan 27, 2005 at 07:17:38PM -0800, Andrew Morton wrote: > As an aside, all architectures except one define FIRST_USER_PGD_NR as 0: > > include/asm-arm26/pgtable.h:#define FIRST_USER_PGD_NR 1 After seeing Andi's comments - if there is a chance that this patch gets merged, please can the commentry be corrected to reflect reality instead of being misleading? -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: 2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/ 2.6 Serial core ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: TASK_SIZE is variable. 2005-01-28 2:58 ` Paul Mackerras 2005-01-28 3:11 ` Paul Mackerras @ 2005-01-28 6:39 ` Andi Kleen 2005-01-28 11:32 ` David Woodhouse 1 sibling, 1 reply; 32+ messages in thread From: Andi Kleen @ 2005-01-28 6:39 UTC (permalink / raw) To: Paul Mackerras; +Cc: Andrew Morton, Andi Kleen, davem, dwmw2, linux-arch On Fri, Jan 28, 2005 at 01:58:04PM +1100, Paul Mackerras wrote: > Here is a patch that I did recently to reduce the overhead of > clear_page_tables() when using 64k pages on ppc64. It keeps a record > of the maximum address that has been used in each mm_struct. With > this we can kill MM_VM_SIZE. > > Andrew - could this go in -mm perhaps? How does this work? The stack is always at the top of address space and it will never change anything. Or did you do this only to optimize 32bit processes on 64bit kernels? If yes then the test in TASK_SIZE or MM_VM_SIZE should take care of it anyways. -Andi ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: TASK_SIZE is variable. 2005-01-28 6:39 ` Andi Kleen @ 2005-01-28 11:32 ` David Woodhouse 0 siblings, 0 replies; 32+ messages in thread From: David Woodhouse @ 2005-01-28 11:32 UTC (permalink / raw) To: Andi Kleen; +Cc: Paul Mackerras, Andrew Morton, davem, linux-arch On Fri, 2005-01-28 at 07:39 +0100, Andi Kleen wrote: > How does this work? The stack is always at the top of address space > and it will never change anything. Is that the case? Doesn't ia64 put some extra things required for emulation (virtual LDT etc?) in the virtual address space above 4GiB, for 32-bit processes? > Or did you do this only to optimize 32bit processes on 64bit kernels? > If yes then the test in TASK_SIZE or MM_VM_SIZE should take care of it > anyways. The test in TASK_SIZE gets it wrong as discussed. -- dwmw2 ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: TASK_SIZE is variable. 2005-01-25 23:52 ` David S. Miller 2005-01-26 6:36 ` Andi Kleen @ 2005-01-26 7:54 ` David Woodhouse 1 sibling, 0 replies; 32+ messages in thread From: David Woodhouse @ 2005-01-26 7:54 UTC (permalink / raw) To: David S. Miller; +Cc: linux-arch On Tue, 2005-01-25 at 15:52 -0800, David S. Miller wrote: > I think the easiest way to fix this for good is to simply > kill off TASK_SIZE and that way each and every use will be > audited. We can replace TASK_SIZE with something the describes > what is really wanted: You're right. And we should replace MM_VM_SIZE() too -- as you say, nobody uses the (mm) argument because it's useless. TASK_SIZE should be renamed to current_task_size(), because that's what it actually does. MM_VM_SIZE(mm) should become MAX_TASK_SIZE, for much the same reason. Then perhaps we can also add thread_task_size(tsk) in case any of the current users of MM_VM_SIZE() or the currently-broken users of TASK_SIZE actually have access to the thread_info in question and really want the real answer instead of MAX_TASK_SIZE. -- dwmw2 ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: TASK_SIZE is variable. 2005-01-25 22:26 TASK_SIZE is variable David Woodhouse 2005-01-25 23:52 ` David S. Miller @ 2005-01-29 20:23 ` Andrew Morton 2005-01-29 23:28 ` Paul Mackerras 1 sibling, 1 reply; 32+ messages in thread From: Andrew Morton @ 2005-01-29 20:23 UTC (permalink / raw) To: David Woodhouse; +Cc: linux-arch Guys, could I get a bit of clarity on how we want to proceed with all of this? I have the below two patches queued. Speak now or forever stfu! From: David Woodhouse <dwmw2@infradead.org> Bad things can happen if a 32-bit process is the last user of a 64-bit mm. TASK_SIZE isn't a constant, and we can end up clearing page tables only up to the 32-bit TASK_SIZE instead of all the way. We should probably double-check every instance of TASK_SIZE or USER_PTRS_PER_PGD for this kind of problem. We should also double-check that MM_VM_SIZE() and other such things are correctly defined on all architectures. I already fixed ppc64 which let it stay as TASK_SIZE, and hence dependent on the _current_ context instead of the mm in the argument. Signed-off-by: Andrew Morton <akpm@osdl.org> --- 25-akpm/mm/mmap.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff -puN mm/mmap.c~task_size-is-variable mm/mmap.c --- 25/mm/mmap.c~task_size-is-variable 2005-01-25 22:08:40.903785456 -0800 +++ 25-akpm/mm/mmap.c 2005-01-25 22:08:40.908784696 -0800 @@ -1612,8 +1612,8 @@ static void free_pgtables(struct mmu_gat unsigned long last = end + PGDIR_SIZE - 1; struct mm_struct *mm = tlb->mm; - if (last > TASK_SIZE || last < end) - last = TASK_SIZE; + if (last > MM_VM_SIZE(mm) || last < end) + last = MM_VM_SIZE(mm); if (!prev) { prev = mm->mmap; _ From: Anton Blanchard <anton@samba.org> The 4 level pagetable code changed the exit_mmap code to rely on TASK_SIZE. On some architectures (eg ppc64 and ia64), this is a per task property and bad things can happen in certain circumstances when using it. It is possible for one task to end up "owning" an mm from another - we have seen this with the procfs code when process 1 accesses /proc/pid/cmdline of process 2 while it is exiting. Process 2 exits but does not tear its mm down. Later on process 1 finishes with the proc file and the mm gets torn down at this point. Now if process 1 was 32bit and process 2 was 64bit then we end up using a bad value for TASK_SIZE in exit_mmap. We only tear down part of the address space and leave half initialised pagetables and entries in the MMU etc. MM_VM_SIZE() was created for this purpose (and is used in the next line for tlb_finish_mmu), so use it. I moved the PGD round up of TASK_SIZE into the default MM_VM_SIZE. As an aside, all architectures except one define FIRST_USER_PGD_NR as 0: include/asm-arm26/pgtable.h:#define FIRST_USER_PGD_NR 1 It would be nice to get rid of one more magic constant and just clear from 0 ... MM_VM_SIZE(). That would make it consistent with the tlb_flush_mmu call below it too. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> --- 25-akpm/include/linux/mm.h | 2 +- 25-akpm/mm/mmap.c | 3 +-- 2 files changed, 2 insertions(+), 3 deletions(-) diff -puN include/linux/mm.h~use-mm_vm_size-in-exit_mmap include/linux/mm.h --- 25/include/linux/mm.h~use-mm_vm_size-in-exit_mmap 2005-01-25 21:41:44.365536624 -0800 +++ 25-akpm/include/linux/mm.h 2005-01-25 21:41:44.371535712 -0800 @@ -38,7 +38,7 @@ extern int sysctl_legacy_va_layout; #include <asm/atomic.h> #ifndef MM_VM_SIZE -#define MM_VM_SIZE(mm) TASK_SIZE +#define MM_VM_SIZE(mm) ((TASK_SIZE + PGDIR_SIZE - 1) & PGDIR_MASK) #endif #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) diff -puN mm/mmap.c~use-mm_vm_size-in-exit_mmap mm/mmap.c --- 25/mm/mmap.c~use-mm_vm_size-in-exit_mmap 2005-01-25 21:41:44.367536320 -0800 +++ 25-akpm/mm/mmap.c 2005-01-25 21:41:44.373535408 -0800 @@ -1980,8 +1980,7 @@ void exit_mmap(struct mm_struct *mm) ~0UL, &nr_accounted, NULL); vm_unacct_memory(nr_accounted); BUG_ON(mm->map_count); /* This is just debugging */ - clear_page_range(tlb, FIRST_USER_PGD_NR * PGDIR_SIZE, - (TASK_SIZE + PGDIR_SIZE - 1) & PGDIR_MASK); + clear_page_range(tlb, FIRST_USER_PGD_NR * PGDIR_SIZE, MM_VM_SIZE(mm)); tlb_finish_mmu(tlb, 0, MM_VM_SIZE(mm)); _ ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: TASK_SIZE is variable. 2005-01-29 20:23 ` Andrew Morton @ 2005-01-29 23:28 ` Paul Mackerras 2005-01-30 11:01 ` Andi Kleen 2005-01-31 2:33 ` Matthew Wilcox 0 siblings, 2 replies; 32+ messages in thread From: Paul Mackerras @ 2005-01-29 23:28 UTC (permalink / raw) To: Andrew Morton; +Cc: David Woodhouse, linux-arch Andrew Morton writes: > Guys, could I get a bit of clarity on how we want to proceed with all of > this? > > I have the below two patches queued. Speak now or forever stfu! In both cases (free_pgtables and exit_mmap) we want to do something to the whole of an address space which may not be ours. With these patches we use MM_VM_SIZE(mm) to tell us how big the address space is rather than TASK_SIZE, which is a good start. However, the default MM_VM_SIZE is still (based on) TASK_SIZE, and each 64-bit arch that has TASK_SIZE as a per-task property then has to override it (except that parisc doesn't seem to). Also, having the mm as an argument to MM_VM_SIZE is currently useless because there is nothing in the mm that is helpful. Thus the two patches you included will fix the problems, at the expense of having to scan the whole pgd at exit time, even though we really only need to look at the first 2 entries on ppc64 for the very common case where the mm was used by a 32-bit process. IMHO it would be cleaner and more efficient to have a way to tell how much of an address space has actually been used, just from the mm, without having to know what task(s) used the mm and whether they were 32-bit or 64-bit tasks. That is what my mm->max_addr patch was aiming at. Having mm->max_addr should also work correctly across all architectures, even the ones that have mappings above the 4GB point for 32-bit processes, as long as they create vma's for those mappings. Thus I would like to see the mm->max_addr patch go in. If architectures still want to control MM_VM_SIZE, we could have the default definition of MM_VM_SIZE(mm) be mm->max_addr + 1 (or alternatively make mm->max_addr be the address space size rather than the max address - I made it the max address because I was concerned that there might be 32-bit architectures that allowed mappings up to and including 0xffffffff). I am happy to make a combined patch if people agree on the approach. Paul. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: TASK_SIZE is variable. 2005-01-29 23:28 ` Paul Mackerras @ 2005-01-30 11:01 ` Andi Kleen 2005-01-30 12:10 ` Paul Mackerras 2005-01-31 2:33 ` Matthew Wilcox 1 sibling, 1 reply; 32+ messages in thread From: Andi Kleen @ 2005-01-30 11:01 UTC (permalink / raw) To: Paul Mackerras; +Cc: Andrew Morton, David Woodhouse, linux-arch > Thus the two patches you included will fix the problems, at the > expense of having to scan the whole pgd at exit time, even though we > really only need to look at the first 2 entries on ppc64 for the > very common case where the mm was used by a 32-bit process. I plan to fix this anyways in a more generic way - 4 level currently has some bad performance regression because it scans much more pagetables. DaveM had an old patch to use bitmaps for used entries in struct page. For your 32bit processes only the first bit would be set and it would not look at most of the pgds. The plan was to redo Dave's old patch for 4 level (I wanted to redo it a bit because I didn't like how his iterators worked) Once that it is in you will get the fast scanning for free. Should hopefully happen soon. -Andi ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: TASK_SIZE is variable. 2005-01-30 11:01 ` Andi Kleen @ 2005-01-30 12:10 ` Paul Mackerras 2005-01-31 2:23 ` David S. Miller 0 siblings, 1 reply; 32+ messages in thread From: Paul Mackerras @ 2005-01-30 12:10 UTC (permalink / raw) To: Andi Kleen; +Cc: Andrew Morton, David Woodhouse, linux-arch Andi Kleen writes: > I plan to fix this anyways in a more generic way - 4 level currently > has some bad performance regression because it scans much more pagetables. > DaveM had an old patch to use bitmaps for used entries in struct page. > For your 32bit processes only the first bit would be set and it would not > look at most of the pgds. The plan was to redo Dave's old patch > for 4 level (I wanted to redo it a bit because I didn't like how > his iterators worked) Sounds good. I'm looking forward to the patch. In the meantime I think that MM_VM_SIZE(mm) should be renamed to MAX_TASK_SIZE without the mm argument. MAX_TASK_SIZE can default to TASK_SIZE and be overridden on architectures that have more than one task size. Paul. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: TASK_SIZE is variable. 2005-01-30 12:10 ` Paul Mackerras @ 2005-01-31 2:23 ` David S. Miller 2005-01-31 9:23 ` Andi Kleen 0 siblings, 1 reply; 32+ messages in thread From: David S. Miller @ 2005-01-31 2:23 UTC (permalink / raw) To: Paul Mackerras; +Cc: ak, akpm, dwmw2, linux-arch On Sun, 30 Jan 2005 23:10:02 +1100 Paul Mackerras <paulus@samba.org> wrote: > Andi Kleen writes: > > > I plan to fix this anyways in a more generic way - 4 level currently > > has some bad performance regression because it scans much more pagetables. > > DaveM had an old patch to use bitmaps for used entries in struct page. > > For your 32bit processes only the first bit would be set and it would not > > look at most of the pgds. The plan was to redo Dave's old patch > > for 4 level (I wanted to redo it a bit because I didn't like how > > his iterators worked) > > Sounds good. I'm looking forward to the patch. > > In the meantime I think that MM_VM_SIZE(mm) should be renamed to > MAX_TASK_SIZE without the mm argument. MAX_TASK_SIZE can default to > TASK_SIZE and be overridden on architectures that have more than one > task size. I'm looking forward very much to Andi's work as well. However, I like the mm->max_addr idea because that could be used for the mmap()/munmap()/mremap() sanity checks as well instead of bogus TASK_SIZE. There are some compat syscalls for the mmap like stuff that exists only to validate the address ranges for compat task limits. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: TASK_SIZE is variable. 2005-01-31 2:23 ` David S. Miller @ 2005-01-31 9:23 ` Andi Kleen 2005-01-31 19:29 ` David S. Miller 0 siblings, 1 reply; 32+ messages in thread From: Andi Kleen @ 2005-01-31 9:23 UTC (permalink / raw) To: David S. Miller; +Cc: Paul Mackerras, ak, akpm, dwmw2, linux-arch > However, I like the mm->max_addr idea because that could be used > for the mmap()/munmap()/mremap() sanity checks as well instead of > bogus TASK_SIZE. Hmm, but in process context it is not bogus is it? -Andi ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: TASK_SIZE is variable. 2005-01-31 9:23 ` Andi Kleen @ 2005-01-31 19:29 ` David S. Miller 2005-01-31 19:38 ` Andi Kleen 0 siblings, 1 reply; 32+ messages in thread From: David S. Miller @ 2005-01-31 19:29 UTC (permalink / raw) To: Andi Kleen; +Cc: paulus, akpm, dwmw2, linux-arch On Mon, 31 Jan 2005 10:23:54 +0100 Andi Kleen <ak@suse.de> wrote: > > However, I like the mm->max_addr idea because that could be used > > for the mmap()/munmap()/mremap() sanity checks as well instead of > > bogus TASK_SIZE. > > Hmm, but in process context it is not bogus is it? I guess you're suggesting that there could be times when mm->max_addr and the current thread's address space disposition are out of sync? That is certainly something to check out. And I am rather certain that it is flush_old_exec() which should transition mm->max_addr. It is the only spot where the address space API logically changes. More precisely, this occurs when exec_mmap() completes successfully, so maybe it should happen there and in that case. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: TASK_SIZE is variable. 2005-01-31 19:29 ` David S. Miller @ 2005-01-31 19:38 ` Andi Kleen 2005-01-31 20:35 ` David S. Miller 0 siblings, 1 reply; 32+ messages in thread From: Andi Kleen @ 2005-01-31 19:38 UTC (permalink / raw) To: David S. Miller; +Cc: Andi Kleen, paulus, akpm, dwmw2, linux-arch On Mon, Jan 31, 2005 at 11:29:57AM -0800, David S. Miller wrote: > On Mon, 31 Jan 2005 10:23:54 +0100 > Andi Kleen <ak@suse.de> wrote: > > > > However, I like the mm->max_addr idea because that could be used > > > for the mmap()/munmap()/mremap() sanity checks as well instead of > > > bogus TASK_SIZE. > > > > Hmm, but in process context it is not bogus is it? > > I guess you're suggesting that there could be times when > mm->max_addr and the current thread's address space disposition > are out of sync? No, i just think it's overkill for the simple task Paul wants it for. And in mmap/munmap you can just trust what TASK_SIZE tells you because it's in the correct process context. -Andi ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: TASK_SIZE is variable. 2005-01-31 19:38 ` Andi Kleen @ 2005-01-31 20:35 ` David S. Miller 2005-02-03 4:08 ` Andrew Morton 0 siblings, 1 reply; 32+ messages in thread From: David S. Miller @ 2005-01-31 20:35 UTC (permalink / raw) To: Andi Kleen; +Cc: paulus, akpm, dwmw2, linux-arch On Mon, 31 Jan 2005 20:38:28 +0100 Andi Kleen <ak@suse.de> wrote: > No, i just think it's overkill for the simple task Paul wants > it for. And in mmap/munmap you can just trust what TASK_SIZE > tells you because it's in the correct process context. Lots of 64-bit platforms use a constant TASK_SIZE. This is why such platforms use compat syscall wrappers to validate mmap args. I'm saying that if we move over to useing mm->map_limit in the syscall validation checks, such schemes would no longer be necessary. The arity of the address space really is an MM attribute even though it is inherited from the thread. The delayed MM final release bug case via procfs is instrumental in demonstrating this fact. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: TASK_SIZE is variable. 2005-01-31 20:35 ` David S. Miller @ 2005-02-03 4:08 ` Andrew Morton 2005-02-03 6:28 ` David S. Miller 0 siblings, 1 reply; 32+ messages in thread From: Andrew Morton @ 2005-02-03 4:08 UTC (permalink / raw) To: David S. Miller; +Cc: ak, paulus, dwmw2, linux-arch Guys, I still don't have a clear sense of what you want done on this front. I'm still sitting on the below two patches. If 2.6.11-rc3 works correctly as-is then let's leave it alone. If it does not then can we please get a wiggle on? task_size-is-variable.patch: From: David Woodhouse <dwmw2@infradead.org> Bad things can happen if a 32-bit process is the last user of a 64-bit mm. TASK_SIZE isn't a constant, and we can end up clearing page tables only up to the 32-bit TASK_SIZE instead of all the way. We should probably double-check every instance of TASK_SIZE or USER_PTRS_PER_PGD for this kind of problem. We should also double-check that MM_VM_SIZE() and other such things are correctly defined on all architectures. I already fixed ppc64 which let it stay as TASK_SIZE, and hence dependent on the _current_ context instead of the mm in the argument. Signed-off-by: Andrew Morton <akpm@osdl.org> --- 25-akpm/mm/mmap.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff -puN mm/mmap.c~task_size-is-variable mm/mmap.c --- 25/mm/mmap.c~task_size-is-variable 2005-01-25 22:08:40.903785456 -0800 +++ 25-akpm/mm/mmap.c 2005-01-25 22:08:40.908784696 -0800 @@ -1612,8 +1612,8 @@ static void free_pgtables(struct mmu_gat unsigned long last = end + PGDIR_SIZE - 1; struct mm_struct *mm = tlb->mm; - if (last > TASK_SIZE || last < end) - last = TASK_SIZE; + if (last > MM_VM_SIZE(mm) || last < end) + last = MM_VM_SIZE(mm); if (!prev) { prev = mm->mmap; _ use-mm_vm_size-in-exit_mmap.patch: From: Anton Blanchard <anton@samba.org> The 4 level pagetable code changed the exit_mmap code to rely on TASK_SIZE. On some architectures (eg ppc64 and ia64), this is a per task property and bad things can happen in certain circumstances when using it. It is possible for one task to end up "owning" an mm from another - we have seen this with the procfs code when process 1 accesses /proc/pid/cmdline of process 2 while it is exiting. Process 2 exits but does not tear its mm down. Later on process 1 finishes with the proc file and the mm gets torn down at this point. Now if process 1 was 32bit and process 2 was 64bit then we end up using a bad value for TASK_SIZE in exit_mmap. We only tear down part of the address space and leave half initialised pagetables and entries in the MMU etc. MM_VM_SIZE() was created for this purpose (and is used in the next line for tlb_finish_mmu), so use it. I moved the PGD round up of TASK_SIZE into the default MM_VM_SIZE. As an aside, all architectures except one define FIRST_USER_PGD_NR as 0: include/asm-arm26/pgtable.h:#define FIRST_USER_PGD_NR 1 It would be nice to get rid of one more magic constant and just clear from 0 ... MM_VM_SIZE(). That would make it consistent with the tlb_flush_mmu call below it too. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> --- 25-akpm/include/linux/mm.h | 2 +- 25-akpm/mm/mmap.c | 3 +-- 2 files changed, 2 insertions(+), 3 deletions(-) diff -puN include/linux/mm.h~use-mm_vm_size-in-exit_mmap include/linux/mm.h --- 25/include/linux/mm.h~use-mm_vm_size-in-exit_mmap 2005-01-25 21:41:44.365536624 -0800 +++ 25-akpm/include/linux/mm.h 2005-01-25 21:41:44.371535712 -0800 @@ -38,7 +38,7 @@ extern int sysctl_legacy_va_layout; #include <asm/atomic.h> #ifndef MM_VM_SIZE -#define MM_VM_SIZE(mm) TASK_SIZE +#define MM_VM_SIZE(mm) ((TASK_SIZE + PGDIR_SIZE - 1) & PGDIR_MASK) #endif #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) diff -puN mm/mmap.c~use-mm_vm_size-in-exit_mmap mm/mmap.c --- 25/mm/mmap.c~use-mm_vm_size-in-exit_mmap 2005-01-25 21:41:44.367536320 -0800 +++ 25-akpm/mm/mmap.c 2005-01-25 21:41:44.373535408 -0800 @@ -1980,8 +1980,7 @@ void exit_mmap(struct mm_struct *mm) ~0UL, &nr_accounted, NULL); vm_unacct_memory(nr_accounted); BUG_ON(mm->map_count); /* This is just debugging */ - clear_page_range(tlb, FIRST_USER_PGD_NR * PGDIR_SIZE, - (TASK_SIZE + PGDIR_SIZE - 1) & PGDIR_MASK); + clear_page_range(tlb, FIRST_USER_PGD_NR * PGDIR_SIZE, MM_VM_SIZE(mm)); tlb_finish_mmu(tlb, 0, MM_VM_SIZE(mm)); _ ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: TASK_SIZE is variable. 2005-02-03 4:08 ` Andrew Morton @ 2005-02-03 6:28 ` David S. Miller 2005-02-03 7:19 ` Andi Kleen 0 siblings, 1 reply; 32+ messages in thread From: David S. Miller @ 2005-02-03 6:28 UTC (permalink / raw) To: Andrew Morton; +Cc: ak, paulus, dwmw2, linux-arch On Wed, 2 Feb 2005 20:08:12 -0800 Andrew Morton <akpm@osdl.org> wrote: > Guys, I still don't have a clear sense of what you want done on this front. > I'm still sitting on the below two patches. > > If 2.6.11-rc3 works correctly as-is then let's leave it alone. If it does > not then can we please get a wiggle on? I'm willing to go whatever way people want to get a fix quickly into 2.6.11-final. But longer term I think the mm->vm_limit idea is the better direction longer term. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: TASK_SIZE is variable. 2005-02-03 6:28 ` David S. Miller @ 2005-02-03 7:19 ` Andi Kleen 2005-02-03 9:23 ` David Woodhouse 0 siblings, 1 reply; 32+ messages in thread From: Andi Kleen @ 2005-02-03 7:19 UTC (permalink / raw) To: David S. Miller; +Cc: Andrew Morton, ak, paulus, dwmw2, linux-arch On Wed, Feb 02, 2005 at 10:28:41PM -0800, David S. Miller wrote: > On Wed, 2 Feb 2005 20:08:12 -0800 > Andrew Morton <akpm@osdl.org> wrote: > > > Guys, I still don't have a clear sense of what you want done on this front. > > I'm still sitting on the below two patches. > > > > If 2.6.11-rc3 works correctly as-is then let's leave it alone. If it does > > not then can we please get a wiggle on? > > I'm willing to go whatever way people want to get a fix quickly > into 2.6.11-final. But longer term I think the mm->vm_limit idea > is the better direction longer term. For 2.6.11 MM_VM_SIZE is imho the right fix. -Andi (who reconsidered his earlier objection) ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: TASK_SIZE is variable. 2005-02-03 7:19 ` Andi Kleen @ 2005-02-03 9:23 ` David Woodhouse 0 siblings, 0 replies; 32+ messages in thread From: David Woodhouse @ 2005-02-03 9:23 UTC (permalink / raw) To: Andi Kleen; +Cc: David S. Miller, Andrew Morton, paulus, linux-arch On Thu, 2005-02-03 at 08:19 +0100, Andi Kleen wrote: > > > I'm willing to go whatever way people want to get a fix quickly > > into 2.6.11-final. But longer term I think the mm->vm_limit idea > > is the better direction longer term. > > For 2.6.11 MM_VM_SIZE is imho the right fix. Agreed. As Dave says, we can always implement the more cunning ideas later. Push what you have. -- dwmw2 ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: TASK_SIZE is variable. 2005-01-29 23:28 ` Paul Mackerras 2005-01-30 11:01 ` Andi Kleen @ 2005-01-31 2:33 ` Matthew Wilcox 1 sibling, 0 replies; 32+ messages in thread From: Matthew Wilcox @ 2005-01-31 2:33 UTC (permalink / raw) To: Paul Mackerras; +Cc: Andrew Morton, David Woodhouse, linux-arch On Sun, Jan 30, 2005 at 10:28:50AM +1100, Paul Mackerras wrote: > However, the default > MM_VM_SIZE is still (based on) TASK_SIZE, and each 64-bit arch that > has TASK_SIZE as a per-task property then has to override it (except > that parisc doesn't seem to). There's no 64-bit userspace for parisc yet ... that would involve dealing with drepper some more, and nobody wants to do that. -- "Next the statesmen will invent cheap lies, putting the blame upon the nation that is attacked, and every man will be glad of those conscience-soothing falsities, and will diligently study them, and refuse to examine any refutations of them; and thus he will by and by convince himself that the war is just, and will thank God for the better sleep he enjoys after this process of grotesque self-deception." -- Mark Twain ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2005-02-03 9:23 UTC | newest] Thread overview: 32+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-01-25 22:26 TASK_SIZE is variable David Woodhouse 2005-01-25 23:52 ` David S. Miller 2005-01-26 6:36 ` Andi Kleen 2005-01-26 6:41 ` David S. Miller 2005-01-26 7:13 ` Andi Kleen 2005-01-26 7:24 ` Andrew Morton 2005-01-26 7:43 ` Andi Kleen 2005-01-26 8:01 ` Andrew Morton 2005-01-26 8:04 ` Andi Kleen 2005-01-28 2:58 ` Paul Mackerras 2005-01-28 3:11 ` Paul Mackerras 2005-01-28 3:17 ` Andrew Morton 2005-01-28 6:40 ` Andi Kleen 2005-01-29 11:23 ` Anton Blanchard 2005-01-28 8:46 ` Russell King 2005-01-28 6:39 ` Andi Kleen 2005-01-28 11:32 ` David Woodhouse 2005-01-26 7:54 ` David Woodhouse 2005-01-29 20:23 ` Andrew Morton 2005-01-29 23:28 ` Paul Mackerras 2005-01-30 11:01 ` Andi Kleen 2005-01-30 12:10 ` Paul Mackerras 2005-01-31 2:23 ` David S. Miller 2005-01-31 9:23 ` Andi Kleen 2005-01-31 19:29 ` David S. Miller 2005-01-31 19:38 ` Andi Kleen 2005-01-31 20:35 ` David S. Miller 2005-02-03 4:08 ` Andrew Morton 2005-02-03 6:28 ` David S. Miller 2005-02-03 7:19 ` Andi Kleen 2005-02-03 9:23 ` David Woodhouse 2005-01-31 2:33 ` Matthew Wilcox
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox