[patch] flexible-mmap-2.6.7-D5

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [patch] flexible-mmap-2.6.7-D5
@ 2004-06-18 21:38 Ingo Molnar
  2004-06-18 23:16 ` William Lee Irwin III
  2004-06-19  8:06 ` Ingo Molnar
  0 siblings, 2 replies; 7+ messages in thread
From: Ingo Molnar @ 2004-06-18 21:38 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3204 bytes --]


the flexible-mmap-2.6.7-D5 patch (attached) splits the flexible-mmap
layout stuff out of exec-shield. The patch is against 2.6.7 + my latest
NX patch. (i've attached the NX patch too, for convenience.)

the goal of the patch is to change the way virtual memory is allocated,
from:

  0x08000000 ... binary image
  0x08xxxxxx ... brk area, grows upwards
  0x40000000 ... start of mmap, new mmaps go after old ones
  0xbfxxxxxx ... stack

to a more flexible top-down mmap() method:

  0x08000000 ... binary image
  0x08xxxxxx ... brk area, grows upwards
  0xbfxxxxxx ... _end_ of all mmaps, new mmaps go below old ones
  0xbfyyyyyy ... stack

the new layout has a couple of advantages:

- primarily this layout enables both malloc(), mmap()/shmat() users to
  utilize the full address space: 3GB on stock x86, 4GB on x86 4:4 or
  x86-64 running x86 apps.

- the new layout is also in essence 'self-tuning' the mmap() and
  malloc() limits: no hacks like /proc/PID/mmap_base are needed - both
  malloc() and mmap() can grow until all the address space is full.
  With the old layout, malloc() space was limited to 900MB, mmap() space
  to ~2GB.

- The new layout also allows very large continuous mmap()s because the
  'free space' is always a continuous hole (statistically).

- there's also a ~4K pagetable saved per typical process, because we
  dont allocate at 1GB anymore and dont fragment the VM that much -
  there are only 2 main chunks, the binary image + brk(), and the mmap()
  area + stack.

here are the kernel code changes the patch introduces:

- the patch is compatible with existing architectures that either make
  use of HAVE_ARCH_UNMAPPED_AREA or use the default mmap() allocator -
  there is no change in behavior.

- the patch gives a new mechanism to define alternate per-mm mmap()
  layouts via mm->get_unmapped_area(). On x86 i used this mechanism to
  implement a top-down allocator for PT_GNU_STACK applications.

- 64-bit architectures can use the same mechanism to clean up 32-bit
  compatibility layouts: by defining HAVE_ARCH_PICK_MMAP_LAYOUT and
  providing a arch_pick_mmap_layout() function - which can then decide
  between various mmap() layout functions.

- i also introduced a new personality bit (ADDR_COMPAT_LAYOUT) to signal
  older binaries that dont have PT_GNU_STACK. x86 uses this to revert 
  back to the stock layout. I also changed x86 to not clear the
  personality bits upon exec(), like x86-64 already does.

- once every architecture that uses HAVE_ARCH_UNMAPPED_AREA has defined
  its arch_pick_mmap_layout() function, we can get rid of
  HAVE_ARCH_UNMAPPED_AREA altogether, as a final cleanup.

i tested the patch on two distributions: one was an older one without
PT_GNU_STACK, the other had applications was with PT_GNU_STACK. On the
old one we properly generate the compatible VM layout - the PT_GNU_STACK
one got the new layout.

the new layout generation function (__get_unmapped_area()) got
significant testing in Fedora Core 1/2 and RHEL3, so it's pretty robust.

Compatibility: there's one known public application that broke in
Fedora, a serial terminal program. The application bug has been fixed.

Comments, suggestions are welcome,

	Ingo

[-- Attachment #2: flexible-mmap-2.6.7-D5 --]
[-- Type: text/plain, Size: 9497 bytes --]

--- linux/arch/i386/mm/mmap.c.orig	
+++ linux/arch/i386/mm/mmap.c	
@@ -0,0 +1,63 @@
+/*
+ *  linux/arch/i386/mm/mmap.c
+ *
+ *  flexible mmap layout support
+ *
+ * Copyright 2003-2004 Red Hat Inc., Durham, North Carolina.
+ * All Rights Reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ *
+ * Started by Ingo Molnar <mingo@elte.hu>
+ */
+
+#include <linux/personality.h>
+#include <linux/mm.h>
+
+/*
+ * Top of mmap area (just below the process stack).
+ *
+ * Leave an at least ~128 MB hole. Limit maximum floor to 5/6 of
+ * the total address space, in case the stack is unlimited.
+ */
+#define MIN_GAP (128*1024*1024)
+#define MAX_GAP (TASK_SIZE/6*5)
+
+/*
+ * This function, called very early during the creation of a new
+ * process VM image, sets up which VM layout function to use:
+ */
+void arch_pick_mmap_layout(struct mm_struct *mm)
+{
+	/*
+	 * If non-PT_GNU_STACK binary then fall back to the
+	 * standard layout:
+	 */
+	if (current->personality & ADDR_COMPAT_LAYOUT) {
+		mm->mmap_base = TASK_UNMAPPED_BASE;
+		mm->get_unmapped_area = arch_get_unmapped_area;
+	} else {
+		unsigned long gap = current->rlim[RLIMIT_STACK].rlim_cur;
+
+		if (gap < MIN_GAP)
+			gap = MIN_GAP;
+		else if (gap > MAX_GAP)
+			gap = MAX_GAP;
+		mm->mmap_base = TASK_SIZE - (gap & PAGE_MASK);
+		mm->get_unmapped_area = get_unmapped_area_topdown;
+	}
+}
+
--- linux/arch/i386/mm/Makefile.orig	
+++ linux/arch/i386/mm/Makefile	
@@ -2,7 +2,7 @@
 # Makefile for the linux i386-specific parts of the memory manager.
 #
 
-obj-y	:= init.o pgtable.o fault.o ioremap.o extable.o pageattr.o 
+obj-y	:= init.o pgtable.o fault.o ioremap.o extable.o pageattr.o mmap.o
 
 obj-$(CONFIG_DISCONTIGMEM)	+= discontig.o
 obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
--- linux/include/linux/sched.h.orig	
+++ linux/include/linux/sched.h	
@@ -191,10 +191,32 @@ extern int sysctl_max_map_count;
 
 #include <linux/aio.h>
 
+extern unsigned long
+arch_get_unmapped_area(struct file *, unsigned long, unsigned long,
+		       unsigned long, unsigned long);
+extern unsigned long
+get_unmapped_area_topdown(struct file *filp, unsigned long addr,
+			  unsigned long len, unsigned long pgoff,
+			  unsigned long flags);
+
+#ifdef HAVE_ARCH_PICK_MMAP_LAYOUT
+extern void arch_pick_mmap_layout(struct mm_struct *mm);
+#else
+static inline void arch_pick_mmap_layout(struct mm_struct *mm)
+{
+	mm->mmap_base = TASK_UNMAPPED_BASE;
+	mm->get_unmapped_area = arch_get_unmapped_area;
+}
+#endif
+
 struct mm_struct {
 	struct vm_area_struct * mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
 	struct vm_area_struct * mmap_cache;	/* last find_vma result */
+	unsigned long (*get_unmapped_area) (struct file *filp,
+				unsigned long addr, unsigned long len,
+				unsigned long pgoff, unsigned long flags);
+	unsigned long mmap_base;		/* base of mmap area */
 	unsigned long free_area_cache;		/* first hole */
 	pgd_t * pgd;
 	atomic_t mm_users;			/* How many users with user space? */
--- linux/include/linux/personality.h.orig	
+++ linux/include/linux/personality.h	
@@ -30,6 +30,7 @@ extern int abi_fake_utsname;
  */
 enum {
 	MMAP_PAGE_ZERO =	0x0100000,
+	ADDR_COMPAT_LAYOUT =	0x0200000,
 	ADDR_LIMIT_32BIT =	0x0800000,
 	SHORT_INODE =		0x1000000,
 	WHOLE_SECONDS =		0x2000000,
--- linux/include/asm-i386/processor.h.orig	
+++ linux/include/asm-i386/processor.h	
@@ -302,6 +302,8 @@ extern unsigned int mca_pentium_flag;
  */
 #define TASK_UNMAPPED_BASE	(PAGE_ALIGN(TASK_SIZE / 3))
 
+#define HAVE_ARCH_PICK_MMAP_LAYOUT
+
 /*
  * Size of io_bitmap, covering ports 0 to 0x3ff.
  */
--- linux/include/asm-i386/elf.h.orig	
+++ linux/include/asm-i386/elf.h	
@@ -117,7 +117,7 @@ typedef struct user_fxsr_struct elf_fpxr
 #define AT_SYSINFO_EHDR		33
 
 #ifdef __KERNEL__
-#define SET_PERSONALITY(ex, ibcs2) set_personality((ibcs2)?PER_SVR4:PER_LINUX)
+#define SET_PERSONALITY(ex, ibcs2) do { } while (0)
 
 extern int dump_task_regs (struct task_struct *, elf_gregset_t *);
 extern int dump_task_fpu (struct task_struct *, elf_fpregset_t *);
--- linux/fs/binfmt_aout.c.orig	
+++ linux/fs/binfmt_aout.c	
@@ -307,7 +307,7 @@ static int load_aout_binary(struct linux
 		(current->mm->start_data = N_DATADDR(ex));
 	current->mm->brk = ex.a_bss +
 		(current->mm->start_brk = N_BSSADDR(ex));
-	current->mm->free_area_cache = TASK_UNMAPPED_BASE;
+	current->mm->free_area_cache = current->mm->mmap_base;
 
 	current->mm->rss = 0;
 	current->mm->mmap = NULL;
--- linux/fs/binfmt_elf.c.orig	
+++ linux/fs/binfmt_elf.c	
@@ -625,8 +625,10 @@ static int load_elf_binary(struct linux_
 				executable_stack = EXSTACK_DISABLE_X;
 			break;
 		}
-	if (i == elf_ex.e_phnum)
+	if (i == elf_ex.e_phnum) {
 		def_flags |= VM_EXEC | VM_MAYEXEC;
+		current->personality |= ADDR_COMPAT_LAYOUT;
+	}
 
 	/* Some simple consistency checks for the interpreter */
 	if (elf_interpreter) {
@@ -703,7 +705,7 @@ static int load_elf_binary(struct linux_
 	/* Do this so that we can load the interpreter, if need be.  We will
 	   change some of these later */
 	current->mm->rss = 0;
-	current->mm->free_area_cache = TASK_UNMAPPED_BASE;
+	current->mm->free_area_cache = current->mm->mmap_base;
 	retval = setup_arg_pages(bprm, executable_stack);
 	if (retval < 0) {
 		send_sig(SIGKILL, current, 0);
--- linux/fs/exec.c.orig	
+++ linux/fs/exec.c	
@@ -547,6 +547,7 @@ static int exec_mmap(struct mm_struct *m
 	tsk->active_mm = mm;
 	activate_mm(active_mm, mm);
 	task_unlock(tsk);
+	arch_pick_mmap_layout(mm);
 	if (old_mm) {
 		if (active_mm != old_mm) BUG();
 		mmput(old_mm);
--- linux/kernel/fork.c.orig	
+++ linux/kernel/fork.c	
@@ -279,7 +279,7 @@ static inline int dup_mmap(struct mm_str
 	mm->locked_vm = 0;
 	mm->mmap = NULL;
 	mm->mmap_cache = NULL;
-	mm->free_area_cache = TASK_UNMAPPED_BASE;
+	mm->free_area_cache = oldmm->mmap_base;
 	mm->map_count = 0;
 	mm->rss = 0;
 	cpus_clear(mm->cpu_vm_mask);
--- linux/mm/mmap.c.orig	
+++ linux/mm/mmap.c	
@@ -1002,7 +1002,7 @@ EXPORT_SYMBOL(do_mmap_pgoff);
  * This function "knows" that -ENOMEM has the bits set.
  */
 #ifndef HAVE_ARCH_UNMAPPED_AREA
-static inline unsigned long
+unsigned long
 arch_get_unmapped_area(struct file *filp, unsigned long addr,
 		unsigned long len, unsigned long pgoff, unsigned long flags)
 {
@@ -1046,12 +1046,79 @@ full_search:
 		addr = vma->vm_end;
 	}
 }
-#else
-extern unsigned long
-arch_get_unmapped_area(struct file *, unsigned long, unsigned long,
-			unsigned long, unsigned long);
 #endif	
 
+/*
+ * This mmap-allocator allocates new areas top-down from below the
+ * stack's low limit (the base):
+ */
+unsigned long
+get_unmapped_area_topdown(struct file *filp, unsigned long addr,
+			  unsigned long len, unsigned long pgoff,
+			  unsigned long flags)
+{
+	struct vm_area_struct *vma, *prev_vma;
+	struct mm_struct *mm = current->mm;
+	unsigned long base = mm->mmap_base;
+	int first_time = 1;
+
+	/* requested length too big for entire address space */
+	if (len > TASK_SIZE) 
+		return -ENOMEM;
+
+	/* dont allow allocations above current base */
+	if (mm->free_area_cache > base)
+		mm->free_area_cache = base;
+
+	/* requesting a specific address */
+	if (addr) {
+		addr = PAGE_ALIGN(addr);
+		vma = find_vma(mm, addr);
+		if (TASK_SIZE - len >= addr && 
+				(!vma || addr + len <= vma->vm_start))
+			return addr;
+	}
+
+	/* make sure it can fit in the remaining address space */
+	if (mm->free_area_cache < len)
+		return -ENOMEM;
+
+	/* either no address requested or cant fit in requested address hole */
+try_again:
+	addr = (mm->free_area_cache - len) & PAGE_MASK;
+	do {
+ 	 	if (!(vma = find_vma_prev(mm, addr, &prev_vma)))
+			return -ENOMEM;
+
+		/*
+		 * new region fits between prev_vma->vm_end and
+		 * vma->vm_start, use it:
+		 */
+		if (addr+len <= vma->vm_start &&
+				(!prev_vma || (addr >= prev_vma->vm_end)))
+			/* remember the address as a hint for next time */
+			return (mm->free_area_cache = addr);
+		else
+			/* pull free_area_cache down to the first hole */
+			if (mm->free_area_cache == vma->vm_end)
+				mm->free_area_cache = vma->vm_start;	
+
+		/* try just below the current vma->vm_start */
+		addr = vma->vm_start-len;
+	} while (len <= vma->vm_start);
+
+	/*
+	 * if hint left us with no space for the requested
+	 * mapping then try again:
+	 */
+	if (first_time) {
+		mm->free_area_cache = base;
+		first_time = 0;
+		goto try_again;
+	}
+	return -ENOMEM;
+}
+
 unsigned long
 get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
 		unsigned long pgoff, unsigned long flags)
@@ -1086,7 +1153,7 @@ get_unmapped_area(struct file *file, uns
 		return file->f_op->get_unmapped_area(file, addr, len,
 						pgoff, flags);
 
-	return arch_get_unmapped_area(file, addr, len, pgoff, flags);
+	return current->mm->get_unmapped_area(file, addr, len, pgoff, flags);
 }
 
 EXPORT_SYMBOL(get_unmapped_area);

[-- Attachment #3: nx-2.6.7-A1 --]
[-- Type: text/plain, Size: 18315 bytes --]

--- linux/arch/i386/kernel/cpu/proc.c.orig	
+++ linux/arch/i386/kernel/cpu/proc.c	
@@ -27,7 +27,7 @@ static int show_cpuinfo(struct seq_file 
 		/* AMD-defined */
 		NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
 		NULL, NULL, NULL, "syscall", NULL, NULL, NULL, NULL,
-		NULL, NULL, NULL, "mp", NULL, NULL, "mmxext", NULL,
+		NULL, NULL, NULL, "mp", "nx", NULL, "mmxext", NULL,
 		NULL, NULL, NULL, NULL, NULL, "lm", "3dnowext", "3dnow",
 
 		/* Transmeta-defined */
--- linux/arch/i386/kernel/head.S.orig	
+++ linux/arch/i386/kernel/head.S	
@@ -153,6 +153,32 @@ ENTRY(startup_32_smp)
 	orl %edx,%eax
 	movl %eax,%cr4
 
+	btl $5, %eax		# check if PAE is enabled
+	jnc 6f
+
+	/* Check if extended functions are implemented */
+	movl $0x80000000, %eax
+	cpuid
+	cmpl $0x80000000, %eax
+	jbe 6f
+	mov $0x80000001, %eax
+	cpuid
+	/* Execute Disable bit supported? */
+	btl $20, %edx
+	jnc 6f
+
+	/* Setup EFER (Extended Feature Enable Register) */
+	movl $0xc0000080, %ecx
+	rdmsr
+
+	btsl $11, %eax
+	/* Make changes effective */
+	wrmsr
+
+6:
+	/* cpuid clobbered ebx, set it up again: */
+	xorl %ebx,%ebx
+	incl %ebx
 3:
 #endif /* CONFIG_SMP */
 
--- linux/arch/i386/kernel/sysenter.c.orig	
+++ linux/arch/i386/kernel/sysenter.c	
@@ -45,7 +45,7 @@ static int __init sysenter_setup(void)
 {
 	unsigned long page = get_zeroed_page(GFP_ATOMIC);
 
-	__set_fixmap(FIX_VSYSCALL, __pa(page), PAGE_READONLY);
+	__set_fixmap(FIX_VSYSCALL, __pa(page), PAGE_READONLY_EXEC);
 
 	if (!boot_cpu_has(X86_FEATURE_SEP)) {
 		memcpy((void *) page,
--- linux/arch/i386/kernel/module.c.orig	
+++ linux/arch/i386/kernel/module.c	
@@ -32,7 +32,7 @@ void *module_alloc(unsigned long size)
 {
 	if (size == 0)
 		return NULL;
-	return vmalloc(size);
+	return vmalloc_exec(size);
 }
 
 
--- linux/arch/i386/mm/init.c.orig	
+++ linux/arch/i386/mm/init.c	
@@ -123,6 +123,13 @@ static void __init page_table_range_init
 	}
 }
 
+static inline int is_kernel_text(unsigned long addr)
+{
+	if (addr >= (unsigned long)_stext && addr <= (unsigned long)__init_end)
+		return 1;
+	return 0;
+}
+
 /*
  * This maps the physical memory to kernel virtual address space, a total 
  * of max_low_pfn pages, by creating page tables starting from address 
@@ -145,18 +152,29 @@ static void __init kernel_physical_mappi
 		if (pfn >= max_low_pfn)
 			continue;
 		for (pmd_idx = 0; pmd_idx < PTRS_PER_PMD && pfn < max_low_pfn; pmd++, pmd_idx++) {
+			unsigned int address = pfn * PAGE_SIZE + PAGE_OFFSET;
+
 			/* Map with big pages if possible, otherwise create normal page tables. */
 			if (cpu_has_pse) {
-				set_pmd(pmd, pfn_pmd(pfn, PAGE_KERNEL_LARGE));
+				unsigned int address2 = (pfn + PTRS_PER_PTE - 1) * PAGE_SIZE + PAGE_OFFSET + PAGE_SIZE-1;
+
+				if (is_kernel_text(address) || is_kernel_text(address2))
+					set_pmd(pmd, pfn_pmd(pfn, PAGE_KERNEL_LARGE_EXEC));
+				else
+					set_pmd(pmd, pfn_pmd(pfn, PAGE_KERNEL_LARGE));
 				pfn += PTRS_PER_PTE;
 			} else {
 				pte = one_page_table_init(pmd);
 
-				for (pte_ofs = 0; pte_ofs < PTRS_PER_PTE && pfn < max_low_pfn; pte++, pfn++, pte_ofs++)
-					set_pte(pte, pfn_pte(pfn, PAGE_KERNEL));
+				for (pte_ofs = 0; pte_ofs < PTRS_PER_PTE && pfn < max_low_pfn; pte++, pfn++, pte_ofs++) {
+						if (is_kernel_text(address))
+							set_pte(pte, pfn_pte(pfn, PAGE_KERNEL_EXEC));
+						else
+							set_pte(pte, pfn_pte(pfn, PAGE_KERNEL));
+				}
 			}
 		}
-	}	
+	}
 }
 
 static inline int page_kills_ppro(unsigned long pagenr)
@@ -273,7 +291,8 @@ extern void set_highmem_pages_init(int);
 #define set_highmem_pages_init(bad_ppro) do { } while (0)
 #endif /* CONFIG_HIGHMEM */
 
-unsigned long __PAGE_KERNEL = _PAGE_KERNEL;
+unsigned long long __PAGE_KERNEL = _PAGE_KERNEL;
+unsigned long long __PAGE_KERNEL_EXEC = _PAGE_KERNEL_EXEC;
 
 #ifndef CONFIG_DISCONTIGMEM
 #define remap_numa_kva() do {} while (0)
@@ -302,6 +321,7 @@ static void __init pagetable_init (void)
 	if (cpu_has_pge) {
 		set_in_cr4(X86_CR4_PGE);
 		__PAGE_KERNEL |= _PAGE_GLOBAL;
+		__PAGE_KERNEL_EXEC |= _PAGE_GLOBAL;
 	}
 
 	kernel_physical_mapping_init(pgd_base);
@@ -392,6 +412,52 @@ void __init zone_sizes_init(void)
 extern void zone_sizes_init(void);
 #endif /* !CONFIG_DISCONTIGMEM */
 
+static int disable_nx __initdata = 0;
+u64 __supported_pte_mask = ~_PAGE_NX;
+
+/*
+ * noexec = on|off
+ *
+ * Control non executable mappings.
+ *
+ * on      Enable
+ * off     Disable
+ */
+static int __init noexec_setup(char *str)
+{
+	if (!strncmp(str, "on",2) && cpu_has_nx) {
+		__supported_pte_mask |= _PAGE_NX;
+		disable_nx = 0;
+	} else if (!strncmp(str,"off",3)) {
+		disable_nx = 1;
+		__supported_pte_mask &= ~_PAGE_NX;
+	}
+	return 1;
+}
+
+__setup("noexec=", noexec_setup);
+
+#ifdef CONFIG_X86_PAE
+static int use_nx = 0;
+
+static void __init set_nx(void)
+{
+	unsigned int v[4], l, h;
+
+	if (cpu_has_pae && (cpuid_eax(0x80000000) > 0x80000001)) {
+		cpuid(0x80000001, &v[0], &v[1], &v[2], &v[3]);
+		if ((v[3] & (1 << 20)) && !disable_nx) {
+			rdmsr(MSR_EFER, l, h);
+			l |= EFER_NX;
+			wrmsr(MSR_EFER, l, h);
+			use_nx = 1;
+			__supported_pte_mask |= _PAGE_NX;
+		}
+	}
+}
+
+#endif
+
 /*
  * paging_init() sets up the page tables - note that the first 8MB are
  * already mapped by head.S.
@@ -401,6 +467,12 @@ extern void zone_sizes_init(void);
  */
 void __init paging_init(void)
 {
+#ifdef CONFIG_X86_PAE
+	set_nx();
+	if (use_nx)
+		printk("NX (Execute Disable) protection: active\n");
+#endif
+
 	pagetable_init();
 
 	load_cr3(swapper_pg_dir);
--- linux/arch/i386/mm/fault.c.orig	
+++ linux/arch/i386/mm/fault.c	
@@ -406,6 +406,21 @@ no_context:
 
 	bust_spinlocks(1);
 
+#ifdef CONFIG_X86_PAE
+	{
+		pgd_t *pgd;
+		pmd_t *pmd;
+
+
+
+		pgd = init_mm.pgd + pgd_index(address);
+		if (pgd_present(*pgd)) {
+			pmd = pmd_offset(pgd, address);
+			if (pmd_val(*pmd) & _PAGE_NX)
+				printk(KERN_CRIT "kernel tried to access NX-protected page - exploit attempt? (uid: %d)\n", current->uid);
+		}
+	}
+#endif
 	if (address < PAGE_SIZE)
 		printk(KERN_ALERT "Unable to handle kernel NULL pointer dereference");
 	else
--- linux/arch/x86_64/kernel/module.c.orig	
+++ linux/arch/x86_64/kernel/module.c	
@@ -121,7 +121,7 @@ void *module_alloc(unsigned long size)
 			goto fail;
 	}
 	
-	if (map_vm_area(area, PAGE_KERNEL_EXECUTABLE, &pages))
+	if (map_vm_area(area, PAGE_KERNEL_EXEC, &pages))
 		goto fail;
 	
 	memset(addr, 0, size);
--- linux/arch/x86_64/mm/pageattr.c.orig	
+++ linux/arch/x86_64/mm/pageattr.c	
@@ -180,7 +180,7 @@ int change_page_attr(struct page *page, 
 			unsigned long addr2;
 			addr2 = __START_KERNEL_map + page_to_phys(page);
 			err = __change_page_attr(addr2, page, prot, 
-						 PAGE_KERNEL_EXECUTABLE);
+						 PAGE_KERNEL_EXEC);
 		} 
 	} 	
 	up_write(&init_mm.mmap_sem); 
--- linux/include/asm-x86_64/pgtable.h.orig	
+++ linux/include/asm-x86_64/pgtable.h	
@@ -172,7 +172,7 @@ static inline void set_pml4(pml4_t *dst,
 #define PAGE_READONLY_EXEC __pgprot(_PAGE_PRESENT | _PAGE_USER | _PAGE_ACCESSED)
 #define __PAGE_KERNEL \
 	(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_NX)
-#define __PAGE_KERNEL_EXECUTABLE \
+#define __PAGE_KERNEL_EXEC \
 	(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED)
 #define __PAGE_KERNEL_NOCACHE \
 	(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_PCD | _PAGE_ACCESSED | _PAGE_NX)
@@ -188,7 +188,7 @@ static inline void set_pml4(pml4_t *dst,
 #define MAKE_GLOBAL(x) __pgprot((x) | _PAGE_GLOBAL)
 
 #define PAGE_KERNEL MAKE_GLOBAL(__PAGE_KERNEL)
-#define PAGE_KERNEL_EXECUTABLE MAKE_GLOBAL(__PAGE_KERNEL_EXECUTABLE)
+#define PAGE_KERNEL_EXEC MAKE_GLOBAL(__PAGE_KERNEL_EXEC)
 #define PAGE_KERNEL_RO MAKE_GLOBAL(__PAGE_KERNEL_RO)
 #define PAGE_KERNEL_NOCACHE MAKE_GLOBAL(__PAGE_KERNEL_NOCACHE)
 #define PAGE_KERNEL_VSYSCALL MAKE_GLOBAL(__PAGE_KERNEL_VSYSCALL)
--- linux/include/linux/vmalloc.h.orig	
+++ linux/include/linux/vmalloc.h	
@@ -23,6 +23,7 @@ struct vm_struct {
  *	Highlevel APIs for driver use
  */
 extern void *vmalloc(unsigned long size);
+extern void *vmalloc_exec(unsigned long size);
 extern void *vmalloc_32(unsigned long size);
 extern void *__vmalloc(unsigned long size, int gfp_mask, pgprot_t prot);
 extern void vfree(void *addr);
--- linux/include/asm-i386/msr.h.orig	
+++ linux/include/asm-i386/msr.h	
@@ -217,6 +217,15 @@ static inline void wrmsrl (unsigned long
 #define MSR_K7_FID_VID_CTL		0xC0010041
 #define MSR_K7_FID_VID_STATUS		0xC0010042
 
+/* extended feature register */
+#define MSR_EFER 			0xc0000080
+
+/* EFER bits: */
+
+/* Execute Disable enable */
+#define _EFER_NX			11
+#define EFER_NX				(1<<_EFER_NX)
+
 /* Centaur-Hauls/IDT defined MSRs. */
 #define MSR_IDT_FCR1			0x107
 #define MSR_IDT_FCR2			0x108
--- linux/include/asm-i386/cpufeature.h.orig	
+++ linux/include/asm-i386/cpufeature.h	
@@ -47,6 +47,7 @@
 /* Don't duplicate feature flags which are redundant with Intel! */
 #define X86_FEATURE_SYSCALL	(1*32+11) /* SYSCALL/SYSRET */
 #define X86_FEATURE_MP		(1*32+19) /* MP Capable. */
+#define X86_FEATURE_NX		(1*32+20) /* Execute Disable */
 #define X86_FEATURE_MMXEXT	(1*32+22) /* AMD MMX extensions */
 #define X86_FEATURE_LM		(1*32+29) /* Long Mode (x86-64) */
 #define X86_FEATURE_3DNOWEXT	(1*32+30) /* AMD 3DNow! extensions */
@@ -100,6 +101,7 @@
 #define cpu_has_xmm		boot_cpu_has(X86_FEATURE_XMM)
 #define cpu_has_ht		boot_cpu_has(X86_FEATURE_HT)
 #define cpu_has_mp		boot_cpu_has(X86_FEATURE_MP)
+#define cpu_has_nx		boot_cpu_has(X86_FEATURE_NX)
 #define cpu_has_k6_mtrr		boot_cpu_has(X86_FEATURE_K6_MTRR)
 #define cpu_has_cyrix_arr	boot_cpu_has(X86_FEATURE_CYRIX_ARR)
 #define cpu_has_centaur_mcr	boot_cpu_has(X86_FEATURE_CENTAUR_MCR)
--- linux/include/asm-i386/page.h.orig	
+++ linux/include/asm-i386/page.h	
@@ -40,15 +40,18 @@
  * These are used to make use of C type-checking..
  */
 #ifdef CONFIG_X86_PAE
+extern unsigned long long __supported_pte_mask;
 typedef struct { unsigned long pte_low, pte_high; } pte_t;
 typedef struct { unsigned long long pmd; } pmd_t;
 typedef struct { unsigned long long pgd; } pgd_t;
+typedef struct { unsigned long long pgprot; } pgprot_t;
 #define pte_val(x)	((x).pte_low | ((unsigned long long)(x).pte_high << 32))
 #define HPAGE_SHIFT	21
 #else
 typedef struct { unsigned long pte_low; } pte_t;
 typedef struct { unsigned long pmd; } pmd_t;
 typedef struct { unsigned long pgd; } pgd_t;
+typedef struct { unsigned long pgprot; } pgprot_t;
 #define boot_pte_t pte_t /* or would you rather have a typedef */
 #define pte_val(x)	((x).pte_low)
 #define HPAGE_SHIFT	22
@@ -61,7 +64,6 @@ typedef struct { unsigned long pgd; } pg
 #define HUGETLB_PAGE_ORDER	(HPAGE_SHIFT - PAGE_SHIFT)
 #endif
 
-typedef struct { unsigned long pgprot; } pgprot_t;
 
 #define pmd_val(x)	((x).pmd)
 #define pgd_val(x)	((x).pgd)
@@ -136,7 +138,7 @@ static __inline__ int get_order(unsigned
 
 #define virt_addr_valid(kaddr)	pfn_valid(__pa(kaddr) >> PAGE_SHIFT)
 
-#define VM_DATA_DEFAULT_FLAGS	(VM_READ | VM_WRITE | VM_EXEC | \
+#define VM_DATA_DEFAULT_FLAGS	(VM_READ | VM_WRITE | \
 				 VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
 
 #endif /* __KERNEL__ */
--- linux/include/asm-i386/pgtable-3level.h.orig	
+++ linux/include/asm-i386/pgtable-3level.h	
@@ -101,18 +101,24 @@ static inline unsigned long pte_pfn(pte_
 		(pte.pte_high << (32 - PAGE_SHIFT));
 }
 
+extern unsigned long long __supported_pte_mask;
+
 static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot)
 {
 	pte_t pte;
 
-	pte.pte_high = page_nr >> (32 - PAGE_SHIFT);
-	pte.pte_low = (page_nr << PAGE_SHIFT) | pgprot_val(pgprot);
+	pte.pte_high = (page_nr >> (32 - PAGE_SHIFT)) | \
+					(pgprot_val(pgprot) >> 32);
+	pte.pte_high &= (__supported_pte_mask >> 32);
+	pte.pte_low = ((page_nr << PAGE_SHIFT) | pgprot_val(pgprot)) & \
+							__supported_pte_mask;
 	return pte;
 }
 
 static inline pmd_t pfn_pmd(unsigned long page_nr, pgprot_t pgprot)
 {
-	return __pmd(((unsigned long long)page_nr << PAGE_SHIFT) | pgprot_val(pgprot));
+	return __pmd((((unsigned long long)page_nr << PAGE_SHIFT) | \
+			pgprot_val(pgprot)) & __supported_pte_mask);
 }
 
 /*
--- linux/include/asm-i386/pgtable.h.orig	
+++ linux/include/asm-i386/pgtable.h	
@@ -110,6 +110,7 @@ void paging_init(void);
 #define _PAGE_BIT_UNUSED1	9	/* available for programmer */
 #define _PAGE_BIT_UNUSED2	10
 #define _PAGE_BIT_UNUSED3	11
+#define _PAGE_BIT_NX		63
 
 #define _PAGE_PRESENT	0x001
 #define _PAGE_RW	0x002
@@ -126,28 +127,51 @@ void paging_init(void);
 
 #define _PAGE_FILE	0x040	/* set:pagecache unset:swap */
 #define _PAGE_PROTNONE	0x080	/* If not present */
+#ifdef CONFIG_X86_PAE
+#define _PAGE_NX	(1ULL<<_PAGE_BIT_NX)
+#else
+#define _PAGE_NX	0
+#endif
 
 #define _PAGE_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | _PAGE_ACCESSED | _PAGE_DIRTY)
 #define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY)
 #define _PAGE_CHG_MASK	(PTE_MASK | _PAGE_ACCESSED | _PAGE_DIRTY)
 
-#define PAGE_NONE	__pgprot(_PAGE_PROTNONE | _PAGE_ACCESSED)
-#define PAGE_SHARED	__pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | _PAGE_ACCESSED)
-#define PAGE_COPY	__pgprot(_PAGE_PRESENT | _PAGE_USER | _PAGE_ACCESSED)
-#define PAGE_READONLY	__pgprot(_PAGE_PRESENT | _PAGE_USER | _PAGE_ACCESSED)
+#define PAGE_NONE \
+	__pgprot(_PAGE_PROTNONE | _PAGE_ACCESSED)
+#define PAGE_SHARED \
+	__pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | _PAGE_ACCESSED)
+
+#define PAGE_SHARED_EXEC \
+	__pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | _PAGE_ACCESSED)
+#define PAGE_COPY_NOEXEC \
+	__pgprot(_PAGE_PRESENT | _PAGE_USER | _PAGE_ACCESSED | _PAGE_NX)
+#define PAGE_COPY_EXEC \
+	__pgprot(_PAGE_PRESENT | _PAGE_USER | _PAGE_ACCESSED)
+#define PAGE_COPY \
+	PAGE_COPY_NOEXEC
+#define PAGE_READONLY \
+	__pgprot(_PAGE_PRESENT | _PAGE_USER | _PAGE_ACCESSED | _PAGE_NX)
+#define PAGE_READONLY_EXEC \
+	__pgprot(_PAGE_PRESENT | _PAGE_USER | _PAGE_ACCESSED)
 
 #define _PAGE_KERNEL \
+	(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_NX)
+#define _PAGE_KERNEL_EXEC \
 	(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED)
 
-extern unsigned long __PAGE_KERNEL;
-#define __PAGE_KERNEL_RO	(__PAGE_KERNEL & ~_PAGE_RW)
-#define __PAGE_KERNEL_NOCACHE	(__PAGE_KERNEL | _PAGE_PCD)
-#define __PAGE_KERNEL_LARGE	(__PAGE_KERNEL | _PAGE_PSE)
+extern unsigned long long __PAGE_KERNEL, __PAGE_KERNEL_EXEC;
+#define __PAGE_KERNEL_RO		(__PAGE_KERNEL & ~_PAGE_RW)
+#define __PAGE_KERNEL_NOCACHE		(__PAGE_KERNEL | _PAGE_PCD)
+#define __PAGE_KERNEL_LARGE		(__PAGE_KERNEL | _PAGE_PSE)
+#define __PAGE_KERNEL_LARGE_EXEC	(__PAGE_KERNEL_EXEC | _PAGE_PSE)
 
 #define PAGE_KERNEL		__pgprot(__PAGE_KERNEL)
 #define PAGE_KERNEL_RO		__pgprot(__PAGE_KERNEL_RO)
+#define PAGE_KERNEL_EXEC	__pgprot(__PAGE_KERNEL_EXEC)
 #define PAGE_KERNEL_NOCACHE	__pgprot(__PAGE_KERNEL_NOCACHE)
 #define PAGE_KERNEL_LARGE	__pgprot(__PAGE_KERNEL_LARGE)
+#define PAGE_KERNEL_LARGE_EXEC	__pgprot(__PAGE_KERNEL_LARGE_EXEC)
 
 /*
  * The i386 can't do page protection for execute, and considers that
@@ -158,19 +182,19 @@ extern unsigned long __PAGE_KERNEL;
 #define __P001	PAGE_READONLY
 #define __P010	PAGE_COPY
 #define __P011	PAGE_COPY
-#define __P100	PAGE_READONLY
-#define __P101	PAGE_READONLY
-#define __P110	PAGE_COPY
-#define __P111	PAGE_COPY
+#define __P100	PAGE_READONLY_EXEC
+#define __P101	PAGE_READONLY_EXEC
+#define __P110	PAGE_COPY_EXEC
+#define __P111	PAGE_COPY_EXEC
 
 #define __S000	PAGE_NONE
 #define __S001	PAGE_READONLY
 #define __S010	PAGE_SHARED
 #define __S011	PAGE_SHARED
-#define __S100	PAGE_READONLY
-#define __S101	PAGE_READONLY
-#define __S110	PAGE_SHARED
-#define __S111	PAGE_SHARED
+#define __S100	PAGE_READONLY_EXEC
+#define __S101	PAGE_READONLY_EXEC
+#define __S110	PAGE_SHARED_EXEC
+#define __S111	PAGE_SHARED_EXEC
 
 /*
  * Define this if things work differently on an i386 and an i486:
@@ -244,6 +268,15 @@ static inline pte_t pte_modify(pte_t pte
 {
 	pte.pte_low &= _PAGE_CHG_MASK;
 	pte.pte_low |= pgprot_val(newprot);
+#ifdef CONFIG_X86_PAE
+	/*
+	 * Chop off the NX bit (if present), and add the NX portion of
+	 * the newprot (if present):
+	 */
+	pte.pte_high &= -1 ^ (1 << (_PAGE_BIT_NX - 32));
+	pte.pte_high |= (pgprot_val(newprot) >> 32) & \
+					(__supported_pte_mask >> 32);
+#endif
 	return pte;
 }
 
--- linux/fs/binfmt_elf.c.orig	
+++ linux/fs/binfmt_elf.c	
@@ -491,6 +491,7 @@ static int load_elf_binary(struct linux_
 	char passed_fileno[6];
 	struct files_struct *files;
 	int executable_stack = EXSTACK_DEFAULT;
+	unsigned long def_flags = 0;
 	
 	/* Get the exec-header */
 	elf_ex = *((struct elfhdr *) bprm->buf);
@@ -622,7 +623,10 @@ static int load_elf_binary(struct linux_
 				executable_stack = EXSTACK_ENABLE_X;
 			else
 				executable_stack = EXSTACK_DISABLE_X;
+			break;
 		}
+	if (i == elf_ex.e_phnum)
+		def_flags |= VM_EXEC | VM_MAYEXEC;
 
 	/* Some simple consistency checks for the interpreter */
 	if (elf_interpreter) {
@@ -690,6 +694,7 @@ static int load_elf_binary(struct linux_
 	current->mm->end_code = 0;
 	current->mm->mmap = NULL;
 	current->flags &= ~PF_FORKNOEXEC;
+	current->mm->def_flags = def_flags;
 
 	/* Do this immediately, since STACK_TOP as used in setup_arg_pages
 	   may depend on the personality.  */
--- linux/fs/exec.c.orig	
+++ linux/fs/exec.c	
@@ -431,6 +431,7 @@ int setup_arg_pages(struct linux_binprm 
 			mpnt->vm_flags = VM_STACK_FLAGS & ~VM_EXEC;
 		else
 			mpnt->vm_flags = VM_STACK_FLAGS;
+		mpnt->vm_flags |= mm->def_flags;
 		mpnt->vm_page_prot = protection_map[mpnt->vm_flags & 0x7];
 		insert_vm_struct(mm, mpnt);
 		mm->total_vm = (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT;
--- linux/mm/vmalloc.c.orig	
+++ linux/mm/vmalloc.c	
@@ -456,6 +456,28 @@ void *vmalloc(unsigned long size)
 EXPORT_SYMBOL(vmalloc);
 
 /**
+ *	vmalloc_exec  -  allocate virtually contiguous, executable memory
+ *
+ *	@size:		allocation size
+ *
+ *	Kernel-internal function to allocate enough pages to cover @size
+ *	the page level allocator and map them into contiguous and
+ *	executable kernel virtual space.
+ *
+ *	For tight cotrol over page level allocator and protection flags
+ *	use __vmalloc() instead.
+ */
+
+#ifndef PAGE_KERNEL_EXEC
+# define PAGE_KERNEL_EXEC PAGE_KERNEL
+#endif
+
+void *vmalloc_exec(unsigned long size)
+{
+	return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL_EXEC);
+}
+
+/**
  *	vmalloc_32  -  allocate virtually contiguous memory (32bit addressable)
  *
  *	@size:		allocation size

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [patch] flexible-mmap-2.6.7-D5
  2004-06-18 21:38 [patch] flexible-mmap-2.6.7-D5 Ingo Molnar
@ 2004-06-18 23:16 ` William Lee Irwin III
  2004-06-19  7:46   ` Ingo Molnar
  2004-06-19  8:06 ` Ingo Molnar
  1 sibling, 1 reply; 7+ messages in thread
From: William Lee Irwin III @ 2004-06-18 23:16 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

On Fri, Jun 18, 2004 at 11:38:14PM +0200, Ingo Molnar wrote:
> the flexible-mmap-2.6.7-D5 patch (attached) splits the flexible-mmap
> layout stuff out of exec-shield. The patch is against 2.6.7 + my latest
> NX patch. (i've attached the NX patch too, for convenience.)
> the goal of the patch is to change the way virtual memory is allocated,
> from:
>   0x08000000 ... binary image
>   0x08xxxxxx ... brk area, grows upwards
>   0x40000000 ... start of mmap, new mmaps go after old ones
>   0xbfxxxxxx ... stack
> to a more flexible top-down mmap() method:
> 
>   0x08000000 ... binary image
>   0x08xxxxxx ... brk area, grows upwards
>   0xbfxxxxxx ... _end_ of all mmaps, new mmaps go below old ones
>   0xbfyyyyyy ... stack

It may be useful to move STACK_TOP to 0x8040000 give the stack more head
room and fit small statically-linked executables into one pagetable page.
Otherwise the stack has very little headroom after the first mmap().


-- wli

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [patch] flexible-mmap-2.6.7-D5
  2004-06-18 23:16 ` William Lee Irwin III
@ 2004-06-19  7:46   ` Ingo Molnar
  2004-06-19  8:34     ` William Lee Irwin III
  0 siblings, 1 reply; 7+ messages in thread
From: Ingo Molnar @ 2004-06-19  7:46 UTC (permalink / raw)
  To: William Lee Irwin III, linux-kernel


* William Lee Irwin III <wli@holomorphy.com> wrote:

> > to a more flexible top-down mmap() method:
> > 
> >   0x08000000 ... binary image
> >   0x08xxxxxx ... brk area, grows upwards
> >   0xbfxxxxxx ... _end_ of all mmaps, new mmaps go below old ones
> >   0xbfyyyyyy ... stack
> 
> It may be useful to move STACK_TOP to 0x8040000 give the stack more
> head room and fit small statically-linked executables into one
> pagetable page. Otherwise the stack has very little headroom after the
> first mmap().

the patch already takes care of this via two mechanisms:

- the top of the mmaps is moved down by the expected maximum size of the
  stack (the stack rlimit).

- we maintain a minimum 128 MB 'gap' between stack top and mmap top. 

moving it to 0x08040000 is asking for trouble, there are people who need
1GB stacks (dont ask why). This approach has been in production use for
~2 years and it works fine for a wide range of applications and VM
needs.

	Ingo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [patch] flexible-mmap-2.6.7-D5
  2004-06-18 21:38 [patch] flexible-mmap-2.6.7-D5 Ingo Molnar
  2004-06-18 23:16 ` William Lee Irwin III
@ 2004-06-19  8:06 ` Ingo Molnar
  1 sibling, 0 replies; 7+ messages in thread
From: Ingo Molnar @ 2004-06-19  8:06 UTC (permalink / raw)
  To: linux-kernel


* Ingo Molnar <mingo@elte.hu> wrote:

> - i also introduced a new personality bit (ADDR_COMPAT_LAYOUT) to signal
>   older binaries that dont have PT_GNU_STACK. x86 uses this to revert 
>   back to the stock layout. [...]

i've also attached a patch below against setarch that adds the -L option
which turns on the old layout. Thus if there's any old broken app that
relies on mmap()s being below 2GB then the following can be used:

	setarch -L oldapp <params>

and 'oldapp' (and all its children) will run with the old layout. But
this should be an extremely rare occurance, i've added it more for
completeness and paranoia reasons than necessity.

	Ingo

--- setarch-1.4/setarch.c.orig
+++ setarch-1.4/setarch.c
@@ -118,6 +118,15 @@ int main(int argc, char *argv[])
       if(got_arch)
 	p = argv[0];
     }
+  if(!strcmp(p, "-L"))
+    {
+      argv++;
+      argc--;
+      options |= 0x0200000 /* ADDR_COMPAT_LAYOUT */;
+      if(got_arch)
+	p = argv[0];
+    }
+
 
   if(set_arch(p, options))
     {

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [patch] flexible-mmap-2.6.7-D5
  2004-06-19  7:46   ` Ingo Molnar
@ 2004-06-19  8:34     ` William Lee Irwin III
  2004-06-19 11:38       ` Ingo Molnar
  0 siblings, 1 reply; 7+ messages in thread
From: William Lee Irwin III @ 2004-06-19  8:34 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

On Sat, Jun 19, 2004 at 09:46:12AM +0200, Ingo Molnar wrote:
> the patch already takes care of this via two mechanisms:
> - the top of the mmaps is moved down by the expected maximum size of the
>   stack (the stack rlimit).
> - we maintain a minimum 128 MB 'gap' between stack top and mmap top. 
> moving it to 0x08040000 is asking for trouble, there are people who need
> 1GB stacks (dont ask why). This approach has been in production use for
> ~2 years and it works fine for a wide range of applications and VM
> needs.

I'm not terribly convinced the stack's virtual placement helps in the
case where 1GB stacks are needed as within rlimits, it's usually lazily
expanded. I suppose the users in need of it adjust their rlimits, say,
in the shell spawning the process wanting it to precisely what they
need in this scheme, which is awkward, but doesn't require code
modification for the offending apps.

Also, I suspect some more graceful fallback would make sense
particularly for the case of RLIM_INFINITY, which would leave users
that run with, say, all rlimits at RLIM_INFINITY in the interest of
having full access to system resources with a mere 512MB of
virtualspace for the heap, which IIRC glibc is intelligent enough to
circumvent for malloc(), but not for mmap(NULL, ...). If it's been in
production that long, I find it hard to believe that's never been
tripped over. This is one of the reasons that when I thought about this
and split off an isolated patch for top-down vma allocation from
execshield, I gave up immediately on the rlimits and said "let them
malloc() their own stacks", and shoved the default down low (also, that
128MB is currently wasted); the rlimits appeared to need too precise of
tuning and be easy to mis-tune. But that was a matter of thinking about
it, not experience.

These are questions whose answers largely do not affect me personally
or anything I work on. If the answer is "it suffices as-is" I won't get
worked up about it. I didn't raise the RLIM_INFINITY point in my prior
post (b/c I missed the rlimit bit), so for the sake of completeness I
droned on and on about it here.

Last and definitely least, the declaration in sched.h of functions only
used in mmap.c set off some warning flags; it may make sense to just
declare the things in mm/mmap.c, possibly with __attribute__((weak)) or
some such (AIUI this is now allowed, the last buggy arches' toolchains
were fixed, and it's already used now with cond_syscall() etc.). There
is a bit of nastiness with sched.h being a "garbage can" header it'd be
nice to avoid aggravating.

Thanks.

-- wli

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [patch] flexible-mmap-2.6.7-D5
  2004-06-19  8:34     ` William Lee Irwin III
@ 2004-06-19 11:38       ` Ingo Molnar
  2004-06-19 21:34         ` William Lee Irwin III
  0 siblings, 1 reply; 7+ messages in thread
From: Ingo Molnar @ 2004-06-19 11:38 UTC (permalink / raw)
  To: William Lee Irwin III, linux-kernel

* William Lee Irwin III <wli@holomorphy.com> wrote:

> Also, I suspect some more graceful fallback would make sense
> particularly for the case of RLIM_INFINITY, which would leave users
> that run with, say, all rlimits at RLIM_INFINITY in the interest of
> having full access to system resources with a mere 512MB of
> virtualspace for the heap, which IIRC glibc is intelligent enough to
> circumvent for malloc(), but not for mmap(NULL, ...). [...]

well, the 5/6=stack 1/6=malloc rule in the RLIM_INFINITY can be changed. 
What would make the most sense - 1/2 for both?

> If it's been in production that long, I find it hard to believe that's
> never been tripped over. [...]

it's been tripped over and the 5/6 rule was a fix for such a bugreport. 
What happens more in practice frequently is that someone needs a big
stack and sets the stack ulimit to RLIM_INFINITY.

> [...] (also, that 128MB is currently wasted); [...]

the 128MB is 'wasted' to give some flexibility to the stack rlimits
changing runtime. But in practice it's far more important to have the
mmap()/malloc() space maximized and flexible than to give the stack
automatic flexibility.

	Ingo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [patch] flexible-mmap-2.6.7-D5
  2004-06-19 11:38       ` Ingo Molnar
@ 2004-06-19 21:34         ` William Lee Irwin III
  0 siblings, 0 replies; 7+ messages in thread
From: William Lee Irwin III @ 2004-06-19 21:34 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

* William Lee Irwin III <wli@holomorphy.com> wrote:
>> Also, I suspect some more graceful fallback would make sense
>> particularly for the case of RLIM_INFINITY, which would leave users
>> that run with, say, all rlimits at RLIM_INFINITY in the interest of
>> having full access to system resources with a mere 512MB of
>> virtualspace for the heap, which IIRC glibc is intelligent enough to
>> circumvent for malloc(), but not for mmap(NULL, ...). [...]

On Sat, Jun 19, 2004 at 01:38:36PM +0200, Ingo Molnar wrote:
> well, the 5/6=stack 1/6=malloc rule in the RLIM_INFINITY can be changed. 
> What would make the most sense - 1/2 for both?

I had in mind fallback as opposed to a changed base, but a particular
choice of the base may cover enough cases. The bugreport below seems to
say there's no need for a change.


* William Lee Irwin III <wli@holomorphy.com> wrote:
>> If it's been in production that long, I find it hard to believe that's
>> never been tripped over. [...]

On Sat, Jun 19, 2004 at 01:38:36PM +0200, Ingo Molnar wrote:
> it's been tripped over and the 5/6 rule was a fix for such a bugreport. 
> What happens more in practice frequently is that someone needs a big
> stack and sets the stack ulimit to RLIM_INFINITY.

This sounds like nothing is needed, then.


* William Lee Irwin III <wli@holomorphy.com> wrote:
>> [...] (also, that 128MB is currently wasted); [...]

On Sat, Jun 19, 2004 at 01:38:36PM +0200, Ingo Molnar wrote:
> the 128MB is 'wasted' to give some flexibility to the stack rlimits
> changing runtime. But in practice it's far more important to have the
> mmap()/malloc() space maximized and flexible than to give the stack
> automatic flexibility.

Fishing around down there to utilize it for mmap() placement should
happen anyway if things fall back far enough in the top-down scheme.

Answers like "I've thought about it" or "I've seen this and dealt with
it" are good enough for me. There isn't much of a normative aspect to
user virtualspace layout.

Thanks.


-- wli

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2004-06-19 21:34 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-06-18 21:38 [patch] flexible-mmap-2.6.7-D5 Ingo Molnar
2004-06-18 23:16 ` William Lee Irwin III
2004-06-19  7:46   ` Ingo Molnar
2004-06-19  8:34     ` William Lee Irwin III
2004-06-19 11:38       ` Ingo Molnar
2004-06-19 21:34         ` William Lee Irwin III
2004-06-19  8:06 ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox