[patch 00/14] remap_file_pages protection support

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [patch 00/14] remap_file_pages protection support
@ 2006-04-30 17:29 blaisorblade
  2006-04-30 17:29 ` [patch 01/14] Fix comment about remap_file_pages blaisorblade
                   ` (15 more replies)
  0 siblings, 16 replies; 46+ messages in thread
From: blaisorblade @ 2006-04-30 17:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Again (about 8 month since last time, I have much less time during my academic
year), I'm sending for review (and for possible inclusion into -mm) protection
support for remap_file_pages, i.e. setting per-pte protections (beyond file
offset) through this syscall.

== How it works ==

Protections are set in the page tables when the
page is loaded, are saved into the PTE when the page is swapped out and restored
when the page is faulted back in.

Additionally, we modify the fault handler since the VMA protections aren't valid
for PTE with modified protections.

Finally, we must also provide, for each arch, macros to store also the
protections into the PTE; to make the kernel compile for any arch, I've added
since last time dummy default macros to keep the same functionality.

== What is this for ==

The first idea is to use this for UML - it must create a lot of single page
mappings, and managing them through separate VMAs is slow.

Additional note: this idea, with some further refinements (which I'll code after
this chunk is accepted), will allow to reduce the number of used VMAs for most
userspace programs - in particular, it will allow to avoid creating one VMA for
one guard pages (which has PROT_NONE) - forcing PROT_NONE on that page will be
enough.

This will be useful since the VMA lookup at fault time can be a bottleneck for
some programs (I've received a report about this from Ulrich Drepper and I've
been told that also Val Henson from Intel is interested about this). I guess
that since we use RB-trees, the slowness is also due to the poor cache locality
of RB-trees (since RB nodes are within VMAs but aren't accessed together with
their content), compared for instance with radix trees where the lookup has high
cache locality (but they have however space usage problems, possibly bigger, on
64-bit machines).

== Notes ==

Implementations are provided for i386, x86_64 and UML, and for some other archs
I have patches I will send, based on the ones which were in -mm when Ingo sent
the first version of this work.

You shouldn't worry for the number of patches, most of them are very little.
I've last tested them in UML against 2.6.16-rc3, but I've seen no big changes in
the VM.
--
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [patch 01/14] Fix comment about remap_file_pages
  2006-04-30 17:29 [patch 00/14] remap_file_pages protection support blaisorblade
@ 2006-04-30 17:29 ` blaisorblade
  2006-04-30 17:29 ` [patch 02/14] remap_file_pages protection support: add needed macros blaisorblade
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 46+ messages in thread
From: blaisorblade @ 2006-04-30 17:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Paolo Blaisorblade Giarrusso

[-- Attachment #1: rfp/00-rfp-comment.diff --]
[-- Type: text/plain, Size: 1213 bytes --]

From: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>

This comment is a bit unclear and also stale. So fix it. Thanks to Hugh Dickins
for explaining me what it really referred to, and correcting my first fix.

Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Index: linux-2.6.git/mm/fremap.c
===================================================================
--- linux-2.6.git.orig/mm/fremap.c
+++ linux-2.6.git/mm/fremap.c
@@ -208,9 +208,10 @@ asmlinkage long sys_remap_file_pages(uns
 					    pgoff, flags & MAP_NONBLOCK);
 
 		/*
-		 * We can't clear VM_NONLINEAR because we'd have to do
-		 * it after ->populate completes, and that would prevent
-		 * downgrading the lock.  (Locks can't be upgraded).
+		 * We would like to clear VM_NONLINEAR, in the case when
+		 * sys_remap_file_pages covers the whole vma, so making
+		 * it linear again.  But cannot do so until after a
+		 * successful populate, and have no way to upgrade sem.
 		 */
 	}
 	if (likely(!has_write_lock))

--
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [patch 02/14] remap_file_pages protection support: add needed macros
  2006-04-30 17:29 [patch 00/14] remap_file_pages protection support blaisorblade
  2006-04-30 17:29 ` [patch 01/14] Fix comment about remap_file_pages blaisorblade
@ 2006-04-30 17:29 ` blaisorblade
  2006-04-30 17:29 ` [patch 03/14] remap_file_pages protection support: handle MANYPROTS VMAs blaisorblade
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 46+ messages in thread
From: blaisorblade @ 2006-04-30 17:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Paolo Blaisorblade Giarrusso

[-- Attachment #1: rfp/01-add-MAP_CHGPROT-wrapper-macros.diff --]
[-- Type: text/plain, Size: 13652 bytes --]

From: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>

Add pte_to_pgprot() and pgoff_prot_to_pte() macros, in generic versions; so we
can safely use it and keep the kernel compiling. For some architectures real
definitions of the macros are actually provided.

Also, add the MAP_CHGPROT flag to all arch headers (was MAP_NOINHERIT, changed on
Hugh Dickins' suggestion).

Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Index: linux-2.6.git/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.git.orig/include/asm-generic/pgtable.h
+++ linux-2.6.git/include/asm-generic/pgtable.h
@@ -241,4 +241,16 @@ static inline int pmd_none_or_clear_bad(
 }
 #endif /* !__ASSEMBLY__ */
 
+#ifndef __HAVE_ARCH_PTE_TO_PGPROT
+/* Wrappers for architectures which don't support yet page protections for
+ * remap_file_pages. */
+
+/* Dummy define - if the architecture has no special support, access is denied
+ * in VM_MANYPROTS vma's. */
+#define pte_to_pgprot(pte) __P000
+
+#define pgoff_prot_to_pte(off, prot) pgoff_to_pte(off)
+
+#endif
+
 #endif /* _ASM_GENERIC_PGTABLE_H */
Index: linux-2.6.git/include/asm-alpha/mman.h
===================================================================
--- linux-2.6.git.orig/include/asm-alpha/mman.h
+++ linux-2.6.git/include/asm-alpha/mman.h
@@ -28,6 +28,9 @@
 #define MAP_NORESERVE	0x10000		/* don't check for reservations */
 #define MAP_POPULATE	0x20000		/* populate (prefault) pagetables */
 #define MAP_NONBLOCK	0x40000		/* do not block on IO */
+#define MAP_CHGPROT	0x80000		/* don't inherit the protection bits of
+					   the underlying vma, to be passed to
+					   remap_file_pages() only */
 
 #define MS_ASYNC	1		/* sync memory asynchronously */
 #define MS_SYNC		2		/* synchronous memory sync */
Index: linux-2.6.git/include/asm-arm26/mman.h
===================================================================
--- linux-2.6.git.orig/include/asm-arm26/mman.h
+++ linux-2.6.git/include/asm-arm26/mman.h
@@ -10,6 +10,9 @@
 #define MAP_NORESERVE	0x4000		/* don't check for reservations */
 #define MAP_POPULATE    0x8000          /* populate (prefault) page tables */
 #define MAP_NONBLOCK    0x10000         /* do not block on IO */
+#define MAP_CHGPROT	0x20000		/* don't inherit the protection bits of
+					   the underlying vma, to be passed to
+					   remap_file_pages() only */
 
 #define MCL_CURRENT	1		/* lock all current mappings */
 #define MCL_FUTURE	2		/* lock all future mappings */
Index: linux-2.6.git/include/asm-arm/mman.h
===================================================================
--- linux-2.6.git.orig/include/asm-arm/mman.h
+++ linux-2.6.git/include/asm-arm/mman.h
@@ -10,6 +10,9 @@
 #define MAP_NORESERVE	0x4000		/* don't check for reservations */
 #define MAP_POPULATE	0x8000		/* populate (prefault) page tables */
 #define MAP_NONBLOCK	0x10000		/* do not block on IO */
+#define MAP_CHGPROT	0x20000		/* don't inherit the protection bits of
+					   the underlying vma, to be passed to
+					   remap_file_pages() only */
 
 #define MCL_CURRENT	1		/* lock all current mappings */
 #define MCL_FUTURE	2		/* lock all future mappings */
Index: linux-2.6.git/include/asm-cris/mman.h
===================================================================
--- linux-2.6.git.orig/include/asm-cris/mman.h
+++ linux-2.6.git/include/asm-cris/mman.h
@@ -12,6 +12,9 @@
 #define MAP_NORESERVE	0x4000		/* don't check for reservations */
 #define MAP_POPULATE	0x8000		/* populate (prefault) pagetables */
 #define MAP_NONBLOCK	0x10000		/* do not block on IO */
+#define MAP_CHGPROT	0x20000		/* don't inherit the protection bits of
+					   the underlying vma, to be passed to
+					   remap_file_pages() only */
 
 #define MCL_CURRENT	1		/* lock all current mappings */
 #define MCL_FUTURE	2		/* lock all future mappings */
Index: linux-2.6.git/include/asm-frv/mman.h
===================================================================
--- linux-2.6.git.orig/include/asm-frv/mman.h
+++ linux-2.6.git/include/asm-frv/mman.h
@@ -10,6 +10,9 @@
 #define MAP_NORESERVE	0x4000		/* don't check for reservations */
 #define MAP_POPULATE	0x8000		/* populate (prefault) pagetables */
 #define MAP_NONBLOCK	0x10000		/* do not block on IO */
+#define MAP_CHGPROT	0x20000		/* don't inherit the protection bits of
+					   the underlying vma, to be passed to
+					   remap_file_pages() only */
 
 #define MCL_CURRENT	1		/* lock all current mappings */
 #define MCL_FUTURE	2		/* lock all future mappings */
Index: linux-2.6.git/include/asm-h8300/mman.h
===================================================================
--- linux-2.6.git.orig/include/asm-h8300/mman.h
+++ linux-2.6.git/include/asm-h8300/mman.h
@@ -10,6 +10,9 @@
 #define MAP_NORESERVE	0x4000		/* don't check for reservations */
 #define MAP_POPULATE	0x8000		/* populate (prefault) pagetables */
 #define MAP_NONBLOCK	0x10000		/* do not block on IO */
+#define MAP_CHGPROT	0x20000		/* don't inherit the protection bits of
+					   the underlying vma, to be passed to
+					   remap_file_pages() only */
 
 #define MCL_CURRENT	1		/* lock all current mappings */
 #define MCL_FUTURE	2		/* lock all future mappings */
Index: linux-2.6.git/include/asm-i386/mman.h
===================================================================
--- linux-2.6.git.orig/include/asm-i386/mman.h
+++ linux-2.6.git/include/asm-i386/mman.h
@@ -10,6 +10,9 @@
 #define MAP_NORESERVE	0x4000		/* don't check for reservations */
 #define MAP_POPULATE	0x8000		/* populate (prefault) pagetables */
 #define MAP_NONBLOCK	0x10000		/* do not block on IO */
+#define MAP_CHGPROT	0x20000		/* don't inherit the protection bits of
+					   the underlying vma, to be passed to
+					   remap_file_pages() only */
 
 #define MCL_CURRENT	1		/* lock all current mappings */
 #define MCL_FUTURE	2		/* lock all future mappings */
Index: linux-2.6.git/include/asm-ia64/mman.h
===================================================================
--- linux-2.6.git.orig/include/asm-ia64/mman.h
+++ linux-2.6.git/include/asm-ia64/mman.h
@@ -18,6 +18,9 @@
 #define MAP_NORESERVE	0x04000		/* don't check for reservations */
 #define MAP_POPULATE	0x08000		/* populate (prefault) pagetables */
 #define MAP_NONBLOCK	0x10000		/* do not block on IO */
+#define MAP_CHGPROT	0x20000		/* don't inherit the protection bits of
+					   the underlying vma, to be passed to
+					   remap_file_pages() only */
 
 #define MCL_CURRENT	1		/* lock all current mappings */
 #define MCL_FUTURE	2		/* lock all future mappings */
Index: linux-2.6.git/include/asm-m32r/mman.h
===================================================================
--- linux-2.6.git.orig/include/asm-m32r/mman.h
+++ linux-2.6.git/include/asm-m32r/mman.h
@@ -12,6 +12,9 @@
 #define MAP_NORESERVE	0x4000		/* don't check for reservations */
 #define MAP_POPULATE	0x8000		/* populate (prefault) pagetables */
 #define MAP_NONBLOCK	0x10000		/* do not block on IO */
+#define MAP_CHGPROT	0x20000		/* don't inherit the protection bits of
+					   the underlying vma, to be passed to
+					   remap_file_pages() only */
 
 #define MCL_CURRENT	1		/* lock all current mappings */
 #define MCL_FUTURE	2		/* lock all future mappings */
Index: linux-2.6.git/include/asm-m68k/mman.h
===================================================================
--- linux-2.6.git.orig/include/asm-m68k/mman.h
+++ linux-2.6.git/include/asm-m68k/mman.h
@@ -10,6 +10,9 @@
 #define MAP_NORESERVE	0x4000		/* don't check for reservations */
 #define MAP_POPULATE	0x8000		/* populate (prefault) pagetables */
 #define MAP_NONBLOCK	0x10000		/* do not block on IO */
+#define MAP_CHGPROT	0x20000		/* don't inherit the protection bits of
+					   the underlying vma, to be passed to
+					   remap_file_pages() only */
 
 #define MCL_CURRENT	1		/* lock all current mappings */
 #define MCL_FUTURE	2		/* lock all future mappings */
Index: linux-2.6.git/include/asm-mips/mman.h
===================================================================
--- linux-2.6.git.orig/include/asm-mips/mman.h
+++ linux-2.6.git/include/asm-mips/mman.h
@@ -46,6 +46,9 @@
 #define MAP_LOCKED	0x8000		/* pages are locked */
 #define MAP_POPULATE	0x10000		/* populate (prefault) pagetables */
 #define MAP_NONBLOCK	0x20000		/* do not block on IO */
+#define MAP_CHGPROT	0x40000		/* don't inherit the protection bits of
+					   the underlying vma, to be passed to
+					   remap_file_pages() only */
 
 /*
  * Flags for msync
Index: linux-2.6.git/include/asm-parisc/mman.h
===================================================================
--- linux-2.6.git.orig/include/asm-parisc/mman.h
+++ linux-2.6.git/include/asm-parisc/mman.h
@@ -22,6 +22,9 @@
 #define MAP_GROWSDOWN	0x8000		/* stack-like segment */
 #define MAP_POPULATE	0x10000		/* populate (prefault) pagetables */
 #define MAP_NONBLOCK	0x20000		/* do not block on IO */
+#define MAP_CHGPROT	0x40000		/* don't inherit the protection bits of
+					   the underlying vma, to be passed to
+					   remap_file_pages() only */
 
 #define MS_SYNC		1		/* synchronous memory sync */
 #define MS_ASYNC	2		/* sync memory asynchronously */
Index: linux-2.6.git/include/asm-powerpc/mman.h
===================================================================
--- linux-2.6.git.orig/include/asm-powerpc/mman.h
+++ linux-2.6.git/include/asm-powerpc/mman.h
@@ -23,5 +23,8 @@
 
 #define MAP_POPULATE	0x8000		/* populate (prefault) pagetables */
 #define MAP_NONBLOCK	0x10000		/* do not block on IO */
+#define MAP_CHGPROT	0x20000		/* don't inherit the protection bits of
+					   the underlying vma, to be passed to
+					   remap_file_pages() only */
 
 #endif	/* _ASM_POWERPC_MMAN_H */
Index: linux-2.6.git/include/asm-s390/mman.h
===================================================================
--- linux-2.6.git.orig/include/asm-s390/mman.h
+++ linux-2.6.git/include/asm-s390/mman.h
@@ -18,6 +18,9 @@
 #define MAP_NORESERVE	0x4000		/* don't check for reservations */
 #define MAP_POPULATE	0x8000		/* populate (prefault) pagetables */
 #define MAP_NONBLOCK	0x10000		/* do not block on IO */
+#define MAP_CHGPROT	0x20000		/* don't inherit the protection bits of
+					   the underlying vma, to be passed to
+					   remap_file_pages() only */
 
 #define MCL_CURRENT	1		/* lock all current mappings */
 #define MCL_FUTURE	2		/* lock all future mappings */
Index: linux-2.6.git/include/asm-sh/mman.h
===================================================================
--- linux-2.6.git.orig/include/asm-sh/mman.h
+++ linux-2.6.git/include/asm-sh/mman.h
@@ -10,6 +10,9 @@
 #define MAP_NORESERVE	0x4000		/* don't check for reservations */
 #define MAP_POPULATE	0x8000		/* populate (prefault) page tables */
 #define MAP_NONBLOCK	0x10000		/* do not block on IO */
+#define MAP_CHGPROT	0x20000		/* don't inherit the protection bits of
+					   the underlying vma, to be passed to
+					   remap_file_pages() only */
 
 #define MCL_CURRENT	1		/* lock all current mappings */
 #define MCL_FUTURE	2		/* lock all future mappings */
Index: linux-2.6.git/include/asm-sparc64/mman.h
===================================================================
--- linux-2.6.git.orig/include/asm-sparc64/mman.h
+++ linux-2.6.git/include/asm-sparc64/mman.h
@@ -21,6 +21,9 @@
 
 #define MAP_POPULATE	0x8000		/* populate (prefault) pagetables */
 #define MAP_NONBLOCK	0x10000		/* do not block on IO */
+#define MAP_CHGPROT	0x20000		/* don't inherit the protection bits of
+					   the underlying vma, to be passed to
+					   remap_file_pages() only */
 
 /* XXX Need to add flags to SunOS's mctl, mlockall, and madvise system
  * XXX calls.
Index: linux-2.6.git/include/asm-sparc/mman.h
===================================================================
--- linux-2.6.git.orig/include/asm-sparc/mman.h
+++ linux-2.6.git/include/asm-sparc/mman.h
@@ -21,6 +21,9 @@
 
 #define MAP_POPULATE	0x8000		/* populate (prefault) pagetables */
 #define MAP_NONBLOCK	0x10000		/* do not block on IO */
+#define MAP_CHGPROT	0x20000		/* don't inherit the protection bits of
+					   the underlying vma, to be passed to
+					   remap_file_pages() only */
 
 /* XXX Need to add flags to SunOS's mctl, mlockall, and madvise system
  * XXX calls.
Index: linux-2.6.git/include/asm-x86_64/mman.h
===================================================================
--- linux-2.6.git.orig/include/asm-x86_64/mman.h
+++ linux-2.6.git/include/asm-x86_64/mman.h
@@ -12,6 +12,9 @@
 #define MAP_NORESERVE	0x4000		/* don't check for reservations */
 #define MAP_POPULATE	0x8000		/* populate (prefault) pagetables */
 #define MAP_NONBLOCK	0x10000		/* do not block on IO */
+#define MAP_CHGPROT	0x20000		/* don't inherit the protection bits of
+					   the underlying vma, to be passed to
+					   remap_file_pages() only */
 
 #define MCL_CURRENT	1		/* lock all current mappings */
 #define MCL_FUTURE	2		/* lock all future mappings */
Index: linux-2.6.git/include/asm-xtensa/mman.h
===================================================================
--- linux-2.6.git.orig/include/asm-xtensa/mman.h
+++ linux-2.6.git/include/asm-xtensa/mman.h
@@ -53,6 +53,9 @@
 #define MAP_LOCKED	0x8000		/* pages are locked */
 #define MAP_POPULATE	0x10000		/* populate (prefault) pagetables */
 #define MAP_NONBLOCK	0x20000		/* do not block on IO */
+#define MAP_CHGPROT	0x40000		/* don't inherit the protection bits of
+					   the underlying vma, to be passed to
+					   remap_file_pages() only */
 
 /*
  * Flags for msync

--
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [patch 03/14] remap_file_pages protection support: handle MANYPROTS VMAs
  2006-04-30 17:29 [patch 00/14] remap_file_pages protection support blaisorblade
  2006-04-30 17:29 ` [patch 01/14] Fix comment about remap_file_pages blaisorblade
  2006-04-30 17:29 ` [patch 02/14] remap_file_pages protection support: add needed macros blaisorblade
@ 2006-04-30 17:29 ` blaisorblade
  2006-04-30 17:29 ` [patch 04/14] remap_file_pages protection support: disallow mprotect() on manyprots mappings blaisorblade
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 46+ messages in thread
From: blaisorblade @ 2006-04-30 17:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Val Henson, Paolo Blaisorblade Giarrusso

[-- Attachment #1: rfp/03-rfp-add-VM_NONUNIF.diff --]
[-- Type: text/plain, Size: 10462 bytes --]

From: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Val Henson <val.henson@intel.com>

Handle the possible existance of VM_MANYPROTS vmas, without actually creating
them.

* Replace old uses of pgoff_to_pte with pgoff_prot_to_pte.
* Introduce the flag, use it to read permissions from the PTE rather than from
  the VMA flags.
* Replace the linear_page_index() check with save_nonlinear_pte(), which
  encapsulates the check.
2.6.14+ updates:
* Add VM_MANYPROTS among cases needing copying of PTE at fork time rather than
  faulting.
* check for VM_MANYPROTS in do_file_pte before complaining for pte_file PTE
* check for VM_MANYPROTS in *_populate, when we skip installing pte_file PTE's
  for linear areas

Below there is a long explaination of why I've added VM_MANYPROTS, rather
than simply overload VM_NONLINEAR. You can freely skip that if you have real
work to do :-).

However, this patch is only sufficient if VM_MANYPROTS vmas are also marked as
nonlinear. Otherwise also other changes are needed.

I've implemented both solutions - I've sent only full support for the easy case,
but possibly I'll afterwards reintroduce the other changes; in particular,
they're needed to make this useful for general usage beyond UML.

*) remap_file_pages protection support: add VM_MANYPROTS to fix existing usage of mprotect()

Distinguish between "normal" VMA and VMA with variable protection, by
adding the VM_MANYPROTS flag. This is needed for various reasons:

* notify the arch fault handlers that they must not check VMA protection for
  giving SIGSEGV 
* fixing regression of mprotect() on !VM_MANYPROTS mappings (see below)
* (in next patches) giving a sensible behaviour to mprotect on VM_MANYPROTS
  mappings
* (TODO?) avoid regression in max file offset with r_f_p() for older mappings;
  we could use either the old offset encoding or the new offset-prot encoding
  depending on this flag.
  It's trivial to do, just I don't know whether existing apps will overflow
  the new limits. They go down from 2Tb to 1Tb on i386 and 512G on PPC, and
  from 256G to 128G on S390/31 bits. Give me a call in case.
* (TODO?) on MAP_PRIVATE mappings, especially when they are readonly, we can
  easily support VM_MANYPROTS. This has been explicitly requested by Ulrich
  Drepper for DSO handling - creating a PROT_NONE VMA for guard pages is bad.
  And that is worse when you have a binary with 100 DSO, or a program with
  really many threads - Ulrich profiled a workload where the RB-tree lookup
  function is a performance bottleneck.

In fact, without this flag, we'd have indeed a regression with
remap_file_pages VS mprotect, on uniform nonlinear VMAs.

mprotect alters the VMA prots and walks each present PTE, ignoring installed
ones, even when pte_file() is on; their saved prots will be restored on faults,
ignoring VMA ones and losing the mprotect() on them. So, in do_file_page(), we
must restore anyway VMA prots when the VMA is uniform, as we used to do before
this trail of patches.

Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Index: linux-2.6.git/include/linux/mm.h
===================================================================
--- linux-2.6.git.orig/include/linux/mm.h
+++ linux-2.6.git/include/linux/mm.h
@@ -164,7 +164,14 @@ extern unsigned int kobjsize(const void 
 #define VM_ACCOUNT	0x00100000	/* Is a VM accounted object */
 #define VM_HUGETLB	0x00400000	/* Huge TLB Page VM */
 #define VM_NONLINEAR	0x00800000	/* Is non-linear (remap_file_pages) */
+
+#ifndef CONFIG_MMU
 #define VM_MAPPED_COPY	0x01000000	/* T if mapped copy of data (nommu mmap) */
+#else
+#define VM_MANYPROTS	0x01000000	/* The VM individual pages have
+					   different protections
+					   (remap_file_pages)*/
+#endif
 #define VM_INSERTPAGE	0x02000000	/* The vma has had "vm_insert_page()" done on it */
 
 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
Index: linux-2.6.git/include/linux/pagemap.h
===================================================================
--- linux-2.6.git.orig/include/linux/pagemap.h
+++ linux-2.6.git/include/linux/pagemap.h
@@ -165,6 +165,28 @@ static inline pgoff_t linear_page_index(
 	return pgoff >> (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 }
 
+/***
+ * Checks if the PTE is nonlinear, and if yes sets it.
+ * @vma: the VMA in which @addr is; we don't check if it's VM_NONLINEAR, just
+ * if this PTE is nonlinear.
+ * @addr: the addr which @pte refers to.
+ * @pte: the old PTE value (to read its protections.
+ * @ptep: the PTE pointer (for setting it).
+ * @mm: passed to set_pte_at.
+ * @page: the page which was installed (to read its ->index, i.e. the old
+ * offset inside the file.
+ */
+static inline void save_nonlinear_pte(pte_t pte, pte_t * ptep, struct
+		vm_area_struct *vma, struct mm_struct *mm, struct page* page,
+		unsigned long addr)
+{
+	pgprot_t pgprot = pte_to_pgprot(pte);
+	if (linear_page_index(vma, addr) != page->index ||
+		pgprot_val(pgprot) != pgprot_val(vma->vm_page_prot))
+		set_pte_at(mm, addr, ptep, pgoff_prot_to_pte(page->index,
+					pgprot));
+}
+
 extern void FASTCALL(__lock_page(struct page *page));
 extern void FASTCALL(unlock_page(struct page *page));
 
Index: linux-2.6.git/mm/fremap.c
===================================================================
--- linux-2.6.git.orig/mm/fremap.c
+++ linux-2.6.git/mm/fremap.c
@@ -49,7 +49,7 @@ static int zap_pte(struct mm_struct *mm,
  * previously existing mapping.
  */
 int install_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long addr, struct page *page, pgprot_t prot)
+		unsigned long addr, struct page *page, pgprot_t pgprot)
 {
 	struct inode *inode;
 	pgoff_t size;
@@ -79,7 +79,7 @@ int install_page(struct mm_struct *mm, s
 		inc_mm_counter(mm, file_rss);
 
 	flush_icache_page(vma, page);
-	set_pte_at(mm, addr, pte, mk_pte(page, prot));
+	set_pte_at(mm, addr, pte, mk_pte(page, pgprot));
 	page_add_file_rmap(page);
 	pte_val = *pte;
 	update_mmu_cache(vma, addr, pte_val);
@@ -96,7 +96,7 @@ EXPORT_SYMBOL(install_page);
  * previously existing mapping.
  */
 int install_file_pte(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long addr, unsigned long pgoff, pgprot_t prot)
+		unsigned long addr, unsigned long pgoff, pgprot_t pgprot)
 {
 	int err = -ENOMEM;
 	pte_t *pte;
@@ -112,7 +112,7 @@ int install_file_pte(struct mm_struct *m
 		dec_mm_counter(mm, file_rss);
 	}
 
-	set_pte_at(mm, addr, pte, pgoff_to_pte(pgoff));
+	set_pte_at(mm, addr, pte, pgoff_prot_to_pte(pgoff, pgprot));
 	pte_val = *pte;
 	update_mmu_cache(vma, addr, pte_val);
 	pte_unmap_unlock(pte, ptl);
Index: linux-2.6.git/mm/memory.c
===================================================================
--- linux-2.6.git.orig/mm/memory.c
+++ linux-2.6.git/mm/memory.c
@@ -581,7 +581,8 @@ int copy_page_range(struct mm_struct *ds
 	 * readonly mappings. The tradeoff is that copy_page_range is more
 	 * efficient than faulting.
 	 */
-	if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {
+	if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_MANYPROTS|
+					VM_PFNMAP|VM_INSERTPAGE))) {
 		if (!vma->anon_vma)
 			return 0;
 	}
@@ -650,11 +651,11 @@ static unsigned long zap_pte_range(struc
 			tlb_remove_tlb_entry(tlb, pte, addr);
 			if (unlikely(!page))
 				continue;
-			if (unlikely(details) && details->nonlinear_vma
-			    && linear_page_index(details->nonlinear_vma,
-						addr) != page->index)
-				set_pte_at(mm, addr, pte,
-					   pgoff_to_pte(page->index));
+			if (unlikely(details) && details->nonlinear_vma) {
+				save_nonlinear_pte(ptent, pte,
+						details->nonlinear_vma,
+						mm, page, addr);
+			}
 			if (PageAnon(page))
 				anon_rss--;
 			else {
@@ -2159,12 +2160,13 @@ static int do_file_page(struct mm_struct
 		int write_access, pte_t orig_pte)
 {
 	pgoff_t pgoff;
+	pgprot_t pgprot;
 	int err;
 
 	if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
 		return VM_FAULT_MINOR;
 
-	if (unlikely(!(vma->vm_flags & VM_NONLINEAR))) {
+	if (unlikely(!(vma->vm_flags & (VM_NONLINEAR|VM_MANYPROTS)))) {
 		/*
 		 * Page table corrupted: show pte and kill process.
 		 */
@@ -2174,8 +2176,11 @@ static int do_file_page(struct mm_struct
 	/* We can then assume vm->vm_ops && vma->vm_ops->populate */
 
 	pgoff = pte_to_pgoff(orig_pte);
+	pgprot = (vma->vm_flags & VM_MANYPROTS) ? pte_to_pgprot(orig_pte) :
+		vma->vm_page_prot;
+
 	err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE,
-					vma->vm_page_prot, pgoff, 0);
+					pgprot, pgoff, 0);
 	if (err == -ENOMEM)
 		return VM_FAULT_OOM;
 	if (err)
Index: linux-2.6.git/mm/rmap.c
===================================================================
--- linux-2.6.git.orig/mm/rmap.c
+++ linux-2.6.git/mm/rmap.c
@@ -721,8 +721,7 @@ static void try_to_unmap_cluster(unsigne
 		pteval = ptep_clear_flush(vma, address, pte);
 
 		/* If nonlinear, store the file page offset in the pte. */
-		if (page->index != linear_page_index(vma, address))
-			set_pte_at(mm, address, pte, pgoff_to_pte(page->index));
+		save_nonlinear_pte(pteval, pte, vma, mm, page, address);
 
 		/* Move the dirty bit to the physical page now the pte is gone. */
 		if (pte_dirty(pteval))
Index: linux-2.6.git/mm/filemap.c
===================================================================
--- linux-2.6.git.orig/mm/filemap.c
+++ linux-2.6.git/mm/filemap.c
@@ -1587,7 +1587,7 @@ repeat:
 			page_cache_release(page);
 			return err;
 		}
-	} else if (vma->vm_flags & VM_NONLINEAR) {
+	} else if (vma->vm_flags & (VM_NONLINEAR|VM_MANYPROTS)) {
 		/* No page was found just because we can't read it in now (being
 		 * here implies nonblock != 0), but the page may exist, so set
 		 * the PTE to fault it in later. */
Index: linux-2.6.git/mm/shmem.c
===================================================================
--- linux-2.6.git.orig/mm/shmem.c
+++ linux-2.6.git/mm/shmem.c
@@ -1275,7 +1275,7 @@ static int shmem_populate(struct vm_area
 				page_cache_release(page);
 				return err;
 			}
-		} else if (vma->vm_flags & VM_NONLINEAR) {
+		} else if (vma->vm_flags & (VM_NONLINEAR|VM_MANYPROTS)) {
 			/* No page was found just because we can't read it in
 			 * now (being here implies nonblock != 0), but the page
 			 * may exist, so set the PTE to fault it in later. */

--
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [patch 04/14] remap_file_pages protection support: disallow mprotect() on manyprots mappings
  2006-04-30 17:29 [patch 00/14] remap_file_pages protection support blaisorblade
                   ` (2 preceding siblings ...)
  2006-04-30 17:29 ` [patch 03/14] remap_file_pages protection support: handle MANYPROTS VMAs blaisorblade
@ 2006-04-30 17:29 ` blaisorblade
  2006-04-30 17:29 ` [patch 05/14] remap_file_pages protection support: cleanup syscall checks blaisorblade
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 46+ messages in thread
From: blaisorblade @ 2006-04-30 17:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Paolo Blaisorblade Giarrusso

[-- Attachment #1: rfp/04-rfp-stop-mprotect.diff --]
[-- Type: text/plain, Size: 1050 bytes --]

From: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>

For now we (I and Hugh) have found no agreement on which behavior to implement
here. So, at least as a stop-gap, return an error here.

Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Index: linux-2.6.git/mm/mprotect.c
===================================================================
--- linux-2.6.git.orig/mm/mprotect.c
+++ linux-2.6.git/mm/mprotect.c
@@ -217,6 +217,13 @@ sys_mprotect(unsigned long start, size_t
 	error = -ENOMEM;
 	if (!vma)
 		goto out;
+
+	/* If a need is felt, an appropriate behaviour may be implemented for
+	 * this case. We haven't agreed yet on which behavior is appropriate. */
+	error = -EACCES;
+	if (vma->vm_flags & VM_MANYPROTS)
+		goto out;
+
 	if (unlikely(grows & PROT_GROWSDOWN)) {
 		if (vma->vm_start >= end)
 			goto out;

--
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [patch 05/14] remap_file_pages protection support: cleanup syscall checks
  2006-04-30 17:29 [patch 00/14] remap_file_pages protection support blaisorblade
                   ` (3 preceding siblings ...)
  2006-04-30 17:29 ` [patch 04/14] remap_file_pages protection support: disallow mprotect() on manyprots mappings blaisorblade
@ 2006-04-30 17:29 ` blaisorblade
  2006-04-30 17:29 ` [patch 06/14] remap_file_pages protection support: enhance syscall interface blaisorblade
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 46+ messages in thread
From: blaisorblade @ 2006-04-30 17:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Paolo Blaisorblade Giarrusso

[-- Attachment #1: rfp/05-rfp-cleanup-sc-check.diff --]
[-- Type: text/plain, Size: 3658 bytes --]

From: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>

This patch reorganizes the code only, without differences in behaviour. It
makes the code more readable on its own, and is needed for next patches. I've
split this out to avoid cluttering real patches.

*) remap_file_pages protection support: use EOVERFLOW ret code

Use -EOVERFLOW ("Value too large for defined data type") rather than -EINVAL
when we cannot store the file offset in the PTE.

Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Index: linux-2.6.git/mm/fremap.c
===================================================================
--- linux-2.6.git.orig/mm/fremap.c
+++ linux-2.6.git/mm/fremap.c
@@ -140,7 +140,7 @@ out:
  * future.
  */
 asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
-	unsigned long __prot, unsigned long pgoff, unsigned long flags)
+	unsigned long prot, unsigned long pgoff, unsigned long flags)
 {
 	struct mm_struct *mm = current->mm;
 	struct address_space *mapping;
@@ -148,9 +148,10 @@ asmlinkage long sys_remap_file_pages(uns
 	struct vm_area_struct *vma;
 	int err = -EINVAL;
 	int has_write_lock = 0;
+	pgprot_t pgprot;
 
-	if (__prot)
-		return err;
+	if (prot)
+		goto out;
 	/*
 	 * Sanitize the syscall parameters:
 	 */
@@ -159,17 +160,19 @@ asmlinkage long sys_remap_file_pages(uns
 
 	/* Does the address range wrap, or is the span zero-sized? */
 	if (start + size <= start)
-		return err;
+		goto out;
 
 	/* Can we represent this offset inside this architecture's pte's? */
 #if PTE_FILE_MAX_BITS < BITS_PER_LONG
-	if (pgoff + (size >> PAGE_SHIFT) >= (1UL << PTE_FILE_MAX_BITS))
-		return err;
+	if (pgoff + (size >> PAGE_SHIFT) >= (1UL << PTE_FILE_MAX_BITS)) {
+		err = -EOVERFLOW;
+		goto out;
+	}
 #endif
 
 	/* We need down_write() to change vma->vm_flags. */
 	down_read(&mm->mmap_sem);
- retry:
+retry:
 	vma = find_vma(mm, start);
 
 	/*
@@ -178,12 +181,21 @@ asmlinkage long sys_remap_file_pages(uns
 	 * the single existing vma.  vm_private_data is used as a
 	 * swapout cursor in a VM_NONLINEAR vma.
 	 */
-	if (vma && (vma->vm_flags & VM_SHARED) &&
-		(!vma->vm_private_data || (vma->vm_flags & VM_NONLINEAR)) &&
-		vma->vm_ops && vma->vm_ops->populate &&
-			end > start && start >= vma->vm_start &&
-				end <= vma->vm_end) {
+	if (!vma)
+		goto out_unlock;
+
+	if (!(vma->vm_flags & VM_SHARED))
+		goto out_unlock;
+
+	if (!vma->vm_ops || !vma->vm_ops->populate)
+		goto out_unlock;
 
+	if (end <= start || start < vma->vm_start || end > vma->vm_end)
+		goto out_unlock;
+
+	pgprot = vma->vm_page_prot;
+
+	if (!vma->vm_private_data || (vma->vm_flags & VM_NONLINEAR)) {
 		/* Must set VM_NONLINEAR before any pages are populated. */
 		if (pgoff != linear_page_index(vma, start) &&
 		    !(vma->vm_flags & VM_NONLINEAR)) {
@@ -203,9 +215,8 @@ asmlinkage long sys_remap_file_pages(uns
 			spin_unlock(&mapping->i_mmap_lock);
 		}
 
-		err = vma->vm_ops->populate(vma, start, size,
-					    vma->vm_page_prot,
-					    pgoff, flags & MAP_NONBLOCK);
+		err = vma->vm_ops->populate(vma, start, size, pgprot, pgoff,
+				flags & MAP_NONBLOCK);
 
 		/*
 		 * We would like to clear VM_NONLINEAR, in the case when
@@ -214,11 +225,14 @@ asmlinkage long sys_remap_file_pages(uns
 		 * successful populate, and have no way to upgrade sem.
 		 */
 	}
+
+out_unlock:
 	if (likely(!has_write_lock))
 		up_read(&mm->mmap_sem);
 	else
 		up_write(&mm->mmap_sem);
 
+out:
 	return err;
 }
 

--
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [patch 06/14] remap_file_pages protection support: enhance syscall interface
  2006-04-30 17:29 [patch 00/14] remap_file_pages protection support blaisorblade
                   ` (4 preceding siblings ...)
  2006-04-30 17:29 ` [patch 05/14] remap_file_pages protection support: cleanup syscall checks blaisorblade
@ 2006-04-30 17:29 ` blaisorblade
  2006-04-30 17:30 ` [patch 07/14] remap_file_pages protection support: support private vma for MAP_POPULATE blaisorblade
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 46+ messages in thread
From: blaisorblade @ 2006-04-30 17:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

[-- Attachment #1: rfp/06-rfp-enhance-syscall.diff --]
[-- Type: text/plain, Size: 3684 bytes --]

From: Ingo Molnar <mingo@elte.hu>, Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>

Enable the 'prot' parameter for shared-writable mappings (the ones which are
the primary target for remap_file_pages), without breaking up the vma.

This contains simply the changes to the syscall code, based on Ingo's patch.
Differently from his one, I've *not* added a new syscall, choosing to add a
new flag (MAP_CHGPROT) which the application must specify to get the new
behavior (prot != 0 is accepted and prot == 0 means PROT_NONE).

Upon Hugh's suggestion, simplify the permission checking on the VMA, reusing
mprotect()'s trick.

Index: linux-2.6.git/mm/fremap.c
===================================================================
--- linux-2.6.git.orig/mm/fremap.c
+++ linux-2.6.git/mm/fremap.c
@@ -4,6 +4,10 @@
  * Explicit pagetable population and nonlinear (random) mappings support.
  *
  * started by Ingo Molnar, Copyright (C) 2002, 2003
+ *
+ * support of nonuniform remappings:
+ * Copyright (C) 2004 Ingo Molnar
+ * Copyright (C) 2005 Paolo 'Blaisorblade' Giarrusso
  */
 
 #include <linux/mm.h>
@@ -126,18 +130,14 @@ out:
  *                        file within an existing vma.
  * @start: start of the remapped virtual memory range
  * @size: size of the remapped virtual memory range
- * @prot: new protection bits of the range
+ * @prot: new protection bits of the range, must be 0 if not using MAP_CHGPROT
  * @pgoff: to be mapped page of the backing store file
- * @flags: 0 or MAP_NONBLOCKED - the later will cause no IO.
+ * @flags: bits MAP_CHGPROT or MAP_NONBLOCKED - the later will cause no IO.
  *
  * this syscall works purely via pagetables, so it's the most efficient
  * way to map the same (large) file into a given virtual window. Unlike
  * mmap()/mremap() it does not create any new vmas. The new mappings are
  * also safe across swapout.
- *
- * NOTE: the 'prot' parameter right now is ignored, and the vma's default
- * protection is used. Arbitrary protections might be implemented in the
- * future.
  */
 asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
 	unsigned long prot, unsigned long pgoff, unsigned long flags)
@@ -150,7 +150,7 @@ asmlinkage long sys_remap_file_pages(uns
 	int has_write_lock = 0;
 	pgprot_t pgprot;
 
-	if (prot)
+	if (prot && !(flags & MAP_CHGPROT))
 		goto out;
 	/*
 	 * Sanitize the syscall parameters:
@@ -193,7 +193,19 @@ retry:
 	if (end <= start || start < vma->vm_start || end > vma->vm_end)
 		goto out_unlock;
 
-	pgprot = vma->vm_page_prot;
+	if (flags & MAP_CHGPROT) {
+		unsigned long vm_prots = calc_vm_prot_bits(prot);
+
+		/* vma->vm_flags >> 4 shifts VM_MAY% in place of VM_% */
+		if ((vm_prots & ~(vma->vm_flags >> 4)) &
+				(VM_READ | VM_WRITE | VM_EXEC)) {
+			err = -EPERM;
+			goto out_unlock;
+		}
+
+		pgprot = protection_map[vm_prots | VM_SHARED];
+	} else
+		pgprot = vma->vm_page_prot;
 
 	if (!vma->vm_private_data || (vma->vm_flags & VM_NONLINEAR)) {
 		/* Must set VM_NONLINEAR before any pages are populated. */
@@ -215,6 +227,17 @@ retry:
 			spin_unlock(&mapping->i_mmap_lock);
 		}
 
+		if (pgprot_val(pgprot) != pgprot_val(vma->vm_page_prot) &&
+				!(vma->vm_flags & VM_MANYPROTS)) {
+			if (!has_write_lock) {
+				up_read(&mm->mmap_sem);
+				down_write(&mm->mmap_sem);
+				has_write_lock = 1;
+				goto retry;
+			}
+			vma->vm_flags |= VM_MANYPROTS;
+		}
+
 		err = vma->vm_ops->populate(vma, start, size, pgprot, pgoff,
 				flags & MAP_NONBLOCK);
 

--
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [patch 07/14] remap_file_pages protection support: support private vma for MAP_POPULATE
  2006-04-30 17:29 [patch 00/14] remap_file_pages protection support blaisorblade
                   ` (5 preceding siblings ...)
  2006-04-30 17:29 ` [patch 06/14] remap_file_pages protection support: enhance syscall interface blaisorblade
@ 2006-04-30 17:30 ` blaisorblade
  2006-04-30 17:30 ` [patch 08/14] remap_file_pages protection support: use FAULT_SIGSEGV for protection checking blaisorblade
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 46+ messages in thread
From: blaisorblade @ 2006-04-30 17:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Paolo Blaisorblade Giarrusso

[-- Attachment #1: rfp/07-rfp-private-vma.diff --]
[-- Type: text/plain, Size: 1400 bytes --]

From: Ingo Molnar <mingo@elte.hu>

Fix mmap(MAP_POPULATE | MAP_PRIVATE). We don't need the VMA to be shared if we
don't rearrange pages around. And it's trivial to do.

Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Index: linux-2.6.git/mm/fremap.c
===================================================================
--- linux-2.6.git.orig/mm/fremap.c
+++ linux-2.6.git/mm/fremap.c
@@ -184,9 +184,6 @@ retry:
 	if (!vma)
 		goto out_unlock;
 
-	if (!(vma->vm_flags & VM_SHARED))
-		goto out_unlock;
-
 	if (!vma->vm_ops || !vma->vm_ops->populate)
 		goto out_unlock;
 
@@ -211,6 +208,8 @@ retry:
 		/* Must set VM_NONLINEAR before any pages are populated. */
 		if (pgoff != linear_page_index(vma, start) &&
 		    !(vma->vm_flags & VM_NONLINEAR)) {
+			if (!(vma->vm_flags & VM_SHARED))
+				goto out_unlock;
 			if (!has_write_lock) {
 				up_read(&mm->mmap_sem);
 				down_write(&mm->mmap_sem);
@@ -229,6 +228,8 @@ retry:
 
 		if (pgprot_val(pgprot) != pgprot_val(vma->vm_page_prot) &&
 				!(vma->vm_flags & VM_MANYPROTS)) {
+			if (!(vma->vm_flags & VM_SHARED))
+				goto out_unlock;
 			if (!has_write_lock) {
 				up_read(&mm->mmap_sem);
 				down_write(&mm->mmap_sem);

--
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [patch 08/14] remap_file_pages protection support: use FAULT_SIGSEGV for protection checking
  2006-04-30 17:29 [patch 00/14] remap_file_pages protection support blaisorblade
                   ` (6 preceding siblings ...)
  2006-04-30 17:30 ` [patch 07/14] remap_file_pages protection support: support private vma for MAP_POPULATE blaisorblade
@ 2006-04-30 17:30 ` blaisorblade
  2006-04-30 17:30 ` [patch 09/14] remap_file_pages protection support: fix race condition with concurrent faults on same address space blaisorblade
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 46+ messages in thread
From: blaisorblade @ 2006-04-30 17:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Paolo Blaisorblade Giarrusso

[-- Attachment #1: rfp/09-rfp-add-vm_fault_sigsegv.diff --]
[-- Type: text/plain, Size: 11878 bytes --]

From: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>, Ingo Molnar <mingo@elte.hu>

This is the more intrusive patch, but it couldn't be reduced a lot, not even if
I limited the protection support to the bare minimum for Uml (and thus I left
the interface generic).

The arch handler used to check itself protection, now we must possibly move
that to the generic VM if the VMA is non-uniform, since vma protections are
totally unreliable in that case when a pte_file PTE has been set or a page is
installed.

So, we change the prototype of __handle_mm_fault() to inform it of the access
kind, so it does protection checking. handle_mm_fault() keeps its API, but has
the new VM_FAULT_SIGSEGV return value.

=== Issue 1 (trivial changes to do in every arch):
This value should be handled in every arch-specific fault handlers.

But we can get spurious BUG/oom killings only when the new functionality is
used.

=== Issue 2 (solved afterwards):
* Another problem I've just discovered is that PTRACE_POKETEXT access_process_vm
  on VM_MANYPROTS write-protected vma's won't work. This is handled in a
  specific patch.

=== Issue 3 (solved afterwards):
* Also, there is a (potential) problem: on VM_MANYPROTS vmas, in
  handle_pte_fault(), if the PTE is present we unconditionally return
  VM_FAULT_SIGSEGV, because the PTE was already up-to-date.

  This is removed in next patch, because it's wrong for 2 reasons:
  
  1) isn't thread safe - it's possible the fault occurred when the PTE was not
  installed and the PTE has been later installed by fault from another thread.

  2) This has proven to be a bit strict, at least for UML - so this may break
  other arches too (only for new functionality). At least, peculiar ones - this
  problem was due to handle_mm_fault() called for TLB faults rather than PTE
  faults. I'm leaving this note for reference, if any other arch does similar
  strange things.

=== Implementation and tradeoff notes:

* do_file_page installs the PTE and doesn't check the fault type, if it
  was wrong, then it'll do another fault and die only then. I've left this for
  now to exercise more the code, and it works anyway; beyond, this way the
  fast-path is potentially more efficient.

* I've made sure do_no_page to fault in pages with their *exact* permissions
  for non-uniform VMAs.

  Actually, the code already works so for shared vmas, since vma->vm_page_prot
  is (supposed to be) already writable when the VMA is. Hope this doesn't vary
  across different arches.

  However, for future possible handling of private mappings, this may be
  needed again.

* For checking, we simply reuse the standard protection_map, by creating a
  pte_t value with the vma->vm_page_prot protection and testing directly
  pte_{read,write,exec} on it.
  I use the physical frame number "0" to create the PTE, even if this isn't
  probably realistic, but I assume that pfn_pte() and the access macros will
  work anyway.

Changes are included for the i386, x86_64 and UML handler. It isn't enough to
make UML work, however, because UML has some peculiarities. Subsequent patches
fix this.

Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Index: linux-2.6.git/arch/i386/mm/fault.c
===================================================================
--- linux-2.6.git.orig/arch/i386/mm/fault.c
+++ linux-2.6.git/arch/i386/mm/fault.c
@@ -397,6 +397,14 @@ fastcall void __kprobes do_page_fault(st
 good_area:
 	si_code = SEGV_ACCERR;
 	write = 0;
+
+	/* If the PTE is not present, the vma protection are not accurate if
+	 * VM_MANYPROTS; present PTE's are correct for VM_MANYPROTS. */
+	if (unlikely(vma->vm_flags & VM_MANYPROTS)) {
+		write = error_code & 2;
+		goto survive;
+	}
+
 	switch (error_code & 3) {
 		default:	/* 3: write, present */
 #ifdef TEST_VERIFY_AREA
@@ -433,6 +441,8 @@ good_area:
 			goto do_sigbus;
 		case VM_FAULT_OOM:
 			goto out_of_memory;
+		case VM_FAULT_SIGSEGV:
+			goto bad_area;
 		default:
 			BUG();
 	}
Index: linux-2.6.git/arch/um/kernel/trap_kern.c
===================================================================
--- linux-2.6.git.orig/arch/um/kernel/trap_kern.c
+++ linux-2.6.git/arch/um/kernel/trap_kern.c
@@ -68,6 +68,11 @@ int handle_page_fault(unsigned long addr
 
 good_area:
 	*code_out = SEGV_ACCERR;
+	/* If the PTE is not present, the vma protection are not accurate if
+	 * VM_MANYPROTS; present PTE's are correct for VM_MANYPROTS. */
+	if (unlikely(vma->vm_flags & VM_MANYPROTS))
+		goto survive;
+
 	if(is_write && !(vma->vm_flags & VM_WRITE)) 
 		goto out;
 
@@ -77,7 +82,7 @@ good_area:
 
 	do {
 survive:
-		switch (handle_mm_fault(mm, vma, address, is_write)){
+		switch (handle_mm_fault(mm, vma, address, is_write)) {
 		case VM_FAULT_MINOR:
 			current->min_flt++;
 			break;
@@ -87,6 +92,9 @@ survive:
 		case VM_FAULT_SIGBUS:
 			err = -EACCES;
 			goto out;
+		case VM_FAULT_SIGSEGV:
+			err = -EFAULT;
+			goto out;
 		case VM_FAULT_OOM:
 			err = -ENOMEM;
 			goto out_of_memory;
Index: linux-2.6.git/include/linux/mm.h
===================================================================
--- linux-2.6.git.orig/include/linux/mm.h
+++ linux-2.6.git/include/linux/mm.h
@@ -623,10 +623,11 @@ static inline int page_mapped(struct pag
  * Used to decide whether a process gets delivered SIGBUS or
  * just gets major/minor fault counters bumped up.
  */
-#define VM_FAULT_OOM	0x00
-#define VM_FAULT_SIGBUS	0x01
-#define VM_FAULT_MINOR	0x02
-#define VM_FAULT_MAJOR	0x03
+#define VM_FAULT_OOM		0x00
+#define VM_FAULT_SIGBUS		0x01
+#define VM_FAULT_MINOR		0x02
+#define VM_FAULT_MAJOR		0x03
+#define VM_FAULT_SIGSEGV	0x04
 
 /* 
  * Special case for get_user_pages.
@@ -732,14 +733,16 @@ extern int install_page(struct mm_struct
 extern int install_file_pte(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, unsigned long pgoff, pgprot_t prot);
 
 #ifdef CONFIG_MMU
+
+/* We reuse VM_READ, VM_WRITE and VM_EXEC for the @access_mask. */
 extern int __handle_mm_fault(struct mm_struct *mm,struct vm_area_struct *vma,
-			unsigned long address, int write_access);
+			unsigned long address, int access_mask);
 
 static inline int handle_mm_fault(struct mm_struct *mm,
 			struct vm_area_struct *vma, unsigned long address,
 			int write_access)
 {
-	return __handle_mm_fault(mm, vma, address, write_access) &
+	return __handle_mm_fault(mm, vma, address, write_access ? VM_WRITE : VM_READ) &
 				(~VM_FAULT_WRITE);
 }
 #else
Index: linux-2.6.git/mm/memory.c
===================================================================
--- linux-2.6.git.orig/mm/memory.c
+++ linux-2.6.git/mm/memory.c
@@ -959,6 +959,7 @@ no_page_table:
 	return page;
 }
 
+/* Return number of faulted-in pages. */
 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		unsigned long start, int len, int write, int force,
 		struct page **pages, struct vm_area_struct **vmas)
@@ -1062,6 +1063,7 @@ int get_user_pages(struct task_struct *t
 				case VM_FAULT_MAJOR:
 					tsk->maj_flt++;
 					break;
+				case VM_FAULT_SIGSEGV:
 				case VM_FAULT_SIGBUS:
 					return i ? i : -EFAULT;
 				case VM_FAULT_OOM:
@@ -2117,6 +2119,8 @@ retry:
 	/* Only go through if we didn't race with anybody else... */
 	if (pte_none(*page_table)) {
 		flush_icache_page(vma, new_page);
+		/* This already sets the PTE to be rw if appropriate, except for
+		 * private COW pages. */
 		entry = mk_pte(new_page, vma->vm_page_prot);
 		if (write_access)
 			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
@@ -2146,6 +2150,25 @@ oom:
 	return VM_FAULT_OOM;
 }
 
+static inline int check_perms(struct vm_area_struct * vma, int access_mask) {
+	if (unlikely(vma->vm_flags & VM_MANYPROTS)) {
+		/* we used to check protections in arch handler, but with
+		 * VM_MANYPROTS the check is skipped. */
+		/* access_mask contains the type of the access, vm_flags are the
+		 * declared protections, pte has the protection which will be
+		 * given to the PTE's in that area. */
+		pte_t pte = pfn_pte(0UL, vma->vm_page_prot);
+		if ((access_mask & VM_WRITE) && !pte_write(pte))
+			goto err;
+		if ((access_mask & VM_READ)  && !pte_read(pte))
+			goto err;
+		if ((access_mask & VM_EXEC)  && !pte_exec(pte))
+			goto err;
+	}
+	return 0;
+err:
+	return -EPERM;
+}
 /*
  * Fault of a previously existing named mapping. Repopulate the pte
  * from the encoded file_pte if possible. This enables swappable
@@ -2203,14 +2226,21 @@ static int do_file_page(struct mm_struct
  */
 static inline int handle_pte_fault(struct mm_struct *mm,
 		struct vm_area_struct *vma, unsigned long address,
-		pte_t *pte, pmd_t *pmd, int write_access)
+		pte_t *pte, pmd_t *pmd, int access_mask)
 {
 	pte_t entry;
 	pte_t old_entry;
 	spinlock_t *ptl;
+	int write_access = access_mask & VM_WRITE;
 
 	old_entry = entry = *pte;
 	if (!pte_present(entry)) {
+		/* when pte_file(), the VMA protections are useless.  Otherwise,
+		 * we need to check VM_MANYPROTS, because in that case the arch
+		 * fault handler skips the VMA protection check. */
+		if (!pte_file(entry) && check_perms(vma, access_mask))
+			goto out_segv;
+
 		if (pte_none(entry)) {
 			if (!vma->vm_ops || !vma->vm_ops->nopage)
 				return do_anonymous_page(mm, vma, address,
@@ -2229,6 +2259,12 @@ static inline int handle_pte_fault(struc
 	spin_lock(ptl);
 	if (unlikely(!pte_same(*pte, entry)))
 		goto unlock;
+
+	/* VM_MANYPROTS vma's have PTE's always installed with the correct
+	 * protection. So, generate a SIGSEGV if a fault is caught there. */
+	if (unlikely(vma->vm_flags & VM_MANYPROTS))
+		goto out_segv;
+
 	if (write_access) {
 		if (!pte_write(entry))
 			return do_wp_page(mm, vma, address,
@@ -2253,13 +2289,16 @@ static inline int handle_pte_fault(struc
 unlock:
 	pte_unmap_unlock(pte, ptl);
 	return VM_FAULT_MINOR;
+out_segv:
+	pte_unmap_unlock(pte, ptl);
+	return VM_FAULT_SIGSEGV;
 }
 
 /*
  * By the time we get here, we already hold the mm semaphore
  */
 int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, int write_access)
+		unsigned long address, int access_mask)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -2271,7 +2310,7 @@ int __handle_mm_fault(struct mm_struct *
 	inc_page_state(pgfault);
 
 	if (unlikely(is_vm_hugetlb_page(vma)))
-		return hugetlb_fault(mm, vma, address, write_access);
+		return hugetlb_fault(mm, vma, address, access_mask & VM_WRITE);
 
 	pgd = pgd_offset(mm, address);
 	pud = pud_alloc(mm, pgd, address);
@@ -2284,7 +2323,7 @@ int __handle_mm_fault(struct mm_struct *
 	if (!pte)
 		return VM_FAULT_OOM;
 
-	return handle_pte_fault(mm, vma, address, pte, pmd, write_access);
+	return handle_pte_fault(mm, vma, address, pte, pmd, access_mask);
 }
 
 EXPORT_SYMBOL_GPL(__handle_mm_fault);
Index: linux-2.6.git/arch/x86_64/mm/fault.c
===================================================================
--- linux-2.6.git.orig/arch/x86_64/mm/fault.c
+++ linux-2.6.git/arch/x86_64/mm/fault.c
@@ -423,6 +423,12 @@ asmlinkage void __kprobes do_page_fault(
 good_area:
 	info.si_code = SEGV_ACCERR;
 	write = 0;
+
+	if (unlikely(vma->vm_flags & VM_MANYPROTS)) {
+		write = error_code & PF_PROT;
+		goto handle_fault;
+	}
+
 	switch (error_code & (PF_PROT|PF_WRITE)) {
 		default:	/* 3: write, present */
 			/* fall through */
@@ -438,6 +444,7 @@ good_area:
 				goto bad_area;
 	}
 
+handle_fault:
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
@@ -452,8 +459,12 @@ good_area:
 		break;
 	case VM_FAULT_SIGBUS:
 		goto do_sigbus;
-	default:
+	case VM_FAULT_OOM:
 		goto out_of_memory;
+	case VM_FAULT_SIGSEGV:
+		goto bad_area;
+	default:
+		BUG();
 	}
 
 	up_read(&mm->mmap_sem);

--
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [patch 09/14] remap_file_pages protection support: fix race condition with concurrent faults on same address space
  2006-04-30 17:29 [patch 00/14] remap_file_pages protection support blaisorblade
                   ` (7 preceding siblings ...)
  2006-04-30 17:30 ` [patch 08/14] remap_file_pages protection support: use FAULT_SIGSEGV for protection checking blaisorblade
@ 2006-04-30 17:30 ` blaisorblade
  2006-04-30 17:30 ` [patch 10/14] remap_file_pages protection support: fix get_user_pages() on VM_MANYPROTS vmas blaisorblade
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 46+ messages in thread
From: blaisorblade @ 2006-04-30 17:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Paolo Blaisorblade Giarrusso

[-- Attachment #1: rfp/10-rfp-fix-concurrent-faults.diff --]
[-- Type: text/plain, Size: 1673 bytes --]

From: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>

The one noted by Hugh Dickins. A thread may get a fault because a PTE is absent,
then the PTE could be mapped by another thread, so we'd get a stale
pte_present(); we must check the permissions ourselves.

Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Index: linux-2.6.git/mm/memory.c
===================================================================
--- linux-2.6.git.orig/mm/memory.c
+++ linux-2.6.git/mm/memory.c
@@ -2261,9 +2261,21 @@ static inline int handle_pte_fault(struc
 		goto unlock;
 
 	/* VM_MANYPROTS vma's have PTE's always installed with the correct
-	 * protection. So, generate a SIGSEGV if a fault is caught there. */
-	if (unlikely(vma->vm_flags & VM_MANYPROTS))
-		goto out_segv;
+	 * protection, so if we got a fault on a present PTE we're in trouble.
+	 * However, the pte_present() may simply be the result of a race
+	 * condition with another thread having already fixed the fault. So go
+	 * the slow way. */
+	if (unlikely(vma->vm_flags & VM_MANYPROTS)) {
+		pgprot_t pgprot = pte_to_pgprot(*pte);
+		pte_t test_entry = pfn_pte(0, pgprot);
+
+		if (unlikely((access_mask & VM_WRITE) && !pte_write(test_entry)))
+			goto out_segv;
+		if (unlikely((access_mask & VM_READ) && !pte_read(test_entry)))
+			goto out_segv;
+		if (unlikely((access_mask & VM_EXEC) && !pte_exec(test_entry)))
+			goto out_segv;
+	}
 
 	if (write_access) {
 		if (!pte_write(entry))

--
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [patch 10/14] remap_file_pages protection support: fix get_user_pages() on VM_MANYPROTS vmas
  2006-04-30 17:29 [patch 00/14] remap_file_pages protection support blaisorblade
                   ` (8 preceding siblings ...)
  2006-04-30 17:30 ` [patch 09/14] remap_file_pages protection support: fix race condition with concurrent faults on same address space blaisorblade
@ 2006-04-30 17:30 ` blaisorblade
  2006-04-30 17:30 ` [patch 11/14] remap_file_pages protection support: pte_present should not trigger on PTE_FILE PROTNONE ptes blaisorblade
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 46+ messages in thread
From: blaisorblade @ 2006-04-30 17:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

[-- Attachment #1: rfp/10_2-rfp-fix-get_user_pages-revamp.diff --]
[-- Type: text/plain, Size: 7606 bytes --]

From: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>

*Untested patch* - I've not written a test case to verify functionality of
ptrace on VM_MANYPROTS area.

get_user_pages may well call __handle_mm_fault() wanting to override protections...
so in this case __handle_mm_fault() should still avoid checking VM access rights.

Also, get_user_pages() may give write faults on present readonly PTEs in
VM_MANYPROTS areas (think of PTRACE_POKETEXT), so we must still do do_wp_page
even on VM_MANYPROTS areas.

So, possibly use VM_MAYWRITE and/or VM_MAYREAD in the access_mask and check
VM_MANYPROTS in maybe_mkwrite_file (new variant of maybe_mkwrite).

API Note: there are many flags parameter which can be constructed but which
don't make any sense, but the code very freely interprets them too.
For instance VM_MAYREAD|VM_WRITE is interpreted as VM_MAYWRITE|VM_WRITE.

XXX: Todo: add checking to reject all meaningless flag combination.

Index: linux-2.6.git/mm/memory.c
===================================================================
--- linux-2.6.git.orig/mm/memory.c
+++ linux-2.6.git/mm/memory.c
@@ -1045,16 +1045,17 @@ int get_user_pages(struct task_struct *t
 			cond_resched();
 			while (!(page = follow_page(vma, start, foll_flags))) {
 				int ret;
-				ret = __handle_mm_fault(mm, vma, start,
-						foll_flags & FOLL_WRITE);
+				ret = __handle_mm_fault(mm, vma, start, vm_flags);
 				/*
 				 * The VM_FAULT_WRITE bit tells us that do_wp_page has
 				 * broken COW when necessary, even if maybe_mkwrite
 				 * decided not to set pte_write. We can thus safely do
 				 * subsequent page lookups as if they were reads.
 				 */
-				if (ret & VM_FAULT_WRITE)
+				if (ret & VM_FAULT_WRITE) {
 					foll_flags &= ~FOLL_WRITE;
+					vm_flags &= ~(VM_WRITE|VM_MAYWRITE);
+				}
 				
 				switch (ret & ~VM_FAULT_WRITE) {
 				case VM_FAULT_MINOR:
@@ -1389,7 +1390,20 @@ static inline int pte_unmap_same(struct 
  * servicing faults for write access.  In the normal case, do always want
  * pte_mkwrite.  But get_user_pages can cause write faults for mappings
  * that do not have writing enabled, when used by access_process_vm.
+ *
+ * Also, we must never change protections on VM_MANYPROTS pages; that's only
+ * allowed in do_no_page(), so test only VMA protections there. For other cases
+ * we *know* that VM_MANYPROTS is clear, such as anonymous/swap pages, and in
+ * that case using plain maybe_mkwrite() is an optimization.
+ * Instead, when we may be mapping a file, we must use maybe_mkwrite_file.
  */
+static inline pte_t maybe_mkwrite_file(pte_t pte, struct vm_area_struct *vma)
+{
+	if (likely((vma->vm_flags & (VM_WRITE | VM_MANYPROTS)) == VM_WRITE))
+		pte = pte_mkwrite(pte);
+	return pte;
+}
+
 static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 {
 	if (likely(vma->vm_flags & VM_WRITE))
@@ -1441,6 +1455,9 @@ static inline void cow_user_page(struct 
  * We enter with non-exclusive mmap_sem (to exclude vma changes,
  * but allow concurrent faults), with pte both mapped and locked.
  * We return with mmap_sem still held, but pte unmapped and unlocked.
+ *
+ * Note that a page here can be a shared readonly page where
+ * get_user_pages() (for instance for ptrace()) wants to write to it!
  */
 static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pte_t *page_table, pmd_t *pmd,
@@ -1460,7 +1477,8 @@ static int do_wp_page(struct mm_struct *
 		if (reuse) {
 			flush_cache_page(vma, address, pte_pfn(orig_pte));
 			entry = pte_mkyoung(orig_pte);
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+			/* Since it can be shared, it can be VM_MANYPROTS! */
+			entry = maybe_mkwrite_file(pte_mkdirty(entry), vma);
 			ptep_set_access_flags(vma, address, page_table, entry, 1);
 			update_mmu_cache(vma, address, entry);
 			lazy_mmu_prot_update(entry);
@@ -1504,7 +1522,7 @@ gotten:
 			inc_mm_counter(mm, anon_rss);
 		flush_cache_page(vma, address, pte_pfn(orig_pte));
 		entry = mk_pte(new_page, vma->vm_page_prot);
-		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		entry = maybe_mkwrite_file(pte_mkdirty(entry), vma);
 		ptep_establish(vma, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
 		lazy_mmu_prot_update(entry);
@@ -1930,7 +1948,7 @@ again:
 	inc_mm_counter(mm, anon_rss);
 	pte = mk_pte(page, vma->vm_page_prot);
 	if (write_access && can_share_swap_page(page)) {
-		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
+		pte = maybe_mkwrite_file(pte_mkdirty(pte), vma);
 		write_access = 0;
 	}
 
@@ -2231,15 +2249,15 @@ static inline int handle_pte_fault(struc
 	pte_t entry;
 	pte_t old_entry;
 	spinlock_t *ptl;
-	int write_access = access_mask & VM_WRITE;
+	int write_access = access_mask & (VM_WRITE|VM_MAYWRITE);
 
 	old_entry = entry = *pte;
 	if (!pte_present(entry)) {
 		/* when pte_file(), the VMA protections are useless.  Otherwise,
 		 * we need to check VM_MANYPROTS, because in that case the arch
 		 * fault handler skips the VMA protection check. */
-		if (!pte_file(entry) && check_perms(vma, access_mask))
-			goto out_segv;
+		if (!pte_file(entry) && unlikely(check_perms(vma, access_mask)))
+			goto segv;
 
 		if (pte_none(entry)) {
 			if (!vma->vm_ops || !vma->vm_ops->nopage)
@@ -2269,12 +2287,14 @@ static inline int handle_pte_fault(struc
 		pgprot_t pgprot = pte_to_pgprot(*pte);
 		pte_t test_entry = pfn_pte(0, pgprot);
 
-		if (unlikely((access_mask & VM_WRITE) && !pte_write(test_entry)))
-			goto out_segv;
-		if (unlikely((access_mask & VM_READ) && !pte_read(test_entry)))
-			goto out_segv;
-		if (unlikely((access_mask & VM_EXEC) && !pte_exec(test_entry)))
-			goto out_segv;
+		if (likely(!(access_mask & (VM_MAYWRITE|VM_MAYREAD)))) {
+			if (unlikely((access_mask & VM_WRITE) && !pte_write(test_entry)))
+				goto segv_unlock;
+			if (unlikely((access_mask & VM_READ) && !pte_read(test_entry)))
+				goto segv_unlock;
+			if (unlikely((access_mask & VM_EXEC) && !pte_exec(test_entry)))
+				goto segv_unlock;
+		}
 	}
 
 	if (write_access) {
@@ -2301,8 +2321,10 @@ static inline int handle_pte_fault(struc
 unlock:
 	pte_unmap_unlock(pte, ptl);
 	return VM_FAULT_MINOR;
-out_segv:
+
+segv_unlock:
 	pte_unmap_unlock(pte, ptl);
+segv:
 	return VM_FAULT_SIGSEGV;
 }
 
Index: linux-2.6.git/include/linux/mm.h
===================================================================
--- linux-2.6.git.orig/include/linux/mm.h
+++ linux-2.6.git/include/linux/mm.h
@@ -734,7 +734,22 @@ extern int install_file_pte(struct mm_st
 
 #ifdef CONFIG_MMU
 
-/* We reuse VM_READ, VM_WRITE and VM_EXEC for the @access_mask. */
+/* We reuse VM_READ, VM_WRITE and (optionally) VM_EXEC for the @access_mask, to
+ * report the kind of access we request for permission checking, in case the VMA
+ * is VM_MANYPROTS.
+ *
+ * get_user_pages( force == 1 ) is a special case. It's allowed to override
+ * protection checks, even on VM_MANYPROTS vma.
+ *
+ * To express that, it must replace VM_READ / VM_WRITE with the corresponding
+ * MAY flags.
+ * This allows to force copying COW pages to break sharing even on read-only
+ * page table entries.
+ * PITFALL: you're not allowed to override only part of the checks, and in
+ * general specifying strange combinations of flags may lead to unspecified
+ * results.
+ */
+
 extern int __handle_mm_fault(struct mm_struct *mm,struct vm_area_struct *vma,
 			unsigned long address, int access_mask);
 

--
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [patch 11/14] remap_file_pages protection support: pte_present should not trigger on PTE_FILE PROTNONE ptes
  2006-04-30 17:29 [patch 00/14] remap_file_pages protection support blaisorblade
                   ` (9 preceding siblings ...)
  2006-04-30 17:30 ` [patch 10/14] remap_file_pages protection support: fix get_user_pages() on VM_MANYPROTS vmas blaisorblade
@ 2006-04-30 17:30 ` blaisorblade
  2006-05-02  3:53   ` Nick Piggin
  2006-04-30 17:30 ` [patch 12/14] remap_file_pages protection support: also set VM_NONLINEAR on nonuniform VMAs blaisorblade
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 46+ messages in thread
From: blaisorblade @ 2006-04-30 17:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Paolo Blaisorblade Giarrusso

[-- Attachment #1: rfp/pte_present-for-PROT_NONE-pte.diff --]
[-- Type: text/plain, Size: 2996 bytes --]

From: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>

pte_present(pte) implies that pte_pfn(pte) is valid. Normally even with a
_PAGE_PROTNONE pte this holds, but not when such a PTE is installed by
the new install_file_pte; previously it didn't store protections, only file
offsets, with the patches it also stores protections, and can set
_PAGE_PROTNONE|_PAGE_FILE.

zap_pte_range, when acting on such a pte, calls vm_normal_page and gets
&mem_map[0], does page_remove_rmap, and we're easily in trouble, because it
happens to find a page with mapcount == 0. And it BUGs on this!

I've seen this trigger easily and repeatably on UML on 2.6.16-rc3. This was
likely avoided in the past by the PageReserved test - page 0 *had* to be
reserved on i386 (dunno on UML).

Implementation follows for UML and i386.

To avoid additional overhead, I also considered adding likely() for
_PAGE_PRESENT and unlikely() for the rest, but I'm uncertain about validity of
possible [un]likely(pte_present()) occurrences.

Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Index: linux-2.6.git/include/asm-um/pgtable.h
===================================================================
--- linux-2.6.git.orig/include/asm-um/pgtable.h
+++ linux-2.6.git/include/asm-um/pgtable.h
@@ -158,7 +158,7 @@ extern unsigned long pg0[1024];
 #define mk_phys(a, r) ((a) + (((unsigned long) r) << REGION_SHIFT))
 #define phys_addr(p) ((p) & ~REGION_MASK)
 
-#define pte_present(x)	pte_get_bits(x, (_PAGE_PRESENT | _PAGE_PROTNONE))
+#define pte_present(x)	(pte_get_bits(x, (_PAGE_PRESENT)) || (pte_get_bits(x, (_PAGE_PROTNONE)) && !pte_file(x)))
 
 /*
  * =================================
Index: linux-2.6.git/mm/memory.c
===================================================================
--- linux-2.6.git.orig/mm/memory.c
+++ linux-2.6.git/mm/memory.c
@@ -624,6 +624,8 @@ static unsigned long zap_pte_range(struc
 
 		(*zap_work) -= PAGE_SIZE;
 
+		/* XXX: This can trigger even if the PTE is only a PROTNONE
+		 * PTE_FILE pte - we'll then extract page 0 and unmap it! */
 		if (pte_present(ptent)) {
 			struct page *page;
 
Index: linux-2.6.git/include/asm-i386/pgtable.h
===================================================================
--- linux-2.6.git.orig/include/asm-i386/pgtable.h
+++ linux-2.6.git/include/asm-i386/pgtable.h
@@ -204,6 +204,8 @@ extern unsigned long long __PAGE_KERNEL,
 extern unsigned long pg0[];
 
 #define pte_present(x)	((x).pte_low & (_PAGE_PRESENT | _PAGE_PROTNONE))
+#define pte_present(x)  (((x).pte_low & _PAGE_PRESENT) || \
+		(((x).pte_low & (_PAGE_PROTNONE|_PAGE_FILE)) == _PAGE_PROTNONE))
 #define pte_clear(mm,addr,xp)	do { set_pte_at(mm, addr, xp, __pte(0)); } while (0)
 
 /* To avoid harmful races, pmd_none(x) should check only the lower when PAE */

--
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 11/14] remap_file_pages protection support: pte_present should not trigger on PTE_FILE PROTNONE ptes
  2006-04-30 17:30 ` [patch 11/14] remap_file_pages protection support: pte_present should not trigger on PTE_FILE PROTNONE ptes blaisorblade
@ 2006-05-02  3:53   ` Nick Piggin
  2006-05-03  1:29     ` Blaisorblade
  0 siblings, 1 reply; 46+ messages in thread
From: Nick Piggin @ 2006-05-02  3:53 UTC (permalink / raw)
  To: blaisorblade; +Cc: Andrew Morton, linux-kernel, Linux Memory Management

blaisorblade@yahoo.it wrote:
> From: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
> 
> pte_present(pte) implies that pte_pfn(pte) is valid. Normally even with a
> _PAGE_PROTNONE pte this holds, but not when such a PTE is installed by
> the new install_file_pte; previously it didn't store protections, only file
> offsets, with the patches it also stores protections, and can set
> _PAGE_PROTNONE|_PAGE_FILE.

Why is this combination useful? Can't you just drop the _PAGE_FILE from
_PAGE_PROTNONE ptes?

> 
> zap_pte_range, when acting on such a pte, calls vm_normal_page and gets
> &mem_map[0], does page_remove_rmap, and we're easily in trouble, because it
> happens to find a page with mapcount == 0. And it BUGs on this!
> 
> I've seen this trigger easily and repeatably on UML on 2.6.16-rc3. This was
> likely avoided in the past by the PageReserved test - page 0 *had* to be
> reserved on i386 (dunno on UML).
> 
> Implementation follows for UML and i386.
> 
> To avoid additional overhead, I also considered adding likely() for
> _PAGE_PRESENT and unlikely() for the rest, but I'm uncertain about validity of
> possible [un]likely(pte_present()) occurrences.

Not present pages are likely to be pretty common when unmapping.

I don't like this patch much.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 11/14] remap_file_pages protection support: pte_present should not trigger on PTE_FILE PROTNONE ptes
  2006-05-02  3:53   ` Nick Piggin
@ 2006-05-03  1:29     ` Blaisorblade
  2006-05-06 10:03       ` Nick Piggin
  0 siblings, 1 reply; 46+ messages in thread
From: Blaisorblade @ 2006-05-03  1:29 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, Linux Memory Management

On Tuesday 02 May 2006 05:53, Nick Piggin wrote:
> blaisorblade@yahoo.it wrote:
> > From: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
> >
> > pte_present(pte) implies that pte_pfn(pte) is valid. Normally even with a
> > _PAGE_PROTNONE pte this holds, but not when such a PTE is installed by
> > the new install_file_pte; previously it didn't store protections, only
> > file offsets, with the patches it also stores protections, and can set
> > _PAGE_PROTNONE|_PAGE_FILE.

What could be done is to set a PTE with "no protection", use another bit 
rather than _PAGE_PROTNONE. This wastes one more bit but doable.

> Why is this combination useful? Can't you just drop the _PAGE_FILE from
> _PAGE_PROTNONE ptes?

I must think on this, but the semantics are not entirely the same between the 
two cases. You have no page attached when _PAGE_FILE is there, but a page is 
attached to the PTE with only _PAGE_PROTNONE. Testing that via VM_MANYPROTS 
is just as slow as-is (can be changed with code duplication for the linear 
and non-linear cases).

The application semantics can also be different when you remap as read/write 
that page - the app could have stored an offset there (this is less definite 
since you can't remap & keep the offset currently).

Also, this wouldn't solve the problem, it would make the solution harder: how 
do I know that there's no page to call page_remove_rmap() on, without 
_PAGE_FILE?

I thought to change _PAGE_PROTNONE: it is used to hold a page present and 
referenced but unaccessible. It seems it could be released when 
_PAGE_PROTNONE is set, but for anonymous memory it's impossible. When I've 
asked Hugh about this, he imagined the case when an application faults in a 
page in a VMA then mprotects(PROT_NONE) it; the PTE is set as PROT_NONE. We 
can avoid that in the VM_MAYSHARE case (VM_SHARED or PROT_SHARED was set but 
the file is readonly), but not when anonymous memory is present - the 
application could want it back.

> > To avoid additional overhead, I also considered adding likely() for
> > _PAGE_PRESENT and unlikely() for the rest, but I'm uncertain about
> > validity of possible [un]likely(pte_present()) occurrences.
>
> Not present pages are likely to be pretty common when unmapping.

Ok, only unlikely for test on _PAGE_PROTNONE and ! _PAGE_FILE.
-- 
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade
Chiacchiera con i tuoi amici in tempo reale! 
 http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 11/14] remap_file_pages protection support: pte_present should not trigger on PTE_FILE PROTNONE ptes
  2006-05-03  1:29     ` Blaisorblade
@ 2006-05-06 10:03       ` Nick Piggin
  2006-05-07 17:50         ` Blaisorblade
  0 siblings, 1 reply; 46+ messages in thread
From: Nick Piggin @ 2006-05-06 10:03 UTC (permalink / raw)
  To: Blaisorblade; +Cc: Andrew Morton, linux-kernel, Linux Memory Management

Blaisorblade wrote:
> On Tuesday 02 May 2006 05:53, Nick Piggin wrote:
> 
>>blaisorblade@yahoo.it wrote:
>>
>>>From: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
>>>
>>>pte_present(pte) implies that pte_pfn(pte) is valid. Normally even with a
>>>_PAGE_PROTNONE pte this holds, but not when such a PTE is installed by
>>>the new install_file_pte; previously it didn't store protections, only
>>>file offsets, with the patches it also stores protections, and can set
>>>_PAGE_PROTNONE|_PAGE_FILE.
> 
> 
> What could be done is to set a PTE with "no protection", use another bit 
> rather than _PAGE_PROTNONE. This wastes one more bit but doable.

I see.

> 
> 
>>Why is this combination useful? Can't you just drop the _PAGE_FILE from
>>_PAGE_PROTNONE ptes?
> 
> 
> I must think on this, but the semantics are not entirely the same between the 
> two cases.

And yes, this won't work. I was misunderstanding what was happening.

I guess your problem is that you're overloading the pte protection bits
for present ptes as protection bits for not present (file) ptes. I'd rather
you just used a different encoding for file pte protections then.

"Wasting" a bit seems much more preferable for this very uncommon case (for
most people) rather than bloating pte_present check, which is called in
practically every performance critical inner loop).

That said, if the patch is i386/uml specific then I don't have much say in
it. If Ingo/Linus and Jeff/Yourself, respectively, accept the patch, then
fine.

But I think you should drop the comment from the core code. It seems wrong.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 11/14] remap_file_pages protection support: pte_present should not trigger on PTE_FILE PROTNONE ptes
  2006-05-06 10:03       ` Nick Piggin
@ 2006-05-07 17:50         ` Blaisorblade
  0 siblings, 0 replies; 46+ messages in thread
From: Blaisorblade @ 2006-05-07 17:50 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, Linux Memory Management

On Saturday 06 May 2006 12:03, Nick Piggin wrote:
> Blaisorblade wrote:
> > On Tuesday 02 May 2006 05:53, Nick Piggin wrote:
> >>blaisorblade@yahoo.it wrote:
> >>>From: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
> >>>
> >>>pte_present(pte) implies that pte_pfn(pte) is valid. Normally even with
> >>> a _PAGE_PROTNONE pte this holds, but not when such a PTE is installed
> >>> by the new install_file_pte; previously it didn't store protections,
> >>> only file offsets, with the patches it also stores protections, and can
> >>> set _PAGE_PROTNONE|_PAGE_FILE.
> >
> > What could be done is to set a PTE with "no protection", use another bit
> > rather than _PAGE_PROTNONE. This wastes one more bit but doable.

> I see.

> I guess your problem is that you're overloading the pte protection bits
> for present ptes as protection bits for not present (file) ptes. I'd rather
> you just used a different encoding for file pte protections then.

Yes, this is what I said above, so we agree; and indeed this overloading was 
decided when the present problem didn't trigger, so it can now change. As 
detailed in the patch description, the previous PageReserved handling 
prevented freeing page 0 and hided this.

> "Wasting" a bit seems much more preferable for this very uncommon case (for
> most people) rather than bloating pte_present check, which is called in
> practically every performance critical inner loop).

Yes, I thought about this problem, I wasn't sure how hard it was.

> That said, if the patch is i386/uml specific then I don't have much say in
> it.

It's presently specific, but will probably extend. Implementations for some 
other archs were already sent and I've collected them (will send 
afterwards,I've avoided excess bloat).

> If Ingo/Linus and Jeff/Yourself, respectively, accept the patch, then 
> fine.

> But I think you should drop the comment from the core code. It seems wrong.

Yep, forgot there, thanks for reminding, I've now removed it.
-- 
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

Chiacchiera con i tuoi amici in tempo reale! 
 http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [patch 12/14] remap_file_pages protection support: also set VM_NONLINEAR on nonuniform VMAs
  2006-04-30 17:29 [patch 00/14] remap_file_pages protection support blaisorblade
                   ` (10 preceding siblings ...)
  2006-04-30 17:30 ` [patch 11/14] remap_file_pages protection support: pte_present should not trigger on PTE_FILE PROTNONE ptes blaisorblade
@ 2006-04-30 17:30 ` blaisorblade
  2006-04-30 17:30 ` [patch 13/14] remap_file_pages protection support: uml, i386, x64 bits blaisorblade
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 46+ messages in thread
From: blaisorblade @ 2006-04-30 17:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Paolo Blaisorblade Giarrusso

[-- Attachment #1: rfp/12-rfp-nonuniform-implies-nonlinear.diff --]
[-- Type: text/plain, Size: 2524 bytes --]

From: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>

To simplify the VM code, and to reflect expected application usage, we decide
to also set VM_NONLINEAR when setting VM_MANYPROTS. Otherwise, we'd have to
possibly save nonlinear PTEs even on paths which cope with linear VMAs. It's
possible, but intrusive (it's done in one of the next patches).

Obviously, this has a performance cost, since we potentially have to handle a
linear VMA with nonlinear handling code. But I didn't know of any application
which might have this usage.

XXX: update: glibc wants to replace mprotect() with linear VM_MANYPROTS areas,
to handle guard pages and data mappings of shared objects.

Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Index: linux-2.6.git/mm/fremap.c
===================================================================
--- linux-2.6.git.orig/mm/fremap.c
+++ linux-2.6.git/mm/fremap.c
@@ -206,8 +206,9 @@ retry:
 
 	if (!vma->vm_private_data || (vma->vm_flags & VM_NONLINEAR)) {
 		/* Must set VM_NONLINEAR before any pages are populated. */
-		if (pgoff != linear_page_index(vma, start) &&
-		    !(vma->vm_flags & VM_NONLINEAR)) {
+		if (!(vma->vm_flags & VM_NONLINEAR) &&
+			(pgoff != linear_page_index(vma, start) ||
+			pgprot_val(pgprot) != pgprot_val(vma->vm_page_prot))) {
 			if (!(vma->vm_flags & VM_SHARED))
 				goto out_unlock;
 			if (!has_write_lock) {
@@ -224,19 +225,19 @@ retry:
 			vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear);
 			flush_dcache_mmap_unlock(mapping);
 			spin_unlock(&mapping->i_mmap_lock);
-		}
 
-		if (pgprot_val(pgprot) != pgprot_val(vma->vm_page_prot) &&
-				!(vma->vm_flags & VM_MANYPROTS)) {
-			if (!(vma->vm_flags & VM_SHARED))
-				goto out_unlock;
-			if (!has_write_lock) {
-				up_read(&mm->mmap_sem);
-				down_write(&mm->mmap_sem);
-				has_write_lock = 1;
-				goto retry;
+			if (!(vma->vm_flags & VM_MANYPROTS) &&
+				pgprot_val(pgprot) != pgprot_val(vma->vm_page_prot)) {
+				if (!(vma->vm_flags & VM_SHARED))
+					goto out_unlock;
+				if (!has_write_lock) {
+					up_read(&mm->mmap_sem);
+					down_write(&mm->mmap_sem);
+					has_write_lock = 1;
+					goto retry;
+				}
+				vma->vm_flags |= VM_MANYPROTS;
 			}
-			vma->vm_flags |= VM_MANYPROTS;
 		}
 
 		err = vma->vm_ops->populate(vma, start, size, pgprot, pgoff,

--
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [patch 13/14] remap_file_pages protection support: uml, i386, x64 bits
  2006-04-30 17:29 [patch 00/14] remap_file_pages protection support blaisorblade
                   ` (11 preceding siblings ...)
  2006-04-30 17:30 ` [patch 12/14] remap_file_pages protection support: also set VM_NONLINEAR on nonuniform VMAs blaisorblade
@ 2006-04-30 17:30 ` blaisorblade
  2006-04-30 17:30 ` [patch 14/14] remap_file_pages protection support: adapt to uml peculiarities blaisorblade
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 46+ messages in thread
From: blaisorblade @ 2006-04-30 17:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Paolo Blaisorblade Giarrusso

[-- Attachment #1: rfp/02-rfp-arch-uml-i386.diff --]
[-- Type: text/plain, Size: 7529 bytes --]

From: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>, Ingo Molnar <mingo@elte.hu>

Various boilerplate stuff.

Update pte encoding macros for UML, i386 and x86-64.

*) remap_file_pages protection support: improvement for UML bits

Recover one bit by additionally using _PAGE_NEWPROT. Since I wasn't sure this
would work, I've split this out, but it has worked well. We rely on the fact
that pte_newprot always checks first if the PTE is marked present. This is
joined because it worked well during the unit testing I performed, beyond
making sense.

Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Index: linux-2.6.git/include/asm-i386/pgtable-2level.h
===================================================================
--- linux-2.6.git.orig/include/asm-i386/pgtable-2level.h
+++ linux-2.6.git/include/asm-i386/pgtable-2level.h
@@ -43,16 +43,21 @@ static inline int pte_exec_kernel(pte_t 
 }
 
 /*
- * Bits 0, 6 and 7 are taken, split up the 29 bits of offset
+ * Bits 0, 1, 6 and 7 are taken, split up the 28 bits of offset
  * into this range:
  */
-#define PTE_FILE_MAX_BITS	29
+#define PTE_FILE_MAX_BITS	28
 
 #define pte_to_pgoff(pte) \
-	((((pte).pte_low >> 1) & 0x1f ) + (((pte).pte_low >> 8) << 5 ))
-
-#define pgoff_to_pte(off) \
-	((pte_t) { (((off) & 0x1f) << 1) + (((off) >> 5) << 8) + _PAGE_FILE })
+	((((pte).pte_low >> 2) & 0xf ) + (((pte).pte_low >> 8) << 4 ))
+#define pte_to_pgprot(pte) \
+	__pgprot(((pte).pte_low & (_PAGE_RW | _PAGE_PROTNONE)) \
+		| (((pte).pte_low & _PAGE_PROTNONE) ? 0 : \
+			(_PAGE_USER | _PAGE_PRESENT)) | _PAGE_ACCESSED)
+
+#define pgoff_prot_to_pte(off, prot) \
+	((pte_t) { (((off) & 0xf) << 2) + (((off) >> 4) << 8) + \
+	 (pgprot_val(prot) & (_PAGE_RW | _PAGE_PROTNONE)) + _PAGE_FILE })
 
 /* Encode and de-code a swap entry */
 #define __swp_type(x)			(((x).val >> 1) & 0x1f)
Index: linux-2.6.git/include/asm-i386/pgtable-3level.h
===================================================================
--- linux-2.6.git.orig/include/asm-i386/pgtable-3level.h
+++ linux-2.6.git/include/asm-i386/pgtable-3level.h
@@ -140,7 +140,16 @@ static inline pmd_t pfn_pmd(unsigned lon
  * put the 32 bits of offset into the high part.
  */
 #define pte_to_pgoff(pte) ((pte).pte_high)
-#define pgoff_to_pte(off) ((pte_t) { _PAGE_FILE, (off) })
+
+#define pte_to_pgprot(pte) \
+	__pgprot(((pte).pte_low & (_PAGE_RW | _PAGE_PROTNONE)) \
+		| (((pte).pte_low & _PAGE_PROTNONE) ? 0 : \
+			(_PAGE_USER | _PAGE_PRESENT)) | _PAGE_ACCESSED)
+
+#define pgoff_prot_to_pte(off, prot) \
+	((pte_t) { _PAGE_FILE + \
+		(pgprot_val(prot) & (_PAGE_RW | _PAGE_PROTNONE)) , (off) })
+
 #define PTE_FILE_MAX_BITS       32
 
 /* Encode and de-code a swap entry */
Index: linux-2.6.git/include/asm-um/pgtable-2level.h
===================================================================
--- linux-2.6.git.orig/include/asm-um/pgtable-2level.h
+++ linux-2.6.git/include/asm-um/pgtable-2level.h
@@ -45,12 +45,19 @@ static inline void pgd_mkuptodate(pgd_t 
 	((unsigned long) __va(pmd_val(pmd) & PAGE_MASK))
 
 /*
- * Bits 0 through 3 are taken
+ * Bits 0, 1, 3 to 5 are taken, split up the 27 bits of offset
+ * into this range:
  */
-#define PTE_FILE_MAX_BITS	28
+#define PTE_FILE_MAX_BITS	27
 
-#define pte_to_pgoff(pte) (pte_val(pte) >> 4)
+#define pte_to_pgoff(pte) (((pte_val(pte) >> 6) << 1) | ((pte_val(pte) >> 2) & 0x1))
+#define pte_to_pgprot(pte) \
+	__pgprot((pte_val(pte) & (_PAGE_RW | _PAGE_PROTNONE)) \
+		| ((pte_val(pte) & _PAGE_PROTNONE) ? 0 : \
+			(_PAGE_USER | _PAGE_PRESENT)) | _PAGE_ACCESSED)
 
-#define pgoff_to_pte(off) ((pte_t) { ((off) << 4) + _PAGE_FILE })
+#define pgoff_prot_to_pte(off, prot) \
+	__pte((((off) >> 1) << 6) + (((off) & 0x1) << 2) + \
+	 (pgprot_val(prot) & (_PAGE_RW | _PAGE_PROTNONE)) + _PAGE_FILE)
 
 #endif
Index: linux-2.6.git/include/asm-um/pgtable-3level.h
===================================================================
--- linux-2.6.git.orig/include/asm-um/pgtable-3level.h
+++ linux-2.6.git/include/asm-um/pgtable-3level.h
@@ -101,25 +101,35 @@ static inline pmd_t pfn_pmd(pfn_t page_n
 }
 
 /*
- * Bits 0 through 3 are taken in the low part of the pte,
+ * Bits 0 through 5 are taken in the low part of the pte,
  * put the 32 bits of offset into the high part.
  */
 #define PTE_FILE_MAX_BITS	32
 
+
 #ifdef CONFIG_64BIT
 
 #define pte_to_pgoff(p) ((p).pte >> 32)
-
-#define pgoff_to_pte(off) ((pte_t) { ((off) << 32) | _PAGE_FILE })
+#define pgoff_prot_to_pte(off, prot) ((pte_t) { ((off) << 32) | _PAGE_FILE | \
+		(pgprot_val(prot) & (_PAGE_RW | _PAGE_PROTNONE)) })
+#define __pte_flags(pte) pte_val(pte)
 
 #else
 
 #define pte_to_pgoff(pte) ((pte).pte_high)
-
-#define pgoff_to_pte(off) ((pte_t) { _PAGE_FILE, (off) })
+#define pgoff_prot_to_pte(off, prot) ((pte_t) { \
+		(pgprot_val(prot) & (_PAGE_RW | _PAGE_PROTNONE)) | _PAGE_FILE, \
+		(off) })
+/* Don't use pte_val below, useless to join the two halves */
+#define __pte_flags(pte) ((pte).pte_low)
 
 #endif
 
+#define pte_to_pgprot(pte) \
+	__pgprot((__pte_flags(pte) & (_PAGE_RW | _PAGE_PROTNONE)) \
+		| ((__pte_flags(pte) & _PAGE_PROTNONE) ? 0 : \
+			(_PAGE_USER | _PAGE_PRESENT)) | _PAGE_ACCESSED)
+
 #endif
 
 /*
Index: linux-2.6.git/include/asm-x86_64/pgtable.h
===================================================================
--- linux-2.6.git.orig/include/asm-x86_64/pgtable.h
+++ linux-2.6.git/include/asm-x86_64/pgtable.h
@@ -360,9 +360,19 @@ static inline pud_t *__pud_offset_k(pud_
 #define pmd_pfn(x)  ((pmd_val(x) & __PHYSICAL_MASK) >> PAGE_SHIFT)
 
 #define pte_to_pgoff(pte) ((pte_val(pte) & PHYSICAL_PAGE_MASK) >> PAGE_SHIFT)
-#define pgoff_to_pte(off) ((pte_t) { ((off) << PAGE_SHIFT) | _PAGE_FILE })
 #define PTE_FILE_MAX_BITS __PHYSICAL_MASK_SHIFT
 
+#define pte_to_pgprot(pte) \
+	__pgprot((pte_val(pte) & (_PAGE_RW | _PAGE_PROTNONE)) \
+		| ((pte_val(pte) & _PAGE_PROTNONE) ? 0 : \
+			(_PAGE_USER | _PAGE_PRESENT)) | _PAGE_ACCESSED)
+
+#define pgoff_prot_to_pte(off, prot) \
+	((pte_t) { _PAGE_FILE + \
+		(pgprot_val(prot) & (_PAGE_RW | _PAGE_PROTNONE)) + \
+			((off) << PAGE_SHIFT) })
+
+
 /* PTE - Level 1 access. */
 
 /* page, protection -> pte */
@@ -454,6 +464,7 @@ extern int kern_addr_valid(unsigned long
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define __HAVE_ARCH_PTE_SAME
+#define __HAVE_ARCH_PTE_TO_PGPROT
 #include <asm-generic/pgtable.h>
 
 #endif /* _X86_64_PGTABLE_H */
Index: linux-2.6.git/include/asm-i386/pgtable.h
===================================================================
--- linux-2.6.git.orig/include/asm-i386/pgtable.h
+++ linux-2.6.git/include/asm-i386/pgtable.h
@@ -452,6 +452,7 @@ extern void noexec_setup(const char *str
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define __HAVE_ARCH_PTE_SAME
+#define __HAVE_ARCH_PTE_TO_PGPROT
 #include <asm-generic/pgtable.h>
 
 #endif /* _I386_PGTABLE_H */
Index: linux-2.6.git/include/asm-um/pgtable.h
===================================================================
--- linux-2.6.git.orig/include/asm-um/pgtable.h
+++ linux-2.6.git/include/asm-um/pgtable.h
@@ -410,6 +410,7 @@ static inline pte_t pte_modify(pte_t pte
 
 #define kern_addr_valid(addr) (1)
 
+#define __HAVE_ARCH_PTE_TO_PGPROT
 #include <asm-generic/pgtable.h>
 
 #include <asm-generic/pgtable-nopud.h>

--
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [patch 14/14] remap_file_pages protection support: adapt to uml peculiarities
  2006-04-30 17:29 [patch 00/14] remap_file_pages protection support blaisorblade
                   ` (12 preceding siblings ...)
  2006-04-30 17:30 ` [patch 13/14] remap_file_pages protection support: uml, i386, x64 bits blaisorblade
@ 2006-04-30 17:30 ` blaisorblade
  2006-05-02  3:45 ` [patch 00/14] remap_file_pages protection support Nick Piggin
  2006-05-02 10:21 ` Arjan van de Ven
  15 siblings, 0 replies; 46+ messages in thread
From: blaisorblade @ 2006-04-30 17:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Paolo Blaisorblade Giarrusso

[-- Attachment #1: rfp/11-rfp-sigsegv-uml-handle-tlb-faults.diff --]
[-- Type: text/plain, Size: 3032 bytes --]

From: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>

Uml is particular in respect with other architectures (and possibly this is to
fix) in the fact that our arch fault handler handles indifferently both TLB
and page faults. In particular, we may get to call handle_mm_fault() when the
PTE is already correct, but simply it's not flushed.

And rfp-fault-sigsegv-2 breaks this, because when getting a fault on a
pte_present PTE and non-uniform VMA, it assumes the fault is due to a
protection fault, and signals the caller a SIGSEGV must be sent.

XXX: this is now wrong, since that SIGSEGV is not sent any more. I'll
subsequently verify whether this patch is still needed or not, but until now I
haven't had the time.

*) remap_file_pages protection support: fix unflushed TLB errors detection

From: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>

We got unflushed PTE's marked up-to-date; they were actually flushed, but still
protected, in order to get dirtying / accessing faults. So, don't test the PTE
for being up-to-date, but check directly the permission (since the PTE is not
protected for that).

Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Index: linux-2.6.git/arch/um/kernel/trap_kern.c
===================================================================
--- linux-2.6.git.orig/arch/um/kernel/trap_kern.c
+++ linux-2.6.git/arch/um/kernel/trap_kern.c
@@ -43,7 +43,7 @@ int handle_page_fault(unsigned long addr
 	pgd_t *pgd;
 	pud_t *pud;
 	pmd_t *pmd;
-	pte_t *pte;
+	pte_t *pte, entry;
 	int err = -EFAULT;
 
 	*code_out = SEGV_MAPERR;
@@ -93,8 +93,37 @@ survive:
 			err = -EACCES;
 			goto out;
 		case VM_FAULT_SIGSEGV:
-			err = -EFAULT;
-			goto out;
+			WARN_ON(!(vma->vm_flags & VM_MANYPROTS));
+			/* Duplicate this code here. */
+			pgd = pgd_offset(mm, address);
+			pud = pud_offset(pgd, address);
+			pmd = pmd_offset(pud, address);
+			pte = pte_offset_kernel(pmd, address);
+			if (likely (pte_newpage(*pte) || pte_newprot(*pte)) ||
+				(is_write ? pte_write(*pte) : pte_read(*pte)) ) {
+				/* The page hadn't been flushed, or it had been
+				 * flushed but without access to get a dirtying
+				 * / accessing fault. */
+
+				/* __handle_mm_fault() didn't dirty / young this
+				 * PTE, probably we won't get another fault for
+				 * this page, so fix things now. */
+				entry = *pte;
+				entry = pte_mkyoung(*pte);
+				if(pte_write(entry))
+					entry = pte_mkdirty(entry);
+				/* Yes, this will set the page as NEWPAGE. We
+				 * want this, otherwise things won't work.
+				 * Indeed, the
+				 * *pte = pte_mkyoung(*pte);
+				 * we used to have (uselessly) didn't work at
+				 * all! */
+				set_pte(pte, entry);
+				break;
+			} else {
+				err = -EFAULT;
+				goto out;
+			}
 		case VM_FAULT_OOM:
 			err = -ENOMEM;
 			goto out_of_memory;

--
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-04-30 17:29 [patch 00/14] remap_file_pages protection support blaisorblade
                   ` (13 preceding siblings ...)
  2006-04-30 17:30 ` [patch 14/14] remap_file_pages protection support: adapt to uml peculiarities blaisorblade
@ 2006-05-02  3:45 ` Nick Piggin
  2006-05-02  3:56   ` Nick Piggin
                     ` (3 more replies)
  2006-05-02 10:21 ` Arjan van de Ven
  15 siblings, 4 replies; 46+ messages in thread
From: Nick Piggin @ 2006-05-02  3:45 UTC (permalink / raw)
  To: blaisorblade; +Cc: Andrew Morton, linux-kernel, Linux Memory Management

blaisorblade@yahoo.it wrote:

> The first idea is to use this for UML - it must create a lot of single page
> mappings, and managing them through separate VMAs is slow.

I don't know about this. The patches add some complexity, I guess because
we now have vmas which cannot communicate the protectedness of the pages.
Still, nobody was too concerned about nonlinear mappings doing the same
for addressing. But this does seem to add more overhead to the common cases
in the VM :(

Now I didn't follow the earlier discussions on this much, but let me try
making a few silly comments to get things going again (cc'ed linux-mm).

I think I would rather this all just folded under VM_NONLINEAR rather than
having this extra MANYPROTS thing, no? (you're already doing that in one
direction).

> 
> Additional note: this idea, with some further refinements (which I'll code after
> this chunk is accepted), will allow to reduce the number of used VMAs for most
> userspace programs - in particular, it will allow to avoid creating one VMA for
> one guard pages (which has PROT_NONE) - forcing PROT_NONE on that page will be
> enough.

I think that's silly. Your VM_MANYPROTS|VM_NONLINEAR vmas will cause more
overhead in faulting and reclaim.

It loooks like it would take an hour or two just to code up a patch which
puts a VM_GUARDPAGES flag into the vma, and tells the free area allocator
to skip vm_start-1 .. vm_end+1. What kind of troubles has prevented
something simple and easy like that from going in?

> 
> This will be useful since the VMA lookup at fault time can be a bottleneck for
> some programs (I've received a report about this from Ulrich Drepper and I've
> been told that also Val Henson from Intel is interested about this). I guess
> that since we use RB-trees, the slowness is also due to the poor cache locality
> of RB-trees (since RB nodes are within VMAs but aren't accessed together with
> their content), compared for instance with radix trees where the lookup has high
> cache locality (but they have however space usage problems, possibly bigger, on
> 64-bit machines).

Let's try get back to the good old days when people actually reported
their bugs (togther will *real* numbers) to the mailing lists. That way,
everybody gets to think about and discuss the problem.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-02  3:45 ` [patch 00/14] remap_file_pages protection support Nick Piggin
@ 2006-05-02  3:56   ` Nick Piggin
  2006-05-02 11:24     ` Ingo Molnar
  2006-05-02 17:16   ` Lee Schermerhorn
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 46+ messages in thread
From: Nick Piggin @ 2006-05-02  3:56 UTC (permalink / raw)
  To: blaisorblade; +Cc: Andrew Morton, linux-kernel, Linux Memory Management

Nick Piggin wrote:
> blaisorblade@yahoo.it wrote:
> 
>> The first idea is to use this for UML - it must create a lot of single 
>> page
>> mappings, and managing them through separate VMAs is slow.

[...]

> Let's try get back to the good old days when people actually reported
> their bugs (togther will *real* numbers) to the mailing lists. That way,
> everybody gets to think about and discuss the problem.

Speaking of which, let's see some numbers for UML -- performance
and memory. I don't doubt your claims, but I (and others) would be
interested to see.

Thanks

PS. I'll be away for the next few days.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-02  3:56   ` Nick Piggin
@ 2006-05-02 11:24     ` Ingo Molnar
  2006-05-02 12:19       ` Nick Piggin
  0 siblings, 1 reply; 46+ messages in thread
From: Ingo Molnar @ 2006-05-02 11:24 UTC (permalink / raw)
  To: Nick Piggin
  Cc: blaisorblade, Andrew Morton, linux-kernel,
	Linux Memory Management

* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> >Let's try get back to the good old days when people actually reported
> >their bugs (togther will *real* numbers) to the mailing lists. That way,
> >everybody gets to think about and discuss the problem.
> 
> Speaking of which, let's see some numbers for UML -- performance and 
> memory. I don't doubt your claims, but I (and others) would be 
> interested to see.

firstly, thanks for the review feedback!

originally i tested this feature with some minimal amount of RAM 
simulated by UML 128MB or so. That's just 32 thousand pages, but still 
the improvement was massive: context-switch times in UML were cut in 
half or more. Process-creation times improved 10-fold. With this feature 
included I accidentally (for the first time ever!) confused an UML shell 
prompt with a real shell prompt. (before that UML was so slow [even in 
"skas mode"] that you'd immediately notice it by the shell's behavior)

the 'have 1 vma instead of 32,000 vmas' thing is a really, really big 
plus. It makes UML comparable to Xen, in rough terms of basic VM design.

Now imagine a somewhat larger setup - 16 GB RAM UML instance with 4 
million vmas per UML process ... Frankly, without 
sys_remap_file_pages_prot() the UML design is still somewhat of a toy.

	Ingo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-02 11:24     ` Ingo Molnar
@ 2006-05-02 12:19       ` Nick Piggin
  0 siblings, 0 replies; 46+ messages in thread
From: Nick Piggin @ 2006-05-02 12:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: blaisorblade, Andrew Morton, linux-kernel,
	Linux Memory Management

Ingo Molnar wrote:

> originally i tested this feature with some minimal amount of RAM 
> simulated by UML 128MB or so. That's just 32 thousand pages, but still 
> the improvement was massive: context-switch times in UML were cut in 
> half or more. Process-creation times improved 10-fold. With this feature 
> included I accidentally (for the first time ever!) confused an UML shell 
> prompt with a real shell prompt. (before that UML was so slow [even in 
> "skas mode"] that you'd immediately notice it by the shell's behavior)

Cool, thanks for the numbers.

> 
> the 'have 1 vma instead of 32,000 vmas' thing is a really, really big 
> plus. It makes UML comparable to Xen, in rough terms of basic VM design.
> 
> Now imagine a somewhat larger setup - 16 GB RAM UML instance with 4 
> million vmas per UML process ... Frankly, without 
> sys_remap_file_pages_prot() the UML design is still somewhat of a toy.

Yes, I guess I imagined the common case might have been slightly better,
however with reasonable RAM utilisation, fragmentation means I wouldn't
be surprised if it does easily get close to that worst theoretical case.

My request for numbers was more about the Intel/glibc people than Paolo:
I do realise it is a problem for UML. I just like to see nice numbers :)

I think UML's really neat, so I'd love to see this get in. I don't see
any fundamental sticking point, given a few iterations, and some more
discussion.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-02  3:45 ` [patch 00/14] remap_file_pages protection support Nick Piggin
  2006-05-02  3:56   ` Nick Piggin
@ 2006-05-02 17:16   ` Lee Schermerhorn
  2006-05-03  1:20     ` Blaisorblade
  2006-05-03  0:25   ` Blaisorblade
  2006-05-03  0:44   ` Blaisorblade
  3 siblings, 1 reply; 46+ messages in thread
From: Lee Schermerhorn @ 2006-05-02 17:16 UTC (permalink / raw)
  To: Nick Piggin
  Cc: blaisorblade, Andrew Morton, linux-kernel,
	Linux Memory Management

On Tue, 2006-05-02 at 13:45 +1000, Nick Piggin wrote:
> blaisorblade@yahoo.it wrote:
> 
> > The first idea is to use this for UML - it must create a lot of single page
> > mappings, and managing them through separate VMAs is slow.
> 
> I don't know about this. The patches add some complexity, I guess because
> we now have vmas which cannot communicate the protectedness of the pages.
> Still, nobody was too concerned about nonlinear mappings doing the same
> for addressing. But this does seem to add more overhead to the common cases
> in the VM :(
> 
> Now I didn't follow the earlier discussions on this much, but let me try
> making a few silly comments to get things going again (cc'ed linux-mm).
> 
> I think I would rather this all just folded under VM_NONLINEAR rather than
> having this extra MANYPROTS thing, no? (you're already doing that in one
> direction).
<snip>

One way I've seen this done on other systems is to use something like a
prio tree [e.g., see the shared policy support for shmem] for sub-vma
protection ranges.  Most vmas [I'm guessing here] will have only the
original protections or will be reprotected in toto.  So, one need only
allocate/populate the protection tree when sub-vma protections are
requested.   Then, one can test protections via the vma, perhaps with
access/check macros to hide the existence of the protection tree.  Of
course, adding a tree-like structure could introduce locking
complications/overhead in some paths where we'd rather not [just
guessing again].  Might be more overhead than just mucking with the ptes
[for UML], but would keep the ptes in sync with the vma's view of
"protectedness".

Lee

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-02 17:16   ` Lee Schermerhorn
@ 2006-05-03  1:20     ` Blaisorblade
  2006-05-03 14:35       ` Lee Schermerhorn
  0 siblings, 1 reply; 46+ messages in thread
From: Blaisorblade @ 2006-05-03  1:20 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Nick Piggin, Andrew Morton, linux-kernel, Linux Memory Management,
	Ulrich Drepper, Val Henson

On Tuesday 02 May 2006 19:16, Lee Schermerhorn wrote:
> On Tue, 2006-05-02 at 13:45 +1000, Nick Piggin wrote:
> > blaisorblade@yahoo.it wrote:

> > I think I would rather this all just folded under VM_NONLINEAR rather
> > than having this extra MANYPROTS thing, no? (you're already doing that in
> > one direction).

> One way I've seen this done on other systems

I'm curious, which ones?

> is to use something like a 
> prio tree [e.g., see the shared policy support for shmem] for sub-vma
> protection ranges.
Which sub-vma ranges? The ones created with mprotect?

I'm curious about what is the difference between this sub-tree and the main 
tree... you have some point, but I miss which one :-) Actually when doing a 
lookup in the main tree the extra nodes in the subtree are not searched, so 
you get an advantage.

One possible point is that a VMA maps to one mmap() call (with splits from 
mremap(),mprotect(),partial munmap()s), and then they use sub-VMAs instead of 
VMA splits.

> Most vmas [I'm guessing here] will have only the 
> original protections or will be reprotected in toto.

> So, one need only 
> allocate/populate the protection tree when sub-vma protections are
> requested.   Then, one can test protections via the vma, perhaps with
> access/check macros to hide the existence of the protection tree.  Of
> course, adding a tree-like structure could introduce locking
> complications/overhead in some paths where we'd rather not [just
> guessing again].  Might be more overhead than just mucking with the ptes
> [for UML], but would keep the ptes in sync with the vma's view of
> "protectedness".
>
> Lee

Ok, there are two different situations, I'm globally unconvinced until I 
understand the usefulness of a different sub-tree.

a) UML. The answer is _no_ to all guesses, since we must implement page tables 
of a guest virtual machine via mmap() or remap_file_pages. And they're as 
fragmented as they get (we get one-page-wide VMAs currently).

b) the proposed glibc usage. The original Ulrich's request (which I cut down 
because of problems with objrmap) was to have one mapping per DSO, including 
code,data and guard page. So you have three protections in one VMA.

However, this is doable via this remap_file_pages, adding something for 
handling private VMAs (handling movement of the anonymous memory you get on 
writes); but it's slow on swapout, since it stops using objrmap. So I've not 
thought to do it.
-- 
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade
Chiacchiera con i tuoi amici in tempo reale! 
 http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-03  1:20     ` Blaisorblade
@ 2006-05-03 14:35       ` Lee Schermerhorn
  0 siblings, 0 replies; 46+ messages in thread
From: Lee Schermerhorn @ 2006-05-03 14:35 UTC (permalink / raw)
  To: Blaisorblade
  Cc: Nick Piggin, Andrew Morton, linux-kernel, Linux Memory Management,
	Ulrich Drepper, Val Henson, Bob Picco

On Wed, 2006-05-03 at 03:20 +0200, Blaisorblade wrote:
> On Tuesday 02 May 2006 19:16, Lee Schermerhorn wrote:
> > On Tue, 2006-05-02 at 13:45 +1000, Nick Piggin wrote:
> > > blaisorblade@yahoo.it wrote:
> 
> > > I think I would rather this all just folded under VM_NONLINEAR rather
> > > than having this extra MANYPROTS thing, no? (you're already doing that in
> > > one direction).
> 
> > One way I've seen this done on other systems
> 
> I'm curious, which ones?

Let's see:  System V [4.2MP] back in the days of USL [Unix Systems Labs]
did this, as did [does] Digital/Compaq/HP Tru64 on alpha.  I'm not sure
if the latter came from the original Mach/OSF code or from Bob Picco's
rewrite thereof back in the 90's.

> 
> > is to use something like a 
> > prio tree [e.g., see the shared policy support for shmem] for sub-vma
> > protection ranges.
> Which sub-vma ranges? The ones created with mprotect?
> 
> I'm curious about what is the difference between this sub-tree and the main 
> tree... you have some point, but I miss which one :-) Actually when doing a 
> lookup in the main tree the extra nodes in the subtree are not searched, so 
> you get an advantage.

True, you still have locate the protection range in the subtree and then
still walk the page tables to install the new protections.  Of course,
in the bad old days, the only time I saw thousands or 10s of thousands
of different mappings/vmas/regions/whatever was in vm stress tests.
Until squid came along, that is.  Then we encountered, on large memory
alpha systems, 100s of thousands of mmaped files.  Had to replace the
linear [honest] vma/mapping list at that point ;-).

> 
> One possible point is that a VMA maps to one mmap() call (with splits from 
> mremap(),mprotect(),partial munmap()s), and then they use sub-VMAs instead of 
> VMA splits.

Yeah.  That was the point--in response to Nick's comment about the
disconnect betwen the protections as reported by the vma and the actual
pte protections.
> 
> > Most vmas [I'm guessing here] will have only the 
> > original protections or will be reprotected in toto.
> 
> > So, one need only 
> > allocate/populate the protection tree when sub-vma protections are
> > requested.   Then, one can test protections via the vma, perhaps with
> > access/check macros to hide the existence of the protection tree.  Of
> > course, adding a tree-like structure could introduce locking
> > complications/overhead in some paths where we'd rather not [just
> > guessing again].  Might be more overhead than just mucking with the ptes
> > [for UML], but would keep the ptes in sync with the vma's view of
> > "protectedness".
> >
> > Lee
> 
> Ok, there are two different situations, I'm globally unconvinced until I 
> understand the usefulness of a different sub-tree.
> 
> a) UML. The answer is _no_ to all guesses, since we must implement page tables 
> of a guest virtual machine via mmap() or remap_file_pages. And they're as 
> fragmented as they get (we get one-page-wide VMAs currently).
> 
> b) the proposed glibc usage. The original Ulrich's request (which I cut down 
> because of problems with objrmap) was to have one mapping per DSO, including 
> code,data and guard page. So you have three protections in one VMA.
> 
> However, this is doable via this remap_file_pages, adding something for 
> handling private VMAs (handling movement of the anonymous memory you get on 
> writes); but it's slow on swapout, since it stops using objrmap. So I've not 
> thought to do it.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-02  3:45 ` [patch 00/14] remap_file_pages protection support Nick Piggin
  2006-05-02  3:56   ` Nick Piggin
  2006-05-02 17:16   ` Lee Schermerhorn
@ 2006-05-03  0:25   ` Blaisorblade
  2006-05-06 16:05     ` Ulrich Drepper
  2006-05-03  0:44   ` Blaisorblade
  3 siblings, 1 reply; 46+ messages in thread
From: Blaisorblade @ 2006-05-03  0:25 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, linux-kernel, Linux Memory Management,
	Ulrich Drepper, Val Henson

On Tuesday 02 May 2006 05:45, Nick Piggin wrote:
> blaisorblade@yahoo.it wrote:
> > This will be useful since the VMA lookup at fault time can be a
> > bottleneck for some programs (I've received a report about this from
> > Ulrich Drepper and I've been told that also Val Henson from Intel is
> > interested about this). I guess that since we use RB-trees, the slowness
> > is also due to the poor cache locality of RB-trees (since RB nodes are
> > within VMAs but aren't accessed together with their content), compared
> > for instance with radix trees where the lookup has high cache locality
> > (but they have however space usage problems, possibly bigger, on 64-bit
> > machines).

> Let's try get back to the good old days when people actually reported
> their bugs (togther will *real* numbers) to the mailing lists. That way,
> everybody gets to think about and discuss the problem.

I've not seen the numbers indeed, I've been told of a problem with a "customer 
program" and Ingo connected my work with this problem. Frankly, I've been 
always astonished about how looking up a 10-level tree can be slow. Poor 
cache locality is the only thing that I could think about.

That said, it was an add-on, not the original motivation of the work.
-- 
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade
Chiacchiera con i tuoi amici in tempo reale! 
 http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-03  0:25   ` Blaisorblade
@ 2006-05-06 16:05     ` Ulrich Drepper
  2006-05-07  4:22       ` Nick Piggin
  0 siblings, 1 reply; 46+ messages in thread
From: Ulrich Drepper @ 2006-05-06 16:05 UTC (permalink / raw)
  To: Blaisorblade
  Cc: Nick Piggin, Andrew Morton, linux-kernel, Linux Memory Management,
	Val Henson

[-- Attachment #1: Type: text/plain, Size: 3971 bytes --]

Blaisorblade wrote:
> I've not seen the numbers indeed, I've been told of a problem with a "customer 
> program" and Ingo connected my work with this problem. Frankly, I've been 
> always astonished about how looking up a 10-level tree can be slow. Poor 
> cache locality is the only thing that I could think about.

It might be good if I explain a bit how much we use mmap in libc.  The
numbers can really add up quickly.

- for each loaded DSO we might have up to 5 VMAs today:

   1. text segment, protection r-xp, normally never written to
   2. a gap, protection ---p (i.e., no access)
   3. a relro data segment, protectection r--p
   4. a data segment, protection rw-p
   5. a bss segment, anonymous memory

The first four are mapped from the file.  In fact, the first segment
"allocates" the entire address space of all segment, even if it's longer
than the file.

Then gap is done using mprotect(PROT_NONE).  Then the area for segment 3
and 4 is mapped in one mmap() call.  It's in the same file but the
offset used in the mmap is not the same as the same as the offset which
naturally is already established through the first  mmap.  I.e., if the
first mmap() would start at offset 0 and  continue for 1000 pages, the
gap might start at a, say, offset of 4 pages and continue for 500 pages.
 Then the "natural" offset of the first data page would be 504 pages but
the second mmap() call would in fact use the offset 4 because the text
and data segment are continuous in the _file_ (although not in memory).

Anyway, once relocations are done the protection of the relro segment is
changed, splitting the data segment in two.

So, for DSO loading there would be two steps of improvement:

1. if a mprotect() call wouldn't split the VMA we would have 3 VMAs in
the end instead of 5.  40% gain.

2. if I could use remap_file_pages() for the data segment mapping and
the call would allow changing the protection and it would not split the
VMAs, then we'd be down to 2 mappings.  60% down.

A second big VMA user are thread stacks.  I think the application which
was mentioned in this thread briefly used literally thousands of
threads.  Leaving aside the insanity of this (it's unfortunately how
many programmers work) this can create problems because we get at least
two (on IA-64 three) VMAs per thread.  I.e., thread stack allocation
works likes this:

1. allocate area big enough for stack and guard (we don't use automatic
growing, this cannot work)

2. change the protection of the guard end of the stack to PROT_NONE.

So, for say 1000 threads we'll end up with 2000 VMAs.  Threads are also
important to mention here because

- often they are short-lived and we have to recreate them often.  We
usually reuse stacks but only keep that much allocated memory around.
So more often than we like we actually free and later re-allocate stacks.

- these thousands of stack VMAs are really used concurrently.  ALl
threads are woken over a period of time.

A third source of VMAs is anonymous memory allocation.  mmap is used in
the malloc implementation and directly in various places.  For
randomization reasons there isn't really much we can do here, we
shouldn't lump all these allocations together.

A fourth source of VMAs are the programs themselves which mmap files.
Often read-only mappings of many small files.

Put all this together and non-trivial apps as written today (I don't say
they are high-quality apps) can easily have a few thousand, maybe even
10,000 to 20,000 VMAs.  Firefox on my machine uses in the moment ~560
VMAs and this is with only a handful of threads.  Are these the numbers
the VM system is optimized for?  I think what our people running the
experiments at the customer site saw is that it's not.  The VMA
traversal showed up on the profile lists.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-06 16:05     ` Ulrich Drepper
@ 2006-05-07  4:22       ` Nick Piggin
  2006-05-13 14:13         ` Nick Piggin
  0 siblings, 1 reply; 46+ messages in thread
From: Nick Piggin @ 2006-05-07  4:22 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Blaisorblade, Andrew Morton, linux-kernel,
	Linux Memory Management, Val Henson

Ulrich Drepper wrote:
> Blaisorblade wrote:
> 
>>I've not seen the numbers indeed, I've been told of a problem with a "customer 
>>program" and Ingo connected my work with this problem. Frankly, I've been 
>>always astonished about how looking up a 10-level tree can be slow. Poor 
>>cache locality is the only thing that I could think about.
> 
> 
> It might be good if I explain a bit how much we use mmap in libc.  The
> numbers can really add up quickly.

[...]

Thanks. Very informative.

> Put all this together and non-trivial apps as written today (I don't say
> they are high-quality apps) can easily have a few thousand, maybe even
> 10,000 to 20,000 VMAs.  Firefox on my machine uses in the moment ~560
> VMAs and this is with only a handful of threads.  Are these the numbers
> the VM system is optimized for?  I think what our people running the
> experiments at the customer site saw is that it's not.  The VMA
> traversal showed up on the profile lists.

Your % improvement numbers are of course only talking about memory
usage improvements. Time complexity increases with the log of the
number of VMAs, so while search within 100,000 vmas might have a CPU
cost of 16 arbitrary units, it is only about 300% the cost in 40
vmas (and not the 2,500,000% that the number of vmas suggests).

Definitely reducing vmas would be good. If guard ranges around vmas
can be implemented easily and reduce vmas by even 20%, it would come
at an almost zero complexity cost to the kernel.

However, I think another consideration is the vma lookup cache. I need
to get around to looking at this again, but IMO it is inadequate for
threaded applications. Currently we have one last-lookup cached vma
for each mm. You get cacheline bouncing when updating the cache, and
the locality becomes almost useless.

I think possibly each thread should have a private vma cache, with
room for at least its stack vma(s), (and several others, eg. code,
data). Perhaps the per-mm cache could be dispensed with completely,
although it might be useful eg. for the heap. And it might be helped
with increased entries as well.

I've got patches lying around to implement this stuff -- I'd be
interested to have more detail about this problem, or distilled test
cases.

Nick

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-07  4:22       ` Nick Piggin
@ 2006-05-13 14:13         ` Nick Piggin
  2006-05-13 18:19           ` Valerie Henson
  0 siblings, 1 reply; 46+ messages in thread
From: Nick Piggin @ 2006-05-13 14:13 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Blaisorblade, Andrew Morton, linux-kernel,
	Linux Memory Management, Val Henson

[-- Attachment #1: Type: text/plain, Size: 1433 bytes --]

Nick Piggin wrote:

> I think possibly each thread should have a private vma cache, with
> room for at least its stack vma(s), (and several others, eg. code,
> data). Perhaps the per-mm cache could be dispensed with completely,
> although it might be useful eg. for the heap. And it might be helped
> with increased entries as well.
> 
> I've got patches lying around to implement this stuff -- I'd be
> interested to have more detail about this problem, or distilled test
> cases.

OK, I got interested again, but can't get Val's ebizzy to give me
a find_vma constrained workload yet (though the numbers back up
my assertion that the vma cache is crap for threaded apps).

Without the patch, after bootup, the vma cache gets 208 364 hits out
of 438 535 lookups (47.5%)

./ebizzy -t16: 384.29user 754.61system 5:31.87elapsed 343%CPU

And ebizzy gets 7 373 078 hits out of 82 255 863 lookups (8.9%)


With mm + 4 slot LRU per-thread cache (this patch):
After boot, 303 767 / 439 918 = 69.0%

./ebizzy -t16: 388.73user 750.29system 5:30.24elapsed 344%CPU

ebizzy hits: 53 024 083 / 82 055 195 = 64.6%


So on a non-threaded workload, hit rate is increased by about 50%;
on a threaded workload it is increased by over 700%. In rbtree-walk
-constrained workloads, the total find_vma speedup should be linear
to the hit ratio improvement.

I don't think my ebizzy numbers can justify the patch though...

Nick

-- 
SUSE Labs, Novell Inc.

[-- Attachment #2: vma.patch --]
[-- Type: text/plain, Size: 6766 bytes --]

Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2006-05-13 23:31:13.000000000 +1000
+++ linux-2.6/mm/mmap.c	2006-05-13 23:48:53.000000000 +1000
@@ -30,6 +30,99 @@
 #include <asm/cacheflush.h>
 #include <asm/tlb.h>
 
+static void vma_cache_touch(struct mm_struct *mm, struct vm_area_struct *vma)
+{
+	struct task_struct *curr = current;
+	if (mm == curr->mm) {
+		int i;
+		if (curr->vma_cache_sequence != mm->vma_sequence) {
+			curr->vma_cache_sequence = mm->vma_sequence;
+			curr->vma_cache[0] = vma;
+			for (i = 1; i < 4; i++)
+				curr->vma_cache[i] = NULL;
+		} else {
+			int update_mm;
+
+			if (curr->vma_cache[0] == vma)
+				return;
+
+			for (i = 1; i < 4; i++) {
+				if (curr->vma_cache[i] == vma)
+					break;
+			}
+			update_mm = 0;
+			if (i == 4) {
+				update_mm = 1;
+				i = 3;
+			}
+			while (i != 0) {
+				curr->vma_cache[i] = curr->vma_cache[i-1];
+				i--;
+			}
+			curr->vma_cache[0] = vma;
+
+			if (!update_mm)
+				return;
+		}
+	}
+
+	if (mm->vma_cache != vma) /* prevent cacheline bouncing */
+		mm->vma_cache = vma;
+}
+
+static void vma_cache_replace(struct mm_struct *mm, struct vm_area_struct *vma,
+						struct vm_area_struct *repl)
+{
+	mm->vma_sequence++;
+	if (unlikely(mm->vma_sequence == 0)) {
+		struct task_struct *curr = current, *t;
+		t = curr;
+		rcu_read_lock();
+		do {
+			t->vma_cache_sequence = -1;
+			t = next_thread(t);
+		} while (t != curr);
+		rcu_read_unlock();
+	}
+
+	if (mm->vma_cache == vma)
+		mm->vma_cache = repl;
+}
+
+static inline void vma_cache_invalidate(struct mm_struct *mm, struct vm_area_struct *vma)
+{
+	vma_cache_replace(mm, vma, NULL);
+}
+
+static struct vm_area_struct *vma_cache_find(struct mm_struct *mm,
+						unsigned long addr)
+{
+	struct task_struct *curr;
+	struct vm_area_struct *vma;
+
+	inc_page_state(vma_cache_query);
+
+	curr = current;
+	if (mm == curr->mm && mm->vma_sequence == curr->vma_cache_sequence) {
+		int i;
+		for (i = 0; i < 4; i++) {
+			vma = curr->vma_cache[i];
+			if (vma && vma->vm_end > addr && vma->vm_start <= addr){
+				inc_page_state(vma_cache_hit);
+				return vma;
+			}
+		}
+	}
+
+	vma = mm->vma_cache;
+	if (vma && vma->vm_end > addr && vma->vm_start <= addr) {
+		inc_page_state(vma_cache_hit);
+		return vma;
+	}
+
+	return NULL;
+}
+
 static void unmap_region(struct mm_struct *mm,
 		struct vm_area_struct *vma, struct vm_area_struct *prev,
 		unsigned long start, unsigned long end);
@@ -460,8 +553,6 @@
 {
 	prev->vm_next = vma->vm_next;
 	rb_erase(&vma->vm_rb, &mm->mm_rb);
-	if (mm->mmap_cache == vma)
-		mm->mmap_cache = prev;
 }
 
 /*
@@ -586,6 +677,7 @@
 		 * us to remove next before dropping the locks.
 		 */
 		__vma_unlink(mm, next, vma);
+		vma_cache_replace(mm, next, vma);
 		if (file)
 			__remove_shared_vm_struct(next, file, mapping);
 		if (next->anon_vma)
@@ -1384,8 +1476,8 @@
 	if (mm) {
 		/* Check the cache first. */
 		/* (Cache hit rate is typically around 35%.) */
-		vma = mm->mmap_cache;
-		if (!(vma && vma->vm_end > addr && vma->vm_start <= addr)) {
+		vma = vma_cache_find(mm, addr);
+		if (!vma) {
 			struct rb_node * rb_node;
 
 			rb_node = mm->mm_rb.rb_node;
@@ -1405,9 +1497,9 @@
 				} else
 					rb_node = rb_node->rb_right;
 			}
-			if (vma)
-				mm->mmap_cache = vma;
 		}
+		if (vma)
+			vma_cache_touch(mm, vma);
 	}
 	return vma;
 }
@@ -1424,6 +1516,14 @@
 	if (!mm)
 		goto out;
 
+	vma = vma_cache_find(mm, addr);
+	if (vma) {
+		rb_node = rb_prev(&vma->vm_rb);
+		if (rb_node)
+			prev = rb_entry(rb_node, struct vm_area_struct, vm_rb);
+		goto out;
+	}
+
 	/* Guard against addr being lower than the first VMA */
 	vma = mm->mmap;
 
@@ -1445,6 +1545,9 @@
 	}
 
 out:
+	if (vma)
+		vma_cache_touch(mm, vma);
+
 	*pprev = prev;
 	return prev ? prev->vm_next : vma;
 }
@@ -1686,6 +1789,7 @@
 
 	insertion_point = (prev ? &prev->vm_next : &mm->mmap);
 	do {
+		vma_cache_invalidate(mm, vma);
 		rb_erase(&vma->vm_rb, &mm->mm_rb);
 		mm->map_count--;
 		tail_vma = vma;
@@ -1698,7 +1802,6 @@
 	else
 		addr = vma ?  vma->vm_start : mm->mmap_base;
 	mm->unmap_area(mm, addr);
-	mm->mmap_cache = NULL;		/* Kill the cache. */
 }
 
 /*
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h	2006-05-13 23:31:05.000000000 +1000
+++ linux-2.6/include/linux/page-flags.h	2006-05-13 23:31:44.000000000 +1000
@@ -164,6 +164,9 @@
 
 	unsigned long pgrotated;	/* pages rotated to tail of the LRU */
 	unsigned long nr_bounce;	/* pages for bounce buffers */
+
+	unsigned long vma_cache_hit;
+	unsigned long vma_cache_query;
 };
 
 extern void get_page_state(struct page_state *ret);
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2006-05-13 23:31:05.000000000 +1000
+++ linux-2.6/mm/page_alloc.c	2006-05-13 23:31:44.000000000 +1000
@@ -2389,6 +2389,9 @@
 
 	"pgrotated",
 	"nr_bounce",
+
+	"vma_cache_hit",
+	"vma_cache_query",
 };
 
 static void *vmstat_start(struct seq_file *m, loff_t *pos)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h	2006-05-13 23:31:03.000000000 +1000
+++ linux-2.6/include/linux/sched.h	2006-05-13 23:33:01.000000000 +1000
@@ -293,9 +293,11 @@
 } while (0)
 
 struct mm_struct {
-	struct vm_area_struct * mmap;		/* list of VMAs */
+	struct vm_area_struct *mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
-	struct vm_area_struct * mmap_cache;	/* last find_vma result */
+	struct vm_area_struct *vma_cache;
+	unsigned long vma_sequence;
+
 	unsigned long (*get_unmapped_area) (struct file *filp,
 				unsigned long addr, unsigned long len,
 				unsigned long pgoff, unsigned long flags);
@@ -734,6 +736,8 @@
 	struct list_head ptrace_list;
 
 	struct mm_struct *mm, *active_mm;
+	struct vm_area_struct *vma_cache[4];
+	unsigned long vma_cache_sequence;
 
 /* task state */
 	struct linux_binfmt *binfmt;
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2006-05-13 23:31:03.000000000 +1000
+++ linux-2.6/kernel/fork.c	2006-05-13 23:32:59.000000000 +1000
@@ -197,7 +197,7 @@
 
 	mm->locked_vm = 0;
 	mm->mmap = NULL;
-	mm->mmap_cache = NULL;
+	mm->vma_sequence = oldmm->vma_sequence+1;
 	mm->free_area_cache = oldmm->mmap_base;
 	mm->cached_hole_size = ~0UL;
 	mm->map_count = 0;
@@ -238,6 +238,10 @@
 		tmp->vm_next = NULL;
 		anon_vma_link(tmp);
 		file = tmp->vm_file;
+
+		if (oldmm->vma_cache == mpnt)
+			mm->vma_cache = tmp;
+
 		if (file) {
 			struct inode *inode = file->f_dentry->d_inode;
 			get_file(file);

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-13 14:13         ` Nick Piggin
@ 2006-05-13 18:19           ` Valerie Henson
  2006-05-13 22:54             ` Valerie Henson
  2006-05-16 13:30             ` Nick Piggin
  0 siblings, 2 replies; 46+ messages in thread
From: Valerie Henson @ 2006-05-13 18:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ulrich Drepper, Blaisorblade, Andrew Morton, linux-kernel,
	Linux Memory Management, Val Henson

On Sun, May 14, 2006 at 12:13:21AM +1000, Nick Piggin wrote:
> 
> OK, I got interested again, but can't get Val's ebizzy to give me
> a find_vma constrained workload yet (though the numbers back up
> my assertion that the vma cache is crap for threaded apps).

Hey Nick,

Glad to see you're using it!  There are (at least) two ways to do what
you want:

1. Increase the number of threads - this gives you two vma's per
   thread, one for stack, one for guard page:

   $ ./ebizzy -t 100

2. Apply the patch at the end of this email and use -p "prevent
   coalescing", -m "always mmap" and appropriate number of chunks,
   size, and records to search - this works for me:

   $ ./ebizzy -p -m -n 10000 -s 4096 -r 100000

The original program mmapped everything with the same permissions and
no alignment restrictions, so all the mmaps were coalesced into one.
This version alternates PROT_WRITE permissions on the mmap'd areas
after they are written, so you get lots of vma's:

val@goober:~/ebizzy$ ./ebizzy -p -m -n 10000 -s 4096 -r 100000

[2]+  Stopped                 ./ebizzy -p -m -n 10000 -s 4096 -r 100000
val@goober:~/ebizzy$ wc -l /proc/`pgrep ebizzy`/maps
10019 /proc/10917/maps

I haven't profiled to see if this brings find_vma to the top, though.

(The patch also moves around some other stuff so that options are in
alphabetical order; apparently I thought 's' came after 'r' and before
'R'...)

-VAL

--- ebizzy.c.old	2006-05-13 10:18:58.000000000 -0700
+++ ebizzy.c	2006-05-13 11:01:42.000000000 -0700
@@ -52,9 +52,10 @@
 static unsigned int always_mmap;
 static unsigned int never_mmap;
 static unsigned int chunks;
+static unsigned int prevent_coalescing;
 static unsigned int records;
-static unsigned int chunk_size;
 static unsigned int random_size;
+static unsigned int chunk_size;
 static unsigned int threads;
 static unsigned int verbose;
 static unsigned int linear;
@@ -76,9 +77,10 @@
 		"-m\t\t Always use mmap instead of malloc\n"
 		"-M\t\t Never use mmap\n"
 		"-n <num>\t Number of memory chunks to allocate\n"
+		"-p \t\t Prevent mmap coalescing\n"
 		"-r <num>\t Total number of records to search for\n"
-		"-s <size>\t Size of memory chunks, in bytes\n"
 		"-R\t\t Randomize size of memory to copy and search\n"
+		"-s <size>\t Size of memory chunks, in bytes\n"
 		"-t <num>\t Number of threads\n"
 		"-v[v[v]]\t Be verbose (more v's for more verbose)\n"
 		"-z\t\t Linear search instead of binary search\n",
@@ -98,7 +100,7 @@
 	cmd = argv[0];
 	opterr = 1;
 
-	while ((c = getopt(argc, argv, "mMn:r:s:Rt:vz")) != -1) {
+	while ((c = getopt(argc, argv, "mMn:pr:Rs:t:vz")) != -1) {
 		switch (c) {
 		case 'm':
 			always_mmap = 1;
@@ -111,19 +113,22 @@
 			if (chunks == 0)
 				usage();
 			break;
+		case 'p':
+			prevent_coalescing = 1;
+			break;
 		case 'r':
 			records = atoi(optarg);
 			if (records == 0)
 				usage();
 			break;
+		case 'R':
+			random_size = 1;
+			break;
 		case 's':
 			chunk_size = atoi(optarg);
 			if (chunk_size == 0)
 				usage();
 			break;
-		case 'R':
-			random_size = 1;
-			break;
 		case 't':
 			threads = atoi(optarg);
 			if (threads == 0)
@@ -141,7 +146,7 @@
 	}
 
 	if (verbose)
-		printf("ebizzy 0.1, Copyright 2006 Intel Corporation\n"
+		printf("ebizzy 0.2, Copyright 2006 Intel Corporation\n"
 		       "Written by Val Henson <val_henson@linux.intel.com\n");
 
 	/*
@@ -173,10 +178,11 @@
 		printf("always_mmap %u\n", always_mmap);
 		printf("never_mmap %u\n", never_mmap);
 		printf("chunks %u\n", chunks);
+		printf("prevent coalescing %u\n", prevent_coalescing);
 		printf("records %u\n", records);
 		printf("records per thread %u\n", records_per_thread);
-		printf("chunk_size %u\n", chunk_size);
 		printf("random_size %u\n", random_size);
+		printf("chunk_size %u\n", chunk_size);
 		printf("threads %u\n", threads);
 		printf("verbose %u\n", verbose);
 		printf("linear %u\n", linear);
@@ -251,9 +257,13 @@
 {
 	int i, j;
 
-	for (i = 0; i < chunks; i++)
+	for (i = 0; i < chunks; i++) {
 		for(j = 0; j < chunk_size / record_size; j++)
 			mem[i][j] = (record_t) j;
+		/* Prevent coalescing by alternating permissions */
+		if (prevent_coalescing && (i % 2) == 0)
+			mprotect(mem[i], chunk_size, PROT_READ);
+	}
 	if (verbose)
 		printf("Wrote memory\n");
 }

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-13 18:19           ` Valerie Henson
@ 2006-05-13 22:54             ` Valerie Henson
  2006-05-16 13:30             ` Nick Piggin
  1 sibling, 0 replies; 46+ messages in thread
From: Valerie Henson @ 2006-05-13 22:54 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ulrich Drepper, Blaisorblade, Andrew Morton, linux-kernel,
	Linux Memory Management, Arjan van de Ven

On Sat, May 13, 2006 at 11:19:46AM -0700, Valerie Henson wrote:
> The original program mmapped everything with the same permissions and
> no alignment restrictions, so all the mmaps were coalesced into one.
> This version alternates PROT_WRITE permissions on the mmap'd areas
> after they are written, so you get lots of vma's:

... Which is of course exactly the case that Blaisorblade's patches
should coalesce into one vma.  So I wrote another option which uses
holes instead - takes more memory initially, unfortunately.  Grab it
from:

http://www.nmt.edu/~val/patches/ebizzy.tar.gz

-p for preventing coaelescing via protections, -P for preventing via
holes.

-VAL

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-13 18:19           ` Valerie Henson
  2006-05-13 22:54             ` Valerie Henson
@ 2006-05-16 13:30             ` Nick Piggin
  2006-05-16 13:51               ` Andreas Mohr
  2006-05-16 16:33               ` Valerie Henson
  1 sibling, 2 replies; 46+ messages in thread
From: Nick Piggin @ 2006-05-16 13:30 UTC (permalink / raw)
  To: Valerie Henson
  Cc: Ulrich Drepper, Blaisorblade, Andrew Morton, linux-kernel,
	Linux Memory Management, Val Henson

Valerie Henson wrote:
> On Sun, May 14, 2006 at 12:13:21AM +1000, Nick Piggin wrote:
> 
>>OK, I got interested again, but can't get Val's ebizzy to give me
>>a find_vma constrained workload yet (though the numbers back up
>>my assertion that the vma cache is crap for threaded apps).
> 
> 
> Hey Nick,
> 
> Glad to see you're using it!  There are (at least) two ways to do what
> you want:
> 
> 1. Increase the number of threads - this gives you two vma's per
>    thread, one for stack, one for guard page:
> 
>    $ ./ebizzy -t 100
> 
> 2. Apply the patch at the end of this email and use -p "prevent
>    coalescing", -m "always mmap" and appropriate number of chunks,
>    size, and records to search - this works for me:
> 
>    $ ./ebizzy -p -m -n 10000 -s 4096 -r 100000
> 
> The original program mmapped everything with the same permissions and
> no alignment restrictions, so all the mmaps were coalesced into one.
> This version alternates PROT_WRITE permissions on the mmap'd areas
> after they are written, so you get lots of vma's:
> 
> val@goober:~/ebizzy$ ./ebizzy -p -m -n 10000 -s 4096 -r 100000
> 
> [2]+  Stopped                 ./ebizzy -p -m -n 10000 -s 4096 -r 100000
> val@goober:~/ebizzy$ wc -l /proc/`pgrep ebizzy`/maps
> 10019 /proc/10917/maps
> 
> I haven't profiled to see if this brings find_vma to the top, though.
> 

Hi Val,

Thanks, I've tried with your most recent ebizzy and with 256 threads and
50,000 vmas (which gives really poor mmap_cache hits), I'm still unable
to get find_vma above a few % of kernel time.

With 50,000 vmas, my per-thread vma cache is much less effective, I guess
because access is pretty random (hopefully more realistic patterns would
get a bigger improvement).

I also tried running kbuild under UML, and could not make find_vma take
much time either [in this case, the per-thread vma cache patch roughly
doubles the number of hits, from about 15%->30% (in the host)].

So I guess it's time to go back into my hole. If anyone does come across
a find_vma constrained workload (especially with threads), I'd be very
interested.

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-16 13:30             ` Nick Piggin
@ 2006-05-16 13:51               ` Andreas Mohr
  2006-05-16 16:31                 ` Valerie Henson
  2006-05-16 16:33               ` Valerie Henson
  1 sibling, 1 reply; 46+ messages in thread
From: Andreas Mohr @ 2006-05-16 13:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Valerie Henson, Ulrich Drepper, Blaisorblade, Andrew Morton,
	linux-kernel, Linux Memory Management, Val Henson

Hi,

On Tue, May 16, 2006 at 11:30:32PM +1000, Nick Piggin wrote:
> I also tried running kbuild under UML, and could not make find_vma take
> much time either [in this case, the per-thread vma cache patch roughly
> doubles the number of hits, from about 15%->30% (in the host)].
> 
> So I guess it's time to go back into my hole. If anyone does come across
> a find_vma constrained workload (especially with threads), I'd be very
> interested.

I cannot offer much other than some random confirmation that from my own
oprofiling, whatever I did (often running a load test script consisting of
launching 30 big apps at the same time), find_vma basically always showed up
very prominently in the list of vmlinux-based code (always ranking within the
top 4 or 5 kernel hotspots, such as timer interrupts, ACPI idle I/O etc.pp.).
call-tracing showed it originating from mmap syscalls etc., and AFAIR quite
some find_vma activity from oprofile itself.
Profiling done on 512MB UP Athlon and P3/700, 2.6.16ish, current Debian.
Sorry for the foggy report, I don't have those logs here right now.

So yes, improving that part should help in general, but I cannot quite
say that my machines are "constrained" by it.

But you probably knew that already, otherwise you wouldn't have poked
in there... ;)

Andreas Mohr

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-16 13:51               ` Andreas Mohr
@ 2006-05-16 16:31                 ` Valerie Henson
  2006-05-16 16:47                   ` Andreas Mohr
  0 siblings, 1 reply; 46+ messages in thread
From: Valerie Henson @ 2006-05-16 16:31 UTC (permalink / raw)
  To: Andreas Mohr
  Cc: Nick Piggin, Ulrich Drepper, Blaisorblade, Andrew Morton,
	linux-kernel, Linux Memory Management, Val Henson

On Tue, May 16, 2006 at 03:51:35PM +0200, Andreas Mohr wrote:
> 
> I cannot offer much other than some random confirmation that from my own
> oprofiling, whatever I did (often running a load test script consisting of
> launching 30 big apps at the same time), find_vma basically always showed up
> very prominently in the list of vmlinux-based code (always ranking within the
> top 4 or 5 kernel hotspots, such as timer interrupts, ACPI idle I/O etc.pp.).
> call-tracing showed it originating from mmap syscalls etc., and AFAIR quite
> some find_vma activity from oprofile itself.

This is important: Which kernel?

The cases I saw it in were in a (now old) SuSE kernel which as it
turns out had old/different vma lookup code.

Thanks,

-VAL

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-16 16:31                 ` Valerie Henson
@ 2006-05-16 16:47                   ` Andreas Mohr
  2006-05-17  3:25                     ` Nick Piggin
  2006-05-17  6:10                     ` Blaisorblade
  0 siblings, 2 replies; 46+ messages in thread
From: Andreas Mohr @ 2006-05-16 16:47 UTC (permalink / raw)
  To: Valerie Henson
  Cc: Nick Piggin, Ulrich Drepper, Blaisorblade, Andrew Morton,
	linux-kernel, Linux Memory Management, Val Henson

Hi,

On Tue, May 16, 2006 at 09:31:12AM -0700, Valerie Henson wrote:
> On Tue, May 16, 2006 at 03:51:35PM +0200, Andreas Mohr wrote:
> > 
> > I cannot offer much other than some random confirmation that from my own
> > oprofiling, whatever I did (often running a load test script consisting of
> > launching 30 big apps at the same time), find_vma basically always showed up
> > very prominently in the list of vmlinux-based code (always ranking within the
> > top 4 or 5 kernel hotspots, such as timer interrupts, ACPI idle I/O etc.pp.).
> > call-tracing showed it originating from mmap syscalls etc., and AFAIR quite
> > some find_vma activity from oprofile itself.
> 
> This is important: Which kernel?

I had some traces still showing find_vma prominently during a profiling run
just yesterday, with a very fresh 2.6.17-rc4-ck1 (IOW, basically 2.6.17-rc4).
I added some cache prefetching in the list traversal a while ago, and IIRC
that improved profiling times there, but cache prefetching is very often
a bandaid in search for a real solution: a better data-handling algorithm.

Andreas Mohr

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-16 16:47                   ` Andreas Mohr
@ 2006-05-17  3:25                     ` Nick Piggin
  2006-05-17  6:10                     ` Blaisorblade
  1 sibling, 0 replies; 46+ messages in thread
From: Nick Piggin @ 2006-05-17  3:25 UTC (permalink / raw)
  To: Andreas Mohr
  Cc: Valerie Henson, Ulrich Drepper, Blaisorblade, Andrew Morton,
	linux-kernel, Linux Memory Management, Val Henson

Andreas Mohr wrote:

>Hi,
>
>On Tue, May 16, 2006 at 09:31:12AM -0700, Valerie Henson wrote:
>
>>On Tue, May 16, 2006 at 03:51:35PM +0200, Andreas Mohr wrote:
>>
>>>I cannot offer much other than some random confirmation that from my own
>>>oprofiling, whatever I did (often running a load test script consisting of
>>>launching 30 big apps at the same time), find_vma basically always showed up
>>>very prominently in the list of vmlinux-based code (always ranking within the
>>>top 4 or 5 kernel hotspots, such as timer interrupts, ACPI idle I/O etc.pp.).
>>>call-tracing showed it originating from mmap syscalls etc., and AFAIR quite
>>>some find_vma activity from oprofile itself.
>>>
>>This is important: Which kernel?
>>
>
>I had some traces still showing find_vma prominently during a profiling run
>just yesterday, with a very fresh 2.6.17-rc4-ck1 (IOW, basically 2.6.17-rc4).
>I added some cache prefetching in the list traversal a while ago, and IIRC
>that improved profiling times there, but cache prefetching is very often
>a bandaid in search for a real solution: a better data-handling algorithm.
>

If you want to try out the patch and see what it does for you, that would be
interesting. I'll repost a slightly cleaned up version in a couple of hours.

Nick
--

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-16 16:47                   ` Andreas Mohr
  2006-05-17  3:25                     ` Nick Piggin
@ 2006-05-17  6:10                     ` Blaisorblade
  1 sibling, 0 replies; 46+ messages in thread
From: Blaisorblade @ 2006-05-17  6:10 UTC (permalink / raw)
  To: Andreas Mohr
  Cc: Valerie Henson, Nick Piggin, Ulrich Drepper, Andrew Morton,
	linux-kernel, Linux Memory Management, Val Henson

On Tuesday 16 May 2006 18:47, Andreas Mohr wrote:
> Hi,
>
> On Tue, May 16, 2006 at 09:31:12AM -0700, Valerie Henson wrote:
> > On Tue, May 16, 2006 at 03:51:35PM +0200, Andreas Mohr wrote:
> > > I cannot offer much other than some random confirmation that from my
> > > own oprofiling, whatever I did (often running a load test script
> > > consisting of launching 30 big apps at the same time), find_vma
> > > basically always showed up very prominently in the list of
> > > vmlinux-based code (always ranking within the top 4 or 5 kernel
> > > hotspots, such as timer interrupts, ACPI idle I/O etc.pp.).
> > > call-tracing showed it originating from mmap syscalls etc., and AFAIR
> > > quite some find_vma activity from oprofile itself.
> >
> > This is important: Which kernel?

I'd also add (for all peoples): on which processors? L2 cache size probably 
plays an important role, if (as I'm convinced) the problem are cache misses 
during rb-tree traversal.

> I had some traces still showing find_vma prominently during a profiling run
> just yesterday, with a very fresh 2.6.17-rc4-ck1 (IOW, basically
> 2.6.17-rc4). I added some cache prefetching in the list traversal a while
> ago, 

You mean the rb-tree traversal, I guess! Or was the base kernel so old?

> and IIRC that improved profiling times there, but cache prefetching is 
> very often a bandaid in search for a real solution: a better data-handling
> algorithm.

Ok, finally I find the time to kick in and ask a couple of question.

The current algorithm is good but has poor cache locality (IMHO).

First, since you can get find_vma on the profile, I've read (the article 
talked about userspace apps but I think it applies to kernelspace too) that 
oprofile can trace L2 cache misses.

I think such a profiling, if possible, would be particularly interesting: 
there's no reason whatsoever for that lookup, even on a 32-level tree 
(theoretical maximum since we have max 64K vmas and height_rbtree <= 2 logN), 
should be so slow, unless you add cache misses into the picture. The fact 
that cache prefetching helps shows this even more.

The lookup has very poor cache locality: the rb-node (3 pointers i.e. 12 
bytes, and we need only 2 pointers on searches) is surrounded by non-relevant 
data we fetch (we don't need the VMA itself for nodes we traverse).

For cache-locality the best data structure I know of are radix trees; but 
changing the implementation is absolutely non-trivial (the find_vma_prev() 
and friends API is tightly coupled with the rb-tree), and the size of the 
tree grows with the virtual address space (which is a problem on 64-bit 
archs); finally, you have locality when you do multiple searches, especially 
for the root nodes, but not across different levels inside a single search.
-- 
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

___________________________________ 
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB 
http://mail.yahoo.it

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-16 13:30             ` Nick Piggin
  2006-05-16 13:51               ` Andreas Mohr
@ 2006-05-16 16:33               ` Valerie Henson
  1 sibling, 0 replies; 46+ messages in thread
From: Valerie Henson @ 2006-05-16 16:33 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ulrich Drepper, Blaisorblade, Andrew Morton, linux-kernel,
	Linux Memory Management

On Tue, May 16, 2006 at 11:30:32PM +1000, Nick Piggin wrote:
> 
> Hi Val,
> 
> Thanks, I've tried with your most recent ebizzy and with 256 threads and
> 50,000 vmas (which gives really poor mmap_cache hits), I'm still unable
> to get find_vma above a few % of kernel time.

How excellent!  Sometimes negative results are worth publishing. :)

-VAL

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-02  3:45 ` [patch 00/14] remap_file_pages protection support Nick Piggin
                     ` (2 preceding siblings ...)
  2006-05-03  0:25   ` Blaisorblade
@ 2006-05-03  0:44   ` Blaisorblade
  2006-05-06  9:06     ` Nick Piggin
  3 siblings, 1 reply; 46+ messages in thread
From: Blaisorblade @ 2006-05-03  0:44 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, linux-kernel, Linux Memory Management,
	Ulrich Drepper, Val Henson

On Tuesday 02 May 2006 05:45, Nick Piggin wrote:
> blaisorblade@yahoo.it wrote:
> > The first idea is to use this for UML - it must create a lot of single
> > page mappings, and managing them through separate VMAs is slow.

> I think I would rather this all just folded under VM_NONLINEAR rather than
> having this extra MANYPROTS thing, no? (you're already doing that in one
> direction).

That step is _temporary_ if the extra usages are accepted.

Also, I reported (changelog of patch 03/14) a definite API bug you get if you 
don't distinguish VM_MANYPROTS from VM_NONLINEAR. I'm pasting it here because 
that changelog is rather long:

"In fact, without this flag, we'd have indeed a regression with
remap_file_pages VS mprotect, on uniform nonlinear VMAs.

mprotect alters the VMA prots and walks each present PTE, ignoring installed
ones, even when pte_file() is on; their saved prots will be restored on 
faults,
ignoring VMA ones and losing the mprotect() on them. So, in do_file_page(), we
must restore anyway VMA prots when the VMA is uniform, as we used to do before
this trail of patches."

> > Additional note: this idea, with some further refinements (which I'll
> > code after this chunk is accepted), will allow to reduce the number of
> > used VMAs for most userspace programs - in particular, it will allow to
> > avoid creating one VMA for one guard pages (which has PROT_NONE) -
> > forcing PROT_NONE on that page will be enough.

> I think that's silly. Your VM_MANYPROTS|VM_NONLINEAR vmas will cause more
> overhead in faulting and reclaim.

I know that problem. In fact for that we want VM_MANYPROTS without 
VM_NONLINEAR.

> It loooks like it would take an hour or two just to code up a patch which
> puts a VM_GUARDPAGES flag into the vma, and tells the free area allocator
> to skip vm_start-1 .. vm_end+1
we must refine which pages to skip (the example I saw has only one guard page, 
if I'm not mistaken) but 
> . What kind of troubles has prevented 
> something simple and easy like that from going in?

Fairly better idea... It's just the fact that the original proposal was wider, 
and that we looked to the problem in the wrong way (+ we wanted anyway to 
have the present work merged, so that wasn't a problem).

Ulrich wanted to have code+data(+guard on 64-bit) into the same VMA, but I 
left the code+data VMA joining away, to think more with it, since currently 
it's too slow on swapout.

The other part is avoiding guard VMAs for thread stacks, and that could be 
accomplished too by your proposal. Iff this work is held out, however.
-- 
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade
Chiacchiera con i tuoi amici in tempo reale! 
 http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-03  0:44   ` Blaisorblade
@ 2006-05-06  9:06     ` Nick Piggin
  2006-05-06 15:26       ` Ulrich Drepper
  0 siblings, 1 reply; 46+ messages in thread
From: Nick Piggin @ 2006-05-06  9:06 UTC (permalink / raw)
  To: Blaisorblade
  Cc: Andrew Morton, linux-kernel, Linux Memory Management,
	Ulrich Drepper, Val Henson

Blaisorblade wrote:
> On Tuesday 02 May 2006 05:45, Nick Piggin wrote:
> 
>>blaisorblade@yahoo.it wrote:
>>
>>>The first idea is to use this for UML - it must create a lot of single
>>>page mappings, and managing them through separate VMAs is slow.
> 
> 
>>I think I would rather this all just folded under VM_NONLINEAR rather than
>>having this extra MANYPROTS thing, no? (you're already doing that in one
>>direction).
> 
> 
> That step is _temporary_ if the extra usages are accepted.

Can we try to get the whole design right from the start?

> 
> Also, I reported (changelog of patch 03/14) a definite API bug you get if you 
> don't distinguish VM_MANYPROTS from VM_NONLINEAR. I'm pasting it here because 
> that changelog is rather long:
> 
> "In fact, without this flag, we'd have indeed a regression with
> remap_file_pages VS mprotect, on uniform nonlinear VMAs.
> 
> mprotect alters the VMA prots and walks each present PTE, ignoring installed
> ones, even when pte_file() is on; their saved prots will be restored on 
> faults,
> ignoring VMA ones and losing the mprotect() on them. So, in do_file_page(), we
> must restore anyway VMA prots when the VMA is uniform, as we used to do before
> this trail of patches."

It is only a bug because you hadn't plugged the hole -- make it fix up pte_file
ones as well.

> Ulrich wanted to have code+data(+guard on 64-bit) into the same VMA, but I 
> left the code+data VMA joining away, to think more with it, since currently 
> it's too slow on swapout.

Yes, and it would be ridiculous to do this with non linear protections anyway.
If the vma code is so slow that glibc wants to merge code and data vmas together,
then we obviously need to fix the data structure (which will help everyone)
rather than hacking around it.

> 
> The other part is avoiding guard VMAs for thread stacks, and that could be 
> accomplished too by your proposal. Iff this work is held out, however.

I see no reason why they couldn't both go in. In fact, having an mmap flag for
adding guard regions around vmas (and perhaps eg. a system-wide / per-process
option for stack) could almost go in tomorrow.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-06  9:06     ` Nick Piggin
@ 2006-05-06 15:26       ` Ulrich Drepper
  0 siblings, 0 replies; 46+ messages in thread
From: Ulrich Drepper @ 2006-05-06 15:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Blaisorblade, Andrew Morton, linux-kernel,
	Linux Memory Management, Val Henson

[-- Attachment #1: Type: text/plain, Size: 853 bytes --]

Nick Piggin wrote:
> I see no reason why they couldn't both go in. In fact, having an mmap
> flag for
> adding guard regions around vmas (and perhaps eg. a system-wide /
> per-process
> option for stack) could almost go in tomorrow.

This would have to be flexible, though.  For thread stacks, at least,
the programmer is able to specify the size of the guard area.  It can be
arbitrarily large.

Also, consider IA-64.  Here we have two stacks.  We allocate them with
one mmap call and put the guard somewhere in the middle (the optimal
ratio of CPU and register stack size is yet to be determined) and have
the stack grow toward each other.  This results into three VMAs in the
moment.  Anything which results on more VMAs obviously isn't good.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-04-30 17:29 [patch 00/14] remap_file_pages protection support blaisorblade
                   ` (14 preceding siblings ...)
  2006-05-02  3:45 ` [patch 00/14] remap_file_pages protection support Nick Piggin
@ 2006-05-02 10:21 ` Arjan van de Ven
  2006-05-02 23:46   ` Valerie Henson
  2006-05-03  0:26   ` Blaisorblade
  15 siblings, 2 replies; 46+ messages in thread
From: Arjan van de Ven @ 2006-05-02 10:21 UTC (permalink / raw)
  To: blaisorblade; +Cc: linux-kernel, Andrew Morton

> This will be useful since the VMA lookup at fault time can be a bottleneck for
> some programs (I've received a report about this from Ulrich Drepper and I've
> been told that also Val Henson from Intel is interested about this). 

I've not seen much of this if any at all, the various caches that are in
place for these lookups seem to function quite well; what we did see was
glibc's malloc implementation being mistuned resulting in far too many
mmaps than needed (which in turn leads to far too much page zeroing
which is the really expensive part. It's not the vma lookup that is
expensive, it's the page zeroing)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-02 10:21 ` Arjan van de Ven
@ 2006-05-02 23:46   ` Valerie Henson
  2006-05-03  0:26   ` Blaisorblade
  1 sibling, 0 replies; 46+ messages in thread
From: Valerie Henson @ 2006-05-02 23:46 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: blaisorblade, linux-kernel, Andrew Morton

On Tue, May 02, 2006 at 12:21:06PM +0200, Arjan van de Ven wrote:
> 
> > This will be useful since the VMA lookup at fault time can be a bottleneck for
> > some programs (I've received a report about this from Ulrich Drepper and I've
> > been told that also Val Henson from Intel is interested about this). 
> 
> I've not seen much of this if any at all, the various caches that are in
> place for these lookups seem to function quite well; what we did see was
> glibc's malloc implementation being mistuned resulting in far too many
> mmaps than needed (which in turn leads to far too much page zeroing
> which is the really expensive part. It's not the vma lookup that is
> expensive, it's the page zeroing)

VMA lookup time hasn't been noticeable on the systems we're running
ebizzy[1] on, which is what Arjan is talking about.  I did see it with
a customer application last year - on a kernel without the RB tree for
looking up vma's.  My vague recollection of the oprofile results was
something on the order of 2-10% of cpu time spent in find_vma() and
similar vma handling functions.  This was on an application with 100's
of vmas due to malloc() being tuned to use mmap() too often.  Arjan
and I whomped up a patch to glibc to fix the root cause; for details
see:

http://sourceware.org/ml/libc-alpha/2006-03/msg00033.html

A more legitimate situation resulting in 100's of vma's is the JVM
case - you end up with at least 2 vma's per thread, for the stack and
guard page.  I personally have seen a JVM with over 100 threads in
active use and over 500 vma's. (One of the nice things about your
patch is that it will eliminate the separate guard page vma, reducing
the number of necessary vma's by one per thread.)

I intend to go back and look for find_vma() and friends while running
ebizzy, since I wrote the darn thing partly to expose that problem;
however I suspect it's mostly gone now that we have RB-trees for
vma's.  If you want to use ebizzy to evaluate your patches, just run
it with the -M "always mmap" argument and tune the various thread and
memory allocation options until you get a suitable number of vma's.

-VAL

[1] ebizzy is an application I wrote to replicate this kind of
workload.  For more info, see:

http://infohost.nmt.edu/~val/patches.html#ebizzy

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-02 10:21 ` Arjan van de Ven
  2006-05-02 23:46   ` Valerie Henson
@ 2006-05-03  0:26   ` Blaisorblade
  2006-05-03  1:44     ` Ulrich Drepper
  1 sibling, 1 reply; 46+ messages in thread
From: Blaisorblade @ 2006-05-03  0:26 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel, Andrew Morton, Ulrich Drepper, Val Henson

On Tuesday 02 May 2006 12:21, Arjan van de Ven wrote:
> > This will be useful since the VMA lookup at fault time can be a
> > bottleneck for some programs (I've received a report about this from
> > Ulrich Drepper and I've been told that also Val Henson from Intel is
> > interested about this).

> I've not seen much of this if any at all, the various caches that are in
> place for these lookups seem to function quite well; what we did see was
> glibc's malloc implementation being mistuned resulting in far too many
> mmaps than needed (which in turn leads to far too much page zeroing
> which is the really expensive part. It's not the vma lookup that is
> expensive, it's the page zeroing)
Even to this email, I hope Ulrich will answer.
-- 
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade
Chiacchiera con i tuoi amici in tempo reale! 
 http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-03  0:26   ` Blaisorblade
@ 2006-05-03  1:44     ` Ulrich Drepper
  0 siblings, 0 replies; 46+ messages in thread
From: Ulrich Drepper @ 2006-05-03  1:44 UTC (permalink / raw)
  To: Blaisorblade; +Cc: Arjan van de Ven, linux-kernel, Andrew Morton, Val Henson

[-- Attachment #1: Type: text/plain, Size: 964 bytes --]

Blaisorblade wrote:
>> I've not seen much of this if any at all, the various caches that are in
>> place for these lookups seem to function quite well; what we did see was
>> glibc's malloc implementation being mistuned resulting in far too many
>> mmaps than needed (which in turn leads to far too much page zeroing
>> which is the really expensive part. It's not the vma lookup that is
>> expensive, it's the page zeroing)
> Even to this email, I hope Ulrich will answer.

All I can say is that some of our guys tuning a big application on a
customer's site reported seeing the VMA lookups being on the profile
list.  This was some huge Java program.  It might be that every other
page had a different protection, executable or not, read-only mmap etc.
 And data access for very non-local.

I cannot say more since this was near to the end of the trials.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2006-05-17  6:11 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-30 17:29 [patch 00/14] remap_file_pages protection support blaisorblade
2006-04-30 17:29 ` [patch 01/14] Fix comment about remap_file_pages blaisorblade
2006-04-30 17:29 ` [patch 02/14] remap_file_pages protection support: add needed macros blaisorblade
2006-04-30 17:29 ` [patch 03/14] remap_file_pages protection support: handle MANYPROTS VMAs blaisorblade
2006-04-30 17:29 ` [patch 04/14] remap_file_pages protection support: disallow mprotect() on manyprots mappings blaisorblade
2006-04-30 17:29 ` [patch 05/14] remap_file_pages protection support: cleanup syscall checks blaisorblade
2006-04-30 17:29 ` [patch 06/14] remap_file_pages protection support: enhance syscall interface blaisorblade
2006-04-30 17:30 ` [patch 07/14] remap_file_pages protection support: support private vma for MAP_POPULATE blaisorblade
2006-04-30 17:30 ` [patch 08/14] remap_file_pages protection support: use FAULT_SIGSEGV for protection checking blaisorblade
2006-04-30 17:30 ` [patch 09/14] remap_file_pages protection support: fix race condition with concurrent faults on same address space blaisorblade
2006-04-30 17:30 ` [patch 10/14] remap_file_pages protection support: fix get_user_pages() on VM_MANYPROTS vmas blaisorblade
2006-04-30 17:30 ` [patch 11/14] remap_file_pages protection support: pte_present should not trigger on PTE_FILE PROTNONE ptes blaisorblade
2006-05-02  3:53   ` Nick Piggin
2006-05-03  1:29     ` Blaisorblade
2006-05-06 10:03       ` Nick Piggin
2006-05-07 17:50         ` Blaisorblade
2006-04-30 17:30 ` [patch 12/14] remap_file_pages protection support: also set VM_NONLINEAR on nonuniform VMAs blaisorblade
2006-04-30 17:30 ` [patch 13/14] remap_file_pages protection support: uml, i386, x64 bits blaisorblade
2006-04-30 17:30 ` [patch 14/14] remap_file_pages protection support: adapt to uml peculiarities blaisorblade
2006-05-02  3:45 ` [patch 00/14] remap_file_pages protection support Nick Piggin
2006-05-02  3:56   ` Nick Piggin
2006-05-02 11:24     ` Ingo Molnar
2006-05-02 12:19       ` Nick Piggin
2006-05-02 17:16   ` Lee Schermerhorn
2006-05-03  1:20     ` Blaisorblade
2006-05-03 14:35       ` Lee Schermerhorn
2006-05-03  0:25   ` Blaisorblade
2006-05-06 16:05     ` Ulrich Drepper
2006-05-07  4:22       ` Nick Piggin
2006-05-13 14:13         ` Nick Piggin
2006-05-13 18:19           ` Valerie Henson
2006-05-13 22:54             ` Valerie Henson
2006-05-16 13:30             ` Nick Piggin
2006-05-16 13:51               ` Andreas Mohr
2006-05-16 16:31                 ` Valerie Henson
2006-05-16 16:47                   ` Andreas Mohr
2006-05-17  3:25                     ` Nick Piggin
2006-05-17  6:10                     ` Blaisorblade
2006-05-16 16:33               ` Valerie Henson
2006-05-03  0:44   ` Blaisorblade
2006-05-06  9:06     ` Nick Piggin
2006-05-06 15:26       ` Ulrich Drepper
2006-05-02 10:21 ` Arjan van de Ven
2006-05-02 23:46   ` Valerie Henson
2006-05-03  0:26   ` Blaisorblade
2006-05-03  1:44     ` Ulrich Drepper

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox