* [PATCH] fork support
@ 2005-07-19 16:55 Michael S. Tsirkin
2005-07-25 17:19 ` [PATCH repost] PROT_DONTCOPY: ifiniband uverbs " Michael S. Tsirkin
0 siblings, 1 reply; 17+ messages in thread
From: Michael S. Tsirkin @ 2005-07-19 16:55 UTC (permalink / raw)
To: Roland Dreier, openib-general; +Cc: linux-kernel
Here's a patch to linux kernel to enable fork() support for
infiniband uverbs (userspace i/o initiator).
Please Cc me with comments.
---
This patch adds PROT_DONTCOPY to mmap and mprotect, to set VM_DONTCOPY on vma.
This is needed for infiniband userspace i/o, where we need to protect against
- the child process accessing the parent hardware page
- the parent registered address (on which the driver did get_user_pages)
getting remapped to another page by COW
One can imagine other uses, e.g. combined with mlock for real-time or security.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Index: linux-2.6.12.2/include/asm-ppc64/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-ppc64/mman.h
+++ linux-2.6.12.2/include/asm-ppc64/mman.h
@@ -15,6 +15,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-cris/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-cris/mman.h
+++ linux-2.6.12.2/include/asm-cris/mman.h
@@ -10,6 +10,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-arm26/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-arm26/mman.h
+++ linux-2.6.12.2/include/asm-arm26/mman.h
@@ -8,6 +8,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-alpha/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-alpha/mman.h
+++ linux-2.6.12.2/include/asm-alpha/mman.h
@@ -8,6 +8,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-m68k/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-m68k/mman.h
+++ linux-2.6.12.2/include/asm-m68k/mman.h
@@ -8,6 +8,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-mips/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-mips/mman.h
+++ linux-2.6.12.2/include/asm-mips/mman.h
@@ -22,6 +22,7 @@
#define PROT_SEM 0x10 /* page may be used for atomic ops */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
/*
* Flags for mmap
Index: linux-2.6.12.2/include/asm-sparc64/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-sparc64/mman.h
+++ linux-2.6.12.2/include/asm-sparc64/mman.h
@@ -11,6 +11,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-v850/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-v850/mman.h
+++ linux-2.6.12.2/include/asm-v850/mman.h
@@ -7,6 +7,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-s390/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-s390/mman.h
+++ linux-2.6.12.2/include/asm-s390/mman.h
@@ -16,6 +16,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-parisc/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-parisc/mman.h
+++ linux-2.6.12.2/include/asm-parisc/mman.h
@@ -8,6 +8,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-ppc/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-ppc/mman.h
+++ linux-2.6.12.2/include/asm-ppc/mman.h
@@ -8,6 +8,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-i386/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-i386/mman.h
+++ linux-2.6.12.2/include/asm-i386/mman.h
@@ -8,6 +8,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-sh/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-sh/mman.h
+++ linux-2.6.12.2/include/asm-sh/mman.h
@@ -8,6 +8,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-x86_64/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-x86_64/mman.h
+++ linux-2.6.12.2/include/asm-x86_64/mman.h
@@ -8,6 +8,7 @@
#define PROT_SEM 0x8
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-ia64/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-ia64/mman.h
+++ linux-2.6.12.2/include/asm-ia64/mman.h
@@ -15,6 +15,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-sparc/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-sparc/mman.h
+++ linux-2.6.12.2/include/asm-sparc/mman.h
@@ -11,6 +11,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-m32r/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-m32r/mman.h
+++ linux-2.6.12.2/include/asm-m32r/mman.h
@@ -10,6 +10,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-frv/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-frv/mman.h
+++ linux-2.6.12.2/include/asm-frv/mman.h
@@ -8,6 +8,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/linux/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/linux/mman.h
+++ linux-2.6.12.2/include/linux/mman.h
@@ -47,9 +47,10 @@ static inline void vm_unacct_memory(long
static inline unsigned long
calc_vm_prot_bits(unsigned long prot)
{
- return _calc_vm_trans(prot, PROT_READ, VM_READ ) |
- _calc_vm_trans(prot, PROT_WRITE, VM_WRITE) |
- _calc_vm_trans(prot, PROT_EXEC, VM_EXEC );
+ return _calc_vm_trans(prot, PROT_READ, VM_READ ) |
+ _calc_vm_trans(prot, PROT_WRITE, VM_WRITE) |
+ _calc_vm_trans(prot, PROT_EXEC, VM_EXEC ) |
+ _calc_vm_trans(prot, PROT_DONTCOPY, VM_DONTCOPY );
}
/*
Index: linux-2.6.12.2/include/asm-h8300/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-h8300/mman.h
+++ linux-2.6.12.2/include/asm-h8300/mman.h
@@ -8,6 +8,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-arm/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-arm/mman.h
+++ linux-2.6.12.2/include/asm-arm/mman.h
@@ -8,6 +8,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/mm/mprotect.c
===================================================================
--- linux-2.6.12.2.orig/mm/mprotect.c
+++ linux-2.6.12.2/mm/mprotect.c
@@ -196,7 +196,7 @@ sys_mprotect(unsigned long start, size_t
end = start + len;
if (end <= start)
return -ENOMEM;
- if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM))
+ if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM | PROT_DONTCOPY))
return -EINVAL;
reqprot = prot;
@@ -246,7 +246,7 @@ sys_mprotect(unsigned long start, size_t
goto out;
}
- newflags = vm_flags | (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));
+ newflags = vm_flags | (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC | VM_DONTCOPY));
if ((newflags & ~(newflags >> 4)) & 0xf) {
error = -EACCES;
Index: linux-2.6.12.2/mm/mmap.c
===================================================================
--- linux-2.6.12.2.orig/mm/mmap.c
+++ linux-2.6.12.2/mm/mmap.c
@@ -792,8 +792,8 @@ struct anon_vma *find_mergeable_anon_vma
* Neither mlock nor madvise tries to remerge at present,
* so leave their flags as obstructing a merge.
*/
- vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
- vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC);
+ vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC|VM_DONTCOPY);
+ vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_DONTCOPY);
if (near->anon_vma && vma->vm_end == near->vm_start &&
mpol_equal(vma_policy(vma), vma_policy(near)) &&
@@ -814,8 +814,8 @@ try_prev:
if (!near)
goto none;
- vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
- vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC);
+ vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC|VM_DONTCOPY);
+ vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_DONTCOPY);
if (near->anon_vma && near->vm_end == vma->vm_start &&
mpol_equal(vma_policy(near), vma_policy(vma)) &&
--
MST
^ permalink raw reply [flat|nested] 17+ messages in thread* [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support
2005-07-19 16:55 [PATCH] fork support Michael S. Tsirkin
@ 2005-07-25 17:19 ` Michael S. Tsirkin
2005-07-26 12:30 ` Hugh Dickins
0 siblings, 1 reply; 17+ messages in thread
From: Michael S. Tsirkin @ 2005-07-25 17:19 UTC (permalink / raw)
To: Roland Dreier, openib-general; +Cc: linux-kernel
Hi!
I posted this before but got no comments.
Here it is again, in case the reason was OLS.
---
This patch adds PROT_DONTCOPY to mmap and mprotect, to set VM_DONTCOPY on vma.
This is needed for infiniband userspace i/o, where we need to protect against
- the child process accessing the parent hardware page
- the parent registered address (on which the driver did get_user_pages)
getting remapped to another page by COW
One can imagine other uses, e.g. combined with mlock for real-time or security.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Index: linux-2.6.12.2/include/asm-ppc64/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-ppc64/mman.h
+++ linux-2.6.12.2/include/asm-ppc64/mman.h
@@ -15,6 +15,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-cris/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-cris/mman.h
+++ linux-2.6.12.2/include/asm-cris/mman.h
@@ -10,6 +10,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-arm26/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-arm26/mman.h
+++ linux-2.6.12.2/include/asm-arm26/mman.h
@@ -8,6 +8,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-alpha/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-alpha/mman.h
+++ linux-2.6.12.2/include/asm-alpha/mman.h
@@ -8,6 +8,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-m68k/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-m68k/mman.h
+++ linux-2.6.12.2/include/asm-m68k/mman.h
@@ -8,6 +8,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-mips/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-mips/mman.h
+++ linux-2.6.12.2/include/asm-mips/mman.h
@@ -22,6 +22,7 @@
#define PROT_SEM 0x10 /* page may be used for atomic ops */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
/*
* Flags for mmap
Index: linux-2.6.12.2/include/asm-sparc64/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-sparc64/mman.h
+++ linux-2.6.12.2/include/asm-sparc64/mman.h
@@ -11,6 +11,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-v850/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-v850/mman.h
+++ linux-2.6.12.2/include/asm-v850/mman.h
@@ -7,6 +7,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-s390/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-s390/mman.h
+++ linux-2.6.12.2/include/asm-s390/mman.h
@@ -16,6 +16,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-parisc/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-parisc/mman.h
+++ linux-2.6.12.2/include/asm-parisc/mman.h
@@ -8,6 +8,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-ppc/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-ppc/mman.h
+++ linux-2.6.12.2/include/asm-ppc/mman.h
@@ -8,6 +8,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-i386/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-i386/mman.h
+++ linux-2.6.12.2/include/asm-i386/mman.h
@@ -8,6 +8,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-sh/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-sh/mman.h
+++ linux-2.6.12.2/include/asm-sh/mman.h
@@ -8,6 +8,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-x86_64/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-x86_64/mman.h
+++ linux-2.6.12.2/include/asm-x86_64/mman.h
@@ -8,6 +8,7 @@
#define PROT_SEM 0x8
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-ia64/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-ia64/mman.h
+++ linux-2.6.12.2/include/asm-ia64/mman.h
@@ -15,6 +15,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-sparc/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-sparc/mman.h
+++ linux-2.6.12.2/include/asm-sparc/mman.h
@@ -11,6 +11,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-m32r/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-m32r/mman.h
+++ linux-2.6.12.2/include/asm-m32r/mman.h
@@ -10,6 +10,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-frv/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-frv/mman.h
+++ linux-2.6.12.2/include/asm-frv/mman.h
@@ -8,6 +8,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/linux/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/linux/mman.h
+++ linux-2.6.12.2/include/linux/mman.h
@@ -47,9 +47,10 @@ static inline void vm_unacct_memory(long
static inline unsigned long
calc_vm_prot_bits(unsigned long prot)
{
- return _calc_vm_trans(prot, PROT_READ, VM_READ ) |
- _calc_vm_trans(prot, PROT_WRITE, VM_WRITE) |
- _calc_vm_trans(prot, PROT_EXEC, VM_EXEC );
+ return _calc_vm_trans(prot, PROT_READ, VM_READ ) |
+ _calc_vm_trans(prot, PROT_WRITE, VM_WRITE) |
+ _calc_vm_trans(prot, PROT_EXEC, VM_EXEC ) |
+ _calc_vm_trans(prot, PROT_DONTCOPY, VM_DONTCOPY );
}
/*
Index: linux-2.6.12.2/include/asm-h8300/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-h8300/mman.h
+++ linux-2.6.12.2/include/asm-h8300/mman.h
@@ -8,6 +8,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/include/asm-arm/mman.h
===================================================================
--- linux-2.6.12.2.orig/include/asm-arm/mman.h
+++ linux-2.6.12.2/include/asm-arm/mman.h
@@ -8,6 +8,7 @@
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+#define PROT_DONTCOPY 0x04000000 /* dont copy to child on fork */
#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
Index: linux-2.6.12.2/mm/mprotect.c
===================================================================
--- linux-2.6.12.2.orig/mm/mprotect.c
+++ linux-2.6.12.2/mm/mprotect.c
@@ -196,7 +196,7 @@ sys_mprotect(unsigned long start, size_t
end = start + len;
if (end <= start)
return -ENOMEM;
- if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM))
+ if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM | PROT_DONTCOPY))
return -EINVAL;
reqprot = prot;
@@ -246,7 +246,7 @@ sys_mprotect(unsigned long start, size_t
goto out;
}
- newflags = vm_flags | (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));
+ newflags = vm_flags | (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC | VM_DONTCOPY));
if ((newflags & ~(newflags >> 4)) & 0xf) {
error = -EACCES;
Index: linux-2.6.12.2/mm/mmap.c
===================================================================
--- linux-2.6.12.2.orig/mm/mmap.c
+++ linux-2.6.12.2/mm/mmap.c
@@ -792,8 +792,8 @@ struct anon_vma *find_mergeable_anon_vma
* Neither mlock nor madvise tries to remerge at present,
* so leave their flags as obstructing a merge.
*/
- vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
- vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC);
+ vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC|VM_DONTCOPY);
+ vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_DONTCOPY);
if (near->anon_vma && vma->vm_end == near->vm_start &&
mpol_equal(vma_policy(vma), vma_policy(near)) &&
@@ -814,8 +814,8 @@ try_prev:
if (!near)
goto none;
- vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
- vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC);
+ vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC|VM_DONTCOPY);
+ vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_DONTCOPY);
if (near->anon_vma && near->vm_end == vma->vm_start &&
mpol_equal(vma_policy(near), vma_policy(vma)) &&
--
MST
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support
2005-07-25 17:19 ` [PATCH repost] PROT_DONTCOPY: ifiniband uverbs " Michael S. Tsirkin
@ 2005-07-26 12:30 ` Hugh Dickins
2005-07-26 13:35 ` Michael S. Tsirkin
0 siblings, 1 reply; 17+ messages in thread
From: Hugh Dickins @ 2005-07-26 12:30 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: Roland Dreier, openib-general, linux-kernel
On Mon, 25 Jul 2005, Michael S. Tsirkin wrote:
>
> This patch adds PROT_DONTCOPY to mmap and mprotect, to set VM_DONTCOPY on vma.
> This is needed for infiniband userspace i/o, where we need to protect against
> - the child process accessing the parent hardware page
> - the parent registered address (on which the driver did get_user_pages)
> getting remapped to another page by COW
> One can imagine other uses, e.g. combined with mlock for real-time or security.
I don't much like it, but it does solve a real problem in an efficient way.
Partly I don't like it because of "PROT_DONTCOPY" itself: I'm queasy
about protection flags which are not protection flags, though I find
you're not the first to go down that road.
Is the patch tested? I've not tried, but suspect the newflags shift
and mask won't work for it. And I don't look forward to your adding
VM_MAYDONTCOPY - ugh!
> @@ -246,7 +246,7 @@ sys_mprotect(unsigned long start, size_t
> goto out;
> }
>
> - newflags = vm_flags | (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));
> + newflags = vm_flags | (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC | VM_DONTCOPY));
>
> if ((newflags & ~(newflags >> 4)) & 0xf) {
> error = -EACCES;
I rather think it would all be more cleanly handled by dropping the mmap
and mprotect changes, adding an madvise instead. Though you may object
that madvise is for optional behaviours, and this should be mandatory.
The other reason I dislike the patch is that the problem it fixes is
an old one, and I'd much rather have get_user_pages fix it for itself,
than ask the developer to do some additional magic to get around it.
But I've failed to work out a simple efficient alternative, which won't
burden the vast majority of get_user_pages usages which never hit the
issue. So your way is probably appropriate, but I'd prefer madvise.
(Sorry, I won't be able to discuss further for a couple of days.)
Hugh
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support
2005-07-26 12:30 ` Hugh Dickins
@ 2005-07-26 13:35 ` Michael S. Tsirkin
2005-08-09 18:13 ` Hugh Dickins
0 siblings, 1 reply; 17+ messages in thread
From: Michael S. Tsirkin @ 2005-07-26 13:35 UTC (permalink / raw)
To: Hugh Dickins; +Cc: Roland Dreier, openib-general, linux-kernel
Hi, Hugh!
Thanks for the comments.
Quoting Hugh Dickins <hugh@veritas.com>:
> Subject: Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support
>
> On Mon, 25 Jul 2005, Michael S. Tsirkin wrote:
> >
> > This patch adds PROT_DONTCOPY to mmap and mprotect, to set VM_DONTCOPY on vma.
> > This is needed for infiniband userspace i/o, where we need to protect against
> > - the child process accessing the parent hardware page
> > - the parent registered address (on which the driver did get_user_pages)
> > getting remapped to another page by COW
> > One can imagine other uses, e.g. combined with mlock for real-time or security.
>
> I don't much like it, but it does solve a real problem in an efficient way.
>
> Partly I don't like it because of "PROT_DONTCOPY" itself: I'm queasy
> about protection flags which are not protection flags, though I find
> you're not the first to go down that road.
Yes. Compare with PROT_GROWSDOWN and such.
> Is the patch tested? I've not tried, but suspect the newflags shift
> and mask won't work for it.
I tested this patch. I didnt test all thinkable configurations of
flags though - what do you mean by "newflags shift and mask"?
> And I don't look forward to your adding
> VM_MAYDONTCOPY - ugh!
We already have VM_DONTCOPY. Why would we need VM_MAYDONTCOPY and what
would it do?
>
>
> > @@ -246,7 +246,7 @@ sys_mprotect(unsigned long start, size_t
> > goto out;
> > }
> >
> > - newflags = vm_flags | (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));
> > + newflags = vm_flags | (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC | VM_DONTCOPY));
> >
> > if ((newflags & ~(newflags >> 4)) & 0xf) {
> > error = -EACCES;
>
> I rather think it would all be more cleanly handled by dropping the mmap
> and mprotect changes,
Well, mmap would be much better off if VM_DONTCOPY is set atomically, since
a process may fork after mmap is called but before madvise.
> adding an madvise instead.
I'm not opposed to this, on principle. But see below.
> Though you may object
> that madvise is for optional behaviours, and this should be mandatory.
What about a new system call?
Or a flag for mprotect that effectively turns it into a new system call?
Something like PROT_EXTENDED?
> The other reason I dislike the patch is that the problem it fixes is
> an old one, and I'd much rather have get_user_pages fix it for itself,
Please note that the problem this attempts to solve is not limited
to pages locked by get_user_pages: in an infiniband userspace initiator,
a hardware page is mapped into process memory and must not be inherited
by a child processes, otherwise hardware protection breaks.
> than ask the developer to do some additional magic to get around it.
>
> But I've failed to work out a simple efficient alternative, which won't
> burden the vast majority of get_user_pages usages which never hit the
> issue.
They dont hit it if they keep the mm semaphore, or if they only lock
pages for read.
> So your way is probably appropriate, but I'd prefer madvise.
The difficulty with changing get_user_pages, as I see it, is that
you wont be able to get away with a single DONTCOPY bit - you'll need
a full reference count for each page, no less.
> (Sorry, I won't be able to discuss further for a couple of days.)
>
> Hugh
>
Well, madvise currently cant break/merge VMAs, which is required
for VM_DONTCOPY. And it seems like making madvise do this opens
a whole cans of worms.
Hugh, so the patch is likely to be bigger in the madvise approach.
Considering this, and the fact that a full solution has to add
a flag to mmap, anyway, do you still think madvise is really the best way
to do it?
Regarding solving the problem automagically by get_user_pages:
What about a new VM_COPYONFORK flag, to trigger the old unix
behaviour of copying the vma on fork and a flag for get_user_pages that sets it?
Only users that dont keep the mm semaphore around
the get_user_pages/put_page operation would use this flag, others
would be unaffected. The flag will stay on until the VMA is destroyed.
MST
--
MST
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support
2005-07-26 13:35 ` Michael S. Tsirkin
@ 2005-08-09 18:13 ` Hugh Dickins
2005-08-10 8:30 ` Michael S. Tsirkin
2005-08-10 8:39 ` [openib-general] " Gleb Natapov
0 siblings, 2 replies; 17+ messages in thread
From: Hugh Dickins @ 2005-08-09 18:13 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: Roland Dreier, openib-general, linux-kernel
Sorry for my delay in replying...
On Tue, 26 Jul 2005, Michael S. Tsirkin wrote:
> Quoting Hugh Dickins <hugh@veritas.com>:
> > Subject: Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support
> >
> > On Mon, 25 Jul 2005, Michael S. Tsirkin wrote:
> > >
> > > This patch adds PROT_DONTCOPY to mmap and mprotect, to set VM_DONTCOPY on vma.
> > > This is needed for infiniband userspace i/o, where we need to protect against
> > > - the child process accessing the parent hardware page
> > > - the parent registered address (on which the driver did get_user_pages)
> > > getting remapped to another page by COW
> > > One can imagine other uses, e.g. combined with mlock for real-time or security.
> >
> > I don't much like it, but it does solve a real problem in an efficient way.
> >
> > Partly I don't like it because of "PROT_DONTCOPY" itself: I'm queasy
> > about protection flags which are not protection flags, though I find
> > you're not the first to go down that road.
>
> Yes. Compare with PROT_GROWSDOWN and such.
Though if you look deeper into that, you find that PROT_GROWSDOWN and
PROT_GROWSUP are all about determining the start or end of the range
when it's the stack: nothing to do with the protection flags set.
Which inclines me the more against using mprotect to set VM_DONTCOPY.
> > Is the patch tested? I've not tried, but suspect the newflags shift
> > and mask won't work for it.
>
> I tested this patch. I didnt test all thinkable configurations of
> flags though - what do you mean by "newflags shift and mask"?
My error. See further down where the code is shown.
> > And I don't look forward to your adding
> > VM_MAYDONTCOPY - ugh!
>
> We already have VM_DONTCOPY. Why would we need VM_MAYDONTCOPY and what
> would it do?
>
> > > @@ -246,7 +246,7 @@ sys_mprotect(unsigned long start, size_t
> > > goto out;
> > > }
> > >
> > > - newflags = vm_flags | (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));
> > > + newflags = vm_flags | (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC | VM_DONTCOPY));
> > >
> > > if ((newflags & ~(newflags >> 4)) & 0xf) {
> > > error = -EACCES;
That newflags shift and mask is checking VM_READ,VM_WRITE,VM_EXEC against
VM_MAYREAD,VM_MAYWRITE,VM_MAYEXEC (the same bits shifted up/down 4).
It's checking, for example, whether the caller actually has permission
to mprotect the mapping to make writes to the file.
But I was reading it wrongly, sorry: I thought you were going to need a
VM_MAYDONTCOPY bit in order to give permission for you to mprotect the
mapping to VM_DONTCOPY. No, it's only checking the bottom four bits
(including VM_SHARED against VM_MAYSHARE, but that's never changed).
> > I rather think it would all be more cleanly handled by dropping the mmap
> > and mprotect changes,
>
> Well, mmap would be much better off if VM_DONTCOPY is set atomically, since
> a process may fork after mmap is called but before madvise.
But it doesn't matter if the process does fork after mmap before madvise.
It only starts to matter when you do get_user_pages (for writing): that
will break COW on the private pages made readonly by a preceding fork,
your problem is when a fork occurs after that to make them readonly.
> > adding an madvise instead.
>
> I'm not opposed to this, on principle. But see below.
>
> > Though you may object
> > that madvise is for optional behaviours, and this should be mandatory.
>
> What about a new system call?
> Or a flag for mprotect that effectively turns it into a new system call?
> Something like PROT_EXTENDED?
PROT_DONTCOPY seems quite enough to signal the extension,
if we were to go the mprotect route.
> > The other reason I dislike the patch is that the problem it fixes is
> > an old one, and I'd much rather have get_user_pages fix it for itself,
>
> Please note that the problem this attempts to solve is not limited
> to pages locked by get_user_pages: in an infiniband userspace initiator,
> a hardware page is mapped into process memory and must not be inherited
> by a child processes, otherwise hardware protection breaks.
Interesting.
But (correct me if I'm wrong, I know nothing about InfiniBand userspace
initiators) that would be done by a driver, which can set VM_DONTCOPY
on the vma, without us having to extend the mprotect or madvise API
> > than ask the developer to do some additional magic to get around it.
> >
> > But I've failed to work out a simple efficient alternative, which won't
> > burden the vast majority of get_user_pages usages which never hit the
> > issue.
>
> They dont hit it if they keep the mm semaphore, or if they only lock
> pages for read.
I think the usual case is simply that userspace does not touch those
pages while they are pinned by get_user_pages, and/or it does not fork.
But we have occasionally got bitten by the issue.
> > So your way is probably appropriate, but I'd prefer madvise.
>
> The difficulty with changing get_user_pages, as I see it, is that
> you wont be able to get away with a single DONTCOPY bit - you'll need
> a full reference count for each page, no less.
Quite possibly: I only thought it through far enough to conclude that
your proposal has the great merit of simplicity in comparison,
despite its dubious interface.
> > (Sorry, I won't be able to discuss further for a couple of days.)
Please correct that to weeks ;)
> Well, madvise currently cant break/merge VMAs, which is required
> for VM_DONTCOPY. And it seems like making madvise do this opens
> a whole cans of worms.
madvise has been splitting vmas forever, and was enhanced to remerge
them 2.6.13-rc.
> Hugh, so the patch is likely to be bigger in the madvise approach.
> Considering this, and the fact that a full solution has to add
> a flag to mmap, anyway, do you still think madvise is really the best way
> to do it?
Has to add a flag to mmap? I didn't buy your "atomic" argument above,
did I miss something?
I still prefer madvise to mprotect for this, but admit neither is
entirely clean, would rather let someone else decide between them.
Even more I'd prefer one of these two solutions below, which sidestep
that uncleanliness - but both of these would be in mmap only, no clean
way to change afterwards (except by munmap or mmap MAP_FIXED):
1. Use the standard mmap(NULL, len, PROT_READ|PROT_WRITE,
MAP_SHARED|MAP_ANONYMOUS, -1, 0) which gives you a memory object
shared with children, so write-protection and COW won't come into it.
or if there's good reason why that's no good,
2. Define a MAP_DONTCOPY to mmap: we have a fine tradition of MAP_flags
to achieve this or that effect, adding one more would be cleaner than
now corrupting mprotect or madvise.
> Regarding solving the problem automagically by get_user_pages:
>
> What about a new VM_COPYONFORK flag, to trigger the old unix
> behaviour of copying the vma on fork and a flag for get_user_pages
> that sets it? Only users that dont keep the mm semaphore around
> the get_user_pages/put_page operation would use this flag, others
> would be unaffected. The flag will stay on until the VMA is destroyed.
(I don't understand why you propose a new flag for the usual behaviour,
but that's just a matter of which way round it's defined, not important.)
Splitting a vma from within get_user_pages is not straightforward,
we need down_write(&mm->mmap_sem) for a start; I think we'd all prefer
to avoid that if we can - as I said, your proposal rather simpler.
Coincidentally, Linus has drawn my attention in the last week to some
uses of get_user_pages which are behaving in a way which I believe
is currently mishandled, and may need splitting the vma. But I don't
think you should wait around for however we decide to fix that issue.
Hugh
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support
2005-08-09 18:13 ` Hugh Dickins
@ 2005-08-10 8:30 ` Michael S. Tsirkin
2005-08-10 8:39 ` [openib-general] " Gleb Natapov
1 sibling, 0 replies; 17+ messages in thread
From: Michael S. Tsirkin @ 2005-08-10 8:30 UTC (permalink / raw)
To: Hugh Dickins; +Cc: Roland Dreier, openib-general, linux-kernel
Quoting r. Hugh Dickins <hugh@veritas.com>:
> > > The other reason I dislike the patch is that the problem it fixes is
> > > an old one, and I'd much rather have get_user_pages fix it for itself,
> >
> > Please note that the problem this attempts to solve is not limited
> > to pages locked by get_user_pages: in an infiniband userspace initiator,
> > a hardware page is mapped into process memory and must not be inherited
> > by a child processes, otherwise hardware protection breaks.
>
> Interesting.
>
> But (correct me if I'm wrong, I know nothing about InfiniBand userspace
> initiators) that would be done by a driver, which can set VM_DONTCOPY
> on the vma, without us having to extend the mprotect or madvise API
Roland, Hugh here proposes setting VM_DONTCOPY on user-mapped PIO
memory from driver on mmap, to protect against child process
corrupting parent's user access region.
IIRC, we used to set this bit, but it was removed later - could you please
clarify why? Do you think its a good idea to restore this behaviour?
--
MST
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [openib-general] Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support
2005-08-09 18:13 ` Hugh Dickins
2005-08-10 8:30 ` Michael S. Tsirkin
@ 2005-08-10 8:39 ` Gleb Natapov
2005-08-10 13:22 ` Hugh Dickins
1 sibling, 1 reply; 17+ messages in thread
From: Gleb Natapov @ 2005-08-10 8:39 UTC (permalink / raw)
To: Hugh Dickins
Cc: Michael S. Tsirkin, Roland Dreier, linux-kernel, openib-general
On Tue, Aug 09, 2005 at 07:13:33PM +0100, Hugh Dickins wrote:
> Even more I'd prefer one of these two solutions below, which sidestep
> that uncleanliness - but both of these would be in mmap only, no clean
> way to change afterwards (except by munmap or mmap MAP_FIXED):
>
> 1. Use the standard mmap(NULL, len, PROT_READ|PROT_WRITE,
> MAP_SHARED|MAP_ANONYMOUS, -1, 0) which gives you a memory object
> shared with children, so write-protection and COW won't come into it.
>
> or if there's good reason why that's no good,
>
> 2. Define a MAP_DONTCOPY to mmap: we have a fine tradition of MAP_flags
> to achieve this or that effect, adding one more would be cleaner than
> now corrupting mprotect or madvise.
>
They are both relying on the way user allocates memory for RDMA. The idea behind
Michael's propose it to let library (MPI for instance) to tell to the
kernel that the pages are used for RDMA and it is not safe to copy them now.
The pages may be anywhere in the process address space bss, text, stack
whatever.
--
Gleb.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [openib-general] Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support
2005-08-10 8:39 ` [openib-general] " Gleb Natapov
@ 2005-08-10 13:22 ` Hugh Dickins
2005-08-10 13:26 ` Gleb Natapov
0 siblings, 1 reply; 17+ messages in thread
From: Hugh Dickins @ 2005-08-10 13:22 UTC (permalink / raw)
To: Gleb Natapov
Cc: Michael S. Tsirkin, Roland Dreier, linux-kernel, openib-general
On Wed, 10 Aug 2005, Gleb Natapov wrote:
> On Tue, Aug 09, 2005 at 07:13:33PM +0100, Hugh Dickins wrote:
> > Even more I'd prefer one of these two solutions below, which sidestep
> > that uncleanliness - but both of these would be in mmap only, no clean
> > way to change afterwards (except by munmap or mmap MAP_FIXED):
> >
> > 1. Use the standard mmap(NULL, len, PROT_READ|PROT_WRITE,
> > MAP_SHARED|MAP_ANONYMOUS, -1, 0) which gives you a memory object
> > shared with children, so write-protection and COW won't come into it.
> >
> > or if there's good reason why that's no good,
> >
> > 2. Define a MAP_DONTCOPY to mmap: we have a fine tradition of MAP_flags
> > to achieve this or that effect, adding one more would be cleaner than
> > now corrupting mprotect or madvise.
>
> They are both relying on the way user allocates memory for RDMA. The idea
> behind Michael's propose it to let library (MPI for instance) to tell to the
> kernel that the pages are used for RDMA and it is not safe to copy them now.
> The pages may be anywhere in the process address space bss, text, stack
> whatever.
That's a nice aim, but I don't think it can quite be done in the face of
the fork issue - one way or another, we have to change the behaviour of a
forked RDMA area slightly, which might interfere with common assumptions.
Your stack example is a good one: if we end up setting VM_DONTCOPY on
the user stack, then I don't think fork's child will get very far without
hitting a SIGSEGV.
Hugh
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [openib-general] Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support
2005-08-10 13:22 ` Hugh Dickins
@ 2005-08-10 13:26 ` Gleb Natapov
2005-08-10 15:27 ` Hugh Dickins
0 siblings, 1 reply; 17+ messages in thread
From: Gleb Natapov @ 2005-08-10 13:26 UTC (permalink / raw)
To: Hugh Dickins
Cc: Michael S. Tsirkin, Roland Dreier, linux-kernel, openib-general
On Wed, Aug 10, 2005 at 02:22:40PM +0100, Hugh Dickins wrote:
> On Wed, 10 Aug 2005, Gleb Natapov wrote:
> > On Tue, Aug 09, 2005 at 07:13:33PM +0100, Hugh Dickins wrote:
> > > Even more I'd prefer one of these two solutions below, which sidestep
> > > that uncleanliness - but both of these would be in mmap only, no clean
> > > way to change afterwards (except by munmap or mmap MAP_FIXED):
> > >
> > > 1. Use the standard mmap(NULL, len, PROT_READ|PROT_WRITE,
> > > MAP_SHARED|MAP_ANONYMOUS, -1, 0) which gives you a memory object
> > > shared with children, so write-protection and COW won't come into it.
> > >
> > > or if there's good reason why that's no good,
> > >
> > > 2. Define a MAP_DONTCOPY to mmap: we have a fine tradition of MAP_flags
> > > to achieve this or that effect, adding one more would be cleaner than
> > > now corrupting mprotect or madvise.
> >
> > They are both relying on the way user allocates memory for RDMA. The idea
> > behind Michael's propose it to let library (MPI for instance) to tell to the
> > kernel that the pages are used for RDMA and it is not safe to copy them now.
> > The pages may be anywhere in the process address space bss, text, stack
> > whatever.
>
> That's a nice aim, but I don't think it can quite be done in the face of
> the fork issue - one way or another, we have to change the behaviour of a
> forked RDMA area slightly, which might interfere with common assumptions.
>
> Your stack example is a good one: if we end up setting VM_DONTCOPY on
> the user stack, then I don't think fork's child will get very far without
> hitting a SIGSEGV.
I know, but I prefer child SIGSEGV than silent data corruption. In most
cases child will exec immediately after fork so no problem in this
case.
--
Gleb.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [openib-general] Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support
2005-08-10 13:26 ` Gleb Natapov
@ 2005-08-10 15:27 ` Hugh Dickins
2005-08-11 8:02 ` Gleb Natapov
0 siblings, 1 reply; 17+ messages in thread
From: Hugh Dickins @ 2005-08-10 15:27 UTC (permalink / raw)
To: Gleb Natapov
Cc: Michael S. Tsirkin, Roland Dreier, linux-kernel, openib-general
On Wed, 10 Aug 2005, Gleb Natapov wrote:
> On Wed, Aug 10, 2005 at 02:22:40PM +0100, Hugh Dickins wrote:
> >
> > Your stack example is a good one: if we end up setting VM_DONTCOPY on
> > the user stack, then I don't think fork's child will get very far without
> > hitting a SIGSEGV.
>
> I know, but I prefer child SIGSEGV than silent data corruption.
Most people will share your preference, but neither is satisfactory.
> In most cases child will exec immediately after fork so no problem
> in this case.
In most(?) cases it won't even be able to exec before the SIGSEGV.
Hugh
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [openib-general] Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support
2005-08-10 15:27 ` Hugh Dickins
@ 2005-08-11 8:02 ` Gleb Natapov
2005-08-11 14:04 ` Hugh Dickins
2005-08-15 16:37 ` Bill Jordan
0 siblings, 2 replies; 17+ messages in thread
From: Gleb Natapov @ 2005-08-11 8:02 UTC (permalink / raw)
To: Hugh Dickins
Cc: Michael S. Tsirkin, Roland Dreier, linux-kernel, openib-general
On Wed, Aug 10, 2005 at 04:27:31PM +0100, Hugh Dickins wrote:
> On Wed, 10 Aug 2005, Gleb Natapov wrote:
> > On Wed, Aug 10, 2005 at 02:22:40PM +0100, Hugh Dickins wrote:
> > >
> > > Your stack example is a good one: if we end up setting VM_DONTCOPY on
> > > the user stack, then I don't think fork's child will get very far without
> > > hitting a SIGSEGV.
> >
> > I know, but I prefer child SIGSEGV than silent data corruption.
>
> Most people will share your preference, but neither is satisfactory.
>
What about the idea that was floating around about new VM flag that will
instruct kernel to copy pages belonging to the vma on fork instead of mark
them as cow?
> > In most cases child will exec immediately after fork so no problem
> > in this case.
>
> In most(?) cases it won't even be able to exec before the SIGSEGV.
>
If the top of the stack belongs to not copied page then yes.
--
Gleb.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [openib-general] Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support
2005-08-11 8:02 ` Gleb Natapov
@ 2005-08-11 14:04 ` Hugh Dickins
2005-08-11 14:07 ` Gleb Natapov
2005-08-11 14:11 ` Michael S. Tsirkin
2005-08-15 16:37 ` Bill Jordan
1 sibling, 2 replies; 17+ messages in thread
From: Hugh Dickins @ 2005-08-11 14:04 UTC (permalink / raw)
To: Gleb Natapov
Cc: Michael S. Tsirkin, Roland Dreier, linux-kernel, openib-general
On Thu, 11 Aug 2005, Gleb Natapov wrote:
> What about the idea that was floating around about new VM flag that will
> instruct kernel to copy pages belonging to the vma on fork instead of mark
> them as cow?
It's a pretty good idea, and thanks for reminding us of it.
It suffers from the general difficulty with fixes within get_user_pages,
that we need down_write(&mm->mmap_sem) to split_vma, and even just to
update vm_flags, whereas get_user_pages is entered with down_read.
Really, we'd prefer not to mess with the vma itself in get_user_pages.
Could mark the ptes instead, perhaps, but that gets very architecture-
dependent. A separate array? not so nice if the vma is very large
and the get_user_pages area very small.
I had toyed with leaving the ptes in the parent as writable, made
readonly just in the child; but though that violation could be excused
while get_user_pages is active, it would have to be corrected at the
end, and that gets complicated again.
Hugh
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [openib-general] Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support
2005-08-11 14:04 ` Hugh Dickins
@ 2005-08-11 14:07 ` Gleb Natapov
2005-08-11 14:17 ` Hugh Dickins
2005-08-11 14:11 ` Michael S. Tsirkin
1 sibling, 1 reply; 17+ messages in thread
From: Gleb Natapov @ 2005-08-11 14:07 UTC (permalink / raw)
To: Hugh Dickins
Cc: Michael S. Tsirkin, Roland Dreier, linux-kernel, openib-general
On Thu, Aug 11, 2005 at 03:04:29PM +0100, Hugh Dickins wrote:
> On Thu, 11 Aug 2005, Gleb Natapov wrote:
> > What about the idea that was floating around about new VM flag that will
> > instruct kernel to copy pages belonging to the vma on fork instead of mark
> > them as cow?
>
> It's a pretty good idea, and thanks for reminding us of it.
>
> It suffers from the general difficulty with fixes within get_user_pages,
> that we need down_write(&mm->mmap_sem) to split_vma, and even just to
> update vm_flags, whereas get_user_pages is entered with down_read.
>
Why do it form get_user_pages? Lets use madvise/mprotect interface.
Program can mrpotect(VM_COPYONFORK) address range before registering it.
--
Gleb.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [openib-general] Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support
2005-08-11 14:07 ` Gleb Natapov
@ 2005-08-11 14:17 ` Hugh Dickins
0 siblings, 0 replies; 17+ messages in thread
From: Hugh Dickins @ 2005-08-11 14:17 UTC (permalink / raw)
To: Gleb Natapov
Cc: Michael S. Tsirkin, Roland Dreier, linux-kernel, openib-general
On Thu, 11 Aug 2005, Gleb Natapov wrote:
> On Thu, Aug 11, 2005 at 03:04:29PM +0100, Hugh Dickins wrote:
> > On Thu, 11 Aug 2005, Gleb Natapov wrote:
> > > What about the idea that was floating around about new VM flag that will
> > > instruct kernel to copy pages belonging to the vma on fork instead of mark
> > > them as cow?
> >
> > It's a pretty good idea, and thanks for reminding us of it.
> >
> > It suffers from the general difficulty with fixes within get_user_pages,
> > that we need down_write(&mm->mmap_sem) to split_vma, and even just to
> > update vm_flags, whereas get_user_pages is entered with down_read.
> >
> Why do it form get_user_pages? Lets use madvise/mprotect interface.
> Program can mrpotect(VM_COPYONFORK) address range before registering it.
Perhaps. But then it's more complicated than the VM_DONTCOPY we came from.
It's a good solution to the semantic divergence introduced by VM_DONTCOPY,
but most people seemed unworried by that aspect.
My trouble is that I'm waiting for a magic right solution to appear,
and none has struct me that way so far.
Hugh
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [openib-general] Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support
2005-08-11 14:04 ` Hugh Dickins
2005-08-11 14:07 ` Gleb Natapov
@ 2005-08-11 14:11 ` Michael S. Tsirkin
1 sibling, 0 replies; 17+ messages in thread
From: Michael S. Tsirkin @ 2005-08-11 14:11 UTC (permalink / raw)
To: Hugh Dickins; +Cc: Gleb Natapov, Roland Dreier, linux-kernel, openib-general
Quoting r. Hugh Dickins <hugh@veritas.com>:
> Subject: Re: [openib-general] Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support
>
> On Thu, 11 Aug 2005, Gleb Natapov wrote:
> > What about the idea that was floating around about new VM flag that will
> > instruct kernel to copy pages belonging to the vma on fork instead of mark
> > them as cow?
>
> It's a pretty good idea, and thanks for reminding us of it.
>
> It suffers from the general difficulty with fixes within get_user_pages,
> that we need down_write(&mm->mmap_sem) to split_vma, and even just to
> update vm_flags, whereas get_user_pages is entered with down_read.
No, the idea is to let the application (or a library that it loades)
change this flag by means of some system call.
Something like MADV_COPYONFORK, in addition to MADV_DONTCOPY.
--
MST
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [openib-general] Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support
2005-08-11 8:02 ` Gleb Natapov
2005-08-11 14:04 ` Hugh Dickins
@ 2005-08-15 16:37 ` Bill Jordan
2005-08-16 7:52 ` Gleb Natapov
1 sibling, 1 reply; 17+ messages in thread
From: Bill Jordan @ 2005-08-15 16:37 UTC (permalink / raw)
To: Gleb Natapov
Cc: Hugh Dickins, Michael S. Tsirkin, Roland Dreier, linux-kernel,
openib-general
On 8/11/05, Gleb Natapov <glebn@voltaire.com> wrote:
> What about the idea that was floating around about new VM flag that will
> instruct kernel to copy pages belonging to the vma on fork instead of mark
> them as cow?
>
I think the big problem with this idea is the huge memory regions that
InfiniBand applications are dealing with. If the application forks (or
uses system()), you are going to copy a huge chunk of data (most
likely swapping since the application memory footprint is probably
already tuned to consume the available physical memory). And the copy
is really for nothing since in most (or at least many) cases the child
is just going to exec anyway.
--
Bill Jordan
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [openib-general] Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support
2005-08-15 16:37 ` Bill Jordan
@ 2005-08-16 7:52 ` Gleb Natapov
0 siblings, 0 replies; 17+ messages in thread
From: Gleb Natapov @ 2005-08-16 7:52 UTC (permalink / raw)
To: Bill Jordan
Cc: Hugh Dickins, Michael S. Tsirkin, Roland Dreier, linux-kernel,
openib-general
On Mon, Aug 15, 2005 at 12:37:50PM -0400, Bill Jordan wrote:
> On 8/11/05, Gleb Natapov <glebn@voltaire.com> wrote:
> > What about the idea that was floating around about new VM flag that will
> > instruct kernel to copy pages belonging to the vma on fork instead of mark
> > them as cow?
> >
>
> I think the big problem with this idea is the huge memory regions that
> InfiniBand applications are dealing with. If the application forks (or
> uses system()), you are going to copy a huge chunk of data (most
> likely swapping since the application memory footprint is probably
> already tuned to consume the available physical memory). And the copy
> is really for nothing since in most (or at least many) cases the child
> is just going to exec anyway.
If the child is going to exec it may call vfork or clone with CLONE_VM
flag. glibc system(3) does clone (CLONE_PARENT_SETTID | SIGCHLD) why not
CLONE_VM too? This single change will allow to use system() from MPI
programs thus eliminating many users problem.
If the child isn't going to exec it should face the music.
--
Gleb.
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2005-08-16 7:52 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-07-19 16:55 [PATCH] fork support Michael S. Tsirkin
2005-07-25 17:19 ` [PATCH repost] PROT_DONTCOPY: ifiniband uverbs " Michael S. Tsirkin
2005-07-26 12:30 ` Hugh Dickins
2005-07-26 13:35 ` Michael S. Tsirkin
2005-08-09 18:13 ` Hugh Dickins
2005-08-10 8:30 ` Michael S. Tsirkin
2005-08-10 8:39 ` [openib-general] " Gleb Natapov
2005-08-10 13:22 ` Hugh Dickins
2005-08-10 13:26 ` Gleb Natapov
2005-08-10 15:27 ` Hugh Dickins
2005-08-11 8:02 ` Gleb Natapov
2005-08-11 14:04 ` Hugh Dickins
2005-08-11 14:07 ` Gleb Natapov
2005-08-11 14:17 ` Hugh Dickins
2005-08-11 14:11 ` Michael S. Tsirkin
2005-08-15 16:37 ` Bill Jordan
2005-08-16 7:52 ` Gleb Natapov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox