* [patch 0/4] MAP_NOZERO v2 - VM_NOZERO/MAP_NOZERO early summer madness @ 2007-06-28 18:49 Davide Libenzi 2007-06-29 2:57 ` Kyle Moffett 0 siblings, 1 reply; 23+ messages in thread From: Davide Libenzi @ 2007-06-28 18:49 UTC (permalink / raw) To: Linux Kernel Mailing List; +Cc: Rik van Riel, Andy Isaacson I was using oprofile to sample some userspace code I am working on, and I was continuosly noticing clear_page in the top three entries of the oprofile logs. Also, a simple kernel build, in my Dual Opteron with 8GB of RAM, shows clear_page as the first kernel entry, second only to the userspace the cc1 and as. Most of the userspace code uses malloc() (and anonymous mappings) in such a way that the memory returned via kernel->glibc is immediately written soon after. The POSIX malloc() definition itself also, does not require the returned memory to be zeroed (as calloc() does). So I implemented a rather quick hack that introduces a new mmap() flag MAP_NOZERO (only valid for anonymous mappings) and the vma counter-part VM_NOZERO. Also, a new sys_brk2() has been introduced to accept a new flags parameter. A brief description of the patches follows in the next emails. I first hacked Val's ebizzy to accept a new '-N' flag to make use of MAP_NOZERO: http://infohost.nmt.edu/~val/patches/ebizzy.tar.gz http://www.xmailserver.org/ebizzy-nzmmap-0.3.diff On my box, ebizzy performance jumped up from 10% to 15%. The userspace code I am working on (uses malloc() quite heavily), saw a performance jump of around 14%. In both cases, clear_page dropped way down in the oprofile logs. I then coded quick (and rather ugly) hacks for glibc and gcc to make them use the new features (MAP_NOZERO and sys_brk2()): http://www.xmailserver.org/glibc-nzmalloc-tweaks http://www.xmailserver.org/gcc-nozero-hack I then tried a 2.6.22-rc5 kernel build using the newly built glibc and gcc (with and w/out no-zero enabling options/env-vars), and when using the no-zero mode, clear_page went way down in the oprofile logs and build time dropped of about 2.5% to 3%. I did not have time (and will) to tweak as and ld also. These are some test utilities to verify the no-zero behaviour of MAP_NOZERO (and sys_brk2()): http://www.xmailserver.org/nzmmap-test.c http://www.xmailserver.org/nzmalloc-test.c http://www.xmailserver.org/smiffy.c To run nzmalloc-test you need a patched glibc (using glibc-nzmalloc-tweaks). The smiffy one, should be run under a user that has no other processes running and that owns no files on the system, and it verifies that all the pages it gets from the kernel are zeroed (otherwise "Houston, we have a problem ..."). It is running on my system w/out barfing by more than two days. How crazy is that? - Davide ChangeLog: * Version 2 o Reusing _mapcount instead of adding a new field in the page struct o Added a fix for a setuid+exec/ptrace race (Andy spotted) ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 0/4] MAP_NOZERO v2 - VM_NOZERO/MAP_NOZERO early summer madness 2007-06-28 18:49 [patch 0/4] MAP_NOZERO v2 - VM_NOZERO/MAP_NOZERO early summer madness Davide Libenzi @ 2007-06-29 2:57 ` Kyle Moffett 2007-06-29 3:04 ` Rik van Riel ` (2 more replies) 0 siblings, 3 replies; 23+ messages in thread From: Kyle Moffett @ 2007-06-29 2:57 UTC (permalink / raw) To: Davide Libenzi; +Cc: Linux Kernel Mailing List, Rik van Riel, Andy Isaacson On Jun 28, 2007, at 14:49:24, Davide Libenzi wrote: > I was using oprofile to sample some userspace code I am working on, > and I was continuosly noticing clear_page in the top three > entries of the oprofile logs. > > Also, a simple kernel build, in my Dual Opteron with 8GB of RAM, > shows clear_page as the first kernel entry, second only to the > userspace the cc1 and as. Most of the userspace code uses malloc > () (and anonymous mappings) in such a way that the memory returned > via kernel->glibc is immediately written soon after. The POSIX > malloc() definition itself also, does not require the returned > memory to be zeroed (as calloc() does). > > So I implemented a rather quick hack that introduces a new mmap() > flag MAP_NOZERO (only valid for anonymous mappings) and the vma > counter-part VM_NOZERO. Also, a new sys_brk2() has been introduced > to accept a new flags parameter. A brief description of the > patches follows in the next emails. Hmm, sounds like this would also need a "MAP_NOREUSE" flag of some kind for security sensitive applications. Basically, I wouldn't want my ssh-agent pages holding private SSH keys to be reused by my web browser which then gets exploited :-D. It would also be a massive information leak under SELinux. To fix it properly according to the SELinux model you would need to tag each page with a label immediately after it's freed and then do an access-vector-check against the old page and the new process before allowing reuse. On the other hand, that would probably be at least as expensive as zeroing the page. Cheers, Kyle Moffett ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 0/4] MAP_NOZERO v2 - VM_NOZERO/MAP_NOZERO early summer madness 2007-06-29 2:57 ` Kyle Moffett @ 2007-06-29 3:04 ` Rik van Riel 2007-06-29 5:09 ` Ulrich Drepper 2007-06-29 5:20 ` Davide Libenzi 2007-06-29 19:39 ` Andy Isaacson 2 siblings, 1 reply; 23+ messages in thread From: Rik van Riel @ 2007-06-29 3:04 UTC (permalink / raw) To: Kyle Moffett; +Cc: Davide Libenzi, Linux Kernel Mailing List, Andy Isaacson Kyle Moffett wrote: > On Jun 28, 2007, at 14:49:24, Davide Libenzi wrote: >> I was using oprofile to sample some userspace code I am working on, >> and I was continuosly noticing clear_page in the top three entries >> of the oprofile logs. >> >> Also, a simple kernel build, in my Dual Opteron with 8GB of RAM, >> shows clear_page as the first kernel entry, second only to the >> userspace the cc1 and as. Most of the userspace code uses malloc() >> (and anonymous mappings) in such a way that the memory returned via >> kernel->glibc is immediately written soon after. The POSIX malloc() >> definition itself also, does not require the returned memory to be >> zeroed (as calloc() does). >> >> So I implemented a rather quick hack that introduces a new mmap() flag >> MAP_NOZERO (only valid for anonymous mappings) and the vma >> counter-part VM_NOZERO. Also, a new sys_brk2() has been introduced to >> accept a new flags parameter. A brief description of the patches >> follows in the next emails. > > Hmm, sounds like this would also need a "MAP_NOREUSE" flag of some kind > for security sensitive applications. That wants MAP_PRIVATE so that the kernel can also decide to not swap these pages out to an unencrypted swap area. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 0/4] MAP_NOZERO v2 - VM_NOZERO/MAP_NOZERO early summer madness 2007-06-29 3:04 ` Rik van Riel @ 2007-06-29 5:09 ` Ulrich Drepper 0 siblings, 0 replies; 23+ messages in thread From: Ulrich Drepper @ 2007-06-29 5:09 UTC (permalink / raw) To: Rik van Riel Cc: Kyle Moffett, Davide Libenzi, Linux Kernel Mailing List, Andy Isaacson On 6/28/07, Rik van Riel <riel@redhat.com> wrote: > That wants MAP_PRIVATE so that the kernel can also decide to not > swap these pages out to an unencrypted swap area. That's not what MAP_PRIVATE means. MAP_PRIVATE is the opposite of MAP_SHARED. It's meaningless for anonymous memory (which is what ssh-agent etc would use) and for file-backed data it definitely allows swapping unless you use mlock(). ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 0/4] MAP_NOZERO v2 - VM_NOZERO/MAP_NOZERO early summer madness 2007-06-29 2:57 ` Kyle Moffett 2007-06-29 3:04 ` Rik van Riel @ 2007-06-29 5:20 ` Davide Libenzi 2007-06-29 19:39 ` Andy Isaacson 2 siblings, 0 replies; 23+ messages in thread From: Davide Libenzi @ 2007-06-29 5:20 UTC (permalink / raw) To: Kyle Moffett; +Cc: Linux Kernel Mailing List, Rik van Riel, Andy Isaacson On Thu, 28 Jun 2007, Kyle Moffett wrote: > On Jun 28, 2007, at 14:49:24, Davide Libenzi wrote: > > I was using oprofile to sample some userspace code I am working on, and I > > was continuosly noticing clear_page in the top three entries of the > > oprofile logs. > > > > Also, a simple kernel build, in my Dual Opteron with 8GB of RAM, shows > > clear_page as the first kernel entry, second only to the userspace the cc1 > > and as. Most of the userspace code uses malloc() (and anonymous mappings) > > in such a way that the memory returned via kernel->glibc is immediately > > written soon after. The POSIX malloc() definition itself also, does not > > require the returned memory to be zeroed (as calloc() does). > > > > So I implemented a rather quick hack that introduces a new mmap() flag > > MAP_NOZERO (only valid for anonymous mappings) and the vma counter-part > > VM_NOZERO. Also, a new sys_brk2() has been introduced to accept a new flags > > parameter. A brief description of the patches follows in the next emails. > > Hmm, sounds like this would also need a "MAP_NOREUSE" flag of some kind for > security sensitive applications. Basically, I wouldn't want my ssh-agent > pages holding private SSH keys to be reused by my web browser which then gets > exploited :-D Well, if your web browser and your ssh session are running under the same user, and your web browser gets hacked, someone is basically logged in in your system under your user. It can ptrace-attach your ssh-agent and take it from there. It can also read your .ssh directory for what it matter and more simply ;) > It would also be a massive information leak under SELinux. To > fix it properly according to the SELinux model you would need to tag each page > with a label immediately after it's freed and then do an access-vector-check > against the old page and the new process before allowing reuse. On the other > hand, that would probably be at least as expensive as zeroing the page. SeLinux could use a simple hook and disable the feature per-task and globally. Just assign an invalid UID to mm->owner_uid and pages will never be used. It could also hook mmap and clear off MAP_NOZERO. - Davide ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 0/4] MAP_NOZERO v2 - VM_NOZERO/MAP_NOZERO early summer madness 2007-06-29 2:57 ` Kyle Moffett 2007-06-29 3:04 ` Rik van Riel 2007-06-29 5:20 ` Davide Libenzi @ 2007-06-29 19:39 ` Andy Isaacson 2007-06-29 20:12 ` Davide Libenzi 2 siblings, 1 reply; 23+ messages in thread From: Andy Isaacson @ 2007-06-29 19:39 UTC (permalink / raw) To: Kyle Moffett; +Cc: Davide Libenzi, Linux Kernel Mailing List, Rik van Riel On Thu, Jun 28, 2007 at 10:57:00PM -0400, Kyle Moffett wrote: > On Jun 28, 2007, at 14:49:24, Davide Libenzi wrote: > >So I implemented a rather quick hack that introduces a new mmap() > >flag MAP_NOZERO (only valid for anonymous mappings) and the vma > >counter-part VM_NOZERO. Also, a new sys_brk2() has been introduced > >to accept a new flags parameter. A brief description of the > >patches follows in the next emails. > > Hmm, sounds like this would also need a "MAP_NOREUSE" flag of some > kind for security sensitive applications. Basically, I wouldn't want > my ssh-agent pages holding private SSH keys to be reused by my web > browser which then gets exploited :-D. PGP at least (and I think GPG still) did overwrite keys before calling free(), and attempted to use mlock(). Looks like ssh-agent doesn't use mlock -- at least it hasn't in this case: % grep Lck /proc/`pidof ssh-agent`/status VmLck: 0 kB % ulimit -a | grep lock file size (blocks) unlimited core file size (blocks) 0 locked-in-memory size (kb) 32 file locks unlimited Requiring security-sensitive apps to use a new flag to get safe behavior is dangerous. Better to be safe by default and turn on the less-safe-but-faster behavior for the cases that benefit from it. > It would also be a massive > information leak under SELinux. To fix it properly according to the > SELinux model you would need to tag each page with a label > immediately after it's freed and then do an access-vector-check > against the old page and the new process before allowing reuse. On > the other hand, that would probably be at least as expensive as > zeroing the page. I still think that using uid in mm_struct is wrong, and some kind of abstraction is required. I called this "free pool" in <20070628061911.GA16986@hexapodia.org>, but I think that name is misleading -- I am not proposing that this should be part of the management of free pages, but should be a label which abstracts "safe to share freed pages among" groups. Then different SELinux protection domains would simply have different labels. -andy ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 0/4] MAP_NOZERO v2 - VM_NOZERO/MAP_NOZERO early summer madness 2007-06-29 19:39 ` Andy Isaacson @ 2007-06-29 20:12 ` Davide Libenzi 2007-06-29 23:48 ` Kyle Moffett 0 siblings, 1 reply; 23+ messages in thread From: Davide Libenzi @ 2007-06-29 20:12 UTC (permalink / raw) To: Andy Isaacson; +Cc: Kyle Moffett, Linux Kernel Mailing List, Rik van Riel On Fri, 29 Jun 2007, Andy Isaacson wrote: > On Thu, Jun 28, 2007 at 10:57:00PM -0400, Kyle Moffett wrote: > > On Jun 28, 2007, at 14:49:24, Davide Libenzi wrote: > > >So I implemented a rather quick hack that introduces a new mmap() > > >flag MAP_NOZERO (only valid for anonymous mappings) and the vma > > >counter-part VM_NOZERO. Also, a new sys_brk2() has been introduced > > >to accept a new flags parameter. A brief description of the > > >patches follows in the next emails. > > > > Hmm, sounds like this would also need a "MAP_NOREUSE" flag of some > > kind for security sensitive applications. Basically, I wouldn't want > > my ssh-agent pages holding private SSH keys to be reused by my web > > browser which then gets exploited :-D. > > PGP at least (and I think GPG still) did overwrite keys before calling > free(), and attempted to use mlock(). Looks like ssh-agent doesn't use > mlock -- at least it hasn't in this case: > % grep Lck /proc/`pidof ssh-agent`/status > VmLck: 0 kB > % ulimit -a | grep lock > file size (blocks) unlimited > core file size (blocks) 0 > locked-in-memory size (kb) 32 > file locks unlimited > > Requiring security-sensitive apps to use a new flag to get safe behavior > is dangerous. Better to be safe by default and turn on the > less-safe-but-faster behavior for the cases that benefit from it. Can you better explain what MAP_NOZERO would alter in such case? > I still think that using uid in mm_struct is wrong, and some kind of > abstraction is required. I called this "free pool" in > <20070628061911.GA16986@hexapodia.org>, but I think that name is > misleading -- I am not proposing that this should be part of the > management of free pages, but should be a label which abstracts "safe to > share freed pages among" groups. Then different SELinux protection > domains would simply have different labels. I think I answered this one at least a couple of times, but anyawy. First, that can be whatever cookie we choose. At the moment UID is used because it makes easier a fit into _mapcount. Second, SeLinux will be able to disable the feature on a per-process base, or globally. Anything else? - Davide ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 0/4] MAP_NOZERO v2 - VM_NOZERO/MAP_NOZERO early summer madness 2007-06-29 20:12 ` Davide Libenzi @ 2007-06-29 23:48 ` Kyle Moffett 2007-06-30 19:03 ` Davide Libenzi 0 siblings, 1 reply; 23+ messages in thread From: Kyle Moffett @ 2007-06-29 23:48 UTC (permalink / raw) To: Davide Libenzi; +Cc: Andy Isaacson, Linux Kernel Mailing List, Rik van Riel On Jun 29, 2007, at 16:12:58, Davide Libenzi wrote: > On Fri, 29 Jun 2007, Andy Isaacson wrote: >> I still think that using uid in mm_struct is wrong, and some kind >> of abstraction is required. I called this "free pool" in >> <20070628061911.GA16986@hexapodia.org>, but I think that name is >> misleading -- I am not proposing that this should be part of the >> management of free pages, but should be a label which abstracts >> "safe to share freed pages among" groups. Then different SELinux >> protection domains would simply have different labels. > > I think I answered this one at least a couple of times, but anyawy. > First, that can be whatever cookie we choose. At the moment UID is > used because it makes easier a fit into _mapcount. Second, SeLinux > will be able to disable the feature on a per-process base, or > globally. > > Anything else? Well I would be very interested in actually being able to use this feature under SELinux, I think that just the underlying "can-I-use- this-page" logic needs modification. Maybe "MAP_REUSABLE"? That would both imply that we can accept reused (IE: nonzeroed) memory *AND* that the current page may be reused (IE: remapped without zeroing), although you could conceivably have one flag for each case. The userspace allocator should be able to (when prompted by MAP_REUSABLE) look in a page "pool" of sorts before falling back to a zeroed page. The pool would be created for a given "key" the first time it unmaps MAP_REUSABLE pages, possibly using the memory freed by said unmap. The real trick is how to define the "key". The default, without LSMs, should be something like the UID. SELinux, on the other hand, would probably want to use some kind of hash of the label as the "key", (and store the label in each pool, as well). That way SELinux could have a simple access-vector check for process:reusepage, as well as an access-vector check and type transition for "freereusablepage". Then a policy could allow most user processes to unconditionally reuse pages (which would end up in the access-vector- cache and therefore be fast), while security-sensitive processes like ssh-agent could neither reuse pages nor have their pages reused, even if they request it. Cheers, Kyle Moffett ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 0/4] MAP_NOZERO v2 - VM_NOZERO/MAP_NOZERO early summer madness 2007-06-29 23:48 ` Kyle Moffett @ 2007-06-30 19:03 ` Davide Libenzi 2007-06-30 23:46 ` Kyle Moffett 2007-07-02 18:38 ` Andy Isaacson 0 siblings, 2 replies; 23+ messages in thread From: Davide Libenzi @ 2007-06-30 19:03 UTC (permalink / raw) To: Kyle Moffett; +Cc: Andy Isaacson, Linux Kernel Mailing List, Rik van Riel On Fri, 29 Jun 2007, Kyle Moffett wrote: > Well I would be very interested in actually being able to use this feature > under SELinux, I think that just the underlying "can-I-use-this-page" logic > needs modification. Maybe "MAP_REUSABLE"? That would both imply that we can > accept reused (IE: nonzeroed) memory *AND* that the current page may be reused > (IE: remapped without zeroing), although you could conceivably have one flag > for each case. The userspace allocator should be able to (when prompted by > MAP_REUSABLE) look in a page "pool" of sorts before falling back to a zeroed > page. The pool would be created for a given "key" the first time it unmaps > MAP_REUSABLE pages, possibly using the memory freed by said unmap. Hmm, why would you need MAP_REUSABLE? If a page is visible at any time for a given UID, and you have a login under such UID, you can fetch the content of the page at any time (ie, ptrace_attach, gdb, ...). And if you are not under a UID login, you'll never get to see that page. ATM not even the classical "root can see everything" rule is applied. I think the focus should be to find a case where under the currently implemented policy for MAP_NOZERO, MAP_NOZERO represent a loss of security WRT no MAP_NOZERO. I have not been able to find one yet, although Andy found a potential one in the setuid+exec/ptrace race (fixed by a patch that should IMO go in in any case). The more ppl think about breaking it, the better it is. > The real trick is how to define the "key". The default, without LSMs, should > be something like the UID. SELinux, on the other hand, would probably want to > use some kind of hash of the label as the "key", (and store the label in each > pool, as well). That way SELinux could have a simple access-vector check for > process:reusepage, as well as an access-vector check and type transition for > "freereusablepage". Then a policy could allow most user processes to > unconditionally reuse pages (which would end up in the access-vector-cache and > therefore be fast), while security-sensitive processes like ssh-agent could > neither reuse pages nor have their pages reused, even if they request it. It is very easy, assuming a simple unsigned long cookie is enough for SeLinux, to fit in the current MAP_NOZERO. Well, we have to change something in the current struct page _mapcount reuse, but that doable. There is one line to change, that is the line where the cookie is assigned to the mm_struct. From there on, it's all handled the same way. If the hash is any longer than unsigned long, I don't really think is ever gonna fly, being it stored inside a struct page. Also, if you start introducing "keys" whose domain is wider than a single user, then you'll run for sure in some sort of problem. This is why the current code does not even try to go into the "group" policies. IMO all this may have some sense if 1) it is very simple 2) limits code and data structures bloat to very little or nothing 3) stays all the way to the safe side, at the cost of losing some possible valid pages to be recycled. After all, MAP_NOZERO is an hint and not a requirement. I think that the current method (either UID or KEY based) is simple, does not add extra management pools and, *so far*, is not showing security leaks. - Davide ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 0/4] MAP_NOZERO v2 - VM_NOZERO/MAP_NOZERO early summer madness 2007-06-30 19:03 ` Davide Libenzi @ 2007-06-30 23:46 ` Kyle Moffett 2007-06-30 23:57 ` Davide Libenzi 2007-07-02 18:38 ` Andy Isaacson 1 sibling, 1 reply; 23+ messages in thread From: Kyle Moffett @ 2007-06-30 23:46 UTC (permalink / raw) To: Davide Libenzi; +Cc: Andy Isaacson, Linux Kernel Mailing List, Rik van Riel On Jun 30, 2007, at 15:03:07, Davide Libenzi wrote: > Hmm, why would you need MAP_REUSABLE? If a page is visible at any > time for a given UID, and you have a login under such UID, you can > fetch the content of the page at any time (ie, ptrace_attach, > gdb, ...). Not under SELinux or other LSMs. I suppose those could live without a 15% performance improvement in some workloads, but it would be nice if we could avoid it. Essentially, UID is a really poor way to define process-security-equivalence classes in some systems. If you really want to define such classes then you need to add LSM hooks to manage the equivalence classes. > I think the focus should be to find a case where under the > currently implemented policy for MAP_NOZERO, MAP_NOZERO represent a > loss of security WRT no MAP_NOZERO. Very simple case: SELinux is turned on, an s9 (IE: TOP_SECRET) process calls free(), and an s3 (IE: UNCLASSIFIED) process calls malloc(), getting the data from the TOP_SECRET process. >> The real trick is how to define the "key". The default, without >> LSMs, should be something like the UID. SELinux, on the other >> hand, would probably want to use some kind of hash of the label as >> the "key", (and store the label in each pool, as well). That way >> SELinux could have a simple access-vector check for >> process:reusepage, as well as an access-vector check and type >> transition for "freereusablepage". Then a policy could allow most >> user processes to unconditionally reuse pages (which would end up >> in the access-vector-cache and therefore be fast), while security- >> sensitive processes like ssh-agent could neither reuse pages nor >> have their pages reused, even if they request it. > > It is very easy, assuming a simple unsigned long cookie is enough > for SeLinux, to fit in the current MAP_NOZERO. Well, we have to > change something in the current struct page _mapcount reuse, but > that doable. There is one line to change, that is the line where > the cookie is assigned to the mm_struct. I think if you create the concept of a "process equivalence class" and add an LSM hook for it, then the unsigned long could just store which equivalence class the page is in. The default without LSMs would be to use mutual-ptraceability as the equivalence class (IE: the UID with a proviso for SUID binaries). LSMs should be able to create a process_equivalence_class hook which when called returns an unsigned long identifying the "equivalence class" (IE: pool) into which the page is placed when freed (or ((unsigned long)-1) to forcibly zero the page). When a process requests a maybe-not-zeroed page, the LSM hook would be called again to determine what equivalence class should be used, (or ((unsigned long)-1) for dont- use-any-class). Cheers, Kyle Moffett ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 0/4] MAP_NOZERO v2 - VM_NOZERO/MAP_NOZERO early summer madness 2007-06-30 23:46 ` Kyle Moffett @ 2007-06-30 23:57 ` Davide Libenzi 2007-07-01 0:21 ` Kyle Moffett 0 siblings, 1 reply; 23+ messages in thread From: Davide Libenzi @ 2007-06-30 23:57 UTC (permalink / raw) To: Kyle Moffett; +Cc: Andy Isaacson, Linux Kernel Mailing List, Rik van Riel On Sat, 30 Jun 2007, Kyle Moffett wrote: > On Jun 30, 2007, at 15:03:07, Davide Libenzi wrote: > > Hmm, why would you need MAP_REUSABLE? If a page is visible at any time for a > > given UID, and you have a login under such UID, you can fetch the content of > > the page at any time (ie, ptrace_attach, gdb, ...). > > Not under SELinux or other LSMs. I suppose those could live without a 15% > performance improvement in some workloads, but it would be nice if we could > avoid it. Essentially, UID is a really poor way to define > process-security-equivalence classes in some systems. If you really want to > define such classes then you need to add LSM hooks to manage the equivalence > classes. > > > I think the focus should be to find a case where under the currently > > implemented policy for MAP_NOZERO, MAP_NOZERO represent a loss of security > > WRT no MAP_NOZERO. > > Very simple case: > SELinux is turned on, an s9 (IE: TOP_SECRET) process calls free(), and an s3 > (IE: UNCLASSIFIED) process calls malloc(), getting the data from the > TOP_SECRET process. Note that you use *s3* and *s9*. Those will be two different context cookies. SeLinux will have its own way to set the cookie in the mm_struct, to *s3* in one case, and to *s9* in the other case. This will make things so that they'll never see each other pages. > > It is very easy, assuming a simple unsigned long cookie is enough for > > SeLinux, to fit in the current MAP_NOZERO. Well, we have to change > > something in the current struct page _mapcount reuse, but that doable. There > > is one line to change, that is the line where the cookie is assigned to the > > mm_struct. > > I think if you create the concept of a "process equivalence class" and add an > LSM hook for it, then the unsigned long could just store which equivalence > class the page is in. The default without LSMs would be to use > mutual-ptraceability as the equivalence class (IE: the UID with a proviso for > SUID binaries). LSMs should be able to create a process_equivalence_class > hook which when called returns an unsigned long identifying the "equivalence > class" (IE: pool) into which the page is placed when freed (or ((unsigned > long)-1) to forcibly zero the page). When a process requests a > maybe-not-zeroed page, the LSM hook would be called again to determine what > equivalence class should be used, (or ((unsigned long)-1) for > dont-use-any-class). I'd rather prefer to not create anything, and to not add anything. SeLinux can set the cookie as it likes, and it can also disable the feature. Since SeLinux is definitely not the common case, I'd prefer to keep the simple unsigned long cookie abstraction, and let them handle however they like it. - Davide ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 0/4] MAP_NOZERO v2 - VM_NOZERO/MAP_NOZERO early summer madness 2007-06-30 23:57 ` Davide Libenzi @ 2007-07-01 0:21 ` Kyle Moffett 2007-07-01 4:25 ` Davide Libenzi 2007-07-02 19:00 ` Andy Isaacson 0 siblings, 2 replies; 23+ messages in thread From: Kyle Moffett @ 2007-07-01 0:21 UTC (permalink / raw) To: Davide Libenzi; +Cc: Andy Isaacson, Linux Kernel Mailing List, Rik van Riel On Jun 30, 2007, at 19:57:18, Davide Libenzi wrote: > On Sat, 30 Jun 2007, Kyle Moffett wrote: >> Very simple case: SELinux is turned on, an s9 (IE: TOP_SECRET) >> process calls free(), and an s3 (IE: UNCLASSIFIED) process calls >> malloc(), getting the data from the TOP_SECRET process. > > Note that you use *s3* and *s9*. Those will be two different > context cookies. SeLinux will have its own way to set the cookie > in the mm_struct, to *s3* in one case, and to *s9* in the other > case. This will make things so that they'll never see each other > pages. Except s3 and s9 aren't complete cookies. A complete label might be: "system_u:system_r:apache2_t:s3" for an unclassified apache web- server process, or "kmoffett_u:secadmin_r:usershell_t:s9" for me logged in with a top-secret label in my security-administrator role. That's why you'd need to call an LSM hook to get a unique identifier, as the LSM would actually need to allocate identifiers for equivalence classes. Secondly, processes may change labels as they run, so you couldn't just call it once and cache the result, you would need to call it for every freed page (or every re-use of a page). >>> It is very easy, assuming a simple unsigned long cookie is enough >>> for >>> SeLinux, to fit in the current MAP_NOZERO. Well, we have to change >>> something in the current struct page _mapcount reuse, but that >>> doable. There >>> is one line to change, that is the line where the cookie is >>> assigned to the >>> mm_struct. I do think a single unsigned long cookie would work, as long as you have an LSM hook: int process_equivalence_class(struct task_struct *task, unsigned long *result); If it returns 0 then you can use the result as a page cookie, otherwise you can't reuse pages for this process at all. Cheers, Kyle Moffett ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 0/4] MAP_NOZERO v2 - VM_NOZERO/MAP_NOZERO early summer madness 2007-07-01 0:21 ` Kyle Moffett @ 2007-07-01 4:25 ` Davide Libenzi 2007-07-02 19:00 ` Andy Isaacson 1 sibling, 0 replies; 23+ messages in thread From: Davide Libenzi @ 2007-07-01 4:25 UTC (permalink / raw) To: Kyle Moffett; +Cc: Andy Isaacson, Linux Kernel Mailing List, Rik van Riel On Sat, 30 Jun 2007, Kyle Moffett wrote: > On Jun 30, 2007, at 19:57:18, Davide Libenzi wrote: > > On Sat, 30 Jun 2007, Kyle Moffett wrote: > > > Very simple case: SELinux is turned on, an s9 (IE: TOP_SECRET) process > > > calls free(), and an s3 (IE: UNCLASSIFIED) process calls malloc(), getting > > > the data from the TOP_SECRET process. > > > > Note that you use *s3* and *s9*. Those will be two different context > > cookies. SeLinux will have its own way to set the cookie in the mm_struct, > > to *s3* in one case, and to *s9* in the other case. This will make things so > > that they'll never see each other pages. > > Except s3 and s9 aren't complete cookies. A complete label might be: > "system_u:system_r:apache2_t:s3" for an unclassified apache web-server > process, or "kmoffett_u:secadmin_r:usershell_t:s9" for me logged in with a > top-secret label in my security-administrator role. That's why you'd need to > call an LSM hook to get a unique identifier, as the LSM would actually need to > allocate identifiers for equivalence classes. Secondly, processes may change > labels as they run, so you couldn't just call it once and cache the result, > you would need to call it for every freed page (or every re-use of a page). It does not matter what the UID contain and how it is generated. The focus now should be to verify that different UIDs do not see other UIDs pages. Once that is verified to be true, and if SeLinux actually wants to use it, we'll find a way so that the other 99% of Linux users won't pay the price of the abstraction needed to accomplish what they need for this to work for them. Ok? - Davide ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 0/4] MAP_NOZERO v2 - VM_NOZERO/MAP_NOZERO early summer madness 2007-07-01 0:21 ` Kyle Moffett 2007-07-01 4:25 ` Davide Libenzi @ 2007-07-02 19:00 ` Andy Isaacson 2007-07-02 19:03 ` Rik van Riel 1 sibling, 1 reply; 23+ messages in thread From: Andy Isaacson @ 2007-07-02 19:00 UTC (permalink / raw) To: Kyle Moffett; +Cc: Davide Libenzi, Linux Kernel Mailing List, Rik van Riel On Sat, Jun 30, 2007 at 08:21:52PM -0400, Kyle Moffett wrote: > That's why you'd need to call an LSM hook to get a unique identifier, > as the LSM would actually need to allocate identifiers for > equivalence classes. Secondly, processes may change labels as they > run, so you couldn't just call it once and cache the result, you > would need to call it for every freed page (or every re-use of a page). Davide's patch adds a owner_uid field to mm_struct. Assuming that turns into a "mm security equivalence class identifier", the LSM can simply update it when a label-change-event occurs. No need to call out to (potentially heavyweight!) LSM code in page allocation critical paths. I'm a bit concerned that tracking the equivalence classes will get expensive. I think you can end up with quadratic explosion in the worst case (every user using every permutation of LSM bits). -andy ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 0/4] MAP_NOZERO v2 - VM_NOZERO/MAP_NOZERO early summer madness 2007-07-02 19:00 ` Andy Isaacson @ 2007-07-02 19:03 ` Rik van Riel 2007-07-02 19:06 ` Ulrich Drepper 0 siblings, 1 reply; 23+ messages in thread From: Rik van Riel @ 2007-07-02 19:03 UTC (permalink / raw) To: Andy Isaacson; +Cc: Kyle Moffett, Davide Libenzi, Linux Kernel Mailing List Andy Isaacson wrote: > On Sat, Jun 30, 2007 at 08:21:52PM -0400, Kyle Moffett wrote: >> That's why you'd need to call an LSM hook to get a unique identifier, >> as the LSM would actually need to allocate identifiers for >> equivalence classes. Secondly, processes may change labels as they >> run, so you couldn't just call it once and cache the result, you >> would need to call it for every freed page (or every re-use of a page). > > Davide's patch adds a owner_uid field to mm_struct. Assuming that turns > into a "mm security equivalence class identifier", the LSM can simply > update it when a label-change-event occurs. No need to call out to > (potentially heavyweight!) LSM code in page allocation critical paths. > > I'm a bit concerned that tracking the equivalence classes will get > expensive. I think you can end up with quadratic explosion in the worst > case (every user using every permutation of LSM bits). That should not happen. The default SELinux configuration in Fedora (and Debian?) runs a few daemons in their own restricted modes and has most of the system running in unconfined_t, including the majority of user programs. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 0/4] MAP_NOZERO v2 - VM_NOZERO/MAP_NOZERO early summer madness 2007-07-02 19:03 ` Rik van Riel @ 2007-07-02 19:06 ` Ulrich Drepper 2007-07-02 22:46 ` Davide Libenzi 0 siblings, 1 reply; 23+ messages in thread From: Ulrich Drepper @ 2007-07-02 19:06 UTC (permalink / raw) To: Rik van Riel Cc: Andy Isaacson, Kyle Moffett, Davide Libenzi, Linux Kernel Mailing List On 7/2/07, Rik van Riel <riel@redhat.com> wrote: > That should not happen. The default SELinux configuration > in Fedora (and Debian?) runs a few daemons in their own > restricted modes and has most of the system running in > unconfined_t, including the majority of user programs. This is the state as of F7. This will change hopefully soon. Programs like firefox run by normal users must be confined, to. Any tests using security must be fast, it's not something which is done only for a few apps. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 0/4] MAP_NOZERO v2 - VM_NOZERO/MAP_NOZERO early summer madness 2007-07-02 19:06 ` Ulrich Drepper @ 2007-07-02 22:46 ` Davide Libenzi 2007-07-02 22:55 ` Rik van Riel 0 siblings, 1 reply; 23+ messages in thread From: Davide Libenzi @ 2007-07-02 22:46 UTC (permalink / raw) To: Ulrich Drepper Cc: Rik van Riel, Andy Isaacson, Kyle Moffett, Linux Kernel Mailing List On Mon, 2 Jul 2007, Ulrich Drepper wrote: > On 7/2/07, Rik van Riel <riel@redhat.com> wrote: > > That should not happen. The default SELinux configuration > > in Fedora (and Debian?) runs a few daemons in their own > > restricted modes and has most of the system running in > > unconfined_t, including the majority of user programs. > > This is the state as of F7. This will change hopefully soon. > Programs like firefox run by normal users must be confined, to. Any > tests using security must be fast, it's not something which is done > only for a few apps. The strong requirement would be that the cookie is not a bit longer than sizeof(unsigned long). For the "equality" check, this better be "==", although it could be abstracted in a function/macro that SeLinux implements in a different way. But this "equality" check is done a page-fault time, so it better be pretty quick (otherwise if they bloat that path, they probably be better of not using MAP_NOZERO at all). - Davide ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 0/4] MAP_NOZERO v2 - VM_NOZERO/MAP_NOZERO early summer madness 2007-07-02 22:46 ` Davide Libenzi @ 2007-07-02 22:55 ` Rik van Riel 2007-07-02 23:46 ` Davide Libenzi 2007-07-04 21:53 ` Andy Isaacson 0 siblings, 2 replies; 23+ messages in thread From: Rik van Riel @ 2007-07-02 22:55 UTC (permalink / raw) To: Davide Libenzi Cc: Ulrich Drepper, Andy Isaacson, Kyle Moffett, Linux Kernel Mailing List Davide Libenzi wrote: > On Mon, 2 Jul 2007, Ulrich Drepper wrote: > >> On 7/2/07, Rik van Riel <riel@redhat.com> wrote: >>> That should not happen. The default SELinux configuration >>> in Fedora (and Debian?) runs a few daemons in their own >>> restricted modes and has most of the system running in >>> unconfined_t, including the majority of user programs. >> This is the state as of F7. This will change hopefully soon. >> Programs like firefox run by normal users must be confined, to. Any >> tests using security must be fast, it's not something which is done >> only for a few apps. > > The strong requirement would be that the cookie is not a bit longer than > sizeof(unsigned long). You could easily replace the cookie with a pointer to a free page pool. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 0/4] MAP_NOZERO v2 - VM_NOZERO/MAP_NOZERO early summer madness 2007-07-02 22:55 ` Rik van Riel @ 2007-07-02 23:46 ` Davide Libenzi 2007-07-04 21:53 ` Andy Isaacson 1 sibling, 0 replies; 23+ messages in thread From: Davide Libenzi @ 2007-07-02 23:46 UTC (permalink / raw) To: Rik van Riel Cc: Ulrich Drepper, Andy Isaacson, Kyle Moffett, Linux Kernel Mailing List On Mon, 2 Jul 2007, Rik van Riel wrote: > Davide Libenzi wrote: > > On Mon, 2 Jul 2007, Ulrich Drepper wrote: > > > > > On 7/2/07, Rik van Riel <riel@redhat.com> wrote: > > > > That should not happen. The default SELinux configuration > > > > in Fedora (and Debian?) runs a few daemons in their own > > > > restricted modes and has most of the system running in > > > > unconfined_t, including the majority of user programs. > > > This is the state as of F7. This will change hopefully soon. > > > Programs like firefox run by normal users must be confined, to. Any > > > tests using security must be fast, it's not something which is done > > > only for a few apps. > > > > The strong requirement would be that the cookie is not a bit longer than > > sizeof(unsigned long). > > You could easily replace the cookie with a pointer to a free > page pool. You talked about this pool even before, but honestly I don't see how it can help. Pools requires management (mainly flush under pressure), and it wouldn't really do any benefit. Don't we already have enough pools in the allocation path? The pool has to be UID (or cookie/whatever) based [1], so on top of the pool lookup/management code, you'll have another lock to be taken (the pool one). So using the UID (cookie) is simpler, faster and requires no extra code. A simple numeric cookie virtually creates "pools" w/out requiring extra code. Yes, having really separate pool helps the recycle efficency, but I'm not sure that's worth the extra code. [1] An mm_struct based pool would be clearly useless - Davide ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 0/4] MAP_NOZERO v2 - VM_NOZERO/MAP_NOZERO early summer madness 2007-07-02 22:55 ` Rik van Riel 2007-07-02 23:46 ` Davide Libenzi @ 2007-07-04 21:53 ` Andy Isaacson 2007-07-04 23:42 ` Davide Libenzi 1 sibling, 1 reply; 23+ messages in thread From: Andy Isaacson @ 2007-07-04 21:53 UTC (permalink / raw) To: Davide Libenzi Cc: Rik van Riel, Ulrich Drepper, Kyle Moffett, Linux Kernel Mailing List On Mon, Jul 02, 2007 at 06:55:40PM -0400, Rik van Riel wrote: > You could easily replace the cookie with a pointer to a free > page pool. It just occurred to me that something like this is *required* to get the performance benefit from MAP_NOZERO on a busy system. With Davide's current proposal, if there are N jobs with different "nozero cookies" busy allocating and deallocating pages on a single-NUMA-node system, then there's only a 1/N chance that the page returned by __alloc_pages will have the correct cookie, so (N-1)/N percent of the time MAP_NOZERO will have no positive effect -- 90% of the time for the case of N=10. (Of course on NUMA systems node affinity will probably make the situation a bit better.) So I'm afraid that it sounds like additional complexity would be required so that __alloc_pages could track and preferentially return pages with the correct "nozero cookie". In a mostly unrelated complaint, I note that having a function named "alloc_zeroed_page_vma" which returns a potentially nonzero page is, um, unintuitive. It needs a name which does not claim it returns zeroed pages. I'm no good at names, but perhaps "alloc_available_page_vma"? -andy ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 0/4] MAP_NOZERO v2 - VM_NOZERO/MAP_NOZERO early summer madness 2007-07-04 21:53 ` Andy Isaacson @ 2007-07-04 23:42 ` Davide Libenzi 0 siblings, 0 replies; 23+ messages in thread From: Davide Libenzi @ 2007-07-04 23:42 UTC (permalink / raw) To: Andy Isaacson Cc: Rik van Riel, Ulrich Drepper, Kyle Moffett, Linux Kernel Mailing List On Wed, 4 Jul 2007, Andy Isaacson wrote: > On Mon, Jul 02, 2007 at 06:55:40PM -0400, Rik van Riel wrote: > > You could easily replace the cookie with a pointer to a free > > page pool. > > It just occurred to me that something like this is *required* to get the > performance benefit from MAP_NOZERO on a busy system. With Davide's > current proposal, if there are N jobs with different "nozero cookies" > busy allocating and deallocating pages on a single-NUMA-node system, > then there's only a 1/N chance that the page returned by __alloc_pages > will have the correct cookie, so (N-1)/N percent of the time MAP_NOZERO > will have no positive effect -- 90% of the time for the case of N=10. That can indeed happen. If you have a system busy with pressure coming from many different users, the recycle efficency goes down proportionally with it. As I said before, I prefer a smaller and less intrusive code that gives good recycle efficency for the mejority of systems, WRT the idea of dedicated pools. But maybe I'm wrong, and Rik or you can show up with a patch implementing pools that does not hurt code, structure sizes and performances. > In a mostly unrelated complaint, I note that having a function named > "alloc_zeroed_page_vma" which returns a potentially nonzero page is, um, > unintuitive. It needs a name which does not claim it returns zeroed > pages. I'm no good at names, but perhaps "alloc_available_page_vma"? Yeah, the name "zeroed" is not perfect, but neither it is "available". The name should give the idea that the page is either from the same ID/cookie or it is zeroed. That naming should be applied back to alloc_zeroed_user_highpage(), __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE, ... - Davide ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 0/4] MAP_NOZERO v2 - VM_NOZERO/MAP_NOZERO early summer madness 2007-06-30 19:03 ` Davide Libenzi 2007-06-30 23:46 ` Kyle Moffett @ 2007-07-02 18:38 ` Andy Isaacson 2007-07-02 22:38 ` Davide Libenzi 1 sibling, 1 reply; 23+ messages in thread From: Andy Isaacson @ 2007-07-02 18:38 UTC (permalink / raw) To: Davide Libenzi; +Cc: Kyle Moffett, Linux Kernel Mailing List, Rik van Riel On Sat, Jun 30, 2007 at 12:03:07PM -0700, Davide Libenzi wrote: > I think the focus should be to find a case where under the currently > implemented policy for MAP_NOZERO, MAP_NOZERO represent a loss of security > WRT no MAP_NOZERO. I have not been able to find one yet, although Andy > found a potential one in the setuid+exec/ptrace race (fixed by a patch > that should IMO go in in any case). BTW, the ptrace variant of this issue is not a problem -- PTRACE_ATTACH running as newuid gets EPERM when trying to attach at /* here */ below. setuid(newuid); /* here */ exec(...); exit(1); sys_setuid sets current->mm->dumpable = suid_dumpable, so unless the admin asked for it, there is no risk WRT PTRACE_ATTACH. However, this risk vector does need to be considered when implementing MAP_NOZERO. -andy ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 0/4] MAP_NOZERO v2 - VM_NOZERO/MAP_NOZERO early summer madness 2007-07-02 18:38 ` Andy Isaacson @ 2007-07-02 22:38 ` Davide Libenzi 0 siblings, 0 replies; 23+ messages in thread From: Davide Libenzi @ 2007-07-02 22:38 UTC (permalink / raw) To: Andy Isaacson; +Cc: Kyle Moffett, Linux Kernel Mailing List, Rik van Riel On Mon, 2 Jul 2007, Andy Isaacson wrote: > On Sat, Jun 30, 2007 at 12:03:07PM -0700, Davide Libenzi wrote: > > I think the focus should be to find a case where under the currently > > implemented policy for MAP_NOZERO, MAP_NOZERO represent a loss of security > > WRT no MAP_NOZERO. I have not been able to find one yet, although Andy > > found a potential one in the setuid+exec/ptrace race (fixed by a patch > > that should IMO go in in any case). > > BTW, the ptrace variant of this issue is not a problem -- PTRACE_ATTACH > running as newuid gets EPERM when trying to attach at /* here */ below. > > setuid(newuid); > /* here */ > exec(...); > exit(1); > > sys_setuid sets current->mm->dumpable = suid_dumpable, so unless the > admin asked for it, there is no risk WRT PTRACE_ATTACH. However, this > risk vector does need to be considered when implementing MAP_NOZERO. Yes, I missed that. Ptrace is fine there. The 3 lines patch is still needed for MAP_NOZERO though. - Davide ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2007-07-04 23:42 UTC | newest] Thread overview: 23+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-06-28 18:49 [patch 0/4] MAP_NOZERO v2 - VM_NOZERO/MAP_NOZERO early summer madness Davide Libenzi 2007-06-29 2:57 ` Kyle Moffett 2007-06-29 3:04 ` Rik van Riel 2007-06-29 5:09 ` Ulrich Drepper 2007-06-29 5:20 ` Davide Libenzi 2007-06-29 19:39 ` Andy Isaacson 2007-06-29 20:12 ` Davide Libenzi 2007-06-29 23:48 ` Kyle Moffett 2007-06-30 19:03 ` Davide Libenzi 2007-06-30 23:46 ` Kyle Moffett 2007-06-30 23:57 ` Davide Libenzi 2007-07-01 0:21 ` Kyle Moffett 2007-07-01 4:25 ` Davide Libenzi 2007-07-02 19:00 ` Andy Isaacson 2007-07-02 19:03 ` Rik van Riel 2007-07-02 19:06 ` Ulrich Drepper 2007-07-02 22:46 ` Davide Libenzi 2007-07-02 22:55 ` Rik van Riel 2007-07-02 23:46 ` Davide Libenzi 2007-07-04 21:53 ` Andy Isaacson 2007-07-04 23:42 ` Davide Libenzi 2007-07-02 18:38 ` Andy Isaacson 2007-07-02 22:38 ` Davide Libenzi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox