* [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments [not found] ` <alpine.LFD.2.00.0901281316450.3123@localhost.localdomain> @ 2009-01-29 20:03 ` Lee Schermerhorn 2009-01-29 20:33 ` Linus Torvalds 0 siblings, 1 reply; 47+ messages in thread From: Lee Schermerhorn @ 2009-01-29 20:03 UTC (permalink / raw) To: linux-kernel Cc: Linus Torvalds, Maksim Yevmenkin, Nick Piggin, Andrew Morton, Greg Kroah-Hartman, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki Against: 2.6.28.2 We see a: BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 IP: [<ffffffff80336805>] __downgrade_write+0x43/0xb4 : Call Trace: [<ffffffff80284dec>] mlock_vma_pages_range+0x53/0xc3 [<ffffffff80287021>] mmap_region+0x389/0x471 [<ffffffff802876b5>] do_mmap_pgoff+0x308/0x36d [<ffffffff802100a8>] sys_mmap+0x8b/0x110 [<ffffffff8020bc8b>] system_call_fastpath+0x16/0x1b when mmap_region() merges two segments of a file mmap()ed with MAP_LOCKED [VM_LOCKED]. This patch provides a simplistic fix, suitable, I hope, for -stable. Pass the merged vma to mlock_vma_pages_range(). Tested with: http://free.linux.hp.com/~lts/Tests/mmap_lock.c I'll attempt a rework of mlock_vma_pages_range(), et al, to eliminate the 'vma' arg for 29-rc?. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> mm/mmap.c | 23 ++++++++++++++--------- 1 file changed, 14 insertions(+), 9 deletions(-) Index: linux-2.6.28.2/mm/mmap.c =================================================================== --- linux-2.6.28.2.orig/mm/mmap.c 2009-01-29 11:34:08.000000000 -0500 +++ linux-2.6.28.2/mm/mmap.c 2009-01-29 12:22:14.000000000 -0500 @@ -1094,7 +1094,7 @@ unsigned long mmap_region(struct file *f int accountable) { struct mm_struct *mm = current->mm; - struct vm_area_struct *vma, *prev; + struct vm_area_struct *vma, *prev, *merged = NULL; int correct_wcount = 0; int error; struct rb_node **rb_link, *rb_parent; @@ -1207,14 +1207,19 @@ munmap_back: if (vma_wants_writenotify(vma)) vma->vm_page_prot = vm_get_page_prot(vm_flags & ~VM_SHARED); - if (file && vma_merge(mm, prev, addr, vma->vm_end, - vma->vm_flags, NULL, file, pgoff, vma_policy(vma))) { - mpol_put(vma_policy(vma)); - kmem_cache_free(vm_area_cachep, vma); - fput(file); - if (vm_flags & VM_EXECUTABLE) - removed_exe_file_vma(mm); - } else { + if (file) { + merged = vma_merge(mm, prev, addr, vma->vm_end, vma->vm_flags, + NULL, file, pgoff, vma_policy(vma)); + if (merged) { + mpol_put(vma_policy(vma)); + kmem_cache_free(vm_area_cachep, vma); + fput(file); + if (vm_flags & VM_EXECUTABLE) + removed_exe_file_vma(mm); + vma = merged; /* for mlock_vma_pages_range() */ + } + } + if (!merged) { vma_link(mm, vma, prev, rb_link, rb_parent); file = vma->vm_file; } ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-29 20:03 ` [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments Lee Schermerhorn @ 2009-01-29 20:33 ` Linus Torvalds 2009-01-29 20:48 ` Linus Torvalds 0 siblings, 1 reply; 47+ messages in thread From: Linus Torvalds @ 2009-01-29 20:33 UTC (permalink / raw) To: Lee Schermerhorn Cc: linux-kernel, Maksim Yevmenkin, Nick Piggin, Andrew Morton, Greg Kroah-Hartman, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki On Thu, 29 Jan 2009, Lee Schermerhorn wrote: > > This patch provides a simplistic fix, suitable, I hope, for -stable. So I've looked a bit at that function, and I wonder why we can't just do the "vma_merge()" earlier - before we allocate that whole vma to begin with. We already do that for anonymous mappings _anyway_, using the exact same "vma_merge()" function. So how about just expanding on that logic, and simplifying everything at the same time? THIS PATCH IS TOTALLY UNTESTED! There may be some reason why we didn't do it this way to begin with, but this woudl actually seem to be the more logical way to fix the problem: by simplifying the whole function to the point where we never try to do that "try to merge late and then fix up the fact that we have to free the vma we allocated earlier" thing at all! Hmm? This would remove a more lines than it adds, and actually makes things more readable. If it works. Anybody see why it wouldn't? Linus --- mm/mmap.c | 30 ++++++++++-------------------- 1 files changed, 10 insertions(+), 20 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 8d95902..3f78ead 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -769,6 +769,10 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, if (vm_flags & VM_SPECIAL) return NULL; + /* Anonymous shared mappings are unsharable */ + if ((vm_flags & VM_SHARED) && !file) + return NULL; + if (prev) next = prev->vm_next; else @@ -1134,16 +1138,11 @@ munmap_back: } /* - * Can we just expand an old private anonymous mapping? - * The VM_SHARED test is necessary because shmem_zero_setup - * will create the file object for a shared anonymous map below. + * Can we just expand an old mapping? */ - if (!file && !(vm_flags & VM_SHARED)) { - vma = vma_merge(mm, prev, addr, addr + len, vm_flags, - NULL, NULL, pgoff, NULL); - if (vma) - goto out; - } + vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL); + if (vma) + goto out; /* * Determine the object being mapped and call the appropriate @@ -1206,17 +1205,8 @@ munmap_back: if (vma_wants_writenotify(vma)) vma->vm_page_prot = vm_get_page_prot(vm_flags & ~VM_SHARED); - if (file && vma_merge(mm, prev, addr, vma->vm_end, - vma->vm_flags, NULL, file, pgoff, vma_policy(vma))) { - mpol_put(vma_policy(vma)); - kmem_cache_free(vm_area_cachep, vma); - fput(file); - if (vm_flags & VM_EXECUTABLE) - removed_exe_file_vma(mm); - } else { - vma_link(mm, vma, prev, rb_link, rb_parent); - file = vma->vm_file; - } + vma_link(mm, vma, prev, rb_link, rb_parent); + file = vma->vm_file; /* Once vma denies write, undo our temporary denial count */ if (correct_wcount) ^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-29 20:33 ` Linus Torvalds @ 2009-01-29 20:48 ` Linus Torvalds 2009-01-29 22:32 ` Hugh Dickins ` (2 more replies) 0 siblings, 3 replies; 47+ messages in thread From: Linus Torvalds @ 2009-01-29 20:48 UTC (permalink / raw) To: Lee Schermerhorn Cc: linux-kernel, Maksim Yevmenkin, Nick Piggin, Andrew Morton, Greg Kroah-Hartman, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki On Thu, 29 Jan 2009, Linus Torvalds wrote: > > THIS PATCH IS TOTALLY UNTESTED! Well, it boots. FWIW. I've not really tested anything interesting with it, but any potential breakage is at least not catastrophic and immediate. > diff --git a/mm/mmap.c b/mm/mmap.c > index 8d95902..3f78ead 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -769,6 +769,10 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, > if (vm_flags & VM_SPECIAL) > return NULL; > > + /* Anonymous shared mappings are unsharable */ > + if ((vm_flags & VM_SHARED) && !file) > + return NULL; > + .. and I think this part of it is actually unnecessary, because what happens is that a shared anon mapping is turned into a shmem mapping when it is inserted, and that actually ends up allocating a file for it. So the vma->vm_file for anon mappings will not match a NULL file pointer _anyway_, so there's no way it would end up merging. So my patch can be further simplified, I think, to just the following. Even more total lines removed. I still want somebody else to look at and think about it, though. Linus --- mm/mmap.c | 26 ++++++-------------------- 1 files changed, 6 insertions(+), 20 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 8d95902..d3fa10a 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1134,16 +1134,11 @@ munmap_back: } /* - * Can we just expand an old private anonymous mapping? - * The VM_SHARED test is necessary because shmem_zero_setup - * will create the file object for a shared anonymous map below. + * Can we just expand an old mapping? */ - if (!file && !(vm_flags & VM_SHARED)) { - vma = vma_merge(mm, prev, addr, addr + len, vm_flags, - NULL, NULL, pgoff, NULL); - if (vma) - goto out; - } + vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL); + if (vma) + goto out; /* * Determine the object being mapped and call the appropriate @@ -1206,17 +1201,8 @@ munmap_back: if (vma_wants_writenotify(vma)) vma->vm_page_prot = vm_get_page_prot(vm_flags & ~VM_SHARED); - if (file && vma_merge(mm, prev, addr, vma->vm_end, - vma->vm_flags, NULL, file, pgoff, vma_policy(vma))) { - mpol_put(vma_policy(vma)); - kmem_cache_free(vm_area_cachep, vma); - fput(file); - if (vm_flags & VM_EXECUTABLE) - removed_exe_file_vma(mm); - } else { - vma_link(mm, vma, prev, rb_link, rb_parent); - file = vma->vm_file; - } + vma_link(mm, vma, prev, rb_link, rb_parent); + file = vma->vm_file; /* Once vma denies write, undo our temporary denial count */ if (correct_wcount) ^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-29 20:48 ` Linus Torvalds @ 2009-01-29 22:32 ` Hugh Dickins 2009-01-29 23:02 ` Linus Torvalds 2009-01-29 22:47 ` Maksim Yevmenkin 2009-01-30 8:34 ` Peter Zijlstra 2 siblings, 1 reply; 47+ messages in thread From: Hugh Dickins @ 2009-01-29 22:32 UTC (permalink / raw) To: Linus Torvalds Cc: Lee Schermerhorn, linux-kernel, Maksim Yevmenkin, Nick Piggin, Andrew Morton, Greg Kroah-Hartman, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki On Thu, 29 Jan 2009, Linus Torvalds wrote: > On Thu, 29 Jan 2009, Linus Torvalds wrote: > > > > THIS PATCH IS TOTALLY UNTESTED! Sorry, I was away yesterday, and not yet caught up with things today. I certainly agree that it would be much nicer not to allocated a vma in one place, then later throw it away on finding it can be merged: your patch looks a plausible improvement. The problem has always been that file->f_op->mmap(file, vma) on a special file can do various strange things, changing surprising fields of the struct vma passed to it, changing its mergeability. Now it may well be that every driver which does something strange there already sets one of the VM_SPECIAL flags which prevent merging, or can easily be fixed to do so, or otherwise clearly cannot pose a problem. But I need to mull it over and do some checking tomorrow: anything I say tonight, I'd probably change my mind about. Hugh ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-29 22:32 ` Hugh Dickins @ 2009-01-29 23:02 ` Linus Torvalds 2009-01-30 4:43 ` Lee Schermerhorn 0 siblings, 1 reply; 47+ messages in thread From: Linus Torvalds @ 2009-01-29 23:02 UTC (permalink / raw) To: Hugh Dickins Cc: Lee Schermerhorn, linux-kernel, Maksim Yevmenkin, Nick Piggin, Andrew Morton, Greg Kroah-Hartman, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, David S. Miller On Thu, 29 Jan 2009, Hugh Dickins wrote: > > The problem has always been that file->f_op->mmap(file, vma) on a > special file can do various strange things, changing surprising > fields of the struct vma passed to it, changing its mergeability. Ahh, you're right. That's somewhat bogus, but it does explain why we have those two separate merge_vma() calls. And as you also point out: > Now it may well be that every driver which does something strange > there already sets one of the VM_SPECIAL flags which prevent merging, > or can easily be fixed to do so, or otherwise clearly cannot pose a > problem. .. and I think this is the right answer. If a device driver really does something as odd as play games with the offset that would make merging wrong, it had better set some of the VM_SPECIAL bits (presumably VM_IO) in order to never merge at all. I added DaveM to the cc, since he's the one who is credited with the comment saying that several drivers change addr (vm_start). I only really see one, namely bf54x-lq043fb.c, and that one really depends on the whole nommu thing (ie it seems to simply require a particular virtual address, broken as that may be). Btw, changing vm_start is actually very very wrong - it cannot work in general. Why? Because we earlier checked the "overlapping mmap" case with the original vm_start, an a driver that changes its vm_start in its mmap() routine would cause odd issues there. We also pass in "prev" to vma_link, so it could end up in totally the wrong place. So changing vm_start is a sign of a driver bug, nothing less. But there are more drivers who do change vm_pgoff, which is at least valid (if a bit dodgy), and is indeed a mergeability issue. But at least a subset of those do already set VM_IO (eg fbmem.c - I only looked at one, and was happy that it already did that). As background for Davem, here's the patch once more. Linus --- mm/mmap.c | 26 ++++++-------------------- 1 files changed, 6 insertions(+), 20 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 8d95902..d3fa10a 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1134,16 +1134,11 @@ munmap_back: } /* - * Can we just expand an old private anonymous mapping? - * The VM_SHARED test is necessary because shmem_zero_setup - * will create the file object for a shared anonymous map below. + * Can we just expand an old mapping? */ - if (!file && !(vm_flags & VM_SHARED)) { - vma = vma_merge(mm, prev, addr, addr + len, vm_flags, - NULL, NULL, pgoff, NULL); - if (vma) - goto out; - } + vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL); + if (vma) + goto out; /* * Determine the object being mapped and call the appropriate @@ -1206,17 +1201,8 @@ munmap_back: if (vma_wants_writenotify(vma)) vma->vm_page_prot = vm_get_page_prot(vm_flags & ~VM_SHARED); - if (file && vma_merge(mm, prev, addr, vma->vm_end, - vma->vm_flags, NULL, file, pgoff, vma_policy(vma))) { - mpol_put(vma_policy(vma)); - kmem_cache_free(vm_area_cachep, vma); - fput(file); - if (vm_flags & VM_EXECUTABLE) - removed_exe_file_vma(mm); - } else { - vma_link(mm, vma, prev, rb_link, rb_parent); - file = vma->vm_file; - } + vma_link(mm, vma, prev, rb_link, rb_parent); + file = vma->vm_file; /* Once vma denies write, undo our temporary denial count */ if (correct_wcount) ^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-29 23:02 ` Linus Torvalds @ 2009-01-30 4:43 ` Lee Schermerhorn 2009-01-30 4:49 ` Linus Torvalds 0 siblings, 1 reply; 47+ messages in thread From: Lee Schermerhorn @ 2009-01-30 4:43 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, linux-kernel, Maksim Yevmenkin, Nick Piggin, Andrew Morton, Greg Kroah-Hartman, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, David S. Miller On Thu, 2009-01-29 at 15:02 -0800, Linus Torvalds wrote: > > On Thu, 29 Jan 2009, Hugh Dickins wrote: > > > > The problem has always been that file->f_op->mmap(file, vma) on a > > special file can do various strange things, changing surprising > > fields of the struct vma passed to it, changing its mergeability. > > Ahh, you're right. That's somewhat bogus, but it does explain why we have > those two separate merge_vma() calls. > > And as you also point out: > > > Now it may well be that every driver which does something strange > > there already sets one of the VM_SPECIAL flags which prevent merging, > > or can easily be fixed to do so, or otherwise clearly cannot pose a > > problem. > > .. and I think this is the right answer. If a device driver really does > something as odd as play games with the offset that would make merging > wrong, it had better set some of the VM_SPECIAL bits (presumably VM_IO) in > order to never merge at all. Just want to note that install_special_mappings()--used for, e.g., the vdso--sets VM_DONTEXPAND [one of the VM_SPECIAL flags] which, if we want to prevent merging, makes a lot of sense to me. It appears that get_user_pages() will balk at addresses in vmas with VM_IO [and VM_PFNMAP] which might not be what one wants. For example, you can't pre-populate the ptes via make_pages_present() with this flag. Lee ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-30 4:43 ` Lee Schermerhorn @ 2009-01-30 4:49 ` Linus Torvalds 0 siblings, 0 replies; 47+ messages in thread From: Linus Torvalds @ 2009-01-30 4:49 UTC (permalink / raw) To: Lee Schermerhorn Cc: Hugh Dickins, linux-kernel, Maksim Yevmenkin, Nick Piggin, Andrew Morton, Greg Kroah-Hartman, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, David S. Miller On Thu, 29 Jan 2009, Lee Schermerhorn wrote: > > Just want to note that install_special_mappings()--used for, e.g., the > vdso--sets VM_DONTEXPAND [one of the VM_SPECIAL flags] which, if we want > to prevent merging, makes a lot of sense to me. Yes. VM_DONTEXPAND in many ways really fits the "don't merge" thing, because the whole mremap() VM expansion thing is really equivalent to this explicit merge. > It appears that get_user_pages() will balk at addresses in vmas with > VM_IO [and VM_PFNMAP] which might not be what one wants. For example, > you can't pre-populate the ptes via make_pages_present() with this flag. Well, anybody who plays games with vm_pgoff really _should_ be VM_IO. I don't see any real reason for it except for having a very special IO mapping. Of course, you quite possibly _also_ want VM_DONTEXPAND. Linus ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-29 20:48 ` Linus Torvalds 2009-01-29 22:32 ` Hugh Dickins @ 2009-01-29 22:47 ` Maksim Yevmenkin 2009-01-29 22:48 ` Randy Dunlap 2009-01-30 2:08 ` Linus Torvalds 2009-01-30 8:34 ` Peter Zijlstra 2 siblings, 2 replies; 47+ messages in thread From: Maksim Yevmenkin @ 2009-01-29 22:47 UTC (permalink / raw) To: Linus Torvalds Cc: Lee Schermerhorn, linux-kernel, Nick Piggin, Andrew Morton, Greg Kroah-Hartman, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki On Thu, Jan 29, 2009 at 12:48 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, 29 Jan 2009, Linus Torvalds wrote: >> >> THIS PATCH IS TOTALLY UNTESTED! > > Well, it boots. FWIW. I've not really tested anything interesting with it, > but any potential breakage is at least not catastrophic and immediate. > >> diff --git a/mm/mmap.c b/mm/mmap.c >> index 8d95902..3f78ead 100644 >> --- a/mm/mmap.c >> +++ b/mm/mmap.c >> @@ -769,6 +769,10 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, >> if (vm_flags & VM_SPECIAL) >> return NULL; >> >> + /* Anonymous shared mappings are unsharable */ >> + if ((vm_flags & VM_SHARED) && !file) >> + return NULL; >> + > > .. and I think this part of it is actually unnecessary, because what > happens is that a shared anon mapping is turned into a shmem mapping when > it is inserted, and that actually ends up allocating a file for it. So the > vma->vm_file for anon mappings will not match a NULL file pointer > _anyway_, so there's no way it would end up merging. > > So my patch can be further simplified, I think, to just the following. > Even more total lines removed. > > I still want somebody else to look at and think about it, though. Just to confirm. This patch also appear to fix the immediate issue for us. thanks, max ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-29 22:47 ` Maksim Yevmenkin @ 2009-01-29 22:48 ` Randy Dunlap 2009-01-29 23:31 ` Maksim Yevmenkin 2009-01-30 2:08 ` Linus Torvalds 1 sibling, 1 reply; 47+ messages in thread From: Randy Dunlap @ 2009-01-29 22:48 UTC (permalink / raw) To: Maksim Yevmenkin Cc: Linus Torvalds, Lee Schermerhorn, linux-kernel, Nick Piggin, Andrew Morton, Greg Kroah-Hartman, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki Maksim Yevmenkin wrote: > On Thu, Jan 29, 2009 at 12:48 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> On Thu, 29 Jan 2009, Linus Torvalds wrote: >>> THIS PATCH IS TOTALLY UNTESTED! >> Well, it boots. FWIW. I've not really tested anything interesting with it, >> but any potential breakage is at least not catastrophic and immediate. >> >>> diff --git a/mm/mmap.c b/mm/mmap.c >>> index 8d95902..3f78ead 100644 >>> --- a/mm/mmap.c >>> +++ b/mm/mmap.c >>> @@ -769,6 +769,10 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, >>> if (vm_flags & VM_SPECIAL) >>> return NULL; >>> >>> + /* Anonymous shared mappings are unsharable */ >>> + if ((vm_flags & VM_SHARED) && !file) >>> + return NULL; >>> + >> .. and I think this part of it is actually unnecessary, because what >> happens is that a shared anon mapping is turned into a shmem mapping when >> it is inserted, and that actually ends up allocating a file for it. So the >> vma->vm_file for anon mappings will not match a NULL file pointer >> _anyway_, so there's no way it would end up merging. >> >> So my patch can be further simplified, I think, to just the following. >> Even more total lines removed. >> >> I still want somebody else to look at and think about it, though. > > Just to confirm. This patch also appear to fix the immediate issue for us. Is there a (small) test program available? Thanks, -- ~Randy ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-29 22:48 ` Randy Dunlap @ 2009-01-29 23:31 ` Maksim Yevmenkin 0 siblings, 0 replies; 47+ messages in thread From: Maksim Yevmenkin @ 2009-01-29 23:31 UTC (permalink / raw) To: Randy Dunlap Cc: Linus Torvalds, Lee Schermerhorn, linux-kernel, Nick Piggin, Andrew Morton, Greg Kroah-Hartman, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki On Thu, Jan 29, 2009 at 2:48 PM, Randy Dunlap <randy.dunlap@oracle.com> wrote: > Maksim Yevmenkin wrote: >> On Thu, Jan 29, 2009 at 12:48 PM, Linus Torvalds >> <torvalds@linux-foundation.org> wrote: >>> On Thu, 29 Jan 2009, Linus Torvalds wrote: >>>> THIS PATCH IS TOTALLY UNTESTED! >>> Well, it boots. FWIW. I've not really tested anything interesting with it, >>> but any potential breakage is at least not catastrophic and immediate. >>> >>>> diff --git a/mm/mmap.c b/mm/mmap.c >>>> index 8d95902..3f78ead 100644 >>>> --- a/mm/mmap.c >>>> +++ b/mm/mmap.c >>>> @@ -769,6 +769,10 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, >>>> if (vm_flags & VM_SPECIAL) >>>> return NULL; >>>> >>>> + /* Anonymous shared mappings are unsharable */ >>>> + if ((vm_flags & VM_SHARED) && !file) >>>> + return NULL; >>>> + >>> .. and I think this part of it is actually unnecessary, because what >>> happens is that a shared anon mapping is turned into a shmem mapping when >>> it is inserted, and that actually ends up allocating a file for it. So the >>> vma->vm_file for anon mappings will not match a NULL file pointer >>> _anyway_, so there's no way it would end up merging. >>> >>> So my patch can be further simplified, I think, to just the following. >>> Even more total lines removed. >>> >>> I still want somebody else to look at and think about it, though. >> >> Just to confirm. This patch also appear to fix the immediate issue for us. > > Is there a (small) test program available? Yes, it was in the original (first) email. Here it is again /* * Program to provoke kernel NULL pointer de-reference during * mmap(...MAP_LOCKED...) in Linux 2.6.28. * * 1. Create a 32KB test file in /tmp (avoids mlock limit on all recent * Linuxes). * 2. mmap it with MAP_LOCKED from top to bottom. (Provokes the oops, * since vmas can be merged in this case.) * 3. Clean up. * * Compile: * * gcc maplock-bug.c -o maplog-bug */ #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <stdint.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <sys/mman.h> #define SIZE (32*1024) /* Will get rounded down to page size if nec. */ static char tmp[] = "./maplock-bug.XXXXXX"; static char junkbuf[SIZE]; int main(void) { int fd; int ps = getpagesize(); size_t sz = (SIZE / ps) * ps; void **addrs; off_t off; int i; if ((addrs = malloc((sz / ps) * sizeof (*addrs))) == 0) { perror("malloc"); exit(1); } if ((fd = mkstemp(tmp)) < 0) { perror("mkstemp"); exit(1); } if (write(fd, junkbuf, sz) != sz) { perror("write"); exit(1); } if (close(fd) < 0) { perror("close"); exit(1); } if ((fd = open(tmp, O_RDONLY)) < 0) { perror("open"); exit(1); } for (off = sz - ps, i = 0; off >= 0; off -= ps, i++) { if ((addrs[i] = mmap(0, ps, PROT_READ, MAP_SHARED|MAP_LOCKED, fd, off)) == MAP_FAILED) { perror("mmap"); exit(1); } printf("Mapped offset 0x%jx at %p\n", (uintmax_t)off, addrs[i]); } if (close(fd) < 0) { perror("close"); exit(1); } for (i = 0; i < sz / ps; i++) { if (munmap(addrs[i], ps) < 0) { perror("munmap"); exit(1); } printf("Unmapped %p\n", addrs[i]); } if (unlink(tmp) < 0) { perror("unlink"); exit(1); } printf("Done\n"); } Thanks, max ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-29 22:47 ` Maksim Yevmenkin 2009-01-29 22:48 ` Randy Dunlap @ 2009-01-30 2:08 ` Linus Torvalds 2009-01-30 5:56 ` Greg KH 1 sibling, 1 reply; 47+ messages in thread From: Linus Torvalds @ 2009-01-30 2:08 UTC (permalink / raw) To: Maksim Yevmenkin Cc: Lee Schermerhorn, linux-kernel, Nick Piggin, Andrew Morton, Greg Kroah-Hartman, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki On Thu, 29 Jan 2009, Maksim Yevmenkin wrote: > > Just to confirm. This patch also appear to fix the immediate issue for us. Ok, I ended up committing it. I didn't do the whole "Cc: stable@kernel.org" thing, because maybe the non-cleanup version is less controversial for stable, but I decided that there's no way I don't want the cleanup done in mainline kernel, and if we end up needing a few VM_IO's added, I think it's worth it. Even after -rc3 (since I suspend we won't actually need them). Linus ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-30 2:08 ` Linus Torvalds @ 2009-01-30 5:56 ` Greg KH 2009-01-30 16:36 ` Linus Torvalds 0 siblings, 1 reply; 47+ messages in thread From: Greg KH @ 2009-01-30 5:56 UTC (permalink / raw) To: Linus Torvalds Cc: Maksim Yevmenkin, Lee Schermerhorn, linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki On Thu, Jan 29, 2009 at 06:08:30PM -0800, Linus Torvalds wrote: > > > On Thu, 29 Jan 2009, Maksim Yevmenkin wrote: > > > > Just to confirm. This patch also appear to fix the immediate issue for us. > > Ok, I ended up committing it. > > I didn't do the whole "Cc: stable@kernel.org" thing, because maybe the > non-cleanup version is less controversial for stable, but I decided that > there's no way I don't want the cleanup done in mainline kernel, and if we > end up needing a few VM_IO's added, I think it's worth it. Even after -rc3 > (since I suspend we won't actually need them). Which version was the "non-cleanup" version that should be added to the stable trees? The commit you did make, de33c8db5910cda599899dd431cc30d7c1018cbf, seemed pretty "tiny" to me. confused, greg k-h ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-30 5:56 ` Greg KH @ 2009-01-30 16:36 ` Linus Torvalds 2009-01-30 17:40 ` Hugh Dickins 0 siblings, 1 reply; 47+ messages in thread From: Linus Torvalds @ 2009-01-30 16:36 UTC (permalink / raw) To: Greg KH Cc: Maksim Yevmenkin, Lee Schermerhorn, linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki On Thu, 29 Jan 2009, Greg KH wrote: > > Which version was the "non-cleanup" version that should be added to the > stable trees? There were two different versions: From: Andrew Morton <akpm@linux-foundation.org> Subject: Re: possible bug in mmap_region() in linux-2.6.28 kernel Message-Id: <20090128134350.034ac6a7.akpm@linux-foundation.org> From: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Subject: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments Message-Id: <1233259410.2315.75.camel@lts-notebook> and I'm actually not at all sure which one should go into stable (or if we should just pick the same one that went into mainline). > The commit you did make, de33c8db5910cda599899dd431cc30d7c1018cbf, > seemed pretty "tiny" to me. Oh, yes, in m any ways it's smaller than the trivial patches, since it's really just removing code (well, it adds "file" as a parameter to the first vma_merge, it used to always be NULL since we only did that for the "!file" case). The downside with that commit is simply that there is a (pretty damn small, imho) possibility that somebody will report a regression with it, and we'll have to add some appropriate vma->vm_flags |= VM_IO | VM_DONTEXPAND; to some odd special device driver, to make sure that it cannot trigger the "merge early" case because it needs its ->mmap() function called, rather than just expand the mapping quietly. The reason(s) I don't think it's very likely is that (a) hopefully drivers already explicitly mark their mappings with one of the VM_SPECIAL bits already. Most such drivers would likely want to use "vm_insert_pfn()" (to insert random IO pages into the mapping), and in order to do that they have to add the VM_PFNMAP to vm_flags - we check. That would disable any merging, because that flag won't be set in the new vma before calling ->mmap. (b) even if the drivers don't do that explicit VM_PFNMAP, a driver that does special things at mmap() time - even if it just uses regular pages - probably prepopulates its mmap area with vm_insert_page(). That in turn, will implicitly set VM_INSERTPAGE, and again: the merge logic that verifies that the old vma->vm_flags == vm_flags (in is_mergeable_vma) would disable merging. (c) finally - even if a merge really would be possible (ie vm_flags didn't really get changed by the special ->mmap() function, but the driver still depended on ->mmap() being called for some other reason), I suspect it is really really unlikely that any application would actually request two magic consecutive mmap's to the same special device, with the same permissions and with just the right pgoff values etc. IOW, even if a merge could happen in theory with a flaky driver that doesn't like it, I don't think we'd hit it in practice. So I feel I had pretty good reasons to say "f*ck it, I'm going to fix this properly in mainline with a much prettier patch". But none of the above really changes the fact that the patch I committed to mainline was really quite fundamentally more invasive than either of the "simple" patches. All three patches are small, with mine arguably the smallest of the lot, but mine actually changed semantics, while Andrew's and Lee's patch literally only fix the invalid pointer use. I'll leave it to others to decide which one goes into -stable. I personally don't really think it matters. I argue above that mine is pretty safe and thus perfectly fine even for -stable, but reality has a habit of sometimes disagreeing with me. Dang. Linus ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-30 16:36 ` Linus Torvalds @ 2009-01-30 17:40 ` Hugh Dickins 2009-01-30 18:14 ` Linus Torvalds 2009-01-30 23:44 ` Greg KH 0 siblings, 2 replies; 47+ messages in thread From: Hugh Dickins @ 2009-01-30 17:40 UTC (permalink / raw) To: Linus Torvalds Cc: Greg KH, Maksim Yevmenkin, Lee Schermerhorn, linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki On Fri, 30 Jan 2009, Linus Torvalds wrote: > On Thu, 29 Jan 2009, Greg KH wrote: > > > > Which version was the "non-cleanup" version that should be added to the > > stable trees? > > There were two different versions: > > From: Andrew Morton <akpm@linux-foundation.org> > Subject: Re: possible bug in mmap_region() in linux-2.6.28 kernel > Message-Id: <20090128134350.034ac6a7.akpm@linux-foundation.org> > > From: Lee Schermerhorn <Lee.Schermerhorn@hp.com> > Subject: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments > Message-Id: <1233259410.2315.75.camel@lts-notebook> > > and I'm actually not at all sure which one should go into stable (or if we > should just pick the same one that went into mainline). > ... > > But none of the above really changes the fact that the patch I committed > to mainline was really quite fundamentally more invasive than either of > the "simple" patches. All three patches are small, with mine arguably the > smallest of the lot, but mine actually changed semantics, while Andrew's > and Lee's patch literally only fix the invalid pointer use. > > I'll leave it to others to decide which one goes into -stable. I > personally don't really think it matters. I argue above that mine is > pretty safe and thus perfectly fine even for -stable, but reality has a > habit of sometimes disagreeing with me. Dang. I'd say one of the non-cleanup versions for -stable (but I've not compared them to see which one is better). I'm still working my way through all the ->mmap methods to check their safety with regard to yesterday's change (there are about ten times as many as the last time I looked). So far there's only one driver mmap I want to go back and recheck, the vast majority are as good today as they were the day before, but ... ... what I think you have done is break the vma merging on ordinary files: because of that irritating VM_CAN_NONLINEAR flag which generic_file_mmap() and some others add in. To break the merging won't cause anyone much trouble, but is a slight regression we should fix. I'd have been very upset not to find something ;) Hugh ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-30 17:40 ` Hugh Dickins @ 2009-01-30 18:14 ` Linus Torvalds 2009-01-30 18:30 ` Hugh Dickins 2009-01-30 19:53 ` Lee Schermerhorn 2009-01-30 23:44 ` Greg KH 1 sibling, 2 replies; 47+ messages in thread From: Linus Torvalds @ 2009-01-30 18:14 UTC (permalink / raw) To: Hugh Dickins Cc: Greg KH, Maksim Yevmenkin, Lee Schermerhorn, linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki On Fri, 30 Jan 2009, Hugh Dickins wrote: > > ... what I think you have done is break the vma merging on > ordinary files: because of that irritating VM_CAN_NONLINEAR > flag which generic_file_mmap() and some others add in. Ahh. Yes. VM_CAN_NONLINEAR is a "reverse flag", ie unlike the other flags it's to some degree about being extra _normal_, not about being odd. Most special flags tend to disable the VM from doing some clever thing, this one enables it. > To break the merging won't cause anyone much trouble, > but is a slight regression we should fix. Yeah. Just masking it off when comparing is probably the simplest option. Make it a separate #define just for readability. There might be other flags like this in the future. > I'd have been very upset not to find something ;) Yay for you ;) And yes, this is the kind of thing that probably does mean that we're better off with the no-semantic-changes patches in -stable. Linus --- mm/mmap.c | 5 ++++- 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index d3fa10a..c581df1 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -658,6 +658,9 @@ again: remove_next = 1 + (end > next->vm_end); validate_mm(mm); } +/* Flags that can be inherited from an existing mapping when merging */ +#define VM_MERGEABLE_FLAGS (VM_CAN_NONLINEAR) + /* * If the vma has a ->close operation then the driver probably needs to release * per-vma resources, so we don't attempt to merge those. @@ -665,7 +668,7 @@ again: remove_next = 1 + (end > next->vm_end); static inline int is_mergeable_vma(struct vm_area_struct *vma, struct file *file, unsigned long vm_flags) { - if (vma->vm_flags != vm_flags) + if ((vma->vm_flags ^ vm_flags) & ~VM_MERGEABLE_FLAGS) return 0; if (vma->vm_file != file) return 0; ^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-30 18:14 ` Linus Torvalds @ 2009-01-30 18:30 ` Hugh Dickins 2009-01-30 19:53 ` Lee Schermerhorn 1 sibling, 0 replies; 47+ messages in thread From: Hugh Dickins @ 2009-01-30 18:30 UTC (permalink / raw) To: Linus Torvalds Cc: Greg KH, Maksim Yevmenkin, Lee Schermerhorn, linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki On Fri, 30 Jan 2009, Linus Torvalds wrote: > On Fri, 30 Jan 2009, Hugh Dickins wrote: > > > > ... what I think you have done is break the vma merging on > > ordinary files: because of that irritating VM_CAN_NONLINEAR > > flag which generic_file_mmap() and some others add in. > > Ahh. Yes. VM_CAN_NONLINEAR is a "reverse flag", ie unlike the other flags > it's to some degree about being extra _normal_, not about being odd. Most > special flags tend to disable the VM from doing some clever thing, this > one enables it. > > > To break the merging won't cause anyone much trouble, > > but is a slight regression we should fix. > > Yeah. Just masking it off when comparing is probably the simplest option. > Make it a separate #define just for readability. There might be other > flags like this in the future. > > > I'd have been very upset not to find something ;) > > Yay for you ;) > > And yes, this is the kind of thing that probably does mean that we're > better off with the no-semantic-changes patches in -stable. > > Linus Yes, your patch looks just right to me - or maybe add a comment that it's only mmap_region()'s call to vma_merge() which needs that mask? If you hear no more from me, that'll be all I found. Hugh > > --- > mm/mmap.c | 5 ++++- > 1 files changed, 4 insertions(+), 1 deletions(-) > > diff --git a/mm/mmap.c b/mm/mmap.c > index d3fa10a..c581df1 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -658,6 +658,9 @@ again: remove_next = 1 + (end > next->vm_end); > validate_mm(mm); > } > > +/* Flags that can be inherited from an existing mapping when merging */ > +#define VM_MERGEABLE_FLAGS (VM_CAN_NONLINEAR) > + > /* > * If the vma has a ->close operation then the driver probably needs to release > * per-vma resources, so we don't attempt to merge those. > @@ -665,7 +668,7 @@ again: remove_next = 1 + (end > next->vm_end); > static inline int is_mergeable_vma(struct vm_area_struct *vma, > struct file *file, unsigned long vm_flags) > { > - if (vma->vm_flags != vm_flags) > + if ((vma->vm_flags ^ vm_flags) & ~VM_MERGEABLE_FLAGS) > return 0; > if (vma->vm_file != file) > return 0; ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-30 18:14 ` Linus Torvalds 2009-01-30 18:30 ` Hugh Dickins @ 2009-01-30 19:53 ` Lee Schermerhorn 2009-01-30 20:31 ` Linus Torvalds ` (2 more replies) 1 sibling, 3 replies; 47+ messages in thread From: Lee Schermerhorn @ 2009-01-30 19:53 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, Greg KH, Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki On Fri, 2009-01-30 at 10:14 -0800, Linus Torvalds wrote: > > On Fri, 30 Jan 2009, Hugh Dickins wrote: > > > > ... what I think you have done is break the vma merging on > > ordinary files: because of that irritating VM_CAN_NONLINEAR > > flag which generic_file_mmap() and some others add in. > > Ahh. Yes. VM_CAN_NONLINEAR is a "reverse flag", ie unlike the other flags > it's to some degree about being extra _normal_, not about being odd. Most > special flags tend to disable the VM from doing some clever thing, this > one enables it. > > > To break the merging won't cause anyone much trouble, > > but is a slight regression we should fix. > > Yeah. Just masking it off when comparing is probably the simplest option. > Make it a separate #define just for readability. There might be other > flags like this in the future. > > > I'd have been very upset not to find something ;) > > Yay for you ;) > > And yes, this is the kind of thing that probably does mean that we're > better off with the no-semantic-changes patches in -stable. > > Linus > > --- > mm/mmap.c | 5 ++++- > 1 files changed, 4 insertions(+), 1 deletions(-) > > diff --git a/mm/mmap.c b/mm/mmap.c > index d3fa10a..c581df1 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -658,6 +658,9 @@ again: remove_next = 1 + (end > next->vm_end); > validate_mm(mm); > } > > +/* Flags that can be inherited from an existing mapping when merging */ > +#define VM_MERGEABLE_FLAGS (VM_CAN_NONLINEAR) > + > /* > * If the vma has a ->close operation then the driver probably needs to release > * per-vma resources, so we don't attempt to merge those. > @@ -665,7 +668,7 @@ again: remove_next = 1 + (end > next->vm_end); > static inline int is_mergeable_vma(struct vm_area_struct *vma, > struct file *file, unsigned long vm_flags) > { > - if (vma->vm_flags != vm_flags) > + if ((vma->vm_flags ^ vm_flags) & ~VM_MERGEABLE_FLAGS) > return 0; > if (vma->vm_file != file) > return 0; I tried this patch atop 29-rc3 + your patch from yesterday with my simple test program at http://free.linux/hp.com/~lts/Tests/mmap_lock.c. The test program shows the /proc/<pid>/maps before and after the mmap and attempted merge. It's not merging: 7fd20a668000-7fd20a66a000 rw-p 7fd20a668000 00:00 0 7fd20a68a000-7fd20a68b000 r--s 00000000 68:23 6608852 /tmp/tmpfVx1SFL (deleted) 7fd20a68b000-7fd20a68c000 r--s 00001000 68:23 6608852 /tmp/tmpfVx1SFL (deleted) 7fd20a68c000-7fd20a690000 rw-p 7fd20a68c000 00:00 0 Ad hoc instrumentation shows that it's the VM_ACCOUNT flag that is different between the existing file segment and the one attempting the merge: is_mergeable_vma: !mergable: vma flags: 0x80020f9:0x1020f9 | |-VM_ACCOUNT +-----------VM_CAN_NONLINEAR So happens, I'm mapping with MAP_SHARED, so the VM_ACCOUNT flag gets cleared later in mmap_region(). Comments say that this is for checking memory availability during shmem_file_setup(). Maybe we can move the temporary setting of VM_ACCOUNT until just before the call to shmem_zero_setup()? Lee ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-30 19:53 ` Lee Schermerhorn @ 2009-01-30 20:31 ` Linus Torvalds 2009-01-30 21:12 ` Hugh Dickins 2009-01-30 20:33 ` Hugh Dickins 2009-01-30 20:53 ` Randy Dunlap 2 siblings, 1 reply; 47+ messages in thread From: Linus Torvalds @ 2009-01-30 20:31 UTC (permalink / raw) To: Lee Schermerhorn Cc: Hugh Dickins, Greg KH, Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki On Fri, 30 Jan 2009, Lee Schermerhorn wrote: > > Ad hoc instrumentation shows that it's the VM_ACCOUNT flag that is > different between the existing file segment and the one attempting the > merge: Gaah. I looked at VM_ACCOUNT, since it looked like a prime example of the same VM_CAN_NONLINEAR thing and was thinking that I should just add it to the "ok to merge" flags, but decided after a quick look that we do all the VM_ACCOUNT handling before we call vma_merge, so I just left it at that. But apparently we _do_ do some VM_ACCOUNT stuff afterwards. > is_mergeable_vma: !mergable: vma flags: 0x80020f9:0x1020f9 > | |-VM_ACCOUNT > +-----------VM_CAN_NONLINEAR > > So happens, I'm mapping with MAP_SHARED, so the VM_ACCOUNT flag gets > cleared later in mmap_region(). Comments say that this is for checking > memory availability during shmem_file_setup(). Maybe we can move the > temporary setting of VM_ACCOUNT until just before the call to > shmem_zero_setup()? Yeah, that would probably fix it, and looks like the right thing to do. It all looks pretty confused wrong to set the whole VM_ACCOUNT flag for a file-backed file AT ALL in the first place, but the code knows that it won't matter for a shared file, and will be cleared again later. So it plays these temporary games with vm_flags, and it didn't matter because of how we used to call "vma_merge()" either early only for the anonymous memory case (that had VM_ACCOUNT stable and didn't have that temporary case at all) or much later (after having undone the temporary flag setting) for files. Why do we pass in that "accountable" flag, btw? It's only ever set to 0 by a MAP_PRIVATE mapping that hits is_file_hugepages() (see do_mmap_pgoff), and we could just do that decision all inside mmap_region(). So the flag doesn't really seem to have any real meaning, and is just passed around for some odd historical reason? Linus ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-30 20:31 ` Linus Torvalds @ 2009-01-30 21:12 ` Hugh Dickins 2009-01-30 21:25 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 47+ messages in thread From: Hugh Dickins @ 2009-01-30 21:12 UTC (permalink / raw) To: Linus Torvalds Cc: Lee Schermerhorn, Greg KH, Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Mikos Szeredi On Fri, 30 Jan 2009, Linus Torvalds wrote: > On Fri, 30 Jan 2009, Lee Schermerhorn wrote: > > > > So happens, I'm mapping with MAP_SHARED, so the VM_ACCOUNT flag gets > > cleared later in mmap_region(). Comments say that this is for checking > > memory availability during shmem_file_setup(). Maybe we can move the > > temporary setting of VM_ACCOUNT until just before the call to > > shmem_zero_setup()? > > Yeah, that would probably fix it, and looks like the right thing to do. I do need to refresh my memory on that in a moment... > > It all looks pretty confused wrong to set the whole VM_ACCOUNT flag for a > file-backed file AT ALL in the first place, but the code knows that it > won't matter for a shared file, and will be cleared again later. > > So it plays these temporary games with vm_flags, and it didn't matter > because of how we used to call "vma_merge()" either early only for the > anonymous memory case (that had VM_ACCOUNT stable and didn't have that > temporary case at all) or much later (after having undone the temporary > flag setting) for files. I'm to blame for those games, and now they've given trouble, the right thing may be to put an end to them. > > Why do we pass in that "accountable" flag, btw? It's only ever set to 0 by > a MAP_PRIVATE mapping that hits is_file_hugepages() (see do_mmap_pgoff), > and we could just do that decision all inside mmap_region(). So the flag > doesn't really seem to have any real meaning, and is just passed around > for some odd historical reason? It looks like the "accountable" flag dates from before Miklos separated mmap_region() out from do_mmap_pgoff(): so he just passed it on down to mmap_region() as an additional argument, preferring to leave the more complex MAP_PRIVATE/is_file_hugepages test behind in do_mmap_pgoff(). It seemed rather a random refactoring to me. Looking at it again, I wonder if we should be getting do_brk() to use mmap_region() too; but my appetite for cleanups is low at present, bugs we have enough. By the way, there's an argument to say that you should add VM_MIXEDMAP to VM_CAN_NONLINEAR in VM_MERGEABLE_FLAGS: I don't really care whether we merge the odd filemap_xip vma or not, but it used to do so and now won't. By the same (used to merge, now won't) argument, one could say VM_INSERTPAGE should be there too; but whereas VM_MIXEDMAP is used in one place only, quite a lot of drivers use vm_insert_page(), so I feel more comfortable with the idea that it's stopping merges - though in that case, shouldn't we add it to VM_SPECIAL? But I'm caring more about that VM_ACCOUNT... Hugh ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-30 21:12 ` Hugh Dickins @ 2009-01-30 21:25 ` Linus Torvalds 2009-01-30 21:36 ` Lee Schermerhorn 2009-01-30 21:37 ` Linus Torvalds 2 siblings, 0 replies; 47+ messages in thread From: Linus Torvalds @ 2009-01-30 21:25 UTC (permalink / raw) To: Hugh Dickins Cc: Lee Schermerhorn, Greg KH, Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Mikos Szeredi On Fri, 30 Jan 2009, Hugh Dickins wrote: > > But I'm caring more about that VM_ACCOUNT... The bright side here is that I think the VM_ACCOUNT bit differences only really happen for MAP_SHARED file mappings. For MAP_SHARED anon mappings we already don't share (for other reasons), and for private mappings VM_ACCOUNT looks stable (ie it's reliably either set or not, and stays that way). So the impact is reasonably low. But it would definitely be good to clean up that whole VM_ACCOUNT logic, and incidentally then fix that merging issue at the same time. Linus ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-30 21:12 ` Hugh Dickins 2009-01-30 21:25 ` Linus Torvalds @ 2009-01-30 21:36 ` Lee Schermerhorn 2009-01-30 22:27 ` Linus Torvalds 2009-01-30 21:37 ` Linus Torvalds 2 siblings, 1 reply; 47+ messages in thread From: Lee Schermerhorn @ 2009-01-30 21:36 UTC (permalink / raw) To: Hugh Dickins Cc: Linus Torvalds, Greg KH, Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Mikos Szeredi On Fri, 2009-01-30 at 21:12 +0000, Hugh Dickins wrote: > On Fri, 30 Jan 2009, Linus Torvalds wrote: > > On Fri, 30 Jan 2009, Lee Schermerhorn wrote: > > > > > > So happens, I'm mapping with MAP_SHARED, so the VM_ACCOUNT flag gets > > > cleared later in mmap_region(). Comments say that this is for checking > > > memory availability during shmem_file_setup(). Maybe we can move the > > > temporary setting of VM_ACCOUNT until just before the call to > > > shmem_zero_setup()? > > > > Yeah, that would probably fix it, and looks like the right thing to do. > > I do need to refresh my memory on that in a moment... > > > > > It all looks pretty confused wrong to set the whole VM_ACCOUNT flag for a > > file-backed file AT ALL in the first place, but the code knows that it > > won't matter for a shared file, and will be cleared again later. > > > > So it plays these temporary games with vm_flags, and it didn't matter > > because of how we used to call "vma_merge()" either early only for the > > anonymous memory case (that had VM_ACCOUNT stable and didn't have that > > temporary case at all) or much later (after having undone the temporary > > flag setting) for files. > > I'm to blame for those games, and now they've given trouble, > the right thing may be to put an end to them. > > > > > Why do we pass in that "accountable" flag, btw? It's only ever set to 0 by > > a MAP_PRIVATE mapping that hits is_file_hugepages() (see do_mmap_pgoff), > > and we could just do that decision all inside mmap_region(). So the flag > > doesn't really seem to have any real meaning, and is just passed around > > for some odd historical reason? > > It looks like the "accountable" flag dates from before Miklos separated > mmap_region() out from do_mmap_pgoff(): so he just passed it on down to > mmap_region() as an additional argument, preferring to leave the more > complex MAP_PRIVATE/is_file_hugepages test behind in do_mmap_pgoff(). > > It seemed rather a random refactoring to me. Looking at it again, > I wonder if we should be getting do_brk() to use mmap_region() too; > but my appetite for cleanups is low at present, bugs we have enough. > > By the way, there's an argument to say that you should add > VM_MIXEDMAP to VM_CAN_NONLINEAR in VM_MERGEABLE_FLAGS: I don't > really care whether we merge the odd filemap_xip vma or not, > but it used to do so and now won't. > > By the same (used to merge, now won't) argument, one could say > VM_INSERTPAGE should be there too; but whereas VM_MIXEDMAP is used > in one place only, quite a lot of drivers use vm_insert_page(), so > I feel more comfortable with the idea that it's stopping merges - > though in that case, shouldn't we add it to VM_SPECIAL? > > But I'm caring more about that VM_ACCOUNT... I just verified that adding VM_ACCOUNT to VM_MERGEABLE does allow the merge to happen with the test program. And the system didn't come crashing down around me. But, I wouldn't trust that simple test as the last word. A short run of a stress load I use held up/still running, but I can't tell whether it's merging as expected there. I am running a slightly modified version of Maksim's test program under the harness. I modified it to mmap the entire region to reserve space, then MAP_FIXED at each page address in the range returned by the first mmap. I saw that it was leaving holes between some of the pages w/o this. I'm going to automate the check for merging [read map and verify a single segment at expected range] and leave that running with the load. Lee ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-30 21:36 ` Lee Schermerhorn @ 2009-01-30 22:27 ` Linus Torvalds 2009-01-31 12:35 ` Hugh Dickins 0 siblings, 1 reply; 47+ messages in thread From: Linus Torvalds @ 2009-01-30 22:27 UTC (permalink / raw) To: Lee Schermerhorn Cc: Hugh Dickins, Greg KH, Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Mikos Szeredi On Fri, 30 Jan 2009, Lee Schermerhorn wrote: > > I just verified that adding VM_ACCOUNT to VM_MERGEABLE does allow the > merge to happen with the test program. And the system didn't come > crashing down around me. But, I wouldn't trust that simple test as the > last word. A short run of a stress load I use held up/still running, > but I can't tell whether it's merging as expected there. Just ignoring VM_ACCOUNT for merging is not the right thing to do. It probably works in practice for just about everything, but at least in theory it does mean that mmap() can stop honoring MAP_NORESERVE. Admittedly the circumstances where that happens are unlikely, and nobody probably even really uses MAP_NORESERVE in the first place, so I doubt you can ever see it as a real issue, but it's still technically wrong to merge vma's that can differ in VM_ACCOUNT. Now, the particular test you have, VM_ACCOUNT differs only during that temporary window, and the vma's really do end up with the same VM_ACCOUNT state in the end, so merging them is correct, but if you were to privately map the same file (or private anonymous map) with the same permissions next to each other so that they -could- merge, but use MAP_NORESERVE on one and not on the other, then they shouldn't merge. So VM_ACCOUNT does matter - just barely - for merging, and we just happen to currently hit it too much due to a very odd internal reason. Linus ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-30 22:27 ` Linus Torvalds @ 2009-01-31 12:35 ` Hugh Dickins 2009-01-31 18:34 ` Linus Torvalds 2009-02-03 16:13 ` Lee Schermerhorn 0 siblings, 2 replies; 47+ messages in thread From: Hugh Dickins @ 2009-01-31 12:35 UTC (permalink / raw) To: Linus Torvalds Cc: Lee Schermerhorn, Greg KH, Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Mikos Szeredi On Fri, 30 Jan 2009, Linus Torvalds wrote: > On Fri, 30 Jan 2009, Lee Schermerhorn wrote: > > > > I just verified that adding VM_ACCOUNT to VM_MERGEABLE does allow the > > merge to happen with the test program. And the system didn't come > > crashing down around me. But, I wouldn't trust that simple test as the > > last word. A short run of a stress load I use held up/still running, > > but I can't tell whether it's merging as expected there. > > Just ignoring VM_ACCOUNT for merging is not the right thing to do. It > probably works in practice for just about everything, but at least in > theory it does mean that mmap() can stop honoring MAP_NORESERVE. > > Admittedly the circumstances where that happens are unlikely, and nobody > probably even really uses MAP_NORESERVE in the first place, so I doubt you > can ever see it as a real issue, but it's still technically wrong to merge > vma's that can differ in VM_ACCOUNT. > > Now, the particular test you have, VM_ACCOUNT differs only during that > temporary window, and the vma's really do end up with the same VM_ACCOUNT > state in the end, so merging them is correct, but if you were to privately > map the same file (or private anonymous map) with the same permissions > next to each other so that they -could- merge, but use MAP_NORESERVE on > one and not on the other, then they shouldn't merge. > > So VM_ACCOUNT does matter - just barely - for merging, and we just happen > to currently hit it too much due to a very odd internal reason. It matters more than just barely - if you care about non-overcommit, or if you care about non-wrapping Committed_AS in your /proc/meminfo. Ignoring VM_ACCOUNT when merging is very much the wrong thing to do, because it lets an unaccounted area be treated thereafter as accounted, or vice versa - even forgetting the MAP_NORESERVE special case. I have by now recalled why I chose to play those VM_ACCOUNT games: /* We set VM_ACCOUNT in a shared mapping's vm_flags, to inform * shmem_zero_setup (perhaps called through /dev/zero's ->mmap) * that memory reservation must be checked; but that reservation * belongs to shared memory object, not to vma: so now clear it. We need a way to communicate not-MAP_NORESERVE to shmem.c, and we don't just need it in the explicit shmem_zero_setup() case, we also need it for the (probably rare nowadays) case when mmap() is working on file /dev/zero (drivers/char/mem.c mmap_zero()), rather than using MAP_ANON. Still haven't decided what's best to do about it (plenty of diversions): perhaps we just say my error was to overload VM_ACCOUNT, and define a new flag for the purpose, which can go into VM_MERGEABLE_FLAGS; but I'd prefer a neater solution if it crosses my mind. Hugh ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-31 12:35 ` Hugh Dickins @ 2009-01-31 18:34 ` Linus Torvalds 2009-02-02 11:59 ` KOSAKI Motohiro 2009-02-03 16:13 ` Lee Schermerhorn 1 sibling, 1 reply; 47+ messages in thread From: Linus Torvalds @ 2009-01-31 18:34 UTC (permalink / raw) To: Hugh Dickins Cc: Lee Schermerhorn, Greg KH, Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Mikos Szeredi On Sat, 31 Jan 2009, Hugh Dickins wrote: > > We need a way to communicate not-MAP_NORESERVE to shmem.c, and we don't > just need it in the explicit shmem_zero_setup() case, we also need it > for the (probably rare nowadays) case when mmap() is working on file > /dev/zero (drivers/char/mem.c mmap_zero()), rather than using MAP_ANON. Heh. We already have that. Maybe you didn't realize. Look at VM_NORESERVE. So shmem.c can just look at "!(vma->vm_flags & VM_NORESERVE)" if it wants to. The only piece of information you don't have is that "accountable" flag, but that was why I was pointing out how totally _useless_ that flag actually is. We shouldn't pass it along, because it's always 1, except for one special case that we can always calculate (private hugepages). But in fact, even leaving that untouched, we can trivially just change what 'VM_NORESERVE' means: we can just make it the end result of all that 'accountable' logic, instead of having it just mirror MAP_NORESERVE blindly. > Still haven't decided what's best to do about it (plenty of diversions): > perhaps we just say my error was to overload VM_ACCOUNT, and define a > new flag for the purpose, which can go into VM_MERGEABLE_FLAGS; but > I'd prefer a neater solution if it crosses my mind. How about this pretty trivial patch? TOTALLY UNTESTED. As usual. But the concept is pretty simple, and it actually removes a fair chunk of hacky code. The only reason the diffstat output says that it adds more lines than it deletes is that I added more comments and made that helper inline function rather than make a complex conditional. Whaddaya think? Linus --- mm/mmap.c | 48 +++++++++++++++++++++++++----------------------- mm/shmem.c | 2 +- 2 files changed, 26 insertions(+), 24 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index c581df1..5fcaec3 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1090,6 +1090,15 @@ int vma_wants_writenotify(struct vm_area_struct *vma) mapping_cap_account_dirty(vma->vm_file->f_mapping); } +/* + * We account for memory if it's a private writeable mapping, + * and VM_NORESERVE wasn't set. + */ +static inline int private_accountable_mapping(unsigned int vm_flags) +{ + return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE; +} + unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, unsigned long flags, unsigned int vm_flags, unsigned long pgoff, @@ -1117,23 +1126,24 @@ munmap_back: if (!may_expand_vm(mm, len >> PAGE_SHIFT)) return -ENOMEM; - if (flags & MAP_NORESERVE) + /* + * Set 'VM_NORESERVE' if we should not account for the + * memory use of this mapping. We only honor MAP_NORESERVE + * if we're allowed to overcommit memory. + */ + if ((flags & MAP_NORESERVE) && sysctl_overcommit_memory != OVERCOMMIT_NEVER) + vm_flags |= VM_NORESERVE; + if (!accountable) vm_flags |= VM_NORESERVE; - if (accountable && (!(flags & MAP_NORESERVE) || - sysctl_overcommit_memory == OVERCOMMIT_NEVER)) { - if (vm_flags & VM_SHARED) { - /* Check memory availability in shmem_file_setup? */ - vm_flags |= VM_ACCOUNT; - } else if (vm_flags & VM_WRITE) { - /* - * Private writable mapping: check memory availability - */ - charged = len >> PAGE_SHIFT; - if (security_vm_enough_memory(charged)) - return -ENOMEM; - vm_flags |= VM_ACCOUNT; - } + /* + * Private writable mapping: check memory availability + */ + if (private_accountable_mapping(vm_flags)) { + charged = len >> PAGE_SHIFT; + if (security_vm_enough_memory(charged)) + return -ENOMEM; + vm_flags |= VM_ACCOUNT; } /* @@ -1184,14 +1194,6 @@ munmap_back: goto free_vma; } - /* We set VM_ACCOUNT in a shared mapping's vm_flags, to inform - * shmem_zero_setup (perhaps called through /dev/zero's ->mmap) - * that memory reservation must be checked; but that reservation - * belongs to shared memory object, not to vma: so now clear it. - */ - if ((vm_flags & (VM_SHARED|VM_ACCOUNT)) == (VM_SHARED|VM_ACCOUNT)) - vma->vm_flags &= ~VM_ACCOUNT; - /* Can addr have changed?? * * Answer: Yes, several device drivers can do it in their diff --git a/mm/shmem.c b/mm/shmem.c index 5d0de96..19d566c 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2628,7 +2628,7 @@ struct file *shmem_file_setup(char *name, loff_t size, unsigned long flags) goto close_file; #ifdef CONFIG_SHMEM - SHMEM_I(inode)->flags = flags & VM_ACCOUNT; + SHMEM_I(inode)->flags = (flags & VM_NORESERVE) ? 0 : VM_ACCOUNT; #endif d_instantiate(dentry, inode); inode->i_size = size; ^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-31 18:34 ` Linus Torvalds @ 2009-02-02 11:59 ` KOSAKI Motohiro 2009-02-02 12:54 ` Hugh Dickins 0 siblings, 1 reply; 47+ messages in thread From: KOSAKI Motohiro @ 2009-02-02 11:59 UTC (permalink / raw) To: Linus Torvalds Cc: kosaki.motohiro, Hugh Dickins, Lee Schermerhorn, Greg KH, Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel, KAMEZAWA Hiroyuki, Mikos Szeredi Hi I went to trip for last three days and I returned today. I suprised this loooong thread now :) I have one comment. > TOTALLY UNTESTED. As usual. But the concept is pretty simple, and it > actually removes a fair chunk of hacky code. The only reason the diffstat > output says that it adds more lines than it deletes is that I added more > comments and made that helper inline function rather than make a complex > conditional. > > Whaddaya think? > > Linus > > --- > mm/mmap.c | 48 +++++++++++++++++++++++++----------------------- > mm/shmem.c | 2 +- > 2 files changed, 26 insertions(+), 24 deletions(-) > > diff --git a/mm/mmap.c b/mm/mmap.c > index c581df1..5fcaec3 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -1090,6 +1090,15 @@ int vma_wants_writenotify(struct vm_area_struct *vma) > mapping_cap_account_dirty(vma->vm_file->f_mapping); > } > > +/* > + * We account for memory if it's a private writeable mapping, > + * and VM_NORESERVE wasn't set. > + */ > +static inline int private_accountable_mapping(unsigned int vm_flags) > +{ > + return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE; > +} > + > unsigned long mmap_region(struct file *file, unsigned long addr, > unsigned long len, unsigned long flags, > unsigned int vm_flags, unsigned long pgoff, > @@ -1117,23 +1126,24 @@ munmap_back: > if (!may_expand_vm(mm, len >> PAGE_SHIFT)) > return -ENOMEM; > > - if (flags & MAP_NORESERVE) > + /* > + * Set 'VM_NORESERVE' if we should not account for the > + * memory use of this mapping. We only honor MAP_NORESERVE > + * if we're allowed to overcommit memory. > + */ > + if ((flags & MAP_NORESERVE) && sysctl_overcommit_memory != OVERCOMMIT_NEVER) I afraid this line a bit. if following scenario happend, we can lost VM_NORESERVE? 1. admin set overcommit_memory to "never" 2. mmap 3. admin set overcommit_memory to "guess" thanks. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-02-02 11:59 ` KOSAKI Motohiro @ 2009-02-02 12:54 ` Hugh Dickins 2009-02-02 14:10 ` KOSAKI Motohiro 2009-02-02 18:33 ` Mel Gorman 0 siblings, 2 replies; 47+ messages in thread From: Hugh Dickins @ 2009-02-02 12:54 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Linus Torvalds, Lee Schermerhorn, Greg KH, Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel, KAMEZAWA Hiroyuki, Mikos Szeredi, Andy Whitcroft On Mon, 2 Feb 2009, KOSAKI Motohiro wrote: > > > > - if (flags & MAP_NORESERVE) > > + /* > > + * Set 'VM_NORESERVE' if we should not account for the > > + * memory use of this mapping. We only honor MAP_NORESERVE > > + * if we're allowed to overcommit memory. > > + */ > > + if ((flags & MAP_NORESERVE) && sysctl_overcommit_memory != OVERCOMMIT_NEVER) > > I afraid this line a bit. > if following scenario happend, we can lost VM_NORESERVE? > > 1. admin set overcommit_memory to "never" > 2. mmap > 3. admin set overcommit_memory to "guess" I still haven't reviewed it fully myself (and note that what Linus put in his tree is not identical to this posted patch), but I do believe this is okay. When admin changes overcommit_memory, we don't make a pass across every vma of every mm in the system, to adjust all the accounting of VM_NORESERVE areas; so I think it's quite reasonable to take VM_NORESERVE as reflecting the policy in force when that vma was created. And nothing is displaying the VM_NORESERVE flag. Ah, you're actually thinking of 4. mprotect with the original flags (!VM_WRITE) such that no VM_ACCOUNT was done, and now VM_WRITE is added and the accounting is done despite it having been mapped MAP_NORESERVE originally. Whereas before Linus's change, VM_NORESERVE would have still exempted it. Well... I don't think I care! But I wonder what the hugetlb situation is: that if (!accountable) vm_flags |= VM_NORESERVE; looks suspicious to me, they look as if they're exempting all the hugetlb pages from its accounting, whereas !accountable was only supposed to exempt them from mmap_region()'s own accounting. Perhaps. I'm still looking at other things, not given this the time it needs yet. Hugh ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-02-02 12:54 ` Hugh Dickins @ 2009-02-02 14:10 ` KOSAKI Motohiro 2009-02-02 18:58 ` Mel Gorman 2009-02-02 18:33 ` Mel Gorman 1 sibling, 1 reply; 47+ messages in thread From: KOSAKI Motohiro @ 2009-02-02 14:10 UTC (permalink / raw) To: Hugh Dickins Cc: kosaki.motohiro, Linus Torvalds, Lee Schermerhorn, Greg KH, Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel, KAMEZAWA Hiroyuki, Mikos Szeredi, Andy Whitcroft, Mel Gorman (cc to mel) > On Mon, 2 Feb 2009, KOSAKI Motohiro wrote: > > > > > > - if (flags & MAP_NORESERVE) > > > + /* > > > + * Set 'VM_NORESERVE' if we should not account for the > > > + * memory use of this mapping. We only honor MAP_NORESERVE > > > + * if we're allowed to overcommit memory. > > > + */ > > > + if ((flags & MAP_NORESERVE) && sysctl_overcommit_memory != OVERCOMMIT_NEVER) > > > > I afraid this line a bit. > > if following scenario happend, we can lost VM_NORESERVE? > > > > 1. admin set overcommit_memory to "never" > > 2. mmap > > 3. admin set overcommit_memory to "guess" > > I still haven't reviewed it fully myself (and note that what > Linus put in his tree is not identical to this posted patch), > but I do believe this is okay. > > When admin changes overcommit_memory, we don't make a pass across > every vma of every mm in the system, to adjust all the accounting > of VM_NORESERVE areas; so I think it's quite reasonable to take > VM_NORESERVE as reflecting the policy in force when that vma was > created. And nothing is displaying the VM_NORESERVE flag. hmhm, I see. > Ah, you're actually thinking of > 4. mprotect > with the original flags (!VM_WRITE) such that no VM_ACCOUNT was done, > and now VM_WRITE is added and the accounting is done despite it having > been mapped MAP_NORESERVE originally. Whereas before Linus's change, > VM_NORESERVE would have still exempted it. > > Well... I don't think I care! Yeah. FWIW, we don't need VM_NORESERVE checking now because VM_NORESERVE and VM_ACCOUNT are exclusive condition now :) > But I wonder what the hugetlb situation is: that > if (!accountable) > vm_flags |= VM_NORESERVE; > looks suspicious to me, they look as if they're exempting all > the hugetlb pages from its accounting, whereas !accountable was > only supposed to exempt them from mmap_region()'s own accounting. HAHAHA, Indeed. when hugepage shared read-only mapping -> hugepage shared writable maping, following code seems to cause calling vm_enough_memory() although hugepage. ======================================================== mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev, unsigned long start, unsigned long end, unsigned long newflags) { if (newflags & VM_WRITE) { if (!(oldflags & (VM_ACCOUNT|VM_WRITE| VM_SHARED|VM_NORESERVE))) { charged = nrpages; if (security_vm_enough_memory(charged)) return -ENOMEM; newflags |= VM_ACCOUNT; } } ========================================================== mel, what do you think this? > > Perhaps. I'm still looking at other things, > not given this the time it needs yet. > > Hugh ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-02-02 14:10 ` KOSAKI Motohiro @ 2009-02-02 18:58 ` Mel Gorman 2009-02-02 19:23 ` Linus Torvalds 0 siblings, 1 reply; 47+ messages in thread From: Mel Gorman @ 2009-02-02 18:58 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Hugh Dickins, Linus Torvalds, Lee Schermerhorn, Greg KH, Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel, KAMEZAWA Hiroyuki, Mikos Szeredi, Andy Whitcroft On Mon, Feb 02, 2009 at 11:10:42PM +0900, KOSAKI Motohiro wrote: > (cc to mel) > > > On Mon, 2 Feb 2009, KOSAKI Motohiro wrote: > > > > > > > > - if (flags & MAP_NORESERVE) > > > > + /* > > > > + * Set 'VM_NORESERVE' if we should not account for the > > > > + * memory use of this mapping. We only honor MAP_NORESERVE > > > > + * if we're allowed to overcommit memory. > > > > + */ > > > > + if ((flags & MAP_NORESERVE) && sysctl_overcommit_memory != OVERCOMMIT_NEVER) > > > > > > I afraid this line a bit. > > > if following scenario happend, we can lost VM_NORESERVE? > > > > > > 1. admin set overcommit_memory to "never" > > > 2. mmap > > > 3. admin set overcommit_memory to "guess" > > > > I still haven't reviewed it fully myself (and note that what > > Linus put in his tree is not identical to this posted patch), > > but I do believe this is okay. > > > > When admin changes overcommit_memory, we don't make a pass across > > every vma of every mm in the system, to adjust all the accounting > > of VM_NORESERVE areas; so I think it's quite reasonable to take > > VM_NORESERVE as reflecting the policy in force when that vma was > > created. And nothing is displaying the VM_NORESERVE flag. > > hmhm, I see. > > > > Ah, you're actually thinking of > > 4. mprotect > > with the original flags (!VM_WRITE) such that no VM_ACCOUNT was done, > > and now VM_WRITE is added and the accounting is done despite it having > > been mapped MAP_NORESERVE originally. Whereas before Linus's change, > > VM_NORESERVE would have still exempted it. > > > > Well... I don't think I care! > > Yeah. > > FWIW, we don't need VM_NORESERVE checking now because VM_NORESERVE and VM_ACCOUNT > are exclusive condition now :) > > > > > But I wonder what the hugetlb situation is: that > > if (!accountable) > > vm_flags |= VM_NORESERVE; > > looks suspicious to me, they look as if they're exempting all > > the hugetlb pages from its accounting, whereas !accountable was > > only supposed to exempt them from mmap_region()'s own accounting. > > HAHAHA, Indeed. > Candidate patch for clearing that up as been posted. Thanks for cc'ing me on this as I would have missed it. > when hugepage shared read-only mapping -> hugepage shared writable maping, > following code seems to cause calling vm_enough_memory() although hugepage. > > ======================================================== > mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev, > unsigned long start, unsigned long end, unsigned long newflags) > { > if (newflags & VM_WRITE) { > if (!(oldflags & (VM_ACCOUNT|VM_WRITE| > VM_SHARED|VM_NORESERVE))) { > charged = nrpages; > if (security_vm_enough_memory(charged)) > return -ENOMEM; > newflags |= VM_ACCOUNT; > } > } > ========================================================== > > mel, what do you think this? > I think there is a problem there all right. VM_ACCOUNT will not be set with the other patch applied but VM_NORESERVE might be. If so, we potentially set VM_ACCOUNT on a hugetlbfs mapping and probably make a mess out of Committed_AS later. Maybe something like the following? ================ Do not account for address space usage when making hugetlbfs mappings RW hugetlbfs accounts for its address space usage separate from the VM core. VM_ACCOUNT should not be set for its mappings but it is possible it gets set if a user creates a RO hugetlbfs mapping MAP_NORESERVE and then calls mprotect(). This patch stops VM_ACCOUNT being set for hugetlbfs mappings during mprotect(). Credit goes to Kosaki Motohiro for spotting this. Signed-off-by: Mel Gorman <mel@csn.ul.ie> diff --git a/mm/mprotect.c b/mm/mprotect.c index abe2694..31ddc6a 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -153,7 +153,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev, * but (without finer accounting) cannot reduce our commit if we * make it unwritable again. */ - if (newflags & VM_WRITE) { + if (newflags & VM_WRITE && !(vma->vm_flags & VM_HUGETLB)) { if (!(oldflags & (VM_ACCOUNT|VM_WRITE| VM_SHARED|VM_NORESERVE))) { charged = nrpages; ^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-02-02 18:58 ` Mel Gorman @ 2009-02-02 19:23 ` Linus Torvalds 2009-02-02 21:50 ` Mel Gorman 0 siblings, 1 reply; 47+ messages in thread From: Linus Torvalds @ 2009-02-02 19:23 UTC (permalink / raw) To: Mel Gorman Cc: KOSAKI Motohiro, Hugh Dickins, Lee Schermerhorn, Greg KH, Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel, KAMEZAWA Hiroyuki, Mikos Szeredi, Andy Whitcroft On Mon, 2 Feb 2009, Mel Gorman wrote: > > Do not account for address space usage when making hugetlbfs mappings RW > > hugetlbfs accounts for its address space usage separate from the VM > core. VM_ACCOUNT should not be set for its mappings but it is possible it gets > set if a user creates a RO hugetlbfs mapping MAP_NORESERVE and then calls > mprotect(). This patch stops VM_ACCOUNT being set for hugetlbfs mappings > during mprotect(). > > Credit goes to Kosaki Motohiro for spotting this. > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > > diff --git a/mm/mprotect.c b/mm/mprotect.c > index abe2694..31ddc6a 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -153,7 +153,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev, > * but (without finer accounting) cannot reduce our commit if we > * make it unwritable again. > */ > - if (newflags & VM_WRITE) { > + if (newflags & VM_WRITE && !(vma->vm_flags & VM_HUGETLB)) { Wouldn't it be _much_ nicer to just depend on that whole VM_NORESERVE thing? Those hugetlb mappings _should_ have VM_NORESERVE on them, so the following test: > if (!(oldflags & (VM_ACCOUNT|VM_WRITE| > VM_SHARED|VM_NORESERVE))) { > charged = nrpages; should do it all correctly. Why make up some ad-hoc testing, when we already have a flag for _exactly_ this issue. That's what VM_NORESERVE means: don't apply VM_ACCOUNT. IOW, I don't see the point of this patch at all. And if there is some hugetlb path that doesn't set VM_NORESERVE, then fix _that_ instead. Linus ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-02-02 19:23 ` Linus Torvalds @ 2009-02-02 21:50 ` Mel Gorman 2009-02-02 22:12 ` Linus Torvalds 0 siblings, 1 reply; 47+ messages in thread From: Mel Gorman @ 2009-02-02 21:50 UTC (permalink / raw) To: Linus Torvalds Cc: KOSAKI Motohiro, Hugh Dickins, Lee Schermerhorn, Greg KH, Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel, KAMEZAWA Hiroyuki, Mikos Szeredi, Andy Whitcroft On Mon, Feb 02, 2009 at 11:23:33AM -0800, Linus Torvalds wrote: > > > On Mon, 2 Feb 2009, Mel Gorman wrote: > > > > Do not account for address space usage when making hugetlbfs mappings RW > > > > hugetlbfs accounts for its address space usage separate from the VM > > core. VM_ACCOUNT should not be set for its mappings but it is possible it gets > > set if a user creates a RO hugetlbfs mapping MAP_NORESERVE and then calls > > mprotect(). This patch stops VM_ACCOUNT being set for hugetlbfs mappings > > during mprotect(). > > > > Credit goes to Kosaki Motohiro for spotting this. > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > > > > diff --git a/mm/mprotect.c b/mm/mprotect.c > > index abe2694..31ddc6a 100644 > > --- a/mm/mprotect.c > > +++ b/mm/mprotect.c > > @@ -153,7 +153,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev, > > * but (without finer accounting) cannot reduce our commit if we > > * make it unwritable again. > > */ > > - if (newflags & VM_WRITE) { > > + if (newflags & VM_WRITE && !(vma->vm_flags & VM_HUGETLB)) { > > Wouldn't it be _much_ nicer to just depend on that whole VM_NORESERVE > thing? > Yeah, it would but it's not a trivial change. mm/hugetlb.c depends on VM_NORESERVE for its own accounting but also depends on VM_ACCOUNT not being set because counters would get mucked up when the VMAs get unmapped. The ideal answer would be to handle VM_ACCOUNT properly but it's not clear-cut. If it's counted towards reserves, then we are double reserving - the huge pages already allocated and base pages that will never be used. Then again, maybe the right thing to do is update nr_accounted when VM_HUGETLB is not set converting things like if (vma->vm_flags & VM_ACCOUNT) *nr_accounted += (end - start) >> PAGE_SHIFT; to if (vma->vm_flags & (VM_ACCOUNT | VM_HUGETLB) == VM_ACCOUNT) *nr_accounted += (end - start) >> PAGE_SHIFT; ? > Those hugetlb mappings _should_ have VM_NORESERVE on them, so the > following test: > > > if (!(oldflags & (VM_ACCOUNT|VM_WRITE| > > VM_SHARED|VM_NORESERVE))) { > > charged = nrpages; > > should do it all correctly. > Lets say someone does the following 1. mmap(PROT_READ, MAP_PRIVATE) on a hugetlbfs file VM_ACCOUNT is not set for hugetlbfs VM_NORESERVE is not set because MAP_NORESERVE was not there 2. mprotect(PROT_WRITE) VM_ACCOUNT|VM_WRITE|VM_SHARE|VM_NORESERVE == 0 That check is true newflags |= VM_ACCOUNT and hugetlbfs now has VM_ACCOUNT 3. unmap the vmas nr_accounted gets decremented, maybe wraps negative and unhappiness ensues > Why make up some ad-hoc testing, when we already have a flag for _exactly_ > this issue. That's what VM_NORESERVE means: don't apply VM_ACCOUNT. > > IOW, I don't see the point of this patch at all. > > And if there is some hugetlb path that doesn't set VM_NORESERVE, then fix > _that_ instead. > It gets set all right, the problem is with VM_ACCOUNT getting set. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-02-02 21:50 ` Mel Gorman @ 2009-02-02 22:12 ` Linus Torvalds 2009-02-02 22:35 ` Mel Gorman 0 siblings, 1 reply; 47+ messages in thread From: Linus Torvalds @ 2009-02-02 22:12 UTC (permalink / raw) To: Mel Gorman Cc: KOSAKI Motohiro, Hugh Dickins, Lee Schermerhorn, Greg KH, Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel, KAMEZAWA Hiroyuki, Mikos Szeredi, Andy Whitcroft On Mon, 2 Feb 2009, Mel Gorman wrote: > > Lets say someone does the following > > 1. mmap(PROT_READ, MAP_PRIVATE) on a hugetlbfs file > VM_ACCOUNT is not set for hugetlbfs > VM_NORESERVE is not set because MAP_NORESERVE was not there But isn't this exactly the thing that we have that odd "accountable" flag for, and we do the whole if (!accountable) vm_flags |= VM_NORESERVE; in mmap_region() for? So VM_NORESERVE _will_ be set. Linus ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-02-02 22:12 ` Linus Torvalds @ 2009-02-02 22:35 ` Mel Gorman 0 siblings, 0 replies; 47+ messages in thread From: Mel Gorman @ 2009-02-02 22:35 UTC (permalink / raw) To: Linus Torvalds Cc: KOSAKI Motohiro, Hugh Dickins, Lee Schermerhorn, Greg KH, Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel, KAMEZAWA Hiroyuki, Mikos Szeredi, Andy Whitcroft On Mon, Feb 02, 2009 at 02:12:30PM -0800, Linus Torvalds wrote: > > > On Mon, 2 Feb 2009, Mel Gorman wrote: > > > > Lets say someone does the following > > > > 1. mmap(PROT_READ, MAP_PRIVATE) on a hugetlbfs file > > VM_ACCOUNT is not set for hugetlbfs > > VM_NORESERVE is not set because MAP_NORESERVE was not there > > But isn't this exactly the thing that we have that odd "accountable" flag > for, and we do the whole > > if (!accountable) > vm_flags |= VM_NORESERVE; > > in mmap_region() for? > > So VM_NORESERVE _will_ be set. > Then it's getting unconditionally set which breaks the hugetlb accounting for reserving hugepages. See mm/hugetlb.c#decrement_hugepage_resv_vma() and mm/hugetlb.c#hugetlb_reserve_pages() which depend on VM_NORESERVE being set or not set depending on MAP_NORESERVE, not whether the core VM is accounting it or not. This is why I tried replacing if (!accountable) vm_flags |= VM_NORESERVE; with if ((flags & MAP_NORESERVE) && should_overcommit(file)) vm_flags |= VM_NORESERVE; and static inline int should_overcommit(struct file *file) { /* Check if the sysctl allows overcommit */ if (sysctl_overcommit_memory != OVERCOMMIT_NEVER) return 1; /* hugetlbfs does its own accounting */ if (file && is_file_hugepages(file)) return 1; return 0; } in the first patch I mailed out. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-02-02 12:54 ` Hugh Dickins 2009-02-02 14:10 ` KOSAKI Motohiro @ 2009-02-02 18:33 ` Mel Gorman 1 sibling, 0 replies; 47+ messages in thread From: Mel Gorman @ 2009-02-02 18:33 UTC (permalink / raw) To: Hugh Dickins Cc: KOSAKI Motohiro, Linus Torvalds, Lee Schermerhorn, Greg KH, Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel, KAMEZAWA Hiroyuki, Mikos Szeredi, Andy Whitcroft On Mon, Feb 02, 2009 at 12:54:15PM +0000, Hugh Dickins wrote: > On Mon, 2 Feb 2009, KOSAKI Motohiro wrote: > > > > > > - if (flags & MAP_NORESERVE) > > > + /* > > > + * Set 'VM_NORESERVE' if we should not account for the > > > + * memory use of this mapping. We only honor MAP_NORESERVE > > > + * if we're allowed to overcommit memory. > > > + */ > > > + if ((flags & MAP_NORESERVE) && sysctl_overcommit_memory != OVERCOMMIT_NEVER) > > > > I afraid this line a bit. > > if following scenario happend, we can lost VM_NORESERVE? > > > > 1. admin set overcommit_memory to "never" > > 2. mmap > > 3. admin set overcommit_memory to "guess" > > I still haven't reviewed it fully myself (and note that what > Linus put in his tree is not identical to this posted patch), > but I do believe this is okay. > > When admin changes overcommit_memory, we don't make a pass across > every vma of every mm in the system, to adjust all the accounting > of VM_NORESERVE areas; so I think it's quite reasonable to take > VM_NORESERVE as reflecting the policy in force when that vma was > created. And nothing is displaying the VM_NORESERVE flag. > > Ah, you're actually thinking of > 4. mprotect > with the original flags (!VM_WRITE) such that no VM_ACCOUNT was done, > and now VM_WRITE is added and the accounting is done despite it having > been mapped MAP_NORESERVE originally. Whereas before Linus's change, > VM_NORESERVE would have still exempted it. > > Well... I don't think I care! > > But I wonder what the hugetlb situation is: that > if (!accountable) > vm_flags |= VM_NORESERVE; > looks suspicious to me, they look as if they're exempting all > the hugetlb pages from its accounting, whereas !accountable was > only supposed to exempt them from mmap_region()'s own accounting. > Am playing a bit of catch-up here and just reading the patch for the first time. Whatever about the rest of it, I can comment on the hugetlb aspects at least. Glancing through commit fc8744adc870a8d4366908221508bb113d8b72ee looked like it would break hugetlb accounting and sure enough, a test set off an OOM storm killing almost everything in sight. Any opinions on the patch below? As a bonus, it gets rid of this "accountable" parameter altogether and figures out everything needed from the file and flags. ================ Do not account for the address space used by hugetlbfs hugetlb always applies strict overcommit semantics to shared and private mappings regardless of the value in /proc/sys/vm/overcommit_memory. mm/hugetlb.c has reservation counters that are checked and updated during mmap() to ensure that hugepages exist in the future when faults occurs. Otherwise, it is too easy to applications to be SIGKILLed. It uses VM_NORESERVE to track whether MAP_NORESERVE was set at mmap() time. If an application application fails with MAP_NORESERVE, they get to keep both pieces. The core VM does not handle VM_ACCOUNT correctly for hugetlbfs-backed mappings so it should not be set. With commit fc8744adc870a8d4366908221508bb113d8b72ee, VM_NORESERVE and VM_ACCOUNT are getting set for hugetlbfs-backed mappings resulting in a lot of oddness and processes getting killed. Special-case hugetlbfs as it handles its own accounting. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- include/linux/mm.h | 3 +-- mm/fremap.c | 2 +- mm/mmap.c | 30 ++++++++++++++++++++---------- 3 files changed, 22 insertions(+), 13 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index e8ddc98..3235615 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1129,8 +1129,7 @@ extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, unsigned long flag, unsigned long pgoff); extern unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, unsigned long flags, - unsigned int vm_flags, unsigned long pgoff, - int accountable); + unsigned int vm_flags, unsigned long pgoff); static inline unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, diff --git a/mm/fremap.c b/mm/fremap.c index 736ba7f..b6ec85a 100644 --- a/mm/fremap.c +++ b/mm/fremap.c @@ -198,7 +198,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size, flags &= MAP_NONBLOCK; get_file(file); addr = mmap_region(file, start, size, - flags, vma->vm_flags, pgoff, 1); + flags, vma->vm_flags, pgoff); fput(file); if (IS_ERR_VALUE(addr)) { err = addr; diff --git a/mm/mmap.c b/mm/mmap.c index 214b6a2..c4f3321 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -918,7 +918,6 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, struct inode *inode; unsigned int vm_flags; int error; - int accountable = 1; unsigned long reqprot = prot; /* @@ -1019,8 +1018,6 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, return -EPERM; vm_flags &= ~VM_MAYEXEC; } - if (is_file_hugepages(file)) - accountable = 0; if (!file->f_op || !file->f_op->mmap) return -ENODEV; @@ -1053,8 +1050,7 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, if (error) return error; - return mmap_region(file, addr, len, flags, vm_flags, pgoff, - accountable); + return mmap_region(file, addr, len, flags, vm_flags, pgoff); } EXPORT_SYMBOL(do_mmap_pgoff); @@ -1096,13 +1092,29 @@ int vma_wants_writenotify(struct vm_area_struct *vma) */ static inline int accountable_mapping(unsigned int vm_flags) { + /* hugetlbfs does its own accounting */ + if (vm_flags & VM_HUGETLB) + return 0; + return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE; } +static inline int should_overcommit(struct file *file) +{ + /* Check if the sysctl allows overcommit */ + if (sysctl_overcommit_memory != OVERCOMMIT_NEVER) + return 1; + + /* hugetlbfs does its own accounting */ + if (file && is_file_hugepages(file)) + return 1; + + return 0; +} + unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, unsigned long flags, - unsigned int vm_flags, unsigned long pgoff, - int accountable) + unsigned int vm_flags, unsigned long pgoff) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma, *prev; @@ -1131,9 +1143,7 @@ munmap_back: * memory use of this mapping. We only honor MAP_NORESERVE * if we're allowed to overcommit memory. */ - if ((flags & MAP_NORESERVE) && sysctl_overcommit_memory != OVERCOMMIT_NEVER) - vm_flags |= VM_NORESERVE; - if (!accountable) + if ((flags & MAP_NORESERVE) && should_overcommit(file)) vm_flags |= VM_NORESERVE; /* ^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-31 12:35 ` Hugh Dickins 2009-01-31 18:34 ` Linus Torvalds @ 2009-02-03 16:13 ` Lee Schermerhorn 2009-02-03 16:40 ` Linus Torvalds 2009-02-03 17:10 ` Hugh Dickins 1 sibling, 2 replies; 47+ messages in thread From: Lee Schermerhorn @ 2009-02-03 16:13 UTC (permalink / raw) To: Hugh Dickins Cc: Linus Torvalds, Greg KH, Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Mikos Szeredi On Sat, 2009-01-31 at 12:35 +0000, Hugh Dickins wrote: <snip> > I have by now recalled why I chose to play those VM_ACCOUNT games: > /* We set VM_ACCOUNT in a shared mapping's vm_flags, to inform > * shmem_zero_setup (perhaps called through /dev/zero's ->mmap) > * that memory reservation must be checked; but that reservation > * belongs to shared memory object, not to vma: so now clear it. > We need a way to communicate not-MAP_NORESERVE to shmem.c, and we don't > just need it in the explicit shmem_zero_setup() case, we also need it > for the (probably rare nowadays) case when mmap() is working on file ^^^^^^^^^^^^^^^^^^^^^^^^ > /dev/zero (drivers/char/mem.c mmap_zero()), rather than using MAP_ANON. This reminded me of something I'd seen recently looking at /proc/<pid>/[numa]_maps for <a large commercial database> on Linux/x86_64: 2adadf247000-2adadf2b2000 rwxp 2adadf247000 00:00 0 2adadf2b2000-2adadf2b3000 rwxs 00000000 68:31 55362966 <some file != /dev/zero> 2adadf2b9000-2adadf2c0000 rwxp 00000000 00:0e 4072 /dev/zero 2adadf2c0000-2adadf2d0000 rwxp 00000000 00:0e 4072 /dev/zero 2adadf2d0000-2adadf2e0000 rwxp 00000000 00:0e 4072 /dev/zero 2adadf2e0000-2adadf2f0000 rwxp 00000000 00:0e 4072 /dev/zero 2adadf2f0000-2adadf300000 rwxp 00000000 00:0e 4072 /dev/zero 2adadf300000-2adadf310000 rwxp 00000000 00:0e 4072 /dev/zero 2adadf310000-2adadf320000 rwxp 00000000 00:0e 4072 /dev/zero 2adadf320000-2adadf330000 rwxp 00000000 00:0e 4072 /dev/zero 2adadf330000-2adadf339000 rwxp 00077000 00:0e 4072 /dev/zero 2adadf353000-2adadf35a000 r-xp 00000000 69:02 1228822 /lib64/libnss_compat-2.4.so 2adadf35a000-2adadf459000 ---p 00007000 69:02 1228822 /lib64/libnss_compat-2.4.so 2adadf459000-2adadf45b000 rwxp 00006000 69:02 1228822 /lib64/libnss_compat-2.4.so 2adadf45b000-2adadf464000 r-xp 00000000 69:02 1228830 /lib64/libnss_nis-2.4.so 2adadf464000-2adadf564000 ---p 00009000 69:02 1228830 /lib64/libnss_nis-2.4.so 2adadf564000-2adadf566000 rwxp 00009000 69:02 1228830 /lib64/libnss_nis-2.4.so 2adadf566000-2adadf570000 r-xp 00000000 69:02 1228826 /lib64/libnss_files-2.4.so 2adadf570000-2adadf66f000 ---p 0000a000 69:02 1228826 /lib64/libnss_files-2.4.so 2adadf66f000-2adadf671000 rwxp 00009000 69:02 1228826 /lib64/libnss_files-2.4.so 2adadf671000-2adadf681000 rwxp 00000000 00:0e 4072 /dev/zero 2adadf681000-2adadf6a1000 rwxp 00000000 00:0e 4072 /dev/zero 2adadf6a1000-2adadf6b1000 rwxp 00000000 00:0e 4072 /dev/zero 2adadf6b1000-2adadf6c1000 rwxp 00000000 00:0e 4072 /dev/zero 2adadf6c1000-2adadf6d1000 rwxp 00000000 00:0e 4072 /dev/zero 2adadf6d1000-2adadf6e1000 rwxp 00000000 00:0e 4072 /dev/zero 2adadf6e1000-2adadf6f1000 rwxp 00000000 00:0e 4072 /dev/zero 2adadf6f1000-2adadf701000 rwxp 00000000 00:0e 4072 /dev/zero 2adadf701000-2adadf711000 rwxp 00000000 00:0e 4072 /dev/zero 2adadf711000-2adadf721000 rwxp 00000000 00:0e 4072 /dev/zero 2adadf721000-2adadf731000 rwxp 00000000 00:0e 4072 /dev/zero 2adadf731000-2adadf741000 rwxp 00000000 00:0e 4072 /dev/zero <and so on, for another 90 lines until> 7fffcdd36000-7fffcdd4e000 rwxp 7fffcdd36000 00:00 0 [stack] ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0 [vdso] For portability between Linux and various Unix-like systems that don't support MAP_ANON*, perhaps? Anyway, from the addresses and permissions, these all look potentially mergeable. The offset is preventing merging, right? I guess that's one of the downsides of mapping /dev/zero rather than using MAP_ANONYMOUS? Makes one wonder whether it would be worthwhile [not to mention possible] to rework mmap_zero() to mimic MAP_ANONYMOUS... Lee ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-02-03 16:13 ` Lee Schermerhorn @ 2009-02-03 16:40 ` Linus Torvalds 2009-02-03 17:10 ` Hugh Dickins 1 sibling, 0 replies; 47+ messages in thread From: Linus Torvalds @ 2009-02-03 16:40 UTC (permalink / raw) To: Lee Schermerhorn Cc: Hugh Dickins, Greg KH, Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Mikos Szeredi On Tue, 3 Feb 2009, Lee Schermerhorn wrote: > > This reminded me of something I'd seen recently looking > at /proc/<pid>/[numa]_maps for <a large commercial database> on > Linux/x86_64: > > 2adadf2b9000-2adadf2c0000 rwxp 00000000 00:0e 4072 /dev/zero > > For portability between Linux and various Unix-like systems that don't > support MAP_ANON*, perhaps? Odd. At first I thought that it is just that Linux will turn a MAP_SHARED | MAP_ANON into that /dev/zero thing, so you won't be able to tell by lookup at /proc/maps. So it would be very possible that the application did not actually open /dev/zero at all, and used MAP_ANON instead (see the whole shmem_zero_setup() and shmem_file_setup() thing). But those mappings have that 'p' for private there, so it's not MAP_SHARED. And yes, that means that your large commercial database really did open /dev/zero and mapped it privately. They must be living in the past. > Anyway, from the addresses and permissions, these all look potentially > mergeable. The offset is preventing merging, right? I guess that's one > of the downsides of mapping /dev/zero rather than using MAP_ANONYMOUS? Yeah. The MAP_ANON code has a total hack: case MAP_PRIVATE: /* * Set pgoff according to addr for anon_vma. */ pgoff = addr >> PAGE_SHIFT; break; where the whole point is to allow sharing: since pgoff doesn't matter, we can make it be something that will merge _if_ you don't play games (of course, if you then start usign mremap to move things around, that all breaks, and you lose the merging ;) That said, if it's just a hundred segments, nobody really cares. It's going to make vma lookup fractionally slower, but not so anybody would likely ever notice even in benchmarks. And if it's just this one db, it's certainly not going to use any noticeable amount of memory either. Merging is important, but it's important to avoid the _really_ common cases, and to make /proc/maps more readable etc. It's not like it matters for the occasional crazy setup. But you could still try to teach the DB people to use MAP_ANON. Linus ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-02-03 16:13 ` Lee Schermerhorn 2009-02-03 16:40 ` Linus Torvalds @ 2009-02-03 17:10 ` Hugh Dickins 2009-02-03 21:50 ` Lee Schermerhorn 1 sibling, 1 reply; 47+ messages in thread From: Hugh Dickins @ 2009-02-03 17:10 UTC (permalink / raw) To: Lee Schermerhorn Cc: Linus Torvalds, Greg KH, Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Mikos Szeredi On Tue, 3 Feb 2009, Lee Schermerhorn wrote: > On Sat, 2009-01-31 at 12:35 +0000, Hugh Dickins wrote: > > We need a way to communicate not-MAP_NORESERVE to shmem.c, and we don't > > just need it in the explicit shmem_zero_setup() case, we also need it > > for the (probably rare nowadays) case when mmap() is working on file > ^^^^^^^^^^^^^^^^^^^^^^^^ > > /dev/zero (drivers/char/mem.c mmap_zero()), rather than using MAP_ANON. > > > This reminded me of something I'd seen recently looking > at /proc/<pid>/[numa]_maps for <a large commercial database> on > Linux/x86_64: >... > 2adadf711000-2adadf721000 rwxp 00000000 00:0e 4072 /dev/zero > 2adadf721000-2adadf731000 rwxp 00000000 00:0e 4072 /dev/zero > 2adadf731000-2adadf741000 rwxp 00000000 00:0e 4072 /dev/zero > > <and so on, for another 90 lines until> > > 7fffcdd36000-7fffcdd4e000 rwxp 7fffcdd36000 00:00 0 [stack] > ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0 [vdso] > > For portability between Linux and various Unix-like systems that don't > support MAP_ANON*, perhaps? > > Anyway, from the addresses and permissions, these all look potentially > mergeable. The offset is preventing merging, right? I guess that's one > of the downsides of mapping /dev/zero rather than using MAP_ANONYMOUS? > > Makes one wonder whether it would be worthwhile [not to mention > possible] to rework mmap_zero() to mimic MAP_ANONYMOUS... That's certainly an interesting observation, and thank you for sharing it with us (hmm, I sound like a self-help group leader or something). I don't really have anything to add to what Linus said (and hadn't got around to realizing the significance of the "p" there before I saw his reply). Mmm, it's interesting, but I fear to add more hacks in there just for this - I guess we could, but I'd rather not, unless it becomes a serious issue. Let's just tuck away the knowledge of this case for now. Hugh ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-02-03 17:10 ` Hugh Dickins @ 2009-02-03 21:50 ` Lee Schermerhorn 0 siblings, 0 replies; 47+ messages in thread From: Lee Schermerhorn @ 2009-02-03 21:50 UTC (permalink / raw) To: Hugh Dickins Cc: Linus Torvalds, Greg KH, Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Mikos Szeredi On Tue, 2009-02-03 at 17:10 +0000, Hugh Dickins wrote: > On Tue, 3 Feb 2009, Lee Schermerhorn wrote: > > On Sat, 2009-01-31 at 12:35 +0000, Hugh Dickins wrote: > > > We need a way to communicate not-MAP_NORESERVE to shmem.c, and we don't > > > just need it in the explicit shmem_zero_setup() case, we also need it > > > for the (probably rare nowadays) case when mmap() is working on file > > ^^^^^^^^^^^^^^^^^^^^^^^^ > > > /dev/zero (drivers/char/mem.c mmap_zero()), rather than using MAP_ANON. > > > > > > This reminded me of something I'd seen recently looking > > at /proc/<pid>/[numa]_maps for <a large commercial database> on > > Linux/x86_64: > >... > > 2adadf711000-2adadf721000 rwxp 00000000 00:0e 4072 /dev/zero > > 2adadf721000-2adadf731000 rwxp 00000000 00:0e 4072 /dev/zero > > 2adadf731000-2adadf741000 rwxp 00000000 00:0e 4072 /dev/zero > > > > <and so on, for another 90 lines until> > > > > 7fffcdd36000-7fffcdd4e000 rwxp 7fffcdd36000 00:00 0 [stack] > > ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0 [vdso] > > > > For portability between Linux and various Unix-like systems that don't > > support MAP_ANON*, perhaps? > > > > Anyway, from the addresses and permissions, these all look potentially > > mergeable. The offset is preventing merging, right? I guess that's one > > of the downsides of mapping /dev/zero rather than using MAP_ANONYMOUS? > > > > Makes one wonder whether it would be worthwhile [not to mention > > possible] to rework mmap_zero() to mimic MAP_ANONYMOUS... > > That's certainly an interesting observation, and thank you for sharing > it with us (hmm, I sound like a self-help group leader or something). > > I don't really have anything to add to what Linus said (and hadn't > got around to realizing the significance of the "p" there before I > saw his reply). > > Mmm, it's interesting, but I fear to add more hacks in there just > for this - I guess we could, but I'd rather not, unless it becomes > a serious issue. > > Let's just tuck away the knowledge of this case for now. Right. And a bit more info to tuck away... I routinely grab the proc maps and numa_maps from our largish servers running various "industry standard benchmarks". Prompted by Linus' comment that "if it's just a hundred segments, nobody really cares", I went back and looked a bit further at the maps for a recent run. Below are some segment counts for the run. The benchmark involved 32 "instances" of the application--a technique used to reduce contention on application internal resources as the user count increases--along with it's data base task[s]. Each instance spawns a few processes [5-6 average, up to ~14, for this run] that share a few instance-specific SYSV segments between them. In each instance, one of those shmem segments exhibits a similar pattern to the /dev/zero segments from the prior mail. Many, altho' not all, of the individual vmas are adjacent with the same permissions: 'r--s'. E.g., a small snippet: 2ac0e3cf0000-2ac0e40f5000 r--s 00d26000 00:08 15695938 /SYSV0000277a (deleted) 2ac0e40f5000-2ac0e4101000 r--s 0112b000 00:08 15695938 /SYSV0000277a (deleted) 2ac0e4101000-2ac0e4102000 r--s 01137000 00:08 15695938 /SYSV0000277a (deleted) 2ac0e4102000-2ac0e4113000 r--s 01138000 00:08 15695938 /SYSV0000277a (deleted) 2ac0e4113000-2ac0e4114000 r--s 01149000 00:08 15695938 /SYSV0000277a (deleted) 2ac0e4114000-2ac0e4115000 r--s 0114a000 00:08 15695938 /SYSV0000277a (deleted) 2ac0e4115000-2ac0e4116000 r--s 0114b000 00:08 15695938 /SYSV0000277a (deleted) 2ac0e4116000-2ac0e4117000 r--s 0114c000 00:08 15695938 /SYSV0000277a (deleted) I counted 2000-3600+ of these for a couple of tasks. How they got like this--one vma per page?--I'm not sure. Perhaps a sequence of mprotect() calls or such after attaching the segment. [I'll try to get an strace sometime.] Then I counted the occurrences of the patterns: '^2.*r--s.*/SYSV' in each of the instances as, again, each instance uses a different shmem segment among its tasks. For good measure, I counted the '/dev/zero' segments as well: SYSV shm /dev/zero instance 00 5771 217 instance 01 6025 183 instance 02 5738 176 instance 03 5798 177 instance 04 5709 182 instance 05 5423 915 instance 06 5513 929 instance 07 5915 180 instance 08 5802 182 instance 09 5690 177 instance 10 5643 177 instance 11 5647 180 instance 12 5656 182 instance 13 5672 181 instance 14 5522 180 instance 15 5497 180 instance 16 5594 179 instance 17 4922 906 instance 18 6956 935 instance 19 5769 181 instance 20 5771 180 instance 21 5712 180 instance 22 5711 184 instance 23 5631 179 instance 24 5586 180 instance 25 5640 180 instance 26 5614 176 instance 27 5523 176 instance 28 5600 179 instance 29 5473 177 instance 30 5581 180 instance 31 5470 180 A total of ~ 180K shmem segments, not counting the /dev/zero mappings. Good thing we have a lot of memory :). A couple of those segments per instance are different shmem segments--just 2 or 3 out of 5k-6k in the cases that I looked at. The benchmark seems to run fairly well, so I'm not saying we have a problem here--with the Linux kernel, anyway. Just some raw data from a pseudo-real-world application load. ['pseudo' because I'm told no real user would ever set up the app quite this way :)] Also, this is on a vintage 2.6.16+ kernel [not my choice]. Soon I'll have data from a much more recent release. Lee ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-30 21:12 ` Hugh Dickins 2009-01-30 21:25 ` Linus Torvalds 2009-01-30 21:36 ` Lee Schermerhorn @ 2009-01-30 21:37 ` Linus Torvalds 2009-01-31 12:16 ` Hugh Dickins 2 siblings, 1 reply; 47+ messages in thread From: Linus Torvalds @ 2009-01-30 21:37 UTC (permalink / raw) To: Hugh Dickins Cc: Lee Schermerhorn, Greg KH, Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Mikos Szeredi On Fri, 30 Jan 2009, Hugh Dickins wrote: > > By the way, there's an argument to say that you should add > VM_MIXEDMAP to VM_CAN_NONLINEAR in VM_MERGEABLE_FLAGS: I don't > really care whether we merge the odd filemap_xip vma or not, > but it used to do so and now won't. > > By the same (used to merge, now won't) argument, one could say > VM_INSERTPAGE should be there too; but whereas VM_MIXEDMAP is used > in one place only, quite a lot of drivers use vm_insert_page(), so > I feel more comfortable with the idea that it's stopping merges - > though in that case, shouldn't we add it to VM_SPECIAL? VM_MIXEDMAP, yes. It's a special case. But VM_INSERTPAGE - no. Why? Because VM_INSERTPAGE implies that mmap() itself quite possibly actually _does_ something (it may have inserted _all_ the pages, and may not even have any ->fault handler at all). So we can't just expand a current mapping and forget about it, we need to do the mmap(). So the "only do merges early" fundamentally cannot merge such a vma, even if the old code did. Of course, we could look at whether it has a ->fault handler as an indication of whether it's possible to merge or not. We already do that for ->close, and in many ways ->fault would likely be a better indicator of whether something is mergeable or not. But there's really no point. VM_INSERTPAGE implies a very special mapping: it's used by things like the SCSI target user space ring map, or the magic network packet mmap thing. If you use those things, you really don't care about merging adjacent VM's anyway, and at least the two I looked at really do the whole "pre-populate at mmap time" thing. And at least the packet one really does have a ->close function, and lacks a ->fault function, so it wouldn't have merged before either (and looking at ->fault again seems to be as valid as looking at ->close). Linus ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-30 21:37 ` Linus Torvalds @ 2009-01-31 12:16 ` Hugh Dickins 0 siblings, 0 replies; 47+ messages in thread From: Hugh Dickins @ 2009-01-31 12:16 UTC (permalink / raw) To: Linus Torvalds Cc: Lee Schermerhorn, Greg KH, Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Mikos Szeredi On Fri, 30 Jan 2009, Linus Torvalds wrote: > On Fri, 30 Jan 2009, Hugh Dickins wrote: > > > > By the same (used to merge, now won't) argument, one could say > > VM_INSERTPAGE should be there too; but whereas VM_MIXEDMAP is used > > in one place only, quite a lot of drivers use vm_insert_page(), so > > I feel more comfortable with the idea that it's stopping merges - > > though in that case, shouldn't we add it to VM_SPECIAL? > > But VM_INSERTPAGE - no. > > Why? Because VM_INSERTPAGE implies that mmap() itself quite possibly > actually _does_ something (it may have inserted _all_ the pages, and may > not even have any ->fault handler at all). > > So we can't just expand a current mapping and forget about it, we need to > do the mmap(). So the "only do merges early" fundamentally cannot merge > such a vma, even if the old code did. Good point indeed. Same as why we test for it in copy_page_range(). Should it be added to VM_SPECIAL? Not for this particular (need to call ->mmap) reason, but we probably ought to add it anyway. Hugh > > Of course, we could look at whether it has a ->fault handler as an > indication of whether it's possible to merge or not. We already do that > for ->close, and in many ways ->fault would likely be a better indicator > of whether something is mergeable or not. > > But there's really no point. VM_INSERTPAGE implies a very special mapping: > it's used by things like the SCSI target user space ring map, or the magic > network packet mmap thing. If you use those things, you really don't care > about merging adjacent VM's anyway, and at least the two I looked at > really do the whole "pre-populate at mmap time" thing. > > And at least the packet one really does have a ->close function, and lacks > a ->fault function, so it wouldn't have merged before either (and looking > at ->fault again seems to be as valid as looking at ->close). > > Linus ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-30 19:53 ` Lee Schermerhorn 2009-01-30 20:31 ` Linus Torvalds @ 2009-01-30 20:33 ` Hugh Dickins 2009-01-30 20:53 ` Randy Dunlap 2 siblings, 0 replies; 47+ messages in thread From: Hugh Dickins @ 2009-01-30 20:33 UTC (permalink / raw) To: Lee Schermerhorn Cc: Linus Torvalds, Greg KH, Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki On Fri, 30 Jan 2009, Lee Schermerhorn wrote: > > I tried this patch atop 29-rc3 + your patch from yesterday with my > simple test program at http://free.linux/hp.com/~lts/Tests/mmap_lock.c. > > The test program shows the /proc/<pid>/maps before and after the mmap > and attempted merge. It's not merging: > > 7fd20a668000-7fd20a66a000 rw-p 7fd20a668000 00:00 0 > 7fd20a68a000-7fd20a68b000 r--s 00000000 68:23 6608852 /tmp/tmpfVx1SFL (deleted) > 7fd20a68b000-7fd20a68c000 r--s 00001000 68:23 6608852 /tmp/tmpfVx1SFL (deleted) > 7fd20a68c000-7fd20a690000 rw-p 7fd20a68c000 00:00 0 Some testing - now that's a great idea! Good spotting. > > Ad hoc instrumentation shows that it's the VM_ACCOUNT flag that is > different between the existing file segment and the one attempting the > merge: > > is_mergeable_vma: !mergable: vma flags: 0x80020f9:0x1020f9 > | |-VM_ACCOUNT > +-----------VM_CAN_NONLINEAR Sorry, I should have noticed that: VM_ACCOUNT was certainly on my mind as a flag that gives trouble when merging vmas, but I'd misconvinced myself that it wasn't a problem in this case. > > So happens, I'm mapping with MAP_SHARED, so the VM_ACCOUNT flag gets > cleared later in mmap_region(). Comments say that this is for checking > memory availability during shmem_file_setup(). Maybe we can move the > temporary setting of VM_ACCOUNT until just before the call to > shmem_zero_setup()? Please let me think on that - we can live with the status quo for the moment. I need to remind myself why I used that peculiar way of conveying info from mmap.c to shmem.c: the right answer may be just to pass another arg to shmem_zero_setup and shmem_file_setup. Hugh ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-30 19:53 ` Lee Schermerhorn 2009-01-30 20:31 ` Linus Torvalds 2009-01-30 20:33 ` Hugh Dickins @ 2009-01-30 20:53 ` Randy Dunlap 2009-01-30 20:59 ` Lee Schermerhorn 2 siblings, 1 reply; 47+ messages in thread From: Randy Dunlap @ 2009-01-30 20:53 UTC (permalink / raw) To: Lee Schermerhorn Cc: Linus Torvalds, Hugh Dickins, Greg KH, Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki Lee Schermerhorn wrote: > I tried this patch atop 29-rc3 + your patch from yesterday with my > simple test program at http://free.linux/hp.com/~lts/Tests/mmap_lock.c. Is that URL correct? I can't find/access it. > The test program shows the /proc/<pid>/maps before and after the mmap > and attempted merge. It's not merging: Thanks, -- ~Randy ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-30 20:53 ` Randy Dunlap @ 2009-01-30 20:59 ` Lee Schermerhorn 2009-01-30 21:11 ` Will Crowder 0 siblings, 1 reply; 47+ messages in thread From: Lee Schermerhorn @ 2009-01-30 20:59 UTC (permalink / raw) To: Randy Dunlap Cc: Linus Torvalds, Hugh Dickins, Greg KH, Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki On Fri, 2009-01-30 at 12:53 -0800, Randy Dunlap wrote: > Lee Schermerhorn wrote: > > > I tried this patch atop 29-rc3 + your patch from yesterday with my > > simple test program at http://free.linux/hp.com/~lts/Tests/mmap_lock.c. D'oh! that's ".hp.com" > Is that URL correct? I can't find/access it. Sorry, should have cut and pasted, like here: http://free.linux.hp.com/~lts/Tests/mmap_lock.c Lee ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-30 20:59 ` Lee Schermerhorn @ 2009-01-30 21:11 ` Will Crowder 0 siblings, 0 replies; 47+ messages in thread From: Will Crowder @ 2009-01-30 21:11 UTC (permalink / raw) To: Lee Schermerhorn Cc: Randy Dunlap, Linus Torvalds, Hugh Dickins, Greg KH, Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki What, no ".linux" TLD? :-) Oh well...in time, time... Will Crowder On Fri, 2009-01-30 at 15:59 -0500, Lee Schermerhorn wrote: > On Fri, 2009-01-30 at 12:53 -0800, Randy Dunlap wrote: > > Lee Schermerhorn wrote: > > > > > I tried this patch atop 29-rc3 + your patch from yesterday with my > > > simple test program at http://free.linux/hp.com/~lts/Tests/mmap_lock.c. > > D'oh! that's ".hp.com" > > > Is that URL correct? I can't find/access it. > > Sorry, should have cut and pasted, like here: > > http://free.linux.hp.com/~lts/Tests/mmap_lock.c > > > Lee > ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-30 17:40 ` Hugh Dickins 2009-01-30 18:14 ` Linus Torvalds @ 2009-01-30 23:44 ` Greg KH 1 sibling, 0 replies; 47+ messages in thread From: Greg KH @ 2009-01-30 23:44 UTC (permalink / raw) To: Hugh Dickins Cc: Linus Torvalds, Maksim Yevmenkin, Lee Schermerhorn, linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki On Fri, Jan 30, 2009 at 05:40:24PM +0000, Hugh Dickins wrote: > On Fri, 30 Jan 2009, Linus Torvalds wrote: > > On Thu, 29 Jan 2009, Greg KH wrote: > > > > > > Which version was the "non-cleanup" version that should be added to the > > > stable trees? > > > > There were two different versions: > > > > From: Andrew Morton <akpm@linux-foundation.org> > > Subject: Re: possible bug in mmap_region() in linux-2.6.28 kernel > > Message-Id: <20090128134350.034ac6a7.akpm@linux-foundation.org> > > > > From: Lee Schermerhorn <Lee.Schermerhorn@hp.com> > > Subject: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments > > Message-Id: <1233259410.2315.75.camel@lts-notebook> > > > > and I'm actually not at all sure which one should go into stable (or if we > > should just pick the same one that went into mainline). > > > ... > > > > But none of the above really changes the fact that the patch I committed > > to mainline was really quite fundamentally more invasive than either of > > the "simple" patches. All three patches are small, with mine arguably the > > smallest of the lot, but mine actually changed semantics, while Andrew's > > and Lee's patch literally only fix the invalid pointer use. > > > > I'll leave it to others to decide which one goes into -stable. I > > personally don't really think it matters. I argue above that mine is > > pretty safe and thus perfectly fine even for -stable, but reality has a > > habit of sometimes disagreeing with me. Dang. > > I'd say one of the non-cleanup versions for -stable > (but I've not compared them to see which one is better). Ok, based on both of your comments about this, and the fact that the in-tree one did break something, I'll go look at Andrew and Lee's versions and pick one of them for -stable. thanks, greg k-h ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-29 20:48 ` Linus Torvalds 2009-01-29 22:32 ` Hugh Dickins 2009-01-29 22:47 ` Maksim Yevmenkin @ 2009-01-30 8:34 ` Peter Zijlstra 2009-01-30 16:45 ` Linus Torvalds 2 siblings, 1 reply; 47+ messages in thread From: Peter Zijlstra @ 2009-01-30 8:34 UTC (permalink / raw) To: Linus Torvalds Cc: Lee Schermerhorn, linux-kernel, Maksim Yevmenkin, Nick Piggin, Andrew Morton, Greg Kroah-Hartman, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki On Thu, 2009-01-29 at 12:48 -0800, Linus Torvalds wrote: > I still want somebody else to look at and think about it, though. > > Linus > > --- > mm/mmap.c | 26 ++++++-------------------- > 1 files changed, 6 insertions(+), 20 deletions(-) > > diff --git a/mm/mmap.c b/mm/mmap.c > index 8d95902..d3fa10a 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -1134,16 +1134,11 @@ munmap_back: > } > > /* > - * Can we just expand an old private anonymous mapping? > - * The VM_SHARED test is necessary because shmem_zero_setup > - * will create the file object for a shared anonymous map below. > + * Can we just expand an old mapping? > */ > - if (!file && !(vm_flags & VM_SHARED)) { > - vma = vma_merge(mm, prev, addr, addr + len, vm_flags, > - NULL, NULL, pgoff, NULL); > - if (vma) > - goto out; > - } > + vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL); > + if (vma) > + goto out; You've made checkpatch unhappy ;-) So we don't bother with anonymous only, always attempt the merge. > @@ -1206,17 +1201,8 @@ munmap_back: > if (vma_wants_writenotify(vma)) > vma->vm_page_prot = vm_get_page_prot(vm_flags & ~VM_SHARED); > > - if (file && vma_merge(mm, prev, addr, vma->vm_end, > - vma->vm_flags, NULL, file, pgoff, vma_policy(vma))) { > - mpol_put(vma_policy(vma)); > - kmem_cache_free(vm_area_cachep, vma); > - fput(file); > - if (vm_flags & VM_EXECUTABLE) > - removed_exe_file_vma(mm); > - } else { > - vma_link(mm, vma, prev, rb_link, rb_parent); > - file = vma->vm_file; > - } > + vma_link(mm, vma, prev, rb_link, rb_parent); > + file = vma->vm_file; And here we don't bother merging because that would have been done before. Assuming ->mmap() doesn't go wild, in which case it ought to have set a VM_SPECIAL bit anyway to discourage merging. [ And even if it didn't, failing to merge shouldn't be a problem, as minimizing the vmas is an optimization, not a strict requirement afaik. ] The obvious glaring difference is the vma_policy() cruft. But staring at the code a bit I can't see how the new vma can have acquired a vm_policy here, so it ought to not matter. Looks ok to my eyes, so I guess: Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-30 8:34 ` Peter Zijlstra @ 2009-01-30 16:45 ` Linus Torvalds 2009-01-30 16:49 ` Randy Dunlap 0 siblings, 1 reply; 47+ messages in thread From: Linus Torvalds @ 2009-01-30 16:45 UTC (permalink / raw) To: Peter Zijlstra Cc: Lee Schermerhorn, linux-kernel, Maksim Yevmenkin, Nick Piggin, Andrew Morton, Greg Kroah-Hartman, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki On Fri, 30 Jan 2009, Peter Zijlstra wrote: > > > + vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL); > > + if (vma) > > + goto out; > > You've made checkpatch unhappy ;-) Yeah, we need to fix that piece of crud. checkpatch, that is. I think "git grep" is more important than the "ooh, I have a tiny d*ck, everybody else should have one too" argument against larger terminals. Linus ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments 2009-01-30 16:45 ` Linus Torvalds @ 2009-01-30 16:49 ` Randy Dunlap 0 siblings, 0 replies; 47+ messages in thread From: Randy Dunlap @ 2009-01-30 16:49 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Lee Schermerhorn, linux-kernel, Maksim Yevmenkin, Nick Piggin, Andrew Morton, Greg Kroah-Hartman, will, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki Linus Torvalds wrote: > > On Fri, 30 Jan 2009, Peter Zijlstra wrote: >>> + vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL); >>> + if (vma) >>> + goto out; >> You've made checkpatch unhappy ;-) > > Yeah, we need to fix that piece of crud. > > checkpatch, that is. It's just advisory, fwiw. > I think "git grep" is more important than the "ooh, I have a tiny d*ck, > everybody else should have one too" argument against larger terminals. and fix the Documentation/CodingStyle recommendations? -- ~Randy ^ permalink raw reply [flat|nested] 47+ messages in thread
end of thread, other threads:[~2009-02-03 21:50 UTC | newest]
Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <bb4a86c70901281151w4300605r3882461cd6e9774a@mail.gmail.com>
[not found] ` <alpine.LFD.2.00.0901281316450.3123@localhost.localdomain>
2009-01-29 20:03 ` [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments Lee Schermerhorn
2009-01-29 20:33 ` Linus Torvalds
2009-01-29 20:48 ` Linus Torvalds
2009-01-29 22:32 ` Hugh Dickins
2009-01-29 23:02 ` Linus Torvalds
2009-01-30 4:43 ` Lee Schermerhorn
2009-01-30 4:49 ` Linus Torvalds
2009-01-29 22:47 ` Maksim Yevmenkin
2009-01-29 22:48 ` Randy Dunlap
2009-01-29 23:31 ` Maksim Yevmenkin
2009-01-30 2:08 ` Linus Torvalds
2009-01-30 5:56 ` Greg KH
2009-01-30 16:36 ` Linus Torvalds
2009-01-30 17:40 ` Hugh Dickins
2009-01-30 18:14 ` Linus Torvalds
2009-01-30 18:30 ` Hugh Dickins
2009-01-30 19:53 ` Lee Schermerhorn
2009-01-30 20:31 ` Linus Torvalds
2009-01-30 21:12 ` Hugh Dickins
2009-01-30 21:25 ` Linus Torvalds
2009-01-30 21:36 ` Lee Schermerhorn
2009-01-30 22:27 ` Linus Torvalds
2009-01-31 12:35 ` Hugh Dickins
2009-01-31 18:34 ` Linus Torvalds
2009-02-02 11:59 ` KOSAKI Motohiro
2009-02-02 12:54 ` Hugh Dickins
2009-02-02 14:10 ` KOSAKI Motohiro
2009-02-02 18:58 ` Mel Gorman
2009-02-02 19:23 ` Linus Torvalds
2009-02-02 21:50 ` Mel Gorman
2009-02-02 22:12 ` Linus Torvalds
2009-02-02 22:35 ` Mel Gorman
2009-02-02 18:33 ` Mel Gorman
2009-02-03 16:13 ` Lee Schermerhorn
2009-02-03 16:40 ` Linus Torvalds
2009-02-03 17:10 ` Hugh Dickins
2009-02-03 21:50 ` Lee Schermerhorn
2009-01-30 21:37 ` Linus Torvalds
2009-01-31 12:16 ` Hugh Dickins
2009-01-30 20:33 ` Hugh Dickins
2009-01-30 20:53 ` Randy Dunlap
2009-01-30 20:59 ` Lee Schermerhorn
2009-01-30 21:11 ` Will Crowder
2009-01-30 23:44 ` Greg KH
2009-01-30 8:34 ` Peter Zijlstra
2009-01-30 16:45 ` Linus Torvalds
2009-01-30 16:49 ` Randy Dunlap
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox