public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
       [not found] ` <alpine.LFD.2.00.0901281316450.3123@localhost.localdomain>
@ 2009-01-29 20:03   ` Lee Schermerhorn
  2009-01-29 20:33     ` Linus Torvalds
  0 siblings, 1 reply; 47+ messages in thread
From: Lee Schermerhorn @ 2009-01-29 20:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Maksim Yevmenkin, Nick Piggin, Andrew Morton,
	Greg Kroah-Hartman, will, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki

Against:  2.6.28.2

We see a:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
IP: [<ffffffff80336805>] __downgrade_write+0x43/0xb4
:
Call Trace:
 [<ffffffff80284dec>] mlock_vma_pages_range+0x53/0xc3
 [<ffffffff80287021>] mmap_region+0x389/0x471
 [<ffffffff802876b5>] do_mmap_pgoff+0x308/0x36d
 [<ffffffff802100a8>] sys_mmap+0x8b/0x110
 [<ffffffff8020bc8b>] system_call_fastpath+0x16/0x1b

when mmap_region() merges two segments of a file mmap()ed with
MAP_LOCKED [VM_LOCKED].  

This patch provides a simplistic fix, suitable, I hope, for -stable.

Pass the merged vma to mlock_vma_pages_range().

Tested with:  http://free.linux.hp.com/~lts/Tests/mmap_lock.c

I'll attempt a rework of mlock_vma_pages_range(), et al, to
eliminate the 'vma' arg for 29-rc?.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/mmap.c |   23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

Index: linux-2.6.28.2/mm/mmap.c
===================================================================
--- linux-2.6.28.2.orig/mm/mmap.c	2009-01-29 11:34:08.000000000 -0500
+++ linux-2.6.28.2/mm/mmap.c	2009-01-29 12:22:14.000000000 -0500
@@ -1094,7 +1094,7 @@ unsigned long mmap_region(struct file *f
 			  int accountable)
 {
 	struct mm_struct *mm = current->mm;
-	struct vm_area_struct *vma, *prev;
+	struct vm_area_struct *vma, *prev, *merged = NULL;
 	int correct_wcount = 0;
 	int error;
 	struct rb_node **rb_link, *rb_parent;
@@ -1207,14 +1207,19 @@ munmap_back:
 	if (vma_wants_writenotify(vma))
 		vma->vm_page_prot = vm_get_page_prot(vm_flags & ~VM_SHARED);
 
-	if (file && vma_merge(mm, prev, addr, vma->vm_end,
-			vma->vm_flags, NULL, file, pgoff, vma_policy(vma))) {
-		mpol_put(vma_policy(vma));
-		kmem_cache_free(vm_area_cachep, vma);
-		fput(file);
-		if (vm_flags & VM_EXECUTABLE)
-			removed_exe_file_vma(mm);
-	} else {
+	if (file) {
+		merged = vma_merge(mm, prev, addr, vma->vm_end, vma->vm_flags,
+					NULL, file, pgoff, vma_policy(vma));
+		if (merged) {
+			mpol_put(vma_policy(vma));
+			kmem_cache_free(vm_area_cachep, vma);
+			fput(file);
+			if (vm_flags & VM_EXECUTABLE)
+				removed_exe_file_vma(mm);
+			vma = merged;	/* for mlock_vma_pages_range() */
+		}
+	}
+	if (!merged) {
 		vma_link(mm, vma, prev, rb_link, rb_parent);
 		file = vma->vm_file;
 	}



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-29 20:03   ` [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments Lee Schermerhorn
@ 2009-01-29 20:33     ` Linus Torvalds
  2009-01-29 20:48       ` Linus Torvalds
  0 siblings, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2009-01-29 20:33 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-kernel, Maksim Yevmenkin, Nick Piggin, Andrew Morton,
	Greg Kroah-Hartman, will, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki



On Thu, 29 Jan 2009, Lee Schermerhorn wrote:
> 
> This patch provides a simplistic fix, suitable, I hope, for -stable.

So I've looked a bit at that function, and I wonder why we can't just do 
the "vma_merge()" earlier - before we allocate that whole vma to begin 
with.

We already do that for anonymous mappings _anyway_, using the exact same 
"vma_merge()" function. So how about just expanding on that logic, and 
simplifying everything at the same time?

THIS PATCH IS TOTALLY UNTESTED!

There may be some reason why we didn't do it this way to begin with, but 
this woudl actually seem to be the more logical way to fix the problem: by 
simplifying the whole function to the point where we never try to do that 
"try to merge late and then fix up the fact that we have to free the vma 
we allocated earlier" thing at all!

Hmm? This would remove a more lines than it adds, and actually makes 
things more readable. If it works. Anybody see why it wouldn't?

		Linus

---
 mm/mmap.c |   30 ++++++++++--------------------
 1 files changed, 10 insertions(+), 20 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 8d95902..3f78ead 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -769,6 +769,10 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 	if (vm_flags & VM_SPECIAL)
 		return NULL;
 
+	/* Anonymous shared mappings are unsharable */
+	if ((vm_flags & VM_SHARED) && !file)
+		return NULL;
+
 	if (prev)
 		next = prev->vm_next;
 	else
@@ -1134,16 +1138,11 @@ munmap_back:
 	}
 
 	/*
-	 * Can we just expand an old private anonymous mapping?
-	 * The VM_SHARED test is necessary because shmem_zero_setup
-	 * will create the file object for a shared anonymous map below.
+	 * Can we just expand an old mapping?
 	 */
-	if (!file && !(vm_flags & VM_SHARED)) {
-		vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
-					NULL, NULL, pgoff, NULL);
-		if (vma)
-			goto out;
-	}
+	vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL);
+	if (vma)
+		goto out;
 
 	/*
 	 * Determine the object being mapped and call the appropriate
@@ -1206,17 +1205,8 @@ munmap_back:
 	if (vma_wants_writenotify(vma))
 		vma->vm_page_prot = vm_get_page_prot(vm_flags & ~VM_SHARED);
 
-	if (file && vma_merge(mm, prev, addr, vma->vm_end,
-			vma->vm_flags, NULL, file, pgoff, vma_policy(vma))) {
-		mpol_put(vma_policy(vma));
-		kmem_cache_free(vm_area_cachep, vma);
-		fput(file);
-		if (vm_flags & VM_EXECUTABLE)
-			removed_exe_file_vma(mm);
-	} else {
-		vma_link(mm, vma, prev, rb_link, rb_parent);
-		file = vma->vm_file;
-	}
+	vma_link(mm, vma, prev, rb_link, rb_parent);
+	file = vma->vm_file;
 
 	/* Once vma denies write, undo our temporary denial count */
 	if (correct_wcount)

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-29 20:33     ` Linus Torvalds
@ 2009-01-29 20:48       ` Linus Torvalds
  2009-01-29 22:32         ` Hugh Dickins
                           ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Linus Torvalds @ 2009-01-29 20:48 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-kernel, Maksim Yevmenkin, Nick Piggin, Andrew Morton,
	Greg Kroah-Hartman, will, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki



On Thu, 29 Jan 2009, Linus Torvalds wrote:
> 
> THIS PATCH IS TOTALLY UNTESTED!

Well, it boots. FWIW. I've not really tested anything interesting with it, 
but any potential breakage is at least not catastrophic and immediate.

> diff --git a/mm/mmap.c b/mm/mmap.c
> index 8d95902..3f78ead 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -769,6 +769,10 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
>  	if (vm_flags & VM_SPECIAL)
>  		return NULL;
>  
> +	/* Anonymous shared mappings are unsharable */
> +	if ((vm_flags & VM_SHARED) && !file)
> +		return NULL;
> +

.. and I think this part of it is actually unnecessary, because what 
happens is that a shared anon mapping is turned into a shmem mapping when 
it is inserted, and that actually ends up allocating a file for it. So the 
vma->vm_file for anon mappings will not match a NULL file pointer 
_anyway_, so there's no way it would end up merging.

So my patch can be further simplified, I think, to just the following. 
Even more total lines removed.

I still want somebody else to look at and think about it, though.

		Linus

---
 mm/mmap.c |   26 ++++++--------------------
 1 files changed, 6 insertions(+), 20 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 8d95902..d3fa10a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1134,16 +1134,11 @@ munmap_back:
 	}
 
 	/*
-	 * Can we just expand an old private anonymous mapping?
-	 * The VM_SHARED test is necessary because shmem_zero_setup
-	 * will create the file object for a shared anonymous map below.
+	 * Can we just expand an old mapping?
 	 */
-	if (!file && !(vm_flags & VM_SHARED)) {
-		vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
-					NULL, NULL, pgoff, NULL);
-		if (vma)
-			goto out;
-	}
+	vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL);
+	if (vma)
+		goto out;
 
 	/*
 	 * Determine the object being mapped and call the appropriate
@@ -1206,17 +1201,8 @@ munmap_back:
 	if (vma_wants_writenotify(vma))
 		vma->vm_page_prot = vm_get_page_prot(vm_flags & ~VM_SHARED);
 
-	if (file && vma_merge(mm, prev, addr, vma->vm_end,
-			vma->vm_flags, NULL, file, pgoff, vma_policy(vma))) {
-		mpol_put(vma_policy(vma));
-		kmem_cache_free(vm_area_cachep, vma);
-		fput(file);
-		if (vm_flags & VM_EXECUTABLE)
-			removed_exe_file_vma(mm);
-	} else {
-		vma_link(mm, vma, prev, rb_link, rb_parent);
-		file = vma->vm_file;
-	}
+	vma_link(mm, vma, prev, rb_link, rb_parent);
+	file = vma->vm_file;
 
 	/* Once vma denies write, undo our temporary denial count */
 	if (correct_wcount)

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-29 20:48       ` Linus Torvalds
@ 2009-01-29 22:32         ` Hugh Dickins
  2009-01-29 23:02           ` Linus Torvalds
  2009-01-29 22:47         ` Maksim Yevmenkin
  2009-01-30  8:34         ` Peter Zijlstra
  2 siblings, 1 reply; 47+ messages in thread
From: Hugh Dickins @ 2009-01-29 22:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Lee Schermerhorn, linux-kernel, Maksim Yevmenkin, Nick Piggin,
	Andrew Morton, Greg Kroah-Hartman, will, Rik van Riel,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Thu, 29 Jan 2009, Linus Torvalds wrote:
> On Thu, 29 Jan 2009, Linus Torvalds wrote:
> > 
> > THIS PATCH IS TOTALLY UNTESTED!

Sorry, I was away yesterday, and not yet caught up with things today.

I certainly agree that it would be much nicer not to allocated a vma
in one place, then later throw it away on finding it can be merged:
your patch looks a plausible improvement.

The problem has always been that file->f_op->mmap(file, vma) on a
special file can do various strange things, changing surprising
fields of the struct vma passed to it, changing its mergeability.

Now it may well be that every driver which does something strange
there already sets one of the VM_SPECIAL flags which prevent merging,
or can easily be fixed to do so, or otherwise clearly cannot pose a
problem.

But I need to mull it over and do some checking tomorrow:
anything I say tonight, I'd probably change my mind about.

Hugh

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED  file segments
  2009-01-29 20:48       ` Linus Torvalds
  2009-01-29 22:32         ` Hugh Dickins
@ 2009-01-29 22:47         ` Maksim Yevmenkin
  2009-01-29 22:48           ` Randy Dunlap
  2009-01-30  2:08           ` Linus Torvalds
  2009-01-30  8:34         ` Peter Zijlstra
  2 siblings, 2 replies; 47+ messages in thread
From: Maksim Yevmenkin @ 2009-01-29 22:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Lee Schermerhorn, linux-kernel, Nick Piggin, Andrew Morton,
	Greg Kroah-Hartman, will, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki

On Thu, Jan 29, 2009 at 12:48 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, 29 Jan 2009, Linus Torvalds wrote:
>>
>> THIS PATCH IS TOTALLY UNTESTED!
>
> Well, it boots. FWIW. I've not really tested anything interesting with it,
> but any potential breakage is at least not catastrophic and immediate.
>
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index 8d95902..3f78ead 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -769,6 +769,10 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
>>       if (vm_flags & VM_SPECIAL)
>>               return NULL;
>>
>> +     /* Anonymous shared mappings are unsharable */
>> +     if ((vm_flags & VM_SHARED) && !file)
>> +             return NULL;
>> +
>
> .. and I think this part of it is actually unnecessary, because what
> happens is that a shared anon mapping is turned into a shmem mapping when
> it is inserted, and that actually ends up allocating a file for it. So the
> vma->vm_file for anon mappings will not match a NULL file pointer
> _anyway_, so there's no way it would end up merging.
>
> So my patch can be further simplified, I think, to just the following.
> Even more total lines removed.
>
> I still want somebody else to look at and think about it, though.

Just to confirm. This patch also appear to fix the immediate issue for us.

thanks,
max

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-29 22:47         ` Maksim Yevmenkin
@ 2009-01-29 22:48           ` Randy Dunlap
  2009-01-29 23:31             ` Maksim Yevmenkin
  2009-01-30  2:08           ` Linus Torvalds
  1 sibling, 1 reply; 47+ messages in thread
From: Randy Dunlap @ 2009-01-29 22:48 UTC (permalink / raw)
  To: Maksim Yevmenkin
  Cc: Linus Torvalds, Lee Schermerhorn, linux-kernel, Nick Piggin,
	Andrew Morton, Greg Kroah-Hartman, will, Rik van Riel,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki

Maksim Yevmenkin wrote:
> On Thu, Jan 29, 2009 at 12:48 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Thu, 29 Jan 2009, Linus Torvalds wrote:
>>> THIS PATCH IS TOTALLY UNTESTED!
>> Well, it boots. FWIW. I've not really tested anything interesting with it,
>> but any potential breakage is at least not catastrophic and immediate.
>>
>>> diff --git a/mm/mmap.c b/mm/mmap.c
>>> index 8d95902..3f78ead 100644
>>> --- a/mm/mmap.c
>>> +++ b/mm/mmap.c
>>> @@ -769,6 +769,10 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
>>>       if (vm_flags & VM_SPECIAL)
>>>               return NULL;
>>>
>>> +     /* Anonymous shared mappings are unsharable */
>>> +     if ((vm_flags & VM_SHARED) && !file)
>>> +             return NULL;
>>> +
>> .. and I think this part of it is actually unnecessary, because what
>> happens is that a shared anon mapping is turned into a shmem mapping when
>> it is inserted, and that actually ends up allocating a file for it. So the
>> vma->vm_file for anon mappings will not match a NULL file pointer
>> _anyway_, so there's no way it would end up merging.
>>
>> So my patch can be further simplified, I think, to just the following.
>> Even more total lines removed.
>>
>> I still want somebody else to look at and think about it, though.
> 
> Just to confirm. This patch also appear to fix the immediate issue for us.

Is there a (small) test program available?

Thanks,
-- 
~Randy

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-29 22:32         ` Hugh Dickins
@ 2009-01-29 23:02           ` Linus Torvalds
  2009-01-30  4:43             ` Lee Schermerhorn
  0 siblings, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2009-01-29 23:02 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Lee Schermerhorn, linux-kernel, Maksim Yevmenkin, Nick Piggin,
	Andrew Morton, Greg Kroah-Hartman, will, Rik van Riel,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki, David S. Miller



On Thu, 29 Jan 2009, Hugh Dickins wrote:
> 
> The problem has always been that file->f_op->mmap(file, vma) on a
> special file can do various strange things, changing surprising
> fields of the struct vma passed to it, changing its mergeability.

Ahh, you're right. That's somewhat bogus, but it does explain why we have 
those two separate merge_vma() calls.

And as you also point out:

> Now it may well be that every driver which does something strange
> there already sets one of the VM_SPECIAL flags which prevent merging,
> or can easily be fixed to do so, or otherwise clearly cannot pose a
> problem.

.. and I think this is the right answer. If a device driver really does 
something as odd as play games with the offset that would make merging 
wrong, it had better set some of the VM_SPECIAL bits (presumably VM_IO) in 
order to never merge at all.

I added DaveM to the cc, since he's the one who is credited with the 
comment saying that several drivers change addr (vm_start). I only really 
see one, namely bf54x-lq043fb.c, and that one really depends on the whole 
nommu thing (ie it seems to simply require a particular virtual address, 
broken as that may be).

Btw, changing vm_start is actually very very wrong - it cannot work in 
general. Why? Because we earlier checked the "overlapping mmap" case with 
the original vm_start, an a driver that changes its vm_start in its mmap() 
routine would cause odd issues there. We also pass in "prev" to vma_link, 
so it could end up in totally the wrong place.

So changing vm_start is a sign of a driver bug, nothing less.

But there are more drivers who do change vm_pgoff, which is at least valid 
(if a bit dodgy), and is indeed a mergeability issue.  But at least a 
subset of those do already set VM_IO (eg fbmem.c - I only looked at one, 
and was happy that it already did that).

As background for Davem, here's the patch once more.

			Linus

---
 mm/mmap.c |   26 ++++++--------------------
 1 files changed, 6 insertions(+), 20 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 8d95902..d3fa10a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1134,16 +1134,11 @@ munmap_back:
 	}
 
 	/*
-	 * Can we just expand an old private anonymous mapping?
-	 * The VM_SHARED test is necessary because shmem_zero_setup
-	 * will create the file object for a shared anonymous map below.
+	 * Can we just expand an old mapping?
 	 */
-	if (!file && !(vm_flags & VM_SHARED)) {
-		vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
-					NULL, NULL, pgoff, NULL);
-		if (vma)
-			goto out;
-	}
+	vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL);
+	if (vma)
+		goto out;
 
 	/*
 	 * Determine the object being mapped and call the appropriate
@@ -1206,17 +1201,8 @@ munmap_back:
 	if (vma_wants_writenotify(vma))
 		vma->vm_page_prot = vm_get_page_prot(vm_flags & ~VM_SHARED);
 
-	if (file && vma_merge(mm, prev, addr, vma->vm_end,
-			vma->vm_flags, NULL, file, pgoff, vma_policy(vma))) {
-		mpol_put(vma_policy(vma));
-		kmem_cache_free(vm_area_cachep, vma);
-		fput(file);
-		if (vm_flags & VM_EXECUTABLE)
-			removed_exe_file_vma(mm);
-	} else {
-		vma_link(mm, vma, prev, rb_link, rb_parent);
-		file = vma->vm_file;
-	}
+	vma_link(mm, vma, prev, rb_link, rb_parent);
+	file = vma->vm_file;
 
 	/* Once vma denies write, undo our temporary denial count */
 	if (correct_wcount)

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED  file segments
  2009-01-29 22:48           ` Randy Dunlap
@ 2009-01-29 23:31             ` Maksim Yevmenkin
  0 siblings, 0 replies; 47+ messages in thread
From: Maksim Yevmenkin @ 2009-01-29 23:31 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Linus Torvalds, Lee Schermerhorn, linux-kernel, Nick Piggin,
	Andrew Morton, Greg Kroah-Hartman, will, Rik van Riel,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Thu, Jan 29, 2009 at 2:48 PM, Randy Dunlap <randy.dunlap@oracle.com> wrote:
> Maksim Yevmenkin wrote:
>> On Thu, Jan 29, 2009 at 12:48 PM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>> On Thu, 29 Jan 2009, Linus Torvalds wrote:
>>>> THIS PATCH IS TOTALLY UNTESTED!
>>> Well, it boots. FWIW. I've not really tested anything interesting with it,
>>> but any potential breakage is at least not catastrophic and immediate.
>>>
>>>> diff --git a/mm/mmap.c b/mm/mmap.c
>>>> index 8d95902..3f78ead 100644
>>>> --- a/mm/mmap.c
>>>> +++ b/mm/mmap.c
>>>> @@ -769,6 +769,10 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
>>>>       if (vm_flags & VM_SPECIAL)
>>>>               return NULL;
>>>>
>>>> +     /* Anonymous shared mappings are unsharable */
>>>> +     if ((vm_flags & VM_SHARED) && !file)
>>>> +             return NULL;
>>>> +
>>> .. and I think this part of it is actually unnecessary, because what
>>> happens is that a shared anon mapping is turned into a shmem mapping when
>>> it is inserted, and that actually ends up allocating a file for it. So the
>>> vma->vm_file for anon mappings will not match a NULL file pointer
>>> _anyway_, so there's no way it would end up merging.
>>>
>>> So my patch can be further simplified, I think, to just the following.
>>> Even more total lines removed.
>>>
>>> I still want somebody else to look at and think about it, though.
>>
>> Just to confirm. This patch also appear to fix the immediate issue for us.
>
> Is there a (small) test program available?

Yes, it was in the original (first) email. Here it is again

/*
 * Program to provoke kernel NULL pointer de-reference during
 * mmap(...MAP_LOCKED...) in Linux 2.6.28.
 *
 * 1. Create a 32KB test file in /tmp (avoids mlock limit on all recent
 *    Linuxes).
 * 2. mmap it with MAP_LOCKED from top to bottom.  (Provokes the oops,
 *    since vmas can be merged in this case.)
 * 3. Clean up.
 *
 * Compile:
 *
 *	gcc maplock-bug.c -o maplog-bug
 */

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdint.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>

#define	SIZE	(32*1024)	/* Will get rounded down to page size if nec. */

static char tmp[] = "./maplock-bug.XXXXXX";
static char junkbuf[SIZE];

int
main(void)
{
    int fd;
    int ps = getpagesize();
    size_t sz = (SIZE / ps) * ps;
    void **addrs;
    off_t off;
    int i;

    if ((addrs = malloc((sz / ps) * sizeof (*addrs))) == 0) {
	perror("malloc");
	exit(1);
    }

    if ((fd = mkstemp(tmp)) < 0) {
	perror("mkstemp");
	exit(1);
    }

    if (write(fd, junkbuf, sz) != sz) {
	perror("write");
	exit(1);
    }

    if (close(fd) < 0) {
	perror("close");
	exit(1);
    }

    if ((fd = open(tmp, O_RDONLY)) < 0) {
	perror("open");
	exit(1);
    }

    for (off = sz - ps, i = 0; off >= 0; off -= ps, i++) {
	if ((addrs[i] =
	     mmap(0, ps, PROT_READ, MAP_SHARED|MAP_LOCKED,
		  fd, off)) == MAP_FAILED) {
	    perror("mmap");
	    exit(1);
	}

	printf("Mapped offset 0x%jx at %p\n",
	       (uintmax_t)off, addrs[i]);
    }

    if (close(fd) < 0) {
	perror("close");
	exit(1);
    }

    for (i = 0; i < sz / ps; i++) {
	if (munmap(addrs[i], ps) < 0) {
	    perror("munmap");
	    exit(1);
	}
	printf("Unmapped %p\n", addrs[i]);
    }

    if (unlink(tmp) < 0) {
	perror("unlink");
	exit(1);
    }

    printf("Done\n");
}

Thanks,
max

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-29 22:47         ` Maksim Yevmenkin
  2009-01-29 22:48           ` Randy Dunlap
@ 2009-01-30  2:08           ` Linus Torvalds
  2009-01-30  5:56             ` Greg KH
  1 sibling, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2009-01-30  2:08 UTC (permalink / raw)
  To: Maksim Yevmenkin
  Cc: Lee Schermerhorn, linux-kernel, Nick Piggin, Andrew Morton,
	Greg Kroah-Hartman, will, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki



On Thu, 29 Jan 2009, Maksim Yevmenkin wrote:
> 
> Just to confirm. This patch also appear to fix the immediate issue for us.

Ok, I ended up committing it.

I didn't do the whole "Cc: stable@kernel.org" thing, because maybe the 
non-cleanup version is less controversial for stable, but I decided that 
there's no way I don't want the cleanup done in mainline kernel, and if we 
end up needing a few VM_IO's added, I think it's worth it. Even after -rc3 
(since I suspend we won't actually need them).

		Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-29 23:02           ` Linus Torvalds
@ 2009-01-30  4:43             ` Lee Schermerhorn
  2009-01-30  4:49               ` Linus Torvalds
  0 siblings, 1 reply; 47+ messages in thread
From: Lee Schermerhorn @ 2009-01-30  4:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, linux-kernel, Maksim Yevmenkin, Nick Piggin,
	Andrew Morton, Greg Kroah-Hartman, will, Rik van Riel,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki, David S. Miller

On Thu, 2009-01-29 at 15:02 -0800, Linus Torvalds wrote:
> 
> On Thu, 29 Jan 2009, Hugh Dickins wrote:
> > 
> > The problem has always been that file->f_op->mmap(file, vma) on a
> > special file can do various strange things, changing surprising
> > fields of the struct vma passed to it, changing its mergeability.
> 
> Ahh, you're right. That's somewhat bogus, but it does explain why we have 
> those two separate merge_vma() calls.
> 
> And as you also point out:
> 
> > Now it may well be that every driver which does something strange
> > there already sets one of the VM_SPECIAL flags which prevent merging,
> > or can easily be fixed to do so, or otherwise clearly cannot pose a
> > problem.
> 
> .. and I think this is the right answer. If a device driver really does 
> something as odd as play games with the offset that would make merging 
> wrong, it had better set some of the VM_SPECIAL bits (presumably VM_IO) in 
> order to never merge at all.

Just want to note that install_special_mappings()--used for, e.g., the
vdso--sets VM_DONTEXPAND [one of the VM_SPECIAL flags] which, if we want
to prevent merging, makes a lot of sense to me.  It appears that
get_user_pages() will balk at addresses in vmas with VM_IO [and
VM_PFNMAP] which might not be what one wants.  For example, you can't
pre-populate the ptes via make_pages_present() with this flag.

Lee


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-30  4:43             ` Lee Schermerhorn
@ 2009-01-30  4:49               ` Linus Torvalds
  0 siblings, 0 replies; 47+ messages in thread
From: Linus Torvalds @ 2009-01-30  4:49 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Hugh Dickins, linux-kernel, Maksim Yevmenkin, Nick Piggin,
	Andrew Morton, Greg Kroah-Hartman, will, Rik van Riel,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki, David S. Miller



On Thu, 29 Jan 2009, Lee Schermerhorn wrote:
> 
> Just want to note that install_special_mappings()--used for, e.g., the
> vdso--sets VM_DONTEXPAND [one of the VM_SPECIAL flags] which, if we want
> to prevent merging, makes a lot of sense to me.

Yes. VM_DONTEXPAND in many ways really fits the "don't merge" thing, 
because the whole mremap() VM expansion thing is really equivalent to this 
explicit merge.

> It appears that get_user_pages() will balk at addresses in vmas with 
> VM_IO [and VM_PFNMAP] which might not be what one wants.  For example, 
> you can't pre-populate the ptes via make_pages_present() with this flag.

Well, anybody who plays games with vm_pgoff really _should_ be VM_IO. I 
don't see any real reason for it except for having a very special IO 
mapping. Of course, you quite possibly _also_ want VM_DONTEXPAND.

		Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-30  2:08           ` Linus Torvalds
@ 2009-01-30  5:56             ` Greg KH
  2009-01-30 16:36               ` Linus Torvalds
  0 siblings, 1 reply; 47+ messages in thread
From: Greg KH @ 2009-01-30  5:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Maksim Yevmenkin, Lee Schermerhorn, linux-kernel, Nick Piggin,
	Andrew Morton, will, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki

On Thu, Jan 29, 2009 at 06:08:30PM -0800, Linus Torvalds wrote:
> 
> 
> On Thu, 29 Jan 2009, Maksim Yevmenkin wrote:
> > 
> > Just to confirm. This patch also appear to fix the immediate issue for us.
> 
> Ok, I ended up committing it.
> 
> I didn't do the whole "Cc: stable@kernel.org" thing, because maybe the 
> non-cleanup version is less controversial for stable, but I decided that 
> there's no way I don't want the cleanup done in mainline kernel, and if we 
> end up needing a few VM_IO's added, I think it's worth it. Even after -rc3 
> (since I suspend we won't actually need them).

Which version was the "non-cleanup" version that should be added to the
stable trees?

The commit you did make, de33c8db5910cda599899dd431cc30d7c1018cbf,
seemed pretty "tiny" to me.

confused,

greg k-h

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-29 20:48       ` Linus Torvalds
  2009-01-29 22:32         ` Hugh Dickins
  2009-01-29 22:47         ` Maksim Yevmenkin
@ 2009-01-30  8:34         ` Peter Zijlstra
  2009-01-30 16:45           ` Linus Torvalds
  2 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2009-01-30  8:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Lee Schermerhorn, linux-kernel, Maksim Yevmenkin, Nick Piggin,
	Andrew Morton, Greg Kroah-Hartman, will, Rik van Riel,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Thu, 2009-01-29 at 12:48 -0800, Linus Torvalds wrote:
> I still want somebody else to look at and think about it, though.
> 
>                 Linus
> 
> ---
>  mm/mmap.c |   26 ++++++--------------------
>  1 files changed, 6 insertions(+), 20 deletions(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 8d95902..d3fa10a 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1134,16 +1134,11 @@ munmap_back:
>         }
>  
>         /*
> -        * Can we just expand an old private anonymous mapping?
> -        * The VM_SHARED test is necessary because shmem_zero_setup
> -        * will create the file object for a shared anonymous map below.
> +        * Can we just expand an old mapping?
>          */
> -       if (!file && !(vm_flags & VM_SHARED)) {
> -               vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
> -                                       NULL, NULL, pgoff, NULL);
> -               if (vma)
> -                       goto out;
> -       }
> +       vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL);
> +       if (vma)
> +               goto out;

You've made checkpatch unhappy ;-)

So we don't bother with anonymous only, always attempt the merge.

> @@ -1206,17 +1201,8 @@ munmap_back:
>         if (vma_wants_writenotify(vma))
>                 vma->vm_page_prot = vm_get_page_prot(vm_flags & ~VM_SHARED);
>  
> -       if (file && vma_merge(mm, prev, addr, vma->vm_end,
> -                       vma->vm_flags, NULL, file, pgoff, vma_policy(vma))) {
> -               mpol_put(vma_policy(vma));
> -               kmem_cache_free(vm_area_cachep, vma);
> -               fput(file);
> -               if (vm_flags & VM_EXECUTABLE)
> -                       removed_exe_file_vma(mm);
> -       } else {
> -               vma_link(mm, vma, prev, rb_link, rb_parent);
> -               file = vma->vm_file;
> -       }
> +       vma_link(mm, vma, prev, rb_link, rb_parent);
> +       file = vma->vm_file;

And here we don't bother merging because that would have been done
before. Assuming ->mmap() doesn't go wild, in which case it ought to
have set a VM_SPECIAL bit anyway to discourage merging.

[ And even if it didn't, failing to merge shouldn't be a problem, as
minimizing the vmas is an optimization, not a strict requirement
afaik. ]

The obvious glaring difference is the vma_policy() cruft. But staring at
the code a bit I can't see how the new vma can have acquired a vm_policy
here, so it ought to not matter.

Looks ok to my eyes, so I guess:

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-30  5:56             ` Greg KH
@ 2009-01-30 16:36               ` Linus Torvalds
  2009-01-30 17:40                 ` Hugh Dickins
  0 siblings, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2009-01-30 16:36 UTC (permalink / raw)
  To: Greg KH
  Cc: Maksim Yevmenkin, Lee Schermerhorn, linux-kernel, Nick Piggin,
	Andrew Morton, will, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki



On Thu, 29 Jan 2009, Greg KH wrote:
> 
> Which version was the "non-cleanup" version that should be added to the
> stable trees?

There were two different versions:

	From: Andrew Morton <akpm@linux-foundation.org>
	Subject: Re: possible bug in mmap_region() in linux-2.6.28 kernel
	Message-Id: <20090128134350.034ac6a7.akpm@linux-foundation.org>

	From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
	Subject: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
	Message-Id: <1233259410.2315.75.camel@lts-notebook>

and I'm actually not at all sure which one should go into stable (or if we 
should just pick the same one that went into mainline). 

> The commit you did make, de33c8db5910cda599899dd431cc30d7c1018cbf,
> seemed pretty "tiny" to me.

Oh, yes, in m any ways it's smaller than the trivial patches, since it's 
really just removing code (well, it adds "file" as a parameter to the 
first vma_merge, it used to always be NULL since we only did that for the 
"!file" case).

The downside with that commit is simply that there is a (pretty damn 
small, imho) possibility that somebody will report a regression with it, 
and we'll have to add some appropriate

	vma->vm_flags |= VM_IO | VM_DONTEXPAND;

to some odd special device driver, to make sure that it cannot trigger the 
"merge early" case because it needs its ->mmap() function called, rather 
than just expand the mapping quietly.

The reason(s) I don't think it's very likely is that

 (a) hopefully drivers already explicitly mark their mappings with one of 
     the VM_SPECIAL bits already. 

     Most such drivers would likely want to use "vm_insert_pfn()" (to 
     insert random IO pages into the mapping), and in order to do that 
     they have to add the VM_PFNMAP to vm_flags - we check. That would 
     disable any merging, because that flag won't be set in the new vma 
     before calling ->mmap.

 (b) even if the drivers don't do that explicit VM_PFNMAP, a driver that 
     does special things at mmap() time - even if it just uses regular 
     pages - probably prepopulates its mmap area with vm_insert_page().

     That in turn, will implicitly set VM_INSERTPAGE, and again: the 
     merge logic that verifies that the old vma->vm_flags == vm_flags (in 
     is_mergeable_vma) would disable merging.

 (c) finally - even if a merge really would be possible (ie vm_flags 
     didn't really get changed by the special ->mmap() function, but the 
     driver still depended on ->mmap() being called for some other 
     reason), I suspect it is really really unlikely that any application 
     would actually request two magic consecutive mmap's to the same 
     special device, with the same permissions and with just the right 
     pgoff values etc. IOW, even if a merge could happen in theory with a 
     flaky driver that doesn't like it, I don't think we'd hit it in 
     practice.

So I feel I had pretty good reasons to say "f*ck it, I'm going to fix this 
properly in mainline with a much prettier patch". 

But none of the above really changes the fact that the patch I committed 
to mainline was really quite fundamentally more invasive than either of 
the "simple" patches. All three patches are small, with mine arguably the 
smallest of the lot, but mine actually changed semantics, while Andrew's 
and Lee's patch literally only fix the invalid pointer use.

I'll leave it to others to decide which one goes into -stable. I 
personally don't really think it matters. I argue above that mine is 
pretty safe and thus perfectly fine even for -stable, but reality has a 
habit of sometimes disagreeing with me. Dang.

				Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-30  8:34         ` Peter Zijlstra
@ 2009-01-30 16:45           ` Linus Torvalds
  2009-01-30 16:49             ` Randy Dunlap
  0 siblings, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2009-01-30 16:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lee Schermerhorn, linux-kernel, Maksim Yevmenkin, Nick Piggin,
	Andrew Morton, Greg Kroah-Hartman, will, Rik van Riel,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki



On Fri, 30 Jan 2009, Peter Zijlstra wrote:
>
> > +       vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL);
> > +       if (vma)
> > +               goto out;
> 
> You've made checkpatch unhappy ;-)

Yeah, we need to fix that piece of crud.

checkpatch, that is.

I think "git grep" is more important than the "ooh, I have a tiny d*ck, 
everybody else should have one too" argument against larger terminals.

		Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-30 16:45           ` Linus Torvalds
@ 2009-01-30 16:49             ` Randy Dunlap
  0 siblings, 0 replies; 47+ messages in thread
From: Randy Dunlap @ 2009-01-30 16:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Lee Schermerhorn, linux-kernel, Maksim Yevmenkin,
	Nick Piggin, Andrew Morton, Greg Kroah-Hartman, will,
	Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki

Linus Torvalds wrote:
> 
> On Fri, 30 Jan 2009, Peter Zijlstra wrote:
>>> +       vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL);
>>> +       if (vma)
>>> +               goto out;
>> You've made checkpatch unhappy ;-)
> 
> Yeah, we need to fix that piece of crud.
> 
> checkpatch, that is.

It's just advisory, fwiw.

> I think "git grep" is more important than the "ooh, I have a tiny d*ck, 
> everybody else should have one too" argument against larger terminals.

and fix the Documentation/CodingStyle recommendations?

-- 
~Randy

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-30 16:36               ` Linus Torvalds
@ 2009-01-30 17:40                 ` Hugh Dickins
  2009-01-30 18:14                   ` Linus Torvalds
  2009-01-30 23:44                   ` Greg KH
  0 siblings, 2 replies; 47+ messages in thread
From: Hugh Dickins @ 2009-01-30 17:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Greg KH, Maksim Yevmenkin, Lee Schermerhorn, linux-kernel,
	Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki

On Fri, 30 Jan 2009, Linus Torvalds wrote:
> On Thu, 29 Jan 2009, Greg KH wrote:
> > 
> > Which version was the "non-cleanup" version that should be added to the
> > stable trees?
> 
> There were two different versions:
> 
> 	From: Andrew Morton <akpm@linux-foundation.org>
> 	Subject: Re: possible bug in mmap_region() in linux-2.6.28 kernel
> 	Message-Id: <20090128134350.034ac6a7.akpm@linux-foundation.org>
> 
> 	From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> 	Subject: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
> 	Message-Id: <1233259410.2315.75.camel@lts-notebook>
> 
> and I'm actually not at all sure which one should go into stable (or if we 
> should just pick the same one that went into mainline). 
> 
...
> 
> But none of the above really changes the fact that the patch I committed 
> to mainline was really quite fundamentally more invasive than either of 
> the "simple" patches. All three patches are small, with mine arguably the 
> smallest of the lot, but mine actually changed semantics, while Andrew's 
> and Lee's patch literally only fix the invalid pointer use.
> 
> I'll leave it to others to decide which one goes into -stable. I 
> personally don't really think it matters. I argue above that mine is 
> pretty safe and thus perfectly fine even for -stable, but reality has a 
> habit of sometimes disagreeing with me. Dang.

I'd say one of the non-cleanup versions for -stable
(but I've not compared them to see which one is better).

I'm still working my way through all the ->mmap methods to check
their safety with regard to yesterday's change (there are about
ten times as many as the last time I looked).  So far there's only
one driver mmap I want to go back and recheck, the vast majority
are as good today as they were the day before, but ...

... what I think you have done is break the vma merging on
ordinary files: because of that irritating VM_CAN_NONLINEAR
flag which generic_file_mmap() and some others add in.

To break the merging won't cause anyone much trouble,
but is a slight regression we should fix.

I'd have been very upset not to find something ;)

Hugh

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-30 17:40                 ` Hugh Dickins
@ 2009-01-30 18:14                   ` Linus Torvalds
  2009-01-30 18:30                     ` Hugh Dickins
  2009-01-30 19:53                     ` Lee Schermerhorn
  2009-01-30 23:44                   ` Greg KH
  1 sibling, 2 replies; 47+ messages in thread
From: Linus Torvalds @ 2009-01-30 18:14 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Greg KH, Maksim Yevmenkin, Lee Schermerhorn, linux-kernel,
	Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki



On Fri, 30 Jan 2009, Hugh Dickins wrote:
> 
> ... what I think you have done is break the vma merging on
> ordinary files: because of that irritating VM_CAN_NONLINEAR
> flag which generic_file_mmap() and some others add in.

Ahh. Yes. VM_CAN_NONLINEAR is a "reverse flag", ie unlike the other flags 
it's to some degree about being extra _normal_, not about being odd. Most 
special flags tend to disable the VM from doing some clever thing, this 
one enables it.

> To break the merging won't cause anyone much trouble,
> but is a slight regression we should fix.

Yeah. Just masking it off when comparing is probably the simplest option. 
Make it a separate #define just for readability. There might be other 
flags like this in the future.

> I'd have been very upset not to find something ;)

Yay for you ;)

And yes, this is the kind of thing that probably does mean that we're 
better off with the no-semantic-changes patches in -stable.

		Linus

---
 mm/mmap.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index d3fa10a..c581df1 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -658,6 +658,9 @@ again:			remove_next = 1 + (end > next->vm_end);
 	validate_mm(mm);
 }
 
+/* Flags that can be inherited from an existing mapping when merging */
+#define VM_MERGEABLE_FLAGS (VM_CAN_NONLINEAR)
+
 /*
  * If the vma has a ->close operation then the driver probably needs to release
  * per-vma resources, so we don't attempt to merge those.
@@ -665,7 +668,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 static inline int is_mergeable_vma(struct vm_area_struct *vma,
 			struct file *file, unsigned long vm_flags)
 {
-	if (vma->vm_flags != vm_flags)
+	if ((vma->vm_flags ^ vm_flags) & ~VM_MERGEABLE_FLAGS)
 		return 0;
 	if (vma->vm_file != file)
 		return 0;

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-30 18:14                   ` Linus Torvalds
@ 2009-01-30 18:30                     ` Hugh Dickins
  2009-01-30 19:53                     ` Lee Schermerhorn
  1 sibling, 0 replies; 47+ messages in thread
From: Hugh Dickins @ 2009-01-30 18:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Greg KH, Maksim Yevmenkin, Lee Schermerhorn, linux-kernel,
	Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki

On Fri, 30 Jan 2009, Linus Torvalds wrote:
> On Fri, 30 Jan 2009, Hugh Dickins wrote:
> > 
> > ... what I think you have done is break the vma merging on
> > ordinary files: because of that irritating VM_CAN_NONLINEAR
> > flag which generic_file_mmap() and some others add in.
> 
> Ahh. Yes. VM_CAN_NONLINEAR is a "reverse flag", ie unlike the other flags 
> it's to some degree about being extra _normal_, not about being odd. Most 
> special flags tend to disable the VM from doing some clever thing, this 
> one enables it.
> 
> > To break the merging won't cause anyone much trouble,
> > but is a slight regression we should fix.
> 
> Yeah. Just masking it off when comparing is probably the simplest option. 
> Make it a separate #define just for readability. There might be other 
> flags like this in the future.
> 
> > I'd have been very upset not to find something ;)
> 
> Yay for you ;)
> 
> And yes, this is the kind of thing that probably does mean that we're 
> better off with the no-semantic-changes patches in -stable.
> 
> 		Linus

Yes, your patch looks just right to me - or maybe add a comment that
it's only mmap_region()'s call to vma_merge() which needs that mask?

If you hear no more from me, that'll be all I found.

Hugh

> 
> ---
>  mm/mmap.c |    5 ++++-
>  1 files changed, 4 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index d3fa10a..c581df1 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -658,6 +658,9 @@ again:			remove_next = 1 + (end > next->vm_end);
>  	validate_mm(mm);
>  }
>  
> +/* Flags that can be inherited from an existing mapping when merging */
> +#define VM_MERGEABLE_FLAGS (VM_CAN_NONLINEAR)
> +
>  /*
>   * If the vma has a ->close operation then the driver probably needs to release
>   * per-vma resources, so we don't attempt to merge those.
> @@ -665,7 +668,7 @@ again:			remove_next = 1 + (end > next->vm_end);
>  static inline int is_mergeable_vma(struct vm_area_struct *vma,
>  			struct file *file, unsigned long vm_flags)
>  {
> -	if (vma->vm_flags != vm_flags)
> +	if ((vma->vm_flags ^ vm_flags) & ~VM_MERGEABLE_FLAGS)
>  		return 0;
>  	if (vma->vm_file != file)
>  		return 0;

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-30 18:14                   ` Linus Torvalds
  2009-01-30 18:30                     ` Hugh Dickins
@ 2009-01-30 19:53                     ` Lee Schermerhorn
  2009-01-30 20:31                       ` Linus Torvalds
                                         ` (2 more replies)
  1 sibling, 3 replies; 47+ messages in thread
From: Lee Schermerhorn @ 2009-01-30 19:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Greg KH, Maksim Yevmenkin, linux-kernel,
	Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki

On Fri, 2009-01-30 at 10:14 -0800, Linus Torvalds wrote:
> 
> On Fri, 30 Jan 2009, Hugh Dickins wrote:
> > 
> > ... what I think you have done is break the vma merging on
> > ordinary files: because of that irritating VM_CAN_NONLINEAR
> > flag which generic_file_mmap() and some others add in.
> 
> Ahh. Yes. VM_CAN_NONLINEAR is a "reverse flag", ie unlike the other flags 
> it's to some degree about being extra _normal_, not about being odd. Most 
> special flags tend to disable the VM from doing some clever thing, this 
> one enables it.
> 
> > To break the merging won't cause anyone much trouble,
> > but is a slight regression we should fix.
> 
> Yeah. Just masking it off when comparing is probably the simplest option. 
> Make it a separate #define just for readability. There might be other 
> flags like this in the future.
> 
> > I'd have been very upset not to find something ;)
> 
> Yay for you ;)
> 
> And yes, this is the kind of thing that probably does mean that we're 
> better off with the no-semantic-changes patches in -stable.
> 
> 		Linus
> 
> ---
>  mm/mmap.c |    5 ++++-
>  1 files changed, 4 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index d3fa10a..c581df1 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -658,6 +658,9 @@ again:			remove_next = 1 + (end > next->vm_end);
>  	validate_mm(mm);
>  }
>  
> +/* Flags that can be inherited from an existing mapping when merging */
> +#define VM_MERGEABLE_FLAGS (VM_CAN_NONLINEAR)
> +
>  /*
>   * If the vma has a ->close operation then the driver probably needs to release
>   * per-vma resources, so we don't attempt to merge those.
> @@ -665,7 +668,7 @@ again:			remove_next = 1 + (end > next->vm_end);
>  static inline int is_mergeable_vma(struct vm_area_struct *vma,
>  			struct file *file, unsigned long vm_flags)
>  {
> -	if (vma->vm_flags != vm_flags)
> +	if ((vma->vm_flags ^ vm_flags) & ~VM_MERGEABLE_FLAGS)
>  		return 0;
>  	if (vma->vm_file != file)
>  		return 0;

I tried this patch atop 29-rc3 + your patch from yesterday with my
simple test program at http://free.linux/hp.com/~lts/Tests/mmap_lock.c. 

The test program shows the /proc/<pid>/maps before and after the mmap
and attempted merge.  It's not merging:

7fd20a668000-7fd20a66a000 rw-p 7fd20a668000 00:00 0 
7fd20a68a000-7fd20a68b000 r--s 00000000 68:23 6608852                    /tmp/tmpfVx1SFL (deleted)
7fd20a68b000-7fd20a68c000 r--s 00001000 68:23 6608852                    /tmp/tmpfVx1SFL (deleted)
7fd20a68c000-7fd20a690000 rw-p 7fd20a68c000 00:00 0 

Ad hoc instrumentation shows that it's the VM_ACCOUNT flag that is
different between the existing file segment and the one attempting the
merge:

is_mergeable_vma: !mergable: vma flags:  0x80020f9:0x1020f9
                                           |         |-VM_ACCOUNT
                                           +-----------VM_CAN_NONLINEAR

So happens, I'm mapping with MAP_SHARED, so the VM_ACCOUNT flag gets
cleared later in mmap_region().  Comments say that this is for checking
memory availability during shmem_file_setup().  Maybe we can move the
temporary setting of VM_ACCOUNT until just before the call to
shmem_zero_setup()?

Lee


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-30 19:53                     ` Lee Schermerhorn
@ 2009-01-30 20:31                       ` Linus Torvalds
  2009-01-30 21:12                         ` Hugh Dickins
  2009-01-30 20:33                       ` Hugh Dickins
  2009-01-30 20:53                       ` Randy Dunlap
  2 siblings, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2009-01-30 20:31 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Hugh Dickins, Greg KH, Maksim Yevmenkin, linux-kernel,
	Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki



On Fri, 30 Jan 2009, Lee Schermerhorn wrote:
>
> Ad hoc instrumentation shows that it's the VM_ACCOUNT flag that is
> different between the existing file segment and the one attempting the
> merge:

Gaah. I looked at VM_ACCOUNT, since it looked like a prime example of the 
same VM_CAN_NONLINEAR thing and was thinking that I should just add it to 
the "ok to merge" flags, but decided after a quick look that we do all the 
VM_ACCOUNT handling before we call vma_merge, so I just left it at that. 

But apparently we _do_ do some VM_ACCOUNT stuff afterwards.

> is_mergeable_vma: !mergable: vma flags:  0x80020f9:0x1020f9
>                                            |         |-VM_ACCOUNT
>                                            +-----------VM_CAN_NONLINEAR
> 
> So happens, I'm mapping with MAP_SHARED, so the VM_ACCOUNT flag gets
> cleared later in mmap_region().  Comments say that this is for checking
> memory availability during shmem_file_setup().  Maybe we can move the
> temporary setting of VM_ACCOUNT until just before the call to
> shmem_zero_setup()?

Yeah, that would probably fix it, and looks like the right thing to do. 

It all looks pretty confused wrong to set the whole VM_ACCOUNT flag for a 
file-backed file AT ALL in the first place, but the code knows that it 
won't matter for a shared file, and will be cleared again later.

So it plays these temporary games with vm_flags, and it didn't matter 
because of how we used to call "vma_merge()" either early only for the 
anonymous memory case (that had VM_ACCOUNT stable and didn't have that 
temporary case at all) or much later (after having undone the temporary 
flag setting) for files.

Why do we pass in that "accountable" flag, btw? It's only ever set to 0 by 
a MAP_PRIVATE mapping that hits is_file_hugepages() (see do_mmap_pgoff), 
and we could just do that decision all inside mmap_region(). So the flag 
doesn't really seem to have any real meaning, and is just passed around 
for some odd historical reason?

			Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-30 19:53                     ` Lee Schermerhorn
  2009-01-30 20:31                       ` Linus Torvalds
@ 2009-01-30 20:33                       ` Hugh Dickins
  2009-01-30 20:53                       ` Randy Dunlap
  2 siblings, 0 replies; 47+ messages in thread
From: Hugh Dickins @ 2009-01-30 20:33 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Linus Torvalds, Greg KH, Maksim Yevmenkin, linux-kernel,
	Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki

On Fri, 30 Jan 2009, Lee Schermerhorn wrote:
> 
> I tried this patch atop 29-rc3 + your patch from yesterday with my
> simple test program at http://free.linux/hp.com/~lts/Tests/mmap_lock.c. 
> 
> The test program shows the /proc/<pid>/maps before and after the mmap
> and attempted merge.  It's not merging:
> 
> 7fd20a668000-7fd20a66a000 rw-p 7fd20a668000 00:00 0 
> 7fd20a68a000-7fd20a68b000 r--s 00000000 68:23 6608852                    /tmp/tmpfVx1SFL (deleted)
> 7fd20a68b000-7fd20a68c000 r--s 00001000 68:23 6608852                    /tmp/tmpfVx1SFL (deleted)
> 7fd20a68c000-7fd20a690000 rw-p 7fd20a68c000 00:00 0 

Some testing - now that's a great idea!  Good spotting.

> 
> Ad hoc instrumentation shows that it's the VM_ACCOUNT flag that is
> different between the existing file segment and the one attempting the
> merge:
> 
> is_mergeable_vma: !mergable: vma flags:  0x80020f9:0x1020f9
>                                            |         |-VM_ACCOUNT
>                                            +-----------VM_CAN_NONLINEAR

Sorry, I should have noticed that: VM_ACCOUNT was certainly on my
mind as a flag that gives trouble when merging vmas, but I'd
misconvinced myself that it wasn't a problem in this case.

> 
> So happens, I'm mapping with MAP_SHARED, so the VM_ACCOUNT flag gets
> cleared later in mmap_region().  Comments say that this is for checking
> memory availability during shmem_file_setup().  Maybe we can move the
> temporary setting of VM_ACCOUNT until just before the call to
> shmem_zero_setup()?

Please let me think on that - we can live with the status quo for
the moment.  I need to remind myself why I used that peculiar way
of conveying info from mmap.c to shmem.c: the right answer may be
just to pass another arg to shmem_zero_setup and shmem_file_setup.

Hugh

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-30 19:53                     ` Lee Schermerhorn
  2009-01-30 20:31                       ` Linus Torvalds
  2009-01-30 20:33                       ` Hugh Dickins
@ 2009-01-30 20:53                       ` Randy Dunlap
  2009-01-30 20:59                         ` Lee Schermerhorn
  2 siblings, 1 reply; 47+ messages in thread
From: Randy Dunlap @ 2009-01-30 20:53 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Linus Torvalds, Hugh Dickins, Greg KH, Maksim Yevmenkin,
	linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki

Lee Schermerhorn wrote:

> I tried this patch atop 29-rc3 + your patch from yesterday with my
> simple test program at http://free.linux/hp.com/~lts/Tests/mmap_lock.c. 

Is that URL correct?  I can't find/access it.

> The test program shows the /proc/<pid>/maps before and after the mmap
> and attempted merge.  It's not merging:

Thanks,
-- 
~Randy

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-30 20:53                       ` Randy Dunlap
@ 2009-01-30 20:59                         ` Lee Schermerhorn
  2009-01-30 21:11                           ` Will Crowder
  0 siblings, 1 reply; 47+ messages in thread
From: Lee Schermerhorn @ 2009-01-30 20:59 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Linus Torvalds, Hugh Dickins, Greg KH, Maksim Yevmenkin,
	linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Fri, 2009-01-30 at 12:53 -0800, Randy Dunlap wrote:
> Lee Schermerhorn wrote:
> 
> > I tried this patch atop 29-rc3 + your patch from yesterday with my
> > simple test program at http://free.linux/hp.com/~lts/Tests/mmap_lock.c. 

D'oh!  that's ".hp.com"  

> Is that URL correct?  I can't find/access it.

Sorry, should have cut and pasted, like here:

	http://free.linux.hp.com/~lts/Tests/mmap_lock.c


Lee



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-30 20:59                         ` Lee Schermerhorn
@ 2009-01-30 21:11                           ` Will Crowder
  0 siblings, 0 replies; 47+ messages in thread
From: Will Crowder @ 2009-01-30 21:11 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Randy Dunlap, Linus Torvalds, Hugh Dickins, Greg KH,
	Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton,
	Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki

What, no ".linux" TLD?  :-)  Oh well...in time, time...

Will Crowder

On Fri, 2009-01-30 at 15:59 -0500, Lee Schermerhorn wrote:
> On Fri, 2009-01-30 at 12:53 -0800, Randy Dunlap wrote:
> > Lee Schermerhorn wrote:
> > 
> > > I tried this patch atop 29-rc3 + your patch from yesterday with my
> > > simple test program at http://free.linux/hp.com/~lts/Tests/mmap_lock.c. 
> 
> D'oh!  that's ".hp.com"  
> 
> > Is that URL correct?  I can't find/access it.
> 
> Sorry, should have cut and pasted, like here:
> 
> 	http://free.linux.hp.com/~lts/Tests/mmap_lock.c
> 
> 
> Lee
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-30 20:31                       ` Linus Torvalds
@ 2009-01-30 21:12                         ` Hugh Dickins
  2009-01-30 21:25                           ` Linus Torvalds
                                             ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Hugh Dickins @ 2009-01-30 21:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Lee Schermerhorn, Greg KH, Maksim Yevmenkin, linux-kernel,
	Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Mikos Szeredi

On Fri, 30 Jan 2009, Linus Torvalds wrote:
> On Fri, 30 Jan 2009, Lee Schermerhorn wrote:
> > 
> > So happens, I'm mapping with MAP_SHARED, so the VM_ACCOUNT flag gets
> > cleared later in mmap_region().  Comments say that this is for checking
> > memory availability during shmem_file_setup().  Maybe we can move the
> > temporary setting of VM_ACCOUNT until just before the call to
> > shmem_zero_setup()?
> 
> Yeah, that would probably fix it, and looks like the right thing to do. 

I do need to refresh my memory on that in a moment...

> 
> It all looks pretty confused wrong to set the whole VM_ACCOUNT flag for a 
> file-backed file AT ALL in the first place, but the code knows that it 
> won't matter for a shared file, and will be cleared again later.
> 
> So it plays these temporary games with vm_flags, and it didn't matter 
> because of how we used to call "vma_merge()" either early only for the 
> anonymous memory case (that had VM_ACCOUNT stable and didn't have that 
> temporary case at all) or much later (after having undone the temporary 
> flag setting) for files.

I'm to blame for those games, and now they've given trouble,
the right thing may be to put an end to them.

> 
> Why do we pass in that "accountable" flag, btw? It's only ever set to 0 by 
> a MAP_PRIVATE mapping that hits is_file_hugepages() (see do_mmap_pgoff), 
> and we could just do that decision all inside mmap_region(). So the flag 
> doesn't really seem to have any real meaning, and is just passed around 
> for some odd historical reason?

It looks like the "accountable" flag dates from before Miklos separated
mmap_region() out from do_mmap_pgoff(): so he just passed it on down to
mmap_region() as an additional argument, preferring to leave the more
complex MAP_PRIVATE/is_file_hugepages test behind in do_mmap_pgoff().

It seemed rather a random refactoring to me.  Looking at it again,
I wonder if we should be getting do_brk() to use mmap_region() too;
but my appetite for cleanups is low at present, bugs we have enough.

By the way, there's an argument to say that you should add
VM_MIXEDMAP to VM_CAN_NONLINEAR in VM_MERGEABLE_FLAGS: I don't
really care whether we merge the odd filemap_xip vma or not,
but it used to do so and now won't.

By the same (used to merge, now won't) argument, one could say
VM_INSERTPAGE should be there too; but whereas VM_MIXEDMAP is used
in one place only, quite a lot of drivers use vm_insert_page(), so
I feel more comfortable with the idea that it's stopping merges -
though in that case, shouldn't we add it to VM_SPECIAL?

But I'm caring more about that VM_ACCOUNT...

Hugh

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-30 21:12                         ` Hugh Dickins
@ 2009-01-30 21:25                           ` Linus Torvalds
  2009-01-30 21:36                           ` Lee Schermerhorn
  2009-01-30 21:37                           ` Linus Torvalds
  2 siblings, 0 replies; 47+ messages in thread
From: Linus Torvalds @ 2009-01-30 21:25 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Lee Schermerhorn, Greg KH, Maksim Yevmenkin, linux-kernel,
	Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Mikos Szeredi



On Fri, 30 Jan 2009, Hugh Dickins wrote:
> 
> But I'm caring more about that VM_ACCOUNT...

The bright side here is that I think the VM_ACCOUNT bit differences only 
really happen for MAP_SHARED file mappings. For MAP_SHARED anon mappings 
we already don't share (for other reasons), and for private mappings 
VM_ACCOUNT looks stable (ie it's reliably either set or not, and stays 
that way).

So the impact is reasonably low. But it would definitely be good to clean 
up that whole VM_ACCOUNT logic, and incidentally then fix that merging 
issue at the same time.

			Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-30 21:12                         ` Hugh Dickins
  2009-01-30 21:25                           ` Linus Torvalds
@ 2009-01-30 21:36                           ` Lee Schermerhorn
  2009-01-30 22:27                             ` Linus Torvalds
  2009-01-30 21:37                           ` Linus Torvalds
  2 siblings, 1 reply; 47+ messages in thread
From: Lee Schermerhorn @ 2009-01-30 21:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Greg KH, Maksim Yevmenkin, linux-kernel,
	Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Mikos Szeredi

On Fri, 2009-01-30 at 21:12 +0000, Hugh Dickins wrote:
> On Fri, 30 Jan 2009, Linus Torvalds wrote:
> > On Fri, 30 Jan 2009, Lee Schermerhorn wrote:
> > > 
> > > So happens, I'm mapping with MAP_SHARED, so the VM_ACCOUNT flag gets
> > > cleared later in mmap_region().  Comments say that this is for checking
> > > memory availability during shmem_file_setup().  Maybe we can move the
> > > temporary setting of VM_ACCOUNT until just before the call to
> > > shmem_zero_setup()?
> > 
> > Yeah, that would probably fix it, and looks like the right thing to do. 
> 
> I do need to refresh my memory on that in a moment...
> 
> > 
> > It all looks pretty confused wrong to set the whole VM_ACCOUNT flag for a 
> > file-backed file AT ALL in the first place, but the code knows that it 
> > won't matter for a shared file, and will be cleared again later.
> > 
> > So it plays these temporary games with vm_flags, and it didn't matter 
> > because of how we used to call "vma_merge()" either early only for the 
> > anonymous memory case (that had VM_ACCOUNT stable and didn't have that 
> > temporary case at all) or much later (after having undone the temporary 
> > flag setting) for files.
> 
> I'm to blame for those games, and now they've given trouble,
> the right thing may be to put an end to them.
> 
> > 
> > Why do we pass in that "accountable" flag, btw? It's only ever set to 0 by 
> > a MAP_PRIVATE mapping that hits is_file_hugepages() (see do_mmap_pgoff), 
> > and we could just do that decision all inside mmap_region(). So the flag 
> > doesn't really seem to have any real meaning, and is just passed around 
> > for some odd historical reason?
> 
> It looks like the "accountable" flag dates from before Miklos separated
> mmap_region() out from do_mmap_pgoff(): so he just passed it on down to
> mmap_region() as an additional argument, preferring to leave the more
> complex MAP_PRIVATE/is_file_hugepages test behind in do_mmap_pgoff().
> 
> It seemed rather a random refactoring to me.  Looking at it again,
> I wonder if we should be getting do_brk() to use mmap_region() too;
> but my appetite for cleanups is low at present, bugs we have enough.
> 
> By the way, there's an argument to say that you should add
> VM_MIXEDMAP to VM_CAN_NONLINEAR in VM_MERGEABLE_FLAGS: I don't
> really care whether we merge the odd filemap_xip vma or not,
> but it used to do so and now won't.
> 
> By the same (used to merge, now won't) argument, one could say
> VM_INSERTPAGE should be there too; but whereas VM_MIXEDMAP is used
> in one place only, quite a lot of drivers use vm_insert_page(), so
> I feel more comfortable with the idea that it's stopping merges -
> though in that case, shouldn't we add it to VM_SPECIAL?
> 
> But I'm caring more about that VM_ACCOUNT...

I just verified that adding VM_ACCOUNT to VM_MERGEABLE does allow the
merge to happen with the test program.  And the system didn't come
crashing down around me.  But, I wouldn't trust that simple test as the
last word.  A short run of a stress load I use held up/still running,
but I can't tell whether it's merging as expected there.   

I am running a slightly modified version of Maksim's test program under
the harness.  I modified it to mmap the entire region to reserve space,
then MAP_FIXED at each page address in the range returned by the first
mmap.  I saw that it was leaving holes between some of the pages w/o
this.  I'm going to automate the check for merging [read map and verify
a single segment at expected range] and leave that running with the
load.

Lee


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-30 21:12                         ` Hugh Dickins
  2009-01-30 21:25                           ` Linus Torvalds
  2009-01-30 21:36                           ` Lee Schermerhorn
@ 2009-01-30 21:37                           ` Linus Torvalds
  2009-01-31 12:16                             ` Hugh Dickins
  2 siblings, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2009-01-30 21:37 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Lee Schermerhorn, Greg KH, Maksim Yevmenkin, linux-kernel,
	Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Mikos Szeredi



On Fri, 30 Jan 2009, Hugh Dickins wrote:
> 
> By the way, there's an argument to say that you should add
> VM_MIXEDMAP to VM_CAN_NONLINEAR in VM_MERGEABLE_FLAGS: I don't
> really care whether we merge the odd filemap_xip vma or not,
> but it used to do so and now won't.
> 
> By the same (used to merge, now won't) argument, one could say
> VM_INSERTPAGE should be there too; but whereas VM_MIXEDMAP is used
> in one place only, quite a lot of drivers use vm_insert_page(), so
> I feel more comfortable with the idea that it's stopping merges -
> though in that case, shouldn't we add it to VM_SPECIAL?

VM_MIXEDMAP, yes. It's a special case.

But VM_INSERTPAGE - no.

Why? Because VM_INSERTPAGE implies that mmap() itself quite possibly 
actually _does_ something (it may have inserted _all_ the pages, and may 
not even have any ->fault handler at all).

So we can't just expand a current mapping and forget about it, we need to 
do the mmap(). So the "only do merges early" fundamentally cannot merge 
such a vma, even if the old code did.

Of course, we could look at whether it has a ->fault handler as an 
indication of whether it's possible to merge or not. We already do that 
for ->close, and in many ways ->fault would likely be a better indicator 
of whether something is mergeable or not.

But there's really no point. VM_INSERTPAGE implies a very special mapping: 
it's used by things like the SCSI target user space ring map, or the magic 
network packet mmap thing. If you use those things, you really don't care 
about merging adjacent VM's anyway, and at least the two I looked at 
really do the whole "pre-populate at mmap time" thing.

And at least the packet one really does have a ->close function, and lacks 
a ->fault function, so it wouldn't have merged before either (and looking 
at ->fault again seems to be as valid as looking at ->close).

			Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-30 21:36                           ` Lee Schermerhorn
@ 2009-01-30 22:27                             ` Linus Torvalds
  2009-01-31 12:35                               ` Hugh Dickins
  0 siblings, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2009-01-30 22:27 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Hugh Dickins, Greg KH, Maksim Yevmenkin, linux-kernel,
	Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Mikos Szeredi


On Fri, 30 Jan 2009, Lee Schermerhorn wrote:
> 
> I just verified that adding VM_ACCOUNT to VM_MERGEABLE does allow the
> merge to happen with the test program.  And the system didn't come
> crashing down around me.  But, I wouldn't trust that simple test as the
> last word.  A short run of a stress load I use held up/still running,
> but I can't tell whether it's merging as expected there.   

Just ignoring VM_ACCOUNT for merging is not the right thing to do. It 
probably works in practice for just about everything, but at least in 
theory it does mean that mmap() can stop honoring MAP_NORESERVE.

Admittedly the circumstances where that happens are unlikely, and nobody 
probably even really uses MAP_NORESERVE in the first place, so I doubt you 
can ever see it as a real issue, but it's still technically wrong to merge 
vma's that can differ in VM_ACCOUNT.

Now, the particular test you have, VM_ACCOUNT differs only during that 
temporary window, and the vma's really do end up with the same VM_ACCOUNT 
state in the end, so merging them is correct, but if you were to privately 
map the same file (or private anonymous map) with the same permissions 
next to each other so that they -could- merge, but use MAP_NORESERVE on 
one and not on the other, then they shouldn't merge.

So VM_ACCOUNT does matter - just barely - for merging, and we just happen 
to currently hit it too much due to a very odd internal reason.

		Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-30 17:40                 ` Hugh Dickins
  2009-01-30 18:14                   ` Linus Torvalds
@ 2009-01-30 23:44                   ` Greg KH
  1 sibling, 0 replies; 47+ messages in thread
From: Greg KH @ 2009-01-30 23:44 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Maksim Yevmenkin, Lee Schermerhorn, linux-kernel,
	Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki

On Fri, Jan 30, 2009 at 05:40:24PM +0000, Hugh Dickins wrote:
> On Fri, 30 Jan 2009, Linus Torvalds wrote:
> > On Thu, 29 Jan 2009, Greg KH wrote:
> > > 
> > > Which version was the "non-cleanup" version that should be added to the
> > > stable trees?
> > 
> > There were two different versions:
> > 
> > 	From: Andrew Morton <akpm@linux-foundation.org>
> > 	Subject: Re: possible bug in mmap_region() in linux-2.6.28 kernel
> > 	Message-Id: <20090128134350.034ac6a7.akpm@linux-foundation.org>
> > 
> > 	From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> > 	Subject: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
> > 	Message-Id: <1233259410.2315.75.camel@lts-notebook>
> > 
> > and I'm actually not at all sure which one should go into stable (or if we 
> > should just pick the same one that went into mainline). 
> > 
> ...
> > 
> > But none of the above really changes the fact that the patch I committed 
> > to mainline was really quite fundamentally more invasive than either of 
> > the "simple" patches. All three patches are small, with mine arguably the 
> > smallest of the lot, but mine actually changed semantics, while Andrew's 
> > and Lee's patch literally only fix the invalid pointer use.
> > 
> > I'll leave it to others to decide which one goes into -stable. I 
> > personally don't really think it matters. I argue above that mine is 
> > pretty safe and thus perfectly fine even for -stable, but reality has a 
> > habit of sometimes disagreeing with me. Dang.
> 
> I'd say one of the non-cleanup versions for -stable
> (but I've not compared them to see which one is better).

Ok, based on both of your comments about this, and the fact that the
in-tree one did break something, I'll go look at Andrew and Lee's
versions and pick one of them for -stable.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-30 21:37                           ` Linus Torvalds
@ 2009-01-31 12:16                             ` Hugh Dickins
  0 siblings, 0 replies; 47+ messages in thread
From: Hugh Dickins @ 2009-01-31 12:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Lee Schermerhorn, Greg KH, Maksim Yevmenkin, linux-kernel,
	Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Mikos Szeredi

On Fri, 30 Jan 2009, Linus Torvalds wrote:
> On Fri, 30 Jan 2009, Hugh Dickins wrote:
> > 
> > By the same (used to merge, now won't) argument, one could say
> > VM_INSERTPAGE should be there too; but whereas VM_MIXEDMAP is used
> > in one place only, quite a lot of drivers use vm_insert_page(), so
> > I feel more comfortable with the idea that it's stopping merges -
> > though in that case, shouldn't we add it to VM_SPECIAL?
> 
> But VM_INSERTPAGE - no.
> 
> Why? Because VM_INSERTPAGE implies that mmap() itself quite possibly 
> actually _does_ something (it may have inserted _all_ the pages, and may 
> not even have any ->fault handler at all).
> 
> So we can't just expand a current mapping and forget about it, we need to 
> do the mmap(). So the "only do merges early" fundamentally cannot merge 
> such a vma, even if the old code did.

Good point indeed.  Same as why we test for it in copy_page_range().
Should it be added to VM_SPECIAL?  Not for this particular (need to
call ->mmap) reason, but we probably ought to add it anyway.

Hugh

> 
> Of course, we could look at whether it has a ->fault handler as an 
> indication of whether it's possible to merge or not. We already do that 
> for ->close, and in many ways ->fault would likely be a better indicator 
> of whether something is mergeable or not.
> 
> But there's really no point. VM_INSERTPAGE implies a very special mapping: 
> it's used by things like the SCSI target user space ring map, or the magic 
> network packet mmap thing. If you use those things, you really don't care 
> about merging adjacent VM's anyway, and at least the two I looked at 
> really do the whole "pre-populate at mmap time" thing.
> 
> And at least the packet one really does have a ->close function, and lacks 
> a ->fault function, so it wouldn't have merged before either (and looking 
> at ->fault again seems to be as valid as looking at ->close).
> 
> 			Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-30 22:27                             ` Linus Torvalds
@ 2009-01-31 12:35                               ` Hugh Dickins
  2009-01-31 18:34                                 ` Linus Torvalds
  2009-02-03 16:13                                 ` Lee Schermerhorn
  0 siblings, 2 replies; 47+ messages in thread
From: Hugh Dickins @ 2009-01-31 12:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Lee Schermerhorn, Greg KH, Maksim Yevmenkin, linux-kernel,
	Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Mikos Szeredi

On Fri, 30 Jan 2009, Linus Torvalds wrote:
> On Fri, 30 Jan 2009, Lee Schermerhorn wrote:
> > 
> > I just verified that adding VM_ACCOUNT to VM_MERGEABLE does allow the
> > merge to happen with the test program.  And the system didn't come
> > crashing down around me.  But, I wouldn't trust that simple test as the
> > last word.  A short run of a stress load I use held up/still running,
> > but I can't tell whether it's merging as expected there.   
> 
> Just ignoring VM_ACCOUNT for merging is not the right thing to do. It 
> probably works in practice for just about everything, but at least in 
> theory it does mean that mmap() can stop honoring MAP_NORESERVE.
> 
> Admittedly the circumstances where that happens are unlikely, and nobody 
> probably even really uses MAP_NORESERVE in the first place, so I doubt you 
> can ever see it as a real issue, but it's still technically wrong to merge 
> vma's that can differ in VM_ACCOUNT.
> 
> Now, the particular test you have, VM_ACCOUNT differs only during that 
> temporary window, and the vma's really do end up with the same VM_ACCOUNT 
> state in the end, so merging them is correct, but if you were to privately 
> map the same file (or private anonymous map) with the same permissions 
> next to each other so that they -could- merge, but use MAP_NORESERVE on 
> one and not on the other, then they shouldn't merge.
> 
> So VM_ACCOUNT does matter - just barely - for merging, and we just happen 
> to currently hit it too much due to a very odd internal reason.

It matters more than just barely - if you care about non-overcommit,
or if you care about non-wrapping Committed_AS in your /proc/meminfo.

Ignoring VM_ACCOUNT when merging is very much the wrong thing to do,
because it lets an unaccounted area be treated thereafter as accounted,
or vice versa - even forgetting the MAP_NORESERVE special case.

I have by now recalled why I chose to play those VM_ACCOUNT games:
	/* We set VM_ACCOUNT in a shared mapping's vm_flags, to inform
	 * shmem_zero_setup (perhaps called through /dev/zero's ->mmap)
	 * that memory reservation must be checked; but that reservation
	 * belongs to shared memory object, not to vma: so now clear it.
We need a way to communicate not-MAP_NORESERVE to shmem.c, and we don't
just need it in the explicit shmem_zero_setup() case, we also need it
for the (probably rare nowadays) case when mmap() is working on file
/dev/zero (drivers/char/mem.c mmap_zero()), rather than using MAP_ANON.

Still haven't decided what's best to do about it (plenty of diversions):
perhaps we just say my error was to overload VM_ACCOUNT, and define a
new flag for the purpose, which can go into VM_MERGEABLE_FLAGS; but
I'd prefer a neater solution if it crosses my mind.

Hugh

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-31 12:35                               ` Hugh Dickins
@ 2009-01-31 18:34                                 ` Linus Torvalds
  2009-02-02 11:59                                   ` KOSAKI Motohiro
  2009-02-03 16:13                                 ` Lee Schermerhorn
  1 sibling, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2009-01-31 18:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Lee Schermerhorn, Greg KH, Maksim Yevmenkin, linux-kernel,
	Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Mikos Szeredi



On Sat, 31 Jan 2009, Hugh Dickins wrote:
>
> We need a way to communicate not-MAP_NORESERVE to shmem.c, and we don't
> just need it in the explicit shmem_zero_setup() case, we also need it
> for the (probably rare nowadays) case when mmap() is working on file
> /dev/zero (drivers/char/mem.c mmap_zero()), rather than using MAP_ANON.

Heh. We already have that. Maybe you didn't realize. Look at VM_NORESERVE.

So shmem.c can just look at "!(vma->vm_flags & VM_NORESERVE)" if it wants 
to.

The only piece of information you don't have is that "accountable" flag, 
but that was why I was pointing out how totally _useless_ that flag 
actually is. We shouldn't pass it along, because it's always 1, except for 
one special case that we can always calculate (private hugepages).

But in fact, even leaving that untouched, we can trivially just change 
what 'VM_NORESERVE' means: we can just make it the end result of all that 
'accountable' logic, instead of having it just mirror MAP_NORESERVE 
blindly.

> Still haven't decided what's best to do about it (plenty of diversions):
> perhaps we just say my error was to overload VM_ACCOUNT, and define a
> new flag for the purpose, which can go into VM_MERGEABLE_FLAGS; but
> I'd prefer a neater solution if it crosses my mind.

How about this pretty trivial patch?

TOTALLY UNTESTED. As usual. But the concept is pretty simple, and it 
actually removes a fair chunk of hacky code.  The only reason the diffstat 
output says that it adds more lines than it deletes is that I added more 
comments and made that helper inline function rather than make a complex 
conditional.

Whaddaya think?

		Linus

---
 mm/mmap.c  |   48 +++++++++++++++++++++++++-----------------------
 mm/shmem.c |    2 +-
 2 files changed, 26 insertions(+), 24 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index c581df1..5fcaec3 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1090,6 +1090,15 @@ int vma_wants_writenotify(struct vm_area_struct *vma)
 		mapping_cap_account_dirty(vma->vm_file->f_mapping);
 }
 
+/*
+ * We account for memory if it's a private writeable mapping,
+ * and VM_NORESERVE wasn't set.
+ */
+static inline int private_accountable_mapping(unsigned int vm_flags)
+{
+	return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
+}
+
 unsigned long mmap_region(struct file *file, unsigned long addr,
 			  unsigned long len, unsigned long flags,
 			  unsigned int vm_flags, unsigned long pgoff,
@@ -1117,23 +1126,24 @@ munmap_back:
 	if (!may_expand_vm(mm, len >> PAGE_SHIFT))
 		return -ENOMEM;
 
-	if (flags & MAP_NORESERVE)
+	/*
+	 * Set 'VM_NORESERVE' if we should not account for the
+	 * memory use of this mapping. We only honor MAP_NORESERVE
+	 * if we're allowed to overcommit memory.
+	 */
+	if ((flags & MAP_NORESERVE) && sysctl_overcommit_memory != OVERCOMMIT_NEVER)
+		vm_flags |= VM_NORESERVE;
+	if (!accountable)
 		vm_flags |= VM_NORESERVE;
 
-	if (accountable && (!(flags & MAP_NORESERVE) ||
-			    sysctl_overcommit_memory == OVERCOMMIT_NEVER)) {
-		if (vm_flags & VM_SHARED) {
-			/* Check memory availability in shmem_file_setup? */
-			vm_flags |= VM_ACCOUNT;
-		} else if (vm_flags & VM_WRITE) {
-			/*
-			 * Private writable mapping: check memory availability
-			 */
-			charged = len >> PAGE_SHIFT;
-			if (security_vm_enough_memory(charged))
-				return -ENOMEM;
-			vm_flags |= VM_ACCOUNT;
-		}
+	/*
+	 * Private writable mapping: check memory availability
+	 */
+	if (private_accountable_mapping(vm_flags)) {
+		charged = len >> PAGE_SHIFT;
+		if (security_vm_enough_memory(charged))
+			return -ENOMEM;
+		vm_flags |= VM_ACCOUNT;
 	}
 
 	/*
@@ -1184,14 +1194,6 @@ munmap_back:
 			goto free_vma;
 	}
 
-	/* We set VM_ACCOUNT in a shared mapping's vm_flags, to inform
-	 * shmem_zero_setup (perhaps called through /dev/zero's ->mmap)
-	 * that memory reservation must be checked; but that reservation
-	 * belongs to shared memory object, not to vma: so now clear it.
-	 */
-	if ((vm_flags & (VM_SHARED|VM_ACCOUNT)) == (VM_SHARED|VM_ACCOUNT))
-		vma->vm_flags &= ~VM_ACCOUNT;
-
 	/* Can addr have changed??
 	 *
 	 * Answer: Yes, several device drivers can do it in their
diff --git a/mm/shmem.c b/mm/shmem.c
index 5d0de96..19d566c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2628,7 +2628,7 @@ struct file *shmem_file_setup(char *name, loff_t size, unsigned long flags)
 		goto close_file;
 
 #ifdef CONFIG_SHMEM
-	SHMEM_I(inode)->flags = flags & VM_ACCOUNT;
+	SHMEM_I(inode)->flags = (flags & VM_NORESERVE) ? 0 : VM_ACCOUNT;
 #endif
 	d_instantiate(dentry, inode);
 	inode->i_size = size;

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-31 18:34                                 ` Linus Torvalds
@ 2009-02-02 11:59                                   ` KOSAKI Motohiro
  2009-02-02 12:54                                     ` Hugh Dickins
  0 siblings, 1 reply; 47+ messages in thread
From: KOSAKI Motohiro @ 2009-02-02 11:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kosaki.motohiro, Hugh Dickins, Lee Schermerhorn, Greg KH,
	Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will,
	Rik van Riel, KAMEZAWA Hiroyuki, Mikos Szeredi

Hi

I went to trip for last three days and I returned today.
I suprised this loooong thread now :)

I have one comment.

> TOTALLY UNTESTED. As usual. But the concept is pretty simple, and it 
> actually removes a fair chunk of hacky code.  The only reason the diffstat 
> output says that it adds more lines than it deletes is that I added more 
> comments and made that helper inline function rather than make a complex 
> conditional.
> 
> Whaddaya think?
> 
> 		Linus
> 
> ---
>  mm/mmap.c  |   48 +++++++++++++++++++++++++-----------------------
>  mm/shmem.c |    2 +-
>  2 files changed, 26 insertions(+), 24 deletions(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index c581df1..5fcaec3 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1090,6 +1090,15 @@ int vma_wants_writenotify(struct vm_area_struct *vma)
>  		mapping_cap_account_dirty(vma->vm_file->f_mapping);
>  }
>  
> +/*
> + * We account for memory if it's a private writeable mapping,
> + * and VM_NORESERVE wasn't set.
> + */
> +static inline int private_accountable_mapping(unsigned int vm_flags)
> +{
> +	return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
> +}
> +
>  unsigned long mmap_region(struct file *file, unsigned long addr,
>  			  unsigned long len, unsigned long flags,
>  			  unsigned int vm_flags, unsigned long pgoff,
> @@ -1117,23 +1126,24 @@ munmap_back:
>  	if (!may_expand_vm(mm, len >> PAGE_SHIFT))
>  		return -ENOMEM;
>  
> -	if (flags & MAP_NORESERVE)
> +	/*
> +	 * Set 'VM_NORESERVE' if we should not account for the
> +	 * memory use of this mapping. We only honor MAP_NORESERVE
> +	 * if we're allowed to overcommit memory.
> +	 */
> +	if ((flags & MAP_NORESERVE) && sysctl_overcommit_memory != OVERCOMMIT_NEVER)

I afraid this line a bit.
if following scenario happend, we can lost VM_NORESERVE?

1. admin set overcommit_memory to "never"
2. mmap
3. admin set overcommit_memory to "guess"


thanks.





^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-02-02 11:59                                   ` KOSAKI Motohiro
@ 2009-02-02 12:54                                     ` Hugh Dickins
  2009-02-02 14:10                                       ` KOSAKI Motohiro
  2009-02-02 18:33                                       ` Mel Gorman
  0 siblings, 2 replies; 47+ messages in thread
From: Hugh Dickins @ 2009-02-02 12:54 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Linus Torvalds, Lee Schermerhorn, Greg KH, Maksim Yevmenkin,
	linux-kernel, Nick Piggin, Andrew Morton, will, Rik van Riel,
	KAMEZAWA Hiroyuki, Mikos Szeredi, Andy Whitcroft

On Mon, 2 Feb 2009, KOSAKI Motohiro wrote:
> >  
> > -	if (flags & MAP_NORESERVE)
> > +	/*
> > +	 * Set 'VM_NORESERVE' if we should not account for the
> > +	 * memory use of this mapping. We only honor MAP_NORESERVE
> > +	 * if we're allowed to overcommit memory.
> > +	 */
> > +	if ((flags & MAP_NORESERVE) && sysctl_overcommit_memory != OVERCOMMIT_NEVER)
> 
> I afraid this line a bit.
> if following scenario happend, we can lost VM_NORESERVE?
> 
> 1. admin set overcommit_memory to "never"
> 2. mmap
> 3. admin set overcommit_memory to "guess"

I still haven't reviewed it fully myself (and note that what
Linus put in his tree is not identical to this posted patch),
but I do believe this is okay.

When admin changes overcommit_memory, we don't make a pass across
every vma of every mm in the system, to adjust all the accounting
of VM_NORESERVE areas; so I think it's quite reasonable to take
VM_NORESERVE as reflecting the policy in force when that vma was
created.  And nothing is displaying the VM_NORESERVE flag.

Ah, you're actually thinking of
4. mprotect
with the original flags (!VM_WRITE) such that no VM_ACCOUNT was done,
and now VM_WRITE is added and the accounting is done despite it having
been mapped MAP_NORESERVE originally.  Whereas before Linus's change,
VM_NORESERVE would have still exempted it.

Well... I don't think I care!

But I wonder what the hugetlb situation is: that
	if (!accountable)
		vm_flags |= VM_NORESERVE;
looks suspicious to me, they look as if they're exempting all
the hugetlb pages from its accounting, whereas !accountable was
only supposed to exempt them from mmap_region()'s own accounting.

Perhaps.  I'm still looking at other things,
not given this the time it needs yet.

Hugh

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-02-02 12:54                                     ` Hugh Dickins
@ 2009-02-02 14:10                                       ` KOSAKI Motohiro
  2009-02-02 18:58                                         ` Mel Gorman
  2009-02-02 18:33                                       ` Mel Gorman
  1 sibling, 1 reply; 47+ messages in thread
From: KOSAKI Motohiro @ 2009-02-02 14:10 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: kosaki.motohiro, Linus Torvalds, Lee Schermerhorn, Greg KH,
	Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will,
	Rik van Riel, KAMEZAWA Hiroyuki, Mikos Szeredi, Andy Whitcroft,
	Mel Gorman

(cc to mel)

> On Mon, 2 Feb 2009, KOSAKI Motohiro wrote:
> > >  
> > > -	if (flags & MAP_NORESERVE)
> > > +	/*
> > > +	 * Set 'VM_NORESERVE' if we should not account for the
> > > +	 * memory use of this mapping. We only honor MAP_NORESERVE
> > > +	 * if we're allowed to overcommit memory.
> > > +	 */
> > > +	if ((flags & MAP_NORESERVE) && sysctl_overcommit_memory != OVERCOMMIT_NEVER)
> > 
> > I afraid this line a bit.
> > if following scenario happend, we can lost VM_NORESERVE?
> > 
> > 1. admin set overcommit_memory to "never"
> > 2. mmap
> > 3. admin set overcommit_memory to "guess"
> 
> I still haven't reviewed it fully myself (and note that what
> Linus put in his tree is not identical to this posted patch),
> but I do believe this is okay.
> 
> When admin changes overcommit_memory, we don't make a pass across
> every vma of every mm in the system, to adjust all the accounting
> of VM_NORESERVE areas; so I think it's quite reasonable to take
> VM_NORESERVE as reflecting the policy in force when that vma was
> created.  And nothing is displaying the VM_NORESERVE flag.

hmhm, I see.


> Ah, you're actually thinking of
> 4. mprotect
> with the original flags (!VM_WRITE) such that no VM_ACCOUNT was done,
> and now VM_WRITE is added and the accounting is done despite it having
> been mapped MAP_NORESERVE originally.  Whereas before Linus's change,
> VM_NORESERVE would have still exempted it.
> 
> Well... I don't think I care!

Yeah.

FWIW, we don't need VM_NORESERVE checking now because VM_NORESERVE and VM_ACCOUNT
are exclusive condition now :)



> But I wonder what the hugetlb situation is: that
> 	if (!accountable)
> 		vm_flags |= VM_NORESERVE;
> looks suspicious to me, they look as if they're exempting all
> the hugetlb pages from its accounting, whereas !accountable was
> only supposed to exempt them from mmap_region()'s own accounting.

HAHAHA, Indeed.

when hugepage shared read-only mapping  -> hugepage shared writable maping,
following code seems to cause calling vm_enough_memory() although hugepage.

========================================================
mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
        unsigned long start, unsigned long end, unsigned long newflags)
{
        if (newflags & VM_WRITE) {
                if (!(oldflags & (VM_ACCOUNT|VM_WRITE|
                                                VM_SHARED|VM_NORESERVE))) {
                        charged = nrpages;
                        if (security_vm_enough_memory(charged))
                                return -ENOMEM;
                        newflags |= VM_ACCOUNT;
                }
        }
==========================================================

mel, what do you think this?

> 
> Perhaps.  I'm still looking at other things,
> not given this the time it needs yet.
> 
> Hugh




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-02-02 12:54                                     ` Hugh Dickins
  2009-02-02 14:10                                       ` KOSAKI Motohiro
@ 2009-02-02 18:33                                       ` Mel Gorman
  1 sibling, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2009-02-02 18:33 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: KOSAKI Motohiro, Linus Torvalds, Lee Schermerhorn, Greg KH,
	Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will,
	Rik van Riel, KAMEZAWA Hiroyuki, Mikos Szeredi, Andy Whitcroft

On Mon, Feb 02, 2009 at 12:54:15PM +0000, Hugh Dickins wrote:
> On Mon, 2 Feb 2009, KOSAKI Motohiro wrote:
> > >  
> > > -	if (flags & MAP_NORESERVE)
> > > +	/*
> > > +	 * Set 'VM_NORESERVE' if we should not account for the
> > > +	 * memory use of this mapping. We only honor MAP_NORESERVE
> > > +	 * if we're allowed to overcommit memory.
> > > +	 */
> > > +	if ((flags & MAP_NORESERVE) && sysctl_overcommit_memory != OVERCOMMIT_NEVER)
> > 
> > I afraid this line a bit.
> > if following scenario happend, we can lost VM_NORESERVE?
> > 
> > 1. admin set overcommit_memory to "never"
> > 2. mmap
> > 3. admin set overcommit_memory to "guess"
> 
> I still haven't reviewed it fully myself (and note that what
> Linus put in his tree is not identical to this posted patch),
> but I do believe this is okay.
> 
> When admin changes overcommit_memory, we don't make a pass across
> every vma of every mm in the system, to adjust all the accounting
> of VM_NORESERVE areas; so I think it's quite reasonable to take
> VM_NORESERVE as reflecting the policy in force when that vma was
> created.  And nothing is displaying the VM_NORESERVE flag.
> 
> Ah, you're actually thinking of
> 4. mprotect
> with the original flags (!VM_WRITE) such that no VM_ACCOUNT was done,
> and now VM_WRITE is added and the accounting is done despite it having
> been mapped MAP_NORESERVE originally.  Whereas before Linus's change,
> VM_NORESERVE would have still exempted it.
> 
> Well... I don't think I care!
> 
> But I wonder what the hugetlb situation is: that
> 	if (!accountable)
> 		vm_flags |= VM_NORESERVE;
> looks suspicious to me, they look as if they're exempting all
> the hugetlb pages from its accounting, whereas !accountable was
> only supposed to exempt them from mmap_region()'s own accounting.
> 

Am playing a bit of catch-up here and just reading the patch for the first
time. Whatever about the rest of it, I can comment on the hugetlb aspects
at least.  Glancing through commit fc8744adc870a8d4366908221508bb113d8b72ee
looked like it would break hugetlb accounting and sure enough, a test set
off an OOM storm killing almost everything in sight.

Any opinions on the patch below? As a bonus, it gets rid of this "accountable"
parameter altogether and figures out everything needed from the file and
flags.

================
Do not account for the address space used by hugetlbfs

hugetlb always applies strict overcommit semantics to shared and private
mappings regardless of the value in /proc/sys/vm/overcommit_memory.
mm/hugetlb.c has reservation counters that are checked and updated
during mmap() to ensure that hugepages exist in the future when faults
occurs. Otherwise, it is too easy to applications to be SIGKILLed. It
uses VM_NORESERVE to track whether MAP_NORESERVE was set at mmap() time.
If an application application fails with MAP_NORESERVE, they get to keep both
pieces. The core VM does not handle VM_ACCOUNT correctly for hugetlbfs-backed
mappings so it should not be set.

With commit fc8744adc870a8d4366908221508bb113d8b72ee, VM_NORESERVE and
VM_ACCOUNT are getting set for hugetlbfs-backed mappings resulting in a
lot of oddness and processes getting killed. Special-case hugetlbfs as
it handles its own accounting.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
--- 
 include/linux/mm.h |    3 +--
 mm/fremap.c        |    2 +-
 mm/mmap.c          |   30 ++++++++++++++++++++----------
 3 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e8ddc98..3235615 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1129,8 +1129,7 @@ extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
 	unsigned long flag, unsigned long pgoff);
 extern unsigned long mmap_region(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long flags,
-	unsigned int vm_flags, unsigned long pgoff,
-	int accountable);
+	unsigned int vm_flags, unsigned long pgoff);
 
 static inline unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot,
diff --git a/mm/fremap.c b/mm/fremap.c
index 736ba7f..b6ec85a 100644
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -198,7 +198,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
 			flags &= MAP_NONBLOCK;
 			get_file(file);
 			addr = mmap_region(file, start, size,
-					flags, vma->vm_flags, pgoff, 1);
+					flags, vma->vm_flags, pgoff);
 			fput(file);
 			if (IS_ERR_VALUE(addr)) {
 				err = addr;
diff --git a/mm/mmap.c b/mm/mmap.c
index 214b6a2..c4f3321 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -918,7 +918,6 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
 	struct inode *inode;
 	unsigned int vm_flags;
 	int error;
-	int accountable = 1;
 	unsigned long reqprot = prot;
 
 	/*
@@ -1019,8 +1018,6 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
 					return -EPERM;
 				vm_flags &= ~VM_MAYEXEC;
 			}
-			if (is_file_hugepages(file))
-				accountable = 0;
 
 			if (!file->f_op || !file->f_op->mmap)
 				return -ENODEV;
@@ -1053,8 +1050,7 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
 	if (error)
 		return error;
 
-	return mmap_region(file, addr, len, flags, vm_flags, pgoff,
-			   accountable);
+	return mmap_region(file, addr, len, flags, vm_flags, pgoff);
 }
 EXPORT_SYMBOL(do_mmap_pgoff);
 
@@ -1096,13 +1092,29 @@ int vma_wants_writenotify(struct vm_area_struct *vma)
  */
 static inline int accountable_mapping(unsigned int vm_flags)
 {
+	/* hugetlbfs does its own accounting */
+	if (vm_flags & VM_HUGETLB)
+		return 0;
+
 	return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
 }
 
+static inline int should_overcommit(struct file *file)
+{
+	/* Check if the sysctl allows overcommit */
+	if (sysctl_overcommit_memory != OVERCOMMIT_NEVER)
+		return 1;
+
+	/* hugetlbfs does its own accounting */
+	if (file && is_file_hugepages(file))
+		return 1;
+
+	return 0;
+}
+
 unsigned long mmap_region(struct file *file, unsigned long addr,
 			  unsigned long len, unsigned long flags,
-			  unsigned int vm_flags, unsigned long pgoff,
-			  int accountable)
+			  unsigned int vm_flags, unsigned long pgoff)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma, *prev;
@@ -1131,9 +1143,7 @@ munmap_back:
 	 * memory use of this mapping. We only honor MAP_NORESERVE
 	 * if we're allowed to overcommit memory.
 	 */
-	if ((flags & MAP_NORESERVE) && sysctl_overcommit_memory != OVERCOMMIT_NEVER)
-		vm_flags |= VM_NORESERVE;
-	if (!accountable)
+	if ((flags & MAP_NORESERVE) && should_overcommit(file))
 		vm_flags |= VM_NORESERVE;
 
 	/*

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-02-02 14:10                                       ` KOSAKI Motohiro
@ 2009-02-02 18:58                                         ` Mel Gorman
  2009-02-02 19:23                                           ` Linus Torvalds
  0 siblings, 1 reply; 47+ messages in thread
From: Mel Gorman @ 2009-02-02 18:58 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Hugh Dickins, Linus Torvalds, Lee Schermerhorn, Greg KH,
	Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will,
	Rik van Riel, KAMEZAWA Hiroyuki, Mikos Szeredi, Andy Whitcroft

On Mon, Feb 02, 2009 at 11:10:42PM +0900, KOSAKI Motohiro wrote:
> (cc to mel)
> 
> > On Mon, 2 Feb 2009, KOSAKI Motohiro wrote:
> > > >  
> > > > -	if (flags & MAP_NORESERVE)
> > > > +	/*
> > > > +	 * Set 'VM_NORESERVE' if we should not account for the
> > > > +	 * memory use of this mapping. We only honor MAP_NORESERVE
> > > > +	 * if we're allowed to overcommit memory.
> > > > +	 */
> > > > +	if ((flags & MAP_NORESERVE) && sysctl_overcommit_memory != OVERCOMMIT_NEVER)
> > > 
> > > I afraid this line a bit.
> > > if following scenario happend, we can lost VM_NORESERVE?
> > > 
> > > 1. admin set overcommit_memory to "never"
> > > 2. mmap
> > > 3. admin set overcommit_memory to "guess"
> > 
> > I still haven't reviewed it fully myself (and note that what
> > Linus put in his tree is not identical to this posted patch),
> > but I do believe this is okay.
> > 
> > When admin changes overcommit_memory, we don't make a pass across
> > every vma of every mm in the system, to adjust all the accounting
> > of VM_NORESERVE areas; so I think it's quite reasonable to take
> > VM_NORESERVE as reflecting the policy in force when that vma was
> > created.  And nothing is displaying the VM_NORESERVE flag.
> 
> hmhm, I see.
> 
> 
> > Ah, you're actually thinking of
> > 4. mprotect
> > with the original flags (!VM_WRITE) such that no VM_ACCOUNT was done,
> > and now VM_WRITE is added and the accounting is done despite it having
> > been mapped MAP_NORESERVE originally.  Whereas before Linus's change,
> > VM_NORESERVE would have still exempted it.
> > 
> > Well... I don't think I care!
> 
> Yeah.
> 
> FWIW, we don't need VM_NORESERVE checking now because VM_NORESERVE and VM_ACCOUNT
> are exclusive condition now :)
> 
> 
> 
> > But I wonder what the hugetlb situation is: that
> > 	if (!accountable)
> > 		vm_flags |= VM_NORESERVE;
> > looks suspicious to me, they look as if they're exempting all
> > the hugetlb pages from its accounting, whereas !accountable was
> > only supposed to exempt them from mmap_region()'s own accounting.
> 
> HAHAHA, Indeed.
> 

Candidate patch for clearing that up as been posted. Thanks for cc'ing me
on this as I would have missed it.

> when hugepage shared read-only mapping  -> hugepage shared writable maping,
> following code seems to cause calling vm_enough_memory() although hugepage.
> 
> ========================================================
> mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
>         unsigned long start, unsigned long end, unsigned long newflags)
> {
>         if (newflags & VM_WRITE) {
>                 if (!(oldflags & (VM_ACCOUNT|VM_WRITE|
>                                                 VM_SHARED|VM_NORESERVE))) {
>                         charged = nrpages;
>                         if (security_vm_enough_memory(charged))
>                                 return -ENOMEM;
>                         newflags |= VM_ACCOUNT;
>                 }
>         }
> ==========================================================
> 
> mel, what do you think this?
> 

I think there is a problem there all right. VM_ACCOUNT will not be set
with the other patch applied but VM_NORESERVE might be. If so, we
potentially set VM_ACCOUNT on a hugetlbfs mapping and probably make a
mess out of Committed_AS later. Maybe something like the following?

================
Do not account for address space usage when making hugetlbfs mappings RW

hugetlbfs accounts for its address space usage separate from the VM
core. VM_ACCOUNT should not be set for its mappings but it is possible it gets
set if a user creates a RO hugetlbfs mapping MAP_NORESERVE and then calls
mprotect(). This patch stops VM_ACCOUNT being set for hugetlbfs mappings
during mprotect().

Credit goes to Kosaki Motohiro for spotting this.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>

diff --git a/mm/mprotect.c b/mm/mprotect.c
index abe2694..31ddc6a 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -153,7 +153,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	 * but (without finer accounting) cannot reduce our commit if we
 	 * make it unwritable again.
 	 */
-	if (newflags & VM_WRITE) {
+	if (newflags & VM_WRITE && !(vma->vm_flags & VM_HUGETLB)) {
 		if (!(oldflags & (VM_ACCOUNT|VM_WRITE|
 						VM_SHARED|VM_NORESERVE))) {
 			charged = nrpages;


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-02-02 18:58                                         ` Mel Gorman
@ 2009-02-02 19:23                                           ` Linus Torvalds
  2009-02-02 21:50                                             ` Mel Gorman
  0 siblings, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2009-02-02 19:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Hugh Dickins, Lee Schermerhorn, Greg KH,
	Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will,
	Rik van Riel, KAMEZAWA Hiroyuki, Mikos Szeredi, Andy Whitcroft



On Mon, 2 Feb 2009, Mel Gorman wrote:
>
> Do not account for address space usage when making hugetlbfs mappings RW
> 
> hugetlbfs accounts for its address space usage separate from the VM
> core. VM_ACCOUNT should not be set for its mappings but it is possible it gets
> set if a user creates a RO hugetlbfs mapping MAP_NORESERVE and then calls
> mprotect(). This patch stops VM_ACCOUNT being set for hugetlbfs mappings
> during mprotect().
> 
> Credit goes to Kosaki Motohiro for spotting this.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index abe2694..31ddc6a 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -153,7 +153,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
>  	 * but (without finer accounting) cannot reduce our commit if we
>  	 * make it unwritable again.
>  	 */
> -	if (newflags & VM_WRITE) {
> +	if (newflags & VM_WRITE && !(vma->vm_flags & VM_HUGETLB)) {

Wouldn't it be _much_ nicer to just depend on that whole VM_NORESERVE 
thing?

Those hugetlb mappings _should_ have VM_NORESERVE on them, so the 
following test:

>  		if (!(oldflags & (VM_ACCOUNT|VM_WRITE|
>  						VM_SHARED|VM_NORESERVE))) {
>  			charged = nrpages;

should do it all correctly.

Why make up some ad-hoc testing, when we already have a flag for _exactly_ 
this issue. That's what VM_NORESERVE means: don't apply VM_ACCOUNT.

IOW, I don't see the point of this patch at all.

And if there is some hugetlb path that doesn't set VM_NORESERVE, then fix 
_that_ instead.

			Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-02-02 19:23                                           ` Linus Torvalds
@ 2009-02-02 21:50                                             ` Mel Gorman
  2009-02-02 22:12                                               ` Linus Torvalds
  0 siblings, 1 reply; 47+ messages in thread
From: Mel Gorman @ 2009-02-02 21:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KOSAKI Motohiro, Hugh Dickins, Lee Schermerhorn, Greg KH,
	Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will,
	Rik van Riel, KAMEZAWA Hiroyuki, Mikos Szeredi, Andy Whitcroft

On Mon, Feb 02, 2009 at 11:23:33AM -0800, Linus Torvalds wrote:
> 
> 
> On Mon, 2 Feb 2009, Mel Gorman wrote:
> >
> > Do not account for address space usage when making hugetlbfs mappings RW
> > 
> > hugetlbfs accounts for its address space usage separate from the VM
> > core. VM_ACCOUNT should not be set for its mappings but it is possible it gets
> > set if a user creates a RO hugetlbfs mapping MAP_NORESERVE and then calls
> > mprotect(). This patch stops VM_ACCOUNT being set for hugetlbfs mappings
> > during mprotect().
> > 
> > Credit goes to Kosaki Motohiro for spotting this.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > 
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index abe2694..31ddc6a 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -153,7 +153,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
> >  	 * but (without finer accounting) cannot reduce our commit if we
> >  	 * make it unwritable again.
> >  	 */
> > -	if (newflags & VM_WRITE) {
> > +	if (newflags & VM_WRITE && !(vma->vm_flags & VM_HUGETLB)) {
> 
> Wouldn't it be _much_ nicer to just depend on that whole VM_NORESERVE 
> thing?
> 

Yeah, it would but it's not a trivial change. mm/hugetlb.c depends on
VM_NORESERVE for its own accounting but also depends on VM_ACCOUNT not being
set because counters would get mucked up when the VMAs get unmapped.

The ideal answer would be to handle VM_ACCOUNT properly but it's not
clear-cut. If it's counted towards reserves, then we are double reserving -
the huge pages already allocated and base pages that will never be used. Then
again, maybe the right thing to do is update nr_accounted when VM_HUGETLB
is not set converting things like

               if (vma->vm_flags & VM_ACCOUNT)
                        *nr_accounted += (end - start) >> PAGE_SHIFT;

to

               if (vma->vm_flags & (VM_ACCOUNT | VM_HUGETLB) == VM_ACCOUNT)
                        *nr_accounted += (end - start) >> PAGE_SHIFT;

?

> Those hugetlb mappings _should_ have VM_NORESERVE on them, so the 
> following test:
> 
> >  		if (!(oldflags & (VM_ACCOUNT|VM_WRITE|
> >  						VM_SHARED|VM_NORESERVE))) {
> >  			charged = nrpages;
> 
> should do it all correctly.
> 

Lets say someone does the following

1. mmap(PROT_READ, MAP_PRIVATE) on a hugetlbfs file
	VM_ACCOUNT is not set for hugetlbfs
	VM_NORESERVE is not set because MAP_NORESERVE was not there
2. mprotect(PROT_WRITE)
	VM_ACCOUNT|VM_WRITE|VM_SHARE|VM_NORESERVE == 0
	That check is true
	newflags |= VM_ACCOUNT and hugetlbfs now has VM_ACCOUNT
3. unmap the vmas
	nr_accounted gets decremented, maybe wraps negative and
	unhappiness ensues

> Why make up some ad-hoc testing, when we already have a flag for _exactly_ 
> this issue. That's what VM_NORESERVE means: don't apply VM_ACCOUNT.
> 
> IOW, I don't see the point of this patch at all.
> 
> And if there is some hugetlb path that doesn't set VM_NORESERVE, then fix 
> _that_ instead.
> 

It gets set all right, the problem is with VM_ACCOUNT getting set.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-02-02 21:50                                             ` Mel Gorman
@ 2009-02-02 22:12                                               ` Linus Torvalds
  2009-02-02 22:35                                                 ` Mel Gorman
  0 siblings, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2009-02-02 22:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Hugh Dickins, Lee Schermerhorn, Greg KH,
	Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will,
	Rik van Riel, KAMEZAWA Hiroyuki, Mikos Szeredi, Andy Whitcroft



On Mon, 2 Feb 2009, Mel Gorman wrote:
> 
> Lets say someone does the following
> 
> 1. mmap(PROT_READ, MAP_PRIVATE) on a hugetlbfs file
> 	VM_ACCOUNT is not set for hugetlbfs
> 	VM_NORESERVE is not set because MAP_NORESERVE was not there

But isn't this exactly the thing that we have that odd "accountable" flag 
for, and we do the whole

	if (!accountable)
		vm_flags |= VM_NORESERVE;

in mmap_region() for?

So VM_NORESERVE _will_ be set.

				Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-02-02 22:12                                               ` Linus Torvalds
@ 2009-02-02 22:35                                                 ` Mel Gorman
  0 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2009-02-02 22:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KOSAKI Motohiro, Hugh Dickins, Lee Schermerhorn, Greg KH,
	Maksim Yevmenkin, linux-kernel, Nick Piggin, Andrew Morton, will,
	Rik van Riel, KAMEZAWA Hiroyuki, Mikos Szeredi, Andy Whitcroft

On Mon, Feb 02, 2009 at 02:12:30PM -0800, Linus Torvalds wrote:
> 
> 
> On Mon, 2 Feb 2009, Mel Gorman wrote:
> > 
> > Lets say someone does the following
> > 
> > 1. mmap(PROT_READ, MAP_PRIVATE) on a hugetlbfs file
> > 	VM_ACCOUNT is not set for hugetlbfs
> > 	VM_NORESERVE is not set because MAP_NORESERVE was not there
> 
> But isn't this exactly the thing that we have that odd "accountable" flag 
> for, and we do the whole
> 
> 	if (!accountable)
> 		vm_flags |= VM_NORESERVE;
> 
> in mmap_region() for?
> 
> So VM_NORESERVE _will_ be set.
> 

Then it's getting unconditionally set which breaks the hugetlb accounting
for reserving hugepages. See mm/hugetlb.c#decrement_hugepage_resv_vma() and
mm/hugetlb.c#hugetlb_reserve_pages() which depend on VM_NORESERVE being set
or not set depending on MAP_NORESERVE, not whether the core VM is accounting
it or not.

This is why I tried replacing 

 	if (!accountable)
 		vm_flags |= VM_NORESERVE;

with

        if ((flags & MAP_NORESERVE) && should_overcommit(file))
                vm_flags |= VM_NORESERVE;

and

static inline int should_overcommit(struct file *file)
{
        /* Check if the sysctl allows overcommit */
        if (sysctl_overcommit_memory != OVERCOMMIT_NEVER)
                return 1;

        /* hugetlbfs does its own accounting */
        if (file && is_file_hugepages(file))
                return 1;

        return 0;
}

in the first patch I mailed out.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-01-31 12:35                               ` Hugh Dickins
  2009-01-31 18:34                                 ` Linus Torvalds
@ 2009-02-03 16:13                                 ` Lee Schermerhorn
  2009-02-03 16:40                                   ` Linus Torvalds
  2009-02-03 17:10                                   ` Hugh Dickins
  1 sibling, 2 replies; 47+ messages in thread
From: Lee Schermerhorn @ 2009-02-03 16:13 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Greg KH, Maksim Yevmenkin, linux-kernel,
	Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Mikos Szeredi

On Sat, 2009-01-31 at 12:35 +0000, Hugh Dickins wrote:
<snip>
> I have by now recalled why I chose to play those VM_ACCOUNT games:
> 	/* We set VM_ACCOUNT in a shared mapping's vm_flags, to inform
> 	 * shmem_zero_setup (perhaps called through /dev/zero's ->mmap)
> 	 * that memory reservation must be checked; but that reservation
> 	 * belongs to shared memory object, not to vma: so now clear it.
> We need a way to communicate not-MAP_NORESERVE to shmem.c, and we don't
> just need it in the explicit shmem_zero_setup() case, we also need it
> for the (probably rare nowadays) case when mmap() is working on file
          ^^^^^^^^^^^^^^^^^^^^^^^^
> /dev/zero (drivers/char/mem.c mmap_zero()), rather than using MAP_ANON.


This reminded me of something I'd seen recently looking
at /proc/<pid>/[numa]_maps for <a large commercial database> on
Linux/x86_64: 

2adadf247000-2adadf2b2000 rwxp 2adadf247000 00:00 0
2adadf2b2000-2adadf2b3000 rwxs 00000000 68:31 55362966                   <some file != /dev/zero>
2adadf2b9000-2adadf2c0000 rwxp 00000000 00:0e 4072                       /dev/zero
2adadf2c0000-2adadf2d0000 rwxp 00000000 00:0e 4072                       /dev/zero
2adadf2d0000-2adadf2e0000 rwxp 00000000 00:0e 4072                       /dev/zero
2adadf2e0000-2adadf2f0000 rwxp 00000000 00:0e 4072                       /dev/zero
2adadf2f0000-2adadf300000 rwxp 00000000 00:0e 4072                       /dev/zero
2adadf300000-2adadf310000 rwxp 00000000 00:0e 4072                       /dev/zero
2adadf310000-2adadf320000 rwxp 00000000 00:0e 4072                       /dev/zero
2adadf320000-2adadf330000 rwxp 00000000 00:0e 4072                       /dev/zero
2adadf330000-2adadf339000 rwxp 00077000 00:0e 4072                       /dev/zero
2adadf353000-2adadf35a000 r-xp 00000000 69:02 1228822                    /lib64/libnss_compat-2.4.so
2adadf35a000-2adadf459000 ---p 00007000 69:02 1228822                    /lib64/libnss_compat-2.4.so
2adadf459000-2adadf45b000 rwxp 00006000 69:02 1228822                    /lib64/libnss_compat-2.4.so
2adadf45b000-2adadf464000 r-xp 00000000 69:02 1228830                    /lib64/libnss_nis-2.4.so
2adadf464000-2adadf564000 ---p 00009000 69:02 1228830                    /lib64/libnss_nis-2.4.so
2adadf564000-2adadf566000 rwxp 00009000 69:02 1228830                    /lib64/libnss_nis-2.4.so
2adadf566000-2adadf570000 r-xp 00000000 69:02 1228826                    /lib64/libnss_files-2.4.so
2adadf570000-2adadf66f000 ---p 0000a000 69:02 1228826                    /lib64/libnss_files-2.4.so
2adadf66f000-2adadf671000 rwxp 00009000 69:02 1228826                    /lib64/libnss_files-2.4.so
2adadf671000-2adadf681000 rwxp 00000000 00:0e 4072                       /dev/zero
2adadf681000-2adadf6a1000 rwxp 00000000 00:0e 4072                       /dev/zero
2adadf6a1000-2adadf6b1000 rwxp 00000000 00:0e 4072                       /dev/zero
2adadf6b1000-2adadf6c1000 rwxp 00000000 00:0e 4072                       /dev/zero
2adadf6c1000-2adadf6d1000 rwxp 00000000 00:0e 4072                       /dev/zero
2adadf6d1000-2adadf6e1000 rwxp 00000000 00:0e 4072                       /dev/zero
2adadf6e1000-2adadf6f1000 rwxp 00000000 00:0e 4072                       /dev/zero
2adadf6f1000-2adadf701000 rwxp 00000000 00:0e 4072                       /dev/zero
2adadf701000-2adadf711000 rwxp 00000000 00:0e 4072                       /dev/zero
2adadf711000-2adadf721000 rwxp 00000000 00:0e 4072                       /dev/zero
2adadf721000-2adadf731000 rwxp 00000000 00:0e 4072                       /dev/zero
2adadf731000-2adadf741000 rwxp 00000000 00:0e 4072                       /dev/zero

<and so on, for another 90 lines until>

7fffcdd36000-7fffcdd4e000 rwxp 7fffcdd36000 00:00 0                      [stack]
ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0                  [vdso]

For portability between Linux and various Unix-like systems that don't
support MAP_ANON*, perhaps?

Anyway, from the addresses and permissions, these all look potentially
mergeable.  The offset is preventing merging, right?  I guess that's one
of the downsides of mapping /dev/zero rather than using MAP_ANONYMOUS?

Makes one wonder whether it would be worthwhile [not to mention
possible] to rework mmap_zero() to mimic MAP_ANONYMOUS...

Lee


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-02-03 16:13                                 ` Lee Schermerhorn
@ 2009-02-03 16:40                                   ` Linus Torvalds
  2009-02-03 17:10                                   ` Hugh Dickins
  1 sibling, 0 replies; 47+ messages in thread
From: Linus Torvalds @ 2009-02-03 16:40 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Hugh Dickins, Greg KH, Maksim Yevmenkin, linux-kernel,
	Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Mikos Szeredi



On Tue, 3 Feb 2009, Lee Schermerhorn wrote:
> 
> This reminded me of something I'd seen recently looking
> at /proc/<pid>/[numa]_maps for <a large commercial database> on
> Linux/x86_64: 
> 
> 2adadf2b9000-2adadf2c0000 rwxp 00000000 00:0e 4072                       /dev/zero
> 
> For portability between Linux and various Unix-like systems that don't
> support MAP_ANON*, perhaps?

Odd. 

At first I thought that it is just that Linux will turn a MAP_SHARED | 
MAP_ANON into that /dev/zero thing, so you won't be able to tell by lookup 
at /proc/maps. So it would be very possible that the application did not 
actually open /dev/zero at all, and used MAP_ANON instead (see the whole 
shmem_zero_setup() and shmem_file_setup() thing).

But those mappings have that 'p' for private there, so it's not 
MAP_SHARED. And yes, that means that your large commercial database really 
did open /dev/zero and mapped it privately. They must be living in the 
past.

> Anyway, from the addresses and permissions, these all look potentially
> mergeable.  The offset is preventing merging, right?  I guess that's one
> of the downsides of mapping /dev/zero rather than using MAP_ANONYMOUS?

Yeah. The MAP_ANON code has a total hack:

                case MAP_PRIVATE:
                        /*
                         * Set pgoff according to addr for anon_vma.
                         */
                        pgoff = addr >> PAGE_SHIFT;
                        break;

where the whole point is to allow sharing: since pgoff doesn't matter, we 
can make it be something that will merge _if_ you don't play games (of 
course, if you then start usign mremap to move things around, that all 
breaks, and you lose the merging ;)

That said, if it's just a hundred segments, nobody really cares. It's 
going to make vma lookup fractionally slower, but not so anybody would 
likely ever notice even in benchmarks. And if it's just this one db, it's 
certainly not going to use any noticeable amount of memory either.

Merging is important, but it's important to avoid the _really_ common 
cases, and to make /proc/maps more readable etc. It's not like it matters 
for the occasional crazy setup.

But you could still try to teach the DB people to use MAP_ANON.

			Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-02-03 16:13                                 ` Lee Schermerhorn
  2009-02-03 16:40                                   ` Linus Torvalds
@ 2009-02-03 17:10                                   ` Hugh Dickins
  2009-02-03 21:50                                     ` Lee Schermerhorn
  1 sibling, 1 reply; 47+ messages in thread
From: Hugh Dickins @ 2009-02-03 17:10 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Linus Torvalds, Greg KH, Maksim Yevmenkin, linux-kernel,
	Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Mikos Szeredi

On Tue, 3 Feb 2009, Lee Schermerhorn wrote:
> On Sat, 2009-01-31 at 12:35 +0000, Hugh Dickins wrote:
> > We need a way to communicate not-MAP_NORESERVE to shmem.c, and we don't
> > just need it in the explicit shmem_zero_setup() case, we also need it
> > for the (probably rare nowadays) case when mmap() is working on file
>           ^^^^^^^^^^^^^^^^^^^^^^^^
> > /dev/zero (drivers/char/mem.c mmap_zero()), rather than using MAP_ANON.
> 
> 
> This reminded me of something I'd seen recently looking
> at /proc/<pid>/[numa]_maps for <a large commercial database> on
> Linux/x86_64: 
>...
> 2adadf711000-2adadf721000 rwxp 00000000 00:0e 4072                       /dev/zero
> 2adadf721000-2adadf731000 rwxp 00000000 00:0e 4072                       /dev/zero
> 2adadf731000-2adadf741000 rwxp 00000000 00:0e 4072                       /dev/zero
> 
> <and so on, for another 90 lines until>
> 
> 7fffcdd36000-7fffcdd4e000 rwxp 7fffcdd36000 00:00 0                      [stack]
> ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0                  [vdso]
> 
> For portability between Linux and various Unix-like systems that don't
> support MAP_ANON*, perhaps?
> 
> Anyway, from the addresses and permissions, these all look potentially
> mergeable.  The offset is preventing merging, right?  I guess that's one
> of the downsides of mapping /dev/zero rather than using MAP_ANONYMOUS?
> 
> Makes one wonder whether it would be worthwhile [not to mention
> possible] to rework mmap_zero() to mimic MAP_ANONYMOUS...

That's certainly an interesting observation, and thank you for sharing
it with us (hmm, I sound like a self-help group leader or something).

I don't really have anything to add to what Linus said (and hadn't
got around to realizing the significance of the "p" there before I
saw his reply).

Mmm, it's interesting, but I fear to add more hacks in there just
for this - I guess we could, but I'd rather not, unless it becomes
a serious issue.

Let's just tuck away the knowledge of this case for now.

Hugh

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
  2009-02-03 17:10                                   ` Hugh Dickins
@ 2009-02-03 21:50                                     ` Lee Schermerhorn
  0 siblings, 0 replies; 47+ messages in thread
From: Lee Schermerhorn @ 2009-02-03 21:50 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Greg KH, Maksim Yevmenkin, linux-kernel,
	Nick Piggin, Andrew Morton, will, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Mikos Szeredi

On Tue, 2009-02-03 at 17:10 +0000, Hugh Dickins wrote:
> On Tue, 3 Feb 2009, Lee Schermerhorn wrote:
> > On Sat, 2009-01-31 at 12:35 +0000, Hugh Dickins wrote:
> > > We need a way to communicate not-MAP_NORESERVE to shmem.c, and we don't
> > > just need it in the explicit shmem_zero_setup() case, we also need it
> > > for the (probably rare nowadays) case when mmap() is working on file
> >           ^^^^^^^^^^^^^^^^^^^^^^^^
> > > /dev/zero (drivers/char/mem.c mmap_zero()), rather than using MAP_ANON.
> > 
> > 
> > This reminded me of something I'd seen recently looking
> > at /proc/<pid>/[numa]_maps for <a large commercial database> on
> > Linux/x86_64: 
> >...
> > 2adadf711000-2adadf721000 rwxp 00000000 00:0e 4072                       /dev/zero
> > 2adadf721000-2adadf731000 rwxp 00000000 00:0e 4072                       /dev/zero
> > 2adadf731000-2adadf741000 rwxp 00000000 00:0e 4072                       /dev/zero
> > 
> > <and so on, for another 90 lines until>
> > 
> > 7fffcdd36000-7fffcdd4e000 rwxp 7fffcdd36000 00:00 0                      [stack]
> > ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0                  [vdso]
> > 
> > For portability between Linux and various Unix-like systems that don't
> > support MAP_ANON*, perhaps?
> > 
> > Anyway, from the addresses and permissions, these all look potentially
> > mergeable.  The offset is preventing merging, right?  I guess that's one
> > of the downsides of mapping /dev/zero rather than using MAP_ANONYMOUS?
> > 
> > Makes one wonder whether it would be worthwhile [not to mention
> > possible] to rework mmap_zero() to mimic MAP_ANONYMOUS...
> 
> That's certainly an interesting observation, and thank you for sharing
> it with us (hmm, I sound like a self-help group leader or something).
> 
> I don't really have anything to add to what Linus said (and hadn't
> got around to realizing the significance of the "p" there before I
> saw his reply).
> 
> Mmm, it's interesting, but I fear to add more hacks in there just
> for this - I guess we could, but I'd rather not, unless it becomes
> a serious issue.
> 
> Let's just tuck away the knowledge of this case for now.

Right.   And a bit more info to tuck away...

I routinely grab the proc maps and numa_maps from our largish servers
running various "industry standard benchmarks".  Prompted by Linus'
comment that "if it's just a hundred segments, nobody really cares", I
went back and looked a bit further at the maps for a recent run.  

Below are some segment counts for the run.  The benchmark involved 32
"instances" of the application--a technique used to reduce contention on
application internal resources as the user count increases--along with
it's data base task[s].  Each instance spawns a few processes [5-6
average, up to ~14, for this run] that share a few instance-specific
SYSV segments between them.

In each instance, one of those shmem segments exhibits a similar pattern
to the /dev/zero segments from the prior mail.  Many, altho' not all, of
the individual vmas are adjacent with the same permissions:  'r--s'.
E.g., a small snippet:

2ac0e3cf0000-2ac0e40f5000 r--s 00d26000 00:08 15695938                   /SYSV0000277a (deleted)
2ac0e40f5000-2ac0e4101000 r--s 0112b000 00:08 15695938                   /SYSV0000277a (deleted)
2ac0e4101000-2ac0e4102000 r--s 01137000 00:08 15695938                   /SYSV0000277a (deleted)
2ac0e4102000-2ac0e4113000 r--s 01138000 00:08 15695938                   /SYSV0000277a (deleted)
2ac0e4113000-2ac0e4114000 r--s 01149000 00:08 15695938                   /SYSV0000277a (deleted)
2ac0e4114000-2ac0e4115000 r--s 0114a000 00:08 15695938                   /SYSV0000277a (deleted)
2ac0e4115000-2ac0e4116000 r--s 0114b000 00:08 15695938                   /SYSV0000277a (deleted)
2ac0e4116000-2ac0e4117000 r--s 0114c000 00:08 15695938                   /SYSV0000277a (deleted)

I counted 2000-3600+ of these for a couple of tasks.  How they got like
this--one vma per page?--I'm not sure.  Perhaps a sequence of mprotect()
calls or such after attaching the segment.  [I'll try to get an strace
sometime.] Then I counted the occurrences of the patterns:
'^2.*r--s.*/SYSV' in each of the instances as, again, each instance uses
a different shmem segment among its tasks.  For good measure, I counted
the '/dev/zero' segments as well:

              SYSV shm    /dev/zero
instance 00        5771    217
instance 01        6025    183
instance 02        5738    176
instance 03        5798    177
instance 04        5709    182
instance 05        5423    915
instance 06        5513    929
instance 07        5915    180
instance 08        5802    182
instance 09        5690    177
instance 10        5643    177
instance 11        5647    180
instance 12        5656    182
instance 13        5672    181
instance 14        5522    180
instance 15        5497    180
instance 16        5594    179
instance 17        4922    906
instance 18        6956    935
instance 19        5769    181
instance 20        5771    180
instance 21        5712    180
instance 22        5711    184
instance 23        5631    179
instance 24        5586    180
instance 25        5640    180
instance 26        5614    176
instance 27        5523    176
instance 28        5600    179
instance 29        5473    177
instance 30        5581    180
instance 31        5470    180

A total of ~ 180K shmem segments, not counting the /dev/zero mappings.
Good thing we have a lot of memory :).  A couple of those segments per
instance are different shmem segments--just 2 or 3 out of 5k-6k in the
cases that I looked at.

The benchmark seems to run fairly well, so I'm not saying we have a
problem here--with the Linux kernel, anyway.  Just some raw data from a
pseudo-real-world application load.  ['pseudo' because I'm told no real
user would ever set up the app quite this way :)]

Also, this is on a vintage 2.6.16+ kernel [not my choice].  Soon I'll
have data from a much more recent release.

Lee





^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2009-02-03 21:50 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <bb4a86c70901281151w4300605r3882461cd6e9774a@mail.gmail.com>
     [not found] ` <alpine.LFD.2.00.0901281316450.3123@localhost.localdomain>
2009-01-29 20:03   ` [PATCH] Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments Lee Schermerhorn
2009-01-29 20:33     ` Linus Torvalds
2009-01-29 20:48       ` Linus Torvalds
2009-01-29 22:32         ` Hugh Dickins
2009-01-29 23:02           ` Linus Torvalds
2009-01-30  4:43             ` Lee Schermerhorn
2009-01-30  4:49               ` Linus Torvalds
2009-01-29 22:47         ` Maksim Yevmenkin
2009-01-29 22:48           ` Randy Dunlap
2009-01-29 23:31             ` Maksim Yevmenkin
2009-01-30  2:08           ` Linus Torvalds
2009-01-30  5:56             ` Greg KH
2009-01-30 16:36               ` Linus Torvalds
2009-01-30 17:40                 ` Hugh Dickins
2009-01-30 18:14                   ` Linus Torvalds
2009-01-30 18:30                     ` Hugh Dickins
2009-01-30 19:53                     ` Lee Schermerhorn
2009-01-30 20:31                       ` Linus Torvalds
2009-01-30 21:12                         ` Hugh Dickins
2009-01-30 21:25                           ` Linus Torvalds
2009-01-30 21:36                           ` Lee Schermerhorn
2009-01-30 22:27                             ` Linus Torvalds
2009-01-31 12:35                               ` Hugh Dickins
2009-01-31 18:34                                 ` Linus Torvalds
2009-02-02 11:59                                   ` KOSAKI Motohiro
2009-02-02 12:54                                     ` Hugh Dickins
2009-02-02 14:10                                       ` KOSAKI Motohiro
2009-02-02 18:58                                         ` Mel Gorman
2009-02-02 19:23                                           ` Linus Torvalds
2009-02-02 21:50                                             ` Mel Gorman
2009-02-02 22:12                                               ` Linus Torvalds
2009-02-02 22:35                                                 ` Mel Gorman
2009-02-02 18:33                                       ` Mel Gorman
2009-02-03 16:13                                 ` Lee Schermerhorn
2009-02-03 16:40                                   ` Linus Torvalds
2009-02-03 17:10                                   ` Hugh Dickins
2009-02-03 21:50                                     ` Lee Schermerhorn
2009-01-30 21:37                           ` Linus Torvalds
2009-01-31 12:16                             ` Hugh Dickins
2009-01-30 20:33                       ` Hugh Dickins
2009-01-30 20:53                       ` Randy Dunlap
2009-01-30 20:59                         ` Lee Schermerhorn
2009-01-30 21:11                           ` Will Crowder
2009-01-30 23:44                   ` Greg KH
2009-01-30  8:34         ` Peter Zijlstra
2009-01-30 16:45           ` Linus Torvalds
2009-01-30 16:49             ` Randy Dunlap

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox