linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] x86: fix PAE pmd_bad bootup warning
       [not found]     ` <alpine.LFD.1.10.0805061138580.32269@woody.linux-foundation.org>
@ 2008-05-06 19:49       ` Hugh Dickins
  2008-05-06 20:06         ` Linus Torvalds
                           ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Hugh Dickins @ 2008-05-06 19:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin,
	Gabriel C, Hans Rosenfeld, Arjan van de Ven, Nishanth Aravamudan,
	linux-kernel, linux-mm

Fix warning from pmd_bad() at bootup on a HIGHMEM64G HIGHPTE x86_32.

That came from 9fc34113f6880b215cbea4e7017fc818700384c2 x86: debug pmd_bad();
but we understand now that the typecasting was wrong for PAE in the previous
version: pagetable pages above 4GB looked bad and stopped Arjan from booting.

And revert that cded932b75ab0a5f9181ee3da34a0a488d1a14fd x86: fix pmd_bad
and pud_bad to support huge pages.  It was the wrong way round: we shouldn't
weaken every pmd_bad and pud_bad check to let huge pages slip through - in
part they check that we _don't_ have a huge page where it's not expected.

Put the x86 pmd_bad() and pud_bad() definitions back to what they have long
been: they can be improved (x86_32 should use PTE_MASK, to stop PAE thinking
junk in the upper word is good; and x86_64 should follow x86_32's stricter
comparison, to stop thinking any subset of required bits is good); but that
should be a later patch.

Fix Hans' good observation that follow_page() will never find pmd_huge()
because that would have already failed the pmd_bad test: test pmd_huge in
between the pmd_none and pmd_bad tests.  Tighten x86's pmd_huge() check?
No, once it's a hugepage entry, it can get quite far from a good pmd: for
example, PROT_NONE leaves it with only ACCESSED of the KERN_PGTABLE bits.

However... though follow_page() contains this and another test for huge
pages, so it's nice to keep it working on them, where does it actually get
called on a huge page?  get_user_pages() checks is_vm_hugetlb_page(vma) to
to call alternative hugetlb processing, as does unmap_vmas() and others.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---
So Hans' original hugepage leak remains unexplained and unfixed.
Hans, did you find that hugepage leak with a standard kernel, or were
you perhaps trying out some hugepage-using patch of your own, without
marking the vma VM_HUGETLB?  Or were you expecting the hugetlbfs file
to truncate itself once all mmappers had gone?  If the standard kernel
leaks hugepages, I'm surprised the hugetlb guys don't know about it.

 arch/x86/mm/pgtable_32.c     |    7 -------
 include/asm-x86/pgtable_32.h |    9 +--------
 include/asm-x86/pgtable_64.h |    6 ++----
 mm/memory.c                  |    5 ++++-
 4 files changed, 7 insertions(+), 20 deletions(-)

--- 2.6.26-rc1/arch/x86/mm/pgtable_32.c	2008-05-03 21:54:41.000000000 +0100
+++ linux/arch/x86/mm/pgtable_32.c	2008-05-06 14:13:24.000000000 +0100
@@ -172,10 +172,3 @@ void reserve_top_address(unsigned long r
 	__FIXADDR_TOP = -reserve - PAGE_SIZE;
 	__VMALLOC_RESERVE += reserve;
 }
-
-int pmd_bad(pmd_t pmd)
-{
-	WARN_ON_ONCE(pmd_bad_v1(pmd) != pmd_bad_v2(pmd));
-
-	return pmd_bad_v1(pmd);
-}
--- 2.6.26-rc1/include/asm-x86/pgtable_32.h	2008-05-03 21:55:10.000000000 +0100
+++ linux/include/asm-x86/pgtable_32.h	2008-05-06 14:13:24.000000000 +0100
@@ -88,14 +88,7 @@ extern unsigned long pg0[];
 /* To avoid harmful races, pmd_none(x) should check only the lower when PAE */
 #define pmd_none(x)	(!(unsigned long)pmd_val((x)))
 #define pmd_present(x)	(pmd_val((x)) & _PAGE_PRESENT)
-
-extern int pmd_bad(pmd_t pmd);
-
-#define pmd_bad_v1(x)							\
-	(_KERNPG_TABLE != (pmd_val((x)) & ~(PAGE_MASK | _PAGE_USER)))
-#define	pmd_bad_v2(x)							\
-	(_KERNPG_TABLE != (pmd_val((x)) & ~(PAGE_MASK | _PAGE_USER |	\
-					    _PAGE_PSE | _PAGE_NX)))
+#define pmd_bad(x) ((pmd_val(x) & (~PAGE_MASK & ~_PAGE_USER)) != _KERNPG_TABLE)
 
 #define pages_to_mb(x) ((x) >> (20-PAGE_SHIFT))
 
--- 2.6.26-rc1/include/asm-x86/pgtable_64.h	2008-05-03 21:55:10.000000000 +0100
+++ linux/include/asm-x86/pgtable_64.h	2008-05-06 14:13:24.000000000 +0100
@@ -158,14 +158,12 @@ static inline unsigned long pgd_bad(pgd_
 
 static inline unsigned long pud_bad(pud_t pud)
 {
-	return pud_val(pud) &
-		~(PTE_MASK | _KERNPG_TABLE | _PAGE_USER | _PAGE_PSE | _PAGE_NX);
+	return pud_val(pud) & ~(PTE_MASK | _KERNPG_TABLE | _PAGE_USER);
 }
 
 static inline unsigned long pmd_bad(pmd_t pmd)
 {
-	return pmd_val(pmd) &
-		~(PTE_MASK | _KERNPG_TABLE | _PAGE_USER | _PAGE_PSE | _PAGE_NX);
+	return pmd_val(pmd) & ~(PTE_MASK | _KERNPG_TABLE | _PAGE_USER);
 }
 
 #define pte_none(x)	(!pte_val((x)))
--- 2.6.26-rc1/mm/memory.c	2008-05-03 21:55:12.000000000 +0100
+++ linux/mm/memory.c	2008-05-06 14:13:24.000000000 +0100
@@ -969,7 +969,7 @@ struct page *follow_page(struct vm_area_
 		goto no_page_table;
 	
 	pmd = pmd_offset(pud, address);
-	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
+	if (pmd_none(*pmd))
 		goto no_page_table;
 
 	if (pmd_huge(*pmd)) {
@@ -978,6 +978,9 @@ struct page *follow_page(struct vm_area_
 		goto out;
 	}
 
+	if (unlikely(pmd_bad(*pmd)))
+		goto no_page_table;
+
 	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
 	if (!ptep)
 		goto out;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: fix PAE pmd_bad bootup warning
  2008-05-06 19:49       ` [PATCH] x86: fix PAE pmd_bad bootup warning Hugh Dickins
@ 2008-05-06 20:06         ` Linus Torvalds
  2008-05-06 20:30           ` Hugh Dickins
  2008-05-06 20:22         ` Hans Rosenfeld
  2008-05-07  4:40         ` Jeff Chua
  2 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2008-05-06 20:06 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin,
	Gabriel C, Hans Rosenfeld, Arjan van de Ven, Nishanth Aravamudan,
	linux-kernel, linux-mm


On Tue, 6 May 2008, Hugh Dickins wrote:
>
> Fix Hans' good observation that follow_page() will never find pmd_huge()
> because that would have already failed the pmd_bad test: test pmd_huge in
> between the pmd_none and pmd_bad tests.  Tighten x86's pmd_huge() check?
> No, once it's a hugepage entry, it can get quite far from a good pmd: for
> example, PROT_NONE leaves it with only ACCESSED of the KERN_PGTABLE bits.

I'd much rather have pdm_bad() etc fixed up instead, so that they do a 
more proper test (not thinking that a PSE page is bad, since it clearly 
isn't). And then, make them dependent on DEBUG_VM, because doing the 
proper test will be more expensive.

Hmm?

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: fix PAE pmd_bad bootup warning
  2008-05-06 19:49       ` [PATCH] x86: fix PAE pmd_bad bootup warning Hugh Dickins
  2008-05-06 20:06         ` Linus Torvalds
@ 2008-05-06 20:22         ` Hans Rosenfeld
  2008-05-06 20:36           ` Hugh Dickins
  2008-05-06 20:42           ` Dave Hansen
  2008-05-07  4:40         ` Jeff Chua
  2 siblings, 2 replies; 32+ messages in thread
From: Hans Rosenfeld @ 2008-05-06 20:22 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin,
	Gabriel C, Arjan van de Ven, Nishanth Aravamudan, linux-kernel,
	linux-mm

On Tue, May 06, 2008 at 08:49:23PM +0100, Hugh Dickins wrote:
> So Hans' original hugepage leak remains unexplained and unfixed.
> Hans, did you find that hugepage leak with a standard kernel, or were
> you perhaps trying out some hugepage-using patch of your own, without
> marking the vma VM_HUGETLB?  Or were you expecting the hugetlbfs file
> to truncate itself once all mmappers had gone?  If the standard kernel
> leaks hugepages, I'm surprised the hugetlb guys don't know about it.

I used a standard kernel (well, not quite, I had made some changes to
the /proc/pid/pagemap code, but nothing that would affect the hugepage
stuff) and some simple test program that would just mmap a hugepage.

I expected that any hugepage that a process had mmapped would
automatically be returned to the system when the process exits. That was
not the case, the process exited and the hugepage was lost (unless I
changed the program to explicitly munmap the hugepage before exiting).
Removing the hugetlbfs file containing the hugepage also didn't free the
page.


-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: fix PAE pmd_bad bootup warning
  2008-05-06 20:06         ` Linus Torvalds
@ 2008-05-06 20:30           ` Hugh Dickins
  2008-05-08 16:07             ` Nishanth Aravamudan
  0 siblings, 1 reply; 32+ messages in thread
From: Hugh Dickins @ 2008-05-06 20:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin,
	Gabriel C, Hans Rosenfeld, Arjan van de Ven, Nishanth Aravamudan,
	Jeremy Fitzhardinge, linux-kernel, linux-mm

On Tue, 6 May 2008, Linus Torvalds wrote:
> On Tue, 6 May 2008, Hugh Dickins wrote:
> >
> > Fix Hans' good observation that follow_page() will never find pmd_huge()
> > because that would have already failed the pmd_bad test: test pmd_huge in
> > between the pmd_none and pmd_bad tests.  Tighten x86's pmd_huge() check?
> > No, once it's a hugepage entry, it can get quite far from a good pmd: for
> > example, PROT_NONE leaves it with only ACCESSED of the KERN_PGTABLE bits.
> 
> I'd much rather have pdm_bad() etc fixed up instead, so that they do a 
> more proper test (not thinking that a PSE page is bad, since it clearly 
> isn't). And then, make them dependent on DEBUG_VM, because doing the 
> proper test will be more expensive.

But everywhere we use pmd_bad() etc (most are hidden inside
pmd_none_or_clear_bad() etc) we are expecting never to encounter
a pmd_huge, unless there's corruption.  follow_page() is the one
exception, and even in its case I can't find a current user that
actually could meet a hugepage.  I'd rather tighten up pmd_bad
(in the PAE and x86_64 cases), than weaken it so far as to let
hugepages slip through.  Not that pmd_bad often catches anything:
just coincidentally that 90909090 one today.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: fix PAE pmd_bad bootup warning
  2008-05-06 20:22         ` Hans Rosenfeld
@ 2008-05-06 20:36           ` Hugh Dickins
  2008-05-07 23:39             ` Nishanth Aravamudan
  2008-05-06 20:42           ` Dave Hansen
  1 sibling, 1 reply; 32+ messages in thread
From: Hugh Dickins @ 2008-05-06 20:36 UTC (permalink / raw)
  To: Hans Rosenfeld
  Cc: Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin,
	Gabriel C, Arjan van de Ven, Nishanth Aravamudan, linux-kernel,
	linux-mm

On Tue, 6 May 2008, Hans Rosenfeld wrote:
> On Tue, May 06, 2008 at 08:49:23PM +0100, Hugh Dickins wrote:
> > So Hans' original hugepage leak remains unexplained and unfixed.
> > Hans, did you find that hugepage leak with a standard kernel, or were
> > you perhaps trying out some hugepage-using patch of your own, without
> > marking the vma VM_HUGETLB?  Or were you expecting the hugetlbfs file
> > to truncate itself once all mmappers had gone?  If the standard kernel
> > leaks hugepages, I'm surprised the hugetlb guys don't know about it.
> 
> I used a standard kernel (well, not quite, I had made some changes to
> the /proc/pid/pagemap code, but nothing that would affect the hugepage
> stuff) and some simple test program that would just mmap a hugepage.
> 
> I expected that any hugepage that a process had mmapped would
> automatically be returned to the system when the process exits. That was
> not the case, the process exited and the hugepage was lost (unless I
> changed the program to explicitly munmap the hugepage before exiting).
> Removing the hugetlbfs file containing the hugepage also didn't free the
> page.

Hmm.  That doesn't fit with my experience: I've not found an explicit
munmap makes any difference (I wouldn't expect it to), but removing
the file once all openers gone does free everything up.  I guess I'm
overlooking something more experienced hugepagers will soon light upon.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: fix PAE pmd_bad bootup warning
  2008-05-06 20:22         ` Hans Rosenfeld
  2008-05-06 20:36           ` Hugh Dickins
@ 2008-05-06 20:42           ` Dave Hansen
  2008-05-08 14:34             ` Hans Rosenfeld
  1 sibling, 1 reply; 32+ messages in thread
From: Dave Hansen @ 2008-05-06 20:42 UTC (permalink / raw)
  To: Hans Rosenfeld
  Cc: Hugh Dickins, Ingo Molnar, Jeff Chua, Thomas Gleixner,
	H. Peter Anvin, Gabriel C, Arjan van de Ven, Nishanth Aravamudan,
	linux-kernel, linux-mm

On Tue, 2008-05-06 at 22:22 +0200, Hans Rosenfeld wrote:
> I expected that any hugepage that a process had mmapped would
> automatically be returned to the system when the process exits. That was
> not the case, the process exited and the hugepage was lost (unless I
> changed the program to explicitly munmap the hugepage before exiting).
> Removing the hugetlbfs file containing the hugepage also didn't free the
> page.

Could you post the code you're using to do this?  I have to wonder if
you're leaving a fd open somewhere.  Even if you rm the hugepage file,
it'll stay allocated if you have a fd open, or if *someone* is still
mapping it. 

Can you umount your hugetlbfs?

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: fix PAE pmd_bad bootup warning
  2008-05-06 19:49       ` [PATCH] x86: fix PAE pmd_bad bootup warning Hugh Dickins
  2008-05-06 20:06         ` Linus Torvalds
  2008-05-06 20:22         ` Hans Rosenfeld
@ 2008-05-07  4:40         ` Jeff Chua
  2008-05-07  5:30           ` Hugh Dickins
  2 siblings, 1 reply; 32+ messages in thread
From: Jeff Chua @ 2008-05-07  4:40 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Hans Rosenfeld, Ingo Molnar, Thomas Gleixner, H. Peter Anvin,
	Gabriel C, Arjan van de Ven, Nishanth Aravamudan, linux-kernel,
	linux-mm

On Wed, May 7, 2008 at 3:49 AM, Hugh Dickins <hugh@veritas.com> wrote:

>  Signed-off-by: Hugh Dickins <hugh@veritas.com>

Hugh,

Thanks for the patch. It's in Linus's git now and I just booted 2 Dell
2950 and both are no showing the errors.


Thanks,
Jeff.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: fix PAE pmd_bad bootup warning
  2008-05-07  4:40         ` Jeff Chua
@ 2008-05-07  5:30           ` Hugh Dickins
  0 siblings, 0 replies; 32+ messages in thread
From: Hugh Dickins @ 2008-05-07  5:30 UTC (permalink / raw)
  To: Jeff Chua
  Cc: Hans Rosenfeld, Ingo Molnar, Thomas Gleixner, H. Peter Anvin,
	Gabriel C, Arjan van de Ven, Nishanth Aravamudan, linux-kernel,
	linux-mm

On Wed, 7 May 2008, Jeff Chua wrote:
> 
> Thanks for the patch. It's in Linus's git now and I just booted 2 Dell

Thanks for reporting and reporting back: from your contentment ...

> 2950 and both are no showing the errors.
                      ^ 
... I'll assume the missing letter is "t" rather than "w" ;)

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: fix PAE pmd_bad bootup warning
  2008-05-06 20:36           ` Hugh Dickins
@ 2008-05-07 23:39             ` Nishanth Aravamudan
  0 siblings, 0 replies; 32+ messages in thread
From: Nishanth Aravamudan @ 2008-05-07 23:39 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Hans Rosenfeld, Ingo Molnar, Jeff Chua, Thomas Gleixner,
	H. Peter Anvin, Gabriel C, Arjan van de Ven, linux-kernel,
	linux-mm

On 06.05.2008 [21:36:59 +0100], Hugh Dickins wrote:
> On Tue, 6 May 2008, Hans Rosenfeld wrote:
> > On Tue, May 06, 2008 at 08:49:23PM +0100, Hugh Dickins wrote:
> > > So Hans' original hugepage leak remains unexplained and unfixed.
> > > Hans, did you find that hugepage leak with a standard kernel, or were
> > > you perhaps trying out some hugepage-using patch of your own, without
> > > marking the vma VM_HUGETLB?  Or were you expecting the hugetlbfs file
> > > to truncate itself once all mmappers had gone?  If the standard kernel
> > > leaks hugepages, I'm surprised the hugetlb guys don't know about it.
> > 
> > I used a standard kernel (well, not quite, I had made some changes to
> > the /proc/pid/pagemap code, but nothing that would affect the hugepage
> > stuff) and some simple test program that would just mmap a hugepage.
> > 
> > I expected that any hugepage that a process had mmapped would
> > automatically be returned to the system when the process exits. That was
> > not the case, the process exited and the hugepage was lost (unless I
> > changed the program to explicitly munmap the hugepage before exiting).
> > Removing the hugetlbfs file containing the hugepage also didn't free the
> > page.
> 
> Hmm.  That doesn't fit with my experience: I've not found an explicit
> munmap makes any difference (I wouldn't expect it to), but removing
> the file once all openers gone does free everything up.  I guess I'm
> overlooking something more experienced hugepagers will soon light
> upon.

Nothing strikes me immediately. What you described is what I expected.
As Dave pointed out separately, are you able to unmount hugetlbfs at
this point?

Hans, can you send out a sample application's source? What kernel were
you testing on?

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: fix PAE pmd_bad bootup warning
  2008-05-06 20:42           ` Dave Hansen
@ 2008-05-08 14:34             ` Hans Rosenfeld
  2008-05-08 14:39               ` Hans Rosenfeld
  2008-05-08 14:52               ` Dave Hansen
  0 siblings, 2 replies; 32+ messages in thread
From: Hans Rosenfeld @ 2008-05-08 14:34 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Hugh Dickins, Ingo Molnar, Jeff Chua, Thomas Gleixner,
	H. Peter Anvin, Gabriel C, Arjan van de Ven, Nishanth Aravamudan,
	linux-kernel, linux-mm

On Tue, May 06, 2008 at 01:42:59PM -0700, Dave Hansen wrote:
> Could you post the code you're using to do this?  I have to wonder if
> you're leaving a fd open somewhere.  Even if you rm the hugepage file,
> it'll stay allocated if you have a fd open, or if *someone* is still
> mapping it. 

A stripped-down program exposing the leak is attached. It doesn't fork
or do any other fancy stuff, so I don't see a way it could leave any fd
open.

While trying to reproduce this, I noticed that the huge page wouldn't
leak when I just mmapped it and exited without explicitly unmapping, as
I described before. The huge page is leaked only when the
/proc/self/pagemap entry for the huge page is read. The kernel I tested
with was 2.6.26-rc1-00179-g3de2403.

> Can you umount your hugetlbfs?

Yes, but the pages remain lost.

Hans


-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: fix PAE pmd_bad bootup warning
  2008-05-08 14:34             ` Hans Rosenfeld
@ 2008-05-08 14:39               ` Hans Rosenfeld
  2008-05-08 14:52               ` Dave Hansen
  1 sibling, 0 replies; 32+ messages in thread
From: Hans Rosenfeld @ 2008-05-08 14:39 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Hugh Dickins, Ingo Molnar, Jeff Chua, Thomas Gleixner,
	H. Peter Anvin, Gabriel C, Arjan van de Ven, Nishanth Aravamudan,
	linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 202 bytes --]

On Thu, May 08, 2008 at 04:34:53PM +0200, Hans Rosenfeld wrote:
> A stripped-down program exposing the leak is attached.

It is now :)

-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

[-- Attachment #2: hugetest.c --]
[-- Type: text/plain, Size: 1429 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <inttypes.h>
#include <signal.h>
#include <string.h>
#include <strings.h>
#include <sys/mman.h>
#include <sys/ioctl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

#define HUGEPAGES "/proc/sys/vm/nr_hugepages"
#define HUGE_FILE "/mnt/huge/foobar"

void *huge;

int main(int argc, char **argv)
{
	uint64_t ppte;
        int fd, maps;

        if ((fd = open(HUGEPAGES, O_WRONLY, 0)) == -1) {
                perror(HUGEPAGES);
                return 1;
        }
        write(fd, "20\n", 3);
        close(fd);

        if ((fd = open(HUGE_FILE, O_RDWR | O_CREAT, 0)) == -1) {
                perror(HUGE_FILE);
                return 1;
        }

        huge = mmap((void *) 0x1000000, 0x400000,
                    PROT_READ | PROT_WRITE,
                    MAP_PRIVATE | MAP_FIXED,
                    fd, 0);

        if (huge == MAP_FAILED) {
                perror(HUGE_FILE);
                return 1;
        }

        fprintf(stderr, "huge: 0x%0.*" PRIxPTR "\n", sizeof(huge) * 2, huge);
        memset(huge, 1, 12345);

	if ((maps = open("/proc/self/pagemap", O_RDONLY, 0)) < 0) {
		perror("/proc/self/pagemap");
		return 1;
	}

	if (pread(maps, &ppte, sizeof(ppte), ((uintptr_t) huge) >> 9) < 0) {
		perror("pread");
		return 1;
	}

	fprintf(stderr, "ppte: 0x%0.16" PRIx64 "\n", ppte);

        return 0;
}

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: fix PAE pmd_bad bootup warning
  2008-05-08 14:34             ` Hans Rosenfeld
  2008-05-08 14:39               ` Hans Rosenfeld
@ 2008-05-08 14:52               ` Dave Hansen
  2008-05-08 15:11                 ` Hans Rosenfeld
  2008-05-08 15:44                 ` Nishanth Aravamudan
  1 sibling, 2 replies; 32+ messages in thread
From: Dave Hansen @ 2008-05-08 14:52 UTC (permalink / raw)
  To: Hans Rosenfeld
  Cc: Hugh Dickins, Ingo Molnar, Jeff Chua, Thomas Gleixner,
	H. Peter Anvin, Gabriel C, Arjan van de Ven, Nishanth Aravamudan,
	linux-kernel, linux-mm

On Thu, 2008-05-08 at 16:34 +0200, Hans Rosenfeld wrote:
> While trying to reproduce this, I noticed that the huge page wouldn't
> leak when I just mmapped it and exited without explicitly unmapping, as
> I described before. The huge page is leaked only when the
> /proc/self/pagemap entry for the huge page is read.

Well, that's an interesting data point! :)

Are you running any of your /proc/<pid>/pagemap patches?

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: fix PAE pmd_bad bootup warning
  2008-05-08 14:52               ` Dave Hansen
@ 2008-05-08 15:11                 ` Hans Rosenfeld
  2008-05-08 15:51                   ` Dave Hansen
  2008-05-08 15:44                 ` Nishanth Aravamudan
  1 sibling, 1 reply; 32+ messages in thread
From: Hans Rosenfeld @ 2008-05-08 15:11 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Hugh Dickins, Ingo Molnar, Jeff Chua, Thomas Gleixner,
	H. Peter Anvin, Gabriel C, Arjan van de Ven, Nishanth Aravamudan,
	linux-kernel, linux-mm

On Thu, May 08, 2008 at 07:52:30AM -0700, Dave Hansen wrote:
> On Thu, 2008-05-08 at 16:34 +0200, Hans Rosenfeld wrote:
> > The huge page is leaked only when the
> > /proc/self/pagemap entry for the huge page is read.
> 
> Well, that's an interesting data point! :)
> 
> Are you running any of your /proc/<pid>/pagemap patches?

No additional patches. The problem already existed before we agreed on
the change to the pagemap code to just include the page size in the
values returned, and not doing any special huge page handling. I suspect
the page walking code used by /proc/pid/pagemap is doing something nasty
when it sees a huge page as it doesn't know how to handle it.


-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: fix PAE pmd_bad bootup warning
  2008-05-08 14:52               ` Dave Hansen
  2008-05-08 15:11                 ` Hans Rosenfeld
@ 2008-05-08 15:44                 ` Nishanth Aravamudan
  1 sibling, 0 replies; 32+ messages in thread
From: Nishanth Aravamudan @ 2008-05-08 15:44 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Hans Rosenfeld, Hugh Dickins, Ingo Molnar, Jeff Chua,
	Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven,
	linux-kernel, linux-mm

On 08.05.2008 [07:52:30 -0700], Dave Hansen wrote:
> On Thu, 2008-05-08 at 16:34 +0200, Hans Rosenfeld wrote:
> > While trying to reproduce this, I noticed that the huge page wouldn't
> > leak when I just mmapped it and exited without explicitly unmapping, as
> > I described before. The huge page is leaked only when the
> > /proc/self/pagemap entry for the huge page is read.
> 
> Well, that's an interesting data point! :)

Indeed it is! Would explain why we've not seen any leaks until now. I
will try to investigate, as well.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: fix PAE pmd_bad bootup warning
  2008-05-08 15:11                 ` Hans Rosenfeld
@ 2008-05-08 15:51                   ` Dave Hansen
  2008-05-08 16:19                     ` Hans Rosenfeld
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Hansen @ 2008-05-08 15:51 UTC (permalink / raw)
  To: Hans Rosenfeld
  Cc: Hugh Dickins, Ingo Molnar, Jeff Chua, Thomas Gleixner,
	H. Peter Anvin, Gabriel C, Arjan van de Ven, Nishanth Aravamudan,
	linux-kernel, linux-mm

On Thu, 2008-05-08 at 17:11 +0200, Hans Rosenfeld wrote:
> 
> On Thu, May 08, 2008 at 07:52:30AM -0700, Dave Hansen wrote:
> > On Thu, 2008-05-08 at 16:34 +0200, Hans Rosenfeld wrote:
> > > The huge page is leaked only when the
> > > /proc/self/pagemap entry for the huge page is read.
> > 
> > Well, that's an interesting data point! :)
> > 
> > Are you running any of your /proc/<pid>/pagemap patches?
> 
> No additional patches. The problem already existed before we agreed on
> the change to the pagemap code to just include the page size in the
> values returned, and not doing any special huge page handling. I suspect
> the page walking code used by /proc/pid/pagemap is doing something nasty
> when it sees a huge page as it doesn't know how to handle it.

Is there anything in your dmesg?

static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
                          const struct mm_walk *walk, void *private)
{
        pmd_t *pmd;
        unsigned long next;
        int err = 0;

        pmd = pmd_offset(pud, addr);
        do {
                next = pmd_addr_end(addr, end);
                if (pmd_none_or_clear_bad(pmd)) {
                        if (walk->pte_hole)
                                err = walk->pte_hole(addr, next, private);
                        if (err)
                                break;
                        continue;


There was a discussion on LKML in the last couple of days about
pmd_bad() triggering on huge pages.  Perhaps we're clearing the mapping
with the pmd_none_or_clear_bad(), and *THAT* is leaking the page.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: fix PAE pmd_bad bootup warning
  2008-05-06 20:30           ` Hugh Dickins
@ 2008-05-08 16:07             ` Nishanth Aravamudan
  0 siblings, 0 replies; 32+ messages in thread
From: Nishanth Aravamudan @ 2008-05-08 16:07 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Ingo Molnar, Jeff Chua, Thomas Gleixner,
	H. Peter Anvin, Gabriel C, Hans Rosenfeld, Arjan van de Ven,
	Jeremy Fitzhardinge, linux-kernel, linux-mm

On 06.05.2008 [21:30:44 +0100], Hugh Dickins wrote:
> On Tue, 6 May 2008, Linus Torvalds wrote:
> > On Tue, 6 May 2008, Hugh Dickins wrote:
> > >
> > > Fix Hans' good observation that follow_page() will never find
> > > pmd_huge() because that would have already failed the pmd_bad
> > > test: test pmd_huge in between the pmd_none and pmd_bad tests.
> > > Tighten x86's pmd_huge() check?  No, once it's a hugepage entry,
> > > it can get quite far from a good pmd: for example, PROT_NONE
> > > leaves it with only ACCESSED of the KERN_PGTABLE bits.
> > 
> > I'd much rather have pdm_bad() etc fixed up instead, so that they do
> > a more proper test (not thinking that a PSE page is bad, since it
> > clearly isn't). And then, make them dependent on DEBUG_VM, because
> > doing the proper test will be more expensive.
> 
> But everywhere we use pmd_bad() etc (most are hidden inside
> pmd_none_or_clear_bad() etc) we are expecting never to encounter
> a pmd_huge, unless there's corruption.  follow_page() is the one
> exception, and even in its case I can't find a current user that
> actually could meet a hugepage.  I'd rather tighten up pmd_bad
> (in the PAE and x86_64 cases), than weaken it so far as to let
> hugepages slip through.  Not that pmd_bad often catches anything:
> just coincidentally that 90909090 one today.

There is one case that seems to the source of Hans' problem, as Dave has
figured out: /proc/pid/pagemap, where we fairly straight-forwardly walk
the pagetables.

In there, we unconditionally call pmd_none_or_clear_bad(pmd). And any
userspace process that maps hugepages and then reads in
/proc/pid/pagemap will invoke that path, I think (at least with 2M
pages).

So I agree, you're fixing a potential issue in follow_page() [might
deserve a comment, so someone doesn't go and combine them back later?],
but Hans' issue is most likely related to the pagemap code?

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: fix PAE pmd_bad bootup warning
  2008-05-08 15:51                   ` Dave Hansen
@ 2008-05-08 16:19                     ` Hans Rosenfeld
  2008-05-08 16:33                       ` Nishanth Aravamudan
  2008-05-08 16:42                       ` Dave Hansen
  0 siblings, 2 replies; 32+ messages in thread
From: Hans Rosenfeld @ 2008-05-08 16:19 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Hugh Dickins, Ingo Molnar, Jeff Chua, Thomas Gleixner,
	H. Peter Anvin, Gabriel C, Arjan van de Ven, Nishanth Aravamudan,
	linux-kernel, linux-mm

On Thu, May 08, 2008 at 08:51:22AM -0700, Dave Hansen wrote:
> Is there anything in your dmesg?

mm/memory.c:127: bad pmd ffff810076801040(80000000720000e7).

> There was a discussion on LKML in the last couple of days about
> pmd_bad() triggering on huge pages.  Perhaps we're clearing the mapping
> with the pmd_none_or_clear_bad(), and *THAT* is leaking the page.

That makes sense. I remember that explicitly munmapping the huge page
would still work, but it doesn't. I don't quite remember what I did back
then to test this, but I probably made some mistake there that led me to
some false conclusions.


-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: fix PAE pmd_bad bootup warning
  2008-05-08 16:19                     ` Hans Rosenfeld
@ 2008-05-08 16:33                       ` Nishanth Aravamudan
  2008-05-08 16:51                         ` Hans Rosenfeld
  2008-05-08 16:42                       ` Dave Hansen
  1 sibling, 1 reply; 32+ messages in thread
From: Nishanth Aravamudan @ 2008-05-08 16:33 UTC (permalink / raw)
  To: Hans Rosenfeld
  Cc: Dave Hansen, Hugh Dickins, Ingo Molnar, Jeff Chua,
	Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven,
	linux-kernel, linux-mm

On 08.05.2008 [18:19:25 +0200], Hans Rosenfeld wrote:
> On Thu, May 08, 2008 at 08:51:22AM -0700, Dave Hansen wrote:
> > Is there anything in your dmesg?
> 
> mm/memory.c:127: bad pmd ffff810076801040(80000000720000e7).
> 
> > There was a discussion on LKML in the last couple of days about
> > pmd_bad() triggering on huge pages.  Perhaps we're clearing the mapping
> > with the pmd_none_or_clear_bad(), and *THAT* is leaking the page.
> 
> That makes sense. I remember that explicitly munmapping the huge page
> would still work, but it doesn't. I don't quite remember what I did back
> then to test this, but I probably made some mistake there that led me to
> some false conclusions.

So this seems to lend credence to Dave's hypothesis. Without, as you
were trying before, teaching pagemap all about hugepages, what are our
options?

Can we just skip over the current iteration of the PMD loop (would we
need something similar for the PTE loop for power?) if pmd_huge(pmd)?

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: fix PAE pmd_bad bootup warning
  2008-05-08 16:19                     ` Hans Rosenfeld
  2008-05-08 16:33                       ` Nishanth Aravamudan
@ 2008-05-08 16:42                       ` Dave Hansen
  1 sibling, 0 replies; 32+ messages in thread
From: Dave Hansen @ 2008-05-08 16:42 UTC (permalink / raw)
  To: Hans Rosenfeld
  Cc: Hugh Dickins, Ingo Molnar, Jeff Chua, Thomas Gleixner,
	H. Peter Anvin, Gabriel C, Arjan van de Ven, Nishanth Aravamudan,
	linux-kernel, linux-mm

On Thu, 2008-05-08 at 18:19 +0200, Hans Rosenfeld wrote:
> On Thu, May 08, 2008 at 08:51:22AM -0700, Dave Hansen wrote:
> > Is there anything in your dmesg?
> 
> mm/memory.c:127: bad pmd ffff810076801040(80000000720000e7).
> 
> > There was a discussion on LKML in the last couple of days about
> > pmd_bad() triggering on huge pages.  Perhaps we're clearing the mapping
> > with the pmd_none_or_clear_bad(), and *THAT* is leaking the page.
> 
> That makes sense. I remember that explicitly munmapping the huge page
> would still work, but it doesn't. I don't quite remember what I did back
> then to test this, but I probably made some mistake there that led me to
> some false conclusions.

I can't see how it would possibly work with the code that we have today,
so I guess it was just a false assumption.

static inline int pmd_none_or_clear_bad(pmd_t *pmd)
{
        if (pmd_none(*pmd))
                return 1;
        if (unlikely(pmd_bad(*pmd))) {
                pmd_clear_bad(pmd);
                return 1;
        }
        return 0;
}

void pmd_clear_bad(pmd_t *pmd)
{
        pmd_ERROR(*pmd);
        pmd_clear(pmd);
}

That pmd_clear() will simply zero out the pmd and leak the page.

Sounds like Linus had the right idea:

> I'd much rather have pdm_bad() etc fixed up instead, so that they do a 
> more proper test (not thinking that a PSE page is bad, since it clearly 
> isn't). And then, make them dependent on DEBUG_VM, because doing the 
> proper test will be more expensive.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: fix PAE pmd_bad bootup warning
  2008-05-08 16:33                       ` Nishanth Aravamudan
@ 2008-05-08 16:51                         ` Hans Rosenfeld
  2008-05-08 17:16                           ` Nishanth Aravamudan
  0 siblings, 1 reply; 32+ messages in thread
From: Hans Rosenfeld @ 2008-05-08 16:51 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Dave Hansen, Hugh Dickins, Ingo Molnar, Jeff Chua,
	Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven,
	linux-kernel, linux-mm

On Thu, May 08, 2008 at 09:33:52AM -0700, Nishanth Aravamudan wrote:
> So this seems to lend credence to Dave's hypothesis. Without, as you
> were trying before, teaching pagemap all about hugepages, what are our
> options?
> 
> Can we just skip over the current iteration of the PMD loop (would we
> need something similar for the PTE loop for power?) if pmd_huge(pmd)?

Allowing huge pages in the page walker would affect both walk_pmd_range
and walk_pud_range. Then either the users of the page walker need to
know how to handle huge pages themselves (in the pmd_entry and pud_entry
callback functions), or the page walker treats huge pages as any other
pages (calling the pte_entry callback function).


-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: fix PAE pmd_bad bootup warning
  2008-05-08 16:51                         ` Hans Rosenfeld
@ 2008-05-08 17:16                           ` Nishanth Aravamudan
  2008-05-08 18:42                             ` Dave Hansen
  2008-05-08 18:48                             ` Hugh Dickins
  0 siblings, 2 replies; 32+ messages in thread
From: Nishanth Aravamudan @ 2008-05-08 17:16 UTC (permalink / raw)
  To: Hans Rosenfeld
  Cc: Dave Hansen, Hugh Dickins, Ingo Molnar, Jeff Chua,
	Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven,
	linux-kernel, linux-mm

On 08.05.2008 [18:51:11 +0200], Hans Rosenfeld wrote:
> On Thu, May 08, 2008 at 09:33:52AM -0700, Nishanth Aravamudan wrote:
> > So this seems to lend credence to Dave's hypothesis. Without, as you
> > were trying before, teaching pagemap all about hugepages, what are our
> > options?
> > 
> > Can we just skip over the current iteration of the PMD loop (would we
> > need something similar for the PTE loop for power?) if pmd_huge(pmd)?
> 
> Allowing huge pages in the page walker would affect both walk_pmd_range
> and walk_pud_range. Then either the users of the page walker need to
> know how to handle huge pages themselves (in the pmd_entry and pud_entry
> callback functions), or the page walker treats huge pages as any other
> pages (calling the pte_entry callback function).

Right, I agree *if* we allow huge pages in the walker. But AIUI, things
are broken now with hugepages in the process' address space. This is a
bug upstream and leads to hugepages leaking out of the kernel when
/proc/pid/pagemap is read. Why not, instead (as a short-term fix), skip
hugepage mappings altogether in the page-walker code?

Hrm, upon further investigation, this seems to be a pretty clear
limitation of walk_page_range(). One that is avoided in the other two
callers, i.e.

static int show_smap(struct seq_file *m, void *v)
{
...
	if (vma->vm_mm && !is_vm_hugetlb_page(vma))
		walk_page_range(vma->vm_mm, vma->vm_start, vma->vm_end,
				&smaps_walk, &mss);
...
}

static ssize_t clear_refs_write(struct file *file, const char __user *buf,
				size_t count, loff_t *ppos)
{
...
		for (vma = mm->mmap; vma; vma = vma->vm_next)
			if (!is_vm_hugetlb_page(vma))
				walk_page_range(mm, vma->vm_start, vma->vm_end,
						&clear_refs_walk, vma);
...
}

No such protection exists for

static ssize_t pagemap_read(struct file *file, char __user *buf,
			    size_t count, loff_t *ppos);

So, is there any way to either add a is_vm_hugetlb_page(vma) check into
pagemap_read()? Or can we modify walk_page_range to take the a vma and
skip the walking if is_vm_hugetlb_page(vma) is set [to avoid
complications down the road until hugepage walking is fixed]. I guess
the latter isn't possible for pagemap_read(), since we are just looking
at arbitrary addresses in the process space?

Dunno, seems quite clear that the bug is in pagemap_read(), not any
hugepage code, and that the simplest fix is to make pagemap_read() do
what the other walker-callers do, and skip hugepage regions.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: fix PAE pmd_bad bootup warning
  2008-05-08 17:16                           ` Nishanth Aravamudan
@ 2008-05-08 18:42                             ` Dave Hansen
  2008-05-08 18:58                               ` Hugh Dickins
  2008-05-08 18:48                             ` Hugh Dickins
  1 sibling, 1 reply; 32+ messages in thread
From: Dave Hansen @ 2008-05-08 18:42 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Hans Rosenfeld, Hugh Dickins, Ingo Molnar, Jeff Chua,
	Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven,
	linux-kernel, linux-mm

On Thu, 2008-05-08 at 10:16 -0700, Nishanth Aravamudan wrote:
> 
> Dunno, seems quite clear that the bug is in pagemap_read(), not any
> hugepage code, and that the simplest fix is to make pagemap_read() do
> what the other walker-callers do, and skip hugepage regions.

Agreed, this certainly isn't a huge page bug.

But, I do think it is absolutely insane to have pmd_clear_bad() going
after perfectly good hugetlb pmds.  The way it is set up now, people are
bound to miss the hugetlb pages because just about every single
pagetable walk has to be specially coded to handle or avoid them.  We
obviously missed it, here, and we had two good examples in the same
file! :)

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: fix PAE pmd_bad bootup warning
  2008-05-08 17:16                           ` Nishanth Aravamudan
  2008-05-08 18:42                             ` Dave Hansen
@ 2008-05-08 18:48                             ` Hugh Dickins
  2008-05-08 19:49                               ` Matt Mackall
  2008-05-08 20:02                               ` Hans Rosenfeld
  1 sibling, 2 replies; 32+ messages in thread
From: Hugh Dickins @ 2008-05-08 18:48 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Hans Rosenfeld, Dave Hansen, Ingo Molnar, Jeff Chua,
	Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven,
	Matt Mackall, linux-kernel, linux-mm

On Thu, 8 May 2008, Nishanth Aravamudan wrote:
> 
> So, is there any way to either add a is_vm_hugetlb_page(vma) check into
> pagemap_read()? Or can we modify walk_page_range to take the a vma and
> skip the walking if is_vm_hugetlb_page(vma) is set [to avoid
> complications down the road until hugepage walking is fixed]. I guess
> the latter isn't possible for pagemap_read(), since we are just looking
> at arbitrary addresses in the process space?
> 
> Dunno, seems quite clear that the bug is in pagemap_read(), not any
> hugepage code, and that the simplest fix is to make pagemap_read() do
> what the other walker-callers do, and skip hugepage regions.

Yes, I'm afraid it needs an is_vm_hugetlb_page(vma) in there somehow:
as you observe, that's what everything else uses to avoid huge issues.

A pmd_huge(*pmd) test is tempting, but it only ever says "yes" on x86:
we've carefully left it undefined what happens to the pgd/pud/pmd/pte
hierarchy in the general arch case, once you're amongst hugepages.

Might follow_huge_addr() be helpful, to avoid the need for a vma?
Perhaps, but my reading is that actually we've never really been
testing that path's success case (because get_user_pages already
skipped is_vm_hugetlb_page), so it might hold further surprises
on one architecture or another.

Many thanks to Hans for persisting, and pointing us to pagemap
to explain this hugepage leak: yes, the pmd_none_or_clear_bad
will be losing it - and corrupting target user address space.

Cc'ed Matt: he may have a view on what he wants his pagewalker
to do with hugepages: I fear it would differ from one usage to
another.  Skip over them has to be safest, though not ideal.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: fix PAE pmd_bad bootup warning
  2008-05-08 18:42                             ` Dave Hansen
@ 2008-05-08 18:58                               ` Hugh Dickins
  2008-05-08 19:06                                 ` Dave Hansen
  0 siblings, 1 reply; 32+ messages in thread
From: Hugh Dickins @ 2008-05-08 18:58 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Nishanth Aravamudan, Hans Rosenfeld, Ingo Molnar, Jeff Chua,
	Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven,
	linux-kernel, linux-mm

On Thu, 8 May 2008, Dave Hansen wrote:
> 
> But, I do think it is absolutely insane to have pmd_clear_bad() going
> after perfectly good hugetlb pmds.  The way it is set up now, people are
> bound to miss the hugetlb pages because just about every single
> pagetable walk has to be specially coded to handle or avoid them.  We
> obviously missed it, here, and we had two good examples in the same
> file! :)

Like it or not, the pgd/pud/pmd/pte hierarchy cannot be assumed once
you're amongst hugepages.  What happens varies from architecture to
architecture.  Perhaps the hugepage specialists could look at what
in fact the different architectures we know today are doing, and
come up with a better abstraction to encompass them all.  But it's
simply wrong for a "generic" pagewalker to be going blindly in there.

Two good examples in the same file??

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: fix PAE pmd_bad bootup warning
  2008-05-08 18:58                               ` Hugh Dickins
@ 2008-05-08 19:06                                 ` Dave Hansen
  0 siblings, 0 replies; 32+ messages in thread
From: Dave Hansen @ 2008-05-08 19:06 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nishanth Aravamudan, Hans Rosenfeld, Ingo Molnar, Jeff Chua,
	Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven,
	linux-kernel, linux-mm

On Thu, 2008-05-08 at 19:58 +0100, Hugh Dickins wrote:
> But it's
> simply wrong for a "generic" pagewalker to be going blindly in there.
> 
> Two good examples in the same file??

I was just noting that the other two pagewalking users did the right
vm_huge..() check, and we missed it in the third pagewalker user.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: fix PAE pmd_bad bootup warning
  2008-05-08 18:48                             ` Hugh Dickins
@ 2008-05-08 19:49                               ` Matt Mackall
  2008-05-08 20:08                                 ` Dave Hansen
  2008-05-08 20:02                               ` Hans Rosenfeld
  1 sibling, 1 reply; 32+ messages in thread
From: Matt Mackall @ 2008-05-08 19:49 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nishanth Aravamudan, Hans Rosenfeld, Dave Hansen, Ingo Molnar,
	Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C,
	Arjan van de Ven, linux-kernel, linux-mm

On Thu, 2008-05-08 at 19:48 +0100, Hugh Dickins wrote:
> On Thu, 8 May 2008, Nishanth Aravamudan wrote:
> > 
> > So, is there any way to either add a is_vm_hugetlb_page(vma) check into
> > pagemap_read()? Or can we modify walk_page_range to take the a vma and
> > skip the walking if is_vm_hugetlb_page(vma) is set [to avoid
> > complications down the road until hugepage walking is fixed]. I guess
> > the latter isn't possible for pagemap_read(), since we are just looking
> > at arbitrary addresses in the process space?
> > 
> > Dunno, seems quite clear that the bug is in pagemap_read(), not any
> > hugepage code, and that the simplest fix is to make pagemap_read() do
> > what the other walker-callers do, and skip hugepage regions.
> 
> Yes, I'm afraid it needs an is_vm_hugetlb_page(vma) in there somehow:
> as you observe, that's what everything else uses to avoid huge issues.
>
> A pmd_huge(*pmd) test is tempting, but it only ever says "yes" on x86:
> we've carefully left it undefined what happens to the pgd/pud/pmd/pte
> hierarchy in the general arch case, once you're amongst hugepages.
> 
> Might follow_huge_addr() be helpful, to avoid the need for a vma?
> Perhaps, but my reading is that actually we've never really been
> testing that path's success case (because get_user_pages already
> skipped is_vm_hugetlb_page), so it might hold further surprises
> on one architecture or another.
> 
> Many thanks to Hans for persisting, and pointing us to pagemap
> to explain this hugepage leak: yes, the pmd_none_or_clear_bad
> will be losing it - and corrupting target user address space.

Ouch.

> Cc'ed Matt: he may have a view on what he wants his pagewalker
> to do with hugepages: I fear it would differ from one usage to
> another.  Skip over them has to be safest, though not ideal.

There are folks who want to be able to get information about hugepages
from pagemap. Thanks to Hans, there's room in the output format for it.

I'd gone to some lengths to pull VMAs out of the picture as it's quite
ugly to have to simultaneously walk VMAs and pagetables. But I may have
to concede that living with hugepages requires it.

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: fix PAE pmd_bad bootup warning
  2008-05-08 18:48                             ` Hugh Dickins
  2008-05-08 19:49                               ` Matt Mackall
@ 2008-05-08 20:02                               ` Hans Rosenfeld
  2008-05-08 20:16                                 ` Dave Hansen
                                                   ` (2 more replies)
  1 sibling, 3 replies; 32+ messages in thread
From: Hans Rosenfeld @ 2008-05-08 20:02 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nishanth Aravamudan, Dave Hansen, Ingo Molnar, Jeff Chua,
	Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven,
	Matt Mackall, linux-kernel, linux-mm

On Thu, May 08, 2008 at 07:48:51PM +0100, Hugh Dickins wrote:
> > Dunno, seems quite clear that the bug is in pagemap_read(), not any
> > hugepage code, and that the simplest fix is to make pagemap_read() do
> > what the other walker-callers do, and skip hugepage regions.
> 
> Yes, I'm afraid it needs an is_vm_hugetlb_page(vma) in there somehow:
> as you observe, that's what everything else uses to avoid huge issues.
> 
> A pmd_huge(*pmd) test is tempting, but it only ever says "yes" on x86:
> we've carefully left it undefined what happens to the pgd/pud/pmd/pte
> hierarchy in the general arch case, once you're amongst hugepages.

AFAIK the reason for this is that pmd_huge() and pud_huge() are
completely x86-specific. When I looked at the huge page support for
other archs in Linux the last time, all of them marked hugepages with
some page size bits in the PTE, using several PTEs for a single huge
page. So for anything but x86, the pgd/pud/pmd/pte hierarchy should work
for hugepages, too.

> Cc'ed Matt: he may have a view on what he wants his pagewalker
> to do with hugepages: I fear it would differ from one usage to
> another.  Skip over them has to be safest, though not ideal.

Proper hugepage support in /proc/pid/pagemap would be very helpful for
certain kinds of performance analysis, like TLB usage, where you need
page size information when you have a TLB for each supported page size.
That was the reason why I was working on supporting huge pages in
/proc/pid/pagemap in the first place.


-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: fix PAE pmd_bad bootup warning
  2008-05-08 19:49                               ` Matt Mackall
@ 2008-05-08 20:08                                 ` Dave Hansen
  0 siblings, 0 replies; 32+ messages in thread
From: Dave Hansen @ 2008-05-08 20:08 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Hugh Dickins, Nishanth Aravamudan, Hans Rosenfeld, Ingo Molnar,
	Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C,
	Arjan van de Ven, linux-kernel, linux-mm

On Thu, 2008-05-08 at 14:49 -0500, Matt Mackall wrote:
> I'd gone to some lengths to pull VMAs out of the picture as it's quite
> ugly to have to simultaneously walk VMAs and pagetables. But I may have
> to concede that living with hugepages requires it.

Yeah, it will definitely change the way that we have to do the pagetable
walk.

Should we just pass the mm around and make anyone that really wants to
get the VMAs do the lookup themselves?  Or, should we just provide the
VMA?

I'll start with just the mm and see where it goes...

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: fix PAE pmd_bad bootup warning
  2008-05-08 20:02                               ` Hans Rosenfeld
@ 2008-05-08 20:16                                 ` Dave Hansen
  2008-05-08 23:15                                 ` Dave Hansen
  2008-05-09  9:03                                 ` Paul Mundt
  2 siblings, 0 replies; 32+ messages in thread
From: Dave Hansen @ 2008-05-08 20:16 UTC (permalink / raw)
  To: Hans Rosenfeld
  Cc: Hugh Dickins, Nishanth Aravamudan, Ingo Molnar, Jeff Chua,
	Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven,
	Matt Mackall, linux-kernel, linux-mm

On Thu, 2008-05-08 at 22:02 +0200, Hans Rosenfeld wrote:
> > A pmd_huge(*pmd) test is tempting, but it only ever says "yes" on x86:
> > we've carefully left it undefined what happens to the pgd/pud/pmd/pte
> > hierarchy in the general arch case, once you're amongst hugepages.
> 
> AFAIK the reason for this is that pmd_huge() and pud_huge() are
> completely x86-specific. When I looked at the huge page support for
> other archs in Linux the last time, all of them marked hugepages with
> some page size bits in the PTE, using several PTEs for a single huge
> page. So for anything but x86, the pgd/pud/pmd/pte hierarchy should work
> for hugepages, too.

powerpc kinda puts them in pmds, although Adam calls them ptes in his
diagram.  See Adam's very nice pictures here:

	http://linux-mm.org/PageTableStructure

In the arch code, they have a concept of "slices" for each mm that you
can look up the page size for.  That's what they use when the mm/vmas
aren't around.  Their pmd_ts really are just pointers.  I don't think
they have any flags in them at all like _PAGE_PSE.

They just do a special pagetable walk instead of looking *at* the
pagetables.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: fix PAE pmd_bad bootup warning
  2008-05-08 20:02                               ` Hans Rosenfeld
  2008-05-08 20:16                                 ` Dave Hansen
@ 2008-05-08 23:15                                 ` Dave Hansen
  2008-05-14 19:01                                   ` Matt Mackall
  2008-05-09  9:03                                 ` Paul Mundt
  2 siblings, 1 reply; 32+ messages in thread
From: Dave Hansen @ 2008-05-08 23:15 UTC (permalink / raw)
  To: Hans Rosenfeld
  Cc: Hugh Dickins, Nishanth Aravamudan, Ingo Molnar, Jeff Chua,
	Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven,
	Matt Mackall, linux-kernel, linux-mm

Here's one quick stab at a solution.  I figured that we already pass
that 'private' variable around.  This patch just sticks that variable
*in* the mm_walk and also makes the caller fill in an 'mm' as well.
Then, we just pass the actual mm_walk around.

Maybe we should just stick the VMA in the mm_walk as well, and have the
common code keep it up to date with the addresses currently being
walked.

Sadly, I didn't quite get enough time to flesh this idea out very far
today, and I'll be offline for a couple of days now.  But, if someone
wants to go this route, I thought this might be useful.  

---

 linux-2.6.git-dave/fs/proc/task_mmu.c |   45 +++++++++++++++++++---------------
 linux-2.6.git-dave/include/linux/mm.h |   17 ++++++------
 linux-2.6.git-dave/mm/pagewalk.c      |   41 +++++++++++++++---------------
 3 files changed, 56 insertions(+), 47 deletions(-)

diff -puN mm/pagewalk.c~pass-mm-into-pagewalkers mm/pagewalk.c
--- linux-2.6.git/mm/pagewalk.c~pass-mm-into-pagewalkers	2008-05-08 15:49:47.000000000 -0700
+++ linux-2.6.git-dave/mm/pagewalk.c	2008-05-08 15:49:54.000000000 -0700
@@ -3,14 +3,14 @@
 #include <linux/sched.h>
 
 static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
-			  const struct mm_walk *walk, void *private)
+			  struct mm_walk *walk)
 {
 	pte_t *pte;
 	int err = 0;
 
 	pte = pte_offset_map(pmd, addr);
 	for (;;) {
-		err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, private);
+		err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
 		if (err)
 		       break;
 		addr += PAGE_SIZE;
@@ -24,7 +24,7 @@ static int walk_pte_range(pmd_t *pmd, un
 }
 
 static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
-			  const struct mm_walk *walk, void *private)
+			  struct mm_walk *walk)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -35,15 +35,15 @@ static int walk_pmd_range(pud_t *pud, un
 		next = pmd_addr_end(addr, end);
 		if (pmd_none_or_clear_bad(pmd)) {
 			if (walk->pte_hole)
-				err = walk->pte_hole(addr, next, private);
+				err = walk->pte_hole(addr, next, walk);
 			if (err)
 				break;
 			continue;
 		}
 		if (walk->pmd_entry)
-			err = walk->pmd_entry(pmd, addr, next, private);
+			err = walk->pmd_entry(pmd, addr, next, walk);
 		if (!err && walk->pte_entry)
-			err = walk_pte_range(pmd, addr, next, walk, private);
+			err = walk_pte_range(pmd, addr, next, walk);
 		if (err)
 			break;
 	} while (pmd++, addr = next, addr != end);
@@ -52,7 +52,7 @@ static int walk_pmd_range(pud_t *pud, un
 }
 
 static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
-			  const struct mm_walk *walk, void *private)
+			  struct mm_walk *walk)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -63,15 +63,15 @@ static int walk_pud_range(pgd_t *pgd, un
 		next = pud_addr_end(addr, end);
 		if (pud_none_or_clear_bad(pud)) {
 			if (walk->pte_hole)
-				err = walk->pte_hole(addr, next, private);
+				err = walk->pte_hole(addr, next, walk);
 			if (err)
 				break;
 			continue;
 		}
 		if (walk->pud_entry)
-			err = walk->pud_entry(pud, addr, next, private);
+			err = walk->pud_entry(pud, addr, next, walk);
 		if (!err && (walk->pmd_entry || walk->pte_entry))
-			err = walk_pmd_range(pud, addr, next, walk, private);
+			err = walk_pmd_range(pud, addr, next, walk);
 		if (err)
 			break;
 	} while (pud++, addr = next, addr != end);
@@ -85,15 +85,15 @@ static int walk_pud_range(pgd_t *pgd, un
  * @addr: starting address
  * @end: ending address
  * @walk: set of callbacks to invoke for each level of the tree
- * @private: private data passed to the callback function
  *
  * Recursively walk the page table for the memory area in a VMA,
  * calling supplied callbacks. Callbacks are called in-order (first
  * PGD, first PUD, first PMD, first PTE, second PTE... second PMD,
  * etc.). If lower-level callbacks are omitted, walking depth is reduced.
  *
- * Each callback receives an entry pointer, the start and end of the
- * associated range, and a caller-supplied private data pointer.
+ * Each callback receives an entry pointer and the start and end of the
+ * associated range, and a copy of the original mm_walk for access to
+ * the ->private or ->mm fields.
  *
  * No locks are taken, but the bottom level iterator will map PTE
  * directories from highmem if necessary.
@@ -101,9 +101,8 @@ static int walk_pud_range(pgd_t *pgd, un
  * If any callback returns a non-zero value, the walk is aborted and
  * the return value is propagated back to the caller. Otherwise 0 is returned.
  */
-int walk_page_range(const struct mm_struct *mm,
-		    unsigned long addr, unsigned long end,
-		    const struct mm_walk *walk, void *private)
+int walk_page_range(unsigned long addr, unsigned long end,
+		    struct mm_walk *walk)
 {
 	pgd_t *pgd;
 	unsigned long next;
@@ -112,21 +111,23 @@ int walk_page_range(const struct mm_stru
 	if (addr >= end)
 		return err;
 
-	pgd = pgd_offset(mm, addr);
+	if (!walk->mm)
+		return -EINVAL;
+	pgd = pgd_offset(walk->mm, addr);
 	do {
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd)) {
 			if (walk->pte_hole)
-				err = walk->pte_hole(addr, next, private);
+				err = walk->pte_hole(addr, next, walk);
 			if (err)
 				break;
 			continue;
 		}
 		if (walk->pgd_entry)
-			err = walk->pgd_entry(pgd, addr, next, private);
+			err = walk->pgd_entry(pgd, addr, next, walk);
 		if (!err &&
 		    (walk->pud_entry || walk->pmd_entry || walk->pte_entry))
-			err = walk_pud_range(pgd, addr, next, walk, private);
+			err = walk_pud_range(pgd, addr, next, walk);
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
diff -puN include/linux/mm.h~pass-mm-into-pagewalkers include/linux/mm.h
--- linux-2.6.git/include/linux/mm.h~pass-mm-into-pagewalkers	2008-05-08 15:49:47.000000000 -0700
+++ linux-2.6.git-dave/include/linux/mm.h	2008-05-08 15:49:54.000000000 -0700
@@ -760,16 +760,17 @@ unsigned long unmap_vmas(struct mmu_gath
  * (see walk_page_range for more details)
  */
 struct mm_walk {
-	int (*pgd_entry)(pgd_t *, unsigned long, unsigned long, void *);
-	int (*pud_entry)(pud_t *, unsigned long, unsigned long, void *);
-	int (*pmd_entry)(pmd_t *, unsigned long, unsigned long, void *);
-	int (*pte_entry)(pte_t *, unsigned long, unsigned long, void *);
-	int (*pte_hole)(unsigned long, unsigned long, void *);
+	int (*pgd_entry)(pgd_t *, unsigned long, unsigned long, struct mm_walk *);
+	int (*pud_entry)(pud_t *, unsigned long, unsigned long, struct mm_walk *);
+	int (*pmd_entry)(pmd_t *, unsigned long, unsigned long, struct mm_walk *);
+	int (*pte_entry)(pte_t *, unsigned long, unsigned long, struct mm_walk *);
+	int (*pte_hole)(unsigned long, unsigned long, struct mm_walk *);
+	struct mm_struct *mm;
+	void *private;
 };
 
-int walk_page_range(const struct mm_struct *, unsigned long addr,
-		    unsigned long end, const struct mm_walk *walk,
-		    void *private);
+int walk_page_range(unsigned long addr, unsigned long end,
+		struct mm_walk *walk);
 void free_pgd_range(struct mmu_gather **tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
 void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma,
diff -puN fs/proc/task_mmu.c~pass-mm-into-pagewalkers fs/proc/task_mmu.c
--- linux-2.6.git/fs/proc/task_mmu.c~pass-mm-into-pagewalkers	2008-05-08 15:49:47.000000000 -0700
+++ linux-2.6.git-dave/fs/proc/task_mmu.c	2008-05-08 15:49:59.000000000 -0700
@@ -317,9 +317,9 @@ struct mem_size_stats {
 };
 
 static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
-			   void *private)
+			   struct mm_walk *walk)
 {
-	struct mem_size_stats *mss = private;
+	struct mem_size_stats *mss = walk->private;
 	struct vm_area_struct *vma = mss->vma;
 	pte_t *pte, ptent;
 	spinlock_t *ptl;
@@ -367,19 +367,22 @@ static int smaps_pte_range(pmd_t *pmd, u
 	return 0;
 }
 
-static struct mm_walk smaps_walk = { .pmd_entry = smaps_pte_range };
-
 static int show_smap(struct seq_file *m, void *v)
 {
 	struct vm_area_struct *vma = v;
 	struct mem_size_stats mss;
 	int ret;
+	struct mm_walk smaps_walk = {
+		.pmd_entry = smaps_pte_range,
+		.mm = vma->vm_mm,
+		.private = &mss,
+	};
+
 
 	memset(&mss, 0, sizeof mss);
 	mss.vma = vma;
 	if (vma->vm_mm && !is_vm_hugetlb_page(vma))
-		walk_page_range(vma->vm_mm, vma->vm_start, vma->vm_end,
-				&smaps_walk, &mss);
+		walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk);
 
 	ret = show_map(m, v);
 	if (ret)
@@ -428,9 +431,9 @@ const struct file_operations proc_smaps_
 };
 
 static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
-				unsigned long end, void *private)
+				unsigned long end, struct mm_walk *walk)
 {
-	struct vm_area_struct *vma = private;
+	struct vm_area_struct *vma = walk->private;
 	pte_t *pte, ptent;
 	spinlock_t *ptl;
 	struct page *page;
@@ -454,8 +457,6 @@ static int clear_refs_pte_range(pmd_t *p
 	return 0;
 }
 
-static struct mm_walk clear_refs_walk = { .pmd_entry = clear_refs_pte_range };
-
 static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 				size_t count, loff_t *ppos)
 {
@@ -478,11 +479,17 @@ static ssize_t clear_refs_write(struct f
 		return -ESRCH;
 	mm = get_task_mm(task);
 	if (mm) {
+		static struct mm_walk clear_refs_walk;
+		memset(&clear_refs_walk, 0, sizeof(clear_refs_walk));
+		clear_refs_walk.pmd_entry = clear_refs_pte_range;
+		clear_refs_walk.mm = mm;
 		down_read(&mm->mmap_sem);
-		for (vma = mm->mmap; vma; vma = vma->vm_next)
+		for (vma = mm->mmap; vma; vma = vma->vm_next) {
+			clear_refs_walk.private = vma;
 			if (!is_vm_hugetlb_page(vma))
-				walk_page_range(mm, vma->vm_start, vma->vm_end,
-						&clear_refs_walk, vma);
+				walk_page_range(vma->vm_start, vma->vm_end,
+						&clear_refs_walk);
+		}
 		flush_tlb_mm(mm);
 		up_read(&mm->mmap_sem);
 		mmput(mm);
@@ -540,9 +547,9 @@ static int add_to_pagemap(unsigned long 
 }
 
 static int pagemap_pte_hole(unsigned long start, unsigned long end,
-				void *private)
+				struct mm_walk *walk)
 {
-	struct pagemapread *pm = private;
+	struct pagemapread *pm = walk->private;
 	unsigned long addr;
 	int err = 0;
 	for (addr = start; addr < end; addr += PAGE_SIZE) {
@@ -560,9 +567,9 @@ static u64 swap_pte_to_pagemap_entry(pte
 }
 
 static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
-			     void *private)
+			     struct mm_walk *walk)
 {
-	struct pagemapread *pm = private;
+	struct pagemapread *pm = walk->private;
 	pte_t *pte;
 	int err = 0;
 
@@ -687,8 +694,8 @@ static ssize_t pagemap_read(struct file 
 		 * user buffer is tracked in "pm", and the walk
 		 * will stop when we hit the end of the buffer.
 		 */
-		ret = walk_page_range(mm, start_vaddr, end_vaddr,
-					&pagemap_walk, &pm);
+		ret = walk_page_range(start_vaddr, end_vaddr,
+					&pagemap_walk);
 		if (ret == PM_END_OF_BUFFER)
 			ret = 0;
 		/* don't need mmap_sem for these, but this looks cleaner */
_


-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: fix PAE pmd_bad bootup warning
  2008-05-08 20:02                               ` Hans Rosenfeld
  2008-05-08 20:16                                 ` Dave Hansen
  2008-05-08 23:15                                 ` Dave Hansen
@ 2008-05-09  9:03                                 ` Paul Mundt
  2 siblings, 0 replies; 32+ messages in thread
From: Paul Mundt @ 2008-05-09  9:03 UTC (permalink / raw)
  To: Hans Rosenfeld
  Cc: Hugh Dickins, Nishanth Aravamudan, Dave Hansen, Ingo Molnar,
	Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C,
	Arjan van de Ven, Matt Mackall, linux-kernel, linux-mm

On Thu, May 08, 2008 at 10:02:39PM +0200, Hans Rosenfeld wrote:
> On Thu, May 08, 2008 at 07:48:51PM +0100, Hugh Dickins wrote:
> > > Dunno, seems quite clear that the bug is in pagemap_read(), not any
> > > hugepage code, and that the simplest fix is to make pagemap_read() do
> > > what the other walker-callers do, and skip hugepage regions.
> > 
> > Yes, I'm afraid it needs an is_vm_hugetlb_page(vma) in there somehow:
> > as you observe, that's what everything else uses to avoid huge issues.
> > 
> > A pmd_huge(*pmd) test is tempting, but it only ever says "yes" on x86:
> > we've carefully left it undefined what happens to the pgd/pud/pmd/pte
> > hierarchy in the general arch case, once you're amongst hugepages.
> 
> AFAIK the reason for this is that pmd_huge() and pud_huge() are
> completely x86-specific. When I looked at the huge page support for
> other archs in Linux the last time, all of them marked hugepages with
> some page size bits in the PTE, using several PTEs for a single huge
> page. So for anything but x86, the pgd/pud/pmd/pte hierarchy should work
> for hugepages, too.
> 
s390 also does hugepages at the pmd level, so it's not only x86. And
while it's not an issue today, it's worth noting that ARM also has the
same characteristics for larger sizes. Should someone feel compelled to
implement hugepages there, this will almost certainly come up again -- at
least in so far as pmd_huge() is concerned.

At a quick glance, sparc64 also looks like it might need some special
handling in the pagemap case, too..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86: fix PAE pmd_bad bootup warning
  2008-05-08 23:15                                 ` Dave Hansen
@ 2008-05-14 19:01                                   ` Matt Mackall
  0 siblings, 0 replies; 32+ messages in thread
From: Matt Mackall @ 2008-05-14 19:01 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Hans Rosenfeld, Hugh Dickins, Nishanth Aravamudan, Ingo Molnar,
	Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C,
	Arjan van de Ven, linux-kernel, linux-mm

On Thu, 2008-05-08 at 16:15 -0700, Dave Hansen wrote:
> Here's one quick stab at a solution.  I figured that we already pass
> that 'private' variable around.  This patch just sticks that variable
> *in* the mm_walk and also makes the caller fill in an 'mm' as well.
> Then, we just pass the actual mm_walk around.
> 
> Maybe we should just stick the VMA in the mm_walk as well, and have the
> common code keep it up to date with the addresses currently being
> walked.
> 
> Sadly, I didn't quite get enough time to flesh this idea out very far
> today, and I'll be offline for a couple of days now.  But, if someone
> wants to go this route, I thought this might be useful.  

This much looks reasonable. But the real test of course is to actually
teach it about detecting huge pages efficiently. And I suspect that
means tracking the current VMA all the time in the walk. Am I wrong?

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2008-05-14 19:01 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <b6a2187b0805051806v25fa1272xb08e0b70b9c3408@mail.gmail.com>
     [not found] ` <20080506124946.GA2146@elte.hu>
     [not found]   ` <Pine.LNX.4.64.0805061435510.32567@blonde.site>
     [not found]     ` <alpine.LFD.1.10.0805061138580.32269@woody.linux-foundation.org>
2008-05-06 19:49       ` [PATCH] x86: fix PAE pmd_bad bootup warning Hugh Dickins
2008-05-06 20:06         ` Linus Torvalds
2008-05-06 20:30           ` Hugh Dickins
2008-05-08 16:07             ` Nishanth Aravamudan
2008-05-06 20:22         ` Hans Rosenfeld
2008-05-06 20:36           ` Hugh Dickins
2008-05-07 23:39             ` Nishanth Aravamudan
2008-05-06 20:42           ` Dave Hansen
2008-05-08 14:34             ` Hans Rosenfeld
2008-05-08 14:39               ` Hans Rosenfeld
2008-05-08 14:52               ` Dave Hansen
2008-05-08 15:11                 ` Hans Rosenfeld
2008-05-08 15:51                   ` Dave Hansen
2008-05-08 16:19                     ` Hans Rosenfeld
2008-05-08 16:33                       ` Nishanth Aravamudan
2008-05-08 16:51                         ` Hans Rosenfeld
2008-05-08 17:16                           ` Nishanth Aravamudan
2008-05-08 18:42                             ` Dave Hansen
2008-05-08 18:58                               ` Hugh Dickins
2008-05-08 19:06                                 ` Dave Hansen
2008-05-08 18:48                             ` Hugh Dickins
2008-05-08 19:49                               ` Matt Mackall
2008-05-08 20:08                                 ` Dave Hansen
2008-05-08 20:02                               ` Hans Rosenfeld
2008-05-08 20:16                                 ` Dave Hansen
2008-05-08 23:15                                 ` Dave Hansen
2008-05-14 19:01                                   ` Matt Mackall
2008-05-09  9:03                                 ` Paul Mundt
2008-05-08 16:42                       ` Dave Hansen
2008-05-08 15:44                 ` Nishanth Aravamudan
2008-05-07  4:40         ` Jeff Chua
2008-05-07  5:30           ` Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).