* [PATCH] x86: fix PAE pmd_bad bootup warning [not found] ` <alpine.LFD.1.10.0805061138580.32269@woody.linux-foundation.org> @ 2008-05-06 19:49 ` Hugh Dickins 2008-05-06 20:06 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 32+ messages in thread From: Hugh Dickins @ 2008-05-06 19:49 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C, Hans Rosenfeld, Arjan van de Ven, Nishanth Aravamudan, linux-kernel, linux-mm Fix warning from pmd_bad() at bootup on a HIGHMEM64G HIGHPTE x86_32. That came from 9fc34113f6880b215cbea4e7017fc818700384c2 x86: debug pmd_bad(); but we understand now that the typecasting was wrong for PAE in the previous version: pagetable pages above 4GB looked bad and stopped Arjan from booting. And revert that cded932b75ab0a5f9181ee3da34a0a488d1a14fd x86: fix pmd_bad and pud_bad to support huge pages. It was the wrong way round: we shouldn't weaken every pmd_bad and pud_bad check to let huge pages slip through - in part they check that we _don't_ have a huge page where it's not expected. Put the x86 pmd_bad() and pud_bad() definitions back to what they have long been: they can be improved (x86_32 should use PTE_MASK, to stop PAE thinking junk in the upper word is good; and x86_64 should follow x86_32's stricter comparison, to stop thinking any subset of required bits is good); but that should be a later patch. Fix Hans' good observation that follow_page() will never find pmd_huge() because that would have already failed the pmd_bad test: test pmd_huge in between the pmd_none and pmd_bad tests. Tighten x86's pmd_huge() check? No, once it's a hugepage entry, it can get quite far from a good pmd: for example, PROT_NONE leaves it with only ACCESSED of the KERN_PGTABLE bits. However... though follow_page() contains this and another test for huge pages, so it's nice to keep it working on them, where does it actually get called on a huge page? get_user_pages() checks is_vm_hugetlb_page(vma) to to call alternative hugetlb processing, as does unmap_vmas() and others. Signed-off-by: Hugh Dickins <hugh@veritas.com> --- So Hans' original hugepage leak remains unexplained and unfixed. Hans, did you find that hugepage leak with a standard kernel, or were you perhaps trying out some hugepage-using patch of your own, without marking the vma VM_HUGETLB? Or were you expecting the hugetlbfs file to truncate itself once all mmappers had gone? If the standard kernel leaks hugepages, I'm surprised the hugetlb guys don't know about it. arch/x86/mm/pgtable_32.c | 7 ------- include/asm-x86/pgtable_32.h | 9 +-------- include/asm-x86/pgtable_64.h | 6 ++---- mm/memory.c | 5 ++++- 4 files changed, 7 insertions(+), 20 deletions(-) --- 2.6.26-rc1/arch/x86/mm/pgtable_32.c 2008-05-03 21:54:41.000000000 +0100 +++ linux/arch/x86/mm/pgtable_32.c 2008-05-06 14:13:24.000000000 +0100 @@ -172,10 +172,3 @@ void reserve_top_address(unsigned long r __FIXADDR_TOP = -reserve - PAGE_SIZE; __VMALLOC_RESERVE += reserve; } - -int pmd_bad(pmd_t pmd) -{ - WARN_ON_ONCE(pmd_bad_v1(pmd) != pmd_bad_v2(pmd)); - - return pmd_bad_v1(pmd); -} --- 2.6.26-rc1/include/asm-x86/pgtable_32.h 2008-05-03 21:55:10.000000000 +0100 +++ linux/include/asm-x86/pgtable_32.h 2008-05-06 14:13:24.000000000 +0100 @@ -88,14 +88,7 @@ extern unsigned long pg0[]; /* To avoid harmful races, pmd_none(x) should check only the lower when PAE */ #define pmd_none(x) (!(unsigned long)pmd_val((x))) #define pmd_present(x) (pmd_val((x)) & _PAGE_PRESENT) - -extern int pmd_bad(pmd_t pmd); - -#define pmd_bad_v1(x) \ - (_KERNPG_TABLE != (pmd_val((x)) & ~(PAGE_MASK | _PAGE_USER))) -#define pmd_bad_v2(x) \ - (_KERNPG_TABLE != (pmd_val((x)) & ~(PAGE_MASK | _PAGE_USER | \ - _PAGE_PSE | _PAGE_NX))) +#define pmd_bad(x) ((pmd_val(x) & (~PAGE_MASK & ~_PAGE_USER)) != _KERNPG_TABLE) #define pages_to_mb(x) ((x) >> (20-PAGE_SHIFT)) --- 2.6.26-rc1/include/asm-x86/pgtable_64.h 2008-05-03 21:55:10.000000000 +0100 +++ linux/include/asm-x86/pgtable_64.h 2008-05-06 14:13:24.000000000 +0100 @@ -158,14 +158,12 @@ static inline unsigned long pgd_bad(pgd_ static inline unsigned long pud_bad(pud_t pud) { - return pud_val(pud) & - ~(PTE_MASK | _KERNPG_TABLE | _PAGE_USER | _PAGE_PSE | _PAGE_NX); + return pud_val(pud) & ~(PTE_MASK | _KERNPG_TABLE | _PAGE_USER); } static inline unsigned long pmd_bad(pmd_t pmd) { - return pmd_val(pmd) & - ~(PTE_MASK | _KERNPG_TABLE | _PAGE_USER | _PAGE_PSE | _PAGE_NX); + return pmd_val(pmd) & ~(PTE_MASK | _KERNPG_TABLE | _PAGE_USER); } #define pte_none(x) (!pte_val((x))) --- 2.6.26-rc1/mm/memory.c 2008-05-03 21:55:12.000000000 +0100 +++ linux/mm/memory.c 2008-05-06 14:13:24.000000000 +0100 @@ -969,7 +969,7 @@ struct page *follow_page(struct vm_area_ goto no_page_table; pmd = pmd_offset(pud, address); - if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) + if (pmd_none(*pmd)) goto no_page_table; if (pmd_huge(*pmd)) { @@ -978,6 +978,9 @@ struct page *follow_page(struct vm_area_ goto out; } + if (unlikely(pmd_bad(*pmd))) + goto no_page_table; + ptep = pte_offset_map_lock(mm, pmd, address, &ptl); if (!ptep) goto out; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] x86: fix PAE pmd_bad bootup warning 2008-05-06 19:49 ` [PATCH] x86: fix PAE pmd_bad bootup warning Hugh Dickins @ 2008-05-06 20:06 ` Linus Torvalds 2008-05-06 20:30 ` Hugh Dickins 2008-05-06 20:22 ` Hans Rosenfeld 2008-05-07 4:40 ` Jeff Chua 2 siblings, 1 reply; 32+ messages in thread From: Linus Torvalds @ 2008-05-06 20:06 UTC (permalink / raw) To: Hugh Dickins Cc: Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C, Hans Rosenfeld, Arjan van de Ven, Nishanth Aravamudan, linux-kernel, linux-mm On Tue, 6 May 2008, Hugh Dickins wrote: > > Fix Hans' good observation that follow_page() will never find pmd_huge() > because that would have already failed the pmd_bad test: test pmd_huge in > between the pmd_none and pmd_bad tests. Tighten x86's pmd_huge() check? > No, once it's a hugepage entry, it can get quite far from a good pmd: for > example, PROT_NONE leaves it with only ACCESSED of the KERN_PGTABLE bits. I'd much rather have pdm_bad() etc fixed up instead, so that they do a more proper test (not thinking that a PSE page is bad, since it clearly isn't). And then, make them dependent on DEBUG_VM, because doing the proper test will be more expensive. Hmm? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] x86: fix PAE pmd_bad bootup warning 2008-05-06 20:06 ` Linus Torvalds @ 2008-05-06 20:30 ` Hugh Dickins 2008-05-08 16:07 ` Nishanth Aravamudan 0 siblings, 1 reply; 32+ messages in thread From: Hugh Dickins @ 2008-05-06 20:30 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C, Hans Rosenfeld, Arjan van de Ven, Nishanth Aravamudan, Jeremy Fitzhardinge, linux-kernel, linux-mm On Tue, 6 May 2008, Linus Torvalds wrote: > On Tue, 6 May 2008, Hugh Dickins wrote: > > > > Fix Hans' good observation that follow_page() will never find pmd_huge() > > because that would have already failed the pmd_bad test: test pmd_huge in > > between the pmd_none and pmd_bad tests. Tighten x86's pmd_huge() check? > > No, once it's a hugepage entry, it can get quite far from a good pmd: for > > example, PROT_NONE leaves it with only ACCESSED of the KERN_PGTABLE bits. > > I'd much rather have pdm_bad() etc fixed up instead, so that they do a > more proper test (not thinking that a PSE page is bad, since it clearly > isn't). And then, make them dependent on DEBUG_VM, because doing the > proper test will be more expensive. But everywhere we use pmd_bad() etc (most are hidden inside pmd_none_or_clear_bad() etc) we are expecting never to encounter a pmd_huge, unless there's corruption. follow_page() is the one exception, and even in its case I can't find a current user that actually could meet a hugepage. I'd rather tighten up pmd_bad (in the PAE and x86_64 cases), than weaken it so far as to let hugepages slip through. Not that pmd_bad often catches anything: just coincidentally that 90909090 one today. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] x86: fix PAE pmd_bad bootup warning 2008-05-06 20:30 ` Hugh Dickins @ 2008-05-08 16:07 ` Nishanth Aravamudan 0 siblings, 0 replies; 32+ messages in thread From: Nishanth Aravamudan @ 2008-05-08 16:07 UTC (permalink / raw) To: Hugh Dickins Cc: Linus Torvalds, Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C, Hans Rosenfeld, Arjan van de Ven, Jeremy Fitzhardinge, linux-kernel, linux-mm On 06.05.2008 [21:30:44 +0100], Hugh Dickins wrote: > On Tue, 6 May 2008, Linus Torvalds wrote: > > On Tue, 6 May 2008, Hugh Dickins wrote: > > > > > > Fix Hans' good observation that follow_page() will never find > > > pmd_huge() because that would have already failed the pmd_bad > > > test: test pmd_huge in between the pmd_none and pmd_bad tests. > > > Tighten x86's pmd_huge() check? No, once it's a hugepage entry, > > > it can get quite far from a good pmd: for example, PROT_NONE > > > leaves it with only ACCESSED of the KERN_PGTABLE bits. > > > > I'd much rather have pdm_bad() etc fixed up instead, so that they do > > a more proper test (not thinking that a PSE page is bad, since it > > clearly isn't). And then, make them dependent on DEBUG_VM, because > > doing the proper test will be more expensive. > > But everywhere we use pmd_bad() etc (most are hidden inside > pmd_none_or_clear_bad() etc) we are expecting never to encounter > a pmd_huge, unless there's corruption. follow_page() is the one > exception, and even in its case I can't find a current user that > actually could meet a hugepage. I'd rather tighten up pmd_bad > (in the PAE and x86_64 cases), than weaken it so far as to let > hugepages slip through. Not that pmd_bad often catches anything: > just coincidentally that 90909090 one today. There is one case that seems to the source of Hans' problem, as Dave has figured out: /proc/pid/pagemap, where we fairly straight-forwardly walk the pagetables. In there, we unconditionally call pmd_none_or_clear_bad(pmd). And any userspace process that maps hugepages and then reads in /proc/pid/pagemap will invoke that path, I think (at least with 2M pages). So I agree, you're fixing a potential issue in follow_page() [might deserve a comment, so someone doesn't go and combine them back later?], but Hans' issue is most likely related to the pagemap code? Thanks, Nish -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] x86: fix PAE pmd_bad bootup warning 2008-05-06 19:49 ` [PATCH] x86: fix PAE pmd_bad bootup warning Hugh Dickins 2008-05-06 20:06 ` Linus Torvalds @ 2008-05-06 20:22 ` Hans Rosenfeld 2008-05-06 20:36 ` Hugh Dickins 2008-05-06 20:42 ` Dave Hansen 2008-05-07 4:40 ` Jeff Chua 2 siblings, 2 replies; 32+ messages in thread From: Hans Rosenfeld @ 2008-05-06 20:22 UTC (permalink / raw) To: Hugh Dickins Cc: Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven, Nishanth Aravamudan, linux-kernel, linux-mm On Tue, May 06, 2008 at 08:49:23PM +0100, Hugh Dickins wrote: > So Hans' original hugepage leak remains unexplained and unfixed. > Hans, did you find that hugepage leak with a standard kernel, or were > you perhaps trying out some hugepage-using patch of your own, without > marking the vma VM_HUGETLB? Or were you expecting the hugetlbfs file > to truncate itself once all mmappers had gone? If the standard kernel > leaks hugepages, I'm surprised the hugetlb guys don't know about it. I used a standard kernel (well, not quite, I had made some changes to the /proc/pid/pagemap code, but nothing that would affect the hugepage stuff) and some simple test program that would just mmap a hugepage. I expected that any hugepage that a process had mmapped would automatically be returned to the system when the process exits. That was not the case, the process exited and the hugepage was lost (unless I changed the program to explicitly munmap the hugepage before exiting). Removing the hugetlbfs file containing the hugepage also didn't free the page. -- %SYSTEM-F-ANARCHISM, The operating system has been overthrown -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] x86: fix PAE pmd_bad bootup warning 2008-05-06 20:22 ` Hans Rosenfeld @ 2008-05-06 20:36 ` Hugh Dickins 2008-05-07 23:39 ` Nishanth Aravamudan 2008-05-06 20:42 ` Dave Hansen 1 sibling, 1 reply; 32+ messages in thread From: Hugh Dickins @ 2008-05-06 20:36 UTC (permalink / raw) To: Hans Rosenfeld Cc: Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven, Nishanth Aravamudan, linux-kernel, linux-mm On Tue, 6 May 2008, Hans Rosenfeld wrote: > On Tue, May 06, 2008 at 08:49:23PM +0100, Hugh Dickins wrote: > > So Hans' original hugepage leak remains unexplained and unfixed. > > Hans, did you find that hugepage leak with a standard kernel, or were > > you perhaps trying out some hugepage-using patch of your own, without > > marking the vma VM_HUGETLB? Or were you expecting the hugetlbfs file > > to truncate itself once all mmappers had gone? If the standard kernel > > leaks hugepages, I'm surprised the hugetlb guys don't know about it. > > I used a standard kernel (well, not quite, I had made some changes to > the /proc/pid/pagemap code, but nothing that would affect the hugepage > stuff) and some simple test program that would just mmap a hugepage. > > I expected that any hugepage that a process had mmapped would > automatically be returned to the system when the process exits. That was > not the case, the process exited and the hugepage was lost (unless I > changed the program to explicitly munmap the hugepage before exiting). > Removing the hugetlbfs file containing the hugepage also didn't free the > page. Hmm. That doesn't fit with my experience: I've not found an explicit munmap makes any difference (I wouldn't expect it to), but removing the file once all openers gone does free everything up. I guess I'm overlooking something more experienced hugepagers will soon light upon. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] x86: fix PAE pmd_bad bootup warning 2008-05-06 20:36 ` Hugh Dickins @ 2008-05-07 23:39 ` Nishanth Aravamudan 0 siblings, 0 replies; 32+ messages in thread From: Nishanth Aravamudan @ 2008-05-07 23:39 UTC (permalink / raw) To: Hugh Dickins Cc: Hans Rosenfeld, Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven, linux-kernel, linux-mm On 06.05.2008 [21:36:59 +0100], Hugh Dickins wrote: > On Tue, 6 May 2008, Hans Rosenfeld wrote: > > On Tue, May 06, 2008 at 08:49:23PM +0100, Hugh Dickins wrote: > > > So Hans' original hugepage leak remains unexplained and unfixed. > > > Hans, did you find that hugepage leak with a standard kernel, or were > > > you perhaps trying out some hugepage-using patch of your own, without > > > marking the vma VM_HUGETLB? Or were you expecting the hugetlbfs file > > > to truncate itself once all mmappers had gone? If the standard kernel > > > leaks hugepages, I'm surprised the hugetlb guys don't know about it. > > > > I used a standard kernel (well, not quite, I had made some changes to > > the /proc/pid/pagemap code, but nothing that would affect the hugepage > > stuff) and some simple test program that would just mmap a hugepage. > > > > I expected that any hugepage that a process had mmapped would > > automatically be returned to the system when the process exits. That was > > not the case, the process exited and the hugepage was lost (unless I > > changed the program to explicitly munmap the hugepage before exiting). > > Removing the hugetlbfs file containing the hugepage also didn't free the > > page. > > Hmm. That doesn't fit with my experience: I've not found an explicit > munmap makes any difference (I wouldn't expect it to), but removing > the file once all openers gone does free everything up. I guess I'm > overlooking something more experienced hugepagers will soon light > upon. Nothing strikes me immediately. What you described is what I expected. As Dave pointed out separately, are you able to unmount hugetlbfs at this point? Hans, can you send out a sample application's source? What kernel were you testing on? Thanks, Nish -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] x86: fix PAE pmd_bad bootup warning 2008-05-06 20:22 ` Hans Rosenfeld 2008-05-06 20:36 ` Hugh Dickins @ 2008-05-06 20:42 ` Dave Hansen 2008-05-08 14:34 ` Hans Rosenfeld 1 sibling, 1 reply; 32+ messages in thread From: Dave Hansen @ 2008-05-06 20:42 UTC (permalink / raw) To: Hans Rosenfeld Cc: Hugh Dickins, Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven, Nishanth Aravamudan, linux-kernel, linux-mm On Tue, 2008-05-06 at 22:22 +0200, Hans Rosenfeld wrote: > I expected that any hugepage that a process had mmapped would > automatically be returned to the system when the process exits. That was > not the case, the process exited and the hugepage was lost (unless I > changed the program to explicitly munmap the hugepage before exiting). > Removing the hugetlbfs file containing the hugepage also didn't free the > page. Could you post the code you're using to do this? I have to wonder if you're leaving a fd open somewhere. Even if you rm the hugepage file, it'll stay allocated if you have a fd open, or if *someone* is still mapping it. Can you umount your hugetlbfs? -- Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] x86: fix PAE pmd_bad bootup warning 2008-05-06 20:42 ` Dave Hansen @ 2008-05-08 14:34 ` Hans Rosenfeld 2008-05-08 14:39 ` Hans Rosenfeld 2008-05-08 14:52 ` Dave Hansen 0 siblings, 2 replies; 32+ messages in thread From: Hans Rosenfeld @ 2008-05-08 14:34 UTC (permalink / raw) To: Dave Hansen Cc: Hugh Dickins, Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven, Nishanth Aravamudan, linux-kernel, linux-mm On Tue, May 06, 2008 at 01:42:59PM -0700, Dave Hansen wrote: > Could you post the code you're using to do this? I have to wonder if > you're leaving a fd open somewhere. Even if you rm the hugepage file, > it'll stay allocated if you have a fd open, or if *someone* is still > mapping it. A stripped-down program exposing the leak is attached. It doesn't fork or do any other fancy stuff, so I don't see a way it could leave any fd open. While trying to reproduce this, I noticed that the huge page wouldn't leak when I just mmapped it and exited without explicitly unmapping, as I described before. The huge page is leaked only when the /proc/self/pagemap entry for the huge page is read. The kernel I tested with was 2.6.26-rc1-00179-g3de2403. > Can you umount your hugetlbfs? Yes, but the pages remain lost. Hans -- %SYSTEM-F-ANARCHISM, The operating system has been overthrown -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] x86: fix PAE pmd_bad bootup warning 2008-05-08 14:34 ` Hans Rosenfeld @ 2008-05-08 14:39 ` Hans Rosenfeld 2008-05-08 14:52 ` Dave Hansen 1 sibling, 0 replies; 32+ messages in thread From: Hans Rosenfeld @ 2008-05-08 14:39 UTC (permalink / raw) To: Dave Hansen Cc: Hugh Dickins, Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven, Nishanth Aravamudan, linux-kernel, linux-mm [-- Attachment #1: Type: text/plain, Size: 202 bytes --] On Thu, May 08, 2008 at 04:34:53PM +0200, Hans Rosenfeld wrote: > A stripped-down program exposing the leak is attached. It is now :) -- %SYSTEM-F-ANARCHISM, The operating system has been overthrown [-- Attachment #2: hugetest.c --] [-- Type: text/plain, Size: 1429 bytes --] #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <errno.h> #include <inttypes.h> #include <signal.h> #include <string.h> #include <strings.h> #include <sys/mman.h> #include <sys/ioctl.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #define HUGEPAGES "/proc/sys/vm/nr_hugepages" #define HUGE_FILE "/mnt/huge/foobar" void *huge; int main(int argc, char **argv) { uint64_t ppte; int fd, maps; if ((fd = open(HUGEPAGES, O_WRONLY, 0)) == -1) { perror(HUGEPAGES); return 1; } write(fd, "20\n", 3); close(fd); if ((fd = open(HUGE_FILE, O_RDWR | O_CREAT, 0)) == -1) { perror(HUGE_FILE); return 1; } huge = mmap((void *) 0x1000000, 0x400000, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_FIXED, fd, 0); if (huge == MAP_FAILED) { perror(HUGE_FILE); return 1; } fprintf(stderr, "huge: 0x%0.*" PRIxPTR "\n", sizeof(huge) * 2, huge); memset(huge, 1, 12345); if ((maps = open("/proc/self/pagemap", O_RDONLY, 0)) < 0) { perror("/proc/self/pagemap"); return 1; } if (pread(maps, &ppte, sizeof(ppte), ((uintptr_t) huge) >> 9) < 0) { perror("pread"); return 1; } fprintf(stderr, "ppte: 0x%0.16" PRIx64 "\n", ppte); return 0; } ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] x86: fix PAE pmd_bad bootup warning 2008-05-08 14:34 ` Hans Rosenfeld 2008-05-08 14:39 ` Hans Rosenfeld @ 2008-05-08 14:52 ` Dave Hansen 2008-05-08 15:11 ` Hans Rosenfeld 2008-05-08 15:44 ` Nishanth Aravamudan 1 sibling, 2 replies; 32+ messages in thread From: Dave Hansen @ 2008-05-08 14:52 UTC (permalink / raw) To: Hans Rosenfeld Cc: Hugh Dickins, Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven, Nishanth Aravamudan, linux-kernel, linux-mm On Thu, 2008-05-08 at 16:34 +0200, Hans Rosenfeld wrote: > While trying to reproduce this, I noticed that the huge page wouldn't > leak when I just mmapped it and exited without explicitly unmapping, as > I described before. The huge page is leaked only when the > /proc/self/pagemap entry for the huge page is read. Well, that's an interesting data point! :) Are you running any of your /proc/<pid>/pagemap patches? -- Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] x86: fix PAE pmd_bad bootup warning 2008-05-08 14:52 ` Dave Hansen @ 2008-05-08 15:11 ` Hans Rosenfeld 2008-05-08 15:51 ` Dave Hansen 2008-05-08 15:44 ` Nishanth Aravamudan 1 sibling, 1 reply; 32+ messages in thread From: Hans Rosenfeld @ 2008-05-08 15:11 UTC (permalink / raw) To: Dave Hansen Cc: Hugh Dickins, Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven, Nishanth Aravamudan, linux-kernel, linux-mm On Thu, May 08, 2008 at 07:52:30AM -0700, Dave Hansen wrote: > On Thu, 2008-05-08 at 16:34 +0200, Hans Rosenfeld wrote: > > The huge page is leaked only when the > > /proc/self/pagemap entry for the huge page is read. > > Well, that's an interesting data point! :) > > Are you running any of your /proc/<pid>/pagemap patches? No additional patches. The problem already existed before we agreed on the change to the pagemap code to just include the page size in the values returned, and not doing any special huge page handling. I suspect the page walking code used by /proc/pid/pagemap is doing something nasty when it sees a huge page as it doesn't know how to handle it. -- %SYSTEM-F-ANARCHISM, The operating system has been overthrown -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] x86: fix PAE pmd_bad bootup warning 2008-05-08 15:11 ` Hans Rosenfeld @ 2008-05-08 15:51 ` Dave Hansen 2008-05-08 16:19 ` Hans Rosenfeld 0 siblings, 1 reply; 32+ messages in thread From: Dave Hansen @ 2008-05-08 15:51 UTC (permalink / raw) To: Hans Rosenfeld Cc: Hugh Dickins, Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven, Nishanth Aravamudan, linux-kernel, linux-mm On Thu, 2008-05-08 at 17:11 +0200, Hans Rosenfeld wrote: > > On Thu, May 08, 2008 at 07:52:30AM -0700, Dave Hansen wrote: > > On Thu, 2008-05-08 at 16:34 +0200, Hans Rosenfeld wrote: > > > The huge page is leaked only when the > > > /proc/self/pagemap entry for the huge page is read. > > > > Well, that's an interesting data point! :) > > > > Are you running any of your /proc/<pid>/pagemap patches? > > No additional patches. The problem already existed before we agreed on > the change to the pagemap code to just include the page size in the > values returned, and not doing any special huge page handling. I suspect > the page walking code used by /proc/pid/pagemap is doing something nasty > when it sees a huge page as it doesn't know how to handle it. Is there anything in your dmesg? static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, const struct mm_walk *walk, void *private) { pmd_t *pmd; unsigned long next; int err = 0; pmd = pmd_offset(pud, addr); do { next = pmd_addr_end(addr, end); if (pmd_none_or_clear_bad(pmd)) { if (walk->pte_hole) err = walk->pte_hole(addr, next, private); if (err) break; continue; There was a discussion on LKML in the last couple of days about pmd_bad() triggering on huge pages. Perhaps we're clearing the mapping with the pmd_none_or_clear_bad(), and *THAT* is leaking the page. -- Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] x86: fix PAE pmd_bad bootup warning 2008-05-08 15:51 ` Dave Hansen @ 2008-05-08 16:19 ` Hans Rosenfeld 2008-05-08 16:33 ` Nishanth Aravamudan 2008-05-08 16:42 ` Dave Hansen 0 siblings, 2 replies; 32+ messages in thread From: Hans Rosenfeld @ 2008-05-08 16:19 UTC (permalink / raw) To: Dave Hansen Cc: Hugh Dickins, Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven, Nishanth Aravamudan, linux-kernel, linux-mm On Thu, May 08, 2008 at 08:51:22AM -0700, Dave Hansen wrote: > Is there anything in your dmesg? mm/memory.c:127: bad pmd ffff810076801040(80000000720000e7). > There was a discussion on LKML in the last couple of days about > pmd_bad() triggering on huge pages. Perhaps we're clearing the mapping > with the pmd_none_or_clear_bad(), and *THAT* is leaking the page. That makes sense. I remember that explicitly munmapping the huge page would still work, but it doesn't. I don't quite remember what I did back then to test this, but I probably made some mistake there that led me to some false conclusions. -- %SYSTEM-F-ANARCHISM, The operating system has been overthrown -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] x86: fix PAE pmd_bad bootup warning 2008-05-08 16:19 ` Hans Rosenfeld @ 2008-05-08 16:33 ` Nishanth Aravamudan 2008-05-08 16:51 ` Hans Rosenfeld 2008-05-08 16:42 ` Dave Hansen 1 sibling, 1 reply; 32+ messages in thread From: Nishanth Aravamudan @ 2008-05-08 16:33 UTC (permalink / raw) To: Hans Rosenfeld Cc: Dave Hansen, Hugh Dickins, Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven, linux-kernel, linux-mm On 08.05.2008 [18:19:25 +0200], Hans Rosenfeld wrote: > On Thu, May 08, 2008 at 08:51:22AM -0700, Dave Hansen wrote: > > Is there anything in your dmesg? > > mm/memory.c:127: bad pmd ffff810076801040(80000000720000e7). > > > There was a discussion on LKML in the last couple of days about > > pmd_bad() triggering on huge pages. Perhaps we're clearing the mapping > > with the pmd_none_or_clear_bad(), and *THAT* is leaking the page. > > That makes sense. I remember that explicitly munmapping the huge page > would still work, but it doesn't. I don't quite remember what I did back > then to test this, but I probably made some mistake there that led me to > some false conclusions. So this seems to lend credence to Dave's hypothesis. Without, as you were trying before, teaching pagemap all about hugepages, what are our options? Can we just skip over the current iteration of the PMD loop (would we need something similar for the PTE loop for power?) if pmd_huge(pmd)? Thanks, Nish -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] x86: fix PAE pmd_bad bootup warning 2008-05-08 16:33 ` Nishanth Aravamudan @ 2008-05-08 16:51 ` Hans Rosenfeld 2008-05-08 17:16 ` Nishanth Aravamudan 0 siblings, 1 reply; 32+ messages in thread From: Hans Rosenfeld @ 2008-05-08 16:51 UTC (permalink / raw) To: Nishanth Aravamudan Cc: Dave Hansen, Hugh Dickins, Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven, linux-kernel, linux-mm On Thu, May 08, 2008 at 09:33:52AM -0700, Nishanth Aravamudan wrote: > So this seems to lend credence to Dave's hypothesis. Without, as you > were trying before, teaching pagemap all about hugepages, what are our > options? > > Can we just skip over the current iteration of the PMD loop (would we > need something similar for the PTE loop for power?) if pmd_huge(pmd)? Allowing huge pages in the page walker would affect both walk_pmd_range and walk_pud_range. Then either the users of the page walker need to know how to handle huge pages themselves (in the pmd_entry and pud_entry callback functions), or the page walker treats huge pages as any other pages (calling the pte_entry callback function). -- %SYSTEM-F-ANARCHISM, The operating system has been overthrown -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] x86: fix PAE pmd_bad bootup warning 2008-05-08 16:51 ` Hans Rosenfeld @ 2008-05-08 17:16 ` Nishanth Aravamudan 2008-05-08 18:42 ` Dave Hansen 2008-05-08 18:48 ` Hugh Dickins 0 siblings, 2 replies; 32+ messages in thread From: Nishanth Aravamudan @ 2008-05-08 17:16 UTC (permalink / raw) To: Hans Rosenfeld Cc: Dave Hansen, Hugh Dickins, Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven, linux-kernel, linux-mm On 08.05.2008 [18:51:11 +0200], Hans Rosenfeld wrote: > On Thu, May 08, 2008 at 09:33:52AM -0700, Nishanth Aravamudan wrote: > > So this seems to lend credence to Dave's hypothesis. Without, as you > > were trying before, teaching pagemap all about hugepages, what are our > > options? > > > > Can we just skip over the current iteration of the PMD loop (would we > > need something similar for the PTE loop for power?) if pmd_huge(pmd)? > > Allowing huge pages in the page walker would affect both walk_pmd_range > and walk_pud_range. Then either the users of the page walker need to > know how to handle huge pages themselves (in the pmd_entry and pud_entry > callback functions), or the page walker treats huge pages as any other > pages (calling the pte_entry callback function). Right, I agree *if* we allow huge pages in the walker. But AIUI, things are broken now with hugepages in the process' address space. This is a bug upstream and leads to hugepages leaking out of the kernel when /proc/pid/pagemap is read. Why not, instead (as a short-term fix), skip hugepage mappings altogether in the page-walker code? Hrm, upon further investigation, this seems to be a pretty clear limitation of walk_page_range(). One that is avoided in the other two callers, i.e. static int show_smap(struct seq_file *m, void *v) { ... if (vma->vm_mm && !is_vm_hugetlb_page(vma)) walk_page_range(vma->vm_mm, vma->vm_start, vma->vm_end, &smaps_walk, &mss); ... } static ssize_t clear_refs_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos) { ... for (vma = mm->mmap; vma; vma = vma->vm_next) if (!is_vm_hugetlb_page(vma)) walk_page_range(mm, vma->vm_start, vma->vm_end, &clear_refs_walk, vma); ... } No such protection exists for static ssize_t pagemap_read(struct file *file, char __user *buf, size_t count, loff_t *ppos); So, is there any way to either add a is_vm_hugetlb_page(vma) check into pagemap_read()? Or can we modify walk_page_range to take the a vma and skip the walking if is_vm_hugetlb_page(vma) is set [to avoid complications down the road until hugepage walking is fixed]. I guess the latter isn't possible for pagemap_read(), since we are just looking at arbitrary addresses in the process space? Dunno, seems quite clear that the bug is in pagemap_read(), not any hugepage code, and that the simplest fix is to make pagemap_read() do what the other walker-callers do, and skip hugepage regions. Thanks, Nish -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] x86: fix PAE pmd_bad bootup warning 2008-05-08 17:16 ` Nishanth Aravamudan @ 2008-05-08 18:42 ` Dave Hansen 2008-05-08 18:58 ` Hugh Dickins 2008-05-08 18:48 ` Hugh Dickins 1 sibling, 1 reply; 32+ messages in thread From: Dave Hansen @ 2008-05-08 18:42 UTC (permalink / raw) To: Nishanth Aravamudan Cc: Hans Rosenfeld, Hugh Dickins, Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven, linux-kernel, linux-mm On Thu, 2008-05-08 at 10:16 -0700, Nishanth Aravamudan wrote: > > Dunno, seems quite clear that the bug is in pagemap_read(), not any > hugepage code, and that the simplest fix is to make pagemap_read() do > what the other walker-callers do, and skip hugepage regions. Agreed, this certainly isn't a huge page bug. But, I do think it is absolutely insane to have pmd_clear_bad() going after perfectly good hugetlb pmds. The way it is set up now, people are bound to miss the hugetlb pages because just about every single pagetable walk has to be specially coded to handle or avoid them. We obviously missed it, here, and we had two good examples in the same file! :) -- Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] x86: fix PAE pmd_bad bootup warning 2008-05-08 18:42 ` Dave Hansen @ 2008-05-08 18:58 ` Hugh Dickins 2008-05-08 19:06 ` Dave Hansen 0 siblings, 1 reply; 32+ messages in thread From: Hugh Dickins @ 2008-05-08 18:58 UTC (permalink / raw) To: Dave Hansen Cc: Nishanth Aravamudan, Hans Rosenfeld, Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven, linux-kernel, linux-mm On Thu, 8 May 2008, Dave Hansen wrote: > > But, I do think it is absolutely insane to have pmd_clear_bad() going > after perfectly good hugetlb pmds. The way it is set up now, people are > bound to miss the hugetlb pages because just about every single > pagetable walk has to be specially coded to handle or avoid them. We > obviously missed it, here, and we had two good examples in the same > file! :) Like it or not, the pgd/pud/pmd/pte hierarchy cannot be assumed once you're amongst hugepages. What happens varies from architecture to architecture. Perhaps the hugepage specialists could look at what in fact the different architectures we know today are doing, and come up with a better abstraction to encompass them all. But it's simply wrong for a "generic" pagewalker to be going blindly in there. Two good examples in the same file?? Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] x86: fix PAE pmd_bad bootup warning 2008-05-08 18:58 ` Hugh Dickins @ 2008-05-08 19:06 ` Dave Hansen 0 siblings, 0 replies; 32+ messages in thread From: Dave Hansen @ 2008-05-08 19:06 UTC (permalink / raw) To: Hugh Dickins Cc: Nishanth Aravamudan, Hans Rosenfeld, Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven, linux-kernel, linux-mm On Thu, 2008-05-08 at 19:58 +0100, Hugh Dickins wrote: > But it's > simply wrong for a "generic" pagewalker to be going blindly in there. > > Two good examples in the same file?? I was just noting that the other two pagewalking users did the right vm_huge..() check, and we missed it in the third pagewalker user. -- Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] x86: fix PAE pmd_bad bootup warning 2008-05-08 17:16 ` Nishanth Aravamudan 2008-05-08 18:42 ` Dave Hansen @ 2008-05-08 18:48 ` Hugh Dickins 2008-05-08 19:49 ` Matt Mackall 2008-05-08 20:02 ` Hans Rosenfeld 1 sibling, 2 replies; 32+ messages in thread From: Hugh Dickins @ 2008-05-08 18:48 UTC (permalink / raw) To: Nishanth Aravamudan Cc: Hans Rosenfeld, Dave Hansen, Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven, Matt Mackall, linux-kernel, linux-mm On Thu, 8 May 2008, Nishanth Aravamudan wrote: > > So, is there any way to either add a is_vm_hugetlb_page(vma) check into > pagemap_read()? Or can we modify walk_page_range to take the a vma and > skip the walking if is_vm_hugetlb_page(vma) is set [to avoid > complications down the road until hugepage walking is fixed]. I guess > the latter isn't possible for pagemap_read(), since we are just looking > at arbitrary addresses in the process space? > > Dunno, seems quite clear that the bug is in pagemap_read(), not any > hugepage code, and that the simplest fix is to make pagemap_read() do > what the other walker-callers do, and skip hugepage regions. Yes, I'm afraid it needs an is_vm_hugetlb_page(vma) in there somehow: as you observe, that's what everything else uses to avoid huge issues. A pmd_huge(*pmd) test is tempting, but it only ever says "yes" on x86: we've carefully left it undefined what happens to the pgd/pud/pmd/pte hierarchy in the general arch case, once you're amongst hugepages. Might follow_huge_addr() be helpful, to avoid the need for a vma? Perhaps, but my reading is that actually we've never really been testing that path's success case (because get_user_pages already skipped is_vm_hugetlb_page), so it might hold further surprises on one architecture or another. Many thanks to Hans for persisting, and pointing us to pagemap to explain this hugepage leak: yes, the pmd_none_or_clear_bad will be losing it - and corrupting target user address space. Cc'ed Matt: he may have a view on what he wants his pagewalker to do with hugepages: I fear it would differ from one usage to another. Skip over them has to be safest, though not ideal. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] x86: fix PAE pmd_bad bootup warning 2008-05-08 18:48 ` Hugh Dickins @ 2008-05-08 19:49 ` Matt Mackall 2008-05-08 20:08 ` Dave Hansen 2008-05-08 20:02 ` Hans Rosenfeld 1 sibling, 1 reply; 32+ messages in thread From: Matt Mackall @ 2008-05-08 19:49 UTC (permalink / raw) To: Hugh Dickins Cc: Nishanth Aravamudan, Hans Rosenfeld, Dave Hansen, Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven, linux-kernel, linux-mm On Thu, 2008-05-08 at 19:48 +0100, Hugh Dickins wrote: > On Thu, 8 May 2008, Nishanth Aravamudan wrote: > > > > So, is there any way to either add a is_vm_hugetlb_page(vma) check into > > pagemap_read()? Or can we modify walk_page_range to take the a vma and > > skip the walking if is_vm_hugetlb_page(vma) is set [to avoid > > complications down the road until hugepage walking is fixed]. I guess > > the latter isn't possible for pagemap_read(), since we are just looking > > at arbitrary addresses in the process space? > > > > Dunno, seems quite clear that the bug is in pagemap_read(), not any > > hugepage code, and that the simplest fix is to make pagemap_read() do > > what the other walker-callers do, and skip hugepage regions. > > Yes, I'm afraid it needs an is_vm_hugetlb_page(vma) in there somehow: > as you observe, that's what everything else uses to avoid huge issues. > > A pmd_huge(*pmd) test is tempting, but it only ever says "yes" on x86: > we've carefully left it undefined what happens to the pgd/pud/pmd/pte > hierarchy in the general arch case, once you're amongst hugepages. > > Might follow_huge_addr() be helpful, to avoid the need for a vma? > Perhaps, but my reading is that actually we've never really been > testing that path's success case (because get_user_pages already > skipped is_vm_hugetlb_page), so it might hold further surprises > on one architecture or another. > > Many thanks to Hans for persisting, and pointing us to pagemap > to explain this hugepage leak: yes, the pmd_none_or_clear_bad > will be losing it - and corrupting target user address space. Ouch. > Cc'ed Matt: he may have a view on what he wants his pagewalker > to do with hugepages: I fear it would differ from one usage to > another. Skip over them has to be safest, though not ideal. There are folks who want to be able to get information about hugepages from pagemap. Thanks to Hans, there's room in the output format for it. I'd gone to some lengths to pull VMAs out of the picture as it's quite ugly to have to simultaneously walk VMAs and pagetables. But I may have to concede that living with hugepages requires it. -- Mathematics is the supreme nostalgia of our time. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] x86: fix PAE pmd_bad bootup warning 2008-05-08 19:49 ` Matt Mackall @ 2008-05-08 20:08 ` Dave Hansen 0 siblings, 0 replies; 32+ messages in thread From: Dave Hansen @ 2008-05-08 20:08 UTC (permalink / raw) To: Matt Mackall Cc: Hugh Dickins, Nishanth Aravamudan, Hans Rosenfeld, Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven, linux-kernel, linux-mm On Thu, 2008-05-08 at 14:49 -0500, Matt Mackall wrote: > I'd gone to some lengths to pull VMAs out of the picture as it's quite > ugly to have to simultaneously walk VMAs and pagetables. But I may have > to concede that living with hugepages requires it. Yeah, it will definitely change the way that we have to do the pagetable walk. Should we just pass the mm around and make anyone that really wants to get the VMAs do the lookup themselves? Or, should we just provide the VMA? I'll start with just the mm and see where it goes... -- Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] x86: fix PAE pmd_bad bootup warning 2008-05-08 18:48 ` Hugh Dickins 2008-05-08 19:49 ` Matt Mackall @ 2008-05-08 20:02 ` Hans Rosenfeld 2008-05-08 20:16 ` Dave Hansen ` (2 more replies) 1 sibling, 3 replies; 32+ messages in thread From: Hans Rosenfeld @ 2008-05-08 20:02 UTC (permalink / raw) To: Hugh Dickins Cc: Nishanth Aravamudan, Dave Hansen, Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven, Matt Mackall, linux-kernel, linux-mm On Thu, May 08, 2008 at 07:48:51PM +0100, Hugh Dickins wrote: > > Dunno, seems quite clear that the bug is in pagemap_read(), not any > > hugepage code, and that the simplest fix is to make pagemap_read() do > > what the other walker-callers do, and skip hugepage regions. > > Yes, I'm afraid it needs an is_vm_hugetlb_page(vma) in there somehow: > as you observe, that's what everything else uses to avoid huge issues. > > A pmd_huge(*pmd) test is tempting, but it only ever says "yes" on x86: > we've carefully left it undefined what happens to the pgd/pud/pmd/pte > hierarchy in the general arch case, once you're amongst hugepages. AFAIK the reason for this is that pmd_huge() and pud_huge() are completely x86-specific. When I looked at the huge page support for other archs in Linux the last time, all of them marked hugepages with some page size bits in the PTE, using several PTEs for a single huge page. So for anything but x86, the pgd/pud/pmd/pte hierarchy should work for hugepages, too. > Cc'ed Matt: he may have a view on what he wants his pagewalker > to do with hugepages: I fear it would differ from one usage to > another. Skip over them has to be safest, though not ideal. Proper hugepage support in /proc/pid/pagemap would be very helpful for certain kinds of performance analysis, like TLB usage, where you need page size information when you have a TLB for each supported page size. That was the reason why I was working on supporting huge pages in /proc/pid/pagemap in the first place. -- %SYSTEM-F-ANARCHISM, The operating system has been overthrown -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] x86: fix PAE pmd_bad bootup warning 2008-05-08 20:02 ` Hans Rosenfeld @ 2008-05-08 20:16 ` Dave Hansen 2008-05-08 23:15 ` Dave Hansen 2008-05-09 9:03 ` Paul Mundt 2 siblings, 0 replies; 32+ messages in thread From: Dave Hansen @ 2008-05-08 20:16 UTC (permalink / raw) To: Hans Rosenfeld Cc: Hugh Dickins, Nishanth Aravamudan, Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven, Matt Mackall, linux-kernel, linux-mm On Thu, 2008-05-08 at 22:02 +0200, Hans Rosenfeld wrote: > > A pmd_huge(*pmd) test is tempting, but it only ever says "yes" on x86: > > we've carefully left it undefined what happens to the pgd/pud/pmd/pte > > hierarchy in the general arch case, once you're amongst hugepages. > > AFAIK the reason for this is that pmd_huge() and pud_huge() are > completely x86-specific. When I looked at the huge page support for > other archs in Linux the last time, all of them marked hugepages with > some page size bits in the PTE, using several PTEs for a single huge > page. So for anything but x86, the pgd/pud/pmd/pte hierarchy should work > for hugepages, too. powerpc kinda puts them in pmds, although Adam calls them ptes in his diagram. See Adam's very nice pictures here: http://linux-mm.org/PageTableStructure In the arch code, they have a concept of "slices" for each mm that you can look up the page size for. That's what they use when the mm/vmas aren't around. Their pmd_ts really are just pointers. I don't think they have any flags in them at all like _PAGE_PSE. They just do a special pagetable walk instead of looking *at* the pagetables. -- Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] x86: fix PAE pmd_bad bootup warning 2008-05-08 20:02 ` Hans Rosenfeld 2008-05-08 20:16 ` Dave Hansen @ 2008-05-08 23:15 ` Dave Hansen 2008-05-14 19:01 ` Matt Mackall 2008-05-09 9:03 ` Paul Mundt 2 siblings, 1 reply; 32+ messages in thread From: Dave Hansen @ 2008-05-08 23:15 UTC (permalink / raw) To: Hans Rosenfeld Cc: Hugh Dickins, Nishanth Aravamudan, Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven, Matt Mackall, linux-kernel, linux-mm Here's one quick stab at a solution. I figured that we already pass that 'private' variable around. This patch just sticks that variable *in* the mm_walk and also makes the caller fill in an 'mm' as well. Then, we just pass the actual mm_walk around. Maybe we should just stick the VMA in the mm_walk as well, and have the common code keep it up to date with the addresses currently being walked. Sadly, I didn't quite get enough time to flesh this idea out very far today, and I'll be offline for a couple of days now. But, if someone wants to go this route, I thought this might be useful. --- linux-2.6.git-dave/fs/proc/task_mmu.c | 45 +++++++++++++++++++--------------- linux-2.6.git-dave/include/linux/mm.h | 17 ++++++------ linux-2.6.git-dave/mm/pagewalk.c | 41 +++++++++++++++--------------- 3 files changed, 56 insertions(+), 47 deletions(-) diff -puN mm/pagewalk.c~pass-mm-into-pagewalkers mm/pagewalk.c --- linux-2.6.git/mm/pagewalk.c~pass-mm-into-pagewalkers 2008-05-08 15:49:47.000000000 -0700 +++ linux-2.6.git-dave/mm/pagewalk.c 2008-05-08 15:49:54.000000000 -0700 @@ -3,14 +3,14 @@ #include <linux/sched.h> static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, - const struct mm_walk *walk, void *private) + struct mm_walk *walk) { pte_t *pte; int err = 0; pte = pte_offset_map(pmd, addr); for (;;) { - err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, private); + err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, walk); if (err) break; addr += PAGE_SIZE; @@ -24,7 +24,7 @@ static int walk_pte_range(pmd_t *pmd, un } static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, - const struct mm_walk *walk, void *private) + struct mm_walk *walk) { pmd_t *pmd; unsigned long next; @@ -35,15 +35,15 @@ static int walk_pmd_range(pud_t *pud, un next = pmd_addr_end(addr, end); if (pmd_none_or_clear_bad(pmd)) { if (walk->pte_hole) - err = walk->pte_hole(addr, next, private); + err = walk->pte_hole(addr, next, walk); if (err) break; continue; } if (walk->pmd_entry) - err = walk->pmd_entry(pmd, addr, next, private); + err = walk->pmd_entry(pmd, addr, next, walk); if (!err && walk->pte_entry) - err = walk_pte_range(pmd, addr, next, walk, private); + err = walk_pte_range(pmd, addr, next, walk); if (err) break; } while (pmd++, addr = next, addr != end); @@ -52,7 +52,7 @@ static int walk_pmd_range(pud_t *pud, un } static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end, - const struct mm_walk *walk, void *private) + struct mm_walk *walk) { pud_t *pud; unsigned long next; @@ -63,15 +63,15 @@ static int walk_pud_range(pgd_t *pgd, un next = pud_addr_end(addr, end); if (pud_none_or_clear_bad(pud)) { if (walk->pte_hole) - err = walk->pte_hole(addr, next, private); + err = walk->pte_hole(addr, next, walk); if (err) break; continue; } if (walk->pud_entry) - err = walk->pud_entry(pud, addr, next, private); + err = walk->pud_entry(pud, addr, next, walk); if (!err && (walk->pmd_entry || walk->pte_entry)) - err = walk_pmd_range(pud, addr, next, walk, private); + err = walk_pmd_range(pud, addr, next, walk); if (err) break; } while (pud++, addr = next, addr != end); @@ -85,15 +85,15 @@ static int walk_pud_range(pgd_t *pgd, un * @addr: starting address * @end: ending address * @walk: set of callbacks to invoke for each level of the tree - * @private: private data passed to the callback function * * Recursively walk the page table for the memory area in a VMA, * calling supplied callbacks. Callbacks are called in-order (first * PGD, first PUD, first PMD, first PTE, second PTE... second PMD, * etc.). If lower-level callbacks are omitted, walking depth is reduced. * - * Each callback receives an entry pointer, the start and end of the - * associated range, and a caller-supplied private data pointer. + * Each callback receives an entry pointer and the start and end of the + * associated range, and a copy of the original mm_walk for access to + * the ->private or ->mm fields. * * No locks are taken, but the bottom level iterator will map PTE * directories from highmem if necessary. @@ -101,9 +101,8 @@ static int walk_pud_range(pgd_t *pgd, un * If any callback returns a non-zero value, the walk is aborted and * the return value is propagated back to the caller. Otherwise 0 is returned. */ -int walk_page_range(const struct mm_struct *mm, - unsigned long addr, unsigned long end, - const struct mm_walk *walk, void *private) +int walk_page_range(unsigned long addr, unsigned long end, + struct mm_walk *walk) { pgd_t *pgd; unsigned long next; @@ -112,21 +111,23 @@ int walk_page_range(const struct mm_stru if (addr >= end) return err; - pgd = pgd_offset(mm, addr); + if (!walk->mm) + return -EINVAL; + pgd = pgd_offset(walk->mm, addr); do { next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(pgd)) { if (walk->pte_hole) - err = walk->pte_hole(addr, next, private); + err = walk->pte_hole(addr, next, walk); if (err) break; continue; } if (walk->pgd_entry) - err = walk->pgd_entry(pgd, addr, next, private); + err = walk->pgd_entry(pgd, addr, next, walk); if (!err && (walk->pud_entry || walk->pmd_entry || walk->pte_entry)) - err = walk_pud_range(pgd, addr, next, walk, private); + err = walk_pud_range(pgd, addr, next, walk); if (err) break; } while (pgd++, addr = next, addr != end); diff -puN include/linux/mm.h~pass-mm-into-pagewalkers include/linux/mm.h --- linux-2.6.git/include/linux/mm.h~pass-mm-into-pagewalkers 2008-05-08 15:49:47.000000000 -0700 +++ linux-2.6.git-dave/include/linux/mm.h 2008-05-08 15:49:54.000000000 -0700 @@ -760,16 +760,17 @@ unsigned long unmap_vmas(struct mmu_gath * (see walk_page_range for more details) */ struct mm_walk { - int (*pgd_entry)(pgd_t *, unsigned long, unsigned long, void *); - int (*pud_entry)(pud_t *, unsigned long, unsigned long, void *); - int (*pmd_entry)(pmd_t *, unsigned long, unsigned long, void *); - int (*pte_entry)(pte_t *, unsigned long, unsigned long, void *); - int (*pte_hole)(unsigned long, unsigned long, void *); + int (*pgd_entry)(pgd_t *, unsigned long, unsigned long, struct mm_walk *); + int (*pud_entry)(pud_t *, unsigned long, unsigned long, struct mm_walk *); + int (*pmd_entry)(pmd_t *, unsigned long, unsigned long, struct mm_walk *); + int (*pte_entry)(pte_t *, unsigned long, unsigned long, struct mm_walk *); + int (*pte_hole)(unsigned long, unsigned long, struct mm_walk *); + struct mm_struct *mm; + void *private; }; -int walk_page_range(const struct mm_struct *, unsigned long addr, - unsigned long end, const struct mm_walk *walk, - void *private); +int walk_page_range(unsigned long addr, unsigned long end, + struct mm_walk *walk); void free_pgd_range(struct mmu_gather **tlb, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling); void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma, diff -puN fs/proc/task_mmu.c~pass-mm-into-pagewalkers fs/proc/task_mmu.c --- linux-2.6.git/fs/proc/task_mmu.c~pass-mm-into-pagewalkers 2008-05-08 15:49:47.000000000 -0700 +++ linux-2.6.git-dave/fs/proc/task_mmu.c 2008-05-08 15:49:59.000000000 -0700 @@ -317,9 +317,9 @@ struct mem_size_stats { }; static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, - void *private) + struct mm_walk *walk) { - struct mem_size_stats *mss = private; + struct mem_size_stats *mss = walk->private; struct vm_area_struct *vma = mss->vma; pte_t *pte, ptent; spinlock_t *ptl; @@ -367,19 +367,22 @@ static int smaps_pte_range(pmd_t *pmd, u return 0; } -static struct mm_walk smaps_walk = { .pmd_entry = smaps_pte_range }; - static int show_smap(struct seq_file *m, void *v) { struct vm_area_struct *vma = v; struct mem_size_stats mss; int ret; + struct mm_walk smaps_walk = { + .pmd_entry = smaps_pte_range, + .mm = vma->vm_mm, + .private = &mss, + }; + memset(&mss, 0, sizeof mss); mss.vma = vma; if (vma->vm_mm && !is_vm_hugetlb_page(vma)) - walk_page_range(vma->vm_mm, vma->vm_start, vma->vm_end, - &smaps_walk, &mss); + walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk); ret = show_map(m, v); if (ret) @@ -428,9 +431,9 @@ const struct file_operations proc_smaps_ }; static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, - unsigned long end, void *private) + unsigned long end, struct mm_walk *walk) { - struct vm_area_struct *vma = private; + struct vm_area_struct *vma = walk->private; pte_t *pte, ptent; spinlock_t *ptl; struct page *page; @@ -454,8 +457,6 @@ static int clear_refs_pte_range(pmd_t *p return 0; } -static struct mm_walk clear_refs_walk = { .pmd_entry = clear_refs_pte_range }; - static ssize_t clear_refs_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos) { @@ -478,11 +479,17 @@ static ssize_t clear_refs_write(struct f return -ESRCH; mm = get_task_mm(task); if (mm) { + static struct mm_walk clear_refs_walk; + memset(&clear_refs_walk, 0, sizeof(clear_refs_walk)); + clear_refs_walk.pmd_entry = clear_refs_pte_range; + clear_refs_walk.mm = mm; down_read(&mm->mmap_sem); - for (vma = mm->mmap; vma; vma = vma->vm_next) + for (vma = mm->mmap; vma; vma = vma->vm_next) { + clear_refs_walk.private = vma; if (!is_vm_hugetlb_page(vma)) - walk_page_range(mm, vma->vm_start, vma->vm_end, - &clear_refs_walk, vma); + walk_page_range(vma->vm_start, vma->vm_end, + &clear_refs_walk); + } flush_tlb_mm(mm); up_read(&mm->mmap_sem); mmput(mm); @@ -540,9 +547,9 @@ static int add_to_pagemap(unsigned long } static int pagemap_pte_hole(unsigned long start, unsigned long end, - void *private) + struct mm_walk *walk) { - struct pagemapread *pm = private; + struct pagemapread *pm = walk->private; unsigned long addr; int err = 0; for (addr = start; addr < end; addr += PAGE_SIZE) { @@ -560,9 +567,9 @@ static u64 swap_pte_to_pagemap_entry(pte } static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, - void *private) + struct mm_walk *walk) { - struct pagemapread *pm = private; + struct pagemapread *pm = walk->private; pte_t *pte; int err = 0; @@ -687,8 +694,8 @@ static ssize_t pagemap_read(struct file * user buffer is tracked in "pm", and the walk * will stop when we hit the end of the buffer. */ - ret = walk_page_range(mm, start_vaddr, end_vaddr, - &pagemap_walk, &pm); + ret = walk_page_range(start_vaddr, end_vaddr, + &pagemap_walk); if (ret == PM_END_OF_BUFFER) ret = 0; /* don't need mmap_sem for these, but this looks cleaner */ _ -- Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] x86: fix PAE pmd_bad bootup warning 2008-05-08 23:15 ` Dave Hansen @ 2008-05-14 19:01 ` Matt Mackall 0 siblings, 0 replies; 32+ messages in thread From: Matt Mackall @ 2008-05-14 19:01 UTC (permalink / raw) To: Dave Hansen Cc: Hans Rosenfeld, Hugh Dickins, Nishanth Aravamudan, Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven, linux-kernel, linux-mm On Thu, 2008-05-08 at 16:15 -0700, Dave Hansen wrote: > Here's one quick stab at a solution. I figured that we already pass > that 'private' variable around. This patch just sticks that variable > *in* the mm_walk and also makes the caller fill in an 'mm' as well. > Then, we just pass the actual mm_walk around. > > Maybe we should just stick the VMA in the mm_walk as well, and have the > common code keep it up to date with the addresses currently being > walked. > > Sadly, I didn't quite get enough time to flesh this idea out very far > today, and I'll be offline for a couple of days now. But, if someone > wants to go this route, I thought this might be useful. This much looks reasonable. But the real test of course is to actually teach it about detecting huge pages efficiently. And I suspect that means tracking the current VMA all the time in the walk. Am I wrong? -- Mathematics is the supreme nostalgia of our time. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] x86: fix PAE pmd_bad bootup warning 2008-05-08 20:02 ` Hans Rosenfeld 2008-05-08 20:16 ` Dave Hansen 2008-05-08 23:15 ` Dave Hansen @ 2008-05-09 9:03 ` Paul Mundt 2 siblings, 0 replies; 32+ messages in thread From: Paul Mundt @ 2008-05-09 9:03 UTC (permalink / raw) To: Hans Rosenfeld Cc: Hugh Dickins, Nishanth Aravamudan, Dave Hansen, Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven, Matt Mackall, linux-kernel, linux-mm On Thu, May 08, 2008 at 10:02:39PM +0200, Hans Rosenfeld wrote: > On Thu, May 08, 2008 at 07:48:51PM +0100, Hugh Dickins wrote: > > > Dunno, seems quite clear that the bug is in pagemap_read(), not any > > > hugepage code, and that the simplest fix is to make pagemap_read() do > > > what the other walker-callers do, and skip hugepage regions. > > > > Yes, I'm afraid it needs an is_vm_hugetlb_page(vma) in there somehow: > > as you observe, that's what everything else uses to avoid huge issues. > > > > A pmd_huge(*pmd) test is tempting, but it only ever says "yes" on x86: > > we've carefully left it undefined what happens to the pgd/pud/pmd/pte > > hierarchy in the general arch case, once you're amongst hugepages. > > AFAIK the reason for this is that pmd_huge() and pud_huge() are > completely x86-specific. When I looked at the huge page support for > other archs in Linux the last time, all of them marked hugepages with > some page size bits in the PTE, using several PTEs for a single huge > page. So for anything but x86, the pgd/pud/pmd/pte hierarchy should work > for hugepages, too. > s390 also does hugepages at the pmd level, so it's not only x86. And while it's not an issue today, it's worth noting that ARM also has the same characteristics for larger sizes. Should someone feel compelled to implement hugepages there, this will almost certainly come up again -- at least in so far as pmd_huge() is concerned. At a quick glance, sparc64 also looks like it might need some special handling in the pagemap case, too.. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] x86: fix PAE pmd_bad bootup warning 2008-05-08 16:19 ` Hans Rosenfeld 2008-05-08 16:33 ` Nishanth Aravamudan @ 2008-05-08 16:42 ` Dave Hansen 1 sibling, 0 replies; 32+ messages in thread From: Dave Hansen @ 2008-05-08 16:42 UTC (permalink / raw) To: Hans Rosenfeld Cc: Hugh Dickins, Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven, Nishanth Aravamudan, linux-kernel, linux-mm On Thu, 2008-05-08 at 18:19 +0200, Hans Rosenfeld wrote: > On Thu, May 08, 2008 at 08:51:22AM -0700, Dave Hansen wrote: > > Is there anything in your dmesg? > > mm/memory.c:127: bad pmd ffff810076801040(80000000720000e7). > > > There was a discussion on LKML in the last couple of days about > > pmd_bad() triggering on huge pages. Perhaps we're clearing the mapping > > with the pmd_none_or_clear_bad(), and *THAT* is leaking the page. > > That makes sense. I remember that explicitly munmapping the huge page > would still work, but it doesn't. I don't quite remember what I did back > then to test this, but I probably made some mistake there that led me to > some false conclusions. I can't see how it would possibly work with the code that we have today, so I guess it was just a false assumption. static inline int pmd_none_or_clear_bad(pmd_t *pmd) { if (pmd_none(*pmd)) return 1; if (unlikely(pmd_bad(*pmd))) { pmd_clear_bad(pmd); return 1; } return 0; } void pmd_clear_bad(pmd_t *pmd) { pmd_ERROR(*pmd); pmd_clear(pmd); } That pmd_clear() will simply zero out the pmd and leak the page. Sounds like Linus had the right idea: > I'd much rather have pdm_bad() etc fixed up instead, so that they do a > more proper test (not thinking that a PSE page is bad, since it clearly > isn't). And then, make them dependent on DEBUG_VM, because doing the > proper test will be more expensive. -- Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] x86: fix PAE pmd_bad bootup warning 2008-05-08 14:52 ` Dave Hansen 2008-05-08 15:11 ` Hans Rosenfeld @ 2008-05-08 15:44 ` Nishanth Aravamudan 1 sibling, 0 replies; 32+ messages in thread From: Nishanth Aravamudan @ 2008-05-08 15:44 UTC (permalink / raw) To: Dave Hansen Cc: Hans Rosenfeld, Hugh Dickins, Ingo Molnar, Jeff Chua, Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven, linux-kernel, linux-mm On 08.05.2008 [07:52:30 -0700], Dave Hansen wrote: > On Thu, 2008-05-08 at 16:34 +0200, Hans Rosenfeld wrote: > > While trying to reproduce this, I noticed that the huge page wouldn't > > leak when I just mmapped it and exited without explicitly unmapping, as > > I described before. The huge page is leaked only when the > > /proc/self/pagemap entry for the huge page is read. > > Well, that's an interesting data point! :) Indeed it is! Would explain why we've not seen any leaks until now. I will try to investigate, as well. Thanks, Nish -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] x86: fix PAE pmd_bad bootup warning 2008-05-06 19:49 ` [PATCH] x86: fix PAE pmd_bad bootup warning Hugh Dickins 2008-05-06 20:06 ` Linus Torvalds 2008-05-06 20:22 ` Hans Rosenfeld @ 2008-05-07 4:40 ` Jeff Chua 2008-05-07 5:30 ` Hugh Dickins 2 siblings, 1 reply; 32+ messages in thread From: Jeff Chua @ 2008-05-07 4:40 UTC (permalink / raw) To: Hugh Dickins Cc: Hans Rosenfeld, Ingo Molnar, Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven, Nishanth Aravamudan, linux-kernel, linux-mm On Wed, May 7, 2008 at 3:49 AM, Hugh Dickins <hugh@veritas.com> wrote: > Signed-off-by: Hugh Dickins <hugh@veritas.com> Hugh, Thanks for the patch. It's in Linus's git now and I just booted 2 Dell 2950 and both are no showing the errors. Thanks, Jeff. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] x86: fix PAE pmd_bad bootup warning 2008-05-07 4:40 ` Jeff Chua @ 2008-05-07 5:30 ` Hugh Dickins 0 siblings, 0 replies; 32+ messages in thread From: Hugh Dickins @ 2008-05-07 5:30 UTC (permalink / raw) To: Jeff Chua Cc: Hans Rosenfeld, Ingo Molnar, Thomas Gleixner, H. Peter Anvin, Gabriel C, Arjan van de Ven, Nishanth Aravamudan, linux-kernel, linux-mm On Wed, 7 May 2008, Jeff Chua wrote: > > Thanks for the patch. It's in Linus's git now and I just booted 2 Dell Thanks for reporting and reporting back: from your contentment ... > 2950 and both are no showing the errors. ^ ... I'll assume the missing letter is "t" rather than "w" ;) Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2008-05-14 19:01 UTC | newest]
Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <b6a2187b0805051806v25fa1272xb08e0b70b9c3408@mail.gmail.com>
[not found] ` <20080506124946.GA2146@elte.hu>
[not found] ` <Pine.LNX.4.64.0805061435510.32567@blonde.site>
[not found] ` <alpine.LFD.1.10.0805061138580.32269@woody.linux-foundation.org>
2008-05-06 19:49 ` [PATCH] x86: fix PAE pmd_bad bootup warning Hugh Dickins
2008-05-06 20:06 ` Linus Torvalds
2008-05-06 20:30 ` Hugh Dickins
2008-05-08 16:07 ` Nishanth Aravamudan
2008-05-06 20:22 ` Hans Rosenfeld
2008-05-06 20:36 ` Hugh Dickins
2008-05-07 23:39 ` Nishanth Aravamudan
2008-05-06 20:42 ` Dave Hansen
2008-05-08 14:34 ` Hans Rosenfeld
2008-05-08 14:39 ` Hans Rosenfeld
2008-05-08 14:52 ` Dave Hansen
2008-05-08 15:11 ` Hans Rosenfeld
2008-05-08 15:51 ` Dave Hansen
2008-05-08 16:19 ` Hans Rosenfeld
2008-05-08 16:33 ` Nishanth Aravamudan
2008-05-08 16:51 ` Hans Rosenfeld
2008-05-08 17:16 ` Nishanth Aravamudan
2008-05-08 18:42 ` Dave Hansen
2008-05-08 18:58 ` Hugh Dickins
2008-05-08 19:06 ` Dave Hansen
2008-05-08 18:48 ` Hugh Dickins
2008-05-08 19:49 ` Matt Mackall
2008-05-08 20:08 ` Dave Hansen
2008-05-08 20:02 ` Hans Rosenfeld
2008-05-08 20:16 ` Dave Hansen
2008-05-08 23:15 ` Dave Hansen
2008-05-14 19:01 ` Matt Mackall
2008-05-09 9:03 ` Paul Mundt
2008-05-08 16:42 ` Dave Hansen
2008-05-08 15:44 ` Nishanth Aravamudan
2008-05-07 4:40 ` Jeff Chua
2008-05-07 5:30 ` Hugh Dickins
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).