* [PATCH] mm: fix possible cause of a page_mapped BUG @ 2011-02-24 5:39 Hugh Dickins 2011-02-28 23:35 ` Robert Święcki 0 siblings, 1 reply; 37+ messages in thread From: Hugh Dickins @ 2011-02-24 5:39 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Robert Swiecki, Miklos Szeredi, linux-kernel, linux-mm Robert Swiecki reported a BUG_ON(page_mapped) from a fuzzer, punching a hole with madvise(,, MADV_REMOVE). That path is under mutex, and cannot be explained by lack of serialization in unmap_mapping_range(). Reviewing the code, I found one place where vm_truncate_count handling should have been updated, when I switched at the last minute from one way of managing the restart_addr to another: mremap move changes the virtual addresses, so it ought to adjust the restart_addr. But rather than exporting the notion of restart_addr from memory.c, or converting to restart_pgoff throughout, simply reset vm_truncate_count to 0 to force a rescan if mremap move races with preempted truncation. We have no confirmation that this fixes Robert's BUG, but it is a fix that's worth making anyway. Signed-off-by: Hugh Dickins <hughd@google.com> --- mm/mremap.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) --- 2.6.38-rc6/mm/mremap.c 2011-01-18 22:04:56.000000000 -0800 +++ linux/mm/mremap.c 2011-02-23 15:29:52.000000000 -0800 @@ -94,9 +94,7 @@ static void move_ptes(struct vm_area_str */ mapping = vma->vm_file->f_mapping; spin_lock(&mapping->i_mmap_lock); - if (new_vma->vm_truncate_count && - new_vma->vm_truncate_count != vma->vm_truncate_count) - new_vma->vm_truncate_count = 0; + new_vma->vm_truncate_count = 0; } /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-02-24 5:39 [PATCH] mm: fix possible cause of a page_mapped BUG Hugh Dickins @ 2011-02-28 23:35 ` Robert Święcki 2011-03-17 15:40 ` Robert Święcki 0 siblings, 1 reply; 37+ messages in thread From: Robert Święcki @ 2011-02-28 23:35 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, Linus Torvalds, Miklos Szeredi, linux-kernel, linux-mm > But rather than exporting the notion of restart_addr from memory.c, or > converting to restart_pgoff throughout, simply reset vm_truncate_count > to 0 to force a rescan if mremap move races with preempted truncation. > > We have no confirmation that this fixes Robert's BUG, > but it is a fix that's worth making anyway. Hi, I don't have currently access to my rs232/console testing machine (lame excuse but it helps a lot;), cause I'm working currently OOtO, so I'll try to test it asap - which is probably Mar 15th or so. Btw, the fuzzer is here: http://code.google.com/p/iknowthis/ I think i was trying it with this revision: http://code.google.com/p/iknowthis/source/detail?r=11 (i386 mode, newest 'iknowthis' supports x86-64 natively), so feel free to try it. It used to crash the machine (it's BUG_ON but the system became unusable) in matter of hours. Btw, when I was testing it for the last time it Ooopsed much more frequently in proc_readdir (I sent report in one of earliet e-mails). > Signed-off-by: Hugh Dickins <hughd@google.com> > --- > > mm/mremap.c | 4 +--- > 1 file changed, 1 insertion(+), 3 deletions(-) > > --- 2.6.38-rc6/mm/mremap.c 2011-01-18 22:04:56.000000000 -0800 > +++ linux/mm/mremap.c 2011-02-23 15:29:52.000000000 -0800 > @@ -94,9 +94,7 @@ static void move_ptes(struct vm_area_str > */ > mapping = vma->vm_file->f_mapping; > spin_lock(&mapping->i_mmap_lock); > - if (new_vma->vm_truncate_count && > - new_vma->vm_truncate_count != vma->vm_truncate_count) > - new_vma->vm_truncate_count = 0; > + new_vma->vm_truncate_count = 0; > } > > /* > -- Robert Święcki -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-02-28 23:35 ` Robert Święcki @ 2011-03-17 15:40 ` Robert Święcki 2011-03-19 5:34 ` Hugh Dickins 0 siblings, 1 reply; 37+ messages in thread From: Robert Święcki @ 2011-03-17 15:40 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, Linus Torvalds, Miklos Szeredi, linux-kernel, linux-mm Hi, On Tue, Mar 1, 2011 at 12:35 AM, Robert Święcki <robert@swiecki.net> wrote: >> But rather than exporting the notion of restart_addr from memory.c, or >> converting to restart_pgoff throughout, simply reset vm_truncate_count >> to 0 to force a rescan if mremap move races with preempted truncation. >> >> We have no confirmation that this fixes Robert's BUG, >> but it is a fix that's worth making anyway. > > Hi, I don't have currently access to my rs232/console testing machine > (lame excuse but it helps a lot;), cause I'm working currently OOtO, > so I'll try to test it asap - which is probably Mar 15th or so. So, I compiled 2.6.38 and started fuzzing it. I'm bumping into other problems, and never seen anything about mremap in 2.6.38 (yet), as it had been happening in 2.6.37-rc2. The output goes to http://alt.swiecki.net/linux_kernel/ - I'm still trying. > Btw, the fuzzer is here: http://code.google.com/p/iknowthis/ > > I think i was trying it with this revision: > http://code.google.com/p/iknowthis/source/detail?r=11 (i386 mode, > newest 'iknowthis' supports x86-64 natively), so feel free to try it. > > It used to crash the machine (it's BUG_ON but the system became > unusable) in matter of hours. Btw, when I was testing it for the last > time it Ooopsed much more frequently in proc_readdir (I sent report in > one of earliet e-mails). > >> Signed-off-by: Hugh Dickins <hughd@google.com> >> --- >> >> mm/mremap.c | 4 +--- >> 1 file changed, 1 insertion(+), 3 deletions(-) >> >> --- 2.6.38-rc6/mm/mremap.c 2011-01-18 22:04:56.000000000 -0800 >> +++ linux/mm/mremap.c 2011-02-23 15:29:52.000000000 -0800 >> @@ -94,9 +94,7 @@ static void move_ptes(struct vm_area_str >> */ >> mapping = vma->vm_file->f_mapping; >> spin_lock(&mapping->i_mmap_lock); >> - if (new_vma->vm_truncate_count && >> - new_vma->vm_truncate_count != vma->vm_truncate_count) >> - new_vma->vm_truncate_count = 0; >> + new_vma->vm_truncate_count = 0; >> } >> >> /* >> > > > > -- > Robert Święcki > -- Robert Święcki -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-03-17 15:40 ` Robert Święcki @ 2011-03-19 5:34 ` Hugh Dickins 2011-04-01 14:34 ` Robert Święcki 0 siblings, 1 reply; 37+ messages in thread From: Hugh Dickins @ 2011-03-19 5:34 UTC (permalink / raw) To: Robert Swiecki Cc: Andrew Morton, Linus Torvalds, Miklos Szeredi, Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm On Thu, 17 Mar 2011, Robert Swiecki wrote: > On Tue, Mar 1, 2011 at 12:35 AM, Robert Swiecki <robert@swiecki.net> wrote: > > So, I compiled 2.6.38 and started fuzzing it. I'm bumping into other > problems, and never seen anything about mremap in 2.6.38 (yet), Thanks a lot for getting back to this, Robert, and thanks for the update. I won't be celebrating, but this sounds like good news for my mremap patch. > as it had been happening in 2.6.37-rc2. The output goes to > http://alt.swiecki.net/linux_kernel/ - I'm still trying. A problem in sys_mlock: I've Cc'ed Michel who is the current expert. A problem in sys_munlock: Michel again, except vma_prio_tree_add is implicated, and I used to be involved with that. I've appended below a debug patch which I wrote years ago, and have largely forgotten, but Andrew keeps it around in mmotm: we might learn more if you add that into your kernel build. A problem in next_pidmap from find_ge_pid from ... proc_pid_readdir. I did spend a while looking into that when you first reported it. I'm pretty sure, from the register values, that it's a result of a pid number (in some places signed int, in some places unsigned) getting unexpectedly sign-extended to negative, so indexing before the beginning of an array; but I never tracked down the root of the problem, and failed to reproduce it with odd lseeks on the directory. Ah, the one you report now comes from compat_sys_getdents, whereas the original one came from compat_sys_old_readdir: okay, I had been wondering whether it was peculiar to the old_readdir case, but no, it's reproduced with getdents too. Might be peculiar to compat. Anyway, I've Cc'ed Eric who will be the best for that one. And a couple of watchdog problems: I haven't even glanced at those, hope someone else can suggest a good way forward on them. Hugh > > > Btw, the fuzzer is here: http://code.google.com/p/iknowthis/ > > > > I think i was trying it with this revision: > > http://code.google.com/p/iknowthis/source/detail?r=11 (i386 mode, > > newest 'iknowthis' supports x86-64 natively), so feel free to try it. > > > > It used to crash the machine (it's BUG_ON but the system became > > unusable) in matter of hours. Btw, when I was testing it for the last > > time it Ooopsed much more frequently in proc_readdir (I sent report in > > one of earliet e-mails). From: Hugh Dickins <hughd@google.com> Jayson Santos has sighted mm/prio_tree.c:78,79 BUGs (kernel bugzilla 8446), and one was sighted a couple of years ago. No reason yet to suppose they're prio_tree bugs, but we can't tell much about them without seeing the vmas. So dump vma and the one it's supposed to resemble: I had expected to use print_hex_dump(), but that's designed for u8 dumps, whereas almost every field of vm_area_struct is either a pointer or an unsigned long - which look nonsense dumped as u8s. Replace the two BUG_ONs by a single WARN_ON; and if it fires, just keep this vma out of the tree (truncation and swapout won't be able to find it). How safe this is depends on what the error really is; but we hold a file's i_mmap_lock here, so it may be impossible to recover from BUG_ON. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Jayson Santos <jaysonsantos2003@yahoo.com.br> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- mm/prio_tree.c | 33 ++++++++++++++++++++++++++++----- 1 file changed, 28 insertions(+), 5 deletions(-) diff -puN mm/prio_tree.c~prio_tree-debugging-patch mm/prio_tree.c --- a/mm/prio_tree.c~prio_tree-debugging-patch +++ a/mm/prio_tree.c @@ -67,6 +67,20 @@ * vma->shared.vm_set.head == NULL ==> a list node */ +static void dump_vma(struct vm_area_struct *vma) +{ + void **ptr = (void **) vma; + int i; + + printk("vm_area_struct at %p:", ptr); + for (i = 0; i < sizeof(*vma)/sizeof(*ptr); i++, ptr++) { + if (!(i & 3)) + printk("\n"); + printk(" %p", *ptr); + } + printk("\n"); +} + /* * Add a new vma known to map the same set of pages as the old vma: * useful for fork's dup_mmap as well as vma_prio_tree_insert below. @@ -74,14 +88,23 @@ */ void vma_prio_tree_add(struct vm_area_struct *vma, struct vm_area_struct *old) { - /* Leave these BUG_ONs till prio_tree patch stabilizes */ - BUG_ON(RADIX_INDEX(vma) != RADIX_INDEX(old)); - BUG_ON(HEAP_INDEX(vma) != HEAP_INDEX(old)); - vma->shared.vm_set.head = NULL; vma->shared.vm_set.parent = NULL; - if (!old->shared.vm_set.parent) + if (WARN_ON(RADIX_INDEX(vma) != RADIX_INDEX(old) || + HEAP_INDEX(vma) != HEAP_INDEX(old))) { + /* + * This should never happen, yet it has been seen a few times: + * we cannot say much about it without seeing the vma contents. + */ + dump_vma(vma); + dump_vma(old); + /* + * Don't try to link this (corrupt?) vma into the (corrupt?) + * prio_tree, but arrange for its removal to succeed later. + */ + INIT_LIST_HEAD(&vma->shared.vm_set.list); + } else if (!old->shared.vm_set.parent) list_add(&vma->shared.vm_set.list, &old->shared.vm_set.list); else if (old->shared.vm_set.head) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-03-19 5:34 ` Hugh Dickins @ 2011-04-01 14:34 ` Robert Święcki 2011-04-01 15:44 ` Linus Torvalds 0 siblings, 1 reply; 37+ messages in thread From: Robert Święcki @ 2011-04-01 14:34 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, Linus Torvalds, Miklos Szeredi, Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm On Sat, Mar 19, 2011 at 6:34 AM, Hugh Dickins <hughd@google.com> wrote: > On Thu, 17 Mar 2011, Robert Swiecki wrote: >> On Tue, Mar 1, 2011 at 12:35 AM, Robert Swiecki <robert@swiecki.net> wrote: >> >> So, I compiled 2.6.38 and started fuzzing it. I'm bumping into other >> problems, and never seen anything about mremap in 2.6.38 (yet), > > Thanks a lot for getting back to this, Robert, and thanks for the update. > I won't be celebrating, but this sounds like good news for my mremap patch. > >> as it had been happening in 2.6.37-rc2. The output goes to >> http://alt.swiecki.net/linux_kernel/ - I'm still trying. > > A problem in sys_mlock: I've Cc'ed Michel who is the current expert. > > A problem in sys_munlock: Michel again, except vma_prio_tree_add is > implicated, and I used to be involved with that. I've appended below > a debug patch which I wrote years ago, and have largely forgotten, but > Andrew keeps it around in mmotm: we might learn more if you add that > into your kernel build. Hey, I'll apply your patch and check it out. In the meantime I triggered another Oops (NULL-ptr deref via sys_mprotect). The oops is here: http://alt.swiecki.net/linux_kernel/sys_mprotect-2.6.38.txt > A problem in next_pidmap from find_ge_pid from ... proc_pid_readdir. > I did spend a while looking into that when you first reported it. > I'm pretty sure, from the register values, that it's a result of > a pid number (in some places signed int, in some places unsigned) > getting unexpectedly sign-extended to negative, so indexing before > the beginning of an array; but I never tracked down the root of the > problem, and failed to reproduce it with odd lseeks on the directory. > > Ah, the one you report now comes from compat_sys_getdents, > whereas the original one came from compat_sys_old_readdir: okay, > I had been wondering whether it was peculiar to the old_readdir case, > but no, it's reproduced with getdents too. Might be peculiar to compat. > > Anyway, I've Cc'ed Eric who will be the best for that one. > > And a couple of watchdog problems: I haven't even glanced at > those, hope someone else can suggest a good way forward on them. > > Hugh > >> >> > Btw, the fuzzer is here: http://code.google.com/p/iknowthis/ >> > >> > I think i was trying it with this revision: >> > http://code.google.com/p/iknowthis/source/detail?r=11 (i386 mode, >> > newest 'iknowthis' supports x86-64 natively), so feel free to try it. >> > >> > It used to crash the machine (it's BUG_ON but the system became >> > unusable) in matter of hours. Btw, when I was testing it for the last >> > time it Ooopsed much more frequently in proc_readdir (I sent report in >> > one of earliet e-mails). > > From: Hugh Dickins <hughd@google.com> > > Jayson Santos has sighted mm/prio_tree.c:78,79 BUGs (kernel bugzilla 8446), > and one was sighted a couple of years ago. No reason yet to suppose > they're prio_tree bugs, but we can't tell much about them without seeing > the vmas. > > So dump vma and the one it's supposed to resemble: I had expected to use > print_hex_dump(), but that's designed for u8 dumps, whereas almost every > field of vm_area_struct is either a pointer or an unsigned long - which > look nonsense dumped as u8s. > > Replace the two BUG_ONs by a single WARN_ON; and if it fires, just keep > this vma out of the tree (truncation and swapout won't be able to find it). > How safe this is depends on what the error really is; but we hold a file's > i_mmap_lock here, so it may be impossible to recover from BUG_ON. > > Signed-off-by: Hugh Dickins <hughd@google.com> > Cc: Jayson Santos <jaysonsantos2003@yahoo.com.br> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > --- > > mm/prio_tree.c | 33 ++++++++++++++++++++++++++++----- > 1 file changed, 28 insertions(+), 5 deletions(-) > > diff -puN mm/prio_tree.c~prio_tree-debugging-patch mm/prio_tree.c > --- a/mm/prio_tree.c~prio_tree-debugging-patch > +++ a/mm/prio_tree.c > @@ -67,6 +67,20 @@ > * vma->shared.vm_set.head == NULL ==> a list node > */ > > +static void dump_vma(struct vm_area_struct *vma) > +{ > + void **ptr = (void **) vma; > + int i; > + > + printk("vm_area_struct at %p:", ptr); > + for (i = 0; i < sizeof(*vma)/sizeof(*ptr); i++, ptr++) { > + if (!(i & 3)) > + printk("\n"); > + printk(" %p", *ptr); > + } > + printk("\n"); > +} > + > /* > * Add a new vma known to map the same set of pages as the old vma: > * useful for fork's dup_mmap as well as vma_prio_tree_insert below. > @@ -74,14 +88,23 @@ > */ > void vma_prio_tree_add(struct vm_area_struct *vma, struct vm_area_struct *old) > { > - /* Leave these BUG_ONs till prio_tree patch stabilizes */ > - BUG_ON(RADIX_INDEX(vma) != RADIX_INDEX(old)); > - BUG_ON(HEAP_INDEX(vma) != HEAP_INDEX(old)); > - > vma->shared.vm_set.head = NULL; > vma->shared.vm_set.parent = NULL; > > - if (!old->shared.vm_set.parent) > + if (WARN_ON(RADIX_INDEX(vma) != RADIX_INDEX(old) || > + HEAP_INDEX(vma) != HEAP_INDEX(old))) { > + /* > + * This should never happen, yet it has been seen a few times: > + * we cannot say much about it without seeing the vma contents. > + */ > + dump_vma(vma); > + dump_vma(old); > + /* > + * Don't try to link this (corrupt?) vma into the (corrupt?) > + * prio_tree, but arrange for its removal to succeed later. > + */ > + INIT_LIST_HEAD(&vma->shared.vm_set.list); > + } else if (!old->shared.vm_set.parent) > list_add(&vma->shared.vm_set.list, > &old->shared.vm_set.list); > else if (old->shared.vm_set.head) > -- Robert Święcki -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-04-01 14:34 ` Robert Święcki @ 2011-04-01 15:44 ` Linus Torvalds 2011-04-01 16:21 ` Robert Święcki 2011-04-02 1:46 ` Hugh Dickins 0 siblings, 2 replies; 37+ messages in thread From: Linus Torvalds @ 2011-04-01 15:44 UTC (permalink / raw) To: Robert Święcki Cc: Hugh Dickins, Andrew Morton, Miklos Szeredi, Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel On Fri, Apr 1, 2011 at 7:34 AM, Robert Święcki <robert@swiecki.net> wrote: > > Hey, I'll apply your patch and check it out. In the meantime I > triggered another Oops (NULL-ptr deref via sys_mprotect). > > The oops is here: > > http://alt.swiecki.net/linux_kernel/sys_mprotect-2.6.38.txt That's not a NULL pointer dereference. That's a BUG_ON(). And for some reason you've turned off the BUG_ON() messages, saving some tiny amount of memory. Anyway, it looks like the first BUG_ON() in vma_prio_tree_add(), so it would be this one: BUG_ON(RADIX_INDEX(vma) != RADIX_INDEX(old)); but it is possible that gcc has shuffled things around (so it _might_ be the HEAP_INDEX() one). If you had CONFIG_DEBUG_BUGVERBOSE=y, you'd get a filename and line number. One reason I hate -O2 in cases like this is that the basic block movement makes it way harder to actually debug things. I would suggest using -Os too (CONFIG_OPTIMIZE_FOR_SIZE or whatever it's called). Anyway, I do find it worrying. The vma code shouldn't be this fragile. Hugh? I do wonder what triggers this. Is it a huge-page vma? We seem to be lacking the check to see that mprotect() is on a hugepage boundary - and that seems bogus. Or am I missing some check? The new transparent hugepage support splits the page, but what if it's a _static_ hugepage thing? But why would that affect the radix_index thing? I have no idea. I'd like to blame the anon_vma rewrites last year, but I can't see why that should matter either. Again, hugepages had some special rules, I think (and that would explain why nobody normal sees this). Guys, please give this one a look. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-04-01 15:44 ` Linus Torvalds @ 2011-04-01 16:21 ` Robert Święcki 2011-04-01 16:35 ` Linus Torvalds 2011-04-02 1:46 ` Hugh Dickins 1 sibling, 1 reply; 37+ messages in thread From: Robert Święcki @ 2011-04-01 16:21 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, Andrew Morton, Miklos Szeredi, Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel On Fri, Apr 1, 2011 at 5:44 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, Apr 1, 2011 at 7:34 AM, Robert Święcki <robert@swiecki.net> wrote: >> >> Hey, I'll apply your patch and check it out. In the meantime I >> triggered another Oops (NULL-ptr deref via sys_mprotect). >> >> The oops is here: >> >> http://alt.swiecki.net/linux_kernel/sys_mprotect-2.6.38.txt > > That's not a NULL pointer dereference. That's a BUG_ON(). > > And for some reason you've turned off the BUG_ON() messages, saving > some tiny amount of memory. Is it possible to turn it off via config flags? Looking into arch/x86/include/asm/bug.h it seems it's unconditional (as in "it always manifests itself somehow") and I have CONFIG_DEBUG_BUGVERBOSE=y. This BUG/Oopps was triggered before I applied Hugh's patch on a vanilla kernel. Anything that could help you debugging this? Uploading kernel image (unfortunately I've overwritten this one), dumping more kgdb data? I must admit I'm not up-to-date with current linux kernel debugging techniques. The kernel config is here: http://alt.swiecki.net/linux_kernel/ise-test-2.6.38-kernel-config.txt For now I'll compile with -O0 -fno-inline (are you sure you'd like -Os?) > Anyway, it looks like the first BUG_ON() in vma_prio_tree_add(), so it > would be this one: > > BUG_ON(RADIX_INDEX(vma) != RADIX_INDEX(old)); > > but it is possible that gcc has shuffled things around (so it _might_ > be the HEAP_INDEX() one). If you had CONFIG_DEBUG_BUGVERBOSE=y, you'd > get a filename and line number. One reason I hate -O2 in cases like > this is that the basic block movement makes it way harder to actually > debug things. I would suggest using -Os too (CONFIG_OPTIMIZE_FOR_SIZE > or whatever it's called). > > Anyway, I do find it worrying. The vma code shouldn't be this fragile. Hugh? > > I do wonder what triggers this. Is it a huge-page vma? We seem to be > lacking the check to see that mprotect() is on a hugepage boundary - > and that seems bogus. Or am I missing some check? The new transparent > hugepage support splits the page, but what if it's a _static_ hugepage > thing? > > But why would that affect the radix_index thing? I have no idea. I'd > like to blame the anon_vma rewrites last year, but I can't see why > that should matter either. Again, hugepages had some special rules, I > think (and that would explain why nobody normal sees this). > > Guys, please give this one a look. > > Linus > -- Robert Święcki -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-04-01 16:21 ` Robert Święcki @ 2011-04-01 16:35 ` Linus Torvalds 2011-04-02 4:01 ` Hui Zhu 0 siblings, 1 reply; 37+ messages in thread From: Linus Torvalds @ 2011-04-01 16:35 UTC (permalink / raw) To: Robert Święcki Cc: Hugh Dickins, Andrew Morton, Miklos Szeredi, Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel On Fri, Apr 1, 2011 at 9:21 AM, Robert Święcki <robert@swiecki.net> wrote: > > Is it possible to turn it off via config flags? Looking into > arch/x86/include/asm/bug.h it seems it's unconditional (as in "it > always manifests itself somehow") and I have > CONFIG_DEBUG_BUGVERBOSE=y. Ok, if you have CONFIG_DEBUG_BUGVERBOSE then, you do have the bug-table. Maybe it's just kdb that is broken, and doesn't print it. I wouldn't be surprised. It's not the first time I've seen debugging features that just make debugging a mess. > Anything that could help you debugging this? Uploading kernel image > (unfortunately I've overwritten this one), dumping more kgdb data? So in this case kgdb just dropped the most important data on the floor. But if you have kdb active next time, print out the vma/old contents in that function that has the BUG() in it. > I must admit I'm not up-to-date with current linux kernel debugging > techniques. The kernel config is here: > http://alt.swiecki.net/linux_kernel/ise-test-2.6.38-kernel-config.txt > > For now I'll compile with -O0 -fno-inline (are you sure you'd like -Os?) Oh, don't do that. -O0 makes the code totally unreadable (the compiler just does _stupid_ things, making the asm code look so horrible that you can't match it up against anything sane), and -fno-inline isn't worth the pain either. -Os is much better than those. But in this case, just getting the filename and line number would have made the thing moot anyway - without kdb it _should_ have said something clear like kernel BUG at %s:%u! where %s:%u is the filename and line number. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-04-01 16:35 ` Linus Torvalds @ 2011-04-02 4:01 ` Hui Zhu 2011-04-04 13:02 ` Robert Święcki 0 siblings, 1 reply; 37+ messages in thread From: Hui Zhu @ 2011-04-02 4:01 UTC (permalink / raw) To: Robert Święcki Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, Miklos Szeredi, Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel On Sat, Apr 2, 2011 at 00:35, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, Apr 1, 2011 at 9:21 AM, Robert Święcki <robert@swiecki.net> wrote: >> >> Is it possible to turn it off via config flags? Looking into >> arch/x86/include/asm/bug.h it seems it's unconditional (as in "it >> always manifests itself somehow") and I have >> CONFIG_DEBUG_BUGVERBOSE=y. > > Ok, if you have CONFIG_DEBUG_BUGVERBOSE then, you do have the bug-table. > > Maybe it's just kdb that is broken, and doesn't print it. I wouldn't > be surprised. It's not the first time I've seen debugging features > that just make debugging a mess. > >> Anything that could help you debugging this? Uploading kernel image >> (unfortunately I've overwritten this one), dumping more kgdb data? > > So in this case kgdb just dropped the most important data on the floor. > > But if you have kdb active next time, print out the vma/old contents > in that function that has the BUG() in it. > >> I must admit I'm not up-to-date with current linux kernel debugging >> techniques. The kernel config is here: >> http://alt.swiecki.net/linux_kernel/ise-test-2.6.38-kernel-config.txt >> >> For now I'll compile with -O0 -fno-inline (are you sure you'd like -Os?) Hi Robert, I am not sure you can success with build trunk with -O0 -fno-inline. I suggest you try the patch in http://code.google.com/p/kgtp/downloads/detail?name=co.patch. It add a option in "Kernel hacking" called "Compile with almost no optimization". It will make kernel be built without -O2. It support x86_32, x86_64 and arm. PS, maybe you can try kgtp (https://code.google.com/p/kgtp/) debug your kernel. Thanks, Hui > > Oh, don't do that. -O0 makes the code totally unreadable (the compiler > just does _stupid_ things, making the asm code look so horrible that > you can't match it up against anything sane), and -fno-inline isn't > worth the pain either. > > -Os is much better than those. > > But in this case, just getting the filename and line number would have > made the thing moot anyway - without kdb it _should_ have said > something clear like > > kernel BUG at %s:%u! > > where %s:%u is the filename and line number. > > Linus > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-04-02 4:01 ` Hui Zhu @ 2011-04-04 13:02 ` Robert Święcki 0 siblings, 0 replies; 37+ messages in thread From: Robert Święcki @ 2011-04-04 13:02 UTC (permalink / raw) To: Hui Zhu Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, Miklos Szeredi, Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel On Sat, Apr 2, 2011 at 6:01 AM, Hui Zhu <teawater@gmail.com> wrote: > On Sat, Apr 2, 2011 at 00:35, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> On Fri, Apr 1, 2011 at 9:21 AM, Robert Święcki <robert@swiecki.net> wrote: >>> >>> Is it possible to turn it off via config flags? Looking into >>> arch/x86/include/asm/bug.h it seems it's unconditional (as in "it >>> always manifests itself somehow") and I have >>> CONFIG_DEBUG_BUGVERBOSE=y. >> >> Ok, if you have CONFIG_DEBUG_BUGVERBOSE then, you do have the bug-table. >> >> Maybe it's just kdb that is broken, and doesn't print it. I wouldn't >> be surprised. It's not the first time I've seen debugging features >> that just make debugging a mess. >> >>> Anything that could help you debugging this? Uploading kernel image >>> (unfortunately I've overwritten this one), dumping more kgdb data? >> >> So in this case kgdb just dropped the most important data on the floor. >> >> But if you have kdb active next time, print out the vma/old contents >> in that function that has the BUG() in it. >> >>> I must admit I'm not up-to-date with current linux kernel debugging >>> techniques. The kernel config is here: >>> http://alt.swiecki.net/linux_kernel/ise-test-2.6.38-kernel-config.txt >>> >>> For now I'll compile with -O0 -fno-inline (are you sure you'd like -Os?) > > Hi Robert, > > I am not sure you can success with build trunk with -O0 -fno-inline. > I suggest you try the patch in > http://code.google.com/p/kgtp/downloads/detail?name=co.patch. > It add a option in "Kernel hacking" called "Compile with almost no > optimization". It will make kernel be built without -O2. It support > x86_32, x86_64 and arm. HI, Yeah.. -O0 doesn't build smoothly, it seems that building with -O0 is not required right now, but I'll keep your patch in mind in case it becomes necessary. Thanks for the tip. > PS, maybe you can try kgtp (https://code.google.com/p/kgtp/) debug your kernel. > >> >> Oh, don't do that. -O0 makes the code totally unreadable (the compiler >> just does _stupid_ things, making the asm code look so horrible that >> you can't match it up against anything sane), and -fno-inline isn't >> worth the pain either. >> >> -Os is much better than those. >> >> But in this case, just getting the filename and line number would have >> made the thing moot anyway - without kdb it _should_ have said >> something clear like >> >> kernel BUG at %s:%u! >> >> where %s:%u is the filename and line number. >> >> Linus >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ >> > -- Robert Święcki -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-04-01 15:44 ` Linus Torvalds 2011-04-01 16:21 ` Robert Święcki @ 2011-04-02 1:46 ` Hugh Dickins 2011-04-04 12:46 ` Robert Święcki 1 sibling, 1 reply; 37+ messages in thread From: Hugh Dickins @ 2011-04-02 1:46 UTC (permalink / raw) To: Linus Torvalds Cc: Robert Święcki, Andrew Morton, Miklos Szeredi, Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel On Fri, Apr 1, 2011 at 8:44 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, Apr 1, 2011 at 7:34 AM, Robert Święcki <robert@swiecki.net> wrote: >> >> Hey, I'll apply your patch and check it out. In the meantime I >> triggered another Oops (NULL-ptr deref via sys_mprotect). >> >> The oops is here: >> >> http://alt.swiecki.net/linux_kernel/sys_mprotect-2.6.38.txt > > That's not a NULL pointer dereference. That's a BUG_ON(). > > And for some reason you've turned off the BUG_ON() messages, saving > some tiny amount of memory. > > Anyway, it looks like the first BUG_ON() in vma_prio_tree_add(), so it > would be this one: > > BUG_ON(RADIX_INDEX(vma) != RADIX_INDEX(old)); > > but it is possible that gcc has shuffled things around (so it _might_ > be the HEAP_INDEX() one). If you had CONFIG_DEBUG_BUGVERBOSE=y, you'd > get a filename and line number. One reason I hate -O2 in cases like > this is that the basic block movement makes it way harder to actually > debug things. I would suggest using -Os too (CONFIG_OPTIMIZE_FOR_SIZE > or whatever it's called). > > Anyway, I do find it worrying. The vma code shouldn't be this fragile. Hugh? > > I do wonder what triggers this. Is it a huge-page vma? We seem to be > lacking the check to see that mprotect() is on a hugepage boundary - > and that seems bogus. Or am I missing some check? The new transparent > hugepage support splits the page, but what if it's a _static_ hugepage > thing? > > But why would that affect the radix_index thing? I have no idea. I'd > like to blame the anon_vma rewrites last year, but I can't see why > that should matter either. Again, hugepages had some special rules, I > think (and that would explain why nobody normal sees this). > > Guys, please give this one a look. I do intend to look, but I think it makes more sense to wait until Robert has reproduced it (or something like it) with my debugging patch in. He's fuzzing, so no reason to get anxious about recent changes, it may have been lurking there for years. He did already report a vma_prio_tree_add() crash, which led me to send him the patch: so the issue seems to be reproducible, and the patch dumps out the vma_area_structs involved, which has some hope of telling us more. Down the years we've had about three earlier reports of crashes there: so rare we've tended to put them down to bad memory or slab corruption, but never had anything like a reproducible case to study until now. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-04-02 1:46 ` Hugh Dickins @ 2011-04-04 12:46 ` Robert Święcki 2011-04-04 18:30 ` Hugh Dickins 0 siblings, 1 reply; 37+ messages in thread From: Robert Święcki @ 2011-04-04 12:46 UTC (permalink / raw) To: Hugh Dickins Cc: Linus Torvalds, Andrew Morton, Miklos Szeredi, Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel On Sat, Apr 2, 2011 at 3:46 AM, Hugh Dickins <hughd@google.com> wrote: > On Fri, Apr 1, 2011 at 8:44 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> On Fri, Apr 1, 2011 at 7:34 AM, Robert Święcki <robert@swiecki.net> wrote: >>> >>> Hey, I'll apply your patch and check it out. In the meantime I >>> triggered another Oops (NULL-ptr deref via sys_mprotect). >>> >>> The oops is here: >>> >>> http://alt.swiecki.net/linux_kernel/sys_mprotect-2.6.38.txt >> >> That's not a NULL pointer dereference. That's a BUG_ON(). >> >> And for some reason you've turned off the BUG_ON() messages, saving >> some tiny amount of memory. >> >> Anyway, it looks like the first BUG_ON() in vma_prio_tree_add(), so it >> would be this one: >> >> BUG_ON(RADIX_INDEX(vma) != RADIX_INDEX(old)); >> >> but it is possible that gcc has shuffled things around (so it _might_ >> be the HEAP_INDEX() one). If you had CONFIG_DEBUG_BUGVERBOSE=y, you'd >> get a filename and line number. One reason I hate -O2 in cases like >> this is that the basic block movement makes it way harder to actually >> debug things. I would suggest using -Os too (CONFIG_OPTIMIZE_FOR_SIZE >> or whatever it's called). >> >> Anyway, I do find it worrying. The vma code shouldn't be this fragile. Hugh? >> >> I do wonder what triggers this. Is it a huge-page vma? We seem to be >> lacking the check to see that mprotect() is on a hugepage boundary - >> and that seems bogus. Or am I missing some check? The new transparent >> hugepage support splits the page, but what if it's a _static_ hugepage >> thing? >> >> But why would that affect the radix_index thing? I have no idea. I'd >> like to blame the anon_vma rewrites last year, but I can't see why >> that should matter either. Again, hugepages had some special rules, I >> think (and that would explain why nobody normal sees this). >> >> Guys, please give this one a look. > > I do intend to look, but I think it makes more sense to wait until > Robert has reproduced it (or something like it) with my debugging > patch in. Hi Hugh, I did two things, included your patch, and compiled with CONFIG_CC_OPTIMIZE_FOR_SIZE=y; the kernel didn't BUG() or Oopssed for ~2 days under fuzzing (with getdents and readdir syscalls disabled in the fuzzer). I don't think -Os has any bigger influence on how mm internally works therefore I must attribute the change to your patch (was it patch which fixes something or merely dumps vma structures in case of any problem?). ============== BTW, another problem arose in the meantime, not sure if anyhow related to things we're discussing here, although 'btc 0' in kdb shows that processor 0 hangs in sys_mlock - I did it in two different moments, to exclude any coincidences. After those 2 days of fuzzing, 'ps wuax' stopped working, i.e. it prints some output, then stops, cannot be killed with -SIGKILL etc. I'll let it run for the time being, I can dump more data in this PID 17750 if anybody wants: strace: # strace -f ps wwuax .... open("/proc/17750/status", O_RDONLY) = 6 read(6, "Name:\tiknowthis\nState:\tD (disk s"..., 1023) = 777 close(6) = 0 open("/proc/17750/cmdline", O_RDONLY) = 6 read(6, Process 17750 also cannot be killed. Attaching more data: # cat /proc/17750/status Name: iknowthis State: D (disk sleep) Tgid: 17750 Pid: 17750 PPid: 1 TracerPid: 0 Uid: 1001 1001 1001 1001 Gid: 1001 1001 1001 1001 FDSize: 64 Groups: 1001 VmPeak: 7752 kB VmSize: 5760 kB VmLck: 32 kB VmHWM: 4892 kB VmRSS: 3068 kB VmData: 2472 kB VmStk: 408 kB VmExe: 160 kB VmLib: 2684 kB VmPTE: 44 kB VmSwap: 0 kB Threads: 1 SigQ: 218/16382 SigPnd: 0000000000000b00 ShdPnd: 0000400000000503 SigBlk: 0000000000000000 SigIgn: 0000000001001000 SigCgt: 0000000000000000 CapInh: 0000000000000000 CapPrm: 0000000000000000 CapEff: 0000000000000000 CapBnd: ffffffffffffffff Cpus_allowed: 01 Cpus_allowed_list: 0 Mems_allowed: 00000000,00000001 Mems_allowed_list: 0 voluntary_ctxt_switches: 43330 nonvoluntary_ctxt_switches: 4436 # cat /proc/17750/wchan call_rwsem_down_write_failed # cat /proc/17750/maps (hangs) (from kdb) [0]kdb> btp 17750 Stack traceback for pid 17750 0xffff88011e772dc0 17750 1 0 0 D 0xffff88011e773240 iknowthis <c> ffff88011cbcfb88<c> 0000000000000086<c> ffff88011cbcfb08<c> ffff88011cbcffd8<c> <c> 0000000000013f00<c> ffff88011e772dc0<c> ffff88011e773180<c> ffff88011e773178<c> <c> ffff88011cbce000<c> ffff88011cbcffd8<c> 0000000000013f00<c> 0000000000013f00<c> Call Trace: [<ffffffff81e286ad>] rwsem_down_failed_common+0xdb/0x10d [<ffffffff81e286f2>] rwsem_down_write_failed+0x13/0x15 [<ffffffff81416953>] call_rwsem_down_write_failed+0x13/0x20 [<ffffffff81e27da0>] ? down_write+0x25/0x27 [<ffffffff8115f041>] do_coredump+0x14f/0x9a5 [<ffffffff8114d4e2>] ? T.1006+0x17/0x32 [<ffffffff810a45f0>] ? __dequeue_signal+0xfa/0x12f [<ffffffff8108a79c>] ? get_parent_ip+0x11/0x42 [<ffffffff810a6406>] get_signal_to_deliver+0x3be/0x3e6 [<ffffffff8103e0c1>] do_signal+0x72/0x67d [<ffffffff81096807>] ? child_wait_callback+0x0/0x58 [<ffffffff81e28c28>] ? _raw_spin_unlock_irq+0x36/0x41 [<ffffffff8108bb06>] ? finish_task_switch+0x4b/0xb9 [<ffffffff8108c3ed>] ? schedule_tail+0x38/0x68 [<ffffffff8103eb43>] ? ret_from_fork+0x13/0x80 [<ffffffff8103e6f8>] do_notify_resume+0x2c/0x6e [0]kdb> btc 0 Stack traceback for pid 10350 0xffff88011b6badc0 10350 1 1 0 R 0xffff88011b6bb240 *iknowthis2 <c> ffff8800cfc03db8<c> 0000000000000000<c> Call Trace: <#DB> <<EOE>> <IRQ> [<ffffffff81518b03>] ? __handle_sysrq+0xbf/0x15c [<ffffffff81518d7d>] ? handle_sysrq+0x2c/0x2e [<ffffffff8152bd90>] ? serial8250_handle_port+0x157/0x2b2 [<ffffffff810a1be8>] ? run_timer_softirq+0x2b3/0x2c2 [<ffffffff8152bf4c>] ? serial8250_interrupt+0x61/0x111 [<ffffffff810e6e52>] ? handle_IRQ_event+0x78/0x150 [<ffffffff810ea044>] ? move_native_irq+0x19/0x6d [<ffffffff810e8d90>] ? handle_edge_irq+0xe3/0x12f [<ffffffff8104198f>] ? handle_irq+0x88/0x91 [<ffffffff81e297a5>] ? do_IRQ+0x4d/0xb3 [<ffffffff81e29193>] ? ret_from_intr+0x0/0x15 <EOI> [<ffffffff811336aa>] ? __mlock_vma_pages_range+0x49/0xad [<ffffffff8113370a>] ? __mlock_vma_pages_range+0xa9/0xad [<ffffffff811337c0>] ? do_mlock_pages+0xb2/0x118 [<ffffffff81134002>] ? sys_mlock+0xe8/0xf6 [<ffffffff8107d7e3>] ? ia32_sysret+0x0/0x5 [0]kdb> btc 1 Stack traceback for pid 9409 0xffff88011c9816e0 9409 1 1 1 R 0xffff88011c981b60 iknowthis2 <c> ffff88011dc21ec8 -- Robert Święcki -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-04-04 12:46 ` Robert Święcki @ 2011-04-04 18:30 ` Hugh Dickins 2011-04-05 12:21 ` Robert Święcki 0 siblings, 1 reply; 37+ messages in thread From: Hugh Dickins @ 2011-04-04 18:30 UTC (permalink / raw) To: Robert Święcki Cc: Linus Torvalds, Andrew Morton, Miklos Szeredi, Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel On Mon, Apr 4, 2011 at 5:46 AM, Robert Święcki <robert@swiecki.net> wrote: > On Sat, Apr 2, 2011 at 3:46 AM, Hugh Dickins <hughd@google.com> wrote: >> On Fri, Apr 1, 2011 at 8:44 AM, Linus Torvalds >> <torvalds@linux-foundation.org> wrote: >>> On Fri, Apr 1, 2011 at 7:34 AM, Robert Święcki <robert@swiecki.net> wrote: >>>> >>>> Hey, I'll apply your patch and check it out. In the meantime I >>>> triggered another Oops (NULL-ptr deref via sys_mprotect). >>>> >>>> The oops is here: >>>> >>>> http://alt.swiecki.net/linux_kernel/sys_mprotect-2.6.38.txt >>> >>> That's not a NULL pointer dereference. That's a BUG_ON(). >>> >>> And for some reason you've turned off the BUG_ON() messages, saving >>> some tiny amount of memory. >>> >>> Anyway, it looks like the first BUG_ON() in vma_prio_tree_add(), so it >>> would be this one: >>> >>> BUG_ON(RADIX_INDEX(vma) != RADIX_INDEX(old)); >>> >>> but it is possible that gcc has shuffled things around (so it _might_ >>> be the HEAP_INDEX() one). If you had CONFIG_DEBUG_BUGVERBOSE=y, you'd >>> get a filename and line number. One reason I hate -O2 in cases like >>> this is that the basic block movement makes it way harder to actually >>> debug things. I would suggest using -Os too (CONFIG_OPTIMIZE_FOR_SIZE >>> or whatever it's called). >>> >>> Anyway, I do find it worrying. The vma code shouldn't be this fragile. Hugh? >>> >>> I do wonder what triggers this. Is it a huge-page vma? We seem to be >>> lacking the check to see that mprotect() is on a hugepage boundary - >>> and that seems bogus. Or am I missing some check? The new transparent >>> hugepage support splits the page, but what if it's a _static_ hugepage >>> thing? >>> >>> But why would that affect the radix_index thing? I have no idea. I'd >>> like to blame the anon_vma rewrites last year, but I can't see why >>> that should matter either. Again, hugepages had some special rules, I >>> think (and that would explain why nobody normal sees this). >>> >>> Guys, please give this one a look. >> >> I do intend to look, but I think it makes more sense to wait until >> Robert has reproduced it (or something like it) with my debugging >> patch in. > > Hi Hugh, > > I did two things, included your patch, and compiled with > CONFIG_CC_OPTIMIZE_FOR_SIZE=y; the kernel didn't BUG() or Oopssed for > ~2 days under fuzzing (with getdents and readdir syscalls disabled in > the fuzzer). I don't think -Os has any bigger influence on how mm > internally works therefore I must attribute the change to your patch > (was it patch which fixes something or merely dumps vma structures in > case of any problem?). I'm sorry, I should have explained the patch a little more. Along with dumping out the vma structs, it does change the BUG or BUGs there to WARN_ONs, allowing the system to continue if it's not too badly corrupted, though leaking some structure memory (if the structs have been reused, it's probably not safe to assume we still have ownership of them). So if the problem has occurred again, it should be leaving WARNING messages and vma struct dumps in your /var/log/messages - please look for them and send them in if found. Perhaps we should simply include the patch in mainline kernel: it doesn't do much good just lingering in mmotm, but seems to be helping your system to limp along longer. > > > ============== > BTW, another problem arose in the meantime, not sure if anyhow related > to things we're discussing here, although 'btc 0' in kdb shows that > processor 0 hangs in sys_mlock - I did it in two different moments, to > exclude any coincidences. After those 2 days of fuzzing, 'ps wuax' > stopped working, i.e. it prints some output, then stops, cannot be > killed with -SIGKILL etc. I'll let it run for the time being, I can > dump more data in this PID 17750 if anybody wants: > > strace: > > # strace -f ps wwuax > .... > open("/proc/17750/status", O_RDONLY) = 6 > read(6, "Name:\tiknowthis\nState:\tD (disk s"..., 1023) = 777 > close(6) = 0 > open("/proc/17750/cmdline", O_RDONLY) = 6 > read(6, > > Process 17750 also cannot be killed. Attaching more data: > > # cat /proc/17750/status > Name: iknowthis > State: D (disk sleep) > Tgid: 17750 > Pid: 17750 > PPid: 1 > TracerPid: 0 > Uid: 1001 1001 1001 1001 > Gid: 1001 1001 1001 1001 > FDSize: 64 > Groups: 1001 > VmPeak: 7752 kB > VmSize: 5760 kB > VmLck: 32 kB > VmHWM: 4892 kB > VmRSS: 3068 kB > VmData: 2472 kB > VmStk: 408 kB > VmExe: 160 kB > VmLib: 2684 kB > VmPTE: 44 kB > VmSwap: 0 kB > Threads: 1 > SigQ: 218/16382 > SigPnd: 0000000000000b00 > ShdPnd: 0000400000000503 > SigBlk: 0000000000000000 > SigIgn: 0000000001001000 > SigCgt: 0000000000000000 > CapInh: 0000000000000000 > CapPrm: 0000000000000000 > CapEff: 0000000000000000 > CapBnd: ffffffffffffffff > Cpus_allowed: 01 > Cpus_allowed_list: 0 > Mems_allowed: 00000000,00000001 > Mems_allowed_list: 0 > voluntary_ctxt_switches: 43330 > nonvoluntary_ctxt_switches: 4436 > > # cat /proc/17750/wchan > call_rwsem_down_write_failed > > # cat /proc/17750/maps (hangs) > > (from kdb) > > [0]kdb> btp 17750 > Stack traceback for pid 17750 > 0xffff88011e772dc0 17750 1 0 0 D 0xffff88011e773240 iknowthis > <c> ffff88011cbcfb88<c> 0000000000000086<c> ffff88011cbcfb08<c> > ffff88011cbcffd8<c> > <c> 0000000000013f00<c> ffff88011e772dc0<c> ffff88011e773180<c> > ffff88011e773178<c> > <c> ffff88011cbce000<c> ffff88011cbcffd8<c> 0000000000013f00<c> > 0000000000013f00<c> > Call Trace: > [<ffffffff81e286ad>] rwsem_down_failed_common+0xdb/0x10d > [<ffffffff81e286f2>] rwsem_down_write_failed+0x13/0x15 > [<ffffffff81416953>] call_rwsem_down_write_failed+0x13/0x20 > [<ffffffff81e27da0>] ? down_write+0x25/0x27 > [<ffffffff8115f041>] do_coredump+0x14f/0x9a5 > [<ffffffff8114d4e2>] ? T.1006+0x17/0x32 > [<ffffffff810a45f0>] ? __dequeue_signal+0xfa/0x12f > [<ffffffff8108a79c>] ? get_parent_ip+0x11/0x42 > [<ffffffff810a6406>] get_signal_to_deliver+0x3be/0x3e6 > [<ffffffff8103e0c1>] do_signal+0x72/0x67d > [<ffffffff81096807>] ? child_wait_callback+0x0/0x58 > [<ffffffff81e28c28>] ? _raw_spin_unlock_irq+0x36/0x41 > [<ffffffff8108bb06>] ? finish_task_switch+0x4b/0xb9 > [<ffffffff8108c3ed>] ? schedule_tail+0x38/0x68 > [<ffffffff8103eb43>] ? ret_from_fork+0x13/0x80 > [<ffffffff8103e6f8>] do_notify_resume+0x2c/0x6e > > [0]kdb> btc 0 > Stack traceback for pid 10350 > 0xffff88011b6badc0 10350 1 1 0 R 0xffff88011b6bb240 *iknowthis2 > <c> ffff8800cfc03db8<c> 0000000000000000<c> > Call Trace: > <#DB> <<EOE>> <IRQ> [<ffffffff81518b03>] ? __handle_sysrq+0xbf/0x15c > [<ffffffff81518d7d>] ? handle_sysrq+0x2c/0x2e > [<ffffffff8152bd90>] ? serial8250_handle_port+0x157/0x2b2 > [<ffffffff810a1be8>] ? run_timer_softirq+0x2b3/0x2c2 > [<ffffffff8152bf4c>] ? serial8250_interrupt+0x61/0x111 > [<ffffffff810e6e52>] ? handle_IRQ_event+0x78/0x150 > [<ffffffff810ea044>] ? move_native_irq+0x19/0x6d > [<ffffffff810e8d90>] ? handle_edge_irq+0xe3/0x12f > [<ffffffff8104198f>] ? handle_irq+0x88/0x91 > [<ffffffff81e297a5>] ? do_IRQ+0x4d/0xb3 > [<ffffffff81e29193>] ? ret_from_intr+0x0/0x15 > <EOI> [<ffffffff811336aa>] ? __mlock_vma_pages_range+0x49/0xad > [<ffffffff8113370a>] ? __mlock_vma_pages_range+0xa9/0xad > [<ffffffff811337c0>] ? do_mlock_pages+0xb2/0x118 > [<ffffffff81134002>] ? sys_mlock+0xe8/0xf6 > [<ffffffff8107d7e3>] ? ia32_sysret+0x0/0x5 > > [0]kdb> btc 1 > Stack traceback for pid 9409 > 0xffff88011c9816e0 9409 1 1 1 R 0xffff88011c981b60 iknowthis2 > <c> ffff88011dc21ec8 Sorry, I've no time to think about this one at the moment (at LSF). Does this look similar to what you previously reported on mlock? Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-04-04 18:30 ` Hugh Dickins @ 2011-04-05 12:21 ` Robert Święcki 2011-04-05 15:37 ` Linus Torvalds 0 siblings, 1 reply; 37+ messages in thread From: Robert Święcki @ 2011-04-05 12:21 UTC (permalink / raw) To: Hugh Dickins Cc: Linus Torvalds, Andrew Morton, Miklos Szeredi, Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel >> Hi Hugh, >> >> I did two things, included your patch, and compiled with >> CONFIG_CC_OPTIMIZE_FOR_SIZE=y; the kernel didn't BUG() or Oopssed for >> ~2 days under fuzzing (with getdents and readdir syscalls disabled in >> the fuzzer). I don't think -Os has any bigger influence on how mm >> internally works therefore I must attribute the change to your patch >> (was it patch which fixes something or merely dumps vma structures in >> case of any problem?). > > I'm sorry, I should have explained the patch a little more. Along > with dumping out the vma structs, it does change the BUG or BUGs there > to WARN_ONs, allowing the system to continue if it's not too badly > corrupted, though leaking some structure memory (if the structs have > been reused, it's probably not safe to assume we still have ownership > of them). So if the problem has occurred again, it should be leaving > WARNING messages and vma struct dumps in your /var/log/messages - > please look for them and send them in if found. Here it is, I'll leave it in this state (kdb) in case you need some remote debugging <4>[ 1523.877666] WARNING: at mm/prio_tree.c:95 vma_prio_tree_add+0x43/0x110() <4>[ 1523.884381] Hardware name: Precision WorkStation 390 <4>[ 1523.889703] Pid: 13801, comm: iknowthis2 Not tainted 2.6.38 #2 <4>[ 1523.895544] Call Trace: <4>[ 1523.898000] [<ffffffff810b5d2a>] ? warn_slowpath_common+0x7a/0xb0 <4>[ 1523.904195] [<ffffffff810b5d7a>] ? warn_slowpath_null+0x1a/0x20 <4>[ 1523.910210] [<ffffffff8116b4b3>] ? vma_prio_tree_add+0x43/0x110 <4>[ 1523.916226] [<ffffffff8116b5c1>] ? vma_prio_tree_insert+0x41/0x60 <4>[ 1523.922416] [<ffffffff8117b69c>] ? __vma_link_file+0x4c/0x90 <4>[ 1523.928171] [<ffffffff8117c078>] ? vma_adjust+0xe8/0x570 <4>[ 1523.933579] [<ffffffff8117c641>] ? __split_vma+0x141/0x280 <4>[ 1523.939157] [<ffffffff8117c7a5>] ? split_vma+0x25/0x30 <4>[ 1523.944391] [<ffffffff811733f6>] ? sys_madvise+0x6a6/0x720 <4>[ 1523.949969] [<ffffffff810a4e09>] ? sub_preempt_count+0xa9/0xe0 <4>[ 1523.955900] [<ffffffff8224f809>] ? trace_hardirqs_on_thunk+0x3a/0x3c <4>[ 1523.962347] [<ffffffff8109a653>] ? ia32_sysret+0x0/0x5 <4>[ 1523.967580] [<ffffffff8224f809>] ? trace_hardirqs_on_thunk+0x3a/0x3c <4>[ 1523.974026] ---[ end trace c13483b7eb481afd ]--- <4>[ 1523.978650] vm_area_struct at ffff880120bda508: <4>[ 1523.983199] ffff88011eb5aa00 00000000f72f3000 00000000f73f0000 ffff88011b8eaa10 <4>[ 1523.990674] ffff88011b8ea228 0000000000000027 00000000000101ff ffff88011b8ea6b1 <4>[ 1523.998151] ffff88011e390820 ffff88011b8ea260 ffff880120796780 ffff880120bdad40 <4>[ 1524.005624] (null) (null) ffff88011ed5b910 ffff88011ed5b1f0 <4>[ 1524.013103] ffff88011f72b168 ffffffff82427480 ffffffffffffff03 ffff8800793ff0c0 <4>[ 1524.020581] (null) (null) (null) <4>[ 1524.026556] vm_area_struct at ffff880120bdacf0: <4>[ 1524.031110] ffff88011eb5a300 00000000f72f3000 00000000f7400000 ffff88011f6c6f18 <4>[ 1524.038584] ffff88011b5c9da8 0000000000000027 00000000000101ff ffff8801206f0c71 <4>[ 1524.046062] ffff88011f6c6f50 ffff88011b5c9de0 ffff880120bdad40 ffff880120bdad40 <4>[ 1524.053536] ffff880120bda558 (null) ffff88011f758ee0 ffff88011f7583a0 <4>[ 1524.061016] ffff88011f556690 ffffffff82427480 ffffffffffffff03 ffff8800793ff0c0 <4>[ 1524.068491] (null) (null) (null) [1]kdb> pid KDB current process is iknowthis2(pid=13801) [1]kdb> btp 13801 Stack traceback for pid 13801 0xffff88011ec35cc0 13801 4516 1 1 R 0xffff88011ec36140 *iknowthis2 <c> ffff88011c5d1d68<c> 0000000000000000<c> ffff88011f7a3eb8<c> ffff88011c5d1d88<c> <c> ffffffff8116b3b9<c> ffff88011b8ea730<c> ffff88011b8eaa10<c> ffff88011c5d1e28<c> <c> ffffffff8117c0cb<c> 00000000f73f0000<c> ffff88011eb5aa00<c> ffff88011f72b168<c> Call Trace: [<ffffffff8116b3b9>] ? vma_prio_tree_remove+0xc9/0x110 [<ffffffff8117c0cb>] ? vma_adjust+0x13b/0x570 [<ffffffff8117c641>] ? __split_vma+0x141/0x280 [<ffffffff8117c7a5>] ? split_vma+0x25/0x30 [<ffffffff811733f6>] ? sys_madvise+0x6a6/0x720 [<ffffffff810a4e09>] ? sub_preempt_count+0xa9/0xe0 [<ffffffff8224f809>] ? trace_hardirqs_on_thunk+0x3a/0x3c [<ffffffff8109a653>] ? ia32_sysret+0x0/0x5 [<ffffffff8224f809>] ? trace_hardirqs_on_thunk+0x3a/0x3c [1]kdb> rd ax: ffff88011b8ea780 bx: ffff880120796678 cx: ffff8801207966c8 dx: ffff8801207966c8 si: ffff8801207966c8 di: ffff880027d3bec8 bp: ffff88011c5d1d68 sp: ffff88011c5d1d68 r8: 0000000000000000 r9: ffff88011c5d1946 r10: ffff88011c5d1945 r11: 0000000000000000 r12: ffff88011b8eaa10 r13: ffff880120bda508 r14: ffff880027d3bea8 r15: ffff88011eb5aa00 ip: ffffffff8158d935 flags: 00010297 cs: 00000010 -- Robert Święcki -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-04-05 12:21 ` Robert Święcki @ 2011-04-05 15:37 ` Linus Torvalds 2011-04-06 14:47 ` Hugh Dickins 0 siblings, 1 reply; 37+ messages in thread From: Linus Torvalds @ 2011-04-05 15:37 UTC (permalink / raw) To: Robert Święcki Cc: Hugh Dickins, Andrew Morton, Miklos Szeredi, Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel On Tue, Apr 5, 2011 at 5:21 AM, Robert Święcki <robert@swiecki.net> wrote: > > Here it is, I'll leave it in this state (kdb) in case you need some > remote debugging > > <4>[ 1523.877666] WARNING: at mm/prio_tree.c:95 vma_prio_tree_add+0x43/0x110() > <4>[ 1523.978650] vm_area_struct at ffff880120bda508: > <4>[ 1523.983199] ffff88011eb5aa00 00000000f72f3000 00000000f73f0000 ffff88011b8eaa10 > <4>[ 1523.990674] ffff88011b8ea228 0000000000000027 00000000000101ff ffff88011b8ea6b1 > <4>[ 1523.998151] ffff88011e390820 ffff88011b8ea260 ffff880120796780 ffff880120bdad40 > <4>[ 1524.005624] (null) (null) ffff88011ed5b910 ffff88011ed5b1f0 > <4>[ 1524.013103] ffff88011f72b168 ffffffff82427480 ffffffffffffff03 ffff8800793ff0c0 > <4>[ 1524.020581] (null) (null) (null) vma->vm_start/end is 0xf72f3000-0xf73f0000 > <4>[ 1524.026556] vm_area_struct at ffff880120bdacf0: > <4>[ 1524.031110] ffff88011eb5a300 00000000f72f3000 00000000f7400000 ffff88011f6c6f18 > <4>[ 1524.038584] ffff88011b5c9da8 0000000000000027 00000000000101ff ffff8801206f0c71 > <4>[ 1524.046062] ffff88011f6c6f50 ffff88011b5c9de0 ffff880120bdad40 ffff880120bdad40 > <4>[ 1524.053536] ffff880120bda558 (null) ffff88011f758ee0 ffff88011f7583a0 > <4>[ 1524.061016] ffff88011f556690 ffffffff82427480 ffffffffffffff03 ffff8800793ff0c0 > <4>[ 1524.068491] (null) (null) (null) vma->vm_start/end is 0xf72f3000-0xf7400000. If I read those right, then the vm_pgoff (RADIX_INDEX for the prio-tree) is ffffffffffffff03 for both cases. That doesn't look good. How do we get a negative pg_off for a file mapping? Also, since they have a different size, they should have a different HEAP_INDEX. That's why we BUG_ON() - with a different HEAP_INDEX, shouldn't that mean that the prio_tree_insert() logic should create a new node for it? I dunno. But that odd negative pg_off thing makes me think there is some overflow issue (ie HEAP_INDEX being pg_off + size ends up fluctuating between really big and really small). So I'd suspect THAT as the main reason. But maybe I'm mis-reading the dump, and the ffffffffffffff03 isn't vm_pgoff at all. Hugh? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-04-05 15:37 ` Linus Torvalds @ 2011-04-06 14:47 ` Hugh Dickins 2011-04-06 15:32 ` Linus Torvalds 0 siblings, 1 reply; 37+ messages in thread From: Hugh Dickins @ 2011-04-06 14:47 UTC (permalink / raw) To: Linus Torvalds Cc: Robert Święcki, Andrew Morton, Miklos Szeredi, Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel On Tue, Apr 5, 2011 at 8:37 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, Apr 5, 2011 at 5:21 AM, Robert Święcki <robert@swiecki.net> wrote: >> >> Here it is, I'll leave it in this state (kdb) in case you need some >> remote debugging >> >> <4>[ 1523.877666] WARNING: at mm/prio_tree.c:95 vma_prio_tree_add+0x43/0x110() >> <4>[ 1523.978650] vm_area_struct at ffff880120bda508: >> <4>[ 1523.983199] ffff88011eb5aa00 00000000f72f3000 00000000f73f0000 ffff88011b8eaa10 >> <4>[ 1523.990674] ffff88011b8ea228 0000000000000027 00000000000101ff ffff88011b8ea6b1 >> <4>[ 1523.998151] ffff88011e390820 ffff88011b8ea260 ffff880120796780 ffff880120bdad40 >> <4>[ 1524.005624] (null) (null) ffff88011ed5b910 ffff88011ed5b1f0 >> <4>[ 1524.013103] ffff88011f72b168 ffffffff82427480 ffffffffffffff03 ffff8800793ff0c0 >> <4>[ 1524.020581] (null) (null) (null) > > vma->vm_start/end is 0xf72f3000-0xf73f0000 > >> <4>[ 1524.026556] vm_area_struct at ffff880120bdacf0: >> <4>[ 1524.031110] ffff88011eb5a300 00000000f72f3000 00000000f7400000 ffff88011f6c6f18 >> <4>[ 1524.038584] ffff88011b5c9da8 0000000000000027 00000000000101ff ffff8801206f0c71 >> <4>[ 1524.046062] ffff88011f6c6f50 ffff88011b5c9de0 ffff880120bdad40 ffff880120bdad40 >> <4>[ 1524.053536] ffff880120bda558 (null) ffff88011f758ee0 ffff88011f7583a0 >> <4>[ 1524.061016] ffff88011f556690 ffffffff82427480 ffffffffffffff03 ffff8800793ff0c0 >> <4>[ 1524.068491] (null) (null) (null) > > vma->vm_start/end is 0xf72f3000-0xf7400000. > > If I read those right, then the vm_pgoff (RADIX_INDEX for the > prio-tree) is ffffffffffffff03 for both cases. That doesn't look good. > How do we get a negative pg_off for a file mapping? Yes, I think that's probably at the root of it. Robert is using a fuzzer, and it's a 32-bit executable running on a 64-bit kernel: I suspect there's somewhere on our compat path where we've not validated incoming mmap offset properly. Hmm, but I don't see anything wrong there. > > Also, since they have a different size, they should have a different > HEAP_INDEX. That's why we BUG_ON() - with a different HEAP_INDEX, > shouldn't that mean that the prio_tree_insert() logic should create a > new node for it? Yes. > > I dunno. But that odd negative pg_off thing makes me think there is > some overflow issue (ie HEAP_INDEX being pg_off + size ends up > fluctuating between really big and really small). So I'd suspect THAT > as the main reason. Yes, one of the vmas is such that the end offset (pgoff of next page after) would be 0, and for the other it would be 16. There's sure to be places, inside the prio_tree code and outside it, where we rely upon pgoff not wrapping around - wrap should be prevented by original validation of arguments. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-04-06 14:47 ` Hugh Dickins @ 2011-04-06 15:32 ` Linus Torvalds 2011-04-06 15:43 ` Hugh Dickins 0 siblings, 1 reply; 37+ messages in thread From: Linus Torvalds @ 2011-04-06 15:32 UTC (permalink / raw) To: Hugh Dickins Cc: Robert Święcki, Andrew Morton, Miklos Szeredi, Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel [-- Attachment #1: Type: text/plain, Size: 1381 bytes --] On Wed, Apr 6, 2011 at 7:47 AM, Hugh Dickins <hughd@google.com> wrote: >> >> I dunno. But that odd negative pg_off thing makes me think there is >> some overflow issue (ie HEAP_INDEX being pg_off + size ends up >> fluctuating between really big and really small). So I'd suspect THAT >> as the main reason. > > Yes, one of the vmas is such that the end offset (pgoff of next page > after) would be 0, and for the other it would be 16. There's sure to > be places, inside the prio_tree code and outside it, where we rely > upon pgoff not wrapping around - wrap should be prevented by original > validation of arguments. Well, we _do_ validate them in do_mmap_pgoff(), which is the main routine for all the mmap() system calls, and the main way to get a new mapping. There are other ways, like do_brk(), but afaik that always sets vm_pgoff to the virtual address (shifted), so again the new mapping should be fine. So when a new mapping is created, it should all be ok. But I think mremap() may end up expanding it without doing the same overflow check. Do you see any other way to get this situation? Does the vma dump give you any hint about where it came from? Robert - here's a (UNTESTED!) patch to make mremap() be a bit more careful about vm_pgoff when growing a mapping. Does it make any difference? Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 771 bytes --] mm/mremap.c | 11 +++++++++-- 1 files changed, 9 insertions(+), 2 deletions(-) diff --git a/mm/mremap.c b/mm/mremap.c index 1de98d492ddc..a7c1f9f9b941 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -277,9 +277,16 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr, if (old_len > vma->vm_end - addr) goto Efault; - if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)) { - if (new_len > old_len) + /* Need to be careful about a growing mapping */ + if (new_len > old_len) { + unsigned long pgoff; + + if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)) goto Efault; + pgoff = (addr - vma->vm_start) >> PAGE_SHIFT; + pgoff += vma->vm_pgoff; + if (pgoff + (new_len >> PAGE_SHIFT) < pgoff) + goto Einval; } if (vma->vm_flags & VM_LOCKED) { ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-04-06 15:32 ` Linus Torvalds @ 2011-04-06 15:43 ` Hugh Dickins 2011-04-06 15:59 ` Linus Torvalds 0 siblings, 1 reply; 37+ messages in thread From: Hugh Dickins @ 2011-04-06 15:43 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, Robert Święcki, Andrew Morton, Miklos Szeredi, Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel [-- Attachment #1: Type: TEXT/PLAIN, Size: 2555 bytes --] On Wed, 6 Apr 2011, Linus Torvalds wrote: > On Wed, Apr 6, 2011 at 7:47 AM, Hugh Dickins <hughd@google.com> wrote: > >> > >> I dunno. But that odd negative pg_off thing makes me think there is > >> some overflow issue (ie HEAP_INDEX being pg_off + size ends up > >> fluctuating between really big and really small). So I'd suspect THAT > >> as the main reason. > > > > Yes, one of the vmas is such that the end offset (pgoff of next page > > after) would be 0, and for the other it would be 16. There's sure to > > be places, inside the prio_tree code and outside it, where we rely > > upon pgoff not wrapping around - wrap should be prevented by original > > validation of arguments. > > Well, we _do_ validate them in do_mmap_pgoff(), which is the main > routine for all the mmap() system calls, and the main way to get a new > mapping. > > There are other ways, like do_brk(), but afaik that always sets > vm_pgoff to the virtual address (shifted), so again the new mapping > should be fine. > > So when a new mapping is created, it should all be ok. > > But I think mremap() may end up expanding it without doing the same > overflow check. > > Do you see any other way to get this situation? Does the vma dump give > you any hint about where it came from? > > Robert - here's a (UNTESTED!) patch to make mremap() be a bit more > careful about vm_pgoff when growing a mapping. Does it make any > difference? I'd come to the same conclusion: the original page_mapped BUG has itself suggested that mremap() is getting used. I was about to send you my own UNTESTED patch: let me append it anyway, I think it is more correct than yours (it's the offset of vm_end we need to worry about, and there's the funny old_len,new_len stuff). See what you think - sorry, I'm going out now. Hugh --- 2.6.38/mm/mremap.c 2011-03-14 18:20:32.000000000 -0700 +++ linux/mm/mremap.c 2011-04-06 08:31:46.000000000 -0700 @@ -282,6 +282,12 @@ static struct vm_area_struct *vma_to_res goto Efault; } + if (vma->vm_file && new_len > old_len) { + pgoff_t endoff = linear_page_index(vma, vma->vm_end); + if (endoff + ((new_len - old_len) >> PAGE_SHIFT) < endoff) + goto Eoverflow; + } + if (vma->vm_flags & VM_LOCKED) { unsigned long locked, lock_limit; locked = mm->locked_vm << PAGE_SHIFT; @@ -311,6 +317,8 @@ Enomem: return ERR_PTR(-ENOMEM); Eagain: return ERR_PTR(-EAGAIN); +Eoverflow: + return ERR_PTR(-EOVERFLOW); } static unsigned long mremap_to(unsigned long addr, ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-04-06 15:43 ` Hugh Dickins @ 2011-04-06 15:59 ` Linus Torvalds 2011-04-06 17:54 ` Robert Święcki 2011-04-07 14:17 ` Hugh Dickins 0 siblings, 2 replies; 37+ messages in thread From: Linus Torvalds @ 2011-04-06 15:59 UTC (permalink / raw) To: Hugh Dickins Cc: Robert Święcki, Andrew Morton, Miklos Szeredi, Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel On Wed, Apr 6, 2011 at 8:43 AM, Hugh Dickins <hughd@google.com> wrote: > > I was about to send you my own UNTESTED patch: let me append it anyway, > I think it is more correct than yours (it's the offset of vm_end we need > to worry about, and there's the funny old_len,new_len stuff). Umm. That's what my patch did too. The pgoff = (addr - vma->vm_start) >> PAGE_SHIFT; is the "offset of the pgoff" from the original mapping, then we do pgoff += vma->vm_pgoff; to get the pgoff of the new mapping, and then we do if (pgoff + (new_len >> PAGE_SHIFT) < pgoff) to check that the new mapping is ok. I think yours is equivalent, just a different (and odd - that linear_page_index() thing will do lots of unnecessary shifts and hugepage crap) way of writing it. > See what you think - sorry, I'm going out now. I think _yours_ is conceptually buggy, because I think that test for "vma->vm_file" is wrong. Yes, new anonymous mappings set vm_pgoff to the virtual address, but that's not true for mremap() moving them around, afaik. Admittedly it's really hard to get to the overflow case, because the address is shifted down, so even if you start out with an anonymous mmap at a high address (to get a big vm_off), and then move it down and expand it (to get a big size), I doubt you can possibly overflow. But I still don't think that the test for vm_file is semantically sensible, even if it might not _matter_. But whatever. I suspect both our patches are practically doing the same thing, and it would be interesting to hear if it actually fixes the issue. Maybe there is some other way to mess up vm_pgoff that I can't think of right now. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-04-06 15:59 ` Linus Torvalds @ 2011-04-06 17:54 ` Robert Święcki 2011-04-07 12:41 ` Robert Święcki 2011-04-07 14:17 ` Hugh Dickins 1 sibling, 1 reply; 37+ messages in thread From: Robert Święcki @ 2011-04-06 17:54 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, Andrew Morton, Miklos Szeredi, Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel On Wed, Apr 6, 2011 at 5:59 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, Apr 6, 2011 at 8:43 AM, Hugh Dickins <hughd@google.com> wrote: >> >> I was about to send you my own UNTESTED patch: let me append it anyway, >> I think it is more correct than yours (it's the offset of vm_end we need >> to worry about, and there's the funny old_len,new_len stuff). > > Umm. That's what my patch did too. The > > pgoff = (addr - vma->vm_start) >> PAGE_SHIFT; > > is the "offset of the pgoff" from the original mapping, then we do > > pgoff += vma->vm_pgoff; > > to get the pgoff of the new mapping, and then we do > > if (pgoff + (new_len >> PAGE_SHIFT) < pgoff) > > to check that the new mapping is ok. > > I think yours is equivalent, just a different (and odd - that > linear_page_index() thing will do lots of unnecessary shifts and > hugepage crap) way of writing it. > >> See what you think - sorry, I'm going out now. > > I think _yours_ is conceptually buggy, because I think that test for > "vma->vm_file" is wrong. > > Yes, new anonymous mappings set vm_pgoff to the virtual address, but > that's not true for mremap() moving them around, afaik. > > Admittedly it's really hard to get to the overflow case, because the > address is shifted down, so even if you start out with an anonymous > mmap at a high address (to get a big vm_off), and then move it down > and expand it (to get a big size), I doubt you can possibly overflow. > But I still don't think that the test for vm_file is semantically > sensible, even if it might not _matter_. > > But whatever. I suspect both our patches are practically doing the > same thing, and it would be interesting to hear if it actually fixes > the issue. Maybe there is some other way to mess up vm_pgoff that I > can't think of right now. Testing with Linus' patch. Will let you know in a few hours. -- Robert Święcki -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-04-06 17:54 ` Robert Święcki @ 2011-04-07 12:41 ` Robert Święcki 2011-04-07 14:24 ` Hugh Dickins 0 siblings, 1 reply; 37+ messages in thread From: Robert Święcki @ 2011-04-07 12:41 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, Andrew Morton, Miklos Szeredi, Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel >>> I was about to send you my own UNTESTED patch: let me append it anyway, >>> I think it is more correct than yours (it's the offset of vm_end we need >>> to worry about, and there's the funny old_len,new_len stuff). >> >> Umm. That's what my patch did too. The >> >> pgoff = (addr - vma->vm_start) >> PAGE_SHIFT; >> >> is the "offset of the pgoff" from the original mapping, then we do >> >> pgoff += vma->vm_pgoff; >> >> to get the pgoff of the new mapping, and then we do >> >> if (pgoff + (new_len >> PAGE_SHIFT) < pgoff) >> >> to check that the new mapping is ok. >> >> I think yours is equivalent, just a different (and odd - that >> linear_page_index() thing will do lots of unnecessary shifts and >> hugepage crap) way of writing it. >> >>> See what you think - sorry, I'm going out now. >> >> I think _yours_ is conceptually buggy, because I think that test for >> "vma->vm_file" is wrong. >> >> Yes, new anonymous mappings set vm_pgoff to the virtual address, but >> that's not true for mremap() moving them around, afaik. >> >> Admittedly it's really hard to get to the overflow case, because the >> address is shifted down, so even if you start out with an anonymous >> mmap at a high address (to get a big vm_off), and then move it down >> and expand it (to get a big size), I doubt you can possibly overflow. >> But I still don't think that the test for vm_file is semantically >> sensible, even if it might not _matter_. >> >> But whatever. I suspect both our patches are practically doing the >> same thing, and it would be interesting to hear if it actually fixes >> the issue. Maybe there is some other way to mess up vm_pgoff that I >> can't think of right now. > > Testing with Linus' patch. Will let you know in a few hours. Ok, nothing happened after ~20h. The bug, usually, was triggered within 5-10h. I can add some printk in this condition, and let it run for a few days (I will not have access to my testing machine throughout that time), if you think this will confirm your hypothesis. -- Robert Święcki -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-04-07 12:41 ` Robert Święcki @ 2011-04-07 14:24 ` Hugh Dickins 2011-04-12 9:58 ` Robert Święcki 0 siblings, 1 reply; 37+ messages in thread From: Hugh Dickins @ 2011-04-07 14:24 UTC (permalink / raw) To: Robert Swiecki Cc: Linus Torvalds, Andrew Morton, Miklos Szeredi, Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel On Thu, 7 Apr 2011, Robert Swiecki wrote: > > > > Testing with Linus' patch. Will let you know in a few hours. > > Ok, nothing happened after ~20h. The bug, usually, was triggered within 5-10h. > > I can add some printk in this condition, and let it run for a few days > (I will not have access to my testing machine throughout that time), > if you think this will confirm your hypothesis. That's great, thanks Robert. If the machine has nothing better to do, then it would be nice to let it run a little longer (a few days if that's what suits you), but it does look good so far. Though I'm afraid you'll now discover something else entirely ;) Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-04-07 14:24 ` Hugh Dickins @ 2011-04-12 9:58 ` Robert Święcki 2011-04-12 14:21 ` Linus Torvalds 0 siblings, 1 reply; 37+ messages in thread From: Robert Święcki @ 2011-04-12 9:58 UTC (permalink / raw) To: Hugh Dickins Cc: Linus Torvalds, Andrew Morton, Miklos Szeredi, Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel On Thu, Apr 7, 2011 at 4:24 PM, Hugh Dickins <hughd@google.com> wrote: > On Thu, 7 Apr 2011, Robert Swiecki wrote: >> > >> > Testing with Linus' patch. Will let you know in a few hours. >> >> Ok, nothing happened after ~20h. The bug, usually, was triggered within 5-10h. >> >> I can add some printk in this condition, and let it run for a few days >> (I will not have access to my testing machine throughout that time), >> if you think this will confirm your hypothesis. > > That's great, thanks Robert. If the machine has nothing better to do, > then it would be nice to let it run a little longer (a few days if that's > what suits you), but it does look good so far. Though I'm afraid you'll > now discover something else entirely ;) Ok, I added printk here: if (new_len > old_len) { unsigned long pgoff; if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)) goto Efault; pgoff = (addr - vma->vm_start) >> PAGE_SHIFT; pgoff += vma->vm_pgoff; if (pgoff + (new_len >> PAGE_SHIFT) < pgoff) { printk("VMA_TO_RESIZE: ADDR:%lx OLD_LEN:%lx NEW_LEN:%lx PGOFF: %lx VMA->VM_START:%lx VMA->VM_FLAGS:%lx", addr, old_len, new_len, pgoff, vma->vm_start, vma->vm_flags); goto Einval; } } and after a few mins of fuzzing I get: [ 584.224028] VMA_TO_RESIZE: ADDR:f751f000 OLD_LEN:6000 NEW_LEN:c000 PGOFF: fffffffffffffffa VMA->VM_START:f751f000 VMA->VM_FLAGS:2321fa [ 639.777561] VMA_TO_RESIZE: ADDR:f751f000 OLD_LEN:6000 NEW_LEN:b000 PGOFF: fffffffffffffffa VMA->VM_START:f751f000 VMA->VM_FLAGS:2301f8 So, if this case is not caught later on in the code, I guess it solves the problem. During the fuzzing I didn't experience any panic's, but some other problems arose, i.e. cannot read /proc/<pid>/maps for some processes (sys_read hangs, and such process cannot be killed or stopped with any signal, still it's running (R state) and using CPU - I'll submit another report for that). -- Robert Święcki -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-04-12 9:58 ` Robert Święcki @ 2011-04-12 14:21 ` Linus Torvalds [not found] ` <BANLkTik6U21r91DYiUsz9A0P--=5QcsBrA@mail.gmail.com> 0 siblings, 1 reply; 37+ messages in thread From: Linus Torvalds @ 2011-04-12 14:21 UTC (permalink / raw) To: Robert Święcki Cc: Hugh Dickins, Andrew Morton, Miklos Szeredi, Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel On Tue, Apr 12, 2011 at 2:58 AM, Robert Święcki <robert@swiecki.net> wrote: > > So, if this case is not caught later on in the code, I guess it solves > the problem. During the fuzzing I didn't experience any panic's, but > some other problems arose, i.e. cannot read /proc/<pid>/maps for some > processes (sys_read hangs, and such process cannot be killed or > stopped with any signal, still it's running (R state) and using CPU - > I'll submit another report for that). Hmm. Sounds like an endless loop in kernel mode. Use "perf record -ag" as root, it should show up very clearly in the report. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
[parent not found: <BANLkTik6U21r91DYiUsz9A0P--=5QcsBrA@mail.gmail.com>]
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG [not found] ` <BANLkTik6U21r91DYiUsz9A0P--=5QcsBrA@mail.gmail.com> @ 2011-04-12 16:17 ` Robert Święcki 2011-04-12 17:19 ` Linus Torvalds 1 sibling, 0 replies; 37+ messages in thread From: Robert Święcki @ 2011-04-12 16:17 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, Andrew Morton, Miklos Szeredi, Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel >>> So, if this case is not caught later on in the code, I guess it solves >>> the problem. During the fuzzing I didn't experience any panic's, but >>> some other problems arose, i.e. cannot read /proc/<pid>/maps for some >>> processes (sys_read hangs, and such process cannot be killed or >>> stopped with any signal, still it's running (R state) and using CPU - >>> I'll submit another report for that). >> >> Hmm. Sounds like an endless loop in kernel mode. >> >> Use "perf record -ag" as root, it should show up very clearly in the report. >> >> Linus > > I've put some data here - > http://groups.google.com/group/fa.linux.kernel/browse_thread/thread/4345dcc4f7750ce2 > - I think it's somewhat connected (sys_mlock appears on both cases). > > Attaching perf data (for 2.6.38) + kdb dumpall + procdump for process 14158 > > Those 3 processes cannot be stopped/killed > > 14158 66.2 0.0 8380 3012 ? RL /tmp/iknowthis > 17100 63.6 0.1 18248 4004 ? RL /tmp/iknowthis > 19772 63.8 0.0 4000 1888 ? RL /tmp/iknowthis Also, the system doesn't look usable after such fuzzing (executing a few times some pretty deterministic program) root@ise-test:~# gcc -m32 mlock.c -o mlock root@ise-test:~# ./mlock ./mlock: relocation error: ./mlock: symbol perror, version GLIBC_2.0 not defined in file libc.so.6 with link time reference root@ise-test:~# ./mlock mmap: Success RET: 0xf751f000 mremap: Invalid argument RET: 0xffffffff root@ise-test:~# ./mlock Segmentation fault root@ise-test:~# dmesg | tail -n 1 [ 5164.961568] mlock[7097]: segfault at 0 ip (null) sp 00000000ff8a00d4 error 14 in mlock[8048000+1000] -- Robert Święcki -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG [not found] ` <BANLkTik6U21r91DYiUsz9A0P--=5QcsBrA@mail.gmail.com> 2011-04-12 16:17 ` Robert Święcki @ 2011-04-12 17:19 ` Linus Torvalds 2011-04-12 18:59 ` Linus Torvalds 1 sibling, 1 reply; 37+ messages in thread From: Linus Torvalds @ 2011-04-12 17:19 UTC (permalink / raw) To: Robert Święcki Cc: Hugh Dickins, Andrew Morton, Miklos Szeredi, Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel [-- Attachment #1: Type: text/plain, Size: 1701 bytes --] On Tue, Apr 12, 2011 at 8:48 AM, Robert Święcki <robert@swiecki.net> wrote: >> >> Hmm. Sounds like an endless loop in kernel mode. >> >> Use "perf record -ag" as root, it should show up very clearly in the report. > > I've put some data here - > http://groups.google.com/group/fa.linux.kernel/browse_thread/thread/4345dcc4f7750ce2 > - I think it's somewhat connected (sys_mlock appears on both cases). Ok, so it's definitely sys_mlock. And I suspect it's due to commit 53a7706d5ed8 somehow looping forever. One possible cause would be how that commit made things care deeply about the return value of __get_user_pages(), and in particular what happens when that return value is zero. It ends up looping forever in do_mlock_pages() for that case, because it does nend = nstart + ret * PAGE_SIZE; so now the next round we'll set "nstart = nend" and start all over. I see at least one way __get_user_pages() will return zero, and it's if it is passed a npages of 0 to begin with. Which can easily happen if you try to mlock() the first page of a stack segment: the code will jump over that stack segment page, and then have nothing to do, and return zero. So then do_mlock_pages() will just keep on trying again. THIS IS A HACKY AND UNTESTED PATCH! It's ugly as hell, because the real problem is do_mlock_pages() caring too damn much about the return value, and us hiding the whole stack page thing in that function. I wouldn't want to commit it as-is, but if you can easily reproduce the problem, it's a good patch to test out the theory. Assuming I didn't screw something up. Again, TOTALLY UNTESTED! Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 1066 bytes --] mm/mlock.c | 10 +++++++++- 1 files changed, 9 insertions(+), 1 deletions(-) diff --git a/mm/mlock.c b/mm/mlock.c index 2689a08c79af..080c219973ea 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -162,6 +162,7 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma, unsigned long addr = start; int nr_pages = (end - start) / PAGE_SIZE; int gup_flags; + long retval, offset; VM_BUG_ON(start & ~PAGE_MASK); VM_BUG_ON(end & ~PAGE_MASK); @@ -189,13 +190,20 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma, gup_flags |= FOLL_MLOCK; /* We don't try to access the guard page of a stack vma */ + offset = 0; if (stack_guard_page(vma, start)) { addr += PAGE_SIZE; nr_pages--; + offset = 1; } - return __get_user_pages(current, mm, addr, nr_pages, gup_flags, + retval = __get_user_pages(current, mm, addr, nr_pages, gup_flags, NULL, NULL, nonblocking); + + /* Get the return value correct even in the face of the guard page */ + if (retval < 0) + return offset ? : retval; + return retval + offset; } /* ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-04-12 17:19 ` Linus Torvalds @ 2011-04-12 18:59 ` Linus Torvalds 2011-04-12 19:02 ` Robert Święcki 0 siblings, 1 reply; 37+ messages in thread From: Linus Torvalds @ 2011-04-12 18:59 UTC (permalink / raw) To: Robert Święcki Cc: Hugh Dickins, Andrew Morton, Miklos Szeredi, Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel [-- Attachment #1: Type: text/plain, Size: 437 bytes --] On Tue, Apr 12, 2011 at 10:19 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > THIS IS A HACKY AND UNTESTED PATCH! .. and here is a rather less hacky, but still equally untested patch. It moves the stack guard page handling into __get_user_pages() itself, and thus avoids the whole problem. This one I could easily see myself committing. Assuming I get some ack's and testing.. Comments? Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 3078 bytes --] mm/memory.c | 26 ++++++++++++++++++-------- mm/mlock.c | 13 ------------- 2 files changed, 18 insertions(+), 21 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 9da8cab1b1b0..b623a249918c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1410,6 +1410,13 @@ no_page_table: return page; } +static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long addr) +{ + return (vma->vm_flags & VM_GROWSDOWN) && + (vma->vm_start == addr) && + !vma_stack_continue(vma->vm_prev, addr); +} + /** * __get_user_pages() - pin user pages in memory * @tsk: task_struct of target task @@ -1488,7 +1495,6 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, vma = find_extend_vma(mm, start); if (!vma && in_gate_area(mm, start)) { unsigned long pg = start & PAGE_MASK; - struct vm_area_struct *gate_vma = get_gate_vma(mm); pgd_t *pgd; pud_t *pud; pmd_t *pmd; @@ -1513,10 +1519,11 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, pte_unmap(pte); return i ? : -EFAULT; } + vma = get_gate_vma(mm); if (pages) { struct page *page; - page = vm_normal_page(gate_vma, start, *pte); + page = vm_normal_page(vma, start, *pte); if (!page) { if (!(gup_flags & FOLL_DUMP) && is_zero_pfn(pte_pfn(*pte))) @@ -1530,12 +1537,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, get_page(page); } pte_unmap(pte); - if (vmas) - vmas[i] = gate_vma; - i++; - start += PAGE_SIZE; - nr_pages--; - continue; + goto next_page; } if (!vma || @@ -1549,6 +1551,13 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, continue; } + /* + * If we don't actually want the page itself, + * and it's the stack guard page, just skip it. + */ + if (!pages && stack_guard_page(vma, start)) + goto next_page; + do { struct page *page; unsigned int foll_flags = gup_flags; @@ -1631,6 +1640,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, flush_anon_page(vma, page, start); flush_dcache_page(page); } +next_page: if (vmas) vmas[i] = vma; i++; diff --git a/mm/mlock.c b/mm/mlock.c index 2689a08c79af..6b55e3efe0df 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -135,13 +135,6 @@ void munlock_vma_page(struct page *page) } } -static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long addr) -{ - return (vma->vm_flags & VM_GROWSDOWN) && - (vma->vm_start == addr) && - !vma_stack_continue(vma->vm_prev, addr); -} - /** * __mlock_vma_pages_range() - mlock a range of pages in the vma. * @vma: target vma @@ -188,12 +181,6 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma, if (vma->vm_flags & VM_LOCKED) gup_flags |= FOLL_MLOCK; - /* We don't try to access the guard page of a stack vma */ - if (stack_guard_page(vma, start)) { - addr += PAGE_SIZE; - nr_pages--; - } - return __get_user_pages(current, mm, addr, nr_pages, gup_flags, NULL, NULL, nonblocking); } ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-04-12 18:59 ` Linus Torvalds @ 2011-04-12 19:02 ` Robert Święcki 2011-04-12 19:38 ` Linus Torvalds 0 siblings, 1 reply; 37+ messages in thread From: Robert Święcki @ 2011-04-12 19:02 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, Andrew Morton, Miklos Szeredi, Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel On Tue, Apr 12, 2011 at 8:59 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, Apr 12, 2011 at 10:19 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> THIS IS A HACKY AND UNTESTED PATCH! > > .. and here is a rather less hacky, but still equally untested patch. > It moves the stack guard page handling into __get_user_pages() itself, > and thus avoids the whole problem. > > This one I could easily see myself committing. Assuming I get some > ack's and testing.. I'm testing currently with the old one, w/o any symptoms of problems by now, but it's not a meaningful period of time. I can try with the new one, leave it over(European)night, and let you know tomorrow. -- Robert Święcki -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-04-12 19:02 ` Robert Święcki @ 2011-04-12 19:38 ` Linus Torvalds 2011-04-18 21:15 ` Michel Lespinasse 0 siblings, 1 reply; 37+ messages in thread From: Linus Torvalds @ 2011-04-12 19:38 UTC (permalink / raw) To: Robert Święcki Cc: Hugh Dickins, Andrew Morton, Miklos Szeredi, Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel On Tue, Apr 12, 2011 at 12:02 PM, Robert Święcki <robert@swiecki.net> wrote: > > I'm testing currently with the old one, w/o any symptoms of problems > by now, but it's not a meaningful period of time. I can try with the > new one, leave it over(European)night, and let you know tomorrow. You might as well keep testing the old one, if that gives it better coverage. No need to disrupt anything you already have running. The more important input is "was that actually the root cause", rather than deciding between the ugly or clean way of fixing it. So if the first patch fixes it, then I'm pretty sure the second one will too - just in a cleaner manner. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-04-12 19:38 ` Linus Torvalds @ 2011-04-18 21:15 ` Michel Lespinasse 2011-05-05 0:09 ` Michel Lespinasse 0 siblings, 1 reply; 37+ messages in thread From: Michel Lespinasse @ 2011-04-18 21:15 UTC (permalink / raw) To: Linus Torvalds Cc: Robert Święcki, Hugh Dickins, Andrew Morton, Miklos Szeredi, Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel On Tue, Apr 12, 2011 at 12:38 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, Apr 12, 2011 at 12:02 PM, Robert Święcki <robert@swiecki.net> wrote: >> >> I'm testing currently with the old one, w/o any symptoms of problems >> by now, but it's not a meaningful period of time. I can try with the >> new one, leave it over(European)night, and let you know tomorrow. > > You might as well keep testing the old one, if that gives it better > coverage. No need to disrupt anything you already have running. > > The more important input is "was that actually the root cause", rather > than deciding between the ugly or clean way of fixing it. > > So if the first patch fixes it, then I'm pretty sure the second one > will too - just in a cleaner manner. Sorry for the delayed response - I have been traveling abroad in the last two weeks and until the end of the month. This second patch looks more attractive than the first, but is also harder to prove correct. Hugh looked at all gup call sites and convinced himself that the change was safe, except for the fault_in_user_writeable() site in futex.c which he asked me to look at. I am worried that we would have an issue there, as places like futex_wake_op() or fixup_pi_state_owner() operate on user memory with page faults disabled, and expect fault_in_user_writeable() to set up the user page so that they can retry if the initial access failed. With this proposal, fault_in_user_writeable() would become inoperative when the address is within the guard page; this could cause some malicious futex operation to create an infinite loop. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-04-18 21:15 ` Michel Lespinasse @ 2011-05-05 0:09 ` Michel Lespinasse 2011-05-05 0:38 ` Linus Torvalds 0 siblings, 1 reply; 37+ messages in thread From: Michel Lespinasse @ 2011-05-05 0:09 UTC (permalink / raw) To: Linus Torvalds Cc: Robert Święcki, Hugh Dickins, Andrew Morton, Miklos Szeredi, Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel FYI, the attached code causes an infinite loop in kernels that have the 95042f9eb7 commit: #include <stdio.h> #include <string.h> #include <unistd.h> #include <sys/syscall.h> #include <linux/futex.h> int *get_stack_guard(void) { FILE *map; char buf[1000]; map = fopen("/proc/self/maps", "r"); if (!map) return NULL; while(fgets(buf, 1000, map)) { long a, b; char c[1000], d[1000], e[1000], f[1000], g[1000]; if (sscanf(buf, "%lx-%lx %s %s %s %s %s", &a, &b, c, d, e, f, g) == 7 && !strcmp(g, "[stack]")) { fclose(map); return (int *)(a - 4096); } } fclose(map); return NULL; } int main(void) { int *uaddr = get_stack_guard(); syscall(SYS_futex, uaddr, FUTEX_LOCK_PI_PRIVATE, 0, NULL, NULL, 0); return 0; } Linus, I am not sure as to what would be the preferred way to fix this. One option could be to modify fault_in_user_writeable so that it passes a non-NULL page pointer, and just does a put_page on it afterwards. While this would work, this is kinda ugly and would slow down futex operations somewhat. A more conservative alternative could be to enable the guard page special case under an new GUP flag, but this loses much of the elegance of your original proposal... On Mon, Apr 18, 2011 at 2:15 PM, Michel Lespinasse <walken@google.com> wrote: > This second patch looks more attractive than the first, but is also > harder to prove correct. Hugh looked at all gup call sites and > convinced himself that the change was safe, except for the > fault_in_user_writeable() site in futex.c which he asked me to look > at. I am worried that we would have an issue there, as places like > futex_wake_op() or fixup_pi_state_owner() operate on user memory with > page faults disabled, and expect fault_in_user_writeable() to set up > the user page so that they can retry if the initial access failed. > With this proposal, fault_in_user_writeable() would become inoperative > when the address is within the guard page; this could cause some > malicious futex operation to create an infinite loop. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-05-05 0:09 ` Michel Lespinasse @ 2011-05-05 0:38 ` Linus Torvalds 2011-05-05 1:18 ` Michel Lespinasse 0 siblings, 1 reply; 37+ messages in thread From: Linus Torvalds @ 2011-05-05 0:38 UTC (permalink / raw) To: Michel Lespinasse Cc: Robert Święcki, Hugh Dickins, Andrew Morton, Miklos Szeredi, Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel [-- Attachment #1: Type: text/plain, Size: 1654 bytes --] On Wed, May 4, 2011 at 5:09 PM, Michel Lespinasse <walken@google.com> wrote: > > FYI, the attached code causes an infinite loop in kernels that have > the 95042f9eb7 commit: Mmm. Yes. The atomic fault will never work, and the get_user_pages() thing won't either, so things will just loop forever. > Linus, I am not sure as to what would be the preferred way to fix > this. One option could be to modify fault_in_user_writeable so that it > passes a non-NULL page pointer, and just does a put_page on it > afterwards. While this would work, this is kinda ugly and would slow > down futex operations somewhat. No, that's just ugly as hell. > A more conservative alternative could > be to enable the guard page special case under an new GUP flag, but > this loses much of the elegance of your original proposal... How about only doing that only for FOLL_MLOCK? Also, looking at mm/mlock.c, why _do_ we call get_user_pages() even if the vma isn't mlocked? That looks bogus. Since we have dropped the mm_semaphore, an unlock may have happened, and afaik we should *not* try to bring those pages back in at all. There's this whole comment about that in the caller ("__mlock_vma_pages_range() double checks the vma flags, so that it won't mlock pages if the vma was already munlocked."), but despite that it would actually call __get_user_pages() even if the VM_LOCKED bit had been cleared (it just wouldn't call it with the FOLL_MLOCK flag). So maybe something like the attached? UNTESTED! And maybe there was some really subtle reason to still call __get_user_pages() without that FOLL_MLOCK thing that I'm missing. Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 1442 bytes --] mm/memory.c | 2 +- mm/mlock.c | 8 ++++---- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 607098d47e74..f7a487c908a5 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1555,7 +1555,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, * If we don't actually want the page itself, * and it's the stack guard page, just skip it. */ - if (!pages && stack_guard_page(vma, start)) + if (!pages && (gup_flags & FOLL_MLOCK) && stack_guard_page(vma, start)) goto next_page; do { diff --git a/mm/mlock.c b/mm/mlock.c index 6b55e3efe0df..8ed7fd09f81c 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -162,7 +162,10 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma, VM_BUG_ON(end > vma->vm_end); VM_BUG_ON(!rwsem_is_locked(&mm->mmap_sem)); - gup_flags = FOLL_TOUCH; + if (!(vma->vm_flags & VM_LOCKED)) + return nr_pages; + + gup_flags = FOLL_TOUCH | FOLL_MLOCK; /* * We want to touch writable mappings with a write fault in order * to break COW, except for shared mappings because these don't COW @@ -178,9 +181,6 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma, if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)) gup_flags |= FOLL_FORCE; - if (vma->vm_flags & VM_LOCKED) - gup_flags |= FOLL_MLOCK; - return __get_user_pages(current, mm, addr, nr_pages, gup_flags, NULL, NULL, nonblocking); } ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-05-05 0:38 ` Linus Torvalds @ 2011-05-05 1:18 ` Michel Lespinasse 2011-05-05 1:40 ` Linus Torvalds 0 siblings, 1 reply; 37+ messages in thread From: Michel Lespinasse @ 2011-05-05 1:18 UTC (permalink / raw) To: Linus Torvalds Cc: Robert Święcki, Hugh Dickins, Andrew Morton, Miklos Szeredi, Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel On Wed, May 4, 2011 at 5:38 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: >> A more conservative alternative could >> be to enable the guard page special case under an new GUP flag, but >> this loses much of the elegance of your original proposal... > > How about only doing that only for FOLL_MLOCK? Sounds reasonable. > Also, looking at mm/mlock.c, why _do_ we call get_user_pages() even if > the vma isn't mlocked? That looks bogus. Since we have dropped the > mm_semaphore, an unlock may have happened, and afaik we should *not* > try to bring those pages back in at all. There's this whole comment > about that in the caller ("__mlock_vma_pages_range() double checks the > vma flags, so that it won't mlock pages if the vma was already > munlocked."), but despite that it would actually call > __get_user_pages() even if the VM_LOCKED bit had been cleared (it just > wouldn't call it with the FOLL_MLOCK flag). There are two reasons VM_LOCKED might be cleared in __mlock_vma_pages_range(). It could be that one of the VM_SPECIAL flags were set on the VMA, in which case mlock() won't set VM_LOCKED but it still must make the pages present. Or, there is an munlock() executing concurrently with mlock() - in that case, the conservative thing to do is to give the same results as if the mlock() had completed before the munlock(). That is, the mlock() would have broken COW / made the pages present and the munlock() would have cleared the VM_LOCKED and PageMlocked flags. > UNTESTED! And maybe there was some really subtle reason to still call > __get_user_pages() without that FOLL_MLOCK thing that I'm missing. I think we want the mm/memory.c part of this proposal without the mm/mlock.c part. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-05-05 1:18 ` Michel Lespinasse @ 2011-05-05 1:40 ` Linus Torvalds 2011-05-05 3:37 ` Linus Torvalds 0 siblings, 1 reply; 37+ messages in thread From: Linus Torvalds @ 2011-05-05 1:40 UTC (permalink / raw) To: Michel Lespinasse Cc: Robert Święcki, Hugh Dickins, Andrew Morton, Miklos Szeredi, Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel On Wed, May 4, 2011 at 6:18 PM, Michel Lespinasse <walken@google.com> wrote: > > I think we want the mm/memory.c part of this proposal without the > mm/mlock.c part. .. but what about mlock not setting the FOLL_MLOCK bit, then? In that case, you'd get the "mlock extends the stack" problem. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-05-05 1:40 ` Linus Torvalds @ 2011-05-05 3:37 ` Linus Torvalds 2011-05-05 4:26 ` Michel Lespinasse 0 siblings, 1 reply; 37+ messages in thread From: Linus Torvalds @ 2011-05-05 3:37 UTC (permalink / raw) To: Michel Lespinasse Cc: Robert Święcki, Hugh Dickins, Andrew Morton, Miklos Szeredi, Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel [-- Attachment #1: Type: text/plain, Size: 1541 bytes --] Ok, so here's a slightly different approach. It still makes the FOLL_MLOCK be unconditional in the mlock path, but it really just pushes down the - gup_flags = FOLL_TOUCH; + gup_flags = FOLL_TOUCH | FOLL_MLOCK; ... - if (vma->vm_flags & VM_LOCKED) - gup_flags |= FOLL_MLOCK; from __mlock_vma_pages_range(), and moves the conditional into 'follow_page()' (which is the only _user_ of that flag) instead: - if (flags & FOLL_MLOCK) { + if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) { so semantically this changes nothing at all. But now, because __get_user_pages() can look at FOLL_MLOCK to see that it's a mlock access, we can do that whole "skip stack guard page" based on that flag: - if (!pages && stack_guard_page(vma, start)) + if ((gup_flags & FOLL_MLOCK) && stack_guard_page(vma, start)) which means that other uses will try to page in the stack guard page. I seriously considered making that "skip stack guard page" and the "mlock lookup" be two separate bits, because conceptually they are really pretty independent, but right now the only _users_ seem to be tied together, so I kept it as one single bit (FOLL_MLOCK). But as far as I can tell, the attached patch is 100% equivalent to what we do now, except for that "skip stack guard pages only for mlock" change. Comments? I like this patch because it seems to make the logic more straightforward. But somebody else should double-check my logic. Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 1728 bytes --] mm/memory.c | 7 +++---- mm/mlock.c | 5 +---- 2 files changed, 4 insertions(+), 8 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 607098d47e74..27f425378112 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1359,7 +1359,7 @@ split_fallthrough: */ mark_page_accessed(page); } - if (flags & FOLL_MLOCK) { + if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) { /* * The preliminary mapping check is mainly to avoid the * pointless overhead of lock_page on the ZERO_PAGE @@ -1552,10 +1552,9 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, } /* - * If we don't actually want the page itself, - * and it's the stack guard page, just skip it. + * For mlock, just skip the stack guard page. */ - if (!pages && stack_guard_page(vma, start)) + if ((gup_flags & FOLL_MLOCK) && stack_guard_page(vma, start)) goto next_page; do { diff --git a/mm/mlock.c b/mm/mlock.c index 6b55e3efe0df..516b2c2ddd5a 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -162,7 +162,7 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma, VM_BUG_ON(end > vma->vm_end); VM_BUG_ON(!rwsem_is_locked(&mm->mmap_sem)); - gup_flags = FOLL_TOUCH; + gup_flags = FOLL_TOUCH | FOLL_MLOCK; /* * We want to touch writable mappings with a write fault in order * to break COW, except for shared mappings because these don't COW @@ -178,9 +178,6 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma, if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)) gup_flags |= FOLL_FORCE; - if (vma->vm_flags & VM_LOCKED) - gup_flags |= FOLL_MLOCK; - return __get_user_pages(current, mm, addr, nr_pages, gup_flags, NULL, NULL, nonblocking); } ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-05-05 3:37 ` Linus Torvalds @ 2011-05-05 4:26 ` Michel Lespinasse 0 siblings, 0 replies; 37+ messages in thread From: Michel Lespinasse @ 2011-05-05 4:26 UTC (permalink / raw) To: Linus Torvalds Cc: Robert Święcki, Hugh Dickins, Andrew Morton, Miklos Szeredi, Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel On Wed, May 4, 2011 at 8:37 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > I seriously considered making that "skip stack guard page" and the > "mlock lookup" be two separate bits, because conceptually they are > really pretty independent, but right now the only _users_ seem to be > tied together, so I kept it as one single bit (FOLL_MLOCK). Yes, this seems best for now. > But as far as I can tell, the attached patch is 100% equivalent to > what we do now, except for that "skip stack guard pages only for > mlock" change. > > Comments? I like this patch because it seems to make the logic more > straightforward. > > But somebody else should double-check my logic. It makes perfect sense. Reviewed-by: Michel Lespinasse <walken@google.com> (I would argue for it to go to stable trees as well) -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH] mm: fix possible cause of a page_mapped BUG 2011-04-06 15:59 ` Linus Torvalds 2011-04-06 17:54 ` Robert Święcki @ 2011-04-07 14:17 ` Hugh Dickins 1 sibling, 0 replies; 37+ messages in thread From: Hugh Dickins @ 2011-04-07 14:17 UTC (permalink / raw) To: Linus Torvalds Cc: Robert Swiecki, Andrew Morton, Miklos Szeredi, Michel Lespinasse, Eric W. Biederman, linux-kernel, linux-mm, Peter Zijlstra, Rik van Riel [-- Attachment #1: Type: TEXT/PLAIN, Size: 4160 bytes --] On Wed, 6 Apr 2011, Linus Torvalds wrote: > On Wed, Apr 6, 2011 at 8:43 AM, Hugh Dickins <hughd@google.com> wrote: > > > > I was about to send you my own UNTESTED patch: let me append it anyway, > > I think it is more correct than yours (it's the offset of vm_end we need > > to worry about, and there's the funny old_len,new_len stuff). > > Umm. That's what my patch did too. The > > pgoff = (addr - vma->vm_start) >> PAGE_SHIFT; > > is the "offset of the pgoff" from the original mapping, then we do > > pgoff += vma->vm_pgoff; > > to get the pgoff of the new mapping, and then we do > > if (pgoff + (new_len >> PAGE_SHIFT) < pgoff) > > to check that the new mapping is ok. Right, I was forgetting the semantics for mremap when addr + old_len < vma->vm_end. It has to move out the old section and extend it elsewhere, it does not affect the page just before vma->vm_end at all. So mine was indeed a more complicated way of doing yours. > > I think yours is equivalent, just a different (and odd - that > linear_page_index() thing will do lots of unnecessary shifts and > hugepage crap) way of writing it. I was trying to use the common function provided: but it's actually wrong, that's a function for getting the value found in page->index (in units of PAGE_CACHE_SIZE), whereas here we want the value found in vm_pgoff (in units of PAGE_SIZE). Of course PAGE_CACHE_SIZE has equalled PAGE_SIZE everywhere but in some patches by Christoph Lameter a few years back, so there isn't an effective difference; but I was wrong to use that function. > > > See what you think - sorry, I'm going out now. > > I think _yours_ is conceptually buggy, because I think that test for > "vma->vm_file" is wrong. Just being cautious: we cannot hit the BUG in prio_tree.c when we're dealing with an anonymous mapping, and I didn't want to think about anonymous at the time. > > Yes, new anonymous mappings set vm_pgoff to the virtual address, but > that's not true for mremap() moving them around, afaik. > > Admittedly it's really hard to get to the overflow case, because the > address is shifted down, so even if you start out with an anonymous > mmap at a high address (to get a big vm_off), and then move it down > and expand it (to get a big size), I doubt you can possibly overflow. > But I still don't think that the test for vm_file is semantically > sensible, even if it might not _matter_. The strangest case is when a 64-bit kernel execs a 32-bit executable, preparing the stack with a very high virtual address which goes into vm_pgoff (shifted by PAGE_SHIFT), then moves that stack down into the 32-bit address space but leaving it with the original high vm_pgoff. I think you are now excluding some wild anonymous cases which were allowed before, and gave no trouble - vma_address() looks like a wrap won't upset it. But they're not cases which anyone is likely to do, and safer to keep the anon rules in synch with the file rules. > > But whatever. I suspect both our patches are practically doing the > same thing, and it would be interesting to hear if it actually fixes > the issue. Maybe there is some other way to mess up vm_pgoff that I > can't think of right now. Here's yours inline below: Acked-by: Hugh Dickins <hughd@google.com> --- mm/mremap.c | 11 +++++++++-- 1 files changed, 9 insertions(+), 2 deletions(-) diff --git a/mm/mremap.c b/mm/mremap.c index 1de98d492ddc..a7c1f9f9b941 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -277,9 +277,16 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr, if (old_len > vma->vm_end - addr) goto Efault; - if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)) { - if (new_len > old_len) + /* Need to be careful about a growing mapping */ + if (new_len > old_len) { + unsigned long pgoff; + + if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)) goto Efault; + pgoff = (addr - vma->vm_start) >> PAGE_SHIFT; + pgoff += vma->vm_pgoff; + if (pgoff + (new_len >> PAGE_SHIFT) < pgoff) + goto Einval; } if (vma->vm_flags & VM_LOCKED) { ^ permalink raw reply related [flat|nested] 37+ messages in thread
end of thread, other threads:[~2011-05-05 4:26 UTC | newest] Thread overview: 37+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-02-24 5:39 [PATCH] mm: fix possible cause of a page_mapped BUG Hugh Dickins 2011-02-28 23:35 ` Robert Święcki 2011-03-17 15:40 ` Robert Święcki 2011-03-19 5:34 ` Hugh Dickins 2011-04-01 14:34 ` Robert Święcki 2011-04-01 15:44 ` Linus Torvalds 2011-04-01 16:21 ` Robert Święcki 2011-04-01 16:35 ` Linus Torvalds 2011-04-02 4:01 ` Hui Zhu 2011-04-04 13:02 ` Robert Święcki 2011-04-02 1:46 ` Hugh Dickins 2011-04-04 12:46 ` Robert Święcki 2011-04-04 18:30 ` Hugh Dickins 2011-04-05 12:21 ` Robert Święcki 2011-04-05 15:37 ` Linus Torvalds 2011-04-06 14:47 ` Hugh Dickins 2011-04-06 15:32 ` Linus Torvalds 2011-04-06 15:43 ` Hugh Dickins 2011-04-06 15:59 ` Linus Torvalds 2011-04-06 17:54 ` Robert Święcki 2011-04-07 12:41 ` Robert Święcki 2011-04-07 14:24 ` Hugh Dickins 2011-04-12 9:58 ` Robert Święcki 2011-04-12 14:21 ` Linus Torvalds [not found] ` <BANLkTik6U21r91DYiUsz9A0P--=5QcsBrA@mail.gmail.com> 2011-04-12 16:17 ` Robert Święcki 2011-04-12 17:19 ` Linus Torvalds 2011-04-12 18:59 ` Linus Torvalds 2011-04-12 19:02 ` Robert Święcki 2011-04-12 19:38 ` Linus Torvalds 2011-04-18 21:15 ` Michel Lespinasse 2011-05-05 0:09 ` Michel Lespinasse 2011-05-05 0:38 ` Linus Torvalds 2011-05-05 1:18 ` Michel Lespinasse 2011-05-05 1:40 ` Linus Torvalds 2011-05-05 3:37 ` Linus Torvalds 2011-05-05 4:26 ` Michel Lespinasse 2011-04-07 14:17 ` Hugh Dickins
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).