* mm: fix BUG in __split_huge_page_pmd
@ 2013-10-15 11:08 Hugh Dickins
2013-10-15 11:32 ` Kirill A. Shutemov
2013-10-15 14:34 ` Andrea Arcangeli
0 siblings, 2 replies; 12+ messages in thread
From: Hugh Dickins @ 2013-10-15 11:08 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, David Rientjes, Kirill A. Shutemov,
Naoya Horiguchi, linux-kernel, linux-mm
Occasionally we hit the BUG_ON(pmd_trans_huge(*pmd)) at the end of
__split_huge_page_pmd(): seen when doing madvise(,,MADV_DONTNEED).
It's invalid: we don't always have down_write of mmap_sem there:
a racing do_huge_pmd_wp_page() might have copied-on-write to another
huge page before our split_huge_page() got the anon_vma lock.
Forget the BUG_ON, just go back and try again if this happens.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org
---
mm/huge_memory.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
--- 3.12-rc5/mm/huge_memory.c 2013-09-16 17:37:56.811072270 -0700
+++ linux/mm/huge_memory.c 2013-10-15 03:40:02.044138488 -0700
@@ -2697,6 +2697,7 @@ void __split_huge_page_pmd(struct vm_are
mmun_start = haddr;
mmun_end = haddr + HPAGE_PMD_SIZE;
+again:
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_trans_huge(*pmd))) {
@@ -2719,7 +2720,14 @@ void __split_huge_page_pmd(struct vm_are
split_huge_page(page);
put_page(page);
- BUG_ON(pmd_trans_huge(*pmd));
+
+ /*
+ * We don't always have down_write of mmap_sem here: a racing
+ * do_huge_pmd_wp_page() might have copied-on-write to another
+ * huge page before our split_huge_page() got the anon_vma lock.
+ */
+ if (unlikely(pmd_trans_huge(*pmd)))
+ goto again;
}
void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 12+ messages in thread* RE: mm: fix BUG in __split_huge_page_pmd 2013-10-15 11:08 mm: fix BUG in __split_huge_page_pmd Hugh Dickins @ 2013-10-15 11:32 ` Kirill A. Shutemov 2013-10-15 14:41 ` Andrea Arcangeli 2013-10-15 14:34 ` Andrea Arcangeli 1 sibling, 1 reply; 12+ messages in thread From: Kirill A. Shutemov @ 2013-10-15 11:32 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, Andrea Arcangeli, David Rientjes, Kirill A. Shutemov, Naoya Horiguchi, linux-kernel, linux-mm Hugh Dickins wrote: > Occasionally we hit the BUG_ON(pmd_trans_huge(*pmd)) at the end of > __split_huge_page_pmd(): seen when doing madvise(,,MADV_DONTNEED). > > It's invalid: we don't always have down_write of mmap_sem there: > a racing do_huge_pmd_wp_page() might have copied-on-write to another > huge page before our split_huge_page() got the anon_vma lock. > > Forget the BUG_ON, just go back and try again if this happens. > > Signed-off-by: Hugh Dickins <hughd@google.com> > Cc: stable@vger.kernel.org Looks reasonable to me. Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> madvise(MADV_DONTNEED) was aproblematic with THP before. Is a big win having mmap_sem taken on read rather than on write for it? -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: mm: fix BUG in __split_huge_page_pmd 2013-10-15 11:32 ` Kirill A. Shutemov @ 2013-10-15 14:41 ` Andrea Arcangeli 0 siblings, 0 replies; 12+ messages in thread From: Andrea Arcangeli @ 2013-10-15 14:41 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Hugh Dickins, Andrew Morton, David Rientjes, Naoya Horiguchi, linux-kernel, linux-mm On Tue, Oct 15, 2013 at 02:32:54PM +0300, Kirill A. Shutemov wrote: > Hugh Dickins wrote: > > Occasionally we hit the BUG_ON(pmd_trans_huge(*pmd)) at the end of > > __split_huge_page_pmd(): seen when doing madvise(,,MADV_DONTNEED). > > > > It's invalid: we don't always have down_write of mmap_sem there: > > a racing do_huge_pmd_wp_page() might have copied-on-write to another > > huge page before our split_huge_page() got the anon_vma lock. > > > > Forget the BUG_ON, just go back and try again if this happens. > > > > Signed-off-by: Hugh Dickins <hughd@google.com> > > Cc: stable@vger.kernel.org > > Looks reasonable to me. > > Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > > madvise(MADV_DONTNEED) was aproblematic with THP before. Is a big win having > mmap_sem taken on read rather than on write for it? Yeah it caused all those pmd_trans_unstable and pmd_none_or_trans_huge_or_clear_bad and pmd_read_atomic in common code. But I didn't want to regress the scalability of MADV_DONTNEED... I think various apps use MADV_DONTNEED to free memory (including very KVM in the balloon driver and probably JVM and other JIT). none or huge pmds are unstable without mmap_sem for writing and without page_table_lock (or in general pmd_trans_huge_lock). It's identical to the pte being unstable if mmap_sem is held for reading and we don't hold the PT lock, except the pte can only have two states and they're both unstable. hugepmds have three states, and the only stable state of the tree is when it points to a regular pte (the third state that 4k ptes cannot have). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: mm: fix BUG in __split_huge_page_pmd 2013-10-15 11:08 mm: fix BUG in __split_huge_page_pmd Hugh Dickins 2013-10-15 11:32 ` Kirill A. Shutemov @ 2013-10-15 14:34 ` Andrea Arcangeli 2013-10-15 14:48 ` Kirill A. Shutemov 1 sibling, 1 reply; 12+ messages in thread From: Andrea Arcangeli @ 2013-10-15 14:34 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, David Rientjes, Kirill A. Shutemov, Naoya Horiguchi, linux-kernel, linux-mm Hi Hugh, On Tue, Oct 15, 2013 at 04:08:28AM -0700, Hugh Dickins wrote: > Occasionally we hit the BUG_ON(pmd_trans_huge(*pmd)) at the end of > __split_huge_page_pmd(): seen when doing madvise(,,MADV_DONTNEED). > > It's invalid: we don't always have down_write of mmap_sem there: > a racing do_huge_pmd_wp_page() might have copied-on-write to another > huge page before our split_huge_page() got the anon_vma lock. > I don't get exactly the scenario with do_huge_pmd_wp_page(), could you elaborate? My scenario is that in the below line another madvise(MADV_DONTNEED) runs: spin_unlock(&mm->page_table_lock); mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); ------ madvise(MADV_DONTNEED) from another thread zapping the entire 2m and clearing the pmd split_huge_page(page); then split_huge_page does nothing because the page has been freed. And then just before the BUG_ON (after split_huge_page returns) a new page fault fills in an anonymous page just before the BUG_ON. And the crashing thread would always be a partial MADV_DONTNEED with a misaligned end address (so requiring a split_huge_page to zap 4k subpages). So the testcase required would be: 2 concurrent MADV_DONTNEED, where the first has a misaligned "end" address (the one that triggers the BUG_ON), the second MADV_DONTNEED has a end address that covers the whole hugepmd, and a trans huge page fault happening just before the false positive triggers. Maybe your scenario with do_huge_pmd_wp_page() is simpler? > Forget the BUG_ON, just go back and try again if this happens. > > Signed-off-by: Hugh Dickins <hughd@google.com> > Cc: stable@vger.kernel.org > --- > > mm/huge_memory.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > --- 3.12-rc5/mm/huge_memory.c 2013-09-16 17:37:56.811072270 -0700 > +++ linux/mm/huge_memory.c 2013-10-15 03:40:02.044138488 -0700 > @@ -2697,6 +2697,7 @@ void __split_huge_page_pmd(struct vm_are > > mmun_start = haddr; > mmun_end = haddr + HPAGE_PMD_SIZE; > +again: > mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); > spin_lock(&mm->page_table_lock); > if (unlikely(!pmd_trans_huge(*pmd))) { > @@ -2719,7 +2720,14 @@ void __split_huge_page_pmd(struct vm_are > split_huge_page(page); > > put_page(page); > - BUG_ON(pmd_trans_huge(*pmd)); > + > + /* > + * We don't always have down_write of mmap_sem here: a racing > + * do_huge_pmd_wp_page() might have copied-on-write to another > + * huge page before our split_huge_page() got the anon_vma lock. > + */ > + if (unlikely(pmd_trans_huge(*pmd))) > + goto again; > } > > void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address, While it is correct, the looping is misleading. We should document simply that any caller of split_huge_page_pmd with the mmap_sem for reading, should use pmd_none_or_trans_huge_or_clear_bad or pmd_trans_unstable before making any assumption on the pmd being "stable" after split_huge_page returns. This is what zap_pmd_range of course does already so it's perfectly safe if split_huge_page_pmd returns with pmd_trans_huge(*pmd) == true. Even with the loop, if the concurrent page faults that maps a trans_huge_pmd (replacing the pmd_none) triggers just after the 'goto again', but before the 'ret' instruction, the function still returns with a trans huge page mapped in the *pmd. In short I think either we try the more strict but more tricky approach from Hillf https://patchwork.kernel.org/patch/2178311/ and we also take into account the mapcount > 0 in the __split_huge_page loop to make Hillf patch safe (I think currently it isn't), or we just nuke the BUG_ON completely. And we should document in the source what happens with mmap_sem hold for reading in MADV_DONTNEED. Yet another approach would be to add something like down_write_trylock in front of the check after converting it to a VM_BUG_ON. The patch in patchwork, despite trying to be more strict, it doesn't look safe to me because I tend to think we should also change __split_huge_page to return the "mapcount", so that split_huge_page will fail also if the mapcount was 0 in __split_huge_page. I believe split_huge_page_to_list may obtain the anon_vma lock (so the page was mapped) but then zap_huge_pmd obtains the page_table_lock before __split_huge_page runs __split_huge_page_splitting and freezes any attempt zap_huge_pmd or any other attempt of MADV_DONTNEED with truncation of the pmd (waiting in another split_huge_page). So I believe it is possible (and safe) for it to run with mapcount 0 (doing nothing). But it doesn't return failure in that case, but if mapcount is 0, the pmd may not have been converted to stable state. This is why that change would be needed to use the patchwork patch. The only place where we depend on split_huge_page retval so far is add_to_swap. But regular anon pages can also be zapped there. So it should be able to cope and the swapcache will free itself when all refcounts are dropped. So we don't need to worry about the above I think in add_to_swap. Could you also post the stack trace so we compare to the one in patchwork? Ideally this happened for you also in the context of a MADV_DONTNEED, the other places walking pagetables with only the mmap_sem for reading usually don't mangle pagetables. Also note the workload I described with two MADV_DONTNEED running concurrently and a page fault too, on the very same transhugepage looks a non deterministic workload, but still we need to avoid false positives in those non deterministic cases. The two stack traces I have for this problem (patchwork above and below bugzilla) all confirms it's happening only inside MADV_DONTNEED confirming my theory above. https://bugzilla.redhat.com/show_bug.cgi?id=949735 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: mm: fix BUG in __split_huge_page_pmd 2013-10-15 14:34 ` Andrea Arcangeli @ 2013-10-15 14:48 ` Kirill A. Shutemov 2013-10-15 15:58 ` Andrea Arcangeli 2013-10-15 17:53 ` Hugh Dickins 0 siblings, 2 replies; 12+ messages in thread From: Kirill A. Shutemov @ 2013-10-15 14:48 UTC (permalink / raw) To: Andrea Arcangeli Cc: Hugh Dickins, Andrew Morton, David Rientjes, Kirill A. Shutemov, Naoya Horiguchi, linux-kernel, linux-mm Andrea Arcangeli wrote: > Hi Hugh, > > On Tue, Oct 15, 2013 at 04:08:28AM -0700, Hugh Dickins wrote: > > Occasionally we hit the BUG_ON(pmd_trans_huge(*pmd)) at the end of > > __split_huge_page_pmd(): seen when doing madvise(,,MADV_DONTNEED). > > > > It's invalid: we don't always have down_write of mmap_sem there: > > a racing do_huge_pmd_wp_page() might have copied-on-write to another > > huge page before our split_huge_page() got the anon_vma lock. > > > > I don't get exactly the scenario with do_huge_pmd_wp_page(), could you > elaborate? I think the scenario is follow: CPU0: CPU1 __split_huge_page_pmd() page = pmd_page(*pmd); do_huge_pmd_wp_page() copy the page and changes pmd (the same as on CPU0) to point to newly copied page. split_huge_page(page) where page is original page, not allocated on COW. pmd still points on huge page. Hugh, have I got it correctly? -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: mm: fix BUG in __split_huge_page_pmd 2013-10-15 14:48 ` Kirill A. Shutemov @ 2013-10-15 15:58 ` Andrea Arcangeli 2013-10-15 17:53 ` Hugh Dickins 1 sibling, 0 replies; 12+ messages in thread From: Andrea Arcangeli @ 2013-10-15 15:58 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Hugh Dickins, Andrew Morton, David Rientjes, Naoya Horiguchi, linux-kernel, linux-mm On Tue, Oct 15, 2013 at 05:48:27PM +0300, Kirill A. Shutemov wrote: > Andrea Arcangeli wrote: > > Hi Hugh, > > > > On Tue, Oct 15, 2013 at 04:08:28AM -0700, Hugh Dickins wrote: > > > Occasionally we hit the BUG_ON(pmd_trans_huge(*pmd)) at the end of > > > __split_huge_page_pmd(): seen when doing madvise(,,MADV_DONTNEED). > > > > > > It's invalid: we don't always have down_write of mmap_sem there: > > > a racing do_huge_pmd_wp_page() might have copied-on-write to another > > > huge page before our split_huge_page() got the anon_vma lock. > > > > > > > I don't get exactly the scenario with do_huge_pmd_wp_page(), could you > > elaborate? > > I think the scenario is follow: > > CPU0: CPU1 > > __split_huge_page_pmd() > page = pmd_page(*pmd); > do_huge_pmd_wp_page() copy the > page and changes pmd (the same as on CPU0) > to point to newly copied page. > split_huge_page(page) > where page is original page, > not allocated on COW. > pmd still points on huge page. > > > Hugh, have I got it correctly? So MADV_DONTNEED runs with with a "end" not 2m aligned (requiring 4k subpage zapping) on a wrprotected trans-huge page that is hitting a COW. So this scenario would be deterministic (the thread may write beyond the "end" of the MADV_DONTNEED) and it only requires two specific events. With my other scenario with two concurrent MADV_DONTNEED plus a page fault, you could still lead to split_huge_page_pmd returning with pmd_trans_huge(*pmd) == true, despite of the loop introduced. But for the above case, the loop makes a meaningful difference. So I see the good reason for looping now. It wouldn't be ok to miss a partial MADV_DONTNEED zapping because of a concurrent COW, while it would be ok in my other scenario (and the loop in fact cannot do anything to prevent split_huge_page_pmd return with the pmd still huge). My other scenario with two concurrent MADV_DONTNEED and a page fault is non deterministic so looping was meaningless. In both scenario, the kernel wouldn't run into stability issues, even if we only removed the BUG_ON. But the COW scenario, without the loop, we'd silently miss a partial MADV_DONTNEED on the 4k subpages before the "end" (or after the "start"). And we still need pmd_none_or_trans_huge_or_clear_bad in zap_pmd_range, to deal with the non deterministic cases that the loop won't help (the two MADV_DONTNEED + page fault), in addition to the loop to deal with the deterministic COW scenario above. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: mm: fix BUG in __split_huge_page_pmd 2013-10-15 14:48 ` Kirill A. Shutemov 2013-10-15 15:58 ` Andrea Arcangeli @ 2013-10-15 17:53 ` Hugh Dickins 2013-10-15 18:55 ` Andrea Arcangeli 1 sibling, 1 reply; 12+ messages in thread From: Hugh Dickins @ 2013-10-15 17:53 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Andrea Arcangeli, Andrew Morton, David Rientjes, Naoya Horiguchi, linux-kernel, linux-mm On Tue, 15 Oct 2013, Kirill A. Shutemov wrote: > Andrea Arcangeli wrote: > > Hi Hugh, > > > > On Tue, Oct 15, 2013 at 04:08:28AM -0700, Hugh Dickins wrote: > > > Occasionally we hit the BUG_ON(pmd_trans_huge(*pmd)) at the end of > > > __split_huge_page_pmd(): seen when doing madvise(,,MADV_DONTNEED). > > > > > > It's invalid: we don't always have down_write of mmap_sem there: > > > a racing do_huge_pmd_wp_page() might have copied-on-write to another > > > huge page before our split_huge_page() got the anon_vma lock. > > > > > > > I don't get exactly the scenario with do_huge_pmd_wp_page(), could you > > elaborate? > > I think the scenario is follow: > > CPU0: CPU1 > > __split_huge_page_pmd() > page = pmd_page(*pmd); > do_huge_pmd_wp_page() copy the > page and changes pmd (the same as on CPU0) > to point to newly copied page. > split_huge_page(page) > where page is original page, > not allocated on COW. > pmd still points on huge page. > > > Hugh, have I got it correctly? Yes, that's correct, that's what I've been assuming the race is. With CPU0 split_huge_page_pmd() being called from zap_pmd_range() in the service of madvise(,,MADV_DONTNEED). I don't have the stacktrace to hand: could perfectly well dig it out in an hour or two, but honestly, it adds nothing more to the picture. I have no trace of the CPU1 side of things, and have merely surmised that it was doing a COW. As to whether the MADV_DONTNEED down_read optimization is important: I don't recall, might be able to discover justification in old mail, 0a27a14a6292 doesn't actually say; but in general we're much better off using down_read than down_write where it's safe to do so. But more importantly, MADV_DONTNEED down_read across zap_page_range is building on the fact that file invalidation/truncation can already call zap_page_range without touching mmap_sem at all: not a problem for traditional anon-only THP, but something you'll have had to worry about for THPageCache. I'm afraid Andrea's mail about concurrent madvises gives me far more to think about than I have time for: seems to get into problems he knows a lot about but I'm unfamiliar with. If this patch looks good for now on its own, let's put it in; but no problem if you guys prefer to wait for a fuller solution of more problems, we can ride with this one internally for the moment. And I should admit that the crash has occurred too rarely for us yet to be able to judge whether this patch actually fixes it in practice. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: mm: fix BUG in __split_huge_page_pmd 2013-10-15 17:53 ` Hugh Dickins @ 2013-10-15 18:55 ` Andrea Arcangeli 2013-10-15 19:28 ` Naoya Horiguchi 0 siblings, 1 reply; 12+ messages in thread From: Andrea Arcangeli @ 2013-10-15 18:55 UTC (permalink / raw) To: Hugh Dickins Cc: Kirill A. Shutemov, Andrew Morton, David Rientjes, Naoya Horiguchi, linux-kernel, linux-mm On Tue, Oct 15, 2013 at 10:53:10AM -0700, Hugh Dickins wrote: > I'm afraid Andrea's mail about concurrent madvises gives me far more > to think about than I have time for: seems to get into problems he > knows a lot about but I'm unfamiliar with. If this patch looks good > for now on its own, let's put it in; but no problem if you guys prefer > to wait for a fuller solution of more problems, we can ride with this > one internally for the moment. I'm very happy with the patch and I think it's a correct fix for the COW scenario which is deterministic so the looping makes a meaningful difference for it. If we wouldn't loop, part of the copied page wouldn't be zapped after the COW. The patch also solves the false positive for the other non deterministic scenario of two MADV_DONTNEED (one partial, one whole) plus a concurrent page fault. > And I should admit that the crash has occurred too rarely for us yet > to be able to judge whether this patch actually fixes it in practice. It is very rare indeed, and thanks to the BUG_ON it cannot lead to any user or kernel memory corruption, but it's a nuisance we need to fix. I only have the two stack traces in the two links I posted in the previous email and I also don't have the traces of the other CPU. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: mm: fix BUG in __split_huge_page_pmd 2013-10-15 18:55 ` Andrea Arcangeli @ 2013-10-15 19:28 ` Naoya Horiguchi 2013-10-15 19:44 ` Andrea Arcangeli 0 siblings, 1 reply; 12+ messages in thread From: Naoya Horiguchi @ 2013-10-15 19:28 UTC (permalink / raw) To: Andrea Arcangeli Cc: Hugh Dickins, Kirill A. Shutemov, Andrew Morton, David Rientjes, linux-kernel, linux-mm On Tue, Oct 15, 2013 at 08:55:10PM +0200, Andrea Arcangeli wrote: > On Tue, Oct 15, 2013 at 10:53:10AM -0700, Hugh Dickins wrote: > > I'm afraid Andrea's mail about concurrent madvises gives me far more > > to think about than I have time for: seems to get into problems he > > knows a lot about but I'm unfamiliar with. If this patch looks good > > for now on its own, let's put it in; but no problem if you guys prefer > > to wait for a fuller solution of more problems, we can ride with this > > one internally for the moment. > > I'm very happy with the patch and I think it's a correct fix for the > COW scenario which is deterministic so the looping makes a meaningful > difference for it. If we wouldn't loop, part of the copied page > wouldn't be zapped after the COW. I like this patch, too. If we have the loop in __split_huge_page_pmd as suggested in this patch, can we assume that the pmd is stable after __split_huge_page_pmd returns? If it's true, we can remove pmd_none_or_trans_huge_or_clear_bad check in the callers side (zap_pmd_range and some other page table walking code.) Naoya Horiguchi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: mm: fix BUG in __split_huge_page_pmd 2013-10-15 19:28 ` Naoya Horiguchi @ 2013-10-15 19:44 ` Andrea Arcangeli 2013-10-15 20:16 ` Naoya Horiguchi 0 siblings, 1 reply; 12+ messages in thread From: Andrea Arcangeli @ 2013-10-15 19:44 UTC (permalink / raw) To: Naoya Horiguchi Cc: Hugh Dickins, Kirill A. Shutemov, Andrew Morton, David Rientjes, linux-kernel, linux-mm On Tue, Oct 15, 2013 at 03:28:50PM -0400, Naoya Horiguchi wrote: > On Tue, Oct 15, 2013 at 08:55:10PM +0200, Andrea Arcangeli wrote: > > On Tue, Oct 15, 2013 at 10:53:10AM -0700, Hugh Dickins wrote: > > > I'm afraid Andrea's mail about concurrent madvises gives me far more > > > to think about than I have time for: seems to get into problems he > > > knows a lot about but I'm unfamiliar with. If this patch looks good > > > for now on its own, let's put it in; but no problem if you guys prefer > > > to wait for a fuller solution of more problems, we can ride with this > > > one internally for the moment. > > > > I'm very happy with the patch and I think it's a correct fix for the > > COW scenario which is deterministic so the looping makes a meaningful > > difference for it. If we wouldn't loop, part of the copied page > > wouldn't be zapped after the COW. > > I like this patch, too. > > If we have the loop in __split_huge_page_pmd as suggested in this patch, > can we assume that the pmd is stable after __split_huge_page_pmd returns? > If it's true, we can remove pmd_none_or_trans_huge_or_clear_bad check > in the callers side (zap_pmd_range and some other page table walking code.) We can assume it stable for the deterministic cases where the looping is useful for and split_huge_page creates non-huge pmd that points to a regular pte. But we cannot remove pmd_none_or_trans_huge_or_clear_bad after if for the other non deterministic cases that I described in previous email. Looping still provides no guarantee that when the function returns, the pmd in not huge. So for safety we still need to handle the non deterministic case and just discard it through pmd_none_or_trans_huge_or_clear_bad. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: mm: fix BUG in __split_huge_page_pmd 2013-10-15 19:44 ` Andrea Arcangeli @ 2013-10-15 20:16 ` Naoya Horiguchi 2013-10-15 20:30 ` Andrea Arcangeli 0 siblings, 1 reply; 12+ messages in thread From: Naoya Horiguchi @ 2013-10-15 20:16 UTC (permalink / raw) To: Andrea Arcangeli Cc: Hugh Dickins, Kirill A. Shutemov, Andrew Morton, David Rientjes, linux-kernel, linux-mm On Tue, Oct 15, 2013 at 09:44:28PM +0200, Andrea Arcangeli wrote: > On Tue, Oct 15, 2013 at 03:28:50PM -0400, Naoya Horiguchi wrote: > > On Tue, Oct 15, 2013 at 08:55:10PM +0200, Andrea Arcangeli wrote: > > > On Tue, Oct 15, 2013 at 10:53:10AM -0700, Hugh Dickins wrote: > > > > I'm afraid Andrea's mail about concurrent madvises gives me far more > > > > to think about than I have time for: seems to get into problems he > > > > knows a lot about but I'm unfamiliar with. If this patch looks good > > > > for now on its own, let's put it in; but no problem if you guys prefer > > > > to wait for a fuller solution of more problems, we can ride with this > > > > one internally for the moment. > > > > > > I'm very happy with the patch and I think it's a correct fix for the > > > COW scenario which is deterministic so the looping makes a meaningful > > > difference for it. If we wouldn't loop, part of the copied page > > > wouldn't be zapped after the COW. > > > > I like this patch, too. > > > > If we have the loop in __split_huge_page_pmd as suggested in this patch, > > can we assume that the pmd is stable after __split_huge_page_pmd returns? > > If it's true, we can remove pmd_none_or_trans_huge_or_clear_bad check > > in the callers side (zap_pmd_range and some other page table walking code.) > > We can assume it stable for the deterministic cases where the > looping is useful for and split_huge_page creates non-huge pmd that points to > a regular pte. > > But we cannot remove pmd_none_or_trans_huge_or_clear_bad after if for > the other non deterministic cases that I described in previous > email. Looping still provides no guarantee that when the function > returns, the pmd in not huge. So for safety we still need to handle > the non deterministic case and just discard it through > pmd_none_or_trans_huge_or_clear_bad. OK, this check is necessary. But pmd_none_or_trans_huge_or_clear_bad doesn't clear the pmd when pmd_trans_huge is true. So zap_pmd_range seems to do nothing on such irregular pmd_trans_huge. So it looks to me better that zap_pmd_range retries the loop on the same address instead of 'goto next'. The reason why I had this kind of question is that I recently study on page table walker and some related code do retry in the similar situation. Thanks, Naoya Horiguchi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: mm: fix BUG in __split_huge_page_pmd 2013-10-15 20:16 ` Naoya Horiguchi @ 2013-10-15 20:30 ` Andrea Arcangeli 0 siblings, 0 replies; 12+ messages in thread From: Andrea Arcangeli @ 2013-10-15 20:30 UTC (permalink / raw) To: Naoya Horiguchi Cc: Hugh Dickins, Kirill A. Shutemov, Andrew Morton, David Rientjes, linux-kernel, linux-mm On Tue, Oct 15, 2013 at 04:16:23PM -0400, Naoya Horiguchi wrote: > On Tue, Oct 15, 2013 at 09:44:28PM +0200, Andrea Arcangeli wrote: > > On Tue, Oct 15, 2013 at 03:28:50PM -0400, Naoya Horiguchi wrote: > > > On Tue, Oct 15, 2013 at 08:55:10PM +0200, Andrea Arcangeli wrote: > > > > On Tue, Oct 15, 2013 at 10:53:10AM -0700, Hugh Dickins wrote: > > > > > I'm afraid Andrea's mail about concurrent madvises gives me far more > > > > > to think about than I have time for: seems to get into problems he > > > > > knows a lot about but I'm unfamiliar with. If this patch looks good > > > > > for now on its own, let's put it in; but no problem if you guys prefer > > > > > to wait for a fuller solution of more problems, we can ride with this > > > > > one internally for the moment. > > > > > > > > I'm very happy with the patch and I think it's a correct fix for the > > > > COW scenario which is deterministic so the looping makes a meaningful > > > > difference for it. If we wouldn't loop, part of the copied page > > > > wouldn't be zapped after the COW. > > > > > > I like this patch, too. > > > > > > If we have the loop in __split_huge_page_pmd as suggested in this patch, > > > can we assume that the pmd is stable after __split_huge_page_pmd returns? > > > If it's true, we can remove pmd_none_or_trans_huge_or_clear_bad check > > > in the callers side (zap_pmd_range and some other page table walking code.) > > > > We can assume it stable for the deterministic cases where the > > looping is useful for and split_huge_page creates non-huge pmd that points to > > a regular pte. > > > > But we cannot remove pmd_none_or_trans_huge_or_clear_bad after if for > > the other non deterministic cases that I described in previous > > email. Looping still provides no guarantee that when the function > > returns, the pmd in not huge. So for safety we still need to handle > > the non deterministic case and just discard it through > > pmd_none_or_trans_huge_or_clear_bad. > > OK, this check is necessary. But pmd_none_or_trans_huge_or_clear_bad > doesn't clear the pmd when pmd_trans_huge is true. So zap_pmd_range > seems to do nothing on such irregular pmd_trans_huge. So it looks to > me better that zap_pmd_range retries the loop on the same address > instead of 'goto next'. It may look like a bug to return with a huge pmd established, when we could notice it after pmd_trans_huge_pmd returns. However try to imagine to add a check there, and to keep adding checks and loops. If you're just a bit more unlucky next time, the page fault that converts the "unstable" pmd from "none" to "huge", may fire in another thread running in another CPU during the iret, so while MADV_DONTNEED completes and returns to userland. There are not enough checks you can add even after zap_pmd_range returns, to prevent the pmd to become established by the time MADV_DONTNEED returns to userland. The point is that if MADV_DONTNEED was zapping the entire pmd (not part) userland is accessing the same memory at the same time from another thread, the result is undefined. > The reason why I had this kind of question is that I recently study on > page table walker and some related code do retry in the similar situation. There surely are other cases that give an undefined result in the kernel. Another one in the pagetable walking code that would give undefined result is still MADV_DONTNEED vs a 4k page fault. You don't know who runs first, the 4k page may end up zapped or not. But in that case there's no split_huge_page_pmd as variable of the equation. The header definition documents it too: /* * This function is meant to be used by sites walking pagetables with * the mmap_sem hold in read mode to protect against MADV_DONTNEED and * transhuge page faults. MADV_DONTNEED can convert a transhuge pmd * into a null pmd and the transhuge page fault can convert a null pmd * into an hugepmd or into a regular pmd (if the hugepage allocation * fails). While holding the mmap_sem in read mode the pmd becomes * stable and stops changing under us only if it's not null and not a * transhuge pmd. When those races occurs and this function makes a * difference vs the standard pmd_none_or_clear_bad, the result is * undefined so behaving like if the pmd was none is safe (because it * can return none anyway). The compiler level barrier() is critically * important to compute the two checks atomically on the same pmdval. * * For 32bit kernels with a 64bit large pmd_t this automatically takes * care of reading the pmd atomically to avoid SMP race conditions * against pmd_populate() when the mmap_sem is hold for reading by the * caller (a special atomic read not done by "gcc" as in the generic * version above, is also needed when THP is disabled because the page * fault can populate the pmd from under us). */ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd) The COW case cannot be threated as undefined though, that has as well defined result that for userland must be identical to the one that would happen on 4k pages. This is why looping there is required. But we still need to deal with the other scenario with undefined result too. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2013-10-15 20:30 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-10-15 11:08 mm: fix BUG in __split_huge_page_pmd Hugh Dickins 2013-10-15 11:32 ` Kirill A. Shutemov 2013-10-15 14:41 ` Andrea Arcangeli 2013-10-15 14:34 ` Andrea Arcangeli 2013-10-15 14:48 ` Kirill A. Shutemov 2013-10-15 15:58 ` Andrea Arcangeli 2013-10-15 17:53 ` Hugh Dickins 2013-10-15 18:55 ` Andrea Arcangeli 2013-10-15 19:28 ` Naoya Horiguchi 2013-10-15 19:44 ` Andrea Arcangeli 2013-10-15 20:16 ` Naoya Horiguchi 2013-10-15 20:30 ` Andrea Arcangeli
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).