* Re: oops in copy_page_rep() [not found] ` <CA+55aFyYAf6ztDLsxWFD+6jb++y0YNjso-9j+83Mm+3uQ=8PdA@mail.gmail.com> @ 2013-01-08 13:04 ` Hillf Danton 2013-01-08 15:37 ` Linus Torvalds 0 siblings, 1 reply; 17+ messages in thread From: Hillf Danton @ 2013-01-08 13:04 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, Dave Jones, Linux Kernel, Andrea Arcangeli, Andrew Morton, Mel Gorman, Linux-MM On Tue, Jan 8, 2013 at 1:34 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, Jan 7, 2013 at 4:24 AM, Hillf Danton <dhillf@gmail.com> wrote: >> >> I take another try with waiting added, take a look please. > > Hmm. Is there some reason we never need to worry about it for the > "pmd_numa()" case just above? > > A comment about this all might be a really good idea. > Yes Sir, added. --- From: Hillf Danton <dhillf@gmail.com> Subject: [PATCH] mm: restore huge pmd splitting check Hugh said, it's clear that 3.7 had an important pmd_trans_splitting(orig_pmd) check there, which went AWOL in d10e63f29488 "mm: numa: Create basic numa page hinting infrastructure". It is restored for handling stable page fault, with wait_split_huge_page() added, as suggested also by Hugh, to avoid reapted faults until the split has completed. This work is inspired by the oops reported by Dave Jones at https://lkml.org/lkml/2013/1/5/115 Signed-off-by: Hillf Danton <dhillf@gmail.com> --- --- a/mm/memory.c Sun Jan 6 19:49:50 2013 +++ b/mm/memory.c Tue Jan 8 20:28:04 2013 @@ -3710,6 +3710,14 @@ retry: return do_huge_pmd_numa_page(mm, vma, address, orig_pmd, pmd); + /* + * Check if pmd is stable + * (numa pmd is stable, see change_huge_pmd()) + */ + if (pmd_trans_splitting(orig_pmd)) { + wait_split_huge_page(vma->anon_vma, pmd); + goto retry; + } if (dirty && !pmd_write(orig_pmd)) { ret = do_huge_pmd_wp_page(mm, vma, address, pmd, orig_pmd); -- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: oops in copy_page_rep() 2013-01-08 13:04 ` oops in copy_page_rep() Hillf Danton @ 2013-01-08 15:37 ` Linus Torvalds 2013-01-08 16:31 ` Kirill A. Shutemov 0 siblings, 1 reply; 17+ messages in thread From: Linus Torvalds @ 2013-01-08 15:37 UTC (permalink / raw) To: Hillf Danton Cc: Hugh Dickins, Dave Jones, Linux Kernel, Andrea Arcangeli, Andrew Morton, Mel Gorman, Linux-MM On Tue, Jan 8, 2013 at 5:04 AM, Hillf Danton <dhillf@gmail.com> wrote: > On Tue, Jan 8, 2013 at 1:34 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> Hmm. Is there some reason we never need to worry about it for the >> "pmd_numa()" case just above? >> >> A comment about this all might be a really good idea. >> > Yes Sir, added. Heh. I was more thinking about why do_huge_pmd_wp_page() needs it, but do_huge_pmd_numa_page() does not. Also, do we actually need it for huge_pmd_set_accessed()? The *placement* of that thing confuses me. And because it confuses me, I'd like to understand it. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: oops in copy_page_rep() 2013-01-08 15:37 ` Linus Torvalds @ 2013-01-08 16:31 ` Kirill A. Shutemov 2013-01-08 16:52 ` Linus Torvalds 0 siblings, 1 reply; 17+ messages in thread From: Kirill A. Shutemov @ 2013-01-08 16:31 UTC (permalink / raw) To: Linus Torvalds Cc: Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel, Andrea Arcangeli, Andrew Morton, Mel Gorman, Linux-MM On Tue, Jan 08, 2013 at 07:37:06AM -0800, Linus Torvalds wrote: > On Tue, Jan 8, 2013 at 5:04 AM, Hillf Danton <dhillf@gmail.com> wrote: > > On Tue, Jan 8, 2013 at 1:34 AM, Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > >> > >> Hmm. Is there some reason we never need to worry about it for the > >> "pmd_numa()" case just above? > >> > >> A comment about this all might be a really good idea. > >> > > Yes Sir, added. > > Heh. I was more thinking about why do_huge_pmd_wp_page() needs it, but > do_huge_pmd_numa_page() does not. It does. The check should be moved up. > Also, do we actually need it for huge_pmd_set_accessed()? The > *placement* of that thing confuses me. And because it confuses me, I'd > like to understand it. We need it for huge_pmd_set_accessed() too. Looks like a mis-merge. The original patch for huge_pmd_set_accessed() was correct: http://lkml.org/lkml/2012/10/25/402 -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: oops in copy_page_rep() 2013-01-08 16:31 ` Kirill A. Shutemov @ 2013-01-08 16:52 ` Linus Torvalds 2013-01-08 17:30 ` Kirill A. Shutemov ` (2 more replies) 0 siblings, 3 replies; 17+ messages in thread From: Linus Torvalds @ 2013-01-08 16:52 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel, Andrea Arcangeli, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel On Tue, Jan 8, 2013 at 8:31 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote: >> >> Heh. I was more thinking about why do_huge_pmd_wp_page() needs it, but >> do_huge_pmd_numa_page() does not. > > It does. The check should be moved up. > >> Also, do we actually need it for huge_pmd_set_accessed()? The >> *placement* of that thing confuses me. And because it confuses me, I'd >> like to understand it. > > We need it for huge_pmd_set_accessed() too. > > Looks like a mis-merge. The original patch for huge_pmd_set_accessed() was > correct: http://lkml.org/lkml/2012/10/25/402 Not a merge error: the pmd_trans_splitting() check was removed by commit d10e63f29488 ("mm: numa: Create basic numa page hinting infrastructure"). Now, *why* it was removed, I can't tell. And it's not clear why the original code just had it in a conditional, while the suggested patch has that "goto repeat" thing. I suspect re-trying the fault (which I assume the original code did) is actually better, because that way you go through all the "should I reschedule as I return through the exception" stuff. I dunno. Mel, that original patch came from you , although it was based on previous work by Peter/Ingo/Andrea. Can you walk us through the history and thinking about the loss of pmd_trans_splitting(). Was it purely a mistake? It looks intentional. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: oops in copy_page_rep() 2013-01-08 16:52 ` Linus Torvalds @ 2013-01-08 17:30 ` Kirill A. Shutemov 2013-01-08 17:38 ` Linus Torvalds 2013-01-08 17:49 ` Andrea Arcangeli 2013-01-08 17:37 ` Andrea Arcangeli 2013-01-09 11:44 ` Mel Gorman 2 siblings, 2 replies; 17+ messages in thread From: Kirill A. Shutemov @ 2013-01-08 17:30 UTC (permalink / raw) To: Linus Torvalds Cc: Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel, Andrea Arcangeli, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel On Tue, Jan 08, 2013 at 08:52:14AM -0800, Linus Torvalds wrote: > On Tue, Jan 8, 2013 at 8:31 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote: > >> > >> Heh. I was more thinking about why do_huge_pmd_wp_page() needs it, but > >> do_huge_pmd_numa_page() does not. > > > > It does. The check should be moved up. > > > >> Also, do we actually need it for huge_pmd_set_accessed()? The > >> *placement* of that thing confuses me. And because it confuses me, I'd > >> like to understand it. > > > > We need it for huge_pmd_set_accessed() too. > > > > Looks like a mis-merge. The original patch for huge_pmd_set_accessed() was > > correct: http://lkml.org/lkml/2012/10/25/402 > > Not a merge error: the pmd_trans_splitting() check was removed by > commit d10e63f29488 ("mm: numa: Create basic numa page hinting > infrastructure"). Check difference between patch above and merged one -- a1dd450. Merged patch is obviously broken: huge_pmd_set_accessed() can be called only if the pmd is under splitting. -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: oops in copy_page_rep() 2013-01-08 17:30 ` Kirill A. Shutemov @ 2013-01-08 17:38 ` Linus Torvalds 2013-01-08 17:49 ` Andrea Arcangeli 1 sibling, 0 replies; 17+ messages in thread From: Linus Torvalds @ 2013-01-08 17:38 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel, Andrea Arcangeli, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel On Tue, Jan 8, 2013 at 9:30 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote: > > Check difference between patch above and merged one -- a1dd450. > Merged patch is obviously broken: huge_pmd_set_accessed() can be called > only if the pmd is under splitting. Ok, that's a totally different issue, and seems to be due to different versions (Andrew - any idea why http://lkml.org/lkml/2012/10/25/402 and commit a1dd450bcb1a ("mm: thp: set the accessed flag for old pages on access fault") are different? That said, I actually think that commit a1dd450bcb1a is correct: huge_pmd_set_accessed() can not *possibly* need to check the splitting issue, since it takes the page table lock and re-verifies that the pmd entry is identical, before just setting the access flags. So that whole thing is irrelevant. huge_pmd_set_accessed() almost certainly simply doesn't care about splitting. But look at commit d10e63f29488. That's the one that removes pmd_trans_splitting() entirely, and does it for the case that *does* seem to care, namely do_huge_pmd_wp_page(). Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: oops in copy_page_rep() 2013-01-08 17:30 ` Kirill A. Shutemov 2013-01-08 17:38 ` Linus Torvalds @ 2013-01-08 17:49 ` Andrea Arcangeli 2013-01-08 18:03 ` Kirill A. Shutemov 2013-01-11 7:50 ` Simon Jeons 1 sibling, 2 replies; 17+ messages in thread From: Andrea Arcangeli @ 2013-01-08 17:49 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Linus Torvalds, Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel Hi Kirill, On Tue, Jan 08, 2013 at 07:30:58PM +0200, Kirill A. Shutemov wrote: > Merged patch is obviously broken: huge_pmd_set_accessed() can be called > only if the pmd is under splitting. Of course I assume you meant "only if the pmd is not under splitting". But no, setting a bitflag like the young bit or clearing or setting the numa bit won't screw with split_huge_page and it's safe even if the pmd is under splitting. Those bits are only checked here at the last stage of split_huge_page_map after taking the PT lock: spin_lock(&mm->page_table_lock); pmd = page_check_address_pmd(page, mm, address, PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG); if (pmd) { pgtable = pgtable_trans_huge_withdraw(mm); pmd_populate(mm, &_pmd, pgtable); haddr = address; for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) { pte_t *pte, entry; BUG_ON(PageCompound(page+i)); entry = mk_pte(page + i, vma->vm_page_prot); entry = maybe_mkwrite(pte_mkdirty(entry), vma); if (!pmd_write(*pmd)) entry = pte_wrprotect(entry); else BUG_ON(page_mapcount(page) != 1); if (!pmd_young(*pmd)) entry = pte_mkold(entry); if (pmd_numa(*pmd)) entry = pte_mknuma(entry); pte = pte_offset_map(&_pmd, haddr); BUG_ON(!pte_none(*pte)); set_pte_at(mm, haddr, pte, entry); pte_unmap(pte); } If "young" or "numa" bitflags changed on the original *pmd for the previous part of split_huge_page, nothing will go wrong by the time we get to split_huge_page_map (the same is not true if the pfn changes!). If you think this is too tricky, we could also decide to forbid huge_pmd_set_accessed if the pmd is in splitting state, but I don't think that flipping young/numa bits while in splitting state, can cause any problem (if done correctly with PT lock + pmd_same). Thanks! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: oops in copy_page_rep() 2013-01-08 17:49 ` Andrea Arcangeli @ 2013-01-08 18:03 ` Kirill A. Shutemov 2013-01-11 7:50 ` Simon Jeons 1 sibling, 0 replies; 17+ messages in thread From: Kirill A. Shutemov @ 2013-01-08 18:03 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel On Tue, Jan 08, 2013 at 06:49:51PM +0100, Andrea Arcangeli wrote: > Hi Kirill, > > On Tue, Jan 08, 2013 at 07:30:58PM +0200, Kirill A. Shutemov wrote: > > Merged patch is obviously broken: huge_pmd_set_accessed() can be called > > only if the pmd is under splitting. > > Of course I assume you meant "only if the pmd is not under splitting". The broken merged patch has this: + if (dirty && !pmd_write(orig_pmd) && !pmd_trans_splitting(orig_pmd)) { [...] + } else { + huge_pmd_set_accessed(mm, vma, address, pmd, + orig_pmd, dirty); } > But no, setting a bitflag like the young bit or clearing or setting > the numa bit won't screw with split_huge_page and it's safe even if > the pmd is under splitting. Okay. Thanks for clarification for me. -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: oops in copy_page_rep() 2013-01-08 17:49 ` Andrea Arcangeli 2013-01-08 18:03 ` Kirill A. Shutemov @ 2013-01-11 7:50 ` Simon Jeons 2013-01-11 14:01 ` Andrea Arcangeli 1 sibling, 1 reply; 17+ messages in thread From: Simon Jeons @ 2013-01-11 7:50 UTC (permalink / raw) To: Andrea Arcangeli Cc: Kirill A. Shutemov, Linus Torvalds, Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel On Tue, 2013-01-08 at 18:49 +0100, Andrea Arcangeli wrote: > Hi Kirill, > > On Tue, Jan 08, 2013 at 07:30:58PM +0200, Kirill A. Shutemov wrote: > > Merged patch is obviously broken: huge_pmd_set_accessed() can be called > > only if the pmd is under splitting. > > Of course I assume you meant "only if the pmd is not under splitting". > > But no, setting a bitflag like the young bit or clearing or setting > the numa bit won't screw with split_huge_page and it's safe even if > the pmd is under splitting. > > Those bits are only checked here at the last stage of > split_huge_page_map after taking the PT lock: > > spin_lock(&mm->page_table_lock); > pmd = page_check_address_pmd(page, mm, address, > PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG); > if (pmd) { > pgtable = pgtable_trans_huge_withdraw(mm); > pmd_populate(mm, &_pmd, pgtable); > > haddr = address; > for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) { > pte_t *pte, entry; > BUG_ON(PageCompound(page+i)); > entry = mk_pte(page + i, vma->vm_page_prot); > entry = maybe_mkwrite(pte_mkdirty(entry), vma); > if (!pmd_write(*pmd)) > entry = pte_wrprotect(entry); > else > BUG_ON(page_mapcount(page) != 1); > if (!pmd_young(*pmd)) > entry = pte_mkold(entry); > if (pmd_numa(*pmd)) > entry = pte_mknuma(entry); > pte = pte_offset_map(&_pmd, haddr); > BUG_ON(!pte_none(*pte)); > set_pte_at(mm, haddr, pte, entry); > pte_unmap(pte); > } > > If "young" or "numa" bitflags changed on the original *pmd for the > previous part of split_huge_page, nothing will go wrong by the time we > get to split_huge_page_map (the same is not true if the pfn changes!). > But this time BUG_ON(mapcount != mapcount2) in function __split_huge_page will be trigged. > If you think this is too tricky, we could also decide to forbid > huge_pmd_set_accessed if the pmd is in splitting state, but I don't > think that flipping young/numa bits while in splitting state, can > cause any problem (if done correctly with PT lock + pmd_same). > > Thanks! > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: oops in copy_page_rep() 2013-01-11 7:50 ` Simon Jeons @ 2013-01-11 14:01 ` Andrea Arcangeli 0 siblings, 0 replies; 17+ messages in thread From: Andrea Arcangeli @ 2013-01-11 14:01 UTC (permalink / raw) To: Simon Jeons Cc: Kirill A. Shutemov, Linus Torvalds, Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel On Fri, Jan 11, 2013 at 01:50:44AM -0600, Simon Jeons wrote: > On Tue, 2013-01-08 at 18:49 +0100, Andrea Arcangeli wrote: > > Hi Kirill, > > > > On Tue, Jan 08, 2013 at 07:30:58PM +0200, Kirill A. Shutemov wrote: > > > Merged patch is obviously broken: huge_pmd_set_accessed() can be called > > > only if the pmd is under splitting. > > > > Of course I assume you meant "only if the pmd is not under splitting". > > > > But no, setting a bitflag like the young bit or clearing or setting > > the numa bit won't screw with split_huge_page and it's safe even if > > the pmd is under splitting. > > > > Those bits are only checked here at the last stage of > > split_huge_page_map after taking the PT lock: > > > > spin_lock(&mm->page_table_lock); > > pmd = page_check_address_pmd(page, mm, address, > > PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG); > > if (pmd) { > > pgtable = pgtable_trans_huge_withdraw(mm); > > pmd_populate(mm, &_pmd, pgtable); > > > > haddr = address; > > for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) { > > pte_t *pte, entry; > > BUG_ON(PageCompound(page+i)); > > entry = mk_pte(page + i, vma->vm_page_prot); > > entry = maybe_mkwrite(pte_mkdirty(entry), vma); > > if (!pmd_write(*pmd)) > > entry = pte_wrprotect(entry); > > else > > BUG_ON(page_mapcount(page) != 1); > > if (!pmd_young(*pmd)) > > entry = pte_mkold(entry); > > if (pmd_numa(*pmd)) > > entry = pte_mknuma(entry); > > pte = pte_offset_map(&_pmd, haddr); > > BUG_ON(!pte_none(*pte)); > > set_pte_at(mm, haddr, pte, entry); > > pte_unmap(pte); > > } > > > > If "young" or "numa" bitflags changed on the original *pmd for the > > previous part of split_huge_page, nothing will go wrong by the time we > > get to split_huge_page_map (the same is not true if the pfn changes!). > > > > But this time BUG_ON(mapcount != mapcount2) in function > __split_huge_page will be trigged. "young" or "numa" bitflags in the pmd don't alter rmap/mapcount/pagecount/pfn or anything that could affect such BUG_ON, so I'm not sure why you think so. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: oops in copy_page_rep() 2013-01-08 16:52 ` Linus Torvalds 2013-01-08 17:30 ` Kirill A. Shutemov @ 2013-01-08 17:37 ` Andrea Arcangeli 2013-01-08 17:51 ` Linus Torvalds 2013-01-09 4:23 ` Hugh Dickins 2013-01-09 11:44 ` Mel Gorman 2 siblings, 2 replies; 17+ messages in thread From: Andrea Arcangeli @ 2013-01-08 17:37 UTC (permalink / raw) To: Linus Torvalds Cc: Kirill A. Shutemov, Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel Hi, On Tue, Jan 08, 2013 at 08:52:14AM -0800, Linus Torvalds wrote: > On Tue, Jan 8, 2013 at 8:31 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote: > >> > >> Heh. I was more thinking about why do_huge_pmd_wp_page() needs it, but > >> do_huge_pmd_numa_page() does not. > > > > It does. The check should be moved up. > > > >> Also, do we actually need it for huge_pmd_set_accessed()? The > >> *placement* of that thing confuses me. And because it confuses me, I'd > >> like to understand it. > > > > We need it for huge_pmd_set_accessed() too. > > > > Looks like a mis-merge. The original patch for huge_pmd_set_accessed() was > > correct: http://lkml.org/lkml/2012/10/25/402 > > Not a merge error: the pmd_trans_splitting() check was removed by > commit d10e63f29488 ("mm: numa: Create basic numa page hinting > infrastructure"). > > Now, *why* it was removed, I can't tell. And it's not clear why the > original code just had it in a conditional, while the suggested patch > has that "goto repeat" thing. I suspect re-trying the fault (which I > assume the original code did) is actually better, because that way you > go through all the "should I reschedule as I return through the > exception" stuff. I dunno. The reason it returned to userland and retried the fault is that this should be infrequent enough not to worry about it and this was marginally simpler but it could be changed. If we don't want to return to userland we should wait on the splitting bit and then take the pte walking routines like if the pmd wasn't huge. This is not related to the below though. > Mel, that original patch came from you , although it was based on > previous work by Peter/Ingo/Andrea. Can you walk us through the > history and thinking about the loss of pmd_trans_splitting(). Was it > purely a mistake? It looks intentional. d10e63f29488 is wrong in removing the pmd_splitting check, I assume it's an accidental leftover. My code did this: @@ -3530,6 +3534,9 @@ retry: */ orig_pmd = ACCESS_ONCE(*pmd); if (pmd_trans_huge(orig_pmd)) { + if (pmd_numa(*pmd)) + return huge_pmd_numa_fixup(mm, address, + orig_pmd, pmd); if (flags & FAULT_FLAG_WRITE && !pmd_write(orig_pmd) && !pmd_trans_splitting(orig_pmd)) { The function I was calling was this: +#ifdef CONFIG_AUTONUMA +/* NUMA hinting page fault entry point for trans huge pmds */ +int huge_pmd_numa_fixup(struct mm_struct *mm, unsigned long addr, + pmd_t pmd, pmd_t *pmdp) +{ + struct page *page; + bool migrated; + + spin_lock(&mm->page_table_lock); + if (unlikely(!pmd_same(pmd, *pmdp))) + goto out_unlock; + + page = pmd_page(pmd); + pmd = pmd_mknonnuma(pmd); + set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmdp, pmd); + VM_BUG_ON(pmd_numa(*pmdp)); + if (unlikely(page_mapcount(page) != 1)) + goto out_unlock; + get_page(page); + spin_unlock(&mm->page_table_lock); + + migrated = numa_hinting_fault(page, HPAGE_PMD_NR); + if (!migrated) + put_page(page); + +out: + return 0; + +out_unlock: + spin_unlock(&mm->page_table_lock); + goto out; +} +#endif Taking the PT lock, checking pmd_same and clearing the numa bitflag was perfectly ok even if the pmd was in splitting state the whole time. However do_huge_pmd_numa_page is slightly more complex than the above: the problem is that migrate_misplaced_transhuge_page gets pmdp and pmd as parameters (unlike the above numa_hinting_fault() function) and it relies internally to pmd_same too. /* Recheck the target PMD */ spin_lock(&mm->page_table_lock); if (unlikely(!pmd_same(*pmd, entry))) { spin_unlock(&mm->page_table_lock); [..] entry = mk_pmd(new_page, vma->vm_page_prot); entry = pmd_mknonnuma(entry); entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); entry = pmd_mkhuge(entry); page_add_new_anon_rmap(new_page, vma, haddr); set_pmd_at(mm, haddr, pmd, entry); And this kind of mangling isn't ok if the pmd was in splitting state because split_huge_page won't expect the pmd to change (if the numa bit changes is ok but the pfn cannot change or split_huge_page_map will go wrong). So you're right that if migrate_misplaced_transhuge_page will continue to do the pmd_same check, we should add the pmd_splitting bit in memory.c for the pmd_numa() path too. Looking at this, one thing that isn't clear is where the page_count is checked in migrate_misplaced_transhuge_page. Ok that it's unable to migrate anon transhuge COW shared pages so it doesn't need to mess with rmap (the mapcount check makes it safe), but it shouldn't be allowed to migrate memory that has gup direct-IO in flight (and that can only be detected with a page_count vs mapcount check). Real migrate does page_freeze_refs to be safe. Mel comments? Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: oops in copy_page_rep() 2013-01-08 17:37 ` Andrea Arcangeli @ 2013-01-08 17:51 ` Linus Torvalds 2013-01-08 18:03 ` Andrea Arcangeli 2013-01-09 4:23 ` Hugh Dickins 1 sibling, 1 reply; 17+ messages in thread From: Linus Torvalds @ 2013-01-08 17:51 UTC (permalink / raw) To: Andrea Arcangeli Cc: Kirill A. Shutemov, Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel [-- Attachment #1: Type: text/plain, Size: 916 bytes --] On Tue, Jan 8, 2013 at 9:37 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > The reason it returned to userland and retried the fault is that this > should be infrequent enough not to worry about it and this was > marginally simpler but it could be changed. Yeah, that was my suspicion. And as mentioned, returning to user land might actually help with scheduling and/or signal handling latencies etc, so it might be the right thing to do. Especially if the alternative is to just busy-loop. > If we don't want to return to userland we should wait on the splitting > bit and then take the pte walking routines like if the pmd wasn't > huge. This is not related to the below though. How does this patch sound to people? It does the splitting check before the access bit set (even though I don't think it matters), and at least talks about the alternatives and the issues a bit. Hmm? Linus [-- Attachment #2: mm.patch --] [-- Type: application/octet-stream, Size: 750 bytes --] mm/memory.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/mm/memory.c b/mm/memory.c index 49fb1cf08611..f5ec3ae03f44 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3715,6 +3715,18 @@ retry: return do_huge_pmd_numa_page(mm, vma, address, orig_pmd, pmd); + /* + * If the pmd is splitting, return and retry the + * the fault. We *could* set just the accessed flag, + * but it's better to just avoid the races with + * splitting entirely. + * + * Alternative: wait until the split is done, and + * goto retry. + */ + if (pmd_trans_splitting(orig_pmd)) + return 0; + if (dirty && !pmd_write(orig_pmd)) { ret = do_huge_pmd_wp_page(mm, vma, address, pmd, orig_pmd); ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: oops in copy_page_rep() 2013-01-08 17:51 ` Linus Torvalds @ 2013-01-08 18:03 ` Andrea Arcangeli 2013-01-08 18:21 ` Linus Torvalds 0 siblings, 1 reply; 17+ messages in thread From: Andrea Arcangeli @ 2013-01-08 18:03 UTC (permalink / raw) To: Linus Torvalds Cc: Kirill A. Shutemov, Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel On Tue, Jan 08, 2013 at 09:51:47AM -0800, Linus Torvalds wrote: > On Tue, Jan 8, 2013 at 9:37 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > > > The reason it returned to userland and retried the fault is that this > > should be infrequent enough not to worry about it and this was > > marginally simpler but it could be changed. > > Yeah, that was my suspicion. And as mentioned, returning to user land > might actually help with scheduling and/or signal handling latencies > etc, so it might be the right thing to do. Especially if the > alternative is to just busy-loop. > > > If we don't want to return to userland we should wait on the splitting > > bit and then take the pte walking routines like if the pmd wasn't > > huge. This is not related to the below though. > > How does this patch sound to people? It does the splitting check > before the access bit set (even though I don't think it matters), and > at least talks about the alternatives and the issues a bit. > > Hmm? It looks very fine to me, but I suggest to move it above the pmd_numa() check because of the newly introduced migrate_misplaced_transhuge_page method relying on pmd_same too. Thanks! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: oops in copy_page_rep() 2013-01-08 18:03 ` Andrea Arcangeli @ 2013-01-08 18:21 ` Linus Torvalds 2013-01-09 11:38 ` Hillf Danton 0 siblings, 1 reply; 17+ messages in thread From: Linus Torvalds @ 2013-01-08 18:21 UTC (permalink / raw) To: Andrea Arcangeli Cc: Kirill A. Shutemov, Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel On Tue, Jan 8, 2013 at 10:03 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > It looks very fine to me, but I suggest to move it above the > pmd_numa() check because of the newly introduced > migrate_misplaced_transhuge_page method relying on pmd_same too. Hmm. If we need it there, then we need to fix the *later* case of pmd_numa() too: if (pmd_numa(*pmd)) return do_pmd_numa_page(mm, vma, address, pmd); Also, and more fundamentally, since do_pmd_numa_page() doesn't take the orig_pmd thing as an argument (and re-check it under the page-table lock), testing pmd_trans_splitting() on it is pointless, since it can change later. So no, moving the check up does *not* make sense, at least not without other changes. Because if I read things right, pmd_trans_splitting() really has to be done with the page-table lock protection (where "with page-table lock protection" does *not* mean that it has to be done under the page table lock, but if it is done outside, then the pmd entry has to be re-verified after getting the lock - which both do_huge_pmd_wp_page() and huge_pmd_set_accessed() correctly do). Comments? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: oops in copy_page_rep() 2013-01-08 18:21 ` Linus Torvalds @ 2013-01-09 11:38 ` Hillf Danton 0 siblings, 0 replies; 17+ messages in thread From: Hillf Danton @ 2013-01-09 11:38 UTC (permalink / raw) To: Linus Torvalds Cc: Andrea Arcangeli, Kirill A. Shutemov, Hugh Dickins, Dave Jones, Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel On Wed, Jan 9, 2013 at 2:21 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, Jan 8, 2013 at 10:03 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: >> >> It looks very fine to me, but I suggest to move it above the >> pmd_numa() check because of the newly introduced >> migrate_misplaced_transhuge_page method relying on pmd_same too. > > Hmm. If we need it there, then we need to fix the *later* case of > pmd_numa() too: > > if (pmd_numa(*pmd)) > return do_pmd_numa_page(mm, vma, address, pmd); > > Also, and more fundamentally, since do_pmd_numa_page() doesn't take > the orig_pmd thing as an argument (and re-check it under the > page-table lock), testing pmd_trans_splitting() on it is pointless, > since it can change later. > A splitting pmd has to be huge first, and we do handle huge pmd already in the up dozen of lines. That said, we can igore splitting check in this case, Sir. Hillf > So no, moving the check up does *not* make sense, at least not without > other changes. Because if I read things right, pmd_trans_splitting() > really has to be done with the page-table lock protection (where "with > page-table lock protection" does *not* mean that it has to be done > under the page table lock, but if it is done outside, then the pmd > entry has to be re-verified after getting the lock - which both > do_huge_pmd_wp_page() and huge_pmd_set_accessed() correctly do). > > Comments? > > Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: oops in copy_page_rep() 2013-01-08 17:37 ` Andrea Arcangeli 2013-01-08 17:51 ` Linus Torvalds @ 2013-01-09 4:23 ` Hugh Dickins 1 sibling, 0 replies; 17+ messages in thread From: Hugh Dickins @ 2013-01-09 4:23 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, Kirill A. Shutemov, Hillf Danton, Dave Jones, Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel On Tue, 8 Jan 2013, Andrea Arcangeli wrote: > > Looking at this, one thing that isn't clear is where the page_count is > checked in migrate_misplaced_transhuge_page. Ok that it's unable to > migrate anon transhuge COW shared pages so it doesn't need to mess > with rmap (the mapcount check makes it safe), but it shouldn't be > allowed to migrate memory that has gup direct-IO in flight (and that > can only be detected with a page_count vs mapcount check). Real > migrate does page_freeze_refs to be safe. Mel comments? Yes, I protested to Mel about that before the holidays, and he quickly provided a patch, now in akpm's tree; but checking it again today, I believe it's still not quite right yet - see other mail. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: oops in copy_page_rep() 2013-01-08 16:52 ` Linus Torvalds 2013-01-08 17:30 ` Kirill A. Shutemov 2013-01-08 17:37 ` Andrea Arcangeli @ 2013-01-09 11:44 ` Mel Gorman 2 siblings, 0 replies; 17+ messages in thread From: Mel Gorman @ 2013-01-09 11:44 UTC (permalink / raw) To: Linus Torvalds Cc: Kirill A. Shutemov, Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel, Andrea Arcangeli, Andrew Morton, Linux-MM, Rik van Riel On Tue, Jan 08, 2013 at 08:52:14AM -0800, Linus Torvalds wrote: > On Tue, Jan 8, 2013 at 8:31 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote: > >> > >> Heh. I was more thinking about why do_huge_pmd_wp_page() needs it, but > >> do_huge_pmd_numa_page() does not. > > > > It does. The check should be moved up. > > > >> Also, do we actually need it for huge_pmd_set_accessed()? The > >> *placement* of that thing confuses me. And because it confuses me, I'd > >> like to understand it. > > > > We need it for huge_pmd_set_accessed() too. > > > > Looks like a mis-merge. The original patch for huge_pmd_set_accessed() was > > correct: http://lkml.org/lkml/2012/10/25/402 > > Not a merge error: the pmd_trans_splitting() check was removed by > commit d10e63f29488 ("mm: numa: Create basic numa page hinting > infrastructure"). > > Now, *why* it was removed, I can't tell. And it's not clear why the > original code just had it in a conditional, while the suggested patch > has that "goto repeat" thing. It was a mistake by me to remove it and as I screwed up in October I no longer remember how I managed it. The retry versus "goto repeat" is a detail. By retrying the full fault there is a possibility the split will still be in progress on fault retry or that a new THP is collapsed underneath and a new split started while the mmap_sem is released but both are unlikely. On the other side, taking the anon_vma rwsem for write in wait_split_huge_page() could cause delays elsewhere that would be almost impossible to detect so it is not necessarily better. Retrying the fault as your patch does is reasonable. > I suspect re-trying the fault (which I > assume the original code did) is actually better, because that way you > go through all the "should I reschedule as I return through the > exception" stuff. I dunno. > > Mel, that original patch came from you , although it was based on > previous work by Peter/Ingo/Andrea. Can you walk us through the > history and thinking about the loss of pmd_trans_splitting(). Was it > purely a mistake? It looks intentional. > Mistake. Andrea, Peter and Ingo did not make similar mistakes. Looking at your patch, I also think that the check needs to be made before the call to do_huge_pmd_numa_page() so it can reply on a pmd_same() check to make sure a split did not start before the page table lock was taken. In response you said to Andrea Also, and more fundamentally, since do_pmd_numa_page() doesn't take the orig_pmd thing as an argument (and re-check it under the page-table lock), testing pmd_trans_splitting() on it is pointless, since it can change later. do_pmd_numa_page() is called for a normal PMD that is marked pmd_numa(), not a THP PMD. As the mmap_sem is held it cannot collapse to a THP underneath us after the pmd_trans_huge() check so it should be unnecessary to check pmd_trans_splitting() there. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2013-01-11 14:02 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <20130105152208.GA3386@redhat.com> [not found] ` <CAJd=RBCb0oheRnVCM4okVKFvKGzuLp9GpZJCkVY3RR-J=XEoBA@mail.gmail.com> [not found] ` <alpine.LNX.2.00.1301061037140.28950@eggly.anvils> [not found] ` <CAJd=RBAps4Qk9WLYbQhLkJd8d12NLV0CbjPYC6uqH_-L+Vu0VQ@mail.gmail.com> [not found] ` <CA+55aFyYAf6ztDLsxWFD+6jb++y0YNjso-9j+83Mm+3uQ=8PdA@mail.gmail.com> 2013-01-08 13:04 ` oops in copy_page_rep() Hillf Danton 2013-01-08 15:37 ` Linus Torvalds 2013-01-08 16:31 ` Kirill A. Shutemov 2013-01-08 16:52 ` Linus Torvalds 2013-01-08 17:30 ` Kirill A. Shutemov 2013-01-08 17:38 ` Linus Torvalds 2013-01-08 17:49 ` Andrea Arcangeli 2013-01-08 18:03 ` Kirill A. Shutemov 2013-01-11 7:50 ` Simon Jeons 2013-01-11 14:01 ` Andrea Arcangeli 2013-01-08 17:37 ` Andrea Arcangeli 2013-01-08 17:51 ` Linus Torvalds 2013-01-08 18:03 ` Andrea Arcangeli 2013-01-08 18:21 ` Linus Torvalds 2013-01-09 11:38 ` Hillf Danton 2013-01-09 4:23 ` Hugh Dickins 2013-01-09 11:44 ` Mel Gorman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).