* Re: oops in copy_page_rep()
[not found] ` <CA+55aFyYAf6ztDLsxWFD+6jb++y0YNjso-9j+83Mm+3uQ=8PdA@mail.gmail.com>
@ 2013-01-08 13:04 ` Hillf Danton
2013-01-08 15:37 ` Linus Torvalds
0 siblings, 1 reply; 17+ messages in thread
From: Hillf Danton @ 2013-01-08 13:04 UTC (permalink / raw)
To: Linus Torvalds
Cc: Hugh Dickins, Dave Jones, Linux Kernel, Andrea Arcangeli,
Andrew Morton, Mel Gorman, Linux-MM
On Tue, Jan 8, 2013 at 1:34 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Jan 7, 2013 at 4:24 AM, Hillf Danton <dhillf@gmail.com> wrote:
>>
>> I take another try with waiting added, take a look please.
>
> Hmm. Is there some reason we never need to worry about it for the
> "pmd_numa()" case just above?
>
> A comment about this all might be a really good idea.
>
Yes Sir, added.
---
From: Hillf Danton <dhillf@gmail.com>
Subject: [PATCH] mm: restore huge pmd splitting check
Hugh said, it's clear that 3.7 had an important pmd_trans_splitting(orig_pmd)
check there, which went AWOL in
d10e63f29488 "mm: numa: Create basic numa page hinting infrastructure".
It is restored for handling stable page fault, with wait_split_huge_page()
added, as suggested also by Hugh, to avoid reapted faults until the split
has completed.
This work is inspired by the oops reported by Dave Jones at
https://lkml.org/lkml/2013/1/5/115
Signed-off-by: Hillf Danton <dhillf@gmail.com>
---
--- a/mm/memory.c Sun Jan 6 19:49:50 2013
+++ b/mm/memory.c Tue Jan 8 20:28:04 2013
@@ -3710,6 +3710,14 @@ retry:
return do_huge_pmd_numa_page(mm, vma, address,
orig_pmd, pmd);
+ /*
+ * Check if pmd is stable
+ * (numa pmd is stable, see change_huge_pmd())
+ */
+ if (pmd_trans_splitting(orig_pmd)) {
+ wait_split_huge_page(vma->anon_vma, pmd);
+ goto retry;
+ }
if (dirty && !pmd_write(orig_pmd)) {
ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
orig_pmd);
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: oops in copy_page_rep()
2013-01-08 13:04 ` oops in copy_page_rep() Hillf Danton
@ 2013-01-08 15:37 ` Linus Torvalds
2013-01-08 16:31 ` Kirill A. Shutemov
0 siblings, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2013-01-08 15:37 UTC (permalink / raw)
To: Hillf Danton
Cc: Hugh Dickins, Dave Jones, Linux Kernel, Andrea Arcangeli,
Andrew Morton, Mel Gorman, Linux-MM
On Tue, Jan 8, 2013 at 5:04 AM, Hillf Danton <dhillf@gmail.com> wrote:
> On Tue, Jan 8, 2013 at 1:34 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Hmm. Is there some reason we never need to worry about it for the
>> "pmd_numa()" case just above?
>>
>> A comment about this all might be a really good idea.
>>
> Yes Sir, added.
Heh. I was more thinking about why do_huge_pmd_wp_page() needs it, but
do_huge_pmd_numa_page() does not.
Also, do we actually need it for huge_pmd_set_accessed()? The
*placement* of that thing confuses me. And because it confuses me, I'd
like to understand it.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: oops in copy_page_rep()
2013-01-08 15:37 ` Linus Torvalds
@ 2013-01-08 16:31 ` Kirill A. Shutemov
2013-01-08 16:52 ` Linus Torvalds
0 siblings, 1 reply; 17+ messages in thread
From: Kirill A. Shutemov @ 2013-01-08 16:31 UTC (permalink / raw)
To: Linus Torvalds
Cc: Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel,
Andrea Arcangeli, Andrew Morton, Mel Gorman, Linux-MM
On Tue, Jan 08, 2013 at 07:37:06AM -0800, Linus Torvalds wrote:
> On Tue, Jan 8, 2013 at 5:04 AM, Hillf Danton <dhillf@gmail.com> wrote:
> > On Tue, Jan 8, 2013 at 1:34 AM, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> >>
> >> Hmm. Is there some reason we never need to worry about it for the
> >> "pmd_numa()" case just above?
> >>
> >> A comment about this all might be a really good idea.
> >>
> > Yes Sir, added.
>
> Heh. I was more thinking about why do_huge_pmd_wp_page() needs it, but
> do_huge_pmd_numa_page() does not.
It does. The check should be moved up.
> Also, do we actually need it for huge_pmd_set_accessed()? The
> *placement* of that thing confuses me. And because it confuses me, I'd
> like to understand it.
We need it for huge_pmd_set_accessed() too.
Looks like a mis-merge. The original patch for huge_pmd_set_accessed() was
correct: http://lkml.org/lkml/2012/10/25/402
--
Kirill A. Shutemov
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: oops in copy_page_rep()
2013-01-08 16:31 ` Kirill A. Shutemov
@ 2013-01-08 16:52 ` Linus Torvalds
2013-01-08 17:30 ` Kirill A. Shutemov
` (2 more replies)
0 siblings, 3 replies; 17+ messages in thread
From: Linus Torvalds @ 2013-01-08 16:52 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel,
Andrea Arcangeli, Andrew Morton, Mel Gorman, Linux-MM,
Rik van Riel
On Tue, Jan 8, 2013 at 8:31 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>>
>> Heh. I was more thinking about why do_huge_pmd_wp_page() needs it, but
>> do_huge_pmd_numa_page() does not.
>
> It does. The check should be moved up.
>
>> Also, do we actually need it for huge_pmd_set_accessed()? The
>> *placement* of that thing confuses me. And because it confuses me, I'd
>> like to understand it.
>
> We need it for huge_pmd_set_accessed() too.
>
> Looks like a mis-merge. The original patch for huge_pmd_set_accessed() was
> correct: http://lkml.org/lkml/2012/10/25/402
Not a merge error: the pmd_trans_splitting() check was removed by
commit d10e63f29488 ("mm: numa: Create basic numa page hinting
infrastructure").
Now, *why* it was removed, I can't tell. And it's not clear why the
original code just had it in a conditional, while the suggested patch
has that "goto repeat" thing. I suspect re-trying the fault (which I
assume the original code did) is actually better, because that way you
go through all the "should I reschedule as I return through the
exception" stuff. I dunno.
Mel, that original patch came from you , although it was based on
previous work by Peter/Ingo/Andrea. Can you walk us through the
history and thinking about the loss of pmd_trans_splitting(). Was it
purely a mistake? It looks intentional.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: oops in copy_page_rep()
2013-01-08 16:52 ` Linus Torvalds
@ 2013-01-08 17:30 ` Kirill A. Shutemov
2013-01-08 17:38 ` Linus Torvalds
2013-01-08 17:49 ` Andrea Arcangeli
2013-01-08 17:37 ` Andrea Arcangeli
2013-01-09 11:44 ` Mel Gorman
2 siblings, 2 replies; 17+ messages in thread
From: Kirill A. Shutemov @ 2013-01-08 17:30 UTC (permalink / raw)
To: Linus Torvalds
Cc: Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel,
Andrea Arcangeli, Andrew Morton, Mel Gorman, Linux-MM,
Rik van Riel
On Tue, Jan 08, 2013 at 08:52:14AM -0800, Linus Torvalds wrote:
> On Tue, Jan 8, 2013 at 8:31 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >>
> >> Heh. I was more thinking about why do_huge_pmd_wp_page() needs it, but
> >> do_huge_pmd_numa_page() does not.
> >
> > It does. The check should be moved up.
> >
> >> Also, do we actually need it for huge_pmd_set_accessed()? The
> >> *placement* of that thing confuses me. And because it confuses me, I'd
> >> like to understand it.
> >
> > We need it for huge_pmd_set_accessed() too.
> >
> > Looks like a mis-merge. The original patch for huge_pmd_set_accessed() was
> > correct: http://lkml.org/lkml/2012/10/25/402
>
> Not a merge error: the pmd_trans_splitting() check was removed by
> commit d10e63f29488 ("mm: numa: Create basic numa page hinting
> infrastructure").
Check difference between patch above and merged one -- a1dd450.
Merged patch is obviously broken: huge_pmd_set_accessed() can be called
only if the pmd is under splitting.
--
Kirill A. Shutemov
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: oops in copy_page_rep()
2013-01-08 16:52 ` Linus Torvalds
2013-01-08 17:30 ` Kirill A. Shutemov
@ 2013-01-08 17:37 ` Andrea Arcangeli
2013-01-08 17:51 ` Linus Torvalds
2013-01-09 4:23 ` Hugh Dickins
2013-01-09 11:44 ` Mel Gorman
2 siblings, 2 replies; 17+ messages in thread
From: Andrea Arcangeli @ 2013-01-08 17:37 UTC (permalink / raw)
To: Linus Torvalds
Cc: Kirill A. Shutemov, Hillf Danton, Hugh Dickins, Dave Jones,
Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel
Hi,
On Tue, Jan 08, 2013 at 08:52:14AM -0800, Linus Torvalds wrote:
> On Tue, Jan 8, 2013 at 8:31 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >>
> >> Heh. I was more thinking about why do_huge_pmd_wp_page() needs it, but
> >> do_huge_pmd_numa_page() does not.
> >
> > It does. The check should be moved up.
> >
> >> Also, do we actually need it for huge_pmd_set_accessed()? The
> >> *placement* of that thing confuses me. And because it confuses me, I'd
> >> like to understand it.
> >
> > We need it for huge_pmd_set_accessed() too.
> >
> > Looks like a mis-merge. The original patch for huge_pmd_set_accessed() was
> > correct: http://lkml.org/lkml/2012/10/25/402
>
> Not a merge error: the pmd_trans_splitting() check was removed by
> commit d10e63f29488 ("mm: numa: Create basic numa page hinting
> infrastructure").
>
> Now, *why* it was removed, I can't tell. And it's not clear why the
> original code just had it in a conditional, while the suggested patch
> has that "goto repeat" thing. I suspect re-trying the fault (which I
> assume the original code did) is actually better, because that way you
> go through all the "should I reschedule as I return through the
> exception" stuff. I dunno.
The reason it returned to userland and retried the fault is that this
should be infrequent enough not to worry about it and this was
marginally simpler but it could be changed.
If we don't want to return to userland we should wait on the splitting
bit and then take the pte walking routines like if the pmd wasn't
huge. This is not related to the below though.
> Mel, that original patch came from you , although it was based on
> previous work by Peter/Ingo/Andrea. Can you walk us through the
> history and thinking about the loss of pmd_trans_splitting(). Was it
> purely a mistake? It looks intentional.
d10e63f29488 is wrong in removing the pmd_splitting check, I assume
it's an accidental leftover.
My code did this:
@@ -3530,6 +3534,9 @@ retry:
*/
orig_pmd = ACCESS_ONCE(*pmd);
if (pmd_trans_huge(orig_pmd)) {
+ if (pmd_numa(*pmd))
+ return huge_pmd_numa_fixup(mm, address,
+ orig_pmd, pmd);
if (flags & FAULT_FLAG_WRITE &&
!pmd_write(orig_pmd) &&
!pmd_trans_splitting(orig_pmd)) {
The function I was calling was this:
+#ifdef CONFIG_AUTONUMA
+/* NUMA hinting page fault entry point for trans huge pmds */
+int huge_pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,
+ pmd_t pmd, pmd_t *pmdp)
+{
+ struct page *page;
+ bool migrated;
+
+ spin_lock(&mm->page_table_lock);
+ if (unlikely(!pmd_same(pmd, *pmdp)))
+ goto out_unlock;
+
+ page = pmd_page(pmd);
+ pmd = pmd_mknonnuma(pmd);
+ set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmdp, pmd);
+ VM_BUG_ON(pmd_numa(*pmdp));
+ if (unlikely(page_mapcount(page) != 1))
+ goto out_unlock;
+ get_page(page);
+ spin_unlock(&mm->page_table_lock);
+
+ migrated = numa_hinting_fault(page, HPAGE_PMD_NR);
+ if (!migrated)
+ put_page(page);
+
+out:
+ return 0;
+
+out_unlock:
+ spin_unlock(&mm->page_table_lock);
+ goto out;
+}
+#endif
Taking the PT lock, checking pmd_same and clearing the numa bitflag
was perfectly ok even if the pmd was in splitting state the whole
time.
However do_huge_pmd_numa_page is slightly more complex than the above:
the problem is that migrate_misplaced_transhuge_page gets pmdp and pmd
as parameters (unlike the above numa_hinting_fault() function) and it
relies internally to pmd_same too.
/* Recheck the target PMD */
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(*pmd, entry))) {
spin_unlock(&mm->page_table_lock);
[..]
entry = mk_pmd(new_page, vma->vm_page_prot);
entry = pmd_mknonnuma(entry);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
entry = pmd_mkhuge(entry);
page_add_new_anon_rmap(new_page, vma, haddr);
set_pmd_at(mm, haddr, pmd, entry);
And this kind of mangling isn't ok if the pmd was in splitting state
because split_huge_page won't expect the pmd to change (if the numa
bit changes is ok but the pfn cannot change or split_huge_page_map
will go wrong).
So you're right that if migrate_misplaced_transhuge_page will continue
to do the pmd_same check, we should add the pmd_splitting bit in
memory.c for the pmd_numa() path too.
Looking at this, one thing that isn't clear is where the page_count is
checked in migrate_misplaced_transhuge_page. Ok that it's unable to
migrate anon transhuge COW shared pages so it doesn't need to mess
with rmap (the mapcount check makes it safe), but it shouldn't be
allowed to migrate memory that has gup direct-IO in flight (and that
can only be detected with a page_count vs mapcount check). Real
migrate does page_freeze_refs to be safe. Mel comments?
Thanks,
Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: oops in copy_page_rep()
2013-01-08 17:30 ` Kirill A. Shutemov
@ 2013-01-08 17:38 ` Linus Torvalds
2013-01-08 17:49 ` Andrea Arcangeli
1 sibling, 0 replies; 17+ messages in thread
From: Linus Torvalds @ 2013-01-08 17:38 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel,
Andrea Arcangeli, Andrew Morton, Mel Gorman, Linux-MM,
Rik van Riel
On Tue, Jan 8, 2013 at 9:30 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> Check difference between patch above and merged one -- a1dd450.
> Merged patch is obviously broken: huge_pmd_set_accessed() can be called
> only if the pmd is under splitting.
Ok, that's a totally different issue, and seems to be due to different
versions (Andrew - any idea why
http://lkml.org/lkml/2012/10/25/402
and commit a1dd450bcb1a ("mm: thp: set the accessed flag for old pages
on access fault") are different?
That said, I actually think that commit a1dd450bcb1a is correct:
huge_pmd_set_accessed() can not *possibly* need to check the splitting
issue, since it takes the page table lock and re-verifies that the pmd
entry is identical, before just setting the access flags.
So that whole thing is irrelevant. huge_pmd_set_accessed() almost
certainly simply doesn't care about splitting.
But look at commit d10e63f29488. That's the one that removes
pmd_trans_splitting() entirely, and does it for the case that *does*
seem to care, namely do_huge_pmd_wp_page().
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: oops in copy_page_rep()
2013-01-08 17:30 ` Kirill A. Shutemov
2013-01-08 17:38 ` Linus Torvalds
@ 2013-01-08 17:49 ` Andrea Arcangeli
2013-01-08 18:03 ` Kirill A. Shutemov
2013-01-11 7:50 ` Simon Jeons
1 sibling, 2 replies; 17+ messages in thread
From: Andrea Arcangeli @ 2013-01-08 17:49 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Linus Torvalds, Hillf Danton, Hugh Dickins, Dave Jones,
Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel
Hi Kirill,
On Tue, Jan 08, 2013 at 07:30:58PM +0200, Kirill A. Shutemov wrote:
> Merged patch is obviously broken: huge_pmd_set_accessed() can be called
> only if the pmd is under splitting.
Of course I assume you meant "only if the pmd is not under splitting".
But no, setting a bitflag like the young bit or clearing or setting
the numa bit won't screw with split_huge_page and it's safe even if
the pmd is under splitting.
Those bits are only checked here at the last stage of
split_huge_page_map after taking the PT lock:
spin_lock(&mm->page_table_lock);
pmd = page_check_address_pmd(page, mm, address,
PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
if (pmd) {
pgtable = pgtable_trans_huge_withdraw(mm);
pmd_populate(mm, &_pmd, pgtable);
haddr = address;
for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
pte_t *pte, entry;
BUG_ON(PageCompound(page+i));
entry = mk_pte(page + i, vma->vm_page_prot);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
if (!pmd_write(*pmd))
entry = pte_wrprotect(entry);
else
BUG_ON(page_mapcount(page) != 1);
if (!pmd_young(*pmd))
entry = pte_mkold(entry);
if (pmd_numa(*pmd))
entry = pte_mknuma(entry);
pte = pte_offset_map(&_pmd, haddr);
BUG_ON(!pte_none(*pte));
set_pte_at(mm, haddr, pte, entry);
pte_unmap(pte);
}
If "young" or "numa" bitflags changed on the original *pmd for the
previous part of split_huge_page, nothing will go wrong by the time we
get to split_huge_page_map (the same is not true if the pfn changes!).
If you think this is too tricky, we could also decide to forbid
huge_pmd_set_accessed if the pmd is in splitting state, but I don't
think that flipping young/numa bits while in splitting state, can
cause any problem (if done correctly with PT lock + pmd_same).
Thanks!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: oops in copy_page_rep()
2013-01-08 17:37 ` Andrea Arcangeli
@ 2013-01-08 17:51 ` Linus Torvalds
2013-01-08 18:03 ` Andrea Arcangeli
2013-01-09 4:23 ` Hugh Dickins
1 sibling, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2013-01-08 17:51 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Kirill A. Shutemov, Hillf Danton, Hugh Dickins, Dave Jones,
Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel
[-- Attachment #1: Type: text/plain, Size: 916 bytes --]
On Tue, Jan 8, 2013 at 9:37 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> The reason it returned to userland and retried the fault is that this
> should be infrequent enough not to worry about it and this was
> marginally simpler but it could be changed.
Yeah, that was my suspicion. And as mentioned, returning to user land
might actually help with scheduling and/or signal handling latencies
etc, so it might be the right thing to do. Especially if the
alternative is to just busy-loop.
> If we don't want to return to userland we should wait on the splitting
> bit and then take the pte walking routines like if the pmd wasn't
> huge. This is not related to the below though.
How does this patch sound to people? It does the splitting check
before the access bit set (even though I don't think it matters), and
at least talks about the alternatives and the issues a bit.
Hmm?
Linus
[-- Attachment #2: mm.patch --]
[-- Type: application/octet-stream, Size: 750 bytes --]
mm/memory.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/mm/memory.c b/mm/memory.c
index 49fb1cf08611..f5ec3ae03f44 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3715,6 +3715,18 @@ retry:
return do_huge_pmd_numa_page(mm, vma, address,
orig_pmd, pmd);
+ /*
+ * If the pmd is splitting, return and retry the
+ * the fault. We *could* set just the accessed flag,
+ * but it's better to just avoid the races with
+ * splitting entirely.
+ *
+ * Alternative: wait until the split is done, and
+ * goto retry.
+ */
+ if (pmd_trans_splitting(orig_pmd))
+ return 0;
+
if (dirty && !pmd_write(orig_pmd)) {
ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
orig_pmd);
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: oops in copy_page_rep()
2013-01-08 17:49 ` Andrea Arcangeli
@ 2013-01-08 18:03 ` Kirill A. Shutemov
2013-01-11 7:50 ` Simon Jeons
1 sibling, 0 replies; 17+ messages in thread
From: Kirill A. Shutemov @ 2013-01-08 18:03 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Linus Torvalds, Hillf Danton, Hugh Dickins, Dave Jones,
Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel
On Tue, Jan 08, 2013 at 06:49:51PM +0100, Andrea Arcangeli wrote:
> Hi Kirill,
>
> On Tue, Jan 08, 2013 at 07:30:58PM +0200, Kirill A. Shutemov wrote:
> > Merged patch is obviously broken: huge_pmd_set_accessed() can be called
> > only if the pmd is under splitting.
>
> Of course I assume you meant "only if the pmd is not under splitting".
The broken merged patch has this:
+ if (dirty && !pmd_write(orig_pmd) &&
!pmd_trans_splitting(orig_pmd)) {
[...]
+ } else {
+ huge_pmd_set_accessed(mm, vma, address, pmd,
+ orig_pmd, dirty);
}
> But no, setting a bitflag like the young bit or clearing or setting
> the numa bit won't screw with split_huge_page and it's safe even if
> the pmd is under splitting.
Okay. Thanks for clarification for me.
--
Kirill A. Shutemov
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: oops in copy_page_rep()
2013-01-08 17:51 ` Linus Torvalds
@ 2013-01-08 18:03 ` Andrea Arcangeli
2013-01-08 18:21 ` Linus Torvalds
0 siblings, 1 reply; 17+ messages in thread
From: Andrea Arcangeli @ 2013-01-08 18:03 UTC (permalink / raw)
To: Linus Torvalds
Cc: Kirill A. Shutemov, Hillf Danton, Hugh Dickins, Dave Jones,
Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel
On Tue, Jan 08, 2013 at 09:51:47AM -0800, Linus Torvalds wrote:
> On Tue, Jan 8, 2013 at 9:37 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> > The reason it returned to userland and retried the fault is that this
> > should be infrequent enough not to worry about it and this was
> > marginally simpler but it could be changed.
>
> Yeah, that was my suspicion. And as mentioned, returning to user land
> might actually help with scheduling and/or signal handling latencies
> etc, so it might be the right thing to do. Especially if the
> alternative is to just busy-loop.
>
> > If we don't want to return to userland we should wait on the splitting
> > bit and then take the pte walking routines like if the pmd wasn't
> > huge. This is not related to the below though.
>
> How does this patch sound to people? It does the splitting check
> before the access bit set (even though I don't think it matters), and
> at least talks about the alternatives and the issues a bit.
>
> Hmm?
It looks very fine to me, but I suggest to move it above the
pmd_numa() check because of the newly introduced
migrate_misplaced_transhuge_page method relying on pmd_same too.
Thanks!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: oops in copy_page_rep()
2013-01-08 18:03 ` Andrea Arcangeli
@ 2013-01-08 18:21 ` Linus Torvalds
2013-01-09 11:38 ` Hillf Danton
0 siblings, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2013-01-08 18:21 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Kirill A. Shutemov, Hillf Danton, Hugh Dickins, Dave Jones,
Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel
On Tue, Jan 8, 2013 at 10:03 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> It looks very fine to me, but I suggest to move it above the
> pmd_numa() check because of the newly introduced
> migrate_misplaced_transhuge_page method relying on pmd_same too.
Hmm. If we need it there, then we need to fix the *later* case of
pmd_numa() too:
if (pmd_numa(*pmd))
return do_pmd_numa_page(mm, vma, address, pmd);
Also, and more fundamentally, since do_pmd_numa_page() doesn't take
the orig_pmd thing as an argument (and re-check it under the
page-table lock), testing pmd_trans_splitting() on it is pointless,
since it can change later.
So no, moving the check up does *not* make sense, at least not without
other changes. Because if I read things right, pmd_trans_splitting()
really has to be done with the page-table lock protection (where "with
page-table lock protection" does *not* mean that it has to be done
under the page table lock, but if it is done outside, then the pmd
entry has to be re-verified after getting the lock - which both
do_huge_pmd_wp_page() and huge_pmd_set_accessed() correctly do).
Comments?
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: oops in copy_page_rep()
2013-01-08 17:37 ` Andrea Arcangeli
2013-01-08 17:51 ` Linus Torvalds
@ 2013-01-09 4:23 ` Hugh Dickins
1 sibling, 0 replies; 17+ messages in thread
From: Hugh Dickins @ 2013-01-09 4:23 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Linus Torvalds, Kirill A. Shutemov, Hillf Danton, Dave Jones,
Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel
On Tue, 8 Jan 2013, Andrea Arcangeli wrote:
>
> Looking at this, one thing that isn't clear is where the page_count is
> checked in migrate_misplaced_transhuge_page. Ok that it's unable to
> migrate anon transhuge COW shared pages so it doesn't need to mess
> with rmap (the mapcount check makes it safe), but it shouldn't be
> allowed to migrate memory that has gup direct-IO in flight (and that
> can only be detected with a page_count vs mapcount check). Real
> migrate does page_freeze_refs to be safe. Mel comments?
Yes, I protested to Mel about that before the holidays, and he
quickly provided a patch, now in akpm's tree; but checking it again
today, I believe it's still not quite right yet - see other mail.
Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: oops in copy_page_rep()
2013-01-08 18:21 ` Linus Torvalds
@ 2013-01-09 11:38 ` Hillf Danton
0 siblings, 0 replies; 17+ messages in thread
From: Hillf Danton @ 2013-01-09 11:38 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrea Arcangeli, Kirill A. Shutemov, Hugh Dickins, Dave Jones,
Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel
On Wed, Jan 9, 2013 at 2:21 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Jan 8, 2013 at 10:03 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>>
>> It looks very fine to me, but I suggest to move it above the
>> pmd_numa() check because of the newly introduced
>> migrate_misplaced_transhuge_page method relying on pmd_same too.
>
> Hmm. If we need it there, then we need to fix the *later* case of
> pmd_numa() too:
>
> if (pmd_numa(*pmd))
> return do_pmd_numa_page(mm, vma, address, pmd);
>
> Also, and more fundamentally, since do_pmd_numa_page() doesn't take
> the orig_pmd thing as an argument (and re-check it under the
> page-table lock), testing pmd_trans_splitting() on it is pointless,
> since it can change later.
>
A splitting pmd has to be huge first, and we do handle huge pmd already in
the up dozen of lines.
That said, we can igore splitting check in this case, Sir.
Hillf
> So no, moving the check up does *not* make sense, at least not without
> other changes. Because if I read things right, pmd_trans_splitting()
> really has to be done with the page-table lock protection (where "with
> page-table lock protection" does *not* mean that it has to be done
> under the page table lock, but if it is done outside, then the pmd
> entry has to be re-verified after getting the lock - which both
> do_huge_pmd_wp_page() and huge_pmd_set_accessed() correctly do).
>
> Comments?
>
> Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: oops in copy_page_rep()
2013-01-08 16:52 ` Linus Torvalds
2013-01-08 17:30 ` Kirill A. Shutemov
2013-01-08 17:37 ` Andrea Arcangeli
@ 2013-01-09 11:44 ` Mel Gorman
2 siblings, 0 replies; 17+ messages in thread
From: Mel Gorman @ 2013-01-09 11:44 UTC (permalink / raw)
To: Linus Torvalds
Cc: Kirill A. Shutemov, Hillf Danton, Hugh Dickins, Dave Jones,
Linux Kernel, Andrea Arcangeli, Andrew Morton, Linux-MM,
Rik van Riel
On Tue, Jan 08, 2013 at 08:52:14AM -0800, Linus Torvalds wrote:
> On Tue, Jan 8, 2013 at 8:31 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >>
> >> Heh. I was more thinking about why do_huge_pmd_wp_page() needs it, but
> >> do_huge_pmd_numa_page() does not.
> >
> > It does. The check should be moved up.
> >
> >> Also, do we actually need it for huge_pmd_set_accessed()? The
> >> *placement* of that thing confuses me. And because it confuses me, I'd
> >> like to understand it.
> >
> > We need it for huge_pmd_set_accessed() too.
> >
> > Looks like a mis-merge. The original patch for huge_pmd_set_accessed() was
> > correct: http://lkml.org/lkml/2012/10/25/402
>
> Not a merge error: the pmd_trans_splitting() check was removed by
> commit d10e63f29488 ("mm: numa: Create basic numa page hinting
> infrastructure").
>
> Now, *why* it was removed, I can't tell. And it's not clear why the
> original code just had it in a conditional, while the suggested patch
> has that "goto repeat" thing.
It was a mistake by me to remove it and as I screwed up in October I no
longer remember how I managed it.
The retry versus "goto repeat" is a detail. By retrying the full fault
there is a possibility the split will still be in progress on fault
retry or that a new THP is collapsed underneath and a new split started
while the mmap_sem is released but both are unlikely. On the other side,
taking the anon_vma rwsem for write in wait_split_huge_page() could cause
delays elsewhere that would be almost impossible to detect so it is not
necessarily better. Retrying the fault as your patch does is reasonable.
> I suspect re-trying the fault (which I
> assume the original code did) is actually better, because that way you
> go through all the "should I reschedule as I return through the
> exception" stuff. I dunno.
>
> Mel, that original patch came from you , although it was based on
> previous work by Peter/Ingo/Andrea. Can you walk us through the
> history and thinking about the loss of pmd_trans_splitting(). Was it
> purely a mistake? It looks intentional.
>
Mistake. Andrea, Peter and Ingo did not make similar mistakes.
Looking at your patch, I also think that the check needs to be made before
the call to do_huge_pmd_numa_page() so it can reply on a pmd_same() check
to make sure a split did not start before the page table lock was taken.
In response you said to Andrea
Also, and more fundamentally, since do_pmd_numa_page() doesn't
take the orig_pmd thing as an argument (and re-check it under the
page-table lock), testing pmd_trans_splitting() on it is pointless,
since it can change later.
do_pmd_numa_page() is called for a normal PMD that is marked pmd_numa(), not
a THP PMD. As the mmap_sem is held it cannot collapse to a THP underneath us
after the pmd_trans_huge() check so it should be unnecessary to check
pmd_trans_splitting() there.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: oops in copy_page_rep()
2013-01-08 17:49 ` Andrea Arcangeli
2013-01-08 18:03 ` Kirill A. Shutemov
@ 2013-01-11 7:50 ` Simon Jeons
2013-01-11 14:01 ` Andrea Arcangeli
1 sibling, 1 reply; 17+ messages in thread
From: Simon Jeons @ 2013-01-11 7:50 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Kirill A. Shutemov, Linus Torvalds, Hillf Danton, Hugh Dickins,
Dave Jones, Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM,
Rik van Riel
On Tue, 2013-01-08 at 18:49 +0100, Andrea Arcangeli wrote:
> Hi Kirill,
>
> On Tue, Jan 08, 2013 at 07:30:58PM +0200, Kirill A. Shutemov wrote:
> > Merged patch is obviously broken: huge_pmd_set_accessed() can be called
> > only if the pmd is under splitting.
>
> Of course I assume you meant "only if the pmd is not under splitting".
>
> But no, setting a bitflag like the young bit or clearing or setting
> the numa bit won't screw with split_huge_page and it's safe even if
> the pmd is under splitting.
>
> Those bits are only checked here at the last stage of
> split_huge_page_map after taking the PT lock:
>
> spin_lock(&mm->page_table_lock);
> pmd = page_check_address_pmd(page, mm, address,
> PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
> if (pmd) {
> pgtable = pgtable_trans_huge_withdraw(mm);
> pmd_populate(mm, &_pmd, pgtable);
>
> haddr = address;
> for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> pte_t *pte, entry;
> BUG_ON(PageCompound(page+i));
> entry = mk_pte(page + i, vma->vm_page_prot);
> entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> if (!pmd_write(*pmd))
> entry = pte_wrprotect(entry);
> else
> BUG_ON(page_mapcount(page) != 1);
> if (!pmd_young(*pmd))
> entry = pte_mkold(entry);
> if (pmd_numa(*pmd))
> entry = pte_mknuma(entry);
> pte = pte_offset_map(&_pmd, haddr);
> BUG_ON(!pte_none(*pte));
> set_pte_at(mm, haddr, pte, entry);
> pte_unmap(pte);
> }
>
> If "young" or "numa" bitflags changed on the original *pmd for the
> previous part of split_huge_page, nothing will go wrong by the time we
> get to split_huge_page_map (the same is not true if the pfn changes!).
>
But this time BUG_ON(mapcount != mapcount2) in function
__split_huge_page will be trigged.
> If you think this is too tricky, we could also decide to forbid
> huge_pmd_set_accessed if the pmd is in splitting state, but I don't
> think that flipping young/numa bits while in splitting state, can
> cause any problem (if done correctly with PT lock + pmd_same).
>
> Thanks!
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: oops in copy_page_rep()
2013-01-11 7:50 ` Simon Jeons
@ 2013-01-11 14:01 ` Andrea Arcangeli
0 siblings, 0 replies; 17+ messages in thread
From: Andrea Arcangeli @ 2013-01-11 14:01 UTC (permalink / raw)
To: Simon Jeons
Cc: Kirill A. Shutemov, Linus Torvalds, Hillf Danton, Hugh Dickins,
Dave Jones, Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM,
Rik van Riel
On Fri, Jan 11, 2013 at 01:50:44AM -0600, Simon Jeons wrote:
> On Tue, 2013-01-08 at 18:49 +0100, Andrea Arcangeli wrote:
> > Hi Kirill,
> >
> > On Tue, Jan 08, 2013 at 07:30:58PM +0200, Kirill A. Shutemov wrote:
> > > Merged patch is obviously broken: huge_pmd_set_accessed() can be called
> > > only if the pmd is under splitting.
> >
> > Of course I assume you meant "only if the pmd is not under splitting".
> >
> > But no, setting a bitflag like the young bit or clearing or setting
> > the numa bit won't screw with split_huge_page and it's safe even if
> > the pmd is under splitting.
> >
> > Those bits are only checked here at the last stage of
> > split_huge_page_map after taking the PT lock:
> >
> > spin_lock(&mm->page_table_lock);
> > pmd = page_check_address_pmd(page, mm, address,
> > PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
> > if (pmd) {
> > pgtable = pgtable_trans_huge_withdraw(mm);
> > pmd_populate(mm, &_pmd, pgtable);
> >
> > haddr = address;
> > for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> > pte_t *pte, entry;
> > BUG_ON(PageCompound(page+i));
> > entry = mk_pte(page + i, vma->vm_page_prot);
> > entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> > if (!pmd_write(*pmd))
> > entry = pte_wrprotect(entry);
> > else
> > BUG_ON(page_mapcount(page) != 1);
> > if (!pmd_young(*pmd))
> > entry = pte_mkold(entry);
> > if (pmd_numa(*pmd))
> > entry = pte_mknuma(entry);
> > pte = pte_offset_map(&_pmd, haddr);
> > BUG_ON(!pte_none(*pte));
> > set_pte_at(mm, haddr, pte, entry);
> > pte_unmap(pte);
> > }
> >
> > If "young" or "numa" bitflags changed on the original *pmd for the
> > previous part of split_huge_page, nothing will go wrong by the time we
> > get to split_huge_page_map (the same is not true if the pfn changes!).
> >
>
> But this time BUG_ON(mapcount != mapcount2) in function
> __split_huge_page will be trigged.
"young" or "numa" bitflags in the pmd don't alter
rmap/mapcount/pagecount/pfn or anything that could affect such BUG_ON,
so I'm not sure why you think so.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2013-01-11 14:02 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20130105152208.GA3386@redhat.com>
[not found] ` <CAJd=RBCb0oheRnVCM4okVKFvKGzuLp9GpZJCkVY3RR-J=XEoBA@mail.gmail.com>
[not found] ` <alpine.LNX.2.00.1301061037140.28950@eggly.anvils>
[not found] ` <CAJd=RBAps4Qk9WLYbQhLkJd8d12NLV0CbjPYC6uqH_-L+Vu0VQ@mail.gmail.com>
[not found] ` <CA+55aFyYAf6ztDLsxWFD+6jb++y0YNjso-9j+83Mm+3uQ=8PdA@mail.gmail.com>
2013-01-08 13:04 ` oops in copy_page_rep() Hillf Danton
2013-01-08 15:37 ` Linus Torvalds
2013-01-08 16:31 ` Kirill A. Shutemov
2013-01-08 16:52 ` Linus Torvalds
2013-01-08 17:30 ` Kirill A. Shutemov
2013-01-08 17:38 ` Linus Torvalds
2013-01-08 17:49 ` Andrea Arcangeli
2013-01-08 18:03 ` Kirill A. Shutemov
2013-01-11 7:50 ` Simon Jeons
2013-01-11 14:01 ` Andrea Arcangeli
2013-01-08 17:37 ` Andrea Arcangeli
2013-01-08 17:51 ` Linus Torvalds
2013-01-08 18:03 ` Andrea Arcangeli
2013-01-08 18:21 ` Linus Torvalds
2013-01-09 11:38 ` Hillf Danton
2013-01-09 4:23 ` Hugh Dickins
2013-01-09 11:44 ` Mel Gorman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).