From: Andrea Arcangeli <aarcange@redhat.com>
To: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>,
Andrew Morton <akpm@linux-foundation.org>,
David Rientjes <rientjes@google.com>,
Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: mm: fix BUG in __split_huge_page_pmd
Date: Tue, 15 Oct 2013 17:58:15 +0200 [thread overview]
Message-ID: <20131015155815.GG3479@redhat.com> (raw)
In-Reply-To: <20131015144827.C45DDE0090@blue.fi.intel.com>
On Tue, Oct 15, 2013 at 05:48:27PM +0300, Kirill A. Shutemov wrote:
> Andrea Arcangeli wrote:
> > Hi Hugh,
> >
> > On Tue, Oct 15, 2013 at 04:08:28AM -0700, Hugh Dickins wrote:
> > > Occasionally we hit the BUG_ON(pmd_trans_huge(*pmd)) at the end of
> > > __split_huge_page_pmd(): seen when doing madvise(,,MADV_DONTNEED).
> > >
> > > It's invalid: we don't always have down_write of mmap_sem there:
> > > a racing do_huge_pmd_wp_page() might have copied-on-write to another
> > > huge page before our split_huge_page() got the anon_vma lock.
> > >
> >
> > I don't get exactly the scenario with do_huge_pmd_wp_page(), could you
> > elaborate?
>
> I think the scenario is follow:
>
> CPU0: CPU1
>
> __split_huge_page_pmd()
> page = pmd_page(*pmd);
> do_huge_pmd_wp_page() copy the
> page and changes pmd (the same as on CPU0)
> to point to newly copied page.
> split_huge_page(page)
> where page is original page,
> not allocated on COW.
> pmd still points on huge page.
>
>
> Hugh, have I got it correctly?
So MADV_DONTNEED runs with with a "end" not 2m aligned (requiring 4k
subpage zapping) on a wrprotected trans-huge page that is hitting a
COW. So this scenario would be deterministic (the thread may write
beyond the "end" of the MADV_DONTNEED) and it only requires two
specific events.
With my other scenario with two concurrent MADV_DONTNEED plus a page
fault, you could still lead to split_huge_page_pmd returning with
pmd_trans_huge(*pmd) == true, despite of the loop introduced.
But for the above case, the loop makes a meaningful difference. So I
see the good reason for looping now.
It wouldn't be ok to miss a partial MADV_DONTNEED zapping because of a
concurrent COW, while it would be ok in my other scenario (and the
loop in fact cannot do anything to prevent split_huge_page_pmd return
with the pmd still huge).
My other scenario with two concurrent MADV_DONTNEED and a page fault
is non deterministic so looping was meaningless.
In both scenario, the kernel wouldn't run into stability issues, even
if we only removed the BUG_ON. But the COW scenario, without the loop,
we'd silently miss a partial MADV_DONTNEED on the 4k subpages before
the "end" (or after the "start").
And we still need pmd_none_or_trans_huge_or_clear_bad in
zap_pmd_range, to deal with the non deterministic cases that the loop
won't help (the two MADV_DONTNEED + page fault), in addition to the
loop to deal with the deterministic COW scenario above.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: Andrea Arcangeli <aarcange@redhat.com>
To: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>,
Andrew Morton <akpm@linux-foundation.org>,
David Rientjes <rientjes@google.com>,
Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: mm: fix BUG in __split_huge_page_pmd
Date: Tue, 15 Oct 2013 17:58:15 +0200 [thread overview]
Message-ID: <20131015155815.GG3479@redhat.com> (raw)
In-Reply-To: <20131015144827.C45DDE0090@blue.fi.intel.com>
On Tue, Oct 15, 2013 at 05:48:27PM +0300, Kirill A. Shutemov wrote:
> Andrea Arcangeli wrote:
> > Hi Hugh,
> >
> > On Tue, Oct 15, 2013 at 04:08:28AM -0700, Hugh Dickins wrote:
> > > Occasionally we hit the BUG_ON(pmd_trans_huge(*pmd)) at the end of
> > > __split_huge_page_pmd(): seen when doing madvise(,,MADV_DONTNEED).
> > >
> > > It's invalid: we don't always have down_write of mmap_sem there:
> > > a racing do_huge_pmd_wp_page() might have copied-on-write to another
> > > huge page before our split_huge_page() got the anon_vma lock.
> > >
> >
> > I don't get exactly the scenario with do_huge_pmd_wp_page(), could you
> > elaborate?
>
> I think the scenario is follow:
>
> CPU0: CPU1
>
> __split_huge_page_pmd()
> page = pmd_page(*pmd);
> do_huge_pmd_wp_page() copy the
> page and changes pmd (the same as on CPU0)
> to point to newly copied page.
> split_huge_page(page)
> where page is original page,
> not allocated on COW.
> pmd still points on huge page.
>
>
> Hugh, have I got it correctly?
So MADV_DONTNEED runs with with a "end" not 2m aligned (requiring 4k
subpage zapping) on a wrprotected trans-huge page that is hitting a
COW. So this scenario would be deterministic (the thread may write
beyond the "end" of the MADV_DONTNEED) and it only requires two
specific events.
With my other scenario with two concurrent MADV_DONTNEED plus a page
fault, you could still lead to split_huge_page_pmd returning with
pmd_trans_huge(*pmd) == true, despite of the loop introduced.
But for the above case, the loop makes a meaningful difference. So I
see the good reason for looping now.
It wouldn't be ok to miss a partial MADV_DONTNEED zapping because of a
concurrent COW, while it would be ok in my other scenario (and the
loop in fact cannot do anything to prevent split_huge_page_pmd return
with the pmd still huge).
My other scenario with two concurrent MADV_DONTNEED and a page fault
is non deterministic so looping was meaningless.
In both scenario, the kernel wouldn't run into stability issues, even
if we only removed the BUG_ON. But the COW scenario, without the loop,
we'd silently miss a partial MADV_DONTNEED on the 4k subpages before
the "end" (or after the "start").
And we still need pmd_none_or_trans_huge_or_clear_bad in
zap_pmd_range, to deal with the non deterministic cases that the loop
won't help (the two MADV_DONTNEED + page fault), in addition to the
loop to deal with the deterministic COW scenario above.
next prev parent reply other threads:[~2013-10-15 15:58 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-10-15 11:08 mm: fix BUG in __split_huge_page_pmd Hugh Dickins
2013-10-15 11:08 ` Hugh Dickins
2013-10-15 11:32 ` Kirill A. Shutemov
2013-10-15 11:32 ` Kirill A. Shutemov
2013-10-15 14:41 ` Andrea Arcangeli
2013-10-15 14:41 ` Andrea Arcangeli
2013-10-15 14:34 ` Andrea Arcangeli
2013-10-15 14:34 ` Andrea Arcangeli
2013-10-15 14:48 ` Kirill A. Shutemov
2013-10-15 14:48 ` Kirill A. Shutemov
2013-10-15 15:58 ` Andrea Arcangeli [this message]
2013-10-15 15:58 ` Andrea Arcangeli
2013-10-15 17:53 ` Hugh Dickins
2013-10-15 17:53 ` Hugh Dickins
2013-10-15 18:55 ` Andrea Arcangeli
2013-10-15 18:55 ` Andrea Arcangeli
2013-10-15 19:28 ` Naoya Horiguchi
2013-10-15 19:28 ` Naoya Horiguchi
2013-10-15 19:44 ` Andrea Arcangeli
2013-10-15 19:44 ` Andrea Arcangeli
2013-10-15 20:16 ` Naoya Horiguchi
2013-10-15 20:16 ` Naoya Horiguchi
2013-10-15 20:30 ` Andrea Arcangeli
2013-10-15 20:30 ` Andrea Arcangeli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20131015155815.GG3479@redhat.com \
--to=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=hughd@google.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=n-horiguchi@ah.jp.nec.com \
--cc=rientjes@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.