From: Mel Gorman <mgorman@suse.de>
To: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>, Linux-MM <linux-mm@kvack.org>,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 14/15] mm: numa: Flush TLB if NUMA hinting faults race with PTE scan update
Date: Wed, 4 Dec 2013 16:07:41 +0000 [thread overview]
Message-ID: <20131204160741.GC11295@suse.de> (raw)
In-Reply-To: <529F3D51.1090203@redhat.com>
On Wed, Dec 04, 2013 at 09:33:53AM -0500, Rik van Riel wrote:
> On 12/03/2013 06:46 PM, Mel Gorman wrote:
> > On Tue, Dec 03, 2013 at 06:07:06PM -0500, Rik van Riel wrote:
> >> On 12/03/2013 03:52 AM, Mel Gorman wrote:
> >>> NUMA PTE updates and NUMA PTE hinting faults can race against each other. The
> >>> setting of the NUMA bit defers the TLB flush to reduce overhead. NUMA
> >>> hinting faults do not flush the TLB as X86 at least does not cache TLB
> >>> entries for !present PTEs. However, in the event that the two race a NUMA
> >>> hinting fault may return with the TLB in an inconsistent state between
> >>> different processors. This patch detects potential for races between the
> >>> NUMA PTE scanner and fault handler and will flush the TLB for the affected
> >>> range if there is a race.
> >>>
> >>> Signed-off-by: Mel Gorman <mgorman@suse.de>
> >>
> >>> diff --git a/mm/migrate.c b/mm/migrate.c
> >>> index 5dfd552..ccc814b 100644
> >>> --- a/mm/migrate.c
> >>> +++ b/mm/migrate.c
> >>> @@ -1662,6 +1662,39 @@ void wait_migrate_huge_page(struct anon_vma *anon_vma, pmd_t *pmd)
> >>> smp_rmb();
> >>> }
> >>>
> >>> +unsigned long numa_fault_prepare(struct mm_struct *mm)
> >>> +{
> >>> + /* Paired with task_numa_work */
> >>> + smp_rmb();
> >>> + return mm->numa_next_reset;
> >>> +}
> >>
> >> The patch that introduces mm->numa_next_reset, and the
> >> patch that increments it, seem to be missing from your
> >> series...
> >>
> >
> > Damn. s/numa_next_reset/numa_next_scan/ in that patch
>
> How does that protect against the race?
>
It's the local processors TLB I was primarily thinking about and the case
in particular is where the fault has cleared the pmd_numa and the scanner
sets it again before the fault completes and without any flush.
> Would it not be possible for task_numa_work to have a longer
> runtime than the numa fault?
>
Yes.
> In other words, task_numa_work can increment numa_next_scan
> before the numa fault starts, and still be doing its thing
> when numa_fault_commit is run...
>
a) the PTE was previously pte_numa, scanner ignores it, fault traps and
clears it with no flush or TLB consistency due to the page being
inaccessible before
b) the PTE was previously !pte_numa, scanner will set it
o Reference is first? No trap
o Reference is after the scanner goes by. If there is a fault trap,
it means the local TLB has seen the protection change and is
consistent. numa_next_scan will not appear to change and a further
flush should be unnecessary as the page was previously inaccessible
c) PTE was previous pte_numa, fault starts, clears pmd, but scanner
resets it before the fault returns. In this case, a change in
numa_next_scan will be observed and the fault will flush the TLB before
returning. It does mean that that particular page gets flushed twice
but TLB of the scanner and faulting processor will be consistent on
return from fault. The faulting CPU will probably fault again due to
the pte being marked numa.
It was the third situation I was concerned with -- a NUMA fault returning
with pmd_numa still set and the TLBs of different processors having different
views. Due to a potential migration copy, the data may be in the TLB but
now inconsistent with the scanner. What's less clear is how the CPU reacts
in this case or if it's even defined. The architectural manual is vague
on what happens if there is access to a PTE just after a protection change
but before a TLB flush. If it was a race against mprotect and the process
segfaulted, it would be considered a buggy application.
> At that point, numa_fault_commit will not be seeing an
> increment in numa_next_scan, and we are relying completely
> on the batched tlb flush by the change_prot_numa.
>
> Is that scenario a problem, or is it ok?
>
I think the TLB is always in an consistent state after the patch even
though additional faults are possible in the event of races.
> And, why? :)
>
Because I found it impossible to segfault processes under any level of
scanning and numa hinting fault stress after it was applied
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2013-12-04 16:07 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-12-03 8:51 [PATCH 00/14] NUMA balancing segmentation faults candidate fix on large machines Mel Gorman
2013-12-03 8:51 ` [PATCH 01/15] mm: numa: Do not batch handle PMD pages Mel Gorman
2013-12-03 8:51 ` [PATCH 02/15] mm: hugetlbfs: fix hugetlbfs optimization Mel Gorman
2013-12-03 8:51 ` [PATCH 03/15] mm: thp: give transparent hugepage code a separate copy_page Mel Gorman
2013-12-04 16:59 ` Alex Thorlton
2013-12-05 13:35 ` Mel Gorman
2013-12-03 8:51 ` [PATCH 04/15] mm: numa: Serialise parallel get_user_page against THP migration Mel Gorman
2013-12-03 23:07 ` Rik van Riel
2013-12-03 23:54 ` Mel Gorman
2013-12-03 8:51 ` [PATCH 05/15] mm: numa: Call MMU notifiers on " Mel Gorman
2013-12-03 8:51 ` [PATCH 06/15] mm: Clear pmd_numa before invalidating Mel Gorman
2013-12-03 8:51 ` [PATCH 07/15] mm: numa: Do not clear PMD during PTE update scan Mel Gorman
2013-12-03 8:51 ` [PATCH 08/15] mm: numa: Do not clear PTE for pte_numa update Mel Gorman
2013-12-03 8:51 ` [PATCH 09/15] mm: numa: Ensure anon_vma is locked to prevent parallel THP splits Mel Gorman
2013-12-03 8:51 ` [PATCH 10/15] mm: numa: Avoid unnecessary work on the failure path Mel Gorman
2013-12-03 8:51 ` [PATCH 11/15] sched: numa: Skip inaccessible VMAs Mel Gorman
2013-12-03 8:51 ` [PATCH 12/15] Clear numa on mprotect Mel Gorman
2013-12-03 8:52 ` [PATCH 13/15] mm: numa: Avoid unnecessary disruption of NUMA hinting during migration Mel Gorman
2013-12-03 8:52 ` [PATCH 14/15] mm: numa: Flush TLB if NUMA hinting faults race with PTE scan update Mel Gorman
2013-12-03 23:07 ` Rik van Riel
2013-12-03 23:46 ` Mel Gorman
2013-12-04 14:33 ` Rik van Riel
2013-12-04 16:07 ` Mel Gorman [this message]
2013-12-05 15:40 ` Rik van Riel
2013-12-05 19:54 ` Mel Gorman
2013-12-05 20:05 ` Rik van Riel
2013-12-06 9:24 ` Mel Gorman
2013-12-06 17:38 ` Alex Thorlton
2013-12-06 18:32 ` Mel Gorman
2013-12-06 19:13 ` [PATCH 14/15] mm: fix TLB flush race between migration, and change_protection_range Rik van Riel
2013-12-06 20:32 ` Christoph Lameter
2013-12-06 21:21 ` Rik van Riel
2013-12-07 0:25 ` Christoph Lameter
2013-12-07 3:14 ` Rik van Riel
2013-12-09 16:00 ` Christoph Lameter
2013-12-09 16:27 ` Mel Gorman
2013-12-09 16:59 ` Christoph Lameter
2013-12-09 21:01 ` Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20131204160741.GC11295@suse.de \
--to=mgorman@suse.de \
--cc=athorlton@sgi.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=riel@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).