From: Minchan Kim <minchan@kernel.org>
To: Mel Gorman <mgorman@suse.de>
Cc: Nadav Amit <nadav.amit@gmail.com>,
Andy Lutomirski <luto@kernel.org>,
"open list:MEMORY MANAGEMENT" <linux-mm@kvack.org>
Subject: Re: Potential race in TLB flush batching?
Date: Tue, 25 Jul 2017 16:37:48 +0900 [thread overview]
Message-ID: <20170725073748.GB22652@bbox> (raw)
In-Reply-To: <20170724095832.vgvku6vlxkv75r3k@suse.de>
Hi Mel,
On Mon, Jul 24, 2017 at 10:58:32AM +0100, Mel Gorman wrote:
> On Fri, Jul 21, 2017 at 06:19:22PM -0700, Nadav Amit wrote:
> > > At the time of the unlock_page on the reclaim side, any unmapping that
> > > will happen before the flush has taken place. If KSM starts between the
> > > unlock_page and the tlb flush then it'll skip any of the PTEs that were
> > > previously unmapped with stale entries so there is no relevant stale TLB
> > > entry to work with.
> >
> > I don???t see where this skipping happens, but let???s put this scenario aside
> > for a second. Here is a similar scenario that causes memory corruption. I
> > actually created and tested it (although I needed to hack the kernel to add
> > some artificial latency before the actual flushes and before the actual
> > dedupliaction of KSM).
> >
> > We are going to cause KSM to deduplicate a page, and after page comparison
> > but before the page is actually replaced, to use a stale PTE entry to
> > overwrite the page. As a result KSM will lose a write, causing memory
> > corruption.
> >
> > For this race we need 4 CPUs:
> >
> > CPU0: Caches a writable and dirty PTE entry, and uses the stale value for
> > write later.
> >
> > CPU1: Runs madvise_free on the range that includes the PTE. It would clear
> > the dirty-bit. It batches TLB flushes.
> >
> > CPU2: Writes 4 to /proc/PID/clear_refs , clearing the PTEs soft-dirty. We
> > care about the fact that it clears the PTE write-bit, and of course, batches
> > TLB flushes.
> >
> > CPU3: Runs KSM. Our purpose is to pass the following test in
> > write_protect_page():
> >
> > if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
> > (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)))
> >
> > Since it will avoid TLB flush. And we want to do it while the PTE is stale.
> > Later, and before replacing the page, we would be able to change the page.
> >
> > Note that all the operations the CPU1-3 perform canhappen in parallel since
> > they only acquire mmap_sem for read.
> >
> > We start with two identical pages. Everything below regards the same
> > page/PTE.
> >
> > CPU0 CPU1 CPU2 CPU3
> > ---- ---- ---- ----
> > Write the same
> > value on page
> >
> > [cache PTE as
> > dirty in TLB]
> >
> > MADV_FREE
> > pte_mkclean()
> >
> > 4 > clear_refs
> > pte_wrprotect()
> >
> > write_protect_page()
> > [ success, no flush ]
> >
> > pages_indentical()
> > [ ok ]
> >
> > Write to page
> > different value
> >
> > [Ok, using stale
> > PTE]
> >
> > replace_page()
> >
> >
> > Later, CPU1, CPU2 and CPU3 would flush the TLB, but that is too late. CPU0
> > already wrote on the page, but KSM ignored this write, and it got lost.
> >
>
> Ok, as you say you have reproduced this with corruption, I would suggest
> one path for dealing with it although you'll need to pass it by the
> original authors.
>
> When unmapping ranges, there is a check for dirty PTEs in
> zap_pte_range() that forces a flush for dirty PTEs which aims to avoid
> writable stale PTEs from CPU0 in a scenario like you laid out above.
>
> madvise_free misses a similar class of check so I'm adding Minchan Kim
> to the cc as the original author of much of that code. Minchan Kim will
> need to confirm but it appears that two modifications would be required.
> The first should pass in the mmu_gather structure to
> madvise_free_pte_range (at minimum) and force flush the TLB under the
> PTL if a dirty PTE is encountered. The second is that it should consider
OTL: I couldn't read this lengthy discussion so I miss miss something.
About MADV_FREE, I do not understand why it should flush TLB in MADV_FREE
context. MADV_FREE's semantic allows "write(ie, dirty)" so if other thread
in parallel which has stale pte does "store" to make the pte dirty,
it's okay since try_to_unmap_one in shrink_page_list catches the dirty.
In above example, I think KSM should flush the TLB, not MADV_FREE and
soft dirty page hander.
Maybe, I miss something clear, Could you explain it in detail?
> flushing the full affected range as madvise_free holds mmap_sem for
> read-only to avoid problems with two parallel madv_free operations. The
> second is optional because there are other ways it could also be handled
> that may have lower overhead.
Ditto. I cannot understand. Why does two parallel MADV_FREE have a problem?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2017-07-25 7:37 UTC|newest]
Thread overview: 70+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-07-11 0:52 Potential race in TLB flush batching? Nadav Amit
2017-07-11 6:41 ` Mel Gorman
2017-07-11 7:30 ` Nadav Amit
2017-07-11 9:29 ` Mel Gorman
2017-07-11 10:40 ` Nadav Amit
2017-07-11 13:20 ` Mel Gorman
2017-07-11 14:58 ` Andy Lutomirski
2017-07-11 15:53 ` Mel Gorman
2017-07-11 17:23 ` Andy Lutomirski
2017-07-11 19:18 ` Mel Gorman
2017-07-11 20:06 ` Nadav Amit
2017-07-11 21:09 ` Mel Gorman
2017-07-11 20:09 ` Mel Gorman
2017-07-11 21:52 ` Mel Gorman
2017-07-11 22:27 ` Nadav Amit
2017-07-11 22:34 ` Nadav Amit
2017-07-12 8:27 ` Mel Gorman
2017-07-12 23:27 ` Nadav Amit
2017-07-12 23:36 ` Andy Lutomirski
2017-07-12 23:42 ` Nadav Amit
2017-07-13 5:38 ` Andy Lutomirski
2017-07-13 16:05 ` Nadav Amit
2017-07-13 16:06 ` Andy Lutomirski
2017-07-13 6:07 ` Mel Gorman
2017-07-13 16:08 ` Andy Lutomirski
2017-07-13 17:07 ` Mel Gorman
2017-07-13 17:15 ` Andy Lutomirski
2017-07-13 18:23 ` Mel Gorman
2017-07-14 23:16 ` Nadav Amit
2017-07-15 15:55 ` Mel Gorman
2017-07-15 16:41 ` Andy Lutomirski
2017-07-17 7:49 ` Mel Gorman
2017-07-18 21:28 ` Nadav Amit
2017-07-19 7:41 ` Mel Gorman
2017-07-19 19:41 ` Nadav Amit
2017-07-19 19:58 ` Mel Gorman
2017-07-19 20:20 ` Nadav Amit
2017-07-19 21:47 ` Mel Gorman
2017-07-19 22:19 ` Nadav Amit
2017-07-19 22:59 ` Mel Gorman
2017-07-19 23:39 ` Nadav Amit
2017-07-20 7:43 ` Mel Gorman
2017-07-22 1:19 ` Nadav Amit
2017-07-24 9:58 ` Mel Gorman
2017-07-24 19:46 ` Nadav Amit
2017-07-25 7:37 ` Minchan Kim [this message]
2017-07-25 8:51 ` Mel Gorman
2017-07-25 9:11 ` Minchan Kim
2017-07-25 10:10 ` Mel Gorman
2017-07-26 5:43 ` Minchan Kim
2017-07-26 9:22 ` Mel Gorman
2017-07-26 19:18 ` Nadav Amit
2017-07-26 23:40 ` Minchan Kim
2017-07-27 0:09 ` Nadav Amit
2017-07-27 0:34 ` Minchan Kim
2017-07-27 0:48 ` Nadav Amit
2017-07-27 1:13 ` Nadav Amit
2017-07-27 7:04 ` Minchan Kim
2017-07-27 7:21 ` Mel Gorman
2017-07-27 16:04 ` Nadav Amit
2017-07-27 17:36 ` Mel Gorman
2017-07-26 23:44 ` Minchan Kim
2017-07-11 22:07 ` Andy Lutomirski
2017-07-11 22:33 ` Mel Gorman
2017-07-14 7:00 ` Benjamin Herrenschmidt
2017-07-14 8:31 ` Mel Gorman
2017-07-14 9:02 ` Benjamin Herrenschmidt
2017-07-14 9:27 ` Mel Gorman
2017-07-14 22:21 ` Andy Lutomirski
2017-07-11 16:22 ` Nadav Amit
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170725073748.GB22652@bbox \
--to=minchan@kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=mgorman@suse.de \
--cc=nadav.amit@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).