All of lore.kernel.org
 help / color / mirror / Atom feed
From: Minchan Kim <minchan@kernel.org>
To: Mel Gorman <mgorman@suse.de>
Cc: Nadav Amit <nadav.amit@gmail.com>,
	Andy Lutomirski <luto@kernel.org>,
	"open list:MEMORY MANAGEMENT" <linux-mm@kvack.org>
Subject: Re: Potential race in TLB flush batching?
Date: Tue, 25 Jul 2017 16:37:48 +0900	[thread overview]
Message-ID: <20170725073748.GB22652@bbox> (raw)
In-Reply-To: <20170724095832.vgvku6vlxkv75r3k@suse.de>

Hi Mel,

On Mon, Jul 24, 2017 at 10:58:32AM +0100, Mel Gorman wrote:
> On Fri, Jul 21, 2017 at 06:19:22PM -0700, Nadav Amit wrote:
> > > At the time of the unlock_page on the reclaim side, any unmapping that
> > > will happen before the flush has taken place. If KSM starts between the
> > > unlock_page and the tlb flush then it'll skip any of the PTEs that were
> > > previously unmapped with stale entries so there is no relevant stale TLB
> > > entry to work with.
> > 
> > I don???t see where this skipping happens, but let???s put this scenario aside
> > for a second. Here is a similar scenario that causes memory corruption. I
> > actually created and tested it (although I needed to hack the kernel to add
> > some artificial latency before the actual flushes and before the actual
> > dedupliaction of KSM).
> > 
> > We are going to cause KSM to deduplicate a page, and after page comparison
> > but before the page is actually replaced, to use a stale PTE entry to 
> > overwrite the page. As a result KSM will lose a write, causing memory
> > corruption.
> > 
> > For this race we need 4 CPUs:
> > 
> > CPU0: Caches a writable and dirty PTE entry, and uses the stale value for
> > write later.
> > 
> > CPU1: Runs madvise_free on the range that includes the PTE. It would clear
> > the dirty-bit. It batches TLB flushes.
> > 
> > CPU2: Writes 4 to /proc/PID/clear_refs , clearing the PTEs soft-dirty. We
> > care about the fact that it clears the PTE write-bit, and of course, batches
> > TLB flushes.
> > 
> > CPU3: Runs KSM. Our purpose is to pass the following test in
> > write_protect_page():
> > 
> > 	if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
> > 	    (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)))
> > 
> > Since it will avoid TLB flush. And we want to do it while the PTE is stale.
> > Later, and before replacing the page, we would be able to change the page.
> > 
> > Note that all the operations the CPU1-3 perform canhappen in parallel since
> > they only acquire mmap_sem for read.
> > 
> > We start with two identical pages. Everything below regards the same
> > page/PTE.
> > 
> > CPU0		CPU1		CPU2		CPU3
> > ----		----		----		----
> > Write the same
> > value on page
> > 
> > [cache PTE as
> >  dirty in TLB]
> > 
> > 		MADV_FREE
> > 		pte_mkclean()
> > 							
> > 				4 > clear_refs
> > 				pte_wrprotect()
> > 
> > 						write_protect_page()
> > 						[ success, no flush ]
> > 
> > 						pages_indentical()
> > 						[ ok ]
> > 
> > Write to page
> > different value
> > 
> > [Ok, using stale
> >  PTE]
> > 
> > 						replace_page()
> > 
> > 
> > Later, CPU1, CPU2 and CPU3 would flush the TLB, but that is too late. CPU0
> > already wrote on the page, but KSM ignored this write, and it got lost.
> > 
> 
> Ok, as you say you have reproduced this with corruption, I would suggest
> one path for dealing with it although you'll need to pass it by the
> original authors.
> 
> When unmapping ranges, there is a check for dirty PTEs in
> zap_pte_range() that forces a flush for dirty PTEs which aims to avoid
> writable stale PTEs from CPU0 in a scenario like you laid out above.
> 
> madvise_free misses a similar class of check so I'm adding Minchan Kim
> to the cc as the original author of much of that code. Minchan Kim will
> need to confirm but it appears that two modifications would be required.
> The first should pass in the mmu_gather structure to
> madvise_free_pte_range (at minimum) and force flush the TLB under the
> PTL if a dirty PTE is encountered. The second is that it should consider

OTL: I couldn't read this lengthy discussion so I miss miss something.

About MADV_FREE, I do not understand why it should flush TLB in MADV_FREE
context. MADV_FREE's semantic allows "write(ie, dirty)" so if other thread
in parallel which has stale pte does "store" to make the pte dirty,
it's okay since try_to_unmap_one in shrink_page_list catches the dirty.

In above example, I think KSM should flush the TLB, not MADV_FREE and
soft dirty page hander.

Maybe, I miss something clear, Could you explain it in detail?

> flushing the full affected range as madvise_free holds mmap_sem for
> read-only to avoid problems with two parallel madv_free operations. The
> second is optional because there are other ways it could also be handled
> that may have lower overhead.

Ditto. I cannot understand. Why does two parallel MADV_FREE have a problem?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2017-07-25  7:37 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-11  0:52 Potential race in TLB flush batching? Nadav Amit
2017-07-11  6:41 ` Mel Gorman
2017-07-11  7:30   ` Nadav Amit
2017-07-11  9:29     ` Mel Gorman
2017-07-11 10:40       ` Nadav Amit
2017-07-11 13:20         ` Mel Gorman
2017-07-11 14:58           ` Andy Lutomirski
2017-07-11 15:53             ` Mel Gorman
2017-07-11 17:23               ` Andy Lutomirski
2017-07-11 19:18                 ` Mel Gorman
2017-07-11 20:06                   ` Nadav Amit
2017-07-11 21:09                     ` Mel Gorman
2017-07-11 20:09                   ` Mel Gorman
2017-07-11 21:52                     ` Mel Gorman
2017-07-11 22:27                       ` Nadav Amit
2017-07-11 22:34                         ` Nadav Amit
2017-07-12  8:27                         ` Mel Gorman
2017-07-12 23:27                           ` Nadav Amit
2017-07-12 23:36                             ` Andy Lutomirski
2017-07-12 23:42                               ` Nadav Amit
2017-07-13  5:38                                 ` Andy Lutomirski
2017-07-13 16:05                                   ` Nadav Amit
2017-07-13 16:06                                     ` Andy Lutomirski
2017-07-13  6:07                             ` Mel Gorman
2017-07-13 16:08                               ` Andy Lutomirski
2017-07-13 17:07                                 ` Mel Gorman
2017-07-13 17:15                                   ` Andy Lutomirski
2017-07-13 18:23                                     ` Mel Gorman
2017-07-14 23:16                               ` Nadav Amit
2017-07-15 15:55                                 ` Mel Gorman
2017-07-15 16:41                                   ` Andy Lutomirski
2017-07-17  7:49                                     ` Mel Gorman
2017-07-18 21:28                                   ` Nadav Amit
2017-07-19  7:41                                     ` Mel Gorman
2017-07-19 19:41                                       ` Nadav Amit
2017-07-19 19:58                                         ` Mel Gorman
2017-07-19 20:20                                           ` Nadav Amit
2017-07-19 21:47                                             ` Mel Gorman
2017-07-19 22:19                                               ` Nadav Amit
2017-07-19 22:59                                                 ` Mel Gorman
2017-07-19 23:39                                                   ` Nadav Amit
2017-07-20  7:43                                                     ` Mel Gorman
2017-07-22  1:19                                                       ` Nadav Amit
2017-07-24  9:58                                                         ` Mel Gorman
2017-07-24 19:46                                                           ` Nadav Amit
2017-07-25  7:37                                                           ` Minchan Kim [this message]
2017-07-25  8:51                                                             ` Mel Gorman
2017-07-25  9:11                                                               ` Minchan Kim
2017-07-25 10:10                                                                 ` Mel Gorman
2017-07-26  5:43                                                                   ` Minchan Kim
2017-07-26  9:22                                                                     ` Mel Gorman
2017-07-26 19:18                                                                       ` Nadav Amit
2017-07-26 23:40                                                                         ` Minchan Kim
2017-07-27  0:09                                                                           ` Nadav Amit
2017-07-27  0:34                                                                             ` Minchan Kim
2017-07-27  0:48                                                                               ` Nadav Amit
2017-07-27  1:13                                                                                 ` Nadav Amit
2017-07-27  7:04                                                                                   ` Minchan Kim
2017-07-27  7:21                                                                                     ` Mel Gorman
2017-07-27 16:04                                                                                       ` Nadav Amit
2017-07-27 17:36                                                                                         ` Mel Gorman
2017-07-26 23:44                                                                       ` Minchan Kim
2017-07-11 22:07                   ` Andy Lutomirski
2017-07-11 22:33                     ` Mel Gorman
2017-07-14  7:00                     ` Benjamin Herrenschmidt
2017-07-14  8:31                       ` Mel Gorman
2017-07-14  9:02                         ` Benjamin Herrenschmidt
2017-07-14  9:27                           ` Mel Gorman
2017-07-14 22:21                             ` Andy Lutomirski
2017-07-11 16:22           ` Nadav Amit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170725073748.GB22652@bbox \
    --to=minchan@kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mgorman@suse.de \
    --cc=nadav.amit@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.