linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
From: Jerome Glisse <jglisse@redhat.com>
To: Laurent Dufour <ldufour@linux.ibm.com>
Cc: jack@suse.cz, sergey.senozhatsky.work@gmail.com,
	peterz@infradead.org, Will Deacon <will.deacon@arm.com>,
	mhocko@kernel.org, linux-mm@kvack.org, paulus@samba.org,
	Punit Agrawal <punitagrawal@gmail.com>,
	hpa@zytor.com, Michel Lespinasse <walken@google.com>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	ak@linux.intel.com, Minchan Kim <minchan@kernel.org>,
	aneesh.kumar@linux.ibm.com, x86@kernel.org,
	Matthew Wilcox <willy@infradead.org>,
	Daniel Jordan <daniel.m.jordan@oracle.com>,
	Ingo Molnar <mingo@redhat.com>,
	David Rientjes <rientjes@google.com>,
	paulmck@linux.vnet.ibm.com, Haiyan Song <haiyanx.song@intel.com>,
	npiggin@gmail.com, sj38.park@gmail.com, dave@stgolabs.net,
	kemi.wang@intel.com, kirill@shutemov.name,
	Thomas Gleixner <tglx@linutronix.de>,
	zhong jiang <zhongjiang@huawei.com>,
	Ganesh Mahendran <opensource.ganesh@gmail.com>,
	Yang Shi <yang.shi@linux.alibaba.com>,
	Mike Rapoport <rppt@linux.ibm.com>,
	linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org,
	Sergey Senozhatsky <sergey.senozhatsky@gmail.com>,
	Vinayak Menon <vinmenon@codeaurora.org>,
	vinayak menon <vinayakm.list@gmail.com>,
	akpm@linux-foundation.org, Tim Chen <tim.c.chen@linux.intel.com>,
	haren@linux.vnet.ibm.com
Subject: Re: [PATCH v12 18/31] mm: protect against PTE changes done by dup_mmap()
Date: Mon, 22 Apr 2019 16:32:18 -0400	[thread overview]
Message-ID: <20190422203217.GI14666@redhat.com> (raw)
In-Reply-To: <20190416134522.17540-19-ldufour@linux.ibm.com>

On Tue, Apr 16, 2019 at 03:45:09PM +0200, Laurent Dufour wrote:
> Vinayak Menon and Ganesh Mahendran reported that the following scenario may
> lead to thread being blocked due to data corruption:
> 
>     CPU 1                   CPU 2                    CPU 3
>     Process 1,              Process 1,               Process 1,
>     Thread A                Thread B                 Thread C
> 
>     while (1) {             while (1) {              while(1) {
>     pthread_mutex_lock(l)   pthread_mutex_lock(l)    fork
>     pthread_mutex_unlock(l) pthread_mutex_unlock(l)  }
>     }                       }
> 
> In the details this happens because :
> 
>     CPU 1                CPU 2                       CPU 3
>     fork()
>     copy_pte_range()
>       set PTE rdonly
>     got to next VMA...
>      .                   PTE is seen rdonly          PTE still writable
>      .                   thread is writing to page
>      .                   -> page fault
>      .                     copy the page             Thread writes to page
>      .                      .                        -> no page fault
>      .                     update the PTE
>      .                     flush TLB for that PTE
>    flush TLB                                        PTE are now rdonly

Should the fork be on CPU3 to be consistant with the top thing (just to
make it easier to read and go from one to the other as thread can move
from one CPU to another).

> 
> So the write done by the CPU 3 is interfering with the page copy operation
> done by CPU 2, leading to the data corruption.
> 
> To avoid this we mark all the VMA involved in the COW mechanism as changing
> by calling vm_write_begin(). This ensures that the speculative page fault
> handler will not try to handle a fault on these pages.
> The marker is set until the TLB is flushed, ensuring that all the CPUs will
> now see the PTE as not writable.
> Once the TLB is flush, the marker is removed by calling vm_write_end().
> 
> The variable last is used to keep tracked of the latest VMA marked to
> handle the error path where part of the VMA may have been marked.
> 
> Since multiple VMA from the same mm may have the sequence count increased
> during this process, the use of the vm_raw_write_begin/end() is required to
> avoid lockdep false warning messages.
> 
> Reported-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
> Reported-by: Vinayak Menon <vinmenon@codeaurora.org>
> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>

A minor comment (see below)

Reviewed-by: Jérome Glisse <jglisse@redhat.com>

> ---
>  kernel/fork.c | 30 ++++++++++++++++++++++++++++--
>  1 file changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index f8dae021c2e5..2992d2c95256 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -462,7 +462,7 @@ EXPORT_SYMBOL(free_task);
>  static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  					struct mm_struct *oldmm)
>  {
> -	struct vm_area_struct *mpnt, *tmp, *prev, **pprev;
> +	struct vm_area_struct *mpnt, *tmp, *prev, **pprev, *last = NULL;
>  	struct rb_node **rb_link, *rb_parent;
>  	int retval;
>  	unsigned long charge;
> @@ -581,8 +581,18 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  		rb_parent = &tmp->vm_rb;
>  
>  		mm->map_count++;
> -		if (!(tmp->vm_flags & VM_WIPEONFORK))
> +		if (!(tmp->vm_flags & VM_WIPEONFORK)) {
> +			if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT)) {
> +				/*
> +				 * Mark this VMA as changing to prevent the
> +				 * speculative page fault hanlder to process
> +				 * it until the TLB are flushed below.
> +				 */
> +				last = mpnt;
> +				vm_raw_write_begin(mpnt);
> +			}
>  			retval = copy_page_range(mm, oldmm, mpnt);
> +		}
>  
>  		if (tmp->vm_ops && tmp->vm_ops->open)
>  			tmp->vm_ops->open(tmp);
> @@ -595,6 +605,22 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  out:
>  	up_write(&mm->mmap_sem);
>  	flush_tlb_mm(oldmm);
> +
> +	if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT)) {

You do not need to check for CONFIG_SPECULATIVE_PAGE_FAULT as last
will always be NULL if it is not enabled but maybe the compiler will
miss the optimization opportunity if you only have the for() loop
below.

> +		/*
> +		 * Since the TLB has been flush, we can safely unmark the
> +		 * copied VMAs and allows the speculative page fault handler to
> +		 * process them again.
> +		 * Walk back the VMA list from the last marked VMA.
> +		 */
> +		for (; last; last = last->vm_prev) {
> +			if (last->vm_flags & VM_DONTCOPY)
> +				continue;
> +			if (!(last->vm_flags & VM_WIPEONFORK))
> +				vm_raw_write_end(last);
> +		}
> +	}
> +
>  	up_write(&oldmm->mmap_sem);
>  	dup_userfaultfd_complete(&uf);
>  fail_uprobe_end:
> -- 
> 2.21.0
> 

  reply	other threads:[~2019-04-22 20:34 UTC|newest]

Thread overview: 98+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-16 13:44 [PATCH v12 00/31] Speculative page faults Laurent Dufour
2019-04-16 13:44 ` [PATCH v12 01/31] mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT Laurent Dufour
2019-04-18 21:47   ` Jerome Glisse
2019-04-23 15:21     ` Laurent Dufour
2019-04-16 13:44 ` [PATCH v12 02/31] x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT Laurent Dufour
2019-04-18 21:48   ` Jerome Glisse
2019-04-16 13:44 ` [PATCH v12 03/31] powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT Laurent Dufour
2019-04-18 21:49   ` Jerome Glisse
2019-04-16 13:44 ` [PATCH v12 04/31] arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT Laurent Dufour
2019-04-16 14:27   ` Mark Rutland
2019-04-16 14:31     ` Laurent Dufour
2019-04-16 14:41       ` Mark Rutland
2019-04-18 21:51         ` Jerome Glisse
2019-04-23 15:36           ` Laurent Dufour
2019-04-23 16:19             ` Mark Rutland
2019-04-24 10:34               ` Laurent Dufour
2019-04-16 13:44 ` [PATCH v12 05/31] mm: prepare for FAULT_FLAG_SPECULATIVE Laurent Dufour
2019-04-18 22:04   ` Jerome Glisse
2019-04-23 15:45     ` Laurent Dufour
2019-04-16 13:44 ` [PATCH v12 06/31] mm: introduce pte_spinlock " Laurent Dufour
2019-04-18 22:05   ` Jerome Glisse
2019-04-16 13:44 ` [PATCH v12 07/31] mm: make pte_unmap_same compatible with SPF Laurent Dufour
2019-04-18 22:10   ` Jerome Glisse
2019-04-23 15:43   ` Matthew Wilcox
2019-04-23 15:47     ` Laurent Dufour
2019-04-16 13:44 ` [PATCH v12 08/31] mm: introduce INIT_VMA() Laurent Dufour
2019-04-18 22:22   ` Jerome Glisse
2019-04-16 13:45 ` [PATCH v12 09/31] mm: VMA sequence count Laurent Dufour
2019-04-18 22:48   ` Jerome Glisse
2019-04-19 15:45     ` Laurent Dufour
2019-04-22 15:51       ` Jerome Glisse
2019-04-16 13:45 ` [PATCH v12 10/31] mm: protect VMA modifications using " Laurent Dufour
2019-04-22 19:43   ` Jerome Glisse
2019-04-16 13:45 ` [PATCH v12 11/31] mm: protect mremap() against SPF hanlder Laurent Dufour
2019-04-22 19:51   ` Jerome Glisse
2019-04-23 15:51     ` Laurent Dufour
2019-04-16 13:45 ` [PATCH v12 12/31] mm: protect SPF handler against anon_vma changes Laurent Dufour
2019-04-22 19:53   ` Jerome Glisse
2019-04-16 13:45 ` [PATCH v12 13/31] mm: cache some VMA fields in the vm_fault structure Laurent Dufour
2019-04-22 20:06   ` Jerome Glisse
2019-04-16 13:45 ` [PATCH v12 14/31] mm/migrate: Pass vm_fault pointer to migrate_misplaced_page() Laurent Dufour
2019-04-22 20:09   ` Jerome Glisse
2019-04-16 13:45 ` [PATCH v12 15/31] mm: introduce __lru_cache_add_active_or_unevictable Laurent Dufour
2019-04-22 20:11   ` Jerome Glisse
2019-04-16 13:45 ` [PATCH v12 16/31] mm: introduce __vm_normal_page() Laurent Dufour
2019-04-22 20:15   ` Jerome Glisse
2019-04-16 13:45 ` [PATCH v12 17/31] mm: introduce __page_add_new_anon_rmap() Laurent Dufour
2019-04-22 20:18   ` Jerome Glisse
2019-04-16 13:45 ` [PATCH v12 18/31] mm: protect against PTE changes done by dup_mmap() Laurent Dufour
2019-04-22 20:32   ` Jerome Glisse [this message]
2019-04-24 10:33     ` Laurent Dufour
2019-04-16 13:45 ` [PATCH v12 19/31] mm: protect the RB tree with a sequence lock Laurent Dufour
2019-04-22 20:33   ` Jerome Glisse
2019-04-16 13:45 ` [PATCH v12 20/31] mm: introduce vma reference counter Laurent Dufour
2019-04-22 20:36   ` Jerome Glisse
2019-04-24 14:26     ` Laurent Dufour
2019-04-16 13:45 ` [PATCH v12 21/31] mm: Introduce find_vma_rcu() Laurent Dufour
2019-04-22 20:57   ` Jerome Glisse
2019-04-24 14:39     ` Laurent Dufour
2019-04-23  9:27   ` Peter Zijlstra
2019-04-23 18:13     ` Davidlohr Bueso
2019-04-24  7:57     ` Laurent Dufour
2019-04-16 13:45 ` [PATCH v12 22/31] mm: provide speculative fault infrastructure Laurent Dufour
2019-04-22 21:26   ` Jerome Glisse
2019-04-24 14:56     ` Laurent Dufour
2019-04-24 15:13       ` Jerome Glisse
2019-04-16 13:45 ` [PATCH v12 23/31] mm: don't do swap readahead during speculative page fault Laurent Dufour
2019-04-22 21:36   ` Jerome Glisse
2019-04-24 14:57     ` Laurent Dufour
2019-04-16 13:45 ` [PATCH v12 24/31] mm: adding speculative page fault failure trace events Laurent Dufour
2019-04-16 13:45 ` [PATCH v12 25/31] perf: add a speculative page fault sw event Laurent Dufour
2019-04-16 13:45 ` [PATCH v12 26/31] perf tools: add support for the SPF perf event Laurent Dufour
2019-04-16 13:45 ` [PATCH v12 27/31] mm: add speculative page fault vmstats Laurent Dufour
2019-04-16 13:45 ` [PATCH v12 28/31] x86/mm: add speculative pagefault handling Laurent Dufour
2019-04-16 13:45 ` [PATCH v12 29/31] powerpc/mm: add speculative page fault Laurent Dufour
2019-04-16 13:45 ` [PATCH v12 30/31] arm64/mm: " Laurent Dufour
2019-04-16 13:45 ` [PATCH v12 31/31] mm: Add a speculative page fault switch in sysctl Laurent Dufour
2019-04-22 21:29 ` [PATCH v12 00/31] Speculative page faults Michel Lespinasse
2019-04-23  9:38   ` Peter Zijlstra
2019-04-24  7:33     ` Laurent Dufour
2019-04-27  1:53       ` Michel Lespinasse
2019-04-23 10:47   ` Michal Hocko
2019-04-23 12:41     ` Matthew Wilcox
2019-04-23 12:48       ` Peter Zijlstra
2019-04-23 13:42       ` Michal Hocko
2019-04-24 18:01   ` Laurent Dufour
2019-04-27  6:00     ` Michel Lespinasse
2019-04-23 11:35 ` Anshuman Khandual
2019-06-06  6:51 ` Haiyan Song
2019-06-14  8:37   ` Laurent Dufour
2019-06-14  8:44     ` Laurent Dufour
2019-06-20  8:19       ` Haiyan Song
2020-07-06  9:25         ` Chinwen Chang
2020-07-06 12:27           ` Laurent Dufour
2020-07-07  5:31             ` Chinwen Chang
2020-12-14  2:03               ` Joel Fernandes
2020-12-14  9:36                 ` Laurent Dufour
2020-12-14 18:10                   ` Joel Fernandes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190422203217.GI14666@redhat.com \
    --to=jglisse@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexei.starovoitov@gmail.com \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=daniel.m.jordan@oracle.com \
    --cc=dave@stgolabs.net \
    --cc=haiyanx.song@intel.com \
    --cc=haren@linux.vnet.ibm.com \
    --cc=hpa@zytor.com \
    --cc=jack@suse.cz \
    --cc=kemi.wang@intel.com \
    --cc=kirill@shutemov.name \
    --cc=ldufour@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mhocko@kernel.org \
    --cc=minchan@kernel.org \
    --cc=mingo@redhat.com \
    --cc=npiggin@gmail.com \
    --cc=opensource.ganesh@gmail.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=paulus@samba.org \
    --cc=peterz@infradead.org \
    --cc=punitagrawal@gmail.com \
    --cc=rientjes@google.com \
    --cc=rppt@linux.ibm.com \
    --cc=sergey.senozhatsky.work@gmail.com \
    --cc=sergey.senozhatsky@gmail.com \
    --cc=sj38.park@gmail.com \
    --cc=tglx@linutronix.de \
    --cc=tim.c.chen@linux.intel.com \
    --cc=vinayakm.list@gmail.com \
    --cc=vinmenon@codeaurora.org \
    --cc=walken@google.com \
    --cc=will.deacon@arm.com \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    --cc=yang.shi@linux.alibaba.com \
    --cc=zhongjiang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).