Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Avi Kivity <avi@redhat.com>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>, KVM <kvm@vger.kernel.org>,
	Linux Memory Management List <linux-mm@kvack.org>
Subject: Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host
Date: Wed, 22 Aug 2012 11:51:17 +0800	[thread overview]
Message-ID: <50345735.2000807@linux.vnet.ibm.com> (raw)
In-Reply-To: <20120821150618.GJ27696@redhat.com>

On 08/21/2012 11:06 PM, Andrea Arcangeli wrote:
> On Tue, Aug 21, 2012 at 05:46:39PM +0800, Xiao Guangrong wrote:
>> There has a bug in set_pte_at_notify which always set the pte to the
>> new page before release the old page in secondary MMU, at this time,
>> the process will access on the new page, but the secondary MMU still
>> access on the old page, the memory is inconsistent between them
>>
>> Below scenario shows the bug more clearly:
>>
>> at the beginning: *p = 0, and p is write-protected by KSM or shared with
>> parent process
>>
>> CPU 0                                       CPU 1
>> write 1 to p to trigger COW,
>> set_pte_at_notify will be called:
>>   *pte = new_page + W; /* The W bit of pte is set */
>>
>>                                      *p = 1; /* pte is valid, so no #PF */
>>
>>                                      return back to secondary MMU, then
>>                                      the secondary MMU read p, but get:
>>                                      *p == 0;
>>
>>                          /*
>>                           * !!!!!!
>>                           * the host has already set p to 1, but the secondary
>>                           * MMU still get the old value 0
>>                           */
>>
>>   call mmu_notifier_change_pte to release
>>   old page in secondary MMU
> 
> The KSM usage of it looks safe because it will only establish readonly
> ptes with it.

Hmm, in KSM code, i found this code in replace_page:

set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));

It is possible to establish a writable pte, no?

> 
> It seems a problem only for do_wp_page. It wasn't safe to setup
> writable ptes with it. I guess we first introduced it for KSM and then
> we added it to do_wp_page too by mistake.
> 
> The race window is really tiny, it's unlikely it has ever triggered,
> however this one seem to be possible so it's slightly more serious
> than the other race you recently found (the previous one in the exit
> path I think it was impossible to trigger with KVM).

Unfortunately, all these bugs are triggered by test cases.

> 
>> We can fix it by release old page first, then set the pte to the new
>> page.
>>
>> Note, the new page will be firstly used in secondary MMU before it is
>> mapped into the page table of the process, but this is safe because it
>> is protected by the page table lock, there is no race to change the pte
>>
>> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
>> ---
>>  include/linux/mmu_notifier.h |    2 +-
>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
>> index 1d1b1e1..8c7435a 100644
>> --- a/include/linux/mmu_notifier.h
>> +++ b/include/linux/mmu_notifier.h
>> @@ -317,8 +317,8 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
>>  	unsigned long ___address = __address;				\
>>  	pte_t ___pte = __pte;						\
>>  									\
>> -	set_pte_at(___mm, ___address, __ptep, ___pte);			\
>>  	mmu_notifier_change_pte(___mm, ___address, ___pte);		\
>> +	set_pte_at(___mm, ___address, __ptep, ___pte);			\
>>  })
> 
> If we establish the spte on the new page, what will happen is the same
> race in reverse. The fundamental problem is that the first guy that
> writes to the "newpage" (guest or host) won't fault again and so it
> will fail to serialize against the PT lock.
> 
> CPU0  		    	    	CPU1
> 				oldpage[1] == 0 (both guest & host)
> oldpage[0] = 1
> trigger do_wp_page
> mmu_notifier_change_pte
> spte = newpage + writable
> 				guest does newpage[1] = 1
> 				vmexit
> 				host read oldpage[1] == 0
> pte = newpage + writable (too late)
> 
> I think the fix is to use ptep_clear_flush_notify whenever
> set_pte_at_notify will establish a writable pte/spte. If the pte/spte
> established by set_pte_at_notify/change_pte is readonly we don't need
> to do the ptep_clear_flush_notify instead because when the host will
> write to the page that will fault and serialize against the
> PT lock (set_pte_at_notify must always run under the PT lock of course).
> 
> How about this:
> 
> =====
>>From 160a0b1b2be9bf96c45b30d9423f8196ecebe351 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Tue, 21 Aug 2012 16:48:11 +0200
> Subject: [PATCH] mmu_notifier: fix race in set_pte_at_notify usage
> 
> Whenever we establish a writable spte with set_pte_at_notify the
> ptep_clear_flush before it must be a _notify one that clears the spte
> too.
> 
> The fundamental problem is that if the primary MMU that writes to the
> "newpage" won't fault again if the pte established by
> set_pte_at_notify is writable. And so it will fail to serialize
> against the PT lock to wait the set_pte_at_notify to finish
> updating all secondary MMUs before the write hits the newpage.
> 
> CPU0  		    	    	CPU1
> 				oldpage[1] == 0 (all MMUs)
> oldpage[0] = 1
> trigger do_wp_page
> take PT lock
> ptep_clear_flush (secondary MMUs
> still have read access to oldpage)
> mmu_notifier_change_pte
> pte = newpage + writable (primary MMU can write to
> newpage)
> 				host write newpage[1] == 1 (no fault,
> 				failed to serialize against PT lock)
> 				vmenter
> 				guest read oldpage[1] == 0


Why? Why guest can read the old page?

Before you set the pte to be writable, mmu_notifier_change_pte is called
that all old pages have been released.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Avi Kivity <avi@redhat.com>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>, KVM <kvm@vger.kernel.org>,
	Linux Memory Management List <linux-mm@kvack.org>
Subject: Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host
Date: Wed, 22 Aug 2012 11:51:17 +0800	[thread overview]
Message-ID: <50345735.2000807@linux.vnet.ibm.com> (raw)
In-Reply-To: <20120821150618.GJ27696@redhat.com>

On 08/21/2012 11:06 PM, Andrea Arcangeli wrote:
> On Tue, Aug 21, 2012 at 05:46:39PM +0800, Xiao Guangrong wrote:
>> There has a bug in set_pte_at_notify which always set the pte to the
>> new page before release the old page in secondary MMU, at this time,
>> the process will access on the new page, but the secondary MMU still
>> access on the old page, the memory is inconsistent between them
>>
>> Below scenario shows the bug more clearly:
>>
>> at the beginning: *p = 0, and p is write-protected by KSM or shared with
>> parent process
>>
>> CPU 0                                       CPU 1
>> write 1 to p to trigger COW,
>> set_pte_at_notify will be called:
>>   *pte = new_page + W; /* The W bit of pte is set */
>>
>>                                      *p = 1; /* pte is valid, so no #PF */
>>
>>                                      return back to secondary MMU, then
>>                                      the secondary MMU read p, but get:
>>                                      *p == 0;
>>
>>                          /*
>>                           * !!!!!!
>>                           * the host has already set p to 1, but the secondary
>>                           * MMU still get the old value 0
>>                           */
>>
>>   call mmu_notifier_change_pte to release
>>   old page in secondary MMU
> 
> The KSM usage of it looks safe because it will only establish readonly
> ptes with it.

Hmm, in KSM code, i found this code in replace_page:

set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));

It is possible to establish a writable pte, no?

> 
> It seems a problem only for do_wp_page. It wasn't safe to setup
> writable ptes with it. I guess we first introduced it for KSM and then
> we added it to do_wp_page too by mistake.
> 
> The race window is really tiny, it's unlikely it has ever triggered,
> however this one seem to be possible so it's slightly more serious
> than the other race you recently found (the previous one in the exit
> path I think it was impossible to trigger with KVM).

Unfortunately, all these bugs are triggered by test cases.

> 
>> We can fix it by release old page first, then set the pte to the new
>> page.
>>
>> Note, the new page will be firstly used in secondary MMU before it is
>> mapped into the page table of the process, but this is safe because it
>> is protected by the page table lock, there is no race to change the pte
>>
>> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
>> ---
>>  include/linux/mmu_notifier.h |    2 +-
>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
>> index 1d1b1e1..8c7435a 100644
>> --- a/include/linux/mmu_notifier.h
>> +++ b/include/linux/mmu_notifier.h
>> @@ -317,8 +317,8 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
>>  	unsigned long ___address = __address;				\
>>  	pte_t ___pte = __pte;						\
>>  									\
>> -	set_pte_at(___mm, ___address, __ptep, ___pte);			\
>>  	mmu_notifier_change_pte(___mm, ___address, ___pte);		\
>> +	set_pte_at(___mm, ___address, __ptep, ___pte);			\
>>  })
> 
> If we establish the spte on the new page, what will happen is the same
> race in reverse. The fundamental problem is that the first guy that
> writes to the "newpage" (guest or host) won't fault again and so it
> will fail to serialize against the PT lock.
> 
> CPU0  		    	    	CPU1
> 				oldpage[1] == 0 (both guest & host)
> oldpage[0] = 1
> trigger do_wp_page
> mmu_notifier_change_pte
> spte = newpage + writable
> 				guest does newpage[1] = 1
> 				vmexit
> 				host read oldpage[1] == 0
> pte = newpage + writable (too late)
> 
> I think the fix is to use ptep_clear_flush_notify whenever
> set_pte_at_notify will establish a writable pte/spte. If the pte/spte
> established by set_pte_at_notify/change_pte is readonly we don't need
> to do the ptep_clear_flush_notify instead because when the host will
> write to the page that will fault and serialize against the
> PT lock (set_pte_at_notify must always run under the PT lock of course).
> 
> How about this:
> 
> =====
>>From 160a0b1b2be9bf96c45b30d9423f8196ecebe351 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Tue, 21 Aug 2012 16:48:11 +0200
> Subject: [PATCH] mmu_notifier: fix race in set_pte_at_notify usage
> 
> Whenever we establish a writable spte with set_pte_at_notify the
> ptep_clear_flush before it must be a _notify one that clears the spte
> too.
> 
> The fundamental problem is that if the primary MMU that writes to the
> "newpage" won't fault again if the pte established by
> set_pte_at_notify is writable. And so it will fail to serialize
> against the PT lock to wait the set_pte_at_notify to finish
> updating all secondary MMUs before the write hits the newpage.
> 
> CPU0  		    	    	CPU1
> 				oldpage[1] == 0 (all MMUs)
> oldpage[0] = 1
> trigger do_wp_page
> take PT lock
> ptep_clear_flush (secondary MMUs
> still have read access to oldpage)
> mmu_notifier_change_pte
> pte = newpage + writable (primary MMU can write to
> newpage)
> 				host write newpage[1] == 1 (no fault,
> 				failed to serialize against PT lock)
> 				vmenter
> 				guest read oldpage[1] == 0


Why? Why guest can read the old page?

Before you set the pte to be writable, mmu_notifier_change_pte is called
that all old pages have been released.

next prev parent reply	other threads:[~2012-08-22  3:51 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-08-21  9:46 [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host Xiao Guangrong
2012-08-21  9:46 ` Xiao Guangrong
2012-08-21 15:06 ` Andrea Arcangeli
2012-08-21 15:06   ` Andrea Arcangeli
2012-08-21 15:06   ` Andrea Arcangeli
2012-08-22  3:51   ` Xiao Guangrong [this message]
2012-08-22  3:51     ` Xiao Guangrong
2012-08-22  4:12     ` Hugh Dickins
2012-08-22  4:12       ` Hugh Dickins
2012-08-22  5:37       ` Xiao Guangrong
2012-08-22  5:37         ` Xiao Guangrong
2012-08-22 16:37     ` Andrea Arcangeli
2012-08-22 16:37       ` Andrea Arcangeli
2012-08-23  7:25       ` Xiao Guangrong
2012-08-23  7:25         ` Xiao Guangrong
2012-08-22  6:03   ` Xiao Guangrong
2012-08-22  6:03     ` Xiao Guangrong
2012-08-22 16:29     ` Andrea Arcangeli
2012-08-22 16:29       ` Andrea Arcangeli
2012-08-22 19:15       ` Andrew Morton
2012-08-22 19:15         ` Andrew Morton
2012-08-22 19:50         ` Andrea Arcangeli
2012-08-22 19:50           ` Andrea Arcangeli
2012-08-22 19:58           ` Andrew Morton
2012-08-22 19:58             ` Andrew Morton
2012-08-22 20:14             ` Andrea Arcangeli
2012-08-22 20:14               ` Andrea Arcangeli
2012-08-23  7:34           ` Xiao Guangrong
2012-08-23  7:34             ` Xiao Guangrong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50345735.2000807@linux.vnet.ibm.com \
    --to=xiaoguangrong@linux.vnet.ibm.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=avi@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mtosatti@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.