Re: set_pte_at_notify regression

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Haggai Eran <haggaie@mellanox.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>, <linux-mm@kvack.org>,
	<kvm@vger.kernel.org>,
	Alex Fishman <alex.fishman@ravellosystems.com>,
	Mike Rapoport <mike.rapoport@ravellosystems.com>,
	Or Gerlitz <ogerlitz@mellanox.com>,
	Sagi Grimberg <sagig@mellanox.com>,
	Shachar Raindel <raindel@mellanox.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>
Subject: Re: set_pte_at_notify regression
Date: Thu, 16 Jan 2014 09:52:22 +0200	[thread overview]
Message-ID: <52D78FB6.4020102@mellanox.com> (raw)
In-Reply-To: <20140112175031.GH1141@redhat.com>

On 12/01/2014 19:50, Andrea Arcangeli wrote:
> On Sun, Jan 12, 2014 at 01:56:12PM +0200, Haggai Eran wrote:
>> Hi,
>>
>> On 10/01/2014 18:57, Andrea Arcangeli wrote:
>>> Hi!
>>>
>>> On Fri, Jan 10, 2014 at 06:38:06PM +0200, Izik Eidus wrote:
>>>> It look like commit 6bdb913f0a70a4dfb7f066fb15e2d6f960701d00 break the 
>>>> semantic of set_pte_at_notify.
>>>> The change of calling first to mmu_notifier_invalidate_range_start, then 
>>>> to set_pte_at_notify, and then to mmu_notifier_invalidate_range_end
>>>> not only increase the amount of locks kvm have to take and release by 
>>>> factor of 3, but in addition mmu_notifier_invalidate_range_start is zapping
>>>> the pte entry from kvm, so when set_pte_at_notify get called, it doesn`t 
>>>> have any spte to set and it acctuly get called for nothing, the result is
>>>> increasing of vmexits for kvm from both do_wp_page and replace_page, and 
>>>> broken semantic of set_pte_at_notify.
>>>
>>> Agreed.
>>>
>>> I would suggest to change set_pte_at_notify to return if change_pte
>>> was missing in some mmu notifier attached to this mm, so we can do
>>> something like:
>>>
>>>    ptep = page_check_address(page, mm, addr, &ptl, 0);
>>>    [..]
>>>    notify_missing = false;
>>>    if (... ) {
>>>       	entry = ptep_clear_flush(...);
>>>         [..]
>>> 	notify_missing = set_pte_at_notify(mm, addr, ptep, entry);
>>>    }
>>>    pte_unmap_unlock(ptep, ptl);
>>>    if (notify_missing)
>>>    	mmu_notifier_invalidate_page_if_missing_change_pte(mm, addr);
>>>
>>> and drop the range calls. This will provide sleepability and at the
>>> same time it won't screw the ability of change_pte to update sptes (by
>>> leaving those established by the time change_pte runs).
>>
>> I think it would be better for notifiers that do not support change_pte
>> to keep getting both range_start and range_end notifiers. Otherwise, the
>> invalidate_page notifier might end up marking the old page as dirty
>> after it was already replaced in the primary page table.
> 
> Ok but why would that be a problem? If the secondary pagetable mapping
> is found dirty, the old page shall be marked dirty as it means it was
> modified through the secondary mmu and is on-disk version may need to
> be updated before discarding the in-ram copy. What the difference
> would be to mark the page dirty in the range_start while the primary
> page table is still established, or after?
> 
> ...
> 
> But in places like ksm merging and do_wp_page we hold a page reference
> before we start the primary pagetable updating, until after the mmu
> notifier invalidate.

Right. I missed that page locking.

Another possible issue is with reads from the secondary page table.
Given a read-only page, suppose one host thread writes to that page, and
performs COW, but before it calls the
mmu_notifier_invalidate_page_if_missing_change_pte function another host
thread writes to the same page (this time without a page fault). Then we
have a valid entry in the secondary page table to a stale page, and
someone may read stale data from there.

Do you agree?

Thanks,
Haggai

WARNING: multiple messages have this Message-ID (diff)

From: Haggai Eran <haggaie@mellanox.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>,
	linux-mm@kvack.org, kvm@vger.kernel.org,
	Alex Fishman <alex.fishman@ravellosystems.com>,
	Mike Rapoport <mike.rapoport@ravellosystems.com>,
	Or Gerlitz <ogerlitz@mellanox.com>,
	Sagi Grimberg <sagig@mellanox.com>,
	Shachar Raindel <raindel@mellanox.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>
Subject: Re: set_pte_at_notify regression
Date: Thu, 16 Jan 2014 09:52:22 +0200	[thread overview]
Message-ID: <52D78FB6.4020102@mellanox.com> (raw)
In-Reply-To: <20140112175031.GH1141@redhat.com>

On 12/01/2014 19:50, Andrea Arcangeli wrote:
> On Sun, Jan 12, 2014 at 01:56:12PM +0200, Haggai Eran wrote:
>> Hi,
>>
>> On 10/01/2014 18:57, Andrea Arcangeli wrote:
>>> Hi!
>>>
>>> On Fri, Jan 10, 2014 at 06:38:06PM +0200, Izik Eidus wrote:
>>>> It look like commit 6bdb913f0a70a4dfb7f066fb15e2d6f960701d00 break the 
>>>> semantic of set_pte_at_notify.
>>>> The change of calling first to mmu_notifier_invalidate_range_start, then 
>>>> to set_pte_at_notify, and then to mmu_notifier_invalidate_range_end
>>>> not only increase the amount of locks kvm have to take and release by 
>>>> factor of 3, but in addition mmu_notifier_invalidate_range_start is zapping
>>>> the pte entry from kvm, so when set_pte_at_notify get called, it doesn`t 
>>>> have any spte to set and it acctuly get called for nothing, the result is
>>>> increasing of vmexits for kvm from both do_wp_page and replace_page, and 
>>>> broken semantic of set_pte_at_notify.
>>>
>>> Agreed.
>>>
>>> I would suggest to change set_pte_at_notify to return if change_pte
>>> was missing in some mmu notifier attached to this mm, so we can do
>>> something like:
>>>
>>>    ptep = page_check_address(page, mm, addr, &ptl, 0);
>>>    [..]
>>>    notify_missing = false;
>>>    if (... ) {
>>>       	entry = ptep_clear_flush(...);
>>>         [..]
>>> 	notify_missing = set_pte_at_notify(mm, addr, ptep, entry);
>>>    }
>>>    pte_unmap_unlock(ptep, ptl);
>>>    if (notify_missing)
>>>    	mmu_notifier_invalidate_page_if_missing_change_pte(mm, addr);
>>>
>>> and drop the range calls. This will provide sleepability and at the
>>> same time it won't screw the ability of change_pte to update sptes (by
>>> leaving those established by the time change_pte runs).
>>
>> I think it would be better for notifiers that do not support change_pte
>> to keep getting both range_start and range_end notifiers. Otherwise, the
>> invalidate_page notifier might end up marking the old page as dirty
>> after it was already replaced in the primary page table.
> 
> Ok but why would that be a problem? If the secondary pagetable mapping
> is found dirty, the old page shall be marked dirty as it means it was
> modified through the secondary mmu and is on-disk version may need to
> be updated before discarding the in-ram copy. What the difference
> would be to mark the page dirty in the range_start while the primary
> page table is still established, or after?
> 
> ...
> 
> But in places like ksm merging and do_wp_page we hold a page reference
> before we start the primary pagetable updating, until after the mmu
> notifier invalidate.

Right. I missed that page locking.

Another possible issue is with reads from the secondary page table.
Given a read-only page, suppose one host thread writes to that page, and
performs COW, but before it calls the
mmu_notifier_invalidate_page_if_missing_change_pte function another host
thread writes to the same page (this time without a page fault). Then we
have a valid entry in the secondary page table to a stale page, and
someone may read stale data from there.

Do you agree?

Thanks,
Haggai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2014-01-16  7:52 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-01-10 16:38 set_pte_at_notify regression Izik Eidus
2014-01-10 16:57 ` Andrea Arcangeli
2014-01-12 11:56   ` Haggai Eran
2014-01-12 11:56     ` Haggai Eran
2014-01-12 17:50     ` Andrea Arcangeli
2014-01-16  7:52       ` Haggai Eran [this message]
2014-01-16  7:52         ` Haggai Eran

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=52D78FB6.4020102@mellanox.com \
    --to=haggaie@mellanox.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=aarcange@redhat.com \
    --cc=alex.fishman@ravellosystems.com \
    --cc=izik.eidus@ravellosystems.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mike.rapoport@ravellosystems.com \
    --cc=ogerlitz@mellanox.com \
    --cc=raindel@mellanox.com \
    --cc=sagig@mellanox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.