From mboxrd@z Thu Jan  1 00:00:00 1970
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: Re: [PATCH] x86/PV: fix unintended dependency of
 m2p-strict mode on migration-v2
Date: Wed, 13 Jan 2016 15:25:22 +0000
Message-ID: <56966C62.5010201@citrix.com>
References: <5694DEAA02000078000C5D7A@prv-mh.provo.novell.com>
	<5694E9BF.8090005@citrix.com>
	<5695279002000078000C5F80@prv-mh.provo.novell.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
Received: from mail6.bemta5.messagelabs.com ([195.245.231.135])
	by lists.xen.org with esmtp (Exim 4.72)
	(envelope-from <prvs=8132bb835=Andrew.Cooper3@citrix.com>)
	id 1aJNId-0002oQ-9i
	for xen-devel@lists.xenproject.org; Wed, 13 Jan 2016 15:25:27 +0000
In-Reply-To: <5695279002000078000C5F80@prv-mh.provo.novell.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Jan Beulich <JBeulich@suse.com>
Cc: xen-devel <xen-devel@lists.xenproject.org>, Keir Fraser <keir@xen.org>
List-Id: xen-devel@lists.xenproject.org

On 12/01/16 15:19, Jan Beulich wrote:
>>>> On 12.01.16 at 12:55, <andrew.cooper3@citrix.com> wrote:
>> On 12/01/16 10:08, Jan Beulich wrote:
>>> This went unnoticed until a backport of this to an older Xen got used,
>>> causing migration of guests enabling this VM assist to fail, because
>>> page table pinning there preceeds vCPU context loading, and hence L4
>>> tables get initialized for the wrong mode. Fix this by post-processing
>>> L4 tables when setting the intended VM assist flags for the guest.
>>>
>>> Note that this leaves in place a dependency on vCPU 0 getting its guest
>>> context restored first, but afaict the logic here is not the only thing
>>> depending on that.
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>
>>> --- a/xen/arch/x86/domain.c
>>> +++ b/xen/arch/x86/domain.c
>>> @@ -1067,8 +1067,48 @@ int arch_set_info_guest(
>>>          goto out;
>>>  
>>>      if ( v->vcpu_id == 0 )
>>> +    {
>>>          d->vm_assist = c(vm_assist);
>>>  
>>> +        /*
>>> +         * In the restore case we need to deal with L4 pages which got
>>> +         * initialized with m2p_strict still clear (and which hence lack the
>>> +         * correct initial RO_MPT_VIRT_{START,END} L4 entry).
>>> +         */
>>> +        if ( d != current->domain && VM_ASSIST(d, m2p_strict) &&
>>> +             is_pv_domain(d) && !is_pv_32bit_domain(d) &&
>>> +             atomic_read(&d->arch.pv_domain.nr_l4_pages) )
>>> +        {
>>> +            bool_t done = 0;
>>> +
>>> +            spin_lock_recursive(&d->page_alloc_lock);
>>> +
>>> +            for ( i = 0; ; )
>>> +            {
>>> +                struct page_info *page = page_list_remove_head(&d->page_list);
>>> +
>>> +                if ( page_lock(page) )
>>> +                {
>>> +                    if ( (page->u.inuse.type_info & PGT_type_mask) ==
>>> +                         PGT_l4_page_table )
>>> +                        done = !fill_ro_mpt(page_to_mfn(page));
>>> +
>>> +                    page_unlock(page);
>>> +                }
>>> +
>>> +                page_list_add_tail(page, &d->page_list);
>>> +
>>> +                if ( done || (!(++i & 0xff) && hypercall_preempt_check()) )
>>> +                    break;
>>> +            }
>>> +
>>> +            spin_unlock_recursive(&d->page_alloc_lock);
>>> +
>>> +            if ( !done )
>>> +                return -ERESTART;
>> This is a long loop.  It is preemptible, but will incur a time delay
>> proportional to the size of the domain during the VM downtime. 
>>
>> Could you defer the loop until after %cr3 has set been set up, and only
>> enter the loop if the kernel l4 table is missing the RO mappings?  That
>> way, domains migrated with migration v2 will skip the loop entirely.
> Well, first of all this would be the result only as long as you or
> someone else don't re-think and possibly move pinning ahead of
> context load again.

A second set_context() will unconditionally hit the loop though.

>
> Deferring until after CR3 got set up is - afaict - not an option, as
> it would defeat the purpose of m2p-strict mode as much as doing
> the fixup e.g. in the #PF handler. This mode enabled needs to
> strictly mean "L4s start with the slot filled, and user-mode uses
> clear it", as documented.

I just meant a little further down this function, i.e. gate the loop on
whether v->arch.guest_table has inappropriate m2p strictness.

~Andrew