From mboxrd@z Thu Jan  1 00:00:00 1970
From: Ben Catterall <Ben.Catterall@citrix.com>
Subject: Re: [RFC 3/4] HVM x86 deprivileged mode: Code for
 switching into/out of deprivileged mode
Date: Fri, 7 Aug 2015 13:51:02 +0100
Message-ID: <55C4A9B6.1030303@citrix.com>
References: <1438879519-564-1-git-send-email-Ben.Catterall@citrix.com>
	<1438879519-564-4-git-send-email-Ben.Catterall@citrix.com>
	<55C3C9C7.8030808@citrix.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <55C3C9C7.8030808@citrix.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Andrew Cooper <andrew.cooper3@citrix.com>, xen-devel@lists.xensource.com
Cc: george.dunlap@eu.citrix.com, tim@xen.org, keir@xen.org, ian.campbell@citrix.com, jbeulich@suse.com
List-Id: xen-devel@lists.xenproject.org


On 06/08/15 21:55, Andrew Cooper wrote:
> On 06/08/15 17:45, Ben Catterall wrote:
>> The process to switch into and out of deprivileged mode can be likened to
>> setjmp/longjmp.
>>
>> To enter deprivileged mode, we take a copy of the stack from the guest's
>> registers up to the current stack pointer. This allows us to restore the stack
>> when we have finished the deprivileged mode operation, meaning we can continue
>> execution from that point. This is similar to if a context switch had happened.
>>
>> To exit deprivileged mode, we copy the stack back, replacing the current stack.
>> We can then continue execution from where we left off, which will unwind the
>> stack and free up resources. This method means that we do not need to
>> change any other code paths and its invocation will be transparent to callers.
>> This should allow the feature to be more easily deployed to different parts
>> of Xen.
>>
>> Note that this copy of the stack is per-vcpu but, it will contain per-pcpu data.
>> Extra work is needed to properly migrate vcpus between pcpus.
>
> Under what circumstances do you see there being persistent state in the
> depriv area between calls, given that the calls are synchronous from VM
> actions?

I don't know if we can make these synchronous as we need a way to 
interrupt the vcpu if it's spinning for a long time. Otherwise an 
attacker could just spin in depriv and cause a DoS. With that in mind, 
the scheduler may decide to migrate the vcpu whilst it's in depriv mode 
which would mean this per-pcpu data is held in the stack copy which is 
then migrated to another pcpu incorrectly.

>
>>
>> The switch to and from deprivileged mode is performed using sysret and syscall
>> respectively.
>
> I suspect we need to borrow the SS attribute workaround from Linux to
> make this function reliably on AMD systems.
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=61f01dd941ba9e06d2bf05994450ecc3d61b6b8b
>
 >
Ah! ok, I'll look into this. Thanks!
>>
>> The return paths in entry.S have been edited so that, when we receive an
>> interrupt whilst in deprivileged mode, we return into that mode correctly.
>>
>> A hook on the syscall handler in entry.S has also been added which handles
>> returning from user mode and will support deprivileged mode system calls when
>> these are needed.
>>
>> Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>
>> ---
>>   xen/arch/x86/domain.c               |  12 +++
>>   xen/arch/x86/hvm/Makefile           |   1 +
>>   xen/arch/x86/hvm/deprivileged.c     | 103 ++++++++++++++++++
>>   xen/arch/x86/hvm/deprivileged_asm.S | 205 ++++++++++++++++++++++++++++++++++++
>>   xen/arch/x86/hvm/vmx/vmx.c          |   7 ++
>>   xen/arch/x86/x86_64/asm-offsets.c   |   5 +
>>   xen/arch/x86/x86_64/entry.S         |  35 ++++++
>>   xen/include/asm-x86/hvm/vmx/vmx.h   |   2 +
>>   xen/include/xen/hvm/deprivileged.h  |  38 +++++++
>>   xen/include/xen/sched.h             |  18 +++-
>>   10 files changed, 425 insertions(+), 1 deletion(-)
>>   create mode 100644 xen/arch/x86/hvm/deprivileged_asm.S
>>
>> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
>> index 045f6ff..a0e5e70 100644
>> --- a/xen/arch/x86/domain.c
>> +++ b/xen/arch/x86/domain.c
>> @@ -62,6 +62,7 @@
>>   #include <xen/iommu.h>
>>   #include <compat/vcpu.h>
>>   #include <asm/psr.h>
>> +#include <xen/hvm/deprivileged.h>
>>
>>   DEFINE_PER_CPU(struct vcpu *, curr_vcpu);
>>   DEFINE_PER_CPU(unsigned long, cr4);
>> @@ -446,6 +447,12 @@ int vcpu_initialise(struct vcpu *v)
>>       if ( has_hvm_container_domain(d) )
>>       {
>>           rc = hvm_vcpu_initialise(v);
>> +
>> +        /* Initialise HVM deprivileged mode */
>> +        printk("HVM initialising deprivileged mode ...");
>
> All printk()s should have a XENLOG_$severity prefix.
>
will do.
>> +        hvm_deprivileged_prepare_vcpu(v);
>> +        printk("Done.\n");
>> +
>>           goto done;
>>       }
>>
>> @@ -523,7 +530,12 @@ void vcpu_destroy(struct vcpu *v)
>>       vcpu_destroy_fpu(v);
>>
>>       if ( has_hvm_container_vcpu(v) )
>> +    {
>> +        /* Destroy the deprivileged mode on this vcpu */
>> +        hvm_deprivileged_destroy_vcpu(v);
>> +
>>           hvm_vcpu_destroy(v);
>> +    }
>>       else
>>           xfree(v->arch.pv_vcpu.trap_ctxt);
>>   }
>> diff --git a/xen/arch/x86/hvm/Makefile b/xen/arch/x86/hvm/Makefile
>> index bd83ba3..6819886 100644
>> --- a/xen/arch/x86/hvm/Makefile
>> +++ b/xen/arch/x86/hvm/Makefile
>> @@ -17,6 +17,7 @@ obj-y += quirks.o
>>   obj-y += rtc.o
>>   obj-y += save.o
>>   obj-y += deprivileged.o
>> +obj-y += deprivileged_asm.o
>>   obj-y += stdvga.o
>>   obj-y += vioapic.o
>>   obj-y += viridian.o
>> diff --git a/xen/arch/x86/hvm/deprivileged.c b/xen/arch/x86/hvm/deprivileged.c
>> index 071d900..979fc69 100644
>> --- a/xen/arch/x86/hvm/deprivileged.c
>> +++ b/xen/arch/x86/hvm/deprivileged.c
>> @@ -439,3 +439,106 @@ int hvm_deprivileged_copy_l1(struct domain *d,
>>       }
>>       return 0;
>>   }
>> +
>> +/* Used to prepare each vcpus data for user mode. Call for each HVM vcpu.
>> + */
>> +int hvm_deprivileged_prepare_vcpu(struct vcpu *vcpu)
>> +{
>> +    struct page_info *pg;
>> +
>> +    /* TODO: clarify if this MEMF is correct */
>> +    /* Allocate 2^STACK_ORDER contiguous pages */
>> +    pg = alloc_domheap_pages(NULL, STACK_ORDER, MEMF_no_owner);
>> +    if( pg == NULL )
>> +    {
>> +        panic("HVM: Out of memory on per-vcpu deprivileged mode init.\n");
>> +        return -ENOMEM;
>> +    }
>> +
>> +    vcpu->stack = page_to_virt(pg);
>
> Xen has two heaps, the xenheap and the domheap.
>
> You may only construct pointers like this into the xenheap.  The domheap
> is not guaranteed to have safe virtual mappings to.  (This code only
> works because your test box isn't bigger than 5TB.  Also there is a bug
> with xenheap allocations at the same point, but I need to fix that bug).
>
> All access to domheap pages must strictly be within a
> map_domain_page()/unmap() region, which construct save temporary mappings.
>
ok, I'll add these.
>> +    vcpu->rsp = 0;
>> +    vcpu->user_mode = 0;
>> +
>> +    return 0;
>> +}
>> +
>> +/* Called on destroying each vcpu */
>> +void hvm_deprivileged_destroy_vcpu(struct vcpu *vcpu)
>> +{
>> +    free_domheap_pages(virt_to_page(vcpu->stack), STACK_ORDER);
>> +}
>> +
>> +/* Called to perform a user mode operation.
>> + * Execution context is saved and then we move into user mode.
>> + * This method is then jumped into to restore execution context after
>> + * exiting user mode.
>> + */
>> +void hvm_deprivileged_user_mode(void)
>> +{
>> +    struct vcpu *vcpu = get_current();
>> +    unsigned long int efer = read_efer();
>> +    register unsigned long sp asm("rsp");
>> +
>> +    ASSERT( vcpu->user_mode == 0 );
>> +    ASSERT( vcpu->stack != 0 );
>> +    ASSERT( vcpu->rsp == 0 );
>> +
>> +    /* Flip the SCE bit to allow sysret/call */
>> +    write_efer(efer | EFER_SCE);
>> +
>> +    /* Save the msr lstar and star. Xen does lazy loading of these
>> +     * so we need to put the host values in and then restore the
>> +     * guest ones once we're done.
>> +     */
>> +    rdmsrl(MSR_LSTAR, vcpu->msr_lstar);
>> +    rdmsrl(MSR_STAR, vcpu->msr_star);
>> +    wrmsrl(MSR_LSTAR,get_host_msr_state()->msrs[VMX_INDEX_MSR_LSTAR]);
>> +    wrmsrl(MSR_STAR, get_host_msr_state()->msrs[VMX_INDEX_MSR_STAR]);
>
> A partial context switch like this should be implemented as two new
> hvm_ops such as hvm_op.depriv_ctxt_switch_{to,from}()
>
> This allows you to keep the common code clean of vendor specific code.
>
>> +
>> +    /* The assembly routine to handle moving into/out of deprivileged mode */
>> +    hvm_deprivileged_user_mode_asm();
>> +
>> +    /* If our copy failed */
>> +    if( unlikely(vcpu->rsp == 0) )
>> +    {
>> +        gdprintk(XENLOG_ERR, "HVM: Stack too large in %s\n", __FUNCTION__);
>
> __func__ please.  It conforms to C99 whereas __FUNCTION__ is a gnuism.
>
got it.
>> +        domain_crash_synchronous();
>> +    }
>> +
>> +    /* Debug info */
>> +    vcpu->old_msr_lstar = get_host_msr_state()->msrs[VMX_INDEX_MSR_LSTAR];
>> +    vcpu->old_msr_star = get_host_msr_state()->msrs[VMX_INDEX_MSR_STAR];
>> +    vcpu->old_rsp = sp;
>> +    vcpu->old_processor = smp_processor_id();
>> +
>> +    /* Restore the efer and saved msr registers */
>> +    write_efer(efer);
>> +    wrmsrl(MSR_LSTAR, vcpu->msr_lstar);
>> +    wrmsrl(MSR_STAR, vcpu->msr_star);
>> +    vcpu->user_mode = 0;
>> +    vcpu->rsp = 0;
>> +}
>> +
>> +/* Called when the user mode operation has completed
>> + * Perform C-level processing on return pathx
>> + */
>> +void hvm_deprivileged_finish_user_mode(void)
>> +{
>> +    /* If we are not returning from user mode: bail */
>> +    ASSERT(get_current()->user_mode == 1);
>> +
>> +    hvm_deprivileged_finish_user_mode_asm();
>> +}
>> +
>> +void hvm_deprivileged_check_trap(const char* func_name)
>> +{
>> +    if( current->user_mode == 1 )
>> +    {
>> +        printk("HVM Deprivileged Mode: Trap whilst in user mode, %s\n",
>> +               func_name);
>> +        domain_crash_synchronous();
>> +    }
>> +}
>> +
>> +
>> +
>> diff --git a/xen/arch/x86/hvm/deprivileged_asm.S b/xen/arch/x86/hvm/deprivileged_asm.S
>> new file mode 100644
>> index 0000000..00a9e9c
>> --- /dev/null
>> +++ b/xen/arch/x86/hvm/deprivileged_asm.S
>> @@ -0,0 +1,205 @@
>> +/*
>> + * HVM security enhancements assembly code
>> + */
>> +#include <xen/config.h>
>> +#include <xen/errno.h>
>> +#include <xen/softirq.h>
>> +#include <asm/asm_defns.h>
>> +#include <asm/apicdef.h>
>> +#include <asm/page.h>
>> +#include <public/xen.h>
>> +#include <irq_vectors.h>
>> +#include <xen/hvm/deprivileged.h>
>> +
>> +/* Handles entry into the deprivileged mode and returning from this
>> + * mode. This requires copying the current Xen privileged stack across
>> + * to a per-vcpu buffer as we need to be able to handle interrupts and
>> + * exceptions whilst in this mode. Xen is non-preemptable so our
>> + * privileged mode stack would  be clobbered if we did not save it.
>> + *
>> + * If we are entering deprivileged mode, then we use a sysret to get there.
>> + * If we are returning from deprivileged mode, then we need to unwind the stack
>> + * so we copy it back over the current stack so that we can return from the
>> + * call path where we came in from.
>> + *
>> + * We're doing sort-of a long jump/set jump with copying to a stack to
>> + * preserve it and allow returning code to continue executing from
>> + * within this method.
>> + */
>> +ENTRY(hvm_deprivileged_user_mode_asm)
>> +        /* Save our registers */
>> +        push   %rax
>> +        push   %rbx
>> +        push   %rcx
>> +        push   %rdx
>> +        push   %rsi
>> +        push   %rdi
>> +        push   %rbp
>> +        push   %r8
>> +        push   %r9
>> +        push   %r10
>> +        push   %r11
>> +        push   %r12
>> +        push   %r13
>> +        push   %r14
>> +        push   %r15
>> +        pushfq
>> +
>> +        /* Perform a near call to push rip onto the stack */
>> +        call   1f
>> +
>> +        /* Magic: Add to the stored rip the size of the code between
>> +         * label 1 and label 2. This allows  us to restart execution at label 2.
>> +         */
>> +1:      addq   $2f-1b, (%rsp)
>> +
>> +        GET_CURRENT(%r8)
>> +        xor    %rsi, %rsi
>> +
>> +        /* The following is equivalent to
>> +         * (get_cpu_info() + sizeof(struct cpu_info))
>> +         * This gets us to the top of the stack.
>> +         */
>> +        GET_STACK_BASE(%rcx)
>> +        addq   $STACK_SIZE, %rcx
>> +
>> +        movq   VCPU_stack(%r8), %rdi
>> +
>> +        /* We need copy the current stack across to our buffer
>> +         * Calculate the number of bytes to copy:
>> +         * (top of stack - current stack pointer)
>> +         * NOTE: We must not push any more data onto our stack after this point
>> +         * as it won't be saved.
>> +         */
>> +        sub    %rsp, %rcx
>> +
>> +        /* If the stack is too big, we don't do the copy: handled by caller. */
>> +        cmpq   $STACK_SIZE, %rcx
>> +        ja     3f
>> +
>> +        mov    %rsp, %rsi
>> +/* USER MODE ENTRY POINT */
>> +2:
>> +        /* More magic: If we came here from preparing to go into user mode,
>
> There is a very fine line between magic and gross hack ;)
>
> I havn't quite decided which this is yet, but it certainly is neat, if
> rather opaque.
>
>> +         * then we copy our current stack to the buffer (the lines above
>> +         * have setup rsi, rdi and rcx to do this).
>> +         *
>> +         * If we came here from user mode, then we movsb to copy from
>> +         * our buffer into our current stack so that we can continue
>> +         * execution from the current code point, and return back to the guest
>> +         * via the path we came in. rsi, rdi and rcx have been setup by the
>> +         * de-privileged return path for this.
>> +         */
>> +        rep    movsb
>> +        mov    %rsp, %rsi
>> +
>> +        GET_CURRENT(%r8)
>> +        movq   VCPU_user_mode(%r8), %rdx
>> +        movq   VCPU_rsp(%r8), %rax
>> +
>> +        /* If !user_mode  */
>> +        cmpq   $0, %rdx
>> +        jne    3f
>> +        cli
>> +
>> +        movabs $HVM_DEPRIVILEGED_TEXT_ADDR, %rcx /* RIP in user mode */
>> +
>> +        movq   $0x10200, %r11          /* RFLAGS user mode enable interrupts */
>
> Please use $(X86_FLAGS_IF | X86_FLAGS_MBS) to be more clear which flags
> are being set.
>
will do.
> Also, by enabling interrupts, you need some hook to short-circuit the
> scheduling softirq.  As it currently stands, a timer interrupt
> interrupting depriv mode is liable to swap all your state out from under
> you.
>
We need interrupts to be enabled so that we can prevent a DoS from 
depriv by allowing the scheduler to decide to deschedule it. That's also 
why we needed some of the return path changes.
>> +        movq   $1, VCPU_user_mode(%r8) /* Now in user mode */
>> +        movq   %rsi, VCPU_rsp(%r8)     /* The rsp to restore to */
>> +
>> +        /* Stack ptr is set by user mode to avoid race conditions.
>
> What race condition are you referring to?
>
>> +         * See Intel manual 2 on the sysret instruction.
>
> As a general rule, read both the Intel and the AMD manual for bits like
> this.  sysret is one of the areas where implementations differ.
>
>> +         */
>> +        movq   $HVM_STACK_PTR, %rbx
>> +        sysretq                         /* Enter deprivileged mode */
>> +
>> +3:      GET_CURRENT(%r8)
>> +        movq   %rsi, VCPU_rsp(%r8)
>> +        pop    %rax    /* Pop off rip: used in a jump so still on stack */
>> +
>> +        /* Restore registers */
>> +        popfq
>> +        pop    %r15
>> +        pop    %r14
>> +        pop    %r13
>> +        pop    %r12
>> +        pop    %r11
>> +        pop    %r10
>> +        pop    %r9
>> +        pop    %r8
>> +        pop    %rbp
>> +        pop    %rdi
>> +        pop    %rsi
>> +        pop    %rdx
>> +        pop    %rcx
>> +        pop    %rbx
>> +        pop    %rax
>> +        ret
>> +
>> +/* Finished in user mode so return */
>> +ENTRY(hvm_deprivileged_finish_user_mode_asm)
>> +        /* The source is the copied stack in our buffer.
>> +         * The destination is our current stack.
>> +         *
>> +         * We need to:
>> +         * - Move the stack pointer to where it was before we entered
>> +         *   deprivileged mode.
>> +         * - Setup rsi, rdi and rcx as needed to perform the copy
>> +         * - Jump to the address held at the top of the stack which
>> +         *   is the user mode return address
>> +         */
>> +        cli
>> +        GET_CURRENT(%rbx)
>> +        movq   VCPU_stack(%rbx), %rsi
>> +        movq   VCPU_rsp(%rbx), %rdi
>> +
>> +        /* The return address that the near call pushed onto the
>> +         * buffer is pointed to by stack, so use that for rip.
>> +         */
>> +        movq   %rdi, %rsp
>> +
>> +        /* The following is equivalent to
>> +         * (get_cpu_info() + sizeof(struct cpu_info) - vcpu->rsp)
>> +         * This works out how many bytes we need to copy:
>> +         * (top of stack - bottom of stack)
>> +         */
>> +        GET_STACK_BASE(%rcx)
>> +        addq   $STACK_SIZE, %rcx
>> +        subq   %rdi, %rcx
>> +
>> +        /* Go to user mode return code */
>> +        jmp    *(%rsi)
>> +
>> +/* Entry point from the assembly syscall handlers */
>> +ENTRY(hvm_deprivileged_handle_user_mode)
>> +
>> +        /* Handle a user mode hypercall here */
>> +
>> +
>> +        /* We are finished in user mode */
>> +        call hvm_deprivileged_finish_user_mode
>> +
>> +        ret
>> +
>> +.section .hvm_deprivileged_enhancement.text,"ax"
>> +/* HVM deprivileged code */
>> +ENTRY(hvm_deprivileged_ring3)
>> +        /* sysret has loaded eip from rcx and rflags from r11.
>> +         * CS and SS have been loaded from the MSR for ring 3.
>> +         * We now need to  switch to the user mode stack
>> +         */
>> +        /* Setup usermode stack */
>> +        movabs $HVM_STACK_PTR, %rsp
>> +
>> +        /* Perform user mode processing */
>> +
>> +        mov $0xf, %rcx
>> +1: dec  %rcx
>> +        cmp $0, %rcx
>> +        jne 1b
>> +
>> +        /* Return to ring 0 */
>> +        syscall
>> +
>> +.previous
>> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
>> index c32d863..595b0f2 100644
>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>> @@ -59,6 +59,8 @@
>>   #include <asm/event.h>
>>   #include <asm/monitor.h>
>>   #include <public/arch-x86/cpuid.h>
>> +#include <xen/hvm/deprivileged.h>
>> +
>>
>>   static bool_t __initdata opt_force_ept;
>>   boolean_param("force-ept", opt_force_ept);
>> @@ -194,6 +196,10 @@ void vmx_save_host_msrs(void)
>>           set_bit(VMX_INDEX_MSR_ ## address, &host_msr_state->flags);     \
>>       } while ( 0 )
>>
>> +struct vmx_msr_state *get_host_msr_state(void) {
>> +    return &this_cpu(host_msr_state);
>> +}
>> +
>>   static enum handler_return
>>   long_mode_do_msr_read(unsigned int msr, uint64_t *msr_content)
>>   {
>> @@ -272,6 +278,7 @@ long_mode_do_msr_write(unsigned int msr, uint64_t msr_content)
>>       case MSR_LSTAR:
>>           if ( !is_canonical_address(msr_content) )
>>               goto uncanonical_address;
>> +
>
> Please avoid spurious changes like this.
>
apologies.
>>           WRITE_MSR(LSTAR);
>>           break;
>>
>> diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
>> index 447c650..fd5de44 100644
>> --- a/xen/arch/x86/x86_64/asm-offsets.c
>> +++ b/xen/arch/x86/x86_64/asm-offsets.c
>> @@ -115,6 +115,11 @@ void __dummy__(void)
>>       OFFSET(VCPU_nsvm_hap_enabled, struct vcpu, arch.hvm_vcpu.nvcpu.u.nsvm.ns_hap_enabled);
>>       BLANK();
>>
>> +    OFFSET(VCPU_stack, struct vcpu, stack);
>> +    OFFSET(VCPU_rsp, struct vcpu, rsp);
>> +    OFFSET(VCPU_user_mode, struct vcpu, user_mode);
>> +    BLANK();
>> +
>>       OFFSET(DOMAIN_is_32bit_pv, struct domain, arch.is_32bit_pv);
>>       BLANK();
>>
>> diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
>> index 74677a2..fa9155c 100644
>> --- a/xen/arch/x86/x86_64/entry.S
>> +++ b/xen/arch/x86/x86_64/entry.S
>> @@ -102,6 +102,15 @@ restore_all_xen:
>>           RESTORE_ALL adj=8
>>           iretq
>>
>> +/* Returning from user mode */
>> +handle_hvm_user_mode:
>> +
>> +        call hvm_deprivileged_handle_user_mode
>> +
>> +        /* Go back into user mode */
>> +        cli
>> +        jmp  restore_all_guest
>> +
>>   /*
>>    * When entering SYSCALL from kernel mode:
>>    *  %rax                            = hypercall vector
>> @@ -131,6 +140,14 @@ ENTRY(lstar_enter)
>>           testb $TF_kernel_mode,VCPU_thread_flags(%rbx)
>>           jz    switch_to_kernel
>>
>> +        /* Were we in Xen's ring 3?  */
>> +        push %rbx
>> +        GET_CURRENT(%rbx)
>> +        movq VCPU_user_mode(%rbx), %rbx
>> +        cmp  $1, %rbx
>> +        je   handle_hvm_user_mode
>> +        pop  %rbx
>
> No need for the movq or rbx clobber.  This entire block can be:
>
> cmpb $1, VCPU_user_mode(%rbx)
> je handle_hvm_user_mode
>
> Similar to the $TF_kernel_mode check in context above.
>
got it. Thanks!
>
>
>> +
>>   /*hypercall:*/
>>           movq  %r10,%rcx
>>           cmpq  $NR_hypercalls,%rax
>> @@ -487,6 +504,13 @@ ENTRY(common_interrupt)
>>   /* No special register assumptions. */
>>   ENTRY(ret_from_intr)
>>           GET_CURRENT(%rbx)
>> +
>> +        /* If we are in Xen's user mode, return into it */
>> +        cmpq $1,VCPU_user_mode(%rbx)
>> +        cli
>> +        je    restore_all_guest
>> +        sti
>> +
>
> None of this should be necessary - the exception frame created by
> lstar_enter should cause ret_from_intr to do the correct thing.
>

I think this is needed as we have interrupts enabled and so we can take 
interrupts from paths other than lstar_enter. This ensures that Xen 
doesn't treat our depriv mode as a PV guest which led to random page, 
general protection etc. faults.

>>           testb $3,UREGS_cs(%rsp)
>>           jz    restore_all_xen
>>           movq  VCPU_domain(%rbx),%rax
>> @@ -509,6 +533,14 @@ handle_exception_saved:
>>           GET_CURRENT(%rbx)
>>           PERFC_INCR(exceptions, %rax, %rbx)
>>           callq *(%rdx,%rax,8)
>> +
>> +        /* If we are in Xen's user mode, return into it */
>> +        /* TODO: Test this path */
>> +        cmpq  $1,VCPU_user_mode(%rbx)
>> +        cli
>> +        je    restore_all_guest
>> +        sti
>> +
>>           testb $3,UREGS_cs(%rsp)
>>           jz    restore_all_xen
>>           leaq  VCPU_trap_bounce(%rbx),%rdx
>> @@ -664,6 +696,9 @@ handle_ist_exception:
>>           movl  $EVENT_CHECK_VECTOR,%edi
>>           call  send_IPI_self
>>   1:      movq  VCPU_domain(%rbx),%rax
>> +        /* This also handles Xen ring3 return for us.
>> +         * So, there is no need to explicitly do a user mode check.
>> +         */
>>           cmpb  $0,DOMAIN_is_32bit_pv(%rax)
>>           je    restore_all_guest
>>           jmp   compat_restore_all_guest
>> diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
>> index 3fbfa44..98e269e 100644
>> --- a/xen/include/asm-x86/hvm/vmx/vmx.h
>> +++ b/xen/include/asm-x86/hvm/vmx/vmx.h
>> @@ -565,4 +565,6 @@ typedef struct {
>>       u16 eptp_index;
>>   } ve_info_t;
>>
>> +struct vmx_msr_state *get_host_msr_state(void);
>> +
>>   #endif /* __ASM_X86_HVM_VMX_VMX_H__ */
>> diff --git a/xen/include/xen/hvm/deprivileged.h b/xen/include/xen/hvm/deprivileged.h
>> index 6cc803e..e42f39a 100644
>> --- a/xen/include/xen/hvm/deprivileged.h
>> +++ b/xen/include/xen/hvm/deprivileged.h
>> @@ -68,6 +68,37 @@ int hvm_deprivileged_copy_l1(struct domain *d,
>>                                unsigned int l1_flags);
>>
>>
>> +/* Used to prepare each vcpu's data for user mode. Call for each HVM vcpu. */
>> +int hvm_deprivileged_prepare_vcpu(struct vcpu *vcpu);
>> +
>> +/* Destroy each vcpu's data for Xen user mode. Again, call for each vcpu. */
>> +void hvm_deprivileged_destroy_vcpu(struct vcpu *vcpu);
>> +
>> +/* Called to perform a user mode operation. */
>> +void hvm_deprivileged_user_mode(void);
>> +
>> +/* Called when the user mode operation has completed */
>> +void hvm_deprivileged_finish_user_mode(void);
>> +
>> +/* Called to move into and then out of user mode. Needed for accessing
>> + * assembly features.
>> + */
>> +void hvm_deprivileged_user_mode_asm(void);
>> +
>> +/* Called on the return path to return to the correct execution point */
>> +void hvm_deprivileged_finish_user_mode_asm(void);
>> +
>> +/* Handle any syscalls that the user mode makes */
>> +void hvm_deprivileged_handle_user_mode(void);
>> +
>> +/* The ring 3 code */
>> +void hvm_deprivileged_ring3(void);
>> +
>> +/* Call when inside a trap that should cause a domain crash if in user mode
>> + * e.g. an invalid_op is trapped whilst in user mode.
>> + */
>> +void hvm_deprivileged_check_trap(const char* func_name);
>> +
>>   /* The segments where the user mode .text and .data are stored */
>>   extern unsigned long int __hvm_deprivileged_text_start;
>>   extern unsigned long int __hvm_deprivileged_text_end;
>> @@ -91,4 +122,11 @@ extern unsigned long int __hvm_deprivileged_data_end;
>>
>>   #define HVM_ERR_PG_ALLOC -1
>>
>> +/* The user mode stack pointer.
>> ++ * The stack grows down so set this to top of the stack region. Then,
>> ++ * as this is 0-indexed, move into the stack, not just after it.
>> ++ * Subtract 16 bytes for correct stack alignment.
>> ++ */
>> +#define HVM_STACK_PTR (HVM_DEPRIVILEGED_STACK_ADDR + STACK_SIZE - 16)
>> +
>>   #endif
>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
>> index 73d3bc8..180643e 100644
>> --- a/xen/include/xen/sched.h
>> +++ b/xen/include/xen/sched.h
>> @@ -137,7 +137,7 @@ void evtchn_destroy_final(struct domain *d); /* from complete_domain_destroy */
>>
>>   struct waitqueue_vcpu;
>>
>> -struct vcpu
>> +struct vcpu
>
> Trailing whitespace is nasty, but we avoid inflating the patch by
> dropping whitespace on lines not touched by semantic changes.
>
>>   {
>>       int              vcpu_id;
>>
>> @@ -158,6 +158,22 @@ struct vcpu
>>
>>       void            *sched_priv;    /* scheduler-specific data */
>>
>> +    /* HVM deprivileged mode state */
>> +    void *stack;             /* Location of stack to save data onto */
>> +    unsigned long rsp;       /* rsp of our stack to restore our data to */
>> +    unsigned long user_mode; /* Are we (possibly moving into) in user mode? */
>> +
>> +    /* The mstar of the processor that we are currently executing on.
>> +     *  we need to save this because Xen does lazy saving of these.
>> +     */
>> +    unsigned long int msr_lstar; /* lstar */
>> +    unsigned long int msr_star;
>
> There should be no need to store this like this.  Follow what the
> current context switching code does.
>
ok, I'll take a look.
> ~Andrew
>
>> +
>> +    /* Debug info */
>> +    unsigned long int old_rsp;
>> +    unsigned long int old_processor;
>> +    unsigned long int old_msr_lstar;
>> +    unsigned long int old_msr_star;
>>       struct vcpu_runstate_info runstate;
>>   #ifndef CONFIG_COMPAT
>>   # define runstate_guest(v) ((v)->runstate_guest)
>