From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ben Catterall Subject: Re: [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode Date: Fri, 7 Aug 2015 13:51:02 +0100 Message-ID: <55C4A9B6.1030303@citrix.com> References: <1438879519-564-1-git-send-email-Ben.Catterall@citrix.com> <1438879519-564-4-git-send-email-Ben.Catterall@citrix.com> <55C3C9C7.8030808@citrix.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <55C3C9C7.8030808@citrix.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Andrew Cooper , xen-devel@lists.xensource.com Cc: george.dunlap@eu.citrix.com, tim@xen.org, keir@xen.org, ian.campbell@citrix.com, jbeulich@suse.com List-Id: xen-devel@lists.xenproject.org On 06/08/15 21:55, Andrew Cooper wrote: > On 06/08/15 17:45, Ben Catterall wrote: >> The process to switch into and out of deprivileged mode can be likened to >> setjmp/longjmp. >> >> To enter deprivileged mode, we take a copy of the stack from the guest's >> registers up to the current stack pointer. This allows us to restore the stack >> when we have finished the deprivileged mode operation, meaning we can continue >> execution from that point. This is similar to if a context switch had happened. >> >> To exit deprivileged mode, we copy the stack back, replacing the current stack. >> We can then continue execution from where we left off, which will unwind the >> stack and free up resources. This method means that we do not need to >> change any other code paths and its invocation will be transparent to callers. >> This should allow the feature to be more easily deployed to different parts >> of Xen. >> >> Note that this copy of the stack is per-vcpu but, it will contain per-pcpu data. >> Extra work is needed to properly migrate vcpus between pcpus. > > Under what circumstances do you see there being persistent state in the > depriv area between calls, given that the calls are synchronous from VM > actions? I don't know if we can make these synchronous as we need a way to interrupt the vcpu if it's spinning for a long time. Otherwise an attacker could just spin in depriv and cause a DoS. With that in mind, the scheduler may decide to migrate the vcpu whilst it's in depriv mode which would mean this per-pcpu data is held in the stack copy which is then migrated to another pcpu incorrectly. > >> >> The switch to and from deprivileged mode is performed using sysret and syscall >> respectively. > > I suspect we need to borrow the SS attribute workaround from Linux to > make this function reliably on AMD systems. > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=61f01dd941ba9e06d2bf05994450ecc3d61b6b8b > > Ah! ok, I'll look into this. Thanks! >> >> The return paths in entry.S have been edited so that, when we receive an >> interrupt whilst in deprivileged mode, we return into that mode correctly. >> >> A hook on the syscall handler in entry.S has also been added which handles >> returning from user mode and will support deprivileged mode system calls when >> these are needed. >> >> Signed-off-by: Ben Catterall >> --- >> xen/arch/x86/domain.c | 12 +++ >> xen/arch/x86/hvm/Makefile | 1 + >> xen/arch/x86/hvm/deprivileged.c | 103 ++++++++++++++++++ >> xen/arch/x86/hvm/deprivileged_asm.S | 205 ++++++++++++++++++++++++++++++++++++ >> xen/arch/x86/hvm/vmx/vmx.c | 7 ++ >> xen/arch/x86/x86_64/asm-offsets.c | 5 + >> xen/arch/x86/x86_64/entry.S | 35 ++++++ >> xen/include/asm-x86/hvm/vmx/vmx.h | 2 + >> xen/include/xen/hvm/deprivileged.h | 38 +++++++ >> xen/include/xen/sched.h | 18 +++- >> 10 files changed, 425 insertions(+), 1 deletion(-) >> create mode 100644 xen/arch/x86/hvm/deprivileged_asm.S >> >> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c >> index 045f6ff..a0e5e70 100644 >> --- a/xen/arch/x86/domain.c >> +++ b/xen/arch/x86/domain.c >> @@ -62,6 +62,7 @@ >> #include >> #include >> #include >> +#include >> >> DEFINE_PER_CPU(struct vcpu *, curr_vcpu); >> DEFINE_PER_CPU(unsigned long, cr4); >> @@ -446,6 +447,12 @@ int vcpu_initialise(struct vcpu *v) >> if ( has_hvm_container_domain(d) ) >> { >> rc = hvm_vcpu_initialise(v); >> + >> + /* Initialise HVM deprivileged mode */ >> + printk("HVM initialising deprivileged mode ..."); > > All printk()s should have a XENLOG_$severity prefix. > will do. >> + hvm_deprivileged_prepare_vcpu(v); >> + printk("Done.\n"); >> + >> goto done; >> } >> >> @@ -523,7 +530,12 @@ void vcpu_destroy(struct vcpu *v) >> vcpu_destroy_fpu(v); >> >> if ( has_hvm_container_vcpu(v) ) >> + { >> + /* Destroy the deprivileged mode on this vcpu */ >> + hvm_deprivileged_destroy_vcpu(v); >> + >> hvm_vcpu_destroy(v); >> + } >> else >> xfree(v->arch.pv_vcpu.trap_ctxt); >> } >> diff --git a/xen/arch/x86/hvm/Makefile b/xen/arch/x86/hvm/Makefile >> index bd83ba3..6819886 100644 >> --- a/xen/arch/x86/hvm/Makefile >> +++ b/xen/arch/x86/hvm/Makefile >> @@ -17,6 +17,7 @@ obj-y += quirks.o >> obj-y += rtc.o >> obj-y += save.o >> obj-y += deprivileged.o >> +obj-y += deprivileged_asm.o >> obj-y += stdvga.o >> obj-y += vioapic.o >> obj-y += viridian.o >> diff --git a/xen/arch/x86/hvm/deprivileged.c b/xen/arch/x86/hvm/deprivileged.c >> index 071d900..979fc69 100644 >> --- a/xen/arch/x86/hvm/deprivileged.c >> +++ b/xen/arch/x86/hvm/deprivileged.c >> @@ -439,3 +439,106 @@ int hvm_deprivileged_copy_l1(struct domain *d, >> } >> return 0; >> } >> + >> +/* Used to prepare each vcpus data for user mode. Call for each HVM vcpu. >> + */ >> +int hvm_deprivileged_prepare_vcpu(struct vcpu *vcpu) >> +{ >> + struct page_info *pg; >> + >> + /* TODO: clarify if this MEMF is correct */ >> + /* Allocate 2^STACK_ORDER contiguous pages */ >> + pg = alloc_domheap_pages(NULL, STACK_ORDER, MEMF_no_owner); >> + if( pg == NULL ) >> + { >> + panic("HVM: Out of memory on per-vcpu deprivileged mode init.\n"); >> + return -ENOMEM; >> + } >> + >> + vcpu->stack = page_to_virt(pg); > > Xen has two heaps, the xenheap and the domheap. > > You may only construct pointers like this into the xenheap. The domheap > is not guaranteed to have safe virtual mappings to. (This code only > works because your test box isn't bigger than 5TB. Also there is a bug > with xenheap allocations at the same point, but I need to fix that bug). > > All access to domheap pages must strictly be within a > map_domain_page()/unmap() region, which construct save temporary mappings. > ok, I'll add these. >> + vcpu->rsp = 0; >> + vcpu->user_mode = 0; >> + >> + return 0; >> +} >> + >> +/* Called on destroying each vcpu */ >> +void hvm_deprivileged_destroy_vcpu(struct vcpu *vcpu) >> +{ >> + free_domheap_pages(virt_to_page(vcpu->stack), STACK_ORDER); >> +} >> + >> +/* Called to perform a user mode operation. >> + * Execution context is saved and then we move into user mode. >> + * This method is then jumped into to restore execution context after >> + * exiting user mode. >> + */ >> +void hvm_deprivileged_user_mode(void) >> +{ >> + struct vcpu *vcpu = get_current(); >> + unsigned long int efer = read_efer(); >> + register unsigned long sp asm("rsp"); >> + >> + ASSERT( vcpu->user_mode == 0 ); >> + ASSERT( vcpu->stack != 0 ); >> + ASSERT( vcpu->rsp == 0 ); >> + >> + /* Flip the SCE bit to allow sysret/call */ >> + write_efer(efer | EFER_SCE); >> + >> + /* Save the msr lstar and star. Xen does lazy loading of these >> + * so we need to put the host values in and then restore the >> + * guest ones once we're done. >> + */ >> + rdmsrl(MSR_LSTAR, vcpu->msr_lstar); >> + rdmsrl(MSR_STAR, vcpu->msr_star); >> + wrmsrl(MSR_LSTAR,get_host_msr_state()->msrs[VMX_INDEX_MSR_LSTAR]); >> + wrmsrl(MSR_STAR, get_host_msr_state()->msrs[VMX_INDEX_MSR_STAR]); > > A partial context switch like this should be implemented as two new > hvm_ops such as hvm_op.depriv_ctxt_switch_{to,from}() > > This allows you to keep the common code clean of vendor specific code. > >> + >> + /* The assembly routine to handle moving into/out of deprivileged mode */ >> + hvm_deprivileged_user_mode_asm(); >> + >> + /* If our copy failed */ >> + if( unlikely(vcpu->rsp == 0) ) >> + { >> + gdprintk(XENLOG_ERR, "HVM: Stack too large in %s\n", __FUNCTION__); > > __func__ please. It conforms to C99 whereas __FUNCTION__ is a gnuism. > got it. >> + domain_crash_synchronous(); >> + } >> + >> + /* Debug info */ >> + vcpu->old_msr_lstar = get_host_msr_state()->msrs[VMX_INDEX_MSR_LSTAR]; >> + vcpu->old_msr_star = get_host_msr_state()->msrs[VMX_INDEX_MSR_STAR]; >> + vcpu->old_rsp = sp; >> + vcpu->old_processor = smp_processor_id(); >> + >> + /* Restore the efer and saved msr registers */ >> + write_efer(efer); >> + wrmsrl(MSR_LSTAR, vcpu->msr_lstar); >> + wrmsrl(MSR_STAR, vcpu->msr_star); >> + vcpu->user_mode = 0; >> + vcpu->rsp = 0; >> +} >> + >> +/* Called when the user mode operation has completed >> + * Perform C-level processing on return pathx >> + */ >> +void hvm_deprivileged_finish_user_mode(void) >> +{ >> + /* If we are not returning from user mode: bail */ >> + ASSERT(get_current()->user_mode == 1); >> + >> + hvm_deprivileged_finish_user_mode_asm(); >> +} >> + >> +void hvm_deprivileged_check_trap(const char* func_name) >> +{ >> + if( current->user_mode == 1 ) >> + { >> + printk("HVM Deprivileged Mode: Trap whilst in user mode, %s\n", >> + func_name); >> + domain_crash_synchronous(); >> + } >> +} >> + >> + >> + >> diff --git a/xen/arch/x86/hvm/deprivileged_asm.S b/xen/arch/x86/hvm/deprivileged_asm.S >> new file mode 100644 >> index 0000000..00a9e9c >> --- /dev/null >> +++ b/xen/arch/x86/hvm/deprivileged_asm.S >> @@ -0,0 +1,205 @@ >> +/* >> + * HVM security enhancements assembly code >> + */ >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> + >> +/* Handles entry into the deprivileged mode and returning from this >> + * mode. This requires copying the current Xen privileged stack across >> + * to a per-vcpu buffer as we need to be able to handle interrupts and >> + * exceptions whilst in this mode. Xen is non-preemptable so our >> + * privileged mode stack would be clobbered if we did not save it. >> + * >> + * If we are entering deprivileged mode, then we use a sysret to get there. >> + * If we are returning from deprivileged mode, then we need to unwind the stack >> + * so we copy it back over the current stack so that we can return from the >> + * call path where we came in from. >> + * >> + * We're doing sort-of a long jump/set jump with copying to a stack to >> + * preserve it and allow returning code to continue executing from >> + * within this method. >> + */ >> +ENTRY(hvm_deprivileged_user_mode_asm) >> + /* Save our registers */ >> + push %rax >> + push %rbx >> + push %rcx >> + push %rdx >> + push %rsi >> + push %rdi >> + push %rbp >> + push %r8 >> + push %r9 >> + push %r10 >> + push %r11 >> + push %r12 >> + push %r13 >> + push %r14 >> + push %r15 >> + pushfq >> + >> + /* Perform a near call to push rip onto the stack */ >> + call 1f >> + >> + /* Magic: Add to the stored rip the size of the code between >> + * label 1 and label 2. This allows us to restart execution at label 2. >> + */ >> +1: addq $2f-1b, (%rsp) >> + >> + GET_CURRENT(%r8) >> + xor %rsi, %rsi >> + >> + /* The following is equivalent to >> + * (get_cpu_info() + sizeof(struct cpu_info)) >> + * This gets us to the top of the stack. >> + */ >> + GET_STACK_BASE(%rcx) >> + addq $STACK_SIZE, %rcx >> + >> + movq VCPU_stack(%r8), %rdi >> + >> + /* We need copy the current stack across to our buffer >> + * Calculate the number of bytes to copy: >> + * (top of stack - current stack pointer) >> + * NOTE: We must not push any more data onto our stack after this point >> + * as it won't be saved. >> + */ >> + sub %rsp, %rcx >> + >> + /* If the stack is too big, we don't do the copy: handled by caller. */ >> + cmpq $STACK_SIZE, %rcx >> + ja 3f >> + >> + mov %rsp, %rsi >> +/* USER MODE ENTRY POINT */ >> +2: >> + /* More magic: If we came here from preparing to go into user mode, > > There is a very fine line between magic and gross hack ;) > > I havn't quite decided which this is yet, but it certainly is neat, if > rather opaque. > >> + * then we copy our current stack to the buffer (the lines above >> + * have setup rsi, rdi and rcx to do this). >> + * >> + * If we came here from user mode, then we movsb to copy from >> + * our buffer into our current stack so that we can continue >> + * execution from the current code point, and return back to the guest >> + * via the path we came in. rsi, rdi and rcx have been setup by the >> + * de-privileged return path for this. >> + */ >> + rep movsb >> + mov %rsp, %rsi >> + >> + GET_CURRENT(%r8) >> + movq VCPU_user_mode(%r8), %rdx >> + movq VCPU_rsp(%r8), %rax >> + >> + /* If !user_mode */ >> + cmpq $0, %rdx >> + jne 3f >> + cli >> + >> + movabs $HVM_DEPRIVILEGED_TEXT_ADDR, %rcx /* RIP in user mode */ >> + >> + movq $0x10200, %r11 /* RFLAGS user mode enable interrupts */ > > Please use $(X86_FLAGS_IF | X86_FLAGS_MBS) to be more clear which flags > are being set. > will do. > Also, by enabling interrupts, you need some hook to short-circuit the > scheduling softirq. As it currently stands, a timer interrupt > interrupting depriv mode is liable to swap all your state out from under > you. > We need interrupts to be enabled so that we can prevent a DoS from depriv by allowing the scheduler to decide to deschedule it. That's also why we needed some of the return path changes. >> + movq $1, VCPU_user_mode(%r8) /* Now in user mode */ >> + movq %rsi, VCPU_rsp(%r8) /* The rsp to restore to */ >> + >> + /* Stack ptr is set by user mode to avoid race conditions. > > What race condition are you referring to? > >> + * See Intel manual 2 on the sysret instruction. > > As a general rule, read both the Intel and the AMD manual for bits like > this. sysret is one of the areas where implementations differ. > >> + */ >> + movq $HVM_STACK_PTR, %rbx >> + sysretq /* Enter deprivileged mode */ >> + >> +3: GET_CURRENT(%r8) >> + movq %rsi, VCPU_rsp(%r8) >> + pop %rax /* Pop off rip: used in a jump so still on stack */ >> + >> + /* Restore registers */ >> + popfq >> + pop %r15 >> + pop %r14 >> + pop %r13 >> + pop %r12 >> + pop %r11 >> + pop %r10 >> + pop %r9 >> + pop %r8 >> + pop %rbp >> + pop %rdi >> + pop %rsi >> + pop %rdx >> + pop %rcx >> + pop %rbx >> + pop %rax >> + ret >> + >> +/* Finished in user mode so return */ >> +ENTRY(hvm_deprivileged_finish_user_mode_asm) >> + /* The source is the copied stack in our buffer. >> + * The destination is our current stack. >> + * >> + * We need to: >> + * - Move the stack pointer to where it was before we entered >> + * deprivileged mode. >> + * - Setup rsi, rdi and rcx as needed to perform the copy >> + * - Jump to the address held at the top of the stack which >> + * is the user mode return address >> + */ >> + cli >> + GET_CURRENT(%rbx) >> + movq VCPU_stack(%rbx), %rsi >> + movq VCPU_rsp(%rbx), %rdi >> + >> + /* The return address that the near call pushed onto the >> + * buffer is pointed to by stack, so use that for rip. >> + */ >> + movq %rdi, %rsp >> + >> + /* The following is equivalent to >> + * (get_cpu_info() + sizeof(struct cpu_info) - vcpu->rsp) >> + * This works out how many bytes we need to copy: >> + * (top of stack - bottom of stack) >> + */ >> + GET_STACK_BASE(%rcx) >> + addq $STACK_SIZE, %rcx >> + subq %rdi, %rcx >> + >> + /* Go to user mode return code */ >> + jmp *(%rsi) >> + >> +/* Entry point from the assembly syscall handlers */ >> +ENTRY(hvm_deprivileged_handle_user_mode) >> + >> + /* Handle a user mode hypercall here */ >> + >> + >> + /* We are finished in user mode */ >> + call hvm_deprivileged_finish_user_mode >> + >> + ret >> + >> +.section .hvm_deprivileged_enhancement.text,"ax" >> +/* HVM deprivileged code */ >> +ENTRY(hvm_deprivileged_ring3) >> + /* sysret has loaded eip from rcx and rflags from r11. >> + * CS and SS have been loaded from the MSR for ring 3. >> + * We now need to switch to the user mode stack >> + */ >> + /* Setup usermode stack */ >> + movabs $HVM_STACK_PTR, %rsp >> + >> + /* Perform user mode processing */ >> + >> + mov $0xf, %rcx >> +1: dec %rcx >> + cmp $0, %rcx >> + jne 1b >> + >> + /* Return to ring 0 */ >> + syscall >> + >> +.previous >> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c >> index c32d863..595b0f2 100644 >> --- a/xen/arch/x86/hvm/vmx/vmx.c >> +++ b/xen/arch/x86/hvm/vmx/vmx.c >> @@ -59,6 +59,8 @@ >> #include >> #include >> #include >> +#include >> + >> >> static bool_t __initdata opt_force_ept; >> boolean_param("force-ept", opt_force_ept); >> @@ -194,6 +196,10 @@ void vmx_save_host_msrs(void) >> set_bit(VMX_INDEX_MSR_ ## address, &host_msr_state->flags); \ >> } while ( 0 ) >> >> +struct vmx_msr_state *get_host_msr_state(void) { >> + return &this_cpu(host_msr_state); >> +} >> + >> static enum handler_return >> long_mode_do_msr_read(unsigned int msr, uint64_t *msr_content) >> { >> @@ -272,6 +278,7 @@ long_mode_do_msr_write(unsigned int msr, uint64_t msr_content) >> case MSR_LSTAR: >> if ( !is_canonical_address(msr_content) ) >> goto uncanonical_address; >> + > > Please avoid spurious changes like this. > apologies. >> WRITE_MSR(LSTAR); >> break; >> >> diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c >> index 447c650..fd5de44 100644 >> --- a/xen/arch/x86/x86_64/asm-offsets.c >> +++ b/xen/arch/x86/x86_64/asm-offsets.c >> @@ -115,6 +115,11 @@ void __dummy__(void) >> OFFSET(VCPU_nsvm_hap_enabled, struct vcpu, arch.hvm_vcpu.nvcpu.u.nsvm.ns_hap_enabled); >> BLANK(); >> >> + OFFSET(VCPU_stack, struct vcpu, stack); >> + OFFSET(VCPU_rsp, struct vcpu, rsp); >> + OFFSET(VCPU_user_mode, struct vcpu, user_mode); >> + BLANK(); >> + >> OFFSET(DOMAIN_is_32bit_pv, struct domain, arch.is_32bit_pv); >> BLANK(); >> >> diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S >> index 74677a2..fa9155c 100644 >> --- a/xen/arch/x86/x86_64/entry.S >> +++ b/xen/arch/x86/x86_64/entry.S >> @@ -102,6 +102,15 @@ restore_all_xen: >> RESTORE_ALL adj=8 >> iretq >> >> +/* Returning from user mode */ >> +handle_hvm_user_mode: >> + >> + call hvm_deprivileged_handle_user_mode >> + >> + /* Go back into user mode */ >> + cli >> + jmp restore_all_guest >> + >> /* >> * When entering SYSCALL from kernel mode: >> * %rax = hypercall vector >> @@ -131,6 +140,14 @@ ENTRY(lstar_enter) >> testb $TF_kernel_mode,VCPU_thread_flags(%rbx) >> jz switch_to_kernel >> >> + /* Were we in Xen's ring 3? */ >> + push %rbx >> + GET_CURRENT(%rbx) >> + movq VCPU_user_mode(%rbx), %rbx >> + cmp $1, %rbx >> + je handle_hvm_user_mode >> + pop %rbx > > No need for the movq or rbx clobber. This entire block can be: > > cmpb $1, VCPU_user_mode(%rbx) > je handle_hvm_user_mode > > Similar to the $TF_kernel_mode check in context above. > got it. Thanks! > > >> + >> /*hypercall:*/ >> movq %r10,%rcx >> cmpq $NR_hypercalls,%rax >> @@ -487,6 +504,13 @@ ENTRY(common_interrupt) >> /* No special register assumptions. */ >> ENTRY(ret_from_intr) >> GET_CURRENT(%rbx) >> + >> + /* If we are in Xen's user mode, return into it */ >> + cmpq $1,VCPU_user_mode(%rbx) >> + cli >> + je restore_all_guest >> + sti >> + > > None of this should be necessary - the exception frame created by > lstar_enter should cause ret_from_intr to do the correct thing. > I think this is needed as we have interrupts enabled and so we can take interrupts from paths other than lstar_enter. This ensures that Xen doesn't treat our depriv mode as a PV guest which led to random page, general protection etc. faults. >> testb $3,UREGS_cs(%rsp) >> jz restore_all_xen >> movq VCPU_domain(%rbx),%rax >> @@ -509,6 +533,14 @@ handle_exception_saved: >> GET_CURRENT(%rbx) >> PERFC_INCR(exceptions, %rax, %rbx) >> callq *(%rdx,%rax,8) >> + >> + /* If we are in Xen's user mode, return into it */ >> + /* TODO: Test this path */ >> + cmpq $1,VCPU_user_mode(%rbx) >> + cli >> + je restore_all_guest >> + sti >> + >> testb $3,UREGS_cs(%rsp) >> jz restore_all_xen >> leaq VCPU_trap_bounce(%rbx),%rdx >> @@ -664,6 +696,9 @@ handle_ist_exception: >> movl $EVENT_CHECK_VECTOR,%edi >> call send_IPI_self >> 1: movq VCPU_domain(%rbx),%rax >> + /* This also handles Xen ring3 return for us. >> + * So, there is no need to explicitly do a user mode check. >> + */ >> cmpb $0,DOMAIN_is_32bit_pv(%rax) >> je restore_all_guest >> jmp compat_restore_all_guest >> diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h >> index 3fbfa44..98e269e 100644 >> --- a/xen/include/asm-x86/hvm/vmx/vmx.h >> +++ b/xen/include/asm-x86/hvm/vmx/vmx.h >> @@ -565,4 +565,6 @@ typedef struct { >> u16 eptp_index; >> } ve_info_t; >> >> +struct vmx_msr_state *get_host_msr_state(void); >> + >> #endif /* __ASM_X86_HVM_VMX_VMX_H__ */ >> diff --git a/xen/include/xen/hvm/deprivileged.h b/xen/include/xen/hvm/deprivileged.h >> index 6cc803e..e42f39a 100644 >> --- a/xen/include/xen/hvm/deprivileged.h >> +++ b/xen/include/xen/hvm/deprivileged.h >> @@ -68,6 +68,37 @@ int hvm_deprivileged_copy_l1(struct domain *d, >> unsigned int l1_flags); >> >> >> +/* Used to prepare each vcpu's data for user mode. Call for each HVM vcpu. */ >> +int hvm_deprivileged_prepare_vcpu(struct vcpu *vcpu); >> + >> +/* Destroy each vcpu's data for Xen user mode. Again, call for each vcpu. */ >> +void hvm_deprivileged_destroy_vcpu(struct vcpu *vcpu); >> + >> +/* Called to perform a user mode operation. */ >> +void hvm_deprivileged_user_mode(void); >> + >> +/* Called when the user mode operation has completed */ >> +void hvm_deprivileged_finish_user_mode(void); >> + >> +/* Called to move into and then out of user mode. Needed for accessing >> + * assembly features. >> + */ >> +void hvm_deprivileged_user_mode_asm(void); >> + >> +/* Called on the return path to return to the correct execution point */ >> +void hvm_deprivileged_finish_user_mode_asm(void); >> + >> +/* Handle any syscalls that the user mode makes */ >> +void hvm_deprivileged_handle_user_mode(void); >> + >> +/* The ring 3 code */ >> +void hvm_deprivileged_ring3(void); >> + >> +/* Call when inside a trap that should cause a domain crash if in user mode >> + * e.g. an invalid_op is trapped whilst in user mode. >> + */ >> +void hvm_deprivileged_check_trap(const char* func_name); >> + >> /* The segments where the user mode .text and .data are stored */ >> extern unsigned long int __hvm_deprivileged_text_start; >> extern unsigned long int __hvm_deprivileged_text_end; >> @@ -91,4 +122,11 @@ extern unsigned long int __hvm_deprivileged_data_end; >> >> #define HVM_ERR_PG_ALLOC -1 >> >> +/* The user mode stack pointer. >> ++ * The stack grows down so set this to top of the stack region. Then, >> ++ * as this is 0-indexed, move into the stack, not just after it. >> ++ * Subtract 16 bytes for correct stack alignment. >> ++ */ >> +#define HVM_STACK_PTR (HVM_DEPRIVILEGED_STACK_ADDR + STACK_SIZE - 16) >> + >> #endif >> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h >> index 73d3bc8..180643e 100644 >> --- a/xen/include/xen/sched.h >> +++ b/xen/include/xen/sched.h >> @@ -137,7 +137,7 @@ void evtchn_destroy_final(struct domain *d); /* from complete_domain_destroy */ >> >> struct waitqueue_vcpu; >> >> -struct vcpu >> +struct vcpu > > Trailing whitespace is nasty, but we avoid inflating the patch by > dropping whitespace on lines not touched by semantic changes. > >> { >> int vcpu_id; >> >> @@ -158,6 +158,22 @@ struct vcpu >> >> void *sched_priv; /* scheduler-specific data */ >> >> + /* HVM deprivileged mode state */ >> + void *stack; /* Location of stack to save data onto */ >> + unsigned long rsp; /* rsp of our stack to restore our data to */ >> + unsigned long user_mode; /* Are we (possibly moving into) in user mode? */ >> + >> + /* The mstar of the processor that we are currently executing on. >> + * we need to save this because Xen does lazy saving of these. >> + */ >> + unsigned long int msr_lstar; /* lstar */ >> + unsigned long int msr_star; > > There should be no need to store this like this. Follow what the > current context switching code does. > ok, I'll take a look. > ~Andrew > >> + >> + /* Debug info */ >> + unsigned long int old_rsp; >> + unsigned long int old_processor; >> + unsigned long int old_msr_lstar; >> + unsigned long int old_msr_star; >> struct vcpu_runstate_info runstate; >> #ifndef CONFIG_COMPAT >> # define runstate_guest(v) ((v)->runstate_guest) >