* [PATCH RFC v2 0/4] HVM x86 deprivileged mode operations
@ 2015-09-03 16:01 Ben Catterall
2015-09-03 16:01 ` [PATCH RFC v2 1/4] HVM x86 deprivileged mode: Create deprivileged page tables Ben Catterall
` (6 more replies)
0 siblings, 7 replies; 12+ messages in thread
From: Ben Catterall @ 2015-09-03 16:01 UTC (permalink / raw)
To: xen-devel
Cc: keir, suravee.suthikulpanit, george.dunlap, andrew.cooper3, tim,
Aravind.Gopalakrishnan, jbeulich, Ben.Catterall, boris.ostrovsky,
ian.campbell
Hi all,
I have made requested changes and reworked the patch series based on the
comments recieved. Thank you to all of the contributors to those discussions!
The next step will be to provide an example of usage of this system which
will follow in another patch.
The main changes from v1 are:
- No longer copying the privileged Xen stack but instead change the
interrupt/exception, syscall and sysenter pointers to be below the current
execution point.
- AMD SVM support
- Stop IST copying onto the privileged stack
- Watchdog timer to kill a long running deprivileged domain
- Support for crashing a domain whilst performing a deprivileged operation
- .text section is now aliased
- Assembly updates
- Updated deprivileged context switching code to fix bugs
- Moved processor-specific code to processor-specific files
- Reduction of user stack sizes
- Updates to interfaces and an is_hvm_deprivileged_mode() style test
- Small bug fixes
- Revised performance tests
Many thanks in advance,
Ben
The aim of this work is to create a proof-of-concept to establish if it is
feasible to move certain Xen operations into a deprivileged context to mitigate
the impact of a bug or compromise in such areas. An example would be x86_emulate
or virtual device emulation which is not done in QEMU for performance reasons.
Performance testing
-------------------
Performance testing indicates that the overhead for this deprivileged mode
depend heavily upon the processor. This overhead is the cost of moving into
deprivileged mode and then fully back out of deprivileged mode. The conclusions
are that the overheads are not negligible and that operations using this
mechanism would benefit from being long running or be high risk components. It
will need to be evaluated on a case-by-case basis.
I performed 100000 writes to a single I/O port on an Intel 2.2GHz Xeon
E5-2407 0 processor and an AMD Opteron 2376. This was done from a python script
within the HVM guest using time.time() and running Debian Jessie. Each write was
trapped to cause a vmexit and the time for each write was calculated. The port
operation is bypassed so that no portio is actually performed. Thus, the
differences in the measurements below can be taken as the pure overhead. These
experiments were repeated. Note that only the host and this HVM guest were
running (both Debian Jessie) during the experiments.
Intel Intel 2.2GHz Xeon E5-2407 0 processor:
--------------------------------------------
1.55e-06 seconds was the average time for performing the write without the
deprivileged code running.
5.75e-06 seconds was the average time for performing the write with the
deprivileged code running.
So approximately 351% overhead
AMD Opteron 2376:
-----------------
1.74e-06 seconds was the average time for performing the write without the
deprivileged code running.
3.10e-06 seconds was the average time for performing the write with an entry and
exit from deprvileged mode.
So approximately 178% overhead.
Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH RFC v2 1/4] HVM x86 deprivileged mode: Create deprivileged page tables
2015-09-03 16:01 [PATCH RFC v2 0/4] HVM x86 deprivileged mode operations Ben Catterall
@ 2015-09-03 16:01 ` Ben Catterall
2015-09-03 16:01 ` [PATCH RFC v2 2/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode Ben Catterall
` (5 subsequent siblings)
6 siblings, 0 replies; 12+ messages in thread
From: Ben Catterall @ 2015-09-03 16:01 UTC (permalink / raw)
To: xen-devel
Cc: keir, suravee.suthikulpanit, george.dunlap, andrew.cooper3, tim,
Aravind.Gopalakrishnan, jbeulich, Ben.Catterall, boris.ostrovsky,
ian.campbell
The paging structure mappings for the deprivileged mode are added to the monitor
page table for HVM guests for HAP and shadow table paging. The entries are
generated by walking the page tables and mapping in new pages. Access bits are
flipped as needed.
The page entries are generated for deprivileged .text, .data and a stack. The
.text section is only allocated once at HVM domain initialisation and then we
alias it from then onwards. The data section is copied from sections allocated
by the linker. The mappings are setup in an unused portion of the Xen virtual
address space. The pages are mapped in as user mode accessible, with NX bits set
for the data and stack regions and the code region is set to be executable and
read-only.
The needed pages are allocated on the paging heap and are deallocated when
those heap pages are deallocated (on domain destruction).
Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>
Changes since v1
----------------
* .text section is now aliased when needed
* Reduced user stack size to two pages
* Changed allocator used for pages
* Changed types to using __hvm_$foo[] for linker variables
* Moved some #define's to page.h
* Small bug fix: Testing global bit on L3 not relevant
---
xen/arch/x86/hvm/Makefile | 1 +
xen/arch/x86/hvm/deprivileged.c | 514 +++++++++++++++++++++++++++++++++++++
xen/arch/x86/mm/hap/hap.c | 8 +
xen/arch/x86/mm/shadow/multi.c | 8 +
xen/arch/x86/xen.lds.S | 19 ++
xen/include/asm-x86/config.h | 29 ++-
xen/include/asm-x86/x86_64/page.h | 15 ++
xen/include/xen/hvm/deprivileged.h | 90 +++++++
xen/include/xen/sched.h | 4 +
9 files changed, 681 insertions(+), 7 deletions(-)
create mode 100644 xen/arch/x86/hvm/deprivileged.c
create mode 100644 xen/include/xen/hvm/deprivileged.h
diff --git a/xen/arch/x86/hvm/Makefile b/xen/arch/x86/hvm/Makefile
index 794e793..df5ebb8 100644
--- a/xen/arch/x86/hvm/Makefile
+++ b/xen/arch/x86/hvm/Makefile
@@ -2,6 +2,7 @@ subdir-y += svm
subdir-y += vmx
obj-y += asid.o
+obj-y += deprivileged.o
obj-y += emulate.o
obj-y += event.o
obj-y += hpet.o
diff --git a/xen/arch/x86/hvm/deprivileged.c b/xen/arch/x86/hvm/deprivileged.c
new file mode 100644
index 0000000..f34ed67
--- /dev/null
+++ b/xen/arch/x86/hvm/deprivileged.c
@@ -0,0 +1,514 @@
+/*
+ * HVM deprivileged mode to provide support for running operations in
+ * user mode from Xen
+ */
+#include <xen/lib.h>
+#include <xen/mm.h>
+#include <xen/domain_page.h>
+#include <xen/config.h>
+#include <xen/types.h>
+#include <xen/sched.h>
+#include <asm/paging.h>
+#include <xen/compiler.h>
+#include <asm/hap.h>
+#include <asm/paging.h>
+#include <asm-x86/page.h>
+#include <public/domctl.h>
+#include <xen/domain_page.h>
+#include <asm/hvm/vmx/vmx.h>
+#include <xen/hvm/deprivileged.h>
+
+void hvm_deprivileged_init(struct domain *d, l4_pgentry_t *l4t_base)
+{
+ void *p;
+ unsigned long size;
+ unsigned int l4t_idx_code = l4_table_offset(HVM_DEPRIVILEGED_TEXT_ADDR);
+ int ret;
+
+ /* If there is already an entry here */
+ ASSERT(!l4e_get_intpte(l4t_base[l4t_idx_code]));
+
+ /*
+ * We alias the .text segment for deprivileged mode to save memory.
+ * Additionally, to save allocating page tables for each vcpu's deprivileged
+ * mode .text segment, we reuse them.
+ *
+ * If we have not already created a mapping (valid_l4e_code is false) then
+ * we create one and generate the page tables. To save doing this for each
+ * vcpu, if we already have a set of valid page tables then we reuse them.
+ * So, if we have the page tables and there is no entry at the desired PML4
+ * slot, then we can just reuse those page tables.
+ *
+ * The mappings are per-domain as we use the domain's page pool memory
+ * allocator for the new page structure and page frame pages.
+ */
+ if( !d->hvm_depriv_valid_l4e_code )
+ {
+ /*
+ * Build the alias mappings for the .text segment for deprivileged code
+ *
+ * NOTE: If there are other pages here, then this method will map around
+ * them. Which means that any future alias will use this mapping. If the
+ * HVM depriv section no longer has a unique PML4 entry in the Xen
+ * memory map, this will need to be accounted for.
+ */
+ size = (unsigned long)__hvm_deprivileged_text_end -
+ (unsigned long)__hvm_deprivileged_text_start;
+
+ ret = hvm_deprivileged_map_l4(d, l4t_base,
+ (unsigned long)__hvm_deprivileged_text_start,
+ (unsigned long)HVM_DEPRIVILEGED_TEXT_ADDR,
+ size, 0 /* No write */, HVM_DEPRIV_ALIAS);
+
+ if( ret )
+ {
+ printk(XENLOG_ERR "HVM: Error when initialising depriv .text. Code: %d",
+ ret);
+
+ domain_crash(d);
+ return;
+ }
+
+ d->hvm_depriv_l4e_code = l4t_base[l4t_idx_code];
+ d->hvm_depriv_valid_l4e_code = 1;
+ }
+ else
+ {
+ /* Just copy the PML4 entry across */
+ l4t_base[l4t_idx_code] = d->hvm_depriv_l4e_code;
+ }
+
+ /* Copy the .data segment for ring3 code */
+ size = (unsigned long)__hvm_deprivileged_data_end -
+ (unsigned long)__hvm_deprivileged_data_start;
+
+ ret = hvm_deprivileged_map_l4(d, l4t_base,
+ (unsigned long)__hvm_deprivileged_data_start,
+ (unsigned long)HVM_DEPRIVILEGED_DATA_ADDR,
+ size, _PAGE_NX | _PAGE_RW, HVM_DEPRIV_COPY);
+
+ if( ret )
+ {
+ printk(XENLOG_ERR "HVM: Error when initialising depriv .data. Code: %d",
+ ret);
+ domain_crash(d);
+ return;
+ }
+
+ /*
+ * THIS IS A BIT OF A HACK...
+ * Setup the deprivileged mode stack mappings. By allocating a blank area
+ * we can reuse hvm_deprivileged_map_l4.
+ */
+ size = HVM_DEPRIV_STACK_SIZE;
+
+ p = alloc_xenheap_pages(HVM_DEPRIV_STACK_ORDER, 0);
+ if( p == NULL )
+ {
+ printk(XENLOG_ERR "HVM: Out of memory on deprivileged mode stack init.\n");
+ domain_crash(d);
+ return;
+ }
+
+ ret = hvm_deprivileged_map_l4(d, l4t_base,
+ (unsigned long)p,
+ (unsigned long)HVM_DEPRIVILEGED_STACK_ADDR,
+ size, _PAGE_NX | _PAGE_RW, HVM_DEPRIV_COPY);
+
+ free_xenheap_pages(p, HVM_DEPRIV_STACK_ORDER);
+
+ if( ret )
+ {
+ printk(XENLOG_ERR "HVM: Error when initialising depriv stack. Code: %d",
+ ret);
+ domain_crash(d);
+ return;
+ }
+}
+
+void hvm_deprivileged_destroy(struct domain *d)
+{
+
+}
+
+/*
+ * Create a copy or alias of the data at the specified virtual address. The
+ * page table hierarchy is walked and new levels are created if needed.
+ *
+ * If we find a leaf entry in a page table (one which holds the
+ * mfn of a 4KB, 2MB, etc. page frame) which has already been
+ * mapped in, then we bail as we have a collision and this likely
+ * means a bug or the memory configuration has been changed.
+ *
+ * Pages have PAGE_USER, PAGE_GLOBAL (if supported) and PAGE_PRESENT set by
+ * default. The extra l1_flags are used for extra control e.g. PAGE_RW.
+ * The PAGE_RW flag will be enabled for all page structure entries
+ * above the leaf page if that leaf page has PAGE_RW set. This is needed to
+ * permit the writes on the leaf pages. See the Intel manual 3A section 4.6.
+ *
+ * TODO: We proceed down to L1 4KB pages and then map these in. We should
+ * stop the recursion on L3/L2 for a 1GB or 2MB page which would mean faster
+ * page access. When we stop would depend on size (e.g. use 2MB pages for a
+ * few MBs). We'll need to be careful though about aliasing such large pages.
+ * As this then means those pages would need to be aligned to these larger sizes
+ * otherwise we'd share extra data via the alias.
+ */
+int hvm_deprivileged_map_l4(struct domain *d,
+ l4_pgentry_t *l4t_base,
+ unsigned long src_start,
+ unsigned long dst_start,
+ unsigned long size,
+ unsigned int l1_flags,
+ unsigned int op)
+{
+ struct page_info *l3t_pg; /* the destination page */
+ l3_pgentry_t *l3t_base;
+ unsigned long l4t_idx_dst_start;
+ unsigned long l4t_idx_dst_end;
+ unsigned long flags = _PAGE_USER | _PAGE_PRESENT;
+ unsigned int i;
+
+ /* Leaf page needs RW? */
+ if( l1_flags & _PAGE_RW )
+ flags |= _PAGE_RW;
+
+ /*
+ * Calculate where in the destination we need pages
+ * The PML4 page table doesn't map all of virtual memory: for a
+ * 48-bit implementation it's just 512 L4 slots. We also need to
+ * know which L4 slots our entries lie in.
+ */
+ l4t_idx_dst_start = l4_table_offset(dst_start);
+ l4t_idx_dst_end = l4_table_offset(dst_start + size) +
+ !!( ((dst_start + size) & ((1ul << L4_PAGETABLE_SHIFT) - 1)) %
+ L4_PAGE_RANGE );
+
+ for( i = l4t_idx_dst_start; i < l4t_idx_dst_end; i++ )
+ {
+ ASSERT( size >= 0 );
+
+ /* Is this an empty entry? */
+ if( !(l4e_get_intpte(l4t_base[i])) )
+ {
+ /* Allocate a new L3 table */
+ if( (l3t_pg = hvm_deprivileged_alloc_page(d)) == NULL )
+ return HVM_ERR_PG_ALLOC;
+
+ l3t_base = map_domain_page(_mfn(page_to_mfn(l3t_pg)));
+
+ /* Add the page into the L4 table */
+ l4t_base[i] = l4e_from_page(l3t_pg, flags);
+
+ hvm_deprivileged_map_l3(d, l3t_base, src_start, dst_start,
+ (size > L4_PAGE_RANGE) ? L4_PAGE_RANGE : size,
+ l1_flags, op);
+
+ unmap_domain_page(l3t_base);
+ }
+ else
+ {
+ /*
+ * If there is already page information then the page has been
+ * prepared by something else
+ *
+ * We can try recursing on this and see if where we want to put our
+ * new pages is empty.
+ *
+ * We do need to flip this to be a user mode page though so that
+ * the usermode children can be accessed. This is fine as long as
+ * we preserve the access bits of any supervisor entries that are
+ * used in the leaf case.
+ */
+
+ l3t_base = map_l3t_from_l4e(l4t_base[i]);
+
+ hvm_deprivileged_map_l3(d, l3t_base, src_start, dst_start,
+ (size > L4_PAGE_RANGE) ? L4_PAGE_RANGE : size,
+ l1_flags, op);
+
+ l4t_base[i] = l4e_from_intpte(l4e_get_intpte(l4t_base[i]) | flags);
+
+ unmap_domain_page(l3t_base);
+ }
+
+ size -= L4_PAGE_RANGE;
+ src_start += L4_PAGE_RANGE;
+ dst_start += L4_PAGE_RANGE;
+ }
+
+ return 0;
+}
+
+int hvm_deprivileged_map_l3(struct domain *d,
+ l3_pgentry_t *l3t_base,
+ unsigned long src_start,
+ unsigned long dst_start,
+ unsigned long size,
+ unsigned int l1_flags,
+ unsigned int op)
+{
+ struct page_info *l2t_pg; /* the destination page */
+ l2_pgentry_t *l2t_base;
+ unsigned long l3t_idx_dst_start;
+ unsigned long l3t_idx_dst_end;
+ unsigned long flags = _PAGE_USER | _PAGE_PRESENT;
+ unsigned int i;
+
+ /* Leaf page needs RW? */
+ if( l1_flags & _PAGE_RW )
+ flags |= _PAGE_RW;
+
+ /* Calculate where in the destination we need pages */
+ l3t_idx_dst_start = l3_table_offset(dst_start);
+ l3t_idx_dst_end = l3_table_offset(dst_start + size) +
+ !!( ((dst_start + size) & ((1ul << L3_PAGETABLE_SHIFT) - 1)) %
+ L3_PAGE_RANGE );
+
+ for( i = l3t_idx_dst_start; i < l3t_idx_dst_end; i++ )
+ {
+ ASSERT( size >= 0 );
+
+ /* Is this an empty entry? */
+ if( !(l3e_get_intpte(l3t_base[i])) )
+ {
+ /* Allocate a new L2 table */
+ if( (l2t_pg = hvm_deprivileged_alloc_page(d)) == NULL )
+ return HVM_ERR_PG_ALLOC;
+
+ l2t_base = map_domain_page(_mfn(page_to_mfn(l2t_pg)));
+
+ /* Add the page into the L3 table */
+ l3t_base[i] = l3e_from_page(l2t_pg, flags);
+
+ hvm_deprivileged_map_l2(d, l2t_base, src_start, dst_start,
+ (size > L3_PAGE_RANGE) ? L3_PAGE_RANGE : size,
+ l1_flags, op);
+
+ unmap_domain_page(l2t_base);
+ }
+ else
+ {
+ /*
+ * If there is already page information then the page has been
+ * prepared by something else
+ *
+ * If the PSE bit is set, then we can't recurse as this is
+ * a leaf page so we fail.
+ */
+ if( (l3e_get_flags(l3t_base[i]) & _PAGE_PSE) )
+ {
+ panic("HVM: L3 leaf page is already mapped\n");
+ }
+
+ /*
+ * We can try recursing on this and see if where we want to put our
+ * new pages is empty.
+ *
+ * We do need to flip this to be a user mode page though so that
+ * the usermode children can be accessed. This is fine as long as
+ * we preserve the access bits of any supervisor entries that are
+ * used in the leaf case.
+ */
+
+ l2t_base = map_l2t_from_l3e(l3t_base[i]);
+
+ hvm_deprivileged_map_l2(d, l2t_base, src_start, dst_start,
+ (size > L3_PAGE_RANGE) ? L3_PAGE_RANGE : size,
+ l1_flags, op);
+
+ l3t_base[i] = l3e_from_intpte(l3e_get_intpte(l3t_base[i]) | flags);
+
+ unmap_domain_page(l2t_base);
+ }
+
+ size -= L3_PAGE_RANGE;
+ src_start += L3_PAGE_RANGE;
+ dst_start += L3_PAGE_RANGE;
+ }
+
+ return 0;
+}
+
+int hvm_deprivileged_map_l2(struct domain *d,
+ l2_pgentry_t *l2t_base,
+ unsigned long src_start,
+ unsigned long dst_start,
+ unsigned long size,
+ unsigned int l1_flags,
+ unsigned int op)
+{
+ struct page_info *l1t_pg; /* the destination page */
+ l1_pgentry_t *l1t_base;
+ unsigned long l2t_idx_dst_start;
+ unsigned long l2t_idx_dst_end;
+ unsigned long flags = _PAGE_USER | _PAGE_PRESENT;
+ unsigned int i;
+
+ /* Leaf page needs RW? */
+ if( l1_flags & _PAGE_RW )
+ flags |= _PAGE_RW;
+
+ /* Calculate where in the destination we need pages */
+ l2t_idx_dst_start = l2_table_offset(dst_start);
+ l2t_idx_dst_end = l2_table_offset(dst_start + size) +
+ !!( ((dst_start + size) & ((1ul << L2_PAGETABLE_SHIFT) - 1)) %
+ L2_PAGE_RANGE );
+
+ for( i = l2t_idx_dst_start; i < l2t_idx_dst_end; i++ )
+ {
+ ASSERT( size >= 0 );
+
+ /* Is this an empty entry? */
+ if( !(l2e_get_intpte(l2t_base[i])) )
+ {
+ /* Allocate a new L1 table */
+ if( (l1t_pg = hvm_deprivileged_alloc_page(d)) == NULL )
+ return HVM_ERR_PG_ALLOC;
+
+ l1t_base = map_domain_page(_mfn(page_to_mfn(l1t_pg)));
+
+ /* Add the page into the L2 table */
+ l2t_base[i] = l2e_from_page(l1t_pg, flags);
+
+ hvm_deprivileged_map_l1(d, l1t_base, src_start, dst_start,
+ (size > L2_PAGE_RANGE) ? L2_PAGE_RANGE : size,
+ l1_flags, op);
+
+ unmap_domain_page(l1t_base);
+ }
+ else
+ {
+ /*
+ * If there is already page information then the page has been
+ * prepared by something else
+ *
+ * If the PSE bit is set, then we can't recurse as this is
+ * a leaf page so we fail.
+ */
+ if( (l2e_get_flags(l2t_base[i]) & _PAGE_PSE) )
+ {
+ panic("HVM: L2 Leaf page is already mapped\n");
+ }
+
+ /*
+ * We can try recursing on this and see if where we want to put our
+ * new pages is empty.
+ *
+ * We do need to flip this to be a user mode page though so that
+ * the usermode children can be accessed. This is fine as long as
+ * we preserve the access bits of any supervisor entries that are
+ * used in the leaf case.
+ */
+
+ l1t_base = map_l1t_from_l2e(l2t_base[i]);
+
+ hvm_deprivileged_map_l1(d, l1t_base, src_start, dst_start,
+ (size > L2_PAGE_RANGE) ? L2_PAGE_RANGE : size,
+ l1_flags, op);
+
+ l2t_base[i] = l2e_from_intpte(l2e_get_intpte(l2t_base[i]) | flags);
+
+ unmap_domain_page(l1t_base);
+ }
+
+ size -= L2_PAGE_RANGE;
+ src_start += L2_PAGE_RANGE;
+ dst_start += L2_PAGE_RANGE;
+ }
+ return 0;
+}
+
+int hvm_deprivileged_map_l1(struct domain *d,
+ l1_pgentry_t *l1t_base,
+ unsigned long src_start,
+ unsigned long dst_start,
+ unsigned long size,
+ unsigned int l1_flags,
+ unsigned int op)
+{
+ struct page_info *dst_pg; /* the destination page */
+ char *src_data;
+ char *dst_data; /* Pointer for writing into the page */
+ unsigned long l1t_idx_dst_start;
+ unsigned long l1t_idx_dst_end;
+ unsigned long flags = _PAGE_USER | _PAGE_GLOBAL | _PAGE_PRESENT;
+ unsigned int i;
+
+ /* Calculate where in the destination we need pages */
+ l1t_idx_dst_start = l1_table_offset(dst_start);
+ l1t_idx_dst_end = l1_table_offset(dst_start + size) +
+ !!( ((dst_start + size) & ((1ul << L1_PAGETABLE_SHIFT) - 1)) %
+ L1_PAGE_RANGE );
+
+ for( i = l1t_idx_dst_start; i < l1t_idx_dst_end; i++ )
+ {
+ ASSERT( size >= 0 );
+
+ /* Is this an empty entry? */
+ if( !(l1e_get_intpte(l1t_base[i])) )
+ {
+ if( op == HVM_DEPRIV_ALIAS )
+ {
+ /*
+ * To alias a page, put the mfn of the page into our page table
+ * The source should be page aligned to prevent us mapping in
+ * more data than we should.
+ */
+ l1t_base[i] = l1e_from_pfn(virt_to_mfn(src_start),
+ flags | l1_flags);
+ }
+ else
+ {
+ /* Create a new 4KB page */
+ if( (dst_pg = hvm_deprivileged_alloc_page(d)) == NULL )
+ return HVM_ERR_PG_ALLOC;
+
+ /*
+ * Map in src and dst, perform the copy then add it to the
+ * L1 table
+ */
+ dst_data = map_domain_page(_mfn(page_to_mfn(dst_pg)));
+ src_data = map_domain_page(_mfn(virt_to_mfn(src_start)));
+ ASSERT( dst_data != NULL && src_data != NULL );
+
+ memcpy(dst_data, src_data,
+ (size > PAGESIZE_4KB) ? PAGESIZE_4KB : size);
+
+ unmap_domain_page(src_data);
+ unmap_domain_page(dst_data);
+
+ l1t_base[i] = l1e_from_page(dst_pg, flags | l1_flags);
+ }
+
+ size -= PAGESIZE_4KB;
+ src_start += PAGESIZE_4KB;
+ dst_start += PAGESIZE_4KB;
+ }
+ else
+ {
+ /*
+ * If there is already page information then the page has been
+ * prepared by something else, and we can't overwrite it
+ * as this is the leaf case.
+ */
+ panic("HVM: L1 Region already mapped: %lx\nat(%lx)\n",
+ l1e_get_intpte(l1t_base[i]), dst_start);
+ }
+ }
+ return 0;
+}
+
+/* Allocates a page form the domain helper */
+struct page_info *hvm_deprivileged_alloc_page(struct domain *d)
+{
+ struct page_info *pg;
+
+ if( (pg = d->arch.paging.alloc_page(d)) == NULL )
+ {
+ printk(XENLOG_ERR "HVM: Out of memory allocating HVM page\n");
+ domain_crash(d);
+ return NULL;
+ }
+
+ return pg;
+}
diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
index e9c0080..4048929 100644
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -42,6 +42,7 @@
#include <asm/hvm/nestedhvm.h>
#include "private.h"
+#include <xen/hvm/deprivileged.h>
/* Override macros from asm/page.h to make them work with mfn_t */
#undef mfn_to_page
@@ -401,6 +402,9 @@ static void hap_install_xen_entries_in_l4(struct vcpu *v, mfn_t l4mfn)
&idle_pg_table[ROOT_PAGETABLE_FIRST_XEN_SLOT],
ROOT_PAGETABLE_XEN_SLOTS * sizeof(l4_pgentry_t));
+ /* Initialise the HVM deprivileged mode feature */
+ hvm_deprivileged_init(d, l4e);
+
/* Install the per-domain mappings for this domain */
l4e[l4_table_offset(PERDOMAIN_VIRT_START)] =
l4e_from_pfn(mfn_x(page_to_mfn(d->arch.perdomain_l3_pg)),
@@ -439,6 +443,10 @@ static void hap_destroy_monitor_table(struct vcpu* v, mfn_t mmfn)
/* Put the memory back in the pool */
hap_free(d, mmfn);
+
+ /* Destroy the HVM tables */
+ ASSERT(paging_locked_by_me(d));
+ hvm_deprivileged_destroy(d);
}
/************************************************/
diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
index 22081a1..deed4fd 100644
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -38,6 +38,7 @@
#include <asm/mtrr.h>
#include <asm/guest_pt.h>
#include <public/sched.h>
+#include <xen/hvm/deprivileged.h>
#include "private.h"
#include "types.h"
@@ -1429,6 +1430,13 @@ void sh_install_xen_entries_in_l4(struct domain *d, mfn_t gl4mfn, mfn_t sl4mfn)
&idle_pg_table[ROOT_PAGETABLE_FIRST_XEN_SLOT],
slots * sizeof(l4_pgentry_t));
+ /*
+ * Initialise the HVM deprivileged mode feature.
+ * The shadow_l4e_t is a typedef for l4_pgentry_t as are all of the
+ * paging structure so this method will work for the shadow table as well.
+ */
+ hvm_deprivileged_init(d, (l4_pgentry_t *)sl4e);
+
/* Install the per-domain mappings for this domain */
sl4e[shadow_l4_table_offset(PERDOMAIN_VIRT_START)] =
shadow_l4e_from_mfn(page_to_mfn(d->arch.perdomain_l3_pg),
diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S
index 6553cff..0bfe0cf 100644
--- a/xen/arch/x86/xen.lds.S
+++ b/xen/arch/x86/xen.lds.S
@@ -50,6 +50,25 @@ SECTIONS
_etext = .; /* End of text section */
} :text = 0x9090
+ /* HVM deprivileged mode segments
+ * Used to map the ring3 static data and .text
+ */
+
+ . = ALIGN(PAGE_SIZE);
+ .hvm_deprivileged_text : {
+ __hvm_deprivileged_text_start = . ;
+ *(.hvm_deprivileged_enhancement.text)
+ __hvm_deprivileged_text_end = . ;
+ } : text
+
+ . = ALIGN(PAGE_SIZE);
+ .hvm_deprivileged_data : {
+ __hvm_deprivileged_data_start = . ;
+ *(.hvm_deprivileged_enhancement.data)
+ __hvm_deprivileged_data_end = . ;
+ } : text
+
+ . = ALIGN(PAGE_SIZE);
.rodata : {
/* Bug frames table */
. = ALIGN(4);
diff --git a/xen/include/asm-x86/config.h b/xen/include/asm-x86/config.h
index 3e9be83..b5f4e14 100644
--- a/xen/include/asm-x86/config.h
+++ b/xen/include/asm-x86/config.h
@@ -183,10 +183,12 @@ extern unsigned char boot_edid_info[128];
#endif
* 0xffff880000000000 - 0xffffffffffffffff [120TB, PML4:272-511]
* PV: Guest-defined use.
- * 0xffff880000000000 - 0xffffff7fffffffff [119.5TB, PML4:272-510]
+ * 0xffff880000000000 - 0xffffff0000000000 [119TB, PML4:272-509]
* HVM/idle: continuation of 1:1 mapping
+ * 0xffffff0000000000 - 0xfffff7ffffffffff [512GB, 2^39 bytes PML4:510]
+ * HVM: HVM deprivileged mode .text segment
* 0xffffff8000000000 - 0xffffffffffffffff [512GB, 2^39 bytes PML4:511]
- * HVM/idle: unused
+ * HVM: HVM deprivileged mode data and stack segments
*
* Compatibility guest area layout:
* 0x0000000000000000 - 0x00000000f57fffff [3928MB, PML4:0]
@@ -201,7 +203,6 @@ extern unsigned char boot_edid_info[128];
* Reserved for future use.
*/
-
#define ROOT_PAGETABLE_FIRST_XEN_SLOT 256
#define ROOT_PAGETABLE_LAST_XEN_SLOT 271
#define ROOT_PAGETABLE_XEN_SLOTS \
@@ -270,16 +271,30 @@ extern unsigned char boot_edid_info[128];
#define FRAMETABLE_VIRT_START (FRAMETABLE_VIRT_END - FRAMETABLE_SIZE)
#ifndef CONFIG_BIGMEM
-/* Slot 262-271/510: A direct 1:1 mapping of all of physical memory. */
+/* Slot 262-271/509: A direct 1:1 mapping of all of physical memory. */
#define DIRECTMAP_VIRT_START (PML4_ADDR(262))
-#define DIRECTMAP_SIZE (PML4_ENTRY_BYTES * (511 - 262))
+#define DIRECTMAP_SIZE (PML4_ENTRY_BYTES * (510 - 262))
#else
-/* Slot 265-271/510: A direct 1:1 mapping of all of physical memory. */
+/* Slot 265-271/509: A direct 1:1 mapping of all of physical memory. */
#define DIRECTMAP_VIRT_START (PML4_ADDR(265))
-#define DIRECTMAP_SIZE (PML4_ENTRY_BYTES * (511 - 265))
+#define DIRECTMAP_SIZE (PML4_ENTRY_BYTES * (510 - 265))
#endif
#define DIRECTMAP_VIRT_END (DIRECTMAP_VIRT_START + DIRECTMAP_SIZE)
+/*
+ * Slots 510-511: HVM deprivileged mode
+ * The virtual addresses where the .text, .data and stack should be
+ * placed.
+ * We put the .text section in 510 by itself so that, we can easily create an
+ * alias of it. This is because we use the same mfn for a page entry when
+ * aliasign it and so, if we put the data and text in with the .text at PML4
+ * level then, they would conflict with the other address spaces which is not
+ * correct.
+ */
+#define HVM_DEPRIVILEGED_TEXT_ADDR (PML4_ADDR(510))
+#define HVM_DEPRIVILEGED_DATA_ADDR (PML4_ADDR(511) + 0xa000000)
+#define HVM_DEPRIVILEGED_STACK_ADDR (PML4_ADDR(511) + 0xc000000)
+
#ifndef __ASSEMBLY__
/* This is not a fixed value, just a lower limit. */
diff --git a/xen/include/asm-x86/x86_64/page.h b/xen/include/asm-x86/x86_64/page.h
index 19ab4d0..8ecb877 100644
--- a/xen/include/asm-x86/x86_64/page.h
+++ b/xen/include/asm-x86/x86_64/page.h
@@ -22,6 +22,21 @@
#define __PAGE_OFFSET DIRECTMAP_VIRT_START
#define __XEN_VIRT_START XEN_VIRT_START
+/* The sizes of the pages */
+#define PAGESIZE_1GB (1ul << L3_PAGETABLE_SHIFT)
+#define PAGESIZE_2MB (1ul << L2_PAGETABLE_SHIFT)
+#define PAGESIZE_4KB (1ul << L1_PAGETABLE_SHIFT)
+
+/*
+ * The size in bytes that a single L(1,2,3,4} entry covers.
+ * There are 512 (left shift by 9) entries in each page-structure.
+ */
+#define L4_PAGE_RANGE (PAGESIZE_1GB << 9)
+#define L3_PAGE_RANGE (PAGESIZE_2MB << 9)
+#define L2_PAGE_RANGE (PAGESIZE_4KB << 9)
+#define L1_PAGE_RANGE (PAGESIZE_4KB )
+
+
/* These are architectural limits. Current CPUs support only 40-bit phys. */
#define PADDR_BITS 52
#define VADDR_BITS 48
diff --git a/xen/include/xen/hvm/deprivileged.h b/xen/include/xen/hvm/deprivileged.h
new file mode 100644
index 0000000..bcc8c50
--- /dev/null
+++ b/xen/include/xen/hvm/deprivileged.h
@@ -0,0 +1,90 @@
+#ifndef __X86_HVM_DEPRIVILEGED
+
+#define __X86_HVM_DEPRIVILEGED
+
+#include <asm/page.h>
+#include <xen/lib.h>
+#include <xen/mm.h>
+#include <xen/domain_page.h>
+#include <xen/config.h>
+#include <xen/types.h>
+#include <xen/sched.h>
+#include <asm/paging.h>
+#include <asm/hap.h>
+#include <asm/paging.h>
+#include <asm-x86/page.h>
+#include <public/domctl.h>
+#include <xen/domain_page.h>
+
+/*
+ * Initialise the HVM deprivileged mode. This just sets up the general
+ * page mappings for .text and .data. It does not prepare each HVM vcpu's data
+ * or stack which needs to be done separately using
+ * hvm_deprivileged_prepare_vcpu.
+ */
+void hvm_deprivileged_init(struct domain *d, l4_pgentry_t *l4t_base);
+
+/*
+ * Free up the data used by the HVM deprivileged enhancements.
+ * This frees general page mappings. It does not destroy the per-vcpu
+ * data so hvm_deprivileged_destroy_vcpu also needs to be called for each vcpu.
+ * This method should be called after those per-vcpu destruction routines.
+ */
+void hvm_deprivileged_destroy(struct domain *d);
+
+/*
+ * Use create a map of the data at the specified virtual address.
+ * When writing to the source, it will walk the page table hierarchy, creating
+ * new levels as needed, and then either copy or alias the data.
+ */
+int hvm_deprivileged_map_l4(struct domain *d,
+ l4_pgentry_t *l4e_base,
+ unsigned long src_start,
+ unsigned long dst_start,
+ unsigned long size,
+ unsigned int l1_flags,
+ unsigned int op);
+
+int hvm_deprivileged_map_l3(struct domain *d,
+ l3_pgentry_t *l1t_base,
+ unsigned long src_start,
+ unsigned long dst_start,
+ unsigned long size,
+ unsigned int l1_flags,
+ unsigned int op);
+
+int hvm_deprivileged_map_l2(struct domain *d,
+ l2_pgentry_t *l1t_base,
+ unsigned long src_start,
+ unsigned long dst_start,
+ unsigned long size,
+ unsigned int l1_flags,
+ unsigned int op);
+/*
+ * The leaf case of the map. Will allocate the pages and actually copy or alias
+ * the data.
+ */
+int hvm_deprivileged_map_l1(struct domain *d,
+ l1_pgentry_t *l1t_base,
+ unsigned long src_start,
+ unsigned long dst_start,
+ unsigned long size,
+ unsigned int l1_flags,
+ unsigned int op);
+
+/* Used to allocate a page for the deprivileged mode */
+struct page_info *hvm_deprivileged_alloc_page(struct domain *d);
+
+/* The segments where the user mode .text and .data are stored */
+extern unsigned long __hvm_deprivileged_text_start[];
+extern unsigned long __hvm_deprivileged_text_end[];
+extern unsigned long __hvm_deprivileged_data_start[];
+extern unsigned long __hvm_deprivileged_data_end[];
+#define HVM_DEPRIV_STACK_SIZE (PAGE_SIZE << 1)
+#define HVM_DEPRIV_STACK_ORDER 1
+#define HVM_DEPRIV_MODE 1
+#define HVM_ERR_PG_ALLOC -1
+#define HVM_DEPRIV_ALIAS 1
+#define HVM_DEPRIV_COPY 0
+
+#endif
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 73d3bc8..66f4f5e 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -462,6 +462,10 @@ struct domain
/* vNUMA topology accesses are protected by rwlock. */
rwlock_t vnuma_rwlock;
struct vnuma_info *vnuma;
+
+ /* HVM deprivileged mode data */
+ int hvm_depriv_valid_l4e_code;
+ l4_pgentry_t hvm_depriv_l4e_code;
};
struct domain_setup_info
--
2.1.4
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH RFC v2 2/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
2015-09-03 16:01 [PATCH RFC v2 0/4] HVM x86 deprivileged mode operations Ben Catterall
2015-09-03 16:01 ` [PATCH RFC v2 1/4] HVM x86 deprivileged mode: Create deprivileged page tables Ben Catterall
@ 2015-09-03 16:01 ` Ben Catterall
2015-09-03 16:01 ` [PATCH RFC v2 3/4] HVM x86 deprivileged mode: Trap handlers for " Ben Catterall
` (4 subsequent siblings)
6 siblings, 0 replies; 12+ messages in thread
From: Ben Catterall @ 2015-09-03 16:01 UTC (permalink / raw)
To: xen-devel
Cc: keir, suravee.suthikulpanit, george.dunlap, andrew.cooper3, tim,
Aravind.Gopalakrishnan, jbeulich, Ben.Catterall, boris.ostrovsky,
ian.campbell
The process to switch into and out of deprivileged mode can be likened to
setjmp/longjmp.
Xen is non-preemptive and taking an interrupt/exception, SYSCALL, SYSENTER,
NMI or any IST will currently clobber the Xen privileged stack. We need this
stack to be preserved so that after executing deprivileged mode, we can
return to our previous privileged execution point. This allows us to unwind the
stack, cleaning up memory allocations.
To enter deprivileged mode, we move the interrupt/exception rsp,
SYSENTER rsp and SYSCALL rsp to point to lower down Xen's privileged stack
to prevent them from clobbering it. The IST NMI and DF handlers used to copy
themselves onto the privileged stack. This is no longer the case, they now
leave themselves on their predefined stacks.
This means that we can continue execution from that point. This is similar
behaviour to a context switch.
To exit deprivileged mode, we restore the original interrupt/exception rsp,
SYSENTER rsp and SYSCALL rsp. We can then continue execution from where we left
off, which will unwind the stack and free up resources. This method means that
we do not need to change any other code paths and its invocation will be
transparent to callers. This should allow the feature to be more easily
deployed to different parts of Xen.
The switch to and from deprivileged mode is performed using sysret and syscall
respectively.
Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>
Changed since v1
----------------
* Added support for AMD SVM
* Moved to the new stack approach
* IST handlers no longer copy themselves
* Updated context switching code to perform a full context-switch.
This means that depriv mode will execute with host register states not
(partial) guest register state. This allows for crashing the domain (later
patch) whilst in depriv mode, alleviates potential security vulnerabilities
and is necessaryto work around the AMD TR issue.
* Moved processor-specific code to processor-specific files.
* Changed call/jmp pair in deprivileged_asm.S to call/ret pair to not confuse
processor branch predictors.
---
xen/arch/x86/domain.c | 12 +++
xen/arch/x86/hvm/Makefile | 1 +
xen/arch/x86/hvm/deprivileged.c | 103 ++++++++++++++++++++++
xen/arch/x86/hvm/deprivileged_asm.S | 167 ++++++++++++++++++++++++++++++++++++
xen/arch/x86/hvm/svm/svm.c | 130 +++++++++++++++++++++++++++-
xen/arch/x86/hvm/vmx/vmx.c | 118 +++++++++++++++++++++++++
xen/arch/x86/mm/hap/hap.c | 2 +-
xen/arch/x86/x86_64/asm-offsets.c | 5 ++
xen/arch/x86/x86_64/entry.S | 38 ++++++--
xen/arch/x86/x86_64/traps.c | 13 ++-
xen/include/asm-x86/current.h | 2 +
xen/include/asm-x86/hvm/svm/svm.h | 13 +++
xen/include/asm-x86/hvm/vcpu.h | 15 ++++
xen/include/asm-x86/hvm/vmx/vmx.h | 2 +
xen/include/asm-x86/processor.h | 2 +
xen/include/asm-x86/system.h | 3 +
xen/include/xen/hvm/deprivileged.h | 45 ++++++++++
xen/include/xen/sched.h | 18 +++-
18 files changed, 674 insertions(+), 15 deletions(-)
create mode 100644 xen/arch/x86/hvm/deprivileged_asm.S
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 045f6ff..a0e5e70 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -62,6 +62,7 @@
#include <xen/iommu.h>
#include <compat/vcpu.h>
#include <asm/psr.h>
+#include <xen/hvm/deprivileged.h>
DEFINE_PER_CPU(struct vcpu *, curr_vcpu);
DEFINE_PER_CPU(unsigned long, cr4);
@@ -446,6 +447,12 @@ int vcpu_initialise(struct vcpu *v)
if ( has_hvm_container_domain(d) )
{
rc = hvm_vcpu_initialise(v);
+
+ /* Initialise HVM deprivileged mode */
+ printk("HVM initialising deprivileged mode ...");
+ hvm_deprivileged_prepare_vcpu(v);
+ printk("Done.\n");
+
goto done;
}
@@ -523,7 +530,12 @@ void vcpu_destroy(struct vcpu *v)
vcpu_destroy_fpu(v);
if ( has_hvm_container_vcpu(v) )
+ {
+ /* Destroy the deprivileged mode on this vcpu */
+ hvm_deprivileged_destroy_vcpu(v);
+
hvm_vcpu_destroy(v);
+ }
else
xfree(v->arch.pv_vcpu.trap_ctxt);
}
diff --git a/xen/arch/x86/hvm/Makefile b/xen/arch/x86/hvm/Makefile
index df5ebb8..e16960a 100644
--- a/xen/arch/x86/hvm/Makefile
+++ b/xen/arch/x86/hvm/Makefile
@@ -3,6 +3,7 @@ subdir-y += vmx
obj-y += asid.o
obj-y += deprivileged.o
+obj-y += deprivileged_asm.o
obj-y += emulate.o
obj-y += event.o
obj-y += hpet.o
diff --git a/xen/arch/x86/hvm/deprivileged.c b/xen/arch/x86/hvm/deprivileged.c
index f34ed67..994c19e 100644
--- a/xen/arch/x86/hvm/deprivileged.c
+++ b/xen/arch/x86/hvm/deprivileged.c
@@ -512,3 +512,106 @@ struct page_info *hvm_deprivileged_alloc_page(struct domain *d)
return pg;
}
+
+/* Used to prepare each vcpus data for user mode. Call for each HVM vcpu. */
+int hvm_deprivileged_prepare_vcpu(struct vcpu *vcpu)
+{
+ vcpu->arch.hvm_vcpu.depriv_rsp = 0;
+ vcpu->arch.hvm_vcpu.depriv_user_mode = 0;
+ vcpu->arch.hvm_vcpu.depriv_destroy = 0;
+ vcpu->arch.hvm_vcpu.depriv_watchdog_count = 0;
+
+ return 0;
+}
+
+/* Called on destroying each vcpu */
+void hvm_deprivileged_destroy_vcpu(struct vcpu *vcpu)
+{
+
+}
+
+/*
+ * Called to perform a user mode operation.
+ * Execution context is preserved and then we move into user mode.
+ * This method is then jumped into to restore execution context after
+ * exiting user mode.
+ */
+void hvm_deprivileged_user_mode(void)
+{
+ struct vcpu *vcpu = get_current();
+
+ ASSERT( vcpu->arch.hvm_vcpu.depriv_user_mode == 0 );
+ ASSERT( vcpu->arch.hvm_vcpu.depriv_rsp == 0 );
+
+ vcpu->arch.hvm_vcpu.depriv_ctxt_switch_to(vcpu);
+
+ /* The assembly routine to handle moving into/out of deprivileged mode */
+ hvm_deprivileged_user_mode_asm();
+
+ vcpu->arch.hvm_vcpu.depriv_ctxt_switch_from(vcpu);
+
+ vcpu->arch.hvm_vcpu.depriv_user_mode = 0;
+ vcpu->arch.hvm_vcpu.depriv_rsp = 0;
+}
+
+/*
+ * We need to be able to handle interrupts and exceptions whilst in deprivileged
+ * mode. Xen is non-preemptable so our privileged mode stack would be clobbered
+ * if we took an exception/interrupt, syscall or sysenter whilst in deprivileged
+ * mode.
+ *
+ * To handle this, we setup another set of stacks for interrupts/exceptions,
+ * syscall and sysenter. This is done by
+ * - changing TSS.rsp0 so that interrupts and exceptions are taken on a part of
+ * the Xen stack past our current rsp.
+ * - moving the syscall and sysenter stacks so these are also moved past our
+ * execution point.
+ *
+ * This function is called at the point where this rsp is as deep as it will
+ * be on the return path so we can safely clobber after it. It has also been
+ * aligned as needed for a stack ponter.
+ * We do not need to change the IST stack pointers as these are already taken on
+ * different stacks so won't clobber our current Xen stack.
+ *
+ * New Stack Layout
+ * ----------------
+ *
+ * Xen's cpu stacks are 8 pages (8-page aligned), arranged as:
+ *
+ * 7 - Primary stack (with a struct cpu_info at the top)
+ * 6 - Primary stack
+ * - Somewhere in 6 and 7 (depending upon where rsp is when we enter
+ * deprivileged mode), we set the syscall/sysenter and exception pointer
+ * so that it is below the current rsp.
+ * 5 - Optionally not preset (MEMORY_GUARD)
+ * 4 - unused
+ * 3 - Syscall trampolines
+ * 2 - MCE IST stack
+ * 1 - NMI IST stack
+ * 0 - Double Fault IST stack
+ */
+void hvm_deprivileged_setup_stacks(unsigned long stack_ptr)
+{
+ get_current()->arch.hvm_vcpu.depriv_setup_stacks(stack_ptr);
+}
+
+/*
+ * Restore the old TSS.rsp0 for the interrupt/exception stack and the
+ * syscall/sysenter stacks.
+ */
+void hvm_deprivileged_restore_stacks(void)
+{
+ get_current()->arch.hvm_vcpu.depriv_restore_stacks();
+}
+
+/*
+ * Called when the user mode operation has completed
+ * Perform C-level processing on return pathx
+ */
+void hvm_deprivileged_finish_user_mode(void)
+{
+ /* If we are not returning from user mode: bail */
+ ASSERT(get_current()->arch.hvm_vcpu.depriv_user_mode == 1);
+
+ hvm_deprivileged_finish_user_mode_asm();
+}
diff --git a/xen/arch/x86/hvm/deprivileged_asm.S b/xen/arch/x86/hvm/deprivileged_asm.S
new file mode 100644
index 0000000..07d4216
--- /dev/null
+++ b/xen/arch/x86/hvm/deprivileged_asm.S
@@ -0,0 +1,167 @@
+/*
+ * HVM deprivileged mode assembly code
+ */
+
+#include <xen/config.h>
+#include <xen/errno.h>
+#include <xen/softirq.h>
+#include <asm/asm_defns.h>
+#include <asm/apicdef.h>
+#include <asm/page.h>
+#include <public/xen.h>
+#include <irq_vectors.h>
+#include <xen/hvm/deprivileged.h>
+
+/*
+ * Handles entry into the deprivileged mode and returning from this
+ * mode.
+ *
+ * If we are entering deprivileged mode, then we use a sysret to get there.
+ * If we are returning from deprivileged mode, then we need to unwind the stack
+ * so we push the return address onto the current stack so that we can return
+ * from into this function and then return, unwinding the stack.
+ *
+ * We're doing a sort-of long jump/set jump with copying to a stack to
+ * preserve it and allow returning code to continue executing from
+ * within this method.
+ */
+ENTRY(hvm_deprivileged_user_mode_asm)
+ /* Save our registers */
+ push %rax
+ push %rbx
+ push %rcx
+ push %rdx
+ push %rsi
+ push %rdi
+ push %rbp
+ push %r8
+ push %r9
+ push %r10
+ push %r11
+ push %r12
+ push %r13
+ push %r14
+ push %r15
+ pushfq
+
+ /* Perform a near call to push rip onto the stack */
+ call 1f
+
+ /*
+ * MAGIC: Add to the stored rip the size of the code between
+ * label 1 and label 2. This allows us to restart execution at label 2.
+ */
+1: addq $2f-1b, (%rsp)
+
+ /*
+ * Setup the stack pointers for exceptions, syscall and sysenter to be
+ * just after our current rsp, adjusted for 16 byte alignment.
+ */
+ mov %rsp, %rdi
+ and $-16, %rdi
+ call hvm_deprivileged_setup_stacks
+ /*
+ * DO NOT push any more data onto the stack from here unless returning
+ * from user mode. It will be clobbered by exceptions/interrupts,
+ * syscall and sysenter.
+ */
+
+/* USER MODE ENTRY POINT */
+2:
+ GET_CURRENT(%r8)
+ movq VCPU_depriv_user_mode(%r8), %rdx
+
+ /* If !user_mode */
+ cmpq $0, %rdx
+ jne 3f
+ cli
+
+ movq %rsp, VCPU_depriv_rsp(%r8) /* The rsp to restore to */
+ movabs $HVM_DEPRIVILEGED_TEXT_ADDR, %rcx /* RIP in user mode */
+
+ /* RFLAGS user mode */
+ movq $(X86_EFLAGS_IF | X86_EFLAGS_VIP), %r11
+ movq $1, VCPU_depriv_user_mode(%r8) /* Now in user mode */
+
+ /*
+ * Stack ptr is set by user mode. If we set rsp to the user mode stack
+ * pointer here and subsequently took an interrupt or exception between
+ * setting it and executing sysret, then the interrupt would use the
+ * user mode stack pointer. This is because the current stack rsp is
+ * used if the exception descriptor's privilege level = CPL.
+ * See Intel manual volume 3A section 6.12.1 and AMD manual volume 2,
+ * section 8.9.3. Also see Intel manual volume 2 and AMD manual 3 on
+ * the sysret instruction.
+ */
+ movq $HVM_STACK_PTR, %rbx
+ sysretq /* Enter deprivileged mode */
+
+3: call hvm_deprivileged_restore_stacks
+
+ /*
+ * Restore registers
+ * The return rip has been popped by the ret on the return path
+ */
+ popfq
+ pop %r15
+ pop %r14
+ pop %r13
+ pop %r12
+ pop %r11
+ pop %r10
+ pop %r9
+ pop %r8
+ pop %rbp
+ pop %rdi
+ pop %rsi
+ pop %rdx
+ pop %rcx
+ pop %rbx
+ pop %rax
+ ret
+
+/* Finished in user mode so return */
+ENTRY(hvm_deprivileged_finish_user_mode_asm)
+ /* Reset rsp to the old rsp */
+ cli
+ GET_CURRENT(%rbx)
+ movq VCPU_depriv_rsp(%rbx), %rsp
+
+ /*
+ * The return address that the near call pushed onto the
+ * buffer is pointed to by rsp, so use that for rip.
+ */
+ /* Go to user mode return code */
+ ret
+
+/* Entry point from the assembly syscall handlers */
+ENTRY(hvm_deprivileged_handle_user_mode)
+
+ /* Handle a user mode hypercall here */
+
+
+ /* We are finished in user mode */
+ call hvm_deprivileged_finish_user_mode
+
+ ret
+
+.section .hvm_deprivileged_enhancement.text,"ax"
+/* HVM deprivileged code */
+ENTRY(hvm_deprivileged_ring3)
+ /*
+ * sysret has loaded eip from rcx and rflags from r11.
+ * CS and SS have been loaded from the MSR for ring 3.
+ * We now need to switch to the user mode stack
+ */
+ movabs $HVM_STACK_PTR, %rsp
+
+ /* Perform user mode processing */
+ movabs $0xff, %rcx
+1: dec %rcx
+ cmp $0, %rcx
+ jne 1b
+
+ /* Return to ring 0 */
+ syscall
+
+.previous
diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
index 8de41fa..3393fb5 100644
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -61,6 +61,11 @@
#include <asm/apic.h>
#include <asm/debugger.h>
#include <asm/xstate.h>
+#include <xen/hvm/deprivileged.h>
+
+/* HVM svm MSR_{L}STAR cache */
+DEFINE_PER_CPU(u64, svm_depriv_msr_lstar);
+DEFINE_PER_CPU(u64, svm_depriv_msr_star);
void svm_asm_do_resume(void);
@@ -962,12 +967,30 @@ static inline void svm_tsc_ratio_save(struct vcpu *v)
wrmsrl(MSR_AMD64_TSC_RATIO, DEFAULT_TSC_RATIO);
}
+unsigned long svm_depriv_read_msr_star(void)
+{
+ return this_cpu(svm_depriv_msr_star);
+}
+
+void svm_depriv_write_msr_star(unsigned long star)
+{
+ this_cpu(svm_depriv_msr_star) = star;
+}
+unsigned long svm_depriv_read_msr_lstar(void)
+{
+ return this_cpu(svm_depriv_msr_lstar);
+}
+
+void svm_depriv_write_msr_lstar(unsigned long lstar)
+{
+ this_cpu(svm_depriv_msr_lstar) = lstar;
+}
+
static inline void svm_tsc_ratio_load(struct vcpu *v)
{
if ( cpu_has_tsc_ratio && !v->domain->arch.vtsc )
wrmsrl(MSR_AMD64_TSC_RATIO, vcpu_tsc_ratio(v));
}
-
static void svm_ctxt_switch_from(struct vcpu *v)
{
int cpu = smp_processor_id();
@@ -1030,6 +1053,93 @@ static void svm_ctxt_switch_to(struct vcpu *v)
wrmsrl(MSR_TSC_AUX, hvm_msr_tsc_aux(v));
}
+static void svm_depriv_ctxt_switch_from(struct vcpu *v)
+{
+
+ svm_ctxt_switch_to(v);
+ vcpu_restore_fpu_eager(v);
+
+ /* Restore the efer and saved msr registers */
+ write_efer(v->arch.hvm_vcpu.depriv_efer);
+}
+
+/* Setup our stack pointers for interrupts/exceptions, and SYSCALL. */
+static void svm_depriv_setup_stacks(unsigned long stack_ptr)
+{
+ struct vcpu *vcpu = get_current();
+ struct tss_struct *tss = &this_cpu(init_tss);
+ unsigned char *stub_page;
+ unsigned long stub_va = this_cpu(stubs.addr);
+ unsigned int offset;
+
+ /* Save the current rsp0 */
+ vcpu->arch.hvm_vcpu.depriv_tss_rsp0 = tss->rsp0;
+
+ /* Setup the stack for interrupts/exceptions */
+ tss->rsp0 = stack_ptr;
+
+ /* Stacks for syscall and sysenter */
+ stub_page = map_domain_page(_mfn(this_cpu(stubs.mfn)));
+
+ offset = write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
+ stub_va, stack_ptr,
+ (unsigned long)lstar_enter);
+
+ stub_va += offset;
+
+ offset = write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
+ stub_va, stack_ptr,
+ (unsigned long)cstar_enter);
+
+ /* Don't consume more than half of the stub space here. */
+ ASSERT(offset <= STUB_BUF_SIZE / 2);
+
+ unmap_domain_page(stub_page);
+}
+
+static void svm_depriv_restore_stacks(void)
+{
+ struct vcpu* vcpu = get_current();
+ struct tss_struct *tss = &this_cpu(init_tss);
+ unsigned char *stub_page;
+ unsigned long stack_bottom = get_stack_bottom();
+ unsigned long stub_va = this_cpu(stubs.addr);
+ unsigned int offset;
+
+ stub_page = map_domain_page(_mfn(this_cpu(stubs.mfn)));
+
+ /* Restore the old rsp0 */
+ tss->rsp0 = vcpu->arch.hvm_vcpu.depriv_tss_rsp0;
+
+ /* Restore the old syscall/sysenter stacks */
+ offset = write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
+ stub_va, stack_bottom,
+ (unsigned long)lstar_enter);
+ stub_va += offset;
+
+ /* Trampoline for SYSCALL entry from compatibility mode. */
+ offset += write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
+ stub_va, stack_bottom,
+ (unsigned long)cstar_enter);
+
+ /* Don't consume more than half of the stub space here. */
+ ASSERT(offset <= STUB_BUF_SIZE / 2);
+
+ unmap_domain_page(stub_page);
+}
+
+static void svm_depriv_ctxt_switch_to(struct vcpu *v)
+{
+ vcpu_save_fpu(v);
+ svm_ctxt_switch_from(v);
+
+ v->arch.hvm_vcpu.depriv_efer = read_efer();
+
+ /* Flip the SCE bit to allow sysret/call */
+ write_efer(v->arch.hvm_vcpu.depriv_efer | EFER_SCE);
+}
+
+
static void noreturn svm_do_resume(struct vcpu *v)
{
struct vmcb_struct *vmcb = v->arch.hvm_svm.vmcb;
@@ -1156,6 +1266,12 @@ static int svm_vcpu_initialise(struct vcpu *v)
v->arch.hvm_svm.launch_core = -1;
+ /* HVM deprivileged mode operations */
+ v->arch.hvm_vcpu.depriv_ctxt_switch_to = svm_depriv_ctxt_switch_to;
+ v->arch.hvm_vcpu.depriv_ctxt_switch_from = svm_depriv_ctxt_switch_from;
+ v->arch.hvm_vcpu.depriv_setup_stacks = svm_depriv_setup_stacks;
+ v->arch.hvm_vcpu.depriv_restore_stacks = svm_depriv_restore_stacks;
+
if ( (rc = svm_create_vmcb(v)) != 0 )
{
dprintk(XENLOG_WARNING,
@@ -2547,7 +2663,19 @@ void svm_vmexit_handler(struct cpu_user_regs *regs)
{
uint16_t port = (vmcb->exitinfo1 >> 16) & 0xFFFF;
int bytes = ((vmcb->exitinfo1 >> 4) & 0x07);
+
int dir = (vmcb->exitinfo1 & 1) ? IOREQ_READ : IOREQ_WRITE;
+ /* DEBUG: Run only for a specific port */
+ if(port == 0x1000)
+ {
+ if( guest_cpu_user_regs()->eax == 0x1)
+ {
+ hvm_deprivileged_user_mode();
+ }
+ __update_guest_eip(regs, vmcb->exitinfo2 - vmcb->rip);
+ break;
+ }
+
if ( handle_pio(port, bytes, dir) )
__update_guest_eip(regs, vmcb->exitinfo2 - vmcb->rip);
}
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 2582cdd..1ec23f9 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -59,6 +59,8 @@
#include <asm/event.h>
#include <asm/monitor.h>
#include <public/arch-x86/cpuid.h>
+#include <xen/hvm/deprivileged.h>
+
static bool_t __initdata opt_force_ept;
boolean_param("force-ept", opt_force_ept);
@@ -68,6 +70,11 @@ enum handler_return { HNDL_done, HNDL_unhandled, HNDL_exception_raised };
static void vmx_ctxt_switch_from(struct vcpu *v);
static void vmx_ctxt_switch_to(struct vcpu *v);
+static void vmx_depriv_ctxt_switch_from(struct vcpu *v);
+static void vmx_depriv_ctxt_switch_to(struct vcpu *v);
+static void vmx_depriv_setup_stacks(unsigned long stack_ptr);
+static void vmx_depriv_restore_stacks(void);
+
static int vmx_alloc_vlapic_mapping(struct domain *d);
static void vmx_free_vlapic_mapping(struct domain *d);
static void vmx_install_vlapic_mapping(struct vcpu *v);
@@ -110,6 +117,12 @@ static int vmx_vcpu_initialise(struct vcpu *v)
v->arch.ctxt_switch_from = vmx_ctxt_switch_from;
v->arch.ctxt_switch_to = vmx_ctxt_switch_to;
+ /* HVM deprivileged mode operations */
+ v->arch.hvm_vcpu.depriv_ctxt_switch_to = vmx_depriv_ctxt_switch_to;
+ v->arch.hvm_vcpu.depriv_ctxt_switch_from = vmx_depriv_ctxt_switch_from;
+ v->arch.hvm_vcpu.depriv_setup_stacks = vmx_depriv_setup_stacks;
+ v->arch.hvm_vcpu.depriv_restore_stacks = vmx_depriv_restore_stacks;
+
if ( (rc = vmx_create_vmcs(v)) != 0 )
{
dprintk(XENLOG_WARNING,
@@ -272,6 +285,7 @@ long_mode_do_msr_write(unsigned int msr, uint64_t msr_content)
case MSR_LSTAR:
if ( !is_canonical_address(msr_content) )
goto uncanonical_address;
+
WRITE_MSR(LSTAR);
break;
@@ -707,6 +721,98 @@ static void vmx_fpu_leave(struct vcpu *v)
}
}
+static void vmx_depriv_setup_stacks(unsigned long stack_ptr)
+{
+ struct vcpu *vcpu = get_current();
+ struct tss_struct *tss = &this_cpu(init_tss);
+ unsigned char *stub_page;
+ unsigned long stub_va = this_cpu(stubs.addr);
+ unsigned int offset;
+
+ /* Save the current rsp0 */
+ vcpu->arch.hvm_vcpu.depriv_tss_rsp0 = tss->rsp0;
+
+ /* Setup the stack for interrupts/exceptions */
+ tss->rsp0 = stack_ptr;
+
+ /* Stacks for syscall and sysenter */
+ stub_page = map_domain_page(_mfn(this_cpu(stubs.mfn)));
+
+ offset = write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
+ stub_va, stack_ptr,
+ (unsigned long)lstar_enter);
+
+ stub_va += offset;
+
+ if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL ||
+ boot_cpu_data.x86_vendor == X86_VENDOR_CENTAUR )
+ {
+ wrmsrl(MSR_IA32_SYSENTER_ESP, stack_ptr);
+ }
+
+ offset = write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
+ stub_va, stack_ptr,
+ (unsigned long)cstar_enter);
+
+ /* Don't consume more than half of the stub space here. */
+ ASSERT(offset <= STUB_BUF_SIZE / 2);
+
+ unmap_domain_page(stub_page);
+}
+
+static void vmx_depriv_restore_stacks(void)
+{
+ struct vcpu* vcpu = get_current();
+ struct tss_struct *tss = &this_cpu(init_tss);
+ unsigned char *stub_page;
+ unsigned long stack_bottom = get_stack_bottom();
+ unsigned long stub_va = this_cpu(stubs.addr);
+ unsigned int offset;
+
+ stub_page = map_domain_page(_mfn(this_cpu(stubs.mfn)));
+
+ /* Restore the old rsp0 */
+ tss->rsp0 = vcpu->arch.hvm_vcpu.depriv_tss_rsp0;
+
+ /* Restore the old syscall/sysenter stacks */
+ offset = write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
+ stub_va, stack_bottom,
+ (unsigned long)lstar_enter);
+ stub_va += offset;
+
+ wrmsrl(MSR_IA32_SYSENTER_ESP, stack_bottom);
+
+ /* Trampoline for SYSCALL entry from compatibility mode. */
+ offset += write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
+ stub_va, stack_bottom,
+ (unsigned long)cstar_enter);
+
+ /* Don't consume more than half of the stub space here. */
+ ASSERT(offset <= STUB_BUF_SIZE / 2);
+
+ unmap_domain_page(stub_page);
+}
+
+static void vmx_depriv_ctxt_switch_from(struct vcpu *v)
+{
+ vmx_ctxt_switch_to(v);
+ vcpu_save_fpu(v);
+
+ /* Restore the efer and saved msr registers */
+ write_efer(v->arch.hvm_vcpu.depriv_efer);
+}
+
+static void vmx_depriv_ctxt_switch_to(struct vcpu *v)
+{
+ vcpu_save_fpu(v);
+ vmx_ctxt_switch_from(v);
+
+ v->arch.hvm_vcpu.depriv_efer = read_efer();
+
+ /* Flip the SCE bit to allow sysret/call */
+ write_efer(v->arch.hvm_vcpu.depriv_efer | EFER_SCE);
+}
+
static void vmx_ctxt_switch_from(struct vcpu *v)
{
/*
@@ -3341,6 +3447,18 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs)
uint16_t port = (exit_qualification >> 16) & 0xFFFF;
int bytes = (exit_qualification & 0x07) + 1;
int dir = (exit_qualification & 0x08) ? IOREQ_READ : IOREQ_WRITE;
+
+ /* DEBUG: Run only for a specific port */
+ if(port == 0x1000)
+ {
+ if( guest_cpu_user_regs()->eax == 0x1)
+ {
+ hvm_deprivileged_user_mode();
+ }
+ update_guest_eip(); /* Safe: IN, OUT */
+ break;
+ }
+
if ( handle_pio(port, bytes, dir) )
update_guest_eip(); /* Safe: IN, OUT */
}
diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
index 4048929..5633e82 100644
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -40,7 +40,7 @@
#include <asm/domain.h>
#include <xen/numa.h>
#include <asm/hvm/nestedhvm.h>
-
+#include <asm/hvm/vmx/vmx.h>
#include "private.h"
#include <xen/hvm/deprivileged.h>
diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index 447c650..7af824a 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -115,6 +115,11 @@ void __dummy__(void)
OFFSET(VCPU_nsvm_hap_enabled, struct vcpu, arch.hvm_vcpu.nvcpu.u.nsvm.ns_hap_enabled);
BLANK();
+ OFFSET(VCPU_depriv_rsp, struct vcpu, arch.hvm_vcpu.depriv_rsp);
+ OFFSET(VCPU_depriv_user_mode, struct vcpu, arch.hvm_vcpu.depriv_user_mode);
+ OFFSET(VCPU_depriv_destroy, struct vcpu, arch.hvm_vcpu.depriv_destroy);
+ BLANK();
+
OFFSET(DOMAIN_is_32bit_pv, struct domain, arch.is_32bit_pv);
BLANK();
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 74677a2..9590065 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -102,6 +102,18 @@ restore_all_xen:
RESTORE_ALL adj=8
iretq
+
+/* Returning from user mode */
+ENTRY(handle_hvm_user_mode)
+
+ call hvm_deprivileged_handle_user_mode
+
+ /* fallthrough */
+hvm_depriv_mode:
+
+ /* Go back into user mode */
+ jmp restore_all_guest
+
/*
* When entering SYSCALL from kernel mode:
* %rax = hypercall vector
@@ -128,6 +140,11 @@ ENTRY(lstar_enter)
pushq $0
SAVE_VOLATILE TRAP_syscall
GET_CURRENT(%rbx)
+
+ /* Were we in Xen's ring 3? */
+ cmpq $1, VCPU_depriv_user_mode(%rbx)
+ je handle_hvm_user_mode
+
testb $TF_kernel_mode,VCPU_thread_flags(%rbx)
jz switch_to_kernel
@@ -487,6 +504,10 @@ ENTRY(common_interrupt)
/* No special register assumptions. */
ENTRY(ret_from_intr)
GET_CURRENT(%rbx)
+
+ /* If we are in Xen's user mode */
+ cmpq $1,VCPU_depriv_user_mode(%rbx)
+ je hvm_depriv_mode
testb $3,UREGS_cs(%rsp)
jz restore_all_xen
movq VCPU_domain(%rbx),%rax
@@ -509,6 +530,10 @@ handle_exception_saved:
GET_CURRENT(%rbx)
PERFC_INCR(exceptions, %rax, %rbx)
callq *(%rdx,%rax,8)
+
+ /* If we are in Xen's user mode */
+ cmpq $1, VCPU_depriv_user_mode(%rbx)
+ je hvm_depriv_mode
testb $3,UREGS_cs(%rsp)
jz restore_all_xen
leaq VCPU_trap_bounce(%rbx),%rdx
@@ -636,15 +661,7 @@ ENTRY(nmi)
movl $TRAP_nmi,4(%rsp)
handle_ist_exception:
SAVE_ALL CLAC
- testb $3,UREGS_cs(%rsp)
- jz 1f
- /* Interrupted guest context. Copy the context to stack bottom. */
- GET_CPUINFO_FIELD(guest_cpu_user_regs,%rdi)
- movq %rsp,%rsi
- movl $UREGS_kernel_sizeof/8,%ecx
- movq %rdi,%rsp
- rep movsq
-1: movq %rsp,%rdi
+ movq %rsp,%rdi
movzbl UREGS_entry_vector(%rsp),%eax
leaq exception_table(%rip),%rdx
callq *(%rdx,%rax,8)
@@ -664,6 +681,9 @@ handle_ist_exception:
movl $EVENT_CHECK_VECTOR,%edi
call send_IPI_self
1: movq VCPU_domain(%rbx),%rax
+ /* This also handles Xen ring3 return for us.
+ * So, there is no need to explicitly do a user mode check.
+ */
cmpb $0,DOMAIN_is_32bit_pv(%rax)
je restore_all_guest
jmp compat_restore_all_guest
diff --git a/xen/arch/x86/x86_64/traps.c b/xen/arch/x86/x86_64/traps.c
index 0846a19..c7e6077 100644
--- a/xen/arch/x86/x86_64/traps.c
+++ b/xen/arch/x86/x86_64/traps.c
@@ -24,6 +24,7 @@
#include <asm/hvm/hvm.h>
#include <asm/hvm/support.h>
#include <public/callback.h>
+#include <asm/hvm/svm/svm.h>
static void print_xen_info(void)
@@ -337,7 +338,7 @@ unsigned long do_iret(void)
return 0;
}
-static unsigned int write_stub_trampoline(
+unsigned int write_stub_trampoline(
unsigned char *stub, unsigned long stub_va,
unsigned long stack_bottom, unsigned long target_va)
{
@@ -368,8 +369,6 @@ static unsigned int write_stub_trampoline(
}
DEFINE_PER_CPU(struct stubs, stubs);
-void lstar_enter(void);
-void cstar_enter(void);
void __devinit subarch_percpu_traps_init(void)
{
@@ -385,6 +384,14 @@ void __devinit subarch_percpu_traps_init(void)
/* Trampoline for SYSCALL entry from 64-bit mode. */
wrmsrl(MSR_LSTAR, stub_va);
+
+ /*
+ * HVM deprivileged mode on AMD. The writes for MSR_{L}STAR
+ * are not trapped so we need to keep a copy of the host's msrs
+ */
+ svm_depriv_write_msr_star((unsigned long)((FLAT_RING3_CS32<<16) | __HYPERVISOR_CS) << 32);
+ svm_depriv_write_msr_lstar(stub_va);
+
offset = write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
stub_va, stack_bottom,
(unsigned long)lstar_enter);
diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
index f011d2d..c1dae3a 100644
--- a/xen/include/asm-x86/current.h
+++ b/xen/include/asm-x86/current.h
@@ -23,6 +23,8 @@
* 2 - MCE IST stack
* 1 - NMI IST stack
* 0 - Double Fault IST stack
+ *
+ * NOTE: This layout changes slightly in HVM deprivileged mode.
*/
/*
diff --git a/xen/include/asm-x86/hvm/svm/svm.h b/xen/include/asm-x86/hvm/svm/svm.h
index d60ec23..45dd125 100644
--- a/xen/include/asm-x86/hvm/svm/svm.h
+++ b/xen/include/asm-x86/hvm/svm/svm.h
@@ -110,4 +110,17 @@ extern void svm_host_osvw_init(void);
#define _NPT_PFEC_in_gpt 33
#define NPT_PFEC_in_gpt (1UL<<_NPT_PFEC_in_gpt)
+/*
+ * HVM deprivileged mode svm cache of host MSR_{L}STARs
+ * The svm mode does not trap guest writes to these so we
+ * need to preserve them.
+ */
+DECLARE_PER_CPU(u64, svm_depriv_msr_lstar);
+DECLARE_PER_CPU(u64, svm_depriv_msr_star);
+
+unsigned long svm_depriv_read_msr_star(void);
+void svm_depriv_write_msr_star(unsigned long star);
+unsigned long svm_depriv_read_msr_lstar(void);
+void svm_depriv_write_msr_lstar(unsigned long lstar);
+
#endif /* __ASM_X86_HVM_SVM_H__ */
diff --git a/xen/include/asm-x86/hvm/vcpu.h b/xen/include/asm-x86/hvm/vcpu.h
index f553814..f7df9d4 100644
--- a/xen/include/asm-x86/hvm/vcpu.h
+++ b/xen/include/asm-x86/hvm/vcpu.h
@@ -202,6 +202,21 @@ struct hvm_vcpu {
void (*fpu_exception_callback)(void *, struct cpu_user_regs *);
void *fpu_exception_callback_arg;
+ /* Context switching for HVM deprivileged mode */
+ void (*depriv_ctxt_switch_to)(struct vcpu *v);
+ void (*depriv_ctxt_switch_from)(struct vcpu *v);
+ void (*depriv_setup_stacks)(unsigned long stack_ptr);
+ void (*depriv_restore_stacks)(void);
+
+ /* HVM deprivileged mode state */
+ struct segment_register depriv_tr;
+ unsigned long depriv_rsp; /* rsp of our stack to restore our data to */
+ unsigned long depriv_user_mode; /* Are we in user mode */
+ unsigned long depriv_efer;
+ unsigned long depriv_tss_rsp0;
+ unsigned long depriv_destroy;
+ unsigned long depriv_watchdog_count;
+
/* Pending hw/sw interrupt (.vector = -1 means nothing pending). */
struct hvm_trap inject_trap;
diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
index 3fbfa44..98e269e 100644
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -565,4 +565,6 @@ typedef struct {
u16 eptp_index;
} ve_info_t;
+struct vmx_msr_state *get_host_msr_state(void);
+
#endif /* __ASM_X86_HVM_VMX_VMX_H__ */
diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index f507f5e..0fde516 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -547,6 +547,8 @@ void sysenter_entry(void);
void sysenter_eflags_saved(void);
void compat_hypercall(void);
void int80_direct_trap(void);
+void lstar_enter(void);
+void cstar_enter(void);
#define STUBS_PER_PAGE (PAGE_SIZE / STUB_BUF_SIZE)
diff --git a/xen/include/asm-x86/system.h b/xen/include/asm-x86/system.h
index 25a6a2a..e092f36 100644
--- a/xen/include/asm-x86/system.h
+++ b/xen/include/asm-x86/system.h
@@ -240,5 +240,8 @@ void init_idt_traps(void);
void load_system_tables(void);
void percpu_traps_init(void);
void subarch_percpu_traps_init(void);
+unsigned int write_stub_trampoline(
+ unsigned char *stub, unsigned long stub_va,
+ unsigned long stack_bottom, unsigned long target_va);
#endif
diff --git a/xen/include/xen/hvm/deprivileged.h b/xen/include/xen/hvm/deprivileged.h
index bcc8c50..2571108 100644
--- a/xen/include/xen/hvm/deprivileged.h
+++ b/xen/include/xen/hvm/deprivileged.h
@@ -1,5 +1,7 @@
#ifndef __X86_HVM_DEPRIVILEGED
+/* This is also included in the HVM deprivileged mode .S file */
+#ifndef __ASSEMBLY__
#define __X86_HVM_DEPRIVILEGED
#include <asm/page.h>
@@ -75,11 +77,46 @@ int hvm_deprivileged_map_l1(struct domain *d,
/* Used to allocate a page for the deprivileged mode */
struct page_info *hvm_deprivileged_alloc_page(struct domain *d);
+/* Used to prepare each vcpu's data for user mode. Call for each HVM vcpu. */
+int hvm_deprivileged_prepare_vcpu(struct vcpu *vcpu);
+
+/* Destroy each vcpu's data for Xen user mode. Again, call for each vcpu. */
+void hvm_deprivileged_destroy_vcpu(struct vcpu *vcpu);
+
+/* Called to perform a user mode operation. */
+void hvm_deprivileged_user_mode(void);
+
+/* Called when the user mode operation has completed */
+void hvm_deprivileged_finish_user_mode(void);
+
+/* Called to move into and then out of user mode. Needed for accessing
+ * assembly features.
+ */
+void hvm_deprivileged_user_mode_asm(void);
+
+/* Called on the return path to return to the correct execution point */
+void hvm_deprivileged_finish_user_mode_asm(void);
+
+/* Handle any syscalls that the user mode makes */
+void hvm_deprivileged_handle_user_mode(void);
+
+/* Use to setup the stacks for deprivileged mode */
+void hvm_deprivileged_setup_stacks(unsigned long stack_ptr);
+
+/* Use to restore the stacks for deprivileged mode */
+void hvm_deprivileged_restore_stacks(void);
+
+/* The ring 3 code */
+void hvm_deprivileged_ring3(void);
+
/* The segments where the user mode .text and .data are stored */
extern unsigned long __hvm_deprivileged_text_start[];
extern unsigned long __hvm_deprivileged_text_end[];
extern unsigned long __hvm_deprivileged_data_start[];
extern unsigned long __hvm_deprivileged_data_end[];
+
+#endif
+
#define HVM_DEPRIV_STACK_SIZE (PAGE_SIZE << 1)
#define HVM_DEPRIV_STACK_ORDER 1
#define HVM_DEPRIV_MODE 1
@@ -87,4 +124,12 @@ extern unsigned long __hvm_deprivileged_data_end[];
#define HVM_DEPRIV_ALIAS 1
#define HVM_DEPRIV_COPY 0
+/*
+ * The user mode stack pointer.
+ * The stack grows down so set this to top of the stack region. Then,
+ * as this is 0-indexed, move into the stack, not just after it.
+ * Subtract 16 bytes for correct stack alignment.
+ */
+#define HVM_STACK_PTR (HVM_DEPRIVILEGED_STACK_ADDR + STACK_SIZE - 16)
+
#endif
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 66f4f5e..6c05969 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -137,7 +137,7 @@ void evtchn_destroy_final(struct domain *d); /* from complete_domain_destroy */
struct waitqueue_vcpu;
-struct vcpu
+struct vcpu
{
int vcpu_id;
@@ -158,6 +158,22 @@ struct vcpu
void *sched_priv; /* scheduler-specific data */
+ /* HVM deprivileged mode state */
+ void *stack; /* Location of stack to save data onto */
+ unsigned long rsp; /* rsp of our stack to restore our data to */
+ unsigned long user_mode; /* Are we (possibly moving into) in user mode? */
+
+ /* The mstar of the processor that we are currently executing on.
+ * we need to save this because Xen does lazy saving of these.
+ */
+ unsigned long int msr_lstar; /* lstar */
+ unsigned long int msr_star;
+
+ /* Debug info */
+ unsigned long int old_rsp;
+ unsigned long int old_processor;
+ unsigned long int old_msr_lstar;
+ unsigned long int old_msr_star;
struct vcpu_runstate_info runstate;
#ifndef CONFIG_COMPAT
# define runstate_guest(v) ((v)->runstate_guest)
--
2.1.4
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH RFC v2 3/4] HVM x86 deprivileged mode: Trap handlers for deprivileged mode
2015-09-03 16:01 [PATCH RFC v2 0/4] HVM x86 deprivileged mode operations Ben Catterall
2015-09-03 16:01 ` [PATCH RFC v2 1/4] HVM x86 deprivileged mode: Create deprivileged page tables Ben Catterall
2015-09-03 16:01 ` [PATCH RFC v2 2/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode Ben Catterall
@ 2015-09-03 16:01 ` Ben Catterall
2015-09-03 16:01 ` [PATCH RFC v2 4/4] HVM x86 deprivileged mode: Watchdog for DoS prevention Ben Catterall
` (3 subsequent siblings)
6 siblings, 0 replies; 12+ messages in thread
From: Ben Catterall @ 2015-09-03 16:01 UTC (permalink / raw)
To: xen-devel
Cc: keir, suravee.suthikulpanit, george.dunlap, andrew.cooper3, tim,
Aravind.Gopalakrishnan, jbeulich, Ben.Catterall, boris.ostrovsky,
ian.campbell
Added trap handlers to catch exceptions such as a page fault, general
protection fault, etc. These handlers will crash the domain as such exceptions
would indicate that either there is a bug in deprivileged mode or it has been
compromised by an attacker.
On calling a domain_crash() whilst in deprivileged mode, we need to restore
the host's context so that we do not have guest-defined registers and values
in use after this point due to lazy loading of these values in the SVM and VMX
implementations.
Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>
Changed since v1
----------------
* Changed to domain_crash(), domain_crash_synchronous was used previously.
* Updated to perform a HVM context switch on crashing a domain
* Updated hvm_deprivileged_check_trap() to return a testable error
code and return based on this.
---
xen/arch/x86/hvm/deprivileged.c | 54 ++++++++++++++++++++++++++++++++++++++
xen/arch/x86/traps.c | 48 +++++++++++++++++++++++++++++++++
xen/arch/x86/x86_64/traps.c | 1 -
xen/include/xen/hvm/deprivileged.h | 23 ++++++++++++++++
4 files changed, 125 insertions(+), 1 deletion(-)
diff --git a/xen/arch/x86/hvm/deprivileged.c b/xen/arch/x86/hvm/deprivileged.c
index 994c19e..01efbe1 100644
--- a/xen/arch/x86/hvm/deprivileged.c
+++ b/xen/arch/x86/hvm/deprivileged.c
@@ -615,3 +615,57 @@ void hvm_deprivileged_finish_user_mode(void)
hvm_deprivileged_finish_user_mode_asm();
}
+
+/* Check if we are in deprivileged mode */
+int is_hvm_deprivileged_vcpu(void)
+{
+ struct vcpu *v = get_current();
+
+ if( is_hvm_vcpu(v) && (v->arch.hvm_vcpu.depriv_user_mode) )
+ return 1;
+
+ return 0;
+}
+
+/*
+ * Crash the domain. This should not be called if there are any memory
+ * allocations which will be freed by code following its invocation in the
+ * current execution context (current stack). This is because it causes a
+ * permanent 'context switch' and the current stack will be cloberred so
+ * any allocations made which are not freed by other paths will leak.
+ * This function should only be used after deprivileged mode has been
+ * successfully switched into, otherwise, the normal domain_crash function
+ * should be used.
+ *
+ * The domain which is crashed is that of the current vcpu.
+ *
+ * To crash the domain, we need to return to our privileged stack as we may have
+ * memory allocations which need to be cleaned up. Then, after we have returned
+ * to this stack, we can then crash the domain. We set a flag which we check
+ * when returning.
+ */
+void hvm_deprivileged_crash_domain(const char *reason)
+{
+ struct vcpu *vcpu = get_current();
+
+ vcpu->arch.hvm_vcpu.depriv_destroy = 1;
+
+ printk(XENLOG_ERR "HVM Deprivileged Mode: Crashing domain. Reason: %s\n",
+ reason);
+
+ /*
+ * Restore the processor's state. We need to do the privileged return
+ * path to undo any allocations that got us to this state
+ */
+ hvm_deprivileged_finish_user_mode();
+ /* DOES NOT RETURN */
+}
+
+/* Handle a trap event */
+int hvm_deprivileged_check_trap(const char* func_name)
+{
+ if( is_hvm_deprivileged_vcpu() )
+ hvm_deprivileged_crash_domain(func_name);
+
+ return 0;
+}
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 9f5a6c6..df89aa9 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -74,6 +74,7 @@
#include <asm/vpmu.h>
#include <public/arch-x86/cpuid.h>
#include <xsm/xsm.h>
+#include <xen/hvm/deprivileged.h>
/*
* opt_nmi: one of 'ignore', 'dom0', or 'fatal'.
@@ -500,6 +501,12 @@ static void do_guest_trap(
struct trap_bounce *tb;
const struct trap_info *ti;
+ /* If we take the trap whilst in HVM deprivileged mode
+ * then we should crash the domain.
+ */
+ if( hvm_deprivileged_check_trap(__FUNCTION__) )
+ return;
+
trace_pv_trap(trapnr, regs->eip, use_error_code, regs->error_code);
tb = &v->arch.pv_vcpu.trap_bounce;
@@ -616,6 +623,11 @@ static void do_trap(struct cpu_user_regs *regs, int use_error_code)
unsigned long fixup;
DEBUGGER_trap_entry(trapnr, regs);
+ /* If we take the trap whilst in HVM deprivileged mode
+ * then we should crash the domain.
+ */
+ if( hvm_deprivileged_check_trap(__FUNCTION__) )
+ return;
if ( guest_mode(regs) )
{
@@ -1070,6 +1082,13 @@ void do_invalid_op(struct cpu_user_regs *regs)
DEBUGGER_trap_entry(TRAP_invalid_op, regs);
+ /*
+ * If we take the trap whilst in HVM deprivileged mode
+ * then we should crash the domain.
+ */
+ if( hvm_deprivileged_check_trap(__func__) )
+ return;
+
if ( likely(guest_mode(regs)) )
{
if ( !emulate_invalid_rdtscp(regs) &&
@@ -1159,6 +1178,12 @@ void do_int3(struct cpu_user_regs *regs)
{
DEBUGGER_trap_entry(TRAP_int3, regs);
+ /* If we take the trap whilst in HVM deprivileged mode
+ * then we should crash the domain.
+ */
+ if( hvm_deprivileged_check_trap(__func__) )
+ return;
+
if ( !guest_mode(regs) )
{
debugger_trap_fatal(TRAP_int3, regs);
@@ -1495,9 +1520,14 @@ void do_page_fault(struct cpu_user_regs *regs)
perfc_incr(page_faults);
+ /* If we get a page fault whilst in HVM deprivileged mode */
+ if( hvm_deprivileged_check_trap(__func__) )
+ return;
+
if ( unlikely(fixup_page_fault(addr, regs) != 0) )
return;
+
if ( unlikely(!guest_mode(regs)) )
{
pf_type = spurious_page_fault(addr, regs);
@@ -3225,6 +3255,12 @@ void do_general_protection(struct cpu_user_regs *regs)
DEBUGGER_trap_entry(TRAP_gp_fault, regs);
+ /* If we take the trap whilst in HVM deprivileged mode
+ * then we should crash the domain.
+ */
+ if( hvm_deprivileged_check_trap(__func__) )
+ return;
+
if ( regs->error_code & 1 )
goto hardware_gp;
@@ -3490,6 +3526,12 @@ void do_device_not_available(struct cpu_user_regs *regs)
BUG_ON(!guest_mode(regs));
+ /* If we take the trap whilst in HVM deprivileged mode
+ * then we should crash the domain.
+ */
+ if( hvm_deprivileged_check_trap(__func__) )
+ return;
+
vcpu_restore_fpu_lazy(curr);
if ( curr->arch.pv_vcpu.ctrlreg[0] & X86_CR0_TS )
@@ -3531,6 +3573,12 @@ void do_debug(struct cpu_user_regs *regs)
DEBUGGER_trap_entry(TRAP_debug, regs);
+ /* If we take the trap whilst in HVM deprivileged mode
+ * then we should crash the domain.
+ */
+ if( hvm_deprivileged_check_trap(__func__) )
+ return;
+
if ( !guest_mode(regs) )
{
if ( regs->eflags & X86_EFLAGS_TF )
diff --git a/xen/arch/x86/x86_64/traps.c b/xen/arch/x86/x86_64/traps.c
index c7e6077..3bbfc9c 100644
--- a/xen/arch/x86/x86_64/traps.c
+++ b/xen/arch/x86/x86_64/traps.c
@@ -26,7 +26,6 @@
#include <public/callback.h>
#include <asm/hvm/svm/svm.h>
-
static void print_xen_info(void)
{
char taint_str[TAINT_STRING_MAX_LEN];
diff --git a/xen/include/xen/hvm/deprivileged.h b/xen/include/xen/hvm/deprivileged.h
index 2571108..9c08adf 100644
--- a/xen/include/xen/hvm/deprivileged.h
+++ b/xen/include/xen/hvm/deprivileged.h
@@ -106,9 +106,32 @@ void hvm_deprivileged_setup_stacks(unsigned long stack_ptr);
/* Use to restore the stacks for deprivileged mode */
void hvm_deprivileged_restore_stacks(void);
+/* Check if we are in deprivileged mode */
+int is_hvm_deprivileged_vcpu(void);
+
/* The ring 3 code */
void hvm_deprivileged_ring3(void);
+/*
+ * Crash the domain. This should not be called if there are any memory
+ * allocations which will be freed by code following its invocation in the
+ * current execution context (current stack). This is because it causes a
+ * permanent 'context switch' and the current stack will be cloberred so
+ * any allocations made which are not freed by other paths will leak.
+ * This function should only be used after deprivileged mode has been
+ * successfully switched into, otherwise, the normal domain_crash function
+ * should be used.
+ *
+ * The domain which is crashed is that of the current vcpu.
+ */
+void hvm_deprivileged_crash_domain(const char *reason);
+
+/*
+ * Call when inside a trap that should cause a domain crash if in user mode
+ * e.g. an invalid_op is trapped whilst in user mode.
+ */
+int hvm_deprivileged_check_trap(const char* func_name);
+
/* The segments where the user mode .text and .data are stored */
extern unsigned long __hvm_deprivileged_text_start[];
extern unsigned long __hvm_deprivileged_text_end[];
--
2.1.4
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH RFC v2 4/4] HVM x86 deprivileged mode: Watchdog for DoS prevention
2015-09-03 16:01 [PATCH RFC v2 0/4] HVM x86 deprivileged mode operations Ben Catterall
` (2 preceding siblings ...)
2015-09-03 16:01 ` [PATCH RFC v2 3/4] HVM x86 deprivileged mode: Trap handlers for " Ben Catterall
@ 2015-09-03 16:01 ` Ben Catterall
2015-09-03 16:15 ` [PATCH RFC v2 0/4] HVM x86 deprivileged mode operations David Vrabel
` (2 subsequent siblings)
6 siblings, 0 replies; 12+ messages in thread
From: Ben Catterall @ 2015-09-03 16:01 UTC (permalink / raw)
To: xen-devel
Cc: keir, suravee.suthikulpanit, george.dunlap, andrew.cooper3, tim,
Aravind.Gopalakrishnan, jbeulich, Ben.Catterall, boris.ostrovsky,
ian.campbell
A watchdog timer is used to prevent the deprivileged mode running for too long,
aimed at handling a bug or attempted DoS. If the watchdog has occurred more than
once whilst we have been in the same deprivileged mode context, then we crash
the domain. This can be adjusted for longer running times in future.
Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>
---
xen/arch/x86/hvm/deprivileged.c | 17 +++++++++++++----
xen/arch/x86/nmi.c | 17 +++++++++++++++++
xen/include/xen/hvm/deprivileged.h | 1 +
3 files changed, 31 insertions(+), 4 deletions(-)
diff --git a/xen/arch/x86/hvm/deprivileged.c b/xen/arch/x86/hvm/deprivileged.c
index 01efbe1..85701c0 100644
--- a/xen/arch/x86/hvm/deprivileged.c
+++ b/xen/arch/x86/hvm/deprivileged.c
@@ -4,10 +4,11 @@
*/
#include <xen/lib.h>
#include <xen/mm.h>
+#include <xen/sched.h>
#include <xen/domain_page.h>
#include <xen/config.h>
#include <xen/types.h>
-#include <xen/sched.h>
+#include <xen/watchdog.h>
#include <asm/paging.h>
#include <xen/compiler.h>
#include <asm/hap.h>
@@ -17,6 +18,7 @@
#include <xen/domain_page.h>
#include <asm/hvm/vmx/vmx.h>
#include <xen/hvm/deprivileged.h>
+#include <xen/shutdown.h>
void hvm_deprivileged_init(struct domain *d, l4_pgentry_t *l4t_base)
{
@@ -219,7 +221,6 @@ int hvm_deprivileged_map_l4(struct domain *d,
* we preserve the access bits of any supervisor entries that are
* used in the leaf case.
*/
-
l3t_base = map_l3t_from_l4e(l4t_base[i]);
hvm_deprivileged_map_l3(d, l3t_base, src_start, dst_start,
@@ -309,7 +310,6 @@ int hvm_deprivileged_map_l3(struct domain *d,
* we preserve the access bits of any supervisor entries that are
* used in the leaf case.
*/
-
l2t_base = map_l2t_from_l3e(l3t_base[i]);
hvm_deprivileged_map_l2(d, l2t_base, src_start, dst_start,
@@ -389,7 +389,6 @@ int hvm_deprivileged_map_l2(struct domain *d,
{
panic("HVM: L2 Leaf page is already mapped\n");
}
-
/*
* We can try recursing on this and see if where we want to put our
* new pages is empty.
@@ -552,6 +551,16 @@ void hvm_deprivileged_user_mode(void)
vcpu->arch.hvm_vcpu.depriv_user_mode = 0;
vcpu->arch.hvm_vcpu.depriv_rsp = 0;
+ vcpu->arch.hvm_vcpu.depriv_watchdog_count = 0;
+
+ /*
+ * If we need to crash the domain at this point. We will return up the call
+ * stack, undoing any allocations and then the event testers in the exit
+ * assembly stubs will test for the SOFTIRQ_TIMER event generated by a
+ * domain_crash and will crash the domain for us.
+ */
+ if( vcpu->arch.hvm_vcpu.depriv_destroy )
+ domain_crash(vcpu->domain);
}
/*
diff --git a/xen/arch/x86/nmi.c b/xen/arch/x86/nmi.c
index 2ab97a0..25817d2 100644
--- a/xen/arch/x86/nmi.c
+++ b/xen/arch/x86/nmi.c
@@ -26,6 +26,7 @@
#include <xen/smp.h>
#include <xen/keyhandler.h>
#include <xen/cpu.h>
+#include <xen/hvm/deprivileged.h>
#include <asm/current.h>
#include <asm/mc146818rtc.h>
#include <asm/msr.h>
@@ -463,9 +464,25 @@ int __init watchdog_setup(void)
/* Returns false if this was not a watchdog NMI, true otherwise */
bool_t nmi_watchdog_tick(const struct cpu_user_regs *regs)
{
+ struct vcpu *vcpu = current;
bool_t watchdog_tick = 1;
unsigned int sum = this_cpu(nmi_timer_ticks);
+ /*
+ * If the domain has been running in deprivileged mode for two watchdog
+ * ticks, then we kill it to prevent a DoS. We use two ticks as a coarse
+ * measure as this ensures that at least a full watchdog tick duration has
+ * occurred. This means that we do not need to track entry time and do
+ * time calculations.
+ */
+ if( is_hvm_deprivileged_vcpu() )
+ {
+ if( vcpu->arch.hvm_vcpu.depriv_watchdog_count )
+ hvm_deprivileged_crash_domain("HVM Deprivileged domain: Domain exceeded running time.");
+ else
+ vcpu->arch.hvm_vcpu.depriv_watchdog_count = 1;
+ }
+
if ( (this_cpu(last_irq_sums) == sum) && watchdog_enabled() )
{
/*
diff --git a/xen/include/xen/hvm/deprivileged.h b/xen/include/xen/hvm/deprivileged.h
index 9c08adf..9a7f109 100644
--- a/xen/include/xen/hvm/deprivileged.h
+++ b/xen/include/xen/hvm/deprivileged.h
@@ -74,6 +74,7 @@ int hvm_deprivileged_map_l1(struct domain *d,
unsigned int l1_flags,
unsigned int op);
+
/* Used to allocate a page for the deprivileged mode */
struct page_info *hvm_deprivileged_alloc_page(struct domain *d);
--
2.1.4
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH RFC v2 0/4] HVM x86 deprivileged mode operations
2015-09-03 16:01 [PATCH RFC v2 0/4] HVM x86 deprivileged mode operations Ben Catterall
` (3 preceding siblings ...)
2015-09-03 16:01 ` [PATCH RFC v2 4/4] HVM x86 deprivileged mode: Watchdog for DoS prevention Ben Catterall
@ 2015-09-03 16:15 ` David Vrabel
2015-09-07 10:50 ` Ben Catterall
2015-09-04 8:33 ` Jan Beulich
2015-09-04 10:46 ` Fabio Fantoni
6 siblings, 1 reply; 12+ messages in thread
From: David Vrabel @ 2015-09-03 16:15 UTC (permalink / raw)
To: Ben Catterall, xen-devel
Cc: keir, jbeulich, george.dunlap, andrew.cooper3, tim,
Aravind.Gopalakrishnan, suravee.suthikulpanit, boris.ostrovsky,
ian.campbell
On 03/09/15 17:01, Ben Catterall wrote:
>
> Intel Intel 2.2GHz Xeon E5-2407 0 processor:
> --------------------------------------------
> 1.55e-06 seconds was the average time for performing the write without the
> deprivileged code running.
>
> 5.75e-06 seconds was the average time for performing the write with the
> deprivileged code running.
>
> So approximately 351% overhead
>
> AMD Opteron 2376:
> -----------------
> 1.74e-06 seconds was the average time for performing the write without the
> deprivileged code running.
> 3.10e-06 seconds was the average time for performing the write with an entry and
> exit from deprvileged mode.
>
> So approximately 178% overhead.
How does this compare to the overhead of passing the I/O through to qemu?
David
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RFC v2 0/4] HVM x86 deprivileged mode operations
2015-09-03 16:01 [PATCH RFC v2 0/4] HVM x86 deprivileged mode operations Ben Catterall
` (4 preceding siblings ...)
2015-09-03 16:15 ` [PATCH RFC v2 0/4] HVM x86 deprivileged mode operations David Vrabel
@ 2015-09-04 8:33 ` Jan Beulich
2015-09-04 9:16 ` Ian Campbell
2015-09-04 10:46 ` Fabio Fantoni
6 siblings, 1 reply; 12+ messages in thread
From: Jan Beulich @ 2015-09-04 8:33 UTC (permalink / raw)
To: Ben.Catterall
Cc: keir, ian.campbell, george.dunlap, andrew.cooper3, tim,
Aravind.Gopalakrishnan, suravee.suthikulpanit, xen-devel,
boris.ostrovsky
>>> On 03.09.15 at 18:01, <Ben.Catterall@citrix.com> wrote:
> I performed 100000 writes to a single I/O port on an Intel 2.2GHz Xeon
> E5-2407 0 processor and an AMD Opteron 2376. This was done from a python
> script
> within the HVM guest using time.time() and running Debian Jessie. Each write
> was
> trapped to cause a vmexit and the time for each write was calculated. The
> port
> operation is bypassed so that no portio is actually performed. Thus, the
> differences in the measurements below can be taken as the pure overhead.
> These
> experiments were repeated. Note that only the host and this HVM guest were
> running (both Debian Jessie) during the experiments.
>
> Intel Intel 2.2GHz Xeon E5-2407 0 processor:
> --------------------------------------------
> 1.55e-06 seconds was the average time for performing the write without the
> deprivileged code running.
>
> 5.75e-06 seconds was the average time for performing the write with the
> deprivileged code running.
>
> So approximately 351% overhead
>
> AMD Opteron 2376:
> -----------------
> 1.74e-06 seconds was the average time for performing the write without the
> deprivileged code running.
> 3.10e-06 seconds was the average time for performing the write with an entry
> and
> exit from deprvileged mode.
>
> So approximately 178% overhead.
Just like said for v1: Determining a percentage of overhead is
pretty meaningless when the actual operation (the I/O port
access) can take significantly varying amount of time depending
on which I/O port is being accessed. In particular, considering
the built in devices emulation of which you want to move out,
the majority shouldn't actually be doing any accesses to ports
or MMIO, but just act on RAM. Which hence may take quite a
bit less than the roughly 1.5us you use as the base line, in turn
likely resulting in quite a bit higher relative overhead.
That said - even the 350% you determined above look
prohibitive to me.
Jan
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RFC v2 0/4] HVM x86 deprivileged mode operations
2015-09-04 8:33 ` Jan Beulich
@ 2015-09-04 9:16 ` Ian Campbell
2015-09-04 9:31 ` Jan Beulich
0 siblings, 1 reply; 12+ messages in thread
From: Ian Campbell @ 2015-09-04 9:16 UTC (permalink / raw)
To: Jan Beulich, Ben.Catterall
Cc: keir, george.dunlap, andrew.cooper3, tim, Aravind.Gopalakrishnan,
suravee.suthikulpanit, xen-devel, boris.ostrovsky
On Fri, 2015-09-04 at 02:33 -0600, Jan Beulich wrote:
> >
> > > > On 03.09.15 at 18:01, <Ben.Catterall@citrix.com> wrote:
> > I performed 100000 writes to a single I/O port on an Intel 2.2GHz Xeon
> > E5-2407 0 processor and an AMD Opteron 2376. This was done from a
> > python
> > script
> > within the HVM guest using time.time() and running Debian Jessie. Each
> > write
> > was
> > trapped to cause a vmexit and the time for each write was calculated.
> > The
> > port
> > operation is bypassed so that no portio is actually performed. Thus,
> > the
> > differences in the measurements below can be taken as the pure
> > overhead.
> > These
> > experiments were repeated. Note that only the host and this HVM guest
> > were
> > running (both Debian Jessie) during the experiments.
> >
> > Intel Intel 2.2GHz Xeon E5-2407 0 processor:
> > --------------------------------------------
> > 1.55e-06 seconds was the average time for performing the write without
> > the
> > deprivileged code running.
> >
> > 5.75e-06 seconds was the average time for performing the write with the
> > deprivileged code running.
> >
> > So approximately 351% overhead
> >
> > AMD Opteron 2376:
> > -----------------
> > 1.74e-06 seconds was the average time for performing the write without
> > the
> > deprivileged code running.
> > 3.10e-06 seconds was the average time for performing the write with an
> > entry
> > and
> > exit from deprvileged mode.
> >
> > So approximately 178% overhead.
>
> Just like said for v1: Determining a percentage of overhead is
> pretty meaningless when the actual operation (the I/O port
> access) can take significantly varying amount of time depending
> on which I/O port is being accessed. In particular, considering
> the built in devices emulation of which you want to move out,
> the majority shouldn't actually be doing any accesses to ports
> or MMIO, but just act on RAM. Which hence may take quite a
> bit less than the roughly 1.5us you use as the base line, in turn
> likely resulting in quite a bit higher relative overhead.
Ben says "no port io is actually performed", so I think the 1.5us is purely
the overhead of emulating an I/O access as a NOP.
>
> That said - even the 350% you determined above look
> prohibitive to me.
>
> Jan
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RFC v2 0/4] HVM x86 deprivileged mode operations
2015-09-04 9:16 ` Ian Campbell
@ 2015-09-04 9:31 ` Jan Beulich
0 siblings, 0 replies; 12+ messages in thread
From: Jan Beulich @ 2015-09-04 9:31 UTC (permalink / raw)
To: Ben.Catterall, Ian Campbell
Cc: keir, george.dunlap, andrew.cooper3, tim, Aravind.Gopalakrishnan,
suravee.suthikulpanit, xen-devel, boris.ostrovsky
>>> On 04.09.15 at 11:16, <ian.campbell@citrix.com> wrote:
> On Fri, 2015-09-04 at 02:33 -0600, Jan Beulich wrote:
>> >
>> > > > On 03.09.15 at 18:01, <Ben.Catterall@citrix.com> wrote:
>> > I performed 100000 writes to a single I/O port on an Intel 2.2GHz Xeon
>> > E5-2407 0 processor and an AMD Opteron 2376. This was done from a
>> > python
>> > script
>> > within the HVM guest using time.time() and running Debian Jessie. Each
>> > write
>> > was
>> > trapped to cause a vmexit and the time for each write was calculated.
>> > The
>> > port
>> > operation is bypassed so that no portio is actually performed. Thus,
>> > the
>> > differences in the measurements below can be taken as the pure
>> > overhead.
>> > These
>> > experiments were repeated. Note that only the host and this HVM guest
>> > were
>> > running (both Debian Jessie) during the experiments.
>> >
>> > Intel Intel 2.2GHz Xeon E5-2407 0 processor:
>> > --------------------------------------------
>> > 1.55e-06 seconds was the average time for performing the write without
>> > the
>> > deprivileged code running.
>> >
>> > 5.75e-06 seconds was the average time for performing the write with the
>> > deprivileged code running.
>> >
>> > So approximately 351% overhead
>> >
>> > AMD Opteron 2376:
>> > -----------------
>> > 1.74e-06 seconds was the average time for performing the write without
>> > the
>> > deprivileged code running.
>> > 3.10e-06 seconds was the average time for performing the write with an
>> > entry
>> > and
>> > exit from deprvileged mode.
>> >
>> > So approximately 178% overhead.
>>
>> Just like said for v1: Determining a percentage of overhead is
>> pretty meaningless when the actual operation (the I/O port
>> access) can take significantly varying amount of time depending
>> on which I/O port is being accessed. In particular, considering
>> the built in devices emulation of which you want to move out,
>> the majority shouldn't actually be doing any accesses to ports
>> or MMIO, but just act on RAM. Which hence may take quite a
>> bit less than the roughly 1.5us you use as the base line, in turn
>> likely resulting in quite a bit higher relative overhead.
>
> Ben says "no port io is actually performed", so I think the 1.5us is purely
> the overhead of emulating an I/O access as a NOP.
Oh, I see - I didn't pay close enough attention and judged only
from "I performed 100000 writes to a single I/O port ...". I'm
sorry.
Otoh - 1.5us seems quite a long time if there's no actual port
access...
Jan
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RFC v2 0/4] HVM x86 deprivileged mode operations
2015-09-03 16:01 [PATCH RFC v2 0/4] HVM x86 deprivileged mode operations Ben Catterall
` (5 preceding siblings ...)
2015-09-04 8:33 ` Jan Beulich
@ 2015-09-04 10:46 ` Fabio Fantoni
2015-09-08 10:58 ` Ben Catterall
6 siblings, 1 reply; 12+ messages in thread
From: Fabio Fantoni @ 2015-09-04 10:46 UTC (permalink / raw)
To: Ben Catterall, xen-devel
Cc: keir, jbeulich, george.dunlap, andrew.cooper3, tim,
Aravind.Gopalakrishnan, suravee.suthikulpanit, boris.ostrovsky,
ian.campbell
Il 03/09/2015 18:01, Ben Catterall ha scritto:
> Hi all,
>
> I have made requested changes and reworked the patch series based on the
> comments recieved. Thank you to all of the contributors to those discussions!
> The next step will be to provide an example of usage of this system which
> will follow in another patch.
>
> The main changes from v1 are:
> - No longer copying the privileged Xen stack but instead change the
> interrupt/exception, syscall and sysenter pointers to be below the current
> execution point.
> - AMD SVM support
> - Stop IST copying onto the privileged stack
> - Watchdog timer to kill a long running deprivileged domain
> - Support for crashing a domain whilst performing a deprivileged operation
> - .text section is now aliased
> - Assembly updates
> - Updated deprivileged context switching code to fix bugs
> - Moved processor-specific code to processor-specific files
> - Reduction of user stack sizes
> - Updates to interfaces and an is_hvm_deprivileged_mode() style test
> - Small bug fixes
> - Revised performance tests
>
> Many thanks in advance,
> Ben
>
> The aim of this work is to create a proof-of-concept to establish if it is
> feasible to move certain Xen operations into a deprivileged context to mitigate
> the impact of a bug or compromise in such areas. An example would be x86_emulate
> or virtual device emulation which is not done in QEMU for performance reasons.
Sorry for my stupid questions:
Is there a test with benchmark using qemu instead for know how is
different? Qemu seems that emulate also some istructions cases that xen
hypervisor doesn't for now, or I'm wrong?
Is there any possible hardware technology or set of instructions for
improve the operations also deprivileged or transition from Xen is
obliged to control even mappings memory access?
Is there any possible future hardware technology or set of instructions
for take needed informations from hypervisor for executing directly all
needed checks, them if ok and any possible exceptions/protections or
delegate this to xen for each instruction with a tremendous impact on
the efficiency can not be improved?
If I said only absurd things because of my knowledge too low about sorry
for having wasted your time.
Thanks for any reply and sorry for my bad english.
>
> Performance testing
> -------------------
> Performance testing indicates that the overhead for this deprivileged mode
> depend heavily upon the processor. This overhead is the cost of moving into
> deprivileged mode and then fully back out of deprivileged mode. The conclusions
> are that the overheads are not negligible and that operations using this
> mechanism would benefit from being long running or be high risk components. It
> will need to be evaluated on a case-by-case basis.
>
> I performed 100000 writes to a single I/O port on an Intel 2.2GHz Xeon
> E5-2407 0 processor and an AMD Opteron 2376. This was done from a python script
> within the HVM guest using time.time() and running Debian Jessie. Each write was
> trapped to cause a vmexit and the time for each write was calculated. The port
> operation is bypassed so that no portio is actually performed. Thus, the
> differences in the measurements below can be taken as the pure overhead. These
> experiments were repeated. Note that only the host and this HVM guest were
> running (both Debian Jessie) during the experiments.
>
> Intel Intel 2.2GHz Xeon E5-2407 0 processor:
> --------------------------------------------
> 1.55e-06 seconds was the average time for performing the write without the
> deprivileged code running.
>
> 5.75e-06 seconds was the average time for performing the write with the
> deprivileged code running.
>
> So approximately 351% overhead
>
> AMD Opteron 2376:
> -----------------
> 1.74e-06 seconds was the average time for performing the write without the
> deprivileged code running.
> 3.10e-06 seconds was the average time for performing the write with an entry and
> exit from deprvileged mode.
>
> So approximately 178% overhead.
>
> Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RFC v2 0/4] HVM x86 deprivileged mode operations
2015-09-03 16:15 ` [PATCH RFC v2 0/4] HVM x86 deprivileged mode operations David Vrabel
@ 2015-09-07 10:50 ` Ben Catterall
0 siblings, 0 replies; 12+ messages in thread
From: Ben Catterall @ 2015-09-07 10:50 UTC (permalink / raw)
To: David Vrabel, xen-devel
Cc: keir, jbeulich, george.dunlap, andrew.cooper3, tim,
Aravind.Gopalakrishnan, suravee.suthikulpanit, boris.ostrovsky,
ian.campbell
On 03/09/15 17:15, David Vrabel wrote:
> On 03/09/15 17:01, Ben Catterall wrote:
>>
>> Intel Intel 2.2GHz Xeon E5-2407 0 processor:
>> --------------------------------------------
>> 1.55e-06 seconds was the average time for performing the write without the
>> deprivileged code running.
>>
>> 5.75e-06 seconds was the average time for performing the write with the
>> deprivileged code running.
>>
>> So approximately 351% overhead
>>
>> AMD Opteron 2376:
>> -----------------
>> 1.74e-06 seconds was the average time for performing the write without the
>> deprivileged code running.
>> 3.10e-06 seconds was the average time for performing the write with an entry and
>> exit from deprvileged mode.
>>
>> So approximately 178% overhead.
>
> How does this compare to the overhead of passing the I/O through to qemu?
>
So, passing this portio op through to QEMU takes roughly 20e-6 seconds.
However, I don't know if the emulator would have gone and prodded a
physical port as port of that which could skew the result. I'll look
into this and get back if I find anything to clarify this.
Ben
> David
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RFC v2 0/4] HVM x86 deprivileged mode operations
2015-09-04 10:46 ` Fabio Fantoni
@ 2015-09-08 10:58 ` Ben Catterall
0 siblings, 0 replies; 12+ messages in thread
From: Ben Catterall @ 2015-09-08 10:58 UTC (permalink / raw)
To: Fabio Fantoni, xen-devel
Cc: keir, jbeulich, george.dunlap, andrew.cooper3, tim,
Aravind.Gopalakrishnan, suravee.suthikulpanit, boris.ostrovsky,
ian.campbell
Hi Fabio,
On 04/09/15 11:46, Fabio Fantoni wrote:
[snip]
>
> Sorry for my stupid questions:
> Is there a test with benchmark using qemu instead for know how is
> different? Qemu seems that emulate also some istructions cases that xen
> hypervisor doesn't for now, or I'm wrong?
>
So, QEMU emulates devices for HVM guests. Now, letting the portio
operation through to QEMU to emulate takes about 20e-6 seconds. But,
that includes the time QEMU takes to actually emulate the port operation
so is not the 'pure' overhead. I need to do more detailed analysis to
get that figure.
> Is there any possible hardware technology or set of instructions for
> improve the operations also deprivileged or transition from Xen is
> obliged to control even mappings memory access?
We're using sysret and syscall already to do the transition which are
the fast system call operations. I don't have actual benchmark values
for their execution time though. We map the depriv code, stack and data
sections into the monitor table when initialising the HVM guest (user
mode mapping) so Xen doesn't need to worry about those mappings whilst
executing a depriv operation.
> Is there any possible future hardware technology or set of instructions
> for take needed informations from hypervisor for executing directly all
> needed checks, them if ok and any possible exceptions/protections or
> delegate this to xen for each instruction with a tremendous impact on
> the efficiency can not be improved?
I'm not quite sure what you're asking here, sorry! Are you asking if we
can take an HVM guest instruction, analyse it to determine if it's safe
to execute and then execute it rather than emulating it? If so:
QEMU handles device emulation and this is deliberately not done in Xen
to reduce the attack surface of the hypervisor and keep it minimal. We
do need to analyse instructions at some points (x86 emulate) but this is
error prone (there's a paper or two on exploits of this feature). This
is one of the reasons for considering a depriv mode in the first pace,
by moving such code into a deprivileged area, we can prevent a bug in
this code from leading to hypervisor compromise. I'm not aware of any
future hardware or set of instructions but that doesn't mean there
aren't/won't be!
> If I said only absurd things because of my knowledge too low about sorry
> for having wasted your time.
>
> Thanks for any reply and sorry for my bad english.
np, I hope I've understood correctly!
>
>>
>> Performance testing
>> -------------------
>> Performance testing indicates that the overhead for this deprivileged
>> mode
>> depend heavily upon the processor. This overhead is the cost of moving
>> into
>> deprivileged mode and then fully back out of deprivileged mode. The
>> conclusions
>> are that the overheads are not negligible and that operations using this
>> mechanism would benefit from being long running or be high risk
>> components. It
>> will need to be evaluated on a case-by-case basis.
>>
>> I performed 100000 writes to a single I/O port on an Intel 2.2GHz Xeon
>> E5-2407 0 processor and an AMD Opteron 2376. This was done from a
>> python script
>> within the HVM guest using time.time() and running Debian Jessie. Each
>> write was
>> trapped to cause a vmexit and the time for each write was calculated.
>> The port
>> operation is bypassed so that no portio is actually performed. Thus, the
>> differences in the measurements below can be taken as the pure
>> overhead. These
>> experiments were repeated. Note that only the host and this HVM guest
>> were
>> running (both Debian Jessie) during the experiments.
>>
>> Intel Intel 2.2GHz Xeon E5-2407 0 processor:
>> --------------------------------------------
>> 1.55e-06 seconds was the average time for performing the write without
>> the
>> deprivileged code running.
>>
>> 5.75e-06 seconds was the average time for performing the write with the
>> deprivileged code running.
>>
>> So approximately 351% overhead
>>
>> AMD Opteron 2376:
>> -----------------
>> 1.74e-06 seconds was the average time for performing the write without
>> the
>> deprivileged code running.
>> 3.10e-06 seconds was the average time for performing the write with an
>> entry and
>> exit from deprvileged mode.
>>
>> So approximately 178% overhead.
>>
>> Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel
>
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2015-09-08 10:58 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-03 16:01 [PATCH RFC v2 0/4] HVM x86 deprivileged mode operations Ben Catterall
2015-09-03 16:01 ` [PATCH RFC v2 1/4] HVM x86 deprivileged mode: Create deprivileged page tables Ben Catterall
2015-09-03 16:01 ` [PATCH RFC v2 2/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode Ben Catterall
2015-09-03 16:01 ` [PATCH RFC v2 3/4] HVM x86 deprivileged mode: Trap handlers for " Ben Catterall
2015-09-03 16:01 ` [PATCH RFC v2 4/4] HVM x86 deprivileged mode: Watchdog for DoS prevention Ben Catterall
2015-09-03 16:15 ` [PATCH RFC v2 0/4] HVM x86 deprivileged mode operations David Vrabel
2015-09-07 10:50 ` Ben Catterall
2015-09-04 8:33 ` Jan Beulich
2015-09-04 9:16 ` Ian Campbell
2015-09-04 9:31 ` Jan Beulich
2015-09-04 10:46 ` Fabio Fantoni
2015-09-08 10:58 ` Ben Catterall
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).