* [PATCH RFC v3 0/6] HVM x86 deprivileged mode summary
@ 2015-09-11 16:08 Ben Catterall
2015-09-11 16:08 ` [PATCH RFC v3 1/6] HVM x86 deprivileged mode: Create deprivileged page tables Ben Catterall
` (7 more replies)
0 siblings, 8 replies; 9+ messages in thread
From: Ben Catterall @ 2015-09-11 16:08 UTC (permalink / raw)
To: xen-devel
Cc: keir, ian.campbell, george.dunlap, andrew.cooper3, tim, jbeulich,
Aravind.Gopalakrishnan, suravee.suthikulpanit, boris.ostrovsky
Hi all,
I have now finished my internship at Citrix and am posting this final version of
my RFC series. I would like to express my thanks to all of those who have taken
the time to review, comment and discuss this series, as well as to my colleagues
who have provided excellent guidance and help. I have learned a great deal and
have greatly enjoyed working with all of you. Thank you.
Hopefully the series will be beneficial. I believe that it has shown that a
deprivileged mode in Xen is a possible and viable option, as long as performance
impact vs security is carefully considered on a case-by-case basis. The end of
this series contains an example of moving some of the vpic into deprivileged
mode which has allowed me to test and verify that the feature works. There are
enhancements and some clean up which is needed but, after that, the feature
could be deployed to HVM devices currently found in Xen such as the VPIC.
Patches one to four are (hopefully) now fairly stable. Patch 5 is the new
system call and deprivileged dispatch mode which is new to this series. Patch 6
is also new and is a demonstration of using this for the vpic and hass mainly
been used to test and exercise this feature.
As this patch series is in RFC, there are some debug printks which should be
removed when/if it leaves RFC but, they are useful in fixing the known issue so
I have left them in until that can be resolved.
There are some efficiency savings that can be made and an instance of a general
issue (detailed later) which will need to be addressed.
Many thanks once again,
Ben
TODOs
-----
There is a set of TODOs in this patch series, some issues in the later patches
which need addressing and some other considerations which I've summarised here.
Patch 1:
- Consider hvm_deprivileged_map_* and an efficiency saving by mapping in larger
pages. See the TODO at the top of the L4 version of this method.
Patch 2:
- We have a much more heavyweight version of the deprivileged mode context
switch after testing for AMD SVM found that this was necessary. However,
the FPU is currently also saved and this may not be necessary. Consideration
is needed to work out if we can cut this down even more.
Patch 4:
- The watchdog timer is hooked currently to kill deprivileged mode operations
that run for too long and is hardcoded to be at least one watchdog tick and
at most two. This may want to be refined.
Patch 5:
- Alias data for deprivileged mode. There is a large comment at the top of
deprivileged_syscall.c which outlines considerations.
- Check if we need to map_domain_page the pages when we do the copy in
hvm_deprivileged_copy_data{to/from}
- Check for unsigned integer wrapping on addition in
hvm_deprivileged_copy_data_{to/from}
- Move hvm_deprivileged_syscall into the syscall macro. It's a stub and
unless extra code is needed there it can be folded into the macro.
- Check maintainers' thoughts on the deprivileged mode function checks in
hvm_deprivileged_user_mode. See the TODO comment.
Patches 5 & 6:
- Fix/work around the GCC switch statement issue.
KNOWN ISSUES
------------
- Page fault for vpic_ioport_write due to GCC switch statements placing the
jump table in .rodata which is in the privileged mode area.
This has been traced to the first of the switch statements in the function.
Though other switches in that function may also be affected.
Compiled using GCC 4.9.2-10.
You can get the offset into this function by doing:
(RIP - (depriv_vpic_ioport_write - __hvm_deprivileged_text_start))
It appears to be a built-in default of GCC to put switch jump tables in
.rodata or .text and there does not appear to be a way to change this
(except to patch the compiler, though hopefully there _is_ another
option I just haven't been able to find...). Note that GCC will not
necessarily allocate jump tables for each switch statment, it appears to
depends on a number of factors such as the optimiser, the number of cases,
the type of the case, compiler version etc.
Thus, when we relocate a deprivileged method containing code using a switch
statement which GCC has created a jump table for, this leads to a page
fault. This is because we have not mapped in the rodata section
as we should not (depriv should not have access to it).
A workaround would be to patch the generated assembly so that this table is
moved into hvm_deprivileged.rodata. This can be done by adding,
.section .hvm_deprivileged.rodata, around the generated table. We can then
relocate this.
Note that GCC is using RIP-relative addressing for this, so the offset
of depriv .rodata to the depriv .text segment will need to be the same
when it is mapped in.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH RFC v3 1/6] HVM x86 deprivileged mode: Create deprivileged page tables
2015-09-11 16:08 [PATCH RFC v3 0/6] HVM x86 deprivileged mode summary Ben Catterall
@ 2015-09-11 16:08 ` Ben Catterall
2015-09-11 16:08 ` [PATCH RFC v3 2/6] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode Ben Catterall
` (6 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Ben Catterall @ 2015-09-11 16:08 UTC (permalink / raw)
To: xen-devel
Cc: keir, ian.campbell, george.dunlap, andrew.cooper3, tim, jbeulich,
Aravind.Gopalakrishnan, suravee.suthikulpanit, Ben Catterall,
boris.ostrovsky
The paging structure mappings for the deprivileged mode are added to the monitor
page table for HVM guests for HAP and shadow table paging. The entries are
generated by walking the page tables and mapping in new pages. Access bits are
flipped as needed.
The page entries are generated for deprivileged .text, .data and a stack. The
.text section is only allocated once at HVM domain initialisation and then we
alias it from then onwards. The data section is copied from sections allocated
by the linker. The mappings are setup in an unused portion of the Xen virtual
address space. The pages are mapped in as user mode accessible, with NX bits set
for the data and stack regions and the code region is set to be executable and
read-only.
The needed pages are allocated on the paging heap and are deallocated when
those heap pages are deallocated (on domain destruction).
Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>
Changes since v1
----------------
* .text section is now aliased when needed
* Reduced user stack size to two pages
* Changed allocator used for pages
* Changed types to using __hvm_$foo[] for linker variables
* Moved some #define's to page.h
* Small bug fix: Testing global bit on L3 not relevant
Changes since v2:
-----------------
* Bug fix: Pass return value back through page table generation code
* Coding style: Added space before if, for, etc.
---
xen/arch/x86/hvm/Makefile | 1 +
xen/arch/x86/hvm/deprivileged.c | 538 +++++++++++++++++++++++++++++++++++++
xen/arch/x86/mm/hap/hap.c | 8 +
xen/arch/x86/mm/shadow/multi.c | 8 +
xen/arch/x86/xen.lds.S | 19 ++
xen/include/asm-x86/config.h | 29 +-
xen/include/asm-x86/x86_64/page.h | 15 ++
xen/include/xen/hvm/deprivileged.h | 95 +++++++
xen/include/xen/sched.h | 4 +
9 files changed, 710 insertions(+), 7 deletions(-)
create mode 100644 xen/arch/x86/hvm/deprivileged.c
create mode 100644 xen/include/xen/hvm/deprivileged.h
diff --git a/xen/arch/x86/hvm/Makefile b/xen/arch/x86/hvm/Makefile
index 794e793..df5ebb8 100644
--- a/xen/arch/x86/hvm/Makefile
+++ b/xen/arch/x86/hvm/Makefile
@@ -2,6 +2,7 @@ subdir-y += svm
subdir-y += vmx
obj-y += asid.o
+obj-y += deprivileged.o
obj-y += emulate.o
obj-y += event.o
obj-y += hpet.o
diff --git a/xen/arch/x86/hvm/deprivileged.c b/xen/arch/x86/hvm/deprivileged.c
new file mode 100644
index 0000000..0075523
--- /dev/null
+++ b/xen/arch/x86/hvm/deprivileged.c
@@ -0,0 +1,538 @@
+/*
+ * HVM deprivileged mode to provide support for running operations in
+ * user mode from Xen
+ */
+#include <xen/lib.h>
+#include <xen/mm.h>
+#include <xen/domain_page.h>
+#include <xen/config.h>
+#include <xen/types.h>
+#include <xen/sched.h>
+#include <asm/paging.h>
+#include <xen/compiler.h>
+#include <asm/hap.h>
+#include <asm/paging.h>
+#include <asm-x86/page.h>
+#include <public/domctl.h>
+#include <xen/domain_page.h>
+#include <asm/hvm/vmx/vmx.h>
+#include <xen/hvm/deprivileged.h>
+
+void hvm_deprivileged_init(struct domain *d, l4_pgentry_t *l4t_base)
+{
+ void *p;
+ unsigned long size;
+ unsigned int l4t_idx_code = l4_table_offset(HVM_DEPRIVILEGED_TEXT_ADDR);
+ int ret;
+
+ /* If there is already an entry here */
+ ASSERT(!l4e_get_intpte(l4t_base[l4t_idx_code]));
+
+ /*
+ * We alias the .text segment for deprivileged mode to save memory.
+ * Additionally, to save allocating page tables for each vcpu's deprivileged
+ * mode .text segment, we reuse them.
+ *
+ * If we have not already created a mapping (valid_l4e_code is false) then
+ * we create one and generate the page tables. To save doing this for each
+ * vcpu, if we already have a set of valid page tables then we reuse them.
+ * So, if we have the page tables and there is no entry at the desired PML4
+ * slot, then we can just reuse those page tables.
+ *
+ * The mappings are per-domain as we use the domain's page pool memory
+ * allocator for the new page structure and page frame pages.
+ */
+ if ( !d->hvm_depriv_valid_l4e_code )
+ {
+ /*
+ * Build the alias mappings for the .text segment for deprivileged code
+ *
+ * NOTE: If there are other pages here, then this method will map around
+ * them. Which means that any future alias will use this mapping. If the
+ * HVM depriv section no longer has a unique PML4 entry in the Xen
+ * memory map, this will need to be accounted for.
+ */
+ size = (unsigned long)__hvm_deprivileged_text_end -
+ (unsigned long)__hvm_deprivileged_text_start;
+
+ ret = hvm_deprivileged_map_l4(d, l4t_base,
+ (unsigned long)__hvm_deprivileged_text_start,
+ (unsigned long)HVM_DEPRIVILEGED_TEXT_ADDR,
+ size, 0 /* No write */, HVM_DEPRIV_ALIAS);
+
+ if ( ret )
+ {
+ printk(XENLOG_ERR "HVM: Error when initialising depriv .text. Code: %d",
+ ret);
+
+ domain_crash(d);
+ return;
+ }
+
+ d->hvm_depriv_l4e_code = l4t_base[l4t_idx_code];
+ d->hvm_depriv_valid_l4e_code = 1;
+ }
+ else
+ {
+ /* Just copy the PML4 entry across */
+ l4t_base[l4t_idx_code] = d->hvm_depriv_l4e_code;
+ }
+
+ /*
+ * Copy the .data segment for deprivileged mode code. Add in some extra
+ * space to use for passing data between depriv and privileged modes
+ */
+ size = HVM_DEPRIV_DATA_SECTION_SIZE;
+
+ ret = hvm_deprivileged_map_l4(d, l4t_base,
+ (unsigned long)__hvm_deprivileged_data_start,
+ (unsigned long)HVM_DEPRIVILEGED_DATA_ADDR,
+ size, _PAGE_NX | _PAGE_RW, HVM_DEPRIV_COPY);
+
+ if ( ret )
+ {
+ printk(XENLOG_ERR "HVM: Error when initialising depriv .data. Code: %d",
+ ret);
+ domain_crash(d);
+ return;
+ }
+
+ /*
+ * THIS IS A BIT OF A HACK...
+ * Setup the deprivileged mode stack mappings. By allocating a blank area
+ * we can reuse hvm_deprivileged_map_l4.
+ */
+ size = HVM_DEPRIV_STACK_SIZE;
+
+ p = alloc_xenheap_pages(HVM_DEPRIV_STACK_ORDER, 0);
+ if ( p == NULL )
+ {
+ printk(XENLOG_ERR "HVM: Out of memory on deprivileged mode stack init.\n");
+ domain_crash(d);
+ return;
+ }
+
+ ret = hvm_deprivileged_map_l4(d, l4t_base,
+ (unsigned long)p,
+ (unsigned long)HVM_DEPRIVILEGED_STACK_ADDR,
+ size, _PAGE_NX | _PAGE_RW, HVM_DEPRIV_COPY);
+
+ free_xenheap_pages(p, HVM_DEPRIV_STACK_ORDER);
+
+ if ( ret )
+ {
+ printk(XENLOG_ERR "HVM: Error when initialising depriv stack. Code: %d",
+ ret);
+ domain_crash(d);
+ return;
+ }
+}
+
+void hvm_deprivileged_destroy(struct domain *d)
+{
+
+}
+
+/*
+ * Create a copy or alias of the data at the specified virtual address. The
+ * page table hierarchy is walked and new levels are created if needed.
+ *
+ * If we find a leaf entry in a page table (one which holds the
+ * mfn of a 4KB, 2MB, etc. page frame) which has already been
+ * mapped in, then we bail as we have a collision and this likely
+ * means a bug or the memory configuration has been changed.
+ *
+ * Pages have PAGE_USER, PAGE_GLOBAL (if supported) and PAGE_PRESENT set by
+ * default. The extra l1_flags are used for extra control e.g. PAGE_RW.
+ * The PAGE_RW flag will be enabled for all page structure entries
+ * above the leaf page if that leaf page has PAGE_RW set. This is needed to
+ * permit the writes on the leaf pages. See the Intel manual 3A section 4.6.
+ *
+ * TODO: We proceed down to L1 4KB pages and then map these in. We should
+ * stop the recursion on L3/L2 for a 1GB or 2MB page which would mean faster
+ * page access. When we stop would depend on size (e.g. use 2MB pages for a
+ * few MBs). We'll need to be careful though about aliasing such large pages.
+ * As this then means those pages would need to be aligned to these larger sizes
+ * otherwise we'd share extra data via the alias.
+ */
+int hvm_deprivileged_map_l4(struct domain *d,
+ l4_pgentry_t *l4t_base,
+ unsigned long src_start,
+ unsigned long dst_start,
+ unsigned long size,
+ unsigned int l1_flags,
+ unsigned int op)
+{
+ struct page_info *l3t_pg; /* the destination page */
+ l3_pgentry_t *l3t_base;
+ unsigned long l4t_idx_dst_start;
+ unsigned long l4t_idx_dst_end;
+ unsigned long flags = _PAGE_USER | _PAGE_PRESENT;
+ unsigned int i;
+ unsigned int ret = 0;
+
+ /* Leaf page needs RW? */
+ if ( l1_flags & _PAGE_RW )
+ flags |= _PAGE_RW;
+
+ /*
+ * Calculate where in the destination we need pages
+ * The PML4 page table doesn't map all of virtual memory: for a
+ * 48-bit implementation it's just 512 L4 slots. We also need to
+ * know which L4 slots our entries lie in.
+ */
+ l4t_idx_dst_start = l4_table_offset(dst_start);
+ l4t_idx_dst_end = l4_table_offset(dst_start + size) +
+ !!( ((dst_start + size) & ((1ul << L4_PAGETABLE_SHIFT) - 1)) %
+ L4_PAGE_RANGE );
+
+ for ( i = l4t_idx_dst_start; i < l4t_idx_dst_end; i++ )
+ {
+ ASSERT( size >= 0 );
+
+ /* Is this an empty entry? */
+ if ( !(l4e_get_intpte(l4t_base[i])) )
+ {
+ /* Allocate a new L3 table */
+ if ( (l3t_pg = hvm_deprivileged_alloc_page(d)) == NULL )
+ return HVM_ERR_PG_ALLOC;
+
+ l3t_base = map_domain_page(_mfn(page_to_mfn(l3t_pg)));
+
+ /* Add the page into the L4 table */
+ l4t_base[i] = l4e_from_page(l3t_pg, flags);
+
+ ret = hvm_deprivileged_map_l3(d, l3t_base, src_start, dst_start,
+ (size > L4_PAGE_RANGE) ? L4_PAGE_RANGE : size,
+ l1_flags, op);
+
+ unmap_domain_page(l3t_base);
+
+ if ( ret )
+ break;
+
+ }
+ else
+ {
+ /*
+ * If there is already page information then the page has been
+ * prepared by something else
+ *
+ * We can try recursing on this and see if where we want to put our
+ * new pages is empty.
+ *
+ * We do need to flip this to be a user mode page though so that
+ * the usermode children can be accessed. This is fine as long as
+ * we preserve the access bits of any supervisor entries that are
+ * used in the leaf case.
+ */
+
+ l3t_base = map_l3t_from_l4e(l4t_base[i]);
+
+ ret = hvm_deprivileged_map_l3(d, l3t_base, src_start, dst_start,
+ (size > L4_PAGE_RANGE) ? L4_PAGE_RANGE : size,
+ l1_flags, op);
+
+ l4t_base[i] = l4e_from_intpte(l4e_get_intpte(l4t_base[i]) | flags);
+
+ unmap_domain_page(l3t_base);
+
+ if ( ret )
+ break;
+ }
+
+ size -= L4_PAGE_RANGE;
+ src_start += L4_PAGE_RANGE;
+ dst_start += L4_PAGE_RANGE;
+ }
+
+ return ret;
+}
+
+int hvm_deprivileged_map_l3(struct domain *d,
+ l3_pgentry_t *l3t_base,
+ unsigned long src_start,
+ unsigned long dst_start,
+ unsigned long size,
+ unsigned int l1_flags,
+ unsigned int op)
+{
+ struct page_info *l2t_pg; /* the destination page */
+ l2_pgentry_t *l2t_base;
+ unsigned long l3t_idx_dst_start;
+ unsigned long l3t_idx_dst_end;
+ unsigned long flags = _PAGE_USER | _PAGE_PRESENT;
+ unsigned int i;
+ unsigned int ret = 0;
+
+ /* Leaf page needs RW? */
+ if ( l1_flags & _PAGE_RW )
+ flags |= _PAGE_RW;
+
+ /* Calculate where in the destination we need pages */
+ l3t_idx_dst_start = l3_table_offset(dst_start);
+ l3t_idx_dst_end = l3_table_offset(dst_start + size) +
+ !!( ((dst_start + size) & ((1ul << L3_PAGETABLE_SHIFT) - 1)) %
+ L3_PAGE_RANGE );
+
+ for ( i = l3t_idx_dst_start; i < l3t_idx_dst_end; i++ )
+ {
+ ASSERT( size >= 0 );
+
+ /* Is this an empty entry? */
+ if ( !(l3e_get_intpte(l3t_base[i])) )
+ {
+ /* Allocate a new L2 table */
+ if ( (l2t_pg = hvm_deprivileged_alloc_page(d)) == NULL )
+ return HVM_ERR_PG_ALLOC;
+
+ l2t_base = map_domain_page(_mfn(page_to_mfn(l2t_pg)));
+
+ /* Add the page into the L3 table */
+ l3t_base[i] = l3e_from_page(l2t_pg, flags);
+
+ ret = hvm_deprivileged_map_l2(d, l2t_base, src_start, dst_start,
+ (size > L3_PAGE_RANGE) ? L3_PAGE_RANGE : size,
+ l1_flags, op);
+
+ unmap_domain_page(l2t_base);
+
+ if ( ret )
+ break;
+ }
+ else
+ {
+ /*
+ * If there is already page information then the page has been
+ * prepared by something else
+ *
+ * If the PSE bit is set, then we can't recurse as this is
+ * a leaf page so we fail.
+ */
+ if ( (l3e_get_flags(l3t_base[i]) & _PAGE_PSE) )
+ {
+ panic("HVM: L3 leaf page is already mapped\n");
+ }
+
+ /*
+ * We can try recursing on this and see if where we want to put our
+ * new pages is empty.
+ *
+ * We do need to flip this to be a user mode page though so that
+ * the usermode children can be accessed. This is fine as long as
+ * we preserve the access bits of any supervisor entries that are
+ * used in the leaf case.
+ */
+
+ l2t_base = map_l2t_from_l3e(l3t_base[i]);
+
+ ret = hvm_deprivileged_map_l2(d, l2t_base, src_start, dst_start,
+ (size > L3_PAGE_RANGE) ? L3_PAGE_RANGE : size,
+ l1_flags, op);
+
+ l3t_base[i] = l3e_from_intpte(l3e_get_intpte(l3t_base[i]) | flags);
+
+ unmap_domain_page(l2t_base);
+
+ if ( ret )
+ break;
+ }
+
+ size -= L3_PAGE_RANGE;
+ src_start += L3_PAGE_RANGE;
+ dst_start += L3_PAGE_RANGE;
+ }
+
+ return ret;
+}
+
+int hvm_deprivileged_map_l2(struct domain *d,
+ l2_pgentry_t *l2t_base,
+ unsigned long src_start,
+ unsigned long dst_start,
+ unsigned long size,
+ unsigned int l1_flags,
+ unsigned int op)
+{
+ struct page_info *l1t_pg; /* the destination page */
+ l1_pgentry_t *l1t_base;
+ unsigned long l2t_idx_dst_start;
+ unsigned long l2t_idx_dst_end;
+ unsigned long flags = _PAGE_USER | _PAGE_PRESENT;
+ unsigned int i;
+ unsigned int ret = 0;
+
+ /* Leaf page needs RW? */
+ if ( l1_flags & _PAGE_RW )
+ flags |= _PAGE_RW;
+
+ /* Calculate where in the destination we need pages */
+ l2t_idx_dst_start = l2_table_offset(dst_start);
+ l2t_idx_dst_end = l2_table_offset(dst_start + size) +
+ !!( ((dst_start + size) & ((1ul << L2_PAGETABLE_SHIFT) - 1)) %
+ L2_PAGE_RANGE );
+
+ for ( i = l2t_idx_dst_start; i < l2t_idx_dst_end; i++ )
+ {
+ ASSERT( size >= 0 );
+
+ /* Is this an empty entry? */
+ if ( !(l2e_get_intpte(l2t_base[i])) )
+ {
+ /* Allocate a new L1 table */
+ if ( (l1t_pg = hvm_deprivileged_alloc_page(d)) == NULL )
+ return HVM_ERR_PG_ALLOC;
+
+ l1t_base = map_domain_page(_mfn(page_to_mfn(l1t_pg)));
+
+ /* Add the page into the L2 table */
+ l2t_base[i] = l2e_from_page(l1t_pg, flags);
+
+ ret = hvm_deprivileged_map_l1(d, l1t_base, src_start, dst_start,
+ (size > L2_PAGE_RANGE) ? L2_PAGE_RANGE : size,
+ l1_flags, op);
+
+ unmap_domain_page(l1t_base);
+
+ if ( ret )
+ break;
+ }
+ else
+ {
+ /*
+ * If there is already page information then the page has been
+ * prepared by something else
+ *
+ * If the PSE bit is set, then we can't recurse as this is
+ * a leaf page so we fail.
+ */
+ if ( (l2e_get_flags(l2t_base[i]) & _PAGE_PSE) )
+ {
+ panic("HVM: L2 Leaf page is already mapped\n");
+ }
+
+ /*
+ * We can try recursing on this and see if where we want to put our
+ * new pages is empty.
+ *
+ * We do need to flip this to be a user mode page though so that
+ * the usermode children can be accessed. This is fine as long as
+ * we preserve the access bits of any supervisor entries that are
+ * used in the leaf case.
+ */
+
+ l1t_base = map_l1t_from_l2e(l2t_base[i]);
+
+ ret = hvm_deprivileged_map_l1(d, l1t_base, src_start, dst_start,
+ (size > L2_PAGE_RANGE) ? L2_PAGE_RANGE : size,
+ l1_flags, op);
+
+ l2t_base[i] = l2e_from_intpte(l2e_get_intpte(l2t_base[i]) | flags);
+
+ unmap_domain_page(l1t_base);
+
+ if ( ret )
+ break;
+ }
+
+ size -= L2_PAGE_RANGE;
+ src_start += L2_PAGE_RANGE;
+ dst_start += L2_PAGE_RANGE;
+ }
+ return ret;
+}
+
+int hvm_deprivileged_map_l1(struct domain *d,
+ l1_pgentry_t *l1t_base,
+ unsigned long src_start,
+ unsigned long dst_start,
+ unsigned long size,
+ unsigned int l1_flags,
+ unsigned int op)
+{
+ struct page_info *dst_pg; /* the destination page */
+ char *src_data;
+ char *dst_data; /* Pointer for writing into the page */
+ unsigned long l1t_idx_dst_start;
+ unsigned long l1t_idx_dst_end;
+ unsigned long flags = _PAGE_USER | _PAGE_GLOBAL | _PAGE_PRESENT;
+ unsigned int i;
+
+ /* Calculate where in the destination we need pages */
+ l1t_idx_dst_start = l1_table_offset(dst_start);
+ l1t_idx_dst_end = l1_table_offset(dst_start + size) +
+ !!( ((dst_start + size) & ((1ul << L1_PAGETABLE_SHIFT) - 1)) %
+ L1_PAGE_RANGE );
+
+ for ( i = l1t_idx_dst_start; i < l1t_idx_dst_end; i++ )
+ {
+ ASSERT( size >= 0 );
+
+ /* Is this an empty entry? */
+ if ( !(l1e_get_intpte(l1t_base[i])) )
+ {
+ if ( op == HVM_DEPRIV_ALIAS )
+ {
+ /*
+ * To alias a page, put the mfn of the page into our page table
+ * The source should be page aligned to prevent us mapping in
+ * more data than we should.
+ */
+ l1t_base[i] = l1e_from_pfn(virt_to_mfn(src_start),
+ flags | l1_flags);
+ }
+ else
+ {
+ /* Create a new 4KB page */
+ if ( (dst_pg = hvm_deprivileged_alloc_page(d)) == NULL )
+ return HVM_ERR_PG_ALLOC;
+
+ /*
+ * Map in src and dst, perform the copy then add it to the
+ * L1 table
+ */
+ dst_data = map_domain_page(_mfn(page_to_mfn(dst_pg)));
+ src_data = map_domain_page(_mfn(virt_to_mfn(src_start)));
+ ASSERT( dst_data != NULL && src_data != NULL );
+
+ memcpy(dst_data, src_data,
+ (size > PAGESIZE_4KB) ? PAGESIZE_4KB : size);
+
+ unmap_domain_page(src_data);
+ unmap_domain_page(dst_data);
+
+ l1t_base[i] = l1e_from_page(dst_pg, flags | l1_flags);
+ }
+
+ size -= PAGESIZE_4KB;
+ src_start += PAGESIZE_4KB;
+ dst_start += PAGESIZE_4KB;
+ }
+ else
+ {
+ /*
+ * If there is already page information then the page has been
+ * prepared by something else, and we can't overwrite it
+ * as this is the leaf case.
+ */
+ panic("HVM: L1 Region already mapped: %lx\nat(%lx)\n",
+ l1e_get_intpte(l1t_base[i]), dst_start);
+ }
+ }
+ return 0;
+}
+
+/* Allocates a page form the domain helper */
+struct page_info *hvm_deprivileged_alloc_page(struct domain *d)
+{
+ struct page_info *pg;
+
+ if ( (pg = d->arch.paging.alloc_page(d)) == NULL )
+ {
+ printk(XENLOG_ERR "HVM: Out of memory allocating HVM page\n");
+ domain_crash(d);
+ return NULL;
+ }
+
+ return pg;
+}
diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
index e9c0080..4048929 100644
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -42,6 +42,7 @@
#include <asm/hvm/nestedhvm.h>
#include "private.h"
+#include <xen/hvm/deprivileged.h>
/* Override macros from asm/page.h to make them work with mfn_t */
#undef mfn_to_page
@@ -401,6 +402,9 @@ static void hap_install_xen_entries_in_l4(struct vcpu *v, mfn_t l4mfn)
&idle_pg_table[ROOT_PAGETABLE_FIRST_XEN_SLOT],
ROOT_PAGETABLE_XEN_SLOTS * sizeof(l4_pgentry_t));
+ /* Initialise the HVM deprivileged mode feature */
+ hvm_deprivileged_init(d, l4e);
+
/* Install the per-domain mappings for this domain */
l4e[l4_table_offset(PERDOMAIN_VIRT_START)] =
l4e_from_pfn(mfn_x(page_to_mfn(d->arch.perdomain_l3_pg)),
@@ -439,6 +443,10 @@ static void hap_destroy_monitor_table(struct vcpu* v, mfn_t mmfn)
/* Put the memory back in the pool */
hap_free(d, mmfn);
+
+ /* Destroy the HVM tables */
+ ASSERT(paging_locked_by_me(d));
+ hvm_deprivileged_destroy(d);
}
/************************************************/
diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
index 22081a1..deed4fd 100644
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -38,6 +38,7 @@
#include <asm/mtrr.h>
#include <asm/guest_pt.h>
#include <public/sched.h>
+#include <xen/hvm/deprivileged.h>
#include "private.h"
#include "types.h"
@@ -1429,6 +1430,13 @@ void sh_install_xen_entries_in_l4(struct domain *d, mfn_t gl4mfn, mfn_t sl4mfn)
&idle_pg_table[ROOT_PAGETABLE_FIRST_XEN_SLOT],
slots * sizeof(l4_pgentry_t));
+ /*
+ * Initialise the HVM deprivileged mode feature.
+ * The shadow_l4e_t is a typedef for l4_pgentry_t as are all of the
+ * paging structure so this method will work for the shadow table as well.
+ */
+ hvm_deprivileged_init(d, (l4_pgentry_t *)sl4e);
+
/* Install the per-domain mappings for this domain */
sl4e[shadow_l4_table_offset(PERDOMAIN_VIRT_START)] =
shadow_l4e_from_mfn(page_to_mfn(d->arch.perdomain_l3_pg),
diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S
index 6553cff..0bfe0cf 100644
--- a/xen/arch/x86/xen.lds.S
+++ b/xen/arch/x86/xen.lds.S
@@ -50,6 +50,25 @@ SECTIONS
_etext = .; /* End of text section */
} :text = 0x9090
+ /* HVM deprivileged mode segments
+ * Used to map the ring3 static data and .text
+ */
+
+ . = ALIGN(PAGE_SIZE);
+ .hvm_deprivileged_text : {
+ __hvm_deprivileged_text_start = . ;
+ *(.hvm_deprivileged_enhancement.text)
+ __hvm_deprivileged_text_end = . ;
+ } : text
+
+ . = ALIGN(PAGE_SIZE);
+ .hvm_deprivileged_data : {
+ __hvm_deprivileged_data_start = . ;
+ *(.hvm_deprivileged_enhancement.data)
+ __hvm_deprivileged_data_end = . ;
+ } : text
+
+ . = ALIGN(PAGE_SIZE);
.rodata : {
/* Bug frames table */
. = ALIGN(4);
diff --git a/xen/include/asm-x86/config.h b/xen/include/asm-x86/config.h
index 3e9be83..b5f4e14 100644
--- a/xen/include/asm-x86/config.h
+++ b/xen/include/asm-x86/config.h
@@ -183,10 +183,12 @@ extern unsigned char boot_edid_info[128];
#endif
* 0xffff880000000000 - 0xffffffffffffffff [120TB, PML4:272-511]
* PV: Guest-defined use.
- * 0xffff880000000000 - 0xffffff7fffffffff [119.5TB, PML4:272-510]
+ * 0xffff880000000000 - 0xffffff0000000000 [119TB, PML4:272-509]
* HVM/idle: continuation of 1:1 mapping
+ * 0xffffff0000000000 - 0xfffff7ffffffffff [512GB, 2^39 bytes PML4:510]
+ * HVM: HVM deprivileged mode .text segment
* 0xffffff8000000000 - 0xffffffffffffffff [512GB, 2^39 bytes PML4:511]
- * HVM/idle: unused
+ * HVM: HVM deprivileged mode data and stack segments
*
* Compatibility guest area layout:
* 0x0000000000000000 - 0x00000000f57fffff [3928MB, PML4:0]
@@ -201,7 +203,6 @@ extern unsigned char boot_edid_info[128];
* Reserved for future use.
*/
-
#define ROOT_PAGETABLE_FIRST_XEN_SLOT 256
#define ROOT_PAGETABLE_LAST_XEN_SLOT 271
#define ROOT_PAGETABLE_XEN_SLOTS \
@@ -270,16 +271,30 @@ extern unsigned char boot_edid_info[128];
#define FRAMETABLE_VIRT_START (FRAMETABLE_VIRT_END - FRAMETABLE_SIZE)
#ifndef CONFIG_BIGMEM
-/* Slot 262-271/510: A direct 1:1 mapping of all of physical memory. */
+/* Slot 262-271/509: A direct 1:1 mapping of all of physical memory. */
#define DIRECTMAP_VIRT_START (PML4_ADDR(262))
-#define DIRECTMAP_SIZE (PML4_ENTRY_BYTES * (511 - 262))
+#define DIRECTMAP_SIZE (PML4_ENTRY_BYTES * (510 - 262))
#else
-/* Slot 265-271/510: A direct 1:1 mapping of all of physical memory. */
+/* Slot 265-271/509: A direct 1:1 mapping of all of physical memory. */
#define DIRECTMAP_VIRT_START (PML4_ADDR(265))
-#define DIRECTMAP_SIZE (PML4_ENTRY_BYTES * (511 - 265))
+#define DIRECTMAP_SIZE (PML4_ENTRY_BYTES * (510 - 265))
#endif
#define DIRECTMAP_VIRT_END (DIRECTMAP_VIRT_START + DIRECTMAP_SIZE)
+/*
+ * Slots 510-511: HVM deprivileged mode
+ * The virtual addresses where the .text, .data and stack should be
+ * placed.
+ * We put the .text section in 510 by itself so that, we can easily create an
+ * alias of it. This is because we use the same mfn for a page entry when
+ * aliasign it and so, if we put the data and text in with the .text at PML4
+ * level then, they would conflict with the other address spaces which is not
+ * correct.
+ */
+#define HVM_DEPRIVILEGED_TEXT_ADDR (PML4_ADDR(510))
+#define HVM_DEPRIVILEGED_DATA_ADDR (PML4_ADDR(511) + 0xa000000)
+#define HVM_DEPRIVILEGED_STACK_ADDR (PML4_ADDR(511) + 0xc000000)
+
#ifndef __ASSEMBLY__
/* This is not a fixed value, just a lower limit. */
diff --git a/xen/include/asm-x86/x86_64/page.h b/xen/include/asm-x86/x86_64/page.h
index 19ab4d0..8ecb877 100644
--- a/xen/include/asm-x86/x86_64/page.h
+++ b/xen/include/asm-x86/x86_64/page.h
@@ -22,6 +22,21 @@
#define __PAGE_OFFSET DIRECTMAP_VIRT_START
#define __XEN_VIRT_START XEN_VIRT_START
+/* The sizes of the pages */
+#define PAGESIZE_1GB (1ul << L3_PAGETABLE_SHIFT)
+#define PAGESIZE_2MB (1ul << L2_PAGETABLE_SHIFT)
+#define PAGESIZE_4KB (1ul << L1_PAGETABLE_SHIFT)
+
+/*
+ * The size in bytes that a single L(1,2,3,4} entry covers.
+ * There are 512 (left shift by 9) entries in each page-structure.
+ */
+#define L4_PAGE_RANGE (PAGESIZE_1GB << 9)
+#define L3_PAGE_RANGE (PAGESIZE_2MB << 9)
+#define L2_PAGE_RANGE (PAGESIZE_4KB << 9)
+#define L1_PAGE_RANGE (PAGESIZE_4KB )
+
+
/* These are architectural limits. Current CPUs support only 40-bit phys. */
#define PADDR_BITS 52
#define VADDR_BITS 48
diff --git a/xen/include/xen/hvm/deprivileged.h b/xen/include/xen/hvm/deprivileged.h
new file mode 100644
index 0000000..defc89d
--- /dev/null
+++ b/xen/include/xen/hvm/deprivileged.h
@@ -0,0 +1,95 @@
+#ifndef __X86_HVM_DEPRIVILEGED
+
+#define __X86_HVM_DEPRIVILEGED
+
+#include <asm/page.h>
+#include <xen/lib.h>
+#include <xen/mm.h>
+#include <xen/domain_page.h>
+#include <xen/config.h>
+#include <xen/types.h>
+#include <xen/sched.h>
+#include <asm/paging.h>
+#include <asm/hap.h>
+#include <asm/paging.h>
+#include <asm-x86/page.h>
+#include <public/domctl.h>
+#include <xen/domain_page.h>
+
+/*
+ * Initialise the HVM deprivileged mode. This just sets up the general
+ * page mappings for .text and .data. It does not prepare each HVM vcpu's data
+ * or stack which needs to be done separately using
+ * hvm_deprivileged_prepare_vcpu.
+ */
+void hvm_deprivileged_init(struct domain *d, l4_pgentry_t *l4t_base);
+
+/*
+ * Free up the data used by the HVM deprivileged enhancements.
+ * This frees general page mappings. It does not destroy the per-vcpu
+ * data so hvm_deprivileged_destroy_vcpu also needs to be called for each vcpu.
+ * This method should be called after those per-vcpu destruction routines.
+ */
+void hvm_deprivileged_destroy(struct domain *d);
+
+/*
+ * Use create a map of the data at the specified virtual address.
+ * When writing to the source, it will walk the page table hierarchy, creating
+ * new levels as needed, and then either copy or alias the data.
+ */
+int hvm_deprivileged_map_l4(struct domain *d,
+ l4_pgentry_t *l4e_base,
+ unsigned long src_start,
+ unsigned long dst_start,
+ unsigned long size,
+ unsigned int l1_flags,
+ unsigned int op);
+
+int hvm_deprivileged_map_l3(struct domain *d,
+ l3_pgentry_t *l1t_base,
+ unsigned long src_start,
+ unsigned long dst_start,
+ unsigned long size,
+ unsigned int l1_flags,
+ unsigned int op);
+
+int hvm_deprivileged_map_l2(struct domain *d,
+ l2_pgentry_t *l1t_base,
+ unsigned long src_start,
+ unsigned long dst_start,
+ unsigned long size,
+ unsigned int l1_flags,
+ unsigned int op);
+/*
+ * The leaf case of the map. Will allocate the pages and actually copy or alias
+ * the data.
+ */
+int hvm_deprivileged_map_l1(struct domain *d,
+ l1_pgentry_t *l1t_base,
+ unsigned long src_start,
+ unsigned long dst_start,
+ unsigned long size,
+ unsigned int l1_flags,
+ unsigned int op);
+
+/* Used to allocate a page for the deprivileged mode */
+struct page_info *hvm_deprivileged_alloc_page(struct domain *d);
+
+/* The segments where the user mode .text and .data are stored */
+extern unsigned long __hvm_deprivileged_text_start[];
+extern unsigned long __hvm_deprivileged_text_end[];
+extern unsigned long __hvm_deprivileged_data_start[];
+extern unsigned long __hvm_deprivileged_data_end[];
+#define HVM_DEPRIV_STACK_SIZE (PAGE_SIZE << 1)
+#define HVM_DEPRIV_STACK_ORDER 1
+#define HVM_DEPRIV_DATA_SECTION_SIZE \
+ ((unsigned long)__hvm_deprivileged_data_end - \
+ (unsigned long)__hvm_deprivileged_data_start + \
+ (PAGE_SIZE << 1))
+
+#define HVM_DEPRIV_MODE 1
+#define HVM_ERR_PG_ALLOC -1
+#define HVM_DEPRIV_ALIAS 1
+#define HVM_DEPRIV_COPY 0
+
+#endif
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 73d3bc8..66f4f5e 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -462,6 +462,10 @@ struct domain
/* vNUMA topology accesses are protected by rwlock. */
rwlock_t vnuma_rwlock;
struct vnuma_info *vnuma;
+
+ /* HVM deprivileged mode data */
+ int hvm_depriv_valid_l4e_code;
+ l4_pgentry_t hvm_depriv_l4e_code;
};
struct domain_setup_info
--
2.1.4
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH RFC v3 2/6] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
2015-09-11 16:08 [PATCH RFC v3 0/6] HVM x86 deprivileged mode summary Ben Catterall
2015-09-11 16:08 ` [PATCH RFC v3 1/6] HVM x86 deprivileged mode: Create deprivileged page tables Ben Catterall
@ 2015-09-11 16:08 ` Ben Catterall
2015-09-11 16:08 ` [PATCH RFC v3 3/6] HVM x86 deprivileged mode: Trap handlers for " Ben Catterall
` (5 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Ben Catterall @ 2015-09-11 16:08 UTC (permalink / raw)
To: xen-devel
Cc: keir, ian.campbell, george.dunlap, andrew.cooper3, tim, jbeulich,
Aravind.Gopalakrishnan, suravee.suthikulpanit, Ben Catterall,
boris.ostrovsky
The process to switch into and out of deprivileged mode can be likened to
setjmp/longjmp.
Xen is non-preemptive and taking an interrupt/exception, SYSCALL, SYSENTER,
NMI or any IST will currently clobber the Xen privileged stack. We need this
stack to be preserved so that after executing deprivileged mode, we can
return to our previous privileged execution point. This allows us to unwind the
stack, cleaning up memory allocations.
To enter deprivileged mode, we move the interrupt/exception rsp,
SYSENTER rsp and SYSCALL rsp to point to lower down Xen's privileged stack
to prevent them from clobbering it. The IST NMI and DF handlers used to copy
themselves onto the privileged stack. This is no longer the case, they now
leave themselves on their predefined stacks.
This means that we can continue execution from that point. This is similar
behaviour to a context switch.
To exit deprivileged mode, we restore the original interrupt/exception rsp,
SYSENTER rsp and SYSCALL rsp. We can then continue execution from where we left
off, which will unwind the stack and free up resources. This method means that
we do not need to change any other code paths and its invocation will be
transparent to callers. This should allow the feature to be more easily
deployed to different parts of Xen.
The switch to and from deprivileged mode is performed using sysret and syscall
respectively.
Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>
Changed since v1
----------------
* Added support for AMD SVM
* Moved to the new stack approach
* IST handlers no longer copy themselves
* Updated context switching code to perform a full context-switch.
This means that depriv mode will execute with host register states not
(partial) guest register state. This allows for crashing the domain (later
patch) whilst in depriv mode, alleviates potential security vulnerabilities
and is necessaryto work around the AMD TR issue.
* Moved processor-specific code to processor-specific files.
* Changed call/jmp pair in deprivileged_asm.S to call/ret pair to not confuse
processor branch predictors.
Changed since v2:
-----------------
* Coding style: Add space after if, for, etc.
---
xen/arch/x86/domain.c | 12 +++
xen/arch/x86/hvm/Makefile | 1 +
xen/arch/x86/hvm/deprivileged.c | 103 ++++++++++++++++++++++
xen/arch/x86/hvm/deprivileged_asm.S | 167 ++++++++++++++++++++++++++++++++++++
xen/arch/x86/hvm/svm/svm.c | 130 +++++++++++++++++++++++++++-
xen/arch/x86/hvm/vmx/vmx.c | 118 +++++++++++++++++++++++++
xen/arch/x86/mm/hap/hap.c | 2 +-
xen/arch/x86/x86_64/asm-offsets.c | 5 ++
xen/arch/x86/x86_64/entry.S | 38 ++++++--
xen/arch/x86/x86_64/traps.c | 13 ++-
xen/include/asm-x86/current.h | 2 +
xen/include/asm-x86/hvm/svm/svm.h | 13 +++
xen/include/asm-x86/hvm/vcpu.h | 15 ++++
xen/include/asm-x86/hvm/vmx/vmx.h | 2 +
xen/include/asm-x86/processor.h | 2 +
xen/include/asm-x86/system.h | 3 +
xen/include/xen/hvm/deprivileged.h | 45 ++++++++++
xen/include/xen/sched.h | 18 +++-
18 files changed, 674 insertions(+), 15 deletions(-)
create mode 100644 xen/arch/x86/hvm/deprivileged_asm.S
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 045f6ff..a0e5e70 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -62,6 +62,7 @@
#include <xen/iommu.h>
#include <compat/vcpu.h>
#include <asm/psr.h>
+#include <xen/hvm/deprivileged.h>
DEFINE_PER_CPU(struct vcpu *, curr_vcpu);
DEFINE_PER_CPU(unsigned long, cr4);
@@ -446,6 +447,12 @@ int vcpu_initialise(struct vcpu *v)
if ( has_hvm_container_domain(d) )
{
rc = hvm_vcpu_initialise(v);
+
+ /* Initialise HVM deprivileged mode */
+ printk("HVM initialising deprivileged mode ...");
+ hvm_deprivileged_prepare_vcpu(v);
+ printk("Done.\n");
+
goto done;
}
@@ -523,7 +530,12 @@ void vcpu_destroy(struct vcpu *v)
vcpu_destroy_fpu(v);
if ( has_hvm_container_vcpu(v) )
+ {
+ /* Destroy the deprivileged mode on this vcpu */
+ hvm_deprivileged_destroy_vcpu(v);
+
hvm_vcpu_destroy(v);
+ }
else
xfree(v->arch.pv_vcpu.trap_ctxt);
}
diff --git a/xen/arch/x86/hvm/Makefile b/xen/arch/x86/hvm/Makefile
index df5ebb8..e16960a 100644
--- a/xen/arch/x86/hvm/Makefile
+++ b/xen/arch/x86/hvm/Makefile
@@ -3,6 +3,7 @@ subdir-y += vmx
obj-y += asid.o
obj-y += deprivileged.o
+obj-y += deprivileged_asm.o
obj-y += emulate.o
obj-y += event.o
obj-y += hpet.o
diff --git a/xen/arch/x86/hvm/deprivileged.c b/xen/arch/x86/hvm/deprivileged.c
index 0075523..5574c50 100644
--- a/xen/arch/x86/hvm/deprivileged.c
+++ b/xen/arch/x86/hvm/deprivileged.c
@@ -536,3 +536,106 @@ struct page_info *hvm_deprivileged_alloc_page(struct domain *d)
return pg;
}
+
+/* Used to prepare each vcpus data for user mode. Call for each HVM vcpu. */
+int hvm_deprivileged_prepare_vcpu(struct vcpu *vcpu)
+{
+ vcpu->arch.hvm_vcpu.depriv_rsp = 0;
+ vcpu->arch.hvm_vcpu.depriv_user_mode = 0;
+ vcpu->arch.hvm_vcpu.depriv_destroy = 0;
+ vcpu->arch.hvm_vcpu.depriv_watchdog_count = 0;
+
+ return 0;
+}
+
+/* Called on destroying each vcpu */
+void hvm_deprivileged_destroy_vcpu(struct vcpu *vcpu)
+{
+
+}
+
+/*
+ * Called to perform a user mode operation.
+ * Execution context is preserved and then we move into user mode.
+ * This method is then jumped into to restore execution context after
+ * exiting user mode.
+ */
+void hvm_deprivileged_user_mode(void)
+{
+ struct vcpu *vcpu = get_current();
+
+ ASSERT( vcpu->arch.hvm_vcpu.depriv_user_mode == 0 );
+ ASSERT( vcpu->arch.hvm_vcpu.depriv_rsp == 0 );
+
+ vcpu->arch.hvm_vcpu.depriv_ctxt_switch_to(vcpu);
+
+ /* The assembly routine to handle moving into/out of deprivileged mode */
+ hvm_deprivileged_user_mode_asm();
+
+ vcpu->arch.hvm_vcpu.depriv_ctxt_switch_from(vcpu);
+
+ vcpu->arch.hvm_vcpu.depriv_user_mode = 0;
+ vcpu->arch.hvm_vcpu.depriv_rsp = 0;
+}
+
+/*
+ * We need to be able to handle interrupts and exceptions whilst in deprivileged
+ * mode. Xen is non-preemptable so our privileged mode stack would be clobbered
+ * if we took an exception/interrupt, syscall or sysenter whilst in deprivileged
+ * mode.
+ *
+ * To handle this, we setup another set of stacks for interrupts/exceptions,
+ * syscall and sysenter. This is done by
+ * - changing TSS.rsp0 so that interrupts and exceptions are taken on a part of
+ * the Xen stack past our current rsp.
+ * - moving the syscall and sysenter stacks so these are also moved past our
+ * execution point.
+ *
+ * This function is called at the point where this rsp is as deep as it will
+ * be on the return path so we can safely clobber after it. It has also been
+ * aligned as needed for a stack ponter.
+ * We do not need to change the IST stack pointers as these are already taken on
+ * different stacks so won't clobber our current Xen stack.
+ *
+ * New Stack Layout
+ * ----------------
+ *
+ * Xen's cpu stacks are 8 pages (8-page aligned), arranged as:
+ *
+ * 7 - Primary stack (with a struct cpu_info at the top)
+ * 6 - Primary stack
+ * - Somewhere in 6 and 7 (depending upon where rsp is when we enter
+ * deprivileged mode), we set the syscall/sysenter and exception pointer
+ * so that it is below the current rsp.
+ * 5 - Optionally not preset (MEMORY_GUARD)
+ * 4 - unused
+ * 3 - Syscall trampolines
+ * 2 - MCE IST stack
+ * 1 - NMI IST stack
+ * 0 - Double Fault IST stack
+ */
+void hvm_deprivileged_setup_stacks(unsigned long stack_ptr)
+{
+ get_current()->arch.hvm_vcpu.depriv_setup_stacks(stack_ptr);
+}
+
+/*
+ * Restore the old TSS.rsp0 for the interrupt/exception stack and the
+ * syscall/sysenter stacks.
+ */
+void hvm_deprivileged_restore_stacks(void)
+{
+ get_current()->arch.hvm_vcpu.depriv_restore_stacks();
+}
+
+/*
+ * Called when the user mode operation has completed
+ * Perform C-level processing on return pathx
+ */
+void hvm_deprivileged_finish_user_mode(void)
+{
+ /* If we are not returning from user mode: bail */
+ ASSERT(get_current()->arch.hvm_vcpu.depriv_user_mode == 1);
+
+ hvm_deprivileged_finish_user_mode_asm();
+}
diff --git a/xen/arch/x86/hvm/deprivileged_asm.S b/xen/arch/x86/hvm/deprivileged_asm.S
new file mode 100644
index 0000000..07d4216
--- /dev/null
+++ b/xen/arch/x86/hvm/deprivileged_asm.S
@@ -0,0 +1,167 @@
+/*
+ * HVM deprivileged mode assembly code
+ */
+
+#include <xen/config.h>
+#include <xen/errno.h>
+#include <xen/softirq.h>
+#include <asm/asm_defns.h>
+#include <asm/apicdef.h>
+#include <asm/page.h>
+#include <public/xen.h>
+#include <irq_vectors.h>
+#include <xen/hvm/deprivileged.h>
+
+/*
+ * Handles entry into the deprivileged mode and returning from this
+ * mode.
+ *
+ * If we are entering deprivileged mode, then we use a sysret to get there.
+ * If we are returning from deprivileged mode, then we need to unwind the stack
+ * so we push the return address onto the current stack so that we can return
+ * from into this function and then return, unwinding the stack.
+ *
+ * We're doing a sort-of long jump/set jump with copying to a stack to
+ * preserve it and allow returning code to continue executing from
+ * within this method.
+ */
+ENTRY(hvm_deprivileged_user_mode_asm)
+ /* Save our registers */
+ push %rax
+ push %rbx
+ push %rcx
+ push %rdx
+ push %rsi
+ push %rdi
+ push %rbp
+ push %r8
+ push %r9
+ push %r10
+ push %r11
+ push %r12
+ push %r13
+ push %r14
+ push %r15
+ pushfq
+
+ /* Perform a near call to push rip onto the stack */
+ call 1f
+
+ /*
+ * MAGIC: Add to the stored rip the size of the code between
+ * label 1 and label 2. This allows us to restart execution at label 2.
+ */
+1: addq $2f-1b, (%rsp)
+
+ /*
+ * Setup the stack pointers for exceptions, syscall and sysenter to be
+ * just after our current rsp, adjusted for 16 byte alignment.
+ */
+ mov %rsp, %rdi
+ and $-16, %rdi
+ call hvm_deprivileged_setup_stacks
+ /*
+ * DO NOT push any more data onto the stack from here unless returning
+ * from user mode. It will be clobbered by exceptions/interrupts,
+ * syscall and sysenter.
+ */
+
+/* USER MODE ENTRY POINT */
+2:
+ GET_CURRENT(%r8)
+ movq VCPU_depriv_user_mode(%r8), %rdx
+
+ /* If !user_mode */
+ cmpq $0, %rdx
+ jne 3f
+ cli
+
+ movq %rsp, VCPU_depriv_rsp(%r8) /* The rsp to restore to */
+ movabs $HVM_DEPRIVILEGED_TEXT_ADDR, %rcx /* RIP in user mode */
+
+ /* RFLAGS user mode */
+ movq $(X86_EFLAGS_IF | X86_EFLAGS_VIP), %r11
+ movq $1, VCPU_depriv_user_mode(%r8) /* Now in user mode */
+
+ /*
+ * Stack ptr is set by user mode. If we set rsp to the user mode stack
+ * pointer here and subsequently took an interrupt or exception between
+ * setting it and executing sysret, then the interrupt would use the
+ * user mode stack pointer. This is because the current stack rsp is
+ * used if the exception descriptor's privilege level = CPL.
+ * See Intel manual volume 3A section 6.12.1 and AMD manual volume 2,
+ * section 8.9.3. Also see Intel manual volume 2 and AMD manual 3 on
+ * the sysret instruction.
+ */
+ movq $HVM_STACK_PTR, %rbx
+ sysretq /* Enter deprivileged mode */
+
+3: call hvm_deprivileged_restore_stacks
+
+ /*
+ * Restore registers
+ * The return rip has been popped by the ret on the return path
+ */
+ popfq
+ pop %r15
+ pop %r14
+ pop %r13
+ pop %r12
+ pop %r11
+ pop %r10
+ pop %r9
+ pop %r8
+ pop %rbp
+ pop %rdi
+ pop %rsi
+ pop %rdx
+ pop %rcx
+ pop %rbx
+ pop %rax
+ ret
+
+/* Finished in user mode so return */
+ENTRY(hvm_deprivileged_finish_user_mode_asm)
+ /* Reset rsp to the old rsp */
+ cli
+ GET_CURRENT(%rbx)
+ movq VCPU_depriv_rsp(%rbx), %rsp
+
+ /*
+ * The return address that the near call pushed onto the
+ * buffer is pointed to by rsp, so use that for rip.
+ */
+ /* Go to user mode return code */
+ ret
+
+/* Entry point from the assembly syscall handlers */
+ENTRY(hvm_deprivileged_handle_user_mode)
+
+ /* Handle a user mode hypercall here */
+
+
+ /* We are finished in user mode */
+ call hvm_deprivileged_finish_user_mode
+
+ ret
+
+.section .hvm_deprivileged_enhancement.text,"ax"
+/* HVM deprivileged code */
+ENTRY(hvm_deprivileged_ring3)
+ /*
+ * sysret has loaded eip from rcx and rflags from r11.
+ * CS and SS have been loaded from the MSR for ring 3.
+ * We now need to switch to the user mode stack
+ */
+ movabs $HVM_STACK_PTR, %rsp
+
+ /* Perform user mode processing */
+ movabs $0xff, %rcx
+1: dec %rcx
+ cmp $0, %rcx
+ jne 1b
+
+ /* Return to ring 0 */
+ syscall
+
+.previous
diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
index 8de41fa..3393fb5 100644
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -61,6 +61,11 @@
#include <asm/apic.h>
#include <asm/debugger.h>
#include <asm/xstate.h>
+#include <xen/hvm/deprivileged.h>
+
+/* HVM svm MSR_{L}STAR cache */
+DEFINE_PER_CPU(u64, svm_depriv_msr_lstar);
+DEFINE_PER_CPU(u64, svm_depriv_msr_star);
void svm_asm_do_resume(void);
@@ -962,12 +967,30 @@ static inline void svm_tsc_ratio_save(struct vcpu *v)
wrmsrl(MSR_AMD64_TSC_RATIO, DEFAULT_TSC_RATIO);
}
+unsigned long svm_depriv_read_msr_star(void)
+{
+ return this_cpu(svm_depriv_msr_star);
+}
+
+void svm_depriv_write_msr_star(unsigned long star)
+{
+ this_cpu(svm_depriv_msr_star) = star;
+}
+unsigned long svm_depriv_read_msr_lstar(void)
+{
+ return this_cpu(svm_depriv_msr_lstar);
+}
+
+void svm_depriv_write_msr_lstar(unsigned long lstar)
+{
+ this_cpu(svm_depriv_msr_lstar) = lstar;
+}
+
static inline void svm_tsc_ratio_load(struct vcpu *v)
{
if ( cpu_has_tsc_ratio && !v->domain->arch.vtsc )
wrmsrl(MSR_AMD64_TSC_RATIO, vcpu_tsc_ratio(v));
}
-
static void svm_ctxt_switch_from(struct vcpu *v)
{
int cpu = smp_processor_id();
@@ -1030,6 +1053,93 @@ static void svm_ctxt_switch_to(struct vcpu *v)
wrmsrl(MSR_TSC_AUX, hvm_msr_tsc_aux(v));
}
+static void svm_depriv_ctxt_switch_from(struct vcpu *v)
+{
+
+ svm_ctxt_switch_to(v);
+ vcpu_restore_fpu_eager(v);
+
+ /* Restore the efer and saved msr registers */
+ write_efer(v->arch.hvm_vcpu.depriv_efer);
+}
+
+/* Setup our stack pointers for interrupts/exceptions, and SYSCALL. */
+static void svm_depriv_setup_stacks(unsigned long stack_ptr)
+{
+ struct vcpu *vcpu = get_current();
+ struct tss_struct *tss = &this_cpu(init_tss);
+ unsigned char *stub_page;
+ unsigned long stub_va = this_cpu(stubs.addr);
+ unsigned int offset;
+
+ /* Save the current rsp0 */
+ vcpu->arch.hvm_vcpu.depriv_tss_rsp0 = tss->rsp0;
+
+ /* Setup the stack for interrupts/exceptions */
+ tss->rsp0 = stack_ptr;
+
+ /* Stacks for syscall and sysenter */
+ stub_page = map_domain_page(_mfn(this_cpu(stubs.mfn)));
+
+ offset = write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
+ stub_va, stack_ptr,
+ (unsigned long)lstar_enter);
+
+ stub_va += offset;
+
+ offset = write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
+ stub_va, stack_ptr,
+ (unsigned long)cstar_enter);
+
+ /* Don't consume more than half of the stub space here. */
+ ASSERT(offset <= STUB_BUF_SIZE / 2);
+
+ unmap_domain_page(stub_page);
+}
+
+static void svm_depriv_restore_stacks(void)
+{
+ struct vcpu* vcpu = get_current();
+ struct tss_struct *tss = &this_cpu(init_tss);
+ unsigned char *stub_page;
+ unsigned long stack_bottom = get_stack_bottom();
+ unsigned long stub_va = this_cpu(stubs.addr);
+ unsigned int offset;
+
+ stub_page = map_domain_page(_mfn(this_cpu(stubs.mfn)));
+
+ /* Restore the old rsp0 */
+ tss->rsp0 = vcpu->arch.hvm_vcpu.depriv_tss_rsp0;
+
+ /* Restore the old syscall/sysenter stacks */
+ offset = write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
+ stub_va, stack_bottom,
+ (unsigned long)lstar_enter);
+ stub_va += offset;
+
+ /* Trampoline for SYSCALL entry from compatibility mode. */
+ offset += write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
+ stub_va, stack_bottom,
+ (unsigned long)cstar_enter);
+
+ /* Don't consume more than half of the stub space here. */
+ ASSERT(offset <= STUB_BUF_SIZE / 2);
+
+ unmap_domain_page(stub_page);
+}
+
+static void svm_depriv_ctxt_switch_to(struct vcpu *v)
+{
+ vcpu_save_fpu(v);
+ svm_ctxt_switch_from(v);
+
+ v->arch.hvm_vcpu.depriv_efer = read_efer();
+
+ /* Flip the SCE bit to allow sysret/call */
+ write_efer(v->arch.hvm_vcpu.depriv_efer | EFER_SCE);
+}
+
+
static void noreturn svm_do_resume(struct vcpu *v)
{
struct vmcb_struct *vmcb = v->arch.hvm_svm.vmcb;
@@ -1156,6 +1266,12 @@ static int svm_vcpu_initialise(struct vcpu *v)
v->arch.hvm_svm.launch_core = -1;
+ /* HVM deprivileged mode operations */
+ v->arch.hvm_vcpu.depriv_ctxt_switch_to = svm_depriv_ctxt_switch_to;
+ v->arch.hvm_vcpu.depriv_ctxt_switch_from = svm_depriv_ctxt_switch_from;
+ v->arch.hvm_vcpu.depriv_setup_stacks = svm_depriv_setup_stacks;
+ v->arch.hvm_vcpu.depriv_restore_stacks = svm_depriv_restore_stacks;
+
if ( (rc = svm_create_vmcb(v)) != 0 )
{
dprintk(XENLOG_WARNING,
@@ -2547,7 +2663,19 @@ void svm_vmexit_handler(struct cpu_user_regs *regs)
{
uint16_t port = (vmcb->exitinfo1 >> 16) & 0xFFFF;
int bytes = ((vmcb->exitinfo1 >> 4) & 0x07);
+
int dir = (vmcb->exitinfo1 & 1) ? IOREQ_READ : IOREQ_WRITE;
+ /* DEBUG: Run only for a specific port */
+ if(port == 0x1000)
+ {
+ if( guest_cpu_user_regs()->eax == 0x1)
+ {
+ hvm_deprivileged_user_mode();
+ }
+ __update_guest_eip(regs, vmcb->exitinfo2 - vmcb->rip);
+ break;
+ }
+
if ( handle_pio(port, bytes, dir) )
__update_guest_eip(regs, vmcb->exitinfo2 - vmcb->rip);
}
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 2582cdd..1ec23f9 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -59,6 +59,8 @@
#include <asm/event.h>
#include <asm/monitor.h>
#include <public/arch-x86/cpuid.h>
+#include <xen/hvm/deprivileged.h>
+
static bool_t __initdata opt_force_ept;
boolean_param("force-ept", opt_force_ept);
@@ -68,6 +70,11 @@ enum handler_return { HNDL_done, HNDL_unhandled, HNDL_exception_raised };
static void vmx_ctxt_switch_from(struct vcpu *v);
static void vmx_ctxt_switch_to(struct vcpu *v);
+static void vmx_depriv_ctxt_switch_from(struct vcpu *v);
+static void vmx_depriv_ctxt_switch_to(struct vcpu *v);
+static void vmx_depriv_setup_stacks(unsigned long stack_ptr);
+static void vmx_depriv_restore_stacks(void);
+
static int vmx_alloc_vlapic_mapping(struct domain *d);
static void vmx_free_vlapic_mapping(struct domain *d);
static void vmx_install_vlapic_mapping(struct vcpu *v);
@@ -110,6 +117,12 @@ static int vmx_vcpu_initialise(struct vcpu *v)
v->arch.ctxt_switch_from = vmx_ctxt_switch_from;
v->arch.ctxt_switch_to = vmx_ctxt_switch_to;
+ /* HVM deprivileged mode operations */
+ v->arch.hvm_vcpu.depriv_ctxt_switch_to = vmx_depriv_ctxt_switch_to;
+ v->arch.hvm_vcpu.depriv_ctxt_switch_from = vmx_depriv_ctxt_switch_from;
+ v->arch.hvm_vcpu.depriv_setup_stacks = vmx_depriv_setup_stacks;
+ v->arch.hvm_vcpu.depriv_restore_stacks = vmx_depriv_restore_stacks;
+
if ( (rc = vmx_create_vmcs(v)) != 0 )
{
dprintk(XENLOG_WARNING,
@@ -272,6 +285,7 @@ long_mode_do_msr_write(unsigned int msr, uint64_t msr_content)
case MSR_LSTAR:
if ( !is_canonical_address(msr_content) )
goto uncanonical_address;
+
WRITE_MSR(LSTAR);
break;
@@ -707,6 +721,98 @@ static void vmx_fpu_leave(struct vcpu *v)
}
}
+static void vmx_depriv_setup_stacks(unsigned long stack_ptr)
+{
+ struct vcpu *vcpu = get_current();
+ struct tss_struct *tss = &this_cpu(init_tss);
+ unsigned char *stub_page;
+ unsigned long stub_va = this_cpu(stubs.addr);
+ unsigned int offset;
+
+ /* Save the current rsp0 */
+ vcpu->arch.hvm_vcpu.depriv_tss_rsp0 = tss->rsp0;
+
+ /* Setup the stack for interrupts/exceptions */
+ tss->rsp0 = stack_ptr;
+
+ /* Stacks for syscall and sysenter */
+ stub_page = map_domain_page(_mfn(this_cpu(stubs.mfn)));
+
+ offset = write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
+ stub_va, stack_ptr,
+ (unsigned long)lstar_enter);
+
+ stub_va += offset;
+
+ if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL ||
+ boot_cpu_data.x86_vendor == X86_VENDOR_CENTAUR )
+ {
+ wrmsrl(MSR_IA32_SYSENTER_ESP, stack_ptr);
+ }
+
+ offset = write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
+ stub_va, stack_ptr,
+ (unsigned long)cstar_enter);
+
+ /* Don't consume more than half of the stub space here. */
+ ASSERT(offset <= STUB_BUF_SIZE / 2);
+
+ unmap_domain_page(stub_page);
+}
+
+static void vmx_depriv_restore_stacks(void)
+{
+ struct vcpu* vcpu = get_current();
+ struct tss_struct *tss = &this_cpu(init_tss);
+ unsigned char *stub_page;
+ unsigned long stack_bottom = get_stack_bottom();
+ unsigned long stub_va = this_cpu(stubs.addr);
+ unsigned int offset;
+
+ stub_page = map_domain_page(_mfn(this_cpu(stubs.mfn)));
+
+ /* Restore the old rsp0 */
+ tss->rsp0 = vcpu->arch.hvm_vcpu.depriv_tss_rsp0;
+
+ /* Restore the old syscall/sysenter stacks */
+ offset = write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
+ stub_va, stack_bottom,
+ (unsigned long)lstar_enter);
+ stub_va += offset;
+
+ wrmsrl(MSR_IA32_SYSENTER_ESP, stack_bottom);
+
+ /* Trampoline for SYSCALL entry from compatibility mode. */
+ offset += write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
+ stub_va, stack_bottom,
+ (unsigned long)cstar_enter);
+
+ /* Don't consume more than half of the stub space here. */
+ ASSERT(offset <= STUB_BUF_SIZE / 2);
+
+ unmap_domain_page(stub_page);
+}
+
+static void vmx_depriv_ctxt_switch_from(struct vcpu *v)
+{
+ vmx_ctxt_switch_to(v);
+ vcpu_save_fpu(v);
+
+ /* Restore the efer and saved msr registers */
+ write_efer(v->arch.hvm_vcpu.depriv_efer);
+}
+
+static void vmx_depriv_ctxt_switch_to(struct vcpu *v)
+{
+ vcpu_save_fpu(v);
+ vmx_ctxt_switch_from(v);
+
+ v->arch.hvm_vcpu.depriv_efer = read_efer();
+
+ /* Flip the SCE bit to allow sysret/call */
+ write_efer(v->arch.hvm_vcpu.depriv_efer | EFER_SCE);
+}
+
static void vmx_ctxt_switch_from(struct vcpu *v)
{
/*
@@ -3341,6 +3447,18 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs)
uint16_t port = (exit_qualification >> 16) & 0xFFFF;
int bytes = (exit_qualification & 0x07) + 1;
int dir = (exit_qualification & 0x08) ? IOREQ_READ : IOREQ_WRITE;
+
+ /* DEBUG: Run only for a specific port */
+ if(port == 0x1000)
+ {
+ if( guest_cpu_user_regs()->eax == 0x1)
+ {
+ hvm_deprivileged_user_mode();
+ }
+ update_guest_eip(); /* Safe: IN, OUT */
+ break;
+ }
+
if ( handle_pio(port, bytes, dir) )
update_guest_eip(); /* Safe: IN, OUT */
}
diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
index 4048929..5633e82 100644
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -40,7 +40,7 @@
#include <asm/domain.h>
#include <xen/numa.h>
#include <asm/hvm/nestedhvm.h>
-
+#include <asm/hvm/vmx/vmx.h>
#include "private.h"
#include <xen/hvm/deprivileged.h>
diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index 447c650..7af824a 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -115,6 +115,11 @@ void __dummy__(void)
OFFSET(VCPU_nsvm_hap_enabled, struct vcpu, arch.hvm_vcpu.nvcpu.u.nsvm.ns_hap_enabled);
BLANK();
+ OFFSET(VCPU_depriv_rsp, struct vcpu, arch.hvm_vcpu.depriv_rsp);
+ OFFSET(VCPU_depriv_user_mode, struct vcpu, arch.hvm_vcpu.depriv_user_mode);
+ OFFSET(VCPU_depriv_destroy, struct vcpu, arch.hvm_vcpu.depriv_destroy);
+ BLANK();
+
OFFSET(DOMAIN_is_32bit_pv, struct domain, arch.is_32bit_pv);
BLANK();
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 74677a2..9590065 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -102,6 +102,18 @@ restore_all_xen:
RESTORE_ALL adj=8
iretq
+
+/* Returning from user mode */
+ENTRY(handle_hvm_user_mode)
+
+ call hvm_deprivileged_handle_user_mode
+
+ /* fallthrough */
+hvm_depriv_mode:
+
+ /* Go back into user mode */
+ jmp restore_all_guest
+
/*
* When entering SYSCALL from kernel mode:
* %rax = hypercall vector
@@ -128,6 +140,11 @@ ENTRY(lstar_enter)
pushq $0
SAVE_VOLATILE TRAP_syscall
GET_CURRENT(%rbx)
+
+ /* Were we in Xen's ring 3? */
+ cmpq $1, VCPU_depriv_user_mode(%rbx)
+ je handle_hvm_user_mode
+
testb $TF_kernel_mode,VCPU_thread_flags(%rbx)
jz switch_to_kernel
@@ -487,6 +504,10 @@ ENTRY(common_interrupt)
/* No special register assumptions. */
ENTRY(ret_from_intr)
GET_CURRENT(%rbx)
+
+ /* If we are in Xen's user mode */
+ cmpq $1,VCPU_depriv_user_mode(%rbx)
+ je hvm_depriv_mode
testb $3,UREGS_cs(%rsp)
jz restore_all_xen
movq VCPU_domain(%rbx),%rax
@@ -509,6 +530,10 @@ handle_exception_saved:
GET_CURRENT(%rbx)
PERFC_INCR(exceptions, %rax, %rbx)
callq *(%rdx,%rax,8)
+
+ /* If we are in Xen's user mode */
+ cmpq $1, VCPU_depriv_user_mode(%rbx)
+ je hvm_depriv_mode
testb $3,UREGS_cs(%rsp)
jz restore_all_xen
leaq VCPU_trap_bounce(%rbx),%rdx
@@ -636,15 +661,7 @@ ENTRY(nmi)
movl $TRAP_nmi,4(%rsp)
handle_ist_exception:
SAVE_ALL CLAC
- testb $3,UREGS_cs(%rsp)
- jz 1f
- /* Interrupted guest context. Copy the context to stack bottom. */
- GET_CPUINFO_FIELD(guest_cpu_user_regs,%rdi)
- movq %rsp,%rsi
- movl $UREGS_kernel_sizeof/8,%ecx
- movq %rdi,%rsp
- rep movsq
-1: movq %rsp,%rdi
+ movq %rsp,%rdi
movzbl UREGS_entry_vector(%rsp),%eax
leaq exception_table(%rip),%rdx
callq *(%rdx,%rax,8)
@@ -664,6 +681,9 @@ handle_ist_exception:
movl $EVENT_CHECK_VECTOR,%edi
call send_IPI_self
1: movq VCPU_domain(%rbx),%rax
+ /* This also handles Xen ring3 return for us.
+ * So, there is no need to explicitly do a user mode check.
+ */
cmpb $0,DOMAIN_is_32bit_pv(%rax)
je restore_all_guest
jmp compat_restore_all_guest
diff --git a/xen/arch/x86/x86_64/traps.c b/xen/arch/x86/x86_64/traps.c
index 0846a19..c7e6077 100644
--- a/xen/arch/x86/x86_64/traps.c
+++ b/xen/arch/x86/x86_64/traps.c
@@ -24,6 +24,7 @@
#include <asm/hvm/hvm.h>
#include <asm/hvm/support.h>
#include <public/callback.h>
+#include <asm/hvm/svm/svm.h>
static void print_xen_info(void)
@@ -337,7 +338,7 @@ unsigned long do_iret(void)
return 0;
}
-static unsigned int write_stub_trampoline(
+unsigned int write_stub_trampoline(
unsigned char *stub, unsigned long stub_va,
unsigned long stack_bottom, unsigned long target_va)
{
@@ -368,8 +369,6 @@ static unsigned int write_stub_trampoline(
}
DEFINE_PER_CPU(struct stubs, stubs);
-void lstar_enter(void);
-void cstar_enter(void);
void __devinit subarch_percpu_traps_init(void)
{
@@ -385,6 +384,14 @@ void __devinit subarch_percpu_traps_init(void)
/* Trampoline for SYSCALL entry from 64-bit mode. */
wrmsrl(MSR_LSTAR, stub_va);
+
+ /*
+ * HVM deprivileged mode on AMD. The writes for MSR_{L}STAR
+ * are not trapped so we need to keep a copy of the host's msrs
+ */
+ svm_depriv_write_msr_star((unsigned long)((FLAT_RING3_CS32<<16) | __HYPERVISOR_CS) << 32);
+ svm_depriv_write_msr_lstar(stub_va);
+
offset = write_stub_trampoline(stub_page + (stub_va & ~PAGE_MASK),
stub_va, stack_bottom,
(unsigned long)lstar_enter);
diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
index f011d2d..c1dae3a 100644
--- a/xen/include/asm-x86/current.h
+++ b/xen/include/asm-x86/current.h
@@ -23,6 +23,8 @@
* 2 - MCE IST stack
* 1 - NMI IST stack
* 0 - Double Fault IST stack
+ *
+ * NOTE: This layout changes slightly in HVM deprivileged mode.
*/
/*
diff --git a/xen/include/asm-x86/hvm/svm/svm.h b/xen/include/asm-x86/hvm/svm/svm.h
index d60ec23..45dd125 100644
--- a/xen/include/asm-x86/hvm/svm/svm.h
+++ b/xen/include/asm-x86/hvm/svm/svm.h
@@ -110,4 +110,17 @@ extern void svm_host_osvw_init(void);
#define _NPT_PFEC_in_gpt 33
#define NPT_PFEC_in_gpt (1UL<<_NPT_PFEC_in_gpt)
+/*
+ * HVM deprivileged mode svm cache of host MSR_{L}STARs
+ * The svm mode does not trap guest writes to these so we
+ * need to preserve them.
+ */
+DECLARE_PER_CPU(u64, svm_depriv_msr_lstar);
+DECLARE_PER_CPU(u64, svm_depriv_msr_star);
+
+unsigned long svm_depriv_read_msr_star(void);
+void svm_depriv_write_msr_star(unsigned long star);
+unsigned long svm_depriv_read_msr_lstar(void);
+void svm_depriv_write_msr_lstar(unsigned long lstar);
+
#endif /* __ASM_X86_HVM_SVM_H__ */
diff --git a/xen/include/asm-x86/hvm/vcpu.h b/xen/include/asm-x86/hvm/vcpu.h
index f553814..f7df9d4 100644
--- a/xen/include/asm-x86/hvm/vcpu.h
+++ b/xen/include/asm-x86/hvm/vcpu.h
@@ -202,6 +202,21 @@ struct hvm_vcpu {
void (*fpu_exception_callback)(void *, struct cpu_user_regs *);
void *fpu_exception_callback_arg;
+ /* Context switching for HVM deprivileged mode */
+ void (*depriv_ctxt_switch_to)(struct vcpu *v);
+ void (*depriv_ctxt_switch_from)(struct vcpu *v);
+ void (*depriv_setup_stacks)(unsigned long stack_ptr);
+ void (*depriv_restore_stacks)(void);
+
+ /* HVM deprivileged mode state */
+ struct segment_register depriv_tr;
+ unsigned long depriv_rsp; /* rsp of our stack to restore our data to */
+ unsigned long depriv_user_mode; /* Are we in user mode */
+ unsigned long depriv_efer;
+ unsigned long depriv_tss_rsp0;
+ unsigned long depriv_destroy;
+ unsigned long depriv_watchdog_count;
+
/* Pending hw/sw interrupt (.vector = -1 means nothing pending). */
struct hvm_trap inject_trap;
diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
index 3fbfa44..98e269e 100644
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -565,4 +565,6 @@ typedef struct {
u16 eptp_index;
} ve_info_t;
+struct vmx_msr_state *get_host_msr_state(void);
+
#endif /* __ASM_X86_HVM_VMX_VMX_H__ */
diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index f507f5e..0fde516 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -547,6 +547,8 @@ void sysenter_entry(void);
void sysenter_eflags_saved(void);
void compat_hypercall(void);
void int80_direct_trap(void);
+void lstar_enter(void);
+void cstar_enter(void);
#define STUBS_PER_PAGE (PAGE_SIZE / STUB_BUF_SIZE)
diff --git a/xen/include/asm-x86/system.h b/xen/include/asm-x86/system.h
index 25a6a2a..e092f36 100644
--- a/xen/include/asm-x86/system.h
+++ b/xen/include/asm-x86/system.h
@@ -240,5 +240,8 @@ void init_idt_traps(void);
void load_system_tables(void);
void percpu_traps_init(void);
void subarch_percpu_traps_init(void);
+unsigned int write_stub_trampoline(
+ unsigned char *stub, unsigned long stub_va,
+ unsigned long stack_bottom, unsigned long target_va);
#endif
diff --git a/xen/include/xen/hvm/deprivileged.h b/xen/include/xen/hvm/deprivileged.h
index defc89d..5915224 100644
--- a/xen/include/xen/hvm/deprivileged.h
+++ b/xen/include/xen/hvm/deprivileged.h
@@ -1,5 +1,7 @@
#ifndef __X86_HVM_DEPRIVILEGED
+/* This is also included in the HVM deprivileged mode .S file */
+#ifndef __ASSEMBLY__
#define __X86_HVM_DEPRIVILEGED
#include <asm/page.h>
@@ -75,11 +77,46 @@ int hvm_deprivileged_map_l1(struct domain *d,
/* Used to allocate a page for the deprivileged mode */
struct page_info *hvm_deprivileged_alloc_page(struct domain *d);
+/* Used to prepare each vcpu's data for user mode. Call for each HVM vcpu. */
+int hvm_deprivileged_prepare_vcpu(struct vcpu *vcpu);
+
+/* Destroy each vcpu's data for Xen user mode. Again, call for each vcpu. */
+void hvm_deprivileged_destroy_vcpu(struct vcpu *vcpu);
+
+/* Called to perform a user mode operation. */
+void hvm_deprivileged_user_mode(void);
+
+/* Called when the user mode operation has completed */
+void hvm_deprivileged_finish_user_mode(void);
+
+/* Called to move into and then out of user mode. Needed for accessing
+ * assembly features.
+ */
+void hvm_deprivileged_user_mode_asm(void);
+
+/* Called on the return path to return to the correct execution point */
+void hvm_deprivileged_finish_user_mode_asm(void);
+
+/* Handle any syscalls that the user mode makes */
+void hvm_deprivileged_handle_user_mode(void);
+
+/* Use to setup the stacks for deprivileged mode */
+void hvm_deprivileged_setup_stacks(unsigned long stack_ptr);
+
+/* Use to restore the stacks for deprivileged mode */
+void hvm_deprivileged_restore_stacks(void);
+
+/* The ring 3 code */
+void hvm_deprivileged_ring3(void);
+
/* The segments where the user mode .text and .data are stored */
extern unsigned long __hvm_deprivileged_text_start[];
extern unsigned long __hvm_deprivileged_text_end[];
extern unsigned long __hvm_deprivileged_data_start[];
extern unsigned long __hvm_deprivileged_data_end[];
+
+#endif
+
#define HVM_DEPRIV_STACK_SIZE (PAGE_SIZE << 1)
#define HVM_DEPRIV_STACK_ORDER 1
#define HVM_DEPRIV_DATA_SECTION_SIZE \
@@ -92,4 +129,12 @@ extern unsigned long __hvm_deprivileged_data_end[];
#define HVM_DEPRIV_ALIAS 1
#define HVM_DEPRIV_COPY 0
+/*
+ * The user mode stack pointer.
+ * The stack grows down so set this to top of the stack region. Then,
+ * as this is 0-indexed, move into the stack, not just after it.
+ * Subtract 16 bytes for correct stack alignment.
+ */
+#define HVM_STACK_PTR (HVM_DEPRIVILEGED_STACK_ADDR + HVM_DEPRIV_STACK_SIZE - 16)
+
#endif
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 66f4f5e..6c05969 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -137,7 +137,7 @@ void evtchn_destroy_final(struct domain *d); /* from complete_domain_destroy */
struct waitqueue_vcpu;
-struct vcpu
+struct vcpu
{
int vcpu_id;
@@ -158,6 +158,22 @@ struct vcpu
void *sched_priv; /* scheduler-specific data */
+ /* HVM deprivileged mode state */
+ void *stack; /* Location of stack to save data onto */
+ unsigned long rsp; /* rsp of our stack to restore our data to */
+ unsigned long user_mode; /* Are we (possibly moving into) in user mode? */
+
+ /* The mstar of the processor that we are currently executing on.
+ * we need to save this because Xen does lazy saving of these.
+ */
+ unsigned long int msr_lstar; /* lstar */
+ unsigned long int msr_star;
+
+ /* Debug info */
+ unsigned long int old_rsp;
+ unsigned long int old_processor;
+ unsigned long int old_msr_lstar;
+ unsigned long int old_msr_star;
struct vcpu_runstate_info runstate;
#ifndef CONFIG_COMPAT
# define runstate_guest(v) ((v)->runstate_guest)
--
2.1.4
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH RFC v3 3/6] HVM x86 deprivileged mode: Trap handlers for deprivileged mode
2015-09-11 16:08 [PATCH RFC v3 0/6] HVM x86 deprivileged mode summary Ben Catterall
2015-09-11 16:08 ` [PATCH RFC v3 1/6] HVM x86 deprivileged mode: Create deprivileged page tables Ben Catterall
2015-09-11 16:08 ` [PATCH RFC v3 2/6] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode Ben Catterall
@ 2015-09-11 16:08 ` Ben Catterall
2015-09-11 16:08 ` [PATCH RFC v3 4/6] HVM x86 deprivileged mode: Watchdog for DoS prevention Ben Catterall
` (4 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Ben Catterall @ 2015-09-11 16:08 UTC (permalink / raw)
To: xen-devel
Cc: keir, ian.campbell, george.dunlap, andrew.cooper3, tim, jbeulich,
Aravind.Gopalakrishnan, suravee.suthikulpanit, Ben Catterall,
boris.ostrovsky
Added trap handlers to catch exceptions such as a page fault, general
protection fault, etc. These handlers will crash the domain as such exceptions
would indicate that either there is a bug in deprivileged mode or it has been
compromised by an attacker.
On calling a domain_crash() whilst in deprivileged mode, we need to restore
the host's context so that we do not have guest-defined registers and values
in use after this point due to lazy loading of these values in the SVM and VMX
implementations.
Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>
Changed since v1
----------------
* Changed to domain_crash(), domain_crash_synchronous was used previously.
* Updated to perform a HVM context switch on crashing a domain
* Updated hvm_deprivileged_check_trap() to return a testable error
code and return based on this.
Changed since v2
----------------
* Coding style: Added space after if, for, etc.
* hvm_deprivileged_user_mode() now returns a value to indicate success or
failure.
---
xen/arch/x86/hvm/deprivileged.c | 70 +++++++++++++++++++++++++++++++++++++-
xen/arch/x86/traps.c | 55 ++++++++++++++++++++++++++++++
xen/include/xen/hvm/deprivileged.h | 25 +++++++++++++-
3 files changed, 148 insertions(+), 2 deletions(-)
diff --git a/xen/arch/x86/hvm/deprivileged.c b/xen/arch/x86/hvm/deprivileged.c
index 5574c50..68c40ad 100644
--- a/xen/arch/x86/hvm/deprivileged.c
+++ b/xen/arch/x86/hvm/deprivileged.c
@@ -560,7 +560,7 @@ void hvm_deprivileged_destroy_vcpu(struct vcpu *vcpu)
* This method is then jumped into to restore execution context after
* exiting user mode.
*/
-void hvm_deprivileged_user_mode(void)
+int hvm_deprivileged_user_mode(void)
{
struct vcpu *vcpu = get_current();
@@ -576,6 +576,20 @@ void hvm_deprivileged_user_mode(void)
vcpu->arch.hvm_vcpu.depriv_user_mode = 0;
vcpu->arch.hvm_vcpu.depriv_rsp = 0;
+
+ /*
+ * If we need to crash the domain at this point. We will return up the call
+ * stack, undoing any allocations and then the event testers in the exit
+ * assembly stubs will test for the SOFTIRQ_TIMER event generated by a
+ * domain_crash and will crash the domain for us.
+ */
+ if ( vcpu->arch.hvm_vcpu.depriv_destroy )
+ {
+ domain_crash(vcpu->domain);
+ return 1;
+ }
+
+ return 0;
}
/*
@@ -639,3 +653,57 @@ void hvm_deprivileged_finish_user_mode(void)
hvm_deprivileged_finish_user_mode_asm();
}
+
+/* Check if we are in deprivileged mode */
+int is_hvm_deprivileged_vcpu(void)
+{
+ struct vcpu *v = get_current();
+
+ if ( is_hvm_vcpu(v) && (v->arch.hvm_vcpu.depriv_user_mode) )
+ return 1;
+
+ return 0;
+}
+
+/*
+ * Crash the domain. This should not be called if there are any memory
+ * allocations which will be freed by code following its invocation in the
+ * current execution context (current stack). This is because it causes a
+ * permanent 'context switch' and the current stack will be cloberred so
+ * any allocations made which are not freed by other paths will leak.
+ * This function should only be used after deprivileged mode has been
+ * successfully switched into, otherwise, the normal domain_crash function
+ * should be used.
+ *
+ * The domain which is crashed is that of the current vcpu.
+ *
+ * To crash the domain, we need to return to our privileged stack as we may have
+ * memory allocations which need to be cleaned up. Then, after we have returned
+ * to this stack, we can then crash the domain. We set a flag which we check
+ * when returning.
+ */
+void hvm_deprivileged_crash_domain(const char *reason)
+{
+ struct vcpu *vcpu = get_current();
+
+ vcpu->arch.hvm_vcpu.depriv_destroy = 1;
+
+ printk(XENLOG_ERR "HVM Deprivileged Mode: Crashing domain. Reason: %s\n",
+ reason);
+
+ /*
+ * Restore the processor's state. We need to do the privileged return
+ * path to undo any allocations that got us to this state
+ */
+ hvm_deprivileged_finish_user_mode();
+ /* DOES NOT RETURN */
+}
+
+/* Handle a trap event */
+int hvm_deprivileged_check_trap(const char* func_name)
+{
+ if ( is_hvm_deprivileged_vcpu() )
+ hvm_deprivileged_crash_domain(func_name);
+
+ return 0;
+}
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 9f5a6c6..f14a845 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -74,6 +74,7 @@
#include <asm/vpmu.h>
#include <public/arch-x86/cpuid.h>
#include <xsm/xsm.h>
+#include <xen/hvm/deprivileged.h>
/*
* opt_nmi: one of 'ignore', 'dom0', or 'fatal'.
@@ -500,6 +501,13 @@ static void do_guest_trap(
struct trap_bounce *tb;
const struct trap_info *ti;
+ /*
+ * If we take the trap whilst in HVM deprivileged mode
+ * then we should crash the domain.
+ */
+ if ( hvm_deprivileged_check_trap(__func__) )
+ return;
+
trace_pv_trap(trapnr, regs->eip, use_error_code, regs->error_code);
tb = &v->arch.pv_vcpu.trap_bounce;
@@ -617,6 +625,13 @@ static void do_trap(struct cpu_user_regs *regs, int use_error_code)
DEBUGGER_trap_entry(trapnr, regs);
+ /*
+ * If we take the trap whilst in HVM deprivileged mode
+ * then we should crash the domain.
+ */
+ if ( hvm_deprivileged_check_trap(__func__) )
+ return;
+
if ( guest_mode(regs) )
{
do_guest_trap(trapnr, regs, use_error_code);
@@ -1070,6 +1085,13 @@ void do_invalid_op(struct cpu_user_regs *regs)
DEBUGGER_trap_entry(TRAP_invalid_op, regs);
+ /*
+ * If we take the trap whilst in HVM deprivileged mode
+ * then we should crash the domain.
+ */
+ if ( hvm_deprivileged_check_trap(__func__) )
+ return;
+
if ( likely(guest_mode(regs)) )
{
if ( !emulate_invalid_rdtscp(regs) &&
@@ -1159,6 +1181,13 @@ void do_int3(struct cpu_user_regs *regs)
{
DEBUGGER_trap_entry(TRAP_int3, regs);
+ /*
+ * If we take the trap whilst in HVM deprivileged mode
+ * then we should crash the domain.
+ */
+ if ( hvm_deprivileged_check_trap(__func__) )
+ return;
+
if ( !guest_mode(regs) )
{
debugger_trap_fatal(TRAP_int3, regs);
@@ -1495,9 +1524,14 @@ void do_page_fault(struct cpu_user_regs *regs)
perfc_incr(page_faults);
+ /* If we get a page fault whilst in HVM deprivileged mode */
+ if( hvm_deprivileged_check_trap(__func__) )
+ return;
+
if ( unlikely(fixup_page_fault(addr, regs) != 0) )
return;
+
if ( unlikely(!guest_mode(regs)) )
{
pf_type = spurious_page_fault(addr, regs);
@@ -3225,6 +3259,13 @@ void do_general_protection(struct cpu_user_regs *regs)
DEBUGGER_trap_entry(TRAP_gp_fault, regs);
+ /*
+ * If we take the trap whilst in HVM deprivileged mode
+ * then we should crash the domain.
+ */
+ if ( hvm_deprivileged_check_trap(__func__) )
+ return;
+
if ( regs->error_code & 1 )
goto hardware_gp;
@@ -3490,6 +3531,13 @@ void do_device_not_available(struct cpu_user_regs *regs)
BUG_ON(!guest_mode(regs));
+ /*
+ * If we take the trap whilst in HVM deprivileged mode
+ * then we should crash the domain.
+ */
+ if ( hvm_deprivileged_check_trap(__func__) )
+ return;
+
vcpu_restore_fpu_lazy(curr);
if ( curr->arch.pv_vcpu.ctrlreg[0] & X86_CR0_TS )
@@ -3531,6 +3579,13 @@ void do_debug(struct cpu_user_regs *regs)
DEBUGGER_trap_entry(TRAP_debug, regs);
+ /*
+ * If we take the trap whilst in HVM deprivileged mode
+ * then we should crash the domain.
+ */
+ if ( hvm_deprivileged_check_trap(__func__) )
+ return;
+
if ( !guest_mode(regs) )
{
if ( regs->eflags & X86_EFLAGS_TF )
diff --git a/xen/include/xen/hvm/deprivileged.h b/xen/include/xen/hvm/deprivileged.h
index 5915224..b6e575d 100644
--- a/xen/include/xen/hvm/deprivileged.h
+++ b/xen/include/xen/hvm/deprivileged.h
@@ -84,7 +84,7 @@ int hvm_deprivileged_prepare_vcpu(struct vcpu *vcpu);
void hvm_deprivileged_destroy_vcpu(struct vcpu *vcpu);
/* Called to perform a user mode operation. */
-void hvm_deprivileged_user_mode(void);
+int hvm_deprivileged_user_mode(void);
/* Called when the user mode operation has completed */
void hvm_deprivileged_finish_user_mode(void);
@@ -106,9 +106,32 @@ void hvm_deprivileged_setup_stacks(unsigned long stack_ptr);
/* Use to restore the stacks for deprivileged mode */
void hvm_deprivileged_restore_stacks(void);
+/* Check if we are in deprivileged mode */
+int is_hvm_deprivileged_vcpu(void);
+
/* The ring 3 code */
void hvm_deprivileged_ring3(void);
+/*
+ * Crash the domain. This should not be called if there are any memory
+ * allocations which will be freed by code following its invocation in the
+ * current execution context (current stack). This is because it causes a
+ * permanent 'context switch' and the current stack will be cloberred so
+ * any allocations made which are not freed by other paths will leak.
+ * This function should only be used after deprivileged mode has been
+ * successfully switched into, otherwise, the normal domain_crash function
+ * should be used.
+ *
+ * The domain which is crashed is that of the current vcpu.
+ */
+void hvm_deprivileged_crash_domain(const char *reason);
+
+/*
+ * Call when inside a trap that should cause a domain crash if in user mode
+ * e.g. an invalid_op is trapped whilst in user mode.
+ */
+int hvm_deprivileged_check_trap(const char* func_name);
+
/* The segments where the user mode .text and .data are stored */
extern unsigned long __hvm_deprivileged_text_start[];
extern unsigned long __hvm_deprivileged_text_end[];
--
2.1.4
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH RFC v3 4/6] HVM x86 deprivileged mode: Watchdog for DoS prevention
2015-09-11 16:08 [PATCH RFC v3 0/6] HVM x86 deprivileged mode summary Ben Catterall
` (2 preceding siblings ...)
2015-09-11 16:08 ` [PATCH RFC v3 3/6] HVM x86 deprivileged mode: Trap handlers for " Ben Catterall
@ 2015-09-11 16:08 ` Ben Catterall
2015-09-11 16:08 ` [PATCH RFC v3 5/6] HVM x86 deprivileged mode: Syscall and deprivileged operation dispatcher Ben Catterall
` (3 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Ben Catterall @ 2015-09-11 16:08 UTC (permalink / raw)
To: xen-devel
Cc: keir, ian.campbell, george.dunlap, andrew.cooper3, tim, jbeulich,
Aravind.Gopalakrishnan, suravee.suthikulpanit, Ben Catterall,
boris.ostrovsky
A watchdog timer is used to prevent the deprivileged mode running for too long,
aimed at handling a bug or attempted DoS. If the watchdog has occurred more than
once whilst we have been in the same deprivileged mode context, then we crash
the domain. This can be adjusted for longer running times in future.
Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>
Changed since v2:
* Coding style: Added space after if
---
xen/arch/x86/hvm/deprivileged.c | 4 ++++
xen/arch/x86/nmi.c | 17 +++++++++++++++++
2 files changed, 21 insertions(+)
diff --git a/xen/arch/x86/hvm/deprivileged.c b/xen/arch/x86/hvm/deprivileged.c
index 68c40ad..0b02065 100644
--- a/xen/arch/x86/hvm/deprivileged.c
+++ b/xen/arch/x86/hvm/deprivileged.c
@@ -8,6 +8,7 @@
#include <xen/config.h>
#include <xen/types.h>
#include <xen/sched.h>
+#include <xen/watchdog.h>
#include <asm/paging.h>
#include <xen/compiler.h>
#include <asm/hap.h>
@@ -17,6 +18,7 @@
#include <xen/domain_page.h>
#include <asm/hvm/vmx/vmx.h>
#include <xen/hvm/deprivileged.h>
+#include <xen/hvm/deprivileged_syscall.h>
void hvm_deprivileged_init(struct domain *d, l4_pgentry_t *l4t_base)
{
@@ -577,6 +579,8 @@ int hvm_deprivileged_user_mode(void)
vcpu->arch.hvm_vcpu.depriv_user_mode = 0;
vcpu->arch.hvm_vcpu.depriv_rsp = 0;
+ vcpu->arch.hvm_vcpu.depriv_watchdog_count = 0;
+
/*
* If we need to crash the domain at this point. We will return up the call
* stack, undoing any allocations and then the event testers in the exit
diff --git a/xen/arch/x86/nmi.c b/xen/arch/x86/nmi.c
index 2ab97a0..e5598a2 100644
--- a/xen/arch/x86/nmi.c
+++ b/xen/arch/x86/nmi.c
@@ -26,6 +26,7 @@
#include <xen/smp.h>
#include <xen/keyhandler.h>
#include <xen/cpu.h>
+#include <xen/hvm/deprivileged.h>
#include <asm/current.h>
#include <asm/mc146818rtc.h>
#include <asm/msr.h>
@@ -463,9 +464,25 @@ int __init watchdog_setup(void)
/* Returns false if this was not a watchdog NMI, true otherwise */
bool_t nmi_watchdog_tick(const struct cpu_user_regs *regs)
{
+ struct vcpu *vcpu = current;
bool_t watchdog_tick = 1;
unsigned int sum = this_cpu(nmi_timer_ticks);
+ /*
+ * If the domain has been running in deprivileged mode for two watchdog
+ * ticks, then we kill it to prevent a DoS. We use two ticks as a coarse
+ * measure as this ensures that at least a full watchdog tick duration has
+ * occurred. This means that we do not need to track entry time and do
+ * time calculations.
+ */
+ if ( is_hvm_deprivileged_vcpu() )
+ {
+ if ( vcpu->arch.hvm_vcpu.depriv_watchdog_count )
+ hvm_deprivileged_crash_domain("HVM Deprivileged domain: Domain exceeded running time.");
+ else
+ vcpu->arch.hvm_vcpu.depriv_watchdog_count = 1;
+ }
+
if ( (this_cpu(last_irq_sums) == sum) && watchdog_enabled() )
{
/*
--
2.1.4
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH RFC v3 5/6] HVM x86 deprivileged mode: Syscall and deprivileged operation dispatcher
2015-09-11 16:08 [PATCH RFC v3 0/6] HVM x86 deprivileged mode summary Ben Catterall
` (3 preceding siblings ...)
2015-09-11 16:08 ` [PATCH RFC v3 4/6] HVM x86 deprivileged mode: Watchdog for DoS prevention Ben Catterall
@ 2015-09-11 16:08 ` Ben Catterall
2015-09-11 16:08 ` [PATCH RFC v3 6/6] HVM x86 deprivileged mode: Move VPIC to deprivileged mode Ben Catterall
` (2 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Ben Catterall @ 2015-09-11 16:08 UTC (permalink / raw)
To: xen-devel
Cc: keir, ian.campbell, george.dunlap, andrew.cooper3, tim, jbeulich,
Aravind.Gopalakrishnan, suravee.suthikulpanit, Ben Catterall,
boris.ostrovsky
We have two operations:
1) dispatching a deprivileged mode operation
2) deprivileged mode executing a system call
For (1):
We have a table of methods which can be dispatched. All deprivileged mode
methods which can be dispatched need to be in this array. This aims to
prevent dispatching functions which are not designed for deprivileged mode
and means that we do not dispatch on an aribitrary pointer. We then dispatch
to the function pointer stored in this array. This goes via an assembly stub
in deprivileged mode which calls the function and then issues a syscall to
return to privileged mode when the operation completes. This allows the
deprivileged function to return normally.
For (2):
We again have a table of methods which are the syscall handlers. All
system calls which we handle need to be in this table. Deprivileged mode
passes an integer to select which operation to call. The system call is
wrapped to marshall the paramters as necessary and then jumps to a stub to
issue the syscall.
Data transfer for (1):
To pass data to deprivileged mode, we can pass up to five integer or
pointer parameters in registers. This is thanks to the 64-bit Linux calling
convention which puts these into 64-bit registers. The dispatch code takes
these parameters and arranges them so that when the deprivileged mode
operation executes, they are in the registers specified by the calling
convention. This means that it is transparent to the operation that it was
not invoked by a function call.
To pass the data which pointers correspond to, we use the deprivileged data
section. We copy this data to the section and change the pointer so that it
points into this section. Any extra parameters are also copied into the
data section.
To return data from deprivileged mode, the operation can supply a return
value which we pass through back to the caller. If extra data is needed,
which may be needed to make logical decisions after invocation of the
operation, then this is placed at the end of the data section. The caller
of the operation can then access this data. We copy back the data which we
initially copied in so that the caller sees any changes made by the callee.
NOTE: You need to handle the case where these structures can be updated
whilst in deprivileged mode.
It is necessary to clear out the data page between deprivilegeged mode
operations to prevent data leakage between operations which _may_ be
useful to an attacker.
Data transfer for (2):
We need to transfer data to the syscall handler and then back to the
deprivileged mode operation. To pass data, we use the same method as in (1)
for the first five parameters. For extra data, this will be placed at the
end of the data section and will be fetched by the handler. We also use
the same method as in (1) for passing data back to the operation.
The general process to create a deprivilged mode operation is as follows:
- Keep the old method prototype the same so that callers do not need to be
modified. This helps to reduce the impact of this feature on the rest of the
code base
- Move the old code into a new deprv_F version of the function.
- Marshall and unmarshall arguments as needed in the old function
- Call the depriv version using depriv#n(F, params) function which is a wrapper
around hvm_deprivileged_user_mode(F, params) in case we want to change this
interface later or need better/extra argument marshalling.
- Use the return code to work out what further processing is needed then return
- Add an entry into the depriv_operation_table and add an operation number
With this done, there are no edits which need to be made to callers. If aliasing
of data is added to the feature, then this may not longer be the case.
The process to create a syscall is as follows:
- Create a syscall with a name do_depriv_* using the depriv_syscall_t type
- Write the syscall body
- Return a result to depriv mode
- Add an entry to the depriv_syscall_table and create a syscall number
Syscalls are made using DEPRIV_SYSCALL_CALL(op, ret, params) which
takes the operation number, the return variable and the paramters for the
system call, executes the system call using the Linux 64-bit calling convention
and then sets ret to the return value.
TODO:
-----
- Alias data for deprivileged mode. There is a large comment at the top of
deprivileged_syscall.c which outlines considerations.
- Check if we need to map_domain_page the pages when we do the copy in
hvm_deprivileged_copy_data{to/from}
- Check for unsigned integer wrapping on addition in
hvm_deprivileged_copy_data_{to/from}
- Move hvm_deprivileged_syscall into the syscall macro. It's a stub and
unless extra code is needed there it can be folded into the macro.
- Check maintainers' thoughts on the deprivileged mode function checks in
hvm_deprivileged_user_mode. See the TODO comment.
We copy the data for ease of implementation and for small enough
structures, this is acceptable. For larger structures, or ones
which can be updated whilst deprivileged mode uses them, it would be
better to alias them. This would require the caller to provide the data on
separate pages (so that only the required data is passed in, we don't want
deprivileged mode to be able to access any other Xen data). We can then
alias these through a page table mapping. It would make sense to
preallocate a set of pages in the monitor table to do this, so that, when
aliasing, we just need to switch the mfn on the L1 page table, rather than
allocating and mapping in a whole new paging hierarchy. Then, we only
need to invalidate those L1 page table TLB entries when we exit the mode.
Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>
---
xen/arch/x86/hvm/Makefile | 1 +
xen/arch/x86/hvm/deprivileged.c | 56 +++++-
xen/arch/x86/hvm/deprivileged_asm.S | 170 +++++++++++++-----
xen/arch/x86/hvm/deprivileged_syscall.c | 277 +++++++++++++++++++++++++++++
xen/arch/x86/hvm/svm/svm.c | 2 +-
xen/arch/x86/hvm/vmx/vmx.c | 1 -
xen/arch/x86/x86_64/asm-offsets.c | 1 +
xen/arch/x86/x86_64/entry.S | 8 +-
xen/include/asm-x86/hvm/vcpu.h | 5 +-
xen/include/xen/hvm/deprivileged.h | 17 +-
xen/include/xen/hvm/deprivileged_syscall.h | 200 +++++++++++++++++++++
11 files changed, 686 insertions(+), 52 deletions(-)
create mode 100644 xen/arch/x86/hvm/deprivileged_syscall.c
create mode 100644 xen/include/xen/hvm/deprivileged_syscall.h
diff --git a/xen/arch/x86/hvm/Makefile b/xen/arch/x86/hvm/Makefile
index e16960a..cf93e3e 100644
--- a/xen/arch/x86/hvm/Makefile
+++ b/xen/arch/x86/hvm/Makefile
@@ -4,6 +4,7 @@ subdir-y += vmx
obj-y += asid.o
obj-y += deprivileged.o
obj-y += deprivileged_asm.o
+obj-y += deprivileged_syscall.o
obj-y += emulate.o
obj-y += event.o
obj-y += hpet.o
diff --git a/xen/arch/x86/hvm/deprivileged.c b/xen/arch/x86/hvm/deprivileged.c
index 0b02065..5606f9a 100644
--- a/xen/arch/x86/hvm/deprivileged.c
+++ b/xen/arch/x86/hvm/deprivileged.c
@@ -20,6 +20,9 @@
#include <xen/hvm/deprivileged.h>
#include <xen/hvm/deprivileged_syscall.h>
+static depriv_syscall_t depriv_operation_table[] = {
+};
+
void hvm_deprivileged_init(struct domain *d, l4_pgentry_t *l4t_base)
{
void *p;
@@ -562,17 +565,66 @@ void hvm_deprivileged_destroy_vcpu(struct vcpu *vcpu)
* This method is then jumped into to restore execution context after
* exiting user mode.
*/
-int hvm_deprivileged_user_mode(void)
+int hvm_deprivileged_user_mode(unsigned long operation, register_t a,
+ register_t b, register_t c, register_t d,
+ register_t e)
{
struct vcpu *vcpu = get_current();
+ depriv_syscall_fn_t depriv_f;
+ depriv_syscall_fn_t f;
+ unsigned long offset;
ASSERT( vcpu->arch.hvm_vcpu.depriv_user_mode == 0 );
ASSERT( vcpu->arch.hvm_vcpu.depriv_rsp == 0 );
+ printk("OP: %lx size: %lx\n", operation, ARRAY_SIZE(depriv_operation_table));
+ ASSERT( operation < ARRAY_SIZE(depriv_operation_table) );
+
+ vcpu->arch.hvm_vcpu.depriv_return_code = 0;
+
+ /* Invalid operation? */
+ if ( operation >= ARRAY_SIZE(depriv_operation_table) ) {
+ domain_crash(vcpu->domain);
+ return HVM_DISPATCH_ERR;
+ }
+
+ f = depriv_operation_table[operation].fn;
+
+ /*
+ * f needs to be a depriv mode function so needs to be in the deprivileged
+ * text segment. This also check for unsigned integer wrapping on
+ * subtraction. This is probably not necessary as we've indexed via the
+ * array to get the function address which shouldn't be user controllable
+ * so shouldn't represent a security concern. Expecially as we only call
+ * the function in deprivileged mode so it cannot access privileged mode
+ * code but, better safe than sorry...
+ * TODO: See what maintainers think
+ */
+ if ( (unsigned long)f < (unsigned long)__hvm_deprivileged_text_start ||
+ (unsigned long)f >= (unsigned long)__hvm_deprivileged_text_end )
+ {
+ domain_crash(vcpu->domain);
+ return HVM_DISPATCH_ERR;
+ }
+
+ /*
+ * Calculate the offset and then test for unsigned integer wrapping on
+ * addition.
+ */
+ offset = (unsigned long)f - (unsigned long)__hvm_deprivileged_text_start;
+
+ if ( ULONG_MAX - offset < (unsigned long)HVM_DEPRIVILEGED_TEXT_ADDR )
+ {
+ domain_crash(vcpu->domain);
+ return HVM_DISPATCH_ERR;
+ }
+ depriv_f = (depriv_syscall_fn_t)(offset +
+ (unsigned long)HVM_DEPRIVILEGED_TEXT_ADDR);
+ printk("FUNCTION: %lx\n", (unsigned long)depriv_f);
vcpu->arch.hvm_vcpu.depriv_ctxt_switch_to(vcpu);
/* The assembly routine to handle moving into/out of deprivileged mode */
- hvm_deprivileged_user_mode_asm();
+ hvm_deprivileged_user_mode_asm(depriv_f, a, b, c, d, e);
vcpu->arch.hvm_vcpu.depriv_ctxt_switch_from(vcpu);
diff --git a/xen/arch/x86/hvm/deprivileged_asm.S b/xen/arch/x86/hvm/deprivileged_asm.S
index 07d4216..7d3e632 100644
--- a/xen/arch/x86/hvm/deprivileged_asm.S
+++ b/xen/arch/x86/hvm/deprivileged_asm.S
@@ -11,7 +11,7 @@
#include <public/xen.h>
#include <irq_vectors.h>
#include <xen/hvm/deprivileged.h>
-
+#include <xen/hvm/deprivileged_syscall.h>
/*
* Handles entry into the deprivileged mode and returning from this
* mode.
@@ -24,25 +24,43 @@
* We're doing a sort-of long jump/set jump with copying to a stack to
* preserve it and allow returning code to continue executing from
* within this method.
+ *
+ * Params:
+ * f - rdi
+ * a - rsi
+ * b - rdx
+ * c - rcx
+ * d - r8
+ * e - r9
+ *
+ * NOTE: This function relies on the 64-bit linux calling convention. See the
+ * header comment in deprivileged_syscall.c for a full description.
+ *
+ * Stack Layout after entering user mode:
+ * The stack grows down in this diagram
+ *
+ * caller
+ * --------------
+ * saved registers
+ * --------------
+ * return eip
+ * --------------
*/
ENTRY(hvm_deprivileged_user_mode_asm)
/* Save our registers */
- push %rax
- push %rbx
- push %rcx
- push %rdx
- push %rsi
- push %rdi
- push %rbp
- push %r8
- push %r9
- push %r10
- push %r11
- push %r12
- push %r13
- push %r14
- push %r15
+ /* The ordering here is deliberate, we want to save all but our
+ * parameters.
+ */
pushfq
+ push %r15
+ push %r14
+ push %r13
+ push %r12
+ push %r11
+ push %r10
+ push %rbp
+ push %rbx
+ push %rax
/* Perform a near call to push rip onto the stack */
call 1f
@@ -54,6 +72,17 @@ ENTRY(hvm_deprivileged_user_mode_asm)
1: addq $2f-1b, (%rsp)
/*
+ * We need to also save our parameters as they are caller-saved and
+ * we call other functions in this one.
+ */
+ push %r9
+ push %r8
+ push %rcx
+ push %rdx
+ push %rsi
+ push %rdi
+
+ /*
* Setup the stack pointers for exceptions, syscall and sysenter to be
* just after our current rsp, adjusted for 16 byte alignment.
*/
@@ -76,11 +105,11 @@ ENTRY(hvm_deprivileged_user_mode_asm)
jne 3f
cli
- movq %rsp, VCPU_depriv_rsp(%r8) /* The rsp to restore to */
- movabs $HVM_DEPRIVILEGED_TEXT_ADDR, %rcx /* RIP in user mode */
/* RFLAGS user mode */
movq $(X86_EFLAGS_IF | X86_EFLAGS_VIP), %r11
+
+ movabs $(hvm_deprivileged_ring3 - .hvm_deprivileged_enhancement.text + HVM_DEPRIVILEGED_TEXT_ADDR), %rcx /* RIP in user mode */
movq $1, VCPU_depriv_user_mode(%r8) /* Now in user mode */
/*
@@ -94,6 +123,21 @@ ENTRY(hvm_deprivileged_user_mode_asm)
* the sysret instruction.
*/
movq $HVM_STACK_PTR, %rbx
+
+ /*
+ * Pop our parameters so that the registers now hold the needed values
+ * for the deprivileged mode operation.
+ */
+ movq %r8, %r15
+ pop %rdi
+ pop %rsi
+ pop %rdx
+ pop %r10 /* rcx is needed by sysret so use r10 instead */
+ pop %r8
+ pop %r9
+
+ movq %rsp, VCPU_depriv_rsp(%r15) /* The rsp to restore to */
+
sysretq /* Enter deprivileged mode */
3: call hvm_deprivileged_restore_stacks
@@ -102,22 +146,16 @@ ENTRY(hvm_deprivileged_user_mode_asm)
* Restore registers
* The return rip has been popped by the ret on the return path
*/
- popfq
- pop %r15
- pop %r14
- pop %r13
- pop %r12
- pop %r11
- pop %r10
- pop %r9
- pop %r8
- pop %rbp
- pop %rdi
- pop %rsi
- pop %rdx
- pop %rcx
- pop %rbx
pop %rax
+ pop %rbx
+ pop %rbp
+ pop %r10
+ pop %r11
+ pop %r12
+ pop %r13
+ pop %r14
+ pop %r15
+ popfq
ret
/* Finished in user mode so return */
@@ -134,11 +172,14 @@ ENTRY(hvm_deprivileged_finish_user_mode_asm)
/* Go to user mode return code */
ret
-/* Entry point from the assembly syscall handlers */
+/* Entry point from the assembly hypercall handlers */
ENTRY(hvm_deprivileged_handle_user_mode)
+ /* Return code is in %rdi */
+ GET_CURRENT(%rbx)
+ movq %rdi, VCPU_depriv_return_code(%rbx)
- /* Handle a user mode hypercall here */
-
+ /* Handle a user mode syscall here */
+ call do_deprivileged_syscall
/* We are finished in user mode */
call hvm_deprivileged_finish_user_mode
@@ -146,7 +187,25 @@ ENTRY(hvm_deprivileged_handle_user_mode)
ret
.section .hvm_deprivileged_enhancement.text,"ax"
-/* HVM deprivileged code */
+/* HVM deprivileged code general entry and exit point
+ *
+ * All entries and exits to deprivileged mode operations enter
+ * and exit here. This means that the depriv functions do not need
+ * to be written to setup the needed state and can return normally.
+ * We then handle the return to the hypervisor here
+ *
+ * In rdi, we have the address of the function to jump to and its
+ * parameters are in the necessary registers for the 64-bit linux
+ * calling convention
+ *
+ * Params:
+ * f - rdi
+ * a - rsi
+ * b - rdx
+ * c - r10 - rcx is needed by sysret so we can't use it: use r10 instead
+ * d - r8
+ * e - r9
+ */
ENTRY(hvm_deprivileged_ring3)
/*
* sysret has loaded eip from rcx and rflags from r11.
@@ -155,13 +214,42 @@ ENTRY(hvm_deprivileged_ring3)
*/
movabs $HVM_STACK_PTR, %rsp
+ /*
+ * Shuffle params across so that the callee has its first argument in
+ * rdi as defined in the calling convention. We have put f in rdi and
+ * effectively moved the other five arguments 'down' one slot. This
+ * makes the depriv invocation transparent to the callee.
+ */
+ movq %rdi, %r15
+ movq %rsi, %rdi
+ movq %rdx, %rsi
+ movq %r10, %rdx /* r10 holds param rcx */
+ movq %r8, %rcx
+ movq %r9, %r8
+
/* Perform user mode processing */
- movabs $0xff, %rcx
-1: dec %rcx
- cmp $0, %rcx
- jne 1b
+ callq %r15
+
+ /* Result is in rax */
+ mov %rax, %rdi
/* Return to ring 0 */
syscall
+/*
+ * Dispatch a syscall from within deprivileged mode
+ *
+ * Params:
+ * - syscall number is in rdi
+ * - syscall arguments are in rsi, rdx, rcx, r8 and r9 (in that order)
+ *
+ * TODO: this is currently a stub, it can be folded into DEPRIV_SYSCALL_CALL
+ * if no extra code is needed.
+ */
+ENTRY(hvm_deprivileged_syscall)
+
+ syscall
+
+ /* Returned from this mode: Get result into rax */
+ ret
.previous
diff --git a/xen/arch/x86/hvm/deprivileged_syscall.c b/xen/arch/x86/hvm/deprivileged_syscall.c
new file mode 100644
index 0000000..34dfee9
--- /dev/null
+++ b/xen/arch/x86/hvm/deprivileged_syscall.c
@@ -0,0 +1,277 @@
+/*
+ * A description of deprivileged mode operation dispatch and system call handling
+ * follows.
+ *
+ * We have two operations:
+ * 1) dispatching a deprivileged mode operation
+ * 2) deprivileged mode executing a system call
+ *
+ * For (1):
+ * We have a table of methods which can be dispatched. All deprivileged mode
+ * methods which can be dispatched need to be in this array. This aims to
+ * prevent dispatching functions which are not designed for deprivileged mode
+ * and means that we do not dispatch on an aribitrary pointer.
+ *
+ * For (2):
+ * We again have a table of methods which are the syscall handlers. All
+ * system calls which we handle need to be in this table. Deprivileged mode
+ * passes an integer to select which operation to call.
+ *
+ * Data transfer for (1):
+ * To pass data to deprivileged mode, we can pass up to five integer or
+ * pointer parameters in registers. This is thanks to the 64-bit Linux calling
+ * convention which puts these into 64-bit registers. The dispatch code takes
+ * these parameters and arranges them so that when the deprivileged mode
+ * operation executes, they are in the registers specified by the calling
+ * convention. This means that it is transparent to the operation that it was
+ * not invoked by a function call.
+ *
+ * To pass the data which pointers correspond to, we use the deprivileged data
+ * section. We copy this data to the section and change the pointer so that it
+ * points into this section. Any extra parameters are also copied into the
+ * data section.
+ *
+ * To return data from deprivileged mode, the operation can supply a return
+ * value which we pass through back to the caller. If extra data is needed,
+ * which may be needed to make logical decisions after invocation of the
+ * operation, then this is placed at the end of the data section. The caller
+ * of the operation can then access this data. We copy back the data which we
+ * initially copied in so that the caller sees any changes made by the callee.
+ * NOTE: You need to handle the case where these structures can be updated
+ * whilst in deprivileged mode.
+ *
+ * It is necessary to clear out the data page between deprivilegeged mode
+ * operations to prevent data leakage between operations which _may_ be
+ * useful to an attacker.
+ *
+ * Data transfer for (2):
+ * We need to transfer data to the syscall handler and then back to the
+ * deprivileged mode operation. To pass data, we use the same method as in (1)
+ * for the first five parameters. For extra data, this will be placed at the
+ * end of the data section and will be fetched by the handler. We also use
+ * the same method as in (1) for passing data back to the operation.
+ *
+ *
+ * TODO: We copy the data for ease of implementation and for small enough
+ * structures, this is acceptable. For larger structures, or ones
+ * which can be updated whilst deprivileged mode uses them, it would be
+ * better to alias them. This would require the caller to provide the data on
+ * separate pages (so that only the required data is passed in, we don't want
+ * deprivileged mode to be able to access any other Xen data). We can then
+ * alias these through a page table mapping. It would make sense to
+ * preallocate a set of pages in the monitor table to do this, so that, when
+ * aliasing, we just need to switch the mfn on the L1 page table, rather than
+ * allocating and mapping in a whole new paging hierarchy and then we only
+ * need to invalidate that one TLB entry when we exit the mode.
+ */
+
+#include <xen/hvm/deprivileged.h>
+#include <xen/hvm/deprivileged_syscall.h>
+
+/*
+ * Similar to Xen's arch/arm/traps.c
+ * Used for handling a syscall from deprivileged mode or dispatching a
+ * deprivileged mode operation.
+ */
+
+
+/* This table holds the functions which can be called from deprivileged mode. */
+static depriv_syscall_t depriv_syscall_table[] = {
+
+};
+
+/* Handle a syscall from deprivileged mode */
+void do_deprivileged_syscall(struct cpu_user_regs *regs)
+{
+ depriv_syscall_fn_t fn = NULL;
+ unsigned long nr = regs->rdi;
+
+ /* Invalid syscall? */
+ if ( nr >= ARRAY_SIZE(depriv_syscall_table) )
+ hvm_deprivileged_crash_domain("Invalid syscall number");
+
+ fn = depriv_syscall_table[nr].fn;
+
+ /* No syscall? */
+ if ( fn == NULL )
+ hvm_deprivileged_crash_domain("No syscall");
+
+ /*
+ * We can use the 64-bit linux calling convention here. The first 6 integer
+ * and pointer arguments are passed in registers. Now, as long as all of our
+ * system calls use fewer than this, we can just call all of our functions
+ * with five arguments. This is fine as these registers should be preserved
+ * by the caller if they use them so will not impact functions with fewer
+ * parameters.
+ */
+ ASSERT(depriv_syscall_table[nr].nr_args <= 5);
+
+ DEPRIV_SYSCALL_RESULT(regs) = fn(DEPRIV_SYSCALL_ARGS(regs));
+
+ /* Return results */
+}
+
+/*
+ * Copy data from privileged context to deprivileged context for
+ * use by deprivileged context functions.
+ *
+ * TODO: In future, it might be better to alias such data, we can put
+ * the source data in a page aligned region and then alias it so that
+ * deprivileged mode can access it. This would avoid the overheads of
+ * the copy. See the header of this file.
+ */
+void *hvm_deprivileged_copy_data_to(struct vcpu *vcpu, void *src,
+ unsigned long size)
+{
+ unsigned long data_offset = vcpu->arch.hvm_vcpu.depriv_data_offset;
+
+ /*
+ * TODO: Check for unsigned integer wrapping on addition
+ */
+ printk("off: %lx, size: %lx\n section: %lx\n", data_offset, size, HVM_DEPRIV_DATA_SECTION_SIZE);
+ ASSERT(data_offset + size < HVM_DEPRIV_DATA_SECTION_SIZE);
+
+ /*
+ * TODO: We _may_ need to map_domain_page these in first???
+ */
+ memcpy((void *)((unsigned long)HVM_DEPRIVILEGED_DATA_ADDR + data_offset),
+ src, size);
+
+ vcpu->arch.hvm_vcpu.depriv_data_offset += size;
+
+ /* The destination */
+ return (void *)((unsigned long)HVM_DEPRIVILEGED_DATA_ADDR + data_offset);
+}
+
+/* Copy data from deprivileged context to privileged context. */
+void *hvm_deprivileged_copy_data_from(struct vcpu *vcpu, void *dest, void *src,
+ unsigned long size)
+{
+ unsigned long data_offset = vcpu->arch.hvm_vcpu.depriv_data_offset;
+
+ /*
+ * TODO: Check for unsigned integer wrapping on addition
+ */
+ ASSERT(data_offset + size < HVM_DEPRIV_DATA_SECTION_SIZE);
+
+ memcpy(dest, src, size);
+
+ /*
+ * Prevent information leakage between separate deprivileged mode operations
+ * by clearing out this region
+ */
+ memset((void *)((unsigned long)HVM_DEPRIVILEGED_DATA_ADDR + data_offset),
+ 0, size);
+
+ return dest;
+}
+
+/*******************************************************************************
+ *
+ * Deprivileged mode dispatcher wrappers
+ *
+ * These are used to wrap calling a deprivileged mode operation with up to five
+ * parameters in case we change the interface.
+ *
+ ******************************************************************************/
+int depriv0(unsigned long f)
+{
+ return hvm_deprivileged_user_mode(f, 0, 0, 0, 0, 0);
+}
+
+int depriv1(unsigned long f, register_t a)
+{
+ return hvm_deprivileged_user_mode(f, a, 0, 0, 0, 0);
+}
+
+int depriv2(unsigned long f, register_t a, register_t b)
+{
+ return hvm_deprivileged_user_mode(f, a, b, 0, 0, 0);
+}
+
+int depriv3(unsigned long f, register_t a, register_t b, register_t c)
+{
+ return hvm_deprivileged_user_mode(f, a, b, c, 0, 0);
+}
+
+int depriv4(unsigned long f, register_t a, register_t b, register_t c,
+ register_t d)
+{
+ return hvm_deprivileged_user_mode(f, a, b, c, d, 0);
+}
+
+int depriv5(unsigned long f, register_t a, register_t b, register_t c,
+ register_t d, register_t e)
+{
+ return hvm_deprivileged_user_mode(f, a, b, c, d, e);
+}
+
+/*******************************************************************************
+ *
+ * Test dispatchers, used to dispatch a deprivileged mode operation
+ *
+ ******************************************************************************/
+int test_op0(void)
+{
+ return depriv0(DEPRIV_OPERATION_test_op0);
+}
+
+int test_op1(int a)
+{
+ return depriv1(DEPRIV_OPERATION_test_op1, a);
+}
+
+int test_op2(int a, int b)
+{
+ return depriv2(DEPRIV_OPERATION_test_op2, a, b);
+}
+
+int test_op3(int a, int b, int c)
+{
+ return depriv3(DEPRIV_OPERATION_test_op3, a, b, c);
+}
+
+int test_op4(int a, int b, int c, int d)
+{
+ return depriv4(DEPRIV_OPERATION_test_op4, a, b, c, d);
+}
+
+int test_op5(int a, int b, int c, int d, int e)
+{
+ return depriv5(DEPRIV_OPERATION_test_op5, a, b, c, d, e);
+}
+
+/*******************************************************************************
+ *
+ * Test HVM Deprivileged mode functions
+ *
+ ******************************************************************************/
+int depriv_test_op0(void)
+{
+ return 0xDEADBEEF;
+}
+
+int depriv_test_op1(int a)
+{
+ return a;
+}
+
+int depriv_test_op2(int a, int b)
+{
+ return b;
+}
+
+int depriv_test_op3(int a, int b, int c)
+{
+ return c;
+}
+
+int depriv_test_op4(int a, int b, int c, int d)
+{
+ return d;
+}
+
+int depriv_test_op5(int a, int b, int c, int d, int e)
+{
+ return e;
+}
diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
index 3393fb5..4ca6d53 100644
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -2670,7 +2670,7 @@ void svm_vmexit_handler(struct cpu_user_regs *regs)
{
if( guest_cpu_user_regs()->eax == 0x1)
{
- hvm_deprivileged_user_mode();
+// hvm_deprivileged_user_mode();
}
__update_guest_eip(regs, vmcb->exitinfo2 - vmcb->rip);
break;
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 1ec23f9..b93b0b6 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -3453,7 +3453,6 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs)
{
if( guest_cpu_user_regs()->eax == 0x1)
{
- hvm_deprivileged_user_mode();
}
update_guest_eip(); /* Safe: IN, OUT */
break;
diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index 7af824a..4e2a96c 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -118,6 +118,7 @@ void __dummy__(void)
OFFSET(VCPU_depriv_rsp, struct vcpu, arch.hvm_vcpu.depriv_rsp);
OFFSET(VCPU_depriv_user_mode, struct vcpu, arch.hvm_vcpu.depriv_user_mode);
OFFSET(VCPU_depriv_destroy, struct vcpu, arch.hvm_vcpu.depriv_destroy);
+ OFFSET(VCPU_depriv_return_code, struct vcpu, arch.hvm_vcpu.depriv_return_code);
BLANK();
OFFSET(DOMAIN_is_32bit_pv, struct domain, arch.is_32bit_pv);
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 9590065..df434f2 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -106,6 +106,7 @@ restore_all_xen:
/* Returning from user mode */
ENTRY(handle_hvm_user_mode)
+ movq %rsp, %rdi
call hvm_deprivileged_handle_user_mode
/* fallthrough */
@@ -141,7 +142,12 @@ ENTRY(lstar_enter)
SAVE_VOLATILE TRAP_syscall
GET_CURRENT(%rbx)
- /* Were we in Xen's ring 3? */
+ /*
+ * Were we in Xen's ring 3?
+ * From lstar_enter up to saving all registers, we need to preserve rdi,
+ * rsi, rdx, rcx, r8 and r9 so that syscalls into deprivileged mode can
+ * function as expected
+ */
cmpq $1, VCPU_depriv_user_mode(%rbx)
je handle_hvm_user_mode
diff --git a/xen/include/asm-x86/hvm/vcpu.h b/xen/include/asm-x86/hvm/vcpu.h
index f7df9d4..dcdecf1 100644
--- a/xen/include/asm-x86/hvm/vcpu.h
+++ b/xen/include/asm-x86/hvm/vcpu.h
@@ -216,7 +216,10 @@ struct hvm_vcpu {
unsigned long depriv_tss_rsp0;
unsigned long depriv_destroy;
unsigned long depriv_watchdog_count;
-
+ unsigned long depriv_return_code;
+ /* Offset into our data page where we can put data for depriv operations */
+ unsigned long depriv_data_offset;
+
/* Pending hw/sw interrupt (.vector = -1 means nothing pending). */
struct hvm_trap inject_trap;
diff --git a/xen/include/xen/hvm/deprivileged.h b/xen/include/xen/hvm/deprivileged.h
index b6e575d..d7228be 100644
--- a/xen/include/xen/hvm/deprivileged.h
+++ b/xen/include/xen/hvm/deprivileged.h
@@ -17,6 +17,7 @@
#include <asm-x86/page.h>
#include <public/domctl.h>
#include <xen/domain_page.h>
+#include <xen/hvm/deprivileged_syscall.h>
/*
* Initialise the HVM deprivileged mode. This just sets up the general
@@ -83,16 +84,21 @@ int hvm_deprivileged_prepare_vcpu(struct vcpu *vcpu);
/* Destroy each vcpu's data for Xen user mode. Again, call for each vcpu. */
void hvm_deprivileged_destroy_vcpu(struct vcpu *vcpu);
-/* Called to perform a user mode operation. */
-int hvm_deprivileged_user_mode(void);
-
/* Called when the user mode operation has completed */
void hvm_deprivileged_finish_user_mode(void);
-/* Called to move into and then out of user mode. Needed for accessing
+/* Dispatch a deprivileged user mode operation */
+int hvm_deprivileged_user_mode(unsigned long operation, register_t a,
+ register_t b, register_t c, register_t d,
+ register_t e);
+
+/*
+ * Called to move into and then out of user mode. Needed for accessing
* assembly features.
*/
-void hvm_deprivileged_user_mode_asm(void);
+void hvm_deprivileged_user_mode_asm(depriv_syscall_fn_t f, register_t a,
+ register_t b, register_t c, register_t d,
+ register_t e);
/* Called on the return path to return to the correct execution point */
void hvm_deprivileged_finish_user_mode_asm(void);
@@ -151,6 +157,7 @@ extern unsigned long __hvm_deprivileged_data_end[];
#define HVM_ERR_PG_ALLOC -1
#define HVM_DEPRIV_ALIAS 1
#define HVM_DEPRIV_COPY 0
+#define HVM_DISPATCH_ERR -1
/*
* The user mode stack pointer.
diff --git a/xen/include/xen/hvm/deprivileged_syscall.h b/xen/include/xen/hvm/deprivileged_syscall.h
new file mode 100644
index 0000000..3af29ae
--- /dev/null
+++ b/xen/include/xen/hvm/deprivileged_syscall.h
@@ -0,0 +1,200 @@
+#ifndef __X86_HVM_DEPRIVILEGED_SYSCALL
+
+/* Table of HVM deprivileged mode syscall array offsets */
+#define DEPRIV_SYSCALL_vpic_get_priority 0
+
+/* Table of HVM deprivileged mode operation array offsets */
+#define DEPRIV_OPERATION_vpic_ioport_write 0
+#define DEPRIV_OPERATION_test_op0 1
+#define DEPRIV_OPERATION_test_op1 2
+#define DEPRIV_OPERATION_test_op2 3
+#define DEPRIV_OPERATION_test_op3 4
+#define DEPRIV_OPERATION_test_op4 5
+#define DEPRIV_OPERATION_test_op5 6
+
+/* This is also included in the HVM deprivileged mode .S file */
+#ifndef __ASSEMBLY__
+#define __X86_HVM_DEPRIVILEGED_SYSCALL
+#include <xen/hvm/deprivileged.h>
+
+/* Handle a syscall from deprivileged mode */
+void do_deprivileged_syscall(struct cpu_user_regs *regs);
+
+/* Dispatch a syscall from within deprivileged mode */
+void hvm_deprivileged_syscall(void);
+
+/*
+ * Copy data from privileged context to deprivileged context for
+ * use by deprivileged context functions.
+ */
+void *hvm_deprivileged_copy_data_to(struct vcpu *vcpu, void *src,
+ unsigned long size);
+
+/* Copy data from deprivileged context to privileged context. */
+void *hvm_deprivileged_copy_data_from(struct vcpu *vcpu, void *dest, void *src,
+ unsigned long size);
+
+/*
+ * Typing to allow us to store and lookup system calls with different
+ * prototypes by a syscall number
+ */
+typedef unsigned long register_t;
+
+typedef register_t (*depriv_syscall_fn_t)(
+ register_t, register_t, register_t, register_t, register_t);
+
+typedef struct {
+ depriv_syscall_fn_t fn;
+ int nr_args;
+} depriv_syscall_t;
+
+/* Create an entry in the syscall table */
+#define DEPRIV_SYSCALL(_name, _nr_args) \
+ [ DEPRIV_SYSCALL_ ## _name ] = { \
+ .fn = (depriv_syscall_fn_t) &do_depriv_ ## _name, \
+ .nr_args = _nr_args \
+ }
+
+/* Use to extract the arguments from the cpu_user_regs struct */
+#define DEPRIV_SYSCALL_ARGS(r) (r)->rsi, (r)->rdx, (r)->rcx, (r)->r8, (r)->r9
+
+/* Use to set the rax register in the cpu_user_regs struct */
+#define DEPRIV_SYSCALL_RESULT(r) (r)->rax
+
+/*
+ * Use this to call a system call from deprvileged mode.
+ * We take the syscall number and then up to five parameters which we pass in
+ * registers, following the 64-bit Linux calling convention.
+ *
+ * We need to calculate the actual address of the syscall dispatcher as we
+ * relocated it to the deprivileged code area so it's compiled address is not
+ * its acutal address. This also means that we can't just do
+ * hvm_deprivileged_syscall(op, a, ...) as that will be for the compiled address
+ * not the actual relocated address.
+ */
+#define DEPRIV_SYSCALL_CALL(_op, _ret, _a, _b, _c, _d, _e) \
+ __asm__ volatile("movq %1, %%rdi;" /* The syscall number */ \
+ /* Parameters */ \
+ "movq %2, %%rsi;" \
+ "movq %3, %%rdx;" \
+ "movq %4, %%rcx;" \
+ "movq %5, %%r8;" \
+ "movq %6, %%r9;" \
+ /* Dispatch it */ \
+ "callq %7;" \
+ /* return is in rax */ \
+ : "=a"(_ret) \
+ : "rm"((unsigned long)_op), "rm"((unsigned long)_a), \
+ "rm"((unsigned long)_b), "rm"((unsigned long)_c), \
+ "rm"((unsigned long)_d), "rm"((unsigned long)_e), \
+ "rm"((unsigned long)hvm_deprivileged_syscall - \
+ (unsigned long)__hvm_deprivileged_text_start + \
+ (unsigned long)HVM_DEPRIVILEGED_TEXT_ADDR) \
+ : "rdi", "rsi", "rdx", "rcx", "r8", "r9")
+
+#define DEPRIV_SYSCALL_CALL0(_op, _ret) \
+ DEPRIV_SYSCALL_CALL(_op, _ret, 0, 0, 0, 0, 0)
+
+#define DEPRIV_SYSCALL_CALL1(_op, _ret, _a) \
+ DEPRIV_SYSCALL_CALL(_op, _ret, _a, 0, 0, 0, 0)
+
+#define DEPRIV_SYSCALL_CALL2(_op, _ret, _a, _b) \
+ DEPRIV_SYSCALL_CALL(_op, _ret, _a, _b, 0, 0, 0)
+
+#define DEPRIV_SYSCALL_CALL3(_op, _ret, _a, _b, _c) \
+ DEPRIV_SYSCALL_CALL(_op, _ret, _a, _b, _c, 0, 0)
+
+#define DEPRIV_SYSCALL_CALL4(_op, _ret, _a, _b, _c, _d) \
+ DEPRIV_SYSCALL_CALL(_op, _ret, _a, _b, _c, _d, 0)
+
+#define DEPRIV_SYSCALL_CALL5(_op, _ret, _a, _b, _c, _d, _e) \
+ DEPRIV_SYSCALL_CALL(_op, _ret, _a, _b, _c, _d, _e)
+
+/* Deprivileged mode operation. This can be dispatched. */
+#define DEPRIV_OPERATION(_name, _nr_args) \
+ [ DEPRIV_OPERATION_ ## _name ] = { \
+ .fn = (depriv_syscall_fn_t) &depriv_ ## _name, \
+ .nr_args = _nr_args \
+ }
+
+#define DEPRIV_OPERATION_ARGS(r) (r)->rdi, (r)->rsi, (r)->rdx, (r)->rcx, (r)->r8
+
+#define DEPRIV_OPERATION_RESULT(r) (r)->rax
+
+/*
+ * Use this attribute on the prototype of any method which is to be executed in
+ * depriviled mode.
+ */
+#define DEPRIV_TEXT_SEGMENT \
+ __attribute__((section(".hvm_deprivileged_enhancement.text")))
+
+/*
+ * Wrappers to pass up to five parameters on a deprvileged dispatch in a
+ * uniform manner
+ */
+int depriv0(unsigned long f);
+
+int depriv1(unsigned long f, register_t a);
+
+int depriv2(unsigned long f, register_t a, register_t b);
+
+int depriv3(unsigned long f, register_t a, register_t b, register_t c);
+
+int depriv4(unsigned long f, register_t a, register_t b, register_t c,
+ register_t d);
+
+int depriv5(unsigned long f, register_t a, register_t b, register_t c,
+ register_t d, register_t e);
+
+/*
+ * We may want both the caller and the callee to have the same types for the
+ * paramters, so use these macros to ensure this is the case. GCC will complain
+ * if they are not when using these.
+ *
+ * We have a version of the function with identifier F, which can be the old
+ * function identifier and prototype. The aim is to minimise the intrusiveness
+ * of adding this feature so, the original call points remain unchanged and,
+ * instead, we move the contents of the old function into a deprivileged
+ * version and marshall arguments as needed. We then call the deprivileged
+ * version and then handle the return result (this may require additional
+ * logic).
+ */
+#define MAKE_DEPRIV0(retn, F) \
+ retn F(void); \
+ retn depriv_ ##F(void) DEPRIV_TEXT_SEGMENT;
+
+#define MAKE_DEPRIV1(retn, F, type1, arg1) \
+ retn F(type1 arg1); \
+ retn depriv_ ##F(type1 arg1) DEPRIV_TEXT_SEGMENT;
+
+#define MAKE_DEPRIV2(retn, F, type1, arg1, type2, arg2) \
+ retn F(type1 arg1, type2 arg2); \
+ retn depriv_ ##F(type1 arg1, type2 arg2) DEPRIV_TEXT_SEGMENT;
+
+#define MAKE_DEPRIV3(retn, F, type1, arg1, type2, arg2, type3, arg3) \
+ retn F(type1 arg1, type2 arg2, type3 arg3); \
+ retn depriv_ ##F(type1 arg1, type2 arg2, type3 arg3) DEPRIV_TEXT_SEGMENT; \
+
+#define MAKE_DEPRIV4(retn, F, type1, arg1, type2, arg2, type3, arg3, \
+ type4, arg4) \
+ retn F(type1 arg1, type2 arg2, type3 arg3, type4 arg4); \
+ retn depriv_ ##F(type1 arg1, type2 arg2, type3 arg3, \
+ type4 arg4) DEPRIV_TEXT_SEGMENT;
+
+#define MAKE_DEPRIV5(retn, F, type1, arg1, type2, arg2, type3, arg3, \
+ type4, arg4, type5, arg5) \
+ retn F(type1 arg1, type2 arg2, type3 arg3, type4 arg4, type5 arg5); \
+ retn depriv_ ##F(type1 arg1, type2 arg2, type3 arg3, type4 arg4, \
+ type5 arg5) DEPRIV_TEXT_SEGMENT;
+
+/* Test functions for all arguments for deprivileged dispatch */
+MAKE_DEPRIV0(int, test_op0)
+MAKE_DEPRIV1(int, test_op1, int, a)
+MAKE_DEPRIV2(int, test_op2, int, a, int, b)
+MAKE_DEPRIV3(int, test_op3, int, a, int, b, int, c)
+MAKE_DEPRIV4(int, test_op4, int, a, int, b, int, c, int, d)
+MAKE_DEPRIV5(int, test_op5, int, a, int, b, int, c, int, d, int, e)
+
+#endif /* !__ASSEMBLY__ */
+
+#endif
--
2.1.4
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH RFC v3 6/6] HVM x86 deprivileged mode: Move VPIC to deprivileged mode
2015-09-11 16:08 [PATCH RFC v3 0/6] HVM x86 deprivileged mode summary Ben Catterall
` (4 preceding siblings ...)
2015-09-11 16:08 ` [PATCH RFC v3 5/6] HVM x86 deprivileged mode: Syscall and deprivileged operation dispatcher Ben Catterall
@ 2015-09-11 16:08 ` Ben Catterall
2015-09-11 16:13 ` [PATCH RFC v3 0/6] HVM x86 deprivileged mode summary Ben Catterall
2015-09-23 19:20 ` Konrad Rzeszutek Wilk
7 siblings, 0 replies; 9+ messages in thread
From: Ben Catterall @ 2015-09-11 16:08 UTC (permalink / raw)
To: xen-devel
Cc: keir, ian.campbell, george.dunlap, andrew.cooper3, tim, jbeulich,
Aravind.Gopalakrishnan, suravee.suthikulpanit, Ben Catterall,
boris.ostrovsky
First steps of moving the VPIC into deprivileged mode.
For the VPIC, some of its functions are called from both privileged code and
deprivileged code. Some of these are also called from non-hvm domains. This
means that we cannot just convert the entire function to a depriv only one, but
need to handle this case. vpic_get_priority() shows one way of doing this but
there may be other ways. The main aim will be to minimise code duplication and
logic needed to determine where the call is coming from. This will not be a
unique problem so some thought will be needed as to the best way to resolve this
in general.
A clean up method handles a deprivileged mode domain crashing whilst holding
resources. For example, if we hold a lock or have allocated memory which Xen
will not clean up for us when we crash, we need to release these. Otherwise we
fail on ASSERT_NOT_IN_ATOMIC in vmx_asm_vmexit_handler due to an unreleased lock
and then panic. We could also leak memory if we allocate from a pool which Xen
does not clean up for us on crashing the domain.
TODO
----
Patches 5 & 6:
- Fix the GCC switch statement issue which causes a page fault
Patch 6:
- Fix vpic lock release on domain crash.
- Finish moving parts of the VPIC into deprivileged mode
KNOWN ISSUES
------------
- Page fault for vpic_ioport_write due to GCC switch statements placing the
jump table in .rodata which is in the privileged mode area.
This has been traced to the first of the switch statements in the function.
Though other switches in that function may also be affected.
Compiled using GCC 4.9.2-10.
You can get the offset into this function by doing:
(RIP - (depriv_vpic_ioport_write - __hvm_deprivileged_text_start))
It appears to be a built-in default of GCC to put switch jump tables in
.rodata or .text and there does not appear to be a way to change this
(except to patch the compiler). Note that GCC will not necessarily allocate
jump tables for each switch statment, it depends on the optimiser.
Thus, when we relocate a deprivileged method containing code using a switch
statement which GCC has created a jump table for, this leads to a page
fault. This is because we have not mapped in the rodata section
as we should not (depriv should not have access to it).
A workaround would be to patch the generated assembly so that this table is
moved into hvm_deprivileged.rodata. This can be done by adding,
.section .hvm_deprivileged.rodata, around the generated table. We can then
relocate
Note that GCC is using RIP-relative addressing for this, so the offset
of depriv .rodata to the depriv .text segment will need to be the same
when it is mapped in.
Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>
---
xen/arch/x86/hvm/deprivileged.c | 49 +++++++++++
xen/arch/x86/hvm/deprivileged_syscall.c | 4 +-
xen/arch/x86/hvm/vpic.c | 151 ++++++++++++++++++++++++++++----
xen/arch/x86/traps.c | 5 +-
xen/include/asm-x86/hvm/vcpu.h | 2 +
xen/include/xen/hvm/deprivileged.h | 3 +
6 files changed, 192 insertions(+), 22 deletions(-)
diff --git a/xen/arch/x86/hvm/deprivileged.c b/xen/arch/x86/hvm/deprivileged.c
index 5606f9a..9561054 100644
--- a/xen/arch/x86/hvm/deprivileged.c
+++ b/xen/arch/x86/hvm/deprivileged.c
@@ -20,7 +20,14 @@
#include <xen/hvm/deprivileged.h>
#include <xen/hvm/deprivileged_syscall.h>
+/* TODO: move to a better place than here */
+int depriv_vpic_ioport_write(unsigned long *ret_data_ptr,
+ struct hvm_hw_vpic *vpic, int32_t addr,
+ uint32_t val) DEPRIV_TEXT_SEGMENT;
+
+
static depriv_syscall_t depriv_operation_table[] = {
+ DEPRIV_OPERATION(vpic_ioport_write, 4)
};
void hvm_deprivileged_init(struct domain *d, l4_pgentry_t *l4t_base)
@@ -641,6 +648,11 @@ int hvm_deprivileged_user_mode(unsigned long operation, register_t a,
*/
if ( vcpu->arch.hvm_vcpu.depriv_destroy )
{
+ /*
+ * Track which operation we are currently performing so that we can
+ * clean up if we have to crash the domain whilst doing it.
+ */
+ hvm_deprivileged_clean_up(vcpu, operation);
domain_crash(vcpu->domain);
return 1;
}
@@ -763,3 +775,40 @@ int hvm_deprivileged_check_trap(const char* func_name)
return 0;
}
+
+/*
+ * Clean up when destroying the domain
+ * When we destroy the domain whilst performing a deprivilged mode operation,
+ * we need to make sure that we do not hold any locks or have any memory which
+ * we have allocated related to the deprivileged mode operation which will
+ * not be cleared up by Xen automatically as part of domain destruction.
+ *
+ * An example is when we crash whilst holding a lock, we need to release this
+ * lock.
+ */
+void hvm_deprivileged_clean_up(struct vcpu *vcpu, unsigned long op)
+{
+ struct hvm_hw_vpic *vpic;
+
+ /* The vpic lock is not released if we crash the domain. This means that the
+ * preempt count is not decremented so we fail on an ASSERT_NOT_IN_ATOMIC in
+ * vmx_asm_vmexit_handler on our way back to the guest. After this, we would
+ * test for a SOFTIRQ to deschedule and then destroy the guest. But, this
+ * fail results in a panic instead. The solution is to release this
+ * lock when we crash the domain and we have determined that we are
+ * performing a vpic operation.
+ */
+ switch(op)
+ {
+ case DEPRIV_OPERATION_vpic_ioport_write:
+ /* We have cached the current vpic so that we can clean up */
+ vpic = (struct hvm_hw_vpic *)vcpu->arch.hvm_vcpu.depriv_cleanup_data;
+ printk("Cleaning up... %lx\n", (unsigned long)vpic);
+ spin_unlock(&container_of((vpic), struct hvm_domain,
+ vpic[!(vpic)->is_master])->irq_lock);
+ break;
+ default:
+ /* No clean up needed for this operation if not covered by the above */
+ break;
+ }
+}
diff --git a/xen/arch/x86/hvm/deprivileged_syscall.c b/xen/arch/x86/hvm/deprivileged_syscall.c
index 34dfee9..c98ff96 100644
--- a/xen/arch/x86/hvm/deprivileged_syscall.c
+++ b/xen/arch/x86/hvm/deprivileged_syscall.c
@@ -73,11 +73,11 @@
* Used for handling a syscall from deprivileged mode or dispatching a
* deprivileged mode operation.
*/
-
+int do_depriv_vpic_get_priority(struct hvm_hw_vpic *vpic, uint8_t mask);
/* This table holds the functions which can be called from deprivileged mode. */
static depriv_syscall_t depriv_syscall_table[] = {
-
+ DEPRIV_SYSCALL(vpic_get_priority, 2),
};
/* Handle a syscall from deprivileged mode */
diff --git a/xen/arch/x86/hvm/vpic.c b/xen/arch/x86/hvm/vpic.c
index 7c2edc8..8939924 100644
--- a/xen/arch/x86/hvm/vpic.c
+++ b/xen/arch/x86/hvm/vpic.c
@@ -34,6 +34,8 @@
#include <asm/hvm/hvm.h>
#include <asm/hvm/io.h>
#include <asm/hvm/support.h>
+#include <xen/hvm/deprivileged.h>
+#include <xen/hvm/deprivileged_syscall.h>
#define vpic_domain(v) (container_of((v), struct domain, \
arch.hvm_domain.vpic[!vpic->is_master]))
@@ -44,23 +46,63 @@
#define vpic_is_locked(v) spin_is_locked(__vpic_lock(v))
#define vpic_elcr_mask(v) (vpic->is_master ? (uint8_t)0xf8 : (uint8_t)0xde);
+/* DEPRIV operations */
+
+
/* Return the highest priority found in mask. Return 8 if none. */
#define VPIC_PRIO_NONE 8
-static int vpic_get_priority(struct hvm_hw_vpic *vpic, uint8_t mask)
+
+/*
+ * We need this as a separate stub because it is called from both deprivileged
+ * and privileged code. Now, when calling it from deprivileged code, all those
+ * code paths already have the lock so we don't need to get it, whereas
+ * privileged code paths may not already have the lock.
+ *
+ * NOTE: This is a general problem, BOTH deprivileged and privileged mode
+ * sometimes access the same functions and we need to be aware of and then
+ * handle this.
+ *
+ * As the lock is held in the parent structure of the vpic, and we have not
+ * mapped this into deprivileged memory (as it should not be), when we
+ * try to access it using the depriv vpic pointer, we will page fault.
+ * Thus, as we know we already have the lock, we can avoid testing this and
+ * so avoid the page fault.
+ *
+ * TODO:
+ * TBF, this is so small that it shouldn't be a syscall (was a test for a
+ * syscall). We should map it to depriv mode. However, it serves to demonstrate
+ * considerations which need to be taken so I'll leave it here for future
+ * implementors as an example. Hopefully it's helpful...
+ */
+static inline int vpic_get_priority_inline(struct hvm_hw_vpic *vpic,
+ uint8_t mask)
{
int prio;
- ASSERT(vpic_is_locked(vpic));
-
if ( mask == 0 )
return VPIC_PRIO_NONE;
/* prio = ffs(mask ROR vpic->priority_add); */
asm ( "ror %%cl,%b1 ; rep; bsf %1,%0"
: "=r" (prio) : "q" ((uint32_t)mask), "c" (vpic->priority_add) );
+
return prio;
}
+static int vpic_get_priority(struct hvm_hw_vpic *vpic, uint8_t mask)
+{
+ /* Privileged mode access needs to test the lock */
+ ASSERT(vpic_is_locked(vpic));
+ return vpic_get_priority_inline(vpic, mask);
+}
+
+/* syscall handler for the vpic */
+int do_depriv_vpic_get_priority(struct hvm_hw_vpic *vpic, uint8_t mask)
+{
+ /* deprivileged mode already has the lock: no need to assert it */
+ return vpic_get_priority_inline(vpic, mask);
+}
+
/* Return the PIC's highest priority pending interrupt. Return -1 if none. */
static int vpic_get_highest_priority_irq(struct hvm_hw_vpic *vpic)
{
@@ -181,14 +223,84 @@ static int vpic_intack(struct hvm_hw_vpic *vpic)
return irq;
}
+int depriv_vpic_ioport_write(unsigned long *ret_data_ptr,
+ struct hvm_hw_vpic *vpic, int32_t addr,
+ uint32_t val) DEPRIV_TEXT_SEGMENT;
+
static void vpic_ioport_write(
struct hvm_hw_vpic *vpic, uint32_t addr, uint32_t val)
{
- int priority, cmd, irq;
- uint8_t mask, unmasked = 0;
+ struct vcpu *vcpu = get_current();
+ void *p;
+ unsigned long *ret_data_ptr;
+ int ret, irq, unmasked;
+
+ /* Cache the current vpic so we can clean up */
+ vcpu->arch.hvm_vcpu.depriv_cleanup_data = vpic;
vpic_lock(vpic);
+ p = hvm_deprivileged_copy_data_to(vcpu, vpic, sizeof(struct hvm_hw_vpic));
+ ret_data_ptr = (unsigned long*)((unsigned long)p +
+ sizeof(struct hvm_hw_vpic));
+ printk("Entering..%lx\n", (unsigned long)ret_data_ptr);
+ ret = depriv4(DEPRIV_OPERATION_vpic_ioport_write,
+ (unsigned long)ret_data_ptr, (unsigned long)p, addr, val);
+
+ printk("BACK\n");
+ if ( ret )
+ return;
+
+ /*
+ * TODO:
+ * In general: we may have to deal with structures being updated after we
+ * do the copy, this is why we should move to aliasing where possible.
+ * As we hold a lock for the vpic here, it's not a problem in this case.
+ * Though, this lock may be for the other vpic so this _may_ actually be a
+ * problem, see vpic_lock internals...
+ */
+ hvm_deprivileged_copy_data_from(vcpu, vpic, p, sizeof(struct hvm_hw_vpic));
+ vcpu->arch.hvm_vcpu.depriv_data_offset = 0;
+
+ /*
+ * Additional return values are placed after our copied data. We can read
+ * them out from here and then we need to zero out the memory in the data
+ * page to prevent information leakage between deprivileged operations
+ */
+ irq = *ret_data_ptr;
+ *ret_data_ptr = 0;
+
+ unmasked = *(++ret_data_ptr);
+ *ret_data_ptr = 0;
+
+ /* Which operation to perform? */
+ if ( vcpu->arch.hvm_vcpu.depriv_return_code == 0 )
+ {
+ vpic_update_int_output(vpic);
+
+ vpic_unlock(vpic);
+
+ if ( unmasked )
+ pt_may_unmask_irq(vpic_domain(vpic), NULL);
+ }
+ else
+ {
+ /* Release lock and EOI the physical interrupt (if any). */
+ vpic_update_int_output(vpic);
+ vpic_unlock(vpic);
+ hvm_dpci_eoi(current->domain,
+ hvm_isa_irq_to_gsi((addr >> 7) ? (irq|8) : irq),
+ NULL);
+ }
+}
+
+int depriv_vpic_ioport_write(unsigned long *ret_data_ptr,
+ struct hvm_hw_vpic *vpic, int32_t addr,
+ uint32_t val)
+{
+ int priority, cmd, irq;
+ uint8_t mask, unmasked = 0;
+
if ( (addr & 1) == 0 )
{
if ( val & 0x10 )
@@ -243,7 +355,12 @@ static void vpic_ioport_write(
mask = vpic->isr;
if ( vpic->special_mask_mode )
mask &= ~vpic->imr; /* SMM: ignore masked IRs. */
- priority = vpic_get_priority(vpic, mask);
+
+ /* Deprivileged mode system call */
+ DEPRIV_SYSCALL_CALL2(DEPRIV_SYSCALL_vpic_get_priority,
+ priority, /* Return value */
+ vpic, mask);
+
if ( priority == VPIC_PRIO_NONE )
break;
irq = (priority + vpic->priority_add) & 7;
@@ -257,13 +374,12 @@ static void vpic_ioport_write(
vpic->isr &= ~(1 << irq);
if ( cmd == 7 )
vpic->priority_add = (irq + 1) & 7;
- /* Release lock and EOI the physical interrupt (if any). */
- vpic_update_int_output(vpic);
- vpic_unlock(vpic);
- hvm_dpci_eoi(current->domain,
- hvm_isa_irq_to_gsi((addr >> 7) ? (irq|8) : irq),
- NULL);
- return; /* bail immediately */
+
+ /* Pass data back and indicate which operation we need */
+ *ret_data_ptr = irq;
+ *(++ret_data_ptr) = unmasked;
+
+ return 1; /* bail immediately */
case 6: /* Set Priority */
vpic->priority_add = (val + 1) & 7;
break;
@@ -301,12 +417,9 @@ static void vpic_ioport_write(
}
}
- vpic_update_int_output(vpic);
-
- vpic_unlock(vpic);
-
- if ( unmasked )
- pt_may_unmask_irq(vpic_domain(vpic), NULL);
+ /* Pass data back and indicate which operation we need */
+ *(++ret_data_ptr) = unmasked;
+ return 0;
}
static uint32_t vpic_ioport_read(struct hvm_hw_vpic *vpic, uint32_t addr)
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index f14a845..1a449a5 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -1525,7 +1525,10 @@ void do_page_fault(struct cpu_user_regs *regs)
perfc_incr(page_faults);
/* If we get a page fault whilst in HVM deprivileged mode */
- if( hvm_deprivileged_check_trap(__func__) )
+ if (is_hvm_deprivileged_vcpu() )
+ printk("Addr: %lx code: %d, rip: %lx\n", addr, error_code, regs->rip);
+
+ if ( hvm_deprivileged_check_trap(__func__) )
return;
if ( unlikely(fixup_page_fault(addr, regs) != 0) )
diff --git a/xen/include/asm-x86/hvm/vcpu.h b/xen/include/asm-x86/hvm/vcpu.h
index dcdecf1..482b6ae 100644
--- a/xen/include/asm-x86/hvm/vcpu.h
+++ b/xen/include/asm-x86/hvm/vcpu.h
@@ -217,6 +217,8 @@ struct hvm_vcpu {
unsigned long depriv_destroy;
unsigned long depriv_watchdog_count;
unsigned long depriv_return_code;
+ /* Pointer to data to help with clean up if we have to crash the domain */
+ void* depriv_cleanup_data;
/* Offset into our data page where we can put data for depriv operations */
unsigned long depriv_data_offset;
diff --git a/xen/include/xen/hvm/deprivileged.h b/xen/include/xen/hvm/deprivileged.h
index d7228be..c640788 100644
--- a/xen/include/xen/hvm/deprivileged.h
+++ b/xen/include/xen/hvm/deprivileged.h
@@ -115,6 +115,9 @@ void hvm_deprivileged_restore_stacks(void);
/* Check if we are in deprivileged mode */
int is_hvm_deprivileged_vcpu(void);
+/* Clean up when destroying the domain */
+void hvm_deprivileged_clean_up(struct vcpu *vcpu, unsigned long op);
+
/* The ring 3 code */
void hvm_deprivileged_ring3(void);
--
2.1.4
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH RFC v3 0/6] HVM x86 deprivileged mode summary
2015-09-11 16:08 [PATCH RFC v3 0/6] HVM x86 deprivileged mode summary Ben Catterall
` (5 preceding siblings ...)
2015-09-11 16:08 ` [PATCH RFC v3 6/6] HVM x86 deprivileged mode: Move VPIC to deprivileged mode Ben Catterall
@ 2015-09-11 16:13 ` Ben Catterall
2015-09-23 19:20 ` Konrad Rzeszutek Wilk
7 siblings, 0 replies; 9+ messages in thread
From: Ben Catterall @ 2015-09-11 16:13 UTC (permalink / raw)
To: xen-devel
Cc: keir, ian.campbell, george.dunlap, andrew.cooper3, tim,
Aravind.Gopalakrishnan, jbeulich, boris.ostrovsky,
suravee.suthikulpanit
[-- Attachment #1: Type: text/plain, Size: 5346 bytes --]
Hi all,
Here are two Python scripts which I have used to collect performance
benchmarks for this series. I am putting them here in case they are useful.
Ben
On 11/09/15 17:08, Ben Catterall wrote:
> Hi all,
>
> I have now finished my internship at Citrix and am posting this final version of
> my RFC series. I would like to express my thanks to all of those who have taken
> the time to review, comment and discuss this series, as well as to my colleagues
> who have provided excellent guidance and help. I have learned a great deal and
> have greatly enjoyed working with all of you. Thank you.
>
> Hopefully the series will be beneficial. I believe that it has shown that a
> deprivileged mode in Xen is a possible and viable option, as long as performance
> impact vs security is carefully considered on a case-by-case basis. The end of
> this series contains an example of moving some of the vpic into deprivileged
> mode which has allowed me to test and verify that the feature works. There are
> enhancements and some clean up which is needed but, after that, the feature
> could be deployed to HVM devices currently found in Xen such as the VPIC.
>
> Patches one to four are (hopefully) now fairly stable. Patch 5 is the new
> system call and deprivileged dispatch mode which is new to this series. Patch 6
> is also new and is a demonstration of using this for the vpic and hass mainly
> been used to test and exercise this feature.
>
> As this patch series is in RFC, there are some debug printks which should be
> removed when/if it leaves RFC but, they are useful in fixing the known issue so
> I have left them in until that can be resolved.
>
> There are some efficiency savings that can be made and an instance of a general
> issue (detailed later) which will need to be addressed.
>
> Many thanks once again,
> Ben
>
> TODOs
> -----
> There is a set of TODOs in this patch series, some issues in the later patches
> which need addressing and some other considerations which I've summarised here.
>
> Patch 1:
> - Consider hvm_deprivileged_map_* and an efficiency saving by mapping in larger
> pages. See the TODO at the top of the L4 version of this method.
>
> Patch 2:
> - We have a much more heavyweight version of the deprivileged mode context
> switch after testing for AMD SVM found that this was necessary. However,
> the FPU is currently also saved and this may not be necessary. Consideration
> is needed to work out if we can cut this down even more.
>
> Patch 4:
> - The watchdog timer is hooked currently to kill deprivileged mode operations
> that run for too long and is hardcoded to be at least one watchdog tick and
> at most two. This may want to be refined.
>
> Patch 5:
> - Alias data for deprivileged mode. There is a large comment at the top of
> deprivileged_syscall.c which outlines considerations.
> - Check if we need to map_domain_page the pages when we do the copy in
> hvm_deprivileged_copy_data{to/from}
> - Check for unsigned integer wrapping on addition in
> hvm_deprivileged_copy_data_{to/from}
> - Move hvm_deprivileged_syscall into the syscall macro. It's a stub and
> unless extra code is needed there it can be folded into the macro.
> - Check maintainers' thoughts on the deprivileged mode function checks in
> hvm_deprivileged_user_mode. See the TODO comment.
>
> Patches 5 & 6:
> - Fix/work around the GCC switch statement issue.
>
>
> KNOWN ISSUES
> ------------
> - Page fault for vpic_ioport_write due to GCC switch statements placing the
> jump table in .rodata which is in the privileged mode area.
>
> This has been traced to the first of the switch statements in the function.
> Though other switches in that function may also be affected.
> Compiled using GCC 4.9.2-10.
>
> You can get the offset into this function by doing:
> (RIP - (depriv_vpic_ioport_write - __hvm_deprivileged_text_start))
>
> It appears to be a built-in default of GCC to put switch jump tables in
> .rodata or .text and there does not appear to be a way to change this
> (except to patch the compiler, though hopefully there _is_ another
> option I just haven't been able to find...). Note that GCC will not
> necessarily allocate jump tables for each switch statment, it appears to
> depends on a number of factors such as the optimiser, the number of cases,
> the type of the case, compiler version etc.
>
> Thus, when we relocate a deprivileged method containing code using a switch
> statement which GCC has created a jump table for, this leads to a page
> fault. This is because we have not mapped in the rodata section
> as we should not (depriv should not have access to it).
>
> A workaround would be to patch the generated assembly so that this table is
> moved into hvm_deprivileged.rodata. This can be done by adding,
> .section .hvm_deprivileged.rodata, around the generated table. We can then
> relocate this.
>
> Note that GCC is using RIP-relative addressing for this, so the offset
> of depriv .rodata to the depriv .text segment will need to be the same
> when it is mapped in.
>
>
>
>
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>
[-- Attachment #2: generator.py --]
[-- Type: text/x-python, Size: 2023 bytes --]
#!/usr/bin/python3
import numpy as np
import matplotlib.pyplot as plt
#Use for running rsync
import subprocess
#Use for command line args
import sys, argparse
import array
import pickle
from matplotlib.pyplot import savefig
'''
Plot the data and output to a file
'''
def plot(xList,filename):
i = 0
colours = ["red", "blue"]
labels = ["Enabled", "Disabled"]
bins = array.array('f')
#Get max and min
max = np.amax(xList[0])
if np.amax(xList[1]) > max:
max = np.amax(xList[1])
min = np.amin(xList[0])
if np.amin(xList[1]) < min:
min = np.amin(xList[1])
intervalBins = np.linspace(min, max, num=100)
for data in xList:
n, bins, patches = plt.hist(data, intervalBins, facecolor=colours[i],
alpha=0.75, label=labels[i]);
print(str(n) + " " + str(bins) + " " + str(patches) + "\n")
i = i + 1
print("Mean: " + str(np.mean(data)) + " seconds\n")
print("Std: " + str(np.std(data,dtype=np.float64)) + " seconds\n")
print("Max: " + str(np.amax(data)) + " Min: " + str(np.amin(data)) + "\n")
plt.legend(loc='upper right')
plt.ylabel("Frequency")
plt.xlabel("Time/s")
plt.title("Graph of Xen deprivileged user performance")
plt.ticklabel_format(axis='x', style='sci', scilimits=(-2,2))
fg = plt.gcf()
fg.savefig(filename)
return
def main():
#Setup parser
parser = argparse.ArgumentParser(description="Test HVM depriv performance.")
parser.add_argument("-p", "--plot", nargs=2, metavar=("[DATA FILE NAME]", "[GRAPH FILE NAME]"),
help="Plot the data and output a graph at file name")
args = parser.parse_args()
#Process args
if args.plot:
print("Plotting...\n")
#Use Pickle to load in the array
timeData = pickle.load(open(args.plot[0], "rb"))
#Parse it and plot
plot(timeData, args.plot[1])
print("DONE")
if __name__ == '__main__':
main()
[-- Attachment #3: hvm_test_depriv.py --]
[-- Type: text/x-python, Size: 3369 bytes --]
#!/usr/bin/python3
'''
To make use of this script, you will need to run it as root and have installed
the Python portio package.
You need to have patched the portio operation in vmx.c and svm.c which can be
found at xen/arch/x86/hvm/{vmx,svm}/{vmx.c,svm.c} as shown below so that it
will run the deprivileged mode when we perform a particular port operation.
My series has the basic hook in there for patch 2 but this is commented
out/removed in later patches as there was a functional change (you can
no longer just enter the mode, you need to supply an operation to jump to so
this will need to be updated to work with that).
This is in vmx_vmexit_handler and svm_vmexit_handler
Here is the svm part of this copied here for completeness:
uint16_t port = (vmcb->exitinfo1 >> 16) & 0xFFFF;
int bytes = ((vmcb->exitinfo1 >> 4) & 0x07);
int dir = (vmcb->exitinfo1 & 1) ? IOREQ_READ : IOREQ_WRITE;
/* DEBUG: Run only for a specific port */
if(port == 0x1000)
{
if( guest_cpu_user_regs()->eax == 0x1)
{
hvm_deprivileged_user_mode();
}
__update_guest_eip(regs, vmcb->exitinfo2 - vmcb->rip);
break;
}
if ( handle_pio(port, bytes, dir) )
__update_guest_eip(regs, vmcb->exitinfo2 - vmcb->rip);
}
'''
import array
#Use for command line args
import sys
import pickle
import time
from portio import ioperm, iopl, outb
def prepare_portio(port):
#Get access to all ports (needed on Linux pre 2.6.? kernel)
if( iopl(3) != 0 ):
print("ERROR: Elevating access to ports " + str(port) + "\n")
return False
#get access to port, only one port, and enable it
if( ioperm(port, 1, 1) != 0 ):
print("ERROR: Preparing port " + str(port) + "\n")
return False
return True
"""
Clean up after portio operations
"""
def release_portio(port):
if( ioperm(port, 1, 0) != 0 ):
print("ERROR: Releasing port " + str(port) + " \n")
return False
if( iopl(0) != 0 ):
print("ERROR: Clearing access to ports " + str(port) + "\n")
return False
return True
'''
Needs root to run
'''
def send_data(port, data):
outb(data, port)
def profile(outputTimes, numIterations, port, data):
i = 0
while (i < numIterations):
before = time.time()
send_data(port, data)
after = time.time()
timeDelta = after - before
outputTimes.append( timeDelta)
i = i + 1
def main():
outputTimesWith = array.array("f")
outputTimesWithout = array.array("f")
portWith = 0x1000
portWithout = 0x1000
dataWith = 0x1
dataWithout =0x2
numIterations = 100000
prepare_portio(portWith)
#Profile the system
#With the deprivileged op
profile(outputTimesWith, numIterations, portWith, dataWith);
release_portio(portWith)
#Without the deprivileged op
prepare_portio(portWithout)
profile(outputTimesWithout, numIterations, portWithout, dataWithout);
release_portio(portWithout)
data =[outputTimesWith, outputTimesWithout]
outputFile = open("data", "wb")
pickle.dump(data, outputFile)
outputFile.close()
return
if __name__ == '__main__':
main()
[-- Attachment #4: Type: text/plain, Size: 126 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH RFC v3 0/6] HVM x86 deprivileged mode summary
2015-09-11 16:08 [PATCH RFC v3 0/6] HVM x86 deprivileged mode summary Ben Catterall
` (6 preceding siblings ...)
2015-09-11 16:13 ` [PATCH RFC v3 0/6] HVM x86 deprivileged mode summary Ben Catterall
@ 2015-09-23 19:20 ` Konrad Rzeszutek Wilk
7 siblings, 0 replies; 9+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-09-23 19:20 UTC (permalink / raw)
To: Ben Catterall
Cc: xen-devel, keir, ian.campbell, george.dunlap, andrew.cooper3, tim,
Aravind.Gopalakrishnan, jbeulich, boris.ostrovsky,
suravee.suthikulpanit
On Fri, Sep 11, 2015 at 05:08:31PM +0100, Ben Catterall wrote:
> Hi all,
>
> I have now finished my internship at Citrix and am posting this final version of
> my RFC series. I would like to express my thanks to all of those who have taken
> the time to review, comment and discuss this series, as well as to my colleagues
> who have provided excellent guidance and help. I have learned a great deal and
> have greatly enjoyed working with all of you. Thank you.
Thank you for posting them!
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2015-09-23 19:20 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-11 16:08 [PATCH RFC v3 0/6] HVM x86 deprivileged mode summary Ben Catterall
2015-09-11 16:08 ` [PATCH RFC v3 1/6] HVM x86 deprivileged mode: Create deprivileged page tables Ben Catterall
2015-09-11 16:08 ` [PATCH RFC v3 2/6] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode Ben Catterall
2015-09-11 16:08 ` [PATCH RFC v3 3/6] HVM x86 deprivileged mode: Trap handlers for " Ben Catterall
2015-09-11 16:08 ` [PATCH RFC v3 4/6] HVM x86 deprivileged mode: Watchdog for DoS prevention Ben Catterall
2015-09-11 16:08 ` [PATCH RFC v3 5/6] HVM x86 deprivileged mode: Syscall and deprivileged operation dispatcher Ben Catterall
2015-09-11 16:08 ` [PATCH RFC v3 6/6] HVM x86 deprivileged mode: Move VPIC to deprivileged mode Ben Catterall
2015-09-11 16:13 ` [PATCH RFC v3 0/6] HVM x86 deprivileged mode summary Ben Catterall
2015-09-23 19:20 ` Konrad Rzeszutek Wilk
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).