* [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution
@ 2018-01-04 20:21 Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 01/44] passthrough/vtd: Don't DMA to the stack in queue_invalidate_wait() Andrew Cooper
` (44 more replies)
0 siblings, 45 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
This work was developed as an SP3 mitigation, but shelved when it became clear
that it wasn't viable to get done in the timeframe.
To protect against SP3 attacks, most mappings needs to be flushed while in
user context. However, to protect against all cross-VM attacks, it is
necessary to ensure that the Xen stacks are not mapped in any other cpus
address space, or an attacker can still recover at least the GPR state of
separate VMs.
To have isolated stacks, Xen needs a per-pcpu isolated region, which requires
that two pCPUs never share the same %cr3. This is trivial for 32bit PV guests
and HVM guests due to the existing per-vcpu Monitor Tables, but is problematic
for 64bit PV guests, which will run on the same %cr3 when scheduling different
threads from the same process.
To avoid breaking the PV ABI, Xen needs to shadow the guest L4 pagetables if
it wants to maintain the unique %cr3 property it needs.
tl;dr The shadowing algorithm in pt-shadow.c is too much of a performance
overhead to be viable, and very high risk to productise in an embargo window.
If we want to continue down this route, we either need someone to have a
clever alternative to the shadowing algorithm I came up with, or change the PV
ABI to require VMs not to share L4 pagetables.
Either way, these patches are presented to start a discussion of the issues.
The series as a whole is not in a suitable state for committing.
~Andrew
Andrew Cooper (44):
passthrough/vtd: Don't DMA to the stack in queue_invalidate_wait()
x86/idt: Factor out enabling and disabling of ISTs
x86/pv: Rename invalidate_shadow_ldt() to pv_destroy_ldt()
x86/boot: Introduce cpu_smpboot_bsp() to dynamically allocate BSP state
x86/boot: Move arch_init_memory() earlier in the boot sequence
x86/boot: Allocate percpu pagetables for the idle vcpus
x86/boot: Use percpu pagetables for the idle vcpus
x86/pv: Avoid an opencoded mov to %cr3 in toggle_guest_mode()
x86/mm: Track the current %cr3 in a per_cpu variable
x86/pt-shadow: Initial infrastructure for L4 PV pagetable shadowing
x86/pt-shadow: Always set _PAGE_ACCESSED on L4e updates
x86/fixmap: Temporarily add a percpu fixmap range
x86/pt-shadow: Shadow L4 tables from 64bit PV guests
x86/mm: Added safety checks that pagetables aren't shared
x86: Rearrange the virtual layout to introduce a PERCPU linear slot
xen/ipi: Introduce arch_ipi_param_ok() to check IPI parameters
x86/smp: Infrastructure for allocating and freeing percpu pagetables
x86/mm: Maintain the correct percpu mappings on context switch
x86/boot: Defer TSS/IST setup until later during boot on the BSP
x86/smp: Allocate a percpu linear range for the IDT
x86/smp: Switch to using the percpu IDT mappings
x86/mm: Track whether the current cr3 has a short or extended directmap
x86/smp: Allocate percpu resources for map_domain_page() to use
x86/mapcache: Reimplement map_domain_page() from scratch
x86/fixmap: Drop percpu fixmap range
x86/pt-shadow: Maintain a small cache of shadowed frames
x86/smp: Allocate a percpu linear range for the compat translation area.
x86/xlat: Use the percpu compat translation area
x86/smp: Allocate percpu resources for the GDT and LDT
x86/pv: Break handle_ldt_mapping_fault() out of handle_gdt_ldt_mapping_fault()
x86/pv: Drop support for paging out the LDT
x86: Always reload the LDT on vcpu context switch
x86/smp: Use the percpu GDT/LDT mappings
x86: Drop the PERDOMAIN mappings
x86/smp: Allocate the stack in the percpu range
x86/monitor: Capture Xen's intent to use monitor at boot time
x86/misc: Move some IPI parameters off the stack
x86/mca: Move __HYPERVISOR_mca IPI parameters off the stack
x86/smp: Introduce get_smp_ipi_buf() and take more IPI parameters off the stack
x86/boot: Switch the APs to the percpu pagetables before entering C
x86/smp: Switch to using the percpu stacks
x86/smp: Allocate a percpu linear range for the TSS
x86/smp: Use the percpu TSS mapping
misc debugging
xen/arch/x86/acpi/cpu_idle.c | 30 +--
xen/arch/x86/acpi/cpufreq/cpufreq.c | 57 +++--
xen/arch/x86/acpi/cpufreq/powernow.c | 26 +--
xen/arch/x86/acpi/lib.c | 16 +-
xen/arch/x86/boot/x86_64.S | 24 +-
xen/arch/x86/cpu/common.c | 90 +-------
xen/arch/x86/cpu/mcheck/mce.c | 143 +++++++-----
xen/arch/x86/cpu/mtrr/main.c | 27 ++-
xen/arch/x86/domain.c | 94 ++++----
xen/arch/x86/domain_page.c | 353 +++++++++--------------------
xen/arch/x86/domctl.c | 13 +-
xen/arch/x86/efi/efi-boot.h | 8 +-
xen/arch/x86/hvm/hvm.c | 14 --
xen/arch/x86/hvm/save.c | 4 -
xen/arch/x86/hvm/svm/svm.c | 8 +-
xen/arch/x86/hvm/vmx/vmcs.c | 51 ++---
xen/arch/x86/mm.c | 380 ++++++-------------------------
xen/arch/x86/mm/p2m-ept.c | 5 +-
xen/arch/x86/mm/shadow/multi.c | 4 +
xen/arch/x86/platform_hypercall.c | 40 ++--
xen/arch/x86/psr.c | 9 +-
xen/arch/x86/pv/Makefile | 1 +
xen/arch/x86/pv/descriptor-tables.c | 62 ++++-
xen/arch/x86/pv/dom0_build.c | 5 -
xen/arch/x86/pv/domain.c | 55 +----
xen/arch/x86/pv/emulate.h | 4 +-
xen/arch/x86/pv/mm.c | 6 +-
xen/arch/x86/pv/mm.h | 35 ++-
xen/arch/x86/pv/pt-shadow.c | 428 +++++++++++++++++++++++++++++++++++
xen/arch/x86/setup.c | 130 +++++++++--
xen/arch/x86/shutdown.c | 8 +-
xen/arch/x86/smp.c | 2 +
xen/arch/x86/smpboot.c | 399 +++++++++++++++++++++++++++++---
xen/arch/x86/sysctl.c | 10 +-
xen/arch/x86/tboot.c | 29 +--
xen/arch/x86/time.c | 7 +-
xen/arch/x86/traps.c | 328 +++++++++++++++++++++------
xen/arch/x86/x86_64/mm.c | 34 +--
xen/arch/x86/xen.lds.S | 2 +
xen/common/efi/runtime.c | 23 +-
xen/common/smp.c | 1 +
xen/drivers/passthrough/vtd/qinval.c | 8 +-
xen/include/asm-arm/mm.h | 1 -
xen/include/asm-arm/smp.h | 3 +
xen/include/asm-x86/config.h | 77 +++----
xen/include/asm-x86/cpufeature.h | 5 +-
xen/include/asm-x86/cpufeatures.h | 1 +
xen/include/asm-x86/domain.h | 67 +-----
xen/include/asm-x86/hvm/vmx/vmcs.h | 1 -
xen/include/asm-x86/ldt.h | 19 +-
xen/include/asm-x86/mm.h | 32 +--
xen/include/asm-x86/mwait.h | 3 +
xen/include/asm-x86/page.h | 1 +
xen/include/asm-x86/processor.h | 22 +-
xen/include/asm-x86/pv/mm.h | 3 +
xen/include/asm-x86/pv/pt-shadow.h | 100 ++++++++
xen/include/asm-x86/smp.h | 39 ++++
xen/include/asm-x86/system.h | 1 +
xen/include/asm-x86/x86_64/uaccess.h | 6 +-
xen/include/xen/smp.h | 2 -
60 files changed, 2027 insertions(+), 1329 deletions(-)
create mode 100644 xen/arch/x86/pv/pt-shadow.c
create mode 100644 xen/include/asm-x86/pv/pt-shadow.h
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply [flat|nested] 61+ messages in thread
* [PATCH RFC 01/44] passthrough/vtd: Don't DMA to the stack in queue_invalidate_wait()
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-05 9:21 ` Jan Beulich
2018-01-16 6:41 ` Tian, Kevin
2018-01-04 20:21 ` [PATCH RFC 02/44] x86/idt: Factor out enabling and disabling of ISTs Andrew Cooper
` (43 subsequent siblings)
44 siblings, 2 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper, Kevin Tian, Julien Grall, Jan Beulich
DMA-ing to the stack is generally considered bad practice. In this case, if a
timeout occurs because of a sluggish device which is processing the request,
the completion notification will corrupt the stack of a subsequent deeper call
tree.
Place the poll_slot in a percpu area and DMA to that instead.
Note: This change does not address other issues with the current
implementation, such as once a timeout has been suffered, subsequent
completions can't be correlated with their requests.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Kevin Tian <kevin.tian@intel.com>
CC: Julien Grall <julien.grall@arm.com>
Julien: This wants backporting to all releases, and therefore should be
considered for 4.10 at this point.
v3:
* Add note that there are still outstanding issues.
v2:
* Retain volatile declaration for poll_slot.
* Initialise poll_slot to QINVAL_STAT_INIT on each call.
---
xen/drivers/passthrough/vtd/qinval.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/xen/drivers/passthrough/vtd/qinval.c b/xen/drivers/passthrough/vtd/qinval.c
index e95dc54..51aef37 100644
--- a/xen/drivers/passthrough/vtd/qinval.c
+++ b/xen/drivers/passthrough/vtd/qinval.c
@@ -147,13 +147,15 @@ static int __must_check queue_invalidate_wait(struct iommu *iommu,
u8 iflag, u8 sw, u8 fn,
bool_t flush_dev_iotlb)
{
- volatile u32 poll_slot = QINVAL_STAT_INIT;
+ static DEFINE_PER_CPU(volatile u32, poll_slot);
unsigned int index;
unsigned long flags;
u64 entry_base;
struct qinval_entry *qinval_entry, *qinval_entries;
+ volatile u32 *this_poll_slot = &this_cpu(poll_slot);
spin_lock_irqsave(&iommu->register_lock, flags);
+ *this_poll_slot = QINVAL_STAT_INIT;
index = qinval_next_index(iommu);
entry_base = iommu_qi_ctrl(iommu)->qinval_maddr +
((index >> QINVAL_ENTRY_ORDER) << PAGE_SHIFT);
@@ -167,7 +169,7 @@ static int __must_check queue_invalidate_wait(struct iommu *iommu,
qinval_entry->q.inv_wait_dsc.lo.res_1 = 0;
qinval_entry->q.inv_wait_dsc.lo.sdata = QINVAL_STAT_DONE;
qinval_entry->q.inv_wait_dsc.hi.res_1 = 0;
- qinval_entry->q.inv_wait_dsc.hi.saddr = virt_to_maddr(&poll_slot) >> 2;
+ qinval_entry->q.inv_wait_dsc.hi.saddr = virt_to_maddr(this_poll_slot) >> 2;
unmap_vtd_domain_page(qinval_entries);
qinval_update_qtail(iommu, index);
@@ -182,7 +184,7 @@ static int __must_check queue_invalidate_wait(struct iommu *iommu,
timeout = NOW() + MILLISECS(flush_dev_iotlb ?
iommu_dev_iotlb_timeout : VTD_QI_TIMEOUT);
- while ( poll_slot != QINVAL_STAT_DONE )
+ while ( *this_poll_slot != QINVAL_STAT_DONE )
{
if ( NOW() > timeout )
{
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 02/44] x86/idt: Factor out enabling and disabling of ISTs
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 01/44] passthrough/vtd: Don't DMA to the stack in queue_invalidate_wait() Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 03/44] x86/pv: Rename invalidate_shadow_ldt() to pv_destroy_ldt() Andrew Cooper
` (42 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
All alteration of IST settings (other than the crash path) happen in an
identical triple. Introduce helpers to keep the triple in sync, and reduce
the risk of opencoded mistakes.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
xen/arch/x86/cpu/common.c | 4 +---
xen/arch/x86/hvm/svm/svm.c | 8 ++------
xen/arch/x86/smpboot.c | 4 +---
xen/arch/x86/traps.c | 4 +---
xen/include/asm-x86/processor.h | 14 ++++++++++++++
5 files changed, 19 insertions(+), 15 deletions(-)
diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
index e9588b3..b18e0f4 100644
--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -703,9 +703,7 @@ void load_system_tables(void)
ltr(TSS_ENTRY << 3);
lldt(0);
- set_ist(&idt_tables[cpu][TRAP_double_fault], IST_DF);
- set_ist(&idt_tables[cpu][TRAP_nmi], IST_NMI);
- set_ist(&idt_tables[cpu][TRAP_machine_check], IST_MCE);
+ enable_each_ist(idt_tables[cpu]);
/*
* Bottom-of-stack must be 16-byte aligned!
diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
index 2e62b9b..7bfb0ba 100644
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -1038,9 +1038,7 @@ static void svm_ctxt_switch_from(struct vcpu *v)
svm_vmload_pa(per_cpu(host_vmcb, cpu));
/* Resume use of ISTs now that the host TR is reinstated. */
- set_ist(&idt_tables[cpu][TRAP_double_fault], IST_DF);
- set_ist(&idt_tables[cpu][TRAP_nmi], IST_NMI);
- set_ist(&idt_tables[cpu][TRAP_machine_check], IST_MCE);
+ enable_each_ist(idt_tables[cpu]);
}
static void svm_ctxt_switch_to(struct vcpu *v)
@@ -1059,9 +1057,7 @@ static void svm_ctxt_switch_to(struct vcpu *v)
* Cannot use ISTs for NMI/#MC/#DF while we are running with the guest TR.
* But this doesn't matter: the IST is only req'd to handle SYSCALL/SYSRET.
*/
- set_ist(&idt_tables[cpu][TRAP_double_fault], IST_NONE);
- set_ist(&idt_tables[cpu][TRAP_nmi], IST_NONE);
- set_ist(&idt_tables[cpu][TRAP_machine_check], IST_NONE);
+ disable_each_ist(idt_tables[cpu]);
svm_restore_dr(v);
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 7b97ff8..e7fa159 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -723,9 +723,7 @@ static int cpu_smpboot_alloc(unsigned int cpu)
if ( idt_tables[cpu] == NULL )
goto out;
memcpy(idt_tables[cpu], idt_table, IDT_ENTRIES * sizeof(idt_entry_t));
- set_ist(&idt_tables[cpu][TRAP_double_fault], IST_NONE);
- set_ist(&idt_tables[cpu][TRAP_nmi], IST_NONE);
- set_ist(&idt_tables[cpu][TRAP_machine_check], IST_NONE);
+ disable_each_ist(idt_tables[cpu]);
for ( stub_page = 0, i = cpu & ~(STUBS_PER_PAGE - 1);
i < nr_cpu_ids && i <= (cpu | (STUBS_PER_PAGE - 1)); ++i )
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index db16a44..d06ad69 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -1884,9 +1884,7 @@ void __init init_idt_traps(void)
set_intr_gate(TRAP_simd_error,&simd_coprocessor_error);
/* Specify dedicated interrupt stacks for NMI, #DF, and #MC. */
- set_ist(&idt_table[TRAP_double_fault], IST_DF);
- set_ist(&idt_table[TRAP_nmi], IST_NMI);
- set_ist(&idt_table[TRAP_machine_check], IST_MCE);
+ enable_each_ist(idt_table);
/* CPU0 uses the master IDT. */
idt_tables[0] = idt_table;
diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index 41a8d8c..a0c524b 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -457,6 +457,20 @@ static always_inline void set_ist(idt_entry_t *idt, unsigned long ist)
_write_gate_lower(idt, &new);
}
+static inline void enable_each_ist(idt_entry_t *idt)
+{
+ set_ist(&idt[TRAP_double_fault], IST_DF);
+ set_ist(&idt[TRAP_nmi], IST_NMI);
+ set_ist(&idt[TRAP_machine_check], IST_MCE);
+}
+
+static inline void disable_each_ist(idt_entry_t *idt)
+{
+ set_ist(&idt[TRAP_double_fault], IST_NONE);
+ set_ist(&idt[TRAP_nmi], IST_NONE);
+ set_ist(&idt[TRAP_machine_check], IST_NONE);
+}
+
#define IDT_ENTRIES 256
extern idt_entry_t idt_table[];
extern idt_entry_t *idt_tables[];
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 03/44] x86/pv: Rename invalidate_shadow_ldt() to pv_destroy_ldt()
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 01/44] passthrough/vtd: Don't DMA to the stack in queue_invalidate_wait() Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 02/44] x86/idt: Factor out enabling and disabling of ISTs Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 04/44] x86/boot: Introduce cpu_smpboot_bsp() to dynamically allocate BSP state Andrew Cooper
` (41 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
and move it into pv/descriptor-tables.c beside its GDT counterpart. Reduce
the !in_irq() check from a BUG_ON() to ASSERT().
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
v2:
* New
---
xen/arch/x86/mm.c | 51 ++++---------------------------------
xen/arch/x86/pv/descriptor-tables.c | 42 ++++++++++++++++++++++++++++--
xen/include/asm-x86/pv/mm.h | 3 +++
3 files changed, 48 insertions(+), 48 deletions(-)
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index a56f875..14cfa93 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -125,6 +125,7 @@
#include <asm/hvm/grant_table.h>
#include <asm/pv/grant_table.h>
+#include <asm/pv/mm.h>
#include "pv/mm.h"
@@ -544,48 +545,6 @@ static inline void set_tlbflush_timestamp(struct page_info *page)
const char __section(".bss.page_aligned.const") __aligned(PAGE_SIZE)
zero_page[PAGE_SIZE];
-/*
- * Flush the LDT, dropping any typerefs. Returns a boolean indicating whether
- * mappings have been removed (i.e. a TLB flush is needed).
- */
-static bool invalidate_shadow_ldt(struct vcpu *v)
-{
- l1_pgentry_t *pl1e;
- unsigned int i, mappings_dropped = 0;
- struct page_info *page;
-
- BUG_ON(unlikely(in_irq()));
-
- spin_lock(&v->arch.pv_vcpu.shadow_ldt_lock);
-
- if ( v->arch.pv_vcpu.shadow_ldt_mapcnt == 0 )
- goto out;
-
- pl1e = pv_ldt_ptes(v);
-
- for ( i = 0; i < 16; i++ )
- {
- if ( !(l1e_get_flags(pl1e[i]) & _PAGE_PRESENT) )
- continue;
-
- page = l1e_get_page(pl1e[i]);
- l1e_write(&pl1e[i], l1e_empty());
- mappings_dropped++;
-
- ASSERT_PAGE_IS_TYPE(page, PGT_seg_desc_page);
- ASSERT_PAGE_IS_DOMAIN(page, v->domain);
- put_page_and_type(page);
- }
-
- ASSERT(v->arch.pv_vcpu.shadow_ldt_mapcnt == mappings_dropped);
- v->arch.pv_vcpu.shadow_ldt_mapcnt = 0;
-
- out:
- spin_unlock(&v->arch.pv_vcpu.shadow_ldt_lock);
-
- return mappings_dropped;
-}
-
static int alloc_segdesc_page(struct page_info *page)
{
@@ -1242,7 +1201,7 @@ void put_page_from_l1e(l1_pgentry_t l1e, struct domain *l1e_owner)
{
for_each_vcpu ( pg_owner, v )
{
- if ( invalidate_shadow_ldt(v) )
+ if ( pv_destroy_ldt(v) )
flush_tlb_mask(v->vcpu_dirty_cpumask);
}
}
@@ -2825,7 +2784,7 @@ int new_guest_cr3(mfn_t mfn)
return rc;
}
- invalidate_shadow_ldt(curr); /* Unconditional TLB flush later. */
+ pv_destroy_ldt(curr); /* Unconditional TLB flush later. */
write_ptbase(curr);
return 0;
@@ -2861,7 +2820,7 @@ int new_guest_cr3(mfn_t mfn)
return rc;
}
- invalidate_shadow_ldt(curr); /* Unconditional TLB flush later. */
+ pv_destroy_ldt(curr); /* Unconditional TLB flush later. */
if ( !VM_ASSIST(d, m2p_strict) && !paging_mode_refcounts(d) )
fill_ro_mpt(mfn);
@@ -3368,7 +3327,7 @@ long do_mmuext_op(
else if ( (curr->arch.pv_vcpu.ldt_ents != ents) ||
(curr->arch.pv_vcpu.ldt_base != ptr) )
{
- if ( invalidate_shadow_ldt(curr) )
+ if ( pv_destroy_ldt(curr) )
flush_tlb_local();
curr->arch.pv_vcpu.ldt_base = ptr;
diff --git a/xen/arch/x86/pv/descriptor-tables.c b/xen/arch/x86/pv/descriptor-tables.c
index d1c4296..b418bbb 100644
--- a/xen/arch/x86/pv/descriptor-tables.c
+++ b/xen/arch/x86/pv/descriptor-tables.c
@@ -31,9 +31,47 @@
#undef page_to_mfn
#define page_to_mfn(pg) _mfn(__page_to_mfn(pg))
-/*******************
- * Descriptor Tables
+/*
+ * Flush the LDT, dropping any typerefs. Returns a boolean indicating whether
+ * mappings have been removed (i.e. a TLB flush is needed).
*/
+bool pv_destroy_ldt(struct vcpu *v)
+{
+ l1_pgentry_t *pl1e;
+ unsigned int i, mappings_dropped = 0;
+ struct page_info *page;
+
+ ASSERT(!in_irq());
+
+ spin_lock(&v->arch.pv_vcpu.shadow_ldt_lock);
+
+ if ( v->arch.pv_vcpu.shadow_ldt_mapcnt == 0 )
+ goto out;
+
+ pl1e = pv_ldt_ptes(v);
+
+ for ( i = 0; i < 16; i++ )
+ {
+ if ( !(l1e_get_flags(pl1e[i]) & _PAGE_PRESENT) )
+ continue;
+
+ page = l1e_get_page(pl1e[i]);
+ l1e_write(&pl1e[i], l1e_empty());
+ mappings_dropped++;
+
+ ASSERT_PAGE_IS_TYPE(page, PGT_seg_desc_page);
+ ASSERT_PAGE_IS_DOMAIN(page, v->domain);
+ put_page_and_type(page);
+ }
+
+ ASSERT(v->arch.pv_vcpu.shadow_ldt_mapcnt == mappings_dropped);
+ v->arch.pv_vcpu.shadow_ldt_mapcnt = 0;
+
+ out:
+ spin_unlock(&v->arch.pv_vcpu.shadow_ldt_lock);
+
+ return mappings_dropped;
+}
void pv_destroy_gdt(struct vcpu *v)
{
diff --git a/xen/include/asm-x86/pv/mm.h b/xen/include/asm-x86/pv/mm.h
index 5d2fe4c..246b990 100644
--- a/xen/include/asm-x86/pv/mm.h
+++ b/xen/include/asm-x86/pv/mm.h
@@ -29,6 +29,7 @@ long pv_set_gdt(struct vcpu *v, unsigned long *frames, unsigned int entries);
void pv_destroy_gdt(struct vcpu *v);
bool pv_map_ldt_shadow_page(unsigned int off);
+bool pv_destroy_ldt(struct vcpu *v);
#else
@@ -48,6 +49,8 @@ static inline long pv_set_gdt(struct vcpu *v, unsigned long *frames,
static inline void pv_destroy_gdt(struct vcpu *v) { ASSERT_UNREACHABLE(); }
static inline bool pv_map_ldt_shadow_page(unsigned int off) { return false; }
+static inline bool pv_destroy_ldt(struct vcpu *v)
+{ ASSERT_UNREACHABLE(); return false; }
#endif
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 04/44] x86/boot: Introduce cpu_smpboot_bsp() to dynamically allocate BSP state
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (2 preceding siblings ...)
2018-01-04 20:21 ` [PATCH RFC 03/44] x86/pv: Rename invalidate_shadow_ldt() to pv_destroy_ldt() Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 05/44] x86/boot: Move arch_init_memory() earlier in the boot sequence Andrew Cooper
` (40 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
Move the existing stub allocation into the new function, and call it before
initialising the idle domain; eventually it will allocate the pagetables for
the idle vcpu to use.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
xen/arch/x86/setup.c | 6 ++----
xen/arch/x86/smpboot.c | 15 +++++++++++++++
xen/include/asm-x86/smp.h | 1 +
3 files changed, 18 insertions(+), 4 deletions(-)
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 2e10c6b..64286f7 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1502,11 +1502,9 @@ void __init noreturn __start_xen(unsigned long mbi_p)
if ( cpu_has_fsgsbase )
set_in_cr4(X86_CR4_FSGSBASE);
- init_idle_domain();
+ cpu_smpboot_bsp();
- this_cpu(stubs.addr) = alloc_stub_page(smp_processor_id(),
- &this_cpu(stubs).mfn);
- BUG_ON(!this_cpu(stubs.addr));
+ init_idle_domain();
trap_init();
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index e7fa159..36b87dd 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -756,6 +756,21 @@ static int cpu_smpboot_alloc(unsigned int cpu)
return rc;
}
+void __init cpu_smpboot_bsp(void)
+{
+ unsigned int cpu = smp_processor_id();
+ int rc = -ENOMEM;
+
+ if ( (per_cpu(stubs.addr, cpu) =
+ alloc_stub_page(cpu, &per_cpu(stubs, cpu).mfn)) == 0 )
+ goto err;
+
+ return;
+
+ err:
+ panic("Error preparing BSP smpboot data: %d", rc);
+}
+
static int cpu_smpboot_callback(
struct notifier_block *nfb, unsigned long action, void *hcpu)
{
diff --git a/xen/include/asm-x86/smp.h b/xen/include/asm-x86/smp.h
index 4e5f673..409f3af 100644
--- a/xen/include/asm-x86/smp.h
+++ b/xen/include/asm-x86/smp.h
@@ -53,6 +53,7 @@ int cpu_add(uint32_t apic_id, uint32_t acpi_id, uint32_t pxm);
void __stop_this_cpu(void);
+void cpu_smpboot_bsp(void);
long cpu_up_helper(void *data);
long cpu_down_helper(void *data);
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 05/44] x86/boot: Move arch_init_memory() earlier in the boot sequence
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (3 preceding siblings ...)
2018-01-04 20:21 ` [PATCH RFC 04/44] x86/boot: Introduce cpu_smpboot_bsp() to dynamically allocate BSP state Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 06/44] x86/boot: Allocate percpu pagetables for the idle vcpus Andrew Cooper
` (39 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
idle_pg_table[] needs all slots populated before it is copied to create the
vcpu idle pagetables. One missing slot is for MMCFG, which is now allocated
early.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
xen/arch/x86/setup.c | 4 ++--
xen/arch/x86/x86_64/mm.c | 15 +++++++++++++++
2 files changed, 17 insertions(+), 2 deletions(-)
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 64286f7..4aff5bd 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1502,6 +1502,8 @@ void __init noreturn __start_xen(unsigned long mbi_p)
if ( cpu_has_fsgsbase )
set_in_cr4(X86_CR4_FSGSBASE);
+ arch_init_memory();
+
cpu_smpboot_bsp();
init_idle_domain();
@@ -1512,8 +1514,6 @@ void __init noreturn __start_xen(unsigned long mbi_p)
early_time_init();
- arch_init_memory();
-
alternative_instructions();
local_irq_enable();
diff --git a/xen/arch/x86/x86_64/mm.c b/xen/arch/x86/x86_64/mm.c
index 9b37da6..68eee30 100644
--- a/xen/arch/x86/x86_64/mm.c
+++ b/xen/arch/x86/x86_64/mm.c
@@ -831,6 +831,7 @@ static int extend_frame_table(struct mem_hotadd_info *info)
void __init subarch_init_memory(void)
{
unsigned long i, n, v, m2p_start_mfn;
+ l4_pgentry_t *pl4e;
l3_pgentry_t l3e;
l2_pgentry_t l2e;
@@ -886,6 +887,20 @@ void __init subarch_init_memory(void)
}
}
+ /* Create an L3 table for the MMCFG region, or remap it NX. */
+ pl4e = &idle_pg_table[l4_table_offset(PCI_MCFG_VIRT_START)];
+ if ( !(l4e_get_flags(*pl4e) & _PAGE_PRESENT) )
+ {
+ l3_pgentry_t *l3t = alloc_xen_pagetable();
+
+ BUG_ON(!l3t);
+
+ clear_page(l3t);
+ *pl4e = l4e_from_paddr(virt_to_maddr(l3t), __PAGE_HYPERVISOR_RW);
+ }
+ else
+ l4e_add_flags(*pl4e, _PAGE_NX_BIT);
+
/* Mark all of direct map NX if hardware supports it. */
if ( !cpu_has_nx )
return;
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 06/44] x86/boot: Allocate percpu pagetables for the idle vcpus
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (4 preceding siblings ...)
2018-01-04 20:21 ` [PATCH RFC 05/44] x86/boot: Move arch_init_memory() earlier in the boot sequence Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 07/44] x86/boot: Use " Andrew Cooper
` (38 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
Introduce cpu_smpboot_alloc_common(), for state shared between
cpu_smpboot_alloc() and cpu_smpboot_bsp().
A necessary requirement now is that cpu_smpboot_nfb must be called between
allocating the percpu areas, and calling into the scheduler logic.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
xen/arch/x86/smpboot.c | 51 ++++++++++++++++++++++++++++++++++++++++++++---
xen/include/asm-x86/smp.h | 2 ++
2 files changed, 50 insertions(+), 3 deletions(-)
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 36b87dd..221d9c7 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -56,6 +56,8 @@
unsigned long __read_mostly trampoline_phys;
+DEFINE_PER_CPU_READ_MOSTLY(paddr_t, percpu_idle_pt);
+
/* representing HT siblings of each logical CPU */
DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_mask);
/* representing HT and core siblings of each logical CPU */
@@ -633,6 +635,36 @@ void cpu_exit_clear(unsigned int cpu)
set_cpu_state(CPU_STATE_DEAD);
}
+/* Allocate data common between the BSP and APs. */
+static int cpu_smpboot_alloc_common(unsigned int cpu)
+{
+ unsigned int memflags = 0;
+ nodeid_t node = cpu_to_node(cpu);
+ l4_pgentry_t *l4t = NULL;
+ struct page_info *pg;
+ int rc = -ENOMEM;
+
+ if ( node != NUMA_NO_NODE )
+ memflags = MEMF_node(node);
+
+ /* Percpu L4 table, used by the idle cpus. */
+ pg = alloc_domheap_page(NULL, memflags);
+ if ( !pg )
+ goto out;
+ per_cpu(percpu_idle_pt, cpu) = page_to_maddr(pg);
+ l4t = __map_domain_page(pg);
+ clear_page(l4t);
+ init_xen_l4_slots(l4t, page_to_mfn(pg), NULL, INVALID_MFN, false);
+
+ rc = 0; /* Success */
+
+ out:
+ if ( l4t )
+ unmap_domain_page(l4t);
+
+ return rc;
+}
+
static void cpu_smpboot_free(unsigned int cpu)
{
unsigned int order, socket = cpu_to_socket(cpu);
@@ -686,6 +718,12 @@ static void cpu_smpboot_free(unsigned int cpu)
free_xenheap_pages(stack_base[cpu], STACK_ORDER);
stack_base[cpu] = NULL;
}
+
+ if ( per_cpu(percpu_idle_pt, cpu) )
+ {
+ free_domheap_page(maddr_to_page(per_cpu(percpu_idle_pt, cpu)));
+ per_cpu(percpu_idle_pt, cpu) = 0;
+ }
}
static int cpu_smpboot_alloc(unsigned int cpu)
@@ -747,7 +785,7 @@ static int cpu_smpboot_alloc(unsigned int cpu)
alloc_cpumask_var(&per_cpu(scratch_cpumask, cpu))) )
goto out;
- rc = 0;
+ rc = cpu_smpboot_alloc_common(cpu);
out:
if ( rc )
@@ -759,11 +797,17 @@ static int cpu_smpboot_alloc(unsigned int cpu)
void __init cpu_smpboot_bsp(void)
{
unsigned int cpu = smp_processor_id();
- int rc = -ENOMEM;
+ int rc;
+
+ if ( (rc = cpu_smpboot_alloc_common(cpu)) != 0 )
+ goto err;
if ( (per_cpu(stubs.addr, cpu) =
alloc_stub_page(cpu, &per_cpu(stubs, cpu).mfn)) == 0 )
+ {
+ rc = -ENOMEM;
goto err;
+ }
return;
@@ -794,7 +838,8 @@ static int cpu_smpboot_callback(
}
static struct notifier_block cpu_smpboot_nfb = {
- .notifier_call = cpu_smpboot_callback
+ .notifier_call = cpu_smpboot_callback,
+ .priority = 99, /* Must be after percpu area, before idle vcpu. */
};
void __init smp_prepare_cpus(unsigned int max_cpus)
diff --git a/xen/include/asm-x86/smp.h b/xen/include/asm-x86/smp.h
index 409f3af..7fcc946 100644
--- a/xen/include/asm-x86/smp.h
+++ b/xen/include/asm-x86/smp.h
@@ -19,6 +19,8 @@
#define INVALID_CUID (~0U) /* AMD Compute Unit ID */
#ifndef __ASSEMBLY__
+DECLARE_PER_CPU(paddr_t, percpu_idle_pt);
+
/*
* Private routines/data
*/
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 07/44] x86/boot: Use percpu pagetables for the idle vcpus
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (5 preceding siblings ...)
2018-01-04 20:21 ` [PATCH RFC 06/44] x86/boot: Allocate percpu pagetables for the idle vcpus Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 08/44] x86/pv: Avoid an opencoded mov to %cr3 in toggle_guest_mode() Andrew Cooper
` (37 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
Introduce early_switch_to_idle() to replace the opencoded switching to idle
context in the BSP and AP boot paths, and extend it to switch away from
idle_pg_table[] as well.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
xen/arch/x86/domain.c | 4 +++-
xen/arch/x86/domain_page.c | 2 +-
xen/arch/x86/setup.c | 22 ++++++++++++++++++++--
xen/arch/x86/smpboot.c | 6 ++++--
xen/include/asm-x86/system.h | 1 +
5 files changed, 29 insertions(+), 6 deletions(-)
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 0ae715d..93e81c0 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -349,7 +349,9 @@ int vcpu_initialise(struct vcpu *v)
else
{
/* Idle domain */
- v->arch.cr3 = __pa(idle_pg_table);
+ v->arch.cr3 = per_cpu(percpu_idle_pt, v->vcpu_id);
+ BUG_ON(!v->arch.cr3); /* Had better be initialised... */
+
rc = 0;
v->arch.msr = ZERO_BLOCK_PTR; /* Catch stray misuses */
}
diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
index 3432a85..8f2bcd4 100644
--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -51,7 +51,7 @@ static inline struct vcpu *mapcache_current_vcpu(void)
if ( (v = idle_vcpu[smp_processor_id()]) == current )
sync_local_execstate();
/* We must now be running on the idle page table. */
- ASSERT(read_cr3() == __pa(idle_pg_table));
+ ASSERT(read_cr3() == this_cpu(percpu_idle_pt));
}
return v;
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 4aff5bd..b8e52cf 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -237,11 +237,29 @@ void __init discard_initial_images(void)
extern char __init_begin[], __init_end[], __bss_start[], __bss_end[];
+void early_switch_to_idle(void)
+{
+ unsigned int cpu = smp_processor_id();
+ struct vcpu *v = idle_vcpu[cpu];
+ unsigned long cr4 = read_cr4();
+
+ set_current(v);
+ per_cpu(curr_vcpu, cpu) = v;
+
+ asm volatile ( "mov %[npge], %%cr4;"
+ "mov %[cr3], %%cr3;"
+ "mov %[pge], %%cr4;"
+ ::
+ [npge] "r" (cr4 & ~X86_CR4_PGE),
+ [cr3] "r" (v->arch.cr3),
+ [pge] "r" (cr4)
+ : "memory" );
+}
+
static void __init init_idle_domain(void)
{
scheduler_init();
- set_current(idle_vcpu[0]);
- this_cpu(curr_vcpu) = current;
+ early_switch_to_idle();
}
void srat_detect_node(int cpu)
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 221d9c7..ae39b48 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -307,8 +307,10 @@ void start_secondary(void *unused)
/* Critical region without IDT or TSS. Any fault is deadly! */
set_processor_id(cpu);
- set_current(idle_vcpu[cpu]);
- this_cpu(curr_vcpu) = idle_vcpu[cpu];
+ get_cpu_info()->cr4 = XEN_MINIMAL_CR4;
+
+ early_switch_to_idle();
+
rdmsrl(MSR_EFER, this_cpu(efer));
/*
diff --git a/xen/include/asm-x86/system.h b/xen/include/asm-x86/system.h
index 8ac1703..ee57631 100644
--- a/xen/include/asm-x86/system.h
+++ b/xen/include/asm-x86/system.h
@@ -230,6 +230,7 @@ static inline int local_irq_is_enabled(void)
void trap_init(void);
void init_idt_traps(void);
+void early_switch_to_idle(void);
void load_system_tables(void);
void percpu_traps_init(void);
void subarch_percpu_traps_init(void);
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 08/44] x86/pv: Avoid an opencoded mov to %cr3 in toggle_guest_mode()
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (6 preceding siblings ...)
2018-01-04 20:21 ` [PATCH RFC 07/44] x86/boot: Use " Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 09/44] x86/mm: Track the current %cr3 in a per_cpu variable Andrew Cooper
` (36 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
Xen will need to track which %cr3 it is running on. Propagate a
tlb_maintenance parameter down into write_ptbase(), so toggle_guest_mode() can
retain its optimisation of not flushing global mappings and not ticking the
TLB clock.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
xen/arch/x86/mm.c | 9 +++++++--
xen/arch/x86/pv/domain.c | 2 +-
xen/include/asm-x86/processor.h | 6 +++++-
3 files changed, 13 insertions(+), 4 deletions(-)
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 14cfa93..25f9588 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -497,9 +497,14 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
v->arch.cr3 = mfn_x(mfn) << PAGE_SHIFT;
}
-void write_ptbase(struct vcpu *v)
+void do_write_ptbase(struct vcpu *v, bool tlb_maintenance)
{
- write_cr3(v->arch.cr3);
+ unsigned long new_cr3 = v->arch.cr3;
+
+ if ( tlb_maintenance )
+ write_cr3(new_cr3);
+ else
+ asm volatile ( "mov %0, %%cr3" :: "r" (new_cr3) : "memory" );
}
/*
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 2234128..7e4566d 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -246,7 +246,7 @@ void toggle_guest_pt(struct vcpu *v)
v->arch.flags ^= TF_kernel_mode;
update_cr3(v);
/* Don't flush user global mappings from the TLB. Don't tick TLB clock. */
- asm volatile ( "mov %0, %%cr3" : : "r" (v->arch.cr3) : "memory" );
+ do_write_ptbase(v, false);
if ( !(v->arch.flags & TF_kernel_mode) )
return;
diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index a0c524b..c206080 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -479,7 +479,11 @@ DECLARE_PER_CPU(struct tss_struct, init_tss);
extern void init_int80_direct_trap(struct vcpu *v);
-extern void write_ptbase(struct vcpu *v);
+extern void do_write_ptbase(struct vcpu *v, bool tlb_maintenance);
+static inline void write_ptbase(struct vcpu *v)
+{
+ do_write_ptbase(v, true);
+}
/* REP NOP (PAUSE) is a good thing to insert into busy-wait loops. */
static always_inline void rep_nop(void)
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 09/44] x86/mm: Track the current %cr3 in a per_cpu variable
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (7 preceding siblings ...)
2018-01-04 20:21 ` [PATCH RFC 08/44] x86/pv: Avoid an opencoded mov to %cr3 in toggle_guest_mode() Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 10/44] x86/pt-shadow: Initial infrastructure for L4 PV pagetable shadowing Andrew Cooper
` (35 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
... and assert that it isn't changing under our feet. early_switch_to_idle()
is adjusted to set the shadow initially, when switching off idle_pg_table[].
EFI Runtime Service handling happens synchronously and under lock, so doesn't
interact with this path.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
xen/arch/x86/mm.c | 9 +++++++++
xen/arch/x86/setup.c | 2 ++
xen/include/asm-x86/mm.h | 2 ++
3 files changed, 13 insertions(+)
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 25f9588..f85ef6c 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -497,14 +497,23 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
v->arch.cr3 = mfn_x(mfn) << PAGE_SHIFT;
}
+DEFINE_PER_CPU(unsigned long, curr_ptbase);
+
void do_write_ptbase(struct vcpu *v, bool tlb_maintenance)
{
unsigned long new_cr3 = v->arch.cr3;
+ unsigned int cpu = smp_processor_id();
+ unsigned long *this_curr_ptbase = &per_cpu(curr_ptbase, cpu);
+
+ /* Check that %cr3 isn't being shuffled under our feet. */
+ ASSERT(*this_curr_ptbase == read_cr3());
if ( tlb_maintenance )
write_cr3(new_cr3);
else
asm volatile ( "mov %0, %%cr3" :: "r" (new_cr3) : "memory" );
+
+ *this_curr_ptbase = new_cr3;
}
/*
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index b8e52cf..7a05a7c 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -254,6 +254,8 @@ void early_switch_to_idle(void)
[cr3] "r" (v->arch.cr3),
[pge] "r" (cr4)
: "memory" );
+
+ per_cpu(curr_ptbase, cpu) = v->arch.cr3;
}
static void __init init_idle_domain(void)
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index 4af6b23..ceb7dd4 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -561,6 +561,8 @@ void audit_domains(void);
#endif
+DECLARE_PER_CPU(unsigned long, curr_ptbase);
+
void make_cr3(struct vcpu *v, mfn_t mfn);
void update_cr3(struct vcpu *v);
int vcpu_destroy_pagetables(struct vcpu *);
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 10/44] x86/pt-shadow: Initial infrastructure for L4 PV pagetable shadowing
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (8 preceding siblings ...)
2018-01-04 20:21 ` [PATCH RFC 09/44] x86/mm: Track the current %cr3 in a per_cpu variable Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 11/44] x86/pt-shadow: Always set _PAGE_ACCESSED on L4e updates Andrew Cooper
` (34 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
v3:
* Switch to using a single structure per cpu, rather than multiple fields.
---
xen/arch/x86/pv/Makefile | 1 +
xen/arch/x86/pv/pt-shadow.c | 86 ++++++++++++++++++++++++++++++++++++++
xen/arch/x86/smpboot.c | 7 ++++
xen/include/asm-x86/pv/pt-shadow.h | 50 ++++++++++++++++++++++
4 files changed, 144 insertions(+)
create mode 100644 xen/arch/x86/pv/pt-shadow.c
create mode 100644 xen/include/asm-x86/pv/pt-shadow.h
diff --git a/xen/arch/x86/pv/Makefile b/xen/arch/x86/pv/Makefile
index bac2792..acff2bc 100644
--- a/xen/arch/x86/pv/Makefile
+++ b/xen/arch/x86/pv/Makefile
@@ -10,6 +10,7 @@ obj-y += hypercall.o
obj-y += iret.o
obj-y += misc-hypercalls.o
obj-y += mm.o
+obj-y += pt-shadow.o
obj-y += ro-page-fault.o
obj-y += traps.o
diff --git a/xen/arch/x86/pv/pt-shadow.c b/xen/arch/x86/pv/pt-shadow.c
new file mode 100644
index 0000000..7db8efb
--- /dev/null
+++ b/xen/arch/x86/pv/pt-shadow.c
@@ -0,0 +1,86 @@
+/*
+ * arch/x86/pv/pt-shadow.c
+ *
+ * PV Pagetable shadowing logic to allow Xen to run with per-pcpu pagetables.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; If not, see <http://www.gnu.org/licenses/>.
+ *
+ * Copyright (c) 2017 Citrix Systems Ltd.
+ */
+#include <xen/domain_page.h>
+#include <xen/mm.h>
+#include <xen/numa.h>
+
+#include <asm/pv/pt-shadow.h>
+
+struct pt_shadow {
+ /*
+ * A frame used to shadow a vcpus intended pagetable. When shadowing,
+ * this frame is the one actually referenced by %cr3.
+ */
+ paddr_t shadow_l4;
+ l4_pgentry_t *shadow_l4_va;
+};
+
+static DEFINE_PER_CPU(struct pt_shadow, ptsh);
+
+int pt_shadow_alloc(unsigned int cpu)
+{
+ struct pt_shadow *ptsh = &per_cpu(ptsh, cpu);
+ unsigned int memflags = 0;
+ nodeid_t node = cpu_to_node(cpu);
+ struct page_info *pg;
+
+ if ( node != NUMA_NO_NODE )
+ memflags = MEMF_node(node);
+
+ pg = alloc_domheap_page(NULL, memflags);
+ if ( !pg )
+ return -ENOMEM;
+
+ ptsh->shadow_l4 = page_to_maddr(pg);
+
+ ptsh->shadow_l4_va = __map_domain_page_global(pg);
+ if ( !ptsh->shadow_l4_va )
+ return -ENOMEM;
+
+ return 0;
+}
+
+void pt_shadow_free(unsigned int cpu)
+{
+ struct pt_shadow *ptsh = &per_cpu(ptsh, cpu);
+
+ if ( ptsh->shadow_l4_va )
+ {
+ unmap_domain_page_global(ptsh->shadow_l4_va);
+ ptsh->shadow_l4_va = NULL;
+ }
+
+ if ( ptsh->shadow_l4 )
+ {
+ free_domheap_page(maddr_to_page(ptsh->shadow_l4));
+ ptsh->shadow_l4 = 0;
+ }
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index ae39b48..a855301 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -40,6 +40,7 @@
#include <asm/flushtlb.h>
#include <asm/msr.h>
#include <asm/mtrr.h>
+#include <asm/pv/pt-shadow.h>
#include <asm/time.h>
#include <asm/tboot.h>
#include <mach_apic.h>
@@ -658,6 +659,10 @@ static int cpu_smpboot_alloc_common(unsigned int cpu)
clear_page(l4t);
init_xen_l4_slots(l4t, page_to_mfn(pg), NULL, INVALID_MFN, false);
+ rc = pt_shadow_alloc(cpu);
+ if ( rc )
+ goto out;
+
rc = 0; /* Success */
out:
@@ -726,6 +731,8 @@ static void cpu_smpboot_free(unsigned int cpu)
free_domheap_page(maddr_to_page(per_cpu(percpu_idle_pt, cpu)));
per_cpu(percpu_idle_pt, cpu) = 0;
}
+
+ pt_shadow_free(cpu);
}
static int cpu_smpboot_alloc(unsigned int cpu)
diff --git a/xen/include/asm-x86/pv/pt-shadow.h b/xen/include/asm-x86/pv/pt-shadow.h
new file mode 100644
index 0000000..ff99c85
--- /dev/null
+++ b/xen/include/asm-x86/pv/pt-shadow.h
@@ -0,0 +1,50 @@
+/*
+ * include/asm-x86/pv/pt-shadow.h
+ *
+ * PV Pagetable shadowing logic to allow Xen to run with per-pcpu pagetables.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; If not, see <http://www.gnu.org/licenses/>.
+ *
+ * Copyright (c) 2017 Citrix Systems Ltd.
+ */
+#ifndef __X86_PV_PT_SHADOW_H__
+#define __X86_PV_PT_SHADOW_H__
+
+#ifdef CONFIG_PV
+
+/*
+ * Allocate an free per-pcpu resources for pagetable shadowing. If alloc()
+ * returns nonzero, it is the callers responsibility to call free().
+ */
+int pt_shadow_alloc(unsigned int cpu);
+void pt_shadow_free(unsigned int cpu);
+
+#else /* !CONFIG_PV */
+
+static inline int pt_shadow_alloc(unsigned int cpu) { return 0; }
+static inline void pt_shadow_free(unsigned int cpu) { }
+
+#endif /* CONFIG_PV */
+
+#endif /* __X86_PV_PT_SHADOW_H__ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 11/44] x86/pt-shadow: Always set _PAGE_ACCESSED on L4e updates
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (9 preceding siblings ...)
2018-01-04 20:21 ` [PATCH RFC 10/44] x86/pt-shadow: Initial infrastructure for L4 PV pagetable shadowing Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 12/44] x86/fixmap: Temporarily add a percpu fixmap range Andrew Cooper
` (33 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
xen/arch/x86/pv/mm.h | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)
diff --git a/xen/arch/x86/pv/mm.h b/xen/arch/x86/pv/mm.h
index 7502d53..a10b09a 100644
--- a/xen/arch/x86/pv/mm.h
+++ b/xen/arch/x86/pv/mm.h
@@ -144,9 +144,22 @@ static inline l3_pgentry_t unadjust_guest_l3e(l3_pgentry_t l3e,
static inline l4_pgentry_t adjust_guest_l4e(l4_pgentry_t l4e,
const struct domain *d)
{
- if ( likely(l4e_get_flags(l4e) & _PAGE_PRESENT) &&
- likely(!is_pv_32bit_domain(d)) )
- l4e_add_flags(l4e, _PAGE_USER);
+ /*
+ * When shadowing an L4 for per-pcpu purposes, we cannot efficiently sync
+ * access bit updates from hardware (on the shadow tables) back into the
+ * guest view. We therefore always set _PAGE_ACCESSED even in the guests
+ * view.
+ *
+ * This will appear to the guest as a CPU which proactively pulls all
+ * valid L4e's into its TLB, which is compatible with the x86 ABI.
+ *
+ * Furthermore, at the time of writing, all PV guests I can locate choose
+ * to set the access bit anyway, so this is no actual change in their
+ * behaviour.
+ */
+ if ( likely(l4e_get_flags(l4e) & _PAGE_PRESENT) )
+ l4e_add_flags(l4e, (_PAGE_ACCESSED |
+ (is_pv_32bit_domain(d) ? 0 : _PAGE_USER)));
return l4e;
}
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 12/44] x86/fixmap: Temporarily add a percpu fixmap range
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (10 preceding siblings ...)
2018-01-04 20:21 ` [PATCH RFC 11/44] x86/pt-shadow: Always set _PAGE_ACCESSED on L4e updates Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 13/44] x86/pt-shadow: Shadow L4 tables from 64bit PV guests Andrew Cooper
` (32 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
This is required to implement an opencoded version of map_domain_page() during
context switch. It must fit within l1_fixmap[], which imposes an upper limit
on the NR_CPUS.
The limit is currently 509, but will be lifted after later changes.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
xen/include/asm-x86/fixmap.h | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)
diff --git a/xen/include/asm-x86/fixmap.h b/xen/include/asm-x86/fixmap.h
index 89bf6cb..d46939a 100644
--- a/xen/include/asm-x86/fixmap.h
+++ b/xen/include/asm-x86/fixmap.h
@@ -27,6 +27,8 @@
#include <asm/msi.h>
#include <acpi/apei.h>
+#define NR_PERCPU_SLOTS 1
+
/*
* Here we define all the compile-time 'special' virtual
* addresses. The point is to have a constant address at
@@ -45,6 +47,8 @@ enum fixed_addresses {
FIX_COM_BEGIN,
FIX_COM_END,
FIX_EHCI_DBGP,
+ FIX_PERCPU_BEGIN,
+ FIX_PERCPU_END = FIX_PERCPU_BEGIN + (NR_CPUS - 1) * NR_PERCPU_SLOTS,
/* Everything else should go further down. */
FIX_APIC_BASE,
FIX_IO_APIC_BASE_0,
@@ -87,6 +91,32 @@ static inline unsigned long virt_to_fix(const unsigned long vaddr)
return __virt_to_fix(vaddr);
}
+static inline void *percpu_fix_to_virt(unsigned int cpu, unsigned int slot)
+{
+ return (void *)fix_to_virt(FIX_PERCPU_BEGIN + (slot * NR_CPUS) + cpu);
+}
+
+static inline l1_pgentry_t *percpu_fixmap_l1e(unsigned int cpu, unsigned int slot)
+{
+ BUILD_BUG_ON(FIX_PERCPU_END >= L1_PAGETABLE_ENTRIES);
+
+ return &l1_fixmap[l1_table_offset((unsigned long)percpu_fix_to_virt(cpu, slot))];
+}
+
+static inline void set_percpu_fixmap(unsigned int cpu, unsigned int slot, l1_pgentry_t l1e)
+{
+ l1_pgentry_t *pl1e = percpu_fixmap_l1e(cpu, slot);
+
+ if ( l1e_get_intpte(*pl1e) != l1e_get_intpte(l1e) )
+ {
+ *pl1e = l1e;
+
+ __asm__ __volatile__ ( "invlpg %0"
+ :: "m" (*(char *)percpu_fix_to_virt(cpu, slot))
+ : "memory" );
+ }
+}
+
#endif /* __ASSEMBLY__ */
#endif
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 13/44] x86/pt-shadow: Shadow L4 tables from 64bit PV guests
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (11 preceding siblings ...)
2018-01-04 20:21 ` [PATCH RFC 12/44] x86/fixmap: Temporarily add a percpu fixmap range Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 14/44] x86/mm: Added safety checks that pagetables aren't shared Andrew Cooper
` (31 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
See the code comments for reasoning and the algorithm description.
This is a very simplistic algorithm, which comes with a substantial
performance overhead. The algorithm will be improved in a later patch, once
more infrastructure is in place.
Some of the code (particularly in pt_maybe_shadow()) is structured oddly.
This is deliberate to simplify the patch for the later algorithm improvement,
to avoid unnecessary code motion getting in the way of the logical change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
v3:
* Rebase over change to using ptsh
* Rework, in terms of being as close to the eventual algorithm as possible,
before we get map_domain_page() which is usable in context switch context.
---
xen/arch/x86/mm.c | 5 +-
xen/arch/x86/mm/shadow/multi.c | 2 +
xen/arch/x86/pv/mm.h | 16 +++-
xen/arch/x86/pv/pt-shadow.c | 164 +++++++++++++++++++++++++++++++++++++
xen/include/asm-x86/fixmap.h | 1 +
xen/include/asm-x86/pv/pt-shadow.h | 24 ++++++
6 files changed, 209 insertions(+), 3 deletions(-)
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index f85ef6c..375565f 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -126,6 +126,7 @@
#include <asm/hvm/grant_table.h>
#include <asm/pv/grant_table.h>
#include <asm/pv/mm.h>
+#include <asm/pv/pt-shadow.h>
#include "pv/mm.h"
@@ -501,13 +502,15 @@ DEFINE_PER_CPU(unsigned long, curr_ptbase);
void do_write_ptbase(struct vcpu *v, bool tlb_maintenance)
{
- unsigned long new_cr3 = v->arch.cr3;
+ unsigned long new_cr3;
unsigned int cpu = smp_processor_id();
unsigned long *this_curr_ptbase = &per_cpu(curr_ptbase, cpu);
/* Check that %cr3 isn't being shuffled under our feet. */
ASSERT(*this_curr_ptbase == read_cr3());
+ new_cr3 = pt_maybe_shadow(v);
+
if ( tlb_maintenance )
write_cr3(new_cr3);
else
diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
index c4e954e..9c929ed 100644
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -39,6 +39,7 @@ asm(".file \"" __OBJECT_FILE__ "\"");
#include <asm/hvm/cacheattr.h>
#include <asm/mtrr.h>
#include <asm/guest_pt.h>
+#include <asm/pv/pt-shadow.h>
#include <public/sched.h>
#include "private.h"
#include "types.h"
@@ -952,6 +953,7 @@ static int shadow_set_l4e(struct domain *d,
/* Write the new entry */
shadow_write_entries(sl4e, &new_sl4e, 1, sl4mfn);
+ pt_shadow_l4_write(d, mfn_to_page(sl4mfn), pgentry_ptr_to_slot(sl4e));
flags |= SHADOW_SET_CHANGED;
if ( shadow_l4e_get_flags(old_sl4e) & _PAGE_PRESENT )
diff --git a/xen/arch/x86/pv/mm.h b/xen/arch/x86/pv/mm.h
index a10b09a..7c66ca7 100644
--- a/xen/arch/x86/pv/mm.h
+++ b/xen/arch/x86/pv/mm.h
@@ -1,6 +1,8 @@
#ifndef __PV_MM_H__
#define __PV_MM_H__
+#include <asm/pv/pt-shadow.h>
+
l1_pgentry_t *map_guest_l1e(unsigned long linear, mfn_t *gl1mfn);
int new_guest_cr3(mfn_t mfn);
@@ -38,7 +40,7 @@ static inline l1_pgentry_t guest_get_eff_l1e(unsigned long linear)
*/
static inline bool update_intpte(intpte_t *p, intpte_t old, intpte_t new,
unsigned long mfn, struct vcpu *v,
- bool preserve_ad)
+ bool preserve_ad, unsigned int level)
{
bool rv = true;
@@ -77,6 +79,11 @@ static inline bool update_intpte(intpte_t *p, intpte_t old, intpte_t new,
old = t;
}
}
+
+ if ( level == 4 )
+ pt_shadow_l4_write(v->domain, mfn_to_page(mfn),
+ pgentry_ptr_to_slot(p));
+
return rv;
}
@@ -87,7 +94,12 @@ static inline bool update_intpte(intpte_t *p, intpte_t old, intpte_t new,
#define UPDATE_ENTRY(_t,_p,_o,_n,_m,_v,_ad) \
update_intpte(&_t ## e_get_intpte(*(_p)), \
_t ## e_get_intpte(_o), _t ## e_get_intpte(_n), \
- (_m), (_v), (_ad))
+ (_m), (_v), (_ad), _t ## _LEVEL)
+
+#define l1_LEVEL 1
+#define l2_LEVEL 2
+#define l3_LEVEL 3
+#define l4_LEVEL 4
static inline l1_pgentry_t adjust_guest_l1e(l1_pgentry_t l1e,
const struct domain *d)
diff --git a/xen/arch/x86/pv/pt-shadow.c b/xen/arch/x86/pv/pt-shadow.c
index 7db8efb..46a0251 100644
--- a/xen/arch/x86/pv/pt-shadow.c
+++ b/xen/arch/x86/pv/pt-shadow.c
@@ -22,8 +22,32 @@
#include <xen/mm.h>
#include <xen/numa.h>
+#include <asm/fixmap.h>
#include <asm/pv/pt-shadow.h>
+/*
+ * To use percpu linear ranges, we require that no two pcpus have %cr3
+ * pointing at the same L4 pagetable at the same time.
+ *
+ * Guests however might choose to use the same L4 pagetable on multiple vcpus
+ * at once, e.g. concurrently scheduling two threads from the same process.
+ * In practice, all HVM guests, and 32bit PV guests run on Xen-provided
+ * per-vcpu monitor tables, so it is only 64bit PV guests which are an issue.
+ *
+ * To resolve the issue, we shadow L4 pagetables from 64bit PV guests when
+ * they are in context.
+ *
+ * The algorithm is fairly simple.
+ *
+ * - When a pcpu is switching to a new vcpu cr3 and shadowing is necessary,
+ * perform a full 4K copy of the guests frame into a percpu frame, and run
+ * on that.
+ * - When a write to a guests L4 pagetable occurs, the update must be
+ * propagated to all existing shadows. An IPI is sent to the domains
+ * dirty mask indicating which frame/slot was updated, and each pcpu
+ * checks to see whether it needs to sync the update into its shadow.
+ */
+
struct pt_shadow {
/*
* A frame used to shadow a vcpus intended pagetable. When shadowing,
@@ -31,6 +55,17 @@ struct pt_shadow {
*/
paddr_t shadow_l4;
l4_pgentry_t *shadow_l4_va;
+
+ /*
+ * Domain to which the shadowed state belongs, or NULL if no state is
+ * being cached. IPIs for updates to cached information are based on the
+ * domain dirty mask, which can race with the target of the IPI switching
+ * to a different context.
+ */
+ const struct domain *domain;
+
+ /* If nonzero, a guests pagetable which we are shadowing. */
+ paddr_t shadowing;
};
static DEFINE_PER_CPU(struct pt_shadow, ptsh);
@@ -76,6 +111,135 @@ void pt_shadow_free(unsigned int cpu)
}
/*
+ * We only need to shadow 4-level PV guests. All other guests have per-vcpu
+ * monitor tables which are never scheduled on concurrent pcpus. Care needs
+ * to be taken not to shadow d0v0 during construction, as it writes its L4
+ * directly.
+ */
+static bool pt_need_shadow(const struct domain *d)
+{
+ return (system_state >= SYS_STATE_active && is_pv_domain(d) &&
+ !is_idle_domain(d) && !is_pv_32bit_domain(d) && d->max_vcpus > 1);
+}
+
+unsigned long pt_maybe_shadow(struct vcpu *v)
+{
+ unsigned int cpu = smp_processor_id();
+ struct pt_shadow *ptsh = &per_cpu(ptsh, cpu);
+ unsigned long flags, new_cr3 = v->arch.cr3;
+
+ /*
+ * IPIs for updates are based on the domain dirty mask. If we ever switch
+ * out of the currently shadowed context (even to idle), the cache will
+ * become stale.
+ */
+ if ( ptsh->domain &&
+ ptsh->domain != v->domain )
+ {
+ ptsh->domain = NULL;
+ ptsh->shadowing = 0;
+ }
+
+ /* No shadowing necessary? Run on the intended pagetable. */
+ if ( !pt_need_shadow(v->domain) )
+ return new_cr3;
+
+ ptsh->domain = v->domain;
+
+ /* Fastpath, if we are already shadowing the intended pagetable. */
+ if ( ptsh->shadowing == new_cr3 )
+ return ptsh->shadow_l4;
+
+ /*
+ * We may be called with interrupts disabled (e.g. context switch), or
+ * interrupts enabled (e.g. new_guest_cr3()).
+ *
+ * Reads and modifications of ptsh-> are only on the local cpu, but must
+ * be excluded against reads and modifications in _pt_shadow_ipi().
+ */
+ local_irq_save(flags);
+
+ {
+ l4_pgentry_t *l4t, *vcpu_l4t;
+
+ set_percpu_fixmap(cpu, PERCPU_FIXSLOT_SHADOW,
+ l1e_from_paddr(new_cr3, __PAGE_HYPERVISOR_RO));
+ ptsh->shadowing = new_cr3;
+ local_irq_restore(flags);
+
+ l4t = ptsh->shadow_l4_va;
+ vcpu_l4t = percpu_fix_to_virt(cpu, PERCPU_FIXSLOT_SHADOW);
+
+ copy_page(l4t, vcpu_l4t);
+ }
+
+ return ptsh->shadow_l4;
+}
+
+struct ptsh_ipi_info
+{
+ const struct domain *d;
+ const struct page_info *pg;
+ enum {
+ PTSH_IPI_WRITE,
+ } op;
+ unsigned int slot;
+};
+
+static void _pt_shadow_ipi(void *arg)
+{
+ unsigned int cpu = smp_processor_id();
+ struct pt_shadow *ptsh = &per_cpu(ptsh, cpu);
+ const struct ptsh_ipi_info *info = arg;
+ unsigned long maddr = page_to_maddr(info->pg);
+
+ /* No longer shadowing state from this domain? Nothing to do. */
+ if ( info->d != ptsh->domain )
+ return;
+
+ /* Not shadowing this frame? Nothing to do. */
+ if ( ptsh->shadowing != maddr )
+ return;
+
+ switch ( info->op )
+ {
+ l4_pgentry_t *l4t, *vcpu_l4t;
+
+ case PTSH_IPI_WRITE:
+ l4t = ptsh->shadow_l4_va;
+
+ /* Reuse the mapping established in pt_maybe_shadow(). */
+ ASSERT(l1e_get_paddr(*percpu_fixmap_l1e(cpu, PERCPU_FIXSLOT_SHADOW)) ==
+ maddr);
+ vcpu_l4t = percpu_fix_to_virt(cpu, PERCPU_FIXSLOT_SHADOW);
+
+ l4t[info->slot] = vcpu_l4t[info->slot];
+ break;
+
+ default:
+ ASSERT_UNREACHABLE();
+ }
+}
+
+void pt_shadow_l4_write(const struct domain *d, const struct page_info *pg,
+ unsigned int slot)
+{
+ struct ptsh_ipi_info info;
+
+ if ( !pt_need_shadow(d) )
+ return;
+
+ info = (struct ptsh_ipi_info){
+ .d = d,
+ .pg = pg,
+ .op = PTSH_IPI_WRITE,
+ .slot = slot,
+ };
+
+ on_selected_cpus(d->domain_dirty_cpumask, _pt_shadow_ipi, &info, 1);
+}
+
+/*
* Local variables:
* mode: C
* c-file-style: "BSD"
diff --git a/xen/include/asm-x86/fixmap.h b/xen/include/asm-x86/fixmap.h
index d46939a..748219f 100644
--- a/xen/include/asm-x86/fixmap.h
+++ b/xen/include/asm-x86/fixmap.h
@@ -28,6 +28,7 @@
#include <acpi/apei.h>
#define NR_PERCPU_SLOTS 1
+#define PERCPU_FIXSLOT_SHADOW 0
/*
* Here we define all the compile-time 'special' virtual
diff --git a/xen/include/asm-x86/pv/pt-shadow.h b/xen/include/asm-x86/pv/pt-shadow.h
index ff99c85..6e71e99 100644
--- a/xen/include/asm-x86/pv/pt-shadow.h
+++ b/xen/include/asm-x86/pv/pt-shadow.h
@@ -21,6 +21,8 @@
#ifndef __X86_PV_PT_SHADOW_H__
#define __X86_PV_PT_SHADOW_H__
+#include <xen/sched.h>
+
#ifdef CONFIG_PV
/*
@@ -30,11 +32,33 @@
int pt_shadow_alloc(unsigned int cpu);
void pt_shadow_free(unsigned int cpu);
+/*
+ * Called for context switches, and when a vcpu explicitly changes cr3. The
+ * PT shadow logic returns the cr3 hardware should run on, which is either
+ * v->arch.cr3 (no shadowing necessary), or a local frame (which is a suitable
+ * shadow of v->arch.cr3).
+ */
+unsigned long pt_maybe_shadow(struct vcpu *v);
+
+/*
+ * Called when a write occurs to an L4 pagetable. The PT shadow logic brings
+ * any shadows of this page up-to-date.
+ */
+void pt_shadow_l4_write(
+ const struct domain *d, const struct page_info *pg, unsigned int slot);
+
#else /* !CONFIG_PV */
static inline int pt_shadow_alloc(unsigned int cpu) { return 0; }
static inline void pt_shadow_free(unsigned int cpu) { }
+static inline unsigned long pt_maybe_shadow(struct vcpu *v)
+{
+ return v->arch.cr3;
+}
+static inline void pt_shadow_l4_write(
+ const struct domain *d, const struct page_info *pg, unsigned int slot) { }
+
#endif /* CONFIG_PV */
#endif /* __X86_PV_PT_SHADOW_H__ */
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 14/44] x86/mm: Added safety checks that pagetables aren't shared
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (12 preceding siblings ...)
2018-01-04 20:21 ` [PATCH RFC 13/44] x86/pt-shadow: Shadow L4 tables from 64bit PV guests Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 15/44] x86: Rearrange the virtual layout to introduce a PERCPU linear slot Andrew Cooper
` (30 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
xen/arch/x86/mm.c | 19 ++++++++++++++++++-
xen/arch/x86/setup.c | 1 +
xen/include/asm-x86/mm.h | 6 +++++-
3 files changed, 24 insertions(+), 2 deletions(-)
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 375565f..d6f88ca 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -505,18 +505,35 @@ void do_write_ptbase(struct vcpu *v, bool tlb_maintenance)
unsigned long new_cr3;
unsigned int cpu = smp_processor_id();
unsigned long *this_curr_ptbase = &per_cpu(curr_ptbase, cpu);
+ struct page_info *new_pg;
/* Check that %cr3 isn't being shuffled under our feet. */
ASSERT(*this_curr_ptbase == read_cr3());
new_cr3 = pt_maybe_shadow(v);
+ new_pg = maddr_to_page(new_cr3);
+
+ /* Check that new_cr3 isn't in use by a different pcpu. */
+ if ( new_cr3 != *this_curr_ptbase )
+ BUG_ON(test_and_set_bit(_PGC_inuse_pgtable, &new_pg->count_info));
+ else
+ /* Same cr3. Check that it is still marked as in use. */
+ ASSERT(test_bit(_PGC_inuse_pgtable, &new_pg->count_info));
if ( tlb_maintenance )
write_cr3(new_cr3);
else
asm volatile ( "mov %0, %%cr3" :: "r" (new_cr3) : "memory" );
- *this_curr_ptbase = new_cr3;
+ /* Mark the old cr3 as no longer in use. */
+ if ( new_cr3 != *this_curr_ptbase )
+ {
+ struct page_info *old_pg = maddr_to_page(*this_curr_ptbase);
+
+ BUG_ON(!test_and_clear_bit(_PGC_inuse_pgtable, &old_pg->count_info));
+
+ *this_curr_ptbase = new_cr3;
+ }
}
/*
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 7a05a7c..ffa7ea4 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -246,6 +246,7 @@ void early_switch_to_idle(void)
set_current(v);
per_cpu(curr_vcpu, cpu) = v;
+ __set_bit(_PGC_inuse_pgtable, &maddr_to_page(v->arch.cr3)->count_info);
asm volatile ( "mov %[npge], %%cr4;"
"mov %[cr3], %%cr3;"
"mov %[pge], %%cr4;"
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index ceb7dd4..64044c6 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -258,8 +258,12 @@ struct page_info
#define PGC_state_free PG_mask(3, 9)
#define page_state_is(pg, st) (((pg)->count_info&PGC_state) == PGC_state_##st)
+/* Page is a root pagetable, with a pcpus %cr3 pointing at it. */
+#define _PGC_inuse_pgtable PG_shift(10)
+#define PGC_inuse_pgtable PG_mask(1, 10)
+
/* Count of references to this frame. */
-#define PGC_count_width PG_shift(9)
+#define PGC_count_width PG_shift(10)
#define PGC_count_mask ((1UL<<PGC_count_width)-1)
/*
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 15/44] x86: Rearrange the virtual layout to introduce a PERCPU linear slot
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (13 preceding siblings ...)
2018-01-04 20:21 ` [PATCH RFC 14/44] x86/mm: Added safety checks that pagetables aren't shared Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 16/44] xen/ipi: Introduce arch_ipi_param_ok() to check IPI parameters Andrew Cooper
` (29 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
The PERCPU linear range lives in slot 257, and all later slots slide along to
make room. The size of the directmap is reduced by one slot temporarily.
Later changes will remove the PERDOMAIN slot, at which point the latter slots
will slide back to fill the hole, and end up where they are now.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
xen/arch/x86/mm.c | 13 +++++++++----
xen/include/asm-x86/config.h | 45 +++++++++++++++++++++++++++-----------------
2 files changed, 37 insertions(+), 21 deletions(-)
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index d6f88ca..deff4eb 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1576,25 +1576,30 @@ void init_xen_l4_slots(l4_pgentry_t *l4t, mfn_t l4mfn,
ro_mpt ? idle_pg_table[l4_table_offset(RO_MPT_VIRT_START)]
: l4e_empty();
- /* Slot 257: PCI MMCFG. */
+ /* Slot 257: Per-CPU mappings (filled on context switch). */
+ l4t[l4_table_offset(PERCPU_LINEAR_START)] = l4e_empty();
+
+ /* Slot 258: PCI MMCFG. */
l4t[l4_table_offset(PCI_MCFG_VIRT_START)] =
idle_pg_table[l4_table_offset(PCI_MCFG_VIRT_START)];
- /* Slot 258: Self linear mappings. */
+ /* Slot 259: Self linear mappings. */
ASSERT(!mfn_eq(l4mfn, INVALID_MFN));
l4t[l4_table_offset(LINEAR_PT_VIRT_START)] =
l4e_from_mfn(l4mfn, __PAGE_HYPERVISOR_RW);
- /* Slot 259: Shadow linear mappings (if applicable) .*/
+ /* Slot 260: Shadow linear mappings (if applicable). */
l4t[l4_table_offset(SH_LINEAR_PT_VIRT_START)] =
mfn_eq(sl4mfn, INVALID_MFN) ? l4e_empty() :
l4e_from_mfn(sl4mfn, __PAGE_HYPERVISOR_RW);
- /* Slot 260: Per-domain mappings (if applicable). */
+ /* Slot 261: Per-domain mappings (if applicable). */
l4t[l4_table_offset(PERDOMAIN_VIRT_START)] =
d ? l4e_from_page(d->arch.perdomain_l3_pg, __PAGE_HYPERVISOR_RW)
: l4e_empty();
+ /* !!! WARNING - TEMPORARILY STALE BELOW !!! */
+
/* Slot 261-: text/data/bss, RW M2P, vmap, frametable, directmap. */
#ifndef NDEBUG
if ( short_directmap &&
diff --git a/xen/include/asm-x86/config.h b/xen/include/asm-x86/config.h
index 9ef9d03..baf973a 100644
--- a/xen/include/asm-x86/config.h
+++ b/xen/include/asm-x86/config.h
@@ -124,13 +124,16 @@ extern unsigned char boot_edid_info[128];
* 0xffff804000000000 - 0xffff807fffffffff [256GB, 2^38 bytes, PML4:256]
* Reserved for future shared info with the guest OS (GUEST ACCESSIBLE).
* 0xffff808000000000 - 0xffff80ffffffffff [512GB, 2^39 bytes, PML4:257]
- * ioremap for PCI mmconfig space
+ * Per-CPU mappings (Xen's GDT/IDT/Stack etc.)
* 0xffff810000000000 - 0xffff817fffffffff [512GB, 2^39 bytes, PML4:258]
- * Guest linear page table.
+ * ioremap for PCI mmconfig space
* 0xffff818000000000 - 0xffff81ffffffffff [512GB, 2^39 bytes, PML4:259]
- * Shadow linear page table.
+ * Guest linear page table.
* 0xffff820000000000 - 0xffff827fffffffff [512GB, 2^39 bytes, PML4:260]
- * Per-domain mappings (e.g., GDT, LDT).
+ * Shadow linear page table.
+ *
+ * !!! WARNING - TEMPORARILY STALE BELOW !!!
+ *
* 0xffff828000000000 - 0xffff82bfffffffff [256GB, 2^38 bytes, PML4:261]
* Machine-to-phys translation table.
* 0xffff82c000000000 - 0xffff82cfffffffff [64GB, 2^36 bytes, PML4:261]
@@ -170,6 +173,8 @@ extern unsigned char boot_edid_info[128];
* Read-only machine-to-phys translation table (GUEST ACCESSIBLE).
* 0x0000000100000000 - 0x00007fffffffffff [128TB-4GB, PML4:0-255]
* Unused / Reserved for future use.
+ *
+ * !!! WARNING - TEMPORARILY STALE !!!
*/
@@ -187,26 +192,32 @@ extern unsigned char boot_edid_info[128];
#define RO_MPT_VIRT_START (PML4_ADDR(256))
#define MPT_VIRT_SIZE (PML4_ENTRY_BYTES / 2)
#define RO_MPT_VIRT_END (RO_MPT_VIRT_START + MPT_VIRT_SIZE)
-/* Slot 257: ioremap for PCI mmconfig space for 2048 segments (512GB)
+/* Slot 257: Per-CPU mappings. */
+#define PERCPU_LINEAR_START (PML4_ADDR(257))
+#define PERCPU_LINEAR_END (PERCPU_LINEAR_START + PML4_ENTRY_BYTES)
+/* Slot 258: ioremap for PCI mmconfig space for 2048 segments (512GB)
* - full 16-bit segment support needs 44 bits
* - since PML4 slot has 39 bits, we limit segments to 2048 (11-bits)
*/
-#define PCI_MCFG_VIRT_START (PML4_ADDR(257))
+#define PCI_MCFG_VIRT_START (PML4_ADDR(258))
#define PCI_MCFG_VIRT_END (PCI_MCFG_VIRT_START + PML4_ENTRY_BYTES)
-/* Slot 258: linear page table (guest table). */
-#define LINEAR_PT_VIRT_START (PML4_ADDR(258))
+/* Slot 259: linear page table (guest table). */
+#define LINEAR_PT_VIRT_START (PML4_ADDR(259))
#define LINEAR_PT_VIRT_END (LINEAR_PT_VIRT_START + PML4_ENTRY_BYTES)
-/* Slot 259: linear page table (shadow table). */
-#define SH_LINEAR_PT_VIRT_START (PML4_ADDR(259))
+/* Slot 260: linear page table (shadow table). */
+#define SH_LINEAR_PT_VIRT_START (PML4_ADDR(260))
#define SH_LINEAR_PT_VIRT_END (SH_LINEAR_PT_VIRT_START + PML4_ENTRY_BYTES)
-/* Slot 260: per-domain mappings (including map cache). */
-#define PERDOMAIN_VIRT_START (PML4_ADDR(260))
+/* Slot 261: per-domain mappings (including map cache). */
+#define PERDOMAIN_VIRT_START (PML4_ADDR(261))
#define PERDOMAIN_SLOT_MBYTES (PML4_ENTRY_BYTES >> (20 + PAGETABLE_ORDER))
#define PERDOMAIN_SLOTS 3
#define PERDOMAIN_VIRT_SLOT(s) (PERDOMAIN_VIRT_START + (s) * \
(PERDOMAIN_SLOT_MBYTES << 20))
+/*
+ * !!! WARNING - TEMPORARILY STALE BELOW !!!
+ */
/* Slot 261: machine-to-phys conversion table (256GB). */
-#define RDWR_MPT_VIRT_START (PML4_ADDR(261))
+#define RDWR_MPT_VIRT_START (PML4_ADDR(262))
#define RDWR_MPT_VIRT_END (RDWR_MPT_VIRT_START + MPT_VIRT_SIZE)
/* Slot 261: vmap()/ioremap()/fixmap area (64GB). */
#define VMAP_VIRT_START RDWR_MPT_VIRT_END
@@ -234,12 +245,12 @@ extern unsigned char boot_edid_info[128];
#ifndef CONFIG_BIGMEM
/* Slot 262-271/510: A direct 1:1 mapping of all of physical memory. */
-#define DIRECTMAP_VIRT_START (PML4_ADDR(262))
-#define DIRECTMAP_SIZE (PML4_ENTRY_BYTES * (511 - 262))
+#define DIRECTMAP_VIRT_START (PML4_ADDR(263))
+#define DIRECTMAP_SIZE (PML4_ENTRY_BYTES * (511 - 263))
#else
/* Slot 265-271/510: A direct 1:1 mapping of all of physical memory. */
-#define DIRECTMAP_VIRT_START (PML4_ADDR(265))
-#define DIRECTMAP_SIZE (PML4_ENTRY_BYTES * (511 - 265))
+#define DIRECTMAP_VIRT_START (PML4_ADDR(266))
+#define DIRECTMAP_SIZE (PML4_ENTRY_BYTES * (511 - 266))
#endif
#define DIRECTMAP_VIRT_END (DIRECTMAP_VIRT_START + DIRECTMAP_SIZE)
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 16/44] xen/ipi: Introduce arch_ipi_param_ok() to check IPI parameters
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (14 preceding siblings ...)
2018-01-04 20:21 ` [PATCH RFC 15/44] x86: Rearrange the virtual layout to introduce a PERCPU linear slot Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 17/44] x86/smp: Infrastructure for allocating and freeing percpu pagetables Andrew Cooper
` (28 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
There are some addresses which are not safe to pass as IPI parameters, as they
are not mapped on other cpus (or worse, mapped to something else). Introduce
an arch-specific audit hook which is used to check the parameter.
ARM has this stubbed to true, whereas x86 now excluses pointers in the PERCPU
range.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
xen/common/smp.c | 1 +
xen/include/asm-arm/smp.h | 3 +++
xen/include/asm-x86/smp.h | 15 +++++++++++++++
3 files changed, 19 insertions(+)
diff --git a/xen/common/smp.c b/xen/common/smp.c
index 79f4ebd..1ffc21c 100644
--- a/xen/common/smp.c
+++ b/xen/common/smp.c
@@ -54,6 +54,7 @@ void on_selected_cpus(
ASSERT(local_irq_is_enabled());
ASSERT(cpumask_subset(selected, &cpu_online_map));
+ ASSERT(arch_ipi_param_ok(info));
spin_lock(&call_lock);
diff --git a/xen/include/asm-arm/smp.h b/xen/include/asm-arm/smp.h
index 3c12268..2f12e5c 100644
--- a/xen/include/asm-arm/smp.h
+++ b/xen/include/asm-arm/smp.h
@@ -28,6 +28,9 @@ extern void init_secondary(void);
extern void smp_init_cpus(void);
extern void smp_clear_cpu_maps (void);
extern int smp_get_max_cpus (void);
+
+static inline bool arch_ipi_param_ok(const void *param) { return true; }
+
#endif
/*
diff --git a/xen/include/asm-x86/smp.h b/xen/include/asm-x86/smp.h
index 7fcc946..5fea27d 100644
--- a/xen/include/asm-x86/smp.h
+++ b/xen/include/asm-x86/smp.h
@@ -73,6 +73,21 @@ void set_nr_sockets(void);
/* Representing HT and core siblings in each socket. */
extern cpumask_t **socket_cpumask;
+static inline bool arch_ipi_param_ok(const void *_param)
+{
+ unsigned long param = (unsigned long)_param;
+
+ /*
+ * It is not safe to pass pointers in the PERCPU linear range to other
+ * cpus in an IPI.
+ *
+ * Not all parameters passed are actually pointers, so only reject
+ * parameters which are a canonical address in the PERCPU range.
+ */
+ return (!is_canonical_address(param) ||
+ l4_table_offset(param) != l4_table_offset(PERCPU_LINEAR_START));
+}
+
#endif /* !__ASSEMBLY__ */
#endif
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 17/44] x86/smp: Infrastructure for allocating and freeing percpu pagetables
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (15 preceding siblings ...)
2018-01-04 20:21 ` [PATCH RFC 16/44] xen/ipi: Introduce arch_ipi_param_ok() to check IPI parameters Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 18/44] x86/mm: Maintain the correct percpu mappings on context switch Andrew Cooper
` (27 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
Pagetables are allocated and freed along with the other smp datastructures,
and the root of the pagetables is stored in the percpu_mappings variable.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
xen/arch/x86/smpboot.c | 91 ++++++++++++++++++++++++++++++++++++++++++++++
xen/include/asm-x86/page.h | 1 +
xen/include/asm-x86/smp.h | 1 +
3 files changed, 93 insertions(+)
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index a855301..1f92831 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -58,6 +58,7 @@
unsigned long __read_mostly trampoline_phys;
DEFINE_PER_CPU_READ_MOSTLY(paddr_t, percpu_idle_pt);
+DEFINE_PER_CPU_READ_MOSTLY(l4_pgentry_t, percpu_mappings);
/* representing HT siblings of each logical CPU */
DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_mask);
@@ -644,6 +645,7 @@ static int cpu_smpboot_alloc_common(unsigned int cpu)
unsigned int memflags = 0;
nodeid_t node = cpu_to_node(cpu);
l4_pgentry_t *l4t = NULL;
+ l3_pgentry_t *l3t = NULL;
struct page_info *pg;
int rc = -ENOMEM;
@@ -663,15 +665,103 @@ static int cpu_smpboot_alloc_common(unsigned int cpu)
if ( rc )
goto out;
+ rc = -ENOMEM;
+
+ /* Percpu L3 table, containing the percpu mappings. */
+ pg = alloc_domheap_page(NULL, memflags);
+ if ( !pg )
+ goto out;
+ l3t = __map_domain_page(pg);
+ clear_page(l3t);
+ per_cpu(percpu_mappings, cpu) = l4t[l4_table_offset(PERCPU_LINEAR_START)] =
+ l4e_from_page(pg, __PAGE_HYPERVISOR);
+
rc = 0; /* Success */
out:
+ if ( l3t )
+ unmap_domain_page(l3t);
if ( l4t )
unmap_domain_page(l4t);
return rc;
}
+/*
+ * Dismantles the pagetable structure under per_cpu(percpu_mappings, cpu),
+ * freeing all pagetable frames, and any RAM frames which are mapped with
+ * MAP_PERCPU_AUTOFREE.
+ */
+static void free_perpcpu_pagetables(unsigned int cpu)
+{
+ l4_pgentry_t *percpu_mappings = &per_cpu(percpu_mappings, cpu);
+ unsigned int l3i;
+ l3_pgentry_t *l3t = NULL;
+
+ if ( !l4e_get_intpte(*percpu_mappings) )
+ return;
+
+ l3t = map_domain_page(l4e_get_mfn(*percpu_mappings));
+
+ for ( l3i = 0; l3i < L3_PAGETABLE_ENTRIES; ++l3i )
+ {
+ l3_pgentry_t l3e = l3t[l3i];
+
+ if ( !(l3e_get_flags(l3e) & _PAGE_PRESENT) )
+ continue;
+
+ if ( !(l3e_get_flags(l3e) & _PAGE_PSE) )
+ {
+ unsigned int l2i;
+ l2_pgentry_t *l2t = __map_domain_page(l3e_get_page(l3e));
+
+ for ( l2i = 0; l2i < L2_PAGETABLE_ENTRIES; ++l2i )
+ {
+ l2_pgentry_t l2e = l2t[l2i];
+
+ if ( !(l2e_get_flags(l2e) & _PAGE_PRESENT) )
+ continue;
+
+ if ( !(l2e_get_flags(l2e) & _PAGE_PSE) )
+ {
+ unsigned int l1i;
+ l1_pgentry_t *l1t = __map_domain_page(l2e_get_page(l2e));
+
+ for ( l1i = 0; l1i < L1_PAGETABLE_ENTRIES; ++l1i )
+ {
+ l1_pgentry_t l1e = l1t[l1i];
+
+ if ( !(l1e_get_flags(l1e) & _PAGE_PRESENT) )
+ continue;
+
+ if ( l1e_get_flags(l1e) & MAP_PERCPU_AUTOFREE )
+ {
+ struct page_info *pg = l1e_get_page(l1e);
+
+ if ( is_xen_heap_page(pg) )
+ free_xenheap_page(page_to_virt(pg));
+ else
+ free_domheap_page(pg);
+ }
+ }
+
+ unmap_domain_page(l1t);
+ }
+
+ free_domheap_page(l2e_get_page(l2e));
+ }
+
+ unmap_domain_page(l2t);
+ }
+
+ free_domheap_page(l3e_get_page(l3e));
+ }
+
+ unmap_domain_page(l3t);
+ free_domheap_page(l4e_get_page(*percpu_mappings));
+ *percpu_mappings = l4e_empty();
+}
+
static void cpu_smpboot_free(unsigned int cpu)
{
unsigned int order, socket = cpu_to_socket(cpu);
@@ -733,6 +823,7 @@ static void cpu_smpboot_free(unsigned int cpu)
}
pt_shadow_free(cpu);
+ free_perpcpu_pagetables(cpu);
}
static int cpu_smpboot_alloc(unsigned int cpu)
diff --git a/xen/include/asm-x86/page.h b/xen/include/asm-x86/page.h
index 45ca742..f330c75 100644
--- a/xen/include/asm-x86/page.h
+++ b/xen/include/asm-x86/page.h
@@ -344,6 +344,7 @@ void efi_update_l4_pgtable(unsigned int l4idx, l4_pgentry_t);
#define __PAGE_HYPERVISOR_UC (__PAGE_HYPERVISOR | _PAGE_PCD | _PAGE_PWT)
#define MAP_SMALL_PAGES _PAGE_AVAIL0 /* don't use superpages mappings */
+#define MAP_PERCPU_AUTOFREE _PAGE_AVAIL1
#ifndef __ASSEMBLY__
diff --git a/xen/include/asm-x86/smp.h b/xen/include/asm-x86/smp.h
index 5fea27d..46bbf0d 100644
--- a/xen/include/asm-x86/smp.h
+++ b/xen/include/asm-x86/smp.h
@@ -20,6 +20,7 @@
#ifndef __ASSEMBLY__
DECLARE_PER_CPU(paddr_t, percpu_idle_pt);
+DECLARE_PER_CPU(l4_pgentry_t, percpu_mappings);
/*
* Private routines/data
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 18/44] x86/mm: Maintain the correct percpu mappings on context switch
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (16 preceding siblings ...)
2018-01-04 20:21 ` [PATCH RFC 17/44] x86/smp: Infrastructure for allocating and freeing percpu pagetables Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 19/44] x86/boot: Defer TSS/IST setup until later during boot on the BSP Andrew Cooper
` (26 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
Ensure the pagetables we are switching to have the correct percpu mappings in
them. The _PGC_inuse_pgtable check ensures that the pagetables we edit aren't
in use elsewhere.
One complication however is context switching between two vcpus which both
require shadowing. See the code comment for details.
Another complication is requiring a second percpu fixmap slot. It limits
NR_CPUS to 254, but will be removed later.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
xen/arch/x86/mm.c | 9 +++++++++
xen/arch/x86/pv/pt-shadow.c | 14 +++++++++++++-
xen/include/asm-x86/fixmap.h | 3 ++-
3 files changed, 24 insertions(+), 2 deletions(-)
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index deff4eb..57b3e25 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -505,6 +505,8 @@ void do_write_ptbase(struct vcpu *v, bool tlb_maintenance)
unsigned long new_cr3;
unsigned int cpu = smp_processor_id();
unsigned long *this_curr_ptbase = &per_cpu(curr_ptbase, cpu);
+ l4_pgentry_t percpu_mappings = per_cpu(percpu_mappings, cpu);
+ l4_pgentry_t *new_l4t;
struct page_info *new_pg;
/* Check that %cr3 isn't being shuffled under our feet. */
@@ -520,6 +522,13 @@ void do_write_ptbase(struct vcpu *v, bool tlb_maintenance)
/* Same cr3. Check that it is still marked as in use. */
ASSERT(test_bit(_PGC_inuse_pgtable, &new_pg->count_info));
+ /* Insert percpu mappings into the new pagetables. */
+ set_percpu_fixmap(cpu, PERCPU_FIXSLOT_LINEAR,
+ l1e_from_paddr(new_cr3, __PAGE_HYPERVISOR_RW));
+ new_l4t = percpu_fix_to_virt(cpu, PERCPU_FIXSLOT_LINEAR);
+ new_l4t[l4_table_offset(PERCPU_LINEAR_START)] = percpu_mappings;
+ barrier();
+
if ( tlb_maintenance )
write_cr3(new_cr3);
else
diff --git a/xen/arch/x86/pv/pt-shadow.c b/xen/arch/x86/pv/pt-shadow.c
index 46a0251..46c7b86 100644
--- a/xen/arch/x86/pv/pt-shadow.c
+++ b/xen/arch/x86/pv/pt-shadow.c
@@ -160,6 +160,7 @@ unsigned long pt_maybe_shadow(struct vcpu *v)
local_irq_save(flags);
{
+ unsigned int slot = l4_table_offset(PERCPU_LINEAR_START);
l4_pgentry_t *l4t, *vcpu_l4t;
set_percpu_fixmap(cpu, PERCPU_FIXSLOT_SHADOW,
@@ -170,7 +171,18 @@ unsigned long pt_maybe_shadow(struct vcpu *v)
l4t = ptsh->shadow_l4_va;
vcpu_l4t = percpu_fix_to_virt(cpu, PERCPU_FIXSLOT_SHADOW);
- copy_page(l4t, vcpu_l4t);
+ /*
+ * Careful! When context switching between two vcpus, both of which
+ * require shadowing, l4t[] may be the live pagetables.
+ *
+ * We mustn't clobber the PERCPU slot (with a zero, as vcpu_l4t[] will
+ * never have had a percpu mapping inserted into it). The context
+ * switch logic will unconditionally insert the correct value anyway.
+ */
+ memcpy(l4t, vcpu_l4t,
+ sizeof(*l4t) * slot);
+ memcpy(&l4t[slot + 1], &vcpu_l4t[slot + 1],
+ sizeof(*l4t) * (L4_PAGETABLE_ENTRIES - (slot + 1)));
}
return ptsh->shadow_l4;
diff --git a/xen/include/asm-x86/fixmap.h b/xen/include/asm-x86/fixmap.h
index 748219f..c1b3bda 100644
--- a/xen/include/asm-x86/fixmap.h
+++ b/xen/include/asm-x86/fixmap.h
@@ -27,8 +27,9 @@
#include <asm/msi.h>
#include <acpi/apei.h>
-#define NR_PERCPU_SLOTS 1
+#define NR_PERCPU_SLOTS 2
#define PERCPU_FIXSLOT_SHADOW 0
+#define PERCPU_FIXSLOT_LINEAR 1
/*
* Here we define all the compile-time 'special' virtual
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 19/44] x86/boot: Defer TSS/IST setup until later during boot on the BSP
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (17 preceding siblings ...)
2018-01-04 20:21 ` [PATCH RFC 18/44] x86/mm: Maintain the correct percpu mappings on context switch Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 20/44] x86/smp: Allocate a percpu linear range for the IDT Andrew Cooper
` (25 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
TSS and IST setings are only required for safety when running userspace code.
Until we start executing dom0, the boot path is perfectly capable of handling
exceptions and interrupts without a loaded TSS.
Deferring the TSS setup is necessary to facilitiate moving the BSP onto a
percpu stack, which in turn requires that during boot, there are no IST
references in the IDT.
Correct TSS and IST settings are re-set up in reinit_bsp_stack(), just before
we complete initialisation.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
xen/arch/x86/setup.c | 17 ++++++++++++++++-
xen/arch/x86/traps.c | 3 ---
2 files changed, 16 insertions(+), 4 deletions(-)
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index ffa7ea4..5fa70bd 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -625,6 +625,9 @@ static void __init noreturn reinit_bsp_stack(void)
{
unsigned long *stack = (void*)(get_stack_bottom() & ~(STACK_SIZE - 1));
+ /* Sanity check that IST settings weren't set up before this point. */
+ ASSERT(MASK_EXTR(idt_tables[0][TRAP_nmi].a, 7UL << 32) == 0);
+
/* Update TSS and ISTs */
load_system_tables();
@@ -692,7 +695,19 @@ void __init noreturn __start_xen(unsigned long mbi_p)
percpu_init_areas();
init_idt_traps();
- load_system_tables();
+ {
+ const struct desc_ptr gdtr = {
+ .base = (unsigned long)this_cpu(gdt_table) - FIRST_RESERVED_GDT_BYTE,
+ .limit = LAST_RESERVED_GDT_BYTE,
+ };
+ const struct desc_ptr idtr = {
+ .base = (unsigned long)idt_table,
+ .limit = (IDT_ENTRIES * sizeof(idt_entry_t)) - 1,
+ };
+
+ lgdt(&gdtr);
+ lidt(&idtr);
+ }
smp_prepare_boot_cpu();
sort_exception_tables();
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index d06ad69..3eab6d3 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -1883,9 +1883,6 @@ void __init init_idt_traps(void)
set_intr_gate(TRAP_machine_check,&machine_check);
set_intr_gate(TRAP_simd_error,&simd_coprocessor_error);
- /* Specify dedicated interrupt stacks for NMI, #DF, and #MC. */
- enable_each_ist(idt_table);
-
/* CPU0 uses the master IDT. */
idt_tables[0] = idt_table;
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 20/44] x86/smp: Allocate a percpu linear range for the IDT
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (18 preceding siblings ...)
2018-01-04 20:21 ` [PATCH RFC 19/44] x86/boot: Defer TSS/IST setup until later during boot on the BSP Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 21/44] x86/smp: Switch to using the percpu IDT mappings Andrew Cooper
` (24 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
This change also introduces _alter_percpu_mappings(), a helper for creating
and modifying percpu mappings. The code will be extended with extra actions
in later patches.
The existing IDT heap allocation and idt_tables[] array are kept, although the
allocation logic is simplified as an IDT is strictly one single page.
idt_table[], used by CPU0, now needs to be strictly page aligned, so is moved
into .bss.page_aligned.
Nothing writes to the IDT via its percpu mappings, so the opportunity is taken
to make the mapping read-only. This provides extra defence-in-depth, as an
attacker can't use the pointer obtained from SIDT to modify the active IDT as
part of a privilege escalation attempt.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
xen/arch/x86/smpboot.c | 103 ++++++++++++++++++++++++++++++++++++++++---
xen/arch/x86/traps.c | 3 +-
xen/include/asm-x86/config.h | 3 ++
3 files changed, 103 insertions(+), 6 deletions(-)
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 1f92831..4df7775 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -639,6 +639,94 @@ void cpu_exit_clear(unsigned int cpu)
set_cpu_state(CPU_STATE_DEAD);
}
+/*
+ * Make an alteration to a CPUs percpu linear mappings. The action parameter
+ * determines how **page and flags get used.
+ */
+enum percpu_alter_action {
+ PERCPU_MAP, /* Map existing frame: page and flags are input parameters. */
+};
+static int _alter_percpu_mappings(
+ unsigned int cpu, unsigned long linear,
+ enum percpu_alter_action action,
+ struct page_info **page, unsigned int flags)
+{
+ unsigned int memflags = 0;
+ nodeid_t node = cpu_to_node(cpu);
+ l4_pgentry_t mappings = per_cpu(percpu_mappings, cpu);
+ l3_pgentry_t *l3t = NULL;
+ l2_pgentry_t *l2t = NULL;
+ l1_pgentry_t *l1t = NULL;
+ struct page_info *pg;
+ int rc = -ENOMEM;
+
+ ASSERT(l4e_get_flags(mappings) & _PAGE_PRESENT);
+ ASSERT(linear >= PERCPU_LINEAR_START && linear < PERCPU_LINEAR_END);
+
+ if ( node != NUMA_NO_NODE )
+ memflags = MEMF_node(node);
+
+ l3t = map_l3t_from_l4e(mappings);
+
+ /* Allocate or map the l2 table for linear. */
+ if ( !(l3e_get_flags(l3t[l3_table_offset(linear)]) & _PAGE_PRESENT) )
+ {
+ pg = alloc_domheap_page(NULL, memflags);
+ if ( !pg )
+ goto out;
+ l2t = __map_domain_page(pg);
+ clear_page(l2t);
+
+ l3t[l3_table_offset(linear)] = l3e_from_page(pg, __PAGE_HYPERVISOR);
+ }
+ else
+ l2t = map_l2t_from_l3e(l3t[l3_table_offset(linear)]);
+
+ /* Allocate or map the l1 table for linear. */
+ if ( !(l2e_get_flags(l2t[l2_table_offset(linear)]) & _PAGE_PRESENT) )
+ {
+ pg = alloc_domheap_page(NULL, memflags);
+ if ( !pg )
+ goto out;
+ l1t = __map_domain_page(pg);
+ clear_page(l1t);
+
+ l2t[l2_table_offset(linear)] = l2e_from_page(pg, __PAGE_HYPERVISOR);
+ }
+ else
+ l1t = map_l1t_from_l2e(l2t[l2_table_offset(linear)]);
+
+ switch ( action )
+ {
+ case PERCPU_MAP:
+ ASSERT(*page);
+ l1t[l1_table_offset(linear)] = l1e_from_page(*page, flags);
+ break;
+
+ default:
+ ASSERT_UNREACHABLE();
+ rc = -EINVAL;
+ goto out;
+ }
+
+ rc = 0; /* Success */
+
+ out:
+ if ( l1t )
+ unmap_domain_page(l1t);
+ if ( l2t )
+ unmap_domain_page(l2t);
+ unmap_domain_page(l3t);
+
+ return rc;
+}
+
+static int percpu_map_frame(unsigned int cpu, unsigned long linear,
+ struct page_info *page, unsigned int flags)
+{
+ return _alter_percpu_mappings(cpu, linear, PERCPU_MAP, &page, flags);
+}
+
/* Allocate data common between the BSP and APs. */
static int cpu_smpboot_alloc_common(unsigned int cpu)
{
@@ -676,6 +764,12 @@ static int cpu_smpboot_alloc_common(unsigned int cpu)
per_cpu(percpu_mappings, cpu) = l4t[l4_table_offset(PERCPU_LINEAR_START)] =
l4e_from_page(pg, __PAGE_HYPERVISOR);
+ /* Map the IDT. */
+ rc = percpu_map_frame(cpu, PERCPU_IDT_MAPPING,
+ virt_to_page(idt_tables[cpu]), PAGE_HYPERVISOR_RO);
+ if ( rc )
+ goto out;
+
rc = 0; /* Success */
out:
@@ -805,8 +899,7 @@ static void cpu_smpboot_free(unsigned int cpu)
free_xenheap_pages(per_cpu(compat_gdt_table, cpu), order);
- order = get_order_from_bytes(IDT_ENTRIES * sizeof(idt_entry_t));
- free_xenheap_pages(idt_tables[cpu], order);
+ free_xenheap_page(idt_tables[cpu]);
idt_tables[cpu] = NULL;
if ( stack_base[cpu] != NULL )
@@ -856,11 +949,11 @@ static int cpu_smpboot_alloc(unsigned int cpu)
memcpy(gdt, boot_cpu_compat_gdt_table, NR_RESERVED_GDT_PAGES * PAGE_SIZE);
gdt[PER_CPU_GDT_ENTRY - FIRST_RESERVED_GDT_ENTRY].a = cpu;
- order = get_order_from_bytes(IDT_ENTRIES * sizeof(idt_entry_t));
- idt_tables[cpu] = alloc_xenheap_pages(order, memflags);
+ BUILD_BUG_ON(IDT_ENTRIES * sizeof(idt_entry_t) != PAGE_SIZE);
+ idt_tables[cpu] = alloc_xenheap_pages(0, memflags);
if ( idt_tables[cpu] == NULL )
goto out;
- memcpy(idt_tables[cpu], idt_table, IDT_ENTRIES * sizeof(idt_entry_t));
+ memcpy(idt_tables[cpu], idt_table, PAGE_SIZE);
disable_each_ist(idt_tables[cpu]);
for ( stub_page = 0, i = cpu & ~(STUBS_PER_PAGE - 1);
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 3eab6d3..ef9464b 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -102,7 +102,8 @@ DEFINE_PER_CPU_READ_MOSTLY(struct desc_struct *, gdt_table);
DEFINE_PER_CPU_READ_MOSTLY(struct desc_struct *, compat_gdt_table);
/* Master table, used by CPU0. */
-idt_entry_t idt_table[IDT_ENTRIES];
+idt_entry_t idt_table[IDT_ENTRIES]
+__section(".bss.page_aligned") __aligned(PAGE_SIZE);
/* Pointer to the IDT of every CPU. */
idt_entry_t *idt_tables[NR_CPUS] __read_mostly;
diff --git a/xen/include/asm-x86/config.h b/xen/include/asm-x86/config.h
index baf973a..cddfc4e 100644
--- a/xen/include/asm-x86/config.h
+++ b/xen/include/asm-x86/config.h
@@ -293,6 +293,9 @@ extern unsigned char boot_edid_info[128];
extern unsigned long xen_phys_start;
#endif
+/* Mappings in the percpu area: */
+#define PERCPU_IDT_MAPPING (PERCPU_LINEAR_START + KB(4))
+
/* GDT/LDT shadow mapping area. The first per-domain-mapping sub-area. */
#define GDT_LDT_VCPU_SHIFT 5
#define GDT_LDT_VCPU_VA_SHIFT (GDT_LDT_VCPU_SHIFT + PAGE_SHIFT)
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 21/44] x86/smp: Switch to using the percpu IDT mappings
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (19 preceding siblings ...)
2018-01-04 20:21 ` [PATCH RFC 20/44] x86/smp: Allocate a percpu linear range for the IDT Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 22/44] x86/mm: Track whether the current cr3 has a short or extended directmap Andrew Cooper
` (23 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
The loading of IDTR is moved out of load_system_tables() and into
early_switch_to_idle().
One complication for the BSP is that IST references still need to remain
uninitalised until reinit_bsp_stack(). Therefore, early_switch_to_idle() is
extended to take a bsp boolean.
For VT-x guests, HOST_IDTR_BASE needs altering, so a #VMExit doesn't change
the mappings we use. As this is now a compile-time constant, it moves from
vmx_set_host_env() to construct_vmcs() to avoid rewriting it on every context
switch.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
xen/arch/x86/cpu/common.c | 8 --------
xen/arch/x86/hvm/vmx/vmcs.c | 4 +++-
xen/arch/x86/setup.c | 24 ++++++++++++++++++++++--
xen/arch/x86/smpboot.c | 2 +-
xen/include/asm-x86/system.h | 2 +-
5 files changed, 27 insertions(+), 13 deletions(-)
diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
index b18e0f4..14743b6 100644
--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -644,7 +644,6 @@ void __init early_cpu_init(void)
*/
void load_system_tables(void)
{
- unsigned int cpu = smp_processor_id();
unsigned long stack_bottom = get_stack_bottom(),
stack_top = stack_bottom & ~(STACK_SIZE - 1);
@@ -658,10 +657,6 @@ void load_system_tables(void)
.base = (unsigned long)gdt,
.limit = LAST_RESERVED_GDT_BYTE,
};
- const struct desc_ptr idtr = {
- .base = (unsigned long)idt_tables[cpu],
- .limit = (IDT_ENTRIES * sizeof(idt_entry_t)) - 1,
- };
*tss = (struct tss_struct){
/* Main stack for interrupts/exceptions. */
@@ -699,12 +694,9 @@ void load_system_tables(void)
SYS_DESC_tss_busy);
lgdt(&gdtr);
- lidt(&idtr);
ltr(TSS_ENTRY << 3);
lldt(0);
- enable_each_ist(idt_tables[cpu]);
-
/*
* Bottom-of-stack must be 16-byte aligned!
*
diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index e7818ca..f99f1bb 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -804,7 +804,6 @@ static void vmx_set_host_env(struct vcpu *v)
__vmwrite(HOST_GDTR_BASE,
(unsigned long)(this_cpu(gdt_table) - FIRST_RESERVED_GDT_ENTRY));
- __vmwrite(HOST_IDTR_BASE, (unsigned long)idt_tables[cpu]);
__vmwrite(HOST_TR_BASE, (unsigned long)&per_cpu(init_tss, cpu));
@@ -1133,6 +1132,9 @@ static int construct_vmcs(struct vcpu *v)
/* Disable PML anyway here as it will only be enabled in log dirty mode */
v->arch.hvm_vmx.secondary_exec_control &= ~SECONDARY_EXEC_ENABLE_PML;
+ /* Host system tables. */
+ __vmwrite(HOST_IDTR_BASE, PERCPU_IDT_MAPPING);
+
/* Host data selectors. */
__vmwrite(HOST_SS_SELECTOR, __HYPERVISOR_DS);
__vmwrite(HOST_DS_SELECTOR, __HYPERVISOR_DS);
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 5fa70bd..662c383 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -237,12 +237,25 @@ void __init discard_initial_images(void)
extern char __init_begin[], __init_end[], __bss_start[], __bss_end[];
-void early_switch_to_idle(void)
+void early_switch_to_idle(bool bsp)
{
unsigned int cpu = smp_processor_id();
struct vcpu *v = idle_vcpu[cpu];
unsigned long cr4 = read_cr4();
+ /*
+ * VT-x hardwires the IDT limit at 0xffff on VMExit.
+ *
+ * We don't wish to reload on vcpu context switch, so have arranged for
+ * nothing else to live within 64k of the base. Unilaterally setting the
+ * limit to 0xffff avoids leaking whether HVM vcpus are running to PV
+ * guests via SIDT.
+ */
+ const struct desc_ptr idtr = {
+ .base = PERCPU_IDT_MAPPING,
+ .limit = 0xffff,
+ };
+
set_current(v);
per_cpu(curr_vcpu, cpu) = v;
@@ -257,12 +270,17 @@ void early_switch_to_idle(void)
: "memory" );
per_cpu(curr_ptbase, cpu) = v->arch.cr3;
+
+ lidt(&idtr);
+
+ if ( likely(!bsp) ) /* BSP IST setup deferred. */
+ enable_each_ist(idt_tables[cpu]);
}
static void __init init_idle_domain(void)
{
scheduler_init();
- early_switch_to_idle();
+ early_switch_to_idle(true);
}
void srat_detect_node(int cpu)
@@ -631,6 +649,8 @@ static void __init noreturn reinit_bsp_stack(void)
/* Update TSS and ISTs */
load_system_tables();
+ enable_each_ist(idt_tables[0]);
+
/* Update SYSCALL trampolines */
percpu_traps_init();
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 4df7775..7f02dd8 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -311,7 +311,7 @@ void start_secondary(void *unused)
set_processor_id(cpu);
get_cpu_info()->cr4 = XEN_MINIMAL_CR4;
- early_switch_to_idle();
+ early_switch_to_idle(false);
rdmsrl(MSR_EFER, this_cpu(efer));
diff --git a/xen/include/asm-x86/system.h b/xen/include/asm-x86/system.h
index ee57631..5cf8827 100644
--- a/xen/include/asm-x86/system.h
+++ b/xen/include/asm-x86/system.h
@@ -230,7 +230,7 @@ static inline int local_irq_is_enabled(void)
void trap_init(void);
void init_idt_traps(void);
-void early_switch_to_idle(void);
+void early_switch_to_idle(bool bsp);
void load_system_tables(void);
void percpu_traps_init(void);
void subarch_percpu_traps_init(void);
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 22/44] x86/mm: Track whether the current cr3 has a short or extended directmap
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (20 preceding siblings ...)
2018-01-04 20:21 ` [PATCH RFC 21/44] x86/smp: Switch to using the percpu IDT mappings Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 23/44] x86/smp: Allocate percpu resources for map_domain_page() to use Andrew Cooper
` (22 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
This will be used to remove the mapcache override/current vcpu mechanism when
reworking map_domain_page() to be safe in the middle of context switches.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
xen/arch/x86/mm.c | 11 +++++++++++
xen/arch/x86/setup.c | 2 ++
xen/common/efi/runtime.c | 3 +++
xen/include/asm-x86/mm.h | 11 +++++++++++
4 files changed, 27 insertions(+)
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 57b3e25..7c08807 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -499,6 +499,7 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
}
DEFINE_PER_CPU(unsigned long, curr_ptbase);
+DEFINE_PER_CPU(bool, curr_extended_directmap);
void do_write_ptbase(struct vcpu *v, bool tlb_maintenance)
{
@@ -506,6 +507,8 @@ void do_write_ptbase(struct vcpu *v, bool tlb_maintenance)
unsigned int cpu = smp_processor_id();
unsigned long *this_curr_ptbase = &per_cpu(curr_ptbase, cpu);
l4_pgentry_t percpu_mappings = per_cpu(percpu_mappings, cpu);
+ bool *this_extd_directmap = &per_cpu(curr_extended_directmap, cpu);
+ bool new_extd_directmap = paging_mode_external(v->domain);
l4_pgentry_t *new_l4t;
struct page_info *new_pg;
@@ -529,11 +532,19 @@ void do_write_ptbase(struct vcpu *v, bool tlb_maintenance)
new_l4t[l4_table_offset(PERCPU_LINEAR_START)] = percpu_mappings;
barrier();
+ /* If the new cr3 has a short directmap, report so before switching... */
+ if ( !new_extd_directmap )
+ *this_extd_directmap = new_extd_directmap;
+
if ( tlb_maintenance )
write_cr3(new_cr3);
else
asm volatile ( "mov %0, %%cr3" :: "r" (new_cr3) : "memory" );
+ /* ... else report afterwards. */
+ if ( new_extd_directmap )
+ *this_extd_directmap = new_extd_directmap;
+
/* Mark the old cr3 as no longer in use. */
if ( new_cr3 != *this_curr_ptbase )
{
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 662c383..80efef0 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -270,6 +270,7 @@ void early_switch_to_idle(bool bsp)
: "memory" );
per_cpu(curr_ptbase, cpu) = v->arch.cr3;
+ per_cpu(curr_extended_directmap, cpu) = true;
lidt(&idtr);
@@ -713,6 +714,7 @@ void __init noreturn __start_xen(unsigned long mbi_p)
idle_vcpu[0] = current;
percpu_init_areas();
+ per_cpu(curr_extended_directmap, 0) = true;
init_idt_traps();
{
diff --git a/xen/common/efi/runtime.c b/xen/common/efi/runtime.c
index 3dbc2e8..fe6d3af 100644
--- a/xen/common/efi/runtime.c
+++ b/xen/common/efi/runtime.c
@@ -112,6 +112,7 @@ struct efi_rs_state efi_rs_enter(void)
}
write_cr3(virt_to_maddr(efi_l4_pgtable));
+ this_cpu(curr_extended_directmap) = true;
return state;
}
@@ -120,6 +121,8 @@ void efi_rs_leave(struct efi_rs_state *state)
{
if ( !state->cr3 )
return;
+
+ this_cpu(curr_extended_directmap) = paging_mode_external(current->domain);
write_cr3(state->cr3);
if ( is_pv_vcpu(current) && !is_idle_vcpu(current) )
{
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index 64044c6..54b7499 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -567,6 +567,17 @@ void audit_domains(void);
DECLARE_PER_CPU(unsigned long, curr_ptbase);
+/*
+ * Indicates whether the current %cr3 contains a short or extended directmap.
+ * Care needs to be taken when updating, as map_domain_page() is used in
+ * interrupt/exception context. It is safe to indicate that the current %cr3
+ * is short when it is actually extended (in which case, map_domain_page()
+ * will use a mapping slot rather than refering to the directmap), but it is
+ * not safe to indicate the opposite (in which case, map_domain_page() will
+ * return a pointer into 64bit PV kernel address space).
+ */
+DECLARE_PER_CPU(bool, curr_extended_directmap);
+
void make_cr3(struct vcpu *v, mfn_t mfn);
void update_cr3(struct vcpu *v);
int vcpu_destroy_pagetables(struct vcpu *);
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 23/44] x86/smp: Allocate percpu resources for map_domain_page() to use
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (21 preceding siblings ...)
2018-01-04 20:21 ` [PATCH RFC 22/44] x86/mm: Track whether the current cr3 has a short or extended directmap Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 24/44] x86/mapcache: Reimplement map_domain_page() from scratch Andrew Cooper
` (21 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
The mapcache infrastructure needs some linear address space with which to make
temporary mappings.
_alter_percpu_mappings() is updated to support allocating an L1t, and
cpu_smpboot_alloc_common() is updated to allocate an L1t for mapcache
purposes, and map the L1t into linear address space so it can be easily
modified.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
xen/arch/x86/smpboot.c | 27 ++++++++++++++++++++++++++-
xen/include/asm-x86/config.h | 4 ++++
2 files changed, 30 insertions(+), 1 deletion(-)
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 7f02dd8..6a5f18a 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -645,6 +645,7 @@ void cpu_exit_clear(unsigned int cpu)
*/
enum percpu_alter_action {
PERCPU_MAP, /* Map existing frame: page and flags are input parameters. */
+ PERCPU_ALLOC_L1T, /* Allocate an L1 table. optionally returned via *page. */
};
static int _alter_percpu_mappings(
unsigned int cpu, unsigned long linear,
@@ -694,7 +695,10 @@ static int _alter_percpu_mappings(
l2t[l2_table_offset(linear)] = l2e_from_page(pg, __PAGE_HYPERVISOR);
}
else
- l1t = map_l1t_from_l2e(l2t[l2_table_offset(linear)]);
+ {
+ pg = l2e_get_page(l2t[l2_table_offset(linear)]);
+ l1t = __map_domain_page(pg);
+ }
switch ( action )
{
@@ -703,6 +707,11 @@ static int _alter_percpu_mappings(
l1t[l1_table_offset(linear)] = l1e_from_page(*page, flags);
break;
+ case PERCPU_ALLOC_L1T:
+ if ( page )
+ *page = pg;
+ break;
+
default:
ASSERT_UNREACHABLE();
rc = -EINVAL;
@@ -727,6 +736,12 @@ static int percpu_map_frame(unsigned int cpu, unsigned long linear,
return _alter_percpu_mappings(cpu, linear, PERCPU_MAP, &page, flags);
}
+static int percpu_alloc_l1t(unsigned int cpu, unsigned long linear,
+ struct page_info **page)
+{
+ return _alter_percpu_mappings(cpu, linear, PERCPU_ALLOC_L1T, page, 0);
+}
+
/* Allocate data common between the BSP and APs. */
static int cpu_smpboot_alloc_common(unsigned int cpu)
{
@@ -770,6 +785,16 @@ static int cpu_smpboot_alloc_common(unsigned int cpu)
if ( rc )
goto out;
+ /* Allocate space for the mapcache L1e's... */
+ rc = percpu_alloc_l1t(cpu, PERCPU_MAPCACHE_START, &pg);
+ if ( rc )
+ goto out;
+
+ /* ... and map the L1t so it can be used. */
+ rc = percpu_map_frame(cpu, PERCPU_MAPCACHE_L1ES, pg, PAGE_HYPERVISOR_RW);
+ if ( rc )
+ goto out;
+
rc = 0; /* Success */
out:
diff --git a/xen/include/asm-x86/config.h b/xen/include/asm-x86/config.h
index cddfc4e..a95f8c8 100644
--- a/xen/include/asm-x86/config.h
+++ b/xen/include/asm-x86/config.h
@@ -296,6 +296,10 @@ extern unsigned long xen_phys_start;
/* Mappings in the percpu area: */
#define PERCPU_IDT_MAPPING (PERCPU_LINEAR_START + KB(4))
+#define PERCPU_MAPCACHE_L1ES (PERCPU_LINEAR_START + MB(2) + KB(12))
+#define PERCPU_MAPCACHE_START (PERCPU_LINEAR_START + MB(4))
+#define PERCPU_MAPCACHE_END (PERCPU_MAPCACHE_START + MB(2))
+
/* GDT/LDT shadow mapping area. The first per-domain-mapping sub-area. */
#define GDT_LDT_VCPU_SHIFT 5
#define GDT_LDT_VCPU_VA_SHIFT (GDT_LDT_VCPU_SHIFT + PAGE_SHIFT)
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 24/44] x86/mapcache: Reimplement map_domain_page() from scratch
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (22 preceding siblings ...)
2018-01-04 20:21 ` [PATCH RFC 23/44] x86/smp: Allocate percpu resources for map_domain_page() to use Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 25/44] x86/fixmap: Drop percpu fixmap range Andrew Cooper
` (20 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
There are two reasons:
1) To stop using the per-domain range for the mapcache
2) To make map_domain_page() safe to use during context switches
The new implementation is entirely percpu and rather more simple. See the
comment at the top of domain_page.c for a description of the algorithm.
A side effect of the new implementation is that we can get rid of struct
mapcache_{vcpu,domain} entirely, and mapcache_override_current().
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
TODO: Consider whether to try and lazily unmap, utilising other TLB flush
scenarios rather than forcing an invlpg on each unmap.
---
xen/arch/x86/domain.c | 6 -
xen/arch/x86/domain_page.c | 353 +++++++++++++------------------------------
xen/arch/x86/pv/dom0_build.c | 3 -
xen/include/asm-x86/config.h | 7 -
xen/include/asm-x86/domain.h | 42 -----
5 files changed, 106 insertions(+), 305 deletions(-)
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 93e81c0..3d9e7fb 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -324,10 +324,6 @@ int vcpu_initialise(struct vcpu *v)
v->arch.flags = TF_kernel_mode;
- rc = mapcache_vcpu_init(v);
- if ( rc )
- return rc;
-
if ( !is_idle_domain(d) )
{
paging_vcpu_init(v);
@@ -478,8 +474,6 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags,
d->arch.emulation_flags = emflags;
}
- mapcache_domain_init(d);
-
HYPERVISOR_COMPAT_VIRT_START(d) =
is_pv_domain(d) ? __HYPERVISOR_COMPAT_VIRT_START : ~0u;
diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
index 8f2bcd4..c17ff66 100644
--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -18,291 +18,131 @@
#include <asm/hardirq.h>
#include <asm/setup.h>
-static DEFINE_PER_CPU(struct vcpu *, override);
+/*
+ * Global mapcache entries are implemented using the vmap() infrastructure.
+ *
+ * Local mapcache entries are implemented with a percpu linear range, starting
+ * at PERCPU_MAPCACHE_START. The maximum number of concurrent mappings we
+ * expect to use (NR_MAPCACHE_SLOTS) is for a nested pagewalk. Being a small
+ * number, allocations are tracked with a simple bitmap (inuse).
+ *
+ * There is plenty of linear address space to use, so addresses are handed out
+ * by index into the inuse bitmap, with unmapped guard pages inbetween, to
+ * help catch bounds errors in the code using the mappings.
+ *
+ * It is *not* safe to pass local mapcache mappings to other CPUs to use.
+ */
-static inline struct vcpu *mapcache_current_vcpu(void)
-{
- /* In the common case we use the mapcache of the running VCPU. */
- struct vcpu *v = this_cpu(override) ?: current;
+struct mapcache_info {
+#define NR_MAPCACHE_SLOTS (CONFIG_PAGING_LEVELS * CONFIG_PAGING_LEVELS)
+ unsigned long inuse;
+};
+static DEFINE_PER_CPU(struct mapcache_info, mapcache_info);
- /*
- * When current isn't properly set up yet, this is equivalent to
- * running in an idle vCPU (callers must check for NULL).
- */
- if ( v == INVALID_VCPU )
- return NULL;
+static unsigned long mapcache_idx_to_linear(unsigned int idx)
+{
+ return PERCPU_MAPCACHE_START + pfn_to_paddr(idx * 2 + 1);
+}
- /*
- * When using efi runtime page tables, we have the equivalent of the idle
- * domain's page tables but current may point at another domain's VCPU.
- * Return NULL as though current is not properly set up yet.
- */
- if ( efi_rs_using_pgtables() )
- return NULL;
+static unsigned int mapcache_linear_to_idx(unsigned long linear)
+{
+ return paddr_to_pfn(linear - PERCPU_MAPCACHE_START) / 2;
+}
- /*
- * If guest_table is NULL, and we are running a paravirtualised guest,
- * then it means we are running on the idle domain's page table and must
- * therefore use its mapcache.
- */
- if ( unlikely(pagetable_is_null(v->arch.guest_table)) && is_pv_vcpu(v) )
- {
- /* If we really are idling, perform lazy context switch now. */
- if ( (v = idle_vcpu[smp_processor_id()]) == current )
- sync_local_execstate();
- /* We must now be running on the idle page table. */
- ASSERT(read_cr3() == this_cpu(percpu_idle_pt));
- }
+static l1_pgentry_t *mapcache_l1e(unsigned long linear)
+{
+ l1_pgentry_t *l1t = (l1_pgentry_t *)PERCPU_MAPCACHE_L1ES;
- return v;
+ return &l1t[l1_table_offset(linear)];
}
-void __init mapcache_override_current(struct vcpu *v)
+/*
+ * Look up a mapcache entry, based on a linear address, ASSERT()ing that it is
+ * bounded senibly and in use.
+ */
+static l1_pgentry_t *lookup_inuse_mapcache_entry(
+ unsigned long linear, unsigned int *p_idx)
{
- this_cpu(override) = v;
-}
+ unsigned int idx;
+ l1_pgentry_t *pl1e;
-#define mapcache_l2_entry(e) ((e) >> PAGETABLE_ORDER)
-#define MAPCACHE_L2_ENTRIES (mapcache_l2_entry(MAPCACHE_ENTRIES - 1) + 1)
-#define MAPCACHE_L1ENT(idx) \
- __linear_l1_table[l1_linear_offset(MAPCACHE_VIRT_START + pfn_to_paddr(idx))]
+ ASSERT(linear >= PERCPU_MAPCACHE_START && linear < PERCPU_MAPCACHE_END);
+
+ idx = mapcache_linear_to_idx(linear);
+ ASSERT(idx < NR_MAPCACHE_SLOTS);
+ ASSERT(test_bit(idx, &this_cpu(mapcache_info).inuse));
+
+ if ( p_idx )
+ *p_idx = idx;
+
+ pl1e = mapcache_l1e(linear);
+ ASSERT(l1e_get_flags(*pl1e) & _PAGE_PRESENT);
+
+ return pl1e;
+}
void *map_domain_page(mfn_t mfn)
{
- unsigned long flags;
- unsigned int idx, i;
- struct vcpu *v;
- struct mapcache_domain *dcache;
- struct mapcache_vcpu *vcache;
- struct vcpu_maphash_entry *hashent;
+ unsigned long flags, linear;
+ unsigned int idx;
+ struct mapcache_info *mci = &this_cpu(mapcache_info);
+ l1_pgentry_t *pl1e;
#ifdef NDEBUG
if ( mfn_x(mfn) <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
return mfn_to_virt(mfn_x(mfn));
#endif
- v = mapcache_current_vcpu();
- if ( !v || !is_pv_vcpu(v) )
+ if ( this_cpu(curr_extended_directmap) )
return mfn_to_virt(mfn_x(mfn));
- dcache = &v->domain->arch.pv_domain.mapcache;
- vcache = &v->arch.pv_vcpu.mapcache;
- if ( !dcache->inuse )
- return mfn_to_virt(mfn_x(mfn));
-
- perfc_incr(map_domain_page_count);
-
+ /*
+ * map_domain_page() is used from many contexts, including fault handlers.
+ * Disable interrupts to keep the inuse bitmap consistent with the l1t.
+ *
+ * Be aware! Any #PF inside this region will most likely recurse with the
+ * spurious pagefault handler until the BUG_ON() is hit.
+ */
local_irq_save(flags);
- hashent = &vcache->hash[MAPHASH_HASHFN(mfn_x(mfn))];
- if ( hashent->mfn == mfn_x(mfn) )
- {
- idx = hashent->idx;
- ASSERT(idx < dcache->entries);
- hashent->refcnt++;
- ASSERT(hashent->refcnt);
- ASSERT(l1e_get_pfn(MAPCACHE_L1ENT(idx)) == mfn_x(mfn));
- goto out;
- }
-
- spin_lock(&dcache->lock);
-
- /* Has some other CPU caused a wrap? We must flush if so. */
- if ( unlikely(dcache->epoch != vcache->shadow_epoch) )
- {
- vcache->shadow_epoch = dcache->epoch;
- if ( NEED_FLUSH(this_cpu(tlbflush_time), dcache->tlbflush_timestamp) )
- {
- perfc_incr(domain_page_tlb_flush);
- flush_tlb_local();
- }
- }
+ idx = find_first_zero_bit(&mci->inuse, NR_MAPCACHE_SLOTS);
+ BUG_ON(idx == NR_MAPCACHE_SLOTS);
- idx = find_next_zero_bit(dcache->inuse, dcache->entries, dcache->cursor);
- if ( unlikely(idx >= dcache->entries) )
- {
- unsigned long accum = 0, prev = 0;
-
- /* /First/, clean the garbage map and update the inuse list. */
- for ( i = 0; i < BITS_TO_LONGS(dcache->entries); i++ )
- {
- accum |= prev;
- dcache->inuse[i] &= ~xchg(&dcache->garbage[i], 0);
- prev = ~dcache->inuse[i];
- }
-
- if ( accum | (prev & BITMAP_LAST_WORD_MASK(dcache->entries)) )
- idx = find_first_zero_bit(dcache->inuse, dcache->entries);
- else
- {
- /* Replace a hash entry instead. */
- i = MAPHASH_HASHFN(mfn_x(mfn));
- do {
- hashent = &vcache->hash[i];
- if ( hashent->idx != MAPHASHENT_NOTINUSE && !hashent->refcnt )
- {
- idx = hashent->idx;
- ASSERT(l1e_get_pfn(MAPCACHE_L1ENT(idx)) == hashent->mfn);
- l1e_write(&MAPCACHE_L1ENT(idx), l1e_empty());
- hashent->idx = MAPHASHENT_NOTINUSE;
- hashent->mfn = ~0UL;
- break;
- }
- if ( ++i == MAPHASH_ENTRIES )
- i = 0;
- } while ( i != MAPHASH_HASHFN(mfn_x(mfn)) );
- }
- BUG_ON(idx >= dcache->entries);
-
- /* /Second/, flush TLBs. */
- perfc_incr(domain_page_tlb_flush);
- flush_tlb_local();
- vcache->shadow_epoch = ++dcache->epoch;
- dcache->tlbflush_timestamp = tlbflush_current_time();
- }
+ __set_bit(idx, &mci->inuse);
- set_bit(idx, dcache->inuse);
- dcache->cursor = idx + 1;
+ linear = mapcache_idx_to_linear(idx);
+ pl1e = mapcache_l1e(linear);
- spin_unlock(&dcache->lock);
+ ASSERT(!(l1e_get_flags(*pl1e) & _PAGE_PRESENT));
+ *pl1e = l1e_from_mfn(mfn, __PAGE_HYPERVISOR_RW);
+ barrier(); /* Ensure the pagetable is updated before enabling interrupts. */
- l1e_write(&MAPCACHE_L1ENT(idx), l1e_from_mfn(mfn, __PAGE_HYPERVISOR_RW));
-
- out:
local_irq_restore(flags);
- return (void *)MAPCACHE_VIRT_START + pfn_to_paddr(idx);
+
+ return (void *)linear;
}
void unmap_domain_page(const void *ptr)
{
+ struct mapcache_info *mci = &this_cpu(mapcache_info);
+ unsigned long flags, linear = (unsigned long)ptr;
unsigned int idx;
- struct vcpu *v;
- struct mapcache_domain *dcache;
- unsigned long va = (unsigned long)ptr, mfn, flags;
- struct vcpu_maphash_entry *hashent;
+ l1_pgentry_t *pl1e;
- if ( va >= DIRECTMAP_VIRT_START )
+ if ( linear >= DIRECTMAP_VIRT_START )
return;
- ASSERT(va >= MAPCACHE_VIRT_START && va < MAPCACHE_VIRT_END);
-
- v = mapcache_current_vcpu();
- ASSERT(v && is_pv_vcpu(v));
-
- dcache = &v->domain->arch.pv_domain.mapcache;
- ASSERT(dcache->inuse);
-
- idx = PFN_DOWN(va - MAPCACHE_VIRT_START);
- mfn = l1e_get_pfn(MAPCACHE_L1ENT(idx));
- hashent = &v->arch.pv_vcpu.mapcache.hash[MAPHASH_HASHFN(mfn)];
+ pl1e = lookup_inuse_mapcache_entry(linear, &idx);
local_irq_save(flags);
- if ( hashent->idx == idx )
- {
- ASSERT(hashent->mfn == mfn);
- ASSERT(hashent->refcnt);
- hashent->refcnt--;
- }
- else if ( !hashent->refcnt )
- {
- if ( hashent->idx != MAPHASHENT_NOTINUSE )
- {
- /* /First/, zap the PTE. */
- ASSERT(l1e_get_pfn(MAPCACHE_L1ENT(hashent->idx)) ==
- hashent->mfn);
- l1e_write(&MAPCACHE_L1ENT(hashent->idx), l1e_empty());
- /* /Second/, mark as garbage. */
- set_bit(hashent->idx, dcache->garbage);
- }
-
- /* Add newly-freed mapping to the maphash. */
- hashent->mfn = mfn;
- hashent->idx = idx;
- }
- else
- {
- /* /First/, zap the PTE. */
- l1e_write(&MAPCACHE_L1ENT(idx), l1e_empty());
- /* /Second/, mark as garbage. */
- set_bit(idx, dcache->garbage);
- }
+ *pl1e = l1e_empty();
+ asm volatile ( "invlpg %0" :: "m" (*(char *)ptr) : "memory" );
+ __clear_bit(idx, &mci->inuse);
local_irq_restore(flags);
}
-int mapcache_domain_init(struct domain *d)
-{
- struct mapcache_domain *dcache = &d->arch.pv_domain.mapcache;
- unsigned int bitmap_pages;
-
- if ( !is_pv_domain(d) || is_idle_domain(d) )
- return 0;
-
-#ifdef NDEBUG
- if ( !mem_hotplug && max_page <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
- return 0;
-#endif
-
- BUILD_BUG_ON(MAPCACHE_VIRT_END + PAGE_SIZE * (3 +
- 2 * PFN_UP(BITS_TO_LONGS(MAPCACHE_ENTRIES) * sizeof(long))) >
- MAPCACHE_VIRT_START + (PERDOMAIN_SLOT_MBYTES << 20));
- bitmap_pages = PFN_UP(BITS_TO_LONGS(MAPCACHE_ENTRIES) * sizeof(long));
- dcache->inuse = (void *)MAPCACHE_VIRT_END + PAGE_SIZE;
- dcache->garbage = dcache->inuse +
- (bitmap_pages + 1) * PAGE_SIZE / sizeof(long);
-
- spin_lock_init(&dcache->lock);
-
- return create_perdomain_mapping(d, (unsigned long)dcache->inuse,
- 2 * bitmap_pages + 1,
- NIL(l1_pgentry_t *), NULL);
-}
-
-int mapcache_vcpu_init(struct vcpu *v)
-{
- struct domain *d = v->domain;
- struct mapcache_domain *dcache = &d->arch.pv_domain.mapcache;
- unsigned long i;
- unsigned int ents = d->max_vcpus * MAPCACHE_VCPU_ENTRIES;
- unsigned int nr = PFN_UP(BITS_TO_LONGS(ents) * sizeof(long));
-
- if ( !is_pv_vcpu(v) || !dcache->inuse )
- return 0;
-
- if ( ents > dcache->entries )
- {
- /* Populate page tables. */
- int rc = create_perdomain_mapping(d, MAPCACHE_VIRT_START, ents,
- NIL(l1_pgentry_t *), NULL);
-
- /* Populate bit maps. */
- if ( !rc )
- rc = create_perdomain_mapping(d, (unsigned long)dcache->inuse,
- nr, NULL, NIL(struct page_info *));
- if ( !rc )
- rc = create_perdomain_mapping(d, (unsigned long)dcache->garbage,
- nr, NULL, NIL(struct page_info *));
-
- if ( rc )
- return rc;
-
- dcache->entries = ents;
- }
-
- /* Mark all maphash entries as not in use. */
- BUILD_BUG_ON(MAPHASHENT_NOTINUSE < MAPCACHE_ENTRIES);
- for ( i = 0; i < MAPHASH_ENTRIES; i++ )
- {
- struct vcpu_maphash_entry *hashent = &v->arch.pv_vcpu.mapcache.hash[i];
-
- hashent->mfn = ~0UL; /* never valid to map */
- hashent->idx = MAPHASHENT_NOTINUSE;
- }
-
- return 0;
-}
-
void *map_domain_page_global(mfn_t mfn)
{
ASSERT(!in_irq() &&
@@ -345,10 +185,29 @@ unsigned long domain_page_map_to_mfn(const void *ptr)
BUG_ON(!pl1e);
}
else
- {
- ASSERT(va >= MAPCACHE_VIRT_START && va < MAPCACHE_VIRT_END);
- pl1e = &__linear_l1_table[l1_linear_offset(va)];
- }
+ pl1e = lookup_inuse_mapcache_entry(va, NULL);
return l1e_get_pfn(*pl1e);
}
+
+static __init __maybe_unused void build_assertions(void)
+{
+ struct mapcache_info info;
+
+ /* NR_MAPCACHE_SLOTS within the bounds of the inuse bitmap? */
+ BUILD_BUG_ON(NR_MAPCACHE_SLOTS > (sizeof(info.inuse) * 8));
+
+ /* Enough linear address space, including guard pages? */
+ BUILD_BUG_ON((NR_MAPCACHE_SLOTS * 2) >
+ (PERCPU_MAPCACHE_END - PERCPU_MAPCACHE_START));
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index 09c765a..3baf37b 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -698,7 +698,6 @@ int __init dom0_construct_pv(struct domain *d,
/* We run on dom0's page tables for the final part of the build process. */
write_ptbase(v);
- mapcache_override_current(v);
/* Copy the OS image and free temporary buffer. */
elf.dest_base = (void*)vkern_start;
@@ -717,7 +716,6 @@ int __init dom0_construct_pv(struct domain *d,
if ( (parms.virt_hypercall < v_start) ||
(parms.virt_hypercall >= v_end) )
{
- mapcache_override_current(NULL);
write_ptbase(current);
printk("Invalid HYPERCALL_PAGE field in ELF notes.\n");
rc = -1;
@@ -838,7 +836,6 @@ int __init dom0_construct_pv(struct domain *d,
xlat_start_info(si, XLAT_start_info_console_dom0);
/* Return to idle domain's page tables. */
- mapcache_override_current(NULL);
write_ptbase(current);
update_domain_wallclock_time(d);
diff --git a/xen/include/asm-x86/config.h b/xen/include/asm-x86/config.h
index a95f8c8..f78cbde 100644
--- a/xen/include/asm-x86/config.h
+++ b/xen/include/asm-x86/config.h
@@ -314,13 +314,6 @@ extern unsigned long xen_phys_start;
#define LDT_VIRT_START(v) \
(GDT_VIRT_START(v) + (64*1024))
-/* map_domain_page() map cache. The second per-domain-mapping sub-area. */
-#define MAPCACHE_VCPU_ENTRIES (CONFIG_PAGING_LEVELS * CONFIG_PAGING_LEVELS)
-#define MAPCACHE_ENTRIES (MAX_VIRT_CPUS * MAPCACHE_VCPU_ENTRIES)
-#define MAPCACHE_VIRT_START PERDOMAIN_VIRT_SLOT(1)
-#define MAPCACHE_VIRT_END (MAPCACHE_VIRT_START + \
- MAPCACHE_ENTRIES * PAGE_SIZE)
-
/* Argument translation area. The third per-domain-mapping sub-area. */
#define ARG_XLAT_VIRT_START PERDOMAIN_VIRT_SLOT(2)
/* Allow for at least one guard page (COMPAT_ARG_XLAT_SIZE being 2 pages): */
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index f699119..fa57c93 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -38,42 +38,6 @@ struct trap_bounce {
unsigned long eip;
};
-#define MAPHASH_ENTRIES 8
-#define MAPHASH_HASHFN(pfn) ((pfn) & (MAPHASH_ENTRIES-1))
-#define MAPHASHENT_NOTINUSE ((u32)~0U)
-struct mapcache_vcpu {
- /* Shadow of mapcache_domain.epoch. */
- unsigned int shadow_epoch;
-
- /* Lock-free per-VCPU hash of recently-used mappings. */
- struct vcpu_maphash_entry {
- unsigned long mfn;
- uint32_t idx;
- uint32_t refcnt;
- } hash[MAPHASH_ENTRIES];
-};
-
-struct mapcache_domain {
- /* The number of array entries, and a cursor into the array. */
- unsigned int entries;
- unsigned int cursor;
-
- /* Protects map_domain_page(). */
- spinlock_t lock;
-
- /* Garbage mappings are flushed from TLBs in batches called 'epochs'. */
- unsigned int epoch;
- u32 tlbflush_timestamp;
-
- /* Which mappings are in use, and which are garbage to reap next epoch? */
- unsigned long *inuse;
- unsigned long *garbage;
-};
-
-int mapcache_domain_init(struct domain *);
-int mapcache_vcpu_init(struct vcpu *);
-void mapcache_override_current(struct vcpu *);
-
/* x86/64: toggle guest between kernel and user modes. */
void toggle_guest_mode(struct vcpu *);
/* x86/64: toggle guest page tables between kernel and user modes. */
@@ -251,9 +215,6 @@ struct pv_domain
atomic_t nr_l4_pages;
- /* map_domain_page() mapping cache. */
- struct mapcache_domain mapcache;
-
struct cpuidmasks *cpuidmasks;
};
@@ -444,9 +405,6 @@ struct arch_domain
struct pv_vcpu
{
- /* map_domain_page() mapping cache. */
- struct mapcache_vcpu mapcache;
-
struct trap_info *trap_ctxt;
unsigned long gdt_frames[FIRST_RESERVED_GDT_PAGE];
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 25/44] x86/fixmap: Drop percpu fixmap range
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (23 preceding siblings ...)
2018-01-04 20:21 ` [PATCH RFC 24/44] x86/mapcache: Reimplement map_domain_page() from scratch Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 26/44] x86/pt-shadow: Maintain a small cache of shadowed frames Andrew Cooper
` (19 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
The percpu fixmap range was introduced to allow opencoding of
map_domain_page() in the middle of a context switch.
The new implementation of map_domain_page() is safe to use in a context
switch, so drop the percpu fixmap infrastructure.
This removes the temporary build-time restriction on NR_CPUS.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
xen/arch/x86/mm.c | 6 ++----
xen/arch/x86/pv/pt-shadow.c | 15 ++++++---------
xen/include/asm-x86/fixmap.h | 32 --------------------------------
3 files changed, 8 insertions(+), 45 deletions(-)
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 7c08807..d5c69c0 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -526,11 +526,9 @@ void do_write_ptbase(struct vcpu *v, bool tlb_maintenance)
ASSERT(test_bit(_PGC_inuse_pgtable, &new_pg->count_info));
/* Insert percpu mappings into the new pagetables. */
- set_percpu_fixmap(cpu, PERCPU_FIXSLOT_LINEAR,
- l1e_from_paddr(new_cr3, __PAGE_HYPERVISOR_RW));
- new_l4t = percpu_fix_to_virt(cpu, PERCPU_FIXSLOT_LINEAR);
+ new_l4t = map_domain_page(maddr_to_mfn(new_cr3));
new_l4t[l4_table_offset(PERCPU_LINEAR_START)] = percpu_mappings;
- barrier();
+ unmap_domain_page(new_l4t);
/* If the new cr3 has a short directmap, report so before switching... */
if ( !new_extd_directmap )
diff --git a/xen/arch/x86/pv/pt-shadow.c b/xen/arch/x86/pv/pt-shadow.c
index 46c7b86..33cb303 100644
--- a/xen/arch/x86/pv/pt-shadow.c
+++ b/xen/arch/x86/pv/pt-shadow.c
@@ -22,7 +22,6 @@
#include <xen/mm.h>
#include <xen/numa.h>
-#include <asm/fixmap.h>
#include <asm/pv/pt-shadow.h>
/*
@@ -163,13 +162,11 @@ unsigned long pt_maybe_shadow(struct vcpu *v)
unsigned int slot = l4_table_offset(PERCPU_LINEAR_START);
l4_pgentry_t *l4t, *vcpu_l4t;
- set_percpu_fixmap(cpu, PERCPU_FIXSLOT_SHADOW,
- l1e_from_paddr(new_cr3, __PAGE_HYPERVISOR_RO));
ptsh->shadowing = new_cr3;
local_irq_restore(flags);
l4t = ptsh->shadow_l4_va;
- vcpu_l4t = percpu_fix_to_virt(cpu, PERCPU_FIXSLOT_SHADOW);
+ vcpu_l4t = map_domain_page(maddr_to_mfn(new_cr3));
/*
* Careful! When context switching between two vcpus, both of which
@@ -183,6 +180,8 @@ unsigned long pt_maybe_shadow(struct vcpu *v)
sizeof(*l4t) * slot);
memcpy(&l4t[slot + 1], &vcpu_l4t[slot + 1],
sizeof(*l4t) * (L4_PAGETABLE_ENTRIES - (slot + 1)));
+
+ unmap_domain_page(vcpu_l4t);
}
return ptsh->shadow_l4;
@@ -219,13 +218,11 @@ static void _pt_shadow_ipi(void *arg)
case PTSH_IPI_WRITE:
l4t = ptsh->shadow_l4_va;
-
- /* Reuse the mapping established in pt_maybe_shadow(). */
- ASSERT(l1e_get_paddr(*percpu_fixmap_l1e(cpu, PERCPU_FIXSLOT_SHADOW)) ==
- maddr);
- vcpu_l4t = percpu_fix_to_virt(cpu, PERCPU_FIXSLOT_SHADOW);
+ vcpu_l4t = map_domain_page(maddr_to_mfn(maddr));
l4t[info->slot] = vcpu_l4t[info->slot];
+
+ unmap_domain_page(vcpu_l4t);
break;
default:
diff --git a/xen/include/asm-x86/fixmap.h b/xen/include/asm-x86/fixmap.h
index c1b3bda..89bf6cb 100644
--- a/xen/include/asm-x86/fixmap.h
+++ b/xen/include/asm-x86/fixmap.h
@@ -27,10 +27,6 @@
#include <asm/msi.h>
#include <acpi/apei.h>
-#define NR_PERCPU_SLOTS 2
-#define PERCPU_FIXSLOT_SHADOW 0
-#define PERCPU_FIXSLOT_LINEAR 1
-
/*
* Here we define all the compile-time 'special' virtual
* addresses. The point is to have a constant address at
@@ -49,8 +45,6 @@ enum fixed_addresses {
FIX_COM_BEGIN,
FIX_COM_END,
FIX_EHCI_DBGP,
- FIX_PERCPU_BEGIN,
- FIX_PERCPU_END = FIX_PERCPU_BEGIN + (NR_CPUS - 1) * NR_PERCPU_SLOTS,
/* Everything else should go further down. */
FIX_APIC_BASE,
FIX_IO_APIC_BASE_0,
@@ -93,32 +87,6 @@ static inline unsigned long virt_to_fix(const unsigned long vaddr)
return __virt_to_fix(vaddr);
}
-static inline void *percpu_fix_to_virt(unsigned int cpu, unsigned int slot)
-{
- return (void *)fix_to_virt(FIX_PERCPU_BEGIN + (slot * NR_CPUS) + cpu);
-}
-
-static inline l1_pgentry_t *percpu_fixmap_l1e(unsigned int cpu, unsigned int slot)
-{
- BUILD_BUG_ON(FIX_PERCPU_END >= L1_PAGETABLE_ENTRIES);
-
- return &l1_fixmap[l1_table_offset((unsigned long)percpu_fix_to_virt(cpu, slot))];
-}
-
-static inline void set_percpu_fixmap(unsigned int cpu, unsigned int slot, l1_pgentry_t l1e)
-{
- l1_pgentry_t *pl1e = percpu_fixmap_l1e(cpu, slot);
-
- if ( l1e_get_intpte(*pl1e) != l1e_get_intpte(l1e) )
- {
- *pl1e = l1e;
-
- __asm__ __volatile__ ( "invlpg %0"
- :: "m" (*(char *)percpu_fix_to_virt(cpu, slot))
- : "memory" );
- }
-}
-
#endif /* __ASSEMBLY__ */
#endif
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 26/44] x86/pt-shadow: Maintain a small cache of shadowed frames
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (24 preceding siblings ...)
2018-01-04 20:21 ` [PATCH RFC 25/44] x86/fixmap: Drop percpu fixmap range Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 27/44] x86/smp: Allocate a percpu linear range for the compat translation area Andrew Cooper
` (18 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
This improves the shadowing performance substantially. In particular, system
calls for 64bit PV guests (which switch between the user and kernel
pagetables) no longer suffer a 4K copy hit in both directions.
See the code comments for reasoning and the algorithm description.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
xen/arch/x86/mm.c | 2 +
xen/arch/x86/mm/shadow/multi.c | 2 +
xen/arch/x86/pv/pt-shadow.c | 196 ++++++++++++++++++++++++++++++++-----
xen/include/asm-x86/pv/pt-shadow.h | 9 ++
4 files changed, 186 insertions(+), 23 deletions(-)
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index d5c69c0..f8f15e9 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -2413,6 +2413,8 @@ int free_page_type(struct page_info *page, unsigned long type,
case PGT_l4_page_table:
ASSERT(preemptible);
rc = free_l4_table(page);
+ if ( !rc )
+ pt_shadow_l4_invlpg(owner, page);
break;
default:
gdprintk(XENLOG_WARNING, "type %" PRtype_info " mfn %" PRI_mfn "\n",
diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
index 9c929ed..f9ec5aa 100644
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -1895,6 +1895,8 @@ void sh_destroy_l4_shadow(struct domain *d, mfn_t smfn)
}
});
+ pt_shadow_l4_invlpg(d, sp);
+
/* Put the memory back in the pool */
shadow_free(d, smfn);
}
diff --git a/xen/arch/x86/pv/pt-shadow.c b/xen/arch/x86/pv/pt-shadow.c
index 33cb303..b4f2b86 100644
--- a/xen/arch/x86/pv/pt-shadow.c
+++ b/xen/arch/x86/pv/pt-shadow.c
@@ -24,6 +24,10 @@
#include <asm/pv/pt-shadow.h>
+/* Override macros from asm/mm.h to make them work with mfn_t */
+#undef page_to_mfn
+#define page_to_mfn(pg) _mfn(__page_to_mfn(pg))
+
/*
* To use percpu linear ranges, we require that no two pcpus have %cr3
* pointing at the same L4 pagetable at the same time.
@@ -38,19 +42,44 @@
*
* The algorithm is fairly simple.
*
+ * - A small cache of shadowed L4s from the same guest is maintained.
* - When a pcpu is switching to a new vcpu cr3 and shadowing is necessary,
- * perform a full 4K copy of the guests frame into a percpu frame, and run
- * on that.
+ * the cache is searched.
+ * - If the new cr3 is already cached, use our existing shadow.
+ * - If not, drop an entry and shadow the new frame with a full 4K copy.
* - When a write to a guests L4 pagetable occurs, the update must be
* propagated to all existing shadows. An IPI is sent to the domains
* dirty mask indicating which frame/slot was updated, and each pcpu
* checks to see whether it needs to sync the update into its shadow.
+ * - When a guest L4 pagetable is freed, it must be dropped from any caches,
+ * as Xen will allow it to become writeable to the guest again, and its
+ * contents will go stale. It uses the same IPI mechanism as for writes.
+ */
+
+#define L4_SHADOW_ORDER 2
+#define NR_L4_SHADOWS (1ul << L4_SHADOW_ORDER)
+
+/*
+ * An individual cache entry. Contains a %cr3 which has been cached, and the
+ * index of this entry into the shadow frames.
+ *
+ * The layout relies on %cr3 being page aligned, with the index stored in the
+ * lower bits. idx could be a smaller bitfield, but there is no other
+ * information to store, and having it as an 8bit field results in better
+ * compiled code.
*/
+typedef union pt_cache_entry {
+ unsigned long raw;
+ struct {
+ uint8_t idx;
+ unsigned long :4, cr3_mfn:52;
+ };
+} pt_cache_entry_t;
struct pt_shadow {
/*
- * A frame used to shadow a vcpus intended pagetable. When shadowing,
- * this frame is the one actually referenced by %cr3.
+ * A cache of frames used to shadow a vcpus intended pagetables. When
+ * shadowing, one of these frames is the one actually referenced by %cr3.
*/
paddr_t shadow_l4;
l4_pgentry_t *shadow_l4_va;
@@ -63,29 +92,60 @@ struct pt_shadow {
*/
const struct domain *domain;
- /* If nonzero, a guests pagetable which we are shadowing. */
- paddr_t shadowing;
+ /*
+ * A collection of %cr3's, belonging to @p domain, which are shadowed
+ * locally.
+ *
+ * A cache entry is used if cr3_mfn != 0, free otherwise. The cache is
+ * maintained in most-recently-used order. As a result, cache[0].cr3_mfn
+ * should always match v->arch.cr3.
+ *
+ * The cache[].idx fields will always be unique, and between 0 and
+ * NR_L4_SHADOWS. Their order however will vary as most-recently-used
+ * order is maintained.
+ */
+ pt_cache_entry_t cache[NR_L4_SHADOWS];
};
static DEFINE_PER_CPU(struct pt_shadow, ptsh);
+static l4_pgentry_t *shadow_l4_va(struct pt_shadow *ptsh, unsigned int idx)
+{
+ return _p(ptsh->shadow_l4_va) + idx * PAGE_SIZE;
+}
+
+static paddr_t shadow_l4(struct pt_shadow *ptsh, unsigned int idx)
+{
+ return ptsh->shadow_l4 + idx * PAGE_SIZE;
+}
+
int pt_shadow_alloc(unsigned int cpu)
{
struct pt_shadow *ptsh = &per_cpu(ptsh, cpu);
- unsigned int memflags = 0;
+ unsigned int memflags = 0, i;
nodeid_t node = cpu_to_node(cpu);
struct page_info *pg;
+ mfn_t mfns[NR_L4_SHADOWS];
if ( node != NUMA_NO_NODE )
memflags = MEMF_node(node);
- pg = alloc_domheap_page(NULL, memflags);
+ pg = alloc_domheap_pages(NULL, L4_SHADOW_ORDER, memflags);
if ( !pg )
return -ENOMEM;
ptsh->shadow_l4 = page_to_maddr(pg);
- ptsh->shadow_l4_va = __map_domain_page_global(pg);
+ for ( i = 0; i < ARRAY_SIZE(mfns); ++i )
+ {
+ /* Initialise the cache (ascending idx fields). */
+ ptsh->cache[i] = (pt_cache_entry_t){ i };
+
+ /* Collect MFNs to vmap(). */
+ mfns[i] = mfn_add(maddr_to_mfn(ptsh->shadow_l4), i);
+ }
+
+ ptsh->shadow_l4_va = vmap(mfns, ARRAY_SIZE(mfns));
if ( !ptsh->shadow_l4_va )
return -ENOMEM;
@@ -98,17 +158,35 @@ void pt_shadow_free(unsigned int cpu)
if ( ptsh->shadow_l4_va )
{
- unmap_domain_page_global(ptsh->shadow_l4_va);
+ vunmap(ptsh->shadow_l4_va);
ptsh->shadow_l4_va = NULL;
}
if ( ptsh->shadow_l4 )
{
- free_domheap_page(maddr_to_page(ptsh->shadow_l4));
+ free_domheap_pages(maddr_to_page(ptsh->shadow_l4), L4_SHADOW_ORDER);
ptsh->shadow_l4 = 0;
}
}
+static pt_cache_entry_t *pt_cache_lookup(
+ struct pt_shadow *ptsh, unsigned long maddr)
+{
+ unsigned int i;
+
+ ASSERT(!local_irq_is_enabled());
+
+ for ( i = 0; i < ARRAY_SIZE(ptsh->cache); ++i )
+ {
+ pt_cache_entry_t *ent = &ptsh->cache[i];
+
+ if ( ent->cr3_mfn == (maddr >> PAGE_SHIFT) )
+ return ent;
+ }
+
+ return NULL;
+}
+
/*
* We only need to shadow 4-level PV guests. All other guests have per-vcpu
* monitor tables which are never scheduled on concurrent pcpus. Care needs
@@ -126,6 +204,7 @@ unsigned long pt_maybe_shadow(struct vcpu *v)
unsigned int cpu = smp_processor_id();
struct pt_shadow *ptsh = &per_cpu(ptsh, cpu);
unsigned long flags, new_cr3 = v->arch.cr3;
+ pt_cache_entry_t *ent;
/*
* IPIs for updates are based on the domain dirty mask. If we ever switch
@@ -135,8 +214,12 @@ unsigned long pt_maybe_shadow(struct vcpu *v)
if ( ptsh->domain &&
ptsh->domain != v->domain )
{
+ unsigned int i;
+
ptsh->domain = NULL;
- ptsh->shadowing = 0;
+
+ for ( i = 0; i < ARRAY_SIZE(ptsh->cache); ++i )
+ ptsh->cache[i].cr3_mfn = 0;
}
/* No shadowing necessary? Run on the intended pagetable. */
@@ -145,10 +228,6 @@ unsigned long pt_maybe_shadow(struct vcpu *v)
ptsh->domain = v->domain;
- /* Fastpath, if we are already shadowing the intended pagetable. */
- if ( ptsh->shadowing == new_cr3 )
- return ptsh->shadow_l4;
-
/*
* We may be called with interrupts disabled (e.g. context switch), or
* interrupts enabled (e.g. new_guest_cr3()).
@@ -158,14 +237,46 @@ unsigned long pt_maybe_shadow(struct vcpu *v)
*/
local_irq_save(flags);
+ ent = pt_cache_lookup(ptsh, new_cr3);
+ if ( ent )
+ {
+ /*
+ * Cache hit. Promote this entry to being most recently used (if it
+ * isn't already).
+ */
+ unsigned int cache_idx = ent - ptsh->cache;
+
+ if ( cache_idx )
+ {
+ pt_cache_entry_t tmp = *ent;
+
+ switch ( cache_idx )
+ {
+ case 3: ptsh->cache[3] = ptsh->cache[2];
+ case 2: ptsh->cache[2] = ptsh->cache[1];
+ case 1: ptsh->cache[1] = ptsh->cache[0];
+ ptsh->cache[0] = tmp;
+ }
+ }
+ local_irq_restore(flags);
+ }
+ else
{
+ /*
+ * Cache miss. Recycle whatever was in the last slot, promote it to
+ * being most recently used, and copy the entire pagetable.
+ */
unsigned int slot = l4_table_offset(PERCPU_LINEAR_START);
+ unsigned int idx = ptsh->cache[3].idx;
l4_pgentry_t *l4t, *vcpu_l4t;
- ptsh->shadowing = new_cr3;
+ ptsh->cache[3] = ptsh->cache[2];
+ ptsh->cache[2] = ptsh->cache[1];
+ ptsh->cache[1] = ptsh->cache[0];
+ ptsh->cache[0] = (pt_cache_entry_t){ new_cr3 | idx };
local_irq_restore(flags);
- l4t = ptsh->shadow_l4_va;
+ l4t = shadow_l4_va(ptsh, idx);
vcpu_l4t = map_domain_page(maddr_to_mfn(new_cr3));
/*
@@ -184,7 +295,9 @@ unsigned long pt_maybe_shadow(struct vcpu *v)
unmap_domain_page(vcpu_l4t);
}
- return ptsh->shadow_l4;
+ ASSERT(ptsh->cache[0].cr3_mfn == (new_cr3 >> PAGE_SHIFT));
+
+ return shadow_l4(ptsh, ptsh->cache[0].idx);
}
struct ptsh_ipi_info
@@ -193,6 +306,7 @@ struct ptsh_ipi_info
const struct page_info *pg;
enum {
PTSH_IPI_WRITE,
+ PTSH_IPI_INVLPG,
} op;
unsigned int slot;
};
@@ -202,29 +316,49 @@ static void _pt_shadow_ipi(void *arg)
unsigned int cpu = smp_processor_id();
struct pt_shadow *ptsh = &per_cpu(ptsh, cpu);
const struct ptsh_ipi_info *info = arg;
- unsigned long maddr = page_to_maddr(info->pg);
+ pt_cache_entry_t *ent;
/* No longer shadowing state from this domain? Nothing to do. */
if ( info->d != ptsh->domain )
return;
+ ent = pt_cache_lookup(ptsh, page_to_maddr(info->pg));
+
/* Not shadowing this frame? Nothing to do. */
- if ( ptsh->shadowing != maddr )
+ if ( ent == NULL )
return;
switch ( info->op )
{
l4_pgentry_t *l4t, *vcpu_l4t;
+ unsigned int cache_idx, shadow_idx;
case PTSH_IPI_WRITE:
- l4t = ptsh->shadow_l4_va;
- vcpu_l4t = map_domain_page(maddr_to_mfn(maddr));
+ l4t = shadow_l4_va(ptsh, ent->idx);
+ vcpu_l4t = map_domain_page(page_to_mfn(info->pg));
l4t[info->slot] = vcpu_l4t[info->slot];
unmap_domain_page(vcpu_l4t);
break;
+ case PTSH_IPI_INVLPG:
+ cache_idx = ent - ptsh->cache;
+ shadow_idx = ent->idx;
+
+ /*
+ * Demote the dropped entry to least-recently-used, so it is the next
+ * entry to be reused.
+ */
+ switch ( cache_idx )
+ {
+ case 0: BUG(); /* ??? Freeing the L4 which current is running on! */
+ case 1: ptsh->cache[1] = ptsh->cache[2];
+ case 2: ptsh->cache[2] = ptsh->cache[3];
+ case 3: ptsh->cache[3] = (pt_cache_entry_t){ shadow_idx };
+ }
+ break;
+
default:
ASSERT_UNREACHABLE();
}
@@ -248,6 +382,22 @@ void pt_shadow_l4_write(const struct domain *d, const struct page_info *pg,
on_selected_cpus(d->domain_dirty_cpumask, _pt_shadow_ipi, &info, 1);
}
+void pt_shadow_l4_invlpg(const struct domain *d, const struct page_info *pg)
+{
+ struct ptsh_ipi_info info;
+
+ if ( !pt_need_shadow(d) )
+ return;
+
+ info = (struct ptsh_ipi_info){
+ .d = d,
+ .pg = pg,
+ .op = PTSH_IPI_INVLPG,
+ };
+
+ on_selected_cpus(d->domain_dirty_cpumask, _pt_shadow_ipi, &info, 1);
+}
+
/*
* Local variables:
* mode: C
diff --git a/xen/include/asm-x86/pv/pt-shadow.h b/xen/include/asm-x86/pv/pt-shadow.h
index 6e71e99..d5576f4 100644
--- a/xen/include/asm-x86/pv/pt-shadow.h
+++ b/xen/include/asm-x86/pv/pt-shadow.h
@@ -47,6 +47,13 @@ unsigned long pt_maybe_shadow(struct vcpu *v);
void pt_shadow_l4_write(
const struct domain *d, const struct page_info *pg, unsigned int slot);
+/*
+ * Called when an L4 pagetable is freed. The PT shadow logic ensures that it
+ * is purged from any caches.
+ */
+void pt_shadow_l4_invlpg(
+ const struct domain *d, const struct page_info *pg);
+
#else /* !CONFIG_PV */
static inline int pt_shadow_alloc(unsigned int cpu) { return 0; }
@@ -58,6 +65,8 @@ static inline unsigned long pt_maybe_shadow(struct vcpu *v)
}
static inline void pt_shadow_l4_write(
const struct domain *d, const struct page_info *pg, unsigned int slot) { }
+static inline void pt_shadow_l4_invlpg(
+ const struct domain *d, const struct page_info *pg) { }
#endif /* CONFIG_PV */
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 27/44] x86/smp: Allocate a percpu linear range for the compat translation area.
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (25 preceding siblings ...)
2018-01-04 20:21 ` [PATCH RFC 26/44] x86/pt-shadow: Maintain a small cache of shadowed frames Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 28/44] x86/xlat: Use the percpu " Andrew Cooper
` (17 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
The existing translation area claims to be 2 frames and a guard page, but is
actually 4 frames with no guard page at all.
Allocate 2 frames in the percpu area, which actually has unmapped frames on
either side.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
xen/arch/x86/smpboot.c | 27 +++++++++++++++++++++++++++
xen/include/asm-x86/config.h | 2 ++
2 files changed, 29 insertions(+)
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 6a5f18a..a5d3f7a 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -646,6 +646,8 @@ void cpu_exit_clear(unsigned int cpu)
enum percpu_alter_action {
PERCPU_MAP, /* Map existing frame: page and flags are input parameters. */
PERCPU_ALLOC_L1T, /* Allocate an L1 table. optionally returned via *page. */
+ PERCPU_ALLOC_FRAME, /* Allocate a frame. flags is an input, and *page is
+ * optionally an output. */
};
static int _alter_percpu_mappings(
unsigned int cpu, unsigned long linear,
@@ -707,6 +709,15 @@ static int _alter_percpu_mappings(
l1t[l1_table_offset(linear)] = l1e_from_page(*page, flags);
break;
+ case PERCPU_ALLOC_FRAME:
+ pg = alloc_domheap_page(NULL, memflags);
+ if ( !pg )
+ goto out;
+
+ clear_domain_page(page_to_mfn(pg));
+ l1t[l1_table_offset(linear)] = l1e_from_page(pg, flags);
+
+ /* Fallthrough */
case PERCPU_ALLOC_L1T:
if ( page )
*page = pg;
@@ -742,6 +753,12 @@ static int percpu_alloc_l1t(unsigned int cpu, unsigned long linear,
return _alter_percpu_mappings(cpu, linear, PERCPU_ALLOC_L1T, page, 0);
}
+static int percpu_alloc_frame(unsigned int cpu, unsigned long linear,
+ struct page_info **page, unsigned int flags)
+{
+ return _alter_percpu_mappings(cpu, linear, PERCPU_ALLOC_FRAME, page, flags);
+}
+
/* Allocate data common between the BSP and APs. */
static int cpu_smpboot_alloc_common(unsigned int cpu)
{
@@ -795,6 +812,16 @@ static int cpu_smpboot_alloc_common(unsigned int cpu)
if ( rc )
goto out;
+ /* Allocate two pages for the XLAT area. */
+ rc = percpu_alloc_frame(cpu, PERCPU_XLAT_START, NULL,
+ PAGE_HYPERVISOR_RW | MAP_PERCPU_AUTOFREE);
+ if ( rc )
+ goto out;
+ rc = percpu_alloc_frame(cpu, PERCPU_XLAT_START + PAGE_SIZE, NULL,
+ PAGE_HYPERVISOR_RW | MAP_PERCPU_AUTOFREE);
+ if ( rc )
+ goto out;
+
rc = 0; /* Success */
out:
diff --git a/xen/include/asm-x86/config.h b/xen/include/asm-x86/config.h
index f78cbde..3d64047 100644
--- a/xen/include/asm-x86/config.h
+++ b/xen/include/asm-x86/config.h
@@ -300,6 +300,8 @@ extern unsigned long xen_phys_start;
#define PERCPU_MAPCACHE_START (PERCPU_LINEAR_START + MB(4))
#define PERCPU_MAPCACHE_END (PERCPU_MAPCACHE_START + MB(2))
+#define PERCPU_XLAT_START (PERCPU_LINEAR_START + MB(6) + KB(8))
+
/* GDT/LDT shadow mapping area. The first per-domain-mapping sub-area. */
#define GDT_LDT_VCPU_SHIFT 5
#define GDT_LDT_VCPU_VA_SHIFT (GDT_LDT_VCPU_SHIFT + PAGE_SHIFT)
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 28/44] x86/xlat: Use the percpu compat translation area
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (26 preceding siblings ...)
2018-01-04 20:21 ` [PATCH RFC 27/44] x86/smp: Allocate a percpu linear range for the compat translation area Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 29/44] x86/smp: Allocate percpu resources for the GDT and LDT Andrew Cooper
` (16 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
This allows {setup,free}_compat_arg_xlat() to be dropped.
Changing COMPAT_ARG_XLAT_VIRT_BASE to avoid referencing current has a fairly
large impact on code size, as it is hidden underneath the
copy_{to,from}_guest() logic.
The net bloat-o-meter report for this change is:
add/remove: 0/2 grow/shrink: 4/35 up/down: 570/-1285 (-715)
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
xen/arch/x86/hvm/hvm.c | 8 --------
xen/arch/x86/pv/dom0_build.c | 2 --
xen/arch/x86/pv/domain.c | 12 +-----------
xen/arch/x86/x86_64/mm.c | 13 -------------
xen/include/asm-x86/config.h | 7 -------
xen/include/asm-x86/x86_64/uaccess.h | 6 ++----
6 files changed, 3 insertions(+), 45 deletions(-)
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 71fddfd..5836269 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -1533,10 +1533,6 @@ int hvm_vcpu_initialise(struct vcpu *v)
v->arch.hvm_vcpu.inject_event.vector = HVM_EVENT_VECTOR_UNSET;
- rc = setup_compat_arg_xlat(v); /* teardown: free_compat_arg_xlat() */
- if ( rc != 0 )
- goto fail4;
-
if ( nestedhvm_enabled(d)
&& (rc = nestedhvm_vcpu_initialise(v)) < 0 ) /* teardown: nestedhvm_vcpu_destroy */
goto fail5;
@@ -1562,8 +1558,6 @@ int hvm_vcpu_initialise(struct vcpu *v)
fail6:
nestedhvm_vcpu_destroy(v);
fail5:
- free_compat_arg_xlat(v);
- fail4:
hvm_funcs.vcpu_destroy(v);
fail3:
vlapic_destroy(v);
@@ -1584,8 +1578,6 @@ void hvm_vcpu_destroy(struct vcpu *v)
nestedhvm_vcpu_destroy(v);
- free_compat_arg_xlat(v);
-
tasklet_kill(&v->arch.hvm_vcpu.assert_evtchn_irq_tasklet);
hvm_funcs.vcpu_destroy(v);
diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index 3baf37b..3f5e3bc 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -385,8 +385,6 @@ int __init dom0_construct_pv(struct domain *d,
{
d->arch.is_32bit_pv = d->arch.has_32bit_shinfo = 1;
v->vcpu_info = (void *)&d->shared_info->compat.vcpu_info[0];
- if ( setup_compat_arg_xlat(v) != 0 )
- BUG();
}
nr_pages = dom0_compute_nr_pages(d, &parms, initrd_len);
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 7e4566d..4e88bfd 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -72,8 +72,7 @@ int switch_compat(struct domain *d)
for_each_vcpu( d, v )
{
- if ( (rc = setup_compat_arg_xlat(v)) ||
- (rc = setup_compat_l4(v)) )
+ if ( (rc = setup_compat_l4(v)) )
goto undo_and_fail;
}
@@ -87,10 +86,7 @@ int switch_compat(struct domain *d)
undo_and_fail:
d->arch.is_32bit_pv = d->arch.has_32bit_shinfo = 0;
for_each_vcpu( d, v )
- {
- free_compat_arg_xlat(v);
release_compat_l4(v);
- }
return rc;
}
@@ -112,10 +108,7 @@ static void pv_destroy_gdt_ldt_l1tab(struct vcpu *v)
void pv_vcpu_destroy(struct vcpu *v)
{
if ( is_pv_32bit_vcpu(v) )
- {
- free_compat_arg_xlat(v);
release_compat_l4(v);
- }
pv_destroy_gdt_ldt_l1tab(v);
xfree(v->arch.pv_vcpu.trap_ctxt);
@@ -152,9 +145,6 @@ int pv_vcpu_initialise(struct vcpu *v)
if ( is_pv_32bit_domain(d) )
{
- if ( (rc = setup_compat_arg_xlat(v)) )
- goto done;
-
if ( (rc = setup_compat_l4(v)) )
goto done;
}
diff --git a/xen/arch/x86/x86_64/mm.c b/xen/arch/x86/x86_64/mm.c
index 68eee30..aae721b 100644
--- a/xen/arch/x86/x86_64/mm.c
+++ b/xen/arch/x86/x86_64/mm.c
@@ -697,19 +697,6 @@ void __init zap_low_mappings(void)
__PAGE_HYPERVISOR);
}
-int setup_compat_arg_xlat(struct vcpu *v)
-{
- return create_perdomain_mapping(v->domain, ARG_XLAT_START(v),
- PFN_UP(COMPAT_ARG_XLAT_SIZE),
- NULL, NIL(struct page_info *));
-}
-
-void free_compat_arg_xlat(struct vcpu *v)
-{
- destroy_perdomain_mapping(v->domain, ARG_XLAT_START(v),
- PFN_UP(COMPAT_ARG_XLAT_SIZE));
-}
-
static void cleanup_frame_table(struct mem_hotadd_info *info)
{
unsigned long sva, eva;
diff --git a/xen/include/asm-x86/config.h b/xen/include/asm-x86/config.h
index 3d64047..c7503ad 100644
--- a/xen/include/asm-x86/config.h
+++ b/xen/include/asm-x86/config.h
@@ -316,13 +316,6 @@ extern unsigned long xen_phys_start;
#define LDT_VIRT_START(v) \
(GDT_VIRT_START(v) + (64*1024))
-/* Argument translation area. The third per-domain-mapping sub-area. */
-#define ARG_XLAT_VIRT_START PERDOMAIN_VIRT_SLOT(2)
-/* Allow for at least one guard page (COMPAT_ARG_XLAT_SIZE being 2 pages): */
-#define ARG_XLAT_VA_SHIFT (2 + PAGE_SHIFT)
-#define ARG_XLAT_START(v) \
- (ARG_XLAT_VIRT_START + ((v)->vcpu_id << ARG_XLAT_VA_SHIFT))
-
#define NATIVE_VM_ASSIST_VALID ((1UL << VMASST_TYPE_4gb_segments) | \
(1UL << VMASST_TYPE_4gb_segments_notify) | \
(1UL << VMASST_TYPE_writable_pagetables) | \
diff --git a/xen/include/asm-x86/x86_64/uaccess.h b/xen/include/asm-x86/x86_64/uaccess.h
index d7dad4f..ce88dce 100644
--- a/xen/include/asm-x86/x86_64/uaccess.h
+++ b/xen/include/asm-x86/x86_64/uaccess.h
@@ -1,11 +1,9 @@
#ifndef __X86_64_UACCESS_H
#define __X86_64_UACCESS_H
-#define COMPAT_ARG_XLAT_VIRT_BASE ((void *)ARG_XLAT_START(current))
+#define COMPAT_ARG_XLAT_VIRT_BASE ((void *)PERCPU_XLAT_START)
#define COMPAT_ARG_XLAT_SIZE (2*PAGE_SIZE)
-struct vcpu;
-int setup_compat_arg_xlat(struct vcpu *v);
-void free_compat_arg_xlat(struct vcpu *v);
+
#define is_compat_arg_xlat_range(addr, size) ({ \
unsigned long __off; \
__off = (unsigned long)(addr) - (unsigned long)COMPAT_ARG_XLAT_VIRT_BASE; \
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 29/44] x86/smp: Allocate percpu resources for the GDT and LDT
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (27 preceding siblings ...)
2018-01-04 20:21 ` [PATCH RFC 28/44] x86/xlat: Use the percpu " Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 30/44] x86/pv: Break handle_ldt_mapping_fault() out of handle_gdt_ldt_mapping_fault() Andrew Cooper
` (15 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
Like the mapcache region, we need an L1e which is modifiable in the context
switch code.
The Xen-reserved GDT frames are proactively mapped for the benefit of future
changes to the AP boot path.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
xen/arch/x86/smpboot.c | 21 +++++++++++++++++++++
xen/include/asm-x86/config.h | 4 ++++
2 files changed, 25 insertions(+)
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index a5d3f7a..cc80f24 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -822,6 +822,27 @@ static int cpu_smpboot_alloc_common(unsigned int cpu)
if ( rc )
goto out;
+ /* Allocate space for the GDT/LDT L1e's... */
+ rc = percpu_alloc_l1t(cpu, PERCPU_GDT_MAPPING, &pg);
+ if ( rc )
+ goto out;
+
+ /* ... and map the L1t so it can be used... */
+ rc = percpu_map_frame(cpu, PERCPU_GDT_LDT_L1ES, pg, PAGE_HYPERVISOR_RW);
+ if ( rc )
+ goto out;
+
+ /* ... and map Xen-reserved GDT frames. */
+ rc = percpu_map_frame(cpu, PERCPU_GDT_MAPPING + FIRST_RESERVED_GDT_BYTE,
+ virt_to_page(per_cpu(gdt_table, cpu)),
+ PAGE_HYPERVISOR_RW);
+ if ( rc )
+ goto out;
+ rc = percpu_map_frame(cpu, PERCPU_GDT_MAPPING + FIRST_RESERVED_GDT_BYTE + PAGE_SIZE,
+ virt_to_page(zero_page), __PAGE_HYPERVISOR_RO);
+ if ( rc )
+ goto out;
+
rc = 0; /* Success */
out:
diff --git a/xen/include/asm-x86/config.h b/xen/include/asm-x86/config.h
index c7503ad..dfe1f03 100644
--- a/xen/include/asm-x86/config.h
+++ b/xen/include/asm-x86/config.h
@@ -302,6 +302,10 @@ extern unsigned long xen_phys_start;
#define PERCPU_XLAT_START (PERCPU_LINEAR_START + MB(6) + KB(8))
+#define PERCPU_GDT_LDT_L1ES (PERCPU_LINEAR_START + MB(8) + KB(12))
+#define PERCPU_GDT_MAPPING (PERCPU_LINEAR_START + MB(10))
+#define PERCPU_LDT_MAPPING (PERCPU_LINEAR_START + MB(11))
+
/* GDT/LDT shadow mapping area. The first per-domain-mapping sub-area. */
#define GDT_LDT_VCPU_SHIFT 5
#define GDT_LDT_VCPU_VA_SHIFT (GDT_LDT_VCPU_SHIFT + PAGE_SHIFT)
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 30/44] x86/pv: Break handle_ldt_mapping_fault() out of handle_gdt_ldt_mapping_fault()
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (28 preceding siblings ...)
2018-01-04 20:21 ` [PATCH RFC 29/44] x86/smp: Allocate percpu resources for the GDT and LDT Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 31/44] x86/pv: Drop support for paging out the LDT Andrew Cooper
` (14 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
Future changes will alter the conditions under which we expect to take faults.
One adjustment however is to exclude the use of this fixup path for non-PV
guests. Well-formed code shouldn't reference the LDT while in HVM vcpu
context, but currently on a context switch from PV to HVM context, there may
be a stale LDT selector loaded, over an unmapped region.
By explicitly excluding HVM context at this point, we avoid erroneous
hypervisor execution resulting in a cascade failure, by falling into
pv_map_ldt_shadow_page().
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
v2:
* Correct the sense of the HVM context check
* Reduce offset to unsigned int. It will be at maximum 0xfffc
---
xen/arch/x86/traps.c | 79 ++++++++++++++++++++++++++++++----------------------
1 file changed, 46 insertions(+), 33 deletions(-)
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index ef9464b..2f1540e 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -1094,6 +1094,48 @@ static void reserved_bit_page_fault(unsigned long addr,
show_execution_state(regs);
}
+static int handle_ldt_mapping_fault(unsigned int offset,
+ struct cpu_user_regs *regs)
+{
+ struct vcpu *curr = current;
+
+ /*
+ * Not in PV context? Something is very broken. Leave it to the #PF
+ * handler, which will probably result in a panic().
+ */
+ if ( !is_pv_vcpu(curr) )
+ return 0;
+
+ /* Try to copy a mapping from the guest's LDT, if it is valid. */
+ if ( likely(pv_map_ldt_shadow_page(offset)) )
+ {
+ if ( guest_mode(regs) )
+ trace_trap_two_addr(TRC_PV_GDT_LDT_MAPPING_FAULT,
+ regs->rip, offset);
+ }
+ else
+ {
+ /* In hypervisor mode? Leave it to the #PF handler to fix up. */
+ if ( !guest_mode(regs) )
+ return 0;
+
+ /* Access would have become non-canonical? Pass #GP[sel] back. */
+ if ( unlikely(!is_canonical_address(
+ curr->arch.pv_vcpu.ldt_base + offset)) )
+ {
+ uint16_t ec = (offset & ~(X86_XEC_EXT | X86_XEC_IDT)) | X86_XEC_TI;
+
+ pv_inject_hw_exception(TRAP_gp_fault, ec);
+ }
+ else
+ /* else pass the #PF back, with adjusted %cr2. */
+ pv_inject_page_fault(regs->error_code,
+ curr->arch.pv_vcpu.ldt_base + offset);
+ }
+
+ return EXCRET_fault_fixed;
+}
+
static int handle_gdt_ldt_mapping_fault(unsigned long offset,
struct cpu_user_regs *regs)
{
@@ -1115,40 +1157,11 @@ static int handle_gdt_ldt_mapping_fault(unsigned long offset,
offset &= (1UL << (GDT_LDT_VCPU_VA_SHIFT-1)) - 1UL;
if ( likely(is_ldt_area) )
- {
- /* LDT fault: Copy a mapping from the guest's LDT, if it is valid. */
- if ( likely(pv_map_ldt_shadow_page(offset)) )
- {
- if ( guest_mode(regs) )
- trace_trap_two_addr(TRC_PV_GDT_LDT_MAPPING_FAULT,
- regs->rip, offset);
- }
- else
- {
- /* In hypervisor mode? Leave it to the #PF handler to fix up. */
- if ( !guest_mode(regs) )
- return 0;
+ return handle_ldt_mapping_fault(offset, regs);
- /* Access would have become non-canonical? Pass #GP[sel] back. */
- if ( unlikely(!is_canonical_address(
- curr->arch.pv_vcpu.ldt_base + offset)) )
- {
- uint16_t ec = (offset & ~(X86_XEC_EXT | X86_XEC_IDT)) | X86_XEC_TI;
-
- pv_inject_hw_exception(TRAP_gp_fault, ec);
- }
- else
- /* else pass the #PF back, with adjusted %cr2. */
- pv_inject_page_fault(regs->error_code,
- curr->arch.pv_vcpu.ldt_base + offset);
- }
- }
- else
- {
- /* GDT fault: handle the fault as #GP(selector). */
- regs->error_code = offset & ~(X86_XEC_EXT | X86_XEC_IDT | X86_XEC_TI);
- (void)do_general_protection(regs);
- }
+ /* GDT fault: handle the fault as #GP(selector). */
+ regs->error_code = offset & ~(X86_XEC_EXT | X86_XEC_IDT | X86_XEC_TI);
+ (void)do_general_protection(regs);
return EXCRET_fault_fixed;
}
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 31/44] x86/pv: Drop support for paging out the LDT
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (29 preceding siblings ...)
2018-01-04 20:21 ` [PATCH RFC 30/44] x86/pv: Break handle_ldt_mapping_fault() out of handle_gdt_ldt_mapping_fault() Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-24 11:04 ` Jan Beulich
2018-01-04 20:21 ` [PATCH RFC 32/44] x86: Always reload the LDT on vcpu context switch Andrew Cooper
` (13 subsequent siblings)
44 siblings, 1 reply; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
Windows is the only OS which pages out kernel datastructures, so chances are
good that this is a vestigial remnant of the PV Windows XP experiment.
Furthermore the implementation is incomplete; it only functions for a present
=> not-present transition, rather than a present => read/write transition.
The for_each_vcpu() is one scalability limitation for PV guests, which can't
reasonably be altered to be continuable. Most importantly however, is that
this only codepath which plays with descriptor frames of a remote vcpu.
A side effect of dropping support for paging the LDT out is that the LDT no
longer automatically cleans itself up on domain destruction. Cover this by
explicitly releasing the LDT frames at the same time as the GDT frames.
Finally, leave some asserts around to confirm the expected behaviour of all
the functions playing with PGT_seg_desc_page references.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
v2:
* Split moving and renaming out to an earlier patch
* Drop shadow_ldt_{lock,mapcnt}.
---
xen/arch/x86/domain.c | 7 ++-----
xen/arch/x86/mm.c | 17 -----------------
xen/arch/x86/pv/descriptor-tables.c | 20 ++++++--------------
xen/arch/x86/pv/domain.c | 2 --
xen/arch/x86/pv/mm.c | 3 ---
xen/include/asm-x86/domain.h | 4 ----
6 files changed, 8 insertions(+), 45 deletions(-)
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 3d9e7fb..ce5337b 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1936,11 +1936,8 @@ int domain_relinquish_resources(struct domain *d)
{
for_each_vcpu ( d, v )
{
- /*
- * Relinquish GDT mappings. No need for explicit unmapping of
- * the LDT as it automatically gets squashed with the guest
- * mappings.
- */
+ /* Relinquish GDT/LDT mappings. */
+ pv_destroy_ldt(v);
pv_destroy_gdt(v);
}
}
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index f8f15e9..8b925b3 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1204,7 +1204,6 @@ void put_page_from_l1e(l1_pgentry_t l1e, struct domain *l1e_owner)
unsigned long pfn = l1e_get_pfn(l1e);
struct page_info *page;
struct domain *pg_owner;
- struct vcpu *v;
if ( !(l1e_get_flags(l1e) & _PAGE_PRESENT) || is_iomem_page(_mfn(pfn)) )
return;
@@ -1240,25 +1239,9 @@ void put_page_from_l1e(l1_pgentry_t l1e, struct domain *l1e_owner)
*/
if ( (l1e_get_flags(l1e) & _PAGE_RW) &&
((l1e_owner == pg_owner) || !paging_mode_external(pg_owner)) )
- {
put_page_and_type(page);
- }
else
- {
- /* We expect this is rare so we blow the entire shadow LDT. */
- if ( unlikely(((page->u.inuse.type_info & PGT_type_mask) ==
- PGT_seg_desc_page)) &&
- unlikely(((page->u.inuse.type_info & PGT_count_mask) != 0)) &&
- (l1e_owner == pg_owner) )
- {
- for_each_vcpu ( pg_owner, v )
- {
- if ( pv_destroy_ldt(v) )
- flush_tlb_mask(v->vcpu_dirty_cpumask);
- }
- }
put_page(page);
- }
}
diff --git a/xen/arch/x86/pv/descriptor-tables.c b/xen/arch/x86/pv/descriptor-tables.c
index b418bbb..77f9851 100644
--- a/xen/arch/x86/pv/descriptor-tables.c
+++ b/xen/arch/x86/pv/descriptor-tables.c
@@ -37,18 +37,12 @@
*/
bool pv_destroy_ldt(struct vcpu *v)
{
- l1_pgentry_t *pl1e;
+ l1_pgentry_t *pl1e = pv_ldt_ptes(v);
unsigned int i, mappings_dropped = 0;
struct page_info *page;
ASSERT(!in_irq());
-
- spin_lock(&v->arch.pv_vcpu.shadow_ldt_lock);
-
- if ( v->arch.pv_vcpu.shadow_ldt_mapcnt == 0 )
- goto out;
-
- pl1e = pv_ldt_ptes(v);
+ ASSERT(v == current || cpumask_empty(v->vcpu_dirty_cpumask));
for ( i = 0; i < 16; i++ )
{
@@ -64,12 +58,6 @@ bool pv_destroy_ldt(struct vcpu *v)
put_page_and_type(page);
}
- ASSERT(v->arch.pv_vcpu.shadow_ldt_mapcnt == mappings_dropped);
- v->arch.pv_vcpu.shadow_ldt_mapcnt = 0;
-
- out:
- spin_unlock(&v->arch.pv_vcpu.shadow_ldt_lock);
-
return mappings_dropped;
}
@@ -80,6 +68,8 @@ void pv_destroy_gdt(struct vcpu *v)
l1_pgentry_t zero_l1e = l1e_from_mfn(zero_mfn, __PAGE_HYPERVISOR_RO);
unsigned int i;
+ ASSERT(v == current || cpumask_empty(v->vcpu_dirty_cpumask));
+
v->arch.pv_vcpu.gdt_ents = 0;
for ( i = 0; i < FIRST_RESERVED_GDT_PAGE; i++ )
{
@@ -100,6 +90,8 @@ long pv_set_gdt(struct vcpu *v, unsigned long *frames, unsigned int entries)
l1_pgentry_t *pl1e;
unsigned int i, nr_frames = DIV_ROUND_UP(entries, 512);
+ ASSERT(v == current || cpumask_empty(v->vcpu_dirty_cpumask));
+
if ( entries > FIRST_RESERVED_GDT_ENTRY )
return -EINVAL;
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 4e88bfd..60a88bd 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -122,8 +122,6 @@ int pv_vcpu_initialise(struct vcpu *v)
ASSERT(!is_idle_domain(d));
- spin_lock_init(&v->arch.pv_vcpu.shadow_ldt_lock);
-
rc = pv_create_gdt_ldt_l1tab(v);
if ( rc )
return rc;
diff --git a/xen/arch/x86/pv/mm.c b/xen/arch/x86/pv/mm.c
index 8d7a4fd..d293724 100644
--- a/xen/arch/x86/pv/mm.c
+++ b/xen/arch/x86/pv/mm.c
@@ -125,10 +125,7 @@ bool pv_map_ldt_shadow_page(unsigned int offset)
pl1e = &pv_ldt_ptes(curr)[offset >> PAGE_SHIFT];
l1e_add_flags(gl1e, _PAGE_RW);
- spin_lock(&curr->arch.pv_vcpu.shadow_ldt_lock);
l1e_write(pl1e, gl1e);
- curr->arch.pv_vcpu.shadow_ldt_mapcnt++;
- spin_unlock(&curr->arch.pv_vcpu.shadow_ldt_lock);
return true;
}
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index fa57c93..be0f61c 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -447,10 +447,6 @@ struct pv_vcpu
unsigned int iopl; /* Current IOPL for this VCPU, shifted left by
* 12 to match the eflags register. */
- /* Current LDT details. */
- unsigned long shadow_ldt_mapcnt;
- spinlock_t shadow_ldt_lock;
-
/* data breakpoint extension MSRs */
uint32_t dr_mask[4];
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 32/44] x86: Always reload the LDT on vcpu context switch
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (30 preceding siblings ...)
2018-01-04 20:21 ` [PATCH RFC 31/44] x86/pv: Drop support for paging out the LDT Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 33/44] x86/smp: Use the percpu GDT/LDT mappings Andrew Cooper
` (12 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
... and always zero the LDT for HVM contexts. This causes erroneous execution
which manages to reference the LDT fail with a straight #GP fault, rather than
possibly finding a stale loaded LDT and wandering the #PF handler.
Future changes will cause the loading of LDT to be lazy, at which point
load_LDT() will be a nop for all cases other than context switching to/from a
PV vcpu with an LDT loaded.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
xen/arch/x86/domain.c | 5 ++---
xen/include/asm-x86/ldt.h | 4 ++--
2 files changed, 4 insertions(+), 5 deletions(-)
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index ce5337b..4671c9b 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1660,6 +1660,8 @@ static void __context_switch(void)
lgdt(&gdt_desc);
}
+ load_LDT(n);
+
if ( pd != nd )
cpumask_clear_cpu(cpu, pd->domain_dirty_cpumask);
cpumask_clear_cpu(cpu, p->vcpu_dirty_cpumask);
@@ -1723,10 +1725,7 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
local_irq_enable();
if ( is_pv_domain(nextd) )
- {
- load_LDT(next);
load_segments(next);
- }
ctxt_switch_levelling(next);
}
diff --git a/xen/include/asm-x86/ldt.h b/xen/include/asm-x86/ldt.h
index 589daf8..6fbce93 100644
--- a/xen/include/asm-x86/ldt.h
+++ b/xen/include/asm-x86/ldt.h
@@ -7,9 +7,9 @@
static inline void load_LDT(struct vcpu *v)
{
struct desc_struct *desc;
- unsigned long ents;
+ unsigned int ents = is_pv_vcpu(v) && v->arch.pv_vcpu.ldt_ents;
- if ( (ents = v->arch.pv_vcpu.ldt_ents) == 0 )
+ if ( ents == 0 )
lldt(0);
else
{
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 33/44] x86/smp: Use the percpu GDT/LDT mappings
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (31 preceding siblings ...)
2018-01-04 20:21 ` [PATCH RFC 32/44] x86: Always reload the LDT on vcpu context switch Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 34/44] x86: Drop the PERDOMAIN mappings Andrew Cooper
` (11 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
This is unfortunately quite invasive, because of the impact on the context
switch path.
PV vcpus gain an array of ldt and gdt ptes (replacing gdt_frames[]), which map
the frames loaded by HYPERCALL_set_gdt, or faulted in for the LDT. Each
present PTE here which isn't a read-only mapping of zero_page holds a type
reference.
When context switching to a vcpu which needs a full GDT or LDT, the ptes are
copied in from the above arrays, while if context switching away from vcpu
which needs a full GDT or LDT, the ptes are reset back to default values. As
a side effect, the GDT/LDT create/destroy functions need to adjust the percpu
mappings if they are running in current context.
Overall, the GDT and LDT base addresses are always the same, and this depend
on the context switch logic happening before write_ptbase(), so the TLB flush
removes any stale mappings. While altering load_LDT()'s behaviour to cope,
introduce lazy loading to avoid executing lldt in the common case.
The arch_{get,set}_info_guest() functions need adjusting to cope with the fact
that they will now find references to zero_page in the ptes, which need
skipping.
Loading of GDTR now happens once at boot, in early_switch_to_idle(). As the
base address is now constant and always mapped, we can remove lgdt() calls
from the context switch path and EFI Runtime Services path. Finally,
HOST_GDTR_BASE in the VMCS needs to be adjusted, and moves into
construct_vmcs().
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
v2:
* Fix pv_destroy_ldt() to release the correct references when there were
outstanding LDT frames at domain destruction.
---
xen/arch/x86/cpu/common.c | 7 ----
xen/arch/x86/domain.c | 72 +++++++++++++++++++++----------------
xen/arch/x86/domctl.c | 13 +++++--
xen/arch/x86/hvm/vmx/vmcs.c | 4 +--
xen/arch/x86/pv/descriptor-tables.c | 30 ++++++++++------
xen/arch/x86/pv/emulate.h | 4 +--
xen/arch/x86/pv/mm.c | 3 +-
xen/arch/x86/setup.c | 10 ++++--
xen/arch/x86/traps.c | 36 ++-----------------
xen/common/efi/runtime.c | 20 -----------
xen/include/asm-x86/config.h | 2 ++
xen/include/asm-x86/domain.h | 17 +++++----
xen/include/asm-x86/ldt.h | 15 +++++---
13 files changed, 110 insertions(+), 123 deletions(-)
diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
index 14743b6..decdcd5 100644
--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -653,11 +653,6 @@ void load_system_tables(void)
struct desc_struct *compat_gdt =
this_cpu(compat_gdt_table) - FIRST_RESERVED_GDT_ENTRY;
- const struct desc_ptr gdtr = {
- .base = (unsigned long)gdt,
- .limit = LAST_RESERVED_GDT_BYTE,
- };
-
*tss = (struct tss_struct){
/* Main stack for interrupts/exceptions. */
.rsp0 = stack_bottom,
@@ -693,9 +688,7 @@ void load_system_tables(void)
offsetof(struct tss_struct, __cacheline_filler) - 1,
SYS_DESC_tss_busy);
- lgdt(&gdtr);
ltr(TSS_ENTRY << 3);
- lldt(0);
/*
* Bottom-of-stack must be 16-byte aligned!
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 4671c9b..2d665c6 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -67,6 +67,7 @@
#include <asm/pv/mm.h>
DEFINE_PER_CPU(struct vcpu *, curr_vcpu);
+DEFINE_PER_CPU(unsigned int, ldt_ents);
static void default_idle(void);
void (*pm_idle) (void) __read_mostly = default_idle;
@@ -917,8 +918,13 @@ int arch_set_info_guest(
fail = compat_pfn_to_cr3(pfn) != c.cmp->ctrlreg[3];
}
- for ( i = 0; i < ARRAY_SIZE(v->arch.pv_vcpu.gdt_frames); ++i )
- fail |= v->arch.pv_vcpu.gdt_frames[i] != c(gdt_frames[i]);
+ for ( i = 0; i < MAX_PV_GDT_FRAMES; ++i )
+ {
+ paddr_t addr = pfn_to_paddr(c(gdt_frames[i])) ?: __pa(zero_page);
+
+ fail |= l1e_get_paddr(v->arch.pv_vcpu.gdt_l1es[i]) != addr;
+ }
+
fail |= v->arch.pv_vcpu.gdt_ents != c(gdt_ents);
fail |= v->arch.pv_vcpu.ldt_base != c(ldt_base);
@@ -1015,10 +1021,10 @@ int arch_set_info_guest(
rc = (int)pv_set_gdt(v, c.nat->gdt_frames, c.nat->gdt_ents);
else
{
- unsigned long gdt_frames[ARRAY_SIZE(v->arch.pv_vcpu.gdt_frames)];
+ unsigned long gdt_frames[MAX_PV_GDT_FRAMES];
unsigned int nr_frames = DIV_ROUND_UP(c.cmp->gdt_ents, 512);
- if ( nr_frames > ARRAY_SIZE(v->arch.pv_vcpu.gdt_frames) )
+ if ( nr_frames > MAX_PV_GDT_FRAMES )
return -EINVAL;
for ( i = 0; i < nr_frames; ++i )
@@ -1579,15 +1585,18 @@ static inline bool need_full_gdt(const struct domain *d)
return is_pv_domain(d) && !is_idle_domain(d);
}
+static bool needs_ldt(const struct vcpu *v)
+{
+ return is_pv_vcpu(v) && v->arch.pv_vcpu.ldt_ents;
+}
+
static void __context_switch(void)
{
struct cpu_user_regs *stack_regs = guest_cpu_user_regs();
- unsigned int cpu = smp_processor_id();
+ unsigned int cpu = smp_processor_id(), i;
struct vcpu *p = per_cpu(curr_vcpu, cpu);
struct vcpu *n = current;
struct domain *pd = p->domain, *nd = n->domain;
- struct desc_struct *gdt;
- struct desc_ptr gdt_desc;
ASSERT(p != n);
ASSERT(cpumask_empty(n->vcpu_dirty_cpumask));
@@ -1627,38 +1636,41 @@ static void __context_switch(void)
psr_ctxt_switch_to(nd);
- gdt = !is_pv_32bit_domain(nd) ? per_cpu(gdt_table, cpu) :
- per_cpu(compat_gdt_table, cpu);
+ /* Load a full new GDT if the new vcpu needs one. */
if ( need_full_gdt(nd) )
{
- unsigned long mfn = virt_to_mfn(gdt);
- l1_pgentry_t *pl1e = pv_gdt_ptes(n);
- unsigned int i;
+ memcpy(pv_gdt_ptes, n->arch.pv_vcpu.gdt_l1es,
+ sizeof(n->arch.pv_vcpu.gdt_l1es));
- for ( i = 0; i < NR_RESERVED_GDT_PAGES; i++ )
- l1e_write(pl1e + FIRST_RESERVED_GDT_PAGE + i,
- l1e_from_pfn(mfn + i, __PAGE_HYPERVISOR_RW));
+ l1e_write(&pv_gdt_ptes[FIRST_RESERVED_GDT_PAGE],
+ l1e_from_pfn(virt_to_mfn(!is_pv_32bit_domain(nd)
+ ? per_cpu(gdt_table, cpu)
+ : per_cpu(compat_gdt_table, cpu)),
+ __PAGE_HYPERVISOR_RW));
}
-
- if ( need_full_gdt(pd) &&
- ((p->vcpu_id != n->vcpu_id) || !need_full_gdt(nd)) )
+ /* or clobber a previous full GDT. */
+ else if ( need_full_gdt(pd) )
{
- gdt_desc.limit = LAST_RESERVED_GDT_BYTE;
- gdt_desc.base = (unsigned long)(gdt - FIRST_RESERVED_GDT_ENTRY);
+ l1_pgentry_t zero_l1e = l1e_from_paddr(__pa(zero_page),
+ __PAGE_HYPERVISOR_RO);
+
+ for ( i = 0; i < FIRST_RESERVED_GDT_PAGE; ++i )
+ pv_gdt_ptes[i] = zero_l1e;
- lgdt(&gdt_desc);
+ l1e_write(&pv_gdt_ptes[FIRST_RESERVED_GDT_PAGE],
+ l1e_from_pfn(virt_to_mfn(per_cpu(gdt_table, cpu)),
+ __PAGE_HYPERVISOR_RW));
}
- write_ptbase(n);
+ /* Load the LDT frames if needed. */
+ if ( needs_ldt(n) )
+ memcpy(pv_ldt_ptes, n->arch.pv_vcpu.ldt_l1es,
+ sizeof(n->arch.pv_vcpu.ldt_l1es));
+ /* or clobber the previous LDT. */
+ else if ( needs_ldt(p) )
+ memset(pv_ldt_ptes, 0, sizeof(n->arch.pv_vcpu.ldt_l1es));
- if ( need_full_gdt(nd) &&
- ((p->vcpu_id != n->vcpu_id) || !need_full_gdt(pd)) )
- {
- gdt_desc.limit = LAST_RESERVED_GDT_BYTE;
- gdt_desc.base = GDT_VIRT_START(n);
-
- lgdt(&gdt_desc);
- }
+ write_ptbase(n);
load_LDT(n);
diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index 36ab235..28c7b04 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -1642,8 +1642,17 @@ void arch_get_info_guest(struct vcpu *v, vcpu_guest_context_u c)
{
c(ldt_base = v->arch.pv_vcpu.ldt_base);
c(ldt_ents = v->arch.pv_vcpu.ldt_ents);
- for ( i = 0; i < ARRAY_SIZE(v->arch.pv_vcpu.gdt_frames); ++i )
- c(gdt_frames[i] = v->arch.pv_vcpu.gdt_frames[i]);
+
+ for ( i = 0; i < MAX_PV_GDT_FRAMES; ++i )
+ {
+ paddr_t addr = l1e_get_paddr(v->arch.pv_vcpu.gdt_l1es[i]);
+
+ if ( addr == __pa(zero_page) )
+ break;
+
+ c(gdt_frames[i] = paddr_to_pfn(addr));
+ }
+
BUILD_BUG_ON(ARRAY_SIZE(c.nat->gdt_frames) !=
ARRAY_SIZE(c.cmp->gdt_frames));
for ( ; i < ARRAY_SIZE(c.nat->gdt_frames); ++i )
diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index f99f1bb..795210f 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -802,9 +802,6 @@ static void vmx_set_host_env(struct vcpu *v)
{
unsigned int cpu = smp_processor_id();
- __vmwrite(HOST_GDTR_BASE,
- (unsigned long)(this_cpu(gdt_table) - FIRST_RESERVED_GDT_ENTRY));
-
__vmwrite(HOST_TR_BASE, (unsigned long)&per_cpu(init_tss, cpu));
__vmwrite(HOST_SYSENTER_ESP, get_stack_bottom());
@@ -1134,6 +1131,7 @@ static int construct_vmcs(struct vcpu *v)
/* Host system tables. */
__vmwrite(HOST_IDTR_BASE, PERCPU_IDT_MAPPING);
+ __vmwrite(HOST_GDTR_BASE, PERCPU_GDT_MAPPING);
/* Host data selectors. */
__vmwrite(HOST_SS_SELECTOR, __HYPERVISOR_DS);
diff --git a/xen/arch/x86/pv/descriptor-tables.c b/xen/arch/x86/pv/descriptor-tables.c
index 77f9851..6b0bfbf 100644
--- a/xen/arch/x86/pv/descriptor-tables.c
+++ b/xen/arch/x86/pv/descriptor-tables.c
@@ -37,7 +37,7 @@
*/
bool pv_destroy_ldt(struct vcpu *v)
{
- l1_pgentry_t *pl1e = pv_ldt_ptes(v);
+ l1_pgentry_t *pl1e = v->arch.pv_vcpu.ldt_l1es;
unsigned int i, mappings_dropped = 0;
struct page_info *page;
@@ -58,12 +58,22 @@ bool pv_destroy_ldt(struct vcpu *v)
put_page_and_type(page);
}
+ /* Clobber the live LDT. */
+ if ( v == current )
+ {
+ if ( mappings_dropped )
+ memset(pv_ldt_ptes, 0, sizeof(v->arch.pv_vcpu.ldt_l1es));
+ else
+ ASSERT(memcmp(pv_ldt_ptes, v->arch.pv_vcpu.ldt_l1es,
+ sizeof(v->arch.pv_vcpu.ldt_l1es)) == 0);
+ }
+
return mappings_dropped;
}
void pv_destroy_gdt(struct vcpu *v)
{
- l1_pgentry_t *pl1e = pv_gdt_ptes(v);
+ l1_pgentry_t *pl1e = v->arch.pv_vcpu.gdt_l1es;
mfn_t zero_mfn = _mfn(virt_to_mfn(zero_page));
l1_pgentry_t zero_l1e = l1e_from_mfn(zero_mfn, __PAGE_HYPERVISOR_RO);
unsigned int i;
@@ -79,15 +89,13 @@ void pv_destroy_gdt(struct vcpu *v)
!mfn_eq(mfn, zero_mfn) )
put_page_and_type(mfn_to_page(mfn));
- l1e_write(&pl1e[i], zero_l1e);
- v->arch.pv_vcpu.gdt_frames[i] = 0;
+ pl1e[i] = zero_l1e;
}
}
long pv_set_gdt(struct vcpu *v, unsigned long *frames, unsigned int entries)
{
struct domain *d = v->domain;
- l1_pgentry_t *pl1e;
unsigned int i, nr_frames = DIV_ROUND_UP(entries, 512);
ASSERT(v == current || cpumask_empty(v->vcpu_dirty_cpumask));
@@ -116,12 +124,14 @@ long pv_set_gdt(struct vcpu *v, unsigned long *frames, unsigned int entries)
/* Install the new GDT. */
v->arch.pv_vcpu.gdt_ents = entries;
- pl1e = pv_gdt_ptes(v);
for ( i = 0; i < nr_frames; i++ )
- {
- v->arch.pv_vcpu.gdt_frames[i] = frames[i];
- l1e_write(&pl1e[i], l1e_from_pfn(frames[i], __PAGE_HYPERVISOR_RW));
- }
+ v->arch.pv_vcpu.gdt_l1es[i] =
+ l1e_from_pfn(frames[i], __PAGE_HYPERVISOR_RW);
+
+ /* Update the live GDT if in context. */
+ if ( v == current )
+ memcpy(pv_gdt_ptes, v->arch.pv_vcpu.gdt_l1es,
+ sizeof(v->arch.pv_vcpu.gdt_l1es));
return 0;
diff --git a/xen/arch/x86/pv/emulate.h b/xen/arch/x86/pv/emulate.h
index 9d58794..80530e7 100644
--- a/xen/arch/x86/pv/emulate.h
+++ b/xen/arch/x86/pv/emulate.h
@@ -20,9 +20,7 @@ static inline int pv_emul_is_mem_write(const struct x86_emulate_state *state,
/* Return a pointer to the GDT/LDT descriptor referenced by sel. */
static inline const struct desc_struct *gdt_ldt_desc_ptr(unsigned int sel)
{
- const struct vcpu *curr = current;
- const struct desc_struct *tbl = (void *)
- ((sel & X86_XEC_TI) ? LDT_VIRT_START(curr) : GDT_VIRT_START(curr));
+ const struct desc_struct *tbl = (sel & X86_XEC_TI) ? pv_ldt : pv_gdt;
return &tbl[sel >> 3];
}
diff --git a/xen/arch/x86/pv/mm.c b/xen/arch/x86/pv/mm.c
index d293724..6da9990 100644
--- a/xen/arch/x86/pv/mm.c
+++ b/xen/arch/x86/pv/mm.c
@@ -122,10 +122,11 @@ bool pv_map_ldt_shadow_page(unsigned int offset)
return false;
}
- pl1e = &pv_ldt_ptes(curr)[offset >> PAGE_SHIFT];
+ pl1e = &pv_ldt_ptes[offset >> PAGE_SHIFT];
l1e_add_flags(gl1e, _PAGE_RW);
l1e_write(pl1e, gl1e);
+ curr->arch.pv_vcpu.ldt_l1es[offset >> PAGE_SHIFT] = gl1e;
return true;
}
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 80efef0..39d1592 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -244,13 +244,17 @@ void early_switch_to_idle(bool bsp)
unsigned long cr4 = read_cr4();
/*
- * VT-x hardwires the IDT limit at 0xffff on VMExit.
+ * VT-x hardwires the GDT and IDT limit at 0xffff on VMExit.
*
* We don't wish to reload on vcpu context switch, so have arranged for
* nothing else to live within 64k of the base. Unilaterally setting the
* limit to 0xffff avoids leaking whether HVM vcpus are running to PV
- * guests via SIDT.
+ * guests via SGDT/SIDT.
*/
+ const struct desc_ptr gdtr = {
+ .base = PERCPU_GDT_MAPPING,
+ .limit = 0xffff,
+ };
const struct desc_ptr idtr = {
.base = PERCPU_IDT_MAPPING,
.limit = 0xffff,
@@ -272,7 +276,9 @@ void early_switch_to_idle(bool bsp)
per_cpu(curr_ptbase, cpu) = v->arch.cr3;
per_cpu(curr_extended_directmap, cpu) = true;
+ lgdt(&gdtr);
lidt(&idtr);
+ lldt(0);
if ( likely(!bsp) ) /* BSP IST setup deferred. */
enable_each_ist(idt_tables[cpu]);
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 2f1540e..eeabb4a 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -1136,36 +1136,6 @@ static int handle_ldt_mapping_fault(unsigned int offset,
return EXCRET_fault_fixed;
}
-static int handle_gdt_ldt_mapping_fault(unsigned long offset,
- struct cpu_user_regs *regs)
-{
- struct vcpu *curr = current;
- /* Which vcpu's area did we fault in, and is it in the ldt sub-area? */
- unsigned int is_ldt_area = (offset >> (GDT_LDT_VCPU_VA_SHIFT-1)) & 1;
- unsigned int vcpu_area = (offset >> GDT_LDT_VCPU_VA_SHIFT);
-
- /*
- * If the fault is in another vcpu's area, it cannot be due to
- * a GDT/LDT descriptor load. Thus we can reasonably exit immediately, and
- * indeed we have to since pv_map_ldt_shadow_page() works correctly only on
- * accesses to a vcpu's own area.
- */
- if ( vcpu_area != curr->vcpu_id )
- return 0;
-
- /* Byte offset within the gdt/ldt sub-area. */
- offset &= (1UL << (GDT_LDT_VCPU_VA_SHIFT-1)) - 1UL;
-
- if ( likely(is_ldt_area) )
- return handle_ldt_mapping_fault(offset, regs);
-
- /* GDT fault: handle the fault as #GP(selector). */
- regs->error_code = offset & ~(X86_XEC_EXT | X86_XEC_IDT | X86_XEC_TI);
- (void)do_general_protection(regs);
-
- return EXCRET_fault_fixed;
-}
-
#define IN_HYPERVISOR_RANGE(va) \
(((va) >= HYPERVISOR_VIRT_START) && ((va) < HYPERVISOR_VIRT_END))
@@ -1316,9 +1286,9 @@ static int fixup_page_fault(unsigned long addr, struct cpu_user_regs *regs)
if ( unlikely(IN_HYPERVISOR_RANGE(addr)) )
{
if ( !(regs->error_code & (PFEC_user_mode | PFEC_reserved_bit)) &&
- (addr >= GDT_LDT_VIRT_START) && (addr < GDT_LDT_VIRT_END) )
- return handle_gdt_ldt_mapping_fault(
- addr - GDT_LDT_VIRT_START, regs);
+ (addr >= PERCPU_LDT_MAPPING) && (addr < PERCPU_LDT_MAPPING_END) )
+ return handle_ldt_mapping_fault(addr - PERCPU_LDT_MAPPING, regs);
+
return 0;
}
diff --git a/xen/common/efi/runtime.c b/xen/common/efi/runtime.c
index fe6d3af..3e46ac6 100644
--- a/xen/common/efi/runtime.c
+++ b/xen/common/efi/runtime.c
@@ -100,17 +100,6 @@ struct efi_rs_state efi_rs_enter(void)
/* prevent fixup_page_fault() from doing anything */
irq_enter();
- if ( is_pv_vcpu(current) && !is_idle_vcpu(current) )
- {
- struct desc_ptr gdt_desc = {
- .limit = LAST_RESERVED_GDT_BYTE,
- .base = (unsigned long)(per_cpu(gdt_table, smp_processor_id()) -
- FIRST_RESERVED_GDT_ENTRY)
- };
-
- lgdt(&gdt_desc);
- }
-
write_cr3(virt_to_maddr(efi_l4_pgtable));
this_cpu(curr_extended_directmap) = true;
@@ -124,15 +113,6 @@ void efi_rs_leave(struct efi_rs_state *state)
this_cpu(curr_extended_directmap) = paging_mode_external(current->domain);
write_cr3(state->cr3);
- if ( is_pv_vcpu(current) && !is_idle_vcpu(current) )
- {
- struct desc_ptr gdt_desc = {
- .limit = LAST_RESERVED_GDT_BYTE,
- .base = GDT_VIRT_START(current)
- };
-
- lgdt(&gdt_desc);
- }
irq_exit();
efi_rs_on_cpu = NR_CPUS;
spin_unlock(&efi_rs_lock);
diff --git a/xen/include/asm-x86/config.h b/xen/include/asm-x86/config.h
index dfe1f03..62549a8 100644
--- a/xen/include/asm-x86/config.h
+++ b/xen/include/asm-x86/config.h
@@ -304,7 +304,9 @@ extern unsigned long xen_phys_start;
#define PERCPU_GDT_LDT_L1ES (PERCPU_LINEAR_START + MB(8) + KB(12))
#define PERCPU_GDT_MAPPING (PERCPU_LINEAR_START + MB(10))
+#define PERCPU_GDT_MAPPING_END (PERCPU_GDT_MAPPING + 0x10000)
#define PERCPU_LDT_MAPPING (PERCPU_LINEAR_START + MB(11))
+#define PERCPU_LDT_MAPPING_END (PERCPU_LDT_MAPPING + 0x10000)
/* GDT/LDT shadow mapping area. The first per-domain-mapping sub-area. */
#define GDT_LDT_VCPU_SHIFT 5
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index be0f61c..108b3a4 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -396,18 +396,21 @@ struct arch_domain
#define has_arch_pdevs(d) (!list_empty(&(d)->arch.pdev_list))
-#define gdt_ldt_pt_idx(v) \
- ((v)->vcpu_id >> (PAGETABLE_ORDER - GDT_LDT_VCPU_SHIFT))
-#define pv_gdt_ptes(v) \
- ((v)->domain->arch.pv_domain.gdt_ldt_l1tab[gdt_ldt_pt_idx(v)] + \
- (((v)->vcpu_id << GDT_LDT_VCPU_SHIFT) & (L1_PAGETABLE_ENTRIES - 1)))
-#define pv_ldt_ptes(v) (pv_gdt_ptes(v) + 16)
+#define pv_gdt ((struct desc_struct *)PERCPU_GDT_MAPPING)
+#define pv_ldt ((struct desc_struct *)PERCPU_LDT_MAPPING)
+
+#define pv_gdt_ptes \
+ ((l1_pgentry_t *)PERCPU_GDT_LDT_L1ES + l1_table_offset(PERCPU_GDT_MAPPING))
+#define pv_ldt_ptes \
+ ((l1_pgentry_t *)PERCPU_GDT_LDT_L1ES + l1_table_offset(PERCPU_LDT_MAPPING))
struct pv_vcpu
{
struct trap_info *trap_ctxt;
- unsigned long gdt_frames[FIRST_RESERVED_GDT_PAGE];
+#define MAX_PV_GDT_FRAMES FIRST_RESERVED_GDT_PAGE
+ l1_pgentry_t gdt_l1es[MAX_PV_GDT_FRAMES];
+ l1_pgentry_t ldt_l1es[16];
unsigned long ldt_base;
unsigned int gdt_ents, ldt_ents;
diff --git a/xen/include/asm-x86/ldt.h b/xen/include/asm-x86/ldt.h
index 6fbce93..f28a895 100644
--- a/xen/include/asm-x86/ldt.h
+++ b/xen/include/asm-x86/ldt.h
@@ -4,21 +4,26 @@
#ifndef __ASSEMBLY__
+DECLARE_PER_CPU(unsigned int, ldt_ents);
+
static inline void load_LDT(struct vcpu *v)
{
- struct desc_struct *desc;
unsigned int ents = is_pv_vcpu(v) && v->arch.pv_vcpu.ldt_ents;
+ unsigned int *this_ldt_ents = &this_cpu(ldt_ents);
+
+ if ( likely(ents == *this_ldt_ents) )
+ return;
if ( ents == 0 )
lldt(0);
else
{
- desc = (!is_pv_32bit_vcpu(v)
- ? this_cpu(gdt_table) : this_cpu(compat_gdt_table))
- + LDT_ENTRY - FIRST_RESERVED_GDT_ENTRY;
- _set_tssldt_desc(desc, LDT_VIRT_START(v), ents*8-1, SYS_DESC_ldt);
+ _set_tssldt_desc(&pv_gdt[LDT_ENTRY], PERCPU_LDT_MAPPING,
+ ents * 8 - 1, SYS_DESC_ldt);
lldt(LDT_ENTRY << 3);
}
+
+ *this_ldt_ents = ents;
}
#endif /* !__ASSEMBLY__ */
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 34/44] x86: Drop the PERDOMAIN mappings
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (32 preceding siblings ...)
2018-01-04 20:21 ` [PATCH RFC 33/44] x86/smp: Use the percpu GDT/LDT mappings Andrew Cooper
@ 2018-01-04 20:21 ` Andrew Cooper
2018-01-04 20:22 ` [PATCH RFC 35/44] x86/smp: Allocate the stack in the percpu range Andrew Cooper
` (10 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:21 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
With the mapcache, xlat and GDT/LDT moved over to the PERCPU mappings, there
are no remaining users of the PERDOMAIN mappings. Drop the whole PERDOMAIN
infrastructure, and remove the PERDOMAIN slot in the virtual address layout.
Slide each of the subsequent slots back by one, and extend the directmap back
to its original size.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
xen/arch/x86/domain.c | 2 -
xen/arch/x86/hvm/hvm.c | 6 --
xen/arch/x86/mm.c | 234 -------------------------------------------
xen/arch/x86/pv/domain.c | 39 +-------
xen/include/asm-x86/config.h | 36 ++-----
xen/include/asm-x86/domain.h | 4 -
xen/include/asm-x86/mm.h | 10 --
7 files changed, 8 insertions(+), 323 deletions(-)
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 2d665c6..eeca01d 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -568,7 +568,6 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags,
xfree(d->arch.msr);
if ( paging_initialised )
paging_final_teardown(d);
- free_perdomain_mappings(d);
return rc;
}
@@ -590,7 +589,6 @@ void arch_domain_destroy(struct domain *d)
if ( is_pv_domain(d) )
pv_domain_destroy(d);
- free_perdomain_mappings(d);
free_xenheap_page(d->shared_info);
cleanup_domain_irq_mapping(d);
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 5836269..85447dd 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -583,10 +583,6 @@ int hvm_domain_initialise(struct domain *d, unsigned long domcr_flags,
INIT_LIST_HEAD(&d->arch.hvm_domain.write_map.list);
INIT_LIST_HEAD(&d->arch.hvm_domain.g2m_ioport_list);
- rc = create_perdomain_mapping(d, PERDOMAIN_VIRT_START, 0, NULL, NULL);
- if ( rc )
- goto fail;
-
hvm_init_cacheattr_region_list(d);
rc = paging_enable(d, PG_refcounts|PG_translate|PG_external);
@@ -670,8 +666,6 @@ int hvm_domain_initialise(struct domain *d, unsigned long domcr_flags,
xfree(d->arch.hvm_domain.irq);
fail0:
hvm_destroy_cacheattr_region_list(d);
- destroy_perdomain_mapping(d, PERDOMAIN_VIRT_START, 0);
- fail:
return rc;
}
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 8b925b3..933bd67 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1594,13 +1594,6 @@ void init_xen_l4_slots(l4_pgentry_t *l4t, mfn_t l4mfn,
mfn_eq(sl4mfn, INVALID_MFN) ? l4e_empty() :
l4e_from_mfn(sl4mfn, __PAGE_HYPERVISOR_RW);
- /* Slot 261: Per-domain mappings (if applicable). */
- l4t[l4_table_offset(PERDOMAIN_VIRT_START)] =
- d ? l4e_from_page(d->arch.perdomain_l3_pg, __PAGE_HYPERVISOR_RW)
- : l4e_empty();
-
- /* !!! WARNING - TEMPORARILY STALE BELOW !!! */
-
/* Slot 261-: text/data/bss, RW M2P, vmap, frametable, directmap. */
#ifndef NDEBUG
if ( short_directmap &&
@@ -5257,233 +5250,6 @@ void __iomem *ioremap(paddr_t pa, size_t len)
return (void __force __iomem *)va;
}
-int create_perdomain_mapping(struct domain *d, unsigned long va,
- unsigned int nr, l1_pgentry_t **pl1tab,
- struct page_info **ppg)
-{
- struct page_info *pg;
- l3_pgentry_t *l3tab;
- l2_pgentry_t *l2tab;
- l1_pgentry_t *l1tab;
- int rc = 0;
-
- ASSERT(va >= PERDOMAIN_VIRT_START &&
- va < PERDOMAIN_VIRT_SLOT(PERDOMAIN_SLOTS));
-
- if ( !d->arch.perdomain_l3_pg )
- {
- pg = alloc_domheap_page(d, MEMF_no_owner);
- if ( !pg )
- return -ENOMEM;
- l3tab = __map_domain_page(pg);
- clear_page(l3tab);
- d->arch.perdomain_l3_pg = pg;
- if ( !nr )
- {
- unmap_domain_page(l3tab);
- return 0;
- }
- }
- else if ( !nr )
- return 0;
- else
- l3tab = __map_domain_page(d->arch.perdomain_l3_pg);
-
- ASSERT(!l3_table_offset(va ^ (va + nr * PAGE_SIZE - 1)));
-
- if ( !(l3e_get_flags(l3tab[l3_table_offset(va)]) & _PAGE_PRESENT) )
- {
- pg = alloc_domheap_page(d, MEMF_no_owner);
- if ( !pg )
- {
- unmap_domain_page(l3tab);
- return -ENOMEM;
- }
- l2tab = __map_domain_page(pg);
- clear_page(l2tab);
- l3tab[l3_table_offset(va)] = l3e_from_page(pg, __PAGE_HYPERVISOR_RW);
- }
- else
- l2tab = map_l2t_from_l3e(l3tab[l3_table_offset(va)]);
-
- unmap_domain_page(l3tab);
-
- if ( !pl1tab && !ppg )
- {
- unmap_domain_page(l2tab);
- return 0;
- }
-
- for ( l1tab = NULL; !rc && nr--; )
- {
- l2_pgentry_t *pl2e = l2tab + l2_table_offset(va);
-
- if ( !(l2e_get_flags(*pl2e) & _PAGE_PRESENT) )
- {
- if ( pl1tab && !IS_NIL(pl1tab) )
- {
- l1tab = alloc_xenheap_pages(0, MEMF_node(domain_to_node(d)));
- if ( !l1tab )
- {
- rc = -ENOMEM;
- break;
- }
- ASSERT(!pl1tab[l2_table_offset(va)]);
- pl1tab[l2_table_offset(va)] = l1tab;
- pg = virt_to_page(l1tab);
- }
- else
- {
- pg = alloc_domheap_page(d, MEMF_no_owner);
- if ( !pg )
- {
- rc = -ENOMEM;
- break;
- }
- l1tab = __map_domain_page(pg);
- }
- clear_page(l1tab);
- *pl2e = l2e_from_page(pg, __PAGE_HYPERVISOR_RW);
- }
- else if ( !l1tab )
- l1tab = map_l1t_from_l2e(*pl2e);
-
- if ( ppg &&
- !(l1e_get_flags(l1tab[l1_table_offset(va)]) & _PAGE_PRESENT) )
- {
- pg = alloc_domheap_page(d, MEMF_no_owner);
- if ( pg )
- {
- clear_domain_page(page_to_mfn(pg));
- if ( !IS_NIL(ppg) )
- *ppg++ = pg;
- l1tab[l1_table_offset(va)] =
- l1e_from_page(pg, __PAGE_HYPERVISOR_RW | _PAGE_AVAIL0);
- l2e_add_flags(*pl2e, _PAGE_AVAIL0);
- }
- else
- rc = -ENOMEM;
- }
-
- va += PAGE_SIZE;
- if ( rc || !nr || !l1_table_offset(va) )
- {
- /* Note that this is a no-op for the alloc_xenheap_page() case. */
- unmap_domain_page(l1tab);
- l1tab = NULL;
- }
- }
-
- ASSERT(!l1tab);
- unmap_domain_page(l2tab);
-
- return rc;
-}
-
-void destroy_perdomain_mapping(struct domain *d, unsigned long va,
- unsigned int nr)
-{
- const l3_pgentry_t *l3tab, *pl3e;
-
- ASSERT(va >= PERDOMAIN_VIRT_START &&
- va < PERDOMAIN_VIRT_SLOT(PERDOMAIN_SLOTS));
- ASSERT(!l3_table_offset(va ^ (va + nr * PAGE_SIZE - 1)));
-
- if ( !d->arch.perdomain_l3_pg )
- return;
-
- l3tab = __map_domain_page(d->arch.perdomain_l3_pg);
- pl3e = l3tab + l3_table_offset(va);
-
- if ( l3e_get_flags(*pl3e) & _PAGE_PRESENT )
- {
- const l2_pgentry_t *l2tab = map_l2t_from_l3e(*pl3e);
- const l2_pgentry_t *pl2e = l2tab + l2_table_offset(va);
- unsigned int i = l1_table_offset(va);
-
- while ( nr )
- {
- if ( l2e_get_flags(*pl2e) & _PAGE_PRESENT )
- {
- l1_pgentry_t *l1tab = map_l1t_from_l2e(*pl2e);
-
- for ( ; nr && i < L1_PAGETABLE_ENTRIES; --nr, ++i )
- {
- if ( (l1e_get_flags(l1tab[i]) &
- (_PAGE_PRESENT | _PAGE_AVAIL0)) ==
- (_PAGE_PRESENT | _PAGE_AVAIL0) )
- free_domheap_page(l1e_get_page(l1tab[i]));
- l1tab[i] = l1e_empty();
- }
-
- unmap_domain_page(l1tab);
- }
- else if ( nr + i < L1_PAGETABLE_ENTRIES )
- break;
- else
- nr -= L1_PAGETABLE_ENTRIES - i;
-
- ++pl2e;
- i = 0;
- }
-
- unmap_domain_page(l2tab);
- }
-
- unmap_domain_page(l3tab);
-}
-
-void free_perdomain_mappings(struct domain *d)
-{
- l3_pgentry_t *l3tab;
- unsigned int i;
-
- if ( !d->arch.perdomain_l3_pg )
- return;
-
- l3tab = __map_domain_page(d->arch.perdomain_l3_pg);
-
- for ( i = 0; i < PERDOMAIN_SLOTS; ++i)
- if ( l3e_get_flags(l3tab[i]) & _PAGE_PRESENT )
- {
- struct page_info *l2pg = l3e_get_page(l3tab[i]);
- l2_pgentry_t *l2tab = __map_domain_page(l2pg);
- unsigned int j;
-
- for ( j = 0; j < L2_PAGETABLE_ENTRIES; ++j )
- if ( l2e_get_flags(l2tab[j]) & _PAGE_PRESENT )
- {
- struct page_info *l1pg = l2e_get_page(l2tab[j]);
-
- if ( l2e_get_flags(l2tab[j]) & _PAGE_AVAIL0 )
- {
- l1_pgentry_t *l1tab = __map_domain_page(l1pg);
- unsigned int k;
-
- for ( k = 0; k < L1_PAGETABLE_ENTRIES; ++k )
- if ( (l1e_get_flags(l1tab[k]) &
- (_PAGE_PRESENT | _PAGE_AVAIL0)) ==
- (_PAGE_PRESENT | _PAGE_AVAIL0) )
- free_domheap_page(l1e_get_page(l1tab[k]));
-
- unmap_domain_page(l1tab);
- }
-
- if ( is_xen_heap_page(l1pg) )
- free_xenheap_page(page_to_virt(l1pg));
- else
- free_domheap_page(l1pg);
- }
-
- unmap_domain_page(l2tab);
- free_domheap_page(l2pg);
- }
-
- unmap_domain_page(l3tab);
- free_domheap_page(d->arch.perdomain_l3_pg);
- d->arch.perdomain_l3_pg = NULL;
-}
-
#ifdef MEMORY_GUARD
static void __memguard_change_range(void *p, unsigned long l, int guard)
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 60a88bd..cce7541 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -91,26 +91,11 @@ int switch_compat(struct domain *d)
return rc;
}
-static int pv_create_gdt_ldt_l1tab(struct vcpu *v)
-{
- return create_perdomain_mapping(v->domain, GDT_VIRT_START(v),
- 1U << GDT_LDT_VCPU_SHIFT,
- v->domain->arch.pv_domain.gdt_ldt_l1tab,
- NULL);
-}
-
-static void pv_destroy_gdt_ldt_l1tab(struct vcpu *v)
-{
- destroy_perdomain_mapping(v->domain, GDT_VIRT_START(v),
- 1U << GDT_LDT_VCPU_SHIFT);
-}
-
void pv_vcpu_destroy(struct vcpu *v)
{
if ( is_pv_32bit_vcpu(v) )
release_compat_l4(v);
- pv_destroy_gdt_ldt_l1tab(v);
xfree(v->arch.pv_vcpu.trap_ctxt);
v->arch.pv_vcpu.trap_ctxt = NULL;
}
@@ -122,10 +107,6 @@ int pv_vcpu_initialise(struct vcpu *v)
ASSERT(!is_idle_domain(d));
- rc = pv_create_gdt_ldt_l1tab(v);
- if ( rc )
- return rc;
-
BUILD_BUG_ON(NR_VECTORS * sizeof(*v->arch.pv_vcpu.trap_ctxt) >
PAGE_SIZE);
v->arch.pv_vcpu.trap_ctxt = xzalloc_array(struct trap_info,
@@ -147,6 +128,8 @@ int pv_vcpu_initialise(struct vcpu *v)
goto done;
}
+ rc = 0; /* Success */
+
done:
if ( rc )
pv_vcpu_destroy(v);
@@ -155,14 +138,8 @@ int pv_vcpu_initialise(struct vcpu *v)
void pv_domain_destroy(struct domain *d)
{
- destroy_perdomain_mapping(d, GDT_LDT_VIRT_START,
- GDT_LDT_MBYTES << (20 - PAGE_SHIFT));
-
xfree(d->arch.pv_domain.cpuidmasks);
d->arch.pv_domain.cpuidmasks = NULL;
-
- free_xenheap_page(d->arch.pv_domain.gdt_ldt_l1tab);
- d->arch.pv_domain.gdt_ldt_l1tab = NULL;
}
@@ -176,12 +153,6 @@ int pv_domain_initialise(struct domain *d, unsigned int domcr_flags,
};
int rc = -ENOMEM;
- d->arch.pv_domain.gdt_ldt_l1tab =
- alloc_xenheap_pages(0, MEMF_node(domain_to_node(d)));
- if ( !d->arch.pv_domain.gdt_ldt_l1tab )
- goto fail;
- clear_page(d->arch.pv_domain.gdt_ldt_l1tab);
-
if ( levelling_caps & ~LCAP_faulting )
{
d->arch.pv_domain.cpuidmasks = xmalloc(struct cpuidmasks);
@@ -190,12 +161,6 @@ int pv_domain_initialise(struct domain *d, unsigned int domcr_flags,
*d->arch.pv_domain.cpuidmasks = cpuidmask_defaults;
}
- rc = create_perdomain_mapping(d, GDT_LDT_VIRT_START,
- GDT_LDT_MBYTES << (20 - PAGE_SHIFT),
- NULL, NULL);
- if ( rc )
- goto fail;
-
d->arch.ctxt_switch = &pv_csw;
/* 64-bit PV guest by default. */
diff --git a/xen/include/asm-x86/config.h b/xen/include/asm-x86/config.h
index 62549a8..cf6f1be 100644
--- a/xen/include/asm-x86/config.h
+++ b/xen/include/asm-x86/config.h
@@ -131,9 +131,6 @@ extern unsigned char boot_edid_info[128];
* Guest linear page table.
* 0xffff820000000000 - 0xffff827fffffffff [512GB, 2^39 bytes, PML4:260]
* Shadow linear page table.
- *
- * !!! WARNING - TEMPORARILY STALE BELOW !!!
- *
* 0xffff828000000000 - 0xffff82bfffffffff [256GB, 2^38 bytes, PML4:261]
* Machine-to-phys translation table.
* 0xffff82c000000000 - 0xffff82cfffffffff [64GB, 2^36 bytes, PML4:261]
@@ -207,17 +204,8 @@ extern unsigned char boot_edid_info[128];
/* Slot 260: linear page table (shadow table). */
#define SH_LINEAR_PT_VIRT_START (PML4_ADDR(260))
#define SH_LINEAR_PT_VIRT_END (SH_LINEAR_PT_VIRT_START + PML4_ENTRY_BYTES)
-/* Slot 261: per-domain mappings (including map cache). */
-#define PERDOMAIN_VIRT_START (PML4_ADDR(261))
-#define PERDOMAIN_SLOT_MBYTES (PML4_ENTRY_BYTES >> (20 + PAGETABLE_ORDER))
-#define PERDOMAIN_SLOTS 3
-#define PERDOMAIN_VIRT_SLOT(s) (PERDOMAIN_VIRT_START + (s) * \
- (PERDOMAIN_SLOT_MBYTES << 20))
-/*
- * !!! WARNING - TEMPORARILY STALE BELOW !!!
- */
/* Slot 261: machine-to-phys conversion table (256GB). */
-#define RDWR_MPT_VIRT_START (PML4_ADDR(262))
+#define RDWR_MPT_VIRT_START (PML4_ADDR(261))
#define RDWR_MPT_VIRT_END (RDWR_MPT_VIRT_START + MPT_VIRT_SIZE)
/* Slot 261: vmap()/ioremap()/fixmap area (64GB). */
#define VMAP_VIRT_START RDWR_MPT_VIRT_END
@@ -245,12 +233,12 @@ extern unsigned char boot_edid_info[128];
#ifndef CONFIG_BIGMEM
/* Slot 262-271/510: A direct 1:1 mapping of all of physical memory. */
-#define DIRECTMAP_VIRT_START (PML4_ADDR(263))
-#define DIRECTMAP_SIZE (PML4_ENTRY_BYTES * (511 - 263))
+#define DIRECTMAP_VIRT_START (PML4_ADDR(262))
+#define DIRECTMAP_SIZE (PML4_ENTRY_BYTES * (511 - 262))
#else
/* Slot 265-271/510: A direct 1:1 mapping of all of physical memory. */
-#define DIRECTMAP_VIRT_START (PML4_ADDR(266))
-#define DIRECTMAP_SIZE (PML4_ENTRY_BYTES * (511 - 266))
+#define DIRECTMAP_VIRT_START (PML4_ADDR(265))
+#define DIRECTMAP_SIZE (PML4_ENTRY_BYTES * (511 - 265))
#endif
#define DIRECTMAP_VIRT_END (DIRECTMAP_VIRT_START + DIRECTMAP_SIZE)
@@ -308,19 +296,7 @@ extern unsigned long xen_phys_start;
#define PERCPU_LDT_MAPPING (PERCPU_LINEAR_START + MB(11))
#define PERCPU_LDT_MAPPING_END (PERCPU_LDT_MAPPING + 0x10000)
-/* GDT/LDT shadow mapping area. The first per-domain-mapping sub-area. */
-#define GDT_LDT_VCPU_SHIFT 5
-#define GDT_LDT_VCPU_VA_SHIFT (GDT_LDT_VCPU_SHIFT + PAGE_SHIFT)
-#define GDT_LDT_MBYTES PERDOMAIN_SLOT_MBYTES
-#define MAX_VIRT_CPUS (GDT_LDT_MBYTES << (20-GDT_LDT_VCPU_VA_SHIFT))
-#define GDT_LDT_VIRT_START PERDOMAIN_VIRT_SLOT(0)
-#define GDT_LDT_VIRT_END (GDT_LDT_VIRT_START + (GDT_LDT_MBYTES << 20))
-
-/* The address of a particular VCPU's GDT or LDT. */
-#define GDT_VIRT_START(v) \
- (PERDOMAIN_VIRT_START + ((v)->vcpu_id << GDT_LDT_VCPU_VA_SHIFT))
-#define LDT_VIRT_START(v) \
- (GDT_VIRT_START(v) + (64*1024))
+#define MAX_VIRT_CPUS 8192
#define NATIVE_VM_ASSIST_VALID ((1UL << VMASST_TYPE_4gb_segments) | \
(1UL << VMASST_TYPE_4gb_segments_notify) | \
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 108b3a4..ac75248 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -211,8 +211,6 @@ struct time_scale {
struct pv_domain
{
- l1_pgentry_t **gdt_ldt_l1tab;
-
atomic_t nr_l4_pages;
struct cpuidmasks *cpuidmasks;
@@ -235,8 +233,6 @@ struct monitor_write_data {
struct arch_domain
{
- struct page_info *perdomain_l3_pg;
-
unsigned int hv_compat_vstart;
/* Maximum physical-address bitwidth supported by this guest. */
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index 54b7499..22c2809 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -591,16 +591,6 @@ long subarch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg);
int compat_arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void));
int compat_subarch_memory_op(int op, XEN_GUEST_HANDLE_PARAM(void));
-#define NIL(type) ((type *)-sizeof(type))
-#define IS_NIL(ptr) (!((uintptr_t)(ptr) + sizeof(*(ptr))))
-
-int create_perdomain_mapping(struct domain *, unsigned long va,
- unsigned int nr, l1_pgentry_t **,
- struct page_info **);
-void destroy_perdomain_mapping(struct domain *, unsigned long va,
- unsigned int nr);
-void free_perdomain_mappings(struct domain *);
-
extern int memory_add(unsigned long spfn, unsigned long epfn, unsigned int pxm);
void domain_set_alloc_bitsize(struct domain *d);
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 35/44] x86/smp: Allocate the stack in the percpu range
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (33 preceding siblings ...)
2018-01-04 20:21 ` [PATCH RFC 34/44] x86: Drop the PERDOMAIN mappings Andrew Cooper
@ 2018-01-04 20:22 ` Andrew Cooper
2018-01-04 20:22 ` [PATCH RFC 36/44] x86/monitor: Capture Xen's intent to use monitor at boot time Andrew Cooper
` (9 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:22 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
This involves allocating a total of 5 frames, which don't have to be an
order-3 allocation, and unconditionally has guard pages in place for a primary
stack overflow.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
xen/arch/x86/smpboot.c | 27 ++++++++++++++++++++++++++-
xen/include/asm-x86/config.h | 2 ++
2 files changed, 28 insertions(+), 1 deletion(-)
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index cc80f24..1bf6dc1 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -762,7 +762,7 @@ static int percpu_alloc_frame(unsigned int cpu, unsigned long linear,
/* Allocate data common between the BSP and APs. */
static int cpu_smpboot_alloc_common(unsigned int cpu)
{
- unsigned int memflags = 0;
+ unsigned int memflags = 0, i;
nodeid_t node = cpu_to_node(cpu);
l4_pgentry_t *l4t = NULL;
l3_pgentry_t *l3t = NULL;
@@ -843,6 +843,31 @@ static int cpu_smpboot_alloc_common(unsigned int cpu)
if ( rc )
goto out;
+ /* Allocate the stack. */
+ for ( i = 0; i < 8; ++i )
+ {
+ BUILD_BUG_ON((1u << STACK_ORDER) != 8);
+ BUILD_BUG_ON(!IS_ALIGNED(PERCPU_STACK_MAPPING, STACK_SIZE));
+ BUILD_BUG_ON((sizeof(struct cpu_info) -
+ offsetof(struct cpu_info, guest_cpu_user_regs.es)) & 0xf);
+
+ /*
+ * Pages 0-2: #DF, #NMI, #MCE IST stacks
+ * Pages 3-5: Guard pages - UNMAPPED
+ * Pages 6-7: Main stack
+ */
+ if ( i == 3 )
+ {
+ i = 5;
+ continue;
+ }
+
+ rc = percpu_alloc_frame(cpu, PERCPU_STACK_MAPPING + i * PAGE_SIZE, NULL,
+ PAGE_HYPERVISOR_RW | MAP_PERCPU_AUTOFREE);
+ if ( rc )
+ goto out;
+ }
+
rc = 0; /* Success */
out:
diff --git a/xen/include/asm-x86/config.h b/xen/include/asm-x86/config.h
index cf6f1be..3974748 100644
--- a/xen/include/asm-x86/config.h
+++ b/xen/include/asm-x86/config.h
@@ -296,6 +296,8 @@ extern unsigned long xen_phys_start;
#define PERCPU_LDT_MAPPING (PERCPU_LINEAR_START + MB(11))
#define PERCPU_LDT_MAPPING_END (PERCPU_LDT_MAPPING + 0x10000)
+#define PERCPU_STACK_MAPPING (PERCPU_LINEAR_START + MB(12))
+
#define MAX_VIRT_CPUS 8192
#define NATIVE_VM_ASSIST_VALID ((1UL << VMASST_TYPE_4gb_segments) | \
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 36/44] x86/monitor: Capture Xen's intent to use monitor at boot time
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (34 preceding siblings ...)
2018-01-04 20:22 ` [PATCH RFC 35/44] x86/smp: Allocate the stack in the percpu range Andrew Cooper
@ 2018-01-04 20:22 ` Andrew Cooper
2018-01-04 20:22 ` [PATCH RFC 37/44] x86/misc: Move some IPI parameters off the stack Andrew Cooper
` (8 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:22 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
The ACPI idle driver uses an IPI to retrieve cpuid_ecx(5). This is
problematic because it uses a stack pointer, but also wasteful at runtime.
Introduce X86_FEATURE_XEN_MONITOR as a synthetic feature bit meaning MONITOR
&& EXTENSIONS && INTERRUPT_BREAK, and calculate it when a cpu comes up rather
than repeatedly at runtime.
Drop the duplicate defines for MWAIT cpuid information, and use the
definitions from mwait.h
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
xen/arch/x86/acpi/lib.c | 16 +---------------
xen/arch/x86/cpu/common.c | 7 +++++++
xen/include/asm-x86/cpufeature.h | 5 +----
xen/include/asm-x86/cpufeatures.h | 1 +
xen/include/asm-x86/mwait.h | 3 +++
5 files changed, 13 insertions(+), 19 deletions(-)
diff --git a/xen/arch/x86/acpi/lib.c b/xen/arch/x86/acpi/lib.c
index 7d7c718..1d64e74 100644
--- a/xen/arch/x86/acpi/lib.c
+++ b/xen/arch/x86/acpi/lib.c
@@ -80,16 +80,10 @@ unsigned int acpi_get_processor_id(unsigned int cpu)
return INVALID_ACPIID;
}
-static void get_mwait_ecx(void *info)
-{
- *(u32 *)info = cpuid_ecx(CPUID_MWAIT_LEAF);
-}
-
int arch_acpi_set_pdc_bits(u32 acpi_id, u32 *pdc, u32 mask)
{
unsigned int cpu = get_cpu_id(acpi_id);
struct cpuinfo_x86 *c;
- u32 ecx;
if (!(acpi_id + 1))
c = &boot_cpu_data;
@@ -110,15 +104,7 @@ int arch_acpi_set_pdc_bits(u32 acpi_id, u32 *pdc, u32 mask)
* If mwait/monitor or its break-on-interrupt extension are
* unsupported, Cx_FFH will be disabled.
*/
- if (!cpu_has(c, X86_FEATURE_MONITOR) ||
- c->cpuid_level < CPUID_MWAIT_LEAF)
- ecx = 0;
- else if (c == &boot_cpu_data || cpu == smp_processor_id())
- ecx = cpuid_ecx(CPUID_MWAIT_LEAF);
- else
- on_selected_cpus(cpumask_of(cpu), get_mwait_ecx, &ecx, 1);
- if (!(ecx & CPUID5_ECX_EXTENSIONS_SUPPORTED) ||
- !(ecx & CPUID5_ECX_INTERRUPT_BREAK))
+ if ( !cpu_has_xen_monitor )
pdc[2] &= ~(ACPI_PDC_C_C1_FFH | ACPI_PDC_C_C2C3_FFH);
return 0;
diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
index decdcd5..262eccc 100644
--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -12,6 +12,7 @@
#include <mach_apic.h>
#include <asm/setup.h>
#include <public/sysctl.h> /* for XEN_INVALID_{SOCKET,CORE}_ID */
+#include <asm/mwait.h>
#include "cpu.h"
@@ -312,6 +313,12 @@ static void generic_identify(struct cpuinfo_x86 *c)
if ( cpu_has(c, X86_FEATURE_CLFLUSH) )
c->x86_clflush_size = ((ebx >> 8) & 0xff) * 8;
+ /* Xen only uses MONITOR if INTERRUPT_BREAK is available. */
+ if ( cpu_has(c, X86_FEATURE_MONITOR) &&
+ ((cpuid_ecx(CPUID_MWAIT_LEAF) & CPUID_MWAIT_MIN_FEATURES) ==
+ CPUID_MWAIT_MIN_FEATURES) )
+ set_bit(X86_FEATURE_XEN_MONITOR, c->x86_capability);
+
if ( (c->cpuid_level >= CPUID_PM_LEAF) &&
(cpuid_ecx(CPUID_PM_LEAF) & CPUID6_ECX_APERFMPERF_CAPABILITY) )
set_bit(X86_FEATURE_APERFMPERF, c->x86_capability);
diff --git a/xen/include/asm-x86/cpufeature.h b/xen/include/asm-x86/cpufeature.h
index 84cc51d..8b24e0e 100644
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -22,10 +22,6 @@
#define cpu_has(c, bit) test_bit(bit, (c)->x86_capability)
#define boot_cpu_has(bit) test_bit(bit, boot_cpu_data.x86_capability)
-#define CPUID_MWAIT_LEAF 5
-#define CPUID5_ECX_EXTENSIONS_SUPPORTED 0x1
-#define CPUID5_ECX_INTERRUPT_BREAK 0x2
-
#define CPUID_PM_LEAF 6
#define CPUID6_ECX_APERFMPERF_CAPABILITY 0x1
@@ -104,6 +100,7 @@
#define cpu_has_arch_perfmon boot_cpu_has(X86_FEATURE_ARCH_PERFMON)
#define cpu_has_cpuid_faulting boot_cpu_has(X86_FEATURE_CPUID_FAULTING)
#define cpu_has_aperfmperf boot_cpu_has(X86_FEATURE_APERFMPERF)
+#define cpu_has_xen_monitor boot_cpu_has(X86_FEATURE_XEN_MONITOR)
enum _cache_type {
CACHE_TYPE_NULL = 0,
diff --git a/xen/include/asm-x86/cpufeatures.h b/xen/include/asm-x86/cpufeatures.h
index bc98227..98637d0 100644
--- a/xen/include/asm-x86/cpufeatures.h
+++ b/xen/include/asm-x86/cpufeatures.h
@@ -22,3 +22,4 @@ XEN_CPUFEATURE(APERFMPERF, (FSCAPINTS+0)*32+ 8) /* APERFMPERF */
XEN_CPUFEATURE(MFENCE_RDTSC, (FSCAPINTS+0)*32+ 9) /* MFENCE synchronizes RDTSC */
XEN_CPUFEATURE(XEN_SMEP, (FSCAPINTS+0)*32+10) /* SMEP gets used by Xen itself */
XEN_CPUFEATURE(XEN_SMAP, (FSCAPINTS+0)*32+11) /* SMAP gets used by Xen itself */
+XEN_CPUFEATURE(XEN_MONITOR, (FSCAPINTS+0)*32+12) /* Xen uses MONITOR */
diff --git a/xen/include/asm-x86/mwait.h b/xen/include/asm-x86/mwait.h
index ba9c0ea..a1bfeb1 100644
--- a/xen/include/asm-x86/mwait.h
+++ b/xen/include/asm-x86/mwait.h
@@ -9,6 +9,9 @@
#define CPUID5_ECX_EXTENSIONS_SUPPORTED 0x1
#define CPUID5_ECX_INTERRUPT_BREAK 0x2
+#define CPUID_MWAIT_MIN_FEATURES \
+ (CPUID5_ECX_EXTENSIONS_SUPPORTED | CPUID5_ECX_INTERRUPT_BREAK)
+
#define MWAIT_ECX_INTERRUPT_BREAK 0x1
void mwait_idle_with_hints(unsigned int eax, unsigned int ecx);
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 37/44] x86/misc: Move some IPI parameters off the stack
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (35 preceding siblings ...)
2018-01-04 20:22 ` [PATCH RFC 36/44] x86/monitor: Capture Xen's intent to use monitor at boot time Andrew Cooper
@ 2018-01-04 20:22 ` Andrew Cooper
2018-01-04 20:22 ` [PATCH RFC 38/44] x86/mca: Move __HYPERVISOR_mca " Andrew Cooper
` (7 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:22 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
With percpu stacks, it will not be safe to pass stack pointers. The logic in
machine_restart(), time_calibration() and set_mtrr() is singleton, so switch
to using static variables.
The set_mtrr_data is protected under the mtrr_mutex, which requires
mtrr_ap_init() and mtrr_aps_sync_end() to hold the mutex around calls to
set_mtrr().
time_calibration() runs exclusively out of a timer on cpu0 so is safe, while
machine_restart() doesn't have any concurrency to be worried about.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
xen/arch/x86/cpu/mtrr/main.c | 27 +++++++++++++++++----------
xen/arch/x86/shutdown.c | 8 ++++++--
xen/arch/x86/time.c | 7 +++++--
3 files changed, 28 insertions(+), 14 deletions(-)
diff --git a/xen/arch/x86/cpu/mtrr/main.c b/xen/arch/x86/cpu/mtrr/main.c
index 56f71a6..d8ae9bf 100644
--- a/xen/arch/x86/cpu/mtrr/main.c
+++ b/xen/arch/x86/cpu/mtrr/main.c
@@ -59,9 +59,6 @@ u64 __read_mostly size_and_mask;
const struct mtrr_ops *__read_mostly mtrr_if = NULL;
-static void set_mtrr(unsigned int reg, unsigned long base,
- unsigned long size, mtrr_type type);
-
static const char *const mtrr_strings[MTRR_NUM_TYPES] =
{
"uncachable", /* 0 */
@@ -211,21 +208,27 @@ static inline int types_compatible(mtrr_type type1, mtrr_type type2) {
static void set_mtrr(unsigned int reg, unsigned long base,
unsigned long size, mtrr_type type)
{
+ /* Can't pass a stack pointer to an IPI. */
+ static struct set_mtrr_data data;
+
cpumask_t allbutself;
unsigned int nr_cpus;
- struct set_mtrr_data data;
unsigned long flags;
+ ASSERT(spin_is_locked(&mtrr_mutex));
+
cpumask_andnot(&allbutself, &cpu_online_map,
cpumask_of(smp_processor_id()));
nr_cpus = cpumask_weight(&allbutself);
- data.smp_reg = reg;
- data.smp_base = base;
- data.smp_size = size;
- data.smp_type = type;
- atomic_set(&data.count, nr_cpus);
- atomic_set(&data.gate,0);
+ data = (struct set_mtrr_data){
+ .smp_reg = reg,
+ .smp_base = base,
+ .smp_size = size,
+ .smp_type = type,
+ .count = ATOMIC_INIT(nr_cpus),
+ .gate = ATOMIC_INIT(0),
+ };
/* Start the ball rolling on other CPUs */
on_selected_cpus(&allbutself, ipi_handler, &data, 0);
@@ -593,7 +596,9 @@ void mtrr_ap_init(void)
* 2.cpu hotadd time. We let mtrr_add/del_page hold cpuhotplug lock to
* prevent mtrr entry changes
*/
+ mutex_lock(&mtrr_mutex);
set_mtrr(~0U, 0, 0, 0);
+ mutex_unlock(&mtrr_mutex);
}
/**
@@ -621,7 +626,9 @@ void mtrr_aps_sync_end(void)
{
if (!use_intel())
return;
+ mutex_lock(&mtrr_mutex);
set_mtrr(~0U, 0, 0, 0);
+ mutex_unlock(&mtrr_mutex);
hold_mtrr_updates_on_aps = 0;
}
diff --git a/xen/arch/x86/shutdown.c b/xen/arch/x86/shutdown.c
index a87aa60..29317d9 100644
--- a/xen/arch/x86/shutdown.c
+++ b/xen/arch/x86/shutdown.c
@@ -536,9 +536,13 @@ void machine_restart(unsigned int delay_millisecs)
/* Ensure we are the boot CPU. */
if ( get_apic_id() != boot_cpu_physical_apicid )
{
+ /* Can't pass a stack pointer to an IPI. */
+ static unsigned int delay;
+
+ delay = delay_millisecs;
+
/* Send IPI to the boot CPU (logical cpu 0). */
- on_selected_cpus(cpumask_of(0), __machine_restart,
- &delay_millisecs, 0);
+ on_selected_cpus(cpumask_of(0), __machine_restart, &delay, 0);
for ( ; ; )
halt();
}
diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
index 2a87950..390cf0c 100644
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -1446,8 +1446,11 @@ static void (*time_calibration_rendezvous_fn)(void *) =
static void time_calibration(void *unused)
{
- struct calibration_rendezvous r = {
- .semaphore = ATOMIC_INIT(0)
+ /* Can't pass a stack pointer to an IPI. */
+ static struct calibration_rendezvous r;
+
+ r = (struct calibration_rendezvous){
+ .semaphore = ATOMIC_INIT(0),
};
if ( clocksource_is_tsc() )
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 38/44] x86/mca: Move __HYPERVISOR_mca IPI parameters off the stack
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (36 preceding siblings ...)
2018-01-04 20:22 ` [PATCH RFC 37/44] x86/misc: Move some IPI parameters off the stack Andrew Cooper
@ 2018-01-04 20:22 ` Andrew Cooper
2018-01-04 20:22 ` [PATCH RFC 39/44] x86/smp: Introduce get_smp_ipi_buf() and take more " Andrew Cooper
` (6 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:22 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
do_mca() makes several IPI with huge parameter blocks. All operations are
control-plane, and for debugging/development purposes, so restrict them to
being serialised. This allows the hypercall parameter block to safely be
static.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
xen/arch/x86/cpu/mcheck/mce.c | 143 +++++++++++++++++++++++++-----------------
1 file changed, 87 insertions(+), 56 deletions(-)
diff --git a/xen/arch/x86/cpu/mcheck/mce.c b/xen/arch/x86/cpu/mcheck/mce.c
index a8c287d..6e88c64 100644
--- a/xen/arch/x86/cpu/mcheck/mce.c
+++ b/xen/arch/x86/cpu/mcheck/mce.c
@@ -49,18 +49,6 @@ struct mca_banks *mca_allbanks;
#define SEG_PL(segsel) ((segsel) & 0x3)
#define _MC_MSRINJ_F_REQ_HWCR_WREN (1 << 16)
-#if 0
-#define x86_mcerr(fmt, err, args...) \
- ({ \
- int _err = (err); \
- gdprintk(XENLOG_WARNING, "x86_mcerr: " fmt ", returning %d\n", \
- ## args, _err); \
- _err; \
- })
-#else
-#define x86_mcerr(fmt, err, args...) (err)
-#endif
-
int mce_verbosity;
static int __init mce_set_verbosity(const char *str)
{
@@ -1306,8 +1294,11 @@ CHECK_mcinfo_recovery;
/* Machine Check Architecture Hypercall */
long do_mca(XEN_GUEST_HANDLE_PARAM(xen_mc_t) u_xen_mc)
{
+ static spinlock_t mca_lock = SPIN_LOCK_UNLOCKED;
+ static struct xen_mc curop;
+
long ret = 0;
- struct xen_mc curop, *op = &curop;
+ struct xen_mc *op = &curop;
struct vcpu *v = current;
union {
struct xen_mc_fetch *nat;
@@ -1328,13 +1319,26 @@ long do_mca(XEN_GUEST_HANDLE_PARAM(xen_mc_t) u_xen_mc)
ret = xsm_do_mca(XSM_PRIV);
if ( ret )
- return x86_mcerr("", ret);
+ return ret;
+
+ while ( !spin_trylock(&mca_lock) )
+ {
+ if ( hypercall_preempt_check() )
+ return hypercall_create_continuation(__HYPERVISOR_mca,
+ "h", u_xen_mc);
+ }
if ( copy_from_guest(op, u_xen_mc, 1) )
- return x86_mcerr("do_mca: failed copyin of xen_mc_t", -EFAULT);
+ {
+ ret = -EFAULT;
+ goto out;
+ }
if ( op->interface_version != XEN_MCA_INTERFACE_VERSION )
- return x86_mcerr("do_mca: interface version mismatch", -EACCES);
+ {
+ ret = -EACCES;
+ goto out;
+ }
switch ( op->cmd )
{
@@ -1353,7 +1357,8 @@ long do_mca(XEN_GUEST_HANDLE_PARAM(xen_mc_t) u_xen_mc)
break;
default:
- return x86_mcerr("do_mca fetch: bad cmdflags", -EINVAL);
+ ret = -EINVAL;
+ goto out;
}
flags = XEN_MC_OK;
@@ -1368,8 +1373,10 @@ long do_mca(XEN_GUEST_HANDLE_PARAM(xen_mc_t) u_xen_mc)
if ( !is_pv_32bit_vcpu(v)
? guest_handle_is_null(mc_fetch.nat->data)
: compat_handle_is_null(mc_fetch.cmp->data) )
- return x86_mcerr("do_mca fetch: guest buffer "
- "invalid", -EINVAL);
+ {
+ ret = -EINVAL;
+ goto out;
+ }
mctc = mctelem_consume_oldest_begin(which);
if ( mctc )
@@ -1402,7 +1409,10 @@ long do_mca(XEN_GUEST_HANDLE_PARAM(xen_mc_t) u_xen_mc)
break;
case XEN_MC_notifydomain:
- return x86_mcerr("do_mca notify unsupported", -EINVAL);
+ {
+ ret = -EINVAL;
+ goto out;
+ }
case XEN_MC_physcpuinfo:
mc_physcpuinfo.nat = &op->u.mc_physcpuinfo;
@@ -1413,12 +1423,17 @@ long do_mca(XEN_GUEST_HANDLE_PARAM(xen_mc_t) u_xen_mc)
: !compat_handle_is_null(mc_physcpuinfo.cmp->info) )
{
if ( mc_physcpuinfo.nat->ncpus <= 0 )
- return x86_mcerr("do_mca cpuinfo: ncpus <= 0",
- -EINVAL);
+ {
+ ret = -EINVAL;
+ goto out;
+ }
nlcpu = min(nlcpu, (int)mc_physcpuinfo.nat->ncpus);
log_cpus = xmalloc_array(xen_mc_logical_cpu_t, nlcpu);
if ( log_cpus == NULL )
- return x86_mcerr("do_mca cpuinfo", -ENOMEM);
+ {
+ ret = -ENOMEM;
+ goto out;
+ }
on_each_cpu(do_mc_get_cpu_info, log_cpus, 1);
if ( !is_pv_32bit_vcpu(v)
? copy_to_guest(mc_physcpuinfo.nat->info, log_cpus, nlcpu)
@@ -1430,26 +1445,27 @@ long do_mca(XEN_GUEST_HANDLE_PARAM(xen_mc_t) u_xen_mc)
mc_physcpuinfo.nat->ncpus = nlcpu;
if ( copy_to_guest(u_xen_mc, op, 1) )
- return x86_mcerr("do_mca cpuinfo", -EFAULT);
-
+ ret = -EFAULT;
break;
case XEN_MC_msrinject:
if ( nr_mce_banks == 0 )
- return x86_mcerr("do_mca inject", -ENODEV);
+ {
+ ret = -ENODEV;
+ goto out;
+ }
mc_msrinject = &op->u.mc_msrinject;
target = mc_msrinject->mcinj_cpunr;
- if ( target >= nr_cpu_ids )
- return x86_mcerr("do_mca inject: bad target", -EINVAL);
-
- if ( !cpu_online(target) )
- return x86_mcerr("do_mca inject: target offline",
- -EINVAL);
+ if ( target >= nr_cpu_ids || !cpu_online(target) )
+ {
+ ret = -EINVAL;
+ goto out;
+ }
if ( mc_msrinject->mcinj_count == 0 )
- return 0;
+ goto out;
if ( mc_msrinject->mcinj_flags & MC_MSRINJ_F_GPADDR )
{
@@ -1464,14 +1480,17 @@ long do_mca(XEN_GUEST_HANDLE_PARAM(xen_mc_t) u_xen_mc)
domid = (mc_msrinject->mcinj_domid == DOMID_SELF) ?
current->domain->domain_id : mc_msrinject->mcinj_domid;
if ( domid >= DOMID_FIRST_RESERVED )
- return x86_mcerr("do_mca inject: incompatible flag "
- "MC_MSRINJ_F_GPADDR with domain %d",
- -EINVAL, domid);
+ {
+ ret = -EINVAL;
+ goto out;
+ }
d = get_domain_by_id(domid);
if ( d == NULL )
- return x86_mcerr("do_mca inject: bad domain id %d",
- -EINVAL, domid);
+ {
+ ret = -EINVAL;
+ goto out;
+ }
for ( i = 0, msr = &mc_msrinject->mcinj_msr[0];
i < mc_msrinject->mcinj_count;
@@ -1485,8 +1504,8 @@ long do_mca(XEN_GUEST_HANDLE_PARAM(xen_mc_t) u_xen_mc)
{
put_gfn(d, gfn);
put_domain(d);
- return x86_mcerr("do_mca inject: bad gfn %#lx of domain %d",
- -EINVAL, gfn, domid);
+ ret = -EINVAL;
+ goto out;
}
msr->value = pfn_to_paddr(mfn) | (gaddr & (PAGE_SIZE - 1));
@@ -1498,7 +1517,10 @@ long do_mca(XEN_GUEST_HANDLE_PARAM(xen_mc_t) u_xen_mc)
}
if ( !x86_mc_msrinject_verify(mc_msrinject) )
- return x86_mcerr("do_mca inject: illegal MSR", -EINVAL);
+ {
+ ret = -EINVAL;
+ goto out;
+ }
add_taint(TAINT_ERROR_INJECT);
@@ -1509,16 +1531,19 @@ long do_mca(XEN_GUEST_HANDLE_PARAM(xen_mc_t) u_xen_mc)
case XEN_MC_mceinject:
if ( nr_mce_banks == 0 )
- return x86_mcerr("do_mca #MC", -ENODEV);
+ {
+ ret = -ENODEV;
+ goto out;
+ }
mc_mceinject = &op->u.mc_mceinject;
target = mc_mceinject->mceinj_cpunr;
- if ( target >= nr_cpu_ids )
- return x86_mcerr("do_mca #MC: bad target", -EINVAL);
-
- if ( !cpu_online(target) )
- return x86_mcerr("do_mca #MC: target offline", -EINVAL);
+ if ( target >= nr_cpu_ids || !cpu_online(target) )
+ {
+ ret = -EINVAL;
+ goto out;
+ }
add_taint(TAINT_ERROR_INJECT);
@@ -1536,7 +1561,10 @@ long do_mca(XEN_GUEST_HANDLE_PARAM(xen_mc_t) u_xen_mc)
bool broadcast = op->u.mc_inject_v2.flags & XEN_MC_INJECT_CPU_BROADCAST;
if ( nr_mce_banks == 0 )
- return x86_mcerr("do_mca #MC", -ENODEV);
+ {
+ ret = -ENODEV;
+ goto out;
+ }
if ( broadcast )
cpumap = &cpu_online_map;
@@ -1549,7 +1577,7 @@ long do_mca(XEN_GUEST_HANDLE_PARAM(xen_mc_t) u_xen_mc)
if ( !cpumask_intersects(cpumap, &cpu_online_map) )
{
free_cpumask_var(cmv);
- ret = x86_mcerr("No online CPU passed\n", -EINVAL);
+ ret = -EINVAL;
break;
}
if ( !cpumask_subset(cpumap, &cpu_online_map) )
@@ -1568,7 +1596,7 @@ long do_mca(XEN_GUEST_HANDLE_PARAM(xen_mc_t) u_xen_mc)
case XEN_MC_INJECT_TYPE_CMCI:
if ( !cmci_apic_vector )
- ret = x86_mcerr("No CMCI supported in platform\n", -EINVAL);
+ ret = -EINVAL;
else
{
if ( cpumask_test_cpu(smp_processor_id(), cpumap) )
@@ -1580,26 +1608,25 @@ long do_mca(XEN_GUEST_HANDLE_PARAM(xen_mc_t) u_xen_mc)
case XEN_MC_INJECT_TYPE_LMCE:
if ( !lmce_support )
{
- ret = x86_mcerr("No LMCE support", -EINVAL);
+ ret = -EINVAL;
break;
}
if ( broadcast )
{
- ret = x86_mcerr("Broadcast cannot be used with LMCE", -EINVAL);
+ ret = -EINVAL;
break;
}
/* Ensure at most one CPU is specified. */
if ( nr_cpu_ids > cpumask_next(cpumask_first(cpumap), cpumap) )
{
- ret = x86_mcerr("More than one CPU specified for LMCE",
- -EINVAL);
+ ret = -EINVAL;
break;
}
on_selected_cpus(cpumap, x86_mc_mceinject, NULL, 1);
break;
default:
- ret = x86_mcerr("Wrong mca type\n", -EINVAL);
+ ret = -EINVAL;
break;
}
@@ -1610,9 +1637,13 @@ long do_mca(XEN_GUEST_HANDLE_PARAM(xen_mc_t) u_xen_mc)
}
default:
- return x86_mcerr("do_mca: bad command", -EINVAL);
+ ret = -EINVAL;
+ break;
}
+ out:
+ spin_unlock(&mca_lock);
+
return ret;
}
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 39/44] x86/smp: Introduce get_smp_ipi_buf() and take more IPI parameters off the stack
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (37 preceding siblings ...)
2018-01-04 20:22 ` [PATCH RFC 38/44] x86/mca: Move __HYPERVISOR_mca " Andrew Cooper
@ 2018-01-04 20:22 ` Andrew Cooper
2018-01-04 20:22 ` [PATCH RFC 40/44] x86/boot: Switch the APs to the percpu pagetables before entering C Andrew Cooper
` (5 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:22 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
A number of hypercalls and softirq tasks pass small stack buffers via IPI.
These operate sequentially on a single CPU, so introduce a shared PER_CPU
buffer for use. Access to the buffer is via get_smp_ipi_buf(), which performs
a range check at compile time.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
xen/arch/x86/acpi/cpu_idle.c | 30 +++++++++----------
xen/arch/x86/acpi/cpufreq/cpufreq.c | 57 ++++++++++++++++++------------------
xen/arch/x86/acpi/cpufreq/powernow.c | 26 ++++++++--------
xen/arch/x86/platform_hypercall.c | 40 ++++++++++++-------------
xen/arch/x86/psr.c | 9 +++---
xen/arch/x86/pv/pt-shadow.c | 12 ++++----
xen/arch/x86/smp.c | 2 ++
xen/arch/x86/sysctl.c | 10 +++----
xen/include/asm-x86/smp.h | 20 +++++++++++++
9 files changed, 114 insertions(+), 92 deletions(-)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index cb1c5da..0479826 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -246,23 +246,23 @@ static void get_hw_residencies(uint32_t cpu, struct hw_residencies *hw_res)
static void print_hw_residencies(uint32_t cpu)
{
- struct hw_residencies hw_res;
+ struct hw_residencies *hw_res = get_smp_ipi_buf(struct hw_residencies);
- get_hw_residencies(cpu, &hw_res);
+ get_hw_residencies(cpu, hw_res);
- if ( hw_res.mc0 | hw_res.mc6 )
+ if ( hw_res->mc0 | hw_res->mc6 )
printk("MC0[%"PRIu64"] MC6[%"PRIu64"]\n",
- hw_res.mc0, hw_res.mc6);
+ hw_res->mc0, hw_res->mc6);
printk("PC2[%"PRIu64"] PC%d[%"PRIu64"] PC6[%"PRIu64"] PC7[%"PRIu64"]\n",
- hw_res.pc2,
- hw_res.pc4 ? 4 : 3, hw_res.pc4 ?: hw_res.pc3,
- hw_res.pc6, hw_res.pc7);
- if ( hw_res.pc8 | hw_res.pc9 | hw_res.pc10 )
+ hw_res->pc2,
+ hw_res->pc4 ? 4 : 3, hw_res->pc4 ?: hw_res->pc3,
+ hw_res->pc6, hw_res->pc7);
+ if ( hw_res->pc8 | hw_res->pc9 | hw_res->pc10 )
printk("PC8[%"PRIu64"] PC9[%"PRIu64"] PC10[%"PRIu64"]\n",
- hw_res.pc8, hw_res.pc9, hw_res.pc10);
+ hw_res->pc8, hw_res->pc9, hw_res->pc10);
printk("CC%d[%"PRIu64"] CC6[%"PRIu64"] CC7[%"PRIu64"]\n",
- hw_res.cc1 ? 1 : 3, hw_res.cc1 ?: hw_res.cc3,
- hw_res.cc6, hw_res.cc7);
+ hw_res->cc1 ? 1 : 3, hw_res->cc1 ?: hw_res->cc3,
+ hw_res->cc6, hw_res->cc7);
}
static char* acpi_cstate_method_name[] =
@@ -1251,7 +1251,7 @@ int pmstat_get_cx_stat(uint32_t cpuid, struct pm_cx_stat *stat)
}
else
{
- struct hw_residencies hw_res;
+ struct hw_residencies *hw_res = get_smp_ipi_buf(struct hw_residencies);
signed int last_state_idx;
stat->nr = power->count;
@@ -1285,13 +1285,13 @@ int pmstat_get_cx_stat(uint32_t cpuid, struct pm_cx_stat *stat)
idle_res += res[i];
}
- get_hw_residencies(cpuid, &hw_res);
+ get_hw_residencies(cpuid, hw_res);
#define PUT_xC(what, n) do { \
if ( stat->nr_##what >= n && \
- copy_to_guest_offset(stat->what, n - 1, &hw_res.what##n, 1) ) \
+ copy_to_guest_offset(stat->what, n - 1, &hw_res->what##n, 1) ) \
return -EFAULT; \
- if ( hw_res.what##n ) \
+ if ( hw_res->what##n ) \
nr_##what = n; \
} while ( 0 )
#define PUT_PC(n) PUT_xC(pc, n)
diff --git a/xen/arch/x86/acpi/cpufreq/cpufreq.c b/xen/arch/x86/acpi/cpufreq/cpufreq.c
index 1f8d02a..f295e1e 100644
--- a/xen/arch/x86/acpi/cpufreq/cpufreq.c
+++ b/xen/arch/x86/acpi/cpufreq/cpufreq.c
@@ -198,7 +198,7 @@ static u32 get_cur_val(const cpumask_t *mask)
{
struct cpufreq_policy *policy;
struct processor_performance *perf;
- struct drv_cmd cmd;
+ struct drv_cmd *cmd = get_smp_ipi_buf(struct drv_cmd);
unsigned int cpu = smp_processor_id();
if (unlikely(cpumask_empty(mask)))
@@ -215,23 +215,23 @@ static u32 get_cur_val(const cpumask_t *mask)
switch (cpufreq_drv_data[policy->cpu]->arch_cpu_flags) {
case SYSTEM_INTEL_MSR_CAPABLE:
- cmd.type = SYSTEM_INTEL_MSR_CAPABLE;
- cmd.addr.msr.reg = MSR_IA32_PERF_STATUS;
+ cmd->type = SYSTEM_INTEL_MSR_CAPABLE;
+ cmd->addr.msr.reg = MSR_IA32_PERF_STATUS;
break;
case SYSTEM_IO_CAPABLE:
- cmd.type = SYSTEM_IO_CAPABLE;
+ cmd->type = SYSTEM_IO_CAPABLE;
perf = cpufreq_drv_data[policy->cpu]->acpi_data;
- cmd.addr.io.port = perf->control_register.address;
- cmd.addr.io.bit_width = perf->control_register.bit_width;
+ cmd->addr.io.port = perf->control_register.address;
+ cmd->addr.io.bit_width = perf->control_register.bit_width;
break;
default:
return 0;
}
- cmd.mask = cpumask_of(cpu);
+ cmd->mask = cpumask_of(cpu);
- drv_read(&cmd);
- return cmd.val;
+ drv_read(cmd);
+ return cmd->val;
}
struct perf_pair {
@@ -270,7 +270,7 @@ static void read_measured_perf_ctrs(void *_readin)
unsigned int get_measured_perf(unsigned int cpu, unsigned int flag)
{
struct cpufreq_policy *policy;
- struct perf_pair readin, cur, *saved;
+ struct perf_pair *readin = get_smp_ipi_buf(struct perf_pair), cur, *saved;
unsigned int perf_percent;
unsigned int retval;
@@ -298,16 +298,15 @@ unsigned int get_measured_perf(unsigned int cpu, unsigned int flag)
}
if (cpu == smp_processor_id()) {
- read_measured_perf_ctrs((void *)&readin);
+ read_measured_perf_ctrs(readin);
} else {
- on_selected_cpus(cpumask_of(cpu), read_measured_perf_ctrs,
- &readin, 1);
+ on_selected_cpus(cpumask_of(cpu), read_measured_perf_ctrs, readin, 1);
}
- cur.aperf.whole = readin.aperf.whole - saved->aperf.whole;
- cur.mperf.whole = readin.mperf.whole - saved->mperf.whole;
- saved->aperf.whole = readin.aperf.whole;
- saved->mperf.whole = readin.mperf.whole;
+ cur.aperf.whole = readin->aperf.whole - saved->aperf.whole;
+ cur.mperf.whole = readin->mperf.whole - saved->mperf.whole;
+ saved->aperf.whole = readin->aperf.whole;
+ saved->mperf.whole = readin->mperf.whole;
if (unlikely(((unsigned long)(-1) / 100) < cur.aperf.whole)) {
int shift_count = 7;
@@ -389,7 +388,7 @@ static int acpi_cpufreq_target(struct cpufreq_policy *policy,
struct processor_performance *perf;
struct cpufreq_freqs freqs;
cpumask_t online_policy_cpus;
- struct drv_cmd cmd;
+ struct drv_cmd *cmd = get_smp_ipi_buf(struct drv_cmd);
unsigned int next_state = 0; /* Index into freq_table */
unsigned int next_perf_state = 0; /* Index into perf table */
unsigned int j;
@@ -424,31 +423,31 @@ static int acpi_cpufreq_target(struct cpufreq_policy *policy,
switch (data->arch_cpu_flags) {
case SYSTEM_INTEL_MSR_CAPABLE:
- cmd.type = SYSTEM_INTEL_MSR_CAPABLE;
- cmd.addr.msr.reg = MSR_IA32_PERF_CTL;
- cmd.val = (u32) perf->states[next_perf_state].control;
+ cmd->type = SYSTEM_INTEL_MSR_CAPABLE;
+ cmd->addr.msr.reg = MSR_IA32_PERF_CTL;
+ cmd->val = (u32) perf->states[next_perf_state].control;
break;
case SYSTEM_IO_CAPABLE:
- cmd.type = SYSTEM_IO_CAPABLE;
- cmd.addr.io.port = perf->control_register.address;
- cmd.addr.io.bit_width = perf->control_register.bit_width;
- cmd.val = (u32) perf->states[next_perf_state].control;
+ cmd->type = SYSTEM_IO_CAPABLE;
+ cmd->addr.io.port = perf->control_register.address;
+ cmd->addr.io.bit_width = perf->control_register.bit_width;
+ cmd->val = (u32) perf->states[next_perf_state].control;
break;
default:
return -ENODEV;
}
if (policy->shared_type != CPUFREQ_SHARED_TYPE_ANY)
- cmd.mask = &online_policy_cpus;
+ cmd->mask = &online_policy_cpus;
else
- cmd.mask = cpumask_of(policy->cpu);
+ cmd->mask = cpumask_of(policy->cpu);
freqs.old = perf->states[perf->state].core_frequency * 1000;
freqs.new = data->freq_table[next_state].frequency;
- drv_write(&cmd);
+ drv_write(cmd);
- if (acpi_pstate_strict && !check_freqs(cmd.mask, freqs.new, data)) {
+ if (acpi_pstate_strict && !check_freqs(cmd->mask, freqs.new, data)) {
printk(KERN_WARNING "Fail transfer to new freq %d\n", freqs.new);
return -EAGAIN;
}
diff --git a/xen/arch/x86/acpi/cpufreq/powernow.c b/xen/arch/x86/acpi/cpufreq/powernow.c
index 8f1ac74..72d95b7 100644
--- a/xen/arch/x86/acpi/cpufreq/powernow.c
+++ b/xen/arch/x86/acpi/cpufreq/powernow.c
@@ -94,7 +94,7 @@ static int powernow_cpufreq_target(struct cpufreq_policy *policy,
struct acpi_cpufreq_data *data = cpufreq_drv_data[policy->cpu];
struct processor_performance *perf;
unsigned int next_state; /* Index into freq_table */
- unsigned int next_perf_state; /* Index into perf table */
+ unsigned int *next_perf_state = get_smp_ipi_buf(unsigned int);
int result;
if (unlikely(data == NULL ||
@@ -110,8 +110,8 @@ static int powernow_cpufreq_target(struct cpufreq_policy *policy,
if (unlikely(result))
return result;
- next_perf_state = data->freq_table[next_state].index;
- if (perf->state == next_perf_state) {
+ *next_perf_state = data->freq_table[next_state].index;
+ if (perf->state == *next_perf_state) {
if (unlikely(data->arch_cpu_flags & ARCH_CPU_FLAG_RESUME))
data->arch_cpu_flags &= ~ARCH_CPU_FLAG_RESUME;
else
@@ -120,8 +120,8 @@ static int powernow_cpufreq_target(struct cpufreq_policy *policy,
if (policy->shared_type == CPUFREQ_SHARED_TYPE_HW &&
likely(policy->cpu == smp_processor_id())) {
- transition_pstate(&next_perf_state);
- cpufreq_statistic_update(policy->cpu, perf->state, next_perf_state);
+ transition_pstate(next_perf_state);
+ cpufreq_statistic_update(policy->cpu, perf->state, *next_perf_state);
} else {
cpumask_t online_policy_cpus;
unsigned int cpu;
@@ -131,15 +131,15 @@ static int powernow_cpufreq_target(struct cpufreq_policy *policy,
if (policy->shared_type == CPUFREQ_SHARED_TYPE_ALL ||
unlikely(policy->cpu != smp_processor_id()))
on_selected_cpus(&online_policy_cpus, transition_pstate,
- &next_perf_state, 1);
+ next_perf_state, 1);
else
- transition_pstate(&next_perf_state);
+ transition_pstate(next_perf_state);
for_each_cpu(cpu, &online_policy_cpus)
- cpufreq_statistic_update(cpu, perf->state, next_perf_state);
+ cpufreq_statistic_update(cpu, perf->state, *next_perf_state);
}
- perf->state = next_perf_state;
+ perf->state = *next_perf_state;
policy->cur = data->freq_table[next_state].frequency;
return 0;
@@ -236,7 +236,7 @@ static int powernow_cpufreq_cpu_init(struct cpufreq_policy *policy)
struct acpi_cpufreq_data *data;
unsigned int result = 0;
struct processor_performance *perf;
- struct amd_cpu_data info;
+ struct amd_cpu_data *info = get_smp_ipi_buf(struct amd_cpu_data);
struct cpuinfo_x86 *c = &cpu_data[policy->cpu];
data = xzalloc(struct acpi_cpufreq_data);
@@ -247,7 +247,7 @@ static int powernow_cpufreq_cpu_init(struct cpufreq_policy *policy)
data->acpi_data = &processor_pminfo[cpu]->perf;
- info.perf = perf = data->acpi_data;
+ info->perf = perf = data->acpi_data;
policy->shared_type = perf->shared_type;
if (policy->shared_type == CPUFREQ_SHARED_TYPE_ALL ||
@@ -293,10 +293,10 @@ static int powernow_cpufreq_cpu_init(struct cpufreq_policy *policy)
policy->governor = cpufreq_opt_governor ? : CPUFREQ_DEFAULT_GOVERNOR;
- on_selected_cpus(cpumask_of(cpu), get_cpu_data, &info, 1);
+ on_selected_cpus(cpumask_of(cpu), get_cpu_data, info, 1);
/* table init */
- for (i = 0; i < perf->state_count && i <= info.max_hw_pstate; i++) {
+ for (i = 0; i < perf->state_count && i <= info->max_hw_pstate; i++) {
if (i > 0 && perf->states[i].core_frequency >=
data->freq_table[valid_states-1].frequency / 1000)
continue;
diff --git a/xen/arch/x86/platform_hypercall.c b/xen/arch/x86/platform_hypercall.c
index ebc2f39..4439bf9 100644
--- a/xen/arch/x86/platform_hypercall.c
+++ b/xen/arch/x86/platform_hypercall.c
@@ -728,21 +728,21 @@ ret_t do_platform_op(XEN_GUEST_HANDLE_PARAM(xen_platform_op_t) u_xenpf_op)
case XENPF_resource_op:
{
- struct resource_access ra;
+ struct resource_access *ra = get_smp_ipi_buf(struct resource_access);
unsigned int cpu;
XEN_GUEST_HANDLE(xenpf_resource_entry_t) guest_entries;
- ra.nr_entries = op->u.resource_op.nr_entries;
- if ( ra.nr_entries == 0 )
+ ra->nr_entries = op->u.resource_op.nr_entries;
+ if ( ra->nr_entries == 0 )
break;
- if ( ra.nr_entries > RESOURCE_ACCESS_MAX_ENTRIES )
+ if ( ra->nr_entries > RESOURCE_ACCESS_MAX_ENTRIES )
{
ret = -EINVAL;
break;
}
- ra.entries = xmalloc_array(xenpf_resource_entry_t, ra.nr_entries);
- if ( !ra.entries )
+ ra->entries = xmalloc_array(xenpf_resource_entry_t, ra->nr_entries);
+ if ( !ra->entries )
{
ret = -ENOMEM;
break;
@@ -750,46 +750,46 @@ ret_t do_platform_op(XEN_GUEST_HANDLE_PARAM(xen_platform_op_t) u_xenpf_op)
guest_from_compat_handle(guest_entries, op->u.resource_op.entries);
- if ( copy_from_guest(ra.entries, guest_entries, ra.nr_entries) )
+ if ( copy_from_guest(ra->entries, guest_entries, ra->nr_entries) )
{
- xfree(ra.entries);
+ xfree(ra->entries);
ret = -EFAULT;
break;
}
/* Do sanity check earlier to omit the potential IPI overhead. */
- check_resource_access(&ra);
- if ( ra.nr_done == 0 )
+ check_resource_access(ra);
+ if ( ra->nr_done == 0 )
{
/* Copy the return value for entry 0 if it failed. */
- if ( __copy_to_guest(guest_entries, ra.entries, 1) )
+ if ( __copy_to_guest(guest_entries, ra->entries, 1) )
ret = -EFAULT;
- xfree(ra.entries);
+ xfree(ra->entries);
break;
}
cpu = op->u.resource_op.cpu;
if ( (cpu >= nr_cpu_ids) || !cpu_online(cpu) )
{
- xfree(ra.entries);
+ xfree(ra->entries);
ret = -ENODEV;
break;
}
if ( cpu == smp_processor_id() )
- resource_access(&ra);
+ resource_access(ra);
else
- on_selected_cpus(cpumask_of(cpu), resource_access, &ra, 1);
+ on_selected_cpus(cpumask_of(cpu), resource_access, ra, 1);
/* Copy all if succeeded or up to the failed entry. */
- if ( __copy_to_guest(guest_entries, ra.entries,
- ra.nr_done < ra.nr_entries ? ra.nr_done + 1
- : ra.nr_entries) )
+ if ( __copy_to_guest(guest_entries, ra->entries,
+ ra->nr_done < ra->nr_entries ? ra->nr_done + 1
+ : ra->nr_entries) )
ret = -EFAULT;
else
- ret = ra.nr_done;
+ ret = ra->nr_done;
- xfree(ra.entries);
+ xfree(ra->entries);
}
break;
diff --git a/xen/arch/x86/psr.c b/xen/arch/x86/psr.c
index 0ba8ef8..a6f6fb3 100644
--- a/xen/arch/x86/psr.c
+++ b/xen/arch/x86/psr.c
@@ -1285,8 +1285,9 @@ static int write_psr_msrs(unsigned int socket, unsigned int cos,
enum psr_feat_type feat_type)
{
struct psr_socket_info *info = get_socket_info(socket);
- struct cos_write_info data =
- {
+ struct cos_write_info *data = get_smp_ipi_buf(struct cos_write_info);
+
+ *data = (struct cos_write_info){
.cos = cos,
.val = val,
.array_len = array_len,
@@ -1296,14 +1297,14 @@ static int write_psr_msrs(unsigned int socket, unsigned int cos,
return -EINVAL;
if ( socket == cpu_to_socket(smp_processor_id()) )
- do_write_psr_msrs(&data);
+ do_write_psr_msrs(data);
else
{
unsigned int cpu = get_socket_cpu(socket);
if ( cpu >= nr_cpu_ids )
return -ENOTSOCK;
- on_selected_cpus(cpumask_of(cpu), do_write_psr_msrs, &data, 1);
+ on_selected_cpus(cpumask_of(cpu), do_write_psr_msrs, data, 1);
}
return 0;
diff --git a/xen/arch/x86/pv/pt-shadow.c b/xen/arch/x86/pv/pt-shadow.c
index b4f2b86..d550ae1 100644
--- a/xen/arch/x86/pv/pt-shadow.c
+++ b/xen/arch/x86/pv/pt-shadow.c
@@ -367,35 +367,35 @@ static void _pt_shadow_ipi(void *arg)
void pt_shadow_l4_write(const struct domain *d, const struct page_info *pg,
unsigned int slot)
{
- struct ptsh_ipi_info info;
+ struct ptsh_ipi_info *info = get_smp_ipi_buf(struct ptsh_ipi_info);
if ( !pt_need_shadow(d) )
return;
- info = (struct ptsh_ipi_info){
+ *info = (struct ptsh_ipi_info){
.d = d,
.pg = pg,
.op = PTSH_IPI_WRITE,
.slot = slot,
};
- on_selected_cpus(d->domain_dirty_cpumask, _pt_shadow_ipi, &info, 1);
+ on_selected_cpus(d->domain_dirty_cpumask, _pt_shadow_ipi, info, 1);
}
void pt_shadow_l4_invlpg(const struct domain *d, const struct page_info *pg)
{
- struct ptsh_ipi_info info;
+ struct ptsh_ipi_info *info = get_smp_ipi_buf(struct ptsh_ipi_info);
if ( !pt_need_shadow(d) )
return;
- info = (struct ptsh_ipi_info){
+ *info = (struct ptsh_ipi_info){
.d = d,
.pg = pg,
.op = PTSH_IPI_INVLPG,
};
- on_selected_cpus(d->domain_dirty_cpumask, _pt_shadow_ipi, &info, 1);
+ on_selected_cpus(d->domain_dirty_cpumask, _pt_shadow_ipi, info, 1);
}
/*
diff --git a/xen/arch/x86/smp.c b/xen/arch/x86/smp.c
index fd6d254..68d3af0 100644
--- a/xen/arch/x86/smp.c
+++ b/xen/arch/x86/smp.c
@@ -22,6 +22,8 @@
#include <asm/hvm/support.h>
#include <mach_apic.h>
+DEFINE_PER_CPU(struct smp_ipi_buf, smp_ipi_buf);
+
/*
* send_IPI_mask(cpumask, vector): sends @vector IPI to CPUs in @cpumask,
* excluding the local CPU. @cpumask may be empty.
diff --git a/xen/arch/x86/sysctl.c b/xen/arch/x86/sysctl.c
index 4d372db..7ecf8df 100644
--- a/xen/arch/x86/sysctl.c
+++ b/xen/arch/x86/sysctl.c
@@ -139,7 +139,7 @@ long arch_do_sysctl(
break;
case XEN_SYSCTL_PSR_CMT_get_l3_cache_size:
{
- struct l3_cache_info info;
+ struct l3_cache_info *info = get_smp_ipi_buf(struct l3_cache_info);
unsigned int cpu = sysctl->u.psr_cmt_op.u.l3_cache.cpu;
if ( (cpu >= nr_cpu_ids) || !cpu_online(cpu) )
@@ -149,12 +149,12 @@ long arch_do_sysctl(
break;
}
if ( cpu == smp_processor_id() )
- l3_cache_get(&info);
+ l3_cache_get(info);
else
- on_selected_cpus(cpumask_of(cpu), l3_cache_get, &info, 1);
+ on_selected_cpus(cpumask_of(cpu), l3_cache_get, info, 1);
- ret = info.ret;
- sysctl->u.psr_cmt_op.u.data = (ret ? 0 : info.size);
+ ret = info->ret;
+ sysctl->u.psr_cmt_op.u.data = (ret ? 0 : info->size);
break;
}
case XEN_SYSCTL_PSR_CMT_get_l3_event_mask:
diff --git a/xen/include/asm-x86/smp.h b/xen/include/asm-x86/smp.h
index 46bbf0d..d915c1e 100644
--- a/xen/include/asm-x86/smp.h
+++ b/xen/include/asm-x86/smp.h
@@ -13,6 +13,7 @@
#ifndef __ASSEMBLY__
#include <xen/bitops.h>
#include <asm/mpspec.h>
+#include <asm/hardirq.h>
#endif
#define BAD_APICID (-1U)
@@ -89,6 +90,25 @@ static inline bool arch_ipi_param_ok(const void *_param)
l4_table_offset(param) != l4_table_offset(PERCPU_LINEAR_START));
}
+struct smp_ipi_buf {
+#define SMP_IPI_BUF_SZ 0x70
+ char OPAQUE[SMP_IPI_BUF_SZ];
+};
+DECLARE_PER_CPU(struct smp_ipi_buf, smp_ipi_buf);
+
+/*
+ * Wrapper to obtain an IPI bounce buffer, checking that there is sufficient
+ * size. The choice of SMP_IPI_BUF_SZ is arbitrary, and should be the size of
+ * the largest object passed into an IPI.
+ */
+#define get_smp_ipi_buf(obj) \
+ ({ \
+ typeof(obj) *_o = (void *)this_cpu(smp_ipi_buf).OPAQUE; \
+ BUILD_BUG_ON(sizeof(obj) > SMP_IPI_BUF_SZ); \
+ ASSERT(!in_irq()); \
+ _o; \
+ })
+
#endif /* !__ASSEMBLY__ */
#endif
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 40/44] x86/boot: Switch the APs to the percpu pagetables before entering C
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (38 preceding siblings ...)
2018-01-04 20:22 ` [PATCH RFC 39/44] x86/smp: Introduce get_smp_ipi_buf() and take more " Andrew Cooper
@ 2018-01-04 20:22 ` Andrew Cooper
2018-01-04 20:22 ` [PATCH RFC 41/44] x86/smp: Switch to using the percpu stacks Andrew Cooper
` (4 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:22 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
This is in preparation for the APs to switch to their percpu stack before
entering C.
This requires splitting the BSP and AP paths in __high_start(), and for
do_boot_cpu() to pass the appropriate pagetables. The result is that
early_switch_to_idle() no longer needs to switch pagetables, but the switch
does need to retained for the BSP.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
xen/arch/x86/boot/x86_64.S | 13 +++++++++++++
xen/arch/x86/setup.c | 21 ++++++++++++---------
xen/arch/x86/smpboot.c | 4 +++-
3 files changed, 28 insertions(+), 10 deletions(-)
diff --git a/xen/arch/x86/boot/x86_64.S b/xen/arch/x86/boot/x86_64.S
index 925fd4b..b1f0457 100644
--- a/xen/arch/x86/boot/x86_64.S
+++ b/xen/arch/x86/boot/x86_64.S
@@ -15,6 +15,19 @@ ENTRY(__high_start)
mov $XEN_MINIMAL_CR4,%rcx
mov %rcx,%cr4
+ /* Set up %cr3 (differs between BSP and APs). */
+ test %ebx, %ebx
+ jz .Lbsp_setup
+
+ /* APs switch onto percpu_idle_pt[], as provided by do_boot_cpu(). */
+ mov ap_cr3(%rip), %rax
+ mov %rax, %cr3
+ jmp .Ldone
+
+.Lbsp_setup:
+ /* The BSP stays on the idle_pg_table[] during early boot. */
+.Ldone:
+
mov stack_start(%rip),%rsp
or $(STACK_SIZE-CPUINFO_sizeof),%rsp
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 39d1592..d624b95 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -241,7 +241,6 @@ void early_switch_to_idle(bool bsp)
{
unsigned int cpu = smp_processor_id();
struct vcpu *v = idle_vcpu[cpu];
- unsigned long cr4 = read_cr4();
/*
* VT-x hardwires the GDT and IDT limit at 0xffff on VMExit.
@@ -264,14 +263,6 @@ void early_switch_to_idle(bool bsp)
per_cpu(curr_vcpu, cpu) = v;
__set_bit(_PGC_inuse_pgtable, &maddr_to_page(v->arch.cr3)->count_info);
- asm volatile ( "mov %[npge], %%cr4;"
- "mov %[cr3], %%cr3;"
- "mov %[pge], %%cr4;"
- ::
- [npge] "r" (cr4 & ~X86_CR4_PGE),
- [cr3] "r" (v->arch.cr3),
- [pge] "r" (cr4)
- : "memory" );
per_cpu(curr_ptbase, cpu) = v->arch.cr3;
per_cpu(curr_extended_directmap, cpu) = true;
@@ -286,7 +277,19 @@ void early_switch_to_idle(bool bsp)
static void __init init_idle_domain(void)
{
+ unsigned long cr4 = read_cr4();
+
scheduler_init();
+
+ asm volatile ( "mov %[npge], %%cr4;"
+ "mov %[cr3], %%cr3;"
+ "mov %[pge], %%cr4;"
+ ::
+ [npge] "r" (cr4 & ~X86_CR4_PGE),
+ [cr3] "r" (idle_vcpu[0]->arch.cr3),
+ [pge] "r" (cr4)
+ : "memory" );
+
early_switch_to_idle(true);
}
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 1bf6dc1..f785d5f 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -309,7 +309,6 @@ void start_secondary(void *unused)
/* Critical region without IDT or TSS. Any fault is deadly! */
set_processor_id(cpu);
- get_cpu_info()->cr4 = XEN_MINIMAL_CR4;
early_switch_to_idle(false);
@@ -385,6 +384,8 @@ void start_secondary(void *unused)
startup_cpu_idle_loop();
}
+/* Used to pass percpu_idle_pt to the booting AP. */
+paddr_t ap_cr3;
extern void *stack_start;
static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip)
@@ -527,6 +528,7 @@ static int do_boot_cpu(int apicid, int cpu)
printk("Booting processor %d/%d eip %lx\n",
cpu, apicid, start_eip);
+ ap_cr3 = per_cpu(percpu_idle_pt, cpu);
stack_start = stack_base[cpu];
/* This grunge runs the startup process for the targeted processor. */
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 41/44] x86/smp: Switch to using the percpu stacks
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (39 preceding siblings ...)
2018-01-04 20:22 ` [PATCH RFC 40/44] x86/boot: Switch the APs to the percpu pagetables before entering C Andrew Cooper
@ 2018-01-04 20:22 ` Andrew Cooper
2018-01-04 20:22 ` [PATCH RFC 42/44] x86/smp: Allocate a percpu linear range for the TSS Andrew Cooper
` (3 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:22 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
This is very easy for the APs. __high_start() is modified to switch stacks
before entering C. The BSP however is more complicated, and needs to stay on
cpu0_stack[] until setup is complete.
The end of __start_xen() is modified to copy the top-of-stack data to the
percpu stack immediately before jumping there. The VMCS Host and SYSENTER
stacks are suitably adjusted, and become construction-time constant.
The stack_start and stack_base[] array are removed completely, as well as the
memguard_guard_stack() infrastructure. The STACK_ORDER xenheap allocations
are no longer needed, and higher CPUs on large machines are finally
numa-local.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
xen/arch/x86/boot/x86_64.S | 15 ++++++++-------
xen/arch/x86/efi/efi-boot.h | 8 ++++----
xen/arch/x86/hvm/vmx/vmcs.c | 21 ++++++++++-----------
xen/arch/x86/mm.c | 15 ---------------
xen/arch/x86/setup.c | 29 +++++++++++++++++++----------
xen/arch/x86/smpboot.c | 18 ------------------
xen/arch/x86/tboot.c | 29 +----------------------------
xen/arch/x86/traps.c | 10 ++--------
xen/include/asm-arm/mm.h | 1 -
xen/include/asm-x86/mm.h | 3 ---
xen/include/xen/smp.h | 2 --
11 files changed, 44 insertions(+), 107 deletions(-)
diff --git a/xen/arch/x86/boot/x86_64.S b/xen/arch/x86/boot/x86_64.S
index b1f0457..ed4c805 100644
--- a/xen/arch/x86/boot/x86_64.S
+++ b/xen/arch/x86/boot/x86_64.S
@@ -15,21 +15,25 @@ ENTRY(__high_start)
mov $XEN_MINIMAL_CR4,%rcx
mov %rcx,%cr4
- /* Set up %cr3 (differs between BSP and APs). */
+ /* Set up %cr3 and %rsp (differs between BSP and APs). */
test %ebx, %ebx
jz .Lbsp_setup
/* APs switch onto percpu_idle_pt[], as provided by do_boot_cpu(). */
mov ap_cr3(%rip), %rax
mov %rax, %cr3
+
+ /* APs move straight onto the PERCPU stack. */
+ movabs $STACK_SIZE - CPUINFO_sizeof + PERCPU_STACK_MAPPING, %rsp
+
jmp .Ldone
.Lbsp_setup:
/* The BSP stays on the idle_pg_table[] during early boot. */
-.Ldone:
- mov stack_start(%rip),%rsp
- or $(STACK_SIZE-CPUINFO_sizeof),%rsp
+ /* The BSP starts on cpu0_stack. */
+ lea STACK_SIZE - CPUINFO_sizeof + cpu0_stack(%rip), %rsp
+.Ldone:
/* Reset EFLAGS (subsumes CLI and CLD). */
pushq $0
@@ -61,9 +65,6 @@ GLOBAL(gdt_descr)
.word LAST_RESERVED_GDT_BYTE
.quad boot_cpu_gdt_table - FIRST_RESERVED_GDT_BYTE
-GLOBAL(stack_start)
- .quad cpu0_stack
-
.section .data.page_aligned, "aw", @progbits
.align PAGE_SIZE, 0
GLOBAL(boot_cpu_gdt_table)
diff --git a/xen/arch/x86/efi/efi-boot.h b/xen/arch/x86/efi/efi-boot.h
index d30f688..8af661b 100644
--- a/xen/arch/x86/efi/efi-boot.h
+++ b/xen/arch/x86/efi/efi-boot.h
@@ -251,15 +251,15 @@ static void __init noreturn efi_arch_post_exit_boot(void)
#endif
"movabs $__start_xen, %[rip]\n\t"
"lgdt gdt_descr(%%rip)\n\t"
- "mov stack_start(%%rip), %%rsp\n\t"
+ "lea %c[stkoff] + cpu0_stack(%%rip), %%rsp\n\t"
"mov %[ds], %%ss\n\t"
"mov %[ds], %%ds\n\t"
"mov %[ds], %%es\n\t"
"mov %[ds], %%fs\n\t"
"mov %[ds], %%gs\n\t"
- "movl %[cs], 8(%%rsp)\n\t"
- "mov %[rip], (%%rsp)\n\t"
- "lretq %[stkoff]-16"
+ "push %[cs]\n\t"
+ "push %[rip]\n\t"
+ "lretq"
: [rip] "=&r" (efer/* any dead 64-bit variable */),
[cr4] "+&r" (cr4)
: [cr3] "r" (idle_pg_table),
diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index 795210f..483f72d 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -804,15 +804,6 @@ static void vmx_set_host_env(struct vcpu *v)
__vmwrite(HOST_TR_BASE, (unsigned long)&per_cpu(init_tss, cpu));
- __vmwrite(HOST_SYSENTER_ESP, get_stack_bottom());
-
- /*
- * Skip end of cpu_user_regs when entering the hypervisor because the
- * CPU does not save context onto the stack. SS,RSP,CS,RIP,RFLAGS,etc
- * all get saved into the VMCS instead.
- */
- __vmwrite(HOST_RSP,
- (unsigned long)&get_cpu_info()->guest_cpu_user_regs.error_code);
}
void vmx_clear_msr_intercept(struct vcpu *v, unsigned int msr,
@@ -1148,13 +1139,21 @@ static int construct_vmcs(struct vcpu *v)
__vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0);
__vmwrite(HOST_CR4, mmu_cr4_features);
- /* Host CS:RIP. */
+ /* Host code/stack. */
__vmwrite(HOST_CS_SELECTOR, __HYPERVISOR_CS);
__vmwrite(HOST_RIP, (unsigned long)vmx_asm_vmexit_handler);
+ __vmwrite(HOST_RSP, /* VMExit doesn't push an excpetion frame. */
+ (PERCPU_STACK_MAPPING + STACK_SIZE -
+ sizeof(struct cpu_info) +
+ offsetof(struct cpu_info, guest_cpu_user_regs.error_code)));
- /* Host SYSENTER CS:RIP. */
+ /* Host SYSENTER code/stack. */
__vmwrite(HOST_SYSENTER_CS, __HYPERVISOR_CS);
__vmwrite(HOST_SYSENTER_EIP, (unsigned long)sysenter_entry);
+ __vmwrite(HOST_SYSENTER_ESP,
+ (PERCPU_STACK_MAPPING + STACK_SIZE -
+ sizeof(struct cpu_info) +
+ offsetof(struct cpu_info, guest_cpu_user_regs.es)));
/* MSR intercepts. */
__vmwrite(VM_EXIT_MSR_LOAD_COUNT, 0);
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 933bd67..cb54921 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -5281,21 +5281,6 @@ void memguard_unguard_range(void *p, unsigned long l)
#endif
-void memguard_guard_stack(void *p)
-{
- BUILD_BUG_ON((PRIMARY_STACK_SIZE + PAGE_SIZE) > STACK_SIZE);
- p = (void *)((unsigned long)p + STACK_SIZE -
- PRIMARY_STACK_SIZE - PAGE_SIZE);
- memguard_guard_range(p, PAGE_SIZE);
-}
-
-void memguard_unguard_stack(void *p)
-{
- p = (void *)((unsigned long)p + STACK_SIZE -
- PRIMARY_STACK_SIZE - PAGE_SIZE);
- memguard_unguard_range(p, PAGE_SIZE);
-}
-
void arch_dump_shared_mem_info(void)
{
printk("Shared frames %u -- Saved frames %u\n",
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index d624b95..c0f7289 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -651,8 +651,6 @@ static void noinline init_done(void)
/* Reinitalise all state referring to the old virtual address of the stack. */
static void __init noreturn reinit_bsp_stack(void)
{
- unsigned long *stack = (void*)(get_stack_bottom() & ~(STACK_SIZE - 1));
-
/* Sanity check that IST settings weren't set up before this point. */
ASSERT(MASK_EXTR(idt_tables[0][TRAP_nmi].a, 7UL << 32) == 0);
@@ -664,9 +662,6 @@ static void __init noreturn reinit_bsp_stack(void)
/* Update SYSCALL trampolines */
percpu_traps_init();
- stack_base[0] = stack;
- memguard_guard_stack(stack);
-
reset_stack_and_jump(init_done);
}
@@ -1744,11 +1739,25 @@ void __init noreturn __start_xen(unsigned long mbi_p)
setup_io_bitmap(dom0);
- /* Jump to the 1:1 virtual mappings of cpu0_stack. */
- asm volatile ("mov %[stk], %%rsp; jmp %c[fn]" ::
- [stk] "g" (__va(__pa(get_stack_bottom()))),
- [fn] "i" (reinit_bsp_stack) : "memory");
- unreachable();
+ /*
+ * Switch from cpu0_stack to the percpu stack, copying the non-GPR
+ * cpu_info data into place before hand.
+ */
+ {
+ const struct cpu_info *src = get_cpu_info();
+ struct cpu_info *dst = _p(PERCPU_STACK_MAPPING + STACK_SIZE -
+ sizeof(*dst));
+
+ dst->processor_id = src->processor_id;
+ dst->current_vcpu = src->current_vcpu;
+ dst->per_cpu_offset = src->per_cpu_offset;
+ dst->cr4 = src->cr4;
+
+ asm volatile ("mov %[stk], %%rsp; jmp %c[fn]" ::
+ [stk] "g" (&dst->guest_cpu_user_regs.es),
+ [fn] "i" (reinit_bsp_stack) : "memory");
+ unreachable();
+ }
}
void arch_get_xen_caps(xen_capabilities_info_t *info)
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index f785d5f..77ee883 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -91,8 +91,6 @@ static enum cpu_state {
} cpu_state;
#define set_cpu_state(state) do { smp_mb(); cpu_state = (state); } while (0)
-void *stack_base[NR_CPUS];
-
void initialize_cpu_data(unsigned int cpu)
{
cpu_data[cpu] = boot_cpu_data;
@@ -386,7 +384,6 @@ void start_secondary(void *unused)
/* Used to pass percpu_idle_pt to the booting AP. */
paddr_t ap_cr3;
-extern void *stack_start;
static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip)
{
@@ -529,7 +526,6 @@ static int do_boot_cpu(int apicid, int cpu)
cpu, apicid, start_eip);
ap_cr3 = per_cpu(percpu_idle_pt, cpu);
- stack_start = stack_base[cpu];
/* This grunge runs the startup process for the targeted processor. */
@@ -1002,13 +998,6 @@ static void cpu_smpboot_free(unsigned int cpu)
free_xenheap_page(idt_tables[cpu]);
idt_tables[cpu] = NULL;
- if ( stack_base[cpu] != NULL )
- {
- memguard_unguard_stack(stack_base[cpu]);
- free_xenheap_pages(stack_base[cpu], STACK_ORDER);
- stack_base[cpu] = NULL;
- }
-
if ( per_cpu(percpu_idle_pt, cpu) )
{
free_domheap_page(maddr_to_page(per_cpu(percpu_idle_pt, cpu)));
@@ -1030,11 +1019,6 @@ static int cpu_smpboot_alloc(unsigned int cpu)
if ( node != NUMA_NO_NODE )
memflags = MEMF_node(node);
- stack_base[cpu] = alloc_xenheap_pages(STACK_ORDER, memflags);
- if ( stack_base[cpu] == NULL )
- goto out;
- memguard_guard_stack(stack_base[cpu]);
-
order = get_order_from_pages(NR_RESERVED_GDT_PAGES);
per_cpu(gdt_table, cpu) = gdt = alloc_xenheap_pages(order, memflags);
if ( gdt == NULL )
@@ -1148,8 +1132,6 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
boot_cpu_physical_apicid = get_apic_id();
x86_cpu_to_apicid[0] = boot_cpu_physical_apicid;
- stack_base[0] = stack_start;
-
set_nr_sockets();
socket_cpumask = xzalloc_array(cpumask_t *, nr_sockets);
diff --git a/xen/arch/x86/tboot.c b/xen/arch/x86/tboot.c
index 59d7c47..c283b91 100644
--- a/xen/arch/x86/tboot.c
+++ b/xen/arch/x86/tboot.c
@@ -243,29 +243,6 @@ static void tboot_gen_domain_integrity(const uint8_t key[TB_KEY_SIZE],
memset(&ctx, 0, sizeof(ctx));
}
-/*
- * For stack overflow detection in debug build, a guard page is set up.
- * This fn is used to detect whether a page is in the guarded pages for
- * the above reason.
- */
-static int mfn_in_guarded_stack(unsigned long mfn)
-{
- void *p;
- int i;
-
- for ( i = 0; i < nr_cpu_ids; i++ )
- {
- if ( !stack_base[i] )
- continue;
- p = (void *)((unsigned long)stack_base[i] + STACK_SIZE -
- PRIMARY_STACK_SIZE - PAGE_SIZE);
- if ( mfn == virt_to_mfn(p) )
- return -1;
- }
-
- return 0;
-}
-
static void tboot_gen_xenheap_integrity(const uint8_t key[TB_KEY_SIZE],
vmac_t *mac)
{
@@ -290,12 +267,8 @@ static void tboot_gen_xenheap_integrity(const uint8_t key[TB_KEY_SIZE],
if ( is_page_in_use(page) && is_xen_heap_page(page) )
{
- void *pg;
-
- if ( mfn_in_guarded_stack(mfn) )
- continue; /* skip guard stack, see memguard_guard_stack() in mm.c */
+ void *pg = mfn_to_virt(mfn);
- pg = mfn_to_virt(mfn);
vmac_update((uint8_t *)pg, PAGE_SIZE, &ctx);
}
}
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index eeabb4a..493f8f3 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -356,9 +356,6 @@ unsigned long get_stack_trace_bottom(unsigned long sp)
return ROUNDUP(sp, PAGE_SIZE) -
offsetof(struct cpu_user_regs, es) - sizeof(unsigned long);
-#ifndef MEMORY_GUARD
- case 3 ... 5:
-#endif
case 6 ... 7:
return ROUNDUP(sp, STACK_SIZE) -
sizeof(struct cpu_info) - sizeof(unsigned long);
@@ -375,9 +372,6 @@ unsigned long get_stack_dump_bottom(unsigned long sp)
case 0 ... 2:
return ROUNDUP(sp, PAGE_SIZE) - sizeof(unsigned long);
-#ifndef MEMORY_GUARD
- case 3 ... 5:
-#endif
case 6 ... 7:
return ROUNDUP(sp, STACK_SIZE) - sizeof(unsigned long);
@@ -518,9 +512,9 @@ void show_stack_overflow(unsigned int cpu, const struct cpu_user_regs *regs)
unsigned long esp_top, esp_bottom;
#endif
- if ( _p(curr_stack_base) != stack_base[cpu] )
+ if ( curr_stack_base != PERCPU_STACK_MAPPING )
printk("Current stack base %p differs from expected %p\n",
- _p(curr_stack_base), stack_base[cpu]);
+ _p(curr_stack_base), _p(PERCPU_STACK_MAPPING));
#ifdef MEMORY_GUARD
esp_bottom = (esp | (STACK_SIZE - 1)) + 1;
diff --git a/xen/include/asm-arm/mm.h b/xen/include/asm-arm/mm.h
index 4d5563b..86b8fcb 100644
--- a/xen/include/asm-arm/mm.h
+++ b/xen/include/asm-arm/mm.h
@@ -362,7 +362,6 @@ unsigned long domain_get_maximum_gpfn(struct domain *d);
extern struct domain *dom_xen, *dom_io, *dom_cow;
-#define memguard_guard_stack(_p) ((void)0)
#define memguard_guard_range(_p,_l) ((void)0)
#define memguard_unguard_range(_p,_l) ((void)0)
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index 22c2809..2c1ed1d 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -521,9 +521,6 @@ void memguard_unguard_range(void *p, unsigned long l);
#define memguard_unguard_range(_p,_l) ((void)0)
#endif
-void memguard_guard_stack(void *p);
-void memguard_unguard_stack(void *p);
-
struct mmio_ro_emulate_ctxt {
unsigned long cr2;
unsigned int seg, bdf;
diff --git a/xen/include/xen/smp.h b/xen/include/xen/smp.h
index c55f57f..d30f369 100644
--- a/xen/include/xen/smp.h
+++ b/xen/include/xen/smp.h
@@ -69,8 +69,6 @@ void smp_send_call_function_mask(const cpumask_t *mask);
int alloc_cpu_id(void);
-extern void *stack_base[NR_CPUS];
-
void initialize_cpu_data(unsigned int cpu);
#endif /* __XEN_SMP_H__ */
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 42/44] x86/smp: Allocate a percpu linear range for the TSS
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (40 preceding siblings ...)
2018-01-04 20:22 ` [PATCH RFC 41/44] x86/smp: Switch to using the percpu stacks Andrew Cooper
@ 2018-01-04 20:22 ` Andrew Cooper
2018-01-04 20:22 ` [PATCH RFC 43/44] x86/smp: Use the percpu TSS mapping Andrew Cooper
` (2 subsequent siblings)
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:22 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
With all CPUs using the same virtual stack mapping, the TSS rsp0/ist[0..2]
values are compile-time constant. Therefore, we can use a single read-only
TSS for the whole system.
To faciliate this, a new .rodata.page_aligned section needs introducing.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
xen/arch/x86/smpboot.c | 6 ++++++
xen/arch/x86/traps.c | 29 +++++++++++++++++++++++++++++
xen/arch/x86/xen.lds.S | 2 ++
xen/include/asm-x86/config.h | 1 +
xen/include/asm-x86/processor.h | 2 ++
5 files changed, 40 insertions(+)
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 77ee883..fa99e4d 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -800,6 +800,12 @@ static int cpu_smpboot_alloc_common(unsigned int cpu)
if ( rc )
goto out;
+ /* Map the TSS. */
+ rc = percpu_map_frame(cpu, PERCPU_TSS_MAPPING,
+ virt_to_page(&global_tss), PAGE_HYPERVISOR_RO);
+ if ( rc )
+ goto out;
+
/* Allocate space for the mapcache L1e's... */
rc = percpu_alloc_l1t(cpu, PERCPU_MAPCACHE_START, &pg);
if ( rc )
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 493f8f3..0ab10ba 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -108,6 +108,35 @@ __section(".bss.page_aligned") __aligned(PAGE_SIZE);
/* Pointer to the IDT of every CPU. */
idt_entry_t *idt_tables[NR_CPUS] __read_mostly;
+/* Global TSS. All stack entry points are identical on each CPU. */
+const struct tss_struct global_tss
+__section(".rodata.page_aligned") __aligned(PAGE_SIZE) =
+{
+ /* Main stack for interrupts/exceptions. */
+ .rsp0 = (PERCPU_STACK_MAPPING + STACK_SIZE -
+ sizeof(struct cpu_info) +
+ offsetof(struct cpu_info, guest_cpu_user_regs.es)) ,
+
+ /* Ring 1 and 2 stacks poisoned. */
+ .rsp1 = 0x8600111111111111ul,
+ .rsp2 = 0x8600111111111111ul,
+
+ /*
+ * MCE, NMI and Double Fault handlers get their own stacks.
+ * All others poisoned.
+ */
+ .ist = {
+ [IST_MCE - 1] = PERCPU_STACK_MAPPING + IST_MCE * PAGE_SIZE,
+ [IST_DF - 1] = PERCPU_STACK_MAPPING + IST_DF * PAGE_SIZE,
+ [IST_NMI - 1] = PERCPU_STACK_MAPPING + IST_NMI * PAGE_SIZE,
+
+ [IST_MAX ... ARRAY_SIZE(global_tss.ist) - 1] =
+ 0x8600111111111111ul,
+ },
+
+ .bitmap = IOBMP_INVALID_OFFSET,
+};
+
void (*ioemul_handle_quirk)(
u8 opcode, char *io_emul_stub, struct cpu_user_regs *regs);
diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S
index d5e8821..3456b4c 100644
--- a/xen/arch/x86/xen.lds.S
+++ b/xen/arch/x86/xen.lds.S
@@ -77,6 +77,8 @@ SECTIONS
__2M_rodata_start = .; /* Start of 2M superpages, mapped RO. */
.rodata : {
_srodata = .;
+ *(.rodata.page_aligned)
+ . = ALIGN(PAGE_SIZE);
/* Bug frames table */
__start_bug_frames = .;
*(.bug_frames.0)
diff --git a/xen/include/asm-x86/config.h b/xen/include/asm-x86/config.h
index 3974748..caff09f 100644
--- a/xen/include/asm-x86/config.h
+++ b/xen/include/asm-x86/config.h
@@ -283,6 +283,7 @@ extern unsigned long xen_phys_start;
/* Mappings in the percpu area: */
#define PERCPU_IDT_MAPPING (PERCPU_LINEAR_START + KB(4))
+#define PERCPU_TSS_MAPPING (PERCPU_LINEAR_START + KB(128))
#define PERCPU_MAPCACHE_L1ES (PERCPU_LINEAR_START + MB(2) + KB(12))
#define PERCPU_MAPCACHE_START (PERCPU_LINEAR_START + MB(4))
diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index c206080..22882a6 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -475,6 +475,8 @@ static inline void disable_each_ist(idt_entry_t *idt)
extern idt_entry_t idt_table[];
extern idt_entry_t *idt_tables[];
+extern const struct tss_struct global_tss;
+
DECLARE_PER_CPU(struct tss_struct, init_tss);
extern void init_int80_direct_trap(struct vcpu *v);
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 43/44] x86/smp: Use the percpu TSS mapping
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (41 preceding siblings ...)
2018-01-04 20:22 ` [PATCH RFC 42/44] x86/smp: Allocate a percpu linear range for the TSS Andrew Cooper
@ 2018-01-04 20:22 ` Andrew Cooper
2018-01-04 20:22 ` [PATCH RFC 44/44] misc debugging Andrew Cooper
2018-01-05 7:48 ` [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Juergen Gross
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:22 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
Construction of the TSS is the final action remaining in load_system_tables(),
and is lifted to early_switch_to_idle(). As a single global TSS is in use,
the per_cpu init_tss variable is dropped.
The setting of HOST_TR_BASE is now a constant, so moves to construct_vmcs().
This means that vmx_set_host_env() and arch_vmx_struct.hostenv_migrated can be
dropped as well.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
xen/arch/x86/cpu/common.c | 66 --------------------------------------
xen/arch/x86/hvm/vmx/vmcs.c | 22 +------------
xen/arch/x86/setup.c | 22 ++++++++++---
xen/arch/x86/smpboot.c | 6 ++--
xen/arch/x86/traps.c | 7 ++--
xen/include/asm-x86/hvm/vmx/vmcs.h | 1 -
xen/include/asm-x86/processor.h | 2 --
7 files changed, 23 insertions(+), 103 deletions(-)
diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
index 262eccc..579d149 100644
--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -642,72 +642,6 @@ void __init early_cpu_init(void)
}
/*
- * Sets up system tables and descriptors.
- *
- * - Sets up TSS with stack pointers, including ISTs
- * - Inserts TSS selector into regular and compat GDTs
- * - Loads GDT, IDT, TR then null LDT
- * - Sets up IST references in the IDT
- */
-void load_system_tables(void)
-{
- unsigned long stack_bottom = get_stack_bottom(),
- stack_top = stack_bottom & ~(STACK_SIZE - 1);
-
- struct tss_struct *tss = &this_cpu(init_tss);
- struct desc_struct *gdt =
- this_cpu(gdt_table) - FIRST_RESERVED_GDT_ENTRY;
- struct desc_struct *compat_gdt =
- this_cpu(compat_gdt_table) - FIRST_RESERVED_GDT_ENTRY;
-
- *tss = (struct tss_struct){
- /* Main stack for interrupts/exceptions. */
- .rsp0 = stack_bottom,
-
- /* Ring 1 and 2 stacks poisoned. */
- .rsp1 = 0x8600111111111111ul,
- .rsp2 = 0x8600111111111111ul,
-
- /*
- * MCE, NMI and Double Fault handlers get their own stacks.
- * All others poisoned.
- */
- .ist = {
- [IST_MCE - 1] = stack_top + IST_MCE * PAGE_SIZE,
- [IST_DF - 1] = stack_top + IST_DF * PAGE_SIZE,
- [IST_NMI - 1] = stack_top + IST_NMI * PAGE_SIZE,
-
- [IST_MAX ... ARRAY_SIZE(tss->ist) - 1] =
- 0x8600111111111111ul,
- },
-
- .bitmap = IOBMP_INVALID_OFFSET,
- };
-
- _set_tssldt_desc(
- gdt + TSS_ENTRY,
- (unsigned long)tss,
- offsetof(struct tss_struct, __cacheline_filler) - 1,
- SYS_DESC_tss_avail);
- _set_tssldt_desc(
- compat_gdt + TSS_ENTRY,
- (unsigned long)tss,
- offsetof(struct tss_struct, __cacheline_filler) - 1,
- SYS_DESC_tss_busy);
-
- ltr(TSS_ENTRY << 3);
-
- /*
- * Bottom-of-stack must be 16-byte aligned!
- *
- * Defer checks until exception support is sufficiently set up.
- */
- BUILD_BUG_ON((sizeof(struct cpu_info) -
- offsetof(struct cpu_info, guest_cpu_user_regs.es)) & 0xf);
- BUG_ON(system_state != SYS_STATE_early_boot && (stack_bottom & 0xf));
-}
-
-/*
* cpu_init() initializes state that is per-CPU. Some data is already
* initialized (naturally) in the bootstrap process, such as the GDT
* and IDT. We reload them nevertheless, this function acts as a
diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index 483f72d..93d979e 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -798,14 +798,6 @@ void vmx_vmcs_exit(struct vcpu *v)
}
}
-static void vmx_set_host_env(struct vcpu *v)
-{
- unsigned int cpu = smp_processor_id();
-
- __vmwrite(HOST_TR_BASE, (unsigned long)&per_cpu(init_tss, cpu));
-
-}
-
void vmx_clear_msr_intercept(struct vcpu *v, unsigned int msr,
enum vmx_msr_intercept_type type)
{
@@ -898,12 +890,6 @@ void vmx_vmcs_switch(paddr_t from, paddr_t to)
vmx->launched = 0;
this_cpu(current_vmcs) = to;
- if ( vmx->hostenv_migrated )
- {
- vmx->hostenv_migrated = 0;
- vmx_set_host_env(current);
- }
-
spin_unlock(&vmx->vmcs_lock);
}
@@ -1123,6 +1109,7 @@ static int construct_vmcs(struct vcpu *v)
/* Host system tables. */
__vmwrite(HOST_IDTR_BASE, PERCPU_IDT_MAPPING);
__vmwrite(HOST_GDTR_BASE, PERCPU_GDT_MAPPING);
+ __vmwrite(HOST_TR_BASE, PERCPU_TSS_MAPPING);
/* Host data selectors. */
__vmwrite(HOST_SS_SELECTOR, __HYPERVISOR_DS);
@@ -1701,13 +1688,6 @@ void vmx_do_resume(struct vcpu *v)
vmx_load_vmcs(v);
hvm_migrate_timers(v);
hvm_migrate_pirqs(v);
- vmx_set_host_env(v);
- /*
- * Both n1 VMCS and n2 VMCS need to update the host environment after
- * VCPU migration. The environment of current VMCS is updated in place,
- * but the action of another VMCS is deferred till it is switched in.
- */
- v->arch.hvm_vmx.hostenv_migrated = 1;
hvm_asid_flush_vcpu(v);
}
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index c0f7289..3458ea6 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -90,8 +90,6 @@ unsigned long __read_mostly xen_phys_start;
unsigned long __read_mostly xen_virt_end;
-DEFINE_PER_CPU(struct tss_struct, init_tss);
-
char __section(".bss.stack_aligned") __aligned(STACK_SIZE)
cpu0_stack[STACK_SIZE];
@@ -258,6 +256,10 @@ void early_switch_to_idle(bool bsp)
.base = PERCPU_IDT_MAPPING,
.limit = 0xffff,
};
+ struct desc_struct *gdt =
+ this_cpu(gdt_table) - FIRST_RESERVED_GDT_ENTRY;
+ struct desc_struct *compat_gdt =
+ this_cpu(compat_gdt_table) - FIRST_RESERVED_GDT_ENTRY;
set_current(v);
per_cpu(curr_vcpu, cpu) = v;
@@ -267,8 +269,20 @@ void early_switch_to_idle(bool bsp)
per_cpu(curr_ptbase, cpu) = v->arch.cr3;
per_cpu(curr_extended_directmap, cpu) = true;
+ _set_tssldt_desc(
+ gdt + TSS_ENTRY,
+ (unsigned long)&global_tss,
+ offsetof(struct tss_struct, __cacheline_filler) - 1,
+ SYS_DESC_tss_avail);
+ _set_tssldt_desc(
+ compat_gdt + TSS_ENTRY,
+ (unsigned long)&global_tss,
+ offsetof(struct tss_struct, __cacheline_filler) - 1,
+ SYS_DESC_tss_busy);
+
lgdt(&gdtr);
lidt(&idtr);
+ ltr(TSS_ENTRY << 3);
lldt(0);
if ( likely(!bsp) ) /* BSP IST setup deferred. */
@@ -654,9 +668,7 @@ static void __init noreturn reinit_bsp_stack(void)
/* Sanity check that IST settings weren't set up before this point. */
ASSERT(MASK_EXTR(idt_tables[0][TRAP_nmi].a, 7UL << 32) == 0);
- /* Update TSS and ISTs */
- load_system_tables();
-
+ /* Enable BSP ISTs now we've switched stack. */
enable_each_ist(idt_tables[0]);
/* Update SYSCALL trampolines */
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index fa99e4d..69767e2 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -310,6 +310,8 @@ void start_secondary(void *unused)
early_switch_to_idle(false);
+ /* Full exception support from here on in. */
+
rdmsrl(MSR_EFER, this_cpu(efer));
/*
@@ -330,10 +332,6 @@ void start_secondary(void *unused)
*/
spin_debug_disable();
- load_system_tables();
-
- /* Full exception support from here on in. */
-
/* Safe to enable feature such as CR4.MCE with the IDT set up now. */
write_cr4(mmu_cr4_features);
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 0ab10ba..6b02a5f 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -551,7 +551,7 @@ void show_stack_overflow(unsigned int cpu, const struct cpu_user_regs *regs)
printk("Valid stack range: %p-%p, sp=%p, tss.rsp0=%p\n",
(void *)esp_top, (void *)esp_bottom, (void *)esp,
- (void *)per_cpu(init_tss, cpu).rsp0);
+ (void *)global_tss.rsp0);
/*
* Trigger overflow trace if %esp is anywhere within the guard page, or
@@ -1804,7 +1804,6 @@ static void __init set_intr_gate(unsigned int n, void *addr)
void load_TR(void)
{
- struct tss_struct *tss = &this_cpu(init_tss);
struct desc_ptr old_gdt, tss_gdt = {
.base = (long)(this_cpu(gdt_table) - FIRST_RESERVED_GDT_ENTRY),
.limit = LAST_RESERVED_GDT_BYTE
@@ -1812,12 +1811,12 @@ void load_TR(void)
_set_tssldt_desc(
this_cpu(gdt_table) + TSS_ENTRY - FIRST_RESERVED_GDT_ENTRY,
- (unsigned long)tss,
+ (unsigned long)&global_tss,
offsetof(struct tss_struct, __cacheline_filler) - 1,
SYS_DESC_tss_avail);
_set_tssldt_desc(
this_cpu(compat_gdt_table) + TSS_ENTRY - FIRST_RESERVED_GDT_ENTRY,
- (unsigned long)tss,
+ (unsigned long)&global_tss,
offsetof(struct tss_struct, __cacheline_filler) - 1,
SYS_DESC_tss_busy);
diff --git a/xen/include/asm-x86/hvm/vmx/vmcs.h b/xen/include/asm-x86/hvm/vmx/vmcs.h
index 8fb9e3c..c1bd468 100644
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
@@ -157,7 +157,6 @@ struct arch_vmx_struct {
struct segment_register vm86_saved_seg[x86_seg_tr + 1];
/* Remember EFLAGS while in virtual 8086 mode */
uint32_t vm86_saved_eflags;
- int hostenv_migrated;
/* Bitmap to control vmexit policy for Non-root VMREAD/VMWRITE */
struct page_info *vmread_bitmap;
diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index 22882a6..2990afd 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -477,8 +477,6 @@ extern idt_entry_t *idt_tables[];
extern const struct tss_struct global_tss;
-DECLARE_PER_CPU(struct tss_struct, init_tss);
-
extern void init_int80_direct_trap(struct vcpu *v);
extern void do_write_ptbase(struct vcpu *v, bool tlb_maintenance);
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH RFC 44/44] misc debugging
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (42 preceding siblings ...)
2018-01-04 20:22 ` [PATCH RFC 43/44] x86/smp: Use the percpu TSS mapping Andrew Cooper
@ 2018-01-04 20:22 ` Andrew Cooper
2018-01-05 7:48 ` [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Juergen Gross
44 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-04 20:22 UTC (permalink / raw)
To: Xen-devel; +Cc: Andrew Cooper
Keyhandlers for the following:
'1' - Walk idle_pg_table[]
'2' - Walk each percpu_mappings
'3' - Dump PT shadow stats
---
xen/arch/x86/hvm/save.c | 4 -
xen/arch/x86/mm/p2m-ept.c | 5 +-
xen/arch/x86/pv/pt-shadow.c | 19 ++++
xen/arch/x86/traps.c | 199 +++++++++++++++++++++++++++++++++++++
xen/arch/x86/x86_64/mm.c | 6 ++
xen/include/asm-x86/pv/pt-shadow.h | 17 ++++
6 files changed, 244 insertions(+), 6 deletions(-)
diff --git a/xen/arch/x86/hvm/save.c b/xen/arch/x86/hvm/save.c
index 8984a23..fbdae05 100644
--- a/xen/arch/x86/hvm/save.c
+++ b/xen/arch/x86/hvm/save.c
@@ -223,8 +223,6 @@ int hvm_save(struct domain *d, hvm_domain_context_t *h)
handler = hvm_sr_handlers[i].save;
if ( handler != NULL )
{
- printk(XENLOG_G_INFO "HVM%d save: %s\n",
- d->domain_id, hvm_sr_handlers[i].name);
if ( handler(d, h) != 0 )
{
printk(XENLOG_G_ERR
@@ -297,8 +295,6 @@ int hvm_load(struct domain *d, hvm_domain_context_t *h)
}
/* Load the entry */
- printk(XENLOG_G_INFO "HVM%d restore: %s %"PRIu16"\n", d->domain_id,
- hvm_sr_handlers[desc->typecode].name, desc->instance);
if ( handler(d, h) != 0 )
{
printk(XENLOG_G_ERR "HVM%d restore: failed to load entry %u/%u\n",
diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
index b4996ce..4167d29 100644
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -1285,7 +1285,7 @@ void ept_p2m_uninit(struct p2m_domain *p2m)
free_cpumask_var(ept->invalidate);
}
-static const char *memory_type_to_str(unsigned int x)
+const char *memory_type_to_str(unsigned int x)
{
static const char memory_types[8][3] = {
[MTRR_TYPE_UNCACHABLE] = "UC",
@@ -1293,7 +1293,8 @@ static const char *memory_type_to_str(unsigned int x)
[MTRR_TYPE_WRTHROUGH] = "WT",
[MTRR_TYPE_WRPROT] = "WP",
[MTRR_TYPE_WRBACK] = "WB",
- [MTRR_NUM_TYPES] = "??"
+ [PAT_TYPE_UC_MINUS] = "U-",
+ /* [MTRR_NUM_TYPES] = "??", */
};
ASSERT(x < ARRAY_SIZE(memory_types));
diff --git a/xen/arch/x86/pv/pt-shadow.c b/xen/arch/x86/pv/pt-shadow.c
index d550ae1..f4c522f 100644
--- a/xen/arch/x86/pv/pt-shadow.c
+++ b/xen/arch/x86/pv/pt-shadow.c
@@ -28,6 +28,8 @@
#undef page_to_mfn
#define page_to_mfn(pg) _mfn(__page_to_mfn(pg))
+struct ptstats ptstats;
+
/*
* To use percpu linear ranges, we require that no two pcpus have %cr3
* pointing at the same L4 pagetable at the same time.
@@ -224,7 +226,10 @@ unsigned long pt_maybe_shadow(struct vcpu *v)
/* No shadowing necessary? Run on the intended pagetable. */
if ( !pt_need_shadow(v->domain) )
+ {
+ ptstat(&ptstats.sync_none);
return new_cr3;
+ }
ptsh->domain = v->domain;
@@ -259,6 +264,11 @@ unsigned long pt_maybe_shadow(struct vcpu *v)
}
}
local_irq_restore(flags);
+
+ if ( cache_idx )
+ ptstat(&ptstats.sync_shuffle);
+ else
+ ptstat(&ptstats.sync_noshuffle);
}
else
{
@@ -293,6 +303,7 @@ unsigned long pt_maybe_shadow(struct vcpu *v)
sizeof(*l4t) * (L4_PAGETABLE_ENTRIES - (slot + 1)));
unmap_domain_page(vcpu_l4t);
+ ptstat(&ptstats.sync_full);
}
ASSERT(ptsh->cache[0].cr3_mfn == (new_cr3 >> PAGE_SHIFT));
@@ -320,13 +331,19 @@ static void _pt_shadow_ipi(void *arg)
/* No longer shadowing state from this domain? Nothing to do. */
if ( info->d != ptsh->domain )
+ {
+ ptstat(&ptstats.ipi_dom_miss);
return;
+ }
ent = pt_cache_lookup(ptsh, page_to_maddr(info->pg));
/* Not shadowing this frame? Nothing to do. */
if ( ent == NULL )
+ {
+ ptstat(&ptstats.ipi_cache_miss);
return;
+ }
switch ( info->op )
{
@@ -340,6 +357,7 @@ static void _pt_shadow_ipi(void *arg)
l4t[info->slot] = vcpu_l4t[info->slot];
unmap_domain_page(vcpu_l4t);
+ ptstat(&ptstats.ipi_write);
break;
case PTSH_IPI_INVLPG:
@@ -357,6 +375,7 @@ static void _pt_shadow_ipi(void *arg)
case 2: ptsh->cache[2] = ptsh->cache[3];
case 3: ptsh->cache[3] = (pt_cache_entry_t){ shadow_idx };
}
+ ptstat(&ptstats.ipi_invlpg);
break;
default:
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 6b02a5f..095bf97 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -955,6 +955,7 @@ void cpuid_hypervisor_leaves(const struct vcpu *v, uint32_t leaf,
break;
res->b = flsl(get_upper_mfn_bound()) + PAGE_SHIFT;
+ res->c = v->vcpu_id;
break;
default:
@@ -2074,6 +2075,204 @@ void asm_domain_crash_synchronous(unsigned long addr)
__domain_crash_synchronous();
}
+#include <xen/keyhandler.h>
+#include <asm/pv/pt-shadow.h>
+
+const char *memory_type_to_str(unsigned int x);
+static void decode_intpte(unsigned int level, unsigned int slot, intpte_t pte)
+{
+ unsigned int pat_idx = ((pte >> 3) & 3) |
+ ((pte >> ((level > 1 && (pte & _PAGE_PSE)) ? 10 : 5)) & 4);
+
+ unsigned int mem_type = (host_pat >> (pat_idx << 3)) & 0xff;
+
+ printk("%*sL%u[%03u] %"PRIpte" %*s%s %s%s%s%s%s%s\n",
+ (4 - level) * 2, "",
+ level, slot, pte,
+ (level - 1) * 2, "",
+
+ memory_type_to_str(mem_type),
+
+ pte & 0x8000000000000000ULL ? " Nx" : "",
+ pte & _PAGE_GLOBAL ? " G" : "",
+ (level > 1 && pte & _PAGE_PSE) ? " +" : "",
+ pte & _PAGE_USER ? " U" : "",
+ pte & _PAGE_RW ? " W" : "",
+ pte & _PAGE_PRESENT ? " P" : "");
+}
+
+static bool is_poison(intpte_t pte)
+{
+ return (pte & ~0xfff0000) == 0x800f868600000063;
+}
+
+static void dump_l3t(l3_pgentry_t *l3t, bool decend)
+{
+ unsigned int l3i, l2i, l1i;
+ l2_pgentry_t *l2;
+ l1_pgentry_t *l1;
+
+ for ( l3i = 0; l3i < 512; ++l3i )
+ {
+ if ( !(l3t[l3i].l3 & _PAGE_PRESENT) )
+ continue;
+
+ decode_intpte(3, l3i, l3t[l3i].l3);
+
+ if ( is_poison(l3t[l3i].l3) )
+ continue;
+
+ if ( l3t[l3i].l3 & _PAGE_PSE )
+ continue;
+
+ if ( !decend )
+ continue;
+
+ l2 = l3e_to_l2e(l3t[l3i]);
+ for ( l2i = 0; l2i < 512; ++l2i )
+ {
+ if ( !(l2[l2i].l2 & _PAGE_PRESENT) )
+ continue;
+
+ decode_intpte(2, l2i, l2[l2i].l2);
+
+ if ( is_poison(l2[l2i].l2) )
+ continue;
+
+ if ( l2[l2i].l2 & _PAGE_PSE )
+ continue;
+
+ l1 = l2e_to_l1e(l2[l2i]);
+ for ( l1i = 0; l1i < 512; ++l1i )
+ {
+ if ( !(l1[l1i].l1 & _PAGE_PRESENT) )
+ continue;
+
+ decode_intpte(1, l1i, l1[l1i].l1);
+ }
+
+ process_pending_softirqs();
+ }
+
+ process_pending_softirqs();
+ }
+
+}
+
+static void dump_l4t(l4_pgentry_t *l4t, bool decend)
+{
+ unsigned int l4i;
+
+ for ( l4i = 0; l4i < 512; ++l4i )
+ {
+ if ( !(l4t[l4i].l4 & _PAGE_PRESENT) )
+ continue;
+
+ decode_intpte(4, l4i, l4t[l4i].l4);
+
+ if ( is_poison(l4t[l4i].l4) )
+ continue;
+
+ if ( decend &&
+ l4i != l4_table_offset(LINEAR_PT_VIRT_START) &&
+ l4i != l4_table_offset(SH_LINEAR_PT_VIRT_START) )
+ dump_l3t(l4e_to_l3e(l4t[l4i]), true);
+ }
+}
+
+static void do_extreme_debug(unsigned char key)
+{
+ unsigned int cpu;
+
+ printk("'%c' pressed -> Extreme debugging in progress...\n", key);
+
+ watchdog_disable();
+ console_start_log_everything();
+
+ switch ( key )
+ {
+ case '1':
+ dump_l4t(idle_pg_table, true);
+ break;
+
+ case '2':
+ printk("idle_pg_table[]\n");
+ dump_l4t(idle_pg_table, false);
+
+ for_each_online_cpu ( cpu )
+ {
+ paddr_t l4 = per_cpu(percpu_idle_pt, cpu);
+ l4_pgentry_t mappings = per_cpu(percpu_mappings, cpu);
+ l4_pgentry_t *l4t;
+ l3_pgentry_t *l3t;
+
+ printk("CPU #%u per-pcpu l4 %"PRIpaddr", mappings %"PRIpte"\n",
+ cpu, l4, mappings.l4);
+
+ if ( !l4 )
+ {
+ printk(" BAD l4\n");
+ continue;
+ }
+ if ( !mappings.l4 )
+ {
+ printk(" Bad mappings\n");
+ continue;
+ }
+
+ printk("Dumping L4:\n");
+ l4t = map_domain_page(_mfn(paddr_to_pfn(l4)));
+ dump_l4t(l4t, false);
+ unmap_domain_page(l4t);
+
+ printk("Dumping L3:\n");
+ l3t = map_domain_page(l4e_get_mfn(mappings));
+ dump_l3t(l3t, true);
+ unmap_domain_page(l3t);
+ }
+ break;
+
+ case '3':
+ printk("pt_shadow() stats:\n"
+ " sync_none: %20lu\n"
+ " sync_noshuffle: %20lu\n"
+ " sync_shuffle: %20lu\n"
+ " sync_full: %20lu\n"
+ " ipi_dom_miss: %20lu\n"
+ " ipi_cache_miss: %20lu\n"
+ " ipi_ipi_write: %20lu\n"
+ " ipi_ipi_invlpg: %20lu\n",
+ ptstats.sync_none, ptstats.sync_noshuffle,
+ ptstats.sync_shuffle, ptstats.sync_full,
+ ptstats.ipi_dom_miss, ptstats.ipi_cache_miss,
+ ptstats.ipi_write, ptstats.ipi_invlpg);
+ break;
+ }
+
+ console_end_log_everything();
+ watchdog_enable();
+}
+
+static struct timer stats;
+static void stats_fn(void *unused)
+{
+ do_extreme_debug('3');
+ set_timer(&stats, NOW() + SECONDS(10));
+}
+
+static int __init extreme_debug_keyhandler_init(void)
+{
+ register_keyhandler('1', &do_extreme_debug, "Extreme debugging 1", 0);
+ register_keyhandler('2', &do_extreme_debug, "Extreme debugging 2", 0);
+ register_keyhandler('3', &do_extreme_debug, "Extreme debugging 3", 0);
+
+ init_timer(&stats, stats_fn, NULL, 0);
+ /* set_timer(&stats, NOW() + SECONDS(10)); */
+
+ return 0;
+}
+__initcall(extreme_debug_keyhandler_init);
+
/*
* Local variables:
* mode: C
diff --git a/xen/arch/x86/x86_64/mm.c b/xen/arch/x86/x86_64/mm.c
index aae721b..a3e81ac 100644
--- a/xen/arch/x86/x86_64/mm.c
+++ b/xen/arch/x86/x86_64/mm.c
@@ -874,6 +874,12 @@ void __init subarch_init_memory(void)
}
}
+ /* Poison specific entries. */
+ idle_pg_table[271].l4 = 0x800f868602710063;
+ idle_pg_table[272].l4 = 0x800f868602720063;
+ idle_pg_table[510].l4 = 0x800f868605100063;
+ idle_pg_table[511].l4 = 0x800f868605110063;
+
/* Create an L3 table for the MMCFG region, or remap it NX. */
pl4e = &idle_pg_table[l4_table_offset(PCI_MCFG_VIRT_START)];
if ( !(l4e_get_flags(*pl4e) & _PAGE_PRESENT) )
diff --git a/xen/include/asm-x86/pv/pt-shadow.h b/xen/include/asm-x86/pv/pt-shadow.h
index d5576f4..399ebeb 100644
--- a/xen/include/asm-x86/pv/pt-shadow.h
+++ b/xen/include/asm-x86/pv/pt-shadow.h
@@ -23,6 +23,23 @@
#include <xen/sched.h>
+extern struct ptstats {
+ unsigned long sync_none;
+ unsigned long sync_noshuffle;
+ unsigned long sync_shuffle;
+ unsigned long sync_full;
+
+ unsigned long ipi_dom_miss;
+ unsigned long ipi_cache_miss;
+ unsigned long ipi_write;
+ unsigned long ipi_invlpg;
+} ptstats;
+
+static inline void ptstat(unsigned long *stat)
+{
+ asm volatile ("lock; add $1, %0" : "+m" (*stat));
+}
+
#ifdef CONFIG_PV
/*
--
2.1.4
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply related [flat|nested] 61+ messages in thread
* Re: [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
` (43 preceding siblings ...)
2018-01-04 20:22 ` [PATCH RFC 44/44] misc debugging Andrew Cooper
@ 2018-01-05 7:48 ` Juergen Gross
2018-01-05 9:26 ` Andrew Cooper
2018-01-09 23:14 ` Stefano Stabellini
44 siblings, 2 replies; 61+ messages in thread
From: Juergen Gross @ 2018-01-05 7:48 UTC (permalink / raw)
To: Andrew Cooper, Xen-devel
On 04/01/18 21:21, Andrew Cooper wrote:
> This work was developed as an SP3 mitigation, but shelved when it became clear
> that it wasn't viable to get done in the timeframe.
>
> To protect against SP3 attacks, most mappings needs to be flushed while in
> user context. However, to protect against all cross-VM attacks, it is
> necessary to ensure that the Xen stacks are not mapped in any other cpus
> address space, or an attacker can still recover at least the GPR state of
> separate VMs.
Above statement is too strict: it would be sufficient if no stacks of
other domains are mapped.
I'm just working on a proof of concept using dedicated per-vcpu stacks
for 64 bit pv domains. Those stacks would be mapped in the per-domain
region of the address space. I hope to have a RFC version of the patches
ready next week.
This would allow to remove the per physical cpu mappings in the guest
visible address space when doing page table isolation.
In order to avoid SP3 attacks to other vcpu's stacks of the same guest
we could extend the pv ABI to mark a guest's user L4 page table as
"single use", i.e. not allowed to be active on multiple vcpus at the
same time (introducing that ABI modification in the Linux kernel would
be simple, as the Linux kernel currently lacks support for cross-cpu
stack exploits and when that support is being added by per-cpu L4 user
page tables we could just chime in). A L4 page table marked as "single
use" would map the local vcpu stacks only.
> To have isolated stacks, Xen needs a per-pcpu isolated region, which requires
> that two pCPUs never share the same %cr3. This is trivial for 32bit PV guests
> and HVM guests due to the existing per-vcpu Monitor Tables, but is problematic
> for 64bit PV guests, which will run on the same %cr3 when scheduling different
> threads from the same process.
>
> To avoid breaking the PV ABI, Xen needs to shadow the guest L4 pagetables if
> it wants to maintain the unique %cr3 property it needs.
>
> tl;dr The shadowing algorithm in pt-shadow.c is too much of a performance
> overhead to be viable, and very high risk to productise in an embargo window.
> If we want to continue down this route, we either need someone to have a
> clever alternative to the shadowing algorithm I came up with, or change the PV
> ABI to require VMs not to share L4 pagetables.
>
> Either way, these patches are presented to start a discussion of the issues.
> The series as a whole is not in a suitable state for committing.
I think patch 1 should be excluded from that statement, as it is not
directly related to the series.
Juergen
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH RFC 01/44] passthrough/vtd: Don't DMA to the stack in queue_invalidate_wait()
2018-01-04 20:21 ` [PATCH RFC 01/44] passthrough/vtd: Don't DMA to the stack in queue_invalidate_wait() Andrew Cooper
@ 2018-01-05 9:21 ` Jan Beulich
2018-01-05 9:33 ` Andrew Cooper
2018-01-16 6:41 ` Tian, Kevin
1 sibling, 1 reply; 61+ messages in thread
From: Jan Beulich @ 2018-01-05 9:21 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Julien Grall, Kevin Tian, Xen-devel
>>> On 04.01.18 at 21:21, <andrew.cooper3@citrix.com> wrote:
> DMA-ing to the stack is generally considered bad practice. In this case, if a
> timeout occurs because of a sluggish device which is processing the request,
> the completion notification will corrupt the stack of a subsequent deeper call
> tree.
>
> Place the poll_slot in a percpu area and DMA to that instead.
>
> Note: This change does not address other issues with the current
> implementation, such as once a timeout has been suffered, subsequent
> completions can't be correlated with their requests.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
> Julien: This wants backporting to all releases, and therefore should be
> considered for 4.10 at this point.
Interesting remark at this point in time ;-)
Jan
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution
2018-01-05 7:48 ` [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Juergen Gross
@ 2018-01-05 9:26 ` Andrew Cooper
2018-01-05 9:39 ` Juergen Gross
2018-01-09 23:14 ` Stefano Stabellini
1 sibling, 1 reply; 61+ messages in thread
From: Andrew Cooper @ 2018-01-05 9:26 UTC (permalink / raw)
To: Juergen Gross, Xen-devel
On 05/01/2018 07:48, Juergen Gross wrote:
> On 04/01/18 21:21, Andrew Cooper wrote:
>> This work was developed as an SP3 mitigation, but shelved when it became clear
>> that it wasn't viable to get done in the timeframe.
>>
>> To protect against SP3 attacks, most mappings needs to be flushed while in
>> user context. However, to protect against all cross-VM attacks, it is
>> necessary to ensure that the Xen stacks are not mapped in any other cpus
>> address space, or an attacker can still recover at least the GPR state of
>> separate VMs.
> Above statement is too strict: it would be sufficient if no stacks of
> other domains are mapped.
Sadly not. Having stacks shared by domain means one vcpu can still
steal at least GPR state from other vcpus belonging to the same domain.
Whether or not a specific kernel cares, some definitely will.
> I'm just working on a proof of concept using dedicated per-vcpu stacks
> for 64 bit pv domains. Those stacks would be mapped in the per-domain
> region of the address space. I hope to have a RFC version of the patches
> ready next week.
>
> This would allow to remove the per physical cpu mappings in the guest
> visible address space when doing page table isolation.
>
> In order to avoid SP3 attacks to other vcpu's stacks of the same guest
> we could extend the pv ABI to mark a guest's user L4 page table as
> "single use", i.e. not allowed to be active on multiple vcpus at the
> same time (introducing that ABI modification in the Linux kernel would
> be simple, as the Linux kernel currently lacks support for cross-cpu
> stack exploits and when that support is being added by per-cpu L4 user
> page tables we could just chime in). A L4 page table marked as "single
> use" would map the local vcpu stacks only.
For PV guests, it is the Xen stacks which matter, not the vcpu guest
kernel's ones.
64bit PV guest kernels are already mitigated better than KPTI can ever
manage, because there are no entry stacks or entry stubs required to be
mapped into guest userspace at all.
>> To have isolated stacks, Xen needs a per-pcpu isolated region, which requires
>> that two pCPUs never share the same %cr3. This is trivial for 32bit PV guests
>> and HVM guests due to the existing per-vcpu Monitor Tables, but is problematic
>> for 64bit PV guests, which will run on the same %cr3 when scheduling different
>> threads from the same process.
>>
>> To avoid breaking the PV ABI, Xen needs to shadow the guest L4 pagetables if
>> it wants to maintain the unique %cr3 property it needs.
>>
>> tl;dr The shadowing algorithm in pt-shadow.c is too much of a performance
>> overhead to be viable, and very high risk to productise in an embargo window.
>> If we want to continue down this route, we either need someone to have a
>> clever alternative to the shadowing algorithm I came up with, or change the PV
>> ABI to require VMs not to share L4 pagetables.
>>
>> Either way, these patches are presented to start a discussion of the issues.
>> The series as a whole is not in a suitable state for committing.
> I think patch 1 should be excluded from that statement, as it is not
> directly related to the series.
There are bits of the series I do intend to take in, largely in this
form. Another is "x86/pv: Drop support for paging out the LDT" because
its long-since time for that to disappear.
I should also say that the net changes to context switch and
critical-structure handling across this series is a performance and
security benefit, irrespective of the KAISER/KPTI side of things.
They'd qualify for inclusion on their own merits alone (if it weren't
for the dependent L4 shadowing issues).
If you're interested, I stumbled onto patch one after introducing the
per-pcpu stack mapping, as virt_to_maddr() came out spectacularly
wrong. Very observant readers might also notice the bit of misc
debugging which caused me to blindly stumble into XSA-243, which was an
interesting diversion from Xen crashing because of my own pagetable
mistakes.
~Andrew
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH RFC 01/44] passthrough/vtd: Don't DMA to the stack in queue_invalidate_wait()
2018-01-05 9:21 ` Jan Beulich
@ 2018-01-05 9:33 ` Andrew Cooper
0 siblings, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-05 9:33 UTC (permalink / raw)
To: Jan Beulich; +Cc: Julien Grall, Kevin Tian, Xen-devel
On 05/01/2018 09:21, Jan Beulich wrote:
>>>> On 04.01.18 at 21:21, <andrew.cooper3@citrix.com> wrote:
>> DMA-ing to the stack is generally considered bad practice. In this case, if a
>> timeout occurs because of a sluggish device which is processing the request,
>> the completion notification will corrupt the stack of a subsequent deeper call
>> tree.
>>
>> Place the poll_slot in a percpu area and DMA to that instead.
>>
>> Note: This change does not address other issues with the current
>> implementation, such as once a timeout has been suffered, subsequent
>> completions can't be correlated with their requests.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
>
>> Julien: This wants backporting to all releases, and therefore should be
>> considered for 4.10 at this point.
> Interesting remark at this point in time ;-)
Oops yes. This might leak the point at which I shelved the plan.
With this all out in the open now, observant people might notice now
many of my 4.10 patches are relevant to the issues at hand.
~Andrew
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution
2018-01-05 9:26 ` Andrew Cooper
@ 2018-01-05 9:39 ` Juergen Gross
2018-01-05 9:56 ` Andrew Cooper
2018-01-05 14:11 ` George Dunlap
0 siblings, 2 replies; 61+ messages in thread
From: Juergen Gross @ 2018-01-05 9:39 UTC (permalink / raw)
To: Andrew Cooper, Xen-devel
On 05/01/18 10:26, Andrew Cooper wrote:
> On 05/01/2018 07:48, Juergen Gross wrote:
>> On 04/01/18 21:21, Andrew Cooper wrote:
>>> This work was developed as an SP3 mitigation, but shelved when it became clear
>>> that it wasn't viable to get done in the timeframe.
>>>
>>> To protect against SP3 attacks, most mappings needs to be flushed while in
>>> user context. However, to protect against all cross-VM attacks, it is
>>> necessary to ensure that the Xen stacks are not mapped in any other cpus
>>> address space, or an attacker can still recover at least the GPR state of
>>> separate VMs.
>> Above statement is too strict: it would be sufficient if no stacks of
>> other domains are mapped.
>
> Sadly not. Having stacks shared by domain means one vcpu can still
> steal at least GPR state from other vcpus belonging to the same domain.
>
> Whether or not a specific kernel cares, some definitely will.
>
>> I'm just working on a proof of concept using dedicated per-vcpu stacks
>> for 64 bit pv domains. Those stacks would be mapped in the per-domain
>> region of the address space. I hope to have a RFC version of the patches
>> ready next week.
>>
>> This would allow to remove the per physical cpu mappings in the guest
>> visible address space when doing page table isolation.
>>
>> In order to avoid SP3 attacks to other vcpu's stacks of the same guest
>> we could extend the pv ABI to mark a guest's user L4 page table as
>> "single use", i.e. not allowed to be active on multiple vcpus at the
>> same time (introducing that ABI modification in the Linux kernel would
>> be simple, as the Linux kernel currently lacks support for cross-cpu
>> stack exploits and when that support is being added by per-cpu L4 user
>> page tables we could just chime in). A L4 page table marked as "single
>> use" would map the local vcpu stacks only.
>
> For PV guests, it is the Xen stacks which matter, not the vcpu guest
> kernel's ones.
Indeed. That's the reason I want to have per-vcpu Xen stacks.
> 64bit PV guest kernels are already mitigated better than KPTI can ever
> manage, because there are no entry stacks or entry stubs required to be
> mapped into guest userspace at all.
But without Xen being secured via a mechanism similar to KPTI this
is moot, as user mode can exploit the whole host including the own
kernel's memory.
Juergen
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution
2018-01-05 9:39 ` Juergen Gross
@ 2018-01-05 9:56 ` Andrew Cooper
2018-01-05 14:11 ` George Dunlap
1 sibling, 0 replies; 61+ messages in thread
From: Andrew Cooper @ 2018-01-05 9:56 UTC (permalink / raw)
To: Juergen Gross, Xen-devel
On 05/01/2018 09:39, Juergen Gross wrote:
> On 05/01/18 10:26, Andrew Cooper wrote:
>> On 05/01/2018 07:48, Juergen Gross wrote:
>>> On 04/01/18 21:21, Andrew Cooper wrote:
>>>> This work was developed as an SP3 mitigation, but shelved when it became clear
>>>> that it wasn't viable to get done in the timeframe.
>>>>
>>>> To protect against SP3 attacks, most mappings needs to be flushed while in
>>>> user context. However, to protect against all cross-VM attacks, it is
>>>> necessary to ensure that the Xen stacks are not mapped in any other cpus
>>>> address space, or an attacker can still recover at least the GPR state of
>>>> separate VMs.
>>> Above statement is too strict: it would be sufficient if no stacks of
>>> other domains are mapped.
>> Sadly not. Having stacks shared by domain means one vcpu can still
>> steal at least GPR state from other vcpus belonging to the same domain.
>>
>> Whether or not a specific kernel cares, some definitely will.
>>
>>> I'm just working on a proof of concept using dedicated per-vcpu stacks
>>> for 64 bit pv domains. Those stacks would be mapped in the per-domain
>>> region of the address space. I hope to have a RFC version of the patches
>>> ready next week.
>>>
>>> This would allow to remove the per physical cpu mappings in the guest
>>> visible address space when doing page table isolation.
>>>
>>> In order to avoid SP3 attacks to other vcpu's stacks of the same guest
>>> we could extend the pv ABI to mark a guest's user L4 page table as
>>> "single use", i.e. not allowed to be active on multiple vcpus at the
>>> same time (introducing that ABI modification in the Linux kernel would
>>> be simple, as the Linux kernel currently lacks support for cross-cpu
>>> stack exploits and when that support is being added by per-cpu L4 user
>>> page tables we could just chime in). A L4 page table marked as "single
>>> use" would map the local vcpu stacks only.
>> For PV guests, it is the Xen stacks which matter, not the vcpu guest
>> kernel's ones.
> Indeed. That's the reason I want to have per-vcpu Xen stacks.
We will have to be extra careful going along those lines (and to
forewarn you, I don't have a good gut feeling about it).
For one, livepatching safety currently depends on the per-pcpu stacks.
Also, you will have to entirely rework how the IST stacks work, as they
will have to move to being per-vcpu as well, which means modifying the
TSS and rewriting the syscall stubs on context switch.
At the moment, Xen's per-pcpu stacks have shielded us some of the
SP2/RSB issues, because of reset_stack_and_jump() used during
scheduling. The waitqueue infrastructure is the one place where this is
violated at the moment, and is only used in practice during
introspection. However, for other reasons, I'm looking to delete that
code and pretend that it never existed.
~Andrew
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution
2018-01-05 9:39 ` Juergen Gross
2018-01-05 9:56 ` Andrew Cooper
@ 2018-01-05 14:11 ` George Dunlap
2018-01-05 14:17 ` Juergen Gross
2018-01-05 14:27 ` Jan Beulich
1 sibling, 2 replies; 61+ messages in thread
From: George Dunlap @ 2018-01-05 14:11 UTC (permalink / raw)
To: Juergen Gross; +Cc: Andrew Cooper, Xen-devel
On Fri, Jan 5, 2018 at 9:39 AM, Juergen Gross <jgross@suse.com> wrote:
> On 05/01/18 10:26, Andrew Cooper wrote:
>> On 05/01/2018 07:48, Juergen Gross wrote:
>>> On 04/01/18 21:21, Andrew Cooper wrote:
>>>> This work was developed as an SP3 mitigation, but shelved when it became clear
>>>> that it wasn't viable to get done in the timeframe.
>>>>
>>>> To protect against SP3 attacks, most mappings needs to be flushed while in
>>>> user context. However, to protect against all cross-VM attacks, it is
>>>> necessary to ensure that the Xen stacks are not mapped in any other cpus
>>>> address space, or an attacker can still recover at least the GPR state of
>>>> separate VMs.
>>> Above statement is too strict: it would be sufficient if no stacks of
>>> other domains are mapped.
>>
>> Sadly not. Having stacks shared by domain means one vcpu can still
>> steal at least GPR state from other vcpus belonging to the same domain.
>>
>> Whether or not a specific kernel cares, some definitely will.
>>
>>> I'm just working on a proof of concept using dedicated per-vcpu stacks
>>> for 64 bit pv domains. Those stacks would be mapped in the per-domain
>>> region of the address space. I hope to have a RFC version of the patches
>>> ready next week.
>>>
>>> This would allow to remove the per physical cpu mappings in the guest
>>> visible address space when doing page table isolation.
>>>
>>> In order to avoid SP3 attacks to other vcpu's stacks of the same guest
>>> we could extend the pv ABI to mark a guest's user L4 page table as
>>> "single use", i.e. not allowed to be active on multiple vcpus at the
>>> same time (introducing that ABI modification in the Linux kernel would
>>> be simple, as the Linux kernel currently lacks support for cross-cpu
>>> stack exploits and when that support is being added by per-cpu L4 user
>>> page tables we could just chime in). A L4 page table marked as "single
>>> use" would map the local vcpu stacks only.
>>
>> For PV guests, it is the Xen stacks which matter, not the vcpu guest
>> kernel's ones.
>
> Indeed. That's the reason I want to have per-vcpu Xen stacks.
>
>> 64bit PV guest kernels are already mitigated better than KPTI can ever
>> manage, because there are no entry stacks or entry stubs required to be
>> mapped into guest userspace at all.
>
> But without Xen being secured via a mechanism similar to KPTI this
> is moot, as user mode can exploit the whole host including the own
> kernel's memory.
Here's a question: What if we didn't try to prevent the guest from
reading hypervisor memory at all, but instead just tried to make sure
that there was nothing of interest there?
If sensitive information pertaining to a given vcpu were only maped on
the processor currently running that vcpu, then it would mitigate not
only SP3, but also SP2 and SP1.
-George
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution
2018-01-05 14:11 ` George Dunlap
@ 2018-01-05 14:17 ` Juergen Gross
2018-01-05 14:21 ` George Dunlap
2018-01-05 14:27 ` Jan Beulich
1 sibling, 1 reply; 61+ messages in thread
From: Juergen Gross @ 2018-01-05 14:17 UTC (permalink / raw)
To: George Dunlap; +Cc: Andrew Cooper, Xen-devel
On 05/01/18 15:11, George Dunlap wrote:
> On Fri, Jan 5, 2018 at 9:39 AM, Juergen Gross <jgross@suse.com> wrote:
>> On 05/01/18 10:26, Andrew Cooper wrote:
>>> On 05/01/2018 07:48, Juergen Gross wrote:
>>>> On 04/01/18 21:21, Andrew Cooper wrote:
>>>>> This work was developed as an SP3 mitigation, but shelved when it became clear
>>>>> that it wasn't viable to get done in the timeframe.
>>>>>
>>>>> To protect against SP3 attacks, most mappings needs to be flushed while in
>>>>> user context. However, to protect against all cross-VM attacks, it is
>>>>> necessary to ensure that the Xen stacks are not mapped in any other cpus
>>>>> address space, or an attacker can still recover at least the GPR state of
>>>>> separate VMs.
>>>> Above statement is too strict: it would be sufficient if no stacks of
>>>> other domains are mapped.
>>>
>>> Sadly not. Having stacks shared by domain means one vcpu can still
>>> steal at least GPR state from other vcpus belonging to the same domain.
>>>
>>> Whether or not a specific kernel cares, some definitely will.
>>>
>>>> I'm just working on a proof of concept using dedicated per-vcpu stacks
>>>> for 64 bit pv domains. Those stacks would be mapped in the per-domain
>>>> region of the address space. I hope to have a RFC version of the patches
>>>> ready next week.
>>>>
>>>> This would allow to remove the per physical cpu mappings in the guest
>>>> visible address space when doing page table isolation.
>>>>
>>>> In order to avoid SP3 attacks to other vcpu's stacks of the same guest
>>>> we could extend the pv ABI to mark a guest's user L4 page table as
>>>> "single use", i.e. not allowed to be active on multiple vcpus at the
>>>> same time (introducing that ABI modification in the Linux kernel would
>>>> be simple, as the Linux kernel currently lacks support for cross-cpu
>>>> stack exploits and when that support is being added by per-cpu L4 user
>>>> page tables we could just chime in). A L4 page table marked as "single
>>>> use" would map the local vcpu stacks only.
>>>
>>> For PV guests, it is the Xen stacks which matter, not the vcpu guest
>>> kernel's ones.
>>
>> Indeed. That's the reason I want to have per-vcpu Xen stacks.
>>
>>> 64bit PV guest kernels are already mitigated better than KPTI can ever
>>> manage, because there are no entry stacks or entry stubs required to be
>>> mapped into guest userspace at all.
>>
>> But without Xen being secured via a mechanism similar to KPTI this
>> is moot, as user mode can exploit the whole host including the own
>> kernel's memory.
>
> Here's a question: What if we didn't try to prevent the guest from
> reading hypervisor memory at all, but instead just tried to make sure
> that there was nothing of interest there?
>
> If sensitive information pertaining to a given vcpu were only maped on
> the processor currently running that vcpu, then it would mitigate not
> only SP3, but also SP2 and SP1.
You are aware this includes the mappings when running in the hypervisor?
So i.e. the mapping of physical memory of the host...
Juergen
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution
2018-01-05 14:17 ` Juergen Gross
@ 2018-01-05 14:21 ` George Dunlap
2018-01-05 14:28 ` Jan Beulich
0 siblings, 1 reply; 61+ messages in thread
From: George Dunlap @ 2018-01-05 14:21 UTC (permalink / raw)
To: Juergen Gross; +Cc: Andrew Cooper, Xen-devel
On Fri, Jan 5, 2018 at 2:17 PM, Juergen Gross <jgross@suse.com> wrote:
> On 05/01/18 15:11, George Dunlap wrote:
>> On Fri, Jan 5, 2018 at 9:39 AM, Juergen Gross <jgross@suse.com> wrote:
>>> On 05/01/18 10:26, Andrew Cooper wrote:
>>>> On 05/01/2018 07:48, Juergen Gross wrote:
>>>>> On 04/01/18 21:21, Andrew Cooper wrote:
>>>>>> This work was developed as an SP3 mitigation, but shelved when it became clear
>>>>>> that it wasn't viable to get done in the timeframe.
>>>>>>
>>>>>> To protect against SP3 attacks, most mappings needs to be flushed while in
>>>>>> user context. However, to protect against all cross-VM attacks, it is
>>>>>> necessary to ensure that the Xen stacks are not mapped in any other cpus
>>>>>> address space, or an attacker can still recover at least the GPR state of
>>>>>> separate VMs.
>>>>> Above statement is too strict: it would be sufficient if no stacks of
>>>>> other domains are mapped.
>>>>
>>>> Sadly not. Having stacks shared by domain means one vcpu can still
>>>> steal at least GPR state from other vcpus belonging to the same domain.
>>>>
>>>> Whether or not a specific kernel cares, some definitely will.
>>>>
>>>>> I'm just working on a proof of concept using dedicated per-vcpu stacks
>>>>> for 64 bit pv domains. Those stacks would be mapped in the per-domain
>>>>> region of the address space. I hope to have a RFC version of the patches
>>>>> ready next week.
>>>>>
>>>>> This would allow to remove the per physical cpu mappings in the guest
>>>>> visible address space when doing page table isolation.
>>>>>
>>>>> In order to avoid SP3 attacks to other vcpu's stacks of the same guest
>>>>> we could extend the pv ABI to mark a guest's user L4 page table as
>>>>> "single use", i.e. not allowed to be active on multiple vcpus at the
>>>>> same time (introducing that ABI modification in the Linux kernel would
>>>>> be simple, as the Linux kernel currently lacks support for cross-cpu
>>>>> stack exploits and when that support is being added by per-cpu L4 user
>>>>> page tables we could just chime in). A L4 page table marked as "single
>>>>> use" would map the local vcpu stacks only.
>>>>
>>>> For PV guests, it is the Xen stacks which matter, not the vcpu guest
>>>> kernel's ones.
>>>
>>> Indeed. That's the reason I want to have per-vcpu Xen stacks.
>>>
>>>> 64bit PV guest kernels are already mitigated better than KPTI can ever
>>>> manage, because there are no entry stacks or entry stubs required to be
>>>> mapped into guest userspace at all.
>>>
>>> But without Xen being secured via a mechanism similar to KPTI this
>>> is moot, as user mode can exploit the whole host including the own
>>> kernel's memory.
>>
>> Here's a question: What if we didn't try to prevent the guest from
>> reading hypervisor memory at all, but instead just tried to make sure
>> that there was nothing of interest there?
>>
>> If sensitive information pertaining to a given vcpu were only maped on
>> the processor currently running that vcpu, then it would mitigate not
>> only SP3, but also SP2 and SP1.
>
> You are aware this includes the mappings when running in the hypervisor?
> So i.e. the mapping of physical memory of the host...
Yes, of course. You'd have to map domain memory on-demand, and make
sure it was unmapped before switching to a different domain. (And in
the case of 64-bit PV guests, before switching back to guest space.)
And you'd have to try to identify as much 'sensitive' information as
possible and move it out of the xen-wide domain heap, into per-domain
structures.
We already have map_domain_page(), as a result of 32-bit mode and
>5TiB mode, so getting the domain pages out of the HV should be pretty
easy.
-George
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution
2018-01-05 14:11 ` George Dunlap
2018-01-05 14:17 ` Juergen Gross
@ 2018-01-05 14:27 ` Jan Beulich
2018-01-05 14:35 ` Andrew Cooper
1 sibling, 1 reply; 61+ messages in thread
From: Jan Beulich @ 2018-01-05 14:27 UTC (permalink / raw)
To: George Dunlap; +Cc: Juergen Gross, Andrew Cooper, Xen-devel
>>> On 05.01.18 at 15:11, <dunlapg@umich.edu> wrote:
> Here's a question: What if we didn't try to prevent the guest from
> reading hypervisor memory at all, but instead just tried to make sure
> that there was nothing of interest there?
>
> If sensitive information pertaining to a given vcpu were only maped on
> the processor currently running that vcpu, then it would mitigate not
> only SP3, but also SP2 and SP1.
Unless there were hypervisor secrets pertaining to this guest.
Also, while the idea behind your question is certainly nice, fully
separating memories related to individual guests would come
at quite significant a price: No direct access to a random
domain's control structures would be possible anymore, which
I'd foresee to be a problem in particular when wanting to
forward interrupts / event channel operations to the right
destination. But as I've said elsewhere recently: With all the
workarounds now being put in place, perhaps we don't care
about performance all that much anymore anyway...
Jan
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution
2018-01-05 14:21 ` George Dunlap
@ 2018-01-05 14:28 ` Jan Beulich
0 siblings, 0 replies; 61+ messages in thread
From: Jan Beulich @ 2018-01-05 14:28 UTC (permalink / raw)
To: Juergen Gross, George Dunlap; +Cc: Andrew Cooper, Xen-devel
>>> On 05.01.18 at 15:21, <dunlapg@umich.edu> wrote:
> We already have map_domain_page(), as a result of 32-bit mode and
>>5TiB mode, so getting the domain pages out of the HV should be pretty
> easy.
E.g. by doing away with the directmap altogether.
Jan
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution
2018-01-05 14:27 ` Jan Beulich
@ 2018-01-05 14:35 ` Andrew Cooper
2018-01-08 11:41 ` George Dunlap
0 siblings, 1 reply; 61+ messages in thread
From: Andrew Cooper @ 2018-01-05 14:35 UTC (permalink / raw)
To: Jan Beulich, George Dunlap; +Cc: Juergen Gross, Xen-devel
On 05/01/18 14:27, Jan Beulich wrote:
>>>> On 05.01.18 at 15:11, <dunlapg@umich.edu> wrote:
>> Here's a question: What if we didn't try to prevent the guest from
>> reading hypervisor memory at all, but instead just tried to make sure
>> that there was nothing of interest there?
>>
>> If sensitive information pertaining to a given vcpu were only maped on
>> the processor currently running that vcpu, then it would mitigate not
>> only SP3, but also SP2 and SP1.
> Unless there were hypervisor secrets pertaining to this guest.
> Also, while the idea behind your question is certainly nice, fully
> separating memories related to individual guests would come
> at quite significant a price: No direct access to a random
> domain's control structures would be possible anymore, which
> I'd foresee to be a problem in particular when wanting to
> forward interrupts / event channel operations to the right
> destination. But as I've said elsewhere recently: With all the
> workarounds now being put in place, perhaps we don't care
> about performance all that much anymore anyway...
Even if we did manage to isolate the mappings to only domian-pertinant
information (which is hard, because interrupts need to still work), you
still don't protect against a piece of userspace using SP2 to attack a
co-scheduled piece of userspace in the domain.
~Andrew
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution
2018-01-05 14:35 ` Andrew Cooper
@ 2018-01-08 11:41 ` George Dunlap
0 siblings, 0 replies; 61+ messages in thread
From: George Dunlap @ 2018-01-08 11:41 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Juergen Gross, Jan Beulich, Xen-devel
On Fri, Jan 5, 2018 at 2:35 PM, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> On 05/01/18 14:27, Jan Beulich wrote:
>>>>> On 05.01.18 at 15:11, <dunlapg@umich.edu> wrote:
>>> Here's a question: What if we didn't try to prevent the guest from
>>> reading hypervisor memory at all, but instead just tried to make sure
>>> that there was nothing of interest there?
>>>
>>> If sensitive information pertaining to a given vcpu were only maped on
>>> the processor currently running that vcpu, then it would mitigate not
>>> only SP3, but also SP2 and SP1.
>> Unless there were hypervisor secrets pertaining to this guest.
>> Also, while the idea behind your question is certainly nice, fully
>> separating memories related to individual guests would come
>> at quite significant a price: No direct access to a random
>> domain's control structures would be possible anymore, which
>> I'd foresee to be a problem in particular when wanting to
>> forward interrupts / event channel operations to the right
>> destination. But as I've said elsewhere recently: With all the
>> workarounds now being put in place, perhaps we don't care
>> about performance all that much anymore anyway...
>
> Even if we did manage to isolate the mappings to only domian-pertinant
> information (which is hard, because interrupts need to still work),
I didn't say "only map domain-pertinent information"; I said "only map
*sensitive* information". We don't need to eliminate all information,
nor even, up front, eliminate all side-channels, to be able to
mitigate a lot of these issues significantly.
"Read the entirety of another VM's memory" and "Can infer that a VM is
running on pcpu N and has received M interrupts to vector X" are
*very* different.
> you
> still don't protect against a piece of userspace using SP2 to attack a
> co-scheduled piece of userspace in the domain.
Yes, if we wanted to mitigate against userspace using SPX to access
guest kernel RAM (or that of another process), we'd need to make sure
to map only exactly what was needed and then unmap it when done. But
starting with protecting other guests is still worthwhile I think.
-George
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution
2018-01-05 7:48 ` [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Juergen Gross
2018-01-05 9:26 ` Andrew Cooper
@ 2018-01-09 23:14 ` Stefano Stabellini
1 sibling, 0 replies; 61+ messages in thread
From: Stefano Stabellini @ 2018-01-09 23:14 UTC (permalink / raw)
To: Juergen Gross; +Cc: Andrew Cooper, Xen-devel
On Fri, 5 Jan 2018, Juergen Gross wrote:
> On 04/01/18 21:21, Andrew Cooper wrote:
> > This work was developed as an SP3 mitigation, but shelved when it became clear
> > that it wasn't viable to get done in the timeframe.
> >
> > To protect against SP3 attacks, most mappings needs to be flushed while in
> > user context. However, to protect against all cross-VM attacks, it is
> > necessary to ensure that the Xen stacks are not mapped in any other cpus
> > address space, or an attacker can still recover at least the GPR state of
> > separate VMs.
>
> Above statement is too strict: it would be sufficient if no stacks of
> other domains are mapped.
>
> I'm just working on a proof of concept using dedicated per-vcpu stacks
> for 64 bit pv domains. Those stacks would be mapped in the per-domain
> region of the address space. I hope to have a RFC version of the patches
> ready next week.
>
> This would allow to remove the per physical cpu mappings in the guest
> visible address space when doing page table isolation.
>
> In order to avoid SP3 attacks to other vcpu's stacks of the same guest
> we could extend the pv ABI to mark a guest's user L4 page table as
> "single use", i.e. not allowed to be active on multiple vcpus at the
> same time (introducing that ABI modification in the Linux kernel would
> be simple, as the Linux kernel currently lacks support for cross-cpu
> stack exploits and when that support is being added by per-cpu L4 user
> page tables we could just chime in). A L4 page table marked as "single
> use" would map the local vcpu stacks only.
Regardless of what we do as a stop-gap now (vixen for example), I think
we need to continue pursuing this solution because it is the only one
that can mitigate SP3 when VT-x is not available.
I have several users exactly in this condition, and this is the only
hope for them.
I think this series should be a blocker for 4.11.
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH RFC 01/44] passthrough/vtd: Don't DMA to the stack in queue_invalidate_wait()
2018-01-04 20:21 ` [PATCH RFC 01/44] passthrough/vtd: Don't DMA to the stack in queue_invalidate_wait() Andrew Cooper
2018-01-05 9:21 ` Jan Beulich
@ 2018-01-16 6:41 ` Tian, Kevin
1 sibling, 0 replies; 61+ messages in thread
From: Tian, Kevin @ 2018-01-16 6:41 UTC (permalink / raw)
To: Andrew Cooper, Xen-devel; +Cc: Julien Grall, Jan Beulich
> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> Sent: Friday, January 5, 2018 4:21 AM
>
> DMA-ing to the stack is generally considered bad practice. In this case, if a
> timeout occurs because of a sluggish device which is processing the request,
> the completion notification will corrupt the stack of a subsequent deeper
> call
> tree.
>
> Place the poll_slot in a percpu area and DMA to that instead.
>
> Note: This change does not address other issues with the current
> implementation, such as once a timeout has been suffered, subsequent
> completions can't be correlated with their requests.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH RFC 31/44] x86/pv: Drop support for paging out the LDT
2018-01-04 20:21 ` [PATCH RFC 31/44] x86/pv: Drop support for paging out the LDT Andrew Cooper
@ 2018-01-24 11:04 ` Jan Beulich
0 siblings, 0 replies; 61+ messages in thread
From: Jan Beulich @ 2018-01-24 11:04 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Xen-devel
>>> On 04.01.18 at 21:21, <andrew.cooper3@citrix.com> wrote:
> Windows is the only OS which pages out kernel datastructures, so chances are
> good that this is a vestigial remnant of the PV Windows XP experiment.
> Furthermore the implementation is incomplete; it only functions for a present
> => not-present transition, rather than a present => read/write transition.
I've thought about this some more. For one, I don't understand the
"read/write" part above: descriptor table pages aren't permitted to
be writable in the first place. Did you mean "present -> present"? If
so, put_page_from_l1e() also gets called when an L1E is being
replaced, so it would seem to me that this (sub)case is handled
properly.
What instead isn't handled is the case where a higher level page
table entry mapping the LDT is being replaced.
Jan
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
^ permalink raw reply [flat|nested] 61+ messages in thread
end of thread, other threads:[~2018-01-24 11:04 UTC | newest]
Thread overview: 61+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-01-04 20:21 [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 01/44] passthrough/vtd: Don't DMA to the stack in queue_invalidate_wait() Andrew Cooper
2018-01-05 9:21 ` Jan Beulich
2018-01-05 9:33 ` Andrew Cooper
2018-01-16 6:41 ` Tian, Kevin
2018-01-04 20:21 ` [PATCH RFC 02/44] x86/idt: Factor out enabling and disabling of ISTs Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 03/44] x86/pv: Rename invalidate_shadow_ldt() to pv_destroy_ldt() Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 04/44] x86/boot: Introduce cpu_smpboot_bsp() to dynamically allocate BSP state Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 05/44] x86/boot: Move arch_init_memory() earlier in the boot sequence Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 06/44] x86/boot: Allocate percpu pagetables for the idle vcpus Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 07/44] x86/boot: Use " Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 08/44] x86/pv: Avoid an opencoded mov to %cr3 in toggle_guest_mode() Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 09/44] x86/mm: Track the current %cr3 in a per_cpu variable Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 10/44] x86/pt-shadow: Initial infrastructure for L4 PV pagetable shadowing Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 11/44] x86/pt-shadow: Always set _PAGE_ACCESSED on L4e updates Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 12/44] x86/fixmap: Temporarily add a percpu fixmap range Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 13/44] x86/pt-shadow: Shadow L4 tables from 64bit PV guests Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 14/44] x86/mm: Added safety checks that pagetables aren't shared Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 15/44] x86: Rearrange the virtual layout to introduce a PERCPU linear slot Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 16/44] xen/ipi: Introduce arch_ipi_param_ok() to check IPI parameters Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 17/44] x86/smp: Infrastructure for allocating and freeing percpu pagetables Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 18/44] x86/mm: Maintain the correct percpu mappings on context switch Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 19/44] x86/boot: Defer TSS/IST setup until later during boot on the BSP Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 20/44] x86/smp: Allocate a percpu linear range for the IDT Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 21/44] x86/smp: Switch to using the percpu IDT mappings Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 22/44] x86/mm: Track whether the current cr3 has a short or extended directmap Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 23/44] x86/smp: Allocate percpu resources for map_domain_page() to use Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 24/44] x86/mapcache: Reimplement map_domain_page() from scratch Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 25/44] x86/fixmap: Drop percpu fixmap range Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 26/44] x86/pt-shadow: Maintain a small cache of shadowed frames Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 27/44] x86/smp: Allocate a percpu linear range for the compat translation area Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 28/44] x86/xlat: Use the percpu " Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 29/44] x86/smp: Allocate percpu resources for the GDT and LDT Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 30/44] x86/pv: Break handle_ldt_mapping_fault() out of handle_gdt_ldt_mapping_fault() Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 31/44] x86/pv: Drop support for paging out the LDT Andrew Cooper
2018-01-24 11:04 ` Jan Beulich
2018-01-04 20:21 ` [PATCH RFC 32/44] x86: Always reload the LDT on vcpu context switch Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 33/44] x86/smp: Use the percpu GDT/LDT mappings Andrew Cooper
2018-01-04 20:21 ` [PATCH RFC 34/44] x86: Drop the PERDOMAIN mappings Andrew Cooper
2018-01-04 20:22 ` [PATCH RFC 35/44] x86/smp: Allocate the stack in the percpu range Andrew Cooper
2018-01-04 20:22 ` [PATCH RFC 36/44] x86/monitor: Capture Xen's intent to use monitor at boot time Andrew Cooper
2018-01-04 20:22 ` [PATCH RFC 37/44] x86/misc: Move some IPI parameters off the stack Andrew Cooper
2018-01-04 20:22 ` [PATCH RFC 38/44] x86/mca: Move __HYPERVISOR_mca " Andrew Cooper
2018-01-04 20:22 ` [PATCH RFC 39/44] x86/smp: Introduce get_smp_ipi_buf() and take more " Andrew Cooper
2018-01-04 20:22 ` [PATCH RFC 40/44] x86/boot: Switch the APs to the percpu pagetables before entering C Andrew Cooper
2018-01-04 20:22 ` [PATCH RFC 41/44] x86/smp: Switch to using the percpu stacks Andrew Cooper
2018-01-04 20:22 ` [PATCH RFC 42/44] x86/smp: Allocate a percpu linear range for the TSS Andrew Cooper
2018-01-04 20:22 ` [PATCH RFC 43/44] x86/smp: Use the percpu TSS mapping Andrew Cooper
2018-01-04 20:22 ` [PATCH RFC 44/44] misc debugging Andrew Cooper
2018-01-05 7:48 ` [PATCH FAIRLY-RFC 00/44] x86: Prerequisite work for a Xen KAISER solution Juergen Gross
2018-01-05 9:26 ` Andrew Cooper
2018-01-05 9:39 ` Juergen Gross
2018-01-05 9:56 ` Andrew Cooper
2018-01-05 14:11 ` George Dunlap
2018-01-05 14:17 ` Juergen Gross
2018-01-05 14:21 ` George Dunlap
2018-01-05 14:28 ` Jan Beulich
2018-01-05 14:27 ` Jan Beulich
2018-01-05 14:35 ` Andrew Cooper
2018-01-08 11:41 ` George Dunlap
2018-01-09 23:14 ` Stefano Stabellini
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).