* [PATCH V2 0/3] xen: remove some memory limits from pv-domains
@ 2014-09-17 4:12 Juergen Gross
2014-09-17 4:12 ` [PATCH V2 1/3] xen: sync some headers with xen tree Juergen Gross
` (3 more replies)
0 siblings, 4 replies; 13+ messages in thread
From: Juergen Gross @ 2014-09-17 4:12 UTC (permalink / raw)
To: linux-kernel, xen-devel, konrad.wilk, boris.ostrovsky,
david.vrabel, jbeulich
Cc: Juergen Gross
When a Xen pv-domain is booted the initial memory map contains multiple
objects in the top 2 GB including the initrd and the p2m list. This
limits the supported maximum size of the initrd and the maximum
memory size the p2m list can span is limited to about 500 GB.
Xen however supports loading the initrd without mapping it and the
initial p2m list can be mapped by Xen to an arbitrary selected virtual
address. The following patches activate those options and thus remove
the limitations.
It should be noted that the p2m list limitation isn't only affecting
the amount of memory a pv domain can use, but it also hinders Dom0 to
be started on physical systems with larger memory without reducing it's
memory via a Xen boot parameter. By mapping the initial p2m list to
an area not in the top 2 GB it is now possible to boot Dom0 on such
systems.
It would be desirable to be able to use more than 512 GB in a pv
domain, but this would require a reorganization of the p2m tree built
by the kernel at boot time. As this reorganization would affect the
Xen tools and kexec, too, it is not included in this patch set. This
topic can be addressed later.
Juergen Gross (3):
xen: sync some headers with xen tree
xen: eliminate scalability issues from initrd handling
xen: eliminate scalability issues from initial mapping setup
arch/x86/xen/enlighten.c | 11 +-
arch/x86/xen/mmu.c | 115 +++++++++++++++--
arch/x86/xen/setup.c | 65 +++++-----
arch/x86/xen/xen-head.S | 3 +
include/xen/interface/elfnote.h | 48 ++++++-
include/xen/interface/xen.h | 272 ++++++++++++++++++++++++++++++++++++----
6 files changed, 451 insertions(+), 63 deletions(-)
--
1.8.4.5
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH V2 1/3] xen: sync some headers with xen tree
2014-09-17 4:12 [PATCH V2 0/3] xen: remove some memory limits from pv-domains Juergen Gross
@ 2014-09-17 4:12 ` Juergen Gross
2014-09-17 4:12 ` [PATCH V2 2/3] xen: eliminate scalability issues from initrd handling Juergen Gross
` (2 subsequent siblings)
3 siblings, 0 replies; 13+ messages in thread
From: Juergen Gross @ 2014-09-17 4:12 UTC (permalink / raw)
To: linux-kernel, xen-devel, konrad.wilk, boris.ostrovsky,
david.vrabel, jbeulich
Cc: Juergen Gross
To be able to use an initially unmapped initrd with xen the following
header files must be synced to a newer version from the xen tree:
include/xen/interface/elfnote.h
include/xen/interface/xen.h
As the KEXEC and DUMPCORE related ELFNOTES are not relevant for the
kernel they are omitted from elfnote.h.
Signed-off-by: Juergen Gross <jgross@suse.com>
---
include/xen/interface/elfnote.h | 48 ++++++-
include/xen/interface/xen.h | 272 ++++++++++++++++++++++++++++++++++++----
2 files changed, 294 insertions(+), 26 deletions(-)
diff --git a/include/xen/interface/elfnote.h b/include/xen/interface/elfnote.h
index 6f4eae3..f90b034 100644
--- a/include/xen/interface/elfnote.h
+++ b/include/xen/interface/elfnote.h
@@ -3,6 +3,24 @@
*
* Definitions used for the Xen ELF notes.
*
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
* Copyright (c) 2006, Ian Campbell, XenSource Ltd.
*/
@@ -18,12 +36,13 @@
*
* LEGACY indicated the fields in the legacy __xen_guest string which
* this a note type replaces.
+ *
+ * String values (for non-legacy) are NULL terminated ASCII, also known
+ * as ASCIZ type.
*/
/*
* NAME=VALUE pair (string).
- *
- * LEGACY: FEATURES and PAE
*/
#define XEN_ELFNOTE_INFO 0
@@ -137,10 +156,30 @@
/*
* Whether or not the guest supports cooperative suspend cancellation.
+ * This is a numeric value.
+ *
+ * Default is 0
*/
#define XEN_ELFNOTE_SUSPEND_CANCEL 14
/*
+ * The (non-default) location the initial phys-to-machine map should be
+ * placed at by the hypervisor (Dom0) or the tools (DomU).
+ * The kernel must be prepared for this mapping to be established using
+ * large pages, despite such otherwise not being available to guests.
+ * The kernel must also be able to handle the page table pages used for
+ * this mapping not being accessible through the initial mapping.
+ * (Only x86-64 supports this at present.)
+ */
+#define XEN_ELFNOTE_INIT_P2M 15
+
+/*
+ * Whether or not the guest can deal with being passed an initrd not
+ * mapped through its initial page tables.
+ */
+#define XEN_ELFNOTE_MOD_START_PFN 16
+
+/*
* The features supported by this kernel (numeric).
*
* Other than XEN_ELFNOTE_FEATURES on pre-4.2 Xen, this note allows a
@@ -153,6 +192,11 @@
*/
#define XEN_ELFNOTE_SUPPORTED_FEATURES 17
+/*
+ * The number of the highest elfnote defined.
+ */
+#define XEN_ELFNOTE_MAX XEN_ELFNOTE_SUPPORTED_FEATURES
+
#endif /* __XEN_PUBLIC_ELFNOTE_H__ */
/*
diff --git a/include/xen/interface/xen.h b/include/xen/interface/xen.h
index de08213..f68719f 100644
--- a/include/xen/interface/xen.h
+++ b/include/xen/interface/xen.h
@@ -3,6 +3,24 @@
*
* Guest OS interface to Xen.
*
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
* Copyright (c) 2004, K A Fraser
*/
@@ -73,13 +91,23 @@
* VIRTUAL INTERRUPTS
*
* Virtual interrupts that a guest OS may receive from Xen.
+ * In the side comments, 'V.' denotes a per-VCPU VIRQ while 'G.' denotes a
+ * global VIRQ. The former can be bound once per VCPU and cannot be re-bound.
+ * The latter can be allocated only once per guest: they must initially be
+ * allocated to VCPU0 but can subsequently be re-bound.
*/
-#define VIRQ_TIMER 0 /* Timebase update, and/or requested timeout. */
-#define VIRQ_DEBUG 1 /* Request guest to dump debug info. */
-#define VIRQ_CONSOLE 2 /* (DOM0) Bytes received on emergency console. */
-#define VIRQ_DOM_EXC 3 /* (DOM0) Exceptional event for some domain. */
-#define VIRQ_DEBUGGER 6 /* (DOM0) A domain has paused for debugging. */
-#define VIRQ_PCPU_STATE 9 /* (DOM0) PCPU state changed */
+#define VIRQ_TIMER 0 /* V. Timebase update, and/or requested timeout. */
+#define VIRQ_DEBUG 1 /* V. Request guest to dump debug info. */
+#define VIRQ_CONSOLE 2 /* G. (DOM0) Bytes received on emergency console. */
+#define VIRQ_DOM_EXC 3 /* G. (DOM0) Exceptional event for some domain. */
+#define VIRQ_TBUF 4 /* G. (DOM0) Trace buffer has records available. */
+#define VIRQ_DEBUGGER 6 /* G. (DOM0) A domain has paused for debugging. */
+#define VIRQ_XENOPROF 7 /* V. XenOprofile interrupt: new sample available */
+#define VIRQ_CON_RING 8 /* G. (DOM0) Bytes received on console */
+#define VIRQ_PCPU_STATE 9 /* G. (DOM0) PCPU state changed */
+#define VIRQ_MEM_EVENT 10 /* G. (DOM0) A memory event has occured */
+#define VIRQ_XC_RESERVED 11 /* G. Reserved for XenClient */
+#define VIRQ_ENOMEM 12 /* G. (DOM0) Low on heap memory */
/* Architecture-specific VIRQ definitions. */
#define VIRQ_ARCH_0 16
@@ -92,24 +120,68 @@
#define VIRQ_ARCH_7 23
#define NR_VIRQS 24
+
/*
- * MMU-UPDATE REQUESTS
- *
- * HYPERVISOR_mmu_update() accepts a list of (ptr, val) pairs.
- * A foreigndom (FD) can be specified (or DOMID_SELF for none).
- * Where the FD has some effect, it is described below.
- * ptr[1:0] specifies the appropriate MMU_* command.
+ * enum neg_errnoval HYPERVISOR_mmu_update(const struct mmu_update reqs[],
+ * unsigned count, unsigned *done_out,
+ * unsigned foreigndom)
+ * @reqs is an array of mmu_update_t structures ((ptr, val) pairs).
+ * @count is the length of the above array.
+ * @pdone is an output parameter indicating number of completed operations
+ * @foreigndom[15:0]: FD, the expected owner of data pages referenced in this
+ * hypercall invocation. Can be DOMID_SELF.
+ * @foreigndom[31:16]: PFD, the expected owner of pagetable pages referenced
+ * in this hypercall invocation. The value of this field
+ * (x) encodes the PFD as follows:
+ * x == 0 => PFD == DOMID_SELF
+ * x != 0 => PFD == x - 1
*
+ * Sub-commands: ptr[1:0] specifies the appropriate MMU_* command.
+ * -------------
* ptr[1:0] == MMU_NORMAL_PT_UPDATE:
- * Updates an entry in a page table. If updating an L1 table, and the new
- * table entry is valid/present, the mapped frame must belong to the FD, if
- * an FD has been specified. If attempting to map an I/O page then the
- * caller assumes the privilege of the FD.
+ * Updates an entry in a page table belonging to PFD. If updating an L1 table,
+ * and the new table entry is valid/present, the mapped frame must belong to
+ * FD. If attempting to map an I/O page then the caller assumes the privilege
+ * of the FD.
* FD == DOMID_IO: Permit /only/ I/O mappings, at the priv level of the caller.
* FD == DOMID_XEN: Map restricted areas of Xen's heap space.
* ptr[:2] -- Machine address of the page-table entry to modify.
* val -- Value to write.
*
+ * There also certain implicit requirements when using this hypercall. The
+ * pages that make up a pagetable must be mapped read-only in the guest.
+ * This prevents uncontrolled guest updates to the pagetable. Xen strictly
+ * enforces this, and will disallow any pagetable update which will end up
+ * mapping pagetable page RW, and will disallow using any writable page as a
+ * pagetable. In practice it means that when constructing a page table for a
+ * process, thread, etc, we MUST be very dilligient in following these rules:
+ * 1). Start with top-level page (PGD or in Xen language: L4). Fill out
+ * the entries.
+ * 2). Keep on going, filling out the upper (PUD or L3), and middle (PMD
+ * or L2).
+ * 3). Start filling out the PTE table (L1) with the PTE entries. Once
+ * done, make sure to set each of those entries to RO (so writeable bit
+ * is unset). Once that has been completed, set the PMD (L2) for this
+ * PTE table as RO.
+ * 4). When completed with all of the PMD (L2) entries, and all of them have
+ * been set to RO, make sure to set RO the PUD (L3). Do the same
+ * operation on PGD (L4) pagetable entries that have a PUD (L3) entry.
+ * 5). Now before you can use those pages (so setting the cr3), you MUST also
+ * pin them so that the hypervisor can verify the entries. This is done
+ * via the HYPERVISOR_mmuext_op(MMUEXT_PIN_L4_TABLE, guest physical frame
+ * number of the PGD (L4)). And this point the HYPERVISOR_mmuext_op(
+ * MMUEXT_NEW_BASEPTR, guest physical frame number of the PGD (L4)) can be
+ * issued.
+ * For 32-bit guests, the L4 is not used (as there is less pagetables), so
+ * instead use L3.
+ * At this point the pagetables can be modified using the MMU_NORMAL_PT_UPDATE
+ * hypercall. Also if so desired the OS can also try to write to the PTE
+ * and be trapped by the hypervisor (as the PTE entry is RO).
+ *
+ * To deallocate the pages, the operations are the reverse of the steps
+ * mentioned above. The argument is MMUEXT_UNPIN_TABLE for all levels and the
+ * pagetable MUST not be in use (meaning that the cr3 is not set to it).
+ *
* ptr[1:0] == MMU_MACHPHYS_UPDATE:
* Updates an entry in the machine->pseudo-physical mapping table.
* ptr[:2] -- Machine address within the frame whose mapping to modify.
@@ -119,6 +191,72 @@
* ptr[1:0] == MMU_PT_UPDATE_PRESERVE_AD:
* As MMU_NORMAL_PT_UPDATE above, but A/D bits currently in the PTE are ORed
* with those in @val.
+ *
+ * @val is usually the machine frame number along with some attributes.
+ * The attributes by default follow the architecture defined bits. Meaning that
+ * if this is a X86_64 machine and four page table layout is used, the layout
+ * of val is:
+ * - 63 if set means No execute (NX)
+ * - 46-13 the machine frame number
+ * - 12 available for guest
+ * - 11 available for guest
+ * - 10 available for guest
+ * - 9 available for guest
+ * - 8 global
+ * - 7 PAT (PSE is disabled, must use hypercall to make 4MB or 2MB pages)
+ * - 6 dirty
+ * - 5 accessed
+ * - 4 page cached disabled
+ * - 3 page write through
+ * - 2 userspace accessible
+ * - 1 writeable
+ * - 0 present
+ *
+ * The one bits that does not fit with the default layout is the PAGE_PSE
+ * also called PAGE_PAT). The MMUEXT_[UN]MARK_SUPER arguments to the
+ * HYPERVISOR_mmuext_op serve as mechanism to set a pagetable to be 4MB
+ * (or 2MB) instead of using the PAGE_PSE bit.
+ *
+ * The reason that the PAGE_PSE (bit 7) is not being utilized is due to Xen
+ * using it as the Page Attribute Table (PAT) bit - for details on it please
+ * refer to Intel SDM 10.12. The PAT allows to set the caching attributes of
+ * pages instead of using MTRRs.
+ *
+ * The PAT MSR is as follows (it is a 64-bit value, each entry is 8 bits):
+ * PAT4 PAT0
+ * +-----+-----+----+----+----+-----+----+----+
+ * | UC | UC- | WC | WB | UC | UC- | WC | WB | <= Linux
+ * +-----+-----+----+----+----+-----+----+----+
+ * | UC | UC- | WT | WB | UC | UC- | WT | WB | <= BIOS (default when machine boots)
+ * +-----+-----+----+----+----+-----+----+----+
+ * | rsv | rsv | WP | WC | UC | UC- | WT | WB | <= Xen
+ * +-----+-----+----+----+----+-----+----+----+
+ *
+ * The lookup of this index table translates to looking up
+ * Bit 7, Bit 4, and Bit 3 of val entry:
+ *
+ * PAT/PSE (bit 7) ... PCD (bit 4) .. PWT (bit 3).
+ *
+ * If all bits are off, then we are using PAT0. If bit 3 turned on,
+ * then we are using PAT1, if bit 3 and bit 4, then PAT2..
+ *
+ * As you can see, the Linux PAT1 translates to PAT4 under Xen. Which means
+ * that if a guest that follows Linux's PAT setup and would like to set Write
+ * Combined on pages it MUST use PAT4 entry. Meaning that Bit 7 (PAGE_PAT) is
+ * set. For example, under Linux it only uses PAT0, PAT1, and PAT2 for the
+ * caching as:
+ *
+ * WB = none (so PAT0)
+ * WC = PWT (bit 3 on)
+ * UC = PWT | PCD (bit 3 and 4 are on).
+ *
+ * To make it work with Xen, it needs to translate the WC bit as so:
+ *
+ * PWT (so bit 3 on) --> PAT (so bit 7 is on) and clear bit 3
+ *
+ * And to translate back it would:
+ *
+ * PAT (bit 7 on) --> PWT (bit 3 on) and clear bit 7.
*/
#define MMU_NORMAL_PT_UPDATE 0 /* checked '*ptr = val'. ptr is MA. */
#define MMU_MACHPHYS_UPDATE 1 /* ptr = MA of frame to modify entry for */
@@ -127,7 +265,12 @@
/*
* MMU EXTENDED OPERATIONS
*
- * HYPERVISOR_mmuext_op() accepts a list of mmuext_op structures.
+ * enum neg_errnoval HYPERVISOR_mmuext_op(mmuext_op_t uops[],
+ * unsigned int count,
+ * unsigned int *pdone,
+ * unsigned int foreigndom)
+ */
+/* HYPERVISOR_mmuext_op() accepts a list of mmuext_op structures.
* A foreigndom (FD) can be specified (or DOMID_SELF for none).
* Where the FD has some effect, it is described below.
*
@@ -164,9 +307,23 @@
* cmd: MMUEXT_FLUSH_CACHE
* No additional arguments. Writes back and flushes cache contents.
*
+ * cmd: MMUEXT_FLUSH_CACHE_GLOBAL
+ * No additional arguments. Writes back and flushes cache contents
+ * on all CPUs in the system.
+ *
* cmd: MMUEXT_SET_LDT
* linear_addr: Linear address of LDT base (NB. must be page-aligned).
* nr_ents: Number of entries in LDT.
+ *
+ * cmd: MMUEXT_CLEAR_PAGE
+ * mfn: Machine frame number to be cleared.
+ *
+ * cmd: MMUEXT_COPY_PAGE
+ * mfn: Machine frame number of the destination page.
+ * src_mfn: Machine frame number of the source page.
+ *
+ * cmd: MMUEXT_[UN]MARK_SUPER
+ * mfn: Machine frame number of head of superpage to be [un]marked.
*/
#define MMUEXT_PIN_L1_TABLE 0
#define MMUEXT_PIN_L2_TABLE 1
@@ -183,12 +340,18 @@
#define MMUEXT_FLUSH_CACHE 12
#define MMUEXT_SET_LDT 13
#define MMUEXT_NEW_USER_BASEPTR 15
+#define MMUEXT_CLEAR_PAGE 16
+#define MMUEXT_COPY_PAGE 17
+#define MMUEXT_FLUSH_CACHE_GLOBAL 18
+#define MMUEXT_MARK_SUPER 19
+#define MMUEXT_UNMARK_SUPER 20
#ifndef __ASSEMBLY__
struct mmuext_op {
unsigned int cmd;
union {
- /* [UN]PIN_TABLE, NEW_BASEPTR, NEW_USER_BASEPTR */
+ /* [UN]PIN_TABLE, NEW_BASEPTR, NEW_USER_BASEPTR
+ * CLEAR_PAGE, COPY_PAGE, [UN]MARK_SUPER */
xen_pfn_t mfn;
/* INVLPG_LOCAL, INVLPG_ALL, SET_LDT */
unsigned long linear_addr;
@@ -198,6 +361,8 @@ struct mmuext_op {
unsigned int nr_ents;
/* TLB_FLUSH_MULTI, INVLPG_MULTI */
void *vcpumask;
+ /* COPY_PAGE */
+ xen_pfn_t src_mfn;
} arg2;
};
DEFINE_GUEST_HANDLE_STRUCT(mmuext_op);
@@ -225,10 +390,23 @@ DEFINE_GUEST_HANDLE_STRUCT(mmuext_op);
*/
#define VMASST_CMD_enable 0
#define VMASST_CMD_disable 1
+
+/* x86/32 guests: simulate full 4GB segment limits. */
#define VMASST_TYPE_4gb_segments 0
+
+/* x86/32 guests: trap (vector 15) whenever above vmassist is used. */
#define VMASST_TYPE_4gb_segments_notify 1
+
+/*
+ * x86 guests: support writes to bottom-level PTEs.
+ * NB1. Page-directory entries cannot be written.
+ * NB2. Guest must continue to remove all writable mappings of PTEs.
+ */
#define VMASST_TYPE_writable_pagetables 2
+
+/* x86/PAE guests: support PDPTs above 4GB. */
#define VMASST_TYPE_pae_extended_cr3 3
+
#define MAX_VMASST_TYPE 3
#ifndef __ASSEMBLY__
@@ -260,6 +438,15 @@ typedef uint16_t domid_t;
*/
#define DOMID_XEN (0x7FF2U)
+/* DOMID_COW is used as the owner of sharable pages */
+#define DOMID_COW (0x7FF3U)
+
+/* DOMID_INVALID is used to identify pages with unknown owner. */
+#define DOMID_INVALID (0x7FF4U)
+
+/* Idle domain. */
+#define DOMID_IDLE (0x7FFFU)
+
/*
* Send an array of these to HYPERVISOR_mmu_update().
* NB. The fields are natural pointer/address size for this architecture.
@@ -272,7 +459,9 @@ DEFINE_GUEST_HANDLE_STRUCT(mmu_update);
/*
* Send an array of these to HYPERVISOR_multicall().
- * NB. The fields are natural register size for this architecture.
+ * NB. The fields are logically the natural register size for this
+ * architecture. In cases where xen_ulong_t is larger than this then
+ * any unused bits in the upper portion must be zero.
*/
struct multicall_entry {
xen_ulong_t op;
@@ -442,8 +631,48 @@ struct start_info {
unsigned long mod_start; /* VIRTUAL address of pre-loaded module. */
unsigned long mod_len; /* Size (bytes) of pre-loaded module. */
int8_t cmd_line[MAX_GUEST_CMDLINE];
+ /* The pfn range here covers both page table and p->m table frames. */
+ unsigned long first_p2m_pfn;/* 1st pfn forming initial P->M table. */
+ unsigned long nr_p2m_frames;/* # of pfns forming initial P->M table. */
};
+/* These flags are passed in the 'flags' field of start_info_t. */
+#define SIF_PRIVILEGED (1<<0) /* Is the domain privileged? */
+#define SIF_INITDOMAIN (1<<1) /* Is this the initial control domain? */
+#define SIF_MULTIBOOT_MOD (1<<2) /* Is mod_start a multiboot module? */
+#define SIF_MOD_START_PFN (1<<3) /* Is mod_start a PFN? */
+#define SIF_PM_MASK (0xFF<<8) /* reserve 1 byte for xen-pm options */
+
+/*
+ * A multiboot module is a package containing modules very similar to a
+ * multiboot module array. The only differences are:
+ * - the array of module descriptors is by convention simply at the beginning
+ * of the multiboot module,
+ * - addresses in the module descriptors are based on the beginning of the
+ * multiboot module,
+ * - the number of modules is determined by a termination descriptor that has
+ * mod_start == 0.
+ *
+ * This permits to both build it statically and reference it in a configuration
+ * file, and let the PV guest easily rebase the addresses to virtual addresses
+ * and at the same time count the number of modules.
+ */
+struct xen_multiboot_mod_list {
+ /* Address of first byte of the module */
+ uint32_t mod_start;
+ /* Address of last byte of the module (inclusive) */
+ uint32_t mod_end;
+ /* Address of zero-terminated command line */
+ uint32_t cmdline;
+ /* Unused, must be zero */
+ uint32_t pad;
+};
+/*
+ * The console structure in start_info.console.dom0
+ *
+ * This structure includes a variety of information required to
+ * have a working VGA/VESA console.
+ */
struct dom0_vga_console_info {
uint8_t video_type;
#define XEN_VGATYPE_TEXT_MODE_3 0x03
@@ -484,11 +713,6 @@ struct dom0_vga_console_info {
} u;
};
-/* These flags are passed in the 'flags' field of start_info_t. */
-#define SIF_PRIVILEGED (1<<0) /* Is the domain privileged? */
-#define SIF_INITDOMAIN (1<<1) /* Is this the initial control domain? */
-#define SIF_PM_MASK (0xFF<<8) /* reserve 1 byte for xen-pm options */
-
typedef uint64_t cpumap_t;
typedef uint8_t xen_domain_handle_t[16];
--
1.8.4.5
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH V2 2/3] xen: eliminate scalability issues from initrd handling
2014-09-17 4:12 [PATCH V2 0/3] xen: remove some memory limits from pv-domains Juergen Gross
2014-09-17 4:12 ` [PATCH V2 1/3] xen: sync some headers with xen tree Juergen Gross
@ 2014-09-17 4:12 ` Juergen Gross
2014-09-17 13:45 ` [Xen-devel] " David Vrabel
2014-09-17 4:12 ` [PATCH V2 3/3] xen: eliminate scalability issues from initial mapping setup Juergen Gross
2014-09-17 14:43 ` [PATCH V2 0/3] xen: remove some memory limits from pv-domains David Vrabel
3 siblings, 1 reply; 13+ messages in thread
From: Juergen Gross @ 2014-09-17 4:12 UTC (permalink / raw)
To: linux-kernel, xen-devel, konrad.wilk, boris.ostrovsky,
david.vrabel, jbeulich
Cc: Juergen Gross
Size restrictions native kernels wouldn't have resulted from the initrd
getting mapped into the initial mapping. The kernel doesn't really need
the initrd to be mapped, so use infrastructure available in Xen to avoid
the mapping and hence the restriction.
Signed-off-by: Juergen Gross <jgross@suse.com>
---
arch/x86/xen/enlighten.c | 11 +++++++++--
arch/x86/xen/xen-head.S | 1 +
2 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index c0cb11f..8fd075f 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1519,6 +1519,7 @@ static void __init xen_pvh_early_guest_init(void)
asmlinkage __visible void __init xen_start_kernel(void)
{
struct physdev_set_iopl set_iopl;
+ unsigned long initrd_start = 0;
int rc;
if (!xen_start_info)
@@ -1667,10 +1668,16 @@ asmlinkage __visible void __init xen_start_kernel(void)
new_cpu_data.x86_capability[0] = cpuid_edx(1);
#endif
+ if (xen_start_info->mod_start)
+ initrd_start = __pa(xen_start_info->mod_start);
+#ifdef CONFIG_BLK_DEV_INITRD
+ if (xen_start_info->flags & SIF_MOD_START_PFN)
+ initrd_start = PFN_PHYS(xen_start_info->mod_start);
+#endif
+
/* Poke various useful things into boot_params */
boot_params.hdr.type_of_loader = (9 << 4) | 0;
- boot_params.hdr.ramdisk_image = xen_start_info->mod_start
- ? __pa(xen_start_info->mod_start) : 0;
+ boot_params.hdr.ramdisk_image = initrd_start;
boot_params.hdr.ramdisk_size = xen_start_info->mod_len;
boot_params.hdr.cmd_line_ptr = __pa(xen_start_info->cmd_line);
diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S
index 485b695..46408e5 100644
--- a/arch/x86/xen/xen-head.S
+++ b/arch/x86/xen/xen-head.S
@@ -124,6 +124,7 @@ NEXT_HYPERCALL(arch_6)
ELFNOTE(Xen, XEN_ELFNOTE_L1_MFN_VALID,
.quad _PAGE_PRESENT; .quad _PAGE_PRESENT)
ELFNOTE(Xen, XEN_ELFNOTE_SUSPEND_CANCEL, .long 1)
+ ELFNOTE(Xen, XEN_ELFNOTE_MOD_START_PFN, .long 1)
ELFNOTE(Xen, XEN_ELFNOTE_HV_START_LOW, _ASM_PTR __HYPERVISOR_VIRT_START)
ELFNOTE(Xen, XEN_ELFNOTE_PADDR_OFFSET, _ASM_PTR 0)
--
1.8.4.5
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH V2 3/3] xen: eliminate scalability issues from initial mapping setup
2014-09-17 4:12 [PATCH V2 0/3] xen: remove some memory limits from pv-domains Juergen Gross
2014-09-17 4:12 ` [PATCH V2 1/3] xen: sync some headers with xen tree Juergen Gross
2014-09-17 4:12 ` [PATCH V2 2/3] xen: eliminate scalability issues from initrd handling Juergen Gross
@ 2014-09-17 4:12 ` Juergen Gross
2014-09-17 14:07 ` [Xen-devel] " David Vrabel
2014-09-17 14:43 ` [PATCH V2 0/3] xen: remove some memory limits from pv-domains David Vrabel
3 siblings, 1 reply; 13+ messages in thread
From: Juergen Gross @ 2014-09-17 4:12 UTC (permalink / raw)
To: linux-kernel, xen-devel, konrad.wilk, boris.ostrovsky,
david.vrabel, jbeulich
Cc: Juergen Gross
Direct Xen to place the initial P->M table outside of the initial
mapping, as otherwise the 1G (implementation) / 2G (theoretical)
restriction on the size of the initial mapping limits the amount
of memory a domain can be handed initially.
As the initial P->M table is copied rather early during boot to
domain private memory and it's initial virtual mapping is dropped,
the easiest way to avoid virtual address conflicts with other
addresses in the kernel is to use a user address area for the
virtual address of the initial P->M table. This allows us to just
throw away the page tables of the initial mapping after the copy
without having to care about address invalidation.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
arch/x86/xen/mmu.c | 115 +++++++++++++++++++++++++++++++++++++++++++++---
arch/x86/xen/setup.c | 65 +++++++++++++++------------
arch/x86/xen/xen-head.S | 2 +
3 files changed, 147 insertions(+), 35 deletions(-)
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 16fb009..0880330 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1198,6 +1198,76 @@ static void __init xen_cleanhighmap(unsigned long vaddr,
* instead of somewhere later and be confusing. */
xen_mc_flush();
}
+
+/*
+ * Make a page range writeable and free it.
+ */
+static void __init xen_free_ro_pages(unsigned long paddr, unsigned long size)
+{
+ void *vaddr = __va(paddr);
+ void *vaddr_end = vaddr + size;
+
+ for (; vaddr < vaddr_end; vaddr += PAGE_SIZE)
+ make_lowmem_page_readwrite(vaddr);
+
+ memblock_free(paddr, size);
+}
+
+/*
+ * Since it is well isolated we can (and since it is perhaps large we should)
+ * also free the page tables mapping the initial P->M table.
+ */
+static void __init xen_cleanmfnmap(unsigned long vaddr)
+{
+ unsigned long va = vaddr & PMD_MASK;
+ unsigned long pa;
+ pgd_t *pgd = pgd_offset_k(va);
+ pud_t *pud_page = pud_offset(pgd, 0);
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+ unsigned int i;
+
+ set_pgd(pgd, __pgd(0));
+ do {
+ pud = pud_page + pud_index(va);
+ if (pud_none(*pud)) {
+ va += PUD_SIZE;
+ } else if (pud_large(*pud)) {
+ pa = pud_val(*pud) & PHYSICAL_PAGE_MASK;
+ xen_free_ro_pages(pa, PUD_SIZE);
+ va += PUD_SIZE;
+ } else {
+ pmd = pmd_offset(pud, va);
+ if (pmd_large(*pmd)) {
+ pa = pmd_val(*pmd) & PHYSICAL_PAGE_MASK;
+ xen_free_ro_pages(pa, PMD_SIZE);
+ } else if (!pmd_none(*pmd)) {
+ pte = pte_offset_kernel(pmd, va);
+ for (i = 0; i < PTRS_PER_PTE; ++i) {
+ if (pte_none(pte[i]))
+ break;
+ pa = pte_pfn(pte[i]) << PAGE_SHIFT;
+ xen_free_ro_pages(pa, PAGE_SIZE);
+ }
+ pa = __pa(pte) & PHYSICAL_PAGE_MASK;
+ ClearPagePinned(virt_to_page(__va(pa)));
+ xen_free_ro_pages(pa, PAGE_SIZE);
+ }
+ va += PMD_SIZE;
+ if (pmd_index(va))
+ continue;
+ pa = __pa(pmd) & PHYSICAL_PAGE_MASK;
+ ClearPagePinned(virt_to_page(__va(pa)));
+ xen_free_ro_pages(pa, PAGE_SIZE);
+ }
+
+ } while (pud_index(va) || pmd_index(va));
+ pa = __pa(pud_page) & PHYSICAL_PAGE_MASK;
+ ClearPagePinned(virt_to_page(__va(pa)));
+ xen_free_ro_pages(pa, PAGE_SIZE);
+}
+
static void __init xen_pagetable_p2m_copy(void)
{
unsigned long size;
@@ -1217,18 +1287,23 @@ static void __init xen_pagetable_p2m_copy(void)
/* using __ka address and sticking INVALID_P2M_ENTRY! */
memset((void *)xen_start_info->mfn_list, 0xff, size);
- /* We should be in __ka space. */
- BUG_ON(xen_start_info->mfn_list < __START_KERNEL_map);
addr = xen_start_info->mfn_list;
- /* We roundup to the PMD, which means that if anybody at this stage is
+ /* We could be in __ka space.
+ * We roundup to the PMD, which means that if anybody at this stage is
* using the __ka address of xen_start_info or xen_start_info->shared_info
* they are in going to crash. Fortunatly we have already revectored
* in xen_setup_kernel_pagetable and in xen_setup_shared_info. */
size = roundup(size, PMD_SIZE);
- xen_cleanhighmap(addr, addr + size);
- size = PAGE_ALIGN(xen_start_info->nr_pages * sizeof(unsigned long));
- memblock_free(__pa(xen_start_info->mfn_list), size);
+ if (addr >= __START_KERNEL_map) {
+ xen_cleanhighmap(addr, addr + size);
+ size = PAGE_ALIGN(xen_start_info->nr_pages *
+ sizeof(unsigned long));
+ memblock_free(__pa(addr), size);
+ } else {
+ xen_cleanmfnmap(addr);
+ }
+
/* And revector! Bye bye old array */
xen_start_info->mfn_list = new_mfn_list;
@@ -1529,6 +1604,22 @@ static pte_t __init mask_rw_pte(pte_t *ptep, pte_t pte)
#else /* CONFIG_X86_64 */
static pte_t __init mask_rw_pte(pte_t *ptep, pte_t pte)
{
+ unsigned long pfn;
+
+ if (xen_feature(XENFEAT_writable_page_tables) ||
+ xen_feature(XENFEAT_auto_translated_physmap) ||
+ xen_start_info->mfn_list >= __START_KERNEL_map)
+ return pte;
+
+ /*
+ * Pages belonging to the initial p2m list mapped outside the default
+ * address range must be mapped read-only.
+ */
+ pfn = pte_pfn(pte);
+ if (pfn >= xen_start_info->first_p2m_pfn &&
+ pfn < xen_start_info->first_p2m_pfn + xen_start_info->nr_p2m_frames)
+ pte = __pte_ma(pte_val_ma(pte) & ~_PAGE_RW);
+
return pte;
}
#endif /* CONFIG_X86_64 */
@@ -1884,7 +1975,10 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
* mappings. Considering that on Xen after the kernel mappings we
* have the mappings of some pages that don't exist in pfn space, we
* set max_pfn_mapped to the last real pfn mapped. */
- max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->mfn_list));
+ if (xen_start_info->mfn_list < __START_KERNEL_map)
+ max_pfn_mapped = xen_start_info->first_p2m_pfn;
+ else
+ max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->mfn_list));
pt_base = PFN_DOWN(__pa(xen_start_info->pt_base));
pt_end = pt_base + xen_start_info->nr_pt_frames;
@@ -1924,6 +2018,11 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
/* Graft it onto L4[511][510] */
copy_page(level2_kernel_pgt, l2);
+ /* Copy the initial P->M table mappings if necessary. */
+ i = pgd_index(xen_start_info->mfn_list);
+ if (i && i < pgd_index(__START_KERNEL_map))
+ init_level4_pgt[i] = ((pgd_t *)xen_start_info->pt_base)[i];
+
if (!xen_feature(XENFEAT_auto_translated_physmap)) {
/* Make pagetable pieces RO */
set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
@@ -1964,6 +2063,8 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
/* Our (by three pages) smaller Xen pagetable that we are using */
memblock_reserve(PFN_PHYS(pt_base), (pt_end - pt_base) * PAGE_SIZE);
+ /* protect xen_start_info */
+ memblock_reserve(__pa(xen_start_info), PAGE_SIZE);
/* Revector the xen_start_info */
xen_start_info = (struct start_info *)__va(__pa(xen_start_info));
}
diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 2e555163..6412367 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -333,6 +333,41 @@ void xen_ignore_unusable(struct e820entry *list, size_t map_size)
}
}
+/*
+ * Reserve Xen mfn_list.
+ * See comment above "struct start_info" in <xen/interface/xen.h>
+ * We tried to make the the memblock_reserve more selective so
+ * that it would be clear what region is reserved. Sadly we ran
+ * in the problem wherein on a 64-bit hypervisor with a 32-bit
+ * initial domain, the pt_base has the cr3 value which is not
+ * neccessarily where the pagetable starts! As Jan put it: "
+ * Actually, the adjustment turns out to be correct: The page
+ * tables for a 32-on-64 dom0 get allocated in the order "first L1",
+ * "first L2", "first L3", so the offset to the page table base is
+ * indeed 2. When reading xen/include/public/xen.h's comment
+ * very strictly, this is not a violation (since there nothing is said
+ * that the first thing in the page table space is pointed to by
+ * pt_base; I admit that this seems to be implied though, namely
+ * do I think that it is implied that the page table space is the
+ * range [pt_base, pt_base + nt_pt_frames), whereas that
+ * range here indeed is [pt_base - 2, pt_base - 2 + nt_pt_frames),
+ * which - without a priori knowledge - the kernel would have
+ * difficulty to figure out)." - so lets just fall back to the
+ * easy way and reserve the whole region.
+ */
+static void __init xen_reserve_xen_mfnlist(void)
+{
+ if (xen_start_info->mfn_list >= __START_KERNEL_map) {
+ memblock_reserve(__pa(xen_start_info->mfn_list),
+ xen_start_info->pt_base -
+ xen_start_info->mfn_list);
+ return;
+ }
+
+ memblock_reserve(PFN_PHYS(xen_start_info->first_p2m_pfn),
+ PFN_PHYS(xen_start_info->nr_p2m_frames));
+}
+
/**
* machine_specific_memory_setup - Hook for machine specific memory setup.
**/
@@ -467,32 +502,7 @@ char * __init xen_memory_setup(void)
e820_add_region(ISA_START_ADDRESS, ISA_END_ADDRESS - ISA_START_ADDRESS,
E820_RESERVED);
- /*
- * Reserve Xen bits:
- * - mfn_list
- * - xen_start_info
- * See comment above "struct start_info" in <xen/interface/xen.h>
- * We tried to make the the memblock_reserve more selective so
- * that it would be clear what region is reserved. Sadly we ran
- * in the problem wherein on a 64-bit hypervisor with a 32-bit
- * initial domain, the pt_base has the cr3 value which is not
- * neccessarily where the pagetable starts! As Jan put it: "
- * Actually, the adjustment turns out to be correct: The page
- * tables for a 32-on-64 dom0 get allocated in the order "first L1",
- * "first L2", "first L3", so the offset to the page table base is
- * indeed 2. When reading xen/include/public/xen.h's comment
- * very strictly, this is not a violation (since there nothing is said
- * that the first thing in the page table space is pointed to by
- * pt_base; I admit that this seems to be implied though, namely
- * do I think that it is implied that the page table space is the
- * range [pt_base, pt_base + nt_pt_frames), whereas that
- * range here indeed is [pt_base - 2, pt_base - 2 + nt_pt_frames),
- * which - without a priori knowledge - the kernel would have
- * difficulty to figure out)." - so lets just fall back to the
- * easy way and reserve the whole region.
- */
- memblock_reserve(__pa(xen_start_info->mfn_list),
- xen_start_info->pt_base - xen_start_info->mfn_list);
+ xen_reserve_xen_mfnlist();
sanitize_e820_map(e820.map, ARRAY_SIZE(e820.map), &e820.nr_map);
@@ -522,8 +532,7 @@ char * __init xen_auto_xlated_memory_setup(void)
for (i = 0; i < memmap.nr_entries; i++)
e820_add_region(map[i].addr, map[i].size, map[i].type);
- memblock_reserve(__pa(xen_start_info->mfn_list),
- xen_start_info->pt_base - xen_start_info->mfn_list);
+ xen_reserve_xen_mfnlist();
return "Xen";
}
diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S
index 46408e5..e7bd668 100644
--- a/arch/x86/xen/xen-head.S
+++ b/arch/x86/xen/xen-head.S
@@ -112,6 +112,8 @@ NEXT_HYPERCALL(arch_6)
ELFNOTE(Xen, XEN_ELFNOTE_VIRT_BASE, _ASM_PTR __PAGE_OFFSET)
#else
ELFNOTE(Xen, XEN_ELFNOTE_VIRT_BASE, _ASM_PTR __START_KERNEL_map)
+ /* Map the p2m table to a 512GB-aligned user address. */
+ ELFNOTE(Xen, XEN_ELFNOTE_INIT_P2M, .quad PGDIR_SIZE)
#endif
ELFNOTE(Xen, XEN_ELFNOTE_ENTRY, _ASM_PTR startup_xen)
ELFNOTE(Xen, XEN_ELFNOTE_HYPERCALL_PAGE, _ASM_PTR hypercall_page)
--
1.8.4.5
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [Xen-devel] [PATCH V2 2/3] xen: eliminate scalability issues from initrd handling
2014-09-17 4:12 ` [PATCH V2 2/3] xen: eliminate scalability issues from initrd handling Juergen Gross
@ 2014-09-17 13:45 ` David Vrabel
2014-09-17 14:01 ` Juergen Gross
0 siblings, 1 reply; 13+ messages in thread
From: David Vrabel @ 2014-09-17 13:45 UTC (permalink / raw)
To: Juergen Gross, linux-kernel, xen-devel, konrad.wilk,
boris.ostrovsky, david.vrabel, jbeulich
On 17/09/14 05:12, Juergen Gross wrote:
> Size restrictions native kernels wouldn't have resulted from the initrd
> getting mapped into the initial mapping. The kernel doesn't really need
> the initrd to be mapped, so use infrastructure available in Xen to avoid
> the mapping and hence the restriction.
>
> Signed-off-by: Juergen Gross <jgross@suse.com>
> ---
> arch/x86/xen/enlighten.c | 11 +++++++++--
> arch/x86/xen/xen-head.S | 1 +
> 2 files changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
> index c0cb11f..8fd075f 100644
> --- a/arch/x86/xen/enlighten.c
> +++ b/arch/x86/xen/enlighten.c
> @@ -1519,6 +1519,7 @@ static void __init xen_pvh_early_guest_init(void)
> asmlinkage __visible void __init xen_start_kernel(void)
> {
> struct physdev_set_iopl set_iopl;
> + unsigned long initrd_start = 0;
> int rc;
>
> if (!xen_start_info)
> @@ -1667,10 +1668,16 @@ asmlinkage __visible void __init xen_start_kernel(void)
> new_cpu_data.x86_capability[0] = cpuid_edx(1);
> #endif
>
> + if (xen_start_info->mod_start)
> + initrd_start = __pa(xen_start_info->mod_start);
> +#ifdef CONFIG_BLK_DEV_INITRD
> + if (xen_start_info->flags & SIF_MOD_START_PFN)
> + initrd_start = PFN_PHYS(xen_start_info->mod_start);
> +#endif
Why the #ifdef?
I'll fix this up to be:
if (xen_start_info->mod_start) {
if (xen_start_info->flags & SIF_MOD_START_PFN)
initrd_start = PFN_PHYS_(xen_start_info->mod_start);
else
initrd_start = __pa(xen_start_info->mod_start);
}
Unless you object.
David
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Xen-devel] [PATCH V2 2/3] xen: eliminate scalability issues from initrd handling
2014-09-17 13:45 ` [Xen-devel] " David Vrabel
@ 2014-09-17 14:01 ` Juergen Gross
0 siblings, 0 replies; 13+ messages in thread
From: Juergen Gross @ 2014-09-17 14:01 UTC (permalink / raw)
To: David Vrabel, linux-kernel, xen-devel, konrad.wilk,
boris.ostrovsky, jbeulich
On 09/17/2014 03:45 PM, David Vrabel wrote:
> On 17/09/14 05:12, Juergen Gross wrote:
>> Size restrictions native kernels wouldn't have resulted from the initrd
>> getting mapped into the initial mapping. The kernel doesn't really need
>> the initrd to be mapped, so use infrastructure available in Xen to avoid
>> the mapping and hence the restriction.
>>
>> Signed-off-by: Juergen Gross <jgross@suse.com>
>> ---
>> arch/x86/xen/enlighten.c | 11 +++++++++--
>> arch/x86/xen/xen-head.S | 1 +
>> 2 files changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>> index c0cb11f..8fd075f 100644
>> --- a/arch/x86/xen/enlighten.c
>> +++ b/arch/x86/xen/enlighten.c
>> @@ -1519,6 +1519,7 @@ static void __init xen_pvh_early_guest_init(void)
>> asmlinkage __visible void __init xen_start_kernel(void)
>> {
>> struct physdev_set_iopl set_iopl;
>> + unsigned long initrd_start = 0;
>> int rc;
>>
>> if (!xen_start_info)
>> @@ -1667,10 +1668,16 @@ asmlinkage __visible void __init xen_start_kernel(void)
>> new_cpu_data.x86_capability[0] = cpuid_edx(1);
>> #endif
>>
>> + if (xen_start_info->mod_start)
>> + initrd_start = __pa(xen_start_info->mod_start);
>> +#ifdef CONFIG_BLK_DEV_INITRD
>> + if (xen_start_info->flags & SIF_MOD_START_PFN)
>> + initrd_start = PFN_PHYS(xen_start_info->mod_start);
>> +#endif
>
> Why the #ifdef?
>
> I'll fix this up to be:
>
> if (xen_start_info->mod_start) {
> if (xen_start_info->flags & SIF_MOD_START_PFN)
> initrd_start = PFN_PHYS_(xen_start_info->mod_start);
> else
> initrd_start = __pa(xen_start_info->mod_start);
> }
>
> Unless you object.
Yeah, you are right. This seems to be okay.
Juergen
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Xen-devel] [PATCH V2 3/3] xen: eliminate scalability issues from initial mapping setup
2014-09-17 4:12 ` [PATCH V2 3/3] xen: eliminate scalability issues from initial mapping setup Juergen Gross
@ 2014-09-17 14:07 ` David Vrabel
2014-09-17 14:17 ` Jan Beulich
2014-09-17 14:20 ` [Xen-devel] " Juergen Gross
0 siblings, 2 replies; 13+ messages in thread
From: David Vrabel @ 2014-09-17 14:07 UTC (permalink / raw)
To: Juergen Gross, linux-kernel, xen-devel, konrad.wilk,
boris.ostrovsky, david.vrabel, jbeulich
On 17/09/14 05:12, Juergen Gross wrote:
> Direct Xen to place the initial P->M table outside of the initial
> mapping, as otherwise the 1G (implementation) / 2G (theoretical)
> restriction on the size of the initial mapping limits the amount
> of memory a domain can be handed initially.
>
> As the initial P->M table is copied rather early during boot to
> domain private memory and it's initial virtual mapping is dropped,
> the easiest way to avoid virtual address conflicts with other
> addresses in the kernel is to use a user address area for the
> virtual address of the initial P->M table. This allows us to just
> throw away the page tables of the initial mapping after the copy
> without having to care about address invalidation.
This needs an additional paragraph like:
"This does not increase the amount of memory the guest can use. This
is still limited to 512 GiB by the 3-level p2m."
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -1198,6 +1198,76 @@ static void __init xen_cleanhighmap(unsigned long vaddr,
[...]
> +/*
> + * Since it is well isolated we can (and since it is perhaps large we should)
> + * also free the page tables mapping the initial P->M table.
> + */
> +static void __init xen_cleanmfnmap(unsigned long vaddr)
> +{
> + unsigned long va = vaddr & PMD_MASK;
> + unsigned long pa;
> + pgd_t *pgd = pgd_offset_k(va);
> + pud_t *pud_page = pud_offset(pgd, 0);
> + pud_t *pud;
> + pmd_t *pmd;
> + pte_t *pte;
> + unsigned int i;
> +
> + set_pgd(pgd, __pgd(0));
> + do {
> + pud = pud_page + pud_index(va);
> + if (pud_none(*pud)) {
> + va += PUD_SIZE;
> + } else if (pud_large(*pud)) {
> + pa = pud_val(*pud) & PHYSICAL_PAGE_MASK;
> + xen_free_ro_pages(pa, PUD_SIZE);
> + va += PUD_SIZE;
Are you missing a ClearPagePinned(..) here?
> + } else {
> + pmd = pmd_offset(pud, va);
> + if (pmd_large(*pmd)) {
> + pa = pmd_val(*pmd) & PHYSICAL_PAGE_MASK;
> + xen_free_ro_pages(pa, PMD_SIZE);
> + } else if (!pmd_none(*pmd)) {
> + pte = pte_offset_kernel(pmd, va);
> + for (i = 0; i < PTRS_PER_PTE; ++i) {
> + if (pte_none(pte[i]))
> + break;
> + pa = pte_pfn(pte[i]) << PAGE_SHIFT;
> + xen_free_ro_pages(pa, PAGE_SIZE);
> + }
> + pa = __pa(pte) & PHYSICAL_PAGE_MASK;
> + ClearPagePinned(virt_to_page(__va(pa)));
> + xen_free_ro_pages(pa, PAGE_SIZE);
Put this into a helper function? It's used here...
> + }
> + va += PMD_SIZE;
> + if (pmd_index(va))
> + continue;
> + pa = __pa(pmd) & PHYSICAL_PAGE_MASK;
> + ClearPagePinned(virt_to_page(__va(pa)));
> + xen_free_ro_pages(pa, PAGE_SIZE);
...and here...
> + }
> +
> + } while (pud_index(va) || pmd_index(va));
> + pa = __pa(pud_page) & PHYSICAL_PAGE_MASK;
> + ClearPagePinned(virt_to_page(__va(pa)));
> + xen_free_ro_pages(pa, PAGE_SIZE);
... and here.
> @@ -1529,6 +1604,22 @@ static pte_t __init mask_rw_pte(pte_t *ptep, pte_t pte)
> #else /* CONFIG_X86_64 */
> static pte_t __init mask_rw_pte(pte_t *ptep, pte_t pte)
> {
> + unsigned long pfn;
> +
> + if (xen_feature(XENFEAT_writable_page_tables) ||
> + xen_feature(XENFEAT_auto_translated_physmap) ||
> + xen_start_info->mfn_list >= __START_KERNEL_map)
> + return pte;
> +
> + /*
> + * Pages belonging to the initial p2m list mapped outside the default
> + * address range must be mapped read-only.
Why? I didn't think was anything special about these MFNs.
David
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH V2 3/3] xen: eliminate scalability issues from initial mapping setup
2014-09-17 14:07 ` [Xen-devel] " David Vrabel
@ 2014-09-17 14:17 ` Jan Beulich
2014-09-17 14:20 ` [Xen-devel] " Juergen Gross
1 sibling, 0 replies; 13+ messages in thread
From: Jan Beulich @ 2014-09-17 14:17 UTC (permalink / raw)
To: David Vrabel; +Cc: Juergen Gross, xen-devel, boris.ostrovsky, linux-kernel
>>> On 17.09.14 at 16:07, <david.vrabel@citrix.com> wrote:
> On 17/09/14 05:12, Juergen Gross wrote:
>> +static void __init xen_cleanmfnmap(unsigned long vaddr)
>> +{
>> + unsigned long va = vaddr & PMD_MASK;
>> + unsigned long pa;
>> + pgd_t *pgd = pgd_offset_k(va);
>> + pud_t *pud_page = pud_offset(pgd, 0);
>> + pud_t *pud;
>> + pmd_t *pmd;
>> + pte_t *pte;
>> + unsigned int i;
>> +
>> + set_pgd(pgd, __pgd(0));
>> + do {
>> + pud = pud_page + pud_index(va);
>> + if (pud_none(*pud)) {
>> + va += PUD_SIZE;
>> + } else if (pud_large(*pud)) {
>> + pa = pud_val(*pud) & PHYSICAL_PAGE_MASK;
>> + xen_free_ro_pages(pa, PUD_SIZE);
>> + va += PUD_SIZE;
>
> Are you missing a ClearPagePinned(..) here?
No, this is a 1Gb data page, not a page table one.
Jan
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Xen-devel] [PATCH V2 3/3] xen: eliminate scalability issues from initial mapping setup
2014-09-17 14:07 ` [Xen-devel] " David Vrabel
2014-09-17 14:17 ` Jan Beulich
@ 2014-09-17 14:20 ` Juergen Gross
2014-09-17 14:42 ` David Vrabel
1 sibling, 1 reply; 13+ messages in thread
From: Juergen Gross @ 2014-09-17 14:20 UTC (permalink / raw)
To: David Vrabel, linux-kernel, xen-devel, konrad.wilk,
boris.ostrovsky, jbeulich
On 09/17/2014 04:07 PM, David Vrabel wrote:
> On 17/09/14 05:12, Juergen Gross wrote:
>> Direct Xen to place the initial P->M table outside of the initial
>> mapping, as otherwise the 1G (implementation) / 2G (theoretical)
>> restriction on the size of the initial mapping limits the amount
>> of memory a domain can be handed initially.
>>
>> As the initial P->M table is copied rather early during boot to
>> domain private memory and it's initial virtual mapping is dropped,
>> the easiest way to avoid virtual address conflicts with other
>> addresses in the kernel is to use a user address area for the
>> virtual address of the initial P->M table. This allows us to just
>> throw away the page tables of the initial mapping after the copy
>> without having to care about address invalidation.
>
> This needs an additional paragraph like:
>
> "This does not increase the amount of memory the guest can use. This
> is still limited to 512 GiB by the 3-level p2m."
>
>> --- a/arch/x86/xen/mmu.c
>> +++ b/arch/x86/xen/mmu.c
>> @@ -1198,6 +1198,76 @@ static void __init xen_cleanhighmap(unsigned long vaddr,
> [...]
>> +/*
>> + * Since it is well isolated we can (and since it is perhaps large we should)
>> + * also free the page tables mapping the initial P->M table.
>> + */
>> +static void __init xen_cleanmfnmap(unsigned long vaddr)
>> +{
>> + unsigned long va = vaddr & PMD_MASK;
>> + unsigned long pa;
>> + pgd_t *pgd = pgd_offset_k(va);
>> + pud_t *pud_page = pud_offset(pgd, 0);
>> + pud_t *pud;
>> + pmd_t *pmd;
>> + pte_t *pte;
>> + unsigned int i;
>> +
>> + set_pgd(pgd, __pgd(0));
>> + do {
>> + pud = pud_page + pud_index(va);
>> + if (pud_none(*pud)) {
>> + va += PUD_SIZE;
>> + } else if (pud_large(*pud)) {
>> + pa = pud_val(*pud) & PHYSICAL_PAGE_MASK;
>> + xen_free_ro_pages(pa, PUD_SIZE);
>> + va += PUD_SIZE;
>
> Are you missing a ClearPagePinned(..) here?
Probably, yes.
>
>> + } else {
>> + pmd = pmd_offset(pud, va);
>> + if (pmd_large(*pmd)) {
>> + pa = pmd_val(*pmd) & PHYSICAL_PAGE_MASK;
>> + xen_free_ro_pages(pa, PMD_SIZE);
>> + } else if (!pmd_none(*pmd)) {
>> + pte = pte_offset_kernel(pmd, va);
>> + for (i = 0; i < PTRS_PER_PTE; ++i) {
>> + if (pte_none(pte[i]))
>> + break;
>> + pa = pte_pfn(pte[i]) << PAGE_SHIFT;
>> + xen_free_ro_pages(pa, PAGE_SIZE);
>> + }
>
>> + pa = __pa(pte) & PHYSICAL_PAGE_MASK;
>> + ClearPagePinned(virt_to_page(__va(pa)));
>> + xen_free_ro_pages(pa, PAGE_SIZE);
>
> Put this into a helper function? It's used here...
Good idea.
>
>> + }
>> + va += PMD_SIZE;
>> + if (pmd_index(va))
>> + continue;
>> + pa = __pa(pmd) & PHYSICAL_PAGE_MASK;
>> + ClearPagePinned(virt_to_page(__va(pa)));
>> + xen_free_ro_pages(pa, PAGE_SIZE);
>
> ...and here...
>
>> + }
>> +
>> + } while (pud_index(va) || pmd_index(va));
>> + pa = __pa(pud_page) & PHYSICAL_PAGE_MASK;
>> + ClearPagePinned(virt_to_page(__va(pa)));
>> + xen_free_ro_pages(pa, PAGE_SIZE);
>
> ... and here.
>
>> @@ -1529,6 +1604,22 @@ static pte_t __init mask_rw_pte(pte_t *ptep, pte_t pte)
>> #else /* CONFIG_X86_64 */
>> static pte_t __init mask_rw_pte(pte_t *ptep, pte_t pte)
>> {
>> + unsigned long pfn;
>> +
>> + if (xen_feature(XENFEAT_writable_page_tables) ||
>> + xen_feature(XENFEAT_auto_translated_physmap) ||
>> + xen_start_info->mfn_list >= __START_KERNEL_map)
>> + return pte;
>> +
>> + /*
>> + * Pages belonging to the initial p2m list mapped outside the default
>> + * address range must be mapped read-only.
>
> Why? I didn't think was anything special about these MFNs.
The hypervisor complained when I did otherwise. I think the main reason
is that the hypervisor will set up some more page tables to be able to
map then mfn_list outside the "normal" address range. They are located
in the range starting at xen_start_info->first_p2m_pfn (otherwise the
info in first_p2m_pfn and nr_p2m_frames wouldn't be needed).
And page tables must be mapped read-only.
Juergen
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Xen-devel] [PATCH V2 3/3] xen: eliminate scalability issues from initial mapping setup
2014-09-17 14:20 ` [Xen-devel] " Juergen Gross
@ 2014-09-17 14:42 ` David Vrabel
2014-09-17 14:47 ` Juergen Gross
0 siblings, 1 reply; 13+ messages in thread
From: David Vrabel @ 2014-09-17 14:42 UTC (permalink / raw)
To: Juergen Gross, linux-kernel, xen-devel, konrad.wilk,
boris.ostrovsky, jbeulich
On 17/09/14 15:20, Juergen Gross wrote:
> On 09/17/2014 04:07 PM, David Vrabel wrote:
>>
>>
>> Are you missing a ClearPagePinned(..) here?
>
> Probably, yes.
Jan pointed out that this is not needed.
>>> @@ -1529,6 +1604,22 @@ static pte_t __init mask_rw_pte(pte_t *ptep,
>>> pte_t pte)
>>> #else /* CONFIG_X86_64 */
>>> static pte_t __init mask_rw_pte(pte_t *ptep, pte_t pte)
>>> {
>>> + unsigned long pfn;
>>> +
>>> + if (xen_feature(XENFEAT_writable_page_tables) ||
>>> + xen_feature(XENFEAT_auto_translated_physmap) ||
>>> + xen_start_info->mfn_list >= __START_KERNEL_map)
>>> + return pte;
>>> +
>>> + /*
>>> + * Pages belonging to the initial p2m list mapped outside the
>>> default
>>> + * address range must be mapped read-only.
>>
>> Why? I didn't think was anything special about these MFNs.
>
> The hypervisor complained when I did otherwise. I think the main reason
> is that the hypervisor will set up some more page tables to be able to
> map then mfn_list outside the "normal" address range. They are located
> in the range starting at xen_start_info->first_p2m_pfn (otherwise the
> info in first_p2m_pfn and nr_p2m_frames wouldn't be needed).
Ok. Can you expand the comment to say this?
David
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH V2 0/3] xen: remove some memory limits from pv-domains
2014-09-17 4:12 [PATCH V2 0/3] xen: remove some memory limits from pv-domains Juergen Gross
` (2 preceding siblings ...)
2014-09-17 4:12 ` [PATCH V2 3/3] xen: eliminate scalability issues from initial mapping setup Juergen Gross
@ 2014-09-17 14:43 ` David Vrabel
2014-09-17 14:48 ` Juergen Gross
3 siblings, 1 reply; 13+ messages in thread
From: David Vrabel @ 2014-09-17 14:43 UTC (permalink / raw)
To: Juergen Gross, linux-kernel, xen-devel, konrad.wilk,
boris.ostrovsky, jbeulich
On 17/09/14 05:12, Juergen Gross wrote:
> When a Xen pv-domain is booted the initial memory map contains multiple
> objects in the top 2 GB including the initrd and the p2m list. This
> limits the supported maximum size of the initrd and the maximum
> memory size the p2m list can span is limited to about 500 GB.
>
> Xen however supports loading the initrd without mapping it and the
> initial p2m list can be mapped by Xen to an arbitrary selected virtual
> address. The following patches activate those options and thus remove
> the limitations.
>
> It should be noted that the p2m list limitation isn't only affecting
> the amount of memory a pv domain can use, but it also hinders Dom0 to
> be started on physical systems with larger memory without reducing it's
> memory via a Xen boot parameter. By mapping the initial p2m list to
> an area not in the top 2 GB it is now possible to boot Dom0 on such
> systems.
>
> It would be desirable to be able to use more than 512 GB in a pv
> domain, but this would require a reorganization of the p2m tree built
> by the kernel at boot time. As this reorganization would affect the
> Xen tools and kexec, too, it is not included in this patch set. This
> topic can be addressed later.
>
> Juergen Gross (3):
> xen: sync some headers with xen tree
> xen: eliminate scalability issues from initrd handling
I've applied these two to devel/for-linus-3.18.
Thanks.
David
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Xen-devel] [PATCH V2 3/3] xen: eliminate scalability issues from initial mapping setup
2014-09-17 14:42 ` David Vrabel
@ 2014-09-17 14:47 ` Juergen Gross
0 siblings, 0 replies; 13+ messages in thread
From: Juergen Gross @ 2014-09-17 14:47 UTC (permalink / raw)
To: David Vrabel, linux-kernel, xen-devel, konrad.wilk,
boris.ostrovsky, jbeulich
On 09/17/2014 04:42 PM, David Vrabel wrote:
> On 17/09/14 15:20, Juergen Gross wrote:
>> On 09/17/2014 04:07 PM, David Vrabel wrote:
>>>
>>>
>>> Are you missing a ClearPagePinned(..) here?
>>
>> Probably, yes.
>
> Jan pointed out that this is not needed.
>
>>>> @@ -1529,6 +1604,22 @@ static pte_t __init mask_rw_pte(pte_t *ptep,
>>>> pte_t pte)
>>>> #else /* CONFIG_X86_64 */
>>>> static pte_t __init mask_rw_pte(pte_t *ptep, pte_t pte)
>>>> {
>>>> + unsigned long pfn;
>>>> +
>>>> + if (xen_feature(XENFEAT_writable_page_tables) ||
>>>> + xen_feature(XENFEAT_auto_translated_physmap) ||
>>>> + xen_start_info->mfn_list >= __START_KERNEL_map)
>>>> + return pte;
>>>> +
>>>> + /*
>>>> + * Pages belonging to the initial p2m list mapped outside the
>>>> default
>>>> + * address range must be mapped read-only.
>>>
>>> Why? I didn't think was anything special about these MFNs.
>>
>> The hypervisor complained when I did otherwise. I think the main reason
>> is that the hypervisor will set up some more page tables to be able to
>> map then mfn_list outside the "normal" address range. They are located
>> in the range starting at xen_start_info->first_p2m_pfn (otherwise the
>> info in first_p2m_pfn and nr_p2m_frames wouldn't be needed).
>
> Ok. Can you expand the comment to say this?
Already did. :-)
Juergen
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH V2 0/3] xen: remove some memory limits from pv-domains
2014-09-17 14:43 ` [PATCH V2 0/3] xen: remove some memory limits from pv-domains David Vrabel
@ 2014-09-17 14:48 ` Juergen Gross
0 siblings, 0 replies; 13+ messages in thread
From: Juergen Gross @ 2014-09-17 14:48 UTC (permalink / raw)
To: David Vrabel, linux-kernel, xen-devel, konrad.wilk,
boris.ostrovsky, jbeulich
On 09/17/2014 04:43 PM, David Vrabel wrote:
> On 17/09/14 05:12, Juergen Gross wrote:
>> When a Xen pv-domain is booted the initial memory map contains multiple
>> objects in the top 2 GB including the initrd and the p2m list. This
>> limits the supported maximum size of the initrd and the maximum
>> memory size the p2m list can span is limited to about 500 GB.
>>
>> Xen however supports loading the initrd without mapping it and the
>> initial p2m list can be mapped by Xen to an arbitrary selected virtual
>> address. The following patches activate those options and thus remove
>> the limitations.
>>
>> It should be noted that the p2m list limitation isn't only affecting
>> the amount of memory a pv domain can use, but it also hinders Dom0 to
>> be started on physical systems with larger memory without reducing it's
>> memory via a Xen boot parameter. By mapping the initial p2m list to
>> an area not in the top 2 GB it is now possible to boot Dom0 on such
>> systems.
>>
>> It would be desirable to be able to use more than 512 GB in a pv
>> domain, but this would require a reorganization of the p2m tree built
>> by the kernel at boot time. As this reorganization would affect the
>> Xen tools and kexec, too, it is not included in this patch set. This
>> topic can be addressed later.
>>
>> Juergen Gross (3):
>> xen: sync some headers with xen tree
>> xen: eliminate scalability issues from initrd handling
>
> I've applied these two to devel/for-linus-3.18.
>
> Thanks.
Thanks, too. :-)
Juergen
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2014-09-17 14:48 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-17 4:12 [PATCH V2 0/3] xen: remove some memory limits from pv-domains Juergen Gross
2014-09-17 4:12 ` [PATCH V2 1/3] xen: sync some headers with xen tree Juergen Gross
2014-09-17 4:12 ` [PATCH V2 2/3] xen: eliminate scalability issues from initrd handling Juergen Gross
2014-09-17 13:45 ` [Xen-devel] " David Vrabel
2014-09-17 14:01 ` Juergen Gross
2014-09-17 4:12 ` [PATCH V2 3/3] xen: eliminate scalability issues from initial mapping setup Juergen Gross
2014-09-17 14:07 ` [Xen-devel] " David Vrabel
2014-09-17 14:17 ` Jan Beulich
2014-09-17 14:20 ` [Xen-devel] " Juergen Gross
2014-09-17 14:42 ` David Vrabel
2014-09-17 14:47 ` Juergen Gross
2014-09-17 14:43 ` [PATCH V2 0/3] xen: remove some memory limits from pv-domains David Vrabel
2014-09-17 14:48 ` Juergen Gross
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).