Re: [PATCH RFC 06/12] xen/x86: populate PVHv2 Dom0 physical memory map

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: Andrew Cooper <andrew.cooper3@citrix.com>
To: Roger Pau Monne <roger.pau@citrix.com>, xen-devel@lists.xenproject.org
Cc: Jan Beulich <jbeulich@suse.com>
Subject: Re: [PATCH RFC 06/12] xen/x86: populate PVHv2 Dom0 physical memory map
Date: Fri, 29 Jul 2016 20:04:12 +0100	[thread overview]
Message-ID: <806a668b-d732-09a9-93b7-bf27fa5efee8@citrix.com> (raw)
In-Reply-To: <1469809747-11176-7-git-send-email-roger.pau@citrix.com>

On 29/07/16 17:29, Roger Pau Monne wrote:
> Craft the Dom0 e820 memory map and populate it.
>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>  xen/arch/x86/domain_build.c | 199 ++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 193 insertions(+), 6 deletions(-)
>
> diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
> index c0ef40f..cb8ecbd 100644
> --- a/xen/arch/x86/domain_build.c
> +++ b/xen/arch/x86/domain_build.c
> @@ -43,6 +43,11 @@ static long __initdata dom0_nrpages;
>  static long __initdata dom0_min_nrpages;
>  static long __initdata dom0_max_nrpages = LONG_MAX;
>  
> +/* GFN of the identity map for EPT. */
> +#define HVM_IDENT_PT_GFN  0xfeffeu

The IDENT_PT is only needed on Gen1 VT-x hardware which doesn't support
unrestricted-guest mode.  OTOH, it needs to be matched with VM86_TSS,
which is also needed if hardware doesn't support unrestricted-guest. 
Also, the IDENT_PT can live anywhere in GFN space.  0xfeffe is just
convention for HVM guests as nothing else interesting lives there, but a
PVH dom0 will want to ensure it doesn't alias anything interesting it
might wish to map.

Now I look at it, there is substantial WTF.  The domain builder makes
IDENT_PT, but HVMLoader makes VM86_TSS.  I presume this is a historical
side effect of IDENT_PT being a prerequisite to even executing
hvmloader, while VM86_TSS was only a prerequisite to executing the
rombios payload.  Either way, eww :(

I think the VM86_TSS setup needs to move to unconditionally being beside
IDENT_PT setup in the domain builder, and mirrored here in dom0_hvm()
creation.  I don't think it is reasonable to expect an HVMLite payload
to set up its own VM86_TSS if it didn't set up IDENT_PT.  (OTOH, the
chances of an HVMLite guest ever needing to return to real mode is slim,
but in the name of flexibility, it would be nice not to rule it out).

Longterm, it would be nice to disentangle the unrestricted-guest support
from the main vmx code, and make it able to be compiled out.  There is
lots of it, and it all-but-dead on modern hardware.
> @@ -591,6 +610,8 @@ static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages)
>          {
>              cur_pages += pages;
>          }
> +        ASSERT((entry_guest->addr & ~PAGE_MASK) == 0 &&
> +               (entry_guest->size & ~PAGE_MASK) == 0);

This would be clearer as ASSERT(IS_ALIGNED(entry_guest->addr, PAGE_SIZE));

(although I suspect the IS_ALIGNED() predicate is newer than your first
iteration of this code.)

>   next:
>          d->arch.nr_e820++;
>          entry_guest++;
> @@ -1631,7 +1652,7 @@ static int __init construct_dom0_pv(
>          dom0_update_physmap(d, pfn, mfn, 0);
>  
>          pvh_map_all_iomem(d, nr_pages);
> -        pvh_setup_e820(d, nr_pages);
> +        hvm_setup_e820(d, nr_pages);
>      }
>  
>      if ( d->domain_id == hardware_domid )
> @@ -1647,15 +1668,181 @@ out:
>      return rc;
>  }
>  
> +/* Helper to convert from bytes into human-readable form. */
> +static void __init pretty_print_bytes(uint64_t size)
> +{
> +    const char* units[] = {"B", "KB", "MB", "GB", "TB"};

static const char *units[] __initconst

However, this particular array would be far smaller as

static const char units[][3] __initconst = { "B", ... };

as it avoids embedding 5x8byte pointers to get at 14 useful bytes of
information.

> +    int i = 0;
> +
> +    while ( ++i < sizeof(units) && size >= 1024 )
> +        size >>= 10; /* size /= 1024 */
> +
> +    printk("%4" PRIu64 "%2s", size, units[i-1]);

Wouldn't it be better to introduce a new %p format identifier to do
this?  (Linux doesn't appear to have an existing format identifier which
we can 'borrow')

It looks potentially useful elsewhere, and avoids the awkward

printk("Order %2u: ", MAX_ORDER - i);
pretty_print_bytes(((uint64_t)1 << (MAX_ORDER - i + PAGE_SHIFT)) *
                   hvm_mem_stats[MAX_ORDER - i]);
printk("\n");

constructs you have below.


> +}
> +
> +/* Calculate the biggest usable order given a size in bytes. */
> +static inline unsigned int get_order(uint64_t size)
> +{
> +    unsigned int order;
> +    uint64_t pg;
> +
> +    ASSERT((size & ~PAGE_MASK) == 0);
> +    pg = PFN_DOWN(size);
> +    for ( order = 0; pg >= (1 << (order + 1)); order++ );
> +
> +    return order;
> +}

We already have get_order_from_bytes() and get_order_from_pages(), the
latter of which looks like it will suit your usecase.

As a TODO item, I really would like to borrow some of the Linux
infrastructure to calculate orders of constants at compile time, because
a large number of callers of the existing get_order_*() functions are
performing unnecessary runtime calculation.  For those which need
runtime calculation, some careful use of ffs() would be preferable to a
loop.

> +
> +/* Populate an HVM memory range using the biggest possible order. */
> +static void __init hvm_populate_memory_range(struct domain *d, uint64_t start,
> +                                             uint64_t size)
> +{
> +    static unsigned int __initdata memflags = MEMF_no_dma|MEMF_exact_node;
> +    unsigned int order;
> +    struct page_info *page;
> +    int rc;
> +
> +    ASSERT((size & ~PAGE_MASK) == 0 && (start & ~PAGE_MASK) == 0);
> +
> +    order = MAX_ORDER;
> +    while ( size != 0 )
> +    {
> +        order = min(get_order(size), order);
> +        page = alloc_domheap_pages(d, order, memflags);
> +        if ( page == NULL )
> +        {
> +            if ( order == 0 && memflags )
> +            {
> +                /* Try again without any memflags. */
> +                memflags = 0;
> +                order = MAX_ORDER;
> +                continue;
> +            }
> +            if ( order == 0 )
> +                panic("Unable to allocate memory with order 0!\n");
> +            order--;
> +            continue;
> +        }

It would be far more efficient to try and allocate only 1G and 2M
blocks.  Most of memory is free at this point, and it would allow the
use of HAP superpage mappings, which will be a massive performance boost
if they aren't shattered.

> +
> +        hvm_mem_stats[order]++;
> +        rc = guest_physmap_add_page(d, _gfn(PFN_DOWN(start)),
> +                                    _mfn(page_to_mfn(page)), order);
> +        if ( rc != 0 )
> +            panic("Failed to populate memory: [%" PRIx64 " - %" PRIx64 "] %d\n",
> +                  start, start + (((uint64_t)1) << (order + PAGE_SHIFT)), rc);
> +        start += ((uint64_t)1) << (order + PAGE_SHIFT);
> +        size -= ((uint64_t)1) << (order + PAGE_SHIFT);
> +        if ( (size & 0xffffffff) == 0 )
> +            process_pending_softirqs();
> +    }
> +
> +}
> +
> +static int __init hvm_setup_p2m(struct domain *d)
> +{
> +    struct vcpu *v = d->vcpu[0];
> +    unsigned long nr_pages;
> +    int i;
> +
> +    printk("** Preparing memory map **\n");
> +
> +    /*
> +     * Subtract one page for the EPT identity page table and two pages
> +     * for the MADT replacement.
> +     */
> +    nr_pages = compute_dom0_nr_pages(d, NULL, 0) - 3;
> +
> +    hvm_setup_e820(d, nr_pages);
> +    paging_set_allocation(d, dom0_paging_pages(d, nr_pages));
> +
> +    printk("Dom0 memory map:\n");
> +    print_e820_memory_map(d->arch.e820, d->arch.nr_e820);
> +
> +    printk("** Populating memory map **\n");
> +    /* Populate memory map. */
> +    for ( i = 0; i < d->arch.nr_e820; i++ )
> +    {
> +        if ( d->arch.e820[i].type != E820_RAM )
> +            continue;
> +
> +        hvm_populate_memory_range(d, d->arch.e820[i].addr,
> +                                  d->arch.e820[i].size);
> +    }
> +
> +    printk("Memory allocation stats:\n");
> +    for ( i = 0; i <= MAX_ORDER; i++ )
> +    {
> +        if ( hvm_mem_stats[MAX_ORDER - i] != 0 )
> +        {
> +            printk("Order %2u: ", MAX_ORDER - i);
> +            pretty_print_bytes(((uint64_t)1 << (MAX_ORDER - i + PAGE_SHIFT)) *
> +                               hvm_mem_stats[MAX_ORDER - i]);
> +            printk("\n");
> +        }
> +    }
> +
> +    if ( cpu_has_vmx && paging_mode_hap(d) && !vmx_unrestricted_guest(v) )
> +    {
> +        struct vcpu *saved_current;
> +        struct page_info *page;
> +        uint32_t *ident_pt;
> +
> +        /*
> +         * Identity-map page table is required for running with CR0.PG=0
> +         * when using Intel EPT. Create a 32-bit non-PAE page directory of
> +         * superpages.
> +         */
> +        page = alloc_domheap_pages(d, 0, 0);
> +        if ( unlikely(!page) )
> +        {
> +            printk("Unable to allocate page for identity map\n");
> +            return -ENOMEM;
> +        }
> +
> +        saved_current = current;
> +        set_current(v);

What is this swizzle of current for?  I presume you hit an assertion
somewhere?  Given how late we are in the boot process, it might be worth
formally context switching to d0v0 earlier.

> +
> +        ident_pt = __map_domain_page(page);
> +        for ( i = 0; i < PAGE_SIZE / sizeof(*ident_pt); i++ )
> +            ident_pt[i] = ((i << 22) | _PAGE_PRESENT | _PAGE_RW | _PAGE_USER |
> +                           _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_PSE);
> +        unmap_domain_page(ident_pt);
> +
> +        guest_physmap_add_page(d, _gfn(HVM_IDENT_PT_GFN),
> +                               _mfn(page_to_mfn(page)), 0);

Please pick an existing gfn and map_domain_gfn() instead.  This will
avoid an unnecessary shattering of hap superpages.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

next prev parent reply	other threads:[~2016-07-29 19:04 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-29 16:28 [PATCH RFC 01/12] PVHv2 Dom0 Roger Pau Monne
2016-07-29 16:28 ` [PATCH RFC 01/12] x86/paging: introduce paging_set_allocation Roger Pau Monne
2016-07-29 16:47   ` Andrew Cooper
2016-08-02  9:47     ` Roger Pau Monne
2016-08-02 15:49       ` Roger Pau Monne
2016-08-02 16:12         ` Jan Beulich
2016-08-03 15:11           ` George Dunlap
2016-08-03 15:25             ` Jan Beulich
2016-08-03 15:28               ` George Dunlap
2016-08-03 15:37                 ` Jan Beulich
2016-08-03 15:59                   ` George Dunlap
2016-08-03 16:00                   ` Roger Pau Monne
2016-08-03 16:15                     ` Jan Beulich
2016-08-03 16:24                       ` Roger Pau Monne
2016-08-04  6:19                         ` Jan Beulich
2016-08-01  8:57   ` Tim Deegan
2016-07-29 16:28 ` [PATCH RFC 02/12] xen/x86: split the setup of Dom0 permissions to a function Roger Pau Monne
2016-07-29 16:28 ` [PATCH RFC 03/12] xen/x86: allow the emulated APICs to be enbled for the hardware domain Roger Pau Monne
2016-07-29 17:50   ` Andrew Cooper
2016-08-01 11:23     ` Roger Pau Monne
2016-07-29 16:28 ` [PATCH RFC 04/12] xen/x86: split Dom0 build into PV and PVHv2 Roger Pau Monne
2016-07-29 17:57   ` Andrew Cooper
2016-08-01 11:36     ` Roger Pau Monne
2016-08-04 18:28       ` Andrew Cooper
2016-07-29 16:29 ` [PATCH RFC 05/12] xen/x86: make print_e820_memory_map global Roger Pau Monne
2016-07-29 17:57   ` Andrew Cooper
2016-07-29 16:29 ` [PATCH RFC 06/12] xen/x86: populate PVHv2 Dom0 physical memory map Roger Pau Monne
2016-07-29 19:04   ` Andrew Cooper [this message]
2016-08-02  9:19     ` Roger Pau Monne
2016-08-04 18:43       ` Andrew Cooper
2016-08-05  9:40         ` Roger Pau Monne
2016-08-11 18:28           ` Andrew Cooper
2016-07-29 16:29 ` [PATCH RFC 07/12] xen/x86: parse Dom0 kernel for PVHv2 Roger Pau Monne
2016-09-26 16:16   ` Jan Beulich
2016-09-26 17:11     ` Roger Pau Monne
2016-07-29 16:29 ` [PATCH RFC 08/12] xen/x86: setup PVHv2 Dom0 CPUs Roger Pau Monne
2016-09-26 16:19   ` Jan Beulich
2016-09-26 17:05     ` Roger Pau Monne
2016-09-27  8:10       ` Jan Beulich
2016-07-29 16:29 ` [PATCH RFC 09/12] xen/x86: setup PVHv2 Dom0 ACPI tables Roger Pau Monne
2016-09-26 16:21   ` Jan Beulich
2016-07-29 16:29 ` [PATCH RFC 10/12] xen/dcpi: add a dpci passthrough handler for hardware domain Roger Pau Monne
2016-07-29 16:29 ` [PATCH RFC 11/12] xen/x86: allow a PVHv2 Dom0 to register PCI devices with Xen Roger Pau Monne
2016-07-29 16:29 ` [PATCH RFC 12/12] xen/x86: route legacy PCI interrupts to Dom0 Roger Pau Monne
2016-07-29 16:38 ` [PATCH RFC 01/12] PVHv2 Dom0 Roger Pau Monne
2016-09-26 16:25 ` Jan Beulich
2016-09-26 17:12   ` Roger Pau Monne
2016-09-26 17:55     ` Konrad Rzeszutek Wilk
2016-09-27  8:11     ` Jan Beulich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=806a668b-d732-09a9-93b7-bf27fa5efee8@citrix.com \
    --to=andrew.cooper3@citrix.com \
    --cc=jbeulich@suse.com \
    --cc=roger.pau@citrix.com \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).