Re: [PATCH v4 6/9] tools/libxc: x86 pv save implementation

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Andrew Cooper <andrew.cooper3@citrix.com>
To: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Frediano Ziglio <frediano.ziglio@citrix.com>,
	David Vrabel <david.vrabel@citrix.com>,
	Xen-devel <xen-devel@lists.xen.org>
Subject: Re: [PATCH v4 6/9] tools/libxc: x86 pv save implementation
Date: Wed, 7 May 2014 16:20:06 +0100	[thread overview]
Message-ID: <536A4F26.7040609@citrix.com> (raw)
In-Reply-To: <1399471089.13430.98.camel@kazak.uk.xensource.com>

On 07/05/14 14:58, Ian Campbell wrote:
> On Wed, 2014-04-30 at 19:36 +0100, Andrew Cooper wrote:
>
> The previous "common" patch seemed to have a fair bit of x86 pv save
> code in it, and there seems to be a fair bit of common code here. I
> suppose the split was pretty artificial to make the patches manageable?
>
>> @@ -121,6 +123,9 @@ struct context
>>      };
>>  };
>>  
>> +/* Saves an x86 PV domain. */
> I should hope so!
>
>> +int save_x86_pv(struct context *ctx);
>> +
>>  /*
>>   * Write the image and domain headers to the stream.
>>   * (to eventually make static in save.c)
>> @@ -137,6 +142,21 @@ struct record
>>      void *data;
>>  };
>>  
>> +/* Gets a field from an *_any union */
> Unless the #undefs which were above before I trimmed them relate to
> patch #2/9 they should go right before the redefine IMHO.

At the moment the series is specifically designed to allow the legacy
and v2 code to live together for testing purposes.

When this series leaves RFC, the old macros will be killed with fire. 
(See also gross abortions with scoped state)

>
>> +#define GET_FIELD(_c, _p, _f)                   \
>> +    ({ (_c)->x86_pv.width == 8 ?                \
>> +            (_p)->x64._f:                       \
>> +            (_p)->x32._f;                       \
>> +    })                                          \
>> +
>> +/* Gets a field from an *_any union */
>> +#define SET_FIELD(_c, _p, _f, _v)               \
>> +    ({ if ( (_c)->x86_pv.width == 8 )           \
>> +            (_p)->x64._f = (_v);                \
>> +        else                                    \
>> +            (_p)->x32._f = (_v);                \
>> +    })
>> +
>>  /*
>>   * Writes a split record to the stream, applying correct padding where
>>   * appropriate.  It is common when sending records containing blobs from Xen
>> +    /* MFNs of the batch pfns */
>> +    mfns = malloc(nr_pfns * sizeof *mfns);
>> +    /* Types of the batch pfns */
>> +    types = malloc(nr_pfns * sizeof *types);
>> +    /* Errors from attempting to map the mfns */
>> +    errors = malloc(nr_pfns * sizeof *errors);
>> +    /* Pointers to page data to send.  Might be from mapped mfns or local allocations */
>> +    guest_data = calloc(nr_pfns, sizeof *guest_data);
>> +    /* Pointers to locally allocated pages.  Probably not all used, but need freeing */
>> +    local_pages = calloc(nr_pfns, sizeof *local_pages);
> Wouldn't it be better to preallocate (most of) these for the entire
> save/restore than to do it for each batch?

Valgrind vs (suspected)negligible overhead, particularly in this
function which is probably the single most complex of the series.

>
>> +    for ( i = 0; i < nr_pfns; ++i )
>> +    {
>> +        types[i] = mfns[i] = ctx->ops.pfn_to_gfn(ctx, ctx->batch_pfns[i]);
>> +
>> +        /* Likely a ballooned page */
>> +        if ( mfns[i] == INVALID_MFN )
>> +            set_bit(ctx->batch_pfns[i], ctx->deferred_pages);
> deferred_pages doesn't need to grow dynamically like that bit map in a
> previous patch, does it?

On second thoughts, it probably does.

>
>> +    hdr.count = nr_pfns;
>> +    s = nr_pfns * sizeof *rec_pfns;
> Brackets around sizeof would really help here.
>
>> +
>> +
>> +    rec_pfns = malloc(s);
> Given the calculation of s is this a candidate for calloc?

s was actually some debugging which managed to stay in.  I shall clean
it up a little.

>
>> +    if ( !rec_pfns )
>> +    {
>> +        ERROR("Unable to allocate memory for page data pfn list");
>> +        goto err;
>> +    }
>> +
>> +    for ( i = 0; i < nr_pfns; ++i )
>> +        rec_pfns[i] = ((uint64_t)(types[i]) << 32) | ctx->batch_pfns[i];
> Could this not have a more structured type? (I noticed this in the
> decode bit of the previous patch too).

Not without resorting to bitfields, which I would prefer to avoid for
data ending up being written into a stream.

>
>> +
>> +    /*        header +          pfns data           +        page data */
>> +    rec_size = 4 + 4 + (s) + (nr_pages * PAGE_SIZE);
> Can these 4's not be sizeof something? Also the comment looks like it
> wants to align with something, but is failing.

Again, leaked debugging code.

>
>> +    rec_count = nr_pfns;
>> +
>> +    if ( write_exact(ctx->fd, &rec_type, sizeof(uint32_t)) ||
>> +         write_exact(ctx->fd, &rec_size, sizeof(uint32_t)) ||
> Are these not a struct rhdr?

Yes,

I intended to implement "write_record_raw()" or so which trusted the
length calculation given, although I appear to have lost the tuits.  I
should find them again.

>
>> +         write_exact(ctx->fd, &rec_count, sizeof(uint32_t)) ||
>> +         write_exact(ctx->fd, &rec_res, sizeof(uint32_t)) )
>> +    {
>> +        PERROR("Failed to write page_type header to stream");
>> +        goto err;
>> +    }
>> +
>> +    if ( write_exact(ctx->fd, rec_pfns, s) )
>> +    {
>> +        PERROR("Failed to write page_type header to stream");
>> +        goto err;
>> +    }
>> +
>> +
>> +    for ( i = 0; i < nr_pfns; ++i )
> Extra braces are useful for clarity in these siturations.
>
>
>> +static int flush_batch(struct context *ctx)
>> +{
>> +    int rc = 0;
>> +
>> +    if ( ctx->nr_batch_pfns == 0 )
>> +        return rc;
>> +
>> +    rc = write_batch(ctx);
>> +
>> +    if ( !rc )
>> +    {
>> +        /* Valgrind sanity check */
> What now?

This results in the array being marked as unused again, which was useful
when I found two pieces of code playing with the index and write_batch()
finding some plausible-but-wrong pfns to send.

>
>> +        free(ctx->batch_pfns);
>> +        ctx->batch_pfns = malloc(MAX_BATCH_SIZE * sizeof *ctx->batch_pfns);
>> +        rc = !ctx->batch_pfns;
>> +    }
>> +
>> +
>> +    for ( x = 0, pages_written = 0; x < max_iter ; ++x )
>> +    {
>> +        DPRINTF("Iteration %u", x);
> Something in this series should be using xc_report_progress_* (I'm not
> sure that place is here, but I didn't see it elsewhere yet).

Ah yes - I think other areas want to do the same.

>
>>  int xc_domain_save2(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iters,
>>                      uint32_t max_factor, uint32_t flags,
>>                      struct save_callbacks* callbacks, int hvm,
>>                      unsigned long vm_generationid_addr)
>>  {
>> +    struct context ctx =
>> +        {
> Coding style.
>
>> +            .xch = xch,
>> +            .fd = io_fd,
>> +        };
>> +
>> +    /* Older GCC cant initialise anonymous unions */
> How old?

4.7 in debian is fine.  4.4 in CentOS 6 isn't

>
>
>> +static int map_p2m(struct context *ctx)
>> +{
>> [...]
>> +    fll_mfn = GET_FIELD(ctx, ctx->x86_pv.shinfo, arch.pfn_to_mfn_frame_list_list);
>> +    if ( !fll_mfn )
>> +        IPRINTF("Waiting for domain to set up its p2m frame list list");
>> +
>> +    while ( tries-- && !fll_mfn )
>> +    {
>> +        usleep(10000);
>> +        fll_mfn = GET_FIELD(ctx, ctx->x86_pv.shinfo,
>> +                            arch.pfn_to_mfn_frame_list_list);
>> +    }
>> +
>> +    if ( !fll_mfn )
>> +    {
>> +        ERROR("Timed out waiting for p2m frame list list to be updated");
>> +        goto err;
>> +    }
> This (preexisting) ugliness is there to "support" running migrate
> immediately after starting a guest, when it hasn't got its act together
> yet?

Yes - as the guest is in control of its p2m, it has to set itself up
sufficiently first before the toolstack can migrate it.

>
> I wonder if we can get away with refusing to migrate under those
> circumstances.

Probably, although I thought I tried removing this originally then found
a case where I still needed it.  It would be nice to remove it if possible.

>
>> +static int write_one_vcpu_basic(struct context *ctx, uint32_t id)
>> +{
>> +    xc_interface *xch = ctx->xch;
>> +    xen_pfn_t mfn, pfn;
>> +    unsigned i;
>> +    int rc = -1;
>> +    vcpu_guest_context_any_t vcpu;
>> +    struct rec_x86_pv_vcpu vhdr = { .vcpu_id = id };
>> +    struct record rec =
>> +    {
>> +        .type = REC_TYPE_x86_pv_vcpu_basic,
>> +        .length = sizeof vhdr,
>> +        .data = &vhdr,
>> +    };
>> +
>> +    if ( xc_vcpu_getcontext(xch, ctx->domid, id, &vcpu) )
>> +    {
>> +        PERROR("Failed to get vcpu%u context", id);
>> +        goto err;
>> +    }
>> +
>> +    /* Vcpu 0 is special: Convert the suspend record to a PFN */
> The "suspend record" is the guest's start_info_t I think? Would be
> clearer to say that I think (and perhaps allude to it being the magic
> 3rd parameter to the hypercall.)

It is inconsistently referred to in the toolstack and Xen public
headers.  I think I inherited the name "suspend record" from the legacy
code.

>
>> +    /* 64bit guests: Convert CR1 (guest pagetables) to PFN */
>> +    if ( ctx->x86_pv.levels == 4 && vcpu.x64.ctrlreg[1] )
>> +    {
>> +        mfn = vcpu.x64.ctrlreg[1] >> PAGE_SHIFT;
>> +        if ( !mfn_in_pseudophysmap(ctx, mfn) )
>> +        {
>> +            ERROR("Bad MFN for vcpu%u's cr1", id);
>> +            pseudophysmap_walk(ctx, mfn);
>> +            errno = ERANGE;
>> +            goto err;
>> +        }
>> +
>> +        pfn = mfn_to_pfn(ctx, mfn);
>> +        vcpu.x64.ctrlreg[1] = 1 | ((uint64_t)pfn << PAGE_SHIFT);
>> +    }
>> +
>> +    if ( ctx->x86_pv.width == 8 )
> The use of x86_pv.levels vs x86_pv.width to determine the type of guest
> is rather inconsistent. .width would seem to be the right one to use
> almost all the time I think.

I (believe I) have consistently used levels when dealing with pagetable
stuff, and width when dealing with bitness issues.

I suppose the difference was more pronounced in the legacy code with
32bit 2 and 3 level guests, but I still think this split is the better
way of distinguishing.

>
>> +        rc = write_split_record(ctx, &rec, &vcpu, sizeof vcpu.x64);
>> +    else
>> +        rc = write_split_record(ctx, &rec, &vcpu, sizeof vcpu.x32);
>> +
>> +    if ( rc )
>> +        goto err;
>> +
>> +    DPRINTF("Writing vcpu%u basic context", id);
>> +    rc = 0;
>> + err:
>> +
>> +    return rc;
>> +}
>> +
>> +static int write_one_vcpu_extended(struct context *ctx, uint32_t id)
>> +{
>> +    xc_interface *xch = ctx->xch;
>> +    int rc;
>> +    struct rec_x86_pv_vcpu vhdr = { .vcpu_id = id };
>> +    struct record rec =
>> +    {
>> +        .type = REC_TYPE_x86_pv_vcpu_extended,
>> +        .length = sizeof vhdr,
>> +        .data = &vhdr,
>> +    };
> Does this pattern repeat a lot? DECLARE_REC(x86_pv_vcpu_extended, vhdr)
> perhaps?

It repeats a little but not too often.  However I frankly dislike macros
like that which do opaque initialisation of trivial structures.

>
>> +    struct xen_domctl domctl =
>> +    {
>> +        .cmd = XEN_DOMCTL_get_ext_vcpucontext,
>> +        .domain = ctx->domid,
>> +        .u.ext_vcpucontext.vcpu = id,
>> +    };
> Wants to use DECLARE_DOMCTL I think (at least everywhere else in libxc
> seesms too).

Which has no difference now I dropped -DVALGRIND from libxc, but forgoes
the clarity and optimisation available from structure initialisation.

As previously indicated, I have an issue with macros which use or create
unidentified symbols in scope, because they are unconditionally less
clear than the plain alternative.

>
>> +static int write_all_vcpu_information(struct context *ctx)
>> +{
>> +    xc_interface *xch = ctx->xch;
>> +    xc_vcpuinfo_t vinfo;
>> +    unsigned int i;
>> +    int rc;
>> +
>> +    for ( i = 0; i <= ctx->dominfo.max_vcpu_id; ++i )
>> +    {
>> +        rc = xc_vcpu_getinfo(xch, ctx->domid, i, &vinfo);
>> +        if ( rc )
>> +        {
>> +            PERROR("Failed to get vcpu%u information", i);
>> +            return rc;
>> +        }
>> +
>> +        if ( !vinfo.online )
>> +        {
>> +            DPRINTF("vcpu%u offline - skipping", i);
>> +            continue;
>> +        }
>> +
>> +        rc = write_one_vcpu_basic(ctx, i) ?:
>> +            write_one_vcpu_extended(ctx, i) ?:
>> +            write_one_vcpu_xsave(ctx, i);
> I'd prefer the more verbose one at a time version.
>
>> +    /* Map various structures */
>> +    rc = x86_pv_map_m2p(ctx) ?: map_shinfo(ctx) ?: map_p2m(ctx);
> One at a time please.

Ok.

~Andrew

next prev parent reply	other threads:[~2014-05-07 15:20 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-04-30 18:36 [PATCH v4 0/9] Migration Stream v2 Andrew Cooper
2014-04-30 18:36 ` [PATCH v4 1/9] libxc: add DECLARE_HYPERCALL_BUFFER_SHADOW() Andrew Cooper
2014-05-07 11:45   ` Ian Campbell
2014-05-07 12:00     ` David Vrabel
2014-05-07 12:06       ` Ian Campbell
2014-04-30 18:36 ` [PATCH v4 2/9] [HACK] tools/libxc: save/restore v2 framework Andrew Cooper
2014-04-30 18:36 ` [PATCH v4 3/9] tools/libxc: Stream specification and some common code Andrew Cooper
2014-05-07 11:57   ` Ian Campbell
2014-05-07 12:06     ` David Vrabel
2014-05-07 12:14     ` Andrew Cooper
2014-05-07 13:07       ` Ian Campbell
2014-05-07 13:20         ` Andrew Cooper
2014-04-30 18:36 ` [PATCH v4 4/9] tools/libxc: Scripts for inspection/valdiation of legacy and new streams Andrew Cooper
2014-05-07 12:03   ` Ian Campbell
2014-05-07 12:08     ` David Vrabel
2014-05-07 12:27     ` Andrew Cooper
2014-04-30 18:36 ` [PATCH v4 5/9] tools/libxc: common code Andrew Cooper
2014-05-07 13:03   ` Ian Campbell
2014-05-07 14:38     ` Andrew Cooper
2014-05-07 16:08       ` Ian Campbell
2014-05-07 16:30         ` Andrew Cooper
2014-05-07 16:35           ` Ian Campbell
2014-05-08  8:29       ` David Vrabel
2014-04-30 18:36 ` [PATCH v4 6/9] tools/libxc: x86 pv save implementation Andrew Cooper
2014-05-07 13:58   ` Ian Campbell
2014-05-07 15:20     ` Andrew Cooper [this message]
2014-05-07 16:15       ` Ian Campbell
2014-04-30 18:36 ` [PATCH v4 7/9] tools/libxc: x86 pv restore implementation Andrew Cooper
2014-05-07 14:10   ` Ian Campbell
2014-05-07 15:37     ` Andrew Cooper
2014-05-07 16:17       ` Ian Campbell
2014-04-30 18:36 ` [PATCH v4 8/9] tools/libxc: x86 hvm save implementation Andrew Cooper
2014-05-07 14:14   ` Ian Campbell
2014-05-07 15:39     ` Andrew Cooper
2014-04-30 18:36 ` [PATCH v4 9/9] tools/libxc: x86 hvm restore implementation Andrew Cooper
2014-05-07 14:15   ` Ian Campbell
2014-05-01 16:41 ` [PATCH v4 0/9] Migration Stream v2 Konrad Rzeszutek Wilk
2014-05-01 17:32   ` Andrew Cooper
2014-05-07 14:23 ` Ian Campbell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=536A4F26.7040609@citrix.com \
    --to=andrew.cooper3@citrix.com \
    --cc=Ian.Campbell@citrix.com \
    --cc=david.vrabel@citrix.com \
    --cc=frediano.ziglio@citrix.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.