Re: slow live magration / xc_restore on xen4 pvops

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: Brendan Cully <brendan@cs.ubc.ca>
To: Ian Jackson <Ian.Jackson@eu.citrix.com>
Cc: "xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>,
	Andreas Olsowski <andreas.olsowski@uni.leuphana.de>
Subject: Re: slow live magration / xc_restore on xen4 pvops
Date: Thu, 3 Jun 2010 08:03:05 -0700	[thread overview]
Message-ID: <20100603150305.GA53591@zanzibar.domain.invalid> (raw)
In-Reply-To: <19463.32147.268104.94905@mariner.uk.xensource.com>

On Thursday, 03 June 2010 at 11:01, Ian Jackson wrote:
> Brendan Cully writes ("Re: [Xen-devel] slow live magration / xc_restore on xen4 pvops"):
> > 2. in normal migration, the sender should close the fd after sending
> > all data, immediately triggering an IO error on the receiver and
> > completing the restore.
> 
> This is not true.  In normal migration, the fd is used by the
> machinery which surrounds xc_domain_restore (in xc_save and also in xl
> or xend).  In any case it would be quite wrong for a library function
> like xc_domain_restore to eat the fd.

The sender closes the fd, as it always has. xc_domain_restore has
always consumed the entire contents of the fd, because the qemu tail
has no length header under normal migration. There's no behavioral
difference here that I can see.

> It's not necessary for xc_domain_restore to behave this way in all
> cases; all that's needed is parameters to tell it how to behave.

I have no objection to a more explicit interface. The current form is
simply Remus trying to be as invisible as possible to the rest of the
tool stack.

> > I did try to avoid disturbing regular live migration as much as
> > possible when I wrote the code. I suspect some other regression has
> > crept in, and I'll investigate.
> 
> The short timeout is another regression.  A normal live migration or
> restore should not fall over just because no data is available for
> 100ms.

(the timeout is 1s, by the way).

For some reason you clipped the bit of my previous message where I say
this doesn't happen:

1. reads are only supposed to be able to time out after the entire                                                 
first checkpoint has been received (IOW this wouldn't kick in until                                                
normal migration had already completed)    

Let's take a look at read_exact_timed in xc_domain_restore:

if ( completed ) {
    /* expect a heartbeat every HEARBEAT_MS ms maximum */
    tv.tv_sec = HEARTBEAT_MS / 1000;
    tv.tv_usec = (HEARTBEAT_MS % 1000) * 1000;

    FD_ZERO(&rfds);
    FD_SET(fd, &rfds);
    len = select(fd + 1, &rfds, NULL, NULL, &tv);
    if ( !FD_ISSET(fd, &rfds) ) {
        fprintf(stderr, "read_exact_timed failed (select returned %zd)\n", len);
        return -1;
    }
}

'completed' is not set until the first entire checkpoint (i.e., the
entirety of non-Remus migration) has completed. So, no issue.

I see no evidence that Remus has anything to do with the live
migration performance regression discussed in this thread, and I
haven't seen any other reported issues either. I think the mlock issue
is a much more likely candidate.

next prev parent reply	other threads:[~2010-06-03 15:03 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-06-01 17:49 XCP AkshayKumar Mehta
2010-06-01 19:06 ` XCP Jonathan Ludlam
2010-06-01 19:15   ` XCP AkshayKumar Mehta
2010-06-03  3:03   ` XCP AkshayKumar Mehta
2010-06-03 10:24     ` XCP Jonathan Ludlam
2010-06-03 17:20       ` XCP AkshayKumar Mehta
2010-08-31  1:33       ` XCP - iisues with XCP .5 AkshayKumar Mehta
2010-06-01 21:17 ` slow live magration / xc_restore on xen4 pvops Andreas Olsowski
2010-06-02  7:11   ` Keir Fraser
2010-06-02 15:46     ` Andreas Olsowski
2010-06-02 15:55       ` Keir Fraser
2010-06-02 16:18   ` Ian Jackson
2010-06-02 16:20     ` Ian Jackson
2010-06-02 16:24     ` Keir Fraser
2010-06-03  1:04       ` Brendan Cully
2010-06-03  4:31         ` Brendan Cully
2010-06-03  5:47         ` Keir Fraser
2010-06-03  6:45           ` Brendan Cully
2010-06-03  6:53             ` Jeremy Fitzhardinge
2010-06-03  6:55             ` Brendan Cully
2010-06-03  7:12               ` Keir Fraser
2010-06-03  8:58             ` Zhai, Edwin
2010-06-09 13:32               ` Keir Fraser
2010-06-02 16:27     ` Brendan Cully
2010-06-03 10:01       ` Ian Jackson
2010-06-03 15:03         ` Brendan Cully [this message]
2010-06-03 15:18           ` Keir Fraser
2010-06-03 17:15           ` Ian Jackson
2010-06-03 17:29             ` Brendan Cully
2010-06-03 18:02               ` Ian Jackson
2010-06-02 22:59   ` Andreas Olsowski
2010-06-10  9:27     ` Keir Fraser

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100603150305.GA53591@zanzibar.domain.invalid \
    --to=brendan@cs.ubc.ca \
    --cc=Ian.Jackson@eu.citrix.com \
    --cc=andreas.olsowski@uni.leuphana.de \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).