From mboxrd@z Thu Jan 1 00:00:00 1970 From: Shriram Rajagopalan Subject: Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages] Date: Mon, 4 Nov 2013 10:47:07 -0600 Message-ID: References: <21107.62159.140005.466786@mariner.uk.xensource.com> <21111.36679.369553.735409@mariner.uk.xensource.com> <1383579138.8826.95.camel@kazak.uk.xensource.com> <21111.50707.178276.553159@mariner.uk.xensource.com> Reply-To: rshriram@cs.ubc.ca Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============8488298324920470212==" Return-path: In-Reply-To: <21111.50707.178276.553159@mariner.uk.xensource.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Ian Jackson Cc: Andrew Cooper , Stefano Stabellini , Ian Campbell , "xen-devel@lists.xen.org" List-Id: xen-devel@lists.xenproject.org --===============8488298324920470212== Content-Type: multipart/alternative; boundary=047d7b874c9204a98404ea5cadad --047d7b874c9204a98404ea5cadad Content-Type: text/plain; charset=ISO-8859-1 On Mon, Nov 4, 2013 at 10:06 AM, Ian Jackson wrote: > Ian Campbell writes ("Re: [PATCH 4 of 5 V3] tools/libxl: Control network > buffering in remus callbacks [and 1 more messages]"): > > Regardless of the answer here, would it make sense to do some/all of the > > checkpoint processing in the helper subprocess anyway and only signal > > the eventual failover up to the libxl process? > > It might do. Mind you, the code in libxc is tangled enough as it is > and is due for a rewrite. Perhaps this could be done in the helper > executable, although there isn't currently any way to easily > intersperse code in there. > > > This async op is potentially quite long running I think compared to a > > normal one i.e. if the guest doesn't die it is expected that the ao > > lives "forever". Since the associated gc's persist until the ao ends > > this might end up accumulating lots of allocations? Ian had a similar > > concern about Roger's hotplug daemon series and suggested creating a per > > iteration gc or something. > Yes, this is indeed a problem. Well spotted. > Which of the xc_domain_save (and _restore) callbacks are called each > remus iteration ? > > Almost all of them on the xc_domain_save side. (suspend, resume, save_qemu state, checkpoint). xc_domain_restore doesn't have any callbacks AFAIK. And remus as of now does not have a component on the restore side. It piggybacks on live migration's restore framework. > I think introducing a per-iteration gc here is going to involve taking > some care, since we need to be sure not to put > per-iteraton-gc-allocated objects into data structures which are used > by subsequent iterations. > > FWIW, the remus related code that executes per iteration does not allocate anything. All allocations happen only during setup and I was under the impression that no other allocations are taking place everytime xc_domain_save calls back into libxl. However, it may be possible that other parts of the AO machinery (and there are a lot of them) are allocating stuff per iteration. And if that is the case, it could easily lead to OOMs since Remus technically runs as long as the domain lives. Shriram writes: > > Fair enough. My question is what is the overhead of setting up, firing > > and tearing down a timeout event using the event gen framework, if I > > wish to checkpoint the VM, say every 20ms ? > > The ultimate cost of going back into the event loop to wait for a > timeout will depend on what else the process is doing. If the process > is doing nothing else, it's about two calls to gettimeofday and one to > poll. Plus a bit of in-process computation, but that's going to be > swamped by system call overhead. > > Having said that, libxl is not performance-optimised. Indeed the > callback mechanism involves context switching, and IPC, between the > save/restore helper and libxl proper. Probably not too much to be > doing every 20ms for a single domain, but if you have a lot of these > it's going to end up taking a lot of dom0 cpu etc. > > Yes and that is a problem. Xend+Remus avoided this by linking the libcheckpoint library that interfaced with both the python & libxc code. > I assume you're not doing this for HVM domains, which involve saving > the qemu state each time too. > > It includes HVM domains too. Although in that case, xenstore based suspend takes about 5ms. So the checkpoint interval is typically 50ms or so. If there is a latency sensitive task running inside the VM, lower checkpoint interval leads to better performance. --047d7b874c9204a98404ea5cadad Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
On Mon, Nov 4, 2013 at 10:06 AM, Ian Jackson <Ian= .Jackson@eu.citrix.com> wrote:
=
Ian Campbell writes ("Re: [PATCH 4 of 5= V3] tools/libxl: Control network buffering in remus callbacks [and 1 more = messages]"):
> Regardless of the answer here, would it make sense t= o do some/all of the
> checkpoint processing in the helper subprocess anyway and only signal<= br> > the eventual failover up to the libxl process?

It might do. =A0Mind you, the code in libxc is tangled enough as it i= s
and is due for a rewrite. =A0Perhaps this could be done in the helper
executable, although there isn't currently any way to easily
intersperse code in there.
=A0
<= /blockquote>
> This async op is potentially quite long running I think compared to a<= br> > normal one i.e. if the guest doesn't die it is expected that the a= o
> lives "forever". Since the associated gc's persist until= the ao ends
> this might end up accumulating lots of allocations? Ian had a similar<= br> > concern about Roger's hotplug daemon series and suggested creating= a per
> iteration gc or something.=A0
Yes, this is indeed a problem. =A0Well spotted.=A0
Which of the xc_domain_save (and _restore) callbacks are called each
remus iteration ?


Almost all of them on the xc_domain_sa= ve side. (suspend, resume,=A0
save_qemu state, checkpoint).
=
xc_domain_restore doesn't have any callbacks AFAIK. And remus as o= f now
does not have a component on the restore side. It piggybacks on live m= igration's
restore framework.
=A0
I think introducing a per-iteration gc here is going to involve taking
some care, since we need to be sure not to put
per-iteraton-gc-allocated objects into data structures which are used
by subsequent iterations.


FWIW, the remu= s related code that executes per iteration does not allocate anything.
All allocations happen only during setup and I was under the impressi= on that no other
allocations are taking place everytime xc_domain_save calls back into = libxl.

However, it may be possible that other part= s of the AO machinery (and there=A0
are a lot of them) are alloca= ting stuff per iteration. And if that is the case, it could
easily lead to OOMs since Remus technically runs as long as the domain= lives.


Shriram writes:
> Fair enough. My question is what is the overhead of setting up, firing=
> and tearing down a timeout event using the event gen framework, if I > wish to checkpoint the VM, say every 20ms ?

The ultimate cost of going back into the event loop to wait for a
timeout will depend on what else the process is doing. =A0If the process is doing nothing else, it's about two calls to gettimeofday and one to<= br> poll. =A0Plus a bit of in-process computation, but that's going to be swamped by system call overhead.

Having said that, libxl is not performance-optimised. =A0Indeed the
callback mechanism involves context switching, and IPC, between the
save/restore helper and libxl proper. =A0Probably not too much to be
doing every 20ms for a single domain, but if you have a lot of these
it's going to end up taking a lot of dom0 cpu etc.


Yes and that is a problem. Xend+Remus = avoided this by linking
the libcheckpoint library that interfaced= with both the python & libxc code.
=A0
I assume you're not doing this for HVM domains, which involve saving the qemu state each time too.


It includes HVM domains too. Although in that case, = xenstore based suspend
takes about 5ms. So the checkpoint interva= l is typically 50ms or so.

If there is a latency sensitive task running inside
=
the VM, lower checkpoint interval leads to better performance.


--047d7b874c9204a98404ea5cadad-- --===============8488298324920470212== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============8488298324920470212==--