Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: "Michael R. Hines" <mrhines@linux.vnet.ibm.com>
Cc: SADEKJ@il.ibm.com, pbonzini@redhat.com, quintela@redhat.com,
	BIRAN@il.ibm.com, qemu-devel@nongnu.org, EREZH@il.ibm.com,
	owasserm@redhat.com, onom@us.ibm.com, hinesmr@cn.ibm.com,
	isaku.yamahata@gmail.com, gokul@us.ibm.com, dbulkow@gmail.com,
	junqing.wang@cs2c.com.cn, abali@us.ibm.com,
	lig.fnst@cn.fujitsu.com, "Michael R. Hines" <mrhines@us.ibm.com>
Subject: Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing
Date: Fri, 21 Feb 2014 09:44:34 +0000	[thread overview]
Message-ID: <20140221094433.GA2483@work-vm> (raw)
In-Reply-To: <5306DBF8.3060204@linux.vnet.ibm.com>

* Michael R. Hines (mrhines@linux.vnet.ibm.com) wrote:
> On 02/21/2014 12:32 AM, Dr. David Alan Gilbert wrote:
> >
> >I'm happy to use more memory to get FT, all I'm trying to do is see
> >if it's possible to put a lower bound than 2x on it while still maintaining
> >full FT, at the expense of performance in the case where it uses
> >a lot of memory.
> >
> >>The bottom line is: if you put a *hard* constraint on memory usage,
> >>what will happen to the guest when that garbage collection you mentioned
> >>shows up later and runs for several minutes? How about an hour?
> >>Are we just going to block the guest from being allowed to start a
> >>checkpoint until the memory usage goes down just for the sake of avoiding
> >>the 2x memory usage?
> >Yes, or move to the next checkpoint sooner than the N milliseconds when
> >we see the buffer is getting full.
> 
> OK, I see there is definitely some common ground there: So to be
> more specific, what we really need is two things: (I've learned that
> the reviewers are very cautious about adding to much policy into
> QEMU itself, but let's iron this out anyway:)
> 
> 1. First, we need to throttle down the guest (QEMU can already do this
>     using the recently introduced "auto-converge" feature). This means
>     that the guest is still making forward progress, albeit slow progress.
> 
> 2. Then we would need some kind of policy, or better yet, a trigger that
>     does something to the effect of "we're about to use a whole lot of
>     checkpoint memory soon - can we afford this much memory usage".
>     Such a trigger would be conditional on the current policy of the
>     administrator or management software: We would either have a QMP
>     command that with a boolean flag that says "Yes" or "No", it's
>     tolerable or not to use that much memory in the next checkpoint.
> 
>     If the answer is "Yes", then nothing changes.
>     If the answer is "No", then we should either:
>        a) throttle down the guest
>        b) Adjust the checkpoint frequency
>        c) Or pause it altogether while we migrate some other VMs off the
>            host such that we can complete the next checkpoint in its
> entirety.

Yes I think so, although what I was thinking was mainly (b) possibly
to the point of not starting the next checkpoint.

> It's not clear to me how much of this (or any) of this control loop should
> be in QEMU or in the management software, but I would definitely agree
> that a minimum of at least the ability to detect the situation and remedy
> the situation should be in QEMU. I'm not entirely convince that the
> ability to *decide* to remedy the situation should be in QEMU, though.

The management software access is low frequency, high latency; it should
be setting general parameters (max memory allowed, desired checkpoint
frequency etc) but I don't see that we can use it to do anything on
a sooner than a few second basis; so yes it can monitor things and
tweek the knobs if it sees the host as a whole is getting tight on RAM
etc - but we can't rely on it to throw in the breaks if this guest
suddenly decides to take bucket loads of RAM; something has to react
quickly in relation to previously set limits.

> >>If you block the guest from being checkpointed,
> >>then what happens if there is a failure during that extended period?
> >>We will have saved memory at the expense of availability.
> >If the active machine fails during this time then the secondary carries
> >on from it's last good snapshot in the knowledge that the active
> >never finished the new snapshot and so never uncorked it's previous packets.
> >
> >If the secondary machine fails during this time then tha active drops
> >it's nascent snapshot and carries on.
> 
> Yes, that makes sense. Where would that policy go, though,
> continuing the above concern?

I think there has to be some input from the management layer for failover,
because (as per my split-brain concerns) something has to make the decision
about which of the source/destination is to take over, and I don't
believe individual instances have that information.

> >However, what you have made me realise is that I don't have an answer
> >for the memory usage on the secondary; while the primary can pause
> >it's guest until the secondary ack's the checkpoint, the secondary has
> >to rely on the primary not to send it huge checkpoints.
> 
> Good question: There's a lot of work ideas out there in the academic
> community to compress the secondary, or push the secondary to
> a flash-based device, or de-duplicate the secondary. I'm sure any
> of them would put a dent in the problem, but I'm not seeing a smoking
> gun solution that would absolutely save all that memory completely.

Ah, I was thinking that flash would be a good solution for secondary;
it would be a nice demo.

> (Personally, I don't believe in swap. I wouldn't even consider swap
> or any kind of traditional disk-based remedy to be a viable solution).

Well it certainly exists - I've seen it!
Swap works well in limited circumstances; but as soon as you've got
multiple VMs fighting over something with 10s of ms latency you're doomed.

> >>The customer that is expecting 100% fault tolerance and the provider
> >>who is supporting it need to have an understanding that fault tolerance
> >>is not free and that constraining memory usage will adversely affect
> >>the VM's ability to be protected.
> >>
> >>Do I understand your expectations correctly? Is fault tolerance
> >>something you're willing to sacrifice?
> >As above, no I'm willing to sacrifice performance but not fault tolerance.
> >(It is entirely possible that others would want the other trade off, i.e.
> >some minimum performance is worse than useless, so if we can't maintain
> >that performance then dropping FT leaves us in a more-working position).
> >
> 
> Agreed - I think a "proactive" failover in this case would solve the
> problem.
> If we observed that availability/fault tolerance was going to be at
> risk soon (which is relatively easy to detect) - we could just *force*
> a failover to the secondary host and restart the protection from
> scratch.
> 
> 
> >>
> >>Well, that's simple: If there is a failure of the source, the destination
> >>will simply revert to the previous checkpoint using the same mode
> >>of operation. The lost ACKs that you're curious about only
> >>apply to the checkpoint that is in progress. Just because a
> >>checkpoint is in progress does not mean that the previous checkpoint
> >>is thrown away - it is already loaded into the destination's memory
> >>and ready to be activated.
> >I still don't see why, if the link between them fails, the destination
> >doesn't fall back it it's previous checkpoint, AND the source carries
> >on running - I don't see how they can differentiate which of them has failed.
> 
> I think you're forgetting that the source I/O is buffered - it doesn't
> matter that the source VM is still running. As long as it's output is
> buffered - it cannot have any non-fault-tolerant affect on the outside
> world.
> 
> In the future, if a technician access the machine or the network
> is restored, the management software can terminate the stale
> source virtual machine.

I think going with my comment above; I'm working on the basis it's just
as likely for the destination to fail as it is for the source to fail,
and a destination failure shouldn't kill the source; and in the case
of a destination failure the source is going to have to let it's buffered
I/Os start going again.

> >>We have a script architecture (not on github) which runs MC in a tight
> >>loop hundreds of times and kills the source QEMU and timestamps how
> >>quickly the
> >>destination QEMU loses the TCP socket connection receives an error code
> >>from the kernel - every single time, the destination resumes nearly
> >>instantaneously.
> >>I've not empirically seen a case where the socket just hangs or doesn't
> >>change state.
> >>
> >>I'm not very familiar with the internal linux TCP/IP stack
> >>implementation itself,
> >>but I have not had a problem with the dependability of the linux socket
> >>not being able to shutdown the socket as soon as possible.
> >OK, that only covers a very small range of normal failures.
> >When you kill the destination QEMU the host OS knows that QEMU is dead
> >and sends a packet back closing the socket, hence the source knows
> >the destination is dead very quickly.
> >If:
> >    a) The destination machine was to lose power or hang
> >    b) Or a network link fail  (other than the one attached to the source
> >       possibly)
> >
> >the source would have to do a full TCP timeout.
> >
> >To test a,b I'd use an iptables rule somewhere to cause the packets to
> >be dropped (not rejected).  Stopping the qemu in gdb might be good enough.
> 
> Very good idea - I'll add that to the "todo" list of things to do
> in my test infrastructure. It may indeed turn out be necessary
> to add a formal keepalive between the source and destination.
> 
> - Michael

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

next prev parent reply	other threads:[~2014-02-21  9:45 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-02-18  8:50 [Qemu-devel] [RFC PATCH v2 00/12] mc: fault tolerante through micro-checkpointing mrhines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing mrhines
2014-02-18 12:45   ` Dr. David Alan Gilbert
2014-02-19  1:40     ` Michael R. Hines
2014-02-19 11:27       ` Dr. David Alan Gilbert
2014-02-20  1:17         ` Michael R. Hines
2014-02-20 10:09           ` Dr. David Alan Gilbert
2014-02-20 11:14             ` Li Guang
2014-02-20 14:58               ` Michael R. Hines
2014-02-20 14:57             ` Michael R. Hines
2014-02-20 16:32               ` Dr. David Alan Gilbert
2014-02-21  4:54                 ` Michael R. Hines
2014-02-21  9:44                   ` Dr. David Alan Gilbert [this message]
2014-03-03  6:08                     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 02/12] mc: timestamp migration_bitmap and KVM logdirty usage mrhines
2014-02-18 10:32   ` Dr. David Alan Gilbert
2014-02-19  1:42     ` Michael R. Hines
2014-03-11 21:31   ` Juan Quintela
2014-04-04  3:08     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 03/12] mc: introduce a 'checkpointing' status check into the VCPU states mrhines
2014-03-11 21:36   ` Juan Quintela
2014-04-04  3:11     ` Michael R. Hines
2014-03-11 21:40   ` Eric Blake
2014-04-04  3:12     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 04/12] mc: support custom page loading and copying mrhines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 05/12] rdma: accelerated memcpy() support and better external RDMA user interfaces mrhines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 06/12] mc: introduce state machine changes for MC mrhines
2014-02-19  1:00   ` Li Guang
2014-02-19  2:14     ` Michael R. Hines
2014-02-20  5:03     ` Michael R. Hines
2014-02-21  8:13     ` Michael R. Hines
2014-02-24  6:48       ` Li Guang
2014-02-26  2:52         ` Li Guang
2014-03-11 21:57   ` Juan Quintela
2014-04-04  3:50     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 07/12] mc: introduce additional QMP statistics for micro-checkpointing mrhines
2014-03-11 21:45   ` Eric Blake
2014-04-04  3:15     ` Michael R. Hines
2014-04-04  4:22       ` Eric Blake
2014-03-11 21:59   ` Juan Quintela
2014-04-04  3:55     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 08/12] mc: core logic mrhines
2014-02-19  1:07   ` Li Guang
2014-02-19  2:16     ` Michael R. Hines
2014-02-19  2:53       ` Li Guang
2014-02-19  4:27         ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 09/12] mc: configure and makefile support mrhines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 10/12] mc: expose tunable parameter for checkpointing frequency mrhines
2014-03-11 21:49   ` Eric Blake
2014-03-11 22:15     ` Juan Quintela
2014-03-11 22:49       ` Eric Blake
2014-04-04  5:29         ` Michael R. Hines
2014-04-04 14:56           ` Eric Blake
2014-04-11  6:10             ` Michael R. Hines
2014-04-04 16:28           ` Dr. David Alan Gilbert
2014-04-04 16:35             ` Eric Blake
2014-04-04  3:29     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 11/12] mc: introduce new capabilities to control micro-checkpointing mrhines
2014-03-11 21:57   ` Eric Blake
2014-04-04  3:38     ` Michael R. Hines
2014-04-04  4:25       ` Eric Blake
2014-03-11 22:02   ` Juan Quintela
2014-03-11 22:07     ` Eric Blake
2014-04-04  3:57       ` Michael R. Hines
2014-04-04  3:56     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 12/12] mc: activate and use MC if requested mrhines
2014-02-18  9:28 ` [Qemu-devel] [RFC PATCH v2 00/12] mc: fault tolerante through micro-checkpointing Li Guang
2014-02-19  1:29   ` Michael R. Hines

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140221094433.GA2483@work-vm \
    --to=dgilbert@redhat.com \
    --cc=BIRAN@il.ibm.com \
    --cc=EREZH@il.ibm.com \
    --cc=SADEKJ@il.ibm.com \
    --cc=abali@us.ibm.com \
    --cc=dbulkow@gmail.com \
    --cc=gokul@us.ibm.com \
    --cc=hinesmr@cn.ibm.com \
    --cc=isaku.yamahata@gmail.com \
    --cc=junqing.wang@cs2c.com.cn \
    --cc=lig.fnst@cn.fujitsu.com \
    --cc=mrhines@linux.vnet.ibm.com \
    --cc=mrhines@us.ibm.com \
    --cc=onom@us.ibm.com \
    --cc=owasserm@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).