Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: "Michael R. Hines" <mrhines@linux.vnet.ibm.com>
To: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: SADEKJ@il.ibm.com, quintela@redhat.com, hinesmr@cn.ibm.com,
	qemu-devel@nongnu.org, EREZH@il.ibm.com, owasserm@redhat.com,
	junqing.wang@cs2c.com.cn, onom@us.ibm.com, abali@us.ibm.com,
	lig.fnst@cn.fujitsu.com, gokul@us.ibm.com, dbulkow@gmail.com,
	pbonzini@redhat.com, BIRAN@il.ibm.com, isaku.yamahata@gmail.com,
	"Michael R. Hines" <mrhines@us.ibm.com>
Subject: Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing
Date: Fri, 21 Feb 2014 12:54:16 +0800	[thread overview]
Message-ID: <5306DBF8.3060204@linux.vnet.ibm.com> (raw)
In-Reply-To: <20140220163255.GC2437@work-vm>

On 02/21/2014 12:32 AM, Dr. David Alan Gilbert wrote:
>
> I'm happy to use more memory to get FT, all I'm trying to do is see
> if it's possible to put a lower bound than 2x on it while still maintaining
> full FT, at the expense of performance in the case where it uses
> a lot of memory.
>
>> The bottom line is: if you put a *hard* constraint on memory usage,
>> what will happen to the guest when that garbage collection you mentioned
>> shows up later and runs for several minutes? How about an hour?
>> Are we just going to block the guest from being allowed to start a
>> checkpoint until the memory usage goes down just for the sake of avoiding
>> the 2x memory usage?
> Yes, or move to the next checkpoint sooner than the N milliseconds when
> we see the buffer is getting full.

OK, I see there is definitely some common ground there: So to be
more specific, what we really need is two things: (I've learned that
the reviewers are very cautious about adding to much policy into
QEMU itself, but let's iron this out anyway:)

1. First, we need to throttle down the guest (QEMU can already do this
     using the recently introduced "auto-converge" feature). This means
     that the guest is still making forward progress, albeit slow progress.

2. Then we would need some kind of policy, or better yet, a trigger that
     does something to the effect of "we're about to use a whole lot of
     checkpoint memory soon - can we afford this much memory usage".
     Such a trigger would be conditional on the current policy of the
     administrator or management software: We would either have a QMP
     command that with a boolean flag that says "Yes" or "No", it's
     tolerable or not to use that much memory in the next checkpoint.

     If the answer is "Yes", then nothing changes.
     If the answer is "No", then we should either:
        a) throttle down the guest
        b) Adjust the checkpoint frequency
        c) Or pause it altogether while we migrate some other VMs off the
            host such that we can complete the next checkpoint in its 
entirety.

It's not clear to me how much of this (or any) of this control loop should
be in QEMU or in the management software, but I would definitely agree
that a minimum of at least the ability to detect the situation and remedy
the situation should be in QEMU. I'm not entirely convince that the
ability to *decide* to remedy the situation should be in QEMU, though.


>
>> If you block the guest from being checkpointed,
>> then what happens if there is a failure during that extended period?
>> We will have saved memory at the expense of availability.
> If the active machine fails during this time then the secondary carries
> on from it's last good snapshot in the knowledge that the active
> never finished the new snapshot and so never uncorked it's previous packets.
>
> If the secondary machine fails during this time then tha active drops
> it's nascent snapshot and carries on.

Yes, that makes sense. Where would that policy go, though,
continuing the above concern?

> However, what you have made me realise is that I don't have an answer
> for the memory usage on the secondary; while the primary can pause
> it's guest until the secondary ack's the checkpoint, the secondary has
> to rely on the primary not to send it huge checkpoints.

Good question: There's a lot of work ideas out there in the academic
community to compress the secondary, or push the secondary to
a flash-based device, or de-duplicate the secondary. I'm sure any
of them would put a dent in the problem, but I'm not seeing a smoking
gun solution that would absolutely save all that memory completely.

(Personally, I don't believe in swap. I wouldn't even consider swap
or any kind of traditional disk-based remedy to be a viable solution).

>> The customer that is expecting 100% fault tolerance and the provider
>> who is supporting it need to have an understanding that fault tolerance
>> is not free and that constraining memory usage will adversely affect
>> the VM's ability to be protected.
>>
>> Do I understand your expectations correctly? Is fault tolerance
>> something you're willing to sacrifice?
> As above, no I'm willing to sacrifice performance but not fault tolerance.
> (It is entirely possible that others would want the other trade off, i.e.
> some minimum performance is worse than useless, so if we can't maintain
> that performance then dropping FT leaves us in a more-working position).
>

Agreed - I think a "proactive" failover in this case would solve the 
problem.
If we observed that availability/fault tolerance was going to be at
risk soon (which is relatively easy to detect) - we could just *force*
a failover to the secondary host and restart the protection from
scratch.


>>
>> Well, that's simple: If there is a failure of the source, the destination
>> will simply revert to the previous checkpoint using the same mode
>> of operation. The lost ACKs that you're curious about only
>> apply to the checkpoint that is in progress. Just because a
>> checkpoint is in progress does not mean that the previous checkpoint
>> is thrown away - it is already loaded into the destination's memory
>> and ready to be activated.
> I still don't see why, if the link between them fails, the destination
> doesn't fall back it it's previous checkpoint, AND the source carries
> on running - I don't see how they can differentiate which of them has failed.

I think you're forgetting that the source I/O is buffered - it doesn't
matter that the source VM is still running. As long as it's output is
buffered - it cannot have any non-fault-tolerant affect on the outside
world.

In the future, if a technician access the machine or the network
is restored, the management software can terminate the stale
source virtual machine.

>> We have a script architecture (not on github) which runs MC in a tight
>> loop hundreds of times and kills the source QEMU and timestamps how
>> quickly the
>> destination QEMU loses the TCP socket connection receives an error code
>> from the kernel - every single time, the destination resumes nearly
>> instantaneously.
>> I've not empirically seen a case where the socket just hangs or doesn't
>> change state.
>>
>> I'm not very familiar with the internal linux TCP/IP stack
>> implementation itself,
>> but I have not had a problem with the dependability of the linux socket
>> not being able to shutdown the socket as soon as possible.
> OK, that only covers a very small range of normal failures.
> When you kill the destination QEMU the host OS knows that QEMU is dead
> and sends a packet back closing the socket, hence the source knows
> the destination is dead very quickly.
> If:
>     a) The destination machine was to lose power or hang
>     b) Or a network link fail  (other than the one attached to the source
>        possibly)
>
> the source would have to do a full TCP timeout.
>
> To test a,b I'd use an iptables rule somewhere to cause the packets to
> be dropped (not rejected).  Stopping the qemu in gdb might be good enough.

Very good idea - I'll add that to the "todo" list of things to do
in my test infrastructure. It may indeed turn out be necessary
to add a formal keepalive between the source and destination.

- Michael

next prev parent reply	other threads:[~2014-02-21  4:55 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-02-18  8:50 [Qemu-devel] [RFC PATCH v2 00/12] mc: fault tolerante through micro-checkpointing mrhines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing mrhines
2014-02-18 12:45   ` Dr. David Alan Gilbert
2014-02-19  1:40     ` Michael R. Hines
2014-02-19 11:27       ` Dr. David Alan Gilbert
2014-02-20  1:17         ` Michael R. Hines
2014-02-20 10:09           ` Dr. David Alan Gilbert
2014-02-20 11:14             ` Li Guang
2014-02-20 14:58               ` Michael R. Hines
2014-02-20 14:57             ` Michael R. Hines
2014-02-20 16:32               ` Dr. David Alan Gilbert
2014-02-21  4:54                 ` Michael R. Hines [this message]
2014-02-21  9:44                   ` Dr. David Alan Gilbert
2014-03-03  6:08                     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 02/12] mc: timestamp migration_bitmap and KVM logdirty usage mrhines
2014-02-18 10:32   ` Dr. David Alan Gilbert
2014-02-19  1:42     ` Michael R. Hines
2014-03-11 21:31   ` Juan Quintela
2014-04-04  3:08     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 03/12] mc: introduce a 'checkpointing' status check into the VCPU states mrhines
2014-03-11 21:36   ` Juan Quintela
2014-04-04  3:11     ` Michael R. Hines
2014-03-11 21:40   ` Eric Blake
2014-04-04  3:12     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 04/12] mc: support custom page loading and copying mrhines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 05/12] rdma: accelerated memcpy() support and better external RDMA user interfaces mrhines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 06/12] mc: introduce state machine changes for MC mrhines
2014-02-19  1:00   ` Li Guang
2014-02-19  2:14     ` Michael R. Hines
2014-02-20  5:03     ` Michael R. Hines
2014-02-21  8:13     ` Michael R. Hines
2014-02-24  6:48       ` Li Guang
2014-02-26  2:52         ` Li Guang
2014-03-11 21:57   ` Juan Quintela
2014-04-04  3:50     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 07/12] mc: introduce additional QMP statistics for micro-checkpointing mrhines
2014-03-11 21:45   ` Eric Blake
2014-04-04  3:15     ` Michael R. Hines
2014-04-04  4:22       ` Eric Blake
2014-03-11 21:59   ` Juan Quintela
2014-04-04  3:55     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 08/12] mc: core logic mrhines
2014-02-19  1:07   ` Li Guang
2014-02-19  2:16     ` Michael R. Hines
2014-02-19  2:53       ` Li Guang
2014-02-19  4:27         ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 09/12] mc: configure and makefile support mrhines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 10/12] mc: expose tunable parameter for checkpointing frequency mrhines
2014-03-11 21:49   ` Eric Blake
2014-03-11 22:15     ` Juan Quintela
2014-03-11 22:49       ` Eric Blake
2014-04-04  5:29         ` Michael R. Hines
2014-04-04 14:56           ` Eric Blake
2014-04-11  6:10             ` Michael R. Hines
2014-04-04 16:28           ` Dr. David Alan Gilbert
2014-04-04 16:35             ` Eric Blake
2014-04-04  3:29     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 11/12] mc: introduce new capabilities to control micro-checkpointing mrhines
2014-03-11 21:57   ` Eric Blake
2014-04-04  3:38     ` Michael R. Hines
2014-04-04  4:25       ` Eric Blake
2014-03-11 22:02   ` Juan Quintela
2014-03-11 22:07     ` Eric Blake
2014-04-04  3:57       ` Michael R. Hines
2014-04-04  3:56     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 12/12] mc: activate and use MC if requested mrhines
2014-02-18  9:28 ` [Qemu-devel] [RFC PATCH v2 00/12] mc: fault tolerante through micro-checkpointing Li Guang
2014-02-19  1:29   ` Michael R. Hines

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5306DBF8.3060204@linux.vnet.ibm.com \
    --to=mrhines@linux.vnet.ibm.com \
    --cc=BIRAN@il.ibm.com \
    --cc=EREZH@il.ibm.com \
    --cc=SADEKJ@il.ibm.com \
    --cc=abali@us.ibm.com \
    --cc=dbulkow@gmail.com \
    --cc=dgilbert@redhat.com \
    --cc=gokul@us.ibm.com \
    --cc=hinesmr@cn.ibm.com \
    --cc=isaku.yamahata@gmail.com \
    --cc=junqing.wang@cs2c.com.cn \
    --cc=lig.fnst@cn.fujitsu.com \
    --cc=mrhines@us.ibm.com \
    --cc=onom@us.ibm.com \
    --cc=owasserm@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).