qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: "Michael R. Hines" <mrhines@linux.vnet.ibm.com>
To: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: GILR@il.ibm.com, SADEKJ@il.ibm.com, pbonzini@redhat.com,
	quintela@redhat.com, abali@us.ibm.com, qemu-devel@nongnu.org,
	EREZH@il.ibm.com, owasserm@redhat.com, onom@us.ibm.com,
	hinesmr@cn.ibm.com, isaku.yamahata@gmail.com, gokul@us.ibm.com,
	dbulkow@gmail.com, junqing.wang@cs2c.com.cn, BIRAN@il.ibm.com,
	lig.fnst@cn.fujitsu.com, "Michael R. Hines" <mrhines@us.ibm.com>
Subject: Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing
Date: Thu, 20 Feb 2014 09:17:24 +0800	[thread overview]
Message-ID: <530557A4.9040508@linux.vnet.ibm.com> (raw)
In-Reply-To: <20140219112715.GE2916@work-vm>

On 02/19/2014 07:27 PM, Dr. David Alan Gilbert wrote:
>
> I was just wondering if a separate 'max buffer size' knob would allow
> you to more reasonably bound memory without setting policy; I don't think
> people like having potentially x2 memory.

Note: Checkpoint memory is not monotonic in this patchset (which
is unique to this implementation). Only if the guest actually dirties
100% of it's memory between one checkpoint to the next will
the host experience 2x memory usage for a short period of time.

The patch has a 'slab' mechanism built in to it which implements
a water-mark style policy that throws away unused portions of
the 2x checkpoint memory if later checkpoints are much smaller
(which is likely to be the case if the writable working set size changes).

However, to answer your question: Such a knob could be achieved, but
the same could be achieved simply by tuning the checkpoint frequency
itself. Memory usage would thus be a function of the checkpoint frequency.

If the guest application was maniacal, banging away at all the memory,
there's very little that can be done in the first place, but if the 
guest application
was mildly busy, you don't want to throw away your ability to be fault
tolerant - you would just need more frequent checkpoints to keep up with
the dirty rate.

Once the application died down - the water-mark policy would kick in
and start freeing checkpoint memory. (Note: this policy happens on
both sides in the patchset because the patch has to be fully compatible
with RDMA memory pinning).

What is *not* exposed, however, is the watermark knobs themselves,
I definitely think that needs to be exposed - that would also get you a 
similar
control to 'max buffer size' - you could place a time limit on the
slab list in the patch or something like that.......


>>
>> Good question in general - I'll add it to the FAQ. The patch implements
>> a basic 'transaction' mechanism in coordination with an outbound I/O
>> buffer (documented further down). With these two things in
>> places, split-brain is not possible because the destination is not running.
>> We don't allow the destination to resume execution until a committed
>> transaction has been acknowledged by the destination and only until
>> then do we allow any outbound network traffic to be release to the
>> outside world.
> Yeh I see the IO buffer, what I've not figured out is how:
>    1) MC over TCP/IP gets an acknowledge on the source to know when
>       it can unplug it's buffer.

Only partially correct (See the steps on the wiki). There are two I/O
buffers at any given time which protect against a split-brain scenario:
One buffer for the current checkpoint that is being generated (running VM)
and one buffer for the checkpoint that is being committed in a transaction.

>    2) Lets say the MC connection fails, so that ack never arrives,
>       the source must assume the destination has failed and release it's
>       packets and carry on.

Only the packets for Buffer A are released for the current committed
checkpoint after a completed transaction. The packets for Buffer B
(the current running VM) are still being held up until the next 
transaction starts.
Later once the transaction completes and A is released, B becomes the
new A and a new buffer is installed to become the new Buffer B for
the current running VM.


>       The destination must assume the source has failed and take over.

The destination must also receive an ACK. The ack goes both ways.

Once the source and destination both acknowledge a completed
transation does the source VM resume execution - and even then
it's packets are still being buffered until the next transaction starts.
(That's why it's important to checkpoint as frequently as possible).


>    3) If we're relying on TCP/IP timeout that's quite long.
>

Actually, my experience is been that TCP seems to have more than
one kind of timeout - if receiver is not responding *at all* - it seems that
TCP has a dedicated timer for that. The socket API immediately
sends back an error code and the patchset closes the conneciton
on the destination and recovers.

> No, I wasn't thinking of vmsplice; I just have vague memories of suggestions
> of the use of Intel's I/OAT, graphics cards, etc for doing things like page
> zeroing and DMAing data around; I can see there is a dmaengine API in the
> kernel, I haven't found where if anywhere that is available to userspace.
>
>> 2) Using COW: Actually, I think that's an excellent idea. I've bounced that
>>       around with my colleagues, but we simply didn't have the manpower
>>       to implement it and benchmark it. There was also some concern about
>>       performance: Would the writable working set of the guest be so
>> active/busy
>>       that COW would not get you much benefit? I think it's worth a try.
>>       Patches welcome =)
> It's possible that might be doable with some of the same tricks I'm
> looking at for post-copy, I'll see what I can do.

That's great news - I'm very interested to see how this applies
to post-copy and any kind patches.

- Michael

  reply	other threads:[~2014-02-20  1:20 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-02-18  8:50 [Qemu-devel] [RFC PATCH v2 00/12] mc: fault tolerante through micro-checkpointing mrhines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing mrhines
2014-02-18 12:45   ` Dr. David Alan Gilbert
2014-02-19  1:40     ` Michael R. Hines
2014-02-19 11:27       ` Dr. David Alan Gilbert
2014-02-20  1:17         ` Michael R. Hines [this message]
2014-02-20 10:09           ` Dr. David Alan Gilbert
2014-02-20 11:14             ` Li Guang
2014-02-20 14:58               ` Michael R. Hines
2014-02-20 14:57             ` Michael R. Hines
2014-02-20 16:32               ` Dr. David Alan Gilbert
2014-02-21  4:54                 ` Michael R. Hines
2014-02-21  9:44                   ` Dr. David Alan Gilbert
2014-03-03  6:08                     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 02/12] mc: timestamp migration_bitmap and KVM logdirty usage mrhines
2014-02-18 10:32   ` Dr. David Alan Gilbert
2014-02-19  1:42     ` Michael R. Hines
2014-03-11 21:31   ` Juan Quintela
2014-04-04  3:08     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 03/12] mc: introduce a 'checkpointing' status check into the VCPU states mrhines
2014-03-11 21:36   ` Juan Quintela
2014-04-04  3:11     ` Michael R. Hines
2014-03-11 21:40   ` Eric Blake
2014-04-04  3:12     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 04/12] mc: support custom page loading and copying mrhines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 05/12] rdma: accelerated memcpy() support and better external RDMA user interfaces mrhines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 06/12] mc: introduce state machine changes for MC mrhines
2014-02-19  1:00   ` Li Guang
2014-02-19  2:14     ` Michael R. Hines
2014-02-20  5:03     ` Michael R. Hines
2014-02-21  8:13     ` Michael R. Hines
2014-02-24  6:48       ` Li Guang
2014-02-26  2:52         ` Li Guang
2014-03-11 21:57   ` Juan Quintela
2014-04-04  3:50     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 07/12] mc: introduce additional QMP statistics for micro-checkpointing mrhines
2014-03-11 21:45   ` Eric Blake
2014-04-04  3:15     ` Michael R. Hines
2014-04-04  4:22       ` Eric Blake
2014-03-11 21:59   ` Juan Quintela
2014-04-04  3:55     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 08/12] mc: core logic mrhines
2014-02-19  1:07   ` Li Guang
2014-02-19  2:16     ` Michael R. Hines
2014-02-19  2:53       ` Li Guang
2014-02-19  4:27         ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 09/12] mc: configure and makefile support mrhines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 10/12] mc: expose tunable parameter for checkpointing frequency mrhines
2014-03-11 21:49   ` Eric Blake
2014-03-11 22:15     ` Juan Quintela
2014-03-11 22:49       ` Eric Blake
2014-04-04  5:29         ` Michael R. Hines
2014-04-04 14:56           ` Eric Blake
2014-04-11  6:10             ` Michael R. Hines
2014-04-04 16:28           ` Dr. David Alan Gilbert
2014-04-04 16:35             ` Eric Blake
2014-04-04  3:29     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 11/12] mc: introduce new capabilities to control micro-checkpointing mrhines
2014-03-11 21:57   ` Eric Blake
2014-04-04  3:38     ` Michael R. Hines
2014-04-04  4:25       ` Eric Blake
2014-03-11 22:02   ` Juan Quintela
2014-03-11 22:07     ` Eric Blake
2014-04-04  3:57       ` Michael R. Hines
2014-04-04  3:56     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 12/12] mc: activate and use MC if requested mrhines
2014-02-18  9:28 ` [Qemu-devel] [RFC PATCH v2 00/12] mc: fault tolerante through micro-checkpointing Li Guang
2014-02-19  1:29   ` Michael R. Hines

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=530557A4.9040508@linux.vnet.ibm.com \
    --to=mrhines@linux.vnet.ibm.com \
    --cc=BIRAN@il.ibm.com \
    --cc=EREZH@il.ibm.com \
    --cc=GILR@il.ibm.com \
    --cc=SADEKJ@il.ibm.com \
    --cc=abali@us.ibm.com \
    --cc=dbulkow@gmail.com \
    --cc=dgilbert@redhat.com \
    --cc=gokul@us.ibm.com \
    --cc=hinesmr@cn.ibm.com \
    --cc=isaku.yamahata@gmail.com \
    --cc=junqing.wang@cs2c.com.cn \
    --cc=lig.fnst@cn.fujitsu.com \
    --cc=mrhines@us.ibm.com \
    --cc=onom@us.ibm.com \
    --cc=owasserm@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).