From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:58121)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <lig.fnst@cn.fujitsu.com>) id 1WGRbS-0008C4-H0
	for qemu-devel@nongnu.org; Thu, 20 Feb 2014 06:15:48 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <lig.fnst@cn.fujitsu.com>) id 1WGRbM-0004Gb-DL
	for qemu-devel@nongnu.org; Thu, 20 Feb 2014 06:15:42 -0500
Received: from [222.73.24.84] (port=8641 helo=song.cn.fujitsu.com)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <lig.fnst@cn.fujitsu.com>) id 1WGRbK-0004C7-Rc
	for qemu-devel@nongnu.org; Thu, 20 Feb 2014 06:15:36 -0500
Message-ID: <5305E39F.5030601@cn.fujitsu.com>
Date: Thu, 20 Feb 2014 19:14:39 +0800
From: Li Guang <lig.fnst@cn.fujitsu.com>
MIME-Version: 1.0
References: <1392713429-18201-1-git-send-email-mrhines@linux.vnet.ibm.com>	<1392713429-18201-2-git-send-email-mrhines@linux.vnet.ibm.com>	<20140218124550.GF2662@work-vm>	<53040B77.3030008@linux.vnet.ibm.com>	<20140219112715.GE2916@work-vm>	<530557A4.9040508@linux.vnet.ibm.com>
	<20140220100927.GA2437@work-vm>
In-Reply-To: <20140220100927.GA2437@work-vm>
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Subject: Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for
	micro-checkpointing
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: SADEKJ@il.ibm.com, quintela@redhat.com, hinesmr@cn.ibm.com, qemu-devel@nongnu.org, "Michael R. Hines" <mrhines@linux.vnet.ibm.com>, owasserm@redhat.com, junqing.wang@cs2c.com.cn, onom@us.ibm.com, abali@us.ibm.com, EREZH@il.ibm.com, gokul@us.ibm.com, dbulkow@gmail.com, pbonzini@redhat.com, BIRAN@il.ibm.com, isaku.yamahata@gmail.com, "Michael R. Hines" <mrhines@us.ibm.com>

Dr. David Alan Gilbert wrote:
> * Michael R. Hines (mrhines@linux.vnet.ibm.com) wrote:
>    
>> On 02/19/2014 07:27 PM, Dr. David Alan Gilbert wrote:
>>      
>>> I was just wondering if a separate 'max buffer size' knob would allow
>>> you to more reasonably bound memory without setting policy; I don't think
>>> people like having potentially x2 memory.
>>>        
>> Note: Checkpoint memory is not monotonic in this patchset (which
>> is unique to this implementation). Only if the guest actually dirties
>> 100% of it's memory between one checkpoint to the next will
>> the host experience 2x memory usage for a short period of time.
>>      
> Right, but that doesn't really help - if someone comes along and says
> 'How much memory do I need to be able to run an mc system?' the only
> safe answer is 2x, otherwise we're adding a reason why the previously
> stable guest might OOM.
>
>    

so we may have to involve some disk operations
to handle memory exhaustion.

Thanks!

>> The patch has a 'slab' mechanism built in to it which implements
>> a water-mark style policy that throws away unused portions of
>> the 2x checkpoint memory if later checkpoints are much smaller
>> (which is likely to be the case if the writable working set size changes).
>>
>> However, to answer your question: Such a knob could be achieved, but
>> the same could be achieved simply by tuning the checkpoint frequency
>> itself. Memory usage would thus be a function of the checkpoint frequency.
>>      
>    
>> If the guest application was maniacal, banging away at all the memory,
>> there's very little that can be done in the first place, but if the
>> guest application
>> was mildly busy, you don't want to throw away your ability to be fault
>> tolerant - you would just need more frequent checkpoints to keep up with
>> the dirty rate.
>>      
> I'm not convinced; I can tune my checkpoint frequency until normal operation
> makes a reasonable trade off between mc frequency and RAM usage,
> but that doesn't prevent it running away when a garbage collect or some
> other thing suddenly dirties a load of ram in one particular checkpoint.
> Some management tool that watches ram usage etc can also help tune
> it, but in the end it can't stop it taking loads of RAM.
>
>    
>> Once the application died down - the water-mark policy would kick in
>> and start freeing checkpoint memory. (Note: this policy happens on
>> both sides in the patchset because the patch has to be fully compatible
>> with RDMA memory pinning).
>>
>> What is *not* exposed, however, is the watermark knobs themselves,
>> I definitely think that needs to be exposed - that would also get
>> you a similar
>> control to 'max buffer size' - you could place a time limit on the
>> slab list in the patch or something like that.......
>>
>>
>>      
>>>> Good question in general - I'll add it to the FAQ. The patch implements
>>>> a basic 'transaction' mechanism in coordination with an outbound I/O
>>>> buffer (documented further down). With these two things in
>>>> places, split-brain is not possible because the destination is not running.
>>>> We don't allow the destination to resume execution until a committed
>>>> transaction has been acknowledged by the destination and only until
>>>> then do we allow any outbound network traffic to be release to the
>>>> outside world.
>>>>          
>>> Yeh I see the IO buffer, what I've not figured out is how:
>>>    1) MC over TCP/IP gets an acknowledge on the source to know when
>>>       it can unplug it's buffer.
>>>        
>> Only partially correct (See the steps on the wiki). There are two I/O
>> buffers at any given time which protect against a split-brain scenario:
>> One buffer for the current checkpoint that is being generated (running VM)
>> and one buffer for the checkpoint that is being committed in a transaction.
>>
>>      
>>>    2) Lets say the MC connection fails, so that ack never arrives,
>>>       the source must assume the destination has failed and release it's
>>>       packets and carry on.
>>>        
>> Only the packets for Buffer A are released for the current committed
>> checkpoint after a completed transaction. The packets for Buffer B
>> (the current running VM) are still being held up until the next
>> transaction starts.
>> Later once the transaction completes and A is released, B becomes the
>> new A and a new buffer is installed to become the new Buffer B for
>> the current running VM.
>>
>>
>>      
>>>       The destination must assume the source has failed and take over.
>>>        
>> The destination must also receive an ACK. The ack goes both ways.
>>
>> Once the source and destination both acknowledge a completed
>> transation does the source VM resume execution - and even then
>> it's packets are still being buffered until the next transaction starts.
>> (That's why it's important to checkpoint as frequently as possible).
>>      
> I think I understand normal operation - my question here is about failure;
> what happens when neither side gets any ACKs.
>
>    
>>>    3) If we're relying on TCP/IP timeout that's quite long.
>>>
>>>        
>> Actually, my experience is been that TCP seems to have more than
>> one kind of timeout - if receiver is not responding *at all* - it seems that
>> TCP has a dedicated timer for that. The socket API immediately
>> sends back an error code and the patchset closes the conneciton
>> on the destination and recovers.
>>      
> How did you test that?
> My experience is that if a host knows that it has no route to the destination
> (e.g. it has no route to try, that matches the destination, because someone
> took the network interface away) you immediately get a 'no route to host',
> however if an intermediate link disappears then it takes a while to time out.
>
>    
>>> No, I wasn't thinking of vmsplice; I just have vague memories of suggestions
>>> of the use of Intel's I/OAT, graphics cards, etc for doing things like page
>>> zeroing and DMAing data around; I can see there is a dmaengine API in the
>>> kernel, I haven't found where if anywhere that is available to userspace.
>>>
>>>        
>>>> 2) Using COW: Actually, I think that's an excellent idea. I've bounced that
>>>>       around with my colleagues, but we simply didn't have the manpower
>>>>       to implement it and benchmark it. There was also some concern about
>>>>       performance: Would the writable working set of the guest be so
>>>> active/busy
>>>>       that COW would not get you much benefit? I think it's worth a try.
>>>>       Patches welcome =)
>>>>          
>>> It's possible that might be doable with some of the same tricks I'm
>>> looking at for post-copy, I'll see what I can do.
>>>        
>> That's great news - I'm very interested to see how this applies
>> to post-copy and any kind patches.
>>
>> - Michael
>>
>>      
> Dave
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
>
>