From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:58121) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WGRbS-0008C4-H0 for qemu-devel@nongnu.org; Thu, 20 Feb 2014 06:15:48 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1WGRbM-0004Gb-DL for qemu-devel@nongnu.org; Thu, 20 Feb 2014 06:15:42 -0500 Received: from [222.73.24.84] (port=8641 helo=song.cn.fujitsu.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WGRbK-0004C7-Rc for qemu-devel@nongnu.org; Thu, 20 Feb 2014 06:15:36 -0500 Message-ID: <5305E39F.5030601@cn.fujitsu.com> Date: Thu, 20 Feb 2014 19:14:39 +0800 From: Li Guang MIME-Version: 1.0 References: <1392713429-18201-1-git-send-email-mrhines@linux.vnet.ibm.com> <1392713429-18201-2-git-send-email-mrhines@linux.vnet.ibm.com> <20140218124550.GF2662@work-vm> <53040B77.3030008@linux.vnet.ibm.com> <20140219112715.GE2916@work-vm> <530557A4.9040508@linux.vnet.ibm.com> <20140220100927.GA2437@work-vm> In-Reply-To: <20140220100927.GA2437@work-vm> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-1; format=flowed Subject: Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Dr. David Alan Gilbert" Cc: SADEKJ@il.ibm.com, quintela@redhat.com, hinesmr@cn.ibm.com, qemu-devel@nongnu.org, "Michael R. Hines" , owasserm@redhat.com, junqing.wang@cs2c.com.cn, onom@us.ibm.com, abali@us.ibm.com, EREZH@il.ibm.com, gokul@us.ibm.com, dbulkow@gmail.com, pbonzini@redhat.com, BIRAN@il.ibm.com, isaku.yamahata@gmail.com, "Michael R. Hines" Dr. David Alan Gilbert wrote: > * Michael R. Hines (mrhines@linux.vnet.ibm.com) wrote: > >> On 02/19/2014 07:27 PM, Dr. David Alan Gilbert wrote: >> >>> I was just wondering if a separate 'max buffer size' knob would allow >>> you to more reasonably bound memory without setting policy; I don't think >>> people like having potentially x2 memory. >>> >> Note: Checkpoint memory is not monotonic in this patchset (which >> is unique to this implementation). Only if the guest actually dirties >> 100% of it's memory between one checkpoint to the next will >> the host experience 2x memory usage for a short period of time. >> > Right, but that doesn't really help - if someone comes along and says > 'How much memory do I need to be able to run an mc system?' the only > safe answer is 2x, otherwise we're adding a reason why the previously > stable guest might OOM. > > so we may have to involve some disk operations to handle memory exhaustion. Thanks! >> The patch has a 'slab' mechanism built in to it which implements >> a water-mark style policy that throws away unused portions of >> the 2x checkpoint memory if later checkpoints are much smaller >> (which is likely to be the case if the writable working set size changes). >> >> However, to answer your question: Such a knob could be achieved, but >> the same could be achieved simply by tuning the checkpoint frequency >> itself. Memory usage would thus be a function of the checkpoint frequency. >> > >> If the guest application was maniacal, banging away at all the memory, >> there's very little that can be done in the first place, but if the >> guest application >> was mildly busy, you don't want to throw away your ability to be fault >> tolerant - you would just need more frequent checkpoints to keep up with >> the dirty rate. >> > I'm not convinced; I can tune my checkpoint frequency until normal operation > makes a reasonable trade off between mc frequency and RAM usage, > but that doesn't prevent it running away when a garbage collect or some > other thing suddenly dirties a load of ram in one particular checkpoint. > Some management tool that watches ram usage etc can also help tune > it, but in the end it can't stop it taking loads of RAM. > > >> Once the application died down - the water-mark policy would kick in >> and start freeing checkpoint memory. (Note: this policy happens on >> both sides in the patchset because the patch has to be fully compatible >> with RDMA memory pinning). >> >> What is *not* exposed, however, is the watermark knobs themselves, >> I definitely think that needs to be exposed - that would also get >> you a similar >> control to 'max buffer size' - you could place a time limit on the >> slab list in the patch or something like that....... >> >> >> >>>> Good question in general - I'll add it to the FAQ. The patch implements >>>> a basic 'transaction' mechanism in coordination with an outbound I/O >>>> buffer (documented further down). With these two things in >>>> places, split-brain is not possible because the destination is not running. >>>> We don't allow the destination to resume execution until a committed >>>> transaction has been acknowledged by the destination and only until >>>> then do we allow any outbound network traffic to be release to the >>>> outside world. >>>> >>> Yeh I see the IO buffer, what I've not figured out is how: >>> 1) MC over TCP/IP gets an acknowledge on the source to know when >>> it can unplug it's buffer. >>> >> Only partially correct (See the steps on the wiki). There are two I/O >> buffers at any given time which protect against a split-brain scenario: >> One buffer for the current checkpoint that is being generated (running VM) >> and one buffer for the checkpoint that is being committed in a transaction. >> >> >>> 2) Lets say the MC connection fails, so that ack never arrives, >>> the source must assume the destination has failed and release it's >>> packets and carry on. >>> >> Only the packets for Buffer A are released for the current committed >> checkpoint after a completed transaction. The packets for Buffer B >> (the current running VM) are still being held up until the next >> transaction starts. >> Later once the transaction completes and A is released, B becomes the >> new A and a new buffer is installed to become the new Buffer B for >> the current running VM. >> >> >> >>> The destination must assume the source has failed and take over. >>> >> The destination must also receive an ACK. The ack goes both ways. >> >> Once the source and destination both acknowledge a completed >> transation does the source VM resume execution - and even then >> it's packets are still being buffered until the next transaction starts. >> (That's why it's important to checkpoint as frequently as possible). >> > I think I understand normal operation - my question here is about failure; > what happens when neither side gets any ACKs. > > >>> 3) If we're relying on TCP/IP timeout that's quite long. >>> >>> >> Actually, my experience is been that TCP seems to have more than >> one kind of timeout - if receiver is not responding *at all* - it seems that >> TCP has a dedicated timer for that. The socket API immediately >> sends back an error code and the patchset closes the conneciton >> on the destination and recovers. >> > How did you test that? > My experience is that if a host knows that it has no route to the destination > (e.g. it has no route to try, that matches the destination, because someone > took the network interface away) you immediately get a 'no route to host', > however if an intermediate link disappears then it takes a while to time out. > > >>> No, I wasn't thinking of vmsplice; I just have vague memories of suggestions >>> of the use of Intel's I/OAT, graphics cards, etc for doing things like page >>> zeroing and DMAing data around; I can see there is a dmaengine API in the >>> kernel, I haven't found where if anywhere that is available to userspace. >>> >>> >>>> 2) Using COW: Actually, I think that's an excellent idea. I've bounced that >>>> around with my colleagues, but we simply didn't have the manpower >>>> to implement it and benchmark it. There was also some concern about >>>> performance: Would the writable working set of the guest be so >>>> active/busy >>>> that COW would not get you much benefit? I think it's worth a try. >>>> Patches welcome =) >>>> >>> It's possible that might be doable with some of the same tricks I'm >>> looking at for post-copy, I'll see what I can do. >>> >> That's great news - I'm very interested to see how this applies >> to post-copy and any kind patches. >> >> - Michael >> >> > Dave > -- > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK > > >