[Qemu-devel] [RFC] COLO HA Project proposal

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [RFC] COLO HA Project proposal
@ 2014-06-24  2:08 Hongyang Yang
  2014-07-01 12:12 ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 11+ messages in thread
From: Hongyang Yang @ 2014-06-24  2:08 UTC (permalink / raw)
  To: qemu-devel; +Cc: FNST-Gui Jianfeng, Dong Eddie, kvm

[-- Attachment #1: Type: text/plain, Size: 1648 bytes --]

Background:
   COLO HA project is a high availability solution. Both primary
VM (PVM) and secondary VM (SVM) run in parallel. They receive the
same request from client, and generate response in parallel too.
If the response packets from PVM and SVM are identical, they are
released immediately. Otherwise, a VM checkpoint (on demand) is
conducted. The idea is presented in Xen summit 2012, and 2013,
and academia paper in SOCC 2013. It's also presented in KVM forum
2013:
http://www.linux-kvm.org/wiki/images/1/1d/Kvm-forum-2013-COLO.pdf
Please refer to above document for detailed information.

The attached was the architecture of kvm-COLO we proposed.
   - COLO Manager: Requires modifications of qemu
     - COLO Controller
         COLO Controller includes modifications of save/restore
       flow just like MC(macrocheckpoint), a memory cache on
       secondary VM which cache the dirty pages of primary VM
       and a failover module which provides APIs to communicate
       with external heartbead module.
     - COLO Disk Manager
         When pvm writes data into image, the colo disk manger
       captures this data and send it to the colo disk manger
       which makes sure the context of svm's image is consentient
       with the context of pvm's image.

   - COLO Agent("Proxy module" in the arch picture)
       We need an agent to compare the packets returned by
     Primary VM and Secondary VM, and decide whether to start a
     checkpoint according to some rules. It is a linux kernel
     module for host.

   - Other minor modifications
       We may need other modifications for better performance.

-- 
Thanks,
Yang.

[-- Attachment #2: kvm-colo-arch.jpg --]
[-- Type: image/jpeg, Size: 78793 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [RFC] COLO HA Project proposal
  2014-06-24  2:08 [Qemu-devel] [RFC] COLO HA Project proposal Hongyang Yang
@ 2014-07-01 12:12 ` Dr. David Alan Gilbert
  2014-07-03  3:42   ` Hongyang Yang
  2014-07-04 11:22   ` Andreas Färber
  0 siblings, 2 replies; 11+ messages in thread
From: Dr. David Alan Gilbert @ 2014-07-01 12:12 UTC (permalink / raw)
  To: Hongyang Yang; +Cc: FNST-Gui Jianfeng, Dong Eddie, qemu-devel, kvm

* Hongyang Yang (yanghy@cn.fujitsu.com) wrote:

Hi Yang,

> Background:
>   COLO HA project is a high availability solution. Both primary
> VM (PVM) and secondary VM (SVM) run in parallel. They receive the
> same request from client, and generate response in parallel too.
> If the response packets from PVM and SVM are identical, they are
> released immediately. Otherwise, a VM checkpoint (on demand) is
> conducted. The idea is presented in Xen summit 2012, and 2013,
> and academia paper in SOCC 2013. It's also presented in KVM forum
> 2013:
> http://www.linux-kvm.org/wiki/images/1/1d/Kvm-forum-2013-COLO.pdf
> Please refer to above document for detailed information.

Yes, I remember that talk - very interesting.

I didn't quite understand a couple of things though, perhaps you
can explain:
  1) If we ignore the TCP sequence number problem, in an SMP machine
don't we get other randomnesses - e.g. which core completes something
first, or who wins a lock contention, so the output stream might not
be identical - so do those normal bits of randomness cause the machines
to flag as out-of-sync?

  2) If the PVM has decided that the SVM is out of sync (due to 1) and
the PVM fails at about the same point - can we switch over to the SVM?

I'm worried that due to (1) there are periods where the system
is out-of-sync and a failure of the PVM is not protected.  Does that happen?
If so how often?

> The attached was the architecture of kvm-COLO we proposed.
>   - COLO Manager: Requires modifications of qemu
>     - COLO Controller
>         COLO Controller includes modifications of save/restore
>       flow just like MC(macrocheckpoint), a memory cache on
>       secondary VM which cache the dirty pages of primary VM
>       and a failover module which provides APIs to communicate
>       with external heartbead module.
>     - COLO Disk Manager
>         When pvm writes data into image, the colo disk manger
>       captures this data and send it to the colo disk manger
>       which makes sure the context of svm's image is consentient
>       with the context of pvm's image.

I wonder if there is anyway to coordinate this between COLO, Michael
Hines microcheckpointing and the two separate reverse-execution
projects that also need to do some similar things.
Are there any standard APIs for the heartbeet thing we can already
tie into?

>   - COLO Agent("Proxy module" in the arch picture)
>       We need an agent to compare the packets returned by
>     Primary VM and Secondary VM, and decide whether to start a
>     checkpoint according to some rules. It is a linux kernel
>     module for host.

Why is that a kernel module, and how does it communicate the state
to the QEMU instance?

>   - Other minor modifications
>       We may need other modifications for better performance.

Dave
P.S. I'm starting to look at fault-tolerance stuff, but haven't
got very far yet, so starting to try and understand the details
of COLO, microcheckpointing, etc

> -- 
> Thanks,
> Yang.

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [RFC] COLO HA Project proposal
  2014-07-01 12:12 ` Dr. David Alan Gilbert
@ 2014-07-03  3:42   ` Hongyang Yang
  2014-07-04  8:31     ` Dong, Eddie
  2014-07-08  6:06     ` Michael R. Hines
  2014-07-04 11:22   ` Andreas Färber
  1 sibling, 2 replies; 11+ messages in thread
From: Hongyang Yang @ 2014-07-03  3:42 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: FNST-Gui Jianfeng, Dong Eddie, qemu-devel, kvm

Hi David,

On 07/01/2014 08:12 PM, Dr. David Alan Gilbert wrote:
> * Hongyang Yang (yanghy@cn.fujitsu.com) wrote:
>
> Hi Yang,
>
>> Background:
>>    COLO HA project is a high availability solution. Both primary
>> VM (PVM) and secondary VM (SVM) run in parallel. They receive the
>> same request from client, and generate response in parallel too.
>> If the response packets from PVM and SVM are identical, they are
>> released immediately. Otherwise, a VM checkpoint (on demand) is
>> conducted. The idea is presented in Xen summit 2012, and 2013,
>> and academia paper in SOCC 2013. It's also presented in KVM forum
>> 2013:
>> http://www.linux-kvm.org/wiki/images/1/1d/Kvm-forum-2013-COLO.pdf
>> Please refer to above document for detailed information.
>
> Yes, I remember that talk - very interesting.
>
> I didn't quite understand a couple of things though, perhaps you
> can explain:
>    1) If we ignore the TCP sequence number problem, in an SMP machine
> don't we get other randomnesses - e.g. which core completes something
> first, or who wins a lock contention, so the output stream might not
> be identical - so do those normal bits of randomness cause the machines
> to flag as out-of-sync?

It's about COLO agent, CCing Congyang, he can give the detailed
explanation.

>
>    2) If the PVM has decided that the SVM is out of sync (due to 1) and
> the PVM fails at about the same point - can we switch over to the SVM?

Yes, we can switch over, we have some mechanisms to ensure the SVM's state
is consentient:
- memory cache.
   The memory cache was initially the same as PVM's memory. At
checkpoint, we cache the dirty memory of PVM while transporting the
memory, write cached memory to SVM when we received all PVM memory
(we only need to write memory that was both dirty on PVM and SVM
from last checkpoint). This solves problem 2) you've mentioned above:
If PVM fails while checkpointing, SVM will discard the cached memory
and continue to run and to provide service just as it is.

- COLO Disk manager
   Like memory cache, COLO Disk manager caches the Disk modifications
of PVM, and write it to SVM Disk when checkpointing. If PVM fails while
checkpointing, SVM will discard the cached Disk modifications.

>
> I'm worried that due to (1) there are periods where the system
> is out-of-sync and a failure of the PVM is not protected.  Does that happen?
> If so how often?
>
>> The attached was the architecture of kvm-COLO we proposed.
>>    - COLO Manager: Requires modifications of qemu
>>      - COLO Controller
>>          COLO Controller includes modifications of save/restore
>>        flow just like MC(macrocheckpoint), a memory cache on
>>        secondary VM which cache the dirty pages of primary VM
>>        and a failover module which provides APIs to communicate
>>        with external heartbead module.
>>      - COLO Disk Manager
>>          When pvm writes data into image, the colo disk manger
>>        captures this data and send it to the colo disk manger
>>        which makes sure the context of svm's image is consentient
>>        with the context of pvm's image.
>
> I wonder if there is anyway to coordinate this between COLO, Michael
> Hines microcheckpointing and the two separate reverse-execution
> projects that also need to do some similar things.
> Are there any standard APIs for the heartbeet thing we can already
> tie into?

Sadly we have checked MC, it does not have heartbeat support for now.

>
>>    - COLO Agent("Proxy module" in the arch picture)
>>        We need an agent to compare the packets returned by
>>      Primary VM and Secondary VM, and decide whether to start a
>>      checkpoint according to some rules. It is a linux kernel
>>      module for host.
>
> Why is that a kernel module, and how does it communicate the state
> to the QEMU instance?

The reason we made this a kernel module is to gain better performance.
We can easily hook the packets in a kernel module.
QEMU instance uses ioctl() to communicate with the COLO Agent.

>
>>    - Other minor modifications
>>        We may need other modifications for better performance.
>
> Dave
> P.S. I'm starting to look at fault-tolerance stuff, but haven't
> got very far yet, so starting to try and understand the details
> of COLO, microcheckpointing, etc
>
>> --
>> Thanks,
>> Yang.
>
>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> .
>

-- 
Thanks,
Yang.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [RFC] COLO HA Project proposal
  2014-07-03  3:42   ` Hongyang Yang
@ 2014-07-04  8:31     ` Dong, Eddie
  2014-07-04  8:35       ` Dr. David Alan Gilbert
  2014-07-08  6:06     ` Michael R. Hines
  1 sibling, 1 reply; 11+ messages in thread
From: Dong, Eddie @ 2014-07-04  8:31 UTC (permalink / raw)
  To: Hongyang Yang, Dr. David Alan Gilbert
  Cc: FNST-Gui Jianfeng, Dong, Eddie, qemu-devel@nongnu.org,
	kvm@vger.kernel.org

> >
> > I didn't quite understand a couple of things though, perhaps you can
> > explain:
> >    1) If we ignore the TCP sequence number problem, in an SMP machine
> > don't we get other randomnesses - e.g. which core completes something
> > first, or who wins a lock contention, so the output stream might not
> > be identical - so do those normal bits of randomness cause the
> > machines to flag as out-of-sync?
> 
> It's about COLO agent, CCing Congyang, he can give the detailed
> explanation.
> 

Let me clarify on this issue. COLO didn't ignore the TCP sequence number, but uses a 
new implementation to make the sequence number to be best effort identical 
between the primary VM (PVM) and secondary VM (SVM). Likely, VMM has to synchronize 
the emulation of randomization number generation mechanism between the 
PVM and SVM, like the lock-stepping mechanism does. 

Further mnore, for long TCP connection, we can rely on the (on-demand) VM checkpoint to get the 
identical Sequence number both in PVM and SVM. 

Thanks, Eddie

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [RFC] COLO HA Project proposal
  2014-07-04  8:31     ` Dong, Eddie
@ 2014-07-04  8:35       ` Dr. David Alan Gilbert
  2014-07-04  8:54         ` Dong, Eddie
  0 siblings, 1 reply; 11+ messages in thread
From: Dr. David Alan Gilbert @ 2014-07-04  8:35 UTC (permalink / raw)
  To: Dong, Eddie
  Cc: FNST-Gui Jianfeng, Hongyang Yang, qemu-devel@nongnu.org,
	kvm@vger.kernel.org

* Dong, Eddie (eddie.dong@intel.com) wrote:
> > >
> > > I didn't quite understand a couple of things though, perhaps you can
> > > explain:
> > >    1) If we ignore the TCP sequence number problem, in an SMP machine
> > > don't we get other randomnesses - e.g. which core completes something
> > > first, or who wins a lock contention, so the output stream might not
> > > be identical - so do those normal bits of randomness cause the
> > > machines to flag as out-of-sync?
> > 
> > It's about COLO agent, CCing Congyang, he can give the detailed
> > explanation.
> > 
> 
> Let me clarify on this issue. COLO didn't ignore the TCP sequence number, but uses a 
> new implementation to make the sequence number to be best effort identical 
> between the primary VM (PVM) and secondary VM (SVM). Likely, VMM has to synchronize 
> the emulation of randomization number generation mechanism between the 
> PVM and SVM, like the lock-stepping mechanism does. 
> 
> Further mnore, for long TCP connection, we can rely on the (on-demand) VM checkpoint to get the 
> identical Sequence number both in PVM and SVM. 

That wasn't really my question; I was worrying about other forms of randomness,
such as winners of lock contention, and other SMP non-determinisms,
and I'm also worried by what proportion of time the system can't recover
from a failure due to being unable to distinguish an SVM failure from
a randomness issue.

Dave

> 
> 
> Thanks, Eddie
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [RFC] COLO HA Project proposal
  2014-07-04  8:35       ` Dr. David Alan Gilbert
@ 2014-07-04  8:54         ` Dong, Eddie
  2014-07-04 12:22           ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 11+ messages in thread
From: Dong, Eddie @ 2014-07-04  8:54 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: kvm@vger.kernel.org, FNST-Gui Jianfeng, Dong, Eddie,
	qemu-devel@nongnu.org, Hongyang Yang

> >
> > Let me clarify on this issue. COLO didn't ignore the TCP sequence
> > number, but uses a new implementation to make the sequence number to
> > be best effort identical between the primary VM (PVM) and secondary VM
> > (SVM). Likely, VMM has to synchronize the emulation of randomization
> > number generation mechanism between the PVM and SVM, like the
> lock-stepping mechanism does.
> >
> > Further mnore, for long TCP connection, we can rely on the (on-demand)
> > VM checkpoint to get the identical Sequence number both in PVM and
> SVM.
> 
> That wasn't really my question; I was worrying about other forms of
> randomness, such as winners of lock contention, and other SMP
> non-determinisms, and I'm also worried by what proportion of time the
> system can't recover from a failure due to being unable to distinguish an
> SVM failure from a randomness issue.
> 
Thanks Dave:
	Whether the randomness value/branch/code path the PVM and SVM may have,
It is only a performance issue. COLO never assumes the PVM and SVM has same internal
Machine state.  From correctness p.o.v, as if the PVM and SVM generate
Identical response, we can view the SVM is a valid replica of PVM, and the SVM can take over
When the PVM suffers from hardware failure. We can view the client is all the way talking with 
the SVM, without the notion of PVM.  Of course, if the SVM dies, we can regenerate a copy
of PVM with a new checkpoint too.
	The SOCC paper has the detail recovery model :)

Thanks, Eddie

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [RFC] COLO HA Project proposal
  2014-07-01 12:12 ` Dr. David Alan Gilbert
  2014-07-03  3:42   ` Hongyang Yang
@ 2014-07-04 11:22   ` Andreas Färber
  1 sibling, 0 replies; 11+ messages in thread
From: Andreas Färber @ 2014-07-04 11:22 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Hongyang Yang
  Cc: FNST-Gui Jianfeng, Dong Eddie, qemu-devel, kvm

Am 01.07.2014 14:12, schrieb Dr. David Alan Gilbert:
> Are there any standard APIs for the heartbeet thing we can already
> tie into?

Maybe the http://www.linux-ha.org/wiki/Heartbeat daemon?

Andreas

-- 
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [RFC] COLO HA Project proposal
  2014-07-04  8:54         ` Dong, Eddie
@ 2014-07-04 12:22           ` Dr. David Alan Gilbert
  2014-07-04 15:55             ` Dong, Eddie
  0 siblings, 1 reply; 11+ messages in thread
From: Dr. David Alan Gilbert @ 2014-07-04 12:22 UTC (permalink / raw)
  To: Dong, Eddie
  Cc: FNST-Gui Jianfeng, Hongyang Yang, qemu-devel@nongnu.org,
	kvm@vger.kernel.org

* Dong, Eddie (eddie.dong@intel.com) wrote:
> > >
> > > Let me clarify on this issue. COLO didn't ignore the TCP sequence
> > > number, but uses a new implementation to make the sequence number to
> > > be best effort identical between the primary VM (PVM) and secondary VM
> > > (SVM). Likely, VMM has to synchronize the emulation of randomization
> > > number generation mechanism between the PVM and SVM, like the
> > lock-stepping mechanism does.
> > >
> > > Further mnore, for long TCP connection, we can rely on the (on-demand)
> > > VM checkpoint to get the identical Sequence number both in PVM and
> > SVM.
> > 
> > That wasn't really my question; I was worrying about other forms of
> > randomness, such as winners of lock contention, and other SMP
> > non-determinisms, and I'm also worried by what proportion of time the
> > system can't recover from a failure due to being unable to distinguish an
> > SVM failure from a randomness issue.
> > 
> Thanks Dave:
> 	Whether the randomness value/branch/code path the PVM and SVM may have,
> It is only a performance issue. COLO never assumes the PVM and SVM has same internal
> Machine state.  From correctness p.o.v, as if the PVM and SVM generate
> Identical response, we can view the SVM is a valid replica of PVM, and the SVM can take over
> When the PVM suffers from hardware failure. We can view the client is all the way talking with 
> the SVM, without the notion of PVM.  Of course, if the SVM dies, we can regenerate a copy
> of PVM with a new checkpoint too.
> 	The SOCC paper has the detail recovery model :)

I've had a read; I think the bit I was asking about was what you labelled 'D' in that
papers fig.4 - so I think that does explain it for me.
But I also have some more questions:

  1) 5.3.3 Web server
    a) In fig 11 it shows Remus's performance dropping off with the number of threads - why is that? Is it
       just an increase in the amount of memory changes in each snapshot?
    b) Is fig 11/12 measured with all of the TCP optimisations shown in fig 13 on?

  2) Did you manage to overcome the issue shown in 5.6 with newer guest kernels degredation - could you just fall
     back to micro checkpointing if the guests diverge too quickly?

  3) Was the link between the two servers for synchronisation a low-latency dedicated connection?

  4) Did you try an ftp PUT benchmark using external storage - i.e. that wouldn't have the local disc
     synchronisation overhead?

Dave

> 
> Thanks, Eddie
> 
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [RFC] COLO HA Project proposal
  2014-07-04 12:22           ` Dr. David Alan Gilbert
@ 2014-07-04 15:55             ` Dong, Eddie
  0 siblings, 0 replies; 11+ messages in thread
From: Dong, Eddie @ 2014-07-04 15:55 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: kvm@vger.kernel.org, FNST-Gui Jianfeng, Dong, Eddie,
	qemu-devel@nongnu.org, Hongyang Yang

> > Thanks Dave:
> > 	Whether the randomness value/branch/code path the PVM and SVM
> may
> > have, It is only a performance issue. COLO never assumes the PVM and
> > SVM has same internal Machine state.  From correctness p.o.v, as if
> > the PVM and SVM generate Identical response, we can view the SVM is a
> > valid replica of PVM, and the SVM can take over When the PVM suffers
> > from hardware failure. We can view the client is all the way talking
> > with the SVM, without the notion of PVM.  Of course, if the SVM dies, we
> can regenerate a copy of PVM with a new checkpoint too.
> > 	The SOCC paper has the detail recovery model :)
> 
> I've had a read; I think the bit I was asking about was what you labelled 'D' in
> that papers fig.4 - so I think that does explain it for me.

Very good :)

> But I also have some more questions:
> 
>   1) 5.3.3 Web server
>     a) In fig 11 it shows Remus's performance dropping off with the number
> of threads - why is that? Is it
>        just an increase in the amount of memory changes in each
> snapshot?

I didn't dig into details of them, but document the throughput we observed.
I felt a bit stranger too, memory dirty page set may be larger than small connection
Case, but I am not sure and that is the data we saw :(

>     b) Is fig 11/12 measured with all of the TCP optimisations shown in fig
> 13 on?

Yes.

> 
>   2) Did you manage to overcome the issue shown in 5.6 with newer guest
> kernels degredation - could you just fall
>      back to micro checkpointing if the guests diverge too quickly?

In general, I would say the COLO performance for these 2 workloads is pretty good, and 
I actually didn't list the subsection 5.6 initially. It is the conference sepherd who ask me to 
add this paragraph to make the paper to be balanced :)

In summary, COLO can have very good MP-guest performance comparing with Remus, with 
the payment of potential optimization/modification effort to guest TCP/IP stack. One solution may
Not work for all workloads, but it provides a large room for OSVs to provide customized solution
for a specific usage -- which I think is very good for open source biz model: make money through 
consultant. Huawei technology Ltd. announced to support COLO in there cloud OS, 
Probably for specific usage too.

> 
>   3) Was the link between the two servers for synchronisation a low-latency
> dedicated connection?

We use 10 Gbps NIC in the paper, and yes it is dedicated link, but the solution itself doesn't 
require dedicated link.

> 
>   4) Did you try an ftp PUT benchmark using external storage - i.e. that
> wouldn't have the local disc
>      synchronisation overhead?

Not yet.
External network shared storage works, but today the performance may be not that good, 
because our optimization so far is still very limited. It is just an initial effort to make the 2
common workloads happy. We believe there are large room ahead to make the response of 
TCP/IP stack more predictable. Once the basic COLO stuff is ready for product and accepted
by the industry, it is possible we may impact TCP community to have this kind of predictability
 in mind for the future protocol development, which will greatly help the performance.

Thx Eddie

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [RFC] COLO HA Project proposal
  2014-07-03  3:42   ` Hongyang Yang
  2014-07-04  8:31     ` Dong, Eddie
@ 2014-07-08  6:06     ` Michael R. Hines
  2014-07-08  6:26       ` Hongyang Yang
  1 sibling, 1 reply; 11+ messages in thread
From: Michael R. Hines @ 2014-07-08  6:06 UTC (permalink / raw)
  To: Hongyang Yang, Dr. David Alan Gilbert
  Cc: FNST-Gui Jianfeng, Dong Eddie, qemu-devel, kvm

On 07/03/2014 11:42 AM, Hongyang Yang wrote:
>
>> I wonder if there is anyway to coordinate this between COLO, Michael
>> Hines microcheckpointing and the two separate reverse-execution
>> projects that also need to do some similar things.
>> Are there any standard APIs for the heartbeet thing we can already
>> tie into?
>
> Sadly we have checked MC, it does not have heartbeat support for now.
>

Right, MC by itself does not need heartbeats out-of-the box.

Probably the best thing we can coordinate from MC is the part of the
data transmission protocol and memory management - because we need
to make sure you guys are staying compatible with the QEMUFile
abstraction the same way that the TCP and RDMA protocols are staying
compatible with that abstraction.

COLO should able to run over any protocol supported by QEMUFile.

I can help with some of that...

- Michael

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [RFC] COLO HA Project proposal
  2014-07-08  6:06     ` Michael R. Hines
@ 2014-07-08  6:26       ` Hongyang Yang
  0 siblings, 0 replies; 11+ messages in thread
From: Hongyang Yang @ 2014-07-08  6:26 UTC (permalink / raw)
  To: Michael R. Hines, Dr. David Alan Gilbert
  Cc: FNST-Gui Jianfeng, Dong Eddie, qemu-devel, kvm

Hi Michael,

   Thank you for paying attention on this.

On 07/08/2014 02:06 PM, Michael R. Hines wrote:
> On 07/03/2014 11:42 AM, Hongyang Yang wrote:
>>
>>> I wonder if there is anyway to coordinate this between COLO, Michael
>>> Hines microcheckpointing and the two separate reverse-execution
>>> projects that also need to do some similar things.
>>> Are there any standard APIs for the heartbeet thing we can already
>>> tie into?
>>
>> Sadly we have checked MC, it does not have heartbeat support for now.
>>
>
> Right, MC by itself does not need heartbeats out-of-the box.
>
> Probably the best thing we can coordinate from MC is the part of the
> data transmission protocol and memory management - because we need

Yes, that's what we have planned.

> to make sure you guys are staying compatible with the QEMUFile
> abstraction the same way that the TCP and RDMA protocols are staying
> compatible with that abstraction.
>
> COLO should able to run over any protocol supported by QEMUFile.

Indeed.

>
> I can help with some of that...
>
> - Michael
>
> .
>

-- 
Thanks,
Yang.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-07-08  6:26 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-06-24  2:08 [Qemu-devel] [RFC] COLO HA Project proposal Hongyang Yang
2014-07-01 12:12 ` Dr. David Alan Gilbert
2014-07-03  3:42   ` Hongyang Yang
2014-07-04  8:31     ` Dong, Eddie
2014-07-04  8:35       ` Dr. David Alan Gilbert
2014-07-04  8:54         ` Dong, Eddie
2014-07-04 12:22           ` Dr. David Alan Gilbert
2014-07-04 15:55             ` Dong, Eddie
2014-07-08  6:06     ` Michael R. Hines
2014-07-08  6:26       ` Hongyang Yang
2014-07-04 11:22   ` Andreas Färber

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).