From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1N9fN4-0007hj-Go
	for qemu-devel@nongnu.org; Sun, 15 Nov 2009 08:42:26 -0500
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1N9fMy-0007g4-0R
	for qemu-devel@nongnu.org; Sun, 15 Nov 2009 08:42:25 -0500
Received: from [199.232.76.173] (port=50671 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1N9fMx-0007g0-Sk
	for qemu-devel@nongnu.org; Sun, 15 Nov 2009 08:42:19 -0500
Received: from mx1.redhat.com ([209.132.183.28]:17597)
	by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <dlaor@redhat.com>) id 1N9fMw-0002XX-Qa
	for qemu-devel@nongnu.org; Sun, 15 Nov 2009 08:42:19 -0500
Message-ID: <4B000528.7070802@redhat.com>
Date: Sun, 15 Nov 2009 15:42:00 +0200
From: Dor Laor <dlaor@redhat.com>
MIME-Version: 1.0
References: <4AF79242.20406@oss.ntt.co.jp> <4AFC837D.2060307@redhat.com>
	<4AFD47A0.6040202@lab.ntt.co.jp>
In-Reply-To: <4AFD47A0.6040202@lab.ntt.co.jp>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
Subject: [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM
Reply-To: dlaor@redhat.com
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
Cc: Andrea Arcangeli <aarcange@redhat.com>, Chris Wright <chrisw@redhat.com>, =?UTF-8?B?IuWkp+adkeWcrShvb211cmEga2VpKSI=?= <ohmura.kei@lab.ntt.co.jp>, kvm@vger.kernel.org, =?UTF-8?B?RmVybmFuZG8gTHVpcyBWw6F6cXVleiBDYW8=?= <fernando@oss.ntt.co.jp>, qemu-devel@nongnu.org, Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>, avi@redhat.com

On 11/13/2009 01:48 PM, Yoshiaki Tamura wrote:
> Hi,
>
> Thanks for your comments!
>
> Dor Laor wrote:
>> On 11/09/2009 05:53 AM, Fernando Luis V=C3=A1zquez Cao wrote:
>>> Hi all,
>>>
>>> It has been a while coming, but we have finally started work on
>>> Kemari's port to KVM. For those not familiar with it, Kemari provides
>>> the basic building block to create a virtualization-based fault
>>> tolerant machine: a virtual machine synchronization mechanism.
>>>
>>> Traditional high availability solutions can be classified in two
>>> groups: fault tolerant servers, and software clustering.
>>>
>>> Broadly speaking, fault tolerant servers protect us against hardware
>>> failures and, generally, rely on redundant hardware (often
>>> proprietary), and hardware failure detection to trigger fail-over.
>>>
>>> On the other hand, software clustering, as its name indicates, takes
>>> care of software failures and usually requires a standby server whose
>>> software configuration for the part we are trying to make fault
>>> tolerant must be identical to that of the active server.
>>>
>>> Both solutions may be applied to virtualized environments. Indeed,
>>> the current incarnation of Kemari (Xen-based) brings fault tolerant
>>> server-like capabilities to virtual machines and integration with
>>> existing HA stacks (Heartbeat, RHCS, etc) is under consideration.
>>>
>>> After some time in the drawing board we completed the basic design of
>>> Kemari for KVM, so we are sending an RFC at this point to get early
>>> feedback and, hopefully, get things right from the start. Those
>>> already familiar with Kemari and/or fault tolerance may want to skip
>>> the "Background" and go directly to the design and implementation
>>> bits.
>>>
>>> This is a pretty long write-up, but please bear with me.
>>>
>>> =3D=3D Background =3D=3D
>>>
>>> We started to play around with continuous virtual synchronization
>>> technology about 3 years ago. As development progressed and, most
>>> importantly, we got the first Xen-based working prototypes it became
>>> clear that we needed a proper name for our toy: Kemari.
>>>
>>> The goal of Kemari is to provide a fault tolerant platform for
>>> virtualization environments, so that in the event of a hardware
>>> failure the virtual machine fails over from compromised to properly
>>> operating hardware (a physical machine) in a way that is completely
>>> transparent to the guest operating system.
>>>
>>> Although hardware based fault tolerant servers and HA servers
>>> (software clustering) have been around for a (long) while, they
>>> typically require specifically designed hardware and/or modifications
>>> to applications. In contrast, by abstracting hardware using
>>> virtualization, Kemari can be used on off-the-shelf hardware and no
>>> application modifications are needed.
>>>
>>> After a period of in-house development the first version of Kemari fo=
r
>>> Xen was released in Nov 2008 as open source. However, by then it was
>>> already pretty clear that a KVM port would have several
>>> advantages. First, KVM is integrated into the Linux kernel, which
>>> means one gets support for a wide variety of hardware for
>>> free. Second, and in the same vein, KVM can also benefit from Linux'
>>> low latency networking capabilities including RDMA, which is of
>>> paramount importance for a extremely latency-sensitive functionality
>>> like Kemari. Last, but not the least, KVM and its community is growin=
g
>>> rapidly, and there is increasing demand for Kemari-like functionality
>>> for KVM.
>>>
>>> Although the basic design principles will remain the same, our plan i=
s
>>> to write Kemari for KVM from scratch, since there does not seem to be
>>> much opportunity for sharing between Xen and KVM.
>>>
>>> =3D=3D Design outline =3D=3D
>>>
>>> The basic premise of fault tolerant servers is that when things go
>>> awry with the hardware the running system should transparently
>>> continue execution on an alternate physical host. For this to be
>>> possible the state of the fallback host has to be identical to that o=
f
>>> the primary.
>>>
>>> Kemari runs paired virtual machines in an active-passive configuratio=
n
>>> and achieves whole-system replication by continuously copying the
>>> state of the system (dirty pages and the state of the virtual devices=
)
>>> from the active node to the passive node. An interesting implication
>>> of this is that during normal operation only the active node is
>>> actually executing code.
>>>
>>> Another possible approach is to run a pair of systems in lock-step
>>> (=C3=A0 la VMware FT). Since both the primary and fallback virtual ma=
chines
>>> are active keeping them synchronized is a complex task, which usually
>>> involves carefully injecting external events into both virtual
>>> machines so that they result in identical states.
>>>
>>> The latter approach is extremely architecture specific and not SMP
>>> friendly. This spurred us to try the design that became Kemari, which
>>> we believe lends itself to further optimizations.
>>>
>>> =3D=3D Implementation =3D=3D
>>>
>>> The first step is to encapsulate the machine to be protected within a
>>> virtual machine. Then the live migration functionality is leveraged t=
o
>>> keep the virtual machines synchronized.
>>>
>>> Whereas during live migration dirty pages can be sent asynchronously
>>> from the primary to the fallback server until the ratio of dirty page=
s
>>> is low enough to guarantee very short downtimes, when it comes to
>>> fault tolerance solutions whenever a synchronization point is reached
>>> changes to the virtual machine since the previous one have to be sent
>>> synchronously.
>>>
>>> Since the virtual machine has to be stopped until the data reaches an=
d
>>> is acknowledged by the fallback server, the synchronization model is
>>> of critical importance for performance (both in terms of raw
>>> throughput and latencies). The model chosen for Kemari along with
>>> other implementation details is described below.
>>>
>>> * Synchronization model
>>>
>>> The synchronization points were carefully chosen to minimize the
>>> amount of traffic that goes over the wire while still maintaining the
>>> FT pair consistent at all times. To be precise, Kemari uses events
>>> that modify externally visible state as synchronizations points. This
>>> means that all outgoing I/O needs to be trapped and sent to the
>>> fallback host before the primary is resumed, so that it can be
>>> replayed in the face of hardware failure.
>>>
>>> The basic assumption here is that outgoing I/O operations are
>>> idempotent, which is usually true for disk I/O and reliable network
>>> protocols such as TCP (Kemari may trigger hidden bugs on applications
>>> that use UDP or other unreliable protocols, so those may need minor
>>> changes to ensure they work properly after failover).
>>>
>>> The synchronization process can be broken down as follows:
>>>
>>> - Event tapping: On KVM all I/O generates a VMEXIT that is
>>> synchronously handled by the Linux kernel monitor i.e. KVM (it is
>>> worth noting that this applies to virtio devices too, because they
>>> use MMIO and PIO just like a regular PCI device).
>>>
>>> - VCPU/Guest freezing: This is automatic in the UP case. On SMP
>>> environments we may need to send a IPI to stop the other VCPUs.
>>>
>>> - Notification to qemu: Taking a page from live migration's
>>> playbook, the synchronization process is user-space driven, which
>>> means that qemu needs to be woken up at each synchronization
>>> point. That is already the case for qemu-emulated devices, but we
>>> also have in-kernel emulators. To compound the problem, even for
>>> user-space emulated devices accesses to coalesced MMIO areas can
>>> not be detected. As a consequence we need a mechanism to
>>> communicate KVM-handled events to qemu.
>>>
>>> The channel for KVM-qemu communication can be easily build upon
>>> the existing infrastructure. We just need to add a new a page to
>>> the kvm_run shared memory area that can be mmapped from user space
>>> and set the exit reason appropriately.
>>>
>>> Regarding in-kernel device emulators, we only need to care about
>>> writes. Specifically, making kvm_io_bus_write() fail when Kemari
>>> is activated and invoking the emulator again after re-entrance
>>> from user space should suffice (this is somewhat similar to what
>>> we do in kvm_arch_vcpu_ioctl_run() for MMIO reads).
>>>
>>> To avoid missing synchronization points one should be careful with
>>> coalesced MMIO-like optimizations. In the particular case of
>>> coalesced MMIO, the I/O operation that caused the exit to user
>>> space should act as a write barrier when it was due to an access
>>> to a non-coalesced MMIO area. This means that before proceeding to
>>> handle the exit in kvm_run() we have to make sure that all the
>>> coalesced MMIO has reached the fallback host.
>>>
>>> - Virtual machine synchronization: All the dirty pages since the
>>> last synchronization point and the state of the virtual devices is
>>> sent to the fallback node from the user-space qemu process. For this
>>> the existing savevm infrastructure and KVM's dirty page tracking
>>
>> I failed to understand whether you take the lock step approach and
>> sync every vmexit + make sure the shadow host will inject irq on the
>> original guest's instruction boundary or alternatively use continuous
>> live snapshots.
>
> We'll take the live snapshots approach for now.
>
>> If you use live snapshots, why do you need to track mmio, etc? Is it
>> in order to save the device sync stage in live migration? In order to
>> do it you fully lock step qemu execution (or send the entire vmstate
>> to the slave). Isn't the device part is << dirt pages part?
>
> We're thinking to capture mmio operations that effect the state of
> devices as synchronization points. The purpose is to lock step qemu
> execution as you mentioned.

The hardest thing will be in this case is to inject the virtual irqs to=20
the guest on the slave host in the exact instruction boundary that the=20
original virq was injected on the master. You need to count guest=20
instructions, user performance monitors and in the final stages use=20
guest break points.

>
> Thanks,
>
> Yoshi
>
>>
>> Thanks,
>> Dor
>>
>>
>>> capabilities can be reused. Regarding in-kernel devices, with the
>>> likely advent of in-kernel virtio backends we need a generic way
>>> to access their state from user-space, for which, again, the kvm_run
>>> share memory area could be used.
>>>
>>> - Virtual machine run: Execution of the virtual machine is resumed
>>> as soon as synchronization finishes.
>>>
>>> * Clock
>>>
>>> Even though we do not need to worry about the clock that provides the
>>> tick (the counter resides in memory, which we keep synchronized), the
>>> same does not apply to counters such as the TSC (we certainly want to
>>> avoid a situation where counters jump back in time right after
>>> fail-over, breaking guarantees such as monotonicity).
>>>
>>> To avoid big hiccups after migration the value of the TSC should be
>>> sent to the fallback node frequently. An access from the guest
>>> (through RDTSC, RDTSCP, RDMSR, or WRMSR) seems like the right moment
>>> to do this. Fortunately, both vmx and SVM provide controls to
>>> intercept accesses to the TSC, so it is just a matter of setting thos=
e
>>> appropriately ("RDTSC exiting" VM-execution control, and RDTSC,
>>> RDTSCP, RDMSR, WRMSR instruction intercepts, respectively). However,
>>> since synchronizing the virtual machines every time the TSC is
>>> accessed would be prohibitive, the transmission of the TSC will be
>>> done lazily, which means delaying it until there is a non-TSC
>>> synchronization point arrives.
>>>
>>> * Failover
>>>
>>> Failover process kicks in whenever a failure in the primary node is
>>> detected. At the time of writing we just ping the virtual machine
>>> periodically to determine whether it is still alive, but in the long
>>> term we have plans to integrate Kemari with the major HA stacks
>>> (Hearbeat, RHCS, etc).
>>>
>>> Ideally, we would like to leverage the hardware failure detection
>>> capabilities of newish x86 hardware to trigger failover, the idea
>>> being that transferring control to the fallback node proactively
>>> when a problem is detected is much faster than relying on the polling
>>> mechanisms used by most HA software.
>>>
>>> Finally, to restore the virtual machine in the fallback host the load=
vm
>>> infrastructure used for live-migration is leveraged.
>>>
>>> * Further information
>>>
>>> Please visit the link below for additional information, including
>>> documentation and, most importantly, source code (for Xen only at the
>>> moment).
>>>
>>> http://www.osrg.net/kemari
>>> =3D=3D
>>>
>>>
>>> Any comments and suggestions would be greatly appreciated.
>>>
>>> If this is the right forum and people on the KVM mailing list do not
>>> mind, we would like to use the CC'ed mailing lists for Kemari
>>> development. Having more expert eyes looking at one's code always
>>> helps.
>>>
>>> Thanks,
>>>
>>> Fernando
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html