From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1N9fN4-0007hj-Go for qemu-devel@nongnu.org; Sun, 15 Nov 2009 08:42:26 -0500 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1N9fMy-0007g4-0R for qemu-devel@nongnu.org; Sun, 15 Nov 2009 08:42:25 -0500 Received: from [199.232.76.173] (port=50671 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1N9fMx-0007g0-Sk for qemu-devel@nongnu.org; Sun, 15 Nov 2009 08:42:19 -0500 Received: from mx1.redhat.com ([209.132.183.28]:17597) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1N9fMw-0002XX-Qa for qemu-devel@nongnu.org; Sun, 15 Nov 2009 08:42:19 -0500 Message-ID: <4B000528.7070802@redhat.com> Date: Sun, 15 Nov 2009 15:42:00 +0200 From: Dor Laor MIME-Version: 1.0 References: <4AF79242.20406@oss.ntt.co.jp> <4AFC837D.2060307@redhat.com> <4AFD47A0.6040202@lab.ntt.co.jp> In-Reply-To: <4AFD47A0.6040202@lab.ntt.co.jp> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable Subject: [Qemu-devel] Re: [RFC] KVM Fault Tolerance: Kemari for KVM Reply-To: dlaor@redhat.com List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Yoshiaki Tamura Cc: Andrea Arcangeli , Chris Wright , =?UTF-8?B?IuWkp+adkeWcrShvb211cmEga2VpKSI=?= , kvm@vger.kernel.org, =?UTF-8?B?RmVybmFuZG8gTHVpcyBWw6F6cXVleiBDYW8=?= , qemu-devel@nongnu.org, Takuya Yoshikawa , avi@redhat.com On 11/13/2009 01:48 PM, Yoshiaki Tamura wrote: > Hi, > > Thanks for your comments! > > Dor Laor wrote: >> On 11/09/2009 05:53 AM, Fernando Luis V=C3=A1zquez Cao wrote: >>> Hi all, >>> >>> It has been a while coming, but we have finally started work on >>> Kemari's port to KVM. For those not familiar with it, Kemari provides >>> the basic building block to create a virtualization-based fault >>> tolerant machine: a virtual machine synchronization mechanism. >>> >>> Traditional high availability solutions can be classified in two >>> groups: fault tolerant servers, and software clustering. >>> >>> Broadly speaking, fault tolerant servers protect us against hardware >>> failures and, generally, rely on redundant hardware (often >>> proprietary), and hardware failure detection to trigger fail-over. >>> >>> On the other hand, software clustering, as its name indicates, takes >>> care of software failures and usually requires a standby server whose >>> software configuration for the part we are trying to make fault >>> tolerant must be identical to that of the active server. >>> >>> Both solutions may be applied to virtualized environments. Indeed, >>> the current incarnation of Kemari (Xen-based) brings fault tolerant >>> server-like capabilities to virtual machines and integration with >>> existing HA stacks (Heartbeat, RHCS, etc) is under consideration. >>> >>> After some time in the drawing board we completed the basic design of >>> Kemari for KVM, so we are sending an RFC at this point to get early >>> feedback and, hopefully, get things right from the start. Those >>> already familiar with Kemari and/or fault tolerance may want to skip >>> the "Background" and go directly to the design and implementation >>> bits. >>> >>> This is a pretty long write-up, but please bear with me. >>> >>> =3D=3D Background =3D=3D >>> >>> We started to play around with continuous virtual synchronization >>> technology about 3 years ago. As development progressed and, most >>> importantly, we got the first Xen-based working prototypes it became >>> clear that we needed a proper name for our toy: Kemari. >>> >>> The goal of Kemari is to provide a fault tolerant platform for >>> virtualization environments, so that in the event of a hardware >>> failure the virtual machine fails over from compromised to properly >>> operating hardware (a physical machine) in a way that is completely >>> transparent to the guest operating system. >>> >>> Although hardware based fault tolerant servers and HA servers >>> (software clustering) have been around for a (long) while, they >>> typically require specifically designed hardware and/or modifications >>> to applications. In contrast, by abstracting hardware using >>> virtualization, Kemari can be used on off-the-shelf hardware and no >>> application modifications are needed. >>> >>> After a period of in-house development the first version of Kemari fo= r >>> Xen was released in Nov 2008 as open source. However, by then it was >>> already pretty clear that a KVM port would have several >>> advantages. First, KVM is integrated into the Linux kernel, which >>> means one gets support for a wide variety of hardware for >>> free. Second, and in the same vein, KVM can also benefit from Linux' >>> low latency networking capabilities including RDMA, which is of >>> paramount importance for a extremely latency-sensitive functionality >>> like Kemari. Last, but not the least, KVM and its community is growin= g >>> rapidly, and there is increasing demand for Kemari-like functionality >>> for KVM. >>> >>> Although the basic design principles will remain the same, our plan i= s >>> to write Kemari for KVM from scratch, since there does not seem to be >>> much opportunity for sharing between Xen and KVM. >>> >>> =3D=3D Design outline =3D=3D >>> >>> The basic premise of fault tolerant servers is that when things go >>> awry with the hardware the running system should transparently >>> continue execution on an alternate physical host. For this to be >>> possible the state of the fallback host has to be identical to that o= f >>> the primary. >>> >>> Kemari runs paired virtual machines in an active-passive configuratio= n >>> and achieves whole-system replication by continuously copying the >>> state of the system (dirty pages and the state of the virtual devices= ) >>> from the active node to the passive node. An interesting implication >>> of this is that during normal operation only the active node is >>> actually executing code. >>> >>> Another possible approach is to run a pair of systems in lock-step >>> (=C3=A0 la VMware FT). Since both the primary and fallback virtual ma= chines >>> are active keeping them synchronized is a complex task, which usually >>> involves carefully injecting external events into both virtual >>> machines so that they result in identical states. >>> >>> The latter approach is extremely architecture specific and not SMP >>> friendly. This spurred us to try the design that became Kemari, which >>> we believe lends itself to further optimizations. >>> >>> =3D=3D Implementation =3D=3D >>> >>> The first step is to encapsulate the machine to be protected within a >>> virtual machine. Then the live migration functionality is leveraged t= o >>> keep the virtual machines synchronized. >>> >>> Whereas during live migration dirty pages can be sent asynchronously >>> from the primary to the fallback server until the ratio of dirty page= s >>> is low enough to guarantee very short downtimes, when it comes to >>> fault tolerance solutions whenever a synchronization point is reached >>> changes to the virtual machine since the previous one have to be sent >>> synchronously. >>> >>> Since the virtual machine has to be stopped until the data reaches an= d >>> is acknowledged by the fallback server, the synchronization model is >>> of critical importance for performance (both in terms of raw >>> throughput and latencies). The model chosen for Kemari along with >>> other implementation details is described below. >>> >>> * Synchronization model >>> >>> The synchronization points were carefully chosen to minimize the >>> amount of traffic that goes over the wire while still maintaining the >>> FT pair consistent at all times. To be precise, Kemari uses events >>> that modify externally visible state as synchronizations points. This >>> means that all outgoing I/O needs to be trapped and sent to the >>> fallback host before the primary is resumed, so that it can be >>> replayed in the face of hardware failure. >>> >>> The basic assumption here is that outgoing I/O operations are >>> idempotent, which is usually true for disk I/O and reliable network >>> protocols such as TCP (Kemari may trigger hidden bugs on applications >>> that use UDP or other unreliable protocols, so those may need minor >>> changes to ensure they work properly after failover). >>> >>> The synchronization process can be broken down as follows: >>> >>> - Event tapping: On KVM all I/O generates a VMEXIT that is >>> synchronously handled by the Linux kernel monitor i.e. KVM (it is >>> worth noting that this applies to virtio devices too, because they >>> use MMIO and PIO just like a regular PCI device). >>> >>> - VCPU/Guest freezing: This is automatic in the UP case. On SMP >>> environments we may need to send a IPI to stop the other VCPUs. >>> >>> - Notification to qemu: Taking a page from live migration's >>> playbook, the synchronization process is user-space driven, which >>> means that qemu needs to be woken up at each synchronization >>> point. That is already the case for qemu-emulated devices, but we >>> also have in-kernel emulators. To compound the problem, even for >>> user-space emulated devices accesses to coalesced MMIO areas can >>> not be detected. As a consequence we need a mechanism to >>> communicate KVM-handled events to qemu. >>> >>> The channel for KVM-qemu communication can be easily build upon >>> the existing infrastructure. We just need to add a new a page to >>> the kvm_run shared memory area that can be mmapped from user space >>> and set the exit reason appropriately. >>> >>> Regarding in-kernel device emulators, we only need to care about >>> writes. Specifically, making kvm_io_bus_write() fail when Kemari >>> is activated and invoking the emulator again after re-entrance >>> from user space should suffice (this is somewhat similar to what >>> we do in kvm_arch_vcpu_ioctl_run() for MMIO reads). >>> >>> To avoid missing synchronization points one should be careful with >>> coalesced MMIO-like optimizations. In the particular case of >>> coalesced MMIO, the I/O operation that caused the exit to user >>> space should act as a write barrier when it was due to an access >>> to a non-coalesced MMIO area. This means that before proceeding to >>> handle the exit in kvm_run() we have to make sure that all the >>> coalesced MMIO has reached the fallback host. >>> >>> - Virtual machine synchronization: All the dirty pages since the >>> last synchronization point and the state of the virtual devices is >>> sent to the fallback node from the user-space qemu process. For this >>> the existing savevm infrastructure and KVM's dirty page tracking >> >> I failed to understand whether you take the lock step approach and >> sync every vmexit + make sure the shadow host will inject irq on the >> original guest's instruction boundary or alternatively use continuous >> live snapshots. > > We'll take the live snapshots approach for now. > >> If you use live snapshots, why do you need to track mmio, etc? Is it >> in order to save the device sync stage in live migration? In order to >> do it you fully lock step qemu execution (or send the entire vmstate >> to the slave). Isn't the device part is << dirt pages part? > > We're thinking to capture mmio operations that effect the state of > devices as synchronization points. The purpose is to lock step qemu > execution as you mentioned. The hardest thing will be in this case is to inject the virtual irqs to=20 the guest on the slave host in the exact instruction boundary that the=20 original virq was injected on the master. You need to count guest=20 instructions, user performance monitors and in the final stages use=20 guest break points. > > Thanks, > > Yoshi > >> >> Thanks, >> Dor >> >> >>> capabilities can be reused. Regarding in-kernel devices, with the >>> likely advent of in-kernel virtio backends we need a generic way >>> to access their state from user-space, for which, again, the kvm_run >>> share memory area could be used. >>> >>> - Virtual machine run: Execution of the virtual machine is resumed >>> as soon as synchronization finishes. >>> >>> * Clock >>> >>> Even though we do not need to worry about the clock that provides the >>> tick (the counter resides in memory, which we keep synchronized), the >>> same does not apply to counters such as the TSC (we certainly want to >>> avoid a situation where counters jump back in time right after >>> fail-over, breaking guarantees such as monotonicity). >>> >>> To avoid big hiccups after migration the value of the TSC should be >>> sent to the fallback node frequently. An access from the guest >>> (through RDTSC, RDTSCP, RDMSR, or WRMSR) seems like the right moment >>> to do this. Fortunately, both vmx and SVM provide controls to >>> intercept accesses to the TSC, so it is just a matter of setting thos= e >>> appropriately ("RDTSC exiting" VM-execution control, and RDTSC, >>> RDTSCP, RDMSR, WRMSR instruction intercepts, respectively). However, >>> since synchronizing the virtual machines every time the TSC is >>> accessed would be prohibitive, the transmission of the TSC will be >>> done lazily, which means delaying it until there is a non-TSC >>> synchronization point arrives. >>> >>> * Failover >>> >>> Failover process kicks in whenever a failure in the primary node is >>> detected. At the time of writing we just ping the virtual machine >>> periodically to determine whether it is still alive, but in the long >>> term we have plans to integrate Kemari with the major HA stacks >>> (Hearbeat, RHCS, etc). >>> >>> Ideally, we would like to leverage the hardware failure detection >>> capabilities of newish x86 hardware to trigger failover, the idea >>> being that transferring control to the fallback node proactively >>> when a problem is detected is much faster than relying on the polling >>> mechanisms used by most HA software. >>> >>> Finally, to restore the virtual machine in the fallback host the load= vm >>> infrastructure used for live-migration is leveraged. >>> >>> * Further information >>> >>> Please visit the link below for additional information, including >>> documentation and, most importantly, source code (for Xen only at the >>> moment). >>> >>> http://www.osrg.net/kemari >>> =3D=3D >>> >>> >>> Any comments and suggestions would be greatly appreciated. >>> >>> If this is the right forum and people on the KVM mailing list do not >>> mind, we would like to use the CC'ed mailing lists for Kemari >>> development. Having more expert eyes looking at one's code always >>> helps. >>> >>> Thanks, >>> >>> Fernando >>> -- >>> To unsubscribe from this list: send the line "unsubscribe kvm" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> >> > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html