From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wen Congyang Subject: [RFC Patch 00/25] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Date: Fri, 18 Jul 2014 19:38:45 +0800 Message-ID: <1405683551-12579-1-git-send-email-wency@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: xen devel Cc: Ian Campbell , Wen Congyang , Ian Jackson , Jiang Yunhong , Dong Eddie , Yang Hongyang , Lai Jiangshan List-Id: xen-devel@lists.xenproject.org Virtual machine (VM) replication is a well known technique for providing application-agnostic software-implemented hardware fault tolerance - "non-stop service". Currently, remus provides this function, but it buffers all output packets, and the latency is unacceptable. In xen summit 2012, We introduce a new VM replication solution: colo (COarse-grain LOck-stepping virtual machine). The presentation is in the following URL: http://www.slideshare.net/xen_com_mgr/colo-coarsegrain-lockstepping-virtual-machines-for-nonstop-service Here is the summary of the solution: >>From the client's point of view, as long as the client observes identical responses from the primary and secondary VMs, according to the service semantics, then the secondary vm is a valid replica of the primary vm, and can successfully take over when a hardware failure of the primary vm is detected. This patchset is RFC, and implements the frame of colo: 1. Both primary vm and secondary vm are running 2. do checkoint This patchset is based on remus-v15, and use migration v1. Only supports hvm guest now. TODO list: 1. rebase to remus-v17 or newer 2. support migration v2 3. nic/disk replication 4. support pvm Patch 1-3: bugfix Patch 4-6: temporarily update remus to reuse remus device codes Patch 7-14: update some APIs which will be used by colo Patch 15-22: colo related codes Patch 23: Hack patch, just for test Patch 24-25: bugfix. We find this bug before rebasing colo to newest xen. But we don't trigger this bug now. Patch 26: A patch for qemu-xen Hong Tao (1): copy the correct page to memory Wen Congyang (24): csum the correct page don't zero out ioreq page don't touch remus in remus_device rename remus device to checkpoint device adjust the indentation Refactor domain_suspend_callback_common() Update libxl__domain_resume() for colo Update libxl__domain_suspend_common_switch_qemu_logdirty() for colo Introduce a new internal API libxl__domain_unpause() Update libxl__domain_unpause() to support qemu-xen support to resume uncooperative HVM guests update datecopier to support sending data only introduce a new API to aync read data from fd Update libxl_save_msgs_gen.pl to support return data from xl to xc Allow slave sends data to master secondary vm suspend/resume/checkpoint code primary vm suspend/get_dirty_pfn/resume/checkpoint code xc_domain_save: flush cache before calling callbacks->postcopy() in colo mode COLO: xc related codes send store mfn and console mfn to xl before resuming secondary vm implement the cmdline for COLO HACK: do checkpoint per 20ms fix vm entry fail sync mmu before resuming secondary vm docs/man/xl.pod.1 | 9 +- tools/libxc/xc_domain.c | 9 + tools/libxc/xc_domain_restore.c | 74 +- tools/libxc/xc_domain_save.c | 66 +- tools/libxc/xc_resume.c | 20 +- tools/libxc/xenctrl.h | 2 + tools/libxc/xenguest.h | 40 + tools/libxl/Makefile | 3 +- tools/libxl/libxl.c | 102 ++- tools/libxl/libxl.h | 3 +- tools/libxl/libxl_aoutils.c | 81 +- ...xl_remus_device.c => libxl_checkpoint_device.c} | 266 ++++--- tools/libxl/libxl_colo.h | 48 ++ tools/libxl/libxl_colo_restore.c | 882 +++++++++++++++++++++ tools/libxl/libxl_colo_save.c | 602 ++++++++++++++ tools/libxl/libxl_create.c | 131 ++- tools/libxl/libxl_dom.c | 424 ++++++---- tools/libxl/libxl_internal.h | 262 ++++-- tools/libxl/libxl_netbuffer.c | 85 +- tools/libxl/libxl_nonetbuffer.c | 14 +- tools/libxl/libxl_qmp.c | 10 + tools/libxl/libxl_remus_disk_drbd.c | 54 +- tools/libxl/libxl_save_callout.c | 37 +- tools/libxl/libxl_save_helper.c | 17 + tools/libxl/libxl_save_msgs_gen.pl | 74 +- tools/libxl/libxl_types.idl | 12 +- tools/libxl/xl_cmdimpl.c | 54 +- tools/libxl/xl_cmdtable.c | 3 +- xen/arch/x86/domctl.c | 15 + xen/arch/x86/hvm/save.c | 6 + xen/arch/x86/hvm/vmx/vmcs.c | 8 + xen/arch/x86/hvm/vmx/vmx.c | 8 + xen/include/asm-x86/hvm/hvm.h | 1 + xen/include/asm-x86/hvm/vmx/vmcs.h | 1 + xen/include/public/domctl.h | 1 + xen/include/xen/hvm/save.h | 2 + 36 files changed, 2895 insertions(+), 531 deletions(-) rename tools/libxl/{libxl_remus_device.c => libxl_checkpoint_device.c} (47%) create mode 100644 tools/libxl/libxl_colo.h create mode 100644 tools/libxl/libxl_colo_restore.c create mode 100644 tools/libxl/libxl_colo_save.c -- 1.9.3