* [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
@ 2010-04-21  5:57 Yoshiaki Tamura
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 01/20] Modify DIRTY_FLAG value and introduce DIRTY_IDX to use as indexes of bit-based phys_ram_dirty Yoshiaki Tamura
                   ` (22 more replies)
  0 siblings, 23 replies; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-21  5:57 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, ohmura.kei, mtosatti, Yoshiaki Tamura, yoshikawa.takuya,
	avi
Hi all,
We have been implementing the prototype of Kemari for KVM, and we're sending
this message to share what we have now and TODO lists.  Hopefully, we would like
to get early feedback to keep us in the right direction.  Although advanced
approaches in the TODO lists are fascinating, we would like to run this project
step by step while absorbing comments from the community.  The current code is
based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.
For those who are new to Kemari for KVM, please take a look at the
following RFC which we posted last year.
http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html
The transmission/transaction protocol, and most of the control logic is
implemented in QEMU.  However, we needed a hack in KVM to prevent rip from
proceeding before synchronizing VMs.  It may also need some plumbing in the
kernel side to guarantee replayability of certain events and instructions,
integrate the RAS capabilities of newer x86 hardware with the HA stack, as well
as for optimization purposes, for example. 
Before going into details, we would like to show how Kemari looks.  We prepared
a demonstration video at the following location.  For those who are not
interested in the code, please take a look.  
The demonstration scenario is,
1. Play with a guest VM that has virtio-blk and virtio-net.
# The guest image should be a NFS/SAN.
2. Start Kemari to synchronize the VM by running the following command in QEMU.
Just add "-k" option to usual migrate command.
migrate -d -k tcp:192.168.0.20:4444
3. Check the status by calling info migrate.
4. Go back to the VM to play chess animation.
5. Kill the the VM. (VNC client also disappears)
6. Press "c" to continue the VM on the other host.
7. Bring up the VNC client (Sorry, it pops outside of video capture.)
8. Confirm that the chess animation ends, browser works fine, then shutdown.
http://www.osrg.net/kemari/download/kemari-kvm-fc11.mov
The repository contains all patches we're sending with this message.  For those
who want to try, pull the following repository.  At running configure, please
put --enable-ft-mode.  Also you need to apply a patch attached at the end of
this message to your KVM.
git://kemari.git.sourceforge.net/gitroot/kemari/kemari
In addition to usual migrate environment and command, add "-k" to run.
The patch set consists of following components.
- bit-based dirty bitmap. (I have posted v4 for upstream QEMU on April 2o)
- writev() support to QEMUFile and FdMigrationState.
- FT transaction sender/receiver
- event tap that triggers FT transaction.
- virtio-blk, virtio-net support for event tap.
 Makefile.objs    |    1 +
 buffered_file.c  |    2 +-
 configure        |    8 +
 cpu-all.h        |  134 ++++++++++++++++-
 cutils.c         |   12 ++
 exec.c           |  127 +++++++++++++----
 ft_transaction.c |  423 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 ft_transaction.h |   57 ++++++++
 hw/hw.h          |   25 ++++
 hw/virtio-blk.c  |    2 +
 hw/virtio-net.c  |    2 +
 migration-exec.c |    2 +-
 migration-fd.c   |    2 +-
 migration-tcp.c  |   58 +++++++-
 migration-unix.c |    2 +-
 migration.c      |  146 ++++++++++++++++++-
 migration.h      |    8 +
 osdep.c          |   13 ++
 qemu-char.c      |   25 +++-
 qemu-common.h    |   21 +++
 qemu-kvm.c       |   26 ++--
 qemu-monitor.hx  |    7 +-
 qemu_socket.h    |    4 +
 savevm.c         |  264 ++++++++++++++++++++++++++++++----
 sysemu.h         |    3 +-
 vl.c             |  221 +++++++++++++++++++++++++---
 26 files changed, 1474 insertions(+), 121 deletions(-)
 create mode 100644 ft_transaction.c
 create mode 100644 ft_transaction.h
The rest of this message describes TODO lists grouped by each topic.
=== event tapping === 
Event tapping is the core component of Kemari, and it decides on which event the
primary should synchronize with the secondary.  The basic assumption here is
that outgoing I/O operations are idempotent, which is usually true for disk I/O
and reliable network protocols such as TCP.
As discussed in the following thread, we may need to reconsider how and when to
start VM synchronization.
http://www.mail-archive.com/kvm@vger.kernel.org/msg31908.html
We would like get as much feedbacks on current implementation before
thinking/going into the next approach.
TODO:
 - virtio polling
 - support for asynchronous I/O methods (eventfd)
=== sender / receiver ===
To synchronize virtual machines, all the dirty pages since the last
synchronization point and the state of the VCPU the virtual devices is sent to
the fallback node from the user-space QEMU process.
TODO:
 - Asynchronous VM transfer / pipelining (needed for SMP)
 - Zero copy VM transfer
 - VM transfer w/ RDMA
=== storage ===
Although Kemari needs some kind of shared storage, many users don't like it and
they expect to use Kemari in conjunction with software storage replication.
TODO:
 - Integration with other non-shared disk cluster storage solutions
   such as DRBD (might need changes to guarantee storage data
   consistency at Kemari synchronization points).
 - Integration with QEMU's block live migration functionality for
   non-share disk configurations.
=== integration with HA stack (Pacemaker/Corosync) ===
Failover process kicks in whenever a failure in the primary node is detected.
For Kemari for Xen, we already have finished RA for Heartbeat, and planning to
integrate Kemari for KVM with the new HA stacks (Pacemaker, RHCS, etc).
Ideally, we would like to leverage the hardware failure detection
capabilities of newish x86 hardware to trigger failover, the idea
being that transferring control to the fallback node proactively
when a problem is detected is much faster than relying on the polling
mechanisms used by most HA software.
TODO:
 - RA for Pacemaker.
 - Consider both HW failure and SW failure scenarios (failover
   between Kemari clusters).
 - Make the necessary changes to Pacemaker/Corosync to support
   event(HW failure, etc)-driven failover.
 - Take advantage of the RAS capabilities of newer CPUs/motherboards
   such as MCE to trigger failover.
 - Detect failures in I/O devices (block I/O errors, etc).
=== clock ===
Since synchronizing the virtual machines every time the TSC is accessed would be
prohibitive, the transmission of the TSC will be done lazily, which means
delaying it until there is a non-TSC synchronization point arrives.
TODO:
 - Synchronization of clock sources (need to intercept TSC reads, etc).
=== usability ===
These are items that defines how users interact with Kemari.
TODO:
 - Kemarid daemon that takes care of the cluster management/monitoring
   side of things.
 - Some device emulators might need minor modifications to work well
   with Kemari.  Use white(black)-listing to take the burden of
   choosing the right device model off the users.
=== optimizations ===
Although the big picture can be realized by completing the TODO list above, we
need some optimizations/enhancements to make Kemari useful in real world, and
these are items what needs to be done for that.
TODO:
 - SMP (for the sake of performance might need to implement a
   synchronization protocol that can maintain two or more
   synchronization points active at any given moment)
 - VGA (leverage VNC's subtilting mechanism to identify fb pages that
   are really dirty).
 
Any comments/suggestions would be greatly appreciated.
Thanks,
Yoshi
--
Kemari starts synchronizing VMs when QEMU handles I/O requests.
Without this patch VCPU state is already proceeded before
synchronization, and after failover to the VM on the receiver, it
hangs because of this.
Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 arch/x86/include/asm/kvm_host.h |    1 +
 arch/x86/kvm/svm.c              |   11 ++++++++---
 arch/x86/kvm/vmx.c              |   11 ++++++++---
 arch/x86/kvm/x86.c              |    4 ++++
 4 files changed, 21 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 26c629a..7b8f514 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -227,6 +227,7 @@ struct kvm_pio_request {
 	int in;
 	int port;
 	int size;
+	bool lazy_skip;
 };
 
 /*
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index d04c7ad..e373245 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1495,7 +1495,7 @@ static int io_interception(struct vcpu_svm *svm)
 {
 	struct kvm_vcpu *vcpu = &svm->vcpu;
 	u32 io_info = svm->vmcb->control.exit_info_1; /* address size bug? */
-	int size, in, string;
+	int size, in, string, ret;
 	unsigned port;
 
 	++svm->vcpu.stat.io_exits;
@@ -1507,9 +1507,14 @@ static int io_interception(struct vcpu_svm *svm)
 	port = io_info >> 16;
 	size = (io_info & SVM_IOIO_SIZE_MASK) >> SVM_IOIO_SIZE_SHIFT;
 	svm->next_rip = svm->vmcb->control.exit_info_2;
-	skip_emulated_instruction(&svm->vcpu);
 
-	return kvm_fast_pio_out(vcpu, size, port);
+	ret = kvm_fast_pio_out(vcpu, size, port);
+	if (ret)
+		skip_emulated_instruction(&svm->vcpu);
+	else
+		vcpu->arch.pio.lazy_skip = true;
+
+	return ret;
 }
 
 static int nmi_interception(struct vcpu_svm *svm)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 41e63bb..09052d6 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2975,7 +2975,7 @@ static int handle_triple_fault(struct kvm_vcpu *vcpu)
 static int handle_io(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qualification;
-	int size, in, string;
+	int size, in, string, ret;
 	unsigned port;
 
 	exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
@@ -2989,9 +2989,14 @@ static int handle_io(struct kvm_vcpu *vcpu)
 
 	port = exit_qualification >> 16;
 	size = (exit_qualification & 7) + 1;
-	skip_emulated_instruction(vcpu);
 
-	return kvm_fast_pio_out(vcpu, size, port);
+	ret = kvm_fast_pio_out(vcpu, size, port);
+	if (ret)
+		skip_emulated_instruction(vcpu);
+	else
+		vcpu->arch.pio.lazy_skip = true;
+
+	return ret;
 }
 
 static void
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fd5c3d3..cc308d2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4544,6 +4544,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 	if (!irqchip_in_kernel(vcpu->kvm))
 		kvm_set_cr8(vcpu, kvm_run->cr8);
 
+	if (vcpu->arch.pio.lazy_skip)
+		kvm_x86_ops->skip_emulated_instruction(vcpu);
+	vcpu->arch.pio.lazy_skip = false;
+
 	if (vcpu->arch.pio.count || vcpu->mmio_needed ||
 	    vcpu->arch.emulate_ctxt.restart) {
 		if (vcpu->mmio_needed) {
-- 
1.7.0.31.g1df487
^ permalink raw reply related	[flat|nested] 74+ messages in thread
* [Qemu-devel] [RFC PATCH 01/20] Modify DIRTY_FLAG value and introduce DIRTY_IDX to use as indexes of bit-based phys_ram_dirty.
  2010-04-21  5:57 [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 Yoshiaki Tamura
@ 2010-04-21  5:57 ` Yoshiaki Tamura
  2010-04-22 19:26   ` [Qemu-devel] " Anthony Liguori
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 02/20] Introduce cpu_physical_memory_get_dirty_range() Yoshiaki Tamura
                   ` (21 subsequent siblings)
  22 siblings, 1 reply; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-21  5:57 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, ohmura.kei, mtosatti, Yoshiaki Tamura, yoshikawa.takuya,
	avi
Replaces byte-based phys_ram_dirty bitmap with four (MASTER, VGA,
CODE, MIGRATION) bit-based phys_ram_dirty bitmap.  On allocation, it
sets all bits in the bitmap.  It uses ffs() to convert DIRTY_FLAG to
DIRTY_IDX.
Modifies wrapper functions for byte-based phys_ram_dirty bitmap to
bit-based phys_ram_dirty bitmap.  MASTER works as a buffer, and upon
get_diry() or get_dirty_flags(), it calls
cpu_physical_memory_sync_master() to update VGA and MIGRATION.
Replaces direct phys_ram_dirty access with wrapper functions to
prevent direct access to the phys_ram_dirty bitmap.
Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
Signed-off-by: OHMURA Kei <ohmura.kei@lab.ntt.co.jp>
---
 cpu-all.h |  130 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 exec.c    |   60 ++++++++++++++--------------
 2 files changed, 152 insertions(+), 38 deletions(-)
diff --git a/cpu-all.h b/cpu-all.h
index 51effc0..3f8762d 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -37,6 +37,9 @@
 
 #include "softfloat.h"
 
+/* to use ffs in flag_to_idx() */
+#include <strings.h>
+
 #if defined(HOST_WORDS_BIGENDIAN) != defined(TARGET_WORDS_BIGENDIAN)
 #define BSWAP_NEEDED
 #endif
@@ -846,7 +849,6 @@ int cpu_str_to_log_mask(const char *str);
 /* memory API */
 
 extern int phys_ram_fd;
-extern uint8_t *phys_ram_dirty;
 extern ram_addr_t ram_size;
 extern ram_addr_t last_ram_offset;
 extern uint8_t *bios_mem;
@@ -869,28 +871,140 @@ extern uint8_t *bios_mem;
 /* Set if TLB entry is an IO callback.  */
 #define TLB_MMIO        (1 << 5)
 
+/* Use DIRTY_IDX as indexes of bit-based phys_ram_dirty. */
+#define MASTER_DIRTY_IDX    0
+#define VGA_DIRTY_IDX       1
+#define CODE_DIRTY_IDX      2
+#define MIGRATION_DIRTY_IDX 3
+#define NUM_DIRTY_IDX       4
+
+#define MASTER_DIRTY_FLAG    (1 << MASTER_DIRTY_IDX)
+#define VGA_DIRTY_FLAG       (1 << VGA_DIRTY_IDX)
+#define CODE_DIRTY_FLAG      (1 << CODE_DIRTY_IDX)
+#define MIGRATION_DIRTY_FLAG (1 << MIGRATION_DIRTY_IDX)
+
+extern unsigned long *phys_ram_dirty[NUM_DIRTY_IDX];
+
+static inline int dirty_flag_to_idx(int flag)
+{
+    return ffs(flag) - 1;
+}
+
+static inline int dirty_idx_to_flag(int idx)
+{
+    return 1 << idx;
+}
+
 int cpu_memory_rw_debug(CPUState *env, target_ulong addr,
                         uint8_t *buf, int len, int is_write);
 
-#define VGA_DIRTY_FLAG       0x01
-#define CODE_DIRTY_FLAG      0x02
-#define MIGRATION_DIRTY_FLAG 0x08
-
 /* read dirty bit (return 0 or 1) */
 static inline int cpu_physical_memory_is_dirty(ram_addr_t addr)
 {
-    return phys_ram_dirty[addr >> TARGET_PAGE_BITS] == 0xff;
+    unsigned long mask;
+    ram_addr_t index = (addr >> TARGET_PAGE_BITS) / HOST_LONG_BITS;
+    int offset = (addr >> TARGET_PAGE_BITS) & (HOST_LONG_BITS - 1);
+ 
+    mask = 1UL << offset;
+    return (phys_ram_dirty[MASTER_DIRTY_IDX][index] & mask) == mask;
+}
+
+static inline void cpu_physical_memory_sync_master(ram_addr_t index)
+{
+    if (phys_ram_dirty[MASTER_DIRTY_IDX][index]) {
+        phys_ram_dirty[VGA_DIRTY_IDX][index]
+            |=  phys_ram_dirty[MASTER_DIRTY_IDX][index];
+        phys_ram_dirty[MIGRATION_DIRTY_IDX][index]
+            |=  phys_ram_dirty[MASTER_DIRTY_IDX][index];
+        phys_ram_dirty[MASTER_DIRTY_IDX][index] = 0UL;
+    }
+}
+
+static inline int cpu_physical_memory_get_dirty_flags(ram_addr_t addr)
+{
+    unsigned long mask;
+    ram_addr_t index = (addr >> TARGET_PAGE_BITS) / HOST_LONG_BITS;
+    int offset = (addr >> TARGET_PAGE_BITS) & (HOST_LONG_BITS - 1);
+    int ret = 0, i;
+ 
+    mask = 1UL << offset;
+    cpu_physical_memory_sync_master(index);
+
+    for (i = VGA_DIRTY_IDX; i <= MIGRATION_DIRTY_IDX; i++) {
+        if (phys_ram_dirty[i][index] & mask) {
+            ret |= dirty_idx_to_flag(i);
+        }
+    }
+ 
+    return ret;
+}
+
+static inline int cpu_physical_memory_get_dirty_idx(ram_addr_t addr,
+                                                    int dirty_idx)
+{
+    unsigned long mask;
+    ram_addr_t index = (addr >> TARGET_PAGE_BITS) / HOST_LONG_BITS;
+    int offset = (addr >> TARGET_PAGE_BITS) & (HOST_LONG_BITS - 1);
+
+    mask = 1UL << offset;
+    cpu_physical_memory_sync_master(index);
+    return (phys_ram_dirty[dirty_idx][index] & mask) == mask;
 }
 
 static inline int cpu_physical_memory_get_dirty(ram_addr_t addr,
                                                 int dirty_flags)
 {
-    return phys_ram_dirty[addr >> TARGET_PAGE_BITS] & dirty_flags;
+    return cpu_physical_memory_get_dirty_idx(addr,
+                                             dirty_flag_to_idx(dirty_flags));
 }
 
 static inline void cpu_physical_memory_set_dirty(ram_addr_t addr)
 {
-    phys_ram_dirty[addr >> TARGET_PAGE_BITS] = 0xff;
+    unsigned long mask;
+    ram_addr_t index = (addr >> TARGET_PAGE_BITS) / HOST_LONG_BITS;
+    int offset = (addr >> TARGET_PAGE_BITS) & (HOST_LONG_BITS - 1);
+
+    mask = 1UL << offset;
+    phys_ram_dirty[MASTER_DIRTY_IDX][index] |= mask;
+}
+
+static inline void cpu_physical_memory_set_dirty_range(ram_addr_t addr,
+                                                       unsigned long mask)
+{
+    ram_addr_t index = (addr >> TARGET_PAGE_BITS) / HOST_LONG_BITS;
+
+    phys_ram_dirty[MASTER_DIRTY_IDX][index] |= mask;
+}
+
+static inline void cpu_physical_memory_set_dirty_flags(ram_addr_t addr,
+                                                       int dirty_flags)
+{
+    unsigned long mask;
+    ram_addr_t index = (addr >> TARGET_PAGE_BITS) / HOST_LONG_BITS;
+    int offset = (addr >> TARGET_PAGE_BITS) & (HOST_LONG_BITS - 1);
+
+    mask = 1UL << offset;
+    phys_ram_dirty[MASTER_DIRTY_IDX][index] |= mask;
+
+    if (dirty_flags & CODE_DIRTY_FLAG) {
+        phys_ram_dirty[CODE_DIRTY_IDX][index] |= mask;
+    }
+}
+
+static inline void cpu_physical_memory_mask_dirty_range(ram_addr_t start,
+                                                        unsigned long length,
+                                                        int dirty_flags)
+{
+    ram_addr_t addr = start, index;
+    unsigned long mask;
+    int offset, i;
+
+    for (i = 0;  i < length; i += TARGET_PAGE_SIZE) {
+        index = ((addr + i) >> TARGET_PAGE_BITS) / HOST_LONG_BITS;
+        offset = ((addr + i) >> TARGET_PAGE_BITS) & (HOST_LONG_BITS - 1);
+        mask = ~(1UL << offset);
+        phys_ram_dirty[dirty_flag_to_idx(dirty_flags)][index] &= mask;
+    }
 }
 
 void cpu_physical_memory_reset_dirty(ram_addr_t start, ram_addr_t end,
diff --git a/exec.c b/exec.c
index b647512..bf8d703 100644
--- a/exec.c
+++ b/exec.c
@@ -119,7 +119,7 @@ uint8_t *code_gen_ptr;
 
 #if !defined(CONFIG_USER_ONLY)
 int phys_ram_fd;
-uint8_t *phys_ram_dirty;
+unsigned long *phys_ram_dirty[NUM_DIRTY_IDX];
 uint8_t *bios_mem;
 static int in_migration;
 
@@ -1947,7 +1947,7 @@ static void tlb_protect_code(ram_addr_t ram_addr)
 static void tlb_unprotect_code_phys(CPUState *env, ram_addr_t ram_addr,
                                     target_ulong vaddr)
 {
-    phys_ram_dirty[ram_addr >> TARGET_PAGE_BITS] |= CODE_DIRTY_FLAG;
+    cpu_physical_memory_set_dirty_flags(ram_addr, CODE_DIRTY_FLAG);
 }
 
 static inline void tlb_reset_dirty_range(CPUTLBEntry *tlb_entry,
@@ -1968,8 +1968,7 @@ void cpu_physical_memory_reset_dirty(ram_addr_t start, ram_addr_t end,
 {
     CPUState *env;
     unsigned long length, start1;
-    int i, mask, len;
-    uint8_t *p;
+    int i;
 
     start &= TARGET_PAGE_MASK;
     end = TARGET_PAGE_ALIGN(end);
@@ -1977,11 +1976,7 @@ void cpu_physical_memory_reset_dirty(ram_addr_t start, ram_addr_t end,
     length = end - start;
     if (length == 0)
         return;
-    len = length >> TARGET_PAGE_BITS;
-    mask = ~dirty_flags;
-    p = phys_ram_dirty + (start >> TARGET_PAGE_BITS);
-    for(i = 0; i < len; i++)
-        p[i] &= mask;
+    cpu_physical_memory_mask_dirty_range(start, length, dirty_flags);    
 
     /* we modify the TLB cache so that the dirty bit will be set again
        when accessing the range */
@@ -2643,6 +2638,7 @@ extern const char *mem_path;
 ram_addr_t qemu_ram_alloc(ram_addr_t size)
 {
     RAMBlock *new_block;
+    int i;
 
     size = TARGET_PAGE_ALIGN(size);
     new_block = qemu_malloc(sizeof(*new_block));
@@ -2667,10 +2663,14 @@ ram_addr_t qemu_ram_alloc(ram_addr_t size)
     new_block->next = ram_blocks;
     ram_blocks = new_block;
 
-    phys_ram_dirty = qemu_realloc(phys_ram_dirty,
-        (last_ram_offset + size) >> TARGET_PAGE_BITS);
-    memset(phys_ram_dirty + (last_ram_offset >> TARGET_PAGE_BITS),
-           0xff, size >> TARGET_PAGE_BITS);
+    for (i = MASTER_DIRTY_IDX; i < NUM_DIRTY_IDX; i++) {
+        phys_ram_dirty[i] 
+            = qemu_realloc(phys_ram_dirty[i],
+                           BITMAP_SIZE(last_ram_offset + size));
+        memset((uint8_t *)phys_ram_dirty[i] + BITMAP_SIZE(last_ram_offset),
+               0xff, BITMAP_SIZE(last_ram_offset + size)
+               - BITMAP_SIZE(last_ram_offset));
+    }
 
     last_ram_offset += size;
 
@@ -2833,16 +2833,16 @@ static void notdirty_mem_writeb(void *opaque, target_phys_addr_t ram_addr,
                                 uint32_t val)
 {
     int dirty_flags;
-    dirty_flags = phys_ram_dirty[ram_addr >> TARGET_PAGE_BITS];
+    dirty_flags = cpu_physical_memory_get_dirty_flags(ram_addr);
     if (!(dirty_flags & CODE_DIRTY_FLAG)) {
 #if !defined(CONFIG_USER_ONLY)
         tb_invalidate_phys_page_fast(ram_addr, 1);
-        dirty_flags = phys_ram_dirty[ram_addr >> TARGET_PAGE_BITS];
+        dirty_flags = cpu_physical_memory_get_dirty_flags(ram_addr);
 #endif
     }
     stb_p(qemu_get_ram_ptr(ram_addr), val);
     dirty_flags |= (0xff & ~CODE_DIRTY_FLAG);
-    phys_ram_dirty[ram_addr >> TARGET_PAGE_BITS] = dirty_flags;
+    cpu_physical_memory_set_dirty_flags(ram_addr, dirty_flags);
     /* we remove the notdirty callback only if the code has been
        flushed */
     if (dirty_flags == 0xff)
@@ -2853,16 +2853,16 @@ static void notdirty_mem_writew(void *opaque, target_phys_addr_t ram_addr,
                                 uint32_t val)
 {
     int dirty_flags;
-    dirty_flags = phys_ram_dirty[ram_addr >> TARGET_PAGE_BITS];
+    dirty_flags = cpu_physical_memory_get_dirty_flags(ram_addr);
     if (!(dirty_flags & CODE_DIRTY_FLAG)) {
 #if !defined(CONFIG_USER_ONLY)
         tb_invalidate_phys_page_fast(ram_addr, 2);
-        dirty_flags = phys_ram_dirty[ram_addr >> TARGET_PAGE_BITS];
+        dirty_flags = cpu_physical_memory_get_dirty_flags(ram_addr);
 #endif
     }
     stw_p(qemu_get_ram_ptr(ram_addr), val);
     dirty_flags |= (0xff & ~CODE_DIRTY_FLAG);
-    phys_ram_dirty[ram_addr >> TARGET_PAGE_BITS] = dirty_flags;
+    cpu_physical_memory_set_dirty_flags(ram_addr, dirty_flags);
     /* we remove the notdirty callback only if the code has been
        flushed */
     if (dirty_flags == 0xff)
@@ -2873,16 +2873,16 @@ static void notdirty_mem_writel(void *opaque, target_phys_addr_t ram_addr,
                                 uint32_t val)
 {
     int dirty_flags;
-    dirty_flags = phys_ram_dirty[ram_addr >> TARGET_PAGE_BITS];
+    dirty_flags = cpu_physical_memory_get_dirty_flags(ram_addr);
     if (!(dirty_flags & CODE_DIRTY_FLAG)) {
 #if !defined(CONFIG_USER_ONLY)
         tb_invalidate_phys_page_fast(ram_addr, 4);
-        dirty_flags = phys_ram_dirty[ram_addr >> TARGET_PAGE_BITS];
+        dirty_flags = cpu_physical_memory_get_dirty_flags(ram_addr);
 #endif
     }
     stl_p(qemu_get_ram_ptr(ram_addr), val);
     dirty_flags |= (0xff & ~CODE_DIRTY_FLAG);
-    phys_ram_dirty[ram_addr >> TARGET_PAGE_BITS] = dirty_flags;
+    cpu_physical_memory_set_dirty_flags(ram_addr, dirty_flags);
     /* we remove the notdirty callback only if the code has been
        flushed */
     if (dirty_flags == 0xff)
@@ -3334,8 +3334,8 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf,
                     /* invalidate code */
                     tb_invalidate_phys_page_range(addr1, addr1 + l, 0);
                     /* set dirty bit */
-                    phys_ram_dirty[addr1 >> TARGET_PAGE_BITS] |=
-                        (0xff & ~CODE_DIRTY_FLAG);
+                    cpu_physical_memory_set_dirty_flags(
+                        addr1, (0xff & ~CODE_DIRTY_FLAG));
                 }
 		/* qemu doesn't execute guest code directly, but kvm does
 		   therefore flush instruction caches */
@@ -3548,8 +3548,8 @@ void cpu_physical_memory_unmap(void *buffer, target_phys_addr_t len,
                     /* invalidate code */
                     tb_invalidate_phys_page_range(addr1, addr1 + l, 0);
                     /* set dirty bit */
-                    phys_ram_dirty[addr1 >> TARGET_PAGE_BITS] |=
-                        (0xff & ~CODE_DIRTY_FLAG);
+                    cpu_physical_memory_set_dirty_flags(
+                        addr1, (0xff & ~CODE_DIRTY_FLAG));
                 }
                 addr1 += l;
                 access_len -= l;
@@ -3685,8 +3685,8 @@ void stl_phys_notdirty(target_phys_addr_t addr, uint32_t val)
                 /* invalidate code */
                 tb_invalidate_phys_page_range(addr1, addr1 + 4, 0);
                 /* set dirty bit */
-                phys_ram_dirty[addr1 >> TARGET_PAGE_BITS] |=
-                    (0xff & ~CODE_DIRTY_FLAG);
+                cpu_physical_memory_set_dirty_flags(
+                    addr1, (0xff & ~CODE_DIRTY_FLAG));
             }
         }
     }
@@ -3754,8 +3754,8 @@ void stl_phys(target_phys_addr_t addr, uint32_t val)
             /* invalidate code */
             tb_invalidate_phys_page_range(addr1, addr1 + 4, 0);
             /* set dirty bit */
-            phys_ram_dirty[addr1 >> TARGET_PAGE_BITS] |=
-                (0xff & ~CODE_DIRTY_FLAG);
+            cpu_physical_memory_set_dirty_flags(addr1,
+                (0xff & ~CODE_DIRTY_FLAG));
         }
     }
 }
-- 
1.7.0.31.g1df487
^ permalink raw reply related	[flat|nested] 74+ messages in thread
* [Qemu-devel] [RFC PATCH 02/20] Introduce cpu_physical_memory_get_dirty_range().
  2010-04-21  5:57 [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 Yoshiaki Tamura
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 01/20] Modify DIRTY_FLAG value and introduce DIRTY_IDX to use as indexes of bit-based phys_ram_dirty Yoshiaki Tamura
@ 2010-04-21  5:57 ` Yoshiaki Tamura
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 03/20] Use cpu_physical_memory_set_dirty_range() to update phys_ram_dirty Yoshiaki Tamura
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-21  5:57 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, ohmura.kei, mtosatti, Yoshiaki Tamura, yoshikawa.takuya,
	avi
It checks the first row and puts dirty addr in the array.  If the
first row is empty, it skips to the first non-dirty row or the end
addr, and put the length in the first entry of the array.
Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
Signed-off-by: OHMURA Kei <ohmura.kei@lab.ntt.co.jp>
---
 cpu-all.h |    4 +++
 exec.c    |   67 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 71 insertions(+), 0 deletions(-)
diff --git a/cpu-all.h b/cpu-all.h
index 3f8762d..27187d4 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -1007,6 +1007,10 @@ static inline void cpu_physical_memory_mask_dirty_range(ram_addr_t start,
     }
 }
 
+int cpu_physical_memory_get_dirty_range(ram_addr_t start, ram_addr_t end, 
+                                        ram_addr_t *dirty_rams, int length,
+                                        int dirty_flags);
+
 void cpu_physical_memory_reset_dirty(ram_addr_t start, ram_addr_t end,
                                      int dirty_flags);
 void cpu_tlb_update_dirty(CPUState *env);
diff --git a/exec.c b/exec.c
index bf8d703..d5c2a05 100644
--- a/exec.c
+++ b/exec.c
@@ -1962,6 +1962,73 @@ static inline void tlb_reset_dirty_range(CPUTLBEntry *tlb_entry,
     }
 }
 
+/* It checks the first row and puts dirty addrs in the array.
+   If the first row is empty, it skips to the first non-dirty row
+   or the end addr, and put the length in the first entry of the array. */
+int cpu_physical_memory_get_dirty_range(ram_addr_t start, ram_addr_t end, 
+                                        ram_addr_t *dirty_rams, int length,
+                                        int dirty_flag)
+{
+    unsigned long p = 0, page_number;
+    ram_addr_t addr;
+    ram_addr_t s_idx = (start >> TARGET_PAGE_BITS) / HOST_LONG_BITS;
+    ram_addr_t e_idx = (end >> TARGET_PAGE_BITS) / HOST_LONG_BITS;
+    int i, j, offset, dirty_idx = dirty_flag_to_idx(dirty_flag);
+
+    /* mask bits before the start addr */
+    offset = (start >> TARGET_PAGE_BITS) & (HOST_LONG_BITS - 1);
+    cpu_physical_memory_sync_master(s_idx);
+    p |= phys_ram_dirty[dirty_idx][s_idx] & ~((1UL << offset) - 1);
+
+    if (s_idx == e_idx) {
+        /* mask bits after the end addr */
+        offset = (end >> TARGET_PAGE_BITS) & (HOST_LONG_BITS - 1);
+        p &= (1UL << offset) - 1;
+    }
+
+    if (p == 0) {
+        /* when the row is empty */
+        ram_addr_t skip;
+        if (s_idx == e_idx) {
+            skip = end;
+	} else {
+            /* skip empty rows */
+            while (s_idx < e_idx) {
+                s_idx++;
+                cpu_physical_memory_sync_master(s_idx);
+
+                if (phys_ram_dirty[dirty_idx][s_idx] != 0) {
+                    break;
+                }
+            }
+            skip = (s_idx * HOST_LONG_BITS * TARGET_PAGE_SIZE);
+        }
+        dirty_rams[0] = skip - start;
+        i = 0;
+
+    } else if (p == ~0UL) {
+        /* when the row is fully dirtied */
+        addr = start;
+        for (i = 0; i < length; i++) {
+            dirty_rams[i] = addr;
+            addr += TARGET_PAGE_SIZE;
+        }
+    } else {
+        /* when the row is partially dirtied */
+        i = 0;
+        do {
+            j = ffsl(p) - 1;
+            p &= ~(1UL << j);
+            page_number = s_idx * HOST_LONG_BITS + j;
+            addr = page_number * TARGET_PAGE_SIZE;
+            dirty_rams[i] = addr;
+            i++;
+        } while (p != 0 && i < length);
+    }
+
+    return i;
+}
+
 /* Note: start and end must be within the same ram block.  */
 void cpu_physical_memory_reset_dirty(ram_addr_t start, ram_addr_t end,
                                      int dirty_flags)
-- 
1.7.0.31.g1df487
^ permalink raw reply related	[flat|nested] 74+ messages in thread
* [Qemu-devel] [RFC PATCH 03/20] Use cpu_physical_memory_set_dirty_range() to update phys_ram_dirty.
  2010-04-21  5:57 [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 Yoshiaki Tamura
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 01/20] Modify DIRTY_FLAG value and introduce DIRTY_IDX to use as indexes of bit-based phys_ram_dirty Yoshiaki Tamura
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 02/20] Introduce cpu_physical_memory_get_dirty_range() Yoshiaki Tamura
@ 2010-04-21  5:57 ` Yoshiaki Tamura
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 04/20] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer() Yoshiaki Tamura
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-21  5:57 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, ohmura.kei, mtosatti, Yoshiaki Tamura, yoshikawa.takuya,
	avi
Modifies kvm_get_dirty_pages_log_range to use
cpu_physical_memory_set_dirty_range() to update the row of the
bit-based phys_ram_dirty bitmap at once.
Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
Signed-off-by: OHMURA Kei <ohmura.kei@lab.ntt.co.jp>
---
 qemu-kvm.c |   19 +++++++------------
 1 files changed, 7 insertions(+), 12 deletions(-)
diff --git a/qemu-kvm.c b/qemu-kvm.c
index 29365a9..1414f49 100644
--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -2323,8 +2323,8 @@ static int kvm_get_dirty_pages_log_range(unsigned long start_addr,
                                          unsigned long offset,
                                          unsigned long mem_size)
 {
-    unsigned int i, j;
-    unsigned long page_number, addr, addr1, c;
+    unsigned int i;
+    unsigned long page_number, addr, addr1;
     ram_addr_t ram_addr;
     unsigned int len = ((mem_size / TARGET_PAGE_SIZE) + HOST_LONG_BITS - 1) /
         HOST_LONG_BITS;
@@ -2335,16 +2335,11 @@ static int kvm_get_dirty_pages_log_range(unsigned long start_addr,
      */
     for (i = 0; i < len; i++) {
         if (bitmap[i] != 0) {
-            c = leul_to_cpu(bitmap[i]);
-            do {
-                j = ffsl(c) - 1;
-                c &= ~(1ul << j);
-                page_number = i * HOST_LONG_BITS + j;
-                addr1 = page_number * TARGET_PAGE_SIZE;
-                addr = offset + addr1;
-                ram_addr = cpu_get_physical_page_desc(addr);
-                cpu_physical_memory_set_dirty(ram_addr);
-            } while (c != 0);
+            page_number = i * HOST_LONG_BITS;
+            addr1 = page_number * TARGET_PAGE_SIZE;
+            addr = offset + addr1;
+            ram_addr = cpu_get_physical_page_desc(addr);
+            cpu_physical_memory_set_dirty_range(ram_addr, leul_to_cpu(bitmap[i]));
         }
     }
     return 0;
-- 
1.7.0.31.g1df487
^ permalink raw reply related	[flat|nested] 74+ messages in thread
* [Qemu-devel] [RFC PATCH 04/20] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer().
  2010-04-21  5:57 [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 Yoshiaki Tamura
                   ` (2 preceding siblings ...)
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 03/20] Use cpu_physical_memory_set_dirty_range() to update phys_ram_dirty Yoshiaki Tamura
@ 2010-04-21  5:57 ` Yoshiaki Tamura
  2010-04-21  8:03   ` [Qemu-devel] " Stefan Hajnoczi
  2010-04-23  9:53   ` Avi Kivity
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 05/20] Introduce put_vector() and get_vector to QEMUFile and qemu_fopen_ops() Yoshiaki Tamura
                   ` (18 subsequent siblings)
  22 siblings, 2 replies; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-21  5:57 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, ohmura.kei, mtosatti, Yoshiaki Tamura, yoshikawa.takuya,
	avi
Currently buf size is fixed at 32KB.  It would be useful if it could
be flexible.
Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 hw/hw.h  |    2 ++
 savevm.c |   26 +++++++++++++++++++++++++-
 2 files changed, 27 insertions(+), 1 deletions(-)
diff --git a/hw/hw.h b/hw/hw.h
index 05131a0..fc9ed29 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -61,6 +61,8 @@ void qemu_fflush(QEMUFile *f);
 int qemu_fclose(QEMUFile *f);
 void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int size);
 void qemu_put_byte(QEMUFile *f, int v);
+void *qemu_realloc_buffer(QEMUFile *f, int size);
+void qemu_clear_buffer(QEMUFile *f);
 
 static inline void qemu_put_ubyte(QEMUFile *f, unsigned int v)
 {
diff --git a/savevm.c b/savevm.c
index 2fd3de6..490ab70 100644
--- a/savevm.c
+++ b/savevm.c
@@ -174,7 +174,8 @@ struct QEMUFile {
                            when reading */
     int buf_index;
     int buf_size; /* 0 when writing */
-    uint8_t buf[IO_BUF_SIZE];
+    int buf_max_size;
+    uint8_t *buf;
 
     int has_error;
 };
@@ -424,6 +425,9 @@ QEMUFile *qemu_fopen_ops(void *opaque, QEMUFilePutBufferFunc *put_buffer,
     f->get_rate_limit = get_rate_limit;
     f->is_write = 0;
 
+    f->buf_max_size = IO_BUF_SIZE;
+    f->buf = qemu_mallocz(sizeof(uint8_t) * f->buf_max_size);
+
     return f;
 }
 
@@ -454,6 +458,25 @@ void qemu_fflush(QEMUFile *f)
     }
 }
 
+void *qemu_realloc_buffer(QEMUFile *f, int size)
+{
+    f->buf_max_size = size;
+
+    f->buf = qemu_realloc(f->buf, f->buf_max_size);
+    if (f->buf == NULL) {
+        fprintf(stderr, "qemu file buffer realloc failed\n");
+        exit(1);
+    }
+
+    return f->buf;
+}
+
+void qemu_clear_buffer(QEMUFile *f)
+{
+    f->buf_size = f->buf_index = f->buf_offset = 0;
+    memset(f->buf, 0, f->buf_max_size);
+}
+
 static void qemu_fill_buffer(QEMUFile *f)
 {
     int len;
@@ -479,6 +502,7 @@ int qemu_fclose(QEMUFile *f)
     qemu_fflush(f);
     if (f->close)
         ret = f->close(f->opaque);
+    qemu_free(f->buf);
     qemu_free(f);
     return ret;
 }
-- 
1.7.0.31.g1df487
^ permalink raw reply related	[flat|nested] 74+ messages in thread
* [Qemu-devel] [RFC PATCH 05/20] Introduce put_vector() and get_vector to QEMUFile and qemu_fopen_ops().
  2010-04-21  5:57 [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 Yoshiaki Tamura
                   ` (3 preceding siblings ...)
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 04/20] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer() Yoshiaki Tamura
@ 2010-04-21  5:57 ` Yoshiaki Tamura
  2010-04-22 19:28   ` [Qemu-devel] " Anthony Liguori
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 06/20] Introduce iovec util functions, qemu_iovec_to_vector() and qemu_iovec_to_size() Yoshiaki Tamura
                   ` (17 subsequent siblings)
  22 siblings, 1 reply; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-21  5:57 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, ohmura.kei, mtosatti, Yoshiaki Tamura, yoshikawa.takuya,
	avi
QEMUFile currently doesn't support writev().  For sending multiple
data, such as pages, using writev() should be more efficient.
Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 buffered_file.c |    2 +-
 hw/hw.h         |   16 ++++++++++++++++
 savevm.c        |   43 +++++++++++++++++++++++++------------------
 3 files changed, 42 insertions(+), 19 deletions(-)
diff --git a/buffered_file.c b/buffered_file.c
index 54dc6c2..187d1d4 100644
--- a/buffered_file.c
+++ b/buffered_file.c
@@ -256,7 +256,7 @@ QEMUFile *qemu_fopen_ops_buffered(void *opaque,
     s->wait_for_unfreeze = wait_for_unfreeze;
     s->close = close;
 
-    s->file = qemu_fopen_ops(s, buffered_put_buffer, NULL,
+    s->file = qemu_fopen_ops(s, buffered_put_buffer, NULL, NULL, NULL,
                              buffered_close, buffered_rate_limit,
                              buffered_set_rate_limit,
 			     buffered_get_rate_limit);
diff --git a/hw/hw.h b/hw/hw.h
index fc9ed29..921cf90 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -23,6 +23,13 @@
 typedef int (QEMUFilePutBufferFunc)(void *opaque, const uint8_t *buf,
                                     int64_t pos, int size);
 
+/* This function writes a chunk of vector to a file at the given position.
+ * The pos argument can be ignored if the file is only being used for
+ * streaming.
+ */
+typedef int (QEMUFilePutVectorFunc)(void *opaque, struct iovec *iov,
+                                    int64_t pos, int iovcnt);
+
 /* Read a chunk of data from a file at the given position.  The pos argument
  * can be ignored if the file is only be used for streaming.  The number of
  * bytes actually read should be returned.
@@ -30,6 +37,13 @@ typedef int (QEMUFilePutBufferFunc)(void *opaque, const uint8_t *buf,
 typedef int (QEMUFileGetBufferFunc)(void *opaque, uint8_t *buf,
                                     int64_t pos, int size);
 
+/* Read a chunk of vector from a file at the given position.  The pos argument
+ * can be ignored if the file is only be used for streaming.  The number of
+ * bytes actually read should be returned.
+ */
+typedef int (QEMUFileGetVectorFunc)(void *opaque, struct iovec *iov,
+                                    int64_t pos, int iovcnt);
+
 /* Close a file and return an error code */
 typedef int (QEMUFileCloseFunc)(void *opaque);
 
@@ -46,7 +60,9 @@ typedef size_t (QEMUFileSetRateLimit)(void *opaque, size_t new_rate);
 typedef size_t (QEMUFileGetRateLimit)(void *opaque);
 
 QEMUFile *qemu_fopen_ops(void *opaque, QEMUFilePutBufferFunc *put_buffer,
+                         QEMUFilePutVectorFunc *put_vector,
                          QEMUFileGetBufferFunc *get_buffer,
+                         QEMUFileGetVectorFunc *get_vector,
                          QEMUFileCloseFunc *close,
                          QEMUFileRateLimit *rate_limit,
                          QEMUFileSetRateLimit *set_rate_limit,
diff --git a/savevm.c b/savevm.c
index 490ab70..944e788 100644
--- a/savevm.c
+++ b/savevm.c
@@ -162,7 +162,9 @@ void qemu_announce_self(void)
 
 struct QEMUFile {
     QEMUFilePutBufferFunc *put_buffer;
+    QEMUFilePutVectorFunc *put_vector;
     QEMUFileGetBufferFunc *get_buffer;
+    QEMUFileGetVectorFunc *get_vector;
     QEMUFileCloseFunc *close;
     QEMUFileRateLimit *rate_limit;
     QEMUFileSetRateLimit *set_rate_limit;
@@ -263,11 +265,11 @@ QEMUFile *qemu_popen(FILE *stdio_file, const char *mode)
     s->stdio_file = stdio_file;
 
     if(mode[0] == 'r') {
-        s->file = qemu_fopen_ops(s, NULL, stdio_get_buffer, stdio_pclose, 
-				 NULL, NULL, NULL);
+        s->file = qemu_fopen_ops(s, NULL, NULL, stdio_get_buffer,
+                 NULL, stdio_pclose, NULL, NULL, NULL);
     } else {
-        s->file = qemu_fopen_ops(s, stdio_put_buffer, NULL, stdio_pclose, 
-				 NULL, NULL, NULL);
+        s->file = qemu_fopen_ops(s, stdio_put_buffer, NULL, NULL, NULL, 
+                 stdio_pclose, NULL, NULL, NULL);
     }
     return s->file;
 }
@@ -312,11 +314,11 @@ QEMUFile *qemu_fdopen(int fd, const char *mode)
         goto fail;
 
     if(mode[0] == 'r') {
-        s->file = qemu_fopen_ops(s, NULL, stdio_get_buffer, stdio_fclose, 
-				 NULL, NULL, NULL);
+        s->file = qemu_fopen_ops(s, NULL, NULL, stdio_get_buffer, NULL,
+                 stdio_fclose, NULL, NULL, NULL);
     } else {
-        s->file = qemu_fopen_ops(s, stdio_put_buffer, NULL, stdio_fclose, 
-				 NULL, NULL, NULL);
+        s->file = qemu_fopen_ops(s, stdio_put_buffer, NULL, NULL, NULL,
+                 stdio_fclose, NULL, NULL, NULL);
     }
     return s->file;
 
@@ -330,8 +332,8 @@ QEMUFile *qemu_fopen_socket(int fd)
     QEMUFileSocket *s = qemu_mallocz(sizeof(QEMUFileSocket));
 
     s->fd = fd;
-    s->file = qemu_fopen_ops(s, NULL, socket_get_buffer, socket_close, 
-			     NULL, NULL, NULL);
+    s->file = qemu_fopen_ops(s, NULL, NULL, socket_get_buffer, NULL,
+                             socket_close, NULL, NULL, NULL);
     return s->file;
 }
 
@@ -368,11 +370,11 @@ QEMUFile *qemu_fopen(const char *filename, const char *mode)
         goto fail;
     
     if(mode[0] == 'w') {
-        s->file = qemu_fopen_ops(s, file_put_buffer, NULL, stdio_fclose, 
-				 NULL, NULL, NULL);
+        s->file = qemu_fopen_ops(s, file_put_buffer, NULL, NULL, NULL,
+                  stdio_fclose, NULL, NULL, NULL);
     } else {
-        s->file = qemu_fopen_ops(s, NULL, file_get_buffer, stdio_fclose, 
-			       NULL, NULL, NULL);
+        s->file = qemu_fopen_ops(s, NULL, NULL, file_get_buffer, NULL,
+                  stdio_fclose, NULL, NULL, NULL);
     }
     return s->file;
 fail:
@@ -400,13 +402,16 @@ static int bdrv_fclose(void *opaque)
 static QEMUFile *qemu_fopen_bdrv(BlockDriverState *bs, int is_writable)
 {
     if (is_writable)
-        return qemu_fopen_ops(bs, block_put_buffer, NULL, bdrv_fclose, 
-			      NULL, NULL, NULL);
-    return qemu_fopen_ops(bs, NULL, block_get_buffer, bdrv_fclose, NULL, NULL, NULL);
+        return qemu_fopen_ops(bs, block_put_buffer, NULL, NULL, NULL,
+                  bdrv_fclose, NULL, NULL, NULL);
+    return qemu_fopen_ops(bs, NULL, NULL, block_get_buffer, NULL, bdrv_fclose, NULL, NULL, NULL);
 }
 
-QEMUFile *qemu_fopen_ops(void *opaque, QEMUFilePutBufferFunc *put_buffer,
+QEMUFile *qemu_fopen_ops(void *opaque,
+                         QEMUFilePutBufferFunc *put_buffer,
+                         QEMUFilePutVectorFunc *put_vector,
                          QEMUFileGetBufferFunc *get_buffer,
+                         QEMUFileGetVectorFunc *get_vector,
                          QEMUFileCloseFunc *close,
                          QEMUFileRateLimit *rate_limit,
                          QEMUFileSetRateLimit *set_rate_limit,
@@ -418,7 +423,9 @@ QEMUFile *qemu_fopen_ops(void *opaque, QEMUFilePutBufferFunc *put_buffer,
 
     f->opaque = opaque;
     f->put_buffer = put_buffer;
+    f->put_vector = put_vector;
     f->get_buffer = get_buffer;
+    f->get_vector = get_vector;
     f->close = close;
     f->rate_limit = rate_limit;
     f->set_rate_limit = set_rate_limit;
-- 
1.7.0.31.g1df487
^ permalink raw reply related	[flat|nested] 74+ messages in thread
* [Qemu-devel] [RFC PATCH 06/20] Introduce iovec util functions, qemu_iovec_to_vector() and qemu_iovec_to_size().
  2010-04-21  5:57 [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 Yoshiaki Tamura
                   ` (4 preceding siblings ...)
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 05/20] Introduce put_vector() and get_vector to QEMUFile and qemu_fopen_ops() Yoshiaki Tamura
@ 2010-04-21  5:57 ` Yoshiaki Tamura
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 07/20] Introduce qemu_put_vector() and qemu_put_vector_prepare() to use put_vector() in QEMUFile Yoshiaki Tamura
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-21  5:57 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, ohmura.kei, mtosatti, Yoshiaki Tamura, yoshikawa.takuya,
	avi
Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 cutils.c      |   12 ++++++++++++
 qemu-common.h |    2 ++
 2 files changed, 14 insertions(+), 0 deletions(-)
diff --git a/cutils.c b/cutils.c
index be99b21..1d35590 100644
--- a/cutils.c
+++ b/cutils.c
@@ -238,3 +238,15 @@ void qemu_iovec_from_buffer(QEMUIOVector *qiov, const void *buf, size_t count)
         count -= copy;
     }
 }
+
+void qemu_iovec_to_vector(QEMUIOVector *qiov, struct iovec **iov, int *count)
+{
+    *iov = qiov->iov;
+    *count = qiov->niov;
+}
+
+void qemu_iovec_to_size(QEMUIOVector *qiov, size_t *size)
+{
+    *size = qiov->size;
+}
+
diff --git a/qemu-common.h b/qemu-common.h
index 2e5f3a7..0af30d2 100644
--- a/qemu-common.h
+++ b/qemu-common.h
@@ -273,6 +273,8 @@ void qemu_iovec_destroy(QEMUIOVector *qiov);
 void qemu_iovec_reset(QEMUIOVector *qiov);
 void qemu_iovec_to_buffer(QEMUIOVector *qiov, void *buf);
 void qemu_iovec_from_buffer(QEMUIOVector *qiov, const void *buf, size_t count);
+void qemu_iovec_to_vector(QEMUIOVector *qiov, struct iovec **iov, int *niov);
+void qemu_iovec_to_size(QEMUIOVector *qiov, size_t *size);
 
 struct Monitor;
 typedef struct Monitor Monitor;
-- 
1.7.0.31.g1df487
^ permalink raw reply related	[flat|nested] 74+ messages in thread
* [Qemu-devel] [RFC PATCH 07/20] Introduce qemu_put_vector() and qemu_put_vector_prepare() to use put_vector() in QEMUFile.
  2010-04-21  5:57 [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 Yoshiaki Tamura
                   ` (5 preceding siblings ...)
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 06/20] Introduce iovec util functions, qemu_iovec_to_vector() and qemu_iovec_to_size() Yoshiaki Tamura
@ 2010-04-21  5:57 ` Yoshiaki Tamura
  2010-04-22 19:29   ` [Qemu-devel] " Anthony Liguori
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 08/20] Introduce RAMSaveIO and use cpu_physical_memory_get_dirty_range() to check multiple dirty pages Yoshiaki Tamura
                   ` (15 subsequent siblings)
  22 siblings, 1 reply; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-21  5:57 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, ohmura.kei, mtosatti, Yoshiaki Tamura, yoshikawa.takuya,
	avi
For fool proof purpose, qemu_put_vector_parepare should be called
before qemu_put_vector.  Then, if qemu_put_* functions except this is
called after qemu_put_vector_prepare, program will abort().
Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 hw/hw.h  |    2 ++
 savevm.c |   53 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 55 insertions(+), 0 deletions(-)
diff --git a/hw/hw.h b/hw/hw.h
index 921cf90..10e6dda 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -77,6 +77,8 @@ void qemu_fflush(QEMUFile *f);
 int qemu_fclose(QEMUFile *f);
 void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int size);
 void qemu_put_byte(QEMUFile *f, int v);
+void qemu_put_vector(QEMUFile *f, QEMUIOVector *qiov);
+void qemu_put_vector_prepare(QEMUFile *f);
 void *qemu_realloc_buffer(QEMUFile *f, int size);
 void qemu_clear_buffer(QEMUFile *f);
 
diff --git a/savevm.c b/savevm.c
index 944e788..22d928c 100644
--- a/savevm.c
+++ b/savevm.c
@@ -180,6 +180,7 @@ struct QEMUFile {
     uint8_t *buf;
 
     int has_error;
+    int prepares_vector;
 };
 
 typedef struct QEMUFileStdio
@@ -557,6 +558,58 @@ void qemu_put_byte(QEMUFile *f, int v)
         qemu_fflush(f);
 }
 
+void qemu_put_vector(QEMUFile *f, QEMUIOVector *v)
+{
+    struct iovec *iov;
+    int cnt;
+    size_t bufsize;
+    uint8_t *buf;
+
+    if (qemu_file_get_rate_limit(f) != 0) {
+        fprintf(stderr,
+                "Attempted to write vector while bandwidth limit is not zero.\n");
+        abort();
+    }
+
+    /* checks prepares vector.
+     * For fool proof purpose, qemu_put_vector_parepare should be called
+     * before qemu_put_vector.  Then, if qemu_put_* functions except this
+     * is called after qemu_put_vector_prepare, program will abort().
+     */
+    if (!f->prepares_vector) {
+        fprintf(stderr,
+            "You should prepare with qemu_put_vector_prepare.\n");
+        abort();
+    } else if (f->prepares_vector && f->buf_index != 0) {
+        fprintf(stderr, "Wrote data after qemu_put_vector_prepare.\n");
+        abort();
+    }
+    f->prepares_vector = 0;
+
+    if (f->put_vector) {
+        qemu_iovec_to_vector(v, &iov, &cnt);
+        f->put_vector(f->opaque, iov, 0, cnt);
+    } else {
+        qemu_iovec_to_size(v, &bufsize);
+        buf = qemu_malloc(bufsize + 1 /* for '\0' */);
+        qemu_iovec_to_buffer(v, buf);
+        qemu_put_buffer(f, buf, bufsize);
+        qemu_free(buf);
+    }
+
+}
+
+void qemu_put_vector_prepare(QEMUFile *f)
+{
+    if (f->prepares_vector) {
+        /* prepare vector */
+        fprintf(stderr, "Attempted to prepare vector twice\n");
+        abort();
+    }
+    f->prepares_vector = 1;
+    qemu_fflush(f);
+}
+
 int qemu_get_buffer(QEMUFile *f, uint8_t *buf, int size1)
 {
     int size, l;
-- 
1.7.0.31.g1df487
^ permalink raw reply related	[flat|nested] 74+ messages in thread
* [Qemu-devel] [RFC PATCH 08/20] Introduce RAMSaveIO and use cpu_physical_memory_get_dirty_range() to check multiple dirty pages.
  2010-04-21  5:57 [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 Yoshiaki Tamura
                   ` (6 preceding siblings ...)
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 07/20] Introduce qemu_put_vector() and qemu_put_vector_prepare() to use put_vector() in QEMUFile Yoshiaki Tamura
@ 2010-04-21  5:57 ` Yoshiaki Tamura
  2010-04-22 19:31   ` [Qemu-devel] " Anthony Liguori
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 09/20] Introduce writev and read to FdMigrationState Yoshiaki Tamura
                   ` (14 subsequent siblings)
  22 siblings, 1 reply; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-21  5:57 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, ohmura.kei, mtosatti, Yoshiaki Tamura, yoshikawa.takuya,
	avi
Introduce RAMSaveIO to use writev for saving ram blocks, and modifies
ram_save_block() and ram_save_remaining() to use
cpu_physical_memory_get_dirty_range() to check multiple dirty and
non-dirty pages at once.
Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
Signed-off-by: OHMURA Kei <ohmura.kei@lab.ntt.co.jp>
---
 vl.c |  221 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 files changed, 197 insertions(+), 24 deletions(-)
diff --git a/vl.c b/vl.c
index 729c955..9c3dc4c 100644
--- a/vl.c
+++ b/vl.c
@@ -2774,12 +2774,167 @@ static int is_dup_page(uint8_t *page, uint8_t ch)
     return 1;
 }
 
-static int ram_save_block(QEMUFile *f)
+typedef struct RAMSaveIO RAMSaveIO;
+
+struct RAMSaveIO {
+    QEMUFile *f;
+    QEMUIOVector *qiov;
+
+    uint8_t *ram_store;
+    size_t nalloc, nused;
+    uint8_t io_mode;
+
+    void (*put_buffer)(RAMSaveIO *s, uint8_t *buf, size_t len);
+    void (*put_byte)(RAMSaveIO *s, int v);
+    void (*put_be64)(RAMSaveIO *s, uint64_t v);
+
+};
+
+static inline void ram_saveio_flush(RAMSaveIO *s, int prepare)
+{
+    qemu_put_vector(s->f, s->qiov);
+    if (prepare)
+        qemu_put_vector_prepare(s->f);
+
+    /* reset stored data */
+    qemu_iovec_reset(s->qiov);
+    s->nused = 0;
+}
+
+static inline void ram_saveio_put_buffer(RAMSaveIO *s, uint8_t *buf, size_t len)
+{
+    s->put_buffer(s, buf, len);
+}
+
+static inline void ram_saveio_put_byte(RAMSaveIO *s, int v)
+{
+    s->put_byte(s, v);
+}
+
+static inline void ram_saveio_put_be64(RAMSaveIO *s, uint64_t v)
+{
+    s->put_be64(s, v);
+}
+
+static inline void ram_saveio_set_error(RAMSaveIO *s)
+{
+    qemu_file_set_error(s->f);
+}
+
+static void ram_saveio_put_buffer_vector(RAMSaveIO *s, uint8_t *buf, size_t len)
+{
+    qemu_iovec_add(s->qiov, buf, len);
+}
+
+static void ram_saveio_put_buffer_direct(RAMSaveIO *s, uint8_t *buf, size_t len)
+{
+    qemu_put_buffer(s->f, buf, len);
+}
+
+static void ram_saveio_put_byte_vector(RAMSaveIO *s, int v)
+{
+    uint8_t *to_save;
+
+    if (s->nalloc - s->nused < sizeof(int))
+        ram_saveio_flush(s, 1);
+
+    to_save = &s->ram_store[s->nused];
+    to_save[0] = v & 0xff;
+    s->nused++;
+
+    qemu_iovec_add(s->qiov, to_save, 1);
+}
+
+static void ram_saveio_put_byte_direct(RAMSaveIO *s, int v)
+{
+    qemu_put_byte(s->f, v);
+}
+
+static void ram_saveio_put_be64_vector(RAMSaveIO *s, uint64_t v)
+{
+    uint8_t *to_save;
+
+    if (s->nalloc - s->nused < sizeof(uint64_t))
+        ram_saveio_flush(s, 1);
+
+    to_save = &s->ram_store[s->nused];
+    to_save[0] = (v >> 56) & 0xff;
+    to_save[1] = (v >> 48) & 0xff;
+    to_save[2] = (v >> 40) & 0xff;
+    to_save[3] = (v >> 32) & 0xff;
+    to_save[4] = (v >> 24) & 0xff;
+    to_save[5] = (v >> 16) & 0xff;
+    to_save[6] = (v >>  8) & 0xff;
+    to_save[7] = (v >>  0) & 0xff;
+    s->nused += sizeof(uint64_t);
+
+    qemu_iovec_add(s->qiov, to_save, sizeof(uint64_t));
+}
+
+static void ram_saveio_put_be64_direct(RAMSaveIO *s, uint64_t v)
+{
+
+    qemu_put_be64(s->f, v);
+}
+
+static RAMSaveIO *ram_saveio_new(QEMUFile *f, size_t max_store)
+{
+    RAMSaveIO *s;
+
+    s = qemu_mallocz(sizeof(*s));
+
+    if (qemu_file_get_rate_limit(f) == 0) {/* non buffer mode */
+        /* When QEMUFile don't have get_rate limit,
+         * qemu_file_get_rate_limit will return 0.
+         * However, we believe that all kinds of QEMUFile
+         * except non-block mode has rate limit function.
+         */
+        s->io_mode = 1;
+        s->ram_store = qemu_mallocz(max_store);
+        s->nalloc = max_store;
+        s->nused = 0;
+
+        s->qiov = qemu_mallocz(sizeof(*s->qiov));
+        qemu_iovec_init(s->qiov, max_store);
+
+        s->put_buffer = ram_saveio_put_buffer_vector;
+        s->put_byte = ram_saveio_put_byte_vector;
+        s->put_be64 = ram_saveio_put_be64_vector;
+
+        qemu_put_vector_prepare(f);
+    } else {
+        s->io_mode = 0;
+        s->put_buffer = ram_saveio_put_buffer_direct;
+        s->put_byte = ram_saveio_put_byte_direct;
+        s->put_be64 = ram_saveio_put_be64_direct;
+    }
+
+    s->f = f;
+    
+    return s;
+}
+
+static void ram_saveio_destroy(RAMSaveIO *s)
+{
+    if (s->qiov != NULL) { /* means using put_vector */
+        ram_saveio_flush(s, 0);
+        qemu_iovec_destroy(s->qiov);
+        qemu_free(s->qiov);
+        qemu_free(s->ram_store);
+    }
+    qemu_free(s);
+}
+
+/*
+ * RAMSaveIO will manage I/O.
+ */
+static int ram_save_block(RAMSaveIO *s)
 {
     static ram_addr_t current_addr = 0;
     ram_addr_t saved_addr = current_addr;
     ram_addr_t addr = 0;
-    int found = 0;
+    ram_addr_t dirty_rams[HOST_LONG_BITS];
+    int i, found = 0;
 
     while (addr < last_ram_offset) {
         if (kvm_enabled() && current_addr == 0) {
@@ -2787,32 +2942,38 @@ static int ram_save_block(QEMUFile *f)
             r = kvm_update_dirty_pages_log();
             if (r) {
                 fprintf(stderr, "%s: update dirty pages log failed %d\n", __FUNCTION__, r);
-                qemu_file_set_error(f);
+                ram_saveio_set_error(s);
                 return 0;
             }
         }
-        if (cpu_physical_memory_get_dirty(current_addr, MIGRATION_DIRTY_FLAG)) {
+        if ((found = cpu_physical_memory_get_dirty_range(
+                 current_addr, last_ram_offset, dirty_rams, HOST_LONG_BITS,
+                 MIGRATION_DIRTY_FLAG))) {
             uint8_t *p;
 
-            cpu_physical_memory_reset_dirty(current_addr,
-                                            current_addr + TARGET_PAGE_SIZE,
-                                            MIGRATION_DIRTY_FLAG);
+            for (i = 0; i < found; i++) {
+                ram_addr_t page_addr = dirty_rams[i];
+                cpu_physical_memory_reset_dirty(page_addr,
+                                                page_addr + TARGET_PAGE_SIZE,
+                                                MIGRATION_DIRTY_FLAG);
 
-            p = qemu_get_ram_ptr(current_addr);
+                p = qemu_get_ram_ptr(page_addr);
 
-            if (is_dup_page(p, *p)) {
-                qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_COMPRESS);
-                qemu_put_byte(f, *p);
-            } else {
-                qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_PAGE);
-                qemu_put_buffer(f, p, TARGET_PAGE_SIZE);
+                if (is_dup_page(p, *p)) {
+                    ram_saveio_put_be64(s, 
+                                        (page_addr) | RAM_SAVE_FLAG_COMPRESS);
+                    ram_saveio_put_byte(s, *p);
+                } else {
+                    ram_saveio_put_be64(s, (page_addr) | RAM_SAVE_FLAG_PAGE);
+                    ram_saveio_put_buffer(s, p, TARGET_PAGE_SIZE);
+                }
             }
 
-            found = 1;
             break;
+        } else {
+            addr += dirty_rams[0];
+            current_addr = (saved_addr + addr) % last_ram_offset;
         }
-        addr += TARGET_PAGE_SIZE;
-        current_addr = (saved_addr + addr) % last_ram_offset;
     }
 
     return found;
@@ -2822,12 +2983,19 @@ static uint64_t bytes_transferred;
 
 static ram_addr_t ram_save_remaining(void)
 {
-    ram_addr_t addr;
+    ram_addr_t addr = 0;
     ram_addr_t count = 0;
+    ram_addr_t dirty_rams[HOST_LONG_BITS];
+    int found = 0;
 
-    for (addr = 0; addr < last_ram_offset; addr += TARGET_PAGE_SIZE) {
-        if (cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG))
-            count++;
+    while (addr < last_ram_offset) {
+        if ((found = cpu_physical_memory_get_dirty_range(addr, last_ram_offset,
+            dirty_rams, HOST_LONG_BITS, MIGRATION_DIRTY_FLAG))) {
+            count += found;
+            addr = dirty_rams[found - 1] + TARGET_PAGE_SIZE;
+        } else {
+            addr += dirty_rams[0];
+        }
     }
 
     return count;
@@ -2854,6 +3022,7 @@ static int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
     uint64_t bytes_transferred_last;
     double bwidth = 0;
     uint64_t expected_time = 0;
+    RAMSaveIO *s;
 
     if (stage < 0) {
         cpu_physical_memory_set_dirty_tracking(0);
@@ -2883,10 +3052,12 @@ static int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
     bytes_transferred_last = bytes_transferred;
     bwidth = qemu_get_clock_ns(rt_clock);
 
-    while (!qemu_file_rate_limit(f)) {
+    s = ram_saveio_new(f, IOV_MAX);
+
+     while (!qemu_file_rate_limit(f)) {
         int ret;
 
-        ret = ram_save_block(f);
+        ret = ram_save_block(s);
         bytes_transferred += ret * TARGET_PAGE_SIZE;
         if (ret == 0) /* no more blocks */
             break;
@@ -2903,12 +3074,14 @@ static int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
     /* try transferring iterative blocks of memory */
     if (stage == 3) {
         /* flush all remaining blocks regardless of rate limiting */
-        while (ram_save_block(f) != 0) {
+        while (ram_save_block(s) != 0) {
             bytes_transferred += TARGET_PAGE_SIZE;
         }
         cpu_physical_memory_set_dirty_tracking(0);
     }
 
+    ram_saveio_destroy(s);
+
     qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
 
     expected_time = ram_save_remaining() * TARGET_PAGE_SIZE / bwidth;
-- 
1.7.0.31.g1df487
^ permalink raw reply related	[flat|nested] 74+ messages in thread
* [Qemu-devel] [RFC PATCH 09/20] Introduce writev and read to FdMigrationState.
  2010-04-21  5:57 [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 Yoshiaki Tamura
                   ` (7 preceding siblings ...)
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 08/20] Introduce RAMSaveIO and use cpu_physical_memory_get_dirty_range() to check multiple dirty pages Yoshiaki Tamura
@ 2010-04-21  5:57 ` Yoshiaki Tamura
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 10/20] Introduce skip_header parameter to qemu_loadvm_state() so that it can be called iteratively without reading the header Yoshiaki Tamura
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-21  5:57 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, ohmura.kei, mtosatti, Yoshiaki Tamura, yoshikawa.takuya,
	avi
Currently FdMigrationState doesn't support writev() and read().
writev() is introduced to send data efficiently, and read() is
necessary to get response from the other side.
Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 migration-tcp.c |   20 ++++++++++++++++++++
 migration.c     |   27 +++++++++++++++++++++++++++
 migration.h     |    6 ++++++
 3 files changed, 53 insertions(+), 0 deletions(-)
diff --git a/migration-tcp.c b/migration-tcp.c
index e7f307c..56e1a3b 100644
--- a/migration-tcp.c
+++ b/migration-tcp.c
@@ -39,6 +39,24 @@ static int socket_write(FdMigrationState *s, const void * buf, size_t size)
     return send(s->fd, buf, size, 0);
 }
 
+static int socket_read(FdMigrationState *s, const void * buf, size_t size)
+{
+    ssize_t len;
+
+    do { 
+        len = recv(s->fd, (void *)buf, size, 0);
+    } while (len == -1 && socket_error() == EINTR);
+    if (len == -1)
+        len = -socket_error();
+
+    return len;
+}
+
+static int socket_writev(FdMigrationState *s, const struct iovec *v, int count)
+{
+    return writev(s->fd, v, count);
+}
+
 static int tcp_close(FdMigrationState *s)
 {
     DPRINTF("tcp_close\n");
@@ -94,6 +112,8 @@ MigrationState *tcp_start_outgoing_migration(Monitor *mon,
 
     s->get_error = socket_errno;
     s->write = socket_write;
+    s->writev = socket_writev;
+    s->read = socket_read;
     s->close = tcp_close;
     s->mig_state.cancel = migrate_fd_cancel;
     s->mig_state.get_status = migrate_fd_get_status;
diff --git a/migration.c b/migration.c
index 05f6cc5..5d238f5 100644
--- a/migration.c
+++ b/migration.c
@@ -337,6 +337,33 @@ ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size)
     return ret;
 }
 
+int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, int size)
+{
+    FdMigrationState *s = opaque;
+    ssize_t ret;
+    ret = s->read(s, data, size);
+    
+    if (ret == -1)
+        ret = -(s->get_error(s));
+    
+    return ret;
+}
+
+ssize_t migrate_fd_put_vector(void *opaque, const struct iovec *vector, int count)
+{
+    FdMigrationState *s = opaque;
+    ssize_t ret;
+
+    do {
+        ret = s->writev(s, vector, count);
+    } while (ret == -1 && ((s->get_error(s)) == EINTR));
+
+    if (ret == -1)
+        ret = -(s->get_error(s));
+
+    return ret;
+}
+
 void migrate_fd_connect(FdMigrationState *s)
 {
     int ret;
diff --git a/migration.h b/migration.h
index 385423f..ddc1d42 100644
--- a/migration.h
+++ b/migration.h
@@ -47,6 +47,8 @@ struct FdMigrationState
     int (*get_error)(struct FdMigrationState*);
     int (*close)(struct FdMigrationState*);
     int (*write)(struct FdMigrationState*, const void *, size_t);
+    int (*writev)(struct FdMigrationState*, const struct iovec *, int);
+    int (*read)(struct FdMigrationState *, const void *, size_t);
     void *opaque;
 };
 
@@ -113,6 +115,10 @@ void migrate_fd_put_notify(void *opaque);
 
 ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size);
 
+ssize_t migrate_fd_put_vector(void *opaque, const struct iovec *iov, int count);
+
+int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, int size);
+
 void migrate_fd_connect(FdMigrationState *s);
 
 void migrate_fd_put_ready(void *opaque);
-- 
1.7.0.31.g1df487
^ permalink raw reply related	[flat|nested] 74+ messages in thread
* [Qemu-devel] [RFC PATCH 10/20] Introduce skip_header parameter to qemu_loadvm_state() so that it can be called iteratively without reading the header.
  2010-04-21  5:57 [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 Yoshiaki Tamura
                   ` (8 preceding siblings ...)
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 09/20] Introduce writev and read to FdMigrationState Yoshiaki Tamura
@ 2010-04-21  5:57 ` Yoshiaki Tamura
  2010-04-22 19:34   ` [Qemu-devel] " Anthony Liguori
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 11/20] Introduce some socket util functions Yoshiaki Tamura
                   ` (12 subsequent siblings)
  22 siblings, 1 reply; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-21  5:57 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, ohmura.kei, mtosatti, Yoshiaki Tamura, yoshikawa.takuya,
	avi
Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 migration-exec.c |    2 +-
 migration-fd.c   |    2 +-
 migration-tcp.c  |    2 +-
 migration-unix.c |    2 +-
 savevm.c         |   25 ++++++++++++++-----------
 sysemu.h         |    2 +-
 6 files changed, 19 insertions(+), 16 deletions(-)
diff --git a/migration-exec.c b/migration-exec.c
index 3edc026..5839a6d 100644
--- a/migration-exec.c
+++ b/migration-exec.c
@@ -113,7 +113,7 @@ static void exec_accept_incoming_migration(void *opaque)
     QEMUFile *f = opaque;
     int ret;
 
-    ret = qemu_loadvm_state(f);
+    ret = qemu_loadvm_state(f, 0);
     if (ret < 0) {
         fprintf(stderr, "load of migration failed\n");
         goto err;
diff --git a/migration-fd.c b/migration-fd.c
index 0cc74ad..0e97ed0 100644
--- a/migration-fd.c
+++ b/migration-fd.c
@@ -106,7 +106,7 @@ static void fd_accept_incoming_migration(void *opaque)
     QEMUFile *f = opaque;
     int ret;
 
-    ret = qemu_loadvm_state(f);
+    ret = qemu_loadvm_state(f, 0);
     if (ret < 0) {
         fprintf(stderr, "load of migration failed\n");
         goto err;
diff --git a/migration-tcp.c b/migration-tcp.c
index 56e1a3b..94a1a03 100644
--- a/migration-tcp.c
+++ b/migration-tcp.c
@@ -182,7 +182,7 @@ static void tcp_accept_incoming_migration(void *opaque)
         goto out;
     }
 
-    ret = qemu_loadvm_state(f);
+    ret = qemu_loadvm_state(f, 0);
     if (ret < 0) {
         fprintf(stderr, "load of migration failed\n");
         goto out_fopen;
diff --git a/migration-unix.c b/migration-unix.c
index b7aab38..dd99a73 100644
--- a/migration-unix.c
+++ b/migration-unix.c
@@ -168,7 +168,7 @@ static void unix_accept_incoming_migration(void *opaque)
         goto out;
     }
 
-    ret = qemu_loadvm_state(f);
+    ret = qemu_loadvm_state(f, 0);
     if (ret < 0) {
         fprintf(stderr, "load of migration failed\n");
         goto out_fopen;
diff --git a/savevm.c b/savevm.c
index 22d928c..a401b27 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1554,7 +1554,7 @@ typedef struct LoadStateEntry {
     int version_id;
 } LoadStateEntry;
 
-int qemu_loadvm_state(QEMUFile *f)
+int qemu_loadvm_state(QEMUFile *f, int skip_header)
 {
     QLIST_HEAD(, LoadStateEntry) loadvm_handlers =
         QLIST_HEAD_INITIALIZER(loadvm_handlers);
@@ -1563,17 +1563,20 @@ int qemu_loadvm_state(QEMUFile *f)
     unsigned int v;
     int ret;
 
-    v = qemu_get_be32(f);
-    if (v != QEMU_VM_FILE_MAGIC)
-        return -EINVAL;
+    if (!skip_header) {
+        v = qemu_get_be32(f);
+        if (v != QEMU_VM_FILE_MAGIC)
+            return -EINVAL;
+
+        v = qemu_get_be32(f);
+        if (v == QEMU_VM_FILE_VERSION_COMPAT) {
+            fprintf(stderr, "SaveVM v3 format is obsolete and don't work anymore\n");
+            return -ENOTSUP;
+        }
+        if (v != QEMU_VM_FILE_VERSION)
+            return -ENOTSUP;
 
-    v = qemu_get_be32(f);
-    if (v == QEMU_VM_FILE_VERSION_COMPAT) {
-        fprintf(stderr, "SaveVM v2 format is obsolete and don't work anymore\n");
-        return -ENOTSUP;
     }
-    if (v != QEMU_VM_FILE_VERSION)
-        return -ENOTSUP;
 
     while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) {
         uint32_t instance_id, version_id, section_id;
@@ -1898,7 +1901,7 @@ int load_vmstate(Monitor *mon, const char *name)
         monitor_printf(mon, "Could not open VM state file\n");
         return -EINVAL;
     }
-    ret = qemu_loadvm_state(f);
+    ret = qemu_loadvm_state(f, 0);
     qemu_fclose(f);
     if (ret < 0) {
         monitor_printf(mon, "Error %d while loading VM state\n", ret);
diff --git a/sysemu.h b/sysemu.h
index 647a468..6c1441f 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -68,7 +68,7 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable,
 int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f);
 int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f);
 void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f);
-int qemu_loadvm_state(QEMUFile *f);
+int qemu_loadvm_state(QEMUFile *f, int skip_header);
 
 void qemu_errors_to_file(FILE *fp);
 void qemu_errors_to_mon(Monitor *mon);
-- 
1.7.0.31.g1df487
^ permalink raw reply related	[flat|nested] 74+ messages in thread
* [Qemu-devel] [RFC PATCH 11/20] Introduce some socket util functions.
  2010-04-21  5:57 [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 Yoshiaki Tamura
                   ` (9 preceding siblings ...)
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 10/20] Introduce skip_header parameter to qemu_loadvm_state() so that it can be called iteratively without reading the header Yoshiaki Tamura
@ 2010-04-21  5:57 ` Yoshiaki Tamura
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 12/20] Introduce fault tolerant VM transaction QEMUFile and ft_mode Yoshiaki Tamura
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-21  5:57 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, ohmura.kei, mtosatti, Yoshiaki Tamura, yoshikawa.takuya,
	avi
Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 osdep.c       |   13 +++++++++++++
 qemu-char.c   |   25 ++++++++++++++++++++++++-
 qemu_socket.h |    4 ++++
 3 files changed, 41 insertions(+), 1 deletions(-)
diff --git a/osdep.c b/osdep.c
index 3bab79a..63444e7 100644
--- a/osdep.c
+++ b/osdep.c
@@ -201,6 +201,12 @@ void socket_set_nonblock(int fd)
     ioctlsocket(fd, FIONBIO, &opt);
 }
 
+void socket_set_block(int fd)
+{
+    unsigned long opt = 0;
+    ioctlsocket(fd, FIONBIO, &opt);
+}
+
 int inet_aton(const char *cp, struct in_addr *ia)
 {
     uint32_t addr = inet_addr(cp);
@@ -223,6 +229,13 @@ void socket_set_nonblock(int fd)
     fcntl(fd, F_SETFL, f | O_NONBLOCK);
 }
 
+void socket_set_block(int fd)
+{
+    int f;
+    f = fcntl(fd, F_GETFL);
+    fcntl(fd, F_SETFL, f & ~O_NONBLOCK);
+}
+
 void qemu_set_cloexec(int fd)
 {
     int f;
diff --git a/qemu-char.c b/qemu-char.c
index 4169492..ccdf394 100644
--- a/qemu-char.c
+++ b/qemu-char.c
@@ -2092,12 +2092,35 @@ static void tcp_chr_telnet_init(int fd)
     send(fd, (char *)buf, 3, 0);
 }
 
-static void socket_set_nodelay(int fd)
+void socket_set_delay(int fd)
+{
+    int val = 0;
+    setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, (char *)&val, sizeof(val));
+}
+
+void socket_set_nodelay(int fd)
 {
     int val = 1;
     setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, (char *)&val, sizeof(val));
 }
 
+void socket_set_timeout(int fd, int s)
+{
+    struct timeval tv = {
+        .tv_sec = s,
+        .tv_usec = 0
+    };
+    /* Set socket_timeout */
+    if (setsockopt(fd, SOL_SOCKET, SO_RCVTIMEO,
+                   &tv, sizeof(tv)) < 0) {
+        fprintf(stderr, "failed to set SO_RCVTIMEO\n");
+    }
+    if (setsockopt(fd, SOL_SOCKET, SO_SNDTIMEO,
+                   &tv, sizeof(tv)) < 0) {
+        fprintf(stderr, "fialed to set SO_SNDTIMEO\n");
+    }
+}
+
 static void tcp_chr_accept(void *opaque)
 {
     CharDriverState *chr = opaque;
diff --git a/qemu_socket.h b/qemu_socket.h
index 7ee46ac..8eae465 100644
--- a/qemu_socket.h
+++ b/qemu_socket.h
@@ -35,6 +35,10 @@ int inet_aton(const char *cp, struct in_addr *ia);
 int qemu_socket(int domain, int type, int protocol);
 int qemu_accept(int s, struct sockaddr *addr, socklen_t *addrlen);
 void socket_set_nonblock(int fd);
+void socket_set_block(int fd);
+void socket_set_nodelay(int fd);
+void socket_set_delay(int fd);
+void socket_set_timeout(int fd, int s);
 int send_all(int fd, const void *buf, int len1);
 
 /* New, ipv6-ready socket helper functions, see qemu-sockets.c */
-- 
1.7.0.31.g1df487
^ permalink raw reply related	[flat|nested] 74+ messages in thread
* [Qemu-devel] [RFC PATCH 12/20] Introduce fault tolerant VM transaction QEMUFile and ft_mode.
  2010-04-21  5:57 [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 Yoshiaki Tamura
                   ` (10 preceding siblings ...)
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 11/20] Introduce some socket util functions Yoshiaki Tamura
@ 2010-04-21  5:57 ` Yoshiaki Tamura
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 13/20] Introduce util functions to control ft_transaction from savevm layer Yoshiaki Tamura
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-21  5:57 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, ohmura.kei, mtosatti, Yoshiaki Tamura, yoshikawa.takuya,
	avi
This code implements VM transaction protocol.  Like buffered_file, it
sits between savevm and migration layer.  With this architecture, VM
transaction protocol is implemented mostly independent from other
existing code.
Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
Signed-off-by: OHMURA Kei <ohmura.kei@lab.ntt.co.jp>
---
 Makefile.objs    |    1 +
 ft_transaction.c |  423 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 ft_transaction.h |   57 ++++++++
 migration.c      |    3 +
 4 files changed, 484 insertions(+), 0 deletions(-)
 create mode 100644 ft_transaction.c
 create mode 100644 ft_transaction.h
diff --git a/Makefile.objs b/Makefile.objs
index b73e2cb..4388fb3 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -78,6 +78,7 @@ common-obj-y += qemu-char.o savevm.o #aio.o
 common-obj-y += msmouse.o ps2.o
 common-obj-y += qdev.o qdev-properties.o
 common-obj-y += qemu-config.o block-migration.o
+common-obj-y += ft_transaction.o
 
 common-obj-$(CONFIG_BRLAPI) += baum.o
 common-obj-$(CONFIG_POSIX) += migration-exec.o migration-unix.o migration-fd.o
diff --git a/ft_transaction.c b/ft_transaction.c
new file mode 100644
index 0000000..d0cbc99
--- /dev/null
+++ b/ft_transaction.c
@@ -0,0 +1,423 @@
+/*
+ * Fault tolerant VM transaction QEMUFile
+ *
+ * Copyright (c) 2010 Nippon Telegraph and Telephone Corporation. 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * This source code is based on buffered_file.c.
+ * Copyright IBM, Corp. 2008
+ * Authors:
+ *  Anthony Liguori        <aliguori@us.ibm.com>
+ */
+
+#include "qemu-common.h"
+#include "hw/hw.h"
+#include "qemu-timer.h"
+#include "sysemu.h"
+#include "qemu-char.h"
+#include "ft_transaction.h"
+
+// #define DEBUG_FT_TRANSACTION
+
+typedef struct QEMUFileFtTranx
+{
+    FtTranxPutBufferFunc *put_buffer;
+    FtTranxPutVectorFunc *put_vector;
+    FtTranxGetBufferFunc *get_buffer;
+    FtTranxGetVectorFunc *get_vector;
+    FtTranxCloseFunc *close;
+    void *opaque;
+    QEMUFile *file;
+    int has_error;
+    int is_sender;
+    int buf_max_size;
+    enum QEMU_VM_TRANSACTION_STATE tranx_state;
+    uint16_t tranx_id;
+    uint32_t seq;
+} QEMUFileFtTranx;
+
+#define IO_BUF_SIZE 32768
+
+#ifdef DEBUG_FT_TRANSACTION
+#define dprintf(fmt, ...) \
+    do { printf("ft_transaction: " fmt, ## __VA_ARGS__); } while (0)
+#else
+#define dprintf(fmt, ...) \
+    do { } while (0)
+#endif
+
+static ssize_t ft_tranx_flush_buffer(void *opaque, void *buf, int size)
+{
+    QEMUFileFtTranx *s = opaque;
+    size_t offset = 0;
+    ssize_t len;
+
+    while (offset < size) {
+        len = s->put_buffer(s->opaque, (uint8_t *)buf + offset, size - offset);
+
+        if (len <= 0) {
+            fprintf(stderr, "ft transaction flush buffer failed \n");
+            s->has_error = 1;
+            offset = -EINVAL;
+            break;
+        }
+
+        offset += len;
+    }
+
+    return offset;
+}
+
+static int ft_tranx_send_header(QEMUFileFtTranx *s)
+{
+    int ret = -1;
+
+    dprintf("send header %d\n", s->tranx_state);
+
+    ret = ft_tranx_flush_buffer(s, &s->tranx_state, sizeof(uint16_t));
+    if (ret < 0) {
+        goto out;
+    }
+    ret = ft_tranx_flush_buffer(s, &s->tranx_id, sizeof(uint16_t));    
+
+out:
+    return ret;
+}
+
+static int ft_tranx_put_buffer(void *opaque, const uint8_t *buf, int64_t pos, int size)
+{
+    QEMUFileFtTranx *s = opaque;
+    ssize_t ret = -1;
+
+    if (s->has_error) {
+        fprintf(stderr, "flush when error, bailing\n");
+        return -EINVAL;
+    }
+
+    ret = ft_tranx_send_header(s);
+    if (ret < 0) {
+        goto out;
+    }
+
+    ret = ft_tranx_flush_buffer(s, &s->seq, sizeof(s->seq));
+    if (ret < 0) {
+        goto out;
+    }
+    s->seq++;
+
+    ret = ft_tranx_flush_buffer(s, &size, sizeof(uint32_t));
+    if (ret < 0) {
+        goto out;
+    }
+
+    ret = ft_tranx_flush_buffer(s, (uint8_t *)buf, size);
+
+out:
+    return ret;
+}
+
+static int ft_tranx_put_vector(void *opaque, struct iovec *vector, int64_t pos, int count)
+{
+    QEMUFileFtTranx *s = opaque;
+    ssize_t ret = -1;
+    int i;
+    uint32_t size = 0;
+
+    dprintf("putting %d vectors at %" PRId64 "\n", count, pos);
+
+    if (s->has_error) {
+        dprintf("put vector when error, bailing\n");
+        return -EINVAL;
+    }
+
+    ret = ft_tranx_send_header(s);
+    if (ret < 0) {
+        return ret;
+    }
+
+    ret = ft_tranx_flush_buffer(s, &s->seq, sizeof(s->seq));
+    if (ret < 0) {
+        return ret;
+    }
+    s->seq++;
+
+    for (i = 0; i < count; i++)
+        size += vector[i].iov_len;
+
+    ret = ft_tranx_flush_buffer(s, &size, sizeof(uint32_t));
+    if (ret < 0) {
+        return ret;
+    }
+
+    while (count > 0) {
+        /* 
+         * It will continue calling put_vector even if count > IOV_MAX.
+         */
+        ret = s->put_vector(s->opaque, vector,
+            ((count>IOV_MAX)?IOV_MAX:count));
+
+        if (ret <= 0) {
+            fprintf(stderr, "ft transaction putting vector\n");
+            s->has_error = 1;
+            return ret;
+        }
+
+        for (i = 0; i < count; i++) {
+            /* ret represents -(length of remaining data). */
+            ret -= vector[i].iov_len;
+            if (ret < 0) {
+                vector[i].iov_base += (ret + vector[i].iov_len);
+                vector[i].iov_len = -ret;
+                vector = &vector[i];
+                break;
+            }
+        }
+        count -= i;
+    }
+
+    return 0;
+}
+
+static inline int ft_tranx_fill_buffer(void *opaque, void *buf, int size)
+{
+    QEMUFileFtTranx *s = opaque;
+    size_t offset = 0;
+    ssize_t len;
+
+    while (ft_mode != FT_ERROR && offset < size) {
+        len = s->get_buffer(s->opaque, (uint8_t *)buf + offset,
+                            0, size - offset);
+        if (len <= 0) {
+            fprintf(stderr, "ft_tranx fill buffer failed\n");
+            s->has_error = 1;
+            return -EINVAL;
+        }
+        offset += len;
+    }
+    return 0;
+}
+
+/* return QEMU_VM_TRANSACTION type */
+static int ft_tranx_get_next(QEMUFileFtTranx *s)
+{
+    uint16_t header;
+    uint16_t tranx_id;
+
+    if ((ft_tranx_fill_buffer(s, &header, sizeof(header)) < 0) ||
+        (ft_tranx_fill_buffer(s, &tranx_id, sizeof(tranx_id)) < 0)) {
+        return QEMU_VM_TRANSACTION_CANCEL;
+    }
+
+    s->tranx_id = tranx_id;
+
+    return header;
+}
+
+static int ft_tranx_get_buffer(void *opaque, uint8_t *buf,
+                               int64_t pos, int size)
+{
+    QEMUFileFtTranx *s = opaque;
+    QEMUFile *f = s->file;
+    uint32_t payload_len;
+    int ret = -1, offset;
+
+    /* get transaction header*/
+    ret = ft_tranx_get_next(s);
+    switch (ret) {
+    case QEMU_VM_TRANSACTION_BEGIN:
+        for (offset = 0;;) {
+            ret = ft_tranx_get_next(s);
+            /* CONTINUE or COMMIT must come afer BEGIN */
+            if ((ret != QEMU_VM_TRANSACTION_CONTINUE) &&
+                (ret != QEMU_VM_TRANSACTION_COMMIT)) {
+                goto error_out;
+            }
+
+            if (ft_tranx_fill_buffer(s, &s->seq, sizeof(s->seq)) < 0) {
+                goto error_out;
+            }
+
+            if (ret == QEMU_VM_TRANSACTION_COMMIT) {
+                ret = offset;
+                dprintf("QEMU_VM_TRANSACTION_COMMIT %d\n", offset);
+                break;
+            }
+
+            if (ft_tranx_fill_buffer(s, &payload_len,
+                                     sizeof(payload_len)) < 0) {
+                goto error_out;
+            }
+  
+            /* Extend QEMUFile buf if there weren't enough space. */
+            if (payload_len > (s->buf_max_size - offset)) {
+                s->buf_max_size += (payload_len - (s->buf_max_size - offset));
+                buf = qemu_realloc_buffer(f, s->buf_max_size);
+            }
+
+            if (ft_tranx_fill_buffer(s, buf + offset, payload_len) < 0) {
+                goto error_out;
+            }
+            offset += payload_len;
+        }
+
+        s->tranx_state = QEMU_VM_TRANSACTION_ACK;
+        if (ft_tranx_send_header(s) < 0) {
+            goto error_out;
+        }
+        goto out;
+
+    case QEMU_VM_TRANSACTION_ATOMIC:
+        /* not implemented yet */
+        fprintf(stderr, "QEMU_VM_TRANSACTION_ATOMIC not implemented. %d\n",
+                ret);
+        goto error_out;
+
+    case QEMU_VM_TRANSACTION_CANCEL:
+        dprintf("ft transaction canceled %d\n", ret);
+        ret = -1;
+        ft_mode = FT_OFF;
+        goto out;
+
+    default:
+        fprintf(stderr, "unknown QEMU_VM_TRANSACTION_STATE %d\n", ret);
+    }
+
+error_out:
+    ret = -1;
+    ft_mode = FT_ERROR;
+out:
+    return ret;
+}
+
+static int ft_tranx_close(void *opaque)
+{
+    QEMUFileFtTranx *s = opaque;
+    int ret = -1;
+
+    dprintf("closing\n");
+    ret = s->close(s->opaque);
+    qemu_free(s);
+
+    return ret;
+}
+
+int qemu_ft_tranx_begin(void *opaque)
+{
+    QEMUFileFtTranx *s = opaque;
+    int ret = -1;
+    s->seq = 0;
+
+    if (!s->is_sender && s->tranx_state == QEMU_VM_TRANSACTION_INIT) {
+        /* receiver sends QEMU_VM_TRANSACTION_ACK to start transaction.  */
+        s->tranx_state = QEMU_VM_TRANSACTION_ACK;
+        ret = ft_tranx_send_header(s);
+        goto out;
+    } 
+
+    if (s->is_sender) {
+        if (s->tranx_state == QEMU_VM_TRANSACTION_INIT) {
+            ret = ft_tranx_get_next(s);
+            if (ret != QEMU_VM_TRANSACTION_ACK) {
+                fprintf(stderr, "ft_transaction receiving ack failed\n");
+                ret = -1;
+                goto out;
+            }
+        }
+
+        s->tranx_state = QEMU_VM_TRANSACTION_BEGIN;
+        if ((ret = ft_tranx_send_header(s)) < 0) {
+            goto out;
+        }
+
+        s->tranx_state = QEMU_VM_TRANSACTION_CONTINUE;
+        ret = 0;
+    }
+
+out:
+    return ret;
+}
+
+int qemu_ft_tranx_commit(void *opaque)
+{
+    QEMUFileFtTranx *s = opaque;
+    int ret = -1;
+
+    if (!s->is_sender) {
+        s->tranx_state = QEMU_VM_TRANSACTION_ACK;
+        ret = ft_tranx_send_header(s);
+    } else {
+        /* flush buf before sending COMMIT */
+        qemu_fflush(s->file);
+
+        s->tranx_state = QEMU_VM_TRANSACTION_COMMIT;
+        ret = ft_tranx_send_header(s);
+        if (ret < 0) {
+            return ret;
+        }
+
+        ret = ft_tranx_flush_buffer(s, &s->seq, sizeof(s->seq));
+        if (ret < 0) {
+            return ret;
+        }
+
+        /* FIX ME: can we remove this if statement? */
+        if (ret >= 0) {
+            ret = ft_tranx_get_next(s);
+            if (ret != QEMU_VM_TRANSACTION_ACK) {
+                fprintf(stderr, "ft_transaction receiving ack failed\n");
+                return -1;
+            }
+        }
+
+        s->tranx_id++;
+    }
+
+    return ret;
+}
+
+int qemu_ft_tranx_cancel(void *opaque)
+{
+    QEMUFileFtTranx *s = opaque;
+    int ret = -1;
+
+    if (s->is_sender) {
+        s->tranx_state = QEMU_VM_TRANSACTION_CANCEL;
+        if ((ret = ft_tranx_send_header(s)) < 0) {
+            fprintf(stderr, "ft cancel failed\n");
+        }
+    }
+    
+    return ret;
+}
+
+QEMUFile *qemu_fopen_ops_ft_tranx(void *opaque,
+                                  FtTranxPutBufferFunc *put_buffer,
+                                  FtTranxPutVectorFunc *put_vector,
+                                  FtTranxGetBufferFunc *get_buffer,
+                                  FtTranxGetVectorFunc *get_vector,
+                                  FtTranxCloseFunc *close,
+                                  int is_sender)
+{
+    QEMUFileFtTranx *s;
+
+    s = qemu_mallocz(sizeof(*s));
+
+    s->opaque = opaque;
+    s->put_buffer = put_buffer;
+    s->put_vector = put_vector;
+    s->get_buffer = get_buffer;
+    s->get_vector = get_vector;
+    s->close = close;
+    s->buf_max_size = IO_BUF_SIZE;
+    s->is_sender = is_sender;
+    s->tranx_id = 0;
+    s->seq = 0;
+
+    s->file = qemu_fopen_ops(s, ft_tranx_put_buffer, ft_tranx_put_vector,
+                             ft_tranx_get_buffer, NULL, ft_tranx_close,
+                             NULL, NULL, NULL);
+
+    return s->file;
+}
diff --git a/ft_transaction.h b/ft_transaction.h
new file mode 100644
index 0000000..3f7cbd2
--- /dev/null
+++ b/ft_transaction.h
@@ -0,0 +1,57 @@
+/*
+ * Fault tolerant VM transaction QEMUFile
+ *
+ * Copyright (c) 2010 Nippon Telegraph and Telephone Corporation. 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * This source code is based on buffered_file.h.
+ * Copyright IBM, Corp. 2008
+ * Authors:
+ *  Anthony Liguori        <aliguori@us.ibm.com>
+ */
+
+#ifndef QEMU_FT_TRANSACTION_FILE_H
+#define QEMU_FT_TRANSACTION_FILE_H
+
+#include "hw/hw.h"
+
+enum QEMU_VM_TRANSACTION_STATE {
+    QEMU_VM_TRANSACTION_INIT,
+    QEMU_VM_TRANSACTION_BEGIN,
+    QEMU_VM_TRANSACTION_CONTINUE,
+    QEMU_VM_TRANSACTION_COMMIT,
+    QEMU_VM_TRANSACTION_CANCEL,
+    QEMU_VM_TRANSACTION_ATOMIC,
+    QEMU_VM_TRANSACTION_ACK,
+    QEMU_VM_TRANSACTION_NACK,
+};
+
+enum FT_MODE {
+    FT_OFF,
+    FT_INIT,
+    FT_TRANSACTION,
+    FT_ERROR,
+};
+extern enum FT_MODE ft_mode;
+
+typedef ssize_t (FtTranxPutBufferFunc)(void *opaque, const void *data, size_t size);
+typedef ssize_t (FtTranxPutVectorFunc)(void *opaque, const struct iovec *iov, int iovcnt);
+typedef QEMUFileGetBufferFunc FtTranxGetBufferFunc;
+typedef QEMUFileGetVectorFunc FtTranxGetVectorFunc;
+typedef int (FtTranxCloseFunc)(void *opaque);
+
+int qemu_ft_tranx_begin(void *opaque);
+int qemu_ft_tranx_commit(void *opaque);
+int qemu_ft_tranx_cancel(void *opaque);
+
+QEMUFile *qemu_fopen_ops_ft_tranx(void *opaque, 
+                                  FtTranxPutBufferFunc *put_buffer,
+                                  FtTranxPutVectorFunc *put_vector,
+                                  FtTranxGetBufferFunc *get_buffer,
+                                  FtTranxGetVectorFunc *get_vector,
+                                  FtTranxCloseFunc *close,
+                                  int is_sender);
+
+#endif
diff --git a/migration.c b/migration.c
index 5d238f5..4eed0b7 100644
--- a/migration.c
+++ b/migration.c
@@ -15,6 +15,7 @@
 #include "migration.h"
 #include "monitor.h"
 #include "buffered_file.h"
+#include "ft_transaction.h"
 #include "sysemu.h"
 #include "block.h"
 #include "qemu_socket.h"
@@ -31,6 +32,8 @@
     do { } while (0)
 #endif
 
+enum FT_MODE ft_mode = FT_OFF;
+
 /* Migration speed throttling */
 static uint32_t max_throttle = (32 << 20);
 
-- 
1.7.0.31.g1df487
^ permalink raw reply related	[flat|nested] 74+ messages in thread
* [Qemu-devel] [RFC PATCH 13/20] Introduce util functions to control ft_transaction from savevm layer.
  2010-04-21  5:57 [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 Yoshiaki Tamura
                   ` (11 preceding siblings ...)
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 12/20] Introduce fault tolerant VM transaction QEMUFile and ft_mode Yoshiaki Tamura
@ 2010-04-21  5:57 ` Yoshiaki Tamura
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 14/20] Upgrade QEMU_FILE_VERSION from 3 to 4, and introduce qemu_savevm_state_all() Yoshiaki Tamura
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-21  5:57 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, ohmura.kei, mtosatti, Yoshiaki Tamura, yoshikawa.takuya,
	avi
To utilize ft_transaction function, savevm needs interfaces to be
exported.
Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 hw/hw.h  |    5 +++++
 savevm.c |   41 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 46 insertions(+), 0 deletions(-)
diff --git a/hw/hw.h b/hw/hw.h
index 10e6dda..fcee660 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -70,6 +70,8 @@ QEMUFile *qemu_fopen_ops(void *opaque, QEMUFilePutBufferFunc *put_buffer,
 QEMUFile *qemu_fopen(const char *filename, const char *mode);
 QEMUFile *qemu_fdopen(int fd, const char *mode);
 QEMUFile *qemu_fopen_socket(int fd);
+QEMUFile *qemu_fopen_transaction(int fd);
+QEMUFile *qemu_fopen_tranx_sender(void *opaque);
 QEMUFile *qemu_popen(FILE *popen_file, const char *mode);
 QEMUFile *qemu_popen_cmd(const char *command, const char *mode);
 int qemu_stdio_fd(QEMUFile *f);
@@ -81,6 +83,9 @@ void qemu_put_vector(QEMUFile *f, QEMUIOVector *qiov);
 void qemu_put_vector_prepare(QEMUFile *f);
 void *qemu_realloc_buffer(QEMUFile *f, int size);
 void qemu_clear_buffer(QEMUFile *f);
+int qemu_transaction_begin(QEMUFile *f);
+int qemu_transaction_commit(QEMUFile *f);
+int qemu_transaction_cancel(QEMUFile *f);
 
 static inline void qemu_put_ubyte(QEMUFile *f, unsigned int v)
 {
diff --git a/savevm.c b/savevm.c
index a401b27..292ae32 100644
--- a/savevm.c
+++ b/savevm.c
@@ -82,6 +82,7 @@
 #include "migration.h"
 #include "qemu_socket.h"
 #include "qemu-queue.h"
+#include "ft_transaction.h"
 
 /* point to the block driver where the snapshots are managed */
 static BlockDriverState *bs_snapshots;
@@ -210,6 +211,21 @@ static int socket_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size)
     return len;
 }
 
+static ssize_t socket_put_buffer(void *opaque, const void *buf, size_t size)
+{
+    QEMUFileSocket *s = opaque;
+    ssize_t len;
+
+    do {
+        len = send(s->fd, (void *)buf, size, 0);
+    } while (len == -1 && socket_error() == EINTR);
+
+    if (len == -1)
+        len = -socket_error();
+
+    return len;
+}
+
 static int socket_close(void *opaque)
 {
     QEMUFileSocket *s = opaque;
@@ -338,6 +354,16 @@ QEMUFile *qemu_fopen_socket(int fd)
     return s->file;
 }
 
+QEMUFile *qemu_fopen_transaction(int fd)
+{
+    QEMUFileSocket *s = qemu_mallocz(sizeof(QEMUFileSocket));
+
+    s->fd = fd;
+    s->file = qemu_fopen_ops_ft_tranx(s, socket_put_buffer, NULL,
+                                      socket_get_buffer, NULL, socket_close, 0);
+    return s->file;
+}
+
 static int file_put_buffer(void *opaque, const uint8_t *buf,
                             int64_t pos, int size)
 {
@@ -485,6 +511,21 @@ void qemu_clear_buffer(QEMUFile *f)
     memset(f->buf, 0, f->buf_max_size);
 }
 
+int qemu_transaction_begin(QEMUFile *f)
+{
+    return qemu_ft_tranx_begin(f->opaque);
+}
+
+int qemu_transaction_commit(QEMUFile *f)
+{
+    return qemu_ft_tranx_commit(f->opaque);
+}
+
+int qemu_transaction_cancel(QEMUFile *f)
+{
+    return qemu_ft_tranx_cancel(f->opaque);
+}
+
 static void qemu_fill_buffer(QEMUFile *f)
 {
     int len;
-- 
1.7.0.31.g1df487
^ permalink raw reply related	[flat|nested] 74+ messages in thread
* [Qemu-devel] [RFC PATCH 14/20] Upgrade QEMU_FILE_VERSION from 3 to 4, and introduce qemu_savevm_state_all().
  2010-04-21  5:57 [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 Yoshiaki Tamura
                   ` (12 preceding siblings ...)
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 13/20] Introduce util functions to control ft_transaction from savevm layer Yoshiaki Tamura
@ 2010-04-21  5:57 ` Yoshiaki Tamura
  2010-04-22 19:37   ` [Qemu-devel] " Anthony Liguori
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 15/20] Introduce FT mode support to configure Yoshiaki Tamura
                   ` (8 subsequent siblings)
  22 siblings, 1 reply; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-21  5:57 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, ohmura.kei, mtosatti, Yoshiaki Tamura, yoshikawa.takuya,
	avi
Make a 32bit entry after QEMU_VM_FILE_VERSION to recognize whether the
transfered data is QEMU_VM_FT_MODE or QEMU_VM_LIVE_MIGRATION_MODE.
Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 savevm.c |   76 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 sysemu.h |    1 +
 2 files changed, 75 insertions(+), 2 deletions(-)
diff --git a/savevm.c b/savevm.c
index 292ae32..19b3efb 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1402,8 +1402,10 @@ static void vmstate_save(QEMUFile *f, SaveStateEntry *se)
 }
 
 #define QEMU_VM_FILE_MAGIC           0x5145564d
-#define QEMU_VM_FILE_VERSION_COMPAT  0x00000002
-#define QEMU_VM_FILE_VERSION         0x00000003
+#define QEMU_VM_FILE_VERSION_COMPAT  0x00000003
+#define QEMU_VM_FILE_VERSION         0x00000004 
+#define QEMU_VM_LIVE_MIGRATION_MODE  0x00000005
+#define QEMU_VM_FT_MODE              0x00000006
 
 #define QEMU_VM_EOF                  0x00
 #define QEMU_VM_SECTION_START        0x01
@@ -1425,6 +1427,12 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable,
     
     qemu_put_be32(f, QEMU_VM_FILE_MAGIC);
     qemu_put_be32(f, QEMU_VM_FILE_VERSION);
+    
+    if (ft_mode) {
+        qemu_put_be32(f, QEMU_VM_FT_MODE);
+    } else {
+        qemu_put_be32(f, QEMU_VM_LIVE_MIGRATION_MODE);
+    }
 
     QTAILQ_FOREACH(se, &savevm_handlers, entry) {
         int len;
@@ -1533,6 +1541,66 @@ int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f)
     return 0;
 }
 
+int qemu_savevm_state_all(Monitor *mon, QEMUFile *f)
+{
+    SaveStateEntry *se;
+    
+    QTAILQ_FOREACH(se, &savevm_handlers, entry) {
+        int len;
+
+        if (se->save_live_state == NULL)
+            continue;
+
+        /* Section type */
+        qemu_put_byte(f, QEMU_VM_SECTION_START);
+        qemu_put_be32(f, se->section_id);
+
+        /* ID string */
+        len = strlen(se->idstr);
+        qemu_put_byte(f, len);
+        qemu_put_buffer(f, (uint8_t *)se->idstr, len);
+
+        qemu_put_be32(f, se->instance_id);
+        qemu_put_be32(f, se->version_id);
+        if (ft_mode == FT_INIT) {
+            /* This is workaround. */
+            se->save_live_state(mon, f, QEMU_VM_SECTION_START, se->opaque);
+        } else {
+            se->save_live_state(mon, f, QEMU_VM_SECTION_PART, se->opaque);
+        }
+    }
+
+    ft_mode = FT_TRANSACTION;
+    QTAILQ_FOREACH(se, &savevm_handlers, entry) {
+        int len;
+
+	if (se->save_state == NULL && se->vmsd == NULL)
+	    continue;
+
+        /* Section type */
+        qemu_put_byte(f, QEMU_VM_SECTION_FULL);
+        qemu_put_be32(f, se->section_id);
+
+        /* ID string */
+        len = strlen(se->idstr);
+        qemu_put_byte(f, len);
+        qemu_put_buffer(f, (uint8_t *)se->idstr, len);
+
+        qemu_put_be32(f, se->instance_id);
+        qemu_put_be32(f, se->version_id);
+
+        vmstate_save(f, se);
+    }
+
+    qemu_put_byte(f, QEMU_VM_EOF);
+
+    if (qemu_file_has_error(f))
+        return -EIO;
+
+    return 0;
+}
+
+
 void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f)
 {
     SaveStateEntry *se;
@@ -1617,6 +1685,10 @@ int qemu_loadvm_state(QEMUFile *f, int skip_header)
         if (v != QEMU_VM_FILE_VERSION)
             return -ENOTSUP;
 
+        v = qemu_get_be32(f);
+        if (v == QEMU_VM_FT_MODE) {
+            ft_mode = FT_INIT;
+        }
     }
 
     while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) {
diff --git a/sysemu.h b/sysemu.h
index 6c1441f..df314bb 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -67,6 +67,7 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable,
                             int shared);
 int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f);
 int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f);
+int qemu_savevm_state_all(Monitor *mon, QEMUFile *f);
 void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f);
 int qemu_loadvm_state(QEMUFile *f, int skip_header);
 
-- 
1.7.0.31.g1df487
^ permalink raw reply related	[flat|nested] 74+ messages in thread
* [Qemu-devel] [RFC PATCH 15/20] Introduce FT mode support to configure.
  2010-04-21  5:57 [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 Yoshiaki Tamura
                   ` (13 preceding siblings ...)
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 14/20] Upgrade QEMU_FILE_VERSION from 3 to 4, and introduce qemu_savevm_state_all() Yoshiaki Tamura
@ 2010-04-21  5:57 ` Yoshiaki Tamura
  2010-04-22 19:38   ` [Qemu-devel] " Anthony Liguori
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 16/20] Introduce event_tap fucntions and ft_tranx_ready() Yoshiaki Tamura
                   ` (7 subsequent siblings)
  22 siblings, 1 reply; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-21  5:57 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, ohmura.kei, mtosatti, Yoshiaki Tamura, yoshikawa.takuya,
	avi
Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 configure |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)
diff --git a/configure b/configure
index 046c591..f0682d4 100755
--- a/configure
+++ b/configure
@@ -298,6 +298,7 @@ bsd_user="no"
 guest_base=""
 uname_release=""
 io_thread="no"
+ft_mode="no"
 mixemu="no"
 kvm_trace="no"
 kvm_cap_pit=""
@@ -671,6 +672,8 @@ for opt do
   ;;
   --enable-io-thread) io_thread="yes"
   ;;
+  --enable-ft-mode) ft_mode="yes"
+  ;;
   --disable-blobs) blobs="no"
   ;;
   --kerneldir=*) kerneldir="$optarg"
@@ -840,6 +843,7 @@ echo "  --enable-vde             enable support for vde network"
 echo "  --disable-linux-aio      disable Linux AIO support"
 echo "  --enable-linux-aio       enable Linux AIO support"
 echo "  --enable-io-thread       enable IO thread"
+echo "  --enable-ft-mode         enable FT mode support"
 echo "  --disable-blobs          disable installing provided firmware blobs"
 echo "  --kerneldir=PATH         look for kernel includes in PATH"
 echo "  --with-kvm-trace         enable building the KVM module with the kvm trace option"
@@ -2117,6 +2121,7 @@ echo "GUEST_BASE        $guest_base"
 echo "PIE user targets  $user_pie"
 echo "vde support       $vde"
 echo "IO thread         $io_thread"
+echo "FT mode support   $ft_mode"
 echo "Linux AIO support $linux_aio"
 echo "Install blobs     $blobs"
 echo "KVM support       $kvm"
@@ -2318,6 +2323,9 @@ fi
 if test "$io_thread" = "yes" ; then
   echo "CONFIG_IOTHREAD=y" >> $config_host_mak
 fi
+if test "$ft_mode" = "yes" ; then
+  echo "CONFIG_FT_MODE=y" >> $config_host_mak
+fi
 if test "$linux_aio" = "yes" ; then
   echo "CONFIG_LINUX_AIO=y" >> $config_host_mak
 fi
-- 
1.7.0.31.g1df487
^ permalink raw reply related	[flat|nested] 74+ messages in thread
* [Qemu-devel] [RFC PATCH 16/20] Introduce event_tap fucntions and ft_tranx_ready().
  2010-04-21  5:57 [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 Yoshiaki Tamura
                   ` (14 preceding siblings ...)
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 15/20] Introduce FT mode support to configure Yoshiaki Tamura
@ 2010-04-21  5:57 ` Yoshiaki Tamura
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 17/20] Modify migrate_fd_put_ready() when ft_mode is on Yoshiaki Tamura
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-21  5:57 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, ohmura.kei, mtosatti, Yoshiaki Tamura, yoshikawa.takuya,
	avi
event_tap controls when to start ft transaction.  do_event_tap()
should be instered to the device emulators.
ft_tranx_ready() kicks the transaction.
Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 migration.c   |   77 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 migration.h   |    2 +
 qemu-common.h |   19 ++++++++++++++
 3 files changed, 98 insertions(+), 0 deletions(-)
diff --git a/migration.c b/migration.c
index 4eed0b7..3cc47fc 100644
--- a/migration.c
+++ b/migration.c
@@ -39,6 +39,45 @@ static uint32_t max_throttle = (32 << 20);
 
 static MigrationState *current_migration;
 
+#ifdef CONFIG_FT_MODE
+static enum EVENT_TAP_STATE event_tap_state = EVENT_TAP_OFF;
+
+void event_tap_on(void)
+{
+    event_tap_state = EVENT_TAP_ON;
+}
+
+void event_tap_off(void)
+{
+    event_tap_state = EVENT_TAP_OFF;
+}
+
+void event_tap_suspend(void)
+{
+    if (event_tap_state == EVENT_TAP_ON)
+        event_tap_state = EVENT_TAP_SUSPEND;
+}
+
+void event_tap_resume(void)
+{
+    if (event_tap_state == EVENT_TAP_SUSPEND)
+        event_tap_state = EVENT_TAP_ON;
+}
+
+void do_event_tap(void)
+{
+    if (event_tap_state != EVENT_TAP_ON)
+        return;
+
+    if (ft_mode == FT_TRANSACTION || ft_mode == FT_INIT) {
+        if (ft_tranx_ready(current_migration) < 0) {
+            event_tap_off();
+            vm_start();
+        }
+    }
+}
+#endif
+
 void qemu_start_incoming_migration(const char *uri)
 {
     const char *p;
@@ -390,6 +429,44 @@ void migrate_fd_connect(FdMigrationState *s)
     migrate_fd_put_ready(s);
 }
 
+int ft_tranx_ready(void *opaque)
+{
+    FdMigrationState *s = migrate_to_fms(opaque);
+    int ret = -1;
+    
+    if (qemu_transaction_begin(s->file) < 0) {
+        fprintf(stderr, "tranx_begin failed\n");
+        goto error_out;
+    }
+
+    /* make the VM state consistent by flushing outstanding requests. */
+    vm_stop(0);
+    qemu_aio_flush();
+    bdrv_flush_all();
+
+    if (qemu_savevm_state_all(s->mon, s->file) < 0) {
+        fprintf(stderr, "savevm_state_all failed\n");
+        goto error_out;
+    }
+
+    if (qemu_transaction_commit(s->file) < 0) {
+        fprintf(stderr, "tranx_commit failed\n");
+        goto error_out;
+    }
+
+    ret = 0;
+    vm_start();
+
+    return ret;
+
+error_out:
+    ft_mode = FT_OFF;
+    qemu_savevm_state_cancel(s->mon, s->file);
+    migrate_fd_cleanup(s);
+
+    return ret;
+}
+
 void migrate_fd_put_ready(void *opaque)
 {
     FdMigrationState *s = opaque;
diff --git a/migration.h b/migration.h
index ddc1d42..41ee3fe 100644
--- a/migration.h
+++ b/migration.h
@@ -133,6 +133,8 @@ void migrate_fd_wait_for_unfreeze(void *opaque);
 
 int migrate_fd_close(void *opaque);
 
+int ft_tranx_ready(void *opaque);
+
 static inline FdMigrationState *migrate_to_fms(MigrationState *mig_state)
 {
     return container_of(mig_state, FdMigrationState, mig_state);
diff --git a/qemu-common.h b/qemu-common.h
index 0af30d2..5753af2 100644
--- a/qemu-common.h
+++ b/qemu-common.h
@@ -294,4 +294,23 @@ static inline uint8_t from_bcd(uint8_t val)
 
 #endif /* dyngen-exec.h hack */
 
+#ifdef CONFIG_FT_MODE
+enum EVENT_TAP_STATE {
+    EVENT_TAP_OFF,
+    EVENT_TAP_ON,
+    EVENT_TAP_SUSPEND,
+};
+void event_tap_on(void);
+void event_tap_off(void);
+void event_tap_suspend(void);
+void event_tap_resume(void);
+void do_event_tap(void);
+#else
+#define event_tap_on() do { } while (0)
+#define event_tap_off() do { } while (0)
+#define event_tap_suspend() do { } while (0)
+#define event_tap_resume() do { } while (0)
+#define do_event_tap() do { } while (0)
+#endif
+
 #endif
-- 
1.7.0.31.g1df487
^ permalink raw reply related	[flat|nested] 74+ messages in thread
* [Qemu-devel] [RFC PATCH 17/20] Modify migrate_fd_put_ready() when ft_mode is on.
  2010-04-21  5:57 [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 Yoshiaki Tamura
                   ` (15 preceding siblings ...)
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 16/20] Introduce event_tap fucntions and ft_tranx_ready() Yoshiaki Tamura
@ 2010-04-21  5:57 ` Yoshiaki Tamura
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 18/20] Modify tcp_accept_incoming_migration() to handle ft_mode, and add a hack not to close fd when ft_mode is enabled Yoshiaki Tamura
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-21  5:57 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, ohmura.kei, mtosatti, Yoshiaki Tamura, yoshikawa.takuya,
	avi
When ft_mode is on, migrate_fd_put_ready() would open ft_transaction
file and turn on event_tap.  To end or cancel ft_transaction, ft_mode
and event_tap is turned off.
Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 migration.c |   36 +++++++++++++++++++++++++++++++++---
 1 files changed, 33 insertions(+), 3 deletions(-)
diff --git a/migration.c b/migration.c
index 3cc47fc..c81fdb4 100644
--- a/migration.c
+++ b/migration.c
@@ -494,8 +494,32 @@ void migrate_fd_put_ready(void *opaque)
         } else {
             state = MIG_STATE_COMPLETED;
         }
-        migrate_fd_cleanup(s);
-        s->state = state;
+
+        if (ft_mode && state == MIG_STATE_COMPLETED) {
+            /* close buffered_file and open ft_transaction.
+             * Note: file discriptor won't get closed,
+             * but reused by ft_transaction. */
+            socket_set_block(s->fd);
+            socket_set_nodelay(s->fd);
+            qemu_fclose(s->file);
+            s->file = qemu_fopen_ops_ft_tranx(s,
+                                              migrate_fd_put_buffer,
+                                              migrate_fd_put_vector,
+                                              migrate_fd_get_buffer,
+                                              NULL,
+                                              migrate_fd_close,
+                                              1);
+
+            /* events are tapped from now. */
+            event_tap_on();
+
+            if (old_vm_running) {
+                vm_start();
+            }
+        } else {
+            migrate_fd_cleanup(s);
+            s->state = state;
+        }
     }
 }
 
@@ -515,8 +539,14 @@ void migrate_fd_cancel(MigrationState *mig_state)
     DPRINTF("cancelling migration\n");
 
     s->state = MIG_STATE_CANCELLED;
-    qemu_savevm_state_cancel(s->mon, s->file);
 
+    if (ft_mode == FT_TRANSACTION) {
+        qemu_transaction_cancel(s->file);
+        ft_mode = FT_OFF;
+        event_tap_off();
+    }
+
+    qemu_savevm_state_cancel(s->mon, s->file);
     migrate_fd_cleanup(s);
 }
 
-- 
1.7.0.31.g1df487
^ permalink raw reply related	[flat|nested] 74+ messages in thread
* [Qemu-devel] [RFC PATCH 18/20] Modify tcp_accept_incoming_migration() to handle ft_mode, and add a hack not to close fd when ft_mode is enabled.
  2010-04-21  5:57 [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 Yoshiaki Tamura
                   ` (16 preceding siblings ...)
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 17/20] Modify migrate_fd_put_ready() when ft_mode is on Yoshiaki Tamura
@ 2010-04-21  5:57 ` Yoshiaki Tamura
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 19/20] Insert do_event_tap() to virtio-{blk, net}, comment out assert() on cpu_single_env temporally Yoshiaki Tamura
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-21  5:57 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, ohmura.kei, mtosatti, Yoshiaki Tamura, yoshikawa.takuya,
	avi
When ft_mode is set in the header, tcp_accept_incoming_migration()
receives ft_transaction iteratively.  We also need a hack no to close
fd before moving to ft_transaction mode, so that we can reuse the fd
for it.
Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 migration-tcp.c |   36 +++++++++++++++++++++++++++++++++++-
 1 files changed, 35 insertions(+), 1 deletions(-)
diff --git a/migration-tcp.c b/migration-tcp.c
index 94a1a03..e018ed1 100644
--- a/migration-tcp.c
+++ b/migration-tcp.c
@@ -18,6 +18,7 @@
 #include "sysemu.h"
 #include "buffered_file.h"
 #include "block.h"
+#include "ft_transaction.h"
 
 //#define DEBUG_MIGRATION_TCP
 
@@ -60,7 +61,8 @@ static int socket_writev(FdMigrationState *s, const struct iovec *v, int count)
 static int tcp_close(FdMigrationState *s)
 {
     DPRINTF("tcp_close\n");
-    if (s->fd != -1) {
+    /* FIX ME: accessing ft_mode here isn't clean */
+    if (s->fd != -1 && ft_mode != FT_INIT) {
         close(s->fd);
         s->fd = -1;
     }
@@ -187,6 +189,38 @@ static void tcp_accept_incoming_migration(void *opaque)
         fprintf(stderr, "load of migration failed\n");
         goto out_fopen;
     }
+
+    /* ft_mode is set by qemu_loadvm_state(). */
+    if (ft_mode == FT_INIT) {
+        /* close normal QEMUFile first before reusing connection. */
+        qemu_fclose(f);
+        socket_set_nodelay(c);
+        socket_set_timeout(c, 5);
+        /* don't autostart to avoid split brain. */
+        autostart = 0;
+
+        f = qemu_fopen_transaction(c);
+        if (f == NULL) {
+            fprintf(stderr, "could not qemu_fopen transaction\n");
+            goto out;
+        }
+
+        /* need to wait sender to setup. */
+        if (qemu_transaction_begin(f) < 0) {
+            goto out_fopen;
+        }
+
+        /* loop until transaction breaks */
+        while ((ft_mode != FT_OFF) && (ret == 0)) {
+            ret = qemu_loadvm_state(f, 1);
+        }
+
+        /* if migrate_cancel was called at the sender  */
+        if (ft_mode == FT_OFF) {
+            goto out_fopen;
+        }
+    }
+
     qemu_announce_self();
     DPRINTF("successfully loaded vm state\n");
 
-- 
1.7.0.31.g1df487
^ permalink raw reply related	[flat|nested] 74+ messages in thread
* [Qemu-devel] [RFC PATCH 19/20] Insert do_event_tap() to virtio-{blk, net}, comment out assert() on cpu_single_env temporally.
  2010-04-21  5:57 [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 Yoshiaki Tamura
                   ` (17 preceding siblings ...)
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 18/20] Modify tcp_accept_incoming_migration() to handle ft_mode, and add a hack not to close fd when ft_mode is enabled Yoshiaki Tamura
@ 2010-04-21  5:57 ` Yoshiaki Tamura
  2010-04-22 19:39   ` [Qemu-devel] " Anthony Liguori
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 20/20] Introduce -k option to enable FT migration mode (Kemari) Yoshiaki Tamura
                   ` (3 subsequent siblings)
  22 siblings, 1 reply; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-21  5:57 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, ohmura.kei, mtosatti, Yoshiaki Tamura, yoshikawa.takuya,
	avi
do_event_tap() is inserted to functions which actually fire outputs.
By synchronizing VMs before outputs are fired, we can failover to the
receiver upon failure.  To save VM continuously, comment out assert()
on cpu_single_env temporally.
Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 hw/virtio-blk.c |    2 ++
 hw/virtio-net.c |    2 ++
 qemu-kvm.c      |    7 ++++++-
 3 files changed, 10 insertions(+), 1 deletions(-)
diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index b80402d..1dd1c31 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -327,6 +327,8 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
         .old_bs = NULL,
     };
 
+    do_event_tap();
+
     while ((req = virtio_blk_get_request(s))) {
         virtio_blk_handle_request(req, &mrb);
     }
diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index 5c0093e..1a32bf3 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -667,6 +667,8 @@ static void virtio_net_handle_tx(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIONet *n = to_virtio_net(vdev);
 
+    do_event_tap();
+
     if (n->tx_timer_active) {
         virtio_queue_set_notification(vq, 1);
         qemu_del_timer(n->tx_timer);
diff --git a/qemu-kvm.c b/qemu-kvm.c
index 1414f49..769bc95 100644
--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -935,8 +935,12 @@ int kvm_run(CPUState *env)
 
     post_kvm_run(kvm, env);
 
+    /* TODO: we need to prevent tapping events that derived from the
+     * same VMEXIT. This needs more info from the kernel. */
 #if defined(KVM_CAP_COALESCED_MMIO)
     if (kvm_state->coalesced_mmio) {
+        /* prevent from tapping events while handling coalesced_mmio */
+        event_tap_suspend();
         struct kvm_coalesced_mmio_ring *ring =
             (void *) run + kvm_state->coalesced_mmio * PAGE_SIZE;
         while (ring->first != ring->last) {
@@ -946,6 +950,7 @@ int kvm_run(CPUState *env)
             smp_wmb();
             ring->first = (ring->first + 1) % KVM_COALESCED_MMIO_MAX;
         }
+        event_tap_resume();
     }
 #endif
 
@@ -1770,7 +1775,7 @@ static void resume_all_threads(void)
 {
     CPUState *penv = first_cpu;
 
-    assert(!cpu_single_env);
+    /* assert(!cpu_single_env); */
 
     while (penv) {
         penv->stop = 0;
-- 
1.7.0.31.g1df487
^ permalink raw reply related	[flat|nested] 74+ messages in thread
* [Qemu-devel] [RFC PATCH 20/20] Introduce -k option to enable FT migration mode (Kemari).
  2010-04-21  5:57 [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 Yoshiaki Tamura
                   ` (18 preceding siblings ...)
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 19/20] Insert do_event_tap() to virtio-{blk, net}, comment out assert() on cpu_single_env temporally Yoshiaki Tamura
@ 2010-04-21  5:57 ` Yoshiaki Tamura
  2010-04-22  8:58 ` [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 Dor Laor
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-21  5:57 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, ohmura.kei, mtosatti, Yoshiaki Tamura, yoshikawa.takuya,
	avi
When -k option is set to migrate command, it will turn on ft_mode to
start FT migration mode (Kemari).
Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 migration.c     |    3 +++
 qemu-monitor.hx |    7 ++++---
 2 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/migration.c b/migration.c
index c81fdb4..b288e82 100644
--- a/migration.c
+++ b/migration.c
@@ -109,6 +109,9 @@ int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data)
         return -1;
     }
 
+    if (qdict_get_int(qdict, "ft"))
+        ft_mode = FT_INIT;
+        
     if (strstart(uri, "tcp:", &p)) {
         s = tcp_start_outgoing_migration(mon, p, max_throttle, detach,
                                          (int)qdict_get_int(qdict, "blk"), 
diff --git a/qemu-monitor.hx b/qemu-monitor.hx
index 16c45b7..22b72d9 100644
--- a/qemu-monitor.hx
+++ b/qemu-monitor.hx
@@ -765,13 +765,14 @@ ETEXI
 
     {
         .name       = "migrate",
-        .args_type  = "detach:-d,blk:-b,inc:-i,uri:s",
-        .params     = "[-d] [-b] [-i] uri",
+        .args_type  = "detach:-d,blk:-b,inc:-i,ft:-k,uri:s",
+        .params     = "[-d] [-b] [-i] [-k] uri",
         .help       = "migrate to URI (using -d to not wait for completion)"
 		      "\n\t\t\t -b for migration without shared storage with"
 		      " full copy of disk\n\t\t\t -i for migration without "
 		      "shared storage with incremental copy of disk "
-		      "(base image shared between src and destination)",
+		      "(base image shared between src and destination)"
+		      "\n\t\t\t -k for FT migration mode (Kemari)",
         .user_print = monitor_user_noop,	
 	.mhandler.cmd_new = do_migrate,
     },
-- 
1.7.0.31.g1df487
^ permalink raw reply related	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 04/20] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer().
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 04/20] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer() Yoshiaki Tamura
@ 2010-04-21  8:03   ` Stefan Hajnoczi
  2010-04-21  8:27     ` Yoshiaki Tamura
  2010-04-23  9:53   ` Avi Kivity
  1 sibling, 1 reply; 74+ messages in thread
From: Stefan Hajnoczi @ 2010-04-21  8:03 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, kvm, mtosatti, aliguori, qemu-devel, yoshikawa.takuya,
	avi
On Wed, Apr 21, 2010 at 6:57 AM, Yoshiaki Tamura
<tamura.yoshiaki@lab.ntt.co.jp> wrote:
> @@ -454,6 +458,25 @@ void qemu_fflush(QEMUFile *f)
>     }
>  }
>
> +void *qemu_realloc_buffer(QEMUFile *f, int size)
> +{
> +    f->buf_max_size = size;
> +
> +    f->buf = qemu_realloc(f->buf, f->buf_max_size);
> +    if (f->buf == NULL) {
> +        fprintf(stderr, "qemu file buffer realloc failed\n");
> +        exit(1);
> +    }
> +
> +    return f->buf;
> +}
> +
qemu_realloc() will abort() if there was not enough memory to realloc.
 Just like qemu_malloc(), you don't need to check for NULL.
Stefan
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 04/20] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer().
  2010-04-21  8:03   ` [Qemu-devel] " Stefan Hajnoczi
@ 2010-04-21  8:27     ` Yoshiaki Tamura
  0 siblings, 0 replies; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-21  8:27 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: ohmura.kei, kvm, mtosatti, aliguori, qemu-devel, yoshikawa.takuya,
	avi
2010/4/21 Stefan Hajnoczi <stefanha@gmail.com>:
> On Wed, Apr 21, 2010 at 6:57 AM, Yoshiaki Tamura
> <tamura.yoshiaki@lab.ntt.co.jp> wrote:
>> @@ -454,6 +458,25 @@ void qemu_fflush(QEMUFile *f)
>>     }
>>  }
>>
>> +void *qemu_realloc_buffer(QEMUFile *f, int size)
>> +{
>> +    f->buf_max_size = size;
>> +
>> +    f->buf = qemu_realloc(f->buf, f->buf_max_size);
>> +    if (f->buf == NULL) {
>> +        fprintf(stderr, "qemu file buffer realloc failed\n");
>> +        exit(1);
>> +    }
>> +
>> +    return f->buf;
>> +}
>> +
>
> qemu_realloc() will abort() if there was not enough memory to realloc.
>  Just like qemu_malloc(), you don't need to check for NULL.
Thanks for your comment.  I'll remove it.
If there is no objection, I would like to take out this patch from the series,
and post it by itself.
Yoshi
>
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply	[flat|nested] 74+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
  2010-04-21  5:57 [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 Yoshiaki Tamura
                   ` (19 preceding siblings ...)
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 20/20] Introduce -k option to enable FT migration mode (Kemari) Yoshiaki Tamura
@ 2010-04-22  8:58 ` Dor Laor
  2010-04-22 10:35   ` Yoshiaki Tamura
  2010-04-22 19:42 ` [Qemu-devel] " Anthony Liguori
  2010-04-23 13:24 ` Avi Kivity
  22 siblings, 1 reply; 74+ messages in thread
From: Dor Laor @ 2010-04-22  8:58 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, kvm, mtosatti, aliguori, qemu-devel, yoshikawa.takuya,
	avi
On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
> Hi all,
>
> We have been implementing the prototype of Kemari for KVM, and we're sending
> this message to share what we have now and TODO lists.  Hopefully, we would like
> to get early feedback to keep us in the right direction.  Although advanced
> approaches in the TODO lists are fascinating, we would like to run this project
> step by step while absorbing comments from the community.  The current code is
> based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.
>
> For those who are new to Kemari for KVM, please take a look at the
> following RFC which we posted last year.
>
> http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html
>
> The transmission/transaction protocol, and most of the control logic is
> implemented in QEMU.  However, we needed a hack in KVM to prevent rip from
> proceeding before synchronizing VMs.  It may also need some plumbing in the
> kernel side to guarantee replayability of certain events and instructions,
> integrate the RAS capabilities of newer x86 hardware with the HA stack, as well
> as for optimization purposes, for example.
[ snap]
>
> The rest of this message describes TODO lists grouped by each topic.
>
> === event tapping ===
>
> Event tapping is the core component of Kemari, and it decides on which event the
> primary should synchronize with the secondary.  The basic assumption here is
> that outgoing I/O operations are idempotent, which is usually true for disk I/O
> and reliable network protocols such as TCP.
IMO any type of network even should be stalled too. What if the VM runs 
non tcp protocol and the packet that the master node sent reached some 
remote client and before the sync to the slave the master failed?
[snap]
> === clock ===
>
> Since synchronizing the virtual machines every time the TSC is accessed would be
> prohibitive, the transmission of the TSC will be done lazily, which means
> delaying it until there is a non-TSC synchronization point arrives.
Why do you specifically care about the tsc sync? When you sync all the 
IO model on snapshot it also synchronizes the tsc.
In general, can you please explain the 'algorithm' for continuous 
snapshots (is that what you like to do?):
A trivial one would we to :
  - do X online snapshots/sec
  - Stall all IO (disk/block) from the guest to the outside world
    until the previous snapshot reaches the slave.
  - Snapshots are made of
    - diff of dirty pages from last snapshot
    - Qemu device model (+kvm's) diff from last.
You can do 'light' snapshots in between to send dirty pages to reduce 
snapshot time.
I wrote the above to serve a reference for your comments so it will map 
into my mind. Thanks, dor
>
> TODO:
>   - Synchronization of clock sources (need to intercept TSC reads, etc).
>
> === usability ===
>
> These are items that defines how users interact with Kemari.
>
> TODO:
>   - Kemarid daemon that takes care of the cluster management/monitoring
>     side of things.
>   - Some device emulators might need minor modifications to work well
>     with Kemari.  Use white(black)-listing to take the burden of
>     choosing the right device model off the users.
>
> === optimizations ===
>
> Although the big picture can be realized by completing the TODO list above, we
> need some optimizations/enhancements to make Kemari useful in real world, and
> these are items what needs to be done for that.
>
> TODO:
>   - SMP (for the sake of performance might need to implement a
>     synchronization protocol that can maintain two or more
>     synchronization points active at any given moment)
>   - VGA (leverage VNC's subtilting mechanism to identify fb pages that
>     are really dirty).
>
>
> Any comments/suggestions would be greatly appreciated.
>
> Thanks,
>
> Yoshi
>
> --
>
> Kemari starts synchronizing VMs when QEMU handles I/O requests.
> Without this patch VCPU state is already proceeded before
> synchronization, and after failover to the VM on the receiver, it
> hangs because of this.
>
> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
> ---
>   arch/x86/include/asm/kvm_host.h |    1 +
>   arch/x86/kvm/svm.c              |   11 ++++++++---
>   arch/x86/kvm/vmx.c              |   11 ++++++++---
>   arch/x86/kvm/x86.c              |    4 ++++
>   4 files changed, 21 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 26c629a..7b8f514 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -227,6 +227,7 @@ struct kvm_pio_request {
>   	int in;
>   	int port;
>   	int size;
> +	bool lazy_skip;
>   };
>
>   /*
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index d04c7ad..e373245 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -1495,7 +1495,7 @@ static int io_interception(struct vcpu_svm *svm)
>   {
>   	struct kvm_vcpu *vcpu =&svm->vcpu;
>   	u32 io_info = svm->vmcb->control.exit_info_1; /* address size bug? */
> -	int size, in, string;
> +	int size, in, string, ret;
>   	unsigned port;
>
>   	++svm->vcpu.stat.io_exits;
> @@ -1507,9 +1507,14 @@ static int io_interception(struct vcpu_svm *svm)
>   	port = io_info>>  16;
>   	size = (io_info&  SVM_IOIO_SIZE_MASK)>>  SVM_IOIO_SIZE_SHIFT;
>   	svm->next_rip = svm->vmcb->control.exit_info_2;
> -	skip_emulated_instruction(&svm->vcpu);
>
> -	return kvm_fast_pio_out(vcpu, size, port);
> +	ret = kvm_fast_pio_out(vcpu, size, port);
> +	if (ret)
> +		skip_emulated_instruction(&svm->vcpu);
> +	else
> +		vcpu->arch.pio.lazy_skip = true;
> +
> +	return ret;
>   }
>
>   static int nmi_interception(struct vcpu_svm *svm)
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 41e63bb..09052d6 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2975,7 +2975,7 @@ static int handle_triple_fault(struct kvm_vcpu *vcpu)
>   static int handle_io(struct kvm_vcpu *vcpu)
>   {
>   	unsigned long exit_qualification;
> -	int size, in, string;
> +	int size, in, string, ret;
>   	unsigned port;
>
>   	exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> @@ -2989,9 +2989,14 @@ static int handle_io(struct kvm_vcpu *vcpu)
>
>   	port = exit_qualification>>  16;
>   	size = (exit_qualification&  7) + 1;
> -	skip_emulated_instruction(vcpu);
>
> -	return kvm_fast_pio_out(vcpu, size, port);
> +	ret = kvm_fast_pio_out(vcpu, size, port);
> +	if (ret)
> +		skip_emulated_instruction(vcpu);
> +	else
> +		vcpu->arch.pio.lazy_skip = true;
> +
> +	return ret;
>   }
>
>   static void
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index fd5c3d3..cc308d2 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4544,6 +4544,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
>   	if (!irqchip_in_kernel(vcpu->kvm))
>   		kvm_set_cr8(vcpu, kvm_run->cr8);
>
> +	if (vcpu->arch.pio.lazy_skip)
> +		kvm_x86_ops->skip_emulated_instruction(vcpu);
> +	vcpu->arch.pio.lazy_skip = false;
> +
>   	if (vcpu->arch.pio.count || vcpu->mmio_needed ||
>   	    vcpu->arch.emulate_ctxt.restart) {
>   		if (vcpu->mmio_needed) {
^ permalink raw reply	[flat|nested] 74+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
  2010-04-22  8:58 ` [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 Dor Laor
@ 2010-04-22 10:35   ` Yoshiaki Tamura
  2010-04-22 11:36     ` Takuya Yoshikawa
                       ` (2 more replies)
  0 siblings, 3 replies; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-22 10:35 UTC (permalink / raw)
  To: dlaor
  Cc: ohmura.kei, kvm, mtosatti, aliguori, qemu-devel, yoshikawa.takuya,
	avi
Dor Laor wrote:
> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
>> Hi all,
>>
>> We have been implementing the prototype of Kemari for KVM, and we're
>> sending
>> this message to share what we have now and TODO lists. Hopefully, we
>> would like
>> to get early feedback to keep us in the right direction. Although
>> advanced
>> approaches in the TODO lists are fascinating, we would like to run
>> this project
>> step by step while absorbing comments from the community. The current
>> code is
>> based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.
>>
>> For those who are new to Kemari for KVM, please take a look at the
>> following RFC which we posted last year.
>>
>> http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html
>>
>> The transmission/transaction protocol, and most of the control logic is
>> implemented in QEMU. However, we needed a hack in KVM to prevent rip from
>> proceeding before synchronizing VMs. It may also need some plumbing in
>> the
>> kernel side to guarantee replayability of certain events and
>> instructions,
>> integrate the RAS capabilities of newer x86 hardware with the HA
>> stack, as well
>> as for optimization purposes, for example.
>
> [ snap]
>
>>
>> The rest of this message describes TODO lists grouped by each topic.
>>
>> === event tapping ===
>>
>> Event tapping is the core component of Kemari, and it decides on which
>> event the
>> primary should synchronize with the secondary. The basic assumption
>> here is
>> that outgoing I/O operations are idempotent, which is usually true for
>> disk I/O
>> and reliable network protocols such as TCP.
>
> IMO any type of network even should be stalled too. What if the VM runs
> non tcp protocol and the packet that the master node sent reached some
> remote client and before the sync to the slave the master failed?
In current implementation, it is actually stalling any type of network that goes 
through virtio-net.
However, if the application was using unreliable protocols, it should have its 
own recovering mechanism, or it should be completely stateless.
> [snap]
>
>
>> === clock ===
>>
>> Since synchronizing the virtual machines every time the TSC is
>> accessed would be
>> prohibitive, the transmission of the TSC will be done lazily, which means
>> delaying it until there is a non-TSC synchronization point arrives.
>
> Why do you specifically care about the tsc sync? When you sync all the
> IO model on snapshot it also synchronizes the tsc.
>
> In general, can you please explain the 'algorithm' for continuous
> snapshots (is that what you like to do?):
Yes, of course.
Sorry for being less informative.
> A trivial one would we to :
> - do X online snapshots/sec
I currently don't have good numbers that I can share right now.
Snapshots/sec depends on what kind of workload is running, and if the guest was 
almost idle, there will be no snapshots in 5sec.  On the other hand, if the 
guest was running I/O intensive workloads (netperf, iozone for example), there 
will be about 50 snapshots/sec.
> - Stall all IO (disk/block) from the guest to the outside world
> until the previous snapshot reaches the slave.
Yes, it does.
> - Snapshots are made of
Full device model + diff of dirty pages from the last snapshot.
> - diff of dirty pages from last snapshot
This also depends on the workload.
In case of I/O intensive workloads, dirty pages are usually less than 100.
> - Qemu device model (+kvm's) diff from last.
We're currently sending full copy because we're completely reusing this part of 
existing live migration framework.
Last time we measured, it was about 13KB.
But it varies by which QEMU version is used.
> You can do 'light' snapshots in between to send dirty pages to reduce
> snapshot time.
I agree.  That's one of the advanced topic we would like to try too.
> I wrote the above to serve a reference for your comments so it will map
> into my mind. Thanks, dor
Thank your for the guidance.
I hope this answers to your question.
At the same time, I would also be happy it we could discuss how to implement 
too.  In fact, we needed a hack to prevent rip from proceeding in KVM, which 
turned out that it was not the best workaround.
Thanks,
Yoshi
>
>>
>> TODO:
>> - Synchronization of clock sources (need to intercept TSC reads, etc).
>>
>> === usability ===
>>
>> These are items that defines how users interact with Kemari.
>>
>> TODO:
>> - Kemarid daemon that takes care of the cluster management/monitoring
>> side of things.
>> - Some device emulators might need minor modifications to work well
>> with Kemari. Use white(black)-listing to take the burden of
>> choosing the right device model off the users.
>>
>> === optimizations ===
>>
>> Although the big picture can be realized by completing the TODO list
>> above, we
>> need some optimizations/enhancements to make Kemari useful in real
>> world, and
>> these are items what needs to be done for that.
>>
>> TODO:
>> - SMP (for the sake of performance might need to implement a
>> synchronization protocol that can maintain two or more
>> synchronization points active at any given moment)
>> - VGA (leverage VNC's subtilting mechanism to identify fb pages that
>> are really dirty).
>>
>>
>> Any comments/suggestions would be greatly appreciated.
>>
>> Thanks,
>>
>> Yoshi
>>
>> --
>>
>> Kemari starts synchronizing VMs when QEMU handles I/O requests.
>> Without this patch VCPU state is already proceeded before
>> synchronization, and after failover to the VM on the receiver, it
>> hangs because of this.
>>
>> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
>> ---
>> arch/x86/include/asm/kvm_host.h | 1 +
>> arch/x86/kvm/svm.c | 11 ++++++++---
>> arch/x86/kvm/vmx.c | 11 ++++++++---
>> arch/x86/kvm/x86.c | 4 ++++
>> 4 files changed, 21 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h
>> b/arch/x86/include/asm/kvm_host.h
>> index 26c629a..7b8f514 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -227,6 +227,7 @@ struct kvm_pio_request {
>> int in;
>> int port;
>> int size;
>> + bool lazy_skip;
>> };
>>
>> /*
>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>> index d04c7ad..e373245 100644
>> --- a/arch/x86/kvm/svm.c
>> +++ b/arch/x86/kvm/svm.c
>> @@ -1495,7 +1495,7 @@ static int io_interception(struct vcpu_svm *svm)
>> {
>> struct kvm_vcpu *vcpu =&svm->vcpu;
>> u32 io_info = svm->vmcb->control.exit_info_1; /* address size bug? */
>> - int size, in, string;
>> + int size, in, string, ret;
>> unsigned port;
>>
>> ++svm->vcpu.stat.io_exits;
>> @@ -1507,9 +1507,14 @@ static int io_interception(struct vcpu_svm *svm)
>> port = io_info>> 16;
>> size = (io_info& SVM_IOIO_SIZE_MASK)>> SVM_IOIO_SIZE_SHIFT;
>> svm->next_rip = svm->vmcb->control.exit_info_2;
>> - skip_emulated_instruction(&svm->vcpu);
>>
>> - return kvm_fast_pio_out(vcpu, size, port);
>> + ret = kvm_fast_pio_out(vcpu, size, port);
>> + if (ret)
>> + skip_emulated_instruction(&svm->vcpu);
>> + else
>> + vcpu->arch.pio.lazy_skip = true;
>> +
>> + return ret;
>> }
>>
>> static int nmi_interception(struct vcpu_svm *svm)
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index 41e63bb..09052d6 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -2975,7 +2975,7 @@ static int handle_triple_fault(struct kvm_vcpu
>> *vcpu)
>> static int handle_io(struct kvm_vcpu *vcpu)
>> {
>> unsigned long exit_qualification;
>> - int size, in, string;
>> + int size, in, string, ret;
>> unsigned port;
>>
>> exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
>> @@ -2989,9 +2989,14 @@ static int handle_io(struct kvm_vcpu *vcpu)
>>
>> port = exit_qualification>> 16;
>> size = (exit_qualification& 7) + 1;
>> - skip_emulated_instruction(vcpu);
>>
>> - return kvm_fast_pio_out(vcpu, size, port);
>> + ret = kvm_fast_pio_out(vcpu, size, port);
>> + if (ret)
>> + skip_emulated_instruction(vcpu);
>> + else
>> + vcpu->arch.pio.lazy_skip = true;
>> +
>> + return ret;
>> }
>>
>> static void
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index fd5c3d3..cc308d2 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -4544,6 +4544,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu
>> *vcpu, struct kvm_run *kvm_run)
>> if (!irqchip_in_kernel(vcpu->kvm))
>> kvm_set_cr8(vcpu, kvm_run->cr8);
>>
>> + if (vcpu->arch.pio.lazy_skip)
>> + kvm_x86_ops->skip_emulated_instruction(vcpu);
>> + vcpu->arch.pio.lazy_skip = false;
>> +
>> if (vcpu->arch.pio.count || vcpu->mmio_needed ||
>> vcpu->arch.emulate_ctxt.restart) {
>> if (vcpu->mmio_needed) {
>
>
>
>
^ permalink raw reply	[flat|nested] 74+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
  2010-04-22 10:35   ` Yoshiaki Tamura
@ 2010-04-22 11:36     ` Takuya Yoshikawa
  2010-04-22 12:35       ` Yoshiaki Tamura
  2010-04-22 12:19     ` Dor Laor
  2010-04-22 16:15     ` Jamie Lokier
  2 siblings, 1 reply; 74+ messages in thread
From: Takuya Yoshikawa @ 2010-04-22 11:36 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: aliguori, dlaor, kvm, ohmura.kei, mtosatti, qemu-devel, avi
(2010/04/22 19:35), Yoshiaki Tamura wrote:
>
>> A trivial one would we to :
>> - do X online snapshots/sec
>
> I currently don't have good numbers that I can share right now.
> Snapshots/sec depends on what kind of workload is running, and if the
> guest was almost idle, there will be no snapshots in 5sec. On the other
> hand, if the guest was running I/O intensive workloads (netperf, iozone
> for example), there will be about 50 snapshots/sec.
>
50 is too small: this depends on the synchronization speed and does not
show how many snapshots we need, right?
^ permalink raw reply	[flat|nested] 74+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
  2010-04-22 10:35   ` Yoshiaki Tamura
  2010-04-22 11:36     ` Takuya Yoshikawa
@ 2010-04-22 12:19     ` Dor Laor
  2010-04-22 13:16       ` Yoshiaki Tamura
  2010-04-22 16:15     ` Jamie Lokier
  2 siblings, 1 reply; 74+ messages in thread
From: Dor Laor @ 2010-04-22 12:19 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, kvm, mtosatti, aliguori, qemu-devel, yoshikawa.takuya,
	avi
On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:
> Dor Laor wrote:
>> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
>>> Hi all,
>>>
>>> We have been implementing the prototype of Kemari for KVM, and we're
>>> sending
>>> this message to share what we have now and TODO lists. Hopefully, we
>>> would like
>>> to get early feedback to keep us in the right direction. Although
>>> advanced
>>> approaches in the TODO lists are fascinating, we would like to run
>>> this project
>>> step by step while absorbing comments from the community. The current
>>> code is
>>> based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.
>>>
>>> For those who are new to Kemari for KVM, please take a look at the
>>> following RFC which we posted last year.
>>>
>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html
>>>
>>> The transmission/transaction protocol, and most of the control logic is
>>> implemented in QEMU. However, we needed a hack in KVM to prevent rip
>>> from
>>> proceeding before synchronizing VMs. It may also need some plumbing in
>>> the
>>> kernel side to guarantee replayability of certain events and
>>> instructions,
>>> integrate the RAS capabilities of newer x86 hardware with the HA
>>> stack, as well
>>> as for optimization purposes, for example.
>>
>> [ snap]
>>
>>>
>>> The rest of this message describes TODO lists grouped by each topic.
>>>
>>> === event tapping ===
>>>
>>> Event tapping is the core component of Kemari, and it decides on which
>>> event the
>>> primary should synchronize with the secondary. The basic assumption
>>> here is
>>> that outgoing I/O operations are idempotent, which is usually true for
>>> disk I/O
>>> and reliable network protocols such as TCP.
>>
>> IMO any type of network even should be stalled too. What if the VM runs
>> non tcp protocol and the packet that the master node sent reached some
>> remote client and before the sync to the slave the master failed?
>
> In current implementation, it is actually stalling any type of network
> that goes through virtio-net.
>
> However, if the application was using unreliable protocols, it should
> have its own recovering mechanism, or it should be completely stateless.
Why do you treat tcp differently? You can damage the entire VM this way 
- think of dhcp request that was dropped on the moment you switched 
between the master and the slave?
>
>> [snap]
>>
>>
>>> === clock ===
>>>
>>> Since synchronizing the virtual machines every time the TSC is
>>> accessed would be
>>> prohibitive, the transmission of the TSC will be done lazily, which
>>> means
>>> delaying it until there is a non-TSC synchronization point arrives.
>>
>> Why do you specifically care about the tsc sync? When you sync all the
>> IO model on snapshot it also synchronizes the tsc.
So, do you agree that an extra clock synchronization is not needed since 
it is done anyway as part of the live migration state sync?
>>
>> In general, can you please explain the 'algorithm' for continuous
>> snapshots (is that what you like to do?):
>
> Yes, of course.
> Sorry for being less informative.
>
>> A trivial one would we to :
>> - do X online snapshots/sec
>
> I currently don't have good numbers that I can share right now.
> Snapshots/sec depends on what kind of workload is running, and if the
> guest was almost idle, there will be no snapshots in 5sec. On the other
> hand, if the guest was running I/O intensive workloads (netperf, iozone
> for example), there will be about 50 snapshots/sec.
>
>> - Stall all IO (disk/block) from the guest to the outside world
>> until the previous snapshot reaches the slave.
>
> Yes, it does.
>
>> - Snapshots are made of
>
> Full device model + diff of dirty pages from the last snapshot.
>
>> - diff of dirty pages from last snapshot
>
> This also depends on the workload.
> In case of I/O intensive workloads, dirty pages are usually less than 100.
The hardest would be memory intensive loads.
So 100 snap/sec means latency of 10msec right?
(not that it's not ok, with faster hw and IB you'll be able to get much 
more)
>
>> - Qemu device model (+kvm's) diff from last.
>
> We're currently sending full copy because we're completely reusing this
> part of existing live migration framework.
>
> Last time we measured, it was about 13KB.
> But it varies by which QEMU version is used.
>
>> You can do 'light' snapshots in between to send dirty pages to reduce
>> snapshot time.
>
> I agree. That's one of the advanced topic we would like to try too.
>
>> I wrote the above to serve a reference for your comments so it will map
>> into my mind. Thanks, dor
>
> Thank your for the guidance.
> I hope this answers to your question.
>
> At the same time, I would also be happy it we could discuss how to
> implement too. In fact, we needed a hack to prevent rip from proceeding
> in KVM, which turned out that it was not the best workaround.
There are brute force solutions like
- stop the guest until you send all of the snapshot to the remote (like
   standard live migration)
- Stop + fork + cont the father
Or mark the recent dirty pages that were not sent to the remote as write 
protected and copy them if touched.
>
> Thanks,
>
> Yoshi
>
>>
>>>
>>> TODO:
>>> - Synchronization of clock sources (need to intercept TSC reads, etc).
>>>
>>> === usability ===
>>>
>>> These are items that defines how users interact with Kemari.
>>>
>>> TODO:
>>> - Kemarid daemon that takes care of the cluster management/monitoring
>>> side of things.
>>> - Some device emulators might need minor modifications to work well
>>> with Kemari. Use white(black)-listing to take the burden of
>>> choosing the right device model off the users.
>>>
>>> === optimizations ===
>>>
>>> Although the big picture can be realized by completing the TODO list
>>> above, we
>>> need some optimizations/enhancements to make Kemari useful in real
>>> world, and
>>> these are items what needs to be done for that.
>>>
>>> TODO:
>>> - SMP (for the sake of performance might need to implement a
>>> synchronization protocol that can maintain two or more
>>> synchronization points active at any given moment)
>>> - VGA (leverage VNC's subtilting mechanism to identify fb pages that
>>> are really dirty).
>>>
>>>
>>> Any comments/suggestions would be greatly appreciated.
>>>
>>> Thanks,
>>>
>>> Yoshi
>>>
>>> --
>>>
>>> Kemari starts synchronizing VMs when QEMU handles I/O requests.
>>> Without this patch VCPU state is already proceeded before
>>> synchronization, and after failover to the VM on the receiver, it
>>> hangs because of this.
>>>
>>> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
>>> ---
>>> arch/x86/include/asm/kvm_host.h | 1 +
>>> arch/x86/kvm/svm.c | 11 ++++++++---
>>> arch/x86/kvm/vmx.c | 11 ++++++++---
>>> arch/x86/kvm/x86.c | 4 ++++
>>> 4 files changed, 21 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/kvm_host.h
>>> b/arch/x86/include/asm/kvm_host.h
>>> index 26c629a..7b8f514 100644
>>> --- a/arch/x86/include/asm/kvm_host.h
>>> +++ b/arch/x86/include/asm/kvm_host.h
>>> @@ -227,6 +227,7 @@ struct kvm_pio_request {
>>> int in;
>>> int port;
>>> int size;
>>> + bool lazy_skip;
>>> };
>>>
>>> /*
>>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>>> index d04c7ad..e373245 100644
>>> --- a/arch/x86/kvm/svm.c
>>> +++ b/arch/x86/kvm/svm.c
>>> @@ -1495,7 +1495,7 @@ static int io_interception(struct vcpu_svm *svm)
>>> {
>>> struct kvm_vcpu *vcpu =&svm->vcpu;
>>> u32 io_info = svm->vmcb->control.exit_info_1; /* address size bug? */
>>> - int size, in, string;
>>> + int size, in, string, ret;
>>> unsigned port;
>>>
>>> ++svm->vcpu.stat.io_exits;
>>> @@ -1507,9 +1507,14 @@ static int io_interception(struct vcpu_svm *svm)
>>> port = io_info>> 16;
>>> size = (io_info& SVM_IOIO_SIZE_MASK)>> SVM_IOIO_SIZE_SHIFT;
>>> svm->next_rip = svm->vmcb->control.exit_info_2;
>>> - skip_emulated_instruction(&svm->vcpu);
>>>
>>> - return kvm_fast_pio_out(vcpu, size, port);
>>> + ret = kvm_fast_pio_out(vcpu, size, port);
>>> + if (ret)
>>> + skip_emulated_instruction(&svm->vcpu);
>>> + else
>>> + vcpu->arch.pio.lazy_skip = true;
>>> +
>>> + return ret;
>>> }
>>>
>>> static int nmi_interception(struct vcpu_svm *svm)
>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>> index 41e63bb..09052d6 100644
>>> --- a/arch/x86/kvm/vmx.c
>>> +++ b/arch/x86/kvm/vmx.c
>>> @@ -2975,7 +2975,7 @@ static int handle_triple_fault(struct kvm_vcpu
>>> *vcpu)
>>> static int handle_io(struct kvm_vcpu *vcpu)
>>> {
>>> unsigned long exit_qualification;
>>> - int size, in, string;
>>> + int size, in, string, ret;
>>> unsigned port;
>>>
>>> exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
>>> @@ -2989,9 +2989,14 @@ static int handle_io(struct kvm_vcpu *vcpu)
>>>
>>> port = exit_qualification>> 16;
>>> size = (exit_qualification& 7) + 1;
>>> - skip_emulated_instruction(vcpu);
>>>
>>> - return kvm_fast_pio_out(vcpu, size, port);
>>> + ret = kvm_fast_pio_out(vcpu, size, port);
>>> + if (ret)
>>> + skip_emulated_instruction(vcpu);
>>> + else
>>> + vcpu->arch.pio.lazy_skip = true;
>>> +
>>> + return ret;
>>> }
>>>
>>> static void
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index fd5c3d3..cc308d2 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -4544,6 +4544,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu
>>> *vcpu, struct kvm_run *kvm_run)
>>> if (!irqchip_in_kernel(vcpu->kvm))
>>> kvm_set_cr8(vcpu, kvm_run->cr8);
>>>
>>> + if (vcpu->arch.pio.lazy_skip)
>>> + kvm_x86_ops->skip_emulated_instruction(vcpu);
>>> + vcpu->arch.pio.lazy_skip = false;
>>> +
>>> if (vcpu->arch.pio.count || vcpu->mmio_needed ||
>>> vcpu->arch.emulate_ctxt.restart) {
>>> if (vcpu->mmio_needed) {
>>
>>
>>
>>
>
>
>
^ permalink raw reply	[flat|nested] 74+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
  2010-04-22 11:36     ` Takuya Yoshikawa
@ 2010-04-22 12:35       ` Yoshiaki Tamura
  0 siblings, 0 replies; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-22 12:35 UTC (permalink / raw)
  To: Takuya Yoshikawa
  Cc: aliguori, dlaor, kvm, ohmura.kei, mtosatti, qemu-devel, avi
2010/4/22 Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>:
> (2010/04/22 19:35), Yoshiaki Tamura wrote:
>
>>
>>> A trivial one would we to :
>>> - do X online snapshots/sec
>>
>> I currently don't have good numbers that I can share right now.
>> Snapshots/sec depends on what kind of workload is running, and if the
>> guest was almost idle, there will be no snapshots in 5sec. On the other
>> hand, if the guest was running I/O intensive workloads (netperf, iozone
>> for example), there will be about 50 snapshots/sec.
>>
>
> 50 is too small: this depends on the synchronization speed and does not
> show how many snapshots we need, right?
No it doesn't.
It's an example data which I measured before.
^ permalink raw reply	[flat|nested] 74+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
  2010-04-22 12:19     ` Dor Laor
@ 2010-04-22 13:16       ` Yoshiaki Tamura
  2010-04-22 20:33         ` Anthony Liguori
  2010-04-22 20:38         ` Dor Laor
  0 siblings, 2 replies; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-22 13:16 UTC (permalink / raw)
  To: dlaor
  Cc: ohmura.kei, kvm, mtosatti, aliguori, qemu-devel, yoshikawa.takuya,
	avi
2010/4/22 Dor Laor <dlaor@redhat.com>:
> On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:
>>
>> Dor Laor wrote:
>>>
>>> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
>>>>
>>>> Hi all,
>>>>
>>>> We have been implementing the prototype of Kemari for KVM, and we're
>>>> sending
>>>> this message to share what we have now and TODO lists. Hopefully, we
>>>> would like
>>>> to get early feedback to keep us in the right direction. Although
>>>> advanced
>>>> approaches in the TODO lists are fascinating, we would like to run
>>>> this project
>>>> step by step while absorbing comments from the community. The current
>>>> code is
>>>> based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.
>>>>
>>>> For those who are new to Kemari for KVM, please take a look at the
>>>> following RFC which we posted last year.
>>>>
>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html
>>>>
>>>> The transmission/transaction protocol, and most of the control logic is
>>>> implemented in QEMU. However, we needed a hack in KVM to prevent rip
>>>> from
>>>> proceeding before synchronizing VMs. It may also need some plumbing in
>>>> the
>>>> kernel side to guarantee replayability of certain events and
>>>> instructions,
>>>> integrate the RAS capabilities of newer x86 hardware with the HA
>>>> stack, as well
>>>> as for optimization purposes, for example.
>>>
>>> [ snap]
>>>
>>>>
>>>> The rest of this message describes TODO lists grouped by each topic.
>>>>
>>>> === event tapping ===
>>>>
>>>> Event tapping is the core component of Kemari, and it decides on which
>>>> event the
>>>> primary should synchronize with the secondary. The basic assumption
>>>> here is
>>>> that outgoing I/O operations are idempotent, which is usually true for
>>>> disk I/O
>>>> and reliable network protocols such as TCP.
>>>
>>> IMO any type of network even should be stalled too. What if the VM runs
>>> non tcp protocol and the packet that the master node sent reached some
>>> remote client and before the sync to the slave the master failed?
>>
>> In current implementation, it is actually stalling any type of network
>> that goes through virtio-net.
>>
>> However, if the application was using unreliable protocols, it should
>> have its own recovering mechanism, or it should be completely stateless.
>
> Why do you treat tcp differently? You can damage the entire VM this way -
> think of dhcp request that was dropped on the moment you switched between
> the master and the slave?
I'm not trying to say that we should treat tcp differently, but just
it's severe.
In case of dhcp request, the client would have a chance to retry after
failover, correct?
BTW, in current implementation, it's synchronizing before dhcp ack is sent.
But in case of tcp, once you send ack to the client before sync, there
is no way to recover.
>>> [snap]
>>>
>>>
>>>> === clock ===
>>>>
>>>> Since synchronizing the virtual machines every time the TSC is
>>>> accessed would be
>>>> prohibitive, the transmission of the TSC will be done lazily, which
>>>> means
>>>> delaying it until there is a non-TSC synchronization point arrives.
>>>
>>> Why do you specifically care about the tsc sync? When you sync all the
>>> IO model on snapshot it also synchronizes the tsc.
>
> So, do you agree that an extra clock synchronization is not needed since it
> is done anyway as part of the live migration state sync?
I agree that its sent as part of the live migration.
What I wanted to say here is that this is not something for real time
applications.
I usually get questions like can this guarantee fault tolerance for
real time applications.
>>> In general, can you please explain the 'algorithm' for continuous
>>> snapshots (is that what you like to do?):
>>
>> Yes, of course.
>> Sorry for being less informative.
>>
>>> A trivial one would we to :
>>> - do X online snapshots/sec
>>
>> I currently don't have good numbers that I can share right now.
>> Snapshots/sec depends on what kind of workload is running, and if the
>> guest was almost idle, there will be no snapshots in 5sec. On the other
>> hand, if the guest was running I/O intensive workloads (netperf, iozone
>> for example), there will be about 50 snapshots/sec.
>>
>>> - Stall all IO (disk/block) from the guest to the outside world
>>> until the previous snapshot reaches the slave.
>>
>> Yes, it does.
>>
>>> - Snapshots are made of
>>
>> Full device model + diff of dirty pages from the last snapshot.
>>
>>> - diff of dirty pages from last snapshot
>>
>> This also depends on the workload.
>> In case of I/O intensive workloads, dirty pages are usually less than 100.
>
> The hardest would be memory intensive loads.
> So 100 snap/sec means latency of 10msec right?
> (not that it's not ok, with faster hw and IB you'll be able to get much
> more)
Doesn't 100 snap/sec mean the interval of snap is 10msec?
IIUC, to get the latency, you need to get, Time to transfer VM + Time
to get response from the receiver.
It's hard to say which load is the hardest.
Memory intensive load, who don't generate I/O often, will suffer from
long sync time for that moment, but would have chances to continue its
process until sync.
I/O intensive load, who don't dirty much pages, will suffer from
getting VPU stopped often, but its sync time is relatively shorter.
>>> - Qemu device model (+kvm's) diff from last.
>>
>> We're currently sending full copy because we're completely reusing this
>> part of existing live migration framework.
>>
>> Last time we measured, it was about 13KB.
>> But it varies by which QEMU version is used.
>>
>>> You can do 'light' snapshots in between to send dirty pages to reduce
>>> snapshot time.
>>
>> I agree. That's one of the advanced topic we would like to try too.
>>
>>> I wrote the above to serve a reference for your comments so it will map
>>> into my mind. Thanks, dor
>>
>> Thank your for the guidance.
>> I hope this answers to your question.
>>
>> At the same time, I would also be happy it we could discuss how to
>> implement too. In fact, we needed a hack to prevent rip from proceeding
>> in KVM, which turned out that it was not the best workaround.
>
> There are brute force solutions like
> - stop the guest until you send all of the snapshot to the remote (like
>  standard live migration)
We've implemented this way so far.
> - Stop + fork + cont the father
>
> Or mark the recent dirty pages that were not sent to the remote as write
> protected and copy them if touched.
I think I had that suggestion from Avi before.
And yes, it's very fascinating.
Meanwhile, if you look at the diffstat, it needed to touch many parts of QEMU.
Before going into further implementation, I wanted to check that I'm
in the right track for doing this project.
>> Thanks,
>>
>> Yoshi
>>
>>>
>>>>
>>>> TODO:
>>>> - Synchronization of clock sources (need to intercept TSC reads, etc).
>>>>
>>>> === usability ===
>>>>
>>>> These are items that defines how users interact with Kemari.
>>>>
>>>> TODO:
>>>> - Kemarid daemon that takes care of the cluster management/monitoring
>>>> side of things.
>>>> - Some device emulators might need minor modifications to work well
>>>> with Kemari. Use white(black)-listing to take the burden of
>>>> choosing the right device model off the users.
>>>>
>>>> === optimizations ===
>>>>
>>>> Although the big picture can be realized by completing the TODO list
>>>> above, we
>>>> need some optimizations/enhancements to make Kemari useful in real
>>>> world, and
>>>> these are items what needs to be done for that.
>>>>
>>>> TODO:
>>>> - SMP (for the sake of performance might need to implement a
>>>> synchronization protocol that can maintain two or more
>>>> synchronization points active at any given moment)
>>>> - VGA (leverage VNC's subtilting mechanism to identify fb pages that
>>>> are really dirty).
>>>>
>>>>
>>>> Any comments/suggestions would be greatly appreciated.
>>>>
>>>> Thanks,
>>>>
>>>> Yoshi
>>>>
>>>> --
>>>>
>>>> Kemari starts synchronizing VMs when QEMU handles I/O requests.
>>>> Without this patch VCPU state is already proceeded before
>>>> synchronization, and after failover to the VM on the receiver, it
>>>> hangs because of this.
>>>>
>>>> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
>>>> ---
>>>> arch/x86/include/asm/kvm_host.h | 1 +
>>>> arch/x86/kvm/svm.c | 11 ++++++++---
>>>> arch/x86/kvm/vmx.c | 11 ++++++++---
>>>> arch/x86/kvm/x86.c | 4 ++++
>>>> 4 files changed, 21 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/arch/x86/include/asm/kvm_host.h
>>>> b/arch/x86/include/asm/kvm_host.h
>>>> index 26c629a..7b8f514 100644
>>>> --- a/arch/x86/include/asm/kvm_host.h
>>>> +++ b/arch/x86/include/asm/kvm_host.h
>>>> @@ -227,6 +227,7 @@ struct kvm_pio_request {
>>>> int in;
>>>> int port;
>>>> int size;
>>>> + bool lazy_skip;
>>>> };
>>>>
>>>> /*
>>>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>>>> index d04c7ad..e373245 100644
>>>> --- a/arch/x86/kvm/svm.c
>>>> +++ b/arch/x86/kvm/svm.c
>>>> @@ -1495,7 +1495,7 @@ static int io_interception(struct vcpu_svm *svm)
>>>> {
>>>> struct kvm_vcpu *vcpu =&svm->vcpu;
>>>> u32 io_info = svm->vmcb->control.exit_info_1; /* address size bug? */
>>>> - int size, in, string;
>>>> + int size, in, string, ret;
>>>> unsigned port;
>>>>
>>>> ++svm->vcpu.stat.io_exits;
>>>> @@ -1507,9 +1507,14 @@ static int io_interception(struct vcpu_svm *svm)
>>>> port = io_info>> 16;
>>>> size = (io_info& SVM_IOIO_SIZE_MASK)>> SVM_IOIO_SIZE_SHIFT;
>>>> svm->next_rip = svm->vmcb->control.exit_info_2;
>>>> - skip_emulated_instruction(&svm->vcpu);
>>>>
>>>> - return kvm_fast_pio_out(vcpu, size, port);
>>>> + ret = kvm_fast_pio_out(vcpu, size, port);
>>>> + if (ret)
>>>> + skip_emulated_instruction(&svm->vcpu);
>>>> + else
>>>> + vcpu->arch.pio.lazy_skip = true;
>>>> +
>>>> + return ret;
>>>> }
>>>>
>>>> static int nmi_interception(struct vcpu_svm *svm)
>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>>> index 41e63bb..09052d6 100644
>>>> --- a/arch/x86/kvm/vmx.c
>>>> +++ b/arch/x86/kvm/vmx.c
>>>> @@ -2975,7 +2975,7 @@ static int handle_triple_fault(struct kvm_vcpu
>>>> *vcpu)
>>>> static int handle_io(struct kvm_vcpu *vcpu)
>>>> {
>>>> unsigned long exit_qualification;
>>>> - int size, in, string;
>>>> + int size, in, string, ret;
>>>> unsigned port;
>>>>
>>>> exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
>>>> @@ -2989,9 +2989,14 @@ static int handle_io(struct kvm_vcpu *vcpu)
>>>>
>>>> port = exit_qualification>> 16;
>>>> size = (exit_qualification& 7) + 1;
>>>> - skip_emulated_instruction(vcpu);
>>>>
>>>> - return kvm_fast_pio_out(vcpu, size, port);
>>>> + ret = kvm_fast_pio_out(vcpu, size, port);
>>>> + if (ret)
>>>> + skip_emulated_instruction(vcpu);
>>>> + else
>>>> + vcpu->arch.pio.lazy_skip = true;
>>>> +
>>>> + return ret;
>>>> }
>>>>
>>>> static void
>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>> index fd5c3d3..cc308d2 100644
>>>> --- a/arch/x86/kvm/x86.c
>>>> +++ b/arch/x86/kvm/x86.c
>>>> @@ -4544,6 +4544,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu
>>>> *vcpu, struct kvm_run *kvm_run)
>>>> if (!irqchip_in_kernel(vcpu->kvm))
>>>> kvm_set_cr8(vcpu, kvm_run->cr8);
>>>>
>>>> + if (vcpu->arch.pio.lazy_skip)
>>>> + kvm_x86_ops->skip_emulated_instruction(vcpu);
>>>> + vcpu->arch.pio.lazy_skip = false;
>>>> +
>>>> if (vcpu->arch.pio.count || vcpu->mmio_needed ||
>>>> vcpu->arch.emulate_ctxt.restart) {
>>>> if (vcpu->mmio_needed) {
>>>
>>>
>>>
>>>
>>
>>
>>
>
>
>
>
^ permalink raw reply	[flat|nested] 74+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
  2010-04-22 10:35   ` Yoshiaki Tamura
  2010-04-22 11:36     ` Takuya Yoshikawa
  2010-04-22 12:19     ` Dor Laor
@ 2010-04-22 16:15     ` Jamie Lokier
  2010-04-23  0:20       ` Yoshiaki Tamura
  2 siblings, 1 reply; 74+ messages in thread
From: Jamie Lokier @ 2010-04-22 16:15 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, mtosatti, kvm, dlaor, qemu-devel, yoshikawa.takuya,
	aliguori, avi
Yoshiaki Tamura wrote:
> Dor Laor wrote:
> >On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
> >>Event tapping is the core component of Kemari, and it decides on which
> >>event the
> >>primary should synchronize with the secondary. The basic assumption
> >>here is
> >>that outgoing I/O operations are idempotent, which is usually true for
> >>disk I/O
> >>and reliable network protocols such as TCP.
> >
> >IMO any type of network even should be stalled too. What if the VM runs
> >non tcp protocol and the packet that the master node sent reached some
> >remote client and before the sync to the slave the master failed?
> 
> In current implementation, it is actually stalling any type of network 
> that goes through virtio-net.
> 
> However, if the application was using unreliable protocols, it should have 
> its own recovering mechanism, or it should be completely stateless.
Even with unreliable protocols, if slave takeover causes the receiver
to have received a packet that the sender _does not think it has ever
sent_, expect some protocols to break.
If the slave replaying master's behaviour since the last sync means it
will definitely get into the same state of having sent the packet,
that works out.
But you still have to be careful that the other end's responses to
that packet are not seen by the slave too early during that replay.
Otherwise, for example, the slave may observe a TCP ACK to a packet
that it hasn't yet sent, which is an error.
About IP idempotency:
In general, IP packets are allowed to be lost or duplicated in the
network.  All IP protocols should be prepared for that; it is a basic
property.
However there is one respect in which they're not idempotent:
The TTL field should be decreased if packets are delayed.  Packets
should not appear to live in the network for longer than TTL seconds.
If they do, some protocols (like TCP) can react to the delayed ones
differently, such as sending a RST packet and breaking a connection.
It is acceptable to reduce TTL faster than the minimum.  After all, it
is reduced by 1 on every forwarding hop, in addition to time delays.
> I currently don't have good numbers that I can share right now.
> Snapshots/sec depends on what kind of workload is running, and if the 
> guest was almost idle, there will be no snapshots in 5sec.  On the other 
> hand, if the guest was running I/O intensive workloads (netperf, iozone 
> for example), there will be about 50 snapshots/sec.
That is a really satisfying number, thank you :-)
Without this work I wouldn't have imagined that synchronised machines
could work with such a low transaction rate.
-- Jamie
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 01/20] Modify DIRTY_FLAG value and introduce DIRTY_IDX to use as indexes of bit-based phys_ram_dirty.
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 01/20] Modify DIRTY_FLAG value and introduce DIRTY_IDX to use as indexes of bit-based phys_ram_dirty Yoshiaki Tamura
@ 2010-04-22 19:26   ` Anthony Liguori
  2010-04-23  2:09     ` Yoshiaki Tamura
  0 siblings, 1 reply; 74+ messages in thread
From: Anthony Liguori @ 2010-04-22 19:26 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, kvm, mtosatti, Anthony Liguori, qemu-devel,
	yoshikawa.takuya, avi
Hi,
On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:
> Replaces byte-based phys_ram_dirty bitmap with four (MASTER, VGA,
> CODE, MIGRATION) bit-based phys_ram_dirty bitmap.  On allocation, it
> sets all bits in the bitmap.  It uses ffs() to convert DIRTY_FLAG to
> DIRTY_IDX.
>
> Modifies wrapper functions for byte-based phys_ram_dirty bitmap to
> bit-based phys_ram_dirty bitmap.  MASTER works as a buffer, and upon
> get_diry() or get_dirty_flags(), it calls
> cpu_physical_memory_sync_master() to update VGA and MIGRATION.
>    
Why use an additional bitmap for MASTER instead of just updating the 
VGA, CODE, and MIGRATION bitmaps together?
Regards,
Anthony Liguori
> Replaces direct phys_ram_dirty access with wrapper functions to
> prevent direct access to the phys_ram_dirty bitmap.
>
> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
> Signed-off-by: OHMURA Kei<ohmura.kei@lab.ntt.co.jp>
> ---
>   cpu-all.h |  130 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
>   exec.c    |   60 ++++++++++++++--------------
>   2 files changed, 152 insertions(+), 38 deletions(-)
>
> diff --git a/cpu-all.h b/cpu-all.h
> index 51effc0..3f8762d 100644
> --- a/cpu-all.h
> +++ b/cpu-all.h
> @@ -37,6 +37,9 @@
>
>   #include "softfloat.h"
>
> +/* to use ffs in flag_to_idx() */
> +#include<strings.h>
> +
>   #if defined(HOST_WORDS_BIGENDIAN) != defined(TARGET_WORDS_BIGENDIAN)
>   #define BSWAP_NEEDED
>   #endif
> @@ -846,7 +849,6 @@ int cpu_str_to_log_mask(const char *str);
>   /* memory API */
>
>   extern int phys_ram_fd;
> -extern uint8_t *phys_ram_dirty;
>   extern ram_addr_t ram_size;
>   extern ram_addr_t last_ram_offset;
>   extern uint8_t *bios_mem;
> @@ -869,28 +871,140 @@ extern uint8_t *bios_mem;
>   /* Set if TLB entry is an IO callback.  */
>   #define TLB_MMIO        (1<<  5)
>
> +/* Use DIRTY_IDX as indexes of bit-based phys_ram_dirty. */
> +#define MASTER_DIRTY_IDX    0
> +#define VGA_DIRTY_IDX       1
> +#define CODE_DIRTY_IDX      2
> +#define MIGRATION_DIRTY_IDX 3
> +#define NUM_DIRTY_IDX       4
> +
> +#define MASTER_DIRTY_FLAG    (1<<  MASTER_DIRTY_IDX)
> +#define VGA_DIRTY_FLAG       (1<<  VGA_DIRTY_IDX)
> +#define CODE_DIRTY_FLAG      (1<<  CODE_DIRTY_IDX)
> +#define MIGRATION_DIRTY_FLAG (1<<  MIGRATION_DIRTY_IDX)
> +
> +extern unsigned long *phys_ram_dirty[NUM_DIRTY_IDX];
> +
> +static inline int dirty_flag_to_idx(int flag)
> +{
> +    return ffs(flag) - 1;
> +}
> +
> +static inline int dirty_idx_to_flag(int idx)
> +{
> +    return 1<<  idx;
> +}
> +
>   int cpu_memory_rw_debug(CPUState *env, target_ulong addr,
>                           uint8_t *buf, int len, int is_write);
>
> -#define VGA_DIRTY_FLAG       0x01
> -#define CODE_DIRTY_FLAG      0x02
> -#define MIGRATION_DIRTY_FLAG 0x08
> -
>   /* read dirty bit (return 0 or 1) */
>   static inline int cpu_physical_memory_is_dirty(ram_addr_t addr)
>   {
> -    return phys_ram_dirty[addr>>  TARGET_PAGE_BITS] == 0xff;
> +    unsigned long mask;
> +    ram_addr_t index = (addr>>  TARGET_PAGE_BITS) / HOST_LONG_BITS;
> +    int offset = (addr>>  TARGET_PAGE_BITS)&  (HOST_LONG_BITS - 1);
> +
> +    mask = 1UL<<  offset;
> +    return (phys_ram_dirty[MASTER_DIRTY_IDX][index]&  mask) == mask;
> +}
> +
> +static inline void cpu_physical_memory_sync_master(ram_addr_t index)
> +{
> +    if (phys_ram_dirty[MASTER_DIRTY_IDX][index]) {
> +        phys_ram_dirty[VGA_DIRTY_IDX][index]
> +            |=  phys_ram_dirty[MASTER_DIRTY_IDX][index];
> +        phys_ram_dirty[MIGRATION_DIRTY_IDX][index]
> +            |=  phys_ram_dirty[MASTER_DIRTY_IDX][index];
> +        phys_ram_dirty[MASTER_DIRTY_IDX][index] = 0UL;
> +    }
> +}
> +
> +static inline int cpu_physical_memory_get_dirty_flags(ram_addr_t addr)
> +{
> +    unsigned long mask;
> +    ram_addr_t index = (addr>>  TARGET_PAGE_BITS) / HOST_LONG_BITS;
> +    int offset = (addr>>  TARGET_PAGE_BITS)&  (HOST_LONG_BITS - 1);
> +    int ret = 0, i;
> +
> +    mask = 1UL<<  offset;
> +    cpu_physical_memory_sync_master(index);
> +
> +    for (i = VGA_DIRTY_IDX; i<= MIGRATION_DIRTY_IDX; i++) {
> +        if (phys_ram_dirty[i][index]&  mask) {
> +            ret |= dirty_idx_to_flag(i);
> +        }
> +    }
> +
> +    return ret;
> +}
> +
> +static inline int cpu_physical_memory_get_dirty_idx(ram_addr_t addr,
> +                                                    int dirty_idx)
> +{
> +    unsigned long mask;
> +    ram_addr_t index = (addr>>  TARGET_PAGE_BITS) / HOST_LONG_BITS;
> +    int offset = (addr>>  TARGET_PAGE_BITS)&  (HOST_LONG_BITS - 1);
> +
> +    mask = 1UL<<  offset;
> +    cpu_physical_memory_sync_master(index);
> +    return (phys_ram_dirty[dirty_idx][index]&  mask) == mask;
>   }
>
>   static inline int cpu_physical_memory_get_dirty(ram_addr_t addr,
>                                                   int dirty_flags)
>   {
> -    return phys_ram_dirty[addr>>  TARGET_PAGE_BITS]&  dirty_flags;
> +    return cpu_physical_memory_get_dirty_idx(addr,
> +                                             dirty_flag_to_idx(dirty_flags));
>   }
>
>   static inline void cpu_physical_memory_set_dirty(ram_addr_t addr)
>   {
> -    phys_ram_dirty[addr>>  TARGET_PAGE_BITS] = 0xff;
> +    unsigned long mask;
> +    ram_addr_t index = (addr>>  TARGET_PAGE_BITS) / HOST_LONG_BITS;
> +    int offset = (addr>>  TARGET_PAGE_BITS)&  (HOST_LONG_BITS - 1);
> +
> +    mask = 1UL<<  offset;
> +    phys_ram_dirty[MASTER_DIRTY_IDX][index] |= mask;
> +}
> +
> +static inline void cpu_physical_memory_set_dirty_range(ram_addr_t addr,
> +                                                       unsigned long mask)
> +{
> +    ram_addr_t index = (addr>>  TARGET_PAGE_BITS) / HOST_LONG_BITS;
> +
> +    phys_ram_dirty[MASTER_DIRTY_IDX][index] |= mask;
> +}
> +
> +static inline void cpu_physical_memory_set_dirty_flags(ram_addr_t addr,
> +                                                       int dirty_flags)
> +{
> +    unsigned long mask;
> +    ram_addr_t index = (addr>>  TARGET_PAGE_BITS) / HOST_LONG_BITS;
> +    int offset = (addr>>  TARGET_PAGE_BITS)&  (HOST_LONG_BITS - 1);
> +
> +    mask = 1UL<<  offset;
> +    phys_ram_dirty[MASTER_DIRTY_IDX][index] |= mask;
> +
> +    if (dirty_flags&  CODE_DIRTY_FLAG) {
> +        phys_ram_dirty[CODE_DIRTY_IDX][index] |= mask;
> +    }
> +}
> +
> +static inline void cpu_physical_memory_mask_dirty_range(ram_addr_t start,
> +                                                        unsigned long length,
> +                                                        int dirty_flags)
> +{
> +    ram_addr_t addr = start, index;
> +    unsigned long mask;
> +    int offset, i;
> +
> +    for (i = 0;  i<  length; i += TARGET_PAGE_SIZE) {
> +        index = ((addr + i)>>  TARGET_PAGE_BITS) / HOST_LONG_BITS;
> +        offset = ((addr + i)>>  TARGET_PAGE_BITS)&  (HOST_LONG_BITS - 1);
> +        mask = ~(1UL<<  offset);
> +        phys_ram_dirty[dirty_flag_to_idx(dirty_flags)][index]&= mask;
> +    }
>   }
>
>   void cpu_physical_memory_reset_dirty(ram_addr_t start, ram_addr_t end,
> diff --git a/exec.c b/exec.c
> index b647512..bf8d703 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -119,7 +119,7 @@ uint8_t *code_gen_ptr;
>
>   #if !defined(CONFIG_USER_ONLY)
>   int phys_ram_fd;
> -uint8_t *phys_ram_dirty;
> +unsigned long *phys_ram_dirty[NUM_DIRTY_IDX];
>   uint8_t *bios_mem;
>   static int in_migration;
>
> @@ -1947,7 +1947,7 @@ static void tlb_protect_code(ram_addr_t ram_addr)
>   static void tlb_unprotect_code_phys(CPUState *env, ram_addr_t ram_addr,
>                                       target_ulong vaddr)
>   {
> -    phys_ram_dirty[ram_addr>>  TARGET_PAGE_BITS] |= CODE_DIRTY_FLAG;
> +    cpu_physical_memory_set_dirty_flags(ram_addr, CODE_DIRTY_FLAG);
>   }
>
>   static inline void tlb_reset_dirty_range(CPUTLBEntry *tlb_entry,
> @@ -1968,8 +1968,7 @@ void cpu_physical_memory_reset_dirty(ram_addr_t start, ram_addr_t end,
>   {
>       CPUState *env;
>       unsigned long length, start1;
> -    int i, mask, len;
> -    uint8_t *p;
> +    int i;
>
>       start&= TARGET_PAGE_MASK;
>       end = TARGET_PAGE_ALIGN(end);
> @@ -1977,11 +1976,7 @@ void cpu_physical_memory_reset_dirty(ram_addr_t start, ram_addr_t end,
>       length = end - start;
>       if (length == 0)
>           return;
> -    len = length>>  TARGET_PAGE_BITS;
> -    mask = ~dirty_flags;
> -    p = phys_ram_dirty + (start>>  TARGET_PAGE_BITS);
> -    for(i = 0; i<  len; i++)
> -        p[i]&= mask;
> +    cpu_physical_memory_mask_dirty_range(start, length, dirty_flags);
>
>       /* we modify the TLB cache so that the dirty bit will be set again
>          when accessing the range */
> @@ -2643,6 +2638,7 @@ extern const char *mem_path;
>   ram_addr_t qemu_ram_alloc(ram_addr_t size)
>   {
>       RAMBlock *new_block;
> +    int i;
>
>       size = TARGET_PAGE_ALIGN(size);
>       new_block = qemu_malloc(sizeof(*new_block));
> @@ -2667,10 +2663,14 @@ ram_addr_t qemu_ram_alloc(ram_addr_t size)
>       new_block->next = ram_blocks;
>       ram_blocks = new_block;
>
> -    phys_ram_dirty = qemu_realloc(phys_ram_dirty,
> -        (last_ram_offset + size)>>  TARGET_PAGE_BITS);
> -    memset(phys_ram_dirty + (last_ram_offset>>  TARGET_PAGE_BITS),
> -           0xff, size>>  TARGET_PAGE_BITS);
> +    for (i = MASTER_DIRTY_IDX; i<  NUM_DIRTY_IDX; i++) {
> +        phys_ram_dirty[i]
> +            = qemu_realloc(phys_ram_dirty[i],
> +                           BITMAP_SIZE(last_ram_offset + size));
> +        memset((uint8_t *)phys_ram_dirty[i] + BITMAP_SIZE(last_ram_offset),
> +               0xff, BITMAP_SIZE(last_ram_offset + size)
> +               - BITMAP_SIZE(last_ram_offset));
> +    }
>
>       last_ram_offset += size;
>
> @@ -2833,16 +2833,16 @@ static void notdirty_mem_writeb(void *opaque, target_phys_addr_t ram_addr,
>                                   uint32_t val)
>   {
>       int dirty_flags;
> -    dirty_flags = phys_ram_dirty[ram_addr>>  TARGET_PAGE_BITS];
> +    dirty_flags = cpu_physical_memory_get_dirty_flags(ram_addr);
>       if (!(dirty_flags&  CODE_DIRTY_FLAG)) {
>   #if !defined(CONFIG_USER_ONLY)
>           tb_invalidate_phys_page_fast(ram_addr, 1);
> -        dirty_flags = phys_ram_dirty[ram_addr>>  TARGET_PAGE_BITS];
> +        dirty_flags = cpu_physical_memory_get_dirty_flags(ram_addr);
>   #endif
>       }
>       stb_p(qemu_get_ram_ptr(ram_addr), val);
>       dirty_flags |= (0xff&  ~CODE_DIRTY_FLAG);
> -    phys_ram_dirty[ram_addr>>  TARGET_PAGE_BITS] = dirty_flags;
> +    cpu_physical_memory_set_dirty_flags(ram_addr, dirty_flags);
>       /* we remove the notdirty callback only if the code has been
>          flushed */
>       if (dirty_flags == 0xff)
> @@ -2853,16 +2853,16 @@ static void notdirty_mem_writew(void *opaque, target_phys_addr_t ram_addr,
>                                   uint32_t val)
>   {
>       int dirty_flags;
> -    dirty_flags = phys_ram_dirty[ram_addr>>  TARGET_PAGE_BITS];
> +    dirty_flags = cpu_physical_memory_get_dirty_flags(ram_addr);
>       if (!(dirty_flags&  CODE_DIRTY_FLAG)) {
>   #if !defined(CONFIG_USER_ONLY)
>           tb_invalidate_phys_page_fast(ram_addr, 2);
> -        dirty_flags = phys_ram_dirty[ram_addr>>  TARGET_PAGE_BITS];
> +        dirty_flags = cpu_physical_memory_get_dirty_flags(ram_addr);
>   #endif
>       }
>       stw_p(qemu_get_ram_ptr(ram_addr), val);
>       dirty_flags |= (0xff&  ~CODE_DIRTY_FLAG);
> -    phys_ram_dirty[ram_addr>>  TARGET_PAGE_BITS] = dirty_flags;
> +    cpu_physical_memory_set_dirty_flags(ram_addr, dirty_flags);
>       /* we remove the notdirty callback only if the code has been
>          flushed */
>       if (dirty_flags == 0xff)
> @@ -2873,16 +2873,16 @@ static void notdirty_mem_writel(void *opaque, target_phys_addr_t ram_addr,
>                                   uint32_t val)
>   {
>       int dirty_flags;
> -    dirty_flags = phys_ram_dirty[ram_addr>>  TARGET_PAGE_BITS];
> +    dirty_flags = cpu_physical_memory_get_dirty_flags(ram_addr);
>       if (!(dirty_flags&  CODE_DIRTY_FLAG)) {
>   #if !defined(CONFIG_USER_ONLY)
>           tb_invalidate_phys_page_fast(ram_addr, 4);
> -        dirty_flags = phys_ram_dirty[ram_addr>>  TARGET_PAGE_BITS];
> +        dirty_flags = cpu_physical_memory_get_dirty_flags(ram_addr);
>   #endif
>       }
>       stl_p(qemu_get_ram_ptr(ram_addr), val);
>       dirty_flags |= (0xff&  ~CODE_DIRTY_FLAG);
> -    phys_ram_dirty[ram_addr>>  TARGET_PAGE_BITS] = dirty_flags;
> +    cpu_physical_memory_set_dirty_flags(ram_addr, dirty_flags);
>       /* we remove the notdirty callback only if the code has been
>          flushed */
>       if (dirty_flags == 0xff)
> @@ -3334,8 +3334,8 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf,
>                       /* invalidate code */
>                       tb_invalidate_phys_page_range(addr1, addr1 + l, 0);
>                       /* set dirty bit */
> -                    phys_ram_dirty[addr1>>  TARGET_PAGE_BITS] |=
> -                        (0xff&  ~CODE_DIRTY_FLAG);
> +                    cpu_physical_memory_set_dirty_flags(
> +                        addr1, (0xff&  ~CODE_DIRTY_FLAG));
>                   }
>   		/* qemu doesn't execute guest code directly, but kvm does
>   		   therefore flush instruction caches */
> @@ -3548,8 +3548,8 @@ void cpu_physical_memory_unmap(void *buffer, target_phys_addr_t len,
>                       /* invalidate code */
>                       tb_invalidate_phys_page_range(addr1, addr1 + l, 0);
>                       /* set dirty bit */
> -                    phys_ram_dirty[addr1>>  TARGET_PAGE_BITS] |=
> -                        (0xff&  ~CODE_DIRTY_FLAG);
> +                    cpu_physical_memory_set_dirty_flags(
> +                        addr1, (0xff&  ~CODE_DIRTY_FLAG));
>                   }
>                   addr1 += l;
>                   access_len -= l;
> @@ -3685,8 +3685,8 @@ void stl_phys_notdirty(target_phys_addr_t addr, uint32_t val)
>                   /* invalidate code */
>                   tb_invalidate_phys_page_range(addr1, addr1 + 4, 0);
>                   /* set dirty bit */
> -                phys_ram_dirty[addr1>>  TARGET_PAGE_BITS] |=
> -                    (0xff&  ~CODE_DIRTY_FLAG);
> +                cpu_physical_memory_set_dirty_flags(
> +                    addr1, (0xff&  ~CODE_DIRTY_FLAG));
>               }
>           }
>       }
> @@ -3754,8 +3754,8 @@ void stl_phys(target_phys_addr_t addr, uint32_t val)
>               /* invalidate code */
>               tb_invalidate_phys_page_range(addr1, addr1 + 4, 0);
>               /* set dirty bit */
> -            phys_ram_dirty[addr1>>  TARGET_PAGE_BITS] |=
> -                (0xff&  ~CODE_DIRTY_FLAG);
> +            cpu_physical_memory_set_dirty_flags(addr1,
> +                (0xff&  ~CODE_DIRTY_FLAG));
>           }
>       }
>   }
>    
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 05/20] Introduce put_vector() and get_vector to QEMUFile and qemu_fopen_ops().
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 05/20] Introduce put_vector() and get_vector to QEMUFile and qemu_fopen_ops() Yoshiaki Tamura
@ 2010-04-22 19:28   ` Anthony Liguori
  2010-04-23  3:37     ` Yoshiaki Tamura
  0 siblings, 1 reply; 74+ messages in thread
From: Anthony Liguori @ 2010-04-22 19:28 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, kvm, mtosatti, Anthony Liguori, qemu-devel,
	yoshikawa.takuya, avi
On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:
> QEMUFile currently doesn't support writev().  For sending multiple
> data, such as pages, using writev() should be more efficient.
>
> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
>    
Is there performance data that backs this up?  Since QEMUFile uses a 
linear buffer for most operations that's limited to 16k, I suspect you 
wouldn't be able to observe a difference in practice.
Regards,
Anthony Liguori
> ---
>   buffered_file.c |    2 +-
>   hw/hw.h         |   16 ++++++++++++++++
>   savevm.c        |   43 +++++++++++++++++++++++++------------------
>   3 files changed, 42 insertions(+), 19 deletions(-)
>
> diff --git a/buffered_file.c b/buffered_file.c
> index 54dc6c2..187d1d4 100644
> --- a/buffered_file.c
> +++ b/buffered_file.c
> @@ -256,7 +256,7 @@ QEMUFile *qemu_fopen_ops_buffered(void *opaque,
>       s->wait_for_unfreeze = wait_for_unfreeze;
>       s->close = close;
>
> -    s->file = qemu_fopen_ops(s, buffered_put_buffer, NULL,
> +    s->file = qemu_fopen_ops(s, buffered_put_buffer, NULL, NULL, NULL,
>                                buffered_close, buffered_rate_limit,
>                                buffered_set_rate_limit,
>   			     buffered_get_rate_limit);
> diff --git a/hw/hw.h b/hw/hw.h
> index fc9ed29..921cf90 100644
> --- a/hw/hw.h
> +++ b/hw/hw.h
> @@ -23,6 +23,13 @@
>   typedef int (QEMUFilePutBufferFunc)(void *opaque, const uint8_t *buf,
>                                       int64_t pos, int size);
>
> +/* This function writes a chunk of vector to a file at the given position.
> + * The pos argument can be ignored if the file is only being used for
> + * streaming.
> + */
> +typedef int (QEMUFilePutVectorFunc)(void *opaque, struct iovec *iov,
> +                                    int64_t pos, int iovcnt);
> +
>   /* Read a chunk of data from a file at the given position.  The pos argument
>    * can be ignored if the file is only be used for streaming.  The number of
>    * bytes actually read should be returned.
> @@ -30,6 +37,13 @@ typedef int (QEMUFilePutBufferFunc)(void *opaque, const uint8_t *buf,
>   typedef int (QEMUFileGetBufferFunc)(void *opaque, uint8_t *buf,
>                                       int64_t pos, int size);
>
> +/* Read a chunk of vector from a file at the given position.  The pos argument
> + * can be ignored if the file is only be used for streaming.  The number of
> + * bytes actually read should be returned.
> + */
> +typedef int (QEMUFileGetVectorFunc)(void *opaque, struct iovec *iov,
> +                                    int64_t pos, int iovcnt);
> +
>   /* Close a file and return an error code */
>   typedef int (QEMUFileCloseFunc)(void *opaque);
>
> @@ -46,7 +60,9 @@ typedef size_t (QEMUFileSetRateLimit)(void *opaque, size_t new_rate);
>   typedef size_t (QEMUFileGetRateLimit)(void *opaque);
>
>   QEMUFile *qemu_fopen_ops(void *opaque, QEMUFilePutBufferFunc *put_buffer,
> +                         QEMUFilePutVectorFunc *put_vector,
>                            QEMUFileGetBufferFunc *get_buffer,
> +                         QEMUFileGetVectorFunc *get_vector,
>                            QEMUFileCloseFunc *close,
>                            QEMUFileRateLimit *rate_limit,
>                            QEMUFileSetRateLimit *set_rate_limit,
> diff --git a/savevm.c b/savevm.c
> index 490ab70..944e788 100644
> --- a/savevm.c
> +++ b/savevm.c
> @@ -162,7 +162,9 @@ void qemu_announce_self(void)
>
>   struct QEMUFile {
>       QEMUFilePutBufferFunc *put_buffer;
> +    QEMUFilePutVectorFunc *put_vector;
>       QEMUFileGetBufferFunc *get_buffer;
> +    QEMUFileGetVectorFunc *get_vector;
>       QEMUFileCloseFunc *close;
>       QEMUFileRateLimit *rate_limit;
>       QEMUFileSetRateLimit *set_rate_limit;
> @@ -263,11 +265,11 @@ QEMUFile *qemu_popen(FILE *stdio_file, const char *mode)
>       s->stdio_file = stdio_file;
>
>       if(mode[0] == 'r') {
> -        s->file = qemu_fopen_ops(s, NULL, stdio_get_buffer, stdio_pclose,
> -				 NULL, NULL, NULL);
> +        s->file = qemu_fopen_ops(s, NULL, NULL, stdio_get_buffer,
> +                 NULL, stdio_pclose, NULL, NULL, NULL);
>       } else {
> -        s->file = qemu_fopen_ops(s, stdio_put_buffer, NULL, stdio_pclose,
> -				 NULL, NULL, NULL);
> +        s->file = qemu_fopen_ops(s, stdio_put_buffer, NULL, NULL, NULL,
> +                 stdio_pclose, NULL, NULL, NULL);
>       }
>       return s->file;
>   }
> @@ -312,11 +314,11 @@ QEMUFile *qemu_fdopen(int fd, const char *mode)
>           goto fail;
>
>       if(mode[0] == 'r') {
> -        s->file = qemu_fopen_ops(s, NULL, stdio_get_buffer, stdio_fclose,
> -				 NULL, NULL, NULL);
> +        s->file = qemu_fopen_ops(s, NULL, NULL, stdio_get_buffer, NULL,
> +                 stdio_fclose, NULL, NULL, NULL);
>       } else {
> -        s->file = qemu_fopen_ops(s, stdio_put_buffer, NULL, stdio_fclose,
> -				 NULL, NULL, NULL);
> +        s->file = qemu_fopen_ops(s, stdio_put_buffer, NULL, NULL, NULL,
> +                 stdio_fclose, NULL, NULL, NULL);
>       }
>       return s->file;
>
> @@ -330,8 +332,8 @@ QEMUFile *qemu_fopen_socket(int fd)
>       QEMUFileSocket *s = qemu_mallocz(sizeof(QEMUFileSocket));
>
>       s->fd = fd;
> -    s->file = qemu_fopen_ops(s, NULL, socket_get_buffer, socket_close,
> -			     NULL, NULL, NULL);
> +    s->file = qemu_fopen_ops(s, NULL, NULL, socket_get_buffer, NULL,
> +                             socket_close, NULL, NULL, NULL);
>       return s->file;
>   }
>
> @@ -368,11 +370,11 @@ QEMUFile *qemu_fopen(const char *filename, const char *mode)
>           goto fail;
>
>       if(mode[0] == 'w') {
> -        s->file = qemu_fopen_ops(s, file_put_buffer, NULL, stdio_fclose,
> -				 NULL, NULL, NULL);
> +        s->file = qemu_fopen_ops(s, file_put_buffer, NULL, NULL, NULL,
> +                  stdio_fclose, NULL, NULL, NULL);
>       } else {
> -        s->file = qemu_fopen_ops(s, NULL, file_get_buffer, stdio_fclose,
> -			       NULL, NULL, NULL);
> +        s->file = qemu_fopen_ops(s, NULL, NULL, file_get_buffer, NULL,
> +                  stdio_fclose, NULL, NULL, NULL);
>       }
>       return s->file;
>   fail:
> @@ -400,13 +402,16 @@ static int bdrv_fclose(void *opaque)
>   static QEMUFile *qemu_fopen_bdrv(BlockDriverState *bs, int is_writable)
>   {
>       if (is_writable)
> -        return qemu_fopen_ops(bs, block_put_buffer, NULL, bdrv_fclose,
> -			      NULL, NULL, NULL);
> -    return qemu_fopen_ops(bs, NULL, block_get_buffer, bdrv_fclose, NULL, NULL, NULL);
> +        return qemu_fopen_ops(bs, block_put_buffer, NULL, NULL, NULL,
> +                  bdrv_fclose, NULL, NULL, NULL);
> +    return qemu_fopen_ops(bs, NULL, NULL, block_get_buffer, NULL, bdrv_fclose, NULL, NULL, NULL);
>   }
>
> -QEMUFile *qemu_fopen_ops(void *opaque, QEMUFilePutBufferFunc *put_buffer,
> +QEMUFile *qemu_fopen_ops(void *opaque,
> +                         QEMUFilePutBufferFunc *put_buffer,
> +                         QEMUFilePutVectorFunc *put_vector,
>                            QEMUFileGetBufferFunc *get_buffer,
> +                         QEMUFileGetVectorFunc *get_vector,
>                            QEMUFileCloseFunc *close,
>                            QEMUFileRateLimit *rate_limit,
>                            QEMUFileSetRateLimit *set_rate_limit,
> @@ -418,7 +423,9 @@ QEMUFile *qemu_fopen_ops(void *opaque, QEMUFilePutBufferFunc *put_buffer,
>
>       f->opaque = opaque;
>       f->put_buffer = put_buffer;
> +    f->put_vector = put_vector;
>       f->get_buffer = get_buffer;
> +    f->get_vector = get_vector;
>       f->close = close;
>       f->rate_limit = rate_limit;
>       f->set_rate_limit = set_rate_limit;
>    
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 07/20] Introduce qemu_put_vector() and qemu_put_vector_prepare() to use put_vector() in QEMUFile.
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 07/20] Introduce qemu_put_vector() and qemu_put_vector_prepare() to use put_vector() in QEMUFile Yoshiaki Tamura
@ 2010-04-22 19:29   ` Anthony Liguori
  2010-04-23  4:02     ` Yoshiaki Tamura
  0 siblings, 1 reply; 74+ messages in thread
From: Anthony Liguori @ 2010-04-22 19:29 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, kvm, mtosatti, Anthony Liguori, qemu-devel,
	yoshikawa.takuya, avi
On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:
> For fool proof purpose, qemu_put_vector_parepare should be called
> before qemu_put_vector.  Then, if qemu_put_* functions except this is
> called after qemu_put_vector_prepare, program will abort().
>
> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
>    
I don't get it.  What's this protecting against?
Regards,
Anthony Liguori
> ---
>   hw/hw.h  |    2 ++
>   savevm.c |   53 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 55 insertions(+), 0 deletions(-)
>
> diff --git a/hw/hw.h b/hw/hw.h
> index 921cf90..10e6dda 100644
> --- a/hw/hw.h
> +++ b/hw/hw.h
> @@ -77,6 +77,8 @@ void qemu_fflush(QEMUFile *f);
>   int qemu_fclose(QEMUFile *f);
>   void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int size);
>   void qemu_put_byte(QEMUFile *f, int v);
> +void qemu_put_vector(QEMUFile *f, QEMUIOVector *qiov);
> +void qemu_put_vector_prepare(QEMUFile *f);
>   void *qemu_realloc_buffer(QEMUFile *f, int size);
>   void qemu_clear_buffer(QEMUFile *f);
>
> diff --git a/savevm.c b/savevm.c
> index 944e788..22d928c 100644
> --- a/savevm.c
> +++ b/savevm.c
> @@ -180,6 +180,7 @@ struct QEMUFile {
>       uint8_t *buf;
>
>       int has_error;
> +    int prepares_vector;
>   };
>
>   typedef struct QEMUFileStdio
> @@ -557,6 +558,58 @@ void qemu_put_byte(QEMUFile *f, int v)
>           qemu_fflush(f);
>   }
>
> +void qemu_put_vector(QEMUFile *f, QEMUIOVector *v)
> +{
> +    struct iovec *iov;
> +    int cnt;
> +    size_t bufsize;
> +    uint8_t *buf;
> +
> +    if (qemu_file_get_rate_limit(f) != 0) {
> +        fprintf(stderr,
> +                "Attempted to write vector while bandwidth limit is not zero.\n");
> +        abort();
> +    }
> +
> +    /* checks prepares vector.
> +     * For fool proof purpose, qemu_put_vector_parepare should be called
> +     * before qemu_put_vector.  Then, if qemu_put_* functions except this
> +     * is called after qemu_put_vector_prepare, program will abort().
> +     */
> +    if (!f->prepares_vector) {
> +        fprintf(stderr,
> +            "You should prepare with qemu_put_vector_prepare.\n");
> +        abort();
> +    } else if (f->prepares_vector&&  f->buf_index != 0) {
> +        fprintf(stderr, "Wrote data after qemu_put_vector_prepare.\n");
> +        abort();
> +    }
> +    f->prepares_vector = 0;
> +
> +    if (f->put_vector) {
> +        qemu_iovec_to_vector(v,&iov,&cnt);
> +        f->put_vector(f->opaque, iov, 0, cnt);
> +    } else {
> +        qemu_iovec_to_size(v,&bufsize);
> +        buf = qemu_malloc(bufsize + 1 /* for '\0' */);
> +        qemu_iovec_to_buffer(v, buf);
> +        qemu_put_buffer(f, buf, bufsize);
> +        qemu_free(buf);
> +    }
> +
> +}
> +
> +void qemu_put_vector_prepare(QEMUFile *f)
> +{
> +    if (f->prepares_vector) {
> +        /* prepare vector */
> +        fprintf(stderr, "Attempted to prepare vector twice\n");
> +        abort();
> +    }
> +    f->prepares_vector = 1;
> +    qemu_fflush(f);
> +}
> +
>   int qemu_get_buffer(QEMUFile *f, uint8_t *buf, int size1)
>   {
>       int size, l;
>    
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 08/20] Introduce RAMSaveIO and use cpu_physical_memory_get_dirty_range() to check multiple dirty pages.
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 08/20] Introduce RAMSaveIO and use cpu_physical_memory_get_dirty_range() to check multiple dirty pages Yoshiaki Tamura
@ 2010-04-22 19:31   ` Anthony Liguori
  0 siblings, 0 replies; 74+ messages in thread
From: Anthony Liguori @ 2010-04-22 19:31 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, kvm, mtosatti, Anthony Liguori, qemu-devel,
	yoshikawa.takuya, avi
On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:
> Introduce RAMSaveIO to use writev for saving ram blocks, and modifies
> ram_save_block() and ram_save_remaining() to use
> cpu_physical_memory_get_dirty_range() to check multiple dirty and
> non-dirty pages at once.
>
> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
> Signed-off-by: OHMURA Kei<ohmura.kei@lab.ntt.co.jp>
>    
Perf data?
Regards,
Anthony Liguori
> ---
>   vl.c |  221 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
>   1 files changed, 197 insertions(+), 24 deletions(-)
>
> diff --git a/vl.c b/vl.c
> index 729c955..9c3dc4c 100644
> --- a/vl.c
> +++ b/vl.c
> @@ -2774,12 +2774,167 @@ static int is_dup_page(uint8_t *page, uint8_t ch)
>       return 1;
>   }
>
> -static int ram_save_block(QEMUFile *f)
> +typedef struct RAMSaveIO RAMSaveIO;
> +
> +struct RAMSaveIO {
> +    QEMUFile *f;
> +    QEMUIOVector *qiov;
> +
> +    uint8_t *ram_store;
> +    size_t nalloc, nused;
> +    uint8_t io_mode;
> +
> +    void (*put_buffer)(RAMSaveIO *s, uint8_t *buf, size_t len);
> +    void (*put_byte)(RAMSaveIO *s, int v);
> +    void (*put_be64)(RAMSaveIO *s, uint64_t v);
> +
> +};
> +
> +static inline void ram_saveio_flush(RAMSaveIO *s, int prepare)
> +{
> +    qemu_put_vector(s->f, s->qiov);
> +    if (prepare)
> +        qemu_put_vector_prepare(s->f);
> +
> +    /* reset stored data */
> +    qemu_iovec_reset(s->qiov);
> +    s->nused = 0;
> +}
> +
> +static inline void ram_saveio_put_buffer(RAMSaveIO *s, uint8_t *buf, size_t len)
> +{
> +    s->put_buffer(s, buf, len);
> +}
> +
> +static inline void ram_saveio_put_byte(RAMSaveIO *s, int v)
> +{
> +    s->put_byte(s, v);
> +}
> +
> +static inline void ram_saveio_put_be64(RAMSaveIO *s, uint64_t v)
> +{
> +    s->put_be64(s, v);
> +}
> +
> +static inline void ram_saveio_set_error(RAMSaveIO *s)
> +{
> +    qemu_file_set_error(s->f);
> +}
> +
> +static void ram_saveio_put_buffer_vector(RAMSaveIO *s, uint8_t *buf, size_t len)
> +{
> +    qemu_iovec_add(s->qiov, buf, len);
> +}
> +
> +static void ram_saveio_put_buffer_direct(RAMSaveIO *s, uint8_t *buf, size_t len)
> +{
> +    qemu_put_buffer(s->f, buf, len);
> +}
> +
> +static void ram_saveio_put_byte_vector(RAMSaveIO *s, int v)
> +{
> +    uint8_t *to_save;
> +
> +    if (s->nalloc - s->nused<  sizeof(int))
> +        ram_saveio_flush(s, 1);
> +
> +    to_save =&s->ram_store[s->nused];
> +    to_save[0] = v&  0xff;
> +    s->nused++;
> +
> +    qemu_iovec_add(s->qiov, to_save, 1);
> +}
> +
> +static void ram_saveio_put_byte_direct(RAMSaveIO *s, int v)
> +{
> +    qemu_put_byte(s->f, v);
> +}
> +
> +static void ram_saveio_put_be64_vector(RAMSaveIO *s, uint64_t v)
> +{
> +    uint8_t *to_save;
> +
> +    if (s->nalloc - s->nused<  sizeof(uint64_t))
> +        ram_saveio_flush(s, 1);
> +
> +    to_save =&s->ram_store[s->nused];
> +    to_save[0] = (v>>  56)&  0xff;
> +    to_save[1] = (v>>  48)&  0xff;
> +    to_save[2] = (v>>  40)&  0xff;
> +    to_save[3] = (v>>  32)&  0xff;
> +    to_save[4] = (v>>  24)&  0xff;
> +    to_save[5] = (v>>  16)&  0xff;
> +    to_save[6] = (v>>   8)&  0xff;
> +    to_save[7] = (v>>   0)&  0xff;
> +    s->nused += sizeof(uint64_t);
> +
> +    qemu_iovec_add(s->qiov, to_save, sizeof(uint64_t));
> +}
> +
> +static void ram_saveio_put_be64_direct(RAMSaveIO *s, uint64_t v)
> +{
> +
> +    qemu_put_be64(s->f, v);
> +}
> +
> +static RAMSaveIO *ram_saveio_new(QEMUFile *f, size_t max_store)
> +{
> +    RAMSaveIO *s;
> +
> +    s = qemu_mallocz(sizeof(*s));
> +
> +    if (qemu_file_get_rate_limit(f) == 0) {/* non buffer mode */
> +        /* When QEMUFile don't have get_rate limit,
> +         * qemu_file_get_rate_limit will return 0.
> +         * However, we believe that all kinds of QEMUFile
> +         * except non-block mode has rate limit function.
> +         */
> +        s->io_mode = 1;
> +        s->ram_store = qemu_mallocz(max_store);
> +        s->nalloc = max_store;
> +        s->nused = 0;
> +
> +        s->qiov = qemu_mallocz(sizeof(*s->qiov));
> +        qemu_iovec_init(s->qiov, max_store);
> +
> +        s->put_buffer = ram_saveio_put_buffer_vector;
> +        s->put_byte = ram_saveio_put_byte_vector;
> +        s->put_be64 = ram_saveio_put_be64_vector;
> +
> +        qemu_put_vector_prepare(f);
> +    } else {
> +        s->io_mode = 0;
> +        s->put_buffer = ram_saveio_put_buffer_direct;
> +        s->put_byte = ram_saveio_put_byte_direct;
> +        s->put_be64 = ram_saveio_put_be64_direct;
> +    }
> +
> +    s->f = f;
> +
> +    return s;
> +}
> +
> +static void ram_saveio_destroy(RAMSaveIO *s)
> +{
> +    if (s->qiov != NULL) { /* means using put_vector */
> +        ram_saveio_flush(s, 0);
> +        qemu_iovec_destroy(s->qiov);
> +        qemu_free(s->qiov);
> +        qemu_free(s->ram_store);
> +    }
> +    qemu_free(s);
> +}
> +
> +/*
> + * RAMSaveIO will manage I/O.
> + */
> +static int ram_save_block(RAMSaveIO *s)
>   {
>       static ram_addr_t current_addr = 0;
>       ram_addr_t saved_addr = current_addr;
>       ram_addr_t addr = 0;
> -    int found = 0;
> +    ram_addr_t dirty_rams[HOST_LONG_BITS];
> +    int i, found = 0;
>
>       while (addr<  last_ram_offset) {
>           if (kvm_enabled()&&  current_addr == 0) {
> @@ -2787,32 +2942,38 @@ static int ram_save_block(QEMUFile *f)
>               r = kvm_update_dirty_pages_log();
>               if (r) {
>                   fprintf(stderr, "%s: update dirty pages log failed %d\n", __FUNCTION__, r);
> -                qemu_file_set_error(f);
> +                ram_saveio_set_error(s);
>                   return 0;
>               }
>           }
> -        if (cpu_physical_memory_get_dirty(current_addr, MIGRATION_DIRTY_FLAG)) {
> +        if ((found = cpu_physical_memory_get_dirty_range(
> +                 current_addr, last_ram_offset, dirty_rams, HOST_LONG_BITS,
> +                 MIGRATION_DIRTY_FLAG))) {
>               uint8_t *p;
>
> -            cpu_physical_memory_reset_dirty(current_addr,
> -                                            current_addr + TARGET_PAGE_SIZE,
> -                                            MIGRATION_DIRTY_FLAG);
> +            for (i = 0; i<  found; i++) {
> +                ram_addr_t page_addr = dirty_rams[i];
> +                cpu_physical_memory_reset_dirty(page_addr,
> +                                                page_addr + TARGET_PAGE_SIZE,
> +                                                MIGRATION_DIRTY_FLAG);
>
> -            p = qemu_get_ram_ptr(current_addr);
> +                p = qemu_get_ram_ptr(page_addr);
>
> -            if (is_dup_page(p, *p)) {
> -                qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_COMPRESS);
> -                qemu_put_byte(f, *p);
> -            } else {
> -                qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_PAGE);
> -                qemu_put_buffer(f, p, TARGET_PAGE_SIZE);
> +                if (is_dup_page(p, *p)) {
> +                    ram_saveio_put_be64(s,
> +                                        (page_addr) | RAM_SAVE_FLAG_COMPRESS);
> +                    ram_saveio_put_byte(s, *p);
> +                } else {
> +                    ram_saveio_put_be64(s, (page_addr) | RAM_SAVE_FLAG_PAGE);
> +                    ram_saveio_put_buffer(s, p, TARGET_PAGE_SIZE);
> +                }
>               }
>
> -            found = 1;
>               break;
> +        } else {
> +            addr += dirty_rams[0];
> +            current_addr = (saved_addr + addr) % last_ram_offset;
>           }
> -        addr += TARGET_PAGE_SIZE;
> -        current_addr = (saved_addr + addr) % last_ram_offset;
>       }
>
>       return found;
> @@ -2822,12 +2983,19 @@ static uint64_t bytes_transferred;
>
>   static ram_addr_t ram_save_remaining(void)
>   {
> -    ram_addr_t addr;
> +    ram_addr_t addr = 0;
>       ram_addr_t count = 0;
> +    ram_addr_t dirty_rams[HOST_LONG_BITS];
> +    int found = 0;
>
> -    for (addr = 0; addr<  last_ram_offset; addr += TARGET_PAGE_SIZE) {
> -        if (cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG))
> -            count++;
> +    while (addr<  last_ram_offset) {
> +        if ((found = cpu_physical_memory_get_dirty_range(addr, last_ram_offset,
> +            dirty_rams, HOST_LONG_BITS, MIGRATION_DIRTY_FLAG))) {
> +            count += found;
> +            addr = dirty_rams[found - 1] + TARGET_PAGE_SIZE;
> +        } else {
> +            addr += dirty_rams[0];
> +        }
>       }
>
>       return count;
> @@ -2854,6 +3022,7 @@ static int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>       uint64_t bytes_transferred_last;
>       double bwidth = 0;
>       uint64_t expected_time = 0;
> +    RAMSaveIO *s;
>
>       if (stage<  0) {
>           cpu_physical_memory_set_dirty_tracking(0);
> @@ -2883,10 +3052,12 @@ static int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>       bytes_transferred_last = bytes_transferred;
>       bwidth = qemu_get_clock_ns(rt_clock);
>
> -    while (!qemu_file_rate_limit(f)) {
> +    s = ram_saveio_new(f, IOV_MAX);
> +
> +     while (!qemu_file_rate_limit(f)) {
>           int ret;
>
> -        ret = ram_save_block(f);
> +        ret = ram_save_block(s);
>           bytes_transferred += ret * TARGET_PAGE_SIZE;
>           if (ret == 0) /* no more blocks */
>               break;
> @@ -2903,12 +3074,14 @@ static int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
>       /* try transferring iterative blocks of memory */
>       if (stage == 3) {
>           /* flush all remaining blocks regardless of rate limiting */
> -        while (ram_save_block(f) != 0) {
> +        while (ram_save_block(s) != 0) {
>               bytes_transferred += TARGET_PAGE_SIZE;
>           }
>           cpu_physical_memory_set_dirty_tracking(0);
>       }
>
> +    ram_saveio_destroy(s);
> +
>       qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
>
>       expected_time = ram_save_remaining() * TARGET_PAGE_SIZE / bwidth;
>    
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 10/20] Introduce skip_header parameter to qemu_loadvm_state() so that it can be called iteratively without reading the header.
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 10/20] Introduce skip_header parameter to qemu_loadvm_state() so that it can be called iteratively without reading the header Yoshiaki Tamura
@ 2010-04-22 19:34   ` Anthony Liguori
  2010-04-23  4:25     ` Yoshiaki Tamura
  0 siblings, 1 reply; 74+ messages in thread
From: Anthony Liguori @ 2010-04-22 19:34 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, kvm, mtosatti, Anthony Liguori, qemu-devel,
	yoshikawa.takuya, avi
On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:
> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
>    
I think the more appropriate thing to do is have 
qemu_savevm_state_complete() not write QEMU_VM_EOF when doing a 
continuous live migration.
You would then want qemu_loadvm_state() to detect real EOF and treat 
that the same as QEMU_VM_EOF (provided it occurred at a section boundary).
Of course, this should be a flag to qemu_loadvm_state() as it's not 
always the right behavior.
Regards,
Anthony Liguori
> ---
>   migration-exec.c |    2 +-
>   migration-fd.c   |    2 +-
>   migration-tcp.c  |    2 +-
>   migration-unix.c |    2 +-
>   savevm.c         |   25 ++++++++++++++-----------
>   sysemu.h         |    2 +-
>   6 files changed, 19 insertions(+), 16 deletions(-)
>
> diff --git a/migration-exec.c b/migration-exec.c
> index 3edc026..5839a6d 100644
> --- a/migration-exec.c
> +++ b/migration-exec.c
> @@ -113,7 +113,7 @@ static void exec_accept_incoming_migration(void *opaque)
>       QEMUFile *f = opaque;
>       int ret;
>
> -    ret = qemu_loadvm_state(f);
> +    ret = qemu_loadvm_state(f, 0);
>       if (ret<  0) {
>           fprintf(stderr, "load of migration failed\n");
>           goto err;
> diff --git a/migration-fd.c b/migration-fd.c
> index 0cc74ad..0e97ed0 100644
> --- a/migration-fd.c
> +++ b/migration-fd.c
> @@ -106,7 +106,7 @@ static void fd_accept_incoming_migration(void *opaque)
>       QEMUFile *f = opaque;
>       int ret;
>
> -    ret = qemu_loadvm_state(f);
> +    ret = qemu_loadvm_state(f, 0);
>       if (ret<  0) {
>           fprintf(stderr, "load of migration failed\n");
>           goto err;
> diff --git a/migration-tcp.c b/migration-tcp.c
> index 56e1a3b..94a1a03 100644
> --- a/migration-tcp.c
> +++ b/migration-tcp.c
> @@ -182,7 +182,7 @@ static void tcp_accept_incoming_migration(void *opaque)
>           goto out;
>       }
>
> -    ret = qemu_loadvm_state(f);
> +    ret = qemu_loadvm_state(f, 0);
>       if (ret<  0) {
>           fprintf(stderr, "load of migration failed\n");
>           goto out_fopen;
> diff --git a/migration-unix.c b/migration-unix.c
> index b7aab38..dd99a73 100644
> --- a/migration-unix.c
> +++ b/migration-unix.c
> @@ -168,7 +168,7 @@ static void unix_accept_incoming_migration(void *opaque)
>           goto out;
>       }
>
> -    ret = qemu_loadvm_state(f);
> +    ret = qemu_loadvm_state(f, 0);
>       if (ret<  0) {
>           fprintf(stderr, "load of migration failed\n");
>           goto out_fopen;
> diff --git a/savevm.c b/savevm.c
> index 22d928c..a401b27 100644
> --- a/savevm.c
> +++ b/savevm.c
> @@ -1554,7 +1554,7 @@ typedef struct LoadStateEntry {
>       int version_id;
>   } LoadStateEntry;
>
> -int qemu_loadvm_state(QEMUFile *f)
> +int qemu_loadvm_state(QEMUFile *f, int skip_header)
>   {
>       QLIST_HEAD(, LoadStateEntry) loadvm_handlers =
>           QLIST_HEAD_INITIALIZER(loadvm_handlers);
> @@ -1563,17 +1563,20 @@ int qemu_loadvm_state(QEMUFile *f)
>       unsigned int v;
>       int ret;
>
> -    v = qemu_get_be32(f);
> -    if (v != QEMU_VM_FILE_MAGIC)
> -        return -EINVAL;
> +    if (!skip_header) {
> +        v = qemu_get_be32(f);
> +        if (v != QEMU_VM_FILE_MAGIC)
> +            return -EINVAL;
> +
> +        v = qemu_get_be32(f);
> +        if (v == QEMU_VM_FILE_VERSION_COMPAT) {
> +            fprintf(stderr, "SaveVM v3 format is obsolete and don't work anymore\n");
> +            return -ENOTSUP;
> +        }
> +        if (v != QEMU_VM_FILE_VERSION)
> +            return -ENOTSUP;
>
> -    v = qemu_get_be32(f);
> -    if (v == QEMU_VM_FILE_VERSION_COMPAT) {
> -        fprintf(stderr, "SaveVM v2 format is obsolete and don't work anymore\n");
> -        return -ENOTSUP;
>       }
> -    if (v != QEMU_VM_FILE_VERSION)
> -        return -ENOTSUP;
>
>       while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) {
>           uint32_t instance_id, version_id, section_id;
> @@ -1898,7 +1901,7 @@ int load_vmstate(Monitor *mon, const char *name)
>           monitor_printf(mon, "Could not open VM state file\n");
>           return -EINVAL;
>       }
> -    ret = qemu_loadvm_state(f);
> +    ret = qemu_loadvm_state(f, 0);
>       qemu_fclose(f);
>       if (ret<  0) {
>           monitor_printf(mon, "Error %d while loading VM state\n", ret);
> diff --git a/sysemu.h b/sysemu.h
> index 647a468..6c1441f 100644
> --- a/sysemu.h
> +++ b/sysemu.h
> @@ -68,7 +68,7 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable,
>   int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f);
>   int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f);
>   void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f);
> -int qemu_loadvm_state(QEMUFile *f);
> +int qemu_loadvm_state(QEMUFile *f, int skip_header);
>
>   void qemu_errors_to_file(FILE *fp);
>   void qemu_errors_to_mon(Monitor *mon);
>    
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 14/20] Upgrade QEMU_FILE_VERSION from 3 to 4, and introduce qemu_savevm_state_all().
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 14/20] Upgrade QEMU_FILE_VERSION from 3 to 4, and introduce qemu_savevm_state_all() Yoshiaki Tamura
@ 2010-04-22 19:37   ` Anthony Liguori
  2010-04-23  3:29     ` Yoshiaki Tamura
  0 siblings, 1 reply; 74+ messages in thread
From: Anthony Liguori @ 2010-04-22 19:37 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, kvm, mtosatti, Anthony Liguori, qemu-devel,
	yoshikawa.takuya, avi
On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:
> Make a 32bit entry after QEMU_VM_FILE_VERSION to recognize whether the
> transfered data is QEMU_VM_FT_MODE or QEMU_VM_LIVE_MIGRATION_MODE.
>    
I'd rather you encapsulate the current protocol as opposed to extending 
it with a new version.
You could also do this by just having a new -incoming option, right?  
All you really need is to indicate that this is for high frequency 
checkpointing, right?
Regards,
Anthony Liguori
> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
> ---
>   savevm.c |   76 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>   sysemu.h |    1 +
>   2 files changed, 75 insertions(+), 2 deletions(-)
>
> diff --git a/savevm.c b/savevm.c
> index 292ae32..19b3efb 100644
> --- a/savevm.c
> +++ b/savevm.c
> @@ -1402,8 +1402,10 @@ static void vmstate_save(QEMUFile *f, SaveStateEntry *se)
>   }
>
>   #define QEMU_VM_FILE_MAGIC           0x5145564d
> -#define QEMU_VM_FILE_VERSION_COMPAT  0x00000002
> -#define QEMU_VM_FILE_VERSION         0x00000003
> +#define QEMU_VM_FILE_VERSION_COMPAT  0x00000003
> +#define QEMU_VM_FILE_VERSION         0x00000004
> +#define QEMU_VM_LIVE_MIGRATION_MODE  0x00000005
> +#define QEMU_VM_FT_MODE              0x00000006
>
>   #define QEMU_VM_EOF                  0x00
>   #define QEMU_VM_SECTION_START        0x01
> @@ -1425,6 +1427,12 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable,
>
>       qemu_put_be32(f, QEMU_VM_FILE_MAGIC);
>       qemu_put_be32(f, QEMU_VM_FILE_VERSION);
> +
> +    if (ft_mode) {
> +        qemu_put_be32(f, QEMU_VM_FT_MODE);
> +    } else {
> +        qemu_put_be32(f, QEMU_VM_LIVE_MIGRATION_MODE);
> +    }
>
>       QTAILQ_FOREACH(se,&savevm_handlers, entry) {
>           int len;
> @@ -1533,6 +1541,66 @@ int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f)
>       return 0;
>   }
>
> +int qemu_savevm_state_all(Monitor *mon, QEMUFile *f)
> +{
> +    SaveStateEntry *se;
> +
> +    QTAILQ_FOREACH(se,&savevm_handlers, entry) {
> +        int len;
> +
> +        if (se->save_live_state == NULL)
> +            continue;
> +
> +        /* Section type */
> +        qemu_put_byte(f, QEMU_VM_SECTION_START);
> +        qemu_put_be32(f, se->section_id);
> +
> +        /* ID string */
> +        len = strlen(se->idstr);
> +        qemu_put_byte(f, len);
> +        qemu_put_buffer(f, (uint8_t *)se->idstr, len);
> +
> +        qemu_put_be32(f, se->instance_id);
> +        qemu_put_be32(f, se->version_id);
> +        if (ft_mode == FT_INIT) {
> +            /* This is workaround. */
> +            se->save_live_state(mon, f, QEMU_VM_SECTION_START, se->opaque);
> +        } else {
> +            se->save_live_state(mon, f, QEMU_VM_SECTION_PART, se->opaque);
> +        }
> +    }
> +
> +    ft_mode = FT_TRANSACTION;
> +    QTAILQ_FOREACH(se,&savevm_handlers, entry) {
> +        int len;
> +
> +	if (se->save_state == NULL&&  se->vmsd == NULL)
> +	    continue;
> +
> +        /* Section type */
> +        qemu_put_byte(f, QEMU_VM_SECTION_FULL);
> +        qemu_put_be32(f, se->section_id);
> +
> +        /* ID string */
> +        len = strlen(se->idstr);
> +        qemu_put_byte(f, len);
> +        qemu_put_buffer(f, (uint8_t *)se->idstr, len);
> +
> +        qemu_put_be32(f, se->instance_id);
> +        qemu_put_be32(f, se->version_id);
> +
> +        vmstate_save(f, se);
> +    }
> +
> +    qemu_put_byte(f, QEMU_VM_EOF);
> +
> +    if (qemu_file_has_error(f))
> +        return -EIO;
> +
> +    return 0;
> +}
> +
> +
>   void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f)
>   {
>       SaveStateEntry *se;
> @@ -1617,6 +1685,10 @@ int qemu_loadvm_state(QEMUFile *f, int skip_header)
>           if (v != QEMU_VM_FILE_VERSION)
>               return -ENOTSUP;
>
> +        v = qemu_get_be32(f);
> +        if (v == QEMU_VM_FT_MODE) {
> +            ft_mode = FT_INIT;
> +        }
>       }
>
>       while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) {
> diff --git a/sysemu.h b/sysemu.h
> index 6c1441f..df314bb 100644
> --- a/sysemu.h
> +++ b/sysemu.h
> @@ -67,6 +67,7 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable,
>                               int shared);
>   int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f);
>   int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f);
> +int qemu_savevm_state_all(Monitor *mon, QEMUFile *f);
>   void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f);
>   int qemu_loadvm_state(QEMUFile *f, int skip_header);
>
>    
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 15/20] Introduce FT mode support to configure.
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 15/20] Introduce FT mode support to configure Yoshiaki Tamura
@ 2010-04-22 19:38   ` Anthony Liguori
  2010-04-23  3:09     ` Yoshiaki Tamura
  0 siblings, 1 reply; 74+ messages in thread
From: Anthony Liguori @ 2010-04-22 19:38 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, kvm, mtosatti, Anthony Liguori, qemu-devel,
	yoshikawa.takuya, avi
On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:
> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
>    
No need for this.
Regards,
Anthony Liguori
> ---
>   configure |    8 ++++++++
>   1 files changed, 8 insertions(+), 0 deletions(-)
>
> diff --git a/configure b/configure
> index 046c591..f0682d4 100755
> --- a/configure
> +++ b/configure
> @@ -298,6 +298,7 @@ bsd_user="no"
>   guest_base=""
>   uname_release=""
>   io_thread="no"
> +ft_mode="no"
>   mixemu="no"
>   kvm_trace="no"
>   kvm_cap_pit=""
> @@ -671,6 +672,8 @@ for opt do
>     ;;
>     --enable-io-thread) io_thread="yes"
>     ;;
> +  --enable-ft-mode) ft_mode="yes"
> +  ;;
>     --disable-blobs) blobs="no"
>     ;;
>     --kerneldir=*) kerneldir="$optarg"
> @@ -840,6 +843,7 @@ echo "  --enable-vde             enable support for vde network"
>   echo "  --disable-linux-aio      disable Linux AIO support"
>   echo "  --enable-linux-aio       enable Linux AIO support"
>   echo "  --enable-io-thread       enable IO thread"
> +echo "  --enable-ft-mode         enable FT mode support"
>   echo "  --disable-blobs          disable installing provided firmware blobs"
>   echo "  --kerneldir=PATH         look for kernel includes in PATH"
>   echo "  --with-kvm-trace         enable building the KVM module with the kvm trace option"
> @@ -2117,6 +2121,7 @@ echo "GUEST_BASE        $guest_base"
>   echo "PIE user targets  $user_pie"
>   echo "vde support       $vde"
>   echo "IO thread         $io_thread"
> +echo "FT mode support   $ft_mode"
>   echo "Linux AIO support $linux_aio"
>   echo "Install blobs     $blobs"
>   echo "KVM support       $kvm"
> @@ -2318,6 +2323,9 @@ fi
>   if test "$io_thread" = "yes" ; then
>     echo "CONFIG_IOTHREAD=y">>  $config_host_mak
>   fi
> +if test "$ft_mode" = "yes" ; then
> +  echo "CONFIG_FT_MODE=y">>  $config_host_mak
> +fi
>   if test "$linux_aio" = "yes" ; then
>     echo "CONFIG_LINUX_AIO=y">>  $config_host_mak
>   fi
>    
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 19/20] Insert do_event_tap() to virtio-{blk, net}, comment out assert() on cpu_single_env temporally.
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 19/20] Insert do_event_tap() to virtio-{blk, net}, comment out assert() on cpu_single_env temporally Yoshiaki Tamura
@ 2010-04-22 19:39   ` Anthony Liguori
  2010-04-23  4:51     ` Yoshiaki Tamura
  0 siblings, 1 reply; 74+ messages in thread
From: Anthony Liguori @ 2010-04-22 19:39 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, kvm, mtosatti, Anthony Liguori, qemu-devel,
	yoshikawa.takuya, avi
On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:
> do_event_tap() is inserted to functions which actually fire outputs.
> By synchronizing VMs before outputs are fired, we can failover to the
> receiver upon failure.  To save VM continuously, comment out assert()
> on cpu_single_env temporally.
>
> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
> ---
>   hw/virtio-blk.c |    2 ++
>   hw/virtio-net.c |    2 ++
>   qemu-kvm.c      |    7 ++++++-
>   3 files changed, 10 insertions(+), 1 deletions(-)
>    
This would be better done in the generic layers (the block and net 
layers respectively).  Then it would work with virtio and emulated devices.
Regards,
Anthony Liguori
> diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
> index b80402d..1dd1c31 100644
> --- a/hw/virtio-blk.c
> +++ b/hw/virtio-blk.c
> @@ -327,6 +327,8 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>           .old_bs = NULL,
>       };
>
> +    do_event_tap();
> +
>       while ((req = virtio_blk_get_request(s))) {
>           virtio_blk_handle_request(req,&mrb);
>       }
> diff --git a/hw/virtio-net.c b/hw/virtio-net.c
> index 5c0093e..1a32bf3 100644
> --- a/hw/virtio-net.c
> +++ b/hw/virtio-net.c
> @@ -667,6 +667,8 @@ static void virtio_net_handle_tx(VirtIODevice *vdev, VirtQueue *vq)
>   {
>       VirtIONet *n = to_virtio_net(vdev);
>
> +    do_event_tap();
> +
>       if (n->tx_timer_active) {
>           virtio_queue_set_notification(vq, 1);
>           qemu_del_timer(n->tx_timer);
> diff --git a/qemu-kvm.c b/qemu-kvm.c
> index 1414f49..769bc95 100644
> --- a/qemu-kvm.c
> +++ b/qemu-kvm.c
> @@ -935,8 +935,12 @@ int kvm_run(CPUState *env)
>
>       post_kvm_run(kvm, env);
>
> +    /* TODO: we need to prevent tapping events that derived from the
> +     * same VMEXIT. This needs more info from the kernel. */
>   #if defined(KVM_CAP_COALESCED_MMIO)
>       if (kvm_state->coalesced_mmio) {
> +        /* prevent from tapping events while handling coalesced_mmio */
> +        event_tap_suspend();
>           struct kvm_coalesced_mmio_ring *ring =
>               (void *) run + kvm_state->coalesced_mmio * PAGE_SIZE;
>           while (ring->first != ring->last) {
> @@ -946,6 +950,7 @@ int kvm_run(CPUState *env)
>               smp_wmb();
>               ring->first = (ring->first + 1) % KVM_COALESCED_MMIO_MAX;
>           }
> +        event_tap_resume();
>       }
>   #endif
>
> @@ -1770,7 +1775,7 @@ static void resume_all_threads(void)
>   {
>       CPUState *penv = first_cpu;
>
> -    assert(!cpu_single_env);
> +    /* assert(!cpu_single_env); */
>
>       while (penv) {
>           penv->stop = 0;
>    
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 00/20] Kemari for KVM v0.1
  2010-04-21  5:57 [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 Yoshiaki Tamura
                   ` (20 preceding siblings ...)
  2010-04-22  8:58 ` [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 Dor Laor
@ 2010-04-22 19:42 ` Anthony Liguori
  2010-04-23  0:45   ` Yoshiaki Tamura
  2010-04-23 13:24 ` Avi Kivity
  22 siblings, 1 reply; 74+ messages in thread
From: Anthony Liguori @ 2010-04-22 19:42 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, kvm, mtosatti, Anthony Liguori, qemu-devel,
	yoshikawa.takuya, avi
On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:
> Hi all,
>
> We have been implementing the prototype of Kemari for KVM, and we're sending
> this message to share what we have now and TODO lists.  Hopefully, we would like
> to get early feedback to keep us in the right direction.  Although advanced
> approaches in the TODO lists are fascinating, we would like to run this project
> step by step while absorbing comments from the community.  The current code is
> based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.
>
> For those who are new to Kemari for KVM, please take a look at the
> following RFC which we posted last year.
>
> http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html
>
> The transmission/transaction protocol, and most of the control logic is
> implemented in QEMU.  However, we needed a hack in KVM to prevent rip from
> proceeding before synchronizing VMs.  It may also need some plumbing in the
> kernel side to guarantee replayability of certain events and instructions,
> integrate the RAS capabilities of newer x86 hardware with the HA stack, as well
> as for optimization purposes, for example.
>
> Before going into details, we would like to show how Kemari looks.  We prepared
> a demonstration video at the following location.  For those who are not
> interested in the code, please take a look.
> The demonstration scenario is,
>
> 1. Play with a guest VM that has virtio-blk and virtio-net.
> # The guest image should be a NFS/SAN.
> 2. Start Kemari to synchronize the VM by running the following command in QEMU.
> Just add "-k" option to usual migrate command.
> migrate -d -k tcp:192.168.0.20:4444
> 3. Check the status by calling info migrate.
> 4. Go back to the VM to play chess animation.
> 5. Kill the the VM. (VNC client also disappears)
> 6. Press "c" to continue the VM on the other host.
> 7. Bring up the VNC client (Sorry, it pops outside of video capture.)
> 8. Confirm that the chess animation ends, browser works fine, then shutdown.
>
> http://www.osrg.net/kemari/download/kemari-kvm-fc11.mov
>
> The repository contains all patches we're sending with this message.  For those
> who want to try, pull the following repository.  At running configure, please
> put --enable-ft-mode.  Also you need to apply a patch attached at the end of
> this message to your KVM.
>
> git://kemari.git.sourceforge.net/gitroot/kemari/kemari
>
> In addition to usual migrate environment and command, add "-k" to run.
>
> The patch set consists of following components.
>
> - bit-based dirty bitmap. (I have posted v4 for upstream QEMU on April 2o)
> - writev() support to QEMUFile and FdMigrationState.
> - FT transaction sender/receiver
> - event tap that triggers FT transaction.
> - virtio-blk, virtio-net support for event tap.
>    
This series looks quite nice!
I think it would make sense to separate out the things that are actually 
optimizations (like the dirty bitmap changes and the writev/readv 
changes) and to attempt to justify them with actual performance data.
I'd prefer not to modify the live migration protocol ABI and it doesn't 
seem to be necessary if we're willing to add options to the -incoming 
flag.  We also want to be a bit more generic with respect to IO.  
Otherwise, the series looks very close to being mergable.
Regards,
Anthony Liguori
^ permalink raw reply	[flat|nested] 74+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
  2010-04-22 13:16       ` Yoshiaki Tamura
@ 2010-04-22 20:33         ` Anthony Liguori
  2010-04-23  1:53           ` Yoshiaki Tamura
  2010-04-22 20:38         ` Dor Laor
  1 sibling, 1 reply; 74+ messages in thread
From: Anthony Liguori @ 2010-04-22 20:33 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, mtosatti, kvm, dlaor, qemu-devel, yoshikawa.takuya,
	aliguori, avi
On 04/22/2010 08:16 AM, Yoshiaki Tamura wrote:
> 2010/4/22 Dor Laor<dlaor@redhat.com>:
>    
>> On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:
>>      
>>> Dor Laor wrote:
>>>        
>>>> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
>>>>          
>>>>> Hi all,
>>>>>
>>>>> We have been implementing the prototype of Kemari for KVM, and we're
>>>>> sending
>>>>> this message to share what we have now and TODO lists. Hopefully, we
>>>>> would like
>>>>> to get early feedback to keep us in the right direction. Although
>>>>> advanced
>>>>> approaches in the TODO lists are fascinating, we would like to run
>>>>> this project
>>>>> step by step while absorbing comments from the community. The current
>>>>> code is
>>>>> based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.
>>>>>
>>>>> For those who are new to Kemari for KVM, please take a look at the
>>>>> following RFC which we posted last year.
>>>>>
>>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html
>>>>>
>>>>> The transmission/transaction protocol, and most of the control logic is
>>>>> implemented in QEMU. However, we needed a hack in KVM to prevent rip
>>>>> from
>>>>> proceeding before synchronizing VMs. It may also need some plumbing in
>>>>> the
>>>>> kernel side to guarantee replayability of certain events and
>>>>> instructions,
>>>>> integrate the RAS capabilities of newer x86 hardware with the HA
>>>>> stack, as well
>>>>> as for optimization purposes, for example.
>>>>>            
>>>> [ snap]
>>>>
>>>>          
>>>>> The rest of this message describes TODO lists grouped by each topic.
>>>>>
>>>>> === event tapping ===
>>>>>
>>>>> Event tapping is the core component of Kemari, and it decides on which
>>>>> event the
>>>>> primary should synchronize with the secondary. The basic assumption
>>>>> here is
>>>>> that outgoing I/O operations are idempotent, which is usually true for
>>>>> disk I/O
>>>>> and reliable network protocols such as TCP.
>>>>>            
>>>> IMO any type of network even should be stalled too. What if the VM runs
>>>> non tcp protocol and the packet that the master node sent reached some
>>>> remote client and before the sync to the slave the master failed?
>>>>          
>>> In current implementation, it is actually stalling any type of network
>>> that goes through virtio-net.
>>>
>>> However, if the application was using unreliable protocols, it should
>>> have its own recovering mechanism, or it should be completely stateless.
>>>        
>> Why do you treat tcp differently? You can damage the entire VM this way -
>> think of dhcp request that was dropped on the moment you switched between
>> the master and the slave?
>>      
> I'm not trying to say that we should treat tcp differently, but just
> it's severe.
> In case of dhcp request, the client would have a chance to retry after
> failover, correct?
> BTW, in current implementation,
>    
I'm slightly confused about the current implementation vs. my 
recollection of the original paper with Xen.  I had thought that all 
disk and network I/O was buffered in such a way that at each checkpoint, 
the I/O operations would be released in a burst.  Otherwise, you would 
have to synchronize after every I/O operation which is what it seems the 
current implementation does.  I'm not sure how that is accomplished 
atomically though since you could have a completed I/O operation 
duplicated on the slave node provided it didn't notify completion prior 
to failure.
Is there another kemari component that somehow handles buffering I/O 
that is not obvious from these patches?
Regards,
Anthony Liguori
^ permalink raw reply	[flat|nested] 74+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
  2010-04-22 13:16       ` Yoshiaki Tamura
  2010-04-22 20:33         ` Anthony Liguori
@ 2010-04-22 20:38         ` Dor Laor
  2010-04-23  5:17           ` Yoshiaki Tamura
  1 sibling, 1 reply; 74+ messages in thread
From: Dor Laor @ 2010-04-22 20:38 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, kvm, mtosatti, aliguori, qemu-devel, yoshikawa.takuya,
	avi
On 04/22/2010 04:16 PM, Yoshiaki Tamura wrote:
> 2010/4/22 Dor Laor<dlaor@redhat.com>:
>> On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:
>>>
>>> Dor Laor wrote:
>>>>
>>>> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> We have been implementing the prototype of Kemari for KVM, and we're
>>>>> sending
>>>>> this message to share what we have now and TODO lists. Hopefully, we
>>>>> would like
>>>>> to get early feedback to keep us in the right direction. Although
>>>>> advanced
>>>>> approaches in the TODO lists are fascinating, we would like to run
>>>>> this project
>>>>> step by step while absorbing comments from the community. The current
>>>>> code is
>>>>> based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.
>>>>>
>>>>> For those who are new to Kemari for KVM, please take a look at the
>>>>> following RFC which we posted last year.
>>>>>
>>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html
>>>>>
>>>>> The transmission/transaction protocol, and most of the control logic is
>>>>> implemented in QEMU. However, we needed a hack in KVM to prevent rip
>>>>> from
>>>>> proceeding before synchronizing VMs. It may also need some plumbing in
>>>>> the
>>>>> kernel side to guarantee replayability of certain events and
>>>>> instructions,
>>>>> integrate the RAS capabilities of newer x86 hardware with the HA
>>>>> stack, as well
>>>>> as for optimization purposes, for example.
>>>>
>>>> [ snap]
>>>>
>>>>>
>>>>> The rest of this message describes TODO lists grouped by each topic.
>>>>>
>>>>> === event tapping ===
>>>>>
>>>>> Event tapping is the core component of Kemari, and it decides on which
>>>>> event the
>>>>> primary should synchronize with the secondary. The basic assumption
>>>>> here is
>>>>> that outgoing I/O operations are idempotent, which is usually true for
>>>>> disk I/O
>>>>> and reliable network protocols such as TCP.
>>>>
>>>> IMO any type of network even should be stalled too. What if the VM runs
>>>> non tcp protocol and the packet that the master node sent reached some
>>>> remote client and before the sync to the slave the master failed?
>>>
>>> In current implementation, it is actually stalling any type of network
>>> that goes through virtio-net.
>>>
>>> However, if the application was using unreliable protocols, it should
>>> have its own recovering mechanism, or it should be completely stateless.
>>
>> Why do you treat tcp differently? You can damage the entire VM this way -
>> think of dhcp request that was dropped on the moment you switched between
>> the master and the slave?
>
> I'm not trying to say that we should treat tcp differently, but just
> it's severe.
> In case of dhcp request, the client would have a chance to retry after
> failover, correct?
But until it timeouts it won't have networking.
> BTW, in current implementation, it's synchronizing before dhcp ack is sent.
> But in case of tcp, once you send ack to the client before sync, there
> is no way to recover.
What if the guest is running dhcp server? It we provide an IP to a 
client and then fail to the secondary that will run without knowing the 
master allocated this IP
>
>>>> [snap]
>>>>
>>>>
>>>>> === clock ===
>>>>>
>>>>> Since synchronizing the virtual machines every time the TSC is
>>>>> accessed would be
>>>>> prohibitive, the transmission of the TSC will be done lazily, which
>>>>> means
>>>>> delaying it until there is a non-TSC synchronization point arrives.
>>>>
>>>> Why do you specifically care about the tsc sync? When you sync all the
>>>> IO model on snapshot it also synchronizes the tsc.
>>
>> So, do you agree that an extra clock synchronization is not needed since it
>> is done anyway as part of the live migration state sync?
>
> I agree that its sent as part of the live migration.
> What I wanted to say here is that this is not something for real time
> applications.
> I usually get questions like can this guarantee fault tolerance for
> real time applications.
First the huge cost of snapshots won't match to any real time app.
Second, even if it wasn't the case, the tsc delta and kvmclock are 
synchronized as part of the VM state so there is no use of trapping it 
in the middle.
>
>>>> In general, can you please explain the 'algorithm' for continuous
>>>> snapshots (is that what you like to do?):
>>>
>>> Yes, of course.
>>> Sorry for being less informative.
>>>
>>>> A trivial one would we to :
>>>> - do X online snapshots/sec
>>>
>>> I currently don't have good numbers that I can share right now.
>>> Snapshots/sec depends on what kind of workload is running, and if the
>>> guest was almost idle, there will be no snapshots in 5sec. On the other
>>> hand, if the guest was running I/O intensive workloads (netperf, iozone
>>> for example), there will be about 50 snapshots/sec.
>>>
>>>> - Stall all IO (disk/block) from the guest to the outside world
>>>> until the previous snapshot reaches the slave.
>>>
>>> Yes, it does.
>>>
>>>> - Snapshots are made of
>>>
>>> Full device model + diff of dirty pages from the last snapshot.
>>>
>>>> - diff of dirty pages from last snapshot
>>>
>>> This also depends on the workload.
>>> In case of I/O intensive workloads, dirty pages are usually less than 100.
>>
>> The hardest would be memory intensive loads.
>> So 100 snap/sec means latency of 10msec right?
>> (not that it's not ok, with faster hw and IB you'll be able to get much
>> more)
>
> Doesn't 100 snap/sec mean the interval of snap is 10msec?
> IIUC, to get the latency, you need to get, Time to transfer VM + Time
> to get response from the receiver.
>
> It's hard to say which load is the hardest.
> Memory intensive load, who don't generate I/O often, will suffer from
> long sync time for that moment, but would have chances to continue its
> process until sync.
> I/O intensive load, who don't dirty much pages, will suffer from
> getting VPU stopped often, but its sync time is relatively shorter.
>
>>>> - Qemu device model (+kvm's) diff from last.
>>>
>>> We're currently sending full copy because we're completely reusing this
>>> part of existing live migration framework.
>>>
>>> Last time we measured, it was about 13KB.
>>> But it varies by which QEMU version is used.
>>>
>>>> You can do 'light' snapshots in between to send dirty pages to reduce
>>>> snapshot time.
>>>
>>> I agree. That's one of the advanced topic we would like to try too.
>>>
>>>> I wrote the above to serve a reference for your comments so it will map
>>>> into my mind. Thanks, dor
>>>
>>> Thank your for the guidance.
>>> I hope this answers to your question.
>>>
>>> At the same time, I would also be happy it we could discuss how to
>>> implement too. In fact, we needed a hack to prevent rip from proceeding
>>> in KVM, which turned out that it was not the best workaround.
>>
>> There are brute force solutions like
>> - stop the guest until you send all of the snapshot to the remote (like
>>   standard live migration)
>
> We've implemented this way so far.
>
>> - Stop + fork + cont the father
>>
>> Or mark the recent dirty pages that were not sent to the remote as write
>> protected and copy them if touched.
>
> I think I had that suggestion from Avi before.
> And yes, it's very fascinating.
>
> Meanwhile, if you look at the diffstat, it needed to touch many parts of QEMU.
> Before going into further implementation, I wanted to check that I'm
> in the right track for doing this project.
>
>
>>> Thanks,
>>>
>>> Yoshi
>>>
>>>>
>>>>>
>>>>> TODO:
>>>>> - Synchronization of clock sources (need to intercept TSC reads, etc).
>>>>>
>>>>> === usability ===
>>>>>
>>>>> These are items that defines how users interact with Kemari.
>>>>>
>>>>> TODO:
>>>>> - Kemarid daemon that takes care of the cluster management/monitoring
>>>>> side of things.
>>>>> - Some device emulators might need minor modifications to work well
>>>>> with Kemari. Use white(black)-listing to take the burden of
>>>>> choosing the right device model off the users.
>>>>>
>>>>> === optimizations ===
>>>>>
>>>>> Although the big picture can be realized by completing the TODO list
>>>>> above, we
>>>>> need some optimizations/enhancements to make Kemari useful in real
>>>>> world, and
>>>>> these are items what needs to be done for that.
>>>>>
>>>>> TODO:
>>>>> - SMP (for the sake of performance might need to implement a
>>>>> synchronization protocol that can maintain two or more
>>>>> synchronization points active at any given moment)
>>>>> - VGA (leverage VNC's subtilting mechanism to identify fb pages that
>>>>> are really dirty).
>>>>>
>>>>>
>>>>> Any comments/suggestions would be greatly appreciated.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Yoshi
>>>>>
>>>>> --
>>>>>
>>>>> Kemari starts synchronizing VMs when QEMU handles I/O requests.
>>>>> Without this patch VCPU state is already proceeded before
>>>>> synchronization, and after failover to the VM on the receiver, it
>>>>> hangs because of this.
>>>>>
>>>>> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
>>>>> ---
>>>>> arch/x86/include/asm/kvm_host.h | 1 +
>>>>> arch/x86/kvm/svm.c | 11 ++++++++---
>>>>> arch/x86/kvm/vmx.c | 11 ++++++++---
>>>>> arch/x86/kvm/x86.c | 4 ++++
>>>>> 4 files changed, 21 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/arch/x86/include/asm/kvm_host.h
>>>>> b/arch/x86/include/asm/kvm_host.h
>>>>> index 26c629a..7b8f514 100644
>>>>> --- a/arch/x86/include/asm/kvm_host.h
>>>>> +++ b/arch/x86/include/asm/kvm_host.h
>>>>> @@ -227,6 +227,7 @@ struct kvm_pio_request {
>>>>> int in;
>>>>> int port;
>>>>> int size;
>>>>> + bool lazy_skip;
>>>>> };
>>>>>
>>>>> /*
>>>>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>>>>> index d04c7ad..e373245 100644
>>>>> --- a/arch/x86/kvm/svm.c
>>>>> +++ b/arch/x86/kvm/svm.c
>>>>> @@ -1495,7 +1495,7 @@ static int io_interception(struct vcpu_svm *svm)
>>>>> {
>>>>> struct kvm_vcpu *vcpu =&svm->vcpu;
>>>>> u32 io_info = svm->vmcb->control.exit_info_1; /* address size bug? */
>>>>> - int size, in, string;
>>>>> + int size, in, string, ret;
>>>>> unsigned port;
>>>>>
>>>>> ++svm->vcpu.stat.io_exits;
>>>>> @@ -1507,9 +1507,14 @@ static int io_interception(struct vcpu_svm *svm)
>>>>> port = io_info>>  16;
>>>>> size = (io_info&  SVM_IOIO_SIZE_MASK)>>  SVM_IOIO_SIZE_SHIFT;
>>>>> svm->next_rip = svm->vmcb->control.exit_info_2;
>>>>> - skip_emulated_instruction(&svm->vcpu);
>>>>>
>>>>> - return kvm_fast_pio_out(vcpu, size, port);
>>>>> + ret = kvm_fast_pio_out(vcpu, size, port);
>>>>> + if (ret)
>>>>> + skip_emulated_instruction(&svm->vcpu);
>>>>> + else
>>>>> + vcpu->arch.pio.lazy_skip = true;
>>>>> +
>>>>> + return ret;
>>>>> }
>>>>>
>>>>> static int nmi_interception(struct vcpu_svm *svm)
>>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>>>> index 41e63bb..09052d6 100644
>>>>> --- a/arch/x86/kvm/vmx.c
>>>>> +++ b/arch/x86/kvm/vmx.c
>>>>> @@ -2975,7 +2975,7 @@ static int handle_triple_fault(struct kvm_vcpu
>>>>> *vcpu)
>>>>> static int handle_io(struct kvm_vcpu *vcpu)
>>>>> {
>>>>> unsigned long exit_qualification;
>>>>> - int size, in, string;
>>>>> + int size, in, string, ret;
>>>>> unsigned port;
>>>>>
>>>>> exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
>>>>> @@ -2989,9 +2989,14 @@ static int handle_io(struct kvm_vcpu *vcpu)
>>>>>
>>>>> port = exit_qualification>>  16;
>>>>> size = (exit_qualification&  7) + 1;
>>>>> - skip_emulated_instruction(vcpu);
>>>>>
>>>>> - return kvm_fast_pio_out(vcpu, size, port);
>>>>> + ret = kvm_fast_pio_out(vcpu, size, port);
>>>>> + if (ret)
>>>>> + skip_emulated_instruction(vcpu);
>>>>> + else
>>>>> + vcpu->arch.pio.lazy_skip = true;
>>>>> +
>>>>> + return ret;
>>>>> }
>>>>>
>>>>> static void
>>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>>> index fd5c3d3..cc308d2 100644
>>>>> --- a/arch/x86/kvm/x86.c
>>>>> +++ b/arch/x86/kvm/x86.c
>>>>> @@ -4544,6 +4544,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu
>>>>> *vcpu, struct kvm_run *kvm_run)
>>>>> if (!irqchip_in_kernel(vcpu->kvm))
>>>>> kvm_set_cr8(vcpu, kvm_run->cr8);
>>>>>
>>>>> + if (vcpu->arch.pio.lazy_skip)
>>>>> + kvm_x86_ops->skip_emulated_instruction(vcpu);
>>>>> + vcpu->arch.pio.lazy_skip = false;
>>>>> +
>>>>> if (vcpu->arch.pio.count || vcpu->mmio_needed ||
>>>>> vcpu->arch.emulate_ctxt.restart) {
>>>>> if (vcpu->mmio_needed) {
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>>
>
>
^ permalink raw reply	[flat|nested] 74+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
  2010-04-22 16:15     ` Jamie Lokier
@ 2010-04-23  0:20       ` Yoshiaki Tamura
  2010-04-23 15:07         ` Jamie Lokier
  0 siblings, 1 reply; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-23  0:20 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: ohmura.kei, mtosatti, kvm, dlaor, qemu-devel, yoshikawa.takuya,
	aliguori, avi
Jamie Lokier wrote:
> Yoshiaki Tamura wrote:
>> Dor Laor wrote:
>>> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
>>>> Event tapping is the core component of Kemari, and it decides on which
>>>> event the
>>>> primary should synchronize with the secondary. The basic assumption
>>>> here is
>>>> that outgoing I/O operations are idempotent, which is usually true for
>>>> disk I/O
>>>> and reliable network protocols such as TCP.
>>>
>>> IMO any type of network even should be stalled too. What if the VM runs
>>> non tcp protocol and the packet that the master node sent reached some
>>> remote client and before the sync to the slave the master failed?
>>
>> In current implementation, it is actually stalling any type of network
>> that goes through virtio-net.
>>
>> However, if the application was using unreliable protocols, it should have
>> its own recovering mechanism, or it should be completely stateless.
>
> Even with unreliable protocols, if slave takeover causes the receiver
> to have received a packet that the sender _does not think it has ever
> sent_, expect some protocols to break.
>
> If the slave replaying master's behaviour since the last sync means it
> will definitely get into the same state of having sent the packet,
> that works out.
That's something we're expecting now.
> But you still have to be careful that the other end's responses to
> that packet are not seen by the slave too early during that replay.
> Otherwise, for example, the slave may observe a TCP ACK to a packet
> that it hasn't yet sent, which is an error.
Even current implementation syncs just before network output, what you pointed 
out could happen.  In this case, would the connection going to be lost, or would 
client/server recover from it?  If latter, it would be fine, otherwise I wonder 
how people doing similar things are handling this situation.
> About IP idempotency:
>
> In general, IP packets are allowed to be lost or duplicated in the
> network.  All IP protocols should be prepared for that; it is a basic
> property.
>
> However there is one respect in which they're not idempotent:
>
> The TTL field should be decreased if packets are delayed.  Packets
> should not appear to live in the network for longer than TTL seconds.
> If they do, some protocols (like TCP) can react to the delayed ones
> differently, such as sending a RST packet and breaking a connection.
>
> It is acceptable to reduce TTL faster than the minimum.  After all, it
> is reduced by 1 on every forwarding hop, in addition to time delays.
So the problem is, when the slave takes over, it sends a packet with same TTL 
which client may have received.
>> I currently don't have good numbers that I can share right now.
>> Snapshots/sec depends on what kind of workload is running, and if the
>> guest was almost idle, there will be no snapshots in 5sec.  On the other
>> hand, if the guest was running I/O intensive workloads (netperf, iozone
>> for example), there will be about 50 snapshots/sec.
>
> That is a really satisfying number, thank you :-)
>
> Without this work I wouldn't have imagined that synchronised machines
> could work with such a low transaction rate.
Thank you for your comments.
Although I haven't prepared good data yet, I personally prefer to have 
discussion with actual implementation and experimental data.
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 00/20] Kemari for KVM v0.1
  2010-04-22 19:42 ` [Qemu-devel] " Anthony Liguori
@ 2010-04-23  0:45   ` Yoshiaki Tamura
  2010-04-23 13:10     ` Anthony Liguori
  0 siblings, 1 reply; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-23  0:45 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: ohmura.kei, kvm, mtosatti, Anthony Liguori, qemu-devel,
	yoshikawa.takuya, avi
Anthony Liguori wrote:
> On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:
>> Hi all,
>>
>> We have been implementing the prototype of Kemari for KVM, and we're
>> sending
>> this message to share what we have now and TODO lists. Hopefully, we
>> would like
>> to get early feedback to keep us in the right direction. Although
>> advanced
>> approaches in the TODO lists are fascinating, we would like to run
>> this project
>> step by step while absorbing comments from the community. The current
>> code is
>> based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.
>>
>> For those who are new to Kemari for KVM, please take a look at the
>> following RFC which we posted last year.
>>
>> http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html
>>
>> The transmission/transaction protocol, and most of the control logic is
>> implemented in QEMU. However, we needed a hack in KVM to prevent rip from
>> proceeding before synchronizing VMs. It may also need some plumbing in
>> the
>> kernel side to guarantee replayability of certain events and
>> instructions,
>> integrate the RAS capabilities of newer x86 hardware with the HA
>> stack, as well
>> as for optimization purposes, for example.
>>
>> Before going into details, we would like to show how Kemari looks. We
>> prepared
>> a demonstration video at the following location. For those who are not
>> interested in the code, please take a look.
>> The demonstration scenario is,
>>
>> 1. Play with a guest VM that has virtio-blk and virtio-net.
>> # The guest image should be a NFS/SAN.
>> 2. Start Kemari to synchronize the VM by running the following command
>> in QEMU.
>> Just add "-k" option to usual migrate command.
>> migrate -d -k tcp:192.168.0.20:4444
>> 3. Check the status by calling info migrate.
>> 4. Go back to the VM to play chess animation.
>> 5. Kill the the VM. (VNC client also disappears)
>> 6. Press "c" to continue the VM on the other host.
>> 7. Bring up the VNC client (Sorry, it pops outside of video capture.)
>> 8. Confirm that the chess animation ends, browser works fine, then
>> shutdown.
>>
>> http://www.osrg.net/kemari/download/kemari-kvm-fc11.mov
>>
>> The repository contains all patches we're sending with this message.
>> For those
>> who want to try, pull the following repository. At running configure,
>> please
>> put --enable-ft-mode. Also you need to apply a patch attached at the
>> end of
>> this message to your KVM.
>>
>> git://kemari.git.sourceforge.net/gitroot/kemari/kemari
>>
>> In addition to usual migrate environment and command, add "-k" to run.
>>
>> The patch set consists of following components.
>>
>> - bit-based dirty bitmap. (I have posted v4 for upstream QEMU on April
>> 2o)
>> - writev() support to QEMUFile and FdMigrationState.
>> - FT transaction sender/receiver
>> - event tap that triggers FT transaction.
>> - virtio-blk, virtio-net support for event tap.
>
> This series looks quite nice!
Thanks for your kind words!
> I think it would make sense to separate out the things that are actually
> optimizations (like the dirty bitmap changes and the writev/readv
> changes) and to attempt to justify them with actual performance data.
I agree with the separation plan.
For dirty bitmap change, Avi and I discussed on patchset for upsream QEMU while 
you were offline (Sorry, if I was wrong).  Could you also take a look?
http://lists.gnu.org/archive/html/qemu-devel/2010-04/msg01396.html
Regarding writev, I agree that it should be backed with actual data, otherwise 
it should be removed.  We attemped to do everything that may reduce the overhead 
of the transaction.
> I'd prefer not to modify the live migration protocol ABI and it doesn't
> seem to be necessary if we're willing to add options to the -incoming
> flag. We also want to be a bit more generic with respect to IO.
I totally agree with your approach not to change the protocol ABI.  Can we add 
an option to -incoming?  Like, -incoming ft_mode, for example
Regarding the IO, let me reply to the next message.
> Otherwise, the series looks very close to being mergable.
Thank you for your comment on each patch.
To be honest, I wasn't that confident because I'm a newbie to KVM/QEMU and 
struggled for how to implement in an acceptable way.
Thanks,
Yoshi
>
> Regards,
>
> Anthony Liguori
>
>
>
^ permalink raw reply	[flat|nested] 74+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
  2010-04-22 20:33         ` Anthony Liguori
@ 2010-04-23  1:53           ` Yoshiaki Tamura
  2010-04-23 13:20             ` Anthony Liguori
  0 siblings, 1 reply; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-23  1:53 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: ohmura.kei, mtosatti, kvm, dlaor, qemu-devel, yoshikawa.takuya,
	aliguori, avi
Anthony Liguori wrote:
> On 04/22/2010 08:16 AM, Yoshiaki Tamura wrote:
>> 2010/4/22 Dor Laor<dlaor@redhat.com>:
>>> On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:
>>>> Dor Laor wrote:
>>>>> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> We have been implementing the prototype of Kemari for KVM, and we're
>>>>>> sending
>>>>>> this message to share what we have now and TODO lists. Hopefully, we
>>>>>> would like
>>>>>> to get early feedback to keep us in the right direction. Although
>>>>>> advanced
>>>>>> approaches in the TODO lists are fascinating, we would like to run
>>>>>> this project
>>>>>> step by step while absorbing comments from the community. The current
>>>>>> code is
>>>>>> based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.
>>>>>>
>>>>>> For those who are new to Kemari for KVM, please take a look at the
>>>>>> following RFC which we posted last year.
>>>>>>
>>>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html
>>>>>>
>>>>>> The transmission/transaction protocol, and most of the control
>>>>>> logic is
>>>>>> implemented in QEMU. However, we needed a hack in KVM to prevent rip
>>>>>> from
>>>>>> proceeding before synchronizing VMs. It may also need some
>>>>>> plumbing in
>>>>>> the
>>>>>> kernel side to guarantee replayability of certain events and
>>>>>> instructions,
>>>>>> integrate the RAS capabilities of newer x86 hardware with the HA
>>>>>> stack, as well
>>>>>> as for optimization purposes, for example.
>>>>> [ snap]
>>>>>
>>>>>> The rest of this message describes TODO lists grouped by each topic.
>>>>>>
>>>>>> === event tapping ===
>>>>>>
>>>>>> Event tapping is the core component of Kemari, and it decides on
>>>>>> which
>>>>>> event the
>>>>>> primary should synchronize with the secondary. The basic assumption
>>>>>> here is
>>>>>> that outgoing I/O operations are idempotent, which is usually true
>>>>>> for
>>>>>> disk I/O
>>>>>> and reliable network protocols such as TCP.
>>>>> IMO any type of network even should be stalled too. What if the VM
>>>>> runs
>>>>> non tcp protocol and the packet that the master node sent reached some
>>>>> remote client and before the sync to the slave the master failed?
>>>> In current implementation, it is actually stalling any type of network
>>>> that goes through virtio-net.
>>>>
>>>> However, if the application was using unreliable protocols, it should
>>>> have its own recovering mechanism, or it should be completely
>>>> stateless.
>>> Why do you treat tcp differently? You can damage the entire VM this
>>> way -
>>> think of dhcp request that was dropped on the moment you switched
>>> between
>>> the master and the slave?
>> I'm not trying to say that we should treat tcp differently, but just
>> it's severe.
>> In case of dhcp request, the client would have a chance to retry after
>> failover, correct?
>> BTW, in current implementation,
>
> I'm slightly confused about the current implementation vs. my
> recollection of the original paper with Xen. I had thought that all disk
> and network I/O was buffered in such a way that at each checkpoint, the
> I/O operations would be released in a burst. Otherwise, you would have
> to synchronize after every I/O operation which is what it seems the
> current implementation does.
Yes, you're almost right.
It's synchronizing before QEMU starts emulating I/O at each device model.
It was originally designed that way to avoid complexity of introducing buffering 
mechanism and additional I/O latency by buffering.
> I'm not sure how that is accomplished
> atomically though since you could have a completed I/O operation
> duplicated on the slave node provided it didn't notify completion prior
> to failure.
That's exactly the point I wanted to discuss.
Currently, we're calling vm_stop(0), qemu_aio_flush() and bdrv_flush_all() 
before qemu_save_state_all() in ft_tranx_ready(), to ensure outstanding I/O is 
complete.  I mimicked what existing live migration is doing.
It's not enough?
> Is there another kemari component that somehow handles buffering I/O
> that is not obvious from these patches?
No, I'm not hiding anything, and I would share any information regarding Kemari 
to develop it in this community :-)
Thanks,
Yoshi
>
> Regards,
>
> Anthony Liguori
>
>
>
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 01/20] Modify DIRTY_FLAG value and introduce DIRTY_IDX to use as indexes of bit-based phys_ram_dirty.
  2010-04-22 19:26   ` [Qemu-devel] " Anthony Liguori
@ 2010-04-23  2:09     ` Yoshiaki Tamura
  0 siblings, 0 replies; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-23  2:09 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: ohmura.kei, kvm, mtosatti, Anthony Liguori, qemu-devel,
	yoshikawa.takuya, avi
Anthony Liguori wrote:
> Hi,
>
> On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:
>> Replaces byte-based phys_ram_dirty bitmap with four (MASTER, VGA,
>> CODE, MIGRATION) bit-based phys_ram_dirty bitmap. On allocation, it
>> sets all bits in the bitmap. It uses ffs() to convert DIRTY_FLAG to
>> DIRTY_IDX.
>>
>> Modifies wrapper functions for byte-based phys_ram_dirty bitmap to
>> bit-based phys_ram_dirty bitmap. MASTER works as a buffer, and upon
>> get_diry() or get_dirty_flags(), it calls
>> cpu_physical_memory_sync_master() to update VGA and MIGRATION.
>
> Why use an additional bitmap for MASTER instead of just updating the
> VGA, CODE, and MIGRATION bitmaps together?
This way we don't have to think whether we should update VGA or MIGRATION. 
IIRC, we had this discussion on upstream before with Avi?
http://www.mail-archive.com/kvm@vger.kernel.org/msg30728.html
BTW, I also have the following TODO list regarding dirty bitmap.
1. Allocate vga and migration dirty bitmap dynamically.
2. Separate CODE and the other dirty bitmaps functions.
>
> Regards,
>
> Anthony Liguori
>
>> Replaces direct phys_ram_dirty access with wrapper functions to
>> prevent direct access to the phys_ram_dirty bitmap.
>>
>> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
>> Signed-off-by: OHMURA Kei<ohmura.kei@lab.ntt.co.jp>
>> ---
>> cpu-all.h | 130
>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
>> exec.c | 60 ++++++++++++++--------------
>> 2 files changed, 152 insertions(+), 38 deletions(-)
>>
>> diff --git a/cpu-all.h b/cpu-all.h
>> index 51effc0..3f8762d 100644
>> --- a/cpu-all.h
>> +++ b/cpu-all.h
>> @@ -37,6 +37,9 @@
>>
>> #include "softfloat.h"
>>
>> +/* to use ffs in flag_to_idx() */
>> +#include<strings.h>
>> +
>> #if defined(HOST_WORDS_BIGENDIAN) != defined(TARGET_WORDS_BIGENDIAN)
>> #define BSWAP_NEEDED
>> #endif
>> @@ -846,7 +849,6 @@ int cpu_str_to_log_mask(const char *str);
>> /* memory API */
>>
>> extern int phys_ram_fd;
>> -extern uint8_t *phys_ram_dirty;
>> extern ram_addr_t ram_size;
>> extern ram_addr_t last_ram_offset;
>> extern uint8_t *bios_mem;
>> @@ -869,28 +871,140 @@ extern uint8_t *bios_mem;
>> /* Set if TLB entry is an IO callback. */
>> #define TLB_MMIO (1<< 5)
>>
>> +/* Use DIRTY_IDX as indexes of bit-based phys_ram_dirty. */
>> +#define MASTER_DIRTY_IDX 0
>> +#define VGA_DIRTY_IDX 1
>> +#define CODE_DIRTY_IDX 2
>> +#define MIGRATION_DIRTY_IDX 3
>> +#define NUM_DIRTY_IDX 4
>> +
>> +#define MASTER_DIRTY_FLAG (1<< MASTER_DIRTY_IDX)
>> +#define VGA_DIRTY_FLAG (1<< VGA_DIRTY_IDX)
>> +#define CODE_DIRTY_FLAG (1<< CODE_DIRTY_IDX)
>> +#define MIGRATION_DIRTY_FLAG (1<< MIGRATION_DIRTY_IDX)
>> +
>> +extern unsigned long *phys_ram_dirty[NUM_DIRTY_IDX];
>> +
>> +static inline int dirty_flag_to_idx(int flag)
>> +{
>> + return ffs(flag) - 1;
>> +}
>> +
>> +static inline int dirty_idx_to_flag(int idx)
>> +{
>> + return 1<< idx;
>> +}
>> +
>> int cpu_memory_rw_debug(CPUState *env, target_ulong addr,
>> uint8_t *buf, int len, int is_write);
>>
>> -#define VGA_DIRTY_FLAG 0x01
>> -#define CODE_DIRTY_FLAG 0x02
>> -#define MIGRATION_DIRTY_FLAG 0x08
>> -
>> /* read dirty bit (return 0 or 1) */
>> static inline int cpu_physical_memory_is_dirty(ram_addr_t addr)
>> {
>> - return phys_ram_dirty[addr>> TARGET_PAGE_BITS] == 0xff;
>> + unsigned long mask;
>> + ram_addr_t index = (addr>> TARGET_PAGE_BITS) / HOST_LONG_BITS;
>> + int offset = (addr>> TARGET_PAGE_BITS)& (HOST_LONG_BITS - 1);
>> +
>> + mask = 1UL<< offset;
>> + return (phys_ram_dirty[MASTER_DIRTY_IDX][index]& mask) == mask;
>> +}
>> +
>> +static inline void cpu_physical_memory_sync_master(ram_addr_t index)
>> +{
>> + if (phys_ram_dirty[MASTER_DIRTY_IDX][index]) {
>> + phys_ram_dirty[VGA_DIRTY_IDX][index]
>> + |= phys_ram_dirty[MASTER_DIRTY_IDX][index];
>> + phys_ram_dirty[MIGRATION_DIRTY_IDX][index]
>> + |= phys_ram_dirty[MASTER_DIRTY_IDX][index];
>> + phys_ram_dirty[MASTER_DIRTY_IDX][index] = 0UL;
>> + }
>> +}
>> +
>> +static inline int cpu_physical_memory_get_dirty_flags(ram_addr_t addr)
>> +{
>> + unsigned long mask;
>> + ram_addr_t index = (addr>> TARGET_PAGE_BITS) / HOST_LONG_BITS;
>> + int offset = (addr>> TARGET_PAGE_BITS)& (HOST_LONG_BITS - 1);
>> + int ret = 0, i;
>> +
>> + mask = 1UL<< offset;
>> + cpu_physical_memory_sync_master(index);
>> +
>> + for (i = VGA_DIRTY_IDX; i<= MIGRATION_DIRTY_IDX; i++) {
>> + if (phys_ram_dirty[i][index]& mask) {
>> + ret |= dirty_idx_to_flag(i);
>> + }
>> + }
>> +
>> + return ret;
>> +}
>> +
>> +static inline int cpu_physical_memory_get_dirty_idx(ram_addr_t addr,
>> + int dirty_idx)
>> +{
>> + unsigned long mask;
>> + ram_addr_t index = (addr>> TARGET_PAGE_BITS) / HOST_LONG_BITS;
>> + int offset = (addr>> TARGET_PAGE_BITS)& (HOST_LONG_BITS - 1);
>> +
>> + mask = 1UL<< offset;
>> + cpu_physical_memory_sync_master(index);
>> + return (phys_ram_dirty[dirty_idx][index]& mask) == mask;
>> }
>>
>> static inline int cpu_physical_memory_get_dirty(ram_addr_t addr,
>> int dirty_flags)
>> {
>> - return phys_ram_dirty[addr>> TARGET_PAGE_BITS]& dirty_flags;
>> + return cpu_physical_memory_get_dirty_idx(addr,
>> + dirty_flag_to_idx(dirty_flags));
>> }
>>
>> static inline void cpu_physical_memory_set_dirty(ram_addr_t addr)
>> {
>> - phys_ram_dirty[addr>> TARGET_PAGE_BITS] = 0xff;
>> + unsigned long mask;
>> + ram_addr_t index = (addr>> TARGET_PAGE_BITS) / HOST_LONG_BITS;
>> + int offset = (addr>> TARGET_PAGE_BITS)& (HOST_LONG_BITS - 1);
>> +
>> + mask = 1UL<< offset;
>> + phys_ram_dirty[MASTER_DIRTY_IDX][index] |= mask;
>> +}
>> +
>> +static inline void cpu_physical_memory_set_dirty_range(ram_addr_t addr,
>> + unsigned long mask)
>> +{
>> + ram_addr_t index = (addr>> TARGET_PAGE_BITS) / HOST_LONG_BITS;
>> +
>> + phys_ram_dirty[MASTER_DIRTY_IDX][index] |= mask;
>> +}
>> +
>> +static inline void cpu_physical_memory_set_dirty_flags(ram_addr_t addr,
>> + int dirty_flags)
>> +{
>> + unsigned long mask;
>> + ram_addr_t index = (addr>> TARGET_PAGE_BITS) / HOST_LONG_BITS;
>> + int offset = (addr>> TARGET_PAGE_BITS)& (HOST_LONG_BITS - 1);
>> +
>> + mask = 1UL<< offset;
>> + phys_ram_dirty[MASTER_DIRTY_IDX][index] |= mask;
>> +
>> + if (dirty_flags& CODE_DIRTY_FLAG) {
>> + phys_ram_dirty[CODE_DIRTY_IDX][index] |= mask;
>> + }
>> +}
>> +
>> +static inline void cpu_physical_memory_mask_dirty_range(ram_addr_t
>> start,
>> + unsigned long length,
>> + int dirty_flags)
>> +{
>> + ram_addr_t addr = start, index;
>> + unsigned long mask;
>> + int offset, i;
>> +
>> + for (i = 0; i< length; i += TARGET_PAGE_SIZE) {
>> + index = ((addr + i)>> TARGET_PAGE_BITS) / HOST_LONG_BITS;
>> + offset = ((addr + i)>> TARGET_PAGE_BITS)& (HOST_LONG_BITS - 1);
>> + mask = ~(1UL<< offset);
>> + phys_ram_dirty[dirty_flag_to_idx(dirty_flags)][index]&= mask;
>> + }
>> }
>>
>> void cpu_physical_memory_reset_dirty(ram_addr_t start, ram_addr_t end,
>> diff --git a/exec.c b/exec.c
>> index b647512..bf8d703 100644
>> --- a/exec.c
>> +++ b/exec.c
>> @@ -119,7 +119,7 @@ uint8_t *code_gen_ptr;
>>
>> #if !defined(CONFIG_USER_ONLY)
>> int phys_ram_fd;
>> -uint8_t *phys_ram_dirty;
>> +unsigned long *phys_ram_dirty[NUM_DIRTY_IDX];
>> uint8_t *bios_mem;
>> static int in_migration;
>>
>> @@ -1947,7 +1947,7 @@ static void tlb_protect_code(ram_addr_t ram_addr)
>> static void tlb_unprotect_code_phys(CPUState *env, ram_addr_t ram_addr,
>> target_ulong vaddr)
>> {
>> - phys_ram_dirty[ram_addr>> TARGET_PAGE_BITS] |= CODE_DIRTY_FLAG;
>> + cpu_physical_memory_set_dirty_flags(ram_addr, CODE_DIRTY_FLAG);
>> }
>>
>> static inline void tlb_reset_dirty_range(CPUTLBEntry *tlb_entry,
>> @@ -1968,8 +1968,7 @@ void cpu_physical_memory_reset_dirty(ram_addr_t
>> start, ram_addr_t end,
>> {
>> CPUState *env;
>> unsigned long length, start1;
>> - int i, mask, len;
>> - uint8_t *p;
>> + int i;
>>
>> start&= TARGET_PAGE_MASK;
>> end = TARGET_PAGE_ALIGN(end);
>> @@ -1977,11 +1976,7 @@ void cpu_physical_memory_reset_dirty(ram_addr_t
>> start, ram_addr_t end,
>> length = end - start;
>> if (length == 0)
>> return;
>> - len = length>> TARGET_PAGE_BITS;
>> - mask = ~dirty_flags;
>> - p = phys_ram_dirty + (start>> TARGET_PAGE_BITS);
>> - for(i = 0; i< len; i++)
>> - p[i]&= mask;
>> + cpu_physical_memory_mask_dirty_range(start, length, dirty_flags);
>>
>> /* we modify the TLB cache so that the dirty bit will be set again
>> when accessing the range */
>> @@ -2643,6 +2638,7 @@ extern const char *mem_path;
>> ram_addr_t qemu_ram_alloc(ram_addr_t size)
>> {
>> RAMBlock *new_block;
>> + int i;
>>
>> size = TARGET_PAGE_ALIGN(size);
>> new_block = qemu_malloc(sizeof(*new_block));
>> @@ -2667,10 +2663,14 @@ ram_addr_t qemu_ram_alloc(ram_addr_t size)
>> new_block->next = ram_blocks;
>> ram_blocks = new_block;
>>
>> - phys_ram_dirty = qemu_realloc(phys_ram_dirty,
>> - (last_ram_offset + size)>> TARGET_PAGE_BITS);
>> - memset(phys_ram_dirty + (last_ram_offset>> TARGET_PAGE_BITS),
>> - 0xff, size>> TARGET_PAGE_BITS);
>> + for (i = MASTER_DIRTY_IDX; i< NUM_DIRTY_IDX; i++) {
>> + phys_ram_dirty[i]
>> + = qemu_realloc(phys_ram_dirty[i],
>> + BITMAP_SIZE(last_ram_offset + size));
>> + memset((uint8_t *)phys_ram_dirty[i] + BITMAP_SIZE(last_ram_offset),
>> + 0xff, BITMAP_SIZE(last_ram_offset + size)
>> + - BITMAP_SIZE(last_ram_offset));
>> + }
>>
>> last_ram_offset += size;
>>
>> @@ -2833,16 +2833,16 @@ static void notdirty_mem_writeb(void *opaque,
>> target_phys_addr_t ram_addr,
>> uint32_t val)
>> {
>> int dirty_flags;
>> - dirty_flags = phys_ram_dirty[ram_addr>> TARGET_PAGE_BITS];
>> + dirty_flags = cpu_physical_memory_get_dirty_flags(ram_addr);
>> if (!(dirty_flags& CODE_DIRTY_FLAG)) {
>> #if !defined(CONFIG_USER_ONLY)
>> tb_invalidate_phys_page_fast(ram_addr, 1);
>> - dirty_flags = phys_ram_dirty[ram_addr>> TARGET_PAGE_BITS];
>> + dirty_flags = cpu_physical_memory_get_dirty_flags(ram_addr);
>> #endif
>> }
>> stb_p(qemu_get_ram_ptr(ram_addr), val);
>> dirty_flags |= (0xff& ~CODE_DIRTY_FLAG);
>> - phys_ram_dirty[ram_addr>> TARGET_PAGE_BITS] = dirty_flags;
>> + cpu_physical_memory_set_dirty_flags(ram_addr, dirty_flags);
>> /* we remove the notdirty callback only if the code has been
>> flushed */
>> if (dirty_flags == 0xff)
>> @@ -2853,16 +2853,16 @@ static void notdirty_mem_writew(void *opaque,
>> target_phys_addr_t ram_addr,
>> uint32_t val)
>> {
>> int dirty_flags;
>> - dirty_flags = phys_ram_dirty[ram_addr>> TARGET_PAGE_BITS];
>> + dirty_flags = cpu_physical_memory_get_dirty_flags(ram_addr);
>> if (!(dirty_flags& CODE_DIRTY_FLAG)) {
>> #if !defined(CONFIG_USER_ONLY)
>> tb_invalidate_phys_page_fast(ram_addr, 2);
>> - dirty_flags = phys_ram_dirty[ram_addr>> TARGET_PAGE_BITS];
>> + dirty_flags = cpu_physical_memory_get_dirty_flags(ram_addr);
>> #endif
>> }
>> stw_p(qemu_get_ram_ptr(ram_addr), val);
>> dirty_flags |= (0xff& ~CODE_DIRTY_FLAG);
>> - phys_ram_dirty[ram_addr>> TARGET_PAGE_BITS] = dirty_flags;
>> + cpu_physical_memory_set_dirty_flags(ram_addr, dirty_flags);
>> /* we remove the notdirty callback only if the code has been
>> flushed */
>> if (dirty_flags == 0xff)
>> @@ -2873,16 +2873,16 @@ static void notdirty_mem_writel(void *opaque,
>> target_phys_addr_t ram_addr,
>> uint32_t val)
>> {
>> int dirty_flags;
>> - dirty_flags = phys_ram_dirty[ram_addr>> TARGET_PAGE_BITS];
>> + dirty_flags = cpu_physical_memory_get_dirty_flags(ram_addr);
>> if (!(dirty_flags& CODE_DIRTY_FLAG)) {
>> #if !defined(CONFIG_USER_ONLY)
>> tb_invalidate_phys_page_fast(ram_addr, 4);
>> - dirty_flags = phys_ram_dirty[ram_addr>> TARGET_PAGE_BITS];
>> + dirty_flags = cpu_physical_memory_get_dirty_flags(ram_addr);
>> #endif
>> }
>> stl_p(qemu_get_ram_ptr(ram_addr), val);
>> dirty_flags |= (0xff& ~CODE_DIRTY_FLAG);
>> - phys_ram_dirty[ram_addr>> TARGET_PAGE_BITS] = dirty_flags;
>> + cpu_physical_memory_set_dirty_flags(ram_addr, dirty_flags);
>> /* we remove the notdirty callback only if the code has been
>> flushed */
>> if (dirty_flags == 0xff)
>> @@ -3334,8 +3334,8 @@ void cpu_physical_memory_rw(target_phys_addr_t
>> addr, uint8_t *buf,
>> /* invalidate code */
>> tb_invalidate_phys_page_range(addr1, addr1 + l, 0);
>> /* set dirty bit */
>> - phys_ram_dirty[addr1>> TARGET_PAGE_BITS] |=
>> - (0xff& ~CODE_DIRTY_FLAG);
>> + cpu_physical_memory_set_dirty_flags(
>> + addr1, (0xff& ~CODE_DIRTY_FLAG));
>> }
>> /* qemu doesn't execute guest code directly, but kvm does
>> therefore flush instruction caches */
>> @@ -3548,8 +3548,8 @@ void cpu_physical_memory_unmap(void *buffer,
>> target_phys_addr_t len,
>> /* invalidate code */
>> tb_invalidate_phys_page_range(addr1, addr1 + l, 0);
>> /* set dirty bit */
>> - phys_ram_dirty[addr1>> TARGET_PAGE_BITS] |=
>> - (0xff& ~CODE_DIRTY_FLAG);
>> + cpu_physical_memory_set_dirty_flags(
>> + addr1, (0xff& ~CODE_DIRTY_FLAG));
>> }
>> addr1 += l;
>> access_len -= l;
>> @@ -3685,8 +3685,8 @@ void stl_phys_notdirty(target_phys_addr_t addr,
>> uint32_t val)
>> /* invalidate code */
>> tb_invalidate_phys_page_range(addr1, addr1 + 4, 0);
>> /* set dirty bit */
>> - phys_ram_dirty[addr1>> TARGET_PAGE_BITS] |=
>> - (0xff& ~CODE_DIRTY_FLAG);
>> + cpu_physical_memory_set_dirty_flags(
>> + addr1, (0xff& ~CODE_DIRTY_FLAG));
>> }
>> }
>> }
>> @@ -3754,8 +3754,8 @@ void stl_phys(target_phys_addr_t addr, uint32_t
>> val)
>> /* invalidate code */
>> tb_invalidate_phys_page_range(addr1, addr1 + 4, 0);
>> /* set dirty bit */
>> - phys_ram_dirty[addr1>> TARGET_PAGE_BITS] |=
>> - (0xff& ~CODE_DIRTY_FLAG);
>> + cpu_physical_memory_set_dirty_flags(addr1,
>> + (0xff& ~CODE_DIRTY_FLAG));
>> }
>> }
>> }
>
>
>
>
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 15/20] Introduce FT mode support to configure.
  2010-04-22 19:38   ` [Qemu-devel] " Anthony Liguori
@ 2010-04-23  3:09     ` Yoshiaki Tamura
  0 siblings, 0 replies; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-23  3:09 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: ohmura.kei, kvm, mtosatti, Anthony Liguori, qemu-devel,
	yoshikawa.takuya, avi
Anthony Liguori wrote:
> On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:
>> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
>
> No need for this.
OK.
>
> Regards,
>
> Anthony Liguori
>
>> ---
>> configure | 8 ++++++++
>> 1 files changed, 8 insertions(+), 0 deletions(-)
>>
>> diff --git a/configure b/configure
>> index 046c591..f0682d4 100755
>> --- a/configure
>> +++ b/configure
>> @@ -298,6 +298,7 @@ bsd_user="no"
>> guest_base=""
>> uname_release=""
>> io_thread="no"
>> +ft_mode="no"
>> mixemu="no"
>> kvm_trace="no"
>> kvm_cap_pit=""
>> @@ -671,6 +672,8 @@ for opt do
>> ;;
>> --enable-io-thread) io_thread="yes"
>> ;;
>> + --enable-ft-mode) ft_mode="yes"
>> + ;;
>> --disable-blobs) blobs="no"
>> ;;
>> --kerneldir=*) kerneldir="$optarg"
>> @@ -840,6 +843,7 @@ echo " --enable-vde enable support for vde network"
>> echo " --disable-linux-aio disable Linux AIO support"
>> echo " --enable-linux-aio enable Linux AIO support"
>> echo " --enable-io-thread enable IO thread"
>> +echo " --enable-ft-mode enable FT mode support"
>> echo " --disable-blobs disable installing provided firmware blobs"
>> echo " --kerneldir=PATH look for kernel includes in PATH"
>> echo " --with-kvm-trace enable building the KVM module with the kvm
>> trace option"
>> @@ -2117,6 +2121,7 @@ echo "GUEST_BASE $guest_base"
>> echo "PIE user targets $user_pie"
>> echo "vde support $vde"
>> echo "IO thread $io_thread"
>> +echo "FT mode support $ft_mode"
>> echo "Linux AIO support $linux_aio"
>> echo "Install blobs $blobs"
>> echo "KVM support $kvm"
>> @@ -2318,6 +2323,9 @@ fi
>> if test "$io_thread" = "yes" ; then
>> echo "CONFIG_IOTHREAD=y">> $config_host_mak
>> fi
>> +if test "$ft_mode" = "yes" ; then
>> + echo "CONFIG_FT_MODE=y">> $config_host_mak
>> +fi
>> if test "$linux_aio" = "yes" ; then
>> echo "CONFIG_LINUX_AIO=y">> $config_host_mak
>> fi
>
>
>
>
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 14/20] Upgrade QEMU_FILE_VERSION from 3 to 4, and introduce qemu_savevm_state_all().
  2010-04-22 19:37   ` [Qemu-devel] " Anthony Liguori
@ 2010-04-23  3:29     ` Yoshiaki Tamura
  0 siblings, 0 replies; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-23  3:29 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: ohmura.kei, kvm, mtosatti, Anthony Liguori, qemu-devel,
	yoshikawa.takuya, avi
Anthony Liguori wrote:
> On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:
>> Make a 32bit entry after QEMU_VM_FILE_VERSION to recognize whether the
>> transfered data is QEMU_VM_FT_MODE or QEMU_VM_LIVE_MIGRATION_MODE.
>
> I'd rather you encapsulate the current protocol as opposed to extending
> it with a new version.
>
> You could also do this by just having a new -incoming option, right? All
> you really need is to indicate that this is for high frequency
> checkpointing, right?
Exactly.
I would use -incoming option.
>
> Regards,
>
> Anthony Liguori
>
>> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
>> ---
>> savevm.c | 76
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>> sysemu.h | 1 +
>> 2 files changed, 75 insertions(+), 2 deletions(-)
>>
>> diff --git a/savevm.c b/savevm.c
>> index 292ae32..19b3efb 100644
>> --- a/savevm.c
>> +++ b/savevm.c
>> @@ -1402,8 +1402,10 @@ static void vmstate_save(QEMUFile *f,
>> SaveStateEntry *se)
>> }
>>
>> #define QEMU_VM_FILE_MAGIC 0x5145564d
>> -#define QEMU_VM_FILE_VERSION_COMPAT 0x00000002
>> -#define QEMU_VM_FILE_VERSION 0x00000003
>> +#define QEMU_VM_FILE_VERSION_COMPAT 0x00000003
>> +#define QEMU_VM_FILE_VERSION 0x00000004
>> +#define QEMU_VM_LIVE_MIGRATION_MODE 0x00000005
>> +#define QEMU_VM_FT_MODE 0x00000006
>>
>> #define QEMU_VM_EOF 0x00
>> #define QEMU_VM_SECTION_START 0x01
>> @@ -1425,6 +1427,12 @@ int qemu_savevm_state_begin(Monitor *mon,
>> QEMUFile *f, int blk_enable,
>>
>> qemu_put_be32(f, QEMU_VM_FILE_MAGIC);
>> qemu_put_be32(f, QEMU_VM_FILE_VERSION);
>> +
>> + if (ft_mode) {
>> + qemu_put_be32(f, QEMU_VM_FT_MODE);
>> + } else {
>> + qemu_put_be32(f, QEMU_VM_LIVE_MIGRATION_MODE);
>> + }
>>
>> QTAILQ_FOREACH(se,&savevm_handlers, entry) {
>> int len;
>> @@ -1533,6 +1541,66 @@ int qemu_savevm_state_complete(Monitor *mon,
>> QEMUFile *f)
>> return 0;
>> }
>>
>> +int qemu_savevm_state_all(Monitor *mon, QEMUFile *f)
>> +{
>> + SaveStateEntry *se;
>> +
>> + QTAILQ_FOREACH(se,&savevm_handlers, entry) {
>> + int len;
>> +
>> + if (se->save_live_state == NULL)
>> + continue;
>> +
>> + /* Section type */
>> + qemu_put_byte(f, QEMU_VM_SECTION_START);
>> + qemu_put_be32(f, se->section_id);
>> +
>> + /* ID string */
>> + len = strlen(se->idstr);
>> + qemu_put_byte(f, len);
>> + qemu_put_buffer(f, (uint8_t *)se->idstr, len);
>> +
>> + qemu_put_be32(f, se->instance_id);
>> + qemu_put_be32(f, se->version_id);
>> + if (ft_mode == FT_INIT) {
>> + /* This is workaround. */
>> + se->save_live_state(mon, f, QEMU_VM_SECTION_START, se->opaque);
>> + } else {
>> + se->save_live_state(mon, f, QEMU_VM_SECTION_PART, se->opaque);
>> + }
>> + }
>> +
>> + ft_mode = FT_TRANSACTION;
>> + QTAILQ_FOREACH(se,&savevm_handlers, entry) {
>> + int len;
>> +
>> + if (se->save_state == NULL&& se->vmsd == NULL)
>> + continue;
>> +
>> + /* Section type */
>> + qemu_put_byte(f, QEMU_VM_SECTION_FULL);
>> + qemu_put_be32(f, se->section_id);
>> +
>> + /* ID string */
>> + len = strlen(se->idstr);
>> + qemu_put_byte(f, len);
>> + qemu_put_buffer(f, (uint8_t *)se->idstr, len);
>> +
>> + qemu_put_be32(f, se->instance_id);
>> + qemu_put_be32(f, se->version_id);
>> +
>> + vmstate_save(f, se);
>> + }
>> +
>> + qemu_put_byte(f, QEMU_VM_EOF);
>> +
>> + if (qemu_file_has_error(f))
>> + return -EIO;
>> +
>> + return 0;
>> +}
>> +
>> +
>> void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f)
>> {
>> SaveStateEntry *se;
>> @@ -1617,6 +1685,10 @@ int qemu_loadvm_state(QEMUFile *f, int
>> skip_header)
>> if (v != QEMU_VM_FILE_VERSION)
>> return -ENOTSUP;
>>
>> + v = qemu_get_be32(f);
>> + if (v == QEMU_VM_FT_MODE) {
>> + ft_mode = FT_INIT;
>> + }
>> }
>>
>> while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) {
>> diff --git a/sysemu.h b/sysemu.h
>> index 6c1441f..df314bb 100644
>> --- a/sysemu.h
>> +++ b/sysemu.h
>> @@ -67,6 +67,7 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile
>> *f, int blk_enable,
>> int shared);
>> int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f);
>> int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f);
>> +int qemu_savevm_state_all(Monitor *mon, QEMUFile *f);
>> void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f);
>> int qemu_loadvm_state(QEMUFile *f, int skip_header);
>>
>
>
>
>
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 05/20] Introduce put_vector() and get_vector to QEMUFile and qemu_fopen_ops().
  2010-04-22 19:28   ` [Qemu-devel] " Anthony Liguori
@ 2010-04-23  3:37     ` Yoshiaki Tamura
  2010-04-23 13:22       ` Anthony Liguori
  0 siblings, 1 reply; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-23  3:37 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: ohmura.kei, kvm, mtosatti, Anthony Liguori, qemu-devel,
	yoshikawa.takuya, avi
Anthony Liguori wrote:
> On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:
>> QEMUFile currently doesn't support writev(). For sending multiple
>> data, such as pages, using writev() should be more efficient.
>>
>> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
>
> Is there performance data that backs this up? Since QEMUFile uses a
> linear buffer for most operations that's limited to 16k, I suspect you
> wouldn't be able to observe a difference in practice.
I currently don't have data, but I'll prepare it.
There were two things I wanted to avoid.
1. Pages to be copied to QEMUFile buf through qemu_put_buffer.
2. Calling write() everytime even when we want to send multiple pages at once.
I think 2 may be neglectable.
But 1 seems to be problematic if we want make to the latency as small as 
possible, no?
>
> Regards,
>
> Anthony Liguori
>
>> ---
>> buffered_file.c | 2 +-
>> hw/hw.h | 16 ++++++++++++++++
>> savevm.c | 43 +++++++++++++++++++++++++------------------
>> 3 files changed, 42 insertions(+), 19 deletions(-)
>>
>> diff --git a/buffered_file.c b/buffered_file.c
>> index 54dc6c2..187d1d4 100644
>> --- a/buffered_file.c
>> +++ b/buffered_file.c
>> @@ -256,7 +256,7 @@ QEMUFile *qemu_fopen_ops_buffered(void *opaque,
>> s->wait_for_unfreeze = wait_for_unfreeze;
>> s->close = close;
>>
>> - s->file = qemu_fopen_ops(s, buffered_put_buffer, NULL,
>> + s->file = qemu_fopen_ops(s, buffered_put_buffer, NULL, NULL, NULL,
>> buffered_close, buffered_rate_limit,
>> buffered_set_rate_limit,
>> buffered_get_rate_limit);
>> diff --git a/hw/hw.h b/hw/hw.h
>> index fc9ed29..921cf90 100644
>> --- a/hw/hw.h
>> +++ b/hw/hw.h
>> @@ -23,6 +23,13 @@
>> typedef int (QEMUFilePutBufferFunc)(void *opaque, const uint8_t *buf,
>> int64_t pos, int size);
>>
>> +/* This function writes a chunk of vector to a file at the given
>> position.
>> + * The pos argument can be ignored if the file is only being used for
>> + * streaming.
>> + */
>> +typedef int (QEMUFilePutVectorFunc)(void *opaque, struct iovec *iov,
>> + int64_t pos, int iovcnt);
>> +
>> /* Read a chunk of data from a file at the given position. The pos
>> argument
>> * can be ignored if the file is only be used for streaming. The number of
>> * bytes actually read should be returned.
>> @@ -30,6 +37,13 @@ typedef int (QEMUFilePutBufferFunc)(void *opaque,
>> const uint8_t *buf,
>> typedef int (QEMUFileGetBufferFunc)(void *opaque, uint8_t *buf,
>> int64_t pos, int size);
>>
>> +/* Read a chunk of vector from a file at the given position. The pos
>> argument
>> + * can be ignored if the file is only be used for streaming. The
>> number of
>> + * bytes actually read should be returned.
>> + */
>> +typedef int (QEMUFileGetVectorFunc)(void *opaque, struct iovec *iov,
>> + int64_t pos, int iovcnt);
>> +
>> /* Close a file and return an error code */
>> typedef int (QEMUFileCloseFunc)(void *opaque);
>>
>> @@ -46,7 +60,9 @@ typedef size_t (QEMUFileSetRateLimit)(void *opaque,
>> size_t new_rate);
>> typedef size_t (QEMUFileGetRateLimit)(void *opaque);
>>
>> QEMUFile *qemu_fopen_ops(void *opaque, QEMUFilePutBufferFunc *put_buffer,
>> + QEMUFilePutVectorFunc *put_vector,
>> QEMUFileGetBufferFunc *get_buffer,
>> + QEMUFileGetVectorFunc *get_vector,
>> QEMUFileCloseFunc *close,
>> QEMUFileRateLimit *rate_limit,
>> QEMUFileSetRateLimit *set_rate_limit,
>> diff --git a/savevm.c b/savevm.c
>> index 490ab70..944e788 100644
>> --- a/savevm.c
>> +++ b/savevm.c
>> @@ -162,7 +162,9 @@ void qemu_announce_self(void)
>>
>> struct QEMUFile {
>> QEMUFilePutBufferFunc *put_buffer;
>> + QEMUFilePutVectorFunc *put_vector;
>> QEMUFileGetBufferFunc *get_buffer;
>> + QEMUFileGetVectorFunc *get_vector;
>> QEMUFileCloseFunc *close;
>> QEMUFileRateLimit *rate_limit;
>> QEMUFileSetRateLimit *set_rate_limit;
>> @@ -263,11 +265,11 @@ QEMUFile *qemu_popen(FILE *stdio_file, const
>> char *mode)
>> s->stdio_file = stdio_file;
>>
>> if(mode[0] == 'r') {
>> - s->file = qemu_fopen_ops(s, NULL, stdio_get_buffer, stdio_pclose,
>> - NULL, NULL, NULL);
>> + s->file = qemu_fopen_ops(s, NULL, NULL, stdio_get_buffer,
>> + NULL, stdio_pclose, NULL, NULL, NULL);
>> } else {
>> - s->file = qemu_fopen_ops(s, stdio_put_buffer, NULL, stdio_pclose,
>> - NULL, NULL, NULL);
>> + s->file = qemu_fopen_ops(s, stdio_put_buffer, NULL, NULL, NULL,
>> + stdio_pclose, NULL, NULL, NULL);
>> }
>> return s->file;
>> }
>> @@ -312,11 +314,11 @@ QEMUFile *qemu_fdopen(int fd, const char *mode)
>> goto fail;
>>
>> if(mode[0] == 'r') {
>> - s->file = qemu_fopen_ops(s, NULL, stdio_get_buffer, stdio_fclose,
>> - NULL, NULL, NULL);
>> + s->file = qemu_fopen_ops(s, NULL, NULL, stdio_get_buffer, NULL,
>> + stdio_fclose, NULL, NULL, NULL);
>> } else {
>> - s->file = qemu_fopen_ops(s, stdio_put_buffer, NULL, stdio_fclose,
>> - NULL, NULL, NULL);
>> + s->file = qemu_fopen_ops(s, stdio_put_buffer, NULL, NULL, NULL,
>> + stdio_fclose, NULL, NULL, NULL);
>> }
>> return s->file;
>>
>> @@ -330,8 +332,8 @@ QEMUFile *qemu_fopen_socket(int fd)
>> QEMUFileSocket *s = qemu_mallocz(sizeof(QEMUFileSocket));
>>
>> s->fd = fd;
>> - s->file = qemu_fopen_ops(s, NULL, socket_get_buffer, socket_close,
>> - NULL, NULL, NULL);
>> + s->file = qemu_fopen_ops(s, NULL, NULL, socket_get_buffer, NULL,
>> + socket_close, NULL, NULL, NULL);
>> return s->file;
>> }
>>
>> @@ -368,11 +370,11 @@ QEMUFile *qemu_fopen(const char *filename, const
>> char *mode)
>> goto fail;
>>
>> if(mode[0] == 'w') {
>> - s->file = qemu_fopen_ops(s, file_put_buffer, NULL, stdio_fclose,
>> - NULL, NULL, NULL);
>> + s->file = qemu_fopen_ops(s, file_put_buffer, NULL, NULL, NULL,
>> + stdio_fclose, NULL, NULL, NULL);
>> } else {
>> - s->file = qemu_fopen_ops(s, NULL, file_get_buffer, stdio_fclose,
>> - NULL, NULL, NULL);
>> + s->file = qemu_fopen_ops(s, NULL, NULL, file_get_buffer, NULL,
>> + stdio_fclose, NULL, NULL, NULL);
>> }
>> return s->file;
>> fail:
>> @@ -400,13 +402,16 @@ static int bdrv_fclose(void *opaque)
>> static QEMUFile *qemu_fopen_bdrv(BlockDriverState *bs, int is_writable)
>> {
>> if (is_writable)
>> - return qemu_fopen_ops(bs, block_put_buffer, NULL, bdrv_fclose,
>> - NULL, NULL, NULL);
>> - return qemu_fopen_ops(bs, NULL, block_get_buffer, bdrv_fclose, NULL,
>> NULL, NULL);
>> + return qemu_fopen_ops(bs, block_put_buffer, NULL, NULL, NULL,
>> + bdrv_fclose, NULL, NULL, NULL);
>> + return qemu_fopen_ops(bs, NULL, NULL, block_get_buffer, NULL,
>> bdrv_fclose, NULL, NULL, NULL);
>> }
>>
>> -QEMUFile *qemu_fopen_ops(void *opaque, QEMUFilePutBufferFunc
>> *put_buffer,
>> +QEMUFile *qemu_fopen_ops(void *opaque,
>> + QEMUFilePutBufferFunc *put_buffer,
>> + QEMUFilePutVectorFunc *put_vector,
>> QEMUFileGetBufferFunc *get_buffer,
>> + QEMUFileGetVectorFunc *get_vector,
>> QEMUFileCloseFunc *close,
>> QEMUFileRateLimit *rate_limit,
>> QEMUFileSetRateLimit *set_rate_limit,
>> @@ -418,7 +423,9 @@ QEMUFile *qemu_fopen_ops(void *opaque,
>> QEMUFilePutBufferFunc *put_buffer,
>>
>> f->opaque = opaque;
>> f->put_buffer = put_buffer;
>> + f->put_vector = put_vector;
>> f->get_buffer = get_buffer;
>> + f->get_vector = get_vector;
>> f->close = close;
>> f->rate_limit = rate_limit;
>> f->set_rate_limit = set_rate_limit;
>
>
>
>
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 07/20] Introduce qemu_put_vector() and qemu_put_vector_prepare() to use put_vector() in QEMUFile.
  2010-04-22 19:29   ` [Qemu-devel] " Anthony Liguori
@ 2010-04-23  4:02     ` Yoshiaki Tamura
  2010-04-23 13:23       ` Anthony Liguori
  0 siblings, 1 reply; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-23  4:02 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: ohmura.kei, kvm, mtosatti, Anthony Liguori, qemu-devel,
	yoshikawa.takuya, avi
Anthony Liguori wrote:
> On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:
>> For fool proof purpose, qemu_put_vector_parepare should be called
>> before qemu_put_vector. Then, if qemu_put_* functions except this is
>> called after qemu_put_vector_prepare, program will abort().
>>
>> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
>
> I don't get it. What's this protecting against?
This was introduced to prevent mixing the order of normal write and vector 
write, and flush QEMUFile buffer before handling vectors.
While qemu_put_buffer copies data to QEMUFile buffer, qemu_put_vector() will 
bypass that buffer.
It's just fool proof purpose for what we encountered at beginning, and if the 
user of qemu_put_vector() is careful enough, we can remove 
qemu_put_vectore_prepare().  While writing this message, I started to think that 
just calling qemu_fflush() in qemu_put_vector() would be enough...
>
> Regards,
>
> Anthony Liguori
>
>> ---
>> hw/hw.h | 2 ++
>> savevm.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 55 insertions(+), 0 deletions(-)
>>
>> diff --git a/hw/hw.h b/hw/hw.h
>> index 921cf90..10e6dda 100644
>> --- a/hw/hw.h
>> +++ b/hw/hw.h
>> @@ -77,6 +77,8 @@ void qemu_fflush(QEMUFile *f);
>> int qemu_fclose(QEMUFile *f);
>> void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int size);
>> void qemu_put_byte(QEMUFile *f, int v);
>> +void qemu_put_vector(QEMUFile *f, QEMUIOVector *qiov);
>> +void qemu_put_vector_prepare(QEMUFile *f);
>> void *qemu_realloc_buffer(QEMUFile *f, int size);
>> void qemu_clear_buffer(QEMUFile *f);
>>
>> diff --git a/savevm.c b/savevm.c
>> index 944e788..22d928c 100644
>> --- a/savevm.c
>> +++ b/savevm.c
>> @@ -180,6 +180,7 @@ struct QEMUFile {
>> uint8_t *buf;
>>
>> int has_error;
>> + int prepares_vector;
>> };
>>
>> typedef struct QEMUFileStdio
>> @@ -557,6 +558,58 @@ void qemu_put_byte(QEMUFile *f, int v)
>> qemu_fflush(f);
>> }
>>
>> +void qemu_put_vector(QEMUFile *f, QEMUIOVector *v)
>> +{
>> + struct iovec *iov;
>> + int cnt;
>> + size_t bufsize;
>> + uint8_t *buf;
>> +
>> + if (qemu_file_get_rate_limit(f) != 0) {
>> + fprintf(stderr,
>> + "Attempted to write vector while bandwidth limit is not zero.\n");
>> + abort();
>> + }
>> +
>> + /* checks prepares vector.
>> + * For fool proof purpose, qemu_put_vector_parepare should be called
>> + * before qemu_put_vector. Then, if qemu_put_* functions except this
>> + * is called after qemu_put_vector_prepare, program will abort().
>> + */
>> + if (!f->prepares_vector) {
>> + fprintf(stderr,
>> + "You should prepare with qemu_put_vector_prepare.\n");
>> + abort();
>> + } else if (f->prepares_vector&& f->buf_index != 0) {
>> + fprintf(stderr, "Wrote data after qemu_put_vector_prepare.\n");
>> + abort();
>> + }
>> + f->prepares_vector = 0;
>> +
>> + if (f->put_vector) {
>> + qemu_iovec_to_vector(v,&iov,&cnt);
>> + f->put_vector(f->opaque, iov, 0, cnt);
>> + } else {
>> + qemu_iovec_to_size(v,&bufsize);
>> + buf = qemu_malloc(bufsize + 1 /* for '\0' */);
>> + qemu_iovec_to_buffer(v, buf);
>> + qemu_put_buffer(f, buf, bufsize);
>> + qemu_free(buf);
>> + }
>> +
>> +}
>> +
>> +void qemu_put_vector_prepare(QEMUFile *f)
>> +{
>> + if (f->prepares_vector) {
>> + /* prepare vector */
>> + fprintf(stderr, "Attempted to prepare vector twice\n");
>> + abort();
>> + }
>> + f->prepares_vector = 1;
>> + qemu_fflush(f);
>> +}
>> +
>> int qemu_get_buffer(QEMUFile *f, uint8_t *buf, int size1)
>> {
>> int size, l;
>
>
>
>
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 10/20] Introduce skip_header parameter to qemu_loadvm_state() so that it can be called iteratively without reading the header.
  2010-04-22 19:34   ` [Qemu-devel] " Anthony Liguori
@ 2010-04-23  4:25     ` Yoshiaki Tamura
  0 siblings, 0 replies; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-23  4:25 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: ohmura.kei, kvm, mtosatti, Anthony Liguori, qemu-devel,
	yoshikawa.takuya, avi
Anthony Liguori wrote:
> On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:
>> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
>
> I think the more appropriate thing to do is have
> qemu_savevm_state_complete() not write QEMU_VM_EOF when doing a
> continuous live migration.
>
> You would then want qemu_loadvm_state() to detect real EOF and treat
> that the same as QEMU_VM_EOF (provided it occurred at a section boundary).
>
> Of course, this should be a flag to qemu_loadvm_state() as it's not
> always the right behavior.
Sorry.  I couldn't get your intention.
I would appreciate if you could explain the good points of it.
On the receiver side, we need to switch to ft_transaction mode. If the 
qemu_savevm_state_complete() didn't send QEMU_VM_EOF, qemu_loadvm_state() won't 
get out of the loop, and therefore it can switch to ft_transaction mode.
Please let me know if I were misunderstanding.
>
> Regards,
>
> Anthony Liguori
>
>> ---
>> migration-exec.c | 2 +-
>> migration-fd.c | 2 +-
>> migration-tcp.c | 2 +-
>> migration-unix.c | 2 +-
>> savevm.c | 25 ++++++++++++++-----------
>> sysemu.h | 2 +-
>> 6 files changed, 19 insertions(+), 16 deletions(-)
>>
>> diff --git a/migration-exec.c b/migration-exec.c
>> index 3edc026..5839a6d 100644
>> --- a/migration-exec.c
>> +++ b/migration-exec.c
>> @@ -113,7 +113,7 @@ static void exec_accept_incoming_migration(void
>> *opaque)
>> QEMUFile *f = opaque;
>> int ret;
>>
>> - ret = qemu_loadvm_state(f);
>> + ret = qemu_loadvm_state(f, 0);
>> if (ret< 0) {
>> fprintf(stderr, "load of migration failed\n");
>> goto err;
>> diff --git a/migration-fd.c b/migration-fd.c
>> index 0cc74ad..0e97ed0 100644
>> --- a/migration-fd.c
>> +++ b/migration-fd.c
>> @@ -106,7 +106,7 @@ static void fd_accept_incoming_migration(void
>> *opaque)
>> QEMUFile *f = opaque;
>> int ret;
>>
>> - ret = qemu_loadvm_state(f);
>> + ret = qemu_loadvm_state(f, 0);
>> if (ret< 0) {
>> fprintf(stderr, "load of migration failed\n");
>> goto err;
>> diff --git a/migration-tcp.c b/migration-tcp.c
>> index 56e1a3b..94a1a03 100644
>> --- a/migration-tcp.c
>> +++ b/migration-tcp.c
>> @@ -182,7 +182,7 @@ static void tcp_accept_incoming_migration(void
>> *opaque)
>> goto out;
>> }
>>
>> - ret = qemu_loadvm_state(f);
>> + ret = qemu_loadvm_state(f, 0);
>> if (ret< 0) {
>> fprintf(stderr, "load of migration failed\n");
>> goto out_fopen;
>> diff --git a/migration-unix.c b/migration-unix.c
>> index b7aab38..dd99a73 100644
>> --- a/migration-unix.c
>> +++ b/migration-unix.c
>> @@ -168,7 +168,7 @@ static void unix_accept_incoming_migration(void
>> *opaque)
>> goto out;
>> }
>>
>> - ret = qemu_loadvm_state(f);
>> + ret = qemu_loadvm_state(f, 0);
>> if (ret< 0) {
>> fprintf(stderr, "load of migration failed\n");
>> goto out_fopen;
>> diff --git a/savevm.c b/savevm.c
>> index 22d928c..a401b27 100644
>> --- a/savevm.c
>> +++ b/savevm.c
>> @@ -1554,7 +1554,7 @@ typedef struct LoadStateEntry {
>> int version_id;
>> } LoadStateEntry;
>>
>> -int qemu_loadvm_state(QEMUFile *f)
>> +int qemu_loadvm_state(QEMUFile *f, int skip_header)
>> {
>> QLIST_HEAD(, LoadStateEntry) loadvm_handlers =
>> QLIST_HEAD_INITIALIZER(loadvm_handlers);
>> @@ -1563,17 +1563,20 @@ int qemu_loadvm_state(QEMUFile *f)
>> unsigned int v;
>> int ret;
>>
>> - v = qemu_get_be32(f);
>> - if (v != QEMU_VM_FILE_MAGIC)
>> - return -EINVAL;
>> + if (!skip_header) {
>> + v = qemu_get_be32(f);
>> + if (v != QEMU_VM_FILE_MAGIC)
>> + return -EINVAL;
>> +
>> + v = qemu_get_be32(f);
>> + if (v == QEMU_VM_FILE_VERSION_COMPAT) {
>> + fprintf(stderr, "SaveVM v3 format is obsolete and don't work
>> anymore\n");
>> + return -ENOTSUP;
>> + }
>> + if (v != QEMU_VM_FILE_VERSION)
>> + return -ENOTSUP;
>>
>> - v = qemu_get_be32(f);
>> - if (v == QEMU_VM_FILE_VERSION_COMPAT) {
>> - fprintf(stderr, "SaveVM v2 format is obsolete and don't work
>> anymore\n");
>> - return -ENOTSUP;
>> }
>> - if (v != QEMU_VM_FILE_VERSION)
>> - return -ENOTSUP;
>>
>> while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) {
>> uint32_t instance_id, version_id, section_id;
>> @@ -1898,7 +1901,7 @@ int load_vmstate(Monitor *mon, const char *name)
>> monitor_printf(mon, "Could not open VM state file\n");
>> return -EINVAL;
>> }
>> - ret = qemu_loadvm_state(f);
>> + ret = qemu_loadvm_state(f, 0);
>> qemu_fclose(f);
>> if (ret< 0) {
>> monitor_printf(mon, "Error %d while loading VM state\n", ret);
>> diff --git a/sysemu.h b/sysemu.h
>> index 647a468..6c1441f 100644
>> --- a/sysemu.h
>> +++ b/sysemu.h
>> @@ -68,7 +68,7 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile
>> *f, int blk_enable,
>> int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f);
>> int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f);
>> void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f);
>> -int qemu_loadvm_state(QEMUFile *f);
>> +int qemu_loadvm_state(QEMUFile *f, int skip_header);
>>
>> void qemu_errors_to_file(FILE *fp);
>> void qemu_errors_to_mon(Monitor *mon);
>
>
>
>
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 19/20] Insert do_event_tap() to virtio-{blk, net}, comment out assert() on cpu_single_env temporally.
  2010-04-22 19:39   ` [Qemu-devel] " Anthony Liguori
@ 2010-04-23  4:51     ` Yoshiaki Tamura
  0 siblings, 0 replies; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-23  4:51 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: ohmura.kei, kvm, mtosatti, Anthony Liguori, qemu-devel,
	yoshikawa.takuya, avi
Anthony Liguori wrote:
> On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:
>> do_event_tap() is inserted to functions which actually fire outputs.
>> By synchronizing VMs before outputs are fired, we can failover to the
>> receiver upon failure. To save VM continuously, comment out assert()
>> on cpu_single_env temporally.
>>
>> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
>> ---
>> hw/virtio-blk.c | 2 ++
>> hw/virtio-net.c | 2 ++
>> qemu-kvm.c | 7 ++++++-
>> 3 files changed, 10 insertions(+), 1 deletions(-)
>
> This would be better done in the generic layers (the block and net
> layers respectively). Then it would work with virtio and emulated devices.
I agree with your opinion that it's better if we can handle any emulated devices 
at once.  However, I have a question here that, if we put do_event_tap() to the 
generic layers, emulated devices state would have already been proceeded, and it 
won't be possible to reproduce those I/O after failover?  If I were wrong, I 
would be happy to move it, but if I were right, there are would be two 
approaches to overcome this:
1. Sync I/O requests to the receiver, and upon failover, release those requests 
before running the guest VM.
2. Copy the emulated devices state before starting emulating, and once it comes 
to the generic layer, start the synchronizing using the copied state.
We should also consider the guest's VCPU state.  I previously had similar 
discussion with Avi.  I would like to reconfirm his idea too.
>
> Regards,
>
> Anthony Liguori
>
>> diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
>> index b80402d..1dd1c31 100644
>> --- a/hw/virtio-blk.c
>> +++ b/hw/virtio-blk.c
>> @@ -327,6 +327,8 @@ static void virtio_blk_handle_output(VirtIODevice
>> *vdev, VirtQueue *vq)
>> .old_bs = NULL,
>> };
>>
>> + do_event_tap();
>> +
>> while ((req = virtio_blk_get_request(s))) {
>> virtio_blk_handle_request(req,&mrb);
>> }
>> diff --git a/hw/virtio-net.c b/hw/virtio-net.c
>> index 5c0093e..1a32bf3 100644
>> --- a/hw/virtio-net.c
>> +++ b/hw/virtio-net.c
>> @@ -667,6 +667,8 @@ static void virtio_net_handle_tx(VirtIODevice
>> *vdev, VirtQueue *vq)
>> {
>> VirtIONet *n = to_virtio_net(vdev);
>>
>> + do_event_tap();
>> +
>> if (n->tx_timer_active) {
>> virtio_queue_set_notification(vq, 1);
>> qemu_del_timer(n->tx_timer);
>> diff --git a/qemu-kvm.c b/qemu-kvm.c
>> index 1414f49..769bc95 100644
>> --- a/qemu-kvm.c
>> +++ b/qemu-kvm.c
>> @@ -935,8 +935,12 @@ int kvm_run(CPUState *env)
>>
>> post_kvm_run(kvm, env);
>>
>> + /* TODO: we need to prevent tapping events that derived from the
>> + * same VMEXIT. This needs more info from the kernel. */
>> #if defined(KVM_CAP_COALESCED_MMIO)
>> if (kvm_state->coalesced_mmio) {
>> + /* prevent from tapping events while handling coalesced_mmio */
>> + event_tap_suspend();
>> struct kvm_coalesced_mmio_ring *ring =
>> (void *) run + kvm_state->coalesced_mmio * PAGE_SIZE;
>> while (ring->first != ring->last) {
>> @@ -946,6 +950,7 @@ int kvm_run(CPUState *env)
>> smp_wmb();
>> ring->first = (ring->first + 1) % KVM_COALESCED_MMIO_MAX;
>> }
>> + event_tap_resume();
>> }
>> #endif
>>
>> @@ -1770,7 +1775,7 @@ static void resume_all_threads(void)
>> {
>> CPUState *penv = first_cpu;
>>
>> - assert(!cpu_single_env);
>> + /* assert(!cpu_single_env); */
>>
>> while (penv) {
>> penv->stop = 0;
>
>
>
>
^ permalink raw reply	[flat|nested] 74+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
  2010-04-22 20:38         ` Dor Laor
@ 2010-04-23  5:17           ` Yoshiaki Tamura
  2010-04-23  7:36             ` Fernando Luis Vázquez Cao
  0 siblings, 1 reply; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-23  5:17 UTC (permalink / raw)
  To: dlaor
  Cc: ohmura.kei, kvm, mtosatti, aliguori, qemu-devel, yoshikawa.takuya,
	avi
Dor Laor wrote:
> On 04/22/2010 04:16 PM, Yoshiaki Tamura wrote:
>> 2010/4/22 Dor Laor<dlaor@redhat.com>:
>>> On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:
>>>>
>>>> Dor Laor wrote:
>>>>>
>>>>> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> We have been implementing the prototype of Kemari for KVM, and we're
>>>>>> sending
>>>>>> this message to share what we have now and TODO lists. Hopefully, we
>>>>>> would like
>>>>>> to get early feedback to keep us in the right direction. Although
>>>>>> advanced
>>>>>> approaches in the TODO lists are fascinating, we would like to run
>>>>>> this project
>>>>>> step by step while absorbing comments from the community. The current
>>>>>> code is
>>>>>> based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.
>>>>>>
>>>>>> For those who are new to Kemari for KVM, please take a look at the
>>>>>> following RFC which we posted last year.
>>>>>>
>>>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html
>>>>>>
>>>>>> The transmission/transaction protocol, and most of the control
>>>>>> logic is
>>>>>> implemented in QEMU. However, we needed a hack in KVM to prevent rip
>>>>>> from
>>>>>> proceeding before synchronizing VMs. It may also need some
>>>>>> plumbing in
>>>>>> the
>>>>>> kernel side to guarantee replayability of certain events and
>>>>>> instructions,
>>>>>> integrate the RAS capabilities of newer x86 hardware with the HA
>>>>>> stack, as well
>>>>>> as for optimization purposes, for example.
>>>>>
>>>>> [ snap]
>>>>>
>>>>>>
>>>>>> The rest of this message describes TODO lists grouped by each topic.
>>>>>>
>>>>>> === event tapping ===
>>>>>>
>>>>>> Event tapping is the core component of Kemari, and it decides on
>>>>>> which
>>>>>> event the
>>>>>> primary should synchronize with the secondary. The basic assumption
>>>>>> here is
>>>>>> that outgoing I/O operations are idempotent, which is usually true
>>>>>> for
>>>>>> disk I/O
>>>>>> and reliable network protocols such as TCP.
>>>>>
>>>>> IMO any type of network even should be stalled too. What if the VM
>>>>> runs
>>>>> non tcp protocol and the packet that the master node sent reached some
>>>>> remote client and before the sync to the slave the master failed?
>>>>
>>>> In current implementation, it is actually stalling any type of network
>>>> that goes through virtio-net.
>>>>
>>>> However, if the application was using unreliable protocols, it should
>>>> have its own recovering mechanism, or it should be completely
>>>> stateless.
>>>
>>> Why do you treat tcp differently? You can damage the entire VM this
>>> way -
>>> think of dhcp request that was dropped on the moment you switched
>>> between
>>> the master and the slave?
>>
>> I'm not trying to say that we should treat tcp differently, but just
>> it's severe.
>> In case of dhcp request, the client would have a chance to retry after
>> failover, correct?
>
> But until it timeouts it won't have networking.
>
>> BTW, in current implementation, it's synchronizing before dhcp ack is
>> sent.
>> But in case of tcp, once you send ack to the client before sync, there
>> is no way to recover.
>
> What if the guest is running dhcp server? It we provide an IP to a
> client and then fail to the secondary that will run without knowing the
> master allocated this IP
That's problematic.  So it needs to sync when dhcp ack is sent.
I should apologize for my misunderstanding and explanation.  I agree that we 
should stall every type of network output.
>
>>
>>>>> [snap]
>>>>>
>>>>>
>>>>>> === clock ===
>>>>>>
>>>>>> Since synchronizing the virtual machines every time the TSC is
>>>>>> accessed would be
>>>>>> prohibitive, the transmission of the TSC will be done lazily, which
>>>>>> means
>>>>>> delaying it until there is a non-TSC synchronization point arrives.
>>>>>
>>>>> Why do you specifically care about the tsc sync? When you sync all the
>>>>> IO model on snapshot it also synchronizes the tsc.
>>>
>>> So, do you agree that an extra clock synchronization is not needed
>>> since it
>>> is done anyway as part of the live migration state sync?
>>
>> I agree that its sent as part of the live migration.
>> What I wanted to say here is that this is not something for real time
>> applications.
>> I usually get questions like can this guarantee fault tolerance for
>> real time applications.
>
> First the huge cost of snapshots won't match to any real time app.
I see.
> Second, even if it wasn't the case, the tsc delta and kvmclock are
> synchronized as part of the VM state so there is no use of trapping it
> in the middle.
I should study the clock in KVM, but won't tsc get updated by the HW after 
migration?
I was wondering the following case for example:
1. The application on the guest calls rdtsc on host A.
2. The application uses rdtsc value for something.
3. Failover to host B.
4. The application on the guest replays the rdtsc call on host B.
5. If the rdtsc value is different between A and B, the application may get into 
trouble because of it.
If I were wrong, my apologies.
>
>>
>>>>> In general, can you please explain the 'algorithm' for continuous
>>>>> snapshots (is that what you like to do?):
>>>>
>>>> Yes, of course.
>>>> Sorry for being less informative.
>>>>
>>>>> A trivial one would we to :
>>>>> - do X online snapshots/sec
>>>>
>>>> I currently don't have good numbers that I can share right now.
>>>> Snapshots/sec depends on what kind of workload is running, and if the
>>>> guest was almost idle, there will be no snapshots in 5sec. On the other
>>>> hand, if the guest was running I/O intensive workloads (netperf, iozone
>>>> for example), there will be about 50 snapshots/sec.
>>>>
>>>>> - Stall all IO (disk/block) from the guest to the outside world
>>>>> until the previous snapshot reaches the slave.
>>>>
>>>> Yes, it does.
>>>>
>>>>> - Snapshots are made of
>>>>
>>>> Full device model + diff of dirty pages from the last snapshot.
>>>>
>>>>> - diff of dirty pages from last snapshot
>>>>
>>>> This also depends on the workload.
>>>> In case of I/O intensive workloads, dirty pages are usually less
>>>> than 100.
>>>
>>> The hardest would be memory intensive loads.
>>> So 100 snap/sec means latency of 10msec right?
>>> (not that it's not ok, with faster hw and IB you'll be able to get much
>>> more)
>>
>> Doesn't 100 snap/sec mean the interval of snap is 10msec?
>> IIUC, to get the latency, you need to get, Time to transfer VM + Time
>> to get response from the receiver.
>>
>> It's hard to say which load is the hardest.
>> Memory intensive load, who don't generate I/O often, will suffer from
>> long sync time for that moment, but would have chances to continue its
>> process until sync.
>> I/O intensive load, who don't dirty much pages, will suffer from
>> getting VPU stopped often, but its sync time is relatively shorter.
>>
>>>>> - Qemu device model (+kvm's) diff from last.
>>>>
>>>> We're currently sending full copy because we're completely reusing this
>>>> part of existing live migration framework.
>>>>
>>>> Last time we measured, it was about 13KB.
>>>> But it varies by which QEMU version is used.
>>>>
>>>>> You can do 'light' snapshots in between to send dirty pages to reduce
>>>>> snapshot time.
>>>>
>>>> I agree. That's one of the advanced topic we would like to try too.
>>>>
>>>>> I wrote the above to serve a reference for your comments so it will
>>>>> map
>>>>> into my mind. Thanks, dor
>>>>
>>>> Thank your for the guidance.
>>>> I hope this answers to your question.
>>>>
>>>> At the same time, I would also be happy it we could discuss how to
>>>> implement too. In fact, we needed a hack to prevent rip from proceeding
>>>> in KVM, which turned out that it was not the best workaround.
>>>
>>> There are brute force solutions like
>>> - stop the guest until you send all of the snapshot to the remote (like
>>> standard live migration)
>>
>> We've implemented this way so far.
>>
>>> - Stop + fork + cont the father
>>>
>>> Or mark the recent dirty pages that were not sent to the remote as write
>>> protected and copy them if touched.
>>
>> I think I had that suggestion from Avi before.
>> And yes, it's very fascinating.
>>
>> Meanwhile, if you look at the diffstat, it needed to touch many parts
>> of QEMU.
>> Before going into further implementation, I wanted to check that I'm
>> in the right track for doing this project.
>>
>>
>>>> Thanks,
>>>>
>>>> Yoshi
>>>>
>>>>>
>>>>>>
>>>>>> TODO:
>>>>>> - Synchronization of clock sources (need to intercept TSC reads,
>>>>>> etc).
>>>>>>
>>>>>> === usability ===
>>>>>>
>>>>>> These are items that defines how users interact with Kemari.
>>>>>>
>>>>>> TODO:
>>>>>> - Kemarid daemon that takes care of the cluster management/monitoring
>>>>>> side of things.
>>>>>> - Some device emulators might need minor modifications to work well
>>>>>> with Kemari. Use white(black)-listing to take the burden of
>>>>>> choosing the right device model off the users.
>>>>>>
>>>>>> === optimizations ===
>>>>>>
>>>>>> Although the big picture can be realized by completing the TODO list
>>>>>> above, we
>>>>>> need some optimizations/enhancements to make Kemari useful in real
>>>>>> world, and
>>>>>> these are items what needs to be done for that.
>>>>>>
>>>>>> TODO:
>>>>>> - SMP (for the sake of performance might need to implement a
>>>>>> synchronization protocol that can maintain two or more
>>>>>> synchronization points active at any given moment)
>>>>>> - VGA (leverage VNC's subtilting mechanism to identify fb pages that
>>>>>> are really dirty).
>>>>>>
>>>>>>
>>>>>> Any comments/suggestions would be greatly appreciated.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Yoshi
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Kemari starts synchronizing VMs when QEMU handles I/O requests.
>>>>>> Without this patch VCPU state is already proceeded before
>>>>>> synchronization, and after failover to the VM on the receiver, it
>>>>>> hangs because of this.
>>>>>>
>>>>>> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
>>>>>> ---
>>>>>> arch/x86/include/asm/kvm_host.h | 1 +
>>>>>> arch/x86/kvm/svm.c | 11 ++++++++---
>>>>>> arch/x86/kvm/vmx.c | 11 ++++++++---
>>>>>> arch/x86/kvm/x86.c | 4 ++++
>>>>>> 4 files changed, 21 insertions(+), 6 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/x86/include/asm/kvm_host.h
>>>>>> b/arch/x86/include/asm/kvm_host.h
>>>>>> index 26c629a..7b8f514 100644
>>>>>> --- a/arch/x86/include/asm/kvm_host.h
>>>>>> +++ b/arch/x86/include/asm/kvm_host.h
>>>>>> @@ -227,6 +227,7 @@ struct kvm_pio_request {
>>>>>> int in;
>>>>>> int port;
>>>>>> int size;
>>>>>> + bool lazy_skip;
>>>>>> };
>>>>>>
>>>>>> /*
>>>>>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>>>>>> index d04c7ad..e373245 100644
>>>>>> --- a/arch/x86/kvm/svm.c
>>>>>> +++ b/arch/x86/kvm/svm.c
>>>>>> @@ -1495,7 +1495,7 @@ static int io_interception(struct vcpu_svm
>>>>>> *svm)
>>>>>> {
>>>>>> struct kvm_vcpu *vcpu =&svm->vcpu;
>>>>>> u32 io_info = svm->vmcb->control.exit_info_1; /* address size bug? */
>>>>>> - int size, in, string;
>>>>>> + int size, in, string, ret;
>>>>>> unsigned port;
>>>>>>
>>>>>> ++svm->vcpu.stat.io_exits;
>>>>>> @@ -1507,9 +1507,14 @@ static int io_interception(struct vcpu_svm
>>>>>> *svm)
>>>>>> port = io_info>> 16;
>>>>>> size = (io_info& SVM_IOIO_SIZE_MASK)>> SVM_IOIO_SIZE_SHIFT;
>>>>>> svm->next_rip = svm->vmcb->control.exit_info_2;
>>>>>> - skip_emulated_instruction(&svm->vcpu);
>>>>>>
>>>>>> - return kvm_fast_pio_out(vcpu, size, port);
>>>>>> + ret = kvm_fast_pio_out(vcpu, size, port);
>>>>>> + if (ret)
>>>>>> + skip_emulated_instruction(&svm->vcpu);
>>>>>> + else
>>>>>> + vcpu->arch.pio.lazy_skip = true;
>>>>>> +
>>>>>> + return ret;
>>>>>> }
>>>>>>
>>>>>> static int nmi_interception(struct vcpu_svm *svm)
>>>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>>>>> index 41e63bb..09052d6 100644
>>>>>> --- a/arch/x86/kvm/vmx.c
>>>>>> +++ b/arch/x86/kvm/vmx.c
>>>>>> @@ -2975,7 +2975,7 @@ static int handle_triple_fault(struct kvm_vcpu
>>>>>> *vcpu)
>>>>>> static int handle_io(struct kvm_vcpu *vcpu)
>>>>>> {
>>>>>> unsigned long exit_qualification;
>>>>>> - int size, in, string;
>>>>>> + int size, in, string, ret;
>>>>>> unsigned port;
>>>>>>
>>>>>> exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
>>>>>> @@ -2989,9 +2989,14 @@ static int handle_io(struct kvm_vcpu *vcpu)
>>>>>>
>>>>>> port = exit_qualification>> 16;
>>>>>> size = (exit_qualification& 7) + 1;
>>>>>> - skip_emulated_instruction(vcpu);
>>>>>>
>>>>>> - return kvm_fast_pio_out(vcpu, size, port);
>>>>>> + ret = kvm_fast_pio_out(vcpu, size, port);
>>>>>> + if (ret)
>>>>>> + skip_emulated_instruction(vcpu);
>>>>>> + else
>>>>>> + vcpu->arch.pio.lazy_skip = true;
>>>>>> +
>>>>>> + return ret;
>>>>>> }
>>>>>>
>>>>>> static void
>>>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>>>> index fd5c3d3..cc308d2 100644
>>>>>> --- a/arch/x86/kvm/x86.c
>>>>>> +++ b/arch/x86/kvm/x86.c
>>>>>> @@ -4544,6 +4544,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu
>>>>>> *vcpu, struct kvm_run *kvm_run)
>>>>>> if (!irqchip_in_kernel(vcpu->kvm))
>>>>>> kvm_set_cr8(vcpu, kvm_run->cr8);
>>>>>>
>>>>>> + if (vcpu->arch.pio.lazy_skip)
>>>>>> + kvm_x86_ops->skip_emulated_instruction(vcpu);
>>>>>> + vcpu->arch.pio.lazy_skip = false;
>>>>>> +
>>>>>> if (vcpu->arch.pio.count || vcpu->mmio_needed ||
>>>>>> vcpu->arch.emulate_ctxt.restart) {
>>>>>> if (vcpu->mmio_needed) {
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>
>
>
>
^ permalink raw reply	[flat|nested] 74+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
  2010-04-23  5:17           ` Yoshiaki Tamura
@ 2010-04-23  7:36             ` Fernando Luis Vázquez Cao
  2010-04-25 21:52               ` Dor Laor
  0 siblings, 1 reply; 74+ messages in thread
From: Fernando Luis Vázquez Cao @ 2010-04-23  7:36 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, mtosatti, kvm, dlaor, qemu-devel, yoshikawa.takuya,
	aliguori, avi
On 04/23/2010 02:17 PM, Yoshiaki Tamura wrote:
> Dor Laor wrote:
[...]
>> Second, even if it wasn't the case, the tsc delta and kvmclock are
>> synchronized as part of the VM state so there is no use of trapping it
>> in the middle.
> 
> I should study the clock in KVM, but won't tsc get updated by the HW
> after migration?
> I was wondering the following case for example:
> 
> 1. The application on the guest calls rdtsc on host A.
> 2. The application uses rdtsc value for something.
> 3. Failover to host B.
> 4. The application on the guest replays the rdtsc call on host B.
> 5. If the rdtsc value is different between A and B, the application may
> get into trouble because of it.
Regarding the TSC, we need to guarantee that the guest sees a monotonic
TSC after migration, which can be achieved by adjusting the TSC offset properly.
Besides, we also need a trapping TSC, so that we can tackle the case where the
primary node and the standby node have different TSC frequencies.
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 04/20] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer().
  2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 04/20] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer() Yoshiaki Tamura
  2010-04-21  8:03   ` [Qemu-devel] " Stefan Hajnoczi
@ 2010-04-23  9:53   ` Avi Kivity
  2010-04-23  9:59     ` Yoshiaki Tamura
  2010-04-23 13:26     ` Anthony Liguori
  1 sibling, 2 replies; 74+ messages in thread
From: Avi Kivity @ 2010-04-23  9:53 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: aliguori, kvm, ohmura.kei, mtosatti, qemu-devel, yoshikawa.takuya
On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
> Currently buf size is fixed at 32KB.  It would be useful if it could
> be flexible.
>
>    
Why is this needed?  The real buffering is in the kernel anyways; this 
is only used to reduce the number of write() syscalls.
-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 04/20] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer().
  2010-04-23  9:53   ` Avi Kivity
@ 2010-04-23  9:59     ` Yoshiaki Tamura
  2010-04-23 13:14       ` Avi Kivity
  2010-04-23 13:26     ` Anthony Liguori
  1 sibling, 1 reply; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-23  9:59 UTC (permalink / raw)
  To: Avi Kivity
  Cc: aliguori, kvm, ohmura.kei, mtosatti, qemu-devel, yoshikawa.takuya
Avi Kivity wrote:
> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
>> Currently buf size is fixed at 32KB. It would be useful if it could
>> be flexible.
>>
>
> Why is this needed? The real buffering is in the kernel anyways; this is
> only used to reduce the number of write() syscalls.
This was introduced to buffer the transfered guests image transaction ally on 
the receiver side.  The sender doesn't use it.
In case of intermediate state, we just discard this buffer.
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 00/20] Kemari for KVM v0.1
  2010-04-23  0:45   ` Yoshiaki Tamura
@ 2010-04-23 13:10     ` Anthony Liguori
  0 siblings, 0 replies; 74+ messages in thread
From: Anthony Liguori @ 2010-04-23 13:10 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, kvm, mtosatti, Anthony Liguori, qemu-devel,
	yoshikawa.takuya, avi
On 04/22/2010 07:45 PM, Yoshiaki Tamura wrote:
> Anthony Liguori wrote:
>
>> I think it would make sense to separate out the things that are actually
>> optimizations (like the dirty bitmap changes and the writev/readv
>> changes) and to attempt to justify them with actual performance data.
>
> I agree with the separation plan.
>
> For dirty bitmap change, Avi and I discussed on patchset for upsream 
> QEMU while you were offline (Sorry, if I was wrong).  Could you also 
> take a look?
Yes, I've seen it and I don't disagree.  That said, there ought to be 
perf data in the commit log so that down the road, the justification is 
understood.
> http://lists.gnu.org/archive/html/qemu-devel/2010-04/msg01396.html
>
> Regarding writev, I agree that it should be backed with actual data, 
> otherwise it should be removed.  We attemped to do everything that may 
> reduce the overhead of the transaction.
>
>> I'd prefer not to modify the live migration protocol ABI and it doesn't
>> seem to be necessary if we're willing to add options to the -incoming
>> flag. We also want to be a bit more generic with respect to IO.
>
> I totally agree with your approach not to change the protocol ABI.  
> Can we add an option to -incoming?  Like, -incoming ft_mode, for example
> Regarding the IO, let me reply to the next message.
>
>> Otherwise, the series looks very close to being mergable.
>
> Thank you for your comment on each patch.
>
> To be honest, I wasn't that confident because I'm a newbie to KVM/QEMU 
> and struggled for how to implement in an acceptable way.
The series looks very good.  I'm eager to see this functionality merged.
Regards,
Anthony Liguori
> Thanks,
>
> Yoshi
>
>>
>> Regards,
>>
>> Anthony Liguori
>>
>>
>>
>
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 04/20] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer().
  2010-04-23  9:59     ` Yoshiaki Tamura
@ 2010-04-23 13:14       ` Avi Kivity
  2010-04-26 10:43         ` Yoshiaki Tamura
  0 siblings, 1 reply; 74+ messages in thread
From: Avi Kivity @ 2010-04-23 13:14 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: aliguori, kvm, ohmura.kei, mtosatti, qemu-devel, yoshikawa.takuya
On 04/23/2010 12:59 PM, Yoshiaki Tamura wrote:
> Avi Kivity wrote:
>> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
>>> Currently buf size is fixed at 32KB. It would be useful if it could
>>> be flexible.
>>>
>>
>> Why is this needed? The real buffering is in the kernel anyways; this is
>> only used to reduce the number of write() syscalls.
>
> This was introduced to buffer the transfered guests image transaction 
> ally on the receiver side.  The sender doesn't use it.
> In case of intermediate state, we just discard this buffer.
How large can it grow?
What's wrong with applying it (perhaps partially) to the guest state?  
The next state transfer will overwrite it completely, no?
-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
^ permalink raw reply	[flat|nested] 74+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
  2010-04-23  1:53           ` Yoshiaki Tamura
@ 2010-04-23 13:20             ` Anthony Liguori
  2010-04-26 10:44               ` Yoshiaki Tamura
  0 siblings, 1 reply; 74+ messages in thread
From: Anthony Liguori @ 2010-04-23 13:20 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, mtosatti, kvm, dlaor, qemu-devel, yoshikawa.takuya,
	avi
On 04/22/2010 08:53 PM, Yoshiaki Tamura wrote:
> Anthony Liguori wrote:
>> On 04/22/2010 08:16 AM, Yoshiaki Tamura wrote:
>>> 2010/4/22 Dor Laor<dlaor@redhat.com>:
>>>> On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:
>>>>> Dor Laor wrote:
>>>>>> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> We have been implementing the prototype of Kemari for KVM, and 
>>>>>>> we're
>>>>>>> sending
>>>>>>> this message to share what we have now and TODO lists. 
>>>>>>> Hopefully, we
>>>>>>> would like
>>>>>>> to get early feedback to keep us in the right direction. Although
>>>>>>> advanced
>>>>>>> approaches in the TODO lists are fascinating, we would like to run
>>>>>>> this project
>>>>>>> step by step while absorbing comments from the community. The 
>>>>>>> current
>>>>>>> code is
>>>>>>> based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.
>>>>>>>
>>>>>>> For those who are new to Kemari for KVM, please take a look at the
>>>>>>> following RFC which we posted last year.
>>>>>>>
>>>>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html
>>>>>>>
>>>>>>> The transmission/transaction protocol, and most of the control
>>>>>>> logic is
>>>>>>> implemented in QEMU. However, we needed a hack in KVM to prevent 
>>>>>>> rip
>>>>>>> from
>>>>>>> proceeding before synchronizing VMs. It may also need some
>>>>>>> plumbing in
>>>>>>> the
>>>>>>> kernel side to guarantee replayability of certain events and
>>>>>>> instructions,
>>>>>>> integrate the RAS capabilities of newer x86 hardware with the HA
>>>>>>> stack, as well
>>>>>>> as for optimization purposes, for example.
>>>>>> [ snap]
>>>>>>
>>>>>>> The rest of this message describes TODO lists grouped by each 
>>>>>>> topic.
>>>>>>>
>>>>>>> === event tapping ===
>>>>>>>
>>>>>>> Event tapping is the core component of Kemari, and it decides on
>>>>>>> which
>>>>>>> event the
>>>>>>> primary should synchronize with the secondary. The basic assumption
>>>>>>> here is
>>>>>>> that outgoing I/O operations are idempotent, which is usually true
>>>>>>> for
>>>>>>> disk I/O
>>>>>>> and reliable network protocols such as TCP.
>>>>>> IMO any type of network even should be stalled too. What if the VM
>>>>>> runs
>>>>>> non tcp protocol and the packet that the master node sent reached 
>>>>>> some
>>>>>> remote client and before the sync to the slave the master failed?
>>>>> In current implementation, it is actually stalling any type of 
>>>>> network
>>>>> that goes through virtio-net.
>>>>>
>>>>> However, if the application was using unreliable protocols, it should
>>>>> have its own recovering mechanism, or it should be completely
>>>>> stateless.
>>>> Why do you treat tcp differently? You can damage the entire VM this
>>>> way -
>>>> think of dhcp request that was dropped on the moment you switched
>>>> between
>>>> the master and the slave?
>>> I'm not trying to say that we should treat tcp differently, but just
>>> it's severe.
>>> In case of dhcp request, the client would have a chance to retry after
>>> failover, correct?
>>> BTW, in current implementation,
>>
>> I'm slightly confused about the current implementation vs. my
>> recollection of the original paper with Xen. I had thought that all disk
>> and network I/O was buffered in such a way that at each checkpoint, the
>> I/O operations would be released in a burst. Otherwise, you would have
>> to synchronize after every I/O operation which is what it seems the
>> current implementation does.
>
> Yes, you're almost right.
> It's synchronizing before QEMU starts emulating I/O at each device model.
If NodeA is the master and NodeB is the slave, if NodeA sends a network 
packet, you'll checkpoint before the packet is actually sent, and then 
if a failure occurs before the next checkpoint, won't that result in 
both NodeA and NodeB sending out a duplicate version of the packet?
Regards,
Anthony Liguori
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 05/20] Introduce put_vector() and get_vector to QEMUFile and qemu_fopen_ops().
  2010-04-23  3:37     ` Yoshiaki Tamura
@ 2010-04-23 13:22       ` Anthony Liguori
  2010-04-23 13:48         ` Avi Kivity
  2010-04-26 10:43         ` Yoshiaki Tamura
  0 siblings, 2 replies; 74+ messages in thread
From: Anthony Liguori @ 2010-04-23 13:22 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, kvm, mtosatti, Anthony Liguori, qemu-devel,
	yoshikawa.takuya, avi
On 04/22/2010 10:37 PM, Yoshiaki Tamura wrote:
> Anthony Liguori wrote:
>> On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:
>>> QEMUFile currently doesn't support writev(). For sending multiple
>>> data, such as pages, using writev() should be more efficient.
>>>
>>> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
>>
>> Is there performance data that backs this up? Since QEMUFile uses a
>> linear buffer for most operations that's limited to 16k, I suspect you
>> wouldn't be able to observe a difference in practice.
>
> I currently don't have data, but I'll prepare it.
> There were two things I wanted to avoid.
>
> 1. Pages to be copied to QEMUFile buf through qemu_put_buffer.
> 2. Calling write() everytime even when we want to send multiple pages 
> at once.
>
> I think 2 may be neglectable.
> But 1 seems to be problematic if we want make to the latency as small 
> as possible, no?
Copying often has strange CPU characteristics depending on whether the 
data is already in cache.  It's better to drive these sort of 
optimizations through performance measurement because changes are not 
always obvious.
Regards,
Anthony Liguori
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 07/20] Introduce qemu_put_vector() and qemu_put_vector_prepare() to use put_vector() in QEMUFile.
  2010-04-23  4:02     ` Yoshiaki Tamura
@ 2010-04-23 13:23       ` Anthony Liguori
  2010-04-26 10:43         ` Yoshiaki Tamura
  0 siblings, 1 reply; 74+ messages in thread
From: Anthony Liguori @ 2010-04-23 13:23 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, kvm, mtosatti, Anthony Liguori, qemu-devel,
	yoshikawa.takuya, avi
On 04/22/2010 11:02 PM, Yoshiaki Tamura wrote:
> Anthony Liguori wrote:
>> On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:
>>> For fool proof purpose, qemu_put_vector_parepare should be called
>>> before qemu_put_vector. Then, if qemu_put_* functions except this is
>>> called after qemu_put_vector_prepare, program will abort().
>>>
>>> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
>>
>> I don't get it. What's this protecting against?
>
> This was introduced to prevent mixing the order of normal write and 
> vector write, and flush QEMUFile buffer before handling vectors.
> While qemu_put_buffer copies data to QEMUFile buffer, 
> qemu_put_vector() will bypass that buffer.
>
> It's just fool proof purpose for what we encountered at beginning, and 
> if the user of qemu_put_vector() is careful enough, we can remove 
> qemu_put_vectore_prepare().  While writing this message, I started to 
> think that just calling qemu_fflush() in qemu_put_vector() would be 
> enough...
I definitely think removing the vector stuff in the first version would 
simplify the process of getting everything merged.  I'd prefer not to 
have two apis so if vector operations were important from a performance 
perspective, I'd want to see everything converted to a vector API.
Regards,
Anthony Liguori
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 00/20] Kemari for KVM v0.1
  2010-04-21  5:57 [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 Yoshiaki Tamura
                   ` (21 preceding siblings ...)
  2010-04-22 19:42 ` [Qemu-devel] " Anthony Liguori
@ 2010-04-23 13:24 ` Avi Kivity
  2010-04-26 10:44   ` Yoshiaki Tamura
  22 siblings, 1 reply; 74+ messages in thread
From: Avi Kivity @ 2010-04-23 13:24 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: aliguori, kvm, ohmura.kei, mtosatti, qemu-devel, yoshikawa.takuya
On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
> Kemari starts synchronizing VMs when QEMU handles I/O requests.
> Without this patch VCPU state is already proceeded before
> synchronization, and after failover to the VM on the receiver, it
> hangs because of this.
>    
We discussed moving the barrier to the actual output device, instead of 
the I/O port.  This allows you to complete the I/O transaction before 
starting synchronization.
Does it not work for some reason?
-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 04/20] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer().
  2010-04-23  9:53   ` Avi Kivity
  2010-04-23  9:59     ` Yoshiaki Tamura
@ 2010-04-23 13:26     ` Anthony Liguori
  1 sibling, 0 replies; 74+ messages in thread
From: Anthony Liguori @ 2010-04-23 13:26 UTC (permalink / raw)
  To: Avi Kivity
  Cc: ohmura.kei, kvm, qemu-devel, mtosatti, Anthony Liguori,
	Yoshiaki Tamura, yoshikawa.takuya
On 04/23/2010 04:53 AM, Avi Kivity wrote:
> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
>> Currently buf size is fixed at 32KB.  It would be useful if it could
>> be flexible.
>>
>
> Why is this needed?  The real buffering is in the kernel anyways; this 
> is only used to reduce the number of write() syscalls.
With vmstate, we really shouldn't need to do this magic anymore as an 
optimization.
Regards,
Anthony Liguori
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 05/20] Introduce put_vector() and get_vector to QEMUFile and qemu_fopen_ops().
  2010-04-23 13:22       ` Anthony Liguori
@ 2010-04-23 13:48         ` Avi Kivity
  2010-05-03  9:32           ` Yoshiaki Tamura
  2010-04-26 10:43         ` Yoshiaki Tamura
  1 sibling, 1 reply; 74+ messages in thread
From: Avi Kivity @ 2010-04-23 13:48 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: ohmura.kei, kvm, qemu-devel, mtosatti, Anthony Liguori,
	Yoshiaki Tamura, yoshikawa.takuya
On 04/23/2010 04:22 PM, Anthony Liguori wrote:
>> I currently don't have data, but I'll prepare it.
>> There were two things I wanted to avoid.
>>
>> 1. Pages to be copied to QEMUFile buf through qemu_put_buffer.
>> 2. Calling write() everytime even when we want to send multiple pages 
>> at once.
>>
>> I think 2 may be neglectable.
>> But 1 seems to be problematic if we want make to the latency as small 
>> as possible, no?
>
>
> Copying often has strange CPU characteristics depending on whether the 
> data is already in cache.  It's better to drive these sort of 
> optimizations through performance measurement because changes are not 
> always obvious.
Copying always introduces more cache pollution, so even if the data is 
in the cache, it is worthwhile (not disagreeing with the need to measure).
-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
^ permalink raw reply	[flat|nested] 74+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
  2010-04-23  0:20       ` Yoshiaki Tamura
@ 2010-04-23 15:07         ` Jamie Lokier
  0 siblings, 0 replies; 74+ messages in thread
From: Jamie Lokier @ 2010-04-23 15:07 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, mtosatti, kvm, dlaor, qemu-devel, yoshikawa.takuya,
	aliguori, avi
Yoshiaki Tamura wrote:
> Jamie Lokier wrote:
> >Yoshiaki Tamura wrote:
> >>Dor Laor wrote:
> >>>On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
> >>>>Event tapping is the core component of Kemari, and it decides on which
> >>>>event the
> >>>>primary should synchronize with the secondary. The basic assumption
> >>>>here is
> >>>>that outgoing I/O operations are idempotent, which is usually true for
> >>>>disk I/O
> >>>>and reliable network protocols such as TCP.
> >>>
> >>>IMO any type of network even should be stalled too. What if the VM runs
> >>>non tcp protocol and the packet that the master node sent reached some
> >>>remote client and before the sync to the slave the master failed?
> >>
> >>In current implementation, it is actually stalling any type of network
> >>that goes through virtio-net.
> >>
> >>However, if the application was using unreliable protocols, it should have
> >>its own recovering mechanism, or it should be completely stateless.
> >
> >Even with unreliable protocols, if slave takeover causes the receiver
> >to have received a packet that the sender _does not think it has ever
> >sent_, expect some protocols to break.
> >
> >If the slave replaying master's behaviour since the last sync means it
> >will definitely get into the same state of having sent the packet,
> >that works out.
> 
> That's something we're expecting now.
> 
> >But you still have to be careful that the other end's responses to
> >that packet are not seen by the slave too early during that replay.
> >Otherwise, for example, the slave may observe a TCP ACK to a packet
> >that it hasn't yet sent, which is an error.
> 
> Even current implementation syncs just before network output, what you 
> pointed out could happen.  In this case, would the connection going to be 
> lost, or would client/server recover from it?  If latter, it would be fine, 
> otherwise I wonder how people doing similar things are handling this 
> situation.
In the case of TCP in a "synchronised state", I think it will recover
according to the rules in RFC793.  In an "unsynchronised state"
(during connection), I'm not sure if it recovers or if it looks like a
"Connection reset" error.  I suspect it does recover but I'm not certain.
But that's TCP.  Other protocols, such as over UDP, may behave
differently, because this is not an anticipated behaviour of a
network.
> >However there is one respect in which they're not idempotent:
> >
> >The TTL field should be decreased if packets are delayed.  Packets
> >should not appear to live in the network for longer than TTL seconds.
> >If they do, some protocols (like TCP) can react to the delayed ones
> >differently, such as sending a RST packet and breaking a connection.
> >
> >It is acceptable to reduce TTL faster than the minimum.  After all, it
> >is reduced by 1 on every forwarding hop, in addition to time delays.
> 
> So the problem is, when the slave takes over, it sends a packet with same 
> TTL which client may have received.
Yes.  I guess this is a general problem with time-based protocols and
virtual machines getting stopped for 1 minute (say), without knowing
that real time has moved on for the other nodes.
Some application transaction, caching and locking protocols will give
wrong results when their time assumptions are discontinuous to such a
large degree.  It's a bit nasty to impose that on them after they
worked so hard on their reliability :-)
However, I think such implementations _could_ be made safe if those
programs can arrange to definitely be interrupted with a signal when
the discontinuity happens.  Of course, only if they're aware they may
be running on a Kemari system...
I have an intuitive idea that there is a solution to that, but each
time I try to write the next paragraph explaining it, some little
complication crops up and it needs more thought.  Something about
concurrent, asynchronous transactions to keep the master running while
recording the minimum states that replay needs to be safe, while
slewing the replaying slave's virtual clock back to real time quickly
during recovery mode.
-- Jamie
^ permalink raw reply	[flat|nested] 74+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
  2010-04-23  7:36             ` Fernando Luis Vázquez Cao
@ 2010-04-25 21:52               ` Dor Laor
  0 siblings, 0 replies; 74+ messages in thread
From: Dor Laor @ 2010-04-25 21:52 UTC (permalink / raw)
  To: Fernando Luis Vázquez Cao
  Cc: ohmura.kei, kvm, mtosatti, Yoshiaki Tamura, qemu-devel,
	yoshikawa.takuya, aliguori, avi
On 04/23/2010 10:36 AM, Fernando Luis Vázquez Cao wrote:
> On 04/23/2010 02:17 PM, Yoshiaki Tamura wrote:
>> Dor Laor wrote:
> [...]
>>> Second, even if it wasn't the case, the tsc delta and kvmclock are
>>> synchronized as part of the VM state so there is no use of trapping it
>>> in the middle.
>>
>> I should study the clock in KVM, but won't tsc get updated by the HW
>> after migration?
>> I was wondering the following case for example:
>>
>> 1. The application on the guest calls rdtsc on host A.
>> 2. The application uses rdtsc value for something.
>> 3. Failover to host B.
>> 4. The application on the guest replays the rdtsc call on host B.
>> 5. If the rdtsc value is different between A and B, the application may
>> get into trouble because of it.
>
> Regarding the TSC, we need to guarantee that the guest sees a monotonic
> TSC after migration, which can be achieved by adjusting the TSC offset properly.
> Besides, we also need a trapping TSC, so that we can tackle the case where the
> primary node and the standby node have different TSC frequencies.
You're right but this is already taken care of by normal save/restore 
process. Check void kvm_load_tsc(CPUState *env) function.
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 04/20] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer().
  2010-04-23 13:14       ` Avi Kivity
@ 2010-04-26 10:43         ` Yoshiaki Tamura
  0 siblings, 0 replies; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-26 10:43 UTC (permalink / raw)
  To: Avi Kivity
  Cc: aliguori, kvm, ohmura.kei, mtosatti, qemu-devel, yoshikawa.takuya
Avi Kivity wrote:
> On 04/23/2010 12:59 PM, Yoshiaki Tamura wrote:
>> Avi Kivity wrote:
>>> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
>>>> Currently buf size is fixed at 32KB. It would be useful if it could
>>>> be flexible.
>>>>
>>>
>>> Why is this needed? The real buffering is in the kernel anyways; this is
>>> only used to reduce the number of write() syscalls.
>>
>> This was introduced to buffer the transfered guests image transaction
>> ally on the receiver side. The sender doesn't use it.
>> In case of intermediate state, we just discard this buffer.
>
> How large can it grow?
It really depends on what workload is running on the guest, but it should be as 
large as the guest ram size in the worst case.
> What's wrong with applying it (perhaps partially) to the guest state?
> The next state transfer will overwrite it completely, no?
AFAIK, the answer is no.
qemu_loadvm_state() calls load handlers of each device emulator, and they will 
update its state directly, which means even if the transaction was not complete, 
it's impossible to recover the previous state if we don't make a buffer.
I guess your concern is about consuming large size of ram, and I think having an 
option for writing the transaction to a temporal disk image should be effective.
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 07/20] Introduce qemu_put_vector() and qemu_put_vector_prepare() to use put_vector() in QEMUFile.
  2010-04-23 13:23       ` Anthony Liguori
@ 2010-04-26 10:43         ` Yoshiaki Tamura
  0 siblings, 0 replies; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-26 10:43 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: ohmura.kei, kvm, mtosatti, Anthony Liguori, qemu-devel,
	yoshikawa.takuya, avi
Anthony Liguori wrote:
> On 04/22/2010 11:02 PM, Yoshiaki Tamura wrote:
>> Anthony Liguori wrote:
>>> On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:
>>>> For fool proof purpose, qemu_put_vector_parepare should be called
>>>> before qemu_put_vector. Then, if qemu_put_* functions except this is
>>>> called after qemu_put_vector_prepare, program will abort().
>>>>
>>>> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
>>>
>>> I don't get it. What's this protecting against?
>>
>> This was introduced to prevent mixing the order of normal write and
>> vector write, and flush QEMUFile buffer before handling vectors.
>> While qemu_put_buffer copies data to QEMUFile buffer,
>> qemu_put_vector() will bypass that buffer.
>>
>> It's just fool proof purpose for what we encountered at beginning, and
>> if the user of qemu_put_vector() is careful enough, we can remove
>> qemu_put_vectore_prepare(). While writing this message, I started to
>> think that just calling qemu_fflush() in qemu_put_vector() would be
>> enough...
>
> I definitely think removing the vector stuff in the first version would
> simplify the process of getting everything merged. I'd prefer not to
> have two apis so if vector operations were important from a performance
> perspective, I'd want to see everything converted to a vector API.
I agree with your opinion.
I will measure the effect of introducing vector stuff, and post the data later.
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 05/20] Introduce put_vector() and get_vector to QEMUFile and qemu_fopen_ops().
  2010-04-23 13:22       ` Anthony Liguori
  2010-04-23 13:48         ` Avi Kivity
@ 2010-04-26 10:43         ` Yoshiaki Tamura
  1 sibling, 0 replies; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-26 10:43 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: ohmura.kei, kvm, mtosatti, Anthony Liguori, qemu-devel,
	yoshikawa.takuya, avi
Anthony Liguori wrote:
> On 04/22/2010 10:37 PM, Yoshiaki Tamura wrote:
>> Anthony Liguori wrote:
>>> On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:
>>>> QEMUFile currently doesn't support writev(). For sending multiple
>>>> data, such as pages, using writev() should be more efficient.
>>>>
>>>> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
>>>
>>> Is there performance data that backs this up? Since QEMUFile uses a
>>> linear buffer for most operations that's limited to 16k, I suspect you
>>> wouldn't be able to observe a difference in practice.
>>
>> I currently don't have data, but I'll prepare it.
>> There were two things I wanted to avoid.
>>
>> 1. Pages to be copied to QEMUFile buf through qemu_put_buffer.
>> 2. Calling write() everytime even when we want to send multiple pages
>> at once.
>>
>> I think 2 may be neglectable.
>> But 1 seems to be problematic if we want make to the latency as small
>> as possible, no?
>
> Copying often has strange CPU characteristics depending on whether the
> data is already in cache. It's better to drive these sort of
> optimizations through performance measurement because changes are not
> always obvious.
I agree.
^ permalink raw reply	[flat|nested] 74+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
  2010-04-23 13:20             ` Anthony Liguori
@ 2010-04-26 10:44               ` Yoshiaki Tamura
  0 siblings, 0 replies; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-26 10:44 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: ohmura.kei, mtosatti, kvm, dlaor, qemu-devel, yoshikawa.takuya,
	avi
Anthony Liguori wrote:
> On 04/22/2010 08:53 PM, Yoshiaki Tamura wrote:
>> Anthony Liguori wrote:
>>> On 04/22/2010 08:16 AM, Yoshiaki Tamura wrote:
>>>> 2010/4/22 Dor Laor<dlaor@redhat.com>:
>>>>> On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:
>>>>>> Dor Laor wrote:
>>>>>>> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> We have been implementing the prototype of Kemari for KVM, and
>>>>>>>> we're
>>>>>>>> sending
>>>>>>>> this message to share what we have now and TODO lists.
>>>>>>>> Hopefully, we
>>>>>>>> would like
>>>>>>>> to get early feedback to keep us in the right direction. Although
>>>>>>>> advanced
>>>>>>>> approaches in the TODO lists are fascinating, we would like to run
>>>>>>>> this project
>>>>>>>> step by step while absorbing comments from the community. The
>>>>>>>> current
>>>>>>>> code is
>>>>>>>> based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.
>>>>>>>>
>>>>>>>> For those who are new to Kemari for KVM, please take a look at the
>>>>>>>> following RFC which we posted last year.
>>>>>>>>
>>>>>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html
>>>>>>>>
>>>>>>>> The transmission/transaction protocol, and most of the control
>>>>>>>> logic is
>>>>>>>> implemented in QEMU. However, we needed a hack in KVM to prevent
>>>>>>>> rip
>>>>>>>> from
>>>>>>>> proceeding before synchronizing VMs. It may also need some
>>>>>>>> plumbing in
>>>>>>>> the
>>>>>>>> kernel side to guarantee replayability of certain events and
>>>>>>>> instructions,
>>>>>>>> integrate the RAS capabilities of newer x86 hardware with the HA
>>>>>>>> stack, as well
>>>>>>>> as for optimization purposes, for example.
>>>>>>> [ snap]
>>>>>>>
>>>>>>>> The rest of this message describes TODO lists grouped by each
>>>>>>>> topic.
>>>>>>>>
>>>>>>>> === event tapping ===
>>>>>>>>
>>>>>>>> Event tapping is the core component of Kemari, and it decides on
>>>>>>>> which
>>>>>>>> event the
>>>>>>>> primary should synchronize with the secondary. The basic assumption
>>>>>>>> here is
>>>>>>>> that outgoing I/O operations are idempotent, which is usually true
>>>>>>>> for
>>>>>>>> disk I/O
>>>>>>>> and reliable network protocols such as TCP.
>>>>>>> IMO any type of network even should be stalled too. What if the VM
>>>>>>> runs
>>>>>>> non tcp protocol and the packet that the master node sent reached
>>>>>>> some
>>>>>>> remote client and before the sync to the slave the master failed?
>>>>>> In current implementation, it is actually stalling any type of
>>>>>> network
>>>>>> that goes through virtio-net.
>>>>>>
>>>>>> However, if the application was using unreliable protocols, it should
>>>>>> have its own recovering mechanism, or it should be completely
>>>>>> stateless.
>>>>> Why do you treat tcp differently? You can damage the entire VM this
>>>>> way -
>>>>> think of dhcp request that was dropped on the moment you switched
>>>>> between
>>>>> the master and the slave?
>>>> I'm not trying to say that we should treat tcp differently, but just
>>>> it's severe.
>>>> In case of dhcp request, the client would have a chance to retry after
>>>> failover, correct?
>>>> BTW, in current implementation,
>>>
>>> I'm slightly confused about the current implementation vs. my
>>> recollection of the original paper with Xen. I had thought that all disk
>>> and network I/O was buffered in such a way that at each checkpoint, the
>>> I/O operations would be released in a burst. Otherwise, you would have
>>> to synchronize after every I/O operation which is what it seems the
>>> current implementation does.
>>
>> Yes, you're almost right.
>> It's synchronizing before QEMU starts emulating I/O at each device model.
>
> If NodeA is the master and NodeB is the slave, if NodeA sends a network
> packet, you'll checkpoint before the packet is actually sent, and then
> if a failure occurs before the next checkpoint, won't that result in
> both NodeA and NodeB sending out a duplicate version of the packet?
Yes.  But I think it's better than taking checkpoint after.
If we checkpoint after sending packet, let's say it sent TCP ACK to the client, 
and if a hardware failure occurred to NodeA during the transaction *but the 
client received the TCP ACK*, NodeB will resume from the previous state, and it 
may need to receive some data from the client. However, because the client has 
already receiver TCP ACK, it won't resend the data to NodeB.  It looks this 
data is going to be dropped.
Anyway, I've just started planning to move the sync point to network/block 
layer, and I would post the result for discussion again.
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 00/20] Kemari for KVM v0.1
  2010-04-23 13:24 ` Avi Kivity
@ 2010-04-26 10:44   ` Yoshiaki Tamura
  0 siblings, 0 replies; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-04-26 10:44 UTC (permalink / raw)
  To: Avi Kivity
  Cc: aliguori, kvm, ohmura.kei, mtosatti, qemu-devel, yoshikawa.takuya
Avi Kivity wrote:
> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
>> Kemari starts synchronizing VMs when QEMU handles I/O requests.
>> Without this patch VCPU state is already proceeded before
>> synchronization, and after failover to the VM on the receiver, it
>> hangs because of this.
>
> We discussed moving the barrier to the actual output device, instead of
> the I/O port. This allows you to complete the I/O transaction before
> starting synchronization.
>
> Does it not work for some reason?
Sorry, I've just started working on that.
I've posted this series to share what I have done so far.
Thanks for looking.
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 05/20] Introduce put_vector() and get_vector to QEMUFile and qemu_fopen_ops().
  2010-04-23 13:48         ` Avi Kivity
@ 2010-05-03  9:32           ` Yoshiaki Tamura
  2010-05-03 12:05             ` Anthony Liguori
  0 siblings, 1 reply; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-05-03  9:32 UTC (permalink / raw)
  To: Avi Kivity, Anthony Liguori
  Cc: Anthony Liguori, kvm, ohmura.kei, mtosatti, qemu-devel,
	yoshikawa.takuya
2010/4/23 Avi Kivity <avi@redhat.com>:
> On 04/23/2010 04:22 PM, Anthony Liguori wrote:
>>>
>>> I currently don't have data, but I'll prepare it.
>>> There were two things I wanted to avoid.
>>>
>>> 1. Pages to be copied to QEMUFile buf through qemu_put_buffer.
>>> 2. Calling write() everytime even when we want to send multiple pages at
>>> once.
>>>
>>> I think 2 may be neglectable.
>>> But 1 seems to be problematic if we want make to the latency as small as
>>> possible, no?
>>
>>
>> Copying often has strange CPU characteristics depending on whether the
>> data is already in cache.  It's better to drive these sort of optimizations
>> through performance measurement because changes are not always obvious.
>
> Copying always introduces more cache pollution, so even if the data is in
> the cache, it is worthwhile (not disagreeing with the need to measure).
Anthony,
I measure how long it takes to send all guest pages during migration, and I
would like to share the information in this message.  For convenience,
I modified
the code to do migration not "live migration" which means buffered file is not
used here.
In summary, the performance improvement using writev instead of write/send when
we used GbE seems to be neglectable, however, when the underlying network was
fast (InfiniBand with IPoIB in this case), writev performed 17% faster than
write/send, and therefore, it may be worthwhile to introduce vectors.
Since QEMU compresses pages, I copied a junk file to tmpfs to dirty pages to let
QEMU to transfer fine number of pages.  After setting up the guest, I used
cpu_get_real_ticks() to measure the time during the while loop calling
ram_save_block() in ram_save_live().  I removed the qemu_file_rate_limit() to
disable the function of buffered file, and all of the pages would be transfered
at the first round.
I measure 10 times for each, and took average and standard deviation.
Considering the results, I think the trial number was enough.  In addition to
time duration, number of writev/write and number of pages which were compressed
(dup)/not compressed (nodup) are demonstrated.
Test Environment:
CPU: 2x Intel Xeon Dual Core 3GHz
Mem size: 6GB
Network: GbE, InfiniBand (IPoIB)
Host OS: Fedora 11 (kernel 2.6.34-rc1)
Guest OS: Fedora 11 (kernel 2.6.33)
Guest Mem size: 512MB
* GbE writev
time (sec): 35.732 (std 0.002)
write count: 4 (std 0)
writev count: 8269 (std 1)
dup count: 36157 (std 124)
nodup count: 1016808 (std 147)
* GbE write
time (sec): 35.780 (std 0.164)
write count: 127367 (21)
writev count: 0 (std 0)
dup count: 36134 (std 108)
nodup count: 1016853 (std 165)
* IPoIB writev
time (sec): 13.889 (std 0.155)
write count: 4 (std 0)
writev count: 8267 (std 1)
dup count: 36147 (std 105)
nodup count: 1016838 (std 111)
* IPoIB write
time (sec): 16.777 (std 0.239)
write count: 127364 (24)
writev count: 0 (std 0)
dup count: 36173 (std 169)
nodup count: 1016840 (std 190)
Although the improvement wasn't obvious when the network wan GbE, introducing
writev may be worthwhile when we focus on faster networks like InfiniBand/10GE.
I agree that separating this optimization from the main logic of Kemari since
this modification must be done widely and carefully at the same time.
Thanks,
Yoshi
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 05/20] Introduce put_vector() and get_vector to QEMUFile and qemu_fopen_ops().
  2010-05-03  9:32           ` Yoshiaki Tamura
@ 2010-05-03 12:05             ` Anthony Liguori
  2010-05-03 15:36               ` Yoshiaki Tamura
  0 siblings, 1 reply; 74+ messages in thread
From: Anthony Liguori @ 2010-05-03 12:05 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, kvm, mtosatti, Anthony Liguori, qemu-devel,
	yoshikawa.takuya, Avi Kivity
On 05/03/2010 04:32 AM, Yoshiaki Tamura wrote:
> 2010/4/23 Avi Kivity<avi@redhat.com>:
>    
>> On 04/23/2010 04:22 PM, Anthony Liguori wrote:
>>      
>>>> I currently don't have data, but I'll prepare it.
>>>> There were two things I wanted to avoid.
>>>>
>>>> 1. Pages to be copied to QEMUFile buf through qemu_put_buffer.
>>>> 2. Calling write() everytime even when we want to send multiple pages at
>>>> once.
>>>>
>>>> I think 2 may be neglectable.
>>>> But 1 seems to be problematic if we want make to the latency as small as
>>>> possible, no?
>>>>          
>>>
>>> Copying often has strange CPU characteristics depending on whether the
>>> data is already in cache.  It's better to drive these sort of optimizations
>>> through performance measurement because changes are not always obvious.
>>>        
>> Copying always introduces more cache pollution, so even if the data is in
>> the cache, it is worthwhile (not disagreeing with the need to measure).
>>      
> Anthony,
>
> I measure how long it takes to send all guest pages during migration, and I
> would like to share the information in this message.  For convenience,
> I modified
> the code to do migration not "live migration" which means buffered file is not
> used here.
>
> In summary, the performance improvement using writev instead of write/send when
> we used GbE seems to be neglectable, however, when the underlying network was
> fast (InfiniBand with IPoIB in this case), writev performed 17% faster than
> write/send, and therefore, it may be worthwhile to introduce vectors.
>
> Since QEMU compresses pages, I copied a junk file to tmpfs to dirty pages to let
> QEMU to transfer fine number of pages.  After setting up the guest, I used
> cpu_get_real_ticks() to measure the time during the while loop calling
> ram_save_block() in ram_save_live().  I removed the qemu_file_rate_limit() to
> disable the function of buffered file, and all of the pages would be transfered
> at the first round.
>
> I measure 10 times for each, and took average and standard deviation.
> Considering the results, I think the trial number was enough.  In addition to
> time duration, number of writev/write and number of pages which were compressed
> (dup)/not compressed (nodup) are demonstrated.
>
> Test Environment:
> CPU: 2x Intel Xeon Dual Core 3GHz
> Mem size: 6GB
> Network: GbE, InfiniBand (IPoIB)
>
> Host OS: Fedora 11 (kernel 2.6.34-rc1)
> Guest OS: Fedora 11 (kernel 2.6.33)
> Guest Mem size: 512MB
>
> * GbE writev
> time (sec): 35.732 (std 0.002)
> write count: 4 (std 0)
> writev count: 8269 (std 1)
> dup count: 36157 (std 124)
> nodup count: 1016808 (std 147)
>
> * GbE write
> time (sec): 35.780 (std 0.164)
> write count: 127367 (21)
> writev count: 0 (std 0)
> dup count: 36134 (std 108)
> nodup count: 1016853 (std 165)
>
> * IPoIB writev
> time (sec): 13.889 (std 0.155)
> write count: 4 (std 0)
> writev count: 8267 (std 1)
> dup count: 36147 (std 105)
> nodup count: 1016838 (std 111)
>
> * IPoIB write
> time (sec): 16.777 (std 0.239)
> write count: 127364 (24)
> writev count: 0 (std 0)
> dup count: 36173 (std 169)
> nodup count: 1016840 (std 190)
>
> Although the improvement wasn't obvious when the network wan GbE, introducing
> writev may be worthwhile when we focus on faster networks like InfiniBand/10GE.
>
> I agree that separating this optimization from the main logic of Kemari since
> this modification must be done widely and carefully at the same time.
>    
Okay.  It looks like it's clear that it's a win so let's split it out of 
the main series and we'll treat it separately.  I imagine we'll see even 
more positive results on 10 gbit and particularly if we move migration 
out into a separate thread.
Regards,
Anthony Liguori
> Thanks,
>
> Yoshi
>    
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 05/20] Introduce put_vector() and get_vector to QEMUFile and qemu_fopen_ops().
  2010-05-03 12:05             ` Anthony Liguori
@ 2010-05-03 15:36               ` Yoshiaki Tamura
  2010-05-03 16:07                 ` Anthony Liguori
  0 siblings, 1 reply; 74+ messages in thread
From: Yoshiaki Tamura @ 2010-05-03 15:36 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: ohmura.kei, kvm, mtosatti, Anthony Liguori, qemu-devel,
	yoshikawa.takuya, Avi Kivity
2010/5/3 Anthony Liguori <aliguori@linux.vnet.ibm.com>:
> On 05/03/2010 04:32 AM, Yoshiaki Tamura wrote:
>>
>> 2010/4/23 Avi Kivity<avi@redhat.com>:
>>
>>>
>>> On 04/23/2010 04:22 PM, Anthony Liguori wrote:
>>>
>>>>>
>>>>> I currently don't have data, but I'll prepare it.
>>>>> There were two things I wanted to avoid.
>>>>>
>>>>> 1. Pages to be copied to QEMUFile buf through qemu_put_buffer.
>>>>> 2. Calling write() everytime even when we want to send multiple pages
>>>>> at
>>>>> once.
>>>>>
>>>>> I think 2 may be neglectable.
>>>>> But 1 seems to be problematic if we want make to the latency as small
>>>>> as
>>>>> possible, no?
>>>>>
>>>>
>>>> Copying often has strange CPU characteristics depending on whether the
>>>> data is already in cache.  It's better to drive these sort of
>>>> optimizations
>>>> through performance measurement because changes are not always obvious.
>>>>
>>>
>>> Copying always introduces more cache pollution, so even if the data is in
>>> the cache, it is worthwhile (not disagreeing with the need to measure).
>>>
>>
>> Anthony,
>>
>> I measure how long it takes to send all guest pages during migration, and
>> I
>> would like to share the information in this message.  For convenience,
>> I modified
>> the code to do migration not "live migration" which means buffered file is
>> not
>> used here.
>>
>> In summary, the performance improvement using writev instead of write/send
>> when
>> we used GbE seems to be neglectable, however, when the underlying network
>> was
>> fast (InfiniBand with IPoIB in this case), writev performed 17% faster
>> than
>> write/send, and therefore, it may be worthwhile to introduce vectors.
>>
>> Since QEMU compresses pages, I copied a junk file to tmpfs to dirty pages
>> to let
>> QEMU to transfer fine number of pages.  After setting up the guest, I used
>> cpu_get_real_ticks() to measure the time during the while loop calling
>> ram_save_block() in ram_save_live().  I removed the qemu_file_rate_limit()
>> to
>> disable the function of buffered file, and all of the pages would be
>> transfered
>> at the first round.
>>
>> I measure 10 times for each, and took average and standard deviation.
>> Considering the results, I think the trial number was enough.  In addition
>> to
>> time duration, number of writev/write and number of pages which were
>> compressed
>> (dup)/not compressed (nodup) are demonstrated.
>>
>> Test Environment:
>> CPU: 2x Intel Xeon Dual Core 3GHz
>> Mem size: 6GB
>> Network: GbE, InfiniBand (IPoIB)
>>
>> Host OS: Fedora 11 (kernel 2.6.34-rc1)
>> Guest OS: Fedora 11 (kernel 2.6.33)
>> Guest Mem size: 512MB
>>
>> * GbE writev
>> time (sec): 35.732 (std 0.002)
>> write count: 4 (std 0)
>> writev count: 8269 (std 1)
>> dup count: 36157 (std 124)
>> nodup count: 1016808 (std 147)
>>
>> * GbE write
>> time (sec): 35.780 (std 0.164)
>> write count: 127367 (21)
>> writev count: 0 (std 0)
>> dup count: 36134 (std 108)
>> nodup count: 1016853 (std 165)
>>
>> * IPoIB writev
>> time (sec): 13.889 (std 0.155)
>> write count: 4 (std 0)
>> writev count: 8267 (std 1)
>> dup count: 36147 (std 105)
>> nodup count: 1016838 (std 111)
>>
>> * IPoIB write
>> time (sec): 16.777 (std 0.239)
>> write count: 127364 (24)
>> writev count: 0 (std 0)
>> dup count: 36173 (std 169)
>> nodup count: 1016840 (std 190)
>>
>> Although the improvement wasn't obvious when the network wan GbE,
>> introducing
>> writev may be worthwhile when we focus on faster networks like
>> InfiniBand/10GE.
>>
>> I agree that separating this optimization from the main logic of Kemari
>> since
>> this modification must be done widely and carefully at the same time.
>>
>
> Okay.  It looks like it's clear that it's a win so let's split it out of the
> main series and we'll treat it separately.  I imagine we'll see even more
> positive results on 10 gbit and particularly if we move migration out into a
> separate thread.
Great!
I also wanted to test with 10GE but I'm physically away from my office
now, and can't set up the test environment.  I'll measure the numbers
w/ 10GE next week.
BTW, I was thinking to write a patch to separate threads for both
sender and receiver of migration.  Kemari especially needs a separate
thread receiver, so that monitor can accepts commands from other HA
tools.  Is someone already working on this?  If not, I would add it to
my task list :-)
Thanks,
Yoshi
>
> Regards,
>
> Anthony Liguori
>
>> Thanks,
>>
>> Yoshi
^ permalink raw reply	[flat|nested] 74+ messages in thread
* [Qemu-devel] Re: [RFC PATCH 05/20] Introduce put_vector() and get_vector to QEMUFile and qemu_fopen_ops().
  2010-05-03 15:36               ` Yoshiaki Tamura
@ 2010-05-03 16:07                 ` Anthony Liguori
  0 siblings, 0 replies; 74+ messages in thread
From: Anthony Liguori @ 2010-05-03 16:07 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, kvm, mtosatti, Anthony Liguori, qemu-devel,
	yoshikawa.takuya, Avi Kivity
On 05/03/2010 10:36 AM, Yoshiaki Tamura wrote:
>
> Great!
> I also wanted to test with 10GE but I'm physically away from my office
> now, and can't set up the test environment.  I'll measure the numbers
> w/ 10GE next week.
>
> BTW, I was thinking to write a patch to separate threads for both
> sender and receiver of migration.  Kemari especially needs a separate
> thread receiver, so that monitor can accepts commands from other HA
> tools.  Is someone already working on this?  If not, I would add it to
> my task list :-)
>    
So far, no one (to my knowledge at least), is working on this.
Regards,
Anthony Liguori
> Thanks,
>
> Yoshi
>
>    
>> Regards,
>>
>> Anthony Liguori
>>
>>      
>>> Thanks,
>>>
>>> Yoshi
>>>        
^ permalink raw reply	[flat|nested] 74+ messages in thread
end of thread, other threads:[~2010-05-03 16:07 UTC | newest]
Thread overview: 74+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-21  5:57 [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 Yoshiaki Tamura
2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 01/20] Modify DIRTY_FLAG value and introduce DIRTY_IDX to use as indexes of bit-based phys_ram_dirty Yoshiaki Tamura
2010-04-22 19:26   ` [Qemu-devel] " Anthony Liguori
2010-04-23  2:09     ` Yoshiaki Tamura
2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 02/20] Introduce cpu_physical_memory_get_dirty_range() Yoshiaki Tamura
2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 03/20] Use cpu_physical_memory_set_dirty_range() to update phys_ram_dirty Yoshiaki Tamura
2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 04/20] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer() Yoshiaki Tamura
2010-04-21  8:03   ` [Qemu-devel] " Stefan Hajnoczi
2010-04-21  8:27     ` Yoshiaki Tamura
2010-04-23  9:53   ` Avi Kivity
2010-04-23  9:59     ` Yoshiaki Tamura
2010-04-23 13:14       ` Avi Kivity
2010-04-26 10:43         ` Yoshiaki Tamura
2010-04-23 13:26     ` Anthony Liguori
2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 05/20] Introduce put_vector() and get_vector to QEMUFile and qemu_fopen_ops() Yoshiaki Tamura
2010-04-22 19:28   ` [Qemu-devel] " Anthony Liguori
2010-04-23  3:37     ` Yoshiaki Tamura
2010-04-23 13:22       ` Anthony Liguori
2010-04-23 13:48         ` Avi Kivity
2010-05-03  9:32           ` Yoshiaki Tamura
2010-05-03 12:05             ` Anthony Liguori
2010-05-03 15:36               ` Yoshiaki Tamura
2010-05-03 16:07                 ` Anthony Liguori
2010-04-26 10:43         ` Yoshiaki Tamura
2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 06/20] Introduce iovec util functions, qemu_iovec_to_vector() and qemu_iovec_to_size() Yoshiaki Tamura
2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 07/20] Introduce qemu_put_vector() and qemu_put_vector_prepare() to use put_vector() in QEMUFile Yoshiaki Tamura
2010-04-22 19:29   ` [Qemu-devel] " Anthony Liguori
2010-04-23  4:02     ` Yoshiaki Tamura
2010-04-23 13:23       ` Anthony Liguori
2010-04-26 10:43         ` Yoshiaki Tamura
2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 08/20] Introduce RAMSaveIO and use cpu_physical_memory_get_dirty_range() to check multiple dirty pages Yoshiaki Tamura
2010-04-22 19:31   ` [Qemu-devel] " Anthony Liguori
2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 09/20] Introduce writev and read to FdMigrationState Yoshiaki Tamura
2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 10/20] Introduce skip_header parameter to qemu_loadvm_state() so that it can be called iteratively without reading the header Yoshiaki Tamura
2010-04-22 19:34   ` [Qemu-devel] " Anthony Liguori
2010-04-23  4:25     ` Yoshiaki Tamura
2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 11/20] Introduce some socket util functions Yoshiaki Tamura
2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 12/20] Introduce fault tolerant VM transaction QEMUFile and ft_mode Yoshiaki Tamura
2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 13/20] Introduce util functions to control ft_transaction from savevm layer Yoshiaki Tamura
2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 14/20] Upgrade QEMU_FILE_VERSION from 3 to 4, and introduce qemu_savevm_state_all() Yoshiaki Tamura
2010-04-22 19:37   ` [Qemu-devel] " Anthony Liguori
2010-04-23  3:29     ` Yoshiaki Tamura
2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 15/20] Introduce FT mode support to configure Yoshiaki Tamura
2010-04-22 19:38   ` [Qemu-devel] " Anthony Liguori
2010-04-23  3:09     ` Yoshiaki Tamura
2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 16/20] Introduce event_tap fucntions and ft_tranx_ready() Yoshiaki Tamura
2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 17/20] Modify migrate_fd_put_ready() when ft_mode is on Yoshiaki Tamura
2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 18/20] Modify tcp_accept_incoming_migration() to handle ft_mode, and add a hack not to close fd when ft_mode is enabled Yoshiaki Tamura
2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 19/20] Insert do_event_tap() to virtio-{blk, net}, comment out assert() on cpu_single_env temporally Yoshiaki Tamura
2010-04-22 19:39   ` [Qemu-devel] " Anthony Liguori
2010-04-23  4:51     ` Yoshiaki Tamura
2010-04-21  5:57 ` [Qemu-devel] [RFC PATCH 20/20] Introduce -k option to enable FT migration mode (Kemari) Yoshiaki Tamura
2010-04-22  8:58 ` [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 Dor Laor
2010-04-22 10:35   ` Yoshiaki Tamura
2010-04-22 11:36     ` Takuya Yoshikawa
2010-04-22 12:35       ` Yoshiaki Tamura
2010-04-22 12:19     ` Dor Laor
2010-04-22 13:16       ` Yoshiaki Tamura
2010-04-22 20:33         ` Anthony Liguori
2010-04-23  1:53           ` Yoshiaki Tamura
2010-04-23 13:20             ` Anthony Liguori
2010-04-26 10:44               ` Yoshiaki Tamura
2010-04-22 20:38         ` Dor Laor
2010-04-23  5:17           ` Yoshiaki Tamura
2010-04-23  7:36             ` Fernando Luis Vázquez Cao
2010-04-25 21:52               ` Dor Laor
2010-04-22 16:15     ` Jamie Lokier
2010-04-23  0:20       ` Yoshiaki Tamura
2010-04-23 15:07         ` Jamie Lokier
2010-04-22 19:42 ` [Qemu-devel] " Anthony Liguori
2010-04-23  0:45   ` Yoshiaki Tamura
2010-04-23 13:10     ` Anthony Liguori
2010-04-23 13:24 ` Avi Kivity
2010-04-26 10:44   ` Yoshiaki Tamura
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).