[Qemu-devel] [PATCHv5 00/10] buffer_is_zero / migration optimizations

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCHv5 00/10] buffer_is_zero / migration optimizations
@ 2013-03-26  9:58 Peter Lieven
  2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 01/10] move vector definitions to qemu-common.h Peter Lieven
                   ` (10 more replies)
  0 siblings, 11 replies; 31+ messages in thread
From: Peter Lieven @ 2013-03-26  9:58 UTC (permalink / raw)
  To: qemu-devel
  Cc: quintela, Stefan Hajnoczi, Peter Lieven, Orit Wasserman,
	Paolo Bonzini

this is v5 of my patch series with various optimizations in
zero buffer checking and migration tweaks.

thanks especially to Eric Blake, Orit Wassermann and Paolo Bonzini
for reviewing.

v5:
- move zero splat vector to a different patch
- fix indentation of can_user_buffer_find_nonzero_offset()
- do not unroll the first loop in buffer_find_nonzero_offset()
  to optimize it for zero page checking
- use an older unrolled version of find_next_bit() without
  SIMD instruction as there is no evidence that the vectorized
  version is better if not even worse and the code is easier
  to understand.
- added a word in the commit message of patch 8 
  about the skipped pages field in QMP MigrationStats.
- fixed the order of key-value pairs of MigrationStats in
  qapi-schema.json
- updated info about the performance benefit of is_zero_page()
  to the latest benchmark results in the commit message.

v4:
- do not inline buffer_find_nonzero_offset()
- inline can_usebuffer_find_nonzero_offset() correctly
- readd asserts in buffer_find_nonzero_offset() as profiling
  shows they do not hurt.
- change last occurences of scalar 8 by 
  BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
- avoid deferencing p already in patch 5 where we
  know that the page (p) is zero
- explicitly set bytes_sent = 0 if we skip a zero page.
  bytes_sent was 0 before, but it was not obvious.
- add accounting information for skipped zero pages
- fix errors reported by checkpatch.pl

v3:
- remove asserts, inline functions and add a check
  function if buffer_find_nonzero_offset() can be used.
- use above check function in buffer_is_zero() and
  find_next_bit().
- use buffer_is_nonzero_offset() directly to find
  zero pages. we know that all requirements are met
  for memory pages.
- fix C89 violation in buffer_is_zero().
- avoid derefencing p in ram_save_block() if we already
  know the page is zero.
- fix initialization of last_offset in reset_ram_globals().
- avoid skipping pages with offset == 0 in bulk stage in
  migration_bitmap_find_and_reset_dirty().
- compared to v1 check for zero pages also after bulk
  ram migration as there are guests (e.g. Windows) which
  zero out large amount of memory while running.

v2:
- fix description, add trivial zero check and add asserts 
  to buffer_find_nonzero_offset.
- add a constant for the unroll factor of buffer_find_nonzero_offset
- replace is_dup_page() by buffer_is_zero()
- added test results to xbzrle patch
- optimize descriptions

Peter Lieven (10):
  move vector definitions to qemu-common.h
  add a zero splat vector to qemu-common.h
  cutils: add a function to find non-zero content in a buffer
  buffer_is_zero: use vector optimizations if possible
  bitops: unroll while loop in find_next_bit()
  migration: search for zero instead of dup pages
  migration: add an indicator for bulk state of ram migration
  migration: do not sent zero pages in bulk stage
  migration: do not search dirty pages in bulk stage
  migration: use XBZRLE only after bulk stage

 arch_init.c                   |   74 +++++++++++++++++++----------------------
 hmp.c                         |    2 ++
 include/migration/migration.h |    2 ++
 include/qemu-common.h         |   37 +++++++++++++++++++++
 migration.c                   |    3 +-
 qapi-schema.json              |    8 +++--
 qmp-commands.hx               |    3 +-
 util/bitops.c                 |   18 +++++++++-
 util/cutils.c                 |   60 +++++++++++++++++++++++++++++++++
 9 files changed, 162 insertions(+), 45 deletions(-)

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCHv5 01/10] move vector definitions to qemu-common.h
  2013-03-26  9:58 [Qemu-devel] [PATCHv5 00/10] buffer_is_zero / migration optimizations Peter Lieven
@ 2013-03-26  9:58 ` Peter Lieven
  2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 02/10] add a zero splat vector " Peter Lieven
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 31+ messages in thread
From: Peter Lieven @ 2013-03-26  9:58 UTC (permalink / raw)
  To: qemu-devel
  Cc: quintela, Stefan Hajnoczi, Peter Lieven, Orit Wasserman,
	Paolo Bonzini

vector optimizations will now be used at various places
not just in is_dup_page() in arch_init.c

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 arch_init.c           |   20 --------------------
 include/qemu-common.h |   21 +++++++++++++++++++++
 2 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index e8ade9e..35974c2 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -116,26 +116,6 @@ const uint32_t arch_type = QEMU_ARCH;
 #define RAM_SAVE_FLAG_CONTINUE 0x20
 #define RAM_SAVE_FLAG_XBZRLE   0x40
 
-#ifdef __ALTIVEC__
-#include <altivec.h>
-#define VECTYPE        vector unsigned char
-#define SPLAT(p)       vec_splat(vec_ld(0, p), 0)
-#define ALL_EQ(v1, v2) vec_all_eq(v1, v2)
-/* altivec.h may redefine the bool macro as vector type.
- * Reset it to POSIX semantics. */
-#undef bool
-#define bool _Bool
-#elif defined __SSE2__
-#include <emmintrin.h>
-#define VECTYPE        __m128i
-#define SPLAT(p)       _mm_set1_epi8(*(p))
-#define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF)
-#else
-#define VECTYPE        unsigned long
-#define SPLAT(p)       (*(p) * (~0UL / 255))
-#define ALL_EQ(v1, v2) ((v1) == (v2))
-#endif
-
 
 static struct defconfig_file {
     const char *filename;
diff --git a/include/qemu-common.h b/include/qemu-common.h
index 2371132..d7ad3a7 100644
--- a/include/qemu-common.h
+++ b/include/qemu-common.h
@@ -448,4 +448,25 @@ int uleb128_decode_small(const uint8_t *in, uint32_t *n);
 
 void hexdump(const char *buf, FILE *fp, const char *prefix, size_t size);
 
+/* vector definitions */
+#ifdef __ALTIVEC__
+#include <altivec.h>
+#define VECTYPE        vector unsigned char
+#define SPLAT(p)       vec_splat(vec_ld(0, p), 0)
+#define ALL_EQ(v1, v2) vec_all_eq(v1, v2)
+/* altivec.h may redefine the bool macro as vector type.
+ * Reset it to POSIX semantics. */
+#undef bool
+#define bool _Bool
+#elif defined __SSE2__
+#include <emmintrin.h>
+#define VECTYPE        __m128i
+#define SPLAT(p)       _mm_set1_epi8(*(p))
+#define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF)
+#else
+#define VECTYPE        unsigned long
+#define SPLAT(p)       (*(p) * (~0UL / 255))
+#define ALL_EQ(v1, v2) ((v1) == (v2))
+#endif
+
 #endif
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCHv5 02/10] add a zero splat vector to qemu-common.h
  2013-03-26  9:58 [Qemu-devel] [PATCHv5 00/10] buffer_is_zero / migration optimizations Peter Lieven
  2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 01/10] move vector definitions to qemu-common.h Peter Lieven
@ 2013-03-26  9:58 ` Peter Lieven
  2013-03-26 10:14   ` Paolo Bonzini
  2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 03/10] cutils: add a function to find non-zero content in a buffer Peter Lieven
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Peter Lieven @ 2013-03-26  9:58 UTC (permalink / raw)
  To: qemu-devel
  Cc: quintela, Stefan Hajnoczi, Peter Lieven, Orit Wasserman,
	Paolo Bonzini

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 include/qemu-common.h |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/qemu-common.h b/include/qemu-common.h
index d7ad3a7..9022646 100644
--- a/include/qemu-common.h
+++ b/include/qemu-common.h
@@ -453,6 +453,7 @@ void hexdump(const char *buf, FILE *fp, const char *prefix, size_t size);
 #include <altivec.h>
 #define VECTYPE        vector unsigned char
 #define SPLAT(p)       vec_splat(vec_ld(0, p), 0)
+#define ZERO_SPLAT     vec_splat(vec_ld(0, 0), 0)
 #define ALL_EQ(v1, v2) vec_all_eq(v1, v2)
 /* altivec.h may redefine the bool macro as vector type.
  * Reset it to POSIX semantics. */
@@ -462,10 +463,12 @@ void hexdump(const char *buf, FILE *fp, const char *prefix, size_t size);
 #include <emmintrin.h>
 #define VECTYPE        __m128i
 #define SPLAT(p)       _mm_set1_epi8(*(p))
+#define ZERO_SPLAT     _mm_setzero_si128()
 #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF)
 #else
 #define VECTYPE        unsigned long
 #define SPLAT(p)       (*(p) * (~0UL / 255))
+#define ZERO_SPLAT     0x0UL
 #define ALL_EQ(v1, v2) ((v1) == (v2))
 #endif
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCHv5 03/10] cutils: add a function to find non-zero content in a buffer
  2013-03-26  9:58 [Qemu-devel] [PATCHv5 00/10] buffer_is_zero / migration optimizations Peter Lieven
  2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 01/10] move vector definitions to qemu-common.h Peter Lieven
  2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 02/10] add a zero splat vector " Peter Lieven
@ 2013-03-26  9:58 ` Peter Lieven
  2013-03-26 10:38   ` Juan Quintela
  2013-03-26 10:41   ` Juan Quintela
  2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 04/10] buffer_is_zero: use vector optimizations if possible Peter Lieven
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 31+ messages in thread
From: Peter Lieven @ 2013-03-26  9:58 UTC (permalink / raw)
  To: qemu-devel
  Cc: quintela, Stefan Hajnoczi, Peter Lieven, Orit Wasserman,
	Paolo Bonzini

this adds buffer_find_nonzero_offset() which is a SSE2/Altivec
optimized function that searches for non-zero content in a
buffer.

the function starts full unrolling only after the first few chunks have
been checked one by one. analyzing real memory page data has revealed
that non-zero pages are non-zero within the first 256-512 bits in
most cases. as this function is also heavily used to check for zero memory
pages this tweak has been made to avoid the high setup costs of the fully
unrolled check for non-zero pages.

due to the optimizations used in the function there are restrictions
on buffer address and search length. the function
can_use_buffer_find_nonzero_content() can be used to check if
the function can be used safely.

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 include/qemu-common.h |   13 ++++++++++++
 util/cutils.c         |   55 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 68 insertions(+)

diff --git a/include/qemu-common.h b/include/qemu-common.h
index 9022646..7c7c244 100644
--- a/include/qemu-common.h
+++ b/include/qemu-common.h
@@ -472,4 +472,17 @@ void hexdump(const char *buf, FILE *fp, const char *prefix, size_t size);
 #define ALL_EQ(v1, v2) ((v1) == (v2))
 #endif

+#define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
+static inline bool
+can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
+{
+    if (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
+               * sizeof(VECTYPE)) == 0
+        && ((uintptr_t) buf) % sizeof(VECTYPE) == 0) {
+        return true;
+    }
+    return false;
+}
+size_t buffer_find_nonzero_offset(const void *buf, size_t len);
+
 #endif
diff --git a/util/cutils.c b/util/cutils.c
index 1439da4..0314a18 100644
--- a/util/cutils.c
+++ b/util/cutils.c
@@ -143,6 +143,61 @@ int qemu_fdatasync(int fd)
 }

 /*
+ * Searches for an area with non-zero content in a buffer
+ *
+ * Attention! The len must be a multiple of
+ * BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * sizeof(VECTYPE)
+ * and addr must be a multiple of sizeof(VECTYPE) due to
+ * restriction of optimizations in this function.
+ *
+ * can_use_buffer_find_nonzero_offset() can be used to check
+ * these requirements.
+ *
+ * The return value is the offset of the non-zero area rounded
+ * down to a multiple of sizeof(VECTYPE) for the first
+ * BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR chunks and down to
+ * BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * sizeof(VECTYPE)
+ * afterwards.
+ *
+ * If the buffer is all zero the return value is equal to len.
+ */
+
+size_t buffer_find_nonzero_offset(const void *buf, size_t len)
+{
+    VECTYPE *p = (VECTYPE *)buf;
+    VECTYPE zero = ZERO_SPLAT;
+    size_t i;
+
+    assert(can_use_buffer_find_nonzero_offset(buf, len));
+
+    if (!len) {
+        return 0;
+    }
+
+    for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
+        if (!ALL_EQ(p[i], zero)) {
+            return i * sizeof(VECTYPE);
+        }
+    }
+
+    for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
+         i < len / sizeof(VECTYPE);
+         i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
+        VECTYPE tmp0 = p[i + 0] | p[i + 1];
+        VECTYPE tmp1 = p[i + 2] | p[i + 3];
+        VECTYPE tmp2 = p[i + 4] | p[i + 5];
+        VECTYPE tmp3 = p[i + 6] | p[i + 7];
+        VECTYPE tmp01 = tmp0 | tmp1;
+        VECTYPE tmp23 = tmp2 | tmp3;
+        if (!ALL_EQ(tmp01 | tmp23, zero)) {
+            break;
+        }
+    }
+
+    return i * sizeof(VECTYPE);
+}
+
+/*
  * Checks if a buffer is all zeroes
  *
  * Attention! The len must be a multiple of 4 * sizeof(long) due to
--
1.7.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCHv5 04/10] buffer_is_zero: use vector optimizations if possible
  2013-03-26  9:58 [Qemu-devel] [PATCHv5 00/10] buffer_is_zero / migration optimizations Peter Lieven
                   ` (2 preceding siblings ...)
  2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 03/10] cutils: add a function to find non-zero content in a buffer Peter Lieven
@ 2013-03-26  9:58 ` Peter Lieven
  2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 05/10] bitops: unroll while loop in find_next_bit() Peter Lieven
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 31+ messages in thread
From: Peter Lieven @ 2013-03-26  9:58 UTC (permalink / raw)
  To: qemu-devel
  Cc: quintela, Stefan Hajnoczi, Peter Lieven, Orit Wasserman,
	Paolo Bonzini

performance gain on SSE2 is approx. 20-25%. altivec
is not tested. performance for unsigned long arithmetic
is unchanged.

Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
---
 util/cutils.c |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/util/cutils.c b/util/cutils.c
index 0314a18..daf032a 100644
--- a/util/cutils.c
+++ b/util/cutils.c
@@ -215,6 +215,11 @@ bool buffer_is_zero(const void *buf, size_t len)
     long d0, d1, d2, d3;
     const long * const data = buf;
 
+    /* use vector optimized zero check if possible */
+    if (can_use_buffer_find_nonzero_offset(buf, len)) {
+        return buffer_find_nonzero_offset(buf, len) == len;
+    }
+
     assert(len % (4 * sizeof(long)) == 0);
     len /= sizeof(long);
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCHv5 05/10] bitops: unroll while loop in find_next_bit()
  2013-03-26  9:58 [Qemu-devel] [PATCHv5 00/10] buffer_is_zero / migration optimizations Peter Lieven
                   ` (3 preceding siblings ...)
  2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 04/10] buffer_is_zero: use vector optimizations if possible Peter Lieven
@ 2013-03-26  9:58 ` Peter Lieven
  2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 06/10] migration: search for zero instead of dup pages Peter Lieven
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 31+ messages in thread
From: Peter Lieven @ 2013-03-26  9:58 UTC (permalink / raw)
  To: qemu-devel
  Cc: quintela, Stefan Hajnoczi, Peter Lieven, Orit Wasserman,
	Paolo Bonzini

this patch adopts the loop unrolling idea of bitmap_is_zero() to
speed up the skipping of large areas with zeros in find_next_bit().

this routine is extensively used to find dirty pages in
live migration.

testing only the find_next_bit performance on a zeroed bitfield
the loop onrolling decreased executing time by approx. 50% on x86_64.

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 util/bitops.c |   18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/util/bitops.c b/util/bitops.c
index e72237a..227c38b 100644
--- a/util/bitops.c
+++ b/util/bitops.c
@@ -42,7 +42,23 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
         size -= BITS_PER_LONG;
         result += BITS_PER_LONG;
     }
-    while (size & ~(BITS_PER_LONG-1)) {
+    while (size >= 4*BITS_PER_LONG) {
+        unsigned long d1, d2, d3;
+        tmp = *p;
+        d1 = *(p+1);
+        d2 = *(p+2);
+        d3 = *(p+3);
+        if (tmp) {
+            goto found_middle;
+        }
+        if (d1 | d2 | d3) {
+            break;
+        }
+        p += 4;
+        result += 4*BITS_PER_LONG;
+        size -= 4*BITS_PER_LONG;
+    }
+    while (size >= BITS_PER_LONG) {
         if ((tmp = *(p++))) {
             goto found_middle;
         }
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCHv5 06/10] migration: search for zero instead of dup pages
  2013-03-26  9:58 [Qemu-devel] [PATCHv5 00/10] buffer_is_zero / migration optimizations Peter Lieven
                   ` (4 preceding siblings ...)
  2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 05/10] bitops: unroll while loop in find_next_bit() Peter Lieven
@ 2013-03-26  9:58 ` Peter Lieven
  2013-04-05 19:23   ` Kevin Wolf
  2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 07/10] migration: add an indicator for bulk state of ram migration Peter Lieven
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Peter Lieven @ 2013-03-26  9:58 UTC (permalink / raw)
  To: qemu-devel
  Cc: quintela, Stefan Hajnoczi, Peter Lieven, Orit Wasserman,
	Paolo Bonzini

virtually all dup pages are zero pages. remove
the special is_dup_page() function and use the
optimized buffer_find_nonzero_offset() function
instead.

here buffer_find_nonzero_offset() is used directly
to avoid the unnecssary additional checks in
buffer_is_zero().

raw performace gain checking 1 GByte zeroed memory
over is_dup_page() is approx. 10-12% with SSE2
and 8-10% with unsigned long arithmedtic.

Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 arch_init.c |   21 ++++++---------------
 1 file changed, 6 insertions(+), 15 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 35974c2..dd5deff 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -146,19 +146,10 @@ int qemu_read_default_config_files(bool userconfig)
     return 0;
 }
 
-static int is_dup_page(uint8_t *page)
+static inline bool is_zero_page(uint8_t *p)
 {
-    VECTYPE *p = (VECTYPE *)page;
-    VECTYPE val = SPLAT(page);
-    int i;
-
-    for (i = 0; i < TARGET_PAGE_SIZE / sizeof(VECTYPE); i++) {
-        if (!ALL_EQ(val, p[i])) {
-            return 0;
-        }
-    }
-
-    return 1;
+    return buffer_find_nonzero_offset(p, TARGET_PAGE_SIZE) ==
+        TARGET_PAGE_SIZE;
 }
 
 /* struct contains XBZRLE cache and a static page
@@ -445,12 +436,12 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
 
             /* In doubt sent page as normal */
             bytes_sent = -1;
-            if (is_dup_page(p)) {
+            if (is_zero_page(p)) {
                 acct_info.dup_pages++;
                 bytes_sent = save_block_hdr(f, block, offset, cont,
                                             RAM_SAVE_FLAG_COMPRESS);
-                qemu_put_byte(f, *p);
-                bytes_sent += 1;
+                qemu_put_byte(f, 0);
+                bytes_sent++;
             } else if (migrate_use_xbzrle()) {
                 current_addr = block->offset + offset;
                 bytes_sent = save_xbzrle_page(f, p, current_addr, block,
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCHv5 07/10] migration: add an indicator for bulk state of ram migration
  2013-03-26  9:58 [Qemu-devel] [PATCHv5 00/10] buffer_is_zero / migration optimizations Peter Lieven
                   ` (5 preceding siblings ...)
  2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 06/10] migration: search for zero instead of dup pages Peter Lieven
@ 2013-03-26  9:58 ` Peter Lieven
  2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 08/10] migration: do not sent zero pages in bulk stage Peter Lieven
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 31+ messages in thread
From: Peter Lieven @ 2013-03-26  9:58 UTC (permalink / raw)
  To: qemu-devel
  Cc: quintela, Stefan Hajnoczi, Peter Lieven, Orit Wasserman,
	Paolo Bonzini

the first round of ram transfer is special since all pages
are dirty and thus all memory pages are transferred to
the target. this patch adds a boolean variable to track
this stage.

Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
---
 arch_init.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch_init.c b/arch_init.c
index dd5deff..1291bd2 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -319,6 +319,7 @@ static ram_addr_t last_offset;
 static unsigned long *migration_bitmap;
 static uint64_t migration_dirty_pages;
 static uint32_t last_version;
+static bool ram_bulk_stage;
 
 static inline
 ram_addr_t migration_bitmap_find_and_reset_dirty(MemoryRegion *mr,
@@ -426,6 +427,7 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
             if (!block) {
                 block = QTAILQ_FIRST(&ram_list.blocks);
                 complete_round = true;
+                ram_bulk_stage = false;
             }
         } else {
             uint8_t *p;
@@ -529,6 +531,7 @@ static void reset_ram_globals(void)
     last_sent_block = NULL;
     last_offset = 0;
     last_version = ram_list.version;
+    ram_bulk_stage = true;
 }
 
 #define MAX_WAIT 50 /* ms, half buffered_file limit */
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCHv5 08/10] migration: do not sent zero pages in bulk stage
  2013-03-26  9:58 [Qemu-devel] [PATCHv5 00/10] buffer_is_zero / migration optimizations Peter Lieven
                   ` (6 preceding siblings ...)
  2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 07/10] migration: add an indicator for bulk state of ram migration Peter Lieven
@ 2013-03-26  9:58 ` Peter Lieven
  2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 09/10] migration: do not search dirty " Peter Lieven
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 31+ messages in thread
From: Peter Lieven @ 2013-03-26  9:58 UTC (permalink / raw)
  To: qemu-devel
  Cc: quintela, Stefan Hajnoczi, Peter Lieven, Orit Wasserman,
	Paolo Bonzini

during bulk stage of ram migration if a page is a
zero page do not send it at all.
the memory at the destination reads as zero anyway.

even if there is an madvise with QEMU_MADV_DONTNEED
at the target upon receipt of a zero page I have observed
that the target starts swapping if the memory is overcommitted.
it seems that the pages are dropped asynchronously.

this patch also updates QMP to return the number of
skipped pages in MigrationStats.

Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 arch_init.c                   |   24 ++++++++++++++++++++----
 hmp.c                         |    2 ++
 include/migration/migration.h |    2 ++
 migration.c                   |    3 ++-
 qapi-schema.json              |    8 +++++---
 qmp-commands.hx               |    3 ++-
 6 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 1291bd2..3a0d02e 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -183,6 +183,7 @@ int64_t xbzrle_cache_resize(int64_t new_size)
 /* accounting for migration statistics */
 typedef struct AccountingInfo {
     uint64_t dup_pages;
+    uint64_t skipped_pages;
     uint64_t norm_pages;
     uint64_t iterations;
     uint64_t xbzrle_bytes;
@@ -208,6 +209,16 @@ uint64_t dup_mig_pages_transferred(void)
     return acct_info.dup_pages;
 }
 
+uint64_t skipped_mig_bytes_transferred(void)
+{
+    return acct_info.skipped_pages * TARGET_PAGE_SIZE;
+}
+
+uint64_t skipped_mig_pages_transferred(void)
+{
+    return acct_info.skipped_pages;
+}
+
 uint64_t norm_mig_bytes_transferred(void)
 {
     return acct_info.norm_pages * TARGET_PAGE_SIZE;
@@ -440,10 +451,15 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
             bytes_sent = -1;
             if (is_zero_page(p)) {
                 acct_info.dup_pages++;
-                bytes_sent = save_block_hdr(f, block, offset, cont,
-                                            RAM_SAVE_FLAG_COMPRESS);
-                qemu_put_byte(f, 0);
-                bytes_sent++;
+                if (!ram_bulk_stage) {
+                    bytes_sent = save_block_hdr(f, block, offset, cont,
+                                                RAM_SAVE_FLAG_COMPRESS);
+                    qemu_put_byte(f, 0);
+                    bytes_sent++;
+                } else {
+                    acct_info.skipped_pages++;
+                    bytes_sent = 0;
+                }
             } else if (migrate_use_xbzrle()) {
                 current_addr = block->offset + offset;
                 bytes_sent = save_xbzrle_page(f, p, current_addr, block,
diff --git a/hmp.c b/hmp.c
index b0a861c..e3e833e 100644
--- a/hmp.c
+++ b/hmp.c
@@ -173,6 +173,8 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
                        info->ram->total >> 10);
         monitor_printf(mon, "duplicate: %" PRIu64 " pages\n",
                        info->ram->duplicate);
+        monitor_printf(mon, "skipped: %" PRIu64 " pages\n",
+                       info->ram->skipped);
         monitor_printf(mon, "normal: %" PRIu64 " pages\n",
                        info->ram->normal);
         monitor_printf(mon, "normal bytes: %" PRIu64 " kbytes\n",
diff --git a/include/migration/migration.h b/include/migration/migration.h
index bb617fd..e2acec6 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -96,6 +96,8 @@ extern SaveVMHandlers savevm_ram_handlers;
 
 uint64_t dup_mig_bytes_transferred(void);
 uint64_t dup_mig_pages_transferred(void);
+uint64_t skipped_mig_bytes_transferred(void);
+uint64_t skipped_mig_pages_transferred(void);
 uint64_t norm_mig_bytes_transferred(void);
 uint64_t norm_mig_pages_transferred(void);
 uint64_t xbzrle_mig_bytes_transferred(void);
diff --git a/migration.c b/migration.c
index 185d112..7fb2147 100644
--- a/migration.c
+++ b/migration.c
@@ -197,11 +197,11 @@ MigrationInfo *qmp_query_migrate(Error **errp)
         info->ram->remaining = ram_bytes_remaining();
         info->ram->total = ram_bytes_total();
         info->ram->duplicate = dup_mig_pages_transferred();
+        info->ram->skipped = skipped_mig_pages_transferred();
         info->ram->normal = norm_mig_pages_transferred();
         info->ram->normal_bytes = norm_mig_bytes_transferred();
         info->ram->dirty_pages_rate = s->dirty_pages_rate;
 
-
         if (blk_mig_active()) {
             info->has_disk = true;
             info->disk = g_malloc0(sizeof(*info->disk));
@@ -227,6 +227,7 @@ MigrationInfo *qmp_query_migrate(Error **errp)
         info->ram->remaining = 0;
         info->ram->total = ram_bytes_total();
         info->ram->duplicate = dup_mig_pages_transferred();
+        info->ram->skipped = skipped_mig_pages_transferred();
         info->ram->normal = norm_mig_pages_transferred();
         info->ram->normal_bytes = norm_mig_bytes_transferred();
         break;
diff --git a/qapi-schema.json b/qapi-schema.json
index 088f4e1..d6a8812 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -496,7 +496,9 @@
 #
 # @total: total amount of bytes involved in the migration process
 #
-# @duplicate: number of duplicate pages (since 1.2)
+# @duplicate: number of duplicate (zero) pages (since 1.2)
+#
+# @skipped: number of skipped zero pages (since 1.5)
 #
 # @normal : number of normal pages (since 1.2)
 #
@@ -509,8 +511,8 @@
 ##
 { 'type': 'MigrationStats',
   'data': {'transferred': 'int', 'remaining': 'int', 'total': 'int' ,
-           'duplicate': 'int', 'normal': 'int', 'normal-bytes': 'int',
-           'dirty-pages-rate' : 'int' } }
+           'duplicate': 'int', 'skipped': 'int', 'normal': 'int',
+           'normal-bytes': 'int', 'dirty-pages-rate' : 'int' } }
 
 ##
 # @XBZRLECacheStats
diff --git a/qmp-commands.hx b/qmp-commands.hx
index b370060..fed74c6 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -2442,7 +2442,8 @@ The main json-object contains the following:
          - "transferred": amount transferred (json-int)
          - "remaining": amount remaining (json-int)
          - "total": total (json-int)
-         - "duplicate": number of duplicated pages (json-int)
+         - "duplicate": number of duplicated (zero) pages (json-int)
+         - "skipped": number of skipped zero pages (json-int)
          - "normal" : number of normal pages transferred (json-int)
          - "normal-bytes" : number of normal bytes transferred (json-int)
 - "disk": only present if "status" is "active" and it is a block migration,
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCHv5 09/10] migration: do not search dirty pages in bulk stage
  2013-03-26  9:58 [Qemu-devel] [PATCHv5 00/10] buffer_is_zero / migration optimizations Peter Lieven
                   ` (7 preceding siblings ...)
  2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 08/10] migration: do not sent zero pages in bulk stage Peter Lieven
@ 2013-03-26  9:58 ` Peter Lieven
  2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 10/10] migration: use XBZRLE only after " Peter Lieven
  2013-03-26 10:46 ` [Qemu-devel] [PATCHv5 00/10] buffer_is_zero / migration optimizations Juan Quintela
  10 siblings, 0 replies; 31+ messages in thread
From: Peter Lieven @ 2013-03-26  9:58 UTC (permalink / raw)
  To: qemu-devel
  Cc: quintela, Stefan Hajnoczi, Peter Lieven, Orit Wasserman,
	Paolo Bonzini

avoid searching for dirty pages just increment the
page offset. all pages are dirty anyway.

Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
---
 arch_init.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch_init.c b/arch_init.c
index 3a0d02e..a522735 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -340,7 +340,13 @@ ram_addr_t migration_bitmap_find_and_reset_dirty(MemoryRegion *mr,
     unsigned long nr = base + (start >> TARGET_PAGE_BITS);
     unsigned long size = base + (int128_get64(mr->size) >> TARGET_PAGE_BITS);
 
-    unsigned long next = find_next_bit(migration_bitmap, size, nr);
+    unsigned long next;
+
+    if (ram_bulk_stage && nr > base) {
+        next = nr + 1;
+    } else {
+        next = find_next_bit(migration_bitmap, size, nr);
+    }
 
     if (next < size) {
         clear_bit(next, migration_bitmap);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCHv5 10/10] migration: use XBZRLE only after bulk stage
  2013-03-26  9:58 [Qemu-devel] [PATCHv5 00/10] buffer_is_zero / migration optimizations Peter Lieven
                   ` (8 preceding siblings ...)
  2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 09/10] migration: do not search dirty " Peter Lieven
@ 2013-03-26  9:58 ` Peter Lieven
  2013-03-26 10:46 ` [Qemu-devel] [PATCHv5 00/10] buffer_is_zero / migration optimizations Juan Quintela
  10 siblings, 0 replies; 31+ messages in thread
From: Peter Lieven @ 2013-03-26  9:58 UTC (permalink / raw)
  To: qemu-devel
  Cc: quintela, Stefan Hajnoczi, Peter Lieven, Orit Wasserman,
	Paolo Bonzini

at the beginning of migration all pages are marked dirty and
in the first round a bulk migration of all pages is performed.

currently all these pages are copied to the page cache regardless
of whether they are frequently updated or not. this doesn't make sense
since most of these pages are never transferred again.

this patch changes the XBZRLE transfer to only be used after
the bulk stage has been completed. that means a page is added
to the page cache the second time it is transferred and XBZRLE
can benefit from the third time of transfer.

since the page cache is likely smaller than the number of pages
it's also likely that in the second round the page is missing in the
cache due to collisions in the bulk phase.

on the other hand a lot of unnecessary mallocs, memdups and frees
are saved.

the following results have been taken earlier while executing
the test program from docs/xbzrle.txt. (+) with the patch and (-)
without. (thanks to Eric Blake for reformatting and comments)

+ total time: 22185 milliseconds
- total time: 22410 milliseconds

Shaved 0.3 seconds, better than 1%!

+ downtime: 29 milliseconds
- downtime: 21 milliseconds

Not sure why downtime seemed worse, but probably not the end of the world.

+ transferred ram: 706034 kbytes
- transferred ram: 721318 kbytes

Fewer bytes sent - good.

+ remaining ram: 0 kbytes
- remaining ram: 0 kbytes
+ total ram: 1057216 kbytes
- total ram: 1057216 kbytes
+ duplicate: 108556 pages
- duplicate: 105553 pages
+ normal: 175146 pages
- normal: 179589 pages
+ normal bytes: 700584 kbytes
- normal bytes: 718356 kbytes

Fewer normal bytes...

+ cache size: 67108864 bytes
- cache size: 67108864 bytes
+ xbzrle transferred: 3127 kbytes
- xbzrle transferred: 630 kbytes

...and more compressed pages sent - good.

+ xbzrle pages: 117811 pages
- xbzrle pages: 21527 pages
+ xbzrle cache miss: 18750
- xbzrle cache miss: 179589

And very good improvement on the cache miss rate.

+ xbzrle overflow : 0
- xbzrle overflow : 0

Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
---
 arch_init.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch_init.c b/arch_init.c
index a522735..e1af898 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -466,7 +466,7 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
                     acct_info.skipped_pages++;
                     bytes_sent = 0;
                 }
-            } else if (migrate_use_xbzrle()) {
+            } else if (!ram_bulk_stage && migrate_use_xbzrle()) {
                 current_addr = block->offset + offset;
                 bytes_sent = save_xbzrle_page(f, p, current_addr, block,
                                               offset, cont, last_stage);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCHv5 02/10] add a zero splat vector to qemu-common.h
  2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 02/10] add a zero splat vector " Peter Lieven
@ 2013-03-26 10:14   ` Paolo Bonzini
  2013-03-26 10:17     ` Peter Lieven
  0 siblings, 1 reply; 31+ messages in thread
From: Paolo Bonzini @ 2013-03-26 10:14 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Orit Wasserman, Stefan Hajnoczi, qemu-devel, quintela

Il 26/03/2013 10:58, Peter Lieven ha scritto:
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  include/qemu-common.h |    3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/include/qemu-common.h b/include/qemu-common.h
> index d7ad3a7..9022646 100644
> --- a/include/qemu-common.h
> +++ b/include/qemu-common.h
> @@ -453,6 +453,7 @@ void hexdump(const char *buf, FILE *fp, const char *prefix, size_t size);
>  #include <altivec.h>
>  #define VECTYPE        vector unsigned char
>  #define SPLAT(p)       vec_splat(vec_ld(0, p), 0)
> +#define ZERO_SPLAT     vec_splat(vec_ld(0, 0), 0)
>  #define ALL_EQ(v1, v2) vec_all_eq(v1, v2)
>  /* altivec.h may redefine the bool macro as vector type.
>   * Reset it to POSIX semantics. */
> @@ -462,10 +463,12 @@ void hexdump(const char *buf, FILE *fp, const char *prefix, size_t size);
>  #include <emmintrin.h>
>  #define VECTYPE        __m128i
>  #define SPLAT(p)       _mm_set1_epi8(*(p))
> +#define ZERO_SPLAT     _mm_setzero_si128()
>  #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF)
>  #else
>  #define VECTYPE        unsigned long
>  #define SPLAT(p)       (*(p) * (~0UL / 255))
> +#define ZERO_SPLAT     0x0UL
>  #define ALL_EQ(v1, v2) ((v1) == (v2))
>  #endif

C trivia of the day: this can be written simply as

   (VECTYPE) {0}

Yes, it works even with unsigned long. :)  It is a C99 compound literal.

Paolo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCHv5 02/10] add a zero splat vector to qemu-common.h
  2013-03-26 10:14   ` Paolo Bonzini
@ 2013-03-26 10:17     ` Peter Lieven
  2013-03-26 10:17       ` Paolo Bonzini
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Lieven @ 2013-03-26 10:17 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Orit Wasserman, Stefan Hajnoczi, qemu-devel, quintela


Am 26.03.2013 um 11:14 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 26/03/2013 10:58, Peter Lieven ha scritto:
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>> include/qemu-common.h |    3 +++
>> 1 file changed, 3 insertions(+)
>> 
>> diff --git a/include/qemu-common.h b/include/qemu-common.h
>> index d7ad3a7..9022646 100644
>> --- a/include/qemu-common.h
>> +++ b/include/qemu-common.h
>> @@ -453,6 +453,7 @@ void hexdump(const char *buf, FILE *fp, const char *prefix, size_t size);
>> #include <altivec.h>
>> #define VECTYPE        vector unsigned char
>> #define SPLAT(p)       vec_splat(vec_ld(0, p), 0)
>> +#define ZERO_SPLAT     vec_splat(vec_ld(0, 0), 0)
>> #define ALL_EQ(v1, v2) vec_all_eq(v1, v2)
>> /* altivec.h may redefine the bool macro as vector type.
>>  * Reset it to POSIX semantics. */
>> @@ -462,10 +463,12 @@ void hexdump(const char *buf, FILE *fp, const char *prefix, size_t size);
>> #include <emmintrin.h>
>> #define VECTYPE        __m128i
>> #define SPLAT(p)       _mm_set1_epi8(*(p))
>> +#define ZERO_SPLAT     _mm_setzero_si128()
>> #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF)
>> #else
>> #define VECTYPE        unsigned long
>> #define SPLAT(p)       (*(p) * (~0UL / 255))
>> +#define ZERO_SPLAT     0x0UL
>> #define ALL_EQ(v1, v2) ((v1) == (v2))
>> #endif
> 
> C trivia of the day: this can be written simply as
> 
>   (VECTYPE) {0}

you are the first who mentions this ;-)

respin?

Peter

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCHv5 02/10] add a zero splat vector to qemu-common.h
  2013-03-26 10:17     ` Peter Lieven
@ 2013-03-26 10:17       ` Paolo Bonzini
  2013-03-26 10:18         ` Peter Lieven
  0 siblings, 1 reply; 31+ messages in thread
From: Paolo Bonzini @ 2013-03-26 10:17 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Orit Wasserman, Stefan Hajnoczi, qemu-devel, quintela

Il 26/03/2013 11:17, Peter Lieven ha scritto:
>> > 
>> > C trivia of the day: this can be written simply as
>> > 
>> >   (VECTYPE) {0}
> you are the first who mentions this ;-)
> 
> respin?

Yes, please.

Paolo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCHv5 02/10] add a zero splat vector to qemu-common.h
  2013-03-26 10:17       ` Paolo Bonzini
@ 2013-03-26 10:18         ` Peter Lieven
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Lieven @ 2013-03-26 10:18 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Orit Wasserman, Stefan Hajnoczi, qemu-devel, quintela


Am 26.03.2013 um 11:17 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 26/03/2013 11:17, Peter Lieven ha scritto:
>>>> 
>>>> C trivia of the day: this can be written simply as
>>>> 
>>>>  (VECTYPE) {0}
>> you are the first who mentions this ;-)
>> 
>> respin?
> 
> Yes, please.

Ok, I wait if someone finds something else.

Peter

> 
> Paolo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCHv5 03/10] cutils: add a function to find non-zero content in a buffer
  2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 03/10] cutils: add a function to find non-zero content in a buffer Peter Lieven
@ 2013-03-26 10:38   ` Juan Quintela
  2013-03-26 10:42     ` Peter Lieven
  2013-03-26 10:41   ` Juan Quintela
  1 sibling, 1 reply; 31+ messages in thread
From: Juan Quintela @ 2013-03-26 10:38 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Orit Wasserman, Paolo Bonzini, qemu-devel, Stefan Hajnoczi

Peter Lieven <pl@kamp.de> wrote:
> this adds buffer_find_nonzero_offset() which is a SSE2/Altivec
> optimized function that searches for non-zero content in a
> buffer.
>
> the function starts full unrolling only after the first few chunks have
> been checked one by one. analyzing real memory page data has revealed
> that non-zero pages are non-zero within the first 256-512 bits in
> most cases. as this function is also heavily used to check for zero memory
> pages this tweak has been made to avoid the high setup costs of the fully
> unrolled check for non-zero pages.
>
> due to the optimizations used in the function there are restrictions
> on buffer address and search length. the function
> can_use_buffer_find_nonzero_content() can be used to check if
> the function can be used safely.
>
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  include/qemu-common.h |   13 ++++++++++++
>  util/cutils.c         |   55 +++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 68 insertions(+)
>
> diff --git a/include/qemu-common.h b/include/qemu-common.h
> index 9022646..7c7c244 100644
> --- a/include/qemu-common.h
> +++ b/include/qemu-common.h
> @@ -472,4 +472,17 @@ void hexdump(const char *buf, FILE *fp, const char *prefix, size_t size);
>  #define ALL_EQ(v1, v2) ((v1) == (v2))
>  #endif
>
> +#define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
> +static inline bool
> +can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
> +{
> +    if (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
> +               * sizeof(VECTYPE)) == 0
> +        && ((uintptr_t) buf) % sizeof(VECTYPE) == 0) {
> +        return true;
> +    }
> +    return false;
> +}
> +size_t buffer_find_nonzero_offset(const void *buf, size_t len);
> +
>  #endif
> diff --git a/util/cutils.c b/util/cutils.c
> index 1439da4..0314a18 100644
> --- a/util/cutils.c
> +++ b/util/cutils.c
> @@ -143,6 +143,61 @@ int qemu_fdatasync(int fd)
>  }
>
>  /*
> + * Searches for an area with non-zero content in a buffer
> + *
> + * Attention! The len must be a multiple of
> + * BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * sizeof(VECTYPE)
> + * and addr must be a multiple of sizeof(VECTYPE) due to
> + * restriction of optimizations in this function.
> + *
> + * can_use_buffer_find_nonzero_offset() can be used to check
> + * these requirements.
> + *
> + * The return value is the offset of the non-zero area rounded
> + * down to a multiple of sizeof(VECTYPE) for the first
> + * BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR chunks and down to
> + * BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * sizeof(VECTYPE)
> + * afterwards.
> + *
> + * If the buffer is all zero the return value is equal to len.
> + */
> +
> +size_t buffer_find_nonzero_offset(const void *buf, size_t len)
> +{
> +    VECTYPE *p = (VECTYPE *)buf;
> +    VECTYPE zero = ZERO_SPLAT;

If you have to resplit it anyways, what about changing this to:

-    VECTYPE *p = (VECTYPE *)buf;
-    VECTYPE zero = ZERO_SPLAT;
+    const VECTYPE *p = buf;
+    const VECTYPE zero = ZERO_SPLAT;
     size_t i;

>From the "I hate casts" film?

Thanks, Juan.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCHv5 03/10] cutils: add a function to find non-zero content in a buffer
  2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 03/10] cutils: add a function to find non-zero content in a buffer Peter Lieven
  2013-03-26 10:38   ` Juan Quintela
@ 2013-03-26 10:41   ` Juan Quintela
  1 sibling, 0 replies; 31+ messages in thread
From: Juan Quintela @ 2013-03-26 10:41 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Orit Wasserman, Paolo Bonzini, qemu-devel, Stefan Hajnoczi

Peter Lieven <pl@kamp.de> wrote:
> this adds buffer_find_nonzero_offset() which is a SSE2/Altivec
> optimized function that searches for non-zero content in a
> buffer.
>
> the function starts full unrolling only after the first few chunks have
> been checked one by one. analyzing real memory page data has revealed
> that non-zero pages are non-zero within the first 256-512 bits in
> most cases. as this function is also heavily used to check for zero memory
> pages this tweak has been made to avoid the high setup costs of the fully
> unrolled check for non-zero pages.
>
> due to the optimizations used in the function there are restrictions
> on buffer address and search length. the function
> can_use_buffer_find_nonzero_content() can be used to check if
> the function can be used safely.
>
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  include/qemu-common.h |   13 ++++++++++++
>  util/cutils.c         |   55 +++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 68 insertions(+)
>
> diff --git a/include/qemu-common.h b/include/qemu-common.h
> index 9022646..7c7c244 100644
> --- a/include/qemu-common.h
> +++ b/include/qemu-common.h
> @@ -472,4 +472,17 @@ void hexdump(const char *buf, FILE *fp, const char *prefix, size_t size);
>  #define ALL_EQ(v1, v2) ((v1) == (v2))
>  #endif
>
> +#define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
> +static inline bool
> +can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
> +{
> +    if (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
> +               * sizeof(VECTYPE)) == 0
> +        && ((uintptr_t) buf) % sizeof(VECTYPE) == 0) {
> +        return true;
> +    }
> +    return false;
> +}

This can be spelled as:

return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
               * sizeof(VECTYPE)) == 0
       && ((uintptr_t) buf) % sizeof(VECTYPE) == 0);;

But I don't care too much.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCHv5 03/10] cutils: add a function to find non-zero content in a buffer
  2013-03-26 10:38   ` Juan Quintela
@ 2013-03-26 10:42     ` Peter Lieven
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Lieven @ 2013-03-26 10:42 UTC (permalink / raw)
  To: quintela; +Cc: Orit Wasserman, Paolo Bonzini, qemu-devel, Stefan Hajnoczi


Am 26.03.2013 um 11:38 schrieb Juan Quintela <quintela@redhat.com>:

> Peter Lieven <pl@kamp.de> wrote:
>> this adds buffer_find_nonzero_offset() which is a SSE2/Altivec
>> optimized function that searches for non-zero content in a
>> buffer.
>> 
>> the function starts full unrolling only after the first few chunks have
>> been checked one by one. analyzing real memory page data has revealed
>> that non-zero pages are non-zero within the first 256-512 bits in
>> most cases. as this function is also heavily used to check for zero memory
>> pages this tweak has been made to avoid the high setup costs of the fully
>> unrolled check for non-zero pages.
>> 
>> due to the optimizations used in the function there are restrictions
>> on buffer address and search length. the function
>> can_use_buffer_find_nonzero_content() can be used to check if
>> the function can be used safely.
>> 
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>> include/qemu-common.h |   13 ++++++++++++
>> util/cutils.c         |   55 +++++++++++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 68 insertions(+)
>> 
>> diff --git a/include/qemu-common.h b/include/qemu-common.h
>> index 9022646..7c7c244 100644
>> --- a/include/qemu-common.h
>> +++ b/include/qemu-common.h
>> @@ -472,4 +472,17 @@ void hexdump(const char *buf, FILE *fp, const char *prefix, size_t size);
>> #define ALL_EQ(v1, v2) ((v1) == (v2))
>> #endif
>> 
>> +#define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
>> +static inline bool
>> +can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
>> +{
>> +    if (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
>> +               * sizeof(VECTYPE)) == 0
>> +        && ((uintptr_t) buf) % sizeof(VECTYPE) == 0) {
>> +        return true;
>> +    }
>> +    return false;
>> +}
>> +size_t buffer_find_nonzero_offset(const void *buf, size_t len);
>> +
>> #endif
>> diff --git a/util/cutils.c b/util/cutils.c
>> index 1439da4..0314a18 100644
>> --- a/util/cutils.c
>> +++ b/util/cutils.c
>> @@ -143,6 +143,61 @@ int qemu_fdatasync(int fd)
>> }
>> 
>> /*
>> + * Searches for an area with non-zero content in a buffer
>> + *
>> + * Attention! The len must be a multiple of
>> + * BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * sizeof(VECTYPE)
>> + * and addr must be a multiple of sizeof(VECTYPE) due to
>> + * restriction of optimizations in this function.
>> + *
>> + * can_use_buffer_find_nonzero_offset() can be used to check
>> + * these requirements.
>> + *
>> + * The return value is the offset of the non-zero area rounded
>> + * down to a multiple of sizeof(VECTYPE) for the first
>> + * BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR chunks and down to
>> + * BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * sizeof(VECTYPE)
>> + * afterwards.
>> + *
>> + * If the buffer is all zero the return value is equal to len.
>> + */
>> +
>> +size_t buffer_find_nonzero_offset(const void *buf, size_t len)
>> +{
>> +    VECTYPE *p = (VECTYPE *)buf;
>> +    VECTYPE zero = ZERO_SPLAT;
> 
> If you have to resplit it anyways, what about changing this to:
> 
> -    VECTYPE *p = (VECTYPE *)buf;
> -    VECTYPE zero = ZERO_SPLAT;
> +    const VECTYPE *p = buf;
> +    const VECTYPE zero = ZERO_SPLAT;

i would change it to 

const VECTYPE zero = (VECTYPE) {0};

and drop patch 2 again.

Peter


>     size_t i;
> 
> From the "I hate casts" film?
> 
> Thanks, Juan.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCHv5 00/10] buffer_is_zero / migration optimizations
  2013-03-26  9:58 [Qemu-devel] [PATCHv5 00/10] buffer_is_zero / migration optimizations Peter Lieven
                   ` (9 preceding siblings ...)
  2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 10/10] migration: use XBZRLE only after " Peter Lieven
@ 2013-03-26 10:46 ` Juan Quintela
  2013-03-26 11:02   ` Peter Lieven
  10 siblings, 1 reply; 31+ messages in thread
From: Juan Quintela @ 2013-03-26 10:46 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Orit Wasserman, Paolo Bonzini, qemu-devel, Stefan Hajnoczi

Peter Lieven <pl@kamp.de> wrote:
> this is v5 of my patch series with various optimizations in
> zero buffer checking and migration tweaks.
>
> thanks especially to Eric Blake, Orit Wassermann and Paolo Bonzini
> for reviewing.
>
> v5:
> - move zero splat vector to a different patch
> - fix indentation of can_user_buffer_find_nonzero_offset()
> - do not unroll the first loop in buffer_find_nonzero_offset()
>   to optimize it for zero page checking
> - use an older unrolled version of find_next_bit() without
>   SIMD instruction as there is no evidence that the vectorized
>   version is better if not even worse and the code is easier
>   to understand.
> - added a word in the commit message of patch 8 
>   about the skipped pages field in QMP MigrationStats.
> - fixed the order of key-value pairs of MigrationStats in
>   qapi-schema.json
> - updated info about the performance benefit of is_zero_page()
>   to the latest benchmark results in the commit message.
>
> v4:
> - do not inline buffer_find_nonzero_offset()
> - inline can_usebuffer_find_nonzero_offset() correctly
> - readd asserts in buffer_find_nonzero_offset() as profiling
>   shows they do not hurt.
> - change last occurences of scalar 8 by 
>   BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
> - avoid deferencing p already in patch 5 where we
>   know that the page (p) is zero
> - explicitly set bytes_sent = 0 if we skip a zero page.
>   bytes_sent was 0 before, but it was not obvious.
> - add accounting information for skipped zero pages
> - fix errors reported by checkpatch.pl
>
> v3:
> - remove asserts, inline functions and add a check
>   function if buffer_find_nonzero_offset() can be used.
> - use above check function in buffer_is_zero() and
>   find_next_bit().
> - use buffer_is_nonzero_offset() directly to find
>   zero pages. we know that all requirements are met
>   for memory pages.
> - fix C89 violation in buffer_is_zero().
> - avoid derefencing p in ram_save_block() if we already
>   know the page is zero.
> - fix initialization of last_offset in reset_ram_globals().
> - avoid skipping pages with offset == 0 in bulk stage in
>   migration_bitmap_find_and_reset_dirty().
> - compared to v1 check for zero pages also after bulk
>   ram migration as there are guests (e.g. Windows) which
>   zero out large amount of memory while running.
>
> v2:
> - fix description, add trivial zero check and add asserts 
>   to buffer_find_nonzero_offset.
> - add a constant for the unroll factor of buffer_find_nonzero_offset
> - replace is_dup_page() by buffer_is_zero()
> - added test results to xbzrle patch
> - optimize descriptions
>
> Peter Lieven (10):
>   move vector definitions to qemu-common.h
>   add a zero splat vector to qemu-common.h
>   cutils: add a function to find non-zero content in a buffer
>   buffer_is_zero: use vector optimizations if possible
>   bitops: unroll while loop in find_next_bit()
>   migration: search for zero instead of dup pages
>   migration: add an indicator for bulk state of ram migration
>   migration: do not sent zero pages in bulk stage
>   migration: do not search dirty pages in bulk stage
>   migration: use XBZRLE only after bulk stage

Really nice series.
I did some minor review changes, but they are more nits that anynthing
else.

Thanks, Juan.

Reviewed-by: Juan Quintela <quintela@redhat.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCHv5 00/10] buffer_is_zero / migration optimizations
  2013-03-26 10:46 ` [Qemu-devel] [PATCHv5 00/10] buffer_is_zero / migration optimizations Juan Quintela
@ 2013-03-26 11:02   ` Peter Lieven
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Lieven @ 2013-03-26 11:02 UTC (permalink / raw)
  To: quintela; +Cc: Orit Wasserman, Paolo Bonzini, qemu-devel, Stefan Hajnoczi


Am 26.03.2013 um 11:46 schrieb Juan Quintela <quintela@redhat.com>:

> Peter Lieven <pl@kamp.de> wrote:
>> this is v5 of my patch series with various optimizations in
>> zero buffer checking and migration tweaks.
>> 
>> thanks especially to Eric Blake, Orit Wassermann and Paolo Bonzini
>> for reviewing.
>> 
>> v5:
>> - move zero splat vector to a different patch
>> - fix indentation of can_user_buffer_find_nonzero_offset()
>> - do not unroll the first loop in buffer_find_nonzero_offset()
>>  to optimize it for zero page checking
>> - use an older unrolled version of find_next_bit() without
>>  SIMD instruction as there is no evidence that the vectorized
>>  version is better if not even worse and the code is easier
>>  to understand.
>> - added a word in the commit message of patch 8 
>>  about the skipped pages field in QMP MigrationStats.
>> - fixed the order of key-value pairs of MigrationStats in
>>  qapi-schema.json
>> - updated info about the performance benefit of is_zero_page()
>>  to the latest benchmark results in the commit message.
>> 
>> v4:
>> - do not inline buffer_find_nonzero_offset()
>> - inline can_usebuffer_find_nonzero_offset() correctly
>> - readd asserts in buffer_find_nonzero_offset() as profiling
>>  shows they do not hurt.
>> - change last occurences of scalar 8 by 
>>  BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
>> - avoid deferencing p already in patch 5 where we
>>  know that the page (p) is zero
>> - explicitly set bytes_sent = 0 if we skip a zero page.
>>  bytes_sent was 0 before, but it was not obvious.
>> - add accounting information for skipped zero pages
>> - fix errors reported by checkpatch.pl
>> 
>> v3:
>> - remove asserts, inline functions and add a check
>>  function if buffer_find_nonzero_offset() can be used.
>> - use above check function in buffer_is_zero() and
>>  find_next_bit().
>> - use buffer_is_nonzero_offset() directly to find
>>  zero pages. we know that all requirements are met
>>  for memory pages.
>> - fix C89 violation in buffer_is_zero().
>> - avoid derefencing p in ram_save_block() if we already
>>  know the page is zero.
>> - fix initialization of last_offset in reset_ram_globals().
>> - avoid skipping pages with offset == 0 in bulk stage in
>>  migration_bitmap_find_and_reset_dirty().
>> - compared to v1 check for zero pages also after bulk
>>  ram migration as there are guests (e.g. Windows) which
>>  zero out large amount of memory while running.
>> 
>> v2:
>> - fix description, add trivial zero check and add asserts 
>>  to buffer_find_nonzero_offset.
>> - add a constant for the unroll factor of buffer_find_nonzero_offset
>> - replace is_dup_page() by buffer_is_zero()
>> - added test results to xbzrle patch
>> - optimize descriptions
>> 
>> Peter Lieven (10):
>>  move vector definitions to qemu-common.h
>>  add a zero splat vector to qemu-common.h
>>  cutils: add a function to find non-zero content in a buffer
>>  buffer_is_zero: use vector optimizations if possible
>>  bitops: unroll while loop in find_next_bit()
>>  migration: search for zero instead of dup pages
>>  migration: add an indicator for bulk state of ram migration
>>  migration: do not sent zero pages in bulk stage
>>  migration: do not search dirty pages in bulk stage
>>  migration: use XBZRLE only after bulk stage
> 
> Really nice series.

Thank you.

> I did some minor review changes, but they are more nits that anynthing
> else.

Ok, where is your repository?

Peter


> 
> Thanks, Juan.
> 
> Reviewed-by: Juan Quintela <quintela@redhat.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCHv5 06/10] migration: search for zero instead of dup pages
  2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 06/10] migration: search for zero instead of dup pages Peter Lieven
@ 2013-04-05 19:23   ` Kevin Wolf
  2013-04-05 20:00     ` Paolo Bonzini
  2013-04-08  8:33     ` Peter Lieven
  0 siblings, 2 replies; 31+ messages in thread
From: Kevin Wolf @ 2013-04-05 19:23 UTC (permalink / raw)
  To: Peter Lieven
  Cc: Anthony Liguori, quintela, Stefan Hajnoczi, qemu-devel,
	Orit Wasserman, Paolo Bonzini

Am 26.03.2013 um 10:58 hat Peter Lieven geschrieben:
> virtually all dup pages are zero pages. remove
> the special is_dup_page() function and use the
> optimized buffer_find_nonzero_offset() function
> instead.
> 
> here buffer_find_nonzero_offset() is used directly
> to avoid the unnecssary additional checks in
> buffer_is_zero().
> 
> raw performace gain checking 1 GByte zeroed memory
> over is_dup_page() is approx. 10-12% with SSE2
> and 8-10% with unsigned long arithmedtic.
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>
> Reviewed-by: Orit Wasserman <owasserm@redhat.com>
> Reviewed-by: Eric Blake <eblake@redhat.com>

Okay, so I bisected again and this is the second patch that is involved
in the slowness of qemu-iotests case 007.

The problem seems to be that the RAM of a guest is in fact _not_ zeroed
during initialisation. It hits my test case reliably because I'm running
with MALLOC_PERTURB_. Now I'm wondering if in practice this happens only
under such test conditions, or if real guests could be affected as well
and we should make sure to get zeroed memory for RAM.

Kevin

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCHv5 06/10] migration: search for zero instead of dup pages
  2013-04-05 19:23   ` Kevin Wolf
@ 2013-04-05 20:00     ` Paolo Bonzini
  2013-04-05 21:44       ` Peter Lieven
  2013-04-05 22:06       ` Peter Lieven
  2013-04-08  8:33     ` Peter Lieven
  1 sibling, 2 replies; 31+ messages in thread
From: Paolo Bonzini @ 2013-04-05 20:00 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Anthony Liguori, quintela, Stefan Hajnoczi, Peter Lieven,
	qemu-devel, Orit Wasserman

Il 05/04/2013 21:23, Kevin Wolf ha scritto:
>> > virtually all dup pages are zero pages. remove
>> > the special is_dup_page() function and use the
>> > optimized buffer_find_nonzero_offset() function
>> > instead.
>> > 
>> > here buffer_find_nonzero_offset() is used directly
>> > to avoid the unnecssary additional checks in
>> > buffer_is_zero().
>> > 
>> > raw performace gain checking 1 GByte zeroed memory
>> > over is_dup_page() is approx. 10-12% with SSE2
>> > and 8-10% with unsigned long arithmedtic.
>> > 
>> > Signed-off-by: Peter Lieven <pl@kamp.de>
>> > Reviewed-by: Orit Wasserman <owasserm@redhat.com>
>> > Reviewed-by: Eric Blake <eblake@redhat.com>
> Okay, so I bisected again and this is the second patch that is involved
> in the slowness of qemu-iotests case 007.
> 
> The problem seems to be that the RAM of a guest is in fact _not_ zeroed
> during initialisation. It hits my test case reliably because I'm running
> with MALLOC_PERTURB_. Now I'm wondering if in practice this happens only
> under such test conditions, or if real guests could be affected as well
> and we should make sure to get zeroed memory for RAM.

I think we should MADV_DONTNEED it.

Paolo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCHv5 06/10] migration: search for zero instead of dup pages
  2013-04-05 20:00     ` Paolo Bonzini
@ 2013-04-05 21:44       ` Peter Lieven
  2013-04-05 22:06       ` Peter Lieven
  1 sibling, 0 replies; 31+ messages in thread
From: Peter Lieven @ 2013-04-05 21:44 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Anthony Liguori, quintela, Stefan Hajnoczi,
	qemu-devel, Orit Wasserman

Am 05.04.2013 um 22:00 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 05/04/2013 21:23, Kevin Wolf ha scritto:
>>>> virtually all dup pages are zero pages. remove
>>>> the special is_dup_page() function and use the
>>>> optimized buffer_find_nonzero_offset() function
>>>> instead.
>>>> 
>>>> here buffer_find_nonzero_offset() is used directly
>>>> to avoid the unnecssary additional checks in
>>>> buffer_is_zero().
>>>> 
>>>> raw performace gain checking 1 GByte zeroed memory
>>>> over is_dup_page() is approx. 10-12% with SSE2
>>>> and 8-10% with unsigned long arithmedtic.
>>>> 
>>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>>>> Reviewed-by: Orit Wasserman <owasserm@redhat.com>
>>>> Reviewed-by: Eric Blake <eblake@redhat.com>
>> Okay, so I bisected again and this is the second patch that is involved
>> in the slowness of qemu-iotests case 007.
>> 
>> The problem seems to be that the RAM of a guest is in fact _not_ zeroed
>> during initialisation. It hits my test case reliably because I'm running
>> with MALLOC_PERTURB_. Now I'm wondering if in practice this happens only
>> under such test conditions, or if real guests could be affected as well
>> and we should make sure to get zeroed memory for RAM.
> 
> I think we should MADV_DONTNEED it.

This does not guarantee that the memory is unmapped afaik. Sadly, I think we have to revert

migration: do not sent zero pages in bulk stage

The memory assigned by posix_memalign is most likely zero as all GFP_USER pages
are zeroed out by the kernel on alloc (at least under Linux), but if the page is reused in the same process
it is not necessarily zero anymore.

What I was trying to achieve with this patch is that the memset when receiving a zero_page
at the target was allocating memory and the MADV_DONTNEED was not immediately
dropping the page. This lead to memory pressure and swapping etc. on overcommitted systems.

What I would propose as a solution for this is after reverting the "do not sent zero pages"
patch is sth like this when receiving a compressed page:

if (ch != 0 || !is_zero_page(host)) {
    memset(host, ch, TARGET_PAGE_SIZE);
}

Regarding Kevins observation of the speed regression in iotest 007 this is simply if
MALLOC_PERTURB_ is used there are simply no zero pages, but only dup pages in memory. On
a real system the observation is that pages are either zero or not dup at all.

Peter

> 
> Paolo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCHv5 06/10] migration: search for zero instead of dup pages
  2013-04-05 20:00     ` Paolo Bonzini
  2013-04-05 21:44       ` Peter Lieven
@ 2013-04-05 22:06       ` Peter Lieven
  2013-04-08  8:38         ` Paolo Bonzini
  1 sibling, 1 reply; 31+ messages in thread
From: Peter Lieven @ 2013-04-05 22:06 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Anthony Liguori, quintela, Stefan Hajnoczi,
	qemu-devel, Orit Wasserman


Am 05.04.2013 um 22:00 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 05/04/2013 21:23, Kevin Wolf ha scritto:
>>>> virtually all dup pages are zero pages. remove
>>>> the special is_dup_page() function and use the
>>>> optimized buffer_find_nonzero_offset() function
>>>> instead.
>>>> 
>>>> here buffer_find_nonzero_offset() is used directly
>>>> to avoid the unnecssary additional checks in
>>>> buffer_is_zero().
>>>> 
>>>> raw performace gain checking 1 GByte zeroed memory
>>>> over is_dup_page() is approx. 10-12% with SSE2
>>>> and 8-10% with unsigned long arithmedtic.
>>>> 
>>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>>>> Reviewed-by: Orit Wasserman <owasserm@redhat.com>
>>>> Reviewed-by: Eric Blake <eblake@redhat.com>
>> Okay, so I bisected again and this is the second patch that is involved
>> in the slowness of qemu-iotests case 007.
>> 
>> The problem seems to be that the RAM of a guest is in fact _not_ zeroed
>> during initialisation. It hits my test case reliably because I'm running
>> with MALLOC_PERTURB_. Now I'm wondering if in practice this happens only
>> under such test conditions, or if real guests could be affected as well
>> and we should make sure to get zeroed memory for RAM.
> 
> I think we should MADV_DONTNEED it.

i have to correct myself, on Linux it seems that MADV_DONTNEED guarantees that
subsequent reads will return a zero filled page.

If I am right with that we could MADV_DONTNEED the memory on startup and keep
not sending zero pages in bulk stage if we are running under Linux. Am I right with that?

Peter

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCHv5 06/10] migration: search for zero instead of dup pages
  2013-04-05 19:23   ` Kevin Wolf
  2013-04-05 20:00     ` Paolo Bonzini
@ 2013-04-08  8:33     ` Peter Lieven
  2013-04-08  8:39       ` Peter Lieven
  2013-04-08  8:49       ` Kevin Wolf
  1 sibling, 2 replies; 31+ messages in thread
From: Peter Lieven @ 2013-04-08  8:33 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Anthony Liguori, quintela, Stefan Hajnoczi, qemu-devel,
	Orit Wasserman, Paolo Bonzini


Am 05.04.2013 um 21:23 schrieb Kevin Wolf <kwolf@redhat.com>:

> Am 26.03.2013 um 10:58 hat Peter Lieven geschrieben:
>> virtually all dup pages are zero pages. remove
>> the special is_dup_page() function and use the
>> optimized buffer_find_nonzero_offset() function
>> instead.
>> 
>> here buffer_find_nonzero_offset() is used directly
>> to avoid the unnecssary additional checks in
>> buffer_is_zero().
>> 
>> raw performace gain checking 1 GByte zeroed memory
>> over is_dup_page() is approx. 10-12% with SSE2
>> and 8-10% with unsigned long arithmedtic.
>> 
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> Reviewed-by: Orit Wasserman <owasserm@redhat.com>
>> Reviewed-by: Eric Blake <eblake@redhat.com>
> 
> Okay, so I bisected again and this is the second patch that is involved
> in the slowness of qemu-iotests case 007.
> 

Can you try if the following solves your issue:

diff --git a/exec.c b/exec.c
index 786987a..54baa4a 100644
--- a/exec.c
+++ b/exec.c
@@ -1071,6 +1071,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
             memory_try_enable_merging(new_block->host, size);
         }
     }
+    qemu_madvise(new_block->host, size, QEMU_MADV_DONTNEED);
     new_block->length = size;
 
     /* Keep the list sorted from biggest to smallest block.  */


BR,
Peter

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCHv5 06/10] migration: search for zero instead of dup pages
  2013-04-05 22:06       ` Peter Lieven
@ 2013-04-08  8:38         ` Paolo Bonzini
  2013-04-08  9:25           ` Peter Lieven
  0 siblings, 1 reply; 31+ messages in thread
From: Paolo Bonzini @ 2013-04-08  8:38 UTC (permalink / raw)
  To: Peter Lieven
  Cc: Kevin Wolf, Anthony Liguori, quintela, Stefan Hajnoczi,
	qemu-devel, Orit Wasserman

Il 06/04/2013 00:06, Peter Lieven ha scritto:
>> > I think we should MADV_DONTNEED it.
> i have to correct myself, on Linux it seems that MADV_DONTNEED guarantees that
> subsequent reads will return a zero filled page.
> 
> If I am right with that we could MADV_DONTNEED the memory on startup and keep
> not sending zero pages in bulk stage if we are running under Linux. Am I right with that?

Actually we can use mmap+munmap to allocate a well-aligned, always zero
block for the RAM.

Paolo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCHv5 06/10] migration: search for zero instead of dup pages
  2013-04-08  8:33     ` Peter Lieven
@ 2013-04-08  8:39       ` Peter Lieven
  2013-04-08  8:49       ` Kevin Wolf
  1 sibling, 0 replies; 31+ messages in thread
From: Peter Lieven @ 2013-04-08  8:39 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Anthony Liguori, quintela, Stefan Hajnoczi, qemu-devel,
	Orit Wasserman, Paolo Bonzini

Am 08.04.2013 um 10:33 schrieb Peter Lieven <pl@kamp.de>:

> 
> Am 05.04.2013 um 21:23 schrieb Kevin Wolf <kwolf@redhat.com>:
> 
>> Am 26.03.2013 um 10:58 hat Peter Lieven geschrieben:
>>> virtually all dup pages are zero pages. remove
>>> the special is_dup_page() function and use the
>>> optimized buffer_find_nonzero_offset() function
>>> instead.
>>> 
>>> here buffer_find_nonzero_offset() is used directly
>>> to avoid the unnecssary additional checks in
>>> buffer_is_zero().
>>> 
>>> raw performace gain checking 1 GByte zeroed memory
>>> over is_dup_page() is approx. 10-12% with SSE2
>>> and 8-10% with unsigned long arithmedtic.
>>> 
>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>>> Reviewed-by: Orit Wasserman <owasserm@redhat.com>
>>> Reviewed-by: Eric Blake <eblake@redhat.com>
>> 
>> Okay, so I bisected again and this is the second patch that is involved
>> in the slowness of qemu-iotests case 007.
>> 
> 

actually this might be better?!:

diff --git a/exec.c b/exec.c
index 786987a..334a46e 100644
--- a/exec.c
+++ b/exec.c
@@ -1071,6 +1071,11 @@ ram_addr_t qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
             memory_try_enable_merging(new_block->host, size);
         }
     }
+    if (mem_prealloc) {
+        memset(new_block->host, size, 0x00);
+    } else {
+        qemu_madvise(new_block->host, size, QEMU_MADV_DONTNEED);
+    }
     new_block->length = size;
 
     /* Keep the list sorted from biggest to smallest block.  */

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCHv5 06/10] migration: search for zero instead of dup pages
  2013-04-08  8:33     ` Peter Lieven
  2013-04-08  8:39       ` Peter Lieven
@ 2013-04-08  8:49       ` Kevin Wolf
  2013-04-08  8:50         ` Peter Lieven
  1 sibling, 1 reply; 31+ messages in thread
From: Kevin Wolf @ 2013-04-08  8:49 UTC (permalink / raw)
  To: Peter Lieven
  Cc: Anthony Liguori, quintela, Stefan Hajnoczi, qemu-devel,
	Orit Wasserman, Paolo Bonzini

Am 08.04.2013 um 10:33 hat Peter Lieven geschrieben:
> 
> Am 05.04.2013 um 21:23 schrieb Kevin Wolf <kwolf@redhat.com>:
> 
> > Am 26.03.2013 um 10:58 hat Peter Lieven geschrieben:
> >> virtually all dup pages are zero pages. remove
> >> the special is_dup_page() function and use the
> >> optimized buffer_find_nonzero_offset() function
> >> instead.
> >> 
> >> here buffer_find_nonzero_offset() is used directly
> >> to avoid the unnecssary additional checks in
> >> buffer_is_zero().
> >> 
> >> raw performace gain checking 1 GByte zeroed memory
> >> over is_dup_page() is approx. 10-12% with SSE2
> >> and 8-10% with unsigned long arithmedtic.
> >> 
> >> Signed-off-by: Peter Lieven <pl@kamp.de>
> >> Reviewed-by: Orit Wasserman <owasserm@redhat.com>
> >> Reviewed-by: Eric Blake <eblake@redhat.com>
> > 
> > Okay, so I bisected again and this is the second patch that is involved
> > in the slowness of qemu-iotests case 007.
> > 
> 
> Can you try if the following solves your issue:
> 
> diff --git a/exec.c b/exec.c
> index 786987a..54baa4a 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -1071,6 +1071,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
>              memory_try_enable_merging(new_block->host, size);
>          }
>      }
> +    qemu_madvise(new_block->host, size, QEMU_MADV_DONTNEED);
>      new_block->length = size;
>  
>      /* Keep the list sorted from biggest to smallest block.  */

It does. But perhaps Paolo's suggestion of using mmap() to allocate the
memory would be better. I'm not sure how MADV_DONTNEED behaves on
non-Linux.

Kevin

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCHv5 06/10] migration: search for zero instead of dup pages
  2013-04-08  8:49       ` Kevin Wolf
@ 2013-04-08  8:50         ` Peter Lieven
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Lieven @ 2013-04-08  8:50 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Anthony Liguori, quintela, Stefan Hajnoczi, qemu-devel,
	Orit Wasserman, Paolo Bonzini


Am 08.04.2013 um 10:49 schrieb Kevin Wolf <kwolf@redhat.com>:

> Am 08.04.2013 um 10:33 hat Peter Lieven geschrieben:
>> 
>> Am 05.04.2013 um 21:23 schrieb Kevin Wolf <kwolf@redhat.com>:
>> 
>>> Am 26.03.2013 um 10:58 hat Peter Lieven geschrieben:
>>>> virtually all dup pages are zero pages. remove
>>>> the special is_dup_page() function and use the
>>>> optimized buffer_find_nonzero_offset() function
>>>> instead.
>>>> 
>>>> here buffer_find_nonzero_offset() is used directly
>>>> to avoid the unnecssary additional checks in
>>>> buffer_is_zero().
>>>> 
>>>> raw performace gain checking 1 GByte zeroed memory
>>>> over is_dup_page() is approx. 10-12% with SSE2
>>>> and 8-10% with unsigned long arithmedtic.
>>>> 
>>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>>>> Reviewed-by: Orit Wasserman <owasserm@redhat.com>
>>>> Reviewed-by: Eric Blake <eblake@redhat.com>
>>> 
>>> Okay, so I bisected again and this is the second patch that is involved
>>> in the slowness of qemu-iotests case 007.
>>> 
>> 
>> Can you try if the following solves your issue:
>> 
>> diff --git a/exec.c b/exec.c
>> index 786987a..54baa4a 100644
>> --- a/exec.c
>> +++ b/exec.c
>> @@ -1071,6 +1071,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
>>             memory_try_enable_merging(new_block->host, size);
>>         }
>>     }
>> +    qemu_madvise(new_block->host, size, QEMU_MADV_DONTNEED);
>>     new_block->length = size;
>> 
>>     /* Keep the list sorted from biggest to smallest block.  */
> 
> It does. But perhaps Paolo's suggestion of using mmap() to allocate the
> memory would be better. I'm not sure how MADV_DONTNEED behaves on
> non-Linux.

its not guaranteed to zero memory.

Peter

> 
> Kevin

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCHv5 06/10] migration: search for zero instead of dup pages
  2013-04-08  8:38         ` Paolo Bonzini
@ 2013-04-08  9:25           ` Peter Lieven
  2013-04-08 10:33             ` Paolo Bonzini
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Lieven @ 2013-04-08  9:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Anthony Liguori, quintela, Stefan Hajnoczi,
	qemu-devel, Orit Wasserman


Am 08.04.2013 um 10:38 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 06/04/2013 00:06, Peter Lieven ha scritto:
>>>> I think we should MADV_DONTNEED it.
>> i have to correct myself, on Linux it seems that MADV_DONTNEED guarantees that
>> subsequent reads will return a zero filled page.
>> 
>> If I am right with that we could MADV_DONTNEED the memory on startup and keep
>> not sending zero pages in bulk stage if we are running under Linux. Am I right with that?
> 
> Actually we can use mmap+munmap to allocate a well-aligned, always zero
> block for the RAM.

like this?

diff --git a/exec.c b/exec.c
index 786987a..dfd1ad3 100644
--- a/exec.c
+++ b/exec.c
@@ -1048,27 +1048,31 @@ ram_addr_t qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
         new_block->host = host;
         new_block->flags |= RAM_PREALLOC_MASK;
     } else {
-        if (mem_path) {
+        if (xen_enabled()) {
+            xen_ram_alloc(new_block->offset, size, mr);
+            memory_try_enable_merging(new_block->host, size);
+        } else {
+            if (mem_path) {
 #if defined (__linux__) && !defined(TARGET_S390X)
-            new_block->host = file_ram_alloc(new_block, size, mem_path);
-            if (!new_block->host) {
-                new_block->host = qemu_vmalloc(size);
-                memory_try_enable_merging(new_block->host, size);
-            }
+                new_block->host = file_ram_alloc(new_block, size, mem_path);
 #else
-            fprintf(stderr, "-mem-path option unsupported\n");
-            exit(1);
+                fprintf(stderr, "-mem-path option unsupported\n");
+                exit(1);
 #endif
-        } else {
-            if (xen_enabled()) {
-                xen_ram_alloc(new_block->offset, size, mr);
-            } else if (kvm_enabled()) {
+            } 
+            if (!new_block->host && kvm_enabled()) {
                 /* some s390/kvm configurations have special constraints */
                 new_block->host = kvm_vmalloc(size);
-            } else {
-                new_block->host = qemu_vmalloc(size);
             }
-            memory_try_enable_merging(new_block->host, size);
+            if (!new_block->host) {
+                new_block->host = mmap(0, size, PROT_READ | PROT_WRITE, 
+                                       MAP_ANONYMOUS |  MAP_PRIVATE, -1, 0);
+                if (new_block->host == MAP_FAILED) {
+                    fprintf(stderr,"qemu_ram_alloc: can't mmap RAM pages %d\n",errno);
+                    exit(1);
+                }
+                memory_try_enable_merging(new_block->host, size);
+            }
         }
     }
     new_block->length = size;
@@ -1142,25 +1146,19 @@ void qemu_ram_free(ram_addr_t addr)
                 ;
             } else if (mem_path) {
 #if defined (__linux__) && !defined(TARGET_S390X)
+                munmap(block->host, block->length);
                 if (block->fd) {
-                    munmap(block->host, block->length);
                     close(block->fd);
-                } else {
-                    qemu_vfree(block->host);
                 }
 #else
                 abort();
 #endif
             } else {
-#if defined(TARGET_S390X) && defined(CONFIG_KVM)
-                munmap(block->host, block->length);
-#else
                 if (xen_enabled()) {
                     xen_invalidate_map_cache_entry(block->host);
                 } else {
-                    qemu_vfree(block->host);
+                    munmap(block->host, block->length);
                 }
-#endif
             }
             g_free(block);
             break;
diff --git a/kvm-all.c b/kvm-all.c
index 9b433d3..90be8e3 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1742,7 +1742,7 @@ void *kvm_vmalloc(ram_addr_t size)
         return mem;
     }
 #endif
-    return qemu_vmalloc(size);
+    return NULL;
 }
 
 void kvm_setup_guest_memory(void *start, size_t size)

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCHv5 06/10] migration: search for zero instead of dup pages
  2013-04-08  9:25           ` Peter Lieven
@ 2013-04-08 10:33             ` Paolo Bonzini
  0 siblings, 0 replies; 31+ messages in thread
From: Paolo Bonzini @ 2013-04-08 10:33 UTC (permalink / raw)
  To: Peter Lieven
  Cc: Kevin Wolf, Anthony Liguori, quintela, Stefan Hajnoczi,
	qemu-devel, Orit Wasserman

Il 08/04/2013 11:25, Peter Lieven ha scritto:
>> > Actually we can use mmap+munmap to allocate a well-aligned, always zero
>> > block for the RAM.
> like this?

This doesn't align it.  I'll send a patch shortly.

Paolo

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2013-04-08 10:33 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-03-26  9:58 [Qemu-devel] [PATCHv5 00/10] buffer_is_zero / migration optimizations Peter Lieven
2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 01/10] move vector definitions to qemu-common.h Peter Lieven
2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 02/10] add a zero splat vector " Peter Lieven
2013-03-26 10:14   ` Paolo Bonzini
2013-03-26 10:17     ` Peter Lieven
2013-03-26 10:17       ` Paolo Bonzini
2013-03-26 10:18         ` Peter Lieven
2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 03/10] cutils: add a function to find non-zero content in a buffer Peter Lieven
2013-03-26 10:38   ` Juan Quintela
2013-03-26 10:42     ` Peter Lieven
2013-03-26 10:41   ` Juan Quintela
2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 04/10] buffer_is_zero: use vector optimizations if possible Peter Lieven
2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 05/10] bitops: unroll while loop in find_next_bit() Peter Lieven
2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 06/10] migration: search for zero instead of dup pages Peter Lieven
2013-04-05 19:23   ` Kevin Wolf
2013-04-05 20:00     ` Paolo Bonzini
2013-04-05 21:44       ` Peter Lieven
2013-04-05 22:06       ` Peter Lieven
2013-04-08  8:38         ` Paolo Bonzini
2013-04-08  9:25           ` Peter Lieven
2013-04-08 10:33             ` Paolo Bonzini
2013-04-08  8:33     ` Peter Lieven
2013-04-08  8:39       ` Peter Lieven
2013-04-08  8:49       ` Kevin Wolf
2013-04-08  8:50         ` Peter Lieven
2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 07/10] migration: add an indicator for bulk state of ram migration Peter Lieven
2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 08/10] migration: do not sent zero pages in bulk stage Peter Lieven
2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 09/10] migration: do not search dirty " Peter Lieven
2013-03-26  9:58 ` [Qemu-devel] [PATCHv5 10/10] migration: use XBZRLE only after " Peter Lieven
2013-03-26 10:46 ` [Qemu-devel] [PATCHv5 00/10] buffer_is_zero / migration optimizations Juan Quintela
2013-03-26 11:02   ` Peter Lieven

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).