[Qemu-devel] [RFC] optimize is_dup

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [RFC] optimize is_dup_page for zero pages
@ 2013-03-12 10:51 Peter Lieven
  2013-03-12 11:11 ` Paolo Bonzini
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Lieven @ 2013-03-12 10:51 UTC (permalink / raw)
  To: qemu-devel@nongnu.org
  Cc: peter.maydell, Paolo Bonzini, Kevin Wolf, Stefan Hajnoczi

Hi,

a second patch to optimize live migration. I have generated some artifical load
testing for zero pages. Ordinary dup or non dup pages are not affected.

savings for zero pages (test case):
  non SSE2:    30s -> 26s
  SSE2:        27s -> 21s

optionally I would suggest optimizing buffer_is_zero to use SSE2 if addr
is 16 byte aligned and length is 128 byte aligned.
in this case bdrv functions could also benefit from it.

Peter

diff --git a/arch_init.c b/arch_init.c
index 98e2bc6..e1051e6 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -164,9 +164,37 @@ int qemu_read_default_config_files(bool userconfig)
      return 0;
  }

-static int is_dup_page(uint8_t *page)
+#if __SSE2__
+static int is_zero_page_sse2(u_int8_t *page)
  {
      VECTYPE *p = (VECTYPE *)page;
+    VECTYPE zero = _mm_setzero_si128();
+    int i;
+    for (i = 0; i < (TARGET_PAGE_SIZE / sizeof(VECTYPE)); i+=8) {
+               VECTYPE tmp0 = _mm_or_si128(p[i+0],p[i+1]);
+               VECTYPE tmp1 = _mm_or_si128(p[i+2],p[i+3]);
+               VECTYPE tmp2 = _mm_or_si128(p[i+4],p[i+5]);
+               VECTYPE tmp3 = _mm_or_si128(p[i+6],p[i+7]);
+               VECTYPE tmp01 = _mm_or_si128(tmp0,tmp1);
+               VECTYPE tmp23 = _mm_or_si128(tmp2,tmp3);
+               if (!ALL_EQ(_mm_or_si128(tmp01,tmp23), zero)) {
+                   return 0;
+               }
+    }
+    return 1;
+}
+#endif
+
+static int is_dup_page(u_int8_t *page) {
+    if (!page[0]) {
+#if __SSE2__
+        return is_zero_page_sse2(page);
+#else
+        return buffer_is_zero(page, TARGET_PAGE_SIZE);
+#endif
+    }
+
+    VECTYPE *p = (VECTYPE *)page;
      VECTYPE val = SPLAT(page);
      int i;

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [RFC] optimize is_dup_page for zero pages
  2013-03-12 10:51 [Qemu-devel] [RFC] optimize is_dup_page for zero pages Peter Lieven
@ 2013-03-12 11:11 ` Paolo Bonzini
  2013-03-12 11:20   ` Peter Lieven
  0 siblings, 1 reply; 8+ messages in thread
From: Paolo Bonzini @ 2013-03-12 11:11 UTC (permalink / raw)
  To: Peter Lieven
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel@nongnu.org, peter.maydell

Il 12/03/2013 11:51, Peter Lieven ha scritto:
> Hi,
> 
> a second patch to optimize live migration. I have generated some
> artifical load
> testing for zero pages. Ordinary dup or non dup pages are not affected.
> 
> savings for zero pages (test case):
>  non SSE2:    30s -> 26s
>  SSE2:        27s -> 21s
> 
> optionally I would suggest optimizing buffer_is_zero to use SSE2 if addr
> is 16 byte aligned and length is 128 byte aligned.
> in this case bdrv functions could also benefit from it.
> 
> Peter
> 
> diff --git a/arch_init.c b/arch_init.c
> index 98e2bc6..e1051e6 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -164,9 +164,37 @@ int qemu_read_default_config_files(bool userconfig)
>      return 0;
>  }
> 
> -static int is_dup_page(uint8_t *page)
> +#if __SSE2__
> +static int is_zero_page_sse2(u_int8_t *page)
>  {
>      VECTYPE *p = (VECTYPE *)page;
> +    VECTYPE zero = _mm_setzero_si128();
> +    int i;
> +    for (i = 0; i < (TARGET_PAGE_SIZE / sizeof(VECTYPE)); i+=8) {
> +               VECTYPE tmp0 = _mm_or_si128(p[i+0],p[i+1]);
> +               VECTYPE tmp1 = _mm_or_si128(p[i+2],p[i+3]);
> +               VECTYPE tmp2 = _mm_or_si128(p[i+4],p[i+5]);
> +               VECTYPE tmp3 = _mm_or_si128(p[i+6],p[i+7]);
> +               VECTYPE tmp01 = _mm_or_si128(tmp0,tmp1);
> +               VECTYPE tmp23 = _mm_or_si128(tmp2,tmp3);

You can use the normal "|" C operator, then the result will be portable
to Altivec or !SSE2 as well.

The problem is that find_zero_bit has a known case when there are a lot
of zero bytes---namely, the final passes of migration.  For is_dup_page,
it is reasonable to assume that:

* zero pages remain zero, and thus are only processed once

* non-zero pages are modified often, and thus are processed multiple times.

Your patch adds overhead in the case where a page is non-zero, which
will be the common case in any non-artificial benchmark.  It _is_
possible that the net result is positive because you warm the cache with
the first 128 bytes of the page.  But without more benchmarking, it is
reasonable to optimize is_dup_page for the case where the for loop rolls
very few times.

Paolo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [RFC] optimize is_dup_page for zero pages
  2013-03-12 11:11 ` Paolo Bonzini
@ 2013-03-12 11:20   ` Peter Lieven
  2013-03-12 11:46     ` Paolo Bonzini
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Lieven @ 2013-03-12 11:20 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel@nongnu.org, peter.maydell


Am 12.03.2013 um 12:11 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 12/03/2013 11:51, Peter Lieven ha scritto:
>> Hi,
>> 
>> a second patch to optimize live migration. I have generated some
>> artifical load
>> testing for zero pages. Ordinary dup or non dup pages are not affected.
>> 
>> savings for zero pages (test case):
>> non SSE2:    30s -> 26s
>> SSE2:        27s -> 21s
>> 
>> optionally I would suggest optimizing buffer_is_zero to use SSE2 if addr
>> is 16 byte aligned and length is 128 byte aligned.
>> in this case bdrv functions could also benefit from it.
>> 
>> Peter
>> 
>> diff --git a/arch_init.c b/arch_init.c
>> index 98e2bc6..e1051e6 100644
>> --- a/arch_init.c
>> +++ b/arch_init.c
>> @@ -164,9 +164,37 @@ int qemu_read_default_config_files(bool userconfig)
>>     return 0;
>> }
>> 
>> -static int is_dup_page(uint8_t *page)
>> +#if __SSE2__
>> +static int is_zero_page_sse2(u_int8_t *page)
>> {
>>     VECTYPE *p = (VECTYPE *)page;
>> +    VECTYPE zero = _mm_setzero_si128();
>> +    int i;
>> +    for (i = 0; i < (TARGET_PAGE_SIZE / sizeof(VECTYPE)); i+=8) {
>> +               VECTYPE tmp0 = _mm_or_si128(p[i+0],p[i+1]);
>> +               VECTYPE tmp1 = _mm_or_si128(p[i+2],p[i+3]);
>> +               VECTYPE tmp2 = _mm_or_si128(p[i+4],p[i+5]);
>> +               VECTYPE tmp3 = _mm_or_si128(p[i+6],p[i+7]);
>> +               VECTYPE tmp01 = _mm_or_si128(tmp0,tmp1);
>> +               VECTYPE tmp23 = _mm_or_si128(tmp2,tmp3);
> 
> You can use the normal "|" C operator, then the result will be portable
> to Altivec or !SSE2 as well.
> 
> The problem is that find_zero_bit has a known case when there are a lot
> of zero bytes---namely, the final passes of migration.  For is_dup_page,
> it is reasonable to assume that:

for find_zero_bit it would also be possible to change to use an optimized
version, but the code will get more and more complicated.

> 
> * zero pages remain zero, and thus are only processed once

you are right this will be the case.

> 
> * non-zero pages are modified often, and thus are processed multiple times.
> 
> Your patch adds overhead in the case where a page is non-zero, which
> will be the common case in any non-artificial benchmark.  It _is_
> possible that the net result is positive because you warm the cache with
> the first 128 bytes of the page.  But without more benchmarking, it is
> reasonable to optimize is_dup_page for the case where the for loop rolls
> very few times.

Ok, good point. However, it will only enter the zero check if the first byte (or maybe could change
this to first 32 or 64 bit) is zero.

What about using this patch for buffer_is_zero optimization?

Peter


> 
> Paolo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [RFC] optimize is_dup_page for zero pages
  2013-03-12 11:20   ` Peter Lieven
@ 2013-03-12 11:46     ` Paolo Bonzini
  2013-03-12 11:51       ` Peter Lieven
  0 siblings, 1 reply; 8+ messages in thread
From: Paolo Bonzini @ 2013-03-12 11:46 UTC (permalink / raw)
  To: Peter Lieven
  Cc: Kevin Wolf, Stefan Hajnoczi, Orit Wasserman,
	qemu-devel@nongnu.org, peter.maydell

Il 12/03/2013 12:20, Peter Lieven ha scritto:
>> * zero pages remain zero, and thus are only processed once
> 
> you are right this will be the case.
> 
>>
>> * non-zero pages are modified often, and thus are processed multiple times.
>>
>> Your patch adds overhead in the case where a page is non-zero, which
>> will be the common case in any non-artificial benchmark.  It _is_
>> possible that the net result is positive because you warm the cache with
>> the first 128 bytes of the page.  But without more benchmarking, it is
>> reasonable to optimize is_dup_page for the case where the for loop rolls
>> very few times.
> 
> Ok, good point. However, it will only enter the zero check if the first byte (or maybe could change
> this to first 32 or 64 bit) is zero.

On big-endian architectures, I expect that the first byte will be zero
very often.  (32- or 64-bit, much less indeed).

> What about using this patch for buffer_is_zero optimization?

buffer_is_zero is used in somewhat special cases (block
streaming/copy-on-read) where throughput doesn't really matter, unlike
is_dup_page/find_zero_bit which are used in migration.  But you can use
similar code for is_dup_page and buffer_is_zero.

BTW, I would like to change is_dup_page to is_zero_page.  Non-zero pages
with a repeated value are virtually non-existent, and perhaps we can
improve the migration format by packing multiple pages (up to 64) in a
single "chunk" (i.e. a small header followed by up to 256K bytes of
data).  I would like to see Orit's patches to optimize RAM migration
first, since this only makes sense after you remove all userspace
copies.  Otherwise, the cost of copying the 4k of data to a buffer will
dominate almost every optimization you can make.

Paolo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [RFC] optimize is_dup_page for zero pages
  2013-03-12 11:46     ` Paolo Bonzini
@ 2013-03-12 11:51       ` Peter Lieven
  2013-03-12 12:02         ` Paolo Bonzini
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Lieven @ 2013-03-12 11:51 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Stefan Hajnoczi, Orit Wasserman,
	qemu-devel@nongnu.org, peter.maydell


Am 12.03.2013 um 12:46 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 12/03/2013 12:20, Peter Lieven ha scritto:
>>> * zero pages remain zero, and thus are only processed once
>> 
>> you are right this will be the case.
>> 
>>> 
>>> * non-zero pages are modified often, and thus are processed multiple times.
>>> 
>>> Your patch adds overhead in the case where a page is non-zero, which
>>> will be the common case in any non-artificial benchmark.  It _is_
>>> possible that the net result is positive because you warm the cache with
>>> the first 128 bytes of the page.  But without more benchmarking, it is
>>> reasonable to optimize is_dup_page for the case where the for loop rolls
>>> very few times.
>> 
>> Ok, good point. However, it will only enter the zero check if the first byte (or maybe could change
>> this to first 32 or 64 bit) is zero.
> 
> On big-endian architectures, I expect that the first byte will be zero
> very often.  (32- or 64-bit, much less indeed).
> 
>> What about using this patch for buffer_is_zero optimization?
> 
> buffer_is_zero is used in somewhat special cases (block
> streaming/copy-on-read) where throughput doesn't really matter, unlike
> is_dup_page/find_zero_bit which are used in migration.  But you can use
> similar code for is_dup_page and buffer_is_zero.

ok, i will prepare a patch series for review. at the moment without touching
is_dup_page(). you can decide later if you use buffer_Is_zero to check
for zero pages later (maybe if the first x-bit are zero).

Two comments on changing is_dup_page() to is_zero_page():
a) Would it make sense to only check for zero pages in the first (bulk) round?
b) Would it make sense to not transfer zero pages at all in the first round?
The memory at the target should read as zero (not allocated) anyway.

Peter

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [RFC] optimize is_dup_page for zero pages
  2013-03-12 11:51       ` Peter Lieven
@ 2013-03-12 12:02         ` Paolo Bonzini
  2013-03-12 12:15           ` Peter Lieven
  2013-03-12 20:10           ` Peter Lieven
  0 siblings, 2 replies; 8+ messages in thread
From: Paolo Bonzini @ 2013-03-12 12:02 UTC (permalink / raw)
  To: Peter Lieven
  Cc: Kevin Wolf, Stefan Hajnoczi, Orit Wasserman,
	qemu-devel@nongnu.org, peter.maydell

Il 12/03/2013 12:51, Peter Lieven ha scritto:
>> > buffer_is_zero is used in somewhat special cases (block
>> > streaming/copy-on-read) where throughput doesn't really matter, unlike
>> > is_dup_page/find_zero_bit which are used in migration.  But you can use
>> > similar code for is_dup_page and buffer_is_zero.
> ok, i will prepare a patch series for review. at the moment without touching
> is_dup_page(). you can decide later if you use buffer_Is_zero to check
> for zero pages later (maybe if the first x-bit are zero).
> 
> Two comments on changing is_dup_page() to is_zero_page():
> a) Would it make sense to only check for zero pages in the first (bulk) round?

Interesting idea.  Benchmark it. :)

> b) Would it make sense to not transfer zero pages at all in the first round?

Perhaps yes, but I'm not sure how to efficiently implement it.  There
really isn't a well-specified first round in the RAM migration code.  Of
course you could have another bitmap for known-zero pages.

But zero pages should be rare in real-world testcases, except for
ballooned pages.  The OS should try to use free memory for caches.

> The memory at the target should read as zero (not allocated) anyway.

Paolo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [RFC] optimize is_dup_page for zero pages
  2013-03-12 12:02         ` Paolo Bonzini
@ 2013-03-12 12:15           ` Peter Lieven
  2013-03-12 20:10           ` Peter Lieven
  1 sibling, 0 replies; 8+ messages in thread
From: Peter Lieven @ 2013-03-12 12:15 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Stefan Hajnoczi, Orit Wasserman,
	qemu-devel@nongnu.org, peter.maydell


Am 12.03.2013 um 13:02 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 12/03/2013 12:51, Peter Lieven ha scritto:
>>>> buffer_is_zero is used in somewhat special cases (block
>>>> streaming/copy-on-read) where throughput doesn't really matter, unlike
>>>> is_dup_page/find_zero_bit which are used in migration.  But you can use
>>>> similar code for is_dup_page and buffer_is_zero.
>> ok, i will prepare a patch series for review. at the moment without touching
>> is_dup_page(). you can decide later if you use buffer_Is_zero to check
>> for zero pages later (maybe if the first x-bit are zero).
>> 
>> Two comments on changing is_dup_page() to is_zero_page():
>> a) Would it make sense to only check for zero pages in the first (bulk) round?
> 
> Interesting idea.  Benchmark it. :)

What approach would you use to test it? It again depends on the load.
If there is no software running on the VM that is zeroing out large areas of memory
I would bet there is no need looking for dup pages.

> 
>> b) Would it make sense to not transfer zero pages at all in the first round?
> 
> Perhaps yes, but I'm not sure how to efficiently implement it.  There
> really isn't a well-specified first round in the RAM migration code.  Of
> course you could have another bitmap for known-zero pages.

what about this I used to limit XBZRLE to non-bulk stage:

diff --git a/arch_init.c b/arch_init.c
index 1b71912..d48b914 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -326,6 +326,7 @@ static ram_addr_t last_offset;
 static unsigned long *migration_bitmap;
 static uint64_t migration_dirty_pages;
 static uint32_t last_version;
+static bool ram_bulk_stage;
 
 static inline
 ram_addr_t migration_bitmap_find_and_reset_dirty(MemoryRegion *mr,
@@ -433,6 +434,7 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
             if (!block) {
                 block = QTAILQ_FIRST(&ram_list.blocks);
                 complete_round = true;
+                ram_bulk_stage = false;
             }
         } else {
             uint8_t *p;
@@ -536,6 +538,7 @@ static void reset_ram_globals(void)
     last_sent_block = NULL;
     last_offset = 0;
     last_version = ram_list.version;
+    ram_bulk_stage = true;
 }
 
 #define MAX_WAIT 50 /* ms, half buffered_file limit */

Peter

> 
> But zero pages should be rare in real-world testcases, except for
> ballooned pages.  The OS should try to use free memory for caches.
> 
>> The memory at the target should read as zero (not allocated) anyway.
> 
> Paolo

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [RFC] optimize is_dup_page for zero pages
  2013-03-12 12:02         ` Paolo Bonzini
  2013-03-12 12:15           ` Peter Lieven
@ 2013-03-12 20:10           ` Peter Lieven
  1 sibling, 0 replies; 8+ messages in thread
From: Peter Lieven @ 2013-03-12 20:10 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Stefan Hajnoczi, Orit Wasserman,
	qemu-devel@nongnu.org, peter.maydell

Am 12.03.2013 um 13:02 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 12/03/2013 12:51, Peter Lieven ha scritto:
>>>> buffer_is_zero is used in somewhat special cases (block
>>>> streaming/copy-on-read) where throughput doesn't really matter, unlike
>>>> is_dup_page/find_zero_bit which are used in migration.  But you can use
>>>> similar code for is_dup_page and buffer_is_zero.
>> ok, i will prepare a patch series for review. at the moment without touching
>> is_dup_page(). you can decide later if you use buffer_Is_zero to check
>> for zero pages later (maybe if the first x-bit are zero).
>> 
>> Two comments on changing is_dup_page() to is_zero_page():
>> a) Would it make sense to only check for zero pages in the first (bulk) round?
> 
> Interesting idea.  Benchmark it. :)

After thinking about Windows VMs where all freed memory is zeroed out I would
suggest the following:

a) drop is_dup_page() and use buffer_is_zero() with a small optimization that
buffer_is_zero() checks the first long being zero before unrolling up to 128 bytes (with
the latest patches I have sent).
b) always check for zero pages, but do not send them in the bulk stage. even if
there is an madvise with QEMU_MADV_DONTNEED I have observed that the
target starts swapping if the memory is overcommitted. It seems that the pages are
dropped asynchronously. If they are not sent at all there is no issue. You can simply
test it by creating a VM with more memory than you physically have and migrate
it. While the source VM will not use a large resident size the target VM will use the
füll size at least temporary. 

> 
>> b) Would it make sense to not transfer zero pages at all in the first round?
> 
> Perhaps yes, but I'm not sure how to efficiently implement it.  There
> really isn't a well-specified first round in the RAM migration code.  Of
> course you could have another bitmap for known-zero pages.

please have a look at 

[PATCH 3/9] migration: add an indicator for bulk state of ram migration

Peter

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-03-12 20:15 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-03-12 10:51 [Qemu-devel] [RFC] optimize is_dup_page for zero pages Peter Lieven
2013-03-12 11:11 ` Paolo Bonzini
2013-03-12 11:20   ` Peter Lieven
2013-03-12 11:46     ` Paolo Bonzini
2013-03-12 11:51       ` Peter Lieven
2013-03-12 12:02         ` Paolo Bonzini
2013-03-12 12:15           ` Peter Lieven
2013-03-12 20:10           ` Peter Lieven

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.