* [Qemu-devel] [PATCH] migration: vectorize is_dup_page
@ 2011-12-06 17:25 Paolo Bonzini
2011-12-20 14:13 ` Anthony Liguori
2011-12-20 15:24 ` Avi Kivity
0 siblings, 2 replies; 4+ messages in thread
From: Paolo Bonzini @ 2011-12-06 17:25 UTC (permalink / raw)
To: qemu-devel
is_dup_page is already proceeding in 32-bit chunks. Changing it to 16
bytes using Altivec or SSE is easy, and provides a noticeable improvement.
Pierre Riteau measured 30->25 seconds on a 16GB guest, I measured 4.6->3.9
seconds on a 6GB guest (best of three times for me; dunno for Pierre).
Both of them are approximately a 15% improvement.
I tried playing with non-temporal prefetches, but I did not get any
improvement (though I did get less cache misses, so the patch was doing
its job).
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
arch_init.c | 28 ++++++++++++++++++++++------
1 files changed, 22 insertions(+), 6 deletions(-)
diff --git a/arch_init.c b/arch_init.c
index cdad805..473df2d 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -94,14 +94,30 @@ const uint32_t arch_type = QEMU_ARCH;
#define RAM_SAVE_FLAG_EOS 0x10
#define RAM_SAVE_FLAG_CONTINUE 0x20
-static int is_dup_page(uint8_t *page, uint8_t ch)
+#if __ALTIVEC__
+#include <altivec.h>
+#define VECTYPE vector unsigned char
+#define SPLAT(p) vec_splat(vec_ld(0, p), 0)
+#define ALL_EQ(v1, v2) vec_all_eq(v1, v2)
+#elif __SSE2__
+#include <emmintrin.h>
+#define VECTYPE __m128i
+#define SPLAT(p) _mm_set1_epi8(*(p))
+#define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF)
+#else
+#define VECTYPE unsigned long
+#define SPLAT(p) (*(p) * (~0UL / 255))
+#define ALL_EQ(v1, v2) ((v1) == (v2))
+#endif
+
+static int is_dup_page(uint8_t *page)
{
- uint32_t val = ch << 24 | ch << 16 | ch << 8 | ch;
- uint32_t *array = (uint32_t *)page;
+ VECTYPE *p = (VECTYPE *)page;
+ VECTYPE val = SPLAT(p);
int i;
- for (i = 0; i < (TARGET_PAGE_SIZE / 4); i++) {
- if (array[i] != val) {
+ for (i = 0; i < TARGET_PAGE_SIZE / sizeof(VECTYPE); i++) {
+ if (!ALL_EQ(val, p[i])) {
return 0;
}
}
@@ -136,7 +152,7 @@ static int ram_save_block(QEMUFile *f)
p = block->host + offset;
- if (is_dup_page(p, *p)) {
+ if (is_dup_page(p)) {
qemu_put_be64(f, offset | cont | RAM_SAVE_FLAG_COMPRESS);
if (!cont) {
qemu_put_byte(f, strlen(block->idstr));
--
1.7.7.1
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [Qemu-devel] [PATCH] migration: vectorize is_dup_page
2011-12-06 17:25 [Qemu-devel] [PATCH] migration: vectorize is_dup_page Paolo Bonzini
@ 2011-12-20 14:13 ` Anthony Liguori
2011-12-20 15:24 ` Avi Kivity
1 sibling, 0 replies; 4+ messages in thread
From: Anthony Liguori @ 2011-12-20 14:13 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: qemu-devel
On 12/06/2011 11:25 AM, Paolo Bonzini wrote:
> is_dup_page is already proceeding in 32-bit chunks. Changing it to 16
> bytes using Altivec or SSE is easy, and provides a noticeable improvement.
> Pierre Riteau measured 30->25 seconds on a 16GB guest, I measured 4.6->3.9
> seconds on a 6GB guest (best of three times for me; dunno for Pierre).
> Both of them are approximately a 15% improvement.
>
> I tried playing with non-temporal prefetches, but I did not get any
> improvement (though I did get less cache misses, so the patch was doing
> its job).
>
> Signed-off-by: Paolo Bonzini<pbonzini@redhat.com>
> ---
> arch_init.c | 28 ++++++++++++++++++++++------
> 1 files changed, 22 insertions(+), 6 deletions(-)
>
> diff --git a/arch_init.c b/arch_init.c
> index cdad805..473df2d 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -94,14 +94,30 @@ const uint32_t arch_type = QEMU_ARCH;
> #define RAM_SAVE_FLAG_EOS 0x10
> #define RAM_SAVE_FLAG_CONTINUE 0x20
>
> -static int is_dup_page(uint8_t *page, uint8_t ch)
> +#if __ALTIVEC__
I think you want #ifdefs here and possibly below:
CC x86_64-softmmu/arch_init.o
cc1: warnings being treated as errors
/home/anthony/git/qemu/arch_init.c:97:5: error: "__ALTIVEC__" is not defined
/home/anthony/git/qemu/arch_init.c: In function ‘is_dup_page’:
/home/anthony/git/qemu/arch_init.c:116:5: error: incompatible type for argument
1 of ‘_mm_set1_epi8’
/usr/lib/x86_64-linux-gnu/gcc/x86_64-linux-gnu/4.5.2/include/emmintrin.h:636:1:
note: expected ‘char’ but argument is of type ‘__m128i’
Regards,
Anthony Liguori
> +#include<altivec.h>
> +#define VECTYPE vector unsigned char
> +#define SPLAT(p) vec_splat(vec_ld(0, p), 0)
> +#define ALL_EQ(v1, v2) vec_all_eq(v1, v2)
> +#elif __SSE2__
> +#include<emmintrin.h>
> +#define VECTYPE __m128i
> +#define SPLAT(p) _mm_set1_epi8(*(p))
> +#define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF)
> +#else
> +#define VECTYPE unsigned long
> +#define SPLAT(p) (*(p) * (~0UL / 255))
> +#define ALL_EQ(v1, v2) ((v1) == (v2))
> +#endif
> +
> +static int is_dup_page(uint8_t *page)
> {
> - uint32_t val = ch<< 24 | ch<< 16 | ch<< 8 | ch;
> - uint32_t *array = (uint32_t *)page;
> + VECTYPE *p = (VECTYPE *)page;
> + VECTYPE val = SPLAT(p);
> int i;
>
> - for (i = 0; i< (TARGET_PAGE_SIZE / 4); i++) {
> - if (array[i] != val) {
> + for (i = 0; i< TARGET_PAGE_SIZE / sizeof(VECTYPE); i++) {
> + if (!ALL_EQ(val, p[i])) {
> return 0;
> }
> }
> @@ -136,7 +152,7 @@ static int ram_save_block(QEMUFile *f)
>
> p = block->host + offset;
>
> - if (is_dup_page(p, *p)) {
> + if (is_dup_page(p)) {
> qemu_put_be64(f, offset | cont | RAM_SAVE_FLAG_COMPRESS);
> if (!cont) {
> qemu_put_byte(f, strlen(block->idstr));
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Qemu-devel] [PATCH] migration: vectorize is_dup_page
2011-12-06 17:25 [Qemu-devel] [PATCH] migration: vectorize is_dup_page Paolo Bonzini
2011-12-20 14:13 ` Anthony Liguori
@ 2011-12-20 15:24 ` Avi Kivity
2011-12-20 15:45 ` Paolo Bonzini
1 sibling, 1 reply; 4+ messages in thread
From: Avi Kivity @ 2011-12-20 15:24 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: qemu-devel
On 12/06/2011 07:25 PM, Paolo Bonzini wrote:
> is_dup_page is already proceeding in 32-bit chunks. Changing it to 16
> bytes using Altivec or SSE is easy, and provides a noticeable improvement.
> Pierre Riteau measured 30->25 seconds on a 16GB guest, I measured 4.6->3.9
> seconds on a 6GB guest (best of three times for me; dunno for Pierre).
> Both of them are approximately a 15% improvement.
>
> I tried playing with non-temporal prefetches, but I did not get any
> improvement (though I did get less cache misses, so the patch was doing
> its job).
It's worthwhile anyway IMO.
>
> +static int is_dup_page(uint8_t *page)
> {
> - uint32_t val = ch << 24 | ch << 16 | ch << 8 | ch;
> - uint32_t *array = (uint32_t *)page;
> + VECTYPE *p = (VECTYPE *)page;
> + VECTYPE val = SPLAT(p);
>
I think you can drop the SPLAT and just compare against zero. Full page
repeats of anything but zero are unlikely, so we can simplify the code a
bit here. If we do go with non-temporal loads, it saves an additional miss.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Qemu-devel] [PATCH] migration: vectorize is_dup_page
2011-12-20 15:24 ` Avi Kivity
@ 2011-12-20 15:45 ` Paolo Bonzini
0 siblings, 0 replies; 4+ messages in thread
From: Paolo Bonzini @ 2011-12-20 15:45 UTC (permalink / raw)
To: Avi Kivity; +Cc: qemu-devel
On 12/20/2011 04:24 PM, Avi Kivity wrote:
> On 12/06/2011 07:25 PM, Paolo Bonzini wrote:
>> is_dup_page is already proceeding in 32-bit chunks. Changing it to 16
>> bytes using Altivec or SSE is easy, and provides a noticeable improvement.
>> Pierre Riteau measured 30->25 seconds on a 16GB guest, I measured 4.6->3.9
>> seconds on a 6GB guest (best of three times for me; dunno for Pierre).
>> Both of them are approximately a 15% improvement.
>>
>> I tried playing with non-temporal prefetches, but I did not get any
>> improvement (though I did get less cache misses, so the patch was doing
>> its job).
>
> It's worthwhile anyway IMO.
The problem is that if the page is not dup (the common case), you'll get
all the cache misses anyway when you send it over the socket. So what I
did was add a 4k buffer (the same for all pages), and make is_dup_page
copy the page to it. Because the prefetches are non-temporal, you only
use 4k of cache. But the code is more complex and less reusable, it
incurs an extra copy and it cannot leave is_dup_page early.
>> +static int is_dup_page(uint8_t *page)
>> {
>> - uint32_t val = ch<< 24 | ch<< 16 | ch<< 8 | ch;
>> - uint32_t *array = (uint32_t *)page;
>> + VECTYPE *p = (VECTYPE *)page;
>> + VECTYPE val = SPLAT(p);
>>
>
> I think you can drop the SPLAT and just compare against zero. Full page
> repeats of anything but zero are unlikely, so we can simplify the code a
> bit here. If we do go with non-temporal loads, it saves an additional miss.
Yeah, with non-temporal loads that would make sense.
Paolo
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2011-12-20 15:46 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-06 17:25 [Qemu-devel] [PATCH] migration: vectorize is_dup_page Paolo Bonzini
2011-12-20 14:13 ` Anthony Liguori
2011-12-20 15:24 ` Avi Kivity
2011-12-20 15:45 ` Paolo Bonzini
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).