qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v3 0/1]  ARM64: Live migration optimization
@ 2016-06-29  8:47 vijayak
  2016-06-29  8:47 ` [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking vijayak
  0 siblings, 1 reply; 8+ messages in thread
From: vijayak @ 2016-06-29  8:47 UTC (permalink / raw)
  To: qemu-arm, peter.maydell, pbonzini
  Cc: qemu-devel, Prasun.Kapoor, vijay.kilari, Vijaya Kumar K

From: Vijaya Kumar K <vijayak@caviumnetworks.com>

To optimize Live migration time on ARM64 machine,
Neon instructions are used for Zero page checking.

With these changes, total migration time comes down
from 3.5 seconds to 2.9 seconds.

These patches are tested on top of (GICv3 live migration support)
https://lists.gnu.org/archive/html/qemu-devel/2015-10/msg05284.html
However there is no direct dependency on these patches.

v2 -> v3 changes:
  - Dropped Thunderx specific patches(2) from this series. Will
    be added on kernel exposing midr register to userspace.
  - Used generic zero page checking function. Only macros
    are updated.

v1 -> v2 changes:
----------------
  - Dropped 'target-arm: Update page size for aarch64' patch.
  - Each loop in zero buffer check function is reduced to
    16 from 32.
  - Replaced vorrq_u64 with '|' in Neon macros
  - Renamed local variable to reflect 128 bit.
  - Introduced new file cpuinfo.c to parse /proc/cpuinfo
  - Added Thunderx specific patches to add prefetch in
    zero buffer check function.

Vijay (1):
  target-arm: Use Neon for zero checking

 util/cutils.c |    7 +++++++
 1 file changed, 7 insertions(+)

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking
  2016-06-29  8:47 [Qemu-devel] [PATCH v3 0/1] ARM64: Live migration optimization vijayak
@ 2016-06-29  8:47 ` vijayak
  2016-06-29 12:53   ` Paolo Bonzini
  2016-06-30 13:45   ` Peter Maydell
  0 siblings, 2 replies; 8+ messages in thread
From: vijayak @ 2016-06-29  8:47 UTC (permalink / raw)
  To: qemu-arm, peter.maydell, pbonzini
  Cc: qemu-devel, Prasun.Kapoor, vijay.kilari, Vijay, Suresh

From: Vijay <vijayak@cavium.com>

Use Neon instructions to perform zero checking of
buffer. This is helps in reducing total migration time.

Use case: Idle VM live migration with 4 VCPUS and 8GB ram
running CentOS 7.

Without Neon, the Total migration time is 3.5 Sec

Migration status: completed
total time: 3560 milliseconds
downtime: 33 milliseconds
setup: 5 milliseconds
transferred ram: 297907 kbytes
throughput: 685.76 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2062760 pages
skipped: 0 pages
normal: 69808 pages
normal bytes: 279232 kbytes
dirty sync count: 3

With Neon, the total migration time is 2.9 Sec

Migration status: completed
total time: 2960 milliseconds
downtime: 65 milliseconds
setup: 4 milliseconds
transferred ram: 299869 kbytes
throughput: 830.19 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2064313 pages
skipped: 0 pages
normal: 70294 pages
normal bytes: 281176 kbytes
dirty sync count: 3

Signed-off-by: Vijaya Kumar K <vijayak@cavium.com>
Signed-off-by: Suresh <ksuresh@cavium.com>
---
 util/cutils.c |    7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/util/cutils.c b/util/cutils.c
index 5830a68..4779403 100644
--- a/util/cutils.c
+++ b/util/cutils.c
@@ -184,6 +184,13 @@ int qemu_fdatasync(int fd)
 #define SPLAT(p)       _mm_set1_epi8(*(p))
 #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF)
 #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2))
+#elif __aarch64__
+#include "arm_neon.h"
+#define VECTYPE        uint64x2_t
+#define ALL_EQ(v1, v2) \
+        ((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \
+         (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1)))
+#define VEC_OR(v1, v2) ((v1) | (v2))
 #else
 #define VECTYPE        unsigned long
 #define SPLAT(p)       (*(p) * (~0UL / 255))
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking
  2016-06-29  8:47 ` [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking vijayak
@ 2016-06-29 12:53   ` Paolo Bonzini
  2016-06-30 13:45   ` Peter Maydell
  1 sibling, 0 replies; 8+ messages in thread
From: Paolo Bonzini @ 2016-06-29 12:53 UTC (permalink / raw)
  To: vijayak, qemu-arm, peter.maydell
  Cc: qemu-devel, Prasun.Kapoor, vijay.kilari, Suresh



On 29/06/2016 10:47, vijayak@cavium.com wrote:
> From: Vijay <vijayak@cavium.com>
> 
> Use Neon instructions to perform zero checking of
> buffer. This is helps in reducing total migration time.
> 
> Use case: Idle VM live migration with 4 VCPUS and 8GB ram
> running CentOS 7.
> 
> Without Neon, the Total migration time is 3.5 Sec
> 
> Migration status: completed
> total time: 3560 milliseconds
> downtime: 33 milliseconds
> setup: 5 milliseconds
> transferred ram: 297907 kbytes
> throughput: 685.76 mbps
> remaining ram: 0 kbytes
> total ram: 8519872 kbytes
> duplicate: 2062760 pages
> skipped: 0 pages
> normal: 69808 pages
> normal bytes: 279232 kbytes
> dirty sync count: 3
> 
> With Neon, the total migration time is 2.9 Sec
> 
> Migration status: completed
> total time: 2960 milliseconds
> downtime: 65 milliseconds
> setup: 4 milliseconds
> transferred ram: 299869 kbytes
> throughput: 830.19 mbps
> remaining ram: 0 kbytes
> total ram: 8519872 kbytes
> duplicate: 2064313 pages
> skipped: 0 pages
> normal: 70294 pages
> normal bytes: 281176 kbytes
> dirty sync count: 3
> 
> Signed-off-by: Vijaya Kumar K <vijayak@cavium.com>
> Signed-off-by: Suresh <ksuresh@cavium.com>
> ---
>  util/cutils.c |    7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/util/cutils.c b/util/cutils.c
> index 5830a68..4779403 100644
> --- a/util/cutils.c
> +++ b/util/cutils.c
> @@ -184,6 +184,13 @@ int qemu_fdatasync(int fd)
>  #define SPLAT(p)       _mm_set1_epi8(*(p))
>  #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF)
>  #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2))
> +#elif __aarch64__
> +#include "arm_neon.h"
> +#define VECTYPE        uint64x2_t
> +#define ALL_EQ(v1, v2) \
> +        ((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \
> +         (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1)))
> +#define VEC_OR(v1, v2) ((v1) | (v2))
>  #else
>  #define VECTYPE        unsigned long
>  #define SPLAT(p)       (*(p) * (~0UL / 255))
> 

Acked-by: Paolo Bonzini <pbonzini@redhat.com>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking
  2016-06-29  8:47 ` [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking vijayak
  2016-06-29 12:53   ` Paolo Bonzini
@ 2016-06-30 13:45   ` Peter Maydell
  2016-07-01 22:07     ` Richard Henderson
  1 sibling, 1 reply; 8+ messages in thread
From: Peter Maydell @ 2016-06-30 13:45 UTC (permalink / raw)
  To: Vijay
  Cc: qemu-arm, Paolo Bonzini, QEMU Developers, Prasun.Kapoor,
	Vijay Kilari, Suresh

On 29 June 2016 at 09:47,  <vijayak@cavium.com> wrote:
> From: Vijay <vijayak@cavium.com>
>
> Use Neon instructions to perform zero checking of
> buffer. This is helps in reducing total migration time.

> diff --git a/util/cutils.c b/util/cutils.c
> index 5830a68..4779403 100644
> --- a/util/cutils.c
> +++ b/util/cutils.c
> @@ -184,6 +184,13 @@ int qemu_fdatasync(int fd)
>  #define SPLAT(p)       _mm_set1_epi8(*(p))
>  #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF)
>  #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2))
> +#elif __aarch64__
> +#include "arm_neon.h"
> +#define VECTYPE        uint64x2_t
> +#define ALL_EQ(v1, v2) \
> +        ((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \
> +         (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1)))
> +#define VEC_OR(v1, v2) ((v1) | (v2))

Should be '#elif defined(__aarch64__)'. I have made this
tweak and put this patch in target-arm.next.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking
  2016-06-30 13:45   ` Peter Maydell
@ 2016-07-01 22:07     ` Richard Henderson
  2016-07-02  9:42       ` Peter Maydell
  2016-07-05 12:24       ` Vijay Kilari
  0 siblings, 2 replies; 8+ messages in thread
From: Richard Henderson @ 2016-07-01 22:07 UTC (permalink / raw)
  To: Peter Maydell, Vijay
  Cc: Prasun.Kapoor, Vijay Kilari, Suresh, QEMU Developers, qemu-arm,
	Paolo Bonzini

On 06/30/2016 06:45 AM, Peter Maydell wrote:
> On 29 June 2016 at 09:47,  <vijayak@cavium.com> wrote:
>> From: Vijay <vijayak@cavium.com>
>>
>> Use Neon instructions to perform zero checking of
>> buffer. This is helps in reducing total migration time.
>
>> diff --git a/util/cutils.c b/util/cutils.c
>> index 5830a68..4779403 100644
>> --- a/util/cutils.c
>> +++ b/util/cutils.c
>> @@ -184,6 +184,13 @@ int qemu_fdatasync(int fd)
>>  #define SPLAT(p)       _mm_set1_epi8(*(p))
>>  #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF)
>>  #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2))
>> +#elif __aarch64__
>> +#include "arm_neon.h"
>> +#define VECTYPE        uint64x2_t
>> +#define ALL_EQ(v1, v2) \
>> +        ((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \
>> +         (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1)))
>> +#define VEC_OR(v1, v2) ((v1) | (v2))
>
> Should be '#elif defined(__aarch64__)'. I have made this
> tweak and put this patch in target-arm.next.

Consider

#define VECTYPE        uint32x4_t
#define ALL_EQ(v1, v2) (vmaxvq_u32((v1) ^ (v2)) == 0)


which compiles down to

   1c:	6e211c00 	eor	v0.16b, v0.16b, v1.16b
   20:	6eb0a800 	umaxv	s0, v0.4s
   24:	1e260000 	fmov	w0, s0
   28:	6b1f001f 	cmp	w0, wzr
   2c:	1a9f17e0 	cset	w0, eq
   30:	d65f03c0 	ret

vs

   34:	4e083c20 	mov	x0, v1.d[0]
   38:	4e083c01 	mov	x1, v0.d[0]
   3c:	eb00003f 	cmp	x1, x0
   40:	52800000 	mov	w0, #0
   44:	54000040 	b.eq	4c <f0+0x18>
   48:	d65f03c0 	ret
   4c:	4e183c20 	mov	x0, v1.d[1]
   50:	4e183c01 	mov	x1, v0.d[1]
   54:	eb00003f 	cmp	x1, x0
   58:	1a9f17e0 	cset	w0, eq
   5c:	d65f03c0 	ret


r~

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking
  2016-07-01 22:07     ` Richard Henderson
@ 2016-07-02  9:42       ` Peter Maydell
  2016-07-05 12:24       ` Vijay Kilari
  1 sibling, 0 replies; 8+ messages in thread
From: Peter Maydell @ 2016-07-02  9:42 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Vijay, Prasun.Kapoor, Vijay Kilari, Suresh, QEMU Developers,
	qemu-arm, Paolo Bonzini

On 1 July 2016 at 23:07, Richard Henderson <rth@twiddle.net> wrote:
> On 06/30/2016 06:45 AM, Peter Maydell wrote:
>>
>> On 29 June 2016 at 09:47,  <vijayak@cavium.com> wrote:
>>>
>>> From: Vijay <vijayak@cavium.com>
>>>
>>> Use Neon instructions to perform zero checking of
>>> buffer. This is helps in reducing total migration time.
>>
>>
>>> diff --git a/util/cutils.c b/util/cutils.c
>>> index 5830a68..4779403 100644
>>> --- a/util/cutils.c
>>> +++ b/util/cutils.c
>>> @@ -184,6 +184,13 @@ int qemu_fdatasync(int fd)
>>>  #define SPLAT(p)       _mm_set1_epi8(*(p))
>>>  #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) ==
>>> 0xFFFF)
>>>  #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2))
>>> +#elif __aarch64__
>>> +#include "arm_neon.h"
>>> +#define VECTYPE        uint64x2_t
>>> +#define ALL_EQ(v1, v2) \
>>> +        ((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \
>>> +         (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1)))
>>> +#define VEC_OR(v1, v2) ((v1) | (v2))
>>
>>
>> Should be '#elif defined(__aarch64__)'. I have made this
>> tweak and put this patch in target-arm.next.
>
>
> Consider
>
> #define VECTYPE        uint32x4_t
> #define ALL_EQ(v1, v2) (vmaxvq_u32((v1) ^ (v2)) == 0)

Sounds good. Vijay, could you benchmark that variant, please?

thanks
-- PMM

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking
  2016-07-01 22:07     ` Richard Henderson
  2016-07-02  9:42       ` Peter Maydell
@ 2016-07-05 12:24       ` Vijay Kilari
  2016-07-11 17:55         ` Peter Maydell
  1 sibling, 1 reply; 8+ messages in thread
From: Vijay Kilari @ 2016-07-05 12:24 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Peter Maydell, Vijay, prasun.kapoor, Suresh, QEMU Developers,
	qemu-arm, Paolo Bonzini

On Sat, Jul 2, 2016 at 3:37 AM, Richard Henderson <rth@twiddle.net> wrote:
> On 06/30/2016 06:45 AM, Peter Maydell wrote:
>>
>> On 29 June 2016 at 09:47,  <vijayak@cavium.com> wrote:
>>>
>>> From: Vijay <vijayak@cavium.com>
>>>
>>> Use Neon instructions to perform zero checking of
>>> buffer. This is helps in reducing total migration time.
>>
>>
>>> diff --git a/util/cutils.c b/util/cutils.c
>>> index 5830a68..4779403 100644
>>> --- a/util/cutils.c
>>> +++ b/util/cutils.c
>>> @@ -184,6 +184,13 @@ int qemu_fdatasync(int fd)
>>>  #define SPLAT(p)       _mm_set1_epi8(*(p))
>>>  #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) ==
>>> 0xFFFF)
>>>  #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2))
>>> +#elif __aarch64__
>>> +#include "arm_neon.h"
>>> +#define VECTYPE        uint64x2_t
>>> +#define ALL_EQ(v1, v2) \
>>> +        ((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \
>>> +         (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1)))
>>> +#define VEC_OR(v1, v2) ((v1) | (v2))
>>
>>
>> Should be '#elif defined(__aarch64__)'. I have made this
>> tweak and put this patch in target-arm.next.
>
>
> Consider
>
> #define VECTYPE        uint32x4_t
> #define ALL_EQ(v1, v2) (vmaxvq_u32((v1) ^ (v2)) == 0)
>
>
> which compiles down to
>
>   1c:   6e211c00        eor     v0.16b, v0.16b, v1.16b
>   20:   6eb0a800        umaxv   s0, v0.4s
>   24:   1e260000        fmov    w0, s0
>   28:   6b1f001f        cmp     w0, wzr
>   2c:   1a9f17e0        cset    w0, eq
>   30:   d65f03c0        ret

For me this code compiles as below and migration time is ~100ms more.

See below 3 trails of migration time

  7039cc:       6eb0a800        umaxv   s0, v0.4s
  7039d0:       0e043c02        mov     w2, v0.s[0]
  7039d4:       350000c2        cbnz    w2, 7039ec <f0+0xf4>
  7039d8:       91002084        add     x4, x4, #0x8
  7039dc:       91020063        add     x3, x3, #0x80
  7039e0:       eb01009f        cmp     x4, x1

(qemu) info migrate
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off events: off x-postcopy-ram: off
Migration status: completed
total time: 3070 milliseconds
downtime: 55 milliseconds
setup: 4 milliseconds
transferred ram: 300637 kbytes
throughput: 802.49 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2062834 pages
skipped: 0 pages
normal: 70489 pages
normal bytes: 281956 kbytes
dirty sync count: 3

(qemu) info migrate
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off events: off x-postcopy-ram: off
Migration status: completed
total time: 3067 milliseconds
downtime: 47 milliseconds
setup: 5 milliseconds
transferred ram: 290277 kbytes
throughput: 775.61 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2064185 pages
skipped: 0 pages
normal: 67901 pages
normal bytes: 271604 kbytes
dirty sync count: 3
(qemu)

(qemu) info migrate
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off events: off x-postcopy-ram: off
Migration status: completed
total time: 3067 milliseconds
downtime: 34 milliseconds
setup: 5 milliseconds
transferred ram: 294614 kbytes
throughput: 787.19 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2063365 pages
skipped: 0 pages
normal: 68985 pages
normal bytes: 275940 kbytes
dirty sync count: 3

>
> vs
>
>   34:   4e083c20        mov     x0, v1.d[0]
>   38:   4e083c01        mov     x1, v0.d[0]
>   3c:   eb00003f        cmp     x1, x0
>   40:   52800000        mov     w0, #0
>   44:   54000040        b.eq    4c <f0+0x18>
>   48:   d65f03c0        ret
>   4c:   4e183c20        mov     x0, v1.d[1]
>   50:   4e183c01        mov     x1, v0.d[1]
>   54:   eb00003f        cmp     x1, x0
>   58:   1a9f17e0        cset    w0, eq
>   5c:   d65f03c0        ret
>

My patch compiles to below code and takes ~100ms less time

#define VECTYPE        uint64x2_t
#define ALL_EQ(v1, v2) \
        ((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \
         (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1)))

  7039d0:       4e083c02        mov     x2, v0.d[0]
  7039d4:       b5000102        cbnz    x2, 7039f4 <f0+0xfc>
  7039d8:       4e183c02        mov     x2, v0.d[1]
  7039dc:       b50000c2        cbnz    x2, 7039f4 <f0+0xfc>
  7039e0:       91002084        add     x4, x4, #0x8
  7039e4:       91020063        add     x3, x3, #0x80
  7039e8:       eb04003f        cmp     x1, x4

capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off events: off x-postcopy-ram: off
Migration status: completed
total time: 2973 milliseconds
downtime: 67 milliseconds
setup: 5 milliseconds
transferred ram: 293659 kbytes
throughput: 809.45 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2062791 pages
skipped: 0 pages
normal: 68748 pages
normal bytes: 274992 kbytes
dirty sync count: 3
(qemu)

capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off events: off x-postcopy-ram: off
Migration status: completed
total time: 2972 milliseconds
downtime: 47 milliseconds
setup: 5 milliseconds
transferred ram: 295972 kbytes
throughput: 816.10 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2062861 pages
skipped: 0 pages
normal: 69325 pages
normal bytes: 277300 kbytes
dirty sync count: 3
(qemu)

capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off events: off x-postcopy-ram: off
Migration status: completed
total time: 2982 milliseconds
downtime: 40 milliseconds
setup: 5 milliseconds
transferred ram: 293386 kbytes
throughput: 806.26 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2063199 pages
skipped: 0 pages
normal: 68679 pages
normal bytes: 274716 kbytes
dirty sync count: 4
(qemu)

Regards
Vijay

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking
  2016-07-05 12:24       ` Vijay Kilari
@ 2016-07-11 17:55         ` Peter Maydell
  0 siblings, 0 replies; 8+ messages in thread
From: Peter Maydell @ 2016-07-11 17:55 UTC (permalink / raw)
  To: Vijay Kilari
  Cc: Richard Henderson, Vijay, prasun.kapoor, Suresh, QEMU Developers,
	qemu-arm, Paolo Bonzini

On 5 July 2016 at 13:24, Vijay Kilari <vijay.kilari@gmail.com> wrote:
> On Sat, Jul 2, 2016 at 3:37 AM, Richard Henderson <rth@twiddle.net> wrote:
>> Consider
>>
>> #define VECTYPE        uint32x4_t
>> #define ALL_EQ(v1, v2) (vmaxvq_u32((v1) ^ (v2)) == 0)
>>
>>
>> which compiles down to
>>
>>   1c:   6e211c00        eor     v0.16b, v0.16b, v1.16b
>>   20:   6eb0a800        umaxv   s0, v0.4s
>>   24:   1e260000        fmov    w0, s0
>>   28:   6b1f001f        cmp     w0, wzr
>>   2c:   1a9f17e0        cset    w0, eq
>>   30:   d65f03c0        ret
>
> For me this code compiles as below and migration time is ~100ms more.

Thanks for benchmarking this. I'll take your original patch into
target-arm.next.

-- PMM

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-07-11 17:56 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-06-29  8:47 [Qemu-devel] [PATCH v3 0/1] ARM64: Live migration optimization vijayak
2016-06-29  8:47 ` [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking vijayak
2016-06-29 12:53   ` Paolo Bonzini
2016-06-30 13:45   ` Peter Maydell
2016-07-01 22:07     ` Richard Henderson
2016-07-02  9:42       ` Peter Maydell
2016-07-05 12:24       ` Vijay Kilari
2016-07-11 17:55         ` Peter Maydell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).