[PATCH v4 00/10] Optimize buffer_is

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 00/10] Optimize buffer_is_zero
@ 2024-02-15  8:14 Richard Henderson
  2024-02-15  8:14 ` [PATCH v4 01/10] util/bufferiszero: Remove SSE4.1 variant Richard Henderson
                   ` (10 more replies)
  0 siblings, 11 replies; 27+ messages in thread
From: Richard Henderson @ 2024-02-15  8:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: amonakov, mmromanov

v3: https://patchew.org/QEMU/20240206204809.9859-1-amonakov@ispras.ru/

Changes for v4:
  - Keep separate >= 256 entry point, but only keep constant length
    check inline.  This allows the indirect function call to be hidden
    and optimized away when the pointer is constant.
  - Split out a >= 256 integer routine.
  - Simplify acceleration selection for testing.
  - Add function pointer typedef.
  - Implement new aarch64 accelerations.


r~


Alexander Monakov (5):
  util/bufferiszero: Remove SSE4.1 variant
  util/bufferiszero: Remove AVX512 variant
  util/bufferiszero: Reorganize for early test for acceleration
  util/bufferiszero: Remove useless prefetches
  util/bufferiszero: Optimize SSE2 and AVX2 variants

Richard Henderson (5):
  util/bufferiszero: Improve scalar variant
  util/bufferiszero: Introduce biz_accel_fn typedef
  util/bufferiszero: Simplify test_buffer_is_zero_next_accel
  util/bufferiszero: Add simd acceleration for aarch64
  util/bufferiszero: Add sve acceleration for aarch64

 host/include/aarch64/host/cpuinfo.h |   1 +
 include/qemu/cutils.h               |  15 +-
 util/bufferiszero.c                 | 500 ++++++++++++++++------------
 util/cpuinfo-aarch64.c              |   1 +
 meson.build                         |  13 +
 5 files changed, 323 insertions(+), 207 deletions(-)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v4 01/10] util/bufferiszero: Remove SSE4.1 variant
  2024-02-15  8:14 [PATCH v4 00/10] Optimize buffer_is_zero Richard Henderson
@ 2024-02-15  8:14 ` Richard Henderson
  2024-02-15  8:14 ` [PATCH v4 02/10] util/bufferiszero: Remove AVX512 variant Richard Henderson
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Richard Henderson @ 2024-02-15  8:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: amonakov, mmromanov

From: Alexander Monakov <amonakov@ispras.ru>

The SSE4.1 variant is virtually identical to the SSE2 variant, except
for using 'PTEST+JNZ' in place of 'PCMPEQB+PMOVMSKB+CMP+JNE' for testing
if an SSE register is all zeroes. The PTEST instruction decodes to two
uops, so it can be handled only by the complex decoder, and since
CMP+JNE are macro-fused, both sequences decode to three uops. The uops
comprising the PTEST instruction dispatch to p0 and p5 on Intel CPUs, so
PCMPEQB+PMOVMSKB is comparatively more flexible from dispatch
standpoint.

Hence, the use of PTEST brings no benefit from throughput standpoint.
Its latency is not important, since it feeds only a conditional jump,
which terminates the dependency chain.

I never observed PTEST variants to be faster on real hardware.

Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20240206204809.9859-2-amonakov@ispras.ru>
---
 util/bufferiszero.c | 29 -----------------------------
 1 file changed, 29 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index 3e6a5dfd63..f5a3634f9a 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -100,34 +100,6 @@ buffer_zero_sse2(const void *buf, size_t len)
 }
 
 #ifdef CONFIG_AVX2_OPT
-static bool __attribute__((target("sse4")))
-buffer_zero_sse4(const void *buf, size_t len)
-{
-    __m128i t = _mm_loadu_si128(buf);
-    __m128i *p = (__m128i *)(((uintptr_t)buf + 5 * 16) & -16);
-    __m128i *e = (__m128i *)(((uintptr_t)buf + len) & -16);
-
-    /* Loop over 16-byte aligned blocks of 64.  */
-    while (likely(p <= e)) {
-        __builtin_prefetch(p);
-        if (unlikely(!_mm_testz_si128(t, t))) {
-            return false;
-        }
-        t = p[-4] | p[-3] | p[-2] | p[-1];
-        p += 4;
-    }
-
-    /* Finish the aligned tail.  */
-    t |= e[-3];
-    t |= e[-2];
-    t |= e[-1];
-
-    /* Finish the unaligned tail.  */
-    t |= _mm_loadu_si128(buf + len - 16);
-
-    return _mm_testz_si128(t, t);
-}
-
 static bool __attribute__((target("avx2")))
 buffer_zero_avx2(const void *buf, size_t len)
 {
@@ -221,7 +193,6 @@ select_accel_cpuinfo(unsigned info)
 #endif
 #ifdef CONFIG_AVX2_OPT
         { CPUINFO_AVX2,    128, buffer_zero_avx2 },
-        { CPUINFO_SSE4,     64, buffer_zero_sse4 },
 #endif
         { CPUINFO_SSE2,     64, buffer_zero_sse2 },
         { CPUINFO_ALWAYS,    0, buffer_zero_int },
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v4 02/10] util/bufferiszero: Remove AVX512 variant
  2024-02-15  8:14 [PATCH v4 00/10] Optimize buffer_is_zero Richard Henderson
  2024-02-15  8:14 ` [PATCH v4 01/10] util/bufferiszero: Remove SSE4.1 variant Richard Henderson
@ 2024-02-15  8:14 ` Richard Henderson
  2024-02-15  8:14 ` [PATCH v4 03/10] util/bufferiszero: Reorganize for early test for acceleration Richard Henderson
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Richard Henderson @ 2024-02-15  8:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: amonakov, mmromanov

From: Alexander Monakov <amonakov@ispras.ru>

Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD
routines are invoked much more rarely in normal use when most buffers
are non-zero. This makes use of AVX512 unprofitable, as it incurs extra
frequency and voltage transition periods during which the CPU operates
at reduced performance, as described in
https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html

Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru>
Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20240206204809.9859-4-amonakov@ispras.ru>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 util/bufferiszero.c | 38 +++-----------------------------------
 1 file changed, 3 insertions(+), 35 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index f5a3634f9a..641d5f9b9e 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -64,7 +64,7 @@ buffer_zero_int(const void *buf, size_t len)
     }
 }
 
-#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
+#if defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
 #include <immintrin.h>
 
 /* Note that each of these vectorized functions require len >= 64.  */
@@ -128,41 +128,12 @@ buffer_zero_avx2(const void *buf, size_t len)
 }
 #endif /* CONFIG_AVX2_OPT */
 
-#ifdef CONFIG_AVX512F_OPT
-static bool __attribute__((target("avx512f")))
-buffer_zero_avx512(const void *buf, size_t len)
-{
-    /* Begin with an unaligned head of 64 bytes.  */
-    __m512i t = _mm512_loadu_si512(buf);
-    __m512i *p = (__m512i *)(((uintptr_t)buf + 5 * 64) & -64);
-    __m512i *e = (__m512i *)(((uintptr_t)buf + len) & -64);
-
-    /* Loop over 64-byte aligned blocks of 256.  */
-    while (p <= e) {
-        __builtin_prefetch(p);
-        if (unlikely(_mm512_test_epi64_mask(t, t))) {
-            return false;
-        }
-        t = p[-4] | p[-3] | p[-2] | p[-1];
-        p += 4;
-    }
-
-    t |= _mm512_loadu_si512(buf + len - 4 * 64);
-    t |= _mm512_loadu_si512(buf + len - 3 * 64);
-    t |= _mm512_loadu_si512(buf + len - 2 * 64);
-    t |= _mm512_loadu_si512(buf + len - 1 * 64);
-
-    return !_mm512_test_epi64_mask(t, t);
-
-}
-#endif /* CONFIG_AVX512F_OPT */
-
 /*
  * Make sure that these variables are appropriately initialized when
  * SSE2 is enabled on the compiler command-line, but the compiler is
  * too old to support CONFIG_AVX2_OPT.
  */
-#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT)
+#if defined(CONFIG_AVX2_OPT)
 # define INIT_USED     0
 # define INIT_LENGTH   0
 # define INIT_ACCEL    buffer_zero_int
@@ -188,9 +159,6 @@ select_accel_cpuinfo(unsigned info)
         unsigned len;
         bool (*fn)(const void *, size_t);
     } all[] = {
-#ifdef CONFIG_AVX512F_OPT
-        { CPUINFO_AVX512F, 256, buffer_zero_avx512 },
-#endif
 #ifdef CONFIG_AVX2_OPT
         { CPUINFO_AVX2,    128, buffer_zero_avx2 },
 #endif
@@ -208,7 +176,7 @@ select_accel_cpuinfo(unsigned info)
     return 0;
 }
 
-#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT)
+#if defined(CONFIG_AVX2_OPT)
 static void __attribute__((constructor)) init_accel(void)
 {
     used_accel = select_accel_cpuinfo(cpuinfo_init());
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v4 03/10] util/bufferiszero: Reorganize for early test for acceleration
  2024-02-15  8:14 [PATCH v4 00/10] Optimize buffer_is_zero Richard Henderson
  2024-02-15  8:14 ` [PATCH v4 01/10] util/bufferiszero: Remove SSE4.1 variant Richard Henderson
  2024-02-15  8:14 ` [PATCH v4 02/10] util/bufferiszero: Remove AVX512 variant Richard Henderson
@ 2024-02-15  8:14 ` Richard Henderson
  2024-02-15  8:14 ` [PATCH v4 04/10] util/bufferiszero: Remove useless prefetches Richard Henderson
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Richard Henderson @ 2024-02-15  8:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: amonakov, mmromanov

From: Alexander Monakov <amonakov@ispras.ru>

Test for length >= 256 inline, where is is often a constant.
Before calling into the accelerated routine, sample three bytes
from the buffer, which handles most non-zero buffers.

Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru>
Message-Id: <20240206204809.9859-3-amonakov@ispras.ru>
[rth: Use __builtin_constant_p and perform the sample out-of-line.]
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 include/qemu/cutils.h | 15 +++++++-
 util/bufferiszero.c   | 89 ++++++++++++++++++-------------------------
 2 files changed, 51 insertions(+), 53 deletions(-)

diff --git a/include/qemu/cutils.h b/include/qemu/cutils.h
index 92c927a6a3..36f8cfa0e9 100644
--- a/include/qemu/cutils.h
+++ b/include/qemu/cutils.h
@@ -187,9 +187,22 @@ char *freq_to_str(uint64_t freq_hz);
 /* used to print char* safely */
 #define STR_OR_NULL(str) ((str) ? (str) : "null")
 
-bool buffer_is_zero(const void *buf, size_t len);
+/*
+ * Check if a buffer is all zeroes.
+ */
+
+bool buffer_is_zero_ool(const void *vbuf, size_t len);
+bool buffer_is_zero_ge256(const void *vbuf, size_t len);
 bool test_buffer_is_zero_next_accel(void);
 
+#ifdef __OPTIMIZE__
+#define buffer_is_zero(B, L) \
+    (__builtin_constant_p(L) && (size_t)(L) >= 256 \
+     ? buffer_is_zero_ge256(B, L) : buffer_is_zero_ool(B, L))
+#else
+#define buffer_is_zero  buffer_is_zero_ool
+#endif
+
 /*
  * Implementation of ULEB128 (http://en.wikipedia.org/wiki/LEB128)
  * Input is limited to 14-bit numbers
diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index 641d5f9b9e..38527f2467 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -26,8 +26,9 @@
 #include "qemu/bswap.h"
 #include "host/cpuinfo.h"
 
-static bool
-buffer_zero_int(const void *buf, size_t len)
+static bool (*buffer_is_zero_accel)(const void *, size_t);
+
+static bool buffer_is_zero_integer(const void *buf, size_t len)
 {
     if (unlikely(len < 8)) {
         /* For a very small buffer, simply accumulate all the bytes.  */
@@ -128,60 +129,38 @@ buffer_zero_avx2(const void *buf, size_t len)
 }
 #endif /* CONFIG_AVX2_OPT */
 
-/*
- * Make sure that these variables are appropriately initialized when
- * SSE2 is enabled on the compiler command-line, but the compiler is
- * too old to support CONFIG_AVX2_OPT.
- */
-#if defined(CONFIG_AVX2_OPT)
-# define INIT_USED     0
-# define INIT_LENGTH   0
-# define INIT_ACCEL    buffer_zero_int
-#else
-# ifndef __SSE2__
-#  error "ISA selection confusion"
-# endif
-# define INIT_USED     CPUINFO_SSE2
-# define INIT_LENGTH   64
-# define INIT_ACCEL    buffer_zero_sse2
-#endif
-
-static unsigned used_accel = INIT_USED;
-static unsigned length_to_accel = INIT_LENGTH;
-static bool (*buffer_accel)(const void *, size_t) = INIT_ACCEL;
-
 static unsigned __attribute__((noinline))
 select_accel_cpuinfo(unsigned info)
 {
     /* Array is sorted in order of algorithm preference. */
     static const struct {
         unsigned bit;
-        unsigned len;
         bool (*fn)(const void *, size_t);
     } all[] = {
 #ifdef CONFIG_AVX2_OPT
-        { CPUINFO_AVX2,    128, buffer_zero_avx2 },
+        { CPUINFO_AVX2,    buffer_zero_avx2 },
 #endif
-        { CPUINFO_SSE2,     64, buffer_zero_sse2 },
-        { CPUINFO_ALWAYS,    0, buffer_zero_int },
+        { CPUINFO_SSE2,    buffer_zero_sse2 },
+        { CPUINFO_ALWAYS,  buffer_is_zero_integer },
     };
 
     for (unsigned i = 0; i < ARRAY_SIZE(all); ++i) {
         if (info & all[i].bit) {
-            length_to_accel = all[i].len;
-            buffer_accel = all[i].fn;
+            buffer_is_zero_accel = all[i].fn;
             return all[i].bit;
         }
     }
     return 0;
 }
 
-#if defined(CONFIG_AVX2_OPT)
+static unsigned used_accel;
+
 static void __attribute__((constructor)) init_accel(void)
 {
     used_accel = select_accel_cpuinfo(cpuinfo_init());
 }
-#endif /* CONFIG_AVX2_OPT */
+
+#define INIT_ACCEL NULL
 
 bool test_buffer_is_zero_next_accel(void)
 {
@@ -194,36 +173,42 @@ bool test_buffer_is_zero_next_accel(void)
     used_accel |= used;
     return used;
 }
-
-static bool select_accel_fn(const void *buf, size_t len)
-{
-    if (likely(len >= length_to_accel)) {
-        return buffer_accel(buf, len);
-    }
-    return buffer_zero_int(buf, len);
-}
-
 #else
-#define select_accel_fn  buffer_zero_int
 bool test_buffer_is_zero_next_accel(void)
 {
     return false;
 }
+
+#define INIT_ACCEL buffer_is_zero_integer
 #endif
 
-/*
- * Checks if a buffer is all zeroes
- */
-bool buffer_is_zero(const void *buf, size_t len)
+static bool (*buffer_is_zero_accel)(const void *, size_t) = INIT_ACCEL;
+
+static inline bool buffer_is_zero_sample3(const char *buf, size_t len)
+{
+    return (buf[0] | buf[len - 1] | buf[len / 2]) == 0;
+}
+
+bool buffer_is_zero_ool(const void *buf, size_t len)
 {
     if (unlikely(len == 0)) {
         return true;
     }
+    if (!buffer_is_zero_sample3(buf, len)) {
+        return false;
+    }
+    /* All bytes are covered for any len <= 3.  */
+    if (unlikely(len <= 3)) {
+        return true;
+    }
 
-    /* Fetch the beginning of the buffer while we select the accelerator.  */
-    __builtin_prefetch(buf);
-
-    /* Use an optimized zero check if possible.  Note that this also
-       includes a check for an unrolled loop over 64-bit integers.  */
-    return select_accel_fn(buf, len);
+    if (likely(len >= 256)) {
+        return buffer_is_zero_accel(buf, len);
+    }
+    return buffer_is_zero_integer(buf, len);
+}
+
+bool buffer_is_zero_ge256(const void *buf, size_t len)
+{
+    return buffer_is_zero_sample3(buf, len) && buffer_is_zero_accel(buf, len);
 }
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v4 04/10] util/bufferiszero: Remove useless prefetches
  2024-02-15  8:14 [PATCH v4 00/10] Optimize buffer_is_zero Richard Henderson
                   ` (2 preceding siblings ...)
  2024-02-15  8:14 ` [PATCH v4 03/10] util/bufferiszero: Reorganize for early test for acceleration Richard Henderson
@ 2024-02-15  8:14 ` Richard Henderson
  2024-02-15  8:14 ` [PATCH v4 05/10] util/bufferiszero: Optimize SSE2 and AVX2 variants Richard Henderson
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Richard Henderson @ 2024-02-15  8:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: amonakov, mmromanov

From: Alexander Monakov <amonakov@ispras.ru>

Use of prefetching in bufferiszero.c is quite questionable:

- prefetches are issued just a few CPU cycles before the corresponding
  line would be hit by demand loads;

- they are done for simple access patterns, i.e. where hardware
  prefetchers can perform better;

- they compete for load ports in loops that should be limited by load
  port throughput rather than ALU throughput.

Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20240206204809.9859-5-amonakov@ispras.ru>
---
 util/bufferiszero.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index 38527f2467..6ef5f8ec79 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -50,7 +50,6 @@ static bool buffer_is_zero_integer(const void *buf, size_t len)
         const uint64_t *e = (uint64_t *)(((uintptr_t)buf + len) & -8);
 
         for (; p + 8 <= e; p += 8) {
-            __builtin_prefetch(p + 8);
             if (t) {
                 return false;
             }
@@ -80,7 +79,6 @@ buffer_zero_sse2(const void *buf, size_t len)
 
     /* Loop over 16-byte aligned blocks of 64.  */
     while (likely(p <= e)) {
-        __builtin_prefetch(p);
         t = _mm_cmpeq_epi8(t, zero);
         if (unlikely(_mm_movemask_epi8(t) != 0xFFFF)) {
             return false;
@@ -111,7 +109,6 @@ buffer_zero_avx2(const void *buf, size_t len)
 
     /* Loop over 32-byte aligned blocks of 128.  */
     while (p <= e) {
-        __builtin_prefetch(p);
         if (unlikely(!_mm256_testz_si256(t, t))) {
             return false;
         }
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v4 05/10] util/bufferiszero: Optimize SSE2 and AVX2 variants
  2024-02-15  8:14 [PATCH v4 00/10] Optimize buffer_is_zero Richard Henderson
                   ` (3 preceding siblings ...)
  2024-02-15  8:14 ` [PATCH v4 04/10] util/bufferiszero: Remove useless prefetches Richard Henderson
@ 2024-02-15  8:14 ` Richard Henderson
  2024-02-15  8:14 ` [PATCH v4 06/10] util/bufferiszero: Improve scalar variant Richard Henderson
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Richard Henderson @ 2024-02-15  8:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: amonakov, mmromanov

From: Alexander Monakov <amonakov@ispras.ru>

Increase unroll factor in SIMD loops from 4x to 8x in order to move
their bottlenecks from ALU port contention to load issue rate (two loads
per cycle on popular x86 implementations).

Avoid using out-of-bounds pointers in loop boundary conditions.

Follow SSE2 implementation strategy in the AVX2 variant. Avoid use of
PTEST, which is not profitable there (like in the removed SSE4 variant).

Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20240206204809.9859-6-amonakov@ispras.ru>
---
 util/bufferiszero.c | 111 +++++++++++++++++++++++++++++---------------
 1 file changed, 73 insertions(+), 38 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index 6ef5f8ec79..2822155c27 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -67,62 +67,97 @@ static bool buffer_is_zero_integer(const void *buf, size_t len)
 #if defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
 #include <immintrin.h>
 
-/* Note that each of these vectorized functions require len >= 64.  */
+/* Helper for preventing the compiler from reassociating
+   chains of binary vector operations.  */
+#define SSE_REASSOC_BARRIER(vec0, vec1) asm("" : "+x"(vec0), "+x"(vec1))
+
+/* Note that these vectorized functions may assume len >= 256.  */
 
 static bool __attribute__((target("sse2")))
 buffer_zero_sse2(const void *buf, size_t len)
 {
-    __m128i t = _mm_loadu_si128(buf);
-    __m128i *p = (__m128i *)(((uintptr_t)buf + 5 * 16) & -16);
-    __m128i *e = (__m128i *)(((uintptr_t)buf + len) & -16);
-    __m128i zero = _mm_setzero_si128();
+    /* Unaligned loads at head/tail.  */
+    __m128i v = *(__m128i_u *)(buf);
+    __m128i w = *(__m128i_u *)(buf + len - 16);
+    /* Align head/tail to 16-byte boundaries.  */
+    const __m128i *p = QEMU_ALIGN_PTR_DOWN(buf + 16, 16);
+    const __m128i *e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 16);
+    __m128i zero = { 0 };
 
-    /* Loop over 16-byte aligned blocks of 64.  */
-    while (likely(p <= e)) {
-        t = _mm_cmpeq_epi8(t, zero);
-        if (unlikely(_mm_movemask_epi8(t) != 0xFFFF)) {
+    /* Collect a partial block at tail end.  */
+    v |= e[-1]; w |= e[-2];
+    SSE_REASSOC_BARRIER(v, w);
+    v |= e[-3]; w |= e[-4];
+    SSE_REASSOC_BARRIER(v, w);
+    v |= e[-5]; w |= e[-6];
+    SSE_REASSOC_BARRIER(v, w);
+    v |= e[-7]; v |= w;
+
+    /*
+     * Loop over complete 128-byte blocks.
+     * With the head and tail removed, e - p >= 14, so the loop
+     * must iterate at least once.
+     */
+    do {
+        v = _mm_cmpeq_epi8(v, zero);
+        if (unlikely(_mm_movemask_epi8(v) != 0xFFFF)) {
             return false;
         }
-        t = p[-4] | p[-3] | p[-2] | p[-1];
-        p += 4;
-    }
+        v = p[0]; w = p[1];
+        SSE_REASSOC_BARRIER(v, w);
+        v |= p[2]; w |= p[3];
+        SSE_REASSOC_BARRIER(v, w);
+        v |= p[4]; w |= p[5];
+        SSE_REASSOC_BARRIER(v, w);
+        v |= p[6]; w |= p[7];
+        SSE_REASSOC_BARRIER(v, w);
+        v |= w;
+        p += 8;
+    } while (p < e - 7);
 
-    /* Finish the aligned tail.  */
-    t |= e[-3];
-    t |= e[-2];
-    t |= e[-1];
-
-    /* Finish the unaligned tail.  */
-    t |= _mm_loadu_si128(buf + len - 16);
-
-    return _mm_movemask_epi8(_mm_cmpeq_epi8(t, zero)) == 0xFFFF;
+    return _mm_movemask_epi8(_mm_cmpeq_epi8(v, zero)) == 0xFFFF;
 }
 
 #ifdef CONFIG_AVX2_OPT
 static bool __attribute__((target("avx2")))
 buffer_zero_avx2(const void *buf, size_t len)
 {
-    /* Begin with an unaligned head of 32 bytes.  */
-    __m256i t = _mm256_loadu_si256(buf);
-    __m256i *p = (__m256i *)(((uintptr_t)buf + 5 * 32) & -32);
-    __m256i *e = (__m256i *)(((uintptr_t)buf + len) & -32);
+    /* Unaligned loads at head/tail.  */
+    __m256i v = *(__m256i_u *)(buf);
+    __m256i w = *(__m256i_u *)(buf + len - 32);
+    /* Align head/tail to 32-byte boundaries.  */
+    const __m256i *p = QEMU_ALIGN_PTR_DOWN(buf + 32, 32);
+    const __m256i *e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 32);
+    __m256i zero = { 0 };
 
-    /* Loop over 32-byte aligned blocks of 128.  */
-    while (p <= e) {
-        if (unlikely(!_mm256_testz_si256(t, t))) {
+    /* Collect a partial block at tail end.  */
+    v |= e[-1]; w |= e[-2];
+    SSE_REASSOC_BARRIER(v, w);
+    v |= e[-3]; w |= e[-4];
+    SSE_REASSOC_BARRIER(v, w);
+    v |= e[-5]; w |= e[-6];
+    SSE_REASSOC_BARRIER(v, w);
+    v |= e[-7]; v |= w;
+
+    /* Loop over complete 256-byte blocks.  */
+    for (; p < e - 7; p += 8) {
+        /* PTEST is not profitable here.  */
+        v = _mm256_cmpeq_epi8(v, zero);
+        if (unlikely(_mm256_movemask_epi8(v) != 0xFFFFFFFF)) {
             return false;
         }
-        t = p[-4] | p[-3] | p[-2] | p[-1];
-        p += 4;
-    } ;
+        v = p[0]; w = p[1];
+        SSE_REASSOC_BARRIER(v, w);
+        v |= p[2]; w |= p[3];
+        SSE_REASSOC_BARRIER(v, w);
+        v |= p[4]; w |= p[5];
+        SSE_REASSOC_BARRIER(v, w);
+        v |= p[6]; w |= p[7];
+        SSE_REASSOC_BARRIER(v, w);
+        v |= w;
+    }
 
-    /* Finish the last block of 128 unaligned.  */
-    t |= _mm256_loadu_si256(buf + len - 4 * 32);
-    t |= _mm256_loadu_si256(buf + len - 3 * 32);
-    t |= _mm256_loadu_si256(buf + len - 2 * 32);
-    t |= _mm256_loadu_si256(buf + len - 1 * 32);
-
-    return _mm256_testz_si256(t, t);
+    return _mm256_movemask_epi8(_mm256_cmpeq_epi8(v, zero)) == 0xFFFFFFFF;
 }
 #endif /* CONFIG_AVX2_OPT */
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v4 06/10] util/bufferiszero: Improve scalar variant
  2024-02-15  8:14 [PATCH v4 00/10] Optimize buffer_is_zero Richard Henderson
                   ` (4 preceding siblings ...)
  2024-02-15  8:14 ` [PATCH v4 05/10] util/bufferiszero: Optimize SSE2 and AVX2 variants Richard Henderson
@ 2024-02-15  8:14 ` Richard Henderson
  2024-02-15  8:14 ` [PATCH v4 07/10] util/bufferiszero: Introduce biz_accel_fn typedef Richard Henderson
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Richard Henderson @ 2024-02-15  8:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: amonakov, mmromanov

Split less-than and greater-than 256 cases.
Use unaligned accesses for head and tail.
Avoid using out-of-bounds pointers in loop boundary conditions.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 util/bufferiszero.c | 86 +++++++++++++++++++++++++++------------------
 1 file changed, 52 insertions(+), 34 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index 2822155c27..ce04642c67 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -28,40 +28,58 @@
 
 static bool (*buffer_is_zero_accel)(const void *, size_t);
 
-static bool buffer_is_zero_integer(const void *buf, size_t len)
+static bool buffer_is_zero_int_lt256(const void *buf, size_t len)
 {
-    if (unlikely(len < 8)) {
-        /* For a very small buffer, simply accumulate all the bytes.  */
-        const unsigned char *p = buf;
-        const unsigned char *e = buf + len;
-        unsigned char t = 0;
+    uint64_t t;
+    const uint64_t *p, *e;
 
-        do {
-            t |= *p++;
-        } while (p < e);
-
-        return t == 0;
-    } else {
-        /* Otherwise, use the unaligned memory access functions to
-           handle the beginning and end of the buffer, with a couple
-           of loops handling the middle aligned section.  */
-        uint64_t t = ldq_he_p(buf);
-        const uint64_t *p = (uint64_t *)(((uintptr_t)buf + 8) & -8);
-        const uint64_t *e = (uint64_t *)(((uintptr_t)buf + len) & -8);
-
-        for (; p + 8 <= e; p += 8) {
-            if (t) {
-                return false;
-            }
-            t = p[0] | p[1] | p[2] | p[3] | p[4] | p[5] | p[6] | p[7];
-        }
-        while (p < e) {
-            t |= *p++;
-        }
-        t |= ldq_he_p(buf + len - 8);
-
-        return t == 0;
+    /*
+     * Use unaligned memory access functions to handle
+     * the beginning and end of the buffer, with a couple
+     * of loops handling the middle aligned section.
+     */
+    if (unlikely(len <= 8)) {
+        return (ldl_he_p(buf) | ldl_he_p(buf + len - 4)) == 0;
     }
+
+    t = ldq_he_p(buf) | ldq_he_p(buf + len - 8);
+    p = QEMU_ALIGN_PTR_DOWN(buf + 8, 8);
+    e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 8);
+
+    while (p < e) {
+        t |= *p++;
+    }
+    return t == 0;
+}
+
+static bool buffer_is_zero_int_ge256(const void *buf, size_t len)
+{
+    /*
+     * Use unaligned memory access functions to handle
+     * the beginning and end of the buffer, with a couple
+     * of loops handling the middle aligned section.
+     */
+    uint64_t t = ldq_he_p(buf) | ldq_he_p(buf + len - 8);
+    const uint64_t *p = QEMU_ALIGN_PTR_DOWN(buf + 8, 8);
+    const uint64_t *e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 8);
+
+    /* Collect a partial block at the tail end. */
+    t |= e[-7] | e[-6] | e[-5] | e[-4] | e[-3] | e[-2] | e[-1];
+
+    /*
+     * Loop over 64 byte blocks.
+     * With the head and tail removed, e - p >= 30,
+     * so the loop must iterate at least 3 times.
+     */
+    do {
+        if (t) {
+            return false;
+        }
+        t = p[0] | p[1] | p[2] | p[3] | p[4] | p[5] | p[6] | p[7];
+        p += 8;
+    } while (p < e - 7);
+
+    return t == 0;
 }
 
 #if defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
@@ -173,7 +191,7 @@ select_accel_cpuinfo(unsigned info)
         { CPUINFO_AVX2,    buffer_zero_avx2 },
 #endif
         { CPUINFO_SSE2,    buffer_zero_sse2 },
-        { CPUINFO_ALWAYS,  buffer_is_zero_integer },
+        { CPUINFO_ALWAYS,  buffer_is_zero_int_ge256 },
     };
 
     for (unsigned i = 0; i < ARRAY_SIZE(all); ++i) {
@@ -211,7 +229,7 @@ bool test_buffer_is_zero_next_accel(void)
     return false;
 }
 
-#define INIT_ACCEL buffer_is_zero_integer
+#define INIT_ACCEL buffer_is_zero_int_ge256
 #endif
 
 static bool (*buffer_is_zero_accel)(const void *, size_t) = INIT_ACCEL;
@@ -237,7 +255,7 @@ bool buffer_is_zero_ool(const void *buf, size_t len)
     if (likely(len >= 256)) {
         return buffer_is_zero_accel(buf, len);
     }
-    return buffer_is_zero_integer(buf, len);
+    return buffer_is_zero_int_lt256(buf, len);
 }
 
 bool buffer_is_zero_ge256(const void *buf, size_t len)
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v4 07/10] util/bufferiszero: Introduce biz_accel_fn typedef
  2024-02-15  8:14 [PATCH v4 00/10] Optimize buffer_is_zero Richard Henderson
                   ` (5 preceding siblings ...)
  2024-02-15  8:14 ` [PATCH v4 06/10] util/bufferiszero: Improve scalar variant Richard Henderson
@ 2024-02-15  8:14 ` Richard Henderson
  2024-02-15  8:34   ` Philippe Mathieu-Daudé
  2024-02-15  8:14 ` [PATCH v4 08/10] util/bufferiszero: Simplify test_buffer_is_zero_next_accel Richard Henderson
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2024-02-15  8:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: amonakov, mmromanov

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 util/bufferiszero.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index ce04642c67..ce80713071 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -26,7 +26,8 @@
 #include "qemu/bswap.h"
 #include "host/cpuinfo.h"
 
-static bool (*buffer_is_zero_accel)(const void *, size_t);
+typedef bool (*biz_accel_fn)(const void *, size_t);
+static biz_accel_fn buffer_is_zero_accel;
 
 static bool buffer_is_zero_int_lt256(const void *buf, size_t len)
 {
@@ -179,13 +180,15 @@ buffer_zero_avx2(const void *buf, size_t len)
 }
 #endif /* CONFIG_AVX2_OPT */
 
+
+
 static unsigned __attribute__((noinline))
 select_accel_cpuinfo(unsigned info)
 {
     /* Array is sorted in order of algorithm preference. */
     static const struct {
         unsigned bit;
-        bool (*fn)(const void *, size_t);
+        biz_accel_fn fn;
     } all[] = {
 #ifdef CONFIG_AVX2_OPT
         { CPUINFO_AVX2,    buffer_zero_avx2 },
@@ -232,7 +235,7 @@ bool test_buffer_is_zero_next_accel(void)
 #define INIT_ACCEL buffer_is_zero_int_ge256
 #endif
 
-static bool (*buffer_is_zero_accel)(const void *, size_t) = INIT_ACCEL;
+static biz_accel_fn buffer_is_zero_accel = INIT_ACCEL;
 
 static inline bool buffer_is_zero_sample3(const char *buf, size_t len)
 {
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v4 08/10] util/bufferiszero: Simplify test_buffer_is_zero_next_accel
  2024-02-15  8:14 [PATCH v4 00/10] Optimize buffer_is_zero Richard Henderson
                   ` (6 preceding siblings ...)
  2024-02-15  8:14 ` [PATCH v4 07/10] util/bufferiszero: Introduce biz_accel_fn typedef Richard Henderson
@ 2024-02-15  8:14 ` Richard Henderson
  2024-02-15  8:40   ` Philippe Mathieu-Daudé
  2024-02-15  8:14 ` [PATCH v4 09/10] util/bufferiszero: Add simd acceleration for aarch64 Richard Henderson
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2024-02-15  8:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: amonakov, mmromanov

Because the three alternatives are monotonic, we don't
need to keep a couple of bitmasks, just identify the
strongest alternative at startup.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 util/bufferiszero.c | 56 ++++++++++++++++++---------------------------
 1 file changed, 22 insertions(+), 34 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index ce80713071..4eef6d47bc 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -180,51 +180,39 @@ buffer_zero_avx2(const void *buf, size_t len)
 }
 #endif /* CONFIG_AVX2_OPT */
 
-
-
-static unsigned __attribute__((noinline))
-select_accel_cpuinfo(unsigned info)
-{
-    /* Array is sorted in order of algorithm preference. */
-    static const struct {
-        unsigned bit;
-        biz_accel_fn fn;
-    } all[] = {
+static biz_accel_fn const accel_table[] = {
+    buffer_is_zero_int_ge256,
+    buffer_zero_sse2,
 #ifdef CONFIG_AVX2_OPT
-        { CPUINFO_AVX2,    buffer_zero_avx2 },
+    buffer_zero_avx2,
 #endif
-        { CPUINFO_SSE2,    buffer_zero_sse2 },
-        { CPUINFO_ALWAYS,  buffer_is_zero_int_ge256 },
-    };
-
-    for (unsigned i = 0; i < ARRAY_SIZE(all); ++i) {
-        if (info & all[i].bit) {
-            buffer_is_zero_accel = all[i].fn;
-            return all[i].bit;
-        }
-    }
-    return 0;
-}
-
-static unsigned used_accel;
+};
+static unsigned accel_index;
 
 static void __attribute__((constructor)) init_accel(void)
 {
-    used_accel = select_accel_cpuinfo(cpuinfo_init());
+    unsigned info = cpuinfo_init();
+    unsigned index = (info & CPUINFO_SSE2 ? 1 : 0);
+
+#ifdef CONFIG_AVX2_OPT
+    if (info & CPUINFO_AVX2) {
+        index = 2;
+    }
+#endif
+
+    accel_index = index;
+    buffer_is_zero_accel = accel_table[index];
 }
 
 #define INIT_ACCEL NULL
 
 bool test_buffer_is_zero_next_accel(void)
 {
-    /*
-     * Accumulate the accelerators that we've already tested, and
-     * remove them from the set to test this round.  We'll get back
-     * a zero from select_accel_cpuinfo when there are no more.
-     */
-    unsigned used = select_accel_cpuinfo(cpuinfo & ~used_accel);
-    used_accel |= used;
-    return used;
+    if (accel_index != 0) {
+        buffer_is_zero_accel = accel_table[--accel_index];
+        return true;
+    }
+    return false;
 }
 #else
 bool test_buffer_is_zero_next_accel(void)
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v4 09/10] util/bufferiszero: Add simd acceleration for aarch64
  2024-02-15  8:14 [PATCH v4 00/10] Optimize buffer_is_zero Richard Henderson
                   ` (7 preceding siblings ...)
  2024-02-15  8:14 ` [PATCH v4 08/10] util/bufferiszero: Simplify test_buffer_is_zero_next_accel Richard Henderson
@ 2024-02-15  8:14 ` Richard Henderson
  2024-02-15  8:47   ` Alexander Monakov
  2024-02-15  8:14 ` [RFC PATCH v4 10/10] util/bufferiszero: Add sve " Richard Henderson
  2024-02-15  8:57 ` [PATCH v4 00/10] Optimize buffer_is_zero Alexander Monakov
  10 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2024-02-15  8:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: amonakov, mmromanov

Because non-embedded aarch64 is expected to have AdvSIMD enabled, merely
double-check with the compiler flags for __ARM_NEON and don't bother with
a runtime check.  Otherwise, model the loop after the x86 SSE2 function,
and use VADDV to reduce the four vector comparisons.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 util/bufferiszero.c | 74 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 74 insertions(+)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index 4eef6d47bc..2809b09225 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -214,7 +214,81 @@ bool test_buffer_is_zero_next_accel(void)
     }
     return false;
 }
+
+#elif defined(__aarch64__) && defined(__ARM_NEON)
+#include <arm_neon.h>
+
+#define REASSOC_BARRIER(vec0, vec1) asm("" : "+w"(vec0), "+w"(vec1))
+
+static bool buffer_is_zero_simd(const void *buf, size_t len)
+{
+    uint32x4_t t0, t1, t2, t3;
+
+    /* Align head/tail to 16-byte boundaries.  */
+    const uint32x4_t *p = QEMU_ALIGN_PTR_DOWN(buf + 16, 16);
+    const uint32x4_t *e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 16);
+
+    /* Unaligned loads at head/tail.  */
+    t0 = vld1q_u32(buf) | vld1q_u32(buf + len - 16);
+
+    /* Collect a partial block at tail end.  */
+    t1 = e[-7] | e[-6];
+    t2 = e[-5] | e[-4];
+    t3 = e[-3] | e[-2];
+    t0 |= e[-1];
+    REASSOC_BARRIER(t0, t1);
+    REASSOC_BARRIER(t2, t3);
+    t0 |= t1;
+    t2 |= t3;
+    REASSOC_BARRIER(t0, t2);
+    t0 |= t2;
+
+    /*
+     * Loop over complete 128-byte blocks.
+     * With the head and tail removed, e - p >= 14, so the loop
+     * must iterate at least once.
+     */
+    do {
+        /* Each comparison is [-1,0], so reduction is in [-4..0]. */
+        if (unlikely(vaddvq_u32(vceqzq_u32(t0)) != -4)) {
+            return false;
+        }
+
+        t0 = p[0] | p[1];
+        t1 = p[2] | p[3];
+        t2 = p[4] | p[5];
+        t3 = p[6] | p[7];
+        REASSOC_BARRIER(t0, t1);
+        REASSOC_BARRIER(t2, t3);
+        t0 |= t1;
+        t2 |= t3;
+        REASSOC_BARRIER(t0, t2);
+        t0 |= t2;
+        p += 8;
+    } while (p < e - 7);
+
+    return vaddvq_u32(vceqzq_u32(t0)) == -4;
+}
+
+static biz_accel_fn const accel_table[] = {
+    buffer_is_zero_int_ge256,
+    buffer_is_zero_simd,
+};
+
+static unsigned accel_index = 1;
+#define INIT_ACCEL buffer_is_zero_simd
+
+bool test_buffer_is_zero_next_accel(void)
+{
+    if (accel_index != 0) {
+        buffer_is_zero_accel = accel_table[--accel_index];
+        return true;
+    }
+    return false;
+}
+
 #else
+
 bool test_buffer_is_zero_next_accel(void)
 {
     return false;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH v4 10/10] util/bufferiszero: Add sve acceleration for aarch64
  2024-02-15  8:14 [PATCH v4 00/10] Optimize buffer_is_zero Richard Henderson
                   ` (8 preceding siblings ...)
  2024-02-15  8:14 ` [PATCH v4 09/10] util/bufferiszero: Add simd acceleration for aarch64 Richard Henderson
@ 2024-02-15  8:14 ` Richard Henderson
  2024-02-16  9:33   ` Alex Bennée
  2024-02-16 11:05   ` Alex Bennée
  2024-02-15  8:57 ` [PATCH v4 00/10] Optimize buffer_is_zero Alexander Monakov
  10 siblings, 2 replies; 27+ messages in thread
From: Richard Henderson @ 2024-02-15  8:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: amonakov, mmromanov

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---

RFC because I've not benchmarked this on real hw, only run it
through qemu for validation.

---
 host/include/aarch64/host/cpuinfo.h |  1 +
 util/bufferiszero.c                 | 49 +++++++++++++++++++++++++++++
 util/cpuinfo-aarch64.c              |  1 +
 meson.build                         | 13 ++++++++
 4 files changed, 64 insertions(+)

diff --git a/host/include/aarch64/host/cpuinfo.h b/host/include/aarch64/host/cpuinfo.h
index fe671534e4..b4b816cd07 100644
--- a/host/include/aarch64/host/cpuinfo.h
+++ b/host/include/aarch64/host/cpuinfo.h
@@ -12,6 +12,7 @@
 #define CPUINFO_AES             (1u << 3)
 #define CPUINFO_PMULL           (1u << 4)
 #define CPUINFO_BTI             (1u << 5)
+#define CPUINFO_SVE             (1u << 6)
 
 /* Initialized with a constructor. */
 extern unsigned cpuinfo;
diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index 2809b09225..af64c9c224 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -270,13 +270,62 @@ static bool buffer_is_zero_simd(const void *buf, size_t len)
     return vaddvq_u32(vceqzq_u32(t0)) == -4;
 }
 
+#ifdef CONFIG_SVE_OPT
+#include <arm_sve.h>
+
+#ifndef __ARM_FEATURE_SVE
+__attribute__((target("+sve")))
+#endif
+static bool buffer_is_zero_sve(const void *buf, size_t len)
+{
+    svbool_t p, t = svptrue_b8();
+    size_t i, n;
+
+    /*
+     * For the first vector, align to 16 -- reading 1 to 256 bytes.
+     * Note this routine is only called with len >= 256, which is the
+     * architectural maximum vector length: the first vector always fits.
+     */
+    i = 0;
+    n = QEMU_ALIGN_PTR_DOWN(buf + svcntb(), 16) - buf;
+    p = svwhilelt_b8(i, n);
+
+    do {
+        svuint8_t d = svld1_u8(p, buf + i);
+
+        p = svcmpne_n_u8(t, d, 0);
+        if (unlikely(svptest_any(t, p))) {
+            return false;
+        }
+        i += n;
+        n = svcntb();
+        p = svwhilelt_b8(i, len);
+    } while (svptest_any(t, p));
+
+    return true;
+}
+#endif /* CONFIG_SVE_OPT */
+
 static biz_accel_fn const accel_table[] = {
     buffer_is_zero_int_ge256,
     buffer_is_zero_simd,
+#ifdef CONFIG_SVE_OPT
+    buffer_is_zero_sve,
+#endif
 };
 
+#ifdef CONFIG_SVE_OPT
+static unsigned accel_index;
+static void __attribute__((constructor)) init_accel(void)
+{
+    accel_index = (cpuinfo & CPUINFO_SVE ? 2 : 1);
+    buffer_is_zero_accel = accel_table[accel_index];
+}
+#define INIT_ACCEL NULL
+#else
 static unsigned accel_index = 1;
 #define INIT_ACCEL buffer_is_zero_simd
+#endif /* CONFIG_SVE_OPT */
 
 bool test_buffer_is_zero_next_accel(void)
 {
diff --git a/util/cpuinfo-aarch64.c b/util/cpuinfo-aarch64.c
index 4c8a005715..a1e22ea66e 100644
--- a/util/cpuinfo-aarch64.c
+++ b/util/cpuinfo-aarch64.c
@@ -61,6 +61,7 @@ unsigned __attribute__((constructor)) cpuinfo_init(void)
     info |= (hwcap & HWCAP_USCAT ? CPUINFO_LSE2 : 0);
     info |= (hwcap & HWCAP_AES ? CPUINFO_AES : 0);
     info |= (hwcap & HWCAP_PMULL ? CPUINFO_PMULL : 0);
+    info |= (hwcap & HWCAP_SVE ? CPUINFO_SVE : 0);
 
     unsigned long hwcap2 = qemu_getauxval(AT_HWCAP2);
     info |= (hwcap2 & HWCAP2_BTI ? CPUINFO_BTI : 0);
diff --git a/meson.build b/meson.build
index c1dc83e4c0..89a8241bc0 100644
--- a/meson.build
+++ b/meson.build
@@ -2822,6 +2822,18 @@ config_host_data.set('CONFIG_ARM_AES_BUILTIN', cc.compiles('''
     void foo(uint8x16_t *p) { *p = vaesmcq_u8(*p); }
   '''))
 
+config_host_data.set('CONFIG_SVE_OPT', cc.compiles('''
+    #include <arm_sve.h>
+    #ifndef __ARM_FEATURE_SVE
+    __attribute__((target("+sve")))
+    #endif
+    void foo(void *p) {
+        svbool_t t = svptrue_b8();
+        svuint8_t d = svld1_u8(t, p);
+        svptest_any(t, svcmpne_n_u8(t, d, 0));
+    }
+  '''))
+
 have_pvrdma = get_option('pvrdma') \
   .require(rdma.found(), error_message: 'PVRDMA requires OpenFabrics libraries') \
   .require(cc.compiles(gnu_source_prefix + '''
@@ -4232,6 +4244,7 @@ summary_info += {'memory allocator':  get_option('malloc')}
 summary_info += {'avx2 optimization': config_host_data.get('CONFIG_AVX2_OPT')}
 summary_info += {'avx512bw optimization': config_host_data.get('CONFIG_AVX512BW_OPT')}
 summary_info += {'avx512f optimization': config_host_data.get('CONFIG_AVX512F_OPT')}
+summary_info += {'sve optimization': config_host_data.get('CONFIG_SVE_OPT')}
 summary_info += {'gcov':              get_option('b_coverage')}
 summary_info += {'thread sanitizer':  get_option('tsan')}
 summary_info += {'CFI support':       get_option('cfi')}
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 07/10] util/bufferiszero: Introduce biz_accel_fn typedef
  2024-02-15  8:14 ` [PATCH v4 07/10] util/bufferiszero: Introduce biz_accel_fn typedef Richard Henderson
@ 2024-02-15  8:34   ` Philippe Mathieu-Daudé
  0 siblings, 0 replies; 27+ messages in thread
From: Philippe Mathieu-Daudé @ 2024-02-15  8:34 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: amonakov, mmromanov

On 15/2/24 09:14, Richard Henderson wrote:
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>   util/bufferiszero.c | 9 ++++++---
>   1 file changed, 6 insertions(+), 3 deletions(-)

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 08/10] util/bufferiszero: Simplify test_buffer_is_zero_next_accel
  2024-02-15  8:14 ` [PATCH v4 08/10] util/bufferiszero: Simplify test_buffer_is_zero_next_accel Richard Henderson
@ 2024-02-15  8:40   ` Philippe Mathieu-Daudé
  0 siblings, 0 replies; 27+ messages in thread
From: Philippe Mathieu-Daudé @ 2024-02-15  8:40 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: amonakov, mmromanov

On 15/2/24 09:14, Richard Henderson wrote:
> Because the three alternatives are monotonic, we don't
> need to keep a couple of bitmasks, just identify the
> strongest alternative at startup.
> 
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>   util/bufferiszero.c | 56 ++++++++++++++++++---------------------------
>   1 file changed, 22 insertions(+), 34 deletions(-)

enum {
   ACCEL_DEFAULT,
   ACCEL_SSE2,
   ACCEL_AVX2,
};

> +static biz_accel_fn const accel_table[] = {

   [ACCEL_DEFAULT] =

> +    buffer_is_zero_int_ge256,

   [ACCEL_SSE2] =

> +    buffer_zero_sse2,
>   #ifdef CONFIG_AVX2_OPT

   [ACCEL_AVX2] =

> +    buffer_zero_avx2,
>   #endif


> +static unsigned accel_index;
>   
>   static void __attribute__((constructor)) init_accel(void)
>   {
> -    used_accel = select_accel_cpuinfo(cpuinfo_init());
> +    unsigned info = cpuinfo_init();
> +    unsigned index = (info & CPUINFO_SSE2 ? 1 : 0);

   ... ? ACCEL_SSE2 : ACCEL_DEFAULT;

> +
> +#ifdef CONFIG_AVX2_OPT
> +    if (info & CPUINFO_AVX2) {
> +        index = 2;

   ... = ACCEL_AVX2

> +    }
> +#endif
> +
> +    accel_index = index;
> +    buffer_is_zero_accel = accel_table[index];
>   }
Preferably introducing accel enum:
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 09/10] util/bufferiszero: Add simd acceleration for aarch64
  2024-02-15  8:14 ` [PATCH v4 09/10] util/bufferiszero: Add simd acceleration for aarch64 Richard Henderson
@ 2024-02-15  8:47   ` Alexander Monakov
  2024-02-15 17:47     ` Richard Henderson
  0 siblings, 1 reply; 27+ messages in thread
From: Alexander Monakov @ 2024-02-15  8:47 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, mmromanov


On Wed, 14 Feb 2024, Richard Henderson wrote:

> Because non-embedded aarch64 is expected to have AdvSIMD enabled, merely
> double-check with the compiler flags for __ARM_NEON and don't bother with
> a runtime check.  Otherwise, model the loop after the x86 SSE2 function,
> and use VADDV to reduce the four vector comparisons.

I am not very familiar with Neon but I wonder if this couldn't use SHRN
for the final 128b->64b reduction similar to 2022 Glibc optimizations:
https://inbox.sourceware.org/libc-alpha/20220620174628.2820531-1-danilak@google.com/

In git history I see the previous Neon buffer_is_zero was removed because
it was not faster. Is it because integer LDP was as good as vector loads
at saturating load bandwidth on older cores, and things are different now?

Alexander

> 
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  util/bufferiszero.c | 74 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 74 insertions(+)
> 
> diff --git a/util/bufferiszero.c b/util/bufferiszero.c
> index 4eef6d47bc..2809b09225 100644
> --- a/util/bufferiszero.c
> +++ b/util/bufferiszero.c
> @@ -214,7 +214,81 @@ bool test_buffer_is_zero_next_accel(void)
>      }
>      return false;
>  }
> +
> +#elif defined(__aarch64__) && defined(__ARM_NEON)
> +#include <arm_neon.h>
> +
> +#define REASSOC_BARRIER(vec0, vec1) asm("" : "+w"(vec0), "+w"(vec1))
> +
> +static bool buffer_is_zero_simd(const void *buf, size_t len)
> +{
> +    uint32x4_t t0, t1, t2, t3;
> +
> +    /* Align head/tail to 16-byte boundaries.  */
> +    const uint32x4_t *p = QEMU_ALIGN_PTR_DOWN(buf + 16, 16);
> +    const uint32x4_t *e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 16);
> +
> +    /* Unaligned loads at head/tail.  */
> +    t0 = vld1q_u32(buf) | vld1q_u32(buf + len - 16);
> +
> +    /* Collect a partial block at tail end.  */
> +    t1 = e[-7] | e[-6];
> +    t2 = e[-5] | e[-4];
> +    t3 = e[-3] | e[-2];
> +    t0 |= e[-1];
> +    REASSOC_BARRIER(t0, t1);
> +    REASSOC_BARRIER(t2, t3);
> +    t0 |= t1;
> +    t2 |= t3;
> +    REASSOC_BARRIER(t0, t2);
> +    t0 |= t2;
> +
> +    /*
> +     * Loop over complete 128-byte blocks.
> +     * With the head and tail removed, e - p >= 14, so the loop
> +     * must iterate at least once.
> +     */
> +    do {
> +        /* Each comparison is [-1,0], so reduction is in [-4..0]. */
> +        if (unlikely(vaddvq_u32(vceqzq_u32(t0)) != -4)) {
> +            return false;
> +        }
> +
> +        t0 = p[0] | p[1];
> +        t1 = p[2] | p[3];
> +        t2 = p[4] | p[5];
> +        t3 = p[6] | p[7];
> +        REASSOC_BARRIER(t0, t1);
> +        REASSOC_BARRIER(t2, t3);
> +        t0 |= t1;
> +        t2 |= t3;
> +        REASSOC_BARRIER(t0, t2);
> +        t0 |= t2;
> +        p += 8;
> +    } while (p < e - 7);
> +
> +    return vaddvq_u32(vceqzq_u32(t0)) == -4;
> +}
> +
> +static biz_accel_fn const accel_table[] = {
> +    buffer_is_zero_int_ge256,
> +    buffer_is_zero_simd,
> +};
> +
> +static unsigned accel_index = 1;
> +#define INIT_ACCEL buffer_is_zero_simd
> +
> +bool test_buffer_is_zero_next_accel(void)
> +{
> +    if (accel_index != 0) {
> +        buffer_is_zero_accel = accel_table[--accel_index];
> +        return true;
> +    }
> +    return false;
> +}
> +
>  #else
> +
>  bool test_buffer_is_zero_next_accel(void)
>  {
>      return false;
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 00/10] Optimize buffer_is_zero
  2024-02-15  8:14 [PATCH v4 00/10] Optimize buffer_is_zero Richard Henderson
                   ` (9 preceding siblings ...)
  2024-02-15  8:14 ` [RFC PATCH v4 10/10] util/bufferiszero: Add sve " Richard Henderson
@ 2024-02-15  8:57 ` Alexander Monakov
  2024-02-15 21:16   ` Richard Henderson
  10 siblings, 1 reply; 27+ messages in thread
From: Alexander Monakov @ 2024-02-15  8:57 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, mmromanov


On Wed, 14 Feb 2024, Richard Henderson wrote:

> v3: https://patchew.org/QEMU/20240206204809.9859-1-amonakov@ispras.ru/
> 
> Changes for v4:
>   - Keep separate >= 256 entry point, but only keep constant length
>     check inline.  This allows the indirect function call to be hidden
>     and optimized away when the pointer is constant.

Sorry, I don't understand this. Most of the improvement (at least in our
testing) comes from inlining the byte checks, which often fail and eliminate
call overhead entirely. Moving them out-of-line seems to lose most of the
speedup the patchset was bringing, doesn't it? Is there some concern I am
not seeing?

>   - Split out a >= 256 integer routine.
>   - Simplify acceleration selection for testing.
>   - Add function pointer typedef.
>   - Implement new aarch64 accelerations.
> 
> 
> r~
> 
> 
> Alexander Monakov (5):
>   util/bufferiszero: Remove SSE4.1 variant
>   util/bufferiszero: Remove AVX512 variant
>   util/bufferiszero: Reorganize for early test for acceleration
>   util/bufferiszero: Remove useless prefetches
>   util/bufferiszero: Optimize SSE2 and AVX2 variants
> 
> Richard Henderson (5):
>   util/bufferiszero: Improve scalar variant
>   util/bufferiszero: Introduce biz_accel_fn typedef
>   util/bufferiszero: Simplify test_buffer_is_zero_next_accel
>   util/bufferiszero: Add simd acceleration for aarch64
>   util/bufferiszero: Add sve acceleration for aarch64
> 
>  host/include/aarch64/host/cpuinfo.h |   1 +
>  include/qemu/cutils.h               |  15 +-
>  util/bufferiszero.c                 | 500 ++++++++++++++++------------
>  util/cpuinfo-aarch64.c              |   1 +
>  meson.build                         |  13 +
>  5 files changed, 323 insertions(+), 207 deletions(-)
> 
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 09/10] util/bufferiszero: Add simd acceleration for aarch64
  2024-02-15  8:47   ` Alexander Monakov
@ 2024-02-15 17:47     ` Richard Henderson
  2024-02-15 18:46       ` Alexander Monakov
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2024-02-15 17:47 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: qemu-devel, mmromanov

On 2/14/24 22:47, Alexander Monakov wrote:
> 
> On Wed, 14 Feb 2024, Richard Henderson wrote:
> 
>> Because non-embedded aarch64 is expected to have AdvSIMD enabled, merely
>> double-check with the compiler flags for __ARM_NEON and don't bother with
>> a runtime check.  Otherwise, model the loop after the x86 SSE2 function,
>> and use VADDV to reduce the four vector comparisons.
> 
> I am not very familiar with Neon but I wonder if this couldn't use SHRN
> for the final 128b->64b reduction similar to 2022 Glibc optimizations:
> https://inbox.sourceware.org/libc-alpha/20220620174628.2820531-1-danilak@google.com/

The reason they use SHRN for memchr is that they have also applied a mask
to the comparison so that they can identify which byte contained the match.
That is not required here, so any reduction will do.


> In git history I see the previous Neon buffer_is_zero was removed because
> it was not faster. Is it because integer LDP was as good as vector loads
> at saturating load bandwidth on older cores, and things are different now?

The old reduction was a bit silly,

-#define DO_NONZERO(X)  (vgetq_lane_u64((X), 0) | vgetq_lane_u64((X), 1))

performing two cross-register-set fetches.  It's also possible that we were saturating the 
load bandwidth on the old mustang.  This time I'm testing on a neoverse-n1, which is quite 
a few years newer.

The loop kernel compiles to this:

  19c:   ad401c20        ldp     q0, q7, [x1]
  1a0:   ad411823        ldp     q3, q6, [x1, #32]
  1a4:   ad421421        ldp     q1, q5, [x1, #64]
  1a8:   ad431022        ldp     q2, q4, [x1, #96]
  1ac:   91020021        add     x1, x1, #0x80
  1b0:   4ea71c00        orr     v0.16b, v0.16b, v7.16b
  1b4:   4ea61c63        orr     v3.16b, v3.16b, v6.16b
  1b8:   4ea51c21        orr     v1.16b, v1.16b, v5.16b
  1bc:   4ea41c42        orr     v2.16b, v2.16b, v4.16b
  1c0:   4ea31c00        orr     v0.16b, v0.16b, v3.16b
  1c4:   4ea21c21        orr     v1.16b, v1.16b, v2.16b
  1c8:   4ea11c00        orr     v0.16b, v0.16b, v1.16b
  1cc:   eb03003f        cmp     x1, x3
  1d0:   54000162        b.cs    1fc <buffer_is_zero_simd+0xb8>  // b.hs, b.nlast
  1d4:   4ea09800        cmeq    v0.4s, v0.4s, #0
  1d8:   4eb1b800        addv    s0, v0.4s
  1dc:   1e260000        fmov    w0, s0
  1e0:   3100101f        cmn     w0, #0x4
  1e4:   54fffdc0        b.eq    19c <buffer_is_zero_simd+0x58>  // b.none


r~

> 
> Alexander
> 
>>
>> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
>> ---
>>   util/bufferiszero.c | 74 +++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 74 insertions(+)
>>
>> diff --git a/util/bufferiszero.c b/util/bufferiszero.c
>> index 4eef6d47bc..2809b09225 100644
>> --- a/util/bufferiszero.c
>> +++ b/util/bufferiszero.c
>> @@ -214,7 +214,81 @@ bool test_buffer_is_zero_next_accel(void)
>>       }
>>       return false;
>>   }
>> +
>> +#elif defined(__aarch64__) && defined(__ARM_NEON)
>> +#include <arm_neon.h>
>> +
>> +#define REASSOC_BARRIER(vec0, vec1) asm("" : "+w"(vec0), "+w"(vec1))
>> +
>> +static bool buffer_is_zero_simd(const void *buf, size_t len)
>> +{
>> +    uint32x4_t t0, t1, t2, t3;
>> +
>> +    /* Align head/tail to 16-byte boundaries.  */
>> +    const uint32x4_t *p = QEMU_ALIGN_PTR_DOWN(buf + 16, 16);
>> +    const uint32x4_t *e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 16);
>> +
>> +    /* Unaligned loads at head/tail.  */
>> +    t0 = vld1q_u32(buf) | vld1q_u32(buf + len - 16);
>> +
>> +    /* Collect a partial block at tail end.  */
>> +    t1 = e[-7] | e[-6];
>> +    t2 = e[-5] | e[-4];
>> +    t3 = e[-3] | e[-2];
>> +    t0 |= e[-1];
>> +    REASSOC_BARRIER(t0, t1);
>> +    REASSOC_BARRIER(t2, t3);
>> +    t0 |= t1;
>> +    t2 |= t3;
>> +    REASSOC_BARRIER(t0, t2);
>> +    t0 |= t2;
>> +
>> +    /*
>> +     * Loop over complete 128-byte blocks.
>> +     * With the head and tail removed, e - p >= 14, so the loop
>> +     * must iterate at least once.
>> +     */
>> +    do {
>> +        /* Each comparison is [-1,0], so reduction is in [-4..0]. */
>> +        if (unlikely(vaddvq_u32(vceqzq_u32(t0)) != -4)) {
>> +            return false;
>> +        }
>> +
>> +        t0 = p[0] | p[1];
>> +        t1 = p[2] | p[3];
>> +        t2 = p[4] | p[5];
>> +        t3 = p[6] | p[7];
>> +        REASSOC_BARRIER(t0, t1);
>> +        REASSOC_BARRIER(t2, t3);
>> +        t0 |= t1;
>> +        t2 |= t3;
>> +        REASSOC_BARRIER(t0, t2);
>> +        t0 |= t2;
>> +        p += 8;
>> +    } while (p < e - 7);
>> +
>> +    return vaddvq_u32(vceqzq_u32(t0)) == -4;
>> +}
>> +
>> +static biz_accel_fn const accel_table[] = {
>> +    buffer_is_zero_int_ge256,
>> +    buffer_is_zero_simd,
>> +};
>> +
>> +static unsigned accel_index = 1;
>> +#define INIT_ACCEL buffer_is_zero_simd
>> +
>> +bool test_buffer_is_zero_next_accel(void)
>> +{
>> +    if (accel_index != 0) {
>> +        buffer_is_zero_accel = accel_table[--accel_index];
>> +        return true;
>> +    }
>> +    return false;
>> +}
>> +
>>   #else
>> +
>>   bool test_buffer_is_zero_next_accel(void)
>>   {
>>       return false;
>>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 09/10] util/bufferiszero: Add simd acceleration for aarch64
  2024-02-15 17:47     ` Richard Henderson
@ 2024-02-15 18:46       ` Alexander Monakov
  2024-02-15 21:10         ` Richard Henderson
  0 siblings, 1 reply; 27+ messages in thread
From: Alexander Monakov @ 2024-02-15 18:46 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, mmromanov


On Thu, 15 Feb 2024, Richard Henderson wrote:

> On 2/14/24 22:47, Alexander Monakov wrote:
> > 
> > On Wed, 14 Feb 2024, Richard Henderson wrote:
> > 
> >> Because non-embedded aarch64 is expected to have AdvSIMD enabled, merely
> >> double-check with the compiler flags for __ARM_NEON and don't bother with
> >> a runtime check.  Otherwise, model the loop after the x86 SSE2 function,
> >> and use VADDV to reduce the four vector comparisons.
> > 
> > I am not very familiar with Neon but I wonder if this couldn't use SHRN
> > for the final 128b->64b reduction similar to 2022 Glibc optimizations:
> > https://inbox.sourceware.org/libc-alpha/20220620174628.2820531-1-danilak@google.com/
> 
> The reason they use SHRN for memchr is that they have also applied a mask
> to the comparison so that they can identify which byte contained the match.
> That is not required here, so any reduction will do.

Right, so we can pick the cheapest reduction method, and if I'm reading
Neoverse-N1 SOG right, SHRN is marginally cheaper than ADDV (latency 2
instead of 3), and it should be generally preferable on other cores, no?

For that matter, cannot UQXTN (unsigned saturating extract narrow) be
used in place of CMEQ+ADDV here?

Alexander


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 09/10] util/bufferiszero: Add simd acceleration for aarch64
  2024-02-15 18:46       ` Alexander Monakov
@ 2024-02-15 21:10         ` Richard Henderson
  0 siblings, 0 replies; 27+ messages in thread
From: Richard Henderson @ 2024-02-15 21:10 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: qemu-devel, mmromanov

On 2/15/24 08:46, Alexander Monakov wrote:
> Right, so we can pick the cheapest reduction method, and if I'm reading
> Neoverse-N1 SOG right, SHRN is marginally cheaper than ADDV (latency 2
> instead of 3), and it should be generally preferable on other cores, no?

Fair.

> For that matter, cannot UQXTN (unsigned saturating extract narrow) be
> used in place of CMEQ+ADDV here?

Interesting.  I hadn't thought about using saturation to preserve non-zeroness like that.

Using 1 4-cycle insn instead of 2 2-cycle insns is interesting as well.  I suppose, since 
it's at the end of the dependency chain, the fact that it is restricted to the V1 pipe 
matters not at all.


r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 00/10] Optimize buffer_is_zero
  2024-02-15  8:57 ` [PATCH v4 00/10] Optimize buffer_is_zero Alexander Monakov
@ 2024-02-15 21:16   ` Richard Henderson
  2024-02-15 21:36     ` Alexander Monakov
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2024-02-15 21:16 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: qemu-devel, mmromanov

On 2/14/24 22:57, Alexander Monakov wrote:
> 
> On Wed, 14 Feb 2024, Richard Henderson wrote:
> 
>> v3: https://patchew.org/QEMU/20240206204809.9859-1-amonakov@ispras.ru/
>>
>> Changes for v4:
>>    - Keep separate >= 256 entry point, but only keep constant length
>>      check inline.  This allows the indirect function call to be hidden
>>      and optimized away when the pointer is constant.
> 
> Sorry, I don't understand this. Most of the improvement (at least in our
> testing) comes from inlining the byte checks, which often fail and eliminate
> call overhead entirely. Moving them out-of-line seems to lose most of the
> speedup the patchset was bringing, doesn't it? Is there some concern I am
> not seeing?

What is your benchmarking method?

It was my guess that most of the improvement came from performing those early byte checks 
*at all*, and that the overhead of a function call to a small out of line wrapper would be 
negligible.

By not exposing the function pointer outside the bufferiszero translation unit, the 
compiler can see when the pointer is never modified for a given host, and then transform 
the indirect branch to a direct branch.


r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 00/10] Optimize buffer_is_zero
  2024-02-15 21:16   ` Richard Henderson
@ 2024-02-15 21:36     ` Alexander Monakov
  2024-02-15 22:27       ` Richard Henderson
  0 siblings, 1 reply; 27+ messages in thread
From: Alexander Monakov @ 2024-02-15 21:36 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, mmromanov


On Thu, 15 Feb 2024, Richard Henderson wrote:

> On 2/14/24 22:57, Alexander Monakov wrote:
> > 
> > On Wed, 14 Feb 2024, Richard Henderson wrote:
> > 
> >> v3: https://patchew.org/QEMU/20240206204809.9859-1-amonakov@ispras.ru/
> >>
> >> Changes for v4:
> >>    - Keep separate >= 256 entry point, but only keep constant length
> >>      check inline.  This allows the indirect function call to be hidden
> >>      and optimized away when the pointer is constant.
> > 
> > Sorry, I don't understand this. Most of the improvement (at least in our
> > testing) comes from inlining the byte checks, which often fail and eliminate
> > call overhead entirely. Moving them out-of-line seems to lose most of the
> > speedup the patchset was bringing, doesn't it? Is there some concern I am
> > not seeing?
> 
> What is your benchmarking method?

Converting a 4.4 GiB Windows 10 image to qcow2. It was mentioned in v1 and v2,
are you saying they did not reach your inbox?
https://lore.kernel.org/qemu-devel/20231013155856.21475-1-mmromanov@ispras.ru/
https://lore.kernel.org/qemu-devel/20231027143704.7060-1-mmromanov@ispras.ru/

> It was my guess that most of the improvement came from performing those early
> byte checks *at all*, and that the overhead of a function call to a small out
> of line wrapper would be negligible.

qemu-img invokes buffer_is_zero in a fairly tight loop. Let us know if you
need numbers how much the out-of-line version loses.

> By not exposing the function pointer outside the bufferiszero translation
> unit, the compiler can see when the pointer is never modified for a given
> host, and then transform the indirect branch to a direct branch.

Okay, but that does not make it necessary to move byte checks out of line.
I was preparing a rebase that does not expose the function pointer to the
inline wrapper. I was completely unaware that you're taking over the patchset.

Alexander


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 00/10] Optimize buffer_is_zero
  2024-02-15 21:36     ` Alexander Monakov
@ 2024-02-15 22:27       ` Richard Henderson
  2024-02-15 23:37         ` Alexander Monakov
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2024-02-15 22:27 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: qemu-devel, mmromanov

On 2/15/24 11:36, Alexander Monakov wrote:
> 
> On Thu, 15 Feb 2024, Richard Henderson wrote:
> 
>> On 2/14/24 22:57, Alexander Monakov wrote:
>>>
>>> On Wed, 14 Feb 2024, Richard Henderson wrote:
>>>
>>>> v3: https://patchew.org/QEMU/20240206204809.9859-1-amonakov@ispras.ru/
>>>>
>>>> Changes for v4:
>>>>     - Keep separate >= 256 entry point, but only keep constant length
>>>>       check inline.  This allows the indirect function call to be hidden
>>>>       and optimized away when the pointer is constant.
>>>
>>> Sorry, I don't understand this. Most of the improvement (at least in our
>>> testing) comes from inlining the byte checks, which often fail and eliminate
>>> call overhead entirely. Moving them out-of-line seems to lose most of the
>>> speedup the patchset was bringing, doesn't it? Is there some concern I am
>>> not seeing?
>>
>> What is your benchmarking method?
> 
> Converting a 4.4 GiB Windows 10 image to qcow2. It was mentioned in v1 and v2,
> are you saying they did not reach your inbox?
> https://lore.kernel.org/qemu-devel/20231013155856.21475-1-mmromanov@ispras.ru/
> https://lore.kernel.org/qemu-devel/20231027143704.7060-1-mmromanov@ispras.ru/

I'm saying that this is not a reproducible description of methodology.

With master, so with neither of our changes:

I tried converting an 80G win7 image that I happened to have lying about, I see 
buffer_zero_avx2 with only 3.03% perf overhead.  Then I tried truncating the image to 16G 
to see if having the entire image in ram would help -- not yet, still only 3.4% perf 
overhead.  Finally, I truncated the image to 4G and saw 2.9% overhead.

So... help be out here.  I would like to be able to see results that are at least vaguely 
similar.


r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 00/10] Optimize buffer_is_zero
  2024-02-15 22:27       ` Richard Henderson
@ 2024-02-15 23:37         ` Alexander Monakov
  2024-02-16  8:11           ` Richard Henderson
  0 siblings, 1 reply; 27+ messages in thread
From: Alexander Monakov @ 2024-02-15 23:37 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, mmromanov


On Thu, 15 Feb 2024, Richard Henderson wrote:

> > Converting a 4.4 GiB Windows 10 image to qcow2. It was mentioned in v1 and
> > v2,
> > are you saying they did not reach your inbox?
> > https://lore.kernel.org/qemu-devel/20231013155856.21475-1-mmromanov@ispras.ru/
> > https://lore.kernel.org/qemu-devel/20231027143704.7060-1-mmromanov@ispras.ru/
> 
> I'm saying that this is not a reproducible description of methodology.
> 
> With master, so with neither of our changes:
> 
> I tried converting an 80G win7 image that I happened to have lying about, I
> see buffer_zero_avx2 with only 3.03% perf overhead.  Then I tried truncating
> the image to 16G to see if having the entire image in ram would help -- not
> yet, still only 3.4% perf overhead.  Finally, I truncated the image to 4G and
> saw 2.9% overhead.
> 
> So... help be out here.  I would like to be able to see results that are at
> least vaguely similar.

Ah, I guess you might be running at low perf_event_paranoid setting that
allows unprivileged sampling of kernel events? In our submissions the
percentage was for perf_event_paranoid=2, i.e. relative to Qemu only,
excluding kernel time under syscalls.

Retrieve IE11.Win7.VirtualBox.zip from
https://archive.org/details/ie11.win7.virtualbox
and use

  unzip -p IE11.Win7.VirtualBox.zip | tar xv

to extract 'IE11 - Win7-disk001.vmdk'.

(Mikhail used a different image when preparing the patch)

On this image, I get 70% in buffer_zero_sse2 on a Sandy Bridge running

  qemu-img convert 'IE11 - Win7-disk001.vmdk' -O qcow2 /tmp/t.qcow2

user:kernel time is about 0.15:2.3, so 70% relative to user time does
roughly correspond to single-digits percentage relative to (user+kernel) time.

(which does tell us that qemu-img is doing I/O inefficiently, it shouldn't
need two seconds to read a fully cached 5 Gigabyte file)

Alexander


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 00/10] Optimize buffer_is_zero
  2024-02-15 23:37         ` Alexander Monakov
@ 2024-02-16  8:11           ` Richard Henderson
  2024-02-16 20:20             ` Alexander Monakov
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2024-02-16  8:11 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: qemu-devel, mmromanov

On 2/15/24 13:37, Alexander Monakov wrote:
> Ah, I guess you might be running at low perf_event_paranoid setting that
> allows unprivileged sampling of kernel events? In our submissions the
> percentage was for perf_event_paranoid=2, i.e. relative to Qemu only,
> excluding kernel time under syscalls.

Ok.  Eliminating kernel samples makes things easier to see.
But I still do not see a 40% reduction in runtime.

Just so we're on the same page:

> Retrieve IE11.Win7.VirtualBox.zip from
> https://archive.org/details/ie11.win7.virtualbox
> and use
> 
>   unzip -p IE11.Win7.VirtualBox.zip | tar xv
> 
> to extract 'IE11 - Win7-disk001.vmdk'.
> 
> (Mikhail used a different image when preparing the patch)
> 
> On this image, I get 70% in buffer_zero_sse2 on a Sandy Bridge running
> 
>   qemu-img convert 'IE11 - Win7-disk001.vmdk' -O qcow2 /tmp/t.qcow2

With this, I see virtually all of the runtime in libz.so.
Therefore I converted this to raw first, to focus on the issue.

For avoidance of doubt:

$ ls -lsh test.raw && sha256sum test.raw
  12G -rw-r--r--  1 rth  rth   40G Feb 15 21:14 test.raw
3b056d839952538fed42fa898c6063646f4fda1bf7ea0180fbb5f29d21fe8e80  test.raw

Host: 11th Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz
Compiler: gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)

master:
   57.48%  qemu-img-m  [.] buffer_zero_avx2
    3.60%  qemu-img-m  [.] is_allocated_sectors.part.0
    2.61%  qemu-img-m  [.] buffer_is_zero
   63.69%  -- total

v3:
   48.86%  qemu-img-v3  [.] is_allocated_sectors.part.0
    3.79%  qemu-img-v3  [.] buffer_zero_avx2
   52.65%  -- total
     -17%  -- reduction from master

v4:
   54.60%  qemu-img-v4  [.] buffer_is_zero_ge256
    3.30%  qemu-img-v4  [.] buffer_zero_avx2
    3.17%  qemu-img-v4  [.] is_allocated_sectors.part.0
   61.07%  -- total
      -4%  -- reduction from master

v4+:
   46.65%  qemu-img  [.] is_allocated_sectors.part.0
    3.49%  qemu-img  [.] buffer_zero_avx2
    0.05%  qemu-img  [.] buffer_is_zero_ge256
   50.19%  -- total
     -21%  -- reduction from master

The v4+ puts the 3 byte test back inline, like in your v3.

Importantly, it must be as 3 short-circuting tests, where my v4 "simplified" this to (s | 
m | e) != 0, on the assumption that the reduced number of branches would help.

Diving into perf, it becomes clear why:

  57.36 │       cmpb   $0x0,(%rbx)
   4.02 │     ↓ jne    89
  21.84 │       cmpb   $0x0,0x1ff(%rbx)
   0.64 │     ↓ jne    89
   8.45 │       cmpb   $0x0,0x100(%rbx)
   0.26 │     ↓ jne    89
   0.06 │       mov    $0x200,%esi
        │       mov    %rbx,%rdi
   0.07 │     → call   buffer_is_zero_ge256

The three bytes are on 3 different cachelines.  Judging by the relative percentages, it 
would seem that the first byte alone eliminates slightly more than half of all blocks; the 
last byte eliminates more than half again; the middle byte eliminates a fair fraction of 
the rest.  With the short-circuit, the extra cachelines are not touched.

This is so important that it should be spelled out in a comment.

With that settled, I guess we need to talk about how much the out-of-line implementation 
matters at all.  I'm thinking about writing a test/bench/bufferiszero, with all-zero 
buffers of various sizes and alignments.  With that it would be easier to talk about 
whether any given implementation is is an improvement for that final 4% not eliminated by 
the three bytes.

> (which does tell us that qemu-img is doing I/O inefficiently, it shouldn't
> need two seconds to read a fully cached 5 Gigabyte file)

Indeed!

r~

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v4 10/10] util/bufferiszero: Add sve acceleration for aarch64
  2024-02-15  8:14 ` [RFC PATCH v4 10/10] util/bufferiszero: Add sve " Richard Henderson
@ 2024-02-16  9:33   ` Alex Bennée
  2024-02-16 11:05   ` Alex Bennée
  1 sibling, 0 replies; 27+ messages in thread
From: Alex Bennée @ 2024-02-16  9:33 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, amonakov, mmromanov

Richard Henderson <richard.henderson@linaro.org> writes:

> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>
> RFC because I've not benchmarked this on real hw, only run it
> through qemu for validation.

I think we have an a64fx is the TCWG lab you could probably run the
tests on if you want. Otherwise I might be able to spin up a Graviton
on AWS to run a measurement. Do we have a benchmark test to run?

>
> ---
>  host/include/aarch64/host/cpuinfo.h |  1 +
>  util/bufferiszero.c                 | 49 +++++++++++++++++++++++++++++
>  util/cpuinfo-aarch64.c              |  1 +
>  meson.build                         | 13 ++++++++
>  4 files changed, 64 insertions(+)
>
> diff --git a/host/include/aarch64/host/cpuinfo.h b/host/include/aarch64/host/cpuinfo.h
> index fe671534e4..b4b816cd07 100644
> --- a/host/include/aarch64/host/cpuinfo.h
> +++ b/host/include/aarch64/host/cpuinfo.h
> @@ -12,6 +12,7 @@
>  #define CPUINFO_AES             (1u << 3)
>  #define CPUINFO_PMULL           (1u << 4)
>  #define CPUINFO_BTI             (1u << 5)
> +#define CPUINFO_SVE             (1u << 6)
>  
>  /* Initialized with a constructor. */
>  extern unsigned cpuinfo;
> diff --git a/util/bufferiszero.c b/util/bufferiszero.c
> index 2809b09225..af64c9c224 100644
> --- a/util/bufferiszero.c
> +++ b/util/bufferiszero.c
> @@ -270,13 +270,62 @@ static bool buffer_is_zero_simd(const void *buf, size_t len)
>      return vaddvq_u32(vceqzq_u32(t0)) == -4;
>  }
>  
> +#ifdef CONFIG_SVE_OPT
> +#include <arm_sve.h>
> +
> +#ifndef __ARM_FEATURE_SVE
> +__attribute__((target("+sve")))
> +#endif
> +static bool buffer_is_zero_sve(const void *buf, size_t len)
> +{
> +    svbool_t p, t = svptrue_b8();
> +    size_t i, n;
> +
> +    /*
> +     * For the first vector, align to 16 -- reading 1 to 256 bytes.
> +     * Note this routine is only called with len >= 256, which is the
> +     * architectural maximum vector length: the first vector always fits.
> +     */
> +    i = 0;
> +    n = QEMU_ALIGN_PTR_DOWN(buf + svcntb(), 16) - buf;
> +    p = svwhilelt_b8(i, n);
> +
> +    do {
> +        svuint8_t d = svld1_u8(p, buf + i);
> +
> +        p = svcmpne_n_u8(t, d, 0);
> +        if (unlikely(svptest_any(t, p))) {
> +            return false;
> +        }
> +        i += n;
> +        n = svcntb();
> +        p = svwhilelt_b8(i, len);
> +    } while (svptest_any(t, p));
> +
> +    return true;
> +}
> +#endif /* CONFIG_SVE_OPT */
> +
>  static biz_accel_fn const accel_table[] = {
>      buffer_is_zero_int_ge256,
>      buffer_is_zero_simd,
> +#ifdef CONFIG_SVE_OPT
> +    buffer_is_zero_sve,
> +#endif
>  };
>  
> +#ifdef CONFIG_SVE_OPT
> +static unsigned accel_index;
> +static void __attribute__((constructor)) init_accel(void)
> +{
> +    accel_index = (cpuinfo & CPUINFO_SVE ? 2 : 1);
> +    buffer_is_zero_accel = accel_table[accel_index];
> +}
> +#define INIT_ACCEL NULL
> +#else
>  static unsigned accel_index = 1;
>  #define INIT_ACCEL buffer_is_zero_simd
> +#endif /* CONFIG_SVE_OPT */
>  
>  bool test_buffer_is_zero_next_accel(void)
>  {
> diff --git a/util/cpuinfo-aarch64.c b/util/cpuinfo-aarch64.c
> index 4c8a005715..a1e22ea66e 100644
> --- a/util/cpuinfo-aarch64.c
> +++ b/util/cpuinfo-aarch64.c
> @@ -61,6 +61,7 @@ unsigned __attribute__((constructor)) cpuinfo_init(void)
>      info |= (hwcap & HWCAP_USCAT ? CPUINFO_LSE2 : 0);
>      info |= (hwcap & HWCAP_AES ? CPUINFO_AES : 0);
>      info |= (hwcap & HWCAP_PMULL ? CPUINFO_PMULL : 0);
> +    info |= (hwcap & HWCAP_SVE ? CPUINFO_SVE : 0);
>  
>      unsigned long hwcap2 = qemu_getauxval(AT_HWCAP2);
>      info |= (hwcap2 & HWCAP2_BTI ? CPUINFO_BTI : 0);
> diff --git a/meson.build b/meson.build
> index c1dc83e4c0..89a8241bc0 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -2822,6 +2822,18 @@ config_host_data.set('CONFIG_ARM_AES_BUILTIN', cc.compiles('''
>      void foo(uint8x16_t *p) { *p = vaesmcq_u8(*p); }
>    '''))
>  
> +config_host_data.set('CONFIG_SVE_OPT', cc.compiles('''
> +    #include <arm_sve.h>
> +    #ifndef __ARM_FEATURE_SVE
> +    __attribute__((target("+sve")))
> +    #endif
> +    void foo(void *p) {
> +        svbool_t t = svptrue_b8();
> +        svuint8_t d = svld1_u8(t, p);
> +        svptest_any(t, svcmpne_n_u8(t, d, 0));
> +    }
> +  '''))
> +
>  have_pvrdma = get_option('pvrdma') \
>    .require(rdma.found(), error_message: 'PVRDMA requires OpenFabrics libraries') \
>    .require(cc.compiles(gnu_source_prefix + '''
> @@ -4232,6 +4244,7 @@ summary_info += {'memory allocator':  get_option('malloc')}
>  summary_info += {'avx2 optimization': config_host_data.get('CONFIG_AVX2_OPT')}
>  summary_info += {'avx512bw optimization': config_host_data.get('CONFIG_AVX512BW_OPT')}
>  summary_info += {'avx512f optimization': config_host_data.get('CONFIG_AVX512F_OPT')}
> +summary_info += {'sve optimization': config_host_data.get('CONFIG_SVE_OPT')}
>  summary_info += {'gcov':              get_option('b_coverage')}
>  summary_info += {'thread sanitizer':  get_option('tsan')}
>  summary_info += {'CFI support':       get_option('cfi')}

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v4 10/10] util/bufferiszero: Add sve acceleration for aarch64
  2024-02-15  8:14 ` [RFC PATCH v4 10/10] util/bufferiszero: Add sve " Richard Henderson
  2024-02-16  9:33   ` Alex Bennée
@ 2024-02-16 11:05   ` Alex Bennée
  1 sibling, 0 replies; 27+ messages in thread
From: Alex Bennée @ 2024-02-16 11:05 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, amonakov, mmromanov

Richard Henderson <richard.henderson@linaro.org> writes:

> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>
> RFC because I've not benchmarked this on real hw, only run it
> through qemu for validation.
>
<snip>
>  
> +#ifdef CONFIG_SVE_OPT
> +static unsigned accel_index;
> +static void __attribute__((constructor)) init_accel(void)
> +{
> +    accel_index = (cpuinfo & CPUINFO_SVE ? 2 : 1);
> +    buffer_is_zero_accel = accel_table[accel_index];
> +}

This really needs to be:

  -    accel_index = (cpuinfo & CPUINFO_SVE ? 2 : 1);
  +    unsigned info = cpuinfo_init();
  +    accel_index = (info & CPUINFO_SVE ? 2 : 1);

because otherwise you are relying on constructor initialisation order
and on the Graviton 3 I built on it wasn't detecting the SVE. With that I
get this from "perf record ./tests/unit/test-bufferiszero -m thorough"

  51.17%  test-bufferisze  test-bufferiszero      [.] buffer_is_zero_sve
  18.92%  test-bufferisze  test-bufferiszero      [.] buffer_is_zero_simd
  18.02%  test-bufferisze  test-bufferiszero      [.] buffer_is_zero_int_ge256
   7.67%  test-bufferisze  test-bufferiszero      [.] buffer_is_zero_ool
   4.09%  test-bufferisze  test-bufferiszero      [.] test_1

but as I mentioned before it would be nice to have a proper benchmark
for the buffer utils as I'm sure the unit test would be prone to noise.

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 00/10] Optimize buffer_is_zero
  2024-02-16  8:11           ` Richard Henderson
@ 2024-02-16 20:20             ` Alexander Monakov
  2024-02-16 22:28               ` Richard Henderson
  0 siblings, 1 reply; 27+ messages in thread
From: Alexander Monakov @ 2024-02-16 20:20 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, mmromanov

On Thu, 15 Feb 2024, Richard Henderson wrote:

> On 2/15/24 13:37, Alexander Monakov wrote:
> > Ah, I guess you might be running at low perf_event_paranoid setting that
> > allows unprivileged sampling of kernel events? In our submissions the
> > percentage was for perf_event_paranoid=2, i.e. relative to Qemu only,
> > excluding kernel time under syscalls.
> 
> Ok.  Eliminating kernel samples makes things easier to see.
> But I still do not see a 40% reduction in runtime.

I suspect Mikhail's image was less sparse, so the impact from inlining
was greater.

> With this, I see virtually all of the runtime in libz.so.
> Therefore I converted this to raw first, to focus on the issue.

Ah, apologies for that. I built with --disable-default-features and
did not notice my qemu-img lacked support for vmdk and treated it
as a raw image instead. I was assuming it was similar to what Mikhail
used, but obviously it's not due to the compression.

> For avoidance of doubt:
> 
> $ ls -lsh test.raw && sha256sum test.raw
>  12G -rw-r--r--  1 rth  rth   40G Feb 15 21:14 test.raw
> 3b056d839952538fed42fa898c6063646f4fda1bf7ea0180fbb5f29d21fe8e80  test.raw
> 
> Host: 11th Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz
> Compiler: gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)
> 
> master:
>   57.48%  qemu-img-m  [.] buffer_zero_avx2
>    3.60%  qemu-img-m  [.] is_allocated_sectors.part.0
>    2.61%  qemu-img-m  [.] buffer_is_zero
>   63.69%  -- total
> 
> v3:
>   48.86%  qemu-img-v3  [.] is_allocated_sectors.part.0
>   3.79%  qemu-img-v3  [.] buffer_zero_avx2
>   52.65%  -- total
>     -17%  -- reduction from master
> 
> v4:
>   54.60%  qemu-img-v4  [.] buffer_is_zero_ge256
>    3.30%  qemu-img-v4  [.] buffer_zero_avx2
>    3.17%  qemu-img-v4  [.] is_allocated_sectors.part.0
>   61.07%  -- total
>      -4%  -- reduction from master
> 
> v4+:
>   46.65%  qemu-img  [.] is_allocated_sectors.part.0
>    3.49%  qemu-img  [.] buffer_zero_avx2
>    0.05%  qemu-img  [.] buffer_is_zero_ge256
>   50.19%  -- total
>     -21%  -- reduction from master

Any ideas where the -21% vs v3's -17% difference comes from?

FWIW, in situations like these I always recommend to run perf with fixed
sampling rate, i.e. 'perf record -e cycles:P -c 100000' or 'perf record -e
cycles/period=100000/P' to make sample counts between runs of different
duration directly comparable (displayed with 'perf report -n').

> The v4+ puts the 3 byte test back inline, like in your v3.
> 
> Importantly, it must be as 3 short-circuting tests, where my v4 "simplified"
> this to (s | m | e) != 0, on the assumption that the reduced number of
> branches would help.

Yes, we also noticed that when preparing our patch. We also tried mixed
variants like (s | e) != 0 || m != 0, but they did not turn out faster.

> With that settled, I guess we need to talk about how much the out-of-line
> implementation matters at all.  I'm thinking about writing a
> test/bench/bufferiszero, with all-zero buffers of various sizes and
> alignments.  With that it would be easier to talk about whether any given
> implementation is is an improvement for that final 4% not eliminated by the
> three bytes.

Yeah, initially I suggested this task to Mikhail as a practice exercise
outside of Qemu, and we had a benchmark that measures buffer_is_zero via
perf_event_open. This allows to see exactly how close the implementation
runs to the performance ceiling given by max L1 fetch rate (two loads
per cycle on x86).

Alexander

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 00/10] Optimize buffer_is_zero
  2024-02-16 20:20             ` Alexander Monakov
@ 2024-02-16 22:28               ` Richard Henderson
  0 siblings, 0 replies; 27+ messages in thread
From: Richard Henderson @ 2024-02-16 22:28 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: qemu-devel, mmromanov

On 2/16/24 10:20, Alexander Monakov wrote:
> FWIW, in situations like these I always recommend to run perf with fixed
> sampling rate, i.e. 'perf record -e cycles:P -c 100000' or 'perf record -e
> cycles/period=100000/P' to make sample counts between runs of different
> duration directly comparable (displayed with 'perf report -n').

I've re-done the numbers with fixed period, as suggested, and the difference between v3 
and v4+ is in the sampling noise, differing about 0.3%.


r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2024-02-16 22:29 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-15  8:14 [PATCH v4 00/10] Optimize buffer_is_zero Richard Henderson
2024-02-15  8:14 ` [PATCH v4 01/10] util/bufferiszero: Remove SSE4.1 variant Richard Henderson
2024-02-15  8:14 ` [PATCH v4 02/10] util/bufferiszero: Remove AVX512 variant Richard Henderson
2024-02-15  8:14 ` [PATCH v4 03/10] util/bufferiszero: Reorganize for early test for acceleration Richard Henderson
2024-02-15  8:14 ` [PATCH v4 04/10] util/bufferiszero: Remove useless prefetches Richard Henderson
2024-02-15  8:14 ` [PATCH v4 05/10] util/bufferiszero: Optimize SSE2 and AVX2 variants Richard Henderson
2024-02-15  8:14 ` [PATCH v4 06/10] util/bufferiszero: Improve scalar variant Richard Henderson
2024-02-15  8:14 ` [PATCH v4 07/10] util/bufferiszero: Introduce biz_accel_fn typedef Richard Henderson
2024-02-15  8:34   ` Philippe Mathieu-Daudé
2024-02-15  8:14 ` [PATCH v4 08/10] util/bufferiszero: Simplify test_buffer_is_zero_next_accel Richard Henderson
2024-02-15  8:40   ` Philippe Mathieu-Daudé
2024-02-15  8:14 ` [PATCH v4 09/10] util/bufferiszero: Add simd acceleration for aarch64 Richard Henderson
2024-02-15  8:47   ` Alexander Monakov
2024-02-15 17:47     ` Richard Henderson
2024-02-15 18:46       ` Alexander Monakov
2024-02-15 21:10         ` Richard Henderson
2024-02-15  8:14 ` [RFC PATCH v4 10/10] util/bufferiszero: Add sve " Richard Henderson
2024-02-16  9:33   ` Alex Bennée
2024-02-16 11:05   ` Alex Bennée
2024-02-15  8:57 ` [PATCH v4 00/10] Optimize buffer_is_zero Alexander Monakov
2024-02-15 21:16   ` Richard Henderson
2024-02-15 21:36     ` Alexander Monakov
2024-02-15 22:27       ` Richard Henderson
2024-02-15 23:37         ` Alexander Monakov
2024-02-16  8:11           ` Richard Henderson
2024-02-16 20:20             ` Alexander Monakov
2024-02-16 22:28               ` Richard Henderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).