qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 00/10] Optimize buffer_is_zero
@ 2024-04-24 22:56 Richard Henderson
  2024-04-24 22:56 ` [PATCH v6 01/10] util/bufferiszero: Remove SSE4.1 variant Richard Henderson
                   ` (9 more replies)
  0 siblings, 10 replies; 21+ messages in thread
From: Richard Henderson @ 2024-04-24 22:56 UTC (permalink / raw)
  To: qemu-devel

v3: https://patchew.org/QEMU/20240206204809.9859-1-amonakov@ispras.ru/
v4: https://patchew.org/QEMU/20240215081449.848220-1-richard.henderson@linaro.org/
v5: https://patchew.org/QEMU/20240217003918.52229-1-richard.henderson@linaro.org/

Changes for v6:
  - Fixes for bufferiszero-bench


r~


Alexander Monakov (5):
  util/bufferiszero: Remove SSE4.1 variant
  util/bufferiszero: Remove AVX512 variant
  util/bufferiszero: Reorganize for early test for acceleration
  util/bufferiszero: Remove useless prefetches
  util/bufferiszero: Optimize SSE2 and AVX2 variants

Richard Henderson (5):
  util/bufferiszero: Improve scalar variant
  util/bufferiszero: Introduce biz_accel_fn typedef
  util/bufferiszero: Simplify test_buffer_is_zero_next_accel
  util/bufferiszero: Add simd acceleration for aarch64
  tests/bench: Add bufferiszero-bench

 include/qemu/cutils.h            |  32 ++-
 tests/bench/bufferiszero-bench.c |  47 ++++
 util/bufferiszero.c              | 448 +++++++++++++++++--------------
 tests/bench/meson.build          |   1 +
 4 files changed, 321 insertions(+), 207 deletions(-)
 create mode 100644 tests/bench/bufferiszero-bench.c

-- 
2.34.1



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v6 01/10] util/bufferiszero: Remove SSE4.1 variant
  2024-04-24 22:56 [PATCH v6 00/10] Optimize buffer_is_zero Richard Henderson
@ 2024-04-24 22:56 ` Richard Henderson
  2024-04-24 22:56 ` [PATCH v6 02/10] util/bufferiszero: Remove AVX512 variant Richard Henderson
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Richard Henderson @ 2024-04-24 22:56 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexander Monakov, Mikhail Romanov

From: Alexander Monakov <amonakov@ispras.ru>

The SSE4.1 variant is virtually identical to the SSE2 variant, except
for using 'PTEST+JNZ' in place of 'PCMPEQB+PMOVMSKB+CMP+JNE' for testing
if an SSE register is all zeroes. The PTEST instruction decodes to two
uops, so it can be handled only by the complex decoder, and since
CMP+JNE are macro-fused, both sequences decode to three uops. The uops
comprising the PTEST instruction dispatch to p0 and p5 on Intel CPUs, so
PCMPEQB+PMOVMSKB is comparatively more flexible from dispatch
standpoint.

Hence, the use of PTEST brings no benefit from throughput standpoint.
Its latency is not important, since it feeds only a conditional jump,
which terminates the dependency chain.

I never observed PTEST variants to be faster on real hardware.

Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20240206204809.9859-2-amonakov@ispras.ru>
---
 util/bufferiszero.c | 29 -----------------------------
 1 file changed, 29 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index 3e6a5dfd63..f5a3634f9a 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -100,34 +100,6 @@ buffer_zero_sse2(const void *buf, size_t len)
 }
 
 #ifdef CONFIG_AVX2_OPT
-static bool __attribute__((target("sse4")))
-buffer_zero_sse4(const void *buf, size_t len)
-{
-    __m128i t = _mm_loadu_si128(buf);
-    __m128i *p = (__m128i *)(((uintptr_t)buf + 5 * 16) & -16);
-    __m128i *e = (__m128i *)(((uintptr_t)buf + len) & -16);
-
-    /* Loop over 16-byte aligned blocks of 64.  */
-    while (likely(p <= e)) {
-        __builtin_prefetch(p);
-        if (unlikely(!_mm_testz_si128(t, t))) {
-            return false;
-        }
-        t = p[-4] | p[-3] | p[-2] | p[-1];
-        p += 4;
-    }
-
-    /* Finish the aligned tail.  */
-    t |= e[-3];
-    t |= e[-2];
-    t |= e[-1];
-
-    /* Finish the unaligned tail.  */
-    t |= _mm_loadu_si128(buf + len - 16);
-
-    return _mm_testz_si128(t, t);
-}
-
 static bool __attribute__((target("avx2")))
 buffer_zero_avx2(const void *buf, size_t len)
 {
@@ -221,7 +193,6 @@ select_accel_cpuinfo(unsigned info)
 #endif
 #ifdef CONFIG_AVX2_OPT
         { CPUINFO_AVX2,    128, buffer_zero_avx2 },
-        { CPUINFO_SSE4,     64, buffer_zero_sse4 },
 #endif
         { CPUINFO_SSE2,     64, buffer_zero_sse2 },
         { CPUINFO_ALWAYS,    0, buffer_zero_int },
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 02/10] util/bufferiszero: Remove AVX512 variant
  2024-04-24 22:56 [PATCH v6 00/10] Optimize buffer_is_zero Richard Henderson
  2024-04-24 22:56 ` [PATCH v6 01/10] util/bufferiszero: Remove SSE4.1 variant Richard Henderson
@ 2024-04-24 22:56 ` Richard Henderson
  2024-04-29 11:16   ` Daniel P. Berrangé
  2024-04-24 22:56 ` [PATCH v6 03/10] util/bufferiszero: Reorganize for early test for acceleration Richard Henderson
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 21+ messages in thread
From: Richard Henderson @ 2024-04-24 22:56 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexander Monakov, Mikhail Romanov

From: Alexander Monakov <amonakov@ispras.ru>

Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD
routines are invoked much more rarely in normal use when most buffers
are non-zero. This makes use of AVX512 unprofitable, as it incurs extra
frequency and voltage transition periods during which the CPU operates
at reduced performance, as described in
https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html

Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru>
Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20240206204809.9859-4-amonakov@ispras.ru>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 util/bufferiszero.c | 38 +++-----------------------------------
 1 file changed, 3 insertions(+), 35 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index f5a3634f9a..641d5f9b9e 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -64,7 +64,7 @@ buffer_zero_int(const void *buf, size_t len)
     }
 }
 
-#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
+#if defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
 #include <immintrin.h>
 
 /* Note that each of these vectorized functions require len >= 64.  */
@@ -128,41 +128,12 @@ buffer_zero_avx2(const void *buf, size_t len)
 }
 #endif /* CONFIG_AVX2_OPT */
 
-#ifdef CONFIG_AVX512F_OPT
-static bool __attribute__((target("avx512f")))
-buffer_zero_avx512(const void *buf, size_t len)
-{
-    /* Begin with an unaligned head of 64 bytes.  */
-    __m512i t = _mm512_loadu_si512(buf);
-    __m512i *p = (__m512i *)(((uintptr_t)buf + 5 * 64) & -64);
-    __m512i *e = (__m512i *)(((uintptr_t)buf + len) & -64);
-
-    /* Loop over 64-byte aligned blocks of 256.  */
-    while (p <= e) {
-        __builtin_prefetch(p);
-        if (unlikely(_mm512_test_epi64_mask(t, t))) {
-            return false;
-        }
-        t = p[-4] | p[-3] | p[-2] | p[-1];
-        p += 4;
-    }
-
-    t |= _mm512_loadu_si512(buf + len - 4 * 64);
-    t |= _mm512_loadu_si512(buf + len - 3 * 64);
-    t |= _mm512_loadu_si512(buf + len - 2 * 64);
-    t |= _mm512_loadu_si512(buf + len - 1 * 64);
-
-    return !_mm512_test_epi64_mask(t, t);
-
-}
-#endif /* CONFIG_AVX512F_OPT */
-
 /*
  * Make sure that these variables are appropriately initialized when
  * SSE2 is enabled on the compiler command-line, but the compiler is
  * too old to support CONFIG_AVX2_OPT.
  */
-#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT)
+#if defined(CONFIG_AVX2_OPT)
 # define INIT_USED     0
 # define INIT_LENGTH   0
 # define INIT_ACCEL    buffer_zero_int
@@ -188,9 +159,6 @@ select_accel_cpuinfo(unsigned info)
         unsigned len;
         bool (*fn)(const void *, size_t);
     } all[] = {
-#ifdef CONFIG_AVX512F_OPT
-        { CPUINFO_AVX512F, 256, buffer_zero_avx512 },
-#endif
 #ifdef CONFIG_AVX2_OPT
         { CPUINFO_AVX2,    128, buffer_zero_avx2 },
 #endif
@@ -208,7 +176,7 @@ select_accel_cpuinfo(unsigned info)
     return 0;
 }
 
-#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT)
+#if defined(CONFIG_AVX2_OPT)
 static void __attribute__((constructor)) init_accel(void)
 {
     used_accel = select_accel_cpuinfo(cpuinfo_init());
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 03/10] util/bufferiszero: Reorganize for early test for acceleration
  2024-04-24 22:56 [PATCH v6 00/10] Optimize buffer_is_zero Richard Henderson
  2024-04-24 22:56 ` [PATCH v6 01/10] util/bufferiszero: Remove SSE4.1 variant Richard Henderson
  2024-04-24 22:56 ` [PATCH v6 02/10] util/bufferiszero: Remove AVX512 variant Richard Henderson
@ 2024-04-24 22:56 ` Richard Henderson
  2024-04-24 22:56 ` [PATCH v6 04/10] util/bufferiszero: Remove useless prefetches Richard Henderson
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Richard Henderson @ 2024-04-24 22:56 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexander Monakov, Mikhail Romanov

From: Alexander Monakov <amonakov@ispras.ru>

Test for length >= 256 inline, where is is often a constant.
Before calling into the accelerated routine, sample three bytes
from the buffer, which handles most non-zero buffers.

Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru>
Message-Id: <20240206204809.9859-3-amonakov@ispras.ru>
[rth: Use __builtin_constant_p; move the indirect call out of line.]
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 include/qemu/cutils.h | 32 ++++++++++++++++-
 util/bufferiszero.c   | 84 +++++++++++++++++--------------------------
 2 files changed, 63 insertions(+), 53 deletions(-)

diff --git a/include/qemu/cutils.h b/include/qemu/cutils.h
index 92c927a6a3..741dade7cf 100644
--- a/include/qemu/cutils.h
+++ b/include/qemu/cutils.h
@@ -187,9 +187,39 @@ char *freq_to_str(uint64_t freq_hz);
 /* used to print char* safely */
 #define STR_OR_NULL(str) ((str) ? (str) : "null")
 
-bool buffer_is_zero(const void *buf, size_t len);
+/*
+ * Check if a buffer is all zeroes.
+ */
+
+bool buffer_is_zero_ool(const void *vbuf, size_t len);
+bool buffer_is_zero_ge256(const void *vbuf, size_t len);
 bool test_buffer_is_zero_next_accel(void);
 
+static inline bool buffer_is_zero_sample3(const char *buf, size_t len)
+{
+    /*
+     * For any reasonably sized buffer, these three samples come from
+     * three different cachelines.  In qemu-img usage, we find that
+     * each byte eliminates more than half of all buffer testing.
+     * It is therefore critical to performance that the byte tests
+     * short-circuit, so that we do not pull in additional cache lines.
+     * Do not "optimize" this to !(a | b | c).
+     */
+    return !buf[0] && !buf[len - 1] && !buf[len / 2];
+}
+
+#ifdef __OPTIMIZE__
+static inline bool buffer_is_zero(const void *buf, size_t len)
+{
+    return (__builtin_constant_p(len) && len >= 256
+            ? buffer_is_zero_sample3(buf, len) &&
+              buffer_is_zero_ge256(buf, len)
+            : buffer_is_zero_ool(buf, len));
+}
+#else
+#define buffer_is_zero  buffer_is_zero_ool
+#endif
+
 /*
  * Implementation of ULEB128 (http://en.wikipedia.org/wiki/LEB128)
  * Input is limited to 14-bit numbers
diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index 641d5f9b9e..972f394cbd 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -26,8 +26,9 @@
 #include "qemu/bswap.h"
 #include "host/cpuinfo.h"
 
-static bool
-buffer_zero_int(const void *buf, size_t len)
+static bool (*buffer_is_zero_accel)(const void *, size_t);
+
+static bool buffer_is_zero_integer(const void *buf, size_t len)
 {
     if (unlikely(len < 8)) {
         /* For a very small buffer, simply accumulate all the bytes.  */
@@ -128,60 +129,38 @@ buffer_zero_avx2(const void *buf, size_t len)
 }
 #endif /* CONFIG_AVX2_OPT */
 
-/*
- * Make sure that these variables are appropriately initialized when
- * SSE2 is enabled on the compiler command-line, but the compiler is
- * too old to support CONFIG_AVX2_OPT.
- */
-#if defined(CONFIG_AVX2_OPT)
-# define INIT_USED     0
-# define INIT_LENGTH   0
-# define INIT_ACCEL    buffer_zero_int
-#else
-# ifndef __SSE2__
-#  error "ISA selection confusion"
-# endif
-# define INIT_USED     CPUINFO_SSE2
-# define INIT_LENGTH   64
-# define INIT_ACCEL    buffer_zero_sse2
-#endif
-
-static unsigned used_accel = INIT_USED;
-static unsigned length_to_accel = INIT_LENGTH;
-static bool (*buffer_accel)(const void *, size_t) = INIT_ACCEL;
-
 static unsigned __attribute__((noinline))
 select_accel_cpuinfo(unsigned info)
 {
     /* Array is sorted in order of algorithm preference. */
     static const struct {
         unsigned bit;
-        unsigned len;
         bool (*fn)(const void *, size_t);
     } all[] = {
 #ifdef CONFIG_AVX2_OPT
-        { CPUINFO_AVX2,    128, buffer_zero_avx2 },
+        { CPUINFO_AVX2,    buffer_zero_avx2 },
 #endif
-        { CPUINFO_SSE2,     64, buffer_zero_sse2 },
-        { CPUINFO_ALWAYS,    0, buffer_zero_int },
+        { CPUINFO_SSE2,    buffer_zero_sse2 },
+        { CPUINFO_ALWAYS,  buffer_is_zero_integer },
     };
 
     for (unsigned i = 0; i < ARRAY_SIZE(all); ++i) {
         if (info & all[i].bit) {
-            length_to_accel = all[i].len;
-            buffer_accel = all[i].fn;
+            buffer_is_zero_accel = all[i].fn;
             return all[i].bit;
         }
     }
     return 0;
 }
 
-#if defined(CONFIG_AVX2_OPT)
+static unsigned used_accel;
+
 static void __attribute__((constructor)) init_accel(void)
 {
     used_accel = select_accel_cpuinfo(cpuinfo_init());
 }
-#endif /* CONFIG_AVX2_OPT */
+
+#define INIT_ACCEL NULL
 
 bool test_buffer_is_zero_next_accel(void)
 {
@@ -194,36 +173,37 @@ bool test_buffer_is_zero_next_accel(void)
     used_accel |= used;
     return used;
 }
-
-static bool select_accel_fn(const void *buf, size_t len)
-{
-    if (likely(len >= length_to_accel)) {
-        return buffer_accel(buf, len);
-    }
-    return buffer_zero_int(buf, len);
-}
-
 #else
-#define select_accel_fn  buffer_zero_int
 bool test_buffer_is_zero_next_accel(void)
 {
     return false;
 }
+
+#define INIT_ACCEL buffer_is_zero_integer
 #endif
 
-/*
- * Checks if a buffer is all zeroes
- */
-bool buffer_is_zero(const void *buf, size_t len)
+static bool (*buffer_is_zero_accel)(const void *, size_t) = INIT_ACCEL;
+
+bool buffer_is_zero_ool(const void *buf, size_t len)
 {
     if (unlikely(len == 0)) {
         return true;
     }
+    if (!buffer_is_zero_sample3(buf, len)) {
+        return false;
+    }
+    /* All bytes are covered for any len <= 3.  */
+    if (unlikely(len <= 3)) {
+        return true;
+    }
 
-    /* Fetch the beginning of the buffer while we select the accelerator.  */
-    __builtin_prefetch(buf);
-
-    /* Use an optimized zero check if possible.  Note that this also
-       includes a check for an unrolled loop over 64-bit integers.  */
-    return select_accel_fn(buf, len);
+    if (likely(len >= 256)) {
+        return buffer_is_zero_accel(buf, len);
+    }
+    return buffer_is_zero_integer(buf, len);
+}
+
+bool buffer_is_zero_ge256(const void *buf, size_t len)
+{
+    return buffer_is_zero_accel(buf, len);
 }
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 04/10] util/bufferiszero: Remove useless prefetches
  2024-04-24 22:56 [PATCH v6 00/10] Optimize buffer_is_zero Richard Henderson
                   ` (2 preceding siblings ...)
  2024-04-24 22:56 ` [PATCH v6 03/10] util/bufferiszero: Reorganize for early test for acceleration Richard Henderson
@ 2024-04-24 22:56 ` Richard Henderson
  2024-04-24 22:57 ` [PATCH v6 05/10] util/bufferiszero: Optimize SSE2 and AVX2 variants Richard Henderson
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Richard Henderson @ 2024-04-24 22:56 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexander Monakov, Mikhail Romanov

From: Alexander Monakov <amonakov@ispras.ru>

Use of prefetching in bufferiszero.c is quite questionable:

- prefetches are issued just a few CPU cycles before the corresponding
  line would be hit by demand loads;

- they are done for simple access patterns, i.e. where hardware
  prefetchers can perform better;

- they compete for load ports in loops that should be limited by load
  port throughput rather than ALU throughput.

Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20240206204809.9859-5-amonakov@ispras.ru>
---
 util/bufferiszero.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index 972f394cbd..00118d649e 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -50,7 +50,6 @@ static bool buffer_is_zero_integer(const void *buf, size_t len)
         const uint64_t *e = (uint64_t *)(((uintptr_t)buf + len) & -8);
 
         for (; p + 8 <= e; p += 8) {
-            __builtin_prefetch(p + 8);
             if (t) {
                 return false;
             }
@@ -80,7 +79,6 @@ buffer_zero_sse2(const void *buf, size_t len)
 
     /* Loop over 16-byte aligned blocks of 64.  */
     while (likely(p <= e)) {
-        __builtin_prefetch(p);
         t = _mm_cmpeq_epi8(t, zero);
         if (unlikely(_mm_movemask_epi8(t) != 0xFFFF)) {
             return false;
@@ -111,7 +109,6 @@ buffer_zero_avx2(const void *buf, size_t len)
 
     /* Loop over 32-byte aligned blocks of 128.  */
     while (p <= e) {
-        __builtin_prefetch(p);
         if (unlikely(!_mm256_testz_si256(t, t))) {
             return false;
         }
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 05/10] util/bufferiszero: Optimize SSE2 and AVX2 variants
  2024-04-24 22:56 [PATCH v6 00/10] Optimize buffer_is_zero Richard Henderson
                   ` (3 preceding siblings ...)
  2024-04-24 22:56 ` [PATCH v6 04/10] util/bufferiszero: Remove useless prefetches Richard Henderson
@ 2024-04-24 22:57 ` Richard Henderson
  2024-04-24 22:57 ` [PATCH v6 06/10] util/bufferiszero: Improve scalar variant Richard Henderson
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Richard Henderson @ 2024-04-24 22:57 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexander Monakov, Mikhail Romanov

From: Alexander Monakov <amonakov@ispras.ru>

Increase unroll factor in SIMD loops from 4x to 8x in order to move
their bottlenecks from ALU port contention to load issue rate (two loads
per cycle on popular x86 implementations).

Avoid using out-of-bounds pointers in loop boundary conditions.

Follow SSE2 implementation strategy in the AVX2 variant. Avoid use of
PTEST, which is not profitable there (like in the removed SSE4 variant).

Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20240206204809.9859-6-amonakov@ispras.ru>
---
 util/bufferiszero.c | 111 +++++++++++++++++++++++++++++---------------
 1 file changed, 73 insertions(+), 38 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index 00118d649e..02df82b4ff 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -67,62 +67,97 @@ static bool buffer_is_zero_integer(const void *buf, size_t len)
 #if defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
 #include <immintrin.h>
 
-/* Note that each of these vectorized functions require len >= 64.  */
+/* Helper for preventing the compiler from reassociating
+   chains of binary vector operations.  */
+#define SSE_REASSOC_BARRIER(vec0, vec1) asm("" : "+x"(vec0), "+x"(vec1))
+
+/* Note that these vectorized functions may assume len >= 256.  */
 
 static bool __attribute__((target("sse2")))
 buffer_zero_sse2(const void *buf, size_t len)
 {
-    __m128i t = _mm_loadu_si128(buf);
-    __m128i *p = (__m128i *)(((uintptr_t)buf + 5 * 16) & -16);
-    __m128i *e = (__m128i *)(((uintptr_t)buf + len) & -16);
-    __m128i zero = _mm_setzero_si128();
+    /* Unaligned loads at head/tail.  */
+    __m128i v = *(__m128i_u *)(buf);
+    __m128i w = *(__m128i_u *)(buf + len - 16);
+    /* Align head/tail to 16-byte boundaries.  */
+    const __m128i *p = QEMU_ALIGN_PTR_DOWN(buf + 16, 16);
+    const __m128i *e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 16);
+    __m128i zero = { 0 };
 
-    /* Loop over 16-byte aligned blocks of 64.  */
-    while (likely(p <= e)) {
-        t = _mm_cmpeq_epi8(t, zero);
-        if (unlikely(_mm_movemask_epi8(t) != 0xFFFF)) {
+    /* Collect a partial block at tail end.  */
+    v |= e[-1]; w |= e[-2];
+    SSE_REASSOC_BARRIER(v, w);
+    v |= e[-3]; w |= e[-4];
+    SSE_REASSOC_BARRIER(v, w);
+    v |= e[-5]; w |= e[-6];
+    SSE_REASSOC_BARRIER(v, w);
+    v |= e[-7]; v |= w;
+
+    /*
+     * Loop over complete 128-byte blocks.
+     * With the head and tail removed, e - p >= 14, so the loop
+     * must iterate at least once.
+     */
+    do {
+        v = _mm_cmpeq_epi8(v, zero);
+        if (unlikely(_mm_movemask_epi8(v) != 0xFFFF)) {
             return false;
         }
-        t = p[-4] | p[-3] | p[-2] | p[-1];
-        p += 4;
-    }
+        v = p[0]; w = p[1];
+        SSE_REASSOC_BARRIER(v, w);
+        v |= p[2]; w |= p[3];
+        SSE_REASSOC_BARRIER(v, w);
+        v |= p[4]; w |= p[5];
+        SSE_REASSOC_BARRIER(v, w);
+        v |= p[6]; w |= p[7];
+        SSE_REASSOC_BARRIER(v, w);
+        v |= w;
+        p += 8;
+    } while (p < e - 7);
 
-    /* Finish the aligned tail.  */
-    t |= e[-3];
-    t |= e[-2];
-    t |= e[-1];
-
-    /* Finish the unaligned tail.  */
-    t |= _mm_loadu_si128(buf + len - 16);
-
-    return _mm_movemask_epi8(_mm_cmpeq_epi8(t, zero)) == 0xFFFF;
+    return _mm_movemask_epi8(_mm_cmpeq_epi8(v, zero)) == 0xFFFF;
 }
 
 #ifdef CONFIG_AVX2_OPT
 static bool __attribute__((target("avx2")))
 buffer_zero_avx2(const void *buf, size_t len)
 {
-    /* Begin with an unaligned head of 32 bytes.  */
-    __m256i t = _mm256_loadu_si256(buf);
-    __m256i *p = (__m256i *)(((uintptr_t)buf + 5 * 32) & -32);
-    __m256i *e = (__m256i *)(((uintptr_t)buf + len) & -32);
+    /* Unaligned loads at head/tail.  */
+    __m256i v = *(__m256i_u *)(buf);
+    __m256i w = *(__m256i_u *)(buf + len - 32);
+    /* Align head/tail to 32-byte boundaries.  */
+    const __m256i *p = QEMU_ALIGN_PTR_DOWN(buf + 32, 32);
+    const __m256i *e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 32);
+    __m256i zero = { 0 };
 
-    /* Loop over 32-byte aligned blocks of 128.  */
-    while (p <= e) {
-        if (unlikely(!_mm256_testz_si256(t, t))) {
+    /* Collect a partial block at tail end.  */
+    v |= e[-1]; w |= e[-2];
+    SSE_REASSOC_BARRIER(v, w);
+    v |= e[-3]; w |= e[-4];
+    SSE_REASSOC_BARRIER(v, w);
+    v |= e[-5]; w |= e[-6];
+    SSE_REASSOC_BARRIER(v, w);
+    v |= e[-7]; v |= w;
+
+    /* Loop over complete 256-byte blocks.  */
+    for (; p < e - 7; p += 8) {
+        /* PTEST is not profitable here.  */
+        v = _mm256_cmpeq_epi8(v, zero);
+        if (unlikely(_mm256_movemask_epi8(v) != 0xFFFFFFFF)) {
             return false;
         }
-        t = p[-4] | p[-3] | p[-2] | p[-1];
-        p += 4;
-    } ;
+        v = p[0]; w = p[1];
+        SSE_REASSOC_BARRIER(v, w);
+        v |= p[2]; w |= p[3];
+        SSE_REASSOC_BARRIER(v, w);
+        v |= p[4]; w |= p[5];
+        SSE_REASSOC_BARRIER(v, w);
+        v |= p[6]; w |= p[7];
+        SSE_REASSOC_BARRIER(v, w);
+        v |= w;
+    }
 
-    /* Finish the last block of 128 unaligned.  */
-    t |= _mm256_loadu_si256(buf + len - 4 * 32);
-    t |= _mm256_loadu_si256(buf + len - 3 * 32);
-    t |= _mm256_loadu_si256(buf + len - 2 * 32);
-    t |= _mm256_loadu_si256(buf + len - 1 * 32);
-
-    return _mm256_testz_si256(t, t);
+    return _mm256_movemask_epi8(_mm256_cmpeq_epi8(v, zero)) == 0xFFFFFFFF;
 }
 #endif /* CONFIG_AVX2_OPT */
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 06/10] util/bufferiszero: Improve scalar variant
  2024-04-24 22:56 [PATCH v6 00/10] Optimize buffer_is_zero Richard Henderson
                   ` (4 preceding siblings ...)
  2024-04-24 22:57 ` [PATCH v6 05/10] util/bufferiszero: Optimize SSE2 and AVX2 variants Richard Henderson
@ 2024-04-24 22:57 ` Richard Henderson
  2024-04-29 12:18   ` Philippe Mathieu-Daudé
  2024-04-24 22:57 ` [PATCH v6 07/10] util/bufferiszero: Introduce biz_accel_fn typedef Richard Henderson
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 21+ messages in thread
From: Richard Henderson @ 2024-04-24 22:57 UTC (permalink / raw)
  To: qemu-devel

Split less-than and greater-than 256 cases.
Use unaligned accesses for head and tail.
Avoid using out-of-bounds pointers in loop boundary conditions.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 util/bufferiszero.c | 85 +++++++++++++++++++++++++++------------------
 1 file changed, 51 insertions(+), 34 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index 02df82b4ff..c9a7ded016 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -28,40 +28,57 @@
 
 static bool (*buffer_is_zero_accel)(const void *, size_t);
 
-static bool buffer_is_zero_integer(const void *buf, size_t len)
+static bool buffer_is_zero_int_lt256(const void *buf, size_t len)
 {
-    if (unlikely(len < 8)) {
-        /* For a very small buffer, simply accumulate all the bytes.  */
-        const unsigned char *p = buf;
-        const unsigned char *e = buf + len;
-        unsigned char t = 0;
+    uint64_t t;
+    const uint64_t *p, *e;
 
-        do {
-            t |= *p++;
-        } while (p < e);
-
-        return t == 0;
-    } else {
-        /* Otherwise, use the unaligned memory access functions to
-           handle the beginning and end of the buffer, with a couple
-           of loops handling the middle aligned section.  */
-        uint64_t t = ldq_he_p(buf);
-        const uint64_t *p = (uint64_t *)(((uintptr_t)buf + 8) & -8);
-        const uint64_t *e = (uint64_t *)(((uintptr_t)buf + len) & -8);
-
-        for (; p + 8 <= e; p += 8) {
-            if (t) {
-                return false;
-            }
-            t = p[0] | p[1] | p[2] | p[3] | p[4] | p[5] | p[6] | p[7];
-        }
-        while (p < e) {
-            t |= *p++;
-        }
-        t |= ldq_he_p(buf + len - 8);
-
-        return t == 0;
+    /*
+     * Use unaligned memory access functions to handle
+     * the beginning and end of the buffer.
+     */
+    if (unlikely(len <= 8)) {
+        return (ldl_he_p(buf) | ldl_he_p(buf + len - 4)) == 0;
     }
+
+    t = ldq_he_p(buf) | ldq_he_p(buf + len - 8);
+    p = QEMU_ALIGN_PTR_DOWN(buf + 8, 8);
+    e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 8);
+
+    /* Read 0 to 31 aligned words from the middle. */
+    while (p < e) {
+        t |= *p++;
+    }
+    return t == 0;
+}
+
+static bool buffer_is_zero_int_ge256(const void *buf, size_t len)
+{
+    /*
+     * Use unaligned memory access functions to handle
+     * the beginning and end of the buffer.
+     */
+    uint64_t t = ldq_he_p(buf) | ldq_he_p(buf + len - 8);
+    const uint64_t *p = QEMU_ALIGN_PTR_DOWN(buf + 8, 8);
+    const uint64_t *e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 8);
+
+    /* Collect a partial block at the tail end. */
+    t |= e[-7] | e[-6] | e[-5] | e[-4] | e[-3] | e[-2] | e[-1];
+
+    /*
+     * Loop over 64 byte blocks.
+     * With the head and tail removed, e - p >= 30,
+     * so the loop must iterate at least 3 times.
+     */
+    do {
+        if (t) {
+            return false;
+        }
+        t = p[0] | p[1] | p[2] | p[3] | p[4] | p[5] | p[6] | p[7];
+        p += 8;
+    } while (p < e - 7);
+
+    return t == 0;
 }
 
 #if defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
@@ -173,7 +190,7 @@ select_accel_cpuinfo(unsigned info)
         { CPUINFO_AVX2,    buffer_zero_avx2 },
 #endif
         { CPUINFO_SSE2,    buffer_zero_sse2 },
-        { CPUINFO_ALWAYS,  buffer_is_zero_integer },
+        { CPUINFO_ALWAYS,  buffer_is_zero_int_ge256 },
     };
 
     for (unsigned i = 0; i < ARRAY_SIZE(all); ++i) {
@@ -211,7 +228,7 @@ bool test_buffer_is_zero_next_accel(void)
     return false;
 }
 
-#define INIT_ACCEL buffer_is_zero_integer
+#define INIT_ACCEL buffer_is_zero_int_ge256
 #endif
 
 static bool (*buffer_is_zero_accel)(const void *, size_t) = INIT_ACCEL;
@@ -232,7 +249,7 @@ bool buffer_is_zero_ool(const void *buf, size_t len)
     if (likely(len >= 256)) {
         return buffer_is_zero_accel(buf, len);
     }
-    return buffer_is_zero_integer(buf, len);
+    return buffer_is_zero_int_lt256(buf, len);
 }
 
 bool buffer_is_zero_ge256(const void *buf, size_t len)
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 07/10] util/bufferiszero: Introduce biz_accel_fn typedef
  2024-04-24 22:56 [PATCH v6 00/10] Optimize buffer_is_zero Richard Henderson
                   ` (5 preceding siblings ...)
  2024-04-24 22:57 ` [PATCH v6 06/10] util/bufferiszero: Improve scalar variant Richard Henderson
@ 2024-04-24 22:57 ` Richard Henderson
  2024-04-29 11:14   ` Philippe Mathieu-Daudé
  2024-04-24 22:57 ` [PATCH v6 08/10] util/bufferiszero: Simplify test_buffer_is_zero_next_accel Richard Henderson
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 21+ messages in thread
From: Richard Henderson @ 2024-04-24 22:57 UTC (permalink / raw)
  To: qemu-devel

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 util/bufferiszero.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index c9a7ded016..eb8030a3f0 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -26,7 +26,8 @@
 #include "qemu/bswap.h"
 #include "host/cpuinfo.h"
 
-static bool (*buffer_is_zero_accel)(const void *, size_t);
+typedef bool (*biz_accel_fn)(const void *, size_t);
+static biz_accel_fn buffer_is_zero_accel;
 
 static bool buffer_is_zero_int_lt256(const void *buf, size_t len)
 {
@@ -178,13 +179,15 @@ buffer_zero_avx2(const void *buf, size_t len)
 }
 #endif /* CONFIG_AVX2_OPT */
 
+
+
 static unsigned __attribute__((noinline))
 select_accel_cpuinfo(unsigned info)
 {
     /* Array is sorted in order of algorithm preference. */
     static const struct {
         unsigned bit;
-        bool (*fn)(const void *, size_t);
+        biz_accel_fn fn;
     } all[] = {
 #ifdef CONFIG_AVX2_OPT
         { CPUINFO_AVX2,    buffer_zero_avx2 },
@@ -231,7 +234,7 @@ bool test_buffer_is_zero_next_accel(void)
 #define INIT_ACCEL buffer_is_zero_int_ge256
 #endif
 
-static bool (*buffer_is_zero_accel)(const void *, size_t) = INIT_ACCEL;
+static biz_accel_fn buffer_is_zero_accel = INIT_ACCEL;
 
 bool buffer_is_zero_ool(const void *buf, size_t len)
 {
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 08/10] util/bufferiszero: Simplify test_buffer_is_zero_next_accel
  2024-04-24 22:56 [PATCH v6 00/10] Optimize buffer_is_zero Richard Henderson
                   ` (6 preceding siblings ...)
  2024-04-24 22:57 ` [PATCH v6 07/10] util/bufferiszero: Introduce biz_accel_fn typedef Richard Henderson
@ 2024-04-24 22:57 ` Richard Henderson
  2024-04-29 11:05   ` Philippe Mathieu-Daudé
  2024-04-24 22:57 ` [PATCH v6 09/10] util/bufferiszero: Add simd acceleration for aarch64 Richard Henderson
  2024-04-24 22:57 ` [PATCH v6 10/10] tests/bench: Add bufferiszero-bench Richard Henderson
  9 siblings, 1 reply; 21+ messages in thread
From: Richard Henderson @ 2024-04-24 22:57 UTC (permalink / raw)
  To: qemu-devel

Because the three alternatives are monotonic, we don't
need to keep a couple of bitmasks, just identify the
strongest alternative at startup.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 util/bufferiszero.c | 56 ++++++++++++++++++---------------------------
 1 file changed, 22 insertions(+), 34 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index eb8030a3f0..ff003dc40e 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -179,51 +179,39 @@ buffer_zero_avx2(const void *buf, size_t len)
 }
 #endif /* CONFIG_AVX2_OPT */
 
-
-
-static unsigned __attribute__((noinline))
-select_accel_cpuinfo(unsigned info)
-{
-    /* Array is sorted in order of algorithm preference. */
-    static const struct {
-        unsigned bit;
-        biz_accel_fn fn;
-    } all[] = {
+static biz_accel_fn const accel_table[] = {
+    buffer_is_zero_int_ge256,
+    buffer_zero_sse2,
 #ifdef CONFIG_AVX2_OPT
-        { CPUINFO_AVX2,    buffer_zero_avx2 },
+    buffer_zero_avx2,
 #endif
-        { CPUINFO_SSE2,    buffer_zero_sse2 },
-        { CPUINFO_ALWAYS,  buffer_is_zero_int_ge256 },
-    };
-
-    for (unsigned i = 0; i < ARRAY_SIZE(all); ++i) {
-        if (info & all[i].bit) {
-            buffer_is_zero_accel = all[i].fn;
-            return all[i].bit;
-        }
-    }
-    return 0;
-}
-
-static unsigned used_accel;
+};
+static unsigned accel_index;
 
 static void __attribute__((constructor)) init_accel(void)
 {
-    used_accel = select_accel_cpuinfo(cpuinfo_init());
+    unsigned info = cpuinfo_init();
+    unsigned index = (info & CPUINFO_SSE2 ? 1 : 0);
+
+#ifdef CONFIG_AVX2_OPT
+    if (info & CPUINFO_AVX2) {
+        index = 2;
+    }
+#endif
+
+    accel_index = index;
+    buffer_is_zero_accel = accel_table[index];
 }
 
 #define INIT_ACCEL NULL
 
 bool test_buffer_is_zero_next_accel(void)
 {
-    /*
-     * Accumulate the accelerators that we've already tested, and
-     * remove them from the set to test this round.  We'll get back
-     * a zero from select_accel_cpuinfo when there are no more.
-     */
-    unsigned used = select_accel_cpuinfo(cpuinfo & ~used_accel);
-    used_accel |= used;
-    return used;
+    if (accel_index != 0) {
+        buffer_is_zero_accel = accel_table[--accel_index];
+        return true;
+    }
+    return false;
 }
 #else
 bool test_buffer_is_zero_next_accel(void)
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 09/10] util/bufferiszero: Add simd acceleration for aarch64
  2024-04-24 22:56 [PATCH v6 00/10] Optimize buffer_is_zero Richard Henderson
                   ` (7 preceding siblings ...)
  2024-04-24 22:57 ` [PATCH v6 08/10] util/bufferiszero: Simplify test_buffer_is_zero_next_accel Richard Henderson
@ 2024-04-24 22:57 ` Richard Henderson
  2024-04-29 11:06   ` Philippe Mathieu-Daudé
  2024-04-29 12:45   ` Philippe Mathieu-Daudé
  2024-04-24 22:57 ` [PATCH v6 10/10] tests/bench: Add bufferiszero-bench Richard Henderson
  9 siblings, 2 replies; 21+ messages in thread
From: Richard Henderson @ 2024-04-24 22:57 UTC (permalink / raw)
  To: qemu-devel

Because non-embedded aarch64 is expected to have AdvSIMD enabled, merely
double-check with the compiler flags for __ARM_NEON and don't bother with
a runtime check.  Otherwise, model the loop after the x86 SSE2 function.

Use UMAXV for the vector reduction.  This is 3 cycles on cortex-a76 and
2 cycles on neoverse-n1.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 util/bufferiszero.c | 77 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 77 insertions(+)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index ff003dc40e..38477a3eac 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -213,7 +213,84 @@ bool test_buffer_is_zero_next_accel(void)
     }
     return false;
 }
+
+#elif defined(__aarch64__) && defined(__ARM_NEON)
+#include <arm_neon.h>
+
+#define REASSOC_BARRIER(vec0, vec1) asm("" : "+w"(vec0), "+w"(vec1))
+
+static bool buffer_is_zero_simd(const void *buf, size_t len)
+{
+    uint32x4_t t0, t1, t2, t3;
+
+    /* Align head/tail to 16-byte boundaries.  */
+    const uint32x4_t *p = QEMU_ALIGN_PTR_DOWN(buf + 16, 16);
+    const uint32x4_t *e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 16);
+
+    /* Unaligned loads at head/tail.  */
+    t0 = vld1q_u32(buf) | vld1q_u32(buf + len - 16);
+
+    /* Collect a partial block at tail end.  */
+    t1 = e[-7] | e[-6];
+    t2 = e[-5] | e[-4];
+    t3 = e[-3] | e[-2];
+    t0 |= e[-1];
+    REASSOC_BARRIER(t0, t1);
+    REASSOC_BARRIER(t2, t3);
+    t0 |= t1;
+    t2 |= t3;
+    REASSOC_BARRIER(t0, t2);
+    t0 |= t2;
+
+    /*
+     * Loop over complete 128-byte blocks.
+     * With the head and tail removed, e - p >= 14, so the loop
+     * must iterate at least once.
+     */
+    do {
+        /*
+         * Reduce via UMAXV.  Whatever the actual result,
+         * it will only be zero if all input bytes are zero.
+         */
+        if (unlikely(vmaxvq_u32(t0) != 0)) {
+            return false;
+        }
+
+        t0 = p[0] | p[1];
+        t1 = p[2] | p[3];
+        t2 = p[4] | p[5];
+        t3 = p[6] | p[7];
+        REASSOC_BARRIER(t0, t1);
+        REASSOC_BARRIER(t2, t3);
+        t0 |= t1;
+        t2 |= t3;
+        REASSOC_BARRIER(t0, t2);
+        t0 |= t2;
+        p += 8;
+    } while (p < e - 7);
+
+    return vmaxvq_u32(t0) == 0;
+}
+
+static biz_accel_fn const accel_table[] = {
+    buffer_is_zero_int_ge256,
+    buffer_is_zero_simd,
+};
+
+static unsigned accel_index = 1;
+#define INIT_ACCEL buffer_is_zero_simd
+
+bool test_buffer_is_zero_next_accel(void)
+{
+    if (accel_index != 0) {
+        buffer_is_zero_accel = accel_table[--accel_index];
+        return true;
+    }
+    return false;
+}
+
 #else
+
 bool test_buffer_is_zero_next_accel(void)
 {
     return false;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 10/10] tests/bench: Add bufferiszero-bench
  2024-04-24 22:56 [PATCH v6 00/10] Optimize buffer_is_zero Richard Henderson
                   ` (8 preceding siblings ...)
  2024-04-24 22:57 ` [PATCH v6 09/10] util/bufferiszero: Add simd acceleration for aarch64 Richard Henderson
@ 2024-04-24 22:57 ` Richard Henderson
  2024-04-29 11:12   ` Philippe Mathieu-Daudé
  9 siblings, 1 reply; 21+ messages in thread
From: Richard Henderson @ 2024-04-24 22:57 UTC (permalink / raw)
  To: qemu-devel

Benchmark each acceleration function vs an aligned buffer of zeros.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tests/bench/bufferiszero-bench.c | 47 ++++++++++++++++++++++++++++++++
 tests/bench/meson.build          |  1 +
 2 files changed, 48 insertions(+)
 create mode 100644 tests/bench/bufferiszero-bench.c

diff --git a/tests/bench/bufferiszero-bench.c b/tests/bench/bufferiszero-bench.c
new file mode 100644
index 0000000000..222695c1fa
--- /dev/null
+++ b/tests/bench/bufferiszero-bench.c
@@ -0,0 +1,47 @@
+/*
+ * QEMU buffer_is_zero speed benchmark
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * (at your option) any later version.  See the COPYING file in the
+ * top-level directory.
+ */
+#include "qemu/osdep.h"
+#include "qemu/cutils.h"
+#include "qemu/units.h"
+
+static void test(const void *opaque)
+{
+    size_t max = 64 * KiB;
+    void *buf = g_malloc0(max);
+    int accel_index = 0;
+
+    do {
+        if (accel_index != 0) {
+            g_test_message("%s", "");  /* gnu_printf Werror for simple "" */
+        }
+        for (size_t len = 1 * KiB; len <= max; len *= 4) {
+            double total = 0.0;
+
+            g_test_timer_start();
+            do {
+                buffer_is_zero_ge256(buf, len);
+                total += len;
+            } while (g_test_timer_elapsed() < 0.5);
+
+            total /= MiB;
+            g_test_message("buffer_is_zero #%d: %2zuKB %8.0f MB/sec",
+                           accel_index, len / (size_t)KiB,
+                           total / g_test_timer_last());
+        }
+        accel_index++;
+    } while (test_buffer_is_zero_next_accel());
+
+    g_free(buf);
+}
+
+int main(int argc, char **argv)
+{
+    g_test_init(&argc, &argv, NULL);
+    g_test_add_data_func("/cutils/bufferiszero/speed", NULL, test);
+    return g_test_run();
+}
diff --git a/tests/bench/meson.build b/tests/bench/meson.build
index 7e76338a52..4cd7a2f6b5 100644
--- a/tests/bench/meson.build
+++ b/tests/bench/meson.build
@@ -21,6 +21,7 @@ benchs = {}
 
 if have_block
   benchs += {
+     'bufferiszero-bench': [],
      'benchmark-crypto-hash': [crypto],
      'benchmark-crypto-hmac': [crypto],
      'benchmark-crypto-cipher': [crypto],
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 08/10] util/bufferiszero: Simplify test_buffer_is_zero_next_accel
  2024-04-24 22:57 ` [PATCH v6 08/10] util/bufferiszero: Simplify test_buffer_is_zero_next_accel Richard Henderson
@ 2024-04-29 11:05   ` Philippe Mathieu-Daudé
  0 siblings, 0 replies; 21+ messages in thread
From: Philippe Mathieu-Daudé @ 2024-04-29 11:05 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel

On 25/4/24 00:57, Richard Henderson wrote:
> Because the three alternatives are monotonic, we don't
> need to keep a couple of bitmasks, just identify the
> strongest alternative at startup.
> 
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>   util/bufferiszero.c | 56 ++++++++++++++++++---------------------------
>   1 file changed, 22 insertions(+), 34 deletions(-)

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 09/10] util/bufferiszero: Add simd acceleration for aarch64
  2024-04-24 22:57 ` [PATCH v6 09/10] util/bufferiszero: Add simd acceleration for aarch64 Richard Henderson
@ 2024-04-29 11:06   ` Philippe Mathieu-Daudé
  2024-04-29 12:45   ` Philippe Mathieu-Daudé
  1 sibling, 0 replies; 21+ messages in thread
From: Philippe Mathieu-Daudé @ 2024-04-29 11:06 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: Ard Biesheuvel

Cc'ing Ard :)

On 25/4/24 00:57, Richard Henderson wrote:
> Because non-embedded aarch64 is expected to have AdvSIMD enabled, merely
> double-check with the compiler flags for __ARM_NEON and don't bother with
> a runtime check.  Otherwise, model the loop after the x86 SSE2 function.
> 
> Use UMAXV for the vector reduction.  This is 3 cycles on cortex-a76 and
> 2 cycles on neoverse-n1.
> 
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>   util/bufferiszero.c | 77 +++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 77 insertions(+)
> 
> diff --git a/util/bufferiszero.c b/util/bufferiszero.c
> index ff003dc40e..38477a3eac 100644
> --- a/util/bufferiszero.c
> +++ b/util/bufferiszero.c
> @@ -213,7 +213,84 @@ bool test_buffer_is_zero_next_accel(void)
>       }
>       return false;
>   }
> +
> +#elif defined(__aarch64__) && defined(__ARM_NEON)
> +#include <arm_neon.h>
> +
> +#define REASSOC_BARRIER(vec0, vec1) asm("" : "+w"(vec0), "+w"(vec1))
> +
> +static bool buffer_is_zero_simd(const void *buf, size_t len)
> +{
> +    uint32x4_t t0, t1, t2, t3;
> +
> +    /* Align head/tail to 16-byte boundaries.  */
> +    const uint32x4_t *p = QEMU_ALIGN_PTR_DOWN(buf + 16, 16);
> +    const uint32x4_t *e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 16);
> +
> +    /* Unaligned loads at head/tail.  */
> +    t0 = vld1q_u32(buf) | vld1q_u32(buf + len - 16);
> +
> +    /* Collect a partial block at tail end.  */
> +    t1 = e[-7] | e[-6];
> +    t2 = e[-5] | e[-4];
> +    t3 = e[-3] | e[-2];
> +    t0 |= e[-1];
> +    REASSOC_BARRIER(t0, t1);
> +    REASSOC_BARRIER(t2, t3);
> +    t0 |= t1;
> +    t2 |= t3;
> +    REASSOC_BARRIER(t0, t2);
> +    t0 |= t2;
> +
> +    /*
> +     * Loop over complete 128-byte blocks.
> +     * With the head and tail removed, e - p >= 14, so the loop
> +     * must iterate at least once.
> +     */
> +    do {
> +        /*
> +         * Reduce via UMAXV.  Whatever the actual result,
> +         * it will only be zero if all input bytes are zero.
> +         */
> +        if (unlikely(vmaxvq_u32(t0) != 0)) {
> +            return false;
> +        }
> +
> +        t0 = p[0] | p[1];
> +        t1 = p[2] | p[3];
> +        t2 = p[4] | p[5];
> +        t3 = p[6] | p[7];
> +        REASSOC_BARRIER(t0, t1);
> +        REASSOC_BARRIER(t2, t3);
> +        t0 |= t1;
> +        t2 |= t3;
> +        REASSOC_BARRIER(t0, t2);
> +        t0 |= t2;
> +        p += 8;
> +    } while (p < e - 7);
> +
> +    return vmaxvq_u32(t0) == 0;
> +}
> +
> +static biz_accel_fn const accel_table[] = {
> +    buffer_is_zero_int_ge256,
> +    buffer_is_zero_simd,
> +};
> +
> +static unsigned accel_index = 1;
> +#define INIT_ACCEL buffer_is_zero_simd
> +
> +bool test_buffer_is_zero_next_accel(void)
> +{
> +    if (accel_index != 0) {
> +        buffer_is_zero_accel = accel_table[--accel_index];
> +        return true;
> +    }
> +    return false;
> +}
> +
>   #else
> +
>   bool test_buffer_is_zero_next_accel(void)
>   {
>       return false;



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 10/10] tests/bench: Add bufferiszero-bench
  2024-04-24 22:57 ` [PATCH v6 10/10] tests/bench: Add bufferiszero-bench Richard Henderson
@ 2024-04-29 11:12   ` Philippe Mathieu-Daudé
  0 siblings, 0 replies; 21+ messages in thread
From: Philippe Mathieu-Daudé @ 2024-04-29 11:12 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel

On 25/4/24 00:57, Richard Henderson wrote:
> Benchmark each acceleration function vs an aligned buffer of zeros.
> 
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>   tests/bench/bufferiszero-bench.c | 47 ++++++++++++++++++++++++++++++++
>   tests/bench/meson.build          |  1 +
>   2 files changed, 48 insertions(+)
>   create mode 100644 tests/bench/bufferiszero-bench.c


> +static void test(const void *opaque)
> +{
> +    size_t max = 64 * KiB;
> +    void *buf = g_malloc0(max);
> +    int accel_index = 0;

Nitpicking, accel_index could be unsigned like in util/bufferiszero.c.

> +
> +    do {
> +        if (accel_index != 0) {
> +            g_test_message("%s", "");  /* gnu_printf Werror for simple "" */
> +        }
> +        for (size_t len = 1 * KiB; len <= max; len *= 4) {
> +            double total = 0.0;
> +
> +            g_test_timer_start();
> +            do {
> +                buffer_is_zero_ge256(buf, len);
> +                total += len;
> +            } while (g_test_timer_elapsed() < 0.5);
> +
> +            total /= MiB;
> +            g_test_message("buffer_is_zero #%d: %2zuKB %8.0f MB/sec",
> +                           accel_index, len / (size_t)KiB,

Thus "buffer_is_zero #%u:..."

Regardless,
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>

> +                           total / g_test_timer_last());
> +        }
> +        accel_index++;
> +    } while (test_buffer_is_zero_next_accel());
> +
> +    g_free(buf);
> +}



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 07/10] util/bufferiszero: Introduce biz_accel_fn typedef
  2024-04-24 22:57 ` [PATCH v6 07/10] util/bufferiszero: Introduce biz_accel_fn typedef Richard Henderson
@ 2024-04-29 11:14   ` Philippe Mathieu-Daudé
  0 siblings, 0 replies; 21+ messages in thread
From: Philippe Mathieu-Daudé @ 2024-04-29 11:14 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel

On 25/4/24 00:57, Richard Henderson wrote:
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>   util/bufferiszero.c | 9 ++++++---
>   1 file changed, 6 insertions(+), 3 deletions(-)


> @@ -178,13 +179,15 @@ buffer_zero_avx2(const void *buf, size_t len)
>   }
>   #endif /* CONFIG_AVX2_OPT */
>   
> +
> +

Spurious new lines :)

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>

>   static unsigned __attribute__((noinline))



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 02/10] util/bufferiszero: Remove AVX512 variant
  2024-04-24 22:56 ` [PATCH v6 02/10] util/bufferiszero: Remove AVX512 variant Richard Henderson
@ 2024-04-29 11:16   ` Daniel P. Berrangé
  2024-04-29 11:29     ` Alexander Monakov
  0 siblings, 1 reply; 21+ messages in thread
From: Daniel P. Berrangé @ 2024-04-29 11:16 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, Alexander Monakov, Mikhail Romanov

On Wed, Apr 24, 2024 at 03:56:57PM -0700, Richard Henderson wrote:
> From: Alexander Monakov <amonakov@ispras.ru>
> 
> Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD
> routines are invoked much more rarely in normal use when most buffers
> are non-zero. This makes use of AVX512 unprofitable, as it incurs extra
> frequency and voltage transition periods during which the CPU operates
> at reduced performance, as described in
> https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html

This is describing limitations of Intel's AVX512 implementation.

AMD's AVX512 implementation is said to not have the kind of
power / frequency limitations that Intel's does:

  https://www.mersenneforum.org/showthread.php?p=614191

  "Overall, AMD's AVX512 implementation beat my expectations.
   I was expecting something similar to Zen1's "double-pumping"
   of AVX with half the register file and cross-lane instructions
   being super slow. But this is not the case on Zen4. The lack
   of power or thermal issues combined with stellar shuffle support
   makes it completely worthwhile to use from a developer standpoint.
   If your code can vectorize without excessive wasted computation,
   then go all the way to 512-bit. AMD not only made this worthwhile,
   but *incentivizes* it with the power savings. And if in the future
   AMD decides to widen things up, you may get a 2x speedup for free."

IOW, it sounds like we could be sacrificing performance on modern
AMD Genoa generation CPUs by removing the AVX512 impl

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 02/10] util/bufferiszero: Remove AVX512 variant
  2024-04-29 11:16   ` Daniel P. Berrangé
@ 2024-04-29 11:29     ` Alexander Monakov
  0 siblings, 0 replies; 21+ messages in thread
From: Alexander Monakov @ 2024-04-29 11:29 UTC (permalink / raw)
  To: Daniel P. Berrangé; +Cc: Richard Henderson, qemu-devel, Mikhail Romanov

[-- Attachment #1: Type: text/plain, Size: 2138 bytes --]


On Mon, 29 Apr 2024, Daniel P. Berrangé wrote:

> On Wed, Apr 24, 2024 at 03:56:57PM -0700, Richard Henderson wrote:
> > From: Alexander Monakov <amonakov@ispras.ru>
> > 
> > Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD
> > routines are invoked much more rarely in normal use when most buffers
> > are non-zero. This makes use of AVX512 unprofitable, as it incurs extra
> > frequency and voltage transition periods during which the CPU operates
> > at reduced performance, as described in
> > https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html
> 
> This is describing limitations of Intel's AVX512 implementation.
> 
> AMD's AVX512 implementation is said to not have the kind of
> power / frequency limitations that Intel's does:
> 
>   https://www.mersenneforum.org/showthread.php?p=614191
> 
>   "Overall, AMD's AVX512 implementation beat my expectations.
>    I was expecting something similar to Zen1's "double-pumping"
>    of AVX with half the register file and cross-lane instructions
>    being super slow. But this is not the case on Zen4. The lack
>    of power or thermal issues combined with stellar shuffle support
>    makes it completely worthwhile to use from a developer standpoint.
>    If your code can vectorize without excessive wasted computation,
>    then go all the way to 512-bit. AMD not only made this worthwhile,
>    but *incentivizes* it with the power savings. And if in the future
>    AMD decides to widen things up, you may get a 2x speedup for free."
> 
> IOW, it sounds like we could be sacrificing performance on modern
> AMD Genoa generation CPUs by removing the AVX512 impl

No, the new implementation saturates load ports, and Genoa runs 512-bit
AVX instructions at half throughput compared to their 256-bit counterparts
(so one 512-bit load or two 256-bit loads per cycle), so there's no
obvious reason why this patch would sacrifice performance there.

Maybe it could, indirectly, by lowering the turbo clock limit due to
higher front-end activity, but I don't have access to a Zen 4 machine
to check, and even so it would be a few percent, not 2x.

Alexander

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 06/10] util/bufferiszero: Improve scalar variant
  2024-04-24 22:57 ` [PATCH v6 06/10] util/bufferiszero: Improve scalar variant Richard Henderson
@ 2024-04-29 12:18   ` Philippe Mathieu-Daudé
  2024-04-29 12:31     ` Richard Henderson
  0 siblings, 1 reply; 21+ messages in thread
From: Philippe Mathieu-Daudé @ 2024-04-29 12:18 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel

On 25/4/24 00:57, Richard Henderson wrote:
> Split less-than and greater-than 256 cases.
> Use unaligned accesses for head and tail.
> Avoid using out-of-bounds pointers in loop boundary conditions.
> 
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>   util/bufferiszero.c | 85 +++++++++++++++++++++++++++------------------
>   1 file changed, 51 insertions(+), 34 deletions(-)
> 
> diff --git a/util/bufferiszero.c b/util/bufferiszero.c
> index 02df82b4ff..c9a7ded016 100644
> --- a/util/bufferiszero.c
> +++ b/util/bufferiszero.c
> @@ -28,40 +28,57 @@
>   
>   static bool (*buffer_is_zero_accel)(const void *, size_t);
>   
> -static bool buffer_is_zero_integer(const void *buf, size_t len)
> +static bool buffer_is_zero_int_lt256(const void *buf, size_t len)
>   {
> -    if (unlikely(len < 8)) {
> -        /* For a very small buffer, simply accumulate all the bytes.  */
> -        const unsigned char *p = buf;
> -        const unsigned char *e = buf + len;
> -        unsigned char t = 0;
> +    uint64_t t;
> +    const uint64_t *p, *e;
>   
> -        do {
> -            t |= *p++;
> -        } while (p < e);
> -
> -        return t == 0;
> -    } else {
> -        /* Otherwise, use the unaligned memory access functions to
> -           handle the beginning and end of the buffer, with a couple
> -           of loops handling the middle aligned section.  */
> -        uint64_t t = ldq_he_p(buf);
> -        const uint64_t *p = (uint64_t *)(((uintptr_t)buf + 8) & -8);
> -        const uint64_t *e = (uint64_t *)(((uintptr_t)buf + len) & -8);
> -
> -        for (; p + 8 <= e; p += 8) {
> -            if (t) {
> -                return false;
> -            }
> -            t = p[0] | p[1] | p[2] | p[3] | p[4] | p[5] | p[6] | p[7];
> -        }
> -        while (p < e) {
> -            t |= *p++;
> -        }
> -        t |= ldq_he_p(buf + len - 8);
> -
> -        return t == 0;
> +    /*
> +     * Use unaligned memory access functions to handle
> +     * the beginning and end of the buffer.
> +     */
> +    if (unlikely(len <= 8)) {
> +        return (ldl_he_p(buf) | ldl_he_p(buf + len - 4)) == 0;
>       }
> +
> +    t = ldq_he_p(buf) | ldq_he_p(buf + len - 8);

Here we read #0 and #31, ...

> +    p = QEMU_ALIGN_PTR_DOWN(buf + 8, 8);
> +    e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 8);
> +
> +    /* Read 0 to 31 aligned words from the middle. */

... so here is #1 to #30?

> +    while (p < e) {
> +        t |= *p++;
> +    }
> +    return t == 0;
> +}

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 06/10] util/bufferiszero: Improve scalar variant
  2024-04-29 12:18   ` Philippe Mathieu-Daudé
@ 2024-04-29 12:31     ` Richard Henderson
  2024-04-29 13:21       ` Philippe Mathieu-Daudé
  0 siblings, 1 reply; 21+ messages in thread
From: Richard Henderson @ 2024-04-29 12:31 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, qemu-devel

On 4/29/24 05:18, Philippe Mathieu-Daudé wrote:
>> +
>> +    t = ldq_he_p(buf) | ldq_he_p(buf + len - 8);
> 
> Here we read #0 and #31, ...
> 
>> +    p = QEMU_ALIGN_PTR_DOWN(buf + 8, 8);
>> +    e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 8);
>> +
>> +    /* Read 0 to 31 aligned words from the middle. */
> 
> ... so here is #1 to #30?

Not indexes, but count.  There may be zero words remaining in the middle, etc.


r~


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 09/10] util/bufferiszero: Add simd acceleration for aarch64
  2024-04-24 22:57 ` [PATCH v6 09/10] util/bufferiszero: Add simd acceleration for aarch64 Richard Henderson
  2024-04-29 11:06   ` Philippe Mathieu-Daudé
@ 2024-04-29 12:45   ` Philippe Mathieu-Daudé
  1 sibling, 0 replies; 21+ messages in thread
From: Philippe Mathieu-Daudé @ 2024-04-29 12:45 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel

On 25/4/24 00:57, Richard Henderson wrote:
> Because non-embedded aarch64 is expected to have AdvSIMD enabled, merely
> double-check with the compiler flags for __ARM_NEON and don't bother with
> a runtime check.  Otherwise, model the loop after the x86 SSE2 function.
> 
> Use UMAXV for the vector reduction.  This is 3 cycles on cortex-a76 and
> 2 cycles on neoverse-n1.
> 
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>   util/bufferiszero.c | 77 +++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 77 insertions(+)
> 
> diff --git a/util/bufferiszero.c b/util/bufferiszero.c
> index ff003dc40e..38477a3eac 100644
> --- a/util/bufferiszero.c
> +++ b/util/bufferiszero.c
> @@ -213,7 +213,84 @@ bool test_buffer_is_zero_next_accel(void)
>       }
>       return false;
>   }
> +
> +#elif defined(__aarch64__) && defined(__ARM_NEON)
> +#include <arm_neon.h>
> +

Maybe use the same SSE comment:

/*
  * Helper for preventing the compiler from reassociating
  * chains of binary vector operations.
  */

> +#define REASSOC_BARRIER(vec0, vec1) asm("" : "+w"(vec0), "+w"(vec1))


> +static unsigned accel_index = 1;
> +#define INIT_ACCEL buffer_is_zero_simd
> +
> +bool test_buffer_is_zero_next_accel(void)
> +{
> +    if (accel_index != 0) {
> +        buffer_is_zero_accel = accel_table[--accel_index];
> +        return true;
> +    }
> +    return false;
> +}

Alternatively we could initialize accel_index in
__attribute__((constructor)) init_accel(void) and keep
a single test_buffer_is_zero_next_accel(), squashing:

-- >8 --
diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index 38477a3eac..afb3822251 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -82,6 +82,17 @@ static bool buffer_is_zero_int_ge256(const void *buf, 
size_t len)
      return t == 0;
  }

+static unsigned accel_index;
+
+bool test_buffer_is_zero_next_accel(void)
+{
+    if (accel_index != 0) {
+        buffer_is_zero_accel = accel_table[--accel_index];
+        return true;
+    }
+    return false;
+}
+
  #if defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
  #include <immintrin.h>

@@ -186,7 +197,6 @@ static biz_accel_fn const accel_table[] = {
      buffer_zero_avx2,
  #endif
  };
-static unsigned accel_index;

  static void __attribute__((constructor)) init_accel(void)
  {
@@ -205,15 +215,6 @@ static void __attribute__((constructor)) 
init_accel(void)

  #define INIT_ACCEL NULL

-bool test_buffer_is_zero_next_accel(void)
-{
-    if (accel_index != 0) {
-        buffer_is_zero_accel = accel_table[--accel_index];
-        return true;
-    }
-    return false;
-}
-
  #elif defined(__aarch64__) && defined(__ARM_NEON)
  #include <arm_neon.h>

@@ -277,25 +278,15 @@ static biz_accel_fn const accel_table[] = {
      buffer_is_zero_simd,
  };

-static unsigned accel_index = 1;
  #define INIT_ACCEL buffer_is_zero_simd

-bool test_buffer_is_zero_next_accel(void)
+static void __attribute__((constructor)) init_accel(void)
  {
-    if (accel_index != 0) {
-        buffer_is_zero_accel = accel_table[--accel_index];
-        return true;
-    }
-    return false;
+    accel_index = 1;
  }

  #else

-bool test_buffer_is_zero_next_accel(void)
-{
-    return false;
-}
-
  #define INIT_ACCEL buffer_is_zero_int_ge256
  #endif

---

Or clearer in 2 patches, unifying test_buffer_is_zero_next_accel()
first:

-- >8 --
diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index ff003dc40e..b4da9d5297 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -82,6 +82,17 @@ static bool buffer_is_zero_int_ge256(const void *buf, 
size_t len)
      return t == 0;
  }

+static unsigned accel_index;
+
+bool test_buffer_is_zero_next_accel(void)
+{
+    if (accel_index != 0) {
+        buffer_is_zero_accel = accel_table[--accel_index];
+        return true;
+    }
+    return false;
+}
+
  #if defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
  #include <immintrin.h>

@@ -186,7 +197,6 @@ static biz_accel_fn const accel_table[] = {
      buffer_zero_avx2,
  #endif
  };
-static unsigned accel_index;

  static void __attribute__((constructor)) init_accel(void)
  {
@@ -205,19 +215,7 @@ static void __attribute__((constructor)) 
init_accel(void)

  #define INIT_ACCEL NULL

-bool test_buffer_is_zero_next_accel(void)
-{
-    if (accel_index != 0) {
-        buffer_is_zero_accel = accel_table[--accel_index];
-        return true;
-    }
-    return false;
-}
  #else
-bool test_buffer_is_zero_next_accel(void)
-{
-    return false;
-}

  #define INIT_ACCEL buffer_is_zero_int_ge256
  #endif

---

Then this patch becomes:

-- >8 --
diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index b4da9d5297..afb3822251 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -215,6 +215,76 @@ static void __attribute__((constructor)) 
init_accel(void)

  #define INIT_ACCEL NULL

+#elif defined(__aarch64__) && defined(__ARM_NEON)
+#include <arm_neon.h>
+
+#define REASSOC_BARRIER(vec0, vec1) asm("" : "+w"(vec0), "+w"(vec1))
+
+static bool buffer_is_zero_simd(const void *buf, size_t len)
+{
+    uint32x4_t t0, t1, t2, t3;
+
+    /* Align head/tail to 16-byte boundaries.  */
+    const uint32x4_t *p = QEMU_ALIGN_PTR_DOWN(buf + 16, 16);
+    const uint32x4_t *e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 16);
+
+    /* Unaligned loads at head/tail.  */
+    t0 = vld1q_u32(buf) | vld1q_u32(buf + len - 16);
+
+    /* Collect a partial block at tail end.  */
+    t1 = e[-7] | e[-6];
+    t2 = e[-5] | e[-4];
+    t3 = e[-3] | e[-2];
+    t0 |= e[-1];
+    REASSOC_BARRIER(t0, t1);
+    REASSOC_BARRIER(t2, t3);
+    t0 |= t1;
+    t2 |= t3;
+    REASSOC_BARRIER(t0, t2);
+    t0 |= t2;
+
+    /*
+     * Loop over complete 128-byte blocks.
+     * With the head and tail removed, e - p >= 14, so the loop
+     * must iterate at least once.
+     */
+    do {
+        /*
+         * Reduce via UMAXV.  Whatever the actual result,
+         * it will only be zero if all input bytes are zero.
+         */
+        if (unlikely(vmaxvq_u32(t0) != 0)) {
+            return false;
+        }
+
+        t0 = p[0] | p[1];
+        t1 = p[2] | p[3];
+        t2 = p[4] | p[5];
+        t3 = p[6] | p[7];
+        REASSOC_BARRIER(t0, t1);
+        REASSOC_BARRIER(t2, t3);
+        t0 |= t1;
+        t2 |= t3;
+        REASSOC_BARRIER(t0, t2);
+        t0 |= t2;
+        p += 8;
+    } while (p < e - 7);
+
+    return vmaxvq_u32(t0) == 0;
+}
+
+static biz_accel_fn const accel_table[] = {
+    buffer_is_zero_int_ge256,
+    buffer_is_zero_simd,
+};
+
+#define INIT_ACCEL buffer_is_zero_simd
+
+static void __attribute__((constructor)) init_accel(void)
+{
+    accel_index = 1;
+}
+
  #else

  #define INIT_ACCEL buffer_is_zero_int_ge256

---

Anyhow,
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 06/10] util/bufferiszero: Improve scalar variant
  2024-04-29 12:31     ` Richard Henderson
@ 2024-04-29 13:21       ` Philippe Mathieu-Daudé
  0 siblings, 0 replies; 21+ messages in thread
From: Philippe Mathieu-Daudé @ 2024-04-29 13:21 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel

On 29/4/24 14:31, Richard Henderson wrote:
> On 4/29/24 05:18, Philippe Mathieu-Daudé wrote:
>>> +
>>> +    t = ldq_he_p(buf) | ldq_he_p(buf + len - 8);
>>
>> Here we read #0 and #31, ...
>>
>>> +    p = QEMU_ALIGN_PTR_DOWN(buf + 8, 8);
>>> +    e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 8);
>>> +
>>> +    /* Read 0 to 31 aligned words from the middle. */
>>
>> ... so here is #1 to #30?
> 
> Not indexes, but count.  There may be zero words remaining in the 
> middle, etc.

Oh, got it, thanks!



^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2024-04-29 13:22 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-24 22:56 [PATCH v6 00/10] Optimize buffer_is_zero Richard Henderson
2024-04-24 22:56 ` [PATCH v6 01/10] util/bufferiszero: Remove SSE4.1 variant Richard Henderson
2024-04-24 22:56 ` [PATCH v6 02/10] util/bufferiszero: Remove AVX512 variant Richard Henderson
2024-04-29 11:16   ` Daniel P. Berrangé
2024-04-29 11:29     ` Alexander Monakov
2024-04-24 22:56 ` [PATCH v6 03/10] util/bufferiszero: Reorganize for early test for acceleration Richard Henderson
2024-04-24 22:56 ` [PATCH v6 04/10] util/bufferiszero: Remove useless prefetches Richard Henderson
2024-04-24 22:57 ` [PATCH v6 05/10] util/bufferiszero: Optimize SSE2 and AVX2 variants Richard Henderson
2024-04-24 22:57 ` [PATCH v6 06/10] util/bufferiszero: Improve scalar variant Richard Henderson
2024-04-29 12:18   ` Philippe Mathieu-Daudé
2024-04-29 12:31     ` Richard Henderson
2024-04-29 13:21       ` Philippe Mathieu-Daudé
2024-04-24 22:57 ` [PATCH v6 07/10] util/bufferiszero: Introduce biz_accel_fn typedef Richard Henderson
2024-04-29 11:14   ` Philippe Mathieu-Daudé
2024-04-24 22:57 ` [PATCH v6 08/10] util/bufferiszero: Simplify test_buffer_is_zero_next_accel Richard Henderson
2024-04-29 11:05   ` Philippe Mathieu-Daudé
2024-04-24 22:57 ` [PATCH v6 09/10] util/bufferiszero: Add simd acceleration for aarch64 Richard Henderson
2024-04-29 11:06   ` Philippe Mathieu-Daudé
2024-04-29 12:45   ` Philippe Mathieu-Daudé
2024-04-24 22:57 ` [PATCH v6 10/10] tests/bench: Add bufferiszero-bench Richard Henderson
2024-04-29 11:12   ` Philippe Mathieu-Daudé

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).