[PATCH v3 0/6] Optimize buffer_is

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/6] Optimize buffer_is_zero
@ 2024-02-06 20:48 Alexander Monakov
  2024-02-06 20:48 ` [PATCH v3 1/6] util/bufferiszero: remove SSE4.1 variant Alexander Monakov
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Alexander Monakov @ 2024-02-06 20:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Mikhail Romanov, Richard Henderson, Paolo Bonzini,
	Alexander Monakov

I am posting a new revision of buffer_is_zero improvements (v2 can be found at
https://patchew.org/QEMU/20231027143704.7060-1-mmromanov@ispras.ru/ ).

In our experiments buffer_is_zero took about 40%-50% of overall qemu-img run
time, even though Glib I/O is not very efficient. Hence, it remains an important
routine to optimize.

We substantially improve its performance in typical cases, mostly by introducing
an inline wrapper that samples three bytes from head/middle/tail, avoid call
overhead when any of those is non-zero. We also provide improvements for SIMD
and portable scalar variants.

Changed for v3:

- separate into 6 patches
- fix an oversight which would break the build on non-x86 hosts
- properly avoid out-of-bounds pointers in the scalar variant

Alexander Monakov (6):
  util/bufferiszero: remove SSE4.1 variant
  util/bufferiszero: introduce an inline wrapper
  util/bufferiszero: remove AVX512 variant
  util/bufferiszero: remove useless prefetches
  util/bufferiszero: optimize SSE2 and AVX2 variants
  util/bufferiszero: improve scalar variant

 include/qemu/cutils.h |  28 ++++-
 util/bufferiszero.c   | 280 +++++++++++++++---------------------------
 2 files changed, 128 insertions(+), 180 deletions(-)

-- 
2.32.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v3 1/6] util/bufferiszero: remove SSE4.1 variant
  2024-02-06 20:48 [PATCH v3 0/6] Optimize buffer_is_zero Alexander Monakov
@ 2024-02-06 20:48 ` Alexander Monakov
  2024-02-06 22:24   ` Richard Henderson
  2024-02-06 20:48 ` [PATCH v3 2/6] util/bufferiszero: introduce an inline wrapper Alexander Monakov
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Alexander Monakov @ 2024-02-06 20:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Mikhail Romanov, Richard Henderson, Paolo Bonzini,
	Alexander Monakov

The SSE4.1 variant is virtually identical to the SSE2 variant, except
for using 'PTEST+JNZ' in place of 'PCMPEQB+PMOVMSKB+CMP+JNE' for testing
if an SSE register is all zeroes. The PTEST instruction decodes to two
uops, so it can be handled only by the complex decoder, and since
CMP+JNE are macro-fused, both sequences decode to three uops. The uops
comprising the PTEST instruction dispatch to p0 and p5 on Intel CPUs, so
PCMPEQB+PMOVMSKB is comparatively more flexible from dispatch
standpoint.

Hence, the use of PTEST brings no benefit from throughput standpoint.
Its latency is not important, since it feeds only a conditional jump,
which terminates the dependency chain.

I never observed PTEST variants to be faster on real hardware.

Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru>
---
 util/bufferiszero.c | 29 -----------------------------
 1 file changed, 29 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index 3e6a5dfd63..f5a3634f9a 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -100,34 +100,6 @@ buffer_zero_sse2(const void *buf, size_t len)
 }
 
 #ifdef CONFIG_AVX2_OPT
-static bool __attribute__((target("sse4")))
-buffer_zero_sse4(const void *buf, size_t len)
-{
-    __m128i t = _mm_loadu_si128(buf);
-    __m128i *p = (__m128i *)(((uintptr_t)buf + 5 * 16) & -16);
-    __m128i *e = (__m128i *)(((uintptr_t)buf + len) & -16);
-
-    /* Loop over 16-byte aligned blocks of 64.  */
-    while (likely(p <= e)) {
-        __builtin_prefetch(p);
-        if (unlikely(!_mm_testz_si128(t, t))) {
-            return false;
-        }
-        t = p[-4] | p[-3] | p[-2] | p[-1];
-        p += 4;
-    }
-
-    /* Finish the aligned tail.  */
-    t |= e[-3];
-    t |= e[-2];
-    t |= e[-1];
-
-    /* Finish the unaligned tail.  */
-    t |= _mm_loadu_si128(buf + len - 16);
-
-    return _mm_testz_si128(t, t);
-}
-
 static bool __attribute__((target("avx2")))
 buffer_zero_avx2(const void *buf, size_t len)
 {
@@ -221,7 +193,6 @@ select_accel_cpuinfo(unsigned info)
 #endif
 #ifdef CONFIG_AVX2_OPT
         { CPUINFO_AVX2,    128, buffer_zero_avx2 },
-        { CPUINFO_SSE4,     64, buffer_zero_sse4 },
 #endif
         { CPUINFO_SSE2,     64, buffer_zero_sse2 },
         { CPUINFO_ALWAYS,    0, buffer_zero_int },
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 2/6] util/bufferiszero: introduce an inline wrapper
  2024-02-06 20:48 [PATCH v3 0/6] Optimize buffer_is_zero Alexander Monakov
  2024-02-06 20:48 ` [PATCH v3 1/6] util/bufferiszero: remove SSE4.1 variant Alexander Monakov
@ 2024-02-06 20:48 ` Alexander Monakov
  2024-02-06 22:44   ` Richard Henderson
  2024-02-06 20:48 ` [PATCH v3 3/6] util/bufferiszero: remove AVX512 variant Alexander Monakov
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Alexander Monakov @ 2024-02-06 20:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Mikhail Romanov, Richard Henderson, Paolo Bonzini,
	Alexander Monakov

Make buffer_is_zero a 'static inline' function that tests up to three
bytes from the buffer before handing off to an unrolled loop. This
eliminates call overhead for most non-zero buffers, and allows to
optimize out length checks when it is known at compile time (which is
often the case in Qemu).

Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru>
---
 include/qemu/cutils.h | 28 +++++++++++++++-
 util/bufferiszero.c   | 76 ++++++++++++-------------------------------
 2 files changed, 47 insertions(+), 57 deletions(-)

diff --git a/include/qemu/cutils.h b/include/qemu/cutils.h
index 92c927a6a3..62b153e603 100644
--- a/include/qemu/cutils.h
+++ b/include/qemu/cutils.h
@@ -187,9 +187,35 @@ char *freq_to_str(uint64_t freq_hz);
 /* used to print char* safely */
 #define STR_OR_NULL(str) ((str) ? (str) : "null")
 
-bool buffer_is_zero(const void *buf, size_t len);
+bool buffer_is_zero_len_4_plus(const void *, size_t);
+extern bool (*buffer_is_zero_len_256_plus)(const void *, size_t);
 bool test_buffer_is_zero_next_accel(void);
 
+/*
+ * Check if a buffer is all zeroes.
+ */
+static inline bool buffer_is_zero(const void *vbuf, size_t len)
+{
+    const char *buf = vbuf;
+
+    if (len == 0) {
+        return true;
+    }
+    if (buf[0] || buf[len - 1] || buf[len / 2]) {
+        return false;
+    }
+    /* All bytes are covered for any len <= 3.  */
+    if (len <= 3) {
+        return true;
+    }
+
+    if (len >= 256) {
+        return buffer_is_zero_len_256_plus(vbuf, len);
+    } else {
+        return buffer_is_zero_len_4_plus(vbuf, len);
+    }
+}
+
 /*
  * Implementation of ULEB128 (http://en.wikipedia.org/wiki/LEB128)
  * Input is limited to 14-bit numbers
diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index f5a3634f9a..01050694a6 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -26,8 +26,8 @@
 #include "qemu/bswap.h"
 #include "host/cpuinfo.h"
 
-static bool
-buffer_zero_int(const void *buf, size_t len)
+bool
+buffer_is_zero_len_4_plus(const void *buf, size_t len)
 {
     if (unlikely(len < 8)) {
         /* For a very small buffer, simply accumulate all the bytes.  */
@@ -157,57 +157,40 @@ buffer_zero_avx512(const void *buf, size_t len)
 }
 #endif /* CONFIG_AVX512F_OPT */
 
-/*
- * Make sure that these variables are appropriately initialized when
- * SSE2 is enabled on the compiler command-line, but the compiler is
- * too old to support CONFIG_AVX2_OPT.
- */
-#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT)
-# define INIT_USED     0
-# define INIT_LENGTH   0
-# define INIT_ACCEL    buffer_zero_int
-#else
-# ifndef __SSE2__
-#  error "ISA selection confusion"
-# endif
-# define INIT_USED     CPUINFO_SSE2
-# define INIT_LENGTH   64
-# define INIT_ACCEL    buffer_zero_sse2
-#endif
-
-static unsigned used_accel = INIT_USED;
-static unsigned length_to_accel = INIT_LENGTH;
-static bool (*buffer_accel)(const void *, size_t) = INIT_ACCEL;
-
 static unsigned __attribute__((noinline))
 select_accel_cpuinfo(unsigned info)
 {
     /* Array is sorted in order of algorithm preference. */
     static const struct {
         unsigned bit;
-        unsigned len;
         bool (*fn)(const void *, size_t);
     } all[] = {
 #ifdef CONFIG_AVX512F_OPT
-        { CPUINFO_AVX512F, 256, buffer_zero_avx512 },
+        { CPUINFO_AVX512F, buffer_zero_avx512 },
 #endif
 #ifdef CONFIG_AVX2_OPT
-        { CPUINFO_AVX2,    128, buffer_zero_avx2 },
+        { CPUINFO_AVX2,    buffer_zero_avx2 },
 #endif
-        { CPUINFO_SSE2,     64, buffer_zero_sse2 },
-        { CPUINFO_ALWAYS,    0, buffer_zero_int },
+        { CPUINFO_SSE2,    buffer_zero_sse2 },
+        { CPUINFO_ALWAYS,  buffer_is_zero_len_4_plus },
     };
 
     for (unsigned i = 0; i < ARRAY_SIZE(all); ++i) {
         if (info & all[i].bit) {
-            length_to_accel = all[i].len;
-            buffer_accel = all[i].fn;
+            buffer_is_zero_len_256_plus = all[i].fn;
             return all[i].bit;
         }
     }
     return 0;
 }
 
+static unsigned used_accel
+#if defined(__SSE2__)
+    = CPUINFO_SSE2;
+#else
+    = 0;
+#endif
+
 #if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT)
 static void __attribute__((constructor)) init_accel(void)
 {
@@ -227,35 +210,16 @@ bool test_buffer_is_zero_next_accel(void)
     return used;
 }
 
-static bool select_accel_fn(const void *buf, size_t len)
-{
-    if (likely(len >= length_to_accel)) {
-        return buffer_accel(buf, len);
-    }
-    return buffer_zero_int(buf, len);
-}
-
 #else
-#define select_accel_fn  buffer_zero_int
 bool test_buffer_is_zero_next_accel(void)
 {
     return false;
 }
 #endif
 
-/*
- * Checks if a buffer is all zeroes
- */
-bool buffer_is_zero(const void *buf, size_t len)
-{
-    if (unlikely(len == 0)) {
-        return true;
-    }
-
-    /* Fetch the beginning of the buffer while we select the accelerator.  */
-    __builtin_prefetch(buf);
-
-    /* Use an optimized zero check if possible.  Note that this also
-       includes a check for an unrolled loop over 64-bit integers.  */
-    return select_accel_fn(buf, len);
-}
+bool (*buffer_is_zero_len_256_plus)(const void *, size_t)
+#if defined(__SSE2__)
+    = buffer_zero_sse2;
+#else
+    = buffer_is_zero_len_4_plus;
+#endif
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 3/6] util/bufferiszero: remove AVX512 variant
  2024-02-06 20:48 [PATCH v3 0/6] Optimize buffer_is_zero Alexander Monakov
  2024-02-06 20:48 ` [PATCH v3 1/6] util/bufferiszero: remove SSE4.1 variant Alexander Monakov
  2024-02-06 20:48 ` [PATCH v3 2/6] util/bufferiszero: introduce an inline wrapper Alexander Monakov
@ 2024-02-06 20:48 ` Alexander Monakov
  2024-02-06 22:28   ` Richard Henderson
  2024-02-06 23:56   ` Elena Ufimtseva
  2024-02-06 20:48 ` [PATCH v3 4/6] util/bufferiszero: remove useless prefetches Alexander Monakov
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 19+ messages in thread
From: Alexander Monakov @ 2024-02-06 20:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Mikhail Romanov, Richard Henderson, Paolo Bonzini,
	Alexander Monakov

Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD
routines are invoked much more rarely in normal use when most buffers
are non-zero. This makes use of AVX512 unprofitable, as it incurs extra
frequency and voltage transition periods during which the CPU operates
at reduced performance, as described in
https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html

Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru>
Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
---
 util/bufferiszero.c | 36 ++----------------------------------
 1 file changed, 2 insertions(+), 34 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index 01050694a6..c037d11d04 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -64,7 +64,7 @@ buffer_is_zero_len_4_plus(const void *buf, size_t len)
     }
 }
 
-#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
+#if defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
 #include <immintrin.h>
 
 /* Note that each of these vectorized functions require len >= 64.  */
@@ -128,35 +128,6 @@ buffer_zero_avx2(const void *buf, size_t len)
 }
 #endif /* CONFIG_AVX2_OPT */
 
-#ifdef CONFIG_AVX512F_OPT
-static bool __attribute__((target("avx512f")))
-buffer_zero_avx512(const void *buf, size_t len)
-{
-    /* Begin with an unaligned head of 64 bytes.  */
-    __m512i t = _mm512_loadu_si512(buf);
-    __m512i *p = (__m512i *)(((uintptr_t)buf + 5 * 64) & -64);
-    __m512i *e = (__m512i *)(((uintptr_t)buf + len) & -64);
-
-    /* Loop over 64-byte aligned blocks of 256.  */
-    while (p <= e) {
-        __builtin_prefetch(p);
-        if (unlikely(_mm512_test_epi64_mask(t, t))) {
-            return false;
-        }
-        t = p[-4] | p[-3] | p[-2] | p[-1];
-        p += 4;
-    }
-
-    t |= _mm512_loadu_si512(buf + len - 4 * 64);
-    t |= _mm512_loadu_si512(buf + len - 3 * 64);
-    t |= _mm512_loadu_si512(buf + len - 2 * 64);
-    t |= _mm512_loadu_si512(buf + len - 1 * 64);
-
-    return !_mm512_test_epi64_mask(t, t);
-
-}
-#endif /* CONFIG_AVX512F_OPT */
-
 static unsigned __attribute__((noinline))
 select_accel_cpuinfo(unsigned info)
 {
@@ -165,9 +136,6 @@ select_accel_cpuinfo(unsigned info)
         unsigned bit;
         bool (*fn)(const void *, size_t);
     } all[] = {
-#ifdef CONFIG_AVX512F_OPT
-        { CPUINFO_AVX512F, buffer_zero_avx512 },
-#endif
 #ifdef CONFIG_AVX2_OPT
         { CPUINFO_AVX2,    buffer_zero_avx2 },
 #endif
@@ -191,7 +159,7 @@ static unsigned used_accel
     = 0;
 #endif
 
-#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT)
+#if defined(CONFIG_AVX2_OPT)
 static void __attribute__((constructor)) init_accel(void)
 {
     used_accel = select_accel_cpuinfo(cpuinfo_init());
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 4/6] util/bufferiszero: remove useless prefetches
  2024-02-06 20:48 [PATCH v3 0/6] Optimize buffer_is_zero Alexander Monakov
                   ` (2 preceding siblings ...)
  2024-02-06 20:48 ` [PATCH v3 3/6] util/bufferiszero: remove AVX512 variant Alexander Monakov
@ 2024-02-06 20:48 ` Alexander Monakov
  2024-02-06 22:29   ` Richard Henderson
  2024-02-06 20:48 ` [PATCH v3 5/6] util/bufferiszero: optimize SSE2 and AVX2 variants Alexander Monakov
  2024-02-06 20:48 ` [PATCH v3 6/6] util/bufferiszero: improve scalar variant Alexander Monakov
  5 siblings, 1 reply; 19+ messages in thread
From: Alexander Monakov @ 2024-02-06 20:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Mikhail Romanov, Richard Henderson, Paolo Bonzini,
	Alexander Monakov

Use of prefetching in bufferiszero.c is quite questionable:

- prefetches are issued just a few CPU cycles before the corresponding
  line would be hit by demand loads;

- they are done for simple access patterns, i.e. where hardware
  prefetchers can perform better;

- they compete for load ports in loops that should be limited by load
  port throughput rather than ALU throughput.

Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru>
---
 util/bufferiszero.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index c037d11d04..cb3eb2543f 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -49,7 +49,6 @@ buffer_is_zero_len_4_plus(const void *buf, size_t len)
         const uint64_t *e = (uint64_t *)(((uintptr_t)buf + len) & -8);
 
         for (; p + 8 <= e; p += 8) {
-            __builtin_prefetch(p + 8);
             if (t) {
                 return false;
             }
@@ -79,7 +78,6 @@ buffer_zero_sse2(const void *buf, size_t len)
 
     /* Loop over 16-byte aligned blocks of 64.  */
     while (likely(p <= e)) {
-        __builtin_prefetch(p);
         t = _mm_cmpeq_epi8(t, zero);
         if (unlikely(_mm_movemask_epi8(t) != 0xFFFF)) {
             return false;
@@ -110,7 +108,6 @@ buffer_zero_avx2(const void *buf, size_t len)
 
     /* Loop over 32-byte aligned blocks of 128.  */
     while (p <= e) {
-        __builtin_prefetch(p);
         if (unlikely(!_mm256_testz_si256(t, t))) {
             return false;
         }
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 5/6] util/bufferiszero: optimize SSE2 and AVX2 variants
  2024-02-06 20:48 [PATCH v3 0/6] Optimize buffer_is_zero Alexander Monakov
                   ` (3 preceding siblings ...)
  2024-02-06 20:48 ` [PATCH v3 4/6] util/bufferiszero: remove useless prefetches Alexander Monakov
@ 2024-02-06 20:48 ` Alexander Monakov
  2024-02-06 23:10   ` Richard Henderson
  2024-02-06 20:48 ` [PATCH v3 6/6] util/bufferiszero: improve scalar variant Alexander Monakov
  5 siblings, 1 reply; 19+ messages in thread
From: Alexander Monakov @ 2024-02-06 20:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Mikhail Romanov, Richard Henderson, Paolo Bonzini,
	Alexander Monakov

Increase unroll factor in SIMD loops from 4x to 8x in order to move
their bottlenecks from ALU port contention to load issue rate (two loads
per cycle on popular x86 implementations).

Avoid using out-of-bounds pointers in loop boundary conditions.

Follow SSE2 implementation strategy in the AVX2 variant. Avoid use of
PTEST, which is not profitable there (like in the removed SSE4 variant).

Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru>
---
 util/bufferiszero.c | 108 ++++++++++++++++++++++++++++----------------
 1 file changed, 69 insertions(+), 39 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index cb3eb2543f..d752edd8cc 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -66,62 +66,92 @@ buffer_is_zero_len_4_plus(const void *buf, size_t len)
 #if defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
 #include <immintrin.h>
 
-/* Note that each of these vectorized functions require len >= 64.  */
+/* Helper for preventing the compiler from reassociating
+   chains of binary vector operations.  */
+#define SSE_REASSOC_BARRIER(vec0, vec1) asm("" : "+x"(vec0), "+x"(vec1))
+
+/* Note that these vectorized functions may assume len >= 256.  */
 
 static bool __attribute__((target("sse2")))
 buffer_zero_sse2(const void *buf, size_t len)
 {
-    __m128i t = _mm_loadu_si128(buf);
-    __m128i *p = (__m128i *)(((uintptr_t)buf + 5 * 16) & -16);
-    __m128i *e = (__m128i *)(((uintptr_t)buf + len) & -16);
-    __m128i zero = _mm_setzero_si128();
-
-    /* Loop over 16-byte aligned blocks of 64.  */
-    while (likely(p <= e)) {
-        t = _mm_cmpeq_epi8(t, zero);
-        if (unlikely(_mm_movemask_epi8(t) != 0xFFFF)) {
+    /* Unaligned loads at head/tail.  */
+    __m128i v = *(__m128i_u *)(buf);
+    __m128i w = *(__m128i_u *)(buf + len - 16);
+    /* Align head/tail to 16-byte boundaries.  */
+    __m128i *p = (void *)(((uintptr_t)buf + 16) & -16);
+    __m128i *e = (void *)(((uintptr_t)buf + len - 1) & -16);
+    __m128i zero = { 0 };
+
+    /* Collect a partial block at tail end.  */
+    v |= e[-1]; w |= e[-2];
+    SSE_REASSOC_BARRIER(v, w);
+    v |= e[-3]; w |= e[-4];
+    SSE_REASSOC_BARRIER(v, w);
+    v |= e[-5]; w |= e[-6];
+    SSE_REASSOC_BARRIER(v, w);
+    v |= e[-7]; v |= w;
+
+    /* Loop over complete 128-byte blocks.  */
+    for (; p < e - 7; p += 8) {
+        v = _mm_cmpeq_epi8(v, zero);
+        if (unlikely(_mm_movemask_epi8(v) != 0xFFFF)) {
             return false;
         }
-        t = p[-4] | p[-3] | p[-2] | p[-1];
-        p += 4;
+        v = p[0]; w = p[1];
+        SSE_REASSOC_BARRIER(v, w);
+        v |= p[2]; w |= p[3];
+        SSE_REASSOC_BARRIER(v, w);
+        v |= p[4]; w |= p[5];
+        SSE_REASSOC_BARRIER(v, w);
+        v |= p[6]; w |= p[7];
+        SSE_REASSOC_BARRIER(v, w);
+        v |= w;
     }
 
-    /* Finish the aligned tail.  */
-    t |= e[-3];
-    t |= e[-2];
-    t |= e[-1];
-
-    /* Finish the unaligned tail.  */
-    t |= _mm_loadu_si128(buf + len - 16);
-
-    return _mm_movemask_epi8(_mm_cmpeq_epi8(t, zero)) == 0xFFFF;
+    return _mm_movemask_epi8(_mm_cmpeq_epi8(v, zero)) == 0xFFFF;
 }
 
 #ifdef CONFIG_AVX2_OPT
 static bool __attribute__((target("avx2")))
 buffer_zero_avx2(const void *buf, size_t len)
 {
-    /* Begin with an unaligned head of 32 bytes.  */
-    __m256i t = _mm256_loadu_si256(buf);
-    __m256i *p = (__m256i *)(((uintptr_t)buf + 5 * 32) & -32);
-    __m256i *e = (__m256i *)(((uintptr_t)buf + len) & -32);
-
-    /* Loop over 32-byte aligned blocks of 128.  */
-    while (p <= e) {
-        if (unlikely(!_mm256_testz_si256(t, t))) {
+    /* Unaligned loads at head/tail.  */
+    __m256i v = *(__m256i_u *)(buf);
+    __m256i w = *(__m256i_u *)(buf + len - 32);
+    /* Align head/tail to 32-byte boundaries.  */
+    __m256i *p = (void *)(((uintptr_t)buf + 32) & -32);
+    __m256i *e = (void *)(((uintptr_t)buf + len - 1) & -32);
+    __m256i zero = { 0 };
+
+    /* Collect a partial block at tail end.  */
+    v |= e[-1]; w |= e[-2];
+    SSE_REASSOC_BARRIER(v, w);
+    v |= e[-3]; w |= e[-4];
+    SSE_REASSOC_BARRIER(v, w);
+    v |= e[-5]; w |= e[-6];
+    SSE_REASSOC_BARRIER(v, w);
+    v |= e[-7]; v |= w;
+
+    /* Loop over complete 256-byte blocks.  */
+    for (; p < e - 7; p += 8) {
+        /* PTEST is not profitable here.  */
+        v = _mm256_cmpeq_epi8(v, zero);
+        if (unlikely(_mm256_movemask_epi8(v) != 0xFFFFFFFF)) {
             return false;
         }
-        t = p[-4] | p[-3] | p[-2] | p[-1];
-        p += 4;
-    } ;
-
-    /* Finish the last block of 128 unaligned.  */
-    t |= _mm256_loadu_si256(buf + len - 4 * 32);
-    t |= _mm256_loadu_si256(buf + len - 3 * 32);
-    t |= _mm256_loadu_si256(buf + len - 2 * 32);
-    t |= _mm256_loadu_si256(buf + len - 1 * 32);
+        v = p[0]; w = p[1];
+        SSE_REASSOC_BARRIER(v, w);
+        v |= p[2]; w |= p[3];
+        SSE_REASSOC_BARRIER(v, w);
+        v |= p[4]; w |= p[5];
+        SSE_REASSOC_BARRIER(v, w);
+        v |= p[6]; w |= p[7];
+        SSE_REASSOC_BARRIER(v, w);
+        v |= w;
+    }
 
-    return _mm256_testz_si256(t, t);
+    return _mm256_movemask_epi8(_mm256_cmpeq_epi8(v, zero)) == 0xFFFFFFFF;
 }
 #endif /* CONFIG_AVX2_OPT */
 
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 6/6] util/bufferiszero: improve scalar variant
  2024-02-06 20:48 [PATCH v3 0/6] Optimize buffer_is_zero Alexander Monakov
                   ` (4 preceding siblings ...)
  2024-02-06 20:48 ` [PATCH v3 5/6] util/bufferiszero: optimize SSE2 and AVX2 variants Alexander Monakov
@ 2024-02-06 20:48 ` Alexander Monakov
  2024-02-06 22:34   ` Richard Henderson
  5 siblings, 1 reply; 19+ messages in thread
From: Alexander Monakov @ 2024-02-06 20:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Mikhail Romanov, Richard Henderson, Paolo Bonzini,
	Alexander Monakov

Take into account that the inline wrapper ensures len >= 4.

Use __attribute__((may_alias)) for accesses via non-char pointers.

Avoid using out-of-bounds pointers in loop boundary conditions by
reformulating the 'for' loop as 'if (...) do { ... } while (...)'.

Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru>
---
 util/bufferiszero.c | 30 +++++++++++-------------------
 1 file changed, 11 insertions(+), 19 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index d752edd8cc..1f4cbfaea4 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -29,35 +29,27 @@
 bool
 buffer_is_zero_len_4_plus(const void *buf, size_t len)
 {
-    if (unlikely(len < 8)) {
-        /* For a very small buffer, simply accumulate all the bytes.  */
-        const unsigned char *p = buf;
-        const unsigned char *e = buf + len;
-        unsigned char t = 0;
-
-        do {
-            t |= *p++;
-        } while (p < e);
-
-        return t == 0;
+    if (unlikely(len <= 8)) {
+        /* Our caller ensures len >= 4.  */
+        return (ldl_he_p(buf) | ldl_he_p(buf + len - 4)) == 0;
     } else {
-        /* Otherwise, use the unaligned memory access functions to
-           handle the beginning and end of the buffer, with a couple
+        /* Use unaligned memory access functions to handle
+           the beginning and end of the buffer, with a couple
            of loops handling the middle aligned section.  */
-        uint64_t t = ldq_he_p(buf);
-        const uint64_t *p = (uint64_t *)(((uintptr_t)buf + 8) & -8);
-        const uint64_t *e = (uint64_t *)(((uintptr_t)buf + len) & -8);
+        uint64_t t = ldq_he_p(buf) | ldq_he_p(buf + len - 8);
+        typedef uint64_t uint64_a __attribute__((may_alias));
+        const uint64_a *p = (void *)(((uintptr_t)buf + 8) & -8);
+        const uint64_a *e = (void *)(((uintptr_t)buf + len - 1) & -8);
 
-        for (; p + 8 <= e; p += 8) {
+        if (e - p >= 8) do {
             if (t) {
                 return false;
             }
             t = p[0] | p[1] | p[2] | p[3] | p[4] | p[5] | p[6] | p[7];
-        }
+        } while ((p += 8) <= e - 8);
         while (p < e) {
             t |= *p++;
         }
-        t |= ldq_he_p(buf + len - 8);
 
         return t == 0;
     }
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 1/6] util/bufferiszero: remove SSE4.1 variant
  2024-02-06 20:48 ` [PATCH v3 1/6] util/bufferiszero: remove SSE4.1 variant Alexander Monakov
@ 2024-02-06 22:24   ` Richard Henderson
  0 siblings, 0 replies; 19+ messages in thread
From: Richard Henderson @ 2024-02-06 22:24 UTC (permalink / raw)
  To: Alexander Monakov, qemu-devel; +Cc: Mikhail Romanov, Paolo Bonzini

On 2/7/24 06:48, Alexander Monakov wrote:
> The SSE4.1 variant is virtually identical to the SSE2 variant, except
> for using 'PTEST+JNZ' in place of 'PCMPEQB+PMOVMSKB+CMP+JNE' for testing
> if an SSE register is all zeroes. The PTEST instruction decodes to two
> uops, so it can be handled only by the complex decoder, and since
> CMP+JNE are macro-fused, both sequences decode to three uops. The uops
> comprising the PTEST instruction dispatch to p0 and p5 on Intel CPUs, so
> PCMPEQB+PMOVMSKB is comparatively more flexible from dispatch
> standpoint.
> 
> Hence, the use of PTEST brings no benefit from throughput standpoint.
> Its latency is not important, since it feeds only a conditional jump,
> which terminates the dependency chain.
> 
> I never observed PTEST variants to be faster on real hardware.
> 
> Signed-off-by: Alexander Monakov<amonakov@ispras.ru>
> Signed-off-by: Mikhail Romanov<mmromanov@ispras.ru>
> ---
>   util/bufferiszero.c | 29 -----------------------------
>   1 file changed, 29 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

r~


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/6] util/bufferiszero: remove AVX512 variant
  2024-02-06 20:48 ` [PATCH v3 3/6] util/bufferiszero: remove AVX512 variant Alexander Monakov
@ 2024-02-06 22:28   ` Richard Henderson
  2024-02-06 23:56   ` Elena Ufimtseva
  1 sibling, 0 replies; 19+ messages in thread
From: Richard Henderson @ 2024-02-06 22:28 UTC (permalink / raw)
  To: Alexander Monakov, qemu-devel; +Cc: Mikhail Romanov, Paolo Bonzini

On 2/7/24 06:48, Alexander Monakov wrote:
> Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD
> routines are invoked much more rarely in normal use when most buffers
> are non-zero. This makes use of AVX512 unprofitable, as it incurs extra
> frequency and voltage transition periods during which the CPU operates
> at reduced performance, as described in
> https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html
> 
> Signed-off-by: Mikhail Romanov<mmromanov@ispras.ru>
> Signed-off-by: Alexander Monakov<amonakov@ispras.ru>
> ---
>   util/bufferiszero.c | 36 ++----------------------------------
>   1 file changed, 2 insertions(+), 34 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

Although I think this patch should be ordered second.


r~


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 4/6] util/bufferiszero: remove useless prefetches
  2024-02-06 20:48 ` [PATCH v3 4/6] util/bufferiszero: remove useless prefetches Alexander Monakov
@ 2024-02-06 22:29   ` Richard Henderson
  0 siblings, 0 replies; 19+ messages in thread
From: Richard Henderson @ 2024-02-06 22:29 UTC (permalink / raw)
  To: Alexander Monakov, qemu-devel; +Cc: Mikhail Romanov, Paolo Bonzini

On 2/7/24 06:48, Alexander Monakov wrote:
> Use of prefetching in bufferiszero.c is quite questionable:
> 
> - prefetches are issued just a few CPU cycles before the corresponding
>    line would be hit by demand loads;
> 
> - they are done for simple access patterns, i.e. where hardware
>    prefetchers can perform better;
> 
> - they compete for load ports in loops that should be limited by load
>    port throughput rather than ALU throughput.
> 
> Signed-off-by: Alexander Monakov<amonakov@ispras.ru>
> Signed-off-by: Mikhail Romanov<mmromanov@ispras.ru>
> ---
>   util/bufferiszero.c | 3 ---
>   1 file changed, 3 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

r~


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 6/6] util/bufferiszero: improve scalar variant
  2024-02-06 20:48 ` [PATCH v3 6/6] util/bufferiszero: improve scalar variant Alexander Monakov
@ 2024-02-06 22:34   ` Richard Henderson
  2024-02-06 22:46     ` Richard Henderson
  0 siblings, 1 reply; 19+ messages in thread
From: Richard Henderson @ 2024-02-06 22:34 UTC (permalink / raw)
  To: Alexander Monakov, qemu-devel; +Cc: Mikhail Romanov, Paolo Bonzini

On 2/7/24 06:48, Alexander Monakov wrote:
> -        /* Otherwise, use the unaligned memory access functions to
> -           handle the beginning and end of the buffer, with a couple
> +        /* Use unaligned memory access functions to handle
> +           the beginning and end of the buffer, with a couple
>              of loops handling the middle aligned section.  */
> -        uint64_t t = ldq_he_p(buf);
> -        const uint64_t *p = (uint64_t *)(((uintptr_t)buf + 8) & -8);
> -        const uint64_t *e = (uint64_t *)(((uintptr_t)buf + len) & -8);
> +        uint64_t t = ldq_he_p(buf) | ldq_he_p(buf + len - 8);
> +        typedef uint64_t uint64_a __attribute__((may_alias));
> +        const uint64_a *p = (void *)(((uintptr_t)buf + 8) & -8);
> +        const uint64_a *e = (void *)(((uintptr_t)buf + len - 1) & -8);

You appear to be optimizing this routine for x86, which is not the primary consumer.

This is going to perform very poorly on hosts that do not support unaligned accesses (e.g. 
Sparc and some RISC-V).


r~


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 2/6] util/bufferiszero: introduce an inline wrapper
  2024-02-06 20:48 ` [PATCH v3 2/6] util/bufferiszero: introduce an inline wrapper Alexander Monakov
@ 2024-02-06 22:44   ` Richard Henderson
  2024-02-07  7:13     ` Alexander Monakov
  0 siblings, 1 reply; 19+ messages in thread
From: Richard Henderson @ 2024-02-06 22:44 UTC (permalink / raw)
  To: Alexander Monakov, qemu-devel; +Cc: Mikhail Romanov, Paolo Bonzini

On 2/7/24 06:48, Alexander Monakov wrote:
> Make buffer_is_zero a 'static inline' function that tests up to three
> bytes from the buffer before handing off to an unrolled loop. This
> eliminates call overhead for most non-zero buffers, and allows to
> optimize out length checks when it is known at compile time (which is
> often the case in Qemu).
> 
> Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
> Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru>
> ---
>   include/qemu/cutils.h | 28 +++++++++++++++-
>   util/bufferiszero.c   | 76 ++++++++++++-------------------------------
>   2 files changed, 47 insertions(+), 57 deletions(-)
> 
> diff --git a/include/qemu/cutils.h b/include/qemu/cutils.h
> index 92c927a6a3..62b153e603 100644
> --- a/include/qemu/cutils.h
> +++ b/include/qemu/cutils.h
> @@ -187,9 +187,35 @@ char *freq_to_str(uint64_t freq_hz);
>   /* used to print char* safely */
>   #define STR_OR_NULL(str) ((str) ? (str) : "null")
>   
> -bool buffer_is_zero(const void *buf, size_t len);
> +bool buffer_is_zero_len_4_plus(const void *, size_t);
> +extern bool (*buffer_is_zero_len_256_plus)(const void *, size_t);

Why 256, when the avx2 routine can handle size 128, and you're about to remove avx512?

You appear to have missed that select_accel_fn() resolves directly to buffer_zero_int, aka 
buffer_is_zero_len_4_plus for non-x86, without an indirect function call.

I think you should not attempt to expose the 4 vs larger implementation detail here in the 
inline function.  Presumably the bulk of the benefit in avoiding the function call is 
already realized via the three byte spot checks.


r~


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 6/6] util/bufferiszero: improve scalar variant
  2024-02-06 22:34   ` Richard Henderson
@ 2024-02-06 22:46     ` Richard Henderson
  0 siblings, 0 replies; 19+ messages in thread
From: Richard Henderson @ 2024-02-06 22:46 UTC (permalink / raw)
  To: Alexander Monakov, qemu-devel; +Cc: Mikhail Romanov, Paolo Bonzini

On 2/7/24 08:34, Richard Henderson wrote:
> On 2/7/24 06:48, Alexander Monakov wrote:
>> -        /* Otherwise, use the unaligned memory access functions to
>> -           handle the beginning and end of the buffer, with a couple
>> +        /* Use unaligned memory access functions to handle
>> +           the beginning and end of the buffer, with a couple
>>              of loops handling the middle aligned section.  */
>> -        uint64_t t = ldq_he_p(buf);
>> -        const uint64_t *p = (uint64_t *)(((uintptr_t)buf + 8) & -8);
>> -        const uint64_t *e = (uint64_t *)(((uintptr_t)buf + len) & -8);
>> +        uint64_t t = ldq_he_p(buf) | ldq_he_p(buf + len - 8);
>> +        typedef uint64_t uint64_a __attribute__((may_alias));
>> +        const uint64_a *p = (void *)(((uintptr_t)buf + 8) & -8);
>> +        const uint64_a *e = (void *)(((uintptr_t)buf + len - 1) & -8);
> 
> You appear to be optimizing this routine for x86, which is not the primary consumer.
> 
> This is going to perform very poorly on hosts that do not support unaligned accesses (e.g. 
> Sparc and some RISC-V).

I beg your pardon, I mis-read this.  You're only replacing the byte loops, which will be 
more-or-less identical, modulo unrolling, when unaligned access is not supported.  But 
will be much improved if some unaligned access support is available (e.g. MIPS LWL+LWR).

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 5/6] util/bufferiszero: optimize SSE2 and AVX2 variants
  2024-02-06 20:48 ` [PATCH v3 5/6] util/bufferiszero: optimize SSE2 and AVX2 variants Alexander Monakov
@ 2024-02-06 23:10   ` Richard Henderson
  0 siblings, 0 replies; 19+ messages in thread
From: Richard Henderson @ 2024-02-06 23:10 UTC (permalink / raw)
  To: Alexander Monakov, qemu-devel; +Cc: Mikhail Romanov, Paolo Bonzini

On 2/7/24 06:48, Alexander Monakov wrote:
> Increase unroll factor in SIMD loops from 4x to 8x in order to move
> their bottlenecks from ALU port contention to load issue rate (two loads
> per cycle on popular x86 implementations).

Ah, that answers my question re 128 vs 256 byte minimum.

So as far as this patch goes,
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/6] util/bufferiszero: remove AVX512 variant
  2024-02-06 20:48 ` [PATCH v3 3/6] util/bufferiszero: remove AVX512 variant Alexander Monakov
  2024-02-06 22:28   ` Richard Henderson
@ 2024-02-06 23:56   ` Elena Ufimtseva
  2024-02-07  6:29     ` Alexander Monakov
  1 sibling, 1 reply; 19+ messages in thread
From: Elena Ufimtseva @ 2024-02-06 23:56 UTC (permalink / raw)
  To: Alexander Monakov
  Cc: qemu-devel, Mikhail Romanov, Richard Henderson, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 4030 bytes --]

Hello Alexander

On Tue, Feb 6, 2024 at 12:50 PM Alexander Monakov <amonakov@ispras.ru>
wrote:

> Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD
> routines are invoked much more rarely in normal use when most buffers
> are non-zero. This makes use of AVX512 unprofitable, as it incurs extra
> frequency and voltage transition periods during which the CPU operates
> at reduced performance, as described in
> https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html


I would like to point out that the frequency scaling is not currently an
issue on AMD Zen4 Genoa CPUs, for example.
And microcode architecture description here:
https://www.amd.com/system/files/documents/4th-gen-epyc-processor-architecture-white-paper.pdf
Although, the cpu frequency downscaling mentioned in the above document is
only in relation to floating point operations.
But from other online discussions I gather that the data path for the
integer registers in Zen4 is also 256 bits and it allows to avoid
frequency downscaling for FP and heavy instructions.
And looking at the optimizations for AVX2 in your other patch, would
unrolling the loop for AVX512 ops benefit from the speedup taken that the
data path has the same width?
If the frequency downscaling is not observed on some of the CPUs, can
AVX512 be maintained and used selectively for some
of the CPUs?

Thank you!


>
>
> Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru>
> Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
> ---
>  util/bufferiszero.c | 36 ++----------------------------------
>  1 file changed, 2 insertions(+), 34 deletions(-)
>
> diff --git a/util/bufferiszero.c b/util/bufferiszero.c
> index 01050694a6..c037d11d04 100644
> --- a/util/bufferiszero.c
> +++ b/util/bufferiszero.c
> @@ -64,7 +64,7 @@ buffer_is_zero_len_4_plus(const void *buf, size_t len)
>      }
>  }
>
> -#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT) ||
> defined(__SSE2__)
> +#if defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
>  #include <immintrin.h>
>
>  /* Note that each of these vectorized functions require len >= 64.  */
> @@ -128,35 +128,6 @@ buffer_zero_avx2(const void *buf, size_t len)
>  }
>  #endif /* CONFIG_AVX2_OPT */
>
> -#ifdef CONFIG_AVX512F_OPT
> -static bool __attribute__((target("avx512f")))
> -buffer_zero_avx512(const void *buf, size_t len)
> -{
> -    /* Begin with an unaligned head of 64 bytes.  */
> -    __m512i t = _mm512_loadu_si512(buf);
> -    __m512i *p = (__m512i *)(((uintptr_t)buf + 5 * 64) & -64);
> -    __m512i *e = (__m512i *)(((uintptr_t)buf + len) & -64);
> -
> -    /* Loop over 64-byte aligned blocks of 256.  */
> -    while (p <= e) {
> -        __builtin_prefetch(p);
> -        if (unlikely(_mm512_test_epi64_mask(t, t))) {
> -            return false;
> -        }
> -        t = p[-4] | p[-3] | p[-2] | p[-1];
> -        p += 4;
> -    }
> -
> -    t |= _mm512_loadu_si512(buf + len - 4 * 64);
> -    t |= _mm512_loadu_si512(buf + len - 3 * 64);
> -    t |= _mm512_loadu_si512(buf + len - 2 * 64);
> -    t |= _mm512_loadu_si512(buf + len - 1 * 64);
> -
> -    return !_mm512_test_epi64_mask(t, t);
> -
> -}
> -#endif /* CONFIG_AVX512F_OPT */
> -
>  static unsigned __attribute__((noinline))
>  select_accel_cpuinfo(unsigned info)
>  {
> @@ -165,9 +136,6 @@ select_accel_cpuinfo(unsigned info)
>          unsigned bit;
>          bool (*fn)(const void *, size_t);
>      } all[] = {
> -#ifdef CONFIG_AVX512F_OPT
> -        { CPUINFO_AVX512F, buffer_zero_avx512 },
> -#endif
>  #ifdef CONFIG_AVX2_OPT
>          { CPUINFO_AVX2,    buffer_zero_avx2 },
>  #endif
> @@ -191,7 +159,7 @@ static unsigned used_accel
>      = 0;
>  #endif
>
> -#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT)
> +#if defined(CONFIG_AVX2_OPT)
>  static void __attribute__((constructor)) init_accel(void)
>  {
>      used_accel = select_accel_cpuinfo(cpuinfo_init());
> --
> 2.32.0
>
>
>

-- 
Elena

[-- Attachment #2: Type: text/html, Size: 5535 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/6] util/bufferiszero: remove AVX512 variant
  2024-02-06 23:56   ` Elena Ufimtseva
@ 2024-02-07  6:29     ` Alexander Monakov
  2024-02-07 10:38       ` Joao Martins
  0 siblings, 1 reply; 19+ messages in thread
From: Alexander Monakov @ 2024-02-07  6:29 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: qemu-devel, Mikhail Romanov, Richard Henderson, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 3108 bytes --]

On Tue, 6 Feb 2024, Elena Ufimtseva wrote:

> Hello Alexander
> 
> On Tue, Feb 6, 2024 at 12:50 PM Alexander Monakov <amonakov@ispras.ru>
> wrote:
> 
> > Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD
> > routines are invoked much more rarely in normal use when most buffers
> > are non-zero. This makes use of AVX512 unprofitable, as it incurs extra
> > frequency and voltage transition periods during which the CPU operates
> > at reduced performance, as described in
> > https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html
> 
> 
> I would like to point out that the frequency scaling is not currently an
> issue on AMD Zen4 Genoa CPUs, for example.
> And microcode architecture description here:
> https://www.amd.com/system/files/documents/4th-gen-epyc-processor-architecture-white-paper.pdf
> Although, the cpu frequency downscaling mentioned in the above document is
> only in relation to floating point operations.
> But from other online discussions I gather that the data path for the
> integer registers in Zen4 is also 256 bits and it allows to avoid
> frequency downscaling for FP and heavy instructions.

Yes, that's correct: in particular, on Zen 4 512-bit vector loads occupy load
ports for two consecutive cycles, so from load throughput perspective there's
no difference between 256-bit vectors and 512-bit vectors. Generally AVX-512
still has benefits on Zen 4 since it's a richer instruction set (it also reduces
pressure in the CPU front-end and is more power-efficient), but as the new AVX2
buffer_is_zero is saturating load ports I would expect that AVX512 can exceed
its performance only by a small margin if at all, not anywhere close to 2x.

> And looking at the optimizations for AVX2 in your other patch, would
> unrolling the loop for AVX512 ops benefit from the speedup taken that the
> data path has the same width?

No, 256-bit datapath on Zen 4 means that it's easier to saturate it with
512-bit loads than with 256-bit loads, so an AVX512 loop is roughly comparable
to a similar AVX-256 loop unrolled twice.

Aside: AVX512 variant needs a little more thought to use VPTERNLOG properly.

> If the frequency downscaling is not observed on some of the CPUs, can
> AVX512 be maintained and used selectively for some
> of the CPUs?

Please note that a properly optimized buffer_is_zero is limited by load
throughput, not ALUs. On Zen 4 AVX2 is sufficient to saturate L1 cache load
bandwidth in buffer_is_zero. For data outside of L1 cache, the benefits
of AVX-512 diminish more and more.

I don't have Zen 4 based machines at hand to see if AVX-512 is beneficial
there for buffer_is_zero for reasons like reaching higher turbo clocks or
higher memory parallelism.

Finally, let's consider a somewhat broader perspective. Let's suppose
buffer_is_zero takes 50% of overall application runtime, and 9 out of
10 buffers are found out to be non-zero in the inline wrapper that samples
three bytes. Then the vectorized routine takes about 5% of application
time, and speeding it up even by 20% only shaves off 1% from overall
execution time.

Alexander

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 2/6] util/bufferiszero: introduce an inline wrapper
  2024-02-06 22:44   ` Richard Henderson
@ 2024-02-07  7:13     ` Alexander Monakov
  2024-02-08 20:07       ` Richard Henderson
  0 siblings, 1 reply; 19+ messages in thread
From: Alexander Monakov @ 2024-02-07  7:13 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, Mikhail Romanov, Paolo Bonzini


On Wed, 7 Feb 2024, Richard Henderson wrote:

> On 2/7/24 06:48, Alexander Monakov wrote:
> > Make buffer_is_zero a 'static inline' function that tests up to three
> > bytes from the buffer before handing off to an unrolled loop. This
> > eliminates call overhead for most non-zero buffers, and allows to
> > optimize out length checks when it is known at compile time (which is
> > often the case in Qemu).
> > 
> > Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
> > Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru>
> > ---
> >   include/qemu/cutils.h | 28 +++++++++++++++-
> >   util/bufferiszero.c   | 76 ++++++++++++-------------------------------
> >   2 files changed, 47 insertions(+), 57 deletions(-)
> > 
> > diff --git a/include/qemu/cutils.h b/include/qemu/cutils.h
> > index 92c927a6a3..62b153e603 100644
> > --- a/include/qemu/cutils.h
> > +++ b/include/qemu/cutils.h
> > @@ -187,9 +187,35 @@ char *freq_to_str(uint64_t freq_hz);
> >   /* used to print char* safely */
> >   #define STR_OR_NULL(str) ((str) ? (str) : "null")
> >   
> > -bool buffer_is_zero(const void *buf, size_t len);
> > +bool buffer_is_zero_len_4_plus(const void *, size_t);
> > +extern bool (*buffer_is_zero_len_256_plus)(const void *, size_t);
> 
> Why 256, when the avx2 routine can handle size 128, and you're about to remove
> avx512?

(yes, avx2 is bumped to 256-byte chunks in a later patch)

> You appear to have missed that select_accel_fn() resolves directly to
> buffer_zero_int, aka buffer_is_zero_len_4_plus for non-x86, without an
> indirect function call.
> 
> I think you should not attempt to expose the 4 vs larger implementation detail
> here in the inline function.  Presumably the bulk of the benefit in avoiding
> the function call is already realized via the three byte spot checks.

Thank you. I agree we shouldn't penalize non-x86 hosts here, but to be honest
I'd really like to keep this optimization because so many places in Qemu invoke
buffer_is_zero with a constant length, allowing the compiler to optimize out
the length test. Would you be open to testing availability of optimized variants
in the inline wrapper like this:

diff --git a/include/qemu/cutils.h b/include/qemu/cutils.h
index 62b153e603..7a2145ffef 100644
--- a/include/qemu/cutils.h
+++ b/include/qemu/cutils.h
@@ -209,11 +209,12 @@ static inline bool buffer_is_zero(const void *vbuf, size_t len)
         return true;
     }

+#if defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
     if (len >= 256) {
         return buffer_is_zero_len_256_plus(vbuf, len);
-    } else {
-        return buffer_is_zero_len_4_plus(vbuf, len);
     }
+#endif
+    return buffer_is_zero_len_4_plus(vbuf, len);
 }

 /*

Alexander


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/6] util/bufferiszero: remove AVX512 variant
  2024-02-07  6:29     ` Alexander Monakov
@ 2024-02-07 10:38       ` Joao Martins
  0 siblings, 0 replies; 19+ messages in thread
From: Joao Martins @ 2024-02-07 10:38 UTC (permalink / raw)
  To: Alexander Monakov, Elena Ufimtseva
  Cc: qemu-devel, Mikhail Romanov, Richard Henderson, Paolo Bonzini

On 07/02/2024 06:29, Alexander Monakov wrote:
> On Tue, 6 Feb 2024, Elena Ufimtseva wrote:
>> Hello Alexander
>>
>> On Tue, Feb 6, 2024 at 12:50 PM Alexander Monakov <amonakov@ispras.ru>
>> wrote:
>>
>>> Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD
>>> routines are invoked much more rarely in normal use when most buffers
>>> are non-zero. This makes use of AVX512 unprofitable, as it incurs extra
>>> frequency and voltage transition periods during which the CPU operates
>>> at reduced performance, as described in
>>> https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html
>>
>> I would like to point out that the frequency scaling is not currently an
>> issue on AMD Zen4 Genoa CPUs, for example.
>> And microcode architecture description here:
>> https://www.amd.com/system/files/documents/4th-gen-epyc-processor-architecture-white-paper.pdf
>> Although, the cpu frequency downscaling mentioned in the above document is
>> only in relation to floating point operations.
>> But from other online discussions I gather that the data path for the
>> integer registers in Zen4 is also 256 bits and it allows to avoid
>> frequency downscaling for FP and heavy instructions.
> 
> Yes, that's correct: in particular, on Zen 4 512-bit vector loads occupy load
> ports for two consecutive cycles, so from load throughput perspective there's
> no difference between 256-bit vectors and 512-bit vectors. Generally AVX-512
> still has benefits on Zen 4 since it's a richer instruction set (it also reduces
> pressure in the CPU front-end and is more power-efficient), but as the new AVX2
> buffer_is_zero is saturating load ports I would expect that AVX512 can exceed
> its performance only by a small margin if at all, not anywhere close to 2x.
> 
>> And looking at the optimizations for AVX2 in your other patch, would
>> unrolling the loop for AVX512 ops benefit from the speedup taken that the
>> data path has the same width?
> 
> No, 256-bit datapath on Zen 4 means that it's easier to saturate it with
> 512-bit loads than with 256-bit loads, so an AVX512 loop is roughly comparable
> to a similar AVX-256 loop unrolled twice.
> 
> Aside: AVX512 variant needs a little more thought to use VPTERNLOG properly.
> 
>> If the frequency downscaling is not observed on some of the CPUs, can
>> AVX512 be maintained and used selectively for some
>> of the CPUs?
> 
> Please note that a properly optimized buffer_is_zero is limited by load
> throughput, not ALUs. On Zen 4 AVX2 is sufficient to saturate L1 cache load
> bandwidth in buffer_is_zero. For data outside of L1 cache, the benefits
> of AVX-512 diminish more and more.
> 
> I don't have Zen 4 based machines at hand to see if AVX-512 is beneficial
> there for buffer_is_zero for reasons like reaching higher turbo clocks or
> higher memory parallelism.
> 

FWIW, this frequency downscaling problem that was more prominent in Skylake is
/supposedly/ no longer observed in Intel Sapphire Rapids either:

https://www.phoronix.com/review/intel-sapphirerapids-avx512/8


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 2/6] util/bufferiszero: introduce an inline wrapper
  2024-02-07  7:13     ` Alexander Monakov
@ 2024-02-08 20:07       ` Richard Henderson
  0 siblings, 0 replies; 19+ messages in thread
From: Richard Henderson @ 2024-02-08 20:07 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: qemu-devel, Mikhail Romanov, Paolo Bonzini

On 2/6/24 21:13, Alexander Monakov wrote:
> Thank you. I agree we shouldn't penalize non-x86 hosts here, but to be honest
> I'd really like to keep this optimization because so many places in Qemu invoke
> buffer_is_zero with a constant length, allowing the compiler to optimize out
> the length test.

Hmm.  True, both migration and image copy use large blocks frequently.


> Would you be open to testing availability of optimized variants
> in the inline wrapper like this:
> 
> diff --git a/include/qemu/cutils.h b/include/qemu/cutils.h
> index 62b153e603..7a2145ffef 100644
> --- a/include/qemu/cutils.h
> +++ b/include/qemu/cutils.h
> @@ -209,11 +209,12 @@ static inline bool buffer_is_zero(const void *vbuf, size_t len)
>           return true;
>       }
> 
> +#if defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
>       if (len >= 256) {
>           return buffer_is_zero_len_256_plus(vbuf, len);
> -    } else {
> -        return buffer_is_zero_len_4_plus(vbuf, len);
>       }
> +#endif
> +    return buffer_is_zero_len_4_plus(vbuf, len);

Plausible.

Also, now that we're down to two variants instead of four, perhaps a statically predicted 
direct branch or two might be better than an indirect branch?  E.g.

bool buffer_is_zero_len_256_plus(buf, len)
{
#ifdef CONFIG_AVX2_OPT
     if (select_accel & CPUINFO_AVX2) {
         return buffer_zero_avx2(buf, len);
     }
#endif
#ifdef __SSE2__
     if (select_accel & CPUINFO_SSE2) {
         return buffer_zero_sse2(buf, len);
     }
#endif
     return buffer_is_zero_len_4_plus(buf, len);
}

where select_accel would be set by test_buffer_is_zero_next_accel() etc.


r~


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-02-08 20:07 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-06 20:48 [PATCH v3 0/6] Optimize buffer_is_zero Alexander Monakov
2024-02-06 20:48 ` [PATCH v3 1/6] util/bufferiszero: remove SSE4.1 variant Alexander Monakov
2024-02-06 22:24   ` Richard Henderson
2024-02-06 20:48 ` [PATCH v3 2/6] util/bufferiszero: introduce an inline wrapper Alexander Monakov
2024-02-06 22:44   ` Richard Henderson
2024-02-07  7:13     ` Alexander Monakov
2024-02-08 20:07       ` Richard Henderson
2024-02-06 20:48 ` [PATCH v3 3/6] util/bufferiszero: remove AVX512 variant Alexander Monakov
2024-02-06 22:28   ` Richard Henderson
2024-02-06 23:56   ` Elena Ufimtseva
2024-02-07  6:29     ` Alexander Monakov
2024-02-07 10:38       ` Joao Martins
2024-02-06 20:48 ` [PATCH v3 4/6] util/bufferiszero: remove useless prefetches Alexander Monakov
2024-02-06 22:29   ` Richard Henderson
2024-02-06 20:48 ` [PATCH v3 5/6] util/bufferiszero: optimize SSE2 and AVX2 variants Alexander Monakov
2024-02-06 23:10   ` Richard Henderson
2024-02-06 20:48 ` [PATCH v3 6/6] util/bufferiszero: improve scalar variant Alexander Monakov
2024-02-06 22:34   ` Richard Henderson
2024-02-06 22:46     ` Richard Henderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).