[PATCH 0/9] xen/bitops: hweight() cleanup/improvements

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/9] xen/bitops: hweight() cleanup/improvements
@ 2024-08-22 23:06 Andrew Cooper
  2024-08-22 23:06 ` [PATCH 1/9] xen/bitops: Reinstate the please tidy message Andrew Cooper
                   ` (8 more replies)
  0 siblings, 9 replies; 40+ messages in thread
From: Andrew Cooper @ 2024-08-22 23:06 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné, Wei Liu,
	Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Bertrand Marquis, Michal Orzel, Oleksii Kurochko, Shawn Anastasio

The next phase of bitops cleanup.  This series:

 1) Untangles the mess around hweight()
 2) Removes some unwise uses of hweight()
 3) Makes it work transparently for RISC-V
 4) Use the POPCNT instruction on x86 when available

https://gitlab.com/xen-project/people/andyhhp/xen/-/pipelines/1424000649
https://cirrus-ci.com/build/4947148859506688

Andrew Cooper (9):
  xen/bitops: Reinstate the please tidy message
  xen/bitops: Introduce a multiple_bits_set() helper
  xen/bitops: Convert 'hweight(x) > 1' to new multiple_bits_set()
  xen/bitops: Drop the remnants of hweight{8,16}()
  xen/bitops: Introduce generic_hweightl() and hweightl()
  xen/bitops: Drop hweight_long() and use hweightl()
  xen/bitops: Implement hweight64() in terms of hweightl()
  xen/bitops: Implement hweight32() in terms of hweightl()
  x86/bitops: Use the POPCNT instruction when available

 xen/arch/arm/include/asm/bitops.h |  11 ----
 xen/arch/ppc/include/asm/bitops.h |  11 +---
 xen/arch/x86/cpu/vpmu.c           |   2 +-
 xen/arch/x86/hvm/vlapic.c         |  10 +--
 xen/arch/x86/include/asm/bitops.h |  30 ++++++---
 xen/common/bitmap.c               |   4 +-
 xen/common/bitops.c               |  41 ++++++++++++
 xen/common/numa.c                 |   2 +-
 xen/include/xen/bitops.h          | 102 +++++++++++++-----------------
 xen/include/xen/self-tests.h      |  10 ++-
 xen/lib/Makefile                  |   2 +
 xen/lib/arch-generic-hweightl.S   |  46 ++++++++++++++
 xen/lib/generic-hweightl.c        |  61 ++++++++++++++++++
 13 files changed, 232 insertions(+), 100 deletions(-)
 create mode 100644 xen/lib/arch-generic-hweightl.S
 create mode 100644 xen/lib/generic-hweightl.c

-- 
2.39.2



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/9] xen/bitops: Reinstate the please tidy message
  2024-08-22 23:06 [PATCH 0/9] xen/bitops: hweight() cleanup/improvements Andrew Cooper
@ 2024-08-22 23:06 ` Andrew Cooper
  2024-08-22 23:06 ` [PATCH 2/9] xen/bitops: Introduce a multiple_bits_set() helper Andrew Cooper
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2024-08-22 23:06 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné, Wei Liu,
	Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Bertrand Marquis, Michal Orzel, Oleksii Kurochko, Shawn Anastasio

Recent additions have undone prior tidying at the top of the file.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Julien Grall <julien@xen.org>
CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
CC: Bertrand Marquis <bertrand.marquis@arm.com>
CC: Michal Orzel <michal.orzel@amd.com>
CC: Oleksii Kurochko <oleksii.kurochko@gmail.com>
CC: Shawn Anastasio <sanastasio@raptorengineering.com>
---
 xen/include/xen/bitops.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/xen/include/xen/bitops.h b/xen/include/xen/bitops.h
index 3d88d2f3f1d6..b8bb2ffcd337 100644
--- a/xen/include/xen/bitops.h
+++ b/xen/include/xen/bitops.h
@@ -204,6 +204,8 @@ static always_inline bool test_bit(int nr, const volatile void *addr)
     test_bit(nr, addr);                                 \
 })
 
+/* --------------------- Please tidy above here --------------------- */
+
 static always_inline __pure unsigned int ffs(unsigned int x)
 {
     if ( __builtin_constant_p(x) )
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/9] xen/bitops: Introduce a multiple_bits_set() helper
  2024-08-22 23:06 [PATCH 0/9] xen/bitops: hweight() cleanup/improvements Andrew Cooper
  2024-08-22 23:06 ` [PATCH 1/9] xen/bitops: Reinstate the please tidy message Andrew Cooper
@ 2024-08-22 23:06 ` Andrew Cooper
  2024-08-26 10:30   ` Jan Beulich
  2024-08-22 23:06 ` [PATCH 3/9] xen/bitops: Convert 'hweight(x) > 1' to new multiple_bits_set() Andrew Cooper
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2024-08-22 23:06 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné, Wei Liu,
	Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Bertrand Marquis, Michal Orzel, Oleksii Kurochko, Shawn Anastasio

This will be used to simplify real logic in the following patch.  Add compile
and boot time testing as with other bitops.

Because the expression is so simple, implement it as a function-like macro
which is generic on the type of it's argument, rather than having multiple
variants.

Testing function-like macros needs a minor adjustments to the infrastructure
in xen/self-tests.h to avoid bracketing the fn parameter.  The utility of this
outweighs the associated risks.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Julien Grall <julien@xen.org>
CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
CC: Bertrand Marquis <bertrand.marquis@arm.com>
CC: Michal Orzel <michal.orzel@amd.com>
CC: Oleksii Kurochko <oleksii.kurochko@gmail.com>
CC: Shawn Anastasio <sanastasio@raptorengineering.com>

Name inevitably subject to nitpicking.  I'd prefer it to be shorter but I
can't think of anything suitable.
---
 xen/common/bitops.c          | 24 ++++++++++++++++++++++++
 xen/include/xen/bitops.h     | 10 ++++++++++
 xen/include/xen/self-tests.h | 10 ++++++++--
 3 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/xen/common/bitops.c b/xen/common/bitops.c
index 94a8983af99c..4545682aa8e0 100644
--- a/xen/common/bitops.c
+++ b/xen/common/bitops.c
@@ -84,8 +84,32 @@ static void __init test_fls(void)
     CHECK(fls64, 0x8000000000000001ULL, 64);
 }
 
+static void __init test_multiple_bits_set(void)
+{
+    /*
+     * multiple_bits_set() is generic on the type of it's parameter, as the
+     * internal expression is so simple.
+     */
+
+    CHECK(multiple_bits_set, 0, false);
+    CHECK(multiple_bits_set, 1, false);
+    CHECK(multiple_bits_set, 2, false);
+    CHECK(multiple_bits_set, 3, true);
+
+    CHECK(multiple_bits_set, 1 | (1UL << (BITS_PER_LONG - 1)), true);
+#if BITS_PER_LONG > 32
+    CHECK(multiple_bits_set, 1 | (1UL << 32), true);
+    CHECK(multiple_bits_set, 1 | (1UL << 63), true);
+#endif
+
+    CHECK(multiple_bits_set, 0x8000000000000001ULL, true);
+    CHECK(multiple_bits_set, 0xc000000000000000ULL, true);
+}
+
 static void __init __constructor test_bitops(void)
 {
     test_ffs();
     test_fls();
+
+    test_multiple_bits_set();
 }
diff --git a/xen/include/xen/bitops.h b/xen/include/xen/bitops.h
index b8bb2ffcd337..74c0d04e6647 100644
--- a/xen/include/xen/bitops.h
+++ b/xen/include/xen/bitops.h
@@ -274,6 +274,16 @@ static always_inline __pure unsigned int fls64(uint64_t x)
     }
 }
 
+/*
+ * Calculate if a value has two or more bits set.  Always use this in
+ * preference to an expression of the form 'hweight(x) > 1'.
+ */
+#define multiple_bits_set(x)                    \
+    ({                                          \
+        typeof(x) _v = (x);                     \
+        (_v & (_v - 1)) != 0;                   \
+    })
+
 /* --------------------- Please tidy below here --------------------- */
 
 #ifndef find_next_bit
diff --git a/xen/include/xen/self-tests.h b/xen/include/xen/self-tests.h
index 58484fe5a8ae..e5a6ba748aba 100644
--- a/xen/include/xen/self-tests.h
+++ b/xen/include/xen/self-tests.h
@@ -15,11 +15,14 @@
  *
  * Clang < 8 can't fold constants through static inlines, causing this to
  * fail.  Simply skip it for incredibly old compilers.
+ *
+ * N.B. fn is intentionally not bracketed to allow us to test function-like
+ * macros too.
  */
 #if !defined(CONFIG_CC_IS_CLANG) || CONFIG_CLANG_VERSION >= 80000
 #define COMPILE_CHECK(fn, val, res)                                     \
     do {                                                                \
-        typeof((fn)(val)) real = (fn)(val);                             \
+        typeof(fn(val)) real = fn(val);                                 \
                                                                         \
         if ( !__builtin_constant_p(real) )                              \
             asm ( ".error \"'" STR(fn(val)) "' not compile-time constant\"" ); \
@@ -34,10 +37,13 @@
  * Check that Xen's runtime logic for fn(val) gives the expected answer.  This
  * requires using HIDE() to prevent the optimiser from collapsing the logic
  * into a constant.
+ *
+ * N.B. fn is intentionally not bracketed to allow us to test function-like
+ * macros too.
  */
 #define RUNTIME_CHECK(fn, val, res)                     \
     do {                                                \
-        typeof((fn)(val)) real = (fn)(HIDE(val));       \
+        typeof(fn(val)) real = fn(HIDE(val));           \
                                                         \
         if ( real != (res) )                            \
             panic("%s: %s(%s) expected %u, got %u\n",   \
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 3/9] xen/bitops: Convert 'hweight(x) > 1' to new multiple_bits_set()
  2024-08-22 23:06 [PATCH 0/9] xen/bitops: hweight() cleanup/improvements Andrew Cooper
  2024-08-22 23:06 ` [PATCH 1/9] xen/bitops: Reinstate the please tidy message Andrew Cooper
  2024-08-22 23:06 ` [PATCH 2/9] xen/bitops: Introduce a multiple_bits_set() helper Andrew Cooper
@ 2024-08-22 23:06 ` Andrew Cooper
  2024-08-26 10:37   ` Jan Beulich
  2024-08-22 23:06 ` [PATCH 4/9] xen/bitops: Drop the remnants of hweight{8,16}() Andrew Cooper
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2024-08-22 23:06 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné, Wei Liu,
	Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Bertrand Marquis, Michal Orzel, Oleksii Kurochko, Shawn Anastasio

Using hweight() is an especially expensive way of determining simply if
multiple bits are set in a value.  Worse, 4 of the 10 hweight() calls in Xen
are of this form.

Switch to the new multiple_bits_set() helper.  This is far more efficient than
the longhand hweight() algorithm and, owing to its simplicity, likely more
efficient than even a dedicated instruction on a superscalar processor.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Julien Grall <julien@xen.org>
CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
CC: Bertrand Marquis <bertrand.marquis@arm.com>
CC: Michal Orzel <michal.orzel@amd.com>
CC: Oleksii Kurochko <oleksii.kurochko@gmail.com>
CC: Shawn Anastasio <sanastasio@raptorengineering.com>

On x86, the code reduction speaks for itself:

  add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-240 (-240)
  Function                                     old     new   delta
  vlapic_ipi                                   722     650     -72
  numa_emulation                               577     497     -80
  do_xenpmu_op                                1665    1577     -88

That's an aweful lot of wasted calculation for the same answer.

I can't find a way of enabling CONFIG_NUMA on ARM (yet?) so right now there's
no change in any other architecture.  However, a synthetic helper shows the
following on ARM32:

  add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-128 (-128)
  Function                                     old     new   delta
  test_mbs                                     176      48    -128

and on ARM64:

  add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-44 (-44)
  Function                                     old     new   delta
  test_mbs                                      60      16     -44

PPC64 has POPCNTD in a default build of Xen and by chance both algorithms
compile to the same number of instructions.

RISC-V doesn't have hweight() wired up yet at all.
---
 xen/arch/x86/cpu/vpmu.c   |  2 +-
 xen/arch/x86/hvm/vlapic.c | 10 ++++++----
 xen/common/numa.c         |  2 +-
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/xen/arch/x86/cpu/vpmu.c b/xen/arch/x86/cpu/vpmu.c
index b2ba9994129b..a5bb1689c7d5 100644
--- a/xen/arch/x86/cpu/vpmu.c
+++ b/xen/arch/x86/cpu/vpmu.c
@@ -673,7 +673,7 @@ long do_xenpmu_op(
     {
         if ( (pmu_params.val &
               ~(XENPMU_MODE_SELF | XENPMU_MODE_HV | XENPMU_MODE_ALL)) ||
-             (hweight64(pmu_params.val) > 1) )
+             multiple_bits_set(pmu_params.val) )
             return -EINVAL;
 
         /* 32-bit dom0 can only sample itself. */
diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
index 2ec95942713e..4a3e21a65f9d 100644
--- a/xen/arch/x86/hvm/vlapic.c
+++ b/xen/arch/x86/hvm/vlapic.c
@@ -467,12 +467,14 @@ static bool is_multicast_dest(struct vlapic *vlapic, unsigned int short_hand,
         return short_hand != APIC_DEST_SELF;
 
     if ( vlapic_x2apic_mode(vlapic) )
-        return dest_mode ? hweight16(dest) > 1 : dest == 0xffffffffU;
+        return dest_mode ? multiple_bits_set((uint16_t)dest)
+                         : dest == 0xffffffffU;
 
     if ( dest_mode )
-        return hweight8(dest &
-                        GET_xAPIC_DEST_FIELD(vlapic_get_reg(vlapic,
-                                                            APIC_DFR))) > 1;
+    {
+        dest &= GET_xAPIC_DEST_FIELD(vlapic_get_reg(vlapic, APIC_DFR));
+        return multiple_bits_set((uint8_t)dest);
+    }
 
     return dest == 0xff;
 }
diff --git a/xen/common/numa.c b/xen/common/numa.c
index 28a09766fabc..ce3991929ce5 100644
--- a/xen/common/numa.c
+++ b/xen/common/numa.c
@@ -546,7 +546,7 @@ static int __init numa_emulation(unsigned long start_pfn,
     uint64_t sz = pfn_to_paddr(end_pfn - start_pfn) / numa_fake;
 
     /* Kludge needed for the hash function */
-    if ( hweight64(sz) > 1 )
+    if ( multiple_bits_set(sz) )
     {
         uint64_t x = 1;
 
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 4/9] xen/bitops: Drop the remnants of hweight{8,16}()
  2024-08-22 23:06 [PATCH 0/9] xen/bitops: hweight() cleanup/improvements Andrew Cooper
                   ` (2 preceding siblings ...)
  2024-08-22 23:06 ` [PATCH 3/9] xen/bitops: Convert 'hweight(x) > 1' to new multiple_bits_set() Andrew Cooper
@ 2024-08-22 23:06 ` Andrew Cooper
  2024-08-26 10:39   ` Jan Beulich
  2024-08-22 23:06 ` [PATCH 5/9] xen/bitops: Introduce generic_hweightl() and hweightl() Andrew Cooper
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2024-08-22 23:06 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné, Wei Liu,
	Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Bertrand Marquis, Michal Orzel, Oleksii Kurochko, Shawn Anastasio

They are no more, and won't be returning in this form.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Julien Grall <julien@xen.org>
CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
CC: Bertrand Marquis <bertrand.marquis@arm.com>
CC: Michal Orzel <michal.orzel@amd.com>
CC: Oleksii Kurochko <oleksii.kurochko@gmail.com>
CC: Shawn Anastasio <sanastasio@raptorengineering.com>
---
 xen/arch/arm/include/asm/bitops.h |  2 --
 xen/arch/ppc/include/asm/bitops.h |  2 --
 xen/arch/x86/include/asm/bitops.h |  2 --
 xen/include/xen/bitops.h          | 17 -----------------
 4 files changed, 23 deletions(-)

diff --git a/xen/arch/arm/include/asm/bitops.h b/xen/arch/arm/include/asm/bitops.h
index 3c023103f734..91cd167b6bbb 100644
--- a/xen/arch/arm/include/asm/bitops.h
+++ b/xen/arch/arm/include/asm/bitops.h
@@ -86,8 +86,6 @@ bool clear_mask16_timeout(uint16_t mask, volatile void *p,
  */
 #define hweight64(x) generic_hweight64(x)
 #define hweight32(x) generic_hweight32(x)
-#define hweight16(x) generic_hweight16(x)
-#define hweight8(x) generic_hweight8(x)
 
 #endif /* _ARM_BITOPS_H */
 /*
diff --git a/xen/arch/ppc/include/asm/bitops.h b/xen/arch/ppc/include/asm/bitops.h
index eb3355812ea3..a62c4f99c3bb 100644
--- a/xen/arch/ppc/include/asm/bitops.h
+++ b/xen/arch/ppc/include/asm/bitops.h
@@ -132,7 +132,5 @@ static inline int test_and_set_bit(unsigned int nr, volatile void *addr)
  */
 #define hweight64(x) __builtin_popcountll(x)
 #define hweight32(x) __builtin_popcount(x)
-#define hweight16(x) __builtin_popcount((uint16_t)(x))
-#define hweight8(x)  __builtin_popcount((uint8_t)(x))
 
 #endif /* _ASM_PPC_BITOPS_H */
diff --git a/xen/arch/x86/include/asm/bitops.h b/xen/arch/x86/include/asm/bitops.h
index 8c0403405aa2..4c5b21907a64 100644
--- a/xen/arch/x86/include/asm/bitops.h
+++ b/xen/arch/x86/include/asm/bitops.h
@@ -483,7 +483,5 @@ static always_inline unsigned int arch_flsl(unsigned long x)
  */
 #define hweight64(x) generic_hweight64(x)
 #define hweight32(x) generic_hweight32(x)
-#define hweight16(x) generic_hweight16(x)
-#define hweight8(x) generic_hweight8(x)
 
 #endif /* _X86_BITOPS_H */
diff --git a/xen/include/xen/bitops.h b/xen/include/xen/bitops.h
index 74c0d04e6647..64d70a7a1cb5 100644
--- a/xen/include/xen/bitops.h
+++ b/xen/include/xen/bitops.h
@@ -371,23 +371,6 @@ static inline unsigned int generic_hweight32(unsigned int w)
     return (w + (w >> 16)) & 0xff;
 }
 
-static inline unsigned int generic_hweight16(unsigned int w)
-{
-    w -= ((w >> 1) & 0x5555);
-    w =  (w & 0x3333) + ((w >> 2) & 0x3333);
-    w =  (w + (w >> 4)) & 0x0f0f;
-
-    return (w + (w >> 8)) & 0xff;
-}
-
-static inline unsigned int generic_hweight8(unsigned int w)
-{
-    w -= ((w >> 1) & 0x55);
-    w =  (w & 0x33) + ((w >> 2) & 0x33);
-
-    return (w + (w >> 4)) & 0x0f;
-}
-
 static inline unsigned int generic_hweight64(uint64_t w)
 {
     if ( BITS_PER_LONG < 64 )
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 5/9] xen/bitops: Introduce generic_hweightl() and hweightl()
  2024-08-22 23:06 [PATCH 0/9] xen/bitops: hweight() cleanup/improvements Andrew Cooper
                   ` (3 preceding siblings ...)
  2024-08-22 23:06 ` [PATCH 4/9] xen/bitops: Drop the remnants of hweight{8,16}() Andrew Cooper
@ 2024-08-22 23:06 ` Andrew Cooper
  2024-08-26 11:40   ` Jan Beulich
  2024-08-22 23:06 ` [PATCH 6/9] xen/bitops: Drop hweight_long() and use hweightl() Andrew Cooper
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2024-08-22 23:06 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné, Wei Liu,
	Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Bertrand Marquis, Michal Orzel, Oleksii Kurochko, Shawn Anastasio

There are 6 remaining callers in Xen:

  * The two hweight32() calls, _domain_struct_bits() and efi_find_gop_mode(),
    are __init only.
  * The two hweight_long() calls are both in bitmap_weight().
  * The two hweight64() calls are hv_vpset_nr_banks() and x86_emulate().

Only bitmap_weight() and possibly hv_vpset_nr_banks() can be considered
fast(ish) paths, and they're all of GPR-width form.

Furthermore, the differences between a generic int and generic long form is
only an ADD and SHIFT, and only in !CONFIG_HAS_FAST_MULTIPLY builds.

Therefore, it is definitely not worth having both generic implemenations.

Implement generic_hweightl() based on the current generic_hweight64(),
adjusted to be compatible with ARM32, along with standard SELF_TESTS.

Implement hweightl() with usual constant-folding and arch opt-in support.  PPC
is the only architecture that devates from generic, and it simply uses the
builtin.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Julien Grall <julien@xen.org>
CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
CC: Bertrand Marquis <bertrand.marquis@arm.com>
CC: Michal Orzel <michal.orzel@amd.com>
CC: Oleksii Kurochko <oleksii.kurochko@gmail.com>
CC: Shawn Anastasio <sanastasio@raptorengineering.com>
---
 xen/arch/ppc/include/asm/bitops.h |  2 ++
 xen/common/bitops.c               | 14 ++++++++++
 xen/include/xen/bitops.h          | 18 ++++++++++++
 xen/lib/Makefile                  |  1 +
 xen/lib/generic-hweightl.c        | 46 +++++++++++++++++++++++++++++++
 5 files changed, 81 insertions(+)
 create mode 100644 xen/lib/generic-hweightl.c

diff --git a/xen/arch/ppc/include/asm/bitops.h b/xen/arch/ppc/include/asm/bitops.h
index a62c4f99c3bb..64512e949530 100644
--- a/xen/arch/ppc/include/asm/bitops.h
+++ b/xen/arch/ppc/include/asm/bitops.h
@@ -124,6 +124,8 @@ static inline int test_and_set_bit(unsigned int nr, volatile void *addr)
 #define arch_fls(x)  ((x) ? 32 - __builtin_clz(x) : 0)
 #define arch_flsl(x) ((x) ? BITS_PER_LONG - __builtin_clzl(x) : 0)
 
+#define arch_hweightl(x) __builtin_popcountl(x)
+
 /**
  * hweightN - returns the hamming weight of a N-bit word
  * @x: the word to weigh
diff --git a/xen/common/bitops.c b/xen/common/bitops.c
index 4545682aa8e0..d0c268b4994a 100644
--- a/xen/common/bitops.c
+++ b/xen/common/bitops.c
@@ -106,10 +106,24 @@ static void __init test_multiple_bits_set(void)
     CHECK(multiple_bits_set, 0xc000000000000000ULL, true);
 }
 
+static void __init test_hweight(void)
+{
+    /* unsigned int hweightl(unsigned long) */
+    CHECK(hweightl, 0, 0);
+    CHECK(hweightl, 1, 1);
+    CHECK(hweightl, 3, 2);
+    CHECK(hweightl, 7, 3);
+    CHECK(hweightl, 0xff, 8);
+
+    CHECK(hweightl, 1 | (1UL << (BITS_PER_LONG - 1)), 2);
+    CHECK(hweightl, -1UL, BITS_PER_LONG);
+}
+
 static void __init __constructor test_bitops(void)
 {
     test_ffs();
     test_fls();
 
     test_multiple_bits_set();
+    test_hweight();
 }
diff --git a/xen/include/xen/bitops.h b/xen/include/xen/bitops.h
index 64d70a7a1cb5..3aac10b7f532 100644
--- a/xen/include/xen/bitops.h
+++ b/xen/include/xen/bitops.h
@@ -35,6 +35,12 @@ extern void __bitop_bad_size(void);
 unsigned int __pure generic_ffsl(unsigned long x);
 unsigned int __pure generic_flsl(unsigned long x);
 
+/*
+ * Hamming Weight, also called Population Count.  Returns the number of set
+ * bits in @x.
+ */
+unsigned int __pure generic_hweightl(unsigned long x);
+
 /**
  * generic__test_and_set_bit - Set a bit and return its old value
  * @nr: Bit to set
@@ -284,6 +290,18 @@ static always_inline __pure unsigned int fls64(uint64_t x)
         (_v & (_v - 1)) != 0;                   \
     })
 
+static always_inline __pure unsigned int hweightl(unsigned long x)
+{
+    if ( __builtin_constant_p(x) )
+        return __builtin_popcountl(x);
+
+#ifdef arch_hweightl
+    return arch_hweightl(x);
+#else
+    return generic_hweightl(x);
+#endif
+}
+
 /* --------------------- Please tidy below here --------------------- */
 
 #ifndef find_next_bit
diff --git a/xen/lib/Makefile b/xen/lib/Makefile
index a48541596470..b6558e108bd9 100644
--- a/xen/lib/Makefile
+++ b/xen/lib/Makefile
@@ -6,6 +6,7 @@ lib-y += ctype.o
 lib-y += find-next-bit.o
 lib-y += generic-ffsl.o
 lib-y += generic-flsl.o
+lib-y += generic-hweightl.o
 lib-y += list-sort.o
 lib-y += memchr.o
 lib-y += memchr_inv.o
diff --git a/xen/lib/generic-hweightl.c b/xen/lib/generic-hweightl.c
new file mode 100644
index 000000000000..fa4bbec273ab
--- /dev/null
+++ b/xen/lib/generic-hweightl.c
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#include <xen/bitops.h>
+#include <xen/init.h>
+#include <xen/self-tests.h>
+
+/* Mask value @b broadcast to every byte in a long */
+#if BITS_PER_LONG == 32
+# define MASK(b) ((b) * 0x01010101UL)
+#elif BITS_PER_LONG == 64
+# define MASK(b) ((b) * 0x0101010101010101UL)
+#else
+# error Extend me please
+#endif
+
+unsigned int generic_hweightl(unsigned long x)
+{
+    x -= (x >> 1) & MASK(0x55);
+    x =  (x & MASK(0x33)) + ((x >> 2) & MASK(0x33));
+    x =  (x + (x >> 4)) & MASK(0x0f);
+
+    if ( IS_ENABLED(CONFIG_HAS_FAST_MULTIPLY) )
+        return (x * MASK(0x01)) >> (BITS_PER_LONG - 8);
+
+    x += x >> 8;
+    x += x >> 16;
+#if BITS_PER_LONG > 32
+    x += x >> 32;
+#endif
+
+    return x & 0xff;
+}
+
+#ifdef CONFIG_SELF_TESTS
+static void __init __constructor test_generic_hweightl(void)
+{
+    RUNTIME_CHECK(generic_hweightl, 0, 0);
+    RUNTIME_CHECK(generic_hweightl, 1, 1);
+    RUNTIME_CHECK(generic_hweightl, 3, 2);
+    RUNTIME_CHECK(generic_hweightl, 7, 3);
+    RUNTIME_CHECK(generic_hweightl, 0xff, 8);
+
+    RUNTIME_CHECK(generic_hweightl, 1 | (1UL << (BITS_PER_LONG - 1)), 2);
+    RUNTIME_CHECK(generic_hweightl, -1UL, BITS_PER_LONG);
+}
+#endif /* CONFIG_SELF_TESTS */
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 6/9] xen/bitops: Drop hweight_long() and use hweightl()
  2024-08-22 23:06 [PATCH 0/9] xen/bitops: hweight() cleanup/improvements Andrew Cooper
                   ` (4 preceding siblings ...)
  2024-08-22 23:06 ` [PATCH 5/9] xen/bitops: Introduce generic_hweightl() and hweightl() Andrew Cooper
@ 2024-08-22 23:06 ` Andrew Cooper
  2024-08-26 11:51   ` Jan Beulich
  2024-08-22 23:06 ` [PATCH 7/9] xen/bitops: Implement hweight64() in terms of hweightl() Andrew Cooper
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2024-08-22 23:06 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné, Wei Liu,
	Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Bertrand Marquis, Michal Orzel, Oleksii Kurochko, Shawn Anastasio

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Julien Grall <julien@xen.org>
CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
CC: Bertrand Marquis <bertrand.marquis@arm.com>
CC: Michal Orzel <michal.orzel@amd.com>
CC: Oleksii Kurochko <oleksii.kurochko@gmail.com>
CC: Shawn Anastasio <sanastasio@raptorengineering.com>
---
 xen/common/bitmap.c      | 4 ++--
 xen/include/xen/bitops.h | 5 -----
 2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/xen/common/bitmap.c b/xen/common/bitmap.c
index 9d9ff09f3dde..3da63a32a680 100644
--- a/xen/common/bitmap.c
+++ b/xen/common/bitmap.c
@@ -191,10 +191,10 @@ unsigned int __bitmap_weight(const unsigned long *bitmap, unsigned int bits)
 	unsigned int k, w = 0, lim = bits / BITS_PER_LONG;
 
 	for (k = 0; k < lim; k++)
-		w += hweight_long(bitmap[k]);
+		w += hweightl(bitmap[k]);
 
 	if (bits % BITS_PER_LONG)
-		w += hweight_long(bitmap[k] & BITMAP_LAST_WORD_MASK(bits));
+		w += hweightl(bitmap[k] & BITMAP_LAST_WORD_MASK(bits));
 
 	return w;
 }
diff --git a/xen/include/xen/bitops.h b/xen/include/xen/bitops.h
index 3aac10b7f532..11a1c9130722 100644
--- a/xen/include/xen/bitops.h
+++ b/xen/include/xen/bitops.h
@@ -407,11 +407,6 @@ static inline unsigned int generic_hweight64(uint64_t w)
     return (w + (w >> 32)) & 0xFF;
 }
 
-static inline unsigned int hweight_long(unsigned long w)
-{
-    return sizeof(w) == 4 ? generic_hweight32(w) : generic_hweight64(w);
-}
-
 /*
  * rol32 - rotate a 32-bit value left
  *
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 7/9] xen/bitops: Implement hweight64() in terms of hweightl()
  2024-08-22 23:06 [PATCH 0/9] xen/bitops: hweight() cleanup/improvements Andrew Cooper
                   ` (5 preceding siblings ...)
  2024-08-22 23:06 ` [PATCH 6/9] xen/bitops: Drop hweight_long() and use hweightl() Andrew Cooper
@ 2024-08-22 23:06 ` Andrew Cooper
  2024-08-26 11:55   ` Jan Beulich
  2024-08-22 23:06 ` [PATCH 8/9] xen/bitops: Implement hweight32() " Andrew Cooper
  2024-08-22 23:06 ` [PATCH 9/9] x86/bitops: Use the POPCNT instruction when available Andrew Cooper
  8 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2024-08-22 23:06 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné, Wei Liu,
	Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Bertrand Marquis, Michal Orzel, Oleksii Kurochko, Shawn Anastasio

... and drop generic_hweight64().

This is identical on all architectures except ARM32.  Add one extra SELF_TEST
to check that hweight64() works when the input is split in half.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Julien Grall <julien@xen.org>
CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
CC: Bertrand Marquis <bertrand.marquis@arm.com>
CC: Michal Orzel <michal.orzel@amd.com>
CC: Oleksii Kurochko <oleksii.kurochko@gmail.com>
CC: Shawn Anastasio <sanastasio@raptorengineering.com>
---
 xen/arch/arm/include/asm/bitops.h |  1 -
 xen/arch/ppc/include/asm/bitops.h |  1 -
 xen/arch/x86/include/asm/bitops.h |  1 -
 xen/common/bitops.c               |  3 +++
 xen/include/xen/bitops.h          | 26 ++++++++------------------
 5 files changed, 11 insertions(+), 21 deletions(-)

diff --git a/xen/arch/arm/include/asm/bitops.h b/xen/arch/arm/include/asm/bitops.h
index 91cd167b6bbb..bed6b3b98e08 100644
--- a/xen/arch/arm/include/asm/bitops.h
+++ b/xen/arch/arm/include/asm/bitops.h
@@ -84,7 +84,6 @@ bool clear_mask16_timeout(uint16_t mask, volatile void *p,
  *
  * The Hamming Weight of a number is the total number of bits set in it.
  */
-#define hweight64(x) generic_hweight64(x)
 #define hweight32(x) generic_hweight32(x)
 
 #endif /* _ARM_BITOPS_H */
diff --git a/xen/arch/ppc/include/asm/bitops.h b/xen/arch/ppc/include/asm/bitops.h
index 64512e949530..24dc35ef644d 100644
--- a/xen/arch/ppc/include/asm/bitops.h
+++ b/xen/arch/ppc/include/asm/bitops.h
@@ -132,7 +132,6 @@ static inline int test_and_set_bit(unsigned int nr, volatile void *addr)
  *
  * The Hamming Weight of a number is the total number of bits set in it.
  */
-#define hweight64(x) __builtin_popcountll(x)
 #define hweight32(x) __builtin_popcount(x)
 
 #endif /* _ASM_PPC_BITOPS_H */
diff --git a/xen/arch/x86/include/asm/bitops.h b/xen/arch/x86/include/asm/bitops.h
index 4c5b21907a64..9d3a2448036e 100644
--- a/xen/arch/x86/include/asm/bitops.h
+++ b/xen/arch/x86/include/asm/bitops.h
@@ -481,7 +481,6 @@ static always_inline unsigned int arch_flsl(unsigned long x)
  *
  * The Hamming Weight of a number is the total number of bits set in it.
  */
-#define hweight64(x) generic_hweight64(x)
 #define hweight32(x) generic_hweight32(x)
 
 #endif /* _X86_BITOPS_H */
diff --git a/xen/common/bitops.c b/xen/common/bitops.c
index d0c268b4994a..f6a3eb5c9daf 100644
--- a/xen/common/bitops.c
+++ b/xen/common/bitops.c
@@ -117,6 +117,9 @@ static void __init test_hweight(void)
 
     CHECK(hweightl, 1 | (1UL << (BITS_PER_LONG - 1)), 2);
     CHECK(hweightl, -1UL, BITS_PER_LONG);
+
+    /* unsigned int hweight64(uint64_t) */
+    CHECK(hweight64, -1ULL, 64);
 }
 
 static void __init __constructor test_bitops(void)
diff --git a/xen/include/xen/bitops.h b/xen/include/xen/bitops.h
index 11a1c9130722..e97516552a2e 100644
--- a/xen/include/xen/bitops.h
+++ b/xen/include/xen/bitops.h
@@ -302,6 +302,14 @@ static always_inline __pure unsigned int hweightl(unsigned long x)
 #endif
 }
 
+static always_inline __pure unsigned int hweight64(uint64_t x)
+{
+    if ( BITS_PER_LONG == 64 )
+        return hweightl(x);
+    else
+        return hweightl(x >> 32) + hweightl(x);
+}
+
 /* --------------------- Please tidy below here --------------------- */
 
 #ifndef find_next_bit
@@ -389,24 +397,6 @@ static inline unsigned int generic_hweight32(unsigned int w)
     return (w + (w >> 16)) & 0xff;
 }
 
-static inline unsigned int generic_hweight64(uint64_t w)
-{
-    if ( BITS_PER_LONG < 64 )
-        return generic_hweight32(w >> 32) + generic_hweight32(w);
-
-    w -= (w >> 1) & 0x5555555555555555UL;
-    w =  (w & 0x3333333333333333UL) + ((w >> 2) & 0x3333333333333333UL);
-    w =  (w + (w >> 4)) & 0x0f0f0f0f0f0f0f0fUL;
-
-    if ( IS_ENABLED(CONFIG_HAS_FAST_MULTIPLY) )
-        return (w * 0x0101010101010101UL) >> 56;
-
-    w += w >> 8;
-    w += w >> 16;
-
-    return (w + (w >> 32)) & 0xFF;
-}
-
 /*
  * rol32 - rotate a 32-bit value left
  *
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 8/9] xen/bitops: Implement hweight32() in terms of hweightl()
  2024-08-22 23:06 [PATCH 0/9] xen/bitops: hweight() cleanup/improvements Andrew Cooper
                   ` (6 preceding siblings ...)
  2024-08-22 23:06 ` [PATCH 7/9] xen/bitops: Implement hweight64() in terms of hweightl() Andrew Cooper
@ 2024-08-22 23:06 ` Andrew Cooper
  2024-08-26 11:59   ` Jan Beulich
  2024-08-22 23:06 ` [PATCH 9/9] x86/bitops: Use the POPCNT instruction when available Andrew Cooper
  8 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2024-08-22 23:06 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné, Wei Liu,
	Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Bertrand Marquis, Michal Orzel, Oleksii Kurochko, Shawn Anastasio

... and drop generic_hweight32().

As noted previously, the only two users of hweight32() and they're both
singleton callers in __init paths, so it's not interesting to have a sub-GPR
optimised generic.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Julien Grall <julien@xen.org>
CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
CC: Bertrand Marquis <bertrand.marquis@arm.com>
CC: Michal Orzel <michal.orzel@amd.com>
CC: Oleksii Kurochko <oleksii.kurochko@gmail.com>
CC: Shawn Anastasio <sanastasio@raptorengineering.com>
---
 xen/arch/arm/include/asm/bitops.h |  8 --------
 xen/arch/ppc/include/asm/bitops.h |  8 --------
 xen/arch/x86/include/asm/bitops.h |  8 --------
 xen/include/xen/bitops.h          | 24 +++++-------------------
 4 files changed, 5 insertions(+), 43 deletions(-)

diff --git a/xen/arch/arm/include/asm/bitops.h b/xen/arch/arm/include/asm/bitops.h
index bed6b3b98e08..f163d9bb4578 100644
--- a/xen/arch/arm/include/asm/bitops.h
+++ b/xen/arch/arm/include/asm/bitops.h
@@ -78,14 +78,6 @@ bool clear_mask16_timeout(uint16_t mask, volatile void *p,
 #define arch_fls(x)  ((x) ? 32 - __builtin_clz(x) : 0)
 #define arch_flsl(x) ((x) ? BITS_PER_LONG - __builtin_clzl(x) : 0)
 
-/**
- * hweightN - returns the hamming weight of a N-bit word
- * @x: the word to weigh
- *
- * The Hamming Weight of a number is the total number of bits set in it.
- */
-#define hweight32(x) generic_hweight32(x)
-
 #endif /* _ARM_BITOPS_H */
 /*
  * Local variables:
diff --git a/xen/arch/ppc/include/asm/bitops.h b/xen/arch/ppc/include/asm/bitops.h
index 24dc35ef644d..c942e9432e20 100644
--- a/xen/arch/ppc/include/asm/bitops.h
+++ b/xen/arch/ppc/include/asm/bitops.h
@@ -126,12 +126,4 @@ static inline int test_and_set_bit(unsigned int nr, volatile void *addr)
 
 #define arch_hweightl(x) __builtin_popcountl(x)
 
-/**
- * hweightN - returns the hamming weight of a N-bit word
- * @x: the word to weigh
- *
- * The Hamming Weight of a number is the total number of bits set in it.
- */
-#define hweight32(x) __builtin_popcount(x)
-
 #endif /* _ASM_PPC_BITOPS_H */
diff --git a/xen/arch/x86/include/asm/bitops.h b/xen/arch/x86/include/asm/bitops.h
index 9d3a2448036e..642d8e58b288 100644
--- a/xen/arch/x86/include/asm/bitops.h
+++ b/xen/arch/x86/include/asm/bitops.h
@@ -475,12 +475,4 @@ static always_inline unsigned int arch_flsl(unsigned long x)
 }
 #define arch_flsl arch_flsl
 
-/**
- * hweightN - returns the hamming weight of a N-bit word
- * @x: the word to weigh
- *
- * The Hamming Weight of a number is the total number of bits set in it.
- */
-#define hweight32(x) generic_hweight32(x)
-
 #endif /* _X86_BITOPS_H */
diff --git a/xen/include/xen/bitops.h b/xen/include/xen/bitops.h
index e97516552a2e..bad2601b0fe6 100644
--- a/xen/include/xen/bitops.h
+++ b/xen/include/xen/bitops.h
@@ -302,6 +302,11 @@ static always_inline __pure unsigned int hweightl(unsigned long x)
 #endif
 }
 
+static always_inline __pure unsigned int hweight32(uint32_t x)
+{
+    return hweightl(x);
+}
+
 static always_inline __pure unsigned int hweight64(uint64_t x)
 {
     if ( BITS_PER_LONG == 64 )
@@ -378,25 +383,6 @@ static inline int get_count_order(unsigned int count)
     return order;
 }
 
-/*
- * hweightN: returns the hamming weight (i.e. the number
- * of bits set) of a N-bit word
- */
-
-static inline unsigned int generic_hweight32(unsigned int w)
-{
-    w -= (w >> 1) & 0x55555555;
-    w =  (w & 0x33333333) + ((w >> 2) & 0x33333333);
-    w =  (w + (w >> 4)) & 0x0f0f0f0f;
-
-    if ( IS_ENABLED(CONFIG_HAS_FAST_MULTIPLY) )
-        return (w * 0x01010101) >> 24;
-
-    w += w >> 8;
-
-    return (w + (w >> 16)) & 0xff;
-}
-
 /*
  * rol32 - rotate a 32-bit value left
  *
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 9/9] x86/bitops: Use the POPCNT instruction when available
  2024-08-22 23:06 [PATCH 0/9] xen/bitops: hweight() cleanup/improvements Andrew Cooper
                   ` (7 preceding siblings ...)
  2024-08-22 23:06 ` [PATCH 8/9] xen/bitops: Implement hweight32() " Andrew Cooper
@ 2024-08-22 23:06 ` Andrew Cooper
  2024-08-23 15:35   ` Nicola Vetrini
  2024-08-26 13:07   ` Jan Beulich
  8 siblings, 2 replies; 40+ messages in thread
From: Andrew Cooper @ 2024-08-22 23:06 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné

It has existed in x86 CPUs since 2008, so we're only 16 years late adding
support.  With all the other scafolding in place, implement arch_hweightl()
for x86.

The only complication is that the call to arch_generic_hweightl() is behind
the compilers back.  Address this by writing it in ASM and ensure that it
preserves all registers.

Copy the code generation from generic_hweightl().  It's not a complicated
algorithm, and is easy to regenerate if needs be, but cover it with the same
unit tests as test_generic_hweightl() just for piece of mind.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>

A few RFC points.

 * I throught we had an x86 general lib-y but I can't find one, hence why it's
   still in xen/lib/ for now.

 * When we up the minimum toolchain to GCC 7 / Clang 5, we can use a
   __attribute__((no_caller_saved_registers)) and can forgo writing this in asm.

   GCC seems to need extra help, and wants -mgeneral-regs-only too.  It has a
   habit of complaining about incompatible instructions even when it's not
   emitting them.
---
 xen/arch/x86/include/asm/bitops.h | 21 ++++++++++++++
 xen/lib/Makefile                  |  1 +
 xen/lib/arch-generic-hweightl.S   | 46 +++++++++++++++++++++++++++++++
 xen/lib/generic-hweightl.c        | 15 ++++++++++
 4 files changed, 83 insertions(+)
 create mode 100644 xen/lib/arch-generic-hweightl.S

diff --git a/xen/arch/x86/include/asm/bitops.h b/xen/arch/x86/include/asm/bitops.h
index 642d8e58b288..0db698ed3f4c 100644
--- a/xen/arch/x86/include/asm/bitops.h
+++ b/xen/arch/x86/include/asm/bitops.h
@@ -6,6 +6,7 @@
  */
 
 #include <asm/alternative.h>
+#include <asm/asm_defns.h>
 #include <asm/cpufeatureset.h>
 
 /*
@@ -475,4 +476,24 @@ static always_inline unsigned int arch_flsl(unsigned long x)
 }
 #define arch_flsl arch_flsl
 
+static always_inline unsigned int arch_hweightl(unsigned long x)
+{
+    unsigned int r;
+
+    /*
+     * arch_generic_hweightl() is written in ASM in order to preserve all
+     * registers, as the compiler can't see the call.
+     *
+     * This limits the POPCNT instruction to using the same ABI as a function
+     * call (input in %rdi, output in %eax) but that's fine.
+     */
+    alternative_io("call arch_generic_hweightl",
+                   "popcnt %[val], %q[res]", X86_FEATURE_POPCNT,
+                   ASM_OUTPUT2([res] "=a" (r) ASM_CALL_CONSTRAINT),
+                   [val] "D" (x));
+
+    return r;
+}
+#define arch_hweightl arch_hweightl
+
 #endif /* _X86_BITOPS_H */
diff --git a/xen/lib/Makefile b/xen/lib/Makefile
index b6558e108bd9..84d731dc0ac8 100644
--- a/xen/lib/Makefile
+++ b/xen/lib/Makefile
@@ -1,5 +1,6 @@
 obj-$(CONFIG_X86) += x86/
 
+lib-$(CONFIG_X86) += arch-generic-hweightl.o
 lib-y += bsearch.o
 lib-y += ctors.o
 lib-y += ctype.o
diff --git a/xen/lib/arch-generic-hweightl.S b/xen/lib/arch-generic-hweightl.S
new file mode 100644
index 000000000000..15c6e3394845
--- /dev/null
+++ b/xen/lib/arch-generic-hweightl.S
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+        .file __FILE__
+
+#include <xen/linkage.h>
+
+/*
+ * An implementation of generic_hweightl() used on hardware without the POPCNT
+ * instruction.
+ *
+ * This function is called from within an ALTERNATIVE in arch_hweightl().
+ * i.e. behind the back of the compiler.  Therefore all registers are callee
+ * preserved.
+ *
+ * The ASM is what GCC-12 emits for generic_hweightl() in a release build of
+ * Xen, with spilling of %rdi/%rdx to preserve the callers registers.
+ */
+FUNC(arch_generic_hweightl)
+        push   %rdi
+        push   %rdx
+
+        movabs $0x5555555555555555, %rdx
+        mov    %rdi, %rax
+        shr    $1, %rax
+        and    %rdx, %rax
+        sub    %rax, %rdi
+        movabs $0x3333333333333333, %rax
+        mov    %rdi, %rdx
+        shr    $0x2, %rdi
+        and    %rax, %rdx
+        and    %rax, %rdi
+        add    %rdi, %rdx
+        mov    %rdx, %rax
+        shr    $0x4, %rax
+        add    %rdx, %rax
+        movabs $0xf0f0f0f0f0f0f0f, %rdx
+        and    %rdx, %rax
+        movabs $0x101010101010101, %rdx
+        imul   %rdx, %rax
+        shr    $0x38, %rax
+
+        pop    %rdx
+        pop    %rdi
+
+        ret
+END(arch_generic_hweightl)
diff --git a/xen/lib/generic-hweightl.c b/xen/lib/generic-hweightl.c
index fa4bbec273ab..4b39dd84de5e 100644
--- a/xen/lib/generic-hweightl.c
+++ b/xen/lib/generic-hweightl.c
@@ -43,4 +43,19 @@ static void __init __constructor test_generic_hweightl(void)
     RUNTIME_CHECK(generic_hweightl, 1 | (1UL << (BITS_PER_LONG - 1)), 2);
     RUNTIME_CHECK(generic_hweightl, -1UL, BITS_PER_LONG);
 }
+
+#ifdef CONFIG_X86
+unsigned int arch_generic_hweightl(unsigned long);
+static void __init __constructor test_arch_generic_hweightl(void)
+{
+    RUNTIME_CHECK(arch_generic_hweightl, 0, 0);
+    RUNTIME_CHECK(arch_generic_hweightl, 1, 1);
+    RUNTIME_CHECK(arch_generic_hweightl, 3, 2);
+    RUNTIME_CHECK(arch_generic_hweightl, 7, 3);
+    RUNTIME_CHECK(arch_generic_hweightl, 0xff, 8);
+
+    RUNTIME_CHECK(arch_generic_hweightl, 1 | (1UL << (BITS_PER_LONG - 1)), 2);
+    RUNTIME_CHECK(arch_generic_hweightl, -1UL, BITS_PER_LONG);
+}
+#endif /* CONFIG_X86 */
 #endif /* CONFIG_SELF_TESTS */
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 9/9] x86/bitops: Use the POPCNT instruction when available
  2024-08-22 23:06 ` [PATCH 9/9] x86/bitops: Use the POPCNT instruction when available Andrew Cooper
@ 2024-08-23 15:35   ` Nicola Vetrini
  2024-08-23 15:38     ` Andrew Cooper
  2024-08-26 13:07   ` Jan Beulich
  1 sibling, 1 reply; 40+ messages in thread
From: Nicola Vetrini @ 2024-08-23 15:35 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel, Jan Beulich, Roger Pau Monné

On 2024-08-23 01:06, Andrew Cooper wrote:
> It has existed in x86 CPUs since 2008, so we're only 16 years late 
> adding
> support.  With all the other scafolding in place, implement 
> arch_hweightl()
> for x86.
> 
> The only complication is that the call to arch_generic_hweightl() is 
> behind
> the compilers back.  Address this by writing it in ASM and ensure that 
> it
> preserves all registers.
> 
> Copy the code generation from generic_hweightl().  It's not a 
> complicated
> algorithm, and is easy to regenerate if needs be, but cover it with the 
> same
> unit tests as test_generic_hweightl() just for piece of mind.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> 
> A few RFC points.
> 
>  * I throught we had an x86 general lib-y but I can't find one, hence 
> why it's
>    still in xen/lib/ for now.
> 
>  * When we up the minimum toolchain to GCC 7 / Clang 5, we can use a
>    __attribute__((no_caller_saved_registers)) and can forgo writing 
> this in asm.
> 
>    GCC seems to need extra help, and wants -mgeneral-regs-only too.  It 
> has a
>    habit of complaining about incompatible instructions even when it's 
> not
>    emitting them.
> ---

> diff --git a/xen/lib/arch-generic-hweightl.S 
> b/xen/lib/arch-generic-hweightl.S
> new file mode 100644
> index 000000000000..15c6e3394845
> --- /dev/null
> +++ b/xen/lib/arch-generic-hweightl.S
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +        .file __FILE__
> +
> +#include <xen/linkage.h>
> +
> +/*
> + * An implementation of generic_hweightl() used on hardware without 
> the POPCNT
> + * instruction.
> + *
> + * This function is called from within an ALTERNATIVE in 
> arch_hweightl().
> + * i.e. behind the back of the compiler.  Therefore all registers are 
> callee
> + * preserved.
> + *
> + * The ASM is what GCC-12 emits for generic_hweightl() in a release 
> build of
> + * Xen, with spilling of %rdi/%rdx to preserve the callers registers.
> + */
> +FUNC(arch_generic_hweightl)
> +        push   %rdi
> +        push   %rdx
> +
> +        movabs $0x5555555555555555, %rdx
> +        mov    %rdi, %rax
> +        shr    $1, %rax
> +        and    %rdx, %rax
> +        sub    %rax, %rdi
> +        movabs $0x3333333333333333, %rax
> +        mov    %rdi, %rdx
> +        shr    $0x2, %rdi
> +        and    %rax, %rdx
> +        and    %rax, %rdi
> +        add    %rdi, %rdx
> +        mov    %rdx, %rax
> +        shr    $0x4, %rax
> +        add    %rdx, %rax
> +        movabs $0xf0f0f0f0f0f0f0f, %rdx
> +        and    %rdx, %rax
> +        movabs $0x101010101010101, %rdx
> +        imul   %rdx, %rax
> +        shr    $0x38, %rax
> +
> +        pop    %rdx
> +        pop    %rdi
> +
> +        ret
> +END(arch_generic_hweightl)
> diff --git a/xen/lib/generic-hweightl.c b/xen/lib/generic-hweightl.c
> index fa4bbec273ab..4b39dd84de5e 100644
> --- a/xen/lib/generic-hweightl.c
> +++ b/xen/lib/generic-hweightl.c
> @@ -43,4 +43,19 @@ static void __init __constructor 
> test_generic_hweightl(void)
>      RUNTIME_CHECK(generic_hweightl, 1 | (1UL << (BITS_PER_LONG - 1)), 
> 2);
>      RUNTIME_CHECK(generic_hweightl, -1UL, BITS_PER_LONG);
>  }
> +
> +#ifdef CONFIG_X86
> +unsigned int arch_generic_hweightl(unsigned long);

Hi Andrew,

do you mind putting a parameter name here, as the current form 
introduces a violation of MISRA Rule 8.2 [1] (even if unnecessary, given 
its implementation)?

Thanks,
  Nicola

[1] 
https://saas.eclairit.com:3787/fs/var/local/eclair/xen-project.ecdf/xen-project/patchew/xen/ECLAIR_normal/patchew/20240822230635.954557-1-andrew.cooper3@citrix.com/X86_64/7647484702/PROJECT.ecd;/by_service/MC3R1.R8.2.html

> +static void __init __constructor test_arch_generic_hweightl(void)
> +{
> +    RUNTIME_CHECK(arch_generic_hweightl, 0, 0);
> +    RUNTIME_CHECK(arch_generic_hweightl, 1, 1);
> +    RUNTIME_CHECK(arch_generic_hweightl, 3, 2);
> +    RUNTIME_CHECK(arch_generic_hweightl, 7, 3);
> +    RUNTIME_CHECK(arch_generic_hweightl, 0xff, 8);
> +
> +    RUNTIME_CHECK(arch_generic_hweightl, 1 | (1UL << (BITS_PER_LONG - 
> 1)), 2);
> +    RUNTIME_CHECK(arch_generic_hweightl, -1UL, BITS_PER_LONG);
> +}
> +#endif /* CONFIG_X86 */
>  #endif /* CONFIG_SELF_TESTS */

-- 
Nicola Vetrini, BSc
Software Engineer, BUGSENG srl (https://bugseng.com)


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 9/9] x86/bitops: Use the POPCNT instruction when available
  2024-08-23 15:35   ` Nicola Vetrini
@ 2024-08-23 15:38     ` Andrew Cooper
  0 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2024-08-23 15:38 UTC (permalink / raw)
  To: Nicola Vetrini; +Cc: Xen-devel, Jan Beulich, Roger Pau Monné

On 23/08/2024 4:35 pm, Nicola Vetrini wrote:
> On 2024-08-23 01:06, Andrew Cooper wrote:
>> diff --git a/xen/lib/generic-hweightl.c b/xen/lib/generic-hweightl.c
>> index fa4bbec273ab..4b39dd84de5e 100644
>> --- a/xen/lib/generic-hweightl.c
>> +++ b/xen/lib/generic-hweightl.c
>> @@ -43,4 +43,19 @@ static void __init __constructor
>> test_generic_hweightl(void)
>>      RUNTIME_CHECK(generic_hweightl, 1 | (1UL << (BITS_PER_LONG -
>> 1)), 2);
>>      RUNTIME_CHECK(generic_hweightl, -1UL, BITS_PER_LONG);
>>  }
>> +
>> +#ifdef CONFIG_X86
>> +unsigned int arch_generic_hweightl(unsigned long);
>
> Hi Andrew,
>
> do you mind putting a parameter name here, as the current form
> introduces a violation of MISRA Rule 8.2 [1] (even if unnecessary,
> given its implementation)?

Sorry.

I did run Eclair on this series during development, but this was a late
change and I forgot to re-check.

Fixed locally.

~Andrew


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/9] xen/bitops: Introduce a multiple_bits_set() helper
  2024-08-22 23:06 ` [PATCH 2/9] xen/bitops: Introduce a multiple_bits_set() helper Andrew Cooper
@ 2024-08-26 10:30   ` Jan Beulich
  2024-08-27 12:01     ` Andrew Cooper
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2024-08-26 10:30 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Roger Pau Monné, Wei Liu, Stefano Stabellini, Julien Grall,
	Volodymyr Babchuk, Bertrand Marquis, Michal Orzel,
	Oleksii Kurochko, Shawn Anastasio, Xen-devel

On 23.08.2024 01:06, Andrew Cooper wrote:
> This will be used to simplify real logic in the following patch.  Add compile
> and boot time testing as with other bitops.
> 
> Because the expression is so simple, implement it as a function-like macro
> which is generic on the type of it's argument, rather than having multiple
> variants.
> 
> Testing function-like macros needs a minor adjustments to the infrastructure
> in xen/self-tests.h to avoid bracketing the fn parameter.  The utility of this
> outweighs the associated risks.

We may need to mark these two places as deviated.

> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>
with one remark/request:

> --- a/xen/common/bitops.c
> +++ b/xen/common/bitops.c
> @@ -84,8 +84,32 @@ static void __init test_fls(void)
>      CHECK(fls64, 0x8000000000000001ULL, 64);
>  }
>  
> +static void __init test_multiple_bits_set(void)
> +{
> +    /*
> +     * multiple_bits_set() is generic on the type of it's parameter, as the
> +     * internal expression is so simple.
> +     */
> +
> +    CHECK(multiple_bits_set, 0, false);
> +    CHECK(multiple_bits_set, 1, false);
> +    CHECK(multiple_bits_set, 2, false);
> +    CHECK(multiple_bits_set, 3, true);
> +
> +    CHECK(multiple_bits_set, 1 | (1UL << (BITS_PER_LONG - 1)), true);

This is really the same as ...

> +#if BITS_PER_LONG > 32
> +    CHECK(multiple_bits_set, 1 | (1UL << 32), true);
> +    CHECK(multiple_bits_set, 1 | (1UL << 63), true);

... this, at least as long as BITS_PER_LONG > 32 in practice means
BITS_PER_LONG == 64. Perhaps not really worth keeping that line?

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/9] xen/bitops: Convert 'hweight(x) > 1' to new multiple_bits_set()
  2024-08-22 23:06 ` [PATCH 3/9] xen/bitops: Convert 'hweight(x) > 1' to new multiple_bits_set() Andrew Cooper
@ 2024-08-26 10:37   ` Jan Beulich
  2024-08-27  9:44     ` Andrew Cooper
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2024-08-26 10:37 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Roger Pau Monné, Wei Liu, Stefano Stabellini, Julien Grall,
	Volodymyr Babchuk, Bertrand Marquis, Michal Orzel,
	Oleksii Kurochko, Shawn Anastasio, Xen-devel

On 23.08.2024 01:06, Andrew Cooper wrote:
> Using hweight() is an especially expensive way of determining simply if
> multiple bits are set in a value.  Worse, 4 of the 10 hweight() calls in Xen
> are of this form.
> 
> Switch to the new multiple_bits_set() helper.  This is far more efficient than
> the longhand hweight() algorithm and, owing to its simplicity, likely more
> efficient than even a dedicated instruction on a superscalar processor.
> 
> No functional change.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>

Just to mention it: With the title being sufficiently generic, I half
expected the similar {bitmap,cpumask}_weight() to also be taken care of.
Yet certainly those (and maybe also their ... == 1 forms) can be covered
later.

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/9] xen/bitops: Drop the remnants of hweight{8,16}()
  2024-08-22 23:06 ` [PATCH 4/9] xen/bitops: Drop the remnants of hweight{8,16}() Andrew Cooper
@ 2024-08-26 10:39   ` Jan Beulich
  2024-08-27  9:49     ` Andrew Cooper
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2024-08-26 10:39 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Roger Pau Monné, Wei Liu, Stefano Stabellini, Julien Grall,
	Volodymyr Babchuk, Bertrand Marquis, Michal Orzel,
	Oleksii Kurochko, Shawn Anastasio, Xen-devel

On 23.08.2024 01:06, Andrew Cooper wrote:
> They are no more, and won't be returning in this form.

And what's the plan? Use hweight32((uint8_t)...) in an open-coded manner?
Not overly nice I would say.

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 5/9] xen/bitops: Introduce generic_hweightl() and hweightl()
  2024-08-22 23:06 ` [PATCH 5/9] xen/bitops: Introduce generic_hweightl() and hweightl() Andrew Cooper
@ 2024-08-26 11:40   ` Jan Beulich
  2024-08-27 10:39     ` Andrew Cooper
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2024-08-26 11:40 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Roger Pau Monné, Wei Liu, Stefano Stabellini, Julien Grall,
	Volodymyr Babchuk, Bertrand Marquis, Michal Orzel,
	Oleksii Kurochko, Shawn Anastasio, Xen-devel

On 23.08.2024 01:06, Andrew Cooper wrote:
--- a/xen/include/xen/bitops.h
+++ b/xen/include/xen/bitops.h
@@ -35,6 +35,12 @@ extern void __bitop_bad_size(void);
 unsigned int __pure generic_ffsl(unsigned long x);
 unsigned int __pure generic_flsl(unsigned long x);
 
> +/*
> + * Hamming Weight, also called Population Count.  Returns the number of set
> + * bits in @x.
> + */
> +unsigned int __pure generic_hweightl(unsigned long x);

Aren't this and ...

> @@ -284,6 +290,18 @@ static always_inline __pure unsigned int fls64(uint64_t x)
>          (_v & (_v - 1)) != 0;                   \
>      })
>  
> +static always_inline __pure unsigned int hweightl(unsigned long x)

... this even __attribute_const__?

> +{
> +    if ( __builtin_constant_p(x) )
> +        return __builtin_popcountl(x);

How certain are you that compilers (even old ones) will reliably fold
constant expressions here, and never emit a libgcc call instead? The
conditions look to be more tight than just __builtin_constant_p(); a
pretty absurd example:

unsigned ltest(void) {
    return __builtin_constant_p("") ? __builtin_popcountl((unsigned long)"") : ~0;
}

> --- /dev/null
> +++ b/xen/lib/generic-hweightl.c
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +
> +#include <xen/bitops.h>
> +#include <xen/init.h>
> +#include <xen/self-tests.h>
> +
> +/* Mask value @b broadcast to every byte in a long */
> +#if BITS_PER_LONG == 32
> +# define MASK(b) ((b) * 0x01010101UL)
> +#elif BITS_PER_LONG == 64
> +# define MASK(b) ((b) * 0x0101010101010101UL)
> +#else
> +# error Extend me please
> +#endif
> +
> +unsigned int generic_hweightl(unsigned long x)
> +{
> +    x -= (x >> 1) & MASK(0x55);
> +    x =  (x & MASK(0x33)) + ((x >> 2) & MASK(0x33));
> +    x =  (x + (x >> 4)) & MASK(0x0f);
> +
> +    if ( IS_ENABLED(CONFIG_HAS_FAST_MULTIPLY) )
> +        return (x * MASK(0x01)) >> (BITS_PER_LONG - 8);

I realize it's nitpicking, yet especially this use isn't really "mask"-
like. Could I talk you into naming the macro e.g. BCST()?

> +    x += x >> 8;
> +    x += x >> 16;
> +#if BITS_PER_LONG > 32
> +    x += x >> 32;
> +#endif
> +
> +    return x & 0xff;
> +}

Perhaps #undef MASK here, or else ...

> +#ifdef CONFIG_SELF_TESTS
> +static void __init __constructor test_generic_hweightl(void)
> +{
> +    RUNTIME_CHECK(generic_hweightl, 0, 0);
> +    RUNTIME_CHECK(generic_hweightl, 1, 1);
> +    RUNTIME_CHECK(generic_hweightl, 3, 2);
> +    RUNTIME_CHECK(generic_hweightl, 7, 3);
> +    RUNTIME_CHECK(generic_hweightl, 0xff, 8);
> +
> +    RUNTIME_CHECK(generic_hweightl, 1 | (1UL << (BITS_PER_LONG - 1)), 2);
> +    RUNTIME_CHECK(generic_hweightl, -1UL, BITS_PER_LONG);
> +}

... actually use it some here, to have a few more cases?

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 6/9] xen/bitops: Drop hweight_long() and use hweightl()
  2024-08-22 23:06 ` [PATCH 6/9] xen/bitops: Drop hweight_long() and use hweightl() Andrew Cooper
@ 2024-08-26 11:51   ` Jan Beulich
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2024-08-26 11:51 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Roger Pau Monné, Wei Liu, Stefano Stabellini, Julien Grall,
	Volodymyr Babchuk, Bertrand Marquis, Michal Orzel,
	Oleksii Kurochko, Shawn Anastasio, Xen-devel

On 23.08.2024 01:06, Andrew Cooper wrote:
> No functional change.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 7/9] xen/bitops: Implement hweight64() in terms of hweightl()
  2024-08-22 23:06 ` [PATCH 7/9] xen/bitops: Implement hweight64() in terms of hweightl() Andrew Cooper
@ 2024-08-26 11:55   ` Jan Beulich
  2024-08-27 11:50     ` Andrew Cooper
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2024-08-26 11:55 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Roger Pau Monné, Wei Liu, Stefano Stabellini, Julien Grall,
	Volodymyr Babchuk, Bertrand Marquis, Michal Orzel,
	Oleksii Kurochko, Shawn Anastasio, Xen-devel

On 23.08.2024 01:06, Andrew Cooper wrote:
> ... and drop generic_hweight64().
> 
> This is identical on all architectures except ARM32.  Add one extra SELF_TEST
> to check that hweight64() works when the input is split in half.
> 
> No functional change.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>
with one remark:

> --- a/xen/include/xen/bitops.h
> +++ b/xen/include/xen/bitops.h
> @@ -302,6 +302,14 @@ static always_inline __pure unsigned int hweightl(unsigned long x)
>  #endif
>  }
>  
> +static always_inline __pure unsigned int hweight64(uint64_t x)
> +{
> +    if ( BITS_PER_LONG == 64 )
> +        return hweightl(x);
> +    else
> +        return hweightl(x >> 32) + hweightl(x);

This assume BITS_PER_LONG == 32, which of course is true right now, but
doesn't need to be in general. Better add an explicit cast to uint32_t
(or masking by 0xffffffffU)?

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 8/9] xen/bitops: Implement hweight32() in terms of hweightl()
  2024-08-22 23:06 ` [PATCH 8/9] xen/bitops: Implement hweight32() " Andrew Cooper
@ 2024-08-26 11:59   ` Jan Beulich
  2024-08-27 12:05     ` Andrew Cooper
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2024-08-26 11:59 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Roger Pau Monné, Wei Liu, Stefano Stabellini, Julien Grall,
	Volodymyr Babchuk, Bertrand Marquis, Michal Orzel,
	Oleksii Kurochko, Shawn Anastasio, Xen-devel

On 23.08.2024 01:06, Andrew Cooper wrote:
> ... and drop generic_hweight32().
> 
> As noted previously, the only two users of hweight32() and they're both
> singleton callers in __init paths, so it's not interesting to have a sub-GPR
> optimised generic.

I think it's clear what is meant, but the part of the sentence ahead of
the comma is a little bumpy. As to not interesting: Perhaps indeed not
right now, but new uses may appear and generally the overly wide
operations may be (slightly) more expensive. Of course we can deal with
the need when it arises, so ...

> No functional change.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Acked-by: Jan Beulich <jbeulich@suse.com>




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 9/9] x86/bitops: Use the POPCNT instruction when available
  2024-08-22 23:06 ` [PATCH 9/9] x86/bitops: Use the POPCNT instruction when available Andrew Cooper
  2024-08-23 15:35   ` Nicola Vetrini
@ 2024-08-26 13:07   ` Jan Beulich
  2024-08-27 11:17     ` Andrew Cooper
  1 sibling, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2024-08-26 13:07 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 23.08.2024 01:06, Andrew Cooper wrote:
> A few RFC points.
> 
>  * I throught we had an x86 general lib-y but I can't find one, hence why it's
>    still in xen/lib/ for now.

We indeed have nothing like that yet. The file name should then imo not be
arch-* though, but x86-*. Or you could put it in xen/lib/x86/. It could even
be obj-y rather than lib-y, unless you expect we'll be able to get rid of
all uses, which seems unlikely at least due to bitmap_weight(). Otoh with
my ABI-level series the call site in arch_hweightl() could then be made go
away for v2 and above, at which point it living in lib-y will be preferable.

>  * When we up the minimum toolchain to GCC 7 / Clang 5, we can use a
>    __attribute__((no_caller_saved_registers)) and can forgo writing this in asm.
> 
>    GCC seems to need extra help, and wants -mgeneral-regs-only too.  It has a
>    habit of complaining about incompatible instructions even when it's not
>    emitting them.

What is this part about? What incompatible instructions, in particular?

> @@ -475,4 +476,24 @@ static always_inline unsigned int arch_flsl(unsigned long x)
>  }
>  #define arch_flsl arch_flsl
>  
> +static always_inline unsigned int arch_hweightl(unsigned long x)
> +{
> +    unsigned int r;
> +
> +    /*
> +     * arch_generic_hweightl() is written in ASM in order to preserve all
> +     * registers, as the compiler can't see the call.
> +     *
> +     * This limits the POPCNT instruction to using the same ABI as a function
> +     * call (input in %rdi, output in %eax) but that's fine.
> +     */
> +    alternative_io("call arch_generic_hweightl",
> +                   "popcnt %[val], %q[res]", X86_FEATURE_POPCNT,
> +                   ASM_OUTPUT2([res] "=a" (r) ASM_CALL_CONSTRAINT),
> +                   [val] "D" (x));

If you made [val] an output ("+D") you could avoid preserving the register
in the function. And I'd expect the register wouldn't be re-used often
afterwards, so its clobbering likely won't harm code quality very much.

> --- /dev/null
> +++ b/xen/lib/arch-generic-hweightl.S
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +        .file __FILE__
> +
> +#include <xen/linkage.h>
> +
> +/*
> + * An implementation of generic_hweightl() used on hardware without the POPCNT
> + * instruction.
> + *
> + * This function is called from within an ALTERNATIVE in arch_hweightl().
> + * i.e. behind the back of the compiler.  Therefore all registers are callee
> + * preserved.
> + *
> + * The ASM is what GCC-12 emits for generic_hweightl() in a release build of
> + * Xen, with spilling of %rdi/%rdx to preserve the callers registers.
> + */
> +FUNC(arch_generic_hweightl)
> +        push   %rdi
> +        push   %rdx
> +
> +        movabs $0x5555555555555555, %rdx
> +        mov    %rdi, %rax
> +        shr    $1, %rax
> +        and    %rdx, %rax
> +        sub    %rax, %rdi
> +        movabs $0x3333333333333333, %rax
> +        mov    %rdi, %rdx
> +        shr    $0x2, %rdi

Could I talk you into omitting the 0x here and ...

> +        and    %rax, %rdx
> +        and    %rax, %rdi
> +        add    %rdi, %rdx
> +        mov    %rdx, %rax
> +        shr    $0x4, %rax

... here, and maybe use ...

> +        add    %rdx, %rax
> +        movabs $0xf0f0f0f0f0f0f0f, %rdx
> +        and    %rdx, %rax
> +        movabs $0x101010101010101, %rdx
> +        imul   %rdx, %rax
> +        shr    $0x38, %rax

... 56 here (or even BITS_PER_LONG-8)?

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/9] xen/bitops: Convert 'hweight(x) > 1' to new multiple_bits_set()
  2024-08-26 10:37   ` Jan Beulich
@ 2024-08-27  9:44     ` Andrew Cooper
  0 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2024-08-27  9:44 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roger Pau Monné, Wei Liu, Stefano Stabellini, Julien Grall,
	Volodymyr Babchuk, Bertrand Marquis, Michal Orzel,
	Oleksii Kurochko, Shawn Anastasio, Xen-devel

On 26/08/2024 11:37 am, Jan Beulich wrote:
> On 23.08.2024 01:06, Andrew Cooper wrote:
>> Using hweight() is an especially expensive way of determining simply if
>> multiple bits are set in a value.  Worse, 4 of the 10 hweight() calls in Xen
>> are of this form.
>>
>> Switch to the new multiple_bits_set() helper.  This is far more efficient than
>> the longhand hweight() algorithm and, owing to its simplicity, likely more
>> efficient than even a dedicated instruction on a superscalar processor.
>>
>> No functional change.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

Thanks.

>
> Just to mention it: With the title being sufficiently generic, I half
> expected the similar {bitmap,cpumask}_weight() to also be taken care of.
> Yet certainly those (and maybe also their ... == 1 forms) can be covered
> later.

They are on my radar, but are not as easy to transform.

We'd need a new bitmap API to optimise this case; looking at the
callers, it is definitely worth optimising.  There are also an awful lot
of cpumask_weight() == 1 which want converting, albeit slightly differently.

That's a full series of work in and of itself.

~Andrew


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/9] xen/bitops: Drop the remnants of hweight{8,16}()
  2024-08-26 10:39   ` Jan Beulich
@ 2024-08-27  9:49     ` Andrew Cooper
  2024-08-27 10:11       ` Jan Beulich
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2024-08-27  9:49 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roger Pau Monné, Wei Liu, Stefano Stabellini, Julien Grall,
	Volodymyr Babchuk, Bertrand Marquis, Michal Orzel,
	Oleksii Kurochko, Shawn Anastasio, Xen-devel

On 26/08/2024 11:39 am, Jan Beulich wrote:
> On 23.08.2024 01:06, Andrew Cooper wrote:
>> They are no more, and won't be returning in this form.
> And what's the plan? Use hweight32((uint8_t)...) in an open-coded manner?
> Not overly nice I would say.

If we ever regain a genuine need for the 8 or 16 forms, they can go back
into bitops.h, in terms of hweightl(), just like hweight32().

But it's been 20 years so far and we haven't actually needed
hweight8/16, and I'm expecting this to continue for the forseeable future.

~Andrew


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/9] xen/bitops: Drop the remnants of hweight{8,16}()
  2024-08-27  9:49     ` Andrew Cooper
@ 2024-08-27 10:11       ` Jan Beulich
  2024-08-27 12:04         ` Andrew Cooper
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2024-08-27 10:11 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Roger Pau Monné, Wei Liu, Stefano Stabellini, Julien Grall,
	Volodymyr Babchuk, Bertrand Marquis, Michal Orzel,
	Oleksii Kurochko, Shawn Anastasio, Xen-devel

On 27.08.2024 11:49, Andrew Cooper wrote:
> On 26/08/2024 11:39 am, Jan Beulich wrote:
>> On 23.08.2024 01:06, Andrew Cooper wrote:
>>> They are no more, and won't be returning in this form.
>> And what's the plan? Use hweight32((uint8_t)...) in an open-coded manner?
>> Not overly nice I would say.
> 
> If we ever regain a genuine need for the 8 or 16 forms, they can go back
> into bitops.h, in terms of hweightl(), just like hweight32().
> 
> But it's been 20 years so far and we haven't actually needed
> hweight8/16, and I'm expecting this to continue for the forseeable future.

Well, I'm not fully convinced. People may (try to) add open-coded forms like
in my earlier reply instead. But anyway:
Acked-by: Jan Beulich <jbeulich@suse.com>

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 5/9] xen/bitops: Introduce generic_hweightl() and hweightl()
  2024-08-26 11:40   ` Jan Beulich
@ 2024-08-27 10:39     ` Andrew Cooper
  2024-08-27 11:41       ` Jan Beulich
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2024-08-27 10:39 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roger Pau Monné, Wei Liu, Stefano Stabellini, Julien Grall,
	Volodymyr Babchuk, Bertrand Marquis, Michal Orzel,
	Oleksii Kurochko, Shawn Anastasio, Xen-devel

On 26/08/2024 12:40 pm, Jan Beulich wrote:
> On 23.08.2024 01:06, Andrew Cooper wrote:
> --- a/xen/include/xen/bitops.h
> +++ b/xen/include/xen/bitops.h
> @@ -35,6 +35,12 @@ extern void __bitop_bad_size(void);
>  unsigned int __pure generic_ffsl(unsigned long x);
>  unsigned int __pure generic_flsl(unsigned long x);
>  
>> +/*
>> + * Hamming Weight, also called Population Count.  Returns the number of set
>> + * bits in @x.
>> + */
>> +unsigned int __pure generic_hweightl(unsigned long x);
> Aren't this and ...
>
>> @@ -284,6 +290,18 @@ static always_inline __pure unsigned int fls64(uint64_t x)
>>          (_v & (_v - 1)) != 0;                   \
>>      })
>>  
>> +static always_inline __pure unsigned int hweightl(unsigned long x)
> ... this even __attribute_const__?

Hmm.  This is following fls(), but on further consideration, they should
be const too.

I'll do a prep patch fixing that, although I'm going to rename it to
__attr_const for brevity.

Much as I'd prefer __const, I expect that is going too far, making it
too close to regular const.

>
>> +{
>> +    if ( __builtin_constant_p(x) )
>> +        return __builtin_popcountl(x);
> How certain are you that compilers (even old ones) will reliably fold
> constant expressions here, and never emit a libgcc call instead? The
> conditions look to be more tight than just __builtin_constant_p(); a
> pretty absurd example:
>
> unsigned ltest(void) {
>     return __builtin_constant_p("") ? __builtin_popcountl((unsigned long)"") : ~0;
> }

How do you express that in terms of a call to hweightl()?

Again, this is following the layout started with fls() in order to avoid
each arch opencoding different versions of constant folding.

https://godbolt.org/z/r544c49oY

When it's forced through the hweightl() interface, even GCC 4.1 decides
that it's non-constant and falls back to generic_hweightl().


I did spend a *lot* of time with the fls() series checking that all
compilers we supported did what we wanted in this case, so I don't
expect it to be a problem.  But, if a library call is emitted, it will
be very obvious (link failure), and we can re-evaluate.


>
>> --- /dev/null
>> +++ b/xen/lib/generic-hweightl.c
>> @@ -0,0 +1,46 @@
>> +/* SPDX-License-Identifier: GPL-2.0-only */
>> +
>> +#include <xen/bitops.h>
>> +#include <xen/init.h>
>> +#include <xen/self-tests.h>
>> +
>> +/* Mask value @b broadcast to every byte in a long */
>> +#if BITS_PER_LONG == 32
>> +# define MASK(b) ((b) * 0x01010101UL)
>> +#elif BITS_PER_LONG == 64
>> +# define MASK(b) ((b) * 0x0101010101010101UL)
>> +#else
>> +# error Extend me please
>> +#endif
>> +
>> +unsigned int generic_hweightl(unsigned long x)
>> +{
>> +    x -= (x >> 1) & MASK(0x55);
>> +    x =  (x & MASK(0x33)) + ((x >> 2) & MASK(0x33));
>> +    x =  (x + (x >> 4)) & MASK(0x0f);
>> +
>> +    if ( IS_ENABLED(CONFIG_HAS_FAST_MULTIPLY) )
>> +        return (x * MASK(0x01)) >> (BITS_PER_LONG - 8);
> I realize it's nitpicking, yet especially this use isn't really "mask"-
> like. Could I talk you into naming the macro e.g. BCST()?

Ok - I wasn't overly happy with the name MASK(), and BCST() is better.

>
>> +    x += x >> 8;
>> +    x += x >> 16;
>> +#if BITS_PER_LONG > 32
>> +    x += x >> 32;
>> +#endif
>> +
>> +    return x & 0xff;
>> +}
> Perhaps #undef MASK here, or else ...
>
>> +#ifdef CONFIG_SELF_TESTS
>> +static void __init __constructor test_generic_hweightl(void)
>> +{
>> +    RUNTIME_CHECK(generic_hweightl, 0, 0);
>> +    RUNTIME_CHECK(generic_hweightl, 1, 1);
>> +    RUNTIME_CHECK(generic_hweightl, 3, 2);
>> +    RUNTIME_CHECK(generic_hweightl, 7, 3);
>> +    RUNTIME_CHECK(generic_hweightl, 0xff, 8);
>> +
>> +    RUNTIME_CHECK(generic_hweightl, 1 | (1UL << (BITS_PER_LONG - 1)), 2);
>> +    RUNTIME_CHECK(generic_hweightl, -1UL, BITS_PER_LONG);
>> +}
> ... actually use it some here, to have a few more cases?

Hmm, why not.

~Andrew


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 9/9] x86/bitops: Use the POPCNT instruction when available
  2024-08-26 13:07   ` Jan Beulich
@ 2024-08-27 11:17     ` Andrew Cooper
  2024-08-27 12:47       ` Jan Beulich
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2024-08-27 11:17 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel

On 26/08/2024 2:07 pm, Jan Beulich wrote:
> On 23.08.2024 01:06, Andrew Cooper wrote:
>> A few RFC points.
>>
>>  * I throught we had an x86 general lib-y but I can't find one, hence why it's
>>    still in xen/lib/ for now.
> We indeed have nothing like that yet. The file name should then imo not be
> arch-* though, but x86-*. Or you could put it in xen/lib/x86/.

I was worried about xen/lib/x86/ and interfering with userspace.

> It could even
> be obj-y rather than lib-y, unless you expect we'll be able to get rid of
> all uses, which seems unlikely at least due to bitmap_weight(). Otoh with
> my ABI-level series the call site in arch_hweightl() could then be made go
> away for v2 and above, at which point it living in lib-y will be preferable.

Yes, I was specifically trying to account for this.

I'm expecting the mandatory-popcnt work to end up looking something like:

diff --git a/xen/arch/x86/include/asm/bitops.h
b/xen/arch/x86/include/asm/bitops.h
index 0db698ed3f4c..027eda60c094 100644
--- a/xen/arch/x86/include/asm/bitops.h
+++ b/xen/arch/x86/include/asm/bitops.h
@@ -480,6 +480,9 @@ static always_inline unsigned int
arch_hweightl(unsigned long x)
 {
     unsigned int r;
 
+    if ( IS_ENABLED(CONFIG_REQUIRE_POPCNT /* or whatever */) )
+        return __builtin_popcountl(x);
+
     /*
      * arch_generic_hweightl() is written in ASM in order to preserve all
      * registers, as the compiler can't see the call.


which in turn DCE's the alternative_io() and drops the reference to
arch_generic_hweightl().

>
>>  * When we up the minimum toolchain to GCC 7 / Clang 5, we can use a
>>    __attribute__((no_caller_saved_registers)) and can forgo writing this in asm.
>>
>>    GCC seems to need extra help, and wants -mgeneral-regs-only too.  It has a
>>    habit of complaining about incompatible instructions even when it's not
>>    emitting them.
> What is this part about? What incompatible instructions, in particular?

This was weird.  https://godbolt.org/z/4z1qzWbfE is an example.

>
>> @@ -475,4 +476,24 @@ static always_inline unsigned int arch_flsl(unsigned long x)
>>  }
>>  #define arch_flsl arch_flsl
>>  
>> +static always_inline unsigned int arch_hweightl(unsigned long x)
>> +{
>> +    unsigned int r;
>> +
>> +    /*
>> +     * arch_generic_hweightl() is written in ASM in order to preserve all
>> +     * registers, as the compiler can't see the call.
>> +     *
>> +     * This limits the POPCNT instruction to using the same ABI as a function
>> +     * call (input in %rdi, output in %eax) but that's fine.
>> +     */
>> +    alternative_io("call arch_generic_hweightl",
>> +                   "popcnt %[val], %q[res]", X86_FEATURE_POPCNT,
>> +                   ASM_OUTPUT2([res] "=a" (r) ASM_CALL_CONSTRAINT),
>> +                   [val] "D" (x));
> If you made [val] an output ("+D") you could avoid preserving the register
> in the function. And I'd expect the register wouldn't be re-used often
> afterwards, so its clobbering likely won't harm code quality very much.

"+D" means it's modified by the asm, which forces preservation of the
register, if it's still needed afterwards.

Plain "D" means not modified by the asm, which means it can be reused if
necessary.

>
>> --- /dev/null
>> +++ b/xen/lib/arch-generic-hweightl.S
>> @@ -0,0 +1,46 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +
>> +        .file __FILE__
>> +
>> +#include <xen/linkage.h>
>> +
>> +/*
>> + * An implementation of generic_hweightl() used on hardware without the POPCNT
>> + * instruction.
>> + *
>> + * This function is called from within an ALTERNATIVE in arch_hweightl().
>> + * i.e. behind the back of the compiler.  Therefore all registers are callee
>> + * preserved.
>> + *
>> + * The ASM is what GCC-12 emits for generic_hweightl() in a release build of
>> + * Xen, with spilling of %rdi/%rdx to preserve the callers registers.
>> + */
>> +FUNC(arch_generic_hweightl)
>> +        push   %rdi
>> +        push   %rdx
>> +
>> +        movabs $0x5555555555555555, %rdx
>> +        mov    %rdi, %rax
>> +        shr    $1, %rax
>> +        and    %rdx, %rax
>> +        sub    %rax, %rdi
>> +        movabs $0x3333333333333333, %rax
>> +        mov    %rdi, %rdx
>> +        shr    $0x2, %rdi
> Could I talk you into omitting the 0x here and ...
>
>> +        and    %rax, %rdx
>> +        and    %rax, %rdi
>> +        add    %rdi, %rdx
>> +        mov    %rdx, %rax
>> +        shr    $0x4, %rax
> ... here, and maybe use ...
>
>> +        add    %rdx, %rax
>> +        movabs $0xf0f0f0f0f0f0f0f, %rdx
>> +        and    %rdx, %rax
>> +        movabs $0x101010101010101, %rdx
>> +        imul   %rdx, %rax
>> +        shr    $0x38, %rax
> ... 56 here (or even BITS_PER_LONG-8)?

Ok.

~Andrew


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 5/9] xen/bitops: Introduce generic_hweightl() and hweightl()
  2024-08-27 10:39     ` Andrew Cooper
@ 2024-08-27 11:41       ` Jan Beulich
  2024-08-27 12:30         ` Andrew Cooper
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2024-08-27 11:41 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Roger Pau Monné, Wei Liu, Stefano Stabellini, Julien Grall,
	Volodymyr Babchuk, Bertrand Marquis, Michal Orzel,
	Oleksii Kurochko, Shawn Anastasio, Xen-devel

On 27.08.2024 12:39, Andrew Cooper wrote:
> On 26/08/2024 12:40 pm, Jan Beulich wrote:
>> On 23.08.2024 01:06, Andrew Cooper wrote:
>> --- a/xen/include/xen/bitops.h
>> +++ b/xen/include/xen/bitops.h
>> @@ -35,6 +35,12 @@ extern void __bitop_bad_size(void);
>>  unsigned int __pure generic_ffsl(unsigned long x);
>>  unsigned int __pure generic_flsl(unsigned long x);
>>  
>>> +/*
>>> + * Hamming Weight, also called Population Count.  Returns the number of set
>>> + * bits in @x.
>>> + */
>>> +unsigned int __pure generic_hweightl(unsigned long x);
>> Aren't this and ...
>>
>>> @@ -284,6 +290,18 @@ static always_inline __pure unsigned int fls64(uint64_t x)
>>>          (_v & (_v - 1)) != 0;                   \
>>>      })
>>>  
>>> +static always_inline __pure unsigned int hweightl(unsigned long x)
>> ... this even __attribute_const__?
> 
> Hmm.  This is following fls(), but on further consideration, they should
> be const too.
> 
> I'll do a prep patch fixing that, although I'm going to rename it to
> __attr_const for brevity.
> 
> Much as I'd prefer __const, I expect that is going too far, making it
> too close to regular const.

I was actually going to suggest using that name, if we want to shorten
__attribute_const__. Yes, gcc treats __const (and __const__) as
keywords, but do we care (much)? All it requires is that we don't start
using __const as a (real) keyword.

Of course __const is a good example of why really we shouldn't use
double-underscore prefixed names anywhere. Any of them can be assigned
a meaning by the compiler, and here that's clearly the case. Therefore,
taking your planned rename, maybe better make it attr_const then? And
eventually switch stuff like __packed, __pure, and __weak to attr_* as
well? Or even introduce something like

#define attr(attr...) __attribute__((attr))

and use attr(const) here?

>>> +{
>>> +    if ( __builtin_constant_p(x) )
>>> +        return __builtin_popcountl(x);
>> How certain are you that compilers (even old ones) will reliably fold
>> constant expressions here, and never emit a libgcc call instead? The
>> conditions look to be more tight than just __builtin_constant_p(); a
>> pretty absurd example:
>>
>> unsigned ltest(void) {
>>     return __builtin_constant_p("") ? __builtin_popcountl((unsigned long)"") : ~0;
>> }
> 
> How do you express that in terms of a call to hweightl()?

hweightl((unsigned long)"");

Yet as said - it's absurd. It merely serves to make the point that what
__builtin_constant_p() returns true for doesn't necessarily constant-
fold in expressions.

> Again, this is following the layout started with fls() in order to avoid
> each arch opencoding different versions of constant folding.
> 
> https://godbolt.org/z/r544c49oY
> 
> When it's forced through the hweightl() interface, even GCC 4.1 decides
> that it's non-constant and falls back to generic_hweightl().
> 
> 
> I did spend a *lot* of time with the fls() series checking that all
> compilers we supported did what we wanted in this case, so I don't
> expect it to be a problem.

Right, and I guess I was pointlessly more concerned about popcount than
I was for ffs() / fls(). The criteria upon which gcc decides whether to
constant-fold the uses is exactly the same.

>  But, if a library call is emitted, it will
> be very obvious (link failure), and we can re-evaluate.

Indeed, we certainly would notice, albeit the diagnostic may be cryptic
to people.

Bottom line - keep it as is.

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 7/9] xen/bitops: Implement hweight64() in terms of hweightl()
  2024-08-26 11:55   ` Jan Beulich
@ 2024-08-27 11:50     ` Andrew Cooper
  2024-08-27 13:00       ` Jan Beulich
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2024-08-27 11:50 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roger Pau Monné, Wei Liu, Stefano Stabellini, Julien Grall,
	Volodymyr Babchuk, Bertrand Marquis, Michal Orzel,
	Oleksii Kurochko, Shawn Anastasio, Xen-devel

On 26/08/2024 12:55 pm, Jan Beulich wrote:
> On 23.08.2024 01:06, Andrew Cooper wrote:
>> ... and drop generic_hweight64().
>>
>> This is identical on all architectures except ARM32.  Add one extra SELF_TEST
>> to check that hweight64() works when the input is split in half.
>>
>> No functional change.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

Thanks.

> with one remark:
>
>> --- a/xen/include/xen/bitops.h
>> +++ b/xen/include/xen/bitops.h
>> @@ -302,6 +302,14 @@ static always_inline __pure unsigned int hweightl(unsigned long x)
>>  #endif
>>  }
>>  
>> +static always_inline __pure unsigned int hweight64(uint64_t x)
>> +{
>> +    if ( BITS_PER_LONG == 64 )
>> +        return hweightl(x);
>> +    else
>> +        return hweightl(x >> 32) + hweightl(x);
> This assume BITS_PER_LONG == 32, which of course is true right now, but
> doesn't need to be in general. Better add an explicit cast to uint32_t
> (or masking by 0xffffffffU)?

This is part of the point of putting in the self-tests.  They're
intended to catch things like this in new build environments.

Although, I think we've got enough cases which will #error on
BITS_PER_LONG not being 32 or 64.

Again, this is modelled after f[fl]s64() which have the same
expectations about the BITS_PER_LONG != 64 case.

~Andrew


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/9] xen/bitops: Introduce a multiple_bits_set() helper
  2024-08-26 10:30   ` Jan Beulich
@ 2024-08-27 12:01     ` Andrew Cooper
  2024-08-27 12:50       ` Jan Beulich
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2024-08-27 12:01 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roger Pau Monné, Wei Liu, Stefano Stabellini, Julien Grall,
	Volodymyr Babchuk, Bertrand Marquis, Michal Orzel,
	Oleksii Kurochko, Shawn Anastasio, Xen-devel

On 26/08/2024 11:30 am, Jan Beulich wrote:
> On 23.08.2024 01:06, Andrew Cooper wrote:
>> This will be used to simplify real logic in the following patch.  Add compile
>> and boot time testing as with other bitops.
>>
>> Because the expression is so simple, implement it as a function-like macro
>> which is generic on the type of it's argument, rather than having multiple
>> variants.
>>
>> Testing function-like macros needs a minor adjustments to the infrastructure
>> in xen/self-tests.h to avoid bracketing the fn parameter.  The utility of this
>> outweighs the associated risks.
> We may need to mark these two places as deviated.

Perhaps, although it would want to be a project-wide deviation.

Eclair was green with this patch in place, so it's not blocking.

>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> with one remark/request:
>
>> --- a/xen/common/bitops.c
>> +++ b/xen/common/bitops.c
>> @@ -84,8 +84,32 @@ static void __init test_fls(void)
>>      CHECK(fls64, 0x8000000000000001ULL, 64);
>>  }
>>  
>> +static void __init test_multiple_bits_set(void)
>> +{
>> +    /*
>> +     * multiple_bits_set() is generic on the type of it's parameter, as the
>> +     * internal expression is so simple.
>> +     */
>> +
>> +    CHECK(multiple_bits_set, 0, false);
>> +    CHECK(multiple_bits_set, 1, false);
>> +    CHECK(multiple_bits_set, 2, false);
>> +    CHECK(multiple_bits_set, 3, true);
>> +
>> +    CHECK(multiple_bits_set, 1 | (1UL << (BITS_PER_LONG - 1)), true);
> This is really the same as ...
>
>> +#if BITS_PER_LONG > 32
>> +    CHECK(multiple_bits_set, 1 | (1UL << 32), true);
>> +    CHECK(multiple_bits_set, 1 | (1UL << 63), true);
> ... this, at least as long as BITS_PER_LONG > 32 in practice means
> BITS_PER_LONG == 64. Perhaps not really worth keeping that line?

I suppose not.  I'll drop it.

However, It occurs to me that I do need a test of 0x8000000000000001ULL
mostly for 32bit builds to check that even when the argument is split,
the answer is still accurate.

~Andrew


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/9] xen/bitops: Drop the remnants of hweight{8,16}()
  2024-08-27 10:11       ` Jan Beulich
@ 2024-08-27 12:04         ` Andrew Cooper
  0 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2024-08-27 12:04 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roger Pau Monné, Wei Liu, Stefano Stabellini, Julien Grall,
	Volodymyr Babchuk, Bertrand Marquis, Michal Orzel,
	Oleksii Kurochko, Shawn Anastasio, Xen-devel

On 27/08/2024 11:11 am, Jan Beulich wrote:
> On 27.08.2024 11:49, Andrew Cooper wrote:
>> On 26/08/2024 11:39 am, Jan Beulich wrote:
>>> On 23.08.2024 01:06, Andrew Cooper wrote:
>>>> They are no more, and won't be returning in this form.
>>> And what's the plan? Use hweight32((uint8_t)...) in an open-coded manner?
>>> Not overly nice I would say.
>> If we ever regain a genuine need for the 8 or 16 forms, they can go back
>> into bitops.h, in terms of hweightl(), just like hweight32().
>>
>> But it's been 20 years so far and we haven't actually needed
>> hweight8/16, and I'm expecting this to continue for the forseeable future.
> Well, I'm not fully convinced. People may (try to) add open-coded forms like
> in my earlier reply instead.

I'd hope we'd spot that during review, and even if not, we can fix it up
after the fact.

>  But anyway:
> Acked-by: Jan Beulich <jbeulich@suse.com>

Thanks.

~Andrew


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 8/9] xen/bitops: Implement hweight32() in terms of hweightl()
  2024-08-26 11:59   ` Jan Beulich
@ 2024-08-27 12:05     ` Andrew Cooper
  0 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2024-08-27 12:05 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roger Pau Monné, Wei Liu, Stefano Stabellini, Julien Grall,
	Volodymyr Babchuk, Bertrand Marquis, Michal Orzel,
	Oleksii Kurochko, Shawn Anastasio, Xen-devel

On 26/08/2024 12:59 pm, Jan Beulich wrote:
> On 23.08.2024 01:06, Andrew Cooper wrote:
>> ... and drop generic_hweight32().
>>
>> As noted previously, the only two users of hweight32() and they're both
>> singleton callers in __init paths, so it's not interesting to have a sub-GPR
>> optimised generic.
> I think it's clear what is meant, but the part of the sentence ahead of
> the comma is a little bumpy. As to not interesting: Perhaps indeed not
> right now, but new uses may appear and generally the overly wide
> operations may be (slightly) more expensive. Of course we can deal with
> the need when it arises, so ...

Oh yes, that is wonky.  I'll rephrase.

>> No functional change.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Acked-by: Jan Beulich <jbeulich@suse.com>

Thanks.

~Andrew


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 5/9] xen/bitops: Introduce generic_hweightl() and hweightl()
  2024-08-27 11:41       ` Jan Beulich
@ 2024-08-27 12:30         ` Andrew Cooper
  0 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2024-08-27 12:30 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roger Pau Monné, Wei Liu, Stefano Stabellini, Julien Grall,
	Volodymyr Babchuk, Bertrand Marquis, Michal Orzel,
	Oleksii Kurochko, Shawn Anastasio, Xen-devel

On 27/08/2024 12:41 pm, Jan Beulich wrote:
> On 27.08.2024 12:39, Andrew Cooper wrote:
>> On 26/08/2024 12:40 pm, Jan Beulich wrote:
>>> On 23.08.2024 01:06, Andrew Cooper wrote:
>>> --- a/xen/include/xen/bitops.h
>>> +++ b/xen/include/xen/bitops.h
>>> @@ -35,6 +35,12 @@ extern void __bitop_bad_size(void);
>>>  unsigned int __pure generic_ffsl(unsigned long x);
>>>  unsigned int __pure generic_flsl(unsigned long x);
>>>  
>>>> +/*
>>>> + * Hamming Weight, also called Population Count.  Returns the number of set
>>>> + * bits in @x.
>>>> + */
>>>> +unsigned int __pure generic_hweightl(unsigned long x);
>>> Aren't this and ...
>>>
>>>> @@ -284,6 +290,18 @@ static always_inline __pure unsigned int fls64(uint64_t x)
>>>>          (_v & (_v - 1)) != 0;                   \
>>>>      })
>>>>  
>>>> +static always_inline __pure unsigned int hweightl(unsigned long x)
>>> ... this even __attribute_const__?
>> Hmm.  This is following fls(), but on further consideration, they should
>> be const too.
>>
>> I'll do a prep patch fixing that, although I'm going to rename it to
>> __attr_const for brevity.
>>
>> Much as I'd prefer __const, I expect that is going too far, making it
>> too close to regular const.
> I was actually going to suggest using that name, if we want to shorten
> __attribute_const__. Yes, gcc treats __const (and __const__) as
> keywords, but do we care (much)? All it requires is that we don't start
> using __const as a (real) keyword.

Well also we'll get into more MISRA fun for overriding keywords.

But yes - the fact that GCC treats __const to mean const is precisely
why we shouldn't give it an unrelated meaning.

>
> Of course __const is a good example of why really we shouldn't use
> double-underscore prefixed names anywhere. Any of them can be assigned
> a meaning by the compiler, and here that's clearly the case. Therefore,
> taking your planned rename, maybe better make it attr_const then? And
> eventually switch stuff like __packed, __pure, and __weak to attr_* as
> well? Or even introduce something like
>
> #define attr(attr...) __attribute__((attr))
>
> and use attr(const) here?

Hmm - that's an interesting approach, and for other attributes which we
can use unconditionally.  It will end up shorter than multiple separate
__-prefixed names.

As a tangent, I've got some work from playing with -fanalyzer which
sprinkles some attr malloc/alloc_{size,align}()/free around.  It does
improve code generation (abeit marginally), but the function declaration
size suffers.

It won't work for attributes which are conditionally nothing (e.g.
cf_check), or ones that contain multiple aspects (e.g. __constructer
conataining cf_check).

In practice this means we're always going to end up with a mix, so maybe
attr_const is better for consistency.

>
>>>> +{
>>>> +    if ( __builtin_constant_p(x) )
>>>> +        return __builtin_popcountl(x);
>>> How certain are you that compilers (even old ones) will reliably fold
>>> constant expressions here, and never emit a libgcc call instead? The
>>> conditions look to be more tight than just __builtin_constant_p(); a
>>> pretty absurd example:
>>>
>>> unsigned ltest(void) {
>>>     return __builtin_constant_p("") ? __builtin_popcountl((unsigned long)"") : ~0;
>>> }
>> How do you express that in terms of a call to hweightl()?
> hweightl((unsigned long)"");
>
> Yet as said - it's absurd. It merely serves to make the point that what
> __builtin_constant_p() returns true for doesn't necessarily constant-
> fold in expressions.

Yes, but as shown in the godbolt link, this form changes GCC's mind
about the __builtin_const-ness of the expression.

>
>> Again, this is following the layout started with fls() in order to avoid
>> each arch opencoding different versions of constant folding.
>>
>> https://godbolt.org/z/r544c49oY
>>
>> When it's forced through the hweightl() interface, even GCC 4.1 decides
>> that it's non-constant and falls back to generic_hweightl().
>>
>>
>> I did spend a *lot* of time with the fls() series checking that all
>> compilers we supported did what we wanted in this case, so I don't
>> expect it to be a problem.
> Right, and I guess I was pointlessly more concerned about popcount than
> I was for ffs() / fls(). The criteria upon which gcc decides whether to
> constant-fold the uses is exactly the same.
>
>>   But, if a library call is emitted, it will
>> be very obvious (link failure), and we can re-evaluate.
> Indeed, we certainly would notice, albeit the diagnostic may be cryptic
> to people.
>
> Bottom line - keep it as is.

Thanks.

~Andrew


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 9/9] x86/bitops: Use the POPCNT instruction when available
  2024-08-27 11:17     ` Andrew Cooper
@ 2024-08-27 12:47       ` Jan Beulich
  2024-08-27 14:59         ` Andrew Cooper
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2024-08-27 12:47 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 27.08.2024 13:17, Andrew Cooper wrote:
> On 26/08/2024 2:07 pm, Jan Beulich wrote:
>> On 23.08.2024 01:06, Andrew Cooper wrote:
>>> A few RFC points.
>>>
>>>  * I throught we had an x86 general lib-y but I can't find one, hence why it's
>>>    still in xen/lib/ for now.
>> We indeed have nothing like that yet. The file name should then imo not be
>> arch-* though, but x86-*. Or you could put it in xen/lib/x86/.
> 
> I was worried about xen/lib/x86/ and interfering with userspace.
> 
>> It could even
>> be obj-y rather than lib-y, unless you expect we'll be able to get rid of
>> all uses, which seems unlikely at least due to bitmap_weight(). Otoh with
>> my ABI-level series the call site in arch_hweightl() could then be made go
>> away for v2 and above, at which point it living in lib-y will be preferable.
> 
> Yes, I was specifically trying to account for this.
> 
> I'm expecting the mandatory-popcnt work to end up looking something like:
> 
> diff --git a/xen/arch/x86/include/asm/bitops.h
> b/xen/arch/x86/include/asm/bitops.h
> index 0db698ed3f4c..027eda60c094 100644
> --- a/xen/arch/x86/include/asm/bitops.h
> +++ b/xen/arch/x86/include/asm/bitops.h
> @@ -480,6 +480,9 @@ static always_inline unsigned int
> arch_hweightl(unsigned long x)
>  {
>      unsigned int r;
>  
> +    if ( IS_ENABLED(CONFIG_REQUIRE_POPCNT /* or whatever */) )
> +        return __builtin_popcountl(x);
> +
>      /*
>       * arch_generic_hweightl() is written in ASM in order to preserve all
>       * registers, as the compiler can't see the call.
> 
> 
> which in turn DCE's the alternative_io() and drops the reference to
> arch_generic_hweightl().

Right, that's along the lines of what I was thinking to re-base to once
your work has gone in. (I have close to zero hope that my work would be
going in first. [1]) Just that I don't think we'll have separate
CONFIG_REQUIRE_<feature> settings. Yet how exactly that wants structuring
is something we ought to discuss there, not here.

>>>  * When we up the minimum toolchain to GCC 7 / Clang 5, we can use a
>>>    __attribute__((no_caller_saved_registers)) and can forgo writing this in asm.
>>>
>>>    GCC seems to need extra help, and wants -mgeneral-regs-only too.  It has a
>>>    habit of complaining about incompatible instructions even when it's not
>>>    emitting them.
>> What is this part about? What incompatible instructions, in particular?
> 
> This was weird.  https://godbolt.org/z/4z1qzWbfE is an example.

That's because apparently in your experiments you didn't add -mno-sse. If you
incrementally add that, then -mno-mmx, then -msoft-float, you'll first see the
diagnostic change and then observe it to finally compile. And yes, from
looking at the gcc code emitting this error, this is solely tied to the ISAs
enabled at the time the function is being compiled. It's independent of the
choice of insns. Pretty clearly a shortcoming, imo.

>>> @@ -475,4 +476,24 @@ static always_inline unsigned int arch_flsl(unsigned long x)
>>>  }
>>>  #define arch_flsl arch_flsl
>>>  
>>> +static always_inline unsigned int arch_hweightl(unsigned long x)
>>> +{
>>> +    unsigned int r;
>>> +
>>> +    /*
>>> +     * arch_generic_hweightl() is written in ASM in order to preserve all
>>> +     * registers, as the compiler can't see the call.
>>> +     *
>>> +     * This limits the POPCNT instruction to using the same ABI as a function
>>> +     * call (input in %rdi, output in %eax) but that's fine.
>>> +     */
>>> +    alternative_io("call arch_generic_hweightl",
>>> +                   "popcnt %[val], %q[res]", X86_FEATURE_POPCNT,
>>> +                   ASM_OUTPUT2([res] "=a" (r) ASM_CALL_CONSTRAINT),
>>> +                   [val] "D" (x));
>> If you made [val] an output ("+D") you could avoid preserving the register
>> in the function. And I'd expect the register wouldn't be re-used often
>> afterwards, so its clobbering likely won't harm code quality very much.
> 
> "+D" means it's modified by the asm, which forces preservation of the
> register, if it's still needed afterwards.
> 
> Plain "D" means not modified by the asm, which means it can be reused if
> necessary.

And we likely would prefer the former: If the register's value isn't
use afterwards (and that's the case as far as the function on its own
goes), the compiler will know it doesn't need to preserve anything.
That way, rather than always preserving (in the called function)
preservation will be limited to just the (likely few) cases where the
value actually is still needed afterwards.

Jan

[1] "x86: allow Kconfig control over psABI level" actually has a suitable
    R-b, but its prereq "build: permit Kconfig control over how to deal
    with unsatisfiable choices" doesn't.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/9] xen/bitops: Introduce a multiple_bits_set() helper
  2024-08-27 12:01     ` Andrew Cooper
@ 2024-08-27 12:50       ` Jan Beulich
  2024-08-27 14:13         ` Andrew Cooper
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2024-08-27 12:50 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Roger Pau Monné, Wei Liu, Stefano Stabellini, Julien Grall,
	Volodymyr Babchuk, Bertrand Marquis, Michal Orzel,
	Oleksii Kurochko, Shawn Anastasio, Xen-devel

On 27.08.2024 14:01, Andrew Cooper wrote:
> On 26/08/2024 11:30 am, Jan Beulich wrote:
>> On 23.08.2024 01:06, Andrew Cooper wrote:
>>> --- a/xen/common/bitops.c
>>> +++ b/xen/common/bitops.c
>>> @@ -84,8 +84,32 @@ static void __init test_fls(void)
>>>      CHECK(fls64, 0x8000000000000001ULL, 64);
>>>  }
>>>  
>>> +static void __init test_multiple_bits_set(void)
>>> +{
>>> +    /*
>>> +     * multiple_bits_set() is generic on the type of it's parameter, as the
>>> +     * internal expression is so simple.
>>> +     */
>>> +
>>> +    CHECK(multiple_bits_set, 0, false);
>>> +    CHECK(multiple_bits_set, 1, false);
>>> +    CHECK(multiple_bits_set, 2, false);
>>> +    CHECK(multiple_bits_set, 3, true);
>>> +
>>> +    CHECK(multiple_bits_set, 1 | (1UL << (BITS_PER_LONG - 1)), true);
>> This is really the same as ...
>>
>>> +#if BITS_PER_LONG > 32
>>> +    CHECK(multiple_bits_set, 1 | (1UL << 32), true);
>>> +    CHECK(multiple_bits_set, 1 | (1UL << 63), true);
>> ... this, at least as long as BITS_PER_LONG > 32 in practice means
>> BITS_PER_LONG == 64. Perhaps not really worth keeping that line?
> 
> I suppose not.  I'll drop it.
> 
> However, It occurs to me that I do need a test of 0x8000000000000001ULL
> mostly for 32bit builds to check that even when the argument is split,
> the answer is still accurate.

IOW you'll insert an #else in the middle. Fine with me; keep the R-b.

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 7/9] xen/bitops: Implement hweight64() in terms of hweightl()
  2024-08-27 11:50     ` Andrew Cooper
@ 2024-08-27 13:00       ` Jan Beulich
  2024-08-27 13:25         ` Andrew Cooper
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2024-08-27 13:00 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Roger Pau Monné, Wei Liu, Stefano Stabellini, Julien Grall,
	Volodymyr Babchuk, Bertrand Marquis, Michal Orzel,
	Oleksii Kurochko, Shawn Anastasio, Xen-devel

On 27.08.2024 13:50, Andrew Cooper wrote:
> On 26/08/2024 12:55 pm, Jan Beulich wrote:
>> On 23.08.2024 01:06, Andrew Cooper wrote:
>>> --- a/xen/include/xen/bitops.h
>>> +++ b/xen/include/xen/bitops.h
>>> @@ -302,6 +302,14 @@ static always_inline __pure unsigned int hweightl(unsigned long x)
>>>  #endif
>>>  }
>>>  
>>> +static always_inline __pure unsigned int hweight64(uint64_t x)
>>> +{
>>> +    if ( BITS_PER_LONG == 64 )
>>> +        return hweightl(x);
>>> +    else
>>> +        return hweightl(x >> 32) + hweightl(x);
>> This assume BITS_PER_LONG == 32, which of course is true right now, but
>> doesn't need to be in general. Better add an explicit cast to uint32_t
>> (or masking by 0xffffffffU)?
> 
> This is part of the point of putting in the self-tests.  They're
> intended to catch things like this in new build environments.

I don't think I saw any testcase where the result would be wrong if
this split didn't truncate x to the low 32 bits on the rhs of the +.

> Although, I think we've got enough cases which will #error on
> BITS_PER_LONG not being 32 or 64.

My take on this is: Such checks (#error or whatever else precaution)
should like in every single place where violating the assumptions
made would matter. Or else - how do you locate all the places that
need changing?

> Again, this is modelled after f[fl]s64() which have the same
> expectations about the BITS_PER_LONG != 64 case.

Both of them are fine afaict. fls64() has an explicit intermediate
variable of type uint32_t, and ffs64() has a (uint32_t)x as part
of the conditional expression achieving the intended effect.

Anyway, why not use hweight32() instead of hweightl() here? That'll
make things very explicit.

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 7/9] xen/bitops: Implement hweight64() in terms of hweightl()
  2024-08-27 13:00       ` Jan Beulich
@ 2024-08-27 13:25         ` Andrew Cooper
  2024-08-27 14:32           ` Andrew Cooper
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2024-08-27 13:25 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roger Pau Monné, Wei Liu, Stefano Stabellini, Julien Grall,
	Volodymyr Babchuk, Bertrand Marquis, Michal Orzel,
	Oleksii Kurochko, Shawn Anastasio, Xen-devel

On 27/08/2024 2:00 pm, Jan Beulich wrote:
> On 27.08.2024 13:50, Andrew Cooper wrote:
>> On 26/08/2024 12:55 pm, Jan Beulich wrote:
>>> On 23.08.2024 01:06, Andrew Cooper wrote:
>>>> --- a/xen/include/xen/bitops.h
>>>> +++ b/xen/include/xen/bitops.h
>>>> @@ -302,6 +302,14 @@ static always_inline __pure unsigned int hweightl(unsigned long x)
>>>>  #endif
>>>>  }
>>>>  
>>>> +static always_inline __pure unsigned int hweight64(uint64_t x)
>>>> +{
>>>> +    if ( BITS_PER_LONG == 64 )
>>>> +        return hweightl(x);
>>>> +    else
>>>> +        return hweightl(x >> 32) + hweightl(x);
>>> This assume BITS_PER_LONG == 32, which of course is true right now, but
>>> doesn't need to be in general. Better add an explicit cast to uint32_t
>>> (or masking by 0xffffffffU)?
>> This is part of the point of putting in the self-tests.  They're
>> intended to catch things like this in new build environments.
> I don't think I saw any testcase where the result would be wrong if
> this split didn't truncate x to the low 32 bits on the rhs of the +.

That's arguably an error in the choice of test cases.

Although, they're just my best guesses at some

>
>> Although, I think we've got enough cases which will #error on
>> BITS_PER_LONG not being 32 or 64.
> My take on this is: Such checks (#error or whatever else precaution)
> should like in every single place where violating the assumptions
> made would matter. Or else - how do you locate all the places that
> need changing?

Whomever gets to add RISCV-128 support will have to inspect every use of
BITS_PER_LONG, irrespective of #error/BUILD_BUG_ON()/etc.

So, the answer is `grep`.

I'm not advocating that we stop helping out with #error, but it's
unrealistic to expect that only addressing the build errors will result
in a working Xen for BITS_PER_LONG==128.

>
>> Again, this is modelled after f[fl]s64() which have the same
>> expectations about the BITS_PER_LONG != 64 case.
> Both of them are fine afaict. fls64() has an explicit intermediate
> variable of type uint32_t, and ffs64() has a (uint32_t)x as part
> of the conditional expression achieving the intended effect.
>
> Anyway, why not use hweight32() instead of hweightl() here? That'll
> make things very explicit.

hweight32() doesn't exist until the next patch in the series.

Although looking at the end result, I can't figure out why I thought it
was necessary to transform hweight64 first.

I'll swap this patch and the next one, and then use hweight32().

~Andrew


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/9] xen/bitops: Introduce a multiple_bits_set() helper
  2024-08-27 12:50       ` Jan Beulich
@ 2024-08-27 14:13         ` Andrew Cooper
  0 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2024-08-27 14:13 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roger Pau Monné, Wei Liu, Stefano Stabellini, Julien Grall,
	Volodymyr Babchuk, Bertrand Marquis, Michal Orzel,
	Oleksii Kurochko, Shawn Anastasio, Xen-devel

On 27/08/2024 1:50 pm, Jan Beulich wrote:
> On 27.08.2024 14:01, Andrew Cooper wrote:
>> On 26/08/2024 11:30 am, Jan Beulich wrote:
>>> On 23.08.2024 01:06, Andrew Cooper wrote:
>>>> --- a/xen/common/bitops.c
>>>> +++ b/xen/common/bitops.c
>>>> @@ -84,8 +84,32 @@ static void __init test_fls(void)
>>>>      CHECK(fls64, 0x8000000000000001ULL, 64);
>>>>  }
>>>>  
>>>> +static void __init test_multiple_bits_set(void)
>>>> +{
>>>> +    /*
>>>> +     * multiple_bits_set() is generic on the type of it's parameter, as the
>>>> +     * internal expression is so simple.
>>>> +     */
>>>> +
>>>> +    CHECK(multiple_bits_set, 0, false);
>>>> +    CHECK(multiple_bits_set, 1, false);
>>>> +    CHECK(multiple_bits_set, 2, false);
>>>> +    CHECK(multiple_bits_set, 3, true);
>>>> +
>>>> +    CHECK(multiple_bits_set, 1 | (1UL << (BITS_PER_LONG - 1)), true);
>>> This is really the same as ...
>>>
>>>> +#if BITS_PER_LONG > 32
>>>> +    CHECK(multiple_bits_set, 1 | (1UL << 32), true);
>>>> +    CHECK(multiple_bits_set, 1 | (1UL << 63), true);
>>> ... this, at least as long as BITS_PER_LONG > 32 in practice means
>>> BITS_PER_LONG == 64. Perhaps not really worth keeping that line?
>> I suppose not.  I'll drop it.
>>
>> However, It occurs to me that I do need a test of 0x8000000000000001ULL
>> mostly for 32bit builds to check that even when the argument is split,
>> the answer is still accurate.
> IOW you'll insert an #else in the middle. Fine with me; keep the R-b.

Oh, I already had that case, out of context below what you quoted.  I'll
just drop the one intermediate test.

~Andrew


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 7/9] xen/bitops: Implement hweight64() in terms of hweightl()
  2024-08-27 13:25         ` Andrew Cooper
@ 2024-08-27 14:32           ` Andrew Cooper
  2024-08-27 15:01             ` Jan Beulich
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2024-08-27 14:32 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roger Pau Monné, Wei Liu, Stefano Stabellini, Julien Grall,
	Volodymyr Babchuk, Bertrand Marquis, Michal Orzel,
	Oleksii Kurochko, Shawn Anastasio, Xen-devel

On 27/08/2024 2:25 pm, Andrew Cooper wrote:
> On 27/08/2024 2:00 pm, Jan Beulich wrote:
>> On 27.08.2024 13:50, Andrew Cooper wrote:
>>> On 26/08/2024 12:55 pm, Jan Beulich wrote:
>>>> On 23.08.2024 01:06, Andrew Cooper wrote:
>>> Again, this is modelled after f[fl]s64() which have the same
>>> expectations about the BITS_PER_LONG != 64 case.
>> Both of them are fine afaict. fls64() has an explicit intermediate
>> variable of type uint32_t, and ffs64() has a (uint32_t)x as part
>> of the conditional expression achieving the intended effect.
>>
>> Anyway, why not use hweight32() instead of hweightl() here? That'll
>> make things very explicit.
> hweight32() doesn't exist until the next patch in the series.
>
> Although looking at the end result, I can't figure out why I thought it
> was necessary to transform hweight64 first.
>
> I'll swap this patch and the next one, and then use hweight32().

I've found out why.

The hweight32() patch is the one that deletes generic_hweight32(), but
generic_hweight64() uses it.

I can work around this, but it means keeping generic_hweight32() around
and deleting it in the hweight64() patch.

~Andrew


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 9/9] x86/bitops: Use the POPCNT instruction when available
  2024-08-27 12:47       ` Jan Beulich
@ 2024-08-27 14:59         ` Andrew Cooper
  2024-08-27 15:04           ` Jan Beulich
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2024-08-27 14:59 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Roger Pau Monné, Xen-devel

On 27/08/2024 1:47 pm, Jan Beulich wrote:
> On 27.08.2024 13:17, Andrew Cooper wrote:
>> On 26/08/2024 2:07 pm, Jan Beulich wrote:
>>> On 23.08.2024 01:06, Andrew Cooper wrote:
>>>> A few RFC points.
>>>>
>>>>  * I throught we had an x86 general lib-y but I can't find one, hence why it's
>>>>    still in xen/lib/ for now.
>>> We indeed have nothing like that yet. The file name should then imo not be
>>> arch-* though, but x86-*. Or you could put it in xen/lib/x86/.
>> I was worried about xen/lib/x86/ and interfering with userspace.
>>
>>> It could even
>>> be obj-y rather than lib-y, unless you expect we'll be able to get rid of
>>> all uses, which seems unlikely at least due to bitmap_weight(). Otoh with
>>> my ABI-level series the call site in arch_hweightl() could then be made go
>>> away for v2 and above, at which point it living in lib-y will be preferable.
>> Yes, I was specifically trying to account for this.
>>
>> I'm expecting the mandatory-popcnt work to end up looking something like:
>>
>> diff --git a/xen/arch/x86/include/asm/bitops.h
>> b/xen/arch/x86/include/asm/bitops.h
>> index 0db698ed3f4c..027eda60c094 100644
>> --- a/xen/arch/x86/include/asm/bitops.h
>> +++ b/xen/arch/x86/include/asm/bitops.h
>> @@ -480,6 +480,9 @@ static always_inline unsigned int
>> arch_hweightl(unsigned long x)
>>  {
>>      unsigned int r;
>>  
>> +    if ( IS_ENABLED(CONFIG_REQUIRE_POPCNT /* or whatever */) )
>> +        return __builtin_popcountl(x);
>> +
>>      /*
>>       * arch_generic_hweightl() is written in ASM in order to preserve all
>>       * registers, as the compiler can't see the call.
>>
>>
>> which in turn DCE's the alternative_io() and drops the reference to
>> arch_generic_hweightl().
> Right, that's along the lines of what I was thinking to re-base to once
> your work has gone in. (I have close to zero hope that my work would be
> going in first. [1]) Just that I don't think we'll have separate
> CONFIG_REQUIRE_<feature> settings. Yet how exactly that wants structuring
> is something we ought to discuss there, not here.

I just couldn't remember the name you'd given the option.  I wasn't
trying to driveby review it here.

>
>>>>  * When we up the minimum toolchain to GCC 7 / Clang 5, we can use a
>>>>    __attribute__((no_caller_saved_registers)) and can forgo writing this in asm.
>>>>
>>>>    GCC seems to need extra help, and wants -mgeneral-regs-only too.  It has a
>>>>    habit of complaining about incompatible instructions even when it's not
>>>>    emitting them.
>>> What is this part about? What incompatible instructions, in particular?
>> This was weird.  https://godbolt.org/z/4z1qzWbfE is an example.
> That's because apparently in your experiments you didn't add -mno-sse. If you
> incrementally add that, then -mno-mmx, then -msoft-float, you'll first see the
> diagnostic change and then observe it to finally compile. And yes, from
> looking at the gcc code emitting this error, this is solely tied to the ISAs
> enabled at the time the function is being compiled. It's independent of the
> choice of insns. Pretty clearly a shortcoming, imo.

Ah - it was -msoft-float which I was looking for.

I'd played around with -mno-{sse,mmx} but not gotten a working result.


>
>>>> @@ -475,4 +476,24 @@ static always_inline unsigned int arch_flsl(unsigned long x)
>>>>  }
>>>>  #define arch_flsl arch_flsl
>>>>  
>>>> +static always_inline unsigned int arch_hweightl(unsigned long x)
>>>> +{
>>>> +    unsigned int r;
>>>> +
>>>> +    /*
>>>> +     * arch_generic_hweightl() is written in ASM in order to preserve all
>>>> +     * registers, as the compiler can't see the call.
>>>> +     *
>>>> +     * This limits the POPCNT instruction to using the same ABI as a function
>>>> +     * call (input in %rdi, output in %eax) but that's fine.
>>>> +     */
>>>> +    alternative_io("call arch_generic_hweightl",
>>>> +                   "popcnt %[val], %q[res]", X86_FEATURE_POPCNT,
>>>> +                   ASM_OUTPUT2([res] "=a" (r) ASM_CALL_CONSTRAINT),
>>>> +                   [val] "D" (x));
>>> If you made [val] an output ("+D") you could avoid preserving the register
>>> in the function. And I'd expect the register wouldn't be re-used often
>>> afterwards, so its clobbering likely won't harm code quality very much.
>> "+D" means it's modified by the asm, which forces preservation of the
>> register, if it's still needed afterwards.
>>
>> Plain "D" means not modified by the asm, which means it can be reused if
>> necessary.
> And we likely would prefer the former: If the register's value isn't
> use afterwards (and that's the case as far as the function on its own
> goes), the compiler will know it doesn't need to preserve anything.
> That way, rather than always preserving (in the called function)
> preservation will be limited to just the (likely few) cases where the
> value actually is still needed afterwards.

Constraints are there to describe how the asm() behaves to the compiler.

You appear to be asking me to put in explicitly-incorrect constraints
because you think it will game the optimiser.

Except the reasoning is backwards.  The only thing forcing "+D" will do
is cause the compiler to preserve the value elsewhere if it's actually
needed later, despite the contents of %rdi still being good for the purpose.

In other words, using "+D" for asm which is really only "D" (as this one
is) will result in strictly worse code generation in the corner case you
seem to be worried about.

~Andrew


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 7/9] xen/bitops: Implement hweight64() in terms of hweightl()
  2024-08-27 14:32           ` Andrew Cooper
@ 2024-08-27 15:01             ` Jan Beulich
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2024-08-27 15:01 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Roger Pau Monné, Wei Liu, Stefano Stabellini, Julien Grall,
	Volodymyr Babchuk, Bertrand Marquis, Michal Orzel,
	Oleksii Kurochko, Shawn Anastasio, Xen-devel

On 27.08.2024 16:32, Andrew Cooper wrote:
> On 27/08/2024 2:25 pm, Andrew Cooper wrote:
>> On 27/08/2024 2:00 pm, Jan Beulich wrote:
>>> On 27.08.2024 13:50, Andrew Cooper wrote:
>>>> On 26/08/2024 12:55 pm, Jan Beulich wrote:
>>>>> On 23.08.2024 01:06, Andrew Cooper wrote:
>>>> Again, this is modelled after f[fl]s64() which have the same
>>>> expectations about the BITS_PER_LONG != 64 case.
>>> Both of them are fine afaict. fls64() has an explicit intermediate
>>> variable of type uint32_t, and ffs64() has a (uint32_t)x as part
>>> of the conditional expression achieving the intended effect.
>>>
>>> Anyway, why not use hweight32() instead of hweightl() here? That'll
>>> make things very explicit.
>> hweight32() doesn't exist until the next patch in the series.
>>
>> Although looking at the end result, I can't figure out why I thought it
>> was necessary to transform hweight64 first.
>>
>> I'll swap this patch and the next one, and then use hweight32().
> 
> I've found out why.
> 
> The hweight32() patch is the one that deletes generic_hweight32(), but
> generic_hweight64() uses it.
> 
> I can work around this, but it means keeping generic_hweight32() around
> and deleting it in the hweight64() patch.

Or simply fold both patches?

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 9/9] x86/bitops: Use the POPCNT instruction when available
  2024-08-27 14:59         ` Andrew Cooper
@ 2024-08-27 15:04           ` Jan Beulich
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2024-08-27 15:04 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Roger Pau Monné, Xen-devel

On 27.08.2024 16:59, Andrew Cooper wrote:
> On 27/08/2024 1:47 pm, Jan Beulich wrote:
>> On 27.08.2024 13:17, Andrew Cooper wrote:
>>> On 26/08/2024 2:07 pm, Jan Beulich wrote:
>>>> On 23.08.2024 01:06, Andrew Cooper wrote:
>>>>> @@ -475,4 +476,24 @@ static always_inline unsigned int arch_flsl(unsigned long x)
>>>>>  }
>>>>>  #define arch_flsl arch_flsl
>>>>>  
>>>>> +static always_inline unsigned int arch_hweightl(unsigned long x)
>>>>> +{
>>>>> +    unsigned int r;
>>>>> +
>>>>> +    /*
>>>>> +     * arch_generic_hweightl() is written in ASM in order to preserve all
>>>>> +     * registers, as the compiler can't see the call.
>>>>> +     *
>>>>> +     * This limits the POPCNT instruction to using the same ABI as a function
>>>>> +     * call (input in %rdi, output in %eax) but that's fine.
>>>>> +     */
>>>>> +    alternative_io("call arch_generic_hweightl",
>>>>> +                   "popcnt %[val], %q[res]", X86_FEATURE_POPCNT,
>>>>> +                   ASM_OUTPUT2([res] "=a" (r) ASM_CALL_CONSTRAINT),
>>>>> +                   [val] "D" (x));
>>>> If you made [val] an output ("+D") you could avoid preserving the register
>>>> in the function. And I'd expect the register wouldn't be re-used often
>>>> afterwards, so its clobbering likely won't harm code quality very much.
>>> "+D" means it's modified by the asm, which forces preservation of the
>>> register, if it's still needed afterwards.
>>>
>>> Plain "D" means not modified by the asm, which means it can be reused if
>>> necessary.
>> And we likely would prefer the former: If the register's value isn't
>> use afterwards (and that's the case as far as the function on its own
>> goes), the compiler will know it doesn't need to preserve anything.
>> That way, rather than always preserving (in the called function)
>> preservation will be limited to just the (likely few) cases where the
>> value actually is still needed afterwards.
> 
> Constraints are there to describe how the asm() behaves to the compiler.
> 
> You appear to be asking me to put in explicitly-incorrect constraints
> because you think it will game the optimiser.
> 
> Except the reasoning is backwards.  The only thing forcing "+D" will do
> is cause the compiler to preserve the value elsewhere if it's actually
> needed later, despite the contents of %rdi still being good for the purpose.
> 
> In other words, using "+D" for asm which is really only "D" (as this one
> is) will result in strictly worse code generation in the corner case you
> seem to be worried about.

Well, then leave it as is. (An extra benefit would have been that
arch_generic_hweightl() then would ended up with a odd-number-of-slots
stack frame. Not that this matters much, but having ABI-compliant
functions where possible seems always preferable, when possible.)

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2024-08-27 15:05 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-22 23:06 [PATCH 0/9] xen/bitops: hweight() cleanup/improvements Andrew Cooper
2024-08-22 23:06 ` [PATCH 1/9] xen/bitops: Reinstate the please tidy message Andrew Cooper
2024-08-22 23:06 ` [PATCH 2/9] xen/bitops: Introduce a multiple_bits_set() helper Andrew Cooper
2024-08-26 10:30   ` Jan Beulich
2024-08-27 12:01     ` Andrew Cooper
2024-08-27 12:50       ` Jan Beulich
2024-08-27 14:13         ` Andrew Cooper
2024-08-22 23:06 ` [PATCH 3/9] xen/bitops: Convert 'hweight(x) > 1' to new multiple_bits_set() Andrew Cooper
2024-08-26 10:37   ` Jan Beulich
2024-08-27  9:44     ` Andrew Cooper
2024-08-22 23:06 ` [PATCH 4/9] xen/bitops: Drop the remnants of hweight{8,16}() Andrew Cooper
2024-08-26 10:39   ` Jan Beulich
2024-08-27  9:49     ` Andrew Cooper
2024-08-27 10:11       ` Jan Beulich
2024-08-27 12:04         ` Andrew Cooper
2024-08-22 23:06 ` [PATCH 5/9] xen/bitops: Introduce generic_hweightl() and hweightl() Andrew Cooper
2024-08-26 11:40   ` Jan Beulich
2024-08-27 10:39     ` Andrew Cooper
2024-08-27 11:41       ` Jan Beulich
2024-08-27 12:30         ` Andrew Cooper
2024-08-22 23:06 ` [PATCH 6/9] xen/bitops: Drop hweight_long() and use hweightl() Andrew Cooper
2024-08-26 11:51   ` Jan Beulich
2024-08-22 23:06 ` [PATCH 7/9] xen/bitops: Implement hweight64() in terms of hweightl() Andrew Cooper
2024-08-26 11:55   ` Jan Beulich
2024-08-27 11:50     ` Andrew Cooper
2024-08-27 13:00       ` Jan Beulich
2024-08-27 13:25         ` Andrew Cooper
2024-08-27 14:32           ` Andrew Cooper
2024-08-27 15:01             ` Jan Beulich
2024-08-22 23:06 ` [PATCH 8/9] xen/bitops: Implement hweight32() " Andrew Cooper
2024-08-26 11:59   ` Jan Beulich
2024-08-27 12:05     ` Andrew Cooper
2024-08-22 23:06 ` [PATCH 9/9] x86/bitops: Use the POPCNT instruction when available Andrew Cooper
2024-08-23 15:35   ` Nicola Vetrini
2024-08-23 15:38     ` Andrew Cooper
2024-08-26 13:07   ` Jan Beulich
2024-08-27 11:17     ` Andrew Cooper
2024-08-27 12:47       ` Jan Beulich
2024-08-27 14:59         ` Andrew Cooper
2024-08-27 15:04           ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.