[PATCH next 0/3] lib: Implement mul_u64_u64_div_u64

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH next 0/3] lib: Implement mul_u64_u64_div_u64_roundup()
@ 2025-04-05 20:45 David Laight
  2025-04-05 20:45 ` [PATCH 1/3] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup() David Laight
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: David Laight @ 2025-04-05 20:45 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: David Laight, Uwe Kleine-König, Nicolas Pitre, Oleg Nesterov,
	Peter Zijlstra, Biju Das

The pwm-stm32.c code wants a 'rounding up' version of mul_u64_u64_div_u64().
This can be done simply by adding 'divisor - 1' to the 128bit product.
Implement mul_u64_add_u64_div_u64(a, b, c, d) = (a * b + c)/d based on the
existing code.
Define mul_u64_u64_div_u64(a, b, d) as mul_u64_add_u64_div_u64(a, b, 0, d) and
mul_u64_u64_div_u64_roundup(a, b, d) as mul_u64_add_u64_div_u64(a, b, d-1, d).

Only x86-64 has an optimsed (asm) version of the function.
That is optimised to avoid the 'add c' when c is known to be zero.
In all other cases the extra code will be noise compared to the software
divide code.

I've updated the test module to test mul_u64_u64_div_u64_roundup() and
also enhanced it to verify the C division code on x86-64.

Note that the code generated by gcc (eg for 32bit x86) just for the multiply
is rather more horrid than one would expect (clang does better).
I dread to think how long the divide loop takes.
And I'm not at all sure the call in kernel/sched/cputime.c isn't in a
relatively common path (rather than just hardware initialisation).

David Laight (3):
  lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup()
  lib: Add tests for mul_u64_u64_div_u64_roundup()
  lib: Update the muldiv64 tests to verify the C on x86-64

 arch/x86/include/asm/div64.h        |  19 ++--
 include/linux/math64.h              |  44 ++++++++-
 lib/math/div64.c                    |  57 ++++++++----
 lib/math/test_mul_u64_u64_div_u64.c | 136 +++++++++++++++++-----------
 4 files changed, 179 insertions(+), 77 deletions(-)

-- 
2.39.5

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/3] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup()
  2025-04-05 20:45 [PATCH next 0/3] lib: Implement mul_u64_u64_div_u64_roundup() David Laight
@ 2025-04-05 20:45 ` David Laight
  2025-04-06  1:46   ` Nicolas Pitre
  2025-04-05 20:45 ` [PATCH 2/3] lib: Add tests for mul_u64_u64_div_u64_roundup() David Laight
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: David Laight @ 2025-04-05 20:45 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: David Laight, Uwe Kleine-König, Nicolas Pitre, Oleg Nesterov,
	Peter Zijlstra, Biju Das

The existing mul_u64_u64_div_u64() rounds down, a 'rounding up'
variant needs 'divisor - 1' adding in between the multiply and
divide so cannot easily be done by a caller.

Add mul_u64_add_u64_div_u64(a, b, c, d) that calculates (a * b + c)/d
and implement the 'round down' and 'round up' using it.

Update the x86-64 asm to optimise for 'c' being a constant zero.

For architectures that support u128 check for a 64bit product after
the multiply (will be cheap).
Leave in the early check for other architectures (mostly 32bit) when
'c' is zero to avoid the multi-part multiply.

Note that the cost of the 128bit divide will dwarf the rest of the code.
This function is very slow on everything except x86-64 (very very slow
on 32bit).

Add kerndoc definitions for all three functions.

Signed-off-by: David Laight <david.laight.linux@gmail.com>
---
 arch/x86/include/asm/div64.h | 19 +++++++++++-----
 include/linux/math64.h       | 44 +++++++++++++++++++++++++++++++++++-
 lib/math/div64.c             | 41 ++++++++++++++++++---------------
 3 files changed, 79 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/div64.h b/arch/x86/include/asm/div64.h
index 9931e4c7d73f..9322a35f6a39 100644
--- a/arch/x86/include/asm/div64.h
+++ b/arch/x86/include/asm/div64.h
@@ -84,21 +84,28 @@ static inline u64 mul_u32_u32(u32 a, u32 b)
  * Will generate an #DE when the result doesn't fit u64, could fix with an
  * __ex_table[] entry when it becomes an issue.
  */
-static inline u64 mul_u64_u64_div_u64(u64 a, u64 mul, u64 div)
+static inline u64 mul_u64_add_u64_div_u64(u64 a, u64 mul, u64 add, u64 div)
 {
 	u64 q;
 
-	asm ("mulq %2; divq %3" : "=a" (q)
-				: "a" (a), "rm" (mul), "rm" (div)
-				: "rdx");
+	if (statically_true(!add)) {
+		asm ("mulq %2; divq %3" : "=a" (q)
+					: "a" (a), "rm" (mul), "rm" (div)
+					: "rdx");
+	} else {
+		asm ("mulq %2; addq %4,%%rax; adcq $0,%%rdx; divq %3"
+			: "=a" (q)
+			: "a" (a), "rm" (mul), "rm" (div), "rm" (add)
+			: "rdx");
+	}
 
 	return q;
 }
-#define mul_u64_u64_div_u64 mul_u64_u64_div_u64
+#define mul_u64_add_u64_div_u64 mul_u64_add_u64_div_u64
 
 static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 div)
 {
-	return mul_u64_u64_div_u64(a, mul, div);
+	return mul_u64_add_u64_div_u64(a, mul, 0, div);
 }
 #define mul_u64_u32_div	mul_u64_u32_div
 
diff --git a/include/linux/math64.h b/include/linux/math64.h
index 6aaccc1626ab..e958170e64ab 100644
--- a/include/linux/math64.h
+++ b/include/linux/math64.h
@@ -282,7 +282,49 @@ static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 divisor)
 }
 #endif /* mul_u64_u32_div */
 
-u64 mul_u64_u64_div_u64(u64 a, u64 mul, u64 div);
+/**
+ * mul_u64_add_u64_div_u64 - unsigned 64bit multiply, add, and divide
+ * @a: first unsigned 64bit multiplicand
+ * @b: second unsigned 64bit multiplicand
+ * @c: unsigned 64bit addend
+ * @d: unsigned 64bit divisor
+ *
+ * Multiply two 64bit values together to generate a 128bit product
+ * add a third value and then divide by a fourth.
+ * May BUG()/trap if @d is zero or the quotient exceeds 64 bits.
+ *
+ * Return: (@a * @b + @c) / @d
+ */
+u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d);
+
+/**
+ * mul_u64_u64_div_u64 - unsigned 64bit multiply and divide
+ * @a: first unsigned 64bit multiplicand
+ * @b: second unsigned 64bit multiplicand
+ * @d: unsigned 64bit divisor
+ *
+ * Multiply two 64bit values together to generate a 128bit product
+ * and then divide by a third value.
+ * May BUG()/trap if @d is zero or the quotient exceeds 64 bits.
+ *
+ * Return: @a * @b / @d
+ */
+#define mul_u64_u64_div_u64(a, b, d) mul_u64_add_u64_div_u64(a, b, 0, d)
+
+/**
+ * mul_u64_u64_div_u64_roundup - unsigned 64bit multiply and divide rounded up
+ * @a: first unsigned 64bit multiplicand
+ * @b: second unsigned 64bit multiplicand
+ * @d: unsigned 64bit divisor
+ *
+ * Multiply two 64bit values together to generate a 128bit product
+ * and then divide and round up.
+ * May BUG()/trap if @d is zero or the quotient exceeds 64 bits.
+ *
+ * Return: (@a * @b + @d - 1) / @d
+ */
+#define mul_u64_u64_div_u64_roundup(a, b, d) \
+	({ u64 _tmp = (d); mul_u64_add_u64_div_u64(a, b, _tmp - 1, _tmp); })
 
 /**
  * DIV64_U64_ROUND_UP - unsigned 64bit divide with 64bit divisor rounded up
diff --git a/lib/math/div64.c b/lib/math/div64.c
index 5faa29208bdb..50e025174495 100644
--- a/lib/math/div64.c
+++ b/lib/math/div64.c
@@ -183,26 +183,28 @@ u32 iter_div_u64_rem(u64 dividend, u32 divisor, u64 *remainder)
 }
 EXPORT_SYMBOL(iter_div_u64_rem);
 
-#ifndef mul_u64_u64_div_u64
-u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
+#if !defined(mul_u64_add_u64_div_u64)
+u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
 {
-	if (ilog2(a) + ilog2(b) <= 62)
-		return div64_u64(a * b, c);
-
 #if defined(__SIZEOF_INT128__)
 
 	/* native 64x64=128 bits multiplication */
-	u128 prod = (u128)a * b;
+	u128 prod = (u128)a * b + c;
 	u64 n_lo = prod, n_hi = prod >> 64;
 
 #else
 
+	if (!c && ilog2(a) + ilog2(b) <= 62)
+		return div64_u64(a * b, d);
+
 	/* perform a 64x64=128 bits multiplication manually */
 	u32 a_lo = a, a_hi = a >> 32, b_lo = b, b_hi = b >> 32;
 	u64 x, y, z;
 
-	x = (u64)a_lo * b_lo;
+	/* Since (x-1)(x-1) + 2(x-1) == x.x - 1 two u32 can be added to a u64 */
+	x = (u64)a_lo * b_lo + (u32)c;
 	y = (u64)a_lo * b_hi + (u32)(x >> 32);
+	y += (u32)(c >> 32);
 	z = (u64)a_hi * b_hi + (u32)(y >> 32);
 	y = (u64)a_hi * b_lo + (u32)y;
 	z += (u32)(y >> 32);
@@ -212,36 +214,39 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
 
 #endif
 
-	/* make sure c is not zero, trigger exception otherwise */
+	if (!n_hi)
+		return div64_u64(n_lo, d);
+
+	/* make sure d is not zero, trigger exception otherwise */
 #pragma GCC diagnostic push
 #pragma GCC diagnostic ignored "-Wdiv-by-zero"
-	if (unlikely(c == 0))
+	if (unlikely(d == 0))
 		return 1/0;
 #pragma GCC diagnostic pop
 
-	int shift = __builtin_ctzll(c);
+	int shift = __builtin_ctzll(d);
 
 	/* try reducing the fraction in case the dividend becomes <= 64 bits */
 	if ((n_hi >> shift) == 0) {
 		u64 n = shift ? (n_lo >> shift) | (n_hi << (64 - shift)) : n_lo;
 
-		return div64_u64(n, c >> shift);
+		return div64_u64(n, d >> shift);
 		/*
 		 * The remainder value if needed would be:
-		 *   res = div64_u64_rem(n, c >> shift, &rem);
+		 *   res = div64_u64_rem(n, d >> shift, &rem);
 		 *   rem = (rem << shift) + (n_lo - (n << shift));
 		 */
 	}
 
-	if (n_hi >= c) {
+	if (n_hi >= d) {
 		/* overflow: result is unrepresentable in a u64 */
 		return -1;
 	}
 
 	/* Do the full 128 by 64 bits division */
 
-	shift = __builtin_clzll(c);
-	c <<= shift;
+	shift = __builtin_clzll(d);
+	d <<= shift;
 
 	int p = 64 + shift;
 	u64 res = 0;
@@ -256,8 +261,8 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
 		n_hi <<= shift;
 		n_hi |= n_lo >> (64 - shift);
 		n_lo <<= shift;
-		if (carry || (n_hi >= c)) {
-			n_hi -= c;
+		if (carry || (n_hi >= d)) {
+			n_hi -= d;
 			res |= 1ULL << p;
 		}
 	} while (n_hi);
@@ -265,5 +270,5 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
 
 	return res;
 }
-EXPORT_SYMBOL(mul_u64_u64_div_u64);
+EXPORT_SYMBOL(mul_u64_add_u64_div_u64);
 #endif
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/3] lib: Add tests for mul_u64_u64_div_u64_roundup()
  2025-04-05 20:45 [PATCH next 0/3] lib: Implement mul_u64_u64_div_u64_roundup() David Laight
  2025-04-05 20:45 ` [PATCH 1/3] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup() David Laight
@ 2025-04-05 20:45 ` David Laight
  2025-04-06  1:47   ` Nicolas Pitre
  2025-04-05 20:45 ` [PATCH 3/3] lib: Update the muldiv64 tests to verify the C on x86-64 David Laight
  2025-05-16  9:47 ` [PATCH next 0/3] lib: Implement mul_u64_u64_div_u64_roundup() Uwe Kleine-König
  3 siblings, 1 reply; 14+ messages in thread
From: David Laight @ 2025-04-05 20:45 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: David Laight, Uwe Kleine-König, Nicolas Pitre, Oleg Nesterov,
	Peter Zijlstra, Biju Das

Replicate the existing mul_u64_u64_div_u64() test cases with round up.
Update the shell script that verifies the table, remove the comment
markers so that it can be directly pasted into a shell.

It any tests fail then fail the module load with -EINVAL.

Signed-off-by: David Laight <david.laight.linux@gmail.com>
---
 lib/math/test_mul_u64_u64_div_u64.c | 112 ++++++++++++++++------------
 1 file changed, 66 insertions(+), 46 deletions(-)

diff --git a/lib/math/test_mul_u64_u64_div_u64.c b/lib/math/test_mul_u64_u64_div_u64.c
index 58d058de4e73..9548eb7458c7 100644
--- a/lib/math/test_mul_u64_u64_div_u64.c
+++ b/lib/math/test_mul_u64_u64_div_u64.c
@@ -10,61 +10,72 @@
 #include <linux/printk.h>
 #include <linux/math64.h>
 
-typedef struct { u64 a; u64 b; u64 c; u64 result; } test_params;
+typedef struct { u64 a; u64 b; u64 c; u64 result; uint round_up;} test_params;
 
 static test_params test_values[] = {
 /* this contains many edge values followed by a couple random values */
-{                0xb,                0x7,                0x3,               0x19 },
-{         0xffff0000,         0xffff0000,                0xf, 0x1110eeef00000000 },
-{         0xffffffff,         0xffffffff,                0x1, 0xfffffffe00000001 },
-{         0xffffffff,         0xffffffff,                0x2, 0x7fffffff00000000 },
-{        0x1ffffffff,         0xffffffff,                0x2, 0xfffffffe80000000 },
-{        0x1ffffffff,         0xffffffff,                0x3, 0xaaaaaaa9aaaaaaab },
-{        0x1ffffffff,        0x1ffffffff,                0x4, 0xffffffff00000000 },
-{ 0xffff000000000000, 0xffff000000000000, 0xffff000000000001, 0xfffeffffffffffff },
-{ 0x3333333333333333, 0x3333333333333333, 0x5555555555555555, 0x1eb851eb851eb851 },
-{ 0x7fffffffffffffff,                0x2,                0x3, 0x5555555555555554 },
-{ 0xffffffffffffffff,                0x2, 0x8000000000000000,                0x3 },
-{ 0xffffffffffffffff,                0x2, 0xc000000000000000,                0x2 },
-{ 0xffffffffffffffff, 0x4000000000000004, 0x8000000000000000, 0x8000000000000007 },
-{ 0xffffffffffffffff, 0x4000000000000001, 0x8000000000000000, 0x8000000000000001 },
-{ 0xffffffffffffffff, 0x8000000000000001, 0xffffffffffffffff, 0x8000000000000001 },
-{ 0xfffffffffffffffe, 0x8000000000000001, 0xffffffffffffffff, 0x8000000000000000 },
-{ 0xffffffffffffffff, 0x8000000000000001, 0xfffffffffffffffe, 0x8000000000000001 },
-{ 0xffffffffffffffff, 0x8000000000000001, 0xfffffffffffffffd, 0x8000000000000002 },
-{ 0x7fffffffffffffff, 0xffffffffffffffff, 0xc000000000000000, 0xaaaaaaaaaaaaaaa8 },
-{ 0xffffffffffffffff, 0x7fffffffffffffff, 0xa000000000000000, 0xccccccccccccccca },
-{ 0xffffffffffffffff, 0x7fffffffffffffff, 0x9000000000000000, 0xe38e38e38e38e38b },
-{ 0x7fffffffffffffff, 0x7fffffffffffffff, 0x5000000000000000, 0xccccccccccccccc9 },
-{ 0xffffffffffffffff, 0xfffffffffffffffe, 0xffffffffffffffff, 0xfffffffffffffffe },
-{ 0xe6102d256d7ea3ae, 0x70a77d0be4c31201, 0xd63ec35ab3220357, 0x78f8bf8cc86c6e18 },
-{ 0xf53bae05cb86c6e1, 0x3847b32d2f8d32e0, 0xcfd4f55a647f403c, 0x42687f79d8998d35 },
-{ 0x9951c5498f941092, 0x1f8c8bfdf287a251, 0xa3c8dc5f81ea3fe2, 0x1d887cb25900091f },
-{ 0x374fee9daa1bb2bb, 0x0d0bfbff7b8ae3ef, 0xc169337bd42d5179, 0x03bb2dbaffcbb961 },
-{ 0xeac0d03ac10eeaf0, 0x89be05dfa162ed9b, 0x92bb1679a41f0e4b, 0xdc5f5cc9e270d216 },
+{                0xb,                0x7,                0x3,               0x19, 1 },
+{         0xffff0000,         0xffff0000,                0xf, 0x1110eeef00000000, 0 },
+{         0xffffffff,         0xffffffff,                0x1, 0xfffffffe00000001, 0 },
+{         0xffffffff,         0xffffffff,                0x2, 0x7fffffff00000000, 1 },
+{        0x1ffffffff,         0xffffffff,                0x2, 0xfffffffe80000000, 1 },
+{        0x1ffffffff,         0xffffffff,                0x3, 0xaaaaaaa9aaaaaaab, 0 },
+{        0x1ffffffff,        0x1ffffffff,                0x4, 0xffffffff00000000, 1 },
+{ 0xffff000000000000, 0xffff000000000000, 0xffff000000000001, 0xfffeffffffffffff, 1 },
+{ 0x3333333333333333, 0x3333333333333333, 0x5555555555555555, 0x1eb851eb851eb851, 1 },
+{ 0x7fffffffffffffff,                0x2,                0x3, 0x5555555555555554, 1 },
+{ 0xffffffffffffffff,                0x2, 0x8000000000000000,                0x3, 1 },
+{ 0xffffffffffffffff,                0x2, 0xc000000000000000,                0x2, 1 },
+{ 0xffffffffffffffff, 0x4000000000000004, 0x8000000000000000, 0x8000000000000007, 1 },
+{ 0xffffffffffffffff, 0x4000000000000001, 0x8000000000000000, 0x8000000000000001, 1 },
+{ 0xffffffffffffffff, 0x8000000000000001, 0xffffffffffffffff, 0x8000000000000001, 0 },
+{ 0xfffffffffffffffe, 0x8000000000000001, 0xffffffffffffffff, 0x8000000000000000, 1 },
+{ 0xffffffffffffffff, 0x8000000000000001, 0xfffffffffffffffe, 0x8000000000000001, 1 },
+{ 0xffffffffffffffff, 0x8000000000000001, 0xfffffffffffffffd, 0x8000000000000002, 1 },
+{ 0x7fffffffffffffff, 0xffffffffffffffff, 0xc000000000000000, 0xaaaaaaaaaaaaaaa8, 1 },
+{ 0xffffffffffffffff, 0x7fffffffffffffff, 0xa000000000000000, 0xccccccccccccccca, 1 },
+{ 0xffffffffffffffff, 0x7fffffffffffffff, 0x9000000000000000, 0xe38e38e38e38e38b, 1 },
+{ 0x7fffffffffffffff, 0x7fffffffffffffff, 0x5000000000000000, 0xccccccccccccccc9, 1 },
+{ 0xffffffffffffffff, 0xfffffffffffffffe, 0xffffffffffffffff, 0xfffffffffffffffe, 0 },
+{ 0xe6102d256d7ea3ae, 0x70a77d0be4c31201, 0xd63ec35ab3220357, 0x78f8bf8cc86c6e18, 1 },
+{ 0xf53bae05cb86c6e1, 0x3847b32d2f8d32e0, 0xcfd4f55a647f403c, 0x42687f79d8998d35, 1 },
+{ 0x9951c5498f941092, 0x1f8c8bfdf287a251, 0xa3c8dc5f81ea3fe2, 0x1d887cb25900091f, 1 },
+{ 0x374fee9daa1bb2bb, 0x0d0bfbff7b8ae3ef, 0xc169337bd42d5179, 0x03bb2dbaffcbb961, 1 },
+{ 0xeac0d03ac10eeaf0, 0x89be05dfa162ed9b, 0x92bb1679a41f0e4b, 0xdc5f5cc9e270d216, 1 },
 };
 
 /*
  * The above table can be verified with the following shell script:
- *
- * #!/bin/sh
- * sed -ne 's/^{ \+\(.*\), \+\(.*\), \+\(.*\), \+\(.*\) },$/\1 \2 \3 \4/p' \
- *     lib/math/test_mul_u64_u64_div_u64.c |
- * while read a b c r; do
- *   expected=$( printf "obase=16; ibase=16; %X * %X / %X\n" $a $b $c | bc )
- *   given=$( printf "%X\n" $r )
- *   if [ "$expected" = "$given" ]; then
- *     echo "$a * $b / $c = $r OK"
- *   else
- *     echo "$a * $b / $c = $r is wrong" >&2
- *     echo "should be equivalent to 0x$expected" >&2
- *     exit 1
- *   fi
- * done
+
+#!/bin/sh
+sed -ne 's/^{ \+\(.*\), \+\(.*\), \+\(.*\), \+\(.*\), \+\(.*\) },$/\1 \2 \3 \4 \5/p' \
+    lib/math/test_mul_u64_u64_div_u64.c |
+while read a b c r d; do
+  expected=$( printf "obase=16; ibase=16; %X * %X / %X\n" $a $b $c | bc )
+  given=$( printf "%X\n" $r )
+  if [ "$expected" = "$given" ]; then
+    echo "$a * $b  / $c = $r OK"
+  else
+    echo "$a * $b  / $c = $r is wrong" >&2
+    echo "should be equivalent to 0x$expected" >&2
+    exit 1
+  fi
+  expected=$( printf "obase=16; ibase=16; (%X * %X + %X) / %X\n" $a $b $((c-1)) $c | bc )
+  given=$( printf "%X\n" $((r + d)) )
+  if [ "$expected" = "$given" ]; then
+    echo "$a * $b +/ $c = $(printf '%#x' $((r + d))) OK"
+  else
+    echo "$a * $b +/ $c = $(printf '%#x' $((r + d))) is wrong" >&2
+    echo "should be equivalent to 0x$expected" >&2
+    exit 1
+  fi
+done
+
  */
 
 static int __init test_init(void)
 {
+	int errors = 0;
 	int i;
 
 	pr_info("Starting mul_u64_u64_div_u64() test\n");
@@ -75,16 +86,25 @@ static int __init test_init(void)
 		u64 c = test_values[i].c;
 		u64 expected_result = test_values[i].result;
 		u64 result = mul_u64_u64_div_u64(a, b, c);
+		u64 result_up = mul_u64_u64_div_u64_roundup(a, b, c);
 
 		if (result != expected_result) {
 			pr_err("ERROR: 0x%016llx * 0x%016llx / 0x%016llx\n", a, b, c);
 			pr_err("ERROR: expected result: %016llx\n", expected_result);
 			pr_err("ERROR: obtained result: %016llx\n", result);
+			errors++;
+		}
+		expected_result += test_values[i].round_up;
+		if (result_up != expected_result) {
+			pr_err("ERROR: 0x%016llx * 0x%016llx +/ 0x%016llx\n", a, b, c);
+			pr_err("ERROR: expected result: %016llx\n", expected_result);
+			pr_err("ERROR: obtained result: %016llx\n", result_up);
+			errors++;
 		}
 	}
 
-	pr_info("Completed mul_u64_u64_div_u64() test\n");
-	return 0;
+	pr_info("Completed mul_u64_u64_div_u64() test, %d errors\n", errors);
+	return errors ? -EINVAL : 0;
 }
 
 static void __exit test_exit(void)
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/3] lib: Update the muldiv64 tests to verify the C on x86-64
  2025-04-05 20:45 [PATCH next 0/3] lib: Implement mul_u64_u64_div_u64_roundup() David Laight
  2025-04-05 20:45 ` [PATCH 1/3] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup() David Laight
  2025-04-05 20:45 ` [PATCH 2/3] lib: Add tests for mul_u64_u64_div_u64_roundup() David Laight
@ 2025-04-05 20:45 ` David Laight
  2025-04-06  2:26   ` Nicolas Pitre
  2025-05-16  9:47 ` [PATCH next 0/3] lib: Implement mul_u64_u64_div_u64_roundup() Uwe Kleine-König
  3 siblings, 1 reply; 14+ messages in thread
From: David Laight @ 2025-04-05 20:45 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: David Laight, Uwe Kleine-König, Nicolas Pitre, Oleg Nesterov,
	Peter Zijlstra, Biju Das

div64.c contains a 128 by 64 division algorithm which x86-64 overrides
it with an asm implementation.
So running the muldiv64 tests only verifies the asm code.
Since x86-64 is the most likely test system compile the default
code into an x86-64 kernel (under a different name) when the tests
are being built.
Verify that both the asm and C functions generate the correct results.

Signed-off-by: David Laight <david.laight.linux@gmail.com>
---
 lib/math/div64.c                    | 18 ++++++++++++++++--
 lib/math/test_mul_u64_u64_div_u64.c | 26 ++++++++++++++++++++------
 2 files changed, 36 insertions(+), 8 deletions(-)

diff --git a/lib/math/div64.c b/lib/math/div64.c
index 50e025174495..38ee5c01c288 100644
--- a/lib/math/div64.c
+++ b/lib/math/div64.c
@@ -25,6 +25,8 @@
 #include <linux/minmax.h>
 #include <linux/log2.h>
 
+#include <generated/autoconf.h>
+
 /* Not needed on 64bit architectures */
 #if BITS_PER_LONG == 32
 
@@ -183,10 +185,22 @@ u32 iter_div_u64_rem(u64 dividend, u32 divisor, u64 *remainder)
 }
 EXPORT_SYMBOL(iter_div_u64_rem);
 
-#if !defined(mul_u64_add_u64_div_u64)
+/*
+ * If the architecture overrides the implementation below and the test module
+ * is being built then compile the default implementation with a different name
+ * so that it can be tested.
+ */
+#if defined(mul_u64_add_u64_div_u64) && (defined(CONFIG_TEST_MULDIV64) || defined(CONFIG_TEST_MULDIV64_MODULE))
+#define TEST_MULDIV64
+#undef mul_u64_add_u64_div_u64
+#define mul_u64_add_u64_div_u64 mul_u64_add_u64_div_u64_test
+u64 mul_u64_add_u64_div_u64_test(u64 a, u64 b, u64 c, u64 d);
+#endif
+
+#if !defined( mul_u64_add_u64_div_u64) || defined(TEST_MULDIV64)
 u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
 {
-#if defined(__SIZEOF_INT128__)
+#if defined(__SIZEOF_INT128__) && !defined(TEST_MULDIV64)
 
 	/* native 64x64=128 bits multiplication */
 	u128 prod = (u128)a * b + c;
diff --git a/lib/math/test_mul_u64_u64_div_u64.c b/lib/math/test_mul_u64_u64_div_u64.c
index 9548eb7458c7..e2289b412601 100644
--- a/lib/math/test_mul_u64_u64_div_u64.c
+++ b/lib/math/test_mul_u64_u64_div_u64.c
@@ -73,6 +73,10 @@ done
 
  */
 
+#ifdef mul_u64_add_u64_div_u64
+u64 mul_u64_add_u64_div_u64_test(u64 a, u64 b, u64 add, u64 c);
+#endif
+
 static int __init test_init(void)
 {
 	int errors = 0;
@@ -80,21 +84,31 @@ static int __init test_init(void)
 
 	pr_info("Starting mul_u64_u64_div_u64() test\n");
 
-	for (i = 0; i < ARRAY_SIZE(test_values); i++) {
-		u64 a = test_values[i].a;
-		u64 b = test_values[i].b;
-		u64 c = test_values[i].c;
-		u64 expected_result = test_values[i].result;
+	for (i = 0; i < ARRAY_SIZE(test_values) * 2; i++) {
+		u64 a = test_values[i / 2].a;
+		u64 b = test_values[i / 2].b;
+		u64 c = test_values[i / 2].c;
+		u64 expected_result = test_values[i / 2].result;
 		u64 result = mul_u64_u64_div_u64(a, b, c);
 		u64 result_up = mul_u64_u64_div_u64_roundup(a, b, c);
 
+#ifdef mul_u64_add_u64_div_u64
+		if (i & 1) {
+			/* Verify the generic C version */
+			result = mul_u64_add_u64_div_u64_test(a, b, 0, c);
+			result_up = mul_u64_add_u64_div_u64_test(a, b, c - 1, c);
+		}
+#else
+		i++;
+#endif
+
 		if (result != expected_result) {
 			pr_err("ERROR: 0x%016llx * 0x%016llx / 0x%016llx\n", a, b, c);
 			pr_err("ERROR: expected result: %016llx\n", expected_result);
 			pr_err("ERROR: obtained result: %016llx\n", result);
 			errors++;
 		}
-		expected_result += test_values[i].round_up;
+		expected_result += test_values[i / 2].round_up;
 		if (result_up != expected_result) {
 			pr_err("ERROR: 0x%016llx * 0x%016llx +/ 0x%016llx\n", a, b, c);
 			pr_err("ERROR: expected result: %016llx\n", expected_result);
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup()
  2025-04-05 20:45 ` [PATCH 1/3] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup() David Laight
@ 2025-04-06  1:46   ` Nicolas Pitre
  2025-04-06  3:06     ` Nicolas Pitre
  2025-04-06  9:35     ` David Laight
  0 siblings, 2 replies; 14+ messages in thread
From: Nicolas Pitre @ 2025-04-06  1:46 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, Uwe Kleine-König, Oleg Nesterov,
	Peter Zijlstra, Biju Das

On Sat, 5 Apr 2025, David Laight wrote:

> The existing mul_u64_u64_div_u64() rounds down, a 'rounding up'
> variant needs 'divisor - 1' adding in between the multiply and
> divide so cannot easily be done by a caller.
> 
> Add mul_u64_add_u64_div_u64(a, b, c, d) that calculates (a * b + c)/d
> and implement the 'round down' and 'round up' using it.
> 
> Update the x86-64 asm to optimise for 'c' being a constant zero.
> 
> For architectures that support u128 check for a 64bit product after
> the multiply (will be cheap).
> Leave in the early check for other architectures (mostly 32bit) when
> 'c' is zero to avoid the multi-part multiply.
> 
> Note that the cost of the 128bit divide will dwarf the rest of the code.
> This function is very slow on everything except x86-64 (very very slow
> on 32bit).
> 
> Add kerndoc definitions for all three functions.
> 
> Signed-off-by: David Laight <david.laight.linux@gmail.com>

Reviewed-by: Nicolas Pitre <npitre@baylibre.com>

Sidenote: The 128-bits division cost is proportional to the number of 
bits in the final result. So if the result is 0x0080000000000000 then 
the loop will execute only once and exit early.

> ---
>  arch/x86/include/asm/div64.h | 19 +++++++++++-----
>  include/linux/math64.h       | 44 +++++++++++++++++++++++++++++++++++-
>  lib/math/div64.c             | 41 ++++++++++++++++++---------------
>  3 files changed, 79 insertions(+), 25 deletions(-)
> 
> diff --git a/arch/x86/include/asm/div64.h b/arch/x86/include/asm/div64.h
> index 9931e4c7d73f..9322a35f6a39 100644
> --- a/arch/x86/include/asm/div64.h
> +++ b/arch/x86/include/asm/div64.h
> @@ -84,21 +84,28 @@ static inline u64 mul_u32_u32(u32 a, u32 b)
>   * Will generate an #DE when the result doesn't fit u64, could fix with an
>   * __ex_table[] entry when it becomes an issue.
>   */
> -static inline u64 mul_u64_u64_div_u64(u64 a, u64 mul, u64 div)
> +static inline u64 mul_u64_add_u64_div_u64(u64 a, u64 mul, u64 add, u64 div)
>  {
>  	u64 q;
>  
> -	asm ("mulq %2; divq %3" : "=a" (q)
> -				: "a" (a), "rm" (mul), "rm" (div)
> -				: "rdx");
> +	if (statically_true(!add)) {
> +		asm ("mulq %2; divq %3" : "=a" (q)
> +					: "a" (a), "rm" (mul), "rm" (div)
> +					: "rdx");
> +	} else {
> +		asm ("mulq %2; addq %4,%%rax; adcq $0,%%rdx; divq %3"
> +			: "=a" (q)
> +			: "a" (a), "rm" (mul), "rm" (div), "rm" (add)
> +			: "rdx");
> +	}
>  
>  	return q;
>  }
> -#define mul_u64_u64_div_u64 mul_u64_u64_div_u64
> +#define mul_u64_add_u64_div_u64 mul_u64_add_u64_div_u64
>  
>  static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 div)
>  {
> -	return mul_u64_u64_div_u64(a, mul, div);
> +	return mul_u64_add_u64_div_u64(a, mul, 0, div);
>  }
>  #define mul_u64_u32_div	mul_u64_u32_div
>  
> diff --git a/include/linux/math64.h b/include/linux/math64.h
> index 6aaccc1626ab..e958170e64ab 100644
> --- a/include/linux/math64.h
> +++ b/include/linux/math64.h
> @@ -282,7 +282,49 @@ static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 divisor)
>  }
>  #endif /* mul_u64_u32_div */
>  
> -u64 mul_u64_u64_div_u64(u64 a, u64 mul, u64 div);
> +/**
> + * mul_u64_add_u64_div_u64 - unsigned 64bit multiply, add, and divide
> + * @a: first unsigned 64bit multiplicand
> + * @b: second unsigned 64bit multiplicand
> + * @c: unsigned 64bit addend
> + * @d: unsigned 64bit divisor
> + *
> + * Multiply two 64bit values together to generate a 128bit product
> + * add a third value and then divide by a fourth.
> + * May BUG()/trap if @d is zero or the quotient exceeds 64 bits.
> + *
> + * Return: (@a * @b + @c) / @d
> + */
> +u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d);
> +
> +/**
> + * mul_u64_u64_div_u64 - unsigned 64bit multiply and divide
> + * @a: first unsigned 64bit multiplicand
> + * @b: second unsigned 64bit multiplicand
> + * @d: unsigned 64bit divisor
> + *
> + * Multiply two 64bit values together to generate a 128bit product
> + * and then divide by a third value.
> + * May BUG()/trap if @d is zero or the quotient exceeds 64 bits.
> + *
> + * Return: @a * @b / @d
> + */
> +#define mul_u64_u64_div_u64(a, b, d) mul_u64_add_u64_div_u64(a, b, 0, d)
> +
> +/**
> + * mul_u64_u64_div_u64_roundup - unsigned 64bit multiply and divide rounded up
> + * @a: first unsigned 64bit multiplicand
> + * @b: second unsigned 64bit multiplicand
> + * @d: unsigned 64bit divisor
> + *
> + * Multiply two 64bit values together to generate a 128bit product
> + * and then divide and round up.
> + * May BUG()/trap if @d is zero or the quotient exceeds 64 bits.
> + *
> + * Return: (@a * @b + @d - 1) / @d
> + */
> +#define mul_u64_u64_div_u64_roundup(a, b, d) \
> +	({ u64 _tmp = (d); mul_u64_add_u64_div_u64(a, b, _tmp - 1, _tmp); })
>  
>  /**
>   * DIV64_U64_ROUND_UP - unsigned 64bit divide with 64bit divisor rounded up
> diff --git a/lib/math/div64.c b/lib/math/div64.c
> index 5faa29208bdb..50e025174495 100644
> --- a/lib/math/div64.c
> +++ b/lib/math/div64.c
> @@ -183,26 +183,28 @@ u32 iter_div_u64_rem(u64 dividend, u32 divisor, u64 *remainder)
>  }
>  EXPORT_SYMBOL(iter_div_u64_rem);
>  
> -#ifndef mul_u64_u64_div_u64
> -u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
> +#if !defined(mul_u64_add_u64_div_u64)
> +u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
>  {
> -	if (ilog2(a) + ilog2(b) <= 62)
> -		return div64_u64(a * b, c);
> -
>  #if defined(__SIZEOF_INT128__)
>  
>  	/* native 64x64=128 bits multiplication */
> -	u128 prod = (u128)a * b;
> +	u128 prod = (u128)a * b + c;
>  	u64 n_lo = prod, n_hi = prod >> 64;
>  
>  #else
>  
> +	if (!c && ilog2(a) + ilog2(b) <= 62)
> +		return div64_u64(a * b, d);
> +
>  	/* perform a 64x64=128 bits multiplication manually */
>  	u32 a_lo = a, a_hi = a >> 32, b_lo = b, b_hi = b >> 32;
>  	u64 x, y, z;
>  
> -	x = (u64)a_lo * b_lo;
> +	/* Since (x-1)(x-1) + 2(x-1) == x.x - 1 two u32 can be added to a u64 */
> +	x = (u64)a_lo * b_lo + (u32)c;
>  	y = (u64)a_lo * b_hi + (u32)(x >> 32);
> +	y += (u32)(c >> 32);
>  	z = (u64)a_hi * b_hi + (u32)(y >> 32);
>  	y = (u64)a_hi * b_lo + (u32)y;
>  	z += (u32)(y >> 32);
> @@ -212,36 +214,39 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
>  
>  #endif
>  
> -	/* make sure c is not zero, trigger exception otherwise */
> +	if (!n_hi)
> +		return div64_u64(n_lo, d);
> +
> +	/* make sure d is not zero, trigger exception otherwise */
>  #pragma GCC diagnostic push
>  #pragma GCC diagnostic ignored "-Wdiv-by-zero"
> -	if (unlikely(c == 0))
> +	if (unlikely(d == 0))
>  		return 1/0;
>  #pragma GCC diagnostic pop
>  
> -	int shift = __builtin_ctzll(c);
> +	int shift = __builtin_ctzll(d);
>  
>  	/* try reducing the fraction in case the dividend becomes <= 64 bits */
>  	if ((n_hi >> shift) == 0) {
>  		u64 n = shift ? (n_lo >> shift) | (n_hi << (64 - shift)) : n_lo;
>  
> -		return div64_u64(n, c >> shift);
> +		return div64_u64(n, d >> shift);
>  		/*
>  		 * The remainder value if needed would be:
> -		 *   res = div64_u64_rem(n, c >> shift, &rem);
> +		 *   res = div64_u64_rem(n, d >> shift, &rem);
>  		 *   rem = (rem << shift) + (n_lo - (n << shift));
>  		 */
>  	}
>  
> -	if (n_hi >= c) {
> +	if (n_hi >= d) {
>  		/* overflow: result is unrepresentable in a u64 */
>  		return -1;
>  	}
>  
>  	/* Do the full 128 by 64 bits division */
>  
> -	shift = __builtin_clzll(c);
> -	c <<= shift;
> +	shift = __builtin_clzll(d);
> +	d <<= shift;
>  
>  	int p = 64 + shift;
>  	u64 res = 0;
> @@ -256,8 +261,8 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
>  		n_hi <<= shift;
>  		n_hi |= n_lo >> (64 - shift);
>  		n_lo <<= shift;
> -		if (carry || (n_hi >= c)) {
> -			n_hi -= c;
> +		if (carry || (n_hi >= d)) {
> +			n_hi -= d;
>  			res |= 1ULL << p;
>  		}
>  	} while (n_hi);
> @@ -265,5 +270,5 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
>  
>  	return res;
>  }
> -EXPORT_SYMBOL(mul_u64_u64_div_u64);
> +EXPORT_SYMBOL(mul_u64_add_u64_div_u64);
>  #endif
> -- 
> 2.39.5
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] lib: Add tests for mul_u64_u64_div_u64_roundup()
  2025-04-05 20:45 ` [PATCH 2/3] lib: Add tests for mul_u64_u64_div_u64_roundup() David Laight
@ 2025-04-06  1:47   ` Nicolas Pitre
  0 siblings, 0 replies; 14+ messages in thread
From: Nicolas Pitre @ 2025-04-06  1:47 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, Uwe Kleine-König, Oleg Nesterov,
	Peter Zijlstra, Biju Das

On Sat, 5 Apr 2025, David Laight wrote:

> Replicate the existing mul_u64_u64_div_u64() test cases with round up.
> Update the shell script that verifies the table, remove the comment
> markers so that it can be directly pasted into a shell.
> 
> It any tests fail then fail the module load with -EINVAL.
> 
> Signed-off-by: David Laight <david.laight.linux@gmail.com>

Reviewed-by: Nicolas Pitre <npitre@baylibre.com>

> ---
>  lib/math/test_mul_u64_u64_div_u64.c | 112 ++++++++++++++++------------
>  1 file changed, 66 insertions(+), 46 deletions(-)
> 
> diff --git a/lib/math/test_mul_u64_u64_div_u64.c b/lib/math/test_mul_u64_u64_div_u64.c
> index 58d058de4e73..9548eb7458c7 100644
> --- a/lib/math/test_mul_u64_u64_div_u64.c
> +++ b/lib/math/test_mul_u64_u64_div_u64.c
> @@ -10,61 +10,72 @@
>  #include <linux/printk.h>
>  #include <linux/math64.h>
>  
> -typedef struct { u64 a; u64 b; u64 c; u64 result; } test_params;
> +typedef struct { u64 a; u64 b; u64 c; u64 result; uint round_up;} test_params;
>  
>  static test_params test_values[] = {
>  /* this contains many edge values followed by a couple random values */
> -{                0xb,                0x7,                0x3,               0x19 },
> -{         0xffff0000,         0xffff0000,                0xf, 0x1110eeef00000000 },
> -{         0xffffffff,         0xffffffff,                0x1, 0xfffffffe00000001 },
> -{         0xffffffff,         0xffffffff,                0x2, 0x7fffffff00000000 },
> -{        0x1ffffffff,         0xffffffff,                0x2, 0xfffffffe80000000 },
> -{        0x1ffffffff,         0xffffffff,                0x3, 0xaaaaaaa9aaaaaaab },
> -{        0x1ffffffff,        0x1ffffffff,                0x4, 0xffffffff00000000 },
> -{ 0xffff000000000000, 0xffff000000000000, 0xffff000000000001, 0xfffeffffffffffff },
> -{ 0x3333333333333333, 0x3333333333333333, 0x5555555555555555, 0x1eb851eb851eb851 },
> -{ 0x7fffffffffffffff,                0x2,                0x3, 0x5555555555555554 },
> -{ 0xffffffffffffffff,                0x2, 0x8000000000000000,                0x3 },
> -{ 0xffffffffffffffff,                0x2, 0xc000000000000000,                0x2 },
> -{ 0xffffffffffffffff, 0x4000000000000004, 0x8000000000000000, 0x8000000000000007 },
> -{ 0xffffffffffffffff, 0x4000000000000001, 0x8000000000000000, 0x8000000000000001 },
> -{ 0xffffffffffffffff, 0x8000000000000001, 0xffffffffffffffff, 0x8000000000000001 },
> -{ 0xfffffffffffffffe, 0x8000000000000001, 0xffffffffffffffff, 0x8000000000000000 },
> -{ 0xffffffffffffffff, 0x8000000000000001, 0xfffffffffffffffe, 0x8000000000000001 },
> -{ 0xffffffffffffffff, 0x8000000000000001, 0xfffffffffffffffd, 0x8000000000000002 },
> -{ 0x7fffffffffffffff, 0xffffffffffffffff, 0xc000000000000000, 0xaaaaaaaaaaaaaaa8 },
> -{ 0xffffffffffffffff, 0x7fffffffffffffff, 0xa000000000000000, 0xccccccccccccccca },
> -{ 0xffffffffffffffff, 0x7fffffffffffffff, 0x9000000000000000, 0xe38e38e38e38e38b },
> -{ 0x7fffffffffffffff, 0x7fffffffffffffff, 0x5000000000000000, 0xccccccccccccccc9 },
> -{ 0xffffffffffffffff, 0xfffffffffffffffe, 0xffffffffffffffff, 0xfffffffffffffffe },
> -{ 0xe6102d256d7ea3ae, 0x70a77d0be4c31201, 0xd63ec35ab3220357, 0x78f8bf8cc86c6e18 },
> -{ 0xf53bae05cb86c6e1, 0x3847b32d2f8d32e0, 0xcfd4f55a647f403c, 0x42687f79d8998d35 },
> -{ 0x9951c5498f941092, 0x1f8c8bfdf287a251, 0xa3c8dc5f81ea3fe2, 0x1d887cb25900091f },
> -{ 0x374fee9daa1bb2bb, 0x0d0bfbff7b8ae3ef, 0xc169337bd42d5179, 0x03bb2dbaffcbb961 },
> -{ 0xeac0d03ac10eeaf0, 0x89be05dfa162ed9b, 0x92bb1679a41f0e4b, 0xdc5f5cc9e270d216 },
> +{                0xb,                0x7,                0x3,               0x19, 1 },
> +{         0xffff0000,         0xffff0000,                0xf, 0x1110eeef00000000, 0 },
> +{         0xffffffff,         0xffffffff,                0x1, 0xfffffffe00000001, 0 },
> +{         0xffffffff,         0xffffffff,                0x2, 0x7fffffff00000000, 1 },
> +{        0x1ffffffff,         0xffffffff,                0x2, 0xfffffffe80000000, 1 },
> +{        0x1ffffffff,         0xffffffff,                0x3, 0xaaaaaaa9aaaaaaab, 0 },
> +{        0x1ffffffff,        0x1ffffffff,                0x4, 0xffffffff00000000, 1 },
> +{ 0xffff000000000000, 0xffff000000000000, 0xffff000000000001, 0xfffeffffffffffff, 1 },
> +{ 0x3333333333333333, 0x3333333333333333, 0x5555555555555555, 0x1eb851eb851eb851, 1 },
> +{ 0x7fffffffffffffff,                0x2,                0x3, 0x5555555555555554, 1 },
> +{ 0xffffffffffffffff,                0x2, 0x8000000000000000,                0x3, 1 },
> +{ 0xffffffffffffffff,                0x2, 0xc000000000000000,                0x2, 1 },
> +{ 0xffffffffffffffff, 0x4000000000000004, 0x8000000000000000, 0x8000000000000007, 1 },
> +{ 0xffffffffffffffff, 0x4000000000000001, 0x8000000000000000, 0x8000000000000001, 1 },
> +{ 0xffffffffffffffff, 0x8000000000000001, 0xffffffffffffffff, 0x8000000000000001, 0 },
> +{ 0xfffffffffffffffe, 0x8000000000000001, 0xffffffffffffffff, 0x8000000000000000, 1 },
> +{ 0xffffffffffffffff, 0x8000000000000001, 0xfffffffffffffffe, 0x8000000000000001, 1 },
> +{ 0xffffffffffffffff, 0x8000000000000001, 0xfffffffffffffffd, 0x8000000000000002, 1 },
> +{ 0x7fffffffffffffff, 0xffffffffffffffff, 0xc000000000000000, 0xaaaaaaaaaaaaaaa8, 1 },
> +{ 0xffffffffffffffff, 0x7fffffffffffffff, 0xa000000000000000, 0xccccccccccccccca, 1 },
> +{ 0xffffffffffffffff, 0x7fffffffffffffff, 0x9000000000000000, 0xe38e38e38e38e38b, 1 },
> +{ 0x7fffffffffffffff, 0x7fffffffffffffff, 0x5000000000000000, 0xccccccccccccccc9, 1 },
> +{ 0xffffffffffffffff, 0xfffffffffffffffe, 0xffffffffffffffff, 0xfffffffffffffffe, 0 },
> +{ 0xe6102d256d7ea3ae, 0x70a77d0be4c31201, 0xd63ec35ab3220357, 0x78f8bf8cc86c6e18, 1 },
> +{ 0xf53bae05cb86c6e1, 0x3847b32d2f8d32e0, 0xcfd4f55a647f403c, 0x42687f79d8998d35, 1 },
> +{ 0x9951c5498f941092, 0x1f8c8bfdf287a251, 0xa3c8dc5f81ea3fe2, 0x1d887cb25900091f, 1 },
> +{ 0x374fee9daa1bb2bb, 0x0d0bfbff7b8ae3ef, 0xc169337bd42d5179, 0x03bb2dbaffcbb961, 1 },
> +{ 0xeac0d03ac10eeaf0, 0x89be05dfa162ed9b, 0x92bb1679a41f0e4b, 0xdc5f5cc9e270d216, 1 },
>  };
>  
>  /*
>   * The above table can be verified with the following shell script:
> - *
> - * #!/bin/sh
> - * sed -ne 's/^{ \+\(.*\), \+\(.*\), \+\(.*\), \+\(.*\) },$/\1 \2 \3 \4/p' \
> - *     lib/math/test_mul_u64_u64_div_u64.c |
> - * while read a b c r; do
> - *   expected=$( printf "obase=16; ibase=16; %X * %X / %X\n" $a $b $c | bc )
> - *   given=$( printf "%X\n" $r )
> - *   if [ "$expected" = "$given" ]; then
> - *     echo "$a * $b / $c = $r OK"
> - *   else
> - *     echo "$a * $b / $c = $r is wrong" >&2
> - *     echo "should be equivalent to 0x$expected" >&2
> - *     exit 1
> - *   fi
> - * done
> +
> +#!/bin/sh
> +sed -ne 's/^{ \+\(.*\), \+\(.*\), \+\(.*\), \+\(.*\), \+\(.*\) },$/\1 \2 \3 \4 \5/p' \
> +    lib/math/test_mul_u64_u64_div_u64.c |
> +while read a b c r d; do
> +  expected=$( printf "obase=16; ibase=16; %X * %X / %X\n" $a $b $c | bc )
> +  given=$( printf "%X\n" $r )
> +  if [ "$expected" = "$given" ]; then
> +    echo "$a * $b  / $c = $r OK"
> +  else
> +    echo "$a * $b  / $c = $r is wrong" >&2
> +    echo "should be equivalent to 0x$expected" >&2
> +    exit 1
> +  fi
> +  expected=$( printf "obase=16; ibase=16; (%X * %X + %X) / %X\n" $a $b $((c-1)) $c | bc )
> +  given=$( printf "%X\n" $((r + d)) )
> +  if [ "$expected" = "$given" ]; then
> +    echo "$a * $b +/ $c = $(printf '%#x' $((r + d))) OK"
> +  else
> +    echo "$a * $b +/ $c = $(printf '%#x' $((r + d))) is wrong" >&2
> +    echo "should be equivalent to 0x$expected" >&2
> +    exit 1
> +  fi
> +done
> +
>   */
>  
>  static int __init test_init(void)
>  {
> +	int errors = 0;
>  	int i;
>  
>  	pr_info("Starting mul_u64_u64_div_u64() test\n");
> @@ -75,16 +86,25 @@ static int __init test_init(void)
>  		u64 c = test_values[i].c;
>  		u64 expected_result = test_values[i].result;
>  		u64 result = mul_u64_u64_div_u64(a, b, c);
> +		u64 result_up = mul_u64_u64_div_u64_roundup(a, b, c);
>  
>  		if (result != expected_result) {
>  			pr_err("ERROR: 0x%016llx * 0x%016llx / 0x%016llx\n", a, b, c);
>  			pr_err("ERROR: expected result: %016llx\n", expected_result);
>  			pr_err("ERROR: obtained result: %016llx\n", result);
> +			errors++;
> +		}
> +		expected_result += test_values[i].round_up;
> +		if (result_up != expected_result) {
> +			pr_err("ERROR: 0x%016llx * 0x%016llx +/ 0x%016llx\n", a, b, c);
> +			pr_err("ERROR: expected result: %016llx\n", expected_result);
> +			pr_err("ERROR: obtained result: %016llx\n", result_up);
> +			errors++;
>  		}
>  	}
>  
> -	pr_info("Completed mul_u64_u64_div_u64() test\n");
> -	return 0;
> +	pr_info("Completed mul_u64_u64_div_u64() test, %d errors\n", errors);
> +	return errors ? -EINVAL : 0;
>  }
>  
>  static void __exit test_exit(void)
> -- 
> 2.39.5
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] lib: Update the muldiv64 tests to verify the C on x86-64
  2025-04-05 20:45 ` [PATCH 3/3] lib: Update the muldiv64 tests to verify the C on x86-64 David Laight
@ 2025-04-06  2:26   ` Nicolas Pitre
  0 siblings, 0 replies; 14+ messages in thread
From: Nicolas Pitre @ 2025-04-06  2:26 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, Uwe Kleine-König, Oleg Nesterov,
	Peter Zijlstra, Biju Das

On Sat, 5 Apr 2025, David Laight wrote:

> div64.c contains a 128 by 64 division algorithm which x86-64 overrides
> it with an asm implementation.
> So running the muldiv64 tests only verifies the asm code.
> Since x86-64 is the most likely test system compile the default
> code into an x86-64 kernel (under a different name) when the tests
> are being built.
> Verify that both the asm and C functions generate the correct results.
> 
> Signed-off-by: David Laight <david.laight.linux@gmail.com>
> ---
>  lib/math/div64.c                    | 18 ++++++++++++++++--
>  lib/math/test_mul_u64_u64_div_u64.c | 26 ++++++++++++++++++++------
>  2 files changed, 36 insertions(+), 8 deletions(-)
> 
> diff --git a/lib/math/div64.c b/lib/math/div64.c
> index 50e025174495..38ee5c01c288 100644
> --- a/lib/math/div64.c
> +++ b/lib/math/div64.c
> @@ -25,6 +25,8 @@
>  #include <linux/minmax.h>
>  #include <linux/log2.h>
>  
> +#include <generated/autoconf.h>

Isn't this automatically included everywhere by the Makefile?

> @@ -183,10 +185,22 @@ u32 iter_div_u64_rem(u64 dividend, u32 divisor, u64 *remainder)
>  }
>  EXPORT_SYMBOL(iter_div_u64_rem);
>  
> -#if !defined(mul_u64_add_u64_div_u64)
> +/*
> + * If the architecture overrides the implementation below and the test module
> + * is being built then compile the default implementation with a different name
> + * so that it can be tested.
> + */
> +#if defined(mul_u64_add_u64_div_u64) && (defined(CONFIG_TEST_MULDIV64) || defined(CONFIG_TEST_MULDIV64_MODULE))

You could shorten this to:

#if defined(mul_u64_add_u64_div_u64) && IS_ENABLED(CONFIG_TEST_MULDIV64)

> +#define TEST_MULDIV64

Then I'd use IS_ENABLED(CONFIG_TEST_MULDIV64) in place of TEST_MULDIV64.
It is more self explanatory.

> +#undef mul_u64_add_u64_div_u64
> +#define mul_u64_add_u64_div_u64 mul_u64_add_u64_div_u64_test
> +u64 mul_u64_add_u64_div_u64_test(u64 a, u64 b, u64 c, u64 d);
> +#endif

Hmmm... I wish there could be a better way to do this, but other than 
the above suggestion I don't see one.

> +
> +#if !defined( mul_u64_add_u64_div_u64) || defined(TEST_MULDIV64)
>  u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
>  {
> -#if defined(__SIZEOF_INT128__)
> +#if defined(__SIZEOF_INT128__) && !defined(TEST_MULDIV64)
>  
>  	/* native 64x64=128 bits multiplication */
>  	u128 prod = (u128)a * b + c;
> diff --git a/lib/math/test_mul_u64_u64_div_u64.c b/lib/math/test_mul_u64_u64_div_u64.c
> index 9548eb7458c7..e2289b412601 100644
> --- a/lib/math/test_mul_u64_u64_div_u64.c
> +++ b/lib/math/test_mul_u64_u64_div_u64.c
> @@ -73,6 +73,10 @@ done
>  
>   */
>  
> +#ifdef mul_u64_add_u64_div_u64
> +u64 mul_u64_add_u64_div_u64_test(u64 a, u64 b, u64 add, u64 c);
> +#endif
> +
>  static int __init test_init(void)
>  {
>  	int errors = 0;
> @@ -80,21 +84,31 @@ static int __init test_init(void)
>  
>  	pr_info("Starting mul_u64_u64_div_u64() test\n");
>  
> -	for (i = 0; i < ARRAY_SIZE(test_values); i++) {
> -		u64 a = test_values[i].a;
> -		u64 b = test_values[i].b;
> -		u64 c = test_values[i].c;
> -		u64 expected_result = test_values[i].result;
> +	for (i = 0; i < ARRAY_SIZE(test_values) * 2; i++) {
> +		u64 a = test_values[i / 2].a;
> +		u64 b = test_values[i / 2].b;
> +		u64 c = test_values[i / 2].c;
> +		u64 expected_result = test_values[i / 2].result;

I don't see the point of the loop doubling here.
If I understand it correctly, you'll test the default version twice and 
the _test version once for each test entry.


>  		u64 result = mul_u64_u64_div_u64(a, b, c);
>  		u64 result_up = mul_u64_u64_div_u64_roundup(a, b, c);
>  
> +#ifdef mul_u64_add_u64_div_u64
> +		if (i & 1) {
> +			/* Verify the generic C version */
> +			result = mul_u64_add_u64_div_u64_test(a, b, 0, c);
> +			result_up = mul_u64_add_u64_div_u64_test(a, b, c - 1, c);
> +		}
> +#else
> +		i++;
> +#endif
> +
>  		if (result != expected_result) {
>  			pr_err("ERROR: 0x%016llx * 0x%016llx / 0x%016llx\n", a, b, c);
>  			pr_err("ERROR: expected result: %016llx\n", expected_result);
>  			pr_err("ERROR: obtained result: %016llx\n", result);
>  			errors++;
>  		}
> -		expected_result += test_values[i].round_up;
> +		expected_result += test_values[i / 2].round_up;
>  		if (result_up != expected_result) {
>  			pr_err("ERROR: 0x%016llx * 0x%016llx +/ 0x%016llx\n", a, b, c);
>  			pr_err("ERROR: expected result: %016llx\n", expected_result);
> -- 
> 2.39.5
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup()
  2025-04-06  1:46   ` Nicolas Pitre
@ 2025-04-06  3:06     ` Nicolas Pitre
  2025-04-06  9:35     ` David Laight
  1 sibling, 0 replies; 14+ messages in thread
From: Nicolas Pitre @ 2025-04-06  3:06 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, Uwe Kleine-König, Oleg Nesterov,
	Peter Zijlstra, Biju Das

On Sat, 5 Apr 2025, Nicolas Pitre wrote:

> On Sat, 5 Apr 2025, David Laight wrote:
> 
> > The existing mul_u64_u64_div_u64() rounds down, a 'rounding up'
> > variant needs 'divisor - 1' adding in between the multiply and
> > divide so cannot easily be done by a caller.
> > 
> > Add mul_u64_add_u64_div_u64(a, b, c, d) that calculates (a * b + c)/d
> > and implement the 'round down' and 'round up' using it.
> > 
> > Update the x86-64 asm to optimise for 'c' being a constant zero.
> > 
> > For architectures that support u128 check for a 64bit product after
> > the multiply (will be cheap).
> > Leave in the early check for other architectures (mostly 32bit) when
> > 'c' is zero to avoid the multi-part multiply.
> > 
> > Note that the cost of the 128bit divide will dwarf the rest of the code.
> > This function is very slow on everything except x86-64 (very very slow
> > on 32bit).
> > 
> > Add kerndoc definitions for all three functions.
> > 
> > Signed-off-by: David Laight <david.laight.linux@gmail.com>
> 
> Reviewed-by: Nicolas Pitre <npitre@baylibre.com>
> 
> Sidenote: The 128-bits division cost is proportional to the number of 
> bits in the final result. So if the result is 0x0080000000000000 then 
> the loop will execute only once and exit early.

Just to clarify what I said: the 128-bits division cost is proportional 
to the number of _set_ bits in the final result.


Nicolas

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup()
  2025-04-06  1:46   ` Nicolas Pitre
  2025-04-06  3:06     ` Nicolas Pitre
@ 2025-04-06  9:35     ` David Laight
  2025-04-06 12:30       ` David Laight
  1 sibling, 1 reply; 14+ messages in thread
From: David Laight @ 2025-04-06  9:35 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Andrew Morton, linux-kernel, Uwe Kleine-König, Oleg Nesterov,
	Peter Zijlstra, Biju Das

On Sat, 5 Apr 2025 21:46:25 -0400 (EDT)
Nicolas Pitre <npitre@baylibre.com> wrote:

> On Sat, 5 Apr 2025, David Laight wrote:
> 
> > The existing mul_u64_u64_div_u64() rounds down, a 'rounding up'
> > variant needs 'divisor - 1' adding in between the multiply and
> > divide so cannot easily be done by a caller.
> > 
> > Add mul_u64_add_u64_div_u64(a, b, c, d) that calculates (a * b + c)/d
> > and implement the 'round down' and 'round up' using it.
> > 
> > Update the x86-64 asm to optimise for 'c' being a constant zero.
> > 
> > For architectures that support u128 check for a 64bit product after
> > the multiply (will be cheap).
> > Leave in the early check for other architectures (mostly 32bit) when
> > 'c' is zero to avoid the multi-part multiply.
> > 
> > Note that the cost of the 128bit divide will dwarf the rest of the code.
> > This function is very slow on everything except x86-64 (very very slow
> > on 32bit).
> > 
> > Add kerndoc definitions for all three functions.
> > 
> > Signed-off-by: David Laight <david.laight.linux@gmail.com>  
> 
> Reviewed-by: Nicolas Pitre <npitre@baylibre.com>
> 
> Sidenote: The 128-bits division cost is proportional to the number of 
> bits in the final result. So if the result is 0x0080000000000000 then 
> the loop will execute only once and exit early.

Some performance measurements for the test cases:
0: ok    50    25    19    19    19    19    19    19    19    19 mul_u64_u64_div_u64 
1: ok     2     0     0     0     0     0     0     0     0     0 mul_u64_u64_div_u64 
2: ok     4     4     4     4     4     4     4     4     4     4 mul_u64_u64_div_u64 
3: ok     4     4     4     4     4     4     4     4     4     4 mul_u64_u64_div_u64 
4: ok     4     4     4     4     4     4     4     4     4     4 mul_u64_u64_div_u64 
5: ok    15     8     4     4     4     4     4     4     4     4 mul_u64_u64_div_u64 
6: ok   275   225   223   223   223   223   223   224   224   223 mul_u64_u64_div_u64 
7: ok     9     6     4     4     4     4     5     5     4     4 mul_u64_u64_div_u64 
8: ok   241   192   187   187   187   187   187   188   187   187 mul_u64_u64_div_u64 
9: ok   212   172   169   169   169   169   169   169   169   169 mul_u64_u64_div_u64 
10: ok    12     6     4     4     4     4     4     4     4     4 mul_u64_u64_div_u64 
11: ok     9     5     4     4     4     4     4     4     4     4 mul_u64_u64_div_u64 
12: ok     6     4     4     4     4     4     4     4     4     4 mul_u64_u64_div_u64 
13: ok     6     5     5     4     4     4     4     4     4     4 mul_u64_u64_div_u64 
14: ok     4     4     5     5     4     4     4     4     4     5 mul_u64_u64_div_u64 
15: ok    18    12     8     8     8     8     8     8     8     8 mul_u64_u64_div_u64 
16: ok    18    11     6     6     6     6     6     6     6     6 mul_u64_u64_div_u64 
17: ok    22    16    11    11    11    11    11    11    11    11 mul_u64_u64_div_u64 
18: ok    25    18     9     9     9     9     9    10     9    10 mul_u64_u64_div_u64 
19: ok   272   231   227   227   226   227   227   227   227   226 mul_u64_u64_div_u64 
20: ok   198   188   185   185   185   186   185   185   186   186 mul_u64_u64_div_u64 
21: ok   207   198   196   196   196   196   196   196   196   196 mul_u64_u64_div_u64 
22: ok   201   189   190   189   190   189   190   189   190   189 mul_u64_u64_div_u64 
23: ok   224   184   181   181   181   181   180   180   181   181 mul_u64_u64_div_u64 
24: ok   238   185   179   179   179   179   179   179   179   179 mul_u64_u64_div_u64 
25: ok   208   178   177   177   177   177   177   177   177   177 mul_u64_u64_div_u64 
26: ok   170   146   139   139   139   139   139   139   139   139 mul_u64_u64_div_u64 
27: ok   256   204   196   196   196   196   196   196   196   196 mul_u64_u64_div_u64 
28: ok   227   195   194   195   194   195   194   195   194   195 mul_u64_u64_div_u64 

Entry 0 is an extra test and is the test overhead - subtracted from the others.
Each column is clocks for one run of the test, but for this set I'm running
the actual test 16 times and later dividing the clock count by 16.
It doesn't make much difference though, so the cost of the
	mfence; rdpmc; mfence; nop_test; mfence; rdpmc; mfence
really is about 190 clocks (between the rdpmc results).

As soon as you hit a non-trival case the number of clocks increases
dramatically.

This is on a zen5 in 64bit mode ignoring the u128 path.
(I don't have the packages installed to compile a 32bit binary).

Maybe I can compile it for arm32, it'll need the mfence and rdpmc changing.
But maybe something simple will be ok on a pi-5.

(oh and yes, I didn't need to include autoconf.h)

	David




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup()
  2025-04-06  9:35     ` David Laight
@ 2025-04-06 12:30       ` David Laight
  0 siblings, 0 replies; 14+ messages in thread
From: David Laight @ 2025-04-06 12:30 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Andrew Morton, linux-kernel, Uwe Kleine-König, Oleg Nesterov,
	Peter Zijlstra, Biju Das

On Sun, 6 Apr 2025 10:35:16 +0100
David Laight <david.laight.linux@gmail.com> wrote:

> On Sat, 5 Apr 2025 21:46:25 -0400 (EDT)
> Nicolas Pitre <npitre@baylibre.com> wrote:
> 
> > On Sat, 5 Apr 2025, David Laight wrote:
> >   
> > > The existing mul_u64_u64_div_u64() rounds down, a 'rounding up'
> > > variant needs 'divisor - 1' adding in between the multiply and
> > > divide so cannot easily be done by a caller.
> > > 
> > > Add mul_u64_add_u64_div_u64(a, b, c, d) that calculates (a * b + c)/d
> > > and implement the 'round down' and 'round up' using it.
> > > 
> > > Update the x86-64 asm to optimise for 'c' being a constant zero.
> > > 
> > > For architectures that support u128 check for a 64bit product after
> > > the multiply (will be cheap).
> > > Leave in the early check for other architectures (mostly 32bit) when
> > > 'c' is zero to avoid the multi-part multiply.
> > > 
> > > Note that the cost of the 128bit divide will dwarf the rest of the code.
> > > This function is very slow on everything except x86-64 (very very slow
> > > on 32bit).
> > > 
> > > Add kerndoc definitions for all three functions.
> > > 
> > > Signed-off-by: David Laight <david.laight.linux@gmail.com>    
> > 
> > Reviewed-by: Nicolas Pitre <npitre@baylibre.com>
> > 
> > Sidenote: The 128-bits division cost is proportional to the number of 
> > bits in the final result. So if the result is 0x0080000000000000 then 
> > the loop will execute only once and exit early.  
> 
> Some performance measurements for the test cases:
> 0: ok    50    25    19    19    19    19    19    19    19    19 mul_u64_u64_div_u64 
> 1: ok     2     0     0     0     0     0     0     0     0     0 mul_u64_u64_div_u64 
> 2: ok     4     4     4     4     4     4     4     4     4     4 mul_u64_u64_div_u64 
> 3: ok     4     4     4     4     4     4     4     4     4     4 mul_u64_u64_div_u64 
> 4: ok     4     4     4     4     4     4     4     4     4     4 mul_u64_u64_div_u64 
> 5: ok    15     8     4     4     4     4     4     4     4     4 mul_u64_u64_div_u64 
> 6: ok   275   225   223   223   223   223   223   224   224   223 mul_u64_u64_div_u64 
> 7: ok     9     6     4     4     4     4     5     5     4     4 mul_u64_u64_div_u64 
> 8: ok   241   192   187   187   187   187   187   188   187   187 mul_u64_u64_div_u64 
> 9: ok   212   172   169   169   169   169   169   169   169   169 mul_u64_u64_div_u64 
> 10: ok    12     6     4     4     4     4     4     4     4     4 mul_u64_u64_div_u64 
> 11: ok     9     5     4     4     4     4     4     4     4     4 mul_u64_u64_div_u64 
> 12: ok     6     4     4     4     4     4     4     4     4     4 mul_u64_u64_div_u64 
> 13: ok     6     5     5     4     4     4     4     4     4     4 mul_u64_u64_div_u64 
> 14: ok     4     4     5     5     4     4     4     4     4     5 mul_u64_u64_div_u64 
> 15: ok    18    12     8     8     8     8     8     8     8     8 mul_u64_u64_div_u64 
> 16: ok    18    11     6     6     6     6     6     6     6     6 mul_u64_u64_div_u64 
> 17: ok    22    16    11    11    11    11    11    11    11    11 mul_u64_u64_div_u64 
> 18: ok    25    18     9     9     9     9     9    10     9    10 mul_u64_u64_div_u64 
> 19: ok   272   231   227   227   226   227   227   227   227   226 mul_u64_u64_div_u64 
> 20: ok   198   188   185   185   185   186   185   185   186   186 mul_u64_u64_div_u64 
> 21: ok   207   198   196   196   196   196   196   196   196   196 mul_u64_u64_div_u64 
> 22: ok   201   189   190   189   190   189   190   189   190   189 mul_u64_u64_div_u64 
> 23: ok   224   184   181   181   181   181   180   180   181   181 mul_u64_u64_div_u64 
> 24: ok   238   185   179   179   179   179   179   179   179   179 mul_u64_u64_div_u64 
> 25: ok   208   178   177   177   177   177   177   177   177   177 mul_u64_u64_div_u64 
> 26: ok   170   146   139   139   139   139   139   139   139   139 mul_u64_u64_div_u64 
> 27: ok   256   204   196   196   196   196   196   196   196   196 mul_u64_u64_div_u64 
> 28: ok   227   195   194   195   194   195   194   195   194   195 mul_u64_u64_div_u64 
> 
> Entry 0 is an extra test and is the test overhead - subtracted from the others.

Except I'd failed to add the 'if (!a) return' so it was doing all the multiples.

> Each column is clocks for one run of the test, but for this set I'm running
> the actual test 16 times and later dividing the clock count by 16.
> It doesn't make much difference though, so the cost of the
> 	mfence; rdpmc; mfence; nop_test; mfence; rdpmc; mfence
> really is about 190 clocks (between the rdpmc results).
> 
> As soon as you hit a non-trival case the number of clocks increases
> dramatically.
> 
> This is on a zen5 in 64bit mode ignoring the u128 path.
> (I don't have the packages installed to compile a 32bit binary).

I had a brain wave.
If I make the mul_div depend on the result of the first rdpmc and make the
second rdpmc depend on the mul_div all the mfence can be removed.
(easily done by + (value & volatile_zero)).
(I need to do the same 'trick' for my 'rep movsb' measurements.)

I then get (for 10 single calls of each mul_div):
0: ok   109    62    27    26    26    26    26    26    26    26 mul_u64_u64_div_u64 
1: ok   100    48    47    26    25    25    25    25    25    25 mul_u64_u64_div_u64 
2: ok   306    31    31    31    31    31    31    31    31    31 mul_u64_u64_div_u64 
3: ok    32    32    32    32    32    32    32    32    32    32 mul_u64_u64_div_u64 
4: ok   329    31    31    31    31    31    31    31    31    31 mul_u64_u64_div_u64 
5: ok   108    61    59    34    34    34    34    34    34    34 mul_u64_u64_div_u64 
6: ok   892   462   290   245   243   243   243   243   243   243 mul_u64_u64_div_u64 
7: ok    95    80    34    34    34    34    34    34    34    34 mul_u64_u64_div_u64 
8: ok   598   500   217   218   216   214   213   218   216   218 mul_u64_u64_div_u64 
9: ok   532   461   228   186   187   189   189   189   189   189 mul_u64_u64_div_u64 
10: ok    57    53    31    31    31    31    31    31    31    31 mul_u64_u64_div_u64 
11: ok    71    41    41    27    27    27    27    27    27    27 mul_u64_u64_div_u64 
12: ok    54    27    27    27    27    27    27    27    27    27 mul_u64_u64_div_u64 
13: ok    60    34    34    34    34    34    34    34    34    34 mul_u64_u64_div_u64 
14: ok    34    34    34    34    34    34    34    34    34    34 mul_u64_u64_div_u64 
15: ok    74    94    24    24    24    24    24    24    24    24 mul_u64_u64_div_u64 
16: ok   124    46    45    20    20    20    20    20    20    20 mul_u64_u64_div_u64 
17: ok   140    79    26    24    25    24    25    24    25    24 mul_u64_u64_div_u64 
18: ok   144   106    24    23    23    23    23    23    23    23 mul_u64_u64_div_u64 
19: ok   569   357   204   203   204   203   204   203   204   203 mul_u64_u64_div_u64 
20: ok   263   279   240   208   208   208   208   208   208   208 mul_u64_u64_div_u64 
21: ok   351   580   397   216   215   215   215   215   215   215 mul_u64_u64_div_u64 
22: ok   293   267   267   208   209   207   208   207   208   207 mul_u64_u64_div_u64 
23: ok   536   319   236   236   236   236   236   236   236   236 mul_u64_u64_div_u64 
24: ok   984   334   194   194   197   193   193   193   193   193 mul_u64_u64_div_u64 
25: ok   630   360   195   199   199   199   199   199   199   199 mul_u64_u64_div_u64 
26: ok   558   239   150   149   148   151   149   151   149   151 mul_u64_u64_div_u64 
27: ok   997   414   215   219   214   219   214   219   214   219 mul_u64_u64_div_u64 
28: ok   679   398   216   216   213   215   217   216   216   213 mul_u64_u64_div_u64 

which now includes the cost of the divide when the product is 64bit.
The first few results on each row are affected by things like cache reads and
branch prediction.
But the later ones are pretty stable.

	David






^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH next 0/3] lib: Implement mul_u64_u64_div_u64_roundup()
  2025-04-05 20:45 [PATCH next 0/3] lib: Implement mul_u64_u64_div_u64_roundup() David Laight
                   ` (2 preceding siblings ...)
  2025-04-05 20:45 ` [PATCH 3/3] lib: Update the muldiv64 tests to verify the C on x86-64 David Laight
@ 2025-05-16  9:47 ` Uwe Kleine-König
  2025-05-16 12:17   ` David Laight
  3 siblings, 1 reply; 14+ messages in thread
From: Uwe Kleine-König @ 2025-05-16  9:47 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, Nicolas Pitre, Oleg Nesterov,
	Peter Zijlstra, Biju Das

[-- Attachment #1: Type: text/plain, Size: 1592 bytes --]

Hello David,

On Sat, Apr 05, 2025 at 09:45:27PM +0100, David Laight wrote:
> The pwm-stm32.c code wants a 'rounding up' version of mul_u64_u64_div_u64().
> This can be done simply by adding 'divisor - 1' to the 128bit product.
> Implement mul_u64_add_u64_div_u64(a, b, c, d) = (a * b + c)/d based on the
> existing code.
> Define mul_u64_u64_div_u64(a, b, d) as mul_u64_add_u64_div_u64(a, b, 0, d) and
> mul_u64_u64_div_u64_roundup(a, b, d) as mul_u64_add_u64_div_u64(a, b, d-1, d).
> 
> Only x86-64 has an optimsed (asm) version of the function.
> That is optimised to avoid the 'add c' when c is known to be zero.
> In all other cases the extra code will be noise compared to the software
> divide code.
> 
> I've updated the test module to test mul_u64_u64_div_u64_roundup() and
> also enhanced it to verify the C division code on x86-64.
> 
> Note that the code generated by gcc (eg for 32bit x86) just for the multiply
> is rather more horrid than one would expect (clang does better).
> I dread to think how long the divide loop takes.
> And I'm not at all sure the call in kernel/sched/cputime.c isn't in a
> relatively common path (rather than just hardware initialisation).
> 
> David Laight (3):
>   lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup()
>   lib: Add tests for mul_u64_u64_div_u64_roundup()
>   lib: Update the muldiv64 tests to verify the C on x86-64

I wonder what happend to this series. I'd like to make use of
mul_u64_u64_div_u64_roundup() so I'd be interested to get this into the
mainline.

Best regards
Uwe

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH next 0/3] lib: Implement mul_u64_u64_div_u64_roundup()
  2025-05-16  9:47 ` [PATCH next 0/3] lib: Implement mul_u64_u64_div_u64_roundup() Uwe Kleine-König
@ 2025-05-16 12:17   ` David Laight
  2025-05-16 15:49     ` Nicolas Pitre
  0 siblings, 1 reply; 14+ messages in thread
From: David Laight @ 2025-05-16 12:17 UTC (permalink / raw)
  To: Uwe Kleine-König
  Cc: Andrew Morton, linux-kernel, Nicolas Pitre, Oleg Nesterov,
	Peter Zijlstra, Biju Das

On Fri, 16 May 2025 11:47:58 +0200
Uwe Kleine-König <u.kleine-koenig@baylibre.com> wrote:

> Hello David,
> 
> On Sat, Apr 05, 2025 at 09:45:27PM +0100, David Laight wrote:
> > The pwm-stm32.c code wants a 'rounding up' version of mul_u64_u64_div_u64().
> > This can be done simply by adding 'divisor - 1' to the 128bit product.
> > Implement mul_u64_add_u64_div_u64(a, b, c, d) = (a * b + c)/d based on the
> > existing code.
> > Define mul_u64_u64_div_u64(a, b, d) as mul_u64_add_u64_div_u64(a, b, 0, d) and
> > mul_u64_u64_div_u64_roundup(a, b, d) as mul_u64_add_u64_div_u64(a, b, d-1, d).
> > 
> > Only x86-64 has an optimsed (asm) version of the function.
> > That is optimised to avoid the 'add c' when c is known to be zero.
> > In all other cases the extra code will be noise compared to the software
> > divide code.
> > 
> > I've updated the test module to test mul_u64_u64_div_u64_roundup() and
> > also enhanced it to verify the C division code on x86-64.
> > 
> > Note that the code generated by gcc (eg for 32bit x86) just for the multiply
> > is rather more horrid than one would expect (clang does better).
> > I dread to think how long the divide loop takes.
> > And I'm not at all sure the call in kernel/sched/cputime.c isn't in a
> > relatively common path (rather than just hardware initialisation).
> > 
> > David Laight (3):
> >   lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup()
> >   lib: Add tests for mul_u64_u64_div_u64_roundup()
> >   lib: Update the muldiv64 tests to verify the C on x86-64  
> 
> I wonder what happend to this series. I'd like to make use of
> mul_u64_u64_div_u64_roundup() so I'd be interested to get this into the
> mainline.

I've a WIP rewrite of the divide code, speeds it up considerably for
'not amd-64'.

IIRC (the test machine is powered off) the test cases with random data
do down from over 900 clocks to below 150 for x86-32.
The 64bit code (on x64-x64 but skipping the divide instruction) are ~70
clocks rather than nearer 180.
Both those clock counts are almost data independent.

The code relies on the cpu having an 'unsigned long/unsigned long'
instruction.

I got reasonable code for 32bit x64 running gcc 12.2 from debian.
Then copied it to godbolt and found gcc before 15.0 making a pigs
breakfast of it (all clang versions are fine).
So some rework.

I still need to actually plumb it into the kernel sources.

I do want to test the divide loop on x86-64, at least for 64bit.
Probably by including the code directly into the test module.

	David


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH next 0/3] lib: Implement mul_u64_u64_div_u64_roundup()
  2025-05-16 12:17   ` David Laight
@ 2025-05-16 15:49     ` Nicolas Pitre
  2025-05-18 13:43       ` David Laight
  0 siblings, 1 reply; 14+ messages in thread
From: Nicolas Pitre @ 2025-05-16 15:49 UTC (permalink / raw)
  To: David Laight
  Cc: Uwe Kleine-König, Andrew Morton, linux-kernel, Oleg Nesterov,
	Peter Zijlstra, Biju Das

[-- Attachment #1: Type: text/plain, Size: 2044 bytes --]

On Fri, 16 May 2025, David Laight wrote:

> On Fri, 16 May 2025 11:47:58 +0200
> Uwe Kleine-König <u.kleine-koenig@baylibre.com> wrote:
> 
> > Hello David,
> > 
> > On Sat, Apr 05, 2025 at 09:45:27PM +0100, David Laight wrote:
> > > The pwm-stm32.c code wants a 'rounding up' version of mul_u64_u64_div_u64().
> > > This can be done simply by adding 'divisor - 1' to the 128bit product.
> > > Implement mul_u64_add_u64_div_u64(a, b, c, d) = (a * b + c)/d based on the
> > > existing code.
> > > Define mul_u64_u64_div_u64(a, b, d) as mul_u64_add_u64_div_u64(a, b, 0, d) and
> > > mul_u64_u64_div_u64_roundup(a, b, d) as mul_u64_add_u64_div_u64(a, b, d-1, d).
> > > 
> > > Only x86-64 has an optimsed (asm) version of the function.
> > > That is optimised to avoid the 'add c' when c is known to be zero.
> > > In all other cases the extra code will be noise compared to the software
> > > divide code.
> > > 
> > > I've updated the test module to test mul_u64_u64_div_u64_roundup() and
> > > also enhanced it to verify the C division code on x86-64.
> > > 
> > > Note that the code generated by gcc (eg for 32bit x86) just for the multiply
> > > is rather more horrid than one would expect (clang does better).
> > > I dread to think how long the divide loop takes.
> > > And I'm not at all sure the call in kernel/sched/cputime.c isn't in a
> > > relatively common path (rather than just hardware initialisation).
> > > 
> > > David Laight (3):
> > >   lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup()
> > >   lib: Add tests for mul_u64_u64_div_u64_roundup()
> > >   lib: Update the muldiv64 tests to verify the C on x86-64  
> > 
> > I wonder what happend to this series. I'd like to make use of
> > mul_u64_u64_div_u64_roundup() so I'd be interested to get this into the
> > mainline.
> 
> I've a WIP rewrite of the divide code, speeds it up considerably for
> 'not amd-64'.

May I suggest you simply submit the new API now (addressing my latest 
comments if possible) and submit the divide optimization later?


Nicolas

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH next 0/3] lib: Implement mul_u64_u64_div_u64_roundup()
  2025-05-16 15:49     ` Nicolas Pitre
@ 2025-05-18 13:43       ` David Laight
  0 siblings, 0 replies; 14+ messages in thread
From: David Laight @ 2025-05-18 13:43 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Uwe Kleine-König, Andrew Morton, linux-kernel, Oleg Nesterov,
	Peter Zijlstra, Biju Das

On Fri, 16 May 2025 11:49:44 -0400 (EDT)
Nicolas Pitre <npitre@baylibre.com> wrote:

> On Fri, 16 May 2025, David Laight wrote:
> 
> > On Fri, 16 May 2025 11:47:58 +0200
> > Uwe Kleine-König <u.kleine-koenig@baylibre.com> wrote:
> >   
> > > Hello David,
> > > 
> > > On Sat, Apr 05, 2025 at 09:45:27PM +0100, David Laight wrote:  
> > > > The pwm-stm32.c code wants a 'rounding up' version of mul_u64_u64_div_u64().
> > > > This can be done simply by adding 'divisor - 1' to the 128bit product.
> > > > Implement mul_u64_add_u64_div_u64(a, b, c, d) = (a * b + c)/d based on the
> > > > existing code.
> > > > Define mul_u64_u64_div_u64(a, b, d) as mul_u64_add_u64_div_u64(a, b, 0, d) and
> > > > mul_u64_u64_div_u64_roundup(a, b, d) as mul_u64_add_u64_div_u64(a, b, d-1, d).
> > > > 
> > > > Only x86-64 has an optimsed (asm) version of the function.
> > > > That is optimised to avoid the 'add c' when c is known to be zero.
> > > > In all other cases the extra code will be noise compared to the software
> > > > divide code.
> > > > 
> > > > I've updated the test module to test mul_u64_u64_div_u64_roundup() and
> > > > also enhanced it to verify the C division code on x86-64.
> > > > 
> > > > Note that the code generated by gcc (eg for 32bit x86) just for the multiply
> > > > is rather more horrid than one would expect (clang does better).
> > > > I dread to think how long the divide loop takes.
> > > > And I'm not at all sure the call in kernel/sched/cputime.c isn't in a
> > > > relatively common path (rather than just hardware initialisation).
...
> > > I wonder what happend to this series. I'd like to make use of
> > > mul_u64_u64_div_u64_roundup() so I'd be interested to get this into the
> > > mainline.  
> > 
> > I've a WIP rewrite of the divide code, speeds it up considerably for
> > 'not amd-64'.  
> 
> May I suggest you simply submit the new API now (addressing my latest 
> comments if possible) and submit the divide optimization later?

I've just 'lobbed in' a 'v2' that excludes the last patch (extra changes
to the test module) and replaces 'if (!divisor) 1/0;' with BUG_ON(!divisor).

I'll to the speedup changes on top.

	David


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-05-18 13:43 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-05 20:45 [PATCH next 0/3] lib: Implement mul_u64_u64_div_u64_roundup() David Laight
2025-04-05 20:45 ` [PATCH 1/3] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup() David Laight
2025-04-06  1:46   ` Nicolas Pitre
2025-04-06  3:06     ` Nicolas Pitre
2025-04-06  9:35     ` David Laight
2025-04-06 12:30       ` David Laight
2025-04-05 20:45 ` [PATCH 2/3] lib: Add tests for mul_u64_u64_div_u64_roundup() David Laight
2025-04-06  1:47   ` Nicolas Pitre
2025-04-05 20:45 ` [PATCH 3/3] lib: Update the muldiv64 tests to verify the C on x86-64 David Laight
2025-04-06  2:26   ` Nicolas Pitre
2025-05-16  9:47 ` [PATCH next 0/3] lib: Implement mul_u64_u64_div_u64_roundup() Uwe Kleine-König
2025-05-16 12:17   ` David Laight
2025-05-16 15:49     ` Nicolas Pitre
2025-05-18 13:43       ` David Laight

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.