[PATCH v3 next 00/10] Implement mul_u64_u64_div_u64

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 next 00/10] Implement mul_u64_u64_div_u64_roundup()
@ 2025-06-14  9:53 David Laight
  2025-06-14  9:53 ` [PATCH v3 next 01/10] lib: mul_u64_u64_div_u64() rename parameter 'c' to 'd' David Laight
                   ` (10 more replies)
  0 siblings, 11 replies; 44+ messages in thread
From: David Laight @ 2025-06-14  9:53 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: David Laight, u.kleine-koenig, Nicolas Pitre, Oleg Nesterov,
	Peter Zijlstra, Biju Das

The pwm-stm32.c code wants a 'rounding up' version of mul_u64_u64_div_u64().
This can be done simply by adding 'divisor - 1' to the 128bit product.
Implement mul_u64_add_u64_div_u64(a, b, c, d) = (a * b + c)/d based on the
existing code.
Define mul_u64_u64_div_u64(a, b, d) as mul_u64_add_u64_div_u64(a, b, 0, d) and
mul_u64_u64_div_u64_roundup(a, b, d) as mul_u64_add_u64_div_u64(a, b, d-1, d).

Only x86-64 has an optimsed (asm) version of the function.
That is optimised to avoid the 'add c' when c is known to be zero.
In all other cases the extra code will be noise compared to the software
divide code.

The test module has been upadted to test mul_u64_u64_div_u64_roundup() and
also enhanced it to verify the C division code on x86-64 and the 32bit
division code on 64bit.

Changes for v2:
- Rename the 'divisor' parameter from 'c' to 'd'.
- Add an extra patch to use BUG_ON() to trap zero divisors.
- Remove the last patch that ran the C code on x86-64
  (I've a plan to do that differently).

Changes for v3:
- Replace the BUG_ON() (or panic in the original version) for zero
  divisors with a WARN_ONCE() and return zero.
- Remove the 'pre-multiply' check for small products.
  Completely non-trivial on 32bit systems.
- Use mul_u32_u32() and the new add_u64_u32() to stop gcc generating
  pretty much pessimal code for x86 with lots of register spills.
- Replace the 'bit at a time' divide with one that generates 16 bits
  per iteration on 32bit systems and 32 bits per iteration on 64bit.
  Massively faster, the tests run in under 1/3 the time.

David Laight (10):
  lib: mul_u64_u64_div_u64() rename parameter 'c' to 'd'
  lib: mul_u64_u64_div_u64() Use WARN_ONCE() for divide errors.
  lib: mul_u64_u64_div_u64() simplify check for a 64bit product
  lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup()
  lib: Add tests for mul_u64_u64_div_u64_roundup()
  lib: test_mul_u64_u64_div_u64: Test both generic and arch versions
  lib: mul_u64_u64_div_u64() optimise multiply on 32bit x86
  lib: mul_u64_u64_div_u64() Separate multiply to a helper for clarity
  lib: mul_u64_u64_div_u64() Optimise the divide code
  lib: test_mul_u64_u64_div_u64: Test the 32bit code on 64bit

 arch/x86/include/asm/div64.h        |  33 ++++-
 include/linux/math64.h              |  56 ++++++++-
 lib/math/div64.c                    | 189 +++++++++++++++++++---------
 lib/math/test_mul_u64_u64_div_u64.c | 187 +++++++++++++++++++--------
 4 files changed, 349 insertions(+), 116 deletions(-)

-- 
2.39.5

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v3 next 01/10] lib: mul_u64_u64_div_u64() rename parameter 'c' to 'd'
  2025-06-14  9:53 [PATCH v3 next 00/10] Implement mul_u64_u64_div_u64_roundup() David Laight
@ 2025-06-14  9:53 ` David Laight
  2025-06-14  9:53 ` [PATCH v3 next 02/10] lib: mul_u64_u64_div_u64() Use WARN_ONCE() for divide errors David Laight
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 44+ messages in thread
From: David Laight @ 2025-06-14  9:53 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: David Laight, u.kleine-koenig, Nicolas Pitre, Oleg Nesterov,
	Peter Zijlstra, Biju Das

Change to prototype from mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
to mul_u64_u64_div_u64(u64 a, u64 b, u64 d).
Using 'd' for 'divisor' makes more sense.

Am upcoming change adds a 'c' parameter to calculate (a * b + c)/d.

Signed-off-by: David Laight <david.laight.linux@gmail.com>
Reviewed-by: Nicolas Pitre <npitre@baylibre.com>
---
 lib/math/div64.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/lib/math/div64.c b/lib/math/div64.c
index 5faa29208bdb..a5c966a36836 100644
--- a/lib/math/div64.c
+++ b/lib/math/div64.c
@@ -184,10 +184,10 @@ u32 iter_div_u64_rem(u64 dividend, u32 divisor, u64 *remainder)
 EXPORT_SYMBOL(iter_div_u64_rem);
 
 #ifndef mul_u64_u64_div_u64
-u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
+u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
 {
 	if (ilog2(a) + ilog2(b) <= 62)
-		return div64_u64(a * b, c);
+		return div64_u64(a * b, d);
 
 #if defined(__SIZEOF_INT128__)
 
@@ -212,36 +212,36 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
 
 #endif
 
-	/* make sure c is not zero, trigger exception otherwise */
+	/* make sure d is not zero, trigger exception otherwise */
 #pragma GCC diagnostic push
 #pragma GCC diagnostic ignored "-Wdiv-by-zero"
-	if (unlikely(c == 0))
+	if (unlikely(d == 0))
 		return 1/0;
 #pragma GCC diagnostic pop
 
-	int shift = __builtin_ctzll(c);
+	int shift = __builtin_ctzll(d);
 
 	/* try reducing the fraction in case the dividend becomes <= 64 bits */
 	if ((n_hi >> shift) == 0) {
 		u64 n = shift ? (n_lo >> shift) | (n_hi << (64 - shift)) : n_lo;
 
-		return div64_u64(n, c >> shift);
+		return div64_u64(n, d >> shift);
 		/*
 		 * The remainder value if needed would be:
-		 *   res = div64_u64_rem(n, c >> shift, &rem);
+		 *   res = div64_u64_rem(n, d >> shift, &rem);
 		 *   rem = (rem << shift) + (n_lo - (n << shift));
 		 */
 	}
 
-	if (n_hi >= c) {
+	if (n_hi >= d) {
 		/* overflow: result is unrepresentable in a u64 */
 		return -1;
 	}
 
 	/* Do the full 128 by 64 bits division */
 
-	shift = __builtin_clzll(c);
-	c <<= shift;
+	shift = __builtin_clzll(d);
+	d <<= shift;
 
 	int p = 64 + shift;
 	u64 res = 0;
@@ -256,8 +256,8 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
 		n_hi <<= shift;
 		n_hi |= n_lo >> (64 - shift);
 		n_lo <<= shift;
-		if (carry || (n_hi >= c)) {
-			n_hi -= c;
+		if (carry || (n_hi >= d)) {
+			n_hi -= d;
 			res |= 1ULL << p;
 		}
 	} while (n_hi);
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 next 02/10] lib: mul_u64_u64_div_u64() Use WARN_ONCE() for divide errors.
  2025-06-14  9:53 [PATCH v3 next 00/10] Implement mul_u64_u64_div_u64_roundup() David Laight
  2025-06-14  9:53 ` [PATCH v3 next 01/10] lib: mul_u64_u64_div_u64() rename parameter 'c' to 'd' David Laight
@ 2025-06-14  9:53 ` David Laight
  2025-06-14 15:17   ` Nicolas Pitre
  2025-06-14  9:53 ` [PATCH v3 next 03/10] lib: mul_u64_u64_div_u64() simplify check for a 64bit product David Laight
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 44+ messages in thread
From: David Laight @ 2025-06-14  9:53 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: David Laight, u.kleine-koenig, Nicolas Pitre, Oleg Nesterov,
	Peter Zijlstra, Biju Das

Do an explicit WARN_ONCE(!divisor) instead of hoping the 'undefined
behaviour' the compiler generates for a compile-time 1/0 is in any
way useful.

Return 0 (rather than ~(u64)0) because it is less likely to cause
further serious issues.

Add WARN_ONCE() in the divide overflow path.

A new change for v2 of the patchset.
Whereas gcc inserts (IIRC) 'ud2' clang is likely to let the code
continue and generate 'random' results for any 'undefined behaviour'.

v3: Use WARN_ONCE() and return 0 instead of BUG_ON().
    Explicitely #include <linux/bug.h>

Signed-off-by: David Laight <david.laight.linux@gmail.com>
---
 lib/math/div64.c | 25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/lib/math/div64.c b/lib/math/div64.c
index a5c966a36836..397578dc9a0b 100644
--- a/lib/math/div64.c
+++ b/lib/math/div64.c
@@ -19,6 +19,7 @@
  */
 
 #include <linux/bitops.h>
+#include <linux/bug.h>
 #include <linux/export.h>
 #include <linux/math.h>
 #include <linux/math64.h>
@@ -186,6 +187,15 @@ EXPORT_SYMBOL(iter_div_u64_rem);
 #ifndef mul_u64_u64_div_u64
 u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
 {
+	if (WARN_ONCE(!d, "%s: division of (%#llx * %#llx) by zero, returning 0",
+		      __func__, a, b)) {
+		/*
+		 * Return 0 (rather than ~(u64)0) because it is less likely to
+		 * have unexpected side effects.
+		 */
+		return 0;
+	}
+
 	if (ilog2(a) + ilog2(b) <= 62)
 		return div64_u64(a * b, d);
 
@@ -212,12 +222,10 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
 
 #endif
 
-	/* make sure d is not zero, trigger exception otherwise */
-#pragma GCC diagnostic push
-#pragma GCC diagnostic ignored "-Wdiv-by-zero"
-	if (unlikely(d == 0))
-		return 1/0;
-#pragma GCC diagnostic pop
+	if (WARN_ONCE(n_hi >= d,
+		      "%s: division of (%#llx * %#llx = %#llx%016llx) by %#llx overflows, returning ~0",
+		      __func__, a, b, n_hi, n_lo, d))
+		return ~(u64)0;
 
 	int shift = __builtin_ctzll(d);
 
@@ -233,11 +241,6 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
 		 */
 	}
 
-	if (n_hi >= d) {
-		/* overflow: result is unrepresentable in a u64 */
-		return -1;
-	}
-
 	/* Do the full 128 by 64 bits division */
 
 	shift = __builtin_clzll(d);
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 next 03/10] lib: mul_u64_u64_div_u64() simplify check for a 64bit product
  2025-06-14  9:53 [PATCH v3 next 00/10] Implement mul_u64_u64_div_u64_roundup() David Laight
  2025-06-14  9:53 ` [PATCH v3 next 01/10] lib: mul_u64_u64_div_u64() rename parameter 'c' to 'd' David Laight
  2025-06-14  9:53 ` [PATCH v3 next 02/10] lib: mul_u64_u64_div_u64() Use WARN_ONCE() for divide errors David Laight
@ 2025-06-14  9:53 ` David Laight
  2025-06-14 14:01   ` Nicolas Pitre
  2025-06-14  9:53 ` [PATCH v3 next 04/10] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup() David Laight
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 44+ messages in thread
From: David Laight @ 2025-06-14  9:53 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: David Laight, u.kleine-koenig, Nicolas Pitre, Oleg Nesterov,
	Peter Zijlstra, Biju Das

If the product is only 64bits div64_u64() can be used for the divide.
Replace the pre-multiply check (ilog2(a) + ilog2(b) <= 62) with a
simple post-multiply check that the high 64bits are zero.

This has the advantage of being simpler, more accurate and less code.
It will always be faster when the product is larger than 64bits.

Most 64bit cpu have a native 64x64=128 bit multiply, this is needed
(for the low 64bits) even when div64_u64() is called - so the early
check gains nothing and is just extra code.

32bit cpu will need a compare (etc) to generate the 64bit ilog2()
from two 32bit bit scans - so that is non-trivial.
(Never mind the mess of x86's 'bsr' and any oddball cpu without
fast bit-scan instructions.)
Whereas the additional instructions for the 128bit multiply result
are pretty much one multiply and two adds (typically the 'adc $0,%reg'
can be run in parallel with the instruction that follows).

The only outliers are 64bit systems without 128bit mutiply and
simple in order 32bit ones with fast bit scan but needing extra
instructions to get the high bits of the multiply result.
I doubt it makes much difference to either, the latter is definitely
not mainsteam.

Split from patch 3 of v2 of this series.

If anyone is worried about the analysis they can look at the
generated code for x86 (especially when cmov isn't used).

Signed-off-by: David Laight <david.laight.linux@gmail.com>
---
 lib/math/div64.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/lib/math/div64.c b/lib/math/div64.c
index 397578dc9a0b..ed9475b9e1ef 100644
--- a/lib/math/div64.c
+++ b/lib/math/div64.c
@@ -196,9 +196,6 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
 		return 0;
 	}
 
-	if (ilog2(a) + ilog2(b) <= 62)
-		return div64_u64(a * b, d);
-
 #if defined(__SIZEOF_INT128__)
 
 	/* native 64x64=128 bits multiplication */
@@ -222,6 +219,9 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
 
 #endif
 
+	if (!n_hi)
+		return div64_u64(n_lo, d);
+
 	if (WARN_ONCE(n_hi >= d,
 		      "%s: division of (%#llx * %#llx = %#llx%016llx) by %#llx overflows, returning ~0",
 		      __func__, a, b, n_hi, n_lo, d))
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 next 04/10] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup()
  2025-06-14  9:53 [PATCH v3 next 00/10] Implement mul_u64_u64_div_u64_roundup() David Laight
                   ` (2 preceding siblings ...)
  2025-06-14  9:53 ` [PATCH v3 next 03/10] lib: mul_u64_u64_div_u64() simplify check for a 64bit product David Laight
@ 2025-06-14  9:53 ` David Laight
  2025-06-14 14:06   ` Nicolas Pitre
  2025-06-14  9:53 ` [PATCH v3 next 05/10] lib: Add tests for mul_u64_u64_div_u64_roundup() David Laight
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 44+ messages in thread
From: David Laight @ 2025-06-14  9:53 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: David Laight, u.kleine-koenig, Nicolas Pitre, Oleg Nesterov,
	Peter Zijlstra, Biju Das

The existing mul_u64_u64_div_u64() rounds down, a 'rounding up'
variant needs 'divisor - 1' adding in between the multiply and
divide so cannot easily be done by a caller.

Add mul_u64_add_u64_div_u64(a, b, c, d) that calculates (a * b + c)/d
and implement the 'round down' and 'round up' using it.

Update the x86-64 asm to optimise for 'c' being a constant zero.

Add kerndoc definitions for all three functions.

Signed-off-by: David Laight <david.laight.linux@gmail.com>

Changes for v2 (formally patch 1/3):
- Reinstate the early call to div64_u64() on 32bit when 'c' is zero.
  Although I'm not convinced the path is common enough to be worth
  the two ilog2() calls.

Changes for v3 (formally patch 3/4):
- The early call to div64_u64() has been removed by patch 3.
  Pretty much guaranteed to be a pessimisation.

Signed-off-by: David Laight <david.laight.linux@gmail.com>
---
 arch/x86/include/asm/div64.h | 19 ++++++++++-----
 include/linux/math64.h       | 45 +++++++++++++++++++++++++++++++++++-
 lib/math/div64.c             | 22 ++++++++++--------
 3 files changed, 69 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/div64.h b/arch/x86/include/asm/div64.h
index 9931e4c7d73f..7a0a916a2d7d 100644
--- a/arch/x86/include/asm/div64.h
+++ b/arch/x86/include/asm/div64.h
@@ -84,21 +84,28 @@ static inline u64 mul_u32_u32(u32 a, u32 b)
  * Will generate an #DE when the result doesn't fit u64, could fix with an
  * __ex_table[] entry when it becomes an issue.
  */
-static inline u64 mul_u64_u64_div_u64(u64 a, u64 mul, u64 div)
+static inline u64 mul_u64_add_u64_div_u64(u64 a, u64 mul, u64 add, u64 div)
 {
 	u64 q;
 
-	asm ("mulq %2; divq %3" : "=a" (q)
-				: "a" (a), "rm" (mul), "rm" (div)
-				: "rdx");
+	if (statically_true(!add)) {
+		asm ("mulq %2; divq %3" : "=a" (q)
+					: "a" (a), "rm" (mul), "rm" (div)
+					: "rdx");
+	} else {
+		asm ("mulq %2; addq %3, %%rax; adcq $0, %%rdx; divq %4"
+			: "=a" (q)
+			: "a" (a), "rm" (mul), "rm" (add), "rm" (div)
+			: "rdx");
+	}
 
 	return q;
 }
-#define mul_u64_u64_div_u64 mul_u64_u64_div_u64
+#define mul_u64_add_u64_div_u64 mul_u64_add_u64_div_u64
 
 static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 div)
 {
-	return mul_u64_u64_div_u64(a, mul, div);
+	return mul_u64_add_u64_div_u64(a, mul, 0, div);
 }
 #define mul_u64_u32_div	mul_u64_u32_div
 
diff --git a/include/linux/math64.h b/include/linux/math64.h
index 6aaccc1626ab..e1c2e3642cec 100644
--- a/include/linux/math64.h
+++ b/include/linux/math64.h
@@ -282,7 +282,53 @@ static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 divisor)
 }
 #endif /* mul_u64_u32_div */
 
-u64 mul_u64_u64_div_u64(u64 a, u64 mul, u64 div);
+/**
+ * mul_u64_add_u64_div_u64 - unsigned 64bit multiply, add, and divide
+ * @a: first unsigned 64bit multiplicand
+ * @b: second unsigned 64bit multiplicand
+ * @c: unsigned 64bit addend
+ * @d: unsigned 64bit divisor
+ *
+ * Multiply two 64bit values together to generate a 128bit product
+ * add a third value and then divide by a fourth.
+ * Generic code returns 0 if @d is zero and ~0 if the quotient exceeds 64 bits.
+ * Architecture specific code may trap on zero or overflow.
+ *
+ * Return: (@a * @b + @c) / @d
+ */
+u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d);
+
+/**
+ * mul_u64_u64_div_u64 - unsigned 64bit multiply and divide
+ * @a: first unsigned 64bit multiplicand
+ * @b: second unsigned 64bit multiplicand
+ * @d: unsigned 64bit divisor
+ *
+ * Multiply two 64bit values together to generate a 128bit product
+ * and then divide by a third value.
+ * Generic code returns 0 if @d is zero and ~0 if the quotient exceeds 64 bits.
+ * Architecture specific code may trap on zero or overflow.
+ *
+ * Return: @a * @b / @d
+ */
+#define mul_u64_u64_div_u64(a, b, d) mul_u64_add_u64_div_u64(a, b, 0, d)
+
+/**
+ * mul_u64_u64_div_u64_roundup - unsigned 64bit multiply and divide rounded up
+ * @a: first unsigned 64bit multiplicand
+ * @b: second unsigned 64bit multiplicand
+ * @d: unsigned 64bit divisor
+ *
+ * Multiply two 64bit values together to generate a 128bit product
+ * and then divide and round up.
+ * Generic code returns 0 if @d is zero and ~0 if the quotient exceeds 64 bits.
+ * Architecture specific code may trap on zero or overflow.
+ *
+ * Return: (@a * @b + @d - 1) / @d
+ */
+#define mul_u64_u64_div_u64_roundup(a, b, d) \
+	({ u64 _tmp = (d); mul_u64_add_u64_div_u64(a, b, _tmp - 1, _tmp); })
+
 
 /**
  * DIV64_U64_ROUND_UP - unsigned 64bit divide with 64bit divisor rounded up
diff --git a/lib/math/div64.c b/lib/math/div64.c
index ed9475b9e1ef..7850cc0a7596 100644
--- a/lib/math/div64.c
+++ b/lib/math/div64.c
@@ -184,11 +184,11 @@ u32 iter_div_u64_rem(u64 dividend, u32 divisor, u64 *remainder)
 }
 EXPORT_SYMBOL(iter_div_u64_rem);
 
-#ifndef mul_u64_u64_div_u64
-u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
+#ifndef mul_u64_add_u64_div_u64
+u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
 {
-	if (WARN_ONCE(!d, "%s: division of (%#llx * %#llx) by zero, returning 0",
-		      __func__, a, b)) {
+	if (WARN_ONCE(!d, "%s: division of (%#llx * %#llx + %#llx) by zero, returning 0",
+		      __func__, a, b, c)) {
 		/*
 		 * Return 0 (rather than ~(u64)0) because it is less likely to
 		 * have unexpected side effects.
@@ -199,7 +199,7 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
 #if defined(__SIZEOF_INT128__)
 
 	/* native 64x64=128 bits multiplication */
-	u128 prod = (u128)a * b;
+	u128 prod = (u128)a * b + c;
 	u64 n_lo = prod, n_hi = prod >> 64;
 
 #else
@@ -208,8 +208,10 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
 	u32 a_lo = a, a_hi = a >> 32, b_lo = b, b_hi = b >> 32;
 	u64 x, y, z;
 
-	x = (u64)a_lo * b_lo;
-	y = (u64)a_lo * b_hi + (u32)(x >> 32);
+	/* Since (x-1)(x-1) + 2(x-1) == x.x - 1 two u32 can be added to a u64 */
+	x = (u64)a_lo * b_lo + (u32)c;
+	y = (u64)a_lo * b_hi + (u32)(c >> 32);
+	y += (u32)(x >> 32);
 	z = (u64)a_hi * b_hi + (u32)(y >> 32);
 	y = (u64)a_hi * b_lo + (u32)y;
 	z += (u32)(y >> 32);
@@ -223,8 +225,8 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
 		return div64_u64(n_lo, d);
 
 	if (WARN_ONCE(n_hi >= d,
-		      "%s: division of (%#llx * %#llx = %#llx%016llx) by %#llx overflows, returning ~0",
-		      __func__, a, b, n_hi, n_lo, d))
+		      "%s: division of (%#llx * %#llx + %#llx = %#llx%016llx) by %#llx overflows, returning ~0",
+		      __func__, a, b, c, n_hi, n_lo, d))
 		return ~(u64)0;
 
 	int shift = __builtin_ctzll(d);
@@ -268,5 +270,5 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
 
 	return res;
 }
-EXPORT_SYMBOL(mul_u64_u64_div_u64);
+EXPORT_SYMBOL(mul_u64_add_u64_div_u64);
 #endif
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 next 05/10] lib: Add tests for mul_u64_u64_div_u64_roundup()
  2025-06-14  9:53 [PATCH v3 next 00/10] Implement mul_u64_u64_div_u64_roundup() David Laight
                   ` (3 preceding siblings ...)
  2025-06-14  9:53 ` [PATCH v3 next 04/10] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup() David Laight
@ 2025-06-14  9:53 ` David Laight
  2025-06-14 15:19   ` Nicolas Pitre
  2025-06-17  4:30   ` Nicolas Pitre
  2025-06-14  9:53 ` [PATCH v3 next 06/10] lib: test_mul_u64_u64_div_u64: Test both generic and arch versions David Laight
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 44+ messages in thread
From: David Laight @ 2025-06-14  9:53 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: David Laight, u.kleine-koenig, Nicolas Pitre, Oleg Nesterov,
	Peter Zijlstra, Biju Das

Replicate the existing mul_u64_u64_div_u64() test cases with round up.
Update the shell script that verifies the table, remove the comment
markers so that it can be directly pasted into a shell.

Rename the divisor from 'c' to 'd' to match mul_u64_add_u64_div_u64().

It any tests fail then fail the module load with -EINVAL.

Signed-off-by: David Laight <david.laight.linux@gmail.com>
---

Changes for v3:
- Rename 'c' to 'd' to match mul_u64_add_u64_div_u64()

 lib/math/test_mul_u64_u64_div_u64.c | 122 +++++++++++++++++-----------
 1 file changed, 73 insertions(+), 49 deletions(-)

diff --git a/lib/math/test_mul_u64_u64_div_u64.c b/lib/math/test_mul_u64_u64_div_u64.c
index 58d058de4e73..ea5b703cccff 100644
--- a/lib/math/test_mul_u64_u64_div_u64.c
+++ b/lib/math/test_mul_u64_u64_div_u64.c
@@ -10,61 +10,73 @@
 #include <linux/printk.h>
 #include <linux/math64.h>
 
-typedef struct { u64 a; u64 b; u64 c; u64 result; } test_params;
+typedef struct { u64 a; u64 b; u64 d; u64 result; uint round_up;} test_params;
 
 static test_params test_values[] = {
 /* this contains many edge values followed by a couple random values */
-{                0xb,                0x7,                0x3,               0x19 },
-{         0xffff0000,         0xffff0000,                0xf, 0x1110eeef00000000 },
-{         0xffffffff,         0xffffffff,                0x1, 0xfffffffe00000001 },
-{         0xffffffff,         0xffffffff,                0x2, 0x7fffffff00000000 },
-{        0x1ffffffff,         0xffffffff,                0x2, 0xfffffffe80000000 },
-{        0x1ffffffff,         0xffffffff,                0x3, 0xaaaaaaa9aaaaaaab },
-{        0x1ffffffff,        0x1ffffffff,                0x4, 0xffffffff00000000 },
-{ 0xffff000000000000, 0xffff000000000000, 0xffff000000000001, 0xfffeffffffffffff },
-{ 0x3333333333333333, 0x3333333333333333, 0x5555555555555555, 0x1eb851eb851eb851 },
-{ 0x7fffffffffffffff,                0x2,                0x3, 0x5555555555555554 },
-{ 0xffffffffffffffff,                0x2, 0x8000000000000000,                0x3 },
-{ 0xffffffffffffffff,                0x2, 0xc000000000000000,                0x2 },
-{ 0xffffffffffffffff, 0x4000000000000004, 0x8000000000000000, 0x8000000000000007 },
-{ 0xffffffffffffffff, 0x4000000000000001, 0x8000000000000000, 0x8000000000000001 },
-{ 0xffffffffffffffff, 0x8000000000000001, 0xffffffffffffffff, 0x8000000000000001 },
-{ 0xfffffffffffffffe, 0x8000000000000001, 0xffffffffffffffff, 0x8000000000000000 },
-{ 0xffffffffffffffff, 0x8000000000000001, 0xfffffffffffffffe, 0x8000000000000001 },
-{ 0xffffffffffffffff, 0x8000000000000001, 0xfffffffffffffffd, 0x8000000000000002 },
-{ 0x7fffffffffffffff, 0xffffffffffffffff, 0xc000000000000000, 0xaaaaaaaaaaaaaaa8 },
-{ 0xffffffffffffffff, 0x7fffffffffffffff, 0xa000000000000000, 0xccccccccccccccca },
-{ 0xffffffffffffffff, 0x7fffffffffffffff, 0x9000000000000000, 0xe38e38e38e38e38b },
-{ 0x7fffffffffffffff, 0x7fffffffffffffff, 0x5000000000000000, 0xccccccccccccccc9 },
-{ 0xffffffffffffffff, 0xfffffffffffffffe, 0xffffffffffffffff, 0xfffffffffffffffe },
-{ 0xe6102d256d7ea3ae, 0x70a77d0be4c31201, 0xd63ec35ab3220357, 0x78f8bf8cc86c6e18 },
-{ 0xf53bae05cb86c6e1, 0x3847b32d2f8d32e0, 0xcfd4f55a647f403c, 0x42687f79d8998d35 },
-{ 0x9951c5498f941092, 0x1f8c8bfdf287a251, 0xa3c8dc5f81ea3fe2, 0x1d887cb25900091f },
-{ 0x374fee9daa1bb2bb, 0x0d0bfbff7b8ae3ef, 0xc169337bd42d5179, 0x03bb2dbaffcbb961 },
-{ 0xeac0d03ac10eeaf0, 0x89be05dfa162ed9b, 0x92bb1679a41f0e4b, 0xdc5f5cc9e270d216 },
+{                0xb,                0x7,                0x3,               0x19, 1 },
+{         0xffff0000,         0xffff0000,                0xf, 0x1110eeef00000000, 0 },
+{         0xffffffff,         0xffffffff,                0x1, 0xfffffffe00000001, 0 },
+{         0xffffffff,         0xffffffff,                0x2, 0x7fffffff00000000, 1 },
+{        0x1ffffffff,         0xffffffff,                0x2, 0xfffffffe80000000, 1 },
+{        0x1ffffffff,         0xffffffff,                0x3, 0xaaaaaaa9aaaaaaab, 0 },
+{        0x1ffffffff,        0x1ffffffff,                0x4, 0xffffffff00000000, 1 },
+{ 0xffff000000000000, 0xffff000000000000, 0xffff000000000001, 0xfffeffffffffffff, 1 },
+{ 0x3333333333333333, 0x3333333333333333, 0x5555555555555555, 0x1eb851eb851eb851, 1 },
+{ 0x7fffffffffffffff,                0x2,                0x3, 0x5555555555555554, 1 },
+{ 0xffffffffffffffff,                0x2, 0x8000000000000000,                0x3, 1 },
+{ 0xffffffffffffffff,                0x2, 0xc000000000000000,                0x2, 1 },
+{ 0xffffffffffffffff, 0x4000000000000004, 0x8000000000000000, 0x8000000000000007, 1 },
+{ 0xffffffffffffffff, 0x4000000000000001, 0x8000000000000000, 0x8000000000000001, 1 },
+{ 0xffffffffffffffff, 0x8000000000000001, 0xffffffffffffffff, 0x8000000000000001, 0 },
+{ 0xfffffffffffffffe, 0x8000000000000001, 0xffffffffffffffff, 0x8000000000000000, 1 },
+{ 0xffffffffffffffff, 0x8000000000000001, 0xfffffffffffffffe, 0x8000000000000001, 1 },
+{ 0xffffffffffffffff, 0x8000000000000001, 0xfffffffffffffffd, 0x8000000000000002, 1 },
+{ 0x7fffffffffffffff, 0xffffffffffffffff, 0xc000000000000000, 0xaaaaaaaaaaaaaaa8, 1 },
+{ 0xffffffffffffffff, 0x7fffffffffffffff, 0xa000000000000000, 0xccccccccccccccca, 1 },
+{ 0xffffffffffffffff, 0x7fffffffffffffff, 0x9000000000000000, 0xe38e38e38e38e38b, 1 },
+{ 0x7fffffffffffffff, 0x7fffffffffffffff, 0x5000000000000000, 0xccccccccccccccc9, 1 },
+{ 0xffffffffffffffff, 0xfffffffffffffffe, 0xffffffffffffffff, 0xfffffffffffffffe, 0 },
+{ 0xe6102d256d7ea3ae, 0x70a77d0be4c31201, 0xd63ec35ab3220357, 0x78f8bf8cc86c6e18, 1 },
+{ 0xf53bae05cb86c6e1, 0x3847b32d2f8d32e0, 0xcfd4f55a647f403c, 0x42687f79d8998d35, 1 },
+{ 0x9951c5498f941092, 0x1f8c8bfdf287a251, 0xa3c8dc5f81ea3fe2, 0x1d887cb25900091f, 1 },
+{ 0x374fee9daa1bb2bb, 0x0d0bfbff7b8ae3ef, 0xc169337bd42d5179, 0x03bb2dbaffcbb961, 1 },
+{ 0xeac0d03ac10eeaf0, 0x89be05dfa162ed9b, 0x92bb1679a41f0e4b, 0xdc5f5cc9e270d216, 1 },
 };
 
 /*
  * The above table can be verified with the following shell script:
- *
- * #!/bin/sh
- * sed -ne 's/^{ \+\(.*\), \+\(.*\), \+\(.*\), \+\(.*\) },$/\1 \2 \3 \4/p' \
- *     lib/math/test_mul_u64_u64_div_u64.c |
- * while read a b c r; do
- *   expected=$( printf "obase=16; ibase=16; %X * %X / %X\n" $a $b $c | bc )
- *   given=$( printf "%X\n" $r )
- *   if [ "$expected" = "$given" ]; then
- *     echo "$a * $b / $c = $r OK"
- *   else
- *     echo "$a * $b / $c = $r is wrong" >&2
- *     echo "should be equivalent to 0x$expected" >&2
- *     exit 1
- *   fi
- * done
+
+#!/bin/sh
+sed -ne 's/^{ \+\(.*\), \+\(.*\), \+\(.*\), \+\(.*\), \+\(.*\) },$/\1 \2 \3 \4 \5/p' \
+    lib/math/test_mul_u64_u64_div_u64.c |
+while read a b d r d; do
+  expected=$( printf "obase=16; ibase=16; %X * %X / %X\n" $a $b $d | bc )
+  given=$( printf "%X\n" $r )
+  if [ "$expected" = "$given" ]; then
+    echo "$a * $b  / $d = $r OK"
+  else
+    echo "$a * $b  / $d = $r is wrong" >&2
+    echo "should be equivalent to 0x$expected" >&2
+    exit 1
+  fi
+  expected=$( printf "obase=16; ibase=16; (%X * %X + %X) / %X\n" $a $b $((d-1)) $d | bc )
+  given=$( printf "%X\n" $((r + d)) )
+  if [ "$expected" = "$given" ]; then
+    echo "$a * $b +/ $d = $(printf '%#x' $((r + d))) OK"
+  else
+    echo "$a * $b +/ $d = $(printf '%#x' $((r + d))) is wrong" >&2
+    echo "should be equivalent to 0x$expected" >&2
+    exit 1
+  fi
+done
+
  */
 
 static int __init test_init(void)
 {
+	int errors = 0;
+	int tests = 0;
 	int i;
 
 	pr_info("Starting mul_u64_u64_div_u64() test\n");
@@ -72,19 +84,31 @@ static int __init test_init(void)
 	for (i = 0; i < ARRAY_SIZE(test_values); i++) {
 		u64 a = test_values[i].a;
 		u64 b = test_values[i].b;
-		u64 c = test_values[i].c;
+		u64 d = test_values[i].d;
 		u64 expected_result = test_values[i].result;
-		u64 result = mul_u64_u64_div_u64(a, b, c);
+		u64 result = mul_u64_u64_div_u64(a, b, d);
+		u64 result_up = mul_u64_u64_div_u64_roundup(a, b, d);
+
+		tests += 2;
 
 		if (result != expected_result) {
-			pr_err("ERROR: 0x%016llx * 0x%016llx / 0x%016llx\n", a, b, c);
+			pr_err("ERROR: 0x%016llx * 0x%016llx / 0x%016llx\n", a, b, d);
 			pr_err("ERROR: expected result: %016llx\n", expected_result);
 			pr_err("ERROR: obtained result: %016llx\n", result);
+			errors++;
+		}
+		expected_result += test_values[i].round_up;
+		if (result_up != expected_result) {
+			pr_err("ERROR: 0x%016llx * 0x%016llx +/ 0x%016llx\n", a, b, d);
+			pr_err("ERROR: expected result: %016llx\n", expected_result);
+			pr_err("ERROR: obtained result: %016llx\n", result_up);
+			errors++;
 		}
 	}
 
-	pr_info("Completed mul_u64_u64_div_u64() test\n");
-	return 0;
+	pr_info("Completed mul_u64_u64_div_u64() test, %d tests, %d errors\n",
+		tests, errors);
+	return errors ? -EINVAL : 0;
 }
 
 static void __exit test_exit(void)
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 next 06/10] lib: test_mul_u64_u64_div_u64: Test both generic and arch versions
  2025-06-14  9:53 [PATCH v3 next 00/10] Implement mul_u64_u64_div_u64_roundup() David Laight
                   ` (4 preceding siblings ...)
  2025-06-14  9:53 ` [PATCH v3 next 05/10] lib: Add tests for mul_u64_u64_div_u64_roundup() David Laight
@ 2025-06-14  9:53 ` David Laight
  2025-06-14 15:25   ` Nicolas Pitre
  2025-06-14  9:53 ` [PATCH v3 next 07/10] lib: mul_u64_u64_div_u64() optimise multiply on 32bit x86 David Laight
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 44+ messages in thread
From: David Laight @ 2025-06-14  9:53 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: David Laight, u.kleine-koenig, Nicolas Pitre, Oleg Nesterov,
	Peter Zijlstra, Biju Das

Change the #if in div64.c so that test_mul_u64_u64_div_u64.c
can compile and test the generic version (including the 'long multiply')
on architectures (eg amd64) that define their own copy.

Test the kernel version and the locally compiled version on all arch.
Output the time taken (in ns) on the 'test completed' trace.

For reference, on my zen 5, the optimised version takes ~220ns and the
generic version ~3350ns.
Using the native multiply saves ~200ns and adding back the ilog2() 'optimisation'
test adds ~50ms.

Signed-off-by: David Laight <david.laight.linux@gmail.com>
---

New patch for v3, replacing changes in v1 that were removed for v2.

 lib/math/div64.c                    |  8 +++--
 lib/math/test_mul_u64_u64_div_u64.c | 48 ++++++++++++++++++++++++-----
 2 files changed, 47 insertions(+), 9 deletions(-)

diff --git a/lib/math/div64.c b/lib/math/div64.c
index 7850cc0a7596..22433e5565c4 100644
--- a/lib/math/div64.c
+++ b/lib/math/div64.c
@@ -178,13 +178,15 @@ EXPORT_SYMBOL(div64_s64);
  * Iterative div/mod for use when dividend is not expected to be much
  * bigger than divisor.
  */
+#ifndef iter_div_u64_rem
 u32 iter_div_u64_rem(u64 dividend, u32 divisor, u64 *remainder)
 {
 	return __iter_div_u64_rem(dividend, divisor, remainder);
 }
 EXPORT_SYMBOL(iter_div_u64_rem);
+#endif
 
-#ifndef mul_u64_add_u64_div_u64
+#if !defined(mul_u64_add_u64_div_u64) || defined(test_mul_u64_add_u64_div_u64)
 u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
 {
 	if (WARN_ONCE(!d, "%s: division of (%#llx * %#llx + %#llx) by zero, returning 0",
@@ -196,7 +198,7 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
 		return 0;
 	}
 
-#if defined(__SIZEOF_INT128__)
+#if defined(__SIZEOF_INT128__) && !defined(test_mul_u64_add_u64_div_u64)
 
 	/* native 64x64=128 bits multiplication */
 	u128 prod = (u128)a * b + c;
@@ -270,5 +272,7 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
 
 	return res;
 }
+#if !defined(test_mul_u64_add_u64_div_u64)
 EXPORT_SYMBOL(mul_u64_add_u64_div_u64);
 #endif
+#endif
diff --git a/lib/math/test_mul_u64_u64_div_u64.c b/lib/math/test_mul_u64_u64_div_u64.c
index ea5b703cccff..f0134f25cb0d 100644
--- a/lib/math/test_mul_u64_u64_div_u64.c
+++ b/lib/math/test_mul_u64_u64_div_u64.c
@@ -73,21 +73,34 @@ done
 
  */
 
-static int __init test_init(void)
+static u64 test_mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d);
+
+static int __init test_run(unsigned int fn_no, const char *fn_name)
 {
+	u64 start_time;
 	int errors = 0;
 	int tests = 0;
 	int i;
 
-	pr_info("Starting mul_u64_u64_div_u64() test\n");
+	start_time = ktime_get_ns();
 
 	for (i = 0; i < ARRAY_SIZE(test_values); i++) {
 		u64 a = test_values[i].a;
 		u64 b = test_values[i].b;
 		u64 d = test_values[i].d;
 		u64 expected_result = test_values[i].result;
-		u64 result = mul_u64_u64_div_u64(a, b, d);
-		u64 result_up = mul_u64_u64_div_u64_roundup(a, b, d);
+		u64 result, result_up;
+
+		switch (fn_no) {
+		default:
+			result = mul_u64_u64_div_u64(a, b, d);
+			result_up = mul_u64_u64_div_u64_roundup(a, b, d);
+			break;
+		case 1:
+			result = test_mul_u64_add_u64_div_u64(a, b, 0, d);
+			result_up = test_mul_u64_add_u64_div_u64(a, b, d - 1, d);
+			break;
+		}
 
 		tests += 2;
 
@@ -106,15 +119,36 @@ static int __init test_init(void)
 		}
 	}
 
-	pr_info("Completed mul_u64_u64_div_u64() test, %d tests, %d errors\n",
-		tests, errors);
-	return errors ? -EINVAL : 0;
+	pr_info("Completed %s() test, %d tests, %d errors, %llu ns\n",
+		fn_name, tests, errors, ktime_get_ns() - start_time);
+	return errors;
+}
+
+static int __init test_init(void)
+{
+	pr_info("Starting mul_u64_u64_div_u64() test\n");
+	if (test_run(0, "mul_u64_u64_div_u64"))
+		return -EINVAL;
+	if (test_run(1, "test_mul_u64_u64_div_u64"))
+		return -EINVAL;
+	return 0;
 }
 
 static void __exit test_exit(void)
 {
 }
 
+/* Compile the generic mul_u64_add_u64_div_u64() code */
+#define div64_u64 div64_u64
+#define div64_s64 div64_s64
+#define iter_div_u64_rem iter_div_u64_rem
+
+#undef mul_u64_add_u64_div_u64
+#define mul_u64_add_u64_div_u64 test_mul_u64_add_u64_div_u64
+#define test_mul_u64_add_u64_div_u64 test_mul_u64_add_u64_div_u64
+
+#include "div64.c"
+
 module_init(test_init);
 module_exit(test_exit);
 
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 next 07/10] lib: mul_u64_u64_div_u64() optimise multiply on 32bit x86
  2025-06-14  9:53 [PATCH v3 next 00/10] Implement mul_u64_u64_div_u64_roundup() David Laight
                   ` (5 preceding siblings ...)
  2025-06-14  9:53 ` [PATCH v3 next 06/10] lib: test_mul_u64_u64_div_u64: Test both generic and arch versions David Laight
@ 2025-06-14  9:53 ` David Laight
  2025-06-14 15:31   ` Nicolas Pitre
  2025-06-14  9:53 ` [PATCH v3 next 08/10] lib: mul_u64_u64_div_u64() Separate multiply to a helper for clarity David Laight
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 44+ messages in thread
From: David Laight @ 2025-06-14  9:53 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: David Laight, u.kleine-koenig, Nicolas Pitre, Oleg Nesterov,
	Peter Zijlstra, Biju Das

gcc generates horrid code for both ((u64)u32_a * u32_b) and (u64_a + u32_b).
As well as the extra instructions it can generate a lot of spills to stack
(including spills of constant zeros and even multiplies by constant zero).

mul_u32_u32() already exists to optimise the multiply.
Add a similar add_u64_32() for the addition.
Disable both for clang - it generates better code without them.

Use mul_u32_u32() and add_u64_u32() in the 64x64 => 128 multiply
in mul_u64_add_u64_div_u64().

Tested by forcing the amd64 build of test_mul_u64_u64_div_u64.ko
to use the 32bit asm code.

Signed-off-by: David Laight <david.laight.linux@gmail.com>
---

New patch for v3.

 arch/x86/include/asm/div64.h | 14 ++++++++++++++
 include/linux/math64.h       | 11 +++++++++++
 lib/math/div64.c             | 18 ++++++++++++------
 3 files changed, 37 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/div64.h b/arch/x86/include/asm/div64.h
index 7a0a916a2d7d..4a4c29e8602d 100644
--- a/arch/x86/include/asm/div64.h
+++ b/arch/x86/include/asm/div64.h
@@ -60,6 +60,7 @@ static inline u64 div_u64_rem(u64 dividend, u32 divisor, u32 *remainder)
 }
 #define div_u64_rem	div_u64_rem
 
+#ifndef __clang__
 static inline u64 mul_u32_u32(u32 a, u32 b)
 {
 	u32 high, low;
@@ -71,6 +72,19 @@ static inline u64 mul_u32_u32(u32 a, u32 b)
 }
 #define mul_u32_u32 mul_u32_u32
 
+static inline u64 add_u64_u32(u64 a, u32 b)
+{
+	u32 high = a >> 32, low = a;
+
+	asm ("addl %[b], %[low]; adcl $0, %[high]"
+		: [low] "+r" (low), [high] "+r" (high)
+		: [b] "rm" (b) );
+
+	return low | (u64)high << 32;
+}
+#define add_u64_u32 add_u64_u32
+#endif
+
 /*
  * __div64_32() is never called on x86, so prevent the
  * generic definition from getting built.
diff --git a/include/linux/math64.h b/include/linux/math64.h
index e1c2e3642cec..5e497836e975 100644
--- a/include/linux/math64.h
+++ b/include/linux/math64.h
@@ -158,6 +158,17 @@ static inline u64 mul_u32_u32(u32 a, u32 b)
 }
 #endif
 
+#ifndef add_u64_u32
+/*
+ * Many a GCC version also messes this up.
+ * Zero extending b and then spilling everything to stack.
+ */
+static inline u64 add_u64_u32(u64 a, u32 b)
+{
+	return a + b;
+}
+#endif
+
 #if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__)
 
 #ifndef mul_u64_u32_shr
diff --git a/lib/math/div64.c b/lib/math/div64.c
index 22433e5565c4..2ac7e25039a1 100644
--- a/lib/math/div64.c
+++ b/lib/math/div64.c
@@ -187,6 +187,12 @@ EXPORT_SYMBOL(iter_div_u64_rem);
 #endif
 
 #if !defined(mul_u64_add_u64_div_u64) || defined(test_mul_u64_add_u64_div_u64)
+
+static u64 mul_add(u32 a, u32 b, u32 c)
+{
+	return add_u64_u32(mul_u32_u32(a, b), c);
+}
+
 u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
 {
 	if (WARN_ONCE(!d, "%s: division of (%#llx * %#llx + %#llx) by zero, returning 0",
@@ -211,12 +217,12 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
 	u64 x, y, z;
 
 	/* Since (x-1)(x-1) + 2(x-1) == x.x - 1 two u32 can be added to a u64 */
-	x = (u64)a_lo * b_lo + (u32)c;
-	y = (u64)a_lo * b_hi + (u32)(c >> 32);
-	y += (u32)(x >> 32);
-	z = (u64)a_hi * b_hi + (u32)(y >> 32);
-	y = (u64)a_hi * b_lo + (u32)y;
-	z += (u32)(y >> 32);
+	x = mul_add(a_lo, b_lo, c);
+	y = mul_add(a_lo, b_hi, c >> 32);
+	y = add_u64_u32(y, x >> 32);
+	z = mul_add(a_hi, b_hi, y >> 32);
+	y = mul_add(a_hi, b_lo, y);
+	z = add_u64_u32(z, y >> 32);
 	x = (y << 32) + (u32)x;
 
 	u64 n_lo = x, n_hi = z;
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 next 08/10] lib: mul_u64_u64_div_u64() Separate multiply to a helper for clarity
  2025-06-14  9:53 [PATCH v3 next 00/10] Implement mul_u64_u64_div_u64_roundup() David Laight
                   ` (6 preceding siblings ...)
  2025-06-14  9:53 ` [PATCH v3 next 07/10] lib: mul_u64_u64_div_u64() optimise multiply on 32bit x86 David Laight
@ 2025-06-14  9:53 ` David Laight
  2025-06-14 15:37   ` Nicolas Pitre
  2025-06-14  9:53 ` [PATCH v3 next 09/10] lib: mul_u64_u64_div_u64() Optimise the divide code David Laight
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 44+ messages in thread
From: David Laight @ 2025-06-14  9:53 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: David Laight, u.kleine-koenig, Nicolas Pitre, Oleg Nesterov,
	Peter Zijlstra, Biju Das

Move the 64x64 => 128 multiply into a static inline helper function
for code clarity.
No need for the a/b_hi/lo variables, the implicit casts on the function
calls do the work for us.
Should have minimal effect on the generated code.

Signed-off-by: David Laight <david.laight.linux@gmail.com>
---

new patch for v3.

 lib/math/div64.c | 54 +++++++++++++++++++++++++++---------------------
 1 file changed, 30 insertions(+), 24 deletions(-)

diff --git a/lib/math/div64.c b/lib/math/div64.c
index 2ac7e25039a1..fb77fd9d999d 100644
--- a/lib/math/div64.c
+++ b/lib/math/div64.c
@@ -193,42 +193,48 @@ static u64 mul_add(u32 a, u32 b, u32 c)
 	return add_u64_u32(mul_u32_u32(a, b), c);
 }
 
-u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
-{
-	if (WARN_ONCE(!d, "%s: division of (%#llx * %#llx + %#llx) by zero, returning 0",
-		      __func__, a, b, c)) {
-		/*
-		 * Return 0 (rather than ~(u64)0) because it is less likely to
-		 * have unexpected side effects.
-		 */
-		return 0;
-	}
-
 #if defined(__SIZEOF_INT128__) && !defined(test_mul_u64_add_u64_div_u64)
-
+static inline u64 mul_u64_u64_add_u64(u64 *p_lo, u64 a, u64 b, u64 c)
+{
 	/* native 64x64=128 bits multiplication */
 	u128 prod = (u128)a * b + c;
-	u64 n_lo = prod, n_hi = prod >> 64;
 
-#else
+	*p_lo = prod;
+	return prod >> 64;
+}
 
-	/* perform a 64x64=128 bits multiplication manually */
-	u32 a_lo = a, a_hi = a >> 32, b_lo = b, b_hi = b >> 32;
+#else
+static inline u64 mul_u64_u64_add_u64(u64 *p_lo, u64 a, u64 b, u64 c)
+{
+	/* perform a 64x64=128 bits multiplication in 32bit chunks */
 	u64 x, y, z;
 
 	/* Since (x-1)(x-1) + 2(x-1) == x.x - 1 two u32 can be added to a u64 */
-	x = mul_add(a_lo, b_lo, c);
-	y = mul_add(a_lo, b_hi, c >> 32);
+	x = mul_add(a, b, c);
+	y = mul_add(a, b >> 32, c >> 32);
 	y = add_u64_u32(y, x >> 32);
-	z = mul_add(a_hi, b_hi, y >> 32);
-	y = mul_add(a_hi, b_lo, y);
-	z = add_u64_u32(z, y >> 32);
-	x = (y << 32) + (u32)x;
-
-	u64 n_lo = x, n_hi = z;
+	z = mul_add(a >> 32, b >> 32, y >> 32);
+	y = mul_add(a >> 32, b, y);
+	*p_lo = (y << 32) + (u32)x;
+	return add_u64_u32(z, y >> 32);
+}
 
 #endif
 
+u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
+{
+	u64 n_lo, n_hi;
+
+	if (WARN_ONCE(!d, "%s: division of (%llx * %llx + %llx) by zero, returning 0",
+		      __func__, a, b, c )) {
+		/*
+		 * Return 0 (rather than ~(u64)0) because it is less likely to
+		 * have unexpected side effects.
+		 */
+		return 0;
+	}
+
+	n_hi = mul_u64_u64_add_u64(&n_lo, a, b, c);
 	if (!n_hi)
 		return div64_u64(n_lo, d);
 
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 next 09/10] lib: mul_u64_u64_div_u64() Optimise the divide code
  2025-06-14  9:53 [PATCH v3 next 00/10] Implement mul_u64_u64_div_u64_roundup() David Laight
                   ` (7 preceding siblings ...)
  2025-06-14  9:53 ` [PATCH v3 next 08/10] lib: mul_u64_u64_div_u64() Separate multiply to a helper for clarity David Laight
@ 2025-06-14  9:53 ` David Laight
  2025-06-17  4:16   ` Nicolas Pitre
  2025-07-09 14:24   ` David Laight
  2025-06-14  9:53 ` [PATCH v3 next 10/10] lib: test_mul_u64_u64_div_u64: Test the 32bit code on 64bit David Laight
  2025-06-14 10:27 ` [PATCH v3 next 00/10] Implement mul_u64_u64_div_u64_roundup() Peter Zijlstra
  10 siblings, 2 replies; 44+ messages in thread
From: David Laight @ 2025-06-14  9:53 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: David Laight, u.kleine-koenig, Nicolas Pitre, Oleg Nesterov,
	Peter Zijlstra, Biju Das

Replace the bit by bit algorithm with one that generates 16 bits
per iteration on 32bit architectures and 32 bits on 64bit ones.

On my zen 5 this reduces the time for the tests (using the generic
code) from ~3350ns to ~1000ns.

Running the 32bit algorithm on 64bit x86 takes ~1500ns.
It'll be slightly slower on a real 32bit system, mostly due
to register pressure.

The savings for 32bit x86 are much higher (tested in userspace).
The worst case (lots of bits in the quotient) drops from ~900 clocks
to ~130 (pretty much independant of the arguments).
Other 32bit architectures may see better savings.

It is possibly to optimise for divisors that span less than
__LONG_WIDTH__/2 bits. However I suspect they don't happen that often
and it doesn't remove any slow cpu divide instructions which dominate
the result.

Signed-off-by: David Laight <david.laight.linux@gmail.com>
---

new patch for v3.

 lib/math/div64.c | 124 +++++++++++++++++++++++++++++++++--------------
 1 file changed, 87 insertions(+), 37 deletions(-)

diff --git a/lib/math/div64.c b/lib/math/div64.c
index fb77fd9d999d..bb318ff2ad87 100644
--- a/lib/math/div64.c
+++ b/lib/math/div64.c
@@ -221,11 +221,37 @@ static inline u64 mul_u64_u64_add_u64(u64 *p_lo, u64 a, u64 b, u64 c)
 
 #endif
 
+#ifndef BITS_PER_ITER
+#define BITS_PER_ITER (__LONG_WIDTH__ >= 64 ? 32 : 16)
+#endif
+
+#if BITS_PER_ITER == 32
+#define mul_u64_long_add_u64(p_lo, a, b, c) mul_u64_u64_add_u64(p_lo, a, b, c)
+#define add_u64_long(a, b) ((a) + (b))
+
+#else
+static inline u32 mul_u64_long_add_u64(u64 *p_lo, u64 a, u32 b, u64 c)
+{
+	u64 n_lo = mul_add(a, b, c);
+	u64 n_med = mul_add(a >> 32, b, c >> 32);
+
+	n_med = add_u64_u32(n_med, n_lo >> 32);
+	*p_lo = n_med << 32 | (u32)n_lo;
+	return n_med >> 32;
+}
+
+#define add_u64_long(a, b) add_u64_u32(a, b)
+
+#endif
+
 u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
 {
-	u64 n_lo, n_hi;
+	unsigned long d_msig, q_digit;
+	unsigned int reps, d_z_hi;
+	u64 quotient, n_lo, n_hi;
+	u32 overflow;
 
 	if (WARN_ONCE(!d, "%s: division of (%#llx * %#llx + %#llx) by zero, returning 0",
 		      __func__, a, b, c )) {
 		/*
 		 * Return 0 (rather than ~(u64)0) because it is less likely to
@@ -243,46 +269,70 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
 		      __func__, a, b, c, n_hi, n_lo, d))
 		return ~(u64)0;
 
-	int shift = __builtin_ctzll(d);
-
-	/* try reducing the fraction in case the dividend becomes <= 64 bits */
-	if ((n_hi >> shift) == 0) {
-		u64 n = shift ? (n_lo >> shift) | (n_hi << (64 - shift)) : n_lo;
-
-		return div64_u64(n, d >> shift);
-		/*
-		 * The remainder value if needed would be:
-		 *   res = div64_u64_rem(n, d >> shift, &rem);
-		 *   rem = (rem << shift) + (n_lo - (n << shift));
-		 */
+	/* Left align the divisor, shifting the dividend to match */
+	d_z_hi = __builtin_clzll(d);
+	if (d_z_hi) {
+		d <<= d_z_hi;
+		n_hi = n_hi << d_z_hi | n_lo >> (64 - d_z_hi);
+		n_lo <<= d_z_hi;
 	}
 
-	/* Do the full 128 by 64 bits division */
-
-	shift = __builtin_clzll(d);
-	d <<= shift;
-
-	int p = 64 + shift;
-	u64 res = 0;
-	bool carry;
+	reps = 64 / BITS_PER_ITER;
+	/* Optimise loop count for small dividends */
+	if (!(u32)(n_hi >> 32)) {
+		reps -= 32 / BITS_PER_ITER;
+		n_hi = n_hi << 32 | n_lo >> 32;
+		n_lo <<= 32;
+	}
+#if BITS_PER_ITER == 16
+	if (!(u32)(n_hi >> 48)) {
+		reps--;
+		n_hi = add_u64_u32(n_hi << 16, n_lo >> 48);
+		n_lo <<= 16;
+	}
+#endif
 
-	do {
-		carry = n_hi >> 63;
-		shift = carry ? 1 : __builtin_clzll(n_hi);
-		if (p < shift)
-			break;
-		p -= shift;
-		n_hi <<= shift;
-		n_hi |= n_lo >> (64 - shift);
-		n_lo <<= shift;
-		if (carry || (n_hi >= d)) {
-			n_hi -= d;
-			res |= 1ULL << p;
+	/* Invert the dividend so we can use add instead of subtract. */
+	n_lo = ~n_lo;
+	n_hi = ~n_hi;
+
+	/*
+	 * Get the most significant BITS_PER_ITER bits of the divisor.
+	 * This is used to get a low 'guestimate' of the quotient digit.
+	 */
+	d_msig = (d >> (64 - BITS_PER_ITER)) + 1;
+
+	/*
+	 * Now do a 'long division' with BITS_PER_ITER bit 'digits'.
+	 * The 'guess' quotient digit can be low and BITS_PER_ITER+1 bits.
+	 * The worst case is dividing ~0 by 0x8000 which requires two subtracts.
+	 */
+	quotient = 0;
+	while (reps--) {
+		q_digit = (unsigned long)(~n_hi >> (64 - 2 * BITS_PER_ITER)) / d_msig;
+		/* Shift 'n' left to align with the product q_digit * d */
+		overflow = n_hi >> (64 - BITS_PER_ITER);
+		n_hi = add_u64_u32(n_hi << BITS_PER_ITER, n_lo >> (64 - BITS_PER_ITER));
+		n_lo <<= BITS_PER_ITER;
+		/* Add product to negated divisor */
+		overflow += mul_u64_long_add_u64(&n_hi, d, q_digit, n_hi);
+		/* Adjust for the q_digit 'guestimate' being low */
+		while (overflow < 0xffffffff >> (32 - BITS_PER_ITER)) {
+			q_digit++;
+			n_hi += d;
+			overflow += n_hi < d;
 		}
-	} while (n_hi);
-	/* The remainder value if needed would be n_hi << p */
+		quotient = add_u64_long(quotient << BITS_PER_ITER, q_digit);
+	}
 
-	return res;
+	/*
+	 * The above only ensures the remainder doesn't overflow,
+	 * it can still be possible to add (aka subtract) another copy
+	 * of the divisor.
+	 */
+	if ((n_hi + d) > n_hi)
+		quotient++;
+	return quotient;
 }
 #if !defined(test_mul_u64_add_u64_div_u64)
 EXPORT_SYMBOL(mul_u64_add_u64_div_u64);
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 next 10/10] lib: test_mul_u64_u64_div_u64: Test the 32bit code on 64bit
  2025-06-14  9:53 [PATCH v3 next 00/10] Implement mul_u64_u64_div_u64_roundup() David Laight
                   ` (8 preceding siblings ...)
  2025-06-14  9:53 ` [PATCH v3 next 09/10] lib: mul_u64_u64_div_u64() Optimise the divide code David Laight
@ 2025-06-14  9:53 ` David Laight
  2025-06-14 10:27 ` [PATCH v3 next 00/10] Implement mul_u64_u64_div_u64_roundup() Peter Zijlstra
  10 siblings, 0 replies; 44+ messages in thread
From: David Laight @ 2025-06-14  9:53 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: David Laight, u.kleine-koenig, Nicolas Pitre, Oleg Nesterov,
	Peter Zijlstra, Biju Das

There are slight differences in the mul_u64_add_u64_div_u64() code
between 32bit and 64bit systems.

Compile and test the 32bit version on 64bit hosts for better test
coverage.

Signed-off-by: David Laight <david.laight.linux@gmail.com>
---

new patch for v3.

 lib/math/test_mul_u64_u64_div_u64.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/lib/math/test_mul_u64_u64_div_u64.c b/lib/math/test_mul_u64_u64_div_u64.c
index f0134f25cb0d..ff5df742ec8a 100644
--- a/lib/math/test_mul_u64_u64_div_u64.c
+++ b/lib/math/test_mul_u64_u64_div_u64.c
@@ -74,6 +74,10 @@ done
  */
 
 static u64 test_mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d);
+#if __LONG_WIDTH__ >= 64
+#define TEST_32BIT_DIV
+static u64 test_mul_u64_add_u64_div_u64_32bit(u64 a, u64 b, u64 c, u64 d);
+#endif
 
 static int __init test_run(unsigned int fn_no, const char *fn_name)
 {
@@ -100,6 +104,12 @@ static int __init test_run(unsigned int fn_no, const char *fn_name)
 			result = test_mul_u64_add_u64_div_u64(a, b, 0, d);
 			result_up = test_mul_u64_add_u64_div_u64(a, b, d - 1, d);
 			break;
+#ifdef TEST_32BIT_DIV
+		case 2:
+			result = test_mul_u64_add_u64_div_u64_32bit(a, b, 0, d);
+			result_up = test_mul_u64_add_u64_div_u64_32bit(a, b, d - 1, d);
+			break;
+#endif
 		}
 
 		tests += 2;
@@ -131,6 +141,10 @@ static int __init test_init(void)
 		return -EINVAL;
 	if (test_run(1, "test_mul_u64_u64_div_u64"))
 		return -EINVAL;
+#ifdef TEST_32BIT_DIV
+	if (test_run(2, "test_mul_u64_u64_div_u64_32bit"))
+		return -EINVAL;
+#endif
 	return 0;
 }
 
@@ -149,6 +163,21 @@ static void __exit test_exit(void)
 
 #include "div64.c"
 
+#ifdef TEST_32BIT_DIV
+/* Recompile the generic code for 32bit long */
+#undef test_mul_u64_add_u64_div_u64
+#define test_mul_u64_add_u64_div_u64 test_mul_u64_add_u64_div_u64_32bit
+#undef BITS_PER_ITER
+#define BITS_PER_ITER 16
+
+#define mul_add mul_add_32bit
+#define mul_u64_u64_add_u64 mul_u64_u64_add_u64_32bit
+#undef mul_u64_long_add_u64
+#undef add_u64_long
+
+#include "div64.c"
+#endif
+
 module_init(test_init);
 module_exit(test_exit);
 
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 next 00/10] Implement mul_u64_u64_div_u64_roundup()
  2025-06-14  9:53 [PATCH v3 next 00/10] Implement mul_u64_u64_div_u64_roundup() David Laight
                   ` (9 preceding siblings ...)
  2025-06-14  9:53 ` [PATCH v3 next 10/10] lib: test_mul_u64_u64_div_u64: Test the 32bit code on 64bit David Laight
@ 2025-06-14 10:27 ` Peter Zijlstra
  2025-06-14 11:59   ` David Laight
  10 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2025-06-14 10:27 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Nicolas Pitre,
	Oleg Nesterov, Biju Das

On Sat, Jun 14, 2025 at 10:53:36AM +0100, David Laight wrote:
> The pwm-stm32.c code wants a 'rounding up' version of mul_u64_u64_div_u64().
> This can be done simply by adding 'divisor - 1' to the 128bit product.
> Implement mul_u64_add_u64_div_u64(a, b, c, d) = (a * b + c)/d based on the
> existing code.
> Define mul_u64_u64_div_u64(a, b, d) as mul_u64_add_u64_div_u64(a, b, 0, d) and
> mul_u64_u64_div_u64_roundup(a, b, d) as mul_u64_add_u64_div_u64(a, b, d-1, d).

Another way to achieve the same would to to expose the remainder of
mul_u64_64_div_u64(), no?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 next 00/10] Implement mul_u64_u64_div_u64_roundup()
  2025-06-14 10:27 ` [PATCH v3 next 00/10] Implement mul_u64_u64_div_u64_roundup() Peter Zijlstra
@ 2025-06-14 11:59   ` David Laight
  0 siblings, 0 replies; 44+ messages in thread
From: David Laight @ 2025-06-14 11:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Nicolas Pitre,
	Oleg Nesterov, Biju Das

On Sat, 14 Jun 2025 12:27:17 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Sat, Jun 14, 2025 at 10:53:36AM +0100, David Laight wrote:
> > The pwm-stm32.c code wants a 'rounding up' version of mul_u64_u64_div_u64().
> > This can be done simply by adding 'divisor - 1' to the 128bit product.
> > Implement mul_u64_add_u64_div_u64(a, b, c, d) = (a * b + c)/d based on the
> > existing code.
> > Define mul_u64_u64_div_u64(a, b, d) as mul_u64_add_u64_div_u64(a, b, 0, d) and
> > mul_u64_u64_div_u64_roundup(a, b, d) as mul_u64_add_u64_div_u64(a, b, d-1, d).  
> 
> Another way to achieve the same would to to expose the remainder of
> mul_u64_64_div_u64(), no?

The optimised paths for small products don't give the remainder.
And you don't want to do the multiply again to generate it.

As well as the conditional increment you'd also need another test
for the quotient overflowing.

Adding 'c' to the product is cheap.

	David

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 next 03/10] lib: mul_u64_u64_div_u64() simplify check for a 64bit product
  2025-06-14  9:53 ` [PATCH v3 next 03/10] lib: mul_u64_u64_div_u64() simplify check for a 64bit product David Laight
@ 2025-06-14 14:01   ` Nicolas Pitre
  0 siblings, 0 replies; 44+ messages in thread
From: Nicolas Pitre @ 2025-06-14 14:01 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das

On Sat, 14 Jun 2025, David Laight wrote:

> If the product is only 64bits div64_u64() can be used for the divide.
> Replace the pre-multiply check (ilog2(a) + ilog2(b) <= 62) with a
> simple post-multiply check that the high 64bits are zero.
> 
> This has the advantage of being simpler, more accurate and less code.
> It will always be faster when the product is larger than 64bits.
> 
> Most 64bit cpu have a native 64x64=128 bit multiply, this is needed
> (for the low 64bits) even when div64_u64() is called - so the early
> check gains nothing and is just extra code.
> 
> 32bit cpu will need a compare (etc) to generate the 64bit ilog2()
> from two 32bit bit scans - so that is non-trivial.
> (Never mind the mess of x86's 'bsr' and any oddball cpu without
> fast bit-scan instructions.)
> Whereas the additional instructions for the 128bit multiply result
> are pretty much one multiply and two adds (typically the 'adc $0,%reg'
> can be run in parallel with the instruction that follows).
> 
> The only outliers are 64bit systems without 128bit mutiply and
> simple in order 32bit ones with fast bit scan but needing extra
> instructions to get the high bits of the multiply result.
> I doubt it makes much difference to either, the latter is definitely
> not mainsteam.

mainstream*

> Split from patch 3 of v2 of this series.
> 
> If anyone is worried about the analysis they can look at the
> generated code for x86 (especially when cmov isn't used).
> 
> Signed-off-by: David Laight <david.laight.linux@gmail.com>

Reviewed-by: Nicolas Pitre <npitre@baylibre.com>

> ---
>  lib/math/div64.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/lib/math/div64.c b/lib/math/div64.c
> index 397578dc9a0b..ed9475b9e1ef 100644
> --- a/lib/math/div64.c
> +++ b/lib/math/div64.c
> @@ -196,9 +196,6 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
>  		return 0;
>  	}
>  
> -	if (ilog2(a) + ilog2(b) <= 62)
> -		return div64_u64(a * b, d);
> -
>  #if defined(__SIZEOF_INT128__)
>  
>  	/* native 64x64=128 bits multiplication */
> @@ -222,6 +219,9 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
>  
>  #endif
>  
> +	if (!n_hi)
> +		return div64_u64(n_lo, d);
> +
>  	if (WARN_ONCE(n_hi >= d,
>  		      "%s: division of (%#llx * %#llx = %#llx%016llx) by %#llx overflows, returning ~0",
>  		      __func__, a, b, n_hi, n_lo, d))
> -- 
> 2.39.5
> 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 next 04/10] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup()
  2025-06-14  9:53 ` [PATCH v3 next 04/10] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup() David Laight
@ 2025-06-14 14:06   ` Nicolas Pitre
  0 siblings, 0 replies; 44+ messages in thread
From: Nicolas Pitre @ 2025-06-14 14:06 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das

On Sat, 14 Jun 2025, David Laight wrote:

> The existing mul_u64_u64_div_u64() rounds down, a 'rounding up'
> variant needs 'divisor - 1' adding in between the multiply and
> divide so cannot easily be done by a caller.
> 
> Add mul_u64_add_u64_div_u64(a, b, c, d) that calculates (a * b + c)/d
> and implement the 'round down' and 'round up' using it.
> 
> Update the x86-64 asm to optimise for 'c' being a constant zero.
> 
> Add kerndoc definitions for all three functions.
> 
> Signed-off-by: David Laight <david.laight.linux@gmail.com>

Reviewed-by: Nicolas Pitre <npitre@baylibre.com>


> 
> Changes for v2 (formally patch 1/3):
> - Reinstate the early call to div64_u64() on 32bit when 'c' is zero.
>   Although I'm not convinced the path is common enough to be worth
>   the two ilog2() calls.
> 
> Changes for v3 (formally patch 3/4):
> - The early call to div64_u64() has been removed by patch 3.
>   Pretty much guaranteed to be a pessimisation.

Might get rid of the above in the log. Justification is in the previous 
patch.

> Signed-off-by: David Laight <david.laight.linux@gmail.com>

Double signoff.

> ---
>  arch/x86/include/asm/div64.h | 19 ++++++++++-----
>  include/linux/math64.h       | 45 +++++++++++++++++++++++++++++++++++-
>  lib/math/div64.c             | 22 ++++++++++--------
>  3 files changed, 69 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/x86/include/asm/div64.h b/arch/x86/include/asm/div64.h
> index 9931e4c7d73f..7a0a916a2d7d 100644
> --- a/arch/x86/include/asm/div64.h
> +++ b/arch/x86/include/asm/div64.h
> @@ -84,21 +84,28 @@ static inline u64 mul_u32_u32(u32 a, u32 b)
>   * Will generate an #DE when the result doesn't fit u64, could fix with an
>   * __ex_table[] entry when it becomes an issue.
>   */
> -static inline u64 mul_u64_u64_div_u64(u64 a, u64 mul, u64 div)
> +static inline u64 mul_u64_add_u64_div_u64(u64 a, u64 mul, u64 add, u64 div)
>  {
>  	u64 q;
>  
> -	asm ("mulq %2; divq %3" : "=a" (q)
> -				: "a" (a), "rm" (mul), "rm" (div)
> -				: "rdx");
> +	if (statically_true(!add)) {
> +		asm ("mulq %2; divq %3" : "=a" (q)
> +					: "a" (a), "rm" (mul), "rm" (div)
> +					: "rdx");
> +	} else {
> +		asm ("mulq %2; addq %3, %%rax; adcq $0, %%rdx; divq %4"
> +			: "=a" (q)
> +			: "a" (a), "rm" (mul), "rm" (add), "rm" (div)
> +			: "rdx");
> +	}
>  
>  	return q;
>  }
> -#define mul_u64_u64_div_u64 mul_u64_u64_div_u64
> +#define mul_u64_add_u64_div_u64 mul_u64_add_u64_div_u64
>  
>  static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 div)
>  {
> -	return mul_u64_u64_div_u64(a, mul, div);
> +	return mul_u64_add_u64_div_u64(a, mul, 0, div);
>  }
>  #define mul_u64_u32_div	mul_u64_u32_div
>  
> diff --git a/include/linux/math64.h b/include/linux/math64.h
> index 6aaccc1626ab..e1c2e3642cec 100644
> --- a/include/linux/math64.h
> +++ b/include/linux/math64.h
> @@ -282,7 +282,53 @@ static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 divisor)
>  }
>  #endif /* mul_u64_u32_div */
>  
> -u64 mul_u64_u64_div_u64(u64 a, u64 mul, u64 div);
> +/**
> + * mul_u64_add_u64_div_u64 - unsigned 64bit multiply, add, and divide
> + * @a: first unsigned 64bit multiplicand
> + * @b: second unsigned 64bit multiplicand
> + * @c: unsigned 64bit addend
> + * @d: unsigned 64bit divisor
> + *
> + * Multiply two 64bit values together to generate a 128bit product
> + * add a third value and then divide by a fourth.
> + * Generic code returns 0 if @d is zero and ~0 if the quotient exceeds 64 bits.
> + * Architecture specific code may trap on zero or overflow.
> + *
> + * Return: (@a * @b + @c) / @d
> + */
> +u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d);
> +
> +/**
> + * mul_u64_u64_div_u64 - unsigned 64bit multiply and divide
> + * @a: first unsigned 64bit multiplicand
> + * @b: second unsigned 64bit multiplicand
> + * @d: unsigned 64bit divisor
> + *
> + * Multiply two 64bit values together to generate a 128bit product
> + * and then divide by a third value.
> + * Generic code returns 0 if @d is zero and ~0 if the quotient exceeds 64 bits.
> + * Architecture specific code may trap on zero or overflow.
> + *
> + * Return: @a * @b / @d
> + */
> +#define mul_u64_u64_div_u64(a, b, d) mul_u64_add_u64_div_u64(a, b, 0, d)
> +
> +/**
> + * mul_u64_u64_div_u64_roundup - unsigned 64bit multiply and divide rounded up
> + * @a: first unsigned 64bit multiplicand
> + * @b: second unsigned 64bit multiplicand
> + * @d: unsigned 64bit divisor
> + *
> + * Multiply two 64bit values together to generate a 128bit product
> + * and then divide and round up.
> + * Generic code returns 0 if @d is zero and ~0 if the quotient exceeds 64 bits.
> + * Architecture specific code may trap on zero or overflow.
> + *
> + * Return: (@a * @b + @d - 1) / @d
> + */
> +#define mul_u64_u64_div_u64_roundup(a, b, d) \
> +	({ u64 _tmp = (d); mul_u64_add_u64_div_u64(a, b, _tmp - 1, _tmp); })
> +
>  
>  /**
>   * DIV64_U64_ROUND_UP - unsigned 64bit divide with 64bit divisor rounded up
> diff --git a/lib/math/div64.c b/lib/math/div64.c
> index ed9475b9e1ef..7850cc0a7596 100644
> --- a/lib/math/div64.c
> +++ b/lib/math/div64.c
> @@ -184,11 +184,11 @@ u32 iter_div_u64_rem(u64 dividend, u32 divisor, u64 *remainder)
>  }
>  EXPORT_SYMBOL(iter_div_u64_rem);
>  
> -#ifndef mul_u64_u64_div_u64
> -u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
> +#ifndef mul_u64_add_u64_div_u64
> +u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
>  {
> -	if (WARN_ONCE(!d, "%s: division of (%#llx * %#llx) by zero, returning 0",
> -		      __func__, a, b)) {
> +	if (WARN_ONCE(!d, "%s: division of (%#llx * %#llx + %#llx) by zero, returning 0",
> +		      __func__, a, b, c)) {
>  		/*
>  		 * Return 0 (rather than ~(u64)0) because it is less likely to
>  		 * have unexpected side effects.
> @@ -199,7 +199,7 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
>  #if defined(__SIZEOF_INT128__)
>  
>  	/* native 64x64=128 bits multiplication */
> -	u128 prod = (u128)a * b;
> +	u128 prod = (u128)a * b + c;
>  	u64 n_lo = prod, n_hi = prod >> 64;
>  
>  #else
> @@ -208,8 +208,10 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
>  	u32 a_lo = a, a_hi = a >> 32, b_lo = b, b_hi = b >> 32;
>  	u64 x, y, z;
>  
> -	x = (u64)a_lo * b_lo;
> -	y = (u64)a_lo * b_hi + (u32)(x >> 32);
> +	/* Since (x-1)(x-1) + 2(x-1) == x.x - 1 two u32 can be added to a u64 */
> +	x = (u64)a_lo * b_lo + (u32)c;
> +	y = (u64)a_lo * b_hi + (u32)(c >> 32);
> +	y += (u32)(x >> 32);
>  	z = (u64)a_hi * b_hi + (u32)(y >> 32);
>  	y = (u64)a_hi * b_lo + (u32)y;
>  	z += (u32)(y >> 32);
> @@ -223,8 +225,8 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
>  		return div64_u64(n_lo, d);
>  
>  	if (WARN_ONCE(n_hi >= d,
> -		      "%s: division of (%#llx * %#llx = %#llx%016llx) by %#llx overflows, returning ~0",
> -		      __func__, a, b, n_hi, n_lo, d))
> +		      "%s: division of (%#llx * %#llx + %#llx = %#llx%016llx) by %#llx overflows, returning ~0",
> +		      __func__, a, b, c, n_hi, n_lo, d))
>  		return ~(u64)0;
>  
>  	int shift = __builtin_ctzll(d);
> @@ -268,5 +270,5 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
>  
>  	return res;
>  }
> -EXPORT_SYMBOL(mul_u64_u64_div_u64);
> +EXPORT_SYMBOL(mul_u64_add_u64_div_u64);
>  #endif
> -- 
> 2.39.5
> 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 next 02/10] lib: mul_u64_u64_div_u64() Use WARN_ONCE() for divide errors.
  2025-06-14  9:53 ` [PATCH v3 next 02/10] lib: mul_u64_u64_div_u64() Use WARN_ONCE() for divide errors David Laight
@ 2025-06-14 15:17   ` Nicolas Pitre
  2025-06-14 21:26     ` David Laight
  0 siblings, 1 reply; 44+ messages in thread
From: Nicolas Pitre @ 2025-06-14 15:17 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das, Nicolas Pitre

On Sat, 14 Jun 2025, David Laight wrote:

> Do an explicit WARN_ONCE(!divisor) instead of hoping the 'undefined
> behaviour' the compiler generates for a compile-time 1/0 is in any
> way useful.
> 
> Return 0 (rather than ~(u64)0) because it is less likely to cause
> further serious issues.

I still disagree with this patch. Whether or not what the compiler 
produces is useful is beside the point. What's important here is to have 
a coherent behavior across all division flavors and what's proposed here 
is not.

Arguably, a compile time 1/0 might not be what we want either. The 
compiler forces an "illegal instruction" exception when what we want is 
a "floating point" exception (strange to have floating point exceptions 
for integer divisions but that's what it is).

So I'd suggest the following instead:

----- >8
From Nicolas Pitre <npitre@baylibre.com>
Subject: [PATCH] mul_u64_u64_div_u64(): improve division-by-zero handling

Forcing 1/0 at compile time makes the compiler (on x86 at least) to emit 
an undefined instruction to trigger the exception. But that's not what 
we want. Modify the code so that an actual runtime div-by-0 exception
is triggered to be coherent with the behavior of all the other division
flavors.

And don't use 1 for the dividend as the compiler would convert the 
actual division into a simple compare.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>

diff --git a/lib/math/div64.c b/lib/math/div64.c
index 5faa29208bdb..e6839b40e271 100644
--- a/lib/math/div64.c
+++ b/lib/math/div64.c
@@ -212,12 +212,12 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)

 #endif

-	/* make sure c is not zero, trigger exception otherwise */
-#pragma GCC diagnostic push
-#pragma GCC diagnostic ignored "-Wdiv-by-zero"
-	if (unlikely(c == 0))
-		return 1/0;
-#pragma GCC diagnostic pop
+	/* make sure c is not zero, trigger runtime exception otherwise */
+	if (unlikely(c == 0)) {
+		unsigned long zero = 0;
+		asm ("" : "+r" (zero)); /* hide actual value from the compiler */
+		return ~0UL/zero;
+	}

 	int shift = __builtin_ctzll(c);

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 next 05/10] lib: Add tests for mul_u64_u64_div_u64_roundup()
  2025-06-14  9:53 ` [PATCH v3 next 05/10] lib: Add tests for mul_u64_u64_div_u64_roundup() David Laight
@ 2025-06-14 15:19   ` Nicolas Pitre
  2025-06-17  4:30   ` Nicolas Pitre
  1 sibling, 0 replies; 44+ messages in thread
From: Nicolas Pitre @ 2025-06-14 15:19 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das

On Sat, 14 Jun 2025, David Laight wrote:

> Replicate the existing mul_u64_u64_div_u64() test cases with round up.
> Update the shell script that verifies the table, remove the comment
> markers so that it can be directly pasted into a shell.
> 
> Rename the divisor from 'c' to 'd' to match mul_u64_add_u64_div_u64().
> 
> It any tests fail then fail the module load with -EINVAL.
> 
> Signed-off-by: David Laight <david.laight.linux@gmail.com>

Reviewed-by: Nicolas Pitre <npitre@baylibre.com>

> ---
> 
> Changes for v3:
> - Rename 'c' to 'd' to match mul_u64_add_u64_div_u64()
> 
>  lib/math/test_mul_u64_u64_div_u64.c | 122 +++++++++++++++++-----------
>  1 file changed, 73 insertions(+), 49 deletions(-)
> 
> diff --git a/lib/math/test_mul_u64_u64_div_u64.c b/lib/math/test_mul_u64_u64_div_u64.c
> index 58d058de4e73..ea5b703cccff 100644
> --- a/lib/math/test_mul_u64_u64_div_u64.c
> +++ b/lib/math/test_mul_u64_u64_div_u64.c
> @@ -10,61 +10,73 @@
>  #include <linux/printk.h>
>  #include <linux/math64.h>
>  
> -typedef struct { u64 a; u64 b; u64 c; u64 result; } test_params;
> +typedef struct { u64 a; u64 b; u64 d; u64 result; uint round_up;} test_params;
>  
>  static test_params test_values[] = {
>  /* this contains many edge values followed by a couple random values */
> -{                0xb,                0x7,                0x3,               0x19 },
> -{         0xffff0000,         0xffff0000,                0xf, 0x1110eeef00000000 },
> -{         0xffffffff,         0xffffffff,                0x1, 0xfffffffe00000001 },
> -{         0xffffffff,         0xffffffff,                0x2, 0x7fffffff00000000 },
> -{        0x1ffffffff,         0xffffffff,                0x2, 0xfffffffe80000000 },
> -{        0x1ffffffff,         0xffffffff,                0x3, 0xaaaaaaa9aaaaaaab },
> -{        0x1ffffffff,        0x1ffffffff,                0x4, 0xffffffff00000000 },
> -{ 0xffff000000000000, 0xffff000000000000, 0xffff000000000001, 0xfffeffffffffffff },
> -{ 0x3333333333333333, 0x3333333333333333, 0x5555555555555555, 0x1eb851eb851eb851 },
> -{ 0x7fffffffffffffff,                0x2,                0x3, 0x5555555555555554 },
> -{ 0xffffffffffffffff,                0x2, 0x8000000000000000,                0x3 },
> -{ 0xffffffffffffffff,                0x2, 0xc000000000000000,                0x2 },
> -{ 0xffffffffffffffff, 0x4000000000000004, 0x8000000000000000, 0x8000000000000007 },
> -{ 0xffffffffffffffff, 0x4000000000000001, 0x8000000000000000, 0x8000000000000001 },
> -{ 0xffffffffffffffff, 0x8000000000000001, 0xffffffffffffffff, 0x8000000000000001 },
> -{ 0xfffffffffffffffe, 0x8000000000000001, 0xffffffffffffffff, 0x8000000000000000 },
> -{ 0xffffffffffffffff, 0x8000000000000001, 0xfffffffffffffffe, 0x8000000000000001 },
> -{ 0xffffffffffffffff, 0x8000000000000001, 0xfffffffffffffffd, 0x8000000000000002 },
> -{ 0x7fffffffffffffff, 0xffffffffffffffff, 0xc000000000000000, 0xaaaaaaaaaaaaaaa8 },
> -{ 0xffffffffffffffff, 0x7fffffffffffffff, 0xa000000000000000, 0xccccccccccccccca },
> -{ 0xffffffffffffffff, 0x7fffffffffffffff, 0x9000000000000000, 0xe38e38e38e38e38b },
> -{ 0x7fffffffffffffff, 0x7fffffffffffffff, 0x5000000000000000, 0xccccccccccccccc9 },
> -{ 0xffffffffffffffff, 0xfffffffffffffffe, 0xffffffffffffffff, 0xfffffffffffffffe },
> -{ 0xe6102d256d7ea3ae, 0x70a77d0be4c31201, 0xd63ec35ab3220357, 0x78f8bf8cc86c6e18 },
> -{ 0xf53bae05cb86c6e1, 0x3847b32d2f8d32e0, 0xcfd4f55a647f403c, 0x42687f79d8998d35 },
> -{ 0x9951c5498f941092, 0x1f8c8bfdf287a251, 0xa3c8dc5f81ea3fe2, 0x1d887cb25900091f },
> -{ 0x374fee9daa1bb2bb, 0x0d0bfbff7b8ae3ef, 0xc169337bd42d5179, 0x03bb2dbaffcbb961 },
> -{ 0xeac0d03ac10eeaf0, 0x89be05dfa162ed9b, 0x92bb1679a41f0e4b, 0xdc5f5cc9e270d216 },
> +{                0xb,                0x7,                0x3,               0x19, 1 },
> +{         0xffff0000,         0xffff0000,                0xf, 0x1110eeef00000000, 0 },
> +{         0xffffffff,         0xffffffff,                0x1, 0xfffffffe00000001, 0 },
> +{         0xffffffff,         0xffffffff,                0x2, 0x7fffffff00000000, 1 },
> +{        0x1ffffffff,         0xffffffff,                0x2, 0xfffffffe80000000, 1 },
> +{        0x1ffffffff,         0xffffffff,                0x3, 0xaaaaaaa9aaaaaaab, 0 },
> +{        0x1ffffffff,        0x1ffffffff,                0x4, 0xffffffff00000000, 1 },
> +{ 0xffff000000000000, 0xffff000000000000, 0xffff000000000001, 0xfffeffffffffffff, 1 },
> +{ 0x3333333333333333, 0x3333333333333333, 0x5555555555555555, 0x1eb851eb851eb851, 1 },
> +{ 0x7fffffffffffffff,                0x2,                0x3, 0x5555555555555554, 1 },
> +{ 0xffffffffffffffff,                0x2, 0x8000000000000000,                0x3, 1 },
> +{ 0xffffffffffffffff,                0x2, 0xc000000000000000,                0x2, 1 },
> +{ 0xffffffffffffffff, 0x4000000000000004, 0x8000000000000000, 0x8000000000000007, 1 },
> +{ 0xffffffffffffffff, 0x4000000000000001, 0x8000000000000000, 0x8000000000000001, 1 },
> +{ 0xffffffffffffffff, 0x8000000000000001, 0xffffffffffffffff, 0x8000000000000001, 0 },
> +{ 0xfffffffffffffffe, 0x8000000000000001, 0xffffffffffffffff, 0x8000000000000000, 1 },
> +{ 0xffffffffffffffff, 0x8000000000000001, 0xfffffffffffffffe, 0x8000000000000001, 1 },
> +{ 0xffffffffffffffff, 0x8000000000000001, 0xfffffffffffffffd, 0x8000000000000002, 1 },
> +{ 0x7fffffffffffffff, 0xffffffffffffffff, 0xc000000000000000, 0xaaaaaaaaaaaaaaa8, 1 },
> +{ 0xffffffffffffffff, 0x7fffffffffffffff, 0xa000000000000000, 0xccccccccccccccca, 1 },
> +{ 0xffffffffffffffff, 0x7fffffffffffffff, 0x9000000000000000, 0xe38e38e38e38e38b, 1 },
> +{ 0x7fffffffffffffff, 0x7fffffffffffffff, 0x5000000000000000, 0xccccccccccccccc9, 1 },
> +{ 0xffffffffffffffff, 0xfffffffffffffffe, 0xffffffffffffffff, 0xfffffffffffffffe, 0 },
> +{ 0xe6102d256d7ea3ae, 0x70a77d0be4c31201, 0xd63ec35ab3220357, 0x78f8bf8cc86c6e18, 1 },
> +{ 0xf53bae05cb86c6e1, 0x3847b32d2f8d32e0, 0xcfd4f55a647f403c, 0x42687f79d8998d35, 1 },
> +{ 0x9951c5498f941092, 0x1f8c8bfdf287a251, 0xa3c8dc5f81ea3fe2, 0x1d887cb25900091f, 1 },
> +{ 0x374fee9daa1bb2bb, 0x0d0bfbff7b8ae3ef, 0xc169337bd42d5179, 0x03bb2dbaffcbb961, 1 },
> +{ 0xeac0d03ac10eeaf0, 0x89be05dfa162ed9b, 0x92bb1679a41f0e4b, 0xdc5f5cc9e270d216, 1 },
>  };
>  
>  /*
>   * The above table can be verified with the following shell script:
> - *
> - * #!/bin/sh
> - * sed -ne 's/^{ \+\(.*\), \+\(.*\), \+\(.*\), \+\(.*\) },$/\1 \2 \3 \4/p' \
> - *     lib/math/test_mul_u64_u64_div_u64.c |
> - * while read a b c r; do
> - *   expected=$( printf "obase=16; ibase=16; %X * %X / %X\n" $a $b $c | bc )
> - *   given=$( printf "%X\n" $r )
> - *   if [ "$expected" = "$given" ]; then
> - *     echo "$a * $b / $c = $r OK"
> - *   else
> - *     echo "$a * $b / $c = $r is wrong" >&2
> - *     echo "should be equivalent to 0x$expected" >&2
> - *     exit 1
> - *   fi
> - * done
> +
> +#!/bin/sh
> +sed -ne 's/^{ \+\(.*\), \+\(.*\), \+\(.*\), \+\(.*\), \+\(.*\) },$/\1 \2 \3 \4 \5/p' \
> +    lib/math/test_mul_u64_u64_div_u64.c |
> +while read a b d r d; do
> +  expected=$( printf "obase=16; ibase=16; %X * %X / %X\n" $a $b $d | bc )
> +  given=$( printf "%X\n" $r )
> +  if [ "$expected" = "$given" ]; then
> +    echo "$a * $b  / $d = $r OK"
> +  else
> +    echo "$a * $b  / $d = $r is wrong" >&2
> +    echo "should be equivalent to 0x$expected" >&2
> +    exit 1
> +  fi
> +  expected=$( printf "obase=16; ibase=16; (%X * %X + %X) / %X\n" $a $b $((d-1)) $d | bc )
> +  given=$( printf "%X\n" $((r + d)) )
> +  if [ "$expected" = "$given" ]; then
> +    echo "$a * $b +/ $d = $(printf '%#x' $((r + d))) OK"
> +  else
> +    echo "$a * $b +/ $d = $(printf '%#x' $((r + d))) is wrong" >&2
> +    echo "should be equivalent to 0x$expected" >&2
> +    exit 1
> +  fi
> +done
> +
>   */
>  
>  static int __init test_init(void)
>  {
> +	int errors = 0;
> +	int tests = 0;
>  	int i;
>  
>  	pr_info("Starting mul_u64_u64_div_u64() test\n");
> @@ -72,19 +84,31 @@ static int __init test_init(void)
>  	for (i = 0; i < ARRAY_SIZE(test_values); i++) {
>  		u64 a = test_values[i].a;
>  		u64 b = test_values[i].b;
> -		u64 c = test_values[i].c;
> +		u64 d = test_values[i].d;
>  		u64 expected_result = test_values[i].result;
> -		u64 result = mul_u64_u64_div_u64(a, b, c);
> +		u64 result = mul_u64_u64_div_u64(a, b, d);
> +		u64 result_up = mul_u64_u64_div_u64_roundup(a, b, d);
> +
> +		tests += 2;
>  
>  		if (result != expected_result) {
> -			pr_err("ERROR: 0x%016llx * 0x%016llx / 0x%016llx\n", a, b, c);
> +			pr_err("ERROR: 0x%016llx * 0x%016llx / 0x%016llx\n", a, b, d);
>  			pr_err("ERROR: expected result: %016llx\n", expected_result);
>  			pr_err("ERROR: obtained result: %016llx\n", result);
> +			errors++;
> +		}
> +		expected_result += test_values[i].round_up;
> +		if (result_up != expected_result) {
> +			pr_err("ERROR: 0x%016llx * 0x%016llx +/ 0x%016llx\n", a, b, d);
> +			pr_err("ERROR: expected result: %016llx\n", expected_result);
> +			pr_err("ERROR: obtained result: %016llx\n", result_up);
> +			errors++;
>  		}
>  	}
>  
> -	pr_info("Completed mul_u64_u64_div_u64() test\n");
> -	return 0;
> +	pr_info("Completed mul_u64_u64_div_u64() test, %d tests, %d errors\n",
> +		tests, errors);
> +	return errors ? -EINVAL : 0;
>  }
>  
>  static void __exit test_exit(void)
> -- 
> 2.39.5
> 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 next 06/10] lib: test_mul_u64_u64_div_u64: Test both generic and arch versions
  2025-06-14  9:53 ` [PATCH v3 next 06/10] lib: test_mul_u64_u64_div_u64: Test both generic and arch versions David Laight
@ 2025-06-14 15:25   ` Nicolas Pitre
  2025-06-18  1:39     ` Nicolas Pitre
  0 siblings, 1 reply; 44+ messages in thread
From: Nicolas Pitre @ 2025-06-14 15:25 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das

On Sat, 14 Jun 2025, David Laight wrote:

> Change the #if in div64.c so that test_mul_u64_u64_div_u64.c
> can compile and test the generic version (including the 'long multiply')
> on architectures (eg amd64) that define their own copy.

Please also include some explanation for iter_div_u64_rem.

> Test the kernel version and the locally compiled version on all arch.
> Output the time taken (in ns) on the 'test completed' trace.
> 
> For reference, on my zen 5, the optimised version takes ~220ns and the
> generic version ~3350ns.
> Using the native multiply saves ~200ns and adding back the ilog2() 'optimisation'
> test adds ~50ms.
> 
> Signed-off-by: David Laight <david.laight.linux@gmail.com>

Reviewed-by: Nicolas Pitre <npitre@baylibre.com>

> ---
> 
> New patch for v3, replacing changes in v1 that were removed for v2.
> 
>  lib/math/div64.c                    |  8 +++--
>  lib/math/test_mul_u64_u64_div_u64.c | 48 ++++++++++++++++++++++++-----
>  2 files changed, 47 insertions(+), 9 deletions(-)
> 
> diff --git a/lib/math/div64.c b/lib/math/div64.c
> index 7850cc0a7596..22433e5565c4 100644
> --- a/lib/math/div64.c
> +++ b/lib/math/div64.c
> @@ -178,13 +178,15 @@ EXPORT_SYMBOL(div64_s64);
>   * Iterative div/mod for use when dividend is not expected to be much
>   * bigger than divisor.
>   */
> +#ifndef iter_div_u64_rem
>  u32 iter_div_u64_rem(u64 dividend, u32 divisor, u64 *remainder)
>  {
>  	return __iter_div_u64_rem(dividend, divisor, remainder);
>  }
>  EXPORT_SYMBOL(iter_div_u64_rem);
> +#endif
>  
> -#ifndef mul_u64_add_u64_div_u64
> +#if !defined(mul_u64_add_u64_div_u64) || defined(test_mul_u64_add_u64_div_u64)
>  u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
>  {
>  	if (WARN_ONCE(!d, "%s: division of (%#llx * %#llx + %#llx) by zero, returning 0",
> @@ -196,7 +198,7 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
>  		return 0;
>  	}
>  
> -#if defined(__SIZEOF_INT128__)
> +#if defined(__SIZEOF_INT128__) && !defined(test_mul_u64_add_u64_div_u64)
>  
>  	/* native 64x64=128 bits multiplication */
>  	u128 prod = (u128)a * b + c;
> @@ -270,5 +272,7 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
>  
>  	return res;
>  }
> +#if !defined(test_mul_u64_add_u64_div_u64)
>  EXPORT_SYMBOL(mul_u64_add_u64_div_u64);
>  #endif
> +#endif
> diff --git a/lib/math/test_mul_u64_u64_div_u64.c b/lib/math/test_mul_u64_u64_div_u64.c
> index ea5b703cccff..f0134f25cb0d 100644
> --- a/lib/math/test_mul_u64_u64_div_u64.c
> +++ b/lib/math/test_mul_u64_u64_div_u64.c
> @@ -73,21 +73,34 @@ done
>  
>   */
>  
> -static int __init test_init(void)
> +static u64 test_mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d);
> +
> +static int __init test_run(unsigned int fn_no, const char *fn_name)
>  {
> +	u64 start_time;
>  	int errors = 0;
>  	int tests = 0;
>  	int i;
>  
> -	pr_info("Starting mul_u64_u64_div_u64() test\n");
> +	start_time = ktime_get_ns();
>  
>  	for (i = 0; i < ARRAY_SIZE(test_values); i++) {
>  		u64 a = test_values[i].a;
>  		u64 b = test_values[i].b;
>  		u64 d = test_values[i].d;
>  		u64 expected_result = test_values[i].result;
> -		u64 result = mul_u64_u64_div_u64(a, b, d);
> -		u64 result_up = mul_u64_u64_div_u64_roundup(a, b, d);
> +		u64 result, result_up;
> +
> +		switch (fn_no) {
> +		default:
> +			result = mul_u64_u64_div_u64(a, b, d);
> +			result_up = mul_u64_u64_div_u64_roundup(a, b, d);
> +			break;
> +		case 1:
> +			result = test_mul_u64_add_u64_div_u64(a, b, 0, d);
> +			result_up = test_mul_u64_add_u64_div_u64(a, b, d - 1, d);
> +			break;
> +		}
>  
>  		tests += 2;
>  
> @@ -106,15 +119,36 @@ static int __init test_init(void)
>  		}
>  	}
>  
> -	pr_info("Completed mul_u64_u64_div_u64() test, %d tests, %d errors\n",
> -		tests, errors);
> -	return errors ? -EINVAL : 0;
> +	pr_info("Completed %s() test, %d tests, %d errors, %llu ns\n",
> +		fn_name, tests, errors, ktime_get_ns() - start_time);
> +	return errors;
> +}
> +
> +static int __init test_init(void)
> +{
> +	pr_info("Starting mul_u64_u64_div_u64() test\n");
> +	if (test_run(0, "mul_u64_u64_div_u64"))
> +		return -EINVAL;
> +	if (test_run(1, "test_mul_u64_u64_div_u64"))
> +		return -EINVAL;
> +	return 0;
>  }
>  
>  static void __exit test_exit(void)
>  {
>  }
>  
> +/* Compile the generic mul_u64_add_u64_div_u64() code */
> +#define div64_u64 div64_u64
> +#define div64_s64 div64_s64
> +#define iter_div_u64_rem iter_div_u64_rem
> +
> +#undef mul_u64_add_u64_div_u64
> +#define mul_u64_add_u64_div_u64 test_mul_u64_add_u64_div_u64
> +#define test_mul_u64_add_u64_div_u64 test_mul_u64_add_u64_div_u64
> +
> +#include "div64.c"
> +
>  module_init(test_init);
>  module_exit(test_exit);
>  
> -- 
> 2.39.5
> 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 next 07/10] lib: mul_u64_u64_div_u64() optimise multiply on 32bit x86
  2025-06-14  9:53 ` [PATCH v3 next 07/10] lib: mul_u64_u64_div_u64() optimise multiply on 32bit x86 David Laight
@ 2025-06-14 15:31   ` Nicolas Pitre
  0 siblings, 0 replies; 44+ messages in thread
From: Nicolas Pitre @ 2025-06-14 15:31 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das

On Sat, 14 Jun 2025, David Laight wrote:

> gcc generates horrid code for both ((u64)u32_a * u32_b) and (u64_a + u32_b).
> As well as the extra instructions it can generate a lot of spills to stack
> (including spills of constant zeros and even multiplies by constant zero).
> 
> mul_u32_u32() already exists to optimise the multiply.
> Add a similar add_u64_32() for the addition.
> Disable both for clang - it generates better code without them.
> 
> Use mul_u32_u32() and add_u64_u32() in the 64x64 => 128 multiply
> in mul_u64_add_u64_div_u64().
> 
> Tested by forcing the amd64 build of test_mul_u64_u64_div_u64.ko
> to use the 32bit asm code.
> 
> Signed-off-by: David Laight <david.laight.linux@gmail.com>
> ---
> 
> New patch for v3.
> 
>  arch/x86/include/asm/div64.h | 14 ++++++++++++++
>  include/linux/math64.h       | 11 +++++++++++
>  lib/math/div64.c             | 18 ++++++++++++------
>  3 files changed, 37 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/include/asm/div64.h b/arch/x86/include/asm/div64.h
> index 7a0a916a2d7d..4a4c29e8602d 100644
> --- a/arch/x86/include/asm/div64.h
> +++ b/arch/x86/include/asm/div64.h
> @@ -60,6 +60,7 @@ static inline u64 div_u64_rem(u64 dividend, u32 divisor, u32 *remainder)
>  }
>  #define div_u64_rem	div_u64_rem
>  
> +#ifndef __clang__

Might be worth adding a comment here justifying why clang is excluded.

>  static inline u64 mul_u32_u32(u32 a, u32 b)
>  {
>  	u32 high, low;
> @@ -71,6 +72,19 @@ static inline u64 mul_u32_u32(u32 a, u32 b)
>  }
>  #define mul_u32_u32 mul_u32_u32
>  
> +static inline u64 add_u64_u32(u64 a, u32 b)
> +{
> +	u32 high = a >> 32, low = a;
> +
> +	asm ("addl %[b], %[low]; adcl $0, %[high]"
> +		: [low] "+r" (low), [high] "+r" (high)
> +		: [b] "rm" (b) );
> +
> +	return low | (u64)high << 32;
> +}
> +#define add_u64_u32 add_u64_u32
> +#endif
> +
>  /*
>   * __div64_32() is never called on x86, so prevent the
>   * generic definition from getting built.
> diff --git a/include/linux/math64.h b/include/linux/math64.h
> index e1c2e3642cec..5e497836e975 100644
> --- a/include/linux/math64.h
> +++ b/include/linux/math64.h
> @@ -158,6 +158,17 @@ static inline u64 mul_u32_u32(u32 a, u32 b)
>  }
>  #endif
>  
> +#ifndef add_u64_u32
> +/*
> + * Many a GCC version also messes this up.
> + * Zero extending b and then spilling everything to stack.
> + */
> +static inline u64 add_u64_u32(u64 a, u32 b)
> +{
> +	return a + b;
> +}
> +#endif
> +
>  #if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__)
>  
>  #ifndef mul_u64_u32_shr
> diff --git a/lib/math/div64.c b/lib/math/div64.c
> index 22433e5565c4..2ac7e25039a1 100644
> --- a/lib/math/div64.c
> +++ b/lib/math/div64.c
> @@ -187,6 +187,12 @@ EXPORT_SYMBOL(iter_div_u64_rem);
>  #endif
>  
>  #if !defined(mul_u64_add_u64_div_u64) || defined(test_mul_u64_add_u64_div_u64)
> +
> +static u64 mul_add(u32 a, u32 b, u32 c)
> +{
> +	return add_u64_u32(mul_u32_u32(a, b), c);
> +}
> +
>  u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
>  {
>  	if (WARN_ONCE(!d, "%s: division of (%#llx * %#llx + %#llx) by zero, returning 0",
> @@ -211,12 +217,12 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
>  	u64 x, y, z;
>  
>  	/* Since (x-1)(x-1) + 2(x-1) == x.x - 1 two u32 can be added to a u64 */
> -	x = (u64)a_lo * b_lo + (u32)c;
> -	y = (u64)a_lo * b_hi + (u32)(c >> 32);
> -	y += (u32)(x >> 32);
> -	z = (u64)a_hi * b_hi + (u32)(y >> 32);
> -	y = (u64)a_hi * b_lo + (u32)y;
> -	z += (u32)(y >> 32);
> +	x = mul_add(a_lo, b_lo, c);
> +	y = mul_add(a_lo, b_hi, c >> 32);
> +	y = add_u64_u32(y, x >> 32);
> +	z = mul_add(a_hi, b_hi, y >> 32);
> +	y = mul_add(a_hi, b_lo, y);
> +	z = add_u64_u32(z, y >> 32);
>  	x = (y << 32) + (u32)x;
>  
>  	u64 n_lo = x, n_hi = z;
> -- 
> 2.39.5
> 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 next 08/10] lib: mul_u64_u64_div_u64() Separate multiply to a helper for clarity
  2025-06-14  9:53 ` [PATCH v3 next 08/10] lib: mul_u64_u64_div_u64() Separate multiply to a helper for clarity David Laight
@ 2025-06-14 15:37   ` Nicolas Pitre
  2025-06-14 21:30     ` David Laight
  0 siblings, 1 reply; 44+ messages in thread
From: Nicolas Pitre @ 2025-06-14 15:37 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das

On Sat, 14 Jun 2025, David Laight wrote:

> Move the 64x64 => 128 multiply into a static inline helper function
> for code clarity.
> No need for the a/b_hi/lo variables, the implicit casts on the function
> calls do the work for us.
> Should have minimal effect on the generated code.
> 
> Signed-off-by: David Laight <david.laight.linux@gmail.com>
> ---
> 
> new patch for v3.
> 
>  lib/math/div64.c | 54 +++++++++++++++++++++++++++---------------------
>  1 file changed, 30 insertions(+), 24 deletions(-)
> 
> diff --git a/lib/math/div64.c b/lib/math/div64.c
> index 2ac7e25039a1..fb77fd9d999d 100644
> --- a/lib/math/div64.c
> +++ b/lib/math/div64.c
> @@ -193,42 +193,48 @@ static u64 mul_add(u32 a, u32 b, u32 c)
>  	return add_u64_u32(mul_u32_u32(a, b), c);
>  }
>  
> -u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
> -{
> -	if (WARN_ONCE(!d, "%s: division of (%#llx * %#llx + %#llx) by zero, returning 0",
> -		      __func__, a, b, c)) {
> -		/*
> -		 * Return 0 (rather than ~(u64)0) because it is less likely to
> -		 * have unexpected side effects.
> -		 */
> -		return 0;
> -	}
> -
>  #if defined(__SIZEOF_INT128__) && !defined(test_mul_u64_add_u64_div_u64)
> -
> +static inline u64 mul_u64_u64_add_u64(u64 *p_lo, u64 a, u64 b, u64 c)

Why not move the #if inside the function body and have only one function 
definition?



> +{
>  	/* native 64x64=128 bits multiplication */
>  	u128 prod = (u128)a * b + c;
> -	u64 n_lo = prod, n_hi = prod >> 64;
>  
> -#else
> +	*p_lo = prod;
> +	return prod >> 64;
> +}
>  
> -	/* perform a 64x64=128 bits multiplication manually */
> -	u32 a_lo = a, a_hi = a >> 32, b_lo = b, b_hi = b >> 32;
> +#else
> +static inline u64 mul_u64_u64_add_u64(u64 *p_lo, u64 a, u64 b, u64 c)
> +{
> +	/* perform a 64x64=128 bits multiplication in 32bit chunks */
>  	u64 x, y, z;
>  
>  	/* Since (x-1)(x-1) + 2(x-1) == x.x - 1 two u32 can be added to a u64 */
> -	x = mul_add(a_lo, b_lo, c);
> -	y = mul_add(a_lo, b_hi, c >> 32);
> +	x = mul_add(a, b, c);
> +	y = mul_add(a, b >> 32, c >> 32);
>  	y = add_u64_u32(y, x >> 32);
> -	z = mul_add(a_hi, b_hi, y >> 32);
> -	y = mul_add(a_hi, b_lo, y);
> -	z = add_u64_u32(z, y >> 32);
> -	x = (y << 32) + (u32)x;
> -
> -	u64 n_lo = x, n_hi = z;
> +	z = mul_add(a >> 32, b >> 32, y >> 32);
> +	y = mul_add(a >> 32, b, y);
> +	*p_lo = (y << 32) + (u32)x;
> +	return add_u64_u32(z, y >> 32);
> +}
>  
>  #endif
>  
> +u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
> +{
> +	u64 n_lo, n_hi;
> +
> +	if (WARN_ONCE(!d, "%s: division of (%llx * %llx + %llx) by zero, returning 0",
> +		      __func__, a, b, c )) {
> +		/*
> +		 * Return 0 (rather than ~(u64)0) because it is less likely to
> +		 * have unexpected side effects.
> +		 */
> +		return 0;
> +	}
> +
> +	n_hi = mul_u64_u64_add_u64(&n_lo, a, b, c);
>  	if (!n_hi)
>  		return div64_u64(n_lo, d);
>  
> -- 
> 2.39.5
> 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 next 02/10] lib: mul_u64_u64_div_u64() Use WARN_ONCE() for divide errors.
  2025-06-14 15:17   ` Nicolas Pitre
@ 2025-06-14 21:26     ` David Laight
  2025-06-14 22:23       ` Nicolas Pitre
  0 siblings, 1 reply; 44+ messages in thread
From: David Laight @ 2025-06-14 21:26 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das, Nicolas Pitre

On Sat, 14 Jun 2025 11:17:33 -0400 (EDT)
Nicolas Pitre <nico@fluxnic.net> wrote:

> On Sat, 14 Jun 2025, David Laight wrote:
> 
> > Do an explicit WARN_ONCE(!divisor) instead of hoping the 'undefined
> > behaviour' the compiler generates for a compile-time 1/0 is in any
> > way useful.
> > 
> > Return 0 (rather than ~(u64)0) because it is less likely to cause
> > further serious issues.  
> 
> I still disagree with this patch. Whether or not what the compiler 
> produces is useful is beside the point. What's important here is to have 
> a coherent behavior across all division flavors and what's proposed here 
> is not.
> 
> Arguably, a compile time 1/0 might not be what we want either. The 
> compiler forces an "illegal instruction" exception when what we want is 
> a "floating point" exception (strange to have floating point exceptions 
> for integer divisions but that's what it is).
> 
> So I'd suggest the following instead:
> 
> ----- >8  
> From Nicolas Pitre <npitre@baylibre.com>
> Subject: [PATCH] mul_u64_u64_div_u64(): improve division-by-zero handling
> 
> Forcing 1/0 at compile time makes the compiler (on x86 at least) to emit 
> an undefined instruction to trigger the exception. But that's not what 
> we want. Modify the code so that an actual runtime div-by-0 exception
> is triggered to be coherent with the behavior of all the other division
> flavors.
> 
> And don't use 1 for the dividend as the compiler would convert the 
> actual division into a simple compare.

The alternative would be BUG() or BUG_ON() - but Linus really doesn't
like those unless there is no alternative.

I'm pretty sure that both divide overflow (quotient too large) and
divide by zero are 'Undefined behaviour' in C.
Unless the compiler detects and does something 'strange' it becomes
cpu architecture defined.
It is actually a right PITA that many cpu trap for overflow
and/or divide by zero (x86 traps for both, m68k traps for divide by
zero but sets the overflow flag for overflow (with unchanged outputs),
can't find my arm book, sparc doesn't have divide).

	David

> 
> Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
> 
> diff --git a/lib/math/div64.c b/lib/math/div64.c
> index 5faa29208bdb..e6839b40e271 100644
> --- a/lib/math/div64.c
> +++ b/lib/math/div64.c
> @@ -212,12 +212,12 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
>  
>  #endif
>  
> -	/* make sure c is not zero, trigger exception otherwise */
> -#pragma GCC diagnostic push
> -#pragma GCC diagnostic ignored "-Wdiv-by-zero"
> -	if (unlikely(c == 0))
> -		return 1/0;
> -#pragma GCC diagnostic pop
> +	/* make sure c is not zero, trigger runtime exception otherwise */
> +	if (unlikely(c == 0)) {
> +		unsigned long zero = 0;
> +		asm ("" : "+r" (zero)); /* hide actual value from the compiler */
> +		return ~0UL/zero;
> +	}
>  
>  	int shift = __builtin_ctzll(c);
>  


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 next 08/10] lib: mul_u64_u64_div_u64() Separate multiply to a helper for clarity
  2025-06-14 15:37   ` Nicolas Pitre
@ 2025-06-14 21:30     ` David Laight
  2025-06-14 22:27       ` Nicolas Pitre
  0 siblings, 1 reply; 44+ messages in thread
From: David Laight @ 2025-06-14 21:30 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das

On Sat, 14 Jun 2025 11:37:54 -0400 (EDT)
Nicolas Pitre <npitre@baylibre.com> wrote:

> On Sat, 14 Jun 2025, David Laight wrote:
> 
> > Move the 64x64 => 128 multiply into a static inline helper function
> > for code clarity.
> > No need for the a/b_hi/lo variables, the implicit casts on the function
> > calls do the work for us.
> > Should have minimal effect on the generated code.
> > 
> > Signed-off-by: David Laight <david.laight.linux@gmail.com>
> > ---
> > 
> > new patch for v3.
> > 
> >  lib/math/div64.c | 54 +++++++++++++++++++++++++++---------------------
> >  1 file changed, 30 insertions(+), 24 deletions(-)
> > 
> > diff --git a/lib/math/div64.c b/lib/math/div64.c
> > index 2ac7e25039a1..fb77fd9d999d 100644
> > --- a/lib/math/div64.c
> > +++ b/lib/math/div64.c
> > @@ -193,42 +193,48 @@ static u64 mul_add(u32 a, u32 b, u32 c)
> >  	return add_u64_u32(mul_u32_u32(a, b), c);
> >  }
> >  
> > -u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
> > -{
> > -	if (WARN_ONCE(!d, "%s: division of (%#llx * %#llx + %#llx) by zero, returning 0",
> > -		      __func__, a, b, c)) {
> > -		/*
> > -		 * Return 0 (rather than ~(u64)0) because it is less likely to
> > -		 * have unexpected side effects.
> > -		 */
> > -		return 0;
> > -	}
> > -
> >  #if defined(__SIZEOF_INT128__) && !defined(test_mul_u64_add_u64_div_u64)
> > -
> > +static inline u64 mul_u64_u64_add_u64(u64 *p_lo, u64 a, u64 b, u64 c)  
> 
> Why not move the #if inside the function body and have only one function 
> definition?

Because I think it is easier to read with two definitions,
especially when the bodies are entirely different.

	David

> > +{
> >  	/* native 64x64=128 bits multiplication */
> >  	u128 prod = (u128)a * b + c;
> > -	u64 n_lo = prod, n_hi = prod >> 64;
> >  
> > -#else
> > +	*p_lo = prod;
> > +	return prod >> 64;
> > +}
> >  
> > -	/* perform a 64x64=128 bits multiplication manually */
> > -	u32 a_lo = a, a_hi = a >> 32, b_lo = b, b_hi = b >> 32;
> > +#else
> > +static inline u64 mul_u64_u64_add_u64(u64 *p_lo, u64 a, u64 b, u64 c)
> > +{
> > +	/* perform a 64x64=128 bits multiplication in 32bit chunks */
> >  	u64 x, y, z;
> >  
> >  	/* Since (x-1)(x-1) + 2(x-1) == x.x - 1 two u32 can be added to a u64 */
> > -	x = mul_add(a_lo, b_lo, c);
> > -	y = mul_add(a_lo, b_hi, c >> 32);
> > +	x = mul_add(a, b, c);
> > +	y = mul_add(a, b >> 32, c >> 32);
> >  	y = add_u64_u32(y, x >> 32);
> > -	z = mul_add(a_hi, b_hi, y >> 32);
> > -	y = mul_add(a_hi, b_lo, y);
> > -	z = add_u64_u32(z, y >> 32);
> > -	x = (y << 32) + (u32)x;
> > -
> > -	u64 n_lo = x, n_hi = z;
> > +	z = mul_add(a >> 32, b >> 32, y >> 32);
> > +	y = mul_add(a >> 32, b, y);
> > +	*p_lo = (y << 32) + (u32)x;
> > +	return add_u64_u32(z, y >> 32);
> > +}
> >  
> >  #endif
> >  
> > +u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
> > +{
> > +	u64 n_lo, n_hi;
> > +
> > +	if (WARN_ONCE(!d, "%s: division of (%llx * %llx + %llx) by zero, returning 0",
> > +		      __func__, a, b, c )) {
> > +		/*
> > +		 * Return 0 (rather than ~(u64)0) because it is less likely to
> > +		 * have unexpected side effects.
> > +		 */
> > +		return 0;
> > +	}
> > +
> > +	n_hi = mul_u64_u64_add_u64(&n_lo, a, b, c);
> >  	if (!n_hi)
> >  		return div64_u64(n_lo, d);
> >  
> > -- 
> > 2.39.5
> > 
> >   


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 next 02/10] lib: mul_u64_u64_div_u64() Use WARN_ONCE() for divide errors.
  2025-06-14 21:26     ` David Laight
@ 2025-06-14 22:23       ` Nicolas Pitre
  0 siblings, 0 replies; 44+ messages in thread
From: Nicolas Pitre @ 2025-06-14 22:23 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das

On Sat, 14 Jun 2025, David Laight wrote:

> On Sat, 14 Jun 2025 11:17:33 -0400 (EDT)
> Nicolas Pitre <nico@fluxnic.net> wrote:
> 
> > On Sat, 14 Jun 2025, David Laight wrote:
> > 
> > > Do an explicit WARN_ONCE(!divisor) instead of hoping the 'undefined
> > > behaviour' the compiler generates for a compile-time 1/0 is in any
> > > way useful.
> > > 
> > > Return 0 (rather than ~(u64)0) because it is less likely to cause
> > > further serious issues.  
> > 
> > I still disagree with this patch. Whether or not what the compiler 
> > produces is useful is beside the point. What's important here is to have 
> > a coherent behavior across all division flavors and what's proposed here 
> > is not.
> > 
> > Arguably, a compile time 1/0 might not be what we want either. The 
> > compiler forces an "illegal instruction" exception when what we want is 
> > a "floating point" exception (strange to have floating point exceptions 
> > for integer divisions but that's what it is).
> > 
> > So I'd suggest the following instead:
> > 
> > ----- >8  
> > From Nicolas Pitre <npitre@baylibre.com>
> > Subject: [PATCH] mul_u64_u64_div_u64(): improve division-by-zero handling
> > 
> > Forcing 1/0 at compile time makes the compiler (on x86 at least) to emit 
> > an undefined instruction to trigger the exception. But that's not what 
> > we want. Modify the code so that an actual runtime div-by-0 exception
> > is triggered to be coherent with the behavior of all the other division
> > flavors.
> > 
> > And don't use 1 for the dividend as the compiler would convert the 
> > actual division into a simple compare.
> 
> The alternative would be BUG() or BUG_ON() - but Linus really doesn't
> like those unless there is no alternative.
> 
> I'm pretty sure that both divide overflow (quotient too large) and
> divide by zero are 'Undefined behaviour' in C.
> Unless the compiler detects and does something 'strange' it becomes
> cpu architecture defined.

Exactly. Let each architecture produce what people expect of them when a 
zero divisor is encountered.

> It is actually a right PITA that many cpu trap for overflow
> and/or divide by zero (x86 traps for both, m68k traps for divide by
> zero but sets the overflow flag for overflow (with unchanged outputs),
> can't find my arm book, sparc doesn't have divide).

Some ARMs don't have divide either. But the software lib the compiler is 
relying on in those cases deals with a zero divisor already. We should 
involve the same path here.


Nicolas

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 next 08/10] lib: mul_u64_u64_div_u64() Separate multiply to a helper for clarity
  2025-06-14 21:30     ` David Laight
@ 2025-06-14 22:27       ` Nicolas Pitre
  0 siblings, 0 replies; 44+ messages in thread
From: Nicolas Pitre @ 2025-06-14 22:27 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das

On Sat, 14 Jun 2025, David Laight wrote:

> On Sat, 14 Jun 2025 11:37:54 -0400 (EDT)
> Nicolas Pitre <npitre@baylibre.com> wrote:
> 
> > On Sat, 14 Jun 2025, David Laight wrote:
> > 
> > > Move the 64x64 => 128 multiply into a static inline helper function
> > > for code clarity.
> > > No need for the a/b_hi/lo variables, the implicit casts on the function
> > > calls do the work for us.
> > > Should have minimal effect on the generated code.
> > > 
> > > Signed-off-by: David Laight <david.laight.linux@gmail.com>
> > > ---
> > > 
> > > new patch for v3.
> > > 
> > >  lib/math/div64.c | 54 +++++++++++++++++++++++++++---------------------
> > >  1 file changed, 30 insertions(+), 24 deletions(-)
> > > 
> > > diff --git a/lib/math/div64.c b/lib/math/div64.c
> > > index 2ac7e25039a1..fb77fd9d999d 100644
> > > --- a/lib/math/div64.c
> > > +++ b/lib/math/div64.c
> > > @@ -193,42 +193,48 @@ static u64 mul_add(u32 a, u32 b, u32 c)
> > >  	return add_u64_u32(mul_u32_u32(a, b), c);
> > >  }
> > >  
> > > -u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
> > > -{
> > > -	if (WARN_ONCE(!d, "%s: division of (%#llx * %#llx + %#llx) by zero, returning 0",
> > > -		      __func__, a, b, c)) {
> > > -		/*
> > > -		 * Return 0 (rather than ~(u64)0) because it is less likely to
> > > -		 * have unexpected side effects.
> > > -		 */
> > > -		return 0;
> > > -	}
> > > -
> > >  #if defined(__SIZEOF_INT128__) && !defined(test_mul_u64_add_u64_div_u64)
> > > -
> > > +static inline u64 mul_u64_u64_add_u64(u64 *p_lo, u64 a, u64 b, u64 c)  
> > 
> > Why not move the #if inside the function body and have only one function 
> > definition?
> 
> Because I think it is easier to read with two definitions,
> especially when the bodies are entirely different.

We have differing opinions here, but I don't care that strongly in this 
case.

Reviewed-by: Nicolas Pitre <npitre@baylibre.com>




> 
> 	David
> 
> > > +{
> > >  	/* native 64x64=128 bits multiplication */
> > >  	u128 prod = (u128)a * b + c;
> > > -	u64 n_lo = prod, n_hi = prod >> 64;
> > >  
> > > -#else
> > > +	*p_lo = prod;
> > > +	return prod >> 64;
> > > +}
> > >  
> > > -	/* perform a 64x64=128 bits multiplication manually */
> > > -	u32 a_lo = a, a_hi = a >> 32, b_lo = b, b_hi = b >> 32;
> > > +#else
> > > +static inline u64 mul_u64_u64_add_u64(u64 *p_lo, u64 a, u64 b, u64 c)
> > > +{
> > > +	/* perform a 64x64=128 bits multiplication in 32bit chunks */
> > >  	u64 x, y, z;
> > >  
> > >  	/* Since (x-1)(x-1) + 2(x-1) == x.x - 1 two u32 can be added to a u64 */
> > > -	x = mul_add(a_lo, b_lo, c);
> > > -	y = mul_add(a_lo, b_hi, c >> 32);
> > > +	x = mul_add(a, b, c);
> > > +	y = mul_add(a, b >> 32, c >> 32);
> > >  	y = add_u64_u32(y, x >> 32);
> > > -	z = mul_add(a_hi, b_hi, y >> 32);
> > > -	y = mul_add(a_hi, b_lo, y);
> > > -	z = add_u64_u32(z, y >> 32);
> > > -	x = (y << 32) + (u32)x;
> > > -
> > > -	u64 n_lo = x, n_hi = z;
> > > +	z = mul_add(a >> 32, b >> 32, y >> 32);
> > > +	y = mul_add(a >> 32, b, y);
> > > +	*p_lo = (y << 32) + (u32)x;
> > > +	return add_u64_u32(z, y >> 32);
> > > +}
> > >  
> > >  #endif
> > >  
> > > +u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
> > > +{
> > > +	u64 n_lo, n_hi;
> > > +
> > > +	if (WARN_ONCE(!d, "%s: division of (%llx * %llx + %llx) by zero, returning 0",
> > > +		      __func__, a, b, c )) {
> > > +		/*
> > > +		 * Return 0 (rather than ~(u64)0) because it is less likely to
> > > +		 * have unexpected side effects.
> > > +		 */
> > > +		return 0;
> > > +	}
> > > +
> > > +	n_hi = mul_u64_u64_add_u64(&n_lo, a, b, c);
> > >  	if (!n_hi)
> > >  		return div64_u64(n_lo, d);
> > >  
> > > -- 
> > > 2.39.5
> > > 
> > >   
> 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 next 09/10] lib: mul_u64_u64_div_u64() Optimise the divide code
  2025-06-14  9:53 ` [PATCH v3 next 09/10] lib: mul_u64_u64_div_u64() Optimise the divide code David Laight
@ 2025-06-17  4:16   ` Nicolas Pitre
  2025-06-18  1:33     ` Nicolas Pitre
  2025-07-09 14:24   ` David Laight
  1 sibling, 1 reply; 44+ messages in thread
From: Nicolas Pitre @ 2025-06-17  4:16 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das

On Sat, 14 Jun 2025, David Laight wrote:

> Replace the bit by bit algorithm with one that generates 16 bits
> per iteration on 32bit architectures and 32 bits on 64bit ones.
> 
> On my zen 5 this reduces the time for the tests (using the generic
> code) from ~3350ns to ~1000ns.
> 
> Running the 32bit algorithm on 64bit x86 takes ~1500ns.
> It'll be slightly slower on a real 32bit system, mostly due
> to register pressure.
> 
> The savings for 32bit x86 are much higher (tested in userspace).
> The worst case (lots of bits in the quotient) drops from ~900 clocks
> to ~130 (pretty much independant of the arguments).
> Other 32bit architectures may see better savings.
> 
> It is possibly to optimise for divisors that span less than
> __LONG_WIDTH__/2 bits. However I suspect they don't happen that often
> and it doesn't remove any slow cpu divide instructions which dominate
> the result.
> 
> Signed-off-by: David Laight <david.laight.linux@gmail.com>

Nice work. I had to be fully awake to review this one.
Some suggestions below.

> +	reps = 64 / BITS_PER_ITER;
> +	/* Optimise loop count for small dividends */
> +	if (!(u32)(n_hi >> 32)) {
> +		reps -= 32 / BITS_PER_ITER;
> +		n_hi = n_hi << 32 | n_lo >> 32;
> +		n_lo <<= 32;
> +	}
> +#if BITS_PER_ITER == 16
> +	if (!(u32)(n_hi >> 48)) {
> +		reps--;
> +		n_hi = add_u64_u32(n_hi << 16, n_lo >> 48);
> +		n_lo <<= 16;
> +	}
> +#endif

I think it would be more beneficial to integrate this within the loop 
itself instead. It is often the case that the dividend is initially big, 
and becomes zero or too small for the division in the middle of the 
loop.

Something like:

	unsigned long n;
	...

	reps = 64 / BITS_PER_ITER;
	while (reps--) {
		quotient <<= BITS_PER_ITER;
		n = ~n_hi >> (64 - 2 * BITS_PER_ITER);
		if (n < d_msig) {
			n_hi = (n_hi << BITS_PER_ITER | (n_lo >> (64 - BITS_PER_ITER)));
			n_lo <<= BITS_PER_ITER;
			continue;
		}
		q_digit = n / d_msig;
		...

This way small dividends are optimized as well as dividends with holes 
in them. And this allows for the number of loops to become constant 
which opens some unrolling optimization possibilities.

> +	/*
> +	 * Get the most significant BITS_PER_ITER bits of the divisor.
> +	 * This is used to get a low 'guestimate' of the quotient digit.
> +	 */
> +	d_msig = (d >> (64 - BITS_PER_ITER)) + 1;

Here the unconditional +1 is pessimizing d_msig too much. You should do 
this instead:

	d_msig = (d >> (64 - BITS_PER_ITER)) + !!(d << BITS_PER_ITER);

In other words, you want to round up the value, not an unconditional +1 
causing several unnecessary overflow fixup loops.


Nicolas

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 next 05/10] lib: Add tests for mul_u64_u64_div_u64_roundup()
  2025-06-14  9:53 ` [PATCH v3 next 05/10] lib: Add tests for mul_u64_u64_div_u64_roundup() David Laight
  2025-06-14 15:19   ` Nicolas Pitre
@ 2025-06-17  4:30   ` Nicolas Pitre
  1 sibling, 0 replies; 44+ messages in thread
From: Nicolas Pitre @ 2025-06-17  4:30 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das

On Sat, 14 Jun 2025, David Laight wrote:

> Replicate the existing mul_u64_u64_div_u64() test cases with round up.
> Update the shell script that verifies the table, remove the comment
> markers so that it can be directly pasted into a shell.
> 
> Rename the divisor from 'c' to 'd' to match mul_u64_add_u64_div_u64().
> 
> It any tests fail then fail the module load with -EINVAL.
> 
> Signed-off-by: David Laight <david.laight.linux@gmail.com>

I must withdraw my Reviewed-by here.

>  /*
>   * The above table can be verified with the following shell script:
> - *
> - * #!/bin/sh
> - * sed -ne 's/^{ \+\(.*\), \+\(.*\), \+\(.*\), \+\(.*\) },$/\1 \2 \3 \4/p' \
> - *     lib/math/test_mul_u64_u64_div_u64.c |
> - * while read a b c r; do
> - *   expected=$( printf "obase=16; ibase=16; %X * %X / %X\n" $a $b $c | bc )
> - *   given=$( printf "%X\n" $r )
> - *   if [ "$expected" = "$given" ]; then
> - *     echo "$a * $b / $c = $r OK"
> - *   else
> - *     echo "$a * $b / $c = $r is wrong" >&2
> - *     echo "should be equivalent to 0x$expected" >&2
> - *     exit 1
> - *   fi
> - * done
> +
> +#!/bin/sh
> +sed -ne 's/^{ \+\(.*\), \+\(.*\), \+\(.*\), \+\(.*\), \+\(.*\) },$/\1 \2 \3 \4 \5/p' \
> +    lib/math/test_mul_u64_u64_div_u64.c |
> +while read a b d r d; do

This "read a b d r d" is wrong and that breaks the script.


Nicolas

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 next 09/10] lib: mul_u64_u64_div_u64() Optimise the divide code
  2025-06-17  4:16   ` Nicolas Pitre
@ 2025-06-18  1:33     ` Nicolas Pitre
  2025-06-18  9:16       ` David Laight
  2025-06-26 21:46       ` David Laight
  0 siblings, 2 replies; 44+ messages in thread
From: Nicolas Pitre @ 2025-06-18  1:33 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das

On Tue, 17 Jun 2025, Nicolas Pitre wrote:

> On Sat, 14 Jun 2025, David Laight wrote:
> 
> > Replace the bit by bit algorithm with one that generates 16 bits
> > per iteration on 32bit architectures and 32 bits on 64bit ones.
> > 
> > On my zen 5 this reduces the time for the tests (using the generic
> > code) from ~3350ns to ~1000ns.
> > 
> > Running the 32bit algorithm on 64bit x86 takes ~1500ns.
> > It'll be slightly slower on a real 32bit system, mostly due
> > to register pressure.
> > 
> > The savings for 32bit x86 are much higher (tested in userspace).
> > The worst case (lots of bits in the quotient) drops from ~900 clocks
> > to ~130 (pretty much independant of the arguments).
> > Other 32bit architectures may see better savings.
> > 
> > It is possibly to optimise for divisors that span less than
> > __LONG_WIDTH__/2 bits. However I suspect they don't happen that often
> > and it doesn't remove any slow cpu divide instructions which dominate
> > the result.
> > 
> > Signed-off-by: David Laight <david.laight.linux@gmail.com>
> 
> Nice work. I had to be fully awake to review this one.
> Some suggestions below.

Here's a patch with my suggestions applied to make it easier to figure 
them out. The added "inline" is necessary to fix compilation on ARM32. 
The "likely()" makes for better assembly and this part is pretty much 
likely anyway. I've explained the rest previously, although this is a 
better implementation.

commit 99ea338401f03efe5dbebe57e62bd7c588409c5c
Author: Nicolas Pitre <nico@fluxnic.net>
Date:   Tue Jun 17 14:42:34 2025 -0400

    fixup! lib: mul_u64_u64_div_u64() Optimise the divide code

diff --git a/lib/math/div64.c b/lib/math/div64.c
index 3c9fe878ce68..740e59a58530 100644
--- a/lib/math/div64.c
+++ b/lib/math/div64.c
@@ -188,7 +188,7 @@ EXPORT_SYMBOL(iter_div_u64_rem);
 
 #if !defined(mul_u64_add_u64_div_u64) || defined(test_mul_u64_add_u64_div_u64)
 
-static u64 mul_add(u32 a, u32 b, u32 c)
+static inline u64 mul_add(u32 a, u32 b, u32 c)
 {
 	return add_u64_u32(mul_u32_u32(a, b), c);
 }
@@ -246,7 +246,7 @@ static inline u32 mul_u64_long_add_u64(u64 *p_lo, u64 a, u32 b, u64 c)
 
 u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
 {
-	unsigned long d_msig, q_digit;
+	unsigned long n_long, d_msig, q_digit;
 	unsigned int reps, d_z_hi;
 	u64 quotient, n_lo, n_hi;
 	u32 overflow;
@@ -271,36 +271,21 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
 
 	/* Left align the divisor, shifting the dividend to match */
 	d_z_hi = __builtin_clzll(d);
-	if (d_z_hi) {
+	if (likely(d_z_hi)) {
 		d <<= d_z_hi;
 		n_hi = n_hi << d_z_hi | n_lo >> (64 - d_z_hi);
 		n_lo <<= d_z_hi;
 	}
 
-	reps = 64 / BITS_PER_ITER;
-	/* Optimise loop count for small dividends */
-	if (!(u32)(n_hi >> 32)) {
-		reps -= 32 / BITS_PER_ITER;
-		n_hi = n_hi << 32 | n_lo >> 32;
-		n_lo <<= 32;
-	}
-#if BITS_PER_ITER == 16
-	if (!(u32)(n_hi >> 48)) {
-		reps--;
-		n_hi = add_u64_u32(n_hi << 16, n_lo >> 48);
-		n_lo <<= 16;
-	}
-#endif
-
 	/* Invert the dividend so we can use add instead of subtract. */
 	n_lo = ~n_lo;
 	n_hi = ~n_hi;
 
 	/*
-	 * Get the most significant BITS_PER_ITER bits of the divisor.
+	 * Get the rounded-up most significant BITS_PER_ITER bits of the divisor.
 	 * This is used to get a low 'guestimate' of the quotient digit.
 	 */
-	d_msig = (d >> (64 - BITS_PER_ITER)) + 1;
+	d_msig = (d >> (64 - BITS_PER_ITER)) + !!(d << BITS_PER_ITER);
 
 	/*
 	 * Now do a 'long division' with BITS_PER_ITER bit 'digits'.
@@ -308,12 +293,17 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
 	 * The worst case is dividing ~0 by 0x8000 which requires two subtracts.
 	 */
 	quotient = 0;
-	while (reps--) {
-		q_digit = (unsigned long)(~n_hi >> (64 - 2 * BITS_PER_ITER)) / d_msig;
+	for (reps = 64 / BITS_PER_ITER; reps; reps--) {
+		quotient <<= BITS_PER_ITER;
+		n_long = ~n_hi >> (64 - 2 * BITS_PER_ITER);
 		/* Shift 'n' left to align with the product q_digit * d */
 		overflow = n_hi >> (64 - BITS_PER_ITER);
 		n_hi = add_u64_u32(n_hi << BITS_PER_ITER, n_lo >> (64 - BITS_PER_ITER));
 		n_lo <<= BITS_PER_ITER;
+		/* cut it short if q_digit would be 0 */
+		if (n_long < d_msig)
+			continue;
+		q_digit = n_long / d_msig;
 		/* Add product to negated divisor */
 		overflow += mul_u64_long_add_u64(&n_hi, d, q_digit, n_hi);
 		/* Adjust for the q_digit 'guestimate' being low */
@@ -322,7 +312,7 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
 			n_hi += d;
 			overflow += n_hi < d;
 		}
-		quotient = add_u64_long(quotient << BITS_PER_ITER, q_digit);
+		quotient = add_u64_long(quotient, q_digit);
 	}
 
 	/*

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 next 06/10] lib: test_mul_u64_u64_div_u64: Test both generic and arch versions
  2025-06-14 15:25   ` Nicolas Pitre
@ 2025-06-18  1:39     ` Nicolas Pitre
  0 siblings, 0 replies; 44+ messages in thread
From: Nicolas Pitre @ 2025-06-18  1:39 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das

On Sat, 14 Jun 2025, Nicolas Pitre wrote:

> On Sat, 14 Jun 2025, David Laight wrote:
> 
> > Change the #if in div64.c so that test_mul_u64_u64_div_u64.c
> > can compile and test the generic version (including the 'long multiply')
> > on architectures (eg amd64) that define their own copy.
> > Test the kernel version and the locally compiled version on all arch.
> > Output the time taken (in ns) on the 'test completed' trace.
> > 
> > For reference, on my zen 5, the optimised version takes ~220ns and the
> > generic version ~3350ns.
> > Using the native multiply saves ~200ns and adding back the ilog2() 'optimisation'
> > test adds ~50ms.
> > 
> > Signed-off-by: David Laight <david.laight.linux@gmail.com>
> 
> Reviewed-by: Nicolas Pitre <npitre@baylibre.com>

In fact this doesn't compile on ARM32. The following is needed to fix that:

commit 271a7224634699721b6383ba28f37b23f901319e
Author: Nicolas Pitre <nico@fluxnic.net>
Date:   Tue Jun 17 17:14:05 2025 -0400

    fixup! lib: test_mul_u64_u64_div_u64: Test both generic and arch versions

diff --git a/lib/math/test_mul_u64_u64_div_u64.c b/lib/math/test_mul_u64_u64_div_u64.c
index 88316e68512c..44df9aa39406 100644
--- a/lib/math/test_mul_u64_u64_div_u64.c
+++ b/lib/math/test_mul_u64_u64_div_u64.c
@@ -153,7 +153,10 @@ static void __exit test_exit(void)
 }
 
 /* Compile the generic mul_u64_add_u64_div_u64() code */
+#define __div64_32 __div64_32
+#define div_s64_rem div_s64_rem
 #define div64_u64 div64_u64
+#define div64_u64_rem div64_u64_rem
 #define div64_s64 div64_s64
 #define iter_div_u64_rem iter_div_u64_rem
 

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 next 09/10] lib: mul_u64_u64_div_u64() Optimise the divide code
  2025-06-18  1:33     ` Nicolas Pitre
@ 2025-06-18  9:16       ` David Laight
  2025-06-18 15:39         ` Nicolas Pitre
  2025-06-26 21:46       ` David Laight
  1 sibling, 1 reply; 44+ messages in thread
From: David Laight @ 2025-06-18  9:16 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das

On Tue, 17 Jun 2025 21:33:23 -0400 (EDT)
Nicolas Pitre <nico@fluxnic.net> wrote:

> On Tue, 17 Jun 2025, Nicolas Pitre wrote:
> 
> > On Sat, 14 Jun 2025, David Laight wrote:
> >   
> > > Replace the bit by bit algorithm with one that generates 16 bits
> > > per iteration on 32bit architectures and 32 bits on 64bit ones.
> > > 
> > > On my zen 5 this reduces the time for the tests (using the generic
> > > code) from ~3350ns to ~1000ns.
> > > 
> > > Running the 32bit algorithm on 64bit x86 takes ~1500ns.
> > > It'll be slightly slower on a real 32bit system, mostly due
> > > to register pressure.
> > > 
> > > The savings for 32bit x86 are much higher (tested in userspace).
> > > The worst case (lots of bits in the quotient) drops from ~900 clocks
> > > to ~130 (pretty much independant of the arguments).
> > > Other 32bit architectures may see better savings.
> > > 
> > > It is possibly to optimise for divisors that span less than
> > > __LONG_WIDTH__/2 bits. However I suspect they don't happen that often
> > > and it doesn't remove any slow cpu divide instructions which dominate
> > > the result.
> > > 
> > > Signed-off-by: David Laight <david.laight.linux@gmail.com>  
> > 
> > Nice work. I had to be fully awake to review this one.
> > Some suggestions below.  
> 
> Here's a patch with my suggestions applied to make it easier to figure 
> them out. The added "inline" is necessary to fix compilation on ARM32. 
> The "likely()" makes for better assembly and this part is pretty much 
> likely anyway. I've explained the rest previously, although this is a 
> better implementation.
> 
> commit 99ea338401f03efe5dbebe57e62bd7c588409c5c
> Author: Nicolas Pitre <nico@fluxnic.net>
> Date:   Tue Jun 17 14:42:34 2025 -0400
> 
>     fixup! lib: mul_u64_u64_div_u64() Optimise the divide code
> 
> diff --git a/lib/math/div64.c b/lib/math/div64.c
> index 3c9fe878ce68..740e59a58530 100644
> --- a/lib/math/div64.c
> +++ b/lib/math/div64.c
> @@ -188,7 +188,7 @@ EXPORT_SYMBOL(iter_div_u64_rem);
>  
>  #if !defined(mul_u64_add_u64_div_u64) || defined(test_mul_u64_add_u64_div_u64)
>  
> -static u64 mul_add(u32 a, u32 b, u32 c)
> +static inline u64 mul_add(u32 a, u32 b, u32 c)
>  {
>  	return add_u64_u32(mul_u32_u32(a, b), c);
>  }
> @@ -246,7 +246,7 @@ static inline u32 mul_u64_long_add_u64(u64 *p_lo, u64 a, u32 b, u64 c)
>  
>  u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
>  {
> -	unsigned long d_msig, q_digit;
> +	unsigned long n_long, d_msig, q_digit;
>  	unsigned int reps, d_z_hi;
>  	u64 quotient, n_lo, n_hi;
>  	u32 overflow;
> @@ -271,36 +271,21 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
>  
>  	/* Left align the divisor, shifting the dividend to match */
>  	d_z_hi = __builtin_clzll(d);
> -	if (d_z_hi) {
> +	if (likely(d_z_hi)) {
>  		d <<= d_z_hi;
>  		n_hi = n_hi << d_z_hi | n_lo >> (64 - d_z_hi);
>  		n_lo <<= d_z_hi;
>  	}
>  
> -	reps = 64 / BITS_PER_ITER;
> -	/* Optimise loop count for small dividends */
> -	if (!(u32)(n_hi >> 32)) {
> -		reps -= 32 / BITS_PER_ITER;
> -		n_hi = n_hi << 32 | n_lo >> 32;
> -		n_lo <<= 32;
> -	}
> -#if BITS_PER_ITER == 16
> -	if (!(u32)(n_hi >> 48)) {
> -		reps--;
> -		n_hi = add_u64_u32(n_hi << 16, n_lo >> 48);
> -		n_lo <<= 16;
> -	}
> -#endif
> -
>  	/* Invert the dividend so we can use add instead of subtract. */
>  	n_lo = ~n_lo;
>  	n_hi = ~n_hi;
>  
>  	/*
> -	 * Get the most significant BITS_PER_ITER bits of the divisor.
> +	 * Get the rounded-up most significant BITS_PER_ITER bits of the divisor.
>  	 * This is used to get a low 'guestimate' of the quotient digit.
>  	 */
> -	d_msig = (d >> (64 - BITS_PER_ITER)) + 1;
> +	d_msig = (d >> (64 - BITS_PER_ITER)) + !!(d << BITS_PER_ITER);

If the low divisor bits are zero an alternative simpler divide
can be used (you want to detect it before the left align).
I deleted that because it complicates things and probably doesn't
happen often enough outside the tests cases.

>  
>  	/*
>  	 * Now do a 'long division' with BITS_PER_ITER bit 'digits'.
> @@ -308,12 +293,17 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
>  	 * The worst case is dividing ~0 by 0x8000 which requires two subtracts.
>  	 */
>  	quotient = 0;
> -	while (reps--) {
> -		q_digit = (unsigned long)(~n_hi >> (64 - 2 * BITS_PER_ITER)) / d_msig;
> +	for (reps = 64 / BITS_PER_ITER; reps; reps--) {
> +		quotient <<= BITS_PER_ITER;
> +		n_long = ~n_hi >> (64 - 2 * BITS_PER_ITER);
>  		/* Shift 'n' left to align with the product q_digit * d */
>  		overflow = n_hi >> (64 - BITS_PER_ITER);
>  		n_hi = add_u64_u32(n_hi << BITS_PER_ITER, n_lo >> (64 - BITS_PER_ITER));
>  		n_lo <<= BITS_PER_ITER;
> +		/* cut it short if q_digit would be 0 */
> +		if (n_long < d_msig)
> +			continue;

I don't think that is right.
d_msig is an overestimate so you can only skip the divide and mul_add().
Could be something like:
		if (n_long < d_msig) {
			if (!n_long)
				continue;
			q_digit = 0;
		} else {
			q_digit = n_long / d_msig;
	  		/* Add product to negated divisor */
	  		overflow += mul_u64_long_add_u64(&n_hi, d, q_digit, n_hi);		
		}
but that starts looking like branch prediction hell.

> +		q_digit = n_long / d_msig;

I think you want to do the divide right at the top - maybe even if the
result isn't used!
All the shifts then happen while the divide instruction is in progress
(even without out-of-order execution).

It is also quite likely that any cpu divide instruction takes a lot
less clocks when the dividend or quotient is small.
So if the quotient would be zero there isn't a stall waiting for the
divide to complete.

As I said before it is a trade off between saving a few clocks for
specific cases against adding clocks to all the others.
Leading zero bits on the dividend are very likely, quotients with
the low 16bits zero much less so (except for test cases).

>  		/* Add product to negated divisor */
>  		overflow += mul_u64_long_add_u64(&n_hi, d, q_digit, n_hi);
>  		/* Adjust for the q_digit 'guestimate' being low */
> @@ -322,7 +312,7 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
>  			n_hi += d;
>  			overflow += n_hi < d;
>  		}
> -		quotient = add_u64_long(quotient << BITS_PER_ITER, q_digit);
> +		quotient = add_u64_long(quotient, q_digit);

Moving the shift to the top (after the divide) may help instruction
scheduling (esp. without aggressive out-of-order execution).

	David

>  	}
>  
>  	/*


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 next 09/10] lib: mul_u64_u64_div_u64() Optimise the divide code
  2025-06-18  9:16       ` David Laight
@ 2025-06-18 15:39         ` Nicolas Pitre
  2025-06-18 16:42           ` Nicolas Pitre
  2025-06-18 17:54           ` David Laight
  0 siblings, 2 replies; 44+ messages in thread
From: Nicolas Pitre @ 2025-06-18 15:39 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das

On Wed, 18 Jun 2025, David Laight wrote:

> On Tue, 17 Jun 2025 21:33:23 -0400 (EDT)
> Nicolas Pitre <nico@fluxnic.net> wrote:
> 
> > On Tue, 17 Jun 2025, Nicolas Pitre wrote:
> > 
> > > On Sat, 14 Jun 2025, David Laight wrote:
> > >   
> > > > Replace the bit by bit algorithm with one that generates 16 bits
> > > > per iteration on 32bit architectures and 32 bits on 64bit ones.
> > > > 
> > > > On my zen 5 this reduces the time for the tests (using the generic
> > > > code) from ~3350ns to ~1000ns.
> > > > 
> > > > Running the 32bit algorithm on 64bit x86 takes ~1500ns.
> > > > It'll be slightly slower on a real 32bit system, mostly due
> > > > to register pressure.
> > > > 
> > > > The savings for 32bit x86 are much higher (tested in userspace).
> > > > The worst case (lots of bits in the quotient) drops from ~900 clocks
> > > > to ~130 (pretty much independant of the arguments).
> > > > Other 32bit architectures may see better savings.
> > > > 
> > > > It is possibly to optimise for divisors that span less than
> > > > __LONG_WIDTH__/2 bits. However I suspect they don't happen that often
> > > > and it doesn't remove any slow cpu divide instructions which dominate
> > > > the result.
> > > > 
> > > > Signed-off-by: David Laight <david.laight.linux@gmail.com>  
> > > 
> > > Nice work. I had to be fully awake to review this one.
> > > Some suggestions below.  
> > 
> > Here's a patch with my suggestions applied to make it easier to figure 
> > them out. The added "inline" is necessary to fix compilation on ARM32. 
> > The "likely()" makes for better assembly and this part is pretty much 
> > likely anyway. I've explained the rest previously, although this is a 
> > better implementation.
> > 
> > commit 99ea338401f03efe5dbebe57e62bd7c588409c5c
> > Author: Nicolas Pitre <nico@fluxnic.net>
> > Date:   Tue Jun 17 14:42:34 2025 -0400
> > 
> >     fixup! lib: mul_u64_u64_div_u64() Optimise the divide code
> > 
> > diff --git a/lib/math/div64.c b/lib/math/div64.c
> > index 3c9fe878ce68..740e59a58530 100644
> > --- a/lib/math/div64.c
> > +++ b/lib/math/div64.c
> > @@ -188,7 +188,7 @@ EXPORT_SYMBOL(iter_div_u64_rem);
> >  
> >  #if !defined(mul_u64_add_u64_div_u64) || defined(test_mul_u64_add_u64_div_u64)
> >  
> > -static u64 mul_add(u32 a, u32 b, u32 c)
> > +static inline u64 mul_add(u32 a, u32 b, u32 c)
> >  {
> >  	return add_u64_u32(mul_u32_u32(a, b), c);
> >  }
> > @@ -246,7 +246,7 @@ static inline u32 mul_u64_long_add_u64(u64 *p_lo, u64 a, u32 b, u64 c)
> >  
> >  u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
> >  {
> > -	unsigned long d_msig, q_digit;
> > +	unsigned long n_long, d_msig, q_digit;
> >  	unsigned int reps, d_z_hi;
> >  	u64 quotient, n_lo, n_hi;
> >  	u32 overflow;
> > @@ -271,36 +271,21 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
> >  
> >  	/* Left align the divisor, shifting the dividend to match */
> >  	d_z_hi = __builtin_clzll(d);
> > -	if (d_z_hi) {
> > +	if (likely(d_z_hi)) {
> >  		d <<= d_z_hi;
> >  		n_hi = n_hi << d_z_hi | n_lo >> (64 - d_z_hi);
> >  		n_lo <<= d_z_hi;
> >  	}
> >  
> > -	reps = 64 / BITS_PER_ITER;
> > -	/* Optimise loop count for small dividends */
> > -	if (!(u32)(n_hi >> 32)) {
> > -		reps -= 32 / BITS_PER_ITER;
> > -		n_hi = n_hi << 32 | n_lo >> 32;
> > -		n_lo <<= 32;
> > -	}
> > -#if BITS_PER_ITER == 16
> > -	if (!(u32)(n_hi >> 48)) {
> > -		reps--;
> > -		n_hi = add_u64_u32(n_hi << 16, n_lo >> 48);
> > -		n_lo <<= 16;
> > -	}
> > -#endif
> > -
> >  	/* Invert the dividend so we can use add instead of subtract. */
> >  	n_lo = ~n_lo;
> >  	n_hi = ~n_hi;
> >  
> >  	/*
> > -	 * Get the most significant BITS_PER_ITER bits of the divisor.
> > +	 * Get the rounded-up most significant BITS_PER_ITER bits of the divisor.
> >  	 * This is used to get a low 'guestimate' of the quotient digit.
> >  	 */
> > -	d_msig = (d >> (64 - BITS_PER_ITER)) + 1;
> > +	d_msig = (d >> (64 - BITS_PER_ITER)) + !!(d << BITS_PER_ITER);
> 
> If the low divisor bits are zero an alternative simpler divide
> can be used (you want to detect it before the left align).
> I deleted that because it complicates things and probably doesn't
> happen often enough outside the tests cases.

Depends. On 32-bit systems that might be worth it. Something like:

	if (unlikely(sizeof(long) == 4 && !(u32)d))
		return div_u64(n_hi, d >> 32);

> >  	/*
> >  	 * Now do a 'long division' with BITS_PER_ITER bit 'digits'.
> > @@ -308,12 +293,17 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
> >  	 * The worst case is dividing ~0 by 0x8000 which requires two subtracts.
> >  	 */
> >  	quotient = 0;
> > -	while (reps--) {
> > -		q_digit = (unsigned long)(~n_hi >> (64 - 2 * BITS_PER_ITER)) / d_msig;
> > +	for (reps = 64 / BITS_PER_ITER; reps; reps--) {
> > +		quotient <<= BITS_PER_ITER;
> > +		n_long = ~n_hi >> (64 - 2 * BITS_PER_ITER);
> >  		/* Shift 'n' left to align with the product q_digit * d */
> >  		overflow = n_hi >> (64 - BITS_PER_ITER);
> >  		n_hi = add_u64_u32(n_hi << BITS_PER_ITER, n_lo >> (64 - BITS_PER_ITER));
> >  		n_lo <<= BITS_PER_ITER;
> > +		/* cut it short if q_digit would be 0 */
> > +		if (n_long < d_msig)
> > +			continue;
> 
> I don't think that is right.
> d_msig is an overestimate so you can only skip the divide and mul_add().

That's what I thought initially. But `overflow` was always 0xffff in 
that case. I'd like to prove it mathematically, but empirically this 
seems to be true with all edge cases I could think of.

I also have a test rig using random numbers validating against the 
native x86 128/64 div and it has been running for an hour.

> Could be something like:
> 		if (n_long < d_msig) {
> 			if (!n_long)
> 				continue;
> 			q_digit = 0;
> 		} else {
> 			q_digit = n_long / d_msig;
> 	  		/* Add product to negated divisor */
> 	  		overflow += mul_u64_long_add_u64(&n_hi, d, q_digit, n_hi);		
> 		}
> but that starts looking like branch prediction hell.

... and so far unnecessary (see above).


> 
> > +		q_digit = n_long / d_msig;
> 
> I think you want to do the divide right at the top - maybe even if the
> result isn't used!
> All the shifts then happen while the divide instruction is in progress
> (even without out-of-order execution).

Only if there is an actual divide instruction available. If it is a 
library call then you're screwed.

And the compiler ought to be able to do that kind of shuffling on its 
own when there's a benefit.

> It is also quite likely that any cpu divide instruction takes a lot
> less clocks when the dividend or quotient is small.
> So if the quotient would be zero there isn't a stall waiting for the
> divide to complete.
> 
> As I said before it is a trade off between saving a few clocks for
> specific cases against adding clocks to all the others.
> Leading zero bits on the dividend are very likely, quotients with
> the low 16bits zero much less so (except for test cases).

Given what I said above wrt overflow I think this is a good tradeoff.
We just need to measure it.


Nicolas

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 next 09/10] lib: mul_u64_u64_div_u64() Optimise the divide code
  2025-06-18 15:39         ` Nicolas Pitre
@ 2025-06-18 16:42           ` Nicolas Pitre
  2025-06-18 17:54           ` David Laight
  1 sibling, 0 replies; 44+ messages in thread
From: Nicolas Pitre @ 2025-06-18 16:42 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das

On Wed, 18 Jun 2025, Nicolas Pitre wrote:

> On Wed, 18 Jun 2025, David Laight wrote:
> 
> > If the low divisor bits are zero an alternative simpler divide
> > can be used (you want to detect it before the left align).
> > I deleted that because it complicates things and probably doesn't
> > happen often enough outside the tests cases.
> 
> Depends. On 32-bit systems that might be worth it. Something like:
> 
> 	if (unlikely(sizeof(long) == 4 && !(u32)d))
> 		return div_u64(n_hi, d >> 32);

Well scratch that. Won't work obviously.


Nicolas

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 next 09/10] lib: mul_u64_u64_div_u64() Optimise the divide code
  2025-06-18 15:39         ` Nicolas Pitre
  2025-06-18 16:42           ` Nicolas Pitre
@ 2025-06-18 17:54           ` David Laight
  2025-06-18 20:12             ` Nicolas Pitre
  1 sibling, 1 reply; 44+ messages in thread
From: David Laight @ 2025-06-18 17:54 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das

On Wed, 18 Jun 2025 11:39:20 -0400 (EDT)
Nicolas Pitre <nico@fluxnic.net> wrote:

> On Wed, 18 Jun 2025, David Laight wrote:
> 
> > On Tue, 17 Jun 2025 21:33:23 -0400 (EDT)
> > Nicolas Pitre <nico@fluxnic.net> wrote:
> >   
> > > On Tue, 17 Jun 2025, Nicolas Pitre wrote:
> > >   
> > > > On Sat, 14 Jun 2025, David Laight wrote:
> > > >     
> > > > > Replace the bit by bit algorithm with one that generates 16 bits
> > > > > per iteration on 32bit architectures and 32 bits on 64bit ones.
> > > > > 
> > > > > On my zen 5 this reduces the time for the tests (using the generic
> > > > > code) from ~3350ns to ~1000ns.
> > > > > 
> > > > > Running the 32bit algorithm on 64bit x86 takes ~1500ns.
> > > > > It'll be slightly slower on a real 32bit system, mostly due
> > > > > to register pressure.
> > > > > 
> > > > > The savings for 32bit x86 are much higher (tested in userspace).
> > > > > The worst case (lots of bits in the quotient) drops from ~900 clocks
> > > > > to ~130 (pretty much independant of the arguments).
> > > > > Other 32bit architectures may see better savings.
> > > > > 
> > > > > It is possibly to optimise for divisors that span less than
> > > > > __LONG_WIDTH__/2 bits. However I suspect they don't happen that often
> > > > > and it doesn't remove any slow cpu divide instructions which dominate
> > > > > the result.
> > > > > 
> > > > > Signed-off-by: David Laight <david.laight.linux@gmail.com>    
> > > > 
> > > > Nice work. I had to be fully awake to review this one.
> > > > Some suggestions below.    
> > > 
> > > Here's a patch with my suggestions applied to make it easier to figure 
> > > them out. The added "inline" is necessary to fix compilation on ARM32. 
> > > The "likely()" makes for better assembly and this part is pretty much 
> > > likely anyway. I've explained the rest previously, although this is a 
> > > better implementation.
> > > 
> > > commit 99ea338401f03efe5dbebe57e62bd7c588409c5c
> > > Author: Nicolas Pitre <nico@fluxnic.net>
> > > Date:   Tue Jun 17 14:42:34 2025 -0400
> > > 
> > >     fixup! lib: mul_u64_u64_div_u64() Optimise the divide code
> > > 
> > > diff --git a/lib/math/div64.c b/lib/math/div64.c
> > > index 3c9fe878ce68..740e59a58530 100644
> > > --- a/lib/math/div64.c
> > > +++ b/lib/math/div64.c
> > > @@ -188,7 +188,7 @@ EXPORT_SYMBOL(iter_div_u64_rem);
> > >  
> > >  #if !defined(mul_u64_add_u64_div_u64) || defined(test_mul_u64_add_u64_div_u64)
> > >  
> > > -static u64 mul_add(u32 a, u32 b, u32 c)
> > > +static inline u64 mul_add(u32 a, u32 b, u32 c)
> > >  {
> > >  	return add_u64_u32(mul_u32_u32(a, b), c);
> > >  }
> > > @@ -246,7 +246,7 @@ static inline u32 mul_u64_long_add_u64(u64 *p_lo, u64 a, u32 b, u64 c)
> > >  
> > >  u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
> > >  {
> > > -	unsigned long d_msig, q_digit;
> > > +	unsigned long n_long, d_msig, q_digit;
> > >  	unsigned int reps, d_z_hi;
> > >  	u64 quotient, n_lo, n_hi;
> > >  	u32 overflow;
> > > @@ -271,36 +271,21 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
> > >  
> > >  	/* Left align the divisor, shifting the dividend to match */
> > >  	d_z_hi = __builtin_clzll(d);
> > > -	if (d_z_hi) {
> > > +	if (likely(d_z_hi)) {
> > >  		d <<= d_z_hi;
> > >  		n_hi = n_hi << d_z_hi | n_lo >> (64 - d_z_hi);
> > >  		n_lo <<= d_z_hi;
> > >  	}
> > >  
> > > -	reps = 64 / BITS_PER_ITER;
> > > -	/* Optimise loop count for small dividends */
> > > -	if (!(u32)(n_hi >> 32)) {
> > > -		reps -= 32 / BITS_PER_ITER;
> > > -		n_hi = n_hi << 32 | n_lo >> 32;
> > > -		n_lo <<= 32;
> > > -	}
> > > -#if BITS_PER_ITER == 16
> > > -	if (!(u32)(n_hi >> 48)) {
> > > -		reps--;
> > > -		n_hi = add_u64_u32(n_hi << 16, n_lo >> 48);
> > > -		n_lo <<= 16;
> > > -	}
> > > -#endif
> > > -
> > >  	/* Invert the dividend so we can use add instead of subtract. */
> > >  	n_lo = ~n_lo;
> > >  	n_hi = ~n_hi;
> > >  
> > >  	/*
> > > -	 * Get the most significant BITS_PER_ITER bits of the divisor.
> > > +	 * Get the rounded-up most significant BITS_PER_ITER bits of the divisor.
> > >  	 * This is used to get a low 'guestimate' of the quotient digit.
> > >  	 */
> > > -	d_msig = (d >> (64 - BITS_PER_ITER)) + 1;
> > > +	d_msig = (d >> (64 - BITS_PER_ITER)) + !!(d << BITS_PER_ITER);  
> > 
> > If the low divisor bits are zero an alternative simpler divide
> > can be used (you want to detect it before the left align).
> > I deleted that because it complicates things and probably doesn't
> > happen often enough outside the tests cases.  
> 
> Depends. On 32-bit systems that might be worth it. Something like:
> 
> 	if (unlikely(sizeof(long) == 4 && !(u32)d))
> 		return div_u64(n_hi, d >> 32);

You need a bigger divide than that (96 bits by 32).
It is also possible this code is better than div_u64() !

My initial version optimised for divisors with max 16 bits using:

        d_z_hi = __builtin_clzll(d);
        d_z_lo = __builtin_ffsll(d) - 1;
        if (d_z_hi + d_z_lo >= 48) {
                // Max 16 bits in divisor, shift right
                if (d_z_hi < 48) {
                        n_lo = n_lo >> d_z_lo | n_hi << (64 - d_z_lo);
                        n_hi >>= d_z_lo;
                        d >>= d_z_lo;
                }
                return div_me_16(n_hi, n_lo >> 32, n_lo, d);
        }
	// Followed by the 'align left' code

with the 80 by 16 divide:
static u64 div_me_16(u32 acc, u32 n1, u32 n0, u32 d)
{
        u64 q = 0;

        if (acc) {
                acc = acc << 16 | n1 >> 16;
                q = (acc / d) << 16;
                acc = (acc % d) << 16 | (n1 & 0xffff);
        } else {
                acc = n1;
                if (!acc)
                        return n0 / d;
        }
        q |= acc / d;
        q <<= 32;
        acc = (acc % d) << 16 | (n0 >> 16);
        q |= (acc / d) << 16;
        acc = (acc % d) << 16 | (n0 & 0xffff);
        q |= acc / d;

        return q;
}

In the end I decided the cost of the 64bit ffsll() on 32bit was too much.
(I could cut it down to 3 'cmov' instead of 4 - and remember they have to be
conditional jumps in the kernel.)

> 
> > >  	/*
> > >  	 * Now do a 'long division' with BITS_PER_ITER bit 'digits'.
> > > @@ -308,12 +293,17 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
> > >  	 * The worst case is dividing ~0 by 0x8000 which requires two subtracts.
> > >  	 */
> > >  	quotient = 0;
> > > -	while (reps--) {
> > > -		q_digit = (unsigned long)(~n_hi >> (64 - 2 * BITS_PER_ITER)) / d_msig;
> > > +	for (reps = 64 / BITS_PER_ITER; reps; reps--) {
> > > +		quotient <<= BITS_PER_ITER;
> > > +		n_long = ~n_hi >> (64 - 2 * BITS_PER_ITER);
> > >  		/* Shift 'n' left to align with the product q_digit * d */
> > >  		overflow = n_hi >> (64 - BITS_PER_ITER);
> > >  		n_hi = add_u64_u32(n_hi << BITS_PER_ITER, n_lo >> (64 - BITS_PER_ITER));
> > >  		n_lo <<= BITS_PER_ITER;
> > > +		/* cut it short if q_digit would be 0 */
> > > +		if (n_long < d_msig)
> > > +			continue;  
> > 
> > I don't think that is right.
> > d_msig is an overestimate so you can only skip the divide and mul_add().  
> 
> That's what I thought initially. But `overflow` was always 0xffff in 
> that case. I'd like to prove it mathematically, but empirically this 
> seems to be true with all edge cases I could think of.

You are right :-(
It all gets 'saved' because the next q_digit can be 17 bits.

> I also have a test rig using random numbers validating against the 
> native x86 128/64 div and it has been running for an hour.

I doubt random values will hit many nasty edge cases.

...
> >   
> > > +		q_digit = n_long / d_msig;  
> > 
> > I think you want to do the divide right at the top - maybe even if the
> > result isn't used!
> > All the shifts then happen while the divide instruction is in progress
> > (even without out-of-order execution).  
> 
> Only if there is an actual divide instruction available. If it is a 
> library call then you're screwed.

The library call may well short circuit it anyway.

> And the compiler ought to be able to do that kind of shuffling on its 
> own when there's a benefit.

In my experience it doesn't do a very good job - best to give it help.
In this case I'd bet it even moves it later on (especially if you add
a conditional path that doesn't need it).

Try getting (a + b) + (c + d) compiled that way rather than as
(a + (b + (c + d))) which has a longer register dependency chain.

> 
> > It is also quite likely that any cpu divide instruction takes a lot
> > less clocks when the dividend or quotient is small.
> > So if the quotient would be zero there isn't a stall waiting for the
> > divide to complete.
> > 
> > As I said before it is a trade off between saving a few clocks for
> > specific cases against adding clocks to all the others.
> > Leading zero bits on the dividend are very likely, quotients with
> > the low 16bits zero much less so (except for test cases).  
> 
> Given what I said above wrt overflow I think this is a good tradeoff.
> We just need to measure it.

Can you do accurate timings for arm64 or arm32?

I've found a 2004 Arm book that includes several I-cache busting
divide algorithms.
But I'm sure this pi-5 has hardware divide.

My suspicion is that provided the cpu has reasonable multiply instructions
this code will be reasonable with a 32 by 32 software divide.

According to Agner's tables all AMD Zen cpu have fast 64bit divide.
The same isn't true for Intel though, Skylake is 26 clocks for r32 but 35-88 for r64.
You have to get to IceLake (xx-10) to get a fast r64 divide.
Probably not enough to avoid for the 128 by 64 case, but there may be cases
where it is worth it.

I need to get my test farm set up - and source a zen1/2 system.

	David

> 
> 
> Nicolas


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 next 09/10] lib: mul_u64_u64_div_u64() Optimise the divide code
  2025-06-18 17:54           ` David Laight
@ 2025-06-18 20:12             ` Nicolas Pitre
  2025-06-18 22:26               ` David Laight
  0 siblings, 1 reply; 44+ messages in thread
From: Nicolas Pitre @ 2025-06-18 20:12 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das

On Wed, 18 Jun 2025, David Laight wrote:

> On Wed, 18 Jun 2025 11:39:20 -0400 (EDT)
> Nicolas Pitre <nico@fluxnic.net> wrote:
> 
> > > > +		q_digit = n_long / d_msig;  
> > > 
> > > I think you want to do the divide right at the top - maybe even if the
> > > result isn't used!
> > > All the shifts then happen while the divide instruction is in progress
> > > (even without out-of-order execution).  

Well.... testing on my old Intel Core i7-4770R doesn't show a gain.

With your proposed patch as is: ~34ns per call

With my proposed changes: ~31ns per call

With my changes but leaving the divide at the top of the loop: ~32ns per call

> Can you do accurate timings for arm64 or arm32?

On a Broadcom BCM2712 (ARM Cortex-A76):

With your proposed patch as is: ~20 ns per call

With my proposed changes: ~19 ns per call

With my changes but leaving the divide at the top of the loop: ~19 ns per call

Both CPUs have the same max CPU clock rate (2.4 GHz). These are obtained 
with clock_gettime(CLOCK_MONOTONIC) over 56000 calls. There is some 
noise in the results over multiple runs though but still.

I could get cycle measurements on the RPi5 but that requires a kernel 
recompile.

> I've found a 2004 Arm book that includes several I-cache busting
> divide algorithms.
> But I'm sure this pi-5 has hardware divide.

It does.


Nicolas

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 next 09/10] lib: mul_u64_u64_div_u64() Optimise the divide code
  2025-06-18 20:12             ` Nicolas Pitre
@ 2025-06-18 22:26               ` David Laight
  2025-06-19  2:43                 ` Nicolas Pitre
  0 siblings, 1 reply; 44+ messages in thread
From: David Laight @ 2025-06-18 22:26 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das

On Wed, 18 Jun 2025 16:12:49 -0400 (EDT)
Nicolas Pitre <nico@fluxnic.net> wrote:

> On Wed, 18 Jun 2025, David Laight wrote:
> 
> > On Wed, 18 Jun 2025 11:39:20 -0400 (EDT)
> > Nicolas Pitre <nico@fluxnic.net> wrote:
> >   
> > > > > +		q_digit = n_long / d_msig;    
> > > > 
> > > > I think you want to do the divide right at the top - maybe even if the
> > > > result isn't used!
> > > > All the shifts then happen while the divide instruction is in progress
> > > > (even without out-of-order execution).    
> 
> Well.... testing on my old Intel Core i7-4770R doesn't show a gain.
> 
> With your proposed patch as is: ~34ns per call
> 
> With my proposed changes: ~31ns per call
> 
> With my changes but leaving the divide at the top of the loop: ~32ns per call

Wonders what makes the difference...
Is that random 64bit values (where you don't expect zero digits)
or values where there are likely to be small divisors and/or zero digits?

On x86 you can use the PERF_COUNT_HW_CPU_CYCLES counter to get pretty accurate
counts for a single call.
The 'trick' is to use syscall(___NR_perf_event_open,...) and pc = mmap() to get
the register number pc->index - 1.
Then you want:
static inline unsigned int rdpmc(unsigned int counter)
{
        unsigned int low, high;
        asm volatile("rdpmc" : "=a" (low), "=d" (high) : "c" (counter));
        return low;
}
and do:
	unsigned int start = rdpmc(pc->index - 1);
	unsigned int zero = 0;
	OPTIMISER_HIDE_VAR(zero);
	q = mul_u64_add_u64_div_u64(a + (start & zero), b, c, d);
	elapsed = rdpmc(pc->index - 1 + (q & zero)) - start;

That carefully forces the rdpmc include the code being tested without
the massive penalty of lfence/mfence.
Do 10 calls and the last 8 will be pretty similar.
Lets you time cold-cache and branch mis-prediction effects.

> > Can you do accurate timings for arm64 or arm32?  
> 
> On a Broadcom BCM2712 (ARM Cortex-A76):
> 
> With your proposed patch as is: ~20 ns per call
> 
> With my proposed changes: ~19 ns per call
> 
> With my changes but leaving the divide at the top of the loop: ~19 ns per call

Pretty much no difference.
Is that 64bit or 32bit (or the 16 bits per iteration on 64bit) ?
The shifts get more expensive on 32bit.
Have you timed the original code?


> 
> Both CPUs have the same max CPU clock rate (2.4 GHz). These are obtained 
> with clock_gettime(CLOCK_MONOTONIC) over 56000 calls. There is some 
> noise in the results over multiple runs though but still.

That many loops definitely trains the branch predictor and ignores
any effects of loading the I-cache.
As Linus keeps saying, the kernel tends to be 'cold cache', code size
matters.
That also means that branches are 50% likely to be mis-predicted.
(Although working out what cpu actually do is hard.)

> 
> I could get cycle measurements on the RPi5 but that requires a kernel 
> recompile.

Or a loadable module - shame there isn't a sysctl.

> 
> > I've found a 2004 Arm book that includes several I-cache busting
> > divide algorithms.
> > But I'm sure this pi-5 has hardware divide.  
> 
> It does.
> 
> 
> Nicolas


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 next 09/10] lib: mul_u64_u64_div_u64() Optimise the divide code
  2025-06-18 22:26               ` David Laight
@ 2025-06-19  2:43                 ` Nicolas Pitre
  2025-06-19  8:32                   ` David Laight
  0 siblings, 1 reply; 44+ messages in thread
From: Nicolas Pitre @ 2025-06-19  2:43 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das

On Wed, 18 Jun 2025, David Laight wrote:

> On Wed, 18 Jun 2025 16:12:49 -0400 (EDT)
> Nicolas Pitre <nico@fluxnic.net> wrote:
> 
> > On Wed, 18 Jun 2025, David Laight wrote:
> > 
> > > On Wed, 18 Jun 2025 11:39:20 -0400 (EDT)
> > > Nicolas Pitre <nico@fluxnic.net> wrote:
> > >   
> > > > > > +		q_digit = n_long / d_msig;    
> > > > > 
> > > > > I think you want to do the divide right at the top - maybe even if the
> > > > > result isn't used!
> > > > > All the shifts then happen while the divide instruction is in progress
> > > > > (even without out-of-order execution).    
> > 
> > Well.... testing on my old Intel Core i7-4770R doesn't show a gain.
> > 
> > With your proposed patch as is: ~34ns per call
> > 
> > With my proposed changes: ~31ns per call
> > 
> > With my changes but leaving the divide at the top of the loop: ~32ns per call
> 
> Wonders what makes the difference...
> Is that random 64bit values (where you don't expect zero digits)
> or values where there are likely to be small divisors and/or zero digits?

Those are values from the test module. I just copied it into a user 
space program.

> On x86 you can use the PERF_COUNT_HW_CPU_CYCLES counter to get pretty accurate
> counts for a single call.
> The 'trick' is to use syscall(___NR_perf_event_open,...) and pc = mmap() to get
> the register number pc->index - 1.

I'm not acquainted enough with x86 to fully make sense of the above.
This is more your territory than mine.  ;-)

> > > Can you do accurate timings for arm64 or arm32?  
> > 
> > On a Broadcom BCM2712 (ARM Cortex-A76):
> > 
> > With your proposed patch as is: ~20 ns per call
> > 
> > With my proposed changes: ~19 ns per call
> > 
> > With my changes but leaving the divide at the top of the loop: ~19 ns per call
> 
> Pretty much no difference.
> Is that 64bit or 32bit (or the 16 bits per iteration on 64bit) ?

64bit executable with 32 bits per iteration.

> Have you timed the original code?

The "your proposed patch as is" is the original code per the first email 
in this thread.

> > Both CPUs have the same max CPU clock rate (2.4 GHz). These are obtained 
> > with clock_gettime(CLOCK_MONOTONIC) over 56000 calls. There is some 
> > noise in the results over multiple runs though but still.
> 
> That many loops definitely trains the branch predictor and ignores
> any effects of loading the I-cache.
> As Linus keeps saying, the kernel tends to be 'cold cache', code size
> matters.

My proposed changes shrink the code especially on 32-bit systems due to 
the pre-loop special cases removal.

> That also means that branches are 50% likely to be mis-predicted.

We can tag it as unlikely. In practice this isn't taken very often.

> (Although working out what cpu actually do is hard.)
> 
> > 
> > I could get cycle measurements on the RPi5 but that requires a kernel 
> > recompile.
> 
> Or a loadable module - shame there isn't a sysctl.

Not sure. I've not investigated how the RPi kernel is configured yet.
I suspect this is about granting user space direct access to PMU regs.


Nicolas

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 next 09/10] lib: mul_u64_u64_div_u64() Optimise the divide code
  2025-06-19  2:43                 ` Nicolas Pitre
@ 2025-06-19  8:32                   ` David Laight
  0 siblings, 0 replies; 44+ messages in thread
From: David Laight @ 2025-06-19  8:32 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das

On Wed, 18 Jun 2025 22:43:47 -0400 (EDT)
Nicolas Pitre <nico@fluxnic.net> wrote:

> On Wed, 18 Jun 2025, David Laight wrote:
> 
> > On Wed, 18 Jun 2025 16:12:49 -0400 (EDT)
> > Nicolas Pitre <nico@fluxnic.net> wrote:
> >   
> > > On Wed, 18 Jun 2025, David Laight wrote:
> > >   
> > > > On Wed, 18 Jun 2025 11:39:20 -0400 (EDT)
> > > > Nicolas Pitre <nico@fluxnic.net> wrote:
> > > >     
> > > > > > > +		q_digit = n_long / d_msig;      
> > > > > > 
> > > > > > I think you want to do the divide right at the top - maybe even if the
> > > > > > result isn't used!
> > > > > > All the shifts then happen while the divide instruction is in progress
> > > > > > (even without out-of-order execution).      
> > > 
> > > Well.... testing on my old Intel Core i7-4770R doesn't show a gain.
> > > 
> > > With your proposed patch as is: ~34ns per call
> > > 
> > > With my proposed changes: ~31ns per call
> > > 
> > > With my changes but leaving the divide at the top of the loop: ~32ns per call  
> > 
> > Wonders what makes the difference...
> > Is that random 64bit values (where you don't expect zero digits)
> > or values where there are likely to be small divisors and/or zero digits?  
> 
> Those are values from the test module. I just copied it into a user 
> space program.

Ah, those tests are heavily biased towards values with all bits set.
I added the 'pre-loop' check to speed up the few that have leading zeros
(and don't escape into the div_u64() path).
I've been timing the divisions separately.

I will look at whether it is worth just checking for the top 32bits
being zero on 32bit - where the conditional code is just register moves.

...
> My proposed changes shrink the code especially on 32-bit systems due to 
> the pre-loop special cases removal.
> 
> > That also means that branches are 50% likely to be mis-predicted.  
> 
> We can tag it as unlikely. In practice this isn't taken very often.

I suspect 'unlikely' is over-rated :-)
I had 'fun and games' a few years back trying to minimise the worst-case
path for some code running on a simple embedded cpu.
Firstly gcc seems to ignore 'unlikely' unless there is code in the 'likely'
path - an asm comment will do nicely.

The there is is cpu itself, the x86 prefetch logic is likely to assume
non-taken (probably actually not-branch), but the prediction logic itself
uses whatever is in the selected logic (effectively an array - assume hashed)
so if it isn't 'trained' on the code being execute is 50% taken.
(Only the P6 (late 90's) had prefix for unlikely/likely.)

> 
> > (Although working out what cpu actually do is hard.)
> >   
> > > 
> > > I could get cycle measurements on the RPi5 but that requires a kernel 
> > > recompile.  
> > 
> > Or a loadable module - shame there isn't a sysctl.  
> 
> Not sure. I've not investigated how the RPi kernel is configured yet.
> I suspect this is about granting user space direct access to PMU regs.

Something like that - you don't get the TSC by default.
Access is denied to (try to) stop timing attacks - but doesn't really
help. Just makes it all too hard for everyone else.

I'm not sure it also stops all the 'time' functions being implemented
in the vdso without a system call - and that hurts performance.

	David

> 
> 
> Nicolas

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 next 09/10] lib: mul_u64_u64_div_u64() Optimise the divide code
  2025-06-18  1:33     ` Nicolas Pitre
  2025-06-18  9:16       ` David Laight
@ 2025-06-26 21:46       ` David Laight
  2025-06-27  3:48         ` Nicolas Pitre
  1 sibling, 1 reply; 44+ messages in thread
From: David Laight @ 2025-06-26 21:46 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das

On Tue, 17 Jun 2025 21:33:23 -0400 (EDT)
Nicolas Pitre <nico@fluxnic.net> wrote:

> On Tue, 17 Jun 2025, Nicolas Pitre wrote:
> 
> > On Sat, 14 Jun 2025, David Laight wrote:
> >   
> > > Replace the bit by bit algorithm with one that generates 16 bits
> > > per iteration on 32bit architectures and 32 bits on 64bit ones.
> > > 
> > > On my zen 5 this reduces the time for the tests (using the generic
> > > code) from ~3350ns to ~1000ns.
> > > 
> > > Running the 32bit algorithm on 64bit x86 takes ~1500ns.
> > > It'll be slightly slower on a real 32bit system, mostly due
> > > to register pressure.
> > > 
> > > The savings for 32bit x86 are much higher (tested in userspace).
> > > The worst case (lots of bits in the quotient) drops from ~900 clocks
> > > to ~130 (pretty much independant of the arguments).
> > > Other 32bit architectures may see better savings.
> > > 
> > > It is possibly to optimise for divisors that span less than
> > > __LONG_WIDTH__/2 bits. However I suspect they don't happen that often
> > > and it doesn't remove any slow cpu divide instructions which dominate
> > > the result.
> > > 
> > > Signed-off-by: David Laight <david.laight.linux@gmail.com>  
> > 
> > Nice work. I had to be fully awake to review this one.
> > Some suggestions below.  
> 
> Here's a patch with my suggestions applied to make it easier to figure 
> them out. The added "inline" is necessary to fix compilation on ARM32. 
> The "likely()" makes for better assembly and this part is pretty much 
> likely anyway. I've explained the rest previously, although this is a 
> better implementation.

I've been trying to find time to do some more performance measurements.
I did a few last night and managed to realise at least some of what
is happening.

I've dropped the responses in here since there is a copy of the code.

The 'TLDR' is that the full divide takes my zen5 about 80 clocks
(including some test red tape) provided the branch-predictor is
predicting correctly.
I get that consistently for repeats of the same division, and moving
onto the next one - provided they take the same path.
(eg for the last few 'random' value tests.)
However if I include something that takes a different path (eg divisor
doesn't have the top bit set, or the product has enough high zero bits
to take the 'optimised' path then the first two or three repeats of the
next division are at least 20 clocks slower.
In the kernel these calls aren't going to be common enough (and with
similar inputs) to train the branch predictor.
So the mispredicted branches really start to dominate.
This is similar to the issue that Linus keeps mentioning about kernel
code being mainly 'cold cache'.

> 
> commit 99ea338401f03efe5dbebe57e62bd7c588409c5c
> Author: Nicolas Pitre <nico@fluxnic.net>
> Date:   Tue Jun 17 14:42:34 2025 -0400
> 
>     fixup! lib: mul_u64_u64_div_u64() Optimise the divide code
> 
> diff --git a/lib/math/div64.c b/lib/math/div64.c
> index 3c9fe878ce68..740e59a58530 100644
> --- a/lib/math/div64.c
> +++ b/lib/math/div64.c
> @@ -188,7 +188,7 @@ EXPORT_SYMBOL(iter_div_u64_rem);
>  
>  #if !defined(mul_u64_add_u64_div_u64) || defined(test_mul_u64_add_u64_div_u64)
>  
> -static u64 mul_add(u32 a, u32 b, u32 c)
> +static inline u64 mul_add(u32 a, u32 b, u32 c)

If inline is needed, it need to be always_inline.

>  {
>  	return add_u64_u32(mul_u32_u32(a, b), c);
>  }
> @@ -246,7 +246,7 @@ static inline u32 mul_u64_long_add_u64(u64 *p_lo, u64 a, u32 b, u64 c)
>  
>  u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
>  {
> -	unsigned long d_msig, q_digit;
> +	unsigned long n_long, d_msig, q_digit;
>  	unsigned int reps, d_z_hi;
>  	u64 quotient, n_lo, n_hi;
>  	u32 overflow;
> @@ -271,36 +271,21 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
>  
>  	/* Left align the divisor, shifting the dividend to match */
>  	d_z_hi = __builtin_clzll(d);
> -	if (d_z_hi) {
> +	if (likely(d_z_hi)) {

I don't think likely() helps much.
But for 64bit this can be unconditional...

>  		d <<= d_z_hi;
>  		n_hi = n_hi << d_z_hi | n_lo >> (64 - d_z_hi);

... replace 'n_lo >> (64 - d_z_hi)' with 'n_lo >> (63 - d_z_hi) >> 1'.
The extra shift probably costs too much on 32bit though.


>  		n_lo <<= d_z_hi;
>  	}
>  
> -	reps = 64 / BITS_PER_ITER;
> -	/* Optimise loop count for small dividends */
> -	if (!(u32)(n_hi >> 32)) {
> -		reps -= 32 / BITS_PER_ITER;
> -		n_hi = n_hi << 32 | n_lo >> 32;
> -		n_lo <<= 32;
> -	}
> -#if BITS_PER_ITER == 16
> -	if (!(u32)(n_hi >> 48)) {
> -		reps--;
> -		n_hi = add_u64_u32(n_hi << 16, n_lo >> 48);
> -		n_lo <<= 16;
> -	}
> -#endif

You don't want the branch in the loop.
Once predicted correctly is makes little difference where you put
the test - but that won't be 'normal'
Unless it gets trained odd-even it'll get mispredicted at least once.
For 64bit (zen5) the test outside the loop was less bad
(IIRC dropped to 65 clocks - so probably worth it).
I need to check an older cpu (got a few Intel ones).


> -
>  	/* Invert the dividend so we can use add instead of subtract. */
>  	n_lo = ~n_lo;
>  	n_hi = ~n_hi;
>  
>  	/*
> -	 * Get the most significant BITS_PER_ITER bits of the divisor.
> +	 * Get the rounded-up most significant BITS_PER_ITER bits of the divisor.
>  	 * This is used to get a low 'guestimate' of the quotient digit.
>  	 */
> -	d_msig = (d >> (64 - BITS_PER_ITER)) + 1;
> +	d_msig = (d >> (64 - BITS_PER_ITER)) + !!(d << BITS_PER_ITER);

For 64bit x64 that is pretty much zero cost
(gcc manages to use the 'Z' flag from the '<< 32' in a 'setne' to get a 1
I didn't look hard enough to find where the zero register came from
since 'setne' of changes the low 8 bits).
But it is horrid for 32bit - added quite a few clocks.
I'm not at all sure it is worth it though.


>  
>  	/*
>  	 * Now do a 'long division' with BITS_PER_ITER bit 'digits'.
> @@ -308,12 +293,17 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
>  	 * The worst case is dividing ~0 by 0x8000 which requires two subtracts.
>  	 */
>  	quotient = 0;
> -	while (reps--) {
> -		q_digit = (unsigned long)(~n_hi >> (64 - 2 * BITS_PER_ITER)) / d_msig;
> +	for (reps = 64 / BITS_PER_ITER; reps; reps--) {
> +		quotient <<= BITS_PER_ITER;
> +		n_long = ~n_hi >> (64 - 2 * BITS_PER_ITER);
>  		/* Shift 'n' left to align with the product q_digit * d */
>  		overflow = n_hi >> (64 - BITS_PER_ITER);
>  		n_hi = add_u64_u32(n_hi << BITS_PER_ITER, n_lo >> (64 - BITS_PER_ITER));
>  		n_lo <<= BITS_PER_ITER;
> +		/* cut it short if q_digit would be 0 */
> +		if (n_long < d_msig)
> +			continue;

As I said above that gets misprediced too often

> +		q_digit = n_long / d_msig;

I fiddled with the order of the lines of code - changes the way the object
code is laid out and change the clock count 'randomly' by one or two.
Noise compared to mispredicted branches.

I've not tested on an older system with a slower divide.
I suspect older (Haswell and Broadwell) may benefit from the divide
being scheduled earlier.

If anyone wants of copy of the test program I can send you a copy.

	David

>  		/* Add product to negated divisor */
>  		overflow += mul_u64_long_add_u64(&n_hi, d, q_digit, n_hi);
>  		/* Adjust for the q_digit 'guestimate' being low */
> @@ -322,7 +312,7 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
>  			n_hi += d;
>  			overflow += n_hi < d;
>  		}
> -		quotient = add_u64_long(quotient << BITS_PER_ITER, q_digit);
> +		quotient = add_u64_long(quotient, q_digit);
>  	}
>  
>  	/*


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 next 09/10] lib: mul_u64_u64_div_u64() Optimise the divide code
  2025-06-26 21:46       ` David Laight
@ 2025-06-27  3:48         ` Nicolas Pitre
  0 siblings, 0 replies; 44+ messages in thread
From: Nicolas Pitre @ 2025-06-27  3:48 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das

On Thu, 26 Jun 2025, David Laight wrote:

> On Tue, 17 Jun 2025 21:33:23 -0400 (EDT)
> Nicolas Pitre <nico@fluxnic.net> wrote:
> 
> > On Tue, 17 Jun 2025, Nicolas Pitre wrote:
> > 
> > > On Sat, 14 Jun 2025, David Laight wrote:
> > >   
> > > > Replace the bit by bit algorithm with one that generates 16 bits
> > > > per iteration on 32bit architectures and 32 bits on 64bit ones.
> > > > 
> > > > On my zen 5 this reduces the time for the tests (using the generic
> > > > code) from ~3350ns to ~1000ns.
> > > > 
> > > > Running the 32bit algorithm on 64bit x86 takes ~1500ns.
> > > > It'll be slightly slower on a real 32bit system, mostly due
> > > > to register pressure.
> > > > 
> > > > The savings for 32bit x86 are much higher (tested in userspace).
> > > > The worst case (lots of bits in the quotient) drops from ~900 clocks
> > > > to ~130 (pretty much independant of the arguments).
> > > > Other 32bit architectures may see better savings.
> > > > 
> > > > It is possibly to optimise for divisors that span less than
> > > > __LONG_WIDTH__/2 bits. However I suspect they don't happen that often
> > > > and it doesn't remove any slow cpu divide instructions which dominate
> > > > the result.
> > > > 
> > > > Signed-off-by: David Laight <david.laight.linux@gmail.com>  
> > > 
> > > Nice work. I had to be fully awake to review this one.
> > > Some suggestions below.  
> > 
> > Here's a patch with my suggestions applied to make it easier to figure 
> > them out. The added "inline" is necessary to fix compilation on ARM32. 
> > The "likely()" makes for better assembly and this part is pretty much 
> > likely anyway. I've explained the rest previously, although this is a 
> > better implementation.
> 
> I've been trying to find time to do some more performance measurements.
> I did a few last night and managed to realise at least some of what
> is happening.
> 
> I've dropped the responses in here since there is a copy of the code.
> 
> The 'TLDR' is that the full divide takes my zen5 about 80 clocks
> (including some test red tape) provided the branch-predictor is
> predicting correctly.
> I get that consistently for repeats of the same division, and moving
> onto the next one - provided they take the same path.
> (eg for the last few 'random' value tests.)
> However if I include something that takes a different path (eg divisor
> doesn't have the top bit set, or the product has enough high zero bits
> to take the 'optimised' path then the first two or three repeats of the
> next division are at least 20 clocks slower.
> In the kernel these calls aren't going to be common enough (and with
> similar inputs) to train the branch predictor.
> So the mispredicted branches really start to dominate.
> This is similar to the issue that Linus keeps mentioning about kernel
> code being mainly 'cold cache'.
> 
> > 
> > commit 99ea338401f03efe5dbebe57e62bd7c588409c5c
> > Author: Nicolas Pitre <nico@fluxnic.net>
> > Date:   Tue Jun 17 14:42:34 2025 -0400
> > 
> >     fixup! lib: mul_u64_u64_div_u64() Optimise the divide code
> > 
> > diff --git a/lib/math/div64.c b/lib/math/div64.c
> > index 3c9fe878ce68..740e59a58530 100644
> > --- a/lib/math/div64.c
> > +++ b/lib/math/div64.c
> > @@ -188,7 +188,7 @@ EXPORT_SYMBOL(iter_div_u64_rem);
> >  
> >  #if !defined(mul_u64_add_u64_div_u64) || defined(test_mul_u64_add_u64_div_u64)
> >  
> > -static u64 mul_add(u32 a, u32 b, u32 c)
> > +static inline u64 mul_add(u32 a, u32 b, u32 c)
> 
> If inline is needed, it need to be always_inline.

Here "inline" was needed only because I got a "defined but unused" 
complaint from the compiler in some cases. The "always_inline" is not 
absolutely necessary.

> >  {
> >  	return add_u64_u32(mul_u32_u32(a, b), c);
> >  }
> > @@ -246,7 +246,7 @@ static inline u32 mul_u64_long_add_u64(u64 *p_lo, u64 a, u32 b, u64 c)
> >  
> >  u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
> >  {
> > -	unsigned long d_msig, q_digit;
> > +	unsigned long n_long, d_msig, q_digit;
> >  	unsigned int reps, d_z_hi;
> >  	u64 quotient, n_lo, n_hi;
> >  	u32 overflow;
> > @@ -271,36 +271,21 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
> >  
> >  	/* Left align the divisor, shifting the dividend to match */
> >  	d_z_hi = __builtin_clzll(d);
> > -	if (d_z_hi) {
> > +	if (likely(d_z_hi)) {
> 
> I don't think likely() helps much.
> But for 64bit this can be unconditional...

It helps if you look at how gcc orders the code. Without the likely, the 
normalization code is moved out of line and a conditional forward branch 
(predicted untaken by default) is used to reach it even though this code 
is quite likely. With likely() gcc inserts the normalization 
code in line and a conditional forward branch (predicted untaken by 
default) is used to skip over that code. Clearly we want the later.

> >  		d <<= d_z_hi;
> >  		n_hi = n_hi << d_z_hi | n_lo >> (64 - d_z_hi);
> 
> ... replace 'n_lo >> (64 - d_z_hi)' with 'n_lo >> (63 - d_z_hi) >> 1'.

Why? I don't see the point.

> The extra shift probably costs too much on 32bit though.
> 
> 
> >  		n_lo <<= d_z_hi;
> >  	}
> >  
> > -	reps = 64 / BITS_PER_ITER;
> > -	/* Optimise loop count for small dividends */
> > -	if (!(u32)(n_hi >> 32)) {
> > -		reps -= 32 / BITS_PER_ITER;
> > -		n_hi = n_hi << 32 | n_lo >> 32;
> > -		n_lo <<= 32;
> > -	}
> > -#if BITS_PER_ITER == 16
> > -	if (!(u32)(n_hi >> 48)) {
> > -		reps--;
> > -		n_hi = add_u64_u32(n_hi << 16, n_lo >> 48);
> > -		n_lo <<= 16;
> > -	}
> > -#endif
> 
> You don't want the branch in the loop.
> Once predicted correctly is makes little difference where you put
> the test - but that won't be 'normal'
> Unless it gets trained odd-even it'll get mispredicted at least once.
> For 64bit (zen5) the test outside the loop was less bad
> (IIRC dropped to 65 clocks - so probably worth it).
> I need to check an older cpu (got a few Intel ones).
> 
> 
> > -
> >  	/* Invert the dividend so we can use add instead of subtract. */
> >  	n_lo = ~n_lo;
> >  	n_hi = ~n_hi;
> >  
> >  	/*
> > -	 * Get the most significant BITS_PER_ITER bits of the divisor.
> > +	 * Get the rounded-up most significant BITS_PER_ITER bits of the divisor.
> >  	 * This is used to get a low 'guestimate' of the quotient digit.
> >  	 */
> > -	d_msig = (d >> (64 - BITS_PER_ITER)) + 1;
> > +	d_msig = (d >> (64 - BITS_PER_ITER)) + !!(d << BITS_PER_ITER);
> 
> For 64bit x64 that is pretty much zero cost
> (gcc manages to use the 'Z' flag from the '<< 32' in a 'setne' to get a 1
> I didn't look hard enough to find where the zero register came from
> since 'setne' of changes the low 8 bits).
> But it is horrid for 32bit - added quite a few clocks.
> I'm not at all sure it is worth it though.

If you do an hardcoded +1 when it is not necessary, in most of those 
cases you end up with one or two extra overflow adjustment loops. Those 
loops are unnecessary when d_msig is not increased by 1.

> >  
> >  	/*
> >  	 * Now do a 'long division' with BITS_PER_ITER bit 'digits'.
> > @@ -308,12 +293,17 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
> >  	 * The worst case is dividing ~0 by 0x8000 which requires two subtracts.
> >  	 */
> >  	quotient = 0;
> > -	while (reps--) {
> > -		q_digit = (unsigned long)(~n_hi >> (64 - 2 * BITS_PER_ITER)) / d_msig;
> > +	for (reps = 64 / BITS_PER_ITER; reps; reps--) {
> > +		quotient <<= BITS_PER_ITER;
> > +		n_long = ~n_hi >> (64 - 2 * BITS_PER_ITER);
> >  		/* Shift 'n' left to align with the product q_digit * d */
> >  		overflow = n_hi >> (64 - BITS_PER_ITER);
> >  		n_hi = add_u64_u32(n_hi << BITS_PER_ITER, n_lo >> (64 - BITS_PER_ITER));
> >  		n_lo <<= BITS_PER_ITER;
> > +		/* cut it short if q_digit would be 0 */
> > +		if (n_long < d_msig)
> > +			continue;
> 
> As I said above that gets misprediced too often
> 
> > +		q_digit = n_long / d_msig;
> 
> I fiddled with the order of the lines of code - changes the way the object
> code is laid out and change the clock count 'randomly' by one or two.
> Noise compared to mispredicted branches.
> 
> I've not tested on an older system with a slower divide.
> I suspect older (Haswell and Broadwell) may benefit from the divide
> being scheduled earlier.
> 
> If anyone wants of copy of the test program I can send you a copy.
> 
> 	David
> 
> >  		/* Add product to negated divisor */
> >  		overflow += mul_u64_long_add_u64(&n_hi, d, q_digit, n_hi);
> >  		/* Adjust for the q_digit 'guestimate' being low */
> > @@ -322,7 +312,7 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
> >  			n_hi += d;
> >  			overflow += n_hi < d;
> >  		}
> > -		quotient = add_u64_long(quotient << BITS_PER_ITER, q_digit);
> > +		quotient = add_u64_long(quotient, q_digit);
> >  	}
> >  
> >  	/*
> 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 next 09/10] lib: mul_u64_u64_div_u64() Optimise the divide code
  2025-06-14  9:53 ` [PATCH v3 next 09/10] lib: mul_u64_u64_div_u64() Optimise the divide code David Laight
  2025-06-17  4:16   ` Nicolas Pitre
@ 2025-07-09 14:24   ` David Laight
  2025-07-10  9:39     ` 答复: [????] " Li,Rongqing
  2025-07-11 21:17     ` David Laight
  1 sibling, 2 replies; 44+ messages in thread
From: David Laight @ 2025-07-09 14:24 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: u.kleine-koenig, Nicolas Pitre, Oleg Nesterov, Peter Zijlstra,
	Biju Das, rostedt, lirongqing

On Sat, 14 Jun 2025 10:53:45 +0100
David Laight <david.laight.linux@gmail.com> wrote:

> Replace the bit by bit algorithm with one that generates 16 bits
> per iteration on 32bit architectures and 32 bits on 64bit ones.

I've spent far too long doing some clock counting exercises on this code.
This is the latest version with some conditional compiles and comments
explaining the various optimisation.
I think the 'best' version is with -DMULDIV_OPT=0xc3

#ifndef MULDIV_OPT
#define MULDIV_OPT 0
#endif

#ifndef BITS_PER_ITER
#define BITS_PER_ITER (__LONG_WIDTH__ >= 64 ? 32 : 16)
#endif

#define unlikely(x)  __builtin_expect((x), 0)
#define likely(x)  __builtin_expect(!!(x), 1)

#if __LONG_WIDTH__ >= 64

/* gcc generates sane code for 64bit. */
static unsigned int tzcntll(unsigned long long x)
{
        return __builtin_ctzll(x);
}

static unsigned int lzcntll(unsigned long long x)
{
        return __builtin_clzll(x);
}
#else
        
/*
 * Assuming that bsf/bsr dont change the output register
 * when the input is zero (should be true now that 486 aren't
 * supported) these simple conditional (and cmov) free functions
 * can be used to count trailing/leading zeros.
 */
static inline unsigned int tzcnt_z(u32 x, unsigned int if_z)
{               
        asm("bsfl %1,%0" : "+r" (if_z) : "r" (x));
        return if_z;
}

static inline unsigned int tzcntll(unsigned long long x)
{
        return tzcnt_z(x, 32 + tzcnt_z(x >> 32, 32));
}

static inline unsigned int bsr_z(u32 x, unsigned int if_z)
{
        asm("bsrl %1,%0" : "+r" (if_z) : "r" (x));
        return if_z;
}

static inline unsigned int bsrll(unsigned long long x)
{
        return 32 + bsr_z(x >> 32, bsr_z(x, -1) - 32);
}

static inline unsigned int lzcntll(unsigned long long x)
{
        return 63 - bsrll(x);
}

#endif

/*
 *  gcc (but not clang) makes a pigs-breakfast of mixed
 *  32/64 bit maths.
 */
#if !defined(__i386__) || defined(__clang__)
static u64 add_u64_u32(u64 a, u32 b)
{
        return a + b;
}

static inline u64 mul_u32_u32(u32 a, u32 b)
{
        return (u64)a * b;
}
#else
static u64 add_u64_u32(u64 a, u32 b)
{
        u32 hi = a >> 32, lo = a;

        asm ("addl %[b], %[lo]; adcl $0, %[hi]"
                         : [lo] "+r" (lo), [hi] "+r" (hi)
                         : [b] "rm" (b) );

        return (u64)hi << 32 | lo;
}

static inline u64 mul_u32_u32(u32 a, u32 b)
{
        u32 high, low;

        asm ("mull %[b]" : "=a" (low), "=d" (high)
                         : [a] "a" (a), [b] "rm" (b) );

        return low | ((u64)high) << 32;
}
#endif

static inline u64 mul_add(u32 a, u32 b, u32 c)
{
        return add_u64_u32(mul_u32_u32(a, b), c);
}

#if defined(__SIZEOF_INT128__)
typedef unsigned __int128 u128;
static inline u64 mul_u64_u64_add_u64(u64 *p_lo, u64 a, u64 b, u64 c)
{
        /* native 64x64=128 bits multiplication */
        u128 prod = (u128)a * b + c;

        *p_lo = prod;
        return prod >> 64;
}

#else
static inline u64 mul_u64_u64_add_u64(u64 *p_lo, u64 a, u64 b, u64 c)
{
        /* perform a 64x64=128 bits multiplication in 32bit chunks */
        u64 x, y, z;

        /* Since (x-1)(x-1) + 2(x-1) == x.x - 1 two u32 can be added to a u64 */
        x = mul_add(a, b, c);
        y = mul_add(a, b >> 32, c >> 32);
        y = add_u64_u32(y, x >> 32);
        z = mul_add(a >> 32, b >> 32, y >> 32);
        y = mul_add(a >> 32, b, y);
        *p_lo = (y << 32) + (u32)x;
        return add_u64_u32(z, y >> 32);
}

#endif

#if BITS_PER_ITER == 32
#define mul_u64_long_add_u64(p_lo, a, b, c) mul_u64_u64_add_u64(p_lo, a, b, c)
#define add_u64_long(a, b) ((a) + (b))

#else
static inline u32 mul_u64_long_add_u64(u64 *p_lo, u64 a, u32 b, u64 c)
{
        u64 n_lo = mul_add(a, b, c);
        u64 n_med = mul_add(a >> 32, b, c >> 32);

        n_med = add_u64_u32(n_med, n_lo >> 32);
        *p_lo = n_med << 32 | (u32)n_lo;
        return n_med >> 32;
}

#define add_u64_long(a, b) add_u64_u32(a, b)

#endif

#if MULDIV_OPT & 0x40
/*
 * If the divisor has BITS_PER_ITER or fewer bits then a simple
 * long division can be done.
 */
#if BITS_PER_ITER == 16
static u64 div_u80_u16(u32 n_hi, u64 n_lo, u32 d)
{
        u64 q = 0;

        if (n_hi) {
                n_hi = n_hi << 16 | (u32)(n_lo >> 48);
                q = (n_hi / d) << 16;
                n_hi = (n_hi % d) << 16 | (u16)(n_lo >> 32);
        } else {
                n_hi = n_lo >> 32;
                if (!n_hi)
                        return (u32)n_lo / d;
        }
        q |= n_hi / d;
        q <<= 32;
        n_hi = (n_hi % d) << 16 | ((u32)n_lo >> 16);
        q |= (n_hi / d) << 16;
        n_hi = (n_hi % d) << 16 | (u16)n_lo;
        q |= n_hi / d;

        return q;
}
#else
static u64 div_u96_u32(u64 n_hi, u64 n_lo, u32 d)
{
        u64 q;

        if (!n_hi)
                return n_lo / d;

        n_hi = n_hi << 32 | n_lo >> 32;
        q = (n_hi / d) << 32;
        n_hi = (n_hi % d) << 32 | (u32)n_lo;
        return q | n_hi / d;
}
#endif
#endif

u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
{
        unsigned long d_msig, n_long, q_digit;
        unsigned int reps, d_z_hi;
        u64 quotient, n_lo, n_hi;
        u32 overflow;

        n_hi = mul_u64_u64_add_u64(&n_lo, a, b, c);
        if (unlikely(n_hi >= d)) {
                if (!d)
                        // Quotient infinity or NaN
                        return 0;
                // Quotient larger than 64 bits
                return ~(u64)0;
        }

#if !(MULDIV_OPT & 0x80)
        // OPT: Small divisors can be optimised here.
        // OPT: But the test is measurable and a lot of the cases get
        // OPT: picked up by later tests - especially 1, 2 and 0x40.
        // OPT: For 32bit a full 64bit divide will also be non-trivial.
        if (unlikely(!n_hi))
                return div64_u64(n_lo, d);
#endif

        d_z_hi = lzcntll(d);

#if MULDIV_OPT & 0x40
        // Optimise for divisors with less than BITS_PER_ITER significant bits.
        // OPT: A much simpler 'long division' can be done.
        // OPT: The test could be reworked to avoid the txcntll() when d_z_hi
        // OPT: is large enough - but the code starts looking horrid.
        // OPT: This picks up the same divisions as OPT 8, with a faster algorithm.
        u32 d_z_lo = tzcntll(d);
        if (d_z_hi + d_z_lo >= 64 - BITS_PER_ITER) {
                if (d_z_hi < 64 - BITS_PER_ITER) {
                        n_lo = n_lo >> d_z_lo | n_hi << (64 - d_z_lo);
                        n_hi >>= d_z_lo;
                        d >>= d_z_lo;
                }
#if BITS_PER_ITER == 16
                return div_u80_u16(n_hi, n_lo, d);
#else
                return div_u96_u32(n_hi, n_lo, d);
#endif
        }
#endif

        /* Left align the divisor, shifting the dividend to match */
#if MULDIV_OPT & 0x10
        // OPT: Replacing the 'pretty much always taken' branch
        // OPT: with an extra shift (one clock - should be noise)
        // OPT: feels like it ought to be a gain (for 64bit).
        // OPT: Most of the test cases have 64bit divisors - so lose,
        // OPT: but even some with a smaller divisor are hit for a few clocks.
        // OPT: Might be generating a register spill to stack.
        d <<= d_z_hi;
        n_hi = n_hi << d_z_hi | (n_lo >> (63 - d_z_hi) >> 1);
        n_lo <<= d_z_hi;
#else
        if (d_z_hi) {
                d <<= d_z_hi;
                n_hi = n_hi << d_z_hi | n_lo >> (64 - d_z_hi);
                n_lo <<= d_z_hi;
        }
#endif

        reps = 64 / BITS_PER_ITER;
        /* Optimise loop count for small dividends */
#if MULDIV_OPT & 1
        // OPT: Products with lots of leading zeros are almost certainly
        // OPT: very common.
        // OPT: The gain from removing the loop iterations is significant.
        // OPT: Especially on 32bit where two iterations can be removed
        // OPT: with a simple shift and conditional jump.
        if (!(u32)(n_hi >> 32)) {
                reps -= 32 / BITS_PER_ITER;
                n_hi = n_hi << 32 | n_lo >> 32;
                n_lo <<= 32;
        }
#endif
#if MULDIV_OPT & 2 && BITS_PER_ITER == 16
        if (!(u32)(n_hi >> 48)) {
                reps--;
                n_hi = add_u64_u32(n_hi << 16, n_lo >> 48);
                n_lo <<= 16;
        }
#endif

        /*
         * Get the most significant BITS_PER_ITER bits of the divisor.
         * This is used to get a low 'guestimate' of the quotient digit.
         */
        d_msig = (d >> (64 - BITS_PER_ITER));
#if MULDIV_OPT & 8
        // OPT: d_msig only needs rounding up - so can be unchanged if
        // OPT: all its low bits are zero.
        // OPT: However the test itself causes register pressure on x86-32.
        // OPT: The earlier check (0x40) optimises the same cases.
        // OPT: The code it generates is a lot better.
        d_msig += !!(d << BITS_PER_ITER);
#else
        d_msig += 1;
#endif


        /* Invert the dividend so we can use add instead of subtract. */
        n_lo = ~n_lo;
        n_hi = ~n_hi;

        /*
         * Now do a 'long division' with BITS_PER_ITER bit 'digits'.
         * The 'guess' quotient digit can be low and BITS_PER_ITER+1 bits.
         * The worst case is dividing ~0 by 0x8000 which requires two subtracts.
         */
        quotient = 0;
        while (reps--) {
                n_long = ~n_hi >> (64 - 2 * BITS_PER_ITER);
#if !(MULDIV_OPT & 0x20)
                // OPT: If the cpu divide instruction has a long latency an other
                // OPT: instructions can execute while the divide is pending then
                // OPT: you want the divide as early as possible.
                // OPT: I've seen delays if it moved below the shifts, but I suspect
                // OPT: the ivy bridge cpu spreads the u-ops between the execution
                // OPT: units so you don't get the full latency to play with.
                // OPT: gcc doesn't put the divide as early as it might, attempting to
                // OPT: do so by hand failed - and I'm not playing with custom asm.
                q_digit = n_long / d_msig;
#endif
                /* Shift 'n' left to align with the product q_digit * d */
                overflow = n_hi >> (64 - BITS_PER_ITER);
                n_hi = add_u64_u32(n_hi << BITS_PER_ITER, n_lo >> (64 - BITS_PER_ITER));
                n_lo <<= BITS_PER_ITER;
                quotient <<= BITS_PER_ITER;
#if MULDIV_OPT & 4
                // OPT: This optimises for zero digits.
                // OPT: With the compiler/cpu I using today (gcc 13.3 and Sandy bridge)
                // OPT: it needs the divide moved below the conditional.
                // OPT: For x86-64 0x24 and 0x03 are actually pretty similar,
                // OPT: but x86-32 is definitely slower all the time, and the outer
                // OPT: check removes two loop iterations at once.
                if (unlikely(n_long < d_msig)) {
                        // OPT: Without something here the 'unlikely' still generates
                        // OPT: a conditional backwards branch which some cpu will
                        // OPT: statically predict taken.
                        // asm( "nop");
                        continue;
                }
#endif
#if MULDIV_OPT & 0x20
                q_digit = n_long / d_msig;
#endif
                /* Add product to negated divisor */
                overflow += mul_u64_long_add_u64(&n_hi, d, q_digit, n_hi);
                /* Adjust for the q_digit 'guestimate' being low */
                while (unlikely(overflow < 0xffffffff >> (32 - BITS_PER_ITER))) {
                        q_digit++;
                        n_hi += d;
                        overflow += n_hi < d;
                }
                quotient = add_u64_long(quotient, q_digit);
        }

        /*
         * The above only ensures the remainder doesn't overflow,
         * it can still be possible to add (aka subtract) another copy
         * of the divisor.
         */
        if ((n_hi + d) > n_hi)
                quotient++;
        return quotient;
}

Some measurements on an ivy bridge.
These are the test vectors from the test module with a few extra values on the
end that pick different paths through this implementatoin.
The numbers are 'performance counter' deltas for 10 consecutive calls with the
same values.
So the latter values are with the branch predictor 'trained' to the test case.
The first few (larger) values show the cost of mispredicted branches.

Apologies for the very long lines.

$ cc -O2 -o div_perf div_perf.c -DMULDIV_OPT=0xc3  && sudo ./div_perf
0: ok   162   134    78    78    78    78    78    80    80    80 mul_u64_u64_div_u64_new b*7/3 = 19
1: ok    91    91    91    91    91    91    91    91    91    91 mul_u64_u64_div_u64_new ffff0000*ffff0000/f = 1110eeef00000000
2: ok    75    77    75    77    77    77    77    77    77    77 mul_u64_u64_div_u64_new ffffffff*ffffffff/1 = fffffffe00000001
3: ok    89    91    91    91    91    91    89    90    91    91 mul_u64_u64_div_u64_new ffffffff*ffffffff/2 = 7fffffff00000000
4: ok   147   147   128   128   128   128   128   128   128   128 mul_u64_u64_div_u64_new 1ffffffff*ffffffff/2 = fffffffe80000000
5: ok   128   128   128   128   128   128   128   128   128   128 mul_u64_u64_div_u64_new 1ffffffff*ffffffff/3 = aaaaaaa9aaaaaaab
6: ok   121   121   121   121   121   121   121   121   121   121 mul_u64_u64_div_u64_new 1ffffffff*1ffffffff/4 = ffffffff00000000
7: ok   274   234   146   138   138   138   138   138   138   138 mul_u64_u64_div_u64_new ffff000000000000*ffff000000000000/ffff000000000001 = fffeffffffffffff
8: ok   177   148   148   149   149   149   149   149   149   149 mul_u64_u64_div_u64_new 3333333333333333*3333333333333333/5555555555555555 = 1eb851eb851eb851
9: ok   138    90   118    91    91    91    91    92    92    92 mul_u64_u64_div_u64_new 7fffffffffffffff*2/3 = 5555555555555554
10: ok   113   114    86    86    84    86    86    84    87    87 mul_u64_u64_div_u64_new ffffffffffffffff*2/8000000000000000 = 3
11: ok    87    88    88    86    88    88    88    88    90    90 mul_u64_u64_div_u64_new ffffffffffffffff*2/c000000000000000 = 2
12: ok    82    86    84    86    83    86    83    86    83    87 mul_u64_u64_div_u64_new ffffffffffffffff*4000000000000004/8000000000000000 = 8000000000000007
13: ok    82    86    84    86    83    86    83    86    83    86 mul_u64_u64_div_u64_new ffffffffffffffff*4000000000000001/8000000000000000 = 8000000000000001
14: ok   189   187   138   132   132   132   131   131   131   131 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/ffffffffffffffff = 8000000000000001
15: ok   221   175   159   131   131   131   131   131   131   131 mul_u64_u64_div_u64_new fffffffffffffffe*8000000000000001/ffffffffffffffff = 8000000000000000
16: ok   134   132   134   134   134   135   134   134   134   134 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/fffffffffffffffe = 8000000000000001
17: ok   172   134   137   134   134   134   134   134   134   134 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/fffffffffffffffd = 8000000000000002
18: ok   182   182   129   129   129   129   129   129   129   129 mul_u64_u64_div_u64_new 7fffffffffffffff*ffffffffffffffff/c000000000000000 = aaaaaaaaaaaaaaa8
19: ok   130   129   130   129   129   129   129   129   129   129 mul_u64_u64_div_u64_new ffffffffffffffff*7fffffffffffffff/a000000000000000 = ccccccccccccccca
20: ok   130   129   129   129   129   129   129   129   129   129 mul_u64_u64_div_u64_new ffffffffffffffff*7fffffffffffffff/9000000000000000 = e38e38e38e38e38b
21: ok   130   129   129   129   129   129   129   129   129   129 mul_u64_u64_div_u64_new 7fffffffffffffff*7fffffffffffffff/5000000000000000 = ccccccccccccccc9
22: ok   206   140   138   138   138   138   138   138   138   138 mul_u64_u64_div_u64_new ffffffffffffffff*fffffffffffffffe/ffffffffffffffff = fffffffffffffffe
23: ok   174   140   138   138   138   138   138   138   138   138 mul_u64_u64_div_u64_new e6102d256d7ea3ae*70a77d0be4c31201/d63ec35ab3220357 = 78f8bf8cc86c6e18
24: ok   135   137   137   137   137   137   137   137   137   137 mul_u64_u64_div_u64_new f53bae05cb86c6e1*3847b32d2f8d32e0/cfd4f55a647f403c = 42687f79d8998d35
25: ok   134   136   136   136   136   136   136   136   136   136 mul_u64_u64_div_u64_new 9951c5498f941092*1f8c8bfdf287a251/a3c8dc5f81ea3fe2 = 1d887cb25900091f
26: ok   136   134   134   134   134   134   134   134   134   134 mul_u64_u64_div_u64_new 374fee9daa1bb2bb*d0bfbff7b8ae3ef/c169337bd42d5179 = 3bb2dbaffcbb961
27: ok   139   138   138   138   138   138   138   138   138   138 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
28: ok   130   143    95    95    96    96    96    96    96    96 mul_u64_u64_div_u64_new 2d256d7ea3ae*7d0be4c31201/d63ec35ab3220357 = 1a599d6e
29: ok   169   158   158   138   138   138   138   138   138   138 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
30: ok   178   164   144   147   147   147   147   147   147   147 mul_u64_u64_div_u64_new 2d256d7ea3ae*7d0be4c31201/63ec35ab3220357 = 387f55cef
31: ok   163   128   128   128   128   128   128   128   128   128 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb000000000000 = dc5f7e8b334db07d
32: ok   163   184   137   136   136   138   138   138   138   138 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216

$ cc -O2 -o div_perf div_perf.c -DMULDIV_OPT=0x03  && sudo ./div_perf
0: ok   125    78    78    79    79    79    79    79    79    79 mul_u64_u64_div_u64_new b*7/3 = 19
1: ok    88    89    89    88    89    89    89    89    89    89 mul_u64_u64_div_u64_new ffff0000*ffff0000/f = 1110eeef00000000
2: ok    75    76    76    76    76    76    74    76    76    76 mul_u64_u64_div_u64_new ffffffff*ffffffff/1 = fffffffe00000001
3: ok    87    89    89    89    89    89    89    88    88    88 mul_u64_u64_div_u64_new ffffffff*ffffffff/2 = 7fffffff00000000
4: ok   305   221   148   144   147   147   147   147   147   147 mul_u64_u64_div_u64_new 1ffffffff*ffffffff/2 = fffffffe80000000
5: ok   179   178   141   141   141   141   141   141   141   141 mul_u64_u64_div_u64_new 1ffffffff*ffffffff/3 = aaaaaaa9aaaaaaab
6: ok   148   200   143   145   145   145   145   145   145   145 mul_u64_u64_div_u64_new 1ffffffff*1ffffffff/4 = ffffffff00000000
7: ok   201   186   140   135   135   135   135   135   135   135 mul_u64_u64_div_u64_new ffff000000000000*ffff000000000000/ffff000000000001 = fffeffffffffffff
8: ok   227   154   145   141   141   141   141   141   141   141 mul_u64_u64_div_u64_new 3333333333333333*3333333333333333/5555555555555555 = 1eb851eb851eb851
9: ok   111   111    89    89    89    89    89    89    89    89 mul_u64_u64_div_u64_new 7fffffffffffffff*2/3 = 5555555555555554
10: ok   149   156   124    90    90    90    90    90    90    90 mul_u64_u64_div_u64_new ffffffffffffffff*2/8000000000000000 = 3
11: ok    91    91    90    90    90    90    90    90    90    90 mul_u64_u64_div_u64_new ffffffffffffffff*2/c000000000000000 = 2
12: ok   197   197   138   134   134   134   134   134   134   134 mul_u64_u64_div_u64_new ffffffffffffffff*4000000000000004/8000000000000000 = 8000000000000007
13: ok   260   136   136   135   135   135   135   135   135   135 mul_u64_u64_div_u64_new ffffffffffffffff*4000000000000001/8000000000000000 = 8000000000000001
14: ok   186   187   164   130   127   127   127   127   127   127 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/ffffffffffffffff = 8000000000000001
15: ok   171   172   173   158   160   128   125   127   125   127 mul_u64_u64_div_u64_new fffffffffffffffe*8000000000000001/ffffffffffffffff = 8000000000000000
16: ok   157   164   129   130   130   130   130   130   130   130 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/fffffffffffffffe = 8000000000000001
17: ok   191   158   130   132   132   130   130   130   130   130 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/fffffffffffffffd = 8000000000000002
18: ok   197   214   163   138   138   138   138   138   138   138 mul_u64_u64_div_u64_new 7fffffffffffffff*ffffffffffffffff/c000000000000000 = aaaaaaaaaaaaaaa8
19: ok   196   196   135   134   134   134   134   134   134   134 mul_u64_u64_div_u64_new ffffffffffffffff*7fffffffffffffff/a000000000000000 = ccccccccccccccca
20: ok   191   216   176   140   138   138   138   138   138   138 mul_u64_u64_div_u64_new ffffffffffffffff*7fffffffffffffff/9000000000000000 = e38e38e38e38e38b
21: ok   232   157   156   145   145   145   145   145   145   145 mul_u64_u64_div_u64_new 7fffffffffffffff*7fffffffffffffff/5000000000000000 = ccccccccccccccc9
22: ok   159   192   133   134   134   134   134   134   134   134 mul_u64_u64_div_u64_new ffffffffffffffff*fffffffffffffffe/ffffffffffffffff = fffffffffffffffe
23: ok   133   134   134   134   134   134   134   134   134   134 mul_u64_u64_div_u64_new e6102d256d7ea3ae*70a77d0be4c31201/d63ec35ab3220357 = 78f8bf8cc86c6e18
24: ok   134   131   131   131   131   134   131   131   131   131 mul_u64_u64_div_u64_new f53bae05cb86c6e1*3847b32d2f8d32e0/cfd4f55a647f403c = 42687f79d8998d35
25: ok   133   130   130   130   130   130   130   130   130   130 mul_u64_u64_div_u64_new 9951c5498f941092*1f8c8bfdf287a251/a3c8dc5f81ea3fe2 = 1d887cb25900091f
26: ok   133   130   130   130   130   130   130   130   130   131 mul_u64_u64_div_u64_new 374fee9daa1bb2bb*d0bfbff7b8ae3ef/c169337bd42d5179 = 3bb2dbaffcbb961
27: ok   133   134   134   134   134   134   134   134   134   133 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
28: ok   151    93   119    93    93    93    93    93    93    93 mul_u64_u64_div_u64_new 2d256d7ea3ae*7d0be4c31201/d63ec35ab3220357 = 1a599d6e
29: ok   193   137   134   134   134   134   134   134   134   134 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
30: ok   194   151   150   137   137   137   137   137   137   137 mul_u64_u64_div_u64_new 2d256d7ea3ae*7d0be4c31201/63ec35ab3220357 = 387f55cef
31: ok   137   173   172   137   138   138   138   138   138   138 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb000000000000 = dc5f7e8b334db07d
32: ok   160   149   131   134   134   134   134   134   134   134 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216

$ cc -O2 -o div_perf div_perf.c -DMULDIV_OPT=0x24  && sudo ./div_perf
0: ok   130   106    79    79    78    78    78    78    81    81 mul_u64_u64_div_u64_new b*7/3 = 19
1: ok    88    92    92    89    92    92    92    92    91    91 mul_u64_u64_div_u64_new ffff0000*ffff0000/f = 1110eeef00000000
2: ok    74    78    78    78    78    78    75    79    78    78 mul_u64_u64_div_u64_new ffffffff*ffffffff/1 = fffffffe00000001
3: ok    87    92    92    92    92    92    92    92    92    92 mul_u64_u64_div_u64_new ffffffff*ffffffff/2 = 7fffffff00000000
4: ok   330   275   181   145   147   148   148   148   148   148 mul_u64_u64_div_u64_new 1ffffffff*ffffffff/2 = fffffffe80000000
5: ok   225   175   141   146   146   146   146   146   146   146 mul_u64_u64_div_u64_new 1ffffffff*ffffffff/3 = aaaaaaa9aaaaaaab
6: ok   187   194   193   194   178   144   148   148   148   148 mul_u64_u64_div_u64_new 1ffffffff*1ffffffff/4 = ffffffff00000000
7: ok   202   189   178   139   140   139   140   139   140   139 mul_u64_u64_div_u64_new ffff000000000000*ffff000000000000/ffff000000000001 = fffeffffffffffff
8: ok   228   168   150   143   143   143   143   143   143   143 mul_u64_u64_div_u64_new 3333333333333333*3333333333333333/5555555555555555 = 1eb851eb851eb851
9: ok   112   112    92    89    92    92    92    92    92    87 mul_u64_u64_div_u64_new 7fffffffffffffff*2/3 = 5555555555555554
10: ok   153   184    92    93    95    95    95    95    95    95 mul_u64_u64_div_u64_new ffffffffffffffff*2/8000000000000000 = 3
11: ok   158    93    93    93    95    92    95    95    95    95 mul_u64_u64_div_u64_new ffffffffffffffff*2/c000000000000000 = 2
12: ok   206   178   146   139   139   140   139   140   139   140 mul_u64_u64_div_u64_new ffffffffffffffff*4000000000000004/8000000000000000 = 8000000000000007
13: ok   187   146   140   139   140   139   140   139   140   139 mul_u64_u64_div_u64_new ffffffffffffffff*4000000000000001/8000000000000000 = 8000000000000001
14: ok   167   173   136   136   102    97    97    97    97    97 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/ffffffffffffffff = 8000000000000001
15: ok   166   150   105    98    98    98    97    99    97    99 mul_u64_u64_div_u64_new fffffffffffffffe*8000000000000001/ffffffffffffffff = 8000000000000000
16: ok   209   197   139   136   136   136   136   136   136   136 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/fffffffffffffffe = 8000000000000001
17: ok   170   142   170   137   136   135   136   135   136   135 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/fffffffffffffffd = 8000000000000002
18: ok   238   197   172   140   140   140   139   141   140   139 mul_u64_u64_div_u64_new 7fffffffffffffff*ffffffffffffffff/c000000000000000 = aaaaaaaaaaaaaaa8
19: ok   185   206   139   141   142   142   142   142   142   142 mul_u64_u64_div_u64_new ffffffffffffffff*7fffffffffffffff/a000000000000000 = ccccccccccccccca
20: ok   207   226   146   142   140   140   140   140   140   140 mul_u64_u64_div_u64_new ffffffffffffffff*7fffffffffffffff/9000000000000000 = e38e38e38e38e38b
21: ok   161   153   148   148   148   148   148   148   148   146 mul_u64_u64_div_u64_new 7fffffffffffffff*7fffffffffffffff/5000000000000000 = ccccccccccccccc9
22: ok   172   199   141   140   140   140   140   140   140   140 mul_u64_u64_div_u64_new ffffffffffffffff*fffffffffffffffe/ffffffffffffffff = fffffffffffffffe
23: ok   172   137   140   140   140   140   140   140   140   140 mul_u64_u64_div_u64_new e6102d256d7ea3ae*70a77d0be4c31201/d63ec35ab3220357 = 78f8bf8cc86c6e18
24: ok   135   138   138   138   138   138   138   138   138   138 mul_u64_u64_div_u64_new f53bae05cb86c6e1*3847b32d2f8d32e0/cfd4f55a647f403c = 42687f79d8998d35
25: ok   136   137   137   137   137   137   137   137   137   137 mul_u64_u64_div_u64_new 9951c5498f941092*1f8c8bfdf287a251/a3c8dc5f81ea3fe2 = 1d887cb25900091f
26: ok   136   136   136   136   136   136   136   136   136   136 mul_u64_u64_div_u64_new 374fee9daa1bb2bb*d0bfbff7b8ae3ef/c169337bd42d5179 = 3bb2dbaffcbb961
27: ok   139   140   140   140   140   140   140   140   140   140 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
28: ok   132    94    97    96    96    96    96    96    96    96 mul_u64_u64_div_u64_new 2d256d7ea3ae*7d0be4c31201/d63ec35ab3220357 = 1a599d6e
29: ok   139   140   140   140   140   140   140   140   140   140 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
30: ok   159   141   141   141   141   141   141   141   141   141 mul_u64_u64_div_u64_new 2d256d7ea3ae*7d0be4c31201/63ec35ab3220357 = 387f55cef
31: ok   244   186   154   141   139   141   140   139   141   140 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb000000000000 = dc5f7e8b334db07d
32: ok   165   140   140   140   140   140   140   140   140   140 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216

$ cc -O2 -o div_perf div_perf.c -DMULDIV_OPT=0xc3 -m32 && sudo ./div_perf
0: ok   336   131   104   104   102   102   103   103   103   105 mul_u64_u64_div_u64_new b*7/3 = 19
1: ok   195   195   171   171   171   171   171   171   171   171 mul_u64_u64_div_u64_new ffff0000*ffff0000/f = 1110eeef00000000
2: ok   165   162   162   162   162   162   162   162   162   162 mul_u64_u64_div_u64_new ffffffff*ffffffff/1 = fffffffe00000001
3: ok   165   173   173   173   173   173   173   173   173   173 mul_u64_u64_div_u64_new ffffffff*ffffffff/2 = 7fffffff00000000
4: ok   220   216   202   206   202   206   202   206   202   206 mul_u64_u64_div_u64_new 1ffffffff*ffffffff/2 = fffffffe80000000
5: ok   203   205   208   205   209   205   209   205   209   205 mul_u64_u64_div_u64_new 1ffffffff*ffffffff/3 = aaaaaaa9aaaaaaab
6: ok   203   203   199   203   199   203   199   203   199   203 mul_u64_u64_div_u64_new 1ffffffff*1ffffffff/4 = ffffffff00000000
7: ok   574   421   251   242   246   254   246   254   246   254 mul_u64_u64_div_u64_new ffff000000000000*ffff000000000000/ffff000000000001 = fffeffffffffffff
8: ok   370   317   263   262   254   259   258   262   254   259 mul_u64_u64_div_u64_new 3333333333333333*3333333333333333/5555555555555555 = 1eb851eb851eb851
9: ok   214   199   177   177   177   177   177   177   177   177 mul_u64_u64_div_u64_new 7fffffffffffffff*2/3 = 5555555555555554
10: ok   163   150   128   129   129   129   129   129   129   129 mul_u64_u64_div_u64_new ffffffffffffffff*2/8000000000000000 = 3
11: ok   135   133   135   135   135   135   135   135   135   135 mul_u64_u64_div_u64_new ffffffffffffffff*2/c000000000000000 = 2
12: ok   206   208   208   180   183   183   183   183   183   183 mul_u64_u64_div_u64_new ffffffffffffffff*4000000000000004/8000000000000000 = 8000000000000007
13: ok   205   183   183   184   183   183   183   183   183   183 mul_u64_u64_div_u64_new ffffffffffffffff*4000000000000001/8000000000000000 = 8000000000000001
14: ok   331   296   238   229   232   232   232   232   232   232 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/ffffffffffffffff = 8000000000000001
15: ok   324   311   238   239   239   239   239   239   239   242 mul_u64_u64_div_u64_new fffffffffffffffe*8000000000000001/ffffffffffffffff = 8000000000000000
16: ok   300   262   265   233   238   234   238   234   238   234 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/fffffffffffffffe = 8000000000000001
17: ok   295   282   244   244   244   244   244   244   244   244 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/fffffffffffffffd = 8000000000000002
18: ok   245   247   235   222   221   222   219   222   221   222 mul_u64_u64_div_u64_new 7fffffffffffffff*ffffffffffffffff/c000000000000000 = aaaaaaaaaaaaaaa8
19: ok   235   221   222   221   222   219   222   221   222   219 mul_u64_u64_div_u64_new ffffffffffffffff*7fffffffffffffff/a000000000000000 = ccccccccccccccca
20: ok   220   222   221   222   219   222   219   222   221   222 mul_u64_u64_div_u64_new ffffffffffffffff*7fffffffffffffff/9000000000000000 = e38e38e38e38e38b
21: ok   225   221   222   219   221   219   221   219   221   219 mul_u64_u64_div_u64_new 7fffffffffffffff*7fffffffffffffff/5000000000000000 = ccccccccccccccc9
22: ok   309   274   250   238   237   244   239   241   237   244 mul_u64_u64_div_u64_new ffffffffffffffff*fffffffffffffffe/ffffffffffffffff = fffffffffffffffe
23: ok   284   284   250   247   243   247   247   243   247   247 mul_u64_u64_div_u64_new e6102d256d7ea3ae*70a77d0be4c31201/d63ec35ab3220357 = 78f8bf8cc86c6e18
24: ok   239   239   243   240   239   239   239   239   239   239 mul_u64_u64_div_u64_new f53bae05cb86c6e1*3847b32d2f8d32e0/cfd4f55a647f403c = 42687f79d8998d35
25: ok   255   255   255   255   247   243   247   247   243   247 mul_u64_u64_div_u64_new 9951c5498f941092*1f8c8bfdf287a251/a3c8dc5f81ea3fe2 = 1d887cb25900091f
26: ok   327   274   240   242   240   242   240   242   240   242 mul_u64_u64_div_u64_new 374fee9daa1bb2bb*d0bfbff7b8ae3ef/c169337bd42d5179 = 3bb2dbaffcbb961
27: ok   461   313   340   259   257   259   257   259   257   259 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
28: ok   291   219   180   174   170   173   171   174   171   174 mul_u64_u64_div_u64_new 2d256d7ea3ae*7d0be4c31201/d63ec35ab3220357 = 1a599d6e
29: ok   280   319   258   257   259   257   259   257   259   257 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
30: ok   379   352   247   239   220   225   225   229   228   229 mul_u64_u64_div_u64_new 2d256d7ea3ae*7d0be4c31201/63ec35ab3220357 = 387f55cef
31: ok   235   219   221   219   221   219   221   219   221   219 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb000000000000 = dc5f7e8b334db07d
32: ok   305   263   257   257   258   258   263   257   257   257 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216

$ cc -O2 -o div_perf div_perf.c -DMULDIV_OPT=0x03 -m32 && sudo ./div_perf
0: ok   292   127   129   125   123   128   125   123   125   121 mul_u64_u64_div_u64_new b*7/3 = 19
1: ok   190   196   151   149   151   149   151   149   151   149 mul_u64_u64_div_u64_new ffff0000*ffff0000/f = 1110eeef00000000
2: ok   141   139   139   139   139   139   139   139   139   139 mul_u64_u64_div_u64_new ffffffff*ffffffff/1 = fffffffe00000001
3: ok   152   149   151   149   151   149   151   149   151   149 mul_u64_u64_div_u64_new ffffffff*ffffffff/2 = 7fffffff00000000
4: ok   667   453   276   270   271   271   271   267   274   272 mul_u64_u64_div_u64_new 1ffffffff*ffffffff/2 = fffffffe80000000
5: ok   366   319   373   337   278   278   278   278   278   278 mul_u64_u64_div_u64_new 1ffffffff*ffffffff/3 = aaaaaaa9aaaaaaab
6: ok   380   349   268   268   271   265   265   265   265   265 mul_u64_u64_div_u64_new 1ffffffff*1ffffffff/4 = ffffffff00000000
7: ok   340   277   255   251   249   255   251   249   255   251 mul_u64_u64_div_u64_new ffff000000000000*ffff000000000000/ffff000000000001 = fffeffffffffffff
8: ok   377   302   253   252   256   256   256   256   256   256 mul_u64_u64_div_u64_new 3333333333333333*3333333333333333/5555555555555555 = 1eb851eb851eb851
9: ok   181   184   157   155   157   155   157   155   157   155 mul_u64_u64_div_u64_new 7fffffffffffffff*2/3 = 5555555555555554
10: ok   304   223   142   139   139   141   139   139   139   139 mul_u64_u64_div_u64_new ffffffffffffffff*2/8000000000000000 = 3
11: ok   153   143   148   142   143   142   143   142   143   142 mul_u64_u64_div_u64_new ffffffffffffffff*2/c000000000000000 = 2
12: ok   428   323   292   246   257   248   253   250   248   253 mul_u64_u64_div_u64_new ffffffffffffffff*4000000000000004/8000000000000000 = 8000000000000007
13: ok   289   256   260   257   255   257   255   257   255   257 mul_u64_u64_div_u64_new ffffffffffffffff*4000000000000001/8000000000000000 = 8000000000000001
14: ok   334   246   234   234   227   233   229   233   229   233 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/ffffffffffffffff = 8000000000000001
15: ok   324   302   273   236   236   236   236   236   236   236 mul_u64_u64_div_u64_new fffffffffffffffe*8000000000000001/ffffffffffffffff = 8000000000000000
16: ok   269   328   285   259   232   230   236   232   232   236 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/fffffffffffffffe = 8000000000000001
17: ok   307   329   330   244   247   246   245   245   245   245 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/fffffffffffffffd = 8000000000000002
18: ok   359   361   324   258   258   258   258   258   258   258 mul_u64_u64_div_u64_new 7fffffffffffffff*ffffffffffffffff/c000000000000000 = aaaaaaaaaaaaaaa8
19: ok   347   325   295   258   260   253   251   251   251   251 mul_u64_u64_div_u64_new ffffffffffffffff*7fffffffffffffff/a000000000000000 = ccccccccccccccca
20: ok   339   312   261   260   255   255   255   255   255   255 mul_u64_u64_div_u64_new ffffffffffffffff*7fffffffffffffff/9000000000000000 = e38e38e38e38e38b
21: ok   411   349   333   276   272   272   272   272   272   272 mul_u64_u64_div_u64_new 7fffffffffffffff*7fffffffffffffff/5000000000000000 = ccccccccccccccc9
22: ok   297   330   290   266   239   236   238   237   238   237 mul_u64_u64_div_u64_new ffffffffffffffff*fffffffffffffffe/ffffffffffffffff = fffffffffffffffe
23: ok   299   311   250   247   250   245   247   250   245   247 mul_u64_u64_div_u64_new e6102d256d7ea3ae*70a77d0be4c31201/d63ec35ab3220357 = 78f8bf8cc86c6e18
24: ok   274   245   238   237   237   237   237   237   237   247 mul_u64_u64_div_u64_new f53bae05cb86c6e1*3847b32d2f8d32e0/cfd4f55a647f403c = 42687f79d8998d35
25: ok   247   247   245   247   250   245   247   250   245   247 mul_u64_u64_div_u64_new 9951c5498f941092*1f8c8bfdf287a251/a3c8dc5f81ea3fe2 = 1d887cb25900091f
26: ok   341   354   288   239   242   240   239   242   240   239 mul_u64_u64_div_u64_new 374fee9daa1bb2bb*d0bfbff7b8ae3ef/c169337bd42d5179 = 3bb2dbaffcbb961
27: ok   408   375   288   312   257   260   259   259   259   259 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
28: ok   289   259   199   198   201   170   170   174   173   173 mul_u64_u64_div_u64_new 2d256d7ea3ae*7d0be4c31201/d63ec35ab3220357 = 1a599d6e
29: ok   334   257   260   257   260   259   259   259   259   259 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
30: ok   341   267   244   233   229   233   229   233   229   233 mul_u64_u64_div_u64_new 2d256d7ea3ae*7d0be4c31201/63ec35ab3220357 = 387f55cef
31: ok   323   323   297   297   264   268   267   268   267   268 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb000000000000 = dc5f7e8b334db07d
32: ok   284   262   251   251   253   251   251   251   251   251 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216

$ cc -O2 -o div_perf div_perf.c -DMULDIV_OPT=0x24 -m32 && sudo ./div_perf
0: ok   362   127   125   122   123   122   125   129   126   126 mul_u64_u64_div_u64_new b*7/3 = 19
1: ok   190   175   149   150   154   150   154   150   154   150 mul_u64_u64_div_u64_new ffff0000*ffff0000/f = 1110eeef00000000
2: ok   144   139   139   139   139   139   139   139   139   139 mul_u64_u64_div_u64_new ffffffff*ffffffff/1 = fffffffe00000001
3: ok   146   150   154   154   150   154   150   154   150   154 mul_u64_u64_div_u64_new ffffffff*ffffffff/2 = 7fffffff00000000
4: ok   747   554   319   318   316   319   318   328   314   324 mul_u64_u64_div_u64_new 1ffffffff*ffffffff/2 = fffffffe80000000
5: ok   426   315   312   315   315   315   315   315   315   315 mul_u64_u64_div_u64_new 1ffffffff*ffffffff/3 = aaaaaaa9aaaaaaab
6: ok   352   391   317   316   323   327   323   327   323   324 mul_u64_u64_div_u64_new 1ffffffff*1ffffffff/4 = ffffffff00000000
7: ok   369   328   298   292   298   292   298   292   298   292 mul_u64_u64_div_u64_new ffff000000000000*ffff000000000000/ffff000000000001 = fffeffffffffffff
8: ok   436   348   298   299   298   300   307   297   297   301 mul_u64_u64_div_u64_new 3333333333333333*3333333333333333/5555555555555555 = 1eb851eb851eb851
9: ok   180   183   151   151   151   151   151   151   151   153 mul_u64_u64_div_u64_new 7fffffffffffffff*2/3 = 5555555555555554
10: ok   286   251   188   169   174   170   178   170   178   170 mul_u64_u64_div_u64_new ffffffffffffffff*2/8000000000000000 = 3
11: ok   230   177   172   177   172   177   172   177   172   177 mul_u64_u64_div_u64_new ffffffffffffffff*2/c000000000000000 = 2
12: ok   494   412   371   331   296   298   305   290   296   298 mul_u64_u64_div_u64_new ffffffffffffffff*4000000000000004/8000000000000000 = 8000000000000007
13: ok   330   300   304   305   290   305   294   302   290   296 mul_u64_u64_div_u64_new ffffffffffffffff*4000000000000001/8000000000000000 = 8000000000000001
14: ok   284   241   175   172   176   175   175   172   176   175 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/ffffffffffffffff = 8000000000000001
15: ok   283   263   171   176   175   175   175   175   175   175 mul_u64_u64_div_u64_new fffffffffffffffe*8000000000000001/ffffffffffffffff = 8000000000000000
16: ok   346   247   258   208   202   205   208   202   205   208 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/fffffffffffffffe = 8000000000000001
17: ok   242   272   208   213   209   213   209   213   209   213 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/fffffffffffffffd = 8000000000000002
18: ok   494   337   306   309   309   309   309   309   309   309 mul_u64_u64_div_u64_new 7fffffffffffffff*ffffffffffffffff/c000000000000000 = aaaaaaaaaaaaaaa8
19: ok   392   394   329   305   299   302   294   302   292   305 mul_u64_u64_div_u64_new ffffffffffffffff*7fffffffffffffff/a000000000000000 = ccccccccccccccca
20: ok   372   307   306   310   308   310   308   310   308   310 mul_u64_u64_div_u64_new ffffffffffffffff*7fffffffffffffff/9000000000000000 = e38e38e38e38e38b
21: ok   525   388   310   320   314   313   314   315   312   315 mul_u64_u64_div_u64_new 7fffffffffffffff*7fffffffffffffff/5000000000000000 = ccccccccccccccc9
22: ok   400   293   289   354   284   283   284   284   284   284 mul_u64_u64_div_u64_new ffffffffffffffff*fffffffffffffffe/ffffffffffffffff = fffffffffffffffe
23: ok   320   324   289   290   289   289   290   289   289   290 mul_u64_u64_div_u64_new e6102d256d7ea3ae*70a77d0be4c31201/d63ec35ab3220357 = 78f8bf8cc86c6e18
24: ok   279   289   290   285   285   285   285   285   285   285 mul_u64_u64_div_u64_new f53bae05cb86c6e1*3847b32d2f8d32e0/cfd4f55a647f403c = 42687f79d8998d35
25: ok   288   290   289   289   290   289   289   290   289   289 mul_u64_u64_div_u64_new 9951c5498f941092*1f8c8bfdf287a251/a3c8dc5f81ea3fe2 = 1d887cb25900091f
26: ok   361   372   351   325   288   293   286   293   294   293 mul_u64_u64_div_u64_new 374fee9daa1bb2bb*d0bfbff7b8ae3ef/c169337bd42d5179 = 3bb2dbaffcbb961
27: ok   483   349   302   302   305   300   307   304   307   304 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
28: ok   339   328   234   233   213   214   213   208   215   208 mul_u64_u64_div_u64_new 2d256d7ea3ae*7d0be4c31201/d63ec35ab3220357 = 1a599d6e
29: ok   406   300   303   300   307   304   307   304   307   304 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
30: ok   404   421   335   268   265   271   267   271   267   272 mul_u64_u64_div_u64_new 2d256d7ea3ae*7d0be4c31201/63ec35ab3220357 = 387f55cef
31: ok   484   350   310   309   306   309   309   306   309   309 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb000000000000 = dc5f7e8b334db07d
32: ok   368   306   301   307   304   307   304   307   304   307 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216

	David

^ permalink raw reply	[flat|nested] 44+ messages in thread

* 答复: [????] Re: [PATCH v3 next 09/10] lib: mul_u64_u64_div_u64() Optimise the divide code
  2025-07-09 14:24   ` David Laight
@ 2025-07-10  9:39     ` Li,Rongqing
  2025-07-10 10:35       ` David Laight
  2025-07-11 21:17     ` David Laight
  1 sibling, 1 reply; 44+ messages in thread
From: Li,Rongqing @ 2025-07-10  9:39 UTC (permalink / raw)
  To: David Laight, Andrew Morton, linux-kernel@vger.kernel.org
  Cc: u.kleine-koenig@baylibre.com, Nicolas Pitre, Oleg Nesterov,
	Peter Zijlstra, Biju Das, rostedt@goodmis.org

> $ cc -O2 -o div_perf div_perf.c -DMULDIV_OPT=0xc3  && sudo ./div_perf
> 0: ok   162   134    78    78    78    78    78    80    80    80
> mul_u64_u64_div_u64_new b*7/3 = 19
> 1: ok    91    91    91    91    91    91    91    91    91    91
> mul_u64_u64_div_u64_new ffff0000*ffff0000/f = 1110eeef00000000
> 2: ok    75    77    75    77    77    77    77    77    77    77
> mul_u64_u64_div_u64_new ffffffff*ffffffff/1 = fffffffe00000001
> 3: ok    89    91    91    91    91    91    89    90    91    91
> mul_u64_u64_div_u64_new ffffffff*ffffffff/2 = 7fffffff00000000
> 4: ok   147   147   128   128   128   128   128   128   128   128
> mul_u64_u64_div_u64_new 1ffffffff*ffffffff/2 = fffffffe80000000
> 5: ok   128   128   128   128   128   128   128   128   128   128
> mul_u64_u64_div_u64_new 1ffffffff*ffffffff/3 = aaaaaaa9aaaaaaab
> 6: ok   121   121   121   121   121   121   121   121   121   121
> mul_u64_u64_div_u64_new 1ffffffff*1ffffffff/4 = ffffffff00000000
> 7: ok   274   234   146   138   138   138   138   138   138   138
> mul_u64_u64_div_u64_new
> ffff000000000000*ffff000000000000/ffff000000000001 = fffeffffffffffff
> 8: ok   177   148   148   149   149   149   149   149   149   149
> mul_u64_u64_div_u64_new
> 3333333333333333*3333333333333333/5555555555555555 =
> 1eb851eb851eb851
> 9: ok   138    90   118    91    91    91    91    92    92    92
> mul_u64_u64_div_u64_new 7fffffffffffffff*2/3 = 5555555555555554
> 10: ok   113   114    86    86    84    86    86    84    87
> 87 mul_u64_u64_div_u64_new ffffffffffffffff*2/8000000000000000 = 3
> 11: ok    87    88    88    86    88    88    88    88    90
> 90 mul_u64_u64_div_u64_new ffffffffffffffff*2/c000000000000000 = 2
> 12: ok    82    86    84    86    83    86    83    86    83
> 87 mul_u64_u64_div_u64_new
> ffffffffffffffff*4000000000000004/8000000000000000 = 8000000000000007
> 13: ok    82    86    84    86    83    86    83    86    83
> 86 mul_u64_u64_div_u64_new
> ffffffffffffffff*4000000000000001/8000000000000000 = 8000000000000001
> 14: ok   189   187   138   132   132   132   131   131   131
> 131 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/ffffffffffffffff =
> 8000000000000001
> 15: ok   221   175   159   131   131   131   131   131   131
> 131 mul_u64_u64_div_u64_new fffffffffffffffe*8000000000000001/ffffffffffffffff
> = 8000000000000000
> 16: ok   134   132   134   134   134   135   134   134   134
> 134 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/fffffffffffffffe
> = 8000000000000001
> 17: ok   172   134   137   134   134   134   134   134   134
> 134 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/fffffffffffffffd
> = 8000000000000002
> 18: ok   182   182   129   129   129   129   129   129   129
> 129 mul_u64_u64_div_u64_new 7fffffffffffffff*ffffffffffffffff/c000000000000000
> = aaaaaaaaaaaaaaa8
> 19: ok   130   129   130   129   129   129   129   129   129
> 129 mul_u64_u64_div_u64_new ffffffffffffffff*7fffffffffffffff/a000000000000000
> = ccccccccccccccca
> 20: ok   130   129   129   129   129   129   129   129   129
> 129 mul_u64_u64_div_u64_new ffffffffffffffff*7fffffffffffffff/9000000000000000
> = e38e38e38e38e38b
> 21: ok   130   129   129   129   129   129   129   129   129
> 129 mul_u64_u64_div_u64_new 7fffffffffffffff*7fffffffffffffff/5000000000000000
> = ccccccccccccccc9
> 22: ok   206   140   138   138   138   138   138   138   138
> 138 mul_u64_u64_div_u64_new ffffffffffffffff*fffffffffffffffe/ffffffffffffffff =
> fffffffffffffffe
> 23: ok   174   140   138   138   138   138   138   138   138
> 138 mul_u64_u64_div_u64_new
> e6102d256d7ea3ae*70a77d0be4c31201/d63ec35ab3220357 =
> 78f8bf8cc86c6e18
> 24: ok   135   137   137   137   137   137   137   137   137
> 137 mul_u64_u64_div_u64_new
> f53bae05cb86c6e1*3847b32d2f8d32e0/cfd4f55a647f403c =
> 42687f79d8998d35
> 25: ok   134   136   136   136   136   136   136   136   136
> 136 mul_u64_u64_div_u64_new
> 9951c5498f941092*1f8c8bfdf287a251/a3c8dc5f81ea3fe2 = 1d887cb25900091f
> 26: ok   136   134   134   134   134   134   134   134   134
> 134 mul_u64_u64_div_u64_new
> 374fee9daa1bb2bb*d0bfbff7b8ae3ef/c169337bd42d5179 = 3bb2dbaffcbb961
> 27: ok   139   138   138   138   138   138   138   138   138
> 138 mul_u64_u64_div_u64_new
> eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b =
> dc5f5cc9e270d216
> 28: ok   130   143    95    95    96    96    96    96    96
> 96 mul_u64_u64_div_u64_new
> 2d256d7ea3ae*7d0be4c31201/d63ec35ab3220357 = 1a599d6e
> 29: ok   169   158   158   138   138   138   138   138   138
> 138 mul_u64_u64_div_u64_new
> eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b =
> dc5f5cc9e270d216
> 30: ok   178   164   144   147   147   147   147   147   147
> 147 mul_u64_u64_div_u64_new
> 2d256d7ea3ae*7d0be4c31201/63ec35ab3220357 = 387f55cef
> 31: ok   163   128   128   128   128   128   128   128   128
> 128 mul_u64_u64_div_u64_new
> eac0d03ac10eeaf0*89be05dfa162ed9b/92bb000000000000 =
> dc5f7e8b334db07d
> 32: ok   163   184   137   136   136   138   138   138   138
> 138 mul_u64_u64_div_u64_new
> eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b =
> dc5f5cc9e270d216
> 

Nice work!

Is it necessary to add an exception test case, such as a case for division result overflowing 64bit?

Thanks

-Li

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [????] Re: [PATCH v3 next 09/10] lib: mul_u64_u64_div_u64() Optimise the divide code
  2025-07-10  9:39     ` 答复: [????] " Li,Rongqing
@ 2025-07-10 10:35       ` David Laight
  0 siblings, 0 replies; 44+ messages in thread
From: David Laight @ 2025-07-10 10:35 UTC (permalink / raw)
  To: Li,Rongqing
  Cc: Andrew Morton, linux-kernel@vger.kernel.org,
	u.kleine-koenig@baylibre.com, Nicolas Pitre, Oleg Nesterov,
	Peter Zijlstra, Biju Das, rostedt@goodmis.org

On Thu, 10 Jul 2025 09:39:51 +0000
"Li,Rongqing" <lirongqing@baidu.com> wrote:

...
> Nice work!
> 
> Is it necessary to add an exception test case, such as a case for division result overflowing 64bit?

Not if the code traps :-)

	David

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 next 09/10] lib: mul_u64_u64_div_u64() Optimise the divide code
  2025-07-09 14:24   ` David Laight
  2025-07-10  9:39     ` 答复: [????] " Li,Rongqing
@ 2025-07-11 21:17     ` David Laight
  2025-07-11 21:40       ` Nicolas Pitre
  1 sibling, 1 reply; 44+ messages in thread
From: David Laight @ 2025-07-11 21:17 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: u.kleine-koenig, Nicolas Pitre, Oleg Nesterov, Peter Zijlstra,
	Biju Das, rostedt, lirongqing

On Wed, 9 Jul 2025 15:24:20 +0100
David Laight <david.laight.linux@gmail.com> wrote:

> On Sat, 14 Jun 2025 10:53:45 +0100
> David Laight <david.laight.linux@gmail.com> wrote:
> 
> > Replace the bit by bit algorithm with one that generates 16 bits
> > per iteration on 32bit architectures and 32 bits on 64bit ones.  
> 
> I've spent far too long doing some clock counting exercises on this code.
> This is the latest version with some conditional compiles and comments
> explaining the various optimisation.
> I think the 'best' version is with -DMULDIV_OPT=0xc3
> 
>....
> Some measurements on an ivy bridge.
> These are the test vectors from the test module with a few extra values on the
> end that pick different paths through this implementatoin.
> The numbers are 'performance counter' deltas for 10 consecutive calls with the
> same values.
> So the latter values are with the branch predictor 'trained' to the test case.
> The first few (larger) values show the cost of mispredicted branches.

I should have included the values for the sames tests with the existing code.
Added here, but this is includes the change to test for overflow and divide
by zero after the multiply (no ilog2 calls - which make the 32bit code worse).
About the only cases where the old code it better are when the quotient
only has two bits set.

> Apologies for the very long lines.
> 
> $ cc -O2 -o div_perf div_perf.c -DMULDIV_OPT=0xc3  && sudo ./div_perf
> 0: ok   162   134    78    78    78    78    78    80    80    80 mul_u64_u64_div_u64_new b*7/3 = 19
> 1: ok    91    91    91    91    91    91    91    91    91    91 mul_u64_u64_div_u64_new ffff0000*ffff0000/f = 1110eeef00000000
> 2: ok    75    77    75    77    77    77    77    77    77    77 mul_u64_u64_div_u64_new ffffffff*ffffffff/1 = fffffffe00000001
> 3: ok    89    91    91    91    91    91    89    90    91    91 mul_u64_u64_div_u64_new ffffffff*ffffffff/2 = 7fffffff00000000
> 4: ok   147   147   128   128   128   128   128   128   128   128 mul_u64_u64_div_u64_new 1ffffffff*ffffffff/2 = fffffffe80000000
> 5: ok   128   128   128   128   128   128   128   128   128   128 mul_u64_u64_div_u64_new 1ffffffff*ffffffff/3 = aaaaaaa9aaaaaaab
> 6: ok   121   121   121   121   121   121   121   121   121   121 mul_u64_u64_div_u64_new 1ffffffff*1ffffffff/4 = ffffffff00000000
> 7: ok   274   234   146   138   138   138   138   138   138   138 mul_u64_u64_div_u64_new ffff000000000000*ffff000000000000/ffff000000000001 = fffeffffffffffff
> 8: ok   177   148   148   149   149   149   149   149   149   149 mul_u64_u64_div_u64_new 3333333333333333*3333333333333333/5555555555555555 = 1eb851eb851eb851
> 9: ok   138    90   118    91    91    91    91    92    92    92 mul_u64_u64_div_u64_new 7fffffffffffffff*2/3 = 5555555555555554
> 10: ok   113   114    86    86    84    86    86    84    87    87 mul_u64_u64_div_u64_new ffffffffffffffff*2/8000000000000000 = 3
> 11: ok    87    88    88    86    88    88    88    88    90    90 mul_u64_u64_div_u64_new ffffffffffffffff*2/c000000000000000 = 2
> 12: ok    82    86    84    86    83    86    83    86    83    87 mul_u64_u64_div_u64_new ffffffffffffffff*4000000000000004/8000000000000000 = 8000000000000007
> 13: ok    82    86    84    86    83    86    83    86    83    86 mul_u64_u64_div_u64_new ffffffffffffffff*4000000000000001/8000000000000000 = 8000000000000001
> 14: ok   189   187   138   132   132   132   131   131   131   131 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/ffffffffffffffff = 8000000000000001
> 15: ok   221   175   159   131   131   131   131   131   131   131 mul_u64_u64_div_u64_new fffffffffffffffe*8000000000000001/ffffffffffffffff = 8000000000000000
> 16: ok   134   132   134   134   134   135   134   134   134   134 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/fffffffffffffffe = 8000000000000001
> 17: ok   172   134   137   134   134   134   134   134   134   134 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/fffffffffffffffd = 8000000000000002
> 18: ok   182   182   129   129   129   129   129   129   129   129 mul_u64_u64_div_u64_new 7fffffffffffffff*ffffffffffffffff/c000000000000000 = aaaaaaaaaaaaaaa8
> 19: ok   130   129   130   129   129   129   129   129   129   129 mul_u64_u64_div_u64_new ffffffffffffffff*7fffffffffffffff/a000000000000000 = ccccccccccccccca
> 20: ok   130   129   129   129   129   129   129   129   129   129 mul_u64_u64_div_u64_new ffffffffffffffff*7fffffffffffffff/9000000000000000 = e38e38e38e38e38b
> 21: ok   130   129   129   129   129   129   129   129   129   129 mul_u64_u64_div_u64_new 7fffffffffffffff*7fffffffffffffff/5000000000000000 = ccccccccccccccc9
> 22: ok   206   140   138   138   138   138   138   138   138   138 mul_u64_u64_div_u64_new ffffffffffffffff*fffffffffffffffe/ffffffffffffffff = fffffffffffffffe
> 23: ok   174   140   138   138   138   138   138   138   138   138 mul_u64_u64_div_u64_new e6102d256d7ea3ae*70a77d0be4c31201/d63ec35ab3220357 = 78f8bf8cc86c6e18
> 24: ok   135   137   137   137   137   137   137   137   137   137 mul_u64_u64_div_u64_new f53bae05cb86c6e1*3847b32d2f8d32e0/cfd4f55a647f403c = 42687f79d8998d35
> 25: ok   134   136   136   136   136   136   136   136   136   136 mul_u64_u64_div_u64_new 9951c5498f941092*1f8c8bfdf287a251/a3c8dc5f81ea3fe2 = 1d887cb25900091f
> 26: ok   136   134   134   134   134   134   134   134   134   134 mul_u64_u64_div_u64_new 374fee9daa1bb2bb*d0bfbff7b8ae3ef/c169337bd42d5179 = 3bb2dbaffcbb961
> 27: ok   139   138   138   138   138   138   138   138   138   138 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
> 28: ok   130   143    95    95    96    96    96    96    96    96 mul_u64_u64_div_u64_new 2d256d7ea3ae*7d0be4c31201/d63ec35ab3220357 = 1a599d6e
> 29: ok   169   158   158   138   138   138   138   138   138   138 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
> 30: ok   178   164   144   147   147   147   147   147   147   147 mul_u64_u64_div_u64_new 2d256d7ea3ae*7d0be4c31201/63ec35ab3220357 = 387f55cef
> 31: ok   163   128   128   128   128   128   128   128   128   128 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb000000000000 = dc5f7e8b334db07d
> 32: ok   163   184   137   136   136   138   138   138   138   138 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
> 
> $ cc -O2 -o div_perf div_perf.c -DMULDIV_OPT=0x03  && sudo ./div_perf
> 0: ok   125    78    78    79    79    79    79    79    79    79 mul_u64_u64_div_u64_new b*7/3 = 19
> 1: ok    88    89    89    88    89    89    89    89    89    89 mul_u64_u64_div_u64_new ffff0000*ffff0000/f = 1110eeef00000000
> 2: ok    75    76    76    76    76    76    74    76    76    76 mul_u64_u64_div_u64_new ffffffff*ffffffff/1 = fffffffe00000001
> 3: ok    87    89    89    89    89    89    89    88    88    88 mul_u64_u64_div_u64_new ffffffff*ffffffff/2 = 7fffffff00000000
> 4: ok   305   221   148   144   147   147   147   147   147   147 mul_u64_u64_div_u64_new 1ffffffff*ffffffff/2 = fffffffe80000000
> 5: ok   179   178   141   141   141   141   141   141   141   141 mul_u64_u64_div_u64_new 1ffffffff*ffffffff/3 = aaaaaaa9aaaaaaab
> 6: ok   148   200   143   145   145   145   145   145   145   145 mul_u64_u64_div_u64_new 1ffffffff*1ffffffff/4 = ffffffff00000000
> 7: ok   201   186   140   135   135   135   135   135   135   135 mul_u64_u64_div_u64_new ffff000000000000*ffff000000000000/ffff000000000001 = fffeffffffffffff
> 8: ok   227   154   145   141   141   141   141   141   141   141 mul_u64_u64_div_u64_new 3333333333333333*3333333333333333/5555555555555555 = 1eb851eb851eb851
> 9: ok   111   111    89    89    89    89    89    89    89    89 mul_u64_u64_div_u64_new 7fffffffffffffff*2/3 = 5555555555555554
> 10: ok   149   156   124    90    90    90    90    90    90    90 mul_u64_u64_div_u64_new ffffffffffffffff*2/8000000000000000 = 3
> 11: ok    91    91    90    90    90    90    90    90    90    90 mul_u64_u64_div_u64_new ffffffffffffffff*2/c000000000000000 = 2
> 12: ok   197   197   138   134   134   134   134   134   134   134 mul_u64_u64_div_u64_new ffffffffffffffff*4000000000000004/8000000000000000 = 8000000000000007
> 13: ok   260   136   136   135   135   135   135   135   135   135 mul_u64_u64_div_u64_new ffffffffffffffff*4000000000000001/8000000000000000 = 8000000000000001
> 14: ok   186   187   164   130   127   127   127   127   127   127 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/ffffffffffffffff = 8000000000000001
> 15: ok   171   172   173   158   160   128   125   127   125   127 mul_u64_u64_div_u64_new fffffffffffffffe*8000000000000001/ffffffffffffffff = 8000000000000000
> 16: ok   157   164   129   130   130   130   130   130   130   130 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/fffffffffffffffe = 8000000000000001
> 17: ok   191   158   130   132   132   130   130   130   130   130 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/fffffffffffffffd = 8000000000000002
> 18: ok   197   214   163   138   138   138   138   138   138   138 mul_u64_u64_div_u64_new 7fffffffffffffff*ffffffffffffffff/c000000000000000 = aaaaaaaaaaaaaaa8
> 19: ok   196   196   135   134   134   134   134   134   134   134 mul_u64_u64_div_u64_new ffffffffffffffff*7fffffffffffffff/a000000000000000 = ccccccccccccccca
> 20: ok   191   216   176   140   138   138   138   138   138   138 mul_u64_u64_div_u64_new ffffffffffffffff*7fffffffffffffff/9000000000000000 = e38e38e38e38e38b
> 21: ok   232   157   156   145   145   145   145   145   145   145 mul_u64_u64_div_u64_new 7fffffffffffffff*7fffffffffffffff/5000000000000000 = ccccccccccccccc9
> 22: ok   159   192   133   134   134   134   134   134   134   134 mul_u64_u64_div_u64_new ffffffffffffffff*fffffffffffffffe/ffffffffffffffff = fffffffffffffffe
> 23: ok   133   134   134   134   134   134   134   134   134   134 mul_u64_u64_div_u64_new e6102d256d7ea3ae*70a77d0be4c31201/d63ec35ab3220357 = 78f8bf8cc86c6e18
> 24: ok   134   131   131   131   131   134   131   131   131   131 mul_u64_u64_div_u64_new f53bae05cb86c6e1*3847b32d2f8d32e0/cfd4f55a647f403c = 42687f79d8998d35
> 25: ok   133   130   130   130   130   130   130   130   130   130 mul_u64_u64_div_u64_new 9951c5498f941092*1f8c8bfdf287a251/a3c8dc5f81ea3fe2 = 1d887cb25900091f
> 26: ok   133   130   130   130   130   130   130   130   130   131 mul_u64_u64_div_u64_new 374fee9daa1bb2bb*d0bfbff7b8ae3ef/c169337bd42d5179 = 3bb2dbaffcbb961
> 27: ok   133   134   134   134   134   134   134   134   134   133 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
> 28: ok   151    93   119    93    93    93    93    93    93    93 mul_u64_u64_div_u64_new 2d256d7ea3ae*7d0be4c31201/d63ec35ab3220357 = 1a599d6e
> 29: ok   193   137   134   134   134   134   134   134   134   134 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
> 30: ok   194   151   150   137   137   137   137   137   137   137 mul_u64_u64_div_u64_new 2d256d7ea3ae*7d0be4c31201/63ec35ab3220357 = 387f55cef
> 31: ok   137   173   172   137   138   138   138   138   138   138 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb000000000000 = dc5f7e8b334db07d
> 32: ok   160   149   131   134   134   134   134   134   134   134 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
> 
> $ cc -O2 -o div_perf div_perf.c -DMULDIV_OPT=0x24  && sudo ./div_perf
> 0: ok   130   106    79    79    78    78    78    78    81    81 mul_u64_u64_div_u64_new b*7/3 = 19
> 1: ok    88    92    92    89    92    92    92    92    91    91 mul_u64_u64_div_u64_new ffff0000*ffff0000/f = 1110eeef00000000
> 2: ok    74    78    78    78    78    78    75    79    78    78 mul_u64_u64_div_u64_new ffffffff*ffffffff/1 = fffffffe00000001
> 3: ok    87    92    92    92    92    92    92    92    92    92 mul_u64_u64_div_u64_new ffffffff*ffffffff/2 = 7fffffff00000000
> 4: ok   330   275   181   145   147   148   148   148   148   148 mul_u64_u64_div_u64_new 1ffffffff*ffffffff/2 = fffffffe80000000
> 5: ok   225   175   141   146   146   146   146   146   146   146 mul_u64_u64_div_u64_new 1ffffffff*ffffffff/3 = aaaaaaa9aaaaaaab
> 6: ok   187   194   193   194   178   144   148   148   148   148 mul_u64_u64_div_u64_new 1ffffffff*1ffffffff/4 = ffffffff00000000
> 7: ok   202   189   178   139   140   139   140   139   140   139 mul_u64_u64_div_u64_new ffff000000000000*ffff000000000000/ffff000000000001 = fffeffffffffffff
> 8: ok   228   168   150   143   143   143   143   143   143   143 mul_u64_u64_div_u64_new 3333333333333333*3333333333333333/5555555555555555 = 1eb851eb851eb851
> 9: ok   112   112    92    89    92    92    92    92    92    87 mul_u64_u64_div_u64_new 7fffffffffffffff*2/3 = 5555555555555554
> 10: ok   153   184    92    93    95    95    95    95    95    95 mul_u64_u64_div_u64_new ffffffffffffffff*2/8000000000000000 = 3
> 11: ok   158    93    93    93    95    92    95    95    95    95 mul_u64_u64_div_u64_new ffffffffffffffff*2/c000000000000000 = 2
> 12: ok   206   178   146   139   139   140   139   140   139   140 mul_u64_u64_div_u64_new ffffffffffffffff*4000000000000004/8000000000000000 = 8000000000000007
> 13: ok   187   146   140   139   140   139   140   139   140   139 mul_u64_u64_div_u64_new ffffffffffffffff*4000000000000001/8000000000000000 = 8000000000000001
> 14: ok   167   173   136   136   102    97    97    97    97    97 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/ffffffffffffffff = 8000000000000001
> 15: ok   166   150   105    98    98    98    97    99    97    99 mul_u64_u64_div_u64_new fffffffffffffffe*8000000000000001/ffffffffffffffff = 8000000000000000
> 16: ok   209   197   139   136   136   136   136   136   136   136 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/fffffffffffffffe = 8000000000000001
> 17: ok   170   142   170   137   136   135   136   135   136   135 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/fffffffffffffffd = 8000000000000002
> 18: ok   238   197   172   140   140   140   139   141   140   139 mul_u64_u64_div_u64_new 7fffffffffffffff*ffffffffffffffff/c000000000000000 = aaaaaaaaaaaaaaa8
> 19: ok   185   206   139   141   142   142   142   142   142   142 mul_u64_u64_div_u64_new ffffffffffffffff*7fffffffffffffff/a000000000000000 = ccccccccccccccca
> 20: ok   207   226   146   142   140   140   140   140   140   140 mul_u64_u64_div_u64_new ffffffffffffffff*7fffffffffffffff/9000000000000000 = e38e38e38e38e38b
> 21: ok   161   153   148   148   148   148   148   148   148   146 mul_u64_u64_div_u64_new 7fffffffffffffff*7fffffffffffffff/5000000000000000 = ccccccccccccccc9
> 22: ok   172   199   141   140   140   140   140   140   140   140 mul_u64_u64_div_u64_new ffffffffffffffff*fffffffffffffffe/ffffffffffffffff = fffffffffffffffe
> 23: ok   172   137   140   140   140   140   140   140   140   140 mul_u64_u64_div_u64_new e6102d256d7ea3ae*70a77d0be4c31201/d63ec35ab3220357 = 78f8bf8cc86c6e18
> 24: ok   135   138   138   138   138   138   138   138   138   138 mul_u64_u64_div_u64_new f53bae05cb86c6e1*3847b32d2f8d32e0/cfd4f55a647f403c = 42687f79d8998d35
> 25: ok   136   137   137   137   137   137   137   137   137   137 mul_u64_u64_div_u64_new 9951c5498f941092*1f8c8bfdf287a251/a3c8dc5f81ea3fe2 = 1d887cb25900091f
> 26: ok   136   136   136   136   136   136   136   136   136   136 mul_u64_u64_div_u64_new 374fee9daa1bb2bb*d0bfbff7b8ae3ef/c169337bd42d5179 = 3bb2dbaffcbb961
> 27: ok   139   140   140   140   140   140   140   140   140   140 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
> 28: ok   132    94    97    96    96    96    96    96    96    96 mul_u64_u64_div_u64_new 2d256d7ea3ae*7d0be4c31201/d63ec35ab3220357 = 1a599d6e
> 29: ok   139   140   140   140   140   140   140   140   140   140 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
> 30: ok   159   141   141   141   141   141   141   141   141   141 mul_u64_u64_div_u64_new 2d256d7ea3ae*7d0be4c31201/63ec35ab3220357 = 387f55cef
> 31: ok   244   186   154   141   139   141   140   139   141   140 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb000000000000 = dc5f7e8b334db07d
> 32: ok   165   140   140   140   140   140   140   140   140   140 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216

Existing code, 64bit x86
$ cc -O2 -o div_perf div_perf.c  && sudo ./div_perf
0: ok   171   117    88    89    88    87    87    87    90    88 mul_u64_u64_div_u64 b*7/3 = 19
1: ok    98    97    97    97   101   101   101   101   101   101 mul_u64_u64_div_u64 ffff0000*ffff0000/f = 1110eeef00000000
2: ok    85    85    84    84    84    87    87    87    87    87 mul_u64_u64_div_u64 ffffffff*ffffffff/1 = fffffffe00000001
3: ok    99   101   101   101   101    99   101   101   101   101 mul_u64_u64_div_u64 ffffffff*ffffffff/2 = 7fffffff00000000
4: ok   162   145    90    90    90    90    90    90    90    90 mul_u64_u64_div_u64 1ffffffff*ffffffff/2 = fffffffe80000000
5: ok   809   675   582   462   450   446   442   446   446   450 mul_u64_u64_div_u64 1ffffffff*ffffffff/3 = aaaaaaa9aaaaaaab
6: ok   136    90    90    90    90    90    90    90    90    90 mul_u64_u64_div_u64 1ffffffff*1ffffffff/4 = ffffffff00000000
7: ok   559   449   391   370   374   370   394   377   375   384 mul_u64_u64_div_u64 ffff000000000000*ffff000000000000/ffff000000000001 = fffeffffffffffff
8: ok   587   515   386   333   334   334   334   334   334   334 mul_u64_u64_div_u64 3333333333333333*3333333333333333/5555555555555555 = 1eb851eb851eb851
9: ok   124   123    99    99   101   101   101   101   101   101 mul_u64_u64_div_u64 7fffffffffffffff*2/3 = 5555555555555554
10: ok   143   119    93    90    90    90    90    90    90    90 mul_u64_u64_div_u64 ffffffffffffffff*2/8000000000000000 = 3
11: ok   136    93    93    95    93    93    93    93    93    93 mul_u64_u64_div_u64 ffffffffffffffff*2/c000000000000000 = 2
12: ok    88    90    90    90    90    90    93    90    90    90 mul_u64_u64_div_u64 ffffffffffffffff*4000000000000004/8000000000000000 = 8000000000000007
13: ok    88    90    90    90    90    90    90    90    90    90 mul_u64_u64_div_u64 ffffffffffffffff*4000000000000001/8000000000000000 = 8000000000000001
14: ok   138   144   102   122   101    71    73    74    74    74 mul_u64_u64_div_u64 ffffffffffffffff*8000000000000001/ffffffffffffffff = 8000000000000001
15: ok   101    98    95    98    93    66    69    69    69    69 mul_u64_u64_div_u64 fffffffffffffffe*8000000000000001/ffffffffffffffff = 8000000000000000
16: ok   207   169   124    99    76    75    76    76    76    76 mul_u64_u64_div_u64 ffffffffffffffff*8000000000000001/fffffffffffffffe = 8000000000000001
17: ok   177   143   107   108    76    74    74    74    74    74 mul_u64_u64_div_u64 ffffffffffffffff*8000000000000001/fffffffffffffffd = 8000000000000002
18: ok   565   444   412   399   401   413   399   412   392   415 mul_u64_u64_div_u64 7fffffffffffffff*ffffffffffffffff/c000000000000000 = aaaaaaaaaaaaaaa8
19: ok   411   447   435   392   417   398   422   394   402   394 mul_u64_u64_div_u64 ffffffffffffffff*7fffffffffffffff/a000000000000000 = ccccccccccccccca
20: ok   467   452   450   427   405   428   418   413   421   405 mul_u64_u64_div_u64 ffffffffffffffff*7fffffffffffffff/9000000000000000 = e38e38e38e38e38b
21: ok   443   395   398   417   405   418   398   404   404   404 mul_u64_u64_div_u64 7fffffffffffffff*7fffffffffffffff/5000000000000000 = ccccccccccccccc9
22: ok   441   400   363   361   347   348   345   345   345   343 mul_u64_u64_div_u64 ffffffffffffffff*fffffffffffffffe/ffffffffffffffff = fffffffffffffffe
23: ok   895   811   423   348   352   352   352   352   352   352 mul_u64_u64_div_u64 e6102d256d7ea3ae*70a77d0be4c31201/d63ec35ab3220357 = 78f8bf8cc86c6e18
24: ok   928   638   445   339   344   344   344   344   344   344 mul_u64_u64_div_u64 f53bae05cb86c6e1*3847b32d2f8d32e0/cfd4f55a647f403c = 42687f79d8998d35
25: ok   606   457   277   279   276   276   276   276   276   276 mul_u64_u64_div_u64 9951c5498f941092*1f8c8bfdf287a251/a3c8dc5f81ea3fe2 = 1d887cb25900091f
26: ok   955   683   415   377   377   377   377   377   377   377 mul_u64_u64_div_u64 374fee9daa1bb2bb*d0bfbff7b8ae3ef/c169337bd42d5179 = 3bb2dbaffcbb961
27: ok   700   599   375   382   382   382   382   382   382   382 mul_u64_u64_div_u64 eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
28: ok   406   346   238   238   217   214   214   214   214   214 mul_u64_u64_div_u64 2d256d7ea3ae*7d0be4c31201/d63ec35ab3220357 = 1a599d6e
29: ok   486   380   378   382   382   382   382   382   382   382 mul_u64_u64_div_u64 eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
30: ok   538   407   297   262   244   244   244   244   244   244 mul_u64_u64_div_u64 2d256d7ea3ae*7d0be4c31201/63ec35ab3220357 = 387f55cef
31: ok   564   539   446   451   417   418   424   425   417   425 mul_u64_u64_div_u64 eac0d03ac10eeaf0*89be05dfa162ed9b/92bb000000000000 = dc5f7e8b334db07d
32: ok   416   382   387   382   382   382   382   382   382   382 mul_u64_u64_div_u64 eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216


> 
> $ cc -O2 -o div_perf div_perf.c -DMULDIV_OPT=0xc3 -m32 && sudo ./div_perf
> 0: ok   336   131   104   104   102   102   103   103   103   105 mul_u64_u64_div_u64_new b*7/3 = 19
> 1: ok   195   195   171   171   171   171   171   171   171   171 mul_u64_u64_div_u64_new ffff0000*ffff0000/f = 1110eeef00000000
> 2: ok   165   162   162   162   162   162   162   162   162   162 mul_u64_u64_div_u64_new ffffffff*ffffffff/1 = fffffffe00000001
> 3: ok   165   173   173   173   173   173   173   173   173   173 mul_u64_u64_div_u64_new ffffffff*ffffffff/2 = 7fffffff00000000
> 4: ok   220   216   202   206   202   206   202   206   202   206 mul_u64_u64_div_u64_new 1ffffffff*ffffffff/2 = fffffffe80000000
> 5: ok   203   205   208   205   209   205   209   205   209   205 mul_u64_u64_div_u64_new 1ffffffff*ffffffff/3 = aaaaaaa9aaaaaaab
> 6: ok   203   203   199   203   199   203   199   203   199   203 mul_u64_u64_div_u64_new 1ffffffff*1ffffffff/4 = ffffffff00000000
> 7: ok   574   421   251   242   246   254   246   254   246   254 mul_u64_u64_div_u64_new ffff000000000000*ffff000000000000/ffff000000000001 = fffeffffffffffff
> 8: ok   370   317   263   262   254   259   258   262   254   259 mul_u64_u64_div_u64_new 3333333333333333*3333333333333333/5555555555555555 = 1eb851eb851eb851
> 9: ok   214   199   177   177   177   177   177   177   177   177 mul_u64_u64_div_u64_new 7fffffffffffffff*2/3 = 5555555555555554
> 10: ok   163   150   128   129   129   129   129   129   129   129 mul_u64_u64_div_u64_new ffffffffffffffff*2/8000000000000000 = 3
> 11: ok   135   133   135   135   135   135   135   135   135   135 mul_u64_u64_div_u64_new ffffffffffffffff*2/c000000000000000 = 2
> 12: ok   206   208   208   180   183   183   183   183   183   183 mul_u64_u64_div_u64_new ffffffffffffffff*4000000000000004/8000000000000000 = 8000000000000007
> 13: ok   205   183   183   184   183   183   183   183   183   183 mul_u64_u64_div_u64_new ffffffffffffffff*4000000000000001/8000000000000000 = 8000000000000001
> 14: ok   331   296   238   229   232   232   232   232   232   232 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/ffffffffffffffff = 8000000000000001
> 15: ok   324   311   238   239   239   239   239   239   239   242 mul_u64_u64_div_u64_new fffffffffffffffe*8000000000000001/ffffffffffffffff = 8000000000000000
> 16: ok   300   262   265   233   238   234   238   234   238   234 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/fffffffffffffffe = 8000000000000001
> 17: ok   295   282   244   244   244   244   244   244   244   244 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/fffffffffffffffd = 8000000000000002
> 18: ok   245   247   235   222   221   222   219   222   221   222 mul_u64_u64_div_u64_new 7fffffffffffffff*ffffffffffffffff/c000000000000000 = aaaaaaaaaaaaaaa8
> 19: ok   235   221   222   221   222   219   222   221   222   219 mul_u64_u64_div_u64_new ffffffffffffffff*7fffffffffffffff/a000000000000000 = ccccccccccccccca
> 20: ok   220   222   221   222   219   222   219   222   221   222 mul_u64_u64_div_u64_new ffffffffffffffff*7fffffffffffffff/9000000000000000 = e38e38e38e38e38b
> 21: ok   225   221   222   219   221   219   221   219   221   219 mul_u64_u64_div_u64_new 7fffffffffffffff*7fffffffffffffff/5000000000000000 = ccccccccccccccc9
> 22: ok   309   274   250   238   237   244   239   241   237   244 mul_u64_u64_div_u64_new ffffffffffffffff*fffffffffffffffe/ffffffffffffffff = fffffffffffffffe
> 23: ok   284   284   250   247   243   247   247   243   247   247 mul_u64_u64_div_u64_new e6102d256d7ea3ae*70a77d0be4c31201/d63ec35ab3220357 = 78f8bf8cc86c6e18
> 24: ok   239   239   243   240   239   239   239   239   239   239 mul_u64_u64_div_u64_new f53bae05cb86c6e1*3847b32d2f8d32e0/cfd4f55a647f403c = 42687f79d8998d35
> 25: ok   255   255   255   255   247   243   247   247   243   247 mul_u64_u64_div_u64_new 9951c5498f941092*1f8c8bfdf287a251/a3c8dc5f81ea3fe2 = 1d887cb25900091f
> 26: ok   327   274   240   242   240   242   240   242   240   242 mul_u64_u64_div_u64_new 374fee9daa1bb2bb*d0bfbff7b8ae3ef/c169337bd42d5179 = 3bb2dbaffcbb961
> 27: ok   461   313   340   259   257   259   257   259   257   259 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
> 28: ok   291   219   180   174   170   173   171   174   171   174 mul_u64_u64_div_u64_new 2d256d7ea3ae*7d0be4c31201/d63ec35ab3220357 = 1a599d6e
> 29: ok   280   319   258   257   259   257   259   257   259   257 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
> 30: ok   379   352   247   239   220   225   225   229   228   229 mul_u64_u64_div_u64_new 2d256d7ea3ae*7d0be4c31201/63ec35ab3220357 = 387f55cef
> 31: ok   235   219   221   219   221   219   221   219   221   219 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb000000000000 = dc5f7e8b334db07d
> 32: ok   305   263   257   257   258   258   263   257   257   257 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
> 
> $ cc -O2 -o div_perf div_perf.c -DMULDIV_OPT=0x03 -m32 && sudo ./div_perf
> 0: ok   292   127   129   125   123   128   125   123   125   121 mul_u64_u64_div_u64_new b*7/3 = 19
> 1: ok   190   196   151   149   151   149   151   149   151   149 mul_u64_u64_div_u64_new ffff0000*ffff0000/f = 1110eeef00000000
> 2: ok   141   139   139   139   139   139   139   139   139   139 mul_u64_u64_div_u64_new ffffffff*ffffffff/1 = fffffffe00000001
> 3: ok   152   149   151   149   151   149   151   149   151   149 mul_u64_u64_div_u64_new ffffffff*ffffffff/2 = 7fffffff00000000
> 4: ok   667   453   276   270   271   271   271   267   274   272 mul_u64_u64_div_u64_new 1ffffffff*ffffffff/2 = fffffffe80000000
> 5: ok   366   319   373   337   278   278   278   278   278   278 mul_u64_u64_div_u64_new 1ffffffff*ffffffff/3 = aaaaaaa9aaaaaaab
> 6: ok   380   349   268   268   271   265   265   265   265   265 mul_u64_u64_div_u64_new 1ffffffff*1ffffffff/4 = ffffffff00000000
> 7: ok   340   277   255   251   249   255   251   249   255   251 mul_u64_u64_div_u64_new ffff000000000000*ffff000000000000/ffff000000000001 = fffeffffffffffff
> 8: ok   377   302   253   252   256   256   256   256   256   256 mul_u64_u64_div_u64_new 3333333333333333*3333333333333333/5555555555555555 = 1eb851eb851eb851
> 9: ok   181   184   157   155   157   155   157   155   157   155 mul_u64_u64_div_u64_new 7fffffffffffffff*2/3 = 5555555555555554
> 10: ok   304   223   142   139   139   141   139   139   139   139 mul_u64_u64_div_u64_new ffffffffffffffff*2/8000000000000000 = 3
> 11: ok   153   143   148   142   143   142   143   142   143   142 mul_u64_u64_div_u64_new ffffffffffffffff*2/c000000000000000 = 2
> 12: ok   428   323   292   246   257   248   253   250   248   253 mul_u64_u64_div_u64_new ffffffffffffffff*4000000000000004/8000000000000000 = 8000000000000007
> 13: ok   289   256   260   257   255   257   255   257   255   257 mul_u64_u64_div_u64_new ffffffffffffffff*4000000000000001/8000000000000000 = 8000000000000001
> 14: ok   334   246   234   234   227   233   229   233   229   233 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/ffffffffffffffff = 8000000000000001
> 15: ok   324   302   273   236   236   236   236   236   236   236 mul_u64_u64_div_u64_new fffffffffffffffe*8000000000000001/ffffffffffffffff = 8000000000000000
> 16: ok   269   328   285   259   232   230   236   232   232   236 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/fffffffffffffffe = 8000000000000001
> 17: ok   307   329   330   244   247   246   245   245   245   245 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/fffffffffffffffd = 8000000000000002
> 18: ok   359   361   324   258   258   258   258   258   258   258 mul_u64_u64_div_u64_new 7fffffffffffffff*ffffffffffffffff/c000000000000000 = aaaaaaaaaaaaaaa8
> 19: ok   347   325   295   258   260   253   251   251   251   251 mul_u64_u64_div_u64_new ffffffffffffffff*7fffffffffffffff/a000000000000000 = ccccccccccccccca
> 20: ok   339   312   261   260   255   255   255   255   255   255 mul_u64_u64_div_u64_new ffffffffffffffff*7fffffffffffffff/9000000000000000 = e38e38e38e38e38b
> 21: ok   411   349   333   276   272   272   272   272   272   272 mul_u64_u64_div_u64_new 7fffffffffffffff*7fffffffffffffff/5000000000000000 = ccccccccccccccc9
> 22: ok   297   330   290   266   239   236   238   237   238   237 mul_u64_u64_div_u64_new ffffffffffffffff*fffffffffffffffe/ffffffffffffffff = fffffffffffffffe
> 23: ok   299   311   250   247   250   245   247   250   245   247 mul_u64_u64_div_u64_new e6102d256d7ea3ae*70a77d0be4c31201/d63ec35ab3220357 = 78f8bf8cc86c6e18
> 24: ok   274   245   238   237   237   237   237   237   237   247 mul_u64_u64_div_u64_new f53bae05cb86c6e1*3847b32d2f8d32e0/cfd4f55a647f403c = 42687f79d8998d35
> 25: ok   247   247   245   247   250   245   247   250   245   247 mul_u64_u64_div_u64_new 9951c5498f941092*1f8c8bfdf287a251/a3c8dc5f81ea3fe2 = 1d887cb25900091f
> 26: ok   341   354   288   239   242   240   239   242   240   239 mul_u64_u64_div_u64_new 374fee9daa1bb2bb*d0bfbff7b8ae3ef/c169337bd42d5179 = 3bb2dbaffcbb961
> 27: ok   408   375   288   312   257   260   259   259   259   259 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
> 28: ok   289   259   199   198   201   170   170   174   173   173 mul_u64_u64_div_u64_new 2d256d7ea3ae*7d0be4c31201/d63ec35ab3220357 = 1a599d6e
> 29: ok   334   257   260   257   260   259   259   259   259   259 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
> 30: ok   341   267   244   233   229   233   229   233   229   233 mul_u64_u64_div_u64_new 2d256d7ea3ae*7d0be4c31201/63ec35ab3220357 = 387f55cef
> 31: ok   323   323   297   297   264   268   267   268   267   268 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb000000000000 = dc5f7e8b334db07d
> 32: ok   284   262   251   251   253   251   251   251   251   251 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
> 
> $ cc -O2 -o div_perf div_perf.c -DMULDIV_OPT=0x24 -m32 && sudo ./div_perf
> 0: ok   362   127   125   122   123   122   125   129   126   126 mul_u64_u64_div_u64_new b*7/3 = 19
> 1: ok   190   175   149   150   154   150   154   150   154   150 mul_u64_u64_div_u64_new ffff0000*ffff0000/f = 1110eeef00000000
> 2: ok   144   139   139   139   139   139   139   139   139   139 mul_u64_u64_div_u64_new ffffffff*ffffffff/1 = fffffffe00000001
> 3: ok   146   150   154   154   150   154   150   154   150   154 mul_u64_u64_div_u64_new ffffffff*ffffffff/2 = 7fffffff00000000
> 4: ok   747   554   319   318   316   319   318   328   314   324 mul_u64_u64_div_u64_new 1ffffffff*ffffffff/2 = fffffffe80000000
> 5: ok   426   315   312   315   315   315   315   315   315   315 mul_u64_u64_div_u64_new 1ffffffff*ffffffff/3 = aaaaaaa9aaaaaaab
> 6: ok   352   391   317   316   323   327   323   327   323   324 mul_u64_u64_div_u64_new 1ffffffff*1ffffffff/4 = ffffffff00000000
> 7: ok   369   328   298   292   298   292   298   292   298   292 mul_u64_u64_div_u64_new ffff000000000000*ffff000000000000/ffff000000000001 = fffeffffffffffff
> 8: ok   436   348   298   299   298   300   307   297   297   301 mul_u64_u64_div_u64_new 3333333333333333*3333333333333333/5555555555555555 = 1eb851eb851eb851
> 9: ok   180   183   151   151   151   151   151   151   151   153 mul_u64_u64_div_u64_new 7fffffffffffffff*2/3 = 5555555555555554
> 10: ok   286   251   188   169   174   170   178   170   178   170 mul_u64_u64_div_u64_new ffffffffffffffff*2/8000000000000000 = 3
> 11: ok   230   177   172   177   172   177   172   177   172   177 mul_u64_u64_div_u64_new ffffffffffffffff*2/c000000000000000 = 2
> 12: ok   494   412   371   331   296   298   305   290   296   298 mul_u64_u64_div_u64_new ffffffffffffffff*4000000000000004/8000000000000000 = 8000000000000007
> 13: ok   330   300   304   305   290   305   294   302   290   296 mul_u64_u64_div_u64_new ffffffffffffffff*4000000000000001/8000000000000000 = 8000000000000001
> 14: ok   284   241   175   172   176   175   175   172   176   175 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/ffffffffffffffff = 8000000000000001
> 15: ok   283   263   171   176   175   175   175   175   175   175 mul_u64_u64_div_u64_new fffffffffffffffe*8000000000000001/ffffffffffffffff = 8000000000000000
> 16: ok   346   247   258   208   202   205   208   202   205   208 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/fffffffffffffffe = 8000000000000001
> 17: ok   242   272   208   213   209   213   209   213   209   213 mul_u64_u64_div_u64_new ffffffffffffffff*8000000000000001/fffffffffffffffd = 8000000000000002
> 18: ok   494   337   306   309   309   309   309   309   309   309 mul_u64_u64_div_u64_new 7fffffffffffffff*ffffffffffffffff/c000000000000000 = aaaaaaaaaaaaaaa8
> 19: ok   392   394   329   305   299   302   294   302   292   305 mul_u64_u64_div_u64_new ffffffffffffffff*7fffffffffffffff/a000000000000000 = ccccccccccccccca
> 20: ok   372   307   306   310   308   310   308   310   308   310 mul_u64_u64_div_u64_new ffffffffffffffff*7fffffffffffffff/9000000000000000 = e38e38e38e38e38b
> 21: ok   525   388   310   320   314   313   314   315   312   315 mul_u64_u64_div_u64_new 7fffffffffffffff*7fffffffffffffff/5000000000000000 = ccccccccccccccc9
> 22: ok   400   293   289   354   284   283   284   284   284   284 mul_u64_u64_div_u64_new ffffffffffffffff*fffffffffffffffe/ffffffffffffffff = fffffffffffffffe
> 23: ok   320   324   289   290   289   289   290   289   289   290 mul_u64_u64_div_u64_new e6102d256d7ea3ae*70a77d0be4c31201/d63ec35ab3220357 = 78f8bf8cc86c6e18
> 24: ok   279   289   290   285   285   285   285   285   285   285 mul_u64_u64_div_u64_new f53bae05cb86c6e1*3847b32d2f8d32e0/cfd4f55a647f403c = 42687f79d8998d35
> 25: ok   288   290   289   289   290   289   289   290   289   289 mul_u64_u64_div_u64_new 9951c5498f941092*1f8c8bfdf287a251/a3c8dc5f81ea3fe2 = 1d887cb25900091f
> 26: ok   361   372   351   325   288   293   286   293   294   293 mul_u64_u64_div_u64_new 374fee9daa1bb2bb*d0bfbff7b8ae3ef/c169337bd42d5179 = 3bb2dbaffcbb961
> 27: ok   483   349   302   302   305   300   307   304   307   304 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
> 28: ok   339   328   234   233   213   214   213   208   215   208 mul_u64_u64_div_u64_new 2d256d7ea3ae*7d0be4c31201/d63ec35ab3220357 = 1a599d6e
> 29: ok   406   300   303   300   307   304   307   304   307   304 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
> 30: ok   404   421   335   268   265   271   267   271   267   272 mul_u64_u64_div_u64_new 2d256d7ea3ae*7d0be4c31201/63ec35ab3220357 = 387f55cef
> 31: ok   484   350   310   309   306   309   309   306   309   309 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb000000000000 = dc5f7e8b334db07d
> 32: ok   368   306   301   307   304   307   304   307   304   307 mul_u64_u64_div_u64_new eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
> 
> 	David

# old code x86 32bit
$ cc -O2 -o div_perf div_perf.c -DMULDIV_OPT=0x03 -m32 && sudo ./div_perf
0: ok   450   149   114   114   113   115   119   114   114   115 mul_u64_u64_div_u64 b*7/3 = 19
1: ok   169   166   141   136   140   136   140   136   140   136 mul_u64_u64_div_u64 ffff0000*ffff0000/f = 1110eeef00000000
2: ok   136   136   133   134   130   134   130   134   130   134 mul_u64_u64_div_u64 ffffffff*ffffffff/1 = fffffffe00000001
3: ok   139   138   140   136   140   136   140   136   138   136 mul_u64_u64_div_u64 ffffffff*ffffffff/2 = 7fffffff00000000
4: ok   258   245   161   159   161   159   161   159   161   159 mul_u64_u64_div_u64 1ffffffff*ffffffff/2 = fffffffe80000000
5: ok  1806  1474  1210  1148  1204  1146  1147  1149  1147  1149 mul_u64_u64_div_u64 1ffffffff*ffffffff/3 = aaaaaaa9aaaaaaab
6: ok   243   192   192   161   162   162   162   162   162   162 mul_u64_u64_div_u64 1ffffffff*1ffffffff/4 = ffffffff00000000
7: ok   965   907   902   833   835   842   858   846   806   805 mul_u64_u64_div_u64 ffff000000000000*ffff000000000000/ffff000000000001 = fffeffffffffffff
8: ok  1171   962   860   922   860   860   860   860   860   860 mul_u64_u64_div_u64 3333333333333333*3333333333333333/5555555555555555 = 1eb851eb851eb851
9: ok   169   176   147   146   142   146   142   146   141   146 mul_u64_u64_div_u64 7fffffffffffffff*2/3 = 5555555555555554
10: ok   205   198   142   136   141   139   141   139   141   139 mul_u64_u64_div_u64 ffffffffffffffff*2/8000000000000000 = 3
11: ok   145   143   143   143   143   143   143   144   143   144 mul_u64_u64_div_u64 ffffffffffffffff*2/c000000000000000 = 2
12: ok   186   188   188   163   161   163   161   163   161   163 mul_u64_u64_div_u64 ffffffffffffffff*4000000000000004/8000000000000000 = 8000000000000007
13: ok   158   160   156   156   156   156   156   156   156   156 mul_u64_u64_div_u64 ffffffffffffffff*4000000000000001/8000000000000000 = 8000000000000001
14: ok   271   261   166   139   138   138   138   138   138   138 mul_u64_u64_div_u64 ffffffffffffffff*8000000000000001/ffffffffffffffff = 8000000000000001
15: ok   242   190   170   171   155   127   124   125   124   125 mul_u64_u64_div_u64 fffffffffffffffe*8000000000000001/ffffffffffffffff = 8000000000000000
16: ok   210   175   176   176   215   146   149   148   149   149 mul_u64_u64_div_u64 ffffffffffffffff*8000000000000001/fffffffffffffffe = 8000000000000001
17: ok   235   224   188   139   140   136   139   140   136   139 mul_u64_u64_div_u64 ffffffffffffffff*8000000000000001/fffffffffffffffd = 8000000000000002
18: ok  1124  1035   960   960   960   960   960   960   960   960 mul_u64_u64_div_u64 7fffffffffffffff*ffffffffffffffff/c000000000000000 = aaaaaaaaaaaaaaa8
19: ok  1106  1092  1028  1027  1010  1009  1011  1010  1010  1010 mul_u64_u64_div_u64 ffffffffffffffff*7fffffffffffffff/a000000000000000 = ccccccccccccccca
20: ok  1123  1015  1017  1017  1003  1002  1002  1002  1002  1002 mul_u64_u64_div_u64 ffffffffffffffff*7fffffffffffffff/9000000000000000 = e38e38e38e38e38b
21: ok   997   973   974   975   974   975   974   975   974   975 mul_u64_u64_div_u64 7fffffffffffffff*7fffffffffffffff/5000000000000000 = ccccccccccccccc9
22: ok   892   834   808   802   832   786   780   784   783   773 mul_u64_u64_div_u64 ffffffffffffffff*fffffffffffffffe/ffffffffffffffff = fffffffffffffffe
23: ok  1495  1162  1028   896   898   898   898   898   898   898 mul_u64_u64_div_u64 e6102d256d7ea3ae*70a77d0be4c31201/d63ec35ab3220357 = 78f8bf8cc86c6e18
24: ok  1493  1220   962   880   880   880   880   880   880   880 mul_u64_u64_div_u64 f53bae05cb86c6e1*3847b32d2f8d32e0/cfd4f55a647f403c = 42687f79d8998d35
25: ok  1079   960   830   709   711   708   708   708   708   708 mul_u64_u64_div_u64 9951c5498f941092*1f8c8bfdf287a251/a3c8dc5f81ea3fe2 = 1d887cb25900091f
26: ok  1551  1257  1043   956   957   957   957   957   957   957 mul_u64_u64_div_u64 374fee9daa1bb2bb*d0bfbff7b8ae3ef/c169337bd42d5179 = 3bb2dbaffcbb961
27: ok  1473  1193   995   995   995   995   995   995   995   995 mul_u64_u64_div_u64 eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
28: ok   797   735   579   535   533   534   534   534   534   534 mul_u64_u64_div_u64 2d256d7ea3ae*7d0be4c31201/d63ec35ab3220357 = 1a599d6e
29: ok  1122   993   995   995   995   995   995   995   995   995 mul_u64_u64_div_u64 eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216
30: ok   931   789   674   604   605   606   605   606   605   606 mul_u64_u64_div_u64 2d256d7ea3ae*7d0be4c31201/63ec35ab3220357 = 387f55cef
31: ok  1536  1247  1068  1034  1034  1034  1034  1094  1032  1035 mul_u64_u64_div_u64 eac0d03ac10eeaf0*89be05dfa162ed9b/92bb000000000000 = dc5f7e8b334db07d
32: ok  1068  1019   995   995   995   995   995   995   995   995 mul_u64_u64_div_u64 eac0d03ac10eeaf0*89be05dfa162ed9b/92bb1679a41f0e4b = dc5f5cc9e270d216

	David

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 next 09/10] lib: mul_u64_u64_div_u64() Optimise the divide code
  2025-07-11 21:17     ` David Laight
@ 2025-07-11 21:40       ` Nicolas Pitre
  2025-07-14  7:06         ` David Laight
  0 siblings, 1 reply; 44+ messages in thread
From: Nicolas Pitre @ 2025-07-11 21:40 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das, rostedt, lirongqing

On Fri, 11 Jul 2025, David Laight wrote:

> On Wed, 9 Jul 2025 15:24:20 +0100
> David Laight <david.laight.linux@gmail.com> wrote:
> 
> > On Sat, 14 Jun 2025 10:53:45 +0100
> > David Laight <david.laight.linux@gmail.com> wrote:
> > 
> > > Replace the bit by bit algorithm with one that generates 16 bits
> > > per iteration on 32bit architectures and 32 bits on 64bit ones.  
> > 
> > I've spent far too long doing some clock counting exercises on this code.
> > This is the latest version with some conditional compiles and comments
> > explaining the various optimisation.
> > I think the 'best' version is with -DMULDIV_OPT=0xc3
> > 
> >....
> > Some measurements on an ivy bridge.
> > These are the test vectors from the test module with a few extra values on the
> > end that pick different paths through this implementatoin.
> > The numbers are 'performance counter' deltas for 10 consecutive calls with the
> > same values.
> > So the latter values are with the branch predictor 'trained' to the test case.
> > The first few (larger) values show the cost of mispredicted branches.
> 
> I should have included the values for the sames tests with the existing code.
> Added here, but this is includes the change to test for overflow and divide
> by zero after the multiply (no ilog2 calls - which make the 32bit code worse).
> About the only cases where the old code it better are when the quotient
> only has two bits set.

Kudos to you.

I don't think it is necessary to go deeper than this without going crazy.

Given you have your brain wrapped around all the variants at this point, 
I'm deferring to you to create a version for the kernel representing the 
best compromise between performance and  maintainability.

Please consider including the following at the start of your next patch 
series version as this is headed for mainline already:
https://lkml.kernel.org/r/q246p466-1453-qon9-29so-37105116009q@onlyvoer.pbz


Nicolas

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 next 09/10] lib: mul_u64_u64_div_u64() Optimise the divide code
  2025-07-11 21:40       ` Nicolas Pitre
@ 2025-07-14  7:06         ` David Laight
  0 siblings, 0 replies; 44+ messages in thread
From: David Laight @ 2025-07-14  7:06 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das, rostedt, lirongqing

On Fri, 11 Jul 2025 17:40:57 -0400 (EDT)
Nicolas Pitre <nico@fluxnic.net> wrote:

...
> Kudos to you.
> 
> I don't think it is necessary to go deeper than this without going crazy.

I'm crazy already :-)

	David

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2025-07-14  7:06 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-14  9:53 [PATCH v3 next 00/10] Implement mul_u64_u64_div_u64_roundup() David Laight
2025-06-14  9:53 ` [PATCH v3 next 01/10] lib: mul_u64_u64_div_u64() rename parameter 'c' to 'd' David Laight
2025-06-14  9:53 ` [PATCH v3 next 02/10] lib: mul_u64_u64_div_u64() Use WARN_ONCE() for divide errors David Laight
2025-06-14 15:17   ` Nicolas Pitre
2025-06-14 21:26     ` David Laight
2025-06-14 22:23       ` Nicolas Pitre
2025-06-14  9:53 ` [PATCH v3 next 03/10] lib: mul_u64_u64_div_u64() simplify check for a 64bit product David Laight
2025-06-14 14:01   ` Nicolas Pitre
2025-06-14  9:53 ` [PATCH v3 next 04/10] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup() David Laight
2025-06-14 14:06   ` Nicolas Pitre
2025-06-14  9:53 ` [PATCH v3 next 05/10] lib: Add tests for mul_u64_u64_div_u64_roundup() David Laight
2025-06-14 15:19   ` Nicolas Pitre
2025-06-17  4:30   ` Nicolas Pitre
2025-06-14  9:53 ` [PATCH v3 next 06/10] lib: test_mul_u64_u64_div_u64: Test both generic and arch versions David Laight
2025-06-14 15:25   ` Nicolas Pitre
2025-06-18  1:39     ` Nicolas Pitre
2025-06-14  9:53 ` [PATCH v3 next 07/10] lib: mul_u64_u64_div_u64() optimise multiply on 32bit x86 David Laight
2025-06-14 15:31   ` Nicolas Pitre
2025-06-14  9:53 ` [PATCH v3 next 08/10] lib: mul_u64_u64_div_u64() Separate multiply to a helper for clarity David Laight
2025-06-14 15:37   ` Nicolas Pitre
2025-06-14 21:30     ` David Laight
2025-06-14 22:27       ` Nicolas Pitre
2025-06-14  9:53 ` [PATCH v3 next 09/10] lib: mul_u64_u64_div_u64() Optimise the divide code David Laight
2025-06-17  4:16   ` Nicolas Pitre
2025-06-18  1:33     ` Nicolas Pitre
2025-06-18  9:16       ` David Laight
2025-06-18 15:39         ` Nicolas Pitre
2025-06-18 16:42           ` Nicolas Pitre
2025-06-18 17:54           ` David Laight
2025-06-18 20:12             ` Nicolas Pitre
2025-06-18 22:26               ` David Laight
2025-06-19  2:43                 ` Nicolas Pitre
2025-06-19  8:32                   ` David Laight
2025-06-26 21:46       ` David Laight
2025-06-27  3:48         ` Nicolas Pitre
2025-07-09 14:24   ` David Laight
2025-07-10  9:39     ` 答复: [????] " Li,Rongqing
2025-07-10 10:35       ` David Laight
2025-07-11 21:17     ` David Laight
2025-07-11 21:40       ` Nicolas Pitre
2025-07-14  7:06         ` David Laight
2025-06-14  9:53 ` [PATCH v3 next 10/10] lib: test_mul_u64_u64_div_u64: Test the 32bit code on 64bit David Laight
2025-06-14 10:27 ` [PATCH v3 next 00/10] Implement mul_u64_u64_div_u64_roundup() Peter Zijlstra
2025-06-14 11:59   ` David Laight

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).