[PATCH v4 next 0/9] Implement mul_u64_u64_div_u64

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 next 0/9] Implement mul_u64_u64_div_u64_roundup()
@ 2025-10-29 17:38 David Laight
  2025-10-29 17:38 ` [PATCH v4 next 1/9] lib: mul_u64_u64_div_u64() rename parameter 'c' to 'd' David Laight
                   ` (10 more replies)
  0 siblings, 11 replies; 32+ messages in thread
From: David Laight @ 2025-10-29 17:38 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: David Laight, u.kleine-koenig, Nicolas Pitre, Oleg Nesterov,
	Peter Zijlstra, Biju Das, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Li RongQing,
	Yu Kuai, Khazhismel Kumykov, Jens Axboe, x86

The pwm-stm32.c code wants a 'rounding up' version of mul_u64_u64_div_u64().
This can be done simply by adding 'divisor - 1' to the 128bit product.
Implement mul_u64_add_u64_div_u64(a, b, c, d) = (a * b + c)/d based on the
existing code.
Define mul_u64_u64_div_u64(a, b, d) as mul_u64_add_u64_div_u64(a, b, 0, d) and
mul_u64_u64_div_u64_roundup(a, b, d) as mul_u64_add_u64_div_u64(a, b, d-1, d).

Only x86-64 has an optimsed (asm) version of the function.
That is optimised to avoid the 'add c' when c is known to be zero.
In all other cases the extra code will be noise compared to the software
divide code.

The test module has been updated to test mul_u64_u64_div_u64_roundup() and
also enhanced it to verify the C division code on x86-64 and the 32bit
division code on 64bit.

Changes for v2:
- Rename the 'divisor' parameter from 'c' to 'd'.
- Add an extra patch to use BUG_ON() to trap zero divisors.
- Remove the last patch that ran the C code on x86-64
  (I've a plan to do that differently).

Changes for v3:
- Replace the BUG_ON() (or panic in the original version) for zero
  divisors with a WARN_ONCE() and return zero.
- Remove the 'pre-multiply' check for small products.
  Completely non-trivial on 32bit systems.
- Use mul_u32_u32() and the new add_u64_u32() to stop gcc generating
  pretty much pessimal code for x86 with lots of register spills.
- Replace the 'bit at a time' divide with one that generates 16 bits
  per iteration on 32bit systems and 32 bits per iteration on 64bit.
  Massively faster, the tests run in under 1/3 the time.

Changes for v4:
No significant code changes.
- Rebase on 6.18-rc2
- Don't change the behaviour for overflow (return ~0ull) or divide
  by zero (execute ~0ul/0).
- Merge patches 8 and 9 to avoid bisection issues.
- Fix build of 32bit test cases on non-x86.
- Fix shell script that verifies test cases.

I've left the x86-64 faulting on both overflow and divide by zero.
The patch to add an execption table entry to return ~0 for both
doesn't seem to have been merged.
If merged it would make sense for the C version to return ~0 for both.
Callers can check for a result of ~0 and then check the divisor if
they care about overflow (etc).
(A valid quotent of ~0 is pretty unlikely and marginal changes to the
input values are likely to generate a real overflow.)
The code that faulted on overflow was about to get invalid results
because one of the 64bit inputs would itself wrap very soon.

David Laight (9):
  lib: mul_u64_u64_div_u64() rename parameter 'c' to 'd'
  lib: mul_u64_u64_div_u64() Combine overflow and divide by zero checks
  lib: mul_u64_u64_div_u64() simplify check for a 64bit product
  lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup()
  lib: Add tests for mul_u64_u64_div_u64_roundup()
  lib: test_mul_u64_u64_div_u64: Test both generic and arch versions
  lib: mul_u64_u64_div_u64() optimise multiply on 32bit x86
  lib: mul_u64_u64_div_u64() Optimise the divide code
  lib: test_mul_u64_u64_div_u64: Test the 32bit code on 64bit

 arch/x86/include/asm/div64.h        |  39 ++++--
 include/linux/math64.h              |  59 ++++++++-
 lib/math/div64.c                    | 183 ++++++++++++++++++---------
 lib/math/test_mul_u64_u64_div_u64.c | 190 ++++++++++++++++++++--------
 4 files changed, 352 insertions(+), 119 deletions(-)

-- 
2.39.5


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 next 1/9] lib: mul_u64_u64_div_u64() rename parameter 'c' to 'd'
  2025-10-29 17:38 [PATCH v4 next 0/9] Implement mul_u64_u64_div_u64_roundup() David Laight
@ 2025-10-29 17:38 ` David Laight
  2025-10-29 17:38 ` [PATCH v4 next 2/9] lib: mul_u64_u64_div_u64() Combine overflow and divide by zero checks David Laight
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: David Laight @ 2025-10-29 17:38 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: David Laight, u.kleine-koenig, Nicolas Pitre, Oleg Nesterov,
	Peter Zijlstra, Biju Das, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Li RongQing,
	Yu Kuai, Khazhismel Kumykov, Jens Axboe, x86

Change to prototype from mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
to mul_u64_u64_div_u64(u64 a, u64 b, u64 d).
Using 'd' for 'divisor' makes more sense.

An upcoming change adds a 'c' parameter to calculate (a * b + c)/d.

Signed-off-by: David Laight <david.laight.linux@gmail.com>
Reviewed-by: Nicolas Pitre <npitre@baylibre.com>
---
 lib/math/div64.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/lib/math/div64.c b/lib/math/div64.c
index bf77b9843175..0ebff850fd4d 100644
--- a/lib/math/div64.c
+++ b/lib/math/div64.c
@@ -184,10 +184,10 @@ u32 iter_div_u64_rem(u64 dividend, u32 divisor, u64 *remainder)
 EXPORT_SYMBOL(iter_div_u64_rem);
 
 #ifndef mul_u64_u64_div_u64
-u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
+u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
 {
 	if (ilog2(a) + ilog2(b) <= 62)
-		return div64_u64(a * b, c);
+		return div64_u64(a * b, d);
 
 #if defined(__SIZEOF_INT128__)
 
@@ -212,37 +212,37 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
 
 #endif
 
-	/* make sure c is not zero, trigger runtime exception otherwise */
-	if (unlikely(c == 0)) {
+	/* make sure d is not zero, trigger runtime exception otherwise */
+	if (unlikely(d == 0)) {
 		unsigned long zero = 0;
 
 		OPTIMIZER_HIDE_VAR(zero);
 		return ~0UL/zero;
 	}
 
-	int shift = __builtin_ctzll(c);
+	int shift = __builtin_ctzll(d);
 
 	/* try reducing the fraction in case the dividend becomes <= 64 bits */
 	if ((n_hi >> shift) == 0) {
 		u64 n = shift ? (n_lo >> shift) | (n_hi << (64 - shift)) : n_lo;
 
-		return div64_u64(n, c >> shift);
+		return div64_u64(n, d >> shift);
 		/*
 		 * The remainder value if needed would be:
-		 *   res = div64_u64_rem(n, c >> shift, &rem);
+		 *   res = div64_u64_rem(n, d >> shift, &rem);
 		 *   rem = (rem << shift) + (n_lo - (n << shift));
 		 */
 	}
 
-	if (n_hi >= c) {
+	if (n_hi >= d) {
 		/* overflow: result is unrepresentable in a u64 */
 		return -1;
 	}
 
 	/* Do the full 128 by 64 bits division */
 
-	shift = __builtin_clzll(c);
-	c <<= shift;
+	shift = __builtin_clzll(d);
+	d <<= shift;
 
 	int p = 64 + shift;
 	u64 res = 0;
@@ -257,8 +257,8 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 c)
 		n_hi <<= shift;
 		n_hi |= n_lo >> (64 - shift);
 		n_lo <<= shift;
-		if (carry || (n_hi >= c)) {
-			n_hi -= c;
+		if (carry || (n_hi >= d)) {
+			n_hi -= d;
 			res |= 1ULL << p;
 		}
 	} while (n_hi);
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 next 2/9] lib: mul_u64_u64_div_u64() Combine overflow and divide by zero checks
  2025-10-29 17:38 [PATCH v4 next 0/9] Implement mul_u64_u64_div_u64_roundup() David Laight
  2025-10-29 17:38 ` [PATCH v4 next 1/9] lib: mul_u64_u64_div_u64() rename parameter 'c' to 'd' David Laight
@ 2025-10-29 17:38 ` David Laight
  2025-10-29 18:02   ` Nicolas Pitre
  2025-10-29 17:38 ` [PATCH v4 next 3/9] lib: mul_u64_u64_div_u64() simplify check for a 64bit product David Laight
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: David Laight @ 2025-10-29 17:38 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: David Laight, u.kleine-koenig, Nicolas Pitre, Oleg Nesterov,
	Peter Zijlstra, Biju Das, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Li RongQing,
	Yu Kuai, Khazhismel Kumykov, Jens Axboe, x86

Since the overflow check always triggers when the divisor is zero
move the check for divide by zero inside the overflow check.
This means there is only one test in the normal path.

Signed-off-by: David Laight <david.laight.linux@gmail.com>
---

V3 contained a different patch 2 that did different chenges to the
error paths.

 lib/math/div64.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/lib/math/div64.c b/lib/math/div64.c
index 0ebff850fd4d..1092f41e878e 100644
--- a/lib/math/div64.c
+++ b/lib/math/div64.c
@@ -212,12 +212,16 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
 
 #endif
 
-	/* make sure d is not zero, trigger runtime exception otherwise */
-	if (unlikely(d == 0)) {
-		unsigned long zero = 0;
+	if (unlikely(n_hi >= d)) {
+		/* trigger runtime exception if divisor is zero */
+		if (d == 0) {
+			unsigned long zero = 0;
 
-		OPTIMIZER_HIDE_VAR(zero);
-		return ~0UL/zero;
+			OPTIMIZER_HIDE_VAR(zero);
+			return ~0UL/zero;
+		}
+		/* overflow: result is unrepresentable in a u64 */
+		return ~0ULL;
 	}
 
 	int shift = __builtin_ctzll(d);
@@ -234,11 +238,6 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
 		 */
 	}
 
-	if (n_hi >= d) {
-		/* overflow: result is unrepresentable in a u64 */
-		return -1;
-	}
-
 	/* Do the full 128 by 64 bits division */
 
 	shift = __builtin_clzll(d);
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 next 3/9] lib: mul_u64_u64_div_u64() simplify check for a 64bit product
  2025-10-29 17:38 [PATCH v4 next 0/9] Implement mul_u64_u64_div_u64_roundup() David Laight
  2025-10-29 17:38 ` [PATCH v4 next 1/9] lib: mul_u64_u64_div_u64() rename parameter 'c' to 'd' David Laight
  2025-10-29 17:38 ` [PATCH v4 next 2/9] lib: mul_u64_u64_div_u64() Combine overflow and divide by zero checks David Laight
@ 2025-10-29 17:38 ` David Laight
  2025-10-29 18:11   ` Nicolas Pitre
  2025-10-29 17:38 ` [PATCH v4 next 4/9] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup() David Laight
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: David Laight @ 2025-10-29 17:38 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: David Laight, u.kleine-koenig, Nicolas Pitre, Oleg Nesterov,
	Peter Zijlstra, Biju Das, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Li RongQing,
	Yu Kuai, Khazhismel Kumykov, Jens Axboe, x86

If the product is only 64bits div64_u64() can be used for the divide.
Replace the pre-multiply check (ilog2(a) + ilog2(b) <= 62) with a
simple post-multiply check that the high 64bits are zero.

This has the advantage of being simpler, more accurate and less code.
It will always be faster when the product is larger than 64bits.

Most 64bit cpu have a native 64x64=128 bit multiply, this is needed
(for the low 64bits) even when div64_u64() is called - so the early
check gains nothing and is just extra code.

32bit cpu will need a compare (etc) to generate the 64bit ilog2()
from two 32bit bit scans - so that is non-trivial.
(Never mind the mess of x86's 'bsr' and any oddball cpu without
fast bit-scan instructions.)
Whereas the additional instructions for the 128bit multiply result
are pretty much one multiply and two adds (typically the 'adc $0,%reg'
can be run in parallel with the instruction that follows).

The only outliers are 64bit systems without 128bit mutiply and
simple in order 32bit ones with fast bit scan but needing extra
instructions to get the high bits of the multiply result.
I doubt it makes much difference to either, the latter is definitely
not mainstream.

If anyone is worried about the analysis they can look at the
generated code for x86 (especially when cmov isn't used).

Signed-off-by: David Laight <david.laight.linux@gmail.com>
---

Split from patch 3 for v2, unchanged since.

 lib/math/div64.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/lib/math/div64.c b/lib/math/div64.c
index 1092f41e878e..7158d141b6e9 100644
--- a/lib/math/div64.c
+++ b/lib/math/div64.c
@@ -186,9 +186,6 @@ EXPORT_SYMBOL(iter_div_u64_rem);
 #ifndef mul_u64_u64_div_u64
 u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
 {
-	if (ilog2(a) + ilog2(b) <= 62)
-		return div64_u64(a * b, d);
-
 #if defined(__SIZEOF_INT128__)
 
 	/* native 64x64=128 bits multiplication */
@@ -224,6 +221,9 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
 		return ~0ULL;
 	}
 
+	if (!n_hi)
+		return div64_u64(n_lo, d);
+
 	int shift = __builtin_ctzll(d);
 
 	/* try reducing the fraction in case the dividend becomes <= 64 bits */
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 next 4/9] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup()
  2025-10-29 17:38 [PATCH v4 next 0/9] Implement mul_u64_u64_div_u64_roundup() David Laight
                   ` (2 preceding siblings ...)
  2025-10-29 17:38 ` [PATCH v4 next 3/9] lib: mul_u64_u64_div_u64() simplify check for a 64bit product David Laight
@ 2025-10-29 17:38 ` David Laight
  2025-10-29 18:17   ` Nicolas Pitre
  2025-10-31 20:59   ` David Laight
  2025-10-29 17:38 ` [PATCH v4 next 5/9] lib: Add tests for mul_u64_u64_div_u64_roundup() David Laight
                   ` (6 subsequent siblings)
  10 siblings, 2 replies; 32+ messages in thread
From: David Laight @ 2025-10-29 17:38 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: David Laight, u.kleine-koenig, Nicolas Pitre, Oleg Nesterov,
	Peter Zijlstra, Biju Das, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Li RongQing,
	Yu Kuai, Khazhismel Kumykov, Jens Axboe, x86

The existing mul_u64_u64_div_u64() rounds down, a 'rounding up'
variant needs 'divisor - 1' adding in between the multiply and
divide so cannot easily be done by a caller.

Add mul_u64_add_u64_div_u64(a, b, c, d) that calculates (a * b + c)/d
and implement the 'round down' and 'round up' using it.

Update the x86-64 asm to optimise for 'c' being a constant zero.

Add kerndoc definitions for all three functions.

Signed-off-by: David Laight <david.laight.linux@gmail.com>
---

Changes for v2 (formally patch 1/3):
- Reinstate the early call to div64_u64() on 32bit when 'c' is zero.
  Although I'm not convinced the path is common enough to be worth
  the two ilog2() calls.

Changes for v3 (formally patch 3/4):
- The early call to div64_u64() has been removed by patch 3.
  Pretty much guaranteed to be a pessimisation.

Changes for v4:
- For x86-64 split the multiply, add and divide into three asm blocks.
  (gcc makes a pigs breakfast of (u128)a * b + c)
- Change the kerndoc since divide by zero will (probably) fault.

 arch/x86/include/asm/div64.h | 20 +++++++++------
 include/linux/math64.h       | 48 +++++++++++++++++++++++++++++++++++-
 lib/math/div64.c             | 14 ++++++-----
 3 files changed, 67 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/div64.h b/arch/x86/include/asm/div64.h
index 9931e4c7d73f..cabdc2d5a68f 100644
--- a/arch/x86/include/asm/div64.h
+++ b/arch/x86/include/asm/div64.h
@@ -84,21 +84,25 @@ static inline u64 mul_u32_u32(u32 a, u32 b)
  * Will generate an #DE when the result doesn't fit u64, could fix with an
  * __ex_table[] entry when it becomes an issue.
  */
-static inline u64 mul_u64_u64_div_u64(u64 a, u64 mul, u64 div)
+static inline u64 mul_u64_add_u64_div_u64(u64 rax, u64 mul, u64 add, u64 div)
 {
-	u64 q;
+	u64 rdx;
 
-	asm ("mulq %2; divq %3" : "=a" (q)
-				: "a" (a), "rm" (mul), "rm" (div)
-				: "rdx");
+	asm ("mulq %[mul]" : "+a" (rax), "=d" (rdx) : [mul] "rm" (mul));
 
-	return q;
+	if (statically_true(add))
+		asm ("addq %[add], %[lo]; adcq $0, %[hi]" :
+			[lo] "+r" (rax), [hi] "+r" (rdx) : [add] "irm" (add));
+
+	asm ("divq %[div]" : "+a" (rax), "+d" (rdx) : [div] "rm" (div));
+
+	return rax;
 }
-#define mul_u64_u64_div_u64 mul_u64_u64_div_u64
+#define mul_u64_add_u64_div_u64 mul_u64_add_u64_div_u64
 
 static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 div)
 {
-	return mul_u64_u64_div_u64(a, mul, div);
+	return mul_u64_add_u64_div_u64(a, mul, 0, div);
 }
 #define mul_u64_u32_div	mul_u64_u32_div
 
diff --git a/include/linux/math64.h b/include/linux/math64.h
index 6aaccc1626ab..e889d850b7f1 100644
--- a/include/linux/math64.h
+++ b/include/linux/math64.h
@@ -282,7 +282,53 @@ static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 divisor)
 }
 #endif /* mul_u64_u32_div */
 
-u64 mul_u64_u64_div_u64(u64 a, u64 mul, u64 div);
+/**
+ * mul_u64_add_u64_div_u64 - unsigned 64bit multiply, add, and divide
+ * @a: first unsigned 64bit multiplicand
+ * @b: second unsigned 64bit multiplicand
+ * @c: unsigned 64bit addend
+ * @d: unsigned 64bit divisor
+ *
+ * Multiply two 64bit values together to generate a 128bit product
+ * add a third value and then divide by a fourth.
+ * The Generic code divides by 0 if @d is zero and returns ~0 on overflow.
+ * Architecture specific code may trap on zero or overflow.
+ *
+ * Return: (@a * @b + @c) / @d
+ */
+u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d);
+
+/**
+ * mul_u64_u64_div_u64 - unsigned 64bit multiply and divide
+ * @a: first unsigned 64bit multiplicand
+ * @b: second unsigned 64bit multiplicand
+ * @d: unsigned 64bit divisor
+ *
+ * Multiply two 64bit values together to generate a 128bit product
+ * and then divide by a third value.
+ * The Generic code divides by 0 if @d is zero and returns ~0 on overflow.
+ * Architecture specific code may trap on zero or overflow.
+ *
+ * Return: @a * @b / @d
+ */
+#define mul_u64_u64_div_u64(a, b, d) mul_u64_add_u64_div_u64(a, b, 0, d)
+
+/**
+ * mul_u64_u64_div_u64_roundup - unsigned 64bit multiply and divide rounded up
+ * @a: first unsigned 64bit multiplicand
+ * @b: second unsigned 64bit multiplicand
+ * @d: unsigned 64bit divisor
+ *
+ * Multiply two 64bit values together to generate a 128bit product
+ * and then divide and round up.
+ * The Generic code divides by 0 if @d is zero and returns ~0 on overflow.
+ * Architecture specific code may trap on zero or overflow.
+ *
+ * Return: (@a * @b + @d - 1) / @d
+ */
+#define mul_u64_u64_div_u64_roundup(a, b, d) \
+	({ u64 _tmp = (d); mul_u64_add_u64_div_u64(a, b, _tmp - 1, _tmp); })
+
 
 /**
  * DIV64_U64_ROUND_UP - unsigned 64bit divide with 64bit divisor rounded up
diff --git a/lib/math/div64.c b/lib/math/div64.c
index 7158d141b6e9..25295daebde9 100644
--- a/lib/math/div64.c
+++ b/lib/math/div64.c
@@ -183,13 +183,13 @@ u32 iter_div_u64_rem(u64 dividend, u32 divisor, u64 *remainder)
 }
 EXPORT_SYMBOL(iter_div_u64_rem);
 
-#ifndef mul_u64_u64_div_u64
-u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
+#ifndef mul_u64_add_u64_div_u64
+u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
 {
 #if defined(__SIZEOF_INT128__)
 
 	/* native 64x64=128 bits multiplication */
-	u128 prod = (u128)a * b;
+	u128 prod = (u128)a * b + c;
 	u64 n_lo = prod, n_hi = prod >> 64;
 
 #else
@@ -198,8 +198,10 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
 	u32 a_lo = a, a_hi = a >> 32, b_lo = b, b_hi = b >> 32;
 	u64 x, y, z;
 
-	x = (u64)a_lo * b_lo;
-	y = (u64)a_lo * b_hi + (u32)(x >> 32);
+	/* Since (x-1)(x-1) + 2(x-1) == x.x - 1 two u32 can be added to a u64 */
+	x = (u64)a_lo * b_lo + (u32)c;
+	y = (u64)a_lo * b_hi + (u32)(c >> 32);
+	y += (u32)(x >> 32);
 	z = (u64)a_hi * b_hi + (u32)(y >> 32);
 	y = (u64)a_hi * b_lo + (u32)y;
 	z += (u32)(y >> 32);
@@ -265,5 +267,5 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
 
 	return res;
 }
-EXPORT_SYMBOL(mul_u64_u64_div_u64);
+EXPORT_SYMBOL(mul_u64_add_u64_div_u64);
 #endif
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 next 5/9] lib: Add tests for mul_u64_u64_div_u64_roundup()
  2025-10-29 17:38 [PATCH v4 next 0/9] Implement mul_u64_u64_div_u64_roundup() David Laight
                   ` (3 preceding siblings ...)
  2025-10-29 17:38 ` [PATCH v4 next 4/9] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup() David Laight
@ 2025-10-29 17:38 ` David Laight
  2025-10-29 18:26   ` Nicolas Pitre
  2025-10-29 17:38 ` [PATCH v4 next 6/9] lib: test_mul_u64_u64_div_u64: Test both generic and arch versions David Laight
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: David Laight @ 2025-10-29 17:38 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: David Laight, u.kleine-koenig, Nicolas Pitre, Oleg Nesterov,
	Peter Zijlstra, Biju Das, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Li RongQing,
	Yu Kuai, Khazhismel Kumykov, Jens Axboe, x86

Replicate the existing mul_u64_u64_div_u64() test cases with round up.
Update the shell script that verifies the table, remove the comment
markers so that it can be directly pasted into a shell.

Rename the divisor from 'c' to 'd' to match mul_u64_add_u64_div_u64().

It any tests fail then fail the module load with -EINVAL.

Signed-off-by: David Laight <david.laight.linux@gmail.com>
---

Changes for v3:
- Rename 'c' to 'd' to match mul_u64_add_u64_div_u64()

Changes for v4:
- Fix shell script that verifies the table

 lib/math/test_mul_u64_u64_div_u64.c | 122 +++++++++++++++++-----------
 1 file changed, 73 insertions(+), 49 deletions(-)

diff --git a/lib/math/test_mul_u64_u64_div_u64.c b/lib/math/test_mul_u64_u64_div_u64.c
index 58d058de4e73..4d5e4e5dac67 100644
--- a/lib/math/test_mul_u64_u64_div_u64.c
+++ b/lib/math/test_mul_u64_u64_div_u64.c
@@ -10,61 +10,73 @@
 #include <linux/printk.h>
 #include <linux/math64.h>
 
-typedef struct { u64 a; u64 b; u64 c; u64 result; } test_params;
+typedef struct { u64 a; u64 b; u64 d; u64 result; uint round_up;} test_params;
 
 static test_params test_values[] = {
 /* this contains many edge values followed by a couple random values */
-{                0xb,                0x7,                0x3,               0x19 },
-{         0xffff0000,         0xffff0000,                0xf, 0x1110eeef00000000 },
-{         0xffffffff,         0xffffffff,                0x1, 0xfffffffe00000001 },
-{         0xffffffff,         0xffffffff,                0x2, 0x7fffffff00000000 },
-{        0x1ffffffff,         0xffffffff,                0x2, 0xfffffffe80000000 },
-{        0x1ffffffff,         0xffffffff,                0x3, 0xaaaaaaa9aaaaaaab },
-{        0x1ffffffff,        0x1ffffffff,                0x4, 0xffffffff00000000 },
-{ 0xffff000000000000, 0xffff000000000000, 0xffff000000000001, 0xfffeffffffffffff },
-{ 0x3333333333333333, 0x3333333333333333, 0x5555555555555555, 0x1eb851eb851eb851 },
-{ 0x7fffffffffffffff,                0x2,                0x3, 0x5555555555555554 },
-{ 0xffffffffffffffff,                0x2, 0x8000000000000000,                0x3 },
-{ 0xffffffffffffffff,                0x2, 0xc000000000000000,                0x2 },
-{ 0xffffffffffffffff, 0x4000000000000004, 0x8000000000000000, 0x8000000000000007 },
-{ 0xffffffffffffffff, 0x4000000000000001, 0x8000000000000000, 0x8000000000000001 },
-{ 0xffffffffffffffff, 0x8000000000000001, 0xffffffffffffffff, 0x8000000000000001 },
-{ 0xfffffffffffffffe, 0x8000000000000001, 0xffffffffffffffff, 0x8000000000000000 },
-{ 0xffffffffffffffff, 0x8000000000000001, 0xfffffffffffffffe, 0x8000000000000001 },
-{ 0xffffffffffffffff, 0x8000000000000001, 0xfffffffffffffffd, 0x8000000000000002 },
-{ 0x7fffffffffffffff, 0xffffffffffffffff, 0xc000000000000000, 0xaaaaaaaaaaaaaaa8 },
-{ 0xffffffffffffffff, 0x7fffffffffffffff, 0xa000000000000000, 0xccccccccccccccca },
-{ 0xffffffffffffffff, 0x7fffffffffffffff, 0x9000000000000000, 0xe38e38e38e38e38b },
-{ 0x7fffffffffffffff, 0x7fffffffffffffff, 0x5000000000000000, 0xccccccccccccccc9 },
-{ 0xffffffffffffffff, 0xfffffffffffffffe, 0xffffffffffffffff, 0xfffffffffffffffe },
-{ 0xe6102d256d7ea3ae, 0x70a77d0be4c31201, 0xd63ec35ab3220357, 0x78f8bf8cc86c6e18 },
-{ 0xf53bae05cb86c6e1, 0x3847b32d2f8d32e0, 0xcfd4f55a647f403c, 0x42687f79d8998d35 },
-{ 0x9951c5498f941092, 0x1f8c8bfdf287a251, 0xa3c8dc5f81ea3fe2, 0x1d887cb25900091f },
-{ 0x374fee9daa1bb2bb, 0x0d0bfbff7b8ae3ef, 0xc169337bd42d5179, 0x03bb2dbaffcbb961 },
-{ 0xeac0d03ac10eeaf0, 0x89be05dfa162ed9b, 0x92bb1679a41f0e4b, 0xdc5f5cc9e270d216 },
+{                0xb,                0x7,                0x3,               0x19, 1 },
+{         0xffff0000,         0xffff0000,                0xf, 0x1110eeef00000000, 0 },
+{         0xffffffff,         0xffffffff,                0x1, 0xfffffffe00000001, 0 },
+{         0xffffffff,         0xffffffff,                0x2, 0x7fffffff00000000, 1 },
+{        0x1ffffffff,         0xffffffff,                0x2, 0xfffffffe80000000, 1 },
+{        0x1ffffffff,         0xffffffff,                0x3, 0xaaaaaaa9aaaaaaab, 0 },
+{        0x1ffffffff,        0x1ffffffff,                0x4, 0xffffffff00000000, 1 },
+{ 0xffff000000000000, 0xffff000000000000, 0xffff000000000001, 0xfffeffffffffffff, 1 },
+{ 0x3333333333333333, 0x3333333333333333, 0x5555555555555555, 0x1eb851eb851eb851, 1 },
+{ 0x7fffffffffffffff,                0x2,                0x3, 0x5555555555555554, 1 },
+{ 0xffffffffffffffff,                0x2, 0x8000000000000000,                0x3, 1 },
+{ 0xffffffffffffffff,                0x2, 0xc000000000000000,                0x2, 1 },
+{ 0xffffffffffffffff, 0x4000000000000004, 0x8000000000000000, 0x8000000000000007, 1 },
+{ 0xffffffffffffffff, 0x4000000000000001, 0x8000000000000000, 0x8000000000000001, 1 },
+{ 0xffffffffffffffff, 0x8000000000000001, 0xffffffffffffffff, 0x8000000000000001, 0 },
+{ 0xfffffffffffffffe, 0x8000000000000001, 0xffffffffffffffff, 0x8000000000000000, 1 },
+{ 0xffffffffffffffff, 0x8000000000000001, 0xfffffffffffffffe, 0x8000000000000001, 1 },
+{ 0xffffffffffffffff, 0x8000000000000001, 0xfffffffffffffffd, 0x8000000000000002, 1 },
+{ 0x7fffffffffffffff, 0xffffffffffffffff, 0xc000000000000000, 0xaaaaaaaaaaaaaaa8, 1 },
+{ 0xffffffffffffffff, 0x7fffffffffffffff, 0xa000000000000000, 0xccccccccccccccca, 1 },
+{ 0xffffffffffffffff, 0x7fffffffffffffff, 0x9000000000000000, 0xe38e38e38e38e38b, 1 },
+{ 0x7fffffffffffffff, 0x7fffffffffffffff, 0x5000000000000000, 0xccccccccccccccc9, 1 },
+{ 0xffffffffffffffff, 0xfffffffffffffffe, 0xffffffffffffffff, 0xfffffffffffffffe, 0 },
+{ 0xe6102d256d7ea3ae, 0x70a77d0be4c31201, 0xd63ec35ab3220357, 0x78f8bf8cc86c6e18, 1 },
+{ 0xf53bae05cb86c6e1, 0x3847b32d2f8d32e0, 0xcfd4f55a647f403c, 0x42687f79d8998d35, 1 },
+{ 0x9951c5498f941092, 0x1f8c8bfdf287a251, 0xa3c8dc5f81ea3fe2, 0x1d887cb25900091f, 1 },
+{ 0x374fee9daa1bb2bb, 0x0d0bfbff7b8ae3ef, 0xc169337bd42d5179, 0x03bb2dbaffcbb961, 1 },
+{ 0xeac0d03ac10eeaf0, 0x89be05dfa162ed9b, 0x92bb1679a41f0e4b, 0xdc5f5cc9e270d216, 1 },
 };
 
 /*
  * The above table can be verified with the following shell script:
- *
- * #!/bin/sh
- * sed -ne 's/^{ \+\(.*\), \+\(.*\), \+\(.*\), \+\(.*\) },$/\1 \2 \3 \4/p' \
- *     lib/math/test_mul_u64_u64_div_u64.c |
- * while read a b c r; do
- *   expected=$( printf "obase=16; ibase=16; %X * %X / %X\n" $a $b $c | bc )
- *   given=$( printf "%X\n" $r )
- *   if [ "$expected" = "$given" ]; then
- *     echo "$a * $b / $c = $r OK"
- *   else
- *     echo "$a * $b / $c = $r is wrong" >&2
- *     echo "should be equivalent to 0x$expected" >&2
- *     exit 1
- *   fi
- * done
+
+#!/bin/sh
+sed -ne 's/^{ \+\(.*\), \+\(.*\), \+\(.*\), \+\(.*\), \+\(.*\) },$/\1 \2 \3 \4 \5/p' \
+    lib/math/test_mul_u64_u64_div_u64.c |
+while read a b d r e; do
+  expected=$( printf "obase=16; ibase=16; %X * %X / %X\n" $a $b $d | bc )
+  given=$( printf "%X\n" $r )
+  if [ "$expected" = "$given" ]; then
+    echo "$a * $b  / $d = $r OK"
+  else
+    echo "$a * $b  / $d = $r is wrong" >&2
+    echo "should be equivalent to 0x$expected" >&2
+    exit 1
+  fi
+  expected=$( printf "obase=16; ibase=16; (%X * %X + %X) / %X\n" $a $b $((d-1)) $d | bc )
+  given=$( printf "%X\n" $((r + e)) )
+  if [ "$expected" = "$given" ]; then
+    echo "$a * $b +/ $d = $(printf '%#x' $((r + e))) OK"
+  else
+    echo "$a * $b +/ $d = $(printf '%#x' $((r + e))) is wrong" >&2
+    echo "should be equivalent to 0x$expected" >&2
+    exit 1
+  fi
+done
+
  */
 
 static int __init test_init(void)
 {
+	int errors = 0;
+	int tests = 0;
 	int i;
 
 	pr_info("Starting mul_u64_u64_div_u64() test\n");
@@ -72,19 +84,31 @@ static int __init test_init(void)
 	for (i = 0; i < ARRAY_SIZE(test_values); i++) {
 		u64 a = test_values[i].a;
 		u64 b = test_values[i].b;
-		u64 c = test_values[i].c;
+		u64 d = test_values[i].d;
 		u64 expected_result = test_values[i].result;
-		u64 result = mul_u64_u64_div_u64(a, b, c);
+		u64 result = mul_u64_u64_div_u64(a, b, d);
+		u64 result_up = mul_u64_u64_div_u64_roundup(a, b, d);
+
+		tests += 2;
 
 		if (result != expected_result) {
-			pr_err("ERROR: 0x%016llx * 0x%016llx / 0x%016llx\n", a, b, c);
+			pr_err("ERROR: 0x%016llx * 0x%016llx / 0x%016llx\n", a, b, d);
 			pr_err("ERROR: expected result: %016llx\n", expected_result);
 			pr_err("ERROR: obtained result: %016llx\n", result);
+			errors++;
+		}
+		expected_result += test_values[i].round_up;
+		if (result_up != expected_result) {
+			pr_err("ERROR: 0x%016llx * 0x%016llx +/ 0x%016llx\n", a, b, d);
+			pr_err("ERROR: expected result: %016llx\n", expected_result);
+			pr_err("ERROR: obtained result: %016llx\n", result_up);
+			errors++;
 		}
 	}
 
-	pr_info("Completed mul_u64_u64_div_u64() test\n");
-	return 0;
+	pr_info("Completed mul_u64_u64_div_u64() test, %d tests, %d errors\n",
+		tests, errors);
+	return errors ? -EINVAL : 0;
 }
 
 static void __exit test_exit(void)
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 next 6/9] lib: test_mul_u64_u64_div_u64: Test both generic and arch versions
  2025-10-29 17:38 [PATCH v4 next 0/9] Implement mul_u64_u64_div_u64_roundup() David Laight
                   ` (4 preceding siblings ...)
  2025-10-29 17:38 ` [PATCH v4 next 5/9] lib: Add tests for mul_u64_u64_div_u64_roundup() David Laight
@ 2025-10-29 17:38 ` David Laight
  2025-10-29 18:53   ` Nicolas Pitre
                     ` (2 more replies)
  2025-10-29 17:38 ` [PATCH v4 next 7/9] lib: mul_u64_u64_div_u64() optimise multiply on 32bit x86 David Laight
                   ` (4 subsequent siblings)
  10 siblings, 3 replies; 32+ messages in thread
From: David Laight @ 2025-10-29 17:38 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: David Laight, u.kleine-koenig, Nicolas Pitre, Oleg Nesterov,
	Peter Zijlstra, Biju Das, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Li RongQing,
	Yu Kuai, Khazhismel Kumykov, Jens Axboe, x86

Change the #if in div64.c so that test_mul_u64_u64_div_u64.c
can compile and test the generic version (including the 'long multiply')
on architectures (eg amd64) that define their own copy.

Test the kernel version and the locally compiled version on all arch.
Output the time taken (in ns) on the 'test completed' trace.

For reference, on my zen 5, the optimised version takes ~220ns and the
generic version ~3350ns.
Using the native multiply saves ~200ns and adding back the ilog2() 'optimisation'
test adds ~50ms.

Signed-off-by: David Laight <david.laight.linux@gmail.com>
---

Changes for v4:
- Fix build on non x86 (eg arm32)

 lib/math/div64.c                    |  8 +++--
 lib/math/test_mul_u64_u64_div_u64.c | 51 +++++++++++++++++++++++++----
 2 files changed, 50 insertions(+), 9 deletions(-)

diff --git a/lib/math/div64.c b/lib/math/div64.c
index 25295daebde9..f92e7160feb6 100644
--- a/lib/math/div64.c
+++ b/lib/math/div64.c
@@ -177,16 +177,18 @@ EXPORT_SYMBOL(div64_s64);
  * Iterative div/mod for use when dividend is not expected to be much
  * bigger than divisor.
  */
+#ifndef iter_div_u64_rem
 u32 iter_div_u64_rem(u64 dividend, u32 divisor, u64 *remainder)
 {
 	return __iter_div_u64_rem(dividend, divisor, remainder);
 }
 EXPORT_SYMBOL(iter_div_u64_rem);
+#endif
 
-#ifndef mul_u64_add_u64_div_u64
+#if !defined(mul_u64_add_u64_div_u64) || defined(test_mul_u64_add_u64_div_u64)
 u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
 {
-#if defined(__SIZEOF_INT128__)
+#if defined(__SIZEOF_INT128__) && !defined(test_mul_u64_add_u64_div_u64)
 
 	/* native 64x64=128 bits multiplication */
 	u128 prod = (u128)a * b + c;
@@ -267,5 +269,7 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
 
 	return res;
 }
+#if !defined(test_mul_u64_add_u64_div_u64)
 EXPORT_SYMBOL(mul_u64_add_u64_div_u64);
 #endif
+#endif
diff --git a/lib/math/test_mul_u64_u64_div_u64.c b/lib/math/test_mul_u64_u64_div_u64.c
index 4d5e4e5dac67..a3c5e54f37ef 100644
--- a/lib/math/test_mul_u64_u64_div_u64.c
+++ b/lib/math/test_mul_u64_u64_div_u64.c
@@ -73,21 +73,34 @@ done
 
  */
 
-static int __init test_init(void)
+static u64 test_mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d);
+
+static int __init test_run(unsigned int fn_no, const char *fn_name)
 {
+	u64 start_time;
 	int errors = 0;
 	int tests = 0;
 	int i;
 
-	pr_info("Starting mul_u64_u64_div_u64() test\n");
+	start_time = ktime_get_ns();
 
 	for (i = 0; i < ARRAY_SIZE(test_values); i++) {
 		u64 a = test_values[i].a;
 		u64 b = test_values[i].b;
 		u64 d = test_values[i].d;
 		u64 expected_result = test_values[i].result;
-		u64 result = mul_u64_u64_div_u64(a, b, d);
-		u64 result_up = mul_u64_u64_div_u64_roundup(a, b, d);
+		u64 result, result_up;
+
+		switch (fn_no) {
+		default:
+			result = mul_u64_u64_div_u64(a, b, d);
+			result_up = mul_u64_u64_div_u64_roundup(a, b, d);
+			break;
+		case 1:
+			result = test_mul_u64_add_u64_div_u64(a, b, 0, d);
+			result_up = test_mul_u64_add_u64_div_u64(a, b, d - 1, d);
+			break;
+		}
 
 		tests += 2;
 
@@ -106,15 +119,39 @@ static int __init test_init(void)
 		}
 	}
 
-	pr_info("Completed mul_u64_u64_div_u64() test, %d tests, %d errors\n",
-		tests, errors);
-	return errors ? -EINVAL : 0;
+	pr_info("Completed %s() test, %d tests, %d errors, %llu ns\n",
+		fn_name, tests, errors, ktime_get_ns() - start_time);
+	return errors;
+}
+
+static int __init test_init(void)
+{
+	pr_info("Starting mul_u64_u64_div_u64() test\n");
+	if (test_run(0, "mul_u64_u64_div_u64"))
+		return -EINVAL;
+	if (test_run(1, "test_mul_u64_u64_div_u64"))
+		return -EINVAL;
+	return 0;
 }
 
 static void __exit test_exit(void)
 {
 }
 
+/* Compile the generic mul_u64_add_u64_div_u64() code */
+#define __div64_32 __div64_32
+#define div_s64_rem div_s64_rem
+#define div64_u64_rem div64_u64_rem
+#define div64_u64 div64_u64
+#define div64_s64 div64_s64
+#define iter_div_u64_rem iter_div_u64_rem
+
+#undef mul_u64_add_u64_div_u64
+#define mul_u64_add_u64_div_u64 test_mul_u64_add_u64_div_u64
+#define test_mul_u64_add_u64_div_u64 test_mul_u64_add_u64_div_u64
+
+#include "div64.c"
+
 module_init(test_init);
 module_exit(test_exit);
 
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 next 7/9] lib: mul_u64_u64_div_u64() optimise multiply on 32bit x86
  2025-10-29 17:38 [PATCH v4 next 0/9] Implement mul_u64_u64_div_u64_roundup() David Laight
                   ` (5 preceding siblings ...)
  2025-10-29 17:38 ` [PATCH v4 next 6/9] lib: test_mul_u64_u64_div_u64: Test both generic and arch versions David Laight
@ 2025-10-29 17:38 ` David Laight
  2025-10-29 19:01   ` Nicolas Pitre
  2025-10-29 17:38 ` [PATCH v4 next 8/9] lib: mul_u64_u64_div_u64() Optimise the divide code David Laight
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: David Laight @ 2025-10-29 17:38 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: David Laight, u.kleine-koenig, Nicolas Pitre, Oleg Nesterov,
	Peter Zijlstra, Biju Das, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Li RongQing,
	Yu Kuai, Khazhismel Kumykov, Jens Axboe, x86

gcc generates horrid code for both ((u64)u32_a * u32_b) and (u64_a + u32_b).
As well as the extra instructions it can generate a lot of spills to stack
(including spills of constant zeros and even multiplies by constant zero).

mul_u32_u32() already exists to optimise the multiply.
Add a similar add_u64_32() for the addition.
Disable both for clang - it generates better code without them.

Move the 64x64 => 128 multiply into a static inline helper function
for code clarity.
No need for the a/b_hi/lo variables, the implicit casts on the function
calls do the work for us.
Should have minimal effect on the generated code.

Use mul_u32_u32() and add_u64_u32() in the 64x64 => 128 multiply
in mul_u64_add_u64_div_u64().

Signed-off-by: David Laight <david.laight.linux@gmail.com>
---

Changes for v4:
- merge in patch 8.
- Add comments about gcc being 'broken' for mixed 32/64 bit maths.
  clang doesn't have the same issues.
- use a #defdine for define mul_add() to avoid 'defined but not used'
  errors.

 arch/x86/include/asm/div64.h | 19 +++++++++++++++++
 include/linux/math64.h       | 11 ++++++++++
 lib/math/div64.c             | 40 +++++++++++++++++++++++-------------
 3 files changed, 56 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/div64.h b/arch/x86/include/asm/div64.h
index cabdc2d5a68f..a18c045aa8a1 100644
--- a/arch/x86/include/asm/div64.h
+++ b/arch/x86/include/asm/div64.h
@@ -60,6 +60,12 @@ static inline u64 div_u64_rem(u64 dividend, u32 divisor, u32 *remainder)
 }
 #define div_u64_rem	div_u64_rem
 
+/*
+ * gcc tends to zero extend 32bit values and do full 64bit maths.
+ * Define asm functions that avoid this.
+ * (clang generates better code for the C versions.)
+ */
+#ifndef __clang__
 static inline u64 mul_u32_u32(u32 a, u32 b)
 {
 	u32 high, low;
@@ -71,6 +77,19 @@ static inline u64 mul_u32_u32(u32 a, u32 b)
 }
 #define mul_u32_u32 mul_u32_u32
 
+static inline u64 add_u64_u32(u64 a, u32 b)
+{
+	u32 high = a >> 32, low = a;
+
+	asm ("addl %[b], %[low]; adcl $0, %[high]"
+		: [low] "+r" (low), [high] "+r" (high)
+		: [b] "rm" (b) );
+
+	return low | (u64)high << 32;
+}
+#define add_u64_u32 add_u64_u32
+#endif
+
 /*
  * __div64_32() is never called on x86, so prevent the
  * generic definition from getting built.
diff --git a/include/linux/math64.h b/include/linux/math64.h
index e889d850b7f1..cc305206d89f 100644
--- a/include/linux/math64.h
+++ b/include/linux/math64.h
@@ -158,6 +158,17 @@ static inline u64 mul_u32_u32(u32 a, u32 b)
 }
 #endif
 
+#ifndef add_u64_u32
+/*
+ * Many a GCC version also messes this up.
+ * Zero extending b and then spilling everything to stack.
+ */
+static inline u64 add_u64_u32(u64 a, u32 b)
+{
+	return a + b;
+}
+#endif
+
 #if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__)
 
 #ifndef mul_u64_u32_shr
diff --git a/lib/math/div64.c b/lib/math/div64.c
index f92e7160feb6..f6da7b5fb69e 100644
--- a/lib/math/div64.c
+++ b/lib/math/div64.c
@@ -186,33 +186,45 @@ EXPORT_SYMBOL(iter_div_u64_rem);
 #endif
 
 #if !defined(mul_u64_add_u64_div_u64) || defined(test_mul_u64_add_u64_div_u64)
-u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
-{
+
+#define mul_add(a, b, c) add_u64_u32(mul_u32_u32(a, b), c)
+
 #if defined(__SIZEOF_INT128__) && !defined(test_mul_u64_add_u64_div_u64)
 
+static inline u64 mul_u64_u64_add_u64(u64 *p_lo, u64 a, u64 b, u64 c)
+{
 	/* native 64x64=128 bits multiplication */
 	u128 prod = (u128)a * b + c;
-	u64 n_lo = prod, n_hi = prod >> 64;
+
+	*p_lo = prod;
+	return prod >> 64;
+}
 
 #else
 
-	/* perform a 64x64=128 bits multiplication manually */
-	u32 a_lo = a, a_hi = a >> 32, b_lo = b, b_hi = b >> 32;
+static inline u64 mul_u64_u64_add_u64(u64 *p_lo, u64 a, u64 b, u64 c)
+{
+	/* perform a 64x64=128 bits multiplication in 32bit chunks */
 	u64 x, y, z;
 
 	/* Since (x-1)(x-1) + 2(x-1) == x.x - 1 two u32 can be added to a u64 */
-	x = (u64)a_lo * b_lo + (u32)c;
-	y = (u64)a_lo * b_hi + (u32)(c >> 32);
-	y += (u32)(x >> 32);
-	z = (u64)a_hi * b_hi + (u32)(y >> 32);
-	y = (u64)a_hi * b_lo + (u32)y;
-	z += (u32)(y >> 32);
-	x = (y << 32) + (u32)x;
-
-	u64 n_lo = x, n_hi = z;
+	x = mul_add(a, b, c);
+	y = mul_add(a, b >> 32, c >> 32);
+	y = add_u64_u32(y, x >> 32);
+	z = mul_add(a >> 32, b >> 32, y >> 32);
+	y = mul_add(a >> 32, b, y);
+	*p_lo = (y << 32) + (u32)x;
+	return add_u64_u32(z, y >> 32);
+}
 
 #endif
 
+u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
+{
+	u64 n_lo, n_hi;
+
+	n_hi = mul_u64_u64_add_u64(&n_lo, a, b, c);
+
 	if (unlikely(n_hi >= d)) {
 		/* trigger runtime exception if divisor is zero */
 		if (d == 0) {
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 next 8/9] lib: mul_u64_u64_div_u64() Optimise the divide code
  2025-10-29 17:38 [PATCH v4 next 0/9] Implement mul_u64_u64_div_u64_roundup() David Laight
                   ` (6 preceding siblings ...)
  2025-10-29 17:38 ` [PATCH v4 next 7/9] lib: mul_u64_u64_div_u64() optimise multiply on 32bit x86 David Laight
@ 2025-10-29 17:38 ` David Laight
  2025-10-29 20:47   ` Nicolas Pitre
  2025-10-29 17:38 ` [PATCH v4 next 9/9] lib: test_mul_u64_u64_div_u64: Test the 32bit code on 64bit David Laight
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: David Laight @ 2025-10-29 17:38 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: David Laight, u.kleine-koenig, Nicolas Pitre, Oleg Nesterov,
	Peter Zijlstra, Biju Das, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Li RongQing,
	Yu Kuai, Khazhismel Kumykov, Jens Axboe, x86

Replace the bit by bit algorithm with one that generates 16 bits
per iteration on 32bit architectures and 32 bits on 64bit ones.

On my zen 5 this reduces the time for the tests (using the generic
code) from ~3350ns to ~1000ns.

Running the 32bit algorithm on 64bit x86 takes ~1500ns.
It'll be slightly slower on a real 32bit system, mostly due
to register pressure.

The savings for 32bit x86 are much higher (tested in userspace).
The worst case (lots of bits in the quotient) drops from ~900 clocks
to ~130 (pretty much independant of the arguments).
Other 32bit architectures may see better savings.

It is possibly to optimise for divisors that span less than
__LONG_WIDTH__/2 bits. However I suspect they don't happen that often
and it doesn't remove any slow cpu divide instructions which dominate
the result.

Typical improvements for 64bit random divides:
               old     new
sandy bridge:  470     150
haswell:       400     144
piledriver:    960     467   I think rdpmc is very slow.
zen5:          244      80
(Timing is 'rdpmc; mul_div(); rdpmc' with the multiply depending on the
first rdpmc and the second rdpmc depending on the quotient.)

Signed-off-by: David Laight <david.laight.linux@gmail.com>
---

Algorithm unchanged from v3.

 lib/math/div64.c | 124 ++++++++++++++++++++++++++++++++---------------
 1 file changed, 85 insertions(+), 39 deletions(-)

diff --git a/lib/math/div64.c b/lib/math/div64.c
index f6da7b5fb69e..4e4e962261c3 100644
--- a/lib/math/div64.c
+++ b/lib/math/div64.c
@@ -190,7 +190,6 @@ EXPORT_SYMBOL(iter_div_u64_rem);
 #define mul_add(a, b, c) add_u64_u32(mul_u32_u32(a, b), c)
 
 #if defined(__SIZEOF_INT128__) && !defined(test_mul_u64_add_u64_div_u64)
-
 static inline u64 mul_u64_u64_add_u64(u64 *p_lo, u64 a, u64 b, u64 c)
 {
 	/* native 64x64=128 bits multiplication */
@@ -199,9 +198,7 @@ static inline u64 mul_u64_u64_add_u64(u64 *p_lo, u64 a, u64 b, u64 c)
 	*p_lo = prod;
 	return prod >> 64;
 }
-
 #else
-
 static inline u64 mul_u64_u64_add_u64(u64 *p_lo, u64 a, u64 b, u64 c)
 {
 	/* perform a 64x64=128 bits multiplication in 32bit chunks */
@@ -216,12 +213,37 @@ static inline u64 mul_u64_u64_add_u64(u64 *p_lo, u64 a, u64 b, u64 c)
 	*p_lo = (y << 32) + (u32)x;
 	return add_u64_u32(z, y >> 32);
 }
+#endif
+
+#ifndef BITS_PER_ITER
+#define BITS_PER_ITER (__LONG_WIDTH__ >= 64 ? 32 : 16)
+#endif
+
+#if BITS_PER_ITER == 32
+#define mul_u64_long_add_u64(p_lo, a, b, c) mul_u64_u64_add_u64(p_lo, a, b, c)
+#define add_u64_long(a, b) ((a) + (b))
+#else
+#undef BITS_PER_ITER
+#define BITS_PER_ITER 16
+static inline u32 mul_u64_long_add_u64(u64 *p_lo, u64 a, u32 b, u64 c)
+{
+	u64 n_lo = mul_add(a, b, c);
+	u64 n_med = mul_add(a >> 32, b, c >> 32);
+
+	n_med = add_u64_u32(n_med, n_lo >> 32);
+	*p_lo = n_med << 32 | (u32)n_lo;
+	return n_med >> 32;
+}
 
+#define add_u64_long(a, b) add_u64_u32(a, b)
 #endif
 
 u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
 {
-	u64 n_lo, n_hi;
+	unsigned long d_msig, q_digit;
+	unsigned int reps, d_z_hi;
+	u64 quotient, n_lo, n_hi;
+	u32 overflow;
 
 	n_hi = mul_u64_u64_add_u64(&n_lo, a, b, c);
 
@@ -240,46 +262,70 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
 	if (!n_hi)
 		return div64_u64(n_lo, d);
 
-	int shift = __builtin_ctzll(d);
-
-	/* try reducing the fraction in case the dividend becomes <= 64 bits */
-	if ((n_hi >> shift) == 0) {
-		u64 n = shift ? (n_lo >> shift) | (n_hi << (64 - shift)) : n_lo;
-
-		return div64_u64(n, d >> shift);
-		/*
-		 * The remainder value if needed would be:
-		 *   res = div64_u64_rem(n, d >> shift, &rem);
-		 *   rem = (rem << shift) + (n_lo - (n << shift));
-		 */
+	/* Left align the divisor, shifting the dividend to match */
+	d_z_hi = __builtin_clzll(d);
+	if (d_z_hi) {
+		d <<= d_z_hi;
+		n_hi = n_hi << d_z_hi | n_lo >> (64 - d_z_hi);
+		n_lo <<= d_z_hi;
 	}
 
-	/* Do the full 128 by 64 bits division */
-
-	shift = __builtin_clzll(d);
-	d <<= shift;
-
-	int p = 64 + shift;
-	u64 res = 0;
-	bool carry;
+	reps = 64 / BITS_PER_ITER;
+	/* Optimise loop count for small dividends */
+	if (!(u32)(n_hi >> 32)) {
+		reps -= 32 / BITS_PER_ITER;
+		n_hi = n_hi << 32 | n_lo >> 32;
+		n_lo <<= 32;
+	}
+#if BITS_PER_ITER == 16
+	if (!(u32)(n_hi >> 48)) {
+		reps--;
+		n_hi = add_u64_u32(n_hi << 16, n_lo >> 48);
+		n_lo <<= 16;
+	}
+#endif
 
-	do {
-		carry = n_hi >> 63;
-		shift = carry ? 1 : __builtin_clzll(n_hi);
-		if (p < shift)
-			break;
-		p -= shift;
-		n_hi <<= shift;
-		n_hi |= n_lo >> (64 - shift);
-		n_lo <<= shift;
-		if (carry || (n_hi >= d)) {
-			n_hi -= d;
-			res |= 1ULL << p;
+	/* Invert the dividend so we can use add instead of subtract. */
+	n_lo = ~n_lo;
+	n_hi = ~n_hi;
+
+	/*
+	 * Get the most significant BITS_PER_ITER bits of the divisor.
+	 * This is used to get a low 'guestimate' of the quotient digit.
+	 */
+	d_msig = (d >> (64 - BITS_PER_ITER)) + 1;
+
+	/*
+	 * Now do a 'long division' with BITS_PER_ITER bit 'digits'.
+	 * The 'guess' quotient digit can be low and BITS_PER_ITER+1 bits.
+	 * The worst case is dividing ~0 by 0x8000 which requires two subtracts.
+	 */
+	quotient = 0;
+	while (reps--) {
+		q_digit = (unsigned long)(~n_hi >> (64 - 2 * BITS_PER_ITER)) / d_msig;
+		/* Shift 'n' left to align with the product q_digit * d */
+		overflow = n_hi >> (64 - BITS_PER_ITER);
+		n_hi = add_u64_u32(n_hi << BITS_PER_ITER, n_lo >> (64 - BITS_PER_ITER));
+		n_lo <<= BITS_PER_ITER;
+		/* Add product to negated divisor */
+		overflow += mul_u64_long_add_u64(&n_hi, d, q_digit, n_hi);
+		/* Adjust for the q_digit 'guestimate' being low */
+		while (overflow < 0xffffffff >> (32 - BITS_PER_ITER)) {
+			q_digit++;
+			n_hi += d;
+			overflow += n_hi < d;
 		}
-	} while (n_hi);
-	/* The remainder value if needed would be n_hi << p */
+		quotient = add_u64_long(quotient << BITS_PER_ITER, q_digit);
+	}
 
-	return res;
+	/*
+	 * The above only ensures the remainder doesn't overflow,
+	 * it can still be possible to add (aka subtract) another copy
+	 * of the divisor.
+	 */
+	if ((n_hi + d) > n_hi)
+		quotient++;
+	return quotient;
 }
 #if !defined(test_mul_u64_add_u64_div_u64)
 EXPORT_SYMBOL(mul_u64_add_u64_div_u64);
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 next 9/9] lib: test_mul_u64_u64_div_u64: Test the 32bit code on 64bit
  2025-10-29 17:38 [PATCH v4 next 0/9] Implement mul_u64_u64_div_u64_roundup() David Laight
                   ` (7 preceding siblings ...)
  2025-10-29 17:38 ` [PATCH v4 next 8/9] lib: mul_u64_u64_div_u64() Optimise the divide code David Laight
@ 2025-10-29 17:38 ` David Laight
  2025-10-29 20:48   ` Nicolas Pitre
  2025-10-31  4:29 ` [PATCH v4 next 0/9] Implement mul_u64_u64_div_u64_roundup() Andrew Morton
  2025-10-31 13:52 ` Oleg Nesterov
  10 siblings, 1 reply; 32+ messages in thread
From: David Laight @ 2025-10-29 17:38 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: David Laight, u.kleine-koenig, Nicolas Pitre, Oleg Nesterov,
	Peter Zijlstra, Biju Das, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Li RongQing,
	Yu Kuai, Khazhismel Kumykov, Jens Axboe, x86

There are slight differences in the mul_u64_add_u64_div_u64() code
between 32bit and 64bit systems.

Compile and test the 32bit version on 64bit hosts for better test
coverage.

Signed-off-by: David Laight <david.laight.linux@gmail.com>
---

Changes for v4:
- Fix build on non x86-64

 lib/math/test_mul_u64_u64_div_u64.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/lib/math/test_mul_u64_u64_div_u64.c b/lib/math/test_mul_u64_u64_div_u64.c
index a3c5e54f37ef..57d5c7158b10 100644
--- a/lib/math/test_mul_u64_u64_div_u64.c
+++ b/lib/math/test_mul_u64_u64_div_u64.c
@@ -74,6 +74,10 @@ done
  */
 
 static u64 test_mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d);
+#if __LONG_WIDTH__ >= 64
+#define TEST_32BIT_DIV
+static u64 test_mul_u64_add_u64_div_u64_32bit(u64 a, u64 b, u64 c, u64 d);
+#endif
 
 static int __init test_run(unsigned int fn_no, const char *fn_name)
 {
@@ -100,6 +104,12 @@ static int __init test_run(unsigned int fn_no, const char *fn_name)
 			result = test_mul_u64_add_u64_div_u64(a, b, 0, d);
 			result_up = test_mul_u64_add_u64_div_u64(a, b, d - 1, d);
 			break;
+#ifdef TEST_32BIT_DIV
+		case 2:
+			result = test_mul_u64_add_u64_div_u64_32bit(a, b, 0, d);
+			result_up = test_mul_u64_add_u64_div_u64_32bit(a, b, d - 1, d);
+			break;
+#endif
 		}
 
 		tests += 2;
@@ -131,6 +141,10 @@ static int __init test_init(void)
 		return -EINVAL;
 	if (test_run(1, "test_mul_u64_u64_div_u64"))
 		return -EINVAL;
+#ifdef TEST_32BIT_DIV
+	if (test_run(2, "test_mul_u64_u64_div_u64_32bit"))
+		return -EINVAL;
+#endif
 	return 0;
 }
 
@@ -152,6 +166,21 @@ static void __exit test_exit(void)
 
 #include "div64.c"
 
+#ifdef TEST_32BIT_DIV
+/* Recompile the generic code for 32bit long */
+#undef test_mul_u64_add_u64_div_u64
+#define test_mul_u64_add_u64_div_u64 test_mul_u64_add_u64_div_u64_32bit
+#undef BITS_PER_ITER
+#define BITS_PER_ITER 16
+
+#define mul_u64_u64_add_u64 mul_u64_u64_add_u64_32bit
+#undef mul_u64_long_add_u64
+#undef add_u64_long
+#undef mul_add
+
+#include "div64.c"
+#endif
+
 module_init(test_init);
 module_exit(test_exit);
 
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 next 2/9] lib: mul_u64_u64_div_u64() Combine overflow and divide by zero checks
  2025-10-29 17:38 ` [PATCH v4 next 2/9] lib: mul_u64_u64_div_u64() Combine overflow and divide by zero checks David Laight
@ 2025-10-29 18:02   ` Nicolas Pitre
  0 siblings, 0 replies; 32+ messages in thread
From: Nicolas Pitre @ 2025-10-29 18:02 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Li RongQing,
	Yu Kuai, Khazhismel Kumykov, Jens Axboe, x86

On Wed, 29 Oct 2025, David Laight wrote:

> Since the overflow check always triggers when the divisor is zero
> move the check for divide by zero inside the overflow check.
> This means there is only one test in the normal path.
> 
> Signed-off-by: David Laight <david.laight.linux@gmail.com>

Reviewed-by: Nicolas Pitre <npitre@baylibre.com>

> ---
> 
> V3 contained a different patch 2 that did different chenges to the
> error paths.
> 
>  lib/math/div64.c | 19 +++++++++----------
>  1 file changed, 9 insertions(+), 10 deletions(-)
> 
> diff --git a/lib/math/div64.c b/lib/math/div64.c
> index 0ebff850fd4d..1092f41e878e 100644
> --- a/lib/math/div64.c
> +++ b/lib/math/div64.c
> @@ -212,12 +212,16 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
>  
>  #endif
>  
> -	/* make sure d is not zero, trigger runtime exception otherwise */
> -	if (unlikely(d == 0)) {
> -		unsigned long zero = 0;
> +	if (unlikely(n_hi >= d)) {
> +		/* trigger runtime exception if divisor is zero */
> +		if (d == 0) {
> +			unsigned long zero = 0;
>  
> -		OPTIMIZER_HIDE_VAR(zero);
> -		return ~0UL/zero;
> +			OPTIMIZER_HIDE_VAR(zero);
> +			return ~0UL/zero;
> +		}
> +		/* overflow: result is unrepresentable in a u64 */
> +		return ~0ULL;
>  	}
>  
>  	int shift = __builtin_ctzll(d);
> @@ -234,11 +238,6 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
>  		 */
>  	}
>  
> -	if (n_hi >= d) {
> -		/* overflow: result is unrepresentable in a u64 */
> -		return -1;
> -	}
> -
>  	/* Do the full 128 by 64 bits division */
>  
>  	shift = __builtin_clzll(d);
> -- 
> 2.39.5
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 next 3/9] lib: mul_u64_u64_div_u64() simplify check for a 64bit product
  2025-10-29 17:38 ` [PATCH v4 next 3/9] lib: mul_u64_u64_div_u64() simplify check for a 64bit product David Laight
@ 2025-10-29 18:11   ` Nicolas Pitre
  2025-10-31  9:19     ` David Laight
  0 siblings, 1 reply; 32+ messages in thread
From: Nicolas Pitre @ 2025-10-29 18:11 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Li RongQing,
	Yu Kuai, Khazhismel Kumykov, Jens Axboe, x86

On Wed, 29 Oct 2025, David Laight wrote:

> If the product is only 64bits div64_u64() can be used for the divide.
> Replace the pre-multiply check (ilog2(a) + ilog2(b) <= 62) with a
> simple post-multiply check that the high 64bits are zero.
> 
> This has the advantage of being simpler, more accurate and less code.
> It will always be faster when the product is larger than 64bits.
> 
> Most 64bit cpu have a native 64x64=128 bit multiply, this is needed
> (for the low 64bits) even when div64_u64() is called - so the early
> check gains nothing and is just extra code.
> 
> 32bit cpu will need a compare (etc) to generate the 64bit ilog2()
> from two 32bit bit scans - so that is non-trivial.
> (Never mind the mess of x86's 'bsr' and any oddball cpu without
> fast bit-scan instructions.)
> Whereas the additional instructions for the 128bit multiply result
> are pretty much one multiply and two adds (typically the 'adc $0,%reg'
> can be run in parallel with the instruction that follows).
> 
> The only outliers are 64bit systems without 128bit mutiply and
> simple in order 32bit ones with fast bit scan but needing extra
> instructions to get the high bits of the multiply result.
> I doubt it makes much difference to either, the latter is definitely
> not mainstream.
> 
> If anyone is worried about the analysis they can look at the
> generated code for x86 (especially when cmov isn't used).
> 
> Signed-off-by: David Laight <david.laight.linux@gmail.com>

Comment below.


> ---
> 
> Split from patch 3 for v2, unchanged since.
> 
>  lib/math/div64.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/lib/math/div64.c b/lib/math/div64.c
> index 1092f41e878e..7158d141b6e9 100644
> --- a/lib/math/div64.c
> +++ b/lib/math/div64.c
> @@ -186,9 +186,6 @@ EXPORT_SYMBOL(iter_div_u64_rem);
>  #ifndef mul_u64_u64_div_u64
>  u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
>  {
> -	if (ilog2(a) + ilog2(b) <= 62)
> -		return div64_u64(a * b, d);
> -
>  #if defined(__SIZEOF_INT128__)
>  
>  	/* native 64x64=128 bits multiplication */
> @@ -224,6 +221,9 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
>  		return ~0ULL;
>  	}
>  
> +	if (!n_hi)
> +		return div64_u64(n_lo, d);

I'd move this before the overflow test. If this is to be taken then 
you'll save one test. same cost otherwise.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 next 4/9] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup()
  2025-10-29 17:38 ` [PATCH v4 next 4/9] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup() David Laight
@ 2025-10-29 18:17   ` Nicolas Pitre
  2025-10-31 20:59   ` David Laight
  1 sibling, 0 replies; 32+ messages in thread
From: Nicolas Pitre @ 2025-10-29 18:17 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Li RongQing,
	Yu Kuai, Khazhismel Kumykov, Jens Axboe, x86

On Wed, 29 Oct 2025, David Laight wrote:

> The existing mul_u64_u64_div_u64() rounds down, a 'rounding up'
> variant needs 'divisor - 1' adding in between the multiply and
> divide so cannot easily be done by a caller.
> 
> Add mul_u64_add_u64_div_u64(a, b, c, d) that calculates (a * b + c)/d
> and implement the 'round down' and 'round up' using it.
> 
> Update the x86-64 asm to optimise for 'c' being a constant zero.
> 
> Add kerndoc definitions for all three functions.
> 
> Signed-off-by: David Laight <david.laight.linux@gmail.com>

Reviewed-by: Nicolas Pitre <npitre@baylibre.com>


> ---
> 
> Changes for v2 (formally patch 1/3):
> - Reinstate the early call to div64_u64() on 32bit when 'c' is zero.
>   Although I'm not convinced the path is common enough to be worth
>   the two ilog2() calls.
> 
> Changes for v3 (formally patch 3/4):
> - The early call to div64_u64() has been removed by patch 3.
>   Pretty much guaranteed to be a pessimisation.
> 
> Changes for v4:
> - For x86-64 split the multiply, add and divide into three asm blocks.
>   (gcc makes a pigs breakfast of (u128)a * b + c)
> - Change the kerndoc since divide by zero will (probably) fault.
> 
>  arch/x86/include/asm/div64.h | 20 +++++++++------
>  include/linux/math64.h       | 48 +++++++++++++++++++++++++++++++++++-
>  lib/math/div64.c             | 14 ++++++-----
>  3 files changed, 67 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/include/asm/div64.h b/arch/x86/include/asm/div64.h
> index 9931e4c7d73f..cabdc2d5a68f 100644
> --- a/arch/x86/include/asm/div64.h
> +++ b/arch/x86/include/asm/div64.h
> @@ -84,21 +84,25 @@ static inline u64 mul_u32_u32(u32 a, u32 b)
>   * Will generate an #DE when the result doesn't fit u64, could fix with an
>   * __ex_table[] entry when it becomes an issue.
>   */
> -static inline u64 mul_u64_u64_div_u64(u64 a, u64 mul, u64 div)
> +static inline u64 mul_u64_add_u64_div_u64(u64 rax, u64 mul, u64 add, u64 div)
>  {
> -	u64 q;
> +	u64 rdx;
>  
> -	asm ("mulq %2; divq %3" : "=a" (q)
> -				: "a" (a), "rm" (mul), "rm" (div)
> -				: "rdx");
> +	asm ("mulq %[mul]" : "+a" (rax), "=d" (rdx) : [mul] "rm" (mul));
>  
> -	return q;
> +	if (statically_true(add))
> +		asm ("addq %[add], %[lo]; adcq $0, %[hi]" :
> +			[lo] "+r" (rax), [hi] "+r" (rdx) : [add] "irm" (add));
> +
> +	asm ("divq %[div]" : "+a" (rax), "+d" (rdx) : [div] "rm" (div));
> +
> +	return rax;
>  }
> -#define mul_u64_u64_div_u64 mul_u64_u64_div_u64
> +#define mul_u64_add_u64_div_u64 mul_u64_add_u64_div_u64
>  
>  static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 div)
>  {
> -	return mul_u64_u64_div_u64(a, mul, div);
> +	return mul_u64_add_u64_div_u64(a, mul, 0, div);
>  }
>  #define mul_u64_u32_div	mul_u64_u32_div
>  
> diff --git a/include/linux/math64.h b/include/linux/math64.h
> index 6aaccc1626ab..e889d850b7f1 100644
> --- a/include/linux/math64.h
> +++ b/include/linux/math64.h
> @@ -282,7 +282,53 @@ static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 divisor)
>  }
>  #endif /* mul_u64_u32_div */
>  
> -u64 mul_u64_u64_div_u64(u64 a, u64 mul, u64 div);
> +/**
> + * mul_u64_add_u64_div_u64 - unsigned 64bit multiply, add, and divide
> + * @a: first unsigned 64bit multiplicand
> + * @b: second unsigned 64bit multiplicand
> + * @c: unsigned 64bit addend
> + * @d: unsigned 64bit divisor
> + *
> + * Multiply two 64bit values together to generate a 128bit product
> + * add a third value and then divide by a fourth.
> + * The Generic code divides by 0 if @d is zero and returns ~0 on overflow.
> + * Architecture specific code may trap on zero or overflow.
> + *
> + * Return: (@a * @b + @c) / @d
> + */
> +u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d);
> +
> +/**
> + * mul_u64_u64_div_u64 - unsigned 64bit multiply and divide
> + * @a: first unsigned 64bit multiplicand
> + * @b: second unsigned 64bit multiplicand
> + * @d: unsigned 64bit divisor
> + *
> + * Multiply two 64bit values together to generate a 128bit product
> + * and then divide by a third value.
> + * The Generic code divides by 0 if @d is zero and returns ~0 on overflow.
> + * Architecture specific code may trap on zero or overflow.
> + *
> + * Return: @a * @b / @d
> + */
> +#define mul_u64_u64_div_u64(a, b, d) mul_u64_add_u64_div_u64(a, b, 0, d)
> +
> +/**
> + * mul_u64_u64_div_u64_roundup - unsigned 64bit multiply and divide rounded up
> + * @a: first unsigned 64bit multiplicand
> + * @b: second unsigned 64bit multiplicand
> + * @d: unsigned 64bit divisor
> + *
> + * Multiply two 64bit values together to generate a 128bit product
> + * and then divide and round up.
> + * The Generic code divides by 0 if @d is zero and returns ~0 on overflow.
> + * Architecture specific code may trap on zero or overflow.
> + *
> + * Return: (@a * @b + @d - 1) / @d
> + */
> +#define mul_u64_u64_div_u64_roundup(a, b, d) \
> +	({ u64 _tmp = (d); mul_u64_add_u64_div_u64(a, b, _tmp - 1, _tmp); })
> +
>  
>  /**
>   * DIV64_U64_ROUND_UP - unsigned 64bit divide with 64bit divisor rounded up
> diff --git a/lib/math/div64.c b/lib/math/div64.c
> index 7158d141b6e9..25295daebde9 100644
> --- a/lib/math/div64.c
> +++ b/lib/math/div64.c
> @@ -183,13 +183,13 @@ u32 iter_div_u64_rem(u64 dividend, u32 divisor, u64 *remainder)
>  }
>  EXPORT_SYMBOL(iter_div_u64_rem);
>  
> -#ifndef mul_u64_u64_div_u64
> -u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
> +#ifndef mul_u64_add_u64_div_u64
> +u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
>  {
>  #if defined(__SIZEOF_INT128__)
>  
>  	/* native 64x64=128 bits multiplication */
> -	u128 prod = (u128)a * b;
> +	u128 prod = (u128)a * b + c;
>  	u64 n_lo = prod, n_hi = prod >> 64;
>  
>  #else
> @@ -198,8 +198,10 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
>  	u32 a_lo = a, a_hi = a >> 32, b_lo = b, b_hi = b >> 32;
>  	u64 x, y, z;
>  
> -	x = (u64)a_lo * b_lo;
> -	y = (u64)a_lo * b_hi + (u32)(x >> 32);
> +	/* Since (x-1)(x-1) + 2(x-1) == x.x - 1 two u32 can be added to a u64 */
> +	x = (u64)a_lo * b_lo + (u32)c;
> +	y = (u64)a_lo * b_hi + (u32)(c >> 32);
> +	y += (u32)(x >> 32);
>  	z = (u64)a_hi * b_hi + (u32)(y >> 32);
>  	y = (u64)a_hi * b_lo + (u32)y;
>  	z += (u32)(y >> 32);
> @@ -265,5 +267,5 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
>  
>  	return res;
>  }
> -EXPORT_SYMBOL(mul_u64_u64_div_u64);
> +EXPORT_SYMBOL(mul_u64_add_u64_div_u64);
>  #endif
> -- 
> 2.39.5
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 next 5/9] lib: Add tests for mul_u64_u64_div_u64_roundup()
  2025-10-29 17:38 ` [PATCH v4 next 5/9] lib: Add tests for mul_u64_u64_div_u64_roundup() David Laight
@ 2025-10-29 18:26   ` Nicolas Pitre
  0 siblings, 0 replies; 32+ messages in thread
From: Nicolas Pitre @ 2025-10-29 18:26 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Li RongQing,
	Yu Kuai, Khazhismel Kumykov, Jens Axboe, x86

On Wed, 29 Oct 2025, David Laight wrote:

> Replicate the existing mul_u64_u64_div_u64() test cases with round up.
> Update the shell script that verifies the table, remove the comment
> markers so that it can be directly pasted into a shell.
> 
> Rename the divisor from 'c' to 'd' to match mul_u64_add_u64_div_u64().
> 
> It any tests fail then fail the module load with -EINVAL.
> 
> Signed-off-by: David Laight <david.laight.linux@gmail.com>

Reviewed-by: Nicolas Pitre <npitre@baylibre.com>


> ---
> 
> Changes for v3:
> - Rename 'c' to 'd' to match mul_u64_add_u64_div_u64()
> 
> Changes for v4:
> - Fix shell script that verifies the table
> 
>  lib/math/test_mul_u64_u64_div_u64.c | 122 +++++++++++++++++-----------
>  1 file changed, 73 insertions(+), 49 deletions(-)
> 
> diff --git a/lib/math/test_mul_u64_u64_div_u64.c b/lib/math/test_mul_u64_u64_div_u64.c
> index 58d058de4e73..4d5e4e5dac67 100644
> --- a/lib/math/test_mul_u64_u64_div_u64.c
> +++ b/lib/math/test_mul_u64_u64_div_u64.c
> @@ -10,61 +10,73 @@
>  #include <linux/printk.h>
>  #include <linux/math64.h>
>  
> -typedef struct { u64 a; u64 b; u64 c; u64 result; } test_params;
> +typedef struct { u64 a; u64 b; u64 d; u64 result; uint round_up;} test_params;
>  
>  static test_params test_values[] = {
>  /* this contains many edge values followed by a couple random values */
> -{                0xb,                0x7,                0x3,               0x19 },
> -{         0xffff0000,         0xffff0000,                0xf, 0x1110eeef00000000 },
> -{         0xffffffff,         0xffffffff,                0x1, 0xfffffffe00000001 },
> -{         0xffffffff,         0xffffffff,                0x2, 0x7fffffff00000000 },
> -{        0x1ffffffff,         0xffffffff,                0x2, 0xfffffffe80000000 },
> -{        0x1ffffffff,         0xffffffff,                0x3, 0xaaaaaaa9aaaaaaab },
> -{        0x1ffffffff,        0x1ffffffff,                0x4, 0xffffffff00000000 },
> -{ 0xffff000000000000, 0xffff000000000000, 0xffff000000000001, 0xfffeffffffffffff },
> -{ 0x3333333333333333, 0x3333333333333333, 0x5555555555555555, 0x1eb851eb851eb851 },
> -{ 0x7fffffffffffffff,                0x2,                0x3, 0x5555555555555554 },
> -{ 0xffffffffffffffff,                0x2, 0x8000000000000000,                0x3 },
> -{ 0xffffffffffffffff,                0x2, 0xc000000000000000,                0x2 },
> -{ 0xffffffffffffffff, 0x4000000000000004, 0x8000000000000000, 0x8000000000000007 },
> -{ 0xffffffffffffffff, 0x4000000000000001, 0x8000000000000000, 0x8000000000000001 },
> -{ 0xffffffffffffffff, 0x8000000000000001, 0xffffffffffffffff, 0x8000000000000001 },
> -{ 0xfffffffffffffffe, 0x8000000000000001, 0xffffffffffffffff, 0x8000000000000000 },
> -{ 0xffffffffffffffff, 0x8000000000000001, 0xfffffffffffffffe, 0x8000000000000001 },
> -{ 0xffffffffffffffff, 0x8000000000000001, 0xfffffffffffffffd, 0x8000000000000002 },
> -{ 0x7fffffffffffffff, 0xffffffffffffffff, 0xc000000000000000, 0xaaaaaaaaaaaaaaa8 },
> -{ 0xffffffffffffffff, 0x7fffffffffffffff, 0xa000000000000000, 0xccccccccccccccca },
> -{ 0xffffffffffffffff, 0x7fffffffffffffff, 0x9000000000000000, 0xe38e38e38e38e38b },
> -{ 0x7fffffffffffffff, 0x7fffffffffffffff, 0x5000000000000000, 0xccccccccccccccc9 },
> -{ 0xffffffffffffffff, 0xfffffffffffffffe, 0xffffffffffffffff, 0xfffffffffffffffe },
> -{ 0xe6102d256d7ea3ae, 0x70a77d0be4c31201, 0xd63ec35ab3220357, 0x78f8bf8cc86c6e18 },
> -{ 0xf53bae05cb86c6e1, 0x3847b32d2f8d32e0, 0xcfd4f55a647f403c, 0x42687f79d8998d35 },
> -{ 0x9951c5498f941092, 0x1f8c8bfdf287a251, 0xa3c8dc5f81ea3fe2, 0x1d887cb25900091f },
> -{ 0x374fee9daa1bb2bb, 0x0d0bfbff7b8ae3ef, 0xc169337bd42d5179, 0x03bb2dbaffcbb961 },
> -{ 0xeac0d03ac10eeaf0, 0x89be05dfa162ed9b, 0x92bb1679a41f0e4b, 0xdc5f5cc9e270d216 },
> +{                0xb,                0x7,                0x3,               0x19, 1 },
> +{         0xffff0000,         0xffff0000,                0xf, 0x1110eeef00000000, 0 },
> +{         0xffffffff,         0xffffffff,                0x1, 0xfffffffe00000001, 0 },
> +{         0xffffffff,         0xffffffff,                0x2, 0x7fffffff00000000, 1 },
> +{        0x1ffffffff,         0xffffffff,                0x2, 0xfffffffe80000000, 1 },
> +{        0x1ffffffff,         0xffffffff,                0x3, 0xaaaaaaa9aaaaaaab, 0 },
> +{        0x1ffffffff,        0x1ffffffff,                0x4, 0xffffffff00000000, 1 },
> +{ 0xffff000000000000, 0xffff000000000000, 0xffff000000000001, 0xfffeffffffffffff, 1 },
> +{ 0x3333333333333333, 0x3333333333333333, 0x5555555555555555, 0x1eb851eb851eb851, 1 },
> +{ 0x7fffffffffffffff,                0x2,                0x3, 0x5555555555555554, 1 },
> +{ 0xffffffffffffffff,                0x2, 0x8000000000000000,                0x3, 1 },
> +{ 0xffffffffffffffff,                0x2, 0xc000000000000000,                0x2, 1 },
> +{ 0xffffffffffffffff, 0x4000000000000004, 0x8000000000000000, 0x8000000000000007, 1 },
> +{ 0xffffffffffffffff, 0x4000000000000001, 0x8000000000000000, 0x8000000000000001, 1 },
> +{ 0xffffffffffffffff, 0x8000000000000001, 0xffffffffffffffff, 0x8000000000000001, 0 },
> +{ 0xfffffffffffffffe, 0x8000000000000001, 0xffffffffffffffff, 0x8000000000000000, 1 },
> +{ 0xffffffffffffffff, 0x8000000000000001, 0xfffffffffffffffe, 0x8000000000000001, 1 },
> +{ 0xffffffffffffffff, 0x8000000000000001, 0xfffffffffffffffd, 0x8000000000000002, 1 },
> +{ 0x7fffffffffffffff, 0xffffffffffffffff, 0xc000000000000000, 0xaaaaaaaaaaaaaaa8, 1 },
> +{ 0xffffffffffffffff, 0x7fffffffffffffff, 0xa000000000000000, 0xccccccccccccccca, 1 },
> +{ 0xffffffffffffffff, 0x7fffffffffffffff, 0x9000000000000000, 0xe38e38e38e38e38b, 1 },
> +{ 0x7fffffffffffffff, 0x7fffffffffffffff, 0x5000000000000000, 0xccccccccccccccc9, 1 },
> +{ 0xffffffffffffffff, 0xfffffffffffffffe, 0xffffffffffffffff, 0xfffffffffffffffe, 0 },
> +{ 0xe6102d256d7ea3ae, 0x70a77d0be4c31201, 0xd63ec35ab3220357, 0x78f8bf8cc86c6e18, 1 },
> +{ 0xf53bae05cb86c6e1, 0x3847b32d2f8d32e0, 0xcfd4f55a647f403c, 0x42687f79d8998d35, 1 },
> +{ 0x9951c5498f941092, 0x1f8c8bfdf287a251, 0xa3c8dc5f81ea3fe2, 0x1d887cb25900091f, 1 },
> +{ 0x374fee9daa1bb2bb, 0x0d0bfbff7b8ae3ef, 0xc169337bd42d5179, 0x03bb2dbaffcbb961, 1 },
> +{ 0xeac0d03ac10eeaf0, 0x89be05dfa162ed9b, 0x92bb1679a41f0e4b, 0xdc5f5cc9e270d216, 1 },
>  };
>  
>  /*
>   * The above table can be verified with the following shell script:
> - *
> - * #!/bin/sh
> - * sed -ne 's/^{ \+\(.*\), \+\(.*\), \+\(.*\), \+\(.*\) },$/\1 \2 \3 \4/p' \
> - *     lib/math/test_mul_u64_u64_div_u64.c |
> - * while read a b c r; do
> - *   expected=$( printf "obase=16; ibase=16; %X * %X / %X\n" $a $b $c | bc )
> - *   given=$( printf "%X\n" $r )
> - *   if [ "$expected" = "$given" ]; then
> - *     echo "$a * $b / $c = $r OK"
> - *   else
> - *     echo "$a * $b / $c = $r is wrong" >&2
> - *     echo "should be equivalent to 0x$expected" >&2
> - *     exit 1
> - *   fi
> - * done
> +
> +#!/bin/sh
> +sed -ne 's/^{ \+\(.*\), \+\(.*\), \+\(.*\), \+\(.*\), \+\(.*\) },$/\1 \2 \3 \4 \5/p' \
> +    lib/math/test_mul_u64_u64_div_u64.c |
> +while read a b d r e; do
> +  expected=$( printf "obase=16; ibase=16; %X * %X / %X\n" $a $b $d | bc )
> +  given=$( printf "%X\n" $r )
> +  if [ "$expected" = "$given" ]; then
> +    echo "$a * $b  / $d = $r OK"
> +  else
> +    echo "$a * $b  / $d = $r is wrong" >&2
> +    echo "should be equivalent to 0x$expected" >&2
> +    exit 1
> +  fi
> +  expected=$( printf "obase=16; ibase=16; (%X * %X + %X) / %X\n" $a $b $((d-1)) $d | bc )
> +  given=$( printf "%X\n" $((r + e)) )
> +  if [ "$expected" = "$given" ]; then
> +    echo "$a * $b +/ $d = $(printf '%#x' $((r + e))) OK"
> +  else
> +    echo "$a * $b +/ $d = $(printf '%#x' $((r + e))) is wrong" >&2
> +    echo "should be equivalent to 0x$expected" >&2
> +    exit 1
> +  fi
> +done
> +
>   */
>  
>  static int __init test_init(void)
>  {
> +	int errors = 0;
> +	int tests = 0;
>  	int i;
>  
>  	pr_info("Starting mul_u64_u64_div_u64() test\n");
> @@ -72,19 +84,31 @@ static int __init test_init(void)
>  	for (i = 0; i < ARRAY_SIZE(test_values); i++) {
>  		u64 a = test_values[i].a;
>  		u64 b = test_values[i].b;
> -		u64 c = test_values[i].c;
> +		u64 d = test_values[i].d;
>  		u64 expected_result = test_values[i].result;
> -		u64 result = mul_u64_u64_div_u64(a, b, c);
> +		u64 result = mul_u64_u64_div_u64(a, b, d);
> +		u64 result_up = mul_u64_u64_div_u64_roundup(a, b, d);
> +
> +		tests += 2;
>  
>  		if (result != expected_result) {
> -			pr_err("ERROR: 0x%016llx * 0x%016llx / 0x%016llx\n", a, b, c);
> +			pr_err("ERROR: 0x%016llx * 0x%016llx / 0x%016llx\n", a, b, d);
>  			pr_err("ERROR: expected result: %016llx\n", expected_result);
>  			pr_err("ERROR: obtained result: %016llx\n", result);
> +			errors++;
> +		}
> +		expected_result += test_values[i].round_up;
> +		if (result_up != expected_result) {
> +			pr_err("ERROR: 0x%016llx * 0x%016llx +/ 0x%016llx\n", a, b, d);
> +			pr_err("ERROR: expected result: %016llx\n", expected_result);
> +			pr_err("ERROR: obtained result: %016llx\n", result_up);
> +			errors++;
>  		}
>  	}
>  
> -	pr_info("Completed mul_u64_u64_div_u64() test\n");
> -	return 0;
> +	pr_info("Completed mul_u64_u64_div_u64() test, %d tests, %d errors\n",
> +		tests, errors);
> +	return errors ? -EINVAL : 0;
>  }
>  
>  static void __exit test_exit(void)
> -- 
> 2.39.5
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 next 6/9] lib: test_mul_u64_u64_div_u64: Test both generic and arch versions
  2025-10-29 17:38 ` [PATCH v4 next 6/9] lib: test_mul_u64_u64_div_u64: Test both generic and arch versions David Laight
@ 2025-10-29 18:53   ` Nicolas Pitre
  2025-11-01 19:35   ` kernel test robot
  2025-11-01 20:59   ` kernel test robot
  2 siblings, 0 replies; 32+ messages in thread
From: Nicolas Pitre @ 2025-10-29 18:53 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Li RongQing,
	Yu Kuai, Khazhismel Kumykov, Jens Axboe, x86

On Wed, 29 Oct 2025, David Laight wrote:

> Change the #if in div64.c so that test_mul_u64_u64_div_u64.c
> can compile and test the generic version (including the 'long multiply')
> on architectures (eg amd64) that define their own copy.
> 
> Test the kernel version and the locally compiled version on all arch.
> Output the time taken (in ns) on the 'test completed' trace.
> 
> For reference, on my zen 5, the optimised version takes ~220ns and the
> generic version ~3350ns.
> Using the native multiply saves ~200ns and adding back the ilog2() 'optimisation'
> test adds ~50ms.
> 
> Signed-off-by: David Laight <david.laight.linux@gmail.com>

Reviewed-by: Nicolas Pitre <npitre@baylibre.com>

> ---
> 
> Changes for v4:
> - Fix build on non x86 (eg arm32)
> 
>  lib/math/div64.c                    |  8 +++--
>  lib/math/test_mul_u64_u64_div_u64.c | 51 +++++++++++++++++++++++++----
>  2 files changed, 50 insertions(+), 9 deletions(-)
> 
> diff --git a/lib/math/div64.c b/lib/math/div64.c
> index 25295daebde9..f92e7160feb6 100644
> --- a/lib/math/div64.c
> +++ b/lib/math/div64.c
> @@ -177,16 +177,18 @@ EXPORT_SYMBOL(div64_s64);
>   * Iterative div/mod for use when dividend is not expected to be much
>   * bigger than divisor.
>   */
> +#ifndef iter_div_u64_rem
>  u32 iter_div_u64_rem(u64 dividend, u32 divisor, u64 *remainder)
>  {
>  	return __iter_div_u64_rem(dividend, divisor, remainder);
>  }
>  EXPORT_SYMBOL(iter_div_u64_rem);
> +#endif
>  
> -#ifndef mul_u64_add_u64_div_u64
> +#if !defined(mul_u64_add_u64_div_u64) || defined(test_mul_u64_add_u64_div_u64)
>  u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
>  {
> -#if defined(__SIZEOF_INT128__)
> +#if defined(__SIZEOF_INT128__) && !defined(test_mul_u64_add_u64_div_u64)
>  
>  	/* native 64x64=128 bits multiplication */
>  	u128 prod = (u128)a * b + c;
> @@ -267,5 +269,7 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
>  
>  	return res;
>  }
> +#if !defined(test_mul_u64_add_u64_div_u64)
>  EXPORT_SYMBOL(mul_u64_add_u64_div_u64);
>  #endif
> +#endif
> diff --git a/lib/math/test_mul_u64_u64_div_u64.c b/lib/math/test_mul_u64_u64_div_u64.c
> index 4d5e4e5dac67..a3c5e54f37ef 100644
> --- a/lib/math/test_mul_u64_u64_div_u64.c
> +++ b/lib/math/test_mul_u64_u64_div_u64.c
> @@ -73,21 +73,34 @@ done
>  
>   */
>  
> -static int __init test_init(void)
> +static u64 test_mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d);
> +
> +static int __init test_run(unsigned int fn_no, const char *fn_name)
>  {
> +	u64 start_time;
>  	int errors = 0;
>  	int tests = 0;
>  	int i;
>  
> -	pr_info("Starting mul_u64_u64_div_u64() test\n");
> +	start_time = ktime_get_ns();
>  
>  	for (i = 0; i < ARRAY_SIZE(test_values); i++) {
>  		u64 a = test_values[i].a;
>  		u64 b = test_values[i].b;
>  		u64 d = test_values[i].d;
>  		u64 expected_result = test_values[i].result;
> -		u64 result = mul_u64_u64_div_u64(a, b, d);
> -		u64 result_up = mul_u64_u64_div_u64_roundup(a, b, d);
> +		u64 result, result_up;
> +
> +		switch (fn_no) {
> +		default:
> +			result = mul_u64_u64_div_u64(a, b, d);
> +			result_up = mul_u64_u64_div_u64_roundup(a, b, d);
> +			break;
> +		case 1:
> +			result = test_mul_u64_add_u64_div_u64(a, b, 0, d);
> +			result_up = test_mul_u64_add_u64_div_u64(a, b, d - 1, d);
> +			break;
> +		}
>  
>  		tests += 2;
>  
> @@ -106,15 +119,39 @@ static int __init test_init(void)
>  		}
>  	}
>  
> -	pr_info("Completed mul_u64_u64_div_u64() test, %d tests, %d errors\n",
> -		tests, errors);
> -	return errors ? -EINVAL : 0;
> +	pr_info("Completed %s() test, %d tests, %d errors, %llu ns\n",
> +		fn_name, tests, errors, ktime_get_ns() - start_time);
> +	return errors;
> +}
> +
> +static int __init test_init(void)
> +{
> +	pr_info("Starting mul_u64_u64_div_u64() test\n");
> +	if (test_run(0, "mul_u64_u64_div_u64"))
> +		return -EINVAL;
> +	if (test_run(1, "test_mul_u64_u64_div_u64"))
> +		return -EINVAL;
> +	return 0;
>  }
>  
>  static void __exit test_exit(void)
>  {
>  }
>  
> +/* Compile the generic mul_u64_add_u64_div_u64() code */
> +#define __div64_32 __div64_32
> +#define div_s64_rem div_s64_rem
> +#define div64_u64_rem div64_u64_rem
> +#define div64_u64 div64_u64
> +#define div64_s64 div64_s64
> +#define iter_div_u64_rem iter_div_u64_rem
> +
> +#undef mul_u64_add_u64_div_u64
> +#define mul_u64_add_u64_div_u64 test_mul_u64_add_u64_div_u64
> +#define test_mul_u64_add_u64_div_u64 test_mul_u64_add_u64_div_u64
> +
> +#include "div64.c"
> +
>  module_init(test_init);
>  module_exit(test_exit);
>  
> -- 
> 2.39.5
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 next 7/9] lib: mul_u64_u64_div_u64() optimise multiply on 32bit x86
  2025-10-29 17:38 ` [PATCH v4 next 7/9] lib: mul_u64_u64_div_u64() optimise multiply on 32bit x86 David Laight
@ 2025-10-29 19:01   ` Nicolas Pitre
  0 siblings, 0 replies; 32+ messages in thread
From: Nicolas Pitre @ 2025-10-29 19:01 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Li RongQing,
	Yu Kuai, Khazhismel Kumykov, Jens Axboe, x86

On Wed, 29 Oct 2025, David Laight wrote:

> gcc generates horrid code for both ((u64)u32_a * u32_b) and (u64_a + u32_b).
> As well as the extra instructions it can generate a lot of spills to stack
> (including spills of constant zeros and even multiplies by constant zero).
> 
> mul_u32_u32() already exists to optimise the multiply.
> Add a similar add_u64_32() for the addition.
> Disable both for clang - it generates better code without them.
> 
> Move the 64x64 => 128 multiply into a static inline helper function
> for code clarity.
> No need for the a/b_hi/lo variables, the implicit casts on the function
> calls do the work for us.
> Should have minimal effect on the generated code.
> 
> Use mul_u32_u32() and add_u64_u32() in the 64x64 => 128 multiply
> in mul_u64_add_u64_div_u64().
> 
> Signed-off-by: David Laight <david.laight.linux@gmail.com>

Reviewed-by: Nicolas Pitre <npitre@baylibre.com>


> ---
> 
> Changes for v4:
> - merge in patch 8.
> - Add comments about gcc being 'broken' for mixed 32/64 bit maths.
>   clang doesn't have the same issues.
> - use a #defdine for define mul_add() to avoid 'defined but not used'
>   errors.
> 
>  arch/x86/include/asm/div64.h | 19 +++++++++++++++++
>  include/linux/math64.h       | 11 ++++++++++
>  lib/math/div64.c             | 40 +++++++++++++++++++++++-------------
>  3 files changed, 56 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/x86/include/asm/div64.h b/arch/x86/include/asm/div64.h
> index cabdc2d5a68f..a18c045aa8a1 100644
> --- a/arch/x86/include/asm/div64.h
> +++ b/arch/x86/include/asm/div64.h
> @@ -60,6 +60,12 @@ static inline u64 div_u64_rem(u64 dividend, u32 divisor, u32 *remainder)
>  }
>  #define div_u64_rem	div_u64_rem
>  
> +/*
> + * gcc tends to zero extend 32bit values and do full 64bit maths.
> + * Define asm functions that avoid this.
> + * (clang generates better code for the C versions.)
> + */
> +#ifndef __clang__
>  static inline u64 mul_u32_u32(u32 a, u32 b)
>  {
>  	u32 high, low;
> @@ -71,6 +77,19 @@ static inline u64 mul_u32_u32(u32 a, u32 b)
>  }
>  #define mul_u32_u32 mul_u32_u32
>  
> +static inline u64 add_u64_u32(u64 a, u32 b)
> +{
> +	u32 high = a >> 32, low = a;
> +
> +	asm ("addl %[b], %[low]; adcl $0, %[high]"
> +		: [low] "+r" (low), [high] "+r" (high)
> +		: [b] "rm" (b) );
> +
> +	return low | (u64)high << 32;
> +}
> +#define add_u64_u32 add_u64_u32
> +#endif
> +
>  /*
>   * __div64_32() is never called on x86, so prevent the
>   * generic definition from getting built.
> diff --git a/include/linux/math64.h b/include/linux/math64.h
> index e889d850b7f1..cc305206d89f 100644
> --- a/include/linux/math64.h
> +++ b/include/linux/math64.h
> @@ -158,6 +158,17 @@ static inline u64 mul_u32_u32(u32 a, u32 b)
>  }
>  #endif
>  
> +#ifndef add_u64_u32
> +/*
> + * Many a GCC version also messes this up.
> + * Zero extending b and then spilling everything to stack.
> + */
> +static inline u64 add_u64_u32(u64 a, u32 b)
> +{
> +	return a + b;
> +}
> +#endif
> +
>  #if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__)
>  
>  #ifndef mul_u64_u32_shr
> diff --git a/lib/math/div64.c b/lib/math/div64.c
> index f92e7160feb6..f6da7b5fb69e 100644
> --- a/lib/math/div64.c
> +++ b/lib/math/div64.c
> @@ -186,33 +186,45 @@ EXPORT_SYMBOL(iter_div_u64_rem);
>  #endif
>  
>  #if !defined(mul_u64_add_u64_div_u64) || defined(test_mul_u64_add_u64_div_u64)
> -u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
> -{
> +
> +#define mul_add(a, b, c) add_u64_u32(mul_u32_u32(a, b), c)
> +
>  #if defined(__SIZEOF_INT128__) && !defined(test_mul_u64_add_u64_div_u64)
>  
> +static inline u64 mul_u64_u64_add_u64(u64 *p_lo, u64 a, u64 b, u64 c)
> +{
>  	/* native 64x64=128 bits multiplication */
>  	u128 prod = (u128)a * b + c;
> -	u64 n_lo = prod, n_hi = prod >> 64;
> +
> +	*p_lo = prod;
> +	return prod >> 64;
> +}
>  
>  #else
>  
> -	/* perform a 64x64=128 bits multiplication manually */
> -	u32 a_lo = a, a_hi = a >> 32, b_lo = b, b_hi = b >> 32;
> +static inline u64 mul_u64_u64_add_u64(u64 *p_lo, u64 a, u64 b, u64 c)
> +{
> +	/* perform a 64x64=128 bits multiplication in 32bit chunks */
>  	u64 x, y, z;
>  
>  	/* Since (x-1)(x-1) + 2(x-1) == x.x - 1 two u32 can be added to a u64 */
> -	x = (u64)a_lo * b_lo + (u32)c;
> -	y = (u64)a_lo * b_hi + (u32)(c >> 32);
> -	y += (u32)(x >> 32);
> -	z = (u64)a_hi * b_hi + (u32)(y >> 32);
> -	y = (u64)a_hi * b_lo + (u32)y;
> -	z += (u32)(y >> 32);
> -	x = (y << 32) + (u32)x;
> -
> -	u64 n_lo = x, n_hi = z;
> +	x = mul_add(a, b, c);
> +	y = mul_add(a, b >> 32, c >> 32);
> +	y = add_u64_u32(y, x >> 32);
> +	z = mul_add(a >> 32, b >> 32, y >> 32);
> +	y = mul_add(a >> 32, b, y);
> +	*p_lo = (y << 32) + (u32)x;
> +	return add_u64_u32(z, y >> 32);
> +}
>  
>  #endif
>  
> +u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
> +{
> +	u64 n_lo, n_hi;
> +
> +	n_hi = mul_u64_u64_add_u64(&n_lo, a, b, c);
> +
>  	if (unlikely(n_hi >= d)) {
>  		/* trigger runtime exception if divisor is zero */
>  		if (d == 0) {
> -- 
> 2.39.5
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 next 8/9] lib: mul_u64_u64_div_u64() Optimise the divide code
  2025-10-29 17:38 ` [PATCH v4 next 8/9] lib: mul_u64_u64_div_u64() Optimise the divide code David Laight
@ 2025-10-29 20:47   ` Nicolas Pitre
  0 siblings, 0 replies; 32+ messages in thread
From: Nicolas Pitre @ 2025-10-29 20:47 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Li RongQing,
	Yu Kuai, Khazhismel Kumykov, Jens Axboe, x86

On Wed, 29 Oct 2025, David Laight wrote:

> Replace the bit by bit algorithm with one that generates 16 bits
> per iteration on 32bit architectures and 32 bits on 64bit ones.
> 
> On my zen 5 this reduces the time for the tests (using the generic
> code) from ~3350ns to ~1000ns.
> 
> Running the 32bit algorithm on 64bit x86 takes ~1500ns.
> It'll be slightly slower on a real 32bit system, mostly due
> to register pressure.
> 
> The savings for 32bit x86 are much higher (tested in userspace).
> The worst case (lots of bits in the quotient) drops from ~900 clocks
> to ~130 (pretty much independant of the arguments).
> Other 32bit architectures may see better savings.
> 
> It is possibly to optimise for divisors that span less than
> __LONG_WIDTH__/2 bits. However I suspect they don't happen that often
> and it doesn't remove any slow cpu divide instructions which dominate
> the result.
> 
> Typical improvements for 64bit random divides:
>                old     new
> sandy bridge:  470     150
> haswell:       400     144
> piledriver:    960     467   I think rdpmc is very slow.
> zen5:          244      80
> (Timing is 'rdpmc; mul_div(); rdpmc' with the multiply depending on the
> first rdpmc and the second rdpmc depending on the quotient.)
> 
> Signed-off-by: David Laight <david.laight.linux@gmail.com>

Reviewed-by: Nicolas Pitre <npitre@baylibre.com>


> ---
> 
> Algorithm unchanged from v3.
> 
>  lib/math/div64.c | 124 ++++++++++++++++++++++++++++++++---------------
>  1 file changed, 85 insertions(+), 39 deletions(-)
> 
> diff --git a/lib/math/div64.c b/lib/math/div64.c
> index f6da7b5fb69e..4e4e962261c3 100644
> --- a/lib/math/div64.c
> +++ b/lib/math/div64.c
> @@ -190,7 +190,6 @@ EXPORT_SYMBOL(iter_div_u64_rem);
>  #define mul_add(a, b, c) add_u64_u32(mul_u32_u32(a, b), c)
>  
>  #if defined(__SIZEOF_INT128__) && !defined(test_mul_u64_add_u64_div_u64)
> -
>  static inline u64 mul_u64_u64_add_u64(u64 *p_lo, u64 a, u64 b, u64 c)
>  {
>  	/* native 64x64=128 bits multiplication */
> @@ -199,9 +198,7 @@ static inline u64 mul_u64_u64_add_u64(u64 *p_lo, u64 a, u64 b, u64 c)
>  	*p_lo = prod;
>  	return prod >> 64;
>  }
> -
>  #else
> -
>  static inline u64 mul_u64_u64_add_u64(u64 *p_lo, u64 a, u64 b, u64 c)
>  {
>  	/* perform a 64x64=128 bits multiplication in 32bit chunks */
> @@ -216,12 +213,37 @@ static inline u64 mul_u64_u64_add_u64(u64 *p_lo, u64 a, u64 b, u64 c)
>  	*p_lo = (y << 32) + (u32)x;
>  	return add_u64_u32(z, y >> 32);
>  }
> +#endif
> +
> +#ifndef BITS_PER_ITER
> +#define BITS_PER_ITER (__LONG_WIDTH__ >= 64 ? 32 : 16)
> +#endif
> +
> +#if BITS_PER_ITER == 32
> +#define mul_u64_long_add_u64(p_lo, a, b, c) mul_u64_u64_add_u64(p_lo, a, b, c)
> +#define add_u64_long(a, b) ((a) + (b))
> +#else
> +#undef BITS_PER_ITER
> +#define BITS_PER_ITER 16
> +static inline u32 mul_u64_long_add_u64(u64 *p_lo, u64 a, u32 b, u64 c)
> +{
> +	u64 n_lo = mul_add(a, b, c);
> +	u64 n_med = mul_add(a >> 32, b, c >> 32);
> +
> +	n_med = add_u64_u32(n_med, n_lo >> 32);
> +	*p_lo = n_med << 32 | (u32)n_lo;
> +	return n_med >> 32;
> +}
>  
> +#define add_u64_long(a, b) add_u64_u32(a, b)
>  #endif
>  
>  u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
>  {
> -	u64 n_lo, n_hi;
> +	unsigned long d_msig, q_digit;
> +	unsigned int reps, d_z_hi;
> +	u64 quotient, n_lo, n_hi;
> +	u32 overflow;
>  
>  	n_hi = mul_u64_u64_add_u64(&n_lo, a, b, c);
>  
> @@ -240,46 +262,70 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
>  	if (!n_hi)
>  		return div64_u64(n_lo, d);
>  
> -	int shift = __builtin_ctzll(d);
> -
> -	/* try reducing the fraction in case the dividend becomes <= 64 bits */
> -	if ((n_hi >> shift) == 0) {
> -		u64 n = shift ? (n_lo >> shift) | (n_hi << (64 - shift)) : n_lo;
> -
> -		return div64_u64(n, d >> shift);
> -		/*
> -		 * The remainder value if needed would be:
> -		 *   res = div64_u64_rem(n, d >> shift, &rem);
> -		 *   rem = (rem << shift) + (n_lo - (n << shift));
> -		 */
> +	/* Left align the divisor, shifting the dividend to match */
> +	d_z_hi = __builtin_clzll(d);
> +	if (d_z_hi) {
> +		d <<= d_z_hi;
> +		n_hi = n_hi << d_z_hi | n_lo >> (64 - d_z_hi);
> +		n_lo <<= d_z_hi;
>  	}
>  
> -	/* Do the full 128 by 64 bits division */
> -
> -	shift = __builtin_clzll(d);
> -	d <<= shift;
> -
> -	int p = 64 + shift;
> -	u64 res = 0;
> -	bool carry;
> +	reps = 64 / BITS_PER_ITER;
> +	/* Optimise loop count for small dividends */
> +	if (!(u32)(n_hi >> 32)) {
> +		reps -= 32 / BITS_PER_ITER;
> +		n_hi = n_hi << 32 | n_lo >> 32;
> +		n_lo <<= 32;
> +	}
> +#if BITS_PER_ITER == 16
> +	if (!(u32)(n_hi >> 48)) {
> +		reps--;
> +		n_hi = add_u64_u32(n_hi << 16, n_lo >> 48);
> +		n_lo <<= 16;
> +	}
> +#endif
>  
> -	do {
> -		carry = n_hi >> 63;
> -		shift = carry ? 1 : __builtin_clzll(n_hi);
> -		if (p < shift)
> -			break;
> -		p -= shift;
> -		n_hi <<= shift;
> -		n_hi |= n_lo >> (64 - shift);
> -		n_lo <<= shift;
> -		if (carry || (n_hi >= d)) {
> -			n_hi -= d;
> -			res |= 1ULL << p;
> +	/* Invert the dividend so we can use add instead of subtract. */
> +	n_lo = ~n_lo;
> +	n_hi = ~n_hi;
> +
> +	/*
> +	 * Get the most significant BITS_PER_ITER bits of the divisor.
> +	 * This is used to get a low 'guestimate' of the quotient digit.
> +	 */
> +	d_msig = (d >> (64 - BITS_PER_ITER)) + 1;
> +
> +	/*
> +	 * Now do a 'long division' with BITS_PER_ITER bit 'digits'.
> +	 * The 'guess' quotient digit can be low and BITS_PER_ITER+1 bits.
> +	 * The worst case is dividing ~0 by 0x8000 which requires two subtracts.
> +	 */
> +	quotient = 0;
> +	while (reps--) {
> +		q_digit = (unsigned long)(~n_hi >> (64 - 2 * BITS_PER_ITER)) / d_msig;
> +		/* Shift 'n' left to align with the product q_digit * d */
> +		overflow = n_hi >> (64 - BITS_PER_ITER);
> +		n_hi = add_u64_u32(n_hi << BITS_PER_ITER, n_lo >> (64 - BITS_PER_ITER));
> +		n_lo <<= BITS_PER_ITER;
> +		/* Add product to negated divisor */
> +		overflow += mul_u64_long_add_u64(&n_hi, d, q_digit, n_hi);
> +		/* Adjust for the q_digit 'guestimate' being low */
> +		while (overflow < 0xffffffff >> (32 - BITS_PER_ITER)) {
> +			q_digit++;
> +			n_hi += d;
> +			overflow += n_hi < d;
>  		}
> -	} while (n_hi);
> -	/* The remainder value if needed would be n_hi << p */
> +		quotient = add_u64_long(quotient << BITS_PER_ITER, q_digit);
> +	}
>  
> -	return res;
> +	/*
> +	 * The above only ensures the remainder doesn't overflow,
> +	 * it can still be possible to add (aka subtract) another copy
> +	 * of the divisor.
> +	 */
> +	if ((n_hi + d) > n_hi)
> +		quotient++;
> +	return quotient;
>  }
>  #if !defined(test_mul_u64_add_u64_div_u64)
>  EXPORT_SYMBOL(mul_u64_add_u64_div_u64);
> -- 
> 2.39.5
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 next 9/9] lib: test_mul_u64_u64_div_u64: Test the 32bit code on 64bit
  2025-10-29 17:38 ` [PATCH v4 next 9/9] lib: test_mul_u64_u64_div_u64: Test the 32bit code on 64bit David Laight
@ 2025-10-29 20:48   ` Nicolas Pitre
  0 siblings, 0 replies; 32+ messages in thread
From: Nicolas Pitre @ 2025-10-29 20:48 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Li RongQing,
	Yu Kuai, Khazhismel Kumykov, Jens Axboe, x86

On Wed, 29 Oct 2025, David Laight wrote:

> There are slight differences in the mul_u64_add_u64_div_u64() code
> between 32bit and 64bit systems.
> 
> Compile and test the 32bit version on 64bit hosts for better test
> coverage.
> 
> Signed-off-by: David Laight <david.laight.linux@gmail.com>

Reviewed-by: Nicolas Pitre <npitre@baylibre.com>

> ---
> 
> Changes for v4:
> - Fix build on non x86-64
> 
>  lib/math/test_mul_u64_u64_div_u64.c | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> diff --git a/lib/math/test_mul_u64_u64_div_u64.c b/lib/math/test_mul_u64_u64_div_u64.c
> index a3c5e54f37ef..57d5c7158b10 100644
> --- a/lib/math/test_mul_u64_u64_div_u64.c
> +++ b/lib/math/test_mul_u64_u64_div_u64.c
> @@ -74,6 +74,10 @@ done
>   */
>  
>  static u64 test_mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d);
> +#if __LONG_WIDTH__ >= 64
> +#define TEST_32BIT_DIV
> +static u64 test_mul_u64_add_u64_div_u64_32bit(u64 a, u64 b, u64 c, u64 d);
> +#endif
>  
>  static int __init test_run(unsigned int fn_no, const char *fn_name)
>  {
> @@ -100,6 +104,12 @@ static int __init test_run(unsigned int fn_no, const char *fn_name)
>  			result = test_mul_u64_add_u64_div_u64(a, b, 0, d);
>  			result_up = test_mul_u64_add_u64_div_u64(a, b, d - 1, d);
>  			break;
> +#ifdef TEST_32BIT_DIV
> +		case 2:
> +			result = test_mul_u64_add_u64_div_u64_32bit(a, b, 0, d);
> +			result_up = test_mul_u64_add_u64_div_u64_32bit(a, b, d - 1, d);
> +			break;
> +#endif
>  		}
>  
>  		tests += 2;
> @@ -131,6 +141,10 @@ static int __init test_init(void)
>  		return -EINVAL;
>  	if (test_run(1, "test_mul_u64_u64_div_u64"))
>  		return -EINVAL;
> +#ifdef TEST_32BIT_DIV
> +	if (test_run(2, "test_mul_u64_u64_div_u64_32bit"))
> +		return -EINVAL;
> +#endif
>  	return 0;
>  }
>  
> @@ -152,6 +166,21 @@ static void __exit test_exit(void)
>  
>  #include "div64.c"
>  
> +#ifdef TEST_32BIT_DIV
> +/* Recompile the generic code for 32bit long */
> +#undef test_mul_u64_add_u64_div_u64
> +#define test_mul_u64_add_u64_div_u64 test_mul_u64_add_u64_div_u64_32bit
> +#undef BITS_PER_ITER
> +#define BITS_PER_ITER 16
> +
> +#define mul_u64_u64_add_u64 mul_u64_u64_add_u64_32bit
> +#undef mul_u64_long_add_u64
> +#undef add_u64_long
> +#undef mul_add
> +
> +#include "div64.c"
> +#endif
> +
>  module_init(test_init);
>  module_exit(test_exit);
>  
> -- 
> 2.39.5
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 next 0/9] Implement mul_u64_u64_div_u64_roundup()
  2025-10-29 17:38 [PATCH v4 next 0/9] Implement mul_u64_u64_div_u64_roundup() David Laight
                   ` (8 preceding siblings ...)
  2025-10-29 17:38 ` [PATCH v4 next 9/9] lib: test_mul_u64_u64_div_u64: Test the 32bit code on 64bit David Laight
@ 2025-10-31  4:29 ` Andrew Morton
  2025-11-04 17:16   ` Nicolas Pitre
  2025-10-31 13:52 ` Oleg Nesterov
  10 siblings, 1 reply; 32+ messages in thread
From: Andrew Morton @ 2025-10-31  4:29 UTC (permalink / raw)
  To: David Laight
  Cc: linux-kernel, u.kleine-koenig, Nicolas Pitre, Oleg Nesterov,
	Peter Zijlstra, Biju Das, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Li RongQing,
	Yu Kuai, Khazhismel Kumykov, Jens Axboe, x86

On Wed, 29 Oct 2025 17:38:19 +0000 David Laight <david.laight.linux@gmail.com> wrote:

> The pwm-stm32.c code wants a 'rounding up' version of mul_u64_u64_div_u64().
> This can be done simply by adding 'divisor - 1' to the 128bit product.
> Implement mul_u64_add_u64_div_u64(a, b, c, d) = (a * b + c)/d based on the
> existing code.
> Define mul_u64_u64_div_u64(a, b, d) as mul_u64_add_u64_div_u64(a, b, 0, d) and
> mul_u64_u64_div_u64_roundup(a, b, d) as mul_u64_add_u64_div_u64(a, b, d-1, d).
> 
> Only x86-64 has an optimsed (asm) version of the function.
> That is optimised to avoid the 'add c' when c is known to be zero.
> In all other cases the extra code will be noise compared to the software
> divide code.
> 
> The test module has been updated to test mul_u64_u64_div_u64_roundup() and
> also enhanced it to verify the C division code on x86-64 and the 32bit
> division code on 64bit.

Thanks, I added this to mm.git's mm-nonmm-unstable branch for some
linux-next exposure.  I have a note that [3/9] may be updated in
response to Nicolas's comment.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 next 3/9] lib: mul_u64_u64_div_u64() simplify check for a 64bit product
  2025-10-29 18:11   ` Nicolas Pitre
@ 2025-10-31  9:19     ` David Laight
  2025-10-31 17:26       ` Nicolas Pitre
  0 siblings, 1 reply; 32+ messages in thread
From: David Laight @ 2025-10-31  9:19 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Li RongQing,
	Yu Kuai, Khazhismel Kumykov, Jens Axboe, x86

On Wed, 29 Oct 2025 14:11:08 -0400 (EDT)
Nicolas Pitre <npitre@baylibre.com> wrote:

> On Wed, 29 Oct 2025, David Laight wrote:
> 
> > If the product is only 64bits div64_u64() can be used for the divide.
> > Replace the pre-multiply check (ilog2(a) + ilog2(b) <= 62) with a
> > simple post-multiply check that the high 64bits are zero.
> > 
> > This has the advantage of being simpler, more accurate and less code.
> > It will always be faster when the product is larger than 64bits.
> > 
> > Most 64bit cpu have a native 64x64=128 bit multiply, this is needed
> > (for the low 64bits) even when div64_u64() is called - so the early
> > check gains nothing and is just extra code.
> > 
> > 32bit cpu will need a compare (etc) to generate the 64bit ilog2()
> > from two 32bit bit scans - so that is non-trivial.
> > (Never mind the mess of x86's 'bsr' and any oddball cpu without
> > fast bit-scan instructions.)
> > Whereas the additional instructions for the 128bit multiply result
> > are pretty much one multiply and two adds (typically the 'adc $0,%reg'
> > can be run in parallel with the instruction that follows).
> > 
> > The only outliers are 64bit systems without 128bit mutiply and
> > simple in order 32bit ones with fast bit scan but needing extra
> > instructions to get the high bits of the multiply result.
> > I doubt it makes much difference to either, the latter is definitely
> > not mainstream.
> > 
> > If anyone is worried about the analysis they can look at the
> > generated code for x86 (especially when cmov isn't used).
> > 
> > Signed-off-by: David Laight <david.laight.linux@gmail.com>  
> 
> Comment below.
> 
> 
> > ---
> > 
> > Split from patch 3 for v2, unchanged since.
> > 
> >  lib/math/div64.c | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> > 
> > diff --git a/lib/math/div64.c b/lib/math/div64.c
> > index 1092f41e878e..7158d141b6e9 100644
> > --- a/lib/math/div64.c
> > +++ b/lib/math/div64.c
> > @@ -186,9 +186,6 @@ EXPORT_SYMBOL(iter_div_u64_rem);
> >  #ifndef mul_u64_u64_div_u64
> >  u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
> >  {
> > -	if (ilog2(a) + ilog2(b) <= 62)
> > -		return div64_u64(a * b, d);
> > -
> >  #if defined(__SIZEOF_INT128__)
> >  
> >  	/* native 64x64=128 bits multiplication */
> > @@ -224,6 +221,9 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
> >  		return ~0ULL;
> >  	}
> >  
> > +	if (!n_hi)
> > +		return div64_u64(n_lo, d);  
> 
> I'd move this before the overflow test. If this is to be taken then 
> you'll save one test. same cost otherwise.
> 

I wanted the 'divide by zero' result to be consistent.

Additionally the change to stop the x86-64 version panicking on
overflow also makes it return ~0 for divide by zero.
If that is done then this version needs to be consistent and
return ~0 for divide by zero - which div64_u64() won't do.

It is worth remembering that the chance of (a * b + c)/d being ~0
is pretty small (for non-test inputs), and any code that might expect
such a value is likely to have to handle overflow as well.
(Not to mention avoiding overflow of 'a' and 'b'.)
So using ~0 for overflow isn't really a problem.

	David


 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 next 0/9] Implement mul_u64_u64_div_u64_roundup()
  2025-10-29 17:38 [PATCH v4 next 0/9] Implement mul_u64_u64_div_u64_roundup() David Laight
                   ` (9 preceding siblings ...)
  2025-10-31  4:29 ` [PATCH v4 next 0/9] Implement mul_u64_u64_div_u64_roundup() Andrew Morton
@ 2025-10-31 13:52 ` Oleg Nesterov
  2025-10-31 16:17   ` David Laight
  10 siblings, 1 reply; 32+ messages in thread
From: Oleg Nesterov @ 2025-10-31 13:52 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Nicolas Pitre,
	Peter Zijlstra, Biju Das, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Li RongQing,
	Yu Kuai, Khazhismel Kumykov, Jens Axboe, x86

On 10/29, David Laight wrote:
>
> The pwm-stm32.c code wants a 'rounding up' version of mul_u64_u64_div_u64().
> This can be done simply by adding 'divisor - 1' to the 128bit product.
> Implement mul_u64_add_u64_div_u64(a, b, c, d) = (a * b + c)/d based on the
> existing code.
> Define mul_u64_u64_div_u64(a, b, d) as mul_u64_add_u64_div_u64(a, b, 0, d) and
> mul_u64_u64_div_u64_roundup(a, b, d) as mul_u64_add_u64_div_u64(a, b, d-1, d).

Sorry for the noise. Can't review due to the lack of knowledge.
But this reminds me... What about

	[PATCH v3 1/2] x86/math64: handle #DE in mul_u64_u64_div_u64()
	https://lore.kernel.org/all/20250815164055.GA13444@redhat.com

?

should I resend it once again or we don't care?

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 next 0/9] Implement mul_u64_u64_div_u64_roundup()
  2025-10-31 13:52 ` Oleg Nesterov
@ 2025-10-31 16:17   ` David Laight
  0 siblings, 0 replies; 32+ messages in thread
From: David Laight @ 2025-10-31 16:17 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Nicolas Pitre,
	Peter Zijlstra, Biju Das, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Li RongQing,
	Yu Kuai, Khazhismel Kumykov, Jens Axboe, x86

On Fri, 31 Oct 2025 14:52:56 +0100
Oleg Nesterov <oleg@redhat.com> wrote:

> On 10/29, David Laight wrote:
> >
> > The pwm-stm32.c code wants a 'rounding up' version of mul_u64_u64_div_u64().
> > This can be done simply by adding 'divisor - 1' to the 128bit product.
> > Implement mul_u64_add_u64_div_u64(a, b, c, d) = (a * b + c)/d based on the
> > existing code.
> > Define mul_u64_u64_div_u64(a, b, d) as mul_u64_add_u64_div_u64(a, b, 0, d) and
> > mul_u64_u64_div_u64_roundup(a, b, d) as mul_u64_add_u64_div_u64(a, b, d-1, d).  
> 
> Sorry for the noise. Can't review due to the lack of knowledge.
> But this reminds me... What about
> 
> 	[PATCH v3 1/2] x86/math64: handle #DE in mul_u64_u64_div_u64()
> 	https://lore.kernel.org/all/20250815164055.GA13444@redhat.com
> 
> ?
> 
> should I resend it once again or we don't care?

I'd remembered that as well.
I'll do a new version of that on top as a separate patch.
Probably when this set has been merged further.

	David


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 next 3/9] lib: mul_u64_u64_div_u64() simplify check for a 64bit product
  2025-10-31  9:19     ` David Laight
@ 2025-10-31 17:26       ` Nicolas Pitre
  2025-10-31 18:04         ` David Laight
  0 siblings, 1 reply; 32+ messages in thread
From: Nicolas Pitre @ 2025-10-31 17:26 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Li RongQing,
	Yu Kuai, Khazhismel Kumykov, Jens Axboe, x86

On Fri, 31 Oct 2025, David Laight wrote:

> On Wed, 29 Oct 2025 14:11:08 -0400 (EDT)
> Nicolas Pitre <npitre@baylibre.com> wrote:
> 
> > On Wed, 29 Oct 2025, David Laight wrote:
> > 
> > > If the product is only 64bits div64_u64() can be used for the divide.
> > > Replace the pre-multiply check (ilog2(a) + ilog2(b) <= 62) with a
> > > simple post-multiply check that the high 64bits are zero.
> > > 
> > > This has the advantage of being simpler, more accurate and less code.
> > > It will always be faster when the product is larger than 64bits.
> > > 
> > > Most 64bit cpu have a native 64x64=128 bit multiply, this is needed
> > > (for the low 64bits) even when div64_u64() is called - so the early
> > > check gains nothing and is just extra code.
> > > 
> > > 32bit cpu will need a compare (etc) to generate the 64bit ilog2()
> > > from two 32bit bit scans - so that is non-trivial.
> > > (Never mind the mess of x86's 'bsr' and any oddball cpu without
> > > fast bit-scan instructions.)
> > > Whereas the additional instructions for the 128bit multiply result
> > > are pretty much one multiply and two adds (typically the 'adc $0,%reg'
> > > can be run in parallel with the instruction that follows).
> > > 
> > > The only outliers are 64bit systems without 128bit mutiply and
> > > simple in order 32bit ones with fast bit scan but needing extra
> > > instructions to get the high bits of the multiply result.
> > > I doubt it makes much difference to either, the latter is definitely
> > > not mainstream.
> > > 
> > > If anyone is worried about the analysis they can look at the
> > > generated code for x86 (especially when cmov isn't used).
> > > 
> > > Signed-off-by: David Laight <david.laight.linux@gmail.com>  
> > 
> > Comment below.
> > 
> > 
> > > ---
> > > 
> > > Split from patch 3 for v2, unchanged since.
> > > 
> > >  lib/math/div64.c | 6 +++---
> > >  1 file changed, 3 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/lib/math/div64.c b/lib/math/div64.c
> > > index 1092f41e878e..7158d141b6e9 100644
> > > --- a/lib/math/div64.c
> > > +++ b/lib/math/div64.c
> > > @@ -186,9 +186,6 @@ EXPORT_SYMBOL(iter_div_u64_rem);
> > >  #ifndef mul_u64_u64_div_u64
> > >  u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
> > >  {
> > > -	if (ilog2(a) + ilog2(b) <= 62)
> > > -		return div64_u64(a * b, d);
> > > -
> > >  #if defined(__SIZEOF_INT128__)
> > >  
> > >  	/* native 64x64=128 bits multiplication */
> > > @@ -224,6 +221,9 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
> > >  		return ~0ULL;
> > >  	}
> > >  
> > > +	if (!n_hi)
> > > +		return div64_u64(n_lo, d);  
> > 
> > I'd move this before the overflow test. If this is to be taken then 
> > you'll save one test. same cost otherwise.
> > 
> 
> I wanted the 'divide by zero' result to be consistent.

It is. div64_u64(x, 0) will produce the same result/behavior.

> Additionally the change to stop the x86-64 version panicking on
> overflow also makes it return ~0 for divide by zero.
> If that is done then this version needs to be consistent and
> return ~0 for divide by zero - which div64_u64() won't do.

Well here I disagree. If that is some x86 peculiarity then x86 should 
deal with it and not impose it on everybody. At least having most other 
architectures raising SIGFPE when encountering a divide by 0 should 
provide enough coverage to have such obviously buggy code fixed.

> It is worth remembering that the chance of (a * b + c)/d being ~0
> is pretty small (for non-test inputs), and any code that might expect
> such a value is likely to have to handle overflow as well.
> (Not to mention avoiding overflow of 'a' and 'b'.)
> So using ~0 for overflow isn't really a problem.

It is not.

To be clear I'm not talking about overflow nor divide by zero here. I'm 
suggesting that the case where div64_u64() can be used should be tested 
first as this is a far more prevalent valid occurrence than a zero 
divisor which is not.


Nicolas

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 next 3/9] lib: mul_u64_u64_div_u64() simplify check for a 64bit product
  2025-10-31 17:26       ` Nicolas Pitre
@ 2025-10-31 18:04         ` David Laight
  2025-10-31 18:45           ` Nicolas Pitre
  0 siblings, 1 reply; 32+ messages in thread
From: David Laight @ 2025-10-31 18:04 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Li RongQing,
	Yu Kuai, Khazhismel Kumykov, Jens Axboe, x86

On Fri, 31 Oct 2025 13:26:41 -0400 (EDT)
Nicolas Pitre <npitre@baylibre.com> wrote:

> On Fri, 31 Oct 2025, David Laight wrote:
> 
> > On Wed, 29 Oct 2025 14:11:08 -0400 (EDT)
> > Nicolas Pitre <npitre@baylibre.com> wrote:
> >   
> > > On Wed, 29 Oct 2025, David Laight wrote:
> > >   
> > > > If the product is only 64bits div64_u64() can be used for the divide.
> > > > Replace the pre-multiply check (ilog2(a) + ilog2(b) <= 62) with a
> > > > simple post-multiply check that the high 64bits are zero.
> > > > 
> > > > This has the advantage of being simpler, more accurate and less code.
> > > > It will always be faster when the product is larger than 64bits.
> > > > 
> > > > Most 64bit cpu have a native 64x64=128 bit multiply, this is needed
> > > > (for the low 64bits) even when div64_u64() is called - so the early
> > > > check gains nothing and is just extra code.
> > > > 
> > > > 32bit cpu will need a compare (etc) to generate the 64bit ilog2()
> > > > from two 32bit bit scans - so that is non-trivial.
> > > > (Never mind the mess of x86's 'bsr' and any oddball cpu without
> > > > fast bit-scan instructions.)
> > > > Whereas the additional instructions for the 128bit multiply result
> > > > are pretty much one multiply and two adds (typically the 'adc $0,%reg'
> > > > can be run in parallel with the instruction that follows).
> > > > 
> > > > The only outliers are 64bit systems without 128bit mutiply and
> > > > simple in order 32bit ones with fast bit scan but needing extra
> > > > instructions to get the high bits of the multiply result.
> > > > I doubt it makes much difference to either, the latter is definitely
> > > > not mainstream.
> > > > 
> > > > If anyone is worried about the analysis they can look at the
> > > > generated code for x86 (especially when cmov isn't used).
> > > > 
> > > > Signed-off-by: David Laight <david.laight.linux@gmail.com>    
> > > 
> > > Comment below.
> > > 
> > >   
> > > > ---
> > > > 
> > > > Split from patch 3 for v2, unchanged since.
> > > > 
> > > >  lib/math/div64.c | 6 +++---
> > > >  1 file changed, 3 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/lib/math/div64.c b/lib/math/div64.c
> > > > index 1092f41e878e..7158d141b6e9 100644
> > > > --- a/lib/math/div64.c
> > > > +++ b/lib/math/div64.c
> > > > @@ -186,9 +186,6 @@ EXPORT_SYMBOL(iter_div_u64_rem);
> > > >  #ifndef mul_u64_u64_div_u64
> > > >  u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
> > > >  {
> > > > -	if (ilog2(a) + ilog2(b) <= 62)
> > > > -		return div64_u64(a * b, d);
> > > > -
> > > >  #if defined(__SIZEOF_INT128__)
> > > >  
> > > >  	/* native 64x64=128 bits multiplication */
> > > > @@ -224,6 +221,9 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
> > > >  		return ~0ULL;
> > > >  	}
> > > >  
> > > > +	if (!n_hi)
> > > > +		return div64_u64(n_lo, d);    
> > > 
> > > I'd move this before the overflow test. If this is to be taken then 
> > > you'll save one test. same cost otherwise.
> > >   
> > 
> > I wanted the 'divide by zero' result to be consistent.  
> 
> It is. div64_u64(x, 0) will produce the same result/behavior.

Are you sure, for all architectures?

> 
> > Additionally the change to stop the x86-64 version panicking on
> > overflow also makes it return ~0 for divide by zero.
> > If that is done then this version needs to be consistent and
> > return ~0 for divide by zero - which div64_u64() won't do.  
> 
> Well here I disagree. If that is some x86 peculiarity then x86 should 
> deal with it and not impose it on everybody. At least having most other 
> architectures raising SIGFPE when encountering a divide by 0 should 
> provide enough coverage to have such obviously buggy code fixed.

The issue here is that crashing the kernel isn't really acceptable.
An extra parameter could be added to return the 'status',
but that makes the calling interface horrid.
So returning ~0 on overflow and divide-by-zero makes it possible
for the caller to check for errors.
Ok, you lose ~0 as a valid result - but that is very unlikely to
need to be treated differently to 'overflow'.

> 
> > It is worth remembering that the chance of (a * b + c)/d being ~0
> > is pretty small (for non-test inputs), and any code that might expect
> > such a value is likely to have to handle overflow as well.
> > (Not to mention avoiding overflow of 'a' and 'b'.)
> > So using ~0 for overflow isn't really a problem.  
> 
> It is not.
> 
> To be clear I'm not talking about overflow nor divide by zero here. I'm 
> suggesting that the case where div64_u64() can be used should be tested 
> first as this is a far more prevalent valid occurrence than a zero 
> divisor which is not.

and I'd rather use the same error path for 'divide by zero'.

	David

> 
> 
> Nicolas


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 next 3/9] lib: mul_u64_u64_div_u64() simplify check for a 64bit product
  2025-10-31 18:04         ` David Laight
@ 2025-10-31 18:45           ` Nicolas Pitre
  2025-10-31 20:12             ` David Laight
  0 siblings, 1 reply; 32+ messages in thread
From: Nicolas Pitre @ 2025-10-31 18:45 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Li RongQing,
	Yu Kuai, Khazhismel Kumykov, Jens Axboe, x86

On Fri, 31 Oct 2025, David Laight wrote:

> On Fri, 31 Oct 2025 13:26:41 -0400 (EDT)
> Nicolas Pitre <npitre@baylibre.com> wrote:
> 
> > On Fri, 31 Oct 2025, David Laight wrote:
> > 
> > > On Wed, 29 Oct 2025 14:11:08 -0400 (EDT)
> > > Nicolas Pitre <npitre@baylibre.com> wrote:
> > >   
> > > > On Wed, 29 Oct 2025, David Laight wrote:
> > > >   
> > > > > If the product is only 64bits div64_u64() can be used for the divide.
> > > > > Replace the pre-multiply check (ilog2(a) + ilog2(b) <= 62) with a
> > > > > simple post-multiply check that the high 64bits are zero.
> > > > > 
> > > > > This has the advantage of being simpler, more accurate and less code.
> > > > > It will always be faster when the product is larger than 64bits.
> > > > > 
> > > > > Most 64bit cpu have a native 64x64=128 bit multiply, this is needed
> > > > > (for the low 64bits) even when div64_u64() is called - so the early
> > > > > check gains nothing and is just extra code.
> > > > > 
> > > > > 32bit cpu will need a compare (etc) to generate the 64bit ilog2()
> > > > > from two 32bit bit scans - so that is non-trivial.
> > > > > (Never mind the mess of x86's 'bsr' and any oddball cpu without
> > > > > fast bit-scan instructions.)
> > > > > Whereas the additional instructions for the 128bit multiply result
> > > > > are pretty much one multiply and two adds (typically the 'adc $0,%reg'
> > > > > can be run in parallel with the instruction that follows).
> > > > > 
> > > > > The only outliers are 64bit systems without 128bit mutiply and
> > > > > simple in order 32bit ones with fast bit scan but needing extra
> > > > > instructions to get the high bits of the multiply result.
> > > > > I doubt it makes much difference to either, the latter is definitely
> > > > > not mainstream.
> > > > > 
> > > > > If anyone is worried about the analysis they can look at the
> > > > > generated code for x86 (especially when cmov isn't used).
> > > > > 
> > > > > Signed-off-by: David Laight <david.laight.linux@gmail.com>    
> > > > 
> > > > Comment below.
> > > > 
> > > >   
> > > > > ---
> > > > > 
> > > > > Split from patch 3 for v2, unchanged since.
> > > > > 
> > > > >  lib/math/div64.c | 6 +++---
> > > > >  1 file changed, 3 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/lib/math/div64.c b/lib/math/div64.c
> > > > > index 1092f41e878e..7158d141b6e9 100644
> > > > > --- a/lib/math/div64.c
> > > > > +++ b/lib/math/div64.c
> > > > > @@ -186,9 +186,6 @@ EXPORT_SYMBOL(iter_div_u64_rem);
> > > > >  #ifndef mul_u64_u64_div_u64
> > > > >  u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
> > > > >  {
> > > > > -	if (ilog2(a) + ilog2(b) <= 62)
> > > > > -		return div64_u64(a * b, d);
> > > > > -
> > > > >  #if defined(__SIZEOF_INT128__)
> > > > >  
> > > > >  	/* native 64x64=128 bits multiplication */
> > > > > @@ -224,6 +221,9 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
> > > > >  		return ~0ULL;
> > > > >  	}
> > > > >  
> > > > > +	if (!n_hi)
> > > > > +		return div64_u64(n_lo, d);    
> > > > 
> > > > I'd move this before the overflow test. If this is to be taken then 
> > > > you'll save one test. same cost otherwise.
> > > >   
> > > 
> > > I wanted the 'divide by zero' result to be consistent.  
> > 
> > It is. div64_u64(x, 0) will produce the same result/behavior.
> 
> Are you sure, for all architectures?

At least all the ones I'm familiar with.

> > 
> > > Additionally the change to stop the x86-64 version panicking on
> > > overflow also makes it return ~0 for divide by zero.
> > > If that is done then this version needs to be consistent and
> > > return ~0 for divide by zero - which div64_u64() won't do.  
> > 
> > Well here I disagree. If that is some x86 peculiarity then x86 should 
> > deal with it and not impose it on everybody. At least having most other 
> > architectures raising SIGFPE when encountering a divide by 0 should 
> > provide enough coverage to have such obviously buggy code fixed.
> 
> The issue here is that crashing the kernel isn't really acceptable.

Encountering a div-by-0 _will_ crash the kernel (or at least kill the 
current task) with most CPUs. They do raise an exception already with 
the other division types. This is no different.

> An extra parameter could be added to return the 'status',
> but that makes the calling interface horrid.

No please.

> So returning ~0 on overflow and divide-by-zero makes it possible
> for the caller to check for errors.

The caller should check for a possible zero divisor _before_ performing 
a division not after. Relying on the div-by_0 CPU behavior is a bug.

> Ok, you lose ~0 as a valid result - but that is very unlikely to
> need to be treated differently to 'overflow'.

I disagree. You need to check for a zero divisor up front and not rely 
on the division to tell you about it.  This is true whether you do
a = b/c; a = div64_u64(b, c); or a = mul_u64_u64_div_u64(a, b, c);.
Most architectures will simply raise an exception if you attempt a div 
by 0, some will return a plain 0. You can't rely on that.

But you need to perform the mul+div before you know there is an 
overflow. Maybe the handling of those cases is the same for the caller 
but this is certainly not universal.


Nicolas

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 next 3/9] lib: mul_u64_u64_div_u64() simplify check for a 64bit product
  2025-10-31 18:45           ` Nicolas Pitre
@ 2025-10-31 20:12             ` David Laight
  0 siblings, 0 replies; 32+ messages in thread
From: David Laight @ 2025-10-31 20:12 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Andrew Morton, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Li RongQing,
	Yu Kuai, Khazhismel Kumykov, Jens Axboe, x86

On Fri, 31 Oct 2025 14:45:49 -0400 (EDT)
Nicolas Pitre <npitre@baylibre.com> wrote:

> On Fri, 31 Oct 2025, David Laight wrote:
> 
> > On Fri, 31 Oct 2025 13:26:41 -0400 (EDT)
> > Nicolas Pitre <npitre@baylibre.com> wrote:
> >   
> > > On Fri, 31 Oct 2025, David Laight wrote:
> > >   
> > > > On Wed, 29 Oct 2025 14:11:08 -0400 (EDT)
> > > > Nicolas Pitre <npitre@baylibre.com> wrote:
> > > >     
> > > > > On Wed, 29 Oct 2025, David Laight wrote:
> > > > >     
> > > > > > If the product is only 64bits div64_u64() can be used for the divide.
> > > > > > Replace the pre-multiply check (ilog2(a) + ilog2(b) <= 62) with a
> > > > > > simple post-multiply check that the high 64bits are zero.
> > > > > > 
> > > > > > This has the advantage of being simpler, more accurate and less code.
> > > > > > It will always be faster when the product is larger than 64bits.
> > > > > > 
> > > > > > Most 64bit cpu have a native 64x64=128 bit multiply, this is needed
> > > > > > (for the low 64bits) even when div64_u64() is called - so the early
> > > > > > check gains nothing and is just extra code.
> > > > > > 
> > > > > > 32bit cpu will need a compare (etc) to generate the 64bit ilog2()
> > > > > > from two 32bit bit scans - so that is non-trivial.
> > > > > > (Never mind the mess of x86's 'bsr' and any oddball cpu without
> > > > > > fast bit-scan instructions.)
> > > > > > Whereas the additional instructions for the 128bit multiply result
> > > > > > are pretty much one multiply and two adds (typically the 'adc $0,%reg'
> > > > > > can be run in parallel with the instruction that follows).
> > > > > > 
> > > > > > The only outliers are 64bit systems without 128bit mutiply and
> > > > > > simple in order 32bit ones with fast bit scan but needing extra
> > > > > > instructions to get the high bits of the multiply result.
> > > > > > I doubt it makes much difference to either, the latter is definitely
> > > > > > not mainstream.
> > > > > > 
> > > > > > If anyone is worried about the analysis they can look at the
> > > > > > generated code for x86 (especially when cmov isn't used).
> > > > > > 
> > > > > > Signed-off-by: David Laight <david.laight.linux@gmail.com>      
> > > > > 
> > > > > Comment below.
> > > > > 
> > > > >     
> > > > > > ---
> > > > > > 
> > > > > > Split from patch 3 for v2, unchanged since.
> > > > > > 
> > > > > >  lib/math/div64.c | 6 +++---
> > > > > >  1 file changed, 3 insertions(+), 3 deletions(-)
> > > > > > 
> > > > > > diff --git a/lib/math/div64.c b/lib/math/div64.c
> > > > > > index 1092f41e878e..7158d141b6e9 100644
> > > > > > --- a/lib/math/div64.c
> > > > > > +++ b/lib/math/div64.c
> > > > > > @@ -186,9 +186,6 @@ EXPORT_SYMBOL(iter_div_u64_rem);
> > > > > >  #ifndef mul_u64_u64_div_u64
> > > > > >  u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
> > > > > >  {
> > > > > > -	if (ilog2(a) + ilog2(b) <= 62)
> > > > > > -		return div64_u64(a * b, d);
> > > > > > -
> > > > > >  #if defined(__SIZEOF_INT128__)
> > > > > >  
> > > > > >  	/* native 64x64=128 bits multiplication */
> > > > > > @@ -224,6 +221,9 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
> > > > > >  		return ~0ULL;
> > > > > >  	}
> > > > > >  
> > > > > > +	if (!n_hi)
> > > > > > +		return div64_u64(n_lo, d);      
> > > > > 
> > > > > I'd move this before the overflow test. If this is to be taken then 
> > > > > you'll save one test. same cost otherwise.
> > > > >     
> > > > 
> > > > I wanted the 'divide by zero' result to be consistent.    
> > > 
> > > It is. div64_u64(x, 0) will produce the same result/behavior.  
> > 
> > Are you sure, for all architectures?  
> 
> At least all the ones I'm familiar with.
> 
> > >   
> > > > Additionally the change to stop the x86-64 version panicking on
> > > > overflow also makes it return ~0 for divide by zero.
> > > > If that is done then this version needs to be consistent and
> > > > return ~0 for divide by zero - which div64_u64() won't do.    
> > > 
> > > Well here I disagree. If that is some x86 peculiarity then x86 should 
> > > deal with it and not impose it on everybody. At least having most other 
> > > architectures raising SIGFPE when encountering a divide by 0 should 
> > > provide enough coverage to have such obviously buggy code fixed.  
> > 
> > The issue here is that crashing the kernel isn't really acceptable.  
> 
> Encountering a div-by-0 _will_ crash the kernel (or at least kill the 
> current task) with most CPUs. They do raise an exception already with 
> the other division types. This is no different.
> 
> > An extra parameter could be added to return the 'status',
> > but that makes the calling interface horrid.  
> 
> No please.
> 
> > So returning ~0 on overflow and divide-by-zero makes it possible
> > for the caller to check for errors.  
> 
> The caller should check for a possible zero divisor _before_ performing 
> a division not after. Relying on the div-by_0 CPU behavior is a bug.
> 
> > Ok, you lose ~0 as a valid result - but that is very unlikely to
> > need to be treated differently to 'overflow'.  
> 
> I disagree. You need to check for a zero divisor up front and not rely 
> on the division to tell you about it.  This is true whether you do
> a = b/c; a = div64_u64(b, c); or a = mul_u64_u64_div_u64(a, b, c);.
> Most architectures will simply raise an exception if you attempt a div 
> by 0, some will return a plain 0. You can't rely on that.
> 
> But you need to perform the mul+div before you know there is an 
> overflow. Maybe the handling of those cases is the same for the caller 
> but this is certainly not universal.

Anyway this is all pretty much irrelevant for this patch.

	David

> 
> 
> Nicolas


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 next 4/9] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup()
  2025-10-29 17:38 ` [PATCH v4 next 4/9] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup() David Laight
  2025-10-29 18:17   ` Nicolas Pitre
@ 2025-10-31 20:59   ` David Laight
  2025-11-01  2:12     ` Andrew Morton
  1 sibling, 1 reply; 32+ messages in thread
From: David Laight @ 2025-10-31 20:59 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: u.kleine-koenig, Nicolas Pitre, Oleg Nesterov, Peter Zijlstra,
	Biju Das, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Ingo Molnar, Thomas Gleixner, Li RongQing, Yu Kuai,
	Khazhismel Kumykov, Jens Axboe, x86

On Wed, 29 Oct 2025 17:38:23 +0000
David Laight <david.laight.linux@gmail.com> wrote:

There is a slight bug ...

> The existing mul_u64_u64_div_u64() rounds down, a 'rounding up'
> variant needs 'divisor - 1' adding in between the multiply and
> divide so cannot easily be done by a caller.
> 
> Add mul_u64_add_u64_div_u64(a, b, c, d) that calculates (a * b + c)/d
> and implement the 'round down' and 'round up' using it.
> 
> Update the x86-64 asm to optimise for 'c' being a constant zero.
> 
> Add kerndoc definitions for all three functions.
> 
> Signed-off-by: David Laight <david.laight.linux@gmail.com>
> ---
> 
> Changes for v2 (formally patch 1/3):
> - Reinstate the early call to div64_u64() on 32bit when 'c' is zero.
>   Although I'm not convinced the path is common enough to be worth
>   the two ilog2() calls.
> 
> Changes for v3 (formally patch 3/4):
> - The early call to div64_u64() has been removed by patch 3.
>   Pretty much guaranteed to be a pessimisation.
> 
> Changes for v4:
> - For x86-64 split the multiply, add and divide into three asm blocks.
>   (gcc makes a pigs breakfast of (u128)a * b + c)
> - Change the kerndoc since divide by zero will (probably) fault.
> 
>  arch/x86/include/asm/div64.h | 20 +++++++++------
>  include/linux/math64.h       | 48 +++++++++++++++++++++++++++++++++++-
>  lib/math/div64.c             | 14 ++++++-----
>  3 files changed, 67 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/include/asm/div64.h b/arch/x86/include/asm/div64.h
> index 9931e4c7d73f..cabdc2d5a68f 100644
> --- a/arch/x86/include/asm/div64.h
> +++ b/arch/x86/include/asm/div64.h
> @@ -84,21 +84,25 @@ static inline u64 mul_u32_u32(u32 a, u32 b)
>   * Will generate an #DE when the result doesn't fit u64, could fix with an
>   * __ex_table[] entry when it becomes an issue.
>   */
> -static inline u64 mul_u64_u64_div_u64(u64 a, u64 mul, u64 div)
> +static inline u64 mul_u64_add_u64_div_u64(u64 rax, u64 mul, u64 add, u64 div)
>  {
> -	u64 q;
> +	u64 rdx;
>  
> -	asm ("mulq %2; divq %3" : "=a" (q)
> -				: "a" (a), "rm" (mul), "rm" (div)
> -				: "rdx");
> +	asm ("mulq %[mul]" : "+a" (rax), "=d" (rdx) : [mul] "rm" (mul));
>  
> -	return q;
> +	if (statically_true(add))

This needs to be:
	if (!statically_true(!add))

Do you need me to resend the full series?

	David

> +		asm ("addq %[add], %[lo]; adcq $0, %[hi]" :
> +			[lo] "+r" (rax), [hi] "+r" (rdx) : [add] "irm" (add));
> +
> +	asm ("divq %[div]" : "+a" (rax), "+d" (rdx) : [div] "rm" (div));
> +
> +	return rax;
>  }
> -#define mul_u64_u64_div_u64 mul_u64_u64_div_u64
> +#define mul_u64_add_u64_div_u64 mul_u64_add_u64_div_u64
>  
>  static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 div)
>  {
> -	return mul_u64_u64_div_u64(a, mul, div);
> +	return mul_u64_add_u64_div_u64(a, mul, 0, div);
>  }
>  #define mul_u64_u32_div	mul_u64_u32_div
>  
> diff --git a/include/linux/math64.h b/include/linux/math64.h
> index 6aaccc1626ab..e889d850b7f1 100644
> --- a/include/linux/math64.h
> +++ b/include/linux/math64.h
> @@ -282,7 +282,53 @@ static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 divisor)
>  }
>  #endif /* mul_u64_u32_div */
>  
> -u64 mul_u64_u64_div_u64(u64 a, u64 mul, u64 div);
> +/**
> + * mul_u64_add_u64_div_u64 - unsigned 64bit multiply, add, and divide
> + * @a: first unsigned 64bit multiplicand
> + * @b: second unsigned 64bit multiplicand
> + * @c: unsigned 64bit addend
> + * @d: unsigned 64bit divisor
> + *
> + * Multiply two 64bit values together to generate a 128bit product
> + * add a third value and then divide by a fourth.
> + * The Generic code divides by 0 if @d is zero and returns ~0 on overflow.
> + * Architecture specific code may trap on zero or overflow.
> + *
> + * Return: (@a * @b + @c) / @d
> + */
> +u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d);
> +
> +/**
> + * mul_u64_u64_div_u64 - unsigned 64bit multiply and divide
> + * @a: first unsigned 64bit multiplicand
> + * @b: second unsigned 64bit multiplicand
> + * @d: unsigned 64bit divisor
> + *
> + * Multiply two 64bit values together to generate a 128bit product
> + * and then divide by a third value.
> + * The Generic code divides by 0 if @d is zero and returns ~0 on overflow.
> + * Architecture specific code may trap on zero or overflow.
> + *
> + * Return: @a * @b / @d
> + */
> +#define mul_u64_u64_div_u64(a, b, d) mul_u64_add_u64_div_u64(a, b, 0, d)
> +
> +/**
> + * mul_u64_u64_div_u64_roundup - unsigned 64bit multiply and divide rounded up
> + * @a: first unsigned 64bit multiplicand
> + * @b: second unsigned 64bit multiplicand
> + * @d: unsigned 64bit divisor
> + *
> + * Multiply two 64bit values together to generate a 128bit product
> + * and then divide and round up.
> + * The Generic code divides by 0 if @d is zero and returns ~0 on overflow.
> + * Architecture specific code may trap on zero or overflow.
> + *
> + * Return: (@a * @b + @d - 1) / @d
> + */
> +#define mul_u64_u64_div_u64_roundup(a, b, d) \
> +	({ u64 _tmp = (d); mul_u64_add_u64_div_u64(a, b, _tmp - 1, _tmp); })
> +
>  
>  /**
>   * DIV64_U64_ROUND_UP - unsigned 64bit divide with 64bit divisor rounded up
> diff --git a/lib/math/div64.c b/lib/math/div64.c
> index 7158d141b6e9..25295daebde9 100644
> --- a/lib/math/div64.c
> +++ b/lib/math/div64.c
> @@ -183,13 +183,13 @@ u32 iter_div_u64_rem(u64 dividend, u32 divisor, u64 *remainder)
>  }
>  EXPORT_SYMBOL(iter_div_u64_rem);
>  
> -#ifndef mul_u64_u64_div_u64
> -u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
> +#ifndef mul_u64_add_u64_div_u64
> +u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
>  {
>  #if defined(__SIZEOF_INT128__)
>  
>  	/* native 64x64=128 bits multiplication */
> -	u128 prod = (u128)a * b;
> +	u128 prod = (u128)a * b + c;
>  	u64 n_lo = prod, n_hi = prod >> 64;
>  
>  #else
> @@ -198,8 +198,10 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
>  	u32 a_lo = a, a_hi = a >> 32, b_lo = b, b_hi = b >> 32;
>  	u64 x, y, z;
>  
> -	x = (u64)a_lo * b_lo;
> -	y = (u64)a_lo * b_hi + (u32)(x >> 32);
> +	/* Since (x-1)(x-1) + 2(x-1) == x.x - 1 two u32 can be added to a u64 */
> +	x = (u64)a_lo * b_lo + (u32)c;
> +	y = (u64)a_lo * b_hi + (u32)(c >> 32);
> +	y += (u32)(x >> 32);
>  	z = (u64)a_hi * b_hi + (u32)(y >> 32);
>  	y = (u64)a_hi * b_lo + (u32)y;
>  	z += (u32)(y >> 32);
> @@ -265,5 +267,5 @@ u64 mul_u64_u64_div_u64(u64 a, u64 b, u64 d)
>  
>  	return res;
>  }
> -EXPORT_SYMBOL(mul_u64_u64_div_u64);
> +EXPORT_SYMBOL(mul_u64_add_u64_div_u64);
>  #endif


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 next 4/9] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup()
  2025-10-31 20:59   ` David Laight
@ 2025-11-01  2:12     ` Andrew Morton
  0 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2025-11-01  2:12 UTC (permalink / raw)
  To: David Laight
  Cc: linux-kernel, u.kleine-koenig, Nicolas Pitre, Oleg Nesterov,
	Peter Zijlstra, Biju Das, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Li RongQing,
	Yu Kuai, Khazhismel Kumykov, Jens Axboe, x86

On Fri, 31 Oct 2025 20:59:17 +0000 David Laight <david.laight.linux@gmail.com> wrote:

> > -static inline u64 mul_u64_u64_div_u64(u64 a, u64 mul, u64 div)
> > +static inline u64 mul_u64_add_u64_div_u64(u64 rax, u64 mul, u64 add, u64 div)
> >  {
> > -	u64 q;
> > +	u64 rdx;
> >  
> > -	asm ("mulq %2; divq %3" : "=a" (q)
> > -				: "a" (a), "rm" (mul), "rm" (div)
> > -				: "rdx");
> > +	asm ("mulq %[mul]" : "+a" (rax), "=d" (rdx) : [mul] "rm" (mul));
> >  
> > -	return q;
> > +	if (statically_true(add))
> 
> This needs to be:
> 	if (!statically_true(!add))
> 
> Do you need me to resend the full series?

I queued a fix, thanks.

--- a/arch/x86/include/asm/div64.h~lib-add-mul_u64_add_u64_div_u64-and-mul_u64_u64_div_u64_roundup-fix
+++ a/arch/x86/include/asm/div64.h
@@ -90,7 +90,7 @@ static inline u64 mul_u64_add_u64_div_u6
 
 	asm ("mulq %[mul]" : "+a" (rax), "=d" (rdx) : [mul] "rm" (mul));
 
-	if (statically_true(add))
+	if (!statically_true(add))
 		asm ("addq %[add], %[lo]; adcq $0, %[hi]" :
 			[lo] "+r" (rax), [hi] "+r" (rdx) : [add] "irm" (add));
 
_


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 next 6/9] lib: test_mul_u64_u64_div_u64: Test both generic and arch versions
  2025-10-29 17:38 ` [PATCH v4 next 6/9] lib: test_mul_u64_u64_div_u64: Test both generic and arch versions David Laight
  2025-10-29 18:53   ` Nicolas Pitre
@ 2025-11-01 19:35   ` kernel test robot
  2025-11-01 20:59   ` kernel test robot
  2 siblings, 0 replies; 32+ messages in thread
From: kernel test robot @ 2025-11-01 19:35 UTC (permalink / raw)
  To: David Laight, Andrew Morton, linux-kernel
  Cc: oe-kbuild-all, Linux Memory Management List, David Laight,
	u.kleine-koenig, Nicolas Pitre, Oleg Nesterov, Peter Zijlstra,
	Biju Das, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Ingo Molnar, Thomas Gleixner, Li RongQing, Yu Kuai,
	Khazhismel Kumykov, Jens Axboe, x86

Hi David,

kernel test robot noticed the following build warnings:

[auto build test WARNING on next-20251029]

url:    https://github.com/intel-lab-lkp/linux/commits/David-Laight/lib-mul_u64_u64_div_u64-rename-parameter-c-to-d/20251030-025633
base:   next-20251029
patch link:    https://lore.kernel.org/r/20251029173828.3682-7-david.laight.linux%40gmail.com
patch subject: [PATCH v4 next 6/9] lib: test_mul_u64_u64_div_u64: Test both generic and arch versions
config: i386-randconfig-052-20251102 (https://download.01.org/0day-ci/archive/20251102/202511020341.TEhkaR65-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251102/202511020341.TEhkaR65-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511020341.TEhkaR65-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> lib/math/test_mul_u64_u64_div_u64.c:142:9: warning: "__div64_32" redefined
     142 | #define __div64_32 __div64_32
         |         ^~~~~~~~~~
   In file included from include/linux/math.h:6,
                    from include/linux/math64.h:6,
                    from include/linux/time.h:6,
                    from include/linux/stat.h:19,
                    from include/linux/module.h:13,
                    from lib/math/test_mul_u64_u64_div_u64.c:9:
   arch/x86/include/asm/div64.h:78:9: note: this is the location of the previous definition
      78 | #define __div64_32
         |         ^~~~~~~~~~


vim +/__div64_32 +142 lib/math/test_mul_u64_u64_div_u64.c

   140	
   141	/* Compile the generic mul_u64_add_u64_div_u64() code */
 > 142	#define __div64_32 __div64_32
   143	#define div_s64_rem div_s64_rem
   144	#define div64_u64_rem div64_u64_rem
   145	#define div64_u64 div64_u64
   146	#define div64_s64 div64_s64
   147	#define iter_div_u64_rem iter_div_u64_rem
   148	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 next 6/9] lib: test_mul_u64_u64_div_u64: Test both generic and arch versions
  2025-10-29 17:38 ` [PATCH v4 next 6/9] lib: test_mul_u64_u64_div_u64: Test both generic and arch versions David Laight
  2025-10-29 18:53   ` Nicolas Pitre
  2025-11-01 19:35   ` kernel test robot
@ 2025-11-01 20:59   ` kernel test robot
  2025-11-02 10:36     ` David Laight
  2 siblings, 1 reply; 32+ messages in thread
From: kernel test robot @ 2025-11-01 20:59 UTC (permalink / raw)
  To: David Laight, Andrew Morton, linux-kernel
  Cc: llvm, oe-kbuild-all, Linux Memory Management List, David Laight,
	u.kleine-koenig, Nicolas Pitre, Oleg Nesterov, Peter Zijlstra,
	Biju Das, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Ingo Molnar, Thomas Gleixner, Li RongQing, Yu Kuai,
	Khazhismel Kumykov, Jens Axboe, x86

Hi David,

kernel test robot noticed the following build warnings:

[auto build test WARNING on next-20251029]

url:    https://github.com/intel-lab-lkp/linux/commits/David-Laight/lib-mul_u64_u64_div_u64-rename-parameter-c-to-d/20251030-025633
base:   next-20251029
patch link:    https://lore.kernel.org/r/20251029173828.3682-7-david.laight.linux%40gmail.com
patch subject: [PATCH v4 next 6/9] lib: test_mul_u64_u64_div_u64: Test both generic and arch versions
config: i386-buildonly-randconfig-004-20251102 (https://download.01.org/0day-ci/archive/20251102/202511020421.ZZPBAIIw-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251102/202511020421.ZZPBAIIw-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511020421.ZZPBAIIw-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> lib/math/test_mul_u64_u64_div_u64.c:142:9: warning: '__div64_32' macro redefined [-Wmacro-redefined]
     142 | #define __div64_32 __div64_32
         |         ^
   arch/x86/include/asm/div64.h:78:9: note: previous definition is here
      78 | #define __div64_32
         |         ^
   1 warning generated.


vim +/__div64_32 +142 lib/math/test_mul_u64_u64_div_u64.c

   140	
   141	/* Compile the generic mul_u64_add_u64_div_u64() code */
 > 142	#define __div64_32 __div64_32
   143	#define div_s64_rem div_s64_rem
   144	#define div64_u64_rem div64_u64_rem
   145	#define div64_u64 div64_u64
   146	#define div64_s64 div64_s64
   147	#define iter_div_u64_rem iter_div_u64_rem
   148	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 next 6/9] lib: test_mul_u64_u64_div_u64: Test both generic and arch versions
  2025-11-01 20:59   ` kernel test robot
@ 2025-11-02 10:36     ` David Laight
  0 siblings, 0 replies; 32+ messages in thread
From: David Laight @ 2025-11-02 10:36 UTC (permalink / raw)
  To: kernel test robot
  Cc: Andrew Morton, linux-kernel, llvm, oe-kbuild-all,
	Linux Memory Management List, u.kleine-koenig, Nicolas Pitre,
	Oleg Nesterov, Peter Zijlstra, Biju Das, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	Li RongQing, Yu Kuai, Khazhismel Kumykov, Jens Axboe, x86

On Sun, 2 Nov 2025 04:59:10 +0800
kernel test robot <lkp@intel.com> wrote:

> Hi David,
> 
> kernel test robot noticed the following build warnings:
> 
> [auto build test WARNING on next-20251029]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/David-Laight/lib-mul_u64_u64_div_u64-rename-parameter-c-to-d/20251030-025633
> base:   next-20251029
> patch link:    https://lore.kernel.org/r/20251029173828.3682-7-david.laight.linux%40gmail.com
> patch subject: [PATCH v4 next 6/9] lib: test_mul_u64_u64_div_u64: Test both generic and arch versions
> config: i386-buildonly-randconfig-004-20251102 (https://download.01.org/0day-ci/archive/20251102/202511020421.ZZPBAIIw-lkp@intel.com/config)
> compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251102/202511020421.ZZPBAIIw-lkp@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202511020421.ZZPBAIIw-lkp@intel.com/
> 
> All warnings (new ones prefixed by >>):
> 
> >> lib/math/test_mul_u64_u64_div_u64.c:142:9: warning: '__div64_32' macro redefined [-Wmacro-redefined]  
>      142 | #define __div64_32 __div64_32
>          |         ^
>    arch/x86/include/asm/div64.h:78:9: note: previous definition is here
>       78 | #define __div64_32

That is preceded by a comment that says it can't happen for x86-64.
I think it would be better as '#define __div64_32 @@@' so that you get a compile
error.
But that isn't part of this change,
>          |         ^
>    1 warning generated.
> 
> 
> vim +/__div64_32 +142 lib/math/test_mul_u64_u64_div_u64.c
> 
>    140	
>    141	/* Compile the generic mul_u64_add_u64_div_u64() code */
>  > 142	#define __div64_32 __div64_32  

It needs a preceding #undef
Although I'm not sure why a normal build doesn't show it.

Looks like I'll need to do a v5 :-(

	David

>    143	#define div_s64_rem div_s64_rem
>    144	#define div64_u64_rem div64_u64_rem
>    145	#define div64_u64 div64_u64
>    146	#define div64_s64 div64_s64
>    147	#define iter_div_u64_rem iter_div_u64_rem
>    148	
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 next 0/9] Implement mul_u64_u64_div_u64_roundup()
  2025-10-31  4:29 ` [PATCH v4 next 0/9] Implement mul_u64_u64_div_u64_roundup() Andrew Morton
@ 2025-11-04 17:16   ` Nicolas Pitre
  0 siblings, 0 replies; 32+ messages in thread
From: Nicolas Pitre @ 2025-11-04 17:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Laight, linux-kernel, u.kleine-koenig, Oleg Nesterov,
	Peter Zijlstra, Biju Das, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Li RongQing,
	Yu Kuai, Khazhismel Kumykov, Jens Axboe, x86

On Thu, 30 Oct 2025, Andrew Morton wrote:

> Thanks, I added this to mm.git's mm-nonmm-unstable branch for some
> linux-next exposure.  I have a note that [3/9] may be updated in
> response to Nicolas's comment.

This is the change I'd like to see:

----- >8
FRom: Nicolas Pitre <npitre@baylibre.com>
Subject: lib: mul_u64_u64_div_u64(): optimize quick path for small numbers

If the 128-bit product is small enough (n_hi == 0) we should branch to
div64_u64() right away. This saves one test for this quick path which is
more prevalent than divide-by-0 cases and div64_u64() can deal with the
(theoretically undefined behavior) zero divisor just fine too. The cost 
remains the same for regular cases.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
---
diff --git a/lib/math/div64.c b/lib/math/div64.c
index 4e4e962261c3..d1e92ea24fce 100644
--- a/lib/math/div64.c
+++ b/lib/math/div64.c
@@ -247,6 +247,9 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
 
 	n_hi = mul_u64_u64_add_u64(&n_lo, a, b, c);
 
+	if (!n_hi)
+		return div64_u64(n_lo, d);
+
 	if (unlikely(n_hi >= d)) {
 		/* trigger runtime exception if divisor is zero */
 		if (d == 0) {
@@ -259,9 +262,6 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
 		return ~0ULL;
 	}
 
-	if (!n_hi)
-		return div64_u64(n_lo, d);
-
 	/* Left align the divisor, shifting the dividend to match */
 	d_z_hi = __builtin_clzll(d);
 	if (d_z_hi) {

^ permalink raw reply related	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2025-11-04 17:16 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-29 17:38 [PATCH v4 next 0/9] Implement mul_u64_u64_div_u64_roundup() David Laight
2025-10-29 17:38 ` [PATCH v4 next 1/9] lib: mul_u64_u64_div_u64() rename parameter 'c' to 'd' David Laight
2025-10-29 17:38 ` [PATCH v4 next 2/9] lib: mul_u64_u64_div_u64() Combine overflow and divide by zero checks David Laight
2025-10-29 18:02   ` Nicolas Pitre
2025-10-29 17:38 ` [PATCH v4 next 3/9] lib: mul_u64_u64_div_u64() simplify check for a 64bit product David Laight
2025-10-29 18:11   ` Nicolas Pitre
2025-10-31  9:19     ` David Laight
2025-10-31 17:26       ` Nicolas Pitre
2025-10-31 18:04         ` David Laight
2025-10-31 18:45           ` Nicolas Pitre
2025-10-31 20:12             ` David Laight
2025-10-29 17:38 ` [PATCH v4 next 4/9] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup() David Laight
2025-10-29 18:17   ` Nicolas Pitre
2025-10-31 20:59   ` David Laight
2025-11-01  2:12     ` Andrew Morton
2025-10-29 17:38 ` [PATCH v4 next 5/9] lib: Add tests for mul_u64_u64_div_u64_roundup() David Laight
2025-10-29 18:26   ` Nicolas Pitre
2025-10-29 17:38 ` [PATCH v4 next 6/9] lib: test_mul_u64_u64_div_u64: Test both generic and arch versions David Laight
2025-10-29 18:53   ` Nicolas Pitre
2025-11-01 19:35   ` kernel test robot
2025-11-01 20:59   ` kernel test robot
2025-11-02 10:36     ` David Laight
2025-10-29 17:38 ` [PATCH v4 next 7/9] lib: mul_u64_u64_div_u64() optimise multiply on 32bit x86 David Laight
2025-10-29 19:01   ` Nicolas Pitre
2025-10-29 17:38 ` [PATCH v4 next 8/9] lib: mul_u64_u64_div_u64() Optimise the divide code David Laight
2025-10-29 20:47   ` Nicolas Pitre
2025-10-29 17:38 ` [PATCH v4 next 9/9] lib: test_mul_u64_u64_div_u64: Test the 32bit code on 64bit David Laight
2025-10-29 20:48   ` Nicolas Pitre
2025-10-31  4:29 ` [PATCH v4 next 0/9] Implement mul_u64_u64_div_u64_roundup() Andrew Morton
2025-11-04 17:16   ` Nicolas Pitre
2025-10-31 13:52 ` Oleg Nesterov
2025-10-31 16:17   ` David Laight

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox