public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: David Laight <david.laight.linux@gmail.com>
To: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, u.kleine-koenig@baylibre.com,
	Nicolas Pitre <npitre@baylibre.com>,
	Oleg Nesterov <oleg@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Biju Das <biju.das.jz@bp.renesas.com>,
	Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Ingo Molnar <mingo@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Li RongQing <lirongqing@baidu.com>,
	Khazhismel Kumykov <khazhy@chromium.org>,
	Jens Axboe <axboe@kernel.dk>,
	x86@kernel.org
Subject: Re: [PATCH v5 next 4/9] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup()
Date: Thu, 6 Nov 2025 09:52:14 +0000	[thread overview]
Message-ID: <20251106095214.2b9c9b8c@pumpkin> (raw)
In-Reply-To: <128901BE-2651-4335-935F-34A5B50D811E@zytor.com>

On Wed, 05 Nov 2025 16:26:05 -0800
"H. Peter Anvin" <hpa@zytor.com> wrote:

> On November 5, 2025 12:10:30 PM PST, David Laight <david.laight.linux@gmail.com> wrote:
> >The existing mul_u64_u64_div_u64() rounds down, a 'rounding up'
> >variant needs 'divisor - 1' adding in between the multiply and
> >divide so cannot easily be done by a caller.
> >
> >Add mul_u64_add_u64_div_u64(a, b, c, d) that calculates (a * b + c)/d
> >and implement the 'round down' and 'round up' using it.
> >
> >Update the x86-64 asm to optimise for 'c' being a constant zero.
> >
> >Add kerndoc definitions for all three functions.
> >
> >Signed-off-by: David Laight <david.laight.linux@gmail.com>
> >Reviewed-by: Nicolas Pitre <npitre@baylibre.com>
> >---
> >
> >Changes for v2 (formally patch 1/3):
> >- Reinstate the early call to div64_u64() on 32bit when 'c' is zero.
> >  Although I'm not convinced the path is common enough to be worth
> >  the two ilog2() calls.
> > 
> >Changes for v3 (formally patch 3/4):
> >- The early call to div64_u64() has been removed by patch 3.
> >  Pretty much guaranteed to be a pessimisation.
> >
> >Changes for v4: 
> >- For x86-64 split the multiply, add and divide into three asm blocks.
> >  (gcc makes a pigs breakfast of (u128)a * b + c)
> >- Change the kerndoc since divide by zero will (probably) fault.
> >
> >Changes for v5:
> >- Fix test that excludes the add/adc asm block for constant zero 'add'.
> >
> > arch/x86/include/asm/div64.h | 20 +++++++++------
> > include/linux/math64.h       | 48 +++++++++++++++++++++++++++++++++++-
> > lib/math/div64.c             | 14 ++++++-----
> > 3 files changed, 67 insertions(+), 15 deletions(-)
> >
> >diff --git a/arch/x86/include/asm/div64.h b/arch/x86/include/asm/div64.h
> >index 9931e4c7d73f..6d8a3de3f43a 100644
> >--- a/arch/x86/include/asm/div64.h
> >+++ b/arch/x86/include/asm/div64.h
> >@@ -84,21 +84,25 @@ static inline u64 mul_u32_u32(u32 a, u32 b)
> >  * Will generate an #DE when the result doesn't fit u64, could fix with an
> >  * __ex_table[] entry when it becomes an issue.
> >  */
> >-static inline u64 mul_u64_u64_div_u64(u64 a, u64 mul, u64 div)
> >+static inline u64 mul_u64_add_u64_div_u64(u64 rax, u64 mul, u64 add, u64 div)
> > {
> >-	u64 q;
> >+	u64 rdx;
> > 
> >-	asm ("mulq %2; divq %3" : "=a" (q)
> >-				: "a" (a), "rm" (mul), "rm" (div)
> >-				: "rdx");
> >+	asm ("mulq %[mul]" : "+a" (rax), "=d" (rdx) : [mul] "rm" (mul));
> > 
> >-	return q;
> >+	if (!statically_true(!add))
> >+		asm ("addq %[add], %[lo]; adcq $0, %[hi]" :
> >+			[lo] "+r" (rax), [hi] "+r" (rdx) : [add] "irm" (add));
> >+
> >+	asm ("divq %[div]" : "+a" (rax), "+d" (rdx) : [div] "rm" (div));
> >+
> >+	return rax;
> > }
> >-#define mul_u64_u64_div_u64 mul_u64_u64_div_u64
> >+#define mul_u64_add_u64_div_u64 mul_u64_add_u64_div_u64
...
> 
> For the roundup case, I'm somewhat curious how this compares with doing:

I guess you are referring to the x86-64 asm version (left above).

>    cmp $1, %rdx
>    sbb $-1, %rax
> 
> ... after the division. At least it means not needing to compute d - 1,
> saving an instruction as well as a register.

> Unfortunately using an lea instruction to compute %rax (which otherwise
>  would incorporate both) isn't possible since it doesn't set the flags.
> 
> The cmp; sbb sequence should be no slower than add;
> adc – I'm saying "no slower" because %rdx is never written to,
> so I think this is provably a better sequence; whether or not it is
> measurable is another thing (but if we are tweaking this stuff...)

I wanted the same function as the non-x64-64 version and 'multiply and add'
possibly has other uses.

The instruction to calculate 'd - 1' (if not a constant) will usually
execute in parallel with an earlier instruction (eg the multiply)
so will be pretty much 'zero cost'.
The add/adc pair are in the 'register dependency chain' - so add a clock each.
The same is true for your cmp/sbb pair.

(Except on pre-broadwell Intel cpu where adc/sbb are two clocks.
I've lost the full reference, the initial changes fixed 'adc $0,x' and
generated the carry flag immediately and only delayed the result.
The doc said 'adc $0,reg' not 'adc $const,reg' - so maybe the sbb $-1
was two clocks for longer.)

	David

  reply	other threads:[~2025-11-06  9:52 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-05 20:10 [PATCH v5 next 0/9] Implement mul_u64_u64_div_u64_roundup() David Laight
2025-11-05 20:10 ` [PATCH v5 next 1/9] lib: mul_u64_u64_div_u64() rename parameter 'c' to 'd' David Laight
2025-11-05 20:10 ` [PATCH v5 next 2/9] lib: mul_u64_u64_div_u64() Combine overflow and divide by zero checks David Laight
2025-11-05 20:10 ` [PATCH v5 next 3/9] lib: mul_u64_u64_div_u64() simplify check for a 64bit product David Laight
2025-11-05 20:59   ` Nicolas Pitre
2025-11-05 20:10 ` [PATCH v5 next 4/9] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup() David Laight
2025-11-06  0:26   ` H. Peter Anvin
2025-11-06  9:52     ` David Laight [this message]
2025-11-05 20:10 ` [PATCH v5 next 5/9] lib: Add tests for mul_u64_u64_div_u64_roundup() David Laight
2025-11-05 20:10 ` [PATCH v5 next 6/9] lib: test_mul_u64_u64_div_u64: Test both generic and arch versions David Laight
2025-11-05 20:10 ` [PATCH v5 next 7/9] lib: mul_u64_u64_div_u64() optimise multiply on 32bit x86 David Laight
2025-11-05 23:45   ` H. Peter Anvin
2025-11-06  9:26     ` David Laight
2025-11-05 20:10 ` [PATCH v5 next 8/9] lib: mul_u64_u64_div_u64() Optimise the divide code David Laight
2025-11-05 20:10 ` [PATCH v5 next 9/9] lib: test_mul_u64_u64_div_u64: Test the 32bit code on 64bit David Laight

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251106095214.2b9c9b8c@pumpkin \
    --to=david.laight.linux@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=biju.das.jz@bp.renesas.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=khazhy@chromium.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lirongqing@baidu.com \
    --cc=mingo@redhat.com \
    --cc=npitre@baylibre.com \
    --cc=oleg@redhat.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=u.kleine-koenig@baylibre.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox