From: David Laight <david.laight.linux@gmail.com>
To: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
linux-kernel@vger.kernel.org, u.kleine-koenig@baylibre.com,
Nicolas Pitre <npitre@baylibre.com>,
Oleg Nesterov <oleg@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Biju Das <biju.das.jz@bp.renesas.com>,
Borislav Petkov <bp@alien8.de>,
Dave Hansen <dave.hansen@linux.intel.com>,
Ingo Molnar <mingo@redhat.com>,
Thomas Gleixner <tglx@linutronix.de>,
Li RongQing <lirongqing@baidu.com>,
Khazhismel Kumykov <khazhy@chromium.org>,
Jens Axboe <axboe@kernel.dk>,
x86@kernel.org
Subject: Re: [PATCH v5 next 4/9] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup()
Date: Thu, 6 Nov 2025 09:52:14 +0000 [thread overview]
Message-ID: <20251106095214.2b9c9b8c@pumpkin> (raw)
In-Reply-To: <128901BE-2651-4335-935F-34A5B50D811E@zytor.com>
On Wed, 05 Nov 2025 16:26:05 -0800
"H. Peter Anvin" <hpa@zytor.com> wrote:
> On November 5, 2025 12:10:30 PM PST, David Laight <david.laight.linux@gmail.com> wrote:
> >The existing mul_u64_u64_div_u64() rounds down, a 'rounding up'
> >variant needs 'divisor - 1' adding in between the multiply and
> >divide so cannot easily be done by a caller.
> >
> >Add mul_u64_add_u64_div_u64(a, b, c, d) that calculates (a * b + c)/d
> >and implement the 'round down' and 'round up' using it.
> >
> >Update the x86-64 asm to optimise for 'c' being a constant zero.
> >
> >Add kerndoc definitions for all three functions.
> >
> >Signed-off-by: David Laight <david.laight.linux@gmail.com>
> >Reviewed-by: Nicolas Pitre <npitre@baylibre.com>
> >---
> >
> >Changes for v2 (formally patch 1/3):
> >- Reinstate the early call to div64_u64() on 32bit when 'c' is zero.
> > Although I'm not convinced the path is common enough to be worth
> > the two ilog2() calls.
> >
> >Changes for v3 (formally patch 3/4):
> >- The early call to div64_u64() has been removed by patch 3.
> > Pretty much guaranteed to be a pessimisation.
> >
> >Changes for v4:
> >- For x86-64 split the multiply, add and divide into three asm blocks.
> > (gcc makes a pigs breakfast of (u128)a * b + c)
> >- Change the kerndoc since divide by zero will (probably) fault.
> >
> >Changes for v5:
> >- Fix test that excludes the add/adc asm block for constant zero 'add'.
> >
> > arch/x86/include/asm/div64.h | 20 +++++++++------
> > include/linux/math64.h | 48 +++++++++++++++++++++++++++++++++++-
> > lib/math/div64.c | 14 ++++++-----
> > 3 files changed, 67 insertions(+), 15 deletions(-)
> >
> >diff --git a/arch/x86/include/asm/div64.h b/arch/x86/include/asm/div64.h
> >index 9931e4c7d73f..6d8a3de3f43a 100644
> >--- a/arch/x86/include/asm/div64.h
> >+++ b/arch/x86/include/asm/div64.h
> >@@ -84,21 +84,25 @@ static inline u64 mul_u32_u32(u32 a, u32 b)
> > * Will generate an #DE when the result doesn't fit u64, could fix with an
> > * __ex_table[] entry when it becomes an issue.
> > */
> >-static inline u64 mul_u64_u64_div_u64(u64 a, u64 mul, u64 div)
> >+static inline u64 mul_u64_add_u64_div_u64(u64 rax, u64 mul, u64 add, u64 div)
> > {
> >- u64 q;
> >+ u64 rdx;
> >
> >- asm ("mulq %2; divq %3" : "=a" (q)
> >- : "a" (a), "rm" (mul), "rm" (div)
> >- : "rdx");
> >+ asm ("mulq %[mul]" : "+a" (rax), "=d" (rdx) : [mul] "rm" (mul));
> >
> >- return q;
> >+ if (!statically_true(!add))
> >+ asm ("addq %[add], %[lo]; adcq $0, %[hi]" :
> >+ [lo] "+r" (rax), [hi] "+r" (rdx) : [add] "irm" (add));
> >+
> >+ asm ("divq %[div]" : "+a" (rax), "+d" (rdx) : [div] "rm" (div));
> >+
> >+ return rax;
> > }
> >-#define mul_u64_u64_div_u64 mul_u64_u64_div_u64
> >+#define mul_u64_add_u64_div_u64 mul_u64_add_u64_div_u64
...
>
> For the roundup case, I'm somewhat curious how this compares with doing:
I guess you are referring to the x86-64 asm version (left above).
> cmp $1, %rdx
> sbb $-1, %rax
>
> ... after the division. At least it means not needing to compute d - 1,
> saving an instruction as well as a register.
> Unfortunately using an lea instruction to compute %rax (which otherwise
> would incorporate both) isn't possible since it doesn't set the flags.
>
> The cmp; sbb sequence should be no slower than add;
> adc – I'm saying "no slower" because %rdx is never written to,
> so I think this is provably a better sequence; whether or not it is
> measurable is another thing (but if we are tweaking this stuff...)
I wanted the same function as the non-x64-64 version and 'multiply and add'
possibly has other uses.
The instruction to calculate 'd - 1' (if not a constant) will usually
execute in parallel with an earlier instruction (eg the multiply)
so will be pretty much 'zero cost'.
The add/adc pair are in the 'register dependency chain' - so add a clock each.
The same is true for your cmp/sbb pair.
(Except on pre-broadwell Intel cpu where adc/sbb are two clocks.
I've lost the full reference, the initial changes fixed 'adc $0,x' and
generated the carry flag immediately and only delayed the result.
The doc said 'adc $0,reg' not 'adc $const,reg' - so maybe the sbb $-1
was two clocks for longer.)
David
next prev parent reply other threads:[~2025-11-06 9:52 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-05 20:10 [PATCH v5 next 0/9] Implement mul_u64_u64_div_u64_roundup() David Laight
2025-11-05 20:10 ` [PATCH v5 next 1/9] lib: mul_u64_u64_div_u64() rename parameter 'c' to 'd' David Laight
2025-11-05 20:10 ` [PATCH v5 next 2/9] lib: mul_u64_u64_div_u64() Combine overflow and divide by zero checks David Laight
2025-11-05 20:10 ` [PATCH v5 next 3/9] lib: mul_u64_u64_div_u64() simplify check for a 64bit product David Laight
2025-11-05 20:59 ` Nicolas Pitre
2025-11-05 20:10 ` [PATCH v5 next 4/9] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup() David Laight
2025-11-06 0:26 ` H. Peter Anvin
2025-11-06 9:52 ` David Laight [this message]
2025-11-05 20:10 ` [PATCH v5 next 5/9] lib: Add tests for mul_u64_u64_div_u64_roundup() David Laight
2025-11-05 20:10 ` [PATCH v5 next 6/9] lib: test_mul_u64_u64_div_u64: Test both generic and arch versions David Laight
2025-11-05 20:10 ` [PATCH v5 next 7/9] lib: mul_u64_u64_div_u64() optimise multiply on 32bit x86 David Laight
2025-11-05 23:45 ` H. Peter Anvin
2025-11-06 9:26 ` David Laight
2025-11-05 20:10 ` [PATCH v5 next 8/9] lib: mul_u64_u64_div_u64() Optimise the divide code David Laight
2025-11-05 20:10 ` [PATCH v5 next 9/9] lib: test_mul_u64_u64_div_u64: Test the 32bit code on 64bit David Laight
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251106095214.2b9c9b8c@pumpkin \
--to=david.laight.linux@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=axboe@kernel.dk \
--cc=biju.das.jz@bp.renesas.com \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=hpa@zytor.com \
--cc=khazhy@chromium.org \
--cc=linux-kernel@vger.kernel.org \
--cc=lirongqing@baidu.com \
--cc=mingo@redhat.com \
--cc=npitre@baylibre.com \
--cc=oleg@redhat.com \
--cc=peterz@infradead.org \
--cc=tglx@linutronix.de \
--cc=u.kleine-koenig@baylibre.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox