All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Laight <david.laight.linux@gmail.com>
To: Nicolas Pitre <nico@fluxnic.net>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, u.kleine-koenig@baylibre.com,
	Oleg Nesterov <oleg@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Biju Das <biju.das.jz@bp.renesas.com>
Subject: Re: [PATCH v3 next 09/10] lib: mul_u64_u64_div_u64() Optimise the divide code
Date: Wed, 18 Jun 2025 18:54:32 +0100	[thread overview]
Message-ID: <20250618185432.5ce80e0d@pumpkin> (raw)
In-Reply-To: <s919qno2-782s-ooqq-0p19-sr754osn8n02@syhkavp.arg>

On Wed, 18 Jun 2025 11:39:20 -0400 (EDT)
Nicolas Pitre <nico@fluxnic.net> wrote:

> On Wed, 18 Jun 2025, David Laight wrote:
> 
> > On Tue, 17 Jun 2025 21:33:23 -0400 (EDT)
> > Nicolas Pitre <nico@fluxnic.net> wrote:
> >   
> > > On Tue, 17 Jun 2025, Nicolas Pitre wrote:
> > >   
> > > > On Sat, 14 Jun 2025, David Laight wrote:
> > > >     
> > > > > Replace the bit by bit algorithm with one that generates 16 bits
> > > > > per iteration on 32bit architectures and 32 bits on 64bit ones.
> > > > > 
> > > > > On my zen 5 this reduces the time for the tests (using the generic
> > > > > code) from ~3350ns to ~1000ns.
> > > > > 
> > > > > Running the 32bit algorithm on 64bit x86 takes ~1500ns.
> > > > > It'll be slightly slower on a real 32bit system, mostly due
> > > > > to register pressure.
> > > > > 
> > > > > The savings for 32bit x86 are much higher (tested in userspace).
> > > > > The worst case (lots of bits in the quotient) drops from ~900 clocks
> > > > > to ~130 (pretty much independant of the arguments).
> > > > > Other 32bit architectures may see better savings.
> > > > > 
> > > > > It is possibly to optimise for divisors that span less than
> > > > > __LONG_WIDTH__/2 bits. However I suspect they don't happen that often
> > > > > and it doesn't remove any slow cpu divide instructions which dominate
> > > > > the result.
> > > > > 
> > > > > Signed-off-by: David Laight <david.laight.linux@gmail.com>    
> > > > 
> > > > Nice work. I had to be fully awake to review this one.
> > > > Some suggestions below.    
> > > 
> > > Here's a patch with my suggestions applied to make it easier to figure 
> > > them out. The added "inline" is necessary to fix compilation on ARM32. 
> > > The "likely()" makes for better assembly and this part is pretty much 
> > > likely anyway. I've explained the rest previously, although this is a 
> > > better implementation.
> > > 
> > > commit 99ea338401f03efe5dbebe57e62bd7c588409c5c
> > > Author: Nicolas Pitre <nico@fluxnic.net>
> > > Date:   Tue Jun 17 14:42:34 2025 -0400
> > > 
> > >     fixup! lib: mul_u64_u64_div_u64() Optimise the divide code
> > > 
> > > diff --git a/lib/math/div64.c b/lib/math/div64.c
> > > index 3c9fe878ce68..740e59a58530 100644
> > > --- a/lib/math/div64.c
> > > +++ b/lib/math/div64.c
> > > @@ -188,7 +188,7 @@ EXPORT_SYMBOL(iter_div_u64_rem);
> > >  
> > >  #if !defined(mul_u64_add_u64_div_u64) || defined(test_mul_u64_add_u64_div_u64)
> > >  
> > > -static u64 mul_add(u32 a, u32 b, u32 c)
> > > +static inline u64 mul_add(u32 a, u32 b, u32 c)
> > >  {
> > >  	return add_u64_u32(mul_u32_u32(a, b), c);
> > >  }
> > > @@ -246,7 +246,7 @@ static inline u32 mul_u64_long_add_u64(u64 *p_lo, u64 a, u32 b, u64 c)
> > >  
> > >  u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
> > >  {
> > > -	unsigned long d_msig, q_digit;
> > > +	unsigned long n_long, d_msig, q_digit;
> > >  	unsigned int reps, d_z_hi;
> > >  	u64 quotient, n_lo, n_hi;
> > >  	u32 overflow;
> > > @@ -271,36 +271,21 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
> > >  
> > >  	/* Left align the divisor, shifting the dividend to match */
> > >  	d_z_hi = __builtin_clzll(d);
> > > -	if (d_z_hi) {
> > > +	if (likely(d_z_hi)) {
> > >  		d <<= d_z_hi;
> > >  		n_hi = n_hi << d_z_hi | n_lo >> (64 - d_z_hi);
> > >  		n_lo <<= d_z_hi;
> > >  	}
> > >  
> > > -	reps = 64 / BITS_PER_ITER;
> > > -	/* Optimise loop count for small dividends */
> > > -	if (!(u32)(n_hi >> 32)) {
> > > -		reps -= 32 / BITS_PER_ITER;
> > > -		n_hi = n_hi << 32 | n_lo >> 32;
> > > -		n_lo <<= 32;
> > > -	}
> > > -#if BITS_PER_ITER == 16
> > > -	if (!(u32)(n_hi >> 48)) {
> > > -		reps--;
> > > -		n_hi = add_u64_u32(n_hi << 16, n_lo >> 48);
> > > -		n_lo <<= 16;
> > > -	}
> > > -#endif
> > > -
> > >  	/* Invert the dividend so we can use add instead of subtract. */
> > >  	n_lo = ~n_lo;
> > >  	n_hi = ~n_hi;
> > >  
> > >  	/*
> > > -	 * Get the most significant BITS_PER_ITER bits of the divisor.
> > > +	 * Get the rounded-up most significant BITS_PER_ITER bits of the divisor.
> > >  	 * This is used to get a low 'guestimate' of the quotient digit.
> > >  	 */
> > > -	d_msig = (d >> (64 - BITS_PER_ITER)) + 1;
> > > +	d_msig = (d >> (64 - BITS_PER_ITER)) + !!(d << BITS_PER_ITER);  
> > 
> > If the low divisor bits are zero an alternative simpler divide
> > can be used (you want to detect it before the left align).
> > I deleted that because it complicates things and probably doesn't
> > happen often enough outside the tests cases.  
> 
> Depends. On 32-bit systems that might be worth it. Something like:
> 
> 	if (unlikely(sizeof(long) == 4 && !(u32)d))
> 		return div_u64(n_hi, d >> 32);

You need a bigger divide than that (96 bits by 32).
It is also possible this code is better than div_u64() !

My initial version optimised for divisors with max 16 bits using:

        d_z_hi = __builtin_clzll(d);
        d_z_lo = __builtin_ffsll(d) - 1;
        if (d_z_hi + d_z_lo >= 48) {
                // Max 16 bits in divisor, shift right
                if (d_z_hi < 48) {
                        n_lo = n_lo >> d_z_lo | n_hi << (64 - d_z_lo);
                        n_hi >>= d_z_lo;
                        d >>= d_z_lo;
                }
                return div_me_16(n_hi, n_lo >> 32, n_lo, d);
        }
	// Followed by the 'align left' code

with the 80 by 16 divide:
static u64 div_me_16(u32 acc, u32 n1, u32 n0, u32 d)
{
        u64 q = 0;

        if (acc) {
                acc = acc << 16 | n1 >> 16;
                q = (acc / d) << 16;
                acc = (acc % d) << 16 | (n1 & 0xffff);
        } else {
                acc = n1;
                if (!acc)
                        return n0 / d;
        }
        q |= acc / d;
        q <<= 32;
        acc = (acc % d) << 16 | (n0 >> 16);
        q |= (acc / d) << 16;
        acc = (acc % d) << 16 | (n0 & 0xffff);
        q |= acc / d;

        return q;
}

In the end I decided the cost of the 64bit ffsll() on 32bit was too much.
(I could cut it down to 3 'cmov' instead of 4 - and remember they have to be
conditional jumps in the kernel.)

> 
> > >  	/*
> > >  	 * Now do a 'long division' with BITS_PER_ITER bit 'digits'.
> > > @@ -308,12 +293,17 @@ u64 mul_u64_add_u64_div_u64(u64 a, u64 b, u64 c, u64 d)
> > >  	 * The worst case is dividing ~0 by 0x8000 which requires two subtracts.
> > >  	 */
> > >  	quotient = 0;
> > > -	while (reps--) {
> > > -		q_digit = (unsigned long)(~n_hi >> (64 - 2 * BITS_PER_ITER)) / d_msig;
> > > +	for (reps = 64 / BITS_PER_ITER; reps; reps--) {
> > > +		quotient <<= BITS_PER_ITER;
> > > +		n_long = ~n_hi >> (64 - 2 * BITS_PER_ITER);
> > >  		/* Shift 'n' left to align with the product q_digit * d */
> > >  		overflow = n_hi >> (64 - BITS_PER_ITER);
> > >  		n_hi = add_u64_u32(n_hi << BITS_PER_ITER, n_lo >> (64 - BITS_PER_ITER));
> > >  		n_lo <<= BITS_PER_ITER;
> > > +		/* cut it short if q_digit would be 0 */
> > > +		if (n_long < d_msig)
> > > +			continue;  
> > 
> > I don't think that is right.
> > d_msig is an overestimate so you can only skip the divide and mul_add().  
> 
> That's what I thought initially. But `overflow` was always 0xffff in 
> that case. I'd like to prove it mathematically, but empirically this 
> seems to be true with all edge cases I could think of.

You are right :-(
It all gets 'saved' because the next q_digit can be 17 bits.

> I also have a test rig using random numbers validating against the 
> native x86 128/64 div and it has been running for an hour.

I doubt random values will hit many nasty edge cases.

...
> >   
> > > +		q_digit = n_long / d_msig;  
> > 
> > I think you want to do the divide right at the top - maybe even if the
> > result isn't used!
> > All the shifts then happen while the divide instruction is in progress
> > (even without out-of-order execution).  
> 
> Only if there is an actual divide instruction available. If it is a 
> library call then you're screwed.

The library call may well short circuit it anyway.

> And the compiler ought to be able to do that kind of shuffling on its 
> own when there's a benefit.

In my experience it doesn't do a very good job - best to give it help.
In this case I'd bet it even moves it later on (especially if you add
a conditional path that doesn't need it).

Try getting (a + b) + (c + d) compiled that way rather than as
(a + (b + (c + d))) which has a longer register dependency chain.

> 
> > It is also quite likely that any cpu divide instruction takes a lot
> > less clocks when the dividend or quotient is small.
> > So if the quotient would be zero there isn't a stall waiting for the
> > divide to complete.
> > 
> > As I said before it is a trade off between saving a few clocks for
> > specific cases against adding clocks to all the others.
> > Leading zero bits on the dividend are very likely, quotients with
> > the low 16bits zero much less so (except for test cases).  
> 
> Given what I said above wrt overflow I think this is a good tradeoff.
> We just need to measure it.

Can you do accurate timings for arm64 or arm32?

I've found a 2004 Arm book that includes several I-cache busting
divide algorithms.
But I'm sure this pi-5 has hardware divide.

My suspicion is that provided the cpu has reasonable multiply instructions
this code will be reasonable with a 32 by 32 software divide.

According to Agner's tables all AMD Zen cpu have fast 64bit divide.
The same isn't true for Intel though, Skylake is 26 clocks for r32 but 35-88 for r64.
You have to get to IceLake (xx-10) to get a fast r64 divide.
Probably not enough to avoid for the 128 by 64 case, but there may be cases
where it is worth it.

I need to get my test farm set up - and source a zen1/2 system.

	David

> 
> 
> Nicolas


  parent reply	other threads:[~2025-06-18 17:54 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-14  9:53 [PATCH v3 next 00/10] Implement mul_u64_u64_div_u64_roundup() David Laight
2025-06-14  9:53 ` [PATCH v3 next 01/10] lib: mul_u64_u64_div_u64() rename parameter 'c' to 'd' David Laight
2025-06-14  9:53 ` [PATCH v3 next 02/10] lib: mul_u64_u64_div_u64() Use WARN_ONCE() for divide errors David Laight
2025-06-14 15:17   ` Nicolas Pitre
2025-06-14 21:26     ` David Laight
2025-06-14 22:23       ` Nicolas Pitre
2025-06-14  9:53 ` [PATCH v3 next 03/10] lib: mul_u64_u64_div_u64() simplify check for a 64bit product David Laight
2025-06-14 14:01   ` Nicolas Pitre
2025-06-14  9:53 ` [PATCH v3 next 04/10] lib: Add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup() David Laight
2025-06-14 14:06   ` Nicolas Pitre
2025-06-14  9:53 ` [PATCH v3 next 05/10] lib: Add tests for mul_u64_u64_div_u64_roundup() David Laight
2025-06-14 15:19   ` Nicolas Pitre
2025-06-17  4:30   ` Nicolas Pitre
2025-09-18 14:00     ` Uwe Kleine-König
2025-09-18 21:06       ` David Laight
2025-06-14  9:53 ` [PATCH v3 next 06/10] lib: test_mul_u64_u64_div_u64: Test both generic and arch versions David Laight
2025-06-14 15:25   ` Nicolas Pitre
2025-06-18  1:39     ` Nicolas Pitre
2025-06-14  9:53 ` [PATCH v3 next 07/10] lib: mul_u64_u64_div_u64() optimise multiply on 32bit x86 David Laight
2025-06-14 15:31   ` Nicolas Pitre
2025-06-14  9:53 ` [PATCH v3 next 08/10] lib: mul_u64_u64_div_u64() Separate multiply to a helper for clarity David Laight
2025-06-14 15:37   ` Nicolas Pitre
2025-06-14 21:30     ` David Laight
2025-06-14 22:27       ` Nicolas Pitre
2025-06-14  9:53 ` [PATCH v3 next 09/10] lib: mul_u64_u64_div_u64() Optimise the divide code David Laight
2025-06-17  4:16   ` Nicolas Pitre
2025-06-18  1:33     ` Nicolas Pitre
2025-06-18  9:16       ` David Laight
2025-06-18 15:39         ` Nicolas Pitre
2025-06-18 16:42           ` Nicolas Pitre
2025-06-18 17:54           ` David Laight [this message]
2025-06-18 20:12             ` Nicolas Pitre
2025-06-18 22:26               ` David Laight
2025-06-19  2:43                 ` Nicolas Pitre
2025-06-19  8:32                   ` David Laight
2025-06-26 21:46       ` David Laight
2025-06-27  3:48         ` Nicolas Pitre
2025-07-09 14:24   ` David Laight
2025-07-10  9:39     ` 答复: [????] " Li,Rongqing
2025-07-10 10:35       ` David Laight
2025-07-11 21:17     ` David Laight
2025-07-11 21:40       ` Nicolas Pitre
2025-07-14  7:06         ` David Laight
2025-06-14  9:53 ` [PATCH v3 next 10/10] lib: test_mul_u64_u64_div_u64: Test the 32bit code on 64bit David Laight
2025-06-14 10:27 ` [PATCH v3 next 00/10] Implement mul_u64_u64_div_u64_roundup() Peter Zijlstra
2025-06-14 11:59   ` David Laight

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250618185432.5ce80e0d@pumpkin \
    --to=david.laight.linux@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=biju.das.jz@bp.renesas.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nico@fluxnic.net \
    --cc=oleg@redhat.com \
    --cc=peterz@infradead.org \
    --cc=u.kleine-koenig@baylibre.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.