public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/4] lib: vsprintf: Optimize division by 10 for small integers.
@ 2012-08-03  5:21 George Spelvin
  2012-08-03  5:21 ` [PATCH 2/4] lib: vsprintf: Optimize division by 10000 George Spelvin
                   ` (4 more replies)
  0 siblings, 5 replies; 29+ messages in thread
From: George Spelvin @ 2012-08-03  5:21 UTC (permalink / raw)
  To: vda.linux, mina86; +Cc: hughd, linux-kernel, linux

Shrink the reciprocal approximations used in put_dec_full4
based on the comments in put_dec_full9.

Signed-off-by: George Spelvin <linux@horizon.com>
Cc: Denys Vlasenko <vda.linux@googlemail.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
---
 lib/vsprintf.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

I was looking over the code and noticed that the constants could be smaller.

diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index c3f36d41..2f32fe8 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -243,13 +243,14 @@ char *put_dec(char *buf, unsigned long long n)
 
 /* Second algorithm: valid only for 64-bit long longs */
 
+/* See comment in put_dec_full9 for choice of constants */
 static noinline_for_stack
 char *put_dec_full4(char *buf, unsigned q)
 {
 	unsigned r;
-	r      = (q * 0xcccd) >> 19;
+	r      = (q * 0xccd) >> 15;
 	*buf++ = (q - 10 * r) + '0';
-	q      = (r * 0x199a) >> 16;
+	q      = (r * 0xcd) >> 11;
 	*buf++ = (r - 10 * q)  + '0';
 	r      = (q * 0xcd) >> 11;
 	*buf++ = (q - 10 * r)  + '0';
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/4] lib: vsprintf: Optimize division by 10000
  2012-08-03  5:21 [PATCH 1/4] lib: vsprintf: Optimize division by 10 for small integers George Spelvin
@ 2012-08-03  5:21 ` George Spelvin
  2012-09-23 17:30   ` Michal Nazarewicz
  2012-09-24  9:03   ` Denys Vlasenko
  2012-08-03  5:21 ` [PATCH 3/4] lib: vsprintf: Optimize put_dec_trunc8 George Spelvin
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 29+ messages in thread
From: George Spelvin @ 2012-08-03  5:21 UTC (permalink / raw)
  To: vda.linux, mina86; +Cc: hughd, linux-kernel, linux

The same multiply-by-inverse technique can be used to
convert division by 10000 to a 32x32->64-bit multiply.

Signed-off-by: George Spelvin <linux@horizon.com>
---
 lib/vsprintf.c |   60 +++++++++++++++++++++++++++++++-------------------------
 1 file changed, 33 insertions(+), 27 deletions(-)

This is something of an RFC, I haven't benchmarked the helper
function.  But it sure cleans up the code!

diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 2f32fe8..a8e7392 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -245,17 +245,32 @@ char *put_dec(char *buf, unsigned long long n)
 
 /* See comment in put_dec_full9 for choice of constants */
 static noinline_for_stack
-char *put_dec_full4(char *buf, unsigned q)
+void put_dec_full4(char *buf, unsigned q)
 {
 	unsigned r;
 	r      = (q * 0xccd) >> 15;
-	*buf++ = (q - 10 * r) + '0';
+	buf[0] = (q - 10 * r) + '0';
 	q      = (r * 0xcd) >> 11;
-	*buf++ = (r - 10 * q)  + '0';
+	buf[1] = (r - 10 * q)  + '0';
 	r      = (q * 0xcd) >> 11;
-	*buf++ = (q - 10 * r)  + '0';
-	*buf++ = r + '0';
-	return buf;
+	buf[2] = (q - 10 * r)  + '0';
+	buf[3] = r + '0';
+}
+
+/*
+ * Call put_dec_full4 on x % 10000, return x / 10000.
+ * The approximation x/10000 == (x * 0x346DC5D7) >> 43
+ * holds for all x < 1,128,869,999.  The largest value this
+ * helper will ever be asked to convert is 1,125,520,955.
+ * (d1 in the put_dec code, assuming n is all-ones).
+ */
+static
+unsigned put_dec_helper4(char *buf, unsigned x)
+{
+        uint32_t q = (x * (uint64_t)0x346DC5D7) >> 43;
+
+        put_dec_full4(buf, x - q * 10000);
+        return q;
 }
 
 /* Based on code by Douglas W. Jones found at
@@ -277,28 +292,19 @@ char *put_dec(char *buf, unsigned long long n)
 	d3  = (h >> 16); /* implicit "& 0xffff" */
 
 	q   = 656 * d3 + 7296 * d2 + 5536 * d1 + ((uint32_t)n & 0xffff);
+	q = put_dec_helper4(buf, q);
+
+	q += 7671 * d3 + 9496 * d2 + 6 * d1;
+	q = put_dec_helper4(buf+4, q);
+
+	q += 4749 * d3 + 42 * d2;
+	q = put_dec_helper4(buf+8, q);
 
-	buf = put_dec_full4(buf, q % 10000);
-	q   = q / 10000;
-
-	d1  = q + 7671 * d3 + 9496 * d2 + 6 * d1;
-	buf = put_dec_full4(buf, d1 % 10000);
-	q   = d1 / 10000;
-
-	d2  = q + 4749 * d3 + 42 * d2;
-	buf = put_dec_full4(buf, d2 % 10000);
-	q   = d2 / 10000;
-
-	d3  = q + 281 * d3;
-	if (!d3)
-		goto done;
-	buf = put_dec_full4(buf, d3 % 10000);
-	q   = d3 / 10000;
-	if (!q)
-		goto done;
-	buf = put_dec_full4(buf, q);
- done:
-	while (buf[-1] == '0')
+	q += 281 * d3;
+	buf += 12;
+	if (q)
+		buf = put_dec_trunc8(buf, q);
+	else while (buf[-1] == '0')
 		--buf;
 
 	return buf;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 3/4] lib: vsprintf: Optimize put_dec_trunc8
  2012-08-03  5:21 [PATCH 1/4] lib: vsprintf: Optimize division by 10 for small integers George Spelvin
  2012-08-03  5:21 ` [PATCH 2/4] lib: vsprintf: Optimize division by 10000 George Spelvin
@ 2012-08-03  5:21 ` George Spelvin
  2012-09-23 14:18   ` Rabin Vincent
  2012-09-23 18:22   ` Michal Nazarewicz
  2012-08-03  5:21 ` [PATCH 4/4] lib: vsprintf: Fix broken comments George Spelvin
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 29+ messages in thread
From: George Spelvin @ 2012-08-03  5:21 UTC (permalink / raw)
  To: vda.linux, mina86; +Cc: hughd, linux-kernel, linux

If you're going to have a conditional branch after
each 32x32->64-bit multiply, might as well shrink the code
and make it a loop.

This also avoids using the long multiply for small integers.

(This leaves the comments in a confusing state, but that's a separate
patch to make review easier.)

Signed-off-by: George Spelvin <linux@horizon.com>
---
 lib/vsprintf.c |   20 ++++++--------------
 1 file changed, 6 insertions(+), 14 deletions(-)

diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index a8e7392..3ca77b8 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -174,20 +174,12 @@ char *put_dec_trunc8(char *buf, unsigned r)
 	unsigned q;
 
 	/* Copy of previous function's body with added early returns */
-	q      = (r * (uint64_t)0x1999999a) >> 32;
-	*buf++ = (r - 10 * q) + '0'; /* 2 */
-	if (q == 0)
-		return buf;
-	r      = (q * (uint64_t)0x1999999a) >> 32;
-	*buf++ = (q - 10 * r) + '0'; /* 3 */
-	if (r == 0)
-		return buf;
-	q      = (r * (uint64_t)0x1999999a) >> 32;
-	*buf++ = (r - 10 * q) + '0'; /* 4 */
-	if (q == 0)
-		return buf;
-	r      = (q * (uint64_t)0x1999999a) >> 32;
-	*buf++ = (q - 10 * r) + '0'; /* 5 */
+	while (r >= 10000) {
+		q = r + '0';
+		r  = (r * (uint64_t)0x1999999a) >> 32;
+		*buf++ = q - 10*r;
+	}
+
 	if (r == 0)
 		return buf;
 	q      = (r * 0x199a) >> 16;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 4/4] lib: vsprintf: Fix broken comments
  2012-08-03  5:21 [PATCH 1/4] lib: vsprintf: Optimize division by 10 for small integers George Spelvin
  2012-08-03  5:21 ` [PATCH 2/4] lib: vsprintf: Optimize division by 10000 George Spelvin
  2012-08-03  5:21 ` [PATCH 3/4] lib: vsprintf: Optimize put_dec_trunc8 George Spelvin
@ 2012-08-03  5:21 ` George Spelvin
  2012-09-23 17:22 ` [PATCH 1/4] lib: vsprintf: Optimize division by 10 for small integers Michal Nazarewicz
  2012-09-24  9:06 ` Denys Vlasenko
  4 siblings, 0 replies; 29+ messages in thread
From: George Spelvin @ 2012-08-03  5:21 UTC (permalink / raw)
  To: vda.linux, mina86; +Cc: hughd, linux-kernel, linux

Numbering the 8 potential digits 2 though 9 never did make a lot of sense.

Signed-off-by: George Spelvin <linux@horizon.com>
---
 lib/vsprintf.c |   14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 3ca77b8..692e61b 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -182,19 +182,19 @@ char *put_dec_trunc8(char *buf, unsigned r)
 
 	if (r == 0)
 		return buf;
-	q      = (r * 0x199a) >> 16;
-	*buf++ = (r - 10 * q)  + '0'; /* 6 */
+	q      = (r * 0x199a) >> 16;	/* r <= 9999 */
+	*buf++ = (r - 10 * q)  + '0';
 	if (q == 0)
 		return buf;
-	r      = (q * 0xcd) >> 11;
-	*buf++ = (q - 10 * r)  + '0'; /* 7 */
+	r      = (q * 0xcd) >> 11;	/* q <= 999 */
+	*buf++ = (q - 10 * r)  + '0';
 	if (r == 0)
 		return buf;
-	q      = (r * 0xcd) >> 11;
-	*buf++ = (r - 10 * q) + '0'; /* 8 */
+	q      = (r * 0xcd) >> 11;	/* r <= 99 */
+	*buf++ = (r - 10 * q) + '0';
 	if (q == 0)
 		return buf;
-	*buf++ = q + '0'; /* 9 */
+	*buf++ = q + '0';		/* q <= 9 */
 	return buf;
 }
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] lib: vsprintf: Optimize put_dec_trunc8
  2012-08-03  5:21 ` [PATCH 3/4] lib: vsprintf: Optimize put_dec_trunc8 George Spelvin
@ 2012-09-23 14:18   ` Rabin Vincent
  2012-09-24 11:13     ` George Spelvin
  2012-09-24 14:33     ` George Spelvin
  2012-09-23 18:22   ` Michal Nazarewicz
  1 sibling, 2 replies; 29+ messages in thread
From: Rabin Vincent @ 2012-09-23 14:18 UTC (permalink / raw)
  To: George Spelvin, Andrew Morton; +Cc: vda.linux, mina86, hughd, linux-kernel

2012/8/3 George Spelvin <linux@horizon.com>:
> If you're going to have a conditional branch after
> each 32x32->64-bit multiply, might as well shrink the code
> and make it a loop.
>
> This also avoids using the long multiply for small integers.
>
> (This leaves the comments in a confusing state, but that's a separate
> patch to make review easier.)
>
> Signed-off-by: George Spelvin <linux@horizon.com>

This patch breaks IP address printing with "%pI4" (and by extension,
nfsroot).  Example:

 - Before: 10.0.0.1
 - After: 10...1

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/4] lib: vsprintf: Optimize division by 10 for small integers.
  2012-08-03  5:21 [PATCH 1/4] lib: vsprintf: Optimize division by 10 for small integers George Spelvin
                   ` (2 preceding siblings ...)
  2012-08-03  5:21 ` [PATCH 4/4] lib: vsprintf: Fix broken comments George Spelvin
@ 2012-09-23 17:22 ` Michal Nazarewicz
  2012-09-24 14:18   ` George Spelvin
  2012-09-24  9:06 ` Denys Vlasenko
  4 siblings, 1 reply; 29+ messages in thread
From: Michal Nazarewicz @ 2012-09-23 17:22 UTC (permalink / raw)
  To: George Spelvin, vda.linux; +Cc: hughd, linux-kernel, linux

[-- Attachment #1: Type: text/plain, Size: 1627 bytes --]

On Fri, Aug 03 2012, George Spelvin <linux@horizon.com> wrote:
> Shrink the reciprocal approximations used in put_dec_full4
> based on the comments in put_dec_full9.
>
> Signed-off-by: George Spelvin <linux@horizon.com>
> Cc: Denys Vlasenko <vda.linux@googlemail.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>

Have you verified that the comment is correct?

> ---
>  lib/vsprintf.c |    5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> I was looking over the code and noticed that the constants could be smaller.
>
> diff --git a/lib/vsprintf.c b/lib/vsprintf.c
> index c3f36d41..2f32fe8 100644
> --- a/lib/vsprintf.c
> +++ b/lib/vsprintf.c
> @@ -243,13 +243,14 @@ char *put_dec(char *buf, unsigned long long n)
>  
>  /* Second algorithm: valid only for 64-bit long longs */
>  
> +/* See comment in put_dec_full9 for choice of constants */
>  static noinline_for_stack
>  char *put_dec_full4(char *buf, unsigned q)
>  {
>  	unsigned r;
> -	r      = (q * 0xcccd) >> 19;
> +	r      = (q * 0xccd) >> 15;
>  	*buf++ = (q - 10 * r) + '0';
> -	q      = (r * 0x199a) >> 16;
> +	q      = (r * 0xcd) >> 11;
>  	*buf++ = (r - 10 * q)  + '0';
>  	r      = (q * 0xcd) >> 11;

If you are changing everything, this could also be changed to:

	r      = (q * 0x67) >> 10;

no?

>  	*buf++ = (q - 10 * r)  + '0';

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michał “mina86” Nazarewicz    (o o)
ooo +----<email/xmpp: mpn@google.com>--------------ooO--(_)--Ooo--

[-- Attachment #2.1: Type: text/plain, Size: 0 bytes --]



[-- Attachment #2.2: Type: application/pgp-signature, Size: 835 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/4] lib: vsprintf: Optimize division by 10000
  2012-08-03  5:21 ` [PATCH 2/4] lib: vsprintf: Optimize division by 10000 George Spelvin
@ 2012-09-23 17:30   ` Michal Nazarewicz
  2012-09-24 12:16     ` George Spelvin
  2012-09-24  9:03   ` Denys Vlasenko
  1 sibling, 1 reply; 29+ messages in thread
From: Michal Nazarewicz @ 2012-09-23 17:30 UTC (permalink / raw)
  To: George Spelvin, vda.linux; +Cc: hughd, linux-kernel, linux

[-- Attachment #1: Type: text/plain, Size: 613 bytes --]

On Fri, Aug 03 2012, George Spelvin wrote:
> The same multiply-by-inverse technique can be used to
> convert division by 10000 to a 32x32->64-bit multiply.
>
> Signed-off-by: George Spelvin <linux@horizon.com>

You are using a 64-bit multiply in a path that is designed for 32-bit
processors, which makes me feel that it will be slower.

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michał “mina86” Nazarewicz    (o o)
ooo +----<email/xmpp: mpn@google.com>--------------ooO--(_)--Ooo--

[-- Attachment #2.1: Type: text/plain, Size: 0 bytes --]



[-- Attachment #2.2: Type: application/pgp-signature, Size: 835 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] lib: vsprintf: Optimize put_dec_trunc8
  2012-08-03  5:21 ` [PATCH 3/4] lib: vsprintf: Optimize put_dec_trunc8 George Spelvin
  2012-09-23 14:18   ` Rabin Vincent
@ 2012-09-23 18:22   ` Michal Nazarewicz
  2012-09-24 11:46     ` George Spelvin
  1 sibling, 1 reply; 29+ messages in thread
From: Michal Nazarewicz @ 2012-09-23 18:22 UTC (permalink / raw)
  To: George Spelvin, vda.linux; +Cc: hughd, linux-kernel, linux

[-- Attachment #1: Type: text/plain, Size: 1919 bytes --]

On Fri, Aug 03 2012, George Spelvin wrote:
> If you're going to have a conditional branch after
> each 32x32->64-bit multiply, might as well shrink the code
> and make it a loop.
>
> This also avoids using the long multiply for small integers.
>
> (This leaves the comments in a confusing state, but that's a separate
> patch to make review easier.)
>
> Signed-off-by: George Spelvin <linux@horizon.com>

NAK.

> ---
>  lib/vsprintf.c |   20 ++++++--------------
>  1 file changed, 6 insertions(+), 14 deletions(-)
>
> diff --git a/lib/vsprintf.c b/lib/vsprintf.c
> index a8e7392..3ca77b8 100644
> --- a/lib/vsprintf.c
> +++ b/lib/vsprintf.c
> @@ -174,20 +174,12 @@ char *put_dec_trunc8(char *buf, unsigned r)
>  	unsigned q;
>  
>  	/* Copy of previous function's body with added early returns */
> -	q      = (r * (uint64_t)0x1999999a) >> 32;
> -	*buf++ = (r - 10 * q) + '0'; /* 2 */
> -	if (q == 0)
> -		return buf;
> -	r      = (q * (uint64_t)0x1999999a) >> 32;
> -	*buf++ = (q - 10 * r) + '0'; /* 3 */
> -	if (r == 0)
> -		return buf;
> -	q      = (r * (uint64_t)0x1999999a) >> 32;
> -	*buf++ = (r - 10 * q) + '0'; /* 4 */
> -	if (q == 0)
> -		return buf;
> -	r      = (q * (uint64_t)0x1999999a) >> 32;
> -	*buf++ = (q - 10 * r) + '0'; /* 5 */
> +	while (r >= 10000) {
> +		q = r + '0';
> +		r  = (r * (uint64_t)0x1999999a) >> 32;
> +		*buf++ = q - 10*r;
> +	}

This loop looks nothing like the original code.  Why are you adding '0'
at the beginning?  Also, the original code switches the role of q and r,
the loop does not.

>  	if (r == 0)
>  		return buf;
>  	q      = (r * 0x199a) >> 16;

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michał “mina86” Nazarewicz    (o o)
ooo +----<email/xmpp: mpn@google.com>--------------ooO--(_)--Ooo--

[-- Attachment #2.1: Type: text/plain, Size: 0 bytes --]



[-- Attachment #2.2: Type: application/pgp-signature, Size: 835 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/4] lib: vsprintf: Optimize division by 10000
  2012-08-03  5:21 ` [PATCH 2/4] lib: vsprintf: Optimize division by 10000 George Spelvin
  2012-09-23 17:30   ` Michal Nazarewicz
@ 2012-09-24  9:03   ` Denys Vlasenko
  2012-09-24 12:35     ` George Spelvin
  1 sibling, 1 reply; 29+ messages in thread
From: Denys Vlasenko @ 2012-09-24  9:03 UTC (permalink / raw)
  To: George Spelvin; +Cc: mina86, hughd, linux-kernel

On Fri, Aug 3, 2012 at 7:21 AM, George Spelvin <linux@horizon.com> wrote:
> The same multiply-by-inverse technique can be used to
> convert division by 10000 to a 32x32->64-bit multiply.
>
> Signed-off-by: George Spelvin <linux@horizon.com>
> ---
>  lib/vsprintf.c |   60 +++++++++++++++++++++++++++++++-------------------------
>  1 file changed, 33 insertions(+), 27 deletions(-)
>
> This is something of an RFC, I haven't benchmarked the helper
> function.  But it sure cleans up the code!

Can you please do that before you send alterations
to the code which *was* benchmarked?


> +static
> +unsigned put_dec_helper4(char *buf, unsigned x)
> +{
> +        uint32_t q = (x * (uint64_t)0x346DC5D7) >> 43;
> +
> +        put_dec_full4(buf, x - q * 10000);
> +        return q;
>  }
>
>  /* Based on code by Douglas W. Jones found at
> @@ -277,28 +292,19 @@ char *put_dec(char *buf, unsigned long long n)
>         d3  = (h >> 16); /* implicit "& 0xffff" */
>
>         q   = 656 * d3 + 7296 * d2 + 5536 * d1 + ((uint32_t)n & 0xffff);
> +       q = put_dec_helper4(buf, q);
> +
> +       q += 7671 * d3 + 9496 * d2 + 6 * d1;
> +       q = put_dec_helper4(buf+4, q);
> +
> +       q += 4749 * d3 + 42 * d2;
> +       q = put_dec_helper4(buf+8, q);
>
> -       buf = put_dec_full4(buf, q % 10000);
> -       q   = q / 10000;
> -
> -       d1  = q + 7671 * d3 + 9496 * d2 + 6 * d1;
> -       buf = put_dec_full4(buf, d1 % 10000);
> -       q   = d1 / 10000;
> -
> -       d2  = q + 4749 * d3 + 42 * d2;
> -       buf = put_dec_full4(buf, d2 % 10000);
> -       q   = d2 / 10000;
> -
> -       d3  = q + 281 * d3;
> -       if (!d3)
> -               goto done;
> -       buf = put_dec_full4(buf, d3 % 10000);
> -       q   = d3 / 10000;
> -       if (!q)
> -               goto done;
> -       buf = put_dec_full4(buf, q);
> - done:
> -       while (buf[-1] == '0')
> +       q += 281 * d3;
> +       buf += 12;
> +       if (q)
> +               buf = put_dec_trunc8(buf, q);
> +       else while (buf[-1] == '0')
>                 --buf;

gcc was already doing that optimization for us.
It will use a larger constant and shift, but that
changes neither size nor speed.

Moreover, put_dec_helper4 will get inlined,
so the generated assembly code will be very similar.

Here is the comparison of the x86-32 assembly
of the fragment which does "x / 10000" thing,
before and after the patch:

-01 c6                  add    %eax,%esi
-b8 59 17 b7 d1         mov    $0xd1b71759,%eax
-f7 e6                  mul    %esi
-89 d3                  mov    %edx,%ebx
-89 f2                  mov    %esi,%edx
-c1 eb 0d               shr    $0xd,%ebx

+01 c7                  add    %eax,%edi
+b8 d7 c5 6d 34         mov    $0x346dc5d7,%eax
+f7 e7                  mul    %edi
+89 55 e8               mov    %edx,-0x18(%ebp)
+8b 5d e8               mov    -0x18(%ebp),%ebx
+89 fa                  mov    %edi,%edx
+89 45 e4               mov    %eax,-0x1c(%ebp)
+c1 eb 0b               shr    $0xb,%ebx

Poor gcc got confused, and generated somewhat
worse code (spilling and immediately reloading upper
part of 32x32->64 multiply).

Please test and benchmark your changes to this code
before submitting them.

-- 
vda

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/4] lib: vsprintf: Optimize division by 10 for small integers.
  2012-08-03  5:21 [PATCH 1/4] lib: vsprintf: Optimize division by 10 for small integers George Spelvin
                   ` (3 preceding siblings ...)
  2012-09-23 17:22 ` [PATCH 1/4] lib: vsprintf: Optimize division by 10 for small integers Michal Nazarewicz
@ 2012-09-24  9:06 ` Denys Vlasenko
  2012-09-24 11:27   ` George Spelvin
  4 siblings, 1 reply; 29+ messages in thread
From: Denys Vlasenko @ 2012-09-24  9:06 UTC (permalink / raw)
  To: George Spelvin; +Cc: mina86, hughd, linux-kernel

On Fri, Aug 3, 2012 at 7:21 AM, George Spelvin <linux@horizon.com> wrote:
> Shrink the reciprocal approximations used in put_dec_full4
> based on the comments in put_dec_full9.
>
> Signed-off-by: George Spelvin <linux@horizon.com>
> Cc: Denys Vlasenko <vda.linux@googlemail.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> ---
>  lib/vsprintf.c |    5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> I was looking over the code and noticed that the constants could be smaller.
>
> diff --git a/lib/vsprintf.c b/lib/vsprintf.c
> index c3f36d41..2f32fe8 100644
> --- a/lib/vsprintf.c
> +++ b/lib/vsprintf.c
> @@ -243,13 +243,14 @@ char *put_dec(char *buf, unsigned long long n)
>
>  /* Second algorithm: valid only for 64-bit long longs */
>
> +/* See comment in put_dec_full9 for choice of constants */
>  static noinline_for_stack
>  char *put_dec_full4(char *buf, unsigned q)
>  {
>         unsigned r;
> -       r      = (q * 0xcccd) >> 19;
> +       r      = (q * 0xccd) >> 15;
>         *buf++ = (q - 10 * r) + '0';
> -       q      = (r * 0x199a) >> 16;
> +       q      = (r * 0xcd) >> 11;

I would use 16-bit shifts instead of smaller ones.
There may be CPUs on which "get upper half of 32-bit reg"
operation is cheaper or smaller than a shift.

-- 
vda

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] lib: vsprintf: Optimize put_dec_trunc8
  2012-09-23 14:18   ` Rabin Vincent
@ 2012-09-24 11:13     ` George Spelvin
  2012-09-24 14:33     ` George Spelvin
  1 sibling, 0 replies; 29+ messages in thread
From: George Spelvin @ 2012-09-24 11:13 UTC (permalink / raw)
  To: akpm, linux, rabin; +Cc: hughd, linux-kernel, mina86, mpn, vda.linux

Oh, joy, a new week and a nice dose of public humiliation to start it off.
(Remind me never to go AFK for a weekend again.)

Seriously, Rabin, thank you very much for the bug report and my
apologies for inflicting the bug on you in the first place.

Denys, good to hear from you.  I had hoped this would be signed off
by you before going upstream, but akpm picked it up first.


This code *was* extensively tested, but in a user-space test harness.
Moving it into the kernel was less so; I looked at a lot of /proc and
/sys files, but known answers are hard to come by there, and I didn't
actually write a complete test module.

(It's also running on the production file/mail server I'm sending
this from; I do eat my own dog food.)

However, thinking back, the sprintf code I used, for comparison with
the libc sprintf, wanted a binary->decimal primitive that converts 0 to
the empty string; getting ANSI compatible results from "%.0u" and "%.1u"
is easier with that convention.

My initial suspicion is that I didn't convert that convention properly
somehow.


Anyway, individual responses to follow.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/4] lib: vsprintf: Optimize division by 10 for small integers.
  2012-09-24  9:06 ` Denys Vlasenko
@ 2012-09-24 11:27   ` George Spelvin
  0 siblings, 0 replies; 29+ messages in thread
From: George Spelvin @ 2012-09-24 11:27 UTC (permalink / raw)
  To: linux, vda.linux; +Cc: hughd, linux-kernel, mina86

>> +/* See comment in put_dec_full9 for choice of constants */
>>  static noinline_for_stack
>>  char *put_dec_full4(char *buf, unsigned q)
>>  {
>>         unsigned r;
>> -       r      = (q * 0xcccd) >> 19;
>> +       r      = (q * 0xccd) >> 15;
>>         *buf++ = (q - 10 * r) + '0';
>> -       q      = (r * 0x199a) >> 16;
>> +       q      = (r * 0xcd) >> 11;

> I would use 16-bit shifts instead of smaller ones.
> There may be CPUs on which "get upper half of 32-bit reg"
> operation is cheaper or smaller than a shift.

Good point, but wouldn't those CPUs *also* have multi-cycle multiply,
or have to synthesize it out of shift-and-add, in which case smaller
constants would save even more cycles?

I'm thinking original MC68010 here, which I'm not sure is even
meaningful any more.  ColdFire has single-cycle shifts.

Can you think of a processor where that would actually be
an improvement?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] lib: vsprintf: Optimize put_dec_trunc8
  2012-09-23 18:22   ` Michal Nazarewicz
@ 2012-09-24 11:46     ` George Spelvin
  2012-09-24 12:29       ` Michal Nazarewicz
  0 siblings, 1 reply; 29+ messages in thread
From: George Spelvin @ 2012-09-24 11:46 UTC (permalink / raw)
  To: linux, mpn, vda.linux; +Cc: hughd, linux-kernel

>>  lib/vsprintf.c |   20 ++++++--------------
>>  1 file changed, 6 insertions(+), 14 deletions(-)
>>
>> diff --git a/lib/vsprintf.c b/lib/vsprintf.c
>> index a8e7392..3ca77b8 100644
>> --- a/lib/vsprintf.c
>> +++ b/lib/vsprintf.c
>> @@ -174,20 +174,12 @@ char *put_dec_trunc8(char *buf, unsigned r)
>>  	unsigned q;
>>=20=20
>>  	/* Copy of previous function's body with added early returns */
>> -	q      = (r * (uint64_t)0x1999999a) >> 32;
>> -	*buf++ = (r - 10 * q) + '0'; /* 2 */
>> -	if (q == 0)
>> -		return buf;
>> -	r      = (q * (uint64_t)0x1999999a) >> 32;
>> -	*buf++ = (q - 10 * r) + '0'; /* 3 */
>> -	if (r == 0)
>> -		return buf;
>> -	q      = (r * (uint64_t)0x1999999a) >> 32;
>> -	*buf++ = (r - 10 * q) + '0'; /* 4 */
>> -	if (q == 0)
>> -		return buf;
>> -	r      = (q * (uint64_t)0x1999999a) >> 32;
>> -	*buf++ = (q - 10 * r) + '0'; /* 5 */
>> +	while (r >= 10000) {
>> +		q = r + '0';
>> +		r  = (r * (uint64_t)0x1999999a) >> 32;
>> +		*buf++ = q - 10*r;
>> +	}

> This loop looks nothing like the original code.  Why are you adding '0'
> at the beginning?

Because I was trying to avoid useless bare "move" instructions
when converting to a non-swapping loop, and putting it up
there made clear that the copy could be combined with a
useful add on a 3-operand machine.  (Or even a 2-operand with
lea.)

Compilers are probably smart enough to figure that out by themselves,
but it mirrors my thinking as I was writing the code.

While it is a bit subtle, it's only three lines of code, and
I figured the equivalence to
"q = r; r = <math>; +buf++ = q - 10*r + '0';"
was pretty easy to see.


> Also, the original code switches the role of q and r,
> the loop does not.

Well, obviously; I had to do that to make it possible to put
into a loop.

Truthfully, it would have made *more* sense to swap q and r globally,
so the loop had a more sensible q=quotient/r=remainder assignment,
but I wanted to show that the unmodified tail was in fact unmodified.

The big saving from using a loop is that it avoids unnecessary
32x32->64-bit multiplies, falling through to the 16x16->32-bit
code as early as possible.  Given that most numbers are small,
this seemed like a significant win.

(As long as it doesn't add additional unpredictable conditional
branches, which are also expensive.)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/4] lib: vsprintf: Optimize division by 10000
  2012-09-23 17:30   ` Michal Nazarewicz
@ 2012-09-24 12:16     ` George Spelvin
  2012-09-24 12:41       ` Michal Nazarewicz
  0 siblings, 1 reply; 29+ messages in thread
From: George Spelvin @ 2012-09-24 12:16 UTC (permalink / raw)
  To: linux, mpn, vda.linux; +Cc: hughd, linux-kernel

> You are using a 64-bit multiply in a path that is designed for 32-bit
> processors, which makes me feel that it will be slower.

Slower than the divide it's replacing?

The following 32-bit processors have 32x32->64-bit multiply:

x86
ARM (as of ARMv4 = ARM7TDMI, the lowest version in common use)
SPARCv7, SPARCv8
MIPS32
MC68020
PA-RISC 1.1 (XMPYU)
avr32
PowerPC (MULHWU)
VAX (EMUL)

I could keep going through the full list of architectures in arch/,
but it's starting to get slow and I haven't hit one *without* a widening
multiply yet.  (And if it doesn't have hardware divide, I expect the
multiply is still faster.)

Ah!  Found one!  ColdFire MCF5272 has 32/32-bit divide, but only 32x32->32
multiply.  However, DIVU takes 20 or 35 cycles, which is pretty close to the
time to synthesize the multiply out of 4 16x16->32 pieces (4 cycles each).

I could do some Kconfig hacking and make the code path architecture-dependent.
Do you think it's worth it?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] lib: vsprintf: Optimize put_dec_trunc8
  2012-09-24 11:46     ` George Spelvin
@ 2012-09-24 12:29       ` Michal Nazarewicz
  2012-09-24 13:49         ` George Spelvin
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Nazarewicz @ 2012-09-24 12:29 UTC (permalink / raw)
  To: George Spelvin, linux, vda.linux; +Cc: hughd, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2488 bytes --]

>>> @@ -174,20 +174,12 @@ char *put_dec_trunc8(char *buf, unsigned r)
>>>  	unsigned q;
>>>  	/* Copy of previous function's body with added early returns */
>>> -	q      = (r * (uint64_t)0x1999999a) >> 32;
>>> -	*buf++ = (r - 10 * q) + '0'; /* 2 */
>>> -	if (q == 0)
>>> -		return buf;
>>> -	r      = (q * (uint64_t)0x1999999a) >> 32;
>>> -	*buf++ = (q - 10 * r) + '0'; /* 3 */
>>> -	if (r == 0)
>>> -		return buf;
>>> -	q      = (r * (uint64_t)0x1999999a) >> 32;
>>> -	*buf++ = (r - 10 * q) + '0'; /* 4 */
>>> -	if (q == 0)
>>> -		return buf;
>>> -	r      = (q * (uint64_t)0x1999999a) >> 32;
>>> -	*buf++ = (q - 10 * r) + '0'; /* 5 */
>>> +	while (r >= 10000) {
>>> +		q = r + '0';
>>> +		r  = (r * (uint64_t)0x1999999a) >> 32;
>>> +		*buf++ = q - 10*r;
>>> +	}

All right, I now see what the loop is doing (I couldn't grasp it
yesterday) and expect for r=0 it looks legit.

On Mon, Sep 24 2012, George Spelvin wrote:
> Truthfully, it would have made *more* sense to swap q and r globally,
> so the loop had a more sensible q=quotient/r=remainder assignment,
> but I wanted to show that the unmodified tail was in fact unmodified.

The original has it a bit awkwardly because it just copies code from
put_dec_full9() with the first iteration skipped.

> The big saving from using a loop is that it avoids unnecessary
> 32x32->64-bit multiplies, falling through to the 16x16->32-bit
> code as early as possible.  Given that most numbers are small,
> this seemed like a significant win.

Ah, makes sense.

I guess the following should work, even though it's not so pretty:

static noinline_for_stack
char *put_dec_trunc8(char *buf, unsigned r) {
	unsigned q;

	if (r > 10000) {
		do {
			q = r + '0';
			r = (r * (uint64_t)0x1999999a) >> 32;
			*buf++ = q - 10 * r;
		} while (r >= 10000);
		if (r == 0)
			return buf;
	}

	q      = (r * 0x199a) >> 16;
	*buf++ = (r - 10 * q)  + '0'; /* 6 */
	if (q == 0)
		return buf;
	r      = (q * 0xcd) >> 11;
	*buf++ = (q - 10 * r)  + '0'; /* 7 */
	if (r == 0)
		return buf;
	q      = (r * 0xcd) >> 11;
	*buf++ = (r - 10 * q) + '0'; /* 8 */
	if (q == 0)
		return buf;
	*buf++ = q + '0'; /* 9 */
	return buf;
}

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michał “mina86” Nazarewicz    (o o)
ooo +----<email/xmpp: mpn@google.com>--------------ooO--(_)--Ooo--

[-- Attachment #2.1: Type: text/plain, Size: 0 bytes --]



[-- Attachment #2.2: Type: application/pgp-signature, Size: 835 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/4] lib: vsprintf: Optimize division by 10000
  2012-09-24  9:03   ` Denys Vlasenko
@ 2012-09-24 12:35     ` George Spelvin
  2012-09-24 15:02       ` Denys Vlasenko
  0 siblings, 1 reply; 29+ messages in thread
From: George Spelvin @ 2012-09-24 12:35 UTC (permalink / raw)
  To: linux, vda.linux; +Cc: hughd, linux-kernel, mina86

> Here is the comparison of the x86-32 assembly
> of the fragment which does "x / 10000" thing,
> before and after the patch:

> -01 c6                  add    %eax,%esi
> -b8 59 17 b7 d1         mov    $0xd1b71759,%eax
> -f7 e6                  mul    %esi
> -89 d3                  mov    %edx,%ebx
> -89 f2                  mov    %esi,%edx
> -c1 eb 0d               shr    $0xd,%ebx
> 
> +01 c7                  add    %eax,%edi
> +b8 d7 c5 6d 34         mov    $0x346dc5d7,%eax
> +f7 e7                  mul    %edi
> +89 55 e8               mov    %edx,-0x18(%ebp)
> +8b 5d e8               mov    -0x18(%ebp),%ebx
> +89 fa                  mov    %edi,%edx
> +89 45 e4               mov    %eax,-0x1c(%ebp)
> +c1 eb 0b               shr    $0xb,%ebx
> 
> Poor gcc got confused, and generated somewhat
> worse code (spilling and immediately reloading upper
> part of 32x32->64 multiply).

> Please test and benchmark your changes to this code
> before submitting them.

Thanks for the feedback!  It very much *was* intended to start a
conversation with you, but the 7 week response delay somewhat interfered
with that process.

I was playing with it on ARM, where the results are a bit different.

As you can see, it fell out of some other word which *did* make a
useful difference.  I just hadn't tested this change in isolation,
which I realized as I wrote the final commit comment while cleaning
up the series for publication.

(And please excuse me if there's some paging delay on my part
to swap the whole business back in; it's been a while.)

I'll see if I can come up with something that provides the cleaner code
(do you agree that the source *looks* nicer?) and still makes GCC do
the right thing.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/4] lib: vsprintf: Optimize division by 10000
  2012-09-24 12:16     ` George Spelvin
@ 2012-09-24 12:41       ` Michal Nazarewicz
  2012-09-24 13:56         ` George Spelvin
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Nazarewicz @ 2012-09-24 12:41 UTC (permalink / raw)
  To: George Spelvin, linux, vda.linux; +Cc: hughd, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1102 bytes --]

On Mon, Sep 24 2012, George Spelvin wrote:
>> You are using a 64-bit multiply in a path that is designed for 32-bit
>> processors, which makes me feel that it will be slower.
>
> Slower than the divide it's replacing?

OK, granted, it might be faster after all. ;)  Still, I'd love to see
some benchmark.

> The following 32-bit processors have 32x32->64-bit multiply:
>
> x86
> ARM (as of ARMv4 = ARM7TDMI, the lowest version in common use)
> SPARCv7, SPARCv8

Didn't some SPARCs have 32x32->32 multiply?  I remember reading some
rant from a GMP developer about how SPARC is broken that way.

> MIPS32
> MC68020
> PA-RISC 1.1 (XMPYU)
> avr32
> PowerPC (MULHWU)
> VAX (EMUL)

> I could do some Kconfig hacking and make the code path
> architecture-dependent.  Do you think it's worth it?

Definitely not.

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michał “mina86” Nazarewicz    (o o)
ooo +----<email/xmpp: mpn@google.com>--------------ooO--(_)--Ooo--

[-- Attachment #2.1: Type: text/plain, Size: 0 bytes --]



[-- Attachment #2.2: Type: application/pgp-signature, Size: 835 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] lib: vsprintf: Optimize put_dec_trunc8
  2012-09-24 12:29       ` Michal Nazarewicz
@ 2012-09-24 13:49         ` George Spelvin
  2012-09-24 15:06           ` Michal Nazarewicz
  2012-09-25 11:44           ` George Spelvin
  0 siblings, 2 replies; 29+ messages in thread
From: George Spelvin @ 2012-09-24 13:49 UTC (permalink / raw)
  To: linux, mpn, vda.linux; +Cc: hughd, linux-kernel

Michal Nazarewicz <mpn@google.com> wrote:
> The original has it a bit awkwardly because it just copies code from
> put_dec_full9() with the first iteration skipped.

Yeah, it also makes the comments pretty confusing.

> I guess the following should work, even though it's not so pretty:
> 
> static noinline_for_stack
> char *put_dec_trunc8(char *buf, unsigned r) {
> 	unsigned q;
> 
> 	if (r > 10000) {
> 		do {
> 			q = r + '0';
> 			r = (r * (uint64_t)0x1999999a) >> 32;
> 			*buf++ = q - 10 * r;
> 		} while (r >= 10000);
> 		if (r == 0)
> 			return buf;
> 	}
> 
> 	q      = (r * 0x199a) >> 16;
> 	*buf++ = (r - 10 * q)  + '0'; /* 6 */
> 	if (q == 0)
> 		return buf;
> 	r      = (q * 0xcd) >> 11;
> 	*buf++ = (q - 10 * r)  + '0'; /* 7 */
> 	if (r == 0)
> 		return buf;
> 	q      = (r * 0xcd) >> 11;
> 	*buf++ = (r - 10 * q) + '0'; /* 8 */
> 	if (q == 0)
> 		return buf;
> 	*buf++ = q + '0'; /* 9 */
> 	return buf;
> }

Two bugs:

1) The initial "(r > 10000)" should be >=.
   If you let r == 10000 through to the remaining code, you'll get ":000".

2) The "r == 0" test isn't necessary.
   Given that the loop divides r by 10 each time, r >= 10000 at the
   beginning implies r >= 1000 at the end, so 1000 <= r < 10000
   when the loop exits.

   The only place you might need a test is if the "r >= 10000"
   test is *false*.  I.e.

	if (r > 10000) {
		/* Code */
	} else if (r == 0)
		return buf;

(But I think that last test is the bug I need to track down.)


You could reduce the number of conditional branches by using a binary
search tree for the number of digits to jump into an unrolled loop,
in the style of Duff's device, but I wasn't sure the complexity
was worth it.  And the q/r swapping makes it even messier.

Basically something like this, except I'd also probably change the
variable names, and modify the calling convention to return the decimal
in big-endian order:

char *put_dec_trunc8(char *buf, unsigned r)
{
	unsigned q;

	if (r < 10000)
		if (r < 100)
			if (r < 10)
				goto d1;
			else
				goto d2;
		else
			if (r < 1000)
				goto d3;
			else
				goto d4;
	else
		if (r < 1000000)
			if (r < 100000)
				goto d5;
			else
				goto d6;
		else
			if (r < 10000000)
				goto d7;
			else
				goto d8;
d8:
	q = r + '0';
	r = (r * (uint64_t)0x1999999a) >> 32;
	*buf++ = q - 10 * r;
d7:
	q = r + '0';
	r = (r * (uint64_t)0x1999999a) >> 32;
	*buf++ = q - 10 * r;
d6:
	q = r + '0';
	r = (r * (uint64_t)0x1999999a) >> 32;
	*buf++ = q - 10 * r;
d5:
	q = r + '0';
	r = (r * (uint64_t)0x1999999a) >> 32;
	*buf++ = q - 10 * r;
d4:
	q = r + '0';
	r = (r * 0x199a) >> 16;
	*buf++ = q - 10 * r;
d3:
	q = r + '0';
	r = (r * 0xcd) >> 11;
	*buf++ = q - 10 * r;
d2:
	q = r + '0';
	r = (r * 0xcd) >> 11;
	*buf++ = q - 10 * r;
d1:
	*buf++ = r + '0';
	return buf;
}

Another possibility would be to count the bits in r, convert that to
an estimate of the number of digits, and do one test to see which side
of the appropriate threshold it lines on.

Here are the ambiguous cases:
Bits	Digits
4	1 (8) or 2 (15)
7	2 (64) or 3 (127)
10	3 (512) or 4 (1023)
14	4 (8192) or 5 (16383)
17	5 (65536) or 6 (131071)
20	6 (524288) 7 (1048575)
24	7 (8388608) or 8 (16777215)
27	8 (67108864) or 9(134217727)

So, for this range, we have
(3*bits+8)/10 <= digits <= (3*bits+10)/10.
Does that get us anywhere?

Actually, those formulae are good for up to 105 bits!
I bet there's a simpler version with a power-of-2
divisor that's good enough for this range.

Some initial guesses
f1(bits) = (19 * bits + 51)/64 (valid for bits < 45)
f2(bits) = (19 * bits + 70)/64 (valid for bits < 44)

I couldn't make it work with a quotient of 32.

Then it would be either:

{
	static unsigned const pow10[] = { 1, 10, 100, 1000, 10000, ... };
	unsigned bits = sigbits(r);
	unsigned digits = f1(bits);
	unsigned digits_high = f2(bits);

	digits += digits_high != digits && r >= pow10[digits_high];

	switch (digits) {
	case 8:
	case 7:
		...
	case 1:
		*buf++ = r + '0';
	}
	return buf;
}

or the possibly simpler:
{
	static unsigned const pow10[] = { 1, 10, 100, 1000, 10000, ... };
	unsigned bits = sigbits(r);
	unsigned digits = f2(bits);	/* This is the high estimate! */

	switch (digits - (r < pow1[digits])) {
	case 8:
	case 7:
		...
	case 1:
		*buf++ = r + '0';
	}
	return buf;
}

I'll play with that; thanks for the inspiration!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/4] lib: vsprintf: Optimize division by 10000
  2012-09-24 12:41       ` Michal Nazarewicz
@ 2012-09-24 13:56         ` George Spelvin
  2012-09-24 15:14           ` Geert Uytterhoeven
  0 siblings, 1 reply; 29+ messages in thread
From: George Spelvin @ 2012-09-24 13:56 UTC (permalink / raw)
  To: linux, mpn, vda.linux; +Cc: hughd, linux-kernel

Michal Nazarewicz <mpn@google.com> wrote:
> Didn't some SPARCs have 32x32->32 multiply?  I remember reading some
> rant from a GMP developer about how SPARC is broken that way.

SPARCv9 only has 64x64->64; there's no 128-bit result version.
That cuts large-integer math speed by a factor of 4 (very
crude approximation), to the same speed as 32x32->64.

SPARCv8 UMUL puts the high half of the 64-bit result into the Y
register, and SPARCv7 has a multiply-step instruction (MULScc) which
does likewise.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/4] lib: vsprintf: Optimize division by 10 for small integers.
  2012-09-23 17:22 ` [PATCH 1/4] lib: vsprintf: Optimize division by 10 for small integers Michal Nazarewicz
@ 2012-09-24 14:18   ` George Spelvin
  0 siblings, 0 replies; 29+ messages in thread
From: George Spelvin @ 2012-09-24 14:18 UTC (permalink / raw)
  To: linux, mpn, vda.linux; +Cc: hughd, linux-kernel

Michal Nazarewicz <mpn@google.com> wrote:
> On Fri, Aug 03 2012, George Spelvin <linux@horizon.com> wrote:
>> Shrink the reciprocal approximations used in put_dec_full4
>> based on the comments in put_dec_full9.
>
> Have you verified that the comment is correct?

I rechecked all the validity limits myself.

>>  	r      = (q * 0xcd) >> 11;

> If you are changing everything, this could also be changed to:
>
>	r      = (q * 0x67) >> 10;
>
> no?

Also in those comments is a statement I did *not* recheck, as it didn't
affect correctness, saying that 0xcd produces shorter code than 0x67 on
x86 (if the code is generated using shifts and adds).

That's why I left it that way.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] lib: vsprintf: Optimize put_dec_trunc8
  2012-09-23 14:18   ` Rabin Vincent
  2012-09-24 11:13     ` George Spelvin
@ 2012-09-24 14:33     ` George Spelvin
  2012-09-24 14:53       ` Michal Nazarewicz
  1 sibling, 1 reply; 29+ messages in thread
From: George Spelvin @ 2012-09-24 14:33 UTC (permalink / raw)
  To: akpm, linux, rabin; +Cc: hughd, linux-kernel, mina86, vda.linux

Rabin Vincent <rabin@rab.in> wrote:
> This patch breaks IP address printing with "%pI4" (and by extension,
> nfsroot).  Example:
>
>  - Before: 10.0.0.1
>  - After: 10...1

Mea culpa, and thank you for catching it!  As I said in my earlier
comment, I tested this most extensively wrapped by some sprintf code
that liked 0 converted to a 0-length string, as that works naturally
with the ANSI spec for %.0u.  And it turns out not to matter for the
usual printf code, as num_to_str special-cases that anyway.

The fix is straightforward:

diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index e755083..9872855 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -180,8 +180,6 @@ char *put_dec_trunc8(char *buf, unsigned r)
 		*buf++ = q - 10*r;
 	}
 
-	if (r == 0)
-		return buf;
 	q      = (r * 0x199a) >> 16;	/* r <= 9999 */
 	*buf++ = (r - 10 * q)  + '0';
 	if (q == 0)

Inspired by Michal Nazarewicz, I have some ideas for more tweaking to
that code.


AKPM: How should I submit this to you?  Would you like it as a fixup
patch, or would you like a revised patch from baseline?  You're free to
do either manually and add my Signed-off-by: to the result if you want
the fix faster.

I'm also working on addressing Denys Vlasenko's comments, but I figure
the bugfix is more urgent.

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] lib: vsprintf: Optimize put_dec_trunc8
  2012-09-24 14:33     ` George Spelvin
@ 2012-09-24 14:53       ` Michal Nazarewicz
  2012-09-24 14:57         ` Michal Nazarewicz
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Nazarewicz @ 2012-09-24 14:53 UTC (permalink / raw)
  To: George Spelvin, akpm, linux, rabin; +Cc: hughd, linux-kernel, vda.linux

[-- Attachment #1: Type: text/plain, Size: 1143 bytes --]

On Mon, Sep 24 2012, George Spelvin wrote:
> The fix is straightforward:
>
> diff --git a/lib/vsprintf.c b/lib/vsprintf.c
> index e755083..9872855 100644
> --- a/lib/vsprintf.c
> +++ b/lib/vsprintf.c
> @@ -180,8 +180,6 @@ char *put_dec_trunc8(char *buf, unsigned r)
>  		*buf++ = q - 10*r;
>  	}
>  
> -	if (r == 0)
> -		return buf;
>  	q      = (r * 0x199a) >> 16;	/* r <= 9999 */
>  	*buf++ = (r - 10 * q)  + '0';
>  	if (q == 0)
>
> Inspired by Michal Nazarewicz, I have some ideas for more tweaking to
> that code.

Ah, right.  I also thought about that first but than started worrying
that it could produce unnecessary zeros if the loop iterates at least
once and exits with r being zero, but now I see that this cannot happen
since if the loop condition was true, r >= 10000 and it has been divided
by 10000 in the loop so now it's at least 1.

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michał “mina86” Nazarewicz    (o o)
ooo +----<email/xmpp: mpn@google.com>--------------ooO--(_)--Ooo--

[-- Attachment #2.1: Type: text/plain, Size: 0 bytes --]



[-- Attachment #2.2: Type: application/pgp-signature, Size: 835 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] lib: vsprintf: Optimize put_dec_trunc8
  2012-09-24 14:53       ` Michal Nazarewicz
@ 2012-09-24 14:57         ` Michal Nazarewicz
  0 siblings, 0 replies; 29+ messages in thread
From: Michal Nazarewicz @ 2012-09-24 14:57 UTC (permalink / raw)
  To: George Spelvin, akpm, linux, rabin; +Cc: hughd, linux-kernel, vda.linux

[-- Attachment #1: Type: text/plain, Size: 763 bytes --]

On Mon, Sep 24 2012, Michal Nazarewicz wrote:
> Ah, right.  I also thought about that first but than started worrying
> that it could produce unnecessary zeros if the loop iterates at least
> once and exits with r being zero, but now I see that this cannot happen
> since if the loop condition was true, r >= 10000 and it has been divided
> by 10000 in the loop so now it's at least 1.
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I of course meant “10” and “at least 1000”.

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michał “mina86” Nazarewicz    (o o)
ooo +----<email/xmpp: mpn@google.com>--------------ooO--(_)--Ooo--

[-- Attachment #2.1: Type: text/plain, Size: 0 bytes --]



[-- Attachment #2.2: Type: application/pgp-signature, Size: 835 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/4] lib: vsprintf: Optimize division by 10000
  2012-09-24 12:35     ` George Spelvin
@ 2012-09-24 15:02       ` Denys Vlasenko
  0 siblings, 0 replies; 29+ messages in thread
From: Denys Vlasenko @ 2012-09-24 15:02 UTC (permalink / raw)
  To: George Spelvin; +Cc: hughd, linux-kernel, mina86

[-- Attachment #1: Type: text/plain, Size: 2465 bytes --]

On Mon, Sep 24, 2012 at 2:35 PM, George Spelvin <linux@horizon.com> wrote:
>> Here is the comparison of the x86-32 assembly
>> of the fragment which does "x / 10000" thing,
>> before and after the patch:
>
>> -01 c6                  add    %eax,%esi
>> -b8 59 17 b7 d1         mov    $0xd1b71759,%eax
>> -f7 e6                  mul    %esi
>> -89 d3                  mov    %edx,%ebx
>> -89 f2                  mov    %esi,%edx
>> -c1 eb 0d               shr    $0xd,%ebx
>>
>> +01 c7                  add    %eax,%edi
>> +b8 d7 c5 6d 34         mov    $0x346dc5d7,%eax
>> +f7 e7                  mul    %edi
>> +89 55 e8               mov    %edx,-0x18(%ebp)
>> +8b 5d e8               mov    -0x18(%ebp),%ebx
>> +89 fa                  mov    %edi,%edx
>> +89 45 e4               mov    %eax,-0x1c(%ebp)
>> +c1 eb 0b               shr    $0xb,%ebx
>>
>> Poor gcc got confused, and generated somewhat
>> worse code (spilling and immediately reloading upper
>> part of 32x32->64 multiply).
>
>> Please test and benchmark your changes to this code
>> before submitting them.
>
> Thanks for the feedback!  It very much *was* intended to start a
> conversation with you, but the 7 week response delay somewhat interfered
> with that process.
>
> I was playing with it on ARM, where the results are a bit different.
>
> As you can see, it fell out of some other word which *did* make a
> useful difference.  I just hadn't tested this change in isolation,

Please find attached source of test program.

You need to touch test_new.c (or make a few copies of it
to experiment with different versions of code). test_header.c
and test_main.c contain benchmarking code and need not be modified.

It also includes a verification step, which would catch the bugs
you had in your patches.

Usage:

$ gcc [ --static] [-m32] -O2 -Wall test_new.c -otest_new
$ ./test_new
Conversions per second: 8:59964000 123:48272000 123456:37852000
12345678:34216000 123456789:23528000 2^32:23520000 2^64:17616000
Conversions per second: 8:60092000 123:48536000 123456:37836000
12345678:33924000 123456789:23580000 2^32:23372000 2^64:17608000
Conversions per second: 8:60084000 123:48396000 123456:37840000
12345678:34192000 123456789:23564000 2^32:23484000 2^64:17612000
Conversions per second: 8:60108000 123:48500000 123456:37872000
12345678:33996000 123456789:23576000 2^32:23524000 2^64:17612000
Tested 14680064      ^C
^^^^^^^^^^^^^^^^^^^^^^^^ tests correctness until user interrupts it
$

[-- Attachment #2: test_header.c --]
[-- Type: text/x-csrc, Size: 1708 bytes --]

#include <features.h>
#include <inttypes.h>
#include <stdint.h>
#include <stdbool.h>
#include <sys/types.h> /* size_t */
#include <limits.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdarg.h>
#include <stddef.h>
#include <ctype.h>
#include <string.h>
#include <errno.h>
#include <stdio.h> /* for FILE */
#include <sys/time.h>
#include <time.h>

#define noinline_for_stack __attribute__((noinline))
typedef uint8_t u8;
typedef int16_t s16;

/*
 * do_div() is NOT a C function. It wants to return
 * two values (the quotient and the remainder), but
 * since that doesn't work very well in C, what it
 * does is:
 *
 * - modifies the 64-bit dividend _in_place_
 * - returns the 32-bit remainder
 *
 * This ends up being the most efficient "calling
 * convention" on x86.
 */
#if LONG_MAX == ((1U << 31)-1)
#define do_div(n, base)						\
({								\
	unsigned long __upper, __low, __high, __mod, __base;	\
	__base = (base);					\
	{							\
		asm("" : "=a" (__low), "=d" (__high) : "A" (n));\
		__upper = __high;				\
		if (__high) {					\
			__upper = __high % (__base);		\
			__high = __high / (__base);		\
		}						\
		asm("divl %2" : "=a" (__low), "=d" (__mod)	\
			: "rm" (__base), "0" (__low), "1" (__upper));	\
		asm("" : "=A" (n) : "a" (__low), "d" (__high));	\
	}							\
	__mod;							\
})
#elif LONG_MAX > ((1U << 31)-1)
#define do_div(n,base) ({                                       \
	uint32_t __base = (base);                               \
	uint32_t __rem;                                         \
	__rem = ((uint64_t)(n)) % __base;                       \
	(n) = ((uint64_t)(n)) / __base;                         \
	__rem;                                                  \
})
#endif

[-- Attachment #3: test_main.c --]
[-- Type: text/x-csrc, Size: 1766 bytes --]

static const struct printf_spec dummy_spec = {
	.type = FORMAT_TYPE_LONG_LONG,
	.flags = 0,
	.base = 10,
	.qualifier = 0,
	.field_width = 0,
	.precision = 0,
};

#define REPS (4000)

static unsigned measure_number(time_t* tp, unsigned long long num)
{
	char buf[64];
	time_t t, n;
	int i;
	unsigned count;

	t = *tp;
	count = 0;
	while (t == (n = time(NULL))) {
		for (i = REPS; --i >= 0;)
			number(buf, buf + 63, num, dummy_spec)[0] = '\0';
		count += REPS;
	}
	*tp = n;
	return count;
}

static void measure()
{
	time_t t, n;
	unsigned count1, count2, count3, count4, count5, count6, count7;

	t = time(NULL);
	while (t == (n = time(NULL)))
		continue;
	t = n;

	count1 = measure_number(&t, 8);
	count2 = measure_number(&t, 123);
	count3 = measure_number(&t, 123456);
	count4 = measure_number(&t, 12345678);
	count5 = measure_number(&t, 123456789);
	count6 = measure_number(&t, (1ULL << 32) - 1);
	count7 = measure_number(&t, (0ULL - 1));

	printf("Conversions per second: 8:%d 123:%d 123456:%d 12345678:%d 123456789:%d 2^32:%d 2^64:%d\n",
		count1,
		count2,
		count3,
		count4,
		count5,
		count6,
		count7
	);
}

static void check(unsigned long long v)
{
	char buf1[64];
	char buf2[64];
	number(buf1, buf1 + 63, v, dummy_spec)[0] = '\0';
	sprintf(buf2, "%llu", v);
	if (strcmp(buf1, buf2) != 0) {
		printf("Error in formatting %llu:'%s'\n", v, buf1);
		exit(1);
	}
}

int main()
{
	measure();measure();
	measure();measure();
	//measure();measure();

	unsigned long long i;
	for (i = 0; ; i++) {
		check(i);
//		check(((1ULL << 32) - 1) + i);
//		check(((1ULL << 32) - 1) - i);
		check(0ULL - i);
//		check(i * i);
//		check(i * i * i);
		if (!(i & 0x3ffff)) {
			printf("\rTested %llu                               ", i);
			fflush(NULL);
		}
	}

	return 0;
}

[-- Attachment #4: test_new.c --]
[-- Type: text/x-csrc, Size: 10005 bytes --]

#include "test_header.c"

/* Decimal conversion is by far the most typical, and is used
 * for /proc and /sys data. This directly impacts e.g. top performance
 * with many processes running. We optimize it for speed
 * using ideas described at <http://www.cs.uiowa.edu/~jones/bcd/divide.html>
 * (with permission from the author, Douglas W. Jones).
 */

#if LONG_MAX > ((1UL<<31)-1) || LLONG_MAX > ((1ULL<<63)-1)
/* Formats correctly any integer in [0, 999999999] */
static noinline_for_stack
char *put_dec_full9(char *buf, unsigned q)
{
	unsigned r;

	/* Possible ways to approx. divide by 10
	 * (x * 0x1999999a) >> 32 x < 1073741829 (multiply must be 64-bit)
	 * (x * 0xcccd) >> 19     x <      81920 (x < 262149 when 64-bit mul)
	 * (x * 0x6667) >> 18     x <      43699
	 * (x * 0x3334) >> 17     x <      16389
	 * (x * 0x199a) >> 16     x <      16389
	 * (x * 0x0ccd) >> 15     x <      16389
	 * (x * 0x0667) >> 14     x <       2739
	 * (x * 0x0334) >> 13     x <       1029
	 * (x * 0x019a) >> 12     x <       1029
	 * (x * 0x00cd) >> 11     x <       1029 shorter code than * 0x67 (on i386)
	 * (x * 0x0067) >> 10     x <        179
	 * (x * 0x0034) >>  9     x <         69 same
	 * (x * 0x001a) >>  8     x <         69 same
	 * (x * 0x000d) >>  7     x <         69 same, shortest code (on i386)
	 * (x * 0x0007) >>  6     x <         19
	 * See <http://www.cs.uiowa.edu/~jones/bcd/divide.html>
	 */
	r      = (q * (uint64_t)0x1999999a) >> 32;
	*buf++ = (q - 10 * r) + '0'; /* 1 */
	q      = (r * (uint64_t)0x1999999a) >> 32;
	*buf++ = (r - 10 * q) + '0'; /* 2 */
	r      = (q * (uint64_t)0x1999999a) >> 32;
	*buf++ = (q - 10 * r) + '0'; /* 3 */
	q      = (r * (uint64_t)0x1999999a) >> 32;
	*buf++ = (r - 10 * q) + '0'; /* 4 */
	r      = (q * (uint64_t)0x1999999a) >> 32;
	*buf++ = (q - 10 * r) + '0'; /* 5 */
	/* Now value is under 10000, can avoid 64-bit multiply */
	q      = (r * 0x199a) >> 16;
	*buf++ = (r - 10 * q)  + '0'; /* 6 */
	r      = (q * 0xcd) >> 11;
	*buf++ = (q - 10 * r)  + '0'; /* 7 */
	q      = (r * 0xcd) >> 11;
	*buf++ = (r - 10 * q) + '0'; /* 8 */
	*buf++ = q + '0'; /* 9 */
	return buf;
}
#endif

/* Similar to above but do not pad with zeros.
 * Code can be easily arranged to print 9 digits too, but our callers
 * always call put_dec_full9() instead when the number has 9 decimal digits.
 */
static noinline_for_stack
char *put_dec_trunc8(char *buf, unsigned r)
{
	unsigned q;

	/* Copy of previous function's body with added early returns */
	q      = (r * (uint64_t)0x1999999a) >> 32;
	*buf++ = (r - 10 * q) + '0'; /* 2 */
	if (q == 0) return buf;
	r      = (q * (uint64_t)0x1999999a) >> 32;
	*buf++ = (q - 10 * r) + '0'; /* 3 */
	if (r == 0) return buf;
	q      = (r * (uint64_t)0x1999999a) >> 32;
	*buf++ = (r - 10 * q) + '0'; /* 4 */
	if (q == 0) return buf;
	r      = (q * (uint64_t)0x1999999a) >> 32;
	*buf++ = (q - 10 * r) + '0'; /* 5 */
	if (r == 0) return buf;
	q      = (r * 0x199a) >> 16;
	*buf++ = (r - 10 * q)  + '0'; /* 6 */
	if (q == 0) return buf;
	r      = (q * 0xcd) >> 11;
	*buf++ = (q - 10 * r)  + '0'; /* 7 */
	if (r == 0) return buf;
	q      = (r * 0xcd) >> 11;
	*buf++ = (r - 10 * q) + '0'; /* 8 */
	if (q == 0) return buf;
	*buf++ = q + '0'; /* 9 */
	return buf;
}

/* There are two algorithms to print larger numbers.
 * One is generic: divide by 1000000000 and repeatedly print
 * groups of (up to) 9 digits. It's conceptually simple,
 * but requires a (unsigned long long) / 1000000000 division.
 *
 * Second algorithm splits 64-bit unsigned long long into 16-bit chunks,
 * manipulates them cleverly and generates groups of 4 decimal digits.
 * It so happens that it does NOT require long long division.
 *
 * If long is > 32 bits, division of 64-bit values is relatively easy,
 * and we will use the first algorithm.
 * If long long is > 64 bits (strange architecture with VERY large long long),
 * second algorithm can't be used, and we again use the first one.
 *
 * Else (if long is 32 bits and long long is 64 bits) we use second one.
 */

#if LONG_MAX > ((1UL<<31)-1) || LLONG_MAX > ((1ULL<<63)-1)

/* First algorithm: generic */

static
char *put_dec(char *buf, unsigned long long n)
{
	if (n >= 100*1000*1000) {
		while (n >= 1000*1000*1000)
			buf = put_dec_full9(buf, do_div(n, 1000*1000*1000));
		if (n >= 100*1000*1000)
			return put_dec_full9(buf, n);
	}
	return put_dec_trunc8(buf, n);
}

#else

/* Second algorithm: valid only for 32-bit longs, 64-bit long longs */

static noinline_for_stack
char *put_dec_full4(char *buf, unsigned q)
{
	unsigned r;
	r      = (q * 0xcccd) >> 19;
	*buf++ = (q - 10 * r) + '0';
	q      = (r * 0x199a) >> 16;
	*buf++ = (r - 10 * q)  + '0';
	r      = (q * 0xcd) >> 11;
	*buf++ = (q - 10 * r)  + '0';
	*buf++ = r + '0';
	return buf;
}

/* Based on code by Douglas W. Jones found at
 * <http://www.cs.uiowa.edu/~jones/bcd/decimal.html#sixtyfour>
 * (with permission from the author).
 * Performs no 64-bit division and hence should be fast on 32-bit machines.
 */
static
char *put_dec(char *buf, unsigned long long n)
{
	uint32_t d3, d2, d1, q, h;

	if (n < 100*1000*1000)
		return put_dec_trunc8(buf, n);

	d1  = ((uint32_t)n >> 16); /* implicit "& 0xffff" */
	h   = (n >> 32);
	d2  = (h      ) & 0xffff;
	d3  = (h >> 16); /* implicit "& 0xffff" */

	q   = 656 * d3 + 7296 * d2 + 5536 * d1 + ((uint32_t)n & 0xffff);

	buf = put_dec_full4(buf, q % 10000);
	q   = q / 10000;

	d1  = q + 7671 * d3 + 9496 * d2 + 6 * d1;
	buf = put_dec_full4(buf, d1 % 10000);
	q   = d1 / 10000;

	d2  = q + 4749 * d3 + 42 * d2;
	buf = put_dec_full4(buf, d2 % 10000);
	q   = d2 / 10000;

	d3  = q + 281 * d3;
	if (!d3)
		goto done;
	buf = put_dec_full4(buf, d3 % 10000);
	q   = d3 / 10000;
	if (!q)
		goto done;
	buf = put_dec_full4(buf, q);
 done:
	while (buf[-1] == '0')
		--buf;

	return buf;
}

#endif

/*
 * Convert passed number to decimal string.
 * Returns the length of string.  On buffer overflow, returns 0.
 *
 * If speed is not important, use snprintf(). It's easy to read the code.
 */
int num_to_str(char *buf, int size, unsigned long long num)
{
	char tmp[sizeof(num) * 3];
	int idx, len;

	/* put_dec() may work incorrectly for num = 0 (generate "", not "0") */
	if (num <= 9) {
		tmp[0] = '0' + num;
		len = 1;
	} else {
		len = put_dec(tmp, num) - tmp;
	}

	if (len > size)
		return 0;
	for (idx = 0; idx < len; ++idx)
		buf[idx] = tmp[len - idx - 1];
	return len;
}

#define ZEROPAD	1		/* pad with zero */
#define SIGN	2		/* unsigned/signed long */
#define PLUS	4		/* show plus */
#define SPACE	8		/* space if plus */
#define LEFT	16		/* left justified */
#define SMALL	32		/* use lowercase in hex (must be 32 == 0x20) */
#define SPECIAL	64		/* prefix hex with "0x", octal with "0" */

enum format_type {
	FORMAT_TYPE_NONE, /* Just a string part */
	FORMAT_TYPE_WIDTH,
	FORMAT_TYPE_PRECISION,
	FORMAT_TYPE_CHAR,
	FORMAT_TYPE_STR,
	FORMAT_TYPE_PTR,
	FORMAT_TYPE_PERCENT_CHAR,
	FORMAT_TYPE_INVALID,
	FORMAT_TYPE_LONG_LONG,
	FORMAT_TYPE_ULONG,
	FORMAT_TYPE_LONG,
	FORMAT_TYPE_UBYTE,
	FORMAT_TYPE_BYTE,
	FORMAT_TYPE_USHORT,
	FORMAT_TYPE_SHORT,
	FORMAT_TYPE_UINT,
	FORMAT_TYPE_INT,
	FORMAT_TYPE_NRCHARS,
	FORMAT_TYPE_SIZE_T,
	FORMAT_TYPE_PTRDIFF
};

struct printf_spec {
	u8	type;		/* format_type enum */
	u8	flags;		/* flags to number() */
	u8	base;		/* number base, 8, 10 or 16 only */
	u8	qualifier;	/* number qualifier, one of 'hHlLtzZ' */
	s16	field_width;	/* width of output field */
	s16	precision;	/* # of digits/chars */
};

static noinline_for_stack
char *number(char *buf, char *end, unsigned long long num,
	     struct printf_spec spec)
{
	/* we are called with base 8, 10 or 16, only, thus don't need "G..."  */
	static const char digits[16] = "0123456789ABCDEF"; /* "GHIJKLMNOPQRSTUVWXYZ"; */

	char tmp[66];
	char sign;
	char locase;
	int need_pfx = ((spec.flags & SPECIAL) && spec.base != 10);
	int i;

	/* locase = 0 or 0x20. ORing digits or letters with 'locase'
	 * produces same digits or (maybe lowercased) letters */
	locase = (spec.flags & SMALL);
	if (spec.flags & LEFT)
		spec.flags &= ~ZEROPAD;
	sign = 0;
	if (spec.flags & SIGN) {
		if ((signed long long)num < 0) {
			sign = '-';
			num = -(signed long long)num;
			spec.field_width--;
		} else if (spec.flags & PLUS) {
			sign = '+';
			spec.field_width--;
		} else if (spec.flags & SPACE) {
			sign = ' ';
			spec.field_width--;
		}
	}
	if (need_pfx) {
		spec.field_width--;
		if (spec.base == 16)
			spec.field_width--;
	}

	/* generate full string in tmp[], in reverse order */
	i = 0;
	if (num < 8)
		tmp[i++] = '0' + num;
	/* Generic code, for any base:
	else do {
		tmp[i++] = (digits[do_div(num,base)] | locase);
	} while (num != 0);
	*/
	else if (spec.base != 10) { /* 8 or 16 */
		int mask = spec.base - 1;
		int shift = 3;

		if (spec.base == 16)
			shift = 4;
		do {
			tmp[i++] = (digits[((unsigned char)num) & mask] | locase);
			num >>= shift;
		} while (num);
	} else { /* base 10 */
		i = put_dec(tmp, num) - tmp;
	}

	/* printing 100 using %2d gives "100", not "00" */
	if (i > spec.precision)
		spec.precision = i;
	/* leading space padding */
	spec.field_width -= spec.precision;
	if (!(spec.flags & (ZEROPAD+LEFT))) {
		while (--spec.field_width >= 0) {
			if (buf < end)
				*buf = ' ';
			++buf;
		}
	}
	/* sign */
	if (sign) {
		if (buf < end)
			*buf = sign;
		++buf;
	}
	/* "0x" / "0" prefix */
	if (need_pfx) {
		if (buf < end)
			*buf = '0';
		++buf;
		if (spec.base == 16) {
			if (buf < end)
				*buf = ('X' | locase);
			++buf;
		}
	}
	/* zero or space padding */
	if (!(spec.flags & LEFT)) {
		char c = (spec.flags & ZEROPAD) ? '0' : ' ';
		while (--spec.field_width >= 0) {
			if (buf < end)
				*buf = c;
			++buf;
		}
	}
	/* hmm even more zero padding? */
	while (i <= --spec.precision) {
		if (buf < end)
			*buf = '0';
		++buf;
	}
	/* actual digits of result */
	while (--i >= 0) {
		if (buf < end)
			*buf = tmp[i];
		++buf;
	}
	/* trailing space padding */
	while (--spec.field_width >= 0) {
		if (buf < end)
			*buf = ' ';
		++buf;
	}

	return buf;
}

#include "test_main.c"

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] lib: vsprintf: Optimize put_dec_trunc8
  2012-09-24 13:49         ` George Spelvin
@ 2012-09-24 15:06           ` Michal Nazarewicz
  2012-09-25 11:44           ` George Spelvin
  1 sibling, 0 replies; 29+ messages in thread
From: Michal Nazarewicz @ 2012-09-24 15:06 UTC (permalink / raw)
  To: George Spelvin, linux, vda.linux; +Cc: hughd, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1186 bytes --]

On Mon, Sep 24 2012, George Spelvin wrote:
> Michal Nazarewicz <mpn@google.com> wrote:
>> static noinline_for_stack
>> char *put_dec_trunc8(char *buf, unsigned r) {
>> 	unsigned q;
>> 
>> 	if (r > 10000) {
>> 		do {
>> 			q = r + '0';
>> 			r = (r * (uint64_t)0x1999999a) >> 32;
>> 			*buf++ = q - 10 * r;
>> 		} while (r >= 10000);
>> 		if (r == 0)
>> 			return buf;
>> 	}
>> 
>> 	q      = (r * 0x199a) >> 16;
>> 	*buf++ = (r - 10 * q)  + '0'; /* 6 */
[...]
>> 	return buf;
>> }
>
> Two bugs:
>
> 1) The initial "(r > 10000)" should be >=.
>    If you let r == 10000 through to the remaining code, you'll get
>    ":000".

Obviously... ;)

>
> 2) The "r == 0" test isn't necessary.
>    Given that the loop divides r by 10 each time, r >= 10000 at the
>    beginning implies r >= 1000 at the end, so 1000 <= r < 10000
>    when the loop exits.

Yeah, I've just figured that out. :]
-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michał “mina86” Nazarewicz    (o o)
ooo +----<email/xmpp: mpn@google.com>--------------ooO--(_)--Ooo--

[-- Attachment #2.1: Type: text/plain, Size: 0 bytes --]



[-- Attachment #2.2: Type: application/pgp-signature, Size: 835 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/4] lib: vsprintf: Optimize division by 10000
  2012-09-24 13:56         ` George Spelvin
@ 2012-09-24 15:14           ` Geert Uytterhoeven
  2012-09-24 15:48             ` George Spelvin
  0 siblings, 1 reply; 29+ messages in thread
From: Geert Uytterhoeven @ 2012-09-24 15:14 UTC (permalink / raw)
  To: George Spelvin; +Cc: mpn, vda.linux, hughd, linux-kernel

On Mon, Sep 24, 2012 at 3:56 PM, George Spelvin <linux@horizon.com> wrote:
> Michal Nazarewicz <mpn@google.com> wrote:
>> Didn't some SPARCs have 32x32->32 multiply?  I remember reading some
>> rant from a GMP developer about how SPARC is broken that way.
>
> SPARCv9 only has 64x64->64; there's no 128-bit result version.
> That cuts large-integer math speed by a factor of 4 (very
> crude approximation), to the same speed as 32x32->64.
>
> SPARCv8 UMUL puts the high half of the 64-bit result into the Y
> register, and SPARCv7 has a multiply-step instruction (MULScc) which
> does likewise.

Early SPARCs don't even have a multiply instruction.

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/4] lib: vsprintf: Optimize division by 10000
  2012-09-24 15:14           ` Geert Uytterhoeven
@ 2012-09-24 15:48             ` George Spelvin
  0 siblings, 0 replies; 29+ messages in thread
From: George Spelvin @ 2012-09-24 15:48 UTC (permalink / raw)
  To: geert, linux; +Cc: hughd, linux-kernel, mpn, vda.linux

Geert Uytterhoeven <geert@linux-m68k.org> wrote:
> On Mon, Sep 24, 2012 at 3:56 PM, George Spelvin <linux@horizon.com> wrote:
>> SPARCv8 UMUL puts the high half of the 64-bit result into the Y
>> register, and SPARCv7 has a multiply-step instruction (MULScc) which
>> does likewise.
>
> Early SPARCs don't even have a multiply instruction.

Are you sure we're not talking about the same thing?  Early SPARCs didn't
have a *single* multiply instruction, but *did* have a multiply step
instruction, which can be iterated to produce a 32x32->64-bit multiply.

That's what I was referring to.  All multiplies were slow, but widening
multiply was no slower.

And SPARCv7 had no divide support at all, so that was *really* slow.

SPARClite added a divide step, but also added an integer multiply
instruction, so again widening multiply beat the pants off divide.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] lib: vsprintf: Optimize put_dec_trunc8
  2012-09-24 13:49         ` George Spelvin
  2012-09-24 15:06           ` Michal Nazarewicz
@ 2012-09-25 11:44           ` George Spelvin
  2012-09-25 13:00             ` Denys Vlasenko
  1 sibling, 1 reply; 29+ messages in thread
From: George Spelvin @ 2012-09-25 11:44 UTC (permalink / raw)
  To: linux, mpn, vda.linux; +Cc: hughd, linux-kernel

Thanks to Denys Vlasenko for sending me his benchmarking code.

I went and hacked on it to ransomize the numbers being converted more,
since repeatedly converting the same number underestimates the number
of branch mispredictions.

Then I tried computing the number of digits beforehand, as mentioned
in my earlier message, and it came out slightly slower.  The code is:

static noinline_for_stack
char *put_dec_trunc8(char *buf, unsigned n)
{
        static unsigned const pow10[9] = { 0, 10, 100, 1000, 10000, 100000,
					 1000000, 10000000, 100000000 };
	unsigned digits = (19 * fls(n) + 6) >> 6;  /* Valid for < 44 bits */
	unsigned t;

	/*
	 * "Digits" is an estimate, always exact or high by 1, of the
	 * number of decimal digits in r.  It is offset by 1, so digits=0
	 * means 1 digit.  Given that the largest value ever passed to
	 * this function has 27 bits, The largest value is (19 * 27 +
	 * 6) >> 6 = 8, meaning 9 digits.  However, that will always be
	 * rounded down because r is less than pow10[8] = 10^8.
	 *
	 * We could use smaller multipliers in cases 4 through 6, but that
	 * would require a shift by less than 32, which is awkward on a
	 * 32-bit processor and wastes more time than the larger multiply.
	 */
	switch (digits - (n < pow10[digits])) {
	default:
		__builtin_unreachable();
	case 7:		/* q <= 99999999 */
		t = n + '0';
		n  = (n * (uint64_t)0x1999999a) >> 32;
		*buf++ = t - 10*n;
	case 6:		/* q <= 9999999 */
		t = n + '0';
		n  = (n * (uint64_t)0x1999999a) >> 32;
		*buf++ = t - 10*n;
	case 5:		/* t <= 999999 */
		t = n + '0';
		n  = (n * (uint64_t)0x1999999a) >> 32;
		*buf++ = t - 10*n;
	case 4:		/* t <= 99999 */
		t = n + '0';
		n  = (n * (uint64_t)0x1999999a) >> 32;
		*buf++ = t - 10*n;
	case 3:		/* t <= 9999 */
		t = n + '0';
		n  = (n * 0x199a) >> 16;
		*buf++ = t - 10*n;
	case 2:		/* t <= 999 */
		t = n + '0';
		n  = (n * 0xcd) >> 11;
		*buf++ = t - 10*n;
	case 1:		/* t <= 99 */
		t = n + '0';
		n  = (n * 0xcd) >> 11;
		*buf++ = t - 10*n;
	case 0:		/* t <= 9 */
		*buf++ = n + '0';
	}
	return buf;
}


But!  With more extensive refactoring of the number() code, computing
the number of digits at the beginning can eliminate the entire business
of formatting into tmp[] and copying backward.  We'll first compute the
number of digits, check for buffer overflow, insert the printf padding,
and then call the number-formatting code, passing the number of digits in.

It seems plausible that the resultant simplification will produce a
speedup.

I'm going to experiment with that.  Denys, since I'm playing in your sandbox,
do you have any violent objections to that?  (Assuming, of course, that it's
a net speed win and my surgery to function signatures isn't too hideous.)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] lib: vsprintf: Optimize put_dec_trunc8
  2012-09-25 11:44           ` George Spelvin
@ 2012-09-25 13:00             ` Denys Vlasenko
  0 siblings, 0 replies; 29+ messages in thread
From: Denys Vlasenko @ 2012-09-25 13:00 UTC (permalink / raw)
  To: George Spelvin; +Cc: mpn, hughd, linux-kernel

On Tue, Sep 25, 2012 at 1:44 PM, George Spelvin <linux@horizon.com> wrote:
> Thanks to Denys Vlasenko for sending me his benchmarking code.
>
> I went and hacked on it to ransomize the numbers being converted more,
> since repeatedly converting the same number underestimates the number
> of branch mispredictions.
>
> Then I tried computing the number of digits beforehand, as mentioned
> in my earlier message, and it came out slightly slower.  The code is:
>
> static noinline_for_stack
> char *put_dec_trunc8(char *buf, unsigned n)
> {
>         static unsigned const pow10[9] = { 0, 10, 100, 1000, 10000, 100000,
>                                          1000000, 10000000, 100000000 };
>         unsigned digits = (19 * fls(n) + 6) >> 6;  /* Valid for < 44 bits */

fls() itself may be more expensive than a few jumps.

> But!  With more extensive refactoring of the number() code, computing
> the number of digits at the beginning can eliminate the entire business
> of formatting into tmp[] and copying backward.  We'll first compute the
> number of digits, check for buffer overflow, insert the printf padding,
> and then call the number-formatting code, passing the number of digits in.
>
> It seems plausible that the resultant simplification will produce a
> speedup.
>
> I'm going to experiment with that.  Denys, since I'm playing in your sandbox,
> do you have any violent objections to that?

I don't object.

-- 
vda

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2012-09-25 13:01 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-08-03  5:21 [PATCH 1/4] lib: vsprintf: Optimize division by 10 for small integers George Spelvin
2012-08-03  5:21 ` [PATCH 2/4] lib: vsprintf: Optimize division by 10000 George Spelvin
2012-09-23 17:30   ` Michal Nazarewicz
2012-09-24 12:16     ` George Spelvin
2012-09-24 12:41       ` Michal Nazarewicz
2012-09-24 13:56         ` George Spelvin
2012-09-24 15:14           ` Geert Uytterhoeven
2012-09-24 15:48             ` George Spelvin
2012-09-24  9:03   ` Denys Vlasenko
2012-09-24 12:35     ` George Spelvin
2012-09-24 15:02       ` Denys Vlasenko
2012-08-03  5:21 ` [PATCH 3/4] lib: vsprintf: Optimize put_dec_trunc8 George Spelvin
2012-09-23 14:18   ` Rabin Vincent
2012-09-24 11:13     ` George Spelvin
2012-09-24 14:33     ` George Spelvin
2012-09-24 14:53       ` Michal Nazarewicz
2012-09-24 14:57         ` Michal Nazarewicz
2012-09-23 18:22   ` Michal Nazarewicz
2012-09-24 11:46     ` George Spelvin
2012-09-24 12:29       ` Michal Nazarewicz
2012-09-24 13:49         ` George Spelvin
2012-09-24 15:06           ` Michal Nazarewicz
2012-09-25 11:44           ` George Spelvin
2012-09-25 13:00             ` Denys Vlasenko
2012-08-03  5:21 ` [PATCH 4/4] lib: vsprintf: Fix broken comments George Spelvin
2012-09-23 17:22 ` [PATCH 1/4] lib: vsprintf: Optimize division by 10 for small integers Michal Nazarewicz
2012-09-24 14:18   ` George Spelvin
2012-09-24  9:06 ` Denys Vlasenko
2012-09-24 11:27   ` George Spelvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox