[PATCHv2 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCHv2 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full()
@ 2010-08-08 19:29 Michal Nazarewicz
  2010-08-08 19:29 ` [PATCHv2 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines Michal Nazarewicz
  2010-08-10  3:17 ` [PATCHv2 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full() Denys Vlasenko
  0 siblings, 2 replies; 10+ messages in thread
From: Michal Nazarewicz @ 2010-08-08 19:29 UTC (permalink / raw)
  To: linux-kernel, linux-kernel
  Cc: m.nazarewicz, Michal Nazarewicz, Douglas W. Jones, Denis Vlasenko,
	Andrew Morton

The put_dec_trunc() and put_dec_full() functions were based on
a code optimised for processors with 8-bit ALU but since we
don't need to limit ourselves to such small ALUs, the code was
optimised and used capacities of an 16-bit ALU anyway.

This patch goes further and uses the full capacity of a 32-bit
ALU and instead of splitting the number into nibbles and
operating on them it performs the obvious algorithm for base
conversion (except it uses optimised code for dividing by
ten).

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
---
 lib/vsprintf.c |  143 +++++++++++++++++++++++++++----------------------------
 1 files changed, 70 insertions(+), 73 deletions(-)

Compared to v1 only commit message and comments were changed.


I did some benchmark on the following processors:

ARM     : ARMv7 Processor rev 2 (v7l)                   (32-bit)
Atom    : Intel(R) Atom(TM) CPU N270 @ 1.60GHz          (32-bit)
Core    : Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz   (64-bit)
Phenom  : AMD Phenom(tm) II X3 710 Processor            (64-bit)

Here are the results (normalised to the fastest/smallest):

                    :        ARM       Atom       Core     Phenom
-- Speed --------------------------------------------------------
orig_put_dec_full   :   1.054570   1.356214   1.732636   1.725760  Original
mod1_put_dec_full   :   1.000000   1.017216   1.255518   1.116559
mod3_put_dec_full   :   1.018222   1.000000   1.000000   1.000000  Proposed

orig_put_dec_trunc  :   1.137903   1.216017   1.850478   1.662370  Original
mod1_put_dec_trunc  :   1.000000   1.078154   1.355635   1.400637
mod3_put_dec_trunc  :   1.025989   1.000000   1.000000   1.000000  Proposed
-- Size ---------------------------------------------------------
orig_put_dec_full   :   1.212766   1.310345   1.355372   1.355372  Original
mod1_put_dec_full   :   1.021277   1.000000   1.000000   1.000000
mod3_put_dec_full   :   1.000000   1.172414   1.049587   1.049587  Proposed

orig_put_dec_trunc  :   1.363636   1.317365   1.784000   1.784000  Original
mod1_put_dec_trunc  :   1.181818   1.275449   1.400000   1.400000
mod3_put_dec_trunc  :   1.000000   1.000000   1.000000   1.000000  Proposed


Source of the benchmark as well as code of all the modified version of
functions is included with the third patch of the benchmark.


As it can be observed from the table, the "mod3" version (proposed by
this patch) is the fastest version with the only exception of ARM on
which it looses by ~2% with "mod1".

It is also smaller, in terms of code size, then the original version
even though "mod1" is even smaller.

In the end, I'm proposing "mod3" because those few bytes are worth the
speed I think and also, for ARM I'm proposing another version in the
patch that follows this one.

The function is also shorter in terms of lines of code.


I'm currently running 2.6.35 with this patch applied.  It applies just
fine on -next and I've run it on ARM.


PS. I've sent a private email to Mr. Jones to get his permission to
use his code.  I'm sure there will be no issues.  I'll resubmitt the
patchset with his Signed-off-by when I hear back from him.

diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index b8a2f54..35764f6 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -278,96 +278,93 @@ int skip_atoi(const char **s)
 	return i;
 }
 
-/* Decimal conversion is by far the most typical, and is used
+/*
+ * Decimal conversion is by far the most typical, and is used
  * for /proc and /sys data. This directly impacts e.g. top performance
  * with many processes running. We optimize it for speed
- * using code from
- * http://www.cs.uiowa.edu/~jones/bcd/decimal.html
- * (with permission from the author, Douglas W. Jones). */
-
-/* Formats correctly any integer in [0,99999].
- * Outputs from one to five digits depending on input.
- * On i386 gcc 4.1.2 -O2: ~250 bytes of code. */
+ * using ideas described at <http://www.cs.uiowa.edu/~jones/bcd/divide.html>.
+ *
+ * Formats correctly any integer in [0, 9999].
+ */
 static noinline_for_stack
-char *put_dec_trunc(char *buf, unsigned q)
+char *put_dec_full(char *buf, unsigned q)
 {
-	unsigned d3, d2, d1, d0;
-	d1 = (q>>4) & 0xf;
-	d2 = (q>>8) & 0xf;
-	d3 = (q>>12);
-
-	d0 = 6*(d3 + d2 + d1) + (q & 0xf);
-	q = (d0 * 0xcd) >> 11;
-	d0 = d0 - 10*q;
-	*buf++ = d0 + '0'; /* least significant digit */
-	d1 = q + 9*d3 + 5*d2 + d1;
-	if (d1 != 0) {
-		q = (d1 * 0xcd) >> 11;
-		d1 = d1 - 10*q;
-		*buf++ = d1 + '0'; /* next digit */
-
-		d2 = q + 2*d2;
-		if ((d2 != 0) || (d3 != 0)) {
-			q = (d2 * 0xd) >> 7;
-			d2 = d2 - 10*q;
-			*buf++ = d2 + '0'; /* next digit */
-
-			d3 = q + 4*d3;
-			if (d3 != 0) {
-				q = (d3 * 0xcd) >> 11;
-				d3 = d3 - 10*q;
-				*buf++ = d3 + '0';  /* next digit */
-				if (q != 0)
-					*buf++ = q + '0'; /* most sign. digit */
-			}
-		}
+	unsigned r;
+	char a = '0';
+
+	/*
+	 * '(x * 0xcccd) >> 19' is an approximation of 'x / 10' that
+	 * gives correct results for all x < 81920.  However, because
+	 * intermediate result can be at most 32-bit we limit x to be
+	 * 16-bit.
+	 *
+	 * Because of those, we check if we are dealing with a "big"
+	 * number and if so, we make it smaller remembering to add to
+	 * the most significant digit.
+	 */
+	if (q >= 50000) {
+		a  = '5';
+		q -= 50000;
 	}
 
-	return buf;
-}
-/* Same with if's removed. Always emits five digits */
-static noinline_for_stack
-char *put_dec_full(char *buf, unsigned q)
-{
-	/* BTW, if q is in [0,9999], 8-bit ints will be enough, */
-	/* but anyway, gcc produces better code with full-sized ints */
-	unsigned d3, d2, d1, d0;
-	d1 = (q>>4) & 0xf;
-	d2 = (q>>8) & 0xf;
-	d3 = (q>>12);
+	r   = (q * 0xcccd) >> 19;
+	*buf++ = (q - 10 * r) + '0';
 
 	/*
-	 * Possible ways to approx. divide by 10
-	 * gcc -O2 replaces multiply with shifts and adds
+	 * Other, possible ways to approx. divide by 10
 	 * (x * 0xcd) >> 11: 11001101 - shorter code than * 0x67 (on i386)
 	 * (x * 0x67) >> 10:  1100111
 	 * (x * 0x34) >> 9:    110100 - same
 	 * (x * 0x1a) >> 8:     11010 - same
 	 * (x * 0x0d) >> 7:      1101 - same, shortest code (on i386)
 	 */
-	d0 = 6*(d3 + d2 + d1) + (q & 0xf);
-	q = (d0 * 0xcd) >> 11;
-	d0 = d0 - 10*q;
-	*buf++ = d0 + '0';
-	d1 = q + 9*d3 + 5*d2 + d1;
-		q = (d1 * 0xcd) >> 11;
-		d1 = d1 - 10*q;
-		*buf++ = d1 + '0';
-
-		d2 = q + 2*d2;
-			q = (d2 * 0xd) >> 7;
-			d2 = d2 - 10*q;
-			*buf++ = d2 + '0';
-
-			d3 = q + 4*d3;
-				q = (d3 * 0xcd) >> 11; /* - shorter code */
-				/* q = (d3 * 0x67) >> 10; - would also work */
-				d3 = d3 - 10*q;
-				*buf++ = d3 + '0';
-					*buf++ = q + '0';
+
+	q   = (r * 0x199a) >> 16;
+	*buf++ = (r - 10 * q)  + '0';
+
+	r   = (q * 0xcd) >> 11;
+	*buf++ = (q - 10 * r)  + '0';
+
+	q   = (r * 0xd) >> 7;
+	*buf++ = (r - 10 * q) + '0';
+
+	*buf++ = q + a;
+
+	return buf;
+}
+
+/* Same as above but do not pad with zeros. */
+static noinline_for_stack
+char *put_dec_trunc(char *buf, unsigned q)
+{
+	unsigned r;
+
+	/*
+	 * We need to check if q is < 65536 so we might as well check
+	 * if we can just call the _full version of this function.
+	 */
+	if (q > 9999)
+		return put_dec_full(buf, q);
+
+	r   = (q * 0xcccd) >> 19;
+	*buf++ = (q - 10 * r) + '0';
+
+	if (r) {
+		q   = (r * 0x199a) >> 16;
+		*buf++ = (r - 10 * q)  + '0';
+
+		if (q) {
+			r   = (q * 0xcd) >> 11;
+			*buf++ = (q - 10 * r)  + '0';
+
+			if (r)
+				*buf++ = r + '0';
+		}
+	}
 
 	return buf;
 }
+
 /* No inlining helps gcc to use registers better */
 static noinline_for_stack
 char *put_dec(char *buf, unsigned long long num)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCHv2 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines
  2010-08-08 19:29 [PATCHv2 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full() Michal Nazarewicz
@ 2010-08-08 19:29 ` Michal Nazarewicz
  2010-08-08 19:29   ` [PATCHv2 3/3] lib: vsprintf: added a put_dec() test and benchmark tool Michal Nazarewicz
  2010-08-10  4:15   ` [PATCHv2 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines Denys Vlasenko
  2010-08-10  3:17 ` [PATCHv2 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full() Denys Vlasenko
  1 sibling, 2 replies; 10+ messages in thread
From: Michal Nazarewicz @ 2010-08-08 19:29 UTC (permalink / raw)
  To: linux-kernel, linux-kernel
  Cc: m.nazarewicz, Michal Nazarewicz, Douglas W. Jones, Denis Vlasenko,
	Andrew Morton

Existing put_dec() function uses a do_div() function for
dividing the 64-bit argument.  On 32-bit machines this may be
a costly operation.  This patch, replaces the put_dec()
function on 32-bit processors to one that performs no 64-bit
divisions.

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
---
 lib/vsprintf.c |  114 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 files changed, 108 insertions(+), 6 deletions(-)

Compared to previous version: the code is used only if:
1. if long long is 64-bit (ie. ULLONG_MAX == 2**64-1), and
2. user did not select optimisation for size with Kconfig.


I did some benchmark on the following processors:

ARM     : ARMv7 Processor rev 2 (v7l)                   (32-bit)
Atom    : Intel(R) Atom(TM) CPU N270 @ 1.60GHz          (32-bit)

(I'm skipping 64-bit machines as this patch is intended only for
32-bit).

Here are the results (normalised to the fastest/smallest):

                    :        ARM       Atom
-- Speed ----------------------------------
orig_put_dec        :   9.333822   2.083110  Original
mod1_put_dec        :   9.282045   1.904564
mod2_put_dec        :   9.260409   1.910302
mod3_put_dec        :   9.320053   1.905689  Proposed by previous patch
mod4_put_dec        :   9.297146   1.933971
mod5_put_dec        :  13.034318   2.434942
mod6_put_dec        :   1.000000   1.000000  Proposed by this patch
mod7_put_dec        :   1.009574   1.014147
mod8_put_dec        :   7.226004   1.953460
-- Size -----------------------------------
orig_put_dec        :   1.000000   1.000000  Original
mod1_put_dec        :   1.000000   1.000000
mod2_put_dec        :   1.361111   1.403226
mod3_put_dec        :   1.000000   1.000000  Proposed by previous patch
mod4_put_dec        :   1.361111   1.403226
mod5_put_dec        :   1.000000   1.000000
mod6_put_dec        :   2.555556   3.508065  Proposed by this patch
mod7_put_dec        :   2.833333   3.911290
mod8_put_dec        :   2.027778   2.258065


Source of the benchmark as well as code of all the modified version of
functions is included with the third patch of the benchmark.


As it can be obsevred, proposed version of the put_dec function is
twice as fast as the original version on Atom and almost 10 times
faster on ARM.  I imagine that it may be similar on other "embedded"
processors.

This may be skewed by the fact that the benchmark is using GCC's
64-bit division operator instead of kernel's do_div but it would
appear that by avoiding 64-bit division something can be gained.

The disadvantage is that the proposed function is 2.5-3.5 bigger.
Those are not big functions though -- we are talking here about
proposed function being below 512 -- and the adventage in speed seem
non-marginal.

No matter, because of it's size, it's not chosen if user selected
optimisation for size (CONFIG_CC_OPTIMIZE_FOR_SIZE).


The drawback of this function is also that the patch adds a bit of
code.  It could be questionable whether it's worth optimising that
much.  Anyway, posting in case someone decides that it is or will be
simply interested. :)


I'm currently running 2.6.35 with this patch applied.  It applies just
fine on -next and I've run it on ARM.


PS. From Mr. Jones site: "Nonetheless, before relying on the material
here, it would be prudent to check the arithmetic!" hence I checked
all the calculations myself and everything seemed fine.  I've also run
test applitacion several times so it tested a few 64-bit numbers.
Here's a "bc" script which calculates all the numbers:


# You can feed "bc" with this file to check the numbers

x = 2^16

print "n =\t1 * n0 +\n\t", x, " * n1 +\n\t", x^2, " * n2 +\n\t", x^3, " * n3\n"

print "0 <= n0, n1, n2, n3 <= ", x - 1, "\n"

# n  =                  1 * n0 +    0 <= n0 <= 65535
#                  6 5536 * n1 +    0 <= n1 <= 65535
#            42 9496 7296 * n2 +    0 <= n2 <= 65535
#      281 4749 7671 0656 * n3      0 <= n3 <= 65535

n0 = x - 1
n1 = x - 1
n2 = x - 1
n3 = x - 1

# n  =              10^ 0 * d0 +
#                   10^ 4 * d1 +
#                   10^ 8 * d2 +
#                   10^12 * d3 +
#                   10^16 * d4

a0 =  656 * n3 + 7296 * n2 + 5536 * n1 +   1 * n0
print "0 <= a0 <= ", a0, "\n"
# 0 <= a0 <=   884 001 615

a1 = 7671 * n3 + 9496 * n2 +    6 * n1
print "0 <= a1 <= ", a1, "\n"
# 0 <= a1 <= 1 125 432 555

a2 = 4749 * n3 +   42 * n2
print "0 <= a2 <= ", a2, "\n"
# 0 <= a2 <=   313 978 185

a3 =  281 * n3
print "0 <= a3 <= ", a3, "\n"
# 0 <= a3 <=    18 415 335


b0 = a0
print "0 <= b0 <= ", b0, "\n0 <= c1 <= ", b0 / 10000, "\n"
# 0 <= d0 <=   884 001 615
# 0 <= c1 <=        88 400

b1 = a1 + b0 / 10000
print "0 <= b1 <= ", b1, "\n0 <= c2 <= ", b1 / 10000, "\n"
# 0 <= d1 <= 1 125 520 955
# 0 <= c2 <=       112 552

b2 = a2 + b1 / 10000
print "0 <= b2 <= ", b2, "\n0 <= c3 <= ", b2 / 10000, "\n"
# 0 <= d2 <=   314 090 737
# 0 <= c3 <=        31 409

b3 = a3 + b2 / 10000
print "0 <= b3 <= ", b3, "\n0 <= c4 <= ", b3 / 10000, "\n"
# 0 <= d3 <=    18 446 744
# 0 <= c4 <=         1 844

b4 = a4 + b3 / 10000
print "0 <= b4 <= ", b4, "\n"
# 0 <= b4 <=         1 844


diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 35764f6..cf0aa9e 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -278,6 +278,9 @@ int skip_atoi(const char **s)
 	return i;
 }
 
+#if BITS_PER_LONG != 32 || defined CONFIG_CC_OPTIMIZE_FOR_SIZE || \
+	ULLONG_MAX != 18446744073709551615ULL
+
 /*
  * Decimal conversion is by far the most typical, and is used
  * for /proc and /sys data. This directly impacts e.g. top performance
@@ -287,7 +290,7 @@ int skip_atoi(const char **s)
  * Formats correctly any integer in [0, 9999].
  */
 static noinline_for_stack
-char *put_dec_full(char *buf, unsigned q)
+char *put_dec_full5(char *buf, unsigned q)
 {
 	unsigned r;
 	char a = '0';
@@ -335,7 +338,7 @@ char *put_dec_full(char *buf, unsigned q)
 
 /* Same as above but do not pad with zeros. */
 static noinline_for_stack
-char *put_dec_trunc(char *buf, unsigned q)
+char *put_dec_trunc5(char *buf, unsigned q)
 {
 	unsigned r;
 
@@ -344,7 +347,7 @@ char *put_dec_trunc(char *buf, unsigned q)
 	 * if we can just call the _full version of this function.
 	 */
 	if (q > 9999)
-		return put_dec_full(buf, q);
+		return put_dec_full5(buf, q);
 
 	r   = (q * 0xcccd) >> 19;
 	*buf++ = (q - 10 * r) + '0';
@@ -372,12 +375,111 @@ char *put_dec(char *buf, unsigned long long num)
 	while (1) {
 		unsigned rem;
 		if (num < 100000)
-			return put_dec_trunc(buf, num);
+			return put_dec_trunc5(buf, num);
 		rem = do_div(num, 100000);
-		buf = put_dec_full(buf, rem);
+		buf = put_dec_full5(buf, rem);
 	}
 }
 
+/* This is used by ip4_string(). */
+#define put_dec_8bit put_dec_trunc5
+
+#else /* BITS_PER_LONG == 32 && !OPTIMIZE_FOR_SIZE && ULLONG_MAX == 2^64-1 */
+
+/*
+ * This is similar to the put_dec_full5() above expect it handles
+ * numbers from 0 to 9999 (ie. at most four digits).  It is used by
+ * the put_dec() below which is optimised for 32-bit processors.
+ */
+static noinline_for_stack
+char *put_dec_full4(char *buf, unsigned q)
+{
+	unsigned r;
+
+	r      = (q * 0xcccd) >> 19;
+	*buf++ = (q - 10 * r) + '0';
+
+	q      = (r * 0x199a) >> 16;
+	*buf++ = (r - 10 * q)  + '0';
+
+	r      = (q * 0xcd) >> 11;
+	*buf++ = (q - 10 * r)  + '0';
+
+	*buf++ = r + '0';
+
+	return buf;
+}
+
+/*
+ * Similar to above but handles only 8-bit operands and does not pad
+ * with zeros.  Used by ip4_string().
+ */
+static noinline_for_stack
+char *put_dec_8bit(char *buf, unsigned q)
+{
+	unsigned r;
+
+	r      = (q * 0xcd) >> 11;
+	*buf++ = (q - 10 * r) + '0';
+
+	if (r) {
+		q      = (r * 0xd) >> 7;
+		*buf++ = (r - 10 * q)  + '0';
+
+		if (q)
+			*buf++ = q + '0';
+	}
+
+	return buf;
+}
+
+/*
+ * Based on code by Douglas W. Jones found at
+ * <http://www.cs.uiowa.edu/~jones/bcd/decimal.html#sixtyfour>.  This
+ * performs no 64-bit division and hence should be faster on 32-bit
+ * machines then the version of the function above.
+ */
+static noinline_for_stack
+char *put_dec(char *buf, unsigned long long n)
+{
+	uint32_t d3, d2, d1, q;
+
+	if (!n) {
+		*buf++ = '0';
+		return buf;
+	}
+
+	d1  = (n >> 16) & 0xFFFF;
+	d2  = (n >> 32) & 0xFFFF;
+	d3  = (n >> 48) & 0xFFFF;
+
+	q   = 656 * d3 + 7296 * d2 + 5536 * d1 + (n & 0xFFFF);
+
+	buf = put_dec_full4(buf, q % 10000);
+	q   = q / 10000;
+
+	d1  = q + 7671 * d3 + 9496 * d2 + 6 * d1;
+	buf = put_dec_full4(buf, d1 % 10000);
+	q   = d1 / 10000;
+
+	d2  = q + 4749 * d3 + 42 * d2;
+	buf = put_dec_full4(buf, d2 % 10000);
+	q   = d2 / 10000;
+
+	d3  = q + 281 * d3;
+	buf = put_dec_full4(buf, d3 % 10000);
+	q   = d3 / 10000;
+
+	buf = put_dec_full4(buf, q);
+
+	while (buf[-1] == '0')
+		--buf;
+
+	return buf;
+}
+
+#endif
+
 #define ZEROPAD	1		/* pad with zero */
 #define SIGN	2		/* unsigned/signed long */
 #define PLUS	4		/* show plus */
@@ -751,7 +853,7 @@ char *ip4_string(char *p, const u8 *addr, const char *fmt)
 	}
 	for (i = 0; i < 4; i++) {
 		char temp[3];	/* hold each IP quad in reverse order */
-		int digits = put_dec_trunc(temp, addr[index]) - temp;
+		int digits = put_dec_8bit(temp, addr[index]) - temp;
 		if (leading_zeros) {
 			if (digits < 3)
 				*p++ = '0';
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCHv2 3/3] lib: vsprintf: added a put_dec() test and benchmark tool
  2010-08-08 19:29 ` [PATCHv2 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines Michal Nazarewicz
@ 2010-08-08 19:29   ` Michal Nazarewicz
  2010-08-10  4:15   ` [PATCHv2 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines Denys Vlasenko
  1 sibling, 0 replies; 10+ messages in thread
From: Michal Nazarewicz @ 2010-08-08 19:29 UTC (permalink / raw)
  To: linux-kernel, linux-kernel
  Cc: m.nazarewicz, Michal Nazarewicz, Douglas W. Jones, Denis Vlasenko,
	Andrew Morton

This commit adds a test application for the put_dec() and
family of functions that are used by the previous two commits.

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
---
 tools/put-dec/Makefile       |   14 +
 tools/put-dec/put-dec-test.c |  942 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 956 insertions(+), 0 deletions(-)
 create mode 100644 tools/put-dec/Makefile
 create mode 100644 tools/put-dec/put-dec-test.c

diff --git a/tools/put-dec/Makefile b/tools/put-dec/Makefile
new file mode 100644
index 0000000..77400c9
--- /dev/null
+++ b/tools/put-dec/Makefile
@@ -0,0 +1,14 @@
+put-dec-test: put-dec-test.c
+	exec $(CC) -Wall -Wextra -O2 -o $@ $<
+
+put-dec-test.s: put-dec-test.c
+	exec $(CC) -Wall -Wextra -O2 -S -o $@ $<
+
+put-dec-test-s: put-dec-test.c
+	exec $(CC) -Wall -Wextra -Os -o $@ $<
+
+put-dec-test-s.s: put-dec-test.c
+	exec $(CC) -Wall -Wextra -Os -S -o $@ $<
+
+clean:
+	rm -f -- put-dec-test
diff --git a/tools/put-dec/put-dec-test.c b/tools/put-dec/put-dec-test.c
new file mode 100644
index 0000000..1860ae3
--- /dev/null
+++ b/tools/put-dec/put-dec-test.c
@@ -0,0 +1,942 @@
+/*
+ * put-dec-test.c -- Variaus put_dec*() functions implementation
+ *                   testing and benchmarking tool
+ * Written by                  Michal Nazarewicz <mina86@mina86.com>
+ * with helpful suggestions by Denys Vlasenko <vda.linux@googlemail.com>
+ */
+#define _BSD_SOURCE
+
+#include <errno.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <stdarg.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <time.h>
+#include <unistd.h>
+
+#if CHAR_BIT != 8
+#  error This code assumes CHAR_BIT == 8
+#endif
+
+
+
+#  define do_div(n, base) ({			\
+		uint32_t __base = (base);	\
+		uint32_t __rem = (n) % __base;	\
+		(n) /= __base;			\
+		__rem;				\
+	})
+
+
+/****************************** Original versian ******************************/
+
+static char *orig_put_dec_trunc(char *buf, unsigned q)
+{
+	unsigned d3, d2, d1, d0;
+	d1 = (q>>4) & 0xf;
+	d2 = (q>>8) & 0xf;
+	d3 = (q>>12);
+
+	d0 = 6*(d3 + d2 + d1) + (q & 0xf);
+	q = (d0 * 0xcd) >> 11;
+	d0 = d0 - 10*q;
+	*buf++ = d0 + '0'; /* least significant digit */
+	d1 = q + 9*d3 + 5*d2 + d1;
+	if (d1 != 0) {
+		q = (d1 * 0xcd) >> 11;
+		d1 = d1 - 10*q;
+		*buf++ = d1 + '0'; /* next digit */
+
+		d2 = q + 2*d2;
+		if ((d2 != 0) || (d3 != 0)) {
+			q = (d2 * 0xd) >> 7;
+			d2 = d2 - 10*q;
+			*buf++ = d2 + '0'; /* next digit */
+
+			d3 = q + 4*d3;
+			if (d3 != 0) {
+				q = (d3 * 0xcd) >> 11;
+				d3 = d3 - 10*q;
+				*buf++ = d3 + '0';  /* next digit */
+				if (q != 0)
+					*buf++ = q + '0'; /* most sign. digit */
+			}
+		}
+	}
+
+	return buf;
+}
+/* Same with if's removed. Always emits five digits */
+static char *orig_put_dec_full(char *buf, unsigned q)
+{
+	/* BTW, if q is in [0,9999], 8-bit ints will be enough, */
+	/* but anyway, gcc produces better code with full-sized ints */
+	unsigned d3, d2, d1, d0;
+	d1 = (q>>4) & 0xf;
+	d2 = (q>>8) & 0xf;
+	d3 = (q>>12);
+
+	/*
+	 * Possible ways to approx. divide by 10
+	 * gcc -O2 replaces multiply with shifts and adds
+	 * (x * 0xcd) >> 11: 11001101 - shorter code than * 0x67 (on i386)
+	 * (x * 0x67) >> 10:  1100111
+	 * (x * 0x34) >> 9:    110100 - same
+	 * (x * 0x1a) >> 8:     11010 - same
+	 * (x * 0x0d) >> 7:      1101 - same, shortest code (on i386)
+	 */
+	d0 = 6*(d3 + d2 + d1) + (q & 0xf);
+	q = (d0 * 0xcd) >> 11;
+	d0 = d0 - 10*q;
+	*buf++ = d0 + '0';
+	d1 = q + 9*d3 + 5*d2 + d1;
+		q = (d1 * 0xcd) >> 11;
+		d1 = d1 - 10*q;
+		*buf++ = d1 + '0';
+
+		d2 = q + 2*d2;
+			q = (d2 * 0xd) >> 7;
+			d2 = d2 - 10*q;
+			*buf++ = d2 + '0';
+
+			d3 = q + 4*d3;
+				q = (d3 * 0xcd) >> 11; /* - shorter code */
+				/* q = (d3 * 0x67) >> 10; - would also work */
+				d3 = d3 - 10*q;
+				*buf++ = d3 + '0';
+					*buf++ = q + '0';
+
+	return buf;
+}
+
+static __attribute__((noinline))
+char *orig_put_dec(char *buf, unsigned long long num)
+{
+	while (1) {
+		unsigned rem;
+		if (num < 100000)
+			return orig_put_dec_trunc(buf, num);
+		rem = do_div(num, 100000);
+		buf = orig_put_dec_full(buf, rem);
+	}
+}
+
+
+
+/****************************** Modified versions ******************************/
+
+/*
+ * Decimal conversion is by far the most typical, and is used for
+ * /proc and /sys data. This directly impacts e.g. top performance
+ * with many processes running.
+ *
+ * We optimize it for speed using code based on idea described at:
+ * http://www.cs.uiowa.edu/~jones/bcd/decimal.html (with permission
+ * from the author, Douglas W. Jones).
+ *
+ * The original code was designed for 8-bit ALus but since we can
+ * assume more capable hardware the code has been rewritten to use the
+ * following properties:
+ *
+ * n  =    1 * n0 +                   ( 0 <= n0 <= 1023 )
+ *      1024 * n1                     ( 0 <= n1 <=   97 )
+ * a0 = 0         + 4 * n1 + 1 * n0   ( 0 <= a0 <= 1412 )
+ * a1 = (a0 / 10) + 2 * n1            ( 0 <= a1 <=  335 )
+ * a2 = (a1 / 10) + 0 * n1            ( 0 <= a2 <=   33 )
+ * a3 = (a2 / 10) + 1 * n1            ( 0 <= a3 <=  100 )
+ * d0 = a0 % 10
+ * d1 = a1 % 10
+ * d2 = a2 % 10
+ * d3 = a3 % 10
+ * d4 = a3 / 10
+ *
+ * So instead of dividing the number into four nibles we divide it
+ * into two numbers: first one 10-bit and the other one 7-bit
+ * (argument is 17-bit number from 0 to 99999).
+ *
+ * Moreover, 1024, which is the value second part of the number needs
+ * to be multiplied by, has nice property that each digit is a power
+ * of two or zero -- this helps with multiplications.
+ *
+ * Possible ways to approx. divide by 10
+ * (x * 0xcd) >> 11: 11001101 - shorter code than * 0x67 (on i386)
+ * (x * 0x67) >> 10:  1100111
+ * (x * 0x34) >> 9:    110100 - same
+ * (x * 0x1a) >> 8:     11010 - same
+ * (x * 0x0d) >> 7:      1101 - same, shortest code (on i386)
+ */
+static char *mod1_put_dec_full(char *buf, unsigned q)
+{
+	unsigned p, r;
+	p   = q >> 10;
+
+	q  &= 0x3ff;
+	q  += 4 * p;
+	r   = (q * 0x199A) >> 16;
+	*buf++ = (q - 10 * r) + '0';
+
+	r  += 2 * p;
+	q   = (r * 0xcd) >> 11;
+	*buf++ = (r - 10 * q)  + '0';
+
+	/* q += 0; */
+	r   = (q * 0xcd) >> 11;
+	*buf++ = (q - 10 * r)  + '0';
+
+	r  += p;
+	q   = (r * 0xcd) >> 11;
+	*buf++ = (r - 10 * q) + '0';
+
+	*buf++ = q + '0';
+
+	return buf;
+}
+
+static char *mod1_put_dec_trunc(char *buf, unsigned q)
+{
+	unsigned p, r;
+	p   = q >> 10;
+
+	q  &= 0x3ff;
+	q  += 4 * p;
+	r   = (q * 0x199a) >> 16;
+	*buf++ = (q - 10 * r) + '0';
+
+	r  += 2 * p;
+	if (r) {
+		q   = (r * 0xcd) >> 11;
+		*buf++ = (r - 10 * q)  + '0';
+
+		/* q += 0; */
+		if (q || p) {
+			r   = (q * 0xcd) >> 11;
+			*buf++ = (q - 10 * r)  + '0';
+
+			r  += p;
+			if (r) {
+				q   = (r * 0xcd) >> 11;
+				*buf++ = (r - 10 * q) + '0';
+
+				if (q)
+					*buf++ = q + '0';
+			}
+		}
+	}
+
+	return buf;
+}
+
+
+static __attribute__((noinline))
+char *mod1_put_dec(char *buf, unsigned long long num)
+{
+	while (1) {
+		unsigned rem;
+		if (num < 100000)
+			return mod1_put_dec_trunc(buf, num);
+		rem = do_div(num, 100000);
+		buf = mod1_put_dec_full(buf, rem);
+	}
+}
+
+
+static __attribute__((noinline))
+char *mod2_put_dec(char *buf, unsigned long long num)
+{
+	if (!num) {
+		*buf++ = '0';
+		return buf;
+	}
+
+	while (num >= 100000) {
+		unsigned rem;
+		rem = do_div(num, 100000);
+		buf = mod1_put_dec_full(buf, rem);
+	}
+
+	buf = mod1_put_dec_full(buf, num);
+	while (buf[-1] == '0')
+		--buf;
+	return buf;
+}
+
+
+
+/*
+ * Decimal conversion is by far the most typical, and is used for
+ * /proc and /sys data. This directly impacts e.g. top performance
+ * with many processes running.
+ *
+ * We optimize it for speed using ideas described at
+ * <http://www.cs.uiowa.edu/~jones/bcd/divide.html>.
+ *
+ * '(num * 0xcccd) >> 19' is an approximation of 'num / 10' that gives
+ * correct results for num < 81920.  Because of this, we check at the
+ * beginning if we are dealing with a number that may cause trouble
+ * and if so, we make it smaller.
+ *
+ * (As a minor note, all operands are always 16 bit so this function
+ * should work well on hardware that cannot multiply 32 bit numbers).
+ *
+ * (Previous a code based on
+ * <http://www.cs.uiowa.edu/~jones/bcd/decimal.html> was used here,
+ * with permission from the author, Douglas W. Jones.)
+ *
+ * Other, possible ways to approx. divide by 10
+ * (x * 0xcd) >> 11: 11001101 - shorter code than * 0x67 (on i386)
+ * (x * 0x67) >> 10:  1100111
+ * (x * 0x34) >> 9:    110100 - same
+ * (x * 0x1a) >> 8:     11010 - same
+ * (x * 0x0d) >> 7:      1101 - same, shortest code (on i386)
+ */
+static char *mod3_put_dec_full(char *buf, unsigned q)
+{
+	unsigned r;
+	char a = '0';
+
+	if (q >= 50000) {
+		a  = '5';
+		q -= 50000;
+	}
+
+	r   = (q * 0xcccd) >> 19;
+	*buf++ = (q - 10 * r) + '0';
+
+	q   = (r * 0x199a) >> 16;
+	*buf++ = (r - 10 * q)  + '0';
+
+	r   = (q * 0xcd) >> 11;
+	*buf++ = (q - 10 * r)  + '0';
+
+	q   = (r * 0xd) >> 7;
+	*buf++ = (r - 10 * q) + '0';
+
+	*buf++ = q + a;
+
+	return buf;
+}
+
+static char *mod3_put_dec_trunc(char *buf, unsigned q)
+{
+	unsigned r;
+
+	/*
+	 * We need to check if num is < 81920 so we might as well
+	 * check if we can just call the _full version of this
+	 * function.
+	 */
+	if (q > 9999)
+		return mod3_put_dec_full(buf, q);
+
+	r   = (q * 0xcccd) >> 19;
+	*buf++ = (q - 10 * r) + '0';
+
+	if (r) {
+		q   = (r * 0x199a) >> 16;
+		*buf++ = (r - 10 * q)  + '0';
+
+		if (q) {
+			r   = (q * 0xcd) >> 11;
+			*buf++ = (q - 10 * r)  + '0';
+
+			if (r)
+				*buf++ = r + '0';
+		}
+	}
+
+	return buf;
+}
+
+
+static char *mod3_put_dec_8bit(char *buf, unsigned q)
+{
+	unsigned r;
+
+	r      = (q * 0xcd) >> 11;
+	*buf++ = (q - 10 * r) + '0';
+
+	if (r) {
+		q   = (r * 0xd) >> 7;
+		*buf++ = (r - 10 * q)  + '0';
+
+		if (q)
+			*buf++ = q + '0';
+	}
+
+	return buf;
+}
+
+
+static __attribute__((noinline))
+char *mod3_put_dec(char *buf, unsigned long long num)
+{
+	while (1) {
+		unsigned rem;
+		if (num < 100000)
+			return mod3_put_dec_trunc(buf, num);
+		rem = do_div(num, 100000);
+		buf = mod3_put_dec_full(buf, rem);
+	}
+}
+
+
+static __attribute__((noinline))
+char *mod4_put_dec(char *buf, unsigned long long num)
+{
+	if (!num) {
+		*buf++ = '0';
+		return buf;
+	}
+
+	while (num >= 100000) {
+		unsigned rem;
+		rem = do_div(num, 100000);
+		buf = mod3_put_dec_full(buf, rem);
+	}
+
+	buf = mod3_put_dec_full(buf, num);
+	while (buf[-1] == '0')
+		--buf;
+	return buf;
+}
+
+
+
+/*
+ * Decimal conversion is by far the most typical, and is used for
+ * /proc and /sys data. This directly impacts e.g. top performance
+ * with many processes running.
+ *
+ * We optimize it for speed using ideas described at
+ * <http://www.cs.uiowa.edu/~jones/bcd/divide.html>.
+ *
+ * (Previous a code based on
+ * <http://www.cs.uiowa.edu/~jones/bcd/decimal.html> was used here,
+ * with permission from the author, Douglas W. Jones.)
+ *
+ * Other, possible ways to approx. divide by 10
+ * (x * 0xcd) >> 11: 11001101 - shorter code than * 0x67 (on i386)
+ * (x * 0x67) >> 10:  1100111
+ * (x * 0x34) >> 9:    110100 - same
+ * (x * 0x1a) >> 8:     11010 - same
+ * (x * 0x0d) >> 7:      1101 - same, shortest code (on i386)
+ */
+static char *mod5_put_dec_full(char *buf, unsigned q)
+{
+	unsigned r;
+
+	r      = (q * 0xcccd) >> 19;
+	*buf++ = (q - 10 * r) + '0';
+
+	q      = (r * 0x199a) >> 16;
+	*buf++ = (r - 10 * q)  + '0';
+
+	r      = (q * 0xcd) >> 11;
+	*buf++ = (q - 10 * r)  + '0';
+
+	*buf++ = r + '0';
+
+	return buf;
+}
+
+static char *mod5_put_dec_trunc(char *buf, unsigned q)
+{
+	unsigned r;
+
+	r      = (q * 0xcccd) >> 19;
+	*buf++ = (q - 10 * r) + '0';
+
+	if (r) {
+		q      = (r * 0x199a) >> 16;
+		*buf++ = (r - 10 * q)  + '0';
+
+		if (q) {
+			r      = (q * 0xcd) >> 11;
+			*buf++ = (q - 10 * r)  + '0';
+
+			if (r)
+				*buf++ = r + '0';
+		}
+	}
+
+	return buf;
+}
+
+
+static __attribute__((noinline))
+char *mod5_put_dec(char *buf, unsigned long long num)
+{
+	while (1) {
+		unsigned rem;
+		if (num < 10000)
+			return mod5_put_dec_trunc(buf, num);
+		rem = do_div(num, 10000);
+		buf = mod5_put_dec_full(buf, rem);
+	}
+}
+
+/*
+ * Based on code by Douglas W. Jones found at
+ * <http://www.cs.uiowa.edu/~jones/bcd/decimal.html>.
+ */
+static __attribute__((noinline))
+char *mod6_put_dec(char *buf, unsigned long long n)
+{
+	uint32_t d3, d2, d1, q;
+
+	if (!n) {
+		*buf++ = '0';
+		return buf;
+	}
+
+	d1  = (n >> 16) & 0xFFFF;
+	d2  = (n >> 32) & 0xFFFF;
+	d3  = (n >> 48) & 0xFFFF;
+
+	q   = 656 * d3 + 7296 * d2 + 5536 * d1 + (n & 0xFFFF);
+
+	buf = mod5_put_dec_full(buf, q % 10000);
+	q   = q / 10000;
+
+	d1  = q + 7671 * d3 + 9496 * d2 + 6 * d1;
+		buf = mod5_put_dec_full(buf, d1 % 10000);
+		q   = d1 / 10000;
+
+		d2  = q + 4749 * d3 + 42 * d2;
+			buf = mod5_put_dec_full(buf, d2 % 10000);
+			q   = d2 / 10000;
+
+			d3  = q + 281 * d3;
+				buf = mod5_put_dec_full(buf, d3 % 10000);
+				q   = d3 / 10000;
+
+					buf = mod5_put_dec_trunc(buf, q);
+
+	while (buf[-1] == '0')
+		--buf;
+
+	return buf;
+}
+
+
+/*
+ * Based on code by Douglas W. Jones found at
+ * <http://www.cs.uiowa.edu/~jones/bcd/decimal.html>.
+ */
+static __attribute__((noinline))
+char *mod7_put_dec(char *buf, unsigned long long n)
+{
+	uint32_t d3, d2, d1, q;
+
+	if (!n) {
+		*buf++ = '0';
+		return buf;
+	}
+
+	d1  = (n >> 16) & 0xFFFF;
+	d2  = (n >> 32) & 0xFFFF;
+	d3  = (n >> 48) & 0xFFFF;
+
+	q   = 656 * d3 + 7296 * d2 + 5536 * d1 + (n & 0xFFFF);
+
+	buf = mod5_put_dec_full(buf, q % 10000);
+	q   = q / 10000;
+
+	d1  = q + 7671 * d3 + 9496 * d2 + 6 * d1;
+	if (d1) {
+		buf = mod5_put_dec_full(buf, d1 % 10000);
+		q   = d1 / 10000;
+
+		d2  = q + 4749 * d3 + 42 * d2;
+		if (d2) {
+			buf = mod5_put_dec_full(buf, d2 % 10000);
+			q   = d2 / 10000;
+
+			d3  = q + 281 * d3;
+			if (d3) {
+				buf = mod5_put_dec_full(buf, d3 % 10000);
+				q   = d3 / 10000;
+
+				if (q)
+					buf = mod5_put_dec_trunc(buf, q);
+			}
+		}
+	}
+
+	while (buf[-1] == '0')
+		--buf;
+
+	return buf;
+}
+
+
+static __attribute__((noinline))
+char *mod8_put_dec(char *buf, unsigned long long num)
+{
+	while (1) {
+		unsigned rem = do_div(num, 100000000);
+
+		if (!num && rem < 10000)
+			return mod5_put_dec_trunc(buf, rem);
+		buf = mod5_put_dec_full(buf, rem % 10000);
+
+		if (!num)
+			return mod5_put_dec_trunc(buf, rem / 10000);
+		buf = mod5_put_dec_full(buf, rem / 10000);
+	}
+}
+
+
+
+/****************************** Main ******************************/
+
+static const struct put_dec_helper {
+	const char *name;
+	char *(*func)(char *buf, unsigned v);
+	unsigned inner;
+	unsigned outer;
+	int width;
+} put_dec_helpers[] = {
+	{ "orig_put_dec_full" , orig_put_dec_full , 100000,   1, 5 },
+	{ "mod1_put_dec_full" , mod1_put_dec_full , 100000,   1, 5 },
+	{ "mod3_put_dec_full" , mod3_put_dec_full , 100000,   1, 5 },
+	{ "mod5_put_dec_full" , mod5_put_dec_full ,  10000,  10, 4 },
+	{ "orig_put_dec_trunc", orig_put_dec_trunc, 100000,   1, 0 },
+	{ "mod1_put_dec_trunc", mod1_put_dec_trunc, 100000,   1, 0 },
+	{ "mod3_put_dec_trunc", mod3_put_dec_trunc, 100000,   1, 0 },
+	{ "mod5_put_dec_trunc", mod5_put_dec_trunc,  10000,  10, 0 },
+	{ "mod3_put_dec_8bit",  mod3_put_dec_8bit,     256, 500, 0 },
+	{ NULL, NULL, 0, 0, 0 }
+};
+
+static const struct put_dec {
+	const char *name;
+	char *(*func)(char *buf, unsigned long long v);
+} put_decs[] = {
+	{ "orig_put_dec" , orig_put_dec },
+	{ "mod1_put_dec" , mod1_put_dec },
+	{ "mod2_put_dec" , mod2_put_dec },
+	{ "mod3_put_dec" , mod3_put_dec },
+	{ "mod4_put_dec" , mod4_put_dec },
+	{ "mod5_put_dec" , mod5_put_dec },
+	{ "mod6_put_dec" , mod6_put_dec },
+	{ "mod7_put_dec" , mod7_put_dec },
+	{ "mod8_put_dec" , mod8_put_dec },
+	{ NULL, NULL }
+};
+
+
+#define WIDTH (int)(sizeof(long long) * 4)
+static char buf_output[WIDTH], buf_expected[WIDTH];
+
+
+static bool load_random(const char *file);
+static void benchmark_helpers(unsigned long iterations);
+static void benchmark_put_dec(unsigned long iterations);
+static void show_sizes(char *app);
+static bool test_helpers(void);
+static bool test_put_dec_rand(void);
+static bool test_put_dec_range(unsigned long long range);
+
+
+int main(int argc, char **argv) {
+	unsigned long iterations;
+	unsigned long long range;
+	bool ret;
+
+	puts(">> Reading etropy...");
+	fflush(NULL);
+
+	if (!load_random("/dev/urandom"))
+		return 2;
+
+	iterations = 1000;
+	if (argc > 1)
+		iterations = atoi(argv[1]);
+
+	if (iterations) {
+		puts(">> Benchmarking...");
+		fflush(NULL);
+
+		benchmark_helpers(iterations);
+		benchmark_put_dec(iterations * 25000);
+	}
+
+	puts(">> Sizes");
+	show_sizes(*argv);
+
+	puts(">> Testing...");
+	fflush(NULL);
+
+	memset(buf_output, 77, sizeof buf_output);
+
+	range = 10*1000*1000;
+	if (argc > 2)
+		range = atoi(argv[2]);
+
+	ret = test_helpers();
+	ret = test_put_dec_rand() && ret;
+	if (range)
+		ret = test_put_dec_range(range) && ret;
+
+	printf(">> %s%*s\n",
+	       ret
+	     ? "Everything went fine"
+	     : "Some test failed, consult stderr",
+	       WIDTH, "");
+
+	return ret;
+}
+
+
+static bool test(const char *name, char *b, char *fmt, ...);
+static void stop(const char *name, struct timeval *start);
+
+static unsigned long long __random[1 << 20];
+static inline unsigned long long randll(unsigned n) {
+	return __random[n & ((sizeof __random / sizeof *__random) - 1)];
+}
+
+
+static bool load_random(const char *file)
+{
+	size_t left, pos;
+	bool ret = true;
+	int fd;
+
+	fd = open(file, O_RDONLY);
+	if (fd < 0) {
+		perror(file);
+		return false;
+	}
+
+	left = sizeof __random;
+	pos = 0;
+	do {
+		ssize_t r = read(fd, ((char *)__random) + pos, left);
+		if (r > 0) {
+			pos  += r;
+			left -= r;
+		} else if (r == 0) {
+			printf("%s: file too small\n", file);
+			ret = false;
+			break;
+		} else if (errno == EAGAIN || errno == EINTR) {
+			/* nothing */
+		} else {
+			perror(file);
+			ret = false;
+			break;
+		}
+	} while (left);
+
+	close(fd);
+	return ret;
+}
+
+
+static void benchmark_helpers(unsigned long iterations)
+{
+	const struct put_dec_helper *helper = put_dec_helpers;
+
+	printf("\thelpers (%lu iterations)\n", iterations * 100000);
+	fflush(NULL);
+
+	do {
+		char *(*func)(char *buf, unsigned v);
+		struct timeval start;
+		unsigned long o;
+		unsigned inner;
+
+		func = helper->func;
+		o = helper->outer * iterations;
+		inner = helper->inner;
+		gettimeofday(&start, NULL);
+
+		do {
+			unsigned i = inner;
+			do {
+				func(buf_output, i);
+			} while (--i);
+		} while (--o);
+
+		stop(helper->name, &start);
+		++helper;
+	} while (helper->name);
+
+	fflush(NULL);
+}
+
+static void benchmark_put_dec(unsigned long iterations)
+{
+	const struct put_dec *pd = put_decs;
+
+	printf("\tput_dec (%lu iterations)\n", iterations);
+	fflush(NULL);
+
+	do {
+		char *(*func)(char *buf, unsigned long long v);
+		struct timeval start;
+		unsigned long i;
+
+		func = pd->func;
+		i = iterations;
+		gettimeofday(&start, NULL);
+
+		do {
+			func(buf_output, randll(--i));
+		} while (i);
+
+		stop(pd->name, &start);
+		++pd;
+	} while (pd->name);
+
+	fflush(NULL);
+}
+
+static void show_sizes(char *app)
+{
+	setenv("APP", app, 1);
+	printf("\tobjdump -t '%s' | grep -F _put_dec | cut -f 2-\n", app);
+	fflush(NULL);
+	system("objdump -t \"$APP\" | grep -F _put_dec | cut -f 2-");
+}
+
+static bool test_helpers(void)
+{
+	const struct put_dec_helper *helper = put_dec_helpers;
+	bool ret = true;
+
+	puts("\thelpers");
+	fflush(NULL);
+
+	do {
+		unsigned i = helper->inner;
+
+		do {
+			--i;
+
+			ret = test(helper->name, helper->func(buf_output, i),
+				   "%0*u", helper->width, i) && ret;
+		} while (i);
+
+		++helper;
+	} while (helper->name);
+
+	fflush(NULL);
+
+	return ret;
+}
+
+static bool __test_put_dec(unsigned long long v, bool print)
+{
+	const struct put_dec *pd = put_decs;
+	bool ret = true;
+
+	sprintf(buf_expected, "%llu", v);
+
+	if (print) {
+		printf("\ttesting: %*s\r", WIDTH, buf_expected);
+		fflush(NULL);
+	}
+
+	do {
+		ret = test(pd->name, pd->func(buf_output, v), NULL) && ret;
+		++pd;
+	} while (pd->name);
+
+	return ret;
+}
+
+static bool test_put_dec_rand(void)
+{
+	size_t i = sizeof __random / sizeof *__random;
+	bool ret = true;
+
+	printf("\tput_dec, %zu random numbers\n", i);
+	fflush(NULL);
+
+	do {
+		--i;
+		ret = __test_put_dec(randll(i), !(i & 0xffff)) && ret;
+	} while (i);
+
+	return ret;
+}
+
+static bool test_put_dec_range(unsigned long long range)
+{
+	unsigned long long edge = 1;
+	bool ret = true;
+
+	printf("\tput_dec, %llu numbers around the \"edges\"%10s\n", range, "");
+	fflush(NULL);
+
+	do {
+		unsigned long long i = 2 * range, v = edge - range;
+		do {
+			ret = __test_put_dec(v, !(i & 0xffff)) && ret;
+			++v;
+		} while (--i);
+
+		edge <<= 16;
+	} while (edge);
+
+	return ret;
+}
+
+
+static bool test(const char *name, char *b, char *fmt, ...)
+{
+	char *a = buf_output;
+	va_list ap;
+	bool ret;
+
+	*b-- = '\0';
+	while (a < b) {
+		char tmp = *a;
+		*a = *b;
+		*b = tmp;
+		++a;
+		--b;
+	}
+
+	if (fmt) {
+		va_start(ap, fmt);
+		vsprintf(buf_expected, fmt, ap);
+		va_end(ap);
+	}
+
+	ret = !strcmp(buf_output, buf_expected);
+	if (!ret)
+		fprintf(stderr, "%-20s: expecting %*s got %*s\n",
+			name, WIDTH, buf_expected, WIDTH, buf_output);
+
+	memset(buf_output, 77, sizeof buf_output);
+
+	return ret;
+}
+
+static void stop(const char *name, struct timeval *start)
+{
+	struct timeval stop;
+	gettimeofday(&stop, NULL);
+
+	stop.tv_sec -= start->tv_sec;
+	if (stop.tv_usec < start->tv_usec) {
+		--stop.tv_sec;
+		stop.tv_usec += 1000000;
+	}
+	stop.tv_usec -= start->tv_usec;
+
+	fflush(NULL);
+	printf("%-20s: %3lu.%06lus\n", name,
+	       (unsigned long)stop.tv_sec, (unsigned long)stop.tv_usec);
+}
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCHv2 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines
  2010-08-08 19:29 ` [PATCHv2 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines Michal Nazarewicz
  2010-08-08 19:29   ` [PATCHv2 3/3] lib: vsprintf: added a put_dec() test and benchmark tool Michal Nazarewicz
@ 2010-08-10  4:15   ` Denys Vlasenko
  2010-08-10  7:42     ` Michał Nazarewicz
  1 sibling, 1 reply; 10+ messages in thread
From: Denys Vlasenko @ 2010-08-10  4:15 UTC (permalink / raw)
  To: Michal Nazarewicz
  Cc: linux-kernel, m.nazarewicz, Douglas W. Jones, Andrew Morton

On Sunday 08 August 2010 21:29, Michal Nazarewicz wrote:
> Compared to previous version: the code is used only if:
> 1. if long long is 64-bit (ie. ULLONG_MAX == 2**64-1), and
> 2. user did not select optimisation for size with Kconfig.

I measured the size and it does not seem to make sense
to exclude it on -Os. On x86:

put_dec_full change: 0x93 -> 0x47 bytes
put_dec      change: 0x12c -> 0x137 bytes

IOW, there is net code size reduction (compared to current kernel,
it may be a slight growth compared to patch 1).

So, please use the optimized code even for CONFIG_CC_OPTIMIZE_FOR_SIZE.


> Here are the results (normalised to the fastest/smallest):
>                     :        ARM       Atom
> -- Speed ----------------------------------
> orig_put_dec        :   9.333822   2.083110  Original
> mod1_put_dec        :   9.282045   1.904564
> mod2_put_dec        :   9.260409   1.910302
> mod3_put_dec        :   9.320053   1.905689  Proposed by previous patch
> mod4_put_dec        :   9.297146   1.933971
> mod5_put_dec        :  13.034318   2.434942
> mod6_put_dec        :   1.000000   1.000000  Proposed by this patch
> mod7_put_dec        :   1.009574   1.014147
> mod8_put_dec        :   7.226004   1.953460
> -- Size -----------------------------------
> orig_put_dec        :   1.000000   1.000000  Original
> mod1_put_dec        :   1.000000   1.000000
> mod2_put_dec        :   1.361111   1.403226
> mod3_put_dec        :   1.000000   1.000000  Proposed by previous patch
> mod4_put_dec        :   1.361111   1.403226
> mod5_put_dec        :   1.000000   1.000000
> mod6_put_dec        :   2.555556   3.508065  Proposed by this patch
> mod7_put_dec        :   2.833333   3.911290
> mod8_put_dec        :   2.027778   2.258065

I believe these are old results? Size growth is just too big.


> As it can be obsevred, proposed version of the put_dec function is
> twice as fast as the original version on Atom and almost 10 times
> faster on ARM.  I imagine that it may be similar on other "embedded"
> processors.
> 
> This may be skewed by the fact that the benchmark is using GCC's
> 64-bit division operator instead of kernel's do_div but it would
> appear that by avoiding 64-bit division something can be gained.

Re speed: on Phenom II in 32-bit mode, I see ~x3.3 speedup
on conversions involving large integers (might be skewed
by gcc's full-blown 64-bit division in "old" code - kernel's
div is smarter).


> PS. From Mr. Jones site: "Nonetheless, before relying on the material
> here, it would be prudent to check the arithmetic!" hence I checked
> all the calculations myself and everything seemed fine.  I've also run
> test applitacion several times so it tested a few 64-bit numbers..."

I tested [0, 100 million] and [2^64-100 million, 2^64-1] ranges.
No errors.


> +#if BITS_PER_LONG != 32 || defined CONFIG_CC_OPTIMIZE_FOR_SIZE || \
> +	ULLONG_MAX != 18446744073709551615ULL

I think it's better to say "if BITS_PER_LONG > 32 and ULLONG_MAX > 2^64-1",
since it expresses your intent better. Also, add comments explaining
what case you optimize for:

#if BITS_PER_LONG > 32 || ULLONG_MAX > 18446744073709551615ULL

/* Generic code */
...

#else /* BITS_PER_LONG <= 32 && ULLONG_MAX <= 2^64-1 */

/* Optimized code for arches with 64-bit long longs */
...


> +static noinline_for_stack
> +char *put_dec(char *buf, unsigned long long n)
> +{
> +	uint32_t d3, d2, d1, q;
> +
> +	if (!n) {
> +		*buf++ = '0';
> +		return buf;
> +	}

You may as well use the above shortcut for n <= 9, not only for 0.

> +	buf = put_dec_full4(buf, q % 10000);
> +	q   = q / 10000;
> +
> +	d1  = q + 7671 * d3 + 9496 * d2 + 6 * d1;
> +	buf = put_dec_full4(buf, d1 % 10000);
> +	q   = d1 / 10000;

I experimented with moving division up, before put_dec_full4:
q   = d1 / 10000;
buf = put_dec_full4(buf, d1 % 10000);
but gcc appears to be smart emough to do this transformation
itself. But you may still do it for older (dumber) gcc's.

-- 
vda

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHv2 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines
  2010-08-10  4:15   ` [PATCHv2 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines Denys Vlasenko
@ 2010-08-10  7:42     ` Michał Nazarewicz
  2010-08-10 16:10       ` Denys Vlasenko
  0 siblings, 1 reply; 10+ messages in thread
From: Michał Nazarewicz @ 2010-08-10  7:42 UTC (permalink / raw)
  To: Michal Nazarewicz, Denys Vlasenko
  Cc: linux-kernel, Douglas W. Jones, Andrew Morton

> On Sunday 08 August 2010 21:29, Michal Nazarewicz wrote:
>> Compared to previous version: the code is used only if:
>> 1. if long long is 64-bit (ie. ULLONG_MAX == 2**64-1), and
>> 2. user did not select optimisation for size with Kconfig.

On Tue, 10 Aug 2010 06:15:52 +0200, Denys Vlasenko <vda.linux@googlemail.com> wrote:
> I measured the size and it does not seem to make sense
> to exclude it on -Os. On x86:
>
> put_dec_full change: 0x93 -> 0x47 bytes
> put_dec      change: 0x12c -> 0x137 bytes
>
> IOW, there is net code size reduction (compared to current kernel,
> it may be a slight growth compared to patch 1).
>
> So, please use the optimized code even for CONFIG_CC_OPTIMIZE_FOR_SIZE.

Will do.

>> Here are the results (normalised to the fastest/smallest):
>>                     :        ARM       Atom
>> -- Speed ----------------------------------
>> orig_put_dec        :   9.333822   2.083110  Original
>> mod1_put_dec        :   9.282045   1.904564
>> mod2_put_dec        :   9.260409   1.910302
>> mod3_put_dec        :   9.320053   1.905689  Proposed by previous patch
>> mod4_put_dec        :   9.297146   1.933971
>> mod5_put_dec        :  13.034318   2.434942
>> mod6_put_dec        :   1.000000   1.000000  Proposed by this patch
>> mod7_put_dec        :   1.009574   1.014147
>> mod8_put_dec        :   7.226004   1.953460
>> -- Size -----------------------------------
>> orig_put_dec        :   1.000000   1.000000  Original
>> mod1_put_dec        :   1.000000   1.000000
>> mod2_put_dec        :   1.361111   1.403226
>> mod3_put_dec        :   1.000000   1.000000  Proposed by previous patch
>> mod4_put_dec        :   1.361111   1.403226
>> mod5_put_dec        :   1.000000   1.000000
>> mod6_put_dec        :   2.555556   3.508065  Proposed by this patch
>> mod7_put_dec        :   2.833333   3.911290
>> mod8_put_dec        :   2.027778   2.258065
>
> I believe these are old results? Size growth is just too big.

Hmm...  I think those are new results, but I might have messed something
up.  I'll redo them.

>> +#if BITS_PER_LONG != 32 || defined CONFIG_CC_OPTIMIZE_FOR_SIZE || \
>> +	ULLONG_MAX != 18446744073709551615ULL
>
> I think it's better to say "if BITS_PER_LONG > 32 and ULLONG_MAX > 2^64-1",
> since it expresses your intent better. Also, add comments explaining
> what case you optimize for:

Will do.

>> +static noinline_for_stack
>> +char *put_dec(char *buf, unsigned long long n)
>> +{
>> +	uint32_t d3, d2, d1, q;
>> +
>> +	if (!n) {
>> +		*buf++ = '0';
>> +		return buf;
>> +	}

> You may as well use the above shortcut for n <= 9, not only for 0.

Will do.

>> +	buf = put_dec_full4(buf, q % 10000);
>> +	q   = q / 10000;
>> +
>> +	d1  = q + 7671 * d3 + 9496 * d2 + 6 * d1;
>> +	buf = put_dec_full4(buf, d1 % 10000);
>> +	q   = d1 / 10000;
>
> I experimented with moving division up, before put_dec_full4:
> q   = d1 / 10000;
> buf = put_dec_full4(buf, d1 % 10000);
> but gcc appears to be smart emough to do this transformation
> itself. But you may still do it for older (dumber) gcc's.

I wasn't sure where would be a better place to put this line.  I'll
follow your advice on this one then.

-- 
Best regards,                                        _     _
| Humble Liege of Serenely Enlightened Majesty of  o' \,=./ `o
| Computer Science,  Michał "mina86" Nazarewicz       (o o)
+----[mina86*mina86.com]---[mina86*jabber.org]----ooO--(_)--Ooo--

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHv2 2/3] lib: vsprintf: optimised put_dec() for 32-bit  machines
  2010-08-10  7:42     ` Michał Nazarewicz
@ 2010-08-10 16:10       ` Denys Vlasenko
  0 siblings, 0 replies; 10+ messages in thread
From: Denys Vlasenko @ 2010-08-10 16:10 UTC (permalink / raw)
  To: Michał Nazarewicz
  Cc: Michal Nazarewicz, linux-kernel, Douglas W. Jones, Andrew Morton

2010/8/10 Michał Nazarewicz <m.nazarewicz@samsung.com>:
>>> +#if BITS_PER_LONG != 32 || defined CONFIG_CC_OPTIMIZE_FOR_SIZE || \
>>> +       ULLONG_MAX != 18446744073709551615ULL
>>
>> I think it's better to say "if BITS_PER_LONG > 32 and ULLONG_MAX >
>> 2^64-1",

Thinko. I meant "or", not "and"...

-- 
vda

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHv2 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full()
  2010-08-08 19:29 [PATCHv2 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full() Michal Nazarewicz
  2010-08-08 19:29 ` [PATCHv2 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines Michal Nazarewicz
@ 2010-08-10  3:17 ` Denys Vlasenko
  2010-08-10  7:39   ` Michał Nazarewicz
  1 sibling, 1 reply; 10+ messages in thread
From: Denys Vlasenko @ 2010-08-10  3:17 UTC (permalink / raw)
  To: Michal Nazarewicz
  Cc: linux-kernel, m.nazarewicz, Douglas W. Jones, Andrew Morton

On Sunday 08 August 2010 21:29, Michal Nazarewicz wrote:
> -- Speed --------------------------------------------------------
> orig_put_dec_full   :   1.054570   1.356214   1.732636   1.725760  Original
> mod1_put_dec_full   :   1.000000   1.017216   1.255518   1.116559
> mod3_put_dec_full   :   1.018222   1.000000   1.000000   1.000000  Proposed
> 
> orig_put_dec_trunc  :   1.137903   1.216017   1.850478   1.662370  Original
> mod1_put_dec_trunc  :   1.000000   1.078154   1.355635   1.400637
> mod3_put_dec_trunc  :   1.025989   1.000000   1.000000   1.000000  Proposed
> -- Size ---------------------------------------------------------
> orig_put_dec_full   :   1.212766   1.310345   1.355372   1.355372  Original
> mod1_put_dec_full   :   1.021277   1.000000   1.000000   1.000000
> mod3_put_dec_full   :   1.000000   1.172414   1.049587   1.049587  Proposed
> 
> orig_put_dec_trunc  :   1.363636   1.317365   1.784000   1.784000  Original
> mod1_put_dec_trunc  :   1.181818   1.275449   1.400000   1.400000
> mod3_put_dec_trunc  :   1.000000   1.000000   1.000000   1.000000  Proposed

In my testing on Phenom II the speed gain is smaller,
but it is indeed faster. And smaller!


> +	/*
> +	 * '(x * 0xcccd) >> 19' is an approximation of 'x / 10' that
> +	 * gives correct results for all x < 81920.  However, because
> +	 * intermediate result can be at most 32-bit we limit x to be
> +	 * 16-bit.
> +	 *
> +	 * Because of those, we check if we are dealing with a "big"
> +	 * number and if so, we make it smaller remembering to add to
> +	 * the most significant digit.
> +	 */
> +	if (q >= 50000) {
> +		a  = '5';
> +		q -= 50000;
...
> +	/*
> +	 * We need to check if q is < 65536 so we might as well check

You meant "need to check if q is < 81920"?

> +	 * if we can just call the _full version of this function.
> +	 */
> +	if (q > 9999)
> +		return put_dec_full(buf, q);

-- 
vda

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHv2 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full()
  2010-08-10  3:17 ` [PATCHv2 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full() Denys Vlasenko
@ 2010-08-10  7:39   ` Michał Nazarewicz
  2010-08-10 16:08     ` Denys Vlasenko
  0 siblings, 1 reply; 10+ messages in thread
From: Michał Nazarewicz @ 2010-08-10  7:39 UTC (permalink / raw)
  To: Michal Nazarewicz, Denys Vlasenko
  Cc: linux-kernel, Douglas W. Jones, Andrew Morton

> On Sunday 08 August 2010 21:29, Michal Nazarewicz wrote:
>> +	/*
>> +	 * '(x * 0xcccd) >> 19' is an approximation of 'x / 10' that
>> +	 * gives correct results for all x < 81920.  However, because
>> +	 * intermediate result can be at most 32-bit we limit x to be
>> +	 * 16-bit.
>> +	 *
>> +	 * Because of those, we check if we are dealing with a "big"
>> +	 * number and if so, we make it smaller remembering to add to
>> +	 * the most significant digit.
>> +	 */
>> +	if (q >= 50000) {
>> +		a  = '5';
>> +		q -= 50000;
> ...
>> +	/*
>> +	 * We need to check if q is < 65536 so we might as well check

On Tue, 10 Aug 2010 05:17:48 +0200, Denys Vlasenko <vda.linux@googlemail.com> wrote:
> You meant "need to check if q is < 81920"?

No.  81920 is a 17 bit number and when we multiply it by 0xcccd we lose
the most significant bit.  Therefore we cannot use the '(x * 0xcccd) >>
19' approximation for numbers which are higher then 65535.

>> +	 * if we can just call the _full version of this function.
>> +	 */
>> +	if (q > 9999)
>> +		return put_dec_full(buf, q);

-- 
Best regards,                                        _     _
| Humble Liege of Serenely Enlightened Majesty of  o' \,=./ `o
| Computer Science,  Michał "mina86" Nazarewicz       (o o)
+----[mina86*mina86.com]---[mina86*jabber.org]----ooO--(_)--Ooo--

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHv2 1/3] lib: vsprintf: optimised put_dec_trunc() and  put_dec_full()
  2010-08-10  7:39   ` Michał Nazarewicz
@ 2010-08-10 16:08     ` Denys Vlasenko
  2010-08-10 22:42       ` Michal Nazarewicz
  0 siblings, 1 reply; 10+ messages in thread
From: Denys Vlasenko @ 2010-08-10 16:08 UTC (permalink / raw)
  To: Michał Nazarewicz
  Cc: Michal Nazarewicz, linux-kernel, Douglas W. Jones, Andrew Morton

2010/8/10 Michał Nazarewicz <m.nazarewicz@samsung.com>:
>> On Sunday 08 August 2010 21:29, Michal Nazarewicz wrote:
>>>
>>> +       /*
>>> +        * '(x * 0xcccd) >> 19' is an approximation of 'x / 10' that
>>> +        * gives correct results for all x < 81920.  However, because
>>> +        * intermediate result can be at most 32-bit we limit x to be
>>> +        * 16-bit.
>>> +        *
>>> +        * Because of those, we check if we are dealing with a "big"
>>> +        * number and if so, we make it smaller remembering to add to
>>> +        * the most significant digit.
>>> +        */
>>> +       if (q >= 50000) {
>>> +               a  = '5';
>>> +               q -= 50000;
>>
>> ...
>>>
>>> +       /*
>>> +        * We need to check if q is < 65536 so we might as well check
>
> On Tue, 10 Aug 2010 05:17:48 +0200, Denys Vlasenko
> <vda.linux@googlemail.com> wrote:
>>
>> You meant "need to check if q is < 81920"?
>
> No.  81920 is a 17 bit number and when we multiply it by 0xcccd we lose
> the most significant bit.
>  Therefore we cannot use the '(x * 0xcccd) >>
> 19' approximation for numbers which are higher then 65535.

No. All x up to (exclusive) 81920 can be multiplied by 0xcccd
and result still fits into 32 bits. Proof:

# printf "%x\n" $((81919 * 0xcccd))
ffff7333

-- 
vda

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHv2 1/3] lib: vsprintf: optimised put_dec_trunc() and  put_dec_full()
  2010-08-10 16:08     ` Denys Vlasenko
@ 2010-08-10 22:42       ` Michal Nazarewicz
  0 siblings, 0 replies; 10+ messages in thread
From: Michal Nazarewicz @ 2010-08-10 22:42 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Michał Nazarewicz, linux-kernel, Douglas W. Jones,
	Andrew Morton

Denys Vlasenko <vda.linux@googlemail.com> writes:

> 2010/8/10 Michał Nazarewicz <m.nazarewicz@samsung.com>:
>>> On Sunday 08 August 2010 21:29, Michal Nazarewicz wrote:
>>>>
>>>> +       /*
>>>> +        * '(x * 0xcccd) >> 19' is an approximation of 'x / 10' that
>>>> +        * gives correct results for all x < 81920.  However, because
>>>> +        * intermediate result can be at most 32-bit we limit x to be
>>>> +        * 16-bit.
>>>> +        *
>>>> +        * Because of those, we check if we are dealing with a "big"
>>>> +        * number and if so, we make it smaller remembering to add to
>>>> +        * the most significant digit.
>>>> +        */
>>>> +       if (q >= 50000) {
>>>> +               a  = '5';
>>>> +               q -= 50000;
>>>
>>> ...
>>>>
>>>> +       /*
>>>> +        * We need to check if q is < 65536 so we might as well check
>>
>> On Tue, 10 Aug 2010 05:17:48 +0200, Denys Vlasenko
>> <vda.linux@googlemail.com> wrote:
>>>
>>> You meant "need to check if q is < 81920"?
>>
>> No.  81920 is a 17 bit number and when we multiply it by 0xcccd we lose
>> the most significant bit.
>>  Therefore we cannot use the '(x * 0xcccd) >>
>> 19' approximation for numbers which are higher then 65535.
>
> No. All x up to (exclusive) 81920 can be multiplied by 0xcccd
> and result still fits into 32 bits. Proof:
>
> # printf "%x\n" $((81919 * 0xcccd))
> ffff7333

Turns out something else was a problem ((x * 13) >> 7 works for x <
69).  I'll update comments in the next version.

-- 
Best regards,                                         _     _
 .o. | Liege of Serenly Enlightened Majesty of      o' \,=./ `o
 ..o | Computer Science,  Michal "mina86" Nazarewicz   (o o)
 ooo +--<mina86-tlen.pl>--<jid:mina86-jabber.org>--ooO--(_)--Ooo--

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2010-08-10 22:42 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-08-08 19:29 [PATCHv2 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full() Michal Nazarewicz
2010-08-08 19:29 ` [PATCHv2 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines Michal Nazarewicz
2010-08-08 19:29   ` [PATCHv2 3/3] lib: vsprintf: added a put_dec() test and benchmark tool Michal Nazarewicz
2010-08-10  4:15   ` [PATCHv2 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines Denys Vlasenko
2010-08-10  7:42     ` Michał Nazarewicz
2010-08-10 16:10       ` Denys Vlasenko
2010-08-10  3:17 ` [PATCHv2 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full() Denys Vlasenko
2010-08-10  7:39   ` Michał Nazarewicz
2010-08-10 16:08     ` Denys Vlasenko
2010-08-10 22:42       ` Michal Nazarewicz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox