* [PATCH 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full()
@ 2010-08-05 22:38 Michal Nazarewicz
2010-08-05 22:38 ` [PATCH 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines Michal Nazarewicz
2010-08-06 3:58 ` [PATCH 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full() Denys Vlasenko
0 siblings, 2 replies; 14+ messages in thread
From: Michal Nazarewicz @ 2010-08-05 22:38 UTC (permalink / raw)
To: linux-kernel, linux-kernel
Cc: m.nazarewicz, Michal Nazarewicz, Douglas W. Jones, Denis Vlasenko,
Andrew Morton
The put_dec_trunc() and put_dec_full() functions were based on
a code optimised for processors with 8-bit ALU but even then
they failed to satisfy the same constraints and in fact
required at least 16-bit ALU (because at least one number they
operate in can take 9 bits).
This version of those functions proposed by this patch goes
further and uses the full capacity of a 32-bit ALU and instead
of splitting the number into nibbles and operating on them it
performs the obvious algorithm for base conversion expect it
uses optimised code for dividing by ten (ie. no division is
actually performed).
Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
---
lib/vsprintf.c | 150 +++++++++++++++++++++++++++----------------------------
1 files changed, 74 insertions(+), 76 deletions(-)
I did some benchmark on the following three processors:
Phenom: AMD Phenom(tm) II X3 710 Processor (64-bit)
Atom: Intel(R) Atom(TM) CPU N270 @ 1.60GHz (32-bit)
ARM: ARMv7 Processor rev 2 (v7l) (32-bit)
Here are the results (normalised to the fastest/smallest):
: ARM Phonem Intel
-- Speed -------------------------------------------------------
orig_put_dec_full : 1.078600 1.777800 1.356917 Original
mod1_put_dec_full : 1.000000 1.117665 1.017742
mod3_put_dec_full : 1.032507 1.000000 1.000000 Proposed
orig_put_dec_trunc : 1.092177 1.657014 1.215658 Original
mod1_put_dec_trunc : 1.006836 1.395088 1.078385
mod3_put_dec_trunc : 1.000000 1.000000 1.000000 Proposed
-- Size --------------------------------------------------------
orig_put_dec_full : 1.212766 1.355372 1.310345 Original
mod1_put_dec_full : 1.021277 1.000000 1.000000
mod3_put_dec_full : 1.000000 1.049587 1.172414 Proposed
orig_put_dec_trunc : 1.363636 1.784000 1.317365 Original
mod1_put_dec_trunc : 1.181818 1.400000 1.275449
mod3_put_dec_trunc : 1.000000 1.000000 1.000000 Proposed
Source of the benchmark as well as code of all the modified version of
functions is included with the third patch of the benchmark.
As it can be observed from the table, the "mod3" version (proposed by
this patch) is the fastest version with the only exception of
"mod3_put_dec_full" on ARM which is slightly slower then
"mod1_put_dec_full" version.
It is also smaller, in terms of code size, then the original version
even though "mod1" is even smaller.
In the end, I'm proposing "mod3" because the size is not that
important (those are mere bytes) and as of speed, for ARM I have
proposed another solution in the next patch of this patchset.
The function is also shorter in terms of lines of code. ;)
I'm currently running 2.6.35 with this patch applied. It applies just
fine on -next as well but I haven't tested this kernel and I've run it
with -next on ARM.
diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index b8a2f54..d63fb15 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -278,96 +278,94 @@ int skip_atoi(const char **s)
return i;
}
-/* Decimal conversion is by far the most typical, and is used
- * for /proc and /sys data. This directly impacts e.g. top performance
- * with many processes running. We optimize it for speed
- * using code from
- * http://www.cs.uiowa.edu/~jones/bcd/decimal.html
- * (with permission from the author, Douglas W. Jones). */
-
-/* Formats correctly any integer in [0,99999].
- * Outputs from one to five digits depending on input.
- * On i386 gcc 4.1.2 -O2: ~250 bytes of code. */
+/*
+ * Decimal conversion is by far the most typical, and is used for
+ * /proc and /sys data. This directly impacts e.g. top performance
+ * with many processes running.
+ *
+ * We optimize it for speed using ideas described at
+ * <http://www.cs.uiowa.edu/~jones/bcd/divide.html>.
+ *
+ * '(num * 0xcccd) >> 19' is an approximation of 'num / 10' that gives
+ * correct results for num < 81920. Because of this, we check at the
+ * beginning if we are dealing with a number that may cause trouble
+ * and if so, we make it smaller.
+ *
+ * (As a minor note, all operands are always 16 bit so this function
+ * should work well on hardware that cannot multiply 32 bit numbers).
+ *
+ * (Previous a code based on
+ * <http://www.cs.uiowa.edu/~jones/bcd/decimal.html> was used here,
+ * with permission from the author, Douglas W. Jones.)
+ *
+ * Other, possible ways to approx. divide by 10
+ * (x * 0xcd) >> 11: 11001101 - shorter code than * 0x67 (on i386)
+ * (x * 0x67) >> 10: 1100111
+ * (x * 0x34) >> 9: 110100 - same
+ * (x * 0x1a) >> 8: 11010 - same
+ * (x * 0x0d) >> 7: 1101 - same, shortest code (on i386)
+ */
static noinline_for_stack
-char *put_dec_trunc(char *buf, unsigned q)
+char *put_dec_full(char *buf, unsigned q)
{
- unsigned d3, d2, d1, d0;
- d1 = (q>>4) & 0xf;
- d2 = (q>>8) & 0xf;
- d3 = (q>>12);
-
- d0 = 6*(d3 + d2 + d1) + (q & 0xf);
- q = (d0 * 0xcd) >> 11;
- d0 = d0 - 10*q;
- *buf++ = d0 + '0'; /* least significant digit */
- d1 = q + 9*d3 + 5*d2 + d1;
- if (d1 != 0) {
- q = (d1 * 0xcd) >> 11;
- d1 = d1 - 10*q;
- *buf++ = d1 + '0'; /* next digit */
-
- d2 = q + 2*d2;
- if ((d2 != 0) || (d3 != 0)) {
- q = (d2 * 0xd) >> 7;
- d2 = d2 - 10*q;
- *buf++ = d2 + '0'; /* next digit */
-
- d3 = q + 4*d3;
- if (d3 != 0) {
- q = (d3 * 0xcd) >> 11;
- d3 = d3 - 10*q;
- *buf++ = d3 + '0'; /* next digit */
- if (q != 0)
- *buf++ = q + '0'; /* most sign. digit */
- }
- }
+ unsigned r;
+ char a = '0';
+
+ if (q > 0xffff) {
+ a = '6';
+ q -= 60000;
}
+ r = (q * 0xcccd) >> 19;
+ *buf++ = (q - 10 * r) + '0';
+
+ q = (r * 0x199a) >> 16;
+ *buf++ = (r - 10 * q) + '0';
+
+ r = (q * 0xcd) >> 11;
+ *buf++ = (q - 10 * r) + '0';
+
+ q = (r * 0xd) >> 7;
+ *buf++ = (r - 10 * q) + '0';
+
+ *buf++ = q + a;
+
return buf;
}
-/* Same with if's removed. Always emits five digits */
+
+/* Same as above but do not pad with zeros. */
static noinline_for_stack
-char *put_dec_full(char *buf, unsigned q)
+char *put_dec_trunc(char *buf, unsigned q)
{
- /* BTW, if q is in [0,9999], 8-bit ints will be enough, */
- /* but anyway, gcc produces better code with full-sized ints */
- unsigned d3, d2, d1, d0;
- d1 = (q>>4) & 0xf;
- d2 = (q>>8) & 0xf;
- d3 = (q>>12);
+ unsigned r;
/*
- * Possible ways to approx. divide by 10
- * gcc -O2 replaces multiply with shifts and adds
- * (x * 0xcd) >> 11: 11001101 - shorter code than * 0x67 (on i386)
- * (x * 0x67) >> 10: 1100111
- * (x * 0x34) >> 9: 110100 - same
- * (x * 0x1a) >> 8: 11010 - same
- * (x * 0x0d) >> 7: 1101 - same, shortest code (on i386)
+ * We need to check if num is < 81920 so we might as well
+ * check if we can just call the _full version of this
+ * function.
*/
- d0 = 6*(d3 + d2 + d1) + (q & 0xf);
- q = (d0 * 0xcd) >> 11;
- d0 = d0 - 10*q;
- *buf++ = d0 + '0';
- d1 = q + 9*d3 + 5*d2 + d1;
- q = (d1 * 0xcd) >> 11;
- d1 = d1 - 10*q;
- *buf++ = d1 + '0';
-
- d2 = q + 2*d2;
- q = (d2 * 0xd) >> 7;
- d2 = d2 - 10*q;
- *buf++ = d2 + '0';
-
- d3 = q + 4*d3;
- q = (d3 * 0xcd) >> 11; /* - shorter code */
- /* q = (d3 * 0x67) >> 10; - would also work */
- d3 = d3 - 10*q;
- *buf++ = d3 + '0';
- *buf++ = q + '0';
+ if (q > 9999)
+ return put_dec_full(buf, q);
+
+ r = (q * 0xcccd) >> 19;
+ *buf++ = (q - 10 * r) + '0';
+
+ if (r) {
+ q = (r * 0x199a) >> 16;
+ *buf++ = (r - 10 * q) + '0';
+
+ if (q) {
+ r = (q * 0xcd) >> 11;
+ *buf++ = (q - 10 * r) + '0';
+
+ if (r)
+ *buf++ = r + '0';
+ }
+ }
return buf;
}
+
/* No inlining helps gcc to use registers better */
static noinline_for_stack
char *put_dec(char *buf, unsigned long long num)
--
1.7.1
^ permalink raw reply related [flat|nested] 14+ messages in thread* [PATCH 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines 2010-08-05 22:38 [PATCH 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full() Michal Nazarewicz @ 2010-08-05 22:38 ` Michal Nazarewicz 2010-08-05 22:38 ` [PATCH 3/3] lib: vsprintf: added a put_dec() test and benchmark tool Michal Nazarewicz 2010-08-06 5:18 ` [PATCH 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines Denys Vlasenko 2010-08-06 3:58 ` [PATCH 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full() Denys Vlasenko 1 sibling, 2 replies; 14+ messages in thread From: Michal Nazarewicz @ 2010-08-05 22:38 UTC (permalink / raw) To: linux-kernel, linux-kernel Cc: m.nazarewicz, Michal Nazarewicz, Douglas W. Jones, Denis Vlasenko, Andrew Morton Existing put_dec() function uses a do_div() function for dividing the 64-bit argument. On 32-bit machines this may be a costly operation. This patch, replaces the put_dec() function on 32-bit processors to one that performs no 64-bit divisions. Signed-off-by: Michal Nazarewicz <mina86@mina86.com> --- lib/vsprintf.c | 103 +++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 files changed, 102 insertions(+), 1 deletions(-) I did some benchmark on the following two processors: Atom: Intel(R) Atom(TM) CPU N270 @ 1.60GHz ARM: ARMv7 Processor rev 2 (v7l) (I'm ommitting Phenom since this code is ment for 32-bit processors). Here are the results (normalised to the fastest/smallest): : ARM Intel -- Speed --------------------------------------------------------------- orig_put_dec : 10.194361 2.190063 Original mod1_put_dec : 10.259952 2.025620 mod2_put_dec : 10.142540 1.970004 mod3_put_dec : 10.284313 1.961153 Version from previous patch mod4_put_dec : 10.164480 1.986127 mod5_put_dec : 14.529587 2.531521 mod6_put_dec : 1.000000 1.000000 Proposed mod7_put_dec : 1.006486 1.011573 mod8_put_dec : 8.347325 2.153098 -- Size ---------------------------------------------------------------- orig_put_dec : 1.000000 1.000000 Original mod1_put_dec : 1.000000 1.000000 mod2_put_dec : 1.361111 1.403226 mod3_put_dec : 1.000000 1.000000 Version from previous patch mod4_put_dec : 1.361111 1.403226 mod5_put_dec : 1.000000 1.000000 mod6_put_dec : 2.555556 3.508065 Proposed mod7_put_dec : 2.833333 3.911290 mod8_put_dec : 2.027778 2.258065 Source of the benchmark as well as code of all the modified version of functions is included with the third patch of the benchmark. As it can be obsevred, proposed version of the put_dec function is 10 times faster on ARM. I imagine that it may be similar on other "embedded" processors. This may be skewed by the fact that the benchmark is using GCC's 64-bit division operator instead of kernel's do_div but it would appear that by avoiding 64-bit division something can be gained. The disadvantage is that the proposed function is 2.5-3.5 bigger. Those are not big functions though -- we are talking here about proposed function being below 512. The drawback of this function is also that the patch adds a bit of code. It could be questionable whether it's worth optimising that much. Anyway, posting in case someone decides that it is or will be simply interested. :) I'm currently running 2.6.35 with this patch applied. It applies just fine on -next as well but I haven't tested this kernel (I don't really have a machine that I wouldn't be afraid to run -next on ;) ). PS. From Mr. Jones site: "Nonetheless, before relying on the material here, it would be prudent to check the arithmetic!" hence I checked all the calculations myself and everything seemed fine. I've also run test applitacion several times so it tested a few 64-bit numbers. Here's a "bc" script which calculates all the numbers: # You can feed "bc" with this file to check the numbers # n = 1 * n0 + 0 <= n0 <= 65535 # 6 5536 * n1 + 0 <= n1 <= 65535 # 42 9496 7296 * n2 + 0 <= n2 <= 65535 # 281 4749 7671 0656 * n3 0 <= n3 <= 65535 n0 = 65535 n1 = 65535 n2 = 65535 n3 = 65535 # n = 10^ 0 * d0 + # 10^ 4 * d1 + # 10^ 8 * d2 + # 10^12 * d3 + # 10^16 * d4 a0 = 656 * n3 + 7296 * n2 + 5536 * n1 + 1 * n0 print "0 <= a0 <= ", a0, "\n" # 0 <= a0 <= 884 001 615 a1 = 7671 * n3 + 9496 * n2 + 6 * n1 print "0 <= a1 <= ", a1, "\n" # 0 <= a1 <= 1 125 432 555 a2 = 4749 * n3 + 42 * n2 print "0 <= a2 <= ", a2, "\n" # 0 <= a2 <= 313 978 185 a3 = 281 * n3 print "0 <= a3 <= ", a3, "\n" # 0 <= a3 <= 18 415 335 b0 = a0 print "0 <= b0 <= ", b0, "\n0 <= c1 <= ", b0 / 10000, "\n" # 0 <= d0 <= 884 001 615 # 0 <= c1 <= 88 400 b1 = a1 + b0 / 10000 print "0 <= b1 <= ", b1, "\n0 <= c2 <= ", b1 / 10000, "\n" # 0 <= d1 <= 1 125 520 955 # 0 <= c2 <= 112 552 b2 = a2 + b1 / 10000 print "0 <= b2 <= ", b2, "\n0 <= c3 <= ", b2 / 10000, "\n" # 0 <= d2 <= 314 090 737 # 0 <= c3 <= 31 409 b3 = a3 + b2 / 10000 print "0 <= b3 <= ", b3, "\n0 <= c4 <= ", b3 / 10000, "\n" # 0 <= d3 <= 18 446 744 # 0 <= c4 <= 1 844 b4 = a4 + b3 / 10000 print "0 <= b4 <= ", b4, "\n" # 0 <= b4 <= 1 844 diff --git a/lib/vsprintf.c b/lib/vsprintf.c index d63fb15..bfbe662 100644 --- a/lib/vsprintf.c +++ b/lib/vsprintf.c @@ -278,6 +278,8 @@ int skip_atoi(const char **s) return i; } +#if BITS_PER_LONG == 64 + /* * Decimal conversion is by far the most typical, and is used for * /proc and /sys data. This directly impacts e.g. top performance @@ -366,6 +368,9 @@ char *put_dec_trunc(char *buf, unsigned q) return buf; } +/* This is used by ip4_string(). */ +#define put_dec_8bit put_dec_trunc + /* No inlining helps gcc to use registers better */ static noinline_for_stack char *put_dec(char *buf, unsigned long long num) @@ -379,6 +384,102 @@ char *put_dec(char *buf, unsigned long long num) } } +#else + +/* + * This is similar to the put_dec_full() above expect it handles + * numbers from 0 to 9999 (ie. at most four digits). It is used by + * the put_dec() below which is optimised for 32-bit processors. + */ +static noinline_for_stack +char *put_dec_full(char *buf, unsigned q) +{ + unsigned r; + + r = (q * 0xcccd) >> 19; + *buf++ = (q - 10 * r) + '0'; + + q = (r * 0x199a) >> 16; + *buf++ = (r - 10 * q) + '0'; + + r = (q * 0xcd) >> 11; + *buf++ = (q - 10 * r) + '0'; + + *buf++ = r + '0'; + + return buf; +} + +/* + * Similar to above but handles only 8-bit operands and does not pad + * with zeros. + */ +static noinline_for_stack +char *put_dec_8bit(char *buf, unsigned q) +{ + unsigned r; + + r = (q * 0xcd) >> 11; + *buf++ = (q - 10 * r) + '0'; + + if (r) { + q = (r * 0xd) >> 7; + *buf++ = (r - 10 * q) + '0'; + + if (q) + *buf++ = q + '0'; + } + + return buf; +} + +/* + * Based on code by Douglas W. Jones found at + * <http://www.cs.uiowa.edu/~jones/bcd/decimal.html#sixtyfour>. This + * performs no 64-bit division and hence should be faster on 32-bit + * machines then the version of the function above. + */ +static noinline_for_stack +char *put_dec(char *buf, unsigned long long n) +{ + uint32_t d3, d2, d1, q; + + if (!n) { + *buf++ = '0'; + return buf; + } + + d1 = (n >> 16) & 0xFFFF; + d2 = (n >> 32) & 0xFFFF; + d3 = (n >> 48) & 0xFFFF; + + q = 656 * d3 + 7296 * d2 + 5536 * d1 + (n & 0xFFFF); + + buf = put_dec_full(buf, q % 10000); + q = q / 10000; + + d1 = q + 7671 * d3 + 9496 * d2 + 6 * d1; + buf = put_dec_full(buf, d1 % 10000); + q = d1 / 10000; + + d2 = q + 4749 * d3 + 42 * d2; + buf = put_dec_full(buf, d2 % 10000); + q = d2 / 10000; + + d3 = q + 281 * d3; + buf = put_dec_full(buf, d3 % 10000); + q = d3 / 10000; + + buf = put_dec_full(buf, q); + + while (buf[-1] == '0') + --buf; + + return buf; +} + +#endif + #define ZEROPAD 1 /* pad with zero */ #define SIGN 2 /* unsigned/signed long */ #define PLUS 4 /* show plus */ @@ -752,7 +853,7 @@ char *ip4_string(char *p, const u8 *addr, const char *fmt) } for (i = 0; i < 4; i++) { char temp[3]; /* hold each IP quad in reverse order */ - int digits = put_dec_trunc(temp, addr[index]) - temp; + int digits = put_dec_8bit(temp, addr[index]) - temp; if (leading_zeros) { if (digits < 3) *p++ = '0'; -- 1.7.1 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 3/3] lib: vsprintf: added a put_dec() test and benchmark tool 2010-08-05 22:38 ` [PATCH 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines Michal Nazarewicz @ 2010-08-05 22:38 ` Michal Nazarewicz 2010-08-06 5:10 ` Denys Vlasenko 2010-08-06 5:18 ` [PATCH 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines Denys Vlasenko 1 sibling, 1 reply; 14+ messages in thread From: Michal Nazarewicz @ 2010-08-05 22:38 UTC (permalink / raw) To: linux-kernel, linux-kernel Cc: m.nazarewicz, Michal Nazarewicz, Douglas W. Jones, Denis Vlasenko, Andrew Morton This commit adds a test application for the put_dec() and family of functions that are used by the previous two commits. Signed-off-by: Michal Nazarewicz <mina86@mina86.com> --- tools/put-dec/Makefile | 8 + tools/put-dec/put-dec-test.c | 725 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 733 insertions(+), 0 deletions(-) create mode 100644 tools/put-dec/Makefile create mode 100644 tools/put-dec/put-dec-test.c This is just a benchmark I've used to test various implementations of the put_dec() and friends. This is not intended as a patch included in the kernel -- it's merely a tool for anyone who'd like to check various versions. diff --git a/tools/put-dec/Makefile b/tools/put-dec/Makefile new file mode 100644 index 0000000..a56e3ef --- /dev/null +++ b/tools/put-dec/Makefile @@ -0,0 +1,7 @@ +CFLAGS := -O2 + +put-dec-test: put-dec-test.c + exec $(CC) $(CFLAGS) -o $@ $< + +clean: + rm -f -- put-dec-test diff --git a/tools/put-dec/put-dec-test.c b/tools/put-dec/put-dec-test.c new file mode 100644 index 0000000..b43cf08 --- /dev/null +++ b/tools/put-dec/put-dec-test.c @@ -0,0 +1,725 @@ +#define _BSD_SOURCE + +#include <stdarg.h> +#include <stdint.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/time.h> +#include <time.h> + + +# define do_div(n, base) ({ \ + uint32_t __base = (base); \ + uint32_t __rem = (n) % __base; \ + (n) /= __base; \ + __rem; \ + }) + + +/****************************** Original versian ******************************/ + +static char *orig_put_dec_trunc(char *buf, unsigned q) +{ + unsigned d3, d2, d1, d0; + d1 = (q>>4) & 0xf; + d2 = (q>>8) & 0xf; + d3 = (q>>12); + + d0 = 6*(d3 + d2 + d1) + (q & 0xf); + q = (d0 * 0xcd) >> 11; + d0 = d0 - 10*q; + *buf++ = d0 + '0'; /* least significant digit */ + d1 = q + 9*d3 + 5*d2 + d1; + if (d1 != 0) { + q = (d1 * 0xcd) >> 11; + d1 = d1 - 10*q; + *buf++ = d1 + '0'; /* next digit */ + + d2 = q + 2*d2; + if ((d2 != 0) || (d3 != 0)) { + q = (d2 * 0xd) >> 7; + d2 = d2 - 10*q; + *buf++ = d2 + '0'; /* next digit */ + + d3 = q + 4*d3; + if (d3 != 0) { + q = (d3 * 0xcd) >> 11; + d3 = d3 - 10*q; + *buf++ = d3 + '0'; /* next digit */ + if (q != 0) + *buf++ = q + '0'; /* most sign. digit */ + } + } + } + + return buf; +} +/* Same with if's removed. Always emits five digits */ +static char *orig_put_dec_full(char *buf, unsigned q) +{ + /* BTW, if q is in [0,9999], 8-bit ints will be enough, */ + /* but anyway, gcc produces better code with full-sized ints */ + unsigned d3, d2, d1, d0; + d1 = (q>>4) & 0xf; + d2 = (q>>8) & 0xf; + d3 = (q>>12); + + /* + * Possible ways to approx. divide by 10 + * gcc -O2 replaces multiply with shifts and adds + * (x * 0xcd) >> 11: 11001101 - shorter code than * 0x67 (on i386) + * (x * 0x67) >> 10: 1100111 + * (x * 0x34) >> 9: 110100 - same + * (x * 0x1a) >> 8: 11010 - same + * (x * 0x0d) >> 7: 1101 - same, shortest code (on i386) + */ + d0 = 6*(d3 + d2 + d1) + (q & 0xf); + q = (d0 * 0xcd) >> 11; + d0 = d0 - 10*q; + *buf++ = d0 + '0'; + d1 = q + 9*d3 + 5*d2 + d1; + q = (d1 * 0xcd) >> 11; + d1 = d1 - 10*q; + *buf++ = d1 + '0'; + + d2 = q + 2*d2; + q = (d2 * 0xd) >> 7; + d2 = d2 - 10*q; + *buf++ = d2 + '0'; + + d3 = q + 4*d3; + q = (d3 * 0xcd) >> 11; /* - shorter code */ + /* q = (d3 * 0x67) >> 10; - would also work */ + d3 = d3 - 10*q; + *buf++ = d3 + '0'; + *buf++ = q + '0'; + + return buf; +} + +static __attribute__((noinline)) +char *orig_put_dec(char *buf, unsigned long long num) +{ + while (1) { + unsigned rem; + if (num < 100000) + return orig_put_dec_trunc(buf, num); + rem = do_div(num, 100000); + buf = orig_put_dec_full(buf, rem); + } +} + + + +/****************************** Modified versions ******************************/ + +/* + * Decimal conversion is by far the most typical, and is used for + * /proc and /sys data. This directly impacts e.g. top performance + * with many processes running. + * + * We optimize it for speed using code based on idea described at: + * http://www.cs.uiowa.edu/~jones/bcd/decimal.html (with permission + * from the author, Douglas W. Jones). + * + * The original code was designed for 8-bit ALus but since we can + * assume more capable hardware the code has been rewritten to use the + * following properties: + * + * n = 1 * n0 + ( 0 <= n0 <= 1023 ) + * 1024 * n1 ( 0 <= n1 <= 97 ) + * a0 = 0 + 4 * n1 + 1 * n0 ( 0 <= a0 <= 1412 ) + * a1 = (a0 / 10) + 2 * n1 ( 0 <= a1 <= 335 ) + * a2 = (a1 / 10) + 0 * n1 ( 0 <= a2 <= 33 ) + * a3 = (a2 / 10) + 1 * n1 ( 0 <= a3 <= 100 ) + * d0 = a0 % 10 + * d1 = a1 % 10 + * d2 = a2 % 10 + * d3 = a3 % 10 + * d4 = a3 / 10 + * + * So instead of dividing the number into four nibles we divide it + * into two numbers: first one 10-bit and the other one 7-bit + * (argument is 17-bit number from 0 to 99999). + * + * Moreover, 1024, which is the value second part of the number needs + * to be multiplied by, has nice property that each digit is a power + * of two or zero -- this helps with multiplications. + * + * Possible ways to approx. divide by 10 + * (x * 0xcd) >> 11: 11001101 - shorter code than * 0x67 (on i386) + * (x * 0x67) >> 10: 1100111 + * (x * 0x34) >> 9: 110100 - same + * (x * 0x1a) >> 8: 11010 - same + * (x * 0x0d) >> 7: 1101 - same, shortest code (on i386) + */ +static char *mod1_put_dec_full(char *buf, unsigned q) +{ + unsigned p, r; + p = q >> 10; + + q &= 0x3ff; + q += 4 * p; + r = (q * 0x199A) >> 16; + *buf++ = (q - 10 * r) + '0'; + + r += 2 * p; + q = (r * 0xcd) >> 11; + *buf++ = (r - 10 * q) + '0'; + + /* q += 0; */ + r = (q * 0xcd) >> 11; + *buf++ = (q - 10 * r) + '0'; + + r += p; + q = (r * 0xcd) >> 11; + *buf++ = (r - 10 * q) + '0'; + + *buf++ = q + '0'; + + return buf; +} + +static char *mod1_put_dec_trunc(char *buf, unsigned q) +{ + unsigned p, r; + p = q >> 10; + + q &= 0x3ff; + q += 4 * p; + r = (q * 0x199a) >> 16; + *buf++ = (q - 10 * r) + '0'; + + r += 2 * p; + if (r) { + q = (r * 0xcd) >> 11; + *buf++ = (r - 10 * q) + '0'; + + /* q += 0; */ + if (q || p) { + r = (q * 0xcd) >> 11; + *buf++ = (q - 10 * r) + '0'; + + r += p; + if (r) { + q = (r * 0xcd) >> 11; + *buf++ = (r - 10 * q) + '0'; + + if (q) + *buf++ = q + '0'; + } + } + } + + return buf; +} + + +static __attribute__((noinline)) +char *mod1_put_dec(char *buf, unsigned long long num) +{ + while (1) { + unsigned rem; + if (num < 100000) + return mod1_put_dec_trunc(buf, num); + rem = do_div(num, 100000); + buf = mod1_put_dec_full(buf, rem); + } +} + + +static __attribute__((noinline)) +char *mod2_put_dec(char *buf, unsigned long long num) +{ + if (!num) { + *buf++ = '0'; + return buf; + } + + while (num >= 100000) { + unsigned rem; + rem = do_div(num, 100000); + buf = mod1_put_dec_full(buf, rem); + } + + buf = mod1_put_dec_full(buf, num); + while (buf[-1] == '0') + --buf; + return buf; +} + + + +/* + * Decimal conversion is by far the most typical, and is used for + * /proc and /sys data. This directly impacts e.g. top performance + * with many processes running. + * + * We optimize it for speed using ideas described at + * <http://www.cs.uiowa.edu/~jones/bcd/divide.html>. + * + * '(num * 0xcccd) >> 19' is an approximation of 'num / 10' that gives + * correct results for num < 81920. Because of this, we check at the + * beginning if we are dealing with a number that may cause trouble + * and if so, we make it smaller. + * + * (As a minor note, all operands are always 16 bit so this function + * should work well on hardware that cannot multiply 32 bit numbers). + * + * (Previous a code based on + * <http://www.cs.uiowa.edu/~jones/bcd/decimal.html> was used here, + * with permission from the author, Douglas W. Jones.) + * + * Other, possible ways to approx. divide by 10 + * (x * 0xcd) >> 11: 11001101 - shorter code than * 0x67 (on i386) + * (x * 0x67) >> 10: 1100111 + * (x * 0x34) >> 9: 110100 - same + * (x * 0x1a) >> 8: 11010 - same + * (x * 0x0d) >> 7: 1101 - same, shortest code (on i386) + */ +static char *mod3_put_dec_full(char *buf, unsigned q) +{ + unsigned r; + char a = '0'; + + if (q > 0xffff) { + a = '6'; + q -= 60000; + } + + r = (q * 0xcccd) >> 19; + *buf++ = (q - 10 * r) + '0'; + + q = (r * 0x199a) >> 16; + *buf++ = (r - 10 * q) + '0'; + + r = (q * 0xcd) >> 11; + *buf++ = (q - 10 * r) + '0'; + + q = (r * 0xd) >> 7; + *buf++ = (r - 10 * q) + '0'; + + *buf++ = q + a; + + return buf; +} + +static char *mod3_put_dec_trunc(char *buf, unsigned q) +{ + unsigned r; + + /* + * We need to check if num is < 81920 so we might as well + * check if we can just call the _full version of this + * function. + */ + if (q > 9999) + return mod3_put_dec_full(buf, q); + + r = (q * 0xcccd) >> 19; + *buf++ = (q - 10 * r) + '0'; + + if (r) { + q = (r * 0x199a) >> 16; + *buf++ = (r - 10 * q) + '0'; + + if (q) { + r = (q * 0xcd) >> 11; + *buf++ = (q - 10 * r) + '0'; + + if (r) + *buf++ = r + '0'; + } + } + + return buf; +} + + +static char *mod3_put_dec_8bit(char *buf, unsigned q) +{ + unsigned r; + + r = (q * 0xcd) >> 11; + *buf++ = (q - 10 * r) + '0'; + + if (r) { + q = (r * 0xd) >> 7; + *buf++ = (r - 10 * q) + '0'; + + if (q) + *buf++ = q + '0'; + } + + return buf; +} + + +static __attribute__((noinline)) +char *mod3_put_dec(char *buf, unsigned long long num) +{ + while (1) { + unsigned rem; + if (num < 100000) + return mod3_put_dec_trunc(buf, num); + rem = do_div(num, 100000); + buf = mod3_put_dec_full(buf, rem); + } +} + + +static __attribute__((noinline)) +char *mod4_put_dec(char *buf, unsigned long long num) +{ + if (!num) { + *buf++ = '0'; + return buf; + } + + while (num >= 100000) { + unsigned rem; + rem = do_div(num, 100000); + buf = mod3_put_dec_full(buf, rem); + } + + buf = mod3_put_dec_full(buf, num); + while (buf[-1] == '0') + --buf; + return buf; +} + + + +/* + * Decimal conversion is by far the most typical, and is used for + * /proc and /sys data. This directly impacts e.g. top performance + * with many processes running. + * + * We optimize it for speed using ideas described at + * <http://www.cs.uiowa.edu/~jones/bcd/divide.html>. + * + * (Previous a code based on + * <http://www.cs.uiowa.edu/~jones/bcd/decimal.html> was used here, + * with permission from the author, Douglas W. Jones.) + * + * Other, possible ways to approx. divide by 10 + * (x * 0xcd) >> 11: 11001101 - shorter code than * 0x67 (on i386) + * (x * 0x67) >> 10: 1100111 + * (x * 0x34) >> 9: 110100 - same + * (x * 0x1a) >> 8: 11010 - same + * (x * 0x0d) >> 7: 1101 - same, shortest code (on i386) + */ +static char *mod5_put_dec_full(char *buf, unsigned q) +{ + unsigned r; + + r = (q * 0xcccd) >> 19; + *buf++ = (q - 10 * r) + '0'; + + q = (r * 0x199a) >> 16; + *buf++ = (r - 10 * q) + '0'; + + r = (q * 0xcd) >> 11; + *buf++ = (q - 10 * r) + '0'; + + *buf++ = r + '0'; + + return buf; +} + +static char *mod5_put_dec_trunc(char *buf, unsigned q) +{ + unsigned r; + + r = (q * 0xcccd) >> 19; + *buf++ = (q - 10 * r) + '0'; + + if (r) { + q = (r * 0x199a) >> 16; + *buf++ = (r - 10 * q) + '0'; + + if (q) { + r = (q * 0xcd) >> 11; + *buf++ = (q - 10 * r) + '0'; + + if (r) + *buf++ = r + '0'; + } + } + + return buf; +} + + +static __attribute__((noinline)) +char *mod5_put_dec(char *buf, unsigned long long num) +{ + while (1) { + unsigned rem; + if (num < 10000) + return mod5_put_dec_trunc(buf, num); + rem = do_div(num, 10000); + buf = mod5_put_dec_full(buf, rem); + } +} + +/* + * Based on code by Douglas W. Jones found at + * <http://www.cs.uiowa.edu/~jones/bcd/decimal.html>. + */ +static __attribute__((noinline)) +char *mod6_put_dec(char *buf, unsigned long long n) +{ + uint32_t d3, d2, d1, q; + + if (!n) { + *buf++ = '0'; + return buf; + } + + d1 = (n >> 16) & 0xFFFF; + d2 = (n >> 32) & 0xFFFF; + d3 = (n >> 48) & 0xFFFF; + + q = 656 * d3 + 7296 * d2 + 5536 * d1 + (n & 0xFFFF); + + buf = mod5_put_dec_full(buf, q % 10000); + q = q / 10000; + + d1 = q + 7671 * d3 + 9496 * d2 + 6 * d1; + buf = mod5_put_dec_full(buf, d1 % 10000); + q = d1 / 10000; + + d2 = q + 4749 * d3 + 42 * d2; + buf = mod5_put_dec_full(buf, d2 % 10000); + q = d2 / 10000; + + d3 = q + 281 * d3; + buf = mod5_put_dec_full(buf, d3 % 10000); + q = d3 / 10000; + + buf = mod5_put_dec_trunc(buf, q); + + while (buf[-1] == '0') + --buf; + + return buf; +} + + +/* + * Based on code by Douglas W. Jones found at + * <http://www.cs.uiowa.edu/~jones/bcd/decimal.html>. + */ +static __attribute__((noinline)) +char *mod7_put_dec(char *buf, unsigned long long n) +{ + uint32_t d3, d2, d1, q; + + if (!n) { + *buf++ = '0'; + return buf; + } + + d1 = (n >> 16) & 0xFFFF; + d2 = (n >> 32) & 0xFFFF; + d3 = (n >> 48) & 0xFFFF; + + q = 656 * d3 + 7296 * d2 + 5536 * d1 + (n & 0xFFFF); + + buf = mod5_put_dec_full(buf, q % 10000); + q = q / 10000; + + d1 = q + 7671 * d3 + 9496 * d2 + 6 * d1; + if (d1) { + buf = mod5_put_dec_full(buf, d1 % 10000); + q = d1 / 10000; + + d2 = q + 4749 * d3 + 42 * d2; + if (d2) { + buf = mod5_put_dec_full(buf, d2 % 10000); + q = d2 / 10000; + + d3 = q + 281 * d3; + if (d3) { + buf = mod5_put_dec_full(buf, d3 % 10000); + q = d3 / 10000; + + if (q) + buf = mod5_put_dec_trunc(buf, q); + } + } + } + + while (buf[-1] == '0') + --buf; + + return buf; +} + + +static __attribute__((noinline)) +char *mod8_put_dec(char *buf, unsigned long long num) +{ + while (1) { + unsigned rem = do_div(num, 100000000); + + if (!num && rem < 10000) + return mod5_put_dec_trunc(buf, rem); + buf = mod5_put_dec_full(buf, rem % 10000); + + if (!num) + return mod5_put_dec_trunc(buf, rem / 10000); + buf = mod5_put_dec_full(buf, rem / 10000); + } +} + + + +/****************************** Main ******************************/ + +static uint64_t rand_64(void) +{ + uint64_t v = 0, m = 1; + for (;;) { + uint64_t n; + v = m * rand(); + n = m + m * RAND_MAX; + if (n < m) + break; + m = n; + } + return v; +} + + +static char buf1[24]; + +static int test(const char *name, char *b, char *fmt, ...) +{ + static char buf2[24]; + char *a = buf1; + va_list ap; + + *b-- = '\0'; + while (a < b) { + char tmp = *a; + *a = *b; + *b = tmp; + ++a; + --b; + } + + va_start(ap, fmt); + vsprintf(buf2, fmt, ap); + va_end(ap); + + if (strcmp(buf1, buf2)) { + fprintf(stderr, "%-20s: expecting %20s got %20s\n", + name, buf2, buf1); + return 1; + } + return 0; +} + +static void stop(const char *name, struct timeval *start, struct timeval *correction) +{ + struct timeval stop, *a; + gettimeofday(&stop, NULL); + + a = start; + do { + stop.tv_sec -= a->tv_sec; + if (stop.tv_usec < a->tv_usec) { + --stop.tv_sec; + stop.tv_usec += 1000000; + } + stop.tv_usec -= a->tv_usec; + + a = correction; + correction = NULL; + } while (a); + + if (name) { + fflush(NULL); + printf("%-20s: %3d.%06ds\n", name, stop.tv_sec, stop.tv_usec); + } else { + *start = stop; + } +} + +#define FUNC(func, outer, inner, correction, format, value) do { \ + struct timeval start; \ + unsigned i, o; \ + for (i = (inner); i--; ) { \ + typeof(value) v = (value); \ + ret |= test(#func, func(buf1, v), format, v); \ + } \ + \ + gettimeofday(&start, NULL); \ + for (o = (outer); o; --o) \ + for (i = (inner); i--; ) \ + func(buf1, (value)); \ + stop(#func, &start, correction); \ + } while (0) \ + + +int main(int argc, char **argv) { + unsigned long iterations = 1000, i; + struct timeval correction; + int ret = 0; + + srand(time(NULL)); + + if (argc > 1) + iterations = atoi(argv[1]); + + gettimeofday(&correction, NULL); + for (i = 25000 * iterations; i; --i) + rand_64(); + stop(NULL, &correction, NULL); + + + puts(">> Benchmarks:\n\tput_dec_full()"); + fflush(NULL); + + FUNC(orig_put_dec_full, iterations, 100000, NULL, "%05u", i); + FUNC(mod1_put_dec_full, iterations, 100000, NULL, "%05u", i); + FUNC(mod3_put_dec_full, iterations, 100000, NULL, "%05u", i); + FUNC(mod5_put_dec_full, iterations * 10, 10000, NULL, "%04u", i); + + puts("\tput_dec_trunc()"); + fflush(NULL); + + FUNC(orig_put_dec_trunc, iterations, 100000, NULL, "%u", i); + FUNC(mod1_put_dec_trunc, iterations, 100000, NULL, "%u", i); + FUNC(mod3_put_dec_trunc, iterations, 100000, NULL, "%u", i); + FUNC(mod5_put_dec_trunc, iterations * 10, 10000, NULL, "%u", i); + FUNC(mod3_put_dec_8bit, iterations * 500, 256, NULL, "%u", i); + + puts("\n\tput_dec()"); + fflush(NULL); + + FUNC(orig_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64()); + FUNC(mod1_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64()); + FUNC(mod2_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64()); + FUNC(mod3_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64()); + FUNC(mod4_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64()); + FUNC(mod5_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64()); + FUNC(mod6_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64()); + FUNC(mod7_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64()); + FUNC(mod8_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64()); + + + fflush(NULL); + puts("\n>> Size of the functions:"); + fflush(NULL); + + setenv("APP", *argv, 1); + puts("\tobjdump -t \"$APP\" | grep -F _put_dec | cut -f 2-"); + system("objdump -t \"$APP\" | grep -F _put_dec | cut -f 2-"); + + + return ret; +} -- 1.7.1 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 3/3] lib: vsprintf: added a put_dec() test and benchmark tool 2010-08-05 22:38 ` [PATCH 3/3] lib: vsprintf: added a put_dec() test and benchmark tool Michal Nazarewicz @ 2010-08-06 5:10 ` Denys Vlasenko 2010-08-06 8:34 ` Michał Nazarewicz 0 siblings, 1 reply; 14+ messages in thread From: Denys Vlasenko @ 2010-08-06 5:10 UTC (permalink / raw) To: Michal Nazarewicz Cc: linux-kernel, m.nazarewicz, Douglas W. Jones, Andrew Morton On Friday 06 August 2010 00:38, Michal Nazarewicz wrote: > This commit adds a test application for the put_dec() and > family of functions that are used by the previous two commits. > > Signed-off-by: Michal Nazarewicz <mina86@mina86.com> > +put-dec-test: put-dec-test.c > + exec $(CC) $(CFLAGS) -o $@ $< (1) Why exec? (2) Add -Wall, you'd be surprised > +static uint64_t rand_64(void) > +{ > + uint64_t v = 0, m = 1; > + for (;;) { > + uint64_t n; > + v = m * rand(); > + n = m + m * RAND_MAX; > + if (n < m) > + break; > + m = n; > + } > + return v; > +} What this function do? Looks cryptic. In my testing, it picks 0 quite often. > +static char buf1[24]; Can you size the array safely, without assuming that long long is no wider than 64 bits? > +#define FUNC(func, outer, inner, correction, format, value) do { \ > + struct timeval start; \ > + unsigned i, o; \ > + for (i = (inner); i--; ) { \ > + typeof(value) v = (value); \ > + ret |= test(#func, func(buf1, v), format, v); \ I'd add memset(buf1, 77, sizeof(buf1)) before test > +int main(int argc, char **argv) { > + unsigned long iterations = 1000, i; > + struct timeval correction; > + int ret = 0; > + > + srand(time(NULL)); > + > + if (argc > 1) > + iterations = atoi(argv[1]); > + > + gettimeofday(&correction, NULL); > + for (i = 25000 * iterations; i; --i) > + rand_64(); > + stop(NULL, &correction, NULL); Why is this "correction" thing needed? I looked at the entire machinery and I see no reason to have it. > + puts(">> Benchmarks:\n\tput_dec_full()"); > + fflush(NULL); > + > + FUNC(orig_put_dec_full, iterations, 100000, NULL, "%05u", i); You have variable named i, you pass it as macro parameter, but macro has local variable named i too. Is it an International Obfuscated C Code Contest entry? > + FUNC(mod1_put_dec_full, iterations, 100000, NULL, "%05u", i); > + FUNC(mod3_put_dec_full, iterations, 100000, NULL, "%05u", i); > + FUNC(mod5_put_dec_full, iterations * 10, 10000, NULL, "%04u", i); > + puts("\tput_dec_trunc()"); > + fflush(NULL); > + > + FUNC(orig_put_dec_trunc, iterations, 100000, NULL, "%u", i); > + FUNC(mod1_put_dec_trunc, iterations, 100000, NULL, "%u", i); > + FUNC(mod3_put_dec_trunc, iterations, 100000, NULL, "%u", i); > + FUNC(mod5_put_dec_trunc, iterations * 10, 10000, NULL, "%u", i); > + FUNC(mod3_put_dec_8bit, iterations * 500, 256, NULL, "%u", i); > + puts("\n\tput_dec()"); > + fflush(NULL); > + > + FUNC(orig_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64()); "%llu" fmt is for unsigned long long, not uint64_t. > + FUNC(mod1_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64()); > + FUNC(mod2_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64()); > + FUNC(mod3_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64()); > + FUNC(mod4_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64()); > + FUNC(mod5_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64()); > + FUNC(mod6_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64()); > + FUNC(mod7_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64()); > + FUNC(mod8_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64()); Here a lot of CPU time is taken by rand() calls. Also, you use different values for different functions, which is wrong. -- vda ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 3/3] lib: vsprintf: added a put_dec() test and benchmark tool 2010-08-06 5:10 ` Denys Vlasenko @ 2010-08-06 8:34 ` Michał Nazarewicz 2010-08-06 15:57 ` Raja R Harinath 2010-08-06 19:26 ` Denys Vlasenko 0 siblings, 2 replies; 14+ messages in thread From: Michał Nazarewicz @ 2010-08-06 8:34 UTC (permalink / raw) To: Michal Nazarewicz, Denys Vlasenko Cc: linux-kernel, Douglas W. Jones, Andrew Morton On Fri, 06 Aug 2010 07:10:06 +0200, Denys Vlasenko <vda.linux@googlemail.com> wrote: > On Friday 06 August 2010 00:38, Michal Nazarewicz wrote: >> This commit adds a test application for the put_dec() and >> family of functions that are used by the previous two commits. >> >> Signed-off-by: Michal Nazarewicz <mina86@mina86.com> > >> +put-dec-test: put-dec-test.c >> + exec $(CC) $(CFLAGS) -o $@ $< > > (1) Why exec? Micro Makefile optimisation -- saves us a fork(). I'll try to fix the benchmark over the weekend and will post updated version. Thanks for the comments. -- Best regards, _ _ | Humble Liege of Serenely Enlightened Majesty of o' \,=./ `o | Computer Science, Michał "mina86" Nazarewicz (o o) +----[mina86*mina86.com]---[mina86*jabber.org]----ooO--(_)--Ooo-- ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 3/3] lib: vsprintf: added a put_dec() test and benchmark tool 2010-08-06 8:34 ` Michał Nazarewicz @ 2010-08-06 15:57 ` Raja R Harinath 2010-08-06 19:26 ` Denys Vlasenko 1 sibling, 0 replies; 14+ messages in thread From: Raja R Harinath @ 2010-08-06 15:57 UTC (permalink / raw) To: linux-kernel Hi, Michał Nazarewicz <m.nazarewicz@samsung.com> writes: > On Fri, 06 Aug 2010 07:10:06 +0200, Denys Vlasenko <vda.linux@googlemail.com> wrote: > >> On Friday 06 August 2010 00:38, Michal Nazarewicz wrote: >>> This commit adds a test application for the put_dec() and >>> family of functions that are used by the previous two commits. >>> >>> Signed-off-by: Michal Nazarewicz <mina86@mina86.com> >> >>> +put-dec-test: put-dec-test.c >>> + exec $(CC) $(CFLAGS) -o $@ $< >> >> (1) Why exec? > > Micro Makefile optimisation -- saves us a fork(). I don't think it's worth it. The use of 'exec' will force make to invoke the shell, while without the exec, make has an opportunity to optimize out the shell invocation completely [1]. Forcing the use of a shell to avoid a fork() sounds like a terrible trade-off. - Hari [1] See, for instance, construct_command_argv_internal() in http://cvs.savannah.gnu.org/viewvc/make/job.c?root=make&view=markup ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 3/3] lib: vsprintf: added a put_dec() test and benchmark tool 2010-08-06 8:34 ` Michał Nazarewicz 2010-08-06 15:57 ` Raja R Harinath @ 2010-08-06 19:26 ` Denys Vlasenko 2010-08-06 20:58 ` Michal Nazarewicz 1 sibling, 1 reply; 14+ messages in thread From: Denys Vlasenko @ 2010-08-06 19:26 UTC (permalink / raw) To: Michał Nazarewicz Cc: Michal Nazarewicz, linux-kernel, Douglas W. Jones, Andrew Morton [-- Attachment #1: Type: text/plain, Size: 950 bytes --] On Friday 06 August 2010 10:34, Michał Nazarewicz wrote: > On Fri, 06 Aug 2010 07:10:06 +0200, Denys Vlasenko <vda.linux@googlemail.com> wrote: > > > On Friday 06 August 2010 00:38, Michal Nazarewicz wrote: > >> This commit adds a test application for the put_dec() and > >> family of functions that are used by the previous two commits. > >> > >> Signed-off-by: Michal Nazarewicz <mina86@mina86.com> > > > >> +put-dec-test: put-dec-test.c > >> + exec $(CC) $(CFLAGS) -o $@ $< > > > > (1) Why exec? > > Micro Makefile optimisation -- saves us a fork(). > > I'll try to fix the benchmark over the weekend and will post updated > version. Thanks for the comments. You might find some ideas in the attached file: * removed "correction" code * added verification of correctness for put_dec() * different rand64 (old one was giving same "random" number surprisingly often) * more robust coding in a few places -- vda [-- Attachment #2: put-dec-test.c --] [-- Type: text/x-csrc, Size: 8205 bytes --] #define _BSD_SOURCE #include <stdarg.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/time.h> #include <time.h> # define do_div(n, base) ({ \ uint32_t __base = (base); \ uint32_t __rem = (n) % __base; \ (n) /= __base; \ __rem; \ }) /****************************** Original versian ******************************/ static char *orig_put_dec_trunc(char *buf, unsigned q) { unsigned d3, d2, d1, d0; d1 = (q>>4) & 0xf; d2 = (q>>8) & 0xf; d3 = (q>>12); d0 = 6*(d3 + d2 + d1) + (q & 0xf); q = (d0 * 0xcd) >> 11; d0 = d0 - 10*q; *buf++ = d0 + '0'; /* least significant digit */ d1 = q + 9*d3 + 5*d2 + d1; if (d1 != 0) { q = (d1 * 0xcd) >> 11; d1 = d1 - 10*q; *buf++ = d1 + '0'; /* next digit */ d2 = q + 2*d2; if ((d2 != 0) || (d3 != 0)) { q = (d2 * 0xd) >> 7; d2 = d2 - 10*q; *buf++ = d2 + '0'; /* next digit */ d3 = q + 4*d3; if (d3 != 0) { q = (d3 * 0xcd) >> 11; d3 = d3 - 10*q; *buf++ = d3 + '0'; /* next digit */ if (q != 0) *buf++ = q + '0'; /* most sign. digit */ } } } return buf; } /* Same with if's removed. Always emits five digits */ static char *orig_put_dec_full(char *buf, unsigned q) { /* BTW, if q is in [0,9999], 8-bit ints will be enough, */ /* but anyway, gcc produces better code with full-sized ints */ unsigned d3, d2, d1, d0; d1 = (q>>4) & 0xf; d2 = (q>>8) & 0xf; d3 = (q>>12); /* * Possible ways to approx. divide by 10 * gcc -O2 replaces multiply with shifts and adds * (x * 0xcd) >> 11: 11001101 - shorter code than * 0x67 (on i386) * (x * 0x67) >> 10: 1100111 * (x * 0x34) >> 9: 110100 - same * (x * 0x1a) >> 8: 11010 - same * (x * 0x0d) >> 7: 1101 - same, shortest code (on i386) */ d0 = 6*(d3 + d2 + d1) + (q & 0xf); q = (d0 * 0xcd) >> 11; d0 = d0 - 10*q; *buf++ = d0 + '0'; d1 = q + 9*d3 + 5*d2 + d1; q = (d1 * 0xcd) >> 11; d1 = d1 - 10*q; *buf++ = d1 + '0'; d2 = q + 2*d2; q = (d2 * 0xd) >> 7; d2 = d2 - 10*q; *buf++ = d2 + '0'; d3 = q + 4*d3; q = (d3 * 0xcd) >> 11; /* - shorter code */ /* q = (d3 * 0x67) >> 10; - would also work */ d3 = d3 - 10*q; *buf++ = d3 + '0'; *buf++ = q + '0'; return buf; } static __attribute__((noinline)) char *orig_put_dec(char *buf, unsigned long long num) { while (1) { unsigned rem; if (num < 100000) return orig_put_dec_trunc(buf, num); rem = do_div(num, 100000); buf = orig_put_dec_full(buf, rem); } } /****************************** Modified versions ******************************/ /* * Decimal conversion is by far the most typical, and is used for * /proc and /sys data. This directly impacts e.g. top performance * with many processes running. * * We optimize it for speed using ideas described at * <http://www.cs.uiowa.edu/~jones/bcd/divide.html>. * * '(num * 0xcccd) >> 19' is an approximation of 'num / 10' that gives * correct results for num < 81920. Because of this, we check at the * beginning if we are dealing with a number that may cause trouble * and if so, we make it smaller. * * (As a minor note, all operands are always 16 bit so this function * should work well on hardware that cannot multiply 32 bit numbers). * * (Previous a code based on * <http://www.cs.uiowa.edu/~jones/bcd/decimal.html> was used here, * with permission from the author, Douglas W. Jones.) * * Other, possible ways to approx. divide by 10 * (x * 0xcd) >> 11: 11001101 - shorter code than * 0x67 (on i386) * (x * 0x67) >> 10: 1100111 * (x * 0x34) >> 9: 110100 - same * (x * 0x1a) >> 8: 11010 - same * (x * 0x0d) >> 7: 1101 - same, shortest code (on i386) */ static char *mod3_put_dec_full(char *buf, unsigned q) { unsigned r; char a = '0'; if (q > 0xffff) { a = '6'; q -= 60000; } r = (q * 0xcccd) >> 19; *buf++ = (q - 10 * r) + '0'; q = (r * 0x199a) >> 16; *buf++ = (r - 10 * q) + '0'; r = (q * 0xcd) >> 11; *buf++ = (q - 10 * r) + '0'; q = (r * 0xd) >> 7; *buf++ = (r - 10 * q) + '0'; *buf++ = q + a; return buf; } static char *mod3_put_dec_trunc(char *buf, unsigned q) { unsigned r; /* * We need to check if num is < 81920 so we might as well * check if we can just call the _full version of this * function. */ if (q > 9999) return mod3_put_dec_full(buf, q); r = (q * 0xcccd) >> 19; *buf++ = (q - 10 * r) + '0'; if (r) { q = (r * 0x199a) >> 16; *buf++ = (r - 10 * q) + '0'; if (q) { r = (q * 0xcd) >> 11; *buf++ = (q - 10 * r) + '0'; if (r) *buf++ = r + '0'; } } return buf; } static __attribute__((noinline)) char *mod3_put_dec(char *buf, unsigned long long num) { while (1) { unsigned rem; if (num < 100000) return mod3_put_dec_trunc(buf, num); rem = do_div(num, 100000); buf = mod3_put_dec_full(buf, rem); } } /****************************** Main ******************************/ static unsigned long long rand_64(void) { unsigned long long v = 0; int i; for (i = 0; i < sizeof(v); i++) { v += rand(); v <<= 9; } return v; } static char buf1[sizeof(long long)*3]; static void reverse_buf1(char *b) { char *a = buf1; *b-- = '\0'; while (a < b) { char tmp = *a; *a = *b; *b = tmp; ++a; --b; } } static int test(const char *name, char *b, char *fmt, ...) { static char buf2[sizeof(long long)*3]; va_list ap; reverse_buf1(b); va_start(ap, fmt); vsprintf(buf2, fmt, ap); va_end(ap); if (strcmp(buf1, buf2)) { fprintf(stderr, "%-20s: expecting '%s' got '%s'\n", name, buf2, buf1); return 1; } return 0; } static void stop(const char *name, struct timeval *start) { struct timeval stop; gettimeofday(&stop, NULL); stop.tv_sec -= start->tv_sec; if (stop.tv_usec < start->tv_usec) { --stop.tv_sec; stop.tv_usec += 1000000; } stop.tv_usec -= start->tv_usec; printf("%-20s: %3d.%06ds\n", name, (int)stop.tv_sec, (int)stop.tv_usec); fflush(NULL); } #define FUNC(func, outer, inner, format, value) do { \ struct timeval start; \ unsigned i, o; \ for (i = (inner); i--; ) { \ typeof(value) v = (value); \ memset(buf1, 77, sizeof(buf1)); \ ret |= test(#func, func(buf1, v), format, v); \ } \ \ gettimeofday(&start, NULL); \ for (o = (outer); o; --o) \ for (i = (inner); i--; ) \ func(buf1, (value)); \ stop(#func, &start); \ } while (0) \ int main(int argc, char **argv) { int ret = 0; unsigned long iterations; unsigned long long ll; unsigned long long range; srand(time(NULL)); iterations = 1000; if (argc > 1) iterations = atoi(argv[1]); printf("Benchmark with %ld iterations:\n", iterations*100000); printf("\tput_dec_full(99999..0)\n"); fflush(NULL); /* func, bench_mult x test_iter, correction, fmt, val */ FUNC(orig_put_dec_full, iterations, 100000, "%05u", i); FUNC(mod3_put_dec_full, iterations, 100000, "%05u", i); printf("\tput_dec_trunc(99999..0)\n"); fflush(NULL); FUNC(orig_put_dec_trunc, iterations, 100000, "%u", i); FUNC(mod3_put_dec_trunc, iterations, 100000, "%u", i); ll = rand_64(); printf("\tput_dec(%llu)\n", ll); fflush(NULL); FUNC(orig_put_dec, iterations, 100000, "%llu", ll); FUNC(mod3_put_dec, iterations, 100000, "%llu", ll); /* Test wide consecutive ranges at both ends of [0,max] interval */ range = 10*1000*1000; if (argc > 2) range = atoi(argv[2]); ll = -1LL - range; while (ll != range) { char buf[sizeof(long long)*3]; sprintf(buf, "%llu", ll); if (!(ll & 0xffff)) { printf("Testing correctness: %s%*s\r", buf, (int)(sizeof(long long) * 4), ""); fflush(NULL); } memset(buf1, 77, sizeof(buf1)); reverse_buf1(orig_put_dec(buf1, ll)); if (strcmp(buf, buf1) == 0) { memset(buf1, 77, sizeof(buf1)); reverse_buf1(mod3_put_dec(buf1, ll)); if (strcmp(buf, buf1) == 0) { ll++; continue; } } printf("error at %llu: bad:'%s' expected:'%s'\n", ll, buf1, buf); return 1; } printf("\n"); printf("\n>> Size of the functions:\n"); setenv("APP", *argv, 1); printf("\tobjdump -t \"$APP\" | grep -F _put_dec | cut -f 2-\n"); fflush(NULL); system("objdump -t \"$APP\" | grep -F _put_dec | cut -f 2-"); return ret; } ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 3/3] lib: vsprintf: added a put_dec() test and benchmark tool 2010-08-06 19:26 ` Denys Vlasenko @ 2010-08-06 20:58 ` Michal Nazarewicz 0 siblings, 0 replies; 14+ messages in thread From: Michal Nazarewicz @ 2010-08-06 20:58 UTC (permalink / raw) To: Denys Vlasenko Cc: Michał Nazarewicz, linux-kernel, Douglas W. Jones, Andrew Morton [-- Attachment #1: Type: text/plain, Size: 1858 bytes --] Denys Vlasenko <vda.linux@googlemail.com> writes: > On Friday 06 August 2010 10:34, Michał Nazarewicz wrote: >> On Fri, 06 Aug 2010 07:10:06 +0200, Denys Vlasenko <vda.linux@googlemail.com> wrote: >> >> > On Friday 06 August 2010 00:38, Michal Nazarewicz wrote: >> >> This commit adds a test application for the put_dec() and >> >> family of functions that are used by the previous two commits. >> >> >> >> Signed-off-by: Michal Nazarewicz <mina86@mina86.com> >> > >> >> +put-dec-test: put-dec-test.c >> >> + exec $(CC) $(CFLAGS) -o $@ $< >> > >> > (1) Why exec? >> >> Micro Makefile optimisation -- saves us a fork(). >> >> I'll try to fix the benchmark over the weekend and will post updated >> version. Thanks for the comments. > > You might find some ideas in the attached file: > * removed "correction" code > * added verification of correctness for put_dec() > * different rand64 > (old one was giving same "random" number surprisingly often) > * more robust coding in a few places Thanks! I actually changed the benchmark earlier today and redid all benchmarks but I'll sure incorporate you're test code. Also, I've removed rand_64() from my code in favour of reading random data from /dev/urandom. In consequence, all functions ale benchmarked using the same values and it's still random (ie. no the same value all the time). This also made "correction" no longer needed. It's 11pm here, so I'll try to send the new patches tomorrow morning after getting some sleep. Once again, thank you for all the comments and suggestiveness! -- Best regards, _ _ .o. | Liege of Serenly Enlightened Majesty of o' \,=./ `o ..o | Computer Science, Michal "mina86" Nazarewicz (o o) ooo +--<mina86-tlen.pl>--<jid:mina86-jabber.org>--ooO--(_)--Ooo-- [-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines 2010-08-05 22:38 ` [PATCH 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines Michal Nazarewicz 2010-08-05 22:38 ` [PATCH 3/3] lib: vsprintf: added a put_dec() test and benchmark tool Michal Nazarewicz @ 2010-08-06 5:18 ` Denys Vlasenko 2010-08-06 7:08 ` Michal Nazarewicz 1 sibling, 1 reply; 14+ messages in thread From: Denys Vlasenko @ 2010-08-06 5:18 UTC (permalink / raw) To: Michal Nazarewicz Cc: linux-kernel, m.nazarewicz, Douglas W. Jones, Andrew Morton On Friday 06 August 2010 00:38, Michal Nazarewicz wrote: > The disadvantage is that the proposed function is 2.5-3.5 bigger. > Those are not big functions though -- we are talking here about > proposed function being below 512. It's a slippery slope. Here's where it ends: glibc has memcpy() function which is "only" 8k of code or so. I'm not joking. > +#if BITS_PER_LONG == 64 > + ... > +#else ... > +/* > + * Based on code by Douglas W. Jones found at > + * <http://www.cs.uiowa.edu/~jones/bcd/decimal.html#sixtyfour>. This > + * performs no 64-bit division and hence should be faster on 32-bit > + * machines then the version of the function above. > + */ > +static noinline_for_stack > +char *put_dec(char *buf, unsigned long long n) > +{ > + uint32_t d3, d2, d1, q; > + > + if (!n) { > + *buf++ = '0'; > + return buf; > + } > + > + d1 = (n >> 16) & 0xFFFF; > + d2 = (n >> 32) & 0xFFFF; > + d3 = (n >> 48) & 0xFFFF; Are you assuming that sizeof(long long) == 8, always? -- vda ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines 2010-08-06 5:18 ` [PATCH 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines Denys Vlasenko @ 2010-08-06 7:08 ` Michal Nazarewicz 2010-08-06 7:35 ` Denys Vlasenko 0 siblings, 1 reply; 14+ messages in thread From: Michal Nazarewicz @ 2010-08-06 7:08 UTC (permalink / raw) To: Denys Vlasenko Cc: linux-kernel, m.nazarewicz, Douglas W. Jones, Andrew Morton Denys Vlasenko <vda.linux@googlemail.com> writes: > On Friday 06 August 2010 00:38, Michal Nazarewicz wrote: >> The disadvantage is that the proposed function is 2.5-3.5 bigger. >> Those are not big functions though -- we are talking here about >> proposed function being below 512. > It's a slippery slope. Here's where it ends: glibc > has memcpy() function which is "only" 8k of code or so. > I'm not joking. I'm aware of that. I assume that someone more clever then me will decide whether to accept this patch or not. (Also we win a few bytes on put_dec_full() and put_dec_8bit()). :P >> +#if BITS_PER_LONG == 64 >> + > ... >> +#else > ... >> +/* >> + * Based on code by Douglas W. Jones found at >> + * <http://www.cs.uiowa.edu/~jones/bcd/decimal.html#sixtyfour>. This >> + * performs no 64-bit division and hence should be faster on 32-bit >> + * machines then the version of the function above. >> + */ >> +static noinline_for_stack >> +char *put_dec(char *buf, unsigned long long n) >> +{ >> + uint32_t d3, d2, d1, q; >> + >> + if (!n) { >> + *buf++ = '0'; >> + return buf; >> + } >> + >> + d1 = (n >> 16) & 0xFFFF; >> + d2 = (n >> 32) & 0xFFFF; >> + d3 = (n >> 48) & 0xFFFF; > > Are you assuming that sizeof(long long) == 8, always? Well... yes. C requires long long to be at least 64-bit and I don't see it being larger in any foreseeable feature. Wouldn't it be enough to put a static assert here? -- Best regards, _ _ .o. | Liege of Serenly Enlightened Majesty of o' \,=./ `o ..o | Computer Science, Michal "mina86" Nazarewicz (o o) ooo +--<mina86-tlen.pl>--<jid:mina86-jabber.org>--ooO--(_)--Ooo-- ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines 2010-08-06 7:08 ` Michal Nazarewicz @ 2010-08-06 7:35 ` Denys Vlasenko 2010-08-06 8:54 ` Michał Nazarewicz 0 siblings, 1 reply; 14+ messages in thread From: Denys Vlasenko @ 2010-08-06 7:35 UTC (permalink / raw) To: Michal Nazarewicz Cc: linux-kernel, m.nazarewicz, Douglas W. Jones, Andrew Morton On Friday 06 August 2010 09:08, Michal Nazarewicz wrote: > >> +static noinline_for_stack > >> +char *put_dec(char *buf, unsigned long long n) > >> +{ > >> + uint32_t d3, d2, d1, q; > >> + > >> + if (!n) { > >> + *buf++ = '0'; > >> + return buf; > >> + } > >> + > >> + d1 = (n >> 16) & 0xFFFF; > >> + d2 = (n >> 32) & 0xFFFF; > >> + d3 = (n >> 48) & 0xFFFF; > > > > Are you assuming that sizeof(long long) == 8, always? > > Well... yes. C requires long long to be at least 64-bit and I don't > see it being larger in any foreseeable feature. "640k will be enough for everybody"? > Wouldn't it be enough to put a static assert here? I'd prefer the code which works with arbitrarily wide long long. If needed, use if (sizeof(long long) == 8) 64-bit code else generic code -- vda ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines 2010-08-06 7:35 ` Denys Vlasenko @ 2010-08-06 8:54 ` Michał Nazarewicz 0 siblings, 0 replies; 14+ messages in thread From: Michał Nazarewicz @ 2010-08-06 8:54 UTC (permalink / raw) To: Michal Nazarewicz, Denys Vlasenko Cc: linux-kernel, Douglas W. Jones, Andrew Morton On Fri, 06 Aug 2010 09:35:26 +0200, Denys Vlasenko <vda.linux@googlemail.com> wrote: > On Friday 06 August 2010 09:08, Michal Nazarewicz wrote: >> >> +static noinline_for_stack >> >> +char *put_dec(char *buf, unsigned long long n) >> >> +{ >> >> + uint32_t d3, d2, d1, q; >> >> + >> >> + if (!n) { >> >> + *buf++ = '0'; >> >> + return buf; >> >> + } >> >> + >> >> + d1 = (n >> 16) & 0xFFFF; >> >> + d2 = (n >> 32) & 0xFFFF; >> >> + d3 = (n >> 48) & 0xFFFF; >> > >> > Are you assuming that sizeof(long long) == 8, always? >> >> Well... yes. C requires long long to be at least 64-bit and I don't >> see it being larger in any foreseeable feature. > > "640k will be enough for everybody"? > >> Wouldn't it be enough to put a static assert here? > > I'd prefer the code which works with arbitrarily wide long long. > If needed, use > > if (sizeof(long long) == 8) > 64-bit code > else > generic code Thanks. That seems like a perfect solution. I rearrange the code and try to post updated version after the weekend. -- Best regards, _ _ | Humble Liege of Serenely Enlightened Majesty of o' \,=./ `o | Computer Science, Michał "mina86" Nazarewicz (o o) +----[mina86*mina86.com]---[mina86*jabber.org]----ooO--(_)--Ooo-- ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full() 2010-08-05 22:38 [PATCH 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full() Michal Nazarewicz 2010-08-05 22:38 ` [PATCH 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines Michal Nazarewicz @ 2010-08-06 3:58 ` Denys Vlasenko 2010-08-06 6:56 ` Michal Nazarewicz 1 sibling, 1 reply; 14+ messages in thread From: Denys Vlasenko @ 2010-08-06 3:58 UTC (permalink / raw) To: Michal Nazarewicz Cc: linux-kernel, m.nazarewicz, Douglas W. Jones, Andrew Morton On Friday 06 August 2010 00:38, Michal Nazarewicz wrote: > The put_dec_trunc() and put_dec_full() functions were based on > a code optimised for processors with 8-bit ALU but even then > they failed to satisfy the same constraints "Failed"? Interesting wording. Yes, the code won't map easily onto 8-bit ALU, for the simple reason Linux kernel does not support any 8-bit CPUs, and by going to wider register I was able to process 5 decimal digits at once, not 4. It was done deliberately. It is not a "failure". Your code isn't 8-bit ALU optimized either. Do you think a bit of smear of previous code would help your to be accepted? > and in fact > required at least 16-bit ALU (because at least one number they > operate in can take 9 bits). Yes, as explained above. > This version of those functions proposed by this patch goes > further and uses the full capacity of a 32-bit ALU and instead > of splitting the number into nibbles and operating on them it > performs the obvious algorithm for base conversion expect it > uses optimised code for dividing by ten (ie. no division is > actually performed). (1) "expect" is a typo (2) No, _this_ patch does not eliminate division. Next one does. Move this part of changelong to the next patch, where it belongs. > + * Decimal conversion is by far the most typical, and is used for > + * /proc and /sys data. This directly impacts e.g. top performance > + * with many processes running. > + * > + * We optimize it for speed using ideas described at > + * <http://www.cs.uiowa.edu/~jones/bcd/divide.html>. Do you have author's permission to do it? Document it in the comment please. > + * '(num * 0xcccd) >> 19' is an approximation of 'num / 10' that gives > + * correct results for num < 81920. Because of this, we check at the > + * beginning if we are dealing with a number that may cause trouble > + * and if so, we make it smaller. This comment needs to be moved to the code line where the opration is performed. > + * (As a minor note, all operands are always 16 bit so this function > + * should work well on hardware that cannot multiply 32 bit numbers). > + * > + * (Previous a code based on English is a bit broken in the line above. -- vda ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full() 2010-08-06 3:58 ` [PATCH 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full() Denys Vlasenko @ 2010-08-06 6:56 ` Michal Nazarewicz 0 siblings, 0 replies; 14+ messages in thread From: Michal Nazarewicz @ 2010-08-06 6:56 UTC (permalink / raw) To: Denys Vlasenko Cc: linux-kernel, m.nazarewicz, Douglas W. Jones, Andrew Morton Denys Vlasenko <vda.linux@googlemail.com> writes: > On Friday 06 August 2010 00:38, Michal Nazarewicz wrote: >> The put_dec_trunc() and put_dec_full() functions were based on >> a code optimised for processors with 8-bit ALU but even then >> they failed to satisfy the same constraints > > "Failed"? Interesting wording. Yes, the code won't map easily > onto 8-bit ALU, for the simple reason Linux kernel > does not support any 8-bit CPUs, and by going to wider register > I was able to process 5 decimal digits at once, not 4. > It was done deliberately. It is not a "failure". You're probably right. ;) I'll change the wording in the next patch. > Your code isn't 8-bit ALU optimized either. Yep, that was not my intention. > Do you think a bit of smear of previous code > would help your to be accepted? I'm sorry if I sounded offencive, I just wanted to point out that we don't have to and we in fact do not limit ourselves to 8-bit ALUs. >> and in fact >> required at least 16-bit ALU (because at least one number they >> operate in can take 9 bits). > > Yes, as explained above. > >> This version of those functions proposed by this patch goes >> further and uses the full capacity of a 32-bit ALU and instead >> of splitting the number into nibbles and operating on them it >> performs the obvious algorithm for base conversion expect it >> uses optimised code for dividing by ten (ie. no division is >> actually performed). > > (1) "expect" is a typo Fixed. > (2) No, _this_ patch does not eliminate division. Next one does. > Move this part of changelong to the next patch, where it belongs. Probably right, the part in parens can be misleading. Will fix. >> + * Decimal conversion is by far the most typical, and is used for >> + * /proc and /sys data. This directly impacts e.g. top performance >> + * with many processes running. >> + * >> + * We optimize it for speed using ideas described at >> + * <http://www.cs.uiowa.edu/~jones/bcd/divide.html>. > > Do you have author's permission to do it? > Document it in the comment please. OK, I'll send an email to Mr. Jones directly (unless he's actually reading it, in which case, could you please give permission to use the code?) In case of this patch I assumed the right to use because: (i) we had permission for the previous code and (ii) I've used only one line of code that is from the "Binary to Decimal Conversion in Limited Precision" website (namely division). In case of the next patch, only (i) applies. >> + * '(num * 0xcccd) >> 19' is an approximation of 'num / 10' that gives >> + * correct results for num < 81920. Because of this, we check at the >> + * beginning if we are dealing with a number that may cause trouble >> + * and if so, we make it smaller. > > This comment needs to be moved to the code line where the opration > is performed. Fixed. >> + * (As a minor note, all operands are always 16 bit so this function >> + * should work well on hardware that cannot multiply 32 bit numbers). >> + * >> + * (Previous a code based on > > English is a bit broken in the line above. Fixed. -- Best regards, _ _ .o. | Liege of Serenly Enlightened Majesty of o' \,=./ `o ..o | Computer Science, Michal "mina86" Nazarewicz (o o) ooo +--<mina86-tlen.pl>--<jid:mina86-jabber.org>--ooO--(_)--Ooo-- ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2010-08-06 20:59 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-08-05 22:38 [PATCH 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full() Michal Nazarewicz 2010-08-05 22:38 ` [PATCH 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines Michal Nazarewicz 2010-08-05 22:38 ` [PATCH 3/3] lib: vsprintf: added a put_dec() test and benchmark tool Michal Nazarewicz 2010-08-06 5:10 ` Denys Vlasenko 2010-08-06 8:34 ` Michał Nazarewicz 2010-08-06 15:57 ` Raja R Harinath 2010-08-06 19:26 ` Denys Vlasenko 2010-08-06 20:58 ` Michal Nazarewicz 2010-08-06 5:18 ` [PATCH 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines Denys Vlasenko 2010-08-06 7:08 ` Michal Nazarewicz 2010-08-06 7:35 ` Denys Vlasenko 2010-08-06 8:54 ` Michał Nazarewicz 2010-08-06 3:58 ` [PATCH 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full() Denys Vlasenko 2010-08-06 6:56 ` Michal Nazarewicz
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox