From: Denys Vlasenko <vda.linux@googlemail.com>
To: Michal Nazarewicz <mina86@mina86.com>
Cc: linux-kernel@vger.kernel.org, m.nazarewicz@samsung.com,
"Douglas W. Jones" <jones@cs.uiowa.edu>,
Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCHv2 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines
Date: Tue, 10 Aug 2010 06:15:52 +0200 [thread overview]
Message-ID: <201008100615.52510.vda.linux@googlemail.com> (raw)
In-Reply-To: <419733287a42072a658f0f74d0b1901132178b75.1281295424.git.mina86@mina86.com>
On Sunday 08 August 2010 21:29, Michal Nazarewicz wrote:
> Compared to previous version: the code is used only if:
> 1. if long long is 64-bit (ie. ULLONG_MAX == 2**64-1), and
> 2. user did not select optimisation for size with Kconfig.
I measured the size and it does not seem to make sense
to exclude it on -Os. On x86:
put_dec_full change: 0x93 -> 0x47 bytes
put_dec change: 0x12c -> 0x137 bytes
IOW, there is net code size reduction (compared to current kernel,
it may be a slight growth compared to patch 1).
So, please use the optimized code even for CONFIG_CC_OPTIMIZE_FOR_SIZE.
> Here are the results (normalised to the fastest/smallest):
> : ARM Atom
> -- Speed ----------------------------------
> orig_put_dec : 9.333822 2.083110 Original
> mod1_put_dec : 9.282045 1.904564
> mod2_put_dec : 9.260409 1.910302
> mod3_put_dec : 9.320053 1.905689 Proposed by previous patch
> mod4_put_dec : 9.297146 1.933971
> mod5_put_dec : 13.034318 2.434942
> mod6_put_dec : 1.000000 1.000000 Proposed by this patch
> mod7_put_dec : 1.009574 1.014147
> mod8_put_dec : 7.226004 1.953460
> -- Size -----------------------------------
> orig_put_dec : 1.000000 1.000000 Original
> mod1_put_dec : 1.000000 1.000000
> mod2_put_dec : 1.361111 1.403226
> mod3_put_dec : 1.000000 1.000000 Proposed by previous patch
> mod4_put_dec : 1.361111 1.403226
> mod5_put_dec : 1.000000 1.000000
> mod6_put_dec : 2.555556 3.508065 Proposed by this patch
> mod7_put_dec : 2.833333 3.911290
> mod8_put_dec : 2.027778 2.258065
I believe these are old results? Size growth is just too big.
> As it can be obsevred, proposed version of the put_dec function is
> twice as fast as the original version on Atom and almost 10 times
> faster on ARM. I imagine that it may be similar on other "embedded"
> processors.
>
> This may be skewed by the fact that the benchmark is using GCC's
> 64-bit division operator instead of kernel's do_div but it would
> appear that by avoiding 64-bit division something can be gained.
Re speed: on Phenom II in 32-bit mode, I see ~x3.3 speedup
on conversions involving large integers (might be skewed
by gcc's full-blown 64-bit division in "old" code - kernel's
div is smarter).
> PS. From Mr. Jones site: "Nonetheless, before relying on the material
> here, it would be prudent to check the arithmetic!" hence I checked
> all the calculations myself and everything seemed fine. I've also run
> test applitacion several times so it tested a few 64-bit numbers..."
I tested [0, 100 million] and [2^64-100 million, 2^64-1] ranges.
No errors.
> +#if BITS_PER_LONG != 32 || defined CONFIG_CC_OPTIMIZE_FOR_SIZE || \
> + ULLONG_MAX != 18446744073709551615ULL
I think it's better to say "if BITS_PER_LONG > 32 and ULLONG_MAX > 2^64-1",
since it expresses your intent better. Also, add comments explaining
what case you optimize for:
#if BITS_PER_LONG > 32 || ULLONG_MAX > 18446744073709551615ULL
/* Generic code */
...
#else /* BITS_PER_LONG <= 32 && ULLONG_MAX <= 2^64-1 */
/* Optimized code for arches with 64-bit long longs */
...
> +static noinline_for_stack
> +char *put_dec(char *buf, unsigned long long n)
> +{
> + uint32_t d3, d2, d1, q;
> +
> + if (!n) {
> + *buf++ = '0';
> + return buf;
> + }
You may as well use the above shortcut for n <= 9, not only for 0.
> + buf = put_dec_full4(buf, q % 10000);
> + q = q / 10000;
> +
> + d1 = q + 7671 * d3 + 9496 * d2 + 6 * d1;
> + buf = put_dec_full4(buf, d1 % 10000);
> + q = d1 / 10000;
I experimented with moving division up, before put_dec_full4:
q = d1 / 10000;
buf = put_dec_full4(buf, d1 % 10000);
but gcc appears to be smart emough to do this transformation
itself. But you may still do it for older (dumber) gcc's.
--
vda
next prev parent reply other threads:[~2010-08-10 4:16 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-08-08 19:29 [PATCHv2 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full() Michal Nazarewicz
2010-08-08 19:29 ` [PATCHv2 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines Michal Nazarewicz
2010-08-08 19:29 ` [PATCHv2 3/3] lib: vsprintf: added a put_dec() test and benchmark tool Michal Nazarewicz
2010-08-10 4:15 ` Denys Vlasenko [this message]
2010-08-10 7:42 ` [PATCHv2 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines Michał Nazarewicz
2010-08-10 16:10 ` Denys Vlasenko
2010-08-10 3:17 ` [PATCHv2 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full() Denys Vlasenko
2010-08-10 7:39 ` Michał Nazarewicz
2010-08-10 16:08 ` Denys Vlasenko
2010-08-10 22:42 ` Michal Nazarewicz
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=201008100615.52510.vda.linux@googlemail.com \
--to=vda.linux@googlemail.com \
--cc=akpm@linux-foundation.org \
--cc=jones@cs.uiowa.edu \
--cc=linux-kernel@vger.kernel.org \
--cc=m.nazarewicz@samsung.com \
--cc=mina86@mina86.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.