All of lore.kernel.org
 help / color / mirror / Atom feed
From: Denys Vlasenko <vda.linux@googlemail.com>
To: "Michał Nazarewicz" <m.nazarewicz@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>,
	linux-kernel@vger.kernel.org,
	"Douglas W. Jones" <jones@cs.uiowa.edu>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH 3/3] lib: vsprintf: added a put_dec() test and benchmark tool
Date: Fri, 6 Aug 2010 21:26:10 +0200	[thread overview]
Message-ID: <201008062126.10789.vda.linux@googlemail.com> (raw)
In-Reply-To: <op.vg0csju97p4s8u@pikus>

[-- Attachment #1: Type: text/plain, Size: 950 bytes --]

On Friday 06 August 2010 10:34, Michał Nazarewicz wrote:
> On Fri, 06 Aug 2010 07:10:06 +0200, Denys Vlasenko <vda.linux@googlemail.com> wrote:
> 
> > On Friday 06 August 2010 00:38, Michal Nazarewicz wrote:
> >> This commit adds a test application for the put_dec() and
> >> family of functions that are used by the previous two commits.
> >>
> >> Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
> >
> >> +put-dec-test: put-dec-test.c
> >> +	exec $(CC) $(CFLAGS) -o $@ $<
> >
> > (1) Why exec?
> 
> Micro Makefile optimisation -- saves us a fork().
> 
> I'll try to fix the benchmark over the weekend and will post updated
> version.  Thanks for the comments.

You might find some ideas in the attached file:
* removed "correction" code
* added verification of correctness for put_dec()
* different rand64
  (old one was giving same "random" number surprisingly often)
* more robust coding in a few places

-- 
vda

[-- Attachment #2: put-dec-test.c --]
[-- Type: text/x-csrc, Size: 8205 bytes --]

#define _BSD_SOURCE

#include <stdarg.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <time.h>


#  define do_div(n, base) ({			\
		uint32_t __base = (base);	\
		uint32_t __rem = (n) % __base;	\
		(n) /= __base;			\
		__rem;				\
	})


/****************************** Original versian ******************************/

static char *orig_put_dec_trunc(char *buf, unsigned q)
{
	unsigned d3, d2, d1, d0;
	d1 = (q>>4) & 0xf;
	d2 = (q>>8) & 0xf;
	d3 = (q>>12);

	d0 = 6*(d3 + d2 + d1) + (q & 0xf);
	q = (d0 * 0xcd) >> 11;
	d0 = d0 - 10*q;
	*buf++ = d0 + '0'; /* least significant digit */
	d1 = q + 9*d3 + 5*d2 + d1;
	if (d1 != 0) {
		q = (d1 * 0xcd) >> 11;
		d1 = d1 - 10*q;
		*buf++ = d1 + '0'; /* next digit */

		d2 = q + 2*d2;
		if ((d2 != 0) || (d3 != 0)) {
			q = (d2 * 0xd) >> 7;
			d2 = d2 - 10*q;
			*buf++ = d2 + '0'; /* next digit */

			d3 = q + 4*d3;
			if (d3 != 0) {
				q = (d3 * 0xcd) >> 11;
				d3 = d3 - 10*q;
				*buf++ = d3 + '0';  /* next digit */
				if (q != 0)
					*buf++ = q + '0'; /* most sign. digit */
			}
		}
	}

	return buf;
}
/* Same with if's removed. Always emits five digits */
static char *orig_put_dec_full(char *buf, unsigned q)
{
	/* BTW, if q is in [0,9999], 8-bit ints will be enough, */
	/* but anyway, gcc produces better code with full-sized ints */
	unsigned d3, d2, d1, d0;
	d1 = (q>>4) & 0xf;
	d2 = (q>>8) & 0xf;
	d3 = (q>>12);

	/*
	 * Possible ways to approx. divide by 10
	 * gcc -O2 replaces multiply with shifts and adds
	 * (x * 0xcd) >> 11: 11001101 - shorter code than * 0x67 (on i386)
	 * (x * 0x67) >> 10:  1100111
	 * (x * 0x34) >> 9:    110100 - same
	 * (x * 0x1a) >> 8:     11010 - same
	 * (x * 0x0d) >> 7:      1101 - same, shortest code (on i386)
	 */
	d0 = 6*(d3 + d2 + d1) + (q & 0xf);
	q = (d0 * 0xcd) >> 11;
	d0 = d0 - 10*q;
	*buf++ = d0 + '0';
	d1 = q + 9*d3 + 5*d2 + d1;
		q = (d1 * 0xcd) >> 11;
		d1 = d1 - 10*q;
		*buf++ = d1 + '0';

		d2 = q + 2*d2;
			q = (d2 * 0xd) >> 7;
			d2 = d2 - 10*q;
			*buf++ = d2 + '0';

			d3 = q + 4*d3;
				q = (d3 * 0xcd) >> 11; /* - shorter code */
				/* q = (d3 * 0x67) >> 10; - would also work */
				d3 = d3 - 10*q;
				*buf++ = d3 + '0';
					*buf++ = q + '0';

	return buf;
}

static __attribute__((noinline))
char *orig_put_dec(char *buf, unsigned long long num)
{
	while (1) {
		unsigned rem;
		if (num < 100000)
			return orig_put_dec_trunc(buf, num);
		rem = do_div(num, 100000);
		buf = orig_put_dec_full(buf, rem);
	}
}



/****************************** Modified versions ******************************/

/*
 * Decimal conversion is by far the most typical, and is used for
 * /proc and /sys data. This directly impacts e.g. top performance
 * with many processes running.
 *
 * We optimize it for speed using ideas described at
 * <http://www.cs.uiowa.edu/~jones/bcd/divide.html>.
 *
 * '(num * 0xcccd) >> 19' is an approximation of 'num / 10' that gives
 * correct results for num < 81920.  Because of this, we check at the
 * beginning if we are dealing with a number that may cause trouble
 * and if so, we make it smaller.
 *
 * (As a minor note, all operands are always 16 bit so this function
 * should work well on hardware that cannot multiply 32 bit numbers).
 *
 * (Previous a code based on
 * <http://www.cs.uiowa.edu/~jones/bcd/decimal.html> was used here,
 * with permission from the author, Douglas W. Jones.)
 *
 * Other, possible ways to approx. divide by 10
 * (x * 0xcd) >> 11: 11001101 - shorter code than * 0x67 (on i386)
 * (x * 0x67) >> 10:  1100111
 * (x * 0x34) >> 9:    110100 - same
 * (x * 0x1a) >> 8:     11010 - same
 * (x * 0x0d) >> 7:      1101 - same, shortest code (on i386)
 */
static char *mod3_put_dec_full(char *buf, unsigned q)
{
	unsigned r;
	char a = '0';

	if (q > 0xffff) {
		a = '6';
		q -= 60000;
	}

	r   = (q * 0xcccd) >> 19;
	*buf++ = (q - 10 * r) + '0';

	q   = (r * 0x199a) >> 16;
	*buf++ = (r - 10 * q)  + '0';

	r   = (q * 0xcd) >> 11;
	*buf++ = (q - 10 * r)  + '0';

	q   = (r * 0xd) >> 7;
	*buf++ = (r - 10 * q) + '0';

	*buf++ = q + a;

	return buf;
}

static char *mod3_put_dec_trunc(char *buf, unsigned q)
{
	unsigned r;

	/*
	 * We need to check if num is < 81920 so we might as well
	 * check if we can just call the _full version of this
	 * function.
	 */
	if (q > 9999)
		return mod3_put_dec_full(buf, q);

	r   = (q * 0xcccd) >> 19;
	*buf++ = (q - 10 * r) + '0';

	if (r) {
		q   = (r * 0x199a) >> 16;
		*buf++ = (r - 10 * q)  + '0';

		if (q) {
			r   = (q * 0xcd) >> 11;
			*buf++ = (q - 10 * r)  + '0';

			if (r)
				*buf++ = r + '0';
		}
	}

	return buf;
}

static __attribute__((noinline))
char *mod3_put_dec(char *buf, unsigned long long num)
{
	while (1) {
		unsigned rem;
		if (num < 100000)
			return mod3_put_dec_trunc(buf, num);
		rem = do_div(num, 100000);
		buf = mod3_put_dec_full(buf, rem);
	}
}

/****************************** Main ******************************/

static unsigned long long rand_64(void)
{
	unsigned long long v = 0;
	int i;

	for (i = 0; i < sizeof(v); i++) {
		v += rand();
		v <<= 9;
	}
	return v;
}

static char buf1[sizeof(long long)*3];

static void reverse_buf1(char *b)
{
	char *a = buf1;
	*b-- = '\0';
	while (a < b) {
		char tmp = *a;
		*a = *b;
		*b = tmp;
		++a;
		--b;
	}
}

static int test(const char *name, char *b, char *fmt, ...)
{
	static char buf2[sizeof(long long)*3];
	va_list ap;

	reverse_buf1(b);

	va_start(ap, fmt);
	vsprintf(buf2, fmt, ap);
	va_end(ap);

	if (strcmp(buf1, buf2)) {
		fprintf(stderr, "%-20s: expecting '%s' got '%s'\n",
			name, buf2, buf1);
		return 1;
	}
	return 0;
}

static void stop(const char *name, struct timeval *start)
{
	struct timeval stop;

	gettimeofday(&stop, NULL);

	stop.tv_sec -= start->tv_sec;
	if (stop.tv_usec < start->tv_usec) {
		--stop.tv_sec;
		stop.tv_usec += 1000000;
	}
	stop.tv_usec -= start->tv_usec;

	printf("%-20s: %3d.%06ds\n", name, (int)stop.tv_sec, (int)stop.tv_usec);
	fflush(NULL);
}

#define FUNC(func, outer, inner, format, value) do {	\
		struct timeval start;					\
		unsigned i, o;						\
		for (i = (inner); i--; ) {				\
			typeof(value) v = (value);			\
			memset(buf1, 77, sizeof(buf1));			\
			ret |= test(#func, func(buf1, v), format, v);	\
		}							\
									\
		gettimeofday(&start, NULL);				\
		for (o = (outer); o; --o)				\
			for (i = (inner); i--; )			\
				func(buf1, (value));			\
		stop(#func, &start);					\
	} while (0)							\


int main(int argc, char **argv)
{
	int ret = 0;
	unsigned long iterations;
	unsigned long long ll;
	unsigned long long range;

	srand(time(NULL));

	iterations = 1000;
	if (argc > 1)
		iterations = atoi(argv[1]);

	printf("Benchmark with %ld iterations:\n", iterations*100000);
	printf("\tput_dec_full(99999..0)\n");
	fflush(NULL);
	/* func, bench_mult x test_iter, correction, fmt, val */
	FUNC(orig_put_dec_full, iterations, 100000, "%05u", i);
	FUNC(mod3_put_dec_full, iterations, 100000, "%05u", i);

	printf("\tput_dec_trunc(99999..0)\n");
	fflush(NULL);
	FUNC(orig_put_dec_trunc, iterations, 100000, "%u", i);
	FUNC(mod3_put_dec_trunc, iterations, 100000, "%u", i);

	ll = rand_64();
	printf("\tput_dec(%llu)\n", ll);
	fflush(NULL);
	FUNC(orig_put_dec, iterations, 100000, "%llu", ll);
	FUNC(mod3_put_dec, iterations, 100000, "%llu", ll);

	/* Test wide consecutive ranges at both ends of [0,max] interval */
	range = 10*1000*1000;
	if (argc > 2)
		range = atoi(argv[2]);
	ll = -1LL - range;
	while (ll != range) {
		char buf[sizeof(long long)*3];

		sprintf(buf, "%llu", ll);
		if (!(ll & 0xffff)) {
			printf("Testing correctness: %s%*s\r", buf, (int)(sizeof(long long) * 4), "");
			fflush(NULL);
		}
		memset(buf1, 77, sizeof(buf1));
		reverse_buf1(orig_put_dec(buf1, ll));
		if (strcmp(buf, buf1) == 0) {
			memset(buf1, 77, sizeof(buf1));
			reverse_buf1(mod3_put_dec(buf1, ll));
			if (strcmp(buf, buf1) == 0) {
				ll++;
				continue;
			}
		}
		printf("error at %llu: bad:'%s' expected:'%s'\n", ll, buf1, buf);
		return 1;
	}
	printf("\n");

	printf("\n>> Size of the functions:\n");
	setenv("APP", *argv, 1);
	printf("\tobjdump -t \"$APP\" | grep -F _put_dec | cut -f 2-\n");
	fflush(NULL);
	system("objdump -t \"$APP\" | grep -F _put_dec | cut -f 2-");

	return ret;
}

  parent reply	other threads:[~2010-08-06 19:26 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-08-05 22:38 [PATCH 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full() Michal Nazarewicz
2010-08-05 22:38 ` [PATCH 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines Michal Nazarewicz
2010-08-05 22:38   ` [PATCH 3/3] lib: vsprintf: added a put_dec() test and benchmark tool Michal Nazarewicz
2010-08-06  5:10     ` Denys Vlasenko
2010-08-06  8:34       ` Michał Nazarewicz
2010-08-06 15:57         ` Raja R Harinath
2010-08-06 19:26         ` Denys Vlasenko [this message]
2010-08-06 20:58           ` Michal Nazarewicz
2010-08-06  5:18   ` [PATCH 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines Denys Vlasenko
2010-08-06  7:08     ` Michal Nazarewicz
2010-08-06  7:35       ` Denys Vlasenko
2010-08-06  8:54         ` Michał Nazarewicz
2010-08-06  3:58 ` [PATCH 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full() Denys Vlasenko
2010-08-06  6:56   ` Michal Nazarewicz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=201008062126.10789.vda.linux@googlemail.com \
    --to=vda.linux@googlemail.com \
    --cc=akpm@linux-foundation.org \
    --cc=jones@cs.uiowa.edu \
    --cc=linux-kernel@vger.kernel.org \
    --cc=m.nazarewicz@samsung.com \
    --cc=mina86@mina86.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.