Re: Csum and csum copyroutines benchmark

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Momchil Velikov <velco@fadata.bg>
To: vda@port.imtp.ilyichevsk.odessa.ua
Cc: Russell King <rmk@arm.linux.org.uk>,
	Roy Sigurd Karlsbakk <roy@karlsbakk.net>,
	netdev@oss.sgi.com,
	Kernel mailing list <linux-kernel@vger.kernel.org>
Subject: Re: Csum and csum copyroutines benchmark
Date: 25 Oct 2002 12:47:05 +0300	[thread overview]
Message-ID: <87znt297fq.fsf@fadata.bg> (raw)
In-Reply-To: <200210250906.g9P96Yp14775@Port.imtp.ilyichevsk.odessa.ua>

[-- Attachment #1: Type: text/plain, Size: 2863 bytes --]

>>>>> "Denis" == Denis Vlasenko <vda@port.imtp.ilyichevsk.odessa.ua> writes:

Denis> [please drop libc from CC:]
Denis> On 25 October 2002 05:48, Momchil Velikov wrote:
>>> Short conclusion:
>>> 1. It is possible to speed up csum routines for AMD processors
>>> by 30%.
>>> 2. It is possible to speed up csum_copy routines for both AMD
>>> andd Intel three times or more.

>> Additional data point:
>> 
>> Short summary:
>> 1. Checksum - kernelpii_csum is ~19% faster
>> 2. Copy - lernelpii_csum is ~6% faster
>> 
>> Dual Pentium III, 1266Mhz, 512K cache, 2G SDRAM (133Mhz, ECC)
>> 
>> The only changes I made were to decrease the buffer size to 1K (as I
>> think this is more representative to a network packet size, correct
>> me if I'm wrong) and increase the runs to 1024. Max values are
>> worthless indeed.

Denis> Well, that makes it run entirely in L0 cache. This is unrealistic
Denis> for actual use. movntq is x3 faster when you hit RAM instead of L0.

Oops ...

Denis> You need to be more clever than that - generate pseudo-random
Denis> offsets in large buffer and run on ~1K pieces of that buffer.

Here it is:

Csum benchmark program
buffer size: 1 K
Each test tried 1024 times, max and min CPU cycles are reported.
Please disregard max values. They are due to system interference only.
csum tests:
                     kernel_csum - took  8678 max,  808 min cycles per kb. sum=0x400270e8
                     kernel_csum - took   941 max,  808 min cycles per kb. sum=0x400270e8
                     kernel_csum - took 11604 max,  808 min cycles per kb. sum=0x400270e8
                  kernelpii_csum - took 28839 max,  664 min cycles per kb. sum=0x400270e8
                kernelpiipf_csum - took  9163 max,  665 min cycles per kb. sum=0x400270e8
                        pfm_csum - took  2788 max, 1470 min cycles per kb. sum=0x400270e8
                       pfm2_csum - took  1179 max,  915 min cycles per kb. sum=0x400270e8
copy tests:
                     kernel_copy - took   688 max,  263 min cycles per kb. sum=0x400270e8
                     kernel_copy - took   456 max,  263 min cycles per kb. sum=0x400270e8
                     kernel_copy - took 11241 max,  263 min cycles per kb. sum=0x400270e8
                  kernelpii_copy - took  7635 max,  246 min cycles per kb. sum=0x400270e8
                      ntqpf_copy - took  5349 max,  536 min cycles per kb. sum=0x400270e8
                     ntqpfm_copy - took   769 max,  425 min cycles per kb. sum=0x400270e8
                        ntq_copy - took   672 max,  469 min cycles per kb. sum=0x400270e8
                     ntqpf2_copy - took  8000 max,  579 min cycles per kb. sum=0x400270e8
Done

Ran on a 512K (my cache size) buffer, choosing each time a 1K
piece. (making the buffer larger (2M, 4M) does not make any
difference).

And the modified 0main.c is attached.

~velco

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0main.c --]
[-- Type: text/x-csrc, Size: 3996 bytes --]

#include <stdio.h>
#include <stdlib.h>

#define NAME(a) \
unsigned int a##csum(const unsigned char * buff, int len, \
			unsigned int sum); \
unsigned int a##copy(const char *src, char *dst, \
                        int len, int sum, int *src_err_ptr, int *dst_err_ptr)
			
/* This makes adding/removing test functions easier */
/* asm ones... */
NAME(kernel_);
NAME(kernelpii_);
NAME(kernelpiipf_);
/* and C */
#include "pfm_csum.c"
#include "pfm2_csum.c"
#include "ntq_copy.c"
#include "ntqpf_copy.c"
#include "ntqpf2_copy.c"
#include "ntqpfm_copy.c"

const int TRY_TIMES = 1024;
const int NBUFS = 512;
const int BUFSIZE = 1024;
const int POISON = 0; // want to check correctness?

typedef unsigned int csum_func(const unsigned char * buff, int len,
		unsigned int sum);
typedef unsigned int copy_func(const char *src, char *dst,
		int len, int sum, int *src_err_ptr, int *dst_err_ptr);

static inline long long rdtsc()
{
	unsigned int low,high;
	__asm__ __volatile__("rdtsc" : "=a" (low), "=d" (high));
	return low + (((long long)high)<<32);
}

int die(const char *msg) {
	puts(msg);
	abort();
	return 1;
}

unsigned test_one_csum(csum_func *func, char *name, char *buffer)
{
	int i;
	unsigned long long before,after,min,max;
	unsigned sum;
	
	// pick fastest run
	min = ~0ULL;
	max = 0;
	for (i=0;i<TRY_TIMES;i++) {
		before = rdtsc();
		unsigned sum2 = func(buffer + (rand () % NBUFS) * BUFSIZE,
				     BUFSIZE, 0);
		after = rdtsc();
		if (before>after) die("timer overflow");
		else {
			after-=before;
			if(min>after) min=after;
			if(max<after) max=after;
		}		
	}
	printf("%32s - took %5lli max,%5lli min cycles per kb. sum=0x%08x\n",
		name,
		max / (BUFSIZE/1024),
		min / (BUFSIZE/1024),
		sum
		);
}
     
unsigned test_one_copy(copy_func *func, char *name, char *buffer)
{
	int i;
	unsigned long long before,after,min,max;
	unsigned sum;
	int err;

	// pick fastest run
	min = ~0ULL;
	max = 0;
	for (i=0; i<TRY_TIMES; i++) {
		if(POISON) memset(buffer,          0x55,BUFSIZE/2);
		if(POISON) memset(buffer+BUFSIZE/2,0xaa,BUFSIZE/2);
		buffer[0] = 0x77;
		buffer[BUFSIZE/2-1] = 0x44;
		before = rdtsc();
		char *buf = buffer + rand () % (NBUFS - 1);
		unsigned sum2 = func(buf,buf+BUFSIZE/2,BUFSIZE/2,0,&err,&err);
		after = rdtsc();
		if(POISON) if(memcmp(buffer,buffer+BUFSIZE/2,BUFSIZE/2)!=0) die("BAD copy!");
		if (before>after) die("timer overflow");
		else {
			after-=before;
			if(min>after) min=after;
			if(max<after) max=after;
		}		
	}
	printf("%32s - took %5lli max,%5lli min cycles per kb. sum=0x%08x\n",
		name,
		max / (BUFSIZE/1024) / 2,
		min / (BUFSIZE/1024) / 2,
		sum
	);
	return sum;
}
     
     
void test_csum(char *buffer)
{
	unsigned sum;
	puts("csum tests:");

#define	TEST_CSUM(a) test_one_csum(a,#a,buffer)
	TEST_CSUM(kernel_csum	);
	TEST_CSUM(kernel_csum	);
	TEST_CSUM(kernel_csum	);
	TEST_CSUM(kernelpii_csum	);
	TEST_CSUM(kernelpiipf_csum);
	TEST_CSUM(pfm_csum	);
	TEST_CSUM(pfm2_csum	);
#undef TEST_CSUM
}   

void test_copy(char *buffer)
{
	unsigned sum;
	puts("copy tests:");

#define	TEST_COPY(a) test_one_copy(a,#a,buffer)
	sum =  TEST_COPY(kernel_copy	);
	sum == TEST_COPY(kernel_copy	) || die("Bad sum");
	sum == TEST_COPY(kernel_copy	) || die("Bad sum");
	sum == TEST_COPY(kernelpii_copy	) || die("Bad sum");
	sum == TEST_COPY(ntqpf_copy	) || die("Bad sum");
	sum == TEST_COPY(ntqpfm_copy	) || die("Bad sum");
	sum == TEST_COPY(ntq_copy	) || die("Bad sum");
	sum == TEST_COPY(ntqpf2_copy	) || die("Bad sum");
#undef TEST_COPY
}

int main()
{
	char *buffer_raw,*buffer;
	printf("Csum benchmark program\n"
		"buffer size: %i K\n"
		"Each test tried %i times, max and min CPU cycles are reported.\n"
		"Please disregard max values. They are due to system interference only.\n",
		BUFSIZE/1024,
		TRY_TIMES
	);
	
	buffer_raw = malloc(NBUFS * BUFSIZE+16);
	if(!buffer_raw) die("Malloc failed");
		
	buffer = (char*) ((((int)buffer_raw)+15) & (~0xF));
	
	test_csum(buffer);
	test_copy(buffer);

	puts("Done");
	free(buffer_raw);
	return 0;
}

next prev parent reply	other threads:[~2002-10-25  9:41 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2002-10-23 10:18 tuning linux for high network performance? Roy Sigurd Karlsbakk
2002-10-23 11:06 ` [RESEND] " Roy Sigurd Karlsbakk
2002-10-23 13:01   ` bert hubert
2002-10-23 13:21     ` David S. Miller
2002-10-23 13:42       ` Roy Sigurd Karlsbakk
2002-10-23 17:01         ` bert hubert
2002-10-23 17:10           ` Ben Greear
2002-10-23 17:11           ` Richard B. Johnson
2002-10-23 17:12           ` Nivedita Singhvi
2002-10-23 17:56             ` Richard B. Johnson
2002-10-23 18:07               ` Nivedita Singhvi
2002-10-23 18:30                 ` Richard B. Johnson
2002-10-24  4:11         ` David S. Miller
2002-10-24  9:37           ` Karen Shaeffer
2002-10-24 10:30           ` sendfile64() anyone? (was [RESEND] tuning linux for high network performance?) Roy Sigurd Karlsbakk
2002-10-24 10:47             ` David S. Miller
2002-10-24 11:07               ` Roy Sigurd Karlsbakk
2002-10-23 13:41     ` [RESEND] tuning linux for high network performance? Roy Sigurd Karlsbakk
2002-10-23 14:59     ` Nivedita Singhvi
2002-10-23 15:26       ` O_DIRECT sockets? (was [RESEND] tuning linux for high network performance?) Roy Sigurd Karlsbakk
2002-10-23 16:34         ` Nivedita Singhvi
2002-10-23 16:34           ` Nivedita Singhvi
2002-10-24 10:14           ` Roy Sigurd Karlsbakk
2002-10-24 10:46             ` David S. Miller
2002-10-24 10:46               ` David S. Miller
2002-10-23 18:01   ` [RESEND] tuning linux for high network performance? Denis Vlasenko
2002-10-23 13:36     ` Roy Sigurd Karlsbakk
2002-10-24 16:22       ` Denis Vlasenko
2002-10-24 11:50         ` Russell King
2002-10-24 12:42           ` bert hubert
2002-10-24 17:41           ` Denis Vlasenko
2002-10-25 11:36             ` Csum and csum copyroutines benchmark Denis Vlasenko
2002-10-25  7:48               ` Momchil Velikov
2002-10-25 13:59                 ` Denis Vlasenko
2002-10-25  9:47                   ` Momchil Velikov [this message]
2002-10-25 10:19                   ` Alan Cox
2002-10-25 16:00                     ` Denis Vlasenko
2002-10-25 14:26               ` Daniel Egger
2002-10-23 14:52     ` [RESEND] tuning linux for high network performance? Nivedita Singhvi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87znt297fq.fsf@fadata.bg \
    --to=velco@fadata.bg \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@oss.sgi.com \
    --cc=rmk@arm.linux.org.uk \
    --cc=roy@karlsbakk.net \
    --cc=vda@port.imtp.ilyichevsk.odessa.ua \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.