From: Momchil Velikov <velco@fadata.bg>
To: vda@port.imtp.ilyichevsk.odessa.ua
Cc: Russell King <rmk@arm.linux.org.uk>,
Roy Sigurd Karlsbakk <roy@karlsbakk.net>,
netdev@oss.sgi.com,
Kernel mailing list <linux-kernel@vger.kernel.org>
Subject: Re: Csum and csum copyroutines benchmark
Date: 25 Oct 2002 12:47:05 +0300 [thread overview]
Message-ID: <87znt297fq.fsf@fadata.bg> (raw)
In-Reply-To: <200210250906.g9P96Yp14775@Port.imtp.ilyichevsk.odessa.ua>
[-- Attachment #1: Type: text/plain, Size: 2863 bytes --]
>>>>> "Denis" == Denis Vlasenko <vda@port.imtp.ilyichevsk.odessa.ua> writes:
Denis> [please drop libc from CC:]
Denis> On 25 October 2002 05:48, Momchil Velikov wrote:
>>> Short conclusion:
>>> 1. It is possible to speed up csum routines for AMD processors
>>> by 30%.
>>> 2. It is possible to speed up csum_copy routines for both AMD
>>> andd Intel three times or more.
>> Additional data point:
>>
>> Short summary:
>> 1. Checksum - kernelpii_csum is ~19% faster
>> 2. Copy - lernelpii_csum is ~6% faster
>>
>> Dual Pentium III, 1266Mhz, 512K cache, 2G SDRAM (133Mhz, ECC)
>>
>> The only changes I made were to decrease the buffer size to 1K (as I
>> think this is more representative to a network packet size, correct
>> me if I'm wrong) and increase the runs to 1024. Max values are
>> worthless indeed.
Denis> Well, that makes it run entirely in L0 cache. This is unrealistic
Denis> for actual use. movntq is x3 faster when you hit RAM instead of L0.
Oops ...
Denis> You need to be more clever than that - generate pseudo-random
Denis> offsets in large buffer and run on ~1K pieces of that buffer.
Here it is:
Csum benchmark program
buffer size: 1 K
Each test tried 1024 times, max and min CPU cycles are reported.
Please disregard max values. They are due to system interference only.
csum tests:
kernel_csum - took 8678 max, 808 min cycles per kb. sum=0x400270e8
kernel_csum - took 941 max, 808 min cycles per kb. sum=0x400270e8
kernel_csum - took 11604 max, 808 min cycles per kb. sum=0x400270e8
kernelpii_csum - took 28839 max, 664 min cycles per kb. sum=0x400270e8
kernelpiipf_csum - took 9163 max, 665 min cycles per kb. sum=0x400270e8
pfm_csum - took 2788 max, 1470 min cycles per kb. sum=0x400270e8
pfm2_csum - took 1179 max, 915 min cycles per kb. sum=0x400270e8
copy tests:
kernel_copy - took 688 max, 263 min cycles per kb. sum=0x400270e8
kernel_copy - took 456 max, 263 min cycles per kb. sum=0x400270e8
kernel_copy - took 11241 max, 263 min cycles per kb. sum=0x400270e8
kernelpii_copy - took 7635 max, 246 min cycles per kb. sum=0x400270e8
ntqpf_copy - took 5349 max, 536 min cycles per kb. sum=0x400270e8
ntqpfm_copy - took 769 max, 425 min cycles per kb. sum=0x400270e8
ntq_copy - took 672 max, 469 min cycles per kb. sum=0x400270e8
ntqpf2_copy - took 8000 max, 579 min cycles per kb. sum=0x400270e8
Done
Ran on a 512K (my cache size) buffer, choosing each time a 1K
piece. (making the buffer larger (2M, 4M) does not make any
difference).
And the modified 0main.c is attached.
~velco
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0main.c --]
[-- Type: text/x-csrc, Size: 3996 bytes --]
#include <stdio.h>
#include <stdlib.h>
#define NAME(a) \
unsigned int a##csum(const unsigned char * buff, int len, \
unsigned int sum); \
unsigned int a##copy(const char *src, char *dst, \
int len, int sum, int *src_err_ptr, int *dst_err_ptr)
/* This makes adding/removing test functions easier */
/* asm ones... */
NAME(kernel_);
NAME(kernelpii_);
NAME(kernelpiipf_);
/* and C */
#include "pfm_csum.c"
#include "pfm2_csum.c"
#include "ntq_copy.c"
#include "ntqpf_copy.c"
#include "ntqpf2_copy.c"
#include "ntqpfm_copy.c"
const int TRY_TIMES = 1024;
const int NBUFS = 512;
const int BUFSIZE = 1024;
const int POISON = 0; // want to check correctness?
typedef unsigned int csum_func(const unsigned char * buff, int len,
unsigned int sum);
typedef unsigned int copy_func(const char *src, char *dst,
int len, int sum, int *src_err_ptr, int *dst_err_ptr);
static inline long long rdtsc()
{
unsigned int low,high;
__asm__ __volatile__("rdtsc" : "=a" (low), "=d" (high));
return low + (((long long)high)<<32);
}
int die(const char *msg) {
puts(msg);
abort();
return 1;
}
unsigned test_one_csum(csum_func *func, char *name, char *buffer)
{
int i;
unsigned long long before,after,min,max;
unsigned sum;
// pick fastest run
min = ~0ULL;
max = 0;
for (i=0;i<TRY_TIMES;i++) {
before = rdtsc();
unsigned sum2 = func(buffer + (rand () % NBUFS) * BUFSIZE,
BUFSIZE, 0);
after = rdtsc();
if (before>after) die("timer overflow");
else {
after-=before;
if(min>after) min=after;
if(max<after) max=after;
}
}
printf("%32s - took %5lli max,%5lli min cycles per kb. sum=0x%08x\n",
name,
max / (BUFSIZE/1024),
min / (BUFSIZE/1024),
sum
);
}
unsigned test_one_copy(copy_func *func, char *name, char *buffer)
{
int i;
unsigned long long before,after,min,max;
unsigned sum;
int err;
// pick fastest run
min = ~0ULL;
max = 0;
for (i=0; i<TRY_TIMES; i++) {
if(POISON) memset(buffer, 0x55,BUFSIZE/2);
if(POISON) memset(buffer+BUFSIZE/2,0xaa,BUFSIZE/2);
buffer[0] = 0x77;
buffer[BUFSIZE/2-1] = 0x44;
before = rdtsc();
char *buf = buffer + rand () % (NBUFS - 1);
unsigned sum2 = func(buf,buf+BUFSIZE/2,BUFSIZE/2,0,&err,&err);
after = rdtsc();
if(POISON) if(memcmp(buffer,buffer+BUFSIZE/2,BUFSIZE/2)!=0) die("BAD copy!");
if (before>after) die("timer overflow");
else {
after-=before;
if(min>after) min=after;
if(max<after) max=after;
}
}
printf("%32s - took %5lli max,%5lli min cycles per kb. sum=0x%08x\n",
name,
max / (BUFSIZE/1024) / 2,
min / (BUFSIZE/1024) / 2,
sum
);
return sum;
}
void test_csum(char *buffer)
{
unsigned sum;
puts("csum tests:");
#define TEST_CSUM(a) test_one_csum(a,#a,buffer)
TEST_CSUM(kernel_csum );
TEST_CSUM(kernel_csum );
TEST_CSUM(kernel_csum );
TEST_CSUM(kernelpii_csum );
TEST_CSUM(kernelpiipf_csum);
TEST_CSUM(pfm_csum );
TEST_CSUM(pfm2_csum );
#undef TEST_CSUM
}
void test_copy(char *buffer)
{
unsigned sum;
puts("copy tests:");
#define TEST_COPY(a) test_one_copy(a,#a,buffer)
sum = TEST_COPY(kernel_copy );
sum == TEST_COPY(kernel_copy ) || die("Bad sum");
sum == TEST_COPY(kernel_copy ) || die("Bad sum");
sum == TEST_COPY(kernelpii_copy ) || die("Bad sum");
sum == TEST_COPY(ntqpf_copy ) || die("Bad sum");
sum == TEST_COPY(ntqpfm_copy ) || die("Bad sum");
sum == TEST_COPY(ntq_copy ) || die("Bad sum");
sum == TEST_COPY(ntqpf2_copy ) || die("Bad sum");
#undef TEST_COPY
}
int main()
{
char *buffer_raw,*buffer;
printf("Csum benchmark program\n"
"buffer size: %i K\n"
"Each test tried %i times, max and min CPU cycles are reported.\n"
"Please disregard max values. They are due to system interference only.\n",
BUFSIZE/1024,
TRY_TIMES
);
buffer_raw = malloc(NBUFS * BUFSIZE+16);
if(!buffer_raw) die("Malloc failed");
buffer = (char*) ((((int)buffer_raw)+15) & (~0xF));
test_csum(buffer);
test_copy(buffer);
puts("Done");
free(buffer_raw);
return 0;
}
next prev parent reply other threads:[~2002-10-25 9:41 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2002-10-23 10:18 tuning linux for high network performance? Roy Sigurd Karlsbakk
2002-10-23 11:06 ` [RESEND] " Roy Sigurd Karlsbakk
2002-10-23 13:01 ` bert hubert
2002-10-23 13:21 ` David S. Miller
2002-10-23 13:42 ` Roy Sigurd Karlsbakk
2002-10-23 17:01 ` bert hubert
2002-10-23 17:10 ` Ben Greear
2002-10-23 17:11 ` Richard B. Johnson
2002-10-23 17:12 ` Nivedita Singhvi
2002-10-23 17:56 ` Richard B. Johnson
2002-10-23 18:07 ` Nivedita Singhvi
2002-10-23 18:30 ` Richard B. Johnson
2002-10-24 4:11 ` David S. Miller
2002-10-24 9:37 ` Karen Shaeffer
2002-10-24 10:30 ` sendfile64() anyone? (was [RESEND] tuning linux for high network performance?) Roy Sigurd Karlsbakk
2002-10-24 10:47 ` David S. Miller
2002-10-24 11:07 ` Roy Sigurd Karlsbakk
2002-10-23 13:41 ` [RESEND] tuning linux for high network performance? Roy Sigurd Karlsbakk
2002-10-23 14:59 ` Nivedita Singhvi
2002-10-23 15:26 ` O_DIRECT sockets? (was [RESEND] tuning linux for high network performance?) Roy Sigurd Karlsbakk
2002-10-23 16:34 ` Nivedita Singhvi
2002-10-23 16:34 ` Nivedita Singhvi
2002-10-24 10:14 ` Roy Sigurd Karlsbakk
2002-10-24 10:46 ` David S. Miller
2002-10-24 10:46 ` David S. Miller
2002-10-23 18:01 ` [RESEND] tuning linux for high network performance? Denis Vlasenko
2002-10-23 13:36 ` Roy Sigurd Karlsbakk
2002-10-24 16:22 ` Denis Vlasenko
2002-10-24 11:50 ` Russell King
2002-10-24 12:42 ` bert hubert
2002-10-24 17:41 ` Denis Vlasenko
2002-10-25 11:36 ` Csum and csum copyroutines benchmark Denis Vlasenko
2002-10-25 7:48 ` Momchil Velikov
2002-10-25 13:59 ` Denis Vlasenko
2002-10-25 9:47 ` Momchil Velikov [this message]
2002-10-25 10:19 ` Alan Cox
2002-10-25 16:00 ` Denis Vlasenko
2002-10-25 14:26 ` Daniel Egger
2002-10-23 14:52 ` [RESEND] tuning linux for high network performance? Nivedita Singhvi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87znt297fq.fsf@fadata.bg \
--to=velco@fadata.bg \
--cc=linux-kernel@vger.kernel.org \
--cc=netdev@oss.sgi.com \
--cc=rmk@arm.linux.org.uk \
--cc=roy@karlsbakk.net \
--cc=vda@port.imtp.ilyichevsk.odessa.ua \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.