From: Momchil Velikov <velco@fadata.bg>
To: vda@port.imtp.ilyichevsk.odessa.ua
Cc: Russell King <rmk@arm.linux.org.uk>,
Roy Sigurd Karlsbakk <roy@karlsbakk.net>,
netdev@oss.sgi.com,
Kernel mailing list <linux-kernel@vger.kernel.org>
Subject: Re: Csum and csum copyroutines benchmark
Date: 25 Oct 2002 12:47:05 +0300 [thread overview]
Message-ID: <87znt297fq.fsf@fadata.bg> (raw)
In-Reply-To: <200210250906.g9P96Yp14775@Port.imtp.ilyichevsk.odessa.ua>
[-- Attachment #1: Type: text/plain, Size: 2863 bytes --]
>>>>> "Denis" == Denis Vlasenko <vda@port.imtp.ilyichevsk.odessa.ua> writes:
Denis> [please drop libc from CC:]
Denis> On 25 October 2002 05:48, Momchil Velikov wrote:
>>> Short conclusion:
>>> 1. It is possible to speed up csum routines for AMD processors
>>> by 30%.
>>> 2. It is possible to speed up csum_copy routines for both AMD
>>> andd Intel three times or more.
>> Additional data point:
>>
>> Short summary:
>> 1. Checksum - kernelpii_csum is ~19% faster
>> 2. Copy - lernelpii_csum is ~6% faster
>>
>> Dual Pentium III, 1266Mhz, 512K cache, 2G SDRAM (133Mhz, ECC)
>>
>> The only changes I made were to decrease the buffer size to 1K (as I
>> think this is more representative to a network packet size, correct
>> me if I'm wrong) and increase the runs to 1024. Max values are
>> worthless indeed.
Denis> Well, that makes it run entirely in L0 cache. This is unrealistic
Denis> for actual use. movntq is x3 faster when you hit RAM instead of L0.
Oops ...
Denis> You need to be more clever than that - generate pseudo-random
Denis> offsets in large buffer and run on ~1K pieces of that buffer.
Here it is:
Csum benchmark program
buffer size: 1 K
Each test tried 1024 times, max and min CPU cycles are reported.
Please disregard max values. They are due to system interference only.
csum tests:
kernel_csum - took 8678 max, 808 min cycles per kb. sum=0x400270e8
kernel_csum - took 941 max, 808 min cycles per kb. sum=0x400270e8
kernel_csum - took 11604 max, 808 min cycles per kb. sum=0x400270e8
kernelpii_csum - took 28839 max, 664 min cycles per kb. sum=0x400270e8
kernelpiipf_csum - took 9163 max, 665 min cycles per kb. sum=0x400270e8
pfm_csum - took 2788 max, 1470 min cycles per kb. sum=0x400270e8
pfm2_csum - took 1179 max, 915 min cycles per kb. sum=0x400270e8
copy tests:
kernel_copy - took 688 max, 263 min cycles per kb. sum=0x400270e8
kernel_copy - took 456 max, 263 min cycles per kb. sum=0x400270e8
kernel_copy - took 11241 max, 263 min cycles per kb. sum=0x400270e8
kernelpii_copy - took 7635 max, 246 min cycles per kb. sum=0x400270e8
ntqpf_copy - took 5349 max, 536 min cycles per kb. sum=0x400270e8
ntqpfm_copy - took 769 max, 425 min cycles per kb. sum=0x400270e8
ntq_copy - took 672 max, 469 min cycles per kb. sum=0x400270e8
ntqpf2_copy - took 8000 max, 579 min cycles per kb. sum=0x400270e8
Done
Ran on a 512K (my cache size) buffer, choosing each time a 1K
piece. (making the buffer larger (2M, 4M) does not make any
difference).
And the modified 0main.c is attached.
~velco
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0main.c --]
[-- Type: text/x-csrc, Size: 3996 bytes --]
#include <stdio.h>
#include <stdlib.h>
#define NAME(a) \
unsigned int a##csum(const unsigned char * buff, int len, \
unsigned int sum); \
unsigned int a##copy(const char *src, char *dst, \
int len, int sum, int *src_err_ptr, int *dst_err_ptr)
/* This makes adding/removing test functions easier */
/* asm ones... */
NAME(kernel_);
NAME(kernelpii_);
NAME(kernelpiipf_);
/* and C */
#include "pfm_csum.c"
#include "pfm2_csum.c"
#include "ntq_copy.c"
#include "ntqpf_copy.c"
#include "ntqpf2_copy.c"
#include "ntqpfm_copy.c"
const int TRY_TIMES = 1024;
const int NBUFS = 512;
const int BUFSIZE = 1024;
const int POISON = 0; // want to check correctness?
typedef unsigned int csum_func(const unsigned char * buff, int len,
unsigned int sum);
typedef unsigned int copy_func(const char *src, char *dst,
int len, int sum, int *src_err_ptr, int *dst_err_ptr);
static inline long long rdtsc()
{
unsigned int low,high;
__asm__ __volatile__("rdtsc" : "=a" (low), "=d" (high));
return low + (((long long)high)<<32);
}
int die(const char *msg) {
puts(msg);
abort();
return 1;
}
unsigned test_one_csum(csum_func *func, char *name, char *buffer)
{
int i;
unsigned long long before,after,min,max;
unsigned sum;
// pick fastest run
min = ~0ULL;
max = 0;
for (i=0;i<TRY_TIMES;i++) {
before = rdtsc();
unsigned sum2 = func(buffer + (rand () % NBUFS) * BUFSIZE,
BUFSIZE, 0);
after = rdtsc();
if (before>after) die("timer overflow");
else {
after-=before;
if(min>after) min=after;
if(max<after) max=after;
}
}
printf("%32s - took %5lli max,%5lli min cycles per kb. sum=0x%08x\n",
name,
max / (BUFSIZE/1024),
min / (BUFSIZE/1024),
sum
);
}
unsigned test_one_copy(copy_func *func, char *name, char *buffer)
{
int i;
unsigned long long before,after,min,max;
unsigned sum;
int err;
// pick fastest run
min = ~0ULL;
max = 0;
for (i=0; i<TRY_TIMES; i++) {
if(POISON) memset(buffer, 0x55,BUFSIZE/2);
if(POISON) memset(buffer+BUFSIZE/2,0xaa,BUFSIZE/2);
buffer[0] = 0x77;
buffer[BUFSIZE/2-1] = 0x44;
before = rdtsc();
char *buf = buffer + rand () % (NBUFS - 1);
unsigned sum2 = func(buf,buf+BUFSIZE/2,BUFSIZE/2,0,&err,&err);
after = rdtsc();
if(POISON) if(memcmp(buffer,buffer+BUFSIZE/2,BUFSIZE/2)!=0) die("BAD copy!");
if (before>after) die("timer overflow");
else {
after-=before;
if(min>after) min=after;
if(max<after) max=after;
}
}
printf("%32s - took %5lli max,%5lli min cycles per kb. sum=0x%08x\n",
name,
max / (BUFSIZE/1024) / 2,
min / (BUFSIZE/1024) / 2,
sum
);
return sum;
}
void test_csum(char *buffer)
{
unsigned sum;
puts("csum tests:");
#define TEST_CSUM(a) test_one_csum(a,#a,buffer)
TEST_CSUM(kernel_csum );
TEST_CSUM(kernel_csum );
TEST_CSUM(kernel_csum );
TEST_CSUM(kernelpii_csum );
TEST_CSUM(kernelpiipf_csum);
TEST_CSUM(pfm_csum );
TEST_CSUM(pfm2_csum );
#undef TEST_CSUM
}
void test_copy(char *buffer)
{
unsigned sum;
puts("copy tests:");
#define TEST_COPY(a) test_one_copy(a,#a,buffer)
sum = TEST_COPY(kernel_copy );
sum == TEST_COPY(kernel_copy ) || die("Bad sum");
sum == TEST_COPY(kernel_copy ) || die("Bad sum");
sum == TEST_COPY(kernelpii_copy ) || die("Bad sum");
sum == TEST_COPY(ntqpf_copy ) || die("Bad sum");
sum == TEST_COPY(ntqpfm_copy ) || die("Bad sum");
sum == TEST_COPY(ntq_copy ) || die("Bad sum");
sum == TEST_COPY(ntqpf2_copy ) || die("Bad sum");
#undef TEST_COPY
}
int main()
{
char *buffer_raw,*buffer;
printf("Csum benchmark program\n"
"buffer size: %i K\n"
"Each test tried %i times, max and min CPU cycles are reported.\n"
"Please disregard max values. They are due to system interference only.\n",
BUFSIZE/1024,
TRY_TIMES
);
buffer_raw = malloc(NBUFS * BUFSIZE+16);
if(!buffer_raw) die("Malloc failed");
buffer = (char*) ((((int)buffer_raw)+15) & (~0xF));
test_csum(buffer);
test_copy(buffer);
puts("Done");
free(buffer_raw);
return 0;
}
next prev parent reply other threads:[~2002-10-25 9:47 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2002-10-23 10:18 tuning linux for high network performance? Roy Sigurd Karlsbakk
2002-10-23 11:06 ` [RESEND] " Roy Sigurd Karlsbakk
2002-10-23 13:01 ` bert hubert
2002-10-23 13:21 ` David S. Miller
2002-10-23 13:42 ` Roy Sigurd Karlsbakk
2002-10-23 17:01 ` bert hubert
2002-10-23 17:10 ` Ben Greear
2002-10-23 17:11 ` Richard B. Johnson
2002-10-23 17:12 ` Nivedita Singhvi
2002-10-23 17:56 ` Richard B. Johnson
2002-10-23 18:07 ` Nivedita Singhvi
2002-10-23 18:30 ` Richard B. Johnson
2002-10-24 4:11 ` David S. Miller
2002-10-24 9:37 ` Karen Shaeffer
2002-10-24 10:30 ` sendfile64() anyone? (was [RESEND] tuning linux for high network performance?) Roy Sigurd Karlsbakk
2002-10-24 10:47 ` David S. Miller
2002-10-24 11:07 ` Roy Sigurd Karlsbakk
2002-10-23 13:41 ` [RESEND] tuning linux for high network performance? Roy Sigurd Karlsbakk
2002-10-23 14:59 ` Nivedita Singhvi
2002-10-23 15:26 ` O_DIRECT sockets? (was [RESEND] tuning linux for high network performance?) Roy Sigurd Karlsbakk
2002-10-23 16:34 ` Nivedita Singhvi
2002-10-24 10:14 ` Roy Sigurd Karlsbakk
2002-10-24 10:46 ` David S. Miller
2002-10-23 18:01 ` [RESEND] tuning linux for high network performance? Denis Vlasenko
2002-10-23 13:36 ` Roy Sigurd Karlsbakk
2002-10-24 16:22 ` Denis Vlasenko
2002-10-24 11:50 ` Russell King
2002-10-24 12:42 ` bert hubert
2002-10-24 17:41 ` Denis Vlasenko
2002-10-25 11:36 ` Csum and csum copyroutines benchmark Denis Vlasenko
2002-10-25 7:48 ` Momchil Velikov
2002-10-25 13:59 ` Denis Vlasenko
2002-10-25 9:47 ` Momchil Velikov [this message]
2002-10-25 10:19 ` Alan Cox
2002-10-25 16:00 ` Denis Vlasenko
2002-10-23 14:52 ` [RESEND] tuning linux for high network performance? Nivedita Singhvi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87znt297fq.fsf@fadata.bg \
--to=velco@fadata.bg \
--cc=linux-kernel@vger.kernel.org \
--cc=netdev@oss.sgi.com \
--cc=rmk@arm.linux.org.uk \
--cc=roy@karlsbakk.net \
--cc=vda@port.imtp.ilyichevsk.odessa.ua \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).