>>>>> "Denis" == Denis Vlasenko <vda@port.imtp.ilyichevsk.odessa.ua> writes:

Denis> [please drop libc from CC:]
Denis> On 25 October 2002 05:48, Momchil Velikov wrote:
>>> Short conclusion:
>>> 1. It is possible to speed up csum routines for AMD processors
>>> by 30%.
>>> 2. It is possible to speed up csum_copy routines for both AMD
>>> andd Intel three times or more.

>> Additional data point:
>> 
>> Short summary:
>> 1. Checksum - kernelpii_csum is ~19% faster
>> 2. Copy - lernelpii_csum is ~6% faster
>> 
>> Dual Pentium III, 1266Mhz, 512K cache, 2G SDRAM (133Mhz, ECC)
>> 
>> The only changes I made were to decrease the buffer size to 1K (as I
>> think this is more representative to a network packet size, correct
>> me if I'm wrong) and increase the runs to 1024. Max values are
>> worthless indeed.

Denis> Well, that makes it run entirely in L0 cache. This is unrealistic
Denis> for actual use. movntq is x3 faster when you hit RAM instead of L0.

Oops ...

Denis> You need to be more clever than that - generate pseudo-random
Denis> offsets in large buffer and run on ~1K pieces of that buffer.

Here it is:

Csum benchmark program
buffer size: 1 K
Each test tried 1024 times, max and min CPU cycles are reported.
Please disregard max values. They are due to system interference only.
csum tests:
                     kernel_csum - took  8678 max,  808 min cycles per kb. sum=0x400270e8
                     kernel_csum - took   941 max,  808 min cycles per kb. sum=0x400270e8
                     kernel_csum - took 11604 max,  808 min cycles per kb. sum=0x400270e8
                  kernelpii_csum - took 28839 max,  664 min cycles per kb. sum=0x400270e8
                kernelpiipf_csum - took  9163 max,  665 min cycles per kb. sum=0x400270e8
                        pfm_csum - took  2788 max, 1470 min cycles per kb. sum=0x400270e8
                       pfm2_csum - took  1179 max,  915 min cycles per kb. sum=0x400270e8
copy tests:
                     kernel_copy - took   688 max,  263 min cycles per kb. sum=0x400270e8
                     kernel_copy - took   456 max,  263 min cycles per kb. sum=0x400270e8
                     kernel_copy - took 11241 max,  263 min cycles per kb. sum=0x400270e8
                  kernelpii_copy - took  7635 max,  246 min cycles per kb. sum=0x400270e8
                      ntqpf_copy - took  5349 max,  536 min cycles per kb. sum=0x400270e8
                     ntqpfm_copy - took   769 max,  425 min cycles per kb. sum=0x400270e8
                        ntq_copy - took   672 max,  469 min cycles per kb. sum=0x400270e8
                     ntqpf2_copy - took  8000 max,  579 min cycles per kb. sum=0x400270e8
Done

Ran on a 512K (my cache size) buffer, choosing each time a 1K
piece. (making the buffer larger (2M, 4M) does not make any
difference).

And the modified 0main.c is attached.

~velco