>>>>> "Denis" == Denis Vlasenko writes: Denis> [please drop libc from CC:] Denis> On 25 October 2002 05:48, Momchil Velikov wrote: >>> Short conclusion: >>> 1. It is possible to speed up csum routines for AMD processors >>> by 30%. >>> 2. It is possible to speed up csum_copy routines for both AMD >>> andd Intel three times or more. >> Additional data point: >> >> Short summary: >> 1. Checksum - kernelpii_csum is ~19% faster >> 2. Copy - lernelpii_csum is ~6% faster >> >> Dual Pentium III, 1266Mhz, 512K cache, 2G SDRAM (133Mhz, ECC) >> >> The only changes I made were to decrease the buffer size to 1K (as I >> think this is more representative to a network packet size, correct >> me if I'm wrong) and increase the runs to 1024. Max values are >> worthless indeed. Denis> Well, that makes it run entirely in L0 cache. This is unrealistic Denis> for actual use. movntq is x3 faster when you hit RAM instead of L0. Oops ... Denis> You need to be more clever than that - generate pseudo-random Denis> offsets in large buffer and run on ~1K pieces of that buffer. Here it is: Csum benchmark program buffer size: 1 K Each test tried 1024 times, max and min CPU cycles are reported. Please disregard max values. They are due to system interference only. csum tests: kernel_csum - took 8678 max, 808 min cycles per kb. sum=0x400270e8 kernel_csum - took 941 max, 808 min cycles per kb. sum=0x400270e8 kernel_csum - took 11604 max, 808 min cycles per kb. sum=0x400270e8 kernelpii_csum - took 28839 max, 664 min cycles per kb. sum=0x400270e8 kernelpiipf_csum - took 9163 max, 665 min cycles per kb. sum=0x400270e8 pfm_csum - took 2788 max, 1470 min cycles per kb. sum=0x400270e8 pfm2_csum - took 1179 max, 915 min cycles per kb. sum=0x400270e8 copy tests: kernel_copy - took 688 max, 263 min cycles per kb. sum=0x400270e8 kernel_copy - took 456 max, 263 min cycles per kb. sum=0x400270e8 kernel_copy - took 11241 max, 263 min cycles per kb. sum=0x400270e8 kernelpii_copy - took 7635 max, 246 min cycles per kb. sum=0x400270e8 ntqpf_copy - took 5349 max, 536 min cycles per kb. sum=0x400270e8 ntqpfm_copy - took 769 max, 425 min cycles per kb. sum=0x400270e8 ntq_copy - took 672 max, 469 min cycles per kb. sum=0x400270e8 ntqpf2_copy - took 8000 max, 579 min cycles per kb. sum=0x400270e8 Done Ran on a 512K (my cache size) buffer, choosing each time a 1K piece. (making the buffer larger (2M, 4M) does not make any difference). And the modified 0main.c is attached. ~velco