From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755394Ab1HBVTQ (ORCPT ); Tue, 2 Aug 2011 17:19:16 -0400 Received: from cdptpa-bc-oedgelb.mail.rr.com ([75.180.133.33]:62216 "EHLO cdptpa-bc-oedgelb.mail.rr.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754856Ab1HBVTM (ORCPT ); Tue, 2 Aug 2011 17:19:12 -0400 Authentication-Results: cdptpa-bc-oedgelb.mail.rr.com smtp.user=rpearson@systemfabricworks.com; auth=pass (LOGIN) X-Authority-Analysis: v=1.1 cv=40Z/dbZBr1wgzPkGSf8y7qdCkiWp+M7NvixVUiz+qMg= c=1 sm=0 a=I7fHHdvOj7QA:10 a=ozIaqLvjkoIA:10 a=kj9zAlcOel0A:10 a=DCwX0kaxZCiV3mmbfDr8nQ==:17 a=VwQbUJbxAAAA:8 a=W0vUJOdyAAAA:8 a=nG7CyOcyLrvjaoSqppwA:9 a=5FGdio68gzCh-EWwshkA:7 a=CjuIK1q_8ugA:10 a=x8gzFH9gYPwA:10 a=_n0j_2CNJc_necV-:21 a=Xx5GAISqMiC7FI9j:21 a=DCwX0kaxZCiV3mmbfDr8nQ==:117 X-Cloudmark-Score: 0 X-Originating-IP: 67.79.195.91 From: "Bob Pearson" To: "'Joakim Tjernlund'" , "'frank zago'" , "'Andrew Morton'" , References: <01dc01cc5159$317879a0$94696ce0$@systemfabricworks.com> In-Reply-To: <01dc01cc5159$317879a0$94696ce0$@systemfabricworks.com> Subject: RE: [PATCH] add slice by 8 algorithm to crc32.c Date: Tue, 2 Aug 2011 16:19:09 -0500 Message-ID: <01dd01cc5159$d21f7d40$765e77c0$@systemfabricworks.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Outlook 14.0 Thread-Index: AQDVh4pDc04B6RwbKPhJsZYe9rqqGgID1lG1lueFwqA= Content-Language: en-us Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Outlook sucks. Sorry. > -----Original Message----- > From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel- > owner@vger.kernel.org] On Behalf Of Bob Pearson > Sent: Tuesday, August 02, 2011 4:15 PM > To: 'Joakim Tjernlund'; 'frank zago'; 'Andrew Morton'; linux- > kernel@vger.kernel.org > Subject: RE: [PATCH] add slice by 8 algorithm to crc32.c > > Hi Joakim, > > Sorry to take so long to respond. > > Here are some performance data collected from the original and modified > crc32 algorithms. > The following is a simple test loop that computes the time to compute 1000 > crc's over 4096 bytes of data aligned on an 8 byte boundary after warming > the cache. You could make other measurements but this is sort of a best > case. > > These measurements were made on a dual socket Nehalem 2.267 GHz > system. > > #ifdef CONFIG_CRC32_PERFTEST > #include > > static u8 __attribute__((__aligned__(8))) test_buf[4096]; > > /* need to convince gcc to not optimize out all the crc calls as dead code > */ > u32 crc; > > static int __init crc32_init(void) > { > int i; > struct timespec start, stop; > u64 nsec_le; > u64 nsec_be; > > for (i = 0; i < 4096; i++) > test_buf[i] = i; > > crc = crc32_le(0, test_buf, 4096); > crc = crc32_le(crc, test_buf, 4096); > > getnstimeofday(&start); > for (i = 0; i < 1000; i++) > crc = crc32_le(crc, test_buf, 4096); > getnstimeofday(&stop); > > nsec_le = stop.tv_nsec - start.tv_nsec + > 1000000000 * (stop.tv_sec - start.tv_sec); > > crc = crc32_be(0, test_buf, 4096); > crc = crc32_be(crc, test_buf, 4096); > > getnstimeofday(&start); > for (i = 0; i < 1000; i++) > crc = crc32_be(crc, test_buf, 4096); > getnstimeofday(&stop); > > nsec_be = stop.tv_nsec - start.tv_nsec + > 1000000000 * (stop.tv_sec - start.tv_sec); > > pr_info("crc32: nsec_le = %lld nsec_be = %lld\n", nsec_le, nsec_be); > > return 0; > } > > module_init(crc32_init); > #endif /* CONFIG_CRC32_PERFTEST */ > > Here are the timings. > > CRC_?E_BITS crc_le(nsec) crc_be(nsec) > orig crc32.c 8 5842712 > 5829793 > > new crc32.c 64 2877530 > 2862515 > new crc32.c 32 5174400 > 5105816 (Equiv. to orig. BITS = 8) > > Modify all 'i' loops from for (i = 0; i < foo; i++) { ... } to for (i = foo > - 1; i >= 0; i--) { ... } > + count down 64 2837860 > 3230438 > > 2865727 3264033 (repeat) > + count down 32 5177334 > 5070337 > Results seem ambiguous on Nehalem. Slight improvement in > some cases. 12% worse in one case. > > Modify main loop to use pre-increment instead of post-increment. > + pre-increment 64 2872502 > 2767601 > + pre-increment 32 5100579 > 5090453 > Small scale improvements for each case. Will add to next > drop. > > Do both count down and pre increment > + ct down + pre-inc 64 3329125 > 3367815 > + ct down + pre-inc 32 5065308 > 5089825 > Got significantly worse for 64 bit slightly better for > 32. > > Separately I looked at the difference between 1D and 2D arrays and 1D > arrays > are a few % faster. The difference is that the loader computes the base > address at compile time. In the 2D case you have to add an offset to a > parameter that was passed in to the body routine which is a runtime > addition. > > The first comment below relates to the behavior of gen_crc32table.c which > always printed out 256x4 byte table entries at a time. The CRC32_?E_BITS = 2 > and 4 cases only read the first 4 and 16 table entries respectively so if > you are trying to make a really small image you can save most of 1KB by > shortening the table. > > I think I just caused more confusion below. The point is that the two > routines crc32_le and crc32_be take a byte string as argument which is > invariant under big/little endian byte order and produce a u32 which is just > an integer. All the byte swapping in the code is implementation detail and > has no effect on the result. The caller should just do what you normally do > with an integer when you use it in a network message e.g. call htonl or > equivalent before putting it in a packet. > > Last point below. The existing code fails sparse with __CHECK_ENDIAN__. > (There are several other sparse fails in lib as well BTW.) I cleaned up > these warnings by forcing everything to u32. > > I will include Andrew's comments with the above and send out another patch > soon. > > Regards, > > Bob Pearson > > > -----Original Message----- > > From: Joakim Tjernlund [mailto:joakim.tjernlund@transmode.se] > > Sent: Monday, August 01, 2011 4:20 AM > > To: frank zago; Andrew Morton; Bob Pearson > > Subject: Re: [PATCH] add slice by 8 algorithm to crc32.c > > > > > > Hi Bob, Frank and Andrew > > > > I just got back from vacation and noticed this patch in the kernel mail > archive. > > I have pasted some > > bits from there and commented too. > > > > > Hello, > > > > > > Added support for slice by 8 to existing crc32 algorithm. Also > > > modified gen_crc32table.c to only produce table entries that are > > > actually used. > > > > Hmm, I don't get this. What entries are unused? > > > > > The parameters CRC_LE_BITS and CRC_BE_BITS determine > > > the number of bits in the input array that are processed during each > > > step. Generally the more bits the faster the algorithm is but the > > > more table data required. > > > > > > Using an x86_64 Opteron machine running at 2100MHz the following > table > > > was collected with a pre-warmed cache by computing the crc 1000 times > > > on a buffer of 4096 bytes. > > > > > > BITS Size LE Cycles/byte BE > > Cycles/byte > > > ---------------------------------------------- > > > 1 873 41.65 > > 34.60 > > > 2 1097 25.43 > > 29.61 > > > 4 1057 13.29 > > 15.28 > > > 8 2913 7.13 > > 8.19 > > > 32 9684 2.80 > > 2.82 > > > 64 18178 1.53 > > 1.53 > > > > > > BITS is the value of CRC_LE_BITS or CRC_BE_BITS. The old > > > default was 8 which actually selected the 32 bit algorithm. > In > > > this version the value 8 is used to select the standard > > > 8 bit algorithm and two new values: 32 and 64 are > introduced > > > to select the slice by 4 and slice by 8 algorithms > respectively. > > > > > > Where Size is the size of crc32.o's text segment which > > includes > > > code and table data when both LE and BE versions are set to > > BITS. > > > > I miss the numbers from the current 32 bits impl. How does this new impl. > > for 32 bits compare to the current impl? > > > > > > > > The current version of crc32.c by default uses the slice by 4 algorithm > > > which requires about 2.8 cycles per byte. The slice by 8 algorithm is > > > roughly 2X faster and enables packet processing at over 1GB/sec on a > > typical > > > 2-3GHz system. > > > --- lib/crc32.c > > > +++ lib/crc32.c > > > > > > ... > > > > > > @@ -28,14 +31,17 @@ > > > #include > > > #include > > > #include "crc32defs.h" > > > -#if CRC_LE_BITS == 8 > > > -# define tole(x) __constant_cpu_to_le32(x) > > > + > > > +#include > > > +#if CRC_LE_BITS > 8 > > > +# define tole(x) (__force u32) __constant_cpu_to_le32(x) > > > #else > > > # define tole(x) (x) > > > #endif > > > > > > -#if CRC_BE_BITS == 8 > > > -# define tobe(x) __constant_cpu_to_be32(x) > > > +#if CRC_BE_BITS > 8 > > > +# define tobe(x) (__force u32) __constant_cpu_to_be32(x) > > > #else > > > # define tobe(x) (x) > > > #endif > > > @@ -45,54 +51,228 @@ MODULE_AUTHOR("Matt Domsch > > > > MODULE_DESCRIPTION("Ethernet CRC32 calculations"); > > > MODULE_LICENSE("GPL"); > > > > > > -#if CRC_LE_BITS == 8 || CRC_BE_BITS == 8 > > > +#if CRC_LE_BITS > 8 > > > +static inline u32 crc32_le_body(u32 crc, u8 const *buf, size_t len) > > > +{ > > > + const u8 *p8; > > > + const u32 *p32; > > > + int init_bytes, end_bytes; > > > + size_t words; > > > + int i; > > > + u32 q; > > > + u8 i0, i1, i2, i3; > > > + > > > + crc = (__force u32) __cpu_to_le32(crc); > > > + > > > +#if CRC_LE_BITS == 64 > > > + p8 = buf; > > > + p32 = (u32 *)(((uintptr_t)p8 + 7) & ~7); > > > + > > > + init_bytes = (uintptr_t)p32 - (uintptr_t)p8; > > > + if (init_bytes > len) > > > + init_bytes = len; > > > + words = (len - init_bytes) >> 3; > > > + end_bytes = (len - init_bytes) & 7; > > > +#else > > > + p8 = buf; > > > + p32 = (u32 *)(((uintptr_t)p8 + 3) & ~3); > > > + init_bytes = (uintptr_t)p32 - (uintptr_t)p8; > > > + if (init_bytes > len) > > > + init_bytes = len; > > > + words = (len - init_bytes) >> 2; > > > + end_bytes = (len - init_bytes) & 3; > > > +#endif > > > > The difference in the above code is just two constants. Should be possible > > to avoid code duplication. > > > > > + > > > + for (i = 0; i < init_bytes; i++) { > > > > All for(..) loops should be changed to: > > for (i = init_bytes; i ; --i) > > This is easier for gcc to optimize. Always compare to zero when possible. > > > > > +#ifdef __LITTLE_ENDIAN > > > + i0 = *p8++ ^ crc; > > > + crc = t0_le[i0] ^ (crc >> 8); > > > +#else > > > + i0 = *p8++ ^ (crc >> 24); > > > + crc = t0_le[i0] ^ (crc << 8); > > > +#endif > > > + } > > > > > Better to hide in macro as current code do. > > ... > > > > > + for (i = 0; i < words; i++) { > > > +#ifdef __LITTLE_ENDIAN > > > +# if CRC_LE_BITS == 64 > > > + /* slice by 8 algorithm */ > > > + q = *p32++ ^ crc; > > > + i3 = q; > > > + i2 = q >> 8; > > > + i1 = q >> 16; > > > + i0 = q >> 24; > > > + crc = t7_le[i3] ^ t6_le[i2] ^ t5_le[i1] ^ > > t4_le[i0]; > > > + > > > + q = *p32++; > > > + i3 = q; > > > + i2 = q >> 8; > > > + i1 = q >> 16; > > > + i0 = q >> 24; > > > + crc ^= t3_le[i3] ^ t2_le[i2] ^ t1_le[i1] ^ > > t0_le[i0]; > > > > I am not convinced that converting the table matrix to several arrays is > faster. > > Now you have to load 8 addressed and then index them instead of just one > > address. > > > > Also, this looks more like unrolling the 32 bit variant. Are you sure that > you > > wont get just as good results by just unrolling the 32 bit variant? > > > > You also lost the pre increment which was faster on RISC archs like > ppc(sparc > > too?) > > last time I tested crc32 speed. > > > > > +# else > > > + /* slice by 4 algorithm */ > > > + q = *p32++ ^ crc; > > > + i3 = q; > > > + i2 = q >> 8; > > > + i1 = q >> 16; > > > + i0 = q >> 24; > > > + crc = t3_le[i3] ^ t2_le[i2] ^ t1_le[i1] ^ > > t0_le[i0]; > > > +# endif > > > > ... > > > General comment. The use of le/be in this and the previous version of > > > crc32.c is confusing because it does not refer to little/big endian cpu > > > architecture but rather to the arbitrary choice made by different > protocols > > > as to whether the bits in a byte are presented to the CRC in little/big > > > *BIT* order. Some protocols (e.g. Ethernet, InfiniBand) process the bits > > > starting from the LSB and some (e.g. atm) process the bits in MSB order. > > > These routines (crc32_le and crc32_be) compute the CRC as a u32 in cpu > > byte > > > order. The caller has to then get the CRC into the correct order for > > > inclusion into the message. As a result there are four versions of the > > > > No, the caller does not have get the CRC into correct order, this is done > by > > crc32 code. > > > > > calculation: > > > BE bit order on BE byte order cpu, BE bit order on LE byte order cpu, > etc. > > > > What would be better? I don't see what to do. Linux requires both LE and > BE > > bit > > ordering. When we optimize heavily, CPU byte order becomes an issue > that > > needs > > to be dealt with. > > > > > Some calculations are simplified by arranging the bits sequentially in > > WORDS > > > when we are going to process them in a certain order within each byte. > This > > > means that the tables used to compute crc32_be are easier to use in BE > > byte > > > order and that tables used to compute crc32_le are easier to use in LE > byte > > > order. That is the logic behind the following weird looking macros which > are > > > kept the same as the previous version except for the inclusion of > __force > > to > > > pass sparse with -D__CHECK_ENDIAN__. > > > > Don't understand, how is you code any different from what we have > today? > > > > Jocke > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/