From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751117Ab1HIXGA (ORCPT ); Tue, 9 Aug 2011 19:06:00 -0400 Received: from cdptpa-bc-oedgelb.mail.rr.com ([75.180.133.33]:34563 "EHLO cdptpa-bc-oedgelb.mail.rr.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750722Ab1HIXF7 (ORCPT ); Tue, 9 Aug 2011 19:05:59 -0400 Authentication-Results: cdptpa-bc-oedgelb.mail.rr.com smtp.user=rpearson@systemfabricworks.com; auth=pass (LOGIN) X-Authority-Analysis: v=1.1 cv=40Z/dbZBr1wgzPkGSf8y7qdCkiWp+M7NvixVUiz+qMg= c=1 sm=0 a=wzY9NJouJbkA:10 a=ozIaqLvjkoIA:10 a=kj9zAlcOel0A:10 a=DCwX0kaxZCiV3mmbfDr8nQ==:17 a=RHkfpXlEtdmpkEgYaLUA:9 a=oCbrjNO3B1QVZlXXUEcA:7 a=CjuIK1q_8ugA:10 a=DCwX0kaxZCiV3mmbfDr8nQ==:117 X-Cloudmark-Score: 0 X-Originating-IP: 67.79.195.91 From: "Bob Pearson" To: "'Joakim Tjernlund'" , "George Spelvin" Cc: , , , References: <4E40C5C7.2050609@systemfabricworks.com> <00d401cc565c$98eeab60$cacc0220$@systemfabricworks.com> In-Reply-To: Subject: RE: [patch v3 7/7] crc32: final-cleanup.diff Date: Tue, 9 Aug 2011 18:05:56 -0500 Message-ID: <019701cc56e8$e606f9c0$b214ed40$@systemfabricworks.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Outlook 14.0 Thread-Index: AQFUJhJtE0E7wgc4id8VkXG2Xj6UAAJcT3PpAdqTgv2V46s5IA== Content-Language: en-us Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > > Doing this adds one insn to ppc. What is good for x86 isn't for > pcc32 (and I guess any modern arch). I looked at the assembly output from the original code and I see two compares at the end of the first loop, one for --len and one for (buf & 3) which it has to compute. You are better off just computing buf & 3 once and comparing with len once to compute init_len then in each iteration in the loop you only have to compare one thing. As proposed: init_len = min((-((uintptr_t)buf)) & 3, len); ... /* process unaligned initial bytes */ for (i = 0; i < init_len; i++) DO_CRC(*buf++); crc32x_le: <.....> negq %rcx # (-buf) andl $3, %ecx # (-buf) & 3 cmpq %rdx, %rcx # min() cmova %rdx, %rcx # init_len = xorl %edi, %edi # i = 0 <.....> jmp .L2 .L3: # crc = tab[0][(crc ^ (*buf++)) & 255] ^ (crc >> 8) movb (%rsi,%rdi), %dl # buf[i] incq %rdi # buf++ xorl %eax, %edx # crc ^ *buf++ shrl $8, %eax # crc >> 8 movzbl %dl, %edx # & 255 xorl crc32table_le(,%rdx,4), %eax # crc = tab[...] ^ (crc >> 8) .L2: cmpq %rcx, %rdi # compare i with init_len jb .L3 <.....> As was originally: if (unlikely((long)buf & 3 && len)) { do { DO_CRC(*buf++); } while ((--len) && ((long)buf)&3); } crc32x_le: pushq %rbp movl %edi, %eax pushq %rbx testb $3, %sil je .L16 testq %rdx, %rdx .L25: je .L16 movb (%rsi), %cl # *p incq %rsi # p++ xorl %eax, %ecx # crc ^ *p shrl $8, %eax # crc >> 8 movzbl %cl, %ecx # ... & 255 xorl crc32table_le(,%rcx,4), %eax # tab[...] ^ (crc >> 8) decq %rdx # len-- je .L16 # if len != 0 continue testb $3, %sil # buf & 3 jmp .L25 # if buf & 3 continue .L16: The first loop has 8 instructions second one has 11 instructions. First loop has slightly more setup to compute init_len, second loop has a couple of extra branches to compute the if() outside of the loop. I get better performance with the new form of the loop. How, does PPC get better results with two tests? > > > > > > > -/* implements slicing-by-4 or slicing-by-8 algorithm */ > > > -static inline u32 > > > -crc32_body(u32 crc, unsigned char const *buf, size_t len, const u32 > > > (*tab)[256]) > > > > After careful measurements and looking at asm code I figured out that > there > > was no penalty for using 2D array. That allowed me to go back to the > > original form. > > What about my suggestion to assign each table to a ptr? This was good > for ppc so if it doesn't slow x86 it should be added. Gcc is very clever and inlined crc32_body() into crc32_le() and crc32_be() and then replaced references to tab[x][y] with references to crc32table_le and crc32table_be plus a constant offset which depended on the first index ([x]). So in effect it completely unrolled back to 1D arrays as (crc32table_le + offset)[y], which is equivalent to the t0_le[y] code I had before.