From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756395Ab1HEPvI (ORCPT ); Fri, 5 Aug 2011 11:51:08 -0400 Received: from cdptpa-bc-oedgelb.mail.rr.com ([75.180.133.32]:34264 "EHLO cdptpa-bc-oedgelb.mail.rr.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753552Ab1HEPvD (ORCPT ); Fri, 5 Aug 2011 11:51:03 -0400 Authentication-Results: cdptpa-bc-oedgelb.mail.rr.com smtp.user=rpearson@systemfabricworks.com; auth=pass (LOGIN) X-Authority-Analysis: v=1.1 cv=QcSFu2tMqX8VyBnwf4xZriMeG3TVj1s8v1Rcea0EwGI= c=1 sm=0 a=I7fHHdvOj7QA:10 a=ozIaqLvjkoIA:10 a=kj9zAlcOel0A:10 a=DCwX0kaxZCiV3mmbfDr8nQ==:17 a=E3JUflbP-Z-PLmnorkwA:9 a=_y-yNVsX3OwVeKDXip8A:7 a=CjuIK1q_8ugA:10 a=DCwX0kaxZCiV3mmbfDr8nQ==:117 X-Cloudmark-Score: 0 X-Originating-IP: 67.79.195.91 From: "Bob Pearson" To: "'Joakim Tjernlund'" Cc: "'Andrew Morton'" , "'frank zago'" , References: <01dc01cc5159$317879a0$94696ce0$@systemfabricworks.com> <019501cc52d7$c8688100$59398300$@systemfabricworks.com> In-Reply-To: Subject: RE: [PATCH] add slice by 8 algorithm to crc32.c Date: Fri, 5 Aug 2011 10:51:00 -0500 Message-ID: <023b01cc5387$79b09dd0$6d11d970$@systemfabricworks.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Outlook 14.0 thread-index: AQDVh4pDc04B6RwbKPhJsZYe9rqqGgID1lG1AfPbjtABsy4QAgGPPXDLlsIrZTA= Content-Language: en-us Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > > version. While I haven't done the experiment you suggest there is > something > > to the point that the second q computation in the new version can be > moved > > ahead of the table lookups from the first q computation . My guess is that > > the unrolled version will be significantly slower. > > Ah, didn't see that. Don't understand how this works though. > Why do you do two 32 bit loads instead of one 64 bit load? > > > The two expression trees can be computed in parallel and combined with the final xor. If the compiler/instruction scheduler are smart enough and can process enough instructions per cycle they overlap well and you get some speedup. I did try a 64 bit load on Nehalem but got about 2 cycles per byte which is a little worse than doing two loads and better than the 32 bit version. I'm not really sure why.