From mboxrd@z Thu Jan 1 00:00:00 1970 From: linux@horizon.com Subject: Re: x86 asm SHA1 (draft) Date: 24 Jun 2006 05:20:26 -0400 Message-ID: <20060624092026.31029.qmail@science.horizon.com> References: <7vfyhv11ej.fsf@assigned-by-dhcp.cox.net> Cc: linux@horizon.com X-From: git-owner@vger.kernel.org Sat Jun 24 15:02:31 2006 Return-path: Envelope-to: gcvg-git@gmane.org Received: from vger.kernel.org ([209.132.176.167]) by ciao.gmane.org with esmtp (Exim 4.43) id 1Fu7mL-0006d6-Ry for gcvg-git@gmane.org; Sat, 24 Jun 2006 15:02:26 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750737AbWFXNCQ (ORCPT ); Sat, 24 Jun 2006 09:02:16 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1750760AbWFXNCQ (ORCPT ); Sat, 24 Jun 2006 09:02:16 -0400 Received: from science.horizon.com ([192.35.100.1]:2862 "HELO science.horizon.com") by vger.kernel.org with SMTP id S1750737AbWFXNCP (ORCPT ); Sat, 24 Jun 2006 09:02:15 -0400 Received: (qmail 31030 invoked by uid 1000); 24 Jun 2006 05:20:26 -0400 To: git@vger.kernel.org, junkio@cox.net In-Reply-To: <7vfyhv11ej.fsf@assigned-by-dhcp.cox.net> Sender: git-owner@vger.kernel.org Precedence: bulk X-Mailing-List: git@vger.kernel.org Archived-At: > OK. I somehow got an impression that your two versions had > quite different performance characteristics on G4 and G5 and > there was a real choice. If they are between a few per-cent, > then I agree it is not worth doing at all. My apologies for being unclear. The place where a noticeable (if not disastrous) difference can appear is x86, which has a lot more models with "interesting" performance characteristics. In particular, Intel is fond of building CPUs with a very small "sweet spot". The openssl SHA1 code had to be reworked to not suck on a P4, with the resultant performance change: # compared with original compared with Intel cc # assembler impl. generated code # Pentium -16% +48% # PIII/AMD +8% +16% # P4 +85%(!) +45% The original code had the most popular round (what I call ROUND_MIX(F2,...))) implemented as follows, with single-uop instructions (no load+op) scheduled for the Pentium pipeline: (A..E are working variables, S and T are temps) movl 16(%esp),S U \ movl 24(%esp),T V \ xorl S,T U \ movl 48(%esp),S V > "MIX", pentium-optimized xorl S,T U / movl 4(%esp),S V / xorl S,T U / movl B,S V roll $1,T U Rotate of mix (SHA0 -> SHA1 fix) xor C,S V mov T,16(%esp) U Store back W[i] xor D,S V Finish computing F(B,C,D) = B^C^D lea K(T,E),E U Add K and W[i] to E mov A,T V roll $5,T UV rorl $1,B U add S,E V rorl $1,B U add T,E V While the P4-optimized version goes: movl B,S movl 16(%esp),T rorl $2,B xorl 24(%esp),T xorl C,S xorl 48(%esp),T xorl D,S This is F(B,C,D) = B^C^D xorl 4(%esp),T roll $1,T Rotate of mix (SHA0 -> SHA1 fix) addl S,E movl T,16(%esp) movl A,S roll $5,S lea K(E,T),E add S,E (The original code actually rotates the working variables around 6 registers, not 5, but I've rearranged the last couple of instructions to rotate around 5.)