Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Benjamin Gilbert <bgilbert@cs.cmu.edu>
To: Jeff Garzik <jeff@garzik.org>
Cc: Matt Mackall <mpm@selenic.com>,
	akpm@linux-foundation.org, herbert@gondor.apana.org.au,
	linux-crypto@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
Date: Sat, 09 Jun 2007 20:33:25 -0400	[thread overview]
Message-ID: <466B46D5.1020004@cs.cmu.edu> (raw)
In-Reply-To: <466B0C3F.3040300@garzik.org>

Jeff Garzik wrote:
> Matt Mackall wrote:
>> Have you benchmarked this against lib/sha1.c? Please post the results.
>> Until then, I'm frankly skeptical that your unrolled version is faster
>> because when I introduced lib/sha1.c the rolled version therein won by
>> a significant margin and had 1/10th the cache footprint.

See the benchmark tables in patch 0 at the head of this thread. 
Performance improved by at least 25% in every test, and 40-60% was more 
common for the 32-bit version (on a Pentium IV).

It's not just the loop unrolling; it's the register allocation and 
spilling.  For comparison, I built SHATransform() from the 
drivers/char/random.c in 2.6.11, using gcc 3.3.5 with -O2 and 
SHA_CODE_SIZE == 3 (i.e., fully unrolled); I'm guessing this is pretty 
close to what you tested back then.  The resulting code is 49% MOV 
instructions, and 80% of *those* involve memory.  gcc4 is somewhat 
better, but it still spills a whole lot, both for the 2.6.11 unrolled 
code and for the current lib/sha1.c.

In contrast, the assembly implementation in this patch only has to go to 
memory for data and workspace (with one small exception in the F3 
rounds), and the workspace has a fifth of the cache footprint of the 
default implementation.

> Yes. And it also depends on the CPU as well.  Testing on a server-class 
> x86 CPU (often with bigger L2, and perhaps even L1, cache) will produce 
> different result than from popular but less-capable "value" CPUs.

Good point.  I benchmarked the 32-bit assembly code on a couple more boxes:

=== AMD Duron, average of 5 trials ===
Test#  Bytes/  Bytes/  Cyc/B  Cyc/B  Change
         block  update    (C)  (asm)
     0      16      16    104     72     31%
     1      64      16     52     36     31%
     2      64      64     45     29     36%
     3     256      16     33     23     30%
     4     256      64     27     17     37%
     5     256     256     24     14     42%
     6    1024      16     29     20     31%
     7    1024     256     20     11     45%
     8    1024    1024     19     11     42%
     9    2048      16     28     20     29%
    10    2048     256     19     11     42%
    11    2048    1024     18     10     44%
    12    2048    2048     18     10     44%
    13    4096      16     28     19     32%
    14    4096     256     18     10     44%
    15    4096    1024     18     10     44%
    16    4096    4096     18     10     44%
    17    8192      16     27     19     30%
    18    8192     256     18     10     44%
    19    8192    1024     18     10     44%
    20    8192    4096     17     10     41%
    21    8192    8192     17     10     41%

=== Classic Pentium, average of 5 trials ===
Test#  Bytes/  Bytes/  Cyc/B  Cyc/B  Change
         block  update    (C)  (asm)
     0      16      16    145    144      1%
     1      64      16     72     61     15%
     2      64      64     65     52     20%
     3     256      16     46     39     15%
     4     256      64     39     32     18%
     5     256     256     36     29     19%
     6    1024      16     40     33     18%
     7    1024     256     30     23     23%
     8    1024    1024     29     23     21%
     9    2048      16     39     32     18%
    10    2048     256     29     22     24%
    11    2048    1024     28     22     21%
    12    2048    2048     28     22     21%
    13    4096      16     38     32     16%
    14    4096     256     28     22     21%
    15    4096    1024     28     21     25%
    16    4096    4096     27     21     22%
    17    8192      16     38     32     16%
    18    8192     256     28     22     21%
    19    8192    1024     28     21     25%
    20    8192    4096     27     21     22%
    21    8192    8192     27     21     22%

The improvement isn't as good, but it's still noticeable.

--Benjamin Gilbert

next prev parent reply	other threads:[~2007-06-10  0:34 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-06-08 21:42 [PATCH 0/3] Add optimized SHA-1 implementations for x86 and x86_64 Benjamin Gilbert
2007-06-08 21:42 ` [PATCH 1/3] [CRYPTO] Move sha_init() into cryptohash.h Benjamin Gilbert
2007-06-08 21:42 ` [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+ Benjamin Gilbert
2007-06-09  7:32   ` Jan Engelhardt
2007-06-10  1:15     ` Benjamin Gilbert
2007-06-11 19:47       ` Benjamin Gilbert
2007-06-11 19:50         ` [PATCH] " Benjamin Gilbert
2007-06-11 19:52         ` [PATCH] [CRYPTO] Add optimized SHA-1 implementation for x86_64 Benjamin Gilbert
2007-06-09 20:11   ` [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+ Matt Mackall
2007-06-09 20:23     ` Jeff Garzik
2007-06-09 21:34       ` Matt Mackall
2007-06-10  0:33       ` Benjamin Gilbert [this message]
2007-06-10 13:59         ` Matt Mackall
2007-06-10 16:47           ` Benjamin Gilbert
2007-06-10 17:33             ` Matt Mackall
2007-06-11 17:39           ` Benjamin Gilbert
2007-06-11 12:04     ` Andi Kleen
2007-06-08 21:42 ` [PATCH 3/3] [CRYPTO] Add optimized SHA-1 implementation for x86_64 Benjamin Gilbert
2007-06-11 12:01   ` Andi Kleen
2007-06-11 19:45     ` Benjamin Gilbert
2007-06-11 20:30 ` [PATCH 0/3] Add optimized SHA-1 implementations for x86 and x86_64 Adrian Bunk
  -- strict thread matches above, loose matches on Subject: below --
2007-06-11  7:53 [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+ linux
2007-06-11 19:17 ` Benjamin Gilbert
2007-06-12  5:05   ` linux
2007-06-13  5:50     ` Matt Mackall
2007-06-13  6:46       ` linux

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=466B46D5.1020004@cs.cmu.edu \
    --to=bgilbert@cs.cmu.edu \
    --cc=akpm@linux-foundation.org \
    --cc=herbert@gondor.apana.org.au \
    --cc=jeff@garzik.org \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mpm@selenic.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox