Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Benjamin Gilbert <bgilbert@cs.cmu.edu>
To: Jeff Garzik <jeff@garzik.org>
Cc: Matt Mackall <mpm@selenic.com>,
	akpm@linux-foundation.org, herbert@gondor.apana.org.au,
	linux-crypto@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
Date: Sat, 09 Jun 2007 20:33:25 -0400	[thread overview]
Message-ID: <466B46D5.1020004@cs.cmu.edu> (raw)
In-Reply-To: <466B0C3F.3040300@garzik.org>

Jeff Garzik wrote:
> Matt Mackall wrote:
>> Have you benchmarked this against lib/sha1.c? Please post the results.
>> Until then, I'm frankly skeptical that your unrolled version is faster
>> because when I introduced lib/sha1.c the rolled version therein won by
>> a significant margin and had 1/10th the cache footprint.

See the benchmark tables in patch 0 at the head of this thread. 
Performance improved by at least 25% in every test, and 40-60% was more 
common for the 32-bit version (on a Pentium IV).

It's not just the loop unrolling; it's the register allocation and 
spilling.  For comparison, I built SHATransform() from the 
drivers/char/random.c in 2.6.11, using gcc 3.3.5 with -O2 and 
SHA_CODE_SIZE == 3 (i.e., fully unrolled); I'm guessing this is pretty 
close to what you tested back then.  The resulting code is 49% MOV 
instructions, and 80% of *those* involve memory.  gcc4 is somewhat 
better, but it still spills a whole lot, both for the 2.6.11 unrolled 
code and for the current lib/sha1.c.

In contrast, the assembly implementation in this patch only has to go to 
memory for data and workspace (with one small exception in the F3 
rounds), and the workspace has a fifth of the cache footprint of the 
default implementation.

> Yes. And it also depends on the CPU as well.  Testing on a server-class 
> x86 CPU (often with bigger L2, and perhaps even L1, cache) will produce 
> different result than from popular but less-capable "value" CPUs.

Good point.  I benchmarked the 32-bit assembly code on a couple more boxes:

=== AMD Duron, average of 5 trials ===
Test#  Bytes/  Bytes/  Cyc/B  Cyc/B  Change
         block  update    (C)  (asm)
     0      16      16    104     72     31%
     1      64      16     52     36     31%
     2      64      64     45     29     36%
     3     256      16     33     23     30%
     4     256      64     27     17     37%
     5     256     256     24     14     42%
     6    1024      16     29     20     31%
     7    1024     256     20     11     45%
     8    1024    1024     19     11     42%
     9    2048      16     28     20     29%
    10    2048     256     19     11     42%
    11    2048    1024     18     10     44%
    12    2048    2048     18     10     44%
    13    4096      16     28     19     32%
    14    4096     256     18     10     44%
    15    4096    1024     18     10     44%
    16    4096    4096     18     10     44%
    17    8192      16     27     19     30%
    18    8192     256     18     10     44%
    19    8192    1024     18     10     44%
    20    8192    4096     17     10     41%
    21    8192    8192     17     10     41%

=== Classic Pentium, average of 5 trials ===
Test#  Bytes/  Bytes/  Cyc/B  Cyc/B  Change
         block  update    (C)  (asm)
     0      16      16    145    144      1%
     1      64      16     72     61     15%
     2      64      64     65     52     20%
     3     256      16     46     39     15%
     4     256      64     39     32     18%
     5     256     256     36     29     19%
     6    1024      16     40     33     18%
     7    1024     256     30     23     23%
     8    1024    1024     29     23     21%
     9    2048      16     39     32     18%
    10    2048     256     29     22     24%
    11    2048    1024     28     22     21%
    12    2048    2048     28     22     21%
    13    4096      16     38     32     16%
    14    4096     256     28     22     21%
    15    4096    1024     28     21     25%
    16    4096    4096     27     21     22%
    17    8192      16     38     32     16%
    18    8192     256     28     22     21%
    19    8192    1024     28     21     25%
    20    8192    4096     27     21     22%
    21    8192    8192     27     21     22%

The improvement isn't as good, but it's still noticeable.

--Benjamin Gilbert

next prev parent reply	other threads:[~2007-06-10  0:34 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-06-08 21:42 [PATCH 0/3] Add optimized SHA-1 implementations for x86 and x86_64 Benjamin Gilbert
2007-06-08 21:42 ` [PATCH 1/3] [CRYPTO] Move sha_init() into cryptohash.h Benjamin Gilbert
2007-06-08 21:42 ` [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+ Benjamin Gilbert
2007-06-09  7:32   ` Jan Engelhardt
2007-06-10  1:15     ` Benjamin Gilbert
2007-06-11 19:47       ` Benjamin Gilbert
2007-06-11 19:50         ` [PATCH] " Benjamin Gilbert
2007-06-11 19:52         ` [PATCH] [CRYPTO] Add optimized SHA-1 implementation for x86_64 Benjamin Gilbert
2007-06-09 20:11   ` [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+ Matt Mackall
2007-06-09 20:23     ` Jeff Garzik
2007-06-09 21:34       ` Matt Mackall
2007-06-10  0:33       ` Benjamin Gilbert [this message]
2007-06-10 13:59         ` Matt Mackall
2007-06-10 16:47           ` Benjamin Gilbert
2007-06-10 17:33             ` Matt Mackall
2007-06-11 17:39           ` Benjamin Gilbert
2007-06-11 12:04     ` Andi Kleen
2007-06-08 21:42 ` [PATCH 3/3] [CRYPTO] Add optimized SHA-1 implementation for x86_64 Benjamin Gilbert
2007-06-11 12:01   ` Andi Kleen
2007-06-11 19:45     ` Benjamin Gilbert
2007-06-11 20:30 ` [PATCH 0/3] Add optimized SHA-1 implementations for x86 and x86_64 Adrian Bunk
  -- strict thread matches above, loose matches on Subject: below --
2007-06-11  7:53 [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+ linux
2007-06-11 19:17 ` Benjamin Gilbert
2007-06-12  5:05   ` linux
2007-06-13  5:50     ` Matt Mackall
2007-06-13  6:46       ` linux

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=466B46D5.1020004@cs.cmu.edu \
    --to=bgilbert@cs.cmu.edu \
    --cc=akpm@linux-foundation.org \
    --cc=herbert@gondor.apana.org.au \
    --cc=jeff@garzik.org \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mpm@selenic.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.