From: Wayne Scott <wsc9tt@gmail.com>
To: Paul Mackerras <paulus@samba.org>
Cc: linux@horizon.com, git@vger.kernel.org
Subject: Re: [PATCH] PPC assembly implementation of SHA1
Date: Sun, 24 Apr 2005 07:04:27 -0500 [thread overview]
Message-ID: <59a6e583050424050434ae2501@mail.gmail.com> (raw)
In-Reply-To: <17003.9009.226712.220822@cargo.ozlabs.ibm.com>
_The_ book on expression rewriting tricks like this, especially for
the PPC, is "Hacker's Delight" by Henry Warren. Great reading!!!
http://www.hackersdelight.org/
-Wayne
On 4/23/05, Paul Mackerras <paulus@samba.org> wrote:
> linux@horizon.com writes:
>
> > I was working on the same thing, but hindered by lack of access to PPC
> > hardware. I notice that you also took advantage of the unaligned load
> > support and native byte order to do the hash straight from the source.
>
> Yes. :) In previous experiments (in the context of trying different
> ways to do memcpy) I found that doing unaligned word loads is faster
> than doing aligned loads plus extra rotate and mask instructions to
> get the bytes you want together.
>
> > But I came up with a few additional refinements:
> >
> > - You are using three temporaries (%r0, %r6, and RT(x)) for your
> > round functions. You only need one temporary (%r0) for all the functions.
> > (Plus %r15 for k)
>
> The reason I used more than one temporary is that I was trying to put
> dependent instructions as far apart as reasonably possible, to
> minimize the chances of pipeline stalls. Given that the 970 does
> register renaming and out-of-order execution, I don't know how
> essential that is, but it can't hurt.
>
> > All are three logical instrunctions on PPC. The second form
> > lets you add it into the accumulator e in two pieces:
>
> A sequence of adds into a single register is going to incur the
> 2-cycle latency between generation and use of a value; i.e. the adds
> will only issue on every second cycle. I think we are better off
> making the dataflow more like a tree than a linear chain where
> possible.
>
> > And the last function, majority(x,y,z), can be written as:
> > f3(x,y,z) = (x & y) | (y & z) | (z & x)
> > = (x & y) | z & (x | y)
> > = (x & y) | z & (x ^ y)
> > = (x & y) + z & (x ^ y)
>
> That's cute, I hadn't thought of that.
>
> > - You don't need to decrement %r1 before saving registers.
> > The PPC calling convention defines a "red zone" below the
> > current stack pointer that is guaranteed never to be touched
> > by signal handlers or the like. This is specifically for
> > leaf procedure optimization, and is at least 224 bytes.
>
> Not in the ppc32 ELF ABI - you are not supposed to touch memory below
> the stack pointer. The kernel is more forgiving than that, and in
> fact you can currently use the red zone without anything bad
> happening, but you really shouldn't.
>
> > - Is that many stw/lwz instructions faster than stmw/lmw?
> > The latter is at least more cahce-friendly.
>
> I believe the stw/lwz and the stmw/lmw will actually execute at the
> same speed on the 970, but I have seen lwz/stw go faster than lmw/stmw
> on other machines. In any case we aren't executing the prolog and
> epilog as often as the instructions in the main loop, hopefully.
>
> > - You can avoid saving and restoring %r15 by recycling %r5 for that
> > purpose; it's not used after the mtctr %r5.
>
> True.
>
> > - The above changes actually save enough registers to cache the whole hash[5]
> > in registers as well, eliminating *all* unnecessary load/store traffic.
>
> That's cool.
>
> > With all of the above changes, your sha1ppc.S file turns into:
>
> I added a stwu and an addi to make a stack frame, and changed %r15 to
> %r5 as you mentioned in another message. I tried it in a little test
> program I have that calls SHA1_Update 256,000 times with a buffer of
> 4096 zero bytes, i.e. it processes 1000MB. Your version seems to be
> about 2% faster; it took 4.53 seconds compared to 4.62 for mine. But
> it also gives the wrong answer; I haven't investigated why.
>
> Thanks,
> Paul.
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
next prev parent reply other threads:[~2005-04-24 11:59 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-04-23 12:42 [PATCH] PPC assembly implementation of SHA1 linux
2005-04-23 13:03 ` linux
2005-04-24 2:49 ` Benjamin Herrenschmidt
2005-04-24 4:40 ` Paul Mackerras
2005-04-24 12:04 ` Wayne Scott [this message]
2005-04-25 0:16 ` linux
2005-04-25 3:13 ` Revised PPC assembly implementation linux
2005-04-25 9:40 ` Paul Mackerras
2005-04-25 17:34 ` linux
2005-04-25 23:00 ` Paul Mackerras
2005-04-25 23:17 ` David S. Miller
2005-04-26 1:22 ` Paul Mackerras
2005-04-27 1:47 ` linux
2005-04-27 3:39 ` Paul Mackerras
2005-04-27 16:01 ` linux
2005-04-26 2:14 ` linux
2005-04-26 2:35 ` linux
-- strict thread matches above, loose matches on Subject: below --
2005-04-23 5:33 [PATCH] PPC assembly implementation of SHA1 Paul Mackerras
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=59a6e583050424050434ae2501@mail.gmail.com \
--to=wsc9tt@gmail.com \
--cc=git@vger.kernel.org \
--cc=linux@horizon.com \
--cc=paulus@samba.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).