AES assembler optimizations

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* AES assembler optimizations
@ 2004-08-09 13:47 Bob Deblier
  2004-08-09 14:32 ` Patrick McFarland
  0 siblings, 1 reply; 12+ messages in thread
From: Bob Deblier @ 2004-08-09 13:47 UTC (permalink / raw)
  To: linux-kernel

Just picked up on KernelTrap that there were some problems with
optimized AES code; if you wish, I can provide my own LGPL licensed (or
I can relicense them for you under GPL), as included in the BeeCrypt
Cryptography Library.

I have generic i586 code and SSE-optimized code available in GNU
assembler format. Latest version is always available on SourceForge
(http://sourceforge.net/cvs/?group_id=8924).

Please cc: me for responses, as I'm not a list subscriber.

Sincerely,

Bob Deblier

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: AES assembler optimizations
       [not found] <2riR3-7U5-3@gated-at.bofh.it>
@ 2004-08-09 14:28 ` Andi Kleen
  2004-08-09 16:02   ` Bob Deblier
  0 siblings, 1 reply; 12+ messages in thread
From: Andi Kleen @ 2004-08-09 14:28 UTC (permalink / raw)
  To: Bob Deblier; +Cc: linux-kernel

Bob Deblier <bob.deblier@telenet.be> writes:

> Just picked up on KernelTrap that there were some problems with
> optimized AES code; if you wish, I can provide my own LGPL licensed (or
> I can relicense them for you under GPL), as included in the BeeCrypt
> Cryptography Library.
>
> I have generic i586 code and SSE-optimized code available in GNU
> assembler format. Latest version is always available on SourceForge
> (http://sourceforge.net/cvs/?group_id=8924).

Would be interesting.  Do you have any benchmarks for your code?

However I think we would need to get rid of the M4 first. I don't
think it would be a good idea to add that as kernel build dependency.
Linux kernel assembly normally uses the C preprocessor and modern gas
also has a quite powerful macro facility that is usually good
enough. Any chance to convert the code to one of these?

-Andi

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: AES assembler optimizations
  2004-08-09 13:47 Bob Deblier
@ 2004-08-09 14:32 ` Patrick McFarland
  0 siblings, 0 replies; 12+ messages in thread
From: Patrick McFarland @ 2004-08-09 14:32 UTC (permalink / raw)
  To: Bob Deblier; +Cc: linux-kernel

On Mon, 09 Aug 2004 15:47:57 +0200, Bob Deblier <bob.deblier@telenet.be> wrote:
> Just picked up on KernelTrap that there were some problems with
> optimized AES code; if you wish, I can provide my own LGPL licensed (or
> I can relicense them for you under GPL), as included in the BeeCrypt
> Cryptography Library.

Well, it ended up being that the author that was complaining about
license violations allowed Linus to use his code, AND someone else
rewrote the stuff not to use the offending code. If you want to help
us out, make a kernel patch of your code.

> I have generic i586 code and SSE-optimized code available in GNU
> assembler format. Latest version is always available on SourceForge
> (http://sourceforge.net/cvs/?group_id=8924).
> 
> Please cc: me for responses, as I'm not a list subscriber.
> 
> Sincerely,
> 
> Bob Deblier
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


-- 
Patrick "Diablo-D3" McFarland || diablod3@gmail.com
"Computer games don't affect kids; I mean if Pac-Man affected us as kids, we'd 
all be running around in darkened rooms, munching magic pills and listening to
repetitive electronic music." -- Kristian Wilson, Nintendo, Inc, 1989

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: AES assembler optimizations
  2004-08-09 14:28 ` AES assembler optimizations Andi Kleen
@ 2004-08-09 16:02   ` Bob Deblier
  2004-08-09 17:12     ` Matti Aarnio
                       ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Bob Deblier @ 2004-08-09 16:02 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

On Mon, 2004-08-09 at 16:28, Andi Kleen wrote:
> Bob Deblier <bob.deblier@telenet.be> writes:
> 
> > Just picked up on KernelTrap that there were some problems with
> > optimized AES code; if you wish, I can provide my own LGPL licensed (or
> > I can relicense them for you under GPL), as included in the BeeCrypt
> > Cryptography Library.
> >
> > I have generic i586 code and SSE-optimized code available in GNU
> > assembler format. Latest version is always available on SourceForge
> > (http://sourceforge.net/cvs/?group_id=8924).
> 
> Would be interesting.  Do you have any benchmarks for your code?

BeeCrypt contains benchmarks in the 'tests' subdirectory. Running of
'make bench' will execute them. Benchmarks results below for repeatedly
looping over the same 16K block, produced by 'benchbc', without any
tweaks (YMMV):

P4 2400, with MMX:
ECB encrypted 738304 KB in 10.00 seconds = 73823.02 KB/s
CBC encrypted 659456 KB in 10.00 seconds = 65925.82 KB/s
ECB decrypted 765952 KB in 10.00 seconds = 76564.57 KB/s
CBC decrypted 616448 KB in 10.02 seconds = 61546.33 KB/s

P4 2400, plain i386:
ECB encrypted 584704 KB in 10.01 seconds = 58435.34 KB/s
CBC encrypted 570368 KB in 10.01 seconds = 56979.82 KB/s
ECB decrypted 444416 KB in 10.02 seconds = 44357.32 KB/s
CBC decrypted 423936 KB in 10.02 seconds = 42304.76 KB/s

P3 800, with MMX:
ECB encrypted 436224 KB in 10.02 seconds = 43526.64 KB/s
CBC encrypted 308224 KB in 10.02 seconds = 30776.24 KB/s
ECB decrypted 449536 KB in 10.00 seconds = 44935.63 KB/s
CBC decrypted 292864 KB in 10.01 seconds = 29262.99 KB/s

P3 800, plain i386:
ECB encrypted 177152 KB in 10.03 seconds = 17665.74 KB/s
CBC encrypted 160768 KB in 10.03 seconds = 16030.31 KB/s
ECB decrypted 163840 KB in 10.05 seconds = 16300.87 KB/s
CBC decrypted 153600 KB in 10.04 seconds = 15306.43 KB/s

> However I think we would need to get rid of the M4 first. I don't
> think it would be a good idea to add that as kernel build dependency.
> Linux kernel assembly normally uses the C preprocessor and modern gas
> also has a quite powerful macro facility that is usually good
> enough. Any chance to convert the code to one of these?

I switched to M4 just to make sure the code works on more platforms than
just Linux with modern tools. It's quite easy to convert - it probably
wouldn't take more than an hour or so to produce a version for one fixed
platform.

Sincerely,

Bob Deblier


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: AES assembler optimizations
  2004-08-09 16:02   ` Bob Deblier
@ 2004-08-09 17:12     ` Matti Aarnio
  2004-08-10 19:51       ` H. Peter Anvin
  2004-08-09 18:16     ` Andi Kleen
  2004-08-09 20:20     ` dean gaudet
  2 siblings, 1 reply; 12+ messages in thread
From: Matti Aarnio @ 2004-08-09 17:12 UTC (permalink / raw)
  To: Bob Deblier; +Cc: Andi Kleen, linux-kernel

On Mon, Aug 09, 2004 at 06:02:08PM +0200, Bob Deblier wrote:
...
> BeeCrypt contains benchmarks in the 'tests' subdirectory. Running of
> 'make bench' will execute them. Benchmarks results below for repeatedly
> looping over the same 16K block, produced by 'benchbc', without any
> tweaks (YMMV):

Usage of MMX inside the Linux kernel is like the usage of FP inside
the Linux kernel:  Can be done after jumping complex hoops, BUT NOT
RECOMMENDED... (MMX in intertwined with FP hardware, after all.)

You have to do lots of the MMXes in order to win after amortizing those
necessary hoops...  RAID-5 code does XOR via MMX code, under some 
conditions.  ... where that happens to become a win.

> P4 2400, with MMX:
> ECB encrypted 738304 KB in 10.00 seconds = 73823.02 KB/s
> CBC encrypted 659456 KB in 10.00 seconds = 65925.82 KB/s
> ECB decrypted 765952 KB in 10.00 seconds = 76564.57 KB/s
> CBC decrypted 616448 KB in 10.02 seconds = 61546.33 KB/s
> 
> P4 2400, plain i386:
> ECB encrypted 584704 KB in 10.01 seconds = 58435.34 KB/s
> CBC encrypted 570368 KB in 10.01 seconds = 56979.82 KB/s
> ECB decrypted 444416 KB in 10.02 seconds = 44357.32 KB/s
> CBC decrypted 423936 KB in 10.02 seconds = 42304.76 KB/s
...
> Sincerely,
> 
> Bob Deblier

/Matti Aarnio

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: AES assembler optimizations
  2004-08-09 16:02   ` Bob Deblier
  2004-08-09 17:12     ` Matti Aarnio
@ 2004-08-09 18:16     ` Andi Kleen
  2004-08-09 20:20     ` dean gaudet
  2 siblings, 0 replies; 12+ messages in thread
From: Andi Kleen @ 2004-08-09 18:16 UTC (permalink / raw)
  To: Bob Deblier; +Cc: linux-kernel

On Mon, Aug 09, 2004 at 06:02:08PM +0200, Bob Deblier wrote:
> On Mon, 2004-08-09 at 16:28, Andi Kleen wrote:
> > Bob Deblier <bob.deblier@telenet.be> writes:
> > 
> > > Just picked up on KernelTrap that there were some problems with
> > > optimized AES code; if you wish, I can provide my own LGPL licensed (or
> > > I can relicense them for you under GPL), as included in the BeeCrypt
> > > Cryptography Library.
> > >
> > > I have generic i586 code and SSE-optimized code available in GNU
> > > assembler format. Latest version is always available on SourceForge
> > > (http://sourceforge.net/cvs/?group_id=8924).
> > 
> > Would be interesting.  Do you have any benchmarks for your code?
> 
> BeeCrypt contains benchmarks in the 'tests' subdirectory. Running of
> 'make bench' will execute them. Benchmarks results below for repeatedly
> looping over the same 16K block, produced by 'benchbc', without any
> tweaks (YMMV):

I guess a cache cold benchmark would be more interesting. AFAIK 
linux does encryption/decryption usually on cache cold buffers.

> P4 2400, with MMX:
> ECB encrypted 738304 KB in 10.00 seconds = 73823.02 KB/s
> CBC encrypted 659456 KB in 10.00 seconds = 65925.82 KB/s
> ECB decrypted 765952 KB in 10.00 seconds = 76564.57 KB/s
> CBC decrypted 616448 KB in 10.02 seconds = 61546.33 KB/s
> 
> P4 2400, plain i386:
> ECB encrypted 584704 KB in 10.01 seconds = 58435.34 KB/s
> CBC encrypted 570368 KB in 10.01 seconds = 56979.82 KB/s
> ECB decrypted 444416 KB in 10.02 seconds = 44357.32 KB/s
> CBC decrypted 423936 KB in 10.02 seconds = 42304.76 KB/s

MMX seems to be fast enough that it's probably a win to use,
even with the overhead of kernel_fpu_begin/end

It usually annoys the "low latency" people a bit though because
it requires disabling kernel preemption during the computation.

-Andi

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: AES assembler optimizations
  2004-08-09 16:02   ` Bob Deblier
  2004-08-09 17:12     ` Matti Aarnio
  2004-08-09 18:16     ` Andi Kleen
@ 2004-08-09 20:20     ` dean gaudet
  2 siblings, 0 replies; 12+ messages in thread
From: dean gaudet @ 2004-08-09 20:20 UTC (permalink / raw)
  To: Bob Deblier; +Cc: Andi Kleen, linux-kernel

On Mon, 9 Aug 2004, Bob Deblier wrote:

> On Mon, 2004-08-09 at 16:28, Andi Kleen wrote:
> > Bob Deblier <bob.deblier@telenet.be> writes:
> >
> > > Just picked up on KernelTrap that there were some problems with
> > > optimized AES code; if you wish, I can provide my own LGPL licensed (or
> > > I can relicense them for you under GPL), as included in the BeeCrypt
> > > Cryptography Library.
> > >
> > > I have generic i586 code and SSE-optimized code available in GNU
> > > assembler format. Latest version is always available on SourceForge
> > > (http://sourceforge.net/cvs/?group_id=8924).
> >
> > Would be interesting.  Do you have any benchmarks for your code?
>
> BeeCrypt contains benchmarks in the 'tests' subdirectory. Running of
> 'make bench' will execute them. Benchmarks results below for repeatedly
> looping over the same 16K block, produced by 'benchbc', without any
> tweaks (YMMV):
>
> P4 2400, with MMX:
> ECB encrypted 738304 KB in 10.00 seconds = 73823.02 KB/s

the gladman code achieves ~88MB/s for p4 northwood 2.4GHz... without using
mmx.

it looks like your mmx code is 1-2% faster on p-m compared to the gladman
code though -- nice, just a half hour ago i posted wondering if anyone had
taken advantage of the 1 cycle mmx on the p2/p3/p-m processors for doing
the AES XOR steps... and that's what your code does.

unfortunately i don't think that pays off compared to the gladman code on
other x86 processors.

-dean

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: AES assembler optimizations
  2004-08-09 17:12     ` Matti Aarnio
@ 2004-08-10 19:51       ` H. Peter Anvin
  2004-08-10 20:36         ` David S. Miller
  0 siblings, 1 reply; 12+ messages in thread
From: H. Peter Anvin @ 2004-08-10 19:51 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <20040809171231.GG2716@mea-ext.zmailer.org>
By author:    Matti Aarnio <matti.aarnio@zmailer.org>
In newsgroup: linux.dev.kernel
> 
> Usage of MMX inside the Linux kernel is like the usage of FP inside
> the Linux kernel:  Can be done after jumping complex hoops, BUT NOT
> RECOMMENDED... (MMX in intertwined with FP hardware, after all.)
> 
> You have to do lots of the MMXes in order to win after amortizing those
> necessary hoops...  RAID-5 code does XOR via MMX code, under some 
> conditions.  ... where that happens to become a win.
> 

It's not really that hard, you just have to have enough work to
amortize it over.  The two metrics are: how much work do you get per
call, and how much work do you get before the next schedule().

	-hpa

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: AES assembler optimizations
  2004-08-10 19:51       ` H. Peter Anvin
@ 2004-08-10 20:36         ` David S. Miller
  2004-08-11  1:02           ` Paul Mackerras
  2004-08-12 20:18           ` Bill Davidsen
  0 siblings, 2 replies; 12+ messages in thread
From: David S. Miller @ 2004-08-10 20:36 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-kernel

On Tue, 10 Aug 2004 19:51:29 +0000 (UTC)
hpa@zytor.com (H. Peter Anvin) wrote:

> It's not really that hard, you just have to have enough work to
> amortize it over.  The two metrics are: how much work do you get per
> call, and how much work do you get before the next schedule().

Someone might want to investigate using sparc64's FPU saving
scheme on x86, if possible.  It might make the cut-off point
smaller.

On sparc64, we:

1) Always save the full FPU state at context switch time if it
   is active.

2) On entry to a FPU-using kernel routine, we save the FPU if
   it is active.

3) On exit from a FPU-using kernel routine, we do nothing
   except mark the FPU as inactive.

4) FPU-disabled traps by the user restore the state saved
   by #1 or #2

Not that this means FPU state can be recursively saved.
For example, if a FPU memcpy take an interrupt, and the interrupt
handler invokes a FPU memcpy, it works just fine.

This works extremely well for cases such as:

   The user made the FPU active, but it is not going to
   use the FPU for quite some time.  The kernel can use
   the FPU multiple times, and only need to save state once.

It's worked extremely well in practice.  We store the stack
of FPU states at the end of the thread_struct area.  This
provides better cache behavior than storing it on the local
kernel stack each time the kernel wants to use the FPU (Solaris
on UltraSPARC chooses this method BTW).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: AES assembler optimizations
  2004-08-10 20:36         ` David S. Miller
@ 2004-08-11  1:02           ` Paul Mackerras
  2004-08-11  1:25             ` David S. Miller
  2004-08-12 20:18           ` Bill Davidsen
  1 sibling, 1 reply; 12+ messages in thread
From: Paul Mackerras @ 2004-08-11  1:02 UTC (permalink / raw)
  To: David S. Miller; +Cc: H. Peter Anvin, linux-kernel

David S. Miller writes:

> On sparc64, we:
> 
> 1) Always save the full FPU state at context switch time if it
>    is active.
> 
> 2) On entry to a FPU-using kernel routine, we save the FPU if
>    it is active.

How is that implemented?  Do you have some magic to make gcc emit a
call to an fpu-save routine in the prolog if the function uses the
FPU?  Or are you only talking about functions written in assembler?

Paul.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: AES assembler optimizations
  2004-08-11  1:02           ` Paul Mackerras
@ 2004-08-11  1:25             ` David S. Miller
  0 siblings, 0 replies; 12+ messages in thread
From: David S. Miller @ 2004-08-11  1:25 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: hpa, linux-kernel

On Wed, 11 Aug 2004 11:02:48 +1000
Paul Mackerras <paulus@samba.org> wrote:

> David S. Miller writes:
> 
> > On sparc64, we:
> > 
> > 1) Always save the full FPU state at context switch time if it
> >    is active.
> > 
> > 2) On entry to a FPU-using kernel routine, we save the FPU if
> >    it is active.
> 
> How is that implemented?  Do you have some magic to make gcc emit a
> call to an fpu-save routine in the prolog if the function uses the
> FPU?  Or are you only talking about functions written in assembler?

All FPU usage is explicit, and VISEntry (if using the full FPU register
set) or VISEntryHalf (if using only the lower 32 registers) calls
are made to enter an FPU usage area, and VISExit/VISExitHalf is invoked
afterwards.

We could do it with traps, but there is no reason when we know
exactly where these places are and the save code is only a
handful of instructions.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: AES assembler optimizations
  2004-08-10 20:36         ` David S. Miller
  2004-08-11  1:02           ` Paul Mackerras
@ 2004-08-12 20:18           ` Bill Davidsen
  1 sibling, 0 replies; 12+ messages in thread
From: Bill Davidsen @ 2004-08-12 20:18 UTC (permalink / raw)
  To: linux-kernel

David S. Miller wrote:

> On sparc64, we:
> 
> 1) Always save the full FPU state at context switch time if it
>    is active.
> 
> 2) On entry to a FPU-using kernel routine, we save the FPU if
>    it is active.
> 
> 3) On exit from a FPU-using kernel routine, we do nothing
>    except mark the FPU as inactive.
> 
> 4) FPU-disabled traps by the user restore the state saved
>    by #1 or #2

Depending on the cost saving of not saving the registers if they haven't 
changed, vs. the time to take the trap and set the FPU active again, it 
might be a win overall, even if you never used FPU in the kernel. Wasn't 
there a change between saving everything and saving FPU only when used 
"back when?" I seem to remember something about that, and the cost of 
the test vs. the cost of just doing the save.

-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2004-08-12 20:14 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <2riR3-7U5-3@gated-at.bofh.it>
2004-08-09 14:28 ` AES assembler optimizations Andi Kleen
2004-08-09 16:02   ` Bob Deblier
2004-08-09 17:12     ` Matti Aarnio
2004-08-10 19:51       ` H. Peter Anvin
2004-08-10 20:36         ` David S. Miller
2004-08-11  1:02           ` Paul Mackerras
2004-08-11  1:25             ` David S. Miller
2004-08-12 20:18           ` Bill Davidsen
2004-08-09 18:16     ` Andi Kleen
2004-08-09 20:20     ` dean gaudet
2004-08-09 13:47 Bob Deblier
2004-08-09 14:32 ` Patrick McFarland

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox