Linux cryptographic layer development
 help / color / mirror / Atom feed
* Poor RNG performance on Ryzen
@ 2017-07-21  7:12 Oliver Mangold
  2017-07-21  9:26 ` Jan Glauber
  2017-07-21 12:11 ` Jeffrey Walton
  0 siblings, 2 replies; 9+ messages in thread
From: Oliver Mangold @ 2017-07-21  7:12 UTC (permalink / raw)
  To: linux-crypto

Hi,

I was wondering why reading from /dev/urandom is much slower on Ryzen 
than on Intel, and did some analysis. It turns out that the RDRAND 
instruction is at fault, which takes much longer on AMD.

if I read this correctly:

--- drivers/char/random.c ---
     862         spin_lock_irqsave(&crng->lock, flags);
     863         if (arch_get_random_long(&v))
     864                 crng->state[14] ^= v;
     865         chacha20_block(&crng->state[0], out);

one call to RDRAND (with 64-bit operand) is issued per computation of a 
chacha20 block. According to the measurements I did, it seems on Ryzen 
this dominates the time usage:

On Broadwell E5-2650 v4:

---
# dd if=/dev/urandom of=/dev/null bs=1M status=progress
28827451392 bytes (29 GB) copied, 143.290349 s, 201 MB/s
# perf top
   49.88%  [kernel]            [k] chacha20_block
   31.22%  [kernel]            [k] _extract_crng
---

On Ryzen 1800X:

---
# dd if=/dev/urandom of=/dev/null bs=1M status=progress
3169845248 bytes (3,2 GB, 3,0 GiB) copied, 42,0106 s, 75,5 MB/s
# perf top
   76,40%  [kernel]                       [k] _extract_crng
   13,05%  [kernel]                       [k] chacha20_block
---

An easy improvement might be to replace the usage of 
arch_get_random_long() by arch_get_random_int(), as the state array 
contains just 32-bit elements, and (contrary to Intel) on Ryzen 32-bit 
RDRAND is supposed to be faster by roughly a factor of 2.

Best regards,

OM

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Poor RNG performance on Ryzen
  2017-07-21  7:12 Poor RNG performance on Ryzen Oliver Mangold
@ 2017-07-21  9:26 ` Jan Glauber
  2017-07-21 11:39   ` Oliver Mangold
  2017-07-21 12:11 ` Jeffrey Walton
  1 sibling, 1 reply; 9+ messages in thread
From: Jan Glauber @ 2017-07-21  9:26 UTC (permalink / raw)
  To: Oliver Mangold; +Cc: linux-crypto

On Fri, Jul 21, 2017 at 09:12:01AM +0200, Oliver Mangold wrote:
> Hi,
> 
> I was wondering why reading from /dev/urandom is much slower on
> Ryzen than on Intel, and did some analysis. It turns out that the
> RDRAND instruction is at fault, which takes much longer on AMD.
> 
> if I read this correctly:
> 
> --- drivers/char/random.c ---
>     862         spin_lock_irqsave(&crng->lock, flags);
>     863         if (arch_get_random_long(&v))
>     864                 crng->state[14] ^= v;
>     865         chacha20_block(&crng->state[0], out);
> 
> one call to RDRAND (with 64-bit operand) is issued per computation
> of a chacha20 block. According to the measurements I did, it seems
> on Ryzen this dominates the time usage:
> 
> On Broadwell E5-2650 v4:
> 
> ---
> # dd if=/dev/urandom of=/dev/null bs=1M status=progress
> 28827451392 bytes (29 GB) copied, 143.290349 s, 201 MB/s
> # perf top
>   49.88%  [kernel]            [k] chacha20_block
>   31.22%  [kernel]            [k] _extract_crng
> ---
> 
> On Ryzen 1800X:
> 
> ---
> # dd if=/dev/urandom of=/dev/null bs=1M status=progress
> 3169845248 bytes (3,2 GB, 3,0 GiB) copied, 42,0106 s, 75,5 MB/s
> # perf top
>   76,40%  [kernel]                       [k] _extract_crng
>   13,05%  [kernel]                       [k] chacha20_block
> ---
> 
> An easy improvement might be to replace the usage of
> arch_get_random_long() by arch_get_random_int(), as the state array
> contains just 32-bit elements, and (contrary to Intel) on Ryzen
> 32-bit RDRAND is supposed to be faster by roughly a factor of 2.

Nice catch. How much does the performance improve on Ryzen when you
use arch_get_random_int()?

--Jan

> Best regards,
> 
> OM

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Poor RNG performance on Ryzen
  2017-07-21  9:26 ` Jan Glauber
@ 2017-07-21 11:39   ` Oliver Mangold
  2017-07-21 14:47     ` Theodore Ts'o
  0 siblings, 1 reply; 9+ messages in thread
From: Oliver Mangold @ 2017-07-21 11:39 UTC (permalink / raw)
  To: linux-crypto

On 21.07.2017 11:26, Jan Glauber wrote:
>
> Nice catch. How much does the performance improve on Ryzen when you
> use arch_get_random_int()?

Okay, now I have some results for you:

On Ryzen 1800X (using arch_get_random_int()):

---
# dd if=/dev/urandom of=/dev/null bs=1M status=progress
8751415296 bytes (8,8 GB, 8,2 GiB) copied, 71,0079 s, 123 MB/s
# perf top
    57,37%  [kernel]                    [k] _extract_crng
    26,20%  [kernel]                    [k] chacha20_block
---

Better, but obviously there is still much room for improvement by 
reducing the number of calls to RDRAND.

On Ryzen 1800X (with nordrand kernel option):

---
# dd if=/dev/urandom of=/dev/null bs=1M status=progress
22643998720 bytes (23 GB, 21 GiB) copied, 67,0025 s, 338 MB/s
---

Here is the patch I used:

--- drivers/char/random.c.orig  2017-07-03 01:07:02.000000000 +0200
+++ drivers/char/random.c       2017-07-21 11:57:40.541677118 +0200
@@ -859,13 +859,14 @@
   static void _extract_crng(struct crng_state *crng,
                            __u8 out[CHACHA20_BLOCK_SIZE])
   {
-       unsigned long v, flags;
+       unsigned int v;
+       unsigned long flags;

          if (crng_init > 1 &&
              time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL))
                  crng_reseed(crng, crng == &primary_crng ? &input_pool 
: NULL);
          spin_lock_irqsave(&crng->lock, flags);
-       if (arch_get_random_long(&v))
+       if (arch_get_random_int(&v))
                  crng->state[14] ^= v;
          chacha20_block(&crng->state[0], out);
          if (crng->state[12] == 0)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Poor RNG performance on Ryzen
  2017-07-21  7:12 Poor RNG performance on Ryzen Oliver Mangold
  2017-07-21  9:26 ` Jan Glauber
@ 2017-07-21 12:11 ` Jeffrey Walton
  1 sibling, 0 replies; 9+ messages in thread
From: Jeffrey Walton @ 2017-07-21 12:11 UTC (permalink / raw)
  To: Oliver Mangold; +Cc: Linux Crypto Mailing List

On Fri, Jul 21, 2017 at 3:12 AM, Oliver Mangold <o.mangold@gmail.com> wrote:
> Hi,
>
> I was wondering why reading from /dev/urandom is much slower on Ryzen than
> on Intel, and did some analysis. It turns out that the RDRAND instruction is
> at fault, which takes much longer on AMD.
>
> if I read this correctly:
>
> --- drivers/char/random.c ---
>     862         spin_lock_irqsave(&crng->lock, flags);
>     863         if (arch_get_random_long(&v))
>     864                 crng->state[14] ^= v;
>     865         chacha20_block(&crng->state[0], out);
>
> one call to RDRAND (with 64-bit operand) is issued per computation of a
> chacha20 block. According to the measurements I did, it seems on Ryzen this
> dominates the time usage:

AMD's implementation of RDRAND and RDSEED are simply slow. It dates
back to Bulldozer. While Intel can produce random numbers at 10
cycle/sbyte, AMD regularly takes thousands of cycles for one byte.
Bulldozer was measured at 4100 cycles per byte.

It also appears AMD uses the same circuit for random numbers for both
RDRAND and RDSEED. Both are equally fast (or equally slow).

Here are some benchmarks if you are interested:
https://www.cryptopp.com/wiki/RDRAND#Performance .

Jeff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Poor RNG performance on Ryzen
  2017-07-21 11:39   ` Oliver Mangold
@ 2017-07-21 14:47     ` Theodore Ts'o
  2017-07-21 14:55       ` Oliver Mangold
  2017-07-21 15:04       ` Gary R Hook
  0 siblings, 2 replies; 9+ messages in thread
From: Theodore Ts'o @ 2017-07-21 14:47 UTC (permalink / raw)
  To: Oliver Mangold; +Cc: linux-crypto

On Fri, Jul 21, 2017 at 01:39:13PM +0200, Oliver Mangold wrote:
> Better, but obviously there is still much room for improvement by reducing
> the number of calls to RDRAND.

Hmm, is there some way we can easily tell we are running on Ryzen?  Or
do we believe this is going to be true for all AMD devices?

I guess we could add some kind of "has_crappy_arch_get_random()" call
which could be defined by arch/x86, and change how aggressively we use
arch_get_random_*() depending on whether has_crappy_arch_get_random()
returns true or not....

						- Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Poor RNG performance on Ryzen
  2017-07-21 14:47     ` Theodore Ts'o
@ 2017-07-21 14:55       ` Oliver Mangold
  2017-07-22 18:16         ` Theodore Ts'o
  2017-07-21 15:04       ` Gary R Hook
  1 sibling, 1 reply; 9+ messages in thread
From: Oliver Mangold @ 2017-07-21 14:55 UTC (permalink / raw)
  To: linux-crypto

On 21.07.2017 16:47, Theodore Ts'o wrote:
> On Fri, Jul 21, 2017 at 01:39:13PM +0200, Oliver Mangold wrote:
>> Better, but obviously there is still much room for improvement by reducing
>> the number of calls to RDRAND.
> Hmm, is there some way we can easily tell we are running on Ryzen?  Or
> do we believe this is going to be true for all AMD devices?
I would like to note that my first measurement on Broadwell suggest that 
the current frequency of RDRAND calls seems to slow things down on Intel 
also (but not as much as on Ryzen).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Poor RNG performance on Ryzen
  2017-07-21 14:47     ` Theodore Ts'o
  2017-07-21 14:55       ` Oliver Mangold
@ 2017-07-21 15:04       ` Gary R Hook
  1 sibling, 0 replies; 9+ messages in thread
From: Gary R Hook @ 2017-07-21 15:04 UTC (permalink / raw)
  To: Theodore Ts'o, Oliver Mangold; +Cc: linux-crypto@vger.kernel.org

On 07/21/2017 09:47 AM, Theodore Ts'o wrote:
> On Fri, Jul 21, 2017 at 01:39:13PM +0200, Oliver Mangold wrote:
>> Better, but obviously there is still much room for improvement by reducing
>> the number of calls to RDRAND.
>
> Hmm, is there some way we can easily tell we are running on Ryzen?  Or
> do we believe this is going to be true for all AMD devices?
>
> I guess we could add some kind of "has_crappy_arch_get_random()" call
> which could be defined by arch/x86, and change how aggressively we use
> arch_get_random_*() depending on whether has_crappy_arch_get_random()
> returns true or not....

Just a quick note to say that we are aware of the issue, and that it is
being addressed.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Poor RNG performance on Ryzen
  2017-07-21 14:55       ` Oliver Mangold
@ 2017-07-22 18:16         ` Theodore Ts'o
  2017-07-25  6:20           ` Jan Glauber
  0 siblings, 1 reply; 9+ messages in thread
From: Theodore Ts'o @ 2017-07-22 18:16 UTC (permalink / raw)
  To: Oliver Mangold; +Cc: linux-crypto

On Fri, Jul 21, 2017 at 04:55:12PM +0200, Oliver Mangold wrote:
> On 21.07.2017 16:47, Theodore Ts'o wrote:
> > On Fri, Jul 21, 2017 at 01:39:13PM +0200, Oliver Mangold wrote:
> > > Better, but obviously there is still much room for improvement by reducing
> > > the number of calls to RDRAND.
> > Hmm, is there some way we can easily tell we are running on Ryzen?  Or
> > do we believe this is going to be true for all AMD devices?
> I would like to note that my first measurement on Broadwell suggest that the
> current frequency of RDRAND calls seems to slow things down on Intel also
> (but not as much as on Ryzen).

On my T470 laptop (with an Intel mobile core i7 processor), using your
benchmark, I am getting 136 MB/s, versus your 75 MB/s.  But so what?

More realistically, if we are generating 256 bit keys (so we're
reading from /dev/urandom 32 bytes at a time), it takes 2.24
microseconds per key generation.  What do you get when you run:

dd if=/dev/urandom of=/dev/zero bs=256 count=1000000

Even if on Ryzen it's slower by a factor of two, 5 microseconds per
key generation is pretty fast!  The time to do the Diffie-Hellman
exchange and the RSA operations will probably completely swamp the
time to generate the session key.

And if you think 2.24 or 5 microseconds is to slow for the IV
generation --- then use a userspace ChaCha20 CRNG for that purpose.

I'm not really sure I see a real-life operational problem here.

	    	       	      	    - Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Poor RNG performance on Ryzen
  2017-07-22 18:16         ` Theodore Ts'o
@ 2017-07-25  6:20           ` Jan Glauber
  0 siblings, 0 replies; 9+ messages in thread
From: Jan Glauber @ 2017-07-25  6:20 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Oliver Mangold, linux-crypto

On Sat, Jul 22, 2017 at 02:16:41PM -0400, Theodore Ts'o wrote:
> On Fri, Jul 21, 2017 at 04:55:12PM +0200, Oliver Mangold wrote:
> > On 21.07.2017 16:47, Theodore Ts'o wrote:
> > > On Fri, Jul 21, 2017 at 01:39:13PM +0200, Oliver Mangold wrote:
> > > > Better, but obviously there is still much room for improvement by reducing
> > > > the number of calls to RDRAND.
> > > Hmm, is there some way we can easily tell we are running on Ryzen?  Or
> > > do we believe this is going to be true for all AMD devices?
> > I would like to note that my first measurement on Broadwell suggest that the
> > current frequency of RDRAND calls seems to slow things down on Intel also
> > (but not as much as on Ryzen).
> 
> On my T470 laptop (with an Intel mobile core i7 processor), using your
> benchmark, I am getting 136 MB/s, versus your 75 MB/s.  But so what?
> 
> More realistically, if we are generating 256 bit keys (so we're
> reading from /dev/urandom 32 bytes at a time), it takes 2.24
> microseconds per key generation.  What do you get when you run:
> 
> dd if=/dev/urandom of=/dev/zero bs=256 count=1000000
> 
> Even if on Ryzen it's slower by a factor of two, 5 microseconds per
> key generation is pretty fast!  The time to do the Diffie-Hellman
> exchange and the RSA operations will probably completely swamp the
> time to generate the session key.
> 
> And if you think 2.24 or 5 microseconds is to slow for the IV
> generation --- then use a userspace ChaCha20 CRNG for that purpose.
> 
> I'm not really sure I see a real-life operational problem here.
> 
> 	    	       	      	    - Ted

While I agree that it is not an issue if the hardware is just slow I
still wonder why we read 8 bytes with arch_get_random_long() and
only use half of them as Oliver pointed out.

If arch_get_random_int() is not slower on Intel we could use that.
Or am I missing something?

--Jan

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-07-25  6:20 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-07-21  7:12 Poor RNG performance on Ryzen Oliver Mangold
2017-07-21  9:26 ` Jan Glauber
2017-07-21 11:39   ` Oliver Mangold
2017-07-21 14:47     ` Theodore Ts'o
2017-07-21 14:55       ` Oliver Mangold
2017-07-22 18:16         ` Theodore Ts'o
2017-07-25  6:20           ` Jan Glauber
2017-07-21 15:04       ` Gary R Hook
2017-07-21 12:11 ` Jeffrey Walton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox