[PATCH 0/2] crypto: xor - defer and optimize boot time benchmark

linux-crypto.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] crypto: xor - defer and optimize boot time benchmark
@ 2020-09-23 18:22 Ard Biesheuvel
  2020-09-23 18:22 ` [PATCH 1/2] crypto: xor - defer load time benchmark to a later time Ard Biesheuvel
  2020-09-23 18:22 ` [PATCH 2/2] crypto: xor - use ktime for template benchmarking Ard Biesheuvel
  0 siblings, 2 replies; 11+ messages in thread
From: Ard Biesheuvel @ 2020-09-23 18:22 UTC (permalink / raw)
  To: linux-crypto; +Cc: herbert, Ard Biesheuvel, Douglas Anderson, David Laight

Doug reports [0] that the XOR boot time benchmark takes more time than
necessary, and runs at a time when there is little room for other
boot time tasks to run concurrently.

Let's fix this by #1 deferring the benchmark, and #2 uses a faster
implementation.

[0] https://lore.kernel.org/linux-arm-kernel/20200921172603.1.Id9450c1d3deef17718bd5368580a3c44895209ee@changeid/

Cc: Douglas Anderson <dianders@chromium.org>
Cc: David Laight <David.Laight@aculab.com>

Ard Biesheuvel (2):
  crypto: xor - defer load time benchmark to a later time
  crypto: xor - use ktime for template benchmarking

 crypto/xor.c | 65 +++++++++++++-------
 1 file changed, 43 insertions(+), 22 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] crypto: xor - defer load time benchmark to a later time
  2020-09-23 18:22 [PATCH 0/2] crypto: xor - defer and optimize boot time benchmark Ard Biesheuvel
@ 2020-09-23 18:22 ` Ard Biesheuvel
  2020-09-24  0:38   ` Doug Anderson
  2020-09-23 18:22 ` [PATCH 2/2] crypto: xor - use ktime for template benchmarking Ard Biesheuvel
  1 sibling, 1 reply; 11+ messages in thread
From: Ard Biesheuvel @ 2020-09-23 18:22 UTC (permalink / raw)
  To: linux-crypto; +Cc: herbert, Ard Biesheuvel, Douglas Anderson, David Laight

Currently, the XOR module performs its boot time benchmark at core
initcall time when it is built-in, to ensure that the RAID code can
make use of it when it is built-in as well.

Let's defer this to a later stage during the boot, to avoid impacting
the overall boot time of the system. Instead, just pick an arbitrary
implementation from the list, and use that as the preliminary default.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 crypto/xor.c | 29 +++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/crypto/xor.c b/crypto/xor.c
index ea7349e6ed23..b42c38343733 100644
--- a/crypto/xor.c
+++ b/crypto/xor.c
@@ -54,6 +54,28 @@ EXPORT_SYMBOL(xor_blocks);
 /* Set of all registered templates.  */
 static struct xor_block_template *__initdata template_list;
 
+#ifndef MODULE
+static void __init do_xor_register(struct xor_block_template *tmpl)
+{
+	tmpl->next = template_list;
+	template_list = tmpl;
+}
+
+static int __init register_xor_blocks(void)
+{
+	active_template = XOR_SELECT_TEMPLATE(NULL);
+
+	if (!active_template) {
+#define xor_speed	do_xor_register
+		// register all the templates and pick the first as the default
+		XOR_TRY_TEMPLATES;
+#undef xor_speed
+		active_template = template_list;
+	}
+	return 0;
+}
+#endif
+
 #define BENCH_SIZE (PAGE_SIZE)
 
 static void __init
@@ -129,6 +151,7 @@ calibrate_xor_blocks(void)
 #define xor_speed(templ)	do_xor_speed((templ), b1, b2)
 
 	printk(KERN_INFO "xor: measuring software checksum speed\n");
+	template_list = NULL;
 	XOR_TRY_TEMPLATES;
 	fastest = template_list;
 	for (f = fastest; f; f = f->next)
@@ -150,6 +173,10 @@ static __exit void xor_exit(void) { }
 
 MODULE_LICENSE("GPL");
 
+#ifndef MODULE
 /* when built-in xor.o must initialize before drivers/md/md.o */
-core_initcall(calibrate_xor_blocks);
+core_initcall(register_xor_blocks);
+#endif
+
+module_init(calibrate_xor_blocks);
 module_exit(xor_exit);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2] crypto: xor - use ktime for template benchmarking
  2020-09-23 18:22 [PATCH 0/2] crypto: xor - defer and optimize boot time benchmark Ard Biesheuvel
  2020-09-23 18:22 ` [PATCH 1/2] crypto: xor - defer load time benchmark to a later time Ard Biesheuvel
@ 2020-09-23 18:22 ` Ard Biesheuvel
  2020-09-24  0:36   ` Doug Anderson
  1 sibling, 1 reply; 11+ messages in thread
From: Ard Biesheuvel @ 2020-09-23 18:22 UTC (permalink / raw)
  To: linux-crypto; +Cc: herbert, Ard Biesheuvel, Douglas Anderson, David Laight

Currently, we use the jiffies counter as a time source, by staring at
it until a HZ period elapses, and then staring at it again and perform
as many XOR operations as we can at the same time until another HZ
period elapses, so that we can calculate the throughput. This takes
longer than necessary, and depends on HZ, which is undesirable, since
HZ is system dependent.

Let's use the ktime interface instead, and use it to time a fixed
number of XOR operations, which can be done much faster, and makes
the time spent depend on the performance level of the system itself,
which is much more reasonable.

On ThunderX2, I get the following results:

Before:

  [72625.956765] xor: measuring software checksum speed
  [72625.993104]    8regs     : 10169.000 MB/sec
  [72626.033099]    32regs    : 12050.000 MB/sec
  [72626.073095]    arm64_neon: 11100.000 MB/sec
  [72626.073097] xor: using function: 32regs (12050.000 MB/sec)

After:

  [ 2503.189696] xor: measuring software checksum speed
  [ 2503.189896]    8regs           : 10556 MB/sec
  [ 2503.190061]    32regs          : 12538 MB/sec
  [ 2503.190250]    arm64_neon      : 11470 MB/sec
  [ 2503.190252] xor: using function: 32regs (12538 MB/sec)

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 crypto/xor.c | 36 ++++++++------------
 1 file changed, 15 insertions(+), 21 deletions(-)

diff --git a/crypto/xor.c b/crypto/xor.c
index b42c38343733..23f98b451b69 100644
--- a/crypto/xor.c
+++ b/crypto/xor.c
@@ -76,49 +76,43 @@ static int __init register_xor_blocks(void)
 }
 #endif
 
-#define BENCH_SIZE (PAGE_SIZE)
+#define BENCH_SIZE	4096
+#define REPS		100
 
 static void __init
 do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2)
 {
 	int speed;
-	unsigned long now, j;
-	int i, count, max;
+	int i, j, count;
+	ktime_t min, start, diff;
 
 	tmpl->next = template_list;
 	template_list = tmpl;
 
 	preempt_disable();
 
-	/*
-	 * Count the number of XORs done during a whole jiffy, and use
-	 * this to calculate the speed of checksumming.  We use a 2-page
-	 * allocation to have guaranteed color L1-cache layout.
-	 */
-	max = 0;
+	min = (ktime_t)S64_MAX;
 	for (i = 0; i < 5; i++) {
-		j = jiffies;
-		count = 0;
-		while ((now = jiffies) == j)
-			cpu_relax();
-		while (time_before(jiffies, now + 1)) {
+		start = ktime_get();
+		for (j = 0; j < REPS; j++) {
 			mb(); /* prevent loop optimzation */
 			tmpl->do_2(BENCH_SIZE, b1, b2);
 			mb();
 			count++;
 			mb();
 		}
-		if (count > max)
-			max = count;
+		diff = ktime_sub(ktime_get(), start);
+		if (diff < min)
+			min = diff;
 	}
 
 	preempt_enable();
 
-	speed = max * (HZ * BENCH_SIZE / 1024);
+	// bytes/ns == GB/s, multiply by 1000 to get MB/s [not MiB/s]
+	speed = (1000 * REPS * BENCH_SIZE) / (u32)min;
 	tmpl->speed = speed;
 
-	printk(KERN_INFO "   %-10s: %5d.%03d MB/sec\n", tmpl->name,
-	       speed / 1000, speed % 1000);
+	printk(KERN_INFO "   %-16s: %5d MB/sec\n", tmpl->name, speed);
 }
 
 static int __init
@@ -158,8 +152,8 @@ calibrate_xor_blocks(void)
 		if (f->speed > fastest->speed)
 			fastest = f;
 
-	printk(KERN_INFO "xor: using function: %s (%d.%03d MB/sec)\n",
-	       fastest->name, fastest->speed / 1000, fastest->speed % 1000);
+	printk(KERN_INFO "xor: using function: %s (%d MB/sec)\n",
+	       fastest->name, fastest->speed);
 
 #undef xor_speed
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] crypto: xor - use ktime for template benchmarking
  2020-09-23 18:22 ` [PATCH 2/2] crypto: xor - use ktime for template benchmarking Ard Biesheuvel
@ 2020-09-24  0:36   ` Doug Anderson
  2020-09-24  8:31     ` Ard Biesheuvel
  0 siblings, 1 reply; 11+ messages in thread
From: Doug Anderson @ 2020-09-24  0:36 UTC (permalink / raw)
  To: Ard Biesheuvel; +Cc: linux-crypto, herbert, David Laight

Hi,

On Wed, Sep 23, 2020 at 11:22 AM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> Currently, we use the jiffies counter as a time source, by staring at
> it until a HZ period elapses, and then staring at it again and perform
> as many XOR operations as we can at the same time until another HZ
> period elapses, so that we can calculate the throughput. This takes
> longer than necessary, and depends on HZ, which is undesirable, since
> HZ is system dependent.
>
> Let's use the ktime interface instead, and use it to time a fixed
> number of XOR operations, which can be done much faster, and makes
> the time spent depend on the performance level of the system itself,
> which is much more reasonable.
>
> On ThunderX2, I get the following results:
>
> Before:
>
>   [72625.956765] xor: measuring software checksum speed
>   [72625.993104]    8regs     : 10169.000 MB/sec
>   [72626.033099]    32regs    : 12050.000 MB/sec
>   [72626.073095]    arm64_neon: 11100.000 MB/sec
>   [72626.073097] xor: using function: 32regs (12050.000 MB/sec)
>
> After:
>
>   [ 2503.189696] xor: measuring software checksum speed
>   [ 2503.189896]    8regs           : 10556 MB/sec
>   [ 2503.190061]    32regs          : 12538 MB/sec
>   [ 2503.190250]    arm64_neon      : 11470 MB/sec
>   [ 2503.190252] xor: using function: 32regs (12538 MB/sec)
>
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
>  crypto/xor.c | 36 ++++++++------------
>  1 file changed, 15 insertions(+), 21 deletions(-)
>
> diff --git a/crypto/xor.c b/crypto/xor.c
> index b42c38343733..23f98b451b69 100644
> --- a/crypto/xor.c
> +++ b/crypto/xor.c
> @@ -76,49 +76,43 @@ static int __init register_xor_blocks(void)
>  }
>  #endif
>
> -#define BENCH_SIZE (PAGE_SIZE)
> +#define BENCH_SIZE     4096

I'm curious why the change away from PAGE_SIZE to using 4096.
Everything should work OK w/ using PAGE_SIZE, right?


> +#define REPS           100

Is this sufficient?  I'm not sure what the lower bound on what's
expected of ktime.  If I'm doing the math right, on your system
running 100 loops took 38802 ns in one case, since:

(4096 * 1000 * 100) / 10556 = 38802

If you happen to have your timer backed by a 32 kHz clock, one tick of
ktime could be as much as 31250 ns, right?  Maybe on systems backed
with a 32kHz clock they'll take longer, but it still seems moderately
iffy?  I dunno, maybe I'm just being paranoid.


>  static void __init
>  do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2)
>  {
>         int speed;
> -       unsigned long now, j;
> -       int i, count, max;
> +       int i, j, count;
> +       ktime_t min, start, diff;
>
>         tmpl->next = template_list;
>         template_list = tmpl;
>
>         preempt_disable();
>
> -       /*
> -        * Count the number of XORs done during a whole jiffy, and use
> -        * this to calculate the speed of checksumming.  We use a 2-page
> -        * allocation to have guaranteed color L1-cache layout.
> -        */
> -       max = 0;
> +       min = (ktime_t)S64_MAX;
>         for (i = 0; i < 5; i++) {
> -               j = jiffies;
> -               count = 0;
> -               while ((now = jiffies) == j)
> -                       cpu_relax();
> -               while (time_before(jiffies, now + 1)) {
> +               start = ktime_get();
> +               for (j = 0; j < REPS; j++) {
>                         mb(); /* prevent loop optimzation */
>                         tmpl->do_2(BENCH_SIZE, b1, b2);
>                         mb();
>                         count++;
>                         mb();
>                 }
> -               if (count > max)
> -                       max = count;
> +               diff = ktime_sub(ktime_get(), start);
> +               if (diff < min)
> +                       min = diff;
>         }
>
>         preempt_enable();
>
> -       speed = max * (HZ * BENCH_SIZE / 1024);
> +       // bytes/ns == GB/s, multiply by 1000 to get MB/s [not MiB/s]

Comment is super helpful, thanks!  ...but are folks really OK with
"//" comments these days?


> +       speed = (1000 * REPS * BENCH_SIZE) / (u32)min;

nit: Just for prettiness, maybe call ktime_to_ns(min)?

optional nit: I always think of u32 as something for accessing
hardware.  Maybe "unsigned int"?


>         tmpl->speed = speed;
>
> -       printk(KERN_INFO "   %-10s: %5d.%03d MB/sec\n", tmpl->name,
> -              speed / 1000, speed % 1000);
> +       printk(KERN_INFO "   %-16s: %5d MB/sec\n", tmpl->name, speed);

Since you're touching, switch to pr_info()?


>  }
>
>  static int __init
> @@ -158,8 +152,8 @@ calibrate_xor_blocks(void)
>                 if (f->speed > fastest->speed)
>                         fastest = f;
>
> -       printk(KERN_INFO "xor: using function: %s (%d.%03d MB/sec)\n",
> -              fastest->name, fastest->speed / 1000, fastest->speed % 1000);
> +       printk(KERN_INFO "xor: using function: %s (%d MB/sec)\n",
> +              fastest->name, fastest->speed);

Since you're touching, switch to pr_info()?


-Doug

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] crypto: xor - defer load time benchmark to a later time
  2020-09-23 18:22 ` [PATCH 1/2] crypto: xor - defer load time benchmark to a later time Ard Biesheuvel
@ 2020-09-24  0:38   ` Doug Anderson
  0 siblings, 0 replies; 11+ messages in thread
From: Doug Anderson @ 2020-09-24  0:38 UTC (permalink / raw)
  To: Ard Biesheuvel; +Cc: linux-crypto, herbert, David Laight

Hi,

On Wed, Sep 23, 2020 at 11:22 AM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> Currently, the XOR module performs its boot time benchmark at core
> initcall time when it is built-in, to ensure that the RAID code can
> make use of it when it is built-in as well.
>
> Let's defer this to a later stage during the boot, to avoid impacting
> the overall boot time of the system. Instead, just pick an arbitrary
> implementation from the list, and use that as the preliminary default.
>
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
>  crypto/xor.c | 29 +++++++++++++++++++-
>  1 file changed, 28 insertions(+), 1 deletion(-)

Seems like it'll work to me.  Thanks!

Reviewed-by: Douglas Anderson <dianders@chromium.org>

-Doug

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] crypto: xor - use ktime for template benchmarking
  2020-09-24  0:36   ` Doug Anderson
@ 2020-09-24  8:31     ` Ard Biesheuvel
  2020-09-24 15:28       ` Doug Anderson
  0 siblings, 1 reply; 11+ messages in thread
From: Ard Biesheuvel @ 2020-09-24  8:31 UTC (permalink / raw)
  To: Doug Anderson; +Cc: Linux Crypto Mailing List, Herbert Xu, David Laight

On Thu, 24 Sep 2020 at 02:36, Doug Anderson <dianders@chromium.org> wrote:
>
> Hi,
>
> On Wed, Sep 23, 2020 at 11:22 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > Currently, we use the jiffies counter as a time source, by staring at
> > it until a HZ period elapses, and then staring at it again and perform
> > as many XOR operations as we can at the same time until another HZ
> > period elapses, so that we can calculate the throughput. This takes
> > longer than necessary, and depends on HZ, which is undesirable, since
> > HZ is system dependent.
> >
> > Let's use the ktime interface instead, and use it to time a fixed
> > number of XOR operations, which can be done much faster, and makes
> > the time spent depend on the performance level of the system itself,
> > which is much more reasonable.
> >
> > On ThunderX2, I get the following results:
> >
> > Before:
> >
> >   [72625.956765] xor: measuring software checksum speed
> >   [72625.993104]    8regs     : 10169.000 MB/sec
> >   [72626.033099]    32regs    : 12050.000 MB/sec
> >   [72626.073095]    arm64_neon: 11100.000 MB/sec
> >   [72626.073097] xor: using function: 32regs (12050.000 MB/sec)
> >
> > After:
> >
> >   [ 2503.189696] xor: measuring software checksum speed
> >   [ 2503.189896]    8regs           : 10556 MB/sec
> >   [ 2503.190061]    32regs          : 12538 MB/sec
> >   [ 2503.190250]    arm64_neon      : 11470 MB/sec
> >   [ 2503.190252] xor: using function: 32regs (12538 MB/sec)
> >
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > ---
> >  crypto/xor.c | 36 ++++++++------------
> >  1 file changed, 15 insertions(+), 21 deletions(-)
> >
> > diff --git a/crypto/xor.c b/crypto/xor.c
> > index b42c38343733..23f98b451b69 100644
> > --- a/crypto/xor.c
> > +++ b/crypto/xor.c
> > @@ -76,49 +76,43 @@ static int __init register_xor_blocks(void)
> >  }
> >  #endif
> >
> > -#define BENCH_SIZE (PAGE_SIZE)
> > +#define BENCH_SIZE     4096
>
> I'm curious why the change away from PAGE_SIZE to using 4096.
> Everything should work OK w/ using PAGE_SIZE, right?
>

Yes, but then the test will take 16x more time on a 64k page size
system for no reason whatsoever.

>
> > +#define REPS           100
>
> Is this sufficient?  I'm not sure what the lower bound on what's
> expected of ktime.  If I'm doing the math right, on your system
> running 100 loops took 38802 ns in one case, since:
>
> (4096 * 1000 * 100) / 10556 = 38802
>
> If you happen to have your timer backed by a 32 kHz clock, one tick of
> ktime could be as much as 31250 ns, right?  Maybe on systems backed
> with a 32kHz clock they'll take longer, but it still seems moderately
> iffy?  I dunno, maybe I'm just being paranoid.
>

No, that is a good point - I didn't really consider that ktime could
be that coarse.

OTOH, we don't really need the full 5 digits of precision either, as
long as we don't misidentify the fastest algorithm.

So I think it should be sufficient to bump this to 800. If my
calculations are correct, this would limit any potential
misidentification of algorithms performing below 10 GB/s to ones that
only deviate in performance up to 10%.

800 * 1000 * 4096 / (10 * 31250) = 10485
800 * 1000 * 4096 / (11 * 31250) = 9532

(10485/9532) / 10485 = 10%

>
> >  static void __init
> >  do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2)
> >  {
> >         int speed;
> > -       unsigned long now, j;
> > -       int i, count, max;
> > +       int i, j, count;
> > +       ktime_t min, start, diff;
> >
> >         tmpl->next = template_list;
> >         template_list = tmpl;
> >
> >         preempt_disable();
> >
> > -       /*
> > -        * Count the number of XORs done during a whole jiffy, and use
> > -        * this to calculate the speed of checksumming.  We use a 2-page
> > -        * allocation to have guaranteed color L1-cache layout.
> > -        */
> > -       max = 0;
> > +       min = (ktime_t)S64_MAX;
> >         for (i = 0; i < 5; i++) {
> > -               j = jiffies;
> > -               count = 0;
> > -               while ((now = jiffies) == j)
> > -                       cpu_relax();
> > -               while (time_before(jiffies, now + 1)) {
> > +               start = ktime_get();
> > +               for (j = 0; j < REPS; j++) {
> >                         mb(); /* prevent loop optimzation */
> >                         tmpl->do_2(BENCH_SIZE, b1, b2);
> >                         mb();
> >                         count++;
> >                         mb();
> >                 }
> > -               if (count > max)
> > -                       max = count;
> > +               diff = ktime_sub(ktime_get(), start);
> > +               if (diff < min)
> > +                       min = diff;
> >         }
> >
> >         preempt_enable();
> >
> > -       speed = max * (HZ * BENCH_SIZE / 1024);
> > +       // bytes/ns == GB/s, multiply by 1000 to get MB/s [not MiB/s]
>
> Comment is super helpful, thanks!  ...but are folks really OK with
> "//" comments these days?
>

Linus said he is fine with it, and even prefers it for single line
comments, so I don't think it's a problem

>
> > +       speed = (1000 * REPS * BENCH_SIZE) / (u32)min;
>
> nit: Just for prettiness, maybe call ktime_to_ns(min)?
>
> optional nit: I always think of u32 as something for accessing
> hardware.  Maybe "unsigned int"?
>

Ack

>
> >         tmpl->speed = speed;
> >
> > -       printk(KERN_INFO "   %-10s: %5d.%03d MB/sec\n", tmpl->name,
> > -              speed / 1000, speed % 1000);
> > +       printk(KERN_INFO "   %-16s: %5d MB/sec\n", tmpl->name, speed);
>
> Since you're touching, switch to pr_info()?
>

Ack (x2)


>
> >  }
> >
> >  static int __init
> > @@ -158,8 +152,8 @@ calibrate_xor_blocks(void)
> >                 if (f->speed > fastest->speed)
> >                         fastest = f;
> >
> > -       printk(KERN_INFO "xor: using function: %s (%d.%03d MB/sec)\n",
> > -              fastest->name, fastest->speed / 1000, fastest->speed % 1000);
> > +       printk(KERN_INFO "xor: using function: %s (%d MB/sec)\n",
> > +              fastest->name, fastest->speed);
>
> Since you're touching, switch to pr_info()?
>
>
> -Doug

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] crypto: xor - use ktime for template benchmarking
  2020-09-24  8:31     ` Ard Biesheuvel
@ 2020-09-24 15:28       ` Doug Anderson
  2020-09-24 15:36         ` Ard Biesheuvel
  0 siblings, 1 reply; 11+ messages in thread
From: Doug Anderson @ 2020-09-24 15:28 UTC (permalink / raw)
  To: Ard Biesheuvel; +Cc: Linux Crypto Mailing List, Herbert Xu, David Laight

Hi,

On Thu, Sep 24, 2020 at 1:32 AM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Thu, 24 Sep 2020 at 02:36, Doug Anderson <dianders@chromium.org> wrote:
> >
> > Hi,
> >
> > On Wed, Sep 23, 2020 at 11:22 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > >
> > > Currently, we use the jiffies counter as a time source, by staring at
> > > it until a HZ period elapses, and then staring at it again and perform
> > > as many XOR operations as we can at the same time until another HZ
> > > period elapses, so that we can calculate the throughput. This takes
> > > longer than necessary, and depends on HZ, which is undesirable, since
> > > HZ is system dependent.
> > >
> > > Let's use the ktime interface instead, and use it to time a fixed
> > > number of XOR operations, which can be done much faster, and makes
> > > the time spent depend on the performance level of the system itself,
> > > which is much more reasonable.
> > >
> > > On ThunderX2, I get the following results:
> > >
> > > Before:
> > >
> > >   [72625.956765] xor: measuring software checksum speed
> > >   [72625.993104]    8regs     : 10169.000 MB/sec
> > >   [72626.033099]    32regs    : 12050.000 MB/sec
> > >   [72626.073095]    arm64_neon: 11100.000 MB/sec
> > >   [72626.073097] xor: using function: 32regs (12050.000 MB/sec)
> > >
> > > After:
> > >
> > >   [ 2503.189696] xor: measuring software checksum speed
> > >   [ 2503.189896]    8regs           : 10556 MB/sec
> > >   [ 2503.190061]    32regs          : 12538 MB/sec
> > >   [ 2503.190250]    arm64_neon      : 11470 MB/sec
> > >   [ 2503.190252] xor: using function: 32regs (12538 MB/sec)
> > >
> > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > > ---
> > >  crypto/xor.c | 36 ++++++++------------
> > >  1 file changed, 15 insertions(+), 21 deletions(-)
> > >
> > > diff --git a/crypto/xor.c b/crypto/xor.c
> > > index b42c38343733..23f98b451b69 100644
> > > --- a/crypto/xor.c
> > > +++ b/crypto/xor.c
> > > @@ -76,49 +76,43 @@ static int __init register_xor_blocks(void)
> > >  }
> > >  #endif
> > >
> > > -#define BENCH_SIZE (PAGE_SIZE)
> > > +#define BENCH_SIZE     4096
> >
> > I'm curious why the change away from PAGE_SIZE to using 4096.
> > Everything should work OK w/ using PAGE_SIZE, right?
> >
>
> Yes, but then the test will take 16x more time on a 64k page size
> system for no reason whatsoever.

Ah.  I wasn't sure if using PAGE_SIZE as trying to help avoid
measurement inaccuracies or make it work more like the real thing
(maybe the code that calls XOR often does it PAGE_SIZE at a time?)  I
definitely haven't played with it, but I could sorta imagine it making
some small differences.  Maybe the "prefetch" versions of the XOR ops
have some overhead but the overhead is mitigated with larger
operations?  Though both 4K and 64K are probably large enough and
probably it wouldn't affect the outcome of which algorithm is best...


> > > +#define REPS           100
> >
> > Is this sufficient?  I'm not sure what the lower bound on what's
> > expected of ktime.  If I'm doing the math right, on your system
> > running 100 loops took 38802 ns in one case, since:
> >
> > (4096 * 1000 * 100) / 10556 = 38802
> >
> > If you happen to have your timer backed by a 32 kHz clock, one tick of
> > ktime could be as much as 31250 ns, right?  Maybe on systems backed
> > with a 32kHz clock they'll take longer, but it still seems moderately
> > iffy?  I dunno, maybe I'm just being paranoid.
> >
>
> No, that is a good point - I didn't really consider that ktime could
> be that coarse.
>
> OTOH, we don't really need the full 5 digits of precision either, as
> long as we don't misidentify the fastest algorithm.
>
> So I think it should be sufficient to bump this to 800. If my
> calculations are correct, this would limit any potential
> misidentification of algorithms performing below 10 GB/s to ones that
> only deviate in performance up to 10%.
>
> 800 * 1000 * 4096 / (10 * 31250) = 10485
> 800 * 1000 * 4096 / (11 * 31250) = 9532
>
> (10485/9532) / 10485 = 10%

Seems OK to me.  Seems unlikely that super fast machine are going to
have a 32 kHz backed k_time and the worst case is that we'll pick a
slightly sub-optimal xor, I guess.  I assume your goal is to keep
things fitting in a 32-bit unsigned integer?  Looks like if your use
1000 it also fits...

-Doug

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] crypto: xor - use ktime for template benchmarking
  2020-09-24 15:28       ` Doug Anderson
@ 2020-09-24 15:36         ` Ard Biesheuvel
  2020-09-24 18:22           ` Doug Anderson
  0 siblings, 1 reply; 11+ messages in thread
From: Ard Biesheuvel @ 2020-09-24 15:36 UTC (permalink / raw)
  To: Doug Anderson; +Cc: Linux Crypto Mailing List, Herbert Xu, David Laight

On Thu, 24 Sep 2020 at 17:28, Doug Anderson <dianders@chromium.org> wrote:
>
> On Thu, Sep 24, 2020 at 1:32 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> >
...
> > > > +#define REPS           100
> > >
> > > Is this sufficient?  I'm not sure what the lower bound on what's
> > > expected of ktime.  If I'm doing the math right, on your system
> > > running 100 loops took 38802 ns in one case, since:
> > >
> > > (4096 * 1000 * 100) / 10556 = 38802
> > >
> > > If you happen to have your timer backed by a 32 kHz clock, one tick of
> > > ktime could be as much as 31250 ns, right?  Maybe on systems backed
> > > with a 32kHz clock they'll take longer, but it still seems moderately
> > > iffy?  I dunno, maybe I'm just being paranoid.
> > >
> >
> > No, that is a good point - I didn't really consider that ktime could
> > be that coarse.
> >
> > OTOH, we don't really need the full 5 digits of precision either, as
> > long as we don't misidentify the fastest algorithm.
> >
> > So I think it should be sufficient to bump this to 800. If my
> > calculations are correct, this would limit any potential
> > misidentification of algorithms performing below 10 GB/s to ones that
> > only deviate in performance up to 10%.
> >
> > 800 * 1000 * 4096 / (10 * 31250) = 10485
> > 800 * 1000 * 4096 / (11 * 31250) = 9532
> >
> > (10485/9532) / 10485 = 10%
>
> Seems OK to me.  Seems unlikely that super fast machine are going to
> have a 32 kHz backed k_time and the worst case is that we'll pick a
> slightly sub-optimal xor, I guess.  I assume your goal is to keep
> things fitting in a 32-bit unsigned integer?  Looks like if your use
> 1000 it also fits...
>

Yes, but the larger we make this number, the more time the test will
take on such slow machines. Doing 1000 iterations of 4k on a low-end
machine that only manages 500 MB/s (?) takes a couple of milliseconds,
which is more than it does today when HZ=1000 I think.

Not that 800 vs 1000 makes a great deal of difference in that regard,
just illustrating that there is an upper bound as well.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] crypto: xor - use ktime for template benchmarking
  2020-09-24 15:36         ` Ard Biesheuvel
@ 2020-09-24 18:22           ` Doug Anderson
  2020-09-24 18:40             ` Ard Biesheuvel
  0 siblings, 1 reply; 11+ messages in thread
From: Doug Anderson @ 2020-09-24 18:22 UTC (permalink / raw)
  To: Ard Biesheuvel; +Cc: Linux Crypto Mailing List, Herbert Xu, David Laight

Hi,

On Thu, Sep 24, 2020 at 8:36 AM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Thu, 24 Sep 2020 at 17:28, Doug Anderson <dianders@chromium.org> wrote:
> >
> > On Thu, Sep 24, 2020 at 1:32 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > >
> ...
> > > > > +#define REPS           100
> > > >
> > > > Is this sufficient?  I'm not sure what the lower bound on what's
> > > > expected of ktime.  If I'm doing the math right, on your system
> > > > running 100 loops took 38802 ns in one case, since:
> > > >
> > > > (4096 * 1000 * 100) / 10556 = 38802
> > > >
> > > > If you happen to have your timer backed by a 32 kHz clock, one tick of
> > > > ktime could be as much as 31250 ns, right?  Maybe on systems backed
> > > > with a 32kHz clock they'll take longer, but it still seems moderately
> > > > iffy?  I dunno, maybe I'm just being paranoid.
> > > >
> > >
> > > No, that is a good point - I didn't really consider that ktime could
> > > be that coarse.
> > >
> > > OTOH, we don't really need the full 5 digits of precision either, as
> > > long as we don't misidentify the fastest algorithm.
> > >
> > > So I think it should be sufficient to bump this to 800. If my
> > > calculations are correct, this would limit any potential
> > > misidentification of algorithms performing below 10 GB/s to ones that
> > > only deviate in performance up to 10%.
> > >
> > > 800 * 1000 * 4096 / (10 * 31250) = 10485
> > > 800 * 1000 * 4096 / (11 * 31250) = 9532
> > >
> > > (10485/9532) / 10485 = 10%
> >
> > Seems OK to me.  Seems unlikely that super fast machine are going to
> > have a 32 kHz backed k_time and the worst case is that we'll pick a
> > slightly sub-optimal xor, I guess.  I assume your goal is to keep
> > things fitting in a 32-bit unsigned integer?  Looks like if your use
> > 1000 it also fits...
> >
>
> Yes, but the larger we make this number, the more time the test will
> take on such slow machines. Doing 1000 iterations of 4k on a low-end
> machine that only manages 500 MB/s (?) takes a couple of milliseconds,
> which is more than it does today when HZ=1000 I think.
>
> Not that 800 vs 1000 makes a great deal of difference in that regard,
> just illustrating that there is an upper bound as well.

Would it make sense to use some type of hybrid approach?  I know
getting ktime itself has some overhead so you don't want to do it in a
tight loop, but maybe calling it every once in a while would be
acceptable and if it's been more than 500 us then stop early?

-Doug

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] crypto: xor - use ktime for template benchmarking
  2020-09-24 18:22           ` Doug Anderson
@ 2020-09-24 18:40             ` Ard Biesheuvel
  2020-09-24 18:52               ` Doug Anderson
  0 siblings, 1 reply; 11+ messages in thread
From: Ard Biesheuvel @ 2020-09-24 18:40 UTC (permalink / raw)
  To: Doug Anderson; +Cc: Linux Crypto Mailing List, Herbert Xu, David Laight

On Thu, 24 Sep 2020 at 20:22, Doug Anderson <dianders@chromium.org> wrote:
>
> Hi,
>
> On Thu, Sep 24, 2020 at 8:36 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > On Thu, 24 Sep 2020 at 17:28, Doug Anderson <dianders@chromium.org> wrote:
> > >
> > > On Thu, Sep 24, 2020 at 1:32 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > >
> > ...
> > > > > > +#define REPS           100
> > > > >
> > > > > Is this sufficient?  I'm not sure what the lower bound on what's
> > > > > expected of ktime.  If I'm doing the math right, on your system
> > > > > running 100 loops took 38802 ns in one case, since:
> > > > >
> > > > > (4096 * 1000 * 100) / 10556 = 38802
> > > > >
> > > > > If you happen to have your timer backed by a 32 kHz clock, one tick of
> > > > > ktime could be as much as 31250 ns, right?  Maybe on systems backed
> > > > > with a 32kHz clock they'll take longer, but it still seems moderately
> > > > > iffy?  I dunno, maybe I'm just being paranoid.
> > > > >
> > > >
> > > > No, that is a good point - I didn't really consider that ktime could
> > > > be that coarse.
> > > >
> > > > OTOH, we don't really need the full 5 digits of precision either, as
> > > > long as we don't misidentify the fastest algorithm.
> > > >
> > > > So I think it should be sufficient to bump this to 800. If my
> > > > calculations are correct, this would limit any potential
> > > > misidentification of algorithms performing below 10 GB/s to ones that
> > > > only deviate in performance up to 10%.
> > > >
> > > > 800 * 1000 * 4096 / (10 * 31250) = 10485
> > > > 800 * 1000 * 4096 / (11 * 31250) = 9532
> > > >
> > > > (10485/9532) / 10485 = 10%
> > >
> > > Seems OK to me.  Seems unlikely that super fast machine are going to
> > > have a 32 kHz backed k_time and the worst case is that we'll pick a
> > > slightly sub-optimal xor, I guess.  I assume your goal is to keep
> > > things fitting in a 32-bit unsigned integer?  Looks like if your use
> > > 1000 it also fits...
> > >
> >
> > Yes, but the larger we make this number, the more time the test will
> > take on such slow machines. Doing 1000 iterations of 4k on a low-end
> > machine that only manages 500 MB/s (?) takes a couple of milliseconds,
> > which is more than it does today when HZ=1000 I think.
> >
> > Not that 800 vs 1000 makes a great deal of difference in that regard,
> > just illustrating that there is an upper bound as well.
>
> Would it make sense to use some type of hybrid approach?  I know
> getting ktime itself has some overhead so you don't want to do it in a
> tight loop, but maybe calling it every once in a while would be
> acceptable and if it's been more than 500 us then stop early?
>

To be honest, I don't think we don't need complexity like this - if
boot time is critical on such a slow system, you probable won't have
XOR built in, assuming it even makes sense to do software XOR on such
a system.

It is indeed preferable to have a numerator that fits in a U32, and so
1000 would be equally suitable in that regard, but I think I will
stick with 800 if you don't mind.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] crypto: xor - use ktime for template benchmarking
  2020-09-24 18:40             ` Ard Biesheuvel
@ 2020-09-24 18:52               ` Doug Anderson
  0 siblings, 0 replies; 11+ messages in thread
From: Doug Anderson @ 2020-09-24 18:52 UTC (permalink / raw)
  To: Ard Biesheuvel; +Cc: Linux Crypto Mailing List, Herbert Xu, David Laight

Hi,

On Thu, Sep 24, 2020 at 11:40 AM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Thu, 24 Sep 2020 at 20:22, Doug Anderson <dianders@chromium.org> wrote:
> >
> > Hi,
> >
> > On Thu, Sep 24, 2020 at 8:36 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > >
> > > On Thu, 24 Sep 2020 at 17:28, Doug Anderson <dianders@chromium.org> wrote:
> > > >
> > > > On Thu, Sep 24, 2020 at 1:32 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > > >
> > > ...
> > > > > > > +#define REPS           100
> > > > > >
> > > > > > Is this sufficient?  I'm not sure what the lower bound on what's
> > > > > > expected of ktime.  If I'm doing the math right, on your system
> > > > > > running 100 loops took 38802 ns in one case, since:
> > > > > >
> > > > > > (4096 * 1000 * 100) / 10556 = 38802
> > > > > >
> > > > > > If you happen to have your timer backed by a 32 kHz clock, one tick of
> > > > > > ktime could be as much as 31250 ns, right?  Maybe on systems backed
> > > > > > with a 32kHz clock they'll take longer, but it still seems moderately
> > > > > > iffy?  I dunno, maybe I'm just being paranoid.
> > > > > >
> > > > >
> > > > > No, that is a good point - I didn't really consider that ktime could
> > > > > be that coarse.
> > > > >
> > > > > OTOH, we don't really need the full 5 digits of precision either, as
> > > > > long as we don't misidentify the fastest algorithm.
> > > > >
> > > > > So I think it should be sufficient to bump this to 800. If my
> > > > > calculations are correct, this would limit any potential
> > > > > misidentification of algorithms performing below 10 GB/s to ones that
> > > > > only deviate in performance up to 10%.
> > > > >
> > > > > 800 * 1000 * 4096 / (10 * 31250) = 10485
> > > > > 800 * 1000 * 4096 / (11 * 31250) = 9532
> > > > >
> > > > > (10485/9532) / 10485 = 10%
> > > >
> > > > Seems OK to me.  Seems unlikely that super fast machine are going to
> > > > have a 32 kHz backed k_time and the worst case is that we'll pick a
> > > > slightly sub-optimal xor, I guess.  I assume your goal is to keep
> > > > things fitting in a 32-bit unsigned integer?  Looks like if your use
> > > > 1000 it also fits...
> > > >
> > >
> > > Yes, but the larger we make this number, the more time the test will
> > > take on such slow machines. Doing 1000 iterations of 4k on a low-end
> > > machine that only manages 500 MB/s (?) takes a couple of milliseconds,
> > > which is more than it does today when HZ=1000 I think.
> > >
> > > Not that 800 vs 1000 makes a great deal of difference in that regard,
> > > just illustrating that there is an upper bound as well.
> >
> > Would it make sense to use some type of hybrid approach?  I know
> > getting ktime itself has some overhead so you don't want to do it in a
> > tight loop, but maybe calling it every once in a while would be
> > acceptable and if it's been more than 500 us then stop early?
> >
>
> To be honest, I don't think we don't need complexity like this - if
> boot time is critical on such a slow system, you probable won't have
> XOR built in, assuming it even makes sense to do software XOR on such
> a system.
>
> It is indeed preferable to have a numerator that fits in a U32, and so
> 1000 would be equally suitable in that regard, but I think I will
> stick with 800 if you don't mind.

OK, fair enough.

-Doug

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-09-24 18:53 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-09-23 18:22 [PATCH 0/2] crypto: xor - defer and optimize boot time benchmark Ard Biesheuvel
2020-09-23 18:22 ` [PATCH 1/2] crypto: xor - defer load time benchmark to a later time Ard Biesheuvel
2020-09-24  0:38   ` Doug Anderson
2020-09-23 18:22 ` [PATCH 2/2] crypto: xor - use ktime for template benchmarking Ard Biesheuvel
2020-09-24  0:36   ` Doug Anderson
2020-09-24  8:31     ` Ard Biesheuvel
2020-09-24 15:28       ` Doug Anderson
2020-09-24 15:36         ` Ard Biesheuvel
2020-09-24 18:22           ` Doug Anderson
2020-09-24 18:40             ` Ard Biesheuvel
2020-09-24 18:52               ` Doug Anderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).