netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
       [not found]         ` <20131017003421.GA31470@hmsreliant.think-freely.org>
@ 2013-10-17  8:41           ` Ingo Molnar
  2013-10-17 18:19             ` H. Peter Anvin
  2013-10-28 16:01             ` Neil Horman
  0 siblings, 2 replies; 47+ messages in thread
From: Ingo Molnar @ 2013-10-17  8:41 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote:
> > On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
> > > On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
> > > 
> > > > So, early testing results today.  I wrote a test module that, allocated a 4k
> > > > buffer, initalized it with random data, and called csum_partial on it 100000
> > > > times, recording the time at the start and end of that loop.  Results on a 2.4
> > > > GHz Intel Xeon processor:
> > > > 
> > > > Without patch: Average execute time for csum_partial was 808 ns
> > > > With patch: Average execute time for csum_partial was 438 ns
> > > 
> > > Impressive, but could you try again with data out of cache ?
> > 
> > So I tried your patch on a GRE tunnel and got following results on a
> > single TCP flow. (short result : no visible difference)
> > 
> > 
> 
> So I went to reproduce these results, but was unable to (due to the fact that I
> only have a pretty jittery network to do testing accross at the moment with
> these devices).  So instead I figured that I would go back to just doing
> measurements with the module that I cobbled together (operating under the
> assumption that it would give me accurate, relatively jitter free results (I've
> attached the module code for reference below).  My results show slightly
> different behavior:
> 
> Base results runs:
> 89417240
> 85170397
> 85208407
> 89422794
> 91645494
> 103655144
> 86063791
> 75647774
> 83502921
> 85847372
> AVG = 875 ns
>
> Prefetch only runs:
> 70962849
> 77555099
> 81898170
> 68249290
> 72636538
> 83039294
> 78561494
> 83393369
> 85317556
> 79570951
> AVG = 781 ns
> 
> Parallel addition only runs:
> 42024233
> 44313064
> 48304416
> 64762297
> 42994259
> 41811628
> 55654282
> 64892958
> 55125582
> 42456403
> AVG = 510 ns
> 
> 
> Both prefetch and parallel addition:
> 41329930
> 40689195
> 61106622
> 46332422
> 49398117
> 52525171
> 49517101
> 61311153
> 43691814
> 49043084
> AVG = 494 ns
> 
> 
> For reference, each of the above large numbers is the number of 
> nanoseconds taken to compute the checksum of a 4kb buffer 100000 times.  
> To get my average results, I ran the test in a loop 10 times, averaged 
> them, and divided by 100000.
> 
> Based on these, prefetching is obviously a a good improvement, but not 
> as good as parallel execution, and the winner by far is doing both.

But in the actual usecase mentioned the packet data was likely cache-cold, 
it just arrived in the NIC and an IRQ got sent. Your testcase uses a 
super-hot 4K buffer that fits into the L1 cache. So it's apples to 
oranges.

To correctly simulate the workload you'd have to:

 - allocate a buffer larger than your L2 cache.

 - to measure the effects of the prefetches you'd also have to randomize
   the individual buffer positions. See how 'perf bench numa' implements a
   random walk via --data_rand_walk, in tools/perf/bench/numa.c.
   Otherwise the CPU might learn your simplistic stream direction and the
   L2 cache might hw-prefetch your data, interfering with any explicit 
   prefetches the code does. In many real-life usecases packet buffers are
   scattered.

Also, it would be nice to see standard deviation noise numbers when two 
averages are close to each other, to be able to tell whether differences 
are statistically significant or not.

For example 'perf stat --repeat' will output stddev for you:

  comet:~/tip> perf stat --repeat 20 --null bash -c 'usleep $((RANDOM*10))'

   Performance counter stats for 'bash -c usleep $((RANDOM*10))' (20 runs):

       0.189084480 seconds time elapsed                                          ( +- 11.95% )

The last '+-' percentage is the noise of the measurement.

Also note that you can inspect many cache behavior details of your 
algorithm via perf stat - the -ddd option will give you a laundry list:

  aldebaran:~> perf stat --repeat 20 -ddd perf bench sched messaging
  ...

     Total time: 0.095 [sec]

 Performance counter stats for 'perf bench sched messaging' (20 runs):

       1519.128721 task-clock (msec)         #   12.305 CPUs utilized            ( +-  0.34% )
            22,882 context-switches          #    0.015 M/sec                    ( +-  2.84% )
             3,927 cpu-migrations            #    0.003 M/sec                    ( +-  2.74% )
            16,616 page-faults               #    0.011 M/sec                    ( +-  0.17% )
     2,327,978,366 cycles                    #    1.532 GHz                      ( +-  1.61% ) [36.43%]
     1,715,561,189 stalled-cycles-frontend   #   73.69% frontend cycles idle     ( +-  1.76% ) [38.05%]
       715,715,454 stalled-cycles-backend    #   30.74% backend  cycles idle     ( +-  2.25% ) [39.85%]
     1,253,106,346 instructions              #    0.54  insns per cycle        
                                             #    1.37  stalled cycles per insn  ( +-  1.71% ) [49.68%]
       241,181,126 branches                  #  158.763 M/sec                    ( +-  1.43% ) [47.83%]
         4,232,053 branch-misses             #    1.75% of all branches          ( +-  1.23% ) [48.63%]
       431,907,354 L1-dcache-loads           #  284.313 M/sec                    ( +-  1.00% ) [48.37%]
        20,550,528 L1-dcache-load-misses     #    4.76% of all L1-dcache hits    ( +-  0.82% ) [47.61%]
         7,435,847 LLC-loads                 #    4.895 M/sec                    ( +-  0.94% ) [36.11%]
         2,419,201 LLC-load-misses           #   32.53% of all LL-cache hits     ( +-  2.93% ) [ 7.33%]
       448,638,547 L1-icache-loads           #  295.326 M/sec                    ( +-  2.43% ) [21.75%]
        22,066,490 L1-icache-load-misses     #    4.92% of all L1-icache hits    ( +-  2.54% ) [30.66%]
       475,557,948 dTLB-loads                #  313.047 M/sec                    ( +-  1.96% ) [37.96%]
         6,741,523 dTLB-load-misses          #    1.42% of all dTLB cache hits   ( +-  2.38% ) [37.05%]
     1,268,628,660 iTLB-loads                #  835.103 M/sec                    ( +-  1.75% ) [36.45%]
            74,192 iTLB-load-misses          #    0.01% of all iTLB cache hits   ( +-  2.88% ) [36.19%]
         4,466,526 L1-dcache-prefetches      #    2.940 M/sec                    ( +-  1.61% ) [36.17%]
         2,396,311 L1-dcache-prefetch-misses #    1.577 M/sec                    ( +-  1.55% ) [35.71%]

       0.123459566 seconds time elapsed                                          ( +-  0.58% )

There's also a number of prefetch counters that might be useful:

 aldebaran:~> perf list | grep prefetch
  L1-dcache-prefetches                               [Hardware cache event]
  L1-dcache-prefetch-misses                          [Hardware cache event]
  LLC-prefetches                                     [Hardware cache event]
  LLC-prefetch-misses                                [Hardware cache event]
  node-prefetches                                    [Hardware cache event]
  node-prefetch-misses                               [Hardware cache event]

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-17  8:41           ` Ingo Molnar
@ 2013-10-17 18:19             ` H. Peter Anvin
  2013-10-17 18:48               ` Eric Dumazet
  2013-10-18  6:43               ` Ingo Molnar
  2013-10-28 16:01             ` Neil Horman
  1 sibling, 2 replies; 47+ messages in thread
From: H. Peter Anvin @ 2013-10-17 18:19 UTC (permalink / raw)
  To: Ingo Molnar, Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, x86, netdev

On 10/17/2013 01:41 AM, Ingo Molnar wrote:
> 
> To correctly simulate the workload you'd have to:
> 
>  - allocate a buffer larger than your L2 cache.
> 
>  - to measure the effects of the prefetches you'd also have to randomize
>    the individual buffer positions. See how 'perf bench numa' implements a
>    random walk via --data_rand_walk, in tools/perf/bench/numa.c.
>    Otherwise the CPU might learn your simplistic stream direction and the
>    L2 cache might hw-prefetch your data, interfering with any explicit 
>    prefetches the code does. In many real-life usecases packet buffers are
>    scattered.
> 
> Also, it would be nice to see standard deviation noise numbers when two 
> averages are close to each other, to be able to tell whether differences 
> are statistically significant or not.
> 

Seriously, though, how much does it matter?  All the above seems likely
to do is to drown the signal by adding noise.

If the parallel (threaded) checksumming is faster, which theory says it
should and microbenchmarking confirms, how important are the
macrobenchmarks?

	-hpa

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-17 18:19             ` H. Peter Anvin
@ 2013-10-17 18:48               ` Eric Dumazet
  2013-10-18  6:43               ` Ingo Molnar
  1 sibling, 0 replies; 47+ messages in thread
From: Eric Dumazet @ 2013-10-17 18:48 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Neil Horman, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, x86, netdev

On Thu, 2013-10-17 at 11:19 -0700, H. Peter Anvin wrote:

> Seriously, though, how much does it matter?  All the above seems likely
> to do is to drown the signal by adding noise.

I don't think so.

> 
> If the parallel (threaded) checksumming is faster, which theory says it
> should and microbenchmarking confirms, how important are the
> macrobenchmarks?

Seriously, micro benchmarks are very misleading.

I spent time on this patch, and found no changes on real workloads.

I was excited first, then disappointed.

I hope we will find the real issue, as I really don't care of micro
benchmarks.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-17 18:19             ` H. Peter Anvin
  2013-10-17 18:48               ` Eric Dumazet
@ 2013-10-18  6:43               ` Ingo Molnar
  1 sibling, 0 replies; 47+ messages in thread
From: Ingo Molnar @ 2013-10-18  6:43 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Neil Horman, Eric Dumazet, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, x86, netdev


* H. Peter Anvin <hpa@zytor.com> wrote:

> On 10/17/2013 01:41 AM, Ingo Molnar wrote:
> > 
> > To correctly simulate the workload you'd have to:
> > 
> >  - allocate a buffer larger than your L2 cache.
> > 
> >  - to measure the effects of the prefetches you'd also have to randomize
> >    the individual buffer positions. See how 'perf bench numa' implements a
> >    random walk via --data_rand_walk, in tools/perf/bench/numa.c.
> >    Otherwise the CPU might learn your simplistic stream direction and the
> >    L2 cache might hw-prefetch your data, interfering with any explicit 
> >    prefetches the code does. In many real-life usecases packet buffers are
> >    scattered.
> > 
> > Also, it would be nice to see standard deviation noise numbers when two 
> > averages are close to each other, to be able to tell whether differences 
> > are statistically significant or not.
> 
> 
> Seriously, though, how much does it matter?  All the above seems likely 
> to do is to drown the signal by adding noise.

I think it matters a lot and I don't think it 'adds' noise - it measures 
something else (cache cold behavior - which is the common case for 
first-time csum_partial() use for network packets), which was not measured 
before, and that that is by its nature has different noise patterns.

I've done many cache-cold measurements myself and had no trouble achieving 
statistically significant results and high precision.

> If the parallel (threaded) checksumming is faster, which theory says it 
> should and microbenchmarking confirms, how important are the 
> macrobenchmarks?

Microbenchmarks can be totally blind to things like the ideal prefetch 
window size. (or whether a prefetch should be done at all: some CPUs will 
throw away prefetches if enough regular fetches arrive.)

Also, 'naive' single-threaded algorithms can occasionally be better in the 
cache-cold case because a linear, predictable stream of memory accesses 
might saturate the memory bus better than a somewhat random looking, 
interleaved web of accesses that might not harmonize with buffer depths.

I _think_ if correctly tuned then the parallel algorithm should be better 
in the cache cold case, I just don't know with what parameters (and the 
algorithm has at least one free parameter: the prefetch window size), and 
I don't know how significant the effect is.

Also, more fundamentally, I absolutely detest doing no measurements or 
measuring the wrong thing - IMHO there are too many 'blind' optimization 
commits in the kernel with little to no observational data attached.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-17  8:41           ` Ingo Molnar
  2013-10-17 18:19             ` H. Peter Anvin
@ 2013-10-28 16:01             ` Neil Horman
  2013-10-28 16:20               ` Ingo Molnar
  2013-10-28 16:24               ` Ingo Molnar
  1 sibling, 2 replies; 47+ messages in thread
From: Neil Horman @ 2013-10-28 16:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev



Ingo, et al.-
	Ok, sorry for the delay, here are the test results you've been asking
for.


First, some information about what I did.  I attached the module that I ran this
test with at the bottom of this email.  You'll note that I started using a
module parameter write patch to trigger the csum rather than the module load
path.  The latter seemed to be giving me lots of variance in my run times, which
I wanted to eliminate.  I attributed it to the module load mechanism itself, and
by using the parameter write path, I was able to get more consistent results.

First, the run time tests:

I ran this command:
for i in `seq 0 1 3`
do
	 echo $i > /sys/module/csum_test/parameters/module_test_mode
	 perf stat --repeat 20 --null echo 1 > echo 1 > /sys/module/csum_test/parameters/test_fire
done

The for loop allows me to chagne the module_test_mode, which is tied to a switch
statement in do_csum that selects which checksumming method we use
(base/prefetch/parallel alu/both).  The results are:


Base:
 Performance counter stats for 'bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       0.093269042 seconds time elapsed                                          ( +-  2.24% )

Prefetch (5x64):
 Performance counter stats for 'bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       0.079440009 seconds time elapsed                                          ( +-  2.29% )

Parallel ALU:
 Performance counter stats for 'bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       0.087666677 seconds time elapsed                                          ( +-  4.01% )

Prefetch + Parallel ALU:
 Performance counter stats for 'bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       0.080758702 seconds time elapsed                                          ( +-  2.34% )

So we can see here that we get about a 1% speedup between the base and the both
(Prefetch + Parallel ALU) case, with prefetch accounting for most of that
speedup.

Looking at the specific cpu counters we get this:


Base:
     Total time: 0.179 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1571.304618 task-clock                #    5.213 CPUs utilized            ( +-  0.45% )
            14,423 context-switches          #    0.009 M/sec                    ( +-  4.28% )
             2,710 cpu-migrations            #    0.002 M/sec                    ( +-  2.83% )
            75,402 page-faults               #    0.048 M/sec                    ( +-  0.07% )
     1,597,349,326 cycles                    #    1.017 GHz                      ( +-  1.74% ) [40.51%]
       104,882,858 stalled-cycles-frontend   #    6.57% frontend cycles idle     ( +-  1.25% ) [40.33%]
     1,043,429,984 stalled-cycles-backend    #   65.32% backend  cycles idle     ( +-  1.25% ) [39.73%]
       868,372,132 instructions              #    0.54  insns per cycle        
                                             #    1.20  stalled cycles per insn  ( +-  1.43% ) [39.88%]
       161,143,820 branches                  #  102.554 M/sec                    ( +-  1.49% ) [39.76%]
         4,348,075 branch-misses             #    2.70% of all branches          ( +-  1.43% ) [39.99%]
       457,042,576 L1-dcache-loads           #  290.868 M/sec                    ( +-  1.25% ) [40.63%]
         8,928,240 L1-dcache-load-misses     #    1.95% of all L1-dcache hits    ( +-  1.26% ) [41.17%]
        15,821,051 LLC-loads                 #   10.069 M/sec                    ( +-  1.56% ) [41.20%]
         4,902,576 LLC-load-misses           #   30.99% of all LL-cache hits     ( +-  1.51% ) [41.36%]
       235,775,688 L1-icache-loads           #  150.051 M/sec                    ( +-  1.39% ) [41.10%]
         3,116,106 L1-icache-load-misses     #    1.32% of all L1-icache hits    ( +-  3.43% ) [40.96%]
       461,315,416 dTLB-loads                #  293.588 M/sec                    ( +-  1.43% ) [41.18%]
           140,280 dTLB-load-misses          #    0.03% of all dTLB cache hits   ( +-  2.30% ) [40.96%]
       236,127,031 iTLB-loads                #  150.275 M/sec                    ( +-  1.63% ) [41.43%]
            46,173 iTLB-load-misses          #    0.02% of all iTLB cache hits   ( +-  3.40% ) [41.11%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [40.82%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [40.37%]

       0.301414024 seconds time elapsed                                          ( +-  0.47% )

Prefetch (5x64):
     Total time: 0.172 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1565.797128 task-clock                #    5.238 CPUs utilized            ( +-  0.46% )
            13,845 context-switches          #    0.009 M/sec                    ( +-  4.20% )
             2,624 cpu-migrations            #    0.002 M/sec                    ( +-  2.72% )
            75,452 page-faults               #    0.048 M/sec                    ( +-  0.08% )
     1,642,106,355 cycles                    #    1.049 GHz                      ( +-  1.33% ) [40.17%]
       107,786,666 stalled-cycles-frontend   #    6.56% frontend cycles idle     ( +-  1.37% ) [39.90%]
     1,065,286,880 stalled-cycles-backend    #   64.87% backend  cycles idle     ( +-  1.59% ) [39.14%]
       888,815,001 instructions              #    0.54  insns per cycle        
                                             #    1.20  stalled cycles per insn  ( +-  1.29% ) [38.92%]
       163,106,907 branches                  #  104.169 M/sec                    ( +-  1.32% ) [38.93%]
         4,333,456 branch-misses             #    2.66% of all branches          ( +-  1.94% ) [39.77%]
       459,779,806 L1-dcache-loads           #  293.639 M/sec                    ( +-  1.60% ) [40.23%]
         8,827,680 L1-dcache-load-misses     #    1.92% of all L1-dcache hits    ( +-  1.77% ) [41.38%]
        15,556,816 LLC-loads                 #    9.935 M/sec                    ( +-  1.76% ) [41.16%]
         4,885,618 LLC-load-misses           #   31.40% of all LL-cache hits     ( +-  1.40% ) [40.84%]
       236,131,778 L1-icache-loads           #  150.806 M/sec                    ( +-  1.32% ) [40.59%]
         3,037,537 L1-icache-load-misses     #    1.29% of all L1-icache hits    ( +-  2.23% ) [41.13%]
       454,835,028 dTLB-loads                #  290.481 M/sec                    ( +-  1.23% ) [41.34%]
           139,907 dTLB-load-misses          #    0.03% of all dTLB cache hits   ( +-  2.18% ) [41.21%]
       236,357,655 iTLB-loads                #  150.950 M/sec                    ( +-  1.31% ) [41.29%]
            46,633 iTLB-load-misses          #    0.02% of all iTLB cache hits   ( +-  2.74% ) [40.67%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [40.16%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [40.09%]

       0.298948767 seconds time elapsed                                          ( +-  0.36% )

Here it appears everything between the two runs is about the same.  We reduced
the number of dcache misses by a small amount (0.03 percentage points), which is
nice, but I'm not sure would account for the speedup we see in the run time.

Parallel ALU:
     Total time: 0.182 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1553.544876 task-clock                #    5.217 CPUs utilized            ( +-  0.42% )
            14,066 context-switches          #    0.009 M/sec                    ( +-  6.24% )
             2,831 cpu-migrations            #    0.002 M/sec                    ( +-  3.33% )
            75,432 page-faults               #    0.049 M/sec                    ( +-  0.08% )
     1,659,509,743 cycles                    #    1.068 GHz                      ( +-  1.27% ) [40.10%]
       106,466,680 stalled-cycles-frontend   #    6.42% frontend cycles idle     ( +-  1.50% ) [39.98%]
     1,035,481,957 stalled-cycles-backend    #   62.40% backend  cycles idle     ( +-  1.23% ) [39.38%]
       875,104,201 instructions              #    0.53  insns per cycle        
                                             #    1.18  stalled cycles per insn  ( +-  1.30% ) [38.66%]
       160,553,275 branches                  #  103.346 M/sec                    ( +-  1.32% ) [38.85%]
         4,329,119 branch-misses             #    2.70% of all branches          ( +-  1.39% ) [39.59%]
       448,195,116 L1-dcache-loads           #  288.498 M/sec                    ( +-  1.91% ) [41.07%]
         8,632,347 L1-dcache-load-misses     #    1.93% of all L1-dcache hits    ( +-  1.90% ) [41.56%]
        15,143,145 LLC-loads                 #    9.747 M/sec                    ( +-  1.89% ) [41.05%]
         4,698,204 LLC-load-misses           #   31.03% of all LL-cache hits     ( +-  1.03% ) [41.23%]
       224,316,468 L1-icache-loads           #  144.390 M/sec                    ( +-  1.27% ) [41.39%]
         2,902,842 L1-icache-load-misses     #    1.29% of all L1-icache hits    ( +-  2.65% ) [42.60%]
       433,914,588 dTLB-loads                #  279.306 M/sec                    ( +-  1.75% ) [43.07%]
           132,090 dTLB-load-misses          #    0.03% of all dTLB cache hits   ( +-  2.15% ) [43.12%]
       230,701,361 iTLB-loads                #  148.500 M/sec                    ( +-  1.77% ) [43.47%]
            45,562 iTLB-load-misses          #    0.02% of all iTLB cache hits   ( +-  3.76% ) [42.88%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [42.29%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [41.32%]

       0.297758185 seconds time elapsed                                          ( +-  0.40% )

Here It seems the major advantage was backend stall cycles saved (which makes
sense to me).  Since we split the instruction path into two units that could run
independently of each other we spent less time waiting for prior instructions to
retire.  As a result we dropped two percentage points in our stall number.

Prefetch + Parallel ALU:
     Total time: 0.182 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1549.171283 task-clock                #    5.231 CPUs utilized            ( +-  0.50% )
            13,717 context-switches          #    0.009 M/sec                    ( +-  4.32% )
             2,721 cpu-migrations            #    0.002 M/sec                    ( +-  2.47% )
            75,432 page-faults               #    0.049 M/sec                    ( +-  0.07% )
     1,579,140,244 cycles                    #    1.019 GHz                      ( +-  1.71% ) [40.06%]
       103,803,034 stalled-cycles-frontend   #    6.57% frontend cycles idle     ( +-  1.74% ) [39.60%]
     1,016,582,613 stalled-cycles-backend    #   64.38% backend  cycles idle     ( +-  1.79% ) [39.57%]
       881,036,653 instructions              #    0.56  insns per cycle        
                                             #    1.15  stalled cycles per insn  ( +-  1.61% ) [39.29%]
       164,333,010 branches                  #  106.078 M/sec                    ( +-  1.51% ) [39.38%]
         4,385,459 branch-misses             #    2.67% of all branches          ( +-  1.62% ) [40.29%]
       463,987,526 L1-dcache-loads           #  299.507 M/sec                    ( +-  1.52% ) [40.20%]
         8,739,535 L1-dcache-load-misses     #    1.88% of all L1-dcache hits    ( +-  1.95% ) [40.37%]
        15,318,497 LLC-loads                 #    9.888 M/sec                    ( +-  1.80% ) [40.43%]
         4,846,148 LLC-load-misses           #   31.64% of all LL-cache hits     ( +-  1.68% ) [40.59%]
       231,982,874 L1-icache-loads           #  149.746 M/sec                    ( +-  1.43% ) [41.25%]
         3,141,106 L1-icache-load-misses     #    1.35% of all L1-icache hits    ( +-  2.32% ) [41.76%]
       459,688,615 dTLB-loads                #  296.732 M/sec                    ( +-  1.75% ) [41.87%]
           138,667 dTLB-load-misses          #    0.03% of all dTLB cache hits   ( +-  1.97% ) [42.31%]
       235,629,204 iTLB-loads                #  152.100 M/sec                    ( +-  1.40% ) [42.04%]
            46,038 iTLB-load-misses          #    0.02% of all iTLB cache hits   ( +-  2.75% ) [41.20%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [40.77%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [40.27%]

       0.296173305 seconds time elapsed                                          ( +-  0.44% )
Here, with both optimizations, we've reduced both our backend stall cycles, and
our dcache miss rate (though our load misses here is higher than it was when we
are just doing parallel ALU execution.  I wonder if the separation of the adcx
path is leading to multiple load requests before the prefetch completes.  I'll
try messing with the stride a bit more to see if I can get some more insight
there.

So there you have it.  I think, looking at this, I can say that its not as big a
win as my initial measurements were indicating, but still a win.

Thoughts?

Regards
Neil

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/netdevice.h>
#include <linux/etherdevice.h>
#include <linux/init.h>
#include <linux/moduleparam.h>
#include <linux/rtnetlink.h>
#include <net/rtnetlink.h>
#include <linux/u64_stats_sync.h>

#define BUFSIZ 2*1024*1024
#define NBPAGES 16

extern int csum_mode;
int module_test_mode = 0;
int test_fire = 0;

static int __init csum_init_module(void)
{
        return 0;
}

static void __exit csum_cleanup_module(void)
{
        return;
}

static int set_param_str(const char *val, const struct kernel_param *kp)
{
        int i;
        __wsum sum = 0;
        /*u64 start, end;*/
        void *base, *addrs[NBPAGES];
        u32 rnd, offset;

	
        memset(addrs, 0, sizeof(addrs));
        for (i = 0; i < NBPAGES; i++) {
                addrs[i] = kmalloc_node(BUFSIZ, GFP_KERNEL, 0);
                if (!addrs[i])
                        goto out;
        }

	csum_mode = module_test_mode;

	local_bh_disable();
        /*pr_err("STARTING ITERATIONS on cpu %d\n", smp_processor_id());*/
        /*start = ktime_to_ns(ktime_get());*/

        for (i = 0; i < 100000; i++) {
                rnd = prandom_u32();
                base = addrs[rnd % NBPAGES];
                rnd /= NBPAGES;
                offset = rnd % (BUFSIZ - 1500);
                offset &= ~1U;
                sum = csum_partial(base + offset, 1500, sum);
        }
        /*end = ktime_to_ns(ktime_get());*/
        local_bh_enable();

	/*pr_err("COMPLETED 100000 iterations of csum %x in %llu nanosec\n", sum, end - start);*/

	csum_mode = 0;
out:
        for (i = 0; i < NBPAGES; i++)
                kfree(addrs[i]);

        return 0;
}

static int get_param_str(char *buffer, const struct kernel_param *kp)
{
	return sprintf(buffer, "%d\n", test_fire);
}

static struct kernel_param_ops param_ops_str = {
	.set = set_param_str,
	.get = get_param_str,
};

module_param_named(module_test_mode, module_test_mode, int, 0644);
MODULE_PARM_DESC(module_test_mode, "csum test mode");
module_param_cb(test_fire, &param_ops_str, &test_fire, 0644);
module_init(csum_init_module);
module_exit(csum_cleanup_module);
MODULE_LICENSE("GPL");

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-28 16:01             ` Neil Horman
@ 2013-10-28 16:20               ` Ingo Molnar
  2013-10-28 17:49                 ` Neil Horman
  2013-10-28 16:24               ` Ingo Molnar
  1 sibling, 1 reply; 47+ messages in thread
From: Ingo Molnar @ 2013-10-28 16:20 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> Base:
>        0.093269042 seconds time elapsed                                          ( +-  2.24% )
> Prefetch (5x64):
>        0.079440009 seconds time elapsed                                          ( +-  2.29% )
> Parallel ALU:
>        0.087666677 seconds time elapsed                                          ( +-  4.01% )
> Prefetch + Parallel ALU:
>        0.080758702 seconds time elapsed                                          ( +-  2.34% )
> 
> So we can see here that we get about a 1% speedup between the base 
> and the both (Prefetch + Parallel ALU) case, with prefetch 
> accounting for most of that speedup.

Hm, there's still something strange about these results. So the 
range of the results is 790-930 nsecs. The noise of the measurements 
is 2%-4%, i.e. 20-40 nsecs.

The prefetch-only result itself is the fastest of all - 
statistically equivalent to the prefetch+parallel-ALU result, within 
the noise range.

So if prefetch is enabled, turning on parallel-ALU has no measurable 
effect - which is counter-intuitive. Do you have an 
theory/explanation for that?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-28 16:01             ` Neil Horman
  2013-10-28 16:20               ` Ingo Molnar
@ 2013-10-28 16:24               ` Ingo Molnar
  2013-10-28 16:49                 ` David Ahern
  2013-10-28 17:46                 ` Neil Horman
  1 sibling, 2 replies; 47+ messages in thread
From: Ingo Molnar @ 2013-10-28 16:24 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> Looking at the specific cpu counters we get this:
> 
> Base:
>      Total time: 0.179 [sec]
> 
>  Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
> 
>        1571.304618 task-clock                #    5.213 CPUs utilized            ( +-  0.45% )
>             14,423 context-switches          #    0.009 M/sec                    ( +-  4.28% )
>              2,710 cpu-migrations            #    0.002 M/sec                    ( +-  2.83% )

Hm, for these second round of measurements were you using 'perf stat 
-a -C ...'?

The most accurate method of measurement for such single-threaded 
workloads is something like:

	taskset 0x1 perf stat -a -C 1 --repeat 20 ...

this will bind your workload to CPU#0, and will do PMU measurements 
only there - without mixing in other CPUs or workloads.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-28 16:24               ` Ingo Molnar
@ 2013-10-28 16:49                 ` David Ahern
  2013-10-28 17:46                 ` Neil Horman
  1 sibling, 0 replies; 47+ messages in thread
From: David Ahern @ 2013-10-28 16:49 UTC (permalink / raw)
  To: Ingo Molnar, Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On 10/28/13 10:24 AM, Ingo Molnar wrote:

> The most accurate method of measurement for such single-threaded
> workloads is something like:
>
> 	taskset 0x1 perf stat -a -C 1 --repeat 20 ...
>
> this will bind your workload to CPU#0, and will do PMU measurements
> only there - without mixing in other CPUs or workloads.

you can drop the -a if you only want a specific CPU (-C arg). And -C in 
perf is cpu number starting with 0, so in your example above -C 1 means 
cpu1, not cpu0.

David

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-28 16:24               ` Ingo Molnar
  2013-10-28 16:49                 ` David Ahern
@ 2013-10-28 17:46                 ` Neil Horman
  2013-10-28 18:29                   ` Neil Horman
  1 sibling, 1 reply; 47+ messages in thread
From: Neil Horman @ 2013-10-28 17:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Mon, Oct 28, 2013 at 05:24:38PM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > Looking at the specific cpu counters we get this:
> > 
> > Base:
> >      Total time: 0.179 [sec]
> > 
> >  Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
> > 
> >        1571.304618 task-clock                #    5.213 CPUs utilized            ( +-  0.45% )
> >             14,423 context-switches          #    0.009 M/sec                    ( +-  4.28% )
> >              2,710 cpu-migrations            #    0.002 M/sec                    ( +-  2.83% )
> 
> Hm, for these second round of measurements were you using 'perf stat 
> -a -C ...'?
> 
> The most accurate method of measurement for such single-threaded 
> workloads is something like:
> 
> 	taskset 0x1 perf stat -a -C 1 --repeat 20 ...
> 
> this will bind your workload to CPU#0, and will do PMU measurements 
> only there - without mixing in other CPUs or workloads.
> 
> Thanks,
> 
> 	Ingo
I wasn't, but I will...
Neil

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-28 16:20               ` Ingo Molnar
@ 2013-10-28 17:49                 ` Neil Horman
  0 siblings, 0 replies; 47+ messages in thread
From: Neil Horman @ 2013-10-28 17:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Mon, Oct 28, 2013 at 05:20:45PM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > Base:
> >        0.093269042 seconds time elapsed                                          ( +-  2.24% )
> > Prefetch (5x64):
> >        0.079440009 seconds time elapsed                                          ( +-  2.29% )
> > Parallel ALU:
> >        0.087666677 seconds time elapsed                                          ( +-  4.01% )
> > Prefetch + Parallel ALU:
> >        0.080758702 seconds time elapsed                                          ( +-  2.34% )
> > 
> > So we can see here that we get about a 1% speedup between the base 
> > and the both (Prefetch + Parallel ALU) case, with prefetch 
> > accounting for most of that speedup.
> 
> Hm, there's still something strange about these results. So the 
> range of the results is 790-930 nsecs. The noise of the measurements 
> is 2%-4%, i.e. 20-40 nsecs.
> 
> The prefetch-only result itself is the fastest of all - 
> statistically equivalent to the prefetch+parallel-ALU result, within 
> the noise range.
> 
> So if prefetch is enabled, turning on parallel-ALU has no measurable 
> effect - which is counter-intuitive. Do you have an 
> theory/explanation for that?
> 
> Thanks,
 I mentioned it farther down, loosely theorizing that running with parallel
alu's in conjunction with a prefetch, puts more pressure on the load/store unit
causing stalls while both alu's wait for the L1 cache to fill.  Not sure if that
makes sense, but I did note that in the both (prefetch+alu case) our data cache
hit rate was somewhat degraded, so I was going to play with the prefetch stride
to see if that fixed the situation.  Regardless I agree, the lack of improvement
in the both case is definately counter-intuitive.

Neil

> 
> 	Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-28 17:46                 ` Neil Horman
@ 2013-10-28 18:29                   ` Neil Horman
  2013-10-29  8:25                     ` Ingo Molnar
  0 siblings, 1 reply; 47+ messages in thread
From: Neil Horman @ 2013-10-28 18:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Mon, Oct 28, 2013 at 01:46:30PM -0400, Neil Horman wrote:
> On Mon, Oct 28, 2013 at 05:24:38PM +0100, Ingo Molnar wrote:
> > 
> > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > 
> > > Looking at the specific cpu counters we get this:
> > > 
> > > Base:
> > >      Total time: 0.179 [sec]
> > > 
> > >  Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
> > > 
> > >        1571.304618 task-clock                #    5.213 CPUs utilized            ( +-  0.45% )
> > >             14,423 context-switches          #    0.009 M/sec                    ( +-  4.28% )
> > >              2,710 cpu-migrations            #    0.002 M/sec                    ( +-  2.83% )
> > 
> > Hm, for these second round of measurements were you using 'perf stat 
> > -a -C ...'?
> > 
> > The most accurate method of measurement for such single-threaded 
> > workloads is something like:
> > 
> > 	taskset 0x1 perf stat -a -C 1 --repeat 20 ...
> > 
> > this will bind your workload to CPU#0, and will do PMU measurements 
> > only there - without mixing in other CPUs or workloads.
> > 
> > Thanks,
> > 
> > 	Ingo
> I wasn't, but I will...
> Neil
> 
> > --

Heres my data for running the same test with taskset restricting execution to
only cpu0.  I'm not quite sure whats going on here, but doing so resulted in a
10x slowdown of the runtime of each iteration which I can't explain.  As before
however, both the parallel alu run and the prefetch run resulted in speedups,
but the two together were not in any way addative.  I'm going to keep playing
with the prefetch stride, unless you have an alternate theory.

Regards
Neil


Base:
     Total time: 1.013 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1140.286043 task-clock                #    1.001 CPUs utilized            ( +-  0.65% ) [100.00%]
            48,779 context-switches          #    0.043 M/sec                    ( +- 10.08% ) [100.00%]
                 0 cpu-migrations            #    0.000 K/sec                   [100.00%]
            75,398 page-faults               #    0.066 M/sec                    ( +-  0.05% )
     2,950,225,491 cycles                    #    2.587 GHz                      ( +-  0.65% ) [16.63%]
       263,349,439 stalled-cycles-frontend   #    8.93% frontend cycles idle     ( +-  1.87% ) [16.70%]
     1,615,723,017 stalled-cycles-backend    #   54.77% backend  cycles idle     ( +-  0.64% ) [16.76%]
     2,168,440,946 instructions              #    0.74  insns per cycle        
                                             #    0.75  stalled cycles per insn  ( +-  0.52% ) [16.76%]
       406,885,149 branches                  #  356.827 M/sec                    ( +-  0.61% ) [16.74%]
        10,099,789 branch-misses             #    2.48% of all branches          ( +-  0.73% ) [16.73%]
     1,138,829,982 L1-dcache-loads           #  998.723 M/sec                    ( +-  0.57% ) [16.71%]
        21,341,094 L1-dcache-load-misses     #    1.87% of all L1-dcache hits    ( +-  1.22% ) [16.69%]
        38,453,870 LLC-loads                 #   33.723 M/sec                    ( +-  1.46% ) [16.67%]
         9,587,987 LLC-load-misses           #   24.93% of all LL-cache hits     ( +-  0.48% ) [16.66%]
       566,241,820 L1-icache-loads           #  496.579 M/sec                    ( +-  0.70% ) [16.65%]
         9,061,979 L1-icache-load-misses     #    1.60% of all L1-icache hits    ( +-  3.39% ) [16.65%]
     1,130,620,555 dTLB-loads                #  991.524 M/sec                    ( +-  0.64% ) [16.64%]
           423,302 dTLB-load-misses          #    0.04% of all dTLB cache hits   ( +-  4.89% ) [16.63%]
       563,371,089 iTLB-loads                #  494.061 M/sec                    ( +-  0.62% ) [16.62%]
           215,406 iTLB-load-misses          #    0.04% of all iTLB cache hits   ( +-  6.97% ) [16.60%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [16.59%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [16.58%]

       1.139598762 seconds time elapsed                                          ( +-  0.65% )

Prefetch:
     Total time: 0.981 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1128.603117 task-clock                #    1.001 CPUs utilized            ( +-  0.66% ) [100.00%]
            45,992 context-switches          #    0.041 M/sec                    ( +-  9.47% ) [100.00%]
                 0 cpu-migrations            #    0.000 K/sec                   [100.00%]
            75,428 page-faults               #    0.067 M/sec                    ( +-  0.06% )
     2,920,666,228 cycles                    #    2.588 GHz                      ( +-  0.66% ) [16.59%]
       255,998,006 stalled-cycles-frontend   #    8.77% frontend cycles idle     ( +-  1.78% ) [16.67%]
     1,601,090,475 stalled-cycles-backend    #   54.82% backend  cycles idle     ( +-  0.69% ) [16.75%]
     2,164,301,312 instructions              #    0.74  insns per cycle        
                                             #    0.74  stalled cycles per insn  ( +-  0.59% ) [16.78%]
       404,920,928 branches                  #  358.781 M/sec                    ( +-  0.54% ) [16.77%]
        10,025,146 branch-misses             #    2.48% of all branches          ( +-  0.66% ) [16.75%]
     1,133,764,674 L1-dcache-loads           # 1004.573 M/sec                    ( +-  0.47% ) [16.74%]
        21,251,432 L1-dcache-load-misses     #    1.87% of all L1-dcache hits    ( +-  1.01% ) [16.72%]
        38,006,432 LLC-loads                 #   33.676 M/sec                    ( +-  1.56% ) [16.70%]
         9,625,034 LLC-load-misses           #   25.32% of all LL-cache hits     ( +-  0.40% ) [16.68%]
       565,712,289 L1-icache-loads           #  501.250 M/sec                    ( +-  0.57% ) [16.66%]
         8,726,826 L1-icache-load-misses     #    1.54% of all L1-icache hits    ( +-  3.40% ) [16.64%]
     1,130,140,463 dTLB-loads                # 1001.362 M/sec                    ( +-  0.53% ) [16.63%]
           419,645 dTLB-load-misses          #    0.04% of all dTLB cache hits   ( +-  4.44% ) [16.62%]
       560,199,307 iTLB-loads                #  496.365 M/sec                    ( +-  0.51% ) [16.61%]
           213,413 iTLB-load-misses          #    0.04% of all iTLB cache hits   ( +-  6.65% ) [16.59%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [16.56%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [16.54%]

       1.127934534 seconds time elapsed                                          ( +-  0.66% )


Parallel ALU:
     Total time: 0.986 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1131.914738 task-clock                #    1.001 CPUs utilized            ( +-  0.49% ) [100.00%]
            40,807 context-switches          #    0.036 M/sec                    ( +- 10.72% ) [100.00%]
                 0 cpu-migrations            #    0.000 K/sec                    ( +-100.00% ) [100.00%]
            75,329 page-faults               #    0.067 M/sec                    ( +-  0.04% )
     2,929,149,996 cycles                    #    2.588 GHz                      ( +-  0.49% ) [16.58%]
       250,428,558 stalled-cycles-frontend   #    8.55% frontend cycles idle     ( +-  1.75% ) [16.66%]
     1,621,074,968 stalled-cycles-backend    #   55.34% backend  cycles idle     ( +-  0.46% ) [16.73%]
     2,147,405,781 instructions              #    0.73  insns per cycle        
                                             #    0.75  stalled cycles per insn  ( +-  0.56% ) [16.77%]
       401,196,771 branches                  #  354.441 M/sec                    ( +-  0.58% ) [16.76%]
         9,941,701 branch-misses             #    2.48% of all branches          ( +-  0.67% ) [16.74%]
     1,126,651,774 L1-dcache-loads           #  995.350 M/sec                    ( +-  0.50% ) [16.73%]
        21,075,294 L1-dcache-load-misses     #    1.87% of all L1-dcache hits    ( +-  0.96% ) [16.72%]
        37,885,850 LLC-loads                 #   33.471 M/sec                    ( +-  1.10% ) [16.71%]
         9,729,116 LLC-load-misses           #   25.68% of all LL-cache hits     ( +-  0.62% ) [16.69%]
       562,058,495 L1-icache-loads           #  496.556 M/sec                    ( +-  0.54% ) [16.67%]
         8,617,450 L1-icache-load-misses     #    1.53% of all L1-icache hits    ( +-  3.06% ) [16.65%]
     1,121,765,737 dTLB-loads                #  991.034 M/sec                    ( +-  0.57% ) [16.63%]
           388,875 dTLB-load-misses          #    0.03% of all dTLB cache hits   ( +-  4.27% ) [16.62%]
       556,029,393 iTLB-loads                #  491.229 M/sec                    ( +-  0.64% ) [16.61%]
           189,181 iTLB-load-misses          #    0.03% of all iTLB cache hits   ( +-  6.98% ) [16.60%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [16.58%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [16.56%]

       1.131247174 seconds time elapsed                                          ( +-  0.49% )


Both:
     Total time: 0.993 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1130.912197 task-clock                #    1.001 CPUs utilized            ( +-  0.60% ) [100.00%]
            45,859 context-switches          #    0.041 M/sec                    ( +-  9.00% ) [100.00%]
                 0 cpu-migrations            #    0.000 K/sec                   [100.00%]
            75,398 page-faults               #    0.067 M/sec                    ( +-  0.07% )
     2,926,527,048 cycles                    #    2.588 GHz                      ( +-  0.60% ) [16.60%]
       255,482,254 stalled-cycles-frontend   #    8.73% frontend cycles idle     ( +-  1.62% ) [16.67%]
     1,608,247,364 stalled-cycles-backend    #   54.95% backend  cycles idle     ( +-  0.73% ) [16.74%]
     2,162,135,903 instructions              #    0.74  insns per cycle        
                                             #    0.74  stalled cycles per insn  ( +-  0.46% ) [16.77%]
       403,436,790 branches                  #  356.736 M/sec                    ( +-  0.44% ) [16.76%]
        10,062,572 branch-misses             #    2.49% of all branches          ( +-  0.85% ) [16.75%]
     1,133,889,264 L1-dcache-loads           # 1002.632 M/sec                    ( +-  0.56% ) [16.74%]
        21,460,116 L1-dcache-load-misses     #    1.89% of all L1-dcache hits    ( +-  1.31% ) [16.73%]
        38,070,119 LLC-loads                 #   33.663 M/sec                    ( +-  1.63% ) [16.72%]
         9,593,162 LLC-load-misses           #   25.20% of all LL-cache hits     ( +-  0.42% ) [16.71%]
       562,867,188 L1-icache-loads           #  497.711 M/sec                    ( +-  0.59% ) [16.68%]
         8,472,343 L1-icache-load-misses     #    1.51% of all L1-icache hits    ( +-  3.02% ) [16.64%]
     1,126,997,403 dTLB-loads                #  996.538 M/sec                    ( +-  0.53% ) [16.61%]
           414,900 dTLB-load-misses          #    0.04% of all dTLB cache hits   ( +-  4.12% ) [16.60%]
       561,156,032 iTLB-loads                #  496.198 M/sec                    ( +-  0.56% ) [16.59%]
           212,482 iTLB-load-misses          #    0.04% of all iTLB cache hits   ( +-  6.10% ) [16.58%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [16.57%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [16.56%]

       1.130242195 seconds time elapsed                                          ( +-  0.60% )

> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-28 18:29                   ` Neil Horman
@ 2013-10-29  8:25                     ` Ingo Molnar
  2013-10-29 11:20                       ` Neil Horman
  0 siblings, 1 reply; 47+ messages in thread
From: Ingo Molnar @ 2013-10-29  8:25 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> Heres my data for running the same test with taskset restricting 
> execution to only cpu0.  I'm not quite sure whats going on here, 
> but doing so resulted in a 10x slowdown of the runtime of each 
> iteration which I can't explain.  As before however, both the 
> parallel alu run and the prefetch run resulted in speedups, but 
> the two together were not in any way addative.  I'm going to keep 
> playing with the prefetch stride, unless you have an alternate 
> theory.

Could you please cite the exact command-line you used for running 
the test?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29  8:25                     ` Ingo Molnar
@ 2013-10-29 11:20                       ` Neil Horman
  2013-10-29 11:30                         ` Ingo Molnar
  0 siblings, 1 reply; 47+ messages in thread
From: Neil Horman @ 2013-10-29 11:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Tue, Oct 29, 2013 at 09:25:42AM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > Heres my data for running the same test with taskset restricting 
> > execution to only cpu0.  I'm not quite sure whats going on here, 
> > but doing so resulted in a 10x slowdown of the runtime of each 
> > iteration which I can't explain.  As before however, both the 
> > parallel alu run and the prefetch run resulted in speedups, but 
> > the two together were not in any way addative.  I'm going to keep 
> > playing with the prefetch stride, unless you have an alternate 
> > theory.
> 
> Could you please cite the exact command-line you used for running 
> the test?
> 
> Thanks,
> 
> 	Ingo
> 

Sure it was this:
for i in `seq 0 1 3`
do
echo $i > /sys/module/csum_test/parameters/module_test_mode
taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
done >> counters.txt 2>&1

where test.sh is:
#!/bin/sh
echo 1 > /sys/module/csum_test/parameters/test_fire


As before, module_test_mode selects a case in a switch statement I added in
do_csum to test one of the 4 csum variants we've been discusing (base, prefetch,
parallel ALU or both), and test_fire is a callback trigger I use in the test
module to run 100000 iterations of a checksum operation.  As you requested, I
ran the above on cpu 0 (-C 0 on perf and -c 0 on taskset), and I removed all irq
affinity to cpu 0.

Regards
Neil

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 11:20                       ` Neil Horman
@ 2013-10-29 11:30                         ` Ingo Molnar
  2013-10-29 11:49                           ` Neil Horman
  0 siblings, 1 reply; 47+ messages in thread
From: Ingo Molnar @ 2013-10-29 11:30 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> Sure it was this:
> for i in `seq 0 1 3`
> do
> echo $i > /sys/module/csum_test/parameters/module_test_mode
> taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
> done >> counters.txt 2>&1
> 
> where test.sh is:
> #!/bin/sh
> echo 1 > /sys/module/csum_test/parameters/test_fire

What does '-- /root/test.sh' do?

Unless I'm missing something, the line above will run:

  perf bench sched messaging -- /root/test.sh

which should be equivalent to:

  perf bench sched messaging

i.e. /root/test.sh won't be run.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 11:30                         ` Ingo Molnar
@ 2013-10-29 11:49                           ` Neil Horman
  2013-10-29 12:52                             ` Ingo Molnar
  0 siblings, 1 reply; 47+ messages in thread
From: Neil Horman @ 2013-10-29 11:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > Sure it was this:
> > for i in `seq 0 1 3`
> > do
> > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
> > done >> counters.txt 2>&1
> > 
> > where test.sh is:
> > #!/bin/sh
> > echo 1 > /sys/module/csum_test/parameters/test_fire
> 
> What does '-- /root/test.sh' do?
> 
> Unless I'm missing something, the line above will run:
> 
>   perf bench sched messaging -- /root/test.sh
> 
> which should be equivalent to:
> 
>   perf bench sched messaging
> 
> i.e. /root/test.sh won't be run.
> 
According to the perf man page, I'm supposed to be able to use -- to separate
perf command line parameters from the command I want to run.  And it definately
executed test.sh, I added an echo to stdout in there as a test run and observed
them get captured in counters.txt

Neil

> Thanks,
> 
> 	Ingo
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 11:49                           ` Neil Horman
@ 2013-10-29 12:52                             ` Ingo Molnar
  2013-10-29 13:07                               ` Neil Horman
  2013-10-29 14:12                               ` David Ahern
  0 siblings, 2 replies; 47+ messages in thread
From: Ingo Molnar @ 2013-10-29 12:52 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote:
> > 
> > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > 
> > > Sure it was this:
> > > for i in `seq 0 1 3`
> > > do
> > > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > > taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
> > > done >> counters.txt 2>&1
> > > 
> > > where test.sh is:
> > > #!/bin/sh
> > > echo 1 > /sys/module/csum_test/parameters/test_fire
> > 
> > What does '-- /root/test.sh' do?
> > 
> > Unless I'm missing something, the line above will run:
> > 
> >   perf bench sched messaging -- /root/test.sh
> > 
> > which should be equivalent to:
> > 
> >   perf bench sched messaging
> > 
> > i.e. /root/test.sh won't be run.
> 
> According to the perf man page, I'm supposed to be able to use -- 
> to separate perf command line parameters from the command I want 
> to run.  And it definately executed test.sh, I added an echo to 
> stdout in there as a test run and observed them get captured in 
> counters.txt

Well, '--' can be used to delineate the command portion for cases 
where it's ambiguous.

Here's it's unambiguous though. This:

  perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh

stops parsing a valid option after the -ddd option, so in theory it 
should execute 'perf bench sched messaging -- /root/test.sh' where 
'-- /root/test.sh' is simply a parameter to 'perf bench' and is thus 
ignored.

The message output you provided seems to suggest that to be the 
case:

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

See how the command executed by perf stat was 'perf bench ...'.

Did you want to run:

  perf stat --repeat 20 -C 0 -ddd /root/test.sh

?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 12:52                             ` Ingo Molnar
@ 2013-10-29 13:07                               ` Neil Horman
  2013-10-29 13:11                                 ` Ingo Molnar
  2013-10-29 14:12                               ` David Ahern
  1 sibling, 1 reply; 47+ messages in thread
From: Neil Horman @ 2013-10-29 13:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Tue, Oct 29, 2013 at 01:52:33PM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote:
> > > 
> > > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > > 
> > > > Sure it was this:
> > > > for i in `seq 0 1 3`
> > > > do
> > > > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > > > taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
> > > > done >> counters.txt 2>&1
> > > > 
> > > > where test.sh is:
> > > > #!/bin/sh
> > > > echo 1 > /sys/module/csum_test/parameters/test_fire
> > > 
> > > What does '-- /root/test.sh' do?
> > > 
> > > Unless I'm missing something, the line above will run:
> > > 
> > >   perf bench sched messaging -- /root/test.sh
> > > 
> > > which should be equivalent to:
> > > 
> > >   perf bench sched messaging
> > > 
> > > i.e. /root/test.sh won't be run.
> > 
> > According to the perf man page, I'm supposed to be able to use -- 
> > to separate perf command line parameters from the command I want 
> > to run.  And it definately executed test.sh, I added an echo to 
> > stdout in there as a test run and observed them get captured in 
> > counters.txt
> 
> Well, '--' can be used to delineate the command portion for cases 
> where it's ambiguous.
> 
> Here's it's unambiguous though. This:
> 
>   perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
> 
> stops parsing a valid option after the -ddd option, so in theory it 
> should execute 'perf bench sched messaging -- /root/test.sh' where 
> '-- /root/test.sh' is simply a parameter to 'perf bench' and is thus 
> ignored.
> 
> The message output you provided seems to suggest that to be the 
> case:
> 
>  Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
> 
> See how the command executed by perf stat was 'perf bench ...'.
> 
> Did you want to run:
> 
>   perf stat --repeat 20 -C 0 -ddd /root/test.sh
> 
I'm sure it worked properly on my system here, I specificially checked it, but
I'll gladly run it again.  You have to give me an hour as I have a meeting to
run to, but I'll have results shortly.
Neil

> ?
> 
> Thanks,
> 
> 	Ingo
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 13:07                               ` Neil Horman
@ 2013-10-29 13:11                                 ` Ingo Molnar
  2013-10-29 13:20                                   ` Neil Horman
  2013-10-29 14:17                                   ` Neil Horman
  0 siblings, 2 replies; 47+ messages in thread
From: Ingo Molnar @ 2013-10-29 13:11 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> I'm sure it worked properly on my system here, I specificially 
> checked it, but I'll gladly run it again.  You have to give me an 
> hour as I have a meeting to run to, but I'll have results shortly.

So what I tried to react to was this observation of yours:

> > > Heres my data for running the same test with taskset 
> > > restricting execution to only cpu0.  I'm not quite sure whats 
> > > going on here, but doing so resulted in a 10x slowdown of the 
> > > runtime of each iteration which I can't explain. [...]

A 10x slowdown would be consistent with not running your testcase 
but 'perf bench sched messaging' by accident, or so.

But I was really just guessing wildly here.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 13:11                                 ` Ingo Molnar
@ 2013-10-29 13:20                                   ` Neil Horman
  2013-10-29 14:17                                   ` Neil Horman
  1 sibling, 0 replies; 47+ messages in thread
From: Neil Horman @ 2013-10-29 13:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Tue, Oct 29, 2013 at 02:11:49PM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > I'm sure it worked properly on my system here, I specificially 
> > checked it, but I'll gladly run it again.  You have to give me an 
> > hour as I have a meeting to run to, but I'll have results shortly.
> 
> So what I tried to react to was this observation of yours:
> 
> > > > Heres my data for running the same test with taskset 
> > > > restricting execution to only cpu0.  I'm not quite sure whats 
> > > > going on here, but doing so resulted in a 10x slowdown of the 
> > > > runtime of each iteration which I can't explain. [...]
> 
> A 10x slowdown would be consistent with not running your testcase 
> but 'perf bench sched messaging' by accident, or so.
> 
> But I was really just guessing wildly here.
> 
> Thanks,
> 
> 	Ingo
> 
Ok, well, I'll run it again in just a bit here.
Neil

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 12:52                             ` Ingo Molnar
  2013-10-29 13:07                               ` Neil Horman
@ 2013-10-29 14:12                               ` David Ahern
  1 sibling, 0 replies; 47+ messages in thread
From: David Ahern @ 2013-10-29 14:12 UTC (permalink / raw)
  To: Ingo Molnar, Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On 10/29/13 6:52 AM, Ingo Molnar wrote:
>> According to the perf man page, I'm supposed to be able to use --
>> to separate perf command line parameters from the command I want
>> to run.  And it definately executed test.sh, I added an echo to
>> stdout in there as a test run and observed them get captured in
>> counters.txt
>
> Well, '--' can be used to delineate the command portion for cases
> where it's ambiguous.
>
> Here's it's unambiguous though. This:
>
>    perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
>
> stops parsing a valid option after the -ddd option, so in theory it
> should execute 'perf bench sched messaging -- /root/test.sh' where
> '-- /root/test.sh' is simply a parameter to 'perf bench' and is thus
> ignored.

Normally with perf commands a workload can be specified to state how 
long to collect perf data. That is not the case for perf-bench.

David

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 13:11                                 ` Ingo Molnar
  2013-10-29 13:20                                   ` Neil Horman
@ 2013-10-29 14:17                                   ` Neil Horman
  2013-10-29 14:27                                     ` Ingo Molnar
  1 sibling, 1 reply; 47+ messages in thread
From: Neil Horman @ 2013-10-29 14:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Tue, Oct 29, 2013 at 02:11:49PM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > I'm sure it worked properly on my system here, I specificially 
> > checked it, but I'll gladly run it again.  You have to give me an 
> > hour as I have a meeting to run to, but I'll have results shortly.
> 
> So what I tried to react to was this observation of yours:
> 
> > > > Heres my data for running the same test with taskset 
> > > > restricting execution to only cpu0.  I'm not quite sure whats 
> > > > going on here, but doing so resulted in a 10x slowdown of the 
> > > > runtime of each iteration which I can't explain. [...]
> 
> A 10x slowdown would be consistent with not running your testcase 
> but 'perf bench sched messaging' by accident, or so.
> 
> But I was really just guessing wildly here.
> 
> Thanks,
> 
> 	Ingo
> 


So, I apologize, you were right.  I was running the test.sh script but perf was
measuring itself.  Using this command line:

for i in `seq 0 1 3`
do
echo $i > /sys/modules/csum_test/parameters/module_test_mode; taskset -c 0 perf stat --repeat -C 0 -ddd /root/test.sh
done >> counters.txt 2>&1

with test.sh unchanged I get these results:


Base:
 Performance counter stats for '/root/test.sh' (20 runs):

         56.069737 task-clock                #    1.005 CPUs utilized            ( +-  0.13% ) [100.00%]
                 5 context-switches          #    0.091 K/sec                    ( +-  5.11% ) [100.00%]
                 0 cpu-migrations            #    0.000 K/sec                   [100.00%]
               366 page-faults               #    0.007 M/sec                    ( +-  0.08% )
       144,264,737 cycles                    #    2.573 GHz                      ( +-  0.23% ) [17.49%]
         9,239,760 stalled-cycles-frontend   #    6.40% frontend cycles idle     ( +-  3.77% ) [19.19%]
       110,635,829 stalled-cycles-backend    #   76.69% backend  cycles idle     ( +-  0.14% ) [19.68%]
        54,291,496 instructions              #    0.38  insns per cycle        
                                             #    2.04  stalled cycles per insn  ( +-  0.14% ) [18.30%]
         5,844,933 branches                  #  104.244 M/sec                    ( +-  2.81% ) [16.58%]
           301,523 branch-misses             #    5.16% of all branches          ( +-  0.12% ) [16.09%]
        23,645,797 L1-dcache-loads           #  421.721 M/sec                    ( +-  0.05% ) [16.06%]
           494,467 L1-dcache-load-misses     #    2.09% of all L1-dcache hits    ( +-  0.06% ) [16.06%]
         2,907,250 LLC-loads                 #   51.851 M/sec                    ( +-  0.08% ) [16.06%]
           486,329 LLC-load-misses           #   16.73% of all LL-cache hits     ( +-  0.11% ) [16.06%]
        11,113,848 L1-icache-loads           #  198.215 M/sec                    ( +-  0.07% ) [16.06%]
             5,378 L1-icache-load-misses     #    0.05% of all L1-icache hits    ( +-  1.34% ) [16.06%]
        23,742,876 dTLB-loads                #  423.453 M/sec                    ( +-  0.06% ) [16.06%]
                 0 dTLB-load-misses          #    0.00% of all dTLB cache hits  [16.06%]
        11,108,538 iTLB-loads                #  198.120 M/sec                    ( +-  0.06% ) [16.06%]
                 0 iTLB-load-misses          #    0.00% of all iTLB cache hits  [16.07%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [16.07%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [16.07%]

       0.055817066 seconds time elapsed                                          ( +-  0.10% )

Prefetch(5*64):
 Performance counter stats for '/root/test.sh' (20 runs):

         47.423853 task-clock                #    1.005 CPUs utilized            ( +-  0.62% ) [100.00%]
                 6 context-switches          #    0.116 K/sec                    ( +-  4.27% ) [100.00%]
                 0 cpu-migrations            #    0.000 K/sec                   [100.00%]
               368 page-faults               #    0.008 M/sec                    ( +-  0.07% )
       120,423,860 cycles                    #    2.539 GHz                      ( +-  0.85% ) [14.23%]
         8,555,632 stalled-cycles-frontend   #    7.10% frontend cycles idle     ( +-  0.56% ) [16.23%]
        87,438,794 stalled-cycles-backend    #   72.61% backend  cycles idle     ( +-  1.13% ) [18.33%]
        55,039,308 instructions              #    0.46  insns per cycle        
                                             #    1.59  stalled cycles per insn  ( +-  0.05% ) [18.98%]
         5,619,298 branches                  #  118.491 M/sec                    ( +-  2.32% ) [18.98%]
           303,686 branch-misses             #    5.40% of all branches          ( +-  0.08% ) [18.98%]
        26,577,868 L1-dcache-loads           #  560.432 M/sec                    ( +-  0.05% ) [18.98%]
         1,323,630 L1-dcache-load-misses     #    4.98% of all L1-dcache hits    ( +-  0.14% ) [18.98%]
         3,426,016 LLC-loads                 #   72.242 M/sec                    ( +-  0.05% ) [18.98%]
         1,304,201 LLC-load-misses           #   38.07% of all LL-cache hits     ( +-  0.13% ) [18.98%]
        13,190,316 L1-icache-loads           #  278.137 M/sec                    ( +-  0.21% ) [18.98%]
            33,881 L1-icache-load-misses     #    0.26% of all L1-icache hits    ( +-  4.63% ) [17.93%]
        25,366,685 dTLB-loads                #  534.893 M/sec                    ( +-  0.24% ) [15.93%]
               734 dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  8.40% ) [13.94%]
        13,314,660 iTLB-loads                #  280.759 M/sec                    ( +-  0.05% ) [12.97%]
                 0 iTLB-load-misses          #    0.00% of all iTLB cache hits  [12.98%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [12.98%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [12.87%]

       0.047194407 seconds time elapsed                                          ( +-  0.62% )

Parallel ALU:
 Performance counter stats for '/root/test.sh' (20 runs):

         57.395070 task-clock                #    1.004 CPUs utilized            ( +-  1.71% ) [100.00%]
                 5 context-switches          #    0.092 K/sec                    ( +-  3.90% ) [100.00%]
                 0 cpu-migrations            #    0.000 K/sec                   [100.00%]
               367 page-faults               #    0.006 M/sec                    ( +-  0.10% )
       143,232,396 cycles                    #    2.496 GHz                      ( +-  1.68% ) [16.73%]
         7,299,843 stalled-cycles-frontend   #    5.10% frontend cycles idle     ( +-  2.69% ) [18.47%]
       109,485,845 stalled-cycles-backend    #   76.44% backend  cycles idle     ( +-  2.01% ) [19.99%]
        56,867,669 instructions              #    0.40  insns per cycle        
                                             #    1.93  stalled cycles per insn  ( +-  0.22% ) [19.49%]
         6,646,323 branches                  #  115.800 M/sec                    ( +-  2.15% ) [17.75%]
           304,671 branch-misses             #    4.58% of all branches          ( +-  0.37% ) [16.23%]
        23,612,428 L1-dcache-loads           #  411.402 M/sec                    ( +-  0.05% ) [15.95%]
           518,988 L1-dcache-load-misses     #    2.20% of all L1-dcache hits    ( +-  0.11% ) [15.95%]
         2,934,119 LLC-loads                 #   51.121 M/sec                    ( +-  0.06% ) [15.95%]
           509,027 LLC-load-misses           #   17.35% of all LL-cache hits     ( +-  0.15% ) [15.95%]
        11,103,819 L1-icache-loads           #  193.463 M/sec                    ( +-  0.08% ) [15.95%]
             5,381 L1-icache-load-misses     #    0.05% of all L1-icache hits    ( +-  2.45% ) [15.95%]
        23,727,164 dTLB-loads                #  413.401 M/sec                    ( +-  0.06% ) [15.95%]
                 0 dTLB-load-misses          #    0.00% of all dTLB cache hits  [15.95%]
        11,104,205 iTLB-loads                #  193.470 M/sec                    ( +-  0.06% ) [15.95%]
                 0 iTLB-load-misses          #    0.00% of all iTLB cache hits  [15.95%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [15.95%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [15.96%]

       0.057151644 seconds time elapsed                                          ( +-  1.69% )

Both:
 Performance counter stats for '/root/test.sh' (20 runs):

         48.377833 task-clock                #    1.005 CPUs utilized            ( +-  0.67% ) [100.00%]
                 5 context-switches          #    0.113 K/sec                    ( +-  3.88% ) [100.00%]
                 0 cpu-migrations            #    0.001 K/sec                    ( +-100.00% ) [100.00%]
               367 page-faults               #    0.008 M/sec                    ( +-  0.08% )
       122,529,490 cycles                    #    2.533 GHz                      ( +-  1.05% ) [14.24%]
         8,796,729 stalled-cycles-frontend   #    7.18% frontend cycles idle     ( +-  0.56% ) [16.20%]
        88,936,550 stalled-cycles-backend    #   72.58% backend  cycles idle     ( +-  1.48% ) [18.16%]
        58,405,660 instructions              #    0.48  insns per cycle        
                                             #    1.52  stalled cycles per insn  ( +-  0.07% ) [18.61%]
         5,742,738 branches                  #  118.706 M/sec                    ( +-  1.54% ) [18.61%]
           303,555 branch-misses             #    5.29% of all branches          ( +-  0.09% ) [18.61%]
        26,321,789 L1-dcache-loads           #  544.088 M/sec                    ( +-  0.07% ) [18.61%]
         1,236,101 L1-dcache-load-misses     #    4.70% of all L1-dcache hits    ( +-  0.08% ) [18.61%]
         3,409,768 LLC-loads                 #   70.482 M/sec                    ( +-  0.05% ) [18.61%]
         1,212,511 LLC-load-misses           #   35.56% of all LL-cache hits     ( +-  0.08% ) [18.61%]
        10,579,372 L1-icache-loads           #  218.682 M/sec                    ( +-  0.05% ) [18.61%]
            19,426 L1-icache-load-misses     #    0.18% of all L1-icache hits    ( +- 14.70% ) [18.61%]
        25,329,963 dTLB-loads                #  523.586 M/sec                    ( +-  0.27% ) [17.29%]
               802 dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  5.43% ) [15.33%]
        10,635,524 iTLB-loads                #  219.843 M/sec                    ( +-  0.09% ) [13.38%]
                 0 iTLB-load-misses          #    0.00% of all iTLB cache hits  [12.72%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [12.72%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [12.72%]

       0.048140073 seconds time elapsed                                          ( +-  0.67% )


Which overall looks alot more like I expect, save for the parallel ALU cases.
It seems here that the parallel ALU changes actually hurt performance, which
really seems counter-intuitive.  I don't yet have any explination for that.  I
do note that we seem to have more stalls in the both case so perhaps the
parallel chains call for a more agressive prefetch.  Do you have any thoughts?

Regards
Neil

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 14:17                                   ` Neil Horman
@ 2013-10-29 14:27                                     ` Ingo Molnar
  2013-10-29 20:26                                       ` Neil Horman
  0 siblings, 1 reply; 47+ messages in thread
From: Ingo Molnar @ 2013-10-29 14:27 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> So, I apologize, you were right.  I was running the test.sh script 
> but perf was measuring itself. [...]

Ok, cool - one mystery less!

> Which overall looks alot more like I expect, save for the parallel 
> ALU cases. It seems here that the parallel ALU changes actually 
> hurt performance, which really seems counter-intuitive.  I don't 
> yet have any explination for that.  I do note that we seem to have 
> more stalls in the both case so perhaps the parallel chains call 
> for a more agressive prefetch.  Do you have any thoughts?

Note that with -ddd you 'overload' the PMU with more counters than 
can be run at once, which introduces extra noise. Since you are 
running the tests for 0.150 secs or so, the results are not very 
representative:

               734 dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  8.40% ) [13.94%]
        13,314,660 iTLB-loads                #  280.759 M/sec                    ( +-  0.05% ) [12.97%]

with such low runtimes those results are very hard to trust.

So -ddd is typically used to pick up the most interesting PMU events 
you want to see measured, and then use them like this:

   -e dTLB-load-misses -e iTLB-loads

etc. For such short runtimes make sure the last column displays 
close to 100%, so that the PMU results become trustable.

A nehalem+ PMU will allow 2-4 events to be measured in parallel, 
plus generics like 'cycles', 'instructions' can be added 'for free' 
because they get counted in a separate (fixed purpose) PMU register.

The last colum tells you what percentage of the runtime that 
particular event was actually active. 100% (or empty last column) 
means it was active all the time.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 14:27                                     ` Ingo Molnar
@ 2013-10-29 20:26                                       ` Neil Horman
  2013-10-31 10:22                                         ` Ingo Molnar
  0 siblings, 1 reply; 47+ messages in thread
From: Neil Horman @ 2013-10-29 20:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Tue, Oct 29, 2013 at 03:27:16PM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > So, I apologize, you were right.  I was running the test.sh script 
> > but perf was measuring itself. [...]
> 
> Ok, cool - one mystery less!
> 
> > Which overall looks alot more like I expect, save for the parallel 
> > ALU cases. It seems here that the parallel ALU changes actually 
> > hurt performance, which really seems counter-intuitive.  I don't 
> > yet have any explination for that.  I do note that we seem to have 
> > more stalls in the both case so perhaps the parallel chains call 
> > for a more agressive prefetch.  Do you have any thoughts?
> 
> Note that with -ddd you 'overload' the PMU with more counters than 
> can be run at once, which introduces extra noise. Since you are 
> running the tests for 0.150 secs or so, the results are not very 
> representative:
> 
>                734 dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  8.40% ) [13.94%]
>         13,314,660 iTLB-loads                #  280.759 M/sec                    ( +-  0.05% ) [12.97%]
> 
> with such low runtimes those results are very hard to trust.
> 
> So -ddd is typically used to pick up the most interesting PMU events 
> you want to see measured, and then use them like this:
> 
>    -e dTLB-load-misses -e iTLB-loads
> 
> etc. For such short runtimes make sure the last column displays 
> close to 100%, so that the PMU results become trustable.
> 
> A nehalem+ PMU will allow 2-4 events to be measured in parallel, 
> plus generics like 'cycles', 'instructions' can be added 'for free' 
> because they get counted in a separate (fixed purpose) PMU register.
> 
> The last colum tells you what percentage of the runtime that 
> particular event was actually active. 100% (or empty last column) 
> means it was active all the time.
> 
> Thanks,
> 
> 	Ingo
> 

Hmm, 

I ran this test:

for i in `seq 0 1 3`
do
echo $i > /sys/module/csum_test/parameters/module_test_mode
taskset -c 0 perf stat --repeat 20 -C 0 -e L1-dcache-load-misses -e L1-dcache-prefetches -e cycles -e instructions -ddd ./test.sh
done

And I updated the test module to run for a million iterations rather than 100000 to increase the sample size and got this:


Base:
 Performance counter stats for './test.sh' (20 runs):

        47,305,064 L1-dcache-load-misses     #    2.09% of all L1-dcache hits    ( +-  0.04% ) [18.74%]
                 0 L1-dcache-prefetches                                         [18.75%]
    13,906,212,348 cycles                    #    0.000 GHz                      ( +-  0.05% ) [18.76%]
     4,426,395,949 instructions              #    0.32  insns per cycle          ( +-  0.01% ) [18.77%]
     2,261,551,278 L1-dcache-loads                                               ( +-  0.02% ) [18.76%]
        47,287,226 L1-dcache-load-misses     #    2.09% of all L1-dcache hits    ( +-  0.04% ) [18.76%]
       276,842,685 LLC-loads                                                     ( +-  0.01% ) [18.76%]
        46,454,114 LLC-load-misses           #   16.78% of all LL-cache hits     ( +-  0.05% ) [18.76%]
     1,048,894,486 L1-icache-loads                                               ( +-  0.07% ) [18.76%]
           472,205 L1-icache-load-misses     #    0.05% of all L1-icache hits    ( +-  1.19% ) [18.76%]
     2,260,639,613 dTLB-loads                                                    ( +-  0.01% ) [18.75%]
               172 dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +- 35.14% ) [18.74%]
     1,048,732,481 iTLB-loads                                                    ( +-  0.07% ) [18.74%]
                19 iTLB-load-misses          #    0.00% of all iTLB cache hits   ( +- 39.75% ) [18.73%]
                 0 L1-dcache-prefetches                                         [18.73%]
                 0 L1-dcache-prefetch-misses                                    [18.73%]

       5.370546698 seconds time elapsed                                          ( +-  0.05% )


Prefetch:
 Performance counter stats for './test.sh' (20 runs):

       124,885,469 L1-dcache-load-misses     #    4.96% of all L1-dcache hits    ( +-  0.09% ) [18.74%]
                 0 L1-dcache-prefetches                                         [18.75%]
    11,434,328,889 cycles                    #    0.000 GHz                      ( +-  1.11% ) [18.77%]
     4,601,831,553 instructions              #    0.40  insns per cycle          ( +-  0.01% ) [18.77%]
     2,515,483,814 L1-dcache-loads                                               ( +-  0.01% ) [18.77%]
       124,928,127 L1-dcache-load-misses     #    4.97% of all L1-dcache hits    ( +-  0.09% ) [18.76%]
       323,355,145 LLC-loads                                                     ( +-  0.02% ) [18.76%]
       123,008,548 LLC-load-misses           #   38.04% of all LL-cache hits     ( +-  0.10% ) [18.75%]
     1,256,391,060 L1-icache-loads                                               ( +-  0.01% ) [18.75%]
           374,691 L1-icache-load-misses     #    0.03% of all L1-icache hits    ( +-  1.41% ) [18.75%]
     2,514,984,046 dTLB-loads                                                    ( +-  0.01% ) [18.75%]
                67 dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +- 51.81% ) [18.74%]
     1,256,333,548 iTLB-loads                                                    ( +-  0.01% ) [18.74%]
                19 iTLB-load-misses          #    0.00% of all iTLB cache hits   ( +- 39.74% ) [18.74%]
                 0 L1-dcache-prefetches                                         [18.73%]
                 0 L1-dcache-prefetch-misses                                    [18.73%]

       4.496839773 seconds time elapsed                                          ( +-  0.64% )


Parallel ALU:
 Performance counter stats for './test.sh' (20 runs):

        49,489,518 L1-dcache-load-misses     #    2.19% of all L1-dcache hits    ( +-  0.09% ) [18.74%]
                 0 L1-dcache-prefetches                                         [18.76%]
    13,777,501,365 cycles                    #    0.000 GHz                      ( +-  1.73% ) [18.78%]
     4,707,160,703 instructions              #    0.34  insns per cycle          ( +-  0.01% ) [18.78%]
     2,261,693,074 L1-dcache-loads                                               ( +-  0.02% ) [18.78%]
        49,468,878 L1-dcache-load-misses     #    2.19% of all L1-dcache hits    ( +-  0.09% ) [18.77%]
       279,524,254 LLC-loads                                                     ( +-  0.01% ) [18.76%]
        48,491,934 LLC-load-misses           #   17.35% of all LL-cache hits     ( +-  0.12% ) [18.75%]
     1,057,877,680 L1-icache-loads                                               ( +-  0.02% ) [18.74%]
           461,784 L1-icache-load-misses     #    0.04% of all L1-icache hits    ( +-  1.87% ) [18.74%]
     2,260,978,836 dTLB-loads                                                    ( +-  0.02% ) [18.74%]
                27 dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +- 89.96% ) [18.74%]
     1,057,886,632 iTLB-loads                                                    ( +-  0.02% ) [18.74%]
                 4 iTLB-load-misses          #    0.00% of all iTLB cache hits   ( +-100.00% ) [18.74%]
                 0 L1-dcache-prefetches                                         [18.73%]
                 0 L1-dcache-prefetch-misses                                    [18.73%]

       5.500417234 seconds time elapsed                                          ( +-  1.60% )


Both:
 Performance counter stats for './test.sh' (20 runs):

       116,621,570 L1-dcache-load-misses     #    4.68% of all L1-dcache hits    ( +-  0.04% ) [18.73%]
                 0 L1-dcache-prefetches                                         [18.75%]
    11,597,067,510 cycles                    #    0.000 GHz                      ( +-  1.73% ) [18.77%]
     4,952,251,361 instructions              #    0.43  insns per cycle          ( +-  0.01% ) [18.77%]
     2,493,003,710 L1-dcache-loads                                               ( +-  0.02% ) [18.77%]
       116,640,333 L1-dcache-load-misses     #    4.68% of all L1-dcache hits    ( +-  0.04% ) [18.77%]
       322,246,216 LLC-loads                                                     ( +-  0.03% ) [18.76%]
       114,528,956 LLC-load-misses           #   35.54% of all LL-cache hits     ( +-  0.04% ) [18.76%]
       999,371,469 L1-icache-loads                                               ( +-  0.02% ) [18.76%]
           406,679 L1-icache-load-misses     #    0.04% of all L1-icache hits    ( +-  1.97% ) [18.75%]
     2,492,708,710 dTLB-loads                                                    ( +-  0.01% ) [18.75%]
               140 dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +- 38.46% ) [18.74%]
       999,320,389 iTLB-loads                                                    ( +-  0.01% ) [18.74%]
                19 iTLB-load-misses          #    0.00% of all iTLB cache hits   ( +- 39.90% ) [18.73%]
                 0 L1-dcache-prefetches                                         [18.73%]
                 0 L1-dcache-prefetch-misses                                    [18.72%]

       4.634419247 seconds time elapsed                                          ( +-  1.60% )


I note a few oddities here:

1) We seem to be getting more counter results than I specified, not sure why
2) The % active column is adding up to way more than 100 (which from my read of
the man page makes sense, given that multiple counters might increment in
response to a single instruction execution
3) The run times are proportionally larger, but still indicate that Parallel ALU
execution is hurting rather than helping, which is counter-intuitive.  I'm
looking into it, but thought you might want to see these results in case
something jumped out at you

Regards
Neil

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
@ 2013-10-30  5:25 Doug Ledford
  2013-10-30 10:27 ` David Laight
  2013-10-30 11:02 ` Neil Horman
  0 siblings, 2 replies; 47+ messages in thread
From: Doug Ledford @ 2013-10-30  5:25 UTC (permalink / raw)
  To: Neil Horman; +Cc: Ingo Molnar, Eric Dumazet, Doug Ledford, linux-kernel, netdev

* Neil Horman <nhorman@tuxdriver.com> wrote:
> 3) The run times are proportionally larger, but still indicate that Parallel ALU
> execution is hurting rather than helping, which is counter-intuitive.  I'm
> looking into it, but thought you might want to see these results in case
> something jumped out at you

So here's my theory about all of this.

I think that the original observation some years back was a fluke caused by
either a buggy CPU or a CPU design that is no longer used.

The parallel ALU design of this patch seems OK at first glance, but it means
that two parallel operations are both trying to set/clear both the overflow
and carry flags of the EFLAGS register of the *CPU* (not the ALU).  So, either
some CPU in the past had a set of overflow/carry flags per ALU and did some
sort of magic to make sure that the last state of those flags across multiple
ALUs that might have been used in parallelizing work were always in the CPU's
logical EFLAGS register, or the CPU has a buggy microcode that allowed two
ALUs to operate on data at the same time in situations where they would
potentially stomp on the carry/overflow flags of the other ALUs operations.

It's my theory that all modern CPUs have this behavior fixed, probably via a
microcode update, and so trying to do parallel ALU operations like this simply
has no effect because the CPU (rightly so) serializes the operations to keep
them from clobbering the overflow/carry flags of the other ALUs operations.

My additional theory then is that the reason you see a slowdown from this
patch is because the attempt to parallelize the ALU operation has caused
us to write a series of instructions that, once serialized, are non-optimal
and hinder smooth pipelining of the data (aka going 0*8, 2*8, 4*8, 6*8, 1*8,
3*8, 5*8, and 7*8 in terms of memory accesses is worse than doing them in
order, and since we aren't getting the parallel operation we want, this
is the net result of the patch).

It would explain things anyway.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-30  5:25 [PATCH] x86: Run checksumming in parallel accross multiple alu's Doug Ledford
@ 2013-10-30 10:27 ` David Laight
  2013-10-30 11:02 ` Neil Horman
  1 sibling, 0 replies; 47+ messages in thread
From: David Laight @ 2013-10-30 10:27 UTC (permalink / raw)
  To: Doug Ledford, Neil Horman; +Cc: Ingo Molnar, Eric Dumazet, linux-kernel, netdev

> The parallel ALU design of this patch seems OK at first glance, but it means
> that two parallel operations are both trying to set/clear both the overflow
> and carry flags of the EFLAGS register of the *CPU* (not the ALU).  So, either
> some CPU in the past had a set of overflow/carry flags per ALU and did some
> sort of magic to make sure that the last state of those flags across multiple
> ALUs that might have been used in parallelizing work were always in the CPU's
> logical EFLAGS register, or the CPU has a buggy microcode that allowed two
> ALUs to operate on data at the same time in situations where they would
> potentially stomp on the carry/overflow flags of the other ALUs operations.

IIRC x86 cpu treat the (arithmetic) flags register as a single entity.
So an instruction that only changes some of the flags is dependant
on any previous instruction that changes any flags.
OTOH it the instruction writes all of the flags then it doesn't
have to wait for the earlier instruction to complete.

This is problematic for the ADC chain in the IP checksum.
I did once try to use the SSE instructions to sum 16bit
fields into multiple 32bit registers.

	David

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-30  5:25 [PATCH] x86: Run checksumming in parallel accross multiple alu's Doug Ledford
  2013-10-30 10:27 ` David Laight
@ 2013-10-30 11:02 ` Neil Horman
  2013-10-30 12:18   ` David Laight
  2013-10-30 13:35   ` Doug Ledford
  1 sibling, 2 replies; 47+ messages in thread
From: Neil Horman @ 2013-10-30 11:02 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Ingo Molnar, Eric Dumazet, linux-kernel, netdev

On Wed, Oct 30, 2013 at 01:25:39AM -0400, Doug Ledford wrote:
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> > 3) The run times are proportionally larger, but still indicate that Parallel ALU
> > execution is hurting rather than helping, which is counter-intuitive.  I'm
> > looking into it, but thought you might want to see these results in case
> > something jumped out at you
> 
> So here's my theory about all of this.
> 
> I think that the original observation some years back was a fluke caused by
> either a buggy CPU or a CPU design that is no longer used.
> 
> The parallel ALU design of this patch seems OK at first glance, but it means
> that two parallel operations are both trying to set/clear both the overflow
> and carry flags of the EFLAGS register of the *CPU* (not the ALU).  So, either
> some CPU in the past had a set of overflow/carry flags per ALU and did some
> sort of magic to make sure that the last state of those flags across multiple
> ALUs that might have been used in parallelizing work were always in the CPU's
> logical EFLAGS register, or the CPU has a buggy microcode that allowed two
> ALUs to operate on data at the same time in situations where they would
> potentially stomp on the carry/overflow flags of the other ALUs operations.
> 
> It's my theory that all modern CPUs have this behavior fixed, probably via a
> microcode update, and so trying to do parallel ALU operations like this simply
> has no effect because the CPU (rightly so) serializes the operations to keep
> them from clobbering the overflow/carry flags of the other ALUs operations.
> 
> My additional theory then is that the reason you see a slowdown from this
> patch is because the attempt to parallelize the ALU operation has caused
> us to write a series of instructions that, once serialized, are non-optimal
> and hinder smooth pipelining of the data (aka going 0*8, 2*8, 4*8, 6*8, 1*8,
> 3*8, 5*8, and 7*8 in terms of memory accesses is worse than doing them in
> order, and since we aren't getting the parallel operation we want, this
> is the net result of the patch).
> 
> It would explain things anyway.
> 

That does makes sense, but it then begs the question, whats the advantage of
having multiple alu's at all?  If they're just going to serialize on the
updating of the condition register, there doesn't seem to be much advantage in
having multiple alu's at all, especially if a common use case (parallelizing an
operation on a large linear dataset) resulted in lower performance.

/me wonders if rearranging the instructions into this order:
adcq 0*8(src), res1
adcq 1*8(src), res2
adcq 2*8(src), res1

would prevent pipeline stalls.  That would be interesting data, and (I think)
support your theory, Doug.  I'll give that a try

Neil

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-30 11:02 ` Neil Horman
@ 2013-10-30 12:18   ` David Laight
  2013-10-30 13:22     ` Doug Ledford
  2013-10-30 13:35   ` Doug Ledford
  1 sibling, 1 reply; 47+ messages in thread
From: David Laight @ 2013-10-30 12:18 UTC (permalink / raw)
  To: Neil Horman, Doug Ledford; +Cc: Ingo Molnar, Eric Dumazet, linux-kernel, netdev

> /me wonders if rearranging the instructions into this order:
> adcq 0*8(src), res1
> adcq 1*8(src), res2
> adcq 2*8(src), res1

Those have to be sequenced.

Using a 64bit lea to add 32bit quantities should avoid the
dependencies on the flags register.
However you'd need to get 3 of those active to beat a 64bit adc.

	David

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-30 12:18   ` David Laight
@ 2013-10-30 13:22     ` Doug Ledford
  0 siblings, 0 replies; 47+ messages in thread
From: Doug Ledford @ 2013-10-30 13:22 UTC (permalink / raw)
  To: David Laight, Neil Horman; +Cc: Ingo Molnar, Eric Dumazet, linux-kernel, netdev

On 10/30/2013 08:18 AM, David Laight wrote:
>> /me wonders if rearranging the instructions into this order:
>> adcq 0*8(src), res1
>> adcq 1*8(src), res2
>> adcq 2*8(src), res1
>
> Those have to be sequenced.
>
> Using a 64bit lea to add 32bit quantities should avoid the
> dependencies on the flags register.
> However you'd need to get 3 of those active to beat a 64bit adc.
>
> 	David
>
>
>

Already done (well, something similar to what you mention above anyway), 
doesn't help (although doesn't hurt either, even though it doubles the 
number of adds needed to complete the same work).  This is the code I 
tested:

#define ADDL_64                                         \
         asm("xorq  %%r8,%%r8\n\t"                       \
             "xorq  %%r9,%%r9\n\t"                       \
             "xorq  %%r10,%%r10\n\t"                     \
             "xorq  %%r11,%%r11\n\t"                     \
             "movl  0*4(%[src]),%%r8d\n\t"               \
             "movl  1*4(%[src]),%%r9d\n\t"               \
             "movl  2*4(%[src]),%%r10d\n\t"              \
             "movl  3*4(%[src]),%%r11d\n\t"              \
             "addq  %%r8,%[res1]\n\t"                    \
             "addq  %%r9,%[res2]\n\t"                    \
             "addq  %%r10,%[res3]\n\t"                   \
             "addq  %%r11,%[res4]\n\t"                   \
             "movl  4*4(%[src]),%%r8d\n\t"               \
             "movl  5*4(%[src]),%%r9d\n\t"               \
             "movl  6*4(%[src]),%%r10d\n\t"              \
             "movl  7*4(%[src]),%%r11d\n\t"              \
             "addq  %%r8,%[res1]\n\t"                    \
             "addq  %%r9,%[res2]\n\t"                    \
             "addq  %%r10,%[res3]\n\t"                   \
             "addq  %%r11,%[res4]\n\t"                   \
             "movl  8*4(%[src]),%%r8d\n\t"               \
             "movl  9*4(%[src]),%%r9d\n\t"               \
             "movl  10*4(%[src]),%%r10d\n\t"             \
             "movl  11*4(%[src]),%%r11d\n\t"             \
             "addq  %%r8,%[res1]\n\t"                    \
             "addq  %%r9,%[res2]\n\t"                    \
             "addq  %%r10,%[res3]\n\t"                   \
             "addq  %%r11,%[res4]\n\t"                   \
             "movl  12*4(%[src]),%%r8d\n\t"              \
             "movl  13*4(%[src]),%%r9d\n\t"              \
             "movl  14*4(%[src]),%%r10d\n\t"             \
             "movl  15*4(%[src]),%%r11d\n\t"             \
             "addq  %%r8,%[res1]\n\t"                    \
             "addq  %%r9,%[res2]\n\t"                    \
             "addq  %%r10,%[res3]\n\t"                   \
             "addq  %%r11,%[res4]"                       \
             : [res1] "=r" (result1),                    \
               [res2] "=r" (result2),                    \
               [res3] "=r" (result3),                    \
               [res4] "=r" (result4)                     \
             : [src] "r" (buff),                         \
               "[res1]" (result1), "[res2]" (result2),   \
               "[res3]" (result3), "[res4]" (result4)    \
             : "r8", "r9", "r10", "r11" )

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-30 11:02 ` Neil Horman
  2013-10-30 12:18   ` David Laight
@ 2013-10-30 13:35   ` Doug Ledford
  2013-10-30 14:04     ` David Laight
                       ` (2 more replies)
  1 sibling, 3 replies; 47+ messages in thread
From: Doug Ledford @ 2013-10-30 13:35 UTC (permalink / raw)
  To: Neil Horman; +Cc: Ingo Molnar, Eric Dumazet, linux-kernel, netdev, David Laight

On 10/30/2013 07:02 AM, Neil Horman wrote:

> That does makes sense, but it then begs the question, whats the advantage of
> having multiple alu's at all?

There's lots of ALU operations that don't operate on the flags or other 
entities that can be run in parallel.

> If they're just going to serialize on the
> updating of the condition register, there doesn't seem to be much advantage in
> having multiple alu's at all, especially if a common use case (parallelizing an
> operation on a large linear dataset) resulted in lower performance.
>
> /me wonders if rearranging the instructions into this order:
> adcq 0*8(src), res1
> adcq 1*8(src), res2
> adcq 2*8(src), res1
>
> would prevent pipeline stalls.  That would be interesting data, and (I think)
> support your theory, Doug.  I'll give that a try

Just to avoid spending too much time on various combinations, here are 
the methods I've tried:

Original code
2 chains doing interleaved memory accesses
2 chains doing serial memory accesses (as above)
4 chains doing serial memory accesses
4 chains using 32bit values in 64bit registers so you can always use add 
instead of adc and never need the carry flag

And I've done all of the above with simple prefetch and smart prefetch.

In all cases, the result is basically that the add method doesn't matter 
much in the grand scheme of things, but the prefetch does, and smart 
prefetch always beat simple prefetch.

My simple prefetch was to just go into the main while() loop for the 
csum operation and always prefetch 5*64 into the future.

My smart prefetch looks like this:

static inline void prefetch_line(unsigned long *cur_line,
                                  unsigned long *end_line,
                                  size_t size)
{
         size_t fetched = 0;

         while (*cur_line <= *end_line && fetched < size) {
                 prefetch((void *)*cur_line);
                 *cur_line += cache_line_size();
                 fetched += cache_line_size();
         }
}

static unsigned do_csum(const unsigned char *buff, unsigned len)
{
	...
         unsigned long cur_line = (unsigned long)buff & 
~(cache_line_size() - 1);
         unsigned long end_line = ((unsigned long)buff + len) & 
~(cache_line_size() - 1);

	...
         /* Don't bother to prefetch the first line, we'll end up 
stalling on
          * it anyway, but go ahead and start the prefetch on the next 3 */
         cur_line += cache_line_size();
         prefetch_line(&cur_line, &end_line, cache_line_size() * 3);
         odd = 1 & (unsigned long) buff;
         if (unlikely(odd)) {
                 result = *buff << 8;
	...
                 count >>= 1;            /* nr of 32-bit words.. */

                 /* prefetch line #4 ahead of main loop */
                 prefetch_line(&cur_line, &end_line, cache_line_size());

                 if (count) {
		...
                         while (count64) {
                                 /* we are now prefetching line #5 ahead of
                                  * where we are starting, and will stay 5
                                  * ahead throughout the loop, at least 
until
                                  * we get to the end line and then 
we'll stop
                                  * prefetching */
                                 prefetch_line(&cur_line, &end_line, 64);
                                 ADDL_64;
                                 buff += 64;
                                 count64--;
                         }

                         ADDL_64_FINISH;


I was going to tinker today and tomorrow with this function once I get a 
toolchain that will compile it (I reinstalled all my rhel6 hosts as f20 
and I'm hoping that does the trick, if not I need to do more work):

#define ADCXQ_64                                        \
         asm("xorq %[res1],%[res1]\n\t"                  \
             "adcxq 0*8(%[src]),%[res1]\n\t"             \
             "adoxq 1*8(%[src]),%[res2]\n\t"             \
             "adcxq 2*8(%[src]),%[res1]\n\t"             \
             "adoxq 3*8(%[src]),%[res2]\n\t"             \
             "adcxq 4*8(%[src]),%[res1]\n\t"             \
             "adoxq 5*8(%[src]),%[res2]\n\t"             \
             "adcxq 6*8(%[src]),%[res1]\n\t"             \
             "adoxq 7*8(%[src]),%[res2]\n\t"             \
             "adcxq %[zero],%[res1]\n\t"                 \
             "adoxq %[zero],%[res2]\n\t"                 \
             : [res1] "=r" (result1),                    \
               [res2] "=r" (result2)                     \
             : [src] "r" (buff), [zero] "r" (zero),      \
               "[res1]" (result1), "[res2]" (result2))

and then I also wanted to try using both xmm and ymm registers and doing 
64bit adds with 32bit numbers across multiple xmm/ymm registers as that 
should parallel nicely.  David, you mentioned you've tried this, how did 
your experiment turn out and what was your method?  I was planning on 
doing regular full size loads into one xmm/ymm register, then using 
pshufd/vshufd to move the data into two different registers, then 
summing into a fourth register, and possible running two of those 
pipelines in parallel.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-30 13:35   ` Doug Ledford
@ 2013-10-30 14:04     ` David Laight
  2013-10-30 14:52     ` Neil Horman
  2013-10-31 18:30     ` Neil Horman
  2 siblings, 0 replies; 47+ messages in thread
From: David Laight @ 2013-10-30 14:04 UTC (permalink / raw)
  To: Doug Ledford, Neil Horman; +Cc: Ingo Molnar, Eric Dumazet, linux-kernel, netdev

...
> and then I also wanted to try using both xmm and ymm registers and doing
> 64bit adds with 32bit numbers across multiple xmm/ymm registers as that
> should parallel nicely.  David, you mentioned you've tried this, how did
> your experiment turn out and what was your method?  I was planning on
> doing regular full size loads into one xmm/ymm register, then using
> pshufd/vshufd to move the data into two different registers, then
> summing into a fourth register, and possible running two of those
> pipelines in parallel.

It was a long time ago, and IIRC the code was just SSE so the
register length just wasn't going to give the required benefit.
I know I wrote the code, but I can't even remember whether I
actually got it working!
With the longer AVX words it might make enough difference.
Of course, this assumes that you have the fpu registers
available. If you have to do a fpu context switch it will
be a lot slower.

About the same time I did manage to an open coded copy
loop to run as fast as 'rep movs' - and without any unrolling
or any prefetch instructions.

Thinking about AVX you should be able to do (without looking up the
actual mnemonics):
	load
	add 32bit chunks to sum
	compare sum with read value (equiv of carry)
	add/subtract compare result (0 or ~0) to a carry-sum register
That is 4 instructions for 256 bits, so you can aim for 4 clocks.
You'd need to check the cpu book to see if any of those can
be scheduled at the same time (if not dependant).
(and also whether there is any result delay - don't think so.)

I'd try running two copies of the above - probably skewed so that
the memory accesses are separated, do the memory read for the
next iteration, and use the 3rd instruction unit for loop control.

	David

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-30 13:35   ` Doug Ledford
  2013-10-30 14:04     ` David Laight
@ 2013-10-30 14:52     ` Neil Horman
  2013-10-31 18:30     ` Neil Horman
  2 siblings, 0 replies; 47+ messages in thread
From: Neil Horman @ 2013-10-30 14:52 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Ingo Molnar, Eric Dumazet, linux-kernel, netdev, David Laight

On Wed, Oct 30, 2013 at 09:35:05AM -0400, Doug Ledford wrote:
> On 10/30/2013 07:02 AM, Neil Horman wrote:
> 
> >That does makes sense, but it then begs the question, whats the advantage of
> >having multiple alu's at all?
> 
> There's lots of ALU operations that don't operate on the flags or
> other entities that can be run in parallel.
> 
> >If they're just going to serialize on the
> >updating of the condition register, there doesn't seem to be much advantage in
> >having multiple alu's at all, especially if a common use case (parallelizing an
> >operation on a large linear dataset) resulted in lower performance.
> >
> >/me wonders if rearranging the instructions into this order:
> >adcq 0*8(src), res1
> >adcq 1*8(src), res2
> >adcq 2*8(src), res1
> >
> >would prevent pipeline stalls.  That would be interesting data, and (I think)
> >support your theory, Doug.  I'll give that a try
> 
> Just to avoid spending too much time on various combinations, here
> are the methods I've tried:
> 
> Original code
> 2 chains doing interleaved memory accesses
> 2 chains doing serial memory accesses (as above)
> 4 chains doing serial memory accesses
> 4 chains using 32bit values in 64bit registers so you can always use
> add instead of adc and never need the carry flag
> 
> And I've done all of the above with simple prefetch and smart prefetch.
> 
Yup, I just tried the 2 chains doing interleaved access and came up with the
same results for both prefetch cases.

> In all cases, the result is basically that the add method doesn't
> matter much in the grand scheme of things, but the prefetch does,
> and smart prefetch always beat simple prefetch.
> 
> My simple prefetch was to just go into the main while() loop for the
> csum operation and always prefetch 5*64 into the future.
> 
> My smart prefetch looks like this:
> 
> static inline void prefetch_line(unsigned long *cur_line,
>                                  unsigned long *end_line,
>                                  size_t size)
> {
>         size_t fetched = 0;
> 
>         while (*cur_line <= *end_line && fetched < size) {
>                 prefetch((void *)*cur_line);
>                 *cur_line += cache_line_size();
>                 fetched += cache_line_size();
>         }
> }
> 
I've done this too, but I've come up with results that are very close to simple
prefetch.

> I was going to tinker today and tomorrow with this function once I
> get a toolchain that will compile it (I reinstalled all my rhel6
> hosts as f20 and I'm hoping that does the trick, if not I need to do
> more work):
> 
> #define ADCXQ_64                                        \
>         asm("xorq %[res1],%[res1]\n\t"                  \
>             "adcxq 0*8(%[src]),%[res1]\n\t"             \
>             "adoxq 1*8(%[src]),%[res2]\n\t"             \
>             "adcxq 2*8(%[src]),%[res1]\n\t"             \
>             "adoxq 3*8(%[src]),%[res2]\n\t"             \
>             "adcxq 4*8(%[src]),%[res1]\n\t"             \
>             "adoxq 5*8(%[src]),%[res2]\n\t"             \
>             "adcxq 6*8(%[src]),%[res1]\n\t"             \
>             "adoxq 7*8(%[src]),%[res2]\n\t"             \
>             "adcxq %[zero],%[res1]\n\t"                 \
>             "adoxq %[zero],%[res2]\n\t"                 \
>             : [res1] "=r" (result1),                    \
>               [res2] "=r" (result2)                     \
>             : [src] "r" (buff), [zero] "r" (zero),      \
>               "[res1]" (result1), "[res2]" (result2))
> 
I've tried using this method also (HPA suggested it early in the thread, but its
not going to be usefull for awhile.  The compiler supports it already, but
theres not hardware available with support for these instructions yet (at least
not that I have available).

Neil

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 20:26                                       ` Neil Horman
@ 2013-10-31 10:22                                         ` Ingo Molnar
  2013-10-31 14:33                                           ` Neil Horman
  0 siblings, 1 reply; 47+ messages in thread
From: Ingo Molnar @ 2013-10-31 10:22 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> > etc. For such short runtimes make sure the last column displays 
> > close to 100%, so that the PMU results become trustable.
> > 
> > A nehalem+ PMU will allow 2-4 events to be measured in parallel, 
> > plus generics like 'cycles', 'instructions' can be added 'for free' 
> > because they get counted in a separate (fixed purpose) PMU register.
> > 
> > The last colum tells you what percentage of the runtime that 
> > particular event was actually active. 100% (or empty last column) 
> > means it was active all the time.
> > 
> > Thanks,
> > 
> > 	Ingo
> > 
> 
> Hmm, 
> 
> I ran this test:
> 
> for i in `seq 0 1 3`
> do
> echo $i > /sys/module/csum_test/parameters/module_test_mode
> taskset -c 0 perf stat --repeat 20 -C 0 -e L1-dcache-load-misses -e L1-dcache-prefetches -e cycles -e instructions -ddd ./test.sh
> done

You need to remove '-ddd' which is a shortcut for a ton of useful 
events, but here you want to use fewer events, to increase the 
precision of the measurement.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-31 10:22                                         ` Ingo Molnar
@ 2013-10-31 14:33                                           ` Neil Horman
  2013-11-01  9:13                                             ` Ingo Molnar
  0 siblings, 1 reply; 47+ messages in thread
From: Neil Horman @ 2013-10-31 14:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > > etc. For such short runtimes make sure the last column displays 
> > > close to 100%, so that the PMU results become trustable.
> > > 
> > > A nehalem+ PMU will allow 2-4 events to be measured in parallel, 
> > > plus generics like 'cycles', 'instructions' can be added 'for free' 
> > > because they get counted in a separate (fixed purpose) PMU register.
> > > 
> > > The last colum tells you what percentage of the runtime that 
> > > particular event was actually active. 100% (or empty last column) 
> > > means it was active all the time.
> > > 
> > > Thanks,
> > > 
> > > 	Ingo
> > > 
> > 
> > Hmm, 
> > 
> > I ran this test:
> > 
> > for i in `seq 0 1 3`
> > do
> > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > taskset -c 0 perf stat --repeat 20 -C 0 -e L1-dcache-load-misses -e L1-dcache-prefetches -e cycles -e instructions -ddd ./test.sh
> > done
> 
> You need to remove '-ddd' which is a shortcut for a ton of useful 
> events, but here you want to use fewer events, to increase the 
> precision of the measurement.
> 
> Thanks,
> 
> 	Ingo
> 

Thank you ingo, that fixed it.  I'm trying some other variants of the csum
algorithm that Doug and I discussed last night, but FWIW, the relative
performance of the 4 test cases (base/prefetch/parallel/both) remains unchanged.
I'm starting to feel like at this point, theres very little point in doing
parallel alu operations (unless we can find a way to break the dependency on the
carry flag, which is what I'm tinkering with now).
Neil

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-30 13:35   ` Doug Ledford
  2013-10-30 14:04     ` David Laight
  2013-10-30 14:52     ` Neil Horman
@ 2013-10-31 18:30     ` Neil Horman
  2013-11-01  9:21       ` Ingo Molnar
  2013-11-01 15:42       ` Ben Hutchings
  2 siblings, 2 replies; 47+ messages in thread
From: Neil Horman @ 2013-10-31 18:30 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Ingo Molnar, Eric Dumazet, linux-kernel, netdev, David Laight

On Wed, Oct 30, 2013 at 09:35:05AM -0400, Doug Ledford wrote:
> On 10/30/2013 07:02 AM, Neil Horman wrote:
> 
> >That does makes sense, but it then begs the question, whats the advantage of
> >having multiple alu's at all?
> 
> There's lots of ALU operations that don't operate on the flags or
> other entities that can be run in parallel.
> 
> >If they're just going to serialize on the
> >updating of the condition register, there doesn't seem to be much advantage in
> >having multiple alu's at all, especially if a common use case (parallelizing an
> >operation on a large linear dataset) resulted in lower performance.
> >
> >/me wonders if rearranging the instructions into this order:
> >adcq 0*8(src), res1
> >adcq 1*8(src), res2
> >adcq 2*8(src), res1
> >
> >would prevent pipeline stalls.  That would be interesting data, and (I think)
> >support your theory, Doug.  I'll give that a try
> 
> Just to avoid spending too much time on various combinations, here
> are the methods I've tried:
> 
> Original code
> 2 chains doing interleaved memory accesses
> 2 chains doing serial memory accesses (as above)
> 4 chains doing serial memory accesses
> 4 chains using 32bit values in 64bit registers so you can always use
> add instead of adc and never need the carry flag
> 
> And I've done all of the above with simple prefetch and smart prefetch.
> 
> 


So, above and beyond this I spent yesterday trying this pattern, something Doug
and I discussed together offline:
 
asm("prefetch 5*64(%[src])\n\t"
    "addq 0*8(%[src]),%[res1]\n\t"
    "jo 2f\n\t"
    "incq %[cry]\n\t"
    "2:addq 1*8(%[src]),%[res2]\n\t"
    "jc 3f\n\t"
    "incq %[cry]\n\t"
    "3:addq 2*8(%[src]),%[res1]\n\t"
    ...

The hope being that by using the add instead instead of the adc instruction, and
alternatively testing the overflow and carry flags, I could break the
serialization on the flags register between subeuqent adds and start doing
things in parallel (creating a poor mans adcx/adox instruction in effect).  It
functions, but unfortunately the performance lost to the completely broken
branch prediction that this inflicts makes it a non starter:


Base Performance:
 Performance counter stats for './test.sh' (20 runs):

        48,143,372 L1-dcache-load-misses                                         ( +-  0.03% ) [74.99%]
                 0 L1-dcache-prefetches                                         [75.00%]
    13,913,339,911 cycles                    #    0.000 GHz                      ( +-  0.06% ) [75.01%]
        28,878,999 branch-misses                                                 ( +-  0.05% ) [75.00%]

       5.367618727 seconds time elapsed                                          ( +-  0.06% )


Prefetch and simluated adcx/adox from above:
 Performance counter stats for './test.sh' (20 runs):

        35,704,331 L1-dcache-load-misses                                         ( +-  0.07% ) [75.00%]
                 0 L1-dcache-prefetches                                         [75.00%]
    19,751,409,264 cycles                    #    0.000 GHz                      ( +-  0.59% ) [75.00%]
        34,850,056 branch-misses                                                 ( +-  1.29% ) [75.00%]

       7.768602160 seconds time elapsed                                          ( +-  1.38% )


With the above instruction changes the prefetching lowers our dcache miss rate
significantly, but greatly raises our branch miss rate, and absolutely kills our
cycle count and run time.

At this point I feel like this is dead in the water.  I apologize for wasting
everyones time.  The best thing to do here would seem to be:

1) Add in some prefetching (from what I've seen a simple prefetch is as
performant as smart prefetching), so we may as well do it exactly as
csum_copy_from_user does it, and save ourselves the extra while loop.

2) Revisit this when the AVX extensions, or the adcx/adox instructions are
available and we can really preform parallel alu ops here.

Does that sound reasonable?
Neil

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-31 14:33                                           ` Neil Horman
@ 2013-11-01  9:13                                             ` Ingo Molnar
  2013-11-01 14:06                                               ` Neil Horman
  0 siblings, 1 reply; 47+ messages in thread
From: Ingo Molnar @ 2013-11-01  9:13 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote:
> > 
> > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > 
> > > > etc. For such short runtimes make sure the last column displays 
> > > > close to 100%, so that the PMU results become trustable.
> > > > 
> > > > A nehalem+ PMU will allow 2-4 events to be measured in parallel, 
> > > > plus generics like 'cycles', 'instructions' can be added 'for free' 
> > > > because they get counted in a separate (fixed purpose) PMU register.
> > > > 
> > > > The last colum tells you what percentage of the runtime that 
> > > > particular event was actually active. 100% (or empty last column) 
> > > > means it was active all the time.
> > > > 
> > > > Thanks,
> > > > 
> > > > 	Ingo
> > > > 
> > > 
> > > Hmm, 
> > > 
> > > I ran this test:
> > > 
> > > for i in `seq 0 1 3`
> > > do
> > > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > > taskset -c 0 perf stat --repeat 20 -C 0 -e L1-dcache-load-misses -e L1-dcache-prefetches -e cycles -e instructions -ddd ./test.sh
> > > done
> > 
> > You need to remove '-ddd' which is a shortcut for a ton of useful 
> > events, but here you want to use fewer events, to increase the 
> > precision of the measurement.
> > 
> > Thanks,
> > 
> > 	Ingo
> > 
> 
> Thank you ingo, that fixed it.  I'm trying some other variants of 
> the csum algorithm that Doug and I discussed last night, but FWIW, 
> the relative performance of the 4 test cases 
> (base/prefetch/parallel/both) remains unchanged. I'm starting to 
> feel like at this point, theres very little point in doing 
> parallel alu operations (unless we can find a way to break the 
> dependency on the carry flag, which is what I'm tinkering with 
> now).

I would still like to encourage you to pick up the improvements that 
Doug measured (mostly via prefetch tweaking?) - that looked like 
some significant speedups that we don't want to lose!

Also, trying to stick the in-kernel implementation into 'perf bench' 
would be a useful first step as well, for this and future efforts.

See what we do in tools/perf/bench/mem-memcpy-x86-64-asm.S to pick 
up the in-kernel assembly memcpy implementations:

#define memcpy MEMCPY /* don't hide glibc's memcpy() */
#define altinstr_replacement text
#define globl p2align 4; .globl
#define Lmemcpy_c globl memcpy_c; memcpy_c
#define Lmemcpy_c_e globl memcpy_c_e; memcpy_c_e

#include "../../../arch/x86/lib/memcpy_64.S"

So it needed a bit of trickery/wrappery for 'perf bench mem memcpy', 
but that is a one-time effort - once it's done then the current 
in-kernel csum_partial() implementation would be easily measurable 
(and any performance regression in it bisectable, etc.) from that 
point on.

In user-space it would also be easier to add various parameters and 
experimental implementations and background cache-stressing 
workloads automatically.

Something similar might be possible for csum_partial(), 
csum_partial_copy*(), etc.

Note, if any of you ventures to add checksum-benchmarking to perf 
bench, please base any patches on top of tip:perf/core:

  git pull git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/core

as there are a couple of perf bench enhancements in the pipeline 
already for v3.13.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-31 18:30     ` Neil Horman
@ 2013-11-01  9:21       ` Ingo Molnar
  2013-11-01 15:42       ` Ben Hutchings
  1 sibling, 0 replies; 47+ messages in thread
From: Ingo Molnar @ 2013-11-01  9:21 UTC (permalink / raw)
  To: Neil Horman
  Cc: Doug Ledford, Eric Dumazet, linux-kernel, netdev, David Laight


* Neil Horman <nhorman@tuxdriver.com> wrote:

> Prefetch and simluated adcx/adox from above:
>  Performance counter stats for './test.sh' (20 runs):
> 
>         35,704,331 L1-dcache-load-misses                                         ( +-  0.07% ) [75.00%]
>                  0 L1-dcache-prefetches                                         [75.00%]
>     19,751,409,264 cycles                    #    0.000 GHz                      ( +-  0.59% ) [75.00%]
>         34,850,056 branch-misses                                                 ( +-  1.29% ) [75.00%]
> 
>        7.768602160 seconds time elapsed                                          ( +-  1.38% )

btw., you might also want to try measuring only the basics:

   -e cycles -e instructions -e branches -e branch-misses

that should give you 100% in the last column and should also allow 
you to double check whether all the PMU counts are correct: is it 
the expected number of instructions, expected number of branches, 
expected number of branch-misses, etc.

Then you can remove branch stats and add just L1-dcache stats - and 
still be 100% covered:

   -e cycles -e instructions -e L1-dcache-loads -e L1-dcache-load-misses

etc.

Just so that you can trust what the PMU tells you. Prefetch counts 
are sometimes off, they might include speculative activities, etc.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-11-01  9:13                                             ` Ingo Molnar
@ 2013-11-01 14:06                                               ` Neil Horman
  0 siblings, 0 replies; 47+ messages in thread
From: Neil Horman @ 2013-11-01 14:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Fri, Nov 01, 2013 at 10:13:37AM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote:
> > > 
> > > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > > 
> > > > > etc. For such short runtimes make sure the last column displays 
> > > > > close to 100%, so that the PMU results become trustable.
> > > > > 
> > > > > A nehalem+ PMU will allow 2-4 events to be measured in parallel, 
> > > > > plus generics like 'cycles', 'instructions' can be added 'for free' 
> > > > > because they get counted in a separate (fixed purpose) PMU register.
> > > > > 
> > > > > The last colum tells you what percentage of the runtime that 
> > > > > particular event was actually active. 100% (or empty last column) 
> > > > > means it was active all the time.
> > > > > 
> > > > > Thanks,
> > > > > 
> > > > > 	Ingo
> > > > > 
> > > > 
> > > > Hmm, 
> > > > 
> > > > I ran this test:
> > > > 
> > > > for i in `seq 0 1 3`
> > > > do
> > > > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > > > taskset -c 0 perf stat --repeat 20 -C 0 -e L1-dcache-load-misses -e L1-dcache-prefetches -e cycles -e instructions -ddd ./test.sh
> > > > done
> > > 
> > > You need to remove '-ddd' which is a shortcut for a ton of useful 
> > > events, but here you want to use fewer events, to increase the 
> > > precision of the measurement.
> > > 
> > > Thanks,
> > > 
> > > 	Ingo
> > > 
> > 
> > Thank you ingo, that fixed it.  I'm trying some other variants of 
> > the csum algorithm that Doug and I discussed last night, but FWIW, 
> > the relative performance of the 4 test cases 
> > (base/prefetch/parallel/both) remains unchanged. I'm starting to 
> > feel like at this point, theres very little point in doing 
> > parallel alu operations (unless we can find a way to break the 
> > dependency on the carry flag, which is what I'm tinkering with 
> > now).
> 
> I would still like to encourage you to pick up the improvements that 
> Doug measured (mostly via prefetch tweaking?) - that looked like 
> some significant speedups that we don't want to lose!
> 
Well, yes, I made a line item of that in my subsequent note below.  I'm going to
repost that shortly, and I suggested that we revisit this when the AVX
instruction extensions are available.

> Also, trying to stick the in-kernel implementation into 'perf bench' 
> would be a useful first step as well, for this and future efforts.
> 
> See what we do in tools/perf/bench/mem-memcpy-x86-64-asm.S to pick 
> up the in-kernel assembly memcpy implementations:
> 
Yes, I'll look into adding this as well
Regards
Neil

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-31 18:30     ` Neil Horman
  2013-11-01  9:21       ` Ingo Molnar
@ 2013-11-01 15:42       ` Ben Hutchings
  2013-11-01 16:08         ` Neil Horman
  1 sibling, 1 reply; 47+ messages in thread
From: Ben Hutchings @ 2013-11-01 15:42 UTC (permalink / raw)
  To: Neil Horman
  Cc: Doug Ledford, Ingo Molnar, Eric Dumazet, linux-kernel, netdev,
	David Laight

On Thu, 2013-10-31 at 14:30 -0400, Neil Horman wrote:
[...]
> It
> functions, but unfortunately the performance lost to the completely broken
> branch prediction that this inflicts makes it a non starter:
[...]

Conditional branches are no good but conditional moves might be worth a shot.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-11-01 15:42       ` Ben Hutchings
@ 2013-11-01 16:08         ` Neil Horman
  2013-11-01 16:16           ` Ben Hutchings
  2013-11-01 16:18           ` David Laight
  0 siblings, 2 replies; 47+ messages in thread
From: Neil Horman @ 2013-11-01 16:08 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Doug Ledford, Ingo Molnar, Eric Dumazet, linux-kernel, netdev,
	David Laight

On Fri, Nov 01, 2013 at 03:42:46PM +0000, Ben Hutchings wrote:
> On Thu, 2013-10-31 at 14:30 -0400, Neil Horman wrote:
> [...]
> > It
> > functions, but unfortunately the performance lost to the completely broken
> > branch prediction that this inflicts makes it a non starter:
> [...]
> 
> Conditional branches are no good but conditional moves might be worth a shot.
> 
> Ben.
> 
How would you suggest replacing the jumps in this case?  I agree it would be
faster here, but I'm not sure how I would implement an increment using a single
conditional move.
Neil

> -- 
> Ben Hutchings, Staff Engineer, Solarflare
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-11-01 16:08         ` Neil Horman
@ 2013-11-01 16:16           ` Ben Hutchings
  2013-11-01 16:18           ` David Laight
  1 sibling, 0 replies; 47+ messages in thread
From: Ben Hutchings @ 2013-11-01 16:16 UTC (permalink / raw)
  To: Neil Horman
  Cc: Doug Ledford, Ingo Molnar, Eric Dumazet, linux-kernel, netdev,
	David Laight

On Fri, 2013-11-01 at 12:08 -0400, Neil Horman wrote:
> On Fri, Nov 01, 2013 at 03:42:46PM +0000, Ben Hutchings wrote:
> > On Thu, 2013-10-31 at 14:30 -0400, Neil Horman wrote:
> > [...]
> > > It
> > > functions, but unfortunately the performance lost to the completely broken
> > > branch prediction that this inflicts makes it a non starter:
> > [...]
> > 
> > Conditional branches are no good but conditional moves might be worth a shot.
> > 
> > Ben.
> > 
> How would you suggest replacing the jumps in this case?  I agree it would be
> faster here, but I'm not sure how I would implement an increment using a single
> conditional move.

You can't, but it lets you use additional registers as carry flags.
Whether there are enough registers and enough parallelism to cancel out
the extra additions required, I don't know.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-11-01 16:08         ` Neil Horman
  2013-11-01 16:16           ` Ben Hutchings
@ 2013-11-01 16:18           ` David Laight
  2013-11-01 17:37             ` Neil Horman
  1 sibling, 1 reply; 47+ messages in thread
From: David Laight @ 2013-11-01 16:18 UTC (permalink / raw)
  To: Neil Horman, Ben Hutchings
  Cc: Doug Ledford, Ingo Molnar, Eric Dumazet, linux-kernel, netdev

> How would you suggest replacing the jumps in this case?  I agree it would be
> faster here, but I'm not sure how I would implement an increment using a single
> conditional move.

I think you need 3 instructions, move a 0, conditionally move a 1
then add. I suspect it won't be a win!

If you do 'win' it is probably very dependent on how the instructions
get scheduled onto the execution units - which will probably make
it very cpu type dependant.

	David

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-11-01 16:18           ` David Laight
@ 2013-11-01 17:37             ` Neil Horman
  2013-11-01 19:45               ` Joe Perches
  2013-11-04  9:47               ` David Laight
  0 siblings, 2 replies; 47+ messages in thread
From: Neil Horman @ 2013-11-01 17:37 UTC (permalink / raw)
  To: David Laight
  Cc: Ben Hutchings, Doug Ledford, Ingo Molnar, Eric Dumazet,
	linux-kernel, netdev

On Fri, Nov 01, 2013 at 04:18:50PM -0000, David Laight wrote:
> > How would you suggest replacing the jumps in this case?  I agree it would be
> > faster here, but I'm not sure how I would implement an increment using a single
> > conditional move.
> 
> I think you need 3 instructions, move a 0, conditionally move a 1
> then add. I suspect it won't be a win!
> 
> If you do 'win' it is probably very dependent on how the instructions
> get scheduled onto the execution units - which will probably make
> it very cpu type dependant.
> 
> 	David
> 
I agree, that sounds interesting, but very cpu dependent.  Thanks for the
suggestion, Ben, but I think it would be better if we just did the prefetch here
and re-addressed this area when AVX (or addcx/addox) instructions were available
for testing on hardware.

Neil

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-11-01 17:37             ` Neil Horman
@ 2013-11-01 19:45               ` Joe Perches
  2013-11-01 19:58                 ` Neil Horman
  2013-11-04  9:47               ` David Laight
  1 sibling, 1 reply; 47+ messages in thread
From: Joe Perches @ 2013-11-01 19:45 UTC (permalink / raw)
  To: Neil Horman
  Cc: David Laight, Ben Hutchings, Doug Ledford, Ingo Molnar,
	Eric Dumazet, linux-kernel, netdev

On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote:

> I think it would be better if we just did the prefetch here
> and re-addressed this area when AVX (or addcx/addox) instructions were available
> for testing on hardware.

Could there be a difference if only a single software
prefetch was done at the beginning of transfer before
the while loop and hardware prefetches did the rest?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-11-01 19:45               ` Joe Perches
@ 2013-11-01 19:58                 ` Neil Horman
  2013-11-01 20:26                   ` Joe Perches
  0 siblings, 1 reply; 47+ messages in thread
From: Neil Horman @ 2013-11-01 19:58 UTC (permalink / raw)
  To: Joe Perches
  Cc: David Laight, Ben Hutchings, Doug Ledford, Ingo Molnar,
	Eric Dumazet, linux-kernel, netdev

On Fri, Nov 01, 2013 at 12:45:29PM -0700, Joe Perches wrote:
> On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote:
> 
> > I think it would be better if we just did the prefetch here
> > and re-addressed this area when AVX (or addcx/addox) instructions were available
> > for testing on hardware.
> 
> Could there be a difference if only a single software
> prefetch was done at the beginning of transfer before
> the while loop and hardware prefetches did the rest?
> 

I wouldn't think so.  If hardware was going to do any prefetching based on
memory access patterns it will do so regardless of the leading prefetch, and
that first prefetch isn't helpful because we still wind up stalling on the adds
while its completing
Neil

> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-11-01 19:58                 ` Neil Horman
@ 2013-11-01 20:26                   ` Joe Perches
  2013-11-02  2:07                     ` Neil Horman
  0 siblings, 1 reply; 47+ messages in thread
From: Joe Perches @ 2013-11-01 20:26 UTC (permalink / raw)
  To: Neil Horman
  Cc: David Laight, Ben Hutchings, Doug Ledford, Ingo Molnar,
	Eric Dumazet, linux-kernel, netdev

On Fri, 2013-11-01 at 15:58 -0400, Neil Horman wrote:
> On Fri, Nov 01, 2013 at 12:45:29PM -0700, Joe Perches wrote:
> > On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote:
> > 
> > > I think it would be better if we just did the prefetch here
> > > and re-addressed this area when AVX (or addcx/addox) instructions were available
> > > for testing on hardware.
> > 
> > Could there be a difference if only a single software
> > prefetch was done at the beginning of transfer before
> > the while loop and hardware prefetches did the rest?
> > 
> I wouldn't think so.  If hardware was going to do any prefetching based on
> memory access patterns it will do so regardless of the leading prefetch, and
> that first prefetch isn't helpful because we still wind up stalling on the adds
> while its completing

I imagine one benefit to be helping prevent
prefetching beyond the actual data required.

Maybe some hardware optimizes prefetch stride
better than 5*64.

I wonder also if using

	if (count > some_length)
		prefetch
	while (...)

helps small lengths more than the test/jump cost.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-11-01 20:26                   ` Joe Perches
@ 2013-11-02  2:07                     ` Neil Horman
  0 siblings, 0 replies; 47+ messages in thread
From: Neil Horman @ 2013-11-02  2:07 UTC (permalink / raw)
  To: Joe Perches
  Cc: David Laight, Ben Hutchings, Doug Ledford, Ingo Molnar,
	Eric Dumazet, linux-kernel, netdev

On Fri, Nov 01, 2013 at 01:26:52PM -0700, Joe Perches wrote:
> On Fri, 2013-11-01 at 15:58 -0400, Neil Horman wrote:
> > On Fri, Nov 01, 2013 at 12:45:29PM -0700, Joe Perches wrote:
> > > On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote:
> > > 
> > > > I think it would be better if we just did the prefetch here
> > > > and re-addressed this area when AVX (or addcx/addox) instructions were available
> > > > for testing on hardware.
> > > 
> > > Could there be a difference if only a single software
> > > prefetch was done at the beginning of transfer before
> > > the while loop and hardware prefetches did the rest?
> > > 
> > I wouldn't think so.  If hardware was going to do any prefetching based on
> > memory access patterns it will do so regardless of the leading prefetch, and
> > that first prefetch isn't helpful because we still wind up stalling on the adds
> > while its completing
> 
> I imagine one benefit to be helping prevent
> prefetching beyond the actual data required.
> 
> Maybe some hardware optimizes prefetch stride
> better than 5*64.
> 
> I wonder also if using
> 
> 	if (count > some_length)
> 		prefetch
> 	while (...)
> 
> helps small lengths more than the test/jump cost.
> 
We've already done this and it is in fact the best performing.  I'll be posting
that patch along with ingos request to add do_csum to the perf bench code when I
have that done
Best
Neil

> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-11-01 17:37             ` Neil Horman
  2013-11-01 19:45               ` Joe Perches
@ 2013-11-04  9:47               ` David Laight
  1 sibling, 0 replies; 47+ messages in thread
From: David Laight @ 2013-11-04  9:47 UTC (permalink / raw)
  To: Neil Horman
  Cc: Ben Hutchings, Doug Ledford, Ingo Molnar, Eric Dumazet,
	linux-kernel, netdev

> > I think you need 3 instructions, move a 0, conditionally move a 1
> > then add. I suspect it won't be a win!

Or, with an appropriately unrolled loop, for each word:
	zero %eax, cmove a 1 to %al
	cmove a 1 to %ah
	shift %eax left, cmove a 1 to %al
	cmove a 1 to %ah, add %eax onto somewhere.
However the 2nd instruction stream would have to use a different
register (IIRC 8bit updates depend on the entire register).

> I agree, that sounds interesting, but very cpu dependent.  Thanks for the
> suggestion, Ben, but I think it would be better if we just did the prefetch here
> and re-addressed this area when AVX (or addcx/addox) instructions were available
> for testing on hardware.

I didn't look too closely at the original figures.
With a simple loop you need 4 instructions per iteration (load, adc, inc, branch).
How close to one iteration per clock do you get?
I thought x86 hardware prefetch would load the cache lines for sequential
accesses - so any prefetch instructions are rather pointless.
However reading the value in the previous loop iteration should help.

I've just realised that there is a problem with the loop termination
condition also needing the flags register:-(
I don't remember the 'loop' instruction ever being added to any of the
fast path instruction decodes - so it won't help.

So I suspect the best you'll get is an interleaved sequence of load and adc
with an lea and inc (both to adjust the index) and a bne back to the top.
(the lea wants to be in the middle somewhere).
That might manage 1 clock per word + 1 clock per loop iteration (if the inc
and bne can be 'fused').

	David

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2013-11-04  9:50 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-10-30  5:25 [PATCH] x86: Run checksumming in parallel accross multiple alu's Doug Ledford
2013-10-30 10:27 ` David Laight
2013-10-30 11:02 ` Neil Horman
2013-10-30 12:18   ` David Laight
2013-10-30 13:22     ` Doug Ledford
2013-10-30 13:35   ` Doug Ledford
2013-10-30 14:04     ` David Laight
2013-10-30 14:52     ` Neil Horman
2013-10-31 18:30     ` Neil Horman
2013-11-01  9:21       ` Ingo Molnar
2013-11-01 15:42       ` Ben Hutchings
2013-11-01 16:08         ` Neil Horman
2013-11-01 16:16           ` Ben Hutchings
2013-11-01 16:18           ` David Laight
2013-11-01 17:37             ` Neil Horman
2013-11-01 19:45               ` Joe Perches
2013-11-01 19:58                 ` Neil Horman
2013-11-01 20:26                   ` Joe Perches
2013-11-02  2:07                     ` Neil Horman
2013-11-04  9:47               ` David Laight
     [not found] <1381510298-20572-1-git-send-email-nhorman@tuxdriver.com>
     [not found] ` <20131012172124.GA18241@gmail.com>
     [not found]   ` <20131014202854.GH26880@hmsreliant.think-freely.org>
     [not found]     ` <1381785560.2045.11.camel@edumazet-glaptop.roam.corp.google.com>
     [not found]       ` <1381789127.2045.22.camel@edumazet-glaptop.roam.corp.google.com>
     [not found]         ` <20131017003421.GA31470@hmsreliant.think-freely.org>
2013-10-17  8:41           ` Ingo Molnar
2013-10-17 18:19             ` H. Peter Anvin
2013-10-17 18:48               ` Eric Dumazet
2013-10-18  6:43               ` Ingo Molnar
2013-10-28 16:01             ` Neil Horman
2013-10-28 16:20               ` Ingo Molnar
2013-10-28 17:49                 ` Neil Horman
2013-10-28 16:24               ` Ingo Molnar
2013-10-28 16:49                 ` David Ahern
2013-10-28 17:46                 ` Neil Horman
2013-10-28 18:29                   ` Neil Horman
2013-10-29  8:25                     ` Ingo Molnar
2013-10-29 11:20                       ` Neil Horman
2013-10-29 11:30                         ` Ingo Molnar
2013-10-29 11:49                           ` Neil Horman
2013-10-29 12:52                             ` Ingo Molnar
2013-10-29 13:07                               ` Neil Horman
2013-10-29 13:11                                 ` Ingo Molnar
2013-10-29 13:20                                   ` Neil Horman
2013-10-29 14:17                                   ` Neil Horman
2013-10-29 14:27                                     ` Ingo Molnar
2013-10-29 20:26                                       ` Neil Horman
2013-10-31 10:22                                         ` Ingo Molnar
2013-10-31 14:33                                           ` Neil Horman
2013-11-01  9:13                                             ` Ingo Molnar
2013-11-01 14:06                                               ` Neil Horman
2013-10-29 14:12                               ` David Ahern

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).