linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
@ 2013-10-18 17:42 Doug Ledford
  2013-10-19  8:23 ` Ingo Molnar
  0 siblings, 1 reply; 107+ messages in thread
From: Doug Ledford @ 2013-10-18 17:42 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Eric Dumazet, Doug Ledford, Neil Horman, linux-kernel

On 2013-10-17, Ingo wrote:
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
>> On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote:
>> > On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
>> > > On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
>> > > 
>> > > > So, early testing results today.  I wrote a test module that, allocated a 4k
>> > > > buffer, initalized it with random data, and called csum_partial on it 100000
>> > > > times, recording the time at the start and end of that loop.  Results on a 2.4
>> > > > GHz Intel Xeon processor:
>> > > > 
>> > > > Without patch: Average execute time for csum_partial was 808 ns
>> > > > With patch: Average execute time for csum_partial was 438 ns
>> > > 
>> > > Impressive, but could you try again with data out of cache ?
>> > 
>> > So I tried your patch on a GRE tunnel and got following results on a
>> > single TCP flow. (short result : no visible difference)

[ to Eric ]

You didn't show profile data from before and after the patch, only after.  And it
showed csum_partial at 19.9% IIRC.  That's a much better than I get on my test
machines (even though this is on a rhel6.5-beta kernel, understand that the entire
IB stack in rhel6.5-beta is up to a 3.10 level, with parts closer to 3.11+):

For IPoIB in connected mode, where there is no rx csum offload:

::::::::::::::
rhel6.5-beta-cm-no-offload-oprofile-run1
::::::::::::::
CPU: Intel Architectural Perfmon, speed 3392.17 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
Samples on CPU 0
Samples on CPU 4
Samples on CPU 6 (edited out as it was only a few samples and ruined
                  line wrapping)
samples  %        samples  %        image name               symbol name
98588    59.1431  215      57.9515  vmlinux    csum_partial_copy_generic
3003      1.8015  8         2.1563  vmlinux                  tcp_sendmsg
2219      1.3312  0              0  vmlinux            irq_entries_start
2076      1.2454  4         1.0782  vmlinux         avc_has_perm_noaudit
1815      1.0888  0              0  mlx4_ib.ko           mlx4_ib_poll_cq

So, here anyway, it's 60%.  At that level of showing, there is a lot more to be
gained from an improvement to that function.  And here's the measured performance
from those runs:

[root@rdma-master rhel6.5-beta-client]# more rhel6.5-beta-cm-no-offload-netperf.output 
Recv   Send    Send                          Utilization
Socket Socket  Message  Elapsed              Send     Recv
Size   Size    Size     Time     Throughput  local    remote
bytes  bytes   bytes    secs.    MBytes  /s  % S      % S
 87380  16384  16384    20.00      2815.29   7.92     12.80  
 87380  16384  16384    20.00      2798.22   7.88     12.87  
 87380  16384  16384    20.00      2786.74   7.79     12.84

The test machine has 8 logical CPUs, so 12.5% is 100% of a single CPU.  That
said, the receive side is obviously the bottleneck here, and 60% of that
bottleneck is csum_partial.

[ snip a bunch of Neil's measurements ]

>> Based on these, prefetching is obviously a a good improvement, but not 
>> as good as parallel execution, and the winner by far is doing both.

OK, this is where I have to chime in that these tests can *not* be used
to say anything about prefetch, and not just for the reasons Ingo lists
in his various emails to this thread.  In fact I would argue that Ingo's
methodology on this is wrong as well.

All prefetch operations get sent to an access queue in the memory controller
where they compete with both other reads and writes for the available memory
bandwidth.  The optimal prefetch window is not a factor of memory bandwidth
and latency, it's a factor of memory bandwidth, memory latency, current memory
access queue depth at time prefetch is issued, and memory bank switch time *
number of queued memory operations that will require a bank switch.  In other
words, it's much more complex and also much more fluid than any static
optimization can pull out.  So every time I see someone run a series of micro-
benchmarks like you just did, where the system was only doing the micro-
benchmark and not a real workload, and we draw conclusions about optimal
prefetch distances from that test, I cringe inside and I think I even die...
just a little.

A better test for this, IMO, would be to start a local kernel compile with at
least twice as many gcc instances allowed as you have CPUs, *then* run your
benchmark kernel module and see what prefetch distance works well.  This
distance should be far enough out that it can withstand other memory pressure,
yet not so far as to constantly be prefetching, tossing the result out of cache
due to pressure, then fetching/stalling that same memory on load.  And it may
not benchmark as well on a quiescent system running only the micro-benchmark,
but it should end up performing better in actual real world usage.

> Also, it would be nice to see standard deviation noise numbers when two 
> averages are close to each other, to be able to tell whether differences 
> are statistically significant or not.
> 
> For example 'perf stat --repeat' will output stddev for you:
> 
>   comet:~/tip> perf stat --repeat 20 --null bash -c 'usleep $((RANDOM*10))'
> 
>    Performance counter stats for 'bash -c usleep $((RANDOM*10))' (20 runs):
> 
>        0.189084480 seconds time elapsed                                          ( +- 11.95% )

[ snip perf usage tips ]

I ran my original tests with oprofile.  I'll rerun the last one plus some new
tests with the various incarnations of this patch using perf and report the
results back here.

However, the machines I ran these tests on were limited by a 40GBit/s line
speed, with a theoretical max of 4GBytes/s due to bit encoding on the wire,
and I think limited even a bit lower by theoretical limit of useful data
across a PCI-e gen2 x8 bus.  So I wouldn't expect the throughput to go
much higher even if this helps, it should mainly reduce CPU usage.  I can
try the same tests on a 56GBit/s link and with cards that have PCI-e
gen3 and see how those machines do by comparison (the hosts are identical,
just the cards are different).

^ permalink raw reply	[flat|nested] 107+ messages in thread
* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
@ 2013-10-30  5:25 Doug Ledford
  2013-10-30 10:27 ` David Laight
  2013-10-30 11:02 ` Neil Horman
  0 siblings, 2 replies; 107+ messages in thread
From: Doug Ledford @ 2013-10-30  5:25 UTC (permalink / raw)
  To: Neil Horman; +Cc: Ingo Molnar, Eric Dumazet, Doug Ledford, linux-kernel, netdev

* Neil Horman <nhorman@tuxdriver.com> wrote:
> 3) The run times are proportionally larger, but still indicate that Parallel ALU
> execution is hurting rather than helping, which is counter-intuitive.  I'm
> looking into it, but thought you might want to see these results in case
> something jumped out at you

So here's my theory about all of this.

I think that the original observation some years back was a fluke caused by
either a buggy CPU or a CPU design that is no longer used.

The parallel ALU design of this patch seems OK at first glance, but it means
that two parallel operations are both trying to set/clear both the overflow
and carry flags of the EFLAGS register of the *CPU* (not the ALU).  So, either
some CPU in the past had a set of overflow/carry flags per ALU and did some
sort of magic to make sure that the last state of those flags across multiple
ALUs that might have been used in parallelizing work were always in the CPU's
logical EFLAGS register, or the CPU has a buggy microcode that allowed two
ALUs to operate on data at the same time in situations where they would
potentially stomp on the carry/overflow flags of the other ALUs operations.

It's my theory that all modern CPUs have this behavior fixed, probably via a
microcode update, and so trying to do parallel ALU operations like this simply
has no effect because the CPU (rightly so) serializes the operations to keep
them from clobbering the overflow/carry flags of the other ALUs operations.

My additional theory then is that the reason you see a slowdown from this
patch is because the attempt to parallelize the ALU operation has caused
us to write a series of instructions that, once serialized, are non-optimal
and hinder smooth pipelining of the data (aka going 0*8, 2*8, 4*8, 6*8, 1*8,
3*8, 5*8, and 7*8 in terms of memory accesses is worse than doing them in
order, and since we aren't getting the parallel operation we want, this
is the net result of the patch).

It would explain things anyway.


^ permalink raw reply	[flat|nested] 107+ messages in thread
* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
@ 2013-10-18 15:46 Doug Ledford
  0 siblings, 0 replies; 107+ messages in thread
From: Doug Ledford @ 2013-10-18 15:46 UTC (permalink / raw)
  To: Joe Perches; +Cc: Ingo Molnar, Eric Dumazet, linux-kernel

On Mon, 2013-10-14 at 22:49 -0700, Joe Perches wrote:
> On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
>> On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
>> > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
>> > > attached patch brings much better results
>> > > 
>> > > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
>> > > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
>> > > Recv   Send    Send                          Utilization       Service Demand
>> > > Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
>> > > Size   Size    Size     Time     Throughput  local    remote   local   remote
>> > > bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
>> > > 
>> > >  87380  16384  16384    10.00      8043.82   2.32     5.34     0.566   1.304  
>> > > 
>> > > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
>> > []
>> > > @@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
>> > >  			zero = 0;
>> > >  			count64 = count >> 3;
>> > >  			while (count64) { 
>> > > -				asm("addq 0*8(%[src]),%[res]\n\t"
>> > > +				asm("prefetch 5*64(%[src])\n\t"
>> > 
>> > Might the prefetch size be too big here?
>> 
>> To be effective, you need to prefetch well ahead of time.
> 
> No doubt.
> 
>> 5*64 seems common practice (check arch/x86/lib/copy_page_64.S)
> 
> 5 cachelines for some processors seems like a lot.
> 
> Given you've got a test rig, maybe you could experiment
> with 2 and increase it until it doesn't get better.

You have a fundamental misunderstanding of the prefetch operation.  The 5*64
in the above asm statment does not mean a size, it is an index, with %[src]
as the base pointer.  So it is saying to go to address %[src] + 5*64 and
prefetch there.  The prefetch size itself is always a cache line.  Once the
address is known, whatever cacheline holds that address is the cacheline we
will prefetch.  Your size concerns have no meaning.


^ permalink raw reply	[flat|nested] 107+ messages in thread
* [PATCH] x86: Run checksumming in parallel accross multiple alu's
@ 2013-10-11 16:51 Neil Horman
  2013-10-12 17:21 ` Ingo Molnar
                   ` (2 more replies)
  0 siblings, 3 replies; 107+ messages in thread
From: Neil Horman @ 2013-10-11 16:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Neil Horman, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

Sébastien Dugué reported to me that devices implementing ipoib (which don't have
checksum offload hardware were spending a significant amount of time computing
checksums.  We found that by splitting the checksum computation into two
separate streams, each skipping successive elements of the buffer being summed,
we could parallelize the checksum operation accros multiple alus.  Since neither
chain is dependent on the result of the other, we get a speedup in execution (on
hardware that has multiple alu's available, which is almost ubiquitous on x86),
and only a negligible decrease on hardware that has only a single alu (an extra
addition is introduced).  Since addition in commutative, the result is the same,
only faster

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: sebastien.dugue@bull.net
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: x86@kernel.org
---
 arch/x86/lib/csum-partial_64.c | 37 +++++++++++++++++++++++++------------
 1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index 9845371..2c7bc50 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -29,11 +29,12 @@ static inline unsigned short from32to16(unsigned a)
  * Things tried and found to not make it faster:
  * Manual Prefetching
  * Unrolling to an 128 bytes inner loop.
- * Using interleaving with more registers to break the carry chains.
  */
 static unsigned do_csum(const unsigned char *buff, unsigned len)
 {
 	unsigned odd, count;
+	unsigned long result1 = 0;
+	unsigned long result2 = 0;
 	unsigned long result = 0;
 
 	if (unlikely(len == 0))
@@ -68,22 +69,34 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
 			zero = 0;
 			count64 = count >> 3;
 			while (count64) { 
-				asm("addq 0*8(%[src]),%[res]\n\t"
-				    "adcq 1*8(%[src]),%[res]\n\t"
-				    "adcq 2*8(%[src]),%[res]\n\t"
-				    "adcq 3*8(%[src]),%[res]\n\t"
-				    "adcq 4*8(%[src]),%[res]\n\t"
-				    "adcq 5*8(%[src]),%[res]\n\t"
-				    "adcq 6*8(%[src]),%[res]\n\t"
-				    "adcq 7*8(%[src]),%[res]\n\t"
-				    "adcq %[zero],%[res]"
-				    : [res] "=r" (result)
+				asm("addq 0*8(%[src]),%[res1]\n\t"
+				    "adcq 2*8(%[src]),%[res1]\n\t"
+				    "adcq 4*8(%[src]),%[res1]\n\t"
+				    "adcq 6*8(%[src]),%[res1]\n\t"
+				    "adcq %[zero],%[res1]\n\t"
+
+				    "addq 1*8(%[src]),%[res2]\n\t"
+				    "adcq 3*8(%[src]),%[res2]\n\t"
+				    "adcq 5*8(%[src]),%[res2]\n\t"
+				    "adcq 7*8(%[src]),%[res2]\n\t"
+				    "adcq %[zero],%[res2]"
+				    : [res1] "=r" (result1),
+				      [res2] "=r" (result2)
 				    : [src] "r" (buff), [zero] "r" (zero),
-				    "[res]" (result));
+				      "[res1]" (result1), "[res2]" (result2));
 				buff += 64;
 				count64--;
 			}
 
+			asm("addq %[res1],%[res]\n\t"
+			    "adcq %[res2],%[res]\n\t"
+			    "adcq %[zero],%[res]"
+			    : [res] "=r" (result)
+			    : [res1] "r" (result1),
+			      [res2] "r" (result2),
+			      [zero] "r" (zero),
+			      "0" (result));
+
 			/* last up to 7 8byte blocks */
 			count %= 8; 
 			while (count) { 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

end of thread, other threads:[~2013-11-04  9:50 UTC | newest]

Thread overview: 107+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-10-18 17:42 [PATCH] x86: Run checksumming in parallel accross multiple alu's Doug Ledford
2013-10-19  8:23 ` Ingo Molnar
2013-10-21 17:54   ` Doug Ledford
2013-10-26 11:55     ` Ingo Molnar
2013-10-28 17:02       ` Doug Ledford
2013-10-29  8:38         ` Ingo Molnar
  -- strict thread matches above, loose matches on Subject: below --
2013-10-30  5:25 Doug Ledford
2013-10-30 10:27 ` David Laight
2013-10-30 11:02 ` Neil Horman
2013-10-30 12:18   ` David Laight
2013-10-30 13:22     ` Doug Ledford
2013-10-30 13:35   ` Doug Ledford
2013-10-30 14:04     ` David Laight
2013-10-30 14:52     ` Neil Horman
2013-10-31 18:30     ` Neil Horman
2013-11-01  9:21       ` Ingo Molnar
2013-11-01 15:42       ` Ben Hutchings
2013-11-01 16:08         ` Neil Horman
2013-11-01 16:16           ` Ben Hutchings
2013-11-01 16:18           ` David Laight
2013-11-01 17:37             ` Neil Horman
2013-11-01 19:45               ` Joe Perches
2013-11-01 19:58                 ` Neil Horman
2013-11-01 20:26                   ` Joe Perches
2013-11-02  2:07                     ` Neil Horman
2013-11-04  9:47               ` David Laight
2013-10-18 15:46 Doug Ledford
2013-10-11 16:51 Neil Horman
2013-10-12 17:21 ` Ingo Molnar
2013-10-13 12:53   ` Neil Horman
2013-10-14 20:28   ` Neil Horman
2013-10-14 21:19     ` Eric Dumazet
2013-10-14 22:18       ` Eric Dumazet
2013-10-14 22:37         ` Joe Perches
2013-10-14 22:44           ` Eric Dumazet
2013-10-14 22:49             ` Joe Perches
2013-10-15  7:41               ` Ingo Molnar
2013-10-15 10:51                 ` Borislav Petkov
2013-10-15 12:04                   ` Ingo Molnar
2013-10-15 16:21                 ` Joe Perches
2013-10-16  0:34                   ` Eric Dumazet
2013-10-16  6:25                   ` Ingo Molnar
2013-10-16 16:55                     ` Joe Perches
2013-10-17  0:34         ` Neil Horman
2013-10-17  1:42           ` Eric Dumazet
2013-10-18 16:50             ` Neil Horman
2013-10-18 17:20               ` Eric Dumazet
2013-10-18 20:11                 ` Neil Horman
2013-10-18 21:15                   ` Eric Dumazet
2013-10-20 21:29                     ` Neil Horman
2013-10-21 17:31                       ` Eric Dumazet
2013-10-21 17:46                         ` Neil Horman
2013-10-21 19:21                     ` Neil Horman
2013-10-21 19:44                       ` Eric Dumazet
2013-10-21 20:19                         ` Neil Horman
2013-10-26 12:01                           ` Ingo Molnar
2013-10-26 13:58                             ` Neil Horman
2013-10-27  7:26                               ` Ingo Molnar
2013-10-27 17:05                                 ` Neil Horman
2013-10-17  8:41           ` Ingo Molnar
2013-10-17 18:19             ` H. Peter Anvin
2013-10-17 18:48               ` Eric Dumazet
2013-10-18  6:43               ` Ingo Molnar
2013-10-28 16:01             ` Neil Horman
2013-10-28 16:20               ` Ingo Molnar
2013-10-28 17:49                 ` Neil Horman
2013-10-28 16:24               ` Ingo Molnar
2013-10-28 16:49                 ` David Ahern
2013-10-28 17:46                 ` Neil Horman
2013-10-28 18:29                   ` Neil Horman
2013-10-29  8:25                     ` Ingo Molnar
2013-10-29 11:20                       ` Neil Horman
2013-10-29 11:30                         ` Ingo Molnar
2013-10-29 11:49                           ` Neil Horman
2013-10-29 12:52                             ` Ingo Molnar
2013-10-29 13:07                               ` Neil Horman
2013-10-29 13:11                                 ` Ingo Molnar
2013-10-29 13:20                                   ` Neil Horman
2013-10-29 14:17                                   ` Neil Horman
2013-10-29 14:27                                     ` Ingo Molnar
2013-10-29 20:26                                       ` Neil Horman
2013-10-31 10:22                                         ` Ingo Molnar
2013-10-31 14:33                                           ` Neil Horman
2013-11-01  9:13                                             ` Ingo Molnar
2013-11-01 14:06                                               ` Neil Horman
2013-10-29 14:12                               ` David Ahern
2013-10-15  7:32     ` Ingo Molnar
2013-10-15 13:14       ` Neil Horman
2013-10-12 22:29 ` H. Peter Anvin
2013-10-13 12:53   ` Neil Horman
2013-10-18 16:42   ` Neil Horman
2013-10-18 17:09     ` H. Peter Anvin
2013-10-25 13:06       ` Neil Horman
2013-10-14  4:38 ` Andi Kleen
2013-10-14  7:49   ` Ingo Molnar
2013-10-14 21:07     ` Eric Dumazet
2013-10-15 13:17       ` Neil Horman
2013-10-14 20:25   ` Neil Horman
2013-10-15  7:12     ` Sébastien Dugué
2013-10-15 13:33       ` Andi Kleen
2013-10-15 13:56         ` Sébastien Dugué
2013-10-15 14:06           ` Eric Dumazet
2013-10-15 14:15             ` Sébastien Dugué
2013-10-15 14:26               ` Eric Dumazet
2013-10-15 14:52                 ` Eric Dumazet
2013-10-15 16:02                   ` Andi Kleen
2013-10-16  0:28                     ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).