From: Ingo Molnar <mingo@kernel.org>
To: Neil Horman <nhorman@tuxdriver.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>,
linux-kernel@vger.kernel.org, sebastien.dugue@bull.net,
Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
x86@kernel.org, netdev@vger.kernel.org
Subject: Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
Date: Thu, 17 Oct 2013 10:41:21 +0200 [thread overview]
Message-ID: <20131017084121.GC22705@gmail.com> (raw)
In-Reply-To: <20131017003421.GA31470@hmsreliant.think-freely.org>
* Neil Horman <nhorman@tuxdriver.com> wrote:
> On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote:
> > On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
> > > On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
> > >
> > > > So, early testing results today. I wrote a test module that, allocated a 4k
> > > > buffer, initalized it with random data, and called csum_partial on it 100000
> > > > times, recording the time at the start and end of that loop. Results on a 2.4
> > > > GHz Intel Xeon processor:
> > > >
> > > > Without patch: Average execute time for csum_partial was 808 ns
> > > > With patch: Average execute time for csum_partial was 438 ns
> > >
> > > Impressive, but could you try again with data out of cache ?
> >
> > So I tried your patch on a GRE tunnel and got following results on a
> > single TCP flow. (short result : no visible difference)
> >
> >
>
> So I went to reproduce these results, but was unable to (due to the fact that I
> only have a pretty jittery network to do testing accross at the moment with
> these devices). So instead I figured that I would go back to just doing
> measurements with the module that I cobbled together (operating under the
> assumption that it would give me accurate, relatively jitter free results (I've
> attached the module code for reference below). My results show slightly
> different behavior:
>
> Base results runs:
> 89417240
> 85170397
> 85208407
> 89422794
> 91645494
> 103655144
> 86063791
> 75647774
> 83502921
> 85847372
> AVG = 875 ns
>
> Prefetch only runs:
> 70962849
> 77555099
> 81898170
> 68249290
> 72636538
> 83039294
> 78561494
> 83393369
> 85317556
> 79570951
> AVG = 781 ns
>
> Parallel addition only runs:
> 42024233
> 44313064
> 48304416
> 64762297
> 42994259
> 41811628
> 55654282
> 64892958
> 55125582
> 42456403
> AVG = 510 ns
>
>
> Both prefetch and parallel addition:
> 41329930
> 40689195
> 61106622
> 46332422
> 49398117
> 52525171
> 49517101
> 61311153
> 43691814
> 49043084
> AVG = 494 ns
>
>
> For reference, each of the above large numbers is the number of
> nanoseconds taken to compute the checksum of a 4kb buffer 100000 times.
> To get my average results, I ran the test in a loop 10 times, averaged
> them, and divided by 100000.
>
> Based on these, prefetching is obviously a a good improvement, but not
> as good as parallel execution, and the winner by far is doing both.
But in the actual usecase mentioned the packet data was likely cache-cold,
it just arrived in the NIC and an IRQ got sent. Your testcase uses a
super-hot 4K buffer that fits into the L1 cache. So it's apples to
oranges.
To correctly simulate the workload you'd have to:
- allocate a buffer larger than your L2 cache.
- to measure the effects of the prefetches you'd also have to randomize
the individual buffer positions. See how 'perf bench numa' implements a
random walk via --data_rand_walk, in tools/perf/bench/numa.c.
Otherwise the CPU might learn your simplistic stream direction and the
L2 cache might hw-prefetch your data, interfering with any explicit
prefetches the code does. In many real-life usecases packet buffers are
scattered.
Also, it would be nice to see standard deviation noise numbers when two
averages are close to each other, to be able to tell whether differences
are statistically significant or not.
For example 'perf stat --repeat' will output stddev for you:
comet:~/tip> perf stat --repeat 20 --null bash -c 'usleep $((RANDOM*10))'
Performance counter stats for 'bash -c usleep $((RANDOM*10))' (20 runs):
0.189084480 seconds time elapsed ( +- 11.95% )
The last '+-' percentage is the noise of the measurement.
Also note that you can inspect many cache behavior details of your
algorithm via perf stat - the -ddd option will give you a laundry list:
aldebaran:~> perf stat --repeat 20 -ddd perf bench sched messaging
...
Total time: 0.095 [sec]
Performance counter stats for 'perf bench sched messaging' (20 runs):
1519.128721 task-clock (msec) # 12.305 CPUs utilized ( +- 0.34% )
22,882 context-switches # 0.015 M/sec ( +- 2.84% )
3,927 cpu-migrations # 0.003 M/sec ( +- 2.74% )
16,616 page-faults # 0.011 M/sec ( +- 0.17% )
2,327,978,366 cycles # 1.532 GHz ( +- 1.61% ) [36.43%]
1,715,561,189 stalled-cycles-frontend # 73.69% frontend cycles idle ( +- 1.76% ) [38.05%]
715,715,454 stalled-cycles-backend # 30.74% backend cycles idle ( +- 2.25% ) [39.85%]
1,253,106,346 instructions # 0.54 insns per cycle
# 1.37 stalled cycles per insn ( +- 1.71% ) [49.68%]
241,181,126 branches # 158.763 M/sec ( +- 1.43% ) [47.83%]
4,232,053 branch-misses # 1.75% of all branches ( +- 1.23% ) [48.63%]
431,907,354 L1-dcache-loads # 284.313 M/sec ( +- 1.00% ) [48.37%]
20,550,528 L1-dcache-load-misses # 4.76% of all L1-dcache hits ( +- 0.82% ) [47.61%]
7,435,847 LLC-loads # 4.895 M/sec ( +- 0.94% ) [36.11%]
2,419,201 LLC-load-misses # 32.53% of all LL-cache hits ( +- 2.93% ) [ 7.33%]
448,638,547 L1-icache-loads # 295.326 M/sec ( +- 2.43% ) [21.75%]
22,066,490 L1-icache-load-misses # 4.92% of all L1-icache hits ( +- 2.54% ) [30.66%]
475,557,948 dTLB-loads # 313.047 M/sec ( +- 1.96% ) [37.96%]
6,741,523 dTLB-load-misses # 1.42% of all dTLB cache hits ( +- 2.38% ) [37.05%]
1,268,628,660 iTLB-loads # 835.103 M/sec ( +- 1.75% ) [36.45%]
74,192 iTLB-load-misses # 0.01% of all iTLB cache hits ( +- 2.88% ) [36.19%]
4,466,526 L1-dcache-prefetches # 2.940 M/sec ( +- 1.61% ) [36.17%]
2,396,311 L1-dcache-prefetch-misses # 1.577 M/sec ( +- 1.55% ) [35.71%]
0.123459566 seconds time elapsed ( +- 0.58% )
There's also a number of prefetch counters that might be useful:
aldebaran:~> perf list | grep prefetch
L1-dcache-prefetches [Hardware cache event]
L1-dcache-prefetch-misses [Hardware cache event]
LLC-prefetches [Hardware cache event]
LLC-prefetch-misses [Hardware cache event]
node-prefetches [Hardware cache event]
node-prefetch-misses [Hardware cache event]
Thanks,
Ingo
next prev parent reply other threads:[~2013-10-17 8:41 UTC|newest]
Thread overview: 132+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-10-11 16:51 [PATCH] x86: Run checksumming in parallel accross multiple alu's Neil Horman
2013-10-12 17:21 ` Ingo Molnar
2013-10-13 12:53 ` Neil Horman
2013-10-14 20:28 ` Neil Horman
2013-10-14 21:19 ` Eric Dumazet
2013-10-14 22:18 ` Eric Dumazet
2013-10-14 22:37 ` Joe Perches
2013-10-14 22:44 ` Eric Dumazet
2013-10-14 22:49 ` Joe Perches
2013-10-15 7:41 ` Ingo Molnar
2013-10-15 10:51 ` Borislav Petkov
2013-10-15 12:04 ` Ingo Molnar
2013-10-15 16:21 ` Joe Perches
2013-10-16 0:34 ` Eric Dumazet
2013-10-16 6:25 ` Ingo Molnar
2013-10-16 16:55 ` Joe Perches
2013-10-17 0:34 ` Neil Horman
2013-10-17 1:42 ` Eric Dumazet
2013-10-18 16:50 ` Neil Horman
2013-10-18 17:20 ` Eric Dumazet
2013-10-18 20:11 ` Neil Horman
2013-10-18 21:15 ` Eric Dumazet
2013-10-20 21:29 ` Neil Horman
2013-10-21 17:31 ` Eric Dumazet
2013-10-21 17:46 ` Neil Horman
2013-10-21 19:21 ` Neil Horman
2013-10-21 19:44 ` Eric Dumazet
2013-10-21 20:19 ` Neil Horman
2013-10-26 12:01 ` Ingo Molnar
2013-10-26 13:58 ` Neil Horman
2013-10-27 7:26 ` Ingo Molnar
2013-10-27 17:05 ` Neil Horman
2013-10-17 8:41 ` Ingo Molnar [this message]
2013-10-17 18:19 ` H. Peter Anvin
2013-10-17 18:48 ` Eric Dumazet
2013-10-18 6:43 ` Ingo Molnar
2013-10-28 16:01 ` Neil Horman
2013-10-28 16:20 ` Ingo Molnar
2013-10-28 17:49 ` Neil Horman
2013-10-28 16:24 ` Ingo Molnar
2013-10-28 16:49 ` David Ahern
2013-10-28 17:46 ` Neil Horman
2013-10-28 18:29 ` Neil Horman
2013-10-29 8:25 ` Ingo Molnar
2013-10-29 11:20 ` Neil Horman
2013-10-29 11:30 ` Ingo Molnar
2013-10-29 11:49 ` Neil Horman
2013-10-29 12:52 ` Ingo Molnar
2013-10-29 13:07 ` Neil Horman
2013-10-29 13:11 ` Ingo Molnar
2013-10-29 13:20 ` Neil Horman
2013-10-29 14:17 ` Neil Horman
2013-10-29 14:27 ` Ingo Molnar
2013-10-29 20:26 ` Neil Horman
2013-10-31 10:22 ` Ingo Molnar
2013-10-31 14:33 ` Neil Horman
2013-11-01 9:13 ` Ingo Molnar
2013-11-01 14:06 ` Neil Horman
2013-10-29 14:12 ` David Ahern
2013-10-15 7:32 ` Ingo Molnar
2013-10-15 13:14 ` Neil Horman
2013-10-12 22:29 ` H. Peter Anvin
2013-10-13 12:53 ` Neil Horman
2013-10-18 16:42 ` Neil Horman
2013-10-18 17:09 ` H. Peter Anvin
2013-10-25 13:06 ` Neil Horman
2013-10-14 4:38 ` Andi Kleen
2013-10-14 7:49 ` Ingo Molnar
2013-10-14 21:07 ` Eric Dumazet
2013-10-15 13:17 ` Neil Horman
2013-10-14 20:25 ` Neil Horman
2013-10-15 7:12 ` Sébastien Dugué
2013-10-15 13:33 ` Andi Kleen
2013-10-15 13:56 ` Sébastien Dugué
2013-10-15 14:06 ` Eric Dumazet
2013-10-15 14:15 ` Sébastien Dugué
2013-10-15 14:26 ` Eric Dumazet
2013-10-15 14:52 ` Eric Dumazet
2013-10-15 16:02 ` Andi Kleen
2013-10-16 0:28 ` Eric Dumazet
2013-11-06 15:23 ` x86: Enhance perf checksum profiling and x86 implementation Neil Horman
2013-11-06 15:23 ` [PATCH v2 1/2] perf: Add csum benchmark tests to perf Neil Horman
2013-11-06 15:23 ` [PATCH v2 2/2] x86: add prefetching to do_csum Neil Horman
2013-11-06 15:34 ` Dave Jones
2013-11-06 15:54 ` Neil Horman
2013-11-06 17:19 ` Joe Perches
2013-11-06 18:11 ` Neil Horman
2013-11-06 20:02 ` Neil Horman
2013-11-06 20:07 ` Joe Perches
2013-11-08 16:25 ` Neil Horman
2013-11-08 16:51 ` Joe Perches
2013-11-08 19:07 ` Neil Horman
2013-11-08 19:17 ` Joe Perches
2013-11-08 20:08 ` Neil Horman
2013-11-08 19:17 ` H. Peter Anvin
2013-11-08 19:01 ` Neil Horman
2013-11-08 19:33 ` Joe Perches
2013-11-08 20:14 ` Neil Horman
2013-11-08 20:29 ` Joe Perches
2013-11-11 19:40 ` Neil Horman
2013-11-11 21:18 ` Ingo Molnar
2013-11-06 18:23 ` Eric Dumazet
2013-11-06 18:59 ` Neil Horman
2013-11-06 20:19 ` Andi Kleen
2013-11-07 21:23 ` Neil Horman
-- strict thread matches above, loose matches on Subject: below --
2013-10-18 15:46 [PATCH] x86: Run checksumming in parallel accross multiple alu's Doug Ledford
2013-10-18 17:42 Doug Ledford
2013-10-19 8:23 ` Ingo Molnar
2013-10-21 17:54 ` Doug Ledford
2013-10-26 11:55 ` Ingo Molnar
2013-10-28 17:02 ` Doug Ledford
2013-10-29 8:38 ` Ingo Molnar
2013-10-30 5:25 Doug Ledford
2013-10-30 10:27 ` David Laight
2013-10-30 11:02 ` Neil Horman
2013-10-30 12:18 ` David Laight
2013-10-30 13:22 ` Doug Ledford
2013-10-30 13:35 ` Doug Ledford
2013-10-30 14:04 ` David Laight
2013-10-30 14:52 ` Neil Horman
2013-10-31 18:30 ` Neil Horman
2013-11-01 9:21 ` Ingo Molnar
2013-11-01 15:42 ` Ben Hutchings
2013-11-01 16:08 ` Neil Horman
2013-11-01 16:16 ` Ben Hutchings
2013-11-01 16:18 ` David Laight
2013-11-01 17:37 ` Neil Horman
2013-11-01 19:45 ` Joe Perches
2013-11-01 19:58 ` Neil Horman
2013-11-01 20:26 ` Joe Perches
2013-11-02 2:07 ` Neil Horman
2013-11-04 9:47 ` David Laight
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20131017084121.GC22705@gmail.com \
--to=mingo@kernel.org \
--cc=eric.dumazet@gmail.com \
--cc=hpa@zytor.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=netdev@vger.kernel.org \
--cc=nhorman@tuxdriver.com \
--cc=sebastien.dugue@bull.net \
--cc=tglx@linutronix.de \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).