From: Neil Horman <nhorman@tuxdriver.com>
To: Ingo Molnar <mingo@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>,
linux-kernel@vger.kernel.org, sebastien.dugue@bull.net,
Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
x86@kernel.org, netdev@vger.kernel.org
Subject: Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
Date: Tue, 29 Oct 2013 16:26:44 -0400 [thread overview]
Message-ID: <20131029202644.GB32389@localhost.localdomain> (raw)
In-Reply-To: <20131029142716.GA28113@gmail.com>
On Tue, Oct 29, 2013 at 03:27:16PM +0100, Ingo Molnar wrote:
>
> * Neil Horman <nhorman@tuxdriver.com> wrote:
>
> > So, I apologize, you were right. I was running the test.sh script
> > but perf was measuring itself. [...]
>
> Ok, cool - one mystery less!
>
> > Which overall looks alot more like I expect, save for the parallel
> > ALU cases. It seems here that the parallel ALU changes actually
> > hurt performance, which really seems counter-intuitive. I don't
> > yet have any explination for that. I do note that we seem to have
> > more stalls in the both case so perhaps the parallel chains call
> > for a more agressive prefetch. Do you have any thoughts?
>
> Note that with -ddd you 'overload' the PMU with more counters than
> can be run at once, which introduces extra noise. Since you are
> running the tests for 0.150 secs or so, the results are not very
> representative:
>
> 734 dTLB-load-misses # 0.00% of all dTLB cache hits ( +- 8.40% ) [13.94%]
> 13,314,660 iTLB-loads # 280.759 M/sec ( +- 0.05% ) [12.97%]
>
> with such low runtimes those results are very hard to trust.
>
> So -ddd is typically used to pick up the most interesting PMU events
> you want to see measured, and then use them like this:
>
> -e dTLB-load-misses -e iTLB-loads
>
> etc. For such short runtimes make sure the last column displays
> close to 100%, so that the PMU results become trustable.
>
> A nehalem+ PMU will allow 2-4 events to be measured in parallel,
> plus generics like 'cycles', 'instructions' can be added 'for free'
> because they get counted in a separate (fixed purpose) PMU register.
>
> The last colum tells you what percentage of the runtime that
> particular event was actually active. 100% (or empty last column)
> means it was active all the time.
>
> Thanks,
>
> Ingo
>
Hmm,
I ran this test:
for i in `seq 0 1 3`
do
echo $i > /sys/module/csum_test/parameters/module_test_mode
taskset -c 0 perf stat --repeat 20 -C 0 -e L1-dcache-load-misses -e L1-dcache-prefetches -e cycles -e instructions -ddd ./test.sh
done
And I updated the test module to run for a million iterations rather than 100000 to increase the sample size and got this:
Base:
Performance counter stats for './test.sh' (20 runs):
47,305,064 L1-dcache-load-misses # 2.09% of all L1-dcache hits ( +- 0.04% ) [18.74%]
0 L1-dcache-prefetches [18.75%]
13,906,212,348 cycles # 0.000 GHz ( +- 0.05% ) [18.76%]
4,426,395,949 instructions # 0.32 insns per cycle ( +- 0.01% ) [18.77%]
2,261,551,278 L1-dcache-loads ( +- 0.02% ) [18.76%]
47,287,226 L1-dcache-load-misses # 2.09% of all L1-dcache hits ( +- 0.04% ) [18.76%]
276,842,685 LLC-loads ( +- 0.01% ) [18.76%]
46,454,114 LLC-load-misses # 16.78% of all LL-cache hits ( +- 0.05% ) [18.76%]
1,048,894,486 L1-icache-loads ( +- 0.07% ) [18.76%]
472,205 L1-icache-load-misses # 0.05% of all L1-icache hits ( +- 1.19% ) [18.76%]
2,260,639,613 dTLB-loads ( +- 0.01% ) [18.75%]
172 dTLB-load-misses # 0.00% of all dTLB cache hits ( +- 35.14% ) [18.74%]
1,048,732,481 iTLB-loads ( +- 0.07% ) [18.74%]
19 iTLB-load-misses # 0.00% of all iTLB cache hits ( +- 39.75% ) [18.73%]
0 L1-dcache-prefetches [18.73%]
0 L1-dcache-prefetch-misses [18.73%]
5.370546698 seconds time elapsed ( +- 0.05% )
Prefetch:
Performance counter stats for './test.sh' (20 runs):
124,885,469 L1-dcache-load-misses # 4.96% of all L1-dcache hits ( +- 0.09% ) [18.74%]
0 L1-dcache-prefetches [18.75%]
11,434,328,889 cycles # 0.000 GHz ( +- 1.11% ) [18.77%]
4,601,831,553 instructions # 0.40 insns per cycle ( +- 0.01% ) [18.77%]
2,515,483,814 L1-dcache-loads ( +- 0.01% ) [18.77%]
124,928,127 L1-dcache-load-misses # 4.97% of all L1-dcache hits ( +- 0.09% ) [18.76%]
323,355,145 LLC-loads ( +- 0.02% ) [18.76%]
123,008,548 LLC-load-misses # 38.04% of all LL-cache hits ( +- 0.10% ) [18.75%]
1,256,391,060 L1-icache-loads ( +- 0.01% ) [18.75%]
374,691 L1-icache-load-misses # 0.03% of all L1-icache hits ( +- 1.41% ) [18.75%]
2,514,984,046 dTLB-loads ( +- 0.01% ) [18.75%]
67 dTLB-load-misses # 0.00% of all dTLB cache hits ( +- 51.81% ) [18.74%]
1,256,333,548 iTLB-loads ( +- 0.01% ) [18.74%]
19 iTLB-load-misses # 0.00% of all iTLB cache hits ( +- 39.74% ) [18.74%]
0 L1-dcache-prefetches [18.73%]
0 L1-dcache-prefetch-misses [18.73%]
4.496839773 seconds time elapsed ( +- 0.64% )
Parallel ALU:
Performance counter stats for './test.sh' (20 runs):
49,489,518 L1-dcache-load-misses # 2.19% of all L1-dcache hits ( +- 0.09% ) [18.74%]
0 L1-dcache-prefetches [18.76%]
13,777,501,365 cycles # 0.000 GHz ( +- 1.73% ) [18.78%]
4,707,160,703 instructions # 0.34 insns per cycle ( +- 0.01% ) [18.78%]
2,261,693,074 L1-dcache-loads ( +- 0.02% ) [18.78%]
49,468,878 L1-dcache-load-misses # 2.19% of all L1-dcache hits ( +- 0.09% ) [18.77%]
279,524,254 LLC-loads ( +- 0.01% ) [18.76%]
48,491,934 LLC-load-misses # 17.35% of all LL-cache hits ( +- 0.12% ) [18.75%]
1,057,877,680 L1-icache-loads ( +- 0.02% ) [18.74%]
461,784 L1-icache-load-misses # 0.04% of all L1-icache hits ( +- 1.87% ) [18.74%]
2,260,978,836 dTLB-loads ( +- 0.02% ) [18.74%]
27 dTLB-load-misses # 0.00% of all dTLB cache hits ( +- 89.96% ) [18.74%]
1,057,886,632 iTLB-loads ( +- 0.02% ) [18.74%]
4 iTLB-load-misses # 0.00% of all iTLB cache hits ( +-100.00% ) [18.74%]
0 L1-dcache-prefetches [18.73%]
0 L1-dcache-prefetch-misses [18.73%]
5.500417234 seconds time elapsed ( +- 1.60% )
Both:
Performance counter stats for './test.sh' (20 runs):
116,621,570 L1-dcache-load-misses # 4.68% of all L1-dcache hits ( +- 0.04% ) [18.73%]
0 L1-dcache-prefetches [18.75%]
11,597,067,510 cycles # 0.000 GHz ( +- 1.73% ) [18.77%]
4,952,251,361 instructions # 0.43 insns per cycle ( +- 0.01% ) [18.77%]
2,493,003,710 L1-dcache-loads ( +- 0.02% ) [18.77%]
116,640,333 L1-dcache-load-misses # 4.68% of all L1-dcache hits ( +- 0.04% ) [18.77%]
322,246,216 LLC-loads ( +- 0.03% ) [18.76%]
114,528,956 LLC-load-misses # 35.54% of all LL-cache hits ( +- 0.04% ) [18.76%]
999,371,469 L1-icache-loads ( +- 0.02% ) [18.76%]
406,679 L1-icache-load-misses # 0.04% of all L1-icache hits ( +- 1.97% ) [18.75%]
2,492,708,710 dTLB-loads ( +- 0.01% ) [18.75%]
140 dTLB-load-misses # 0.00% of all dTLB cache hits ( +- 38.46% ) [18.74%]
999,320,389 iTLB-loads ( +- 0.01% ) [18.74%]
19 iTLB-load-misses # 0.00% of all iTLB cache hits ( +- 39.90% ) [18.73%]
0 L1-dcache-prefetches [18.73%]
0 L1-dcache-prefetch-misses [18.72%]
4.634419247 seconds time elapsed ( +- 1.60% )
I note a few oddities here:
1) We seem to be getting more counter results than I specified, not sure why
2) The % active column is adding up to way more than 100 (which from my read of
the man page makes sense, given that multiple counters might increment in
response to a single instruction execution
3) The run times are proportionally larger, but still indicate that Parallel ALU
execution is hurting rather than helping, which is counter-intuitive. I'm
looking into it, but thought you might want to see these results in case
something jumped out at you
Regards
Neil
next prev parent reply other threads:[~2013-10-29 20:26 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <1381510298-20572-1-git-send-email-nhorman@tuxdriver.com>
[not found] ` <20131012172124.GA18241@gmail.com>
[not found] ` <20131014202854.GH26880@hmsreliant.think-freely.org>
[not found] ` <1381785560.2045.11.camel@edumazet-glaptop.roam.corp.google.com>
[not found] ` <1381789127.2045.22.camel@edumazet-glaptop.roam.corp.google.com>
[not found] ` <20131017003421.GA31470@hmsreliant.think-freely.org>
2013-10-17 8:41 ` [PATCH] x86: Run checksumming in parallel accross multiple alu's Ingo Molnar
2013-10-17 18:19 ` H. Peter Anvin
2013-10-17 18:48 ` Eric Dumazet
2013-10-18 6:43 ` Ingo Molnar
2013-10-28 16:01 ` Neil Horman
2013-10-28 16:20 ` Ingo Molnar
2013-10-28 17:49 ` Neil Horman
2013-10-28 16:24 ` Ingo Molnar
2013-10-28 16:49 ` David Ahern
2013-10-28 17:46 ` Neil Horman
2013-10-28 18:29 ` Neil Horman
2013-10-29 8:25 ` Ingo Molnar
2013-10-29 11:20 ` Neil Horman
2013-10-29 11:30 ` Ingo Molnar
2013-10-29 11:49 ` Neil Horman
2013-10-29 12:52 ` Ingo Molnar
2013-10-29 13:07 ` Neil Horman
2013-10-29 13:11 ` Ingo Molnar
2013-10-29 13:20 ` Neil Horman
2013-10-29 14:17 ` Neil Horman
2013-10-29 14:27 ` Ingo Molnar
2013-10-29 20:26 ` Neil Horman [this message]
2013-10-31 10:22 ` Ingo Molnar
2013-10-31 14:33 ` Neil Horman
2013-11-01 9:13 ` Ingo Molnar
2013-11-01 14:06 ` Neil Horman
2013-10-29 14:12 ` David Ahern
[not found] ` <1383751399-10298-1-git-send-email-nhorman@tuxdriver.com>
[not found] ` <1383751399-10298-3-git-send-email-nhorman@tuxdriver.com>
[not found] ` <87iow58eqf.fsf@tassilo.jf.intel.com>
2013-11-07 21:23 ` [PATCH v2 2/2] x86: add prefetching to do_csum Neil Horman
2013-10-30 5:25 [PATCH] x86: Run checksumming in parallel accross multiple alu's Doug Ledford
2013-10-30 10:27 ` David Laight
2013-10-30 11:02 ` Neil Horman
2013-10-30 12:18 ` David Laight
2013-10-30 13:22 ` Doug Ledford
2013-10-30 13:35 ` Doug Ledford
2013-10-30 14:04 ` David Laight
2013-10-30 14:52 ` Neil Horman
2013-10-31 18:30 ` Neil Horman
2013-11-01 9:21 ` Ingo Molnar
2013-11-01 15:42 ` Ben Hutchings
2013-11-01 16:08 ` Neil Horman
2013-11-01 16:16 ` Ben Hutchings
2013-11-01 16:18 ` David Laight
2013-11-01 17:37 ` Neil Horman
2013-11-01 19:45 ` Joe Perches
2013-11-01 19:58 ` Neil Horman
2013-11-01 20:26 ` Joe Perches
2013-11-02 2:07 ` Neil Horman
2013-11-04 9:47 ` David Laight
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20131029202644.GB32389@localhost.localdomain \
--to=nhorman@tuxdriver.com \
--cc=eric.dumazet@gmail.com \
--cc=hpa@zytor.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=mingo@redhat.com \
--cc=netdev@vger.kernel.org \
--cc=sebastien.dugue@bull.net \
--cc=tglx@linutronix.de \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).