From: Ingo Molnar <mingo@kernel.org>
To: Doug Ledford <dledford@redhat.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>,
Neil Horman <nhorman@tuxdriver.com>,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
Date: Sat, 19 Oct 2013 10:23:14 +0200 [thread overview]
Message-ID: <20131019082314.GA7778@gmail.com> (raw)
In-Reply-To: <201310181742.r9IHgO1q021001@ib.usersys.redhat.com>
* Doug Ledford <dledford@redhat.com> wrote:
> >> Based on these, prefetching is obviously a a good improvement, but
> >> not as good as parallel execution, and the winner by far is doing
> >> both.
>
> OK, this is where I have to chime in that these tests can *not* be used
> to say anything about prefetch, and not just for the reasons Ingo lists
> in his various emails to this thread. In fact I would argue that Ingo's
> methodology on this is wrong as well.
Well, I didn't go into as many details as you - but I agree with your full
list obviously:
> All prefetch operations get sent to an access queue in the memory
> controller where they compete with both other reads and writes for the
> available memory bandwidth. The optimal prefetch window is not a factor
> of memory bandwidth and latency, it's a factor of memory bandwidth,
> memory latency, current memory access queue depth at time prefetch is
> issued, and memory bank switch time * number of queued memory operations
> that will require a bank switch. In other words, it's much more complex
> and also much more fluid than any static optimization can pull out.
> [...]
But this is generally true of _any_ static operation - CPUs are complex,
workloads are complex, other threads, CPUs, sockets, devices might
interact, etc.
Yet it does not make it invalid to optimize for the isolated, static
usecase that was offered, because 'dynamism' and parallelism in a real
system will rarely make that optimization completely invalid, it will
typically only diminish its fruits to a certain degree (for example by
causing prefetches to be discarded).
What I was objecting to strongly here was to measure the _wrong_ thing,
i.e. the cache-hot case. The cache-cold case should be measured in a low
noise fashion, so that results are representative. It's closer to the real
usecase than any other microbenchmark. That will give us a usable speedup
figure and will tell us which technique helped how much and which
parameter should be how large.
> [...] So every time I see someone run a series of micro- benchmarks
> like you just did, where the system was only doing the micro- benchmark
> and not a real workload, and we draw conclusions about optimal prefetch
> distances from that test, I cringe inside and I think I even die... just
> a little.
So the thing is, microbenchmarks can indeed be misleading - and as in this
case the cache-hot claims can be outright dangerously misleading.
But yet, if done correctly and interpreted correctly they tell us a little
bit of the truth and are often correlated to real performance.
Do microbenchmarks show us everything that a 'real' workload inhibits? Not
at all, they are way too simple for that. They are a shortcut, an
indicator, which is often helpful as long as not taken as 'the'
performance of the system.
> A better test for this, IMO, would be to start a local kernel compile
> with at least twice as many gcc instances allowed as you have CPUs,
> *then* run your benchmark kernel module and see what prefetch distance
> works well. [...]
I don't agree that this represents our optimization target. It may
represent _one_ optimization target. But many other important usecases
such as a dedicated file server, or a computation node that is
cache-optimized, would unlikely to show such high parallel memory pressure
as a GCC compilation.
> [...] This distance should be far enough out that it can withstand
> other memory pressure, yet not so far as to constantly be prefetching,
> tossing the result out of cache due to pressure, then fetching/stalling
> that same memory on load. And it may not benchmark as well on a
> quiescent system running only the micro-benchmark, but it should end up
> performing better in actual real world usage.
The 'fully adversarial' case where all resources are maximally competed
for by all other cores is actually pretty rare in practice. I don't say it
does not happen or that it does not matter, but I do say there are many
other important usecases as well.
More importantly, the 'maximally adversarial' case is very hard to
generate, validate, and it's highly system dependent!
Cache-cold (and cache hot) microbenchmarks on the other hand tend to be
more stable, because they typically reflect current physical (mostly
latency) limits of CPU and system technology, _not_ highly system
dependent resource sizing (mostly bandwidth) limits which are very hard to
optimize for in a generic fashion.
Cache-cold and cache-hot measurements are, in a way, important physical
'eigenvalues' of a complex system. If they both show speedups then it's
likely that a more dynamic, contended for, mixed workload will show
speedups as well. And these 'eigenvalues' are statistically much more
stable across systems, and that's something we care for when we implement
various lowlevel assembly routines in arch/x86/ which cover many different
systems with different bandwidth characteristics.
I hope I managed to explain my views clearly enough on this.
Thanks,
Ingo
next prev parent reply other threads:[~2013-10-19 8:23 UTC|newest]
Thread overview: 107+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-10-18 17:42 [PATCH] x86: Run checksumming in parallel accross multiple alu's Doug Ledford
2013-10-19 8:23 ` Ingo Molnar [this message]
2013-10-21 17:54 ` Doug Ledford
2013-10-26 11:55 ` Ingo Molnar
2013-10-28 17:02 ` Doug Ledford
2013-10-29 8:38 ` Ingo Molnar
-- strict thread matches above, loose matches on Subject: below --
2013-10-30 5:25 Doug Ledford
2013-10-30 10:27 ` David Laight
2013-10-30 11:02 ` Neil Horman
2013-10-30 12:18 ` David Laight
2013-10-30 13:22 ` Doug Ledford
2013-10-30 13:35 ` Doug Ledford
2013-10-30 14:04 ` David Laight
2013-10-30 14:52 ` Neil Horman
2013-10-31 18:30 ` Neil Horman
2013-11-01 9:21 ` Ingo Molnar
2013-11-01 15:42 ` Ben Hutchings
2013-11-01 16:08 ` Neil Horman
2013-11-01 16:16 ` Ben Hutchings
2013-11-01 16:18 ` David Laight
2013-11-01 17:37 ` Neil Horman
2013-11-01 19:45 ` Joe Perches
2013-11-01 19:58 ` Neil Horman
2013-11-01 20:26 ` Joe Perches
2013-11-02 2:07 ` Neil Horman
2013-11-04 9:47 ` David Laight
2013-10-18 15:46 Doug Ledford
2013-10-11 16:51 Neil Horman
2013-10-12 17:21 ` Ingo Molnar
2013-10-13 12:53 ` Neil Horman
2013-10-14 20:28 ` Neil Horman
2013-10-14 21:19 ` Eric Dumazet
2013-10-14 22:18 ` Eric Dumazet
2013-10-14 22:37 ` Joe Perches
2013-10-14 22:44 ` Eric Dumazet
2013-10-14 22:49 ` Joe Perches
2013-10-15 7:41 ` Ingo Molnar
2013-10-15 10:51 ` Borislav Petkov
2013-10-15 12:04 ` Ingo Molnar
2013-10-15 16:21 ` Joe Perches
2013-10-16 0:34 ` Eric Dumazet
2013-10-16 6:25 ` Ingo Molnar
2013-10-16 16:55 ` Joe Perches
2013-10-17 0:34 ` Neil Horman
2013-10-17 1:42 ` Eric Dumazet
2013-10-18 16:50 ` Neil Horman
2013-10-18 17:20 ` Eric Dumazet
2013-10-18 20:11 ` Neil Horman
2013-10-18 21:15 ` Eric Dumazet
2013-10-20 21:29 ` Neil Horman
2013-10-21 17:31 ` Eric Dumazet
2013-10-21 17:46 ` Neil Horman
2013-10-21 19:21 ` Neil Horman
2013-10-21 19:44 ` Eric Dumazet
2013-10-21 20:19 ` Neil Horman
2013-10-26 12:01 ` Ingo Molnar
2013-10-26 13:58 ` Neil Horman
2013-10-27 7:26 ` Ingo Molnar
2013-10-27 17:05 ` Neil Horman
2013-10-17 8:41 ` Ingo Molnar
2013-10-17 18:19 ` H. Peter Anvin
2013-10-17 18:48 ` Eric Dumazet
2013-10-18 6:43 ` Ingo Molnar
2013-10-28 16:01 ` Neil Horman
2013-10-28 16:20 ` Ingo Molnar
2013-10-28 17:49 ` Neil Horman
2013-10-28 16:24 ` Ingo Molnar
2013-10-28 16:49 ` David Ahern
2013-10-28 17:46 ` Neil Horman
2013-10-28 18:29 ` Neil Horman
2013-10-29 8:25 ` Ingo Molnar
2013-10-29 11:20 ` Neil Horman
2013-10-29 11:30 ` Ingo Molnar
2013-10-29 11:49 ` Neil Horman
2013-10-29 12:52 ` Ingo Molnar
2013-10-29 13:07 ` Neil Horman
2013-10-29 13:11 ` Ingo Molnar
2013-10-29 13:20 ` Neil Horman
2013-10-29 14:17 ` Neil Horman
2013-10-29 14:27 ` Ingo Molnar
2013-10-29 20:26 ` Neil Horman
2013-10-31 10:22 ` Ingo Molnar
2013-10-31 14:33 ` Neil Horman
2013-11-01 9:13 ` Ingo Molnar
2013-11-01 14:06 ` Neil Horman
2013-10-29 14:12 ` David Ahern
2013-10-15 7:32 ` Ingo Molnar
2013-10-15 13:14 ` Neil Horman
2013-10-12 22:29 ` H. Peter Anvin
2013-10-13 12:53 ` Neil Horman
2013-10-18 16:42 ` Neil Horman
2013-10-18 17:09 ` H. Peter Anvin
2013-10-25 13:06 ` Neil Horman
2013-10-14 4:38 ` Andi Kleen
2013-10-14 7:49 ` Ingo Molnar
2013-10-14 21:07 ` Eric Dumazet
2013-10-15 13:17 ` Neil Horman
2013-10-14 20:25 ` Neil Horman
2013-10-15 7:12 ` Sébastien Dugué
2013-10-15 13:33 ` Andi Kleen
2013-10-15 13:56 ` Sébastien Dugué
2013-10-15 14:06 ` Eric Dumazet
2013-10-15 14:15 ` Sébastien Dugué
2013-10-15 14:26 ` Eric Dumazet
2013-10-15 14:52 ` Eric Dumazet
2013-10-15 16:02 ` Andi Kleen
2013-10-16 0:28 ` Eric Dumazet
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20131019082314.GA7778@gmail.com \
--to=mingo@kernel.org \
--cc=dledford@redhat.com \
--cc=eric.dumazet@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=nhorman@tuxdriver.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.