From: Denys Vlasenko <dvlasenk@redhat.com>
To: Ingo Molnar <mingo@kernel.org>,
Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>,
Davidlohr Bueso <dave@stgolabs.net>, Peter Anvin <hpa@zytor.com>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Tim Chen <tim.c.chen@linux.intel.com>,
Borislav Petkov <bp@alien8.de>,
Peter Zijlstra <peterz@infradead.org>,
"Chandramouleeswaran, Aswin" <aswin@hp.com>,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
Brian Gerst <brgerst@gmail.com>,
Paul McKenney <paulmck@linux.vnet.ibm.com>,
Thomas Gleixner <tglx@linutronix.de>,
Jason Low <jason.low2@hp.com>,
"linux-tip-commits@vger.kernel.org"
<linux-tip-commits@vger.kernel.org>,
Arjan van de Ven <arjan@infradead.org>,
Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions
Date: Wed, 20 May 2015 13:29:22 +0200 [thread overview]
Message-ID: <555C7012.3040806@redhat.com> (raw)
In-Reply-To: <20150519213820.GA31688@gmail.com>
On 05/19/2015 11:38 PM, Ingo Molnar wrote:
> Here's the result from the Intel system:
>
> linux-falign-functions=_64-bytes/res.txt: 647,853,942 L1-icache-load-misses ( +- 0.07% ) (100.00%)
> linux-falign-functions=128-bytes/res.txt: 669,401,612 L1-icache-load-misses ( +- 0.08% ) (100.00%)
> linux-falign-functions=_32-bytes/res.txt: 685,969,043 L1-icache-load-misses ( +- 0.08% ) (100.00%)
> linux-falign-functions=256-bytes/res.txt: 699,130,207 L1-icache-load-misses ( +- 0.06% ) (100.00%)
> linux-falign-functions=512-bytes/res.txt: 699,130,207 L1-icache-load-misses ( +- 0.06% ) (100.00%)
> linux-falign-functions=_16-bytes/res.txt: 706,080,917 L1-icache-load-misses [vanilla kernel] ( +- 0.05% ) (100.00%)
> linux-falign-functions=__1-bytes/res.txt: 724,539,055 L1-icache-load-misses ( +- 0.31% ) (100.00%)
> linux-falign-functions=__4-bytes/res.txt: 725,707,848 L1-icache-load-misses ( +- 0.12% ) (100.00%)
> linux-falign-functions=__8-bytes/res.txt: 726,543,194 L1-icache-load-misses ( +- 0.04% ) (100.00%)
> linux-falign-functions=__2-bytes/res.txt: 738,946,179 L1-icache-load-misses ( +- 0.12% ) (100.00%)
> linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt: 921,910,808 L1-icache-load-misses ( +- 0.05% ) (100.00%)
>
> The optimal I$ miss rate is at 64 bytes - which is 9% better than the
> default kernel's I$ miss rate at 16 bytes alignment.
>
> The 128/256/512 bytes numbers show an increasing amount of cache
> misses: probably due to the artificially reduced associativity of the
> caching.
>
> Surprisingly there's a rather marked improvement in elapsed time as
> well:
>
> linux-falign-functions=_64-bytes/res.txt: 7.154816369 seconds time elapsed ( +- 0.03% )
> linux-falign-functions=_32-bytes/res.txt: 7.231074263 seconds time elapsed ( +- 0.12% )
> linux-falign-functions=__8-bytes/res.txt: 7.292203002 seconds time elapsed ( +- 0.30% )
> linux-falign-functions=128-bytes/res.txt: 7.314226040 seconds time elapsed ( +- 0.29% )
> linux-falign-functions=_16-bytes/res.txt: 7.333597250 seconds time elapsed [vanilla kernel] ( +- 0.48% )
> linux-falign-functions=__1-bytes/res.txt: 7.367139908 seconds time elapsed ( +- 0.28% )
> linux-falign-functions=__4-bytes/res.txt: 7.371721930 seconds time elapsed ( +- 0.26% )
> linux-falign-functions=__2-bytes/res.txt: 7.410033936 seconds time elapsed ( +- 0.34% )
> linux-falign-functions=256-bytes/res.txt: 7.507029637 seconds time elapsed ( +- 0.07% )
> linux-falign-functions=512-bytes/res.txt: 7.507029637 seconds time elapsed ( +- 0.07% )
> linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt: 8.531418784 seconds time elapsed ( +- 0.19% )
>
> the workload got 2.5% faster - which is pretty nice! This result is 5+
> standard deviations above the noise of the measurement.
>
> Side note: see how catastrophic -Os (CC_OPTIMIZE_FOR_SIZE=y)
> performance is: markedly higher cache miss rate despite a 'smaller'
> kernel, and the workload is 16.3% slower (!).
>
> Part of the -Os picture is that the -Os kernel is executing much more
> instructions:
>
> linux-falign-functions=_64-bytes/res.txt: 11,851,763,357 instructions ( +- 0.01% )
> linux-falign-functions=__1-bytes/res.txt: 11,852,538,446 instructions ( +- 0.01% )
> linux-falign-functions=_16-bytes/res.txt: 11,854,159,736 instructions ( +- 0.01% )
> linux-falign-functions=__4-bytes/res.txt: 11,864,421,708 instructions ( +- 0.01% )
> linux-falign-functions=__8-bytes/res.txt: 11,865,947,941 instructions ( +- 0.01% )
> linux-falign-functions=_32-bytes/res.txt: 11,867,369,566 instructions ( +- 0.01% )
> linux-falign-functions=128-bytes/res.txt: 11,867,698,477 instructions ( +- 0.01% )
> linux-falign-functions=__2-bytes/res.txt: 11,870,853,247 instructions ( +- 0.01% )
> linux-falign-functions=256-bytes/res.txt: 11,876,281,686 instructions ( +- 0.01% )
> linux-falign-functions=512-bytes/res.txt: 11,876,281,686 instructions ( +- 0.01% )
> linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt: 14,318,175,358 instructions ( +- 0.01% )
>
> 21.2% more instructions executed ... that cannot go well.
>
> So this should be a reminder that it's effective I$ footprint and
> number of instructions executed that matters to performance, not
> kernel size alone. With current GCC -Os should only be used on
> embedded systems where one is willing to make the kernel 10%+ slower,
> in exchange for a 20% smaller kernel.
Can you post your .config for the test?
If you have CONFIG_OPTIMIZE_INLINING=y in your -Os test,
consider re-testing with it turned off.
You may be seeing this: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
> The AMD system, with a starkly different x86 microarchitecture, is
> showing similar characteristics:
>
> linux-falign-functions=_64-bytes/res-amd.txt: 108,886,550 L1-icache-load-misses ( +- 0.10% ) (100.00%)
> linux-falign-functions=_32-bytes/res-amd.txt: 110,433,214 L1-icache-load-misses ( +- 0.15% ) (100.00%)
> linux-falign-functions=__1-bytes/res-amd.txt: 113,623,200 L1-icache-load-misses ( +- 0.17% ) (100.00%)
> linux-falign-functions=128-bytes/res-amd.txt: 119,100,216 L1-icache-load-misses ( +- 0.22% ) (100.00%)
> linux-falign-functions=_16-bytes/res-amd.txt: 122,916,937 L1-icache-load-misses ( +- 0.15% ) (100.00%)
> linux-falign-functions=__8-bytes/res-amd.txt: 123,810,566 L1-icache-load-misses ( +- 0.18% ) (100.00%)
> linux-falign-functions=__2-bytes/res-amd.txt: 124,337,908 L1-icache-load-misses ( +- 0.71% ) (100.00%)
> linux-falign-functions=__4-bytes/res-amd.txt: 125,221,805 L1-icache-load-misses ( +- 0.09% ) (100.00%)
> linux-falign-functions=256-bytes/res-amd.txt: 135,761,433 L1-icache-load-misses ( +- 0.18% ) (100.00%)
> linux-____CC_OPTIMIZE_FOR_SIZE=y/res-amd.txt: 159,918,181 L1-icache-load-misses ( +- 0.10% ) (100.00%)
> linux-falign-functions=512-bytes/res-amd.txt: 170,307,064 L1-icache-load-misses ( +- 0.26% ) (100.00%)
>
> 64 bytes is a similar sweet spot. Note that the penalty at 512 bytes
> is much steeper than on Intel systems: cache associativity is likely
> lower on this AMD CPU.
>
> Interestingly the 1 byte alignment result is still pretty good on AMD
> systems - and I used the exact same kernel image on both systems, so
> the layout of the functions is exactly the same.
>
> Elapsed time is noisier, but shows a similar trend:
>
> linux-falign-functions=_64-bytes/res-amd.txt: 1.928409143 seconds time elapsed ( +- 2.74% )
> linux-falign-functions=128-bytes/res-amd.txt: 1.932961745 seconds time elapsed ( +- 2.18% )
> linux-falign-functions=__8-bytes/res-amd.txt: 1.940703051 seconds time elapsed ( +- 1.84% )
> linux-falign-functions=__1-bytes/res-amd.txt: 1.940744001 seconds time elapsed ( +- 2.15% )
> linux-falign-functions=_32-bytes/res-amd.txt: 1.962074787 seconds time elapsed ( +- 2.38% )
> linux-falign-functions=_16-bytes/res-amd.txt: 2.000941789 seconds time elapsed ( +- 1.18% )
> linux-falign-functions=__4-bytes/res-amd.txt: 2.002305627 seconds time elapsed ( +- 2.75% )
> linux-falign-functions=256-bytes/res-amd.txt: 2.003218532 seconds time elapsed ( +- 3.16% )
> linux-falign-functions=__2-bytes/res-amd.txt: 2.031252839 seconds time elapsed ( +- 1.77% )
> linux-falign-functions=512-bytes/res-amd.txt: 2.080632439 seconds time elapsed ( +- 1.06% )
> linux-____CC_OPTIMIZE_FOR_SIZE=y/res-amd.txt: 2.346644318 seconds time elapsed ( +- 2.19% )
>
> 64 bytes alignment is the sweet spot here as well, it's 3.7% faster
> than the default 16 bytes alignment.
In AMD, 64 bytes win too, yes, but by a *very* small margin.
8 bytes and 1 byte alignments have basically same timings,
and both take what, +0.63% of time longer to run?
linux-falign-functions=_64-bytes/res-amd.txt: 1.928409143 seconds time elapsed
linux-falign-functions=__8-bytes/res-amd.txt: 1.940703051 seconds time elapsed
linux-falign-functions=__1-bytes/res-amd.txt: 1.940744001 seconds time elapsed
I wouldn't say that it's the same as Intel. There the difference between 64 byte
alignment and no alignment at all is five times larger than for AMD, it's +3%:
linux-falign-functions=_64-bytes/res.txt: 7.154816369 seconds time elapsed
linux-falign-functions=_32-bytes/res.txt: 7.231074263 seconds time elapsed
linux-falign-functions=__8-bytes/res.txt: 7.292203002 seconds time elapsed
linux-falign-functions=_16-bytes/res.txt: 7.333597250 seconds time elapsed
linux-falign-functions=__1-bytes/res.txt: 7.367139908 seconds time elapsed
> So based on those measurements, I think we should do the exact
> opposite of my original patch that reduced alignment to 1 bytes, and
> increase kernel function address alignment from 16 bytes to the
> natural cache line size (64 bytes on modern CPUs).
> + #
> + # Allocate a separate cacheline for every function,
> + # for optimal instruction cache packing:
> + #
> + KBUILD_CFLAGS += -falign-functions=$(CONFIG_X86_FUNCTION_ALIGNMENT)
How about -falign-functions=CONFIG_X86_FUNCTION_ALIGNMENT/2 + 1 instead?
This avoids pathological cases where function starting just a few bytes after
64-bytes boundary gets aligned to the next one, wasting ~60 bytes.
--
vda
next prev parent reply other threads:[~2015-05-20 11:31 UTC|newest]
Thread overview: 108+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-04-08 19:39 [PATCH 0/2] locking: Simplify mutex and rwsem spinning code Jason Low
2015-04-08 19:39 ` [PATCH 1/2] locking/mutex: Further refactor mutex_spin_on_owner() Jason Low
2015-04-09 9:00 ` [tip:locking/core] locking/mutex: Further simplify mutex_spin_on_owner() tip-bot for Jason Low
2015-04-08 19:39 ` [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner() Jason Low
2015-04-09 5:37 ` Ingo Molnar
2015-04-09 6:40 ` Jason Low
2015-04-09 7:53 ` Ingo Molnar
2015-04-09 16:47 ` Linus Torvalds
2015-04-09 17:56 ` Paul E. McKenney
2015-04-09 18:08 ` Linus Torvalds
2015-04-09 18:16 ` Linus Torvalds
2015-04-09 18:39 ` Paul E. McKenney
2015-04-10 9:00 ` [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock Ingo Molnar
2015-04-10 9:12 ` Ingo Molnar
2015-04-10 9:21 ` [PATCH] uaccess: Add __copy_from_kernel_inatomic() primitive Ingo Molnar
2015-04-10 11:14 ` [PATCH] x86/uaccess: Implement get_kernel() Ingo Molnar
2015-04-10 11:27 ` [PATCH] mutex: Improve mutex_spin_on_owner() code generation Ingo Molnar
2015-04-10 12:08 ` [PATCH] x86: Align jump targets to 1 byte boundaries Ingo Molnar
2015-04-10 12:18 ` [PATCH] x86: Pack function addresses tightly as well Ingo Molnar
2015-04-10 12:30 ` [PATCH] x86: Pack loops " Ingo Molnar
2015-04-10 13:46 ` Borislav Petkov
2015-05-15 9:40 ` [tip:x86/asm] " tip-bot for Ingo Molnar
2015-05-17 6:03 ` [tip:x86/apic] " tip-bot for Ingo Molnar
2015-05-15 9:39 ` [tip:x86/asm] x86: Pack function addresses " tip-bot for Ingo Molnar
2015-05-15 18:36 ` Linus Torvalds
2015-05-15 20:52 ` Denys Vlasenko
2015-05-17 5:58 ` Ingo Molnar
2015-05-17 7:09 ` Ingo Molnar
2015-05-17 7:30 ` Ingo Molnar
2015-05-18 9:28 ` Denys Vlasenko
2015-05-19 21:38 ` [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions Ingo Molnar
2015-05-20 0:47 ` Linus Torvalds
2015-05-20 12:21 ` Denys Vlasenko
2015-05-21 11:36 ` Ingo Molnar
2015-05-21 11:38 ` Denys Vlasenko
2016-04-16 21:08 ` Denys Vlasenko
2015-05-20 13:09 ` Ingo Molnar
2015-05-20 11:29 ` Denys Vlasenko [this message]
2015-05-21 13:28 ` Ingo Molnar
2015-05-21 14:03 ` Ingo Molnar
2015-04-10 12:50 ` [PATCH] x86: Align jump targets to 1 byte boundaries Denys Vlasenko
2015-04-10 13:18 ` H. Peter Anvin
2015-04-10 17:54 ` Ingo Molnar
2015-04-10 18:32 ` H. Peter Anvin
2015-04-11 14:41 ` Markus Trippelsdorf
2015-04-12 10:14 ` Ingo Molnar
2015-04-13 16:23 ` Markus Trippelsdorf
2015-04-13 17:26 ` Markus Trippelsdorf
2015-04-13 18:31 ` Linus Torvalds
2015-04-13 19:09 ` Markus Trippelsdorf
2015-04-14 5:38 ` Ingo Molnar
2015-04-14 8:23 ` Markus Trippelsdorf
2015-04-14 9:16 ` Ingo Molnar
2015-04-14 11:17 ` Markus Trippelsdorf
2015-04-14 12:09 ` Ingo Molnar
2015-04-10 18:48 ` Linus Torvalds
2015-04-12 23:44 ` Maciej W. Rozycki
2015-04-10 19:23 ` Daniel Borkmann
2015-04-11 13:48 ` Markus Trippelsdorf
2015-04-10 13:19 ` Borislav Petkov
2015-04-10 13:54 ` Denys Vlasenko
2015-04-10 14:01 ` Borislav Petkov
2015-04-10 14:53 ` Denys Vlasenko
2015-04-10 15:25 ` Borislav Petkov
2015-04-10 15:48 ` Denys Vlasenko
2015-04-10 15:54 ` Borislav Petkov
2015-04-10 21:44 ` Borislav Petkov
2015-04-10 18:54 ` Linus Torvalds
2015-04-10 14:10 ` Paul E. McKenney
2015-04-11 14:28 ` Josh Triplett
2015-04-11 9:20 ` [PATCH] x86: Turn off GCC branch probability heuristics Ingo Molnar
2015-04-11 17:41 ` Linus Torvalds
2015-04-11 18:57 ` Thomas Gleixner
2015-04-11 19:35 ` Linus Torvalds
2015-04-12 5:47 ` Ingo Molnar
2015-04-12 6:20 ` Markus Trippelsdorf
2015-04-12 10:15 ` Ingo Molnar
2015-04-12 7:56 ` Mike Galbraith
2015-04-12 7:41 ` Ingo Molnar
2015-04-12 8:07 ` Ingo Molnar
2015-04-12 21:11 ` Jan Hubicka
2015-05-14 11:59 ` [PATCH] x86: Align jump targets to 1 byte boundaries Denys Vlasenko
2015-05-14 18:17 ` Ingo Molnar
2015-05-14 19:04 ` Denys Vlasenko
2015-05-14 19:44 ` Ingo Molnar
2015-05-15 15:45 ` Josh Triplett
2015-05-17 5:34 ` Ingo Molnar
2015-05-17 19:18 ` Josh Triplett
2015-05-18 6:48 ` Ingo Molnar
2015-05-15 9:39 ` [tip:x86/asm] x86: Align jump targets to 1-byte boundaries tip-bot for Ingo Molnar
2015-04-10 11:34 ` [PATCH] x86/uaccess: Implement get_kernel() Peter Zijlstra
2015-04-10 18:04 ` Ingo Molnar
2015-04-10 17:49 ` Linus Torvalds
2015-04-10 18:04 ` Ingo Molnar
2015-04-10 18:09 ` Linus Torvalds
2015-04-10 14:20 ` [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock Paul E. McKenney
2015-04-10 17:44 ` Ingo Molnar
2015-04-10 18:05 ` Paul E. McKenney
2015-04-09 19:43 ` [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner() Jason Low
2015-04-09 19:58 ` Paul E. McKenney
2015-04-09 20:58 ` Jason Low
2015-04-09 21:07 ` Paul E. McKenney
2015-04-09 19:59 ` Davidlohr Bueso
2015-04-09 20:36 ` Jason Low
2015-04-10 2:43 ` Andev
2015-04-10 9:04 ` Ingo Molnar
2015-04-08 19:49 ` [PATCH 0/2] locking: Simplify mutex and rwsem spinning code Davidlohr Bueso
2015-04-08 20:10 ` Jason Low
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=555C7012.3040806@redhat.com \
--to=dvlasenk@redhat.com \
--cc=a.p.zijlstra@chello.nl \
--cc=akpm@linux-foundation.org \
--cc=arjan@infradead.org \
--cc=aswin@hp.com \
--cc=bp@alien8.de \
--cc=brgerst@gmail.com \
--cc=dave@stgolabs.net \
--cc=hpa@zytor.com \
--cc=jason.low2@hp.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-tip-commits@vger.kernel.org \
--cc=luto@amacapital.net \
--cc=mingo@kernel.org \
--cc=paulmck@linux.vnet.ibm.com \
--cc=peterz@infradead.org \
--cc=tglx@linutronix.de \
--cc=tim.c.chen@linux.intel.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).