Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Denys Vlasenko <dvlasenk@redhat.com>
To: Ingo Molnar <mingo@kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>,
	Davidlohr Bueso <dave@stgolabs.net>, Peter Anvin <hpa@zytor.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Borislav Petkov <bp@alien8.de>,
	Peter Zijlstra <peterz@infradead.org>,
	"Chandramouleeswaran, Aswin" <aswin@hp.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Brian Gerst <brgerst@gmail.com>,
	Paul McKenney <paulmck@linux.vnet.ibm.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Jason Low <jason.low2@hp.com>,
	"linux-tip-commits@vger.kernel.org" 
	<linux-tip-commits@vger.kernel.org>,
	Arjan van de Ven <arjan@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions
Date: Wed, 20 May 2015 13:29:22 +0200	[thread overview]
Message-ID: <555C7012.3040806@redhat.com> (raw)
In-Reply-To: <20150519213820.GA31688@gmail.com>

On 05/19/2015 11:38 PM, Ingo Molnar wrote:
> Here's the result from the Intel system:
> 
> linux-falign-functions=_64-bytes/res.txt:        647,853,942      L1-icache-load-misses                                         ( +-  0.07% )  (100.00%)
> linux-falign-functions=128-bytes/res.txt:        669,401,612      L1-icache-load-misses                                         ( +-  0.08% )  (100.00%)
> linux-falign-functions=_32-bytes/res.txt:        685,969,043      L1-icache-load-misses                                         ( +-  0.08% )  (100.00%)
> linux-falign-functions=256-bytes/res.txt:        699,130,207      L1-icache-load-misses                                         ( +-  0.06% )  (100.00%)
> linux-falign-functions=512-bytes/res.txt:        699,130,207      L1-icache-load-misses                                         ( +-  0.06% )  (100.00%)
> linux-falign-functions=_16-bytes/res.txt:        706,080,917      L1-icache-load-misses   [vanilla kernel]                      ( +-  0.05% )  (100.00%)
> linux-falign-functions=__1-bytes/res.txt:        724,539,055      L1-icache-load-misses                                         ( +-  0.31% )  (100.00%)
> linux-falign-functions=__4-bytes/res.txt:        725,707,848      L1-icache-load-misses                                         ( +-  0.12% )  (100.00%)
> linux-falign-functions=__8-bytes/res.txt:        726,543,194      L1-icache-load-misses                                         ( +-  0.04% )  (100.00%)
> linux-falign-functions=__2-bytes/res.txt:        738,946,179      L1-icache-load-misses                                         ( +-  0.12% )  (100.00%)
> linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt:        921,910,808      L1-icache-load-misses                                         ( +-  0.05% )  (100.00%)
> 
> The optimal I$ miss rate is at 64 bytes - which is 9% better than the 
> default kernel's I$ miss rate at 16 bytes alignment.
> 
> The 128/256/512 bytes numbers show an increasing amount of cache 
> misses: probably due to the artificially reduced associativity of the 
> caching.
> 
> Surprisingly there's a rather marked improvement in elapsed time as 
> well:
> 
> linux-falign-functions=_64-bytes/res.txt:        7.154816369 seconds time elapsed                                          ( +-  0.03% )
> linux-falign-functions=_32-bytes/res.txt:        7.231074263 seconds time elapsed                                          ( +-  0.12% )
> linux-falign-functions=__8-bytes/res.txt:        7.292203002 seconds time elapsed                                          ( +-  0.30% )
> linux-falign-functions=128-bytes/res.txt:        7.314226040 seconds time elapsed                                          ( +-  0.29% )
> linux-falign-functions=_16-bytes/res.txt:        7.333597250 seconds time elapsed     [vanilla kernel]                     ( +-  0.48% )
> linux-falign-functions=__1-bytes/res.txt:        7.367139908 seconds time elapsed                                          ( +-  0.28% )
> linux-falign-functions=__4-bytes/res.txt:        7.371721930 seconds time elapsed                                          ( +-  0.26% )
> linux-falign-functions=__2-bytes/res.txt:        7.410033936 seconds time elapsed                                          ( +-  0.34% )
> linux-falign-functions=256-bytes/res.txt:        7.507029637 seconds time elapsed                                          ( +-  0.07% )
> linux-falign-functions=512-bytes/res.txt:        7.507029637 seconds time elapsed                                          ( +-  0.07% )
> linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt:        8.531418784 seconds time elapsed                                          ( +-  0.19% )
> 
> the workload got 2.5% faster - which is pretty nice! This result is 5+ 
> standard deviations above the noise of the measurement.
> 
> Side note: see how catastrophic -Os (CC_OPTIMIZE_FOR_SIZE=y) 
> performance is: markedly higher cache miss rate despite a 'smaller' 
> kernel, and the workload is 16.3% slower (!).
>
> Part of the -Os picture is that the -Os kernel is executing much more 
> instructions:
> 
> linux-falign-functions=_64-bytes/res.txt:     11,851,763,357      instructions                                                  ( +-  0.01% )
> linux-falign-functions=__1-bytes/res.txt:     11,852,538,446      instructions                                                  ( +-  0.01% )
> linux-falign-functions=_16-bytes/res.txt:     11,854,159,736      instructions                                                  ( +-  0.01% )
> linux-falign-functions=__4-bytes/res.txt:     11,864,421,708      instructions                                                  ( +-  0.01% )
> linux-falign-functions=__8-bytes/res.txt:     11,865,947,941      instructions                                                  ( +-  0.01% )
> linux-falign-functions=_32-bytes/res.txt:     11,867,369,566      instructions                                                  ( +-  0.01% )
> linux-falign-functions=128-bytes/res.txt:     11,867,698,477      instructions                                                  ( +-  0.01% )
> linux-falign-functions=__2-bytes/res.txt:     11,870,853,247      instructions                                                  ( +-  0.01% )
> linux-falign-functions=256-bytes/res.txt:     11,876,281,686      instructions                                                  ( +-  0.01% )
> linux-falign-functions=512-bytes/res.txt:     11,876,281,686      instructions                                                  ( +-  0.01% )
> linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt:     14,318,175,358      instructions                                                  ( +-  0.01% )
> 
> 21.2% more instructions executed ... that cannot go well.
>
> So this should be a reminder that it's effective I$ footprint and 
> number of instructions executed that matters to performance, not 
> kernel size alone. With current GCC -Os should only be used on 
> embedded systems where one is willing to make the kernel 10%+ slower, 
> in exchange for a 20% smaller kernel.

Can you post your .config for the test?
If you have CONFIG_OPTIMIZE_INLINING=y in your -Os test,
consider re-testing with it turned off.
You may be seeing this: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122


> The AMD system, with a starkly different x86 microarchitecture, is 
> showing similar characteristics:
> 
> linux-falign-functions=_64-bytes/res-amd.txt:        108,886,550      L1-icache-load-misses                                         ( +-  0.10% )  (100.00%)
> linux-falign-functions=_32-bytes/res-amd.txt:        110,433,214      L1-icache-load-misses                                         ( +-  0.15% )  (100.00%)
> linux-falign-functions=__1-bytes/res-amd.txt:        113,623,200      L1-icache-load-misses                                         ( +-  0.17% )  (100.00%)
> linux-falign-functions=128-bytes/res-amd.txt:        119,100,216      L1-icache-load-misses                                         ( +-  0.22% )  (100.00%)
> linux-falign-functions=_16-bytes/res-amd.txt:        122,916,937      L1-icache-load-misses                                         ( +-  0.15% )  (100.00%)
> linux-falign-functions=__8-bytes/res-amd.txt:        123,810,566      L1-icache-load-misses                                         ( +-  0.18% )  (100.00%)
> linux-falign-functions=__2-bytes/res-amd.txt:        124,337,908      L1-icache-load-misses                                         ( +-  0.71% )  (100.00%)
> linux-falign-functions=__4-bytes/res-amd.txt:        125,221,805      L1-icache-load-misses                                         ( +-  0.09% )  (100.00%)
> linux-falign-functions=256-bytes/res-amd.txt:        135,761,433      L1-icache-load-misses                                         ( +-  0.18% )  (100.00%)
> linux-____CC_OPTIMIZE_FOR_SIZE=y/res-amd.txt:        159,918,181      L1-icache-load-misses                                         ( +-  0.10% )  (100.00%)
> linux-falign-functions=512-bytes/res-amd.txt:        170,307,064      L1-icache-load-misses                                         ( +-  0.26% )  (100.00%)
> 
> 64 bytes is a similar sweet spot. Note that the penalty at 512 bytes 
> is much steeper than on Intel systems: cache associativity is likely 
> lower on this AMD CPU.
> 
> Interestingly the 1 byte alignment result is still pretty good on AMD 
> systems - and I used the exact same kernel image on both systems, so 
> the layout of the functions is exactly the same.
> 
> Elapsed time is noisier, but shows a similar trend:
> 
> linux-falign-functions=_64-bytes/res-amd.txt:        1.928409143 seconds time elapsed                                          ( +-  2.74% )
> linux-falign-functions=128-bytes/res-amd.txt:        1.932961745 seconds time elapsed                                          ( +-  2.18% )
> linux-falign-functions=__8-bytes/res-amd.txt:        1.940703051 seconds time elapsed                                          ( +-  1.84% )
> linux-falign-functions=__1-bytes/res-amd.txt:        1.940744001 seconds time elapsed                                          ( +-  2.15% )
> linux-falign-functions=_32-bytes/res-amd.txt:        1.962074787 seconds time elapsed                                          ( +-  2.38% )
> linux-falign-functions=_16-bytes/res-amd.txt:        2.000941789 seconds time elapsed                                          ( +-  1.18% )
> linux-falign-functions=__4-bytes/res-amd.txt:        2.002305627 seconds time elapsed                                          ( +-  2.75% )
> linux-falign-functions=256-bytes/res-amd.txt:        2.003218532 seconds time elapsed                                          ( +-  3.16% )
> linux-falign-functions=__2-bytes/res-amd.txt:        2.031252839 seconds time elapsed                                          ( +-  1.77% )
> linux-falign-functions=512-bytes/res-amd.txt:        2.080632439 seconds time elapsed                                          ( +-  1.06% )
> linux-____CC_OPTIMIZE_FOR_SIZE=y/res-amd.txt:        2.346644318 seconds time elapsed                                          ( +-  2.19% )
> 
> 64 bytes alignment is the sweet spot here as well, it's 3.7% faster 
> than the default 16 bytes alignment.

In AMD, 64 bytes win too, yes, but by a *very* small margin.
8 bytes and 1 byte alignments have basically same timings,
and both take what, +0.63% of time longer to run?

linux-falign-functions=_64-bytes/res-amd.txt:        1.928409143 seconds time elapsed
linux-falign-functions=__8-bytes/res-amd.txt:        1.940703051 seconds time elapsed
linux-falign-functions=__1-bytes/res-amd.txt:        1.940744001 seconds time elapsed

I wouldn't say that it's the same as Intel. There the difference between 64 byte
alignment and no alignment at all is five times larger than for AMD, it's +3%:

linux-falign-functions=_64-bytes/res.txt:        7.154816369 seconds time elapsed
linux-falign-functions=_32-bytes/res.txt:        7.231074263 seconds time elapsed
linux-falign-functions=__8-bytes/res.txt:        7.292203002 seconds time elapsed
linux-falign-functions=_16-bytes/res.txt:        7.333597250 seconds time elapsed
linux-falign-functions=__1-bytes/res.txt:        7.367139908 seconds time elapsed

> So based on those measurements, I think we should do the exact 
> opposite of my original patch that reduced alignment to 1 bytes, and 
> increase kernel function address alignment from 16 bytes to the 
> natural cache line size (64 bytes on modern CPUs).

> +        #
> +        # Allocate a separate cacheline for every function,
> +        # for optimal instruction cache packing:
> +        #
> +        KBUILD_CFLAGS += -falign-functions=$(CONFIG_X86_FUNCTION_ALIGNMENT)

How about  -falign-functions=CONFIG_X86_FUNCTION_ALIGNMENT/2 + 1  instead?

This avoids pathological cases where function starting just a few bytes after
64-bytes boundary gets aligned to the next one, wasting ~60 bytes.
-- 
vda

next prev parent reply	other threads:[~2015-05-20 11:31 UTC|newest]

Thread overview: 108+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-08 19:39 [PATCH 0/2] locking: Simplify mutex and rwsem spinning code Jason Low
2015-04-08 19:39 ` [PATCH 1/2] locking/mutex: Further refactor mutex_spin_on_owner() Jason Low
2015-04-09  9:00   ` [tip:locking/core] locking/mutex: Further simplify mutex_spin_on_owner() tip-bot for Jason Low
2015-04-08 19:39 ` [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner() Jason Low
2015-04-09  5:37   ` Ingo Molnar
2015-04-09  6:40     ` Jason Low
2015-04-09  7:53       ` Ingo Molnar
2015-04-09 16:47         ` Linus Torvalds
2015-04-09 17:56           ` Paul E. McKenney
2015-04-09 18:08             ` Linus Torvalds
2015-04-09 18:16               ` Linus Torvalds
2015-04-09 18:39                 ` Paul E. McKenney
2015-04-10  9:00                   ` [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock Ingo Molnar
2015-04-10  9:12                     ` Ingo Molnar
2015-04-10  9:21                       ` [PATCH] uaccess: Add __copy_from_kernel_inatomic() primitive Ingo Molnar
2015-04-10 11:14                         ` [PATCH] x86/uaccess: Implement get_kernel() Ingo Molnar
2015-04-10 11:27                           ` [PATCH] mutex: Improve mutex_spin_on_owner() code generation Ingo Molnar
2015-04-10 12:08                             ` [PATCH] x86: Align jump targets to 1 byte boundaries Ingo Molnar
2015-04-10 12:18                               ` [PATCH] x86: Pack function addresses tightly as well Ingo Molnar
2015-04-10 12:30                                 ` [PATCH] x86: Pack loops " Ingo Molnar
2015-04-10 13:46                                   ` Borislav Petkov
2015-05-15  9:40                                   ` [tip:x86/asm] " tip-bot for Ingo Molnar
2015-05-17  6:03                                   ` [tip:x86/apic] " tip-bot for Ingo Molnar
2015-05-15  9:39                                 ` [tip:x86/asm] x86: Pack function addresses " tip-bot for Ingo Molnar
2015-05-15 18:36                                   ` Linus Torvalds
2015-05-15 20:52                                     ` Denys Vlasenko
2015-05-17  5:58                                     ` Ingo Molnar
2015-05-17  7:09                                       ` Ingo Molnar
2015-05-17  7:30                                         ` Ingo Molnar
2015-05-18  9:28                                       ` Denys Vlasenko
2015-05-19 21:38                                       ` [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions Ingo Molnar
2015-05-20  0:47                                         ` Linus Torvalds
2015-05-20 12:21                                           ` Denys Vlasenko
2015-05-21 11:36                                             ` Ingo Molnar
2015-05-21 11:38                                             ` Denys Vlasenko
2016-04-16 21:08                                               ` Denys Vlasenko
2015-05-20 13:09                                           ` Ingo Molnar
2015-05-20 11:29                                         ` Denys Vlasenko [this message]
2015-05-21 13:28                                           ` Ingo Molnar
2015-05-21 14:03                                           ` Ingo Molnar
2015-04-10 12:50                               ` [PATCH] x86: Align jump targets to 1 byte boundaries Denys Vlasenko
2015-04-10 13:18                                 ` H. Peter Anvin
2015-04-10 17:54                                   ` Ingo Molnar
2015-04-10 18:32                                     ` H. Peter Anvin
2015-04-11 14:41                                   ` Markus Trippelsdorf
2015-04-12 10:14                                     ` Ingo Molnar
2015-04-13 16:23                                       ` Markus Trippelsdorf
2015-04-13 17:26                                         ` Markus Trippelsdorf
2015-04-13 18:31                                           ` Linus Torvalds
2015-04-13 19:09                                             ` Markus Trippelsdorf
2015-04-14  5:38                                               ` Ingo Molnar
2015-04-14  8:23                                                 ` Markus Trippelsdorf
2015-04-14  9:16                                                   ` Ingo Molnar
2015-04-14 11:17                                                     ` Markus Trippelsdorf
2015-04-14 12:09                                                       ` Ingo Molnar
2015-04-10 18:48                                 ` Linus Torvalds
2015-04-12 23:44                                   ` Maciej W. Rozycki
2015-04-10 19:23                                 ` Daniel Borkmann
2015-04-11 13:48                                 ` Markus Trippelsdorf
2015-04-10 13:19                               ` Borislav Petkov
2015-04-10 13:54                                 ` Denys Vlasenko
2015-04-10 14:01                                   ` Borislav Petkov
2015-04-10 14:53                                     ` Denys Vlasenko
2015-04-10 15:25                                       ` Borislav Petkov
2015-04-10 15:48                                         ` Denys Vlasenko
2015-04-10 15:54                                           ` Borislav Petkov
2015-04-10 21:44                                             ` Borislav Petkov
2015-04-10 18:54                                       ` Linus Torvalds
2015-04-10 14:10                               ` Paul E. McKenney
2015-04-11 14:28                                 ` Josh Triplett
2015-04-11  9:20                               ` [PATCH] x86: Turn off GCC branch probability heuristics Ingo Molnar
2015-04-11 17:41                                 ` Linus Torvalds
2015-04-11 18:57                                   ` Thomas Gleixner
2015-04-11 19:35                                     ` Linus Torvalds
2015-04-12  5:47                                       ` Ingo Molnar
2015-04-12  6:20                                         ` Markus Trippelsdorf
2015-04-12 10:15                                           ` Ingo Molnar
2015-04-12  7:56                                         ` Mike Galbraith
2015-04-12  7:41                                       ` Ingo Molnar
2015-04-12  8:07                                     ` Ingo Molnar
2015-04-12 21:11                                     ` Jan Hubicka
2015-05-14 11:59                               ` [PATCH] x86: Align jump targets to 1 byte boundaries Denys Vlasenko
2015-05-14 18:17                                 ` Ingo Molnar
2015-05-14 19:04                                   ` Denys Vlasenko
2015-05-14 19:44                                     ` Ingo Molnar
2015-05-15 15:45                                   ` Josh Triplett
2015-05-17  5:34                                     ` Ingo Molnar
2015-05-17 19:18                                       ` Josh Triplett
2015-05-18  6:48                                         ` Ingo Molnar
2015-05-15  9:39                               ` [tip:x86/asm] x86: Align jump targets to 1-byte boundaries tip-bot for Ingo Molnar
2015-04-10 11:34                           ` [PATCH] x86/uaccess: Implement get_kernel() Peter Zijlstra
2015-04-10 18:04                             ` Ingo Molnar
2015-04-10 17:49                           ` Linus Torvalds
2015-04-10 18:04                             ` Ingo Molnar
2015-04-10 18:09                               ` Linus Torvalds
2015-04-10 14:20                     ` [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock Paul E. McKenney
2015-04-10 17:44                       ` Ingo Molnar
2015-04-10 18:05                         ` Paul E. McKenney
2015-04-09 19:43                 ` [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner() Jason Low
2015-04-09 19:58                   ` Paul E. McKenney
2015-04-09 20:58                     ` Jason Low
2015-04-09 21:07                       ` Paul E. McKenney
2015-04-09 19:59                   ` Davidlohr Bueso
2015-04-09 20:36                 ` Jason Low
2015-04-10  2:43                   ` Andev
2015-04-10  9:04                   ` Ingo Molnar
2015-04-08 19:49 ` [PATCH 0/2] locking: Simplify mutex and rwsem spinning code Davidlohr Bueso
2015-04-08 20:10   ` Jason Low

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=555C7012.3040806@redhat.com \
    --to=dvlasenk@redhat.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=arjan@infradead.org \
    --cc=aswin@hp.com \
    --cc=bp@alien8.de \
    --cc=brgerst@gmail.com \
    --cc=dave@stgolabs.net \
    --cc=hpa@zytor.com \
    --cc=jason.low2@hp.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-tip-commits@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=mingo@kernel.org \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=tim.c.chen@linux.intel.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.