From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>,
linux-kernel@vger.kernel.org, Jason Baron <jbaron@redhat.com>,
Rusty Russell <rusty@rustcorp.com.au>,
Adrian Bunk <bunk@stusta.de>, Andi Kleen <andi@firstfloor.org>,
Christoph Hellwig <hch@infradead.org>
Subject: Re: [patch 02/12] Immediate Values - Architecture Independent Code
Date: Sun, 27 Sep 2009 19:23:45 -0400 [thread overview]
Message-ID: <20090927232345.GA5831@Krystal> (raw)
In-Reply-To: <20090924212013.d27226c4.akpm@linux-foundation.org>
* Andrew Morton (akpm@linux-foundation.org) wrote:
> On Thu, 24 Sep 2009 09:26:28 -0400 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
>
> > Immediate values are used as read mostly variables that are rarely updated. They
> > use code patching to modify the values inscribed in the instruction stream. It
> > provides a way to save precious cache lines that would otherwise have to be used
> > by these variables.
>
> What a hare-brained concept.
>
Hi Andrew,
Improving performance by specializing the implementation has been
studied thoroughly by many in the past, especially for JIT compilers.
What I am proposing here is merely a very specific use of the concept,
applied to read-often variables.
> > * Why should this be merged *
> >
> > It improves performances on heavy memory I/O workloads.
> >
> > An interesting result shows the potential this infrastructure has by
> > showing the slowdown a simple system call such as getppid() suffers when it is
> > used under heavy user-space cache trashing:
> >
> > Random walk L1 and L2 trashing surrounding a getppid() call:
> > (note: in this test, do_syscal_trace was taken at each system call, see
> > Documentation/immediate.txt in these patches for details)
> > - No memory pressure : getppid() takes 1573 cycles
> > - With memory pressure : getppid() takes 15589 cycles
>
> Our ideas of what constitutes an "interesting result" differ.
>
> Do you have any data which indicates that this thing is of any real
> benefit to anyone for anything?
Yep. See the benchmarks I just ran below.
Immediate Values Benchmarks
Kernel 2.6.31-tip
8-core Xeon, 2.0Ghz, E5405
gcc version 4.3.2 (Debian 4.3.2-1.1)
Test workload: build the Linux kernel tree, cache-hot, make -j10
Executive result summary:
In these tests, each system call has an added workload, which is to read a fixed
number of integers from randomly chosen cache lines within an array and perform
a branch. The implementation is added to ptrace.c. The baseline is an unmodified
kernel.
* Baseline: sys 0m57.63s
* 4096 integer reads, random locations sys 2m21.781s
* 4096 integer reads, immediate values sys 1m44.695s
* 128 integer reads, random locations sys 0m59.348s
* 128 integer reads, immediate values sys 0m58.640s
* 32 integer reads, random locations sys 0m58.68s
* 32 integer reads, immediate values sys 0m57.60s
These numbers show that by turning read-often data accesses into immediate
values, we can speed up the kernel.
Binary size results:
* 4096 integer reads, random locations
text data bss dec hex filename
66079 648 262156 328883 504b3 arch/x86/kernel/ptrace.o
* 4096 integer reads, immediate values
text data bss dec hex filename
66079 74412 262156 402647 624d7 arch/x86/kernel/ptrace.o
As we notice, the size of text is the same, same for bss, but the data size
increases with immediate values. The section headers confirms that this extra
data is put in the __imv section, which is only accessed when immediate value
updates are performed.
So the tradeoff is: immediate values use more cache-cold space to increase
speed.
Therefore, if we can turn a significant amount of fast-path read-often variables
into immediate values, this should lead to a performance gain. Also,
given we can expect the fastpath cache-line footprint to grow with the
next kernel releases (this has been a trend I've seen a lot of people
complaining about), immediate values should help minimizing this by
removing the d-cache hit from such read-often variables, leaving a
i-cache hit within a mostly sequential instruction stream.
A quick look at the vmlinux section headers:
vmlinux: file format elf64-x86-64
Sections:
Idx Name Size VMA LMA File off Algn
13 .data.read_mostly 00002df0 ffffffff80859440 0000000000859440 00859440 2**6
CONTENTS, ALLOC, LOAD, DATA
Shows that we have about 11.48kB of read mostly data in the kernel image
which could be turned into immediate values. This is without counting
the modules. If only a portion of this data is not only read mostly, but
also read often, then we will see a clear performance improvement.
Thanks,
Mathieu
Detailed test results follow.
----------------------------------------
* Baseline:
# size of kernel original ptrace.o
text data bss dec hex filename
12863 648 8 13519 34cf arch/x86/kernel/ptrace.o
# time make -j10
real 1m25.358s
user 9m7.506s
sys 0m57.856s
real 1m21.580s
user 9m7.362s
sys 0m57.212s
real 1m21.361s
user 9m6.358s
sys 0m57.824s
* 4096 cache lines read per system call (random cache lines)
(CONFIG_IMMEDIATE=n)
# size of modified ptrace.o
text data bss dec hex filename
66079 648 262156 328883 504b3 arch/x86/kernel/ptrace.o
# section headers
arch/x86/kernel/ptrace.o: file format elf64-x86-64
Sections:
Idx Name Size VMA LMA File off Algn
0 .text 0000f4e8 0000000000000000 0000000000000000 00000040 2**4
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
1 .data 000000a8 0000000000000000 0000000000000000 0000f540 2**5
CONTENTS, ALLOC, LOAD, RELOC, DATA
2 .bss 0004000c 0000000000000000 0000000000000000 0000f600 2**5
ALLOC
3 .rodata 00000988 0000000000000000 0000000000000000 0000f600 2**5
CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
4 .fixup 0000005b 0000000000000000 0000000000000000 0000ff88 2**0
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
5 __ex_table 00000090 0000000000000000 0000000000000000 0000ffe8 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
6 .smp_locks 00000028 0000000000000000 0000000000000000 00010078 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
7 .rodata.str1.8 000001f2 0000000000000000 0000000000000000 000100a0 2**3
CONTENTS, ALLOC, LOAD, READONLY, DATA
8 .rodata.str1.1 00000097 0000000000000000 0000000000000000 00010292 2**0
CONTENTS, ALLOC, LOAD, READONLY, DATA
9 __tracepoints 00000080 0000000000000000 0000000000000000 00010340 2**5
CONTENTS, ALLOC, LOAD, RELOC, DATA
10 _ftrace_events 00000160 0000000000000000 0000000000000000 000103c0 2**3
CONTENTS, ALLOC, LOAD, RELOC, DATA
11 __tracepoints_strings 00000013 0000000000000000 0000000000000000 00010520 2**0
CONTENTS, ALLOC, LOAD, READONLY, DATA
12 .comment 0000001f 0000000000000000 0000000000000000 00010533 2**0
CONTENTS, READONLY
13 .note.GNU-stack 00000000 0000000000000000 0000000000000000 00010552 2**0
CONTENTS, READONLY
# pattern
820: 83 3d 00 00 00 00 01 cmpl $0x1,0x0(%rip) # 827 <test_pollute_cache+0x7>
827: 0f 84 cb cf 00 00 je d7f8 <test_pollute_cache+0xcfd8>
# time make -j10
real 1m36.075s
user 9m15.163s
sys 2m21.781s
* 4096 imv read per system call
(CONFIG_IMMEDIATE=y)
# size of modified ptrace.o
text data bss dec hex filename
66079 74412 262156 402647 624d7 arch/x86/kernel/ptrace.o
(note: data is larger due to __imv table, which is used only for updates)
# section headers
arch/x86/kernel/ptrace.o: file format elf64-x86-64
Sections:
Idx Name Size VMA LMA File off Algn
0 .text 0000f4e8 0000000000000000 0000000000000000 00000040 2**4
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
1 .data 000000a8 0000000000000000 0000000000000000 0000f540 2**5
CONTENTS, ALLOC, LOAD, RELOC, DATA
2 .bss 0004000c 0000000000000000 0000000000000000 0000f600 2**5
ALLOC
3 .rodata 00000988 0000000000000000 0000000000000000 0000f600 2**5
CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
4 .fixup 0000005b 0000000000000000 0000000000000000 0000ff88 2**0
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
5 __ex_table 00000090 0000000000000000 0000000000000000 0000ffe8 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
6 .smp_locks 00000028 0000000000000000 0000000000000000 00010078 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
7 __discard 00005004 0000000000000000 0000000000000000 000100a0 2**0
CONTENTS, READONLY
8 __imv 00012024 0000000000000000 0000000000000000 000150a4 2**0
CONTENTS, ALLOC, LOAD, RELOC, DATA
9 .rodata.str1.8 000001f2 0000000000000000 0000000000000000 000270c8 2**3
CONTENTS, ALLOC, LOAD, READONLY, DATA
10 .rodata.str1.1 00000097 0000000000000000 0000000000000000 000272ba 2**0
CONTENTS, ALLOC, LOAD, READONLY, DATA
11 __tracepoints 00000080 0000000000000000 0000000000000000 00027360 2**5
CONTENTS, ALLOC, LOAD, RELOC, DATA
12 _ftrace_events 00000160 0000000000000000 0000000000000000 000273e0 2**3
CONTENTS, ALLOC, LOAD, RELOC, DATA
13 __tracepoints_strings 00000013 0000000000000000 0000000000000000 00027540 2**0
CONTENTS, ALLOC, LOAD, READONLY, DATA
14 .comment 0000001f 0000000000000000 0000000000000000 00027553 2**0
CONTENTS, READONLY
15 .note.GNU-stack 00000000 0000000000000000 0000000000000000 00027572 2**0
CONTENTS, READONLY
# pattern
820: b8 00 00 00 00 mov $0x0,%eax
825: ff c8 dec %eax
827: 0f 84 d3 cf 00 00 je d800 <test_pollute_cache+0xcfe0>
# time make -j10
real 1m30.688s
user 9m7.770s
sys 1m44.695s
* 128 cache lines read per system call (random cache lines)
(CONFIG_IMMEDIATE=n)
# time make -j10
real 1m27.801s
user 9m12.447s
sys 0m59.348s
* 128 imv read per system call
(CONFIG_IMMEDIATE=y)
# time make -j10
real 1m22.454s
user 9m5.822s
sys 0m58.640s
* 32 cache lines read per system call (random cache lines)
(CONFIG_IMMEDIATE=n)
# time make -j10
real 1m21.539s
user 9m6.946s
sys 0m57.888s
real 1m26.789s
user 9m11.606s
sys 0m59.392s
real 1m29.461s
user 9m12.195s
sys 0m58.768s
avg sys: 58.68s
* 32 imv read per system call
(CONFIG_IMMEDIATE=y)
# time make -j10
real 1m21.844s
user 9m7.278s
sys 0m57.648s
real 1m22.123s
user 9m6.850s
sys 0m56.848s
real 1m24.589s
user 9m5.674s
sys 0m58.328s
avg sys: 57.60s
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
next prev parent reply other threads:[~2009-09-27 23:23 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-09-24 13:26 [patch 00/12] Immediate Values Mathieu Desnoyers
2009-09-24 13:26 ` [patch 01/12] x86: text_poke_early non static Mathieu Desnoyers
2009-09-24 13:26 ` [patch 02/12] Immediate Values - Architecture Independent Code Mathieu Desnoyers
2009-09-25 4:20 ` Andrew Morton
2009-09-27 23:23 ` Mathieu Desnoyers [this message]
2009-09-28 1:23 ` Andi Kleen
2009-09-28 17:46 ` Andrew Morton
2009-09-28 18:03 ` Arjan van de Ven
2009-09-28 18:40 ` Mathieu Desnoyers
2009-09-28 19:54 ` Andi Kleen
2009-09-28 20:37 ` Arjan van de Ven
2009-09-28 21:32 ` H. Peter Anvin
2009-09-28 22:05 ` Mathieu Desnoyers
2009-09-28 20:11 ` Andi Kleen
2009-09-28 21:16 ` Andrew Morton
2009-09-28 22:01 ` Mathieu Desnoyers
2009-09-24 13:26 ` [patch 03/12] Immediate Values - Kconfig menu in EMBEDDED Mathieu Desnoyers
2009-09-24 13:26 ` [patch 04/12] Immediate Values - x86 Optimization Mathieu Desnoyers
2009-09-24 13:26 ` [patch 05/12] Add text_poke and sync_core to powerpc Mathieu Desnoyers
2009-09-24 13:26 ` [patch 06/12] Immediate Values - Powerpc Optimization Mathieu Desnoyers
2009-09-24 13:26 ` [patch 07/12] Sparc create asm.h Mathieu Desnoyers
2009-09-24 21:10 ` David Miller
2009-09-24 13:26 ` [patch 08/12] sparc64: Optimized immediate value implementation Mathieu Desnoyers
2009-09-24 13:26 ` [patch 09/12] Immediate Values - Documentation Mathieu Desnoyers
2009-09-24 13:26 ` [patch 10/12] Immediate Values Support init Mathieu Desnoyers
2009-09-24 15:33 ` [patch 10.1/12] Immediate values fixes for modules Mathieu Desnoyers
2009-09-24 15:35 ` [patch 10.2/12] Fix Immediate Values x86_64 support old gcc Mathieu Desnoyers
2009-09-24 13:26 ` [patch 11/12] Scheduler Profiling - Use Immediate Values Mathieu Desnoyers
2009-09-24 13:26 ` [patch 12/12] Tracepoints - " Mathieu Desnoyers
2009-09-24 14:51 ` Peter Zijlstra
2009-09-24 15:03 ` Mathieu Desnoyers
2009-09-24 15:06 ` Peter Zijlstra
2009-09-24 16:01 ` [RFC patch] Immediate Values - x86 Optimization NMI and MCE support Mathieu Desnoyers
2009-09-24 21:59 ` Masami Hiramatsu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090927232345.GA5831@Krystal \
--to=mathieu.desnoyers@polymtl.ca \
--cc=akpm@linux-foundation.org \
--cc=andi@firstfloor.org \
--cc=bunk@stusta.de \
--cc=hch@infradead.org \
--cc=jbaron@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=rusty@rustcorp.com.au \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.