Re: [patch 02/12] Immediate Values - Architecture Independent Code

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>,
	linux-kernel@vger.kernel.org, Jason Baron <jbaron@redhat.com>,
	Rusty Russell <rusty@rustcorp.com.au>,
	Adrian Bunk <bunk@stusta.de>, Andi Kleen <andi@firstfloor.org>,
	Christoph Hellwig <hch@infradead.org>
Subject: Re: [patch 02/12] Immediate Values - Architecture Independent Code
Date: Sun, 27 Sep 2009 19:23:45 -0400	[thread overview]
Message-ID: <20090927232345.GA5831@Krystal> (raw)
In-Reply-To: <20090924212013.d27226c4.akpm@linux-foundation.org>

* Andrew Morton (akpm@linux-foundation.org) wrote:
> On Thu, 24 Sep 2009 09:26:28 -0400 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
> > Immediate values are used as read mostly variables that are rarely updated. They
> > use code patching to modify the values inscribed in the instruction stream. It
> > provides a way to save precious cache lines that would otherwise have to be used
> > by these variables.
> 
> What a hare-brained concept.
> 

Hi Andrew,

Improving performance by specializing the implementation has been
studied thoroughly by many in the past, especially for JIT compilers.
What I am proposing here is merely a very specific use of the concept,
applied to read-often variables.

> > * Why should this be merged *
> > 
> > It improves performances on heavy memory I/O workloads.
> > 
> > An interesting result shows the potential this infrastructure has by
> > showing the slowdown a simple system call such as getppid() suffers when it is
> > used under heavy user-space cache trashing:
> > 
> > Random walk L1 and L2 trashing surrounding a getppid() call:
> > (note: in this test, do_syscal_trace was taken at each system call, see
> > Documentation/immediate.txt in these patches for details)
> > - No memory pressure :   getppid() takes  1573 cycles
> > - With memory pressure : getppid() takes 15589 cycles
> 
> Our ideas of what constitutes an "interesting result" differ.
> 
> Do you have any data which indicates that this thing is of any real
> benefit to anyone for anything?

Yep. See the benchmarks I just ran below.

Immediate Values Benchmarks

Kernel 2.6.31-tip
8-core Xeon, 2.0Ghz, E5405
gcc version 4.3.2 (Debian 4.3.2-1.1) 

Test workload: build the Linux kernel tree, cache-hot, make -j10

Executive result summary:

In these tests, each system call has an added workload, which is to read a fixed
number of integers from randomly chosen cache lines within an array and perform
a branch. The implementation is added to ptrace.c. The baseline is an unmodified
kernel.

* Baseline:				sys	0m57.63s

* 4096 integer reads, random locations	sys	2m21.781s
* 4096 integer reads, immediate values	sys	1m44.695s

* 128 integer reads, random locations	sys	0m59.348s
* 128 integer reads, immediate values	sys	0m58.640s

* 32 integer reads, random locations	sys	0m58.68s
* 32 integer reads, immediate values	sys	0m57.60s

These numbers show that by turning read-often data accesses into immediate
values, we can speed up the kernel.

Binary size results:

* 4096 integer reads, random locations
  text     data     bss     dec     hex filename
  66079     648  262156  328883   504b3 arch/x86/kernel/ptrace.o

* 4096 integer reads, immediate values
   text	   data	    bss	    dec	    hex	filename
  66079	  74412	 262156	 402647	  624d7	arch/x86/kernel/ptrace.o

As we notice, the size of text is the same, same for bss, but the data size
increases with immediate values. The section headers confirms that this extra
data is put in the __imv section, which is only accessed when immediate value
updates are performed.

So the tradeoff is: immediate values use more cache-cold space to increase
speed.

Therefore, if we can turn a significant amount of fast-path read-often variables
into immediate values, this should lead to a performance gain. Also,
given we can expect the fastpath cache-line footprint to grow with the
next kernel releases (this has been a trend I've seen a lot of people
complaining about), immediate values should help minimizing this by
removing the d-cache hit from such read-often variables, leaving a
i-cache hit within a mostly sequential instruction stream.

A quick look at the vmlinux section headers:

vmlinux:     file format elf64-x86-64

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
 13 .data.read_mostly 00002df0  ffffffff80859440  0000000000859440  00859440  2**6
                  CONTENTS, ALLOC, LOAD, DATA

Shows that we have about 11.48kB of read mostly data in the kernel image
which could be turned into immediate values. This is without counting
the modules. If only a portion of this data is not only read mostly, but
also read often, then we will see a clear performance improvement.

Thanks,

Mathieu


Detailed test results follow.
----------------------------------------

* Baseline:

# size of kernel original ptrace.o

   text	   data	    bss	    dec	    hex	filename
  12863	    648	      8	  13519	   34cf	arch/x86/kernel/ptrace.o

# time make -j10

real	1m25.358s
user	9m7.506s
sys	0m57.856s

real	1m21.580s
user	9m7.362s
sys	0m57.212s

real	1m21.361s
user	9m6.358s
sys	0m57.824s


* 4096 cache lines read per system call (random cache lines)
  (CONFIG_IMMEDIATE=n)

# size of modified ptrace.o

  text	   data	    bss	    dec	    hex	filename
  66079	    648	 262156	 328883	  504b3	arch/x86/kernel/ptrace.o

# section headers

arch/x86/kernel/ptrace.o:     file format elf64-x86-64

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .text         0000f4e8  0000000000000000  0000000000000000  00000040  2**4
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
  1 .data         000000a8  0000000000000000  0000000000000000  0000f540  2**5
                  CONTENTS, ALLOC, LOAD, RELOC, DATA
  2 .bss          0004000c  0000000000000000  0000000000000000  0000f600  2**5
                  ALLOC
  3 .rodata       00000988  0000000000000000  0000000000000000  0000f600  2**5
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
  4 .fixup        0000005b  0000000000000000  0000000000000000  0000ff88  2**0
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
  5 __ex_table    00000090  0000000000000000  0000000000000000  0000ffe8  2**3
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
  6 .smp_locks    00000028  0000000000000000  0000000000000000  00010078  2**3
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
  7 .rodata.str1.8 000001f2  0000000000000000  0000000000000000  000100a0  2**3
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  8 .rodata.str1.1 00000097  0000000000000000  0000000000000000  00010292  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  9 __tracepoints 00000080  0000000000000000  0000000000000000  00010340  2**5
                  CONTENTS, ALLOC, LOAD, RELOC, DATA
 10 _ftrace_events 00000160  0000000000000000  0000000000000000  000103c0  2**3
                  CONTENTS, ALLOC, LOAD, RELOC, DATA
 11 __tracepoints_strings 00000013  0000000000000000  0000000000000000  00010520  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
 12 .comment      0000001f  0000000000000000  0000000000000000  00010533  2**0
                  CONTENTS, READONLY
 13 .note.GNU-stack 00000000  0000000000000000  0000000000000000  00010552  2**0
                  CONTENTS, READONLY

# pattern

     820:       83 3d 00 00 00 00 01    cmpl   $0x1,0x0(%rip)        # 827 <test_pollute_cache+0x7>
     827:       0f 84 cb cf 00 00       je     d7f8 <test_pollute_cache+0xcfd8>

# time make -j10

real	1m36.075s
user	9m15.163s
sys	2m21.781s


* 4096 imv read per system call
  (CONFIG_IMMEDIATE=y)

# size of modified ptrace.o

   text	   data	    bss	    dec	    hex	filename
  66079	  74412	 262156	 402647	  624d7	arch/x86/kernel/ptrace.o

    (note: data is larger due to __imv table, which is used only for updates)

# section headers

arch/x86/kernel/ptrace.o:     file format elf64-x86-64

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .text         0000f4e8  0000000000000000  0000000000000000  00000040  2**4
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
  1 .data         000000a8  0000000000000000  0000000000000000  0000f540  2**5
                  CONTENTS, ALLOC, LOAD, RELOC, DATA
  2 .bss          0004000c  0000000000000000  0000000000000000  0000f600  2**5
                  ALLOC
  3 .rodata       00000988  0000000000000000  0000000000000000  0000f600  2**5
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
  4 .fixup        0000005b  0000000000000000  0000000000000000  0000ff88  2**0
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
  5 __ex_table    00000090  0000000000000000  0000000000000000  0000ffe8  2**3
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
  6 .smp_locks    00000028  0000000000000000  0000000000000000  00010078  2**3
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
  7 __discard     00005004  0000000000000000  0000000000000000  000100a0  2**0
                  CONTENTS, READONLY
  8 __imv         00012024  0000000000000000  0000000000000000  000150a4  2**0
                  CONTENTS, ALLOC, LOAD, RELOC, DATA
  9 .rodata.str1.8 000001f2  0000000000000000  0000000000000000  000270c8  2**3
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
 10 .rodata.str1.1 00000097  0000000000000000  0000000000000000  000272ba  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
 11 __tracepoints 00000080  0000000000000000  0000000000000000  00027360  2**5
                  CONTENTS, ALLOC, LOAD, RELOC, DATA
 12 _ftrace_events 00000160  0000000000000000  0000000000000000  000273e0  2**3
                  CONTENTS, ALLOC, LOAD, RELOC, DATA
 13 __tracepoints_strings 00000013  0000000000000000  0000000000000000  00027540  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
 14 .comment      0000001f  0000000000000000  0000000000000000  00027553  2**0
                  CONTENTS, READONLY
 15 .note.GNU-stack 00000000  0000000000000000  0000000000000000  00027572  2**0
                  CONTENTS, READONLY

# pattern

     820:       b8 00 00 00 00          mov    $0x0,%eax
     825:       ff c8                   dec    %eax
     827:       0f 84 d3 cf 00 00       je     d800 <test_pollute_cache+0xcfe0>

# time make -j10

real	1m30.688s
user	9m7.770s
sys	1m44.695s


* 128 cache lines read per system call (random cache lines)
  (CONFIG_IMMEDIATE=n)

# time make -j10

real	1m27.801s
user	9m12.447s
sys	0m59.348s


* 128 imv read per system call
  (CONFIG_IMMEDIATE=y)

# time make -j10

real	1m22.454s
user	9m5.822s
sys	0m58.640s


* 32 cache lines read per system call (random cache lines)
  (CONFIG_IMMEDIATE=n)

# time make -j10

real	1m21.539s
user	9m6.946s
sys	0m57.888s

real	1m26.789s
user	9m11.606s
sys	0m59.392s

real	1m29.461s
user	9m12.195s
sys	0m58.768s

avg sys:	58.68s


* 32 imv read per system call
  (CONFIG_IMMEDIATE=y)

# time make -j10

real	1m21.844s
user	9m7.278s
sys	0m57.648s

real	1m22.123s
user	9m6.850s
sys	0m56.848s

real	1m24.589s
user	9m5.674s
sys	0m58.328s

avg sys:	57.60s



-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

next prev parent reply	other threads:[~2009-09-27 23:23 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-09-24 13:26 [patch 00/12] Immediate Values Mathieu Desnoyers
2009-09-24 13:26 ` [patch 01/12] x86: text_poke_early non static Mathieu Desnoyers
2009-09-24 13:26 ` [patch 02/12] Immediate Values - Architecture Independent Code Mathieu Desnoyers
2009-09-25  4:20   ` Andrew Morton
2009-09-27 23:23     ` Mathieu Desnoyers [this message]
2009-09-28  1:23     ` Andi Kleen
2009-09-28 17:46       ` Andrew Morton
2009-09-28 18:03         ` Arjan van de Ven
2009-09-28 18:40           ` Mathieu Desnoyers
2009-09-28 19:54           ` Andi Kleen
2009-09-28 20:37             ` Arjan van de Ven
2009-09-28 21:32               ` H. Peter Anvin
2009-09-28 22:05                 ` Mathieu Desnoyers
2009-09-28 20:11         ` Andi Kleen
2009-09-28 21:16           ` Andrew Morton
2009-09-28 22:01             ` Mathieu Desnoyers
2009-09-24 13:26 ` [patch 03/12] Immediate Values - Kconfig menu in EMBEDDED Mathieu Desnoyers
2009-09-24 13:26 ` [patch 04/12] Immediate Values - x86 Optimization Mathieu Desnoyers
2009-09-24 13:26 ` [patch 05/12] Add text_poke and sync_core to powerpc Mathieu Desnoyers
2009-09-24 13:26 ` [patch 06/12] Immediate Values - Powerpc Optimization Mathieu Desnoyers
2009-09-24 13:26 ` [patch 07/12] Sparc create asm.h Mathieu Desnoyers
2009-09-24 21:10   ` David Miller
2009-09-24 13:26 ` [patch 08/12] sparc64: Optimized immediate value implementation Mathieu Desnoyers
2009-09-24 13:26 ` [patch 09/12] Immediate Values - Documentation Mathieu Desnoyers
2009-09-24 13:26 ` [patch 10/12] Immediate Values Support init Mathieu Desnoyers
2009-09-24 15:33   ` [patch 10.1/12] Immediate values fixes for modules Mathieu Desnoyers
2009-09-24 15:35   ` [patch 10.2/12] Fix Immediate Values x86_64 support old gcc Mathieu Desnoyers
2009-09-24 13:26 ` [patch 11/12] Scheduler Profiling - Use Immediate Values Mathieu Desnoyers
2009-09-24 13:26 ` [patch 12/12] Tracepoints - " Mathieu Desnoyers
2009-09-24 14:51   ` Peter Zijlstra
2009-09-24 15:03     ` Mathieu Desnoyers
2009-09-24 15:06       ` Peter Zijlstra
2009-09-24 16:01         ` [RFC patch] Immediate Values - x86 Optimization NMI and MCE support Mathieu Desnoyers
2009-09-24 21:59           ` Masami Hiramatsu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090927232345.GA5831@Krystal \
    --to=mathieu.desnoyers@polymtl.ca \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=bunk@stusta.de \
    --cc=hch@infradead.org \
    --cc=jbaron@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=rusty@rustcorp.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox