Re: [patch 02/12] Immediate Values - Architecture Independent Code

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>,
	linux-kernel@vger.kernel.org, Jason Baron <jbaron@redhat.com>,
	Rusty Russell <rusty@rustcorp.com.au>,
	Adrian Bunk <bunk@stusta.de>, Andi Kleen <andi@firstfloor.org>,
	Christoph Hellwig <hch@infradead.org>
Subject: Re: [patch 02/12] Immediate Values - Architecture Independent Code
Date: Sun, 27 Sep 2009 19:23:45 -0400	[thread overview]
Message-ID: <20090927232345.GA5831@Krystal> (raw)
In-Reply-To: <20090924212013.d27226c4.akpm@linux-foundation.org>

* Andrew Morton (akpm@linux-foundation.org) wrote:
> On Thu, 24 Sep 2009 09:26:28 -0400 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
> > Immediate values are used as read mostly variables that are rarely updated. They
> > use code patching to modify the values inscribed in the instruction stream. It
> > provides a way to save precious cache lines that would otherwise have to be used
> > by these variables.
> 
> What a hare-brained concept.
> 

Hi Andrew,

Improving performance by specializing the implementation has been
studied thoroughly by many in the past, especially for JIT compilers.
What I am proposing here is merely a very specific use of the concept,
applied to read-often variables.

> > * Why should this be merged *
> > 
> > It improves performances on heavy memory I/O workloads.
> > 
> > An interesting result shows the potential this infrastructure has by
> > showing the slowdown a simple system call such as getppid() suffers when it is
> > used under heavy user-space cache trashing:
> > 
> > Random walk L1 and L2 trashing surrounding a getppid() call:
> > (note: in this test, do_syscal_trace was taken at each system call, see
> > Documentation/immediate.txt in these patches for details)
> > - No memory pressure :   getppid() takes  1573 cycles
> > - With memory pressure : getppid() takes 15589 cycles
> 
> Our ideas of what constitutes an "interesting result" differ.
> 
> Do you have any data which indicates that this thing is of any real
> benefit to anyone for anything?

Yep. See the benchmarks I just ran below.

Immediate Values Benchmarks

Kernel 2.6.31-tip
8-core Xeon, 2.0Ghz, E5405
gcc version 4.3.2 (Debian 4.3.2-1.1) 

Test workload: build the Linux kernel tree, cache-hot, make -j10

Executive result summary:

In these tests, each system call has an added workload, which is to read a fixed
number of integers from randomly chosen cache lines within an array and perform
a branch. The implementation is added to ptrace.c. The baseline is an unmodified
kernel.

* Baseline:				sys	0m57.63s

* 4096 integer reads, random locations	sys	2m21.781s
* 4096 integer reads, immediate values	sys	1m44.695s

* 128 integer reads, random locations	sys	0m59.348s
* 128 integer reads, immediate values	sys	0m58.640s

* 32 integer reads, random locations	sys	0m58.68s
* 32 integer reads, immediate values	sys	0m57.60s

These numbers show that by turning read-often data accesses into immediate
values, we can speed up the kernel.

Binary size results:

* 4096 integer reads, random locations
  text     data     bss     dec     hex filename
  66079     648  262156  328883   504b3 arch/x86/kernel/ptrace.o

* 4096 integer reads, immediate values
   text	   data	    bss	    dec	    hex	filename
  66079	  74412	 262156	 402647	  624d7	arch/x86/kernel/ptrace.o

As we notice, the size of text is the same, same for bss, but the data size
increases with immediate values. The section headers confirms that this extra
data is put in the __imv section, which is only accessed when immediate value
updates are performed.

So the tradeoff is: immediate values use more cache-cold space to increase
speed.

Therefore, if we can turn a significant amount of fast-path read-often variables
into immediate values, this should lead to a performance gain. Also,
given we can expect the fastpath cache-line footprint to grow with the
next kernel releases (this has been a trend I've seen a lot of people
complaining about), immediate values should help minimizing this by
removing the d-cache hit from such read-often variables, leaving a
i-cache hit within a mostly sequential instruction stream.

A quick look at the vmlinux section headers:

vmlinux:     file format elf64-x86-64

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
 13 .data.read_mostly 00002df0  ffffffff80859440  0000000000859440  00859440  2**6
                  CONTENTS, ALLOC, LOAD, DATA

Shows that we have about 11.48kB of read mostly data in the kernel image
which could be turned into immediate values. This is without counting
the modules. If only a portion of this data is not only read mostly, but
also read often, then we will see a clear performance improvement.

Thanks,

Mathieu


Detailed test results follow.
----------------------------------------

* Baseline:

# size of kernel original ptrace.o

   text	   data	    bss	    dec	    hex	filename
  12863	    648	      8	  13519	   34cf	arch/x86/kernel/ptrace.o

# time make -j10

real	1m25.358s
user	9m7.506s
sys	0m57.856s

real	1m21.580s
user	9m7.362s
sys	0m57.212s

real	1m21.361s
user	9m6.358s
sys	0m57.824s


* 4096 cache lines read per system call (random cache lines)
  (CONFIG_IMMEDIATE=n)

# size of modified ptrace.o

  text	   data	    bss	    dec	    hex	filename
  66079	    648	 262156	 328883	  504b3	arch/x86/kernel/ptrace.o

# section headers

arch/x86/kernel/ptrace.o:     file format elf64-x86-64

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .text         0000f4e8  0000000000000000  0000000000000000  00000040  2**4
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
  1 .data         000000a8  0000000000000000  0000000000000000  0000f540  2**5
                  CONTENTS, ALLOC, LOAD, RELOC, DATA
  2 .bss          0004000c  0000000000000000  0000000000000000  0000f600  2**5
                  ALLOC
  3 .rodata       00000988  0000000000000000  0000000000000000  0000f600  2**5
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
  4 .fixup        0000005b  0000000000000000  0000000000000000  0000ff88  2**0
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
  5 __ex_table    00000090  0000000000000000  0000000000000000  0000ffe8  2**3
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
  6 .smp_locks    00000028  0000000000000000  0000000000000000  00010078  2**3
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
  7 .rodata.str1.8 000001f2  0000000000000000  0000000000000000  000100a0  2**3
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  8 .rodata.str1.1 00000097  0000000000000000  0000000000000000  00010292  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  9 __tracepoints 00000080  0000000000000000  0000000000000000  00010340  2**5
                  CONTENTS, ALLOC, LOAD, RELOC, DATA
 10 _ftrace_events 00000160  0000000000000000  0000000000000000  000103c0  2**3
                  CONTENTS, ALLOC, LOAD, RELOC, DATA
 11 __tracepoints_strings 00000013  0000000000000000  0000000000000000  00010520  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
 12 .comment      0000001f  0000000000000000  0000000000000000  00010533  2**0
                  CONTENTS, READONLY
 13 .note.GNU-stack 00000000  0000000000000000  0000000000000000  00010552  2**0
                  CONTENTS, READONLY

# pattern

     820:       83 3d 00 00 00 00 01    cmpl   $0x1,0x0(%rip)        # 827 <test_pollute_cache+0x7>
     827:       0f 84 cb cf 00 00       je     d7f8 <test_pollute_cache+0xcfd8>

# time make -j10

real	1m36.075s
user	9m15.163s
sys	2m21.781s


* 4096 imv read per system call
  (CONFIG_IMMEDIATE=y)

# size of modified ptrace.o

   text	   data	    bss	    dec	    hex	filename
  66079	  74412	 262156	 402647	  624d7	arch/x86/kernel/ptrace.o

    (note: data is larger due to __imv table, which is used only for updates)

# section headers

arch/x86/kernel/ptrace.o:     file format elf64-x86-64

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .text         0000f4e8  0000000000000000  0000000000000000  00000040  2**4
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
  1 .data         000000a8  0000000000000000  0000000000000000  0000f540  2**5
                  CONTENTS, ALLOC, LOAD, RELOC, DATA
  2 .bss          0004000c  0000000000000000  0000000000000000  0000f600  2**5
                  ALLOC
  3 .rodata       00000988  0000000000000000  0000000000000000  0000f600  2**5
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
  4 .fixup        0000005b  0000000000000000  0000000000000000  0000ff88  2**0
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
  5 __ex_table    00000090  0000000000000000  0000000000000000  0000ffe8  2**3
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
  6 .smp_locks    00000028  0000000000000000  0000000000000000  00010078  2**3
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
  7 __discard     00005004  0000000000000000  0000000000000000  000100a0  2**0
                  CONTENTS, READONLY
  8 __imv         00012024  0000000000000000  0000000000000000  000150a4  2**0
                  CONTENTS, ALLOC, LOAD, RELOC, DATA
  9 .rodata.str1.8 000001f2  0000000000000000  0000000000000000  000270c8  2**3
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
 10 .rodata.str1.1 00000097  0000000000000000  0000000000000000  000272ba  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
 11 __tracepoints 00000080  0000000000000000  0000000000000000  00027360  2**5
                  CONTENTS, ALLOC, LOAD, RELOC, DATA
 12 _ftrace_events 00000160  0000000000000000  0000000000000000  000273e0  2**3
                  CONTENTS, ALLOC, LOAD, RELOC, DATA
 13 __tracepoints_strings 00000013  0000000000000000  0000000000000000  00027540  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
 14 .comment      0000001f  0000000000000000  0000000000000000  00027553  2**0
                  CONTENTS, READONLY
 15 .note.GNU-stack 00000000  0000000000000000  0000000000000000  00027572  2**0
                  CONTENTS, READONLY

# pattern

     820:       b8 00 00 00 00          mov    $0x0,%eax
     825:       ff c8                   dec    %eax
     827:       0f 84 d3 cf 00 00       je     d800 <test_pollute_cache+0xcfe0>

# time make -j10

real	1m30.688s
user	9m7.770s
sys	1m44.695s


* 128 cache lines read per system call (random cache lines)
  (CONFIG_IMMEDIATE=n)

# time make -j10

real	1m27.801s
user	9m12.447s
sys	0m59.348s


* 128 imv read per system call
  (CONFIG_IMMEDIATE=y)

# time make -j10

real	1m22.454s
user	9m5.822s
sys	0m58.640s


* 32 cache lines read per system call (random cache lines)
  (CONFIG_IMMEDIATE=n)

# time make -j10

real	1m21.539s
user	9m6.946s
sys	0m57.888s

real	1m26.789s
user	9m11.606s
sys	0m59.392s

real	1m29.461s
user	9m12.195s
sys	0m58.768s

avg sys:	58.68s


* 32 imv read per system call
  (CONFIG_IMMEDIATE=y)

# time make -j10

real	1m21.844s
user	9m7.278s
sys	0m57.648s

real	1m22.123s
user	9m6.850s
sys	0m56.848s

real	1m24.589s
user	9m5.674s
sys	0m58.328s

avg sys:	57.60s



-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

next prev parent reply	other threads:[~2009-09-27 23:23 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-09-24 13:26 [patch 00/12] Immediate Values Mathieu Desnoyers
2009-09-24 13:26 ` [patch 01/12] x86: text_poke_early non static Mathieu Desnoyers
2009-09-24 13:26 ` [patch 02/12] Immediate Values - Architecture Independent Code Mathieu Desnoyers
2009-09-25  4:20   ` Andrew Morton
2009-09-27 23:23     ` Mathieu Desnoyers [this message]
2009-09-28  1:23     ` Andi Kleen
2009-09-28 17:46       ` Andrew Morton
2009-09-28 18:03         ` Arjan van de Ven
2009-09-28 18:40           ` Mathieu Desnoyers
2009-09-28 19:54           ` Andi Kleen
2009-09-28 20:37             ` Arjan van de Ven
2009-09-28 21:32               ` H. Peter Anvin
2009-09-28 22:05                 ` Mathieu Desnoyers
2009-09-28 20:11         ` Andi Kleen
2009-09-28 21:16           ` Andrew Morton
2009-09-28 22:01             ` Mathieu Desnoyers
2009-09-24 13:26 ` [patch 03/12] Immediate Values - Kconfig menu in EMBEDDED Mathieu Desnoyers
2009-09-24 13:26 ` [patch 04/12] Immediate Values - x86 Optimization Mathieu Desnoyers
2009-09-24 13:26 ` [patch 05/12] Add text_poke and sync_core to powerpc Mathieu Desnoyers
2009-09-24 13:26 ` [patch 06/12] Immediate Values - Powerpc Optimization Mathieu Desnoyers
2009-09-24 13:26 ` [patch 07/12] Sparc create asm.h Mathieu Desnoyers
2009-09-24 21:10   ` David Miller
2009-09-24 13:26 ` [patch 08/12] sparc64: Optimized immediate value implementation Mathieu Desnoyers
2009-09-24 13:26 ` [patch 09/12] Immediate Values - Documentation Mathieu Desnoyers
2009-09-24 13:26 ` [patch 10/12] Immediate Values Support init Mathieu Desnoyers
2009-09-24 15:33   ` [patch 10.1/12] Immediate values fixes for modules Mathieu Desnoyers
2009-09-24 15:35   ` [patch 10.2/12] Fix Immediate Values x86_64 support old gcc Mathieu Desnoyers
2009-09-24 13:26 ` [patch 11/12] Scheduler Profiling - Use Immediate Values Mathieu Desnoyers
2009-09-24 13:26 ` [patch 12/12] Tracepoints - " Mathieu Desnoyers
2009-09-24 14:51   ` Peter Zijlstra
2009-09-24 15:03     ` Mathieu Desnoyers
2009-09-24 15:06       ` Peter Zijlstra
2009-09-24 16:01         ` [RFC patch] Immediate Values - x86 Optimization NMI and MCE support Mathieu Desnoyers
2009-09-24 21:59           ` Masami Hiramatsu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090927232345.GA5831@Krystal \
    --to=mathieu.desnoyers@polymtl.ca \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=bunk@stusta.de \
    --cc=hch@infradead.org \
    --cc=jbaron@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=rusty@rustcorp.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.